lundi 13 juin 2016

Efficiently query missing integers in a range on a field?

I have a database for a backup service I'm writing to backup Yahoo! Groups. It incrementally retrieves messages, which have a contiguous numeric id. stored in a 'message_id' field. So, if the last message on the service is message number 10000, then once the backup is complete, the database should contain 10000 documents, with the sorted 'message_id's of each document being equivalent to range(1, 10000+1).

I'd like to write a query yielding the missing message ids. So if I have 9995 documents in the database, and messages 10, 15, 49, 99, and 1043 are missing, it should return [10, 15, 49, 99, 1043].

I've done the following, getting just the ids from the database and running a set intersection in my app code:

def missing_message_ids(self):
    """Return the set of the ids of all missing messages.."""
    latest = self.get_latest_message()
    ids = set(range(1, latest['_id']+1))
    present_ids = set(doc['_id'] for doc in self.db.messages.find({}, {'_id': 1}))
    return ids - present_ids

This is fine for my purposes, but it seems like it might get too slow for a vast number of messages. This is more for curiosity's sake than a real performance requirement: Is there any more efficient way to do this, perhaps entirely on the database engine?

Aucun commentaire:

Enregistrer un commentaire