I have a database for a backup service I'm writing to backup Yahoo! Groups. It incrementally retrieves messages, which have a contiguous numeric id. stored in a 'message_id'
field. So, if the last message on the service is message number 10000, then once the backup is complete, the database should contain 10000 documents, with the sorted 'message_id'
s of each document being equivalent to range(1, 10000+1)
.
I'd like to write a query yielding the missing message ids. So if I have 9995 documents in the database, and messages 10, 15, 49, 99, and 1043 are missing, it should return [10, 15, 49, 99, 1043]
.
I've done the following, getting just the ids from the database and running a set intersection in my app code:
def missing_message_ids(self):
"""Return the set of the ids of all missing messages.."""
latest = self.get_latest_message()
ids = set(range(1, latest['_id']+1))
present_ids = set(doc['_id'] for doc in self.db.messages.find({}, {'_id': 1}))
return ids - present_ids
This is fine for my purposes, but it seems like it might get too slow for a vast number of messages. This is more for curiosity's sake than a real performance requirement: Is there any more efficient way to do this, perhaps entirely on the database engine?
Aucun commentaire:
Enregistrer un commentaire