January 23, 2015

Answering my own question. Does it matter which record is marked as the original?

Back in October, I posed the question about web archive deduplication, does it matter which record is marked as the original?

I've come up with a least one reason why it might matter during playback.

Consider what happens in a replay tool like OpenWayback when it encounters a revisit record. The revisit record should contain information about the original URL and original capture time. The replay tool then has to go and make another query on its index to resolve these values.

If the URL is the same URL as was just resolved for the revisit record itself, it follows that any cache (including the OS disk cache for a CDX file) will likely be warm, leading to a quick look up. In fact, OpenWayback will likely just be able to scroll backwards in its 'hit' list for original query.

On the other hand, if the URL is different it will be located in a different part of the index (whether it is a CDX file or some other structure) and may require hitting the disk(s) again. Any help form the cache will be incidental.

Just how much this matters in practice depends on how large your index is and how quickly it can be accessed. If it is served from a machine with super fast SSDs and copious amounts of RAM, it may not matter at all. If the index is on a 5400 RPM HDD on a memory starved machine it may matter a lot.

In any case, I think this is sufficient justification for keeping the deduplication strategy of preferring exact URL matches when available while allowing digest only matches when there isn't.


No comments:

Post a Comment