We've been doing deduplication in webarchiving for a long time now. But, due to limits in our tools and storage format (WARC) it has always been, so called, URL based deduplication. I.e. we record that this capture of a particular URL is a duplicate (or revisit if you prefer). The content isn't stored and playback software simply moves 'backwards in time' until it finds the original record.
With recent clarifications to the WARC spec adopted by the IIPC and implemented in tools like Heritrix and OpenWayback we are no longer limited to this URL based deduplication.
With Heritrix 3.3.0 (still in development) now having robust handling for 'url agnostic' or 'digest based' deduplication, I set out to update my DeDuplicator software, an add-on for Heritrix. Implementing digest based deduplication was super easy.
But something nagged at me.
Consider the following scenario. You are crawling URL A. Its content digest indicates that it is a duplicate of URL A from some earlier crawl (lets call this A-1), but is also a duplicate of URL B. B may have been crawled at the same time (i.e. during the same harvesting round) as A-1, or at another time altogether, including earlier during the current round of harvesting.
Obviously, if we didn't have A-1, we would deduplicate on B and say that A is a duplicate of B. That's the whole point of digest based deduplication.
But, if you do have A-1, does it matter which one A is declared a duplicate of?
During large scale crawling you need lookups in your deduplication index to be as efficient as possible. Doing simple lookups on digests is more performant than doing a lookup on both digest and URL (or searching within a result set for the digest). Additionally, you can make the index smaller by only including one instance of each digest in the index.
But this bowing to performance means that it is blind chance whether A-1 or B will be designated as the 'original' for A.
The engineer in me insist that this is irrelevant. After all, if we didn't have A-1, we'd want to use B. So does the existence of A-1 in our archive really matter.
I haven't been able to come up with any solid argument for preferring A-1. Logically, it shouldn't matter. But, somehow, it just feels off to me.
If anyone has a concrete technical or practical reason for preferring A-1 (when available) please share!
Thank you.
Edit: In response to Peter Websters comment, let me just clarify that replay tools (e.g. OpenWayback) would still be aware of A-1, is it would be in their index, and they would show it as the precursor to A. The link between A and B would be considered incidental (as far as replay tools are concerned) and would not be directly evident to users.
Update: I've come up with at least one reason why it might matter. See my blog post answering my own question.
Hi Kristinn,
ReplyDeleteThanks for this, which is an important question.
I think a user needs to know of the existence, at a point in time, of A-1 *and* B, (and A-2, and A-3, as well as C or D.) I don't think it matters which of these is designated as being next in line for playback, so long as the user can know of the (non)existence of the others.
I imagine that the existence of these networks or webs of the same thing should not be managed in the same index that does all the heavy lifting, but could be managed some other way.
(It would also be a user interface design challenge to represent them to the user, but that's another story.)
Peter (@pj_webster, http://peterwebster.me )