January 29, 2015

The downside of web archive deduplication

I've talked a lot about deduplication (both here and elsewhere). It's a huge boon for web archiving efforts lacking infinite storage. Its value increases as you crawl more frequently, and crawling frequently has been a key aim for my institution. The web is just too fickle.

Still, we mustn't forget about the downsides of deduplication. It's not all rainbows and unicorns, after all.

Using proper deduplication, the aim is that you should be able to undo it. You can dededuplicate, as it were, or perhaps (less tongue-twisty) reduplicate. This was, in part, the reason for the IIPC Proposal for Standardizing the Recording of Arbitrary Duplicates in WARC Files.

So, if it is reversible, there can't be a downside? Right? Err, wrong.

We'll ignore for the moment the fact that many institutions have in the past done (and some may well still do) non-reversible deduplication (notably, not storing the response header in the revisit record). There are some practical implications of having a deduplicated archive. The two primary ones are loss potential and loss of coherence.

Loss potential

There is the adage that 'lots of copies keeps stuff safe'. In a fully deduplicated archive, the loss of one original record could affect the replay of many different collections. One could argue that if a resource is used enough to be represented in multiple crawls, it may be of value to store more than one copy in order to reduce the chance of its loss.

It is worrying to think that the loss of one collection (or part of one collection) could damage the integrity of multiple other collections. Even if we only deduplicate based on previous crawl for the same collection there are implications to data safety.

Complicating this is the fact that we do not normally have any oversight over which original records have a lot of revisit records referencing them. It might be an interesting ability, to be able to map out which records are most heavily cited. This could enable additional replication when a certain threshold is met.

Concerns about data safety/integrity must be weighed against the cost of storage and the likelihood of data loss. That is a judgement each institution must make for its own collection.

Loss of coherence

While the deduplication is reversible (or should be!), that is not to say that reversing it is trivial. If you wish to extract a single harvest from a series of (lets say) weekly crawls that have been going on for years, you may find yourself resolving revisit records to hundreds of different harvests.

While resolving any one revisit is relatively cheap, it is not entirely free. A weekly crawl of one million URLs, may have as many as half a million revisit records. For each such record you need to query an index (CDX or otherwise) to resolve the original record and then you need to access and copy that record. Index searches take in the order of hundreds of milliseconds for non-trivial indexes and because the original records will be scattered all over, you'll be reading bits and pieces from all over your disk. Reading bits and pieces is much more costly on spinning HDDs than linear reads of large files.

Copying this hypothetical harvest might only take minutes if non-deduplicated. Creating a reduplicated harvest might take days. (Of course, if you have an Hadoop cluster or other high performance infrastructure this may not be an issue.)

I'm sure there are ways of optimizing this. For example, it is probably best to record all the revisits, order them and then search the index for them in the same order as the index is structured in. This will improve the indexes cache hit. Similar things can be done for reading the originals from the WARC files.

There is though one other problem. No tool, currently, exists to do this. You'll need to write your own "reduplicator" if you need to do this.


Neither of these issues is argument enough to cause me to abandon deduplication. The fact is that we simply could not do the web archiving we currently do without it. But it is worth being aware of the downsides as well.

No comments:

Post a Comment