January 29, 2015

The downside of web archive deduplication

I've talked a lot about deduplication (both here and elsewhere). It's a huge boon for web archiving efforts lacking infinite storage. Its value increases as you crawl more frequently, and crawling frequently has been a key aim for my institution. The web is just too fickle.

Still, we mustn't forget about the downsides of deduplication. It's not all rainbows and unicorns, after all.

Using proper deduplication, the aim is that you should be able to undo it. You can dededuplicate, as it were, or perhaps (less tongue-twisty) reduplicate. This was, in part, the reason for the IIPC Proposal for Standardizing the Recording of Arbitrary Duplicates in WARC Files.

So, if it is reversible, there can't be a downside? Right? Err, wrong.

We'll ignore for the moment the fact that many institutions have in the past done (and some may well still do) non-reversible deduplication (notably, not storing the response header in the revisit record). There are some practical implications of having a deduplicated archive. The two primary ones are loss potential and loss of coherence.

Loss potential

There is the adage that 'lots of copies keeps stuff safe'. In a fully deduplicated archive, the loss of one original record could affect the replay of many different collections. One could argue that if a resource is used enough to be represented in multiple crawls, it may be of value to store more than one copy in order to reduce the chance of its loss.

It is worrying to think that the loss of one collection (or part of one collection) could damage the integrity of multiple other collections. Even if we only deduplicate based on previous crawl for the same collection there are implications to data safety.

Complicating this is the fact that we do not normally have any oversight over which original records have a lot of revisit records referencing them. It might be an interesting ability, to be able to map out which records are most heavily cited. This could enable additional replication when a certain threshold is met.

Concerns about data safety/integrity must be weighed against the cost of storage and the likelihood of data loss. That is a judgement each institution must make for its own collection.

Loss of coherence

While the deduplication is reversible (or should be!), that is not to say that reversing it is trivial. If you wish to extract a single harvest from a series of (lets say) weekly crawls that have been going on for years, you may find yourself resolving revisit records to hundreds of different harvests.

While resolving any one revisit is relatively cheap, it is not entirely free. A weekly crawl of one million URLs, may have as many as half a million revisit records. For each such record you need to query an index (CDX or otherwise) to resolve the original record and then you need to access and copy that record. Index searches take in the order of hundreds of milliseconds for non-trivial indexes and because the original records will be scattered all over, you'll be reading bits and pieces from all over your disk. Reading bits and pieces is much more costly on spinning HDDs than linear reads of large files.

Copying this hypothetical harvest might only take minutes if non-deduplicated. Creating a reduplicated harvest might take days. (Of course, if you have an Hadoop cluster or other high performance infrastructure this may not be an issue.)

I'm sure there are ways of optimizing this. For example, it is probably best to record all the revisits, order them and then search the index for them in the same order as the index is structured in. This will improve the indexes cache hit. Similar things can be done for reading the originals from the WARC files.

There is though one other problem. No tool, currently, exists to do this. You'll need to write your own "reduplicator" if you need to do this.


Neither of these issues is argument enough to cause me to abandon deduplication. The fact is that we simply could not do the web archiving we currently do without it. But it is worth being aware of the downsides as well.

January 23, 2015

Answering my own question. Does it matter which record is marked as the original?

Back in October, I posed the question about web archive deduplication, does it matter which record is marked as the original?

I've come up with a least one reason why it might matter during playback.

Consider what happens in a replay tool like OpenWayback when it encounters a revisit record. The revisit record should contain information about the original URL and original capture time. The replay tool then has to go and make another query on its index to resolve these values.

If the URL is the same URL as was just resolved for the revisit record itself, it follows that any cache (including the OS disk cache for a CDX file) will likely be warm, leading to a quick look up. In fact, OpenWayback will likely just be able to scroll backwards in its 'hit' list for original query.

On the other hand, if the URL is different it will be located in a different part of the index (whether it is a CDX file or some other structure) and may require hitting the disk(s) again. Any help form the cache will be incidental.

Just how much this matters in practice depends on how large your index is and how quickly it can be accessed. If it is served from a machine with super fast SSDs and copious amounts of RAM, it may not matter at all. If the index is on a 5400 RPM HDD on a memory starved machine it may matter a lot.

In any case, I think this is sufficient justification for keeping the deduplication strategy of preferring exact URL matches when available while allowing digest only matches when there isn't.


January 19, 2015

The First IIPC Technical Training Workshop

It has always been interesting to me how often a chance remark will lead to big things within the IIPC. A stray thought, given voice during a coffee break or dinner at a general assembly, can hit a resonance and lead to something new, something exciting.

So it was with the first IIPC technical training workshop. It started off as an off the cuff joke at last year's Paris GA, about having a 'Heritrix DecideRule party'. It struck a nerve, and it also quickly snowballed to include an 'OpenWayback upgradathon' and a 'SolrJam'.

The more we talked about this, the more convinced I became that this was a good idea. To have a hands on workshop, where IIPC members could send staff for practical training in these tools. Fortunately, Helen Hockx-Yu of the British Library shared this conviction. Even more fortunately, the IIPC Steering Committee wholeheartedly supported the idea. Most fortunate of all, the BL was ready, willing and able to host such an event.

So, last week, on a rather dreary January morning around thirty web archiving professionals, from as far away as New Zealand, gathered outside the British Library in London and waited for the doors to open. Everyone eager to learn more about Heritrix, OpenWayback and Solr.

Day one was dedicated to traditional, presentation oriented, dissemination of knowledge. On hand were several invited experts on each topic. In the morning the basics fundamentals of the three tools were discussed, with more in depth topics after lunch. Roger Coram (BL) and I were responsible for covering Heritrix. Roger discussed the basics of Heritrix DecideRules and I covered other core features, notably sheet overlays in the morning. The afternoon focused on Heritrix's REST API, deduplication at crawl time, and writing your own Heritrix modules.

There is no need for me to repeat all of the topics. The entire first day was filmed and made available online, in IIPC's YouTube channel.

Day one went well, but it wasn't radically different from what we have done before at GAs. It was days two and three that made this meeting unique.

For the later two days only a very loose agenda was provided. A list of tasks related to each tool, varying in complexity. Attendees choose tasks according to their interests and level of technical know-how. Some installed and ran their very first Heritrix crawl or set up their first OpenWayback instance. I set up Solr via the BL's webarchive-discovery and set it to indexing one of our collection.

Others focused on more advanced tasks involving Heritrix sheet overlays and REST API, OpenWayback WAR overlays and CDX generation or ... I really don't know what the advanced Solr tasks were. I was just happy to get the basic indexing up and running.

The 'experts' who did presentations on day one, were, of course, on hand during days two and three to assist. I found this to be a very good model. Impromptu presentations were made on specific topics and the specific issues of different attendees could be addressed. I learned a fair amount about how other IIPC members actually conduct their crawls. There is nothing like hands-on knowledge. I think both experts and attendees got a lot out of it.

It was almost sad to see the three day event come to an end.

So, overall, a success. Definitely meriting an encore.

That isn't to say it was perfect, there is always room for improvement. Given a bit more lead-up time, it would have been possible to get a firmer idea of the actual interests of the attendees. For this workshop there was a bit of guess work. I think we were in the ballpark, but we can do better next time. It would also have been useful to have better developed tasks for the less experienced attendees.

So, will there be an opportunity to improve? I certainly hope so. We will need to decide where (London again or elsewhere) and when (same time next year or ...). The final decision will then be up to the IIPC Steering Committee. All I can say, is that I'm for it and I hope we can make this an annual event. A sort of counter-point to the GA.

We'll see.

Finally, I'd like to thank Helen and the British Library for their role as host and all of our experts for their contribution.