March 18, 2015

Preservation Working Group Web Archiving Deduplication Survey

The IIPC Preservation Working Group is conducting a survey of member institutions deduplication activities. If you or your institution hasn't yet responded, I urge you to do so. Right now. Don't worry, I'll wait.

Answering the survey for my own institution brought up a few things I'd like to go into in greater detail than was possible in the survey.

What kind of content do you deduplicate?

We've been doing deduplication for over nine years. In that time we've done some analysis on this from time to time. Most recently last December which led to this blog posts on deduplication of text based data.

Ultimately, it boils down to a trade off between the impact the deduplication has on crawl rate and the potential savings in terms of storage. For our large, thrice yearly crawls we're now only excluding data whose content type equals "text/html".

HTML documents make up over half of all documents, but they are generally small, frequently created dynamically and compress easily. In other words, they make the deduplication index much bigger while not contributing much to deduplication space saving.

Despite this, for our weekly and other focused crawls, we do deduplicate on all content types as those crawls' speeds are usually dictated by politeness rules. Deduplicating on all content types thus comes at near zero cost. Also the more frequent crawl schedules means that there is, relatively, more savings to be had from deduplicating on HTML than in less frequent crawls.

One aspect that the survey didn't tackle, was on which protocols and response types we deduplicate. We currently do not deduplicate on FTP (we do not crawl much FTP content). We also only deduplicate on HTTP 200 responses. I plan to look into deduplicating on 404 responses in the near future. Most other response types have an empty or negligible payload.

If you deduplicate, do you deduplicate based on URL, hash or both?

Historically, we've only done URL based deduplication. Starting with the last domain crawl of 2014, we've begun rolling out hash based deduplication. This is done fairly conservatively with the same URL being chosen as the 'original' when possible. This has an impact on crawl speeds and we may revise this policy as we gain more confidence in hash based deduplication.

The reason why hash based deduplication hasn't been used earlier is all about tool support. OpenWayback has never had a secondary index to look up based on hashes. It wasn't until we introduced additional data into the WARC revisit records that this was possible. This has now been implemented by both Heritrix and OpenWayback making hash based deduplication viable.

The additional savings gained by hash based deduplication are modest or about 10% in our crawls. In certain circumstances, however, they may help deal with unfortunate website design choices.

For example, one media site we crawl recently started adding a time code to their videos' URLs so their JavaScript player could skip to the right place in them. The time code is an URL parameter (e.g. "video.mp4?t=100") that doesn't affect the downloaded content at all. It is merely a hint to the JavaScript player. With crawl time hash based deduplication, it is possible to store each video only once.

How do you deduplicate?

This question and the next few after it addresses the same issues I discussed in my blog post about the downsides of web archive deduplication.

The primary motivation behind our deduplication effort is to save on storage. We wouldn't be able to crawl nearly as often without it. Moreover, our focused crawls would be impossible without deduplication as we discard as much as 90% of data as duplicates in such crawls.

Even doing a single 'clean' (non deduplicated) crawl every two years would mean we had to go from three to two crawls a year due to our storage budget. So we don't do that.

It's an arguable tradeoff. Certainly, it is going to be more difficult for us to break out discrete crawls due to the reduplication problem. Ultimately it comes down to the fact that what we don't crawl now is lost to us, forever. The reduplication problem can be left to the future.

That just leaves...

Do you see any preservation risks related to deduplication? 
Do you see any preservation risks related to deduplication and the number of W/ARC copies the institution should keep? 

In general, yes, there is a risk. Any data loss has the potential to affect a larger part of your archive.

To make matters worse, if you lose a WARC you may not be able to accurately judge the effects of its loss! Unless you have an index of all the original records that all the revisit records point to, you may need to scan all the WARCs that could potentially contain revisit records pointing to the lost WARC.

This is clearly a risk.

You also can't easily mitigate it by having some highly cited original records be stored with additional copies as those original records are scattered far and wide within your archive. Again, you'd need some kind of index of revisit references.

We address this simply by having three full copies (two of which are, in turn, stored on RAID based media). You may ask, but doesn't that negate some of the monetary savings of deduplication if you then store it multiple times? True, but we are going to want at least two copies at a minimum. Additionally, only one copy need be on a high speed, high availability system (we have two copies on such systems) for access. Further copies can be stored in slower or offline media which tends to be much cheaper.

Whether three (technically 3.5 due to RAID) is enough is then the question. I don't have a definitive answer. Do you?

No comments:

Post a Comment