Kris's blog: March 2015

March 23, 2015

URI Canonicalization in Web Archiving

URI canonicalization is an arcane part of web archiving that is often overlooked, despite having some important implications. This blog post is an attempt at shining a light on this topic, to illustrate those implications and point at at least one thing that we can improve on.

Heritrix

Heritrix keeps track of which URIs it has already crawled so that it doesn't repeatedly crawl the same URI. This fails when there are multiple URIs leading to the same content. Unless you deal with the issue, you wind up in a crawler trap, crawling the same (or nearly the same) content endlessly.

URI canonicalization is an attempt to deal with certain classes of these. This is done by establishing transformation rules that are applied to each discovered URI. The result of these transformations is regarded as the canonical form of the URI. It is this canonical form which is used to discard already seen URIs. The original URI is always used when making the HTTP request, the canonical form is only to prevent multiple fetches of the same content.

By default Heritrix applies the following transformations (in this order) to generate the canonical form of a URI:

Make all letters in the URI lower case.
Remove any user/password info in the URI (e.g. http://user:password@example.com becomes simply http://example.com)
Strip any leading www prefixes. Also strip any www prefixes with a number (e.g. http://www3.example.com becomes http://example.com)
Strip session IDs. This deals with several common forms of session IDs that are stored in URIs. It is unlikely to be exhaustive.
Strip trailing empty query strings. E.g. http://example.com? becomes http://example.com

Two issues are quickly apparent.

One, some of these rules may cause URIs that contain different content to be regarded is the same. For example, while domain names are case insensitive, the path segment need not be. Thus the two URIs http://example.com/A.html and http://example.com/a.html may be two different documents.

Two, despite fairly aggressive URI canonicalization rules, there will still be many instances where the exact same content is served up on multiple URIs. For example, there are likely many more types of session ids than we have identified here.

We can never hope to enumerate all the possible ways websites deliver the same content with different URIs. Heritrix does, however, make it possible to tweak the URI canonicalization rules. You can, thus deal with them on a case by case basis. Here is how to configure URI canonicalization rules in Heritrix 3.

But, you may not want to do that.

You'll understand why as we look at what URI canonicalization means during playback.

OpenWayback

As difficult as harvesting can be, accurate replay is harder. This holds doubly true for URI canonicalization.

Replay tools, like OpenWayback, must apply canonicalization under two circumstances. Firstly, when a new resource as added to the index (CDX or otherwise) it must be a canonical form that is indexed. Secondly, when resolving a URI requested by the user.

User requested URIs are not just the ones users type into search boxes. They are also the URIs for embedded content needed to display a page and the URIs of links that users click in replayed webpages.

Which is where things get complicated. OpenWayback may not use the exact same rules as Heritrix.

The current OpenWayback canonicalization rules (called AggressiveUrlCanonicalizer) applies rules that work more or less the same way as what Heritrix uses. It is not, however, the exact same rules (i.e. code), nor can they be customized in the same manner as Heritrix.

There is a strong argument for harmonizing these two. Move the code into a shared library, ensure that both tools use the same defaults and can be customized in the same manner. (It was discussion of exactly this that prompted this blog post.)

Additionally, it would be very beneficial if each WARC contained information, in some standard notation, about which canonicalization rules were in effect during its creation.

All this would clearly help but even then you still may not want to fiddle with canonicalization rules.

Every rule you add must be applied to every URI. If you add a new rule to Heritrix, you must add it to OpenWayback to ensure playback. Once added to OpenWayback it will go into effect for all URIs being searched for. However, unless you rebuild your index it may contain URIs in a form that is no longer consistent with the current canonical form. So, you need to rebuild the index, which is expensive.

This is also assuming that it is safe to impose a rule retroactively. OpenWayback does not support applying canonicalization rules to time periods. To make matters worse, a rule that would be very useful on one website, may not be suitable for another website.

Dealing with a matrix of rules which apply some of the time to some of the content from some types does not seem enticing. Maybe there is a good way to express and apply such rules that no one has yet brought to the table. In the meantime, all I can say is, proceed with caution.

March 18, 2015

Preservation Working Group Web Archiving Deduplication Survey

The IIPC Preservation Working Group is conducting a survey of member institutions deduplication activities. If you or your institution hasn't yet responded, I urge you to do so. Right now. Don't worry, I'll wait.

Answering the survey for my own institution brought up a few things I'd like to go into in greater detail than was possible in the survey.

What kind of content do you deduplicate?

We've been doing deduplication for over nine years. In that time we've done some analysis on this from time to time. Most recently last December which led to this blog posts on deduplication of text based data.

Ultimately, it boils down to a trade off between the impact the deduplication has on crawl rate and the potential savings in terms of storage. For our large, thrice yearly crawls we're now only excluding data whose content type equals "text/html".

HTML documents make up over half of all documents, but they are generally small, frequently created dynamically and compress easily. In other words, they make the deduplication index much bigger while not contributing much to deduplication space saving.

Despite this, for our weekly and other focused crawls, we do deduplicate on all content types as those crawls' speeds are usually dictated by politeness rules. Deduplicating on all content types thus comes at near zero cost. Also the more frequent crawl schedules means that there is, relatively, more savings to be had from deduplicating on HTML than in less frequent crawls.

One aspect that the survey didn't tackle, was on which protocols and response types we deduplicate. We currently do not deduplicate on FTP (we do not crawl much FTP content). We also only deduplicate on HTTP 200 responses. I plan to look into deduplicating on 404 responses in the near future. Most other response types have an empty or negligible payload.

If you deduplicate, do you deduplicate based on URL, hash or both?

Historically, we've only done URL based deduplication. Starting with the last domain crawl of 2014, we've begun rolling out hash based deduplication. This is done fairly conservatively with the same URL being chosen as the 'original' when possible. This has an impact on crawl speeds and we may revise this policy as we gain more confidence in hash based deduplication.

The reason why hash based deduplication hasn't been used earlier is all about tool support. OpenWayback has never had a secondary index to look up based on hashes. It wasn't until we introduced additional data into the WARC revisit records that this was possible. This has now been implemented by both Heritrix and OpenWayback making hash based deduplication viable.

The additional savings gained by hash based deduplication are modest or about 10% in our crawls. In certain circumstances, however, they may help deal with unfortunate website design choices.

For example, one media site we crawl recently started adding a time code to their videos' URLs so their JavaScript player could skip to the right place in them. The time code is an URL parameter (e.g. "video.mp4?t=100") that doesn't affect the downloaded content at all. It is merely a hint to the JavaScript player. With crawl time hash based deduplication, it is possible to store each video only once.

How do you deduplicate?

This question and the next few after it addresses the same issues I discussed in my blog post about the downsides of web archive deduplication.

The primary motivation behind our deduplication effort is to save on storage. We wouldn't be able to crawl nearly as often without it. Moreover, our focused crawls would be impossible without deduplication as we discard as much as 90% of data as duplicates in such crawls.

Even doing a single 'clean' (non deduplicated) crawl every two years would mean we had to go from three to two crawls a year due to our storage budget. So we don't do that.

It's an arguable tradeoff. Certainly, it is going to be more difficult for us to break out discrete crawls due to the reduplication problem. Ultimately it comes down to the fact that what we don't crawl now is lost to us, forever. The reduplication problem can be left to the future.

That just leaves...

Do you see any preservation risks related to deduplication?
Do you see any preservation risks related to deduplication and the number of W/ARC copies the institution should keep?

In general, yes, there is a risk. Any data loss has the potential to affect a larger part of your archive.

To make matters worse, if you lose a WARC you may not be able to accurately judge the effects of its loss! Unless you have an index of all the original records that all the revisit records point to, you may need to scan all the WARCs that could potentially contain revisit records pointing to the lost WARC.

This is clearly a risk.

You also can't easily mitigate it by having some highly cited original records be stored with additional copies as those original records are scattered far and wide within your archive. Again, you'd need some kind of index of revisit references.

We address this simply by having three full copies (two of which are, in turn, stored on RAID based media). You may ask, but doesn't that negate some of the monetary savings of deduplication if you then store it multiple times? True, but we are going to want at least two copies at a minimum. Additionally, only one copy need be on a high speed, high availability system (we have two copies on such systems) for access. Further copies can be stored in slower or offline media which tends to be much cheaper.

Whether three (technically 3.5 due to RAID) is enough is then the question. I don't have a definitive answer. Do you?

March 13, 2015

What's Next for OpenWayback

About one month ago, OpenWayback 2.1.0 was released. This was mostly a bug-fix release with a few new features merged in from Internet Archive's Wayback development fork. For the most part, the OpenWayback effort has focused on 'fixing' things. Making sure everything builds and runs nicely and is better documented.

I think we've made some very positive strides.

Work is now ongoing for version 2.2.0. Finally, we are moving towards implementing new things! 2.2.0 still has some fixing to do. For example, localization support needs to be improved. But, we're also planning to implement something new, support for internationalized domain names.

We've tentatively scheduled the 2.2.0 release for "spring/early summer".

After 2.2.0 is released, the question will be which features or improvements to focus on next. The OpenWayback issue tracker on GitHub has (at the time of writing) about 60 open issues in the backlog (i.e. not assigned to a specific release).

We're currently in the process of trying to prioritize these. Our current resources are nowhere sufficient to resolve them all. Prioritization will involve several aspects, including how difficult they are to implement, how popular they are and, not least, how clearly they are defined.

This is where you, dear reader, can help us out by reviewing the backlog and commenting on issues you believe to by relevant to your organization. We also invite you to submit new issues if needed.

It is enough to just leave a comment that this is relevant to your organization. Even better would be to explain why it is relevant (this helps frame the solution). Where appropriate we would also welcome suggestions for how to implement the feature. Notably in issues like the one about surfacing metadata in the interface.

If you really want to see a feature happen, the best way to make it happen is, of course, to pitch in.

Some of the features and improvements we are currently reviewing are:

Enable users to 'diff' different captures of an HTML page. Issue 15.
Enable search results with a very large number of hits. Issue 19.
Surface more metadata. Issue 28 and 29.
Enable time ranged exclusions. Issue 212.
Create a revisit test dataset. Issue 117.
Using CDX indexing as the default instead of the BDB index. Issue 132.

As I said, these are just the ones currently being considered. We're happy to look at others if there is someone championing them.

If you'd like to join the conversation, go to the OpenWayback issue tracker on GitHub and review issues without a milestone.

If you'd like to submit a new issue, please read the instructions on the wiki. The main thing to remember is to provide ample details.

We only have so many resources available. Your input is important to help us allocate them most effectively.

March 10, 2015

Implementing CrawlRSS

In my last post I talked about how you could use RSS feeds to continuously crawl sites. Or, at least, crawl sites that provide good RSS feeds.

The idea for how this could work first came to me five years ago. Indeed, at the IIPC Harvesting Working Group meeting in Vienna in September of 2010, I outlined the basic concept (slides PPT).

The concept was simple enough, unfortunately, the execution was a little more tricky. I wanted to implement this building on top of Heritrix. This saves having to redo things like WARC writing, deduplication, link extraction, politeness enforcement etc. But it did bump up against the fact that Heritrix is built around doing snapshots, not crawling continuously.

Heritrix in 30 Seconds

For those not familiar with Heritrix's inner working, I'm going to outline the key parts of the crawler.

Most crawl state (i.e. what URLs are waiting to be crawled) is managed by the frontier. All new URLs are first entered into the Frontier where they are scheduled for crawling. This is done by placing URLs in proper queues, possibly putting them in front of other URLs and so forth.

When an URL comes up for crawling it is emitted by the Frontier. At any given time, there may be multiple URLs being crawled. Each emitted URL passes through a chain of processors. This chain begins with preparatory work (e.g. check robots.txt), proceeds to actually fetching the URL, to link extraction, WARC writing and eventually links discovered in the downloaded document are processed and finally the URL is sent back to the Frontier for final disposition.

Each discovered URL passes through another chain of processors which check if it is in scope before registering it with the Frontier.

If an error occurs, the URL is still returned to the Frontier which decides if it should be retried or not.

The Frontier is Heritrix's state machine and, typically, when the Frontier is empty, the crawl is over.

Implementing CrawlRSS

Heritrix's frontier is a replaceable module. Thus one solution is to simply provide an alternative Frontier that implements the RSS crawling described in my last post. This is what I tried first and initial results where promising. I soon ran into some issues, however, when I tried to scale things up.

A lot of functionality is built into the default Heritrix frontier (so called BdbFrontier). While an attempt has been made to make this flexible via intermediate classes (AbstractFrontier) and some helper classes, the truth is that you have to redo a lot of 'plumbing' if you replace the frontier wholesale.

Because the basic function of the RssFrontier was so different from what was expected of the regular Frontier, I found no way of leveraging the existing code. Since I was also less then keen on reimplementing things like canonicalization, politeness enforcement and numerous other frontier functions, I had to change tack.

Instead of replacing the default frontier, I decided to 'manage' it instead. The key to this is the ability of the default frontier to 'run while empty'. That is to say, it can be configured to not end the crawl when all the queues are empty.

The responsibility for managing the crawl state would reside primarily in a new object, the RssCrawlController. Instead of the frontier being fed an initial list of seed URLs via the usual mechanism, the RssCrawlController is wholly responsible for providing seeds. It also keeps track of any URLs deriving from the RSS feeds and ensures that the feeds aren't rescheduled with the frontier until the proper time.

This is accomplished by doing three things. One, providing a replacement FrontierPreparer module. The FrontierPreparer module is responsible for preparing discovered URLs for entry into the frontier. The overloaded class RssFrontierPreparer also notifies the RssCrawlController allowing it to keep track of URLs descended from a feed.

Two, listen for CrawlURIDispositionEvents. These are events that are triggered whenever an URL is finished, either successfully or fails with no (further) retry possible.

Three, deal with the fact that no disposition event is triggered for URLs that are deemed duplicates.

UriUniqFilter

Heritrix uses so called UriUniqFilters to avoid repeatedly downloading the same URL. This can then be bypassed for individual URLs if desired (such as for refreshing robots.txt or, as in our case, when recrawling RSS feeds).

Unfortunately, this function is done in an opaque manner inside the frontier. An URL that is scheduled with the frontier and then fails to pass this filter will simply disappear. This was no good as the RssCrawlController needs a full accounting of each discovered URLs or it can't proceed to recrawl the RSS feeds.

To accomplish this I subclassed the provided filters so they implemented a DuplicateNotifier interface. This allows the RssCrawlController to keep track of those URLs that are discarded as duplicates.

It seems that perhaps the UriUniqFilters should be applied prior to candidate URIs being scheduled with the frontier, but that is a subject for another time.

RSS Link Extraction

A custom RssExtractor processor handles extracting links from RSS feeds. This is done by parsing the RSS feeds using ROME.

The processor is aware of the time of the most recently seen item for each feed and will only extract those items whose date is after that. This avoids crawling the same items again and again. Of course, if the feed itself is deemed a duplicate by hash comparison, this extraction is skipped.

If any new items are found in the feed, the extractor will also add the implied pages for that feed as discovered links. Both the feed items and implied pages are flagged specially so that they avoid the UriUniqFilter and will be given a pass by the scoping rules.

The usual link extractors are then used to find further links in the URLs that are extracted by the RssExtractor.

Scope

As I noted in my last post, getting the scope right is very important. It is vital that it does not leak and that after crawling a feed, its new items and embedded resources, you exhaust all the URLs deriving from the feed. Otherwise, you'll never recrawl the feed.

Some of the usual scoping classes are still used with CrawlRSS, such as PrerequisiteAcceptDecideRule and SchemeNotInSetDecideRule but for the most part a new decide rule, RssInScopeDecideRule, handles scoping.

RssInScopeDecideRule is a variant on HopsPathMatchesRegexDecideRule. By default the regular expression .R?((E{0,2})|XE?) determines if an URL is accepted or not. Remember, this is applied to the hop path, not the URL itself. This allows, starting from the seed, one hop (any type) then one optional redirect (R) then up to two levels of embeds (E), only the first of which may be a speculative embed (X).

The decide rule also automatically accepts URLs that are discovered in the feeds by RssExtractor.

As noted in my last post, during the initial startup it will take awhile to exhaust the scope. However, after the first run, most embedded content will be filtered out by the UriUniqFilter, making each feed refresh rather quick. You can tweak how long each round takes by changing the politeness settings. In practice we've found no problems refreshing feeds every 10 minutes. Possibly, you could go to as little as 1 minute, but that would likely necessitate a rather aggressive politeness policy.

Configuration

CrawlRSS comes packaged with a Heritrix profile where all of the above has been set up correctly. The only element that needs to be configured is which RSS feeds you want to crawl.

This is done by specifying a special bean, RssSite, in the cxml. Lets look at an example.

<bean id="rssRuv" class="is.landsbokasafn.crawler.rss.RssSite">
<property name="name" value="ruv.is"/>
<property name="minWaitInterval" value="1h30m"/>
<property name="rssFeeds">
    <list>
      <bean class="is.landsbokasafn.crawler.rss.RssFeed">
        <property name="uri" value="http://www.ruv.is/rss/frettir"/>
        <property name="impliedPages">
          <list>
            <value>http://www.ruv.is/</value>
          </list>
        </property>
      </bean>
      <bean class="is.landsbokasafn.crawler.rss.RssFeed">
        <property name="uri" value="http://www.ruv.is/rss/innlent"/>
        <property name="impliedPages">
          <list>
            <value>http://www.ruv.is/</value>
            <value>http://www.ruv.is/innlent</value>
          </list>
        </property>
      </bean>
    </list>
</property>

</bean>

Each site need not map to an actual site, but I generally find that to be the best way. For each site you set a name (purely for reporting purposes) and a minWaitInterval. This is the minimum amount of time that will elapse between the feeds for this site being crawled.

You'll note that the minimum wait interval is expressed in human readable (-ish) form, e.g. 1h30m instead of specifying simply a number of seconds or minutes. This provides good flexibility (i.e. you can specify intervals down to a second) with a user friendly notation (no need to figure out how many minutes there are in 8 hours!)

There is then a list of RssFeeds. Each of which specifies the URL of one RSS feed. Each feed then has a list of implied pages, expressed as URLs.

You can have as many feeds within a site as you like.

You repeat the site bean as often as needed for additional sites. I've tested with up to around 40 sites and nearly 100 feeds. There is probably a point at which this will stop scaling, but I haven't encountered it.

Alternative Configuration - Databases

Configuring dozens or even hundreds of sites and feeds via the CXML quickly get tedious. It also makes it difficult to change the configuration without stopping and restarting the crawl, losing the current state of the crawl.

For this purpose, I added a configuration manager interface. The configuration manager provides the RssCrawlController with the sites, all set up correctly. Simply wire in the appropriate implementation. Provided is the CxmlConfigurationManager as described above and the DbConfigurationManager that interacts with an SQL database.

The database is accessed using Hibernate and I've included the necessary MySQL libraries (for other DBs you'll need to add the connector JARs to Heritrix's lib directory). This facility does require creating your own CRUD interface to the database, but makes it much easier to manage the configuration.

Changes in the database are picked up by the crawler and the crawler also updates the database to reflect last fetch times etc. This can survive crawl job restarts, making it much easier to stop and start crawl without getting a lot of redundant content. When pared with the usual deduplication strategies, no unnecessary duplicates should be written to WARC.

There are two sample profiles bundled with CrawlRSS. One for each of these two configuration styles. I recommend starting with the default, CXML, configuration.

Implementing additional configuration managers is fairly simple if a different framework (e.g. JSON) is preferred.

Reporting

The RssCrawlController provides a detailed report. In the sample configuration, I've added the necessary plumbing so that this reports shows up alongside the normal Heritrix reports in the web interface. It can also be accessed via the scripting console by invoking the getReport() method on the RssCrawlController bean.

Release

All of what I've described above is written and ready for use. You can download a snapshot build from CrawlRSS's Sonatype snapshot repository. Alternatively, you can download the source from CrawlRSS's Github project page and build it yourself (simply download and run 'mvn package').

So, why no proper release? The answer is that the CrawlRSS is built against Heritrix 3.3.0-SNAPSHOT. There have been notable changes in Heritrix's API since 3.2.0 (latest stable) requiring a SNAPSHOT dependency.

I would very much like to have a 'stable' release of the 3.3.0 branch Heritrix, but that is a subject for another post.