March 23, 2015

URI Canonicalization in Web Archiving

URI canonicalization is an arcane part of web archiving that is often overlooked, despite having some important implications. This blog post is an attempt at shining a light on this topic, to illustrate those implications and point at at least one thing that we can improve on.

Heritrix

Heritrix keeps track of which URIs it has already crawled so that it doesn't repeatedly crawl the same URI. This fails when there are multiple URIs leading to the same content. Unless you deal with the issue, you wind up in a crawler trap, crawling the same (or nearly the same) content endlessly.

URI canonicalization is an attempt to deal with certain classes of these. This is done by establishing transformation rules that are applied to each discovered URI. The result of these transformations is regarded as the canonical form of the URI. It is this canonical form which is used to discard already seen URIs. The original URI is always used when making the HTTP request, the canonical form is only to prevent multiple fetches of the same content.

By default Heritrix applies the following transformations (in this order) to generate the canonical form of a URI:
  • Make all letters in the URI lower case.
  • Remove any user/password info in the URI (e.g. http://user:password@example.com becomes simply http://example.com)
  • Strip any leading www prefixes. Also strip any www prefixes with a number (e.g. http://www3.example.com becomes http://example.com)
  • Strip session IDs. This deals with several common forms of session IDs that are stored in URIs. It is unlikely to be exhaustive.
  • Strip trailing empty query strings. E.g. http://example.com? becomes http://example.com

Two issues are quickly apparent.

One, some of these rules may cause URIs that contain different content to be regarded is the same. For example, while domain names are case insensitive, the path segment need not be. Thus the two URIs http://example.com/A.html and http://example.com/a.html may be two different documents.

Two, despite fairly aggressive URI canonicalization rules, there will still be many instances where the exact same content is served up on multiple URIs. For example, there are likely many more types of session ids than we have identified here.

We can never hope to enumerate all the possible ways websites deliver the same content with different URIs. Heritrix does, however, make it possible to tweak the URI canonicalization rules. You can, thus deal with them on a case by case basis. Here is how to configure URI canonicalization rules in Heritrix 3.

But, you may not want to do that.

You'll understand why as we look at what URI canonicalization means during playback.

OpenWayback

As difficult as harvesting can be, accurate replay is harder. This holds doubly true for URI canonicalization.

Replay tools, like OpenWayback, must apply canonicalization under two circumstances. Firstly, when a new resource as added to the index (CDX or otherwise) it must be a canonical form that is indexed. Secondly, when resolving a URI requested by the user.

User requested URIs are not just the ones users type into search boxes. They are also the URIs for embedded content needed to display a page and the URIs of links that users click in replayed webpages.

Which is where things get complicated. OpenWayback may not use the exact same rules as Heritrix.

The current OpenWayback canonicalization rules (called AggressiveUrlCanonicalizer) applies rules that work more or less the same way as what Heritrix uses. It is not, however, the exact same rules (i.e. code), nor can they be customized in the same manner as Heritrix.

There is a strong argument for harmonizing these two. Move the code into a shared library, ensure that both tools use the same defaults and can be customized in the same manner. (It was discussion of exactly this that prompted this blog post.)

Additionally, it would be very beneficial if each WARC contained information, in some standard notation, about which canonicalization rules were in effect during its creation.

All this would clearly help but even then you still may not want to fiddle with canonicalization rules.

Every rule you add must be applied to every URI. If you add a new rule to Heritrix, you must add it to OpenWayback to ensure playback. Once added to OpenWayback it will go into effect for all URIs being searched for. However, unless you rebuild your index it may contain URIs in a form that is no longer consistent with the current canonical form. So, you need to rebuild the index, which is expensive.

This is also assuming that it is safe to impose a rule retroactively. OpenWayback does not support applying canonicalization rules to time periods. To make matters worse, a rule that would be very useful on one website, may not be suitable for another website.

Dealing with a matrix of rules which apply some of the time to some of the content from some types does not seem enticing. Maybe there is a good way to express and apply such rules that no one has yet brought to the table. In the meantime, all I can say is, proceed with caution.

No comments:

Post a Comment