March 18, 2016

Declaring WARR on "CDX Server" API

Work is currently ongoing to specify a "CDX Server API" for OpenWayback. The name of this API has, however, caused an unfortunate amount of confusion. Despite the name, the data served via this API needn't be in CDX files!

The core purpose of this API is to respond to a query containing an URL and optionally a timestamp or timerange with a set of records that fall within those parameters. This is meant to support two basic functionalities. One, replay of captured web content and, two, discovery of capture web content.

CDXs need not enter into it. It is just that the most common way (by far) to manage such an index is to use sorted CDX files. Thus the unfortunate name. Nothing prevents alternative indexing solutions being used. You could use a relational database, Lucene or whatever tool allows lookups of strings!

So, this API desperately needs a new name. My suggestion is "Web Archive Resource Resolution Service" or WARR Service for short. Yes, I did torture that until it produced a usable acronym.

In my last post I discussed changes to the CDX file format itself. Those changes should facilitate WARR servers running on CDX indexes. But ultimately, the development of the WARR Service API is not directly coupled to those changes. We should focus on developing the WARR Service API with respect to the established use cases.

In truth, the exact scope and nature of this new API remains debated. You can find some lively discussion in this Github issue. More on that topic another day.

March 17, 2016

Rewriting the CDX file format

CDX files are used to support URL+timestamp searching of web archives. They've been around for a long time, having first been used to catalog the contents of ARC files. Despite the advent of the WARC file format, they haven't changed much. I think it is past due that we reconsider the format from the ground up.

The current specification lists a large number of possible fields. Many are not used in typical scenarios.

The first field is a canonicalized URL. I.e. an URL with trivial elements (such as protocol) removed so that equivalent URLs end up with the same canonical URL here. This serves as the primary search key.

The only problem with this is that searching for content in all subdomains is not possible without scanning the entire CDX. This is because the subdomain comes before the domain. Instead, we should use a SURT (Sort-friendly URI Reordering Transform) form of the canonical URL instead. SURT URLs turn the domain/sub-domain structure around, making such queries fairly straightforward. There is essentially no downside to doing this and, in fact, a number of CDXs have been built in this manner, regardless of any "formal" standardization (as there isn't really any formal standard).

I suggest that any revised CDX format mandate the use of SURT URLs for the first field. Furthermore, we should utilize the correct SURT format. In most (probably all) current CDXs with SURT URLs, an annoying mistake has been made where the closing comma is missing. An URL that should read:
   com,example,www,)
instead reads:
   com,example,www)
The protocol prefix has been removed as unnecessary along with the opening ellipse. 

The second field should remain the timestamp with whatever precision is available in the ARC/WARC. I.e. an w3c-iso8601 of varying accuracy as per this proposed revision the WARC standard (the revision is extremely likely to be included in WARC 1.1).

The third field would remain the original URL.

The fourth field should be a content digest including the hashing algorithm. Presently, this field is missing the algorithm.

The fifth field would be the WARC record type (or a special value to indicate an ARC response record). This is the most significant change as it allows us to capture additional WARC record types (such as metadata and conversion) while also handling the existing fields in a more targeted manner (e.g. response vs revisit). It might be argued that this should be the second field to facilitate searches of a specific record type. I believe that, probably implemented, this field would allow replay tools to effectively surface any content "related" to the URL currently being viewed, a problem that I know many are trying to tackle.

The next two fields would be the WARC (or ARC) filename (this is supposed to be unique) of the file containing the record and offset at which the record exists within the (W)ARC. This is as it works currently. Some would argue for a more expressive resource locator here, but I believe that is best handled be a separate (W)ARC resolution service. Otherwise you may have to substantially rebuild your CDX index just because you moved your (W)ARCs to a new disk or service.

Lastly, there should be a single line JSON "blob" containing record type relevant additional data. For response records, this would include HTTP status code and content type which I've excluded from the "base" fields in the CDX. This part would be significantly more flexible due to the JSON format, allowing us to include optional data where appropriate etc. The full range of possible values is beyond the scope of this blog post.

There is clearly more work to be done on the JSON aspect, plus some adjustments may be necessary to the base data, but I believe that, at minimum, this is the right direction to head in. Of course, this means we have to rebuild all our CDX files in order to implement this. That's a tall order, but the benefits should be more than enough to justify that one-time cost.