March 18, 2016

Declaring WARR on "CDX Server" API

Work is currently ongoing to specify a "CDX Server API" for OpenWayback. The name of this API has, however, caused an unfortunate amount of confusion. Despite the name, the data served via this API needn't be in CDX files!

The core purpose of this API is to respond to a query containing an URL and optionally a timestamp or timerange with a set of records that fall within those parameters. This is meant to support two basic functionalities. One, replay of captured web content and, two, discovery of capture web content.

CDXs need not enter into it. It is just that the most common way (by far) to manage such an index is to use sorted CDX files. Thus the unfortunate name. Nothing prevents alternative indexing solutions being used. You could use a relational database, Lucene or whatever tool allows lookups of strings!

So, this API desperately needs a new name. My suggestion is "Web Archive Resource Resolution Service" or WARR Service for short. Yes, I did torture that until it produced a usable acronym.

In my last post I discussed changes to the CDX file format itself. Those changes should facilitate WARR servers running on CDX indexes. But ultimately, the development of the WARR Service API is not directly coupled to those changes. We should focus on developing the WARR Service API with respect to the established use cases.

In truth, the exact scope and nature of this new API remains debated. You can find some lively discussion in this Github issue. More on that topic another day.

No comments:

Post a Comment