June 30, 2015

Web archiving APIs - a start

Even though it didn't feature heavily on the official agenda, the topic of web archive APIs repeatedly came up during the last IIPC GA in Stanford. This is an understandable desire as the use of APIs enables interoperability amongst different tools. This, in theory, allows you to build (many) narrow purpose tools and then have them work, seamlessly, together. No more monolithic behemoths that are a pain to maintain.

In fact, we already have one web archive "API" in wide use; the WARC file format. While technically not an "application programming interface" it serves the same fundamental purpose, to enable interoperability. It has decoupled harvesters (e.g. Heritrix) from replay systems (e.g. OpenWayback) and both of those from analytical/data mining software etc.

Overall I'd say the WARC format has been a success, albeit not without its flaws. More on that later.

So, now there is interest in defining more "APIs". Indeed, there was a survey, recently, about the possibility of setting up a special "APIs Working Group" within the IIPC.

What I find missing from this conversations is, what are these APIs? Which facets of web archiving are amicable to being formalized in this manner?

We do have a few informal "APIs" floating around. The CDX file format is one. The Memento protocol is another. Cleaning up and formalizing an existing 'defacto' API avoids the pitfall of creating an API that entirely fails to work in the "real world."

To understand this, lets go back the WARC format. It was an extension of the ARC file format developed by the Internet Archive. In drafting the WARC standard, the lessons of the ARC file format were used and most of the ARC file format's short comings were addressed. 

Predictably, where the WARC format has shown weakness is in areas that ARC did not address at all. For example in handling duplicate records. In those areas, wrong assumptions were made due to lack of experience. This in turn, for example, significantly delayed the widespread use of deduplication.

The best APIs/protocols/standards emerge from hard won experience. I would argue very strongly against developing any API from the top down.

With that in mind, there are two possible APIs I believe are ready to be made more formal.

The first is the current "CDX Server protocol".  The CDX Server is a component within OpenWayback that resolves URI+timestamp requests into stored resources (either a single one or a range depending on the nature of the query). Note, that a "CDX Server" need not use a CDX style index. That is merely how it is now. This is a protocol for separating the user interface of a replay tool (like OpenWayback) from its the index. Lets call it Web Archive Query Protocol, WAQP, for now.

Pywb, another replay tool, uses almost the same protocol in its implementation. With a well defined WAQP, you might be able to use Pywb front end (display) with the OpenWayback back end (CDX server) or the other way around.

With a well defined WAQP, the job of indexing WARCs would be become independent of the job of replaying web archives by rewriting URLs. This would make it much easier to develop new versions of either type of software.

The other API is a bit more speculative. Web archiving proxies now exist and serve a very useful function. If a standard API was established for how these proxies interacted with the software driving the harvesting activity, it would be much easier to pair the two together. This API could possibly be built on top of existing proxy protocols.

I don't mean to imply that this is the definitive final list of web archiving APIs to be developed. These are simply two areas I believe are ready to be formalized and where doing so is likely to be of benefit in the near term.

So, how best to proceed? As noted earlier, the idea has been floated to establish a special APIs working group. On the other hand, there is already precedent for running API development through the existing working groups (WARC was developed within the harvesting working group).

My personal opinion is that once a specific API has been identified as a goal, a special working group (or task force or whatever name you'd like to give it) be formed around that one API. This working groups would than disband once the job is done. Such groups could be reformed if there is a need for a significant revision.

It is also important that any API formalization be done in close cooperation with at least some of the relevant tool maintainers. APIs that are not implemented are, after all, useless.




No comments:

Post a Comment