June 30, 2015

Web archiving APIs - a start

Even though it didn't feature heavily on the official agenda, the topic of web archive APIs repeatedly came up during the last IIPC GA in Stanford. This is an understandable desire as the use of APIs enables interoperability amongst different tools. This, in theory, allows you to build (many) narrow purpose tools and then have them work, seamlessly, together. No more monolithic behemoths that are a pain to maintain.

In fact, we already have one web archive "API" in wide use; the WARC file format. While technically not an "application programming interface" it serves the same fundamental purpose, to enable interoperability. It has decoupled harvesters (e.g. Heritrix) from replay systems (e.g. OpenWayback) and both of those from analytical/data mining software etc.

Overall I'd say the WARC format has been a success, albeit not without its flaws. More on that later.

So, now there is interest in defining more "APIs". Indeed, there was a survey, recently, about the possibility of setting up a special "APIs Working Group" within the IIPC.

What I find missing from this conversations is, what are these APIs? Which facets of web archiving are amicable to being formalized in this manner?

We do have a few informal "APIs" floating around. The CDX file format is one. The Memento protocol is another. Cleaning up and formalizing an existing 'defacto' API avoids the pitfall of creating an API that entirely fails to work in the "real world."

To understand this, lets go back the WARC format. It was an extension of the ARC file format developed by the Internet Archive. In drafting the WARC standard, the lessons of the ARC file format were used and most of the ARC file format's short comings were addressed. 

Predictably, where the WARC format has shown weakness is in areas that ARC did not address at all. For example in handling duplicate records. In those areas, wrong assumptions were made due to lack of experience. This in turn, for example, significantly delayed the widespread use of deduplication.

The best APIs/protocols/standards emerge from hard won experience. I would argue very strongly against developing any API from the top down.

With that in mind, there are two possible APIs I believe are ready to be made more formal.

The first is the current "CDX Server protocol".  The CDX Server is a component within OpenWayback that resolves URI+timestamp requests into stored resources (either a single one or a range depending on the nature of the query). Note, that a "CDX Server" need not use a CDX style index. That is merely how it is now. This is a protocol for separating the user interface of a replay tool (like OpenWayback) from its the index. Lets call it Web Archive Query Protocol, WAQP, for now.

Pywb, another replay tool, uses almost the same protocol in its implementation. With a well defined WAQP, you might be able to use Pywb front end (display) with the OpenWayback back end (CDX server) or the other way around.

With a well defined WAQP, the job of indexing WARCs would be become independent of the job of replaying web archives by rewriting URLs. This would make it much easier to develop new versions of either type of software.

The other API is a bit more speculative. Web archiving proxies now exist and serve a very useful function. If a standard API was established for how these proxies interacted with the software driving the harvesting activity, it would be much easier to pair the two together. This API could possibly be built on top of existing proxy protocols.

I don't mean to imply that this is the definitive final list of web archiving APIs to be developed. These are simply two areas I believe are ready to be formalized and where doing so is likely to be of benefit in the near term.

So, how best to proceed? As noted earlier, the idea has been floated to establish a special APIs working group. On the other hand, there is already precedent for running API development through the existing working groups (WARC was developed within the harvesting working group).

My personal opinion is that once a specific API has been identified as a goal, a special working group (or task force or whatever name you'd like to give it) be formed around that one API. This working groups would than disband once the job is done. Such groups could be reformed if there is a need for a significant revision.

It is also important that any API formalization be done in close cooperation with at least some of the relevant tool maintainers. APIs that are not implemented are, after all, useless.




June 24, 2015

OpenWayback 2.2.0


OpenWayback 2.2.0 was recently released. This marks OpenWayback's third release since becoming a ward of the IIPC in late 2013. This is a fairly modest update and reflects our desire to make frequent, modest sized releases. A few things are still worth pointing out.

First, as of this release, OpenWayback requires Java 7. Java 7 has been out for four years and Java 6 has not been publicly updated in over two years. It is time to move on.

Second, OpenWayback now officially supports internationalized domain names. I.e. domain names containing non-ASCII characters.

Third, UI localization has been much improved. It should now be possible to translate the entire interface without having to mess with the JSP files and otherwise "go under the hood".

And the last thing I'll mention is the new WatchedCDXSource which removes the need to enumerate all the CDX files you wish to use. Simply designate a folder and OpenWayback will pick up all the CDX files in it.

The road to here hasn't been easy, but it is encouraging to see that the number of people involved is slowly, but surely rising. For the 2.2.0 release, we had code contributions from Roger Coram (BL), Lauren Ko (UNT), John Erik Halse (NLN), Sawood Alam (ODU), Mohamed Elsayed (BA) and myself in addition to the IIPC-payed-for work by Roger Mathisen (NLN). Even more people were involved in reporting issues, managing the project and testing the release candidate. My thanks to everyone who helped out.

And going forward, we are certainly going to need people to help out.

Version 2.3.0 of OpenWayback will be another modest bundle of fixes and minor features. We hope it will be ready in September (or so). There are already 10 issues open for it as I write this.

But, we also have larger ambitions. Enter version 3.0.0. It will be developed in parallel with 2.3.0 and aims to make some big changes. Breaking changes. OpenWayback is built on an aging codebase, almost a decade old at this point. To move forward, some big changes need to be made.

The exact features to be implemented will likely shift as work progresses but we are going to increase modularity by pushing the CDXServer front and center and removing the legacy resource stores. In addition to simplifying the codebase, this fits very nicely with the talk at the last GA about APIs.

We'll also be looking at redoing the user interface using iFrames and providing insight into the temporal drift of the page being viewed. The planned issues are available on GitHub. The list is far from locked and we welcome additional input on which features to work on.

We welcome additional work on those features even more!

I'd like to wrap this up with a call to action. We need a reasonably large community around the project to sustain it. Whether it's testing and bug reporting, occasional development work or taking on more by becoming one of our core developers, your help is both needed and appreciated.

If you'd like to become involved, you can simply join the conversation on the OpenWayback GitHub page. Anyone can open new issues and post comments on existing issues. You can also join the OpenWayback developers mailing list.

---

This post was written for the IIPC blog.