August 17, 2015

The WARC Format 1.1

The WARC Format 1.0 is an ISO standard describing the container format that web archives use to store their data. WARCs contain not only the actual file resources (HTML, images, JavaScript etc.) but also request and response headers, metadata about the resources and the overall collection, deduplication records and conversion records.

It's a pretty flexible format. It has served us quite well, but it is not perfect.

While it is an ISO standard, most of it was written by IIPC members. Indeed, it is heavily influenced the ARC format developed by The Internet Archive. So, now that the WARC format is being revisited it is only natural that the IIPC community, again, writes the first draft.

At the IIPC GA this year, in Stanford, there was a workshop where the pain points of the current specification were brought to light. There was a lot of energy in the room and people were excited. But, as everyone got back home a lot of that energy went away.

It is a lot easier to talk about change, than making it happen. Making things more difficult, few of us know much about the standards process. It all felt very inscrutable.

To help with the procedural aspect we came up with an approach that involves using the tools we are familiar with (software development). Consequently, we (and by "we" I mean Andy Jackson of the British Library) set up a GitHub project around the WARC specification.

Any problems with the existing specification could be raised there as "issues" (you'll find all the ones discussed in Stanford on there!). The existing spec could be included as markdown formatted text and any proposed changes could be submitted as "pull requests" acting on the text of the existing spec.

Currently there are two pull requests, each representing a proposed set of changes to address one specific shortcoming of the existing spec.

One of the pull requests comes from yours truly. It address the concerns of "uri agnostic revisit records". This was previously dealt with via an advisory on the subject adopted by the IIPC. This allows us to promote what has been a defacto standard into the actual standard.

The other pull request centers on improving the resolution of timestamps in WARC headers.

Neither pull request has been merged, meaning that both are up for comment and may change or be rejected altogether. There are also many issues that still need to be addressed.

I would like to encourage all interested parties (IIPC members and non-members alike) to take advantage of the GitHub venue if the WARC format is important to you. You can do this by opening issues if you have a problem with the spec that hasn't been brought up. You can do this by commenting on existing issues and pull requests, suggesting solutions or pointing out pitfalls.

And you can do this be suggesting actual edits to the standard via pull requests (let us know if you need help with the technical bits).

Ultimately, the draft thus generated will be passed on to the relevant ISO group for review and approval. This will happen (as I understand it) next year.

So grab the opportunity while it presents itself and have your say on The WARC Format 1.1.

1 comment:

  1. So long as ISO insists on paywalling their version of standards, I'd prefer that the open-standard community refrain from working with them until they change their ways.