November 12, 2018

On screenshots and other 'associated resources' in WARCs

Having screenshots of web pages can be a useful augmentation to a web archive by providing a record of how the website looked in browsers contemporary to the capture. With browser based crawling becoming more common, creating these is, likewise, becoming easier.

Currently, any screenshots stored in WARCs are invisible in replay tools, significantly reducing their usefulness and it is not entirely clear how these screenshots should be stored in WARCs. It is important that a standard way be defined for this type of 'associated resource' so that replay tools can consistently provide consistent access to them.

I've used the term 'associated resource' rather than just 'screenshot' as there are other types of 'data associated' with an URL that we may wish to store. An obvious example might be a video that is embedded in a webpage in a manner that can not be easily crawled or replayed. That video might be extracted through side channels and associated with the original web page via the same mechanism. Replay tools would then, at minimum, provide links to open the video in their 'header'. More advanced replay tools might use advanced rewriting rules to inject it into, e.g. YouTube pages.

There are likely a number of other uses for this type of mechanism. Another example might be to attach some type of annotation/documentation that is attached in this manner to a URL, either via manual curation or an automatic process.

As these are not 'just' metadata that is available for 'big data' style processing, it is important to carefully consider this from the perspective of the replay tools. Notably, how we store this in WARC must facilitate easy discovery using typical web archive indexing (CDX/URL based indexing).

We must also consider that each 'associated resource' will require some amount of metadata to provide suitable context for the replay tool. This goes beyond just type (screenshot, video), but also, e.g. type of capture (is the screenshot based on the same HTTP transaction as the primary resource, was is created in parallel, or was it perhaps created via a replay mechanism), which browser was used etc.

An initial idea might be to store each of these 'associated resources' in a WARC resource record using the URL of the primary resource with a special prefix. E.g. PREFIX:http://example.org. Metadata could then be stored in the record's header using a set of custom fields, a single 'metadata' field containing a JSON 'payload' or some mix of the two.

I'm unsure if this is the best approach, but it serves as a starting point. Over the next few months I'm hoping, with broad input from the people who are building the tools that create and use this data, to write up a proposal for standardizing this that the IIPC could endorse. The document might also include some 'best practice' guidance to how replay tools should handle this data.

If you would like to be a part of that conversation, please get in touch.