November 12, 2015

Workshop on Missing Warc Features

Yesterday I tweeted:
I thought I'd expand a bit on this without a 140 character limit.

The idea for this session came to me while reviewing the various issues set forward for the WARC 1.1 review. Several of the issues/proposals were clearly important as they addressed real needs but at the same time they were nowhere near ready for standardization.

It is important that before we enshrine a solution in the standard that we mature it. This may include exploring multiple avenues to resolve the matter. At minimum it requires implementing it in the relevant tools as proof of concept.

The issues in question include:
  • Screenshots – Crawling using browsers is becoming more common. A very useful by product of this is a screenshot of the webpage as rendered by the browser. It is, however, not clear how best to store this in a WARC. Especially not in terms of making it easily discoverable to a user of a Wayback like replay. This may also affect other types of related resources and metadata.
    See further on WARC specification issue tracker: #13 #27
  • HTTP 2 – The new HTTP 2 standard uses binary headers. This seems to breaks one of the expectations of WARC response records containing HTTP responses.
    See further on WARC specification issue tracker: #15
  • SSL certificates – Store the SSL certificates used during an HTTPS session. #12
  • AJAX Interactions – #14
The above list is unlikely to be exhaustive. It merely enumerates the issues that I'm currently aware of.

I'd like to organize a workshop during the 2016 IIPC GA to discuss these issues. For that to become a reality I'm looking for people willing to present on one or more of these topics (or a related one that I missed). It will probably be aimed at the open days so attendance is not strictly limited to IIPC members.

The idea is that we'd have a ~10-20 minute presentation where a particular issues's problems and needs were defined and a solution proposed. It doesn't matter if the solution has been implemented in code. Following each presentation there would then be a discussion on the merits of the proposed solution and possible alternatives.

A minimum of three presentations would be needed to "fill up" the workshop. We should be able to fit in four.

So, that's what I'm looking for, volunteers to present one or more of these issues. Or, a related issue I've missed.

To be clear, while the idea for this workshop comes out of the WARC 1.1 work it is entirely separate from that review. By the time of the GA the WARC 1.1 revision will be all but settled. Consider this, instead, as a first possible step on the WARC 1.2 revision.

4 comments:

  1. We're collecting a LOT of screenshots/renderings so we should present that (although I'd like to hear from others who are doing that).

    We'd also like to standardised how we store videos in WARC, particularly in the case where we are pulling down multiple streams (e.g. separate audio) via youtube_dl. We're not WARCing them yet as we're not sure how to do it such that playback is definitely unambiguous.

    ReplyDelete
  2. Thanks Andy. I figured I could count on you as one of the presenters. Still have room for 2-3 more.

    ReplyDelete
  3. If there's still room, I could cover how video playbak from youtube_dl is supported in pywb, using a pattern of storing as metadata records response from youtube_dl, as well as POST request replay.

    ReplyDelete
    Replies
    1. Hi Ilya. Yes there is still room. I'll be in touch with those who have offered to present soon after the new year.

      Delete