Kris's blog

November 12, 2018

On screenshots and other 'associated resources' in WARCs

Having screenshots of web pages can be a useful augmentation to a web archive by providing a record of how the website looked in browsers contemporary to the capture. With browser based crawling becoming more common, creating these is, likewise, becoming easier.

Currently, any screenshots stored in WARCs are invisible in replay tools, significantly reducing their usefulness and it is not entirely clear how these screenshots should be stored in WARCs. It is important that a standard way be defined for this type of 'associated resource' so that replay tools can consistently provide consistent access to them.

I've used the term 'associated resource' rather than just 'screenshot' as there are other types of 'data associated' with an URL that we may wish to store. An obvious example might be a video that is embedded in a webpage in a manner that can not be easily crawled or replayed. That video might be extracted through side channels and associated with the original web page via the same mechanism. Replay tools would then, at minimum, provide links to open the video in their 'header'. More advanced replay tools might use advanced rewriting rules to inject it into, e.g. YouTube pages.

There are likely a number of other uses for this type of mechanism. Another example might be to attach some type of annotation/documentation that is attached in this manner to a URL, either via manual curation or an automatic process.

As these are not 'just' metadata that is available for 'big data' style processing, it is important to carefully consider this from the perspective of the replay tools. Notably, how we store this in WARC must facilitate easy discovery using typical web archive indexing (CDX/URL based indexing).

We must also consider that each 'associated resource' will require some amount of metadata to provide suitable context for the replay tool. This goes beyond just type (screenshot, video), but also, e.g. type of capture (is the screenshot based on the same HTTP transaction as the primary resource, was is created in parallel, or was it perhaps created via a replay mechanism), which browser was used etc.

An initial idea might be to store each of these 'associated resources' in a WARC resource record using the URL of the primary resource with a special prefix. E.g. PREFIX:http://example.org. Metadata could then be stored in the record's header using a set of custom fields, a single 'metadata' field containing a JSON 'payload' or some mix of the two.

I'm unsure if this is the best approach, but it serves as a starting point. Over the next few months I'm hoping, with broad input from the people who are building the tools that create and use this data, to write up a proposal for standardizing this that the IIPC could endorse. The document might also include some 'best practice' guidance to how replay tools should handle this data.

If you would like to be a part of that conversation, please get in touch.

October 10, 2016

Wanted: New Leaders for OpenWayback

[This post originally appeared on the IIPC blog on 03/10/2016]

The IIPC is looking for one or two people to take on a leadership role in the OpenWayback project.

The OpenWayback project is responsible not only for the widely used OpenWayback software, but also for the underlying webarchive-commons library. In addition the OpenWayback project has been working to define access related APIs.

The OpenWayback project thus plays an important role in the IIPCs efforts to foster the development and use of common tools and standards for web archives.

Why now?

The OpenWayback project is at a cross roads. The IIPC first took on this project three years ago with

Originally this work was done by volunteers in our community. About two years ago the IIPC decided to fund a developer to work on it. The initial funding was for 16 months. With this we were able to complete the task of stabilizing the software as evidenced by the release of OpenWayback 2.0.0 through 2.3.0.

We then embarked on a somewhat more ambitious task to improve the core of the software. A significant milestone that is now ending as a new ‘CDX server’ or resource resolver is being introduced. You can read more about that here.

This marks the end of the paid position (at least for time being). The original 16 months wound up being spread over somewhat longer time frame, but they are now exhausted. Currently, the National Library of Norway (who hosted the paid developer) is contributing, for free, the work to finalize the new resource resolver.

I’ve been guiding the project over the last year since the previous project leader moved on. While I was happy to assume this role to ensure that our funded developer had a functioning community, I felt like I was never able to give the project the kind of attention that is needed to grow it. Now it seems to be a good time for a change.

With the end of the paid position we are now at a point where there either needs to be a significant transformation of the project or it will likely die away, bit by bit, which is a shame bearing in mind the significance of the project to the community and the time already invested in it.
the initial objective to make the software easier to install, run and manage. This included cleaning up the code and improving documentation.

Who are we looking for?

While a technical background is certainly useful it is not a primary requirement for this role. As you may have surmised from the above, building up this community will definitely be a part of the job. Being a good communicator, manager and organizer may be far more important at this stage.

Ideally, I’d like to see two leads with complementary skill sets, technical and communications/management. Ultimately, the most important requirement is a willingness and ability to take on this challenge.

You’ll not be alone, aside from your prospective co-lead, there is an existing community to build on. Notably when it comes to the technical aspects of the project. You can get a feel for the community on the OpenWayback Google Group and the IIPC GitHub page.

It would be simplest if the new leads were drawn from IIPC member institutions. We may, however, be willing to consider a non-member, especially as a co-lead, if they are uniquely suited for the position.

If you would like to take up this challenge and help move this project forward, please get in touch. My email is kristinn (at) landsbokasafn (dot) is.

There is no deadline, as such, but ideally I’d like the new leads to be in place prior to our next General Assembly in Lisbon next March.

September 26, 2016

3 crawlers : 1 writer

Last week I attended an IIPC sponsored hackathon with the overarching theme of 'Building better crawlers'. I can't say we built a better crawler in the room, but it did help clarify for me the likely future of archival crawling. And it involves three types of crawlers.

The first type is the bulk crawler. Heritrix is an example of this. Can crawl a wide variety of sites 'good enough' and has fairly modest hardware requirements, allowing it to scale quite well. It is, however, limited in its ability to handle scripted content (i.e. JavaScript) as all link extraction is based on heuristics.

The second type is a browser driven crawler. Still fully (mostly) automated but using a browser to render pages. Additionally, scripts can be run on rendered pages to simulate scrolling, clicking and other user behavior we may wish to capture. Brozzler (Internet Archive) is an example of this approach. This allows far better capture of scripted content, but at a price in terms of resources.

For large scale crawls, it seem likely that a hybrid approach would serve us best. To have a bulk crawler cover the majority of URLs, only delegating those URLs that are deemed 'troublesome' to the more expensive browser based rendering.

The trick here is to make the two approaches work together smoothly (Brozzler, for example, does state very differently from Heritrix) and being smart about which content goes in which bucket.

The third type of crawler is what I'll call a manual crawler. I.e. a crawler whose activities are entirely driven by a human operator. An example of this is Webrecorder.io. This enables us to fill in whatever blanks the automated crawlers leave. It can also prove useful for highly targeted collection, where curators are handpicking, not just sites, but specific individual pages. They can then complete the process, right there in the browser.

There is, however, no reason that these crawlers can not all use the same back end for writing WARCs, handling deduplication and otherwise doing post acquisition tasks. By using a suitable archiving proxy all three types of crawlers can easily add their data to our collections.

Such proxy tools already exist, it is simply a matter of making sure these crawlers use them (many already do), and that they use them consistently. I.e. that there is a nice clear API for a archiving proxy that covers the use cases of all the crawlers. Allows them to communicate collection metadata, dictate deduplication policies etc.

Now is the right time to establishing this API. I think the first steps in that direction were taken at the hackathon. Hopefully, we'll have a first draft available on the IIPC GitHub page before too long.

May 30, 2016

Heritrix 3.3.0-LBS-2016-02, now in stores

A month ago I posted that I was testing a 'semi-stable' build of Heritrix. The new build is called "Heritrix 3.3.0-LBS-2016-02" as this is built for LBS's (Icelandic acronym for my library) 2016-02 domain crawl (i.e. the second one this year).

I can now report that this version has passed all my tests without any regressions showing up. I've noted two minor issues, one of which was fixed immediately (Noisy alerts about 401s without auth challenge) and the other has been around since Heritrix 3.1.0 at the least and does not affect crawling in any way (Bug in non-fatal-error log).

Additionally, I heard from Netarkivet.dk. They also tested this version with no regressions found.

I think it is safe to say that if you are currently using my previous semi-stable build (LBS-2015-01), upgrading to this version should be entirely straightforward. There are no notable API changes to worry about either. Unless, of course, you are using features that are less 'mainstream'.

You can find this version on our Github page. You'll have to download the source and build it for yourself.

Update As you can see in the comments below, Netarkivet.dk has put the artifacts into a publicly accessible repository. Very helpful if you have code with dependencies on Heritrix and you don't have your own repository.

Thanks for the heads-up, Nicholas.

May 17, 2016

WARC MIME Media Type

A curious thing came up during the WARC 1.1 review process. In version 1.0, section 8 talked about what MIME media types should be used when exchanging WARCs over the Internet. During the review process, however, it was pointed out that this is actually outside the scope of the standard. 1.1 consequently drops section 8.

For now we should regard the instructions from 1.0 section 8 as best practices. But it isn't part of any official standard.

That's not to say that it isn't important to have a standard set of MIME types for WARC content. Only that the WARC ISO standard isn't the place for it. This is actually something that IANA is responsible for, with specification work going through the IETF if I'm understanding this correctly.

I'm not at all familiar with this process. But it is clear that if we wish to have this standardized then going through this process is the only option. If anyone can offer further insight into how we could move this forward please get in touch.

May 12, 2016

What I learned hosting the 2016 IIPC GA/WAC

National Library of Iceland
Photo taken by GA/WAC attendee

It's been nearly a month since the 2016 IIPC General Assembly (GA) / Web Archiving Conference (WAC) in Reykjavik ended and I think I'm just about ready to try to deconstruct the experience a bit.

Plan ahead

Looking back, planning of the the practical aspects - logistics - of the conference seem to have been mostly spot on. The 2015 event in Stanford had had a problem with no-shows, but this wasn't a big factor in Reykjavik. I suspect largely due to the small number of local attendees. Our expectations about the number of people who would come ended up being more or less correct (about 90 for the GA and 145 for the WAC).

A big part of why the logistics side ran smoothly was, I feel, due to advance planning. We first decided to offer to host the 2016 GA in October of 2013. We made the space reservations at the conference hotel in September 2014. Consequently, there was never any rush or panic on the logistics. Everything felt like it was happening right on schedule with very few surprises.


The IIPC SC had a meeting in Lisbon following the 2013 iPres conference. The idea for Reykjavik as the venue for the 2016 IIPC GA first arose there.

Given how much work it was, despite all the careful planning, I don't care to imagine what doing this under pressure would be like. I've been advocating in the IIPC Steering Committee (SC), for years, that we should leave each GA with a firm date and place for the next two GAs and a good idea of where the one to be held in three years will be be.

Nothing, in my experience hosting a GA, has changed my mind about that.

Spendthrift

There was some discussion about whether some days/sessions should be recorded and put online. This was done in Stanford, but looking at the viewing numbers, I felt that it represented a poor use of money. Ultimately the SC agreed. Recording and editing can be quite costly. It may be worth reviewing this decision in the future. Or, perhaps something else can be used to 'open' the conference to those not physically present.

It was certainly a worthwhile experiment, but overall, I think we made the right decision not doing it in Reykjavik. Especially as the cost was quite, even compared to Stanford.

Another thing we decided not to spend money on was an event planner. I know one was used for the 2015 GA. That one needed to be planned in a hurry and thus may have required such a service. But I can't see how it would have made things much easier in 2016 unless you're willing to hand over the responsibility for making specific choices to the planner. Such as catering etc.

True, that does take a bit of effort, but I felt that was a part of the responsibility that comes with hosting. Just handing it over to a planner wouldn't have sat right. And if I'm vetting the planners choices, then very little effort is being saved.

I'm happy to concede, though, that this may vary very much by location and host.

Communication

Some of the communication surrounding the GA/WAC was sub-optimal. The GA page on netpreserve.org was never really up to the task, although it got better over time. Some of this was down to the lack of flexibility of the netpreserve website. Future events should have a solid communication plan at an early date. Including what gets communicated where and who is responsible for it. Perhaps it is time that each GA/WAC gets its own little website? Or perhaps not.

The dual nature of the event also caused some confusion. This led some people to only register for one of the two events etc. There was also confusion (even among the program committee!) about whether the CFP was for the WAC and GA or WAC only.

This leads us to the most important lesson I took away from this all...

Clearly separate the General Assembly and the Web Archiving Conference!

This isn't a new insight. We've been discussing what separates the 'open days' from 'member only' days for several years. In Reykjavik this was, for the first time, formally divided into two separate events. Yet, the distinction between them was less than absolutely clear.

This is, at least in part, due to how the schedule for the two events was organized. A single program committee was set up (as has been the case most years). It was quite small this year. This committee then organized the call for proposals (CFP) and arranged the schedule to accommodate the proposals that come in from the CFP.

This led to the conference over-spilling onto GA days (notably Tuesday). And it wasn't the first time that has happened. There was definitely a lack separation in Stanford (although perhaps for slightly different reasons) and in Paris, in 2014, the effort to shoehorn in all the proposals from the CFP had a profound effect on the member-only days.

This model of a program committee and a CFP is entirely suitable for a conference and should be continued. But going forward, I think it is absolutely necessary that the program committee for the WAC have no responsibility or direct influence on the GA agenda.

To facilitate this I suggest that the organization of these two events consist of three bodies (in addition to the IIPC Steering Committee (SC) which will continue to bear overall responsibility).

Logistics Team. Membership includes 1-2 people from the hosting institution, the IIPC officers, at least one SC member (if the hosting institution is an SC member this may be their representative) and perhaps one or two people with relevant experience (e.g. have hosted before etc.).
This group is responsible for arranging space, catering, the reception, badges and other printed conference material, hotels (if needed) etc. They get their direction on the amount of space needed from the SC and the two other teams.
This group is responsible for the event staying under budget. Which is why the treasurer is included.
WAC Program Committee. The program committee would be comprised of a number of members and may include several non-members who bring notable expertise and have been engaged in this community for a long time.
The program committee would have a reserved space on it for the hosting institution (which they may decline). There should also be a minimum of one SC member on the committee.
The PCO (program and communications officer) would be included in all communications and assist the committee with communications with members and other prospective attendees (e.g. in sending out the CFP) but would not participate in evaluating the proposals sent in.
The program committee would have a hand in crafting the CFP, but input on overall 'theme' would be expected from the SC.
The program committee's primary task would be to evaluate the proposals sent in after the CFP and arranging them into a coherent schedule. The mechanism for evaluating (and potentially rejecting!) proposals needs to be established before the CFP's come in! Otherwise, it will be hard to avoid the feeling that they are being tailored to fit specific proposals.
GA Organizing Group. The PCO would be responsible for coordinating this group. Included are the SC Chair and Vice Chair, portfolio leads and leaders of working and interest groups. For the most part, each member is primarily responsible for the the needs of their respective areas of responsibility.
More on GA organization in a bit.

None of this gives the SC a free pass. As you'll note, I've mandated an SC presence in all the groups. This both gives the groups access to someone who can easily bring matters to the SC's attention and ensures that there is someone there to ensure that the direction the SC has laid out is, broadly speaking, followed.

For the WAC, the SC's biggest responsibility (aside from choosing the location and setting the budget) will be in deciding how much time it gets (two days, two and a half, three?), what themes to focus around and whether the conference should try to accomplish a specific outreach goal (and if so how).

This was, for example, the case in Stanford where the goal was to get the attention of the big tech companies. Getting Vint Cerf (a VP of Google) to be a keynote speaker was a good effort in that direction. Nothing similar was done during the Reykjavik meeting.

Keynotes

Keynotes are likely to be one of the best ways of accomplishing this. Getting a keynote speaker from a different background can help build bridges. I think this is absolutely a worthwhile path to consider.

However, unless we are hosting the WAC in their backyard (as was the case with Vint Cerf), we need to reach out to them very early and probably be prepared to cover the cost of travel. This is a choice that needs to be made very early. And, indeed, the choice of a keynote may ultimately help frame the overall them of the conference (or not).

Hjálmar Gíslason delivering the
2016 IIPC WAC opening keynote

We had two keynotes in Reykjavik. Both were great, although neither was chosen 'strategically'. The choice of Hjálmar Gíslason was largely with my library. Allowing the hosting institution some influence on one of the keynotes may be appropriate. The other keynote, Brewster Kahle, wasn't chosen until after the CFP was in. We essentially asked him to expand his proposal into a keynote. Given the topic and Brewster's acclaim within our community, this worked out very well. We did have other candidates in mind (but no one confirmed). It was quite fortunate that such a perfect candidate fell into our laps.

It is worth planning this early as people become unavailable surprisingly far in advance.

It could also be argued that we don't need keynotes. People aren't coming to the IIPC WAC to hear some 'rock star' presenter. The event itself is the draw. But I think a couple of keynotes really help tie the event together.

One change may be worth considering. Instead of a whole day with a single track featuring both keynotes, perhaps have multiple tracks on all days but do a single track session at the start of day one and at the end of day two that accommodates the keynotes and the welcome and wrap up talks.

When we were trying to fit in all the proposals we got for Reykjavik, we considered doing this, but the idea simply arose too late. We were unable to secure the additional space required.

Again, we need to plan early.

The General Assembly should not be a conference

The GAs have changed a lot over the years. The IIPC met in Reykjavik for the first time in 2005. Back then we didn't call the meetings "GAs", they were just meetings. And they mostly oriented around discussions. They were working meetings. And they were usually very good.

The first GA, in Paris 2007, largely retained that, despite the fact that the IIPC was already beginning to grow. There was no 'open day'.

By 2010 in Singapore, the open day was there. But in a way that made sense and it didn't overly affect the rest of the GA. I did notice, however, a marked change in the level of engagement by the attendees during sessions.

There seemed to be more people there 'just to listen'. There had always been some of those, but I found it difficult to get discussions going, where two years prior, they'd had usually been difficult to stop in order to take breaks! Not that those discussions had always been all that productive (some of it was just talk), but the atmosphere was more restrained.

At that time I was co-chair of the Harvesting Working Group (HWG) along with Lewis Crawford of the British Library. And although there was always good attendance at the HWG meetings we really struggled to engage the attendees.

Helen Hockx-Yu and Kris Carpenter, who led the Access Working Group (AWG) did a better job of this but clearly felt the same problem. Ultimately, both HWG and AWG became more of GA events than working groups and have now been decommissioned.

With larger groups and especially with many there 'just to listen' it becomes much easier to just do a series of presentations. Its safer, more predictable and when you add the pressure to fit in all the material from the CFP, it becomes inevitable.

But, in the process we have lost something.

Now that the WAC is firmly established and can serve very well for the people who 'just want to listen', I think it is time we refocus the GA on being working meetings. A venue for addressing both consortium business (like the portfolio breakout sessions in Reykjavik, but with more time!) and the work of the consortium (like the OpenWayback meeting and the Preservation and Content Development Working Group meetings in Reykjavik).

This will inevitably include some presentations (but keep them to a minimum!) and there may be some panel discussions but the overall focus should be on working meetings. Where specific topics are discussed and, as much as possible, actions are decided.

That's why I nominated the people I did for the GA Organizing Group. These are the people driving the work of the consortium. They should help form the GA agenda. At least as far as their area of responsibility is concerned.

To accommodate the less knowledgeable GA attendee (e.g. new members) it may be a good idea to schedule tutorials and/or training sessions in parallel to some of these working meetings.

I believe this can build up a more engaged community. And for those not interested in participating in specific work, the WAC will be there to provide them with an opportunity to learn and connect with other members.

This wont be an easy transition. As my experience with the HWG showed, it can be difficult to engage people. But by having a conference (and perhaps training events) to divert those just looking to learn and building sessions around specific strategic goals, I think we can bring this element of 'work' back.

And if we can't, I'm not sure we have much of a future except as a yearly conference.

April 28, 2016

New 'semi-stable' build for Heritrix

Earlier this month I mentioned that I was planning on making another one of my "semi-stable" Heritrix build, using the current master 3.3.0. This work is now underway in the Landsbokasafn Heritrix repo on GitHub. LBS-2016-02.


Heritrix 3.3.0-LBS-2016-02

I've merged in one pull request that is still open in the IA repository, #154 Fixes for apparent build errors. Most notably, this makes it possible to have Travis-CI build and test Heritrix.

You can review the full list of changes between my last Heritrix build (2015-01) and this new one here. Here is a list of the main changes:

Some fixes to how server-not-modified revisit records are written (PR #118).
Fix outlink hoppath in metadata records (PR #119)
Allow dots in filenames for known good extensions (PR #120)
Require Maven 3.3 (PR #126)
Allow realm to be set by server for basic auth (PR #124)
Better error handling in StatisticsTracker (PR #130)
Fix to Java 8 Keytool (PR #129) - I wrote a post about this back in 2014.
Changes to how cookies are stored in Bdb (PR #133)
Handle multiple clauses for same user agent in robots.txt (PR #139)
SourceSeedDecideRule and SeedLimitsEnforcer (PR #137 and #148)
'Novel' URL and byte quotes (PR #138)
Only submit 'checked' checkbox and radio buttons when submitting forms (PR #122)
Form login improvements (PR #142 and #143)
Improvements to hosts report (PR #123)
Handle SNI error better (PR #141)
Allow some whitespace in URLs extracted by ExtractorJS (PR #145)
Fix to ExtractorHTML dealing with HTML comments (PR #149)
Build against Java 7 (PR #152)

I've ignored all pull request that apply primarily to the contrib package in the above. There were quite a few there, mostly (but not exclusively) relating to AMQP.

I've done some preliminary testing and everything looks good. So far, the only issue I've noted is one that I was already aware of, about noisy alerts relating to 401s.

I'll be testing this version further over the next few weeks and welcome any additional input.