Kris's blog

On screenshots and other 'associated resources' in WARCs

2018-11-12T21:20:00.000+00:00

Having screenshots of web pages can be a useful augmentation to a web archive by providing a record of how the website looked in browsers contemporary to the capture. With browser based crawling becoming more common, creating these is, likewise, becoming easier.

Currently, any screenshots stored in WARCs are invisible in replay tools, significantly reducing their usefulness and it is not entirely clear how these screenshots should be stored in WARCs. It is important that a standard way be defined for this type of 'associated resource' so that replay tools can consistently provide consistent access to them.

I've used the term 'associated resource' rather than just 'screenshot' as there are other types of 'data associated' with an URL that we may wish to store. An obvious example might be a video that is embedded in a webpage in a manner that can not be easily crawled or replayed. That video might be extracted through side channels and associated with the original web page via the same mechanism. Replay tools would then, at minimum, provide links to open the video in their 'header'. More advanced replay tools might use advanced rewriting rules to inject it into, e.g. YouTube pages.

There are likely a number of other uses for this type of mechanism. Another example might be to attach some type of annotation/documentation that is attached in this manner to a URL, either via manual curation or an automatic process.

As these are not 'just' metadata that is available for 'big data' style processing, it is important to carefully consider this from the perspective of the replay tools. Notably, how we store this in WARC must facilitate easy discovery using typical web archive indexing (CDX/URL based indexing).

We must also consider that each 'associated resource' will require some amount of metadata to provide suitable context for the replay tool. This goes beyond just type (screenshot, video), but also, e.g. type of capture (is the screenshot based on the same HTTP transaction as the primary resource, was is created in parallel, or was it perhaps created via a replay mechanism), which browser was used etc.

An initial idea might be to store each of these 'associated resources' in a WARC resource record using the URL of the primary resource with a special prefix. E.g. PREFIX:http://example.org. Metadata could then be stored in the record's header using a set of custom fields, a single 'metadata' field containing a JSON 'payload' or some mix of the two.

I'm unsure if this is the best approach, but it serves as a starting point. Over the next few months I'm hoping, with broad input from the people who are building the tools that create and use this data, to write up a proposal for standardizing this that the IIPC could endorse. The document might also include some 'best practice' guidance to how replay tools should handle this data.

If you would like to be a part of that conversation, please get in touch.

Wanted: New Leaders for OpenWayback

2016-10-10T13:14:00.001+00:00

[This post originally appeared on the IIPC blog on 03/10/2016]

The IIPC is looking for one or two people to take on a leadership role in the OpenWayback project.

The OpenWayback project is responsible not only for the widely used OpenWayback software, but also for the underlying webarchive-commons library. In addition the OpenWayback project has been working to define access related APIs.

The OpenWayback project thus plays an important role in the IIPCs efforts to foster the development and use of common tools and standards for web archives.

Why now?

The OpenWayback project is at a cross roads. The IIPC first took on this project three years ago with

Originally this work was done by volunteers in our community. About two years ago the IIPC decided to fund a developer to work on it. The initial funding was for 16 months. With this we were able to complete the task of stabilizing the software as evidenced by the release of OpenWayback 2.0.0 through 2.3.0.

We then embarked on a somewhat more ambitious task to improve the core of the software. A significant milestone that is now ending as a new ‘CDX server’ or resource resolver is being introduced. You can read more about that here.

This marks the end of the paid position (at least for time being). The original 16 months wound up being spread over somewhat longer time frame, but they are now exhausted. Currently, the National Library of Norway (who hosted the paid developer) is contributing, for free, the work to finalize the new resource resolver.

I’ve been guiding the project over the last year since the previous project leader moved on. While I was happy to assume this role to ensure that our funded developer had a functioning community, I felt like I was never able to give the project the kind of attention that is needed to grow it. Now it seems to be a good time for a change.

With the end of the paid position we are now at a point where there either needs to be a significant transformation of the project or it will likely die away, bit by bit, which is a shame bearing in mind the significance of the project to the community and the time already invested in it.
the initial objective to make the software easier to install, run and manage. This included cleaning up the code and improving documentation.

Who are we looking for?

While a technical background is certainly useful it is not a primary requirement for this role. As you may have surmised from the above, building up this community will definitely be a part of the job. Being a good communicator, manager and organizer may be far more important at this stage.

Ideally, I’d like to see two leads with complementary skill sets, technical and communications/management. Ultimately, the most important requirement is a willingness and ability to take on this challenge.

You’ll not be alone, aside from your prospective co-lead, there is an existing community to build on. Notably when it comes to the technical aspects of the project. You can get a feel for the community on the OpenWayback Google Group and the IIPC GitHub page.

It would be simplest if the new leads were drawn from IIPC member institutions. We may, however, be willing to consider a non-member, especially as a co-lead, if they are uniquely suited for the position.

If you would like to take up this challenge and help move this project forward, please get in touch. My email is kristinn (at) landsbokasafn (dot) is.

There is no deadline, as such, but ideally I’d like the new leads to be in place prior to our next General Assembly in Lisbon next March.

3 crawlers : 1 writer

2016-09-26T15:55:00.000+00:00

Last week I attended an IIPC sponsored hackathon with the overarching theme of 'Building better crawlers'. I can't say we built a better crawler in the room, but it did help clarify for me the likely future of archival crawling. And it involves three types of crawlers.

The first type is the bulk crawler. Heritrix is an example of this. Can crawl a wide variety of sites 'good enough' and has fairly modest hardware requirements, allowing it to scale quite well. It is, however, limited in its ability to handle scripted content (i.e. JavaScript) as all link extraction is based on heuristics.

The second type is a browser driven crawler. Still fully (mostly) automated but using a browser to render pages. Additionally, scripts can be run on rendered pages to simulate scrolling, clicking and other user behavior we may wish to capture. Brozzler (Internet Archive) is an example of this approach. This allows far better capture of scripted content, but at a price in terms of resources.

For large scale crawls, it seem likely that a hybrid approach would serve us best. To have a bulk crawler cover the majority of URLs, only delegating those URLs that are deemed 'troublesome' to the more expensive browser based rendering.

The trick here is to make the two approaches work together smoothly (Brozzler, for example, does state very differently from Heritrix) and being smart about which content goes in which bucket.

The third type of crawler is what I'll call a manual crawler. I.e. a crawler whose activities are entirely driven by a human operator. An example of this is Webrecorder.io. This enables us to fill in whatever blanks the automated crawlers leave. It can also prove useful for highly targeted collection, where curators are handpicking, not just sites, but specific individual pages. They can then complete the process, right there in the browser.

There is, however, no reason that these crawlers can not all use the same back end for writing WARCs, handling deduplication and otherwise doing post acquisition tasks. By using a suitable archiving proxy all three types of crawlers can easily add their data to our collections.

Such proxy tools already exist, it is simply a matter of making sure these crawlers use them (many already do), and that they use them consistently. I.e. that there is a nice clear API for a archiving proxy that covers the use cases of all the crawlers. Allows them to communicate collection metadata, dictate deduplication policies etc.

Now is the right time to establishing this API. I think the first steps in that direction were taken at the hackathon. Hopefully, we'll have a first draft available on the IIPC GitHub page before too long.

Heritrix 3.3.0-LBS-2016-02, now in stores

2016-05-30T15:55:00.001+00:00

A month ago I posted that I was testing a 'semi-stable' build of Heritrix. The new build is called "Heritrix 3.3.0-LBS-2016-02" as this is built for LBS's (Icelandic acronym for my library) 2016-02 domain crawl (i.e. the second one this year).

I can now report that this version has passed all my tests without any regressions showing up. I've noted two minor issues, one of which was fixed immediately (Noisy alerts about 401s without auth challenge) and the other has been around since Heritrix 3.1.0 at the least and does not affect crawling in any way (Bug in non-fatal-error log).

Additionally, I heard from Netarkivet.dk. They also tested this version with no regressions found.

I think it is safe to say that if you are currently using my previous semi-stable build (LBS-2015-01), upgrading to this version should be entirely straightforward. There are no notable API changes to worry about either. Unless, of course, you are using features that are less 'mainstream'.

You can find this version on our Github page. You'll have to download the source and build it for yourself.

Update As you can see in the comments below, Netarkivet.dk has put the artifacts into a publicly accessible repository. Very helpful if you have code with dependencies on Heritrix and you don't have your own repository.

Thanks for the heads-up, Nicholas.

WARC MIME Media Type

2016-05-17T13:13:00.000+00:00

A curious thing came up during the WARC 1.1 review process. In version 1.0, section 8 talked about what MIME media types should be used when exchanging WARCs over the Internet. During the review process, however, it was pointed out that this is actually outside the scope of the standard. 1.1 consequently drops section 8.

For now we should regard the instructions from 1.0 section 8 as best practices. But it isn't part of any official standard.

That's not to say that it isn't important to have a standard set of MIME types for WARC content. Only that the WARC ISO standard isn't the place for it. This is actually something that IANA is responsible for, with specification work going through the IETF if I'm understanding this correctly.

I'm not at all familiar with this process. But it is clear that if we wish to have this standardized then going through this process is the only option. If anyone can offer further insight into how we could move this forward please get in touch.

What I learned hosting the 2016 IIPC GA/WAC

2016-05-12T14:39:00.003+00:00

National Library of Iceland
Photo taken by GA/WAC attendee

It's been nearly a month since the 2016 IIPC General Assembly (GA) / Web Archiving Conference (WAC) in Reykjavik ended and I think I'm just about ready to try to deconstruct the experience a bit.

Plan ahead

Looking back, planning of the the practical aspects - logistics - of the conference seem to have been mostly spot on. The 2015 event in Stanford had had a problem with no-shows, but this wasn't a big factor in Reykjavik. I suspect largely due to the small number of local attendees. Our expectations about the number of people who would come ended up being more or less correct (about 90 for the GA and 145 for the WAC).

A big part of why the logistics side ran smoothly was, I feel, due to advance planning. We first decided to offer to host the 2016 GA in October of 2013. We made the space reservations at the conference hotel in September 2014. Consequently, there was never any rush or panic on the logistics. Everything felt like it was happening right on schedule with very few surprises.


The IIPC SC had a meeting in Lisbon following the 2013 iPres conference. The idea for Reykjavik as the venue for the 2016 IIPC GA first arose there.

Given how much work it was, despite all the careful planning, I don't care to imagine what doing this under pressure would be like. I've been advocating in the IIPC Steering Committee (SC), for years, that we should leave each GA with a firm date and place for the next two GAs and a good idea of where the one to be held in three years will be be.

Nothing, in my experience hosting a GA, has changed my mind about that.

Spendthrift

There was some discussion about whether some days/sessions should be recorded and put online. This was done in Stanford, but looking at the viewing numbers, I felt that it represented a poor use of money. Ultimately the SC agreed. Recording and editing can be quite costly. It may be worth reviewing this decision in the future. Or, perhaps something else can be used to 'open' the conference to those not physically present.

It was certainly a worthwhile experiment, but overall, I think we made the right decision not doing it in Reykjavik. Especially as the cost was quite, even compared to Stanford.

Another thing we decided not to spend money on was an event planner. I know one was used for the 2015 GA. That one needed to be planned in a hurry and thus may have required such a service. But I can't see how it would have made things much easier in 2016 unless you're willing to hand over the responsibility for making specific choices to the planner. Such as catering etc.

True, that does take a bit of effort, but I felt that was a part of the responsibility that comes with hosting. Just handing it over to a planner wouldn't have sat right. And if I'm vetting the planners choices, then very little effort is being saved.

I'm happy to concede, though, that this may vary very much by location and host.

Communication

Some of the communication surrounding the GA/WAC was sub-optimal. The GA page on netpreserve.org was never really up to the task, although it got better over time. Some of this was down to the lack of flexibility of the netpreserve website. Future events should have a solid communication plan at an early date. Including what gets communicated where and who is responsible for it. Perhaps it is time that each GA/WAC gets its own little website? Or perhaps not.

The dual nature of the event also caused some confusion. This led some people to only register for one of the two events etc. There was also confusion (even among the program committee!) about whether the CFP was for the WAC and GA or WAC only.

This leads us to the most important lesson I took away from this all...

Clearly separate the General Assembly and the Web Archiving Conference!

This isn't a new insight. We've been discussing what separates the 'open days' from 'member only' days for several years. In Reykjavik this was, for the first time, formally divided into two separate events. Yet, the distinction between them was less than absolutely clear.

This is, at least in part, due to how the schedule for the two events was organized. A single program committee was set up (as has been the case most years). It was quite small this year. This committee then organized the call for proposals (CFP) and arranged the schedule to accommodate the proposals that come in from the CFP.

This led to the conference over-spilling onto GA days (notably Tuesday). And it wasn't the first time that has happened. There was definitely a lack separation in Stanford (although perhaps for slightly different reasons) and in Paris, in 2014, the effort to shoehorn in all the proposals from the CFP had a profound effect on the member-only days.

This model of a program committee and a CFP is entirely suitable for a conference and should be continued. But going forward, I think it is absolutely necessary that the program committee for the WAC have no responsibility or direct influence on the GA agenda.

To facilitate this I suggest that the organization of these two events consist of three bodies (in addition to the IIPC Steering Committee (SC) which will continue to bear overall responsibility).

Logistics Team. Membership includes 1-2 people from the hosting institution, the IIPC officers, at least one SC member (if the hosting institution is an SC member this may be their representative) and perhaps one or two people with relevant experience (e.g. have hosted before etc.).
This group is responsible for arranging space, catering, the reception, badges and other printed conference material, hotels (if needed) etc. They get their direction on the amount of space needed from the SC and the two other teams.
This group is responsible for the event staying under budget. Which is why the treasurer is included.
WAC Program Committee. The program committee would be comprised of a number of members and may include several non-members who bring notable expertise and have been engaged in this community for a long time.
The program committee would have a reserved space on it for the hosting institution (which they may decline). There should also be a minimum of one SC member on the committee.
The PCO (program and communications officer) would be included in all communications and assist the committee with communications with members and other prospective attendees (e.g. in sending out the CFP) but would not participate in evaluating the proposals sent in.
The program committee would have a hand in crafting the CFP, but input on overall 'theme' would be expected from the SC.
The program committee's primary task would be to evaluate the proposals sent in after the CFP and arranging them into a coherent schedule. The mechanism for evaluating (and potentially rejecting!) proposals needs to be established before the CFP's come in! Otherwise, it will be hard to avoid the feeling that they are being tailored to fit specific proposals.
GA Organizing Group. The PCO would be responsible for coordinating this group. Included are the SC Chair and Vice Chair, portfolio leads and leaders of working and interest groups. For the most part, each member is primarily responsible for the the needs of their respective areas of responsibility.
More on GA organization in a bit.

None of this gives the SC a free pass. As you'll note, I've mandated an SC presence in all the groups. This both gives the groups access to someone who can easily bring matters to the SC's attention and ensures that there is someone there to ensure that the direction the SC has laid out is, broadly speaking, followed.

For the WAC, the SC's biggest responsibility (aside from choosing the location and setting the budget) will be in deciding how much time it gets (two days, two and a half, three?), what themes to focus around and whether the conference should try to accomplish a specific outreach goal (and if so how).

This was, for example, the case in Stanford where the goal was to get the attention of the big tech companies. Getting Vint Cerf (a VP of Google) to be a keynote speaker was a good effort in that direction. Nothing similar was done during the Reykjavik meeting.

Keynotes

Keynotes are likely to be one of the best ways of accomplishing this. Getting a keynote speaker from a different background can help build bridges. I think this is absolutely a worthwhile path to consider.

However, unless we are hosting the WAC in their backyard (as was the case with Vint Cerf), we need to reach out to them very early and probably be prepared to cover the cost of travel. This is a choice that needs to be made very early. And, indeed, the choice of a keynote may ultimately help frame the overall them of the conference (or not).

Hjálmar Gíslason delivering the
2016 IIPC WAC opening keynote

We had two keynotes in Reykjavik. Both were great, although neither was chosen 'strategically'. The choice of Hjálmar Gíslason was largely with my library. Allowing the hosting institution some influence on one of the keynotes may be appropriate. The other keynote, Brewster Kahle, wasn't chosen until after the CFP was in. We essentially asked him to expand his proposal into a keynote. Given the topic and Brewster's acclaim within our community, this worked out very well. We did have other candidates in mind (but no one confirmed). It was quite fortunate that such a perfect candidate fell into our laps.

It is worth planning this early as people become unavailable surprisingly far in advance.

It could also be argued that we don't need keynotes. People aren't coming to the IIPC WAC to hear some 'rock star' presenter. The event itself is the draw. But I think a couple of keynotes really help tie the event together.

One change may be worth considering. Instead of a whole day with a single track featuring both keynotes, perhaps have multiple tracks on all days but do a single track session at the start of day one and at the end of day two that accommodates the keynotes and the welcome and wrap up talks.

When we were trying to fit in all the proposals we got for Reykjavik, we considered doing this, but the idea simply arose too late. We were unable to secure the additional space required.

Again, we need to plan early.

The General Assembly should not be a conference

The GAs have changed a lot over the years. The IIPC met in Reykjavik for the first time in 2005. Back then we didn't call the meetings "GAs", they were just meetings. And they mostly oriented around discussions. They were working meetings. And they were usually very good.

The first GA, in Paris 2007, largely retained that, despite the fact that the IIPC was already beginning to grow. There was no 'open day'.

By 2010 in Singapore, the open day was there. But in a way that made sense and it didn't overly affect the rest of the GA. I did notice, however, a marked change in the level of engagement by the attendees during sessions.

There seemed to be more people there 'just to listen'. There had always been some of those, but I found it difficult to get discussions going, where two years prior, they'd had usually been difficult to stop in order to take breaks! Not that those discussions had always been all that productive (some of it was just talk), but the atmosphere was more restrained.

At that time I was co-chair of the Harvesting Working Group (HWG) along with Lewis Crawford of the British Library. And although there was always good attendance at the HWG meetings we really struggled to engage the attendees.

Helen Hockx-Yu and Kris Carpenter, who led the Access Working Group (AWG) did a better job of this but clearly felt the same problem. Ultimately, both HWG and AWG became more of GA events than working groups and have now been decommissioned.

With larger groups and especially with many there 'just to listen' it becomes much easier to just do a series of presentations. Its safer, more predictable and when you add the pressure to fit in all the material from the CFP, it becomes inevitable.

But, in the process we have lost something.

Now that the WAC is firmly established and can serve very well for the people who 'just want to listen', I think it is time we refocus the GA on being working meetings. A venue for addressing both consortium business (like the portfolio breakout sessions in Reykjavik, but with more time!) and the work of the consortium (like the OpenWayback meeting and the Preservation and Content Development Working Group meetings in Reykjavik).

This will inevitably include some presentations (but keep them to a minimum!) and there may be some panel discussions but the overall focus should be on working meetings. Where specific topics are discussed and, as much as possible, actions are decided.

That's why I nominated the people I did for the GA Organizing Group. These are the people driving the work of the consortium. They should help form the GA agenda. At least as far as their area of responsibility is concerned.

To accommodate the less knowledgeable GA attendee (e.g. new members) it may be a good idea to schedule tutorials and/or training sessions in parallel to some of these working meetings.

I believe this can build up a more engaged community. And for those not interested in participating in specific work, the WAC will be there to provide them with an opportunity to learn and connect with other members.

This wont be an easy transition. As my experience with the HWG showed, it can be difficult to engage people. But by having a conference (and perhaps training events) to divert those just looking to learn and building sessions around specific strategic goals, I think we can bring this element of 'work' back.

And if we can't, I'm not sure we have much of a future except as a yearly conference.

New 'semi-stable' build for Heritrix

2016-04-28T11:11:00.000+00:00

Earlier this month I mentioned that I was planning on making another one of my "semi-stable" Heritrix build, using the current master 3.3.0. This work is now underway in the Landsbokasafn Heritrix repo on GitHub. LBS-2016-02.


Heritrix 3.3.0-LBS-2016-02

I've merged in one pull request that is still open in the IA repository, #154 Fixes for apparent build errors. Most notably, this makes it possible to have Travis-CI build and test Heritrix.

You can review the full list of changes between my last Heritrix build (2015-01) and this new one here. Here is a list of the main changes:

Some fixes to how server-not-modified revisit records are written (PR #118).
Fix outlink hoppath in metadata records (PR #119)
Allow dots in filenames for known good extensions (PR #120)
Require Maven 3.3 (PR #126)
Allow realm to be set by server for basic auth (PR #124)
Better error handling in StatisticsTracker (PR #130)
Fix to Java 8 Keytool (PR #129) - I wrote a post about this back in 2014.
Changes to how cookies are stored in Bdb (PR #133)
Handle multiple clauses for same user agent in robots.txt (PR #139)
SourceSeedDecideRule and SeedLimitsEnforcer (PR #137 and #148)
'Novel' URL and byte quotes (PR #138)
Only submit 'checked' checkbox and radio buttons when submitting forms (PR #122)
Form login improvements (PR #142 and #143)
Improvements to hosts report (PR #123)
Handle SNI error better (PR #141)
Allow some whitespace in URLs extracted by ExtractorJS (PR #145)
Fix to ExtractorHTML dealing with HTML comments (PR #149)
Build against Java 7 (PR #152)

I've ignored all pull request that apply primarily to the contrib package in the above. There were quite a few there, mostly (but not exclusively) relating to AMQP.

I've done some preliminary testing and everything looks good. So far, the only issue I've noted is one that I was already aware of, about noisy alerts relating to 401s.

I'll be testing this version further over the next few weeks and welcome any additional input.

A long week is over. Thank you all.

2016-04-18T13:10:00.000+00:00

The 2016 IIPC General Assembly and Web Archiving Conference is over. Phew!


Me, opening the Harvesting Tools session on Tuesday

I always look forward to this event each year. It is by far the most stimulating and productive meeting/conference that I attend regularly. I believe we managed to live up to that this time.

The meeting had a wonderful Twitter back-channel that you can still review using the hashtags #iipcGA16 and #iipcWAC16.

It has been over two years since we, at the National and University Library of Iceland, offered to host the 2016 GA, and over a half year before that when the initial decision was made. Even with a 2.5 year lead time, it barely felt like enough.

I'd like to take this opportunity to thank, again, all the people who helped make last week's event a success.

First off, there is the program committee, which was very small this year, comprising, in addition to myself, (in alphabetical order) Alex Thurman (Columbia University Libraries), Gina Jones (Library of Congress), Jason Webber (IIPC PCO/British Library), Nicholas Taylor (Stanford University Libraries) and Peter Stirling (Bibliothèque nationale de France). I literally couldn't have done this without you.

I'd also like to note the contribution of our incoming PCO in this list, Olga Holownia who put in a lot of work during the conference to help make sure everything was just right for each session.

Next, I'd like to thank my colleagues at the National Library who assisted me in organizing this event and helped out during by week by handling registration, running tours etc. It was a team effort. Notable mentions to Áki Karlsson and Erla Bjarnadóttir who spent much of the week making sure that all the little details were attended to.

The Steering Committee on Friday
following the SC meeting

A big thank you to all the speakers and session moderators.

And lastly, I'd like to thank the members of the Steering Committee for being willing to entrust the single most important event of the IIPC calendar to one of the IIPC's smallest members. Indeed, doing so without the slightest hesitation.

I've learned a lot from this past week and I hope to be able to distill that experience and write it up so that next year's GA/WAC can be even better. But that will have to wait for another day and another blog post.

For now, I'll just say thanks for coming and see you all again in Lisbon for #iipcGA17 and #iipWAC17.

Still Looking For Stability In Heritrix Releases

2016-04-07T15:16:00.001+00:00

I'd just like to briefly follow up on a blog post I wrote last September, Looking For Stability In Heritrix Releases.

The short version is that the response I got was, in my opinion, insufficient to proceed. I'm open to revisiting the idea if that changes, but for now it is on ice.

There is little doubt in my mind that having (somewhat) regular stable releases made of Heritrix would be of notable benefit. Even better if they are published to Maven Central.

Instead, I'll continue to make my own forks from time to time and make sure they are stable for me. The last one was dubbed LBS-2015-01. It is now over a year old and a lot has changed. I expect I'll be making a new one in May/June. You can see what's changed in Heritrix in the meantime here.

I know a few organizations are also using my semi-stable releases. If you are one of them and would like to get some changes in before the next version (to be dubbed LBS-2016-02), you should try to get a PR into Heritrix before the end of April. Likewise, if you know of a serious/blocking bug in the current master of Heritrix, please bring it to my attention.

Duplicate DeDuplicators?

2016-04-01T13:46:00.002+00:00

A question on the Heritrix mailing list prompted my to write a few words about deduplication in Heritrix and why there are multiple ways of doing.

Heritrix's built in service

Heritrix's inbuilt deduplication service comes in two processors. One processor records each URL in a BDB datastore or index (PersistStoreProcessor), the other looks up the current URL in this datastore and, if it finds it, compares the content digest (FetchHistoryProcessor).

The index used to be mingled in with other crawler state. That is often undesirable as you may not wish to carry forward any of that state to subsequent crawls. Typically, thus, you configure the above processors to use their own directory by wiring in a special "BDB module" configured to write to an alternative directory.

There is no way to construct the index outside of a crawl. Which can be problematic since a hard crash will often corrupt the DBD data. Of course, you can recover from a checkpoint, if you have one.

More recently, a new set of processors have been added. The ContentDigestHistoryLoader and ContentDigestHistoryStorer. They work in much the same way except they use an index that is keyed on the content digest, rather than the URL. This enables URL agnostic deduplication.

This was a questionable feature when introduced, but after the changes made to implement a more robust way of recording non-URL duplicates, it became a useful feature. Although its utility will vary based on the nature of your crawl.

As this index is updated at crawl time, it also makes it possible to deduplicate on material discovered during the same crawl. A very useful feature that I now use in most crawls.

You still can't build the index outside of a crawl.

For more information about the built in features, consult the Duplication Reduction Processors page on the Heritrix wiki.

The DeDuplicator add-on

The DeDuplicator add-on pre-dates the built in function in Heritrix by about a year (released in 2006). It essentially accomplishes the same thing, but with a few notable notable differences in tactics.

Most importantly, its index is always built outside of a crawl. Either from the crawl.log (possibly multiple log files) or from WARC files. This provides a considerable amount of flexibility as you can build an index covering multiple crawls. You can also gain the benefit of the deduplication as soon as you implement it. You don't have to run one crawl just to populate it.

The DeDuplicator uses Lucene to build its index. This allows for multiple searchable fields which in turn means that deduplication can do things like prefer exact URL matches but still do digest only matches when exact URL matches do not exist. This affords a choice of search strategies.

The DeDuplicator also some additional statistics, can write more detailed deduplication data to the crawl.log and comes with pre-configured job profiles.

The Heritrix 1 version of the DeDuplicator actually also supported deduplication based on 'server not modified'. But it was dropped when migrating the H3 as no one seemed to be using it. The index still contains enough information to easily bring it back.

Bottom line

Both approaches ultimately accomplish the same thing. Especially after the changes that were made a couple of years ago to how these modules interact with the rest of Heritrix, there really isn't any notable difference in the output. All these processors, after determining an document is a duplicate, set the same flags and cause the same information to be written to the WARC (if you are using ARCs, do not use any URL agnostic features!)

Ultimately, it is just a question of which fits better into your workflow.

Declaring WARR on "CDX Server" API

2016-03-18T14:05:00.001+00:00

Work is currently ongoing to specify a "CDX Server API" for OpenWayback. The name of this API has, however, caused an unfortunate amount of confusion. Despite the name, the data served via this API needn't be in CDX files!

The core purpose of this API is to respond to a query containing an URL and optionally a timestamp or timerange with a set of records that fall within those parameters. This is meant to support two basic functionalities. One, replay of captured web content and, two, discovery of capture web content.

CDXs need not enter into it. It is just that the most common way (by far) to manage such an index is to use sorted CDX files. Thus the unfortunate name. Nothing prevents alternative indexing solutions being used. You could use a relational database, Lucene or whatever tool allows lookups of strings!

So, this API desperately needs a new name. My suggestion is "Web Archive Resource Resolution Service" or WARR Service for short. Yes, I did torture that until it produced a usable acronym.

In my last post I discussed changes to the CDX file format itself. Those changes should facilitate WARR servers running on CDX indexes. But ultimately, the development of the WARR Service API is not directly coupled to those changes. We should focus on developing the WARR Service API with respect to the established use cases.

In truth, the exact scope and nature of this new API remains debated. You can find some lively discussion in this Github issue. More on that topic another day.

Rewriting the CDX file format

2016-03-17T13:47:00.002+00:00

CDX files are used to support URL+timestamp searching of web archives. They've been around for a long time, having first been used to catalog the contents of ARC files. Despite the advent of the WARC file format, they haven't changed much. I think it is past due that we reconsider the format from the ground up.

The current specification lists a large number of possible fields. Many are not used in typical scenarios.

The first field is a canonicalized URL. I.e. an URL with trivial elements (such as protocol) removed so that equivalent URLs end up with the same canonical URL here. This serves as the primary search key.

The only problem with this is that searching for content in all subdomains is not possible without scanning the entire CDX. This is because the subdomain comes before the domain. Instead, we should use a SURT (Sort-friendly URI Reordering Transform) form of the canonical URL instead. SURT URLs turn the domain/sub-domain structure around, making such queries fairly straightforward. There is essentially no downside to doing this and, in fact, a number of CDXs have been built in this manner, regardless of any "formal" standardization (as there isn't really any formal standard).

I suggest that any revised CDX format mandate the use of SURT URLs for the first field. Furthermore, we should utilize the correct SURT format. In most (probably all) current CDXs with SURT URLs, an annoying mistake has been made where the closing comma is missing. An URL that should read:
com,example,www,)
instead reads:
com,example,www)
The protocol prefix has been removed as unnecessary along with the opening ellipse.

The second field should remain the timestamp with whatever precision is available in the ARC/WARC. I.e. an w3c-iso8601 of varying accuracy as per this proposed revision the WARC standard (the revision is extremely likely to be included in WARC 1.1).

The third field would remain the original URL.

The fourth field should be a content digest including the hashing algorithm. Presently, this field is missing the algorithm.

The fifth field would be the WARC record type (or a special value to indicate an ARC response record). This is the most significant change as it allows us to capture additional WARC record types (such as metadata and conversion) while also handling the existing fields in a more targeted manner (e.g. response vs revisit). It might be argued that this should be the second field to facilitate searches of a specific record type. I believe that, probably implemented, this field would allow replay tools to effectively surface any content "related" to the URL currently being viewed, a problem that I know many are trying to tackle.

The next two fields would be the WARC (or ARC) filename (this is supposed to be unique) of the file containing the record and offset at which the record exists within the (W)ARC. This is as it works currently. Some would argue for a more expressive resource locator here, but I believe that is best handled be a separate (W)ARC resolution service. Otherwise you may have to substantially rebuild your CDX index just because you moved your (W)ARCs to a new disk or service.

Lastly, there should be a single line JSON "blob" containing record type relevant additional data. For response records, this would include HTTP status code and content type which I've excluded from the "base" fields in the CDX. This part would be significantly more flexible due to the JSON format, allowing us to include optional data where appropriate etc. The full range of possible values is beyond the scope of this blog post.

There is clearly more work to be done on the JSON aspect, plus some adjustments may be necessary to the base data, but I believe that, at minimum, this is the right direction to head in. Of course, this means we have to rebuild all our CDX files in order to implement this. That's a tall order, but the benefits should be more than enough to justify that one-time cost.

3 things I shouldn't have to tell you about running a "good" crawler

2016-02-24T13:08:00.000+00:00

I've been running large and small scale crawls for almost 12 years. In that time I've encountered any number of unfortunate circumstances where our crawler has caused a website some amount of trouble. We always aim to run a "good" crawler. One that is respectful of website operators and doesn't cause any issues. To accomplish this we have some basic rules and limits to abide by. The reasons we sometimes do cause trouble comes down to the complicated nature of the internet.

During this time I've also been responsible for operating a number of websites. Including a few that are (by Icelandic standards) quite popular. Doing so has shown me the other side of web crawling. Turns out that a lot of crawlers are not following the basics dos and don'ts of web crawling. Including some supposedly respectable crawlers.

This seems to be getting worse each year.

Of course there are "bad" robots. Run by people who do not care about the negative impact they cause. But, even "good" robots (i.e. ones that at least seem to have good intentions) are all to frequently misbehaving.

So, here are 3 things you absolutely should abide by if you want your robot to be considered "good". Remember, it doesn't matter if your robot is a web-scale crawl or a scraping of a single site. As soon you've scripted something to automatically fetch stuff from a 3rd party server (without the 3rd party's explicit permission) you are running a crawler.

1. Be polite

Never do concurrent requests to the same site. Rate limit yourself to around 1 request every 2-5 seconds or so. More if the responses are slow. If you hit a 500 error code, wait a few minutes.

If you know (with certainty) that the site you are crawling can handle a more aggressive load (e.g. crawling google.com), it may be OK to step it up a bit. But when crawling sites where you do not have any insight, it is best to be cautious. Remember, yours is probably not the only crawler hitting them. Not to mention all the regular users.

I know, it can be infuriating when trying to scrape a large dataset. It could take weeks! But your needs do not outweigh the needs of other users. Also, crawlers are often more expensive to serve than "normal" users. They tend to systematically go through the entire site, meaning that caching strategies fail to speed up their requests.

2. Identify yourself

Your user agent string must contain enough information to allow a website operator to find out who you are, why you are crawling their site and how to get in touch with you. This isn't negotiable.

It also goes for little custom tools built to just scrape that one website. You don't have to set a user agent if you are using e.g. curl on the command line to get a single resource. But the moment you script or program something you must put identifying information in the user agent string. If all I see is

curl/7.19.7 (x86_64-redhat-linux-gnu) lib...

I'll assume it is a "bad" crawler. Same for things written in Python, Java etc. Always identify yourself.

And make sure you are responsive to feedback you may get. Something, a number of big, supposedly "good" crawl operators fail to do.

The thing is, if your robot is causing a problem and I don't know who is operating it. I will ban it. Should you wish to have that ban lifted, I will not be especially predisposed towards cooperation. You've already made a very bad first impression.

Yes, you might be able to get around a ban by getting a new IP address, but at that point you are no longer running a "good" crawler.

The bottom line is, even if you are very careful, your crawler may inadvertently cause a problem. In those scenarios. If your intentions are good, then you must make yourself available to deal with the issue and prevent it from reoccurring. If you do not do this, you are running a "bad" crawler.

3. Honor robots.txt

I'm aware of the irony here. As an operator of a legal deposit crawler, I do not respect robots.txt in most of the crawls I conduct. But you probably don't have that shield of being legally required to "get all the URLs".

A lot of websites block perfectly "legitimate" material using robots.txt (e.g. images). It is annoying. But if you are running a "good" crawler, then you have to abide by these rules, including the crawl delay. You really should also be able to parse the newer wildcard rules.

At most, you might bend the rules to get embedded content (images) necessary to render the page. Even that should be done carefully and only while adhering to the first two rules firmly.

If you feel you need content blocked by robots.txt, you must ask for it. Politely. Some sites may be happy to assist you (ours included). Others may tell you to go away. Either way, if you are running a "good" crawler, you'll have your answer.

OpenWayback - Developing an API for a CDX Server

2016-02-15T14:11:00.001+00:00

OpenWayback 2.3.0 was released about a month ago. It was a modest release aimed to fixing a number of bugs and introducing a few minor features. Currently, work is focused on version 3.0.0, which is meant to be a much more impactful release.

Notably, the indexing function is being moved into a separate (bundled) entitiy, called the CDX Server. This will provide a clean separation of concerns between resource resolution and the user interface.

The CDX Server had already been partially implemented. But as work begin on refining that implementation it became clear that we would also need to examine very close the API between these two discreet parts. The API that currently exists is incomplete and at times vague and even, on occasion, contradictory. Existing documentation (or code review) may tell you what can be done, but it fails to shed much light on why you'd do that.

This isn't really surprising for an API that was developed "bottom-up", guided by personal experience and existing (but undocumented!) code requirements. This is pretty typical of what happens when delivering something functional is the first priority. It is a technical debt, but one you may feel is acceptable when delivering a single customer solution.

The problem we face, is that once we push for the CDX server to become the norm, it wont be a single costumer solution. Changes to the API will be painful and, consequently, rare.

So, we've had to step back and examine the API with a critical eye. To do this we've begun to compile a list of use cases that the CDX Server API needs to meet. Some are fairly obvious, others are perhaps more wishful thinking. In between there are a large number of corner cases and essential but non-obvious uses cases that must be addressed.

We welcome and encourage any input on this list. You can do so other be editing the wiki page, by commenting on the relevant issue in our tracker or sending an e-mail to the OpenWayback-dev mailing list.

To be clear, the purpose of this is primarily to ensure that we fully support existing use cases. While we will consider use cases that the current OpenWayback can not handle, they will, necessarily be ascribed a much lower priority.

Hopefully, this will lead to a fairly robust API for a CDX Server. The use cases may also allow us to firm up the API of the wayback URL structure itself. Not only will this serve the OpenWayback project, getting this API rights is very important to facilitate alternative replay tools!

As I said before, we welcome any input you may have on the subject. If you feel unsure of how to become engaged, please feel free to contact me directly.

How to SURT IDN domains?

2016-02-03T14:25:00.000+00:00

Converting URLs to SURT (Sort-friendly URI Reordering Transform) form has many benefits in web archiving and is widely used when configuring crawlers. Notably, the use of SURT prefixes can be used to easily apply rules to a logical segment of the web.

Thus the SURT prefix:

http://(is,

Will match all domains under the .is TLD.

A bit less known ability is to match against partial domain names. Thus the following SURT prefix:

http://(is,a

Would match all .is domains that begin with the letter a (note that there isn't a comma at the end).

This all works quite well, until you hit Internationalized Domain Names (IDNs). As the original infrastructure of the web does not really support non-ASCII characters, all IDNs are designed so that they can be translated into an ASCII equivalent.

Thus the IDN domain landsbókasafn.is is actually represented using the "punycode" representation xn--landsbkasafn-5hb.is.

When matching SURTs against full domain names (trailing comma), this doesn't really matter. But, when matching against a domain name prefix, you run into an issue. Considering the example above, should landsbókasafn.is match the SURT http://(is,l?

The current implementation (at least in Heritrix's much used SurtPrefixedDecideRule) is to evaluate only the punycode version (so no, but it would match http://(is,x).

This seems potentially limiting and likely to cause confusion.

Things to do in Iceland ...

2016-01-29T15:48:00.000+00:00

... when you are not in a conference center

It is clear that many attendees at the IIPC GA and web archiving conference in Reykjavík next April (details) plan to extend their stay. Several have contacted me for advice on what not to miss. So, I figured I'd write something up for all to see.

The following is far from exhaustive or authoritative. It largely reflects my personal taste and may have glaring omissions. It also probably reflects my memory as well!

Reykjavík

Downtown Reykjavík has many interesting sights, museums and attractions. Not to mention shops, bars and restaurants. Of particular note is Hallgrímskirkja which rises up over the heart of city and right outside it is the statue of Leif Eriksson. The Pearl dominates the skyline a bit further south, there are stunning views to be had from its observation deck. In the heart of the city you'll find Alþingishúsið (Parliment building), city hall, the old harbor, Harpa (concert hall) and many other notable buildings and sights.

A bit further afield you'll find Höfði and the Sólfar sculpture. Sólfarið is without a question my favorite public sculpture, anywhere.

There are a large number of museums in Reykjavík. Ranging from the what-you'd-expect to the downright-weird.

Thanks to abundant geothermal energy, you'll find heated, open air, public swimming pools open year round. I highly recommend a visit to one of them. Reykjavík's most prominent public pool is Laugardalslaug.

Lastly, the most famous eatery in Iceland is in the heart of the city, and worth a visit, Bæjarins Beztu.

Out of Reykjavík

Just south of Reykjavík (15 minutes), in my hometown of Hafnarfjörður, you'll find a Viking Village!

A bit further there is the Blue Lagoon. I highly recommend everyone try it at least once. Do note that you may need to book in advance!

And just a bit further still, you'll find the Bridge Between Continents! Okay, so the last one is a bit overwrought, but still a fun visit if you're at all into plate tectonics.

Whale watching tours are operated out of Reykjavík, even in April. You'll want some warm clothes if you go on one of those!

Probably the most popular (for good reasons) day tour out of Reykjavík is the Golden Circle. It covers Geysir, Gullfoss (Iceland's largest waterfall) and Þingvellir (the site of the original Icelandic parliament, formed in 930, it is a UNESCO World Heritrage Site). You can do the circle in a rented car and there are also numerous tour operators offering this tip and variations on it.

Driving along the southern coast on highway 1, you'll also come across many interesting town and sights. It is possible to drive as far as Jökulsárlón and back in a single day. Stopping at places like Selfoss, Vík í Mýrdal, Seljalandsfoss, Skógarfoss and Skaftafell National Park. It may be better to plan an overnight stop, however. This is an ideal route for a modest excursion as there are so many interesting sights right by the highway.

Further afield

For longer trips outside of the capitol there are too many options to count. You could go to Vestmannaeyjar, just off the south coast, or to Akureyri, the capitol of the north. A popular trip is to take highway 1 around the country (it loops around). Such a trip can be done in 3-4 days, but you'd need closer to a week to fully appreciate all the sights along the way.

To ZIP or not to ZIP, that is the (web archiving) question

2016-01-28T15:46:00.000+00:00

Do you use uncompressed (W)ARC files?

It is hard to imagine why you would want to store the material uncompressed. After all, web archives are big. Compression saves space and space is money.

While this seems straightforward it is worth examining some of the assumptions made here and considering what trade-offs we may be making.

Lets start by consider that a lot of the files on the Internet are already compressed. Images, audio and video files as well as almost every file format for "large data" is compressed. Sometimes simply by wrapping everything in a ZIP container (e.g. EPUB). There is very little additional benefit gained from compressing these files again (it may even increase the size very slightly).

For some, highly specific crawls, it is possible that compression will accomplish very little.

But it is also true that compression costs very little. We'll get back to that point in a bit.

For most general crawls, the amount of HTML, CSS, JavaScript and various other highly compressible material will make up a substantial portion of the overall data. Those files may be smaller, but there are a lot more of them. Especially, automatically generated HTML pages and other crawler traps that are impossible to avoid entirely.

In our domain crawls, HTML documents alone typically make up around a quarter of the total data downloaded. Given that we then deduplicate images, videos and other, largely static, file formats, HTML files' share in the overall data needing to be stored is even greater. Typically approaching half!

Given that these text files compress heavily (by 70-80% usually), tremendous storage savings can be realized using compression. In practice, our domain crawls compressed size is usually about 60% of the uncompressed size (after deduplication).

More frequently run crawls (with higher levels of deduplication) will benefit even more. Our weekly crawls' compressed size is usually closer to 35-40% of the uncompressed volume (after deduplication discards about three quarters of the crawled data).

So you can save anywhere from ten to sixty percent of the storage needed, depending on the types of crawling you do. But at what cost?

On the crawler side the limiting factor is usually disk or network access. Memory is also sometimes a bottleneck. CPU cycles are rarely an issue. Thus the additional overhead of compressing a file, as it is written to disk, is trivial.

On the access side, you also largely find that CPU isn't a limiting factor. The bottleneck is disk access. And here compression can actually help! It probably doesn't make much difference when only serving up one small slice of a WARC but when processing entire WARCs it will take less time to lift them off of the slow HDD if the file is smaller. The additional overhead of decompression is insignificant in this scenario except in highly specific circumstances where CPU is very limited (but why would you process entire WARCs in such an environment).

So, you save space (and money!) and performance is barely affected. It seems like there is no good reason to not compress your (W)ARCs.

But, there may just be one, HTTP Range Requests.

To handle an HTTP Range Request a replay tool using compressed (W)ARCs will have to access a WARC record and then decompress the entire payload (or at least from start and as far as needed). If uncompressed, the replay tool could simply locate the start of the record and then skip the required number of bytes.

This only affects large files and is probably most evident when replaying video files. User's may wish to skip ahead etc. and that is implemented via range requests. Imagine the benefit when skipping to the last few minutes of a movie that is 10 GB on disk!

Thus, it seems to me that a hybrid solution may be the best course of action. Compress everything except files whose content type indicates an already compressed format. Configure it to compress when in doubt. It may also be best to compress files under a certain size threshold due to headers being compressed. That would need to be evaluated.

Unfortunately, you can't alternate compressed and uncompressed records in a GZ file such as (W)ARC. But it is fairly simple to configure the crawler to use separate output files for these content types. Most crawls generate more than one (W)ARC anyway.

Not only would this resolve the HTTP Range Request issue, it would also avoid a lot of pointless compression/uncompression work being done.

Workshop on Missing Warc Features

2015-11-12T10:59:00.000+00:00

Yesterday I tweeted:

Considering a session on crawl artifacts that don't yet fit in WARCs for #iipcGA16. If you're interested in helping organize, let me know.
— Kristinn Sigurðsson (@kristsi) November 11, 2015

Possible topics include screenshots, SSL certs, HTTP 2.X and AJAX interactions. #iipcGA16
— Kristinn Sigurðsson (@kristsi) November 11, 2015

I thought I'd expand a bit on this without a 140 character limit.

The idea for this session came to me while reviewing the various issues set forward for the WARC 1.1 review. Several of the issues/proposals were clearly important as they addressed real needs but at the same time they were nowhere near ready for standardization.

It is important that before we enshrine a solution in the standard that we mature it. This may include exploring multiple avenues to resolve the matter. At minimum it requires implementing it in the relevant tools as proof of concept.

The issues in question include:

Screenshots – Crawling using browsers is becoming more common. A very useful by product of this is a screenshot of the webpage as rendered by the browser. It is, however, not clear how best to store this in a WARC. Especially not in terms of making it easily discoverable to a user of a Wayback like replay. This may also affect other types of related resources and metadata.
See further on WARC specification issue tracker: #13 #27
HTTP 2 – The new HTTP 2 standard uses binary headers. This seems to breaks one of the expectations of WARC response records containing HTTP responses.
See further on WARC specification issue tracker: #15
SSL certificates – Store the SSL certificates used during an HTTPS session. #12
AJAX Interactions – #14

The above list is unlikely to be exhaustive. It merely enumerates the issues that I'm currently aware of.

I'd like to organize a workshop during the 2016 IIPC GA to discuss these issues. For that to become a reality I'm looking for people willing to present on one or more of these topics (or a related one that I missed). It will probably be aimed at the open days so attendance is not strictly limited to IIPC members.

The idea is that we'd have a ~10-20 minute presentation where a particular issues's problems and needs were defined and a solution proposed. It doesn't matter if the solution has been implemented in code. Following each presentation there would then be a discussion on the merits of the proposed solution and possible alternatives.

A minimum of three presentations would be needed to "fill up" the workshop. We should be able to fit in four.

So, that's what I'm looking for, volunteers to present one or more of these issues. Or, a related issue I've missed.

To be clear, while the idea for this workshop comes out of the WARC 1.1 work it is entirely separate from that review. By the time of the GA the WARC 1.1 revision will be all but settled. Consider this, instead, as a first possible step on the WARC 1.2 revision.

Looking For Stability In Heritrix Releases

2015-09-15T13:14:00.000+00:00

Which version of Heritrix do you use?

If the answer is version 3.3.0-LBS-2015-01 then you probably already know where I'm going with this post and may want to skip to the proposed solution.

3.3.0-LBS-2015-01 is a version of Heritrix that I "made" and currently use because there isn't a "proper" 3.3.0 release. I know of a couple of other institutions that have taken advantage of it.

The Problem (My Experience)

The last proper release of Heritrix (i.e. non-SNAPSHOT release that got pushed to a public Maven repo, even if just the Internet Archive one) that I could use was 3.1.0-RC1. There were regression bugs in both 3.1.0 and 3.2.0 that kept me from using them.

After 3.2.0 came out the main bug keeping me from upgrading was fixed. Then a big change to how revisit records were created was merged in and it was definitely time for me to stop using a 4 year old version. Unfortunately, stable releases had now mostly gone away. Even when a release is made (as I discovered with 3.1.0 and 3.2.0) they may only be "stable" for those making the release.

So, I started working with the "unstable" SNAPSHOT builds of the unreleased 3.3.0 version. This, however presented some issues. I bundle Heritrix with a few customizations and crawl job profiles. This is done via a Maven build process. Without a stable release, I'd run the risk that a change to Heritrix will cause my internal build to create something that no longer works. It also makes it impossible to release stable builds of tools that rely on new features in Heritrix 3.3.0. Thus no stable releases for the DeDuplicator or CrawlRSS. Both are way overdue.

Late last year, after getting a very nasty bug fixed in Heritrix, I spent a good while testing it and making sure no further bugs interfered with my jobs. I discovered a few minor flaws and wound up creating a fork that contained fixes for these flaws. Realizing I now had something that was as close to a "stable" build as I was likely to see, I dubbed it Heritrix version 3.3.0-LBS-2014-03 (LBS is the Icelandic abbreviation of the library's name and 2014-03 is the domain crawl it was made for).

The fork is still available on GitHub. More importantly, this version was built and deployed to our in-house Maven repo. It doesn't solve the issue of the open tools we have but for internal projects, we now had a proper release to build against.

You can see here all the commits the separate 3.2.0 and 3.3.0-LBS-2014-03 (there are a lot!).

Which brings us to 3.3.0-LBS-2015-01. When getting ready for the first crawl of this year I realized that the issues I'd had were now resolved, plus a few more things had been fixes (full list of commits). So, I created up a new fork and, again, put it through some testing. When it came up clean I released it internally as 3.3.0-LBS-2015-01. It's now used for all crawling at the library.

This sorta works for me. But it isn't really a good model for a widely used piece of software. The unreleased 3.3.0 version contains significant fixes and improvements. Getting people stuck on 3.2.0 or forcing them to use a non-stable release isn't good. And, while anyone may use my build, doing so requires a bit of know-how and there still isn't any promise of it being stable in general just because it is stable for me. This was clearly illustrated with the 3.1.0 and 3.2.0 releases which were stable for IA, but not for me.

Stable releases require some quality assurance.

Proposed Solution

What I'd really like to see is an initiative of multiple Heritrix users (be they individuals or institutions). These would come together, one or twice a year, create a release candidate and test it based on each user's particular needs. This would mostly entail running each party's usual crawls and looking for anything abnormal.

Serious regressions would either lead to fixes, rollback of features or (in dire cases) cancelling for the release. Once everyone signs off, a new release is minted and pushed to a public Maven repo.

The focus here is primarily on testing. While there might be a bit of development work to fix a bug that is discovered, the focus here is primarily on vetting that the proposed release does not contain any notable regressions.

By having multiple parties, each running the candidate build through their own workflow, the odds are greatly improved that we'll catch any serious issues. Of course, this could be done by a dedicated QA team. But the odds of putting that together is small so we must make do.

I'd love if the Internet Archive (IA) was party to this or even took over leading it. But, they aren't essential. It is perfectly feasible to alter the "group ID" and release a version under another "flag", as it were, if IA proves uninterested.

Again, to be clear, this is not an effort to set up a development effort around Heritrix, like the IIPC did for OpenWayback. This is just focused on getting regular stable builds released based on the latest code. Period.

Sign up

If the above sounds good and you'd like to participate, by all means get in touch. In comments below, on Twitter or e-mail.

At minimum you must be willing to do the following once or twice a year:

Download a specific build of Heritrix
Run crawls with said build that match your production crawls
Evaluate those crawls, looking for abnormalities and errors compared to your usual crawls

A fair amount of experience with running Heritrix is clearly needed.

Report you results

Ideally, in a manner that allows issues you uncover to be reproduced

Doing all of this during a coordinated time frame, probably spanning about two weeks.

Better still if you are willing to look into the causes of any problems you discover.

Help with admin tasks, such as pushing releases etc. would also be welcome.

At the moment, this is nothing more than an idea and a blog post. Your responses will determine if it ever amounts to anything more.

How big is your web archive?

2015-09-04T13:26:00.000+00:00

How big is your web archive? I don't want you to actually answer that question. What I'm interested in is, when you read that question, what metric jumped into your head?

I remember long discussions on web archive metrics when I was first starting in this field, over 10 years ago. Most vividly I remember the discussion on what constitutes a "document".

Ultimately, nothing much came of any of those discussions. We are mostly left talking about number of URL "captures" and bytes stored (3.17 billion and 64 TiB, by the way). Yet these don't really convey all that much, at least not without additional context.

Another approach is to talk about websites (or seeds/hosts) captured. That still leaves you with a complicated dataset. How many crawls? How many URL per site? Etc.

Nicholas Taylor (of Stanford University Libraries) recently shared some slides on this subject, that I found quite interesting and revived this topic in my mind.

It can be a very frustrating exercise to communicate the qualities of your web archive. If Internet Archive has 434 billion web pages and I only have 3.17 billion does that make IA's collection 137 times better?

I imagine anyone who's reading this knows that that isn't how things work. If you are after world wide coverage, IA's collection is infinitely better than mine. Conversely, for Icelandic material, the Icelandic web archive is vastly deeper than IA's.

We are, therefore, left writing lengthy reports. Detailing number of seeds/sites, URLs crawled, frequency of crawling etc. Not only does this make it hard to explain to others how good (or not good as the case may be) our web archive is. It makes it very difficult to judge it against other archives.

To put it bluntly, it leaves us guessing at just how "good" our archive is.

To be fair, we are not guessing blindly. Experience (either first hand or learned from others) provides useful yardsticks and rules of thumb. If we have resources for some data mining and quality assurance, those guesses get much better.

But it sure would be nice to have some handy metrics. I fear, however, that this isn't to be.

Writing WARCs

2015-08-24T17:16:00.003+00:00

We started doing deduplication four years before we started using WARC. As the ARC format had no revisit concept, the only record of the deduplicated items from that era lies in the crawl logs.

When we put our collection online back in 2009 we built our own indexer that consumed these crawl logs so we could include these items. It worked very well at the time.

As our collection grew, we eventually hit scaling issues with the custom indexer. Ultimately, it was abandoned when we upgraded to Wayback 1.8 two years ago. We moved to use a CDX index instead. The only casualty was the early deduplication/revisit records from our ARC days which were no longer included.

So, for the last two years, I've had a task on my todo list. Create a program that consumes all of those crawl logs and spits out WARC files containing the revisit records that are missing from the ARC files.

For a while I toyed with the idea of incorporating this in a wider ARC to WARC migration effort. But it's become clear that such a migration just isn't worthwhile. At least not yet.

Recently I finally made some time to address this. I figured it shouldn't take much time to implement. Basically, the program needs to do two things:

Ingest and parse a crawl.log file from Heritrix, making it possible to iterate over its lines and access specific fields. As the DeDuplicator already does this, it was a minor task to adapt the code.
For each line that represented a deduplicated URI, write a revisit record in a WARC file.

Boom, done. Can knock that out in an hour, easy.

Or, I should be able to.

It turns out that writing WARCs is painfully tedious. The libraries require you to understand the structure of a WARC file very well and do nothing to guide you. There is also no useful documentation on this subject. You best bet is to find code examples, but even those can be hard to find and may not directly address your needs.

I tried both the webarchive-commons library and JWAT. Both were tedious, but JWAT was less so. Both require you to understand exactly what header fields need to be written for each record type. To know that you need to write a warcinfo field first and so on. At least JWAT made it fairly simple to configure the header fields.

Consulting both existing WARC files and the WARC spec, I was able to put all the pieces together in about half a day using JWAT.

And that's when I realized that JWAT does not track the number of bytes written out. That means I can't split the writing up to make "even sized" WARCs like Heritrix does (usually 1 GiB/file).

Darn, I need to rewrite this, after all, using webarchive-commons.

**Sigh**

It seems to me that such an important format should have better library support. A better library would save a lot of time and effort.

By saving effort, we may also wind up saving the format itself. The danger that someone will create a widely used program that writes invalid WARCs is very real. If such invalid WARCs proliferate that can greatly undermine the usefulness of the format.

It is important to make it easy to do the right thing. Right now, even someone, like myself, who is very familiar with WARC needs to double and triple check every step of a very simple WARC writing program. Never mind if you need to do something a little advanced.

Someone with less knowledge and, perhaps, less motivation to "get it right" could all too easily write something "good enough".

It is important to demystify the WARC format. Good library support would be an ideal start.

The WARC Format 1.1

2015-08-17T16:42:00.000+00:00

The WARC Format 1.0 is an ISO standard describing the container format that web archives use to store their data. WARCs contain not only the actual file resources (HTML, images, JavaScript etc.) but also request and response headers, metadata about the resources and the overall collection, deduplication records and conversion records.

It's a pretty flexible format. It has served us quite well, but it is not perfect.

While it is an ISO standard, most of it was written by IIPC members. Indeed, it is heavily influenced the ARC format developed by The Internet Archive. So, now that the WARC format is being revisited it is only natural that the IIPC community, again, writes the first draft.

At the IIPC GA this year, in Stanford, there was a workshop where the pain points of the current specification were brought to light. There was a lot of energy in the room and people were excited. But, as everyone got back home a lot of that energy went away.

It is a lot easier to talk about change, than making it happen. Making things more difficult, few of us know much about the standards process. It all felt very inscrutable.

To help with the procedural aspect we came up with an approach that involves using the tools we are familiar with (software development). Consequently, we (and by "we" I mean Andy Jackson of the British Library) set up a GitHub project around the WARC specification.

Any problems with the existing specification could be raised there as "issues" (you'll find all the ones discussed in Stanford on there!). The existing spec could be included as markdown formatted text and any proposed changes could be submitted as "pull requests" acting on the text of the existing spec.

Currently there are two pull requests, each representing a proposed set of changes to address one specific shortcoming of the existing spec.

One of the pull requests comes from yours truly. It address the concerns of "uri agnostic revisit records". This was previously dealt with via an advisory on the subject adopted by the IIPC. This allows us to promote what has been a defacto standard into the actual standard.

The other pull request centers on improving the resolution of timestamps in WARC headers.

Neither pull request has been merged, meaning that both are up for comment and may change or be rejected altogether. There are also many issues that still need to be addressed.

I would like to encourage all interested parties (IIPC members and non-members alike) to take advantage of the GitHub venue if the WARC format is important to you. You can do this by opening issues if you have a problem with the spec that hasn't been brought up. You can do this by commenting on existing issues and pull requests, suggesting solutions or pointing out pitfalls.

And you can do this be suggesting actual edits to the standard via pull requests (let us know if you need help with the technical bits).

Ultimately, the draft thus generated will be passed on to the relevant ISO group for review and approval. This will happen (as I understand it) next year.

So grab the opportunity while it presents itself and have your say on The WARC Format 1.1.

Customizing Heritrix Reports

2015-07-13T17:16:00.003+00:00

It is a well known fact (at least among its users) that Heritrix comes with ten default reports. These can be generated and viewed at crawl time and will be automatically generated when the crawl is terminated. Most of these reports have been part of Heritrix from the very beginning and haven't changed all that much.

It is less well known that these reports are a part of Heritrix modular configuration structure. They can be replaced, configured (in theory) and additional reports added.

The ten, built in, reports do not offer any additional configuration option. (Although, that may change if a pull request I made is merged.) But, the most useful aspect is the ability to add reports tailored to your specific needs. Both the DeDuplicator and the CrawlRSS Heritrix add-on modules use this to surface their own reports in the UI and ensure those are written to disk at the end of a crawl.

In order to configure reports, it is necessary to edit the statisticsTracker bean in your CXML crawl configuration. That bean has a list property called reports. Each element in that list is a Java class that extends the abstract Report class in Heritrix.

That is really all there is to it. Those beans can have their own properties (although none do--yet!) and behave just like any other simple bean. To add your own, just write a class, extend Report and wire it in. Done.

One caveat. You'll notice this section is all commented-out. When the reports property is left empty, the StatisticsTracker loads up the default reports. Once you uncomment that list, the list overrides any 'default list' of reports. This means that if future versions of Heritrix change what reports are 'default', you'll need to update your configuration or miss out.

Of course, you may want to 'miss out', depending on what the change is.

My main annoyance is that the reports list requires subclasses of a class, rather than specifying an interface. This needs to change so that any interested class could implement the contract and become a report. As it is, if you have a class that already extends a class and has a reporting function, you need to create a special report class that does nothing but bridge the gap to what Heritrix needs. You can see this quite clearly in the DeDuplicator where there is a special DeDuplicatorReport class for exactly that purpose. A similar thing came up in CrawlRSS.

I've found it to be very useful to be able to surface customized reports in this manner. In addition to the two use cases I've mentioned (and are publicly available), I also use it to draw up a report on disk usage use (built into the module that monitors for out of disk space conditions) and to for a report on which regular expressions have been triggered during scoping (built into a variant of the MatchesListRegexDecideRule).

I'd had both of those reports available for years be but they had always required using the scripting console to get at. Having them just a click away and automatically written at the end of a crawl has been quite helpful.

If you have Heritrix related reporting needs that are not being met, there is a comment box below.

Leap second and web crawling

2015-07-01T13:16:00.003+00:00

A leap second was added last midnight. This is only the third time that has happened since I started doing web crawls, and the first time it happened while I had a crawl running. So, I decided to look into any possible effects or ramifications a leap second might have for web archiving.

Spoiler alert; there isn't really one. At least not for Heritrix writing WARCs.

Heritrix is typically run on a Unix type system (usually Linux). On those systems, the leap second is implemented by repeating the last second of the day. I.e. 23:59:59 comes around twice. This effectively means that the clock gets set back by a second when the leap second is inserted.

Fortunately, Heritrix does not care if time gets set back. My crawl.log did show this event quite clearly as the following excerpt shows (we were crawling about 37 URLs/second at the time):

2015-06-30T23:59:59.899Z   404      19155 http://sudurnes.dv.is/media/cache/29/74/29747ef7a4574312a4fc44d117148790.jpg LLLLLE http://sudurnes.dv.is/folk/2013/6/21/slagurinn-tok-mig/ text/html #011 2015063023595
2015-06-30T23:59:59.915Z   404        242 http://old.f4x4.is/myndasvaedi/44-tommu-breytinga-a-ford-econoline-v8-351-w/textarea.bbp-the-content ELRLLLLLLLX http://old.f4x4.is/myndasvaedi/44-tommu-breytinga-a-for
2015-06-30T23:59:59.936Z   200       3603 http://foldaskoli.is/myndasafn/index.php?album=2011_til_2012/47-%C3%9Alflj%C3%B3tsvatn&image=dsc01729-2372.jpg LLLLLLLLL http://foldaskoli.is/myndasafn/index.php?album=
2015-06-30T23:59:59.019Z   200      42854 http://baikal.123.is/themes/Nature/images/header.jpg PLLPEE http://baikal.123.is/ottSupportFiles/getThemeCss.aspx?id=19&g=6843&ver=2 image/jpeg #024 20150630235959985+-
2015-06-30T23:59:59.025Z   200      21520 http://bekka.bloggar.is/sida/37229/ LLLLL http://bekka.bloggar.is/ text/html #041 20150630235959986+13 sha1:C2ZF67KFGUDFVPV46CPR57J45YZRI77U http://bloggar.is/ - {"warc
2015-06-30T23:59:59.040Z   200     298365 http://www.birds.is/leikskoli/images/3072/img_2771__large_.jpg LLRLLLLLLLLLL http://www.birds.is/leikskoli/?pageid=3072 image/jpeg #005 20150630235956420+2603 sha1:F65B

There may be some impact on tools parsing your logs if they expect the timestamps to, effectively, be in order. But I'm unaware of any tools that make that assumption.

But, what about replay?

The current WARC spec calls for using timestamps with a resolution of one second. This means that all the URLs captured during the leap second will get the same value as those captured during the preceding second. No assumptions can be made about the order in which these URLs where captured anymore than you can make about the order of URLs captured normally during a single second. It doesn't really change anything that this period of uncertainty now spans two seconds instead of one. The effective level of uncertainty remains about the same.

Sidenote. The order of the captured URLs in the WARC may be indicative of crawl order, but that is not something that can be relied on.

There is actually a proposal for improving the resolutions of WARC dates. You can review it on the WARC review GitHub issue tracker. If adopted, a leap second event would mean that the WARCs actually contain incorrect information.

The fix to that would be to ensure that the leap second is encoded as 23:59:60 as per the official UTC spec. But that seems unlikely to happen as it would require changes to Unix timekeeping or using non-system timekeeping in the crawler.

Perhaps it is best to just leave the WARC date resolution at one second.

Web archiving APIs - a start

2015-06-30T14:28:00.003+00:00

Even though it didn't feature heavily on the official agenda, the topic of web archive APIs repeatedly came up during the last IIPC GA in Stanford. This is an understandable desire as the use of APIs enables interoperability amongst different tools. This, in theory, allows you to build (many) narrow purpose tools and then have them work, seamlessly, together. No more monolithic behemoths that are a pain to maintain.

In fact, we already have one web archive "API" in wide use; the WARC file format. While technically not an "application programming interface" it serves the same fundamental purpose, to enable interoperability. It has decoupled harvesters (e.g. Heritrix) from replay systems (e.g. OpenWayback) and both of those from analytical/data mining software etc.

Overall I'd say the WARC format has been a success, albeit not without its flaws. More on that later.

So, now there is interest in defining more "APIs". Indeed, there was a survey, recently, about the possibility of setting up a special "APIs Working Group" within the IIPC.

What I find missing from this conversations is, what are these APIs? Which facets of web archiving are amicable to being formalized in this manner?

We do have a few informal "APIs" floating around. The CDX file format is one. The Memento protocol is another. Cleaning up and formalizing an existing 'defacto' API avoids the pitfall of creating an API that entirely fails to work in the "real world."

To understand this, lets go back the WARC format. It was an extension of the ARC file format developed by the Internet Archive. In drafting the WARC standard, the lessons of the ARC file format were used and most of the ARC file format's short comings were addressed.

Predictably, where the WARC format has shown weakness is in areas that ARC did not address at all. For example in handling duplicate records. In those areas, wrong assumptions were made due to lack of experience. This in turn, for example, significantly delayed the widespread use of deduplication.

The best APIs/protocols/standards emerge from hard won experience. I would argue very strongly against developing any API from the top down.

With that in mind, there are two possible APIs I believe are ready to be made more formal.

The first is the current "CDX Server protocol". The CDX Server is a component within OpenWayback that resolves URI+timestamp requests into stored resources (either a single one or a range depending on the nature of the query). Note, that a "CDX Server" need not use a CDX style index. That is merely how it is now. This is a protocol for separating the user interface of a replay tool (like OpenWayback) from its the index. Lets call it Web Archive Query Protocol, WAQP, for now.

Pywb, another replay tool, uses almost the same protocol in its implementation. With a well defined WAQP, you might be able to use Pywb front end (display) with the OpenWayback back end (CDX server) or the other way around.

With a well defined WAQP, the job of indexing WARCs would be become independent of the job of replaying web archives by rewriting URLs. This would make it much easier to develop new versions of either type of software.

The other API is a bit more speculative. Web archiving proxies now exist and serve a very useful function. If a standard API was established for how these proxies interacted with the software driving the harvesting activity, it would be much easier to pair the two together. This API could possibly be built on top of existing proxy protocols.

I don't mean to imply that this is the definitive final list of web archiving APIs to be developed. These are simply two areas I believe are ready to be formalized and where doing so is likely to be of benefit in the near term.

So, how best to proceed? As noted earlier, the idea has been floated to establish a special APIs working group. On the other hand, there is already precedent for running API development through the existing working groups (WARC was developed within the harvesting working group).

My personal opinion is that once a specific API has been identified as a goal, a special working group (or task force or whatever name you'd like to give it) be formed around that one API. This working groups would than disband once the job is done. Such groups could be reformed if there is a need for a significant revision.

It is also important that any API formalization be done in close cooperation with at least some of the relevant tool maintainers. APIs that are not implemented are, after all, useless.