September 15, 2015

Looking For Stability In Heritrix Releases

Which version of Heritrix do you use?

If the answer is version 3.3.0-LBS-2015-01 then you probably already know where I'm going with this post and may want to skip to the proposed solution.  

3.3.0-LBS-2015-01 is a version of Heritrix that I "made" and currently use because there isn't a "proper" 3.3.0 release. I know of a couple of other institutions that have taken advantage of it.

The Problem (My Experience)

The last proper release of Heritrix (i.e. non-SNAPSHOT release that got pushed to a public Maven repo, even if just the Internet Archive one) that I could use was 3.1.0-RC1. There were regression bugs in both 3.1.0 and 3.2.0 that kept me from using them.

After 3.2.0 came out the main bug keeping me from upgrading was fixed. Then a big change to how revisit records were created was merged in and it was definitely time for me to stop using a 4 year old version. Unfortunately, stable releases had now mostly gone away. Even when a release is made (as I discovered with 3.1.0 and 3.2.0) they may only be "stable" for those making the release.

So, I started working with the "unstable" SNAPSHOT builds of the unreleased 3.3.0 version. This, however presented some issues. I bundle Heritrix with a few customizations and crawl job profiles. This is done via a Maven build process. Without a stable release, I'd run the risk that a change to Heritrix will cause my internal build to create something that no longer works. It also makes it impossible to release stable builds of tools that rely on new features in Heritrix 3.3.0. Thus no stable releases for the DeDuplicator or CrawlRSS. Both are way overdue.

Late last year, after getting a very nasty bug fixed in Heritrix, I spent a good while testing it and making sure no further bugs interfered with my jobs. I discovered a few minor flaws and wound up creating a fork that contained fixes for these flaws. Realizing I now had something that was as close to a "stable" build as I was likely to see, I dubbed it Heritrix version 3.3.0-LBS-2014-03 (LBS is the Icelandic abbreviation of the library's name and 2014-03 is the domain crawl it was made for).

The fork is still available on GitHub. More importantly, this version was built and deployed to our in-house Maven repo. It doesn't solve the issue of the open tools we have but for internal projects, we now had a proper release to build against.

You can see here all the commits the separate 3.2.0 and 3.3.0-LBS-2014-03 (there are a lot!).

Which brings us to 3.3.0-LBS-2015-01. When getting ready for the first crawl of this year I realized that the issues I'd had were now resolved, plus a few more things had been fixes (full list of commits). So, I created up a new fork and, again, put it through some testing. When it came up clean I released it internally as 3.3.0-LBS-2015-01. It's now used for all crawling at the library.

This sorta works for me. But it isn't really a good model for a widely used piece of software. The unreleased 3.3.0 version contains significant fixes and improvements. Getting people stuck on 3.2.0 or forcing them to use a non-stable release isn't good. And, while anyone may use my build, doing so requires a bit of know-how and there still isn't any promise of it being stable in general just because it is stable for me. This was clearly illustrated with the 3.1.0 and 3.2.0 releases which were stable for IA, but not for me.

Stable releases require some quality assurance.

Proposed Solution

What I'd really like to see is an initiative of multiple Heritrix users (be they individuals or institutions). These would come together, one or twice a year, create a release candidate and test it based on each user's particular needs. This would mostly entail running each party's usual crawls and looking for anything abnormal.

Serious regressions would either lead to fixes, rollback of features or (in dire cases) cancelling for the release. Once everyone signs off, a new release is minted and pushed to a public Maven repo.

The focus here is primarily on testing. While there might be a bit of development work to fix a bug that is discovered, the focus here is primarily on vetting that the proposed release does not contain any notable regressions.

By having multiple parties, each running the candidate build through their own workflow, the odds are greatly improved that we'll catch any serious issues. Of course, this could be done by a dedicated QA team. But the odds of putting that together is small so we must make do.

I'd love if the Internet Archive (IA) was party to this or even took over leading it. But, they aren't essential. It is perfectly feasible to alter the "group ID" and release a version under another "flag", as it were, if IA proves uninterested.

Again, to be clear, this is not an effort to set up a development effort around Heritrix, like the IIPC did for OpenWayback. This is just focused on getting regular stable builds released based on the latest code. Period.

Sign up 

If the above sounds good and you'd like to participate, by all means get in touch. In comments below, on Twitter or e-mail.

At minimum you must be willing to do the following once or twice a year:
  • Download a specific build of Heritrix
  • Run crawls with said build that match your production crawls
  • Evaluate those crawls, looking for abnormalities and errors compared to your usual crawls
    • A fair amount of experience with running Heritrix is clearly needed.
  • Report you results
    • Ideally, in a manner that allows issues you uncover to be reproduced
Doing all of this during a coordinated time frame, probably spanning about two weeks.

Better still if you are willing to look into the causes of any problems you discover.

Help with admin tasks, such as pushing releases etc. would also be welcome.

At the moment, this is nothing more than an idea and a blog post. Your responses will determine if it ever amounts to anything more.

September 4, 2015

How big is your web archive?

How big is your web archive? I don't want you to actually answer that question. What I'm interested in is, when you read that question, what metric jumped into your head?

I remember long discussions on web archive metrics when I was first starting in this field, over 10 years ago. Most vividly I remember the discussion on what constitutes a "document".

Ultimately, nothing much came of any of those discussions. We are mostly left talking about number of URL "captures" and bytes stored (3.17 billion and 64 TiB, by the way). Yet these don't really convey all that much, at least not without additional context.

Measure All the (Web Archiving) Things!
- Nicholas Taylor (Stanford University Libraries)
Another approach is to talk about websites (or seeds/hosts) captured. That still leaves you with a complicated dataset. How many crawls? How many URL per site? Etc.

Nicholas Taylor (of Stanford University Libraries) recently shared some slides on this subject, that I found quite interesting and revived this topic in my mind.

It can be a very frustrating exercise to communicate the qualities of your web archive. If Internet Archive has 434 billion web pages and I only have 3.17 billion does that make IA's collection 137 times better?

I imagine anyone who's reading this knows that that isn't how things work. If you are after world wide coverage, IA's collection is infinitely better than mine. Conversely, for Icelandic material, the Icelandic web archive is vastly deeper than IA's.

We are, therefore, left writing lengthy reports. Detailing number of seeds/sites, URLs crawled, frequency of crawling etc. Not only does this make it hard to explain to others how good (or not good as the case may be) our web archive is. It makes it very difficult to judge it against other archives.

To put it bluntly, it leaves us guessing at just how "good" our archive is.

To be fair, we are not guessing blindly. Experience (either first hand or learned from others) provides useful yardsticks and rules of thumb. If we have resources for some data mining and quality assurance, those guesses get much better.

But it sure would be nice to have some handy metrics. I fear, however, that this isn't to be.