September 15, 2015

Looking For Stability In Heritrix Releases

Which version of Heritrix do you use?

If the answer is version 3.3.0-LBS-2015-01 then you probably already know where I'm going with this post and may want to skip to the proposed solution.  

3.3.0-LBS-2015-01 is a version of Heritrix that I "made" and currently use because there isn't a "proper" 3.3.0 release. I know of a couple of other institutions that have taken advantage of it.

The Problem (My Experience)

The last proper release of Heritrix (i.e. non-SNAPSHOT release that got pushed to a public Maven repo, even if just the Internet Archive one) that I could use was 3.1.0-RC1. There were regression bugs in both 3.1.0 and 3.2.0 that kept me from using them.

After 3.2.0 came out the main bug keeping me from upgrading was fixed. Then a big change to how revisit records were created was merged in and it was definitely time for me to stop using a 4 year old version. Unfortunately, stable releases had now mostly gone away. Even when a release is made (as I discovered with 3.1.0 and 3.2.0) they may only be "stable" for those making the release.

So, I started working with the "unstable" SNAPSHOT builds of the unreleased 3.3.0 version. This, however presented some issues. I bundle Heritrix with a few customizations and crawl job profiles. This is done via a Maven build process. Without a stable release, I'd run the risk that a change to Heritrix will cause my internal build to create something that no longer works. It also makes it impossible to release stable builds of tools that rely on new features in Heritrix 3.3.0. Thus no stable releases for the DeDuplicator or CrawlRSS. Both are way overdue.

Late last year, after getting a very nasty bug fixed in Heritrix, I spent a good while testing it and making sure no further bugs interfered with my jobs. I discovered a few minor flaws and wound up creating a fork that contained fixes for these flaws. Realizing I now had something that was as close to a "stable" build as I was likely to see, I dubbed it Heritrix version 3.3.0-LBS-2014-03 (LBS is the Icelandic abbreviation of the library's name and 2014-03 is the domain crawl it was made for).

The fork is still available on GitHub. More importantly, this version was built and deployed to our in-house Maven repo. It doesn't solve the issue of the open tools we have but for internal projects, we now had a proper release to build against.

You can see here all the commits the separate 3.2.0 and 3.3.0-LBS-2014-03 (there are a lot!).

Which brings us to 3.3.0-LBS-2015-01. When getting ready for the first crawl of this year I realized that the issues I'd had were now resolved, plus a few more things had been fixes (full list of commits). So, I created up a new fork and, again, put it through some testing. When it came up clean I released it internally as 3.3.0-LBS-2015-01. It's now used for all crawling at the library.

This sorta works for me. But it isn't really a good model for a widely used piece of software. The unreleased 3.3.0 version contains significant fixes and improvements. Getting people stuck on 3.2.0 or forcing them to use a non-stable release isn't good. And, while anyone may use my build, doing so requires a bit of know-how and there still isn't any promise of it being stable in general just because it is stable for me. This was clearly illustrated with the 3.1.0 and 3.2.0 releases which were stable for IA, but not for me.

Stable releases require some quality assurance.

Proposed Solution

What I'd really like to see is an initiative of multiple Heritrix users (be they individuals or institutions). These would come together, one or twice a year, create a release candidate and test it based on each user's particular needs. This would mostly entail running each party's usual crawls and looking for anything abnormal.

Serious regressions would either lead to fixes, rollback of features or (in dire cases) cancelling for the release. Once everyone signs off, a new release is minted and pushed to a public Maven repo.

The focus here is primarily on testing. While there might be a bit of development work to fix a bug that is discovered, the focus here is primarily on vetting that the proposed release does not contain any notable regressions.

By having multiple parties, each running the candidate build through their own workflow, the odds are greatly improved that we'll catch any serious issues. Of course, this could be done by a dedicated QA team. But the odds of putting that together is small so we must make do.

I'd love if the Internet Archive (IA) was party to this or even took over leading it. But, they aren't essential. It is perfectly feasible to alter the "group ID" and release a version under another "flag", as it were, if IA proves uninterested.

Again, to be clear, this is not an effort to set up a development effort around Heritrix, like the IIPC did for OpenWayback. This is just focused on getting regular stable builds released based on the latest code. Period.

Sign up 

If the above sounds good and you'd like to participate, by all means get in touch. In comments below, on Twitter or e-mail.

At minimum you must be willing to do the following once or twice a year:
  • Download a specific build of Heritrix
  • Run crawls with said build that match your production crawls
  • Evaluate those crawls, looking for abnormalities and errors compared to your usual crawls
    • A fair amount of experience with running Heritrix is clearly needed.
  • Report you results
    • Ideally, in a manner that allows issues you uncover to be reproduced
Doing all of this during a coordinated time frame, probably spanning about two weeks.

Better still if you are willing to look into the causes of any problems you discover.

Help with admin tasks, such as pushing releases etc. would also be welcome.

At the moment, this is nothing more than an idea and a blog post. Your responses will determine if it ever amounts to anything more.


  1. UNT Libraries can contribute to downloading a specific build, running our regular crawls with it, evaluating, and reporting back. Stable regular releases of Heritrix would be very welcome.

  2. I swear I commented here last week, what happened to comment? :) Anyway archive-it generally runs latest heritrix master in production. I think that represents a good level of testing for our part?

    Does anyone want to volunteer to go through the steps involved in "minting" a release? Here they are as described by Gordon some years ago.

    - rename what's currently in JIRA as H3.1.0-final as H3.1.0-[TESTNAME] (where testname in this case might be "beta2" or "RC1"), and create a new 'H3.1.0' release for whatever's between now and then
    - update all the pom.xml's, in a single commit, to replace the -SNAPSHOT version IDs with -[TESTNAME], letting an 'official build happen on the build box
    - as soon as that's done, again single-commit all the pom.xml's back to -SNAPSHOT
    - grab one of the representative artifacts (.tar.gz), expand somewhere, and make sure it runs at least a trivial test crawl (in case something went horribly wrong in packaging). Provided that passes, then...
    - update the release notes (including changing the article name) to reflect the new version-id, download links, and release date. Yes, with in-place updating and Confluence's handling of article renames, this means all links to the old '-beta' should now wind up at the updated notes.
    - announce to the list, usually by C&Ping an old announcement with just the key names/links/dates/major-bullets changed

  3. That last comment was from me.

    1. Thanks Noah. I did see your comment from last week. Have no idea why it disappeared.

      I did suspect that Heritrix was stable for IA.

      One question; Does doing the steps mentioned push the stable release to IA's Maven repo?

      I was also wondering if it wasn't worthwhile to push it to Maven Central (as is done for OpenWayback). Admittedly, a slightly larger task, but not an insurmountable one.

    2. > One question; Does doing the steps mentioned push the stable release to IA's Maven repo?

      Yep. Every commit triggers a build, and every build is published to the maven repo. Same thing happens whether the version in pom.xml is -SNAPSHOT or not.

      > I was also wondering if it wasn't worthwhile to push it to Maven Central

      I would support that if someone else wants to do the work. :)

  4. In the Danish Netarkiv we use Heritrix as a dependency in NetarchiveSuite. So from our point of view it's fairly critical that the release candidate is also available in a public maven repository if we're going to be able to use it. I think we could commit us to running some limited automatic testing. I'm not sure I can promise resources for more realistic harvest-testing without consulting with my overlords.

    1. Hi Colin,

      Unfortunately, the response I got to this post fell well short of what would be needed IMHO. Thus I've put this idea on ice for now.