Kris's blog: heritrix

Showing posts with label heritrix. Show all posts

May 30, 2016

Heritrix 3.3.0-LBS-2016-02, now in stores

A month ago I posted that I was testing a 'semi-stable' build of Heritrix. The new build is called "Heritrix 3.3.0-LBS-2016-02" as this is built for LBS's (Icelandic acronym for my library) 2016-02 domain crawl (i.e. the second one this year).

I can now report that this version has passed all my tests without any regressions showing up. I've noted two minor issues, one of which was fixed immediately (Noisy alerts about 401s without auth challenge) and the other has been around since Heritrix 3.1.0 at the least and does not affect crawling in any way (Bug in non-fatal-error log).

Additionally, I heard from Netarkivet.dk. They also tested this version with no regressions found.

I think it is safe to say that if you are currently using my previous semi-stable build (LBS-2015-01), upgrading to this version should be entirely straightforward. There are no notable API changes to worry about either. Unless, of course, you are using features that are less 'mainstream'.

You can find this version on our Github page. You'll have to download the source and build it for yourself.

Update As you can see in the comments below, Netarkivet.dk has put the artifacts into a publicly accessible repository. Very helpful if you have code with dependencies on Heritrix and you don't have your own repository.

Thanks for the heads-up, Nicholas.

April 28, 2016

New 'semi-stable' build for Heritrix

Earlier this month I mentioned that I was planning on making another one of my "semi-stable" Heritrix build, using the current master 3.3.0. This work is now underway in the Landsbokasafn Heritrix repo on GitHub. LBS-2016-02.


Heritrix 3.3.0-LBS-2016-02

I've merged in one pull request that is still open in the IA repository, #154 Fixes for apparent build errors. Most notably, this makes it possible to have Travis-CI build and test Heritrix.

You can review the full list of changes between my last Heritrix build (2015-01) and this new one here. Here is a list of the main changes:

Some fixes to how server-not-modified revisit records are written (PR #118).
Fix outlink hoppath in metadata records (PR #119)
Allow dots in filenames for known good extensions (PR #120)
Require Maven 3.3 (PR #126)
Allow realm to be set by server for basic auth (PR #124)
Better error handling in StatisticsTracker (PR #130)
Fix to Java 8 Keytool (PR #129) - I wrote a post about this back in 2014.
Changes to how cookies are stored in Bdb (PR #133)
Handle multiple clauses for same user agent in robots.txt (PR #139)
SourceSeedDecideRule and SeedLimitsEnforcer (PR #137 and #148)
'Novel' URL and byte quotes (PR #138)
Only submit 'checked' checkbox and radio buttons when submitting forms (PR #122)
Form login improvements (PR #142 and #143)
Improvements to hosts report (PR #123)
Handle SNI error better (PR #141)
Allow some whitespace in URLs extracted by ExtractorJS (PR #145)
Fix to ExtractorHTML dealing with HTML comments (PR #149)
Build against Java 7 (PR #152)

I've ignored all pull request that apply primarily to the contrib package in the above. There were quite a few there, mostly (but not exclusively) relating to AMQP.

I've done some preliminary testing and everything looks good. So far, the only issue I've noted is one that I was already aware of, about noisy alerts relating to 401s.

I'll be testing this version further over the next few weeks and welcome any additional input.

April 7, 2016

Still Looking For Stability In Heritrix Releases

I'd just like to briefly follow up on a blog post I wrote last September, Looking For Stability In Heritrix Releases.

The short version is that the response I got was, in my opinion, insufficient to proceed. I'm open to revisiting the idea if that changes, but for now it is on ice.

There is little doubt in my mind that having (somewhat) regular stable releases made of Heritrix would be of notable benefit. Even better if they are published to Maven Central.

Instead, I'll continue to make my own forks from time to time and make sure they are stable for me. The last one was dubbed LBS-2015-01. It is now over a year old and a lot has changed. I expect I'll be making a new one in May/June. You can see what's changed in Heritrix in the meantime here.

I know a few organizations are also using my semi-stable releases. If you are one of them and would like to get some changes in before the next version (to be dubbed LBS-2016-02), you should try to get a PR into Heritrix before the end of April. Likewise, if you know of a serious/blocking bug in the current master of Heritrix, please bring it to my attention.

April 1, 2016

Duplicate DeDuplicators?

A question on the Heritrix mailing list prompted my to write a few words about deduplication in Heritrix and why there are multiple ways of doing.

Heritrix's built in service

Heritrix's inbuilt deduplication service comes in two processors. One processor records each URL in a BDB datastore or index (PersistStoreProcessor), the other looks up the current URL in this datastore and, if it finds it, compares the content digest (FetchHistoryProcessor).

The index used to be mingled in with other crawler state. That is often undesirable as you may not wish to carry forward any of that state to subsequent crawls. Typically, thus, you configure the above processors to use their own directory by wiring in a special "BDB module" configured to write to an alternative directory.

There is no way to construct the index outside of a crawl. Which can be problematic since a hard crash will often corrupt the DBD data. Of course, you can recover from a checkpoint, if you have one.

More recently, a new set of processors have been added. The ContentDigestHistoryLoader and ContentDigestHistoryStorer. They work in much the same way except they use an index that is keyed on the content digest, rather than the URL. This enables URL agnostic deduplication.

This was a questionable feature when introduced, but after the changes made to implement a more robust way of recording non-URL duplicates, it became a useful feature. Although its utility will vary based on the nature of your crawl.

As this index is updated at crawl time, it also makes it possible to deduplicate on material discovered during the same crawl. A very useful feature that I now use in most crawls.

You still can't build the index outside of a crawl.

For more information about the built in features, consult the Duplication Reduction Processors page on the Heritrix wiki.

The DeDuplicator add-on

The DeDuplicator add-on pre-dates the built in function in Heritrix by about a year (released in 2006). It essentially accomplishes the same thing, but with a few notable notable differences in tactics.

Most importantly, its index is always built outside of a crawl. Either from the crawl.log (possibly multiple log files) or from WARC files. This provides a considerable amount of flexibility as you can build an index covering multiple crawls. You can also gain the benefit of the deduplication as soon as you implement it. You don't have to run one crawl just to populate it.

The DeDuplicator uses Lucene to build its index. This allows for multiple searchable fields which in turn means that deduplication can do things like prefer exact URL matches but still do digest only matches when exact URL matches do not exist. This affords a choice of search strategies.

The DeDuplicator also some additional statistics, can write more detailed deduplication data to the crawl.log and comes with pre-configured job profiles.

The Heritrix 1 version of the DeDuplicator actually also supported deduplication based on 'server not modified'. But it was dropped when migrating the H3 as no one seemed to be using it. The index still contains enough information to easily bring it back.

Bottom line

Both approaches ultimately accomplish the same thing. Especially after the changes that were made a couple of years ago to how these modules interact with the rest of Heritrix, there really isn't any notable difference in the output. All these processors, after determining an document is a duplicate, set the same flags and cause the same information to be written to the WARC (if you are using ARCs, do not use any URL agnostic features!)

Ultimately, it is just a question of which fits better into your workflow.

September 15, 2015

Looking For Stability In Heritrix Releases

Which version of Heritrix do you use?

If the answer is version 3.3.0-LBS-2015-01 then you probably already know where I'm going with this post and may want to skip to the proposed solution.

3.3.0-LBS-2015-01 is a version of Heritrix that I "made" and currently use because there isn't a "proper" 3.3.0 release. I know of a couple of other institutions that have taken advantage of it.

The Problem (My Experience)

The last proper release of Heritrix (i.e. non-SNAPSHOT release that got pushed to a public Maven repo, even if just the Internet Archive one) that I could use was 3.1.0-RC1. There were regression bugs in both 3.1.0 and 3.2.0 that kept me from using them.

After 3.2.0 came out the main bug keeping me from upgrading was fixed. Then a big change to how revisit records were created was merged in and it was definitely time for me to stop using a 4 year old version. Unfortunately, stable releases had now mostly gone away. Even when a release is made (as I discovered with 3.1.0 and 3.2.0) they may only be "stable" for those making the release.

So, I started working with the "unstable" SNAPSHOT builds of the unreleased 3.3.0 version. This, however presented some issues. I bundle Heritrix with a few customizations and crawl job profiles. This is done via a Maven build process. Without a stable release, I'd run the risk that a change to Heritrix will cause my internal build to create something that no longer works. It also makes it impossible to release stable builds of tools that rely on new features in Heritrix 3.3.0. Thus no stable releases for the DeDuplicator or CrawlRSS. Both are way overdue.

Late last year, after getting a very nasty bug fixed in Heritrix, I spent a good while testing it and making sure no further bugs interfered with my jobs. I discovered a few minor flaws and wound up creating a fork that contained fixes for these flaws. Realizing I now had something that was as close to a "stable" build as I was likely to see, I dubbed it Heritrix version 3.3.0-LBS-2014-03 (LBS is the Icelandic abbreviation of the library's name and 2014-03 is the domain crawl it was made for).

The fork is still available on GitHub. More importantly, this version was built and deployed to our in-house Maven repo. It doesn't solve the issue of the open tools we have but for internal projects, we now had a proper release to build against.

You can see here all the commits the separate 3.2.0 and 3.3.0-LBS-2014-03 (there are a lot!).

Which brings us to 3.3.0-LBS-2015-01. When getting ready for the first crawl of this year I realized that the issues I'd had were now resolved, plus a few more things had been fixes (full list of commits). So, I created up a new fork and, again, put it through some testing. When it came up clean I released it internally as 3.3.0-LBS-2015-01. It's now used for all crawling at the library.

This sorta works for me. But it isn't really a good model for a widely used piece of software. The unreleased 3.3.0 version contains significant fixes and improvements. Getting people stuck on 3.2.0 or forcing them to use a non-stable release isn't good. And, while anyone may use my build, doing so requires a bit of know-how and there still isn't any promise of it being stable in general just because it is stable for me. This was clearly illustrated with the 3.1.0 and 3.2.0 releases which were stable for IA, but not for me.

Stable releases require some quality assurance.

Proposed Solution

What I'd really like to see is an initiative of multiple Heritrix users (be they individuals or institutions). These would come together, one or twice a year, create a release candidate and test it based on each user's particular needs. This would mostly entail running each party's usual crawls and looking for anything abnormal.

Serious regressions would either lead to fixes, rollback of features or (in dire cases) cancelling for the release. Once everyone signs off, a new release is minted and pushed to a public Maven repo.

The focus here is primarily on testing. While there might be a bit of development work to fix a bug that is discovered, the focus here is primarily on vetting that the proposed release does not contain any notable regressions.

By having multiple parties, each running the candidate build through their own workflow, the odds are greatly improved that we'll catch any serious issues. Of course, this could be done by a dedicated QA team. But the odds of putting that together is small so we must make do.

I'd love if the Internet Archive (IA) was party to this or even took over leading it. But, they aren't essential. It is perfectly feasible to alter the "group ID" and release a version under another "flag", as it were, if IA proves uninterested.

Again, to be clear, this is not an effort to set up a development effort around Heritrix, like the IIPC did for OpenWayback. This is just focused on getting regular stable builds released based on the latest code. Period.

Sign up

If the above sounds good and you'd like to participate, by all means get in touch. In comments below, on Twitter or e-mail.

At minimum you must be willing to do the following once or twice a year:

Download a specific build of Heritrix
Run crawls with said build that match your production crawls
Evaluate those crawls, looking for abnormalities and errors compared to your usual crawls

A fair amount of experience with running Heritrix is clearly needed.

Report you results

Ideally, in a manner that allows issues you uncover to be reproduced

Doing all of this during a coordinated time frame, probably spanning about two weeks.

Better still if you are willing to look into the causes of any problems you discover.

Help with admin tasks, such as pushing releases etc. would also be welcome.

At the moment, this is nothing more than an idea and a blog post. Your responses will determine if it ever amounts to anything more.

July 13, 2015

Customizing Heritrix Reports

It is a well known fact (at least among its users) that Heritrix comes with ten default reports. These can be generated and viewed at crawl time and will be automatically generated when the crawl is terminated. Most of these reports have been part of Heritrix from the very beginning and haven't changed all that much.

It is less well known that these reports are a part of Heritrix modular configuration structure. They can be replaced, configured (in theory) and additional reports added.

The ten, built in, reports do not offer any additional configuration option. (Although, that may change if a pull request I made is merged.) But, the most useful aspect is the ability to add reports tailored to your specific needs. Both the DeDuplicator and the CrawlRSS Heritrix add-on modules use this to surface their own reports in the UI and ensure those are written to disk at the end of a crawl.

In order to configure reports, it is necessary to edit the statisticsTracker bean in your CXML crawl configuration. That bean has a list property called reports. Each element in that list is a Java class that extends the abstract Report class in Heritrix.

That is really all there is to it. Those beans can have their own properties (although none do--yet!) and behave just like any other simple bean. To add your own, just write a class, extend Report and wire it in. Done.

One caveat. You'll notice this section is all commented-out. When the reports property is left empty, the StatisticsTracker loads up the default reports. Once you uncomment that list, the list overrides any 'default list' of reports. This means that if future versions of Heritrix change what reports are 'default', you'll need to update your configuration or miss out.

Of course, you may want to 'miss out', depending on what the change is.

My main annoyance is that the reports list requires subclasses of a class, rather than specifying an interface. This needs to change so that any interested class could implement the contract and become a report. As it is, if you have a class that already extends a class and has a reporting function, you need to create a special report class that does nothing but bridge the gap to what Heritrix needs. You can see this quite clearly in the DeDuplicator where there is a special DeDuplicatorReport class for exactly that purpose. A similar thing came up in CrawlRSS.

I've found it to be very useful to be able to surface customized reports in this manner. In addition to the two use cases I've mentioned (and are publicly available), I also use it to draw up a report on disk usage use (built into the module that monitors for out of disk space conditions) and to for a report on which regular expressions have been triggered during scoping (built into a variant of the MatchesListRegexDecideRule).

I'd had both of those reports available for years be but they had always required using the scripting console to get at. Having them just a click away and automatically written at the end of a crawl has been quite helpful.

If you have Heritrix related reporting needs that are not being met, there is a comment box below.

July 1, 2015

Leap second and web crawling

A leap second was added last midnight. This is only the third time that has happened since I started doing web crawls, and the first time it happened while I had a crawl running. So, I decided to look into any possible effects or ramifications a leap second might have for web archiving.

Spoiler alert; there isn't really one. At least not for Heritrix writing WARCs.

Heritrix is typically run on a Unix type system (usually Linux). On those systems, the leap second is implemented by repeating the last second of the day. I.e. 23:59:59 comes around twice. This effectively means that the clock gets set back by a second when the leap second is inserted.

Fortunately, Heritrix does not care if time gets set back. My crawl.log did show this event quite clearly as the following excerpt shows (we were crawling about 37 URLs/second at the time):

2015-06-30T23:59:59.899Z   404      19155 http://sudurnes.dv.is/media/cache/29/74/29747ef7a4574312a4fc44d117148790.jpg LLLLLE http://sudurnes.dv.is/folk/2013/6/21/slagurinn-tok-mig/ text/html #011 2015063023595
2015-06-30T23:59:59.915Z   404        242 http://old.f4x4.is/myndasvaedi/44-tommu-breytinga-a-ford-econoline-v8-351-w/textarea.bbp-the-content ELRLLLLLLLX http://old.f4x4.is/myndasvaedi/44-tommu-breytinga-a-for
2015-06-30T23:59:59.936Z   200       3603 http://foldaskoli.is/myndasafn/index.php?album=2011_til_2012/47-%C3%9Alflj%C3%B3tsvatn&image=dsc01729-2372.jpg LLLLLLLLL http://foldaskoli.is/myndasafn/index.php?album=
2015-06-30T23:59:59.019Z   200      42854 http://baikal.123.is/themes/Nature/images/header.jpg PLLPEE http://baikal.123.is/ottSupportFiles/getThemeCss.aspx?id=19&g=6843&ver=2 image/jpeg #024 20150630235959985+-
2015-06-30T23:59:59.025Z   200      21520 http://bekka.bloggar.is/sida/37229/ LLLLL http://bekka.bloggar.is/ text/html #041 20150630235959986+13 sha1:C2ZF67KFGUDFVPV46CPR57J45YZRI77U http://bloggar.is/ - {"warc
2015-06-30T23:59:59.040Z   200     298365 http://www.birds.is/leikskoli/images/3072/img_2771__large_.jpg LLRLLLLLLLLLL http://www.birds.is/leikskoli/?pageid=3072 image/jpeg #005 20150630235956420+2603 sha1:F65B

There may be some impact on tools parsing your logs if they expect the timestamps to, effectively, be in order. But I'm unaware of any tools that make that assumption.

But, what about replay?

The current WARC spec calls for using timestamps with a resolution of one second. This means that all the URLs captured during the leap second will get the same value as those captured during the preceding second. No assumptions can be made about the order in which these URLs where captured anymore than you can make about the order of URLs captured normally during a single second. It doesn't really change anything that this period of uncertainty now spans two seconds instead of one. The effective level of uncertainty remains about the same.

Sidenote. The order of the captured URLs in the WARC may be indicative of crawl order, but that is not something that can be relied on.

There is actually a proposal for improving the resolutions of WARC dates. You can review it on the WARC review GitHub issue tracker. If adopted, a leap second event would mean that the WARCs actually contain incorrect information.

The fix to that would be to ensure that the leap second is encoded as 23:59:60 as per the official UTC spec. But that seems unlikely to happen as it would require changes to Unix timekeeping or using non-system timekeeping in the crawler.

Perhaps it is best to just leave the WARC date resolution at one second.

March 23, 2015

URI Canonicalization in Web Archiving

URI canonicalization is an arcane part of web archiving that is often overlooked, despite having some important implications. This blog post is an attempt at shining a light on this topic, to illustrate those implications and point at at least one thing that we can improve on.

Heritrix

Heritrix keeps track of which URIs it has already crawled so that it doesn't repeatedly crawl the same URI. This fails when there are multiple URIs leading to the same content. Unless you deal with the issue, you wind up in a crawler trap, crawling the same (or nearly the same) content endlessly.

URI canonicalization is an attempt to deal with certain classes of these. This is done by establishing transformation rules that are applied to each discovered URI. The result of these transformations is regarded as the canonical form of the URI. It is this canonical form which is used to discard already seen URIs. The original URI is always used when making the HTTP request, the canonical form is only to prevent multiple fetches of the same content.

By default Heritrix applies the following transformations (in this order) to generate the canonical form of a URI:

Make all letters in the URI lower case.
Remove any user/password info in the URI (e.g. http://user:password@example.com becomes simply http://example.com)
Strip any leading www prefixes. Also strip any www prefixes with a number (e.g. http://www3.example.com becomes http://example.com)
Strip session IDs. This deals with several common forms of session IDs that are stored in URIs. It is unlikely to be exhaustive.
Strip trailing empty query strings. E.g. http://example.com? becomes http://example.com

Two issues are quickly apparent.

One, some of these rules may cause URIs that contain different content to be regarded is the same. For example, while domain names are case insensitive, the path segment need not be. Thus the two URIs http://example.com/A.html and http://example.com/a.html may be two different documents.

Two, despite fairly aggressive URI canonicalization rules, there will still be many instances where the exact same content is served up on multiple URIs. For example, there are likely many more types of session ids than we have identified here.

We can never hope to enumerate all the possible ways websites deliver the same content with different URIs. Heritrix does, however, make it possible to tweak the URI canonicalization rules. You can, thus deal with them on a case by case basis. Here is how to configure URI canonicalization rules in Heritrix 3.

But, you may not want to do that.

You'll understand why as we look at what URI canonicalization means during playback.

OpenWayback

As difficult as harvesting can be, accurate replay is harder. This holds doubly true for URI canonicalization.

Replay tools, like OpenWayback, must apply canonicalization under two circumstances. Firstly, when a new resource as added to the index (CDX or otherwise) it must be a canonical form that is indexed. Secondly, when resolving a URI requested by the user.

User requested URIs are not just the ones users type into search boxes. They are also the URIs for embedded content needed to display a page and the URIs of links that users click in replayed webpages.

Which is where things get complicated. OpenWayback may not use the exact same rules as Heritrix.

The current OpenWayback canonicalization rules (called AggressiveUrlCanonicalizer) applies rules that work more or less the same way as what Heritrix uses. It is not, however, the exact same rules (i.e. code), nor can they be customized in the same manner as Heritrix.

There is a strong argument for harmonizing these two. Move the code into a shared library, ensure that both tools use the same defaults and can be customized in the same manner. (It was discussion of exactly this that prompted this blog post.)

Additionally, it would be very beneficial if each WARC contained information, in some standard notation, about which canonicalization rules were in effect during its creation.

All this would clearly help but even then you still may not want to fiddle with canonicalization rules.

Every rule you add must be applied to every URI. If you add a new rule to Heritrix, you must add it to OpenWayback to ensure playback. Once added to OpenWayback it will go into effect for all URIs being searched for. However, unless you rebuild your index it may contain URIs in a form that is no longer consistent with the current canonical form. So, you need to rebuild the index, which is expensive.

This is also assuming that it is safe to impose a rule retroactively. OpenWayback does not support applying canonicalization rules to time periods. To make matters worse, a rule that would be very useful on one website, may not be suitable for another website.

Dealing with a matrix of rules which apply some of the time to some of the content from some types does not seem enticing. Maybe there is a good way to express and apply such rules that no one has yet brought to the table. In the meantime, all I can say is, proceed with caution.

January 19, 2015

The First IIPC Technical Training Workshop

It has always been interesting to me how often a chance remark will lead to big things within the IIPC. A stray thought, given voice during a coffee break or dinner at a general assembly, can hit a resonance and lead to something new, something exciting.

So it was with the first IIPC technical training workshop. It started off as an off the cuff joke at last year's Paris GA, about having a 'Heritrix DecideRule party'. It struck a nerve, and it also quickly snowballed to include an 'OpenWayback upgradathon' and a 'SolrJam'.

The more we talked about this, the more convinced I became that this was a good idea. To have a hands on workshop, where IIPC members could send staff for practical training in these tools. Fortunately, Helen Hockx-Yu of the British Library shared this conviction. Even more fortunately, the IIPC Steering Committee wholeheartedly supported the idea. Most fortunate of all, the BL was ready, willing and able to host such an event.

So, last week, on a rather dreary January morning around thirty web archiving professionals, from as far away as New Zealand, gathered outside the British Library in London and waited for the doors to open. Everyone eager to learn more about Heritrix, OpenWayback and Solr.

Day one was dedicated to traditional, presentation oriented, dissemination of knowledge. On hand were several invited experts on each topic. In the morning the basics fundamentals of the three tools were discussed, with more in depth topics after lunch. Roger Coram (BL) and I were responsible for covering Heritrix. Roger discussed the basics of Heritrix DecideRules and I covered other core features, notably sheet overlays in the morning. The afternoon focused on Heritrix's REST API, deduplication at crawl time, and writing your own Heritrix modules.

There is no need for me to repeat all of the topics. The entire first day was filmed and made available online, in IIPC's YouTube channel.

Day one went well, but it wasn't radically different from what we have done before at GAs. It was days two and three that made this meeting unique.

For the later two days only a very loose agenda was provided. A list of tasks related to each tool, varying in complexity. Attendees choose tasks according to their interests and level of technical know-how. Some installed and ran their very first Heritrix crawl or set up their first OpenWayback instance. I set up Solr via the BL's webarchive-discovery and set it to indexing one of our collection.

Others focused on more advanced tasks involving Heritrix sheet overlays and REST API, OpenWayback WAR overlays and CDX generation or ... I really don't know what the advanced Solr tasks were. I was just happy to get the basic indexing up and running.

The 'experts' who did presentations on day one, were, of course, on hand during days two and three to assist. I found this to be a very good model. Impromptu presentations were made on specific topics and the specific issues of different attendees could be addressed. I learned a fair amount about how other IIPC members actually conduct their crawls. There is nothing like hands-on knowledge. I think both experts and attendees got a lot out of it.

It was almost sad to see the three day event come to an end.

So, overall, a success. Definitely meriting an encore.

That isn't to say it was perfect, there is always room for improvement. Given a bit more lead-up time, it would have been possible to get a firmer idea of the actual interests of the attendees. For this workshop there was a bit of guess work. I think we were in the ballpark, but we can do better next time. It would also have been useful to have better developed tasks for the less experienced attendees.

So, will there be an opportunity to improve? I certainly hope so. We will need to decide where (London again or elsewhere) and when (same time next year or ...). The final decision will then be up to the IIPC Steering Committee. All I can say, is that I'm for it and I hope we can make this an annual event. A sort of counter-point to the GA.

We'll see.

Finally, I'd like to thank Helen and the British Library for their role as host and all of our experts for their contribution.

October 2, 2014

Heritrix, Java 8 and sun.security.tools.Keytool

I ran into this issue and I figured if I don't write it up, I'll be sure to have forgotten all the details when it occurs again.

The issue is that Heritrix (which is still built against Java 6) uses sun.security.tools.Keytool on startup to generate a self signed certificate for its HTTPS connection. However, in Java 8, Oracle changed this class to be sun.security.tools.keytool.Main.

As Heritrix only generates the certificate once, I only ran into this issue when installing a new build of Heritrix, not when I upgraded Java to 8 on my crawl server. You can run Heritrix with Java 8 just fine as long as you launch it once with Java 7 (or, presumably older).

It should be noted that Java warns against using anything from the sun package. It is not considered part of the Java API. But I believe that the only alternative is to have people manually set up the certificates.

This does mean two things:

1. You need a version of Java prior to 8 to work on Heritrix. It is possible for newer versions to be in compatibility mode with a prior version. But Keytool isn't part of Java proper. If you only have Java 8 installed, you will not have the necessary dependency available on the classpath. Your IDE will complain incessantly.

2. Building Heritrix on machines with only Java 8 is not possible.

I've also seen at least one unit test also using Keytool (this may be only in a pending pull request, I haven't looked into it deeply).

This isn't an immediate problem as Java 7 is still supported and available from Oracle. However, if they discontinue Java 7 it will quickly become a problem (just try get Java 6 to install from Oracle).

If you want to run Heritrix with Java 8 your options are:

1. First run it once with Java 7 or prior.
2. Use the -s option to specify a custom keystore location and passwords. You can build that keystore using external tools.
3. Manually create the adhoc.keystore file (in Heritrix's working directory) that Heritrix usually generates automatically. This can be done using Java 8 tools with the following command (assumes Java's bin directory is on the path):
$ keytool -keystore adhoc.keystore -storepass password
-keypass password -alias adhoc -genkey -keyalg RSA
-dname "CN=Heritrix Ad-Hoc HTTPS Certificate" -validity 3650

Number 3 rather points at a possible solution to this. Just move this generation of an adhoc keystore to the shell script that launches Heritrix.

Edited to add #4: Copy an adhoc.keystore from a previous Heritrix install, if you have one lying about.

August 28, 2014

Rethinking Heritrix's crawl.log

I've been looking a lot at the Heritrix crawl.log recently, for a number of reasons. I can't help feeling that it is flawed and it's time to rethink it.

Issue 1; it contains data that isn't directly comparable

Currently it has data for at least two protocols (DNS and HTTP, three if you count HTTPS) and that is assuming you aren't doing any FTP or other 'exotic' crawling. While these share elements (notably an URL), they are not apples to apples comparable.

Worse still, the crawl.log includes failed fetches and even decisions to not crawl (-9998 and -500X status codes).

It seems to me that we really need multiple logs. One per protocol, plus one for failures and one for URLs that the crawler chooses not to crawl. Each with fields that are appropriate to its nature.

This resolves, for example, the problem of the status code sometimes being Heritrix specific and sometimes protocol specific. In fact, we may replace integer status codes with short text labels for increased clarity in the non-protocol logs.

For the sake of backward compatibility, these could be added while retaining the existing crawl.log. Ultimately, the basic crawl.log could be either eliminated or changed into a simpler form meant primarily as a way of gauging activity in the crawler at crawl time.

Issue 2; It doesn't account for revisit/duplicate discovery

Annotations have been used to address this, but it deserves to be treated better. This can be done by adding three new fields:

Revisit Profile - Either - if not a revisit or a suitable constant for server-not-modified and identical-payload-digest. These should not be obtuse number or some such to make it easy for custom deduplication schemes to extend this as needed.
Original Capture Time - Mandatory if revisit profile is not -
Original URL - Either - to signify that original URL is the same as current or the original URL

These would only be in the logs for protocols. Possibly omitted in the DNS protocol log.

Issue 3; Changes are extremely difficult because tools will break

To help with this, going forward, a header specification for the new logs should be written. Either to note a version or to specify fields in use (e.g. the first line of CDX files). Possibly both.

This will allow for somewhat more flexible log formats and we should provide an ability to configure exactly which fields are written in each log.

This does place a burden on tool writers, but at least it will be a solvable issue. Currently, tools need to sniff for the few minor changes that have been made in the last eleven years, such as the change in the date format of the first field.

Issue 4; Annotations have been horribly abused

Annotations were added for the sake of flexibility in what the log could contain. They have, however been abused quite thoroughly. I'm in favor of dropping them entirely (not just in the logs, but excising them completely at the code level) and in their place (for data that isn't accounted for in the log spec) use the JSON style data structure "extra information" that is present but generally unused.

Very common usages of the annotations field should be promoted to dedicated fields. Notably, revisits (as discussed above) and number of fetch attempts.

Some of these fields might be optional as per issue 3.

Closing thoughts

In writing the above I've intentionally avoided more moderate/short term fixes that could be applied with less of an upset. I wanted to shake things up and hopefully get all us to reevaluate our stance on this long serving tool.

Whether the solutions I outline are used or not, the issues remain and the above is not an exhaustive list, I'm sure. It's time, indeed past time, we did something about them.

August 15, 2014

Packaging Heritrix add-on projects

I've made several projects that add-on to Heritrix. Typically, these build a tarball (or zip file) that you can explode into Heritrix's root directory and all the necessary JAR files, job configurations and shell scripts wind up where they are supposed to be. This works well enough, but it does impose an extra step, a double install if you will.

So I decided to see if I could improve on this and have the add-on project actually bake itself into the Heritrix distribution. Turns out, this is easy!

Step one, update the project POM to have a dependency on the root Heritrix project distibution. Like so:

<dependency>
    <groupId>org.archive.heritrix</groupId>
    <artifactId>heritrix</artifactId>
    <version>${org.archive.heritrix.version}</version>
    <classifier>dist</classifier>
    <type>tar.gz</type>
    <scope>test</scope>
</dependency>

The key there is the classifier and type.

Next add to the plugin section of the POM instructions to unpack the above. Make sure this comes before the assembly plugin.


<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-dependency-plugin</artifactId>
    <executions>
      <execution>
        <id>unpack-heritrix</id>
        <goals>
          <goal>unpack-dependencies</goal>
        </goals>
        <phase>package</phase>
        <configuration>
          <outputDirectory>
            ${project.build.directory}/heritrix
          </outputDirectory>
        <includeGroupIds>org.archive.heritrix</includeGroupIds>
        <excludeTransitive>true</excludeTransitive>
        <excludeTypes>pom</excludeTypes>
        <scope>test</scope>
        </configuration>
    </execution>
    </executions>
</plugin>

Now all you need to do is ensure that the assembly plugin puts the necessary files into the correct directories. This can be done by specifying the outputdirectory as follows:

<outputDirectory>
    /heritrix-${org.archive.heritrix.version}/lib
</outputDirectory>

and make sure that there is a fileSet to include the entire exploded Heritrix distribution. E.g.:

<fileSet>
    <directory>target/heritrix</directory>
    <outputDirectory>/</outputDirectory>
    <includes>
      <include>**</include>
    </includes>
</fileSet>

And done. The assembly will now unpack the Heritrix distribution, add your files and pack it back up, ready for install like any other Heritrix distro.