July 13, 2015

Customizing Heritrix Reports

It is a well known fact (at least among its users) that Heritrix comes with ten default reports. These can be generated and viewed at crawl time and will be automatically generated when the crawl is terminated. Most of these reports have been part of Heritrix from the very beginning and haven't changed all that much.

It is less well known that these reports are a part of Heritrix modular configuration structure. They can be replaced, configured (in theory) and additional reports added.

The ten, built in, reports do not offer any additional configuration option. (Although, that may change if a pull request I made is merged.) But, the most useful aspect is the ability to add reports tailored to your specific needs. Both the DeDuplicator and the CrawlRSS Heritrix add-on modules use this to surface their own reports in the UI and ensure those are written to disk at the end of a crawl.

In order to configure reports, it is necessary to edit the statisticsTracker bean in your CXML crawl configuration. That bean has a list property called reports. Each element in that list is a Java class that extends the abstract Report class in Heritrix.

That is really all there is to it. Those beans can have their own properties (although none do--yet!) and behave just like any other simple bean. To add your own, just write a class, extend Report and wire it in. Done.

One caveat. You'll notice this section is all commented-out. When the reports property is left empty, the StatisticsTracker loads up the default reports. Once you uncomment that list, the list overrides any 'default list' of reports. This means that if future versions of Heritrix change what reports are 'default', you'll need to update your configuration or miss out.

Of course, you may want to 'miss out', depending on what the change is.

My main annoyance is that the reports list requires subclasses of a class, rather than specifying an interface. This needs to change so that any interested class could implement the contract and become a report. As it is, if you have a class that already extends a class and has a reporting function, you need to create a special report class that does nothing but bridge the gap to what Heritrix needs. You can see this quite clearly in the DeDuplicator where there is a special DeDuplicatorReport class for exactly that purpose. A similar thing came up in CrawlRSS.

I've found it to be very useful to be able to surface customized reports in this manner. In addition to the two use cases I've mentioned (and are publicly available), I also use it to draw up a report on disk usage use (built into the module that monitors for out of disk space conditions) and to for a report on which regular expressions have been triggered during scoping (built into a variant of the MatchesListRegexDecideRule).

I'd had both of those reports available for years be but they had always required using the scripting console to get at. Having them just a click away and automatically written at the end of a crawl has been quite helpful.

If you have Heritrix related reporting needs that are not being met, there is a comment box below.

July 1, 2015

Leap second and web crawling

A leap second was added last midnight. This is only the third time that has happened since I started doing web crawls, and the first time it happened while I had a crawl running. So, I decided to look into any possible effects or ramifications a leap second might have for web archiving.

Spoiler alert; there isn't really one. At least not for Heritrix writing WARCs.

Heritrix is typically run on a Unix type system (usually Linux). On those systems, the leap second is implemented by repeating the last second of the day. I.e. 23:59:59 comes around twice. This effectively means that the clock gets set back by a second when the leap second is inserted.

Fortunately, Heritrix does not care if time gets set back. My crawl.log did show this event quite clearly as the following excerpt shows (we were crawling about 37 URLs/second at the time):

2015-06-30T23:59:59.899Z 404 19155 http://sudurnes.dv.is/media/cache/29/74/29747ef7a4574312a4fc44d117148790.jpg LLLLLE http://sudurnes.dv.is/folk/2013/6/21/slagurinn-tok-mig/ text/html #011 2015063023595 2015-06-30T23:59:59.915Z 404 242 http://old.f4x4.is/myndasvaedi/44-tommu-breytinga-a-ford-econoline-v8-351-w/textarea.bbp-the-content ELRLLLLLLLX http://old.f4x4.is/myndasvaedi/44-tommu-breytinga-a-for 2015-06-30T23:59:59.936Z 200 3603 http://foldaskoli.is/myndasafn/index.php?album=2011_til_2012/47-%C3%9Alflj%C3%B3tsvatn&image=dsc01729-2372.jpg LLLLLLLLL http://foldaskoli.is/myndasafn/index.php?album= 2015-06-30T23:59:59.019Z 200 42854 http://baikal.123.is/themes/Nature/images/header.jpg PLLPEE http://baikal.123.is/ottSupportFiles/getThemeCss.aspx?id=19&g=6843&ver=2 image/jpeg #024 20150630235959985+- 2015-06-30T23:59:59.025Z 200 21520 http://bekka.bloggar.is/sida/37229/ LLLLL http://bekka.bloggar.is/ text/html #041 20150630235959986+13 sha1:C2ZF67KFGUDFVPV46CPR57J45YZRI77U http://bloggar.is/ - {"warc 2015-06-30T23:59:59.040Z 200 298365 http://www.birds.is/leikskoli/images/3072/img_2771__large_.jpg LLRLLLLLLLLLL http://www.birds.is/leikskoli/?pageid=3072 image/jpeg #005 20150630235956420+2603 sha1:F65B

There may be some impact on tools parsing your logs if they expect the timestamps to, effectively, be in order. But I'm unaware of any tools that make that assumption.

But, what about replay?

The current WARC spec calls for using timestamps with a resolution of one second. This means that all the URLs captured during the leap second will get the same value as those captured during the preceding second. No assumptions can be made about the order in which these URLs where captured anymore than you can make about the order of URLs captured normally during a single second. It doesn't really change anything that this period of uncertainty now spans two seconds instead of one. The effective level of uncertainty remains about the same.

Sidenote. The order of the captured URLs in the WARC may be indicative of crawl order, but that is not something that can be relied on.

There is actually a proposal for improving the resolutions of WARC dates. You can review it on the WARC review GitHub issue tracker. If adopted, a leap second event would mean that the WARCs actually contain incorrect information.

The fix to that would be to ensure that the leap second is encoded as 23:59:60 as per the official UTC spec. But that seems unlikely to happen as it would require changes to Unix timekeeping or using non-system timekeeping in the crawler.

Perhaps it is best to just leave the WARC date resolution at one second.