July 13, 2015

Customizing Heritrix Reports

It is a well known fact (at least among its users) that Heritrix comes with ten default reports. These can be generated and viewed at crawl time and will be automatically generated when the crawl is terminated. Most of these reports have been part of Heritrix from the very beginning and haven't changed all that much.

It is less well known that these reports are a part of Heritrix modular configuration structure. They can be replaced, configured (in theory) and additional reports added.

The ten, built in, reports do not offer any additional configuration option. (Although, that may change if a pull request I made is merged.) But, the most useful aspect is the ability to add reports tailored to your specific needs. Both the DeDuplicator and the CrawlRSS Heritrix add-on modules use this to surface their own reports in the UI and ensure those are written to disk at the end of a crawl.

In order to configure reports, it is necessary to edit the statisticsTracker bean in your CXML crawl configuration. That bean has a list property called reports. Each element in that list is a Java class that extends the abstract Report class in Heritrix.

That is really all there is to it. Those beans can have their own properties (although none do--yet!) and behave just like any other simple bean. To add your own, just write a class, extend Report and wire it in. Done.

One caveat. You'll notice this section is all commented-out. When the reports property is left empty, the StatisticsTracker loads up the default reports. Once you uncomment that list, the list overrides any 'default list' of reports. This means that if future versions of Heritrix change what reports are 'default', you'll need to update your configuration or miss out.

Of course, you may want to 'miss out', depending on what the change is.

My main annoyance is that the reports list requires subclasses of a class, rather than specifying an interface. This needs to change so that any interested class could implement the contract and become a report. As it is, if you have a class that already extends a class and has a reporting function, you need to create a special report class that does nothing but bridge the gap to what Heritrix needs. You can see this quite clearly in the DeDuplicator where there is a special DeDuplicatorReport class for exactly that purpose. A similar thing came up in CrawlRSS.

I've found it to be very useful to be able to surface customized reports in this manner. In addition to the two use cases I've mentioned (and are publicly available), I also use it to draw up a report on disk usage use (built into the module that monitors for out of disk space conditions) and to for a report on which regular expressions have been triggered during scoping (built into a variant of the MatchesListRegexDecideRule).

I'd had both of those reports available for years be but they had always required using the scripting console to get at. Having them just a click away and automatically written at the end of a crawl has been quite helpful.


If you have Heritrix related reporting needs that are not being met, there is a comment box below.

No comments:

Post a Comment