August 28, 2014

Rethinking Heritrix's crawl.log

I've been looking a lot at the Heritrix crawl.log recently, for a number of reasons. I can't help feeling that it is flawed and it's time to rethink it. 

Issue 1; it contains data that isn't directly comparable

Currently it has data for at least two protocols (DNS and HTTP, three if you count HTTPS) and that is assuming you aren't doing any FTP or other 'exotic' crawling. While these share elements (notably an URL), they are not apples to apples comparable.

Worse still, the crawl.log includes failed fetches and even decisions to not crawl (-9998 and -500X status codes).

It seems to me that we really need multiple logs. One per protocol, plus one for failures and one for URLs that the crawler chooses not to crawl. Each with fields that are appropriate to its nature.

This resolves, for example, the problem of the status code sometimes being Heritrix specific and sometimes protocol specific. In fact, we may replace integer status codes with short text labels for increased clarity in the non-protocol logs.

For the sake of backward compatibility, these could be added while retaining the existing crawl.log. Ultimately, the basic crawl.log could be either eliminated or changed into a simpler form meant primarily as a way of gauging activity in the crawler at crawl time.

Issue 2; It doesn't account for revisit/duplicate discovery

Annotations have been used to address this, but it deserves to be treated better. This can be done by adding three new fields:

  • Revisit Profile - Either - if not a revisit or a suitable constant for server-not-modified and identical-payload-digest. These should not be obtuse number or some such to make it easy for custom deduplication schemes to extend this as needed.
  • Original Capture Time - Mandatory if revisit profile is not -
  • Original URL - Either - to signify that original URL is the same as current or the original URL
These would only be in the logs for protocols. Possibly omitted in the DNS protocol log.

Issue 3; Changes are extremely difficult because tools will break

To help with this, going forward, a header specification for the new logs should be written. Either to note a version or to specify fields in use (e.g. the first line of CDX files). Possibly both.

This will allow for somewhat more flexible log formats and we should provide an ability to configure exactly which fields are written in each log.

This does place a burden on tool writers, but at least it will be a solvable issue. Currently, tools need to sniff for the few minor changes that have been made in the last eleven years, such as the change in the date format of the first field.

Issue 4; Annotations have been horribly abused

Annotations were added for the sake of flexibility in what the log could contain. They have, however been abused quite thoroughly. I'm in favor of dropping them entirely (not just in the logs, but excising them completely at the code level) and in their place (for data that isn't accounted for in the log spec) use the JSON style data structure "extra information" that is present but generally unused.

Very common usages of the annotations field should be promoted to dedicated fields. Notably, revisits (as discussed above) and number of fetch attempts.

Some of these fields might be optional as per issue 3.

Closing thoughts 

In writing the above I've intentionally avoided more moderate/short term fixes that could be applied with less of an upset. I wanted to shake things up and hopefully get all us to reevaluate our stance on this long serving tool.

Whether the solutions I outline are used or not, the issues remain and the above is not an exhaustive list, I'm sure. It's time, indeed past time, we did something about them.

No comments:

Post a Comment