August 28, 2014

Rethinking Heritrix's crawl.log

I've been looking a lot at the Heritrix crawl.log recently, for a number of reasons. I can't help feeling that it is flawed and it's time to rethink it. 

Issue 1; it contains data that isn't directly comparable


Currently it has data for at least two protocols (DNS and HTTP, three if you count HTTPS) and that is assuming you aren't doing any FTP or other 'exotic' crawling. While these share elements (notably an URL), they are not apples to apples comparable.

Worse still, the crawl.log includes failed fetches and even decisions to not crawl (-9998 and -500X status codes).

It seems to me that we really need multiple logs. One per protocol, plus one for failures and one for URLs that the crawler chooses not to crawl. Each with fields that are appropriate to its nature.

This resolves, for example, the problem of the status code sometimes being Heritrix specific and sometimes protocol specific. In fact, we may replace integer status codes with short text labels for increased clarity in the non-protocol logs.

For the sake of backward compatibility, these could be added while retaining the existing crawl.log. Ultimately, the basic crawl.log could be either eliminated or changed into a simpler form meant primarily as a way of gauging activity in the crawler at crawl time.


Issue 2; It doesn't account for revisit/duplicate discovery


Annotations have been used to address this, but it deserves to be treated better. This can be done by adding three new fields:

  • Revisit Profile - Either - if not a revisit or a suitable constant for server-not-modified and identical-payload-digest. These should not be obtuse number or some such to make it easy for custom deduplication schemes to extend this as needed.
  • Original Capture Time - Mandatory if revisit profile is not -
  • Original URL - Either - to signify that original URL is the same as current or the original URL
These would only be in the logs for protocols. Possibly omitted in the DNS protocol log.


Issue 3; Changes are extremely difficult because tools will break


To help with this, going forward, a header specification for the new logs should be written. Either to note a version or to specify fields in use (e.g. the first line of CDX files). Possibly both.

This will allow for somewhat more flexible log formats and we should provide an ability to configure exactly which fields are written in each log.

This does place a burden on tool writers, but at least it will be a solvable issue. Currently, tools need to sniff for the few minor changes that have been made in the last eleven years, such as the change in the date format of the first field.


Issue 4; Annotations have been horribly abused


Annotations were added for the sake of flexibility in what the log could contain. They have, however been abused quite thoroughly. I'm in favor of dropping them entirely (not just in the logs, but excising them completely at the code level) and in their place (for data that isn't accounted for in the log spec) use the JSON style data structure "extra information" that is present but generally unused.

Very common usages of the annotations field should be promoted to dedicated fields. Notably, revisits (as discussed above) and number of fetch attempts.

Some of these fields might be optional as per issue 3.


Closing thoughts 


In writing the above I've intentionally avoided more moderate/short term fixes that could be applied with less of an upset. I wanted to shake things up and hopefully get all us to reevaluate our stance on this long serving tool.

Whether the solutions I outline are used or not, the issues remain and the above is not an exhaustive list, I'm sure. It's time, indeed past time, we did something about them.

August 15, 2014

Packaging Heritrix add-on projects

I've made several projects that add-on to Heritrix. Typically, these build a tarball (or zip file) that you can explode into Heritrix's root directory and all the necessary JAR files, job configurations and shell scripts wind up where they are supposed to be. This works well enough, but it does impose an extra step, a double install if you will.

So I decided to see if I could improve on this and have the add-on project actually bake itself into the Heritrix distribution. Turns out, this is easy!

Step one, update the project POM to have a dependency on the root Heritrix project distibution. Like so:

  <dependency>
    <groupId>org.archive.heritrix</groupId>
    <artifactId>heritrix</artifactId>
    <version>${org.archive.heritrix.version}</version>
    <classifier>dist</classifier>
    <type>tar.gz</type>
    <scope>test</scope>
  </dependency>


The key there is the classifier and type.

Next add to the plugin section of the POM instructions to unpack the above. Make sure this comes before the assembly plugin.

  <!-- Unzip Heritrix distribution -->
  <plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-dependency-plugin</artifactId>
    <executions>
      <execution>
        <id>unpack-heritrix</id>
        <goals>
          <goal>unpack-dependencies</goal>
        </goals>
        <phase>package</phase>
        <configuration>
          <outputDirectory>

            ${project.build.directory}/heritrix
          </outputDirectory>
          <includeGroupIds>org.archive.heritrix</includeGroupIds>
          <excludeTransitive>true</excludeTransitive>
          <excludeTypes>pom</excludeTypes>
          <scope>test</scope>
        </configuration>
      </execution>
    </executions>
  </plugin>


Now all you need to do is ensure that the assembly plugin puts the necessary files into the correct directories. This can be done by specifying the outputdirectory as follows:

  <outputDirectory>
    /heritrix-${org.archive.heritrix.version}/lib
  </outputDirectory>

and make sure that there is a fileSet to include the entire exploded Heritrix distribution. E.g.:

  <fileSet>
    <directory>target/heritrix</directory>
    <outputDirectory>/</outputDirectory>
    <includes>
      <include>**</include>
    </includes>
  </fileSet>


And done. The assembly will now unpack the Heritrix distribution, add your files and pack it back up, ready for install like any other Heritrix distro.

August 12, 2014

JWAT WARC reading speed doubled

This is a follow up to my last post about inefficiencies in WARC readers.

As I noted there, reading a WARC file straight (via á GZIP reader, so uncompressing it, but not parsing the content) takes about 10 seconds, whereas iterating over the same file, using the Java tools available (webarchive-commons, JWAT) takes about 40 seconds.

More specifically:

JWAT GzipReader: 10s
webarchive-commons WARCReader: 42s
JWAT WarcReader: 45s

There is a variability there of a few hundred milliseconds between runs so I've rounded to the nearest second. Note also, that the WARC readers were run immediately after the GzipReader and would have benefited from any OS caching.

In my earlier post I speculated that adding a read buffer to JWAT's ByteCountingPushbackInputStream would likely improve the efficiency considerably. I proceeded to test this hypothesis. New run time:

JWAT WarcReader for GZ: 21s

So, JWAT goes from being slightly slower than webarchive-commons, to being twice as fast.

The class still passes all its unit tests and as far as I can tell it is still functionally equivalent.

I've forked the JWAT project to my account on GitHub. You can see the modified file here: ByteCountingPushBackInputStream.java

There are no doubt greater gains to be had, but they'll require a deeper understanding of the code than I possesses at this moment.

The downside to this change is, of course, a slight uptick in the amount of memory used, as a 10K buffer is assigned every time an instance of the ByteCountingPushBackInputStream is created (and that is done more frequently than just to wrap the core gzip read). Still, it seems a small price to pay for the speed increase.

I have no doubt that improvements can also be made in webarchive-commons, but it is far less clear to me where those changes should be made

August 6, 2014

Inefficiencies in WARC readers

There is a hard limit to how fast you can read a WARC file, dictated by the storage mechanism. You can't process it faster than you can read it off of an HDD or SSD.

Considering how glacially slow those devices are, compared to everything else you might expect that processing a WARC (to build a CDX for example) would take about as long as it takes to read it from disk. After all, nothing overly special is being done. GZIP decompression is fairly good and parsing the WARC header fields is simple enough.

This is, unfortunately, not the case.

I've suspected as much ever since I generated CDXs for our entire collection some months ago. I ran into this issue again last week as I was trying to extract data from WARC files to build a deduplication index. So I set out to do some testing.

Using a randomly selected 1 GB gzipped WARC file I tried running it through the WarcReader that is in the webarchive-commons library. It averaged about 40 seconds to process this one WARC file.

I then tried reading it using the GZIPInputStream provided by the Java Standard Library and found that it took about 10 seconds to read the file. This, by the way, is consistent with the amount of time needed to read a file of that size from disk.

So, why does it take the webarchive-commons tools four times as long?

I looked under the hood and saw that it was using its own input stream named GZIPMembersInputStream. However, when I tried to use that directly, I also got a run time of about 10 seconds.

Digging a bit further I noticed something interesting. Whereas I was reading the file like so:

  GZIPMembersInputStream gin = new GZIPMembersInputStream(
      new FileInputStream(file),BUFFER);

  byte[] buf = new byte[BUFFER];
  int bytesRead = in.read(buf);
  while (bytesRead!=-1) {
    bytesRead = in.read(buf);
  }

       

I.e. in 10KB blocks, the webarchive-commons tools were (at least mostly) reading it a byte at a time. This can be seen, for example in LaxHttpParser, line 84. Calls to read()accounted for 20 of the 40 second run time.

I changed my tests on the GZIPInputStream and GZIPMembersInputStream to read a byte at a time and, sure enough, it now took about 40 seconds to read the WARC files.

Clearly, there is an overhead to each read action from a compressed stream, unrelated to the size of the read.

Unfortunately, the GZIPMembersInputStream and its parent classes are quite convoluted, so adding an output buffer is complicated. I barely know where to begin.

For the sake of completeness, I also tested the JWAT WARC reader. It has the same issue, although I suspect it may be easier to address there in either the ByteCountingPushBackInputStream or PushbackInputStream as they aren't nearly as convoluted as GZIPMembersInputStream in webarchive-commons.

Bottom line. Byte-at-a-time reading from a compressed stream is very inefficient. A buffer (of about 7-8 KB according to my tests) provides massive improvements. Implementing this in webarchive-commons is going to be tricky, given the legacy code there, but the performance gains would seem to justify the effort.