Kris's blog: Inefficiencies in WARC readers

There is a hard limit to how fast you can read a WARC file, dictated by the storage mechanism. You can't process it faster than you can read it off of an HDD or SSD.

Considering how glacially slow those devices are, compared to everything else you might expect that processing a WARC (to build a CDX for example) would take about as long as it takes to read it from disk. After all, nothing overly special is being done. GZIP decompression is fairly good and parsing the WARC header fields is simple enough.

This is, unfortunately, not the case.

I've suspected as much ever since I generated CDXs for our entire collection some months ago. I ran into this issue again last week as I was trying to extract data from WARC files to build a deduplication index. So I set out to do some testing.

Using a randomly selected 1 GB gzipped WARC file I tried running it through the WarcReader that is in the webarchive-commons library. It averaged about 40 seconds to process this one WARC file.

I then tried reading it using the GZIPInputStream provided by the Java Standard Library and found that it took about 10 seconds to read the file. This, by the way, is consistent with the amount of time needed to read a file of that size from disk.

So, why does it take the webarchive-commons tools four times as long?

I looked under the hood and saw that it was using its own input stream named GZIPMembersInputStream. However, when I tried to use that directly, I also got a run time of about 10 seconds.

Digging a bit further I noticed something interesting. Whereas I was reading the file like so:

GZIPMembersInputStream gin = new GZIPMembersInputStream(
      new FileInputStream(file),BUFFER);

byte[] buf = new byte[BUFFER];
int bytesRead = in.read(buf);
while (bytesRead!=-1) {
    bytesRead = in.read(buf);
}

I.e. in 10KB blocks, the webarchive-commons tools were (at least mostly) reading it a byte at a time. This can be seen, for example in LaxHttpParser, line 84. Calls to read()accounted for 20 of the 40 second run time.

I changed my tests on the GZIPInputStream and GZIPMembersInputStream to read a byte at a time and, sure enough, it now took about 40 seconds to read the WARC files.

Clearly, there is an overhead to each read action from a compressed stream, unrelated to the size of the read.

Unfortunately, the GZIPMembersInputStream and its parent classes are quite convoluted, so adding an output buffer is complicated. I barely know where to begin.

For the sake of completeness, I also tested the JWAT WARC reader. It has the same issue, although I suspect it may be easier to address there in either the ByteCountingPushBackInputStream or PushbackInputStream as they aren't nearly as convoluted as GZIPMembersInputStream in webarchive-commons.

Bottom line. Byte-at-a-time reading from a compressed stream is very inefficient. A buffer (of about 7-8 KB according to my tests) provides massive improvements. Implementing this in webarchive-commons is going to be tricky, given the legacy code there, but the performance gains would seem to justify the effort.

Kris's blog

August 6, 2014

Inefficiencies in WARC readers

No comments:

Post a Comment