When we put our collection online back in 2009 we built our own indexer that consumed these crawl logs so we could include these items. It worked very well at the time.
As our collection grew, we eventually hit scaling issues with the custom indexer. Ultimately, it was abandoned when we upgraded to Wayback 1.8 two years ago. We moved to use a CDX index instead. The only casualty was the early deduplication/revisit records from our ARC days which were no longer included.
So, for the last two years, I've had a task on my todo list. Create a program that consumes all of those crawl logs and spits out WARC files containing the revisit records that are missing from the ARC files.
For a while I toyed with the idea of incorporating this in a wider ARC to WARC migration effort. But it's become clear that such a migration just isn't worthwhile. At least not yet.
Recently I finally made some time to address this. I figured it shouldn't take much time to implement. Basically, the program needs to do two things:
- Ingest and parse a crawl.log file from Heritrix, making it possible to iterate over its lines and access specific fields. As the DeDuplicator already does this, it was a minor task to adapt the code.
- For each line that represented a deduplicated URI, write a revisit record in a WARC file.
Boom, done. Can knock that out in an hour, easy.
Or, I should be able to.
It turns out that writing WARCs is painfully tedious. The libraries require you to understand the structure of a WARC file very well and do nothing to guide you. There is also no useful documentation on this subject. You best bet is to find code examples, but even those can be hard to find and may not directly address your needs.
I tried both the webarchive-commons library and JWAT. Both were tedious, but JWAT was less so. Both require you to understand exactly what header fields need to be written for each record type. To know that you need to write a warcinfo field first and so on. At least JWAT made it fairly simple to configure the header fields.
Consulting both existing WARC files and the WARC spec, I was able to put all the pieces together in about half a day using JWAT.
And that's when I realized that JWAT does not track the number of bytes written out. That means I can't split the writing up to make "even sized" WARCs like Heritrix does (usually 1 GiB/file).
Darn, I need to rewrite this, after all, using webarchive-commons.
**Sigh**
It seems to me that such an important format should have better library support. A better library would save a lot of time and effort.
By saving effort, we may also wind up saving the format itself. The danger that someone will create a widely used program that writes invalid WARCs is very real. If such invalid WARCs proliferate that can greatly undermine the usefulness of the format.
It is important to make it easy to do the right thing. Right now, even someone, like myself, who is very familiar with WARC needs to double and triple check every step of a very simple WARC writing program. Never mind if you need to do something a little advanced.
Someone with less knowledge and, perhaps, less motivation to "get it right" could all too easily write something "good enough".
It is important to demystify the WARC format. Good library support would be an ideal start.
Totally true, if you don't get the format right you cripple how your data can be used, just think that you lose the ability to use your crawled data with a lot of apps.
ReplyDeletehttps://bitbucket.org/nclarkekb/jwat/src/f7c7c17fcd4f71b9013ea2cd5477a5828f9868f9/jwat-warc/src/main/java/org/jwat/warc/WarcFileWriter.java?at=default&fileviewer=file-view-default
ReplyDeleteThis handles automatic splitting/naming of warc files when writting large amounts of data.
These files were added recently do help with writting ARC/WARC files.
Nobody tells me these things, so I have to read about peoples problems with JWAT here and there. :)
Admittedly the documentation could be a lot better. Moving from bitbucket/HG to github/git would maybe also be an improvement.