October 17, 2014

Webarchive deduplication. Does it matter which record is marked as the original?

We've been doing deduplication in webarchiving for a long time now. But, due to limits in our tools and storage format (WARC) it has always been, so called, URL based deduplication. I.e. we record that this capture of a particular URL is a duplicate (or revisit if you prefer). The content isn't stored and playback software simply moves 'backwards in time' until it finds the original record.

With recent clarifications to the WARC spec adopted by the IIPC and implemented in tools like Heritrix and OpenWayback we are no longer limited to this URL based deduplication.

With Heritrix 3.3.0 (still in development) now having robust handling for 'url agnostic' or 'digest based' deduplication, I set out to update my DeDuplicator software, an add-on for Heritrix. Implementing digest based deduplication was super easy.

But something nagged at me.

Consider the following scenario. You are crawling URL A. Its content digest indicates that it is a duplicate of URL A from some earlier crawl (lets call this A-1), but is also a duplicate of URL B. B may have been crawled at the same time (i.e. during the same harvesting round) as A-1, or at another time altogether, including earlier during the current round of harvesting.

Obviously, if we didn't have A-1, we would deduplicate on B and say that A is a duplicate of B. That's the whole point of digest based deduplication.

But, if you do have A-1, does it matter which one A is declared a duplicate of?

During large scale crawling you need lookups in your deduplication index to be as efficient as possible. Doing simple lookups on digests is more performant than doing a lookup on both digest and URL (or searching within a result set for the digest). Additionally, you can make the index smaller by only including one instance of each digest in the index.

But this bowing to performance means that it is blind chance whether A-1 or B will be designated as the 'original' for A.

The engineer in me insist that this is irrelevant. After all, if we didn't have A-1, we'd want to use B. So does the existence of A-1 in our archive really matter.

I haven't been able to come up with any solid argument for preferring A-1. Logically, it shouldn't matter. But, somehow, it just feels off to me.

If anyone has a concrete technical or practical reason for preferring A-1 (when available) please share!

Thank you.

Edit: In response to Peter Websters comment, let me just clarify that replay tools (e.g. OpenWayback) would still be aware of A-1, is it would be in their index, and they would show it as the precursor to A. The link between A and B would be considered incidental (as far as replay tools are concerned) and would not be directly evident to users.

Update: I've come up with at least one reason why it might matter. See my blog post answering my own question.


October 2, 2014

Heritrix, Java 8 and sun.security.tools.Keytool

I ran into this issue and I figured if I don't write it up, I'll be sure to have forgotten all the details when it occurs again.

The issue is that Heritrix (which is still built against Java 6) uses sun.security.tools.Keytool on startup to generate a self signed certificate for its HTTPS connection. However, in Java 8, Oracle changed this class to be sun.security.tools.keytool.Main.

As Heritrix only generates the certificate once, I only ran into this issue when installing a new build of Heritrix, not when I upgraded Java to 8 on my crawl server. You can run Heritrix with Java 8 just fine as long as you launch it once with Java 7 (or, presumably older).

It should be noted that Java warns against using anything from the sun package. It is not considered part of the Java API. But I believe that the only alternative is to have people manually set up the certificates.

This does mean two things:

1. You need a version of Java prior to 8 to work on Heritrix. It is possible for newer versions to be in compatibility mode with a prior version. But Keytool isn't part of Java proper. If you only have Java 8 installed, you will not have the necessary dependency available on the classpath. Your IDE will complain incessantly.

2. Building Heritrix on machines with only Java 8 is not possible.

I've also seen at least one unit test also using Keytool (this may be only in a pending pull request, I haven't looked into it deeply).

This isn't an immediate problem as Java 7 is still supported and available from Oracle. However, if they discontinue Java 7 it will quickly become a problem (just try get Java 6 to install from Oracle).

If you want to run Heritrix with Java 8 your options are:

1. First run it once with Java 7 or prior.
2. Use the -s option to specify a custom keystore location and passwords. You can build that keystore using external tools.
3. Manually create the adhoc.keystore file (in Heritrix's working directory) that Heritrix usually generates automatically. This can be done using Java 8 tools with the following command (assumes Java's bin directory is on the path):
  $ keytool -keystore adhoc.keystore -storepass password 
    -keypass password -alias adhoc -genkey -keyalg RSA 
    -dname "CN=Heritrix Ad-Hoc HTTPS Certificate" -validity 3650

Number 3 rather points at a possible solution to this. Just move this generation of an adhoc keystore to the shell script that launches Heritrix.

Edited to add #4: Copy an adhoc.keystore from a previous Heritrix install, if you have one lying about.