April 1, 2016

Duplicate DeDuplicators?

A question on the Heritrix mailing list prompted my to write a few words about deduplication in Heritrix and why there are multiple ways of doing.

Heritrix's built in service

Heritrix's inbuilt deduplication service comes in two processors. One processor records each URL in a BDB datastore or index (PersistStoreProcessor), the other looks up the current URL in this datastore and, if it finds it, compares the content digest (FetchHistoryProcessor).

The index used to be mingled in with other crawler state. That is often undesirable as you may not wish to carry forward any of that state to subsequent crawls. Typically, thus, you configure the above processors to use their own directory by wiring in a special "BDB module" configured to write to an alternative directory.

There is no way to construct the index outside of a crawl. Which can be problematic since a hard crash will often corrupt the DBD data. Of course, you can recover from a checkpoint, if you have one.

More recently, a new set of processors have been added. The ContentDigestHistoryLoader and ContentDigestHistoryStorer. They work in much the same way except they use an index that is keyed on the content digest, rather than the URL. This enables URL agnostic deduplication.

This was a questionable feature when introduced, but after the changes made to implement a more robust way of recording non-URL duplicates, it became a useful feature. Although its utility will vary based on the nature of your crawl.

As this index is updated at crawl time, it also makes it possible to deduplicate on material discovered during the same crawl. A very useful feature that I now use in most crawls.

You still can't build the index outside of a crawl.

For more information about the built in features, consult the Duplication Reduction Processors page on the Heritrix wiki.

The DeDuplicator add-on

The DeDuplicator add-on pre-dates the built in function in Heritrix by about a year (released in 2006). It essentially accomplishes the same thing, but with a few notable notable differences in tactics.

Most importantly, its index is always built outside of a crawl. Either from the crawl.log (possibly multiple log files) or from WARC files. This provides a considerable amount of flexibility as you can build an index covering multiple crawls. You can also gain the benefit of the deduplication as soon as you implement it. You don't have to run one crawl just to populate it.

The DeDuplicator uses Lucene to build its index. This allows for multiple searchable fields which in turn means that deduplication can do things like prefer exact URL matches but still do digest only matches when exact URL matches do not exist. This affords a choice of search strategies.

The DeDuplicator also some additional statistics, can write more detailed deduplication data to the crawl.log and comes with pre-configured job profiles.

The Heritrix 1 version of the DeDuplicator actually also supported deduplication based on 'server not modified'. But it was dropped when migrating the H3 as no one seemed to be using it. The index still contains enough information to easily bring it back.

Bottom line

Both approaches ultimately accomplish the same thing. Especially after the changes that were made a couple of years ago to how these modules interact with the rest of Heritrix, there really isn't any notable difference in the output. All these processors, after determining an document is a duplicate, set the same flags and cause the same information to be written to the WARC (if you are using ARCs, do not use any URL agnostic features!)

Ultimately, it is just a question of which fits better into your workflow.


  1. FWIW, I think Roger managed to build a persistlog file from previous crawls and pass it in to the persistLoadProcessor as the preloadSource. Not convinced this is a sustainable approach.



    1. I'm sure something robust, along those lines, could be built. The point is, it doesn't currently exist for the BDB modules. Maybe it should, though.