December 16, 2014

Deduplicating text based data

In my last two posts about deduplication, you may have noticed the following caveat:
It should also be noted that only URLs whose content (mime) type did not begin with "text/" were deduplicated.
The reasons for ignoring text documents derive from analysis I did 8-9 years ago when first developing the DeDuplicator. The basic argument was essentially, that HTML documents (which at the time were the overwhelming type of plain text documents) were highly dynamic (yielding little by way of deduplication), generally small and highly compressible.

The first and last assumptions are still true, but HTML files have certainly gotten bigger in the interim. Perhaps more importantly, other text files are becoming more common and may benefit from deduplication. For example, CSS and JavaScript files are unlikely to change very frequently and have gotten much larger. Also, commonly used JavaScript libraries are replicated across many websites.

So, I figured it was time to do a little analysis and see what numbers came up. I've used the same crawl logs as were discussed in the two last posts.

In my last post there was a table that showed that a total of 73.6 million URLs had been exempted from deduplication based on mime type. This is about two thirds of all URLs crawled. This is a bit misleading as it includes non-200 responses. When we limit ourselves to 200 responses, the actual number is 53.9 million. Thus text documents account for 62% of URLs

The table also showed that these URLs accounted for about 2.3 TiB of the data collected, about 32% of all the data (that is about right as non-200 responses usually have no or negligible payload). Clearly the average uncompressed file size of text documents is much smaller than of non-text documents.

Of the 2.3 TiB, 25% could have been deduplicated (553 GiB). By URL, it was about 26% overall.

I didn't attempt to break it down to exact-url, digest and crawl time deduplication.

Looking at it further, about half of the duplicate data was from non-HTML documents. More interesting is that while 14 million of the 50 million HTML documents were deemed duplicates, 2 million of the 3.1 million non-HTML text documents were deemed duplicates. The probability of a non-HTML text document being a duplicate is very high (almost comparable to non-text documents) and it size is, on average, much larger than that of an HTML document.

This is pretty conclusive. Including non-HTML text documents in deduplication yields significant savings and is fairly cheap in terms of index size and number of additional lookups.

The savings with regards to HTML is more questionable. By more than doubling the number of lookups there is a potential saving of about 5% of the total compressed data size (assuming 60% compression, which is likely conservative). With a larger index, the cost of each lookup also becomes more expensive.

Unless resources are plentiful, I believe that skipping HTML documents when deduplicating is still a reasonable choice for large scale crawls. For focused crawls (especially those conducted on short intervals), I would however recommend including HTML content.

December 5, 2014

URI agnostic deduplication on content discovered at crawl time


In my last blog post I showed that URI agnostic duplicates accounted for about 5% of all duplicates by volume (bytes) and about 11% by URI count. But this is limited to looking up content digests that had been discovered in a previous crawl. What if we also deduplicated on content digests that are discovered during a crawl?

So I put together a little script and set it lose on the crawl logs for the domain crawl. As before I only considered documents whose content (mime) type does not start with "text/".

In all, this would have increased the number of duplicates found by 3.5%. It would also increase the number of bytes deemed duplicate by 3.5%.

In practical terms this means I could have avoided storing 121 GiB of data. This is about 9.2% of the data volume that was subjected to deduplication and deemed novel. Or 3.3% of the overall data volume deemed novel and stored.

The following is a table showing the actual numbers. The difference between the total and 'subject to deduplication' is made up of URIs whose content type started with "text/".

URIs GiB
Total:106.690.7927.127
Subject to deduplication:33.096.2194.791
Deemed duplicates (total):24.522.4093.477
- Exact URL matches:18.505.1382.941
- Canonical URL matches:3.273.397353
- Digest only matches:2.743.874176
Missed digest at crawl time matches:853.013121

So there doesn't seem to be that much gain from tackling this class of duplicates. Heritrix does offer a tool for this  (that I haven't tried). I think it'll come down to how difficult this is to implement and its effect on performance. If its easy and doesn't hurt performance, reducing data volume by 3-4% can add up.

December 4, 2014

The results of URI-agnostic deduplication on a domain crawl

During the recently completed .is domain crawl, URI-agnostic deduplicaction was employed for the first time. Until now, .is domain crawls have only deduplicated in instances where the same URI served up identical content to that observed during the most recent prior crawl.

It should be noted that the URI-agnostic deduplication only relied on the data from the index created pre-crawl. It did not try to deduplicate on documents that were discovered for the first time during the crawl.

It should also be noted that only URLs whose content (mime) type did not begin with "text/" were deduplicated. Generally, deduplicating on HTML and other text based documents yields limited results due to them being dynamically generated and heavily compressible. We will be looking at this class in the future.


The deduplication index contained all known URIs for known hashes. This made it possible to record for each encountered duplicate if it was...
  • An exact URL match.
    Meaning that that exact URL had already been recorded as having this payload
  • A canonical URL match.
    Meaning that an URL whose canonical form is identical to the current URL's canonical form had already been recorded as having this payload. The canonicalization employed is the same one utilized by OpenWayback when determining URL equivalency. Exact URL matches do not count as canonical matches.
  • A digest only match
    Meaning that we have recorded an unrelated URL as having the same payload. Exact and canonical URL matches do not count towards this.
The results:
  • Exact URL: 75,46% of URLs, 84.58% of bytes deduplicated
  • Canonical URL: 13,35% of URLs, 10.15% of bytes deduplicated
  • Digest only: 11,19% of URLs, 5.07% of bytes deduplicated
As you can see, allowing URI-agnostic does not significantly affect the amount of data that can be deduplicated. It is also clear that it is primarily smaller files that are likely to be deduplicated in this manner.

In all, URI-agnostic deduplication saved 176 GiB in a crawl that ultimately required 2.1 TiB of storage space. So even if we assume that none of the 176 was compressible it is a saving of 7.6%. The actual value is probably closer to 5%.

If you count canonical URL matches as URI-agnostic (it is an arguable point) then the number rises notably to 19.7%.