December 4, 2014

The results of URI-agnostic deduplication on a domain crawl

During the recently completed .is domain crawl, URI-agnostic deduplicaction was employed for the first time. Until now, .is domain crawls have only deduplicated in instances where the same URI served up identical content to that observed during the most recent prior crawl.

It should be noted that the URI-agnostic deduplication only relied on the data from the index created pre-crawl. It did not try to deduplicate on documents that were discovered for the first time during the crawl.

It should also be noted that only URLs whose content (mime) type did not begin with "text/" were deduplicated. Generally, deduplicating on HTML and other text based documents yields limited results due to them being dynamically generated and heavily compressible. We will be looking at this class in the future.

The deduplication index contained all known URIs for known hashes. This made it possible to record for each encountered duplicate if it was...
  • An exact URL match.
    Meaning that that exact URL had already been recorded as having this payload
  • A canonical URL match.
    Meaning that an URL whose canonical form is identical to the current URL's canonical form had already been recorded as having this payload. The canonicalization employed is the same one utilized by OpenWayback when determining URL equivalency. Exact URL matches do not count as canonical matches.
  • A digest only match
    Meaning that we have recorded an unrelated URL as having the same payload. Exact and canonical URL matches do not count towards this.
The results:
  • Exact URL: 75,46% of URLs, 84.58% of bytes deduplicated
  • Canonical URL: 13,35% of URLs, 10.15% of bytes deduplicated
  • Digest only: 11,19% of URLs, 5.07% of bytes deduplicated
As you can see, allowing URI-agnostic does not significantly affect the amount of data that can be deduplicated. It is also clear that it is primarily smaller files that are likely to be deduplicated in this manner.

In all, URI-agnostic deduplication saved 176 GiB in a crawl that ultimately required 2.1 TiB of storage space. So even if we assume that none of the 176 was compressible it is a saving of 7.6%. The actual value is probably closer to 5%.

If you count canonical URL matches as URI-agnostic (it is an arguable point) then the number rises notably to 19.7%.

No comments:

Post a Comment