It should be noted that the URI-agnostic deduplication only relied on the data from the index created pre-crawl. It did not try to deduplicate on documents that were discovered for the first time during the crawl.
It should also be noted that only URLs whose content (mime) type did not begin with "text/" were deduplicated. Generally, deduplicating on HTML and other text based documents yields limited results due to them being dynamically generated and heavily compressible. We will be looking at this class in the future.
The deduplication index contained all known URIs for known hashes. This made it possible to record for each encountered duplicate if it was...
- An exact URL match.
Meaning that that exact URL had already been recorded as having this payload - A canonical URL match.
Meaning that an URL whose canonical form is identical to the current URL's canonical form had already been recorded as having this payload. The canonicalization employed is the same one utilized by OpenWayback when determining URL equivalency. Exact URL matches do not count as canonical matches. - A digest only match
Meaning that we have recorded an unrelated URL as having the same payload. Exact and canonical URL matches do not count towards this.
- Exact URL: 75,46% of URLs, 84.58% of bytes deduplicated
- Canonical URL: 13,35% of URLs, 10.15% of bytes deduplicated
- Digest only: 11,19% of URLs, 5.07% of bytes deduplicated
In all, URI-agnostic deduplication saved 176 GiB in a crawl that ultimately required 2.1 TiB of storage space. So even if we assume that none of the 176 was compressible it is a saving of 7.6%. The actual value is probably closer to 5%.
If you count canonical URL matches as URI-agnostic (it is an arguable point) then the number rises notably to 19.7%.
No comments:
Post a Comment