Kris's blog: URI agnostic deduplication on content discovered at crawl time

In my last blog post I showed that URI agnostic duplicates accounted for about 5% of all duplicates by volume (bytes) and about 11% by URI count. But this is limited to looking up content digests that had been discovered in a previous crawl. What if we also deduplicated on content digests that are discovered during a crawl?

So I put together a little script and set it lose on the crawl logs for the domain crawl. As before I only considered documents whose content (mime) type does not start with "text/".

In all, this would have increased the number of duplicates found by 3.5%. It would also increase the number of bytes deemed duplicate by 3.5%.

In practical terms this means I could have avoided storing 121 GiB of data. This is about 9.2% of the data volume that was subjected to deduplication and deemed novel. Or 3.3% of the overall data volume deemed novel and stored.

The following is a table showing the actual numbers. The difference between the total and 'subject to deduplication' is made up of URIs whose content type started with "text/".

	URIs	GiB
Total:	106.690.792	7.127
Subject to deduplication:	33.096.219	4.791
Deemed duplicates (total):	24.522.409	3.477
- Exact URL matches:	18.505.138	2.941
- Canonical URL matches:	3.273.397	353
- Digest only matches:	2.743.874	176
Missed digest at crawl time matches:	853.013	121

So there doesn't seem to be that much gain from tackling this class of duplicates. Heritrix does offer a tool for this (that I haven't tried). I think it'll come down to how difficult this is to implement and its effect on performance. If its easy and doesn't hurt performance, reducing data volume by 3-4% can add up.

Kris's blog

December 5, 2014

URI agnostic deduplication on content discovered at crawl time

No comments:

Post a Comment