In my last blog post I showed that URI agnostic duplicates accounted for about 5% of all duplicates by volume (bytes) and about 11% by URI count. But this is limited to looking up content digests that had been discovered in a previous crawl. What if we also deduplicated on content digests that are discovered during a crawl?
So I put together a little script and set it lose on the crawl logs for the domain crawl. As before I only considered documents whose content (mime) type does not start with "text/".
In all, this would have increased the number of duplicates found by 3.5%. It would also increase the number of bytes deemed duplicate by 3.5%.
In practical terms this means I could have avoided storing 121 GiB of data. This is about 9.2% of the data volume that was subjected to deduplication and deemed novel. Or 3.3% of the overall data volume deemed novel and stored.
The following is a table showing the actual numbers. The difference between the total and 'subject to deduplication' is made up of URIs whose content type started with "text/".
URIs | GiB | |
---|---|---|
Total: | 106.690.792 | 7.127 |
Subject to deduplication: | 33.096.219 | 4.791 |
Deemed duplicates (total): | 24.522.409 | 3.477 |
- Exact URL matches: | 18.505.138 | 2.941 |
- Canonical URL matches: | 3.273.397 | 353 |
- Digest only matches: | 2.743.874 | 176 |
Missed digest at crawl time matches: | 853.013 | 121 |
So there doesn't seem to be that much gain from tackling this class of duplicates. Heritrix does offer a tool for this (that I haven't tried). I think it'll come down to how difficult this is to implement and its effect on performance. If its easy and doesn't hurt performance, reducing data volume by 3-4% can add up.
No comments:
Post a Comment