December 16, 2014

Deduplicating text based data

In my last two posts about deduplication, you may have noticed the following caveat:
It should also be noted that only URLs whose content (mime) type did not begin with "text/" were deduplicated.
The reasons for ignoring text documents derive from analysis I did 8-9 years ago when first developing the DeDuplicator. The basic argument was essentially, that HTML documents (which at the time were the overwhelming type of plain text documents) were highly dynamic (yielding little by way of deduplication), generally small and highly compressible.

The first and last assumptions are still true, but HTML files have certainly gotten bigger in the interim. Perhaps more importantly, other text files are becoming more common and may benefit from deduplication. For example, CSS and JavaScript files are unlikely to change very frequently and have gotten much larger. Also, commonly used JavaScript libraries are replicated across many websites.

So, I figured it was time to do a little analysis and see what numbers came up. I've used the same crawl logs as were discussed in the two last posts.

In my last post there was a table that showed that a total of 73.6 million URLs had been exempted from deduplication based on mime type. This is about two thirds of all URLs crawled. This is a bit misleading as it includes non-200 responses. When we limit ourselves to 200 responses, the actual number is 53.9 million. Thus text documents account for 62% of URLs

The table also showed that these URLs accounted for about 2.3 TiB of the data collected, about 32% of all the data (that is about right as non-200 responses usually have no or negligible payload). Clearly the average uncompressed file size of text documents is much smaller than of non-text documents.

Of the 2.3 TiB, 25% could have been deduplicated (553 GiB). By URL, it was about 26% overall.

I didn't attempt to break it down to exact-url, digest and crawl time deduplication.

Looking at it further, about half of the duplicate data was from non-HTML documents. More interesting is that while 14 million of the 50 million HTML documents were deemed duplicates, 2 million of the 3.1 million non-HTML text documents were deemed duplicates. The probability of a non-HTML text document being a duplicate is very high (almost comparable to non-text documents) and it size is, on average, much larger than that of an HTML document.

This is pretty conclusive. Including non-HTML text documents in deduplication yields significant savings and is fairly cheap in terms of index size and number of additional lookups.

The savings with regards to HTML is more questionable. By more than doubling the number of lookups there is a potential saving of about 5% of the total compressed data size (assuming 60% compression, which is likely conservative). With a larger index, the cost of each lookup also becomes more expensive.

Unless resources are plentiful, I believe that skipping HTML documents when deduplicating is still a reasonable choice for large scale crawls. For focused crawls (especially those conducted on short intervals), I would however recommend including HTML content.

1 comment:

  1. A very informative post that I stumbled upon while looking for something completely unrelated (I am glad I did). Currently in planning stages of deploying a crawler for the purposes of indexing news + records related to a specific subject & text-based deduplication was one of the many questions rattling around in my mind. Because large portions of the news industry rely on syndication, identical text is incredibly common. Great points made here re: including HTML content for the sort of focused crawling I will be working on. Cheers!