tag:blogger.com,1999:blog-5951939749830079328.post5721581318784065470..comments2023-03-25T10:27:17.347+00:00Comments on Kris's blog: Deduplicating text based dataKristinn Sigurðssonhttp://www.blogger.com/profile/03734283518658091605noreply@blogger.comBlogger1125tag:blogger.com,1999:blog-5951939749830079328.post-84384766913931715272015-08-10T14:57:27.257+00:002015-08-10T14:57:27.257+00:00A very informative post that I stumbled upon while...A very informative post that I stumbled upon while looking for something completely unrelated (I am glad I did). Currently in planning stages of deploying a crawler for the purposes of indexing news + records related to a specific subject & text-based deduplication was one of the many questions rattling around in my mind. Because large portions of the news industry rely on syndication, identical text is incredibly common. Great points made here re: including HTML content for the sort of focused crawling I will be working on. Cheers!Josh Wiederhttps://www.blogger.com/profile/11800273950071585348noreply@blogger.com