March 10, 2015

Implementing CrawlRSS

In my last post I talked about how you could use RSS feeds to continuously crawl sites. Or, at least, crawl sites that provide good RSS feeds.

The idea for how this could work first came to me five years ago. Indeed, at the IIPC Harvesting Working Group meeting in Vienna in September of 2010, I outlined the basic concept (slides PPT).

The concept was simple enough, unfortunately, the execution was a little more tricky. I wanted to implement this building on top of Heritrix. This saves having to redo things like WARC writing, deduplication, link extraction, politeness enforcement etc. But it did bump up against the fact that Heritrix is built around doing snapshots, not crawling continuously.

Heritrix in 30 Seconds

For those not familiar with Heritrix's inner working, I'm going to outline the key parts of the crawler.

Most crawl state (i.e. what URLs are waiting to be crawled) is managed by the frontier. All new URLs are first entered into the Frontier where they are scheduled for crawling. This is done by placing URLs in proper queues, possibly putting them in front of other URLs and so forth.

When an URL comes up for crawling it is emitted by the Frontier. At any given time, there may be multiple URLs being crawled. Each emitted URL passes through a chain of processors. This chain begins with preparatory work (e.g. check robots.txt), proceeds to actually fetching the URL, to link extraction, WARC writing and eventually links discovered in the downloaded document are processed and finally the URL is sent back to the Frontier for final disposition.

Each discovered URL passes through another chain of processors which check if it is in scope before registering it with the Frontier.

If an error occurs, the URL is still returned to the Frontier which decides if it should be retried or not.

The Frontier is Heritrix's state machine and, typically, when the Frontier is empty, the crawl is over.

Implementing CrawlRSS

Heritrix's frontier is a replaceable module. Thus one solution is to simply provide an alternative Frontier that implements the RSS crawling described in my last post. This is what I tried first and initial results where promising. I soon ran into some issues, however, when I tried to scale things up.

A lot of functionality is built into the default Heritrix frontier (so called BdbFrontier). While an attempt has been made to make this flexible via intermediate classes (AbstractFrontier) and some helper classes, the truth is that you have to redo a lot of 'plumbing' if you replace the frontier wholesale.

Because the basic function of the RssFrontier was so different from what was expected of the regular Frontier, I found no way of leveraging the existing code. Since I was also less then keen on reimplementing things like canonicalization, politeness enforcement and numerous other frontier functions, I had to change tack.

Instead of replacing the default frontier, I decided to 'manage' it instead. The key to this is the ability of the default frontier to 'run while empty'. That is to say, it can be configured to not end the crawl when all the queues are empty.

The responsibility for managing the crawl state would reside primarily in a new object, the RssCrawlController. Instead of the frontier being fed an initial list of seed URLs via the usual mechanism, the RssCrawlController is wholly responsible for providing seeds. It also keeps track of any URLs deriving from the RSS feeds and ensures that the feeds aren't rescheduled with the frontier until the proper time.

This is accomplished by doing three things. One, providing a replacement FrontierPreparer module. The FrontierPreparer module is responsible for preparing discovered URLs for entry into the frontier. The overloaded class RssFrontierPreparer also notifies the RssCrawlController allowing it to keep track of URLs descended from a feed.

Two, listen for CrawlURIDispositionEvents. These are events that are triggered whenever an URL is finished, either successfully or fails with no (further) retry possible.

Three, deal with the fact that no disposition event is triggered for URLs that are deemed duplicates.

UriUniqFilter

Heritrix uses so called UriUniqFilters to avoid repeatedly downloading the same URL. This can then be bypassed for individual URLs if desired (such as for refreshing robots.txt or, as in our case, when recrawling RSS feeds).

Unfortunately, this function is done in an opaque manner inside the frontier. An URL that is scheduled with the frontier and then fails to pass this filter will simply disappear. This was no good as the RssCrawlController needs a full accounting of each discovered URLs or it can't proceed to recrawl the RSS feeds.

To accomplish this I subclassed the provided filters so they implemented a DuplicateNotifier interface. This allows the RssCrawlController to keep track of those URLs that are discarded as duplicates.

It seems that perhaps the UriUniqFilters should be applied prior to candidate URIs being scheduled with the frontier, but that is a subject for another time.

RSS Link Extraction

A custom RssExtractor processor handles extracting links from RSS feeds. This is done by parsing the RSS feeds using ROME.

The processor is aware of the time of the most recently seen item for each feed and will only extract those items whose date is after that. This avoids crawling the same items again and again. Of course, if the feed itself is deemed a duplicate by hash comparison, this extraction is skipped.

If any new items are found in the feed, the extractor will also add the implied pages for that feed as discovered links. Both the feed items and implied pages are flagged specially so that they avoid the UriUniqFilter and will be given a pass by the scoping rules.

The usual link extractors are then used to find further links in the URLs that are extracted by the RssExtractor.

Scope

As I noted in my last post, getting the scope right is very important. It is vital that it does not leak and that after crawling a feed, its new items and embedded resources, you exhaust all the URLs deriving from the feed. Otherwise, you'll never recrawl the feed.

Some of the usual scoping classes are still used with CrawlRSS, such as PrerequisiteAcceptDecideRule and SchemeNotInSetDecideRule but for the most part a new decide rule, RssInScopeDecideRule, handles scoping.

RssInScopeDecideRule is a variant on HopsPathMatchesRegexDecideRule. By default the regular expression .R?((E{0,2})|XE?) determines if an URL is accepted or not. Remember, this is applied to the hop path, not the URL itself. This allows, starting from the seed, one hop (any type) then one optional redirect (R) then up to two levels of embeds (E), only the first of which may be a speculative embed (X).

The decide rule also automatically accepts URLs that are discovered in the feeds by RssExtractor.

As noted in my last post, during the initial startup it will take awhile to exhaust the scope. However, after the first run, most embedded content will be filtered out by the UriUniqFilter, making each feed refresh rather quick. You can tweak how long each round takes by changing the politeness settings. In practice we've found no problems refreshing feeds every 10 minutes. Possibly, you could go to as little as 1 minute, but that would likely necessitate a rather aggressive politeness policy.

Configuration

CrawlRSS comes packaged with a Heritrix profile where all of the above has been set up correctly. The only element that needs to be configured is which RSS feeds you want to crawl.

This is done by specifying a special bean, RssSite, in the cxml. Lets look at an example.

<bean id="rssRuv" class="is.landsbokasafn.crawler.rss.RssSite">
  <property name="name" value="ruv.is"/>
  <property name="minWaitInterval" value="1h30m"/>
  <property name="rssFeeds">
    <list>
      <bean class="is.landsbokasafn.crawler.rss.RssFeed">
        <property name="uri" value="http://www.ruv.is/rss/frettir"/>
        <property name="impliedPages">
          <list>
            <value>http://www.ruv.is/</value>
          </list>
        </property>
      </bean>
      <bean class="is.landsbokasafn.crawler.rss.RssFeed">
        <property name="uri" value="http://www.ruv.is/rss/innlent"/>
        <property name="impliedPages">
          <list>
            <value>http://www.ruv.is/</value>
            <value>http://www.ruv.is/innlent</value>
          </list>
        </property>
      </bean>
    </list>
  </property>
  
</bean>


Each site need not map to an actual site, but I generally find that to be the best way. For each site you set a name (purely for reporting purposes) and a minWaitInterval. This is the minimum amount of time that will elapse between the feeds for this site being crawled.

You'll note that the minimum wait interval is expressed in human readable (-ish) form, e.g. 1h30m instead of specifying simply a number of seconds or minutes. This provides good flexibility (i.e. you can specify intervals down to a second) with a user friendly notation (no need to figure out how many minutes there are in 8 hours!)

There is then a list of RssFeeds. Each of which specifies the URL of one RSS feed. Each feed then has a list of implied pages, expressed as URLs.

You can have as many feeds within a site as you like.

You repeat the site bean as often as needed for additional sites. I've tested with up to around 40 sites and nearly 100 feeds. There is probably a point at which this will stop scaling, but I haven't encountered it.

Alternative Configuration - Databases

Configuring dozens or even hundreds of sites and feeds via the CXML quickly get tedious. It also makes it difficult to change the configuration without stopping and restarting the crawl, losing the current state of the crawl.

For this purpose, I added a configuration manager interface. The configuration manager provides the RssCrawlController with the sites, all set up correctly. Simply wire in the appropriate implementation. Provided is the CxmlConfigurationManager as described above and the DbConfigurationManager that interacts with an SQL database.

The database is accessed using Hibernate and I've included the necessary MySQL libraries (for other DBs you'll need to add the connector JARs to Heritrix's lib directory). This facility does require creating your own CRUD interface to the database, but makes it much easier to manage the configuration.

Changes in the database are picked up by the crawler and the crawler also updates the database to reflect last fetch times etc. This can survive crawl job restarts, making it much easier to stop and start crawl without getting a lot of redundant content. When pared with the usual deduplication strategies, no unnecessary duplicates should be written to WARC.

There are two sample profiles bundled with CrawlRSS. One for each of these two configuration styles. I recommend starting with the default, CXML, configuration.

Implementing additional configuration managers is fairly simple if a different framework (e.g. JSON) is preferred.

Reporting

The RssCrawlController provides a detailed report. In the sample configuration, I've added the necessary plumbing so that this reports shows up alongside the normal Heritrix reports in the web interface. It can also be accessed via the scripting console by invoking the getReport() method on the RssCrawlController bean.

Release

All of what I've described above is written and ready for use. You can download a snapshot build from CrawlRSS's Sonatype snapshot repository. Alternatively, you can download the source from CrawlRSS's Github project page and build it yourself (simply download and run 'mvn package').

So, why no proper release? The answer is that the CrawlRSS is built against Heritrix 3.3.0-SNAPSHOT. There have been notable changes in Heritrix's API since 3.2.0 (latest stable) requiring a SNAPSHOT dependency.

I would very much like to have a 'stable' release of the 3.3.0 branch Heritrix, but that is a subject for another post.

No comments:

Post a Comment