September 26, 2016

3 crawlers : 1 writer

Last week I attended an IIPC sponsored hackathon with the overarching theme of 'Building better crawlers'. I can't say we built a better crawler in the room, but it did help clarify for me the likely future of archival crawling. And it involves three types of crawlers.

The first type is the bulk crawler. Heritrix is an example of this. Can crawl a wide variety of sites 'good enough' and has fairly modest hardware requirements, allowing it to scale quite well. It is, however, limited in its ability to handle scripted content (i.e. JavaScript) as all link extraction is based on heuristics.

The second type is a browser driven crawler. Still fully (mostly) automated but using a browser to render pages. Additionally, scripts can be run on rendered pages to simulate scrolling, clicking and other user behavior we may wish to capture. Brozzler (Internet Archive) is an example of this approach. This allows far better capture of scripted content, but at a price in terms of resources.

For large scale crawls, it seem likely that a hybrid approach would serve us best. To have a bulk crawler cover the majority of URLs, only delegating those URLs that are deemed 'troublesome' to the more expensive browser based rendering.

The trick here is to make the two approaches work together smoothly (Brozzler, for example, does state very differently from Heritrix) and being smart about which content goes in which bucket.

The third type of crawler is what I'll call a manual crawler. I.e. a crawler whose activities are entirely driven by a human operator. An example of this is Webrecorder.io. This enables us to fill in whatever blanks the automated crawlers leave. It can also prove useful for highly targeted collection, where curators are handpicking, not just sites, but specific individual pages. They can then complete the process, right there in the browser.

There is, however, no reason that these crawlers can not all use the same back end for writing WARCs, handling deduplication and otherwise doing post acquisition tasks. By using a suitable archiving proxy all three types of crawlers can easily add their data to our collections.

Such proxy tools already exist, it is simply a matter of making sure these crawlers use them (many already do), and that they use them consistently. I.e. that there is a nice clear API for a archiving proxy that covers the use cases of all the crawlers. Allows them to communicate collection metadata, dictate deduplication policies etc.

Now is the right time to establishing this API. I think the first steps in that direction were taken at the hackathon. Hopefully, we'll have a first draft available on the IIPC GitHub page before too long.