February 24, 2016

3 things I shouldn't have to tell you about running a "good" crawler

I've been running large and small scale crawls for almost 12 years. In that time I've encountered any number of unfortunate circumstances where our crawler has caused a website some amount of trouble. We always aim to run a "good" crawler. One that is respectful of website operators and doesn't cause any issues. To accomplish this we have some basic rules and limits to abide by. The reasons we sometimes do cause trouble comes down to the complicated nature of the internet.

During this time I've also been responsible for operating a number of websites. Including a few that are (by Icelandic standards) quite popular. Doing so has shown me the other side of web crawling. Turns out that a lot of crawlers are not following the basics dos and don'ts of web crawling. Including some supposedly respectable crawlers.

This seems to be getting worse each year.

Of course there are "bad" robots. Run by people who do not care about the negative impact they cause. But, even "good" robots (i.e. ones that at least seem to have good intentions) are all to frequently misbehaving.

So, here are 3 things you absolutely should abide by if you want your robot to be considered "good". Remember, it doesn't matter if your robot is a web-scale crawl or a scraping of a single site. As soon you've scripted something to automatically fetch stuff from a 3rd party server (without the 3rd party's explicit permission) you are running a crawler.

1. Be polite

Never do concurrent requests to the same site. Rate limit yourself to around 1 request every 2-5 seconds or so. More if the responses are slow. If you hit a 500 error code, wait a few minutes.

If you know (with certainty) that the site you are crawling can handle a more aggressive load (e.g. crawling google.com), it may be OK to step it up a bit. But when crawling sites where you do not have any insight, it is best to be cautious. Remember, yours is probably not the only crawler hitting them. Not to mention all the regular users.

I know, it can be infuriating when trying to scrape a large dataset. It could take weeks! But your needs do not outweigh the needs of other users. Also, crawlers are often more expensive to serve than "normal" users. They tend to systematically go through the entire site, meaning that caching strategies fail to speed up their requests.

2. Identify yourself

Your user agent string must contain enough information to allow a website operator to find out who you are, why you are crawling their site and how to get in touch with you. This isn't negotiable.

It also goes for little custom tools built to just scrape that one website. You don't have to set a user agent if you are using e.g. curl on the command line to get a single resource. But the moment you script or program something you must put identifying information in the user agent string. If all I see is

  curl/7.19.7 (x86_64-redhat-linux-gnu) lib...

I'll assume it is a "bad" crawler. Same for things written in Python, Java etc. Always identify yourself.

And make sure you are responsive to feedback you may get. Something, a number of big, supposedly "good" crawl operators fail to do.

The thing is, if your robot is causing a problem and I don't know who is operating it. I will ban it. Should you wish to have that ban lifted, I will not be especially predisposed towards cooperation. You've already made a very bad first impression.

Yes, you might be able to get around a ban by getting a new IP address, but at that point you are no longer running a "good" crawler.

The bottom line is, even if you are very careful, your crawler may inadvertently cause a problem. In those scenarios. If your intentions are good, then you must make yourself available to deal with the issue and prevent it from reoccurring. If you do not do this, you are running a "bad" crawler.

3. Honor robots.txt

I'm aware of the irony here. As an operator of a legal deposit crawler, I do not respect robots.txt in most of the crawls I conduct. But you probably don't have that shield of being legally required to "get all the URLs".

A lot of websites block perfectly "legitimate" material using robots.txt (e.g. images). It is annoying. But if you are running a "good" crawler, then you have to abide by these rules, including the crawl delay. You really should also be able to parse the newer wildcard rules.

At most, you might bend the rules to get embedded content (images) necessary to render the page. Even that should be done carefully and only while adhering to the first two rules firmly.

If you feel you need content blocked by robots.txt, you must ask for it. Politely. Some sites may be happy to assist you (ours included). Others may tell you to go away. Either way, if you are running a "good" crawler, you'll have your answer. 

February 15, 2016

OpenWayback - Developing an API for a CDX Server

OpenWayback 2.3.0 was released about a month ago. It was a modest release aimed to fixing a number of bugs and introducing a few minor features. Currently, work is focused on version 3.0.0, which is meant to be a much more impactful release.

Notably, the indexing function is being moved into a separate (bundled) entitiy, called the CDX Server. This will provide a clean separation of concerns between resource resolution and the user interface.

The CDX Server had already been partially implemented. But as work begin on refining that implementation it became clear that we would also need to examine very close the API between these two discreet parts. The API that currently exists is incomplete and at times vague and even, on occasion, contradictory. Existing documentation (or code review) may tell you what can be done, but it fails to shed much light on why you'd do that.

This isn't really surprising for an API that was developed "bottom-up", guided by personal experience and existing (but undocumented!) code requirements. This is pretty typical of what happens when delivering something functional is the first priority. It is a technical debt, but one you may feel is acceptable when delivering a single customer solution.

The problem we face, is that once we push for the CDX server to become the norm, it wont be a single costumer solution. Changes to the API will be painful and, consequently, rare.

So, we've had to step back and examine the API with a critical eye. To do this we've begun to compile a list of use cases that the CDX Server API needs to meet. Some are fairly obvious, others are perhaps more wishful thinking. In between there are a large number of corner cases and essential but non-obvious uses cases that must be addressed.

We welcome and encourage any input on this list. You can do so other be editing the wiki page, by commenting on the relevant issue in our tracker or sending an e-mail to the OpenWayback-dev mailing list.

To be clear, the purpose of this is primarily to ensure that we fully support existing use cases. While we will consider use cases that the current OpenWayback can not handle, they will, necessarily be ascribed a much lower priority.

Hopefully, this will lead to a fairly robust API for a CDX Server. The use cases may also allow us to firm up the API of the wayback URL structure itself. Not only will this serve the OpenWayback project, getting this API rights is very important to facilitate alternative replay tools!

As I said before, we welcome any input you may have on the subject. If you feel unsure of how to become engaged, please feel free to contact me directly.

February 3, 2016

How to SURT IDN domains?

Converting URLs to SURT (Sort-friendly URI Reordering Transform) form has many benefits in web archiving and is widely used when configuring crawlers. Notably, the use of SURT prefixes can be used to easily apply rules to a logical segment of the web.

Thus the SURT prefix:
http://(is,
Will match all domains under the .is TLD.

A bit less known ability is to match against partial domain names. Thus the following SURT prefix:
http://(is,a
Would match all .is domains that begin with the letter a (note that there isn't a comma at the end).

This all works quite well, until you hit Internationalized Domain Names (IDNs). As the original infrastructure of the web does not really support non-ASCII characters, all IDNs are designed so that they can be translated into an ASCII equivalent.

Thus the IDN domain landsbókasafn.is is actually represented using the "punycode" representation xn--landsbkasafn-5hb.is.

When matching SURTs against full domain names (trailing comma), this doesn't really matter. But, when matching against a domain name prefix, you run into an issue. Considering the example above, should landsbókasafn.is match the SURT http://(is,l?

The current implementation (at least in Heritrix's much used SurtPrefixedDecideRule) is to evaluate only the punycode version (so no, but it would match http://(is,x).

This seems potentially limiting and likely to cause confusion.