November 12, 2015

Workshop on Missing Warc Features

Yesterday I tweeted:
I thought I'd expand a bit on this without a 140 character limit.

The idea for this session came to me while reviewing the various issues set forward for the WARC 1.1 review. Several of the issues/proposals were clearly important as they addressed real needs but at the same time they were nowhere near ready for standardization.

It is important that before we enshrine a solution in the standard that we mature it. This may include exploring multiple avenues to resolve the matter. At minimum it requires implementing it in the relevant tools as proof of concept.

The issues in question include:
  • Screenshots – Crawling using browsers is becoming more common. A very useful by product of this is a screenshot of the webpage as rendered by the browser. It is, however, not clear how best to store this in a WARC. Especially not in terms of making it easily discoverable to a user of a Wayback like replay. This may also affect other types of related resources and metadata.
    See further on WARC specification issue tracker: #13 #27
  • HTTP 2 – The new HTTP 2 standard uses binary headers. This seems to breaks one of the expectations of WARC response records containing HTTP responses.
    See further on WARC specification issue tracker: #15
  • SSL certificates – Store the SSL certificates used during an HTTPS session. #12
  • AJAX Interactions – #14
The above list is unlikely to be exhaustive. It merely enumerates the issues that I'm currently aware of.

I'd like to organize a workshop during the 2016 IIPC GA to discuss these issues. For that to become a reality I'm looking for people willing to present on one or more of these topics (or a related one that I missed). It will probably be aimed at the open days so attendance is not strictly limited to IIPC members.

The idea is that we'd have a ~10-20 minute presentation where a particular issues's problems and needs were defined and a solution proposed. It doesn't matter if the solution has been implemented in code. Following each presentation there would then be a discussion on the merits of the proposed solution and possible alternatives.

A minimum of three presentations would be needed to "fill up" the workshop. We should be able to fit in four.

So, that's what I'm looking for, volunteers to present one or more of these issues. Or, a related issue I've missed.

To be clear, while the idea for this workshop comes out of the WARC 1.1 work it is entirely separate from that review. By the time of the GA the WARC 1.1 revision will be all but settled. Consider this, instead, as a first possible step on the WARC 1.2 revision.

September 15, 2015

Looking For Stability In Heritrix Releases

Which version of Heritrix do you use?

If the answer is version 3.3.0-LBS-2015-01 then you probably already know where I'm going with this post and may want to skip to the proposed solution.  

3.3.0-LBS-2015-01 is a version of Heritrix that I "made" and currently use because there isn't a "proper" 3.3.0 release. I know of a couple of other institutions that have taken advantage of it.

The Problem (My Experience)


The last proper release of Heritrix (i.e. non-SNAPSHOT release that got pushed to a public Maven repo, even if just the Internet Archive one) that I could use was 3.1.0-RC1. There were regression bugs in both 3.1.0 and 3.2.0 that kept me from using them.

After 3.2.0 came out the main bug keeping me from upgrading was fixed. Then a big change to how revisit records were created was merged in and it was definitely time for me to stop using a 4 year old version. Unfortunately, stable releases had now mostly gone away. Even when a release is made (as I discovered with 3.1.0 and 3.2.0) they may only be "stable" for those making the release.

So, I started working with the "unstable" SNAPSHOT builds of the unreleased 3.3.0 version. This, however presented some issues. I bundle Heritrix with a few customizations and crawl job profiles. This is done via a Maven build process. Without a stable release, I'd run the risk that a change to Heritrix will cause my internal build to create something that no longer works. It also makes it impossible to release stable builds of tools that rely on new features in Heritrix 3.3.0. Thus no stable releases for the DeDuplicator or CrawlRSS. Both are way overdue.

Late last year, after getting a very nasty bug fixed in Heritrix, I spent a good while testing it and making sure no further bugs interfered with my jobs. I discovered a few minor flaws and wound up creating a fork that contained fixes for these flaws. Realizing I now had something that was as close to a "stable" build as I was likely to see, I dubbed it Heritrix version 3.3.0-LBS-2014-03 (LBS is the Icelandic abbreviation of the library's name and 2014-03 is the domain crawl it was made for).

The fork is still available on GitHub. More importantly, this version was built and deployed to our in-house Maven repo. It doesn't solve the issue of the open tools we have but for internal projects, we now had a proper release to build against.

You can see here all the commits the separate 3.2.0 and 3.3.0-LBS-2014-03 (there are a lot!).

Which brings us to 3.3.0-LBS-2015-01. When getting ready for the first crawl of this year I realized that the issues I'd had were now resolved, plus a few more things had been fixes (full list of commits). So, I created up a new fork and, again, put it through some testing. When it came up clean I released it internally as 3.3.0-LBS-2015-01. It's now used for all crawling at the library.

This sorta works for me. But it isn't really a good model for a widely used piece of software. The unreleased 3.3.0 version contains significant fixes and improvements. Getting people stuck on 3.2.0 or forcing them to use a non-stable release isn't good. And, while anyone may use my build, doing so requires a bit of know-how and there still isn't any promise of it being stable in general just because it is stable for me. This was clearly illustrated with the 3.1.0 and 3.2.0 releases which were stable for IA, but not for me.

Stable releases require some quality assurance.

Proposed Solution


What I'd really like to see is an initiative of multiple Heritrix users (be they individuals or institutions). These would come together, one or twice a year, create a release candidate and test it based on each user's particular needs. This would mostly entail running each party's usual crawls and looking for anything abnormal.

Serious regressions would either lead to fixes, rollback of features or (in dire cases) cancelling for the release. Once everyone signs off, a new release is minted and pushed to a public Maven repo.

The focus here is primarily on testing. While there might be a bit of development work to fix a bug that is discovered, the focus here is primarily on vetting that the proposed release does not contain any notable regressions.

By having multiple parties, each running the candidate build through their own workflow, the odds are greatly improved that we'll catch any serious issues. Of course, this could be done by a dedicated QA team. But the odds of putting that together is small so we must make do.

I'd love if the Internet Archive (IA) was party to this or even took over leading it. But, they aren't essential. It is perfectly feasible to alter the "group ID" and release a version under another "flag", as it were, if IA proves uninterested.

Again, to be clear, this is not an effort to set up a development effort around Heritrix, like the IIPC did for OpenWayback. This is just focused on getting regular stable builds released based on the latest code. Period.


Sign up 


If the above sounds good and you'd like to participate, by all means get in touch. In comments below, on Twitter or e-mail.

At minimum you must be willing to do the following once or twice a year:
  • Download a specific build of Heritrix
  • Run crawls with said build that match your production crawls
  • Evaluate those crawls, looking for abnormalities and errors compared to your usual crawls
    • A fair amount of experience with running Heritrix is clearly needed.
  • Report you results
    • Ideally, in a manner that allows issues you uncover to be reproduced
Doing all of this during a coordinated time frame, probably spanning about two weeks.

Better still if you are willing to look into the causes of any problems you discover.

Help with admin tasks, such as pushing releases etc. would also be welcome.

At the moment, this is nothing more than an idea and a blog post. Your responses will determine if it ever amounts to anything more.

September 4, 2015

How big is your web archive?

How big is your web archive? I don't want you to actually answer that question. What I'm interested in is, when you read that question, what metric jumped into your head?

I remember long discussions on web archive metrics when I was first starting in this field, over 10 years ago. Most vividly I remember the discussion on what constitutes a "document".

Ultimately, nothing much came of any of those discussions. We are mostly left talking about number of URL "captures" and bytes stored (3.17 billion and 64 TiB, by the way). Yet these don't really convey all that much, at least not without additional context.

Measure All the (Web Archiving) Things!
- Nicholas Taylor (Stanford University Libraries)
Another approach is to talk about websites (or seeds/hosts) captured. That still leaves you with a complicated dataset. How many crawls? How many URL per site? Etc.

Nicholas Taylor (of Stanford University Libraries) recently shared some slides on this subject, that I found quite interesting and revived this topic in my mind.

It can be a very frustrating exercise to communicate the qualities of your web archive. If Internet Archive has 434 billion web pages and I only have 3.17 billion does that make IA's collection 137 times better?

I imagine anyone who's reading this knows that that isn't how things work. If you are after world wide coverage, IA's collection is infinitely better than mine. Conversely, for Icelandic material, the Icelandic web archive is vastly deeper than IA's.

We are, therefore, left writing lengthy reports. Detailing number of seeds/sites, URLs crawled, frequency of crawling etc. Not only does this make it hard to explain to others how good (or not good as the case may be) our web archive is. It makes it very difficult to judge it against other archives.

To put it bluntly, it leaves us guessing at just how "good" our archive is.

To be fair, we are not guessing blindly. Experience (either first hand or learned from others) provides useful yardsticks and rules of thumb. If we have resources for some data mining and quality assurance, those guesses get much better.

But it sure would be nice to have some handy metrics. I fear, however, that this isn't to be.

August 24, 2015

Writing WARCs

We started doing deduplication four years before we started using WARC. As the ARC format had no revisit concept, the only record of the deduplicated items from that era lies in the crawl logs.

When we put our collection online back in 2009 we built our own indexer that consumed these crawl logs so we could include these items. It worked very well at the time.

As our collection grew, we eventually hit scaling issues with the custom indexer. Ultimately, it was abandoned when we upgraded to Wayback 1.8 two years ago. We moved to use a CDX index instead. The only casualty was the early deduplication/revisit records from our ARC days which were no longer included.

So, for the last two years, I've had a task on my todo list. Create a program that consumes all of those crawl logs and spits out WARC files containing the revisit records that are missing from the ARC files.

For a while I toyed with the idea of incorporating this in a wider ARC to WARC migration effort. But it's become clear that such a migration just isn't worthwhile. At least not yet.

Recently I finally made some time to address this. I figured it shouldn't take much time to implement. Basically, the program needs to do two things:

  1. Ingest and parse a crawl.log file from Heritrix, making it possible to iterate over its lines and access specific fields. As the DeDuplicator already does this, it was a minor task to adapt the code.
  2. For each line that represented a deduplicated URI, write a revisit record in a WARC file.

Boom, done. Can knock that out in an hour, easy.

Or, I should be able to.

It turns out that writing WARCs is painfully tedious. The libraries require you to understand the structure of a WARC file very well and do nothing to guide you. There is also no useful documentation on this subject. You best bet is to find code examples, but even those can be hard to find and may not directly address your needs.

I tried both the webarchive-commons library and JWAT. Both were tedious, but JWAT was less so. Both require you to understand exactly what header fields need to be written for each record type. To know that you need to write a warcinfo field first and so on. At least JWAT made it fairly simple to configure the header fields.

Consulting both existing WARC files and the WARC spec, I was able to put all the pieces together in about half a day using JWAT.

And that's when I realized that JWAT does not track the number of bytes written out. That means I can't split the writing up to make "even sized" WARCs like Heritrix does (usually 1 GiB/file).

Darn, I need to rewrite this, after all, using webarchive-commons.

**Sigh**

It seems to me that such an important format should have better library support. A better library would save a lot of time and effort.

By saving effort, we may also wind up saving the format itself. The danger that someone will create a widely used program that writes invalid WARCs is very real. If such invalid WARCs proliferate that can greatly undermine the usefulness of the format.

It is important to make it easy to do the right thing. Right now, even someone, like myself, who is very familiar with WARC needs to double and triple check every step of a very simple WARC writing program. Never mind if you need to do something a little advanced.

Someone with less knowledge and, perhaps, less motivation to "get it right" could all too easily write something "good enough".


It is important to demystify the WARC format. Good library support would be an ideal start.

August 17, 2015

The WARC Format 1.1

The WARC Format 1.0 is an ISO standard describing the container format that web archives use to store their data. WARCs contain not only the actual file resources (HTML, images, JavaScript etc.) but also request and response headers, metadata about the resources and the overall collection, deduplication records and conversion records.

It's a pretty flexible format. It has served us quite well, but it is not perfect.

While it is an ISO standard, most of it was written by IIPC members. Indeed, it is heavily influenced the ARC format developed by The Internet Archive. So, now that the WARC format is being revisited it is only natural that the IIPC community, again, writes the first draft.

At the IIPC GA this year, in Stanford, there was a workshop where the pain points of the current specification were brought to light. There was a lot of energy in the room and people were excited. But, as everyone got back home a lot of that energy went away.

It is a lot easier to talk about change, than making it happen. Making things more difficult, few of us know much about the standards process. It all felt very inscrutable.

To help with the procedural aspect we came up with an approach that involves using the tools we are familiar with (software development). Consequently, we (and by "we" I mean Andy Jackson of the British Library) set up a GitHub project around the WARC specification.

Any problems with the existing specification could be raised there as "issues" (you'll find all the ones discussed in Stanford on there!). The existing spec could be included as markdown formatted text and any proposed changes could be submitted as "pull requests" acting on the text of the existing spec.

Currently there are two pull requests, each representing a proposed set of changes to address one specific shortcoming of the existing spec.

One of the pull requests comes from yours truly. It address the concerns of "uri agnostic revisit records". This was previously dealt with via an advisory on the subject adopted by the IIPC. This allows us to promote what has been a defacto standard into the actual standard.

The other pull request centers on improving the resolution of timestamps in WARC headers.

Neither pull request has been merged, meaning that both are up for comment and may change or be rejected altogether. There are also many issues that still need to be addressed.

I would like to encourage all interested parties (IIPC members and non-members alike) to take advantage of the GitHub venue if the WARC format is important to you. You can do this by opening issues if you have a problem with the spec that hasn't been brought up. You can do this by commenting on existing issues and pull requests, suggesting solutions or pointing out pitfalls.

And you can do this be suggesting actual edits to the standard via pull requests (let us know if you need help with the technical bits).

Ultimately, the draft thus generated will be passed on to the relevant ISO group for review and approval. This will happen (as I understand it) next year.

So grab the opportunity while it presents itself and have your say on The WARC Format 1.1.

July 13, 2015

Customizing Heritrix Reports

It is a well known fact (at least among its users) that Heritrix comes with ten default reports. These can be generated and viewed at crawl time and will be automatically generated when the crawl is terminated. Most of these reports have been part of Heritrix from the very beginning and haven't changed all that much.

It is less well known that these reports are a part of Heritrix modular configuration structure. They can be replaced, configured (in theory) and additional reports added.

The ten, built in, reports do not offer any additional configuration option. (Although, that may change if a pull request I made is merged.) But, the most useful aspect is the ability to add reports tailored to your specific needs. Both the DeDuplicator and the CrawlRSS Heritrix add-on modules use this to surface their own reports in the UI and ensure those are written to disk at the end of a crawl.

In order to configure reports, it is necessary to edit the statisticsTracker bean in your CXML crawl configuration. That bean has a list property called reports. Each element in that list is a Java class that extends the abstract Report class in Heritrix.

That is really all there is to it. Those beans can have their own properties (although none do--yet!) and behave just like any other simple bean. To add your own, just write a class, extend Report and wire it in. Done.

One caveat. You'll notice this section is all commented-out. When the reports property is left empty, the StatisticsTracker loads up the default reports. Once you uncomment that list, the list overrides any 'default list' of reports. This means that if future versions of Heritrix change what reports are 'default', you'll need to update your configuration or miss out.

Of course, you may want to 'miss out', depending on what the change is.

My main annoyance is that the reports list requires subclasses of a class, rather than specifying an interface. This needs to change so that any interested class could implement the contract and become a report. As it is, if you have a class that already extends a class and has a reporting function, you need to create a special report class that does nothing but bridge the gap to what Heritrix needs. You can see this quite clearly in the DeDuplicator where there is a special DeDuplicatorReport class for exactly that purpose. A similar thing came up in CrawlRSS.

I've found it to be very useful to be able to surface customized reports in this manner. In addition to the two use cases I've mentioned (and are publicly available), I also use it to draw up a report on disk usage use (built into the module that monitors for out of disk space conditions) and to for a report on which regular expressions have been triggered during scoping (built into a variant of the MatchesListRegexDecideRule).

I'd had both of those reports available for years be but they had always required using the scripting console to get at. Having them just a click away and automatically written at the end of a crawl has been quite helpful.


If you have Heritrix related reporting needs that are not being met, there is a comment box below.

July 1, 2015

Leap second and web crawling

A leap second was added last midnight. This is only the third time that has happened since I started doing web crawls, and the first time it happened while I had a crawl running. So, I decided to look into any possible effects or ramifications a leap second might have for web archiving.

Spoiler alert; there isn't really one. At least not for Heritrix writing WARCs.

Heritrix is typically run on a Unix type system (usually Linux). On those systems, the leap second is implemented by repeating the last second of the day. I.e. 23:59:59 comes around twice. This effectively means that the clock gets set back by a second when the leap second is inserted.

Fortunately, Heritrix does not care if time gets set back. My crawl.log did show this event quite clearly as the following excerpt shows (we were crawling about 37 URLs/second at the time):

2015-06-30T23:59:59.899Z 404 19155 http://sudurnes.dv.is/media/cache/29/74/29747ef7a4574312a4fc44d117148790.jpg LLLLLE http://sudurnes.dv.is/folk/2013/6/21/slagurinn-tok-mig/ text/html #011 2015063023595 2015-06-30T23:59:59.915Z 404 242 http://old.f4x4.is/myndasvaedi/44-tommu-breytinga-a-ford-econoline-v8-351-w/textarea.bbp-the-content ELRLLLLLLLX http://old.f4x4.is/myndasvaedi/44-tommu-breytinga-a-for 2015-06-30T23:59:59.936Z 200 3603 http://foldaskoli.is/myndasafn/index.php?album=2011_til_2012/47-%C3%9Alflj%C3%B3tsvatn&image=dsc01729-2372.jpg LLLLLLLLL http://foldaskoli.is/myndasafn/index.php?album= 2015-06-30T23:59:59.019Z 200 42854 http://baikal.123.is/themes/Nature/images/header.jpg PLLPEE http://baikal.123.is/ottSupportFiles/getThemeCss.aspx?id=19&g=6843&ver=2 image/jpeg #024 20150630235959985+- 2015-06-30T23:59:59.025Z 200 21520 http://bekka.bloggar.is/sida/37229/ LLLLL http://bekka.bloggar.is/ text/html #041 20150630235959986+13 sha1:C2ZF67KFGUDFVPV46CPR57J45YZRI77U http://bloggar.is/ - {"warc 2015-06-30T23:59:59.040Z 200 298365 http://www.birds.is/leikskoli/images/3072/img_2771__large_.jpg LLRLLLLLLLLLL http://www.birds.is/leikskoli/?pageid=3072 image/jpeg #005 20150630235956420+2603 sha1:F65B

There may be some impact on tools parsing your logs if they expect the timestamps to, effectively, be in order. But I'm unaware of any tools that make that assumption.

But, what about replay?


The current WARC spec calls for using timestamps with a resolution of one second. This means that all the URLs captured during the leap second will get the same value as those captured during the preceding second. No assumptions can be made about the order in which these URLs where captured anymore than you can make about the order of URLs captured normally during a single second. It doesn't really change anything that this period of uncertainty now spans two seconds instead of one. The effective level of uncertainty remains about the same.

Sidenote. The order of the captured URLs in the WARC may be indicative of crawl order, but that is not something that can be relied on.

There is actually a proposal for improving the resolutions of WARC dates. You can review it on the WARC review GitHub issue tracker. If adopted, a leap second event would mean that the WARCs actually contain incorrect information.

The fix to that would be to ensure that the leap second is encoded as 23:59:60 as per the official UTC spec. But that seems unlikely to happen as it would require changes to Unix timekeeping or using non-system timekeeping in the crawler.

Perhaps it is best to just leave the WARC date resolution at one second.

June 30, 2015

Web archiving APIs - a start

Even though it didn't feature heavily on the official agenda, the topic of web archive APIs repeatedly came up during the last IIPC GA in Stanford. This is an understandable desire as the use of APIs enables interoperability amongst different tools. This, in theory, allows you to build (many) narrow purpose tools and then have them work, seamlessly, together. No more monolithic behemoths that are a pain to maintain.

In fact, we already have one web archive "API" in wide use; the WARC file format. While technically not an "application programming interface" it serves the same fundamental purpose, to enable interoperability. It has decoupled harvesters (e.g. Heritrix) from replay systems (e.g. OpenWayback) and both of those from analytical/data mining software etc.

Overall I'd say the WARC format has been a success, albeit not without its flaws. More on that later.

So, now there is interest in defining more "APIs". Indeed, there was a survey, recently, about the possibility of setting up a special "APIs Working Group" within the IIPC.

What I find missing from this conversations is, what are these APIs? Which facets of web archiving are amicable to being formalized in this manner?

We do have a few informal "APIs" floating around. The CDX file format is one. The Memento protocol is another. Cleaning up and formalizing an existing 'defacto' API avoids the pitfall of creating an API that entirely fails to work in the "real world."

To understand this, lets go back the WARC format. It was an extension of the ARC file format developed by the Internet Archive. In drafting the WARC standard, the lessons of the ARC file format were used and most of the ARC file format's short comings were addressed. 

Predictably, where the WARC format has shown weakness is in areas that ARC did not address at all. For example in handling duplicate records. In those areas, wrong assumptions were made due to lack of experience. This in turn, for example, significantly delayed the widespread use of deduplication.

The best APIs/protocols/standards emerge from hard won experience. I would argue very strongly against developing any API from the top down.

With that in mind, there are two possible APIs I believe are ready to be made more formal.

The first is the current "CDX Server protocol".  The CDX Server is a component within OpenWayback that resolves URI+timestamp requests into stored resources (either a single one or a range depending on the nature of the query). Note, that a "CDX Server" need not use a CDX style index. That is merely how it is now. This is a protocol for separating the user interface of a replay tool (like OpenWayback) from its the index. Lets call it Web Archive Query Protocol, WAQP, for now.

Pywb, another replay tool, uses almost the same protocol in its implementation. With a well defined WAQP, you might be able to use Pywb front end (display) with the OpenWayback back end (CDX server) or the other way around.

With a well defined WAQP, the job of indexing WARCs would be become independent of the job of replaying web archives by rewriting URLs. This would make it much easier to develop new versions of either type of software.

The other API is a bit more speculative. Web archiving proxies now exist and serve a very useful function. If a standard API was established for how these proxies interacted with the software driving the harvesting activity, it would be much easier to pair the two together. This API could possibly be built on top of existing proxy protocols.

I don't mean to imply that this is the definitive final list of web archiving APIs to be developed. These are simply two areas I believe are ready to be formalized and where doing so is likely to be of benefit in the near term.

So, how best to proceed? As noted earlier, the idea has been floated to establish a special APIs working group. On the other hand, there is already precedent for running API development through the existing working groups (WARC was developed within the harvesting working group).

My personal opinion is that once a specific API has been identified as a goal, a special working group (or task force or whatever name you'd like to give it) be formed around that one API. This working groups would than disband once the job is done. Such groups could be reformed if there is a need for a significant revision.

It is also important that any API formalization be done in close cooperation with at least some of the relevant tool maintainers. APIs that are not implemented are, after all, useless.




June 24, 2015

OpenWayback 2.2.0


OpenWayback 2.2.0 was recently released. This marks OpenWayback's third release since becoming a ward of the IIPC in late 2013. This is a fairly modest update and reflects our desire to make frequent, modest sized releases. A few things are still worth pointing out.

First, as of this release, OpenWayback requires Java 7. Java 7 has been out for four years and Java 6 has not been publicly updated in over two years. It is time to move on.

Second, OpenWayback now officially supports internationalized domain names. I.e. domain names containing non-ASCII characters.

Third, UI localization has been much improved. It should now be possible to translate the entire interface without having to mess with the JSP files and otherwise "go under the hood".

And the last thing I'll mention is the new WatchedCDXSource which removes the need to enumerate all the CDX files you wish to use. Simply designate a folder and OpenWayback will pick up all the CDX files in it.

The road to here hasn't been easy, but it is encouraging to see that the number of people involved is slowly, but surely rising. For the 2.2.0 release, we had code contributions from Roger Coram (BL), Lauren Ko (UNT), John Erik Halse (NLN), Sawood Alam (ODU), Mohamed Elsayed (BA) and myself in addition to the IIPC-payed-for work by Roger Mathisen (NLN). Even more people were involved in reporting issues, managing the project and testing the release candidate. My thanks to everyone who helped out.

And going forward, we are certainly going to need people to help out.

Version 2.3.0 of OpenWayback will be another modest bundle of fixes and minor features. We hope it will be ready in September (or so). There are already 10 issues open for it as I write this.

But, we also have larger ambitions. Enter version 3.0.0. It will be developed in parallel with 2.3.0 and aims to make some big changes. Breaking changes. OpenWayback is built on an aging codebase, almost a decade old at this point. To move forward, some big changes need to be made.

The exact features to be implemented will likely shift as work progresses but we are going to increase modularity by pushing the CDXServer front and center and removing the legacy resource stores. In addition to simplifying the codebase, this fits very nicely with the talk at the last GA about APIs.

We'll also be looking at redoing the user interface using iFrames and providing insight into the temporal drift of the page being viewed. The planned issues are available on GitHub. The list is far from locked and we welcome additional input on which features to work on.

We welcome additional work on those features even more!

I'd like to wrap this up with a call to action. We need a reasonably large community around the project to sustain it. Whether it's testing and bug reporting, occasional development work or taking on more by becoming one of our core developers, your help is both needed and appreciated.

If you'd like to become involved, you can simply join the conversation on the OpenWayback GitHub page. Anyone can open new issues and post comments on existing issues. You can also join the OpenWayback developers mailing list.

---

This post was written for the IIPC blog.

March 23, 2015

URI Canonicalization in Web Archiving

URI canonicalization is an arcane part of web archiving that is often overlooked, despite having some important implications. This blog post is an attempt at shining a light on this topic, to illustrate those implications and point at at least one thing that we can improve on.

Heritrix

Heritrix keeps track of which URIs it has already crawled so that it doesn't repeatedly crawl the same URI. This fails when there are multiple URIs leading to the same content. Unless you deal with the issue, you wind up in a crawler trap, crawling the same (or nearly the same) content endlessly.

URI canonicalization is an attempt to deal with certain classes of these. This is done by establishing transformation rules that are applied to each discovered URI. The result of these transformations is regarded as the canonical form of the URI. It is this canonical form which is used to discard already seen URIs. The original URI is always used when making the HTTP request, the canonical form is only to prevent multiple fetches of the same content.

By default Heritrix applies the following transformations (in this order) to generate the canonical form of a URI:
  • Make all letters in the URI lower case.
  • Remove any user/password info in the URI (e.g. http://user:password@example.com becomes simply http://example.com)
  • Strip any leading www prefixes. Also strip any www prefixes with a number (e.g. http://www3.example.com becomes http://example.com)
  • Strip session IDs. This deals with several common forms of session IDs that are stored in URIs. It is unlikely to be exhaustive.
  • Strip trailing empty query strings. E.g. http://example.com? becomes http://example.com

Two issues are quickly apparent.

One, some of these rules may cause URIs that contain different content to be regarded is the same. For example, while domain names are case insensitive, the path segment need not be. Thus the two URIs http://example.com/A.html and http://example.com/a.html may be two different documents.

Two, despite fairly aggressive URI canonicalization rules, there will still be many instances where the exact same content is served up on multiple URIs. For example, there are likely many more types of session ids than we have identified here.

We can never hope to enumerate all the possible ways websites deliver the same content with different URIs. Heritrix does, however, make it possible to tweak the URI canonicalization rules. You can, thus deal with them on a case by case basis. Here is how to configure URI canonicalization rules in Heritrix 3.

But, you may not want to do that.

You'll understand why as we look at what URI canonicalization means during playback.

OpenWayback

As difficult as harvesting can be, accurate replay is harder. This holds doubly true for URI canonicalization.

Replay tools, like OpenWayback, must apply canonicalization under two circumstances. Firstly, when a new resource as added to the index (CDX or otherwise) it must be a canonical form that is indexed. Secondly, when resolving a URI requested by the user.

User requested URIs are not just the ones users type into search boxes. They are also the URIs for embedded content needed to display a page and the URIs of links that users click in replayed webpages.

Which is where things get complicated. OpenWayback may not use the exact same rules as Heritrix.

The current OpenWayback canonicalization rules (called AggressiveUrlCanonicalizer) applies rules that work more or less the same way as what Heritrix uses. It is not, however, the exact same rules (i.e. code), nor can they be customized in the same manner as Heritrix.

There is a strong argument for harmonizing these two. Move the code into a shared library, ensure that both tools use the same defaults and can be customized in the same manner. (It was discussion of exactly this that prompted this blog post.)

Additionally, it would be very beneficial if each WARC contained information, in some standard notation, about which canonicalization rules were in effect during its creation.

All this would clearly help but even then you still may not want to fiddle with canonicalization rules.

Every rule you add must be applied to every URI. If you add a new rule to Heritrix, you must add it to OpenWayback to ensure playback. Once added to OpenWayback it will go into effect for all URIs being searched for. However, unless you rebuild your index it may contain URIs in a form that is no longer consistent with the current canonical form. So, you need to rebuild the index, which is expensive.

This is also assuming that it is safe to impose a rule retroactively. OpenWayback does not support applying canonicalization rules to time periods. To make matters worse, a rule that would be very useful on one website, may not be suitable for another website.

Dealing with a matrix of rules which apply some of the time to some of the content from some types does not seem enticing. Maybe there is a good way to express and apply such rules that no one has yet brought to the table. In the meantime, all I can say is, proceed with caution.

March 18, 2015

Preservation Working Group Web Archiving Deduplication Survey

The IIPC Preservation Working Group is conducting a survey of member institutions deduplication activities. If you or your institution hasn't yet responded, I urge you to do so. Right now. Don't worry, I'll wait.

Answering the survey for my own institution brought up a few things I'd like to go into in greater detail than was possible in the survey.

What kind of content do you deduplicate?

We've been doing deduplication for over nine years. In that time we've done some analysis on this from time to time. Most recently last December which led to this blog posts on deduplication of text based data.

Ultimately, it boils down to a trade off between the impact the deduplication has on crawl rate and the potential savings in terms of storage. For our large, thrice yearly crawls we're now only excluding data whose content type equals "text/html".

HTML documents make up over half of all documents, but they are generally small, frequently created dynamically and compress easily. In other words, they make the deduplication index much bigger while not contributing much to deduplication space saving.

Despite this, for our weekly and other focused crawls, we do deduplicate on all content types as those crawls' speeds are usually dictated by politeness rules. Deduplicating on all content types thus comes at near zero cost. Also the more frequent crawl schedules means that there is, relatively, more savings to be had from deduplicating on HTML than in less frequent crawls.

One aspect that the survey didn't tackle, was on which protocols and response types we deduplicate. We currently do not deduplicate on FTP (we do not crawl much FTP content). We also only deduplicate on HTTP 200 responses. I plan to look into deduplicating on 404 responses in the near future. Most other response types have an empty or negligible payload.

If you deduplicate, do you deduplicate based on URL, hash or both?

Historically, we've only done URL based deduplication. Starting with the last domain crawl of 2014, we've begun rolling out hash based deduplication. This is done fairly conservatively with the same URL being chosen as the 'original' when possible. This has an impact on crawl speeds and we may revise this policy as we gain more confidence in hash based deduplication.

The reason why hash based deduplication hasn't been used earlier is all about tool support. OpenWayback has never had a secondary index to look up based on hashes. It wasn't until we introduced additional data into the WARC revisit records that this was possible. This has now been implemented by both Heritrix and OpenWayback making hash based deduplication viable.

The additional savings gained by hash based deduplication are modest or about 10% in our crawls. In certain circumstances, however, they may help deal with unfortunate website design choices.

For example, one media site we crawl recently started adding a time code to their videos' URLs so their JavaScript player could skip to the right place in them. The time code is an URL parameter (e.g. "video.mp4?t=100") that doesn't affect the downloaded content at all. It is merely a hint to the JavaScript player. With crawl time hash based deduplication, it is possible to store each video only once.

How do you deduplicate?

This question and the next few after it addresses the same issues I discussed in my blog post about the downsides of web archive deduplication.

The primary motivation behind our deduplication effort is to save on storage. We wouldn't be able to crawl nearly as often without it. Moreover, our focused crawls would be impossible without deduplication as we discard as much as 90% of data as duplicates in such crawls.

Even doing a single 'clean' (non deduplicated) crawl every two years would mean we had to go from three to two crawls a year due to our storage budget. So we don't do that.

It's an arguable tradeoff. Certainly, it is going to be more difficult for us to break out discrete crawls due to the reduplication problem. Ultimately it comes down to the fact that what we don't crawl now is lost to us, forever. The reduplication problem can be left to the future.

That just leaves...

Do you see any preservation risks related to deduplication? 
Do you see any preservation risks related to deduplication and the number of W/ARC copies the institution should keep? 

In general, yes, there is a risk. Any data loss has the potential to affect a larger part of your archive.

To make matters worse, if you lose a WARC you may not be able to accurately judge the effects of its loss! Unless you have an index of all the original records that all the revisit records point to, you may need to scan all the WARCs that could potentially contain revisit records pointing to the lost WARC.

This is clearly a risk.

You also can't easily mitigate it by having some highly cited original records be stored with additional copies as those original records are scattered far and wide within your archive. Again, you'd need some kind of index of revisit references.

We address this simply by having three full copies (two of which are, in turn, stored on RAID based media). You may ask, but doesn't that negate some of the monetary savings of deduplication if you then store it multiple times? True, but we are going to want at least two copies at a minimum. Additionally, only one copy need be on a high speed, high availability system (we have two copies on such systems) for access. Further copies can be stored in slower or offline media which tends to be much cheaper.

Whether three (technically 3.5 due to RAID) is enough is then the question. I don't have a definitive answer. Do you?

March 13, 2015

What's Next for OpenWayback

About one month ago, OpenWayback 2.1.0 was released. This was mostly a bug-fix release with a few new features merged in from Internet Archive's Wayback development fork. For the most part, the OpenWayback effort has focused on 'fixing' things. Making sure everything builds and runs nicely and is better documented.

I think we've made some very positive strides.

Work is now ongoing for version 2.2.0. Finally, we are moving towards implementing new things! 2.2.0 still has some fixing to do. For example, localization support needs to be improved. But, we're also planning to implement something new, support for internationalized domain names.

We've tentatively scheduled the 2.2.0 release for "spring/early summer".

After 2.2.0 is released, the question will be which features or improvements to focus on next. The OpenWayback issue tracker on GitHub has (at the time of writing) about 60 open issues in the backlog (i.e. not assigned to a specific release).

We're currently in the process of trying to prioritize these. Our current resources are nowhere sufficient to resolve them all. Prioritization will involve several aspects, including how difficult they are to implement, how popular they are and, not least, how clearly they are defined.

This is where you, dear reader, can help us out by reviewing the backlog and commenting on issues you believe to by relevant to your organization. We also invite you to submit new issues if needed.

It is enough to just leave a comment that this is relevant to your organization. Even better would be to explain why it is relevant (this helps frame the solution). Where appropriate we would also welcome suggestions for how to implement the feature. Notably in issues like the one about surfacing metadata in the interface.

If you really want to see a feature happen, the best way to make it happen is, of course, to pitch in.

Some of the features and improvements we are currently reviewing are:
  • Enable users to 'diff' different captures of an HTML page. Issue 15.
  • Enable search results with a very large number of hits. Issue 19.
  • Surface more metadata. Issue 28 and 29.
  • Enable time ranged exclusions. Issue 212.
  • Create a revisit test dataset. Issue 117.
  • Using CDX indexing as the default instead of the BDB index. Issue 132

As I said, these are just the ones currently being considered. We're happy to look at others if there is someone championing them.

If you'd like to join the conversation, go to the OpenWayback issue tracker on GitHub and review issues without a milestone.

If you'd like to submit a new issue, please read the instructions on the wiki. The main thing to remember is to provide ample details.

We only have so many resources available. Your input is important to help us allocate them most effectively.

March 10, 2015

Implementing CrawlRSS

In my last post I talked about how you could use RSS feeds to continuously crawl sites. Or, at least, crawl sites that provide good RSS feeds.

The idea for how this could work first came to me five years ago. Indeed, at the IIPC Harvesting Working Group meeting in Vienna in September of 2010, I outlined the basic concept (slides PPT).

The concept was simple enough, unfortunately, the execution was a little more tricky. I wanted to implement this building on top of Heritrix. This saves having to redo things like WARC writing, deduplication, link extraction, politeness enforcement etc. But it did bump up against the fact that Heritrix is built around doing snapshots, not crawling continuously.

Heritrix in 30 Seconds

For those not familiar with Heritrix's inner working, I'm going to outline the key parts of the crawler.

Most crawl state (i.e. what URLs are waiting to be crawled) is managed by the frontier. All new URLs are first entered into the Frontier where they are scheduled for crawling. This is done by placing URLs in proper queues, possibly putting them in front of other URLs and so forth.

When an URL comes up for crawling it is emitted by the Frontier. At any given time, there may be multiple URLs being crawled. Each emitted URL passes through a chain of processors. This chain begins with preparatory work (e.g. check robots.txt), proceeds to actually fetching the URL, to link extraction, WARC writing and eventually links discovered in the downloaded document are processed and finally the URL is sent back to the Frontier for final disposition.

Each discovered URL passes through another chain of processors which check if it is in scope before registering it with the Frontier.

If an error occurs, the URL is still returned to the Frontier which decides if it should be retried or not.

The Frontier is Heritrix's state machine and, typically, when the Frontier is empty, the crawl is over.

Implementing CrawlRSS

Heritrix's frontier is a replaceable module. Thus one solution is to simply provide an alternative Frontier that implements the RSS crawling described in my last post. This is what I tried first and initial results where promising. I soon ran into some issues, however, when I tried to scale things up.

A lot of functionality is built into the default Heritrix frontier (so called BdbFrontier). While an attempt has been made to make this flexible via intermediate classes (AbstractFrontier) and some helper classes, the truth is that you have to redo a lot of 'plumbing' if you replace the frontier wholesale.

Because the basic function of the RssFrontier was so different from what was expected of the regular Frontier, I found no way of leveraging the existing code. Since I was also less then keen on reimplementing things like canonicalization, politeness enforcement and numerous other frontier functions, I had to change tack.

Instead of replacing the default frontier, I decided to 'manage' it instead. The key to this is the ability of the default frontier to 'run while empty'. That is to say, it can be configured to not end the crawl when all the queues are empty.

The responsibility for managing the crawl state would reside primarily in a new object, the RssCrawlController. Instead of the frontier being fed an initial list of seed URLs via the usual mechanism, the RssCrawlController is wholly responsible for providing seeds. It also keeps track of any URLs deriving from the RSS feeds and ensures that the feeds aren't rescheduled with the frontier until the proper time.

This is accomplished by doing three things. One, providing a replacement FrontierPreparer module. The FrontierPreparer module is responsible for preparing discovered URLs for entry into the frontier. The overloaded class RssFrontierPreparer also notifies the RssCrawlController allowing it to keep track of URLs descended from a feed.

Two, listen for CrawlURIDispositionEvents. These are events that are triggered whenever an URL is finished, either successfully or fails with no (further) retry possible.

Three, deal with the fact that no disposition event is triggered for URLs that are deemed duplicates.

UriUniqFilter

Heritrix uses so called UriUniqFilters to avoid repeatedly downloading the same URL. This can then be bypassed for individual URLs if desired (such as for refreshing robots.txt or, as in our case, when recrawling RSS feeds).

Unfortunately, this function is done in an opaque manner inside the frontier. An URL that is scheduled with the frontier and then fails to pass this filter will simply disappear. This was no good as the RssCrawlController needs a full accounting of each discovered URLs or it can't proceed to recrawl the RSS feeds.

To accomplish this I subclassed the provided filters so they implemented a DuplicateNotifier interface. This allows the RssCrawlController to keep track of those URLs that are discarded as duplicates.

It seems that perhaps the UriUniqFilters should be applied prior to candidate URIs being scheduled with the frontier, but that is a subject for another time.

RSS Link Extraction

A custom RssExtractor processor handles extracting links from RSS feeds. This is done by parsing the RSS feeds using ROME.

The processor is aware of the time of the most recently seen item for each feed and will only extract those items whose date is after that. This avoids crawling the same items again and again. Of course, if the feed itself is deemed a duplicate by hash comparison, this extraction is skipped.

If any new items are found in the feed, the extractor will also add the implied pages for that feed as discovered links. Both the feed items and implied pages are flagged specially so that they avoid the UriUniqFilter and will be given a pass by the scoping rules.

The usual link extractors are then used to find further links in the URLs that are extracted by the RssExtractor.

Scope

As I noted in my last post, getting the scope right is very important. It is vital that it does not leak and that after crawling a feed, its new items and embedded resources, you exhaust all the URLs deriving from the feed. Otherwise, you'll never recrawl the feed.

Some of the usual scoping classes are still used with CrawlRSS, such as PrerequisiteAcceptDecideRule and SchemeNotInSetDecideRule but for the most part a new decide rule, RssInScopeDecideRule, handles scoping.

RssInScopeDecideRule is a variant on HopsPathMatchesRegexDecideRule. By default the regular expression .R?((E{0,2})|XE?) determines if an URL is accepted or not. Remember, this is applied to the hop path, not the URL itself. This allows, starting from the seed, one hop (any type) then one optional redirect (R) then up to two levels of embeds (E), only the first of which may be a speculative embed (X).

The decide rule also automatically accepts URLs that are discovered in the feeds by RssExtractor.

As noted in my last post, during the initial startup it will take awhile to exhaust the scope. However, after the first run, most embedded content will be filtered out by the UriUniqFilter, making each feed refresh rather quick. You can tweak how long each round takes by changing the politeness settings. In practice we've found no problems refreshing feeds every 10 minutes. Possibly, you could go to as little as 1 minute, but that would likely necessitate a rather aggressive politeness policy.

Configuration

CrawlRSS comes packaged with a Heritrix profile where all of the above has been set up correctly. The only element that needs to be configured is which RSS feeds you want to crawl.

This is done by specifying a special bean, RssSite, in the cxml. Lets look at an example.

<bean id="rssRuv" class="is.landsbokasafn.crawler.rss.RssSite">
  <property name="name" value="ruv.is"/>
  <property name="minWaitInterval" value="1h30m"/>
  <property name="rssFeeds">
    <list>
      <bean class="is.landsbokasafn.crawler.rss.RssFeed">
        <property name="uri" value="http://www.ruv.is/rss/frettir"/>
        <property name="impliedPages">
          <list>
            <value>http://www.ruv.is/</value>
          </list>
        </property>
      </bean>
      <bean class="is.landsbokasafn.crawler.rss.RssFeed">
        <property name="uri" value="http://www.ruv.is/rss/innlent"/>
        <property name="impliedPages">
          <list>
            <value>http://www.ruv.is/</value>
            <value>http://www.ruv.is/innlent</value>
          </list>
        </property>
      </bean>
    </list>
  </property>
  
</bean>


Each site need not map to an actual site, but I generally find that to be the best way. For each site you set a name (purely for reporting purposes) and a minWaitInterval. This is the minimum amount of time that will elapse between the feeds for this site being crawled.

You'll note that the minimum wait interval is expressed in human readable (-ish) form, e.g. 1h30m instead of specifying simply a number of seconds or minutes. This provides good flexibility (i.e. you can specify intervals down to a second) with a user friendly notation (no need to figure out how many minutes there are in 8 hours!)

There is then a list of RssFeeds. Each of which specifies the URL of one RSS feed. Each feed then has a list of implied pages, expressed as URLs.

You can have as many feeds within a site as you like.

You repeat the site bean as often as needed for additional sites. I've tested with up to around 40 sites and nearly 100 feeds. There is probably a point at which this will stop scaling, but I haven't encountered it.

Alternative Configuration - Databases

Configuring dozens or even hundreds of sites and feeds via the CXML quickly get tedious. It also makes it difficult to change the configuration without stopping and restarting the crawl, losing the current state of the crawl.

For this purpose, I added a configuration manager interface. The configuration manager provides the RssCrawlController with the sites, all set up correctly. Simply wire in the appropriate implementation. Provided is the CxmlConfigurationManager as described above and the DbConfigurationManager that interacts with an SQL database.

The database is accessed using Hibernate and I've included the necessary MySQL libraries (for other DBs you'll need to add the connector JARs to Heritrix's lib directory). This facility does require creating your own CRUD interface to the database, but makes it much easier to manage the configuration.

Changes in the database are picked up by the crawler and the crawler also updates the database to reflect last fetch times etc. This can survive crawl job restarts, making it much easier to stop and start crawl without getting a lot of redundant content. When pared with the usual deduplication strategies, no unnecessary duplicates should be written to WARC.

There are two sample profiles bundled with CrawlRSS. One for each of these two configuration styles. I recommend starting with the default, CXML, configuration.

Implementing additional configuration managers is fairly simple if a different framework (e.g. JSON) is preferred.

Reporting

The RssCrawlController provides a detailed report. In the sample configuration, I've added the necessary plumbing so that this reports shows up alongside the normal Heritrix reports in the web interface. It can also be accessed via the scripting console by invoking the getReport() method on the RssCrawlController bean.

Release

All of what I've described above is written and ready for use. You can download a snapshot build from CrawlRSS's Sonatype snapshot repository. Alternatively, you can download the source from CrawlRSS's Github project page and build it yourself (simply download and run 'mvn package').

So, why no proper release? The answer is that the CrawlRSS is built against Heritrix 3.3.0-SNAPSHOT. There have been notable changes in Heritrix's API since 3.2.0 (latest stable) requiring a SNAPSHOT dependency.

I would very much like to have a 'stable' release of the 3.3.0 branch Heritrix, but that is a subject for another post.

February 27, 2015

Crawling websites using RSS feeds

A lot of websites have RSS feeds. Especially websites that are updated regularly with new content, such as news websites and blogs. Using these feeds it is possible to crawl the sites (notably news sites) both more effectively (getting more good content) and efficiently (getting less redundant content). This is because the RSS feeds shine a light on the most critical datum we need, what has changed.

To take advantage of this, I have developed an add-on for Heritrix, called simply CrawlRSS.

The way it works is that for any given site a number of RSS feeds are identified. For each feed a number of 'implied pages' are defined. When a new item is detected in a feed that is taken to imply that the "implied pages" (such as a category front page etc.) have also changed, causing them to be recrawled.

The feeds for a given site are emitted all at the same time. If there is duplication amongst them (a news story is both under domestic and financial news, for example) the URL is only enqueued once, but the implied pages for both feeds are enqueued. All of the discovered URLs are then crawled along with embedded items such as images, stylesheets etc that are discovered via regular link extraction.

The scoping here is very important. Aside from URLs deriving from the feed, only a very narrow range of embedded content is allowed to be crawled. This ensures that the crawler is able to complete a crawl round in a reasonable amount of time. Obviously, this means that an RSS triggered crawl will not capture, for example, old news stories. It is not a replacement to traditional snapshot crawls, but to be a complimentary technique that can help capture the minute-to-minute changes in a website.

Once all discovered URLs deriving from one site's feeds have been crawled, the site becomes eligible for a feed update. I.e. the feeds may be crawled again. A minimum amount of wait between crawling the feeds can be specified. However, this is just a minimum wait time (so we aren't constantly updating a feed that hasn't changed). No maximum limit is imposed. If there is a lot of content to download, it will simply take the time it needs. Indeed, during the first round (when all the layout images, stylesheets etc. are crawled) it can possibly take a couple of hours for all derived URLs to be crawled.

Typically, the crawler will quickly reach the point where any given site is mostly just waiting for the next feed update. Of course, this depends mostly on the nature of the sites being monitored.

The quality of the RSS feeds varies immensely. Some sites have very comprehensive feeds for each category on the site. Others have only one overarching feed. A problematic aspect is that some websites only update their RSS feeds at fixed intervals, not as new items are added to the site. This means that the feed may be up to an hour out of date.

It is necessary to examine and evaluate the RSS feeds of prospective sites. This approach is best suited to high value targets where it is reasonable to invest a degree of effort into capturing them as best as possible.

When done correctly, on a site with well configured feeds, this enables a very detailed capture of the site and how it changes. And this is accomplished with a fairly moderate volume of data. For example, we have been crawling several news websites weekly. This has not given a very good view of day-to-day (let alone minute-to-minute) changes but has produced 15-20 GiB of data weekly. Similar crawling via RSS has given us much better results at a fraction of the data (less than .5 GiB a day). Additionally, the risk of a scope leakage is greatly reduced.

Where before we got 1 capture a week, we now get up to 500 captures of the front pages per week. This is done while reducing the amount of overall amount of data gathered! For websites that update their RSS feeds when stories are updated, we also capture the changes in individual stories.

This is, overall, a giant success. While, true, that not all websites provide a usable RSS feed, where it is present, this approach can make a huge difference. Look, for example at the search results for the front page of the state broadcaster here in Iceland, RUV in our wayback. Before we started using their RSS feed, we would capture their front page about 15-30 times a month. After, it was more like 1500-1800 times a month. A hundred fold increase. With a very small storage footprint.

As I stated before, this doesn't replace conventional snapshot crawling. But for high value, highly dynamic sites, this can, very cheaply, improve their capture by a staggering amount.

Next week I'll do another blog post with the more technical aspects of this.

January 29, 2015

The downside of web archive deduplication

I've talked a lot about deduplication (both here and elsewhere). It's a huge boon for web archiving efforts lacking infinite storage. Its value increases as you crawl more frequently, and crawling frequently has been a key aim for my institution. The web is just too fickle.

Still, we mustn't forget about the downsides of deduplication. It's not all rainbows and unicorns, after all.

Using proper deduplication, the aim is that you should be able to undo it. You can dededuplicate, as it were, or perhaps (less tongue-twisty) reduplicate. This was, in part, the reason for the IIPC Proposal for Standardizing the Recording of Arbitrary Duplicates in WARC Files.

So, if it is reversible, there can't be a downside? Right? Err, wrong.

We'll ignore for the moment the fact that many institutions have in the past done (and some may well still do) non-reversible deduplication (notably, not storing the response header in the revisit record). There are some practical implications of having a deduplicated archive. The two primary ones are loss potential and loss of coherence.

Loss potential

There is the adage that 'lots of copies keeps stuff safe'. In a fully deduplicated archive, the loss of one original record could affect the replay of many different collections. One could argue that if a resource is used enough to be represented in multiple crawls, it may be of value to store more than one copy in order to reduce the chance of its loss.

It is worrying to think that the loss of one collection (or part of one collection) could damage the integrity of multiple other collections. Even if we only deduplicate based on previous crawl for the same collection there are implications to data safety.

Complicating this is the fact that we do not normally have any oversight over which original records have a lot of revisit records referencing them. It might be an interesting ability, to be able to map out which records are most heavily cited. This could enable additional replication when a certain threshold is met.

Concerns about data safety/integrity must be weighed against the cost of storage and the likelihood of data loss. That is a judgement each institution must make for its own collection.

Loss of coherence

While the deduplication is reversible (or should be!), that is not to say that reversing it is trivial. If you wish to extract a single harvest from a series of (lets say) weekly crawls that have been going on for years, you may find yourself resolving revisit records to hundreds of different harvests.

While resolving any one revisit is relatively cheap, it is not entirely free. A weekly crawl of one million URLs, may have as many as half a million revisit records. For each such record you need to query an index (CDX or otherwise) to resolve the original record and then you need to access and copy that record. Index searches take in the order of hundreds of milliseconds for non-trivial indexes and because the original records will be scattered all over, you'll be reading bits and pieces from all over your disk. Reading bits and pieces is much more costly on spinning HDDs than linear reads of large files.

Copying this hypothetical harvest might only take minutes if non-deduplicated. Creating a reduplicated harvest might take days. (Of course, if you have an Hadoop cluster or other high performance infrastructure this may not be an issue.)

I'm sure there are ways of optimizing this. For example, it is probably best to record all the revisits, order them and then search the index for them in the same order as the index is structured in. This will improve the indexes cache hit. Similar things can be done for reading the originals from the WARC files.

There is though one other problem. No tool, currently, exists to do this. You'll need to write your own "reduplicator" if you need to do this.


Neither of these issues is argument enough to cause me to abandon deduplication. The fact is that we simply could not do the web archiving we currently do without it. But it is worth being aware of the downsides as well.

January 23, 2015

Answering my own question. Does it matter which record is marked as the original?

Back in October, I posed the question about web archive deduplication, does it matter which record is marked as the original?

I've come up with a least one reason why it might matter during playback.

Consider what happens in a replay tool like OpenWayback when it encounters a revisit record. The revisit record should contain information about the original URL and original capture time. The replay tool then has to go and make another query on its index to resolve these values.

If the URL is the same URL as was just resolved for the revisit record itself, it follows that any cache (including the OS disk cache for a CDX file) will likely be warm, leading to a quick look up. In fact, OpenWayback will likely just be able to scroll backwards in its 'hit' list for original query.

On the other hand, if the URL is different it will be located in a different part of the index (whether it is a CDX file or some other structure) and may require hitting the disk(s) again. Any help form the cache will be incidental.

Just how much this matters in practice depends on how large your index is and how quickly it can be accessed. If it is served from a machine with super fast SSDs and copious amounts of RAM, it may not matter at all. If the index is on a 5400 RPM HDD on a memory starved machine it may matter a lot.

In any case, I think this is sufficient justification for keeping the deduplication strategy of preferring exact URL matches when available while allowing digest only matches when there isn't.


January 19, 2015

The First IIPC Technical Training Workshop

It has always been interesting to me how often a chance remark will lead to big things within the IIPC. A stray thought, given voice during a coffee break or dinner at a general assembly, can hit a resonance and lead to something new, something exciting.

So it was with the first IIPC technical training workshop. It started off as an off the cuff joke at last year's Paris GA, about having a 'Heritrix DecideRule party'. It struck a nerve, and it also quickly snowballed to include an 'OpenWayback upgradathon' and a 'SolrJam'.

The more we talked about this, the more convinced I became that this was a good idea. To have a hands on workshop, where IIPC members could send staff for practical training in these tools. Fortunately, Helen Hockx-Yu of the British Library shared this conviction. Even more fortunately, the IIPC Steering Committee wholeheartedly supported the idea. Most fortunate of all, the BL was ready, willing and able to host such an event.

So, last week, on a rather dreary January morning around thirty web archiving professionals, from as far away as New Zealand, gathered outside the British Library in London and waited for the doors to open. Everyone eager to learn more about Heritrix, OpenWayback and Solr.

Day one was dedicated to traditional, presentation oriented, dissemination of knowledge. On hand were several invited experts on each topic. In the morning the basics fundamentals of the three tools were discussed, with more in depth topics after lunch. Roger Coram (BL) and I were responsible for covering Heritrix. Roger discussed the basics of Heritrix DecideRules and I covered other core features, notably sheet overlays in the morning. The afternoon focused on Heritrix's REST API, deduplication at crawl time, and writing your own Heritrix modules.

There is no need for me to repeat all of the topics. The entire first day was filmed and made available online, in IIPC's YouTube channel.

Day one went well, but it wasn't radically different from what we have done before at GAs. It was days two and three that made this meeting unique.

For the later two days only a very loose agenda was provided. A list of tasks related to each tool, varying in complexity. Attendees choose tasks according to their interests and level of technical know-how. Some installed and ran their very first Heritrix crawl or set up their first OpenWayback instance. I set up Solr via the BL's webarchive-discovery and set it to indexing one of our collection.

Others focused on more advanced tasks involving Heritrix sheet overlays and REST API, OpenWayback WAR overlays and CDX generation or ... I really don't know what the advanced Solr tasks were. I was just happy to get the basic indexing up and running.

The 'experts' who did presentations on day one, were, of course, on hand during days two and three to assist. I found this to be a very good model. Impromptu presentations were made on specific topics and the specific issues of different attendees could be addressed. I learned a fair amount about how other IIPC members actually conduct their crawls. There is nothing like hands-on knowledge. I think both experts and attendees got a lot out of it.

It was almost sad to see the three day event come to an end.

So, overall, a success. Definitely meriting an encore.

That isn't to say it was perfect, there is always room for improvement. Given a bit more lead-up time, it would have been possible to get a firmer idea of the actual interests of the attendees. For this workshop there was a bit of guess work. I think we were in the ballpark, but we can do better next time. It would also have been useful to have better developed tasks for the less experienced attendees.

So, will there be an opportunity to improve? I certainly hope so. We will need to decide where (London again or elsewhere) and when (same time next year or ...). The final decision will then be up to the IIPC Steering Committee. All I can say, is that I'm for it and I hope we can make this an annual event. A sort of counter-point to the GA.

We'll see.

Finally, I'd like to thank Helen and the British Library for their role as host and all of our experts for their contribution.