October 10, 2016

Wanted: New Leaders for OpenWayback

[This post originally appeared on the IIPC blog on 03/10/2016]

The IIPC is looking for one or two people to take on a leadership role in the OpenWayback project.

The OpenWayback project is responsible not only for the widely used OpenWayback software, but also for the underlying webarchive-commons library. In addition the OpenWayback project has been working to define access related APIs.

The OpenWayback project thus plays an important role in the IIPCs efforts to foster the development and use of common tools and standards for web archives.

Why now?


The OpenWayback project is at a cross roads. The IIPC first took on this project three years ago with

Originally this work was done by volunteers in our community. About two years ago the IIPC decided to fund a developer to work on it. The initial funding was for 16 months. With this we were able to complete the task of stabilizing the software as evidenced by the release of OpenWayback 2.0.0 through 2.3.0.

We then embarked on a somewhat more ambitious task to improve the core of the software. A significant milestone that is now ending as a new ‘CDX server’ or resource resolver is being introduced. You can read more about that here.

This marks the end of the paid position (at least for time being). The original 16 months wound up being spread over somewhat longer time frame, but they are now exhausted. Currently, the National Library of Norway (who hosted the paid developer) is contributing, for free, the work to finalize the new resource resolver.

I’ve been guiding the project over the last year since the previous project leader moved on. While I was happy to assume this role to ensure that our funded developer had a functioning community, I felt like I was never able to give the project the kind of attention that is needed to grow it. Now it seems to be a good time for a change.

With the end of the paid position we are now at a point where there either needs to be a significant transformation of the project or it will likely die away, bit by bit, which is a shame bearing in mind the significance of the project to the community and the time already invested in it.
the initial objective to make the software easier to install, run and manage. This included cleaning up the code and improving documentation.

Who are we looking for?


While a technical background is certainly useful it is not a primary requirement for this role. As you may have surmised from the above, building up this community will definitely be a part of the job. Being a good communicator, manager and organizer may be far more important at this stage.

Ideally, I’d like to see two leads with complementary skill sets, technical and communications/management. Ultimately, the most important requirement is a willingness and ability to take on this challenge.

You’ll not be alone, aside from your prospective co-lead, there is an existing community to build on. Notably when it comes to the technical aspects of the project. You can get a feel for the community on the OpenWayback Google Group and the IIPC GitHub page.

It would be simplest if the new leads were drawn from IIPC member institutions. We may, however, be willing to consider a non-member, especially as a co-lead, if they are uniquely suited for the position.

If you would like to take up this challenge and help move this project forward, please get in touch. My email is kristinn (at) landsbokasafn (dot) is.

There is no deadline, as such, but ideally I’d like the new leads to be in place prior to our next General Assembly in Lisbon next March.

September 26, 2016

3 crawlers : 1 writer

Last week I attended an IIPC sponsored hackathon with the overarching theme of 'Building better crawlers'. I can't say we built a better crawler in the room, but it did help clarify for me the likely future of archival crawling. And it involves three types of crawlers.

The first type is the bulk crawler. Heritrix is an example of this. Can crawl a wide variety of sites 'good enough' and has fairly modest hardware requirements, allowing it to scale quite well. It is, however, limited in its ability to handle scripted content (i.e. JavaScript) as all link extraction is based on heuristics.

The second type is a browser driven crawler. Still fully (mostly) automated but using a browser to render pages. Additionally, scripts can be run on rendered pages to simulate scrolling, clicking and other user behavior we may wish to capture. Brozzler (Internet Archive) is an example of this approach. This allows far better capture of scripted content, but at a price in terms of resources.

For large scale crawls, it seem likely that a hybrid approach would serve us best. To have a bulk crawler cover the majority of URLs, only delegating those URLs that are deemed 'troublesome' to the more expensive browser based rendering.

The trick here is to make the two approaches work together smoothly (Brozzler, for example, does state very differently from Heritrix) and being smart about which content goes in which bucket.

The third type of crawler is what I'll call a manual crawler. I.e. a crawler whose activities are entirely driven by a human operator. An example of this is Webrecorder.io. This enables us to fill in whatever blanks the automated crawlers leave. It can also prove useful for highly targeted collection, where curators are handpicking, not just sites, but specific individual pages. They can then complete the process, right there in the browser.

There is, however, no reason that these crawlers can not all use the same back end for writing WARCs, handling deduplication and otherwise doing post acquisition tasks. By using a suitable archiving proxy all three types of crawlers can easily add their data to our collections.

Such proxy tools already exist, it is simply a matter of making sure these crawlers use them (many already do), and that they use them consistently. I.e. that there is a nice clear API for a archiving proxy that covers the use cases of all the crawlers. Allows them to communicate collection metadata, dictate deduplication policies etc.

Now is the right time to establishing this API. I think the first steps in that direction were taken at the hackathon. Hopefully, we'll have a first draft available on the IIPC GitHub page before too long.

May 30, 2016

Heritrix 3.3.0-LBS-2016-02, now in stores

A month ago I posted that I was testing a 'semi-stable' build of Heritrix. The new build is called "Heritrix 3.3.0-LBS-2016-02" as this is built for LBS's (Icelandic acronym for my library) 2016-02 domain crawl (i.e. the second one this year).

I can now report that this version has passed all my tests without any regressions showing up. I've noted two minor issues, one of which was fixed immediately (Noisy alerts about 401s without auth challenge) and the other has been around since Heritrix 3.1.0 at the least and does not affect crawling in any way (Bug in non-fatal-error log).

Additionally, I heard from Netarkivet.dk. They also tested this version with no regressions found.

I think it is safe to say that if you are currently using my previous semi-stable build (LBS-2015-01), upgrading to this version should be entirely straightforward. There are no notable API changes to worry about either. Unless, of course, you are using features that are less 'mainstream'.

You can find this version on our Github page. You'll have to download the source and build it for yourself.

Update As you can see in the comments below, Netarkivet.dk has put the artifacts into a publicly accessible repository. Very helpful if you have code with dependencies on Heritrix and you don't have your own repository.

Thanks for the heads-up, Nicholas.

May 17, 2016

WARC MIME Media Type

A curious thing came up during the WARC 1.1 review process. In version 1.0, section 8 talked about what MIME media types should be used when exchanging WARCs over the Internet. During the review process, however, it was pointed out that this is actually outside the scope of the standard. 1.1 consequently drops section 8.

For now we should regard the instructions from 1.0 section 8 as best practices. But it isn't part of any official standard.

That's not to say that it isn't important to have a standard set of MIME types for WARC content. Only that the WARC ISO standard isn't the place for it. This is actually something that IANA is responsible for, with specification work going through the IETF if I'm understanding this correctly.

I'm not at all familiar with this process. But it is clear that if we wish to have this standardized then going through this process is the only option. If anyone can offer further insight into how we could move this forward please get in touch.


May 12, 2016

What I learned hosting the 2016 IIPC GA/WAC

National Library of Iceland
Photo taken by GA/WAC attendee
It's been nearly a month since the 2016 IIPC General Assembly (GA) / Web Archiving Conference (WAC) in Reykjavik ended and I think I'm just about ready to try to deconstruct the experience a bit.

Plan ahead


Looking back, planning of the the practical aspects - logistics - of the conference seem to have been mostly spot on. The 2015 event in Stanford had had a problem with no-shows, but this wasn't a big factor in Reykjavik. I suspect largely due to the small number of local attendees. Our expectations about the number of people who would come ended up being more or less correct (about 90 for the GA and 145 for the WAC).

A big part of why the logistics side ran smoothly was, I feel, due to advance planning. We first decided to offer to host the 2016 GA in October of 2013. We made the space reservations at the conference hotel in September 2014. Consequently, there was never any rush or panic on the logistics. Everything felt like it was happening right on schedule with very few surprises.

The IIPC SC had a meeting in Lisbon
following the 2013 iPres conference.
The idea for Reykjavik as the venue
for the 2016 IIPC GA first arose there.

Given how much work it was, despite all the careful planning, I don't care to imagine what doing this under pressure would be like. I've been advocating in the IIPC Steering Committee (SC), for years, that we should leave each GA with a firm date and place for the next two GAs and a good idea of where the one to be held in three years will be be.

Nothing, in my experience hosting a GA, has changed my mind about that.

Spendthrift 


There was some discussion about whether some days/sessions should be recorded and put online. This was done in Stanford, but looking at the viewing numbers, I felt that it represented a poor use of money. Ultimately the SC agreed. Recording and editing can be quite costly. It may be worth reviewing this decision in the future. Or, perhaps something else can be used to 'open' the conference to those not physically present.

It was certainly a worthwhile experiment, but overall, I think we made the right decision not doing it in Reykjavik. Especially as the cost was quite, even compared to Stanford.

Another thing we decided not to spend money on was an event planner. I know one was used for the 2015 GA. That one needed to be planned in a hurry and thus may have required such a service. But I can't see how it would have made things much easier in 2016 unless you're willing to hand over the responsibility for making specific choices to the planner. Such as catering etc.

True, that does take a bit of effort, but I felt that was a part of the responsibility that comes with hosting. Just handing it over to a planner wouldn't have sat right. And if I'm vetting the planners choices, then very little effort is being saved.

I'm happy to concede, though, that this may vary very much by location and host.

Communication


Some of the communication surrounding the GA/WAC was sub-optimal. The GA page on netpreserve.org was never really up to the task, although it got better over time. Some of this was down to the lack of flexibility of the netpreserve website. Future events should have a solid communication plan at an early date. Including what gets communicated where and who is responsible for it. Perhaps it is time that each GA/WAC gets its own little website? Or perhaps not.

The dual nature of the event also caused some confusion. This led some people to only register for one of the two events etc. There was also confusion (even among the program committee!) about whether the CFP was for the WAC and GA or WAC only.

This leads us to the most important lesson I took away from this all...

Clearly separate the General Assembly and the Web Archiving Conference!


This isn't a new insight. We've been discussing what separates the 'open days' from 'member only' days for several years. In Reykjavik this was, for the first time, formally divided into two separate events. Yet, the distinction between them was less than absolutely clear.

This is, at least in part, due to how the schedule for the two events was organized. A single program committee was set up (as has been the case most years). It was quite small this year. This committee then organized the call for proposals (CFP) and arranged the schedule to accommodate the proposals that come in from the CFP.

This led to the conference over-spilling onto GA days (notably Tuesday). And it wasn't the first time that has happened. There was definitely a lack separation in Stanford (although perhaps for slightly different reasons) and in Paris, in 2014, the effort to shoehorn in all the proposals from the CFP had a profound effect on the member-only days.

This model of a program committee and a CFP is entirely suitable for a conference and should be continued. But going forward, I think it is absolutely necessary that the program committee for the WAC have no responsibility or direct influence on the GA agenda.

To facilitate this I suggest that the organization of these two events consist of three bodies (in addition to the IIPC Steering Committee (SC) which will continue to bear overall responsibility).

  1. Logistics Team. Membership includes 1-2 people from the hosting institution, the IIPC officers, at least one SC member (if the hosting institution is an SC member this may be their representative) and perhaps one or two people with relevant experience (e.g. have hosted before etc.).
    This group is responsible for arranging space, catering, the reception, badges and other printed conference material, hotels (if needed) etc. They get their direction on the amount of space needed from the SC and the two other teams.
    This group is responsible for the event staying under budget. Which is why the treasurer is included.
  2. WAC Program Committee. The program committee would be comprised of a number of members and may include several non-members who bring notable expertise and have been engaged in this community for a long time.
    The program committee would have a reserved space on it for the hosting institution (which they may decline). There should also be a minimum of one SC member on the committee.
    The PCO (program and communications officer) would be included in all communications and assist the committee with communications with members and other prospective attendees (e.g. in sending out the CFP) but would not participate in evaluating the proposals sent in.
    The program committee would have a hand in crafting the CFP, but input on overall 'theme' would be expected from the SC.
    The program committee's primary task would be to evaluate the proposals sent in after the CFP and arranging them into a coherent schedule. The mechanism for evaluating (and potentially rejecting!) proposals needs to be established before the CFP's come in! Otherwise, it will be hard to avoid the feeling that they are being tailored to fit specific proposals.
  3. GA Organizing Group. The PCO would be responsible for coordinating this group. Included are the SC Chair and Vice Chair, portfolio leads and leaders of working and interest groups. For the most part, each member is primarily responsible for the the needs of their respective areas of responsibility.
    More on GA organization in a bit.
None of this gives the SC a free pass. As you'll note, I've mandated an SC presence in all the groups. This both gives the groups access to someone who can easily bring matters to the SC's attention and ensures that there is someone there to ensure that the direction the SC has laid out is, broadly speaking, followed.

For the WAC, the SC's biggest responsibility (aside from choosing the location and setting the budget) will be in deciding how much time it gets (two days, two and a half, three?), what themes to focus around and whether the conference should try to accomplish a specific outreach goal (and if so how).

This was, for example, the case in Stanford where the goal was to get the attention of the big tech companies. Getting Vint Cerf (a VP of Google) to be a keynote speaker was a good effort in that direction. Nothing similar was done during the Reykjavik meeting.

Keynotes


Keynotes are likely to be one of the best ways of accomplishing this. Getting a keynote speaker from a different background can help build bridges. I think this is absolutely a worthwhile path to consider.

However, unless we are hosting the WAC in their backyard (as was the case with Vint Cerf), we need to reach out to them very early and probably be prepared to cover the cost of travel. This is a choice that needs to be made very early. And, indeed, the choice of a keynote may ultimately help frame the overall them of the conference (or not).

Hjálmar Gíslason delivering the
2016 IIPC WAC opening keynote
We had two keynotes in Reykjavik. Both were great, although neither was chosen 'strategically'. The choice of Hjálmar Gíslason was largely with my library. Allowing the hosting institution some influence on one of the keynotes may be appropriate. The other keynote, Brewster Kahle, wasn't chosen until after the CFP was in. We essentially asked him to expand his proposal into a keynote. Given the topic and Brewster's acclaim within our community, this worked out very well. We did have other candidates in mind (but no one confirmed). It was quite fortunate that such a perfect candidate fell into our laps.

It is worth planning this early as people become unavailable surprisingly far in advance.

It could also be argued that we don't need keynotes. People aren't coming to the IIPC WAC to hear some 'rock star' presenter. The event itself is the draw. But I think a couple of keynotes really help tie the event together.

One change may be worth considering. Instead of a whole day with a single track featuring both keynotes, perhaps have multiple tracks on all days but do a single track session at the start of day one and at the end of day two that accommodates the keynotes and the welcome and wrap up talks.

When we were trying to fit in all the proposals we got for Reykjavik, we considered doing this, but the idea simply arose too late. We were unable to secure the additional space required.

Again, we need to plan early.

The General Assembly should not be a conference


The GAs have changed a lot over the years. The IIPC met in Reykjavik for the first time in 2005. Back then we didn't call the meetings "GAs", they were just meetings. And they mostly oriented around discussions. They were working meetings. And they were usually very good.

The first GA, in Paris 2007, largely retained that, despite the fact that the IIPC was already beginning to grow. There was no 'open day'. 

By 2010 in Singapore, the open day was there. But in a way that made sense and it didn't overly affect the rest of the GA. I did notice, however, a marked change in the level of engagement by the attendees during sessions.

There seemed to be more people there 'just to listen'. There had always been some of those, but I found it difficult to get discussions going, where two years prior, they'd had usually been difficult to stop in order to take breaks! Not that those discussions had always been all that productive (some of it was just talk), but the atmosphere was more restrained.

At that time I was co-chair of the Harvesting Working Group (HWG) along with Lewis Crawford of the British Library. And although there was always good attendance at the HWG meetings we really struggled to engage the attendees.

Helen Hockx-Yu and Kris Carpenter, who led the Access Working Group (AWG) did a better job of this but clearly felt the same problem. Ultimately, both HWG and AWG became more of GA events than working groups and have now been decommissioned.

With larger groups and especially with many there 'just to listen' it becomes much easier to just do a series of presentations. Its safer, more predictable and when you add the pressure to fit in all the material from the CFP, it becomes inevitable.

But, in the process we have lost something.

Now that the WAC is firmly established and can serve very well for the people who 'just want to listen', I think it is time we refocus the GA on being working meetings. A venue for addressing both consortium business (like the portfolio breakout sessions in Reykjavik, but with more time!) and the work of the consortium (like the OpenWayback meeting and the Preservation and Content Development Working Group meetings in Reykjavik).

This will inevitably include some presentations (but keep them to a minimum!) and there may be some panel discussions but the overall focus should be on working meetings. Where specific topics are discussed and, as much as possible, actions are decided.

That's why I nominated the people I did for the GA Organizing Group. These are the people driving the work of the consortium. They should help form the GA agenda. At least as far as their area of responsibility is concerned.

To accommodate the less knowledgeable GA attendee (e.g. new members) it may be a good idea to schedule tutorials and/or training sessions in parallel to some of these working meetings.

I believe this can build up a more engaged community. And for those not interested in participating in specific work, the WAC will be there to provide them with an opportunity to learn and connect with other members.

This wont be an easy transition. As my experience with the HWG showed, it can be difficult to engage people. But by having a conference (and perhaps training events) to divert those just looking to learn and building sessions around specific strategic goals, I think we can bring this element of 'work' back.

And if we can't, I'm not sure we have much of a future except as a yearly conference.

April 28, 2016

New 'semi-stable' build for Heritrix

Earlier this month I mentioned that I was planning on making another one of my "semi-stable" Heritrix build, using the current master 3.3.0. This work is now underway in the Landsbokasafn Heritrix repo on GitHub. LBS-2016-02.

Heritrix 3.3.0-LBS-2016-02
I've merged in one pull request that is still open in the IA repository, #154 Fixes for apparent build errors. Most notably, this makes it possible to have Travis-CI build and test Heritrix.

You can review the full list of changes between my last Heritrix build (2015-01) and this new one here. Here is a list of the main changes:

  • Some fixes to how server-not-modified revisit records are written (PR #118).
  • Fix outlink hoppath in metadata records (PR #119)
  • Allow dots in filenames for known good extensions (PR #120)
  • Require Maven 3.3 (PR #126
  • Allow realm to be set by server for basic auth (PR #124)
  • Better error handling in StatisticsTracker (PR #130)
  • Fix to Java 8 Keytool (PR #129) - I wrote a post about this back in 2014.
  • Changes to how cookies are stored in Bdb (PR #133)
  • Handle multiple clauses for same user agent in robots.txt (PR #139)
  • SourceSeedDecideRule and SeedLimitsEnforcer (PR #137 and #148)
  • 'Novel' URL and byte quotes (PR #138)
  • Only submit 'checked' checkbox and radio buttons when submitting forms (PR #122)
  • Form login improvements (PR #142 and #143)
  • Improvements to hosts report (PR #123)
  • Handle SNI error better (PR #141)
  • Allow some whitespace in URLs extracted by ExtractorJS (PR #145)
  • Fix to ExtractorHTML dealing with HTML comments (PR #149)
  • Build against Java 7 (PR #152)

I've ignored all pull request that apply primarily to the contrib package in the above. There were quite a few there, mostly (but not exclusively) relating to AMQP.

I've done some preliminary testing and everything looks good.  So far, the only issue I've noted is one that I was already aware of, about noisy alerts relating to 401s.

I'll be testing this version further over the next few weeks and welcome any additional input.

April 18, 2016

A long week is over. Thank you all.

The 2016 IIPC General Assembly and Web Archiving Conference is over. Phew!

Me, opening the Harvesting Tools
session on Tuesday

I always look forward to this event each year. It is by far the most stimulating and productive meeting/conference that I attend regularly. I believe we managed to live up to that this time.

The meeting had a wonderful Twitter back-channel that you can still review using the hashtags #iipcGA16 and #iipcWAC16.

It has been over two years since we, at the National and University Library of Iceland, offered to host the 2016 GA, and over a half year before that when the initial decision was made. Even with a 2.5 year lead time, it barely felt like enough.

I'd like to take this opportunity to thank, again, all the people who helped make last week's event a success.

First off, there is the program committee, which was very small this year, comprising, in addition to myself, (in alphabetical order) Alex Thurman (Columbia University Libraries), Gina Jones (Library of Congress), Jason Webber (IIPC PCO/British Library), Nicholas Taylor (Stanford University Libraries) and Peter Stirling (Bibliothèque nationale de France). I literally couldn't have done this without you.

I'd also like to note the contribution of our incoming PCO in this list, Olga Holownia who put in a lot of work during the conference to help make sure everything was just right for each session.

Next, I'd like to thank my colleagues at the National Library who assisted me in organizing this event and helped out during by week by handling registration, running tours etc. It was a team effort. Notable mentions to Áki Karlsson and Erla Bjarnadóttir who spent much of the week making sure that all the little details were attended to.

The Steering Committee on Friday
following the SC meeting
A big thank you to all the speakers and session moderators.

And lastly, I'd like to thank the members of the Steering Committee for being willing to entrust the single most important event of the IIPC calendar to one of the IIPC's smallest members. Indeed, doing so without the slightest hesitation.

I've learned a lot from this past week and I hope to be able to distill that experience and write it up so that next year's GA/WAC can be even better. But that will have to wait for another day and another blog post.

For now, I'll just say thanks for coming and see you all again in Lisbon for #iipcGA17 and #iipWAC17.

April 7, 2016

Still Looking For Stability In Heritrix Releases

I'd just like to briefly follow up on a blog post I wrote last September, Looking For Stability In Heritrix Releases.

The short version is that the response I got was, in my opinion, insufficient to proceed. I'm open to revisiting the idea if that changes, but for now it is on ice.

There is little doubt in my mind that having (somewhat) regular stable releases made of Heritrix would be of notable benefit. Even better if they are published to Maven Central.

Instead, I'll continue to make my own forks from time to time and make sure they are stable for me. The last one was dubbed LBS-2015-01. It is now over a year old and a lot has changed. I expect I'll be making a new one in May/June. You can see what's changed in Heritrix in the meantime here.

I know a few organizations are also using my semi-stable releases. If you are one of them and would like to get some changes in before the next version (to be dubbed LBS-2016-02), you should try to get a PR into Heritrix before the end of April. Likewise, if you know of a serious/blocking bug in the current master of Heritrix, please bring it to my attention.

April 1, 2016

Duplicate DeDuplicators?


A question on the Heritrix mailing list prompted my to write a few words about deduplication in Heritrix and why there are multiple ways of doing.

Heritrix's built in service


Heritrix's inbuilt deduplication service comes in two processors. One processor records each URL in a BDB datastore or index (PersistStoreProcessor), the other looks up the current URL in this datastore and, if it finds it, compares the content digest (FetchHistoryProcessor).

The index used to be mingled in with other crawler state. That is often undesirable as you may not wish to carry forward any of that state to subsequent crawls. Typically, thus, you configure the above processors to use their own directory by wiring in a special "BDB module" configured to write to an alternative directory.

There is no way to construct the index outside of a crawl. Which can be problematic since a hard crash will often corrupt the DBD data. Of course, you can recover from a checkpoint, if you have one.

More recently, a new set of processors have been added. The ContentDigestHistoryLoader and ContentDigestHistoryStorer. They work in much the same way except they use an index that is keyed on the content digest, rather than the URL. This enables URL agnostic deduplication.

This was a questionable feature when introduced, but after the changes made to implement a more robust way of recording non-URL duplicates, it became a useful feature. Although its utility will vary based on the nature of your crawl.

As this index is updated at crawl time, it also makes it possible to deduplicate on material discovered during the same crawl. A very useful feature that I now use in most crawls.

You still can't build the index outside of a crawl.

For more information about the built in features, consult the Duplication Reduction Processors page on the Heritrix wiki.

The DeDuplicator add-on


The DeDuplicator add-on pre-dates the built in function in Heritrix by about a year (released in 2006). It essentially accomplishes the same thing, but with a few notable notable differences in tactics.

Most importantly, its index is always built outside of a crawl. Either from the crawl.log (possibly multiple log files) or from WARC files. This provides a considerable amount of flexibility as you can build an index covering multiple crawls. You can also gain the benefit of the deduplication as soon as you implement it. You don't have to run one crawl just to populate it.

The DeDuplicator uses Lucene to build its index. This allows for multiple searchable fields which in turn means that deduplication can do things like prefer exact URL matches but still do digest only matches when exact URL matches do not exist. This affords a choice of search strategies.

The DeDuplicator also some additional statistics, can write more detailed deduplication data to the crawl.log and comes with pre-configured job profiles.

The Heritrix 1 version of the DeDuplicator actually also supported deduplication based on 'server not modified'. But it was dropped when migrating the H3 as no one seemed to be using it. The index still contains enough information to easily bring it back.

Bottom line


Both approaches ultimately accomplish the same thing. Especially after the changes that were made a couple of years ago to how these modules interact with the rest of Heritrix, there really isn't any notable difference in the output. All these processors, after determining an document is a duplicate, set the same flags and cause the same information to be written to the WARC (if you are using ARCs, do not use any URL agnostic features!)

Ultimately, it is just a question of which fits better into your workflow.


March 18, 2016

Declaring WARR on "CDX Server" API

Work is currently ongoing to specify a "CDX Server API" for OpenWayback. The name of this API has, however, caused an unfortunate amount of confusion. Despite the name, the data served via this API needn't be in CDX files!

The core purpose of this API is to respond to a query containing an URL and optionally a timestamp or timerange with a set of records that fall within those parameters. This is meant to support two basic functionalities. One, replay of captured web content and, two, discovery of capture web content.

CDXs need not enter into it. It is just that the most common way (by far) to manage such an index is to use sorted CDX files. Thus the unfortunate name. Nothing prevents alternative indexing solutions being used. You could use a relational database, Lucene or whatever tool allows lookups of strings!

So, this API desperately needs a new name. My suggestion is "Web Archive Resource Resolution Service" or WARR Service for short. Yes, I did torture that until it produced a usable acronym.

In my last post I discussed changes to the CDX file format itself. Those changes should facilitate WARR servers running on CDX indexes. But ultimately, the development of the WARR Service API is not directly coupled to those changes. We should focus on developing the WARR Service API with respect to the established use cases.

In truth, the exact scope and nature of this new API remains debated. You can find some lively discussion in this Github issue. More on that topic another day.

March 17, 2016

Rewriting the CDX file format

CDX files are used to support URL+timestamp searching of web archives. They've been around for a long time, having first been used to catalog the contents of ARC files. Despite the advent of the WARC file format, they haven't changed much. I think it is past due that we reconsider the format from the ground up.

The current specification lists a large number of possible fields. Many are not used in typical scenarios.

The first field is a canonicalized URL. I.e. an URL with trivial elements (such as protocol) removed so that equivalent URLs end up with the same canonical URL here. This serves as the primary search key.

The only problem with this is that searching for content in all subdomains is not possible without scanning the entire CDX. This is because the subdomain comes before the domain. Instead, we should use a SURT (Sort-friendly URI Reordering Transform) form of the canonical URL instead. SURT URLs turn the domain/sub-domain structure around, making such queries fairly straightforward. There is essentially no downside to doing this and, in fact, a number of CDXs have been built in this manner, regardless of any "formal" standardization (as there isn't really any formal standard).

I suggest that any revised CDX format mandate the use of SURT URLs for the first field. Furthermore, we should utilize the correct SURT format. In most (probably all) current CDXs with SURT URLs, an annoying mistake has been made where the closing comma is missing. An URL that should read:
   com,example,www,)
instead reads:
   com,example,www)
The protocol prefix has been removed as unnecessary along with the opening ellipse. 

The second field should remain the timestamp with whatever precision is available in the ARC/WARC. I.e. an w3c-iso8601 of varying accuracy as per this proposed revision the WARC standard (the revision is extremely likely to be included in WARC 1.1).

The third field would remain the original URL.

The fourth field should be a content digest including the hashing algorithm. Presently, this field is missing the algorithm.

The fifth field would be the WARC record type (or a special value to indicate an ARC response record). This is the most significant change as it allows us to capture additional WARC record types (such as metadata and conversion) while also handling the existing fields in a more targeted manner (e.g. response vs revisit). It might be argued that this should be the second field to facilitate searches of a specific record type. I believe that, probably implemented, this field would allow replay tools to effectively surface any content "related" to the URL currently being viewed, a problem that I know many are trying to tackle.

The next two fields would be the WARC (or ARC) filename (this is supposed to be unique) of the file containing the record and offset at which the record exists within the (W)ARC. This is as it works currently. Some would argue for a more expressive resource locator here, but I believe that is best handled be a separate (W)ARC resolution service. Otherwise you may have to substantially rebuild your CDX index just because you moved your (W)ARCs to a new disk or service.

Lastly, there should be a single line JSON "blob" containing record type relevant additional data. For response records, this would include HTTP status code and content type which I've excluded from the "base" fields in the CDX. This part would be significantly more flexible due to the JSON format, allowing us to include optional data where appropriate etc. The full range of possible values is beyond the scope of this blog post.

There is clearly more work to be done on the JSON aspect, plus some adjustments may be necessary to the base data, but I believe that, at minimum, this is the right direction to head in. Of course, this means we have to rebuild all our CDX files in order to implement this. That's a tall order, but the benefits should be more than enough to justify that one-time cost.

February 24, 2016

3 things I shouldn't have to tell you about running a "good" crawler

I've been running large and small scale crawls for almost 12 years. In that time I've encountered any number of unfortunate circumstances where our crawler has caused a website some amount of trouble. We always aim to run a "good" crawler. One that is respectful of website operators and doesn't cause any issues. To accomplish this we have some basic rules and limits to abide by. The reasons we sometimes do cause trouble comes down to the complicated nature of the internet.

During this time I've also been responsible for operating a number of websites. Including a few that are (by Icelandic standards) quite popular. Doing so has shown me the other side of web crawling. Turns out that a lot of crawlers are not following the basics dos and don'ts of web crawling. Including some supposedly respectable crawlers.

This seems to be getting worse each year.

Of course there are "bad" robots. Run by people who do not care about the negative impact they cause. But, even "good" robots (i.e. ones that at least seem to have good intentions) are all to frequently misbehaving.

So, here are 3 things you absolutely should abide by if you want your robot to be considered "good". Remember, it doesn't matter if your robot is a web-scale crawl or a scraping of a single site. As soon you've scripted something to automatically fetch stuff from a 3rd party server (without the 3rd party's explicit permission) you are running a crawler.

1. Be polite

Never do concurrent requests to the same site. Rate limit yourself to around 1 request every 2-5 seconds or so. More if the responses are slow. If you hit a 500 error code, wait a few minutes.

If you know (with certainty) that the site you are crawling can handle a more aggressive load (e.g. crawling google.com), it may be OK to step it up a bit. But when crawling sites where you do not have any insight, it is best to be cautious. Remember, yours is probably not the only crawler hitting them. Not to mention all the regular users.

I know, it can be infuriating when trying to scrape a large dataset. It could take weeks! But your needs do not outweigh the needs of other users. Also, crawlers are often more expensive to serve than "normal" users. They tend to systematically go through the entire site, meaning that caching strategies fail to speed up their requests.

2. Identify yourself

Your user agent string must contain enough information to allow a website operator to find out who you are, why you are crawling their site and how to get in touch with you. This isn't negotiable.

It also goes for little custom tools built to just scrape that one website. You don't have to set a user agent if you are using e.g. curl on the command line to get a single resource. But the moment you script or program something you must put identifying information in the user agent string. If all I see is

  curl/7.19.7 (x86_64-redhat-linux-gnu) lib...

I'll assume it is a "bad" crawler. Same for things written in Python, Java etc. Always identify yourself.

And make sure you are responsive to feedback you may get. Something, a number of big, supposedly "good" crawl operators fail to do.

The thing is, if your robot is causing a problem and I don't know who is operating it. I will ban it. Should you wish to have that ban lifted, I will not be especially predisposed towards cooperation. You've already made a very bad first impression.

Yes, you might be able to get around a ban by getting a new IP address, but at that point you are no longer running a "good" crawler.

The bottom line is, even if you are very careful, your crawler may inadvertently cause a problem. In those scenarios. If your intentions are good, then you must make yourself available to deal with the issue and prevent it from reoccurring. If you do not do this, you are running a "bad" crawler.

3. Honor robots.txt

I'm aware of the irony here. As an operator of a legal deposit crawler, I do not respect robots.txt in most of the crawls I conduct. But you probably don't have that shield of being legally required to "get all the URLs".

A lot of websites block perfectly "legitimate" material using robots.txt (e.g. images). It is annoying. But if you are running a "good" crawler, then you have to abide by these rules, including the crawl delay. You really should also be able to parse the newer wildcard rules.

At most, you might bend the rules to get embedded content (images) necessary to render the page. Even that should be done carefully and only while adhering to the first two rules firmly.

If you feel you need content blocked by robots.txt, you must ask for it. Politely. Some sites may be happy to assist you (ours included). Others may tell you to go away. Either way, if you are running a "good" crawler, you'll have your answer. 

February 15, 2016

OpenWayback - Developing an API for a CDX Server

OpenWayback 2.3.0 was released about a month ago. It was a modest release aimed to fixing a number of bugs and introducing a few minor features. Currently, work is focused on version 3.0.0, which is meant to be a much more impactful release.

Notably, the indexing function is being moved into a separate (bundled) entitiy, called the CDX Server. This will provide a clean separation of concerns between resource resolution and the user interface.

The CDX Server had already been partially implemented. But as work begin on refining that implementation it became clear that we would also need to examine very close the API between these two discreet parts. The API that currently exists is incomplete and at times vague and even, on occasion, contradictory. Existing documentation (or code review) may tell you what can be done, but it fails to shed much light on why you'd do that.

This isn't really surprising for an API that was developed "bottom-up", guided by personal experience and existing (but undocumented!) code requirements. This is pretty typical of what happens when delivering something functional is the first priority. It is a technical debt, but one you may feel is acceptable when delivering a single customer solution.

The problem we face, is that once we push for the CDX server to become the norm, it wont be a single costumer solution. Changes to the API will be painful and, consequently, rare.

So, we've had to step back and examine the API with a critical eye. To do this we've begun to compile a list of use cases that the CDX Server API needs to meet. Some are fairly obvious, others are perhaps more wishful thinking. In between there are a large number of corner cases and essential but non-obvious uses cases that must be addressed.

We welcome and encourage any input on this list. You can do so other be editing the wiki page, by commenting on the relevant issue in our tracker or sending an e-mail to the OpenWayback-dev mailing list.

To be clear, the purpose of this is primarily to ensure that we fully support existing use cases. While we will consider use cases that the current OpenWayback can not handle, they will, necessarily be ascribed a much lower priority.

Hopefully, this will lead to a fairly robust API for a CDX Server. The use cases may also allow us to firm up the API of the wayback URL structure itself. Not only will this serve the OpenWayback project, getting this API rights is very important to facilitate alternative replay tools!

As I said before, we welcome any input you may have on the subject. If you feel unsure of how to become engaged, please feel free to contact me directly.

February 3, 2016

How to SURT IDN domains?

Converting URLs to SURT (Sort-friendly URI Reordering Transform) form has many benefits in web archiving and is widely used when configuring crawlers. Notably, the use of SURT prefixes can be used to easily apply rules to a logical segment of the web.

Thus the SURT prefix:
http://(is,
Will match all domains under the .is TLD.

A bit less known ability is to match against partial domain names. Thus the following SURT prefix:
http://(is,a
Would match all .is domains that begin with the letter a (note that there isn't a comma at the end).

This all works quite well, until you hit Internationalized Domain Names (IDNs). As the original infrastructure of the web does not really support non-ASCII characters, all IDNs are designed so that they can be translated into an ASCII equivalent.

Thus the IDN domain landsbókasafn.is is actually represented using the "punycode" representation xn--landsbkasafn-5hb.is.

When matching SURTs against full domain names (trailing comma), this doesn't really matter. But, when matching against a domain name prefix, you run into an issue. Considering the example above, should landsbókasafn.is match the SURT http://(is,l?

The current implementation (at least in Heritrix's much used SurtPrefixedDecideRule) is to evaluate only the punycode version (so no, but it would match http://(is,x).

This seems potentially limiting and likely to cause confusion.

January 29, 2016

Things to do in Iceland ...

... when you are not in a conference center


It is clear that many attendees at the IIPC GA and web archiving conference in Reykjavík next April (details) plan to extend their stay. Several have contacted me for advice on what not to miss. So, I figured I'd write something up for all to see.

The following is far from exhaustive or authoritative. It largely reflects my personal taste and may have glaring omissions. It also probably reflects my memory as well!

Reykjavík

Downtown Reykjavík has many interesting sights, museums and attractions. Not to mention shops, bars and restaurants. Of particular note is Hallgrímskirkja which rises up over the heart of city and right outside it is the statue of Leif Eriksson. The Pearl dominates the skyline a bit further south, there are stunning views to be had from its observation deck. In the heart of the city you'll find Alþingishúsið (Parliment building), city hall, the old harbor, Harpa (concert hall) and many other notable buildings and sights.

A bit further afield you'll find Höfði and the Sólfar sculpture. Sólfarið is without a question my favorite public sculpture, anywhere.

There are a large number of museums in Reykjavík. Ranging from the what-you'd-expect to the downright-weird.

Thanks to abundant geothermal energy, you'll find heated, open air, public swimming pools open year round. I highly recommend a visit to one of them. Reykjavík's most prominent public pool is Laugardalslaug.

Lastly, the most famous eatery in Iceland is in the heart of the city, and worth a visit, Bæjarins Beztu.

Out of Reykjavík

Just south of Reykjavík (15 minutes), in my hometown of Hafnarfjörður, you'll find a Viking Village!

A bit further there is the Blue Lagoon. I highly recommend everyone try it at least once. Do note that you may need to book in advance!

And just a bit further still, you'll find the Bridge Between Continents! Okay, so the last one is a bit overwrought, but still a fun visit if you're at all into plate tectonics.

Whale watching tours are operated out of Reykjavík, even in April. You'll want some warm clothes if you go on one of those!

Probably the most popular (for good reasons) day tour out of Reykjavík is the Golden Circle. It covers Geysir, Gullfoss (Iceland's largest waterfall) and Þingvellir (the site of the original Icelandic parliament, formed in 930, it is a UNESCO World Heritrage Site). You can do the circle in a rented car and there are also numerous tour operators offering this tip and variations on it.

Driving along the southern coast on highway 1, you'll also come across many interesting town and sights. It is possible to drive as far as Jökulsárlón and back in a single day. Stopping at places like Selfoss, Vík í Mýrdal, Seljalandsfoss, Skógarfoss and Skaftafell National Park. It may be better to plan an overnight stop, however. This is an ideal route for a modest excursion as there are so many interesting sights right by the highway.


Further afield

For longer trips outside of the capitol there are too many options to count. You could go to Vestmannaeyjar, just off the south coast, or to Akureyri, the capitol of the north. A popular trip is to take highway 1 around the country (it loops around). Such a trip can be done in 3-4 days, but you'd need closer to a week to fully appreciate all the sights along the way.


January 28, 2016

To ZIP or not to ZIP, that is the (web archiving) question

Do you use uncompressed (W)ARC files?

It is hard to imagine why you would want to store the material uncompressed. After all, web archives are big. Compression saves space and space is money.

While this seems straightforward it is worth examining some of the assumptions made here and considering what trade-offs we may be making.

Lets start by consider that a lot of the files on the Internet are already compressed. Images, audio and video files as well as almost every file format for "large data" is compressed. Sometimes simply by wrapping everything in a ZIP container (e.g. EPUB). There is very little additional benefit gained from compressing these files again (it may even increase the size very slightly).

For some, highly specific crawls, it is possible that compression will accomplish very little.

But it is also true that compression costs very little. We'll get back to that point in a bit.

For most general crawls, the amount of HTML, CSS, JavaScript and various other highly compressible material will make up a substantial portion of the overall data. Those files may be smaller, but there are a lot more of them. Especially, automatically generated HTML pages and other crawler traps that are impossible to avoid entirely.

In our domain crawls, HTML documents alone typically make up around a quarter of the total data downloaded. Given that we then deduplicate images, videos and other, largely static, file formats, HTML files' share in the overall data needing to be stored is even greater. Typically approaching half!

Given that these text files compress heavily (by 70-80% usually), tremendous storage savings can be realized using compression. In practice, our domain crawls compressed size is usually about 60% of the uncompressed size (after deduplication).

More frequently run crawls (with higher levels of deduplication) will benefit even more. Our weekly crawls' compressed size is usually closer to 35-40% of the uncompressed volume (after deduplication discards about three quarters of the crawled data).

So you can save anywhere from ten to sixty percent of the storage needed, depending on the types of crawling you do. But at what cost?

On the crawler side the limiting factor is usually disk or network access. Memory is also sometimes a bottleneck. CPU cycles are rarely an issue. Thus the additional overhead of compressing a file, as it is written to disk, is trivial.

On the access side, you also largely find that CPU isn't a limiting factor. The bottleneck is disk access. And here compression can actually help! It probably doesn't make much difference when only serving up one small slice of a WARC but when processing entire WARCs it will take less time to lift them off of the slow HDD if the file is smaller. The additional overhead of decompression is insignificant in this scenario except in highly specific circumstances where CPU is very limited (but why would you process entire WARCs in such an environment).

So, you save space (and money!) and performance is barely affected. It seems like there is no good reason to not compress your (W)ARCs.

But, there may just be one, HTTP Range Requests.

To handle an HTTP Range Request a replay tool using compressed (W)ARCs will have to access a WARC record and then decompress the entire payload (or at least from start and as far as needed). If uncompressed, the replay tool could simply locate the start of the record and then skip the required number of bytes.

This only affects large files and is probably most evident when replaying video files. User's may wish to skip ahead etc. and that is implemented via range requests. Imagine the benefit when skipping to the last few minutes of a movie that is 10 GB on disk!

Thus, it seems to me that a hybrid solution may be the best course of action. Compress everything except files whose content type indicates an already compressed format. Configure it to compress when in doubt. It may also be best to compress files under a certain size threshold due to headers being compressed. That would need to be evaluated.

Unfortunately, you can't alternate compressed and uncompressed records in a GZ file such as (W)ARC. But it is fairly simple to configure the crawler to use separate output files for these content types. Most crawls generate more than one (W)ARC anyway.

Not only would this resolve the HTTP Range Request issue, it would also avoid a lot of pointless compression/uncompression work being done.