January 29, 2016

Things to do in Iceland ...

... when you are not in a conference center

It is clear that many attendees at the IIPC GA and web archiving conference in Reykjavík next April (details) plan to extend their stay. Several have contacted me for advice on what not to miss. So, I figured I'd write something up for all to see.

The following is far from exhaustive or authoritative. It largely reflects my personal taste and may have glaring omissions. It also probably reflects my memory as well!


Downtown Reykjavík has many interesting sights, museums and attractions. Not to mention shops, bars and restaurants. Of particular note is Hallgrímskirkja which rises up over the heart of city and right outside it is the statue of Leif Eriksson. The Pearl dominates the skyline a bit further south, there are stunning views to be had from its observation deck. In the heart of the city you'll find Alþingishúsið (Parliment building), city hall, the old harbor, Harpa (concert hall) and many other notable buildings and sights.

A bit further afield you'll find Höfði and the Sólfar sculpture. Sólfarið is without a question my favorite public sculpture, anywhere.

There are a large number of museums in Reykjavík. Ranging from the what-you'd-expect to the downright-weird.

Thanks to abundant geothermal energy, you'll find heated, open air, public swimming pools open year round. I highly recommend a visit to one of them. Reykjavík's most prominent public pool is Laugardalslaug.

Lastly, the most famous eatery in Iceland is in the heart of the city, and worth a visit, Bæjarins Beztu.

Out of Reykjavík

Just south of Reykjavík (15 minutes), in my hometown of Hafnarfjörður, you'll find a Viking Village!

A bit further there is the Blue Lagoon. I highly recommend everyone try it at least once. Do note that you may need to book in advance!

And just a bit further still, you'll find the Bridge Between Continents! Okay, so the last one is a bit overwrought, but still a fun visit if you're at all into plate tectonics.

Whale watching tours are operated out of Reykjavík, even in April. You'll want some warm clothes if you go on one of those!

Probably the most popular (for good reasons) day tour out of Reykjavík is the Golden Circle. It covers Geysir, Gullfoss (Iceland's largest waterfall) and Þingvellir (the site of the original Icelandic parliament, formed in 930, it is a UNESCO World Heritrage Site). You can do the circle in a rented car and there are also numerous tour operators offering this tip and variations on it.

Driving along the southern coast on highway 1, you'll also come across many interesting town and sights. It is possible to drive as far as Jökulsárlón and back in a single day. Stopping at places like Selfoss, Vík í Mýrdal, Seljalandsfoss, Skógarfoss and Skaftafell National Park. It may be better to plan an overnight stop, however. This is an ideal route for a modest excursion as there are so many interesting sights right by the highway.

Further afield

For longer trips outside of the capitol there are too many options to count. You could go to Vestmannaeyjar, just off the south coast, or to Akureyri, the capitol of the north. A popular trip is to take highway 1 around the country (it loops around). Such a trip can be done in 3-4 days, but you'd need closer to a week to fully appreciate all the sights along the way.

January 28, 2016

To ZIP or not to ZIP, that is the (web archiving) question

Do you use uncompressed (W)ARC files?

It is hard to imagine why you would want to store the material uncompressed. After all, web archives are big. Compression saves space and space is money.

While this seems straightforward it is worth examining some of the assumptions made here and considering what trade-offs we may be making.

Lets start by consider that a lot of the files on the Internet are already compressed. Images, audio and video files as well as almost every file format for "large data" is compressed. Sometimes simply by wrapping everything in a ZIP container (e.g. EPUB). There is very little additional benefit gained from compressing these files again (it may even increase the size very slightly).

For some, highly specific crawls, it is possible that compression will accomplish very little.

But it is also true that compression costs very little. We'll get back to that point in a bit.

For most general crawls, the amount of HTML, CSS, JavaScript and various other highly compressible material will make up a substantial portion of the overall data. Those files may be smaller, but there are a lot more of them. Especially, automatically generated HTML pages and other crawler traps that are impossible to avoid entirely.

In our domain crawls, HTML documents alone typically make up around a quarter of the total data downloaded. Given that we then deduplicate images, videos and other, largely static, file formats, HTML files' share in the overall data needing to be stored is even greater. Typically approaching half!

Given that these text files compress heavily (by 70-80% usually), tremendous storage savings can be realized using compression. In practice, our domain crawls compressed size is usually about 60% of the uncompressed size (after deduplication).

More frequently run crawls (with higher levels of deduplication) will benefit even more. Our weekly crawls' compressed size is usually closer to 35-40% of the uncompressed volume (after deduplication discards about three quarters of the crawled data).

So you can save anywhere from ten to sixty percent of the storage needed, depending on the types of crawling you do. But at what cost?

On the crawler side the limiting factor is usually disk or network access. Memory is also sometimes a bottleneck. CPU cycles are rarely an issue. Thus the additional overhead of compressing a file, as it is written to disk, is trivial.

On the access side, you also largely find that CPU isn't a limiting factor. The bottleneck is disk access. And here compression can actually help! It probably doesn't make much difference when only serving up one small slice of a WARC but when processing entire WARCs it will take less time to lift them off of the slow HDD if the file is smaller. The additional overhead of decompression is insignificant in this scenario except in highly specific circumstances where CPU is very limited (but why would you process entire WARCs in such an environment).

So, you save space (and money!) and performance is barely affected. It seems like there is no good reason to not compress your (W)ARCs.

But, there may just be one, HTTP Range Requests.

To handle an HTTP Range Request a replay tool using compressed (W)ARCs will have to access a WARC record and then decompress the entire payload (or at least from start and as far as needed). If uncompressed, the replay tool could simply locate the start of the record and then skip the required number of bytes.

This only affects large files and is probably most evident when replaying video files. User's may wish to skip ahead etc. and that is implemented via range requests. Imagine the benefit when skipping to the last few minutes of a movie that is 10 GB on disk!

Thus, it seems to me that a hybrid solution may be the best course of action. Compress everything except files whose content type indicates an already compressed format. Configure it to compress when in doubt. It may also be best to compress files under a certain size threshold due to headers being compressed. That would need to be evaluated.

Unfortunately, you can't alternate compressed and uncompressed records in a GZ file such as (W)ARC. But it is fairly simple to configure the crawler to use separate output files for these content types. Most crawls generate more than one (W)ARC anyway.

Not only would this resolve the HTTP Range Request issue, it would also avoid a lot of pointless compression/uncompression work being done.