July 1, 2015

Leap second and web crawling

A leap second was added last midnight. This is only the third time that has happened since I started doing web crawls, and the first time it happened while I had a crawl running. So, I decided to look into any possible effects or ramifications a leap second might have for web archiving.

Spoiler alert; there isn't really one. At least not for Heritrix writing WARCs.

Heritrix is typically run on a Unix type system (usually Linux). On those systems, the leap second is implemented by repeating the last second of the day. I.e. 23:59:59 comes around twice. This effectively means that the clock gets set back by a second when the leap second is inserted.

Fortunately, Heritrix does not care if time gets set back. My crawl.log did show this event quite clearly as the following excerpt shows (we were crawling about 37 URLs/second at the time):

2015-06-30T23:59:59.899Z 404 19155 http://sudurnes.dv.is/media/cache/29/74/29747ef7a4574312a4fc44d117148790.jpg LLLLLE http://sudurnes.dv.is/folk/2013/6/21/slagurinn-tok-mig/ text/html #011 2015063023595 2015-06-30T23:59:59.915Z 404 242 http://old.f4x4.is/myndasvaedi/44-tommu-breytinga-a-ford-econoline-v8-351-w/textarea.bbp-the-content ELRLLLLLLLX http://old.f4x4.is/myndasvaedi/44-tommu-breytinga-a-for 2015-06-30T23:59:59.936Z 200 3603 http://foldaskoli.is/myndasafn/index.php?album=2011_til_2012/47-%C3%9Alflj%C3%B3tsvatn&image=dsc01729-2372.jpg LLLLLLLLL http://foldaskoli.is/myndasafn/index.php?album= 2015-06-30T23:59:59.019Z 200 42854 http://baikal.123.is/themes/Nature/images/header.jpg PLLPEE http://baikal.123.is/ottSupportFiles/getThemeCss.aspx?id=19&g=6843&ver=2 image/jpeg #024 20150630235959985+- 2015-06-30T23:59:59.025Z 200 21520 http://bekka.bloggar.is/sida/37229/ LLLLL http://bekka.bloggar.is/ text/html #041 20150630235959986+13 sha1:C2ZF67KFGUDFVPV46CPR57J45YZRI77U http://bloggar.is/ - {"warc 2015-06-30T23:59:59.040Z 200 298365 http://www.birds.is/leikskoli/images/3072/img_2771__large_.jpg LLRLLLLLLLLLL http://www.birds.is/leikskoli/?pageid=3072 image/jpeg #005 20150630235956420+2603 sha1:F65B

There may be some impact on tools parsing your logs if they expect the timestamps to, effectively, be in order. But I'm unaware of any tools that make that assumption.

But, what about replay?


The current WARC spec calls for using timestamps with a resolution of one second. This means that all the URLs captured during the leap second will get the same value as those captured during the preceding second. No assumptions can be made about the order in which these URLs where captured anymore than you can make about the order of URLs captured normally during a single second. It doesn't really change anything that this period of uncertainty now spans two seconds instead of one. The effective level of uncertainty remains about the same.

Sidenote. The order of the captured URLs in the WARC may be indicative of crawl order, but that is not something that can be relied on.

There is actually a proposal for improving the resolutions of WARC dates. You can review it on the WARC review GitHub issue tracker. If adopted, a leap second event would mean that the WARCs actually contain incorrect information.

The fix to that would be to ensure that the leap second is encoded as 23:59:60 as per the official UTC spec. But that seems unlikely to happen as it would require changes to Unix timekeeping or using non-system timekeeping in the crawler.

Perhaps it is best to just leave the WARC date resolution at one second.

No comments:

Post a Comment