Thus the SURT prefix:
http://(is,Will match all domains under the .is TLD.
A bit less known ability is to match against partial domain names. Thus the following SURT prefix:
http://(is,aWould match all .is domains that begin with the letter a (note that there isn't a comma at the end).
This all works quite well, until you hit Internationalized Domain Names (IDNs). As the original infrastructure of the web does not really support non-ASCII characters, all IDNs are designed so that they can be translated into an ASCII equivalent.
Thus the IDN domain landsbókasafn.is is actually represented using the "punycode" representation xn--landsbkasafn-5hb.is.
When matching SURTs against full domain names (trailing comma), this doesn't really matter. But, when matching against a domain name prefix, you run into an issue. Considering the example above, should landsbókasafn.is match the SURT http://(is,l?
The current implementation (at least in Heritrix's much used SurtPrefixedDecideRule) is to evaluate only the punycode version (so no, but it would match http://(is,x).
This seems potentially limiting and likely to cause confusion.
No comments:
Post a Comment