I wrote a web crawler in C# a couple years ago. I’ve been fiddling with it ever since. During that time, I’ve have been forcibly introduced to the following list of things my crawler hates.
- Websites that return a 200 OK for everything, even if it was a 404 or a 500 or a 302 or whatever
- Websites that don’t use canonical URL tags
- Websites with self-replicating URL rabbit holes
- Websites that don’t use the Google Sitemap protocol (no, I don’t depend on it, but it’s awfully handy to seed the crawler with starting points – I promise that a crawl will be better with one than without one)
- Websites that have non-critical information carried into the page on querystring params, thus giving multiple URLs to the same content
- Websites with SSL that let don’t control their schemes – only allow secured pages under HTTPS, and vice-versa – so that you can’t have two URLs for the same content, just with different schemes
- Websites with a “print” option on every single page with a querystring param, thus giving that page two different URLs (okay, okay, this one is easy to filter for – I just always forget…)
- Misuse of the content-type HTTP header, because file extensions will handle it all…
Admittedly, a lot of things in this list are why crawlers are hard to write, and I should just suck it up and deal with it because this is reality. But the entire process has underscored to me how loosely we treat URLs (see the canonical URL post linked above for more on this).
We’re generally very cavalier about our URLs, and I think the web as a whole is worse off for it. URLs are a core technology, and there’s a philosophical point behind them dealing with the universal access to information. find-ability, and index-ability.
We should be more careful. Rant over.