Things that Web Crawlers Hate

By Deane Barker

We make the lives of webs crawlers much more difficult and much less effective, unnecessarily.

I wrote a web crawler in C# a couple years ago. I’ve been fiddling with it ever since. During that time, I’ve have been forcibly introduced to the following list of things my crawler hates. Admittedly, a lot of things in this list are why crawlers are hard to write, and I should just suck it up and…

The author discusses the issues faced by web crawling software, including websites that return a 200 OK for everything, don’t use canonical URL tags, have self-replicated URL rabbit holes, and don’t use the Google Webmaster protocol. They also highlight the issues with non-critical information being carried into the page on querystring params, SSL schemes that can’t control their schemes, and the misuse of the content-type HTTP header. The author criticizes the casual treatment of URLs, arguing that it negatively impacts the web’s accessibility, find-ability, and index-ability.

Generated by Azure AI on June 24, 2024

I wrote a web crawler in C# a couple years ago. I’ve been fiddling with it ever since. During that time, I’ve have been forcibly introduced to the following list of things my crawler hates.

  1. Websites that return a 200 OK for everything, even if it was a 404 or a 500 or a 302 or whatever
  2. Websites that don’t use canonical URL tags
  3. Websites with self-replicating URL rabbit holes
  4. Websites that don’t use the Google Sitemap protocol (no, I don’t depend on it, but it’s awfully handy to seed the crawler with starting points – I promise that a crawl will be better with one than without one)
  5. Websites that have non-critical information carried into the page on querystring params, thus giving multiple URLs to the same content
  6. Websites with SSL that let don’t control their schemes – only allow secured pages under HTTPS, and vice-versa – so that you can’t have two URLs for the same content, just with different schemes
  7. Websites with a “print” option on every single page with a querystring param, thus giving that page two different URLs (okay, okay, this one is easy to filter for – I just always forget…)
  8. Misuse of the content-type HTTP header, because file extensions will handle it all…

Admittedly, a lot of things in this list are why crawlers are hard to write, and I should just suck it up and deal with it because this is reality. But the entire process has underscored to me how loosely we treat URLs (see the canonical URL post linked above for more on this).

We’re generally very cavalier about our URLs, and I think the web as a whole is worse off for it. URLs are a core technology, and there’s a philosophical point behind them dealing with the universal access to information. find-ability, and index-ability.

We should be more careful. Rant over.

This is item #37 in a sequence of 357 items.

You can use your left/right arrow keys to navigate

Copied...

Integration Console