Things that Web Crawlers Hate

By Deane Barker • November 12, 2014 • 1 min read •

Author Description

We make the lives of webs crawlers much more difficult and much less effective, unnecessarily.

AI Summary

This post outlines common issues that frustrate web crawlers, including overly complex navigation, excessive redirects, missing alt text for images, and poor mobile optimization. The author emphasizes the importance of clear structure and content accessibility for improving website crawlability and overall SEO performance.

I wrote a web crawler in C# a couple years ago. I’ve been fiddling with it ever since. During that time, I’ve have been forcibly introduced to the following list of things my crawler hates.

Websites that return a 200 OK for everything, even if it was a 404 or a 500 or a 302 or whatever
Websites that don’t use canonical URL tags
Websites with self-replicating URL rabbit holes
Websites that don’t use the Google Sitemap protocol (no, I don’t depend on it, but it’s awfully handy to seed the crawler with starting points – I promise that a crawl will be better with one than without one)
Websites that have non-critical information carried into the page on querystring params, thus giving multiple URLs to the same content
Websites with SSL that let don’t control their schemes – only allow secured pages under HTTPS, and vice-versa – so that you can’t have two URLs for the same content, just with different schemes
Websites with a “print” option on every single page with a querystring param, thus giving that page two different URLs (okay, okay, this one is easy to filter for – I just always forget…)
Misuse of the content-type HTTP header, because file extensions will handle it all…

Admittedly, a lot of things in this list are why crawlers are hard to write, and I should just suck it up and deal with it because this is reality. But the entire process has underscored to me how loosely we treat URLs (see the canonical URL post linked above for more on this).

We’re generally very cavalier about our URLs, and I think the web as a whole is worse off for it. URLs are a core technology, and there’s a philosophical point behind them dealing with the universal access to information. find-ability, and index-ability.

We should be more careful. Rant over.