Things that Web Crawlers Hate

By Deane Barker • November 12, 2014 •

We make the lives of webs crawlers much more difficult and much less effective, unnecessarily.

In this post, the author discusses various factors that can hinder web crawlers, affecting a site’s visibility and indexing. Key issues include poor site structure, excessive redirects, slow loading times, and lack of mobile-friendliness. The post emphasizes the importance of structured data, optimal use of robots.txt, and regular site maintenance to enhance crawlability and improve search engine performance.

Generated by Azure AI on June 24, 2024

I wrote a web crawler in C# a couple years ago. I’ve been fiddling with it ever since. During that time, I’ve have been forcibly introduced to the following list of things my crawler hates.

Websites that return a 200 OK for everything, even if it was a 404 or a 500 or a 302 or whatever
Websites that don’t use canonical URL tags
Websites with self-replicating URL rabbit holes
Websites that don’t use the Google Sitemap protocol (no, I don’t depend on it, but it’s awfully handy to seed the crawler with starting points – I promise that a crawl will be better with one than without one)
Websites that have non-critical information carried into the page on querystring params, thus giving multiple URLs to the same content
Websites with SSL that let don’t control their schemes – only allow secured pages under HTTPS, and vice-versa – so that you can’t have two URLs for the same content, just with different schemes
Websites with a “print” option on every single page with a querystring param, thus giving that page two different URLs (okay, okay, this one is easy to filter for – I just always forget…)
Misuse of the content-type HTTP header, because file extensions will handle it all…

Admittedly, a lot of things in this list are why crawlers are hard to write, and I should just suck it up and deal with it because this is reality. But the entire process has underscored to me how loosely we treat URLs (see the canonical URL post linked above for more on this).

We’re generally very cavalier about our URLs, and I think the web as a whole is worse off for it. URLs are a core technology, and there’s a philosophical point behind them dealing with the universal access to information. find-ability, and index-ability.

We should be more careful. Rant over.