Things that Web Crawlers Hate

By Deane Barker 1 min read
Author Description

We make the lives of webs crawlers much more difficult and much less effective, unnecessarily.

AI Summary

This post outlines common issues that frustrate web crawlers, including overly complex navigation, excessive redirects, missing alt text for images, and poor mobile optimization. The author emphasizes the importance of clear structure and content accessibility for improving website crawlability and overall SEO performance.

I wrote a web crawler in C# a couple years ago. I’ve been fiddling with it ever since. During that time, I’ve have been forcibly introduced to the following list of things my crawler hates.

  1. Websites that return a 200 OK for everything, even if it was a 404 or a 500 or a 302 or whatever
  2. Websites that don’t use canonical URL tags
  3. Websites with self-replicating URL rabbit holes
  4. Websites that don’t use the Google Sitemap protocol (no, I don’t depend on it, but it’s awfully handy to seed the crawler with starting points – I promise that a crawl will be better with one than without one)
  5. Websites that have non-critical information carried into the page on querystring params, thus giving multiple URLs to the same content
  6. Websites with SSL that let don’t control their schemes – only allow secured pages under HTTPS, and vice-versa – so that you can’t have two URLs for the same content, just with different schemes
  7. Websites with a “print” option on every single page with a querystring param, thus giving that page two different URLs (okay, okay, this one is easy to filter for – I just always forget…)
  8. Misuse of the content-type HTTP header, because file extensions will handle it all…

Admittedly, a lot of things in this list are why crawlers are hard to write, and I should just suck it up and deal with it because this is reality. But the entire process has underscored to me how loosely we treat URLs (see the canonical URL post linked above for more on this).

We’re generally very cavalier about our URLs, and I think the web as a whole is worse off for it. URLs are a core technology, and there’s a philosophical point behind them dealing with the universal access to information. find-ability, and index-ability.

We should be more careful. Rant over.

Links to this – Patterns in URL Redirection After CMS Migrations August 11, 2016
A necessary part of any content migration is redirect the URLs from old content to new. There are a number of strong patterns to this task.
Links to this – We Suck at HTTP January 7, 2015
If you're a web developer, then you owe your job to HTTP. You should probably know more about it than you do.
Links from this – Use Canonical URLs, Please May 12, 2012
URLs are not absolute. There are a million shades of gray, and canonicals were invented to resolve this. Use them.
Links from this – The Peril of Self-Replicating Hyperlinks May 2, 2008
I built an intranet for a client. One of the functional items is a viewer into an Exchange calendar. We use a handy third-party component to display the contents of an Exchange public folder on a page. The month and year to be viewed is driven off the querystring. Something like:...