Spiders are Stupid

By Deane Barker

I’ve been monitoring the 404s on this site. I changed our URL pattern a while back, so I have a page that catches all the 404 and resolves the old pattern against the new one, then redirects. Anything that doesn’t resolve gets logged and I have an RSS feed where I can watch them all. Which brings me…

The author discusses the common mistakes made by web spiders, which account for 99% of 404 errors on their site. They highlight that spiders often struggle with relative and absolute URLs, old URLs that haven’t been used in years, and truncated or inserted spaces in URLs. Despite these issues, the author notes that monitoring 404s can alert to many problems, such as bad links and missing images, and can aid in fighting content rot.

Generated by Azure AI on June 24, 2024

I’ve been monitoring the 404s on this site. I changed our URL pattern a while back, so I have a page that catches all the 404 and resolves the old pattern against the new one, then redirects. Anything that doesn’t resolve gets logged and I have an RSS feed where I can watch them all.

Which brings me to my point: Web spiders are pretty stupid. Ninety-nine percent of 404s to this site are from spiders. They’re looking for URLs that:

I’ve also noticed a lot of one-off spiders that I’ve never seen before. They come out of colleges a lot, it seems.

And, of course, there are hack attempts galore. Trying to hack the XMLRPC vulnerability that was revealed a few months ago is pretty common, and I get scads of long, long requests for things in _vti directories.

That said, monitoring your 404s is a really handy thing to do as it alerts you to a lot of problems. We have over 4,500 entries now, and by watching bad requests, I find out all the time about bad links, missing images, etc. It’s really a good, simple way to give you an extra leg up on fighting content rot.

But don’t think the spiders are the smart ones. You’d think since they were programmed by (supposed) professionals, and have everything in a database somewhere, that they’d be pretty on top of things. My experience, however, indicates that a bunch of two-year-olds mashing on the keyboard would probably come up with more valid URLs than your average web spider.

This is item #264 in a sequence of 357 items.

You can use your left/right arrow keys to navigate