Spiders are Stupid

By Deane Barker • November 4, 2005 •

I’ve been monitoring the 404s on this site. I changed our URL pattern a while back, so I have a page that catches all the 404 and resolves the old pattern against the new one, then redirects. Anything that doesn’t resolve gets logged and I have an RSS feed where I can watch them all. Which brings me…

The author discusses the common mistakes made by web spiders, which account for 99% of 404 errors on their site. They highlight that spiders often struggle with relative and absolute URLs, old URLs that haven’t been used in years, and truncated or inserted spaces in URLs. Despite these issues, the author notes that monitoring 404s can alert to many problems, such as bad links and missing images, and can aid in fighting content rot.

Generated by Azure AI on June 24, 2024

Which brings me to my point: Web spiders are pretty stupid. Ninety-nine percent of 404s to this site are from spiders. They’re looking for URLs that:

…they couldn’t possibly have derived from any other page on the site. Oftentimes they screw up relative vs. absolute URLs. I usually go check, just in case I forgot to put “http://” in front of something, but I usually find everything is in order and it must just be the spider that’s confused.
…existed a long, long time ago. I still get spiders coming in for pages with URLs that haven’t been around for three years. They must have them stored somewhere because every once in a while I’ll get about 300 consecutive requests from the same spider for the same old pattern, like it was reading them from a file somewhere.
…are obviously munged. Spiders truncate a lot, or insert random spaces in URLs. I finally modified my lookup script up to remove spaces from the target URL first, and, if it can’t find what the want, try to match what they ask for at the front of a string, so I can catch truncations.

I’ve also noticed a lot of one-off spiders that I’ve never seen before. They come out of colleges a lot, it seems.

And, of course, there are hack attempts galore. Trying to hack the XMLRPC vulnerability that was revealed a few months ago is pretty common, and I get scads of long, long requests for things in _vti directories.

That said, monitoring your 404s is a really handy thing to do as it alerts you to a lot of problems. We have over 4,500 entries now, and by watching bad requests, I find out all the time about bad links, missing images, etc. It’s really a good, simple way to give you an extra leg up on fighting content rot.

But don’t think the spiders are the smart ones. You’d think since they were programmed by (supposed) professionals, and have everything in a database somewhere, that they’d be pretty on top of things. My experience, however, indicates that a bunch of two-year-olds mashing on the keyboard would probably come up with more valid URLs than your average web spider.