Unresolved 404 Patterns

By Deane Barker

I changed the URL scheme of this Web site over the weekend. I had been meaning to do it for a while, but some problems with Movable Type 3.2 kind of forced the issue. To make everything backwards compatible, I built a simple redirect system – I have a table in the database with every single…

The author has updated the URL scheme of their website due to issues with Movable Type 3.2, and has built a redirect system to ensure backwards compatibility. The system allows for the resolution of old URLs against new ones when someone looks for a moved page, and allows the author to monitor for “unresolved 404s” that were not in their lookup table. The author has noticed several phenomena, including referrer spam, mis- interpretation of HREFs, truncation of URLs, and hack attempts.

Generated by Azure AI on June 24, 2024

I changed the URL scheme of this Web site over the weekend. I had been meaning to do it for a while, but some problems with Movable Type 3.2 kind of forced the issue. (I have got to stop rushing into every beta that presents itself…)

To make everything backwards compatible, I built a simple redirect system – I have a table in the database with every single permalink from the old site (all 9,000 of them – including entry RSS feeds and category pages) mapped to every single new URL.

If someone looks for a page which has moved, the 404 page does a lookup on this table, “resolves” the old URL against a new one, then redirects with a “301 Moved Permanently.” It seems to work well.

A side benefit of this system is that I can watch for “unresolved 404s,” meaning 404s that were not in my lookup table – a genuine 404, if you will. I’ve noticed some interesting phenomena:

Others, however, are more mysterious. Just two minutes ago, a spider tried to access a URL that it could only have hit if it missed the leading “/” in the HREF. Coming from this page…

/2005/07/09/IsPerlStillRelevant.html

…the spider tried to hit:

/2005/07/09/4131

I just checked that page and there’s no way it pulled that URL out of the code. The correct URL was…

/4131

But the URL it bounced off of could have only happened if it had a bug of some kind or if the HTML got mangled on the way down.

I also get hits to things like this:

/2005/04/15/EasyJavaScriptAutocompleteI

No mystery here – that’s just a truncated version of this:

/2005/04/15/EasyJavaScriptAutocompleteIntellisenseScript.html

Truncation, it seems, happens a lot. The Ask Jeeves/Teoma spider, for instance, has been trying all day to get at URLs that are all truncated at 39 characters. Add http://www.gadgetopia.com/ to that, and you get 64 characters.

Why is this, I wonder? Was that the size of the database field they stored the URL in? More importantly, does it explain why I’ve never done so well in that index? I’m wondering now if my previously-long URLs have hurt my engine placement in other indexes besides Google.

This is item #281 in a sequence of 357 items.

You can use your left/right arrow keys to navigate