Unresolved 404 Patterns

I changed the URL scheme of this Web site over the weekend. I had been meaning to do it for a while, but some problems with Movable Type 3.2 kind of forced the issue. (I have got to stop rushing into every beta that presents itself…)

To make everything backwards compatible, I built a simple redirect system – I have a table in the database with every single permalink from the old site (all 9,000 of them – including entry RSS feeds and category pages) mapped to every single new URL.

If someone looks for a page which has moved, the 404 page does a lookup on this table, “resolves” the old URL against a new one, then redirects with a “301 Moved Permanently.” It seems to work well.

A side benefit of this system is that I can watch for “unresolved 404s,” meaning 404s that were not in my lookup table – a genuine 404, if you will. I’ve noticed some interesting phenomena:

Others, however, are more mysterious. Just two minutes ago, a spider tried to access a URL that it could only have hit if it missed the leading “/” in the HREF. Coming from this page…


…the spider tried to hit:


I just checked that page and there’s no way it pulled that URL out of the code. The correct URL was…


But the URL it bounced off of could have only happened if it had a bug of some kind or if the HTML got mangled on the way down.

I also get hits to things like this:


No mystery here – that’s just a truncated version of this:


Truncation, it seems, happens a lot. The Ask Jeeves/Teoma spider, for instance, has been trying all day to get at URLs that are all truncated at 39 characters. Add http://www.gadgetopia.com/ to that, and you get 64 characters.

Why is this, I wonder? Was that the size of the database field they stored the URL in? More importantly, does it explain why I’ve never done so well in that index? I’m wondering now if my previously-long URLs have hurt my engine placement in other indexes besides Google.

This is item #283 in a sequence of 357 items.

You can use your left/right arrow keys to navigate