How do you do the actual URL lookup once you have an inbound URL you want to search for? This depends on the CMS and volume of content, but it’s usually complicated by the stemming issues mentioned in the prior section. If you have a 10-step stemming algorithm that you need to run old URLs through before you search them, and you have 100,000 old URLs stored alongside their new content in your CMS, this becomes a significant computational problem.
(Additionally, as noted in “Storage” above, you might be storing multiple old URLs in a single field within a content object, which is not great for searching. Thus, there needs to be an asynchronous step where you split these out and write them to individual records somewhere to facilitate searching.)
We generally write the URL lookup data to a repository separate from the CMS post-stemming. When a content object in the CMS is saved, the old URLs contained within it are split, stemmed, and the results written to a simple database table (or even something more elaborate like a Redis cache). We’ve occasionally cached URL data in memory in the web server process, but the lack of persistence can be a problem because the cache has to be rebuilt on start-up which might take too much time.
(A Cautionary Tale: We did this on a site with 70,000 URLs. It took about a minute to build the in-memory cache, and this was done when the first 404 was received. So, the site would start and get the first 404 a couple seconds later. The handler would look for the cached URLs, wouldn’t find them, and so would start to build the cache. Before it finished, another 404 would come in, look for the URLs, wouldn’t find them…and would kick off another build of the cache. And so on and so on. This story does not have a happy ending.)
Remember that if you’ve modeled an explicit Redirect object for ad hoc redirects, a second step needs to find those and write them to the same storage location.
If you cache to an external, persistent datastore, you can rewrite this data whenever an object is saved, and perhaps have a batch job which clears and rewrites all the data in the event your stemming algorithm changes. The old URLs in the CMS are considered canonical – the external storage location should just be an easily queryable, denormalized cache location which can be wiped out and rebuilt without too much pain.
Understand too that the matching resolution might be just the first step in your final resolution. Beyond trying to match on an old URL, you might have additional programming logic that attempts to find a new URL. My friend David Hobbs (who wrote the aptly titled Migration Handbook) relays a common type of scenario:
I once worked on a project where the old URL had the report number (
http://site.com/documents/?id=523) and the new URL had the report name (
http://site.com/library/why-monkeys-are-cute). There was a document management system keeping track of this, so the redirect engine in that case called the document management system to resolve the ID to the new URL.
In this case, you have some logic that runs either before or after the matching attempt and performs more custom logic. Do you do this before or after your matching matching? I’m not sure it’s incredibly important (database queries are pretty fast these days), but it depends on your gut feeling of where the resolution is most likely to be.
The idea of rules logic could be extended further to try and preemptively handle common URL errors. If you had a bunch of URLs containing your product name (
something/powerslicer-5000/something), and then you changed the name of the product, it might be worth changing the product name on the inbound URL and testing to see if the result matches an existing content item (assuming your CMS has an API call for this – a lot of them do).
The basic idea here is this: if you’re running this on the 404 handler (and you should be), then something bad has already happened. You are already in a failure state, and nothing worse can happen at this point, therefore you should probably knock yourself out trying to get them to the right place. Don’t send them somewhere irrelevant, but interrogate your inbound 404s vigorously and ask yourself, “Where were they trying to go, and what logic can I put in place to get them there?”
This is actually the easy part. Once you have a simple data source of stemmed URLs (and access to the same stemming function to run the inbound URL through), then stem the inbound and see if you have a match. If you do, route them with a 301 or whatever else your UI plan dictates.
If your CMS is decoupled, you can likely still execute programming code of some kind. However, for static sites on execution-limited environments like S3, this might be a problem. If you have absolutely no way to do active programming on your web server, then your options are limited. You could put the redirection engine on a different environment as a service, then do an Ajax call out of the 404 page to find a match and redirect from the client. Or, depending on volume, you might be able to write and upload actual HTML page shells for each old URL with a redirect META tag in the header.
It helps to check the inbound for a bookmark (“#chapter-1”) and re-add that to the redirection URL. If they’re requesting a bookmark on a page that has a valid redirect, then it’s safe assumption that it actually maps to an anchor inside the page and the visitor would like to be scrolled down to it. If it doesn’t, no harm done.
If you don’t find a match through any method, you just let the 404 handler execute as normal and display a “Page Not Found” page to the visitor.
It might be valuable to store analytics of what redirects have been hit and where they’re coming from. In these cases, you can see where old links are coming from and ask other site owners to update (truth: you’ll often find that some of the old links are on your own site).
Two schools of thought here:
Just store an incrementing “hits” field on the lookup database table. This is simple, but you lose the time dimension. A simple number will tell you volume, but not (1) in what timespan it accumulated, and (2) when it was last hit.
Store a record for each resolved redirection, with metadata like actual inbound URL (pre-stem), HTTP referrer, and timestamp. This is more complicated, but it lets you do more advanced analytics and track trends over time.
You might want to store unresolved redirection attempts as well, as they can be handy to figure out URL patterns you might be missing, and other sites on the internet that just have bad links. You could contact them for updates, or just compensate for their mistakes by adding ad hoc redirects or additional URLs to the intended content objects. However, you’ll likely quickly discover that a lot of unresolved URLs are spambots which have munged your URLs somehow, or blatant hack attempts.