Patterns in URL Redirection After CMS Migrations
After a CMS migration, there’s often a problem of URL redirection. A lot of our work at Blend is CMS re-platforming, so we often have clients with tens of thousands of existing URLs. When they switch to a new CMS they often change URL schemes, they want to make sure all their old traffic gets to their new pages.
And this matters beyond just traffic. URLs matter. The URLs are the history of the web. When you publish a URL, you’re giving the world a key to unlock that content. To abandon a URL is the same as just changing the locks and walking away. Respect your URLs, people.
Okay, rant over.
Practically speaking, there are a bunch of ways to handle redirection, and over the years we’ve probably tried them all and settled into some common patterns and practices which I’ll touch on below. This will be a recap for some of you, but if you haven’t done this before, here’s the basics of how we do it.
You have to answer multiple questions:
- When do you “catch” and evaluate a URL for redirection?
- How do you store the old/new URL pairs?
- How do you normalize a URL for comparison?
- How do you efficiently execute the comparison?
- Beyond a simple lookup, how else will you try to redirect an inbound URL?
- How do you report on redirection in general?
- How do you maintain your redirects over time?
Of course, all of this assumes your CMS doesn’t offer some options here. Many CMSs build this functionality in. However, even then, it’s often a fairly superficial feature only meant to check a box on some matrix. There are a lot of questions and issues which might be specific to your particular scenario that can vary enough to make it difficult to build a universally applicable framework.
When do you catch and evaluate the inbound URL to see if it should be redirected? If you consider the inbound request as a series of steps, at what point do you intervene to determine if this is a URL that needs to be redirected?
One school of thought says to do it right away as the first action in a request, via an HttpModule (.NET) or Apache Rewrite rule.
We don’t like this for two reasons:
It’s extra overhead for a lot of requests that will never be redirected. You are trying to solve a problem before you even have a problem.
The site can never “recover” that URL. If someday an editor wants to proactively put a page at that URL (to replace an old deleted page), they can’t because it’s being redirected before it would ever hit their CMS. Thus, that URL is forever lost.
In general, we like to put this code on the actual 404 handler. Let the CMS and the web server do their thing and confirm that there is, in fact, no page at that URL. If this has been determined, a 404 handler will execute in some form. Do your redirection here in the teeny space between (1) determining there’s no content for the URL, and (2) displaying “Page Not Found.” You’ve basically accepted you have nothing for this URL, so you’re going to make a last ditch attempt in the split second before giving up.
Put another way: you have the luxury of knowing you’re already in a failure state. If the 404 handler executes, the situation is already sub-optimal, so there’s little damage in doing some acrobatics to fix it – you have nothing to lose. There’s nowhere to go but up.
Handling the redirect this far “back” in the request might lead to the occasional situation of a redirection URL being “hidden” by a new page, but this is a minor problem. If an editor can’t figure out why a URL is redirecting, all they have to do is visit the URL to see that there’s actual content there and not a 404, and it should all become pretty clear. We don’t normally go back through and prune old redirection URLs that now have new content at them since this is not a terribly common occurrence.
Where do you store the URL lookups? Somewhere, there needs to be a list of old URLs and their corresponding new counterparts. And remember that this list could easily be one-to-many – one new page might have had 20 old URLs pointing to it over a period of time, and all 20 of those need to be redirected.
In most cases, we store this in the CMS itself, within each content object (page). A textarea with one old URL per line works well. We’ll store the old URL as content on initial import/migration, and an editor could easily add another line to this whenever they wanted. (Yes, globbing data up in a single field like this makes searching hard, but read the section on “Resolution” below and you’ll see why this isn’t normally an issue.) Having the URLs built directly into content also eliminates the need for a dedicated admin interface, which is handy.
However, in addition to URLs point directly to content objects, you might have redirects that don’t point to content objects. For instance, if you now use an external vendor/site for something that used to be on your site directly, you need to push people to that external site and there’s no content object for this. Thus, editors need a way to add ad hoc redirects.
In these cases, we’ll usually model an explicit object for “Redirect” which takes an old and a new URL. These can be created, edited, and deleted by editors like any other content. As a bonus, they’re subject to permissions, versioning, and workflow, which is helpful when redirection is subject to audit controls at your organization.
How do you normalize a URL so you can compare incoming URLs for matches? You’d like to think that URLs are just strings and we’ll just check if A=B and that’ll be fine, but URLs are actually pretty vague things.
I’ve discussed this before. The internet is sloppy and forgiving with URLs. Visitors can sometimes dramatically change URLs and yet produce the same page of content. This can be helpful in many cases, but it makes comparison difficult. How do you reliably determine the equality of two URLs?
Consider the ways the same content might have variable URLs, and how you need to normalize the differences.
Your site might respond to multiple domain names or subdomains. Does this matter? If not, then they need to be removed.
Your site might respond to both URLs with trailing slashes or without. Add or remove – just be consistent.
Casing could be different. By default, Apache servers are case sensitive, as are many programming languages (which will be doing the comparison). I’m not going to explain what to do here, because if you actually have different content at the same URL, differentiated only by case, then you are a bad person and deserve what happens to you.
Bookmarks (“#chapter-1”) might be present on the end of the URL. It’s very hard to account for all possibilities (and new ones can be added very easily), so we just remove them for storage and comparison (but they should be re-added later; see “Redirection” below).
Querystring order could be different:
?a=b&c=dis a different string than
?c=d&a=byet it would most likely produce the same page result. Querystring args will have to be parsed and re-ordered consistently.
Arbitrary querystring arguments could be added. Here’s a link back to this page with a random querystring argument. It will still get you back here, but it makes the URL worthless for string matching. Remember that inbound marketing campaigns might tack all sorts of analytics tracking codes on the URL (like the ubiquitous “utm source”) and all of these “pollute” a URL from a string matching perspective.
Some querystring arguments might be valid but not worth 404’ing a request for. A parameter called “sort_by”, for instance, might sort an embedded table by different columns. You either need to account for all possible values, or just ignore that argument altogether and accept a default sort as better than trying to manage all the possibilities. (When it comes to querystring arguments in general, do you “blacklist” or “whitelist”? Do you create a list of “black” arguments you’ll always remove, or create a list of “white” arguments which are the only ones you’ll allow? The right answer depends on your URL architecture.)
These issues combine to make raw URLs “dirty,” and simple string matching is often worthless. URLs on both sides – what you have stored and the inbound URL you’re analyzing – will need to be deconstructed and put back together according to consistent, repeatable rules so you can do an effective comparison. You need to go beyond the simple string of characters and normalize the URL to determine it’s intention.
This is called “stemming” (so-named because you’re reducing text to a consistent “stem”) and it’s common in information retrieval. Any search engine worth its salt has stemming algorithms which take both (1) the content, and (2) the requested search, and attempt to stem both back to common ground.
For example, “swimming” and “swimmed” both become “swim” when stemmed by the Porter algorithm (here’s an interactive demo of Porter). Stemming attempts to remove variables which don’t change the core meaning of the text. This is one way search engines do what they call “fuzzy searching” – when you search for “swimming” but still get a result which contains “swim.” This happens because whenever the engine sees “swimming” (in either what you want to search for, or the content you’re searching) it just reads “swim” and moves on.
The goal is to get both the (1) old URL and (2) the inbound URL reduced to the simplest, most consistent possible text string in order to maximize the chances of matching. And remember that the stemming algorithm needs to be available to process inbound URLs as well – it only has value if you run two text strings through the same function.
How do you do the actual URL lookup once you have an inbound URL you want to search for? This depends on the CMS and volume of content, but it’s usually complicated by the stemming issues mentioned in the prior section. If you have a 10-step stemming algorithm that you need to run old URLs through before you search them, and you have 100,000 old URLs stored alongside their new content in your CMS, this becomes a significant computational problem.
(Additionally, as noted in “Storage” above, you might be storing multiple old URLs in a single field within a content object, which is not great for searching. Thus, there needs to be an asynchronous step where you split these out and write them to individual records somewhere to facilitate searching.)
We generally write the URL lookup data to a repository separate from the CMS post-stemming. When a content object in the CMS is saved, the old URLs contained within it are split, stemmed, and the results written to a simple database table (or even something more elaborate like a Redis cache). We’ve occasionally cached URL data in memory in the web server process, but the lack of persistence can be a problem because the cache has to be rebuilt on start-up which might take too much time.
(A Cautionary Tale: We did this on a site with 70,000 URLs. It took about a minute to build the in-memory cache, and this was done when the first 404 was received. So, the site would start and get the first 404 a couple seconds later. The handler would look for the cached URLs, wouldn’t find them, and so would start to build the cache. Before it finished, another 404 would come in, look for the URLs, wouldn’t find them…and would kick off another build of the cache. And so on and so on. This story does not have a happy ending.)
Remember that if you’ve modeled an explicit Redirect object for ad hoc redirects, a second step needs to find those and write them to the same storage location.
If you cache to an external, persistent datastore, you can rewrite this data whenever an object is saved, and perhaps have a batch job which clears and rewrites all the data in the event your stemming algorithm changes. The old URLs in the CMS are considered canonical – the external storage location should just be an easily queryable, denormalized cache location which can be wiped out and rebuilt without too much pain.
Understand too that the matching resolution might be just the first step in your final resolution. Beyond trying to match on an old URL, you might have additional programming logic that attempts to find a new URL. My friend David Hobbs (who wrote the aptly titled Migration Handbook) relays a common type of scenario:
I once worked on a project where the old URL had the report number (
http://site.com/documents/?id=523) and the new URL had the report name (
http://site.com/library/why-monkeys-are-cute). There was a document management system keeping track of this, so the redirect engine in that case called the document management system to resolve the ID to the new URL.
In this case, you have some logic that runs either before or after the matching attempt and performs more custom logic. Do you do this before or after your matching matching? I’m not sure it’s incredibly important (database queries are pretty fast these days), but it depends on your gut feeling of where the resolution is most likely to be.
The idea of rules logic could be extended further to try and preemptively handle common URL errors. If you had a bunch of URLs containing your product name (
something/powerslicer-5000/something), and then you changed the name of the product, it might be worth changing the product name on the inbound URL and testing to see if the result matches an existing content item (assuming your CMS has an API call for this – a lot of them do).
The basic idea here is this: if you’re running this on the 404 handler (and you should be), then something bad has already happened. You are already in a failure state, and nothing worse can happen at this point, therefore you should probably knock yourself out trying to get them to the right place. Don’t send them somewhere irrelevant, but interrogate your inbound 404s vigorously and ask yourself, “Where were they trying to go, and what logic can I put in place to get them there?”
This is actually the easy part. Once you have a simple data source of stemmed URLs (and access to the same stemming function to run the inbound URL through), then stem the inbound and see if you have a match. If you do, route them with a 301 or whatever else your UI plan dictates.
If your CMS is decoupled, you can likely still execute programming code of some kind. However, for static sites on execution-limited environments like S3, this might be a problem. If you have absolutely no way to do active programming on your web server, then your options are limited. You could put the redirection engine on a different environment as a service, then do an Ajax call out of the 404 page to find a match and redirect from the client. Or, depending on volume, you might be able to write and upload actual HTML page shells for each old URL with a redirect META tag in the header.
It helps to check the inbound for a bookmark (“#chapter-1”) and re-add that to the redirection URL. If they’re requesting a bookmark on a page that has a valid redirect, then it’s safe assumption that it actually maps to an anchor inside the page and the visitor would like to be scrolled down to it. If it doesn’t, no harm done.
If you don’t find a match through any method, you just let the 404 handler execute as normal and display a “Page Not Found” page to the visitor.
It might be valuable to store analytics of what redirects have been hit and where they’re coming from. In these cases, you can see where old links are coming from and ask other site owners to update (truth: you’ll often find that some of the old links are on your own site).
Two schools of thought here:
Just store an incrementing “hits” field on the lookup database table. This is simple, but you lose the time dimension. A simple number will tell you volume, but not (1) in what timespan it accumulated, and (2) when it was last hit.
Store a record for each resolved redirection, with metadata like actual inbound URL (pre-stem), HTTP referrer, and timestamp. This is more complicated, but it lets you do more advanced analytics and track trends over time.
You might want to store unresolved redirection attempts as well, as they can be handy to figure out URL patterns you might be missing, and other sites on the internet that just have bad links. You could contact them for updates, or just compensate for their mistakes by adding ad hoc redirects or additional URLs to the intended content objects. However, you’ll likely quickly discover that a lot of unresolved URLs are spambots which have munged your URLs somehow, or blatant hack attempts.
Managing the redirects over time is partially accomplished by storing the old URLs as content. When a content object is deleted, its old URLs go with it and requests to them will now correctly 404 instead of redirecting (indeed, what would they redirect to?). If an editor creates a new page, they can add any redirects they want (though, why would they on a new page?).
A trickier problem is when URLs get changed, especially if the change is high on a content tree that cascades URL segments. For example, say you have a page with a URL segment of “products” that has 250 descendants. If you decide to change this URL segment to “solutions,” you have likely just changed the effective URLs of the 250 pages below it, since a core segment of their path has changed.
In these cases, we haven’t found a better plan other than to just mass update all those descendant content objects with their old URL. When a content object is changed, we check to see if the URL segment being saved differs from the one already in the database. If so, then we retrieve all the descendants of the page and store their previous URL as a redirect.
If you do this after the save of the content object, their URL has likely already changed and you’ll have to reverse engineer what it was. If you do it before the object saves, you can usually access the prior URL, but the save action might wait a long time for this to occur. What we’ve done in the past is capture the descendants before the save (with their current URL), then spawn a new thread to complete the updating.
(Note: for massive cascading changes like this, you might consider a rule-based approach as discussed in “Resolution” above. Perhaps check the raw URL string for
/products/ and just replace that with
/solutions/. There are some edge cases where this could be a problem, so your mileage might vary.)
Should you ever prune your redirects and delete older ones? I don’t think so, for reasons both simple and complex.
First, the simple reason: doing this is a low-return task. What harm is an old redirect doing? In how many situations will redirecting a previously-valid URL cause some actual harm that you have to guard against? I can’t think of many. Reviewing and evaluating old redirects is just not a great use of time, generally.
Second, the more complex reason: that old URL is part of your content’s history, whether you like it or not. It doesn’t matter if your content existed at that URL for only five minutes before it was changed, that still happened. A URL is an “id,” which is short for “identifier” and encompasses all sorts of concepts about “identity.” That old URL, as annoying as it may be for you to have sitting out there, is part of your content’s identity and should be retained.
Again, respect your URLs, people.