My Own WTF, or How to Not Handle URL Redirects
About 10 years ago, I did a CMS migration for a large health system.
With the new CMS, they were changing the URLs of about 100,000 pages, so we did a standard redirection strategy. We had the old URL stored with the new content, so on app startup, we iterated all the content and loaded a URL lookup table into memory –
Old URL → New URL. When a 404 comes in, look for the old URL, send a redirect to the new URL. (I wrote quite a bit about that strategy here: Patterns in URL Redirection After CMS Migrations.)
This is all pretty standard. Even 100,000 key/value pairs doesn’t take a lot of space in memory, and creating that structure took about 90 seconds on app startup. It was done in the background, asynchronously, so it would become available just over a minute after the site came online. Given that the app restarted pretty rarely (once a month, maybe?), this was fine.
However, I made a critical mistake: I didn’t actually start the process on app startup. For whatever reason, I triggered the creation of the lookup table when the first 404 came in.
The first 404 would look for the data in memory, and if it didn’t exist, it would then kick of the 90-second process to create it. I figured this was pretty efficient.
What I didn’t realize is that this site got a lot of traffic. And in the 90 seconds after that initial 404 started the process, a lot of 404s would come in. The site was big and had been online for a long time, so it was constantly being pelted with 404s. I’m sure most of this traffic was from crawlers trying to access the accumulated history of dozens of URL changes over the course of a couple decades of existence.
This means that for 90 seconds (theoretically) every single 404 kicked off that same process. Until that first process completed and the lookup table was in memory, all those 404s wouldn’t find the data, and they would dutifully set about creating it.
We launched the new site, and it mysteriously kept failing. Seconds after launch, the web server process would go ballistic until it stopped responding, at which point we’d fall back to the old environment. I never checked, but at some point, I’m sure we had hundreds of threads trying to iterate all the content and create the same lookup table in memory.
This happened on three consecutive days. Each day, I had “fixed” something which I suspected was the problem. Each day, I was wrong.
Needless to say, the client was just thrilled with all of this…
Confusingly, there was no excess memory usage, because as each process completed, it overwrote the table that was there, so we only ever got the one table in memory – we just had hundreds of processes overwriting it again and again.
In theory, it would eventually resolve itself, because when the first process would finally write the data, the process wouldn’t be started again (because new 404s would find it in memory), but we never got to the end of that chain of processes, because the server would straight up die before that point. I suppose the server would have chewed through it eventually (90 seconds turning into 90 minutes or whatever), but we could never wait that long.
Since the problem seemed to be CPU-bound, I started looking for infinite recursion somewhere, and I wasn’t finding it.
I don’t remember when I figured out what was going on, but when I did, I felt like the sky fell in on me. Realizing the sheer stupidity of what I did was epic.
I believe I got very quiet. I think I laid my head on my desk for a long period of time.
I fixed it by setting an “in process” flag when the first process started, and just returning an actual 404 and bailing out for every 404 that came in during that 90-second window so that only one background process was ever running at a time. There were undoubtedly better solutions for fixing and improving this, but it needed to be done in a hurry, so that’s what I did.
It’s a perfect demonstration that you need to test error conditions like lots of 404s. Consider that –
The average load test would have likely only tested URLs that existed, so it would have never triggered this
Even if my load test included a single 404 (it didn’t…), the problem would have never surfaced on a single 404 (indeed, that was the exact situation I had coded it for)
To be fair to me, these two points made it a non-trivial thing to discover, but coding it in the first place was clearly not my finest moment.