The Line Between the Web and Your CMS
A couple weeks ago, I wrote about a stupid thing I did with redirects.
My friend Andy Cohen reposted it on LinkedIn, and, among other things, said this:
Redirects don’t belong in the CMS layer ;)
I’m inclined to agree. I think a CMS can be a source for redirect data, but I’m coming to believe that any redirection should happen long before anything arrives on your CMS’s doorstep. In fact, a lot of things should happen before anything is serviced by your CMS. A lot more than I was doing, anyway.
I self-host this site, cached and proxied through a Cloudflare tunnel. It’s on a bespoke system that has grown organically, basically providing templating and organizational services to 6,000 or so Markdown files.
I usually run my website as a service, but last week, for whatever reason, I was running it in the console, which meant I was watching the requests append to the bottom of the output, down the screen, Matrix-style.
What I was seeing was mostly redirects.
(One problem: I wasn’t sending cache headers with redirect responses. It didn’t occur to me that they have “cachability” like anything else. I fixed this.)
Remember that this site used to be Gadgetopia, which has been on the web since 2002. Given the URL footprint (7,000 or so unique inbound URLs) and the time it was on the web, the site gets a lot of random traffic. Mostly bots, I’m sure, but there are still a lot of links out there.
A few years ago, I salvaged about 350 of those posts, then pushed the other 7,000+ off to another site I called the “archives” (an example). Getting people there involves some logic and redirection.
I had built that logic into my CMS, but the wastefulness of this suddenly occurred to me. I’m no longer adding to Gadgetopia, and that content is not going to change. It was time to move that off to some “outer shield.”
Additionally, I had a bunch of other logic that just didn’t need to be in my CMS.
I have a rule that an inbound URL either has to end with a trailing slash, or have a dot (“.”) in it (meaning a direct request to a file). If you try to come into a path without a trailing slash, you’ll get redirected.
The URL needs to be clean. Any double slashes, parent pathing, or upper-casing will be corrected and redirected.
I control querystring args. I have a list of arguments I use for the site, and the CMS will strip anything not in that list and redirect you back without them. I do this so that the URL you see will be “clean,” stripped of any marketing cruft, in case you bookmark it or share it out again.
I have a “kill list” of URL patterns that are DOA. These are mostly hack attempts to random WordPress-ish directories or other vulnerabilities. These are detected and 404ed very early in the request.
POST requests are only allowed through a very narrow range of URLs.
There are some more, getting increasingly more esoteric. I have these in a class/infrastructure I called a PathFilter
.
The goal is that any inbound URL that’s actually processed by my CMS should be… perfect. Maybe this is anal-retentiveness, I don’t know. But I steadfastly maintain that there’s value in being protective of your “URL-space.” URLs are “tickets” to content. Keep your tickets in order.
With Andy’s prompting, I’ve determined that this doesn’t need to be in my CMS. My CMS can be a source of information for it, but the action of all these things should be somewhere else.
I wrote a simple exporter to get the old Gadgetopia URLs out to a Cloudflare redirect CSV, and uploaded that via their Bulk Export. Instantly, that dropped inbound traffic by 80% - 90%. To be fair, the overhead of that traffic was low – the requests were redirected early, and the responses were all empty bodies with just a Location
header, but still…
As for my weird logic, I pushed this off to a Cloudflare Worker. You can route all inbound traffic through a JavaScript function (the “Worker”). That JavaScript can do whatever it likes, up to and including servicing the entire request via some MVC-ish mechanism (yeah, you could easily write some form of a CMS directly the Cloudflare edge network, if you wanted; I’m sure this has been done – I think Conscia does something like this).
My Worker examines the request, does the logic noted above and some other stuff, and either redirects, 404s, or passes the request through. Essentially, you can program against Cloudlfare’s CDN logic.
Now, none of this is new. Cloudflare Workers have been around since 2017.
But the whole thing got me thinking about the division of responsibilities. How much work should your CMS do on the incoming request? Is that the highest and best purpose of your CMS – figuring out if it should even service the request? Or should that work be pushed off somewhere else?
At this moment, my CMS can trust that every single inbound request is one that should be serviced. I pulled out hundreds of lines of logic out of the CMS – deleting an entire subsystem, in fact. For all my CMS knows, the web is a lovely place where all URLs are perfect and no mistakes are ever made.
I’ve started to take this a step further, and I’m separating the idea of my CMS from the web server entirely. Yes, yes, I know about headless (please don’t come at me with “CMS != web” stuff), but it’s a little different in that the web server will have a native connection to the CMS (a C# API). So it’s not remote, but at the same time, I’m divorcing the concept of my CMS and my web server.
So, even if your system is installed, there could be an argument to have two systems: a CMS, and then a web server integration. Requests hit your web server, are processed by your web server integration code, and then call out to the CMS code to be resolved against content.
As a neat side effect of all this, static file generation is a hop, skip, and a jump away. Since my CMS is just producing HTML and providing it to a web server – with no notion of it being used to satisfy a web request – then who says we can’t just write it to a file instead? My web server is now calling an HtmlOutputFactory
which I can also invoke from a LINQPad script and get the same HTML that the web server gets.
(Consider the inverse: could your static site generator be hooked up to your CMS as a dynamic content generation system? Could it generate a single path at a time, on demand, and return that to your CMS? The lines get really fuzzy here…)
This is probably an example of turning around in a circle to end up right back where you started, but it was a educational experience. …or a re-educational experience? I think these were all concepts I knew, but had lost track of for some reason.
Anyway, thanks to Andy for his LinkedIn assertion that flipped some bit in my brain and allowed me to separate two concerns more clearly.