Case Studies of CMS-to-SQL Decoupled Publishing
I have two teenage daughters. They have a lot of skills, but none so highly developed as the selfie. They generate spectacular selfies. But, here’s the thing – for each selfie that makes it onto Instagram, they throw away, like, 10 of them. Smile, flash, review, frown, delete…and repeat.
Digital photography means we can just do trial run after trial run, and throw things we don’t like away until we have something we like. Then we have this moment, frozen in time, that looks perfect and hides the 10 that didn’t work, plus all the other really annoying parts of life like homework that isn’t done and the fact that we haven’t vacuumed the stairs like our father asked us to do three days ago (ahem).
Photography is like this. It’s a single, perfect, and often unrealistic and idealized moment in time.
Why am I talking about this?
Because this is kind of the same principle behind decoupled content management. In that paradigm, content is managed in one environment. It’s labored over until it’s perfect, then it’s snapshotted (“selfied”) and pushed out to the delivery tier (“Instagram”) and we nonchalantly pretend like our content just springs forth that perfect all the time.
I enjoy snapshotting content like this. It hides all the messiness. Static site generators are based on this principle. Do crazy, arcane things to build your site, but then just snap a picture of it in its most perfect state, and publish that to the world. No one ever has to know the ridiculous lengths you might have employed to get it there.
The problem with static generators is that flat HTML files are pretty inflexible, so it’s sometimes easier to snapshot to a dynamic datasource – one you can manipulate and query in real-time, like a relational database. I’ve implemented this model a few times. It uses a CMS architecture to manage the content, then just pushes it into a set of database tables.
The database itself has no idea where the content came from, and developers using the content don’t have to know either. They just get this wonderfully neutral datasource, around which there’s probably the deepest toolset in our industry. A relational database is almost universally accessible and usable. Indeed, the first dynamic websites any of us built were probably just spitting database records out onto a page.
I have two examples of this, at opposite extremes. The first is about as “enterprisey” as I’ve ever done it. It was done for a client and was deployed to production. The second is a counterpoint and is literally as simple as I can make it, using Contentful and their webhook architecture. I’ve provided the code for this which should integrate into any project pretty easily.
The Enterprise Case Study: Sitecore
A client had all their content in a custom SQL Server database. Their marketing department wanted to migrate to Sitecore. Unfortunately, they had many processes tied to this database – for example, it was behind a large REST API from which their mobile app was powered – and they didn’t want to re-write all of that.
We came up with a plan to populate this database from Sitecore. It worked fairly well. We sort of fell down the rabbit hole with this one, and had to layer more and more functionality on top of it as more and more edge cases came up, but when we left it, they were stable.
The database was highly relational – lots of foreign keys. Any given record could be tied to a half-dozen other records. This made things complicated because a common pattern with database snapshots is that you simply delete and re-insert rows. If you accept that the CMS is the single, authoritative source of your content, then when a record is changed, just delete it and re-create it in the external database. However, with foreign keys, this can be problematic. If a keyed record changes, you can’t delete it without violating a bunch of keys, so you have to update it in place.
Additionally, volume was an issue. There were about 100,000 records going from Sitecore to SQL. Sitecore is pseudo-decoupled in that you’re moving content from one database to another, and you can do this en masse. You can re-publish the root node of the tree, and thus instigate a furious storm of database traffic. It’s quite easy to initiate a hundred thousand database queries with a single click of the mouse, and if something falls over during this process, all hell can break loose.
What we ended up doing was this:
The database mappings were stored directly in the content model. In Sitecore, everything is an “Item” which can be wonderfully abstract and recursive. I could extend the each Template (content type) with an extra Field that specified what SQL table it mapped to (we only did this if the table name was different than the Template name; if this was empty, we just used the Template name). I could even add a Field to a Field description to specify the column it mapped to (again, only if the column name was different than the Field name).
We hooked the Sitecore “Publish” pipeline. Sitecore’s pipelining system is quite well-done (it’s kind of an alternative event handling system), and we could easily inject our own code execution at a specific point in the publishing process. We did this asynchronously, so the processing that follows happened in a background thread. Execution would branch here, and Sitecore could continue on unimpeded.
We had an Item serializer that turned a Sitecore Item into XML. It serialized actual Fields, and also Fields that didn’t exist, like the URL and the parent Item ID. The system was extensible, so you could create or retrieve any data from anywhere and pretend it was a Field (ex:
currentTempInMoscow). This was necessary because some things we needed in the external database (like the URL to the item) didn’t exist as Fields. These things were produced by method and property calls to the Sitecore API, so we had to “ask” Sitecore for them while we were still in that context, then we serialized the answer as if it was a simple Field.
Each Field type had a specific serializer which formatted data in that type of Field to the correct format to get it into SQL. Additionally, we could have custom serializers for specific Template/Field combinations, so you could run custom filtering/formatting code on a specific Field on a specific Template to “prep” it for transport.
After serialization, we ended up with a perfectly neutral XML document containing all necessary data. It had no traces of Sitecore in it – it was perfectly self-contained and “detached from the mothership.”
We sent this XML document to Microsoft Messaging Queue (the concept of a “target” was abstracted – we could have sent it to any number of things, MSMQ just being one of them). We wanted a buffer or “holding pen” between Sitecore and SQL. If SQL went down for some reason, we didn’t want Sitecore to break, or to lose updates. MSMQ is as old as dirt and rock-solid stable, so it would just sit around, accepting messages from Sitecore, storing them in the correct chronological order of receipt until some process asked for them. Additionally, MSMQ is ridiculously fast, and there’s no way Sitecore could overwhelm it. (I’ll discuss volume later, but suffice it to say that performance mattered.)
We created a Windows Service that processed the messages from MSMQ. This service knew nothing about Sitecore, it just maintained an open connection to MSMQ and processed anything that came down the pipe (technically, a process other than Sitecore could have dropped a properly-formatted message in MSMQ as well). The service took the XML document and updated the corresponding tables in the SQL database. The foreign key issues got a little complicated, so in some cases, we serialized child objects with the parent as nested XML, then updated the SQL from the “inside out.”
Clearly, we ended up abstracting this process to quite a few levels. To this day, I wonder if we abstracted a level too far.
The basic path was this:
XML (via a custom serializer)
MSMQ (this was one target; technically we could have branched here into multiple targets)
SQL Server Database
For each level of abstraction, I can provide a justification…but just barely, and I concede that someone smarter than me might disagree.
The system worked well. For single content item updates, the process took milliseconds and was overkill. But it was in its glory on mass republishes.
You could republish the root node of Sitecore, and tens of thousands of records would explode through the process. Watching MSMQ could be a little breathtaking, as thousands and thousands of messages would instantly start queueing up. If I was running the Windows service from the console, log messages would start raining down the console window, Matrix-style, for three or four minutes. The cessation of activity would cascade – Sitecore would stop publishing; shortly thereafter the background thread would send the last document to MSMQ; sometime after that, the Windows service would process the last message out of MSMQ.
We built-in end-to-end logging. During the debugging, we often had to peel back the layers and watch the system work. We needed logs of (1) the formation of the XML doc, (2) the delivery to MSMQ, (3) the pickup by the Windows service, and (4) the SQL to update the record(s). We got it to the point where we could “follow” a record from start to finish, and we used these logs extensively.
(Lesson: don’t retrofit logging when you need it. Build copious logging in from the start and ideally create a consolidated logging architecture to which every part of the process is writing so the lifespan of a record can be seen from birth to death.)
This particular client engagement ended, and they took over responsibility for developing the system. Consequently, I can’t look back on years of history here to gauge long-term stability, but when we left the system, it was functioning as designed.
We had a few logical errors we fixed along the way (and the client eventually found a few more), but the basic architecture was solid. If we had more time to work on this, I’m convinced it could have been perfected (and the client may have done so since we left).
The Simple Case Study: Contentful
Recounting that Sitecore system was stressful. It was complicated, but necessarily so. To show the other extreme, I worked up with a very simple proof-of-concept using Contentful.
I actually – literally – tried to keep this as simple as possible. Robert Linde has written a fantastic .NET SDK for Contentful, and I’ve written the Contentful Webhook Server (which will soon be obviated by an official version), but I wanted to remove all voodoo and see what I could do with no external libraries, other than what’s provided by .NET MVC itself. So, I started from a fresh MVC project to see how far I could get.
I got it down to a couple dozen lines of code. The result doesn’t even rate a complete project: here’s a simple Gist of the MVC controller. That was the only file I created in the project. You will also need a single line in the global.asax – which is noted in a comment – and, of course, a connection string.
Contentful has a webhook system that it can trigger on any number of actions – Publish and Unpublish being the important ones here. So when an object is published or unpublished, Contentful will POST a request to any URL you give it. As a bonus, it also sends the content object that triggered the event. So it doesn’t just notify your URL, it also says, “…and, by the way, here’s the entire content object that’s under consideration at the moment.”
If you combine this with .NET MVC model binding, you get something really magical. The framework will accept the incoming request and automatically deserialize the JSON to an object.
You could create a strongly-typed model of your desired object using serialization attributes if you wanted (and which I would recommend), but I didn’t even bother with that for this demo. Model binding gave me a fully populated JObject, and that was “queryable” enough to get what I needed (remember that the NewtonSoft library is included in a default MVC project).
There’s no authentication here. I’m assuming you’ll implement basic auth (which Contentful supports for webhooks). If not that, Contentful can send custom HTTP headers, so just embed a password in one of those (or even in the querystring), and check that it matches before doing anything.
This uses the “delete and recreate” model discussed above. We just destroy the existing record (if it even exists; we actually don’t care), then create it again from scratch. This means that you might not be able to enforce your foreign keys (they can exist, just not be enforced). I maintain that it shouldn’t matter because you’re depending on Contentful to be the authoritative source of referential integrity. The SQL data is merely a facade to the real data – like a denormalized reporting database – so just trust that the source data is correct.
Clearly, this isn’t “enterprise ready.” You need some logging and error handling here. But I wanted to keep it simple, and, honestly, I’ve seen worse code get deployed to production. If this breaks, remember, your database is still just fine – you just can’t update it from Contentful.
There is no buffer like mentioned in the Sitecore example. If your website goes down, Contentful is going to get an error on the webhook call and you’re going to “lose” an event. You might want to stand this up as a separate endpoint from your website, so your website can go up and down as you deploy stuff, and this endpoint stays stable. (In that scenario, I might not even do this with MVC, but rather just as a single HttpHandler on an absolutely barebones IIS website. You could deploy nothing but a couple DLLs and a web.config.)
If you get “out of sync,” there’s no great solution other than just clearing the SQL tables and doing a mass unpublish and publish from Contentful. That would trigger a flurry of webhook calls that should repopulate the tables. You might want to build a small command line job that proactively repopulated the tables from scratch in these situations, but now we’re falling down the rabbit hole a bit. For a moderate amount of data in a non-critical scenario, mass unpublish and republish might be acceptable.
This example is about as simple as I can make it, and I realize you can tear it apart with potential issues and edge cases. But that’s for you to worry about – I can’t know what your particular situation calls for, I just wanted to show how simple I could make database snapshotting from a CMS.
You could set up a Contentful account and have content flowing into local SQL tables in minutes. Your developers could start working with that SQL database while remaining blissfully unaware that Contentful even exists. The challenge was to do it in a few lines of code as possible with no external dependencies. Mission accomplished.
But the specific platform doesn’t matter – we’ve done it once with Episerver too (which horrified them, BTW), and even did something like it with Ektron, back in the day. The principle itself is transcendent – take data that’s complicated to manage, and then snapshot it when it’s perfect into a format ideally suited for manipulation and delivery.
Throw away the selfies that don’t work. Only the perfect ones make it to Instagram.
I love this model. I love that content management can be about management again, and we’re abstracting away the datasource for presentation. To be sure, we’re going to lose a lot of functionality in the delivery tier that a full-stack CMS like Sitecore or Episerver offers, but sometimes you don’t need that, and we’re gaining a shocking amount of flexibility and ease of development in exchange.
Remember when we made websites from custom SQL databases? To my memory, spitting the content out on the page was never the problem. The headaches were mostly in the meta-work around managing the content in the database. If you could get the best of both worlds – the simplicity of templating without the headache of management – would that be a net positive?
I maintain that in many situations, it absolutely would be.