I have two teenage daughters. They have a lot of skills, but none so highly developed as the selfie. They generate spectacular selfies. But, here’s the thing – for each selfie that makes it onto Instagram, they throw away, like, 10 of them. Smile, flash, review, frown, delete…and repeat.
Digital photography means we can just do trial run after trial run, and throw things we don’t like away until we have something we like. Then we have this moment, frozen in time, that looks perfect and hides the 10 that didn’t work, plus all the other really annoying parts of life like homework that isn’t done and the fact that we haven’t vacuumed the stairs like our father asked us to do three days ago (ahem).
Photography is like this. It’s a single, perfect, and often unrealistic and idealized moment in time.
Why am I talking about this?
Because this is kind of the same principle behind decoupled content management. In that paradigm, content is managed in one environment. It’s labored over until it’s perfect, then it’s snapshotted (“selfied”) and pushed out to the delivery tier (“Instagram”) and we nonchalantly pretend like our content just springs forth that perfect all the time.
I enjoy snapshotting content like this. It hides all the messiness. Static site generators are based on this principle. Do crazy, arcane things to build your site, but then just snap a picture of it in its most perfect state, and publish that to the world. No one ever has to know the ridiculous lengths you might have employed to get it there.
The problem with static generators is that flat HTML files are pretty inflexible, so it’s sometimes easier to snapshot to a dynamic datasource – one you can manipulate and query in real-time, like a relational database. I’ve implemented this model a few times. It uses a CMS architecture to manage the content, then just pushes it into a set of database tables.
The database itself has no idea where the content came from, and developers using the content don’t have to know either. They just get this wonderfully neutral datasource, around which there’s probably the deepest toolset in our industry. A relational database is almost universally accessible and usable. Indeed, the first dynamic websites any of us built were probably just spitting database records out onto a page.
I have two examples of this, at opposite extremes. The first is about as “enterprisey” as I’ve ever done it. It was done for a client and was deployed to production. The second is a counterpoint and is literally as simple as I can make it, using Contentful and their webhook architecture. I’ve provided the code for this which should integrate into any project pretty easily.
The Enterprise Case Study: Sitecore
A client had all their content in a custom SQL Server database. Their marketing department wanted to migrate to Sitecore. Unfortunately, they had many processes tied to this database – for example, it was behind a large REST API from which their mobile app was powered – and they didn’t want to re-write all of that.
We came up with a plan to populate this database from Sitecore. It worked fairly well. We sort of fell down the rabbit hole with this one, and had to layer more and more functionality on top of it as more and more edge cases came up, but when we left it, they were stable.
The database was highly relational – lots of foreign keys. Any given record could be tied to a half-dozen other records. This made things complicated because a common pattern with database snapshots is that you simply delete and re-insert rows. If you accept that the CMS is the single, authoritative source of your content, then when a record is changed, just delete it and re-create it in the external database. However, with foreign keys, this can be problematic. If a keyed record changes, you can’t delete it without violating a bunch of keys, so you have to update it in place.
Additionally, volume was an issue. There were about 100,000 records going from Sitecore to SQL. Sitecore is pseudo-decoupled in that you’re moving content from one database to another, and you can do this en masse. You can re-publish the root node of the tree, and thus instigate a furious storm of database traffic. It’s quite easy to initiate a hundred thousand database queries with a single click of the mouse, and if something falls over during this process, all hell can break loose.
What we ended up doing was this:
The database mappings were stored directly in the content model. In Sitecore, everything is an “Item” which can be wonderfully abstract and recursive. I could extend the each Template (content type) with an extra Field that specified what SQL table it mapped to (we only did this if the table name was different than the Template name; if this was empty, we just used the Template name). I could even add a Field to a Field description to specify the column it mapped to (again, only if the column name was different than the Field name).
We hooked the Sitecore “Publish” pipeline. Sitecore’s pipelining system is quite well-done (it’s kind of an alternative event handling system), and we could easily inject our own code execution at a specific point in the publishing process. We did this asynchronously, so the processing that follows happened in a background thread. Execution would branch here, and Sitecore could continue on unimpeded.
We had an Item serializer that turned a Sitecore Item into XML. It serialized actual Fields, and also Fields that didn’t exist, like the URL and the parent Item ID. The system was extensible, so you could create or retrieve any data from anywhere and pretend it was a Field (ex: currentTempInMoscow
). This was necessary because some things we needed in the external database (like the URL to the item) didn’t exist as Fields. These things were produced by method and property calls to the Sitecore API, so we had to “ask” Sitecore for them while we were still in that context, then we serialized the answer as if it was a simple Field.
Each Field type had a specific serializer which formatted data in that type of Field to the correct format to get it into SQL. Additionally, we could have custom serializers for specific Template/Field combinations, so you could run custom filtering/formatting code on a specific Field on a specific Template to “prep” it for transport.
After serialization, we ended up with a perfectly neutral XML document containing all necessary data. It had no traces of Sitecore in it – it was perfectly self-contained and “detached from the mothership.”
We sent this XML document to Microsoft Messaging Queue (the concept of a “target” was abstracted – we could have sent it to any number of things, MSMQ just being one of them). We wanted a buffer or “holding pen” between Sitecore and SQL. If SQL went down for some reason, we didn’t want Sitecore to break, or to lose updates. MSMQ is as old as dirt and rock-solid stable, so it would just sit around, accepting messages from Sitecore, storing them in the correct chronological order of receipt until some process asked for them. Additionally, MSMQ is ridiculously fast, and there’s no way Sitecore could overwhelm it. (I’ll discuss volume later, but suffice it to say that performance mattered.)
We created a Windows Service that processed the messages from MSMQ. This service knew nothing about Sitecore, it just maintained an open connection to MSMQ and processed anything that came down the pipe (technically, a process other than Sitecore could have dropped a properly-formatted message in MSMQ as well). The service took the XML document and updated the corresponding tables in the SQL database. The foreign key issues got a little complicated, so in some cases, we serialized child objects with the parent as nested XML, then updated the SQL from the “inside out.”
Clearly, we ended up abstracting this process to quite a few levels. To this day, I wonder if we abstracted a level too far.
The basic path was this:
Sitecore
XML (via a custom serializer)
MSMQ (this was one target; technically we could have branched here into multiple targets)
Windows Service
SQL Server Database
For each level of abstraction, I can provide a justification…but just barely, and I concede that someone smarter than me might disagree.
The system worked well. For single content item updates, the process took milliseconds and was overkill. But it was in its glory on mass republishes.
You could republish the root node of Sitecore, and tens of thousands of records would explode through the process. Watching MSMQ could be a little breathtaking, as thousands and thousands of messages would instantly start queueing up. If I was running the Windows service from the console, log messages would start raining down the console window, Matrix-style, for three or four minutes. The cessation of activity would cascade – Sitecore would stop publishing; shortly thereafter the background thread would send the last document to MSMQ; sometime after that, the Windows service would process the last message out of MSMQ.
We built-in end-to-end logging. During the debugging, we often had to peel back the layers and watch the system work. We needed logs of (1) the formation of the XML doc, (2) the delivery to MSMQ, (3) the pickup by the Windows service, and (4) the SQL to update the record(s). We got it to the point where we could “follow” a record from start to finish, and we used these logs extensively.
(Lesson: don’t retrofit logging when you need it. Build copious logging in from the start and ideally create a consolidated logging architecture to which every part of the process is writing so the lifespan of a record can be seen from birth to death.)
This particular client engagement ended, and they took over responsibility for developing the system. Consequently, I can’t look back on years of history here to gauge long-term stability, but when we left the system, it was functioning as designed.