The “Import and Update” Pattern

By Deane Barker • November 12, 2014 •

Often you need to import AND update content, rather that just simply importing it. This makes tasks of content integration so much easier.

In this post, the author discusses a common software development pattern for efficiently handling data imports and updates. It emphasizes the importance of distinguishing between new entries and updates to existing records. The post outlines strategies for managing data consistency, minimizing performance issues, and ensuring seamless integration. Additionally, it provides practical examples and insights into the implementation of this pattern for better data management in applications.

Generated by Azure AI on June 24, 2024

Most all CMS support content import, to some extent. There’s always an API and often a web service for you to fire content into a system from the outside.

But a model we see over and over that really needs to be explicitly acknowledged is that of “import and update.” This means, create new content if it doesn’t exist, but update the content in-place if it was previously created. It’s used to support instances when we’re syncing information stored inside the CMS with information stored outside the CMS.

For example, let’s say our hospital maintains its physician profiles in a separate database (for whatever reason). However, we need our physicians to have managed content objects inside the CMS, for a variety of reasons (for a list of why this is handy, see my post on proxy objects in CMS).

We can easily write a job to import our physician profiles, but what happens when they update in the source database? We don’t want to import again, we just want to update the page inside the CMS. Sure, we could delete it and recreate it, but that becomes problematic when it might change the URL, or increment a set of ID numbers, or even delete information in the CMS which is referencing that specific content object (analytics, for example).

Episerver has a “Content Channel” architecture that handles this. You fire a dictionary of key-value pairs (representing content properties and their values) at a web service. You can optionally include the GUID of an existing content objects. No GUID means Episerver will create a new object, while data coming in with a GUID will find the corresponding page and update it with the incoming information. It essentially keeps the content object shell, but overwrites all the information in it.

With any system like this, you need to maintain a mapping between the ID outside the CMS, and the ID inside the CMS. You need to know that Database Record #654 is updating Content ID #492. When iterating your database rows, when you run across ID #654, you know to reference ID #492 when talking to the CMS. You also need to be able to get the newly-created back out of the CMS when content is created, so you can create a mapping for it – if my CMS creates Content ID #732, I need to know this so I can reference it later.

Some CMS offer “content provider” models, which are real-time methods to “mount” other repositories. So, instead of importing and updating this data, the CMS reaches out to our external database in real-time when required to get objects back and mock them up as first-order content objects.

This is certainly elegant and sophisticated, but it presents problems with performance, uptime of the source system, unnecessary computational overhead if the content doesn’t change much, network topology and unbroken connectivity, and the inability to extend the content with new data inside the CMS (for instance, while 90% of the information about our physicians comes from the external database, perhaps we have a couple of properties that live inside the CMS only).

I hope I see this pattern more often. Episerver has it, eZ publish has it, and I’m sure many others. Additionally, it’s not hard to build it. If you can put together a web service, you should be able to pull it off.

It’s a handy thing to have.