Content Archival

tags: archival •

This document describes a proposed content archival service. This document is an attempt to describe the product, its role, and how it might be marketed.

Prelude

I perversely love content migration, and specifically the scrubbing and reorganization of content. Give me a content management system (CMS) that’s made a mess of things and ask me to get the core of the content out, and I’m in heaven.

I love the stripping away of cruft. I love finding content in all the nooks and crannies of a CMS, cleaning it up, and putting it back in perfect order. I love this because it gets to the heart of what we do at Blend: we do content. We “do” CMS too, but really only as a proxy or tool for content.

The content is the point. Not the CMS. The CMS is just a means to an end. It’s a tool in service of a greater commodity – the content itself.

This week, I’ve been doing some content migration work for a non-profit out of DC. A big reason for the existence of this organization is to generate information products. They have 16,000 news articles, but these aren’t vanity items. These are the entire point. They monitor an industry segment for their members. This is their life’s work.

I loved working through this content. As I did, I saw “footprints” left by multiple CMSs over the years. Going back a decade, I counted at least three content creation tools that had mucked up the HTML. The organization moved from one tool to another, rolling over all their content each time. And each time, I had to assume that the tool they were using was the primary home for these words and images.

Over the years, the content was like baggage, thrown from one container to another. The CMS tail was wagging the content dog. The content was a subservient to whatever platform happened to be managing it at the time.

At this moment, the exported content is almost perfect – sterile, even. Through a combination of parsing, lexing, and old-fashioned text pattern replacement, it’s lovely again. Each news article reads perfectly, with semantic markup that enhances it, not obscures it.

And now I have to cram it back into yet another CMS.

Dammit.

The Idea

While a WCMS is a great tool for creation and presentation, I can’t shake the idea that an organization should keep a copy of their content…somewhere else. Once created, content should be allowed to “come to rest” outside the CMS, in a format that is:

Free of presentation artifacts
Located on storage the organization controls
In an open format that can be manipulated without infrastructure
Free from any vendor control

Therefore, here’s the idea: a content archival solution that continually maintains a separate copy of your content in accordance with the above principles.

This would be a software system in three parts.

A Packager would run in the customer’s CMS environment. It would log changes to content, convert that content to an XML format, and transmit it to the Archiver.
The Archiver would be our service running in the cloud. It would receive the XML from the Packager, log specific information from it, then place it into Storage
Storage would be any one of multiple cloud storage options. Azure Blob Storage or Amazon S3 would be the most common. A key point: this would be the customer’s own storage account. They would create an API key for the Archiver to access. Ideally, this API key would allow Read and Create access, but not Delete or Change – once archive files are written, they can neither be changed nor deleted.

Using this system, the content from a CMS would be archived into cloud storage as context-free XML, with no vendor baggage. The storage account would belong to the customer, so they would know that the content in it was theirs, forever, and can always be accessed and manipulated using common tools. If they want more direct storage, there are dozens of methods of syncing cloud storage to their local networks, or even to their local workstations.

The Packager could run on a schedule, or in real-time in response to events, depending on organization’s concern for latency.

The XML format would be common to the concept of content in a CMS. This will require us to ignore specific idiosyncrasies from one CMS to another – the goal is not to provide a perfect serialization of the state of any CMS, but to abstract content from the CMS itself, and store it as “content,” not “Episerver content” or “Drupal content.”

At the base, “ContentML” looks like this:

A content object is a collection of named properties each with has a value and a datatype
One property is designated as the colloquial name of the content object
A content object has a type which is just a reference to another content object. (We’d archive types as content objects themselves.)
Every content object has a unique ID and a version number. The combination of these identifies a specific version of a specific content object.
Every content object has a creation date, which represents when that object was created. If there is a prior version, then the creation date represents when the new version was created.
A content object has a state, which is a label indicating its publication status (draft, published, etc)
A content object might have an addressable URI
Content objects can be grouped into serial or hierarchical aggregations which are collections of references to content objects (menus, a content tree, etc)
Users are simply another content object.
Users can be aggregated into user groups
Content objects can have an access control list which is a collection of access control entries for that content object
An access control entry is the specification of a user or user group and a label indicating the permission it confers
A media item has all of the above, plus a data payload which is the bytes of the media itself (which, clearly, would be referenced from the XML, but not stored within it).

And – to start, at least – that’s about it.

With this, we could capture the core essence the content out of a system. Again, this is not backup. It is not meant to be restored. It’s not concerned about the idiosyncrasies of any one system. This is pure content. What we capture above would transcend any specific system. It is an archival format (think PDF-A).

As noted above, an organization might want to hook the Packager and add some implementation-specific (name-spaced) information, but that’s up to them. Theoretically, they could attach any information they wanted to an object before storage.

In my head, I have a few use cases:

An organization has been burned too many times when moving from system to system. They want to make sure that they have a controlled copy of their content that’s not tied to their vendor so they’re safe when moving systems in the future. (I’m thinking of the “always be closing” line from Glengarry Glen Ross except that it’s “always be exporting” in this case.)
An organization has content in an unstable platform (read: WordPress) and wants to ensure the content is safe, even if the platform gets compromised.
An organization wants to move to a cloud CMS, but needs to get over the fear (real or imagined) of having all their content somewhere else. With a good archival solution, their content is always “local,” and the CMS becomes merely a management tool. Consequently, IT management calms down a bit.
An organization has compliance or regulatory constraints and needs to be able to produce content which was effective on a specific date. This needs to survive deletes from the CMS. This audit trail needs to be eternal.
An organization has large amounts of content they want to “archive” from their CMS. This is content they want to delete, but not actually delete. Their CMS might only represent a fraction of their total historical content over the years.
An organization has integration needs that would be enabled by having neutral versions of their content easily accessible to custom tools. (Note that we might not promote or endorse this, as it could alienate vendors, but it’s an undeniable advantage.)
An organization would like to have a consolidated “content warehouse” of all their content across multiple disparate CMS platforms.

The business model would be subscription-based. The Archiver is the proprietary piece that receives content from the Packager and places it in Storage. It sits in the middle, moving content between the CMS (controlled by the organization) and the Storage (also controlled by the organization).

If an organization wanted to be completely hands-off, we could manage the Storage for them, but our preference is that they own it. If so, then the perpetual question of “what happens to our stuff if you go out of business” no longer has any power. If they cancel the account, or revoke the API key, all that happens is that their archives are no longer updated. What’s there, stays there, and it would consist of nothing but straightforward XML.

The Archiver would be transactional, only logging information necessary to power a slim management interface which allows organizations to browse and search their stored archives. The Archiver would issue reports via email or RSS on its activities, so a content manager could be notified of how many objects were archived on any day/week/month. An optional service might screencap a page if an archived object has a URI.

There would be a limited data export API for the organization. They could programmatically query a specific object, or get a mass download of the archive as of a specific date. A webhook system would ping a URL in response to various events, even sending related content objects with the request.

(Note that this API is at the Archiver. To continue a common theme, the content is in their own storage. They can do whatever they want to it directly, or by using any tool that integrates with AWS or Azure.)

The original CMS might have access to a small gadget or UI add-on to show editors the “archival status” of a particular content object, but we would avoid extended functionality from the CMS. Cognitively, we want editors to go “somewhere else” for their archives. Part of the mental model is that this is separate from the CMS by design. Archive users might not be the same as CMS users.

ContentML would be promoted as an open standard. It would be a promotional tool for the subscription product. We would openly document this to a level that a customer could cancel the service and still make perfect sense of the XML archive that has been created to that point.

(We need to do some more research here, as there might be something which already fits. If not, we’d likely try to leverage an API standard like JCR or CMIS, by providing some XML representation of it.)

We’d start with an Episerver plugin, given that this is where most of our experiences lies. Next up would likely be WordPress – a knowledgeable friend has told me that lots of large organizations have a bunch of content in WordPress and “they’re all freaked out about it.” Next would be Drupal, and on from there.

So, that’s the idea.

Your CMS is disposable.

Your content is eternal.

We need to started treating it this way.