Content Archival

tags: archival

This document describes a proposed content archival service. This document is an attempt to describe the product, its role, and how it might be marketed.

Prelude

I perversely love content migration, and specifically the scrubbing and reorganization of content. Give me a content management system (CMS) that’s made a mess of things and ask me to get the core of the content out, and I’m in heaven.

I love the stripping away of cruft. I love finding content in all the nooks and crannies of a CMS, cleaning it up, and putting it back in perfect order. I love this because it gets to the heart of what we do at Blend: we do content. We “do” CMS too, but really only as a proxy or tool for content.

The content is the point. Not the CMS. The CMS is just a means to an end. It’s a tool in service of a greater commodity – the content itself.

This week, I’ve been doing some content migration work for a non-profit out of DC. A big reason for the existence of this organization is to generate information products. They have 16,000 news articles, but these aren’t vanity items. These are the entire point. They monitor an industry segment for their members. This is their life’s work.

I loved working through this content. As I did, I saw “footprints” left by multiple CMSs over the years. Going back a decade, I counted at least three content creation tools that had mucked up the HTML. The organization moved from one tool to another, rolling over all their content each time. And each time, I had to assume that the tool they were using was the primary home for these words and images.

Over the years, the content was like baggage, thrown from one container to another. The CMS tail was wagging the content dog. The content was a subservient to whatever platform happened to be managing it at the time.

At this moment, the exported content is almost perfect – sterile, even. Through a combination of parsing, lexing, and old-fashioned text pattern replacement, it’s lovely again. Each news article reads perfectly, with semantic markup that enhances it, not obscures it.

And now I have to cram it back into yet another CMS.

Dammit.

The Idea

While a WCMS is a great tool for creation and presentation, I can’t shake the idea that an organization should keep a copy of their content…somewhere else. Once created, content should be allowed to “come to rest” outside the CMS, in a format that is:

Therefore, here’s the idea: a content archival solution that continually maintains a separate copy of your content in accordance with the above principles.

This would be a software system in three parts.

  1. A Packager would run in the customer’s CMS environment. It would log changes to content, convert that content to an XML format, and transmit it to the Archiver.
  2. The Archiver would be our service running in the cloud. It would receive the XML from the Packager, log specific information from it, then place it into Storage
  3. Storage would be any one of multiple cloud storage options. Azure Blob Storage or Amazon S3 would be the most common. A key point: this would be the customer’s own storage account. They would create an API key for the Archiver to access. Ideally, this API key would allow Read and Create access, but not Delete or Change – once archive files are written, they can neither be changed nor deleted.

Using this system, the content from a CMS would be archived into cloud storage as context-free XML, with no vendor baggage. The storage account would belong to the customer, so they would know that the content in it was theirs, forever, and can always be accessed and manipulated using common tools. If they want more direct storage, there are dozens of methods of syncing cloud storage to their local networks, or even to their local workstations.

The Packager could run on a schedule, or in real-time in response to events, depending on organization’s concern for latency.

The XML format would be common to the concept of content in a CMS. This will require us to ignore specific idiosyncrasies from one CMS to another – the goal is not to provide a perfect serialization of the state of any CMS, but to abstract content from the CMS itself, and store it as “content,” not “Episerver content” or “Drupal content.”

At the base, “ContentML” looks like this:

And – to start, at least – that’s about it.

With this, we could capture the core essence the content out of a system. Again, this is not backup. It is not meant to be restored. It’s not concerned about the idiosyncrasies of any one system. This is pure content. What we capture above would transcend any specific system. It is an archival format (think PDF-A).

As noted above, an organization might want to hook the Packager and add some implementation-specific (name-spaced) information, but that’s up to them. Theoretically, they could attach any information they wanted to an object before storage.

In my head, I have a few use cases:

The business model would be subscription-based. The Archiver is the proprietary piece that receives content from the Packager and places it in Storage. It sits in the middle, moving content between the CMS (controlled by the organization) and the Storage (also controlled by the organization).

If an organization wanted to be completely hands-off, we could manage the Storage for them, but our preference is that they own it. If so, then the perpetual question of “what happens to our stuff if you go out of business” no longer has any power. If they cancel the account, or revoke the API key, all that happens is that their archives are no longer updated. What’s there, stays there, and it would consist of nothing but straightforward XML.

The Archiver would be transactional, only logging information necessary to power a slim management interface which allows organizations to browse and search their stored archives. The Archiver would issue reports via email or RSS on its activities, so a content manager could be notified of how many objects were archived on any day/week/month. An optional service might screencap a page if an archived object has a URI.

There would be a limited data export API for the organization. They could programmatically query a specific object, or get a mass download of the archive as of a specific date. A webhook system would ping a URL in response to various events, even sending related content objects with the request.

(Note that this API is at the Archiver. To continue a common theme, the content is in their own storage. They can do whatever they want to it directly, or by using any tool that integrates with AWS or Azure.)

The original CMS might have access to a small gadget or UI add-on to show editors the “archival status” of a particular content object, but we would avoid extended functionality from the CMS. Cognitively, we want editors to go “somewhere else” for their archives. Part of the mental model is that this is separate from the CMS by design. Archive users might not be the same as CMS users.

ContentML would be promoted as an open standard. It would be a promotional tool for the subscription product. We would openly document this to a level that a customer could cancel the service and still make perfect sense of the XML archive that has been created to that point.

(We need to do some more research here, as there might be something which already fits. If not, we’d likely try to leverage an API standard like JCR or CMIS, by providing some XML representation of it.)

We’d start with an Episerver plugin, given that this is where most of our experiences lies. Next up would likely be WordPress – a knowledgeable friend has told me that lots of large organizations have a bunch of content in WordPress and “they’re all freaked out about it.” Next would be Drupal, and on from there.

So, that’s the idea.

Your CMS is disposable.

Your content is eternal.

We need to started treating it this way.