Towards a Content Modeling Standard

By Deane Barker • April 24, 2019 • 8 min read •

Author Description

This industry would benefit greatly if only we could agree on how content is modeled.

Note

When I started writing this post a year ago, it was about headless. Then I realized that my basic point was universal, and didn’t depend on coupling model. Therefore, I changed the title so it didn’t refer to headless, but the rest of the post still seems headless-centric. If you don’t care about headless, that’s cool – just bear with me, thanks.

We (I?) talk incessantly about headless CMS. But not many people are talking about the “head.” Headless is obsessed with how content is managed but ambivalent about how it’s delivered. It seems that delivery is always custom, and this is largely the selling point.

But does it always have to be custom? Should it always be custom? What’s stopping us from making it…less custom?

But, first, a quick recap –

Headless CMS is a content management system without a publishing infrastructure. Content is “published” by simply making it available through an API (though, usually, by pushing it into a CDN), and some channel – a website, a mobile app, etc. – proactively retrieves the content for display.

(Also, see the definition of coupling models over at the Web Content Management Glossary.)

What are the majority use cases of headless? I wrote about this at length a couple years ago. In that post, I made the point that headless in WCM is still very much a question mark:

What might be surprising is that we more or less agreed on this: the use case of a single repository of information feeding a single website is not a great value-add for a headless CMS.

The “we” in that post is me and some of the folks from Contentful, who I visited in Berlin. In that afternoon, and in that post, we talked about a lot of different things you can do with a headless CMS, but running a single website wasn’t really one of them.

Can you run a website from a headless CMS? Of course. And, as I pointed out in a CMSWire article last year, headless CMS is something of a return to how we were all building sites at the turn of the century. Back then, we built all “headless” CMSs – systems that managed content using one set of logic, and displayed it using another.

Fourteen years ago, I wrote this:

Now, lets put the 50-yard line of this game at the database. So the creation, management, approval, and general administration of content all leads up to one moment – when a certain content record in a database table is declared “active.” Everything is working up to that point. The “active” records in the database table are free to be used on the public side of the site.

Why not just create a view of the database that only includes those records, then give your designers and presentation specialists a read-only user and a copy of ColdFusion? Who says that the language the CMS is programmed in has to be the language the content is presented in?

Today, that seems a little prescient.

Headless is all about administration and management. It actively eschews presentation. It does not care what you do with the content it manages.

So…who does care?

Put another way, what is the “head” in headless, when it comes to WCM? What is the counterpart to the headless system that manages the content?

To be pedantic, we could just say “website,” and this isn’t wrong. But just like we created content management systems to standardize the common patterns of managing content, will someone create a “content delivery system” to standardize the delivery of content?

Yes, there are any number of web frameworks to build a website on, but they’re just tools that neither know nor care what you’re writing. The line between content-based website and transactional web application gets pretty slippery.

Speaking of content sites in particular, there are patterns. Meaning, there are things we have to do over and over again, and these are patterns that traditional web CMSs have organically grown around. Things like:

Global page elements
Hierarchical navigation
Positionable content elements
Personalized content delivery
Full-text search
Object-level permissions

When will someone standardize these elements into a framework that can be “backed” by a headless CMS?

(True story: Blend is sort of stuck on a project to create a “base site” profile, meaning we’re trying to codify the basic, foundational elements that go in every site we end up building. It’s harder than you think, but work continues.)

Here’s an example how this might play out at a business level.

Some enterprising front-end dev might look deeply at Kentico Cloud and think,

“You know, I could build a front-end for this. With some conventions and a little training, I could create a system that would give you a packaged website, backed by a Kentico Cloud instance.”

And then, later, when this has become a raging success –

“Man, there are a lot of Contentful users. Maybe we should create a version of the front-end to be backed by Contentful.”

Even later –

“Okay, now I have versions for Kentico Cloud, Contentful, Directus, Prismic.io. And for the traditional systems that have remote APIs, like Sitecore, Episerver, Drupal, etc. And also for the ECM systems like Alfresco and Nuxeo.”

And then this person dies from exhaustion because they’re updating all these front-ends all the time.

The point: before we could standardize a front-end (a “head”), we would need to standardize the APIs. All of these headless systems would either need to agree on an API, or there would need to be a thriving ecosystem of API abstractions for them. This way, your front-end could make generic calls into the void and not care who responds to them.

And when I say “standardize an API,” I don’t mean to standardize the bindings or the protocols. So, not like REST, for example. And I don’t mean to standardize the discovery, like WSDL or Swagger.

I mean that we need to abstract and standardize the very idea of content. We need to come up with a common lens with which to view content types, content objects, properties, datatypes, values, and relationships in the ways they relate to WCM.

This isn’t to say we haven’t tried. There are two major content-centric API systems in (somewhat) common use:

Content Repository API for Java (“the JCR”). This was developed at Day Software for their CQ product (now Adobe Experience Manager), then spun off as a standard (JSR 170). It’s popular in the Java WCM space (AEM, of course, is built on it, as is Magnolia). Clearly, it’s Java-centric, but does have a PHP port (but no .NET port, to my eternal temptation).
Content Management Interoperability Services (CMIS). This is an open-standard started by the Association for Intelligent Information Management (AIIM) and since approved by OASIS. It’s very well-accepted in the enterprise content management space.

Now, I don’t claim to be an expert in either of these, but my understanding is that they’re not competing standards. They have specific angles and can work together productively.

When will something like API standardization come to the headless space, to define the interactions with a WCM “head”? I think the next logical step in the slow uncoupling of content management and delivery is the commoditization of a delivery layer. There are a ton of front-end ninjas floating around out there that are probably just dying to build something, but the headless market is so red ocean right now that some people don’t know which horse to back.

I think the market might be waking up to this. I had a demo of GatsbyJS a while back, which purports to be the front-end for a headless website. It’s a set of JS libraries around React. Combined with static site generation, this provides a presentation layer that actually doesn’t rely on JavaScript and has a static version for all the content. (During the demo, my partner was browsing their docs. For giggles, he shut off JavaScript. It worked exactly the same, just a little slower.)

On the subject of standardization of APIs, Sam Bhagwat from GatsbyJS had this to say:

The tricky part here, of course, is that CMS-es don’t want to be commoditized any more than newspaper publishers do (hence all the controversy around AMP!). And publishing your data in a standard format is basically screaming “I’m a commodity!” So every CMS will output different schemas, and probably will for a while.

Andrea Schauerhuber from Gentics Mesh, based in Vienna, mentioned that perhaps we don’t need to standardize, given the flexibility of one of the common API protocols:

[…] with GraphQL in the game I think that today it is very easy to offer APIs that are extremely flexible. Developers basically are able to tailor the requests to their need. […] with GraphQL it has become so simple to “model” the data response in a way that your app can process it easily.

(Andrea also wanted to point out that Gentics Mesh supports content trees, in defiance of my rant above…)

From the traditional (coupled) CMS vendor side, I’ve recently become aware of JavaScript Services for Sitecore, which is an attempt by Sitecore to provide a front-end for a headless implementation of their system. JSS is a set of libraries that will bind components directly to Sitecore objects and do all the plumbing behind the scenes.

Let’s look at The Headless Equation™ as having three layers:

Repository
API
Delivery Channel (the “head”)

In this model, the title of this post is getting ahead of itself. We can’t standardize the head until we standardize the API. And we can’t standardize the API until we standardize how repositories conceptualize content models. For us to make progress here, we’re going to need to agree on a set of concepts around which we slap on any API (REST, GraphQL, SOAP, whatever).

Some of the basic content modeling questions need need to resolve to create The One Standard™ (some are obvious; others, less so) –

Are there structured content types?
Do those types have attributes?
What base data types can we agree on?
What description framework can we put around custom types?
Can attributes repeat?
How are empty attributes/null indicated?
Can attributes be a full, encapsulated object? Can this be restricted by type?
Can attributes reference another object? Can this restricted by type?
Are those references two-way?
Can we enforce referential integrity?
Is content in a hierarchy?
What base grouping or sectioning structures can we agree on? (Categories, types, branches of a content tree, etc.)
What parametric search options can we agree on?
Do we support full-text, tokenized search?
What discovery framework can we create to allow programmatic investigations of capabilities? (WSDL, FTW, finally.)

Finding a baseline standard here would help the entire headless industry enormously. And there’s precedent for this – standards like Dublin Core and CommonMark and even ANSI SQL have tried (or are trying) to create a baseline for a wide variety of implementations. Vendors can do whatever they want as a superset of this, but claiming standards compliance means they at least support the baseline.

But I circle back to a cynical problem endemic to these situations: vendors in the space right now probably don’t want portability. They want lock-in.

No one who is depending on subscription revenue or looking for a seed round wants to be easily swappable. Rather, they want people to base the upper layer around their product and Cement. That. In. Place. And this probably has some variance based on the size of the vendor – a smaller vendor likes the idea of portability to gain new adherents, while an established vendor with an existing base of users wants to keep them in place.

So, this entire idea will likely remain a pipe dream. But there’s value here, and if anyone knows of a way to build some momentum behind this idea, I’m all ears.