The Necessity of a Content Index

By Deane Barker

There’s an aspect of content management that gets completely taken for granted (yes, even more so than the core content geography). It’s something so inherent to CMS that you rarely acknowledge that it’s even there. But I promise that it’s one of the first benefits you found when you fired up your first CMS so many years ago.

It’s the simple ability for the CMS to (1) know about all your content, and (2) respond to queries about this content.

Think back to before you had a CMS and were working in FrontPage or Dreamweaver or Notepad or whatever. You had two big problems:

  1. There was no templating – content and presentation were mixed.

  2. The running website had no overarching knowledge about all its content. There was no central authority to which you could pose question about your content and get answers.

It’s that second problem which I’m going to talk about here.

To demonstrate, let’s go back to your static website for a second –

You want a list of press releases? Fine, create it manually, then whenever you add a new press release, be sure to add it to that list. You need to render local navigation? Okay, whenever you add a page to a section, be sure to add the page to the menu. You deleted the page? Go back and take the link off. You want a list of content related to the current page? Yeah, well, good luck with that.

When you first hooked a website up to a database, most of these problems went away, didn’t they? You want a list of press releases? That’s just one query away. Navigation too gets rendered from the database, with links coming on and off as the pages get added and removed.

Here’s the difference: with a database, you had some resource that knew about all your content. It had the big picture; the overhead view. You could ask it for something (“Can I have a list of all the press releases in reverse order of date published?”) and get them back in the format you wanted.

You had a content index.

(I spent some time trying to figure out what to call this feature. I came close to using “content manifest,” since “manifest” as a general term means “list of individual items in a group” – think a ship’s manifest or a passenger manifest. I also thought about “content oracle,” but that might lead to confusion for obvious reasons. I also half-seriously considered “content gossip,” because in every group of people, there’s always one person that knows all the dirt on everyone.)

I’ve been talking about content indexes in various forms for years, perhaps most passionately in the creatively-named post: Give Me My Friggin' Content! Or, why methods that start with "Get" are a good thing. (You can skip this if you don’t want to hear me rant…)

A good CMS should manage your content well and let you have it back in whatever format, permutation, grouping, filtering level, depth, and quantity as you want. And then it should get the hell out of your way.

[…] Here’s the deal – if you make a CMS and persuade me to put my content into it, you need to make damn sure that I can get content back out of it with at least as much proficiency as the crappy Access databases I was writing back in 1998. If not, then rework your API until it no longer pisses me off.

Retrieval APIs are foundational. They are not an add-on. They are one of the pillars of content management, period. If you are putting more work into your widgets than your retrieval API, then you are boned.

The fact is, a good content index is something you don’t think about much until you don’t have it (or you have a crappy one).

I got to thinking about this feature the other day while playing around with DropPages. While I think DropPages is neat, it’s missing a comprehensive content index. I believe there’s some type of navigation management (Nesta has this), but I couldn’t find any feature in DropPages that specifically created a queryable index of all the content in the site, and this is critical for even a moderately large collection of content.

What if I started entering news articles for my company in a DropPages site? I kept at it, and before long had hundreds of them in there. What if I wanted to categorize them? Display a date range? Show the 10 latest articles? This amount of content gets tough to manage without a index that I can query.

Years ago, we had a customer come to us after having a bad CMS experience. He wanted to be in Dreamweaver. At the time, I was fresh off some blog posts on mixing static sites with CMS and really wanted to give it a try, so we took put him in a static site with considerable automation.

The results were mixed. While he got some nice templating features, he still had the problem that there was no comprehensive content index for the entire site. He would add a page in Dreamweaver, but “the site” (quotes to suggest the amorphous concept of the website as a single entity) didn’t know about it, thus I couldn’t automatically put it in any navigation or do anything else with it. (Honestly, we should have just talked this client into a better CMS…)

With a CMS or database-driven site, the content index is inherent and obvious. So the question is, how do we create a index for a static site, or for something like DropPages or Nesta?

My feeling is that you do some kind of supplemental indexing, via a web or file crawler. When a new page is added, that’s detected somehow and the site is re-indexed (or, worst case, the editor has to manually initiate this). All of the pages are consumed and indexed, both to record their mere existence, plus any structured data that can be parsed out of them.

The goal would be to get standard set of metadata out of them, say:

  1. Title

  2. Type (news article, blog post, product description, etc)

  3. Date Published

  4. Author

Both DropPages and Nesta give you a way to model content and specify individual fields, so you could pull this out of the source files. Alternately, render these items to meta tags and pull the out of the rendered page via an HTTP call.

Then, inside the templating system, there just needs to be a way to query this index and render the results into a navigation structure, such as a list of press releases or a menu. There’s a critical question of how deep and sophisticated the query facility can be – you could go from type filtering on one side all the way to full-blown SQL on the other. However, even a simple system that let you query and sort by dates and types would be a big step forward.

(There’s some interesting precedent in ASP.Net. Microsoft has tried to solve the problem of a content index on a static site with their site map protocol. You create an XML document in a specific location and format. You can use this to manage your navigation (though, sadly, not much more). You have to manually add and position pages in it, but it’s still a content index – a single entity that knows about all the content and can be queried about it.)

What gets really interesting is when you step back a level and try to come up with a content index for multiple, related sites. This gets challenging because you might have content siloed in multiple repositories, none of which know about each other.

Consider an intranet, which can often consist of multiple sites given the distributed enterprise and the reality of political fiefdoms. If we have intranet sites A, B, and C in, each in their own CMS, then we have a lot of local indexes, but still no global index for all the content in the enterprise.

In these cases, you usually end up with some kind of federated search where the search engine itself becomes the content index. The crawler is constantly wandering around intranet looking for new content, and it is inherently built to suck up content from disparate data sources and present those on a SERP in some kind of homogenized format.

If you can manage to put some common metadata scheme around all your content, then you can get some real use out of it. Indexed metadata can be treated as a sort of free-form database, with each meta tag constituting a queryable field.

Consider news – let’s say every department has their own way of distributing news to their employees, and the CEO wants a master list of news where he can see all the news for the company. Problem is, some departments use local WordPress sites, some have more advanced CMS, a couple are using FrontPage, and so on.

So long as all the editors put a common set of metatags on the pages, you should be able to create a federated content index using a search engine. There are Dublin Core standard tags for DateCreated, Author, and Title. Any competent search engine should be able to index these. Using this index, you could produce a list of all news, across all departments, in reverse order of when it was published, without forcing any department to change their content publishing infrastructure.

(There’s no reason why this can’t be done with a single site as well. You just need a search system that will parse meta tags and allow you to filter and order based on them. Yes, you’re using a search engine for structured, parameterized search rather than free-text search, but so what – it’s not like there’s a law against this or something.)

Regardless of the method (and there are likely hundreds of viable methods), a content index is truly the backbone of a CMS, and it’s absolutely necessary to be able to say a website is “managed” in any real sense. It’s the thing that binds all the disparate threads of content together and allows them to be used in service of a greater goal.

Without a index, each page of content is a world unto itself, completely ignorant of the larger structure in which it resides. With this limitation, the utility of your content is drastically reduced.

This is item #73 in a sequence of 356 items.

You can use your left/right arrow keys to navigate