Supplemental Indexing in Content Management

By Deane Barker

I’ve suddenly become quite interested in the idea of supplemental indexing in content management. This is primarily because I have a long history of hating content management query capabilities.

I’m mystified at systems that are less adept than a simple relational database at letting you have your content back. You put it in the system, then go to get it back out…and the API just sucks.

This still gets me, to this day. The fact is, the query abilities of most content management systems just suck. It’s as if they wanted us to put our content in, but never expected us to want to get it back out. Vendors, if your CMS isn’t at least as good as the SQL databases we all started building back in the day, you really need to re-examine your priorities.

Traditionally, a CMS forces you to rely on their provided API. The two systems we work with regularly have facilities for querying: eZ publish has their Fetch architecture, and Episerver has what they call Find Pages With Criteria. Both are…okay, if a little verbose. (We worked with Ektron in the past, and their system has traditionally been completely hopeless in this regard, though they’ve done some neat work lately with their new Framework API, and I hear they’ve solved this problem.)

But, there are times when even the most well thought-out API breaks down. With Episerver, a client needed to do a simple parenthetical query – WHERE A=B AND ( C=D OR E=F ) – and the Find Pages With Criteria API did not handle it well (or at all, really). In discussions with other EPIServer developers, the proposed solutions were all some form of code acrobatics for something which SQL handles natively.

I’ve run into the same problem with about every CMS I’ve ever worked with. Querying APIs are fundamentally wrappers around SQL (insomuch as most every CMS is still using a SQL-based database), and they always lose something in the abstraction. They’re built to handle 90% of the things you can come up with. The last 10% is where things break down and where you find yourself lamenting, “But I could do this in straight SQL in about two seconds…”

And this is a real problem, because as I passionately wrote in the previously-linked post:

Retrieval APIs are foundational. They are not an add-on. They are one of the pillars of content management, period.

In most cases, your need for content retrieval is not that advanced. But there are situations when you really want to use a CMS (or part of one) as a big relational database, and unless you’re using a natively relational framework like Rails or Django, or a ORM-based CMS like WebNodes, you’re going to test the limits of your API.

So, recently, Blend has done a few projects where we had to do some advanced querying and we just…punted. We created companion, query-friendly indices of CMS content alongside the CMS repository, and used them for querying. If your CMS has a decent event-based API, this isn’t hard, and I’ve found it to be a fantastic way around sticky problems.

For the first project, we used Lucene.Net. Though Lucene is traditionally used for full-text indexing, we used it here for straight-up field-based search of a subset of the CMS content. In this case, we denormalized the data considerably, which is a great thing about a supplemental index – you store content in the method that makes it easiest to query, regardless of how many data modeling rules that breaks.

In the second case, we went even more basic – we wrote a plugin for Episerver which converted portions of the page repository to a set of flat SQL tables. So we essentially write out a companion database, next to the CMS database, which we use for advanced querying. This is for a client who is converting existing SQL queries, and it’s worked beautifully to re-use their existing query logic.

For the third project, we hooked up Solr, and used it to store an separate index of all the content in a truly massive eZ publish installation. We’ve got this dialed-in to the point where could effectively use this separate index for any queries of the repository – you can fundamentally drive the entire site out of Solr, if you want.

These have been great solutions, but using a supplemental index hinges on a couple things –

First, your CMS has to be such that you can keep the index updated. This means you need an event-capture API so you can be notified of content changes and update the changed content in the index. Additionally, you need to make sure you capture cascading updates – changing page A may change page B, can how do you capture this? In a tree structure, changing the parent may change every page below it in the tree, so you need to make sure you can correctly re-index content, even when that content might not have changed directly.

Second, your CMS needs to have a decent filtering API, meaning you can toss a bucket of content at it and say, “Just give me back the content this user has access to,” or “Just give me back the content that’s public and published.” Your supplemental index probably knows nothing about your permissions model, so you need to make sure your CMS can pick through a raw pile of content and take out what the current user shouldn’t see.

Third, you need to faithfully treat your supplemental index as disposable. It’s truly supplemental – it’s a querying facility, and nothing more. You don’t actually read content out of it, rather you just read some page identifier that you then populate through from the core CMS API. You should be able to wipe out and re-create this index without concern. A sure way to run into problems is to start using your index as more than just a querying tool.

My initial reaction to the idea of a supplemental index was that it was somehow less “pure” than a real-time query of the actual CMS repository, but my experiences of the last year have completely changed my mind. Writing your own index can alleviate so many problems that it can free you up spend more time on the more challenging aspects of your integration.

Trying to force a query API to do what even something as simple as SQL has done for years shouldn’t be the hardest thing about a CMS. Sadly, it often is.

This is item #102 in a sequence of 357 items.

You can use your left/right arrow keys to navigate