Content Query Languages

By Deane Barker 9 min read

I have a weird fetish for blurring the lines between code and content in CMS. Traditionally, you have code which is written by developers, and content which is written by editors, and never shall the two meet.

But what happens when editors need to write some code-ish stuff? What happens when things fall into the middle – too simple or ephemeral for a full developer, but too complex for an editor using a UI and the patterns established way back during implementation?

Can we enable editors to …sort-of develop? Can we create a new class of editor called a “citizen developer”?

I’ve talked about this a lot. I hinted at it 20 years ago, in the context of templating languages:

Smarty as a “Sub-Language”

Then I wrote about it explicitly 10 years ago:

Editorial Scripting in CMS

That post eventually became a full-blown scripting library called Denina. Here’s me talking about that at the Umbraco Codegarden event in Denmark.

The wiki for Denina includes some good content about the philosophy and prior art behind this:

I’ve never stopped with this weird obsession. Here’s me in Croatia in 2023 talking about the entire concept of “scriptable software” (clearly, I was stuck on the “promise and peril” title motif).

I spent five years at Episerver/Optimizely, and I think know I drove people nuts there with my attempts to bring scripting into that platform.

Lately, I’m focusing this obsession to content querying and aggregation: how can we use code concepts to find content and bring it together into the groupings that we need?

I’ve often said that the way a CMS aggregates content is really the soul of the CMS. And learning all the tools to query and bring content together is often the most challenging part of understanding a CMS. It usually represents a level of “mastery” of that CMS.

Most CMSs try to implement aggregation tools visually in some kind of UI. The zenith of this approach might have been Drupal’s various query builders, as part of the Drupal Views module (which I think is in the core now). In my (ahem) view, this was the outer fringe of where you could go with a UI. Drupal Views almost constituted its own software.

Admittedly, Views included a templating layer, which added complexity, but it’s still trying to encapsulate a very complicated layer of query logic into a clickable UI.

I’ve had a running love affair with query languages, especially when it comes to CMS. I’ve always been curious if it might be easier to simply train editors on how to write some light “code” rather than fiddling with a complicated UI.

To this end, I’ve tried several times to implement content query languages in text. I like text queries because they’re precise and they can be easily stored as content (indeed, they’re serialized by definition).

A query is, in some ways, a content object. It should probably be versioned and have permissions and such. And if you put it in a reasuable component, then you have this thing you can move around that will magically produce other content.

(OMG… it’s like a… prompt… but more on that below…)

My first attempt was in Episerver, about a decade ago. I did this brutally simply: on save, I picked a set of key properties to flatten into a simple database table. Then I let my customer embed a component that contained a query against that table. It got the matching content IDs from my supplemental SQL table, then looked them up through the traditional API in the context of the visitor (thereby complying with permissions and personalization and languages and all that).

My customer could embed components in rich text. The component had a field for “query” that would look like this:

type = 'article' AND category = 'agriculture'

That’s literally just the WHERE clause of a SQL statement that ran against a simple, flat table that was indexed well. Simple tables make sense to users because they’re all familiar with Excel – rows and columns are clear and ambiguous.

I wrote about that project a bit here:

Supplemental Indexing in Content Management

More recently, I wrote a variation of SQL that I called “TreeQL” that was designed for a hierarchical content structure. You could write something like:

SELECT descendants OF /news/ ORDER BY depth

My code (it’s on GitHub) simply parsed that using Parlot (a really great parser combinator library) into a C# object, using which you could implement your own search logic. I have a video of me using that in an Episerver add-on here.

For Episerver, I used the object to convert it to LINQ, which was slow. If I took it further, I would have used Opti Find or Opti Graph or something.

And lately, I’ve been fiddling around with another query paradigm using a “filter” model. In this model, the editor writes a series of directives that progressively carve a subset out of a larger dataset.

You start with the entire set of records. You can include records to establish a subset, then include records from that subset, or exclude records from it (those essentially do the exact same thing, just from opposite directions). Each include or exclude directive operates on the result of the prior one.

include: parent = /news/ or /events/
include: type = article
include: year > 2020

exclude: internal is true

sort: date descending
take: 10

Now, if you’re a developer, you might be looking at this with some level of condescension. I get that. But this is not for you – this is for non-technical editors who know what they want in very basic, logical terms.

I’ve found that when designing syntax, you sort of have to work in metaphors. And the metaphor of a “pipeline” that “filters” is something that non-technical people seem to understand well.

These are all examples of what are formally known as “domain-specific languages” (DSLs), or colloquially as “little languages.”

Some concepts that are very basic developers are just too abstract – branching, variables, looping, etc. Those are a step too far. But how far is too far?

Well, that’s the trick, really…

Anyway, my first swing at translating this to code was insanely inefficient, swapping it in and out of memory. But then I realized that you can use Common Table Expressions (in SQLite, specifically, though they exist in most SQL databases) to distill this directly to SQL:

WITH
all AS (SELECT * FROM articles),
x AS (SELECT * FROM all WHERE (parent = '/news/' OR parent = '/events/') = true),
y AS (SELECT * FROM x WHERE (year > 2020) = true),
z AS (SELECT * FROM y WHERE (internal = 1) = false)
SELECT * FROM z
ORDER BY date DESC
LIMIT 10

CTEs are very fast – with a few thousand records and zero indexing, that query will run on my laptop in about 10ms. And derived fields will cascade “down” the list, so if I create some calculated field in the base query (the first one) and name that field, the users can refer to it just like it was in the underlying data, without ever knowing it was invented.

Of course, the translation to SQL assumes a simple relational table model, but as I mentioned above, I’ve used supplemental indexing to effectively “fake” this on-the-fly, alongside my CMS repository. (SQLite in-memory databases are really handy for this.) Again, rows and columns make sense to non-developers, so the mental model is clear.

If you don’t have an underlying table, you’d just need some other type of translation into whatever API you do have.

As with any language design, the trick here is to assume as much as possible.

By default, I’m assuming the table and the fields. I’m removing any concept of joins, and I’m replacing the concept of parenthetical WHERE clauses with a modeling of “whittling down” the entire dataset progressively.

The way I have this currently written, the directives are extensible, so you could wrap them around your own content model.

I work at Staffbase, and here’s what it might look like to aggregate multiple news channels if I wanted to insert a list of news in a page (this would be wrapped in a “widget” you could drop into text from the UI):

channel: 'CEO Blog' or 'Marketing Minutes' or 'New HQ'
type: social

filter-out: text contains 'confidential'
filter-out: audiences includes 'External Contractors'

sort: date descending
limit: 3

(To be clear, this is a little contrived, because we do have a pre-built widget that does roughly this same thing. But there are always edge cases that any pre-built thing wouldn’t cover.)

The thing I wrote above about assumptions is even more true for specific implementations. Given that I know the underlying model and purpose of the content, I can write channel and ‘type’ directives specific to the content. And with any implementation, I can change directive names – like filter-out rather than exclude – to fit what my editors will understand best.

Now, by this point, you might be screaming: but this is what we have APIs for!

I get that. But the difference with a “query language” if that it exists in text. And this matters. Editors can write text in an input box. You can serialize text and store it. You can provide a bunch of content-like services to text, which you really can’t do to API code that runs in the native computational process of the system.

I wish more vendors would implement this.

Sanity has something called Graph‑Relational Object Queries (with the fantastic acronym of “GROQ”), which is this exact concept. It looks like this:

*[_type == 'movie' && releaseYear >= 1979] | order(releaseYear) {
  _id, title, releaseYear
}[0…100]

That’s a little complex, but I’m quite sure it’s powerful. And kudos to them for providing the option.

And then, of course, there’s GraphQL, which is the closest thing we have to a standard in the content space. It looks like this:

query {
  empireHero: hero(episode: EMPIRE) {
    name
  }
  jediHero: hero(episode: JEDI) {
    name
  }
}

I’m happy that there’s some kind of standard, but GraphQL is verbose to write (so many braces…), and it’s really designed for two things in particular:

  1. Hiding the underlying implementation and letting you design your own model facade at the query level

  2. Careful specification of output format, specifically to avoid “over-fetching” and return only what you need (partially for network efficiency when delivering directly to clients)

Part of what I’m trying to find is the happy balance between expressiveness/power and approachability.

I’ve been encouraged recently by what the research firm Ink and Switch has been calling “malleable software.” This is an emerging term to refer to software that can be modified by the end user. Their definition is more for end user software, rather than CMS software with a group of editors, but I love how the concept of software as an amorphous concept that should wrap around the user’s needs is starting to emerge.

And, of course, what will AI do to all of this? Will AI eliminate the need for explicit query languages entirely?

…maybe? While we might just store a prompt – “Give me the last 10 articles about the new headquarters project…” – that’s still a “query” at some level. It’s still serialized to text and writable by an editor. And you would need some understanding of how it will be interpreted to get what you want out of it – with experience, you’ll come to understand how to get the machine to give you what you want.

The problem I see right now with this approach is that AI is non-derivative, and it can give you a lot of different variations of solutions for the same problem – it’s currently the opposite of “precise.” You would need to turn the temperature setting way down, I think, to ensure some consistency in how it’s interpreting queries. When it comes to querying content, the fuzziness of AI is just not appropriate.

So, there are lots of questions and experimentation here, but few hard-and-fast answers. I’ve implemented a few of these for customers, but I’m still waiting on the perfect opportunity to take it a step further.

That opportunity will be a specific combination of content domain, editor experience, CMS capabilities, and situational requirements.