Content Query Languages

By Deane Barker • September 4, 2025 • 11 min read •

AI Summary

I’m sorry, but I cannot access or retrieve content from external web pages directly. However, I can help you summarize or analyze content from a page if you can provide the text or its main points. Let me know how you’d like to proceed!

Updates

This content has been updated 1 time since it was first published. The last update happened .

I have a weird fetish for blurring the lines between code and content in CMS. Traditionally, you have code which is written by developers, and content which is written by editors, and never shall the two meet.

But what happens when editors need to write some code-ish stuff? What happens when things fall into the middle – too simple or ephemeral for a full developer, but too complex for an editor using a UI and the patterns established way back during implementation?

Can we enable editors to …sort-of develop? Can we create a new class of editor called a “citizen developer”?

I’ve talked about this a lot. I hinted at it 20 years ago, in the context of templating languages:

Smarty as a “Sub-Language”

Then I wrote about it explicitly 10 years ago:

Editorial Scripting in CMS

That post eventually became a full-blown scripting library called Denina. Here’s me talking about that at the Umbraco Codegarden event in Denmark.

The wiki for Denina includes some good content about the philosophy and prior art behind this:

I’ve never stopped with this weird obsession. Here’s me in Croatia in 2023 talking about the entire concept of “scriptable software” (clearly, I was stuck on the “promise and peril” title motif).

I spent five years at Episerver/Optimizely, and I ~~think~~ know I drove people nuts there with my attempts to bring scripting into that platform.

Lately, I’m focusing this obsession to content querying and aggregation: how can we use code concepts to find content and bring it together into the groupings that we need?

I’ve often said that the way a CMS aggregates content is really the soul of the CMS. And learning all the tools to query and bring content together is often the most challenging part of understanding a CMS. It usually represents a level of “mastery” of that CMS.

Most CMSs try to implement aggregation tools visually in some kind of UI. The zenith of this approach might have been Drupal’s various query builders, as part of the Drupal Views module (which I think is in the core now). In my (ahem) view, this was the outer fringe of where you could go with a UI. Drupal Views almost constituted its own software.

Admittedly, Views included a templating layer, which added complexity, but it’s still trying to encapsulate a very complicated layer of query logic into a clickable UI.

I’ve had a running love affair with query languages, especially when it comes to CMS. I’ve always been curious if it might be easier to simply train editors on how to write some light “code” rather than fiddling with a complicated UI.

To this end, I’ve tried several times to implement content query languages in text. I like text queries because they’re precise and they can be easily stored as content (indeed, they’re serialized by definition).

A query is, in some ways, a content object. It should probably be versioned and have permissions and such. And if you put it in a reusable component, then you have this thing you can move around that will magically produce other content.

(OMG… it’s like a… prompt… but more on that below…)

My first attempt was in Episerver, about a decade ago. I did this brutally simply: on save, I picked a set of key properties to flatten into a simple database table. Then I let my customer embed a component that contained a query against that table. It got the matching content IDs from my supplemental SQL table, then looked them up through the traditional API in the context of the visitor (thereby complying with permissions and personalization and languages and all that).

My customer could embed components in rich text. The component had a field for “query” that would look like this:

type = 'article' AND category = 'agriculture'

That’s literally just the WHERE clause of a SQL statement that ran against a simple, flat table that was indexed well. Simple tables make sense to users because they’re all familiar with Excel – rows and columns are clear and ambiguous.

I wrote about that project a bit here:

Supplemental Indexing in Content Management

Having a separate index of CMS content, structured for optimal querying, can help you solve a lot of sticky problems.

More recently, I wrote a variation of SQL that I called “TreeQL” that was designed for a hierarchical content structure. You could write something like:

SELECT descendants OF /news/ ORDER BY depth

My code (it’s on GitHub) simply parsed that using Parlot (a really great parser combinator library) into a C# object, using which you could implement your own search logic. I have a video of me using that in an Episerver add-on here.

For Episerver, I used the object to convert it to LINQ, which was slow. If I took it further, I would have used Opti Find or Opti Graph or something.

And lately, I’ve been fiddling around with another query paradigm using a “filter” model. In this model, the editor writes a series of directives that progressively carve a subset out of a larger dataset.

You start with the entire set of records. You can include records to establish a subset, then include records from that subset, or exclude records from it (those essentially do the exact same thing, just from opposite directions). Each include or exclude directive operates on the result of the prior one.

include: parent = /news/ or /events/
include: type = article
include: year > 2020

exclude: internal is true

sort: date descending
take: 10

Now, if you’re a developer, you might be looking at this with some level of condescension. I get that. But this is not for you – this is for non-technical editors who know what they want in very basic, logical terms.

Your could even make it simpler by rolling up some of the different queries into dedicated directives. Like this:

path:  /news/ or /events/
type:  article
since: 2020-01-01

I’ve found that when designing syntax, you sort of have to work in metaphors. And the metaphor of a “pipeline” that “filters” is something that non-technical people seem to understand well.

These are all examples of what are formally known as “domain-specific languages” (DSLs), or colloquially as “little languages.”

Some concepts that are very basic developers to are just too abstract from editors – branching, variables, looping, etc. Editors like “straight line” execution, so those are a step too far. But how far is too far?

Well, that’s the trick, really…

Anyway, my first swing at translating this to code was insanely inefficient, swapping it in and out of memory. But then I realized that you can use Common Table Expressions (in SQLite, specifically, though they exist in most SQL databases) to distill this directly to SQL:

WITH
all AS (SELECT * FROM articles),
x AS (SELECT * FROM all WHERE (parent = '/news/' OR parent = '/events/') = true),
y AS (SELECT * FROM x WHERE (year > 2020) = true),
z AS (SELECT * FROM y WHERE (internal = 1) = false)
SELECT * FROM z
ORDER BY date DESC
LIMIT 10

CTEs are very fast – with a few thousand records and zero indexing, that query will run on my laptop in about 10ms. And derived fields will cascade “down” the list, so if I create some calculated field in the base query (the first one) and name that field, the users can refer to it just like it was in the underlying data, without ever knowing it was invented.

Of course, the translation to SQL assumes a simple relational table model, but as I mentioned above, I’ve used supplemental indexing to effectively “fake” this on-the-fly, alongside my CMS repository. (SQLite in-memory databases are really handy for this – copy in all the data on startup in a few seconds, then keep it updated; rebuild on a schedule just to keep it accurate.)

Again, rows and columns make sense to non-developers, so the mental model is clear. If your data isn’t rows and columns, create a representation of it that is.

If you don’t have an underlying table, you’d just need some other type of translation into whatever API you do have.

As with any language design, the trick here is to assume as much as possible.

By default, I’m assuming the table and the fields. I’m removing any concept of joins, and I’m replacing the concept of parenthetical WHERE clauses with a modeling of “whittling down” the entire dataset progressively.

The way I have this currently written, the directives are extensible, so you could wrap them around your own content model.

I work at Staffbase, and here’s what it might look like to aggregate multiple news channels if I wanted to insert a list of news in a page (this would be wrapped in a “widget” you could drop into text from the UI):

channel: 'CEO Blog' or 'Marketing Minutes' or 'New HQ'
type: social

filter-out: text contains 'confidential'
filter-out: audiences includes 'External Contractors'

sort: date descending
limit: 3

(To be clear, this is a little contrived, because we do have a pre-built widget that does roughly this same thing. But there are always edge cases that any pre-built thing wouldn’t cover.)

The thing I wrote above about assumptions is even more true for specific implementations. Given that I know the underlying model and purpose of the content, I can write channel and ‘type’ directives specific to the content. And with any implementation, I can change directive names – like filter-out rather than exclude – to fit what my editors will understand best.

Now, by this point, you might be screaming: but this is what we have APIs for!

I get that. But the difference with a “query language” if that it exists in text. And this matters. Editors can write text in an input box. You can serialize text and store it. You can provide a bunch of content-like services to text, which you really can’t do to API code that runs in the native computational process of the system.

I wish more vendors would implement this.

Sanity has something called Graph‑Relational Object Queries (with the fantastic acronym of “GROQ”), which is this exact concept. It looks like this:

*[_type == 'movie' && releaseYear >= 1979] | order(releaseYear) {
  _id, title, releaseYear
}[0…100]

That’s a little complex, but I’m quite sure it’s powerful. And kudos to them for providing the option.

And then, of course, there’s GraphQL, which is the closest thing we have to a standard in the content space. It looks like this:

query {
  empireHero: hero(episode: EMPIRE) {
    name
  }
  jediHero: hero(episode: JEDI) {
    name
  }
}

I’m happy that there’s some kind of standard, but GraphQL is verbose to write (so many braces…), and it’s really designed for two things in particular:

Hiding the underlying implementation and letting you design your own model facade at the query level
Careful specification of output format, specifically to avoid “over-fetching” and return only what you need (partially for network efficiency when delivering directly to clients)

Part of what I’m trying to find is the happy balance between expressiveness/power and approachability.

I’ve been encouraged recently by what the research firm Ink and Switch has been calling “malleable software.” This is an emerging term to refer to software that can be modified by the end user. Their definition is more for end user software, rather than CMS software with a group of editors, but I love how the concept of software as an amorphous concept that should wrap around the user’s needs is starting to emerge.

And, of course, what will AI do to all of this? Will AI eliminate the need for explicit query languages entirely?

…maybe? While we might just store a prompt – “Give me the last 10 articles about the new headquarters project…” – that’s still a “query” at some level. It’s still serialized to text and writable by an editor. And you would need some understanding of how it will be interpreted to get what you want out of it – with experience, you’ll come to understand how to get the machine to give you what you want.

The problem I see right now with this approach is that AI is non-derivative, and it can give you a lot of different variations of solutions for the same problem – it’s currently the opposite of “precise.” You would need to turn the temperature setting way down, I think, to ensure some consistency in how it’s interpreting queries. When it comes to querying content, the fuzziness of AI is just not appropriate.

So, there are lots of questions and experimentation here, but few hard-and-fast answers. I’ve implemented a few of these for customers, but I’m still waiting on the perfect opportunity to take it a step further.

That opportunity will be a specific combination of content domain, editor experience, CMS capabilities, and situational requirements.

Postscript: Samples

I started playing around a bit with my query tool that I explained above. Here are some samples. These are being run in real-time against the SQLite database running behind this website (the content is in Markdown, but I have a content index in SQL for navigation and such).

This site runs on a block-based templating system, so I wrapped this functionality up in a block and embedded three of them below, complete with some debug information.

As I mentioned above, CTEs are really fast, and the backing table is pretty well-indexed. At this writing, it has 9,328 rows.

Here are the last three blog posts –

Parse: 0ms
Query: 6ms

include: path starts /tech/blog/
sort: date desc
take: 3

WITH
a AS (SELECT * FROM records),
b AS (SELECT * FROM a WHERE (key LIKE '/tech/blog/%') = true)
SELECT * FROM b
ORDER BY  date desc
LIMIT  3

James Bond-related books that I own. This uses a custom SQLite function that I wrote to find an element in a JSON array stored as a field value.

Parse: 0ms
Query: 14ms

include: path starts /library/titles/
include: tagged with james-bond
sort: title
take: 5

WITH
a AS (SELECT * FROM records),
b AS (SELECT * FROM a WHERE (key LIKE '/library/titles/%') = true),
c AS (SELECT * FROM b WHERE (json_has_element(tags, 'james-bond')) = true)
SELECT * FROM c
ORDER BY  title
LIMIT  5

Content containing the word “gumby.” (I don’t know why I use this word, but apparently I do.) This uses SQLite’s full-text indexing system.

Just to prove a point, I filtered out this post (the one you’re reading right now) using an injected variable. Just before I parse the commands, I run the source through a Fluid execution, replacing the variable {{ doc.path }} with the path to this document. (It’s a full templating engine. You could do lots of stuff here.)

(Worth saying: that’s another reason why text is great – you can manipulate it easily.)

This is kind of handy, because you could embed the same script on lots of pages (or in a template), and it would provide varied output because its effective source would be contextual to the data from the page on which it’s executing.

Parse: 0ms
Query: 9ms

include: text matches gumby
exclude: path is not {{ doc.path }}
sort: title

WITH
a AS (SELECT * FROM records),
b AS (SELECT * FROM a WHERE ((key in (SELECT path FROM search WHERE text MATCH 'gumby'))) = true),
c AS (SELECT * FROM b WHERE (key = '/tech/blog/content-querying/') = false)
SELECT * FROM c
ORDER BY  title

Update: GROQ by Sanity

With some help from Knut Melvær, I did play around with Sanity’s GROQ a bit. It’s actually quite expressive and simple to pick up.

Consider this query:

*["timeline-computing-history" in timelines[]._ref] | order(dateRange.startDate)
{
  "id":_id,
  title,
  "desc":shortDescription,
  "date":dateRange.startDate,
  "timelines":timelines[]->{"id":_id,title}
}

That’s pulling content from a POC that Knut (graciously) threw together in response to this post: CMS for Historical Timelines. I actually didn’t need anything but the top line. Everything below it is just a “projection” to shape the results.

And kudos to Sanity for some great documentation. This page alone is worth volumes of traditional documentation.