I going to try and impugn one of the great concepts of content management: metadata. I’m going to argue that in the world of Web content management (WCM), it doesn’t really exist. Well, it might, but if it does, it’s awfully slippery to define and defining it doesn’t give you much value anyway.
Classically, “metadata” is “data about data.” People love that definition. Every time I see “metadata” defined anywhere, it’s quickly followed by that phrase (much like I just did right there).
The idea is that metadata describes another piece of data. So, the metadata is not the data, it’s just a description of the data. However, this raises the obvious question: what is “the data”? How do we sort out what is metadata (second order data) and what is data proper (first order data)? It’s harder than you think.
I propose that there are two major theories of metadata – two major schools of thought people invoke when they say, “this is metadata.”
The Geography Theory says that metadata is such because it’s stored “somewhere else” than data proper.
In some situations, this is obvious. In particular, the current concept of metadata in content management came out of the document management space (where most of the *CM spaces originated).
With document management, the core object under management is some binary file, like a Word document. This is clearly “the data.” And in most document management systems, you could “decorate” this data with additional data to describe the file, be it a category, author, status, whatever. And today, inside Word, you can go to “Properties” and add information, and this is logically metadata because the actual words in the document are “the data.”
So, in situation when you have a clear core of “data” and the information about this data is “somewhere else” (see how this is getting weird already?), then it’s pretty clear.
A lot of this geography was dictated by format. Back in the “golden age” of document management, applications like Word couldn’t store extra data like this, so the data was in the document management system instead (“somewhere else”).
Early versions of Ektron had the same problem – back then, Ektron only managed HTML content, rather than more structured things like XML. So, Ektron had a tab called “Metadata” for you to store other information that you couldn’t somehow embed in HTML. This tab still exists, even though with later versions of Ektron, you can pout most of this information directly inside XML-based content. (What gets odd is that the datatypes differ between what’s on the “metadata” tab and what you can put in the XML, which sometimes forces you into put something under “metadata” when you’d rather just put it over with the core data.)
In other systems – especially Web content management systems – this distinction between metadata and core data breaks down. In Episerver, for instance, there is no concept of content being in one place or another – all properties of a page are in the same “place” (under the same “interface umbrella,” if you will), so it’s all just data. Nowhere can I say, “this is data…and this is metadata…” etc.
This situation is very common in WCM. There is no “somewhere else.” All the data relating to a logical piece of content is stored and administrated together, which completely negates The Geography Theory.
The Visibility Theory says that metadata is data that’s used for some purpose other than publishing to a consumer.
Content has “publishable” information, which is data we intend to push to the consumer – the title of a news article, is an obvious example. This is the data proper.
But what about data that’s for administrative purposes only? One of my clients was just asking me yesterday about “metadata” to help search for content on the admin side of the site. They wanted to be able to tag or otherwise identify pages so they could find certain pages later in among hundreds of others. This is information that would never be published to the end user.
Similarly, at Gilbane Boston last month, I took a question from a woman who wanted to use a taxonomy system to categorize the quality and review state of various content. This is very much information that will never be published – you don’t want you consumers to see you category label of “really crappy stuff I wrote after a eight-martini bender,” after all.
Both of these are perfectly reasonable endeavors, but do they define “metadata,” as opposed to data proper which is published to the end user? If we include explicitly defined information like this as metadata, do we also include systemic information? Is the Published Date considered metadata? What about the applied permission set?
In most systems, there’s no way to really define what data is going to be rendered to the consumer in the presentation layer. Maybe your template will output Published Date, but maybe it won’t, and there’s little way for the system to know that in most cases (few systems have any reason to parse their own presentation templates).
Furthermore, most WCM systems don’t really care (for lack of a better word) if you output a specific piece of the data to the end user. It manages, stores, and treats all content data the same way – If you choose to output Datum X on your page, that’s up to you. A WCM system has yet to ask me why I’m storing any particular piece of data.
So, in the end, when talking about WCM, I think that the use of the term “metadata” can really muddy the waters, especially for people new to the field. Unless you explicitly acknowledge one of the theories above and explain that “this is the operative definition we’re going to use for ‘metadata’,” it’s easy to get people confused by it.
But even if you do define this, and everyone knows what you’re talking about, what have you gained? In a WCM system where all pieces of data for a logical piece of content are jumbled in together, differentiating between what is “data” and what is “metadata” really has little practical value.