Eval Criteria # 5

How can attribute values be validated?

Attribute types use the editorial interface to coerce editors into creating valid attribute values. Our mapping example from the last chapter would offer editors nothing but a map from which they could pick a location, into which we would extract the coordinates. Thankfully, there’s no way an editor could type the entire text of War and Peace into that editorial element.

Value coercion is the first line of content modeling resiliency.

We discuss using the editorial interface to coerce values in a later chapter

But sometimes you can’t rely only on coercion because the format of the value needs additional evaluation for logical truth. Therefore, our fallback position is validation rules.

A validation rule is a “test” to ensure a proposed value logically satisfies some condition. Failure to pass the test results in an error and a refusal of the system to store the value, which usually manifests as a refusal to store the entire set of changes of which the invalid value is a part.

Required Attributes

The most basic validation is that a value exists at all – whether or not it is required.

If I’m writing an Article, it might have a Date Published, and this has to exist before the article is considered publishable. If an article existed without a Date Published, several undesirable things might happen – the article might sort oddly, or not at all; or there would be a big blank space in the templated output.

Most systems will allow you to require a value for a particular attribute. When this is enforced, the content cannot be saved until a value is provided.

An example of multiple missing attributes preventing an object from being saved in Drupal.

As discussed above, this validation usually applies to this entire set of changes – there could be 100 other attributes, and saving the entire thing will be prevented by a single missing attribute value.

This can be frustrating for editors, because they might not have the value for some particular reason (for example, if they’re collaborating with someone else on the content), so an editor will often put a placeholder value so the content can be saved, with the intention of fixing it later.

Content in a CMS goes through an editorial lifecycle where a specific version of a content object passes through several states. This is different for every system, but the following are common.

When content is loaded into an editorial interface, it is considered checked out to a particular editor. Other editors might see that the content is locked and cannot be edited.
When an editor has made their changes and submitted them, content is saved back to the repository, which means it is stored, but not checked in. The current editor still has a lock on it. (Perhaps the editor just needed to go get lunch. Or worse, perhaps they went on vacation.)
Content is checked in when the current editor is done working with it, and releases the lock on it. Checking content in necessarily involves saving changes as well. The content can now be checked out by someone else.
Content is submitted when an editor begins the process of publishing the content. The process might be direct, or it might be a series of reviews and approvals which have to pass satisfactorily.
Content is published when editing and approvals are complete, and the content can be displayed to its intended audience. If the content needs to be edited again, the cycle starts anew, but with a new version of the content. So, Version #1 remains published, and Version #2 is created and checked-out, and we go back to Step 1.

The specifics differ between systems, but the lifecycle described above is relatively common. In most systems, validation occurs whenever an editor attempts to save content back to the repository (step #2).

What Is No Value?

A required field cannot be stored without a value, but there’s a more subtle issue here about what “no value” means.

In general programming, there’s some surprisingly deep theory behind the concept of a null value. A null literally means “nothing,” as opposed to what we might normally consider nothing.

For text, humans would usually consider empty text to be no value but your system might not. A text string of zero characters might still be considered a value. To a computer, this still a string of text, it just has no characters. This is entirely different than a number or a date.
For a number, we often consider zero to be no value but it’s not. Zero only means no value when referring to specific instance of a range – zero ice cream cones certainly means no ice cream cones, but zero is still a value.
For a date…can we have no date? If we talk about a “date,” then it’s tough to say there’s nothing there. In fact, many systems have a completely separate datatype for “nullable date.” When a date is required but not provided, many languages have a default date – for example, Unix-based systems will default a missing date to December 31, 1969.

A lot of this discussion depends on the underling programming framework. Different languages are more or less forgiving of missing values.

From a processing standpoint, if there’s a default value, you just need to know what it is so you can compare. If you know the default date is “1969-12-31” and that date in unlikely to mean anything in your scenario, you can just compare against that to determine if a value is missing.

Logical Validation Rules

Beyond requiring a value, we often want to ensure some logical validity to the content, meaning a value that has some meaning when taken in the context of what the attribute is meant to represent.

Sure, something might be a number at its most basic level, but is it the right kind of number? If we’re compiling a list of movies and we want an attribute to store the year of release, then what do we actually want? We can ask for a number, but “1,976,566.539” is a number, yet that’s clearly way outside of the bounds of what we’re looking for.

However, if we were storing the Gross Domestic Product of a small nation, then perhaps “1,976,566.539” is quite valid.

It could be said that we want a four-digit number – certainly no movies were created before 1000AD, and 10000AD is a long way away. But if we want to get stricter, we could say we want a number between 1878 and…today? Or do you want to also count movies released in the next few years? So, 2025, to be safe?

The larger point here is that primitive datatype validation is not enough. We want to validate the logical concept of what the attribute value is intended to represent.

Some common validation rule types:

Range, for those values that exist on a scale, meaning the value is between two values, or greater or less than a single value
Pattern, for text values that need to be in a particular format (dates must be “yyyy-mm-dd”, for example); these are usually regular expressions (see below)
Length, for text, in both directions (usually to enforce a maximum, but sometimes a minimum)

And there are likely more specific validation rules for more specific attribute types offered by different systems.

Regular Expressions

Validation rules often overlap, especially when regular expression (regex) pattern matching is available.

Regular expressions are a concept in computing where a text value can be interrogated for conformance to a defined pattern. You can require the existence of specific types of characters, in a specific order.

Some examples of simple regular expressions:

To ensure text is between 5 and 10 characters, you might have a regex pattern of [\S]{5,10}. That says, “a non-whitespace character at least five times, but not more than 10.”
Required text can be represented as [\S]+. That says, “a non-whitespace character any number of times more that zero.”
A four-digit year can be [0-9]{4}. That says, “a numeric digit exactly four times.”

What’s handy is that pattern validation can identify and validate custom patterns the CMS can’t know about beforehand.

For example, if you need an editor to enter a product number, and the format of your product number is three upper-case letters, a dash, four numeric digits, another dash, and one more digit (ex. “GTF-6395-2”), that’s easily validated against the pattern:

^[A-Z]{3}-[0-9]{4}-[0-9]$

Anything not matching that pattern would fail validation.

Regular expressions don’t “understand” the content other than just as an arbitrary list of characters. Sure, it can make sure something is a string of four numeric digits, but it doesn’t “know” that’s a number, much less that it’s supposed to represent a year.

Regex just understands patterns. It doesn’t care what those patterns mean.

An example setting up a regular expression pattern validation in Sitecore.

The Timing of Validation

As discussed above, validation normally occurs whenever an editor attempts to save content back to the repository, either new content that didn’t exist, or existing content that has been changed. What this means is content cannot leave the editorial interface until it passes all validation rules.

Remember that “save” is not the same as “publish.” A particular version of content will likely be saved multiple times, then published once. An editor might work on a blog post over the course of a week, for example. The editor will change the content, save it, change it again, save it, and over and over before finally publishing the result.

To relieve the “validation pressure,” some systems can delay validation. Either they run validation rules only before content is published, or different rules can be configured to execute at different phases of the editorial lifecycle – some at save, some at check-in, some at publish, etc.

Loosening the timing of validation rules can make life easier for editors. Sometimes it’s helpful to allow looser rules during content development so editors can “rough in” content without any requirement to pass all validation. As long as content validation checks happen prior to publish, then there’s no risk invalid content will be exposed to the public.

If validation is delayed, missing values need to be checked for and handled if that content is processed.

If an editor wanted to preview her blog post, for example, the template would need to be coded to handle a situation where the incomplete blog post might not have a subtitle. In this case, the template could either just skip that section completely, or render some placeholder text (“Subtitle Goes Here”).

Default Values

Some systems will allow the assignment of default values to attributes.

If an Article needs a Date Published value, then the system might enter the current date as the default value when an editor is creating a new content object of that type. Normally, we wouldn’t consider this validation, but there’s a slight twist to default values that blurs the boundary a bit.

Default values can operate on two models:

A default value can simply pre-fill the editorial element to provide a value on new object creation, which is just a usability feature for editors. If a particular value is common (example: the current date as the Date Published for a blog post), then providing a default saves some keystrokes and provides some suggestive value. An editor can always change the default value if desired.
Alternately, a default value can be added prior to save only when nothing is entered for an attribute. So, if an attribute value is required, the interface might not enforce that and allow the editor to press the “Save” button without value, but the default value will be added before the content is actually saved.

That latter feature is just validation in another form. It’s making an attribute required, but using logical processing to fill in the value when it’s not provided.

There are some usability concerns here, since it can confuse editors if they’re actually trying to indicate no value but the content continues to be published with a value added. However, this would be either (1) a training issue, or (2) a documentation issue, since this should be noted in some help text somewhere in the interface.

The “forced default value” feature can also be enforced with API-level programming against an event model. Many systems will provide a “Content Saving” or “Before Content Save” event, in which values can be modified (and a default value is added) before storage.

We’ll talk more about event models and event programming when we discuss APIs.

Rule Precedence and Conflicts

Attributes should be expected to support multiple validation rules.

This seems straightforward, but if an attribute is governed by more than one rule:

Do they execute in a particular order?
Does one rule failure prevent the others from executing? (If so, then they would logically have to be in a specified order.)
Does a particular rule only execute under specific circumstances?

For example, if a value is required, and forced to conform to a particular pattern or range, is the latter rule dependent on the former?

This takes us back to our prior discussion of what “no value” means – if a numeric value is required to be between 5 and 10, and the editor enters nothing, which rule is it failing? That nothing was entered at all, or that we assumed nothing meant “0” and that’s out of range? That distinction seems academic, but a lot of systems would display both error messages in that situation.

Two error messages might just be annoying, but what if you didn’t want to require a value? So your validation logic is: only if a value is entered, it must be between 5 and 10. This means the value is no longer required, and no value should pass validation.

Some systems have provisions for conditional validation which will apply some logic to validation. For example, a rule might only run if a value was entered – no value is considered valid and requires no further validation. Or, the rules might run in series, with the first failure short-circuiting all the rules following it.

This logic can get tricky, and it’s likely the rules for these situations are baked into the system. When you run into limits here, you’ll often have to resort to custom validation.

We’ll talk about custom validation of attribute values in a later chapter.

Enforcing attribute validation is a key step in making a content model resilient. It prevents editors from making mistakes which cause more serious errors and confusion later in the publishing pipeline.

Additionally, it helps the CMS understand the content model and provide more contextual help when editors are trying to create content. An error isn’t universally undesirable – many times, it’s a key component in training editors and assisting in their discovery of the system and the content model.

Now let’s talk about how a CMS can use the developing model to assist editors further.

Evaluation Questions

What validation rules are globally available for all attribute types?
Can attribute values be required?
Can text-based attributes be validated by regular expressions or other pattern matching?
When is attribute validation executed – only on save, or can different validation rules be assigned to different points in the editorial lifecycle?
How are multiple triggered validation rules represented to editors, and can the execution of rules be conditional on the status of other rules?
Can default values be assigned to attributes, and are these merely pre-entered in the interface, or are they assigned when no other value is provided?