Middle Ground: Content Management using Static HTML

By Deane Barker

There needs to be a way to reconcile content management and static HTML.

I’ve been toying with an idea lately, and instead of actually doing it , I’m going to throw it out here for fun. My idea is for an extremely simplistic content management system – one based on HTML files and a scheduled file system crawl. First, some things I believe: Creating content on the Web is…

The author proposes a simple content management system based on HTML files and a scheduled file system crawl. The system would enforce consistency between pages and manage menus and index pages, using a scheduled process that crawls HTML files and converts them into database records. The author suggests that this system could bridge the gap between managing static HTML files and full-blown content management.

Generated by Azure AI on June 24, 2024

I’ve been toying with an idea lately, and instead of actually doing it (don’t have the time), I’m going to throw it out here for fun. My idea is for an extremely simplistic content management system – one based on HTML files and a scheduled file system crawl.

First, some things I believe:

A movie review, for instance, is a title, a big chunk of text, and perhaps a movie poster and the number of stars. A product listing is a title, a big chunk of text, and maybe a handful of specifications.

There are always exceptions, of course, but the fact remains that you don’t see a lot of really complicated, relational databases jammed into content management systems. Most systems manage pretty loose content.

At the same time that I’ve come to believe what I’ve written above, I’ve had some experience managing some larger static sites. When you manage a static site of 100 pages or more, you quickly run into two big problems:

  1. Enforcing consistency between pages

  2. Managing menus and index pages

Here’s how I handle the first one, and here’s a theory I’d like to try on the second.

Enforcing Consistency

I have a PHP prepend and append file, so every hit to a PHP page gets “book-ended” by these two files. The prepend file starts buffering, and the first thing the append file does is read the buffer into a variable. Nothing too out of the ordinary there.

But then the append file compares the URL pattern to a series of regex expressions, stopping on the first one it matches. Based on the match, the append file inserts HTML just under the open BODY tag, and just above the close BODY tag. It also inserts a suffix to the TITLE tag, and a stylesheet link just under the TITLE tag. (And I’m modifying it this weekend to also insert a submenu, when specified.)

What this means is that when I create a static HTML page for a site under this system, I don’t have to worry about header or footer includes, TITLE tag format, stylesheets – anything like that. I just compose in simple, unformatted HTML, and when the page is requested, it gets processed. Put another way, all the formatting of the page, from headers to footers to styles, is centrally controlled. The page author has no choice.

It also means you can vary things greatly by section. The main part of your site can look a certain way, and a subsection (designated by URL pattern), can look completely different. Put a page in Folder A and it looks one way, but move it to Folder B and it looks completely different.

This system has worked extremely well for me, and has enabled me to keep a hundred or so static HTML pages totally consistent with each other. What it ultimately means is that the page on the file system stays “pure.” There’s no need for PHP code, file includes, stylesheet references, etc. All that’s in the file is the actual content that’s supposed to be there.

Managing Menus and Index Pages

How do I maintain an index page of news articles without hand-coding it and updating it whenever I add or remove a page? How do I keep track of what goes on the front page of the site? And if I delete an HTML file, how do I know all the index pages in which it appears so I can remove reference to it?

The bottom line is that even when you have your static files managed as perfectly as possible, you still have problems relating all this content and keeping it organized and accessible. So how do you cross that chasm without going to full-blown content management?

Here’s an idea:

Create a scheduled process that crawls your HTML files and converts them to database records. Then use these records to power your index pages and other dynamic sections of your site.

An example:

Say I have a folder full of movie reviews. Each one is a static HTML page. I want to have an index page listing all the reviews. This is actually pretty simple – I just have a scheduled process that crawls the folder, extracts the TITLE tag from each file, and logs it with the filename in a database table. Run that process once an hour, and then pull from the database table to run your index page.

But what if I wanted to have the star ratings and a one-line summary of the review on the index page too? Where do I put that stuff? In the page META. Have a META tag for description and another for star_rating. Then, when your process crawls the folder, log those in separate database fields.

(Yes, yes, there are potential datatype issues here. But your users just need to be careful and be notified when there’s been a problem. If the crawler finds anything other than an integer in the star_rating META tag, it skips it and logs an error.)

Some other thoughts on this.

There’s no need to change an admin interface when you do this (the HTML file is the admin interface). And if you put your META fields in a table in key-value format, you don’t even have to change your data model when you start or stop using a certain META tag. The indexer would just log everything it found without question.

To handle this, use the 404 as an alert. When a page is not found, have the 404 page look for the database record corresponding to the missing file and disable it so that page reference instantly comes out of all index pages.

So that’s the gist of it. The HTML file itself really becomes the database editing interface – it’s the bridge between the user and the database. The user can manage the file however he or she feels like it. At a certain interval (or on demand), the files get converted into database records which are awfully easy to query, manipulate, and display.

(Note that anything I said here about database tables goes the same for search engine indexers. A search engine like Swish-E could do pretty much everything I’ve described and it’s monstrously fast. Running on a 2.4GHz P4, Swish-E indexes all 4,600 HTML files on this site in eight seconds of CPU time. See this post.)

I envision a simple Web interface where the site admin can login, then:

  1. Traverse the HTML folder structure and view files

  2. Re-index individual files or entire folders on-demand

  3. Kick off a full-scale index of the entire site

  4. Browse the logged meta

  5. Run test SQL

  6. See the results – including error reports – of previous crawls

  7. Specify headers, footers, stylesheets, and submenus for various URL patterns

Of course, this system only works if the users are managing their files via an HTML editor. But I think a lot of users could, and certainly most Web developers. I think there’s a fair number of situations where it could work very well.

And yes, this is simplistic. But it really bridges the gap between a big stack of HTML files and full-blown content management. Call it middle ground.

This is item #262 in a sequence of 359 items.

You can use your left/right arrow keys to navigate