Keyword Categorization: Thinking Out Loud

By Deane Barker

Here’s the problem with taxonomies and content categorization schemes: no one will maintain them. You can set up the greatest content tree or grouping structure in the world, but sooner or later, content authors (yourself included) are going to get complacent.

That’s because the value-add is on the reader’s end of the equation, not the author’s, so there’s not a compelling reason for an author to go to the trouble. What do you do about this?

This, then, is an exercise in thinking out loud.

Categorization can fall into two schemes: passive and active. Active is the most familiar: you add an item, then you decide where to stick it in the category map. However, as I mentioned, sooner or later, people are going to get lax in doing this. Passive, on the other hand, is just that, when someone adds an item, the system “knows” where it fits automatically based on criteria and meta information.

Passive Categorization: An Example

When I was working with Inktomi Enterprise Search (now owned by Verity), they had an add-on for their search engine called “Content Categorization Engine.” This was a taxonomy where each node was simply a saved search. For a node on “content management,” for instance, the system ran a search for all documents with the phrase “content management” in the keywords (you could design your own search for every node, of course). It didn’t matter where this document was in the file system, it would be grouped under this node just by virtue of its keywords. Nothing was actively assigned – everything was a search from scratch.

It worked great for folder structures too. You could easily configure a search to return all documents located in a certain folder. So someone could create an HTML document, apply meta (title, author, description, keywords, etc.) and put it in a folder. By doing this, the author had unknowingly contributed to the taxonomy. A link to the document would now appear under that node in the tree, complete with title, author, description, abstract, date created, etc. The author would have no choice in the matter – the system just “knew” about the document and enforced the taxonomy.

Plans for Gadgetopia: Keyword-based Categorization

This, I think, is where we’re headed with Gadgetopia. We’ve had some discussions recently about the correct practice for follow-up entries and how to link related entries together. Categorizing everything actively would lead to too many categories, so we started adding keywords. The keywords are linked from the permalink page, and you can click them to search for that term in the site.

One drawback, however, is that just because an entry contains the phrase “Bill Gates” doesn’t necessarily mean it’s about Bill Gates. Wouldn’t it be nice if the search would first grab only those entries where the search term appears in the keywords, since having the target text in the keywords would indicate that the entry has a high relation to what’s being searched for? Well, we’ve done this…sort of:

This is a search page we’re working on.

It first returns matches in the keywords of an entry, then returns matches in any field of an entry. So, by clicking on a keyword from one entry, you’ll first be shown all the other entries that share that keyword. Thus, we’ve created an organic, passive, open-ended categorization system.

Wither the Category List?

But what about the category headers in the sidebar? Well, we have an idea for that too.

We’ll index our keywords: have a database table of all the keywords that are in use. We’ll have a page that will list all of these words, grouped together, that will show you the frequency of a keyword. This will give you a good idea of the subject matter of Gadgetopia overall, and if you restrict by date (show keywords in entries posted in the last month, for instance), you can see what we’ve been talking about lately.

A keyword will be considered a “real” category when it’s been used, say, 10 times. So the list of categories will really be a list of keyword frequency with a floor of 10 occurrences. The tenth time we use a certain keyword – voila – it’s in the category list.

Related Keywords

Using the same database table mentioned in the last section opens up another handy tool: related keywords.

Using the system we’ve built (in our heads, anyway :-), it will be simple to show a user the frequency of keywords used by other entries that share keywords with the entry they’re looking at. So, when the entry displays, the system would find all the entries that it shares keywords with, then display a list of all the keywords of those entires, grouped by keyword to show frequency.

For example, an entry on “Documentum” might share keywords with 20 other entries. In the aggregate list of keywords for those 20 entries, the keyword “Interwoven” may appear 14 times. Thus, “Interwoven” would be in the top spot for the related keywords. Thus, the user can jump freely from keyword to keyword.

There are, of course, some technical difficulties in implementing this right now. The host on which Gadgetopia runs has an older version of MySQL. Full Boolean searching didn’t come along until 4.0, so the sample search page above runs on LIKE queries. Given that performance hit, we’re not quite ready to release it lest we cripple the server. We’ll likely wait until Gadgetopia moves to a new host.

So, that’s the plan in a nutshell. If you have comments about or experience with this idea, post a comment.

This is item #343 in a sequence of 357 items.

You can use your left/right arrow keys to navigate