Improved Search

By Deane Barker

I worked on improving the search on this site today. Search has been through a number of iterations. First, I used the basic Movable Type search. But it was slow and I wanted to do some interesting things with search.

So last year, I switched to using a SQL “LIKE” query to return two-tiered results, first from the title and keywords, then from the body and extended body. This has worked really well so far.

However, one thing bothered me about it: LIKE has no concept of word boundaries. This…

WHERE entry_text LIKE '%date%'

…returns matches for “date,” “update,” “datebook,” “candidateship,” etc. So, I changed it a while ago to this syntax:

WHERE CONCAT(' ',entry_text,' ') LIKE CONCAT('% ','date',' %')

Basically, I started looking for the search term with a space on either side. This turned out to be an astoundingly stupid way to do it. What happens when the search term ends a sentence? Or begins the entry? Newlines, periods, and question marks are not spaces. Duh.

Today, however, I found the right way to do it:

WHERE entry_text RLIKE '[[:<_3a_5d_5d_date5b_5b_3a_>:]]'

Those bracket-y things are MySQL’s character classes for word boundaries. So it’s like my “tack a space on either side” method, but it includes anything that’s not a word. Very handy.

(But why don’t I just use MySQL’s full-text indexing, you ask? Because this is a hosted server, so I have to live with all the default MySQL settings. And the default settings exclude any word of less than four characters from being indexed. So three-letter acronyms like PHP, PDF, JSP, etc. wouldn’t be in the index, and for a site like this, that’s kind of a showstopper.)

I also did a little hacking of the search phrase you submit. For instance, searching for…

windows update

…will give you different results than if you search for…

“windows update”

Essentially, I tokenize the string but group quoted passages together. So, this…

“four score and seven years ago” abraham lincoln gettysburg

…would get tokenized like this:

four score seven years ago
abraham
lincoln
gettysburg

Finally, I included some stop words to save my database some work. This…

The penguin is the mascot of Linux

…gets reduced to this…

penguin mascot linux

So, I’m hoping to shake a few bugs out of it in the next few weeks, then I’ll post the class so anyone who wants to can take a look at it.

This is item #317 in a sequence of 357 items.

You can use your left/right arrow keys to navigate