Improved Search

By Deane Barker • Posted on July 11, 2004

I worked on improving the search on this site today. Search has been through a number of iterations. First, I used the basic Movable Type search. But it was slow and I wanted to do some interesting things with search.

So last year, I switched to using a SQL “LIKE” query to return two-tiered results, first from the title and keywords, then from the body and extended body. This has worked really well so far.

However, one thing bothered me about it: LIKE has no concept of word boundaries. This...

WHERE entry_text LIKE '%date%'

...returns matches for “date,” “update,” “datebook,” “candidateship,” etc. So, I changed it a while ago to this syntax:

WHERE CONCAT(' ',entry_text,' ') LIKE CONCAT('% ','date',' %')

Basically, I started looking for the search term with a space on either side. This turned out to be an astoundingly stupid way to do it. What happens when the search term ends a sentence? Or begins the entry? Newlines, periods, and question marks are not spaces. Duh.

Today, however, I found the right way to do it:

WHERE entry_text RLIKE '[[:<_3a_5d_5d_date5b_5b_3a_>:]]'

Those bracket-y things are MySQL’s character classes for word boundaries. So it’s like my “tack a space on either side” method, but it includes anything that’s not a word. Very handy.

(But why don’t I just use MySQL’s full-text indexing, you ask? Because this is a hosted server, so I have to live with all the default MySQL settings. And the default settings exclude any word of less than four characters from being indexed. So three-letter acronyms like PHP, PDF, JSP, etc. wouldn’t be in the index, and for a site like this, that’s kind of a showstopper.)

I also did a little hacking of the search phrase you submit. For instance, searching for...

windows update

...will give you different results than if you search for...

“windows update”

Essentially, I tokenize the string but group quoted passages together. So, this...

“four score and seven years ago” abraham lincoln gettysburg

...would get tokenized like this:

four score seven years ago
abraham
lincoln
gettysburg

Finally, I included some stop words to save my database some work. This...

The penguin is the mascot of Linux

...gets reduced to this...

penguin mascot linux

So, I’m hoping to shake a few bugs out of it in the next few weeks, then I’ll post the class so anyone who wants to can take a look at it.

This is item #314 in a sequence of 353 items.

You can use your left/right arrow keys or swipe left/right to navigate