Improved Search

By Deane Barker

The author has been working on improving the search on their site, which has been through several iterations. They have updated the SQL “Like” query to return two-tiered results, and have found a better way to handle word boundaries using MySQL’s character classes. They have also implemented tokenization for search phrases and included stop words to save database work.

Generated by Azure AI on June 24, 2024

I worked on improving the search on this site today. Search has been through a number of iterations. First, I used the basic Movable Type search. But it was slow and I wanted to do some interesting things with search.

So last year, I switched to using a SQL “LIKE” query to return two-tiered results, first from the title and keywords, then from the body and extended body. This has worked really well so far.

However, one thing bothered me about it: LIKE has no concept of word boundaries. This…

WHERE entry_text LIKE '%date%'

…returns matches for “date,” “update,” “datebook,” “candidateship,” etc. So, I changed it a while ago to this syntax:

WHERE CONCAT(' ',entry_text,' ') LIKE CONCAT('% ','date',' %')

Basically, I started looking for the search term with a space on either side. This turned out to be an astoundingly stupid way to do it. What happens when the search term ends a sentence? Or begins the entry? Newlines, periods, and question marks are not spaces. Duh.

Today, however, I found the right way to do it:

WHERE entry_text RLIKE '[[:<_3a_5d_5d_date5b_5b_3a_>:]]'

Those bracket-y things are MySQL’s character classes for word boundaries. So it’s like my “tack a space on either side” method, but it includes anything that’s not a word. Very handy.

(But why don’t I just use MySQL’s full-text indexing, you ask? Because this is a hosted server, so I have to live with all the default MySQL settings. And the default settings exclude any word of less than four characters from being indexed. So three-letter acronyms like PHP, PDF, JSP, etc. wouldn’t be in the index, and for a site like this, that’s kind of a showstopper.)

I also did a little hacking of the search phrase you submit. For instance, searching for…

windows update

…will give you different results than if you search for…

“windows update”

Essentially, I tokenize the string but group quoted passages together. So, this…

“four score and seven years ago” abraham lincoln gettysburg

…would get tokenized like this:

four score seven years ago
abraham
lincoln
gettysburg

Finally, I included some stop words to save my database some work. This…

The penguin is the mascot of Linux

…gets reduced to this…

penguin mascot linux

So, I’m hoping to shake a few bugs out of it in the next few weeks, then I’ll post the class so anyone who wants to can take a look at it.

This is item #320 in a sequence of 361 items.

You can use your left/right arrow keys to navigate