May 02, 2009

Stemming Words in MySQL

Stemming words using mysql user defined function looks very promising.
For e.g. the user might search for "testing", "tested" or "tests". All these words are related to each other because the base word "test" is common in all of them. the stem_word function will find out what the base word is for each word saved in the database and match it against the base word of the user input.

mysql> select stem_word('en', 'testing');
+----------------------------+
| stem_word('en', 'testing') |
+----------------------------+
| test |
+----------------------------+
1 row in set (0.00 sec)

mysql> select stem_word('en', 'tested');
+---------------------------+
| stem_word('en', 'tested') |
+---------------------------+
| test |
+---------------------------+
1 row in set (0.00 sec)

mysql> select stem_word('en', 'tests');
+--------------------------+
| stem_word('en', 'tests') |
+--------------------------+
| test |
+--------------------------+
1 row in set (0.00 sec)

The stem_word function is not there by default and needs to be installed from:
http://mysqludf.com/lib_mysqludf_stem/index.php

This is different than the soundex logic. As you can see from the following screen shot, all the words have different soundex value and hence are considered different. These three words doesn't has any relation to each other if we use the soundex logic, but using the stem_word logic (as explained above) we can easily find the word families.

mysql> select soundex('testing');
+--------------------+
| soundex('testing') |
+--------------------+
| T2352 |
+--------------------+
1 row in set (0.02 sec)

mysql> select soundex('tested');
+-------------------+
| soundex('tested') |
+-------------------+
| T230 |
+-------------------+
1 row in set (0.00 sec)

mysql> select soundex('tests');
+------------------+
| soundex('tests') |
+------------------+
| T232 |
+------------------+
1 row in set (0.00 sec)

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.