June 16, 2009

KStemmer Port for DotLucene

I worked with Lucene a few years ago and since then I have not really played with it. Yesterday, I got an email asking me if I still had code for my KStemmer port.

Lucene is a text search engine that was initially written in Java. It does a full-text search of files and data that you have indexed using Lucene. The basic idea is that you want to search your network for files or if you wanted to index data from your database and search using a "Google" style search, than this was the tool for you. I was first introduced to Lucene by a client who wanted to index data from a database and search against it quickly. He had worked with these types of engines before and asked me to build it out for him. He wanted to index the data, add some boosting for certain fields (basically a way to give high priority to certain fields) and to stem the data using KStemmer and PorterStemmer. Stemming is a way for removing common endings from words in English or any other language to give word normalization. It could strip suffix for example 'ing', 'ed', or 'ly'. So that if you searched for the word fishing you would really be searching for the root word 'fish'. The PorterStemmer was known as the de-facto standard algorithm for all stemming done in English and it is also included as part of Lucene.

My client also wanted the KStemmer which was not part of the Lucene package. I had to do some research but basically found KStemmer to be a bit less aggressive than the PorterStemmer and faster. My client felt that for what he needed, the KStemmer was actually better.

This project needed to be in C# and the only KStemmer version was in Java. I converted the code a few years ago and it seems now that the DotLucene site is down, it is hard to find my code. I dug around for it and now you can find it here.

Perhaps, I'll post some of my other tools and code with my time using Lucene.

1 comment:

  1. Hi, I downloaded KStemmer and integrated it into my solution. It passed testing but it throws an exception "Index was outside the bounds of the array". It happens between 4-10 times per day with aboyt half a million calls to Stemmer.Stem
    The problem is also not reproduceable. That makes me believe it happens around static members which may not be thread safe.

    Itz happens on if (str[i] != word[r1]) in
    private bool endsIn(string str)

    ANy ideas?

    ReplyDelete