Showing posts with label Lucene. Show all posts
Showing posts with label Lucene. Show all posts
June 17, 2009
DotLucene and Accented Characters
During my projects working with Lucene, I had to index data from a database and make that searchable. One Of the issues that I came across was that first and last names with accents did not play nicely when searching for them. Once again a Java filter existed for this but nothing in C#. You can find here my ISOLatin1AccentFilter conversion which is a filter that replaces accented characters in the ISO Latin 1 character set by their unaccented equivalent (the case will not be altered).
June 16, 2009
KStemmer Port for DotLucene
I worked with Lucene a few years ago and since then I have not really played with it. Yesterday, I got an email asking me if I still had code for my KStemmer port.
Lucene is a text search engine that was initially written in Java. It does a full-text search of files and data that you have indexed using Lucene. The basic idea is that you want to search your network for files or if you wanted to index data from your database and search using a "Google" style search, than this was the tool for you. I was first introduced to Lucene by a client who wanted to index data from a database and search against it quickly. He had worked with these types of engines before and asked me to build it out for him. He wanted to index the data, add some boosting for certain fields (basically a way to give high priority to certain fields) and to stem the data using KStemmer and PorterStemmer. Stemming is a way for removing common endings from words in English or any other language to give word normalization. It could strip suffix for example 'ing', 'ed', or 'ly'. So that if you searched for the word fishing you would really be searching for the root word 'fish'. The PorterStemmer was known as the de-facto standard algorithm for all stemming done in English and it is also included as part of Lucene.
My client also wanted the KStemmer which was not part of the Lucene package. I had to do some research but basically found KStemmer to be a bit less aggressive than the PorterStemmer and faster. My client felt that for what he needed, the KStemmer was actually better.
This project needed to be in C# and the only KStemmer version was in Java. I converted the code a few years ago and it seems now that the DotLucene site is down, it is hard to find my code. I dug around for it and now you can find it here.
Perhaps, I'll post some of my other tools and code with my time using Lucene.
Lucene is a text search engine that was initially written in Java. It does a full-text search of files and data that you have indexed using Lucene. The basic idea is that you want to search your network for files or if you wanted to index data from your database and search using a "Google" style search, than this was the tool for you. I was first introduced to Lucene by a client who wanted to index data from a database and search against it quickly. He had worked with these types of engines before and asked me to build it out for him. He wanted to index the data, add some boosting for certain fields (basically a way to give high priority to certain fields) and to stem the data using KStemmer and PorterStemmer. Stemming is a way for removing common endings from words in English or any other language to give word normalization. It could strip suffix for example 'ing', 'ed', or 'ly'. So that if you searched for the word fishing you would really be searching for the root word 'fish'. The PorterStemmer was known as the de-facto standard algorithm for all stemming done in English and it is also included as part of Lucene.
My client also wanted the KStemmer which was not part of the Lucene package. I had to do some research but basically found KStemmer to be a bit less aggressive than the PorterStemmer and faster. My client felt that for what he needed, the KStemmer was actually better.
This project needed to be in C# and the only KStemmer version was in Java. I converted the code a few years ago and it seems now that the DotLucene site is down, it is hard to find my code. I dug around for it and now you can find it here.
Perhaps, I'll post some of my other tools and code with my time using Lucene.
Subscribe to:
Posts (Atom)