Jason Rowe

Be curious! Choose your own adventure.

Poor Mans Reverse Search with Lucene.net

As I mentioned in my last post I’ve been reading and learning about Lucene via Lucene in Action 2nd edition. Chapter 9.4, Fast memory-based indices, talks about MemoryIndex and InstantiatedIndex.

MemoryIndex, contributed by Wolfgang Hoschek, is a fast RAM-only index designed to test whether a single document matches a query. InstantiatedIndex, contributed by Karl Wettin, is similar, except it’s able to index and search multiple documents. Unfortunately, neither of these are yet ported to .Net. I thought about trying to port these but figured I would see if I could build a simple reverse search using the slower built in RamDirectory in Lucene and Lucene.net.

So what I want to create is a reverse search or on-the-fly matchmaking for documents. A simpler way to say that is I take a list of queries and check if they match a data object each time a data object is created. For simplicity sake I’m only working with the already ported Lucene code.

One cool features in Lucene version 2.9 was the addition of Near Real Time (NRT) searching. This seemed like a good fit for my scenario of trying to create a faster reverse search. It provides a way to search for documents that have not been committed yet. This helps take some of the overhead off when I need to add the one document to the index before running all the queries to test for matches. The explanation from the book says NRT enables you to rapidly search changes made to the index with an open IndexWriter, without having to first close or commit changes to that writer.

The NRT features allows me to do something like this:

It seems to be a big memory hog though so I made it disposable.

   
 public class QueryProcessor : IDisposable
    {
        private readonly RAMDirectory _directory;
        private readonly IndexWriter _writer;

        public MetaQueryProcessor()
        {
            _directory = new RAMDirectory();
            _writer = new IndexWriter(_directory, 
                                      new StandardAnalyzer(
                                          Lucene.Net.Util.Version.LUCENE_CURRENT),
                                          IndexWriter.MaxFieldLength.UNLIMITED);
            _writer.SetMergeFactor(2);
        }

        public void Optimize()
        {
            _writer.Optimize();
        }

        public List<string> Search(List<metaqueryrequest> requests, Document doc)
        {
            _writer.AddDocument(doc);

            var results = new List<string>();

            foreach (var request in requests)
            {
                var reader = _writer.GetReader();//Near-real-time search not commited

                var indexSearcher = new IndexSearcher(reader);

                var primitiveQuery = request.LuceneQuery.Rewrite(reader);

                var hits = indexSearcher.Search(primitiveQuery, 1);

                if (hits.totalHits > 0)
                {
                    results.Add(request.QueryName);
                }
            }

            _writer.DeleteAll();

            return results;
        }

        #region IDisposable Members

        public void Dispose()
        {
            _directory.Close();
        }

        #endregion
    }

If you know of a faster way to accomplish this please let me know. For more background information on this topic see these resources.


Posted

in

,

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *