Jason Rowe

Be curious! Choose your own adventure.

Lucene in Action .Net Samples

 

I started reading Lucene in Action by Otis Gospodnetic´ and Erik Hatcher.  Lucene is a high performance, scalable Information Retrieval (IR) library. The library is in Java but I’m using the book to understand the Lucene .Net port.

Why buy a Java book to learn a .net Library? I couldn’t find a book specifically for the .Net version. Also, I wanted a 1000 foot level view of search technologies and because this is my first time working with IR. I was pleased to find the .net version is akin to the Java version presented in the book.

In some ways, learning from a different language has been beneficial. I invested more time converting the presented samples to C#. I’ve also read Mathematics and Physics for Programmers which was completely written in a pseudo language. Again, lots of fun to convert the presented samples to a language I was more comfortable with.

It was also nice to find that the open source project Subtext is using Lucene .Net. They’ve worked out some locking issue recently and seem to be getting things setup in a nice way. Also, my co workers Kevin and Tim put together a nice library for Lucene .Net called ActiveLucene.Net – Attributed Lucene.Net. The largest project seems to be Linq to Lucene but I haven’t looked at it yet. It was helpful to see how others are integrating Lucene into the Microsoft Web Platform.

Anyway, here are the indexer and searcher in my C# interpretation. These are the first snippets presented in the book. These are a great way to get a jump start into understanding the basics.

Indexer

   
readonly static SimpleFSLockFactory _LockFactory = new SimpleFSLockFactory();

static void Main(string[] args)
{

    var dataPath = ConfigurationManager.AppSettings["DataDirectory"];

    var indexPath = ConfigurationManager.AppSettings["IndexDirectory"];

    if (!System.IO.Directory.Exists(dataPath))
    {
        throw new IOException(dataPath + " directory does not exist");
    }

    DirectoryInfo indexInfo = new DirectoryInfo(indexPath);

    DirectoryInfo dataInfo = new DirectoryInfo(dataPath);

    Lucene.Net.Store.Directory indexDir = Lucene.Net.Store.FSDirectory.Open(
                                                     indexInfo, _LockFactory);

    var start = DateTime.Now.TimeOfDay;
    var numIndexed = Index(indexDir, dataInfo);
    var end = DateTime.Now.TimeOfDay;

    var delta = end.TotalMilliseconds - start.TotalMilliseconds;

    Console.WriteLine(
       "Indexing " + numIndexed + " files took " + delta.ToString() + " milliseconds
                      );
}

public static int Index(Lucene.Net.Store.Directory indexDir, DirectoryInfo dataInfo)
{
    var writer = new IndexWriter(indexDir, new StandardAnalyzer(
                                      Lucene.Net.Util.Version.LUCENE_29), 
                                      IndexWriter.MaxFieldLength.UNLIMITED);
    writer.SetMergePolicy(new LogDocMergePolicy(writer));
    writer.SetMergeFactor(5);

    try
    {
        var paths = dataInfo.EnumerateFiles("*.txt");

        foreach (var path in paths)
        {
            IndexFile(writer, path);
        }
    }
    catch (Exception ex)
    {
        writer.Close();
        throw ex;
    }

    var numIndexed = writer.MaxDoc();
    writer.Optimize();
    writer.Close();

    return numIndexed;
}

private static void IndexFile(IndexWriter writer, FileInfo file)
{
    if (!file.Exists)
    {
        return;
    }

    Console.WriteLine("Indexing " + file.Name);

    Document doc = new Document();

    var path = file.FullName;

    System.IO.TextReader readFile = new StreamReader(path);

    doc.Add(new Field("contents", readFile));

    doc.Add(new Field("filename", file.Name,
        Field.Store.YES,
        Field.Index.ANALYZED,
        Field.TermVector.YES));

    writer.AddDocument(doc);
}

Searcher

   
public static SimpleFSLockFactory _LockFactory = new SimpleFSLockFactory();

static void Main(string[] args)
{
    if (args.Length < 2)
    {
        Console.WriteLine("Searcher takes two parameters");
        Console.WriteLine("Usage: ConsoleSearcher <index dir> <query>");
    }

    var indexInfo = new DirectoryInfo(args[0]);
    var query = args[1];

    if (!System.IO.Directory.Exists(args[0]))
    {
        throw new IOException(args[0] + " directory does not exist");
    }

    SearchOption(indexInfo, query);
}

private static void SearchOption(DirectoryInfo indexInfo, string query)
{
    Lucene.Net.Store.Directory indexDir = Lucene.Net.Store.FSDirectory.Open(
                                                     indexInfo, _LockFactory);

    IndexSearcher indexSearcher = new IndexSearcher(indexDir, true);

    QueryParser parser = BuildQueryParser();
    var luceneQuery = parser.Parse(query);

    var start = DateTime.Now.TimeOfDay;

    var hits = indexSearcher.Search(luceneQuery);
    var end = DateTime.Now.TimeOfDay;

    var delta = end.TotalMilliseconds - start.TotalMilliseconds;

    Console.WriteLine("Found " + hits.Length() + 
        " document (s) (in " + delta.ToString() + 
        " milliseconds) that matched query '" + query + "':");

    for (int i = 0; i < hits.Length(); i++)
    {
        Document doc = hits.Doc(i);

        Console.WriteLine(doc.Get("filename"));
    }
}

private static QueryParser BuildQueryParser()
{
var parser = new QueryParser(
    Lucene.Net.Util.Version.LUCENE_29, "contents", 
     new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29));

    parser.SetDefaultOperator(QueryParser.Operator.AND);
    return parser;
}

Posted

in

by

Tags:

Comments

One response to “Lucene in Action .Net Samples”

  1. Tarun Mangla Avatar
    Tarun Mangla

    I am facing some issue. While adding documents to indexwriter, it is throwing “Arrays out of bounds exception”. I need to add almost 5 million documents which are basically just words in which I need to search into. Please suggest.

    Regards

Leave a Reply

Your email address will not be published. Required fields are marked *