Search Basics Using Lucene.Net

Posted by Phil on December 14, 2017

If you're a web creator or a .Net developer and think search is something you wish you could apply to your application but it's all a bit of a mystery and using Google is easier, then you need to look at Lucene.Net.

It's not uncommon to feel a little out of your depth when working with searching. Google has refined this task down to a well-tuned science and the concept feels like a big black box if your only frame of reference is the major search engines. But the reality is that it's far easier than you probably think and it takes a very gentle learning curve to get a basic search function up and running on your site or application. This post is a run-through on just that: an introduction to a really powerful search engine and implementing a fully-functional search feature quickly.

My company has an internal documentation site that we've built primarily using markdown files and Metalsmith. We're hosting it on Azure's Standard App Service tier and exposing that internally with Azure AD authentication. We want to be able to search pages and while we could have had SharePoint search or other OOTB products serve this content, everyone's pretty busy at this time of year and implementing something quick and functional fits our needs and the scale we're operating at. So here's what I've put together:

 

Scaffolding

For the purposes of this solution, I've created a new ASP.Net Web API project. I want my generated static pages to be able to search using an AJAX call (e.g. JQuery $.get(...)).

I only need one controller. Call it "Search" or "Home" or whatever you want. The controller needs to do two things: I want it to build out a new index when necessary, and I obviously want to retrieve search results from indexed pages.

        // Head: This rebuilds the index
        [HttpHead]
        public void Head()
        {
            luceneIndexer.Build();
        }

        [HttpGet]
        public IEnumerable<SearchResult> Get()
        {
            string term = Request.RequestUri.Query.Replace("?","");
            List<SearchResult> results = luceneIndexer.Search(term);

            return results;
        }

I can call the same endpoint (e.g. /api/search) and depending on the HTTP Verbs I supply, have the API perform different functions. This was hardly the only way to attack the problem and you may find other approaches work better.

The next task is the actual Lucene indexing.

Lucene

The bulk of the work happens in my Index class, where Lucene is tasked with crawling the pages or returning results based on supplied search terms.

The constructor needs to know what we're going to crawl for our searched content, where Lucene's index file can be written and I want to specify only markdown files, documents or PDFs (in the unlikely event that those last two are included). I set those in AppSettings.

My method for building out the search index is housed mainly in the Build() call. I want to create a new Lucene Index Writer that will creates the search index file.

using (IndexWriter writer = CreateWriter(Analyzer, settings.IndexStoreLocation))            

using that, I'll then recurse over the directories and files containing my markdown files and keep a list of all the matching file types.

ScanFolders(files);

Finally, for each file I've found, I create a new Lucene document, add some fields that are relevant for search purposes and flush the results to disk. As part of this task, my CreateDocument(FileInfo file) method includes a reference to a Parser class. That's purely a personal choice I've made to keep my code a little tidier. I explain that class further down.

Search

The actual search task happens in the Search() method. Here, we take a search term as a string, get a reference to our previously-created Lucene search index file and build out a structured search Query, which we then pass to the Lucene Index Searcher. Lucene will go away and find all the files with keywords matching our query, then return those results as "TopDocs": references to Documents that have a high match rate with our query.

TopDocs hits = searcher.Search(query, 100);

I'm only interested in a maximum of 100 hits, so pass that in as an additional constraint. We then iterate over the list of hits, getting the "Score", a measure of how closely our search term fits this document (I have no idea how this works. I'll learn more and blog about it another time), the file name, content and URL. These are all values I've told Lucene to record previously when I created the search index and defined each field to use. I pass all these values into a custom SearchResult class, which is then passed back to my controller and returned as JSON to my calling AJAX method.

That's really all there is to it! You can now GET search results from your indexed files with a simple AJAX request or just test out the results using Fiddler.

Here's the complete code:

 

Lucene Index Class

using Lucene.Net.Analysis;
using Lucene.Net.Analysis.Standard;
using Lucene.Net.Index;
using Lucene.Net.Store;
using System;
using System.IO;
using System.Text;
using Lucene.Net.Documents;
using System.Configuration;
using System.Collections.Generic;
using Lucene.Net.Search;
using PatternsPracticesSearch.Models;

namespace PatternsPracticesSearch.LuceneIndex
{
    public class Index
    {
        IndexingSettings settings;
        private Lucene.Net.Util.Version version = Lucene.Net.Util.Version.LUCENE_30;
        private StandardAnalyzer analyzer;
        private List fileScanResults = new List();


        public Index()
        {
            settings = new IndexingSettings()
            {
                FileLocation = ConfigurationManager.AppSettings["FileLocation"],
                IndexableExtensions = ConfigurationManager.AppSettings["IndexableExtensions"],
                IndexStoreLocation = ConfigurationManager.AppSettings["IndexStoreLocation"]
            };

            analyzer = new StandardAnalyzer(version);
        }

        ///
        /// Reads through the Lucene search index and returns results based on the supplied query term
        ///
        ///A keyword or set of keywords to search for
        /// A list of  results
        public List Search(string queryTerm)
        {
            Lucene.Net.Store.Directory luceneDirectory = FSDirectory.Open(settings.IndexStoreLocation);

            IndexSearcher searcher = new IndexSearcher(luceneDirectory);

            Term searchTerm = new Term("body", queryTerm);
            Query query = new TermQuery(searchTerm);

            TopDocs hits = searcher.Search(query, 100);

            List results = new List();
            for(int i = 0; i < hits.TotalHits; i++)
            {
                float score = hits.ScoreDocs[i].Score;
                Document foundDoc = searcher.Doc(hits.ScoreDocs[i].Doc);

                SearchResult result = new SearchResult()
                {
                    Title = foundDoc.GetField("title").StringValue,
                    Content = foundDoc.GetField("body").StringValue,
                    Url = foundDoc.GetField("url").StringValue,
                    Confidence = (int)(score * 100)
                };

                results.Add(result);
            }

            searcher.Dispose();
            luceneDirectory.Dispose();

            return results;
        }


        public void Build()
        {

            using (IndexWriter writer = CreateWriter(Analyzer, settings.IndexStoreLocation))
            {
                var files = new DirectoryInfo(settings.FileLocation);
                
                fileScanResults = new List();

                try
                {
                    ScanFolders(files);

                    int docIndex = 0;

                    foreach (FileInfo file in fileScanResults)
                    {
                        docIndex++;

                        Document doc = CreateDocument(file);
                        doc.Add(new Field("id", docIndex.ToString(), Field.Store.YES, Field.Index.NO));

                        writer.AddDocument(doc);
                    }

                    writer.Optimize();
                    writer.Flush(true, true, true);

                }
                catch (Exception ex)
                {
                    throw ex;
                }
            }
        }

        private Document CreateDocument(FileInfo file)
        {
            string content = File.ReadAllText(file.FullName);

            var document = Parser.Parse(content);
            document.Add(new Field("url", Parser.ParseUrl(file.FullName), Field.Store.YES, Field.Index.ANALYZED));
            document.Add(new Field("lastupdated", file.LastWriteTimeUtc.ToLongDateString(), Field.Store.YES, Field.Index.NOT_ANALYZED));


            return document;
        }

        private IndexWriter CreateWriter(Analyzer analyzer, string path)
        {
            DirectoryInfo directory = new DirectoryInfo(path);

            if (!directory.Exists)
            {
                directory.Create();
            }

            try
            {
                var mMapDirectory = FSDirectory.Open(directory);
                return new IndexWriter(mMapDirectory, analyzer, IndexWriter.MaxFieldLength.UNLIMITED);
            }
            catch (Exception ex)
            {
                throw new Exception("MMapDirectory constructor exploded!");
            }
        }

        private void ScanFolders(DirectoryInfo directory)
        {

            try
            {
                foreach (DirectoryInfo dir in directory.GetDirectories())
                {
                    foreach (FileInfo file in dir.GetFiles())
                    {
                        if (file.Extension.Contains("md"))
                        {
                            fileScanResults.Add(file);
                        }
                    }
                    ScanFolders(dir);
                }
            }
            catch (Exception ex)
            {
                throw ex;
            }
        }

        public StandardAnalyzer Analyzer { get { return analyzer ?? new StandardAnalyzer(version); } }
    }
}

Document Field Parsing

Here is the Parser class to abstract away some of the text manipulation happening as part of Lucene's Document creation:

using Lucene.Net.Documents;
using System;
using System.Linq;

namespace PatternsPracticesSearch.LuceneIndex
{
    public class Parser
    {
        ///
        /// Takes a block of raw text (typically from a file read) and extracts the heading and body content
        /// which are then added to a new field set.
        ///
        ///Raw text for indexing as a Document
        ///A new object
        public static Document Parse(string Text)
        {
            Document finishedDocument = new Document();

            string heading = ExtractHeading(Text);
            finishedDocument.Add(new Field("title", heading, Field.Store.YES, Field.Index.ANALYZED));
            finishedDocument.Add(new Field("body", Text, Field.Store.YES, Field.Index.ANALYZED));

            return finishedDocument;
        }

        ///
        /// Takes a reference on disk and transforms the file name to reflect the published URL location
        ///
        ///The full file path of the source document being indexed.
        /// String representing the relative URL of the source document.
        public static string ParseUrl(string Filename)
        {
            // Bit of a kludge. Could probably do this better
            string root = Filename.Substring(Filename.IndexOf("\\", Filename.IndexOf("PatternsAndPractices")));

            string route = root.Replace(".md", "");

            return route.Replace("\\", "/");
        }

        private static string ExtractHeading(string Text)
        {
            try
            {
                string[] lines = Text.Split(Environment.NewLine.ToCharArray());

                return lines.First(x => x.StartsWith("#")).Replace("#", "").Trim();

            }
            catch (Exception ex)
            {
                // TODO: Implement some App Insights logging here or something to trap cases where lines do not match the expected format
                return "No heading";
            }
        }
    }
}

Search Results

Here's my custom SearchResult class:

namespace PatternsPracticesSearch.Models
{
    public class SearchResult
    {
        public string Title { get; set; }
        public string Url { get; set; }
        public string Content { get; set; }
        public int Confidence { get; set; }
    }
}

Application Settings

  <appSettings>
    <add key="FileLocation" value="d:\\dev\\MyMarkdownProject" />
    <add key="IndexableExtensions" value="md,pdf,doc" />
    <add key="IndexStoreLocation" value="d:\\dev\\luceneindex" />
  </appSettings>

Got any questions or advice? Find me on Twitter : @Phil_Wheeler, email me or message me on Keybase.

Photo Credit

unsplash-logoJoão Silas