Solr and explaining "AND" and "OR"

I was asked by a client to explain Relevancy as they didn't understand why adding an additional word to their query was actually causing a result to not show up (at least at the top of the list) as they expected. It turned out the actual discussion wasn't about relevancy at all, but And and Or and how they are used in Solr.

Kudos are in order to Hoss over at Lucidworks for writing the article Why Not AND, OR, And NOT as it was a great way to explain how the query parsers work in Lucene. He also gave a great explanation for what the annotations mean.

AND - Think of this as MUST HAVE. In the world of Solr, if you have a query that look like "School AND Paper", you should expect to receive only indexes that have BOTH School and Papers. This is because an index MUST HAVE School and MUST HAVE Paper.

OR - Think of this as COULD HAVE. If you have a query that looks like "School OR Paper", you should expect to receive all indexes that have either School, Paper, or both words. This is because an index COULD HAVE School or COULD HAVE Paper.

Throw in Synonyms and Stemming

Now that I've explained these seemingly obvious Boolean Operators... there are still some gotchas when debugging Solr. My client was using an AND query, but was experiencing cases where one or even both words were not available. Honestly, it threw me for a loop also, which was part of the reason I set out on an expedition to find out the truth behind AND and OR.

Solr has the ability to work with both Synonyms and Stemming. In this case, my client had a synonyms file that contained the word they were searching for. So lets take these two synonym lines for example

school, college, institution, seminary, academy, university, institute
paper, essay, report, script, newspaper

Now our simple query for "School AND Paper" now becomes the following:

(school OR college OR institution OR seminary OR academy OR university OR institute) AND (paper OR essay OR report OR script OR newspaper)

Now the above doesn't seem too out of the ordinary... You might expect some of those to happen... but you can start to stretch the outer limits of the imagination with for example "institution report". That might bring up an article on how well an insane asylum did on their last inspection, which is nothing near the article you were looking on "How to write a school report in MLA format". My client in fact had a custom synonym file which had some words they use around the workplace for the meaning of other words, and the results were shockingly horrible.

Stemming takes this just one step further. Depending on how your Solr configuration is set up (because you can run stemming on index, or on query as well as synonym detection), you may get the synonyms of a stemmed word as well. Stemming is an algorithmic way of getting the stem of a word. Some stems don't even equate to real words in the English language... but gets pretty close. A really handy tool I found for something like this was the Javascript Porter Stemmer Online, which gave me a foundation to figure out what words were being looked at. Running our example above would give us:

Stemmed Content

So our query would now look like

(school OR college OR institution OR seminary OR academy OR university OR institute OR institut OR seminari OR academi OR univers) AND (paper OR essay OR report OR script OR newspaper OR newspap OR essai)

Big change huh?

Final Thoughts

AND and OR can be confusing, unless you put it into the right context. Solr can be even more confusing in general if you don't know what's going on behind the scenes of your query. I hope this gives you a better idea of what's going down and gives you more confidence in what a search query really means.