# Shantanu's Blog

Database Consultant

## How does elasticsearch work?

Let's assume we have 3 documents to be indexed in elastic.

d1: "This is the desert. There are no people in the desert. The Earth is large."

d2: "'Where are the people?' resumed the little prince at last. 'It's a little lonely in the desert…' ,' It is lonely when you're among people, too,' said the snake."

d3: " 'What makes the desert beautiful,' said the little prince, 'is that somewhere it hides a well' "
_____

Variables used:

λ : 0.1

1-λ : 0.9

tf("desert"): 4, total number of occurence of "desert" in the collection across all documents

Lc : 59, total number of tokens in collection

Mc("desert") = (4 + 1) / (59 + 1) = 5/60

Md("desert") = tf = 2/15

idf("desert") = log(total number of documents / number of documents in which keyword found) i.e. log(3/3) = log(1) = 0
_____

1) Classic:
Simplest form of similarity search
tf * idf
(2/15) * 0 = 0

2) BM25 similarity:
default of elasticsearch for TF/IDF based similarity that has built-in tf normalization

IDF * ((k + 1) * tf) / (k * (1.0 - b + b * (|d|/avgDl)) + tf)

3) Jelinek Mercer smoothing:
The actual formulae is:
log(1+ (1-λ) * Md / λ * Mc)

The values replaced:
(1 + (1-λ) * Md("desert") / λ * Mc("desert"))

translates to:
(1+ ((0.9) * (2/15)) / (0.1 * (5/60)))

returns:
15.4

and log is:
math.log(15.4)
2.7343675094195836

By increasing λ (lambda), we are increasing the importance of the collection model, and diminishing the importance of document model.
This is a good choice for longer queries.
_____

And this is how to test it:

https://gist.github.com/shantanuo/e203fb336ff0712f502c73a43cd85d75

Labels: ,