How Search Engines Work — Understanding Inverted Indexes and TF-IDF with Interactive Diagrams

Introduction

Google Search, IDE code search, e-commerce product search — every day we rely on technology that instantly finds target information from massive datasets. The core technology behind all of these is the search engine.

This article explains the three fundamental technologies at the heart of search engines, using interactive diagrams:

Inverted Index — the data structure that enables fast search
Boolean Search — processing queries as set operations
TF-IDF / BM25 — algorithms for ranking search results

Search Engine Architecture

Let's start with an overview of the search engine pipeline.

From document collection to returning results, the process is divided into roughly 6 stages. This article focuses on text analysis, index building, query processing, and ranking.

Text Analysis — Breaking Documents into Tokens

The first step in processing a document is text analysis. Let's look at the pipeline that transforms raw text into searchable units called tokens.

This pipeline consists of the following stages:

1. Character Normalization

Unicode normalization and case conversion. This ensures that "The" and "the" are treated as the same token.

2. Tokenization

Split text into individual words (tokens). In English, this is typically whitespace-based splitting. Languages like Japanese, which lack spaces between words, require morphological analysis (using tools like MeCab or Kuromoji).

3. Stopword Removal

Remove high-frequency, low-information words like "the", "is", and "on". This reduces index size and improves search precision.

4. Stemming / Lemmatization

Reduce words to their root form: "running" → "run", "cats" → "cat". This absorbs spelling variations and improves search recall.

Inverted Index — The Heart of Search

Why We Need an Inverted Index

A naive approach to full-text search would scan each document from start to finish looking for keywords. For $N$ documents, this requires $O(N \times L)$ time ( $L$ being average document length), which breaks down as the document count grows.

The Inverted Index is the data structure that solves this problem.

Forward Index vs Inverted Index

A normal index (Forward Index) maps "document → contained terms":

Forward Index:
  D1 → [cat, sat, mat]
  D2 → [cat, dog]
  D3 → [dog, sat, log]

An inverted index reverses this relationship. It maps "term → documents containing that term":

Inverted Index:
  cat → [D1, D2]
  dog → [D2, D3]
  log → [D3]
  mat → [D1]
  sat → [D1, D3]

This allows $O(1)$ lookup for documents containing any given keyword. The list of document IDs associated with each term is called a posting list.

Interactive Demo: Building and Searching an Inverted Index

In the demo below, watch how an inverted index is built from 3 documents, then used to execute a search query step by step.

Inverted Index Builder

Watch how an inverted index is built from documents, then used to answer a search query.

Documents

D1"the cat sat on the mat"

D2"the cat and the dog"

D3"the dog sat on a log"

We have 3 documents. Let's build an inverted index from them step by step.1 / 15

Posting List Structure

In real search engines, posting lists often contain more than just document IDs:

Field	Purpose
Document ID	Basic document identification
Term Frequency (TF)	Used for ranking
Position	Phrase search, proximity search
Offset	Snippet generation, highlighting

An index that includes position information is called a positional index, enabling phrase searches like "cat sat".

Query Processing — Boolean Search Model

Once the user's query is tokenized, we use the inverted index to find matching documents. The most fundamental model is Boolean search.

AND (Intersection)

"cat AND sat" computes the intersection of the posting lists for cat and sat. Since both lists are sorted by document ID, a sorted merge with two pointers processes them in $O(n + m)$ time.

def intersect(p1, p2):
    result = []
    i, j = 0, 0
    while i < len(p1) and j < len(p2):
        if p1[i] == p2[j]:
            result.append(p1[i])
            i += 1
            j += 1
        elif p1[i] < p2[j]:
            i += 1
        else:
            j += 1
    return result

OR (Union)

"cat OR sat" returns the union of both posting lists — all documents containing either term.

Query Optimization

When multiple AND conditions exist, processing the term with the shortest posting list first keeps intermediate results small and improves efficiency. This is known as query optimization.

Ranking — TF-IDF and BM25

Boolean search only returns binary results — match or no match. In practice, we want to rank how relevant each document is to the query. This is where TF-IDF and BM25 come in.

TF: Term Frequency

\text{TF}(t, d) = \text{number of times term } t \text{ appears in document } d

The more frequently a term appears in a document, the more likely that document is relevant to that term.

IDF: Inverse Document Frequency

\text{IDF}(t) = \log_{10}\!\left(\frac{N}{\text{df}(t)}\right)

$N$ is the total number of documents, and $\text{df}(t)$ is the number of documents containing term $t$ . Words like "the" that appear everywhere have low IDF, while words like "algorithm" that appear in few documents have high IDF. This quantifies the intuition that rarer words are more important.

TF-IDF Score

\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)

For multi-term queries, TF-IDF scores for each term are summed to produce the document's final score.

Calculation Demo

Let's walk through an actual TF-IDF score calculation to see how ranking works.

TF-IDF Score Calculator

Step through the TF-IDF scoring process for the query "cat mat".

Query:"cat mat"N=3

D1[cat, sat, mat]

D2[cat, dog]

D3[dog, sat, log]

Let's calculate TF-IDF scores for the query "cat mat" across 3 documents (after stopword removal).1 / 7

BM25: Improving on TF-IDF

TF-IDF has some issues. For example, a document where a term appears 100 times gets 100× the score of one where it appears once — but the relevance difference isn't that large.

BM25 (Best Matching 25) is an improved algorithm that's widely used in modern search engines.

\text{BM25}(t, d) = \text{IDF}(t) \cdot \frac{\text{TF}(t,d) \cdot (k_1 + 1)}{\text{TF}(t,d) + k_1 \cdot \left(1 - b + b \cdot \frac{|d|}{\text{avgdl}}\right)}

Two parameters are key:

$k_1$ (typically 1.2): Controls TF saturation. As TF increases, the saturation factor $\frac{\text{TF} \cdot (k_1+1)}{\text{TF}+K}$ asymptotically approaches $(k_1 + 1)$ rather than growing without bound
$b$ (typically 0.75): Document length normalization. Longer documents tend to have higher TF simply because they contain more words; $b$ corrects for this bias

Practical Topics

Index Compression

Posting lists are sorted integer sequences of document IDs, so they can be efficiently compressed using delta encoding combined with Variable Byte Encoding:

Original list:    [101, 105, 108, 120]
Delta encoding:   [101,   4,   3,  12]  ← smaller numbers
Variable bytes:   efficiently packed into bytes

Distributed Search

Large-scale search engines shard (partition) the index across multiple nodes. Queries are broadcast to all shards (scatter), and results from each shard are aggregated (gather) to produce the final ranking. This scatter-gather pattern is widely used in systems like Elasticsearch.

Apache Lucene and Elasticsearch

Apache Lucene is the de facto standard full-text search library built on inverted indexes. Implemented in Java, it includes several optimizations:

Segmentation: Split the index into immutable segments, parallelizing reads and writes
Skip lists: Fast scanning of posting lists
Block encoding: SIMD-accelerated posting list processing

Elasticsearch is a distributed search engine built on Lucene, providing REST APIs, clustering, and real-time indexing.

Summary

We've examined search engine internals through three pillars:

Inverted Index: A "term → document list" data structure enabling $O(1)$ lookup
Boolean Search: Fast query processing through set operations on posting lists
TF-IDF / BM25: Relevance scoring by combining term frequency with document-level rarity

The "instant results" we experience in search is only possible through the combination of these technologies. Next time you type a query into a search box, the inverted index is being consulted and BM25 is ranking documents behind the scenes — I hope this article helps you imagine that mechanism.

References

C. D. Manning, P. Raghavan, H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008.
S. Robertson, H. Zaragoza. The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval, 2009.
Apache Lucene Documentation