How Search Engines Work — Understanding Inverted Indexes and TF-IDF with Interactive Diagrams
An interactive deep-dive into search engine internals: from building inverted indexes to ranking with TF-IDF and BM25.
Introduction
Google Search, IDE code search, e-commerce product search — every day we rely on technology that instantly finds target information from massive datasets. The core technology behind all of these is the search engine.
This article explains the three fundamental technologies at the heart of search engines, using interactive diagrams:
- Inverted Index — the data structure that enables fast search
- Boolean Search — processing queries as set operations
- TF-IDF / BM25 — algorithms for ranking search results
Search Engine Architecture
Let's start with an overview of the search engine pipeline.
From document collection to returning results, the process is divided into roughly 6 stages. This article focuses on text analysis, index building, query processing, and ranking.
Text Analysis — Breaking Documents into Tokens
The first step in processing a document is text analysis. Let's look at the pipeline that transforms raw text into searchable units called tokens.
This pipeline consists of the following stages:
1. Character Normalization
Unicode normalization and case conversion. This ensures that "The" and "the" are treated as the same token.
2. Tokenization
Split text into individual words (tokens). In English, this is typically whitespace-based splitting. Languages like Japanese, which lack spaces between words, require morphological analysis (using tools like MeCab or Kuromoji).
3. Stopword Removal
Remove high-frequency, low-information words like "the", "is", and "on". This reduces index size and improves search precision.
4. Stemming / Lemmatization
Reduce words to their root form: "running" → "run", "cats" → "cat". This absorbs spelling variations and improves search recall.
Inverted Index — The Heart of Search
Why We Need an Inverted Index
A naive approach to full-text search would scan each document from start to finish looking for keywords. For documents, this requires time ( being average document length), which breaks down as the document count grows.
The Inverted Index is the data structure that solves this problem.
Forward Index vs Inverted Index
A normal index (Forward Index) maps "document → contained terms":
Forward Index:
D1 → [cat, sat, mat]
D2 → [cat, dog]
D3 → [dog, sat, log]An inverted index reverses this relationship. It maps "term → documents containing that term":
Inverted Index:
cat → [D1, D2]
dog → [D2, D3]
log → [D3]
mat → [D1]
sat → [D1, D3]This allows lookup for documents containing any given keyword. The list of document IDs associated with each term is called a posting list.
Interactive Demo: Building and Searching an Inverted Index
In the demo below, watch how an inverted index is built from 3 documents, then used to execute a search query step by step.
Inverted Index Builder
Watch how an inverted index is built from documents, then used to answer a search query.
Documents
Posting List Structure
In real search engines, posting lists often contain more than just document IDs:
| Field | Purpose |
|---|---|
| Document ID | Basic document identification |
| Term Frequency (TF) | Used for ranking |
| Position | Phrase search, proximity search |
| Offset | Snippet generation, highlighting |
An index that includes position information is called a positional index, enabling phrase searches like "cat sat".
Query Processing — Boolean Search Model
Once the user's query is tokenized, we use the inverted index to find matching documents. The most fundamental model is Boolean search.
AND (Intersection)
"cat AND sat" computes the intersection of the posting lists for cat and sat. Since both lists are sorted by document ID, a sorted merge with two pointers processes them in time.
def intersect(p1, p2):
result = []
i, j = 0, 0
while i < len(p1) and j < len(p2):
if p1[i] == p2[j]:
result.append(p1[i])
i += 1
j += 1
elif p1[i] < p2[j]:
i += 1
else:
j += 1
return resultOR (Union)
"cat OR sat" returns the union of both posting lists — all documents containing either term.
Query Optimization
When multiple AND conditions exist, processing the term with the shortest posting list first keeps intermediate results small and improves efficiency. This is known as query optimization.
Ranking — TF-IDF and BM25
Boolean search only returns binary results — match or no match. In practice, we want to rank how relevant each document is to the query. This is where TF-IDF and BM25 come in.
TF: Term Frequency
The more frequently a term appears in a document, the more likely that document is relevant to that term.
IDF: Inverse Document Frequency
is the total number of documents, and is the number of documents containing term . Words like "the" that appear everywhere have low IDF, while words like "algorithm" that appear in few documents have high IDF. This quantifies the intuition that rarer words are more important.
TF-IDF Score
For multi-term queries, TF-IDF scores for each term are summed to produce the document's final score.
Calculation Demo
Let's walk through an actual TF-IDF score calculation to see how ranking works.
TF-IDF Score Calculator
Step through the TF-IDF scoring process for the query "cat mat".
BM25: Improving on TF-IDF
TF-IDF has some issues. For example, a document where a term appears 100 times gets 100× the score of one where it appears once — but the relevance difference isn't that large.
BM25 (Best Matching 25) is an improved algorithm that's widely used in modern search engines.
Two parameters are key:
- (typically 1.2): Controls TF saturation. As TF increases, the saturation factor asymptotically approaches rather than growing without bound
- (typically 0.75): Document length normalization. Longer documents tend to have higher TF simply because they contain more words; corrects for this bias
Practical Topics
Index Compression
Posting lists are sorted integer sequences of document IDs, so they can be efficiently compressed using delta encoding combined with Variable Byte Encoding:
Original list: [101, 105, 108, 120]
Delta encoding: [101, 4, 3, 12] ← smaller numbers
Variable bytes: efficiently packed into bytesDistributed Search
Large-scale search engines shard (partition) the index across multiple nodes. Queries are broadcast to all shards (scatter), and results from each shard are aggregated (gather) to produce the final ranking. This scatter-gather pattern is widely used in systems like Elasticsearch.
Apache Lucene and Elasticsearch
Apache Lucene is the de facto standard full-text search library built on inverted indexes. Implemented in Java, it includes several optimizations:
- Segmentation: Split the index into immutable segments, parallelizing reads and writes
- Skip lists: Fast scanning of posting lists
- Block encoding: SIMD-accelerated posting list processing
Elasticsearch is a distributed search engine built on Lucene, providing REST APIs, clustering, and real-time indexing.
Summary
We've examined search engine internals through three pillars:
- Inverted Index: A "term → document list" data structure enabling lookup
- Boolean Search: Fast query processing through set operations on posting lists
- TF-IDF / BM25: Relevance scoring by combining term frequency with document-level rarity
The "instant results" we experience in search is only possible through the combination of these technologies. Next time you type a query into a search box, the inverted index is being consulted and BM25 is ranking documents behind the scenes — I hope this article helps you imagine that mechanism.
References
- C. D. Manning, P. Raghavan, H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008.
- S. Robertson, H. Zaragoza. The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval, 2009.
- Apache Lucene Documentation