All posts

How Search Engines Work — Understanding Inverted Indexes and TF-IDF with Interactive Diagrams

An interactive deep-dive into search engine internals: from building inverted indexes to ranking with TF-IDF and BM25.

Search EngineInverted IndexTF-IDFInformation RetrievalData Structures

Introduction

Google Search, IDE code search, e-commerce product search — every day we rely on technology that instantly finds target information from massive datasets. The core technology behind all of these is the search engine.

This article explains the three fundamental technologies at the heart of search engines, using interactive diagrams:

  1. Inverted Index — the data structure that enables fast search
  2. Boolean Search — processing queries as set operations
  3. TF-IDF / BM25 — algorithms for ranking search results

Search Engine Architecture

Let's start with an overview of the search engine pipeline.

CollectDocumentsAnalyzeTokenizeIndexBuildQueryProcessRankScoreReturnResults

From document collection to returning results, the process is divided into roughly 6 stages. This article focuses on text analysis, index building, query processing, and ranking.

Text Analysis — Breaking Documents into Tokens

The first step in processing a document is text analysis. Let's look at the pipeline that transforms raw text into searchable units called tokens.

Raw Text"The Cats Were Running!"Lowercase"the cats were running"Tokenize["the", "cats", "were", "running"]Remove Stopwords["cats", "running"]Stemming["cat", "run"]

This pipeline consists of the following stages:

1. Character Normalization

Unicode normalization and case conversion. This ensures that "The" and "the" are treated as the same token.

2. Tokenization

Split text into individual words (tokens). In English, this is typically whitespace-based splitting. Languages like Japanese, which lack spaces between words, require morphological analysis (using tools like MeCab or Kuromoji).

3. Stopword Removal

Remove high-frequency, low-information words like "the", "is", and "on". This reduces index size and improves search precision.

4. Stemming / Lemmatization

Reduce words to their root form: "running""run", "cats""cat". This absorbs spelling variations and improves search recall.

Inverted Index — The Heart of Search

Why We Need an Inverted Index

A naive approach to full-text search would scan each document from start to finish looking for keywords. For NN documents, this requires O(N×L)O(N \times L) time (LL being average document length), which breaks down as the document count grows.

The Inverted Index is the data structure that solves this problem.

Forward Index vs Inverted Index

A normal index (Forward Index) maps "document → contained terms":

Forward Index:
  D1 → [cat, sat, mat]
  D2 → [cat, dog]
  D3 → [dog, sat, log]

An inverted index reverses this relationship. It maps "term → documents containing that term":

Inverted Index:
  cat → [D1, D2]
  dog → [D2, D3]
  log → [D3]
  mat → [D1]
  sat → [D1, D3]

This allows O(1)O(1) lookup for documents containing any given keyword. The list of document IDs associated with each term is called a posting list.

Interactive Demo: Building and Searching an Inverted Index

In the demo below, watch how an inverted index is built from 3 documents, then used to execute a search query step by step.

Inverted Index Builder

Watch how an inverted index is built from documents, then used to answer a search query.

Documents

D1"the cat sat on the mat"
D2"the cat and the dog"
D3"the dog sat on a log"
1 / 15

Posting List Structure

In real search engines, posting lists often contain more than just document IDs:

FieldPurpose
Document IDBasic document identification
Term Frequency (TF)Used for ranking
PositionPhrase search, proximity search
OffsetSnippet generation, highlighting

An index that includes position information is called a positional index, enabling phrase searches like "cat sat".

Query Processing — Boolean Search Model

Once the user's query is tokenized, we use the inverted index to find matching documents. The most fundamental model is Boolean search.

AND (Intersection)cat →D1D2sat →D1D3D1OR (Union)cat →D1D2sat →D1D3D1D2D3Sorted-merge algorithmO(n + m)Both lists are sorted by doc IDNOT (Complement)e.g. "cat NOT dog" → {D1,D2} \ {D2,D3} = {D1}

AND (Intersection)

"cat AND sat" computes the intersection of the posting lists for cat and sat. Since both lists are sorted by document ID, a sorted merge with two pointers processes them in O(n+m)O(n + m) time.

def intersect(p1, p2):
    result = []
    i, j = 0, 0
    while i < len(p1) and j < len(p2):
        if p1[i] == p2[j]:
            result.append(p1[i])
            i += 1
            j += 1
        elif p1[i] < p2[j]:
            i += 1
        else:
            j += 1
    return result

OR (Union)

"cat OR sat" returns the union of both posting lists — all documents containing either term.

Query Optimization

When multiple AND conditions exist, processing the term with the shortest posting list first keeps intermediate results small and improves efficiency. This is known as query optimization.

Ranking — TF-IDF and BM25

Boolean search only returns binary results — match or no match. In practice, we want to rank how relevant each document is to the query. This is where TF-IDF and BM25 come in.

TF-IDF(t, d) = TF(t, d) × IDF(t)TF(t, d)Term Frequencycount(t in d)How often does t appear in d?×IDF(t)Inverse Document Frequencylog(N / df(t))How rare is t across all docs?Key InsightHigh TF-IDF = frequent in this document BUT rare overall"the" → low IDF (common everywhere) vs "algorithm" → high IDF (rare, specific)

TF: Term Frequency

TF(t,d)=number of times term t appears in document d\text{TF}(t, d) = \text{number of times term } t \text{ appears in document } d

The more frequently a term appears in a document, the more likely that document is relevant to that term.

IDF: Inverse Document Frequency

IDF(t)=log10 ⁣(Ndf(t))\text{IDF}(t) = \log_{10}\!\left(\frac{N}{\text{df}(t)}\right)

NN is the total number of documents, and df(t)\text{df}(t) is the number of documents containing term tt. Words like "the" that appear everywhere have low IDF, while words like "algorithm" that appear in few documents have high IDF. This quantifies the intuition that rarer words are more important.

TF-IDF Score

TF-IDF(t,d)=TF(t,d)×IDF(t)\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)

For multi-term queries, TF-IDF scores for each term are summed to produce the document's final score.

Calculation Demo

Let's walk through an actual TF-IDF score calculation to see how ranking works.

TF-IDF Score Calculator

Step through the TF-IDF scoring process for the query "cat mat".

Query:"cat mat"N=3
D1[cat, sat, mat]
D2[cat, dog]
D3[dog, sat, log]
1 / 7

BM25: Improving on TF-IDF

TF-IDF has some issues. For example, a document where a term appears 100 times gets 100× the score of one where it appears once — but the relevance difference isn't that large.

BM25 (Best Matching 25) is an improved algorithm that's widely used in modern search engines.

BM25(t,d)=IDF(t)TF(t,d)(k1+1)TF(t,d)+k1(1b+bdavgdl)\text{BM25}(t, d) = \text{IDF}(t) \cdot \frac{\text{TF}(t,d) \cdot (k_1 + 1)}{\text{TF}(t,d) + k_1 \cdot \left(1 - b + b \cdot \frac{|d|}{\text{avgdl}}\right)}

Two parameters are key:

  • k1k_1 (typically 1.2): Controls TF saturation. As TF increases, the saturation factor TF(k1+1)TF+K\frac{\text{TF} \cdot (k_1+1)}{\text{TF}+K} asymptotically approaches (k1+1)(k_1 + 1) rather than growing without bound
  • bb (typically 0.75): Document length normalization. Longer documents tend to have higher TF simply because they contain more words; bb corrects for this bias

Practical Topics

Index Compression

Posting lists are sorted integer sequences of document IDs, so they can be efficiently compressed using delta encoding combined with Variable Byte Encoding:

Original list:    [101, 105, 108, 120]
Delta encoding:   [101,   4,   3,  12]  ← smaller numbers
Variable bytes:   efficiently packed into bytes

Distributed Search

Large-scale search engines shard (partition) the index across multiple nodes. Queries are broadcast to all shards (scatter), and results from each shard are aggregated (gather) to produce the final ranking. This scatter-gather pattern is widely used in systems like Elasticsearch.

Apache Lucene and Elasticsearch

Apache Lucene is the de facto standard full-text search library built on inverted indexes. Implemented in Java, it includes several optimizations:

  • Segmentation: Split the index into immutable segments, parallelizing reads and writes
  • Skip lists: Fast scanning of posting lists
  • Block encoding: SIMD-accelerated posting list processing

Elasticsearch is a distributed search engine built on Lucene, providing REST APIs, clustering, and real-time indexing.

Summary

We've examined search engine internals through three pillars:

  1. Inverted Index: A "term → document list" data structure enabling O(1)O(1) lookup
  2. Boolean Search: Fast query processing through set operations on posting lists
  3. TF-IDF / BM25: Relevance scoring by combining term frequency with document-level rarity

The "instant results" we experience in search is only possible through the combination of these technologies. Next time you type a query into a search box, the inverted index is being consulted and BM25 is ranking documents behind the scenes — I hope this article helps you imagine that mechanism.

References