Random Walks — From a Drunkard's Stagger to PageRank

Introduction — The Drunkard's Return

A drunk stands under a lamppost. At each step, he moves left or right with equal probability. Will he eventually return to where he started?

This problem traces back to a short letter Karl Pearson published in Nature in 1905. Pearson was trying to model the migration pattern of mosquitoes. Lord Rayleigh promptly replied with a solution, and this exchange became the formal origin of the random walk concept. Coincidentally, the same model was implicitly contained in Einstein's theory of Brownian motion, also from 1905. By happy accident, both physics and pure mathematics gave birth to random walk theory in the same year.

The answer to this simple question is surprisingly deep, connecting probability theory, graph theory, and information retrieval. In this article, we trace a path of thought from a one-dimensional line through Markov chains on graphs to Google's PageRank algorithm.

Chapter 1 — 1D Random Walk

The Simplest Model

Start at the origin on the number line. At each step, move $+1$ with probability $1/2$ and $-1$ with probability $1/2$ . If we denote the position after $n$ steps as $S_n$ :

$S_n = X_1 + X_2 + \cdots + X_n$

where each $X_i$ independently takes $+1$ or $-1$ . We call each $X_i$ a step.

This model is especially important because in the limit it converges to Brownian motion (the Wiener process). Shrink the step size to $1/\sqrt{n}$ and let the number of steps go to infinity, and you get a continuous stochastic process. The 1D random walk is thus "discrete-time Brownian motion." The Black-Scholes model in financial engineering is built on this continuous version.

Expected Value and Variance

The expected value is obviously zero by symmetry. Each step has expectation $E[X_i] = (1/2)(+1) + (1/2)(-1) = 0$ , so:

$E[S_n] = \sum_{i=1}^{n} E[X_i] = 0$

What about variance? Each step has variance $\mathrm{Var}(X_i) = E[X_i^2] - (E[X_i])^2 = 1 - 0 = 1$ . Since the $X_i$ are independent, variances add:

$\mathrm{Var}(S_n) = \sum_{i=1}^{n} \mathrm{Var}(X_i) = n$

The standard deviation is $\sqrt{n}$ . In other words, "on average you go nowhere, but the spread grows like $\sqrt{n}$ ." After 100 steps the average position is 0, but the standard deviation is 10. After 10,000 steps the average is still 0, but the standard deviation has grown to 100. Increasing the number of steps by a factor of 100 only increases the spread by a factor of 10. This is the essence of diffusion and the fundamental behavior of random walks.

The demo below runs 5 independent random walks simultaneously. Individual paths vary wildly, but watch how the observed variance approaches the theoretical value $n$ .

1D Random Walk

5 independent random walks. Expected value stays at 0, but variance grows over time

Step: 0Mean position: 0.0Variance: 0.0(theory: 0)

Recurrence — Returning to the Origin

One-dimensional random walks have a remarkable property: they return to the origin with probability 1.

First, let's compute the probability of being at the origin after $2n$ steps. For $S_{2n} = 0$ to hold, we need exactly $n$ steps of $+1$ and $n$ steps of $-1$ . The number of ways to choose $n$ positions from $2n$ trials is $\binom{2n}{n}$ , and each sequence has probability $(1/2)^{2n}$ :

$P(S_{2n} = 0) = \binom{2n}{n} \left(\frac{1}{2}\right)^{2n}$

Using Stirling's approximation $n! \approx \sqrt{2\pi n}\,(n/e)^n$ :

$\binom{2n}{n} = \frac{(2n)!}{(n!)^2} \approx \frac{4^n}{\sqrt{\pi n}}$

Therefore:

$P(S_{2n} = 0) \approx \frac{4^n}{\sqrt{\pi n}} \cdot \frac{1}{4^n} = \frac{1}{\sqrt{\pi n}}$

This probability decreases with $n$ . It might seem like "it gets harder to return," but the key is that the decay is slow enough. Consider the expected number of returns to the origin:

$E[\text{returns}] = \sum_{n=1}^{\infty} P(S_{2n} = 0) \approx \sum_{n=1}^{\infty} \frac{1}{\sqrt{\pi n}} = \infty$

The series $\sum 1/\sqrt{n}$ diverges (compare with the integral $\int_1^\infty x^{-1/2}\,dx = \infty$ ). Since the expected number of returns is infinite, the walker returns to the origin infinitely many times — with probability 1. This is called recurrence.

Intuitively, " $1/\sqrt{n}$ is small, but large enough to diverge when summed to infinity." In contrast, the series $\sum 1/n^2$ converges. As we'll see, in 3D the return probability decays as $1/n^{3/2}$ , the series converges, and return is no longer guaranteed.

First Return Time Distribution

We know the walker "will return," but how long does it take? Surprisingly, the expected first return time is infinite.

Let $f_{2n}$ be the probability of first returning to the origin at step $2n$ :

$f_{2n} = \frac{1}{2n - 1} P(S_{2n} = 0) \approx \frac{1}{2n - 1} \cdot \frac{1}{\sqrt{\pi n}} \sim \frac{1}{n^{3/2}}$

The expected first return time is:

$E[\text{first return}] = \sum_{n=1}^{\infty} 2n \cdot f_{2n} \sim \sum_{n=1}^{\infty} \frac{1}{\sqrt{n}} = \infty$

So we have the paradoxical situation: "you will certainly return, but on average it takes forever." This state — certain return but infinite expected return time — is called null recurrence. On finite graphs, this problem vanishes: the expected return time is always finite. This is an important foreshadowing for the next chapter.

Pólya's Theorem — Dimension and Recurrence

A natural question arises: "Does this hold in 2D and 3D too?"

In the 2D random walk, at each step we move in one of four directions (up, down, left, right) with equal probability. The probability of being at the origin after $2n$ steps is:

$P(S_{2n} = \mathbf{0}) \approx \frac{1}{\pi n}$

The expected number of returns is $\sum 1/(\pi n)$ , which diverges like the harmonic series. So 2D is also recurrent.

What about 3D? At each step we move in one of six directions with equal probability. The probability of being at the origin after $2n$ steps is:

$P(S_{2n} = \mathbf{0}) \approx 2\left(\frac{3}{4\pi n}\right)^{3/2}$

The expected number of returns is $\sum n^{-3/2}$ , which converges. Therefore the number of returns is finite, and there is a positive probability of never returning. George Pólya proved this unified result in 1921:

1 dimension: Recurrent (returns with probability 1)
2 dimensions: Recurrent (returns with probability 1)
3 dimensions and above: Transient (positive probability of never returning)

The specific return probability in 3D can be computed from what's known as Watson's triple integral and equals approximately 0.3405. That is, a 3D random walker has about a 66% chance of never returning to the origin. As dimensions increase, the space becomes "too vast" to find the way back.

Shizuo Kakutani expressed Pólya's result thus: "A drunk man will find his way home, but a drunk bird may get lost forever."

This result has deep implications in physics. In 2D, diffusion on a lattice is recurrent, so a single particle visits every lattice site with probability 1. But in 3D, unvisited sites remain. This plays an important role in theories of phase transitions and crystal growth.

Chapter 2 — Random Walks on Graphs

From Lines to Graphs

If we view the number line as a graph — each integer is a node, edges connect adjacent integers — it's an infinite path graph. The 1D random walk is just a special random walk on this path graph.

Now we extend the idea: can we perform a random walk on any graph? Social networks, transit maps, molecular bonding structures — on any network we can define the operation "be at a node, move to a neighbor."

Transition Rule

The rule is simple: from the current node, choose the next node uniformly at random from among the neighbors. When at node $i$ , the probability of moving to node $j$ is:

$P(i \to j) = \frac{1}{\deg(i)} \quad \text{(when $j$ is adjacent to $i$)}$

At high-degree nodes, the transition probability to each neighbor is small; at low-degree nodes, it's large. A drunk at a busy intersection has a thin probability of taking each road.

Formulation as a Markov Chain

This process is a Markov chain: the next state depends only on the current state, not on history (the Markov property, or "memorylessness").

Define the transition matrix $P$ as a $|V| \times |V|$ matrix:

$P_{ij} = \begin{cases} 1/\deg(i) & \text{if $i$ and $j$ are adjacent} \\ 0 & \text{otherwise} \end{cases}$

Using the matrix, the time evolution of the probability vector $\mathbf{p}^{(t)}$ (row vector of probabilities of being at each node at time $t$ ) is:

$\mathbf{p}^{(t+1)} = \mathbf{p}^{(t)} P$

The distribution after $t$ steps is simply the initial distribution multiplied by $P^t$ .

Stationary Distribution

If we walk long enough, the visit frequency of each node converges to a fixed distribution. This is the stationary distribution $\pi$ . Formally, $\pi$ is a probability distribution satisfying $\pi = \pi P$ — it's the left eigenvector of the transition matrix corresponding to eigenvalue 1.

For connected, non-bipartite graphs (graphs containing an odd-length cycle), the stationary distribution is unique, converged to from any starting distribution, and has a closed form:

$\pi_i = \frac{\deg(i)}{2|E|}$

where $|E|$ is the number of edges. The denominator $2|E|$ equals the sum of all degrees (each edge contributes 1 to the degree of each endpoint — the handshaking lemma).

Nodes with higher degree get visited more often. This makes intuitive sense — a node with many "entrances" naturally receives more walkers.

For example, in a graph with 5 edges where node $A$ has degree 3, we get $\pi_A = 3/10 = 0.3$ . The walker spends 30% of their time at node $A$ in the long run.

In the demo below, a random walker traverses a 6-node graph. Observe how visit frequencies converge to the theoretical stationary distribution $\pi_i = \deg(i) / 2|E|$ .

Random Walk on Graph → Stationary Distribution

Watch the walker's visit frequency converge to π_i = deg(i) / 2|E|

Node	deg	π	obs	err
A	2	0.111	—	—
B	4	0.222	—	—
C	4	0.222	—	—
D	2	0.111	—	—
E	4	0.222	—	—
F	2	0.111	—	—

Total steps: 0

Why a Stationary Distribution Exists — Detailed Balance

The most elegant way to show that $\pi$ satisfies $\pi = \pi P$ is through the detailed balance condition:

$\pi_i P_{ij} = \pi_j P_{ji}$

This condition means "the probability flow from $i$ to $j$ equals the flow from $j$ to $i$ ." If inflow and outflow balance at every pair, no node's probability changes, and a stationary state is achieved.

Substituting $\pi_i = \deg(i) / 2|E|$ and $P_{ij} = 1 / \deg(i)$ , for any adjacent pair $(i, j)$ :

$\frac{\deg(i)}{2|E|} \cdot \frac{1}{\deg(i)} = \frac{1}{2|E|} = \frac{\deg(j)}{2|E|} \cdot \frac{1}{\deg(j)}$

Both sides are equal, so detailed balance holds. A distribution satisfying detailed balance is necessarily stationary (summing $\pi_i P_{ij}$ over $i$ for each $j$ recovers $\pi_j$ ), completing the proof.

A Markov chain satisfying detailed balance is called reversible. Random walks on undirected graphs are always reversible, but Markov chains on general directed graphs need not be. This distinction becomes important for PageRank in the next chapter.

Convergence Speed — Mixing Time

Convergence to the stationary distribution is guaranteed, but how fast does it happen? The speed is governed by the second eigenvalue $\lambda_2$ of the transition matrix:

$\|\mathbf{p}^{(t)} - \pi\|_{\text{TV}} \leq C \cdot |\lambda_2|^t$

The closer $\lambda_2$ is to 1, the slower convergence; the closer to 0, the faster. The gap $1 - \lambda_2$ is called the spectral gap.

The mixing time $t_{\text{mix}}$ — the number of steps needed for the distribution to "get close enough" to stationarity — is approximately $t_{\text{mix}} \sim 1/(1 - \lambda_2)$ .

Complete graph ( $K_n$ ): Large spectral gap, mixes in just a few steps
Path graph (a line): Small spectral gap, needs $O(n^2)$ steps
Expander graphs: Constant spectral gap, mixes in $O(\log n)$ steps

Mixing time is central to convergence diagnostics in MCMC and algorithmic analysis.

Hitting Time and Cover Time

There are other important measures of random walk performance:

Hitting time $h(u, v)$ : Expected number of steps from node $u$ to first reach node $v$
Cover time $C(G)$ : Expected number of steps from some node to visit every node at least once

A famous result: the cover time of any connected $n$ -node graph is at most $O(n^3)$ (though in practice it's often much faster). The cover time of the complete graph is $\Theta(n \log n)$ , which is equivalent to the coupon collector problem (expected number of draws to collect all $n$ types of coupons).

Chapter 3 — The Invention of PageRank

From Citation Networks

Before the web, the random walk idea was applied to ranking academic papers. The Impact Factor is a simple citation-count metric, but it doesn't distinguish between "a citation from a high-quality paper" and "a citation from a low-quality one."

Gabriel Pinski proposed using citation network structure to rank academic journals in 1976. This is a direct precursor to PageRank.

The Web as a Giant Directed Graph

In 1998, Stanford graduate students Larry Page and Sergey Brin turned their attention to the structure of the web. At the time, search engines like AltaVista and Lycos used keyword-frequency-based ranking, making them vulnerable to spam and producing unreliable results.

Each page is a node; each hyperlink is a directed edge. Their question was:

"Can we compute a page's 'importance' purely from link structure?"

Their core insight was recursive: "A page linked from important pages is itself important." But defining "important pages" requires "importance" — a chicken-and-egg problem. Yet this is precisely an eigenvector problem.

The Random Surfer Model

Their idea was essentially a random walk on a graph. An imaginary "random surfer" browses the web:

From the current page, click one of the outgoing links uniformly at random
With probability $1 - d$ , get bored and jump to a random page instead

The damping factor $d$ is typically set to $0.85$ — follow links with 85% probability, teleport randomly with 15%.

The jump mechanism solves two technical problems:

Dangling nodes: Pages with no outgoing links (PDFs, images) would halt the walk. Random jumps provide an escape
Absorbing states: Small clusters of mutually linking pages could trap the walker. With jumps, the transition matrix becomes primitive (all entries positive), guaranteeing that every node can be visited

In matrix notation, the jump-augmented matrix $M$ is:

$M = d \cdot P + (1 - d) \cdot \frac{\mathbf{1}\mathbf{1}^T}{N}$

where $P$ is the link-based transition matrix and $\mathbf{1}\mathbf{1}^T/N$ is a matrix of uniform transition to all nodes (the teleportation matrix).

Power Iteration

The PageRank vector $\mathbf{r}$ is the stationary distribution $\pi M = \pi$ , computed by power iteration:

$r_j^{(t+1)} = \frac{1-d}{N} + d \sum_{i \to j} \frac{r_i^{(t)}}{\text{out}(i)}$

Starting from the uniform distribution $r_i^{(0)} = 1/N$ and iterating. Since $M$ is primitive, the Perron-Frobenius theorem guarantees the eigenvalue-1 eigenvector is unique, and all other eigenvalues have absolute value at most $d$ . Convergence is geometrically fast:

$\|\mathbf{r}^{(t)} - \pi\| \leq C \cdot d^t$

With $d = 0.85$ , about 50 iterations suffice for good precision. Even for the 1998 web (hundreds of millions of pages), this simple iteration computed rankings at practical speed.

The demo below runs PageRank power iteration on a small 6-page web graph. Watch how scores converge with each iteration.

PageRank Simulation

Power iteration PageRank based on the random surfer model. Node size proportional to score

Page	PR(n)	converged	diff
A	0.1667	0.2186	0.0519
B	0.1667	0.1569	0.0098
C	0.1667	0.2470	0.0804
D	0.1667	0.0917	0.0750
E	0.1667	0.1339	0.0327
F	0.1667	0.1519	0.0147

Iteration: 0 / 40d = 0.85

Mathematical Meaning of PageRank

With the previous chapters as background, the essence of PageRank becomes clear.

PageRank is nothing other than the stationary distribution of a damped random walk.

With $d = 1$ (no jumps), it's a pure random walk on a directed graph — but dangling nodes and disconnected components make the stationary distribution non-unique
The jump with $d < 1$ makes the transition matrix primitive (all entries positive). By Perron-Frobenius, the eigenvalue-1 left eigenvector is unique, and convergence from any starting distribution is guaranteed

The principle from the previous chapter — "a random walk's visit frequency converges to the stationary distribution" — directly underwrites the correctness of the PageRank algorithm.

Limitations and Successors

PageRank was groundbreaking, but modern web search combines link structure with hundreds of other signals (text relevance, user behavior, page quality metrics, etc.). PageRank is also vulnerable to link farms — artificially generating mutual links to inflate scores.

Still, the fundamental idea — using a random walk's stationary distribution to compute importance within a network — is widely applied today in social network analysis, bioinformatics gene network analysis, and financial network systemic risk assessment.

Chapter 4 — The Reach of Random Walks

Monte Carlo Methods

Monte Carlo methods use randomness for numerical computation. The name comes from the casino city of Monte Carlo. Von Neumann and Stanislaw Ulam pioneered the approach in the 1940s Manhattan Project for simulating neutron diffusion.

The most famous example: estimate $\pi$ by throwing random points in the unit square $[0,1]^2$ and counting the fraction $p$ that lands inside the unit circle: $\pi \approx 4p$ . With $N$ points, the estimation error is $O(1/\sqrt{N})$ , independent of dimension. This is why Monte Carlo excels at high-dimensional integration. Traditional numerical methods (e.g., Simpson's rule) deteriorate rapidly with dimension (the "curse of dimensionality"), but Monte Carlo maintains $O(1/\sqrt{N})$ accuracy regardless.

MCMC (Markov Chain Monte Carlo)

Monte Carlo assumes we can sample from a uniform distribution. But real-world problems often require sampling from complex distributions (posterior distributions, free energy landscapes).

MCMC constructs a Markov chain whose stationary distribution is the target, then uses a random walk to generate samples. The theory from Chapter 2 applies directly.

Metropolis-Hastings (1953/1970): Generate a candidate $x'$ from proposal distribution $q(x'|x)$ , accept with probability $\alpha = \min(1, \frac{\pi(x') q(x|x')}{\pi(x) q(x'|x)})$ , otherwise stay put. The acceptance probability design guarantees detailed balance, making $\pi$ the stationary distribution. A major advantage: the normalization constant of $\pi$ need not be known. Widely used in Bayesian statistics for posterior sampling
Gibbs Sampling (1984): Sequentially sample each variable from its conditional distribution given the others. A special case of Metropolis-Hastings. Standard in image processing (Markov random fields) and latent variable models (LDA topic models)

The mixing time of MCMC (steps needed for practical convergence) is directly related to the spectral gap from Chapter 2 — random walk theory in direct application.

Graph Clustering

A random walker tends to stay within the same community for a long time and rarely crosses bridge edges between communities. This property can detect community structure.

Markov Clustering (MCL): Alternates between expansion (computing $P^2$ ) and inflation (raising each element to power $r$ and re-normalizing columns). Expansion corresponds to two-step random walks, creating inter-cluster flow. Inflation amplifies strong flows and eliminates weak ones. After iterations, cluster structure emerges naturally. Widely used in protein interaction network analysis
Personalized PageRank: Fixes a specific node $v$ as the teleportation destination. Standard PageRank teleports uniformly to all nodes; Personalized PageRank returns to node $v$ with probability $1-d$ . The result assigns a "closeness" score from $v$ 's perspective to every node. Used by Twitter and Facebook for follow recommendations ("Who to follow")

Recommendation Systems

On a bipartite graph of users and items (connected by purchases, views, or clicks), start a random walk from a given user. Item nodes frequently visited by the walker become recommendations.

Two key advantages:

Natural exploration-exploitation balance: The walk recommends not just direct neighbors (related items to already-purchased products = exploitation) but also items 2+ hops away (= exploration), producing collaborative-filtering-like effects
Cold start handling: Even new items can be recommended if they're indirectly connected to other items in the graph

Pixie (Pinterest's recommendation engine) performs biased random walks on a graph of billions of nodes to deliver real-time personalized recommendations.

Connection to Graph Neural Networks

Recent Graph Neural Networks (GNNs) also have deep ties to random walks:

DeepWalk (2014): Performs random walks on graphs, treats the node visit sequences as "sentences," and trains them with Word2Vec's SkipGram. Co-occurrence in random walk sequences captures node similarity
Node2Vec (2016): Extends DeepWalk by introducing bias parameters $p, q$ that control the trade-off between BFS-like exploration (capturing local structure) and DFS-like exploration (capturing community structure)
GCN (Graph Convolutional Network): The graph convolution is mathematically equivalent to polynomial filters (Chebyshev polynomial approximation) of the random walk transition matrix. In other words, each GCN layer can be interpreted as performing random-walk-like information diffusion

Summary — From a Single Line to the Structure of the World

Let's review what we covered:

1D Random Walk: Expected value 0, variance $n$ , $\sqrt{n}$ diffusion. Recurrent in 1D and 2D, transient in 3D+ (Polya's theorem). The curious phenomenon of null recurrence
Random Walks on Graphs: Formulated as Markov chains. Visit frequency converges to the stationary distribution $\pi_i = \deg(i) / 2|E|$ , proven via detailed balance. Mixing time governed by the spectral gap
PageRank: Computes web page importance as the stationary distribution of a damped random walk. Efficiently computed via power iteration, with convergence guaranteed by Perron-Frobenius
Broader Applications: Monte Carlo, MCMC, graph clustering, recommendation systems, GNNs

The drunkard's simple stagger connects to the core algorithm behind web search engines and cutting-edge graph neural networks. The recurrence and diffusion properties discovered on the simplest structure — the number line — support the foundations of modern information technology via graph theory.