←All posts

Random Walks β€” From a Drunkard's Stagger to PageRank

Starting from 1D random walk recurrence, through Markov chains and stationary distributions on graphs, to the invention of PageRank

AlgorithmsProbabilityPageRankMarkov Chain

Introduction β€” The Drunkard's Return

A drunk stands under a lamppost. At each step, he moves left or right with equal probability. Will he eventually return to where he started?

This problem traces back to a short letter Karl Pearson published in Nature in 1905. Pearson was trying to model the migration pattern of mosquitoes. Lord Rayleigh promptly replied with a solution, and this exchange became the formal origin of the random walk concept. Coincidentally, the same model was implicitly contained in Einstein's theory of Brownian motion, also from 1905. By happy accident, both physics and pure mathematics gave birth to random walk theory in the same year.

The answer to this simple question is surprisingly deep, connecting probability theory, graph theory, and information retrieval. In this article, we trace a path of thought from a one-dimensional line through Markov chains on graphs to Google's PageRank algorithm.


Chapter 1 β€” 1D Random Walk

The Simplest Model

Start at the origin on the number line. At each step, move +1+1 with probability 1/21/2 and βˆ’1-1 with probability 1/21/2. If we denote the position after nn steps as SnS_n:

Sn=X1+X2+β‹―+XnS_n = X_1 + X_2 + \cdots + X_n

where each XiX_i independently takes +1+1 or βˆ’1-1. We call each XiX_i a step.

This model is especially important because in the limit it converges to Brownian motion (the Wiener process). Shrink the step size to 1/n1/\sqrt{n} and let the number of steps go to infinity, and you get a continuous stochastic process. The 1D random walk is thus "discrete-time Brownian motion." The Black-Scholes model in financial engineering is built on this continuous version.

Expected Value and Variance

The expected value is obviously zero by symmetry. Each step has expectation E[Xi]=(1/2)(+1)+(1/2)(βˆ’1)=0E[X_i] = (1/2)(+1) + (1/2)(-1) = 0, so:

E[Sn]=βˆ‘i=1nE[Xi]=0E[S_n] = \sum_{i=1}^{n} E[X_i] = 0

What about variance? Each step has variance Var(Xi)=E[Xi2]βˆ’(E[Xi])2=1βˆ’0=1\mathrm{Var}(X_i) = E[X_i^2] - (E[X_i])^2 = 1 - 0 = 1. Since the XiX_i are independent, variances add:

Var(Sn)=βˆ‘i=1nVar(Xi)=n\mathrm{Var}(S_n) = \sum_{i=1}^{n} \mathrm{Var}(X_i) = n

The standard deviation is n\sqrt{n}. In other words, "on average you go nowhere, but the spread grows like n\sqrt{n}." After 100 steps the average position is 0, but the standard deviation is 10. After 10,000 steps the average is still 0, but the standard deviation has grown to 100. Increasing the number of steps by a factor of 100 only increases the spread by a factor of 10. This is the essence of diffusion and the fundamental behavior of random walks.

The demo below runs 5 independent random walks simultaneously. Individual paths vary wildly, but watch how the observed variance approaches the theoretical value nn.

1D Random Walk

5 independent random walks. Expected value stays at 0, but variance grows over time

03-20200steps
Step: 0Mean position: 0.0Variance: 0.0(theory: 0)

Recurrence β€” Returning to the Origin

One-dimensional random walks have a remarkable property: they return to the origin with probability 1.

First, let's compute the probability of being at the origin after 2n2n steps. For S2n=0S_{2n} = 0 to hold, we need exactly nn steps of +1+1 and nn steps of βˆ’1-1. The number of ways to choose nn positions from 2n2n trials is (2nn)\binom{2n}{n}, and each sequence has probability (1/2)2n(1/2)^{2n}:

P(S2n=0)=(2nn)(12)2nP(S_{2n} = 0) = \binom{2n}{n} \left(\frac{1}{2}\right)^{2n}

Using Stirling's approximation n!β‰ˆ2Ο€n (n/e)nn! \approx \sqrt{2\pi n}\,(n/e)^n:

(2nn)=(2n)!(n!)2β‰ˆ4nΟ€n\binom{2n}{n} = \frac{(2n)!}{(n!)^2} \approx \frac{4^n}{\sqrt{\pi n}}

Therefore:

P(S2n=0)β‰ˆ4nΟ€nβ‹…14n=1Ο€nP(S_{2n} = 0) \approx \frac{4^n}{\sqrt{\pi n}} \cdot \frac{1}{4^n} = \frac{1}{\sqrt{\pi n}}

This probability decreases with nn. It might seem like "it gets harder to return," but the key is that the decay is slow enough. Consider the expected number of returns to the origin:

E[returns]=βˆ‘n=1∞P(S2n=0)β‰ˆβˆ‘n=1∞1Ο€n=∞E[\text{returns}] = \sum_{n=1}^{\infty} P(S_{2n} = 0) \approx \sum_{n=1}^{\infty} \frac{1}{\sqrt{\pi n}} = \infty

The series βˆ‘1/n\sum 1/\sqrt{n} diverges (compare with the integral ∫1∞xβˆ’1/2 dx=∞\int_1^\infty x^{-1/2}\,dx = \infty). Since the expected number of returns is infinite, the walker returns to the origin infinitely many times β€” with probability 1. This is called recurrence.

Intuitively, "1/n1/\sqrt{n} is small, but large enough to diverge when summed to infinity." In contrast, the series βˆ‘1/n2\sum 1/n^2 converges. As we'll see, in 3D the return probability decays as 1/n3/21/n^{3/2}, the series converges, and return is no longer guaranteed.

First Return Time Distribution

We know the walker "will return," but how long does it take? Surprisingly, the expected first return time is infinite.

Let f2nf_{2n} be the probability of first returning to the origin at step 2n2n:

f2n=12nβˆ’1P(S2n=0)β‰ˆ12nβˆ’1β‹…1Ο€n∼1n3/2f_{2n} = \frac{1}{2n - 1} P(S_{2n} = 0) \approx \frac{1}{2n - 1} \cdot \frac{1}{\sqrt{\pi n}} \sim \frac{1}{n^{3/2}}

The expected first return time is:

E[firstΒ return]=βˆ‘n=1∞2nβ‹…f2nβˆΌβˆ‘n=1∞1n=∞E[\text{first return}] = \sum_{n=1}^{\infty} 2n \cdot f_{2n} \sim \sum_{n=1}^{\infty} \frac{1}{\sqrt{n}} = \infty

So we have the paradoxical situation: "you will certainly return, but on average it takes forever." This state β€” certain return but infinite expected return time β€” is called null recurrence. On finite graphs, this problem vanishes: the expected return time is always finite. This is an important foreshadowing for the next chapter.

PΓ³lya's Theorem β€” Dimension and Recurrence

A natural question arises: "Does this hold in 2D and 3D too?"

In the 2D random walk, at each step we move in one of four directions (up, down, left, right) with equal probability. The probability of being at the origin after 2n2n steps is:

P(S2n=0)β‰ˆ1Ο€nP(S_{2n} = \mathbf{0}) \approx \frac{1}{\pi n}

The expected number of returns is βˆ‘1/(Ο€n)\sum 1/(\pi n), which diverges like the harmonic series. So 2D is also recurrent.

What about 3D? At each step we move in one of six directions with equal probability. The probability of being at the origin after 2n2n steps is:

P(S2n=0)β‰ˆ2(34Ο€n)3/2P(S_{2n} = \mathbf{0}) \approx 2\left(\frac{3}{4\pi n}\right)^{3/2}

The expected number of returns is βˆ‘nβˆ’3/2\sum n^{-3/2}, which converges. Therefore the number of returns is finite, and there is a positive probability of never returning. George PΓ³lya proved this unified result in 1921:

  • 1 dimension: Recurrent (returns with probability 1)
  • 2 dimensions: Recurrent (returns with probability 1)
  • 3 dimensions and above: Transient (positive probability of never returning)

The specific return probability in 3D can be computed from what's known as Watson's triple integral and equals approximately 0.3405. That is, a 3D random walker has about a 66% chance of never returning to the origin. As dimensions increase, the space becomes "too vast" to find the way back.

Shizuo Kakutani expressed PΓ³lya's result thus: "A drunk man will find his way home, but a drunk bird may get lost forever."

This result has deep implications in physics. In 2D, diffusion on a lattice is recurrent, so a single particle visits every lattice site with probability 1. But in 3D, unvisited sites remain. This plays an important role in theories of phase transitions and crystal growth.


Chapter 2 β€” Random Walks on Graphs

From Lines to Graphs

If we view the number line as a graph β€” each integer is a node, edges connect adjacent integers β€” it's an infinite path graph. The 1D random walk is just a special random walk on this path graph.

Now we extend the idea: can we perform a random walk on any graph? Social networks, transit maps, molecular bonding structures β€” on any network we can define the operation "be at a node, move to a neighbor."

Transition Rule

The rule is simple: from the current node, choose the next node uniformly at random from among the neighbors. When at node ii, the probability of moving to node jj is:

P(iβ†’j)=1deg⁑(i)(whenΒ jΒ isΒ adjacentΒ toΒ i)P(i \to j) = \frac{1}{\deg(i)} \quad \text{(when $j$ is adjacent to $i$)}

At high-degree nodes, the transition probability to each neighbor is small; at low-degree nodes, it's large. A drunk at a busy intersection has a thin probability of taking each road.

Formulation as a Markov Chain

This process is a Markov chain: the next state depends only on the current state, not on history (the Markov property, or "memorylessness").

Define the transition matrix PP as a ∣Vβˆ£Γ—βˆ£V∣|V| \times |V| matrix:

Pij={1/deg⁑(i)if i and j are adjacent0otherwiseP_{ij} = \begin{cases} 1/\deg(i) & \text{if $i$ and $j$ are adjacent} \\ 0 & \text{otherwise} \end{cases}

Using the matrix, the time evolution of the probability vector p(t)\mathbf{p}^{(t)} (row vector of probabilities of being at each node at time tt) is:

p(t+1)=p(t)P\mathbf{p}^{(t+1)} = \mathbf{p}^{(t)} P

The distribution after tt steps is simply the initial distribution multiplied by PtP^t.

Stationary Distribution

If we walk long enough, the visit frequency of each node converges to a fixed distribution. This is the stationary distribution Ο€\pi. Formally, Ο€\pi is a probability distribution satisfying Ο€=Ο€P\pi = \pi P β€” it's the left eigenvector of the transition matrix corresponding to eigenvalue 1.

For connected, non-bipartite graphs (graphs containing an odd-length cycle), the stationary distribution is unique, converged to from any starting distribution, and has a closed form:

Ο€i=deg⁑(i)2∣E∣\pi_i = \frac{\deg(i)}{2|E|}

where ∣E∣|E| is the number of edges. The denominator 2∣E∣2|E| equals the sum of all degrees (each edge contributes 1 to the degree of each endpoint β€” the handshaking lemma).

Nodes with higher degree get visited more often. This makes intuitive sense β€” a node with many "entrances" naturally receives more walkers.

For example, in a graph with 5 edges where node AA has degree 3, we get Ο€A=3/10=0.3\pi_A = 3/10 = 0.3. The walker spends 30% of their time at node AA in the long run.

In the demo below, a random walker traverses a 6-node graph. Observe how visit frequencies converge to the theoretical stationary distribution Ο€i=deg⁑(i)/2∣E∣\pi_i = \deg(i) / 2|E|.

Random Walk on Graph β†’ Stationary Distribution

Watch the walker's visit frequency converge to Ο€_i = deg(i) / 2|E|

A0B0C0D0E0F0
NodedegΟ€obserr
A20.111β€”β€”
B40.222β€”β€”
C40.222β€”β€”
D20.111β€”β€”
E40.222β€”β€”
F20.111β€”β€”
Total steps: 0

Why a Stationary Distribution Exists β€” Detailed Balance

The most elegant way to show that Ο€\pi satisfies Ο€=Ο€P\pi = \pi P is through the detailed balance condition:

Ο€iPij=Ο€jPji\pi_i P_{ij} = \pi_j P_{ji}

This condition means "the probability flow from ii to jj equals the flow from jj to ii." If inflow and outflow balance at every pair, no node's probability changes, and a stationary state is achieved.

Substituting Ο€i=deg⁑(i)/2∣E∣\pi_i = \deg(i) / 2|E| and Pij=1/deg⁑(i)P_{ij} = 1 / \deg(i), for any adjacent pair (i,j)(i, j):

deg⁑(i)2∣Eβˆ£β‹…1deg⁑(i)=12∣E∣=deg⁑(j)2∣Eβˆ£β‹…1deg⁑(j)\frac{\deg(i)}{2|E|} \cdot \frac{1}{\deg(i)} = \frac{1}{2|E|} = \frac{\deg(j)}{2|E|} \cdot \frac{1}{\deg(j)}

Both sides are equal, so detailed balance holds. A distribution satisfying detailed balance is necessarily stationary (summing Ο€iPij\pi_i P_{ij} over ii for each jj recovers Ο€j\pi_j), completing the proof.

A Markov chain satisfying detailed balance is called reversible. Random walks on undirected graphs are always reversible, but Markov chains on general directed graphs need not be. This distinction becomes important for PageRank in the next chapter.

Convergence Speed β€” Mixing Time

Convergence to the stationary distribution is guaranteed, but how fast does it happen? The speed is governed by the second eigenvalue Ξ»2\lambda_2 of the transition matrix:

βˆ₯p(t)βˆ’Ο€βˆ₯TV≀Cβ‹…βˆ£Ξ»2∣t\|\mathbf{p}^{(t)} - \pi\|_{\text{TV}} \leq C \cdot |\lambda_2|^t

The closer Ξ»2\lambda_2 is to 1, the slower convergence; the closer to 0, the faster. The gap 1βˆ’Ξ»21 - \lambda_2 is called the spectral gap.

The mixing time tmixt_{\text{mix}} β€” the number of steps needed for the distribution to "get close enough" to stationarity β€” is approximately tmix∼1/(1βˆ’Ξ»2)t_{\text{mix}} \sim 1/(1 - \lambda_2).

  • Complete graph (KnK_n): Large spectral gap, mixes in just a few steps
  • Path graph (a line): Small spectral gap, needs O(n2)O(n^2) steps
  • Expander graphs: Constant spectral gap, mixes in O(log⁑n)O(\log n) steps

Mixing time is central to convergence diagnostics in MCMC and algorithmic analysis.

Hitting Time and Cover Time

There are other important measures of random walk performance:

  • Hitting time h(u,v)h(u, v): Expected number of steps from node uu to first reach node vv
  • Cover time C(G)C(G): Expected number of steps from some node to visit every node at least once

A famous result: the cover time of any connected nn-node graph is at most O(n3)O(n^3) (though in practice it's often much faster). The cover time of the complete graph is Θ(nlog⁑n)\Theta(n \log n), which is equivalent to the coupon collector problem (expected number of draws to collect all nn types of coupons).


Chapter 3 β€” The Invention of PageRank

From Citation Networks

Before the web, the random walk idea was applied to ranking academic papers. The Impact Factor is a simple citation-count metric, but it doesn't distinguish between "a citation from a high-quality paper" and "a citation from a low-quality one."

Gabriel Pinski proposed using citation network structure to rank academic journals in 1976. This is a direct precursor to PageRank.

The Web as a Giant Directed Graph

In 1998, Stanford graduate students Larry Page and Sergey Brin turned their attention to the structure of the web. At the time, search engines like AltaVista and Lycos used keyword-frequency-based ranking, making them vulnerable to spam and producing unreliable results.

Each page is a node; each hyperlink is a directed edge. Their question was:

"Can we compute a page's 'importance' purely from link structure?"

Their core insight was recursive: "A page linked from important pages is itself important." But defining "important pages" requires "importance" β€” a chicken-and-egg problem. Yet this is precisely an eigenvector problem.

The Random Surfer Model

Their idea was essentially a random walk on a graph. An imaginary "random surfer" browses the web:

  1. From the current page, click one of the outgoing links uniformly at random
  2. With probability 1βˆ’d1 - d, get bored and jump to a random page instead

The damping factor dd is typically set to 0.850.85 β€” follow links with 85% probability, teleport randomly with 15%.

The jump mechanism solves two technical problems:

  1. Dangling nodes: Pages with no outgoing links (PDFs, images) would halt the walk. Random jumps provide an escape
  2. Absorbing states: Small clusters of mutually linking pages could trap the walker. With jumps, the transition matrix becomes primitive (all entries positive), guaranteeing that every node can be visited

In matrix notation, the jump-augmented matrix MM is:

M=dβ‹…P+(1βˆ’d)β‹…11TNM = d \cdot P + (1 - d) \cdot \frac{\mathbf{1}\mathbf{1}^T}{N}

where PP is the link-based transition matrix and 11T/N\mathbf{1}\mathbf{1}^T/N is a matrix of uniform transition to all nodes (the teleportation matrix).

Power Iteration

The PageRank vector r\mathbf{r} is the stationary distribution Ο€M=Ο€\pi M = \pi, computed by power iteration:

rj(t+1)=1βˆ’dN+dβˆ‘iβ†’jri(t)out(i)r_j^{(t+1)} = \frac{1-d}{N} + d \sum_{i \to j} \frac{r_i^{(t)}}{\text{out}(i)}

Starting from the uniform distribution ri(0)=1/Nr_i^{(0)} = 1/N and iterating. Since MM is primitive, the Perron-Frobenius theorem guarantees the eigenvalue-1 eigenvector is unique, and all other eigenvalues have absolute value at most dd. Convergence is geometrically fast:

βˆ₯r(t)βˆ’Ο€βˆ₯≀Cβ‹…dt\|\mathbf{r}^{(t)} - \pi\| \leq C \cdot d^t

With d=0.85d = 0.85, about 50 iterations suffice for good precision. Even for the 1998 web (hundreds of millions of pages), this simple iteration computed rankings at practical speed.

The demo below runs PageRank power iteration on a small 6-page web graph. Watch how scores converge with each iteration.

PageRank Simulation

Power iteration PageRank based on the random surfer model. Node size proportional to score

A0.167B0.167C0.167D0.167E0.167F0.167
PagePR(n)convergeddiff
A0.16670.21860.0519
B0.16670.15690.0098
C0.16670.24700.0804
D0.16670.09170.0750
E0.16670.13390.0327
F0.16670.15190.0147
Iteration: 0 / 40d = 0.85

Mathematical Meaning of PageRank

With the previous chapters as background, the essence of PageRank becomes clear.

PageRank is nothing other than the stationary distribution of a damped random walk.

  • With d=1d = 1 (no jumps), it's a pure random walk on a directed graph β€” but dangling nodes and disconnected components make the stationary distribution non-unique
  • The jump with d<1d < 1 makes the transition matrix primitive (all entries positive). By Perron-Frobenius, the eigenvalue-1 left eigenvector is unique, and convergence from any starting distribution is guaranteed

The principle from the previous chapter β€” "a random walk's visit frequency converges to the stationary distribution" β€” directly underwrites the correctness of the PageRank algorithm.

Limitations and Successors

PageRank was groundbreaking, but modern web search combines link structure with hundreds of other signals (text relevance, user behavior, page quality metrics, etc.). PageRank is also vulnerable to link farms β€” artificially generating mutual links to inflate scores.

Still, the fundamental idea β€” using a random walk's stationary distribution to compute importance within a network β€” is widely applied today in social network analysis, bioinformatics gene network analysis, and financial network systemic risk assessment.


Chapter 4 β€” The Reach of Random Walks

Monte Carlo Methods

Monte Carlo methods use randomness for numerical computation. The name comes from the casino city of Monte Carlo. Von Neumann and Stanislaw Ulam pioneered the approach in the 1940s Manhattan Project for simulating neutron diffusion.

The most famous example: estimate Ο€\pi by throwing random points in the unit square [0,1]2[0,1]^2 and counting the fraction pp that lands inside the unit circle: Ο€β‰ˆ4p\pi \approx 4p. With NN points, the estimation error is O(1/N)O(1/\sqrt{N}), independent of dimension. This is why Monte Carlo excels at high-dimensional integration. Traditional numerical methods (e.g., Simpson's rule) deteriorate rapidly with dimension (the "curse of dimensionality"), but Monte Carlo maintains O(1/N)O(1/\sqrt{N}) accuracy regardless.

MCMC (Markov Chain Monte Carlo)

Monte Carlo assumes we can sample from a uniform distribution. But real-world problems often require sampling from complex distributions (posterior distributions, free energy landscapes).

MCMC constructs a Markov chain whose stationary distribution is the target, then uses a random walk to generate samples. The theory from Chapter 2 applies directly.

  • Metropolis-Hastings (1953/1970): Generate a candidate xβ€²x' from proposal distribution q(xβ€²βˆ£x)q(x'|x), accept with probability Ξ±=min⁑(1,Ο€(xβ€²)q(x∣xβ€²)Ο€(x)q(xβ€²βˆ£x))\alpha = \min(1, \frac{\pi(x') q(x|x')}{\pi(x) q(x'|x)}), otherwise stay put. The acceptance probability design guarantees detailed balance, making Ο€\pi the stationary distribution. A major advantage: the normalization constant of Ο€\pi need not be known. Widely used in Bayesian statistics for posterior sampling

  • Gibbs Sampling (1984): Sequentially sample each variable from its conditional distribution given the others. A special case of Metropolis-Hastings. Standard in image processing (Markov random fields) and latent variable models (LDA topic models)

The mixing time of MCMC (steps needed for practical convergence) is directly related to the spectral gap from Chapter 2 β€” random walk theory in direct application.

Graph Clustering

A random walker tends to stay within the same community for a long time and rarely crosses bridge edges between communities. This property can detect community structure.

  • Markov Clustering (MCL): Alternates between expansion (computing P2P^2) and inflation (raising each element to power rr and re-normalizing columns). Expansion corresponds to two-step random walks, creating inter-cluster flow. Inflation amplifies strong flows and eliminates weak ones. After iterations, cluster structure emerges naturally. Widely used in protein interaction network analysis

  • Personalized PageRank: Fixes a specific node vv as the teleportation destination. Standard PageRank teleports uniformly to all nodes; Personalized PageRank returns to node vv with probability 1βˆ’d1-d. The result assigns a "closeness" score from vv's perspective to every node. Used by Twitter and Facebook for follow recommendations ("Who to follow")

Recommendation Systems

On a bipartite graph of users and items (connected by purchases, views, or clicks), start a random walk from a given user. Item nodes frequently visited by the walker become recommendations.

Two key advantages:

  1. Natural exploration-exploitation balance: The walk recommends not just direct neighbors (related items to already-purchased products = exploitation) but also items 2+ hops away (= exploration), producing collaborative-filtering-like effects
  2. Cold start handling: Even new items can be recommended if they're indirectly connected to other items in the graph

Pixie (Pinterest's recommendation engine) performs biased random walks on a graph of billions of nodes to deliver real-time personalized recommendations.

Connection to Graph Neural Networks

Recent Graph Neural Networks (GNNs) also have deep ties to random walks:

  • DeepWalk (2014): Performs random walks on graphs, treats the node visit sequences as "sentences," and trains them with Word2Vec's SkipGram. Co-occurrence in random walk sequences captures node similarity
  • Node2Vec (2016): Extends DeepWalk by introducing bias parameters p,qp, q that control the trade-off between BFS-like exploration (capturing local structure) and DFS-like exploration (capturing community structure)
  • GCN (Graph Convolutional Network): The graph convolution is mathematically equivalent to polynomial filters (Chebyshev polynomial approximation) of the random walk transition matrix. In other words, each GCN layer can be interpreted as performing random-walk-like information diffusion

Summary β€” From a Single Line to the Structure of the World

Let's review what we covered:

  1. 1D Random Walk: Expected value 0, variance nn, n\sqrt{n} diffusion. Recurrent in 1D and 2D, transient in 3D+ (Polya's theorem). The curious phenomenon of null recurrence
  2. Random Walks on Graphs: Formulated as Markov chains. Visit frequency converges to the stationary distribution Ο€i=deg⁑(i)/2∣E∣\pi_i = \deg(i) / 2|E|, proven via detailed balance. Mixing time governed by the spectral gap
  3. PageRank: Computes web page importance as the stationary distribution of a damped random walk. Efficiently computed via power iteration, with convergence guaranteed by Perron-Frobenius
  4. Broader Applications: Monte Carlo, MCMC, graph clustering, recommendation systems, GNNs

The drunkard's simple stagger connects to the core algorithm behind web search engines and cutting-edge graph neural networks. The recurrence and diffusion properties discovered on the simplest structure β€” the number line β€” support the foundations of modern information technology via graph theory.