Random Walks β From a Drunkard's Stagger to PageRank
Starting from 1D random walk recurrence, through Markov chains and stationary distributions on graphs, to the invention of PageRank
Introduction β The Drunkard's Return
A drunk stands under a lamppost. At each step, he moves left or right with equal probability. Will he eventually return to where he started?
This problem traces back to a short letter Karl Pearson published in Nature in 1905. Pearson was trying to model the migration pattern of mosquitoes. Lord Rayleigh promptly replied with a solution, and this exchange became the formal origin of the random walk concept. Coincidentally, the same model was implicitly contained in Einstein's theory of Brownian motion, also from 1905. By happy accident, both physics and pure mathematics gave birth to random walk theory in the same year.
The answer to this simple question is surprisingly deep, connecting probability theory, graph theory, and information retrieval. In this article, we trace a path of thought from a one-dimensional line through Markov chains on graphs to Google's PageRank algorithm.
Chapter 1 β 1D Random Walk
The Simplest Model
Start at the origin on the number line. At each step, move with probability and with probability . If we denote the position after steps as :
where each independently takes or . We call each a step.
This model is especially important because in the limit it converges to Brownian motion (the Wiener process). Shrink the step size to and let the number of steps go to infinity, and you get a continuous stochastic process. The 1D random walk is thus "discrete-time Brownian motion." The Black-Scholes model in financial engineering is built on this continuous version.
Expected Value and Variance
The expected value is obviously zero by symmetry. Each step has expectation , so:
What about variance? Each step has variance . Since the are independent, variances add:
The standard deviation is . In other words, "on average you go nowhere, but the spread grows like ." After 100 steps the average position is 0, but the standard deviation is 10. After 10,000 steps the average is still 0, but the standard deviation has grown to 100. Increasing the number of steps by a factor of 100 only increases the spread by a factor of 10. This is the essence of diffusion and the fundamental behavior of random walks.
The demo below runs 5 independent random walks simultaneously. Individual paths vary wildly, but watch how the observed variance approaches the theoretical value .
1D Random Walk
5 independent random walks. Expected value stays at 0, but variance grows over time
Recurrence β Returning to the Origin
One-dimensional random walks have a remarkable property: they return to the origin with probability 1.
First, let's compute the probability of being at the origin after steps. For to hold, we need exactly steps of and steps of . The number of ways to choose positions from trials is , and each sequence has probability :
Using Stirling's approximation :
Therefore:
This probability decreases with . It might seem like "it gets harder to return," but the key is that the decay is slow enough. Consider the expected number of returns to the origin:
The series diverges (compare with the integral ). Since the expected number of returns is infinite, the walker returns to the origin infinitely many times β with probability 1. This is called recurrence.
Intuitively, " is small, but large enough to diverge when summed to infinity." In contrast, the series converges. As we'll see, in 3D the return probability decays as , the series converges, and return is no longer guaranteed.
First Return Time Distribution
We know the walker "will return," but how long does it take? Surprisingly, the expected first return time is infinite.
Let be the probability of first returning to the origin at step :
The expected first return time is:
So we have the paradoxical situation: "you will certainly return, but on average it takes forever." This state β certain return but infinite expected return time β is called null recurrence. On finite graphs, this problem vanishes: the expected return time is always finite. This is an important foreshadowing for the next chapter.
PΓ³lya's Theorem β Dimension and Recurrence
A natural question arises: "Does this hold in 2D and 3D too?"
In the 2D random walk, at each step we move in one of four directions (up, down, left, right) with equal probability. The probability of being at the origin after steps is:
The expected number of returns is , which diverges like the harmonic series. So 2D is also recurrent.
What about 3D? At each step we move in one of six directions with equal probability. The probability of being at the origin after steps is:
The expected number of returns is , which converges. Therefore the number of returns is finite, and there is a positive probability of never returning. George PΓ³lya proved this unified result in 1921:
- 1 dimension: Recurrent (returns with probability 1)
- 2 dimensions: Recurrent (returns with probability 1)
- 3 dimensions and above: Transient (positive probability of never returning)
The specific return probability in 3D can be computed from what's known as Watson's triple integral and equals approximately 0.3405. That is, a 3D random walker has about a 66% chance of never returning to the origin. As dimensions increase, the space becomes "too vast" to find the way back.
Shizuo Kakutani expressed PΓ³lya's result thus: "A drunk man will find his way home, but a drunk bird may get lost forever."
This result has deep implications in physics. In 2D, diffusion on a lattice is recurrent, so a single particle visits every lattice site with probability 1. But in 3D, unvisited sites remain. This plays an important role in theories of phase transitions and crystal growth.
Chapter 2 β Random Walks on Graphs
From Lines to Graphs
If we view the number line as a graph β each integer is a node, edges connect adjacent integers β it's an infinite path graph. The 1D random walk is just a special random walk on this path graph.
Now we extend the idea: can we perform a random walk on any graph? Social networks, transit maps, molecular bonding structures β on any network we can define the operation "be at a node, move to a neighbor."
Transition Rule
The rule is simple: from the current node, choose the next node uniformly at random from among the neighbors. When at node , the probability of moving to node is:
At high-degree nodes, the transition probability to each neighbor is small; at low-degree nodes, it's large. A drunk at a busy intersection has a thin probability of taking each road.
Formulation as a Markov Chain
This process is a Markov chain: the next state depends only on the current state, not on history (the Markov property, or "memorylessness").
Define the transition matrix as a matrix:
Using the matrix, the time evolution of the probability vector (row vector of probabilities of being at each node at time ) is:
The distribution after steps is simply the initial distribution multiplied by .
Stationary Distribution
If we walk long enough, the visit frequency of each node converges to a fixed distribution. This is the stationary distribution . Formally, is a probability distribution satisfying β it's the left eigenvector of the transition matrix corresponding to eigenvalue 1.
For connected, non-bipartite graphs (graphs containing an odd-length cycle), the stationary distribution is unique, converged to from any starting distribution, and has a closed form:
where is the number of edges. The denominator equals the sum of all degrees (each edge contributes 1 to the degree of each endpoint β the handshaking lemma).
Nodes with higher degree get visited more often. This makes intuitive sense β a node with many "entrances" naturally receives more walkers.
For example, in a graph with 5 edges where node has degree 3, we get . The walker spends 30% of their time at node in the long run.
In the demo below, a random walker traverses a 6-node graph. Observe how visit frequencies converge to the theoretical stationary distribution .
Random Walk on Graph β Stationary Distribution
Watch the walker's visit frequency converge to Ο_i = deg(i) / 2|E|
| Node | deg | Ο | obs | err |
|---|---|---|---|---|
| A | 2 | 0.111 | β | β |
| B | 4 | 0.222 | β | β |
| C | 4 | 0.222 | β | β |
| D | 2 | 0.111 | β | β |
| E | 4 | 0.222 | β | β |
| F | 2 | 0.111 | β | β |
Why a Stationary Distribution Exists β Detailed Balance
The most elegant way to show that satisfies is through the detailed balance condition:
This condition means "the probability flow from to equals the flow from to ." If inflow and outflow balance at every pair, no node's probability changes, and a stationary state is achieved.
Substituting and , for any adjacent pair :
Both sides are equal, so detailed balance holds. A distribution satisfying detailed balance is necessarily stationary (summing over for each recovers ), completing the proof.
A Markov chain satisfying detailed balance is called reversible. Random walks on undirected graphs are always reversible, but Markov chains on general directed graphs need not be. This distinction becomes important for PageRank in the next chapter.
Convergence Speed β Mixing Time
Convergence to the stationary distribution is guaranteed, but how fast does it happen? The speed is governed by the second eigenvalue of the transition matrix:
The closer is to 1, the slower convergence; the closer to 0, the faster. The gap is called the spectral gap.
The mixing time β the number of steps needed for the distribution to "get close enough" to stationarity β is approximately .
- Complete graph (): Large spectral gap, mixes in just a few steps
- Path graph (a line): Small spectral gap, needs steps
- Expander graphs: Constant spectral gap, mixes in steps
Mixing time is central to convergence diagnostics in MCMC and algorithmic analysis.
Hitting Time and Cover Time
There are other important measures of random walk performance:
- Hitting time : Expected number of steps from node to first reach node
- Cover time : Expected number of steps from some node to visit every node at least once
A famous result: the cover time of any connected -node graph is at most (though in practice it's often much faster). The cover time of the complete graph is , which is equivalent to the coupon collector problem (expected number of draws to collect all types of coupons).
Chapter 3 β The Invention of PageRank
From Citation Networks
Before the web, the random walk idea was applied to ranking academic papers. The Impact Factor is a simple citation-count metric, but it doesn't distinguish between "a citation from a high-quality paper" and "a citation from a low-quality one."
Gabriel Pinski proposed using citation network structure to rank academic journals in 1976. This is a direct precursor to PageRank.
The Web as a Giant Directed Graph
In 1998, Stanford graduate students Larry Page and Sergey Brin turned their attention to the structure of the web. At the time, search engines like AltaVista and Lycos used keyword-frequency-based ranking, making them vulnerable to spam and producing unreliable results.
Each page is a node; each hyperlink is a directed edge. Their question was:
"Can we compute a page's 'importance' purely from link structure?"
Their core insight was recursive: "A page linked from important pages is itself important." But defining "important pages" requires "importance" β a chicken-and-egg problem. Yet this is precisely an eigenvector problem.
The Random Surfer Model
Their idea was essentially a random walk on a graph. An imaginary "random surfer" browses the web:
- From the current page, click one of the outgoing links uniformly at random
- With probability , get bored and jump to a random page instead
The damping factor is typically set to β follow links with 85% probability, teleport randomly with 15%.
The jump mechanism solves two technical problems:
- Dangling nodes: Pages with no outgoing links (PDFs, images) would halt the walk. Random jumps provide an escape
- Absorbing states: Small clusters of mutually linking pages could trap the walker. With jumps, the transition matrix becomes primitive (all entries positive), guaranteeing that every node can be visited
In matrix notation, the jump-augmented matrix is:
where is the link-based transition matrix and is a matrix of uniform transition to all nodes (the teleportation matrix).
Power Iteration
The PageRank vector is the stationary distribution , computed by power iteration:
Starting from the uniform distribution and iterating. Since is primitive, the Perron-Frobenius theorem guarantees the eigenvalue-1 eigenvector is unique, and all other eigenvalues have absolute value at most . Convergence is geometrically fast:
With , about 50 iterations suffice for good precision. Even for the 1998 web (hundreds of millions of pages), this simple iteration computed rankings at practical speed.
The demo below runs PageRank power iteration on a small 6-page web graph. Watch how scores converge with each iteration.
PageRank Simulation
Power iteration PageRank based on the random surfer model. Node size proportional to score
| Page | PR(n) | converged | diff |
|---|---|---|---|
| A | 0.1667 | 0.2186 | 0.0519 |
| B | 0.1667 | 0.1569 | 0.0098 |
| C | 0.1667 | 0.2470 | 0.0804 |
| D | 0.1667 | 0.0917 | 0.0750 |
| E | 0.1667 | 0.1339 | 0.0327 |
| F | 0.1667 | 0.1519 | 0.0147 |
Mathematical Meaning of PageRank
With the previous chapters as background, the essence of PageRank becomes clear.
PageRank is nothing other than the stationary distribution of a damped random walk.
- With (no jumps), it's a pure random walk on a directed graph β but dangling nodes and disconnected components make the stationary distribution non-unique
- The jump with makes the transition matrix primitive (all entries positive). By Perron-Frobenius, the eigenvalue-1 left eigenvector is unique, and convergence from any starting distribution is guaranteed
The principle from the previous chapter β "a random walk's visit frequency converges to the stationary distribution" β directly underwrites the correctness of the PageRank algorithm.
Limitations and Successors
PageRank was groundbreaking, but modern web search combines link structure with hundreds of other signals (text relevance, user behavior, page quality metrics, etc.). PageRank is also vulnerable to link farms β artificially generating mutual links to inflate scores.
Still, the fundamental idea β using a random walk's stationary distribution to compute importance within a network β is widely applied today in social network analysis, bioinformatics gene network analysis, and financial network systemic risk assessment.
Chapter 4 β The Reach of Random Walks
Monte Carlo Methods
Monte Carlo methods use randomness for numerical computation. The name comes from the casino city of Monte Carlo. Von Neumann and Stanislaw Ulam pioneered the approach in the 1940s Manhattan Project for simulating neutron diffusion.
The most famous example: estimate by throwing random points in the unit square and counting the fraction that lands inside the unit circle: . With points, the estimation error is , independent of dimension. This is why Monte Carlo excels at high-dimensional integration. Traditional numerical methods (e.g., Simpson's rule) deteriorate rapidly with dimension (the "curse of dimensionality"), but Monte Carlo maintains accuracy regardless.
MCMC (Markov Chain Monte Carlo)
Monte Carlo assumes we can sample from a uniform distribution. But real-world problems often require sampling from complex distributions (posterior distributions, free energy landscapes).
MCMC constructs a Markov chain whose stationary distribution is the target, then uses a random walk to generate samples. The theory from Chapter 2 applies directly.
-
Metropolis-Hastings (1953/1970): Generate a candidate from proposal distribution , accept with probability , otherwise stay put. The acceptance probability design guarantees detailed balance, making the stationary distribution. A major advantage: the normalization constant of need not be known. Widely used in Bayesian statistics for posterior sampling
-
Gibbs Sampling (1984): Sequentially sample each variable from its conditional distribution given the others. A special case of Metropolis-Hastings. Standard in image processing (Markov random fields) and latent variable models (LDA topic models)
The mixing time of MCMC (steps needed for practical convergence) is directly related to the spectral gap from Chapter 2 β random walk theory in direct application.
Graph Clustering
A random walker tends to stay within the same community for a long time and rarely crosses bridge edges between communities. This property can detect community structure.
-
Markov Clustering (MCL): Alternates between expansion (computing ) and inflation (raising each element to power and re-normalizing columns). Expansion corresponds to two-step random walks, creating inter-cluster flow. Inflation amplifies strong flows and eliminates weak ones. After iterations, cluster structure emerges naturally. Widely used in protein interaction network analysis
-
Personalized PageRank: Fixes a specific node as the teleportation destination. Standard PageRank teleports uniformly to all nodes; Personalized PageRank returns to node with probability . The result assigns a "closeness" score from 's perspective to every node. Used by Twitter and Facebook for follow recommendations ("Who to follow")
Recommendation Systems
On a bipartite graph of users and items (connected by purchases, views, or clicks), start a random walk from a given user. Item nodes frequently visited by the walker become recommendations.
Two key advantages:
- Natural exploration-exploitation balance: The walk recommends not just direct neighbors (related items to already-purchased products = exploitation) but also items 2+ hops away (= exploration), producing collaborative-filtering-like effects
- Cold start handling: Even new items can be recommended if they're indirectly connected to other items in the graph
Pixie (Pinterest's recommendation engine) performs biased random walks on a graph of billions of nodes to deliver real-time personalized recommendations.
Connection to Graph Neural Networks
Recent Graph Neural Networks (GNNs) also have deep ties to random walks:
- DeepWalk (2014): Performs random walks on graphs, treats the node visit sequences as "sentences," and trains them with Word2Vec's SkipGram. Co-occurrence in random walk sequences captures node similarity
- Node2Vec (2016): Extends DeepWalk by introducing bias parameters that control the trade-off between BFS-like exploration (capturing local structure) and DFS-like exploration (capturing community structure)
- GCN (Graph Convolutional Network): The graph convolution is mathematically equivalent to polynomial filters (Chebyshev polynomial approximation) of the random walk transition matrix. In other words, each GCN layer can be interpreted as performing random-walk-like information diffusion
Summary β From a Single Line to the Structure of the World
Let's review what we covered:
- 1D Random Walk: Expected value 0, variance , diffusion. Recurrent in 1D and 2D, transient in 3D+ (Polya's theorem). The curious phenomenon of null recurrence
- Random Walks on Graphs: Formulated as Markov chains. Visit frequency converges to the stationary distribution , proven via detailed balance. Mixing time governed by the spectral gap
- PageRank: Computes web page importance as the stationary distribution of a damped random walk. Efficiently computed via power iteration, with convergence guaranteed by Perron-Frobenius
- Broader Applications: Monte Carlo, MCMC, graph clustering, recommendation systems, GNNs
The drunkard's simple stagger connects to the core algorithm behind web search engines and cutting-edge graph neural networks. The recurrence and diffusion properties discovered on the simplest structure β the number line β support the foundations of modern information technology via graph theory.