All posts

How Garbage Collection Works — A History of GC and Language Strategies

A deep dive into garbage collection from the ground up. Covering reference counting, tracing GC, generational GC, and concurrent GC — tracing the evolution of GC technology and comparing strategies across major languages. Includes an interactive tri-color mark-and-sweep visualization.

GCRuntimeGoJavaPython

Memory management is an unavoidable challenge in programming. In C/C++, developers must manually call malloc/free, but most modern languages rely on a garbage collector (GC) to automatically reclaim memory.

This article covers GC fundamentals — from basic concepts to algorithm evolution, write barrier mechanics, and how major languages implement their GC strategies, all with concrete examples.

Why Do We Need GC?

Manual memory management leads to two classic categories of bugs.

Memory Leaks

Forgetting to free unused memory causes the program's memory usage to grow without bound, eventually crashing with an OOM (Out of Memory) error.

void leak_example() {
    char *buf = malloc(1024);
    // use buf for processing...
    if (error) return;  // ← forgot to free(buf) before return!
    free(buf);
}

In this example, when error is true, free is never called and 1024 bytes leak. In long-running server applications, this is a fatal problem.

Dangling Pointers

Prematurely freeing memory that's still in use causes Use-After-Free bugs.

void dangling_example() {
    char *a = malloc(64);
    char *b = a;  // a and b point to the same memory
    free(a);
    // b still points to freed memory (dangling pointer)
    strcpy(b, "hello");  // Undefined behavior! Crash or security vulnerability
}

GC fundamentally solves both problems. Developers no longer need to specify when to free memory, dramatically improving program safety and productivity.

GC was invented by John McCarthy in 1959 for Lisp. It's a technology with over 67 years of history, during which many algorithms have been developed and refined.

Fundamental GC Strategies

GC algorithms fall into two major families: reference counting and tracing.

Garbage CollectionReference CountingTracing GCMark & SweepMark-CompactCopying GCGenerationalIncrementalConcurrent

⮳ Generational, incremental, and concurrent are orthogonal optimization strategies that can combine with any tracing algorithm. For example, Java’s young generation uses copying GC, not mark & sweep. The diagram above is a simplified overview of common combinations.

Reference Counting

Each object maintains a counter tracking how many references point to it. The counter is incremented when a new reference is created and decremented when a reference is removed. When the counter reaches zero, the object is immediately freed.

Reference Counting in Action

a = Object()   # Object ref_count = 1 (a references it)
b = a           # Object ref_count = 2 (a and b reference it)
a = None        # Object ref_count = 1 (only b references it)
b = None        # Object ref_count = 0 → immediately freed!

With reference counting, object lifetimes are tracked in real-time and memory is reclaimed the instant it becomes unused.

Advantages

  • Simple implementation: Just increment/decrement counters — easy to understand and implement
  • Immediate reclamation: Garbage is collected the moment it's created. Latency is predictable
  • No Stop-The-World: No need to pause the entire application
  • Minimal memory footprint: Objects are freed immediately, keeping peak memory usage low

Disadvantages

  • Cannot collect circular references: The biggest weakness. When A→B→A forms a cycle, ref_count never reaches zero even when no external references remain
Aref=1Bref=1No external refs → ref_count ≠ 0 → Leak

As shown above, when A and B reference each other, their ref_count stays at 1 even after all external references are gone — a permanent leak.

  • Counter update overhead: Every pointer assignment requires a counter update. In multithreaded environments, this requires atomic operations (atomic_increment/atomic_decrement), causing cache line contention
  • Cascading frees: When the root of a large data structure reaches zero, a chain reaction of frees can cause a temporary latency spike

Tracing GC

Starting from a root set (stack variables, global variables, registers, etc.), the collector traverses the object graph and marks reachable objects as "alive." Unreachable objects are collected as garbage.

What Is the Root Set?

In tracing GC, "roots" are pointers directly accessible by the program. Specifically:

  • Local variables on stack frames: Each goroutine/thread's stack
  • Global variables: Package/module-level variables
  • CPU registers: Pointers held by the current thread

Only objects reachable from the root set are considered "alive" — everything else is garbage. Even circular references are correctly collected as long as the cycle is unreachable from roots, which is the key advantage over reference counting.

Tri-color Mark & Sweep

The most fundamental and widely used tracing GC algorithm. Proposed by Dijkstra et al. in their 1978 paper "On-the-Fly Garbage Collection: An Exercise in Cooperation".

The Three Colors

ColorMeaningState
WhiteNot yet discovered by GCCandidate (might be garbage)
GrayDiscovered, but references not fully scannedAwaiting scan
BlackFully scanned (self and all references)Definitely reachable (alive)

The Tri-color Invariant

Algorithm correctness depends on this invariant:

A black object never directly references a white object

Why is this invariant critical? Black objects are never scanned again by the GC. If a black object directly references a white object, that white object loses its chance to be scanned — it would be freed even though it's still alive. As long as gray objects exist between black and white, every reachable object will eventually be scanned.

Algorithm Flow

  1. Initialize: Color all objects white
  2. Root discovery: Color objects pointed to by GC roots gray, push them onto the gray stack (worklist)
  3. Scan: Pop an object from the gray stack:
    • Color its white referents gray and push them onto the gray stack
    • Color the object itself black
  4. Repeat: Repeat step 3 until the gray stack is empty (mark complete)
  5. Sweep: Walk the heap and free all objects that remain white

Step through this process with the interactive demo below. Watch how each object's color changes and how the gray stack evolves:

Tri-color Mark & Sweep in Action

Step through how a tri-color mark-and-sweep garbage collector traces reachable objects and frees garbage.

Idle Phase
White (未探索)Gray (発見済)Black (走査済)Freed (解放)
Root1root
Root2root
A
B
C
D
E
F
G
H
I
Gray Stack
empty
Initial state: 11 objects exist on the heap. All are white (unvisited). Root1 references A and B; Root2 references E. G, H, and I are unreachable from any root.
1 / 13

Challenges of Mark & Sweep

Basic Mark & Sweep has several challenges:

  • Fragmentation: Freed objects leave gaps between surviving objects, fragmenting memory. Large object allocations may fail to find contiguous space
  • Stop-The-World: All application threads must pause during mark and sweep
  • Full heap traversal: Sweeping requires walking the entire heap, with cost proportional to heap size

Various variants were developed to address these challenges.

Evolution of Tracing GC

Mark-Compact

Mark & Sweep causes memory fragmentation when objects are freed. Mark-Compact eliminates fragmentation by sliding surviving objects to one end of the heap.

BeforeAfterABCDABCDFree

As shown above, compaction fills the gaps left after sweep, creating a contiguous free region at the end of the heap.

Compaction algorithms come in several varieties:

  • Sliding compaction: Slides objects in address order, preserving their relative ordering
  • Lisp 2 algorithm: Three passes over the heap — compute new addresses, update pointers, then move objects
  • Threaded compaction: Converts references into linked lists to store temporary information. No extra memory needed but complex

Cost: Moving objects and updating references is expensive, but allocation becomes fast with bump-pointer allocation. Bump-pointer is essentially alloc(size) = ptr; ptr += size — a single instruction, significantly faster than free-list allocation (linear search for fitting blocks).

Copying GC (Semispace)

Splits the heap into two spaces: From space and To space. Objects are allocated in From space. During GC, surviving objects are copied to To space, then From and To are swapped and the old From is freed entirely.

Cheney's algorithm (1970) is the canonical implementation, using BFS (breadth-first search) to copy objects:

scan = free = start of To space
Copy objects referenced by GC roots to To space, advance free
while scan < free:
    For each field of the object at scan:
        If referent is in From space → copy to To space, advance free
        Set forwarding pointer at old location
    Advance scan to next object
  • Advantages: Free compaction (copies are always packed), bump-pointer allocation, no heap sweep needed
  • Disadvantages: Only half the heap is usable (50% memory efficiency), copying is expensive for large objects

Generational GC

An optimization based on the Weak Generational Hypothesis:

Most objects die young (become unreachable shortly after allocation)

This is backed by real-world observations. In typical applications, 80–95% of newly allocated objects become unreachable before the first GC cycle.

Young GenerationEdenSurvivor 0Survivor 1Minor GCPromoteOld GenerationTenured (Long-lived Objects)Minor GC: Young gen only (frequent, fast) / Major GC: Full heap (rare, slow)
  • Young generation: A small area for new allocations. GC runs frequently (Minor GC) but is very fast (a few milliseconds) since most objects are already dead. New objects are allocated in Eden, and survivors are copied to Survivor spaces
  • Old generation: Objects that survive multiple Minor GCs are promoted here. GC is less frequent but heavier (Major GC / Full GC, tens to hundreds of milliseconds)

Generational GC requires a remembered set to track references from Old to Young generation. This avoids scanning the entire Old generation during Minor GC. The card table approach is common: divide Old generation memory into ~512-byte cards and mark only cards containing Young pointers as "dirty" for scanning.

Incremental / Concurrent GC

To reduce Stop-The-World (STW) time, GC work runs alongside application execution.

Stop-The-WorldFull Heap GCIncrementalMarkAppMarkAppMarkSweepConcurrentApp (concurrent mark)STWApp (concurrent sweep)Time →
  • Stop-The-World GC: Application completely paused during mark + sweep. Simplest implementation but long pause times (proportional to heap size)
  • Incremental GC: Splits marking into small steps, interleaved with application execution. Each step's pause is short, but total GC time is longer (write barrier overhead)
  • Concurrent GC: Runs marking on separate threads in parallel. Application pauses only at mark start and remark (final synchronization), which are extremely short

In concurrent GC, the application (mutator) and GC (collector) run simultaneously, so reference relationships can change during marking. Write barriers handle this.

Write Barriers

The mechanism for maintaining the tri-color invariant in concurrent/incremental GC. When the application (mutator) modifies a pointer, code is inserted to "notify" the GC.

Why Write Barriers Are Necessary

Consider this scenario:

1. GC has scanned object A and colored it black
2. Application executes A.field = C (C is white)
3. Application executes B.field = nil (removes B→C reference; B is gray/white)
4. GC scans B, but C is no longer referenced by B
5. C remains white → freed as garbage → alive object freed!

This is the lost object problem. It occurs when a black→white reference is created (condition ①) and simultaneously all other paths to that white object are severed (condition ②).

Major Write Barrier Types

① Dijkstra Barrier (Insertion Barrier)

When a pointer is written to reference a new object, the referent is shaded gray.

// Barrier inserted when executing A.field = C
writeBarrier(A, field, C):
    shade(C)      // shade C gray first (prevent black→white reference)
    A.field = C   // then perform the store

The shade must happen before the store. If the order were reversed, a window would exist where black object A references white object C — a concurrent GC could see this and incorrectly free C.

Every newly referenced object becomes gray, ensuring it will be scanned. However, this conservatively shades many objects gray, increasing floating garbage (objects that are actually garbage but won't be collected this cycle).

② Yuasa Barrier (Deletion Barrier / Snapshot-at-the-beginning)

Shades the object losing a reference gray. This captures the idea of preserving a "snapshot" of the object graph at GC start time.

// Barrier inserted before A.field changes from old to new
writeBarrier(A, field, new):
    old = A.field
    if old is white:
        shade(old)  // shade old gray
    A.field = new

This guarantees all objects reachable at GC start will be marked.

③ Hybrid Barrier (Go 1.8+)

Go combines Dijkstra and Yuasa barriers into a hybrid write barrier.

writeBarrier(slot, new):
    shade(*slot)  // deletion barrier: shade the overwritten old value gray
    shade(new)    // insertion barrier: shade the new value gray
    *slot = new

The key benefit of the hybrid approach is that stack rescanning becomes unnecessary. In Go 1.5–1.7 with only the Dijkstra barrier, all goroutine stacks needed to be rescanned at the end of marking — this was the main contributor to STW time. With the hybrid barrier, Go's STW is typically under 100μs.

Memory Allocation Strategies

Closely related to GC is the memory allocation strategy. The choice of GC algorithm heavily influences allocation design.

Free List

Manages freed memory blocks as a linked list. Allocation searches for a block of suitable size.

  • First-fit: Use the first sufficiently large block found
  • Best-fit: Use the closest-sized block (less fragmentation)
  • Segregated-fit: Maintain separate lists for each size class (used by Go, tcmalloc)

Bump Pointer

Used with compacted heaps or copying GC. A pointer marks the start of free memory, and allocation simply advances it.

alloc(size):
    result = freePtr
    freePtr += size
    return result

Extremely fast (effectively one instruction) but requires contiguous free space.

TLAB (Thread-Local Allocation Buffer)

Pre-allocates a portion of the heap to each thread to avoid allocation contention in multithreaded environments. Within a thread, allocation uses lock-free bump-pointer. Used by Java's HotSpot JVM and Go's runtime.

GC Strategies by Language

Let's examine how major programming languages implement their GC in detail.

Go — Concurrent Tri-color Mark & Sweep

Go prioritizes simplicity and low latency above all else.

AspectDetails
AlgorithmConcurrent tri-color mark & sweep
GenerationalNo
CompactionNo
STWMark start/end only (typically < 100μs)
Write barrierHybrid (Dijkstra + Yuasa)
TuningGOGC / GOMEMLIMIT

Why doesn't Go use generational GC? Go aggressively stack-allocates. The compiler's escape analysis places objects that don't escape the function on the stack rather than the heap. This means most short-lived objects never even reach the heap. The generational hypothesis ("many heap objects are short-lived") is weakened, making the benefit of generational GC too small to justify the write barrier cost.

GC Pacer: Go's GC uses a "pacer" to decide when to start the next GC cycle. GOGC=100 (default) means "start the next GC when the heap has grown by 100% relative to the live heap after the last GC." Go 1.19 added GOMEMLIMIT, which sets a soft memory limit to prevent OOM in memory-constrained environments.

GC source code lives in runtime/mgc.go.

Java (HotSpot) — Generational + Multiple Collectors

Java has the most mature GC ecosystem, with multiple collectors for different use cases.

CollectorCharacteristicsSTWUse Case
Serial GCSingle-threadedLongSmall apps, client
Parallel GCMulti-threaded markMediumThroughput-oriented batch
G1 GCRegion-based, predictable pausesShort (target-configurable)General purpose (JDK 9+ default)
ZGCColored pointers, TB-scale heaps< 1msLow-latency requirements
ShenandoahConcurrent compaction< 10msLow-latency (RedHat)

How G1 GC works: The heap is divided into equal-sized regions (typically 1–32MB). Young/Old distinctions are assigned dynamically at the region level. G1 prioritizes collecting regions with the most garbage (Garbage-First), maximizing reclamation within a given pause time budget. Configure with -XX:MaxGCPauseMillis=200.

ZGC's Colored Pointers: ZGC embeds GC metadata (mark state, relocation state) in unused bits of object pointers. This eliminates the need to write mark information to object headers, enabling concurrent marking and relocation using only load barriers. This technique is what allows ZGC to maintain sub-1ms STW even with TB-scale heaps.

Python (CPython) — Reference Counting + Generational Tracing

Python uses a hybrid approach.

Reference Countingref == 0?YesFree immediatelyNoPossible circular refGenerational Tracing GCDetect & collect cyclic garbage
  • Reference Counting (Primary): The main memory reclamation mechanism. When an object's ob_refcnt reaches 0, tp_dealloc is called immediately to free it
  • Generational Tracing GC (Auxiliary): Runs generational tracing GC in the background to detect and collect circular references that reference counting cannot handle

Three-generation threshold system:

GenerationThreshold (default)Meaning
Generation 0700Run Gen 0 collection when allocations − deallocations reaches 700
Generation 110Run Gen 1 collection every 10 Gen 0 collections
Generation 210Run Gen 2 collection every 10 Gen 1 collections

Objects are promoted from Gen N to Gen N+1 after surviving just 1 collection. In other words, the thresholds control how frequently each generation is collected, not how many collections an object must survive. Check with gc.get_threshold(), adjust with gc.set_threshold().

CPython's GIL (Global Interpreter Lock) means reference count updates don't need atomic operations. However, Python 3.13+'s free-threaded mode (--disable-gil) switches to biased reference counting — the owning thread uses regular increment/decrement while other threads use atomic operations, achieving thread-safe reference counting without the GIL.

JavaScript (V8) — Generational + Concurrent Marking

V8 (Chrome, Node.js, Deno) uses generational GC.

AreaNameAlgorithm
Young genNursery (Scavenger)Semispace copying GC
Old genMajor GCConcurrent mark & sweep/compact

How Scavenger works: Young generation is split into two semispaces. New objects are allocated in "From" space. During Minor GC (Scavenge), surviving objects are copied to "To" space. Objects surviving twice are promoted to Old generation.

Orinoco: V8's GC optimization project name. It processes marking concurrently with the main thread, dramatically reducing STW time. Specific techniques:

  • Concurrent Marking: Worker threads traverse the object graph in parallel with the main thread
  • Parallel Scavenging: Multiple worker threads perform Young generation GC in parallel
  • Incremental Marking: Makes small marking progress during idle time or between tasks
  • Lazy Sweeping: Defers sweep processing to allocation time

.NET (CLR) — Generational + Segment-Based

.NET's GC is optimized for server workloads.

AspectDetails
AlgorithmGenerational mark & compact
Generations3 (Gen 0, 1, 2) + LOH (Large Object Heap)
ModeWorkstation GC / Server GC
ConcurrentBackground GC (concurrent Gen 2 collection)

Large Object Heap (LOH): Objects 85,000 bytes or larger are allocated directly in the LOH. The LOH is not compacted by default, though .NET 4.5.1+ allows explicit compaction via GCSettings.LargeObjectHeapCompactionMode.

Pinning: Prevents GC from moving objects during native interop. Achieved with fixed statements or GCHandle.Alloc(obj, GCHandleType.Pinned), but overuse causes fragmentation.

Rust — No GC (Ownership System)

Rust doesn't rely on GC. Its ownership + borrowing + lifetime system enforces memory safety at compile time.

Ownership system → Memory deallocation timing decided at compile time
                 → Zero runtime overhead
                 → Zero GC pause time

The RAII (Resource Acquisition Is Initialization) pattern ensures the Drop trait is called automatically when a variable goes out of scope. This code is inserted at compile time, so runtime cost is zero.

For circular references, Rc<T> / Arc<T> + Weak<T> provide reference counting. Weak<T> doesn't count toward Rc's reference count, allowing cycles to be broken.

Swift — ARC (Automatic Reference Counting)

Swift uses ARC, which inserts reference counting operations at compile time.

  • No runtime tracing GC needed
  • Predictable performance: All retain/release calls are inserted at compile time, making runtime overhead predictable
  • Circular references must be explicitly resolved by developers using weak (becomes Optional) / unowned (non-Optional, when lifecycle is guaranteed)

Swift's compiler includes ARC optimizations that eliminate redundant retain/release pairs (including Copy-on-Write optimization).

GC Technology Timeline

1959Mark & Sweep (McCarthy, Lisp)1960Reference Counting (Collins)1969Copying GC (Fenichel & Yochelson)1978Tri-color Marking (Dijkstra et al.)1983Generational GC (Lieberman & Hewitt)1992Train Algorithm (Hudson & Moss)2004G1 GC research paper (Detlefs et al.)2015Go 1.5 concurrent GC (STW < 10ms)2017Go 1.8 hybrid barrier (STW < 1ms)2018ZGC experimental (JDK 11)2020Shenandoah GA (JDK 15)2023Generational ZGC (JDK 21)TheoryImplementation

Summary

LanguageGC TypeGenerationalConcurrentSTWAllocation
GoTri-color M&SNoYes< 100μsSegregated-fit
JavaSelectable (G1/ZGC/Shenandoah)YesYes< 1ms+TLAB + bump pointer
PythonRef counting + tracingYes (3 gen)NoYespymalloc arena
JavaScript (V8)Copying + M&S/CompactYesYesShortSemispace + free list
.NETGenerational M&CYes (3 gen+LOH)YesShortBump pointer
RustNone (ownership)NoneSystem allocator
SwiftARCNoneARC + malloc

GC is not a mere convenience feature — it's one of the most critical architecture decisions in programming language runtime design. The simplicity of reference counting, the completeness of tracing GC, the efficiency of generational hypothesis, the low latency of concurrent execution — each technique represents a specific set of trade-offs. The depth and fascination of this field lies in how each language team draws on decades of research to choose the optimal balance.

References