Garbage Collection Trades Space and Time for Programming Safety

October 30, 2015

"I call it my billion-dollar mistake. It was the invention of the null reference in 1965. This has led to innumerable errors, vulnerabilities, and system crashes, which have probably caused a billion dollars of pain and damage in the last forty years." C.A.R. Hoare

Automatic memory management presents the illusion of infinite memory. The illusion costs something space overhead, pause times, cache pressure but what it buys is freedom from an entire class of catastrophic bugs: dangling pointers, double frees, memory corruption, and the coupling that ownership semantics impose on API design. In the age of explicit memory management, the biggest problem is that it exists.

A garbage collector has two components: the mutator (application threads that allocate objects) and the collector (the thread that reclaims dead objects). The collector is correct only if it never reclaims live objects. Since determining true liveness may be undecidable, most collectors approximate it with reachability if no execution path can reach an object through any chain of references, it is garbage.

The four fundamental GC approaches are mark-sweep, mark-compact, copying collection, and reference counting. Real implementations combine them. Mark-sweep is the simplest: walk all roots, mark every reachable object, then sweep the heap and free anything unmarked. Its elegance hides subtle tradeoffs. The mark phase chases pointers through the object graph terrible for cache locality, since each object header is typically read and written only once. A key optimization is separating mark bits into a bitmap rather than storing them in object headers:

Marking no longer modifies objects
Multiple objects can be marked with a single instruction
Fewer dirty cache lines means fewer flushes
Sweeping can rely entirely on the bitmap without touching live objects

The irony of mark-sweep's complexity: the mark phase is O(L) where L is live data, and sweep is O(H) where H is heap size. You might expect sweep to dominate, but its linear scan has excellent spatial locality. The real cost is mark's cache-hostile pointer chasing the phase that seems simpler is actually slower. Mark-sweep does not compact the heap, so fragmentation accumulates over time, degrading allocation performance for larger objects.

Generational collectors exploit a key empirical fact: most objects die young. By concentrating effort on the youngest generation, they improve total collection time and reduce average pause length. The space-vs-time tradeoff applies at every level: a GC that pauses less frequently must tolerate more garbage on the heap, while one that pauses often keeps the heap tight but interrupts the application more.

Garbage collection is not a solved problem but an ongoing exercise in balancing correctness, throughput, latency, and space, much like distributed systems design but at the language runtime level.