Skip to content

Chapter 6: The TempDB Noose

November 2003, the eve of Thanksgiving.

Ten months had passed since surviving the "Broadcast Storm." The internal network of GenesisSoft had been sliced by Simon Li using VLANs into dozens of mutually non-interfering "watertight compartments." The infrastructure team was enjoying a brief period of peace.

But the greed of commerce never grants technology a moment to breathe.

"Listen up, folks!" Silas Horn stood in the center of the War Room, holding up an all black-and-green first-generation gaming console prototype (an early Xbox prototype) just brought out of the lab, a fanatic glint in his eyes.

"Sony is slaughtering us in the living room console market share! Tomorrow is Thanksgiving, and I'm going to run a 'Zero-Dollar Flash Sale' on our Hello 2.0 Platform that will shock the whole nation! Ten thousand prototype consoles, given away for free exactly on the hour!"

Silas turned around, staring dead at Simon. "Simon, I've already dropped heavy cash to build hype across national television networks. At 8 PM tonight, at least two million rabid gamers will simultaneously flood our website, frantically clicking that 'Buy' button! I want to make sure the system is as smooth as silk!"

Simon leaned back in his chair, looking at the main monitoring screen without any expression.

"The network layer is fine, the VLAN isolation is solid," Simon said indifferently. "The Web servers are also fine. Ever since we performed 'lobotomies' on them two years ago (Chapter 2) and ripped out all local in-memory Sessions, these fifty Web servers are now like a colony of absolute stateless worker ants with no memory. No matter how much traffic comes, we can just Scale-out."

"So where is the state stored now?" Silas pressed.

"Fully externalized." Simon tapped the keyboard and brought up the topology of the most core SQL Server hidden deep in the server room. "All the login states, shopping carts, and flash sale tokens of the two million users are uniformly stored by me into a special memory database at the underlying database layer—TempDB (Temporary Database)."

In 2003, this was an extremely advanced and "elegant" architectural design (and the standard practice before the advent of Redis). By uniformly resource pooling the state, the Web layer gained infinite horizontal scalability.

"Perfect! As long as the hardware can hold up!" Silas pumped his fists.

"The hardware absolutely will hold up." Dave, the Operations Lead, patted his chest in guarantee. "For this flash sale, we just procured the absolute top-tier multi-core CPU servers, and the underlying SAN storage arrays have been swapped to the latest 10-Gigabit Fibre Channel. It's a boundless ocean capable of containing the entire Pacific!"

Simon didn't object, but a faint trace of unease flickered in his heart.

Because in the world of architects, when everyone thinks everything is perfect, it usually means a more hidden, deeper abyss has quietly cast a wide net.


8:00:00 PM, the flash sale began.

Two million gamers across America, driven by rabid greed, pressed their left mouse buttons in the exact same microsecond.

"The flood peak is here!" Dave stared intently at the main monitoring screen.

Simon's cerebral cortex tightened instantly. Out of habit, he closed his eyes, preparing to welcome the deafening roar of disk reads and writes, or the blinding white light of CPU thread contention in his Synesthesia.

However... Dead silence.

No roars, no blinding light. The War Room was terrifyingly quiet.

"What happened? Did the front-end website go down?" Silas looked in bewilderment at his laptop, stuck on a white screen. A tiny hourglass icon spun endlessly on the screen.

"It didn't go down! The Web servers are all alive!" Dave looked at the dashboard in disbelief. "But this is impossible..."

This was indeed an extremely bizarre scene that defied computer common sense.

On the monitoring screen, the database CPU load, which should have instantly spiked to 100% to process two million concurrent requests, was actually sitting at a pitiful 5%! Meanwhile, the top-tier SAN storage array—costing tens of millions of dollars and claiming to be able to swallow the Pacific—had a flat disk IOPS curve resembling a straight line, in a state of completely idle spinning!

"The hardware isn't working at all!" Dave exclaimed. "The CPU is sleeping! The disks are sleeping, too! But why is the TPS (Transactions Per Second) zero?! Why are all user requests gridlocked?!"

This wasn't a crash. This was an "absolute freeze" far more bone-chilling than a crash.

Silas rushed over angrily and grabbed Dave by the collar. "Didn't you say it was a boundless ocean?! Why has the ocean dried up?!"

"I... I don't know! All hardware metrics are normal! The network is up, too! It's as if... the system just suddenly went on strike!"

"Let him go, Silas."

Simon abruptly opened his eyes. His face was pale as paper, beads of cold sweat seeping from his forehead.

During those ten-odd seconds, his synesthetic vision had no light, no pain. Taking their place was an extremely maddening, suffocating sense of congestion that could drive someone insane.

He finally "saw" clearly into that boundless ocean. Dave was right; the underlying SAN storage was indeed incredibly huge, empty, and ready to welcome massive amounts of data.

But this ocean had an extremely tiny entrance, so small it had been ignored by everyone.

In SQL Server, for any user login or any flash sale click, the Web server requested the backend TempDB to create an extremely tiny "Temporary Object."

When two million concurrent requests surged in, the database engine acted like a highly dedicated warehouse manager, ready to place two million packages into that impossibly massive warehouse (the SAN storage).

But before placing them in the warehouse, the manager had to do one thing first: He needed to make a minuscule mark on a system metadata page called PFS (Page Free Space) and GAM (Global Allocation Map) to record, "This piece of hard drive space is now occupied."

This was to prevent data from overwriting each other.

Within Simon's synesthetic grating, a microscopic scene of profound shock and irony was unfolding:

Two million threads representing user requests, like an armored army so massive its end couldn't be seen, marched grandly to the entrance of that enormous, empty warehouse. Inside the warehouse were millions of empty parking spots (ample disk space). The army possessed hundreds of extremely wide highways (powerful multi-core CPUs).

But at the main entrance of the warehouse, there sat only one security guard holding a tattered notepad!

That notepad was the mere three bytes of PFS/GAM metadata pages inside TempDB.

The army of two million, fighting over the "write permission" to these three bytes, caused a microscopic stampede in memory that was extremely tragic yet utterly silent!

"They're waiting for a lock..." Simon gritted his teeth, staring at an extremely obscure wait type skyrocketing into the hundreds of thousands on the screen—PAGELATCH_UP (Page Latch Update Wait).

"What lock?! We clearly solved row-level locks in Chapter 1!" Silas roared.

"This isn't a lock on business data, Silas! This is a latch contention on Metadata!" Simon's voice betrayed a hint of despair. "They are slaughtering each other fighting over that 'notepad for space allocation'!"

Only one thread could grab the notepad, make a checkmark, and put it down at a time. In this minuscule, microsecond-level pause, because the CPUs were too fast and the concurrency too high, the subsequent hundreds of thousands of threads instantly piled up, triggering a terrifying Convoy Effect.

Finding all threads suspended by memory latches, the CPUs concluded they had nothing to do, and peacefully entered sleep mode (load dropping to 5%). The disks, seeing absolutely no IO commands being issued, also entered an idle state.

The hardware slumbered. The software strangled itself to death.

"We stuffed national logistics through a transit hub manned by just one security guard." Simon stared dead at the console. "This is the price of 'global pooling' of all state!"

This was the "Resource Pooling Fallacy" in the history of system architecture.

When you think placing all session states centrally into a single, global TempDB liberates your front-ends, you have actually created an extremely hidden Single Point of Failure (SPOF) at the bedrock of your system. A globally shared resource (Shared-Everything) inevitably brings global lock contention.

"Then let's just hire more guards!" Though Silas didn't understand the underlying principles, he knew common sense. "Since there's only one notepad, hand out more notepads to them!"

"Silas... for once, you are right."

Simon jerked upright. The system wasn't dead; it was simply being strangled by a noose called TempDB.

To loosen this noose, conventional SQL statements were useless. He had to forcibly alter how the database engine worked down at the physical file level.

"Dave, suspend the database into single-user mode! I'm going to slice TempDB open!"

Simon snatched the keyboard, his eyes gleaming with a surgeon's icy resolve.

He typed out a set of low-level ALTER DATABASE physical commands. Since a single file could only possess one PFS/GAM metadata page, he would forcefully chop this gargantuan physical TempDB data file (.mdf) into 8 completely uniform physical files, like slicing a cake perfectly evenly.

"Why 8?" Dave asked nervously from the side.

"Because our core server happens to have exactly 8 logical CPU cores."

Simon's fingers hammered the keyboard like a hurricane. In the underlying algorithms of relational databases, there exists a mechanism called "Proportional Fill."

When Simon forced the data file to be chopped into 8 parts, the database engine was compelled to create 8 separate sets of own PFS/GAM metadata pages within those 8 files.

"If one guard isn't enough, I'll force open 8 main gates and station 8 guards!"

Right after, Simon typed in a mysterious trace flag that every senior DBA at major tech companies must etch into their soul, yet rarely speak of to outsiders:

DBCC TRACEON (1118, -1);

"What's that now?" Silas stared wide-eyed.

"Forcing Uniform Extents Allocation." Simon explained coldly. "Canceling mixed extent monopolization by temporary objects. Tell those 8 guards not to fight with each other, and simply manage their own stalls!"

Enter! Execute!

Within the synesthetic vision, a miraculous scene occurred.

That heavily congested gate with only one guard, responding to Simon's command with a thunderous boom, was violently widened into 8 parallel high-speed channels. Eight guards holding notepads simultaneously went on duty.

The army of two million, previously cramped into a violently trampling mess, was instantly and precisely shunted into the 8 channels by the low-level round-robin algorithm.

The Latch contention (PAGELATCH_UP) melted away like ice in the visible blink of an eye!

"Boom—"

Inside the previously dead-silent server room, an incredibly terrifying physical roar suddenly erupted! It was the fan shriek of 8 CPU cores simultaneously awakened from their slumber, load spiking from 5% straight to 85%! It was the low-frequency tremor of the underlying SAN storage disk array frantically swallowing and spitting data as IO instructions finally poured in!

"It's alive! TPS is shooting up!" Dave jumped up excitedly. "Five thousand! Ten thousand! Twenty thousand! The system is rapidly digesting the flash sale requests!"

"Transaction volumes are soaring! Paralyzing sales of the prototype!" Silas flushed bright red with excitement as he watched the flash sale page on his laptop instantly restore to silk-like smoothness.

The five-minute "suspended animation" finally ended. Simon slumped into his chair, the suffocating congestion brought by his synesthesia instantly draining away, his whole body drenched as if caught in a heavy rain, soaked entirely in cold sweat.

Silas walked over, patting Simon on the shoulder vigorously. "Beautifully done, Simon! You pulled off another miracle! I knew it was just a tiny software configuration error!"

"Tiny?"

Simon didn't smile. He lifted his head, looking through the glass wall at the rows of frantically blinking server racks, a deep worry hidden in the depths of his eyes.

"Silas, this wasn't a configuration error. This is the tragedy of architecture."

Simon turned around and wrote two massive words on the holographic whiteboard: Shared-Everything.

"To make the Web layer stateless, we crammed all states into the shared TempDB. As a result, the fate of our entire giant empire just hung by a mere 3 bytes of system metadata in memory."

Simon heavily struck a line across those words.

"Solving one bottleneck often means creating an even more hidden single-point bottleneck. As long as a system still shares the same physical resource, there will always be a Single Point of Failure (SPOF) that cannot be crossed. Infinite horizontal scaling in the face of run-of-the-mill Shared-Everything is just an extremely expensive, false proposition."

"So what?" Silas didn't care. "Didn't we solve it just now? At worst, we'll just buy 64-core CPUs later and slice it into 64 files!"

"We can't keep slicing forever." Simon shook his head. "One day, we will encounter something even physical hardware can't slice."

In this silent microscopic war, Simon finally touched the wall of sighs of a centralized architecture. The silent observation from the high-dimensional fractal deep down also logged the precious parameters regarding the "Metadata Latch Limits" of Earth's computer systems.

True distributed isolation must be decentralized.

But before that, Simon had to first clean up the other mess left behind by this colossal beast.

Just as Simon was about to close the monitoring screen, out of the corner of his eye, he suddenly caught an anomalous metric from the SAN storage array.

Even though TempDB was alive, gamers, after completing the flash sale, casually uploaded massive amounts of "images" showing off their loot directly to their personalized Hello message boards. The disk IOPS curve of the storage array was creeping upward in a bizarre, extremely unnatural manner.

"Trying to use a bank vault, to store massive amounts of trash..." Simon murmured to himself.

The meteor of Chapter 7 was already silently plummeting toward this fragile centralized storage.


Architect's Note / Post-Mortem

Document ID: PM-2003-11-26 Incident Grade: SEV-1 (Core link suspended animation, massive promotion paralyzed) Owner: Simon Li (Senior SDE)

1. What happened? The instant the Thanksgiving zero-dollar flash sale started, system TPS plummeted to zero, and massive amounts of requests timed out. Metrics showed an extremely bizarre phenomenon: the Web services were normal, underlying CPU load was critically low (<5%), and SAN disk I/O was idle, yet the system was simply incapable of processing any transactions.

2. Root Cause (5 Whys)

  • Why 1: Why was TPS zero while CPUs weren't working? Because all massive worker threads were suspended (Blocked) in memory.
  • Why 2: Why were the threads suspended? They were deadlocked waiting for a memory latch called PAGELATCH_UP.
  • Why 3: What memory latch were they fighting over? They were fighting over PFS (Page Free Space) and GAM (Global Allocation Map) metadata pages within SQL Server's TempDB, used to record free disk space.
  • Why 4: Why was there a contention? Because all Web nodes were stateless, and two million concurrent requests were all extremely frequently applying to create temporary objects in the single, global TempDB.
  • Why 5: Why was this a fatal single point? By default, TempDB has only one data file, thus possessing extremely scarce metadata control pages. Millions in concurrency ultimately converged onto a mere 3-byte memory marker, forging an absolute single-point bottleneck under the Resource Pooling Fallacy.

3. Action Items & Architecture Decision Record (ADR)

  • Workaround & Hotfix:
    • Slice the single physical TempDB file into 8 equal-sized .ndf files (matching the number of logical CPU cores on the server).
    • Globally enable Trace Flag 1118 to forcibly disable mixed extent allocations, forcefully compelling the database engine to use multiple metadata pages in parallel, exploiting hardware concurrency to thoroughly shatter the memory latch contention.
  • Long-term Fix / Architecture Redesign:
    • ADR-006: Strictly prohibit over-reliance on globally shared databases (Shared-Everything RDBMS) as ultra-high-frequency transient state pools.
    • Acknowledge reality: While externalized state solved the single point issue at the app layer, it transferred and concentrated pressure onto metadata locks down at the storage layer. We must initiate preliminary research into lightweight, pure memory-sharded external cache middleware (laying the groundwork for future Redis/Memcached adoption).

4. Blast Radius & Trade-offs Physical multi-core parallelism is powerful, but when faced with a poorly designed software underlying structure (single point data structure contention), even the most powerful hardware will be forced into the quagmire of Serialization. Hardware sleeping while software strangles itself—this is the noose a centralized architecture can never escape.


Architect's Note: Connecting Past and Modern System Design

1. The Cold Punishment of Amdahl's Law In this chapter, Dave naively thought that simply buying a top-tier multi-core CPU could withstand concurrency—a rookie mistake common among junior engineers. Computer science holds an extremely cold theorem—Amdahl's Law. It tells us that the overall throughput improvement of a system is limited by the portion of code that can be parallelized. But anything that requires queuing and competing for a shared lock (like that single security guard's notepad) is inherently the "Serialization Part" that cannot be parallelized. Even if your server had 1,000 CPU cores, as long as those CPUs needed to compete for the same fine-grained memory page lock (Latch), in that microsecond where the lock is acquired, the other 999 cores can only stare blankly in sleep. This is why during a massive promotion, you see metrics tell you the CPU is barely working at 5%, but the whole site is completely frozen. Hardware performance was mindlessly "serialized" by shared state.

2. From Force Slicing Files to Modern Redis Sharding Philosophy To mitigate this "single lock contention" in 2003, Simon took extreme physical measures, chopping the database file into 8 parts to forcefully create 8 parallel lanes. This is actually one of the greatest ideological underpinnings of modern distributed system design: Sharding and Slots. Today, when giant tech companies like Alibaba or Tencent face "Double 11" flash sales, it is absolutely impossible for them to pool all transient user states into a single relational database anymore. They would introduce purely memory distributed middleware like Redis Cluster. Furthermore, even with the blinding speed of Redis, if all flash sale items (relying on a single Key, e.g., iphone_stock) reside on the same Redis node, they will still encounter Simon's exact concurrent single-point bottleneck from back then (known famously as the Hot Key Problem). Modern giants solve it exactly the way Simon did: at the code layer, they forcibly break down the product stock into iphone_stock_1 through iphone_stock_8, scattering the storage across 8 distinct physical Redis machines, letting traffic strike them evenly. Shattering the curse of Shared-Everything, stepping toward the Shared-Nothing architecture, was the most bloody yet glorious leap in the evolution of the internet.