Skip to content

Chapter 19: The Thundering Herd and the Deadlock Restart

Early 2016, Seattle. The eve of the biggest Super Bowl in the US.

Genesis Software had completely finished its Cloud Native transformation. Thousands of microservices expanded and contracted as naturally as breathing under the orchestration of Kubernetes (K8s).

To ensure everything was absolutely foolproof for tomorrow's Super Bowl traffic surge, Director of Operations Dave configured the highest level of automatic protection policy in K8s for the core "Database Proxy Service (DB-Proxy)": "Look, Simon! I wrote it in the Liveness Probe. If the proxy service doesn't respond within 3 seconds, K8s will automatically deem it dead, and then mercilessly kill it and restart it!" Dave proudly showed off the YAML configuration.

Simon Li looked at that line of code, his brow slightly furrowed. "Isn't 3 seconds a bit too short?"

"It's not short, Simon," Dave said. "If it hasn't responded in 3 seconds, it means the system is already deadlocked. Killing it and restarting it as early as possible (Fail Fast / Restart) is the bible of our Cloud Native architecture."

This crude but industry-praised "reboot cures all ailments" theory did indeed work most of the time. But that was only because they hadn't yet encountered a true Thundering Herd.


The next day, Super Bowl halftime show. Traffic crashed down like a tsunami.

Simon Li stood in the situation room. His Synesthesia vision was already filled with the intense roar emitted by the holographic network. The entire network's QPS instantly soared by 10 times.

Everything was running according to plan. K8s's Horizontal Pod Autoscaler (HPA) frantically spun up new containers to handle the traffic.

But a long queue formed at the single-point bottleneck named "MySQL Master" deep in the server room. The 50 "DB-Proxy" containers in front of it were trying to stuff tens of thousands of queries into the database.

Because there were too many people in line, the master database processed slower.

At the exact second of 14:01. Proxy Service A sent a query to the database and implicitly started waiting. 1 second... 2 seconds... 3 seconds... The database still hadn't returned the data. Proxy Service A's thread was stuck.

At this time, the K8s patrolman (Kubelet) coming from the upper layer walked over with its "Liveness Probe". It knocked on Proxy Service A's door: "Brother, are you alive?" Proxy Service A's mouth was blocked by the queued thread from earlier; it couldn't answer in time.

"Oh, no response in 3 seconds. You're dead." The K8s patrolman coldly pulled out its gun.

Bang!

Proxy Service A was forcibly killed (SIGKILL).

In Simon Li's synesthesia, this was the first gunshot of the disaster.

"Dave! Something's happened!" Simon Li suddenly vaulted over the console. "K8s killed the proxy service!"

"Don't panic, Simon. It's dead, K8s will spin up a new one right away." Dave was unconcerned.

True, K8s very dutifully spun up a brand new Proxy Service A 1 second later. But this was a brand new container. It had zero cache in its head, and its Connection Pool was empty!

The moment this newborn proxy service stood up. Facing the overwhelming Super Bowl traffic pressing in from the outside, the first thing it did to survive was: "I need to immediately initiate 100 urgent physical connections (TCP Handshake) to the master database simultaneously!"

The master database was already slow, and it was suddenly headbutted by such a newborn acting like a starved ghost reincarnated. The master database spent another 2 seconds processing these 100 connection handshakes.

This caused the Proxy Service B standing next to it, which was already close to timing out, to also pass the 3-second mark.

The K8s patrolman walked up to Proxy Service B: "No response in 3 seconds. You're dead." Bang!

Proxy Service B was killed! Then K8s once again spun up a new Service B. Newborn B immediately initiated another 100 connections to the database!

This is the most terrifying combination of the famous "Thundering Herd" and "Serial Probe Kills" in the Cloud Native era.

"No! No!" Dave looked at the node monitoring charts on the big screen, his eyes filled with extreme terror.

In just ten short seconds. Those 50 proxy services, which were originally slow but still working hard, were all massacred by K8s on charges of "timeout"! Immediately after, 50 newborns were spun up at the same time! Then these 50 newborns in perfect unison unleashed 5000 high-intensity concurrent TCP connections to the master database simultaneously!

The database was like being simultaneously rammed by fifty fully-loaded heavy trucks! "Boom!" The originally just slow database, under the impact of this "new connection storm", had its CPU driven straight to 100%, completely deadlocking!

The database was deadlocked... which meant that the 50 newly resurrected proxy services couldn't even wait for a single query result.

3 seconds later. The K8s patrolman once again walked into the square, looking at these 50 completely frozen proxy services.

"No response in 3 seconds. You are all dead."

Bang! Bang! Bang! Bang! Bang!

Total annihilation! Then K8s, with extreme rigidity, once again spun up 50 new infants to continue the assault on the already dead database!

Start, block, get killed, restart... The entire cluster had turned into a meat grinder!

"Turn off the probes! Immediately delete the K8s Liveness Probes!" Simon Li roared, pointing at Dave, the veins on his forehead bulging. In the synesthetic world, the code that was constantly resurrecting only to be instantly decapitated let out deafening screams.

Dave, trembling all over, pounded on the keyboard, forcibly cancelling the "suicide switch" for these 50 proxies in K8s.

The meat grinder stopped its decapitations. The proxy services were no longer being killed. However, because the database had been completely deadlocked by the massive ocean of dirty connections brought about by the restarts, the entire front-end page was completely blank. Tens of millions of people across the US, during the Super Bowl halftime, were faced with a Hello World page spinning endlessly.

"Simon! The database is still deadlocked! The external traffic is all stuck at the gateway! The gateway is about to OOM too!" an engineer yelled in despair.

"Because we set up circuit breakers for the retry storm in Chapter 13! To keep the service from dying, it will fall back and retry after timing out." Dave tore at his hair. "But now the database is deadlocked, all their retries are queued up!"

Simon Li stared at the black hole in the center of the screen. "This is the backlash of microservices! When you shatter code too finely, even with retries, even with circuit breakers... when a queue occurs, the queuing effect from the bottom layer will flow back upstream like a mudslide, drowning all the nodes above it."

Simon Li shoved Dave aside and sat at the terminal with the highest root privileges.

He knew that to break this distributed global deadlock, the only way was to abandon those gentle "waiting" mechanisms. He had to erect an absolute Time Wall at the outermost edge of the network.

"Prepare to modify the Edge Gateway's logical configuration file." Simon Li's tone was as cold as a block of ice.

"Simon, what are you doing?!"

"I'm going to add a Global Context Deadline / Timeout at the very entrance of this massive system."

Simon Li's hands flew across the keyboard, using Go's core context package to forcibly strap a countdown stopwatch to the head of every single external request.

"From this moment on, whether it's a retry or a newly added request. Every single request sent from a mobile phone to the gateway, I only give it exactly 5 seconds of lifespan to exist!"

Simon Li smashed the enter key. "Once the 5-second countdown finishes, no matter which layer of microservice this request has currently reached, no matter if it's currently queuing, or even if it's currently shaking hands with the database... all threads related to this request on every node across the entire network must, in the same microsecond, commit suicide on the spot! Interrupt simultaneously!"

This is an extremely advanced self-rescue measure in massive microservice webs: Context Cancel Propagation.

After this line of configuration was deployed.

A miracle occurred. In the synesthetic vision, those discarded request threads that had formed a ten-thousand-mile queue in the microservice chain, clinging to the channel like zombies, suddenly had a blood-red number appear above their heads: 0!

Boom! Accompanied by a cold "timeout" signal emitted by the gateway. Following the topology map of the entire microservice network, like dominoes. Those queued threads stuck in the user service, proxy service, or even the database entrance all voluntarily let go, dissipating in memory like dust.

The channels were instantly emptied!

The master database's CPU utilization plummeted straight from 100% to 10%. In these suffocating five minutes, it finally took its first breath of meaningful life.

As the old channels were forcibly swept clean, truly fresh requests, carrying a new 5-second lifespan, began their silky-smooth traversal.

The pages recovered. The Super Bowl traffic flowed smoothly through the data center.

Silas watched the heartbeat recover on the charts and breathed a long, foul sigh. "Simon, you just saved us all."

Simon Li didn't smile. He stood quietly in front of the screen. His synesthetic intuition told him that he had merely postponed this death caused by an unbounded web of microservices.

"Silas, haven't you seen through it yet?" Simon Li turned around, his voice carrying an otherworldly weariness. "We split up hundreds of microservices, we used K8s, we configured circuit breakers, retries, and context timeouts. But the result?"

"The result is that we are using an even more complex network topology to wipe the ass of the mistakes we made earlier."

"K8s can certainly manage the life and death of containers. But in the deep waters of business, as long as the system is an infinitely sprawling 'web', as long as a data node in Shanghai can initiate frantic retries to a database in New York. This Thundering Herd and avalanche will always destroy you at the moment you least expect it!"

"So..." Silas said hesitantly.

"So, Volume Two should come to an end."

Simon Li walked up to the whiteboard and viciously wiped that gigantic, incomparably ugly, spider-web-like architecture diagram—drawn with three hundred microservices connected to each other—completely clean with the eraser! Those names representing K8s, Kafka, and Circuit Breakers were all wiped away by him.

The whiteboard was completely empty.

Then, Simon Li drew a tiny, perfectly square box right in the middle. He very carefully drew it into an absolutely closed Cell.

This was what that single line of Hello World in the first chapter of this book had once looked like in its purest form, and it was the ultimate destination pointed out by the high-dimensional algorithms after fifty years of wandering and tearing apart.

"We must abandon this web that spreads across the globe." Simon Li looked at this extremely tiny capsule, a near-pilgrimage fervor flickering in his eyes. "In Volume Three, I will stuff those hundreds of microservices you are so proud of, along with the database shards, entirely into this one absolutely vacuum Cell."

"It does not make requests outwards, nor does it allow external lateral interruption." "We will transform the system from 'a blanket distributed abyss' into 'ten thousand mutually non-interfering small monolithic empires'."

This is the ultimate form in the history of architecture, and the absolute defense that blocks any blast radius—the Cell-Based Architecture.

With this, Volume Two, "The Distributed Swamp", brought down the curtain amidst endless retries and death restarts. The high-dimensional algorithm stopped its low-frequency vibration. It was waiting. Waiting for that final moment capable of establishing ten thousand vacuum cells on the surface of the planet.


Architecture Decision Record (ADR) & Post-Mortem

Document ID: PM-2016-02-07 Incident Level: SEV-0 (Site-wide deadlock during Super Bowl mega-promotion, service swarm fell into an endless kill-and-restart loop) Lead: Simon Li (Principal Engineer)

1. What happened? During the Super Bowl traffic peak, the database queuing slowed down, causing frontend requests to become congested. Based on an extremely sensitive Liveness Probe (3-second timeout), K8s judged the 50 DB-Proxy containers—which had temporarily lost responsiveness due to congestion—as dead and forcibly killed them (SIGKILL). The containers were then instantly restarted by K8s. The newly started 50 containers, in a completely cold-start state, simultaneously initiated a massive volume of concurrent connection requests to the database (Thundering Herd). This directly drained the already highly-loaded master database's resources completely and caused a physical deadlock. The database deadlock in turn caused a massive timeout of the probes again in the second wave, plunging the system into a "Meat grinder (serial restart) restart storm: Congestion -> Killed -> Re-send massive reconnects -> Worsened congestion -> Killed again".

2. 5 Whys Root Cause Analysis

  • Why 1: Why were all the proxy containers killed? Because the K8s Liveness probe threshold was set too strictly and had no backoff. It mistakenly equated "business blocking/slow response" with "process crash/deadlock".
  • Why 2: Why did restarting kill the system even more thoroughly? Because rebuilding nodes lost all warm-up states and connection pools. The high cost of forcing new nodes to establish physical connections triggered the famous Thundering Herd Problem.
  • Why 3: Why was the system still unable to recover after the probe mechanism was turned off? Even if the processes were no longer killed, a large number of "dirty requests" that had already timed out and been abandoned remained in the deep network (queued waiting for computation), blocking the entry channel for normal new requests.
  • Why 4: Why weren't the deep requests cancelled in time? Because in the microservice call chain, there lacked a global timeout state propagation mechanism between layers. Although the gateway determined that the user had timed out, the bottom layer was still foolishly spending CPU executing dead operations that "the user no longer cared about."
  • Why 5: Why did the problem snowball like a mudslide? The distributed network lacked absolute boundary isolation; pulling one hair moves the whole body.

3. Action Items & ADR

  • Workaround: Emergency shutdown of the Liveness Probe to prevent infinite restarts. Deployed a global boundary to forcefully cut off and pushed it down to the edge gateway, fusing all squeezed historical deadlocked requests.
  • Long-term Fix:
    • ADR-019A: Strictly distinguish the meaning of Liveness and Readiness probes. It is forbidden to mount the Liveness reboot weapon on slow interfaces caused by business congestion. Liveness must and can only be used for situations like "memory leak deadlocks where a physical power cut is necessary." If it's normal congestion, only the Readiness probe can be used to temporarily shift traffic away (so K8s no longer sends new requests to it). Under no circumstances should nodes be killed on a whim, triggering a connection storm.
    • ADR-019B: Comprehensively implement Global Context Control and Cascade Cancellation (Context Cancel Propagation). Introduce things like Go's Context. Mandate that any incoming request must carry its absolute death timestamp (Deadline) in its header. If the Deadline is reached, all 150 microservices on the chain must unconditionally abort their related query code and RPCs initiated downstream in place and release the thread. Cut off the pointless waste of zombie computing resources.

4. Blast Radius & Trade-offs Blindly venerating Cloud Native tools (reboots cure all ailments) without deeply understanding the underlying network throughput mechanisms often artificially creates avalanches of terrifying power. In a massive network, a system is no longer simply dead or alive; it becomes bogged down in a swamp. The lesson of Volume Two is cruel enough: no matter how we rely on retries, rate limiting, and context isolation to patch up a grand distributed web diagram, in the face of Murphy's Law, the blast radius will still ripple globally through retry storms. To completely solve the stability of large-scale systems, we must jump out of web-thinking. In the upcoming Volume Three, we will pivot to the ultimate form of physically severing entanglements—"Cell-Based Architecture".


Architect's Note

1. K8s's Deadliest Meat Grinder: Abused Liveness Probes Many entry-level engineers, when first encountering K8s, think it's extremely cool to add "kill and restart if it doesn't return 200" to all programs; anyway, K8s restarts don't cost money. This is an extremely lethal trigger for architectural tragedies. Because when massive traffic hits (like the Super Bowl Simon Li faced, or Taobao's Double 11), servers become full and respond slowly, but they are still working hard. If you execute them with a gunshot just because you think they are slow, and pull up a newcomer with absolutely no cache built up, the connection-establishment pressure (extremely CPU intensive) this newborn brings to the database is ten times that of the old-timers. The result instead accelerates the complete extinction of the entire chain. This classic anti-fragility lesson means that subsequent SREs only dare secretly use "Readiness" to remove fully-loaded machines from the traffic pool to let them digest, and rarely use kill-and-restarts in deep waters.

2. Ghost Grinder for Dead Computation: Context Cancellation When a user opens a webpage and it stays blank for a long time, they have long since lost patience and pressed "X" to close it. But deep in your huge distributed web, there are still exactly six microservices frantically studying hard and waiting in line for the task initiated by this dude, and they even have to run to the hard drive to fetch data for this dead person who will never come back. In massive high concurrency, this thing known as "stillborn computation," if not cut off, will pop a server's memory in minutes. In the underlying design of modern, large-scale, advanced languages like Golang, the feature they are most proud of is ctx context.Context. If the edge gateway determines a timeout or the person has left, it directly issues a Cancel signal. This signal instantly strikes like lightning along the network cables through those dozens of services behind it; no matter where they are queuing, they immediately and forcefully drop the work at hand and commit suicide. This high-level design, which cleans up ruins like an infrared laser, is an extraordinary weapon against avalanches.

Volume Two ends here. The Big Ball of Mud has been forcibly isolated; please look forward to the next chapter.