Chapter 14: The Infinite Loop of Poison
Summer, 2012. The ticker outside the NASDAQ trading floor flashed GenesisSoft's proud financial report.
Guarded by the circuit breaker mechanism, the Hello World platform's microservice cluster successfully survived countless localized failures. The system's daily active users broke through the 300 million mark.
But the savage growth of traffic made the database—that old workhorse—let out an agonizing cry once again.
"Simon, our MySQL master database is alarming again," Operations Director Dave pointed at the violently fluctuating write latency curve on the large screen. "A hundred thousand Hello World text submissions per second. Even though it's only 11 characters, the instantaneous write flood has already caused severe jitter in the database's disk IO. If we don't intercept it, the master database will sooner or later be crushed by this transient flood."
In the center of the War Room, Silas Horn was playing with a limited-edition golf club in his hands.
"Then intercept it," Silas said nonchalantly. "Didn't you block the flood of read requests with a cache in Chapter 4? Now use the same logic and block the write requests for me."
"Write requests cannot be blocked with a cache," Simon Li typed on his keyboard without looking up. "Reads can tolerate older data with latency, but once a write is lost, users will find that the Hello World they just sent has vanished out of thin air. This will trigger a crisis of trust."
"Then what are you going to do? Buy another thirty top-tier SAN storage cabinets to handle the transient writes?" Silas frowned.
"No need to buy storage cabinets." Simon drew a giant funnel on the whiteboard. The top of the funnel connected to the broad sky, and the bottom connected to a thin pipe.
"I want to insert a Message Queue (MQ) between the microservices and the database."
Simon explained: "When a user clicks 'send', the microservice will no longer knock directly on the database's door. It will convert this 11-character message into a JSON text and throw it into the giant funnel that is the MQ. Then the microservice immediately tells the user: 'Sent successfully!' As for the database at the bottom of the funnel, it only needs to smoothly take out a few thousand messages from the funnel every second at its own comfortable pace, and unhurriedly write them to the hard drive."
"This is called Asynchronous Load Leveling," Simon wrote the term next to the funnel.
Hearing that there was no need to spend a fortune on hardware, Silas readily approved the proposal.
A week later, Kafka—the hottest high-throughput message queue in the industry at the time—was introduced into GenesisSoft's core architecture.
In the early days after going online, the effect was utterly stunning. No matter how terrifying the spikes in external traffic were, Kafka acted like a bottomless pit, instantly sucking in all concurrent requests. And the poor MySQL master database finally lived a regular "nine-to-five" life, never sounding another alarm.
Until that extremely ordinary Tuesday afternoon.
3:14 PM. A silent suffocation.
Simon was sitting at his workstation, examining a line of low-level code. Suddenly, a nauseating sensation of a foreign object, as if he had swallowed a sharp stone, shot straight from his throat to his stomach.
He suddenly clutched his chest. In the vision of his synesthesia, the stream of Kafka messages, which had been surging like a great river, suddenly came to an absolute standstill at the entrance of a certain Consumer node!
"What's going on?!" Simon rushed into the War Room. "Did the consumer microservice writing to the database crash?"
"It didn't crash!" Dave stared at the screen. "The consumer process is still running, and there is still idle CPU! But... but the latest Hello World messages in the database haven't been refreshed for three minutes!"
This meant that millions of users on the front end clearly saw the "Sent successfully" prompt, but the messages they sent were all jammed in that funnel called Kafka, completely failing to be flushed to disk!
"Check the Kafka monitoring! Fast!"
When Dave pulled up the Kafka consumer monitoring dashboard, a dead silence fell over the room.
The large screen showed that in the topic queue named Hello_Write_Topic, the message backlog (Lag) was soaring at a terrifying speed of tens of thousands of messages per second! A hundred thousand, half a million, a million! Piling up like a mountain!
"Why isn't the consumer pulling data? Is the network disconnected?" Silas asked in panic.
"No... it is pulling data." Simon stared intently at a tiny corner interface of the console logs, his voice trembling with extreme absurdity.
"Not only is it pulling, but it is frantically pulling the exact same piece of data thousands of times a second!"
Simon typed a command on the keyboard, forcefully "fishing" out the extremely peculiar message that was blocking the very front of the queue.
It was a JSON-serialized Hello World request payload.
But when it was displayed on the large screen, everyone gasped. That was not a normal piece of data. Due to an extremely rare bit flip error in the underlying network transmission packet, the end of this JSON data was missing a right brace }.
{"user_id":10024, "msg":"Hello World"
It was such an extremely tiny piece of data, missing just a single byte. It had become a Poison Pill capable of destroying the entire distributed cluster.
In Simon's synesthetic grating, he saw an extremely tragic scene resembling the infinite loop of hell:
The Consumer microservice pulled this poison message from Kafka. It opened its mouth and swallowed it. However, when the code attempted to use the standard JSON library to deserialize this message, because of the missing right brace, the parser instantly reported an error and crashed! It directly threw an uncaught, severe Exception.
The consumer process vomited in agony.
BUT! Because the consumer crashed with an error, it failed to reach the last step of the code, and thus failed to send that extremely critical command back to Kafka—"I have finished processing this message, please Commit Offset."
This is the strictest system baseline adhered to by Kafka: At-least-once Semantics.
Seeing that the consumer did not commit the Offset, Kafka extremely dutifully and rigidly assumed: "Oh my, this poor consumer must have failed to process it due to network issues or a power outage. It's okay, I'll dispatch it again!"
Thus, in the single microsecond after the consumer process was restarted by the Watchdog and just stood up.
Kafka extremely enthusiastically stuffed that jagged stone (the poison) entirely unchanged, right back into the consumer's mouth!
Swallow -> Parser Error Crash -> Failure to Commit Offset -> Kafka Returns and Resends Message. This is the most terrifying dead loop in system architecture.
"Swallow, vomit, and then forcefully stuffed back into the mouth." Simon looked coldly at the error logs frantically scrolling across the screen. "This is the infinite loop of the poison message. As long as this poison is not digested, the entire consumer cluster will march in place infinitely within this loop."
These dozens of clusters, which originally possessed terrifying computing power, were actually choked to death by an incomplete Hello World message of less than 200 bytes!
And behind this poison message. Tens of millions of normal, perfectly structured Hello World messages were waiting in a long queue in despair. They would never reach the moment they were processed, because the channel ahead was staging an absurd, never-ending act of swallowing and vomiting.
"Absurd! This is too absurd!" Silas's face was ashen. "Just because of a broken bracket, our entire network is paralyzed?! Delete that message! Delete it from the queue immediately!"
"Kafka is an Append-only Log." Simon turned his head and looked at Silas, his eyes showing deep helplessness. "You can't use a DELETE statement to delete a certain record in the middle of the queue like you operate a database. Because it is a continuous, sequential physical file block hard-written by the magnetic head."
"Then what do we do? Are we just going to watch the memory get maxed out and the messages of hundreds of millions of users be completely lost?" Silas roared in despair.
Just then, Operations Lead Dave suddenly shouted: "Simon! The backlog at the tail of the queue has broken through the ten million mark! Because the upstream gateway is still frantically receiving user posts (they think the posts are successful), but the downstream is completely blocked. The MQ's disk quota and memory are about to burst!"
This is the consequence of lacking a Backpressure mechanism. When the consuming end is choked to death by poison and loses processing capability, the producing end tirelessly continues to forcefully pump water into the system. This will ultimately trigger a thorough global OOM.
Simon knew they couldn't wait any longer.
To deal with the poison in the immutable torrent, one must absolutely not attempt to modify the river; one must use a bypass diversion.
"Dave, open the configuration center! I need to hot-patch the consumer code!"
Simon's hands turned into blurs on the keyboard. He decisively modified the global logic (Try-Catch Block) used to catch exceptions inside the consumer.
The original code was: upon encountering an unparseable format, directly throw an exception and crash.
Simon cunningly changed it into two steps: "Step one, when you swallow this stone and feel it scratching your throat (catching a JSON parsing exception), do not spit it out! Do not crash!" "Step two, force it down! And immediately tell Kafka: 'I have finished processing, quickly commit the Offset and let me process the next one!'"
"But Simon!" Dave exclaimed, "What if it's dirty data, and we just skip it, won't that message be lost?"
"I didn't say we are going to drop it!"
Simon heavily struck the final piece of architectural instructions on the keyboard, building a highly concealed "dark isolation ward" right next to the main thoroughfare.
"If you encounter poison, immediately commit the Offset for the main path to let the subsequent queue pass. Then, extract this poison exactly as it is from the code level and throw it into this special bypass queue I just built specifically for holding garbage!"
This special bypass would possess a deeply grim and famous name in future systems architectureology— Dead Letter Queue (DLQ).
Five minutes later, the consumer code with the DLQ mechanism was thrillingly pushed to production.
In the vision of synesthesia. The consumer, tortured to the point of agony, once again welcomed that stone with the missing right brace. This time, it still couldn't parse it. But it didn't crash and vomit. It obsequiously swallowed the stone, committed the task of the main path, and then excreted this stone out the side, throwing it into a pitch-black "dead letter abyss" with no subsequent consumers attached.
Immediately following that, the channel was clear!
The more than ten million normal Hello World messages that had been suppressed behind the poison, like a bursting alpine reservoir, roared towards the downstream database!
The alarm lights went out, and the red line of the message backlog (Lag) directly presented a beautifully precipitous drop, smashing straight back to 0.
The entire system let out a long sigh and completely came back to life.
Silas slumped into the leather sofa, watching the restored large screen, feeling as if the last ten minutes had lasted a hundred centuries.
"A corrupted message... blocked tens of millions of normal messages." Silas felt his worldview had been overturned. "Simon, since this queue is so fragile, why do we still use it?"
Simon stood up and walked over to Silas.
"Silas, it's not the fragility of the queue, but rather that you thought of asynchronous decoupling too simply."
Simon looked directly into Silas's eyes. "You thought that by throwing the pressure into a bottomless funnel, the system would be perfectly fine. But you forgot, whatever can be received must ultimately be digested. If you haven't built a 'dead letter treatment hospital' on the bypass, nor have you built a 'backpressure valve' that can notify upstream to slow down when it's about to burst..."
Simon looked at the solitary poison wreckage lying quietly in the dead letter queue on the screen.
"Then this funnel meant to save lives will ultimately become the super tomb that buries your entire empire."
In this era of chaotic warfare belonging to microservices and asynchronous architecture, the real challenge had just begun. The higher-dimensional fragments deep within the Earth's servers coldly engraved the "At-least-once semantic deadlock flaw" into the star map as a precious negative model.
And next, in the far reaches of the system, the terrifying cardinality explosion of metadata brought about by this limitless scaling of microservice nodes... Was about to completely blind the all-seeing monitoring eye that Simon took such pride in.
That is Chapter 15, the starting point of flying blind.
Architecture Decision Record (ADR) & Post-Mortem
Document ID: PM-2012-07-28 Severity: SEV-1 (Core async write queue completely blocked) Lead: Simon Li (Principal Engineer)
1. What happened? (Incident Summary) Used Kafka for high-concurrency Hello World write request load leveling (asynchronous load leveling). A malformed JSON message (missing right brace) caused by a bit flip at the network transport layer entered the queue. During parsing, the consumer threw an uncaught exception, leading to a loop of continuous restarts and crashes. Unable to commit the consumption Offset, this "poison pill message" was repeatedly delivered by Kafka infinitely, entirely blocking the head of the single consumption partition's queue. Millions of subsequent healthy messages suffered Head-of-line Blocking, presenting to the business as a site-wide inability to write data to disk.
2. 5 Whys Root Cause Analysis (Root Cause)
- Why 1: Why did all writes halt? A tiny malformed format (poison) caused the consumer's deserialization execution step to experience an abnormal interruption crash.
- Why 2: Why did the crash completely freeze the system? Because the MQ pursues absolute "At-least-once" delivery semantics. As long as the client doesn't drop the connection or fail to commit a voucher (ACK / Offset), it will never proactively turn the page to bypass the current message.
- Why 3: Why could subsequent messages not be processed when one failed? Kafka is a high-throughput append-only log model, and Partitions have extremely strict sequentiality. It cannot skip the current blocked point to process data with offsets further back.
- Why 4: Why didn't the system auto-digest this error? Because the business code had an extremely low tolerance for exceptions and failed to catch them.
- Why 5: Why wasn't a bypass designed in advance? The architecture team blindly trusted the asynchronous middleware's so-called "infinite throughput halo," omitting the "dirty data interception and sewage discharge system" that must be fully equipped in a data stream model.
3. Action Items & Architecture Decision Records (ADR)
- Workaround: Hot patch the consumer code, wrapping the outermost layer of deserialization with a try-catch. Upon catching a failure, forcefully ACK and Commit the offset in the code to bypass the poison.
- Long-term Fix:
- ADR-014A: Global mandatory specification to introduce Dead Letter Queues (DLQ). All subscribers, when processing retry counts reach a threshold (e.g., 3 consecutive failures), or a parsing-level hard format error occurs, must cleanly cut off the loop on their own and use a bypass to forward the abnormal Payload to an isolated DLQ topic. This ensures the main artery remains unblocked. Humans or offline scripts can then audit and diagnose the data in the DLQ.
- ADR-014B: Build the baseline defense of Backpressure. Once the consuming end is backlogged or hangs timeout, the upstream entrance must absolutely not recklessly pile up; it must generate a blocking feedback to the outer layer via flow control, cutting off the exponential flood on the eve of an avalanche.
4. Blast Radius & Trade-offs In order to escape the terrifying concurrent write bottleneck of the monolithic large database, we relied on this "sponge" of asynchronous queues to isolate external detonations. But instead, we ushered in an even more bizarre pipe blockage. Asynchronous architecture, while masking localized spikes, reduces the error tolerance for handling malformed data to a zero-tolerance vacuum state. A 200-byte misplaced data piece is no less devastating than a ten-million-level concurrency flood.
Architect's Note: Bridging Past and Present System Design
1. The Core Reason for Introducing MQ: Load Leveling In the vast majority of high-concurrency rush buying, liking, and ordering systems, this is the most unchivalrous but useful card in the hands of a top-tier architect. The per-second write capacity of a relational database like MySQL on a standard machine might just be in the tens of thousands. At this time, if millions of concurrent requests crash down, without adding a Message Queue (MQ) like Kafka in this chapter, the underlying layer will definitely burst and disconnect (as reenacted in Chapter 11). The brilliance of MQ lies in this: it writes to disk sequentially like crazy (refer to Simon rubbing out a logging engine by hand in Chapter 3), and swallowing millions of records a second is a piece of cake. The front end directly prompts the user "Order processing", and then the back-end business spawns a thread to leisurely fetch tickets and run errands at MySQL's comfortable throughput. By using so-called "asynchronous" decoupling, it extremely cunningly swaps the continuity problem of processing time.
2. At-least-once Delivery and Head-of-Line Blocking Hell But this is not a fairytale ending. Any programmer who introduces a queue will encounter the story of this chapter in their first year of taking heavy losses: falling into a nightmare infinite loop because of some outrageously dirty data. Whether it's Kafka, RabbitMQ, or RocketMQ. To guarantee that messages sent by users won't just vanish into thin air during power outages or crashes, they strictly follow the At-least-once delivery rule. This means your consumer code must give an extremely explicit successful receipt (Offset Commit or ACK); otherwise, it will throw it back to you infinitely. Once your logic fails to truncate the exception, this "poison" will jam your node perfectly at the front of the queue. All the normal customers can only wait in line behind it, looking on anxiously.
3. Industrial-Standard Safety Net: DLQ (Dead Letter Queue) In almost all the underlying layers of proper major tech companies today, this is no longer a choice, but a hard metric during "illegal construction" code reviews. If your architecture introduces an MQ but doesn't configure a corresponding DLQ processing bypass, it will definitely be rejected. A DLQ is essentially an intensive care isolation ward set up in the architecture for data that cannot be saved. What it protects is not the ailing patient, but the tens of millions of healthy "normal people" behind it. Only with this can a system truly be called a resilient living organism.