Skip to content

Chapter 18: The Disappearing Config and the Ghost Drift

2015, Autumn. The Era of Kubernetes (K8s).

The "Ghost Process" incident (Chapter 17) made the top management of GenesisSoft make a painful resolution: human beings could no longer be allowed to manually manage those ten thousand crazy containers (Docker).

Under Simon Li's strong push, they introduced the ultimate nuclear weapon that had just been open-sourced from Google's internal architecture Borg—Kubernetes (K8s for short).

This was a dimensionality reduction strike in systems engineering.

On the brand-new massive screen in the War Room, Simon Li only needed to use a few lines of visually elegant YAML description files to declare: "I want 50 containers of the 'Reply Service' running." In the next second, K8s's "Control Plane" would be like an omniscient and omnipotent god, automatically searching for idle machines in this vast ocean of tens of thousands of nodes, accurately placing the containers down, and configuring all the network and load balancing. If a physical machine lost power, K8s would discover it within two seconds and re-spin up those dead containers on another healthy machine.

"This is simply magic!" Dave, the Director of Operations, looked at the Auto-scaling curve with teary eyes. He had finally bid farewell to the miserable days of waking up in the middle of the night to pound the keyboard and restart machines. In this system called K8s, Declarative APIs replaced Imperative operations. Humans no longer issued instructions on "what to do," but told the system "what result I want," and the system would close the loop itself (Reconciliation Loop).

But Silas still maintained the vigilance of a businessman: "Simon, this way of letting the system make all the decisions itself, what if it goes crazy?"

"As long as you write the Config correctly, it is the most loyal queen bee." Simon Li held his coffee, "Microservices only care about running the business logic, state and configuration are left to the outer ConfigMap... wait."

Simon Li suddenly stopped in his tracks, his brows furrowed. In a tiny corner of his synesthesia vision, he felt a trace of unusual fluctuation. It was like a few people in a group of highly disciplined soldiers suddenly experienced a brief "mental lapse."


2:00 PM. The Ghost Drift Begins.

"Simon, customer service received a strange complaint." A product manager hurried into the War Room. "A few users posting from Los Angeles said that they clearly clicked 'Public', but half of the Hello World posts they sent out became 'Private'! And when they refreshed the page, the status of the post frantically jumped back and forth between 'Public' and 'Private'!"

Simon Li immediately put down his coffee: "Which microservice manages this?"

"It's the newly refactored 'Auth Service'!" Dave immediately brought up the main dashboard. The "Auth Service" was deployed with a total of 300 replicas (Pods) in K8s. This was one of the foundational services with the highest traffic on the entire network.

Simon Li glanced at the status of these 300 replicas: "They are all Running, no errors, no OOM."

"Then why is it jumping back and forth?" The product manager was anxious.

"Because of Configuration Drift." Simon Li's intuition worked like lightning.

In the microservice architecture, Simon Li had set a strict rule: Code and configuration must be separated. Once all the code is packaged, it is not allowed to change. And those switches that control business logic (for example: is a new user's default posting permission Public or Private) are all written in an external K8s file called a ConfigMap.

When posting requests hit these 300 replicas, Round-robin scheduling would have the requests land randomly on any replica.

"If the user feels the status is jumping..." Simon Li typed a query command on the keyboard, "That means, in the minds of these 300 'Auth Service' replicas, regarding the question of 'what exactly is the default permission', they have two completely different sets of memories!"

Half of the logs popped up on the big screen. Simon Li pointed to the logs of 200 of the replicas: "Look! These 200 clones read a ConfigMap that is the old version released three days ago; the default permission is Private." Then he pointed to the other 100 replicas: "And these 100 clones, they were Rescheduled by K8s this morning because of underlying physical machine maintenance. When they were pulled up, they read the latest ConfigMap, and the default permission is Public!"

"How is this possible?!" Dave exclaimed, "Nobody touched the ConfigMap configuration this morning!"

"No, someone did. But in this ocean of K8s based on eventual consistency, this modification became a ghost."

The synesthesia world instantly switched to an extremely microscopic perspective. Simon Li traced that bizarre modification record in his mind.

In this extremely massive K8s cluster that relied on API asynchronous interactions. Late last night, a senior developer wanted to modify this switch. He extremely skillfully opened the Git repository where the configuration was stored and changed default_auth from Private to Public. Then, the deployment pipeline ran, and this new ConfigMap was pushed to the K8s API server.

"However, merely modifying the scroll outside cannot wake up those soldiers who are already deep in sleep."

Simon Li's words made Dave shudder.

In the early versions of K8s (or when users didn't write a hot-reload listening mechanism), when a container starts for the first time, it would plug the external ConfigMap into its belly like a real USB drive, and then the business code would read it into memory (RAM).

These 200 "starting" Auth Service replicas had read the old configuration Private three days ago. Then, they closed their eyes and focused entirely on processing traffic. No matter how the file in that external "USB drive" was modified by development... the value in their memory was forever frozen exactly three days ago!

But! This morning, K8s conducted a routine Node Draining. K8s ruthlessly killed the other 100 old replicas, and then hatched 100 new replicas on new machines. When these 100 fresh replicas started, they naturally read that new "USB drive" (Public) that had already been modified by the developer.

Thus, an extremely terrifying "Ghost Drift Cluster" was born.

In the same cluster, 300 soldiers that looked exactly the same, had the exact same version number, and were commanded by the omnipotent K8s... actually produced an extremely fatal memory tear because they were hatched at different times!

On the frontend, this manifested as every time a user clicked refresh, the request was randomly distributed to a replica with the new or old memory, and the permission frantically jumped back and forth between "Public" and "Private". If this kind of configuration drift happened in the payment system, all the billing statements on the entire network today would completely explode!

"That's terrifying." Silas's face turned pale, "I thought K8s would solve everything, I didn't expect it couldn't even sync configuration file updates."

"K8s is just a scheduler, it's not a babysitter for business logic." Simon Li's hands flew across the keyboard, preparing to execute an emergency forced synchronization.

"Dave, I'm going to forcefully kill these 200 old Pods that are still living three days ago! Let K8s pull them back up using the latest configuration!"

"But Simon!" Dave yelled, "This is the core Auth Service! If 200 replicas are killed by you at the same time, the remaining 100 will instantly bear three times the QPS traffic flood peak! They'll be beaten to death!"

"I didn't say I would kill them at the same time!" Simon Li typed a long string of elegant commands based on Label Selectors on the console. This is the greatest invention of the cloud-native era to combat cluster-level updates—Rolling Update.

Under Simon Li's command. K8s acted like a precision scalpel, beginning an extremely elegant "slow blood transfusion." It first killed 5 old replicas. It immediately pulled up 5 new replicas with the latest configuration in the newly emptied space. It waited until these 5 new replicas passed the "Readiness Probe" and could smoothly receive guests. Then it continued to kill the next 5 old replicas...

Throughout the whole process, the overall Capacity of the service consistently remained at around 300. There was not a single second of a compute cliff. The error rate, which was originally frantically jumping, smoothly declined like walking down stairs as the new blood was injected.

Five minutes later, all the old blood had been completely cleansed. Three hundred microservice replicas carrying the latest "Public" memory neatly and uniformly processed the traffic. The jumping stopped.

"Whew..." Dave leaned back in his chair exhausted, "Even though it recovered, as long as someone changes the config and forgets to restart, won't this kind of 'memory split' still happen? This kind of configuration drift is simply impossible to defend against!"

Simon Li nodded and drew a closed loop with a Hash value on the whiteboard.

"This is the drawback of declarative systems. After state and configuration are separated, because the business program is lazy (unwilling to regularly read the hard drive), it leads to the outside world changing, while its inner world hasn't."

Simon Li heavily tapped the whiteboard text: "In order to completely eradicate the ghost drift. From today on, for all microservice configurations, you cannot just change the content. You must forcefully inject the checksum (Hash value) of the configuration file as an environment variable into the YAML description of the container deployment!"

"As long as you change even a punctuation mark in the config, its Hash value will change. K8s will keenly discover: 'Oh! Even though the code didn't tell me to restart, the environment variable has changed!'. K8s will automatically and immediately trigger the kind of elegant Rolling Update from just now!"

"There are no changed configs, only entirely new entities (Immutable Infrastructure)." Simon Li looked at Silas and Dave with burning eyes, "Since we even treat physical machines as disposable cattle, then these containers and configs running on them should also be single-use consumables!"

"Changing a config means replacing the entire container!"

This extremely cold-blooded yet incredibly solid "immutable" philosophy completely dominated the cloud-native era for the next ten years. Any geeky behavior of trying to "secretly tweak a few parameters" in a running container was ruthlessly exterminated by the system.

K8s indeed acts like a god. But to tame this god, engineers themselves must first discard that patching-things-up warmth belonging to primates.

After resolving the jumping nightmare brought about by the configuration drift, Simon Li finally let out a long sigh of relief. However, the extremely cold-blooded testing frequency of the high-dimensional shard reached an unprecedented resonance in this chapter. Because when the system reached a certain indestructible "container-level stability" through Immutable Infrastructure.

A beast named "The Giant Beast of Avalanche Retries (Thundering Herd / The Ultimate Amplification of the Herd Effect under K8s)" was about to use K8s's extremely perfect automatic pull-up mechanism to evolve into an endless severe disease cycle that devoured everything.

That was the ultimate quagmire in the deepest part of the microservice swamp. The dawn of Chapter 19, which is the so-called Big Tech "Deadlock Chain Restart Case".


Architecture Decision Record (ADR) & Post-Mortem

Document ID: PM-2015-09-12 Severity Level: SEV-2 (Core configuration drift, severe mental split phenomenon in global routing state) Lead: Simon Li (Principal Engineer)

1. What happened? (Incident Phenomenon) The permission status of posts submitted by users randomly jumped back and forth between "Public" and "Private" at a high frequency. Upon investigation, the core "Auth Service" responsible for this validation had a total of 300 replicas deployed in K8s. But due to differences in their startup lifecycles, these 300 replicas of the exact same image version had completely opposite default permission configurations cached in their memory, creating severe Configuration Drift and split-brain.

2. 5 Whys (Root Cause Analysis)

  • Why 1: Why did identically configured replicas read different parameters? Because the startup times of these 300 Pods were different. A portion underwent restarts due to recent eviction and read the latest value after the ConfigMap was modified. Meanwhile, the original old nodes, having stayed healthy and alive, failed to perceive that the configuration had changed.
  • Why 2: Why didn't the old nodes perceive the configuration change? Although the outside (K8s's ConfigMap) changed, the business code logic was extremely lazy. It only loaded the configuration into memory once at the moment of startup (Init phase) and subsequently lacked a periodic mechanism to watch the config file for changes (Hot Reload).
  • Why 3: Why weren't the old nodes restarted when the configuration was modified? The developer only issued a command to independently update the ConfigMap without modifying the Deployment's main YAML description file. K8s believed the Pods didn't need to be rescheduled.
  • Why 4: Why does the system implicitly allow this kind of separated modification? Under the microservice philosophy of "decoupling code and configuration", the configuration update pipeline and the code release pipeline were not physically tied together, resulting in a state where "the external entity mutated, but the container's internal closed-loop went blind."

3. Action Items & Architecture Decision Record (ADR)

  • Workaround (Temporary Bleeding Stop): Execute a manually triggered Rolling Update on the Deployment of the "Auth Service." Use K8s to elegantly shift traffic, wipe out, and replace all old cache instances, forcefully refreshing the memory state across the entire network.
  • Long-term Fix (Architecture Refactoring):
    • ADR-018: Implement fundamentalist Immutable Infrastructure.
    • Use tools like Kustomize or configuration hash injection plugins (e.g., checksum template annotations in Helm). At the CI/CD pipeline level, important configuration files (ConfigMap/Secret) involved must undergo SHA256 hashing, and the calculated hash value must be rigidly pinned within the Deployment template in the form of an environment variable (Annotation/Env).
    • The extreme brilliance of this move lies in: Even if the business layer is too lazy to write a "hot-reload watcher," as long as the config moves even slightly, the hash suffix injected into the environment variable will cause the Deployment description itself to change. This immediately and brutally triggers the system's underlying Pods to unconditionally undergo chain execution updates. Eradicate all lukewarm states and rebuild the immutable cycle.

4. Blast Radius & Trade-offs Declarative APIs (the cornerstone of K8s) are incredibly beautiful but have latency blind spots. When external dependencies change, it is impossible for the system to 100% perfectly notify tens of thousands of deep memories in a running state. Using the logic of Immutability, all changes are simplified to "Kill and recreate."


Architect's Note: Connecting Past and Present System Design

1. The Arrival of the K8s (Kubernetes) Era Around 2015 was an era of a complete reshuffle in the entire operations world. If the previous chapter said Docker unified standard shipping containers, then the K8s Simon Li introduced in this chapter is the top-tier crane and loading command system that controls the port. It is also the foundation of all large companies' cloud-native infrastructure today. Declarative replaces Imperative. You only need to use extremely dry YAML text to write one sentence: "I want 100 such containers (Replicas=100)" and go to sleep. You don't need to write a single for loop, you don't need to write Shell scripts to bind IPs. The underlying super scheduler will act like a living organism, forever maintaining that final steady state even after power or network failures.

2. The Classic Pitfall: Configuration Drift (Config Drift) When many beginners use K8s, they think that stripping the database password or business switches to an external ConfigMap (an excellent best practice) is the end of the story. As a result, they go to the Dashboard and change the Config's value. The next day, customer complaints say it didn't take effect. After two hours of investigating, they realize that if the old Java/Go process running in the container isn't manually configured with a hot-reloading probe like FSNotify, it had already memorized the dead value in its cerebellum 100 days ago when it started, and is in a terrifyingly "completely blind" state to changes in the outside world. This is the classic "Drift." The best killer move in K8s is to discard all hot-reloading courtesy and directly apply external scripts. As long as the configuration changes, directly inject a Hash to force a Rolling Restart across the board, trading a minimal physical disturbance for the highest-level unification of truth.

3. The Art of Graceful Exit: Rolling Update If it weren't for this great feature, releases at major tech companies in the past were called "Downtime Maintenance Releases." In the cloud-native era, updates are imperceptible to users. This is exactly the scalpel Simon Li used in this chapter: I will absolutely not kill 300 processes at the same time. I kill 5, replace 5, and K8s routing considerately probes whether these 5 newly born babies are awake and Ready (Readiness Probe). Once Ready, the main board directs a tiny bit of traffic over, and then goes to kill the next batch of old people. This version transition, as smooth as Tai Chi, is the lifeline that protects the base connectivity availability.