The business didn't see 0.5%. They saw "99.95% uptime." But I saw the angry tweets. I saw the support tickets: "Card declined. Please try again." Those weren't bank declines. Those were wap-03 swallowing the requests whole.
That 0.5% of failed payments? It wasn't random packet loss. It was the cluster waiting for a dead zombie to vote. remove web application proxy server from cluster
That's when I saw it. For the last 72 hours, wap-03 had been silently receiving packets from an old, forgotten monitoring script on a decommissioned jump box. Every five seconds, the script sent a malformed health check: GET / HTTP/1.1\r\nHost: \x00\x00 . wap-03 was spending 30% of its CPU trying to parse null bytes. The business didn't see 0
Instantly, the average response time for the payment API dropped from 340ms to 190ms. A 44% improvement. The error rate fell to 0.001%. Please try again
But here's the terrifying part. Because wap-03 was "alive" according to basic ICMP pings, the cluster's consensus protocol had been treating it as a voting member. For six months, every time wap-03 choked on a null byte, it would delay the cluster's session replication by 400ms.