The Mistake of Watching CPU Alone
Let me tell you a story. A combo of Real and Reel to keep it spicy.
Several years ago, I found myself in a war room that felt more like a pressure cooker.
Production latency was shooting up. Dashboards were blinking. People were sweating.
The war room coordinator kept repeating like a stuck tape:
“CPU is only 40%. So it’s not infra. Must be code!”
Classic. If I had a rupee for every time I heard that line, I would have my own data center by now (with a chai stall inside).
6 hours later, after enough “deep dives” to find Atlantis, we discovered the real villain:
Disk I/O was maxed out.
The CPU wasn’t busy. It was bored. Just waiting for storage to catch up.
CPU ≠ Everything
Only watching CPU is like judging traffic by looking at the car’s speedometer.
Just because the needle isn’t moving doesn’t mean you aren’t stuck in a jam.
Sometimes the engine is fine, but the road’s blocked, fuel’s low, or the tire’s flat.
Let’s Bring In Jay & Dev
Jay: “CPU is just 60%. Why is the app slow?”
Dev: “Let’s check disk I/O and thread pools.”
Jay: “Why? CPU looks fine.”
Dev: “Exactly. It’s too fine.”
Turns out:
Thread pool was exhausted
GC was running around like a headless chicken
DB was throttling connections like an angry bouncer
Moral: CPU is like a selfie it shows something, but never the whole picture.
What Should You Watch With CPU?
Your performance “thali” shouldn’t have just CPU. Here’s the full meal:
1. Memory & GC Activity – JVM/.NET apps love drama here
2. Disk I/O & Storage Latency – SSDs are fast, but not divine
3. Thread Pool Utilization – If they are blocked, CPU just sips chai
4. Queue Lengths & Timeouts – Long queues = someone’s slacking
5. Network Latency – Especially that 3rd party API that works only in off-peak
Recommended by LinkedIn
6. Load Averages (Linux) – CPU may seem free, but it’s standing room only inside
Real Life Combos
CPU 20%, Memory 95%: Your app is one GC away from a meltdown
CPU 90%, low response time: Celebrate! That’s good utilization
CPU 30%, Disk latency 800ms: Ferrari body, bullock cart wheels
CPU 70%, thread pool full: Time to scale threads, not CPUs
To Young Engineers (And Stressed Seniors)
It’s okay to panic the first time CPU spikes. But before you hit Slack or MS Teams with “PROD IS DYING,” breathe.
Ask:
“Is CPU the symptom or the side effect?”
“What else can I check?”
“Can I explain this to my grandma without sounding confused?”
Performance engineering is detective work and not Netflix drama where the wrong guy always ends up in jail.
Let’s Be Better Engineers
Watching only CPU is a rookie move. Great engineers don’t just look, they correlate.
They ask:
“What else is this metric not telling me?”
And they say:
“I don’t know, let’s dig deeper.”
Most importantly, they never say:
“CPU is fine. Must be code.”
Because truth bombs often sit quietly in memory, storage, or third-party services.
TL;DR (Too Long; Debugged Right)
CPU is one actor. Don’t ignore the rest of the cast.
Context is everything.
Don’t blame infra or code based on one graph.
Teach others — kindly and clearly.
Let’s move from dashboard watchers to pattern recognizers.
That’s when performance becomes real engineering.
#PerformanceEngineering #CPUMetrics #SRElife #ObservabilityMatters #DevOpsCulture#AppPerformance #InfraMatters #DebuggingTips #SiteReliabilityEngineering #LatencyMatters#MicroservicesPerformance #ProductionIssues #PerformanceTesting #EngineeringLeadership #ThinkBeyondCPU
Performance Manager at HCLTech
9h💡 Great insight
Performance Engineer at TRUGlobal
2wExcellent article, Samson. Being a performance enginner reminds me of Sherlock Holmes. We need to look beyond the obvious.
Tech Co-Founder | Driving QE Innovation for Growth-Stage Companies | Customer Success Leader | IIM Kozhikode Alumini
2wLoved the narrative and the takeaways Samson.
Systems Engineer at LifePoint Health
2wThanks for sharing, Samson. Yes need to have the whole picture. Something as simple as pulling unchanged data multiple times because it was done inside the do loop instead of outside the loop. It kept the GC maxed and busy reclaiming memory it was going to reallocate again inside that loop. 4 hrs turned to less than 40 minutes once taken outside the loop.