The Mistake of Watching CPU Alone

The Mistake of Watching CPU Alone

Let me tell you a story. A combo of Real and Reel to keep it spicy.

Several years ago, I found myself in a war room that felt more like a pressure cooker.

Production latency was shooting up. Dashboards were blinking. People were sweating.

The war room coordinator kept repeating like a stuck tape:

“CPU is only 40%. So it’s not infra. Must be code!”

Classic. If I had a rupee for every time I heard that line, I would have my own data center by now (with a chai stall inside).

6 hours later, after enough “deep dives” to find Atlantis, we discovered the real villain:

Disk I/O was maxed out.

The CPU wasn’t busy. It was bored. Just waiting for storage to catch up.

CPU ≠ Everything

Only watching CPU is like judging traffic by looking at the car’s speedometer.

Just because the needle isn’t moving doesn’t mean you aren’t stuck in a jam.

Sometimes the engine is fine, but the road’s blocked, fuel’s low, or the tire’s flat.

Let’s Bring In Jay & Dev

Jay: “CPU is just 60%. Why is the app slow?”

Dev: “Let’s check disk I/O and thread pools.”

Jay: “Why? CPU looks fine.”

Dev: “Exactly. It’s too fine.”

Turns out:

Thread pool was exhausted

GC was running around like a headless chicken

DB was throttling connections like an angry bouncer

Moral: CPU is like a selfie it shows something, but never the whole picture.

What Should You Watch With CPU?

Your performance “thali” shouldn’t have just CPU. Here’s the full meal:

1. Memory & GC Activity – JVM/.NET apps love drama here

2. Disk I/O & Storage Latency – SSDs are fast, but not divine

3. Thread Pool Utilization – If they are blocked, CPU just sips chai

4. Queue Lengths & Timeouts – Long queues = someone’s slacking

5. Network Latency – Especially that 3rd party API that works only in off-peak

6. Load Averages (Linux) – CPU may seem free, but it’s standing room only inside

Real Life Combos

CPU 20%, Memory 95%: Your app is one GC away from a meltdown

CPU 90%, low response time: Celebrate! That’s good utilization

CPU 30%, Disk latency 800ms: Ferrari body, bullock cart wheels

CPU 70%, thread pool full: Time to scale threads, not CPUs

To Young Engineers (And Stressed Seniors)

It’s okay to panic the first time CPU spikes. But before you hit Slack or MS Teams with “PROD IS DYING,” breathe.

Ask:

“Is CPU the symptom or the side effect?”

“What else can I check?”

“Can I explain this to my grandma without sounding confused?”

Performance engineering is detective work and not Netflix drama where the wrong guy always ends up in jail.

Let’s Be Better Engineers

Watching only CPU is a rookie move. Great engineers don’t just look, they correlate.

They ask:

“What else is this metric not telling me?”

And they say:

“I don’t know, let’s dig deeper.”

Most importantly, they never say:

“CPU is fine. Must be code.”

Because truth bombs often sit quietly in memory, storage, or third-party services.

TL;DR (Too Long; Debugged Right)

CPU is one actor. Don’t ignore the rest of the cast.

Context is everything.

Don’t blame infra or code based on one graph.

Teach others — kindly and clearly.

Let’s move from dashboard watchers to pattern recognizers.

That’s when performance becomes real engineering.

#PerformanceEngineering #CPUMetrics #SRElife #ObservabilityMatters #DevOpsCulture#AppPerformance #InfraMatters #DebuggingTips #SiteReliabilityEngineering #LatencyMatters#MicroservicesPerformance #ProductionIssues #PerformanceTesting #EngineeringLeadership #ThinkBeyondCPU

Kapil Garg

Performance Manager at HCLTech

9h

💡 Great insight

Like
Reply
Pratik Gupta

Performance Engineer at TRUGlobal

2w

Excellent article, Samson. Being a performance enginner reminds me of Sherlock Holmes. We need to look beyond the obvious.

Like
Reply
Vijayanathan Naganathan

Tech Co-Founder | Driving QE Innovation for Growth-Stage Companies | Customer Success Leader | IIM Kozhikode Alumini

2w

Loved the narrative and the takeaways Samson.

Like
Reply
Henry Steinhauer

Systems Engineer at LifePoint Health

2w

Thanks for sharing, Samson. Yes need to have the whole picture. Something as simple as pulling unchanged data multiple times because it was done inside the do loop instead of outside the loop. It kept the GC maxed and busy reclaiming memory it was going to reallocate again inside that loop. 4 hrs turned to less than 40 minutes once taken outside the loop.

Like
Reply

To view or add a comment, sign in

More articles by Samson Jaykumar

Insights from the community

Others also viewed

Explore topics