From Monitoring to Observability: Ensuring System Health and Performance

From Monitoring to Observability: Ensuring System Health and Performance

"Monitoring tells you when something is wrong, while observability can tell you what’s happening, why it’s happening, and how to fix it."

What is Monitoring?

Monitoring is the task of assessing the health of a system by collecting and analyzing aggregate data from IT systems based on a predefined set of metrics and logs. It consists of collecting, processing, aggregating, and displaying real-time quantitative data about a system, such as query counts and types, error counts and types, processing times, and server lifetimes.

The Four Golden Signals of Monitoring:

Article content

(1) Latency:

Latency is the total time it takes for a user to send a request and get a response back. Both successful and failed requests have latency and it’s vital to differentiate between the latency of successful and failed requests. For example, an HTTP 500 error, triggered because of a connection loss to the database might be served fast.

The better approach for latency monitoring should be to keep track of error latency. Define a target for a good latency rate and monitor the latency of successful requests against failed ones to track the system's health.

(2) Traffic:

Traffic represents the volume of requests and responses moving through a network. It depends from business to business on the type of traffic passing through the network.

For a web service, traffic measurement is generally HTTP requests per second, while for a storage system, traffic might be transactions per second or retrievals per second. Monitoring traffic in your application helps you prepare for future demand.

(3) Errors:

The error rate defines the rate of unsuccessful requests or failed requests. The errors may highlight infrastructure misconfigurations, outages, flaws in our application code, or broken dependencies.

For instance, a sudden spike in the error rate might represent a service failure, database failure, or network outage. The rate of request may fail either

  • Explicitly: ex: HTTP 500s, internal server error
  • Implicitly: ex: HTTP 200s, but wrong content is delivered

To understand errors, we should categorize them into critical and non-critical so that we can act accordingly as per the priority.

(4) Saturation:

Saturation refers to the overall capacity of the service or how “full” the service is at a given time. Saturation can occur for any resource required by the application, such as memory, CPU, etc. Whenever a system is nearing full utilization of its resources, this can result in a capacity decline.

Therefore, setting a utilization target is critical as it will help ensure the service performance and availability to the users.

What is Observability?

Observability helps us understand why something is behaving unexpectedly. It is the ability to understand a system’s internal state by analyzing the data it generates, such as logs, metrics, and traces.

Observability provides a better view of the system's functionality i.e. the internal states of the system from the relationship between their inputs and outputs. It involves gathering different types of signals and data about the components within a system, to establish the “Why?” rather than just the “What went wrong?" And the three pillars of observability help to determine "What went wrong?"

"Metrics provide performance data, logs offer event records, and traces follow request paths."

Three Pillars of Observability:

Article content
Reference:

(1) Logs:

Records of events, typically in textual or human-readable form, are known as logs. In simple terms, a record of what’s happening within your software. Logs are an extremely easy format to generate – usually a timestamp plus a payload.

Log entries describe events, such as starting a process, handling an error, or simply completing some part of a workload, status code, response time, user ID, success message if status is 200 or error message if status is 500, stack trace, etc. that can help to troubleshoot if any error occurs in future.

(2) Metrics:

A numerical assessment of application performance and resource utilization. This includes real-time operating data on the health of the system which includes CPU, memory, disk utilization, etc.

Metrics are very good at tracking trends over time and understanding trends and how systems or services are changing. We can take action accordingly if the health of the system degrades as per the metrics data.

(3) Traces:

How operations move throughout a system, from one node to another.

It provides a detailed record of the flow of requests and responses i.e. if any request comes to the application from where it will enter and what components it will pass through before its completion. Tracing helps break down end-to-end latency and attribute it to different tiers/components, helping to identify where the bottlenecks are. It will let you know which component is failing or not accepting the request.

Logs, metrics, and traces provide valuable but limited visibility into applications and infrastructures. However, when combined, these three sources can provide a relatively complete view of a system.

Tools: Datadog, New Relic, Dynatrace, Grafana, Splunk, AppDynamics, Prometheus

Article content

In the end, observability and monitoring are closely related concepts in systems and software engineering. Both aim to provide insights into a system's health, performance, and behavior. They utilize data collection, analysis, and visualization techniques to enable proactive detection and troubleshooting of issues. If you want your system to be sustainable and reliable, you have to enable both monitoring and observability.

"A system without observability and monitoring is like a ship without a compass."

Resources:


Thanks for the write up. I've been exploring OpenTelemetry and am looking to learn more from the community. Do you have experience with OTel in helping to implement the three pillars?

Like
Reply
Zachary Gonzales

Cloud Computing, Virtualization, Containerization & Orchestration, Infrastructure-as-Code, Configuration Management, Continuous Integration & Deployment, Observability, Security & Compliance

6mo

PRINCE RAGHAV, powerful metaphor, profound comparison.

Ananya Singh

Cloud Engineer at LTIMindtree | Azure | DevOps | SRE

6mo

Very informative!

Siddharth Dange

AI Solution Engineer @Blueshift | ex ADP| AI ML | NLP | Deep Learning | SQL | Gen AI | Certified AWS Cloud Practitioner

6mo

Useful tips

To view or add a comment, sign in

More articles by PRINCE RAGHAV

  • ARGO CD: Declarative GitOps CD for Kubernetes

    Argo CD is a declarative, GitOps continuous delivery tool for Kubernetes. Once we store the desired state in Git, we…

    1 Comment
  • What is GitOps ?

    GitOps is an operational framework based on DevOps practices, like CI/CD and version control which automates…

    1 Comment

Insights from the community

Others also viewed

Explore topics