Observability in Site Reliability Engineering

Observability in Site Reliability Engineering

Site Reliability Engineering (SRE) is a discipline that applies aspects of software engineering to operations with the goal of creating ultra-scalable and highly reliable software systems. The discipline of SRE can be traced back to the early 2000s when Google was experiencing rapid growth and faced system outages and performance issues due to an increased customer base and usage. The principle behind SRE is that using software to automate oversight of large software systems is a more scalable and sustainable strategy than manual intervention.

Observability, a concept from control theory, has become a crucial pillar in SRE. In the context of SRE, observability focuses on collecting data from all levels of a system so issues can be detected and fixed before they become bigger problems.

What is Observability in SRE?

A simple way to view observability is “monitoring on steroids”, or the next generation of monitoring. Monitoring has been around for decades, and there are many tools available for the purpose, but traditionally there were a limited number of things that could be viewed. Infrastructure metrics, such as CPU load or network error rate, formed the core, with some enterprise application monitoring via APIs in the applications.

Observability requires the collection and analysis of data from the entire computing environment – applications and infrastructure. This data can include logs, metrics, traces, and more. The collected data provides a 360-degree view of the applications, databases, and infrastructure health, which is key to avoiding major issues and making informed decisions.

Observability has become increasingly important in recent years as systems have become more complex and distributed. It allows engineers to quickly identify and resolve problems with their systems. When a system is observable, engineers can see what is happening inside the system and identify the root cause of problems more quickly. This can lead to faster resolution times and improved system reliability. Observability is also important because it can help engineers optimize system performance, identify areas for improvement, and in many cases prevent software limitations from developing into problems.

Components of Observability

There are several key components of observability:

  • Metrics: Metrics are numerical values that represent the state of the system at a point in time. They allow teams to understand trends and patterns, helping them predict and prevent incidents. They can be used to track things like system performance, resource usage, and error rates, thus providing a high-level view of how the systems are performing. This information can be used to identify bottlenecks, identify areas for improvement, and detect problems early on.
  • Logs: Logs are records of events that happen in the applications and systems. They are useful for debugging and understanding the flow of transactions through the application. They can be used to track things like system startup and shutdown, configuration changes, and errors. Logs are important because they provide a detailed record of what has happened, which can be used to troubleshoot problems, track down the root cause of errors, and identify security breaches.
  • Traces: Traces are detailed records of the execution of a request. They can be used to track the flow of data through a system and identify the source of errors. Traces can be particularly important in a microservice architecture, where a single request may travel between several microservices. Tracing information can reveal where in this sequence an error first appeared, thus pinpointing the particular microservice that is failing.
  • Alerts: Alerts notify when something goes wrong. Alerts ensure that teams are aware of issues as soon as they occur, minimizing downtime.

Examples

Some examples of metrics include:

  • System performance metrics: CPU usage, memory usage, disk I/O, network traffic, request latency, response time, throughput.
  • Infrastructure metrics: Server health, disk space, network bandwidth, CPU temperature, fan speed.
  • Application metrics: Number of active users, number of requests per second, number of errors, queue lengths.
  • Customer-facing metrics: Website availability, page load time, number of support tickets.

Some examples of traces that include:

  • HTTP requests: Traces of HTTP requests are detailed timelines of a request’s journey through a system. These traces show the path a request took through various services, how long it spent in each one, and where errors or slowdowns occurred. For example, an engineer might track an HTTP request from the moment it is received by the web server to the moment it is completed and the response is sent back to the client. This information can be used to identify bottlenecks in the application, troubleshoot errors, and optimize performance.
  • Database queries: Traces of database queries can be used to track the flow of data through a database and identify the source of performance problems. This information can be used to identify slow queries, optimize queries, and troubleshoot database problems. For example, an engineer might track a database query from the moment it is received by the database server to the moment it is completed and the results are returned to the application.
  • RPC calls: Traces of RPC calls can be used to track the flow of data between distributed systems and identify the source of performance problems and errors. This information can be used to identify bottlenecks in the communication between the systems, troubleshoot errors, and optimize performance. For example, an engineer might track an RPC call from the moment it is made by one system to the moment it is received and processed by the other system.
  • Function calls: Traces of function calls can be used to track the flow of data through a function. This information can be used to identify slow functions, optimize functions, and troubleshoot function errors. For example, an engineer might track a function call from the moment it is made to the moment it returns.

Some examples of logs include:

  • System Logs: System logs record events that occur on a system, such as startup and shutdown, configuration changes, and error messages.
  • Error Logs: Error logs record runtime errors occurring within an application’s runtime environment
  • Security Logs: Security logs record events that are relevant to security, such as login attempts, failed authentication attempts, and suspicious activity.
  • Application Logs: These logs provide information about the behavior of an application, such as user activities, system events, and errors.
  • Audit Logs: These logs record the sequence of activities that affect specific operations and procedures for auditing purposes.
  • Transaction Logs: These logs record all transactions and the database changes made by each transaction.

Conclusion

Observability has become an integral part of Site Reliability Engineering. It provides valuable insights into system performance, enabling quick detection and resolution of issues. The future promises even more sophisticated tools for monitoring system health, predicting potential problems, and ensuring optimal performance.

Here are some ways observability is used:

  • Troubleshoot problems
  • Identify security threats
  • Track the performance of systems


To view or add a comment, sign in

More articles by Paul Massie

  • The Unicorn Problem in Tech Hiring

    In the competitive landscape of tech recruitment, a persistent challenge has emerged known as the "unicorn problem"…

  • The Short-term Outlook

    The recent weeks and months have seen a lot of cost-cutting at marquee companies. Both the fact of the cutting as well…

  • Should we worry about AI going rogue?

    What is the probability and potential damage of an AI “going rogue”? The truth is we really don’t know the probability,…

    1 Comment
  • Preparing for AI

    Artificial Intelligence (AI) will fundamentally reshape our world. As we stand on the edge of a technological…

  • Data center or cloud?

    Here are some of the pros and cons of using co-location or on-premises data centers versus the public cloud. In the…

  • Social Engineering with AI

    According to various reports, the percentage of successful data breaches involving social engineering is between 70 and…

    1 Comment
  • Controlling AI Development

    One of the most pressing questions around AI is whether AI development even can be controlled, and if so, how? There is…

  • Is RTO helping AI take jobs?

    Has the shift to remote work accelerated the adoption of AI and automation? Remote work has led to at least two things…

  • Cybersecurity and the Board of Directors

    What are corporate Boards doing about Cybersecurity? Several well-publicized recent events are increasing the…

    1 Comment
  • The Seven Deadly Sins of Enterprises

    The seven deadly sins, also known as the capital vices or cardinal sins, are a grouping of vices within Christian…

Insights from the community

Others also viewed

Explore topics