Observability Today

Observability Today

This newsletter will explore the many topics in IT Observability today - and I mean many!! Let us do a quick overview of the landscape.

Article content

Summary

Observability is rapidly evolving to meet the challenges of complex modern software systems. Key trends include the push for observability at all development stages "shift-left", the need to understand the full stack, dependencies between components, and finding new and innovative ways to interact with observability data (dashboards, APIs, natural language interfaces, LLMs, ... ).

There is a focus on the cost of both the platforms themselves and the IT resources they monitor. The emphasis is moving from monitoring isolated resources to understanding complex interrelationships and the true business impact of IT systems.

It is important to acknowledge the roots in application performance monitoring of the term "observability" while also addressing the entirety of IT complexity, holistically, up and down the stack from infrastructure to digital experience and business impact, right and left from development to production, from rules-based analysis to AI/ML-based insights, and from alerts to action.

Discussion

Shift-left, shift-right, full-stack, and service dependencies are just some directions in which observability is exploding.

  • Shift-left is a general philosophy that errors caught in development will not cause problems in production. This philosophy is focused on observability in the area of development tool chains, or sometimes, more specifically, CI/CD observability. Perhaps even in the health of the development pipeline itself. Given the number of IT problems caused by configuration, we might well expand the idea of shift-left to automation and configuration.
  • Full stack is the idea that IT layers impact each other, servers, VMs/containers, applications/services. If we could only connect the dots up and down all the layers, we could do better observability. It is even more complicated when a physical network is involved because it has many layers (TCP, IP, Ethernet, Optical, cables, ... ).
  • Shift-right is the general philosophy that we should do more "testing" in the production environment as it is the only truly representative environment. This runs the risk of production problems but may also save money on the development environment. The development environment must be considered as total IT costs are now becoming a significant focus.

How do I get insights? Let me count the ways. Ways of interfacing with observability data and insights are also expanding, often overlapping.

  • The roots of monitoring are often tied to the paradigm of dashboards. As resource types and monitoring concepts expanded, so did the number of dashboards. Leading to a focus on a) ease of dashboard creation and b) queryless, well, queries. The ease of dashboard creation also includes APIs.
  • Programmatic interfaces are also necessary: APIs, webhooks, scripts, mentions, and more. They all allow us to connect software components in powerful ways and are an important foundation for automation.
  • Then there was 2023, the year of Large Language Models (LLM). While some platforms have developed natural language processing (NLP) interfaces, LLM hit the cultural consciousness last year, like Taylor Swift on tour. LLMs have yet to be an exact science and have their challenges. Still, their ability to string one question to another, simulate a conversation, and do more powerful manipulations than the average query language presents the potential for new ways of interacting with observability platforms.

Observing Observability Platforms.

  • Observability platforms create observability information on themselves so the platform's health can be monitored and troubleshooted (sp?). Recursive? Perhaps.
  • Like any other complex software, observability platforms deal with their own reliability, availability, and performance fundamentals. Oh, and operating cost as well (as distinct from the customer's cost).

Three essential capabilities with the potential for disaggregation are base level functions: collection, storage, and query.

  • Arguably, collection is more complex for on-prem environments than cloud environments because the hardware is not abstracted like it is in the cloud and because there is a wide range of devices, interfaces, and vendors (and decades of deployment). Nonetheless, collection remains an essential and engineering-worthy activity, even in the cloud.
  • Storage is a biggie. How much to ingest and store in real-time, how much to retain over the long term, and how much to keep around for fast queries. These are just some of the considerations. There is also normalization before storage, which is arguably an art form, as we see in all areas of abstraction.
  • I use query as an umbrella term for all approaches to retrieving data and using it in visualization, analysis, and perhaps more. This is an area of significant engineering challenge, and when combined with analysis, it is likely to become an increasing focus of cost optimization as computing goes up.
  • Earlier, I mentioned disaggregation. There are industry conversations about vendor-neutral collection/collectors and who should own and manage stored telemetry data. These conversations are more or less attractive depending on the engineering skills available within an IT team; however, there are rich and complex discussions to be had here, including the core value of observability platforms.

Analysis is everything from simple alert rules to complex correlations, with anomaly detection and other newish approaches to detecting known knowns and uncovering unknown unknowns.

  • AI/ML is at the center of this conversation, another big topic on its own, as is the unknown unknowns and whether IT teams want more information given already existing fatigue.
  • At the end of that conversation is the unquestionable impact of these new approaches.

Simple objects to complex relationships.

  • The engineering challenge here should not be underestimated, not only in terms of discovering and communicating the relationships but also with respect to the scale of telemetry that has to be collected, analyzed, and stored (particularly tough on visualization engines).
  • The modern concept of a "service" is at the center of this conversation, but underneath that business concept can be a sea of TCP/UDP connections (whether ephermeral or not there is a logical relationship and telemetry).
  • Nothing says observability more than the transition from monitoring single resources to understanding collections of resources with complex relationships and chains.

Endless horizontal resource scalability is great, but operating budgets are not. There are two roles observability platforms play here:

  • Their own costs/pricing models
  • The role they can play in transforming IT leaders from resource provisioners to business partners across all areas of business impact. The cost conversation involves both development and production environments: IT all-up.

I know a little about networking, so it is surprising that I have not spoken more about it in this article. Don't worry, I will. SD-WAN, Cloud Managed Networking, Wireless, and more are all exciting areas in monitoring/observability systems.

It must be acknowledged that many associate the roots of the term observability with application performance monitoring and perhaps even more narrowly with tracing or application logs. While the use of the term "observability" in that way has to be acknowledged, the bottom line is complexity is all around, and IT teams need to know not only that an application is not performing well but why and the business impact: whether SLOs, cost, or other critical business-facing metrics. Observability today is about tackling all IT complexity.

Hopefully, this teaser provides "insight" into the tip of the observability iceberg and the many potential conversations to be had.


Godwin Josh

Co-Founder of Altrosyn and DIrector at CDTECH | Inventor | Manufacturer

1y

You mentioned the intriguing landscape of observability. The evolving complexity in IT environments raises questions about effective monitoring. In this context, how do you see observability tools adapting to handle the intricacies of distributed systems and microservices? Now, imagine a scenario where a critical application faces a sudden surge in traffic. How would you leverage observability techniques to quickly identify, analyze, and address potential bottlenecks or anomalies in such a dynamic environment? I'm curious to hear your thoughts on this practical application of observability in real-world scenarios.

To view or add a comment, sign in

More articles by Mark Seery

Insights from the community

Others also viewed

Explore topics