Monitoring systems design

Dmitry Shyionak

Technical Lead & DevOps Manager | Blockchain Infrastructure Architect

Published Sep 26, 2023

What is the proper design of mine the monitoring and alerting system? This is the question you should ask yourself before implementing any monitoring system. I saw a lot of different monitoring systems based on different stacks and my conclusion is the simplest solution often wins complex. Why? The simple solution doesn’t provide the full picture, however, typically it gathers only important data and has a better feedback loop neither than an overcomplicated monitoring system with the enormous amount of alerts, which creates alert fatigue and blindness. Alert fatigue is a significant risk in modern infrastructure environments because it can have detrimental effects on system reliability, the efficiency of incident response, and the overall well-being of the teams involved. Alert fatigue occurs when individuals or teams become desensitized to the constant stream of alerts and notifications generated by monitoring and alerting systems. Here are some reasons below why alert fatigue is a risk in DevOps:

High Volume of Alerts: modern infrastructure environments typically involve a large number of microservices, containers, and infrastructure components. Each of these can generate its own set of alerts. With so many alerts to manage, it's easy for teams to become overwhelmed.
False Positives: Many alerts generated by monitoring systems turn out to be false alarms or non-actionable events. Dealing with these false positives can be frustrating and lead to a lack of trust in the alerting system.
Constant Interruptions: Frequent alerts disrupt the flow of work and concentration. Constantly stopping to address alerts can hinder productivity and cause stress.
Decreased Responsiveness: When teams are bombarded with alerts, they may not prioritize or respond to them effectively. This can result in slower incident resolution times and potentially lead to more significant outages.
Burnout: The constant pressure of managing alerts and incidents can lead to burnout among teams. Burnout not only harms the well-being of team members but also affects their ability to perform effectively.
Inefficient Use of Resources: Teams may allocate a significant portion of their time and effort to managing alerts, even if many of those alerts are not critical. This inefficient allocation of resources can lead to missed opportunities for proactive work and improvements.
Lack of Context: Alerts often lack context, making it difficult for teams to understand the underlying issues and prioritize them effectively. This can lead to confusion and misallocation of resources.

Organizations should consider the following strategies to build proper monitoring and alerting systems and avoid alert fatigue and blindness:

Alert Tuning: Fine-tune alerting thresholds to reduce the number of false positives and ensure that only critical and actionable alerts are triggered.
Better observability: Create multiples dashboards: an overview dashboard with the least possible number of metrics, ideally, not more than 5 to simplify the decision-making process for engineers and overview only for debugging purposes.
Aggregation and Correlation: Use alert aggregation and correlation tools to group related alerts and provide more context, reducing the overall number of notifications.
Prioritization: Implement a clear alert prioritization system to ensure that teams focus on the most critical issues first.
Automation: Automate routine responses to common alerts to free up human resources for more complex tasks.
Documentation and Runbooks: Create detailed documentation and runbooks for common incidents to guide teams in their resolution efforts.
Training and Education: Invest in training and education to help team members better understand how to handle alerts and incidents effectively.
Feedback Loops: Encourage open feedback from teams about the alerting system's effectiveness and make adjustments accordingly.

By using the proper design of a monitoring system from the very beginning teams can maintain a more efficient, responsive, and resilient environment while also promoting the well-being of their team members.