Recently, the team at Signal0ne was busy investigating the interesting problem in observability and incident management, the problem - I like to call “abundance of signals”.
You can collect all of the data and still don’t know what is happening within your system. Why is that? The limited ability of humans to process data, not what we are good at all, that is why we have computers.
The answer might be AIOps. Applying ML and DL to observability is nothing new but It didn’t have a good first impression in the community. Part of the problem here is mentioned in the earlier abundance of signals. Algorithms may discover a change in trend of app traffic, it may discover an anomaly or new baseline to update system thresholds. The thing is information like “there is an anomaly in your user traffic” or “/api/users operation returned 500 Internal Server Error for last 10 requests” does not tell a lot about what is actually happening to the system, it may confuse the on-call person at best. That’s why such information must be put in context just imagine you get 2 alerts:
Which one do you prefer?
Top comments (0)