What is observability (o11y)
- The textbook definition of observability (which applies to systems of any type, not just software) is a measure of how well the internal state of a system can be understood based on data that the system makes available externally
- This definition, in it's original form, came from the field of control theory by Rudolf Emil Kalman, Hungary, 1930
- Think of o11y as a property of a system — another attribute, like functionality, performance or testability. You could also think of it as a mindset permeating design decisions
- Observability is about getting answers to questions that we didn’t know we’d have to ask
- It encapsulates anything that lets you see how your app is performing around engineering goals. Engineering goals could include uptime, error rate, read/write times, API call times, etc, also known as Service Level Indicators (SLIs).
- It provides context across roles and organizations, as it enables developers, operators, managers, PMs, contractors, and any other approved team members to work with the same views and insights about services, specific customers, SQL queries, etc.
Why we need observability
- Achieving observability empowers users and IT teams to update and deploy software and apps safely without a negative impact on the end-user experience, thus, achieving faster innovotion cycle. It provides teams with a far greater level of control over their systems
- By making a system observable, organizations are able to release more code, more quickly, and more safely, thereby, benefiting the organization. This becomes even more relevant for highly distributed, complex, and interconnected systems of today.
- By providing the context and correlation that engineers need to resolve complex application issues, observability tools help teams spend less time tracking down root-cause problems and performing unplanned work.
- For teams of all sizes, observability offers a shared view of the system. This includes its health, its architecture, its performance over time, and how requests make their way from frontend / web apps to backend and third-party services.
Benefits of observability
- Comprehensive understanding of complex systems
- Smarter planning for code releases and application capacity
- Faster problem solving and shorter MTTR
- More insightful incident reviews
- Better uptime and performance
- Happier customers and more revenue
- Easier to understand systems, easier to control, and easier to fix
- More effective post-mortems following incidents
Questions that observability can answer
- Why is x broken?
- What services does my service depend on — and what services are dependent on my service?
- What went wrong during this release?
- Why has performance degraded over the past quarter?”
- Why did my pager just go off?
- What changed? Why?
- What logs should we look at right now?
- Should we roll back this canary?
- Is this issue affecting certain Android users or all of them?
- What is system performance like for our most important customers?
- What SLO should we set?
- Are we out of SLO?
- What did my service look like at time point x?
- What was the relationship between my service and x at time point y?
- What was the relationship of attributed across the system before we deployed? What’s it like now?
- What is most likely contributing to latency right now? What is most likely not?
- Are these performance optimization on the critical path?
Difference from monitoring
Is o11y same as monitoring? No. Although the ability to achieve observability may somewhat rely on effective monitoring tools and platforms, hence, it could be said that effective monitoring tools augment the observability of a system.
- Monitoring merely tells you when something is wrong. Observability helps pinpoint what is wrong, and why it happened.
- Monitoring is about collection of metrics and logs from a system, whereas obervability is about useful insights gained from that data
- Monitoring is failure-centric, whereas observability is about overall behavior of the system
Requirements from a observer system
- Observability tools must be simple to integrate
- Observability tools must be easy to use
- Observability tools must be built on high-quality telemetry
- Observability tools must be real-time
- Observability tools must aggregate and visualize your data
- Observability tools must provide irrefutable context: An observability tool should guide its users toward successful incident resolution and system exploration by providing context every step of the way.
- Temporal context: How does something look now versus one hour, one day, or one week ago?
- Relative context: How much has this changed relative to other changes in the system?
- Relational context: What is dependent on this, and what is it dependent on? How will changes to this dependency chain affect other services?
- Proportional context: What is the scope of an incident or issue? How many customers, versions, or geographies are affected? Are VIP customers more or less impacted?
- Observability tools must scale
- Observability tools must provide ROI to the business
Validating observable system
Now that we have an observable system in place, how de we make sure it is actually observable? To help with that, we can start by asking following questions:
- Can you continually answer open-ended questions about the inner workings of your applications to explain any anomalies, without hitting investigative dead ends (i.e., the issue might be in a certain group of things, but you can’t break it down any further to confirm)?
- Can you understand what any particular user of your software may be experiencing at any given time?
- Can you quickly see any cross-section of system performance you care about, from top-level aggregate views, down to the single and exact user requests that may be contributing to any slowness (and anywhere in between)?
- Can you compare any arbitrary groups of user requests in ways that let you correctly identify which attributes are commonly shared by all users who are experiencing unexpected behavior in your application?
- Once you do find suspicious attributes within one individual user request, can you search across all user requests to identify similar behavioral patterns to confirm or rule out your suspicions?
- Can you identify which system user is generating the most load (and therefore slowing application performance the most), as well as the 2nd, 3rd, or 100th most load-generating users?
- Can you identify which of those most-load-generating users only recently started impacting performance?
- If the 142nd slowest user complained about performance speed, can you isolate their requests to understand why exactly things are slow for that specific user?
- If users complain about timeouts happening, but your graphs show that the 99th, 99.9th, even 99.99th percentile requests are fast, can you find the hidden timeouts?
- Can you answer questions like the preceding ones without first needing to predict that you might need to ask them someday (and therefore set up specific monitors in advance to aggregate the necessary data)?
- Can you answer questions like these about your applications even if you have never seen or debugged this particular issue before?
- Can you get answers to questions like the preceding ones quickly, so that you can iteratively ask a new question, and another, and another, until you get to the correct source of issues, without losing your train of thought (which typically means getting answers within seconds instead of minutes)?
- Can you answer questions like the preceding ones even if that particular issue has never happened before? Applying Observability to Software Systems | 5
- Do the results of your debugging investigations often surprise you by revealing new, perplexing, and bizarre findings, or do you generally find only the issues you suspected that you might find?
- Can you quickly (within minutes) isolate any fault in your system, no matter how complex, deeply buried, or hidden within your stack?
Source of observability data
As with any observation, it is incomplete without sufficient and correct data. Therefore, it is of prime importance that we collect as much data as possible, and from as many sources as possible (that directly or indrectly talks to the system under observance).
Common sources of such data can be:
- Network flow data: router/switch counters, firewall logs, etc.
- Virtual servers: VM Logs, ESXi Logs, etc.
- Cloud services: AWS data sources such as EC2, EMR, S3, etc.
- Docker: logging driver, syslog, apps logs, etc.
- Containers and microservice architectures: container and microservices logs, container metrics and events, etc.
- Third-party services: SaaS, FaaS, serverless, etc.
- Control systems: vCenter, Swarm, Kubernetes, etc.
- Dev automation:Jenkins,Sonarcube,etc.
- Infra orchestration: Chef, Puppet, Ansible, etc.
- Signals from mobile devices: product adoption, users and clients, feature adoption, etc.
- Metrics for business analytics: app data, HTTP events, SFA/CRM
- Signals from social sentiment analytics: analyzing tweets over time
- Customer experience analytics: app logs, business process logs, call detail records, etc.
- Message buses and middleware
These can be classified into 3 types, also known as 3 pillars of observability
-
Metrics:
- System metrics(CPU, memory, disk)
- Infrastructure metrics(AWSCloudWatch)
- Web tracking scripts (Google Analytics, Digital Experience Management)
- Application agents/collectors (APM, error tracking)
- Business metrics (revenue, customer sign-ups, bounce rate, cart abandonment)
-
Logs (Events): Comes in three forms:
plain text
,structured
andbinary
- System and server logs(syslog, journald)
- Firewall and intrusion detection systemlogs
- Social media feeds(Twitter, etc.)
- Application, platform and server logs (log4j, log4net, Apache, MySQL, AWS)
-
Traces:
- Specific parts of a user’s journey are collected into traces, showing which services were invoked, which containers/hosts/instances they were running on, and what the results of each call were.
Tools needed for observability
- Infrastructure monitoring: Determine the health and performance of the containers and environment your applications run on
- Application performance monitoring: Investigate the behavior of your application at the service level. Determine where calls are going and how they perform.
- Real user monitoring: Understand the experience of real users by collecting data from browsers about how your site performs and looks. Isolate issues to the frontend or backend.
- Synthetic monitoring: Measure the impact that releases, third-party APIs and network issues have on the performance and reliability of your app.
- Log viewing: Dig deeper into “the why behind the what” when issues occur. Figure out how to remediate the issues quickly.
- Incident response: Alert the right team the first time to fix the issue and provide them with the data they need to succeed in doing so, all in one place.
Basic terminology and definitions
-
Telemetry Data:
- 1. Granular measurements of the system that allow Service Level Indicators (SLIs) to be measured and explained.
- 2. Most commonly used telemetry data types:
logs
,metrics
andtraces
-
Logs:
- 1. Structured or unstructured lines of timestamped text that are emitted by an application in response to some event in the code.
- 2. Logs are distinct records of “what happened” to or with a specific system attribute at a specific time
-
Metrics:
- 1. Values monitored over a period of time
- 2. These metrics are usually represented as counts or measures, and are often aggregated or calculated over a period of time
- 3. Metrics may be key performance indicators (KPIs), CPU capacity, memory, or any other measurement of the health and performance of a system
-
Traces:
- 1. A single trace shows the activity for an individual transaction or request as it flows through an application.
- 2. Traces are a critical part of observability, as they provide context for other telemetry. For example, traces can help define which metrics would be most valuable in a given situation, or which logs are relevant to a particular issue.
- 3. Every operation performed upon the request is recorded as part of the trace. In a complex system, a single request may go through dozens of microservices. Each of these separate operations, or spans, contains crucial data that becomes a part of the trace.
- 4. Traces are critical for identifying bottlenecks in systems or seeing where a process broke down.
-
SLI's, SLO's and SLI's:
- These represent the promises we make to our users, the internal objectives that help us keep those promises, and the trackable measurements that tell us how we’re doing.
- In other words, these are used to create agreed upon measurements for service health, system reliability, and consumer or end user experience.
- The goal of all three things is to get everybody, vendor and client alike, on the same page about system performance around:
- How often will your systems be available?
- How quickly will your team respond if the system goes down?
- What kind of promises are you making about speed and functionality?
- SLI: SLIs are Service Level Indicators. These are measurable data such as latency, uptime, and error rate. These measure compliance with an SLO.
- SLA: An SLA (Service Level Agreement) is an agreement between provider and client about measurable metrics like uptime, responsiveness, and responsibilities.
- SLO: An SLO is a Service Level Objective. It defines the target for SLIs. For example, p99 latency < 1s; 99.9% uptime; <1% errors. etc.. An SLO is an agreement within an SLA about a specific metric like uptime or response time. So, if the SLA is the formal agreement between you and your customer, SLOs are the individual promises you’re making to that customer
-
Error budget: Leaving room for failures.
- 1. It not only protects the business from SLA violations(by providing buffer) and hefty consequences, it also leaves room for agility - for the team to make changes quickly and have the space to try innovative new solutions that might fail.
- 2. Google actually recommends using leftover error budget for planned downtime, which can help you identify unforeseen issues (e.g. services using servers inappropriately) and maintain appropriate expectations from your clients.
References:
- https://www.atlassian.com/incident-management/kpis/sla-vs-slo-vs-sli
- https://www.observeinc.com/resources/10-lessons-for-observability-what-every-vp-of-engineering-needs-to-know/
- The complete guide to observability by Lightstep
- https://www.atlassian.com/incident-management/incident-response
- A Beginner’s Guide to Observability by Splunk
- OReilly-Book-on-Observability-Engineering by Honeycomb
- https://www.logicmonitor.com/blog/what-is-observability
Top comments (0)