Intro
If you're wary and skeptical of this title, then it is rightfully so. But fear not, this is not another fluffy article like "Five secret troubleshooting techniques that senior devs don't want you to know", but rather a general framework for how to troubleshoot issues in software. The approach described here can be applied to most types of software, whether it be a single executable, or an application composed of multiple services that are distributed across multiple machines.
An example issue
Because this guide is intended to be conceptual, and not focused on specific types of software, we will use a simple non-software example. Let's start with a simple example of a light switch. The product specification says that you have a light switch, and a light.
Electricity can also be a complicated domain, so there are certain details which we will omit for simplicity.
You flip the switch, and somehow the light turns on
Easy enough.
You receive a bug report:
Today I flipped the switch and nothing happened
How to think about the issue
The first thing to do is to shine some light on the black box (pun intended). The black box in our example doesn't literally represent a black box, but is rather a metaphor for the components of a system that we either do not know about, or do not fully understand.
Whenever I read a bug report or listen to an issue being described, I automatically run through this process in my head: "What are all of the components involved?"
Sometimes a diagram helps. It may be the case that such a diagram is available, but oftentimes it is not. When one is not available, sometimes I draw one out myself to help me wrap my head around the issue. If I'm not using paper, I generally use a free tool called diagrams.net.
Let's take an example of a very simple circuit that can power a light bulb.
We have some power source, a light switch, a light, and some wires. While this is still very simple, we have shed some light on the situation, and now have a good idea of what components are actually involved. There could be an issue with any of these components, or with the wires connecting them!
The red question marks in the image above illustrate each potential failure point in our product.
- Our light is broken
- Our switch is broken
- Our power source is broken
- The wire between our power source and switch is damaged
- The wire between the switch and light is damaged
- The wire between the light and battery is damaged
Our switch is a regular toggle switch. The bulb is a standard incandescent bulb, and the power source is a battery that is switched periodically by a maintenance person.
The process
We will want to break down the issue, collect information, then attempt to isolate the issue. When isolating the issue, we want to first use information available to us to get as much intuition about the problem as we can, before starting the potentially time-consuming process of troubleshooting. In many cases, it is possible to narrow down the source of the problem substantially, without ever looking at code.
Breaking down the issue
It is helpful to consider each component (power source, switch, light) and each pair of components (each wire) in turn and ask yourself a few questions.
- What kinds of issues can the component experience?
- What would be true if the problem is X?
- What tools are available to verify retroactively whether the problem observed in the bug report corresponds with the problem being X? *
- What tools are available to verify this as I try to reproduce this? **
The results from considerations like this come in two forms. The first is a set of to-dos for yourself when you troubleshoot the issue. The second is a set of questions for the reporter of the issue.
- In a software scenario, these might be logs, alerts, or monitoring ** In a software scenario, these might be a debugger
Light
What kinds of issues can the component experience?
For the sake of our simple example, we will say that our light bulb is such that it either works, or the filament (the piece of metal in an incandescent bulb that heats up) has burnt out, and the light will never light again.
What would be true if the light is broken?
- The issue would be 100% reproducible. As of a certain time, the light never turns on when the switch is toggled.
- Other lights on the same circuit would continue to work.
- If connected directly to another power source, the light would not turn on.
- Even though there is a signal running to the light (electrical current), the light does not turn on.
What tools do I have to verify retroactively?
There might be other corroborating reports within the same time frame that indicate that the issue has occurred. There would not be a single case where after this issue was reported, where someone has experienced the light turning on.
What tools do I have to verify while troubleshooting?
A multimeter could be used to check if we can detect current running to and away from the light bulb. If there is current, then it is likely that the light bulb is the source of the issue. However, if there is no current, we cannot confirm that the light bulb is not part of the problem! Sometimes, we are unfortunate, and there are multiple problems all contributing to the reported behavior.
I can also investigate other lights on the same circuit to observe their behavior. If these lights are behaving correctly, then the wiring to this circuit is likely to be intact all the way through. This does not provide us a smoking gun, but it does increase the likelihood that this light bulb is the problem.
Finally, if I wish to isolate the light, I can remove it from the circuit and attach it directly to another power source, with wires that are known to be healthy, and see if the bulb turns on. Sometimes, in software, the analogous approach to this is not possible, but when it is, it can oftentimes be the most direct and definitive way to test.
Switch
What kinds of issues can the component experience?
Similar to the light, we will say that the switch either works or it doesn't.
What would be true?
- Issue would be 100% reproducible.
- If the switch were removed from the circuit, the light would turn on and remain lit as long as the power source is active.
- Other lights in the system that are not connected through a switch would continue to work.
- Any other lights connected to the switch would not work.
What tools do I have to verify retroactively?
Same as the light.
What tools do I have to verify while troubleshooting?
We can use the multimeter to check whether there is current running through the switch when it is in the ON position. We can also try to attach different bulbs to this wiring. The deduction process is the same as it was for the light source.
See if you can think through the possible deductions we can make by implementing these techniques.
Power source
What kinds of issues can the component experience?
Our power source is a battery, so it can run out of stored energy. When it is low on energy, it can provide inconsistent levels of power.
What would be true?
Since we have two potential problems here with different traits, we should make sure to consider both of them.
- Issue would not be 100% reproducible, because the battery may have died and been replaced, or it may be low.
- If the battery is currently dead and has not been replaced, the issue would be 100% reproducible.
- Other lights on the circuit would experience the same issues at the same time.
- If the battery is currently low, we may see the light dimming periodically. This is the only case where we have observed dimming behavior.
What tools do I have to verify retroactively?
In addition to corroborating reports, I can check the maintenance logs to see if the battery has been changed recently. If the battery has been changed after the issue was reported, the issue may be resolved now, but I will still need to make sure.
What tools are available to verify while troubleshooting?
If I do not find maintenance logs, I can test the battery directly, using a multimeter. If it is just as easy for me to verify directly, I will prefer to do this even before checking the logs. It is of course possible that even a new battery may be either defective, or may have been drained prematurely due to a short, so simply knowing that it was replaced, does not definitively prove the issue is resolved.
Wiring
So far, we have identified 3 wires. Based on what we have outlined in the system so far, the behavior would be the same for each of them.
What kinds of problems can the wires have?
- The connection between the wire and either of the components could have come loose completely.
- The wire itself could be frayed, causing a faulty connection, or the connection could be poor, but not completely loose.
What would be true?
Since we have two potential problems here with different traits, we should make sure to consider both of them.
- If the connection has come loose, the issue would be 100% reproducible.
- If the connection has come loose, we would observe no current in the circuit.
- If the wire is frayed, or the connection is poor, we would observe the light sometimes turning on when the switch is turned on.
- If the wire is frayed, or the connection is poor, we would observe flickering.
What tools do I have to verify retroactively?
Corroborating reports.
What tools do I have to verify while troubleshooting?
We are able to visually inspect each connection. Some of the wires may be hidden, so we may or may not be able to inspect them. The multimeter, placed at the ends of each wire, can help to observe the flow of current and check for voltage fluctuations.
Gathering information
Although the initial bug report was very vague, we were able to develop some intuition around the issue by breaking it down and formulating possible questions to ask, either of the reporter of the bug, or of the support team. Of course, some of what we were able to learn so far might lead us towards having to do some investigation.
In the case of our example, the investigation might be simply trying to reproduce the issue and observing the behavior. Do we see the light misbehaving? Is it flickering when it does so?
In the case of software, the investigation might involve looking at some system logs.
This step is important in saving our time down the line, so that we pursue only the most relevant leads. For instance, if the light turns on at least some of the time, then we know it's not the light. If there is another light bulb powered by the same switch, and it works correctly 100% of the time, then it's not the switch, and likely not the power source.
Perhaps we are able to look at maintenance logs and see that the battery has been changed once a month, every month for a year, but then has not been changed for the last 3 months. It may not be necessary to spend a lot of time digging into the issue, as it is unlikely that the new battery is able to last substantially longer.
We continue this process and exclude possible causes as we go, to narrow our search. Sometimes, only one possibility remains.
Troubleshooting
On the other hand, if we have gathered all of the information we could, and made all of the deductions we can make, but still do not have a clear cause, then we must troubleshoot.
This is the part where we revisit the information we have noted down during the phase of breaking down the problem and start to evaluate it. We can go through each of the "must be true" statements above, and apply our tools to see if we can eliminate further.
For instance, we have received no information about any other lights. Do any even exist? Perhaps the entire circuit actually resembles the diagram below
Whether or not some of these lights function correctly may tell us where the problem is. For instance, if the light labeled "1" turns on, then we know the issue is not with our switch, and not with the wiring, because a faulty wire would cause all lights to malfunction.
Additional tips and thoughts
Reproducibility
In most cases, it is worth establishing right away whether the issue is reproducible 100% of the time, or only sometimes. If it is only reproducible sometimes, it is also good to establish any characteristics of when it is reproducible. For example: "Each morning", "Once every hour or so", "Exactly every third time".
This information is a great heuristic to keep in mind even as you build out your mental model of the problem and the system, and can be a good way to start brainstorming.
Deductive Reasoning
The process outlined here is effectively a combination of system knowledge, intuition, and deductive reasoning. As you practice, you will strengthen all three of these skills. System knowledge and intuition will be strengthened naturally through practice, but you may need to pay extra attention to your deductive reasoning skills.
Most Computer Science educations cover some form of First-Order Logic. If you haven't been exposed to this, or if it has been a long time, then it may be worthwhile to get acquainted with the concepts. It is not usually necessary to practice this formally, but keeping it in your mind as you go throug the process will help to make sure you do not miss things.
This is a large topic, and I will not attempt to teach it in this article, but will rather leave you, dear reader, with this riddle:
If some doctors are men, and some men are tall, does it follow that some doctors are tall?
I'm looking forward to hearing your answers in the comments, as well as your reasoning! Please let me know if you like this content.
Happy troubleshooting!
Top comments (0)