DEV Community

Cover image for Case Study: Reducing toil of resolving issue in Node JS - Introduction
calvinsadewa
calvinsadewa

Posted on

Case Study: Reducing toil of resolving issue in Node JS - Introduction

Hi there! Welcome to the first of a four-part journey where I share the transformation we achieved in addressing and resolving issues within our Node.js projects at Kargo. This adventure led us to dramatically reduce the time and effort required to tackle these challenges head-on.

In this opening segment, we delve into the core problems that initially confronted us and outline the innovative solutions we proposed.

Background

Some time ago, the Tech team at Kargo embarked on an interesting project: to build a new Node.js backend service designed to support a feature requiring specific integration with a third-party platform. This venture presented a unique challenge for Kargo's engineers, as NodeJS backend was a new addition to our technology stack, which usually was based on Elixir or Go. Despite the initial hardships stemming from our unfamiliarity with this tech stack, we persevered, successfully delivering the required functionality. The project not only met but exceeded our success metrics, leading to a well-deserved celebration among the team.

However, as time progressed, the project's scope expanded, and with it, an increase in operational issues and bugs became apparent. The time required to resolve these issues began to increase significantly. On average, it took the team approximately four hours to address each problem, with around three issues arising per week. This not only led to considerable time loss but also began to affect team productivity adversely. More importantly, the user experience and the product's overall success suffered, as frequent issues and extended downtime became more common.

Frustated Engineer

Diagnosis

After some investigation, the NodeJS project lack observability & debugging tools that we have in usual Kargo's backend service (Elixir/Go). This result in increased effort across maintenance lifecycle (Detect, Diagnose, Debug). To illustrate, here's an hypothetical example of how a particular issue is resolved:

  1. Issue detection: The team received a report from user that the feature is not working as expected. An engineer are assigned to investigate the issue. There are no dashboard for the project, so the engineer usually need to look at relevant code for impacted feature, then look at log file, custom query to database, and other ad-hoc action for confirming the issue.
  2. Issue diagnosis: To determine what's the most likely cause of the issue, the engineer need to be familiar with codebase involved. Occasionally, they are lucky and there's log that's explicit enough to point out the issue. But most of the time, they need to add more log into codebase then redeploy it to get more information.
  3. Issue debugging: After the engineer have some idea on what's the issue, though it's not usually 100% certain. They would implement probable fix, then redeploy the code through CI/CD. If the issue still persist, they would need to repeat the process again. Usually it takes 3 to 4 retry before the issue is resolved.

Let's fix this

Solution

We brainstorm on how to improve the situation, also looking at what the tools & technique we have in Elixir/Go backend service that might help in this situation. We come up following solutions that we think will help addressing problem in each of the stage (Detect, Diagnose, Debug):

  1. Monitoring Dashboard + Canonical Log: By implementing Action-oriented Dashboard, we've created a quick and centralized access point for vital aspects of the project, significantly speeding up our ability to detect and validate issues. This dashboard, built on a Canonical Log record approach, offers the flexibility needed for performing custom advanced analyses with ease. Our aim with this system is to verify common issues within a mere 5 minutes of engineering effort. For example, we successfully reduced the time to check for a frequent session connection issue from 20 minutes to just 1 minute.
  2. On Demand Diagnostic Logging: We developed a feature for dynamic, detailed logging of operations, enabling us to drill down into issues as they happened. This was particularly effective in a recent incident where typical logs were inconclusive. By enabling on-demand logging, we pinpointed the problem within minutes, a process that could have otherwise taken multiple redeployments.
  3. Developer Code Execution: Just like how Elixir's remote shell capability help Kargo's engineer rapidly diagnosing & debug for issue happened in backend, we would like to have similar capability in NodeJS project.
  4. Operational Handbook: We compiled a comprehensive handbook detailing common issues and their solutions, which served as a first reference point for our engineers. One memorable success story involved a team member who are relatively new to the project, was able to resolve an issue independently by following the handbook when the usual maintainer taking day off. This not only make issue resolving more efficient, but also give assurance to engineer.

Screenshot of Grafana Dashboard
Action-oriented dashboard help engineer quickly understand & troubleshoot issue and decide what's the fix

Ideal Outcome

We target we were able to reduce the time needed to resolve issue from 4 hour to 30 minutes. Thus reducing the impact of issue to user experience and product success, and help the team to focus on delivering more value to the user.

An ideal scenario of how the issue resolved in NodeJS project looks like:

  1. Issue detection
    For 50% common issue, the issue would be automatically detected and alerted from metric dashboard. The engineer could easily validate the issue exist by looking at metric dashboard and resolve the issue by following the operational handbook.
    For the rest of 50% issue, engineer could do analysis based on dashboard + extended query from canonical log to confirm the issue and it's scope. If the issue is persistent, engineer could easily create new metric dashboard for the issue from canonical log data.

  2. Issue diagnosis
    When there’s an unknown issue, engineer could turn on diagnostic logging for user they are interested in (based on user ID) and get relevant debug log of WhatsApp bot server to form reasonable guess on where’s the issue is. Then engineer could use developer remote execution capability to validate whether the guess is correct or not.

  3. Issue debugging
    Once root cause of issue is identified, engineer could implement fix to the solution, then test it out first using developer remote execution. Once it's confirmed then engineer could deploy the fix through CI/CD.
    Now, because root cause of issue is well identified, engineer could resolve the issue in just one try.

Conclusion

That's the background of the problem that we are facing, and next write up will be focusing on how we implement each of the solution that we proposed.

What do you do to make sure your team is able to resolve issue quickly? Share your experience in the comment below!

Stay tuned for next part

Top comments (0)