In this short development diary, I’m going to share with you why and how I came up with the "Performance Impact Score”, what it is, why it is needed and how it was implemented and added to our Continuous Feedback (CF) platform.
Early Optimization is the Root of All Evil?
A few years ago I was working on an online service to support local businesses. The domain was QR codes, promotions, and mobile presence. We must have been doing something right, as the service gained momentum and I partnered with a large distributor that used my services in far-away places, supposedly sparking even more interest from other businesses.
Our product was meant to service a few local customers and the code was not designed for scale. As many startups and young products do, we made a conscious decision to focus on rapid, fast development and did not bother optimizing the code. I had in mind that premature optimization would waste too much time, delaying the service launch.
Reality was quick to catch up with us. Soon, customers began complaining. Pages were loading too slowly, operations were taking too long and even timing out. User experience was affected, and business was impacted. Struggling to match usage and demand, it seemed we discovered a deep truth about optimization: If you’re not early - you’re late. We’ve quickly moved from ‘premature optimization’ concerns to overdue optimization. We were in catchup mode.
Undaunted, we rolled up our sleeves and started making the changes we believed were necessary.
Optimizing the code where it counts the most
The thing is, after optimizing for two weeks I was not able to make an impactful change. We were very successful in refactoring, and in fact, made many code changes and improvements if you count the number of commits. It’s just that their aggregated effect did not do much to move the needle. In retrospect, I know I was optimizing the wrong parts of the code.
Eventually, when I did end up fixing something that really counted - a serious bottleneck in the service, it was almost by chance. I did not expect this specific method to account for so many of the performance issues we were facing. Looking at metrics alone it definitely wasn’t in any top ten or even top 50 lists. There were slower methods to optimize, critical sections causing delays, synchronous long operations, and other areas of the code that on the surface seemed like much better candidates to focus on.
I recalled this experience as we were thinking of the concept of “performance impact” as an added layer of information for Digma. After all, getting data about code performance and metrics is easy, but we were seeking ways to make optimization more guided and less random. In other words: analyzing the runtime data to aid developers in identifying those those pieces of code that affect the system’s ability to scale the most.
Measuring Performance Impact
It is easy to make the mistake of focusing on duration percentiles only. For example, the median, P95 or P99. However, looking at the performance data only can create a misleading prioritization. Areas of the code that are seldom triggered, or that affect very little else in the application might present themselves as major bottlenecks. Optimizing them, will in fact have very little actual effect.
To avoid falling into that trap, we decided to weigh the importance of each piece of code in terms of when it runs and how often it’s invoked, instead of just how long it takes it to run. We understand that the code that runs once, at the beginning of a process, might not be as impactful as the functions that are used on user interactions and API calls. We also know that I/O operations, rendering operations, and database queries may require special attention.
We found the best way to do this is by applying exponential smoothing to the duration of each span or activity segment. The formula looks something like this:
st=Tt+(1-)st-1
With st, the score after time t, Tt the duration (total runtime) of the span, and 01 a tuning parameter, with values close to 1 giving more importance to the new value over the accumulated score, and vice versa for values close to 0. s0 is just the span duration of the firs time we encounter it.
There are several nuances to the measured quantities.
Firstly, we debated how to measure time itself! Two principal paradigms arose. The first option would be to use real-time, as in seconds or milliseconds, etc. This way if a span is invoked again after one minute then its previous score would decay more than if it were invoked again after one second. The second option is to have time “tick” every trace, meaning every time a piece of code is invoked externally. This way the score is independent of the scale of endpoint usage.
We chose the latter. This is the better choice for us since we deploy the same algorithm for all of our clients. If we had chosen the former, we would have had to tune a vague formula parameter per service.
Another important decision was how to measure duration. We can treat a span as the total time between when it’s opened and when it’s closed, or we could subtract from that any time spent in its child spans (self duration). Each has its own implications, for example, using total duration would lean towards giving higher PIS to parent nodes.
After a process of trial and error, we decided on a hybrid approach. We use the exclusive duration as a default, and use the total duration for client spans, endpoint and consumer spans as they may hide some additional information.
Check out this graph to see how the PIS develops over time, using =0.95:
How measuring insights affected our own scaling
Recently, we noticed that one of our services is underperforming and set out to find the cause. The initial strategy of sorting by the Slowest 5% (P95) or Median duration, as expected, revealed results that were significant, but after fixing them did not resolve the overall slowness we’ve been experiencing.
For example, here are the top three slowest consumers looking at the P95 duration, as well as the top three slowest queries:
After optimizing these areas we saw very little improvement in the overall performance, resource consumption, and effect on the throughput of our consumers.
This was a chance to test the effectiveness of our new PIS. By matching usage with slowness, we wanted to pinpoint specific key areas that would be worth the effort of optimization. We took a look at the code assets again, this time sorting by our new ‘Performance Impact’ measurement. The results were quite clear in revealing a likely candidate.
Double clicking on that candidate revealed two clear sources for the problem, both seem to have suffered a performance degradation in a major release about three weeks ago. I am not including the entirety of the investigation effort but that very query also came up extremely high when sorting queries by performance impact. BINGO.
I was reminded of a recent Devoxx presentation by Mario Fusco and Francesco Nigro where they referred to the above-mentioned Donald Knuth quote about premature optimization and revealed there is a second part of that quote that is often ignored. Definitely, Knuth writes, focusing on 97% of inefficiencies would be premature optimization however Knuth proceeds to mention: “We should not pass up our opportunities in those critical 3%”
Indeed we were able to reveal those critical 3% and were able to get much more out of our optimization efforts.
Using Digma to identify your own application critical 3%
I’d love to hear if this algorithm also produces good results for you! If you already have Digma installed, then you're ready to go. If not, please head over to the JetBrains Marketplace to quickly download it. You can check out this video to get started. Digma is a Continuous Feedback tool. It constantly monitors your code at run time, to collect data on its behavior. In this case, it will collect data automatically using OpenTelemetry.
Once installed, all you have to do is go to the Assets tab in the Digma plugin, select the category you want to investigate (e.g. Endpoints or Database queries), and sort the entries by “Overall impact” or “Performance impact”.
Let me know your thoughts and results!
Top comments (0)