Table of Contents
- Introduction
- Background
- Objectives
- Implementation
- Evaluation and Results: GPT4o Vs LLAMA 3.1
- Conclusion
- Future Work
Introduction
Middleware is a platform that enables engineering leaders to derive actionable insights from data and improve the processes, making dev teams more efficient. With the fast movements in the field of AI we have been continuously trying to integrate ML models across the product with the goal of deriving actionable insights from the data.
We took some time and figured that the open source LLAMA or Mistral models we wanted to use were good but GPT4o was more reliable when it came to data centric problems. Hence we decided to move in the more sophisticated direction of building RAG pipelines and using function calling.
All this changed when we heard that Meta dropped LLAMA 3.1 models. The 70B and 405B models are simply one of the best open-source models out there and compete neck to neck with GPT4o. So we decided to integrate AI powered DORA reports as a part of an experimental effort and see how GPT4 and LLAMA 3.1 perform when it comes to data analysis and reasoning.
Background
DORA metrics provide critical insights into the performance and reliability of software delivery processes.
1) Lead Time for Changes
- Lead time consists of First Commit to PR Open time, First Response Time, Rework Time, Merge Time, and Merge to Deploy Time.
2) Deployment Frequency
- This metric gauges how frequently code changes are deployed to production.
3) Mean Time to Recover (MTTR)
- MTTR measures how swiftly a team can restore service after a failure occurs in production.
- The team's average incident resolution time is to compute its MTTR.
4) Change Failure Rate (CFR)
- CFR quantifies the percentage of changes that result in a service impairment or outage in production, aiding in the evaluation of deployment process stability and reliability.
- CFR is computed by linking incidents to deployments within an interval; each deployment may have several or no incidents.
You can learn more about dora metrics from here. By leveraging advanced LLMs, we aim to automate the analysis of these metrics, providing teams with deeper and more actionable insights.
Objectives
- To integrate LLMs into Middleware for the analysis of DORA metrics.
- To compare the performance of different large language models in terms of:
- Mathematical Accuracy: How well can it calculate the DORA score ?
- Data Analysis: Can the LLM analyse the input data and derive correct inferences ?
- Summarising: How well can the model summarise data ?
- Actionability: How well can the models suggest an action-plan based on the input data ?
Implementation
Data Processing: Middleware to the Rescue
- Middleware syncs all your data from different sources and calculates the DORA Metrics for your teams.
- Checkout middlewarehq/middleware and setup the dev server using docker.
Model Integration: FireworksAi and OpenAI
- We integrated OpenAI GPT4o and LLAMA 3.1 (70B and 405B) models.
- The OpenAI models use the official OpenAI API under the hood, while the Fireworks AI APIs have been used to integrate the 70B and 405B LLAMA 3.1 Models.
These AI analytics are powered by the AIAnalyticsService in the analytics server. This service can be extended to use more closed sources models from OpenAI or OpenSource model using FireworksAi
Changes on the front end introduce components and BFF logic allowing users to enter their token, choose a large language model and generate AI Reports for their DORA Metrics.
Whenever the user tries to generate AI analysis, the UI makes a POST request to the BFF API:
internal/ai/dora_metrics
with all the preprocessed DORA Metrics and trends data.This BFF API internally calls multiple analytics APIs with the dora metrics and trends data, which in turn generate the analysis based on the processed data and the curated prompts.
Finally, the analysis for each individual metric trend is fed again into the LLM for a summarising effort and all the data is sent to the front-end.
More implementation details can be found in this pull request.
Evaluation and Results: GPT4o Vs LLAMA 3.1
We did the DORA AI analysis for July on the following open-source repositories: facebook/react, middlewarehq/middlware, meta-llama/llama and facebookresearch/dora.
Mathematical Accuracy
- Middleware generated a DORA Performance Score for the team based on this guide by dora.dev
- To test out the computational accuracy of the model we provide it with the four key metrics and prompt the LLM to generate a DORA Score and compare the results with Middleware.
-
The four keys was a JSON of the format.
{ "lead_time": 4000, "mean_time_to_recovery": 200000, "change_failure_rate": 20, "weekly_deployment_frequency": 2 }
- The Actual Dora Score for the repositories was around 5. While OpenAi’s GPT4o was able to predict the score to be 4-5 most of the times, LLAMA 3.1 405B a margin away.
_DORA Metrics score: 5/10_
![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/saepp6t4su3j86fm1g3j.png)
_GPT 4o with DORA score 5/10_
![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/9u7nln407p0rhhqkag71.png)
_LLAMA 3.1 with DORA Score 8/10 (incorrect)_
![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/lwpladuhj66ij2s5j1l7.png)
GPT 4o DORA Score was closer to the actual DORA score than LLAMA 3.1 in 9/10 cases, hence GPT4o was more accurate compared to LLAMA 3.1 in this scenario.
### Data Analysis
- The trend data for the four keys dora metrics, calculated by Middleware, was fed to the LLMs as input along with different experimental prompts to ensure a concrete data analysis.
- The trend data is usually a JSON object with date strings as keys, representing weeks' start dates mapped to the metric data.
{
"2024-01-01": {
...
},
"2024-01-08": {
...
}
}
- *Mapping Data*: Both the models were at par at extracting data from the JSON and inferring the data in the correct manner. Example: Both GPT and LLAMA were able to map the correct data to the input weeks without errors or hallucinations.
_Deployment Trends Summarised: GPT4o_
![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ey4jlh2o1nk5xkvg4tt0.png)
_Deployment Trends Summarised: LLAMA 3.1 405B_
![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ziiymc6tl0l360uam8hs.png)
- **Extracting Inferences**: Both the models were able to derive solid inferences from data.
- LLAMA 3.1 identified week with maximum lead time along with the reason for the high lead time.![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/evww5o0tg6bu4m941z6h.png)
- This inference could be verified by the Middleware Trend Charts.![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/lmu39pip49f0brsbd0ti.png)
- GPT4o was also able to extract the week with the maximum lead time and the reason too, which was, high first-response time.![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/74eepbadfzk24z0i5i80.png)
- **Data Presentation**: Data representation has been a hit or miss with LLMs. There are cases where GPT performs better at data presentation but lacks behind LLAMA 3.1 in accuracy and there have been cases like the DORA score where GPT was able to do the math better.
- LLAMA and GPT were both given the lead time value in seconds. LLAMA was able to round off the data closer to the actual value of 16.99 days while GPT rounded off the data to 17 days 2 hours but presented the data in a more detailed format.
_GPT4o_![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/3dpwmlcscgehi47zlx0c.png)
_LLAMA 3.1 405B_![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/c3owjv94wjfrtxetf91a.png)
### Actionability
<img width="100%" style="width:100%" src="https://i.giphy.com/media/v1.Y2lkPTc5MGI3NjExZXFmcmM2cno2c3liN3doeXJ6Z282NmxrZDN0ZGd3c2xta2RwOXp5eCZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/jsrfOEfEHkHPFSNlir/giphy.gif">
- The models were able to output similar actionables for improving teams' efficiency based on all the metrics.
- Example: Both the models identified the reason for high lead-time to be first-response time and suggested the team to use an alerting tool to avoid delayed PR Reviews. The models also suggested better planning to avoid rework where rework was high in a certain week.
_GPT4o_![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/tq9uaapz50z3dsom7jhd.png)
_LLAMA 3.1 405B_![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/gbw7ovecvj3rc6fhykz3.png)
### Summarisation
To test out the summarisation capabilities of the models we asked the model to summarise each metric trend individually and then feed the output results for all the trends back into the LLMs to get a summary or in Internet's slang *DORA TLDR* for the team.
The summarisation capability of large data is similar in both the LLMs.
_LLAMA 3.1 405B_
![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ewsg3cgyqp3mikx1pb92.png)
_GPT4o_
![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/iq6qgz104pacq7nhpyku.png)
## Conclusion
For a long time LLAMA was trying to catch up with GPT in terms of data processing and analytical abilities. Our earlier experimentation with older LLAMA models led us to believe that GPT is way ahead, but the recent LLAMA 3.1 405B model is at par with the GPT4o.
If you value data privacy of your customers and want to try out the open-source LLAMA 3.1 models instead of GPT4, go ahead! There will be negligible difference in performance and you will be able to ensure data privacy if you use self hosted models. Open-Source LLMs have finally started to compete with all the closed-source competitors.
Both LLAMA 3.1 and GPT4o are super capable of deriving inferences from processed data and making Middleware’s DORA metrics more actionable and digestible for engineering leaders, leading to more efficient teams.
## Future Work
This was an experiment to build an AI powered DORA solution and in the future we will be focusing on adding greater support for self hosted or locally running LLMs from Middleware. Enhanced support for AI powered action-plans throughout the product using self hosted LLMs, while ensuring data privacy, will be our goal for the coming months.
In the mean time you can try out the AI DORA summary feature [here](https://github.com/middlewarehq/middleware/tree/ai-beta).
middlewarehq
/
middleware
✨ Open-source DORA metrics platform for engineering teams ✨
Open-source engineering management that unlocks developer potential
Join our Open Source Community
Introduction
Middleware is an open-source tool designed to help engineering leaders measure and analyze the effectiveness of their teams using the DORA metrics. The DORA metrics are a set of four key values that provide insights into software delivery performance and operational efficiency.
They are:
-
Deployment Frequency: The frequency of code deployments to production or an operational environment.
-
Lead Time for Changes: The time it takes for a commit to make it into production.
-
Mean Time to Restore: The time it takes to restore service after an incident or failure.
-
Change Failure Rate: The percentage of deployments that result in failures or require remediation.
Table of Contents
Top comments (12)
This was quite interesting of a read Samad.
Looking forward to how this gets integrated here
Yup ! We will add soon more exciting AI features.
I tried out the dev setup from the link shared in the post.
To me, GPT didn't work as well as the post claims. It somehow scored me in the 3-4/10 range, with Llama scoring 7-8/10 where the control score appears to be ~6.
I saw the prompt in the code. I think we could come up with something better.
Interesting stuff nonetheless. If the repo wasn't shared, I might have thought this was bait. 🤣
If you can come up with a better PR, would be open to that 😅
Great article, Samad! OSS ftw! 🚀
Llama 3.1 has been wonderful and makes OpenAI bite dust 💨
OpenAI still dominates the LLM space 👀 but this is a great leap for Meta and open-sourced LLMs🔥
Great article Samad!
It is great that we get similar performance as GPT 4o with the added benefit of data privacy.
Thank you for sharing this! 🌟
Llama 3.1 is truly impressive and a game-changer! 🚀
Yes, it definitely is. Can't wait to see how it performs with the tool calling!
This is a highly informative read
Glad you liked it!
Great review of the LLMs, honest and tried hands-on! It's interesting how you broke down your experience into computation and cognition.