Introduction
In the ever-evolving landscape of AI-powered tools, assistants for software development have carved a niche for themselves, especially in the realm of coding.
This post reports the results of experimenting with four leading Large Language Models (plus a bonus guest star at the end of the article).
OpenAI's GPT-4, Meta CodeLlama70B, Meta CodeLlama7B, and Mistral Mixtral8x7B were tasked with coding challenges to evaluate which one reigns supreme as a coding assistant. The aim is to get an assessment of their capabilities and discern which LLM could be most beneficial for various coding tasks.
For GPT-4, I've selected the last release of GTP-4 Turbo (gpt-4-0125-preview) as it corrects some "laziness" that its predecessor had.
Testing Setup
The battleground for this comparison was set up in Visual Studio Code, enhanced by the "Continue" plugin, allowing for direct interaction with each LLM.
This setup mirrors the functionality of other coding assistants like GitHub Copilot and AWS Codewhisperer and offers more privacy control over your code (for example by running the LLM on private servers) and the option of switching to the best (or less expensive) LLM for the task at hand.
Here is how my setup appears:
Note, on the right end side, the answer I just got from CodeLlama70B.
The Tests
The LLMs were evaluated across eight critical areas in coding:
- Code Generation: Their prowess in crafting code snippets or full modules from scratch based on requirements.
- Code Explanation and Documentation: How well they could elucidate existing code and create meaningful documentation.
- Unit Test Generation: Their ability to autonomously generate unit tests for existing code.
- Debugging and Error Correction: Efficiency in identifying, explaining, and rectifying code bugs or errors.
- Refactoring / Optimization Recommendations: The LLMs' capacity to suggest and implement code improvements for better quality and performance.
- Code Review Assistance: Their ability to aid in code reviews by spotting potential issues and suggesting enhancements.
- Security and Best Practices: Proficiency in detecting security vulnerabilities and enforcing best practices.
- Requirements Analysis: The capability to comprehend software requirements in natural language and translate them into technical specifications. While this not being exactly "coding", it is quite related to that as there might be the need to further refine unclear requirements or somehow transform them (for example, to derive a Finite State Machine that will be further coded in a programming language).
The performance of each LLM has been rated on a scale from 0 to 3, with multiple tests, 19 in total, conducted in each area to ensure fair competition across tasks of the same area.
Throughout the tests, the system prompt has been kept very simple: it just sets up the role of the LLM as a coding assistant and instructs it to be concise and straight to the point (to avoid using too many tokens in lengthy and possibly useless explanations).
This has been done to assess the LLMs' native abilities and opens up the possibility of further increasing their abilities through specific system prompting.
The Results
Here is the summary of the tests:
- GPT-4, not surprisingly, emerged as the overall victor, offering the most accurate and comprehensive assistance across all tasks.
- CodeLlama70B and Mixtral8x7B were close competitors, being on par with GPT-4 in some specific areas.
- CodeLlama7B, despite ranking last, showed potential in certain tasks, indicating that tailored prompting could enhance its performance. Its appeal is in its small size which allows it to run on consumer-grade hardware.
Here are the results for each category:
Some Task Examples
You can find the full list of tasks used for this test, with prompts and the output for each LLM, on Github. Note that I'll update it from time to time with new tests and, possibly, new LLMs.
Here are just a few examples of tasks used in the test.
FEN Counting: This task tested the models' knowledge of FEN strings for chessboard positions. While all four LLMs had some knowledge of what FEN was, only GPT-4 managed to generate completely accurate code. This might be one of those cases when having more "unrelated knowledge" could result in better performance.
Guideline Compliance: Surprisingly, none of the LLMs detected all deviations from a set of coding style guidelines, revealing an area where more work is due to ensure an LLMs can assist. Better prompting or even a RAG approach to ensure the proper guidelines are fully "understood" by the LLM might be needed.
Ambiguity Analysis: Sometimes requirements clash with each other, possibly generating confusion, rework, or bugs. This specific task was aimed to check if the LLM were able to identify conflicting or overlapping requirements. Interestingly, Mixtral8x7B ended up being better than CodeLlama70B at this task.
Conclusions
Setting up a personal coding assistant using these LLMs can mitigate concerns regarding data and code privacy issues, common with cloud-based solutions.
There are also cost factors to be considered, like the costs of hosting your own LLM, the benefits of the higher cost per token of GPT-4 (when compared with other cloud API solutions), and so on.
In conclusion, while GPT-4 stands out for its comprehensive support, smaller models may present viable alternatives depending on your specific needs.
The tests used for this assessment are meant to cover a wide area of tasks and get a feeling of how the different LLMs would perform. Take them as a first cut and conduct specific tests tailored to your use cases to make an informed choice.
One thing is sure: whoever is not using an AI assistant for coding, will need to do so very soon to avoid being left behind.
Now is the time to determine which task should benefit from using an AI assistant for coding and to what extent.
Bonus section
During the preparation of this post, Google launched their new LLM Gemini Advance, showing significant improvements over Google Bard.
I've quickly compared it with GPT-4 across the 8 categories. Further tests are surely needed but, in the meantime, here is a first cut view of how Gemini Advanced scored against GPT-4 overall:
and across the eight areas:
The two LLMs are quite close! Clearly Google's new LLM is a serious contender to the crown of "best LLM overall".
Top comments (14)
Question: When you said GPT-4 do you mean latest GPT-4-turbo or original GPT-4?
Anyway I think you should include all GPT-3.5-turbo, GPT-4 and GPT-4-turbo.
Also I wrote recently something more high level but relevant to your post here:
Optimizing Codebases for AI Development Era
Dom Sipowicz ・ Feb 11
I've added the info in the article. The recent GTP-4 Turbo (gpt-4-0125-preview) has been used. GPT-3.5 didn't seem interesting to me as its performance is usually on par with the Open Source model. I might add it in the future.
Okay, that was interesting. But a question arises. Most developers work on laptops that do not have enough power to deploy models on local machine. It seems to me that most will do this on servers. And this raises the question: is it profitable at all? yes you get control but what is the financial side of the coin. For example GitHub Copilot costs $100 per year, how much will it cost to deploy your model for example CodeLlama70b? And I'm more than sure that it will be expensive. And I’m not justifying or trying to convince you of my point of view, but it seems to me that this is how things are. And until hardware becomes so powerful and accessible I think it’s too early to talk about how cool it is to deploy your assistant.
Analyzing the costs would need another article on its own, I think.
Mixtral8x7B can be run on Mac or a PC with a high-end GPU, none of those options come cheaply.
That's why I said that we should test our own use cases. I like the flexibility of the setup I described. You can, for example, use CodeLlama70B through a service like Togheter.ai on a regular basis but fire up a dedicated instance on your favorite cloud provider for Clients that would require the maximum privacy (imagine a Public Sector or a Healthcare agency).
Accessing the LLMs through the API is quite cheap (well, GPT4 is not so cheap really).
There are tons of other tests that could be made, but the time is limited. I hope others will share their own tests (for example it would be interesting thoroughly compare Github copilot, AWS CodeWhisperer, TabNine, ...)
You used GPT-4 via the OpenAI API, right? Not ChatGPT plus?
It's a very useful article, thanks!
Yes, all the LLMs are accessed through API. For the article, I've used OpenAI API for accessing GPT-4 but I've also tested the Azure OpenAI API (but not all models are available). Codellama has been accessed through together.ai and Mixtral8x7B through mistral.ai. Cost-wise, I've spent little less than 2$ for all the tests I've done (some of them multiple times) for this article. GPT-4 was the one using the largest part of that money.
Thanks! I know why you decided to test GPT-4 only, but think comparing to GPT-3.5 would be also very useful. Currently GPT-3.5_turbo is 20 times cheaper than GPT-4_preview, so if the performance is okay, would be a very budget-friendly solution.
I am currently learning JavaScript. At what point should start using a coding assistant
My view: you should use an assistant while learning.
Whatever you are trying to accomplish (for example solve an exercise) you can compare your solution with one suggested by the assistant. And if what the assistant suggests is wrong (it happens) all the better: trying to understand why something doesn't work is even more instructive than trying to understand why it works.
Also, a very instructive thing is to understand how to ask the right question to solve your problem. It may be simple for trivial things but the more you use an assistant, the more you'll
And you can always ask for more in-depth explanations. You may want to consider using services like Together.ai to access Codellama or use Mixtral8x7B directly from Mistral.ai. They will cost you a fraction of what GPT4 costs.
Of course, one should be tempted to leave the solution to the assistant using it without really understanding the outcome. But this is not an issue of using AI Assistants, I've seen too many "Stack Overflow programmers" around.
Any metrics for Claude?
It's some time I don't look at Claude, I'll check on them and might consider expanding the article.
EDIT: Bad luck. It's not available in my country :(
Such an assistant can be very helpful
Is there any solution for local llm with code assistant?
Thanks for sharing.