DEV Community

chh
chh

Posted on • Updated on

Use Code Llama (and other open LLMs) as Drop-In Replacement for Copilot Code Completion

CodeLlama is now available under a commercial-friendly license.

The question arises: Can we replace GitHub Copilot and use CodeLlama as the code completion LLM without transmitting source code to the cloud?

The answer is both yes and no. Tweaking hyperparameters becomes essential in this endeavor. Let's explore the options available as of August 2023.

Note: You might want to read my latest article on copilot

By analyzing CodePilot's VSCode extension1 at thakkarparth007/copilot-explorer, it becomes evident that CodePilot relies on an OpenAI API-compatible backend. Drawing from prior experiences such as fauxpilot, we understand that it's possible to switch the backend by introducing specific modifications to the settings.json file:

"github.copilot.advanced": {
  // fauxpilot was using `codegen`
  "debug.overrideEngine": "codegen",
  // OpenAI API compatible server url
  "debug.testOverrideProxyUrl": "http://localhost:5000",
  "debug.overrideProxyUrl": "http://localhost:5000" 
}
Enter fullscreen mode Exit fullscreen mode

Choosing an OpenAI API-Compatible Server

To make use of CodeLlama, an OpenAI API-compatible server is all that's required. As of 2023, there are numerous options available, and here are a few noteworthy ones:

  • llama-cpp-python: This Python-based option supports llama models exclusively.
  • vllm: Known for high performance, though it lacks support for GGML.
  • flexflow: Touting faster performance compared to vllm.
  • LocalAI: A feature-rich choice that even supports image generation.
  • FastChat: Developed by LMSYS.
  • OpenLLM: An actively developed project.
  • ialacol: Noteworthy for its focus on Kubernetes.
  • ...and many more

The choice among these options is entirely up to you. For the purpose of this article, I'll be focusing on ialacol, primarily because I am the main contributor and thus intimately familiar with all the implementation details.

Let's begin with GGML models. These models boast a low memory requirement and operate without the need for a GPU (which might not be as affordable anymore). If you possess robust CUDA (Nvidia) GPUs, I recommend directly proceeding to the GPTQ section of this article.

Setting up the OpenAI API-Compatible Server

Getting your OpenAI API-compatible server up and running is a straightforward process.

Clone the Repository and Install Dependencies

Use this one-liner to clone the repository and set up the necessary dependencies:

gh repo clone chenhunghan/ialacol && cd ialacol && python3 -m venv .venv && source .venv/bin/activate && python3 -m pip install -r requirements.txt
Enter fullscreen mode Exit fullscreen mode

Run the server and download the model.

export DEFAULT_MODEL_HG_REPO_ID="TheBloke/CodeLlama-7B-GGML"
export DEFAULT_MODEL_FILE="codellama-7b.ggmlv3.Q2_K.bin
"
export LOGGING_LEVEL="DEBUG" # optional, more on this later
uvicorn main:app --host 0.0.0.0 --port 9999
Enter fullscreen mode Exit fullscreen mode

Configure VSCode Copilot extension, pointing to the server.

To integrate the server with the VSCode Copilot extension, edit settings.json:

"github.copilot.advanced": {
  "debug.overrideEngine": "codellama-7b.ggmlv3.Q2_K.bin",
  "debug.testOverrideProxyUrl": "http://localhost:9999",
  "debug.overrideProxyUrl": "http://localhost:9999"
}
Enter fullscreen mode Exit fullscreen mode

With these configurations in place, you're ready to roll. CodeLlama's code completion capabilities will now be at your fingertips.

Tweaking for Optimal Performance

While CodeLlama's completion capabilities are impressive, they might not always meet your expectations, yielding occasional suggestions by chance. However, they might not match the proficiency of GitHub Copilot, especially in terms of inference speed.

Several factors contribute to this discrepancy:

  • Our current model utilizes 7 billion parameters. To potentially enhance performance, consider experimenting with the 13B and 34B variants.
  • GGML/GGUF models are tailored to minimize memory usage rather than prioritize speed. While they excel in asynchronous tasks, code completion mandates swift responses from the server.
  • GitHub Copilot's extension generates a multitude of requests as you type, which can pose challenges, given that language models typically process one prompt at a time.

To address these considerations, exploring smaller models is a viable option. Smaller models often exhibit a faster inference speed. Here are some alternatives to consider:

For a potential increase in throughput, a useful strategy is queuing requests before the inference server. This optimization boosts throughput (not speed) and can be achieved using tools like text-inference-batcher (Disclaimer: I authored this tool, and tib is still in its early alpha phase).

Leveraging the various trade-offs at our disposal, let's proceed with the plan: utilizing a high-quality 3B model with a small footprint. Additionally, let's set up two instances of servers to enhance performance further.

# in `ialacol` folder you just cloned.
export THREAD=2
# Use small model https://stability.ai/blog/stablecode-llm-generative-ai-coding
export DEFAULT_MODEL_HG_REPO_ID="TheBloke/TheBloke/stablecode-completion-alpha-3b-4k-GGML"
export DEFAULT_MODEL_FILE="stablecode-completion-alpha-3b-4k.ggmlv1.q4_0.bin"
# truncate the prompt to make inference faster...
# (it's a trade off, you get lower quality results too)
TRUNCATE_PROMPT_LENGTH=100
uvicorn main:app --host 0.0.0.0 --port 9998
# in another terminal session
uvicorn main:app --host 0.0.0.0 --port 9999 
Enter fullscreen mode Exit fullscreen mode

Load Balancing with a Queue to Increase Throughput

To enhance throughput, we can employ load balancing with a queuing mechanism. Here's how you can set it up using text-inference-batcher:

Setting Up tib for Load Balancing

  1. Clone the repository and set up the necessary environment:
# clone and setup
gh repo clone ialacol/text-inference-batcher && cd text-inference-batcher && npm install
Enter fullscreen mode Exit fullscreen mode
  1. Start tib, directing to your servers.
export UPSTREAMS="http://localhost:9998,http://localhost:9999"
npm start
Enter fullscreen mode Exit fullscreen mode
  1. Configuring the Copilot Extension, directing to the load balancer.
"github.copilot.advanced": {
  "debug.overrideEngine": "stablecode-completion-alpha-3b-4k.ggmlv1.q4_0.bin",
  // pointing to `tib`
  "debug.testOverrideProxyUrl": "http://localhost:8000",
  "debug.overrideProxyUrl": "http://localhost:8000"
}
Enter fullscreen mode Exit fullscreen mode

Despite the compromise in inference quality due to smaller models and prompt truncation, results improved. However, they still fall short of GitHub Copilot's code completion capabilities.

Let's now venture to push the limits in the opposite direction.

Leveraging Cloud Infrastructure for Enhanced Performance

If you possess powerful cloud infrastructure equipped with GPUs, the process becomes notably streamlined.

In this scenario, we will harness the capabilities of Kubernetes due to its exceptional automation features. Both ialacol and text-inference-batcher are inherently compatible with Kubernetes, which further simplifies the setup.

Let's delve into deploying the 34B CodeLLama GPTQ model onto Kubernetes clusters, leveraging CUDA acceleration via the Helm package manager:

(values.yaml)

replicas: 1
deployment:
  image: ghcr.io/chenhunghan/ialacol-gptq:latest
  env:
    DEFAULT_MODEL_HG_REPO_ID: TheBloke/CodeLlama-34B-GPTQ
    TOP_K: 30
    TOP_P: 0.9
    MAX_TOKENS: 200
    THREADS: 1
resources:
  # Request a node with Nvidia 1 GPU
  limits:
    nvidia.com/gpu: 1
model:
  persistence:
    size: 30Gi
    accessModes:
      - ReadWriteOnce
    storageClassName: ~
service:
  type: ClusterIP
  port: 8000
  annotations: {}
# You probably need to use these to select a node with GPUs.
tolerations: []
affinity: {}
Enter fullscreen mode Exit fullscreen mode
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
# work one
helm upgrade --install codellama-worker-0 ialacol/ialacol -f values.yaml
# work two
helm upgrade --install codellama-worker-1 ialacol/ialacol -f values.yaml
# and maybe more? Depends on your budget :)
Enter fullscreen mode Exit fullscreen mode

Again, load balancing using tib with this values.yaml:

replicas: 1
deployment:
  image: ghcr.io/ialacol/text-inference-batcher-nodejs:latest
  env:
    # pointing to our workers
    UPSTREAMS: "http://codellama-worker-0:8000,http://codellama-worker-1:8000"
    # increase this if your the worker can handle more then one inference at a time.
    MAX_CONNECT_PER_UPSTREAM: 1
resources:
  requests:
    cpu: 500m
    memory: 128Mi
service:
  type: ClusterIP
  port: 8000
  annotations: {}
  # If using an AWS load balancer, you'll need to override the default 60s load balancer idle timeout
  # service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "1200"
nodeSelector: {}
tolerations: []
affinity: {}
Enter fullscreen mode Exit fullscreen mode
helm upgrade --install tib text-inference-batcher/text-inference-batcher-nodejs -f values.yaml
Enter fullscreen mode Exit fullscreen mode

Expose the tib service by utilizing your cloud's load balancer, or for testing purposes, you can employ kubectl port-forward.

Conclusion

With CodeLLama operating at 34B, benefiting from CUDA acceleration, and employing at least one worker, the code completion experience becomes not only swift but also of commendable quality. I would confidently state that this setup is on par with the performance of GitHub Copilot.

Nonetheless, it's crucial to acknowledge that this particular configuration does come at a notably higher cost when compared to GitHub Copilot. Striking a balance between budget considerations and privacy concerns is imperative. This investment is especially justifiable when handling proprietary or enterprise-level software projects. Conversely, the pricing structure of Copilot holds its own appeal.

In essence, we're fortunate to have a range of options at our disposal. Your thoughts and feedback are valuable, so feel free to share your insights in the comments section.

Let's keep the conversation going! 🚀


  1. Highly recommended to go through the Copilot source code, you will learn prompt engineering and client cache on different levels before hitting the server. ↩

Top comments (4)

Collapse
 
itlackey profile image
IT Lackey

Thank you for posting this!

I just got fast chat running in a container and leveraging Arc GPUs.
github.com/itlackey/ipex-arc-fastchat

Now I am going to use this to connect copilot to it! 🥳

Collapse
 
smyja profile image
Smyja

is there a way you can use anyscale or together.ai since they have llama models

Collapse
 
iwaduarte profile image
iwaduarte

Hi chh. Great post. How would one go through the Copilot source code? I thought they were private.

Collapse
 
chenhunghan profile image
chh

Hi, the client side (copilot-vscode-extension) is compiled in JavaScript, the code has been minimized, but still possible to go through with some hacks, see this awesome repo github.com/thakkarparth007/copilot...