ChatGPT is awesome, and privacy is a concern for many. But what if you could host your own private AI on an old PC without relying on GPU clusters?
Thanks to the efforts of the amazing community projects like ggml, llama.cpp, and TheBloke, it is now possible for anyone to chat with AI, privately, without internet, before the apocalypse.
In this article, we will containerize an AI before it ends the world, we will explore how to deploy a Large Language Model (LLM, also known as AI) in a container within a Kubernetes cluster, enabling us to have conversations with it.
To get started, you'll need a Kubernetes cluster, for example, a minikube with approximately 8 CPU threads and 5GB of memory. Additionally, you'll need to have Helm installed.
Let's begin by deploying the LLM within a minimal wrapper.
cat > values.yaml <<EOF
replicas: 1
deployment:
image: quay.io/chenhunghan/ialacol:latest
env:
DEFAULT_MODEL_HG_REPO_ID: TheBloke/orca_mini_3B-GGML
DEFAULT_MODEL_FILE: orca-mini-3b.ggmlv3.q4_0.bin
DEFAULT_MODEL_META: ""
THREADS: 8
BATCH_SIZE: 8
CONTEXT_LENGTH: 1024
service:
type: ClusterIP
port: 8000
annotations: {}
EOF
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install orca-mini-3b ialacol/ialacol -f values.yaml
If you're interested in the technical details, here's what's happening behind the scenes:
- We are deploying a Helm release
orca-mini-3b
using Helm chart ialacol - The container image ialacol is a mini RESTFul API server compatible with OpenAI API. Disclaimer: I am the main contributor to this project
- The deployed LLM binary, orca mini, has 3 billion parameters. Orca mini is based on the OpenLLaMA project.
- The binary has been quantized by TheBloke into a 4-bit GGML format.
Now, please be patient for a few minutes as the container downloads the binary, which is around 1.93GB in size:
INFO: Downloading model... TheBloke/orca_mini_3B-GGML/orca-mini-3b.ggmlv3.q4_0.bin
Once the download is complete, it's time to start a conversation!
Expose the service:
kubectl port-forward svc/orca-mini-3b 8000:8000
Ask a question:
USER_QUERY="What is the meaning of life? Explain like I am 5."
MODEL="orca-mini-3b.ggmlv3.q4_0.bin"
curl -X POST \
-H 'Content-Type: application/json' \
-d '{ "prompt": "### System:You are an AI assistant that follows instruction extremely well. Help as much as you can.### User:'${USER_QUERY}'### Response:", "model": "'${MODEL}'" }' \
http://localhost:8000/v1/completions
According to AI...
The meaning of life is a question that has puzzled humans for centuries. Some believe it to be finding happiness, others think it's achieving success or something greater than ourselves, while some see it as fulfilling our purpose on this planet. Ultimately, everyone answers this question differently and what matters most in the end is how we live our lives with integrity and make a positive impact on those around us.
Let's start scaling LLM on Kubernetes!
Top comments (0)