In this article, we will walk through the steps to set up an offline AI on Github Actions that respects your privacy by NOT sending your source code to the internet. This AI will add a touch of humor by telling jokes whenever a developer creates a boring pull request.
Github provides a generous offering for open source projects, allowing you to use their Github-hosted runner for free as long as your project is open source.
However, the Github-hosted runner comes with some limitations in terms of computational power. It offers 2 vCPUs, 7GB of RAM, and 14GB of storage (ref). On the other hand, AI computing, or LLM inference, is considered a luxury due to its resource requirements and associated costs πΈ.
The stock price of Nvidia (the company who makes GPUs for AI):
However, thanks to the efforts of amazing community projects like ggml, it is now possible to run LLM (Large Language Model) on edge devices such as ππ₯§ Raspberry Pi 4.
In this article, I will present the Github Actions snippets that allow you to run an LLM with 3B parameters directly on Github Actions, even with just 2 CPU cores and 7GB of RAM. These actions are triggered when a developer initiates a new pull request, and the AI will lighten the mood by sharing a joke to entertain the developer.
name: Can 3B AI with 2 CPUs make good jokes?
on:
push:
branches:
- main
pull_request:
branches:
- main
env:
TEMPERATURE: 1
DEFAULT_MODEL_HG_REPO_ID: TheBloke/orca_mini_3B-GGML
DEFAULT_MODEL_FILE: orca-mini-3b.ggmlv3.q4_0.bin
DEFAULT_MODEL_META: ""
THREADS: 2
BATCH_SIZE: 8
CONTEXT_LENGTH: 1024
jobs:
joke:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v3
with:
fetch-depth: 0
- name: Create k8s Kind Cluster
uses: helm/kind-action@v1.7.0
- run: |
kubectl cluster-info
kubectl get nodes
- name: Set up Helm
uses: azure/setup-helm@v3
with:
version: v3.12.0
- name: Install ialacol and wait for pods to be ready
run: |
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
cat > values.yaml <<EOF
replicas: 1
deployment:
image: quay.io/chenhunghan/ialacol:latest
env:
DEFAULT_MODEL_HG_REPO_ID: $DEFAULT_MODEL_HG_REPO_ID
DEFAULT_MODEL_FILE: $DEFAULT_MODEL_FILE
DEFAULT_MODEL_META: $DEFAULT_MODEL_META
THREADS: $THREADS
BATCH_SIZE: $BATCH_SIZE
CONTEXT_LENGTH: $CONTEXT_LENGTH
resources:
{}
cache:
persistence:
size: 0.5Gi
accessModes:
- ReadWriteOnce
storageClass: ~
cacheMountPath: /app/cache
model:
persistence:
size: 2Gi
accessModes:
- ReadWriteOnce
storageClass: ~
modelMountPath: /app/models
service:
type: ClusterIP
port: 8000
annotations: {}
nodeSelector: {}
tolerations: []
affinity: {}
EOF
helm install ialacol ialacol/ialacol -f values.yaml --namespace default
echo "Wait for the pod to be ready, it takes about 36s to download a 1.93GB model (~50MB/s)"
sleep 40
kubectl get pods -n default
- name: Ask the AI for a joke
run: |
kubectl port-forward svc/ialacol 8000:8000 &
echo "Wait for port-forward to be ready"
sleep 5
curl http://localhost:8000/v1/models
RESPONSE=$(curl -X POST -H 'Content-Type: application/json' -d '{ "messages": [{"role": "user", "content": "Tell me a joke."}], "temperature":"'${TEMPERATURE}'", "model": "'${DEFAULT_MODEL_FILE}'"}' http://localhost:8000/v1/chat/completions)
echo "$RESPONSE"
REPLY=$(echo "$RESPONSE" | jq -r '.choices[0].message.content')
echo "$REPLY"
kubectl logs --selector app.kubernetes.io/name=$HELM_RELEASE_NAME -n default
if [ -z "$REPLY" ]; then
echo "No reply from AI"
exit 1
fi
echo "REPLY=$REPLY" >> $GITHUB_ENV
- name: Comment the Joke
uses: actions/github-script@v6
# Note, issue and PR are the same thing in GitHub's eyes
with:
script: |
const REPLY = process.env.REPLY
if (REPLY) {
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: `π€: ${REPLY}`
})
}
Is the joke any good?
Well, it's up for debate. If you want better jokes, you can bring self-hosted runner. Self-hosted runners (with for example 16vCPU and 32GB RAM) would definitely capable of running more sophisticated models such as MPT-30B.
You might be wondering why running Kubernetes is necessary for this project. This article was actually created during the development of a testing CI for the OSS project ialacol. The goal was to have a basic smoke test that verifies the Helm charts and ensures the endpoint returns a 200
status code. You can find the full source of the testing CI YAML here.
While running Kubernetes may not be necessary for your specific use case, it's worth mentioning that the overhead of the container runtime and Kubernetes is minimal. In fact, the CI process, which includes LLM inference from provisioning to completion, takes only 2 minutes.
Top comments (0)