Offline AI 🤖 on Github Actions 🙅‍♂️💰

#githubactions #ai #kubernetes #cicd

In this article, we will walk through the steps to set up an offline AI on Github Actions that respects your privacy by NOT sending your source code to the internet. This AI will add a touch of humor by telling jokes whenever a developer creates a boring pull request.

Github provides a generous offering for open source projects, allowing you to use their Github-hosted runner for free as long as your project is open source.

However, the Github-hosted runner comes with some limitations in terms of computational power. It offers 2 vCPUs, 7GB of RAM, and 14GB of storage (ref). On the other hand, AI computing, or LLM inference, is considered a luxury due to its resource requirements and associated costs 💸.

The stock price of Nvidia (the company who makes GPUs for AI):

However, thanks to the efforts of amazing community projects like ggml, it is now possible to run LLM (Large Language Model) on edge devices such as 🍓🥧 Raspberry Pi 4.

In this article, I will present the Github Actions snippets that allow you to run an LLM with 3B parameters directly on Github Actions, even with just 2 CPU cores and 7GB of RAM. These actions are triggered when a developer initiates a new pull request, and the AI will lighten the mood by sharing a joke to entertain the developer.

name: Can 3B AI with 2 CPUs make good jokes?

on:
  push:
    branches:
    - main
  pull_request:
    branches:
    - main

env:
  TEMPERATURE: 1
  DEFAULT_MODEL_HG_REPO_ID: TheBloke/orca_mini_3B-GGML
  DEFAULT_MODEL_FILE: orca-mini-3b.ggmlv3.q4_0.bin
  DEFAULT_MODEL_META: ""
  THREADS: 2
  BATCH_SIZE: 8
  CONTEXT_LENGTH: 1024

jobs:
  joke:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v3
        with:
          fetch-depth: 0

      - name: Create k8s Kind Cluster
        uses: helm/kind-action@v1.7.0

      - run: |
          kubectl cluster-info
          kubectl get nodes

      - name: Set up Helm
        uses: azure/setup-helm@v3
        with:
          version: v3.12.0

      - name: Install ialacol and wait for pods to be ready
        run: |
          helm repo add ialacol https://chenhunghan.github.io/ialacol
          helm repo update

          cat > values.yaml <<EOF
          replicas: 1
          deployment:
            image: quay.io/chenhunghan/ialacol:latest
            env:
              DEFAULT_MODEL_HG_REPO_ID: $DEFAULT_MODEL_HG_REPO_ID
              DEFAULT_MODEL_FILE: $DEFAULT_MODEL_FILE
              DEFAULT_MODEL_META: $DEFAULT_MODEL_META
              THREADS: $THREADS
              BATCH_SIZE: $BATCH_SIZE
              CONTEXT_LENGTH: $CONTEXT_LENGTH
          resources:
            {}
          cache:
            persistence:
              size: 0.5Gi
              accessModes:
                - ReadWriteOnce
              storageClass: ~
          cacheMountPath: /app/cache
          model:
            persistence:
              size: 2Gi
              accessModes:
                - ReadWriteOnce
              storageClass: ~
          modelMountPath: /app/models
          service:
            type: ClusterIP
            port: 8000
            annotations: {}
          nodeSelector: {}
          tolerations: []
          affinity: {}
          EOF
          helm install ialacol ialacol/ialacol -f values.yaml --namespace default

          echo "Wait for the pod to be ready, it takes about 36s to download a 1.93GB model (~50MB/s)"
          sleep 40
          kubectl get pods -n default

      - name: Ask the AI for a joke
        run: |
          kubectl port-forward svc/ialacol 8000:8000 &
          echo "Wait for port-forward to be ready"
          sleep 5

          curl http://localhost:8000/v1/models

          RESPONSE=$(curl -X POST -H 'Content-Type: application/json' -d '{ "messages": [{"role": "user", "content": "Tell me a joke."}], "temperature":"'${TEMPERATURE}'", "model": "'${DEFAULT_MODEL_FILE}'"}' http://localhost:8000/v1/chat/completions)
          echo "$RESPONSE"

          REPLY=$(echo "$RESPONSE" | jq -r '.choices[0].message.content')
          echo "$REPLY"

          kubectl logs --selector app.kubernetes.io/name=$HELM_RELEASE_NAME -n default

          if [ -z "$REPLY" ]; then
            echo "No reply from AI"
            exit 1
          fi

          echo "REPLY=$REPLY" >> $GITHUB_ENV
      - name: Comment the Joke
        uses: actions/github-script@v6
        # Note, issue and PR are the same thing in GitHub's eyes
        with:
          script: |
            const REPLY = process.env.REPLY
            if (REPLY) {
              github.rest.issues.createComment({
                issue_number: context.issue.number,
                owner: context.repo.owner,
                repo: context.repo.repo,
                body: `🤖: ${REPLY}`
              })
            }

Is the joke any good?

Well, it's up for debate. If you want better jokes, you can bring self-hosted runner. Self-hosted runners (with for example 16vCPU and 32GB RAM) would definitely capable of running more sophisticated models such as MPT-30B.

You might be wondering why running Kubernetes is necessary for this project. This article was actually created during the development of a testing CI for the OSS project ialacol. The goal was to have a basic smoke test that verifies the Helm charts and ensures the endpoint returns a 200 status code. You can find the full source of the testing CI YAML here.

While running Kubernetes may not be necessary for your specific use case, it's worth mentioning that the overhead of the container runtime and Kubernetes is minimal. In fact, the CI process, which includes LLM inference from provisioning to completion, takes only 2 minutes.

DEV Community

Offline AI 🤖 on Github Actions 🙅‍♂️💰

Top comments (0)

Read next

🚀 Amazon Nova: AWS's New Foundation Model for GenAI🤖

Building Swarm-based agents with AG2

Understanding the MLOps Lifecycle

These 7 Open-Source Tools will Make You the Ultimate Chill Guy!