Preloading Ollama Models

#ai #tutorial #docker #opensource

A few weeks ago, I started using Ollama to run language models (LLM), and I've been really enjoying it a lot. After getting the hang of it, I thought it was about time to try it out on one of our real-world cases (I'll share more about this later).

At Direktiv we are using Kubernetes for all our deployments and when I tried to run it as a pod, I faced a couple of issues.

The initial issue I faced was Ollama downloading models as needed, which is logical given its support for multiple models. When starting up, the specific model required has to be fetched, with sizes ranging from 1.5GB to 40GB. This really extends the time it takes for the container to start up.

To start the download, you'd either make an API call or get the CLI going to fetch the model you need. In a Kubernetes setup, you can easily handle this using a lifecycle event in postStart. So, here's a simple example of an Ollama deployment I put together:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
spec:
  selector:
    matchLabels:
      name: ollama
  template:
    metadata:
      labels:
        name: ollama
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:0.1.29
        ports:
        - name: http
          containerPort: 11434
          protocol: TCP
        lifecycle:
          postStart:
            exec:
              command: [ "/bin/sh", "-c", "ollama pull gemma:2b" ]

That went okay, but there is still the startup problem - it took ages to run the lifecycle hook, plus it won't function on Kubernetes nodes with no internet access. At Direktiv were are using Knative a lot as well which does not support lifecycle events. So, my plan was to create a container using the Ollama image as base with the model pre-downloaded.

So, a little hiccup is that Ollama runs as an HTTP service with an API, which makes it a bit tricky to run the pull model command when building the container image to have the models ready to go right from the start. No services in docker build, remember?

There have been a couple of GitHub issues pointing out this problem, but the workaround is to start an Ollama container, pull the model, and then transfer the generated models into a new container build. Personally, I found this process not the best for an automated build.

Got my developer gloves on and thought, "How hard can it be?" 🧤 Excited that all the download functions in the project were exported, but oh boy, the dependencies didn't play nice! Ended up having to copy and tweak the existing setup. Voila! Now we've got a neat little container for a multi-stage build. Check out the project here:

https://github.com/jensg-st/ollama-pull 💥

With this container, you can fetch the model in the first stage - in this scenario, it's gemma:2b. For the main container you can still use the default ollama/ollama image. The model simply needs to be copied from the downloader to the main container at /root/.ollama. You can even download multiple models in the first stage.

FROM gerke74/ollama-model-loader as downloader

RUN /ollama-pull gemma:2b

FROM ollama/ollama 

ENV OLLAMA_HOST "0.0.0.0"

COPY --from=downloader /root/.ollama /root/.ollama

Let's build it and run it:

cat << 'EOF' > Dockerfile
FROM gerke74/ollama-model-loader as downloader
RUN /ollama-pull gemma:2b
FROM ollama/ollama 
ENV OLLAMA_HOST "0.0.0.0"
COPY --from=downloader /root/.ollama /root/.ollama
EOF
docker build -t gemma . 
docker run -p 11437:11434 gemma

The curl command sends the question to the container. It is important to use the right value in model. In this case gemma:2b.

curl http://localhost:11437/api/generate -d '{
  "model": "gemma:2b",
  "prompt": "Why is the sky blue?"
}'

The container will respond like that:

{"model":"gemma:2b","created_at":"2024-03-26T15:16:56.780177872Z","response":"The","done":false}
{"model":"gemma:2b","created_at":"2024-03-26T15:16:57.003156881Z","response":" sky","done":false}
{"model":"gemma:2b","created_at":"2024-03-26T15:16:57.223483082Z","response":" appears","done":false}
...

Please feel free to comment if that was helpful or if something is not working. In the next few posts I will add some real-life functionality to this.

DEV Community

Preloading Ollama Models

Top comments (0)

Read next

Practical and Beginner friendly guide for speeding up your web-apps

HanDiffuser: Text-to-Image Generation With Realistic Hand Appearances

OpenTofu vs Hashicorp, Google Cloud Next 24 Highlights, Apple ReALM beats GPT-4

Marking macOS component packages available based on hardware platform type