DEV Community

Peter Maffay
Peter Maffay

Posted on

How to Set Up and Run Ollama on a GPU-Powered VM (vast.ai)

How to Set Up and Run Ollama on a GPU-Powered VM (vast.ai)

In this tutorial, we'll walk you through the process of setting up and using Ollama for private model inference on a VM with GPU, either on your local machine or a rented VM from Vast.ai or Runpod.io. Ollama allows you to run models privately, ensuring data security and faster inference times thanks to the power of GPUs. By leveraging a GPU-powered VM, you can significantly improve the performance and efficiency of your model inference tasks.

Outline

  1. Set up a VM with GPU on Vast.ai

  2. Start Jupyter Terminal

  3. Install Ollama

  4. Run Ollama Serve

  5. Test Ollama with a model

  6. (Optional) using your own model


🐰 AI Rabbit: Tutorials, News, and Insights on More guides and AI developments at https://airabbit.blog/


Setting Up a VM with GPU on Vast.ai

1. Create a VM with GPU: - Visit Vast.ai to create your VM. - Choose a VM with at least 30 GB of storage to accommodate the models. This ensures you have enough space for installation and model storage. - Select a VM that costs less than $0.30 per hour to keep the setup cost-effective.

2. Start Jupyter Terminal: - Once your VM is up and running, start Jupyter and open a terminal within it.

Downloading and Running Ollama

  1. Start Jupyter Terminal: - Once your VM is up and running, start Jupyter and open a terminal within it. This is the easiest method to get started. - Alternatively, you can use SSH on your local VM, for example with VSCode, but you will need to create an SSH key to use it.

  1. Install Ollama: - Open the terminal in Jupyter and run the following command to install Ollama:
bash curl -fsSL https://ollama.com/install.sh | sh
Enter fullscreen mode Exit fullscreen mode

2. Run Ollama Serve: - After installation, start the Ollama service by running:

bash ollama serve &
Enter fullscreen mode Exit fullscreen mode

Ensure there are no GPU errors. If there are issues, the response will be slow when interacting with the model.

3. Test Ollama with a Model: - Test the setup by running a sample model like Mistral:

bash ollama run mistral
Enter fullscreen mode Exit fullscreen mode

You can now start chatting with the model to ensure everything is working correctly.

Optional (Check GPU usage)

Check GPU Utilization: - During the inference (last step), check if the GPU is being utilized by running the following command:bash nvidia-smi - Ensure that the memory utilization is greater than 0%. This indicates that the GPU is being used for the inference process.

Using Your Own Hugging Face Model with Ollama

1. Install Hugging Face CLI: - If you want to use your own model from Hugging Face, first install the Hugging Face CLI. Here we will use an example of a fine tuned Mistral model TheBloke/em_german_mistral_v01-GGUF em_german_mistral_v01.Q4_K_M.gguf

2. Download Your Model: - Download your desired model from Hugging Face. For example, to download a fine-tuned Mistral model:

pip3 install huggingface-hub
Enter fullscreen mode Exit fullscreen mode
# Try with my custom model for fine tuned Mistral
huggingface-cli download TheBloke/em_german_mistral_v01-GGUF em_german_mistral_v01.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False
Enter fullscreen mode Exit fullscreen mode

3. Create a Model File: - Create a model config file Modelfile with the following content:

FROM em_german_mistral_v01.Q4_K_M.gguf


# set the temperature to 1 [higher is more creative, lower is more coherent]
PARAMETER temperature 0

# # set the system message
# SYSTEM """
# You are Mario from Super Mario Bros. Answer as Mario, the assistant, only.
# """
Enter fullscreen mode Exit fullscreen mode

4. Instruct Ollama to Create the Model: - Create the custom model using Ollama with the command:

ollama create -f mymodel Modelfile
Enter fullscreen mode Exit fullscreen mode

5. Run Your Custom Model: - Run your custom model using:

ollama run mymodel
Enter fullscreen mode Exit fullscreen mode

By following these steps, you can effectively utilize Ollama for private model inference on a VM with GPU, ensuring secure and efficient operations for your machine learning projects.

Happy prompting!

Top comments (0)