Intro
In this article, I'll walk you through the process of installing and configuring an Open Weights LLM (Large Language Model) locally such as Mistral or Llama3, equipped with a user-friendly interface for analysing your documents using RAG (Retrieval Augmented Generation). This setup allows you to analyse your documents without sharing your private and sensitive data with third-party AI providers such as OpenAI, Microsoft, Google, etc.
Prerequisites
- You can use pretty much any machine you want, but it's preferable to use a machine a dedicated GPU or Apple Silicon (M1,M2,M3, etc) for faster inference.
- Docker must be preinstalled
Installation
Ollama
Ollama is a service that allows us to easily manage and run local open weights models such as Mistral, Llama3 and more (see the full list of available models).
Ollama installation is pretty straight forward just download it from the official website and run Ollama, no need to do anything else besides the installation and starting the Ollama service.
Installing Ollama User Interface
Next step is installing the Ollama User Interface that will run on Docker, so Docker must be installed and running before installing the Ollama UI.
To install the UI simply run the following command in the terminal:
docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name ollama-webui --restart always -e WEBUI_AUTH=false ghcr.io/open-webui/open-webui:main
This will install and start the Ollama UI webserver locally on address http://localhost:3000/
Download a local model
Now that everything is up and running, we need to download a model.
Good general purpose models as of today (May 2024) are Llama3 (from Meta) and Mistral, in this article, I'll show how to install Mistral Instruct.
Go to Ollama library: https://ollama.com/library and type "mistral" in the search bar, then click on the first result:
Pick the instruct variant in the dropdown menu:
And copy the name and the tag of the model from the right side (don't copy the entire command just the model_name:tag
part):
In the Ollama UI, click on the username, the bottom left corner, to display the pop over menu, click on "Settings":
Then click on "Models" on the sidebar. This form below allows us to download any model that Ollama supports
Paste the model tag mistral:instruct
in the text field and click download:
--
The model installation is the same for any other models in the Ollama Library
Chat with the model
Once the model is downloaded, you can select it and set it as default:
Let's see if everything works by sending a message to the model:
Great! The model is loaded and running without any issues 🎉🥳
Now we can do some interesting things with it.
Analyse documents and data - RAG (Retrieval Augmented Generation)
You can upload documents and ask questions related to these documents, not only that, you can also provide a publicly accessible Web URL and ask the model questions about the contents of the URL (an online documentation for example). All files you add to the chat will always remain on your machine and won't be sent to the cloud.
Working with a PDF document example
Click the "+" icon in the chat and pick any PDF document you want:
I've uploaded the "Attention All you need" paper as a PDF document, and asked a specific question related to this document:
"What is the purpose of multi head attention mechanism?"
Let's check if the RAG worked correctly by looking into the original PDF document:
The RAG system was able to pinpoint the relevant part of the paper in order to answer the question 🎉
Ask questions about the contents of a Web Page
The URL of the web page must be publicly accessible, if you need to authenticate in order to view the page, the RAG won't work, so if you need to analyse a web page protected by auth, a workaround would be to first download it as PDF and upload it as a simple document.
In the chat field type #
followed by a URL, for this example I'll use Doctolib's FAQ about handling relatives in your Doctolib account:
Saving the documents to your Workspace
You can also save your most often used documents in your workspace so you don't have to upload them every time, for that, click on "Workspace". go "Documents" tab, and upload your files here:
Later when you want to work with your documents, just go to chat, and type #
in the message fields, you'll be presented with all documents from your work space, you can chose to work with one specific document or all of them in a single chat session:
This is just scratching the surface, the Ollama UI can be configured to make the retrieval even more performant with some tricks. If you're interested in advanced configuration and usage of this workflow let me know in the comments.
Top comments (4)
Hello Aslan, thank you for sharing. Your walkthrough is excellent, very descriptive. If you have found a good RAG workflow without using OpenAI tooling, please do share!
I started working with tools like n8n and langflow for my RAG workflow, I could do I write up about these if interested
Well, only if you think that they have great performance... otherwise I am afraid that we must wait some more time. Thank you for replying!
Great tutorial Aslan!
How would you run this on a larger scale with hundreds of company documents? Possibly host it on own server?
Tasks it would accomplish: