DEV Community

Cover image for Quick and Dirty Guide to Running a Local LLM and Making API Requests
Marco A.
Marco A.

Posted on

Quick and Dirty Guide to Running a Local LLM and Making API Requests

Alright, buckle up because we’re diving into a quick and dirty solution for running a local LLM (Language Model) and making API requests — much like what the fancy commercial solutions do. Why? Well, why not? In just about three minutes, you can have a perfectly decent system running locally for most of your tests. And if you ever feel the need to scale up to the cloud again, switching back is practically effortless.

Here’s the documentation we’ll be following, mostly so you can claim you’ve read it:

In particular, we’ll focus on making a request like this:

curl https://api.openai.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
     "model": "gpt-4o-mini",
     "messages": [{"role": "user", "content": "Say this is a test!"}],
     "temperature": 0.7
   }'
Enter fullscreen mode Exit fullscreen mode

So far, so good, right? Nothing groundbreaking. But here’s where it gets fun…

Enter LM Studio

There's this gem of a tool called LM Studio](https://lmstudio.ai/), which makes local LLMs much easier to handle. After installing and running your model, you’ll notice a tab with a console icon called Developer. I know, it doesn’t sound too exciting at first, but hold on, because it gets better. This tab comes with a handy CURL example that shows you exactly how to use your local model. And, wouldn't you know it, it looks pretty familiar!

curl http://localhost:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.1-8b-lexi-uncensored-v2",
    "messages": [
      { "role": "system", "content": "Always answer in rhymes. Today is Thursday" },
      { "role": "user", "content": "What day is it today?" }
    ],
    "temperature": 0.7,
    "max_tokens": -1,
    "stream": false
}'
Enter fullscreen mode Exit fullscreen mode

Looks pretty familiar, right? This is the local version of what we just saw. You get the same setup as the OpenAI API request, except it’s running on your local machine. Plus, it's got a little flair — like the "Always answer in rhymes" system prompt. Poetry, anyone?

What About Python? We Got You.

If you prefer working with Python (and let’s be real, who doesn’t?), here’s how you’d send the same request using Python’s requests module:

import requests
import json

url = "http://localhost:1234/v1/chat/completions"

headers = {
    "Content-Type": "application/json"
}

data = {
    "model": "llama-3.1-8b-lexi-uncensored-v2",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"}
    ],
    "temperature": 0.7,
    "max_tokens": -1,
    "stream": False
}

response = requests.post(url, headers=headers, data=json.dumps(data))

if response.status_code == 200:
    result = response.json()
    print(result["choices"][0]["message"]["content"])
else:
    print(f"Error: {response.status_code}")
Enter fullscreen mode Exit fullscreen mode

And voilà! You’re now ready to send requests to your local LLM just like you would with a commercial API. Go ahead, test it, break it, make it rhyme — the world (or at least your model) is your oyster.

Enjoy!

Top comments (0)