DEV Community

Cover image for Unlocking Vision: Evaluating LLMs for Home Security
Francesco Mattia
Francesco Mattia

Posted on • Edited on

Unlocking Vision: Evaluating LLMs for Home Security

Introduction

I am diving into the vision capabilities of large language models (LLMs) to see if they can accurately classify images, specifically focusing on spotting door handle positions to tell if they’re locked or unlocked. This experiment includes basic tests to evaluate accuracy, speed, and token usage, offering an initial comparison across models.

Code on GitHub

Image description

Scenario

Imagine using a webcam to monitor door security, providing images of door handles in different lighting conditions (day and night). The system’s goal is to classify the handle’s position—vertical (locked) or horizontal (unlocked)—and report the status in a parseable format like JSON. This could be a valuable feature in home automation systems. While traditional machine learning models, which require specific training, might achieve better performance, this experiment explores the potential of large language models (LLMs) in this task.

Approach

First, I took some pictures and fed them to the best LLMs, no code, through their websites (Claude 3 Opus, OpenAI GPT-4) to see if they could accurately classify door handle positions. Was this method viable or would it end up being a waste of time?

The initial results were encouraging, but I needed to verify if the models could consistently perform well. With a binary classifier, there’s a 50% chance of guessing correctly, so I wanted to ensure the accuracy was truly meaningful.

To ensure deterministic outputs, I used a prompt with a temperature setting of 0.0. To save on tokens and improve processing speed, I resized the images using the following command:

convert original_image.jpg -resize 200x200 resized.jpg

Next, I wrote a script to access Anthropic models, comparing the classification results to the actual positions indicated by the image filenames (v for vertical, h for horizontal).

./locks_classifier.js -m Haiku -v
🤖 Haiku
images/test01_v.jpg ✅
📊 In: 202 tkn Out: 11 Time: 794 ms
images/test02_v.jpg ✅
📊 In: 202 tkn Out: 11 Time: 1073 ms
images/test03_h.jpg ❌
📊 In: 202 tkn Out: 11 Time: 604 ms

Correct Responses: (12 / 20) 60%
Total In Tokens: 3976
Total Out Tokens: 220
Avg Time: 598 ms
Enter fullscreen mode Exit fullscreen mode

The results for Haiku were somewhat underwhelming, while Sonnet performed even worse, albeit with similar speed.

I experimented with few-shot examples embedded in the prompt, but this did not improve the results.

Out of curiosity, I also tested OpenAI models, adapting my scripts to accommodate their slightly different APIs (it’s frustrating that there isn’t a standard yet, right?).

The results with OpenAI models were significantly better. Although slightly slower, they were much more accurate in comparison.

GPT-4-Turbo:

./locks_classifier.js -m GPT4 -v
Responses: (16 / 20) 80% 
In Tokens: 6360 Out Tokens: 240
Avg Time: 2246 ms
Enter fullscreen mode Exit fullscreen mode

The just released GPT-4o:

./locks_classifier.js -m GPT4o -v
Responses: (20 / 20) 100% 
In Tokens: 6340 Out Tokens: 232
Avg Time: 1751 ms
Enter fullscreen mode Exit fullscreen mode

What I learnt

1) LLM Performance: I was curious to see how the models would perform, and I am quite impressed by GPT-4o. It delivered high accuracy and reasonable speed. On the other hand, Haiku’s performance was somewhat disappointing, although its lower cost and faster response time make it appealing for many applications. There’s definitely potential to explore Haiku further.

2) Temperature 0.0: I was surprised by the varying responses even with the temperature set to 0.0, which should theoretically produce consistent results. This variability was unexpected and suggests that other factors may be influencing the outputs. Any ideas on why this might be happening?

🤖 Haiku *Run #1*
Responses: (5 / 11) 45%
In Tokens: 2222 Out Tokens: 121
Avg Time: 585 ms

🤖 Haiku *Run #2*
Correct Responses: (7 / 11) 64% 
In Tokens: 2222 Out Tokens: 121 
Avg Time: 585 ms

🤖 Haiku *Run #3*
Correct Responses: (4 / 11) 36% 
In Tokens: 2222 Out Tokens: 121
Avg Time: 583 ms
Enter fullscreen mode Exit fullscreen mode

3) Variability in Tokenization: There is significant variability in the number of tokens generated by different models for the same input. This variability impacts cost estimates and efficiency, as token usage directly influences the expense of using these models.

Model In Tks Out Tks $/M In Tks $/M Out Tks Images per $1
Haiku 202 11 $0.25 $1.25 15,563
Sonnet 156 11 $3.00 $15.00 1,579
GPT-4 318 12 $5.00 $15.00 565
GPT-4o 317 12 $10.00 $30.00 283

4) Variability in Response Time: I did not expect the same model, given the same input size, to have such a wide range of response times. This variability suggests that there are underlying factors affecting the inference speed.

Model Avg Res Time (ms) Min Res Time (ms) Max Res Time (ms)
Haiku 598 351 1073
Sonnet 605 468 1011
GPT-4 2246 1716 6037
GPT-4o 1751 1172 4559

Overall, while the accuracy and results are interesting, they can vary significantly depending on the images used. For instance, would larger images improve the performance of models like Haiku and Sonnet?

Next steps

Here are a few ideas to dive deeper into:

1. Explore Different Challenges: Consider swapping the current challenge with a different task to further test the capabilities of LLMs in various scenarios.

2. Test Local Vision-Enabled Models: Evaluate models like Llava 1.5 7B running locally on platforms such as LM Studio or Ollama. Would a local LLM provide a viable option?

3.Compare with Traditional ML Models: Conduct tests against more traditional machine learning models to see how many sample images are needed to achieve similar or better accuracy.

Let me know if you have any comments or questions. I’d love to hear your suggestions on where to go next and what tests you’d like to see conducted!

Top comments (0)