moondream vs llava vs chat gpt 4 : a crude comparison

This is a crude comparison of 3 vision models. I wanted to see good these models are explaining a screenshot taken from a tech presentation. In this case, a screenshot from a video showcasing the taco bell order middleware.

TLDR: GPT4 is leagues ahead when it comes to understanding a picture. While other models provide only a high level overview, GPT4 recognises every AWS service shown. It also guesses a seemingly random number present in the image (which happens to be related to delivery metrics).

Suspicion: I wonder if the model has been trained on this video along with the transcription and that's why it performed so well in comparison to other models.

Note that this is a very crude test. Considering the vast difference in model size one might even call it unfair. So take it with a pinch of salt or a dollop of ketchup.

You can find the full video here : https://www.youtube.com/watch?v=sezX7CSbXTg

Here's the image that I used:

1. Moondream
Demo Space : https://moondream.ai/

Question : Explain this image
Response:

The image features two men engaged in a conversation, with one man standing to the left and the other standing to the right. They are both looking at a large piece of paper or a whiteboard, which has a diagram or flowchart drawn on it. The diagram appears to be related to the topic of their discussion. The men seem to be discussing a project or a business idea, as they are standing close to each other and appear to be actively involved in the conversation. One of the men is wearing a backpack, which can be seen on the left side of the image. The other man is wearing a tie, adding a

2. Llava 1.6 34b
Demo Space : https://llava.hliu.cc/
Question : Explain this image
Response :

The image shows two individuals standing in front of a large screen with various drawings and text. The screen appears to be a whiteboard or a projection screen with notes and diagrams that seem to be related to a presentation or a brainstorming session. The drawings include icons that might represent different concepts or ideas, such as delivery, API, event, and location. There are also numbers written, which could be related to data or statistics.

The person on the left is smiling and appears to be engaged in a conversation or presentation, while the person on the right is listening attentively. Both individuals are dressed in casual attire, suggesting an informal or collaborative work environment. The background suggests that they might be in a conference room or a similar setting designed for meetings and presentations. The overall atmosphere seems to be one of collaboration and discussion.

3. GPT4
Question : Explain this image
Response:

The image shows two people standing in front of a blackboard with various drawings and words on it that seem to be explaining a tech or business concept. On the blackboard, there are several icons and terms which are related to cloud computing services provided by AWS (Amazon Web Services). Here's a breakdown of the elements visible on the board:

Delivery Aggregator: This likely refers to a service or a platform that aggregates (combines) delivery options or services, possibly as part of a logistics or e-commerce solution.

42 million, 7000: These numbers might be statistics related to the context of the discussion, such as the number of deliveries or transactions handled by the system.

API Gateway: This is an AWS service that allows developers to create, publish, maintain, monitor, and secure APIs at any scale. It acts as a front-door to manage all the API calls.

EventBridge: AWS EventBridge is a serverless event bus service that makes it easy to connect applications together using data from your own applications, integrated Software-as-a-Service (SaaS) applications, and AWS services.

Step Functions: This AWS service allows you to coordinate multiple AWS services into serverless workflows so you can build and update apps quickly.

Lambda (spelled as LAHDA in the image): AWS Lambda is a compute service that lets you run code without provisioning or managing servers, which executes code in response to triggers such as changes in data, shifts in system state, or user actions.

Taco Bell: This could be a reference to a case study or an example being used in the context of the discussion, possibly indicating how Taco Bell might be using the aforementioned AWS services.

The dashed box around Lambda: This might be highlighting the central role of AWS Lambda in the architecture being discussed.

Car Icon: This could represent delivery vehicles, relevant in a discussion about logistics or delivery services.

From the content on the board and the way the individuals are dressed, this looks like a professional or educational presentation, possibly a seminar or a workshop related to cloud computing architectures involving AWS services. The person on the right might be the presenter, given the AWS-branded apparel, and the person on the left could be a participant or co-presenter.

DEV Community

moondream vs llava vs chat gpt 4 : a crude comparison

Top comments (0)

Read next

"Can anyone clarify?"

Data Architecture Best Practices

Boost Your Spring Boot App: Custom Metrics and Health Checks Made Easy

Speed Up Your Angular App: 5 Change Detection Tricks You Need to Know