DEV Community

Cover image for Stress Testing VLMs: Multi QnA and Description Tasks
Aryan Kargwal
Aryan Kargwal

Posted on

Stress Testing VLMs: Multi QnA and Description Tasks

Video Link: https://youtu.be/pwW9zwVQ4L8
Repository Link: https://github.com/aryankargwal/genai-tutorials/tree/main


In the fast-evolving world of AI, Vision-Language Models (VLMs) have garnered attention for their ability to understand and generate responses based on visual and textual inputs. However, testing these models in a structured environment and comparing their performance across various scenarios is still a challenging task. This blog will walk you through an experiment where we used a custom-built Streamlit web application to stress test multiple VLMs like Llama 3.2, Qwen 2 VL, and GPT 4o on a range of tasks. We analyzed their response tokens, latency, and accuracy in generating answers to complex, multimodal questions.

However, please note that most of the findings are still hidden, as this application is part of my process of making a VLM benchmark, the first of which you can check out on Huggingface as SynCap-Flickr8K!

Why Compare Vision-Language Models?

The ability to compare the performance of different VLMs across domains is critical for:

  1. Understanding model efficiency (tokens used, latency).
  2. Measuring how well models can generate coherent responses based on image inputs and textual prompts.
  3. Creating benchmark datasets to improve further and fine-tune VLMs.

To achieve this, we built a VLM Stress Testing Web App in Python, utilizing Streamlit for a user-friendly interface. This allowed us to upload images, input textual prompts, and obtain model-generated responses in real time. The app also calculated and logged critical metrics such as the number of tokens used in responses and latency.

Project Setup

Our main application file, app.py, uses Streamlit as the frontend and is integrated with API requests to call different VLM models. Each query to a model includes:

  • Image: Encoded in Base64 format.
  • Question: A text input by the user.
  • Model ID: We allow users to choose between multiple VLMs.

The API response includes:

  • Answer: The model-generated text.
  • Latency: Time taken for the model to generate the answer.
  • Token Count: Number of tokens used by the model in generating the response.

Below is the code structure for querying the models:

def query_model(base64_image, question, model_id, max_tokens=300, temperature=0.9, stream=False, frequency_penalty=0.2):
    image_content = {
        "type": "image_url",
        "image_url": {
            "url": f"data:image/jpeg;base64,{base64_image}"
        }
    }

    prompt = question

    data = {
        "model": model_id,
        "messages": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                    image_content
                ]
            }
        ],
        "max_tokens": max_tokens,
        "temperature": temperature,
        "stream": stream,
        "frequency_penalty": frequency_penalty
    }

    response = requests.post(url, headers=headers, json=data)
    return response.json()
Enter fullscreen mode Exit fullscreen mode

Task Definitions and Experiments

We tested four different tasks across multiple domains using the following models:

  1. Llama 3.2
  2. Qwen 2 VL
  3. GPT 4o

Domains:

  • Medical: Questions related to complex medical scenarios.
  • Retail: Product-related queries.
  • CCTV: Surveillance footage analysis.
  • Art: Generating artistic interpretations and descriptions.

The experiment involved five queries per task for each model, and we recorded the following metrics:

  • Tokens: The number of tokens used by the model to generate a response.
  • Latency: Time taken to return the response.

Results

Token Usage Comparison

The tables below highlight the token usage across the four domains for both Llama and GPT models.

Task Q1 Tokens Q2 Tokens Q3 Tokens Q4 Tokens Q5 Tokens Mean Tokens Standard Deviation (Tokens)
Medical (Llama) 1 12 1 1 1 3.2 4.81
Retail (Llama) 18 39 83 40 124 60.8 32.77
CCTV (Llama) 18 81 83 40 124 69.2 37.29
Art (Llama) 11 71 88 154 40 72.2 51.21
Task Q1 Tokens Q2 Tokens Q3 Tokens Q4 Tokens Q5 Tokens Mean Tokens Standard Deviation (Tokens)
Medical (GPT) 1 10 1 1 1 2.4 4.04
Retail (GPT) 7 13 26 14 29 17.8 8.53
CCTV (GPT) 7 8 26 14 29 16.8 7.69
Art (GPT) 10 13 102 43 35 40.6 35.73

Latency Comparison

Latency, measured in seconds, is another critical factor in evaluating the model's performance, especially for real-time applications. The following tables display latency results for the same set of tasks.

Task Q1 Latency Q2 Latency Q3 Latency Q4 Latency Q5 Latency Mean Latency Standard Deviation (Latency)
Medical (Llama) 0.74 0.97 0.78 0.98 1.19 0.73 0.19
Retail (Llama) 1.63 3.00 3.02 1.67 3.14 2.09 0.74
CCTV (Llama) 1.63 3.00 3.02 1.67 3.14 2.09 0.74
Art (Llama) 1.35 2.46 2.91 4.45 2.09 2.46 1.06
Task Q1 Latency Q2 Latency Q3 Latency Q4 Latency Q5 Latency Mean Latency Standard Deviation (Latency)
Medical (GPT) 1.35 1.50 1.21 1.50 1.23 1.38 0.10
Retail (GPT) 1.24 1.77 2.12 1.35 1.83 1.63 0.29
CCTV (GPT) 1.20 2.12 1.80 1.35 1.83 1.68 0.32
Art (GPT) 1.24 1.77 7.69 3.94 2.41 3.61 2.29

Observations

  1. Token Efficiency: Llama models generally use fewer tokens in response generation for simpler tasks like Medical compared to more complex domains like Art.
  2. Latency: Latency is higher for more complex images, especially for tasks like Retail and Art, indicating that these models take more time when generating in-depth descriptions or analyzing images.
  3. GPT vs. Llama: GPT models generally had lower token counts across the tasks, but the latency was comparable, with GPT showing slightly more variability in complex tasks like Art.

Conclusion and Future Work

This experiment highlights the importance of evaluating both token efficiency and latency when stress testing VLMs. The VLM Stress Test App allows us to quickly compare multiple models and analyze their performance across a variety of real-world tasks.

Future Plans:

  • Additional Models: We plan to add more models like Mistral and Claude to the comparison.
  • Expanded Dataset: New tasks in

domains like Legal and Education will be added to challenge the models further.

  • Accuracy Metrics: We'll also integrate accuracy metrics like BLEU and ROUGE scores in the next iteration.

Check out our GitHub repository for the code and further instructions on how to set up and run your own VLM experiments.

Top comments (0)