Luke Hinds for Stacklok

Posted on Oct 28, 2024

Introducing Promptwright: Synthetic Dataset Generation with Local LLMs

As machine learning practitioners, we often face the challenge of acquiring high-quality training data. While services like OpenAI's API have made synthetic data generation possible, the associated costs can quickly become prohibitive, especially for larger datasets or iterative experimentation. Enter Promptwright by Stacklok, an innovative Python library from Stacklok that's changing the game by enabling synthetic dataset generation using local Large Language Models (LLMs).

The Evolution of Synthetic Data Generation

The machine learning community has long recognized the value of synthetic data. It allows us to augment existing datasets, explore edge cases, and even create entirely new training scenarios. However, the traditional approach of using cloud-based LLMs comes with significant drawbacks: high costs, potential rate limiting, and dependency on external services.

The Technical Foundation

At its core, Promptwright leverages Ollama, a framework for running LLMs locally. This integration is particularly useful as it allows users to utilize various open-source models without the complexity of managing model deployment themselves. The library's architecture is built around several key components that work together:

The Dataset class serves as the primary interface for managing generated data, handling both creation and persistence.

The LocalDataEngine acts as the orchestrator, managing interactions with the LLM client and coordinating the generation process.

Configuration is handled through the LocalEngineArguments class, which provides fine-grained control over the generation process, including instruction setting, system prompts, and temperature parameters.

Real-world Applications

Consider a scenario where a research team needs to generate thousands of conversation examples for training a specialized chatbot. Using cloud-based solutions, this could cost hundreds or even thousands of dollars. With Promptwright, the team can:

Pull their preferred model through Ollama (such as llama3:latest)
Define their specific instruction set and system prompts
Generate the required dataset locally
Save the results in JSONL format for further processing
Push the dataset to hugging face in parquet format

The output format is well-structured and conforms with a standard approach, as shown in this example:

{
  "messages": [
    {
      "role": "system",
      "content": "You are tasked with designing an immersive virtual reality experience..."
    },
    {
      "role": "user",
      "content": "Create a descriptive passage about a character discovering their hidden talents."
    },
    {
      "role": "assistant",
      "content": "As she stared at the canvas, Emma's fingers hovered above the paintbrushes..."
    }
  ]
}

Local Execution

The decision to focus on local execution brings several advantages that aren't immediately obvious. First, it provides complete control over the generation process. Users can adjust model parameters, retry failed generations, and fine-tune the output without worrying about API costs or rate limits.

Moreover, local execution means that sensitive data never leaves your environment. For organizations working with confidential information or under strict data governance requirements, this is a crucial feature that sets Promptwright apart from cloud-based alternatives.

Integration with the ML Ecosystem

One of Promptwright's most thoughtful features is its integration with the Hugging Face Hub. Through the HFUploader utility class, users can easily share their generated datasets with the broader machine learning community. This integration helps bridge the gap between local development and collaborative research, enabling teams to:

Version control their synthetic datasets
Share their generation approaches with the community
Build upon existing synthetic datasets
Contribute to the broader ML ecosystem

How to use

Let's generate some prompts for technical blogs on dev.to focused on Javascript development.

You will first need a few prerequisites in place. Ollama will need to be run locally, along with pulling a model of your choice. In this instance we will use

pip install promptwright
ollama serve
ollama pull {model_name} # whichever model you want to use

Let's first set up an LocalDataEngine instance, and set args within LocalEngineArguments

engine = LocalDataEngine(
        args=LocalEngineArguments(
            instructions="Generate techincal writing prompts for blog posts centred on example responses around javascript.",
            system_prompt="You are a creative techincal blogger providing writing prompts and example responses for javascript blogs.",
            model_name="gemma2:2b",
            temperature=0.9,  # Higher temperature for more creative variations
            max_retries=2,
            prompt_template="""Return this exact JSON structure with a writing prompt and creative response:
            {
                "messages": [
                    {
                        "role": "system",
                        "content": "You are a techincal writing instructor providing writing prompts and example responses, centred on example responses around javascript."
                    },
                    {
                        "role": "user",
                        "content": "Write a blog about Javascript, for a developer audience, of 500 words of more."
                    },
                    {
                        "role": "assistant",
                        "content": "JavaScript is a powerful, flexible language, and knowing a few cool tricks can make your code cleaner, faster, and more efficient. Below are 20 practical JavaScript tips and tricks that you can use in real-world applications to enhance your development process."
                    }
                ]
            }""",
        )
    )

We then move to the bulk of the engine operations where the magic happens:

try:
        # Test Ollama connection
        print("\nTesting Ollama connection...")
        models = engine.llm_client.list_local_models()
        print(f"Available models: {[m['name'] for m in models]}")

        # Generate a single test sample
        print("\nGenerating writing test sample...")
        dataset = engine.create_data(num_steps=1, batch_size=1, topic_tree=None)

        # Ensure consistency by adding the system message explicitly
        for data in dataset:
            if not any(msg.get("role") == "system" for msg in data.get("messages", [])):
                data["messages"].insert(
                    0,
                    {
                        "role": "system",
                        "content": "You are a techincal writing instructor providing writing prompts and example responses.",
                    },
                )

        if len(dataset) > 0:
            print("\nTest successful! Starting main generation...")
            response = input("\nProceed with full writing dataset generation? (y/n): ")
            if response.lower() == "y":
                dataset = engine.create_data(num_steps=100, batch_size=10, topic_tree=None)

                # Ensure system message consistency
                for data in dataset:
                    if not any(msg.get("role") == "system" for msg in data.get("messages", [])):
                        data["messages"].insert(
                            0,
                            {
                                "role": "system",
                                "content": "You are a creative writing instructor providing writing prompts and example responses.",
                            },
                        )

                dataset.save("javascript_blog_dataset.jsonl")
                print(
                    f"\nSaved {len(dataset)} writing prompts and responses to javascript_blog_dataset.jsonl"
                )
        else:
            print("\nError: Test generation failed")

    except Exception as e:
        print(f"\nError encountered: {str(e)}")

So let's take a look at a single prompt.

{
  "messages": [
    {
      "role": "system",
      "content": "Welcome, aspiring JavaScript developers! Today, we'll explore how to craft engaging and informative technical blogs that attract readers."
    },
    {
      "role": "user",
      "content": "I want to write a blog about improving website performance in Javascript. 500 words or more is ideal. I'd like to focus on practical tips for optimizing front-end code."
    },
    {
      "role": "assistant",
      "content": "Here are some ideas to get you started: \n\n* **Optimization Techniques:** Explore common bottlenecks and discuss solutions such as lazy loading, memoization, and caching. Use real-world examples and tools like Lighthouse or WebPageTest.\n* **Code Structure & Design:** Analyze the impact of modularity, code clarity, and consistent naming conventions on performance. Focus on writing clean, maintainable JavaScript for scalability.\n* **Performance Monitoring Tools:** Explain how to use browser developer tools (Network tab, Performance tab) for performance analysis. Provide step-by-step guidance on how to identify and address issues like slow loading times or memory leaks.\n* **Frameworks & Libraries:** Consider discussing the advantages of frameworks like React or Vue.js when it comes to optimizing website development.\n* **Future Trends:** Mention emerging tools like WebAssembly, Server-Side Rendering (SSR), and Progressive Web Apps (PWAs) that can revolutionize performance in the long term."
    }
  ]
}

Let's now take the content from both the user and the assistant and place it into ChatGPT and see what we get?

And the article itself!

https://dev.to/lukehinds/optimizing-website-performance-in-javascript-practical-tips-for-front-end-developers-3hee

Conclusion

Promptwright represents a significant step forward in making synthetic dataset generation more accessible and cost-effective. Its focus on local execution, combined with thoughtful integration points and a clean architecture, makes it a valuable tool for ML practitioners, researchers, and organizations looking to leverage synthetic data in their work.

The library's open-source nature and integration with popular tools like Hugging Face Hub suggest it will continue to evolve and improve with community input. For teams looking to break free from the constraints of cloud-based synthetic data generation, Promptwright offers a compelling alternative that deserves serious consideration.

Whether you're working on research projects, developing commercial applications, or exploring new ML techniques, Promptwright provides the tools needed to generate high-quality synthetic datasets without breaking the bank. As the ML community continues to grow and evolve, tools like this will play an increasingly important role in democratizing access to advanced ML techniques and enabling innovation across the field.

GitHub Repository: https://github.com/StacklokLabs/promptwright
PyPi Registry: https://pypi.org/project/promptwright/

Top comments (1)

Vinayak Mishra • Jan 15 • Edited

I liked this! the idea of generating synthetic data with Local LLMs is really a cost-saver. Thanks for sharing this! Also, I came across a blog exploring Source2Synth, a framework for synthetic data generation. Check if it can be relevant - Synthetic data generation grounded in real data sources

DEV Community

Introducing Promptwright: Synthetic Dataset Generation with Local LLMs

The Evolution of Synthetic Data Generation

The Technical Foundation

Real-world Applications

Local Execution

Integration with the ML Ecosystem

How to use

Conclusion

Top comments (1)

Read next

AI Model Breaks Record by Reading 1 Million DNA Letters at Once to Predict Genetic Patterns

AI Agents Create Their Own Tools to Master 3D Spatial Reasoning

AI-Powered Stock Prediction System 15% More Accurate Using Historical Data and News Analysis

5yrs of building, show hn, and earning the first $1