As machine learning practitioners, we often face the challenge of acquiring high-quality training data. While services like OpenAI's API have made synthetic data generation possible, the associated costs can quickly become prohibitive, especially for larger datasets or iterative experimentation. Enter Promptwright by Stacklok, an innovative Python library from Stacklok that's changing the game by enabling synthetic dataset generation using local Large Language Models (LLMs).
The Evolution of Synthetic Data Generation
The machine learning community has long recognized the value of synthetic data. It allows us to augment existing datasets, explore edge cases, and even create entirely new training scenarios. However, the traditional approach of using cloud-based LLMs comes with significant drawbacks: high costs, potential rate limiting, and dependency on external services.
The Technical Foundation
At its core, Promptwright leverages Ollama, a framework for running LLMs locally. This integration is particularly useful as it allows users to utilize various open-source models without the complexity of managing model deployment themselves. The library's architecture is built around several key components that work together:
The Dataset class serves as the primary interface for managing generated data, handling both creation and persistence.
The LocalDataEngine acts as the orchestrator, managing interactions with the LLM client and coordinating the generation process.
Configuration is handled through the LocalEngineArguments class, which provides fine-grained control over the generation process, including instruction setting, system prompts, and temperature parameters.
Real-world Applications
Consider a scenario where a research team needs to generate thousands of conversation examples for training a specialized chatbot. Using cloud-based solutions, this could cost hundreds or even thousands of dollars. With Promptwright, the team can:
- Pull their preferred model through Ollama (such as llama3:latest)
- Define their specific instruction set and system prompts
- Generate the required dataset locally
- Save the results in JSONL format for further processing
- Push the dataset to hugging face in parquet format
The output format is well-structured and conforms with a standard approach, as shown in this example:
{
"messages": [
{
"role": "system",
"content": "You are tasked with designing an immersive virtual reality experience..."
},
{
"role": "user",
"content": "Create a descriptive passage about a character discovering their hidden talents."
},
{
"role": "assistant",
"content": "As she stared at the canvas, Emma's fingers hovered above the paintbrushes..."
}
]
}
Local Execution
The decision to focus on local execution brings several advantages that aren't immediately obvious. First, it provides complete control over the generation process. Users can adjust model parameters, retry failed generations, and fine-tune the output without worrying about API costs or rate limits.
Moreover, local execution means that sensitive data never leaves your environment. For organizations working with confidential information or under strict data governance requirements, this is a crucial feature that sets Promptwright apart from cloud-based alternatives.
Integration with the ML Ecosystem
One of Promptwright's most thoughtful features is its integration with the Hugging Face Hub. Through the HFUploader utility class, users can easily share their generated datasets with the broader machine learning community. This integration helps bridge the gap between local development and collaborative research, enabling teams to:
- Version control their synthetic datasets
- Share their generation approaches with the community
- Build upon existing synthetic datasets
- Contribute to the broader ML ecosystem
How to use
Let's generate some prompts for technical blogs on dev.to focused on Javascript development.
You will first need a few prerequisites in place. Ollama will need to be run locally, along with pulling a model of your choice. In this instance we will use
pip install promptwright
ollama serve
ollama pull {model_name} # whichever model you want to use
Let's first set up an LocalDataEngine
instance, and set args within LocalEngineArguments
engine = LocalDataEngine(
args=LocalEngineArguments(
instructions="Generate techincal writing prompts for blog posts centred on example responses around javascript.",
system_prompt="You are a creative techincal blogger providing writing prompts and example responses for javascript blogs.",
model_name="gemma2:2b",
temperature=0.9, # Higher temperature for more creative variations
max_retries=2,
prompt_template="""Return this exact JSON structure with a writing prompt and creative response:
{
"messages": [
{
"role": "system",
"content": "You are a techincal writing instructor providing writing prompts and example responses, centred on example responses around javascript."
},
{
"role": "user",
"content": "Write a blog about Javascript, for a developer audience, of 500 words of more."
},
{
"role": "assistant",
"content": "JavaScript is a powerful, flexible language, and knowing a few cool tricks can make your code cleaner, faster, and more efficient. Below are 20 practical JavaScript tips and tricks that you can use in real-world applications to enhance your development process."
}
]
}""",
)
)
We then move to the bulk of the engine operations where the magic happens:
try:
# Test Ollama connection
print("\nTesting Ollama connection...")
models = engine.llm_client.list_local_models()
print(f"Available models: {[m['name'] for m in models]}")
# Generate a single test sample
print("\nGenerating writing test sample...")
dataset = engine.create_data(num_steps=1, batch_size=1, topic_tree=None)
# Ensure consistency by adding the system message explicitly
for data in dataset:
if not any(msg.get("role") == "system" for msg in data.get("messages", [])):
data["messages"].insert(
0,
{
"role": "system",
"content": "You are a techincal writing instructor providing writing prompts and example responses.",
},
)
if len(dataset) > 0:
print("\nTest successful! Starting main generation...")
response = input("\nProceed with full writing dataset generation? (y/n): ")
if response.lower() == "y":
dataset = engine.create_data(num_steps=100, batch_size=10, topic_tree=None)
# Ensure system message consistency
for data in dataset:
if not any(msg.get("role") == "system" for msg in data.get("messages", [])):
data["messages"].insert(
0,
{
"role": "system",
"content": "You are a creative writing instructor providing writing prompts and example responses.",
},
)
dataset.save("javascript_blog_dataset.jsonl")
print(
f"\nSaved {len(dataset)} writing prompts and responses to javascript_blog_dataset.jsonl"
)
else:
print("\nError: Test generation failed")
except Exception as e:
print(f"\nError encountered: {str(e)}")
So let's take a look at a single prompt.
{
"messages": [
{
"role": "system",
"content": "Welcome, aspiring JavaScript developers! Today, we'll explore how to craft engaging and informative technical blogs that attract readers."
},
{
"role": "user",
"content": "I want to write a blog about improving website performance in Javascript. 500 words or more is ideal. I'd like to focus on practical tips for optimizing front-end code."
},
{
"role": "assistant",
"content": "Here are some ideas to get you started: \n\n* **Optimization Techniques:** Explore common bottlenecks and discuss solutions such as lazy loading, memoization, and caching. Use real-world examples and tools like Lighthouse or WebPageTest.\n* **Code Structure & Design:** Analyze the impact of modularity, code clarity, and consistent naming conventions on performance. Focus on writing clean, maintainable JavaScript for scalability.\n* **Performance Monitoring Tools:** Explain how to use browser developer tools (Network tab, Performance tab) for performance analysis. Provide step-by-step guidance on how to identify and address issues like slow loading times or memory leaks.\n* **Frameworks & Libraries:** Consider discussing the advantages of frameworks like React or Vue.js when it comes to optimizing website development.\n* **Future Trends:** Mention emerging tools like WebAssembly, Server-Side Rendering (SSR), and Progressive Web Apps (PWAs) that can revolutionize performance in the long term."
}
]
}
Let's now take the content from both the user and the assistant and place it into ChatGPT and see what we get?
And the article itself!
Conclusion
Promptwright represents a significant step forward in making synthetic dataset generation more accessible and cost-effective. Its focus on local execution, combined with thoughtful integration points and a clean architecture, makes it a valuable tool for ML practitioners, researchers, and organizations looking to leverage synthetic data in their work.
The library's open-source nature and integration with popular tools like Hugging Face Hub suggest it will continue to evolve and improve with community input. For teams looking to break free from the constraints of cloud-based synthetic data generation, Promptwright offers a compelling alternative that deserves serious consideration.
Whether you're working on research projects, developing commercial applications, or exploring new ML techniques, Promptwright provides the tools needed to generate high-quality synthetic datasets without breaking the bank. As the ML community continues to grow and evolve, tools like this will play an increasingly important role in democratizing access to advanced ML techniques and enabling innovation across the field.
GitHub Repository: https://github.com/StacklokLabs/promptwright
PyPi Registry: https://pypi.org/project/promptwright/
Top comments (0)