DEV Community

Cover image for How We Generated a 10K Dataset Using LLM to Fine-Tune Another LLM
Mustafa A. Elghrib
Mustafa A. Elghrib

Posted on

How We Generated a 10K Dataset Using LLM to Fine-Tune Another LLM

Introduction

After scraping the data that will be used as input for the LLM to generate output, we needed to create a dataset to finetune an LLM for Tarwiiga AdGen, a Google Ads generator using AI developed at Tarwiiga. The tool was taking input and producing a JSON output. While we were relying on LLMs like OpenAI's GPT, Google's Gemini, and Anthropic's Claude for generating ads with special prompts and using LangChain parsers to get a JSON, we wanted to use this approach for generating the dataset. Here, I am discussing our approach for generating a 10K dataset.

Genearting 10K dataset using LLM

Old Approach 

But before that, I want to mention that I first tried to make the LLM generate everything from input to output. I asked it to give me a list of 10 inputs and then looped through this list to generate the JSON output and save them in a CSV file. However, I found that each time I requested a list of inputs, it generated many duplicates. I think this happened because the LLM's API was caching the responses. While this issue could be worked around to reduce the number of duplicates, I decided to work with real data that I expect to receive in the future when using the tool. Besides, it was taking too long to generate all the inputs and then proceed to generate the output.

Scraping Data

That's why I scraped data to use as input. With the approach I followed, as mentioned in the article, I was able to scrape millions of data points. Specifically, I scraped data from 12 categories, with each category containing 5,000 pages. Each page had about 20 inputs, resulting in a total of 12 * 5,000 * 20 = 1,200,000 inputs, or approximately one million two hundred thousand. In reality, some pages contained more than 20 inputs, so I ended up with 1,239,232 data points. There were a lot of duplicate inputs - 1,173,847 to be exact - leaving me with 65,385 unique data points. While this approach didn't completely eliminate duplicate inputs, it was much faster to get inputs from another source rather than relying on the LLM. Now, the LLM can focus solely on generating outputs.

Quick Overview

As I was sending API requests to LLM APIs, I needed to find a way to manage the generation process efficiently. I started with one category and looped through 200 pages, with each page including around 20 inputs, sometimes a bit more or less. This process allowed me to generate around 3,859 data points for the first category. For another category, I generated around 3,899 data points, and for a third category, I generated 2,171 data points. In total, this amounted to 3,859 + 3,899 + 2,171 = 9,929 data points, which is approximately a 10K dataset.

During the generation process, I was able to fine-tune Google's Gemma 2B on a 1K dataset, which yielded very good results. I will discuss fine-tuning in a future post, but for now, I want to focus on how I handled the generation process.

The Generation Process

The process is basic, and I didn't do any optimization initially; I just wanted to start and see how things would go. To understand it, let's start from the bottom up. First, we have the AdGen code that takes an input and generates a JSON output representing the Google Ad elements. This is crafted with a special prompt and parsers to extract JSON.

With around 20 inputs per page, I divided them into chunks of size 5. Above this, there's a loop that goes through pages to get inputs. I made it loop through 10 pages to get the 20 inputs from each page, then divided those 20 inputs into chunks of 5. For each input, a request was sent to the LLM, and the output was saved in a CSV file. This resulted in a category folder with 200 subfolders for pages, and inside each page, there were 4 dataset CSV files.

This process took a long time on some LLMs like GPT-4 and was faster on others like GPT-3.5 and Gemini Pro 1.5. I think GPT-4 was slower because it was busy with other users' requests, though I'm not certain. There were also some issues with Gemini making a lot of retries. I ended up running the same script multiple times, changing the range of pages each time: the first script from page 0 to 10, the second script from page 10 to 20, and so on.

While I think this approach could be optimized and improved, my goal was to quickly generate a dataset for fine-tuning. With this approach, I was able to generate a 10K dataset, which is very good for fine-tuning any LLM, though it contains duplicate inputs. The unique inputs, as mentioned above, were around 65K. Generating a 65K dataset would require optimizing the code to make it faster, but that's not necessary for now; it can be done later.

Conclusion

I hope this article was helpful to you. Please don't hesitate to ask me any questions, and you can reach me on Twitter (X) and Linkedin.

Top comments (0)