The domain of data science now encapsulates almost everything, and, looking at the current pace of this field, it's apparent that it'll be used in every sphere in the upcoming years.
As the name suggests, "data science" revolves around data. It has been said that "Data is the new currency" and "Data is the new oil". The rise of AI has made this all the more true today. Machine learning models require vast amounts of data. Larger datasets lead to greater accuracy, resulting in better model performance.
Although the field of data science has evolved a lot during the last two decades, the challenges of data collection for AI remain. There's no doubt that today, we have access to more data than ever before, but gathering and molding the data according to the needs of the model demands a great deal of time and money. That's why big companies like Google and Microsoft spend millions of dollars just on data collection and cleaning.
Why is data collection difficult?
As we mentioned earlier, data collection for machine learning is challenging, but why is that? Here are a few reasons:
Quality: Getting lots of data is one thing, but ensuring it's accurate and relevant to the problem is a whole different challenge.
Diverse Sources : Data comes from many different sources, like online platforms, sensors, and direct feedback. Each has its own format, making it difficult to handle.
Time : Collecting meaningful data isn't always quick. It can be a lengthy process to gather just the right information. It unintentionally skews results if youre not careful.
Integration : Merging data from different sources can be like trying to blend oil and water. You need to be very careful to ensure consistency.
Bias: Sometimes, the data you collect is biased and doesn't accurately represent the bigger picture, which is usually required for better results.
These are the most common problems almost every AI expert or data scientist faces. In this article, we'll cover all these problems and their solutions and find out how to generate better datasets for AI and machine learning models.
What do you need to get started?
To completely understand the content and code samples showcased in this post, you should have the following:
Python installed
Python libraries (apify-client, imblearn, matplotlib, nlpaug, nltk, pandas, sklearn)
To install all these libraries in your environment, open your terminal and enter the following command:
pip install apify-client imblearn matplotlib pandas sklearn
It should be a quick install.
You'll use the Apify API token in this tutorial. Here's how to get it:
Sign up or log in to your Apify account (free, no credit card needed)
Navigate to the integrations section in the account settings
Here, you'll find your unique API token. Keep it handy; you'll need it shortly!
The aim of this tutorial will be a data pipeline with the following steps:
Data acquisition
Data ingestion
Data preprocessing
Data generation
Ultimately, the data will be ready for any ML/DL model. Let's start with the first step.
How to acquire data for AI models
The first step in the data collection phase is data acquisition, which means gathering the data from a source. The source can be a website, camera, sound recorder, or any other source of data relevant to your use case.
But extracting data from a source (scraping a website, for example) is a challenging task. The most time-consuming part is creating a scraper for that particular website. The Apify platform has this issue in hand. It provides hundreds of scrapers to fetch data from any website by following just a few steps.
In this tutorial, we'll use two Amazon review scrapers from Apify Store:
These are two different scrapers even though they almost share the same name and perform similar tasks. Developers can create any tool (called Actors) on Apify and share it with other users, either for free or for a small fee. A cornerstone of the Apify platform is making it possible for developers to monetize their code.
So let's use these two Actors to fetch the reviews of two different products from the Amazon store.
💡Note : The reason for using two different scrapers is that we want to implement two data streams to better understand the concepts of the "data ingestion" part that follows.
The code to scrape reviews of two different products from Amazon is given below:
from apify_client import ApifyClientimport json# Initialize the ApifyClient with API tokenclient = ApifyClient("apify_api_JaJdodNpT1eYrxklaWpG2gsL2N291O14MiAf")# Prepare the first Actor inputfirst_run_input = { "url": "https://www.amazon.com/Apple-iPhone-11-64GB-White/dp/B08BHKSZ5P/ref=sr_1_1?keywords=iphone&qid=1692321184&sr=8-1&th=1" }# Prepare the second Actor inputsecond_run_input = { "productUrls": [{ "url": "https://www.amazon.com/iPhone-Pro-1TB-Alpine-Green/dp/B0BGYCHZWF/ref=sr_1_1?crid=1KM34M1FH1SYO&keywords=iphone&qid=1692616691&sprefix=iphone%2Caps%2C384&sr=8-1&th=1" }], "proxyConfiguration": { "useApifyProxy": True },}print("Scraping the data for the first product. You can see the running actors on Apify console")# Run the first Actor and wait for it to finishfirst_run = client.actor("bebity/amazon-reviews-scraper").call(run_input=first_run_input)first_run_data = list(client.dataset(first_run["defaultDatasetId"]).iterate_items())# Save the data in a Json file with open('first_run.json', 'w') as f: json.dump(first_run_data, f)print("\nFirst five entries from first_run:")for _, entry in enumerate(first_run_data[:5]): print(entry)print("\nScraping the data for the second product. You can see the running actors on Apify console")# Run the second Actor and wait for it to finishsecond_run = client.actor("junglee/amazon-reviews-scraper").call(run_input=second_run_input)second_run_data = list(client.dataset(second_run["defaultDatasetId"]).iterate_items())# Save the data in a Json file with open('second_run.json', 'w') as f: json.dump(second_run_data, f)print("\nFirst five entries from second_run:")for _, entry in enumerate(second_run_data[:5]): print(entry)
In a nutshell, we're using the Apify client to scrape reviews for two different products. First, initialize the Apify client with an API token for authentication. Then set up the URLs as input. After that, trigger two separate Actors
to collect their reviews and put the reviews in json
files. Finally, print the scraped data. We'll use that data in the next step.
🗒Note : We'll develop a classification model with five classes, ranging from 1 star to 5 stars. This model will predict a rating based on the customer's review or description
Data ingestion for AI
Data ingestion is essentially the process of collecting and importing data from different sources, cleaning and structuring it, and then storing it for the next steps.
In this case, you have two data sources, and you may need to structure them to merge them and train the model.
🗒Note : There are two data sources - both are scraped data from Amazon. It's important to note that the sources could be different - one from Amazon and another from a different website. However, this way it's easier to learn the concept of formatting and managing data from two sources.
Using Pandas for data ingestion
The first thing to do is put the data in Pandas data frames. Let's do that now.
import pandas as pd# Convert the first_run.json file into a dataframefirst_run_df = pd.read_json('first_run.json')# Convert the second_run.json file into a dataframesecond_run_df = pd.read_json('second_run.json')# To confirm, you can print the first few rows of each dataframe:print(first_run_df.head())print(second_run_df.head())
The next step is to select the columns that you need for the model and remove the remaining ones. Let's print the columns and see what we have:
print(second_run_df.columns)print(first_run_df.columns)
The output would be something like this:
Index(['ratingScore', 'reviewTitle', 'reviewUrl', 'reviewReaction', 'reviewedIn', 'date', 'country', 'countryCode', 'reviewDescription', 'isVerified', 'variant', 'reviewImages', 'position', 'productAsin', 'reviewCategoryUrl', 'totalCategoryRatings', 'totalCategoryReviews'], dtype='object')Index(['reviewContent', 'title', 'date', 'clientName', 'notation', 'profilePicture', 'commentImages', 'commentLink'], dtype='object')
Lets go with the following columns:
reviewContent
orreviewDescription
ratingScore
ornotation
reviewTitle
ortitle
reviewUrl
orcommentLink
date
Now let's remove the remaining columns from both data frames.
# For the first_run DataFramefirst_run_df = first_run_df[['reviewContent', 'title', 'date','notation']]# For the second_run DataFramesecond_run_df = second_run_df[['reviewDescription', 'ratingScore', 'reviewTitle', 'date']]
Finding and filling missing values
The next step is to find missing values from the columns and fill them.
# Check for null values in both datasetsnull_values_dataset_1 = first_run_df.isnull().sum()null_values_dataset_2 = second_run_df.isnull().sum()# Identify columns with null valuescolumns_with_null_values_1 = null_values_dataset_1[null_values_dataset_1 > 0]columns_with_null_values_2 = null_values_dataset_2[null_values_dataset_2 > 0]print(columns_with_null_values_1)print(columns_with_null_values_2)
You can see some columns with missing values here. We'll drop those entries and generate new data from the remaining dataset in the next steps.
# Drop rows with any NULL values for the first_run DataFramefirst_run_df = first_run_df.dropna()# Drop rows with any NULL values for the second_run DataFramesecond_run_df = second_run_df.dropna()
This code will drop all the rows with missing values.
🗒Note : The target column is "ratings" or "notations." If there are any missing values in the target column, it's not advisable to fill in those values. Instead, it's preferable to drop those rows. Filling in missing values in the target column can introduce bias or inaccuracies.
Making the data types and values consistent
The next step is to make the data types and values from the data sources consistent. In this case, if you look at the data column in both data frames, the format and the values of the columns differ. Let's make them consistent.
# Format the date in the first dataframefirst_run_df['date'] = first_run_df['date'].str.extract(r'on (\w+ \d+, \d+)')first_run_df['date'] = pd.to_datetime(first_run_df['date'], format='%B %d, %Y')# Format the date in the second dataframesecond_run_df['date'] = pd.to_datetime(second_run_df['date'], format='%Y-%m-%d')first_run_df.head(), second_run_df.head()
The last step in this pipeline is to make the names of the columns consistent in both the data frames and merge them.
# Rename the columns in first dataframe for consistencyfirst_run_df.rename(columns={ 'reviewContent': 'reviewDescription', 'title': 'reviewTitle', 'notation': 'ratingScore'}, inplace=True)# Merge the two dataframesmerged_df = pd.concat([first_run_df, second_run_df], ignore_index=True)merged_df.head()
The refined data is now ready to be saved in any database. It could be an SQL database or any other tool. In this case, we'll save the data in a csv
file.
# Name the filefile_name = "merged_reviews.csv"# Save the file as csvmerged_df.to_csv(file_name, index=False)
AI and data augmentation
Data augmentation is the process of generating new synthetic data from the current data if the current data has fewer samples. Augmentation methods are very popular in computer vision applications but are just as powerful for natural language processing. In computer vision, you just flip the images to generate a new data entry, but in NLP, you change the text by applying different methods. One of them is synonym replacement.
For this method, replace the synonyms of the current text to generate a new text with the same meaning. It's a very delicate technique because just one wrong synonym or word can change the whole context of the text. For this, we'll use nlpaug
, a very powerful library that replaces synonyms.
pip install nlpaug
After this, you can use the wordnet
library to help with synonyms.
import nlpaug.augmenter.word as naw# Initialize the SynonymAug augmenter. "n" represents the number of synonyms in each sentencedescription_aug = naw.SynonymAug(aug_src='wordnet', aug_max=5)# define a new list to store the augmented rowsnew_rows = []# Loop through each row of the datasetfor _, row in merged_df.iterrows(): # Augment the reviewDescription columns. "n" denotes the number of sentences we want to augment augmented_description = description_aug.augment(row['reviewDescription'], n=1) # Create a new row with the augmented data and other columns unchanged new_row = row.copy() if len(augmented_description) != 0: new_row['reviewDescription'] = augmented_description[0] new_rows.append(new_row)# Convert list of new rows to a DataFramenew_rows_df = pd.DataFrame(new_rows)# Append the new rows DataFramemerged_df_augmented = pd.concat([merged_df, new_rows_df], ignore_index=True)
The code above uses the nlpaug
library to augment the reviewDescription
column with synonyms from wordnet
and generate a new version of each description. It then appends the augmented descriptions.
Related: What is retrieval-augmented generation, and why use it for chatbots?
Data preprocessing for machine learning
The preprocessing stage involves transforming raw textual data into a structured and clean format that can be easily fed into machine learning or deep learning models. This phase is crucial because, as the saying goes, "garbage in, garbage out." If the data is not cleaned enough, the model will eventually give bad results.
To clean the data, you need to go through several steps:
Lower casing
Removing punctuation
Tokenization
Removing stopwords
Lemmatization
If nltk
is not installed in your environment, you can install it using the following command:
pip install nltk
Pass reviewDescription through all these steps:
import nltkfrom nltk.corpus import stopwordsfrom nltk.stem import WordNetLemmatizerfrom nltk.tokenize import word_tokenizeimport re# Download necessary NLTK datanltk.download('punkt')nltk.download('wordnet')nltk.download('stopwords')# Initialize the lemmatizerlemmatizer = WordNetLemmatizer()# Set the stop wordsstop_words = set(stopwords.words('english'))# Perform preprocessing methods to each rowdef preprocess_text(text): # Convert to lowercase text = text.lower() # Remove punctuation text = re.sub(r'[^\w\s]', '', text) # Tokenization tokens = word_tokenize(text) # Remove stopwords and lemmatization tokens = [lemmatizer.lemmatize(token) for token in tokens if token not in stop_words] return ' '.join(tokens)# Apply preprocessing to reviewDescription columnsmerged_df['reviewDescription'] = merged_df['reviewDescription'].apply(preprocess_text)
In this code, we used the Natural Language Toolkit (NLTK) to clean the text. We changed the text to lowercase, removed punctuation, and broke it into individual words as tokens. Then we removed common stopwords like "and" or "the" and simplified each word to its root form. Finally, we transformed the tokenized words back into sentences.
What if the data contains imbalanced classes?
After all that hassle, you don't want to have imbalanced data classes, do you? Imbalanced classes may contain 90 samples of 1 class and just 10 samples of the other. In this case, the model will be biased towards the first class, no matter how well-designed it is.
It's very important to have a balanced dataset for an unbiased and optimized model performance. To achieve this, you need to implement methods that generate the data points of the minority class, ensuring the dataset is better suited for the model.
How to generate data for AI
In the end, it's very important to have a complete overview of the dataset. If the data is biased, you need to try to make it unbiased to improve the model's performance. Let's take a look at the distribution of the classes for the target column, that is ratingScore
.
import matplotlib.pyplot as plt# Count the number of samples for each class in ratingScore columnrating_counts = merged_df['ratingScore'].value_counts()# Plotplt.figure(figsize=(8, 6))rating_counts.plot(kind='bar', color='skyblue')# Set the title of the plot plt.title('Class distribution for ratingScore')# Set the x and y labelsplt.xlabel('Rating Score')plt.ylabel('Number of Reviews')plt.xticks(rotation=0)plt.grid(axis='y')plt.tight_layout()plt.show()
The resulting graph of this example would look a little bit like this:
This means you have an unequal distribution of classes in the dataset, and you need to increase the data points or data samples of the minority classes. For this, you'll use the Synthetic Minority Oversampling Technique (SMOTE). This technique generates synthetic samples of the minority classes and tries to balance the class distribution.
Before applying SMOTE, you need to first perform two steps:
Vectorization: Convert text into a numerical format so the model can understand and process it.
Train-Test Split: Separate the data to ensure the model learns from one portion and gets tested on an unseen portion, keeping the evaluation genuine.
# Import the necessary librariesfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.model_selection import train_test_split# Define the featurestfidf_vectorizer_description = TfidfVectorizer(max_features=5000)# Transform reviews to vesctorsX_description = tfidf_vectorizer_description.fit_transform(merged_df['reviewDescription'])# Set the target columny = merged_df['ratingScore']# Apply train and test splitX_description_train, X_description_test, y_description_train, y_description_test = train_test_split(X_description, y, test_size=0.2, random_state=42)
The above code divides the data into train and test splits with a proportion of 80% training and 20% test.
Now, we're ready to apply SMOTE on the training data.
from imblearn.over_sampling import SMOTEsmote = SMOTE(random_state=42,k_neighbors=4)X_description_train_resampled, y_description_train_resampled = smote.fit_resample(X_description_train, y_description_train)
The code above will generate synthetic samples for the minority classes and make an equal distribution if you try to make the same plot again.
description_counts_train = y_description_train_resampled.value_counts()# Plottingplt.figure(figsize=(8, 6))description_counts_train.plot(kind='bar', color='skyblue')plt.title('Class distribution for reviewDescription in the training set')plt.xlabel('Class')plt.ylabel('Number of Samples')plt.xticks(rotation=0)plt.grid(axis='y')plt.tight_layout()plt.show()
You'll see a graph like this:
This equal distribution will help the model to generalize the concepts. After this step, the data is ready to be fed to the model, and probably the model will perform better.
An AI model is only as good as the data
We've covered almost the entire pipeline in this blog, from acquiring data to generating new samples before training a model. You learned that no matter how good the model is, it will only perform well if you feed it quality data. You learned about removing irrelevant data and only retaining what's necessary, and you discovered the importance of equal representation of all classes in our training sample.
In machine learning or deep learning, as important as the model is, perhaps the data fed to it is even more crucial.
Read next in deep learning and machine learning
Deep comparison of PyTorch vs. TensorFlow
What is Hugging Face?
6 open-source Pinecone alternatives
Top comments (0)