Tweets provide numerous ways to numerically rank them such as likes, comments and retweets. However, training models to evaluate the text of the tweet and label its sentiment has many advantages. This project takes a dataset from kaggle filled with tweets about Apple labeled as negative, neutral or positive. For simplicity, we will turn this into a binary classifier by reducing the dataset to tweets labeled negative or positive. Also, negative tweets have been relabeled from '-1' to '0'. POsitive tweets remain labeled '1'.
The first step will be to import the dataset with the neutral tweets removed, replace '-1' sentiments with '0' and print the first five rows.
import pandas as pd
df = pd.read_csv('apple_twitter_sentiment.csv')
df.sentiment.replace({-1:0}, inplace=True)
df.head()
[output]
Next, we'll print df.info
to make sure each column has the same amount of rows and no values are missing. This also confirms our datatypes.
df.info()
[output]
Now let's check the distribution of our target variable. For that I have created a function that does this by reproducing a dataframe with the value counts for the column in sum and percentages, plus a plot of the distribution.
# function to present and plot distribtuion of values in series
# automatically prints plot, summary df is returned to unpack and present
def summerize_value_counts(series):
# extract name of series
series_name = series.name
# make dataframe to display value count sum and percentage for series
series_count = series.value_counts().rename('sum')
series_perc = series.value_counts(normalize=True).round(2).rename('percentage')
series_values_df = pd.concat([series_count, series_perc], axis=1)
# plot series distribution
plot = series_values_df['sum'].plot(kind='bar', title=f'Distribution of {series_name.title()} Column',
xlabel=f'{series_name.title()}', ylabel='Count');
# rename df index to series name
series_values_df.index.name = series_name
series_values_df
return series_values_df
Passing the sentiments column through the functions shows that we have a large class imbalance between negative sentiments and positive ones. First we'll proceed as normal, but later we'll resample the data based on the class imbalance to see if the model improves.
summerize_value_counts(df.sentiment)
[output]
To start, we'll have to separate our tweets and sentiments into different variables so we can further split them into a training and testing set.
text = df.text
sentiment = df.sentiment
sklearn's train-test split function does this easily. passing in our features then our target will get us a training and testing set for each. It also gives us the ability to set a random state for our data that will ensure the data in each set will not change any time we rerun our models.
from sklearn.model_selection import train_test_split
# random state of 0 is established for the data
X_train, X_test, y_train, y_test = train_test_split(text, sentiment, random_state=0, test_size=0.25)
Now to handle our text data. We cannot pass in strings into our models, so first we'll have to convert each tweet into a numerical representation. sklearn provides us this ability with the TfidfVectorizer()
. This will convert our column of text into a matrix where each column is a unique word that appears in the dataset as a whole and every row is tweet with a floating point value in each column where that word appears in the tweet. With the TfidfVectorizer()
, the floating point value that represents the word is a calculation of not just that word appearing in the tweet, but also how often that word appears in the dataset as a whole. This ensures that words that appear too often or too rarely are not detracting away from the model's ability to discover underlying patterns in the data.
To vectorize the data, all we have to do is import the vectorizer, fit and transform the training data with it and then transform the testing data.
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer()
X_train_tf = tf.fit_transform(X_train)
X_test_tf = tf.transform(X_test)
To view the matrix, all we have to do is convert it into a pandas DataFrame by passing in the matrix converted into an array with the .toarray()
method and pass the vectorizer with the .get_feature_names()
method to 'columns'.
df_tf = pd.DataFrame(X_train_tf.toarray(), columns=tf.get_feature_names())
df_tf.head()
[output]
Now that our string data has been converted into numerical form, we can pass it into the model. Since this is a binary classifier, I chose LogisticRegression, but any binary classifier will work.
This is as easy as importing the model, fitting it with the vectorized training features data and sentiment labels in y_train.
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X_train_tf, y_train)
[output]
LogisticRegression()
To score the model, first we'll use the models .predict()
method with transformed vectorized testing features to retrieve an array of predicted labels.
from sklearn.metrics import accuracy_score, plot_confusion_matrix, plot_roc_curve
y_pred_tf = clf.predict(X_test_tf)
print(f'accuracy: {accuracy_score(y_test, y_pred_tf)}')
[output]
accuracy: 0.8413461538461539
To dig deeper into our results, first we'll plot a confusion matrix showing the count of, from left to right and top to bottom, True Negatives, False Positives, False Negatives, and True Positives.
Next we'll plot the ROC curve for both the training and testing data. For this, the greater area under the curve, the better performing the model is.
from sklearn.metrics import plot_confusion_matrix
plot_confusion_matrix(clf, X_test_tf, y_test);
[output]
from sklearn.metrics import plot_roc_curve
import matplotlib.pyplot as plt
# plot an ROC curve
fig, ax = plt.subplots()
plt.title('ROC Curve')
plot_roc_curve(clf, X_train_tf, y_train, name='Train', ax=ax)
plot_roc_curve(clf, X_test_tf, y_test, name='Test', ax=ax);
[output]
The results of our accuracy score show us our model is performing well, and our ROC curve tells us it is performing even better, but looking at the bottom row of our confusion matrix tells a different story. Our model mislabeled 33 of our positive tweets as negative and only correctly labeled 4 positive tweets.
This is because our target variable has the large class imbalance we discovered earlier. One way of dealing with this SMOTE, or Synthetic Minority Over-sampling Technique, which is a way of splitting our training data in a way that prioritizes the class that has significantly fewer examples. To see if this helps the performance of our model, we'll redo our logistic regression with resampled data.
This is easily done in Python with the imbalanced-learn library, which is built off of sklearn. To implement, first we initiate an instance of a SMOTE sampling object. For simplicity, we'll only pass a random state as an argument, which will be the state we passed earlier. This will mean the default sampling strategy will be done, which resamples all classes except the majority class. For us, that will mean over-sampling the tweets labeled positive.
Let's redo our vectorizing and modeling process with SMOTE to see if our model was better at predicting true positives.
tf_sm = TfidfVectorizer()
X_train_tf_sm = tf_sm.fit_transform(X_train)
X_test_tf_sm = tf_sm.transform(X_test)
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=0)
X_train_sm, y_train_sm = smote.fit_sample(X_train_tf_sm, y_train)
clf_sm = LogisticRegression()
clf_sm.fit(X_train_sm, y_train_sm)
y_pred_tf_sm = clf_sm.predict(X_test_tf_sm)
print(f'SMOTEd accuracy: {accuracy_score(y_test, y_pred_tf_sm)}')
[output]
SMOTEd accuracy: 0.8942307692307693
plot_confusion_matrix(clf_sm, X_test_tf_sm, y_test);
# plot an ROC curve
fig, ax = plt.subplots()
plt.title('ROC Curve')
plot_roc_curve(clf_sm, X_train_sm, y_train_sm, name='Train', ax=ax)
plot_roc_curve(clf_sm, X_test_tf_sm, y_test, name='Test', ax=ax);
[output]
As you can see, the ROC curve shows the same performance, however this new model trained on smoted data correctly predicted 23 more true positives than the original with a 5% better accuracy score.
As a final model, I'll give a quick example of the powerful preprocessing tools nltk comes with. A more detailed walkthrough can be found in my previous post here.
When it comes to working with textual data, preprocessing data in a way that helps models deal with lexical meaning of words can greatly help predict from longer and more complex text features. Two steps to get at this involve removing words that don't 'mean' anything, or filler words, and condensing similar words to a common meaning. In nltk the filler words are called 'stop words' and removing them can help reduce the noise in textual data so our models can focus on reading only important words. As far as condensing words based on similar meaning, or in nltk, lemmatizing, this process involves identifying the part of speech for each word to combine words whose spelling differences are a reflection of being different inflected form of the same word, rather than words with completely different meanings.
Below I will import libraries to handle these tasks, preprocess the data we have been working with with two functions, re-smote and remodel to see how these preprocessing techniques can help our models' performance.
import re
import nltk
from nltk import pos_tag
from nltk.corpus import wordnet, stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
# This function gets the correct Part of Speech so the Lemmatizer can work
def get_wordnet_pos(treebank_tag):
'''
Translate nltk POS to wordnet tags
'''
if treebank_tag.startswith('J'):
return wordnet.ADJ
elif treebank_tag.startswith('V'):
return wordnet.VERB
elif treebank_tag.startswith('N'):
return wordnet.NOUN
elif treebank_tag.startswith('R'):
return wordnet.ADV
else:
return wordnet.NOUN
def text_prep(text, sw):
sw = stopwords.words('english')
regex_token = RegexpTokenizer(r"([a-zA-Z]+(?:’[a-z]+)?)")
text = regex_token.tokenize(text)
text = [word for word in text]
text = [word for word in text if word not in sw]
text = pos_tag(text)
text = [(word[0], get_wordnet_pos(word[1])) for word in text]
lemmatizer = WordNetLemmatizer()
text = [lemmatizer.lemmatize(word[0], word[1]) for word in text]
return ' '.join(text)
tf_tok = TfidfVectorizer()
sw = stopwords.words('english')
X_train_tokenized = [text_prep(text, sw) for text in X_train]
X_train_tf_tok = tf_tok.fit_transform(X_train_tokenized)
X_test_tf_tok = tf_tok.transform(X_test)
smote2 = SMOTE(random_state=0)
X_train_sm2, y_train_sm2 = smote2.fit_sample(X_train_tf_tok, y_train)
clf_sm_tok = LogisticRegression()
clf_sm_tok.fit(X_train_sm2, y_train_sm2)
y_pred_tf_tok = clf_sm_tok.predict(X_test_tf_tok)
accuracy_score(y_test, y_pred_tf_tok)
[output]
0.9038461538461539
plot_confusion_matrix(clf_sm_tok, X_test_tf_tok, y_test);
[output]
# plot an ROC curve
fig, ax = plt.subplots()
plt.title('ROC Curve')
plot_roc_curve(clf_sm_tok, X_train_sm2, y_train_sm2, name='Train', ax=ax)
plot_roc_curve(clf_sm_tok, X_test_tf_tok, y_test, name='Test', ax=ax);
[output]
As you can see, preprocessing tweets resulted in another 5% jump in accuracy from our original model.
Top comments (0)