Stephen Collins

Posted on Mar 11, 2023 • Edited on May 6, 2023

How to fine tune BERT for real time sentiment analysis

Post Series

How to fine tune BERT for real time sentiment analysis(this post)
How to run BERT on AWS

Table of Contents

Introduction
BERT Summary
Overview of Hugging Face
Setting up the remote environment
Introduction to Jupyter Notebook
Training BERT for social media sentiment analysis
Conclusion

Introduction

In this blog post series (this is part one), I’ll be covering how to fine tune a pre-trained BERT model. And to give you enough information to build your own data mining cluster with BERT. To begin with, let’s briefly summarize BERT.

BERT Summary

BERT (Bidirectional Encoder Representations from Transformers) is an open source natural language processing (NLP) model developed by Google in 2018. It's one of the first NLP models to use Transformers. The reason why this is important is because of it's newly developed capability to bi-directionally process the words in a sentence to more fully capture the context that humans understanding languages like English take for granted and has allowed BERT to reach previously unprecedented levels of accuracy for NLP tasks like sentiment analysis. The BERT model has made it's way into Hugging Face, a company maintaining a library of freely available, open source machine learning models.

Overview of Hugging Face

Hugging Face exposes machine learning models with Python APIs for users to train and use for production. They offer many different base models as well fine-tuned models for more specific tasks. as In this blog post, we're going to be focusing on Hugging Face's bert-base-uncased version of the BERT NLP model ("uncased" means the model makes no distinction between words like "Car" and "car"). The value in using pre-trained models like from Hugging Face is to allow users to "stand on the shoulders of giants", and focus on satisfying their needs that these open source, freely available models open doors for meeting.

Setting up the remote environment

So, we now know what model we are going to fine-tune with. What's next? We need to look at setting up a remote environment. The reason why is because fine-tuning a model often takes hours (if not days) and you don't want to tie up your own machine, even if it has the specs to make fine-tuning feasible. One of the easier options we found for setting up a remote environment is Lambda Labs. At least currently, we've found that Lambda Labs offers cheaper GPUs than on AWS for model fine-tuning - but for everything else, we use AWS. We'll be explaining our architecture in an upcoming blog post!

The process to setup an account is pretty simple: you'll just need a payment method to complete signup. Afterwards, you'll be taken to the dashboard and from there you can select which GPU-based instances you want to spin up. We used the gpu.2x.a6000 GPU instance type. Once you have selected an instance to spin up, and once it finishes booting, you'll have the option to open up the "cloud IDE" for the instance. This will spin up a Jupyter Notebook to ultimately run Python code on the GPU instance for fine-tuning our base pre-trained BERT model.

Introduction to Jupyter Notebook

Jupyter Notebook is a widely used tool for use in data science and machine learning applications. The Jupyer docs provide a ton of helpful resources for getting up to speed with Jupyter, but in this section we'll go over the basic concepts to help you get oriented when working with Jupyter Notebooks.

From a very high level architectural standpoint Jupyter Notebook is an interactive GUI to execute Python code underneath with IPython. While there a ton of features that are worth checking out, we'll focus on the "building blocks" of any notebook: cells.

Notebook Cells allow you to put separate blocks of Python code independently runnable from other cells (but can also be run "all together" sequentially). The current Notebook will persist the results of running a particular cell, which is especially useful when experimenting data cleanup transformation functions - you can keep running slight modifications of a particular cell, and it's latest output will be the output that is saved and can be used by the next cell (if that cell's code block uses a particular variable created by a previous cell). To clarify, every notebook cell is part of the same IPython environment, that's why re-running cells that change variable values can be seen in running subsequent cells - it's reassiging variables when cells that assign variables change the values that are assigned to those variables.

Knowing about Jupyter Notebook Cells, and how all the cells of a given notebook (saved as *.ipynb files) are executed within the same IPython environment (as well as knowing that the Jupyter Notebook interface is a GUI for the underlying IPython shell to execute code from notebook cells) is enough to start working with Jupyter Notebook for training a machine learning model.

Training BERT for social media sentiment analysis

Once we have a GPU instance running on a Lambda Labs virual machine, and are able to navigate around the installed IDE for working with Jupyter Notebooks, now we can get to the most interesting part of the blog post: how to fine tune a pre-trained BERT model for analyzing sentiment from social media.

Finding and cleaning data for fine-tuning BERT

Before we get to fine-tuning BERT, we need to figure out what kind of data we can use for fine-tuning. In this tutorial, we are focused on social media sentiment analysis, and we wanted to find a dataset that ideally had previously labeled social media posts for sentiment — both positive and negative sentiment, on some kind of binary classification, a “1” for positive sentiment, and a “0” for negative sentiment.

As luck would have it, such a dataset exists. The famous Sentiment140 dataset has been often used in Kaggle competitions, and it's also great for fine-tuning BERT, with a little bit of cleanup. The issue is that negative sentiment has a "0" score, which is great, but the positive sentiment has a score of "4". We need a simpler binary classification, so with a bit of cleanup from such a bash script:

#!/usr/bin/env bash

# "shuffles" the rows of the csv file
command shuf training.1600000.processed.noemoticon.csv -o shuffled_output.csv && \
# sleep 1 second just to guarantee the "shuffled_output.csv" has finished writing to disk
command sleep 1 && \
# find every "4" label value, and replace with "1"
command awk -F ',' -v OFS=',' '$1 == "\"4\"" { $1 = "\"1\"" }1' shuffled_output.csv > output.csv

We get two main things. One, we go ahead and "shuffle" the initial csv file just to make sure that there aren't any related / similar tweets next to each other in the Sentiment140 dataset. Granted, we are also going to be telling the BERT model to shuffle the dataset, but in this context, more shuffling doesn't hurt, since we are saving this output.csv file for use in fine-tuning BERT. The output.csv is now in a good shape to create a Tensorflow dataset from like so:

import tensorflow as tf
from transformers import BertTokenizer, TFBertForSequenceClassification, InputFeatures

CSV_PATH = './output.csv'

NUM_EPOCHS = 10
# Reducing the size of the dataset to keep GPU hosting costs down as well as allowing this whole dataset to be held in this python process memory for simpler, initial fine-tuning
DATASET_SIZE = 2500
BATCH_SIZE = 25
AUTOTUNE = tf.data.experimental.AUTOTUNE

dataset = tf.data.experimental.make_csv_dataset(
    CSV_PATH,
    batch_size=BATCH_SIZE,
    column_names=['score','timestamp', 'datestring', 'N/A', 'user', 'tweet'],
    label_name='score',
    select_columns=['score', 'tweet'],
    num_epochs=NUM_EPOCHS,
    header=False,
    shuffle_seed=0,
    shuffle=True,
    num_rows_for_inference=1600000,
    ignore_errors=True,).prefetch(AUTOTUNE)

train_dataset = dataset.take(DATASET_SIZE)
validation_dataset = dataset.skip(DATASET_SIZE).take(DATASET_SIZE)

This creates a Tensorflow Dataset from our csv data file, as well as our train_dataset and our validation_dataset which does not include any data from our train_dataset, which is the goal here.

We are not done yet with transforming our dataset. We need to take our now usable dataset (which still has just raw tweet strings) and tokenize all the raw tweets before passing to our model for fine-tuning. Next, we need to create a tokenizer, that is designed to work with BERT, called the BertTokenizer. This is another class from the Hugging Face transformers package specifically for BERT models' input tokenizing. We can create the tokenizer like so:

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

At this point, we have our raw input training dataset, our raw input validation dataset, and a tokenizer instance. We have one final data cleaning step, which is to tokenize all the input of our datasets for both the training and validation datasets. No shame, inspired by this post about sentiment analysis with BERT:

def convert_examples_to_tf_dataset(batches, tokenizer, max_length=200):
    datasetArr = []
    for batch in batches:
      batchFeatures = [] # -> will hold InputFeatures to be converted later
      index = 0
      textArr = batch['text']
      labelArr = batch['label']
      for textItem in textArr:
        input_dict = tokenizer.encode_plus(
            textItem,
            add_special_tokens=True,
            max_length=max_length, # truncates if len(s) > max_length
            return_token_type_ids=True,
            return_attention_mask=True,
            pad_to_max_length=True, # pads to the right by default
            truncation=True
        )

        input_ids, token_type_ids, attention_mask = (input_dict["input_ids"],
            input_dict["token_type_ids"], input_dict['attention_mask'])

        batchFeatures.append(
            InputFeatures(
                  input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, label=int(labelArr[index])
              )
        )
        index += 1
      datasetArr.append(batchFeatures)

    def gen():
        for batch in datasetArr:
          for f in batch:
            yield ({
                    "input_ids": f.input_ids,
                    "attention_mask": f.attention_mask,
                    "token_type_ids": f.token_type_ids,
                }, f.label)

    return tf.data.Dataset.from_generator(
        gen,
        ({"input_ids": tf.int32, "attention_mask": tf.int32, "token_type_ids": tf.int32}, tf.int64),
        (
            {
                "input_ids": tf.TensorShape([None]),
                "attention_mask": tf.TensorShape([None]),
                "token_type_ids": tf.TensorShape([None]),
            },
            tf.TensorShape(None),
        ),
    ).batch(BATCH_SIZE)

cleaned_batch = []
validation_cleaned_batch = []
for batch in train_dataset:
  rawTextArr = list(map( lambda tweet: str(tweet), batch[0]['tweet'].numpy()))
  rawLabel = list(map( lambda tweet: int(tweet), batch[1]))
  cleaned_batch.append({ 'text': rawTextArr, 'label': rawLabel })

for batch in validation_dataset:
  rawTextArr = list(map( lambda tweet: str(tweet), batch[0]['tweet'].numpy()))
  rawLabel = list(map( lambda tweet: int(tweet), batch[1]))
  validation_cleaned_batch.append({ 'text': rawTextArr, 'label': rawLabel })

# our cleaned datasets, ready to be consumed by pre-trained BERT model instance
convert_dataset = convert_examples_to_tf_dataset(cleaned_batch, tokenizer)
convert_validation_dataset = convert_examples_to_tf_dataset(validation_cleaned_batch, tokenizer)

Once we have the "cleaned" and tokenized dataset, we can move onto the actual setting up of the fine-tuning of BERT through what's called the "train, validation and testing method" of model fine-tuning.

Train, validation and testing method of BERT fine-tuning

There are a variety of methods for fine-tuning machine learning models, but to keep both explaining (and defending!) how we fine-tuned BERT, we are going to go over what's called the train, validation and test split method. Train validation, and test split involves splitting your initial dataset (here, our cleaned_output.csv dataset) into three independent datasets. For (big surprise) training, validation, and testing.

Training of train, validation and testing method

We've now cleaned our dataset and are ready to select which base model to fine tune from for our task which is sentiment analysis of social media. This is where Hugging Face really shines: we can pick the TFBertForSequenceClassification.

This transformers class is very cool because of how transformer models (like BERT) are tailored for a specific task. This tailoring is accomplished by adding another layer on top of a base model. This layer added on top is called the "classifier" which creates a special token (for sentiment analysis, a "sentiment token") that BERT tries to guess at it's value, between 0 and 1. The base task of BERT is actually sequence prediction, so a "sentiment classifier" basically asks BERT to predict the value of the "sentiment token". The value of this token is our desired prediction value.

The reason this needs to be mentioned is because the TFBertForSequenceClassification transformer class is the classifier that Hugging Face provides for us, which makes our fine tuning task all that much simpler. We don't have to create a classifier, and we only have to worry about creating good data for our train, validate and test datasets. and working with TFBertForSequenceClassification's API. Specifically, the compile and fit methods that are part of it's parent class, the Keras Model class.

Keras model compile method

The compile method first. Before we can call compile, we need to create the model instance with that method on it. For BERT sequence classification (for sentiment analysis) we can create a model instance like so:

model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased")

This creates a Keras model instance using the pretrained "bert-base-uncased" model with a classifier layer on top. Now, we can look at what's involved for compiling the model:

model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0), 
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), 
    metrics=[tf.keras.metrics.SparseCategoricalAccuracy('accuracy')]
)

There's a lot to unpack here, so we're going to go over each argument that we need to compile a model for BERT sequence classification (our sentiment analysis task).

First, the optimizer argument. Takes a tf.keras.optimizers type. A keras optimizer is an optimization algorithm used for minimizing the loss of a predictive model with regard to a training dataset (see the difference between back-propagation and optimization). The particular optimization algorithm we are using is called the Adam algorithm. The Adam algorithm is pretty common and works well here for fine-tuning BERT. The arguments we in turn passed to the Adam optimizer are the learning_rate, the epsilon, and the clipnorm.

Adam optimizer learning_rate argument

The learning_rate (also known as the "step size"), used to determine the degree of which to "change" a model with respect to the gradient. A gradient simply measures the change in all weights with regard to the change in error

Adam optimizer epsilon argument

The epsilon prevents the "divide by zero" error when the gradient is close to zero. Awesome Stack Overflow answer on what is the epsilon for the Adam Optimizer

Adam optimizer clipnorm argument

The clipnorm "clips the gradient" to prevent the exploding gradient problem.

Now, the model.compile() loss argument. The loss function to use. The purpose of loss functions is to compute the quantity that a model should seek to minimize during training (taken from Keras Losses docs). We are using the SparseCategoricalCrossentropy class, which "computes the crossentropy loss between the labels and predictions" (see previous link). We need a function to in order to measure the performance between the predicted probability (from 0, "negative" sentiment to "1" for positive sentiment), and we are using the SparseCategoricalCrossentropy Keras class to provide the loss function. More reading on what cross-entropy is.

Finally, the metrics argument. A metric is a function that is used to judge the performance of your model. We are using the SparseCategoricalAccuracy class. The "accuracy" value we pass to the SparseCategoricalAccuracy class tells the model to use the base metric Accuracy class from the keras metrics package, which, to quote it's docstring, Calculates how often predictions equal labels. More readable documentation on the Accuracy class.

There are a few more arguments optionally accepted by compile with explanation provied by the compile method docs.

Keras model fit method

Now onto the final method we need for fine-tuning the BERT pre-trained model, the fit method, that actually peforms the work of fine-tuning the model:

history = model.fit(convert_dataset, epochs=NUM_EPOCHS, validation_data=convert_test_dataset)

The fit method takes at least three arguments. First positional argument is the training data argument, for which we are passing the convert_dataset that we converted in an earlier step to fit into the Dataset type expected by model.fit(). The second argument (that we've named here) is called epochs. Epochs is the number of times the model makes a complete pass over of all the training data that is used to fine-tune the model in the training step. We're passing NUM_EPOCHS set to 10, so we are telling the model to iterate over the entire training dataset ten times, using the validation_data (the third argument we are passing to model.fit(), the convert_test_dataset) after each pass of the training data to "correct" the model using the error produced by the validation step occurring right after the training step.

We want to create a history variable set to the result of model.fit() so we can perform analyses of how training impacted the model such as with plots with matplotlib:

import matplotlib.pyplot as plt

# ...

# summarize history for accuracy
plt.figure()
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.savefig('training_results_accuracy.pdf')
# summarize history for loss
plt.figure()
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.savefig('training_results_loss.pdf')

plt.close()

Validation of train, validation and testing method

Tensorflow keras model API for fit handles accepting the validation dataset for us. So for the sake of this blog post we won't need to go into further detail about this step. We just need to make sure that the validation dataset (and later, the test dataset as well as future data once the model is determined to be viable for production) is structured in the exact same format as the training data, and very importantly, does not contain any input found in the training data.

Testing of train, validation and testing method

Once we've fine-tuned the BERT model using both the training data and the validation data, the final step is to perform end stage testing using previously unused data (data not used by neither the training dataset nor the validation dataset) to check for overfitting. Overfitting occurs when a model has been overly trained on the training and validation data, and thus fails to maintain a level of accuracy of predictions from new input it hasn't seen before.

What we did for final testing, is taking a very very small number of tweets to see the actual model output afterwards. This final test set has no direct impact on training (which uses the validation set to "correct" itself) but rather to evaluate this model's output post training makes sense). Basically, a "sanity check" to make sure the output makes sense and the accuracy is acceptable before deploying the model into a production environment.

Saving the model

We need to save the model into a format we can download from Lamda Labs (or whatever GPU remote hosting you are using). Something like this works for creating a tar archive and compressing it:

def make_tarfile(output_filename, source_dir):
    with tarfile.open(output_filename, "w:gz") as tar:
        tar.add(source_dir, arcname=os.path.basename(source_dir))

# Need to "save" model output configuration
model.save_pretrained('fine_tuned_model')
tokenizer.save_pretrained('fine_tuned_model_tokenizer')

# tar and compress "saved" model
make_tarfile('fine_tuned_model.tar.gz', 'fine_tuned_model')
make_tarfile('fine_tuned_model_tokenizer.tar.gz', 'fine_tuned_model_tokenizer')

Conclusion

Hopefully this blog post has demystified how to fine tune BERT for sentiment analysis. This how to blog post describes the exact same process we used for creating our real time social media mining with fine-tuned BERT, that powers both our homepage's charts as well as the data provided by our REST-based API.

Our next blog post (in the works!) will explain how we setup our data mining cluster, using the weights created from this fine-tuning approach to cost-effectively run a data mining cluster running 24/7, collecting relevant raw social media posts once a minute.

Like this post? Share on social media and connect with me on Twitter! And check back for updates as we go into further detail and our next blog post in this series: how to deploy a BERT fine-tuned model into production.

Congratulations, you made it to the end! Here's all the code in one snippet:

Put this in a bash script (after downloading the Sentiment140 dataset):

#!/usr/bin/env bash

# "shuffles" the rows of the csv file
command shuf training.1600000.processed.noemoticon.csv -o shuffled_output.csv && \
# sleep 1 second just to guarantee the "shuffled_output.csv" has finished writing to disk
command sleep 1 && \
# find every "4" label value, and replace with "1"
command awk -F ',' -v OFS=',' '$1 == "\"4\"" { $1 = "\"1\"" }1' shuffled_output.csv > output.csv

Upload output.csv to a GPU instance (at least as big as a gpu.2x.a6000) on Lambda Labs (or another GPU remote hosting service) and copy and paste this in a Jupyter Notebook:

import matplotlib.pyplot as plt
import tensorflow as tf
from transformers import BertTokenizer, TFBertForSequenceClassification, InputFeatures

CSV_PATH = './output.csv'
NUM_EPOCHS = 10
# Reducing the size of the dataset to keep GPU hosting costs down as well as allowing this whole dataset to be held in this python process memory for simpler, initial fine-tuning
DATASET_SIZE = 2500
BATCH_SIZE = 25
AUTOTUNE = tf.data.experimental.AUTOTUNE

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

dataset = tf.data.experimental.make_csv_dataset(
    CSV_PATH,
    batch_size=BATCH_SIZE,
    column_names=['score','timestamp', 'datestring', 'N/A', 'user', 'tweet'],
    label_name='score',
    select_columns=['score', 'tweet'],
    num_epochs=NUM_EPOCHS,
    header=False,
    shuffle_seed=0,
    shuffle=True,
    num_rows_for_inference=1600000,
    ignore_errors=True,).prefetch(AUTOTUNE)

train_dataset = dataset.take(DATASET_SIZE)
validation_dataset = dataset.skip(DATASET_SIZE).take(DATASET_SIZE)

def convert_examples_to_tf_dataset(batches, tokenizer, max_length=200):
    datasetArr = []
    for batch in batches:
      batchFeatures = [] # -> will hold InputFeatures to be converted later
      index = 0
      textArr = batch['text']
      labelArr = batch['label']
      for textItem in textArr:
        input_dict = tokenizer.encode_plus(
            textItem,
            add_special_tokens=True,
            max_length=max_length, # truncates if len(s) > max_length
            return_token_type_ids=True,
            return_attention_mask=True,
            pad_to_max_length=True, # pads to the right by default
            truncation=True
        )

        input_ids, token_type_ids, attention_mask = (input_dict["input_ids"],
            input_dict["token_type_ids"], input_dict['attention_mask'])

        batchFeatures.append(
            InputFeatures(
                  input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, label=int(labelArr[index])
              )
        )
        index += 1
      datasetArr.append(batchFeatures)

    def gen():
        for batch in datasetArr:
          for f in batch:
            yield ({
                    "input_ids": f.input_ids,
                    "attention_mask": f.attention_mask,
                    "token_type_ids": f.token_type_ids,
                }, f.label)

    return tf.data.Dataset.from_generator(
        gen,
        ({"input_ids": tf.int32, "attention_mask": tf.int32, "token_type_ids": tf.int32}, tf.int64),
        (
            {
                "input_ids": tf.TensorShape([None]),
                "attention_mask": tf.TensorShape([None]),
                "token_type_ids": tf.TensorShape([None]),
            },
            tf.TensorShape(None),
        ),
    ).batch(BATCH_SIZE)

cleaned_batch = []
validation_cleaned_batch = []
for batch in train_dataset:
  rawTextArr = list(map( lambda tweet: str(tweet), batch[0]['tweet'].numpy()))
  rawLabel = list(map( lambda tweet: int(tweet), batch[1]))
  cleaned_batch.append({ 'text': rawTextArr, 'label': rawLabel })

for batch in validation_dataset:
  rawTextArr = list(map( lambda tweet: str(tweet), batch[0]['tweet'].numpy()))
  rawLabel = list(map( lambda tweet: int(tweet), batch[1]))
  validation_cleaned_batch.append({ 'text': rawTextArr, 'label': rawLabel })

# our cleaned datasets, ready to be consumed by pre-trained BERT model instance
convert_dataset = convert_examples_to_tf_dataset(cleaned_batch, tokenizer)
convert_validation_dataset = convert_examples_to_tf_dataset(validation_cleaned_batch, tokenizer)


model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0), 
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), 
    metrics=[tf.keras.metrics.SparseCategoricalAccuracy('accuracy')]
)

history = model.fit(convert_dataset, epochs=NUM_EPOCHS, validation_data=convert_test_dataset)

# summarize history for accuracy
plt.figure()
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.savefig('training_results_accuracy.pdf')
# summarize history for loss
plt.figure()
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.savefig('training_results_loss.pdf')

plt.close()

def make_tarfile(output_filename, source_dir):
    with tarfile.open(output_filename, "w:gz") as tar:
        tar.add(source_dir, arcname=os.path.basename(source_dir))

# Need to "save" model output configuration
model.save_pretrained('fine_tuned_model')
tokenizer.save_pretrained('fine_tuned_model_tokenizer')

# tar and compress "saved" model
make_tarfile('fine_tuned_model.tar.gz', 'fine_tuned_model')
make_tarfile('fine_tuned_model_tokenizer.tar.gz', 'fine_tuned_model_tokenizer')

DEV Community