Explaining the softmax activation function

In machine learning, the softmax activation function is used as an activation function to normalize a machine learning model's output into a probability distribution over a discrete set of classes, especially useful for classification tasks like sentiment analysis. More about general activation functions here.

In this blog post we will cover how to use the softmax activation function with basic tensors. In addition, we will explain how you can use the softmax activation function on the output logits from a pre-trained BERT model for performing sentiment analysis from text input, as a hands on example of using the softmax activation function.

How to use the softmax activation function

We can use the softmax activation function to transform a range of numbers into a set of numbers summing to 1. The following example vector (1-dimensional array):

import tensorflow as tf

tf.nn.softmax([-2.0,0.0,2.0, 4]).numpy()
# array([0.00214401, 0.0158422 , 0.1170589 , 0.8649548], dtype=float32)

In this example, the largest number, 4, is the closest to 1 (0.8649548), and the smallest number, -2.0, is the closest to 0 (0.00214401). This normalization of the set of input numbers is based on the distance between each input number, relatively to all the other input numbers. So, the smallest number will be closest to 0 and the largest closest to 1. In the following section we will use the softmax activation function to help us perform sentiment analysis from a customized BERT model designed for sequence classification tasks.

Softmax activation function usage in classification tasks

In our case, by using the Hugging Face BERT model for sequence classifcation, the TFBertForSequenceClassification class, we can directly transform the logits of the output of the TFBertForSequenceClassification model, into normalized results (using the softmax activation function on the logits tensor of the output). The logits are stored in the zero index of the TFBertForSequenceClassification model output.

We apply the softmax activation function to the model output's logits to normalize the prediction of which class the text sequence belongs to. The default configuration for classification classes for TFBertForSequenceClassification is set for classification between 2 classes: the first column of the output tensor is the probability that the sequence would belong to a negative sentiment class, and the second column of the output tensor is the probability that the sequence would belong to a positive sentiment class. The normalization of the output is what helps us make the determination of which class is predicted to be more likely (negative or positive sentiment). Consider the example of running the BERT model with applying the softmax activation function below:

import tensorflow as tf
from transformers import BertTokenizer, TFBertForSequenceClassification

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased")

test_input = [
    'This was an awesome movie. I watch it twice my time watching this beautiful movie if I have known it was this good',
    'One of the worst movies of all time. I cannot believe I wasted two hours of my life for this movie',
    "I hate this car. I hate how ugly this truck is. The worst day of my life"
]

# tokenizing our raw input
tf_batch = tokenizer(test_input, max_length=128, padding=True, truncation=True, return_tensors='tf')
# storing the direct output of our model
tf_outputs = model(tf_batch)
# applying the softmax activation function to our model's output's logits stored in the 0 index
tf_predictions = tf.nn.softmax(tf_outputs[0], axis=-1)

labels = ['Negative','Positive']
# the argmax helps us automatically determine which class, "Negative" or "Positive" that BERT is telling us is more likely the input sequence belongs to
label = tf.argmax(tf_predictions, axis=1)
label = label.numpy()
for i in range(len(input)):
  print(input[i], ": \n", labels[label[i]])

Not terribly accurate as a classification model (without fine-tuning BERT for sentiment analysis first), but after applying the softmax activation function, we can see that our tf_outputs tensor has been normalized with the result of the softmax activation function call stored in the new tf_predictions tensor. We then apply the argmax function on the tf_predictions tensor to tell us which index of the predicted class (0 index for negative, 1 index for positive) the sequence belongs to. On a side note, the 1 input for the axis argument for the argmax function tells Tensorflow to look across the columns for finding which class was more strongly predicted by the model and thus returning the index of the predicted class, 0 or 1 (negative or positive, respectively), per each "row" that corresponds with each text input (3 in this example). Afterwards, we iterate over the corresponding input index to print a human friendly classification value, "labeled" to represent the "negative" and "positive" classes the model was allowed to choose from in making it's sentiment prediction for each text input.

Conclusion

That pretty much sums up the basics of applying the softmax activation function to a machine learning model like BERT. Normalizing the output helps us to make more sense of what the model is telling us.

Questions or comments? Connect with me on Twitter or LinkedIn.

DEV Community

Explaining the softmax activation function

How to use the softmax activation function

Softmax activation function usage in classification tasks

Conclusion

Top comments (0)

Read next

The Essentials of Modern Website Design: Key Principles and Trends

Building a Dynamic Contact Form with Firebase Realtime Database

What's New in AssetIT 1.4.5-AC? 🚀

I Made a Dungeons and Dragons Name Generator to Help You Create Perfect Character Names!