DEV Community

Cover image for Keras Metrics: Everything You Need To Know
Jakub Czakon
Jakub Czakon

Posted on • Originally published at neptune.ai

Keras Metrics: Everything You Need To Know

This article was originally posted by Derrick Mwiti on neptune.ml/blog where you can find more in-depth articles for machine learning practitioners.


Keras metrics are functions that are used to evaluate the performance of your deep learning model. Choosing a good metric for your problem is usually a difficult task.

  • you need to understand which metrics are already available in Keras and tf.keras and how to use them,
  • in many situations you need to define your own custom metric because the metric you are looking for doesn’t ship with Keras.
  • sometimes you want to monitor model performance by looking at charts like ROC curve or Confusion Matrix after every epoch. Lucky for you, this article explains all that!

Keras metrics 101

In Keras, metrics are passed during the compile stage as shown below. You can pass several metrics by comma separating them.

from keras import metrics

model.compile(loss='mean_squared_error', optimizer='sgd',
              metrics=[metrics.mae,
                       metrics.categorical_accuracy])
Enter fullscreen mode Exit fullscreen mode

How you should choose those evaluation metrics?

Some of them are available in Keras, others in tf.keras. Sometimes you need to implement your own custom metrics.

Let’s go over all of those situations.

Which metrics are available in Keras?

Keras provides a rich pool of inbuilt metrics. Depending on your problem, you’ll use different ones.

Let’s look at some of the problems you may be working on.

Binary classification

Binary classification metrics are used on computations that involve just two classes. A good example is building a deep learning model to predict cats and dogs. We have two classes to predict and the threshold determines the point of separation between them.binary_accuracy and accuracy are two such functions in Keras.

binary_accuracy, for example, computes the mean accuracy rate across all predictions for binary classification problems.

keras.metrics.binary_accuracy(y_true, y_pred, threshold=0.5)
Enter fullscreen mode Exit fullscreen mode

The accuracy metric computes the accuracy rate across all predictions. y_true represents the true labels while y_pred represents the predicted ones.

keras.metrics.accuracy(y_true, y_pred)
Enter fullscreen mode Exit fullscreen mode

The confusion_matrix displays a table showing the true positives, true negatives, false positives, and false negatives.

keras.metrics.confusion_matrix(y_test, y_pred)
Enter fullscreen mode Exit fullscreen mode

image
In the above confusion matrix, the model made 3305 + 375 correct predictions and 106 + 714 wrong predictions.

You can also visualize it as a matplotlib chart which we will cover later.

You can also visualize it as a matplotlib chart which we will cover later.

Multiclass classification

These metrics are used for classification problems involving more than two classes. Extending our animal classification example you can have three animals, cats, dogs, and bears. Since we are classifying more than two animals, this is a multiclass classification problem.

The shape of y_true is the number of entries by 1 that is (n,1) but the shape of y_pred is the number of entries by number of classes(n,c)

categorical_accuracy metric computes the mean accuracy rate across all predictions.

keras.metrics.categorical_accuracy(y_true, y_pred)
Enter fullscreen mode Exit fullscreen mode

sparse_categorical_accuracy is similar to the categorical_accuracy but mostly used when making predictions for sparse targets. A great example of this is working with text in deep learning problems such as word2vec. In this case, one works with thousands of classes with the aim of predicting the next word. This task produces a situation where the y_true is a huge matrix that is almost all zeros, a perfect spot to use a sparse matrix.

keras.metrics.sparse_categorical_accuracy(y_true, y_pred)
Enter fullscreen mode Exit fullscreen mode

top_k_categorical_accuracy computes the top-k-categorical accuracy rate. We take top k predicted classes from our model and see if the correct class was selected as top k. If it was we say that our model was correct.

keras.metrics.top_k_categorical_accuracy(y_true, y_pred, k=5)
Enter fullscreen mode Exit fullscreen mode

Regression

The metrics used in regression problems include Mean Squared Error, Mean Absolute Error, and Mean Absolute Percentage Error. These metrics are used when predicting numerical values such as sales and prices of houses. Check out this resource for a complete guide on regression metrics.

from keras import metrics

model.compile(loss='mse', optimizer='adam', 
              metrics=[metrics.mean_squared_error, 
                       metrics.mean_absolute_error, 
                       metrics.mean_absolute_percentage_error])
                       metrics.categorical_accuracy])
Enter fullscreen mode Exit fullscreen mode

How to create custom metric in Keras?

As we had mentioned earlier, Keras also allows you to define your own custom metrics.

The function you define has to take y_true and y_pred as arguments and must return a single tensor value. These objects are of type Tensor with float32 data type.The shape of the object is the number of rows by 1. For example, if you have 4,500 entries the shape will be (4500, 1).

You can use the function by passing it at the compilation stage of your deep learning model.

model.compile(...metrics=[your_custom_metric])
Enter fullscreen mode Exit fullscreen mode

How to calculate F1 score in Keras (precision, and recall as a bonus)?

Let’s see how you can compute the f1 score, precision and recall in Keras. We will create it for the multiclass scenario but you can also use it for binary classification.

The f1 score is the weighted average of precision and recall. So to calculate f1 we need to create functions that calculate precision and recall first. Note that in multiclass scenario you need to look at all classes not just the positive class (which is the case for binary classification)

def recall(y_true, y_pred):
    y_true = K.ones_like(y_true) 
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    all_positives = K.sum(K.round(K.clip(y_true, 0, 1)))

    recall = true_positives / (all_positives + K.epsilon())
    return recall

def precision(y_true, y_pred):
    y_true = K.ones_like(y_true) 
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))

    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    return precision

def f1_score(y_true, y_pred):
    precision = precision_m(y_true, y_pred)
    recall = recall_m(y_true, y_pred)
    return 2*((precision*recall)/(precision+recall+K.epsilon()))
Enter fullscreen mode Exit fullscreen mode

The next step is to use these functions at the compilation stage of our deep learning model. We are also adding the Keras accuracy metric that is available by default.

model.compile(...,metrics=['accuracy', f1_score, precision, recall])
Enter fullscreen mode Exit fullscreen mode

Let’s now fit the model to the training and test set.

model.fit(x_train, y_train, epochs=5)
Enter fullscreen mode Exit fullscreen mode

Now you can evaluate your model and access the metrics you have just created.

(loss, 
accuracy, 
f1_score, precision, recall) = model.evaluate(x_test, y_test, verbose=1)
Enter fullscreen mode Exit fullscreen mode

Great, you now know how to create custom metrics in keras.

That said, sometimes you can use something that is already there, just in a different library like tf.keras 🙂

Which metrics are available in tf.keras?

Recently Keras has become a standard API in TensorFlow and there are a lot of useful metrics that you can use.

Let’s look at some of them.
Unlike in Keras where you just call the metrics using keras.metrics functions, in tf.keras you have to instantiate a Metric class.

For example:

tf.keras.metrics.Accuracy() 
Enter fullscreen mode Exit fullscreen mode

There is quite a bit of overlap between keras metrics and tf.keras. However, there are some metrics that you can only find in tf.keras.

Let’s take a look at those.

tf.keras Classification Metrics

tf.keras.metrics.AUC computes the approximate AUC (Area under the curve) for ROC curve via the Riemann sum.

model.compile('sgd', loss='mse', metrics=[tf.keras.metrics.AUC()])
Enter fullscreen mode Exit fullscreen mode

You can use precision and recall that we have implemented before, out of the box in tf.keras.

model.compile('sgd', loss='mse', 
               metrics=[tf.keras.metrics.Precision(), 
                        tf.keras.metrics.Recall()])
Enter fullscreen mode Exit fullscreen mode

tf.keras Segmentation Metrics

tf.keras.metrics.MeanIoU – Mean Intersection-Over-Union is a metric used for the evaluation of semantic image segmentation models. We first calculate the IOU for each class:

image

model.compile(... metrics=[tf.keras.metrics.MeanIoU(num_classes=2)])
Enter fullscreen mode Exit fullscreen mode

tf.keras Regression Metrics

Just like Keras, tf.keras has similar regression metrics. We won’t dwell on them much but there is an interesting metric to highlight called MeanRelativeError.

MeanRelativeError takes the absolute error for an observation and divides it by constant. This constant, normalizer, can be the same for all observations or different for each sample.

Therefore, the mean relative error is the average of the relative errors.

tf.keras.metrics.MeanRelativeError(normalizer=[1, 3, 2, 3])
Enter fullscreen mode Exit fullscreen mode

How to create a custom metric in tf.keras?

In tf.keras you can create a custom metric by extending the keras.metrics.Metric class.
To do so you have to override the update_state, result, and reset_state functions:

  • update_state() does all the updates to state variables and calculates the metric,
  • result() returns the value for the metric from state variables,
  • reset_state() sets the metric value at the beginning of each epoch to a predefined constant (typically 0)
class MulticlassTruePositives(tf.keras.metrics.Metric):
    def __init__(self, name='multiclass_true_positives', **kwargs):
        super(MulticlassTruePositives, self).__init__(name=name, **kwargs)
        self.true_positives = self.add_weight(name='tp', initializer='zeros')

    def update_state(self, y_true, y_pred, sample_weight=None):
        y_pred = tf.reshape(tf.argmax(y_pred, axis=1), shape=(-1, 1))
        values = tf.cast(y_true, 'int32') == tf.cast(y_pred, 'int32')
        values = tf.cast(values, 'float32')
        if sample_weight is not None:
            sample_weight = tf.cast(sample_weight, 'float32')
            values = tf.multiply(values, sample_weight)
        self.true_positives.assign_add(tf.reduce_sum(values))

    def result(self):
        return self.true_positives

    def reset_states(self):
        # The state of the metric will be reset at the start of each epoch.
        self.true_positives.assign(0.)
Enter fullscreen mode Exit fullscreen mode

Then we simply pass it at compile stage:

model.compile(...,metrics=[MulticlassTruePositives()])
Enter fullscreen mode Exit fullscreen mode

Performance charts: ROC curve and Confusion Matrix in Keras

Sometimes the performance cannot be represented as one number but rather as a performance chart. Examples of such charts are ROC curve or confusion matrix. In those cases, you may want to log those charts somewhere for further inspection.

To do it you need to create a callback that will track the performance of your model on every epoch end. Then, you can take a look at the improvement in a folder or an experiment tracking tool.
So let’s do that.

First, we need a callback that creates ROC curve and confusion matrix at the end of each epoch.

import os

from keras.callbacks import Callback
import matplotlib.pyplot as plt
import numpy as np
from scikitplot.metrics import plot_confusion_matrix, plot_roc


class PerformanceVisualizationCallback(Callback):
    def __init__(self, model, validation_data, image_dir):
        super().__init__()
        self.model = model
        self.validation_data = validation_data

        os.makedirs(image_dir, exist_ok=True)
        self.image_dir = image_dir

    def on_epoch_end(self, epoch, logs={}):
        y_pred = np.asarray(self.model.predict(self.validation_data[0]))
        y_true = self.validation_data[1]             
        y_pred_class = np.argmax(y_pred, axis=1)

        # plot and save confusion matrix
        fig, ax = plt.subplots(figsize=(16,12))
        plot_confusion_matrix(y_true, y_pred_class, ax=ax)
        fig.savefig(os.path.join(self.image_dir, f'confusion_matrix_epoch_{epoch}'))

       # plot and save roc curve
        fig, ax = plt.subplots(figsize=(16,12))
        plot_roc(y_true, y_pred, ax=ax)
        fig.savefig(os.path.join(self.image_dir, f'roc_curve_epoch_{epoch}'))
Enter fullscreen mode Exit fullscreen mode

Now we simply pass it to the model.fit() callbacks argument.

performance_cbk = PerformanceVisualizationCallback(
                      model=model,
                      validation_data=validation_data,
                      image_dir='performance_vizualizations')

history = model.fit(x=x_train,
                    y=y_train,
                    epochs=5,
                    validation_data=validation_data,
                    callbacks=[performance_cbk])
Enter fullscreen mode Exit fullscreen mode

You can have multiple callbacks if you want to.

Now you will be able to look at those visualizations as your model trains:

Note:

If you want to log everything to the experiment tracking tool like Neptune your callback would look a bit different:

from keras.callbacks import Callback
import neptune
import numpy as np
from scikitplot.metrics import plot_confusion_matrix, plot_roc
import matplotlib.pyplot as plt

neptune.init('jakub-czakon/examples')
neptune.create_experiment('keras-metrics')

class NeptuneLoggerCallback(Callback):
    def __init__(self, model, validation_data):
        super().__init__()
        self.model = model
        self.validation_data = validation_data

    def on_batch_end(self, batch, logs={}):
        for log_name, log_value in logs.items():
            neptune.log_metric(f'batch_{log_name}', log_value)

    def on_epoch_end(self, epoch, logs={}):
        for log_name, log_value in logs.items():
            neptune.log_metric(f'epoch_{log_name}', log_value)

        y_pred = np.asarray(self.model.predict(self.validation_data[0]))
        y_true = self.validation_data[1]

        y_pred_class = np.argmax(y_pred, axis=1)

        fig, ax = plt.subplots(figsize=(16, 12))
        plot_confusion_matrix(y_true, y_pred_class, ax=ax)
        neptune.log_image('confusion_matrix', fig)

        fig, ax = plt.subplots(figsize=(16, 12))
        plot_roc(y_true, y_pred, ax=ax)
        neptune.log_image('roc_curve', fig)
Enter fullscreen mode Exit fullscreen mode

Notice that you don’t need to create folders for images as the charts will be sent to your tool directly. On the flip side you have to create an experiment to start tracking your runs.
Once you have that it is business as usual.

neptune_logger=NeptuneLoggerCallback(model=model,
                                     validation_data=validation_data)

history = model.fit(x=x_train,
                    y=y_train,
                    epochs=5,
                    validation_data=validation_data,
                    callbacks=[neptune_logger])
Enter fullscreen mode Exit fullscreen mode

You can explore metrics and performance charts in the app.

How to plot Keras history object?

Whenever fit() is called, it returns a History object that can be used to visualize the training history. It contains a dictionary with loss and metric values at each epoch calculated both for training and validation datasets.

For example, lets extract the ‘accuracy’ metric and use matplotlib to plot it.

import matplotlib.pyplot as plt

history = model.fit(x_train, y_train, 
                    validation_split=0.25, 
                    epochs=50, batch_size=16, verbose=1)

# Plot training & validation accuracy values
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_‘accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()
Enter fullscreen mode Exit fullscreen mode

image

Keras Metrics Example

Ok, so you’ve gone a long way and learned a bunch. To refresh your memory let’s put it all together in an single example.
We’ll start by taking the mnist dataset and created a simple CNN model:

import tensorflow as tf

mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
validation_data = x_test, y_test

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(512, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(10, activation='softmax')
])
Enter fullscreen mode Exit fullscreen mode

We’ll create a custom metric, multiclass f1 score in keras:

def recall(y_true, y_pred):
    y_true = K.ones_like(y_true) 
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    all_positives = K.sum(K.round(K.clip(y_true, 0, 1)))

    recall = true_positives / (all_positives + K.epsilon())
    return recall

def precision(y_true, y_pred):
    y_true = K.ones_like(y_true) 
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))

    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    return precision

def f1_score(y_true, y_pred):
    precision = precision_m(y_true, y_pred)
    recall = recall_m(y_true, y_pred)
    return 2*((precision*recall)/(precision+recall+K.epsilon()))
Enter fullscreen mode Exit fullscreen mode

We’ll create a custom tf.keras metric: MulticlassTruePositives to be exact:

class MulticlassTruePositives(tf.keras.metrics.Metric):
    def __init__(self, name='multiclass_true_positives', **kwargs):
        super(MulticlassTruePositives, self).__init__(name=name, **kwargs)
        self.true_positives = self.add_weight(name='tp', initializer='zeros')

    def update_state(self, y_true, y_pred, sample_weight=None):
        y_pred = tf.reshape(tf.argmax(y_pred, axis=1), shape=(-1, 1))
        values = tf.cast(y_true, 'int32') == tf.cast(y_pred, 'int32')
        values = tf.cast(values, 'float32')
        if sample_weight is not None:
            sample_weight = tf.cast(sample_weight, 'float32')
            values = tf.multiply(values, sample_weight)
        self.true_positives.assign_add(tf.reduce_sum(values))

    def result(self):
        return self.true_positives

    def reset_states(self):
        # The state of the metric will be reset at the start of each epoch.
        self.true_positives.assign(0.)
Enter fullscreen mode Exit fullscreen mode

We’ll compile the keras model with our metrics:

import keras

model.compile(optimizer='sgd',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy',
                       keras.metrics.categorical_accuracy,
                       f1_score, 
                       recall_score, 
                       precision_score,
                       tf.keras.metrics.TopKCategoricalAccuracy(k=5),
                       MulticlassTruePositives()])
Enter fullscreen mode Exit fullscreen mode

We’ll implement keras callback that plots ROC curve and Confusion Matrix to a folder:

import os

from keras.callbacks import Callback
import matplotlib.pyplot as plt
import numpy as np
from scikitplot.metrics import plot_confusion_matrix, plot_roc

class PerformanceVisualizationCallback(Callback):
    def __init__(self, model, validation_data, image_dir):
        super().__init__()
        self.model = model
        self.validation_data = validation_data

        os.makedirs(image_dir, exist_ok=True)
        self.image_dir = image_dir

    def on_epoch_end(self, epoch, logs={}):
        y_pred = np.asarray(self.model.predict(self.validation_data[0]))
        y_true = self.validation_data[1]             
        y_pred_class = np.argmax(y_pred, axis=1)

        # plot and save confusion matrix
        fig, ax = plt.subplots(figsize=(16,12))
        plot_confusion_matrix(y_true, y_pred_class, ax=ax)
        fig.savefig(os.path.join(self.image_dir, f'confusion_matrix_epoch_{epoch}'))

       # plot and save roc curve
        fig, ax = plt.subplots(figsize=(16,12))
        plot_roc(y_true, y_pred, ax=ax)
        fig.savefig(os.path.join(self.image_dir, f'roc_curve_epoch_{epoch}'))

performance_viz_cbk = PerformanceVisualizationCallback(
                                       model=model,
                                       validation_data=validation_data,
                                       image_dir='perorfmance_charts')
Enter fullscreen mode Exit fullscreen mode

We’ll run training and monitor the performance:

history = model.fit(x=x_train,
                    y=y_train,
                    epochs=5,
                    validation_data=validation_data,
                    callbacks=[performance_viz_cbk])
Enter fullscreen mode Exit fullscreen mode

We’ll visualize metrics from keras history object:

plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()
Enter fullscreen mode Exit fullscreen mode

We will monitor and explore your experiments in a tool like TensorBoard or Neptune. You just need to add another callback or modify the one you have created before:

Tensorboard

from  tf.keras.callbacks import TensorBoard

tensorboard_cbk = TensorBoard(log_dir="logs/training-example/")

history = model.fit(..., callbacks=[performance_viz_cbk, 
                                    tensorboard_cbk])
Enter fullscreen mode Exit fullscreen mode

With TensorBoard you need to start a local server and explore your runs in the browser.

tensorboard --logdir logs/training-example/
Enter fullscreen mode Exit fullscreen mode

image

Neptune

neptune.init('jakub-czakon/examples')
neptune.create_experiment('keras-metrics')

class NeptuneLoggerCallback(Callback):
    def __init__(self, model, validation_data):
        super().__init__()
        self.model = model
        self.validation_data = validation_data

    def on_batch_end(self, batch, logs={}):
        for log_name, log_value in logs.items():
            neptune.log_metric(f'batch_{log_name}', log_value)

    def on_epoch_end(self, epoch, logs={}):
        for log_name, log_value in logs.items():
            neptune.log_metric(f'epoch_{log_name}', log_value)

        y_pred = np.asarray(self.model.predict(self.validation_data[0]))
        y_true = self.validation_data[1]

        y_pred_class = np.argmax(y_pred, axis=1)

        fig, ax = plt.subplots(figsize=(16, 12))
        plot_confusion_matrix(y_true, y_pred_class, ax=ax)
        neptune.log_image('confusion_matrix', fig)

        fig, ax = plt.subplots(figsize=(16, 12))
        plot_roc(y_true, y_pred, ax=ax)
        neptune.log_image('roc_curve', fig)

neptune_logger = NeptuneLoggerCallback(model=model,
                                       validation_data=validation_data)

history = model.fit(..., callbacks=[neptune_logger])
Enter fullscreen mode Exit fullscreen mode

Check this example experiment run if you are interested:

Final Thoughts

Hopefully, this article gave you some background into model evaluation techniques in keras.

We’ve covered:

  • built-in methods in keras and tf.keras, *implementation of your own custom metrics, *how you can visualize custom performance charts as your model is training.

For more information check out the Keras Repository and TensorFlow Metrics documentation.

Happy training!

Top comments (0)