Hakim

Posted on Mar 9

Train a Sentence-CamemBERT

#python #datascience #embedding #nlp

The CamemBERT model is a state-of-the-art language model for modeling the
French language.

It is a RoBERTa model that has been trained on a large number of French texts and can be easily adapted to a large number of tasks thanks to finetuning.

Here we're going to to finetune the model for sentence embedding.

Sentence-BERT

The output of a BERT model is an embedding vector for each token. To obtain an embedding of the text as a whole, we need to define a transformation strategy to go from individual token embeddings to an embedding vector for the sentence as a whole.

The simplest and most effective strategy is simply to take the average of the token embeddings.
This strategy is known as mean pooling.

If you'd like to find out more about the strategies that have been considered, take a look at this paper: Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.

Finetuning a BERT model into a Sentence-BERT model

The authors of the paper mentioned above have built a Python library called sentence-transformers, for manipulating Sentence-BERT models.

We'll use it to obtain a Sentence-CamemBERT model from a CamemBERT model available on huggingface.

Prerequisites

We will be using the following packages:

datasets
sentence-transformers

Training data

First of all, we're going to retrieve the training data.
We will use the French part of the dataset STSb Multi MT.
This is a dataset containing pairs of sentences and a score between 0 and 5 representing the similarity between the two sentences.

from datasets import load_dataset

sts_train_dataset = load_dataset("stsb_multi_mt", name="fr", split="train")
sts_dev_dataset = load_dataset("stsb_multi_mt", name="fr", split="dev")
sts_test_dataset = load_dataset("stsb_multi_mt", name="fr", split="test")

We'll then convert the retrieved data into InputExample objects that can be used for training.

from typing import List
from sentence_transformers import InputExample

def dataset_to_input_examples(dataset) -> List[InputExample]:
    return [
    InputExample(
        texts=[example["sentence1"], example["sentence2"]],
        label=example["similarity_score"] / 5.0,
    )
    for example in dataset
]

sts_train_examples = dataset_to_input_examples(sts_train_dataset)
sts_dev_examples = dataset_to_input_examples(sts_dev_dataset)
sts_test_examples = dataset_to_input_examples(sts_test_dataset)

We will use the CamemBERT model named almanach/camembert-base for finetuning:

from sentence_transformers import evaluation, losses, SentenceTransformer
from torch.utils.data import DataLoader

batch_size = 32

model = SentenceTransformer("almanach/camembert-base")

train_dataloader = DataLoader(sts_train_examples, shuffle=True, batch_size=batch_size)
train_loss = losses.CosineSimilarityLoss(model=model)

We use cosine-similarity loss objective to train the model.

Finally, an evaluator is built to monitor the model's performance on the dev dataset during training.

sts_dev_evaluator = evaluation.EmbeddingSimilarityEvaluator.from_input_examples(
    sts_dev_examples, name="sts-dev"
)

We can now start training the model:

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    evaluator=sts_dev_evaluator,
    epochs=10,
    warmup_steps=500,
    save_best_model=True,
)

Model evaluation

Once training is complete, you can measure the model's performance on the test data set you've kept
away:

sts_test_evaluator = evaluation.EmbeddingSimilarityEvaluator.from_input_examples(
    sts_test_examples, name="sts-test"
)

sts_test_evaluator(model, ".")

I get a Pearson correlation of 0.837 which is at the same level as the Sentence-CamemBERT I found on huggingface :

Model	Pearson Correlation	Parameters
`h4c5/sts-camembert-base`	0.837	110M
`Lajavaness/sentence-camembert-base`	0.835	110M
`inokufu/flaubert-base-uncased-xnli-sts`	0.828	137M
`h4c5/sts-distilcamembert-base`	0.817	68M
`sentence-transformers/distiluse-base-multilingual-cased-v2`	0.786	135M

Sentence-BERT model distilled

As you may have noticed in the table above, I've also trained a Sentence-CamemBERT model that's about half the size (68M parameters vs. 110M) and yet performs very well: h4c5/sts-distilcamembert-base.

This is in fact a model obtained by following the above procedure but starting from the distilled CamemBERT model: cmarkea/distilcamembert-base.

This so-called "distilled" model was obtained by removing half of the layer of the CamemBERT base model and training it to maintain its performance.

To find out more about the distillation process, please consult the following papers:

Et voilà. You can find my two Sentence-CamemBERT models on huggingface :

DEV Community

Train a Sentence-CamemBERT

Sentence-BERT

Finetuning a BERT model into a Sentence-BERT model

Prerequisites

Training data

Model evaluation

Sentence-BERT model distilled

Top comments (0)

Read next

Your ML/AI Success Begins Here: Data Ingestion & Storage on AWS

10 Future Apache Iceberg Developments to Look forward to in 2025

Ping Pong game in Pygame python

Code Better, Debug Smarter: Tips Every Developer Needs