The CamemBERT model is a state-of-the-art language model for modeling the
French language.
It is a RoBERTa model that has been trained on a large number of French texts and can be easily adapted to a large number of tasks thanks to finetuning.
Here we're going to to finetune the model for sentence embedding.
Sentence-BERT
The output of a BERT model is an embedding vector for each token. To obtain an embedding of the text as a whole, we need to define a transformation strategy to go from individual token embeddings to an embedding vector for the sentence as a whole.
The simplest and most effective strategy is simply to take the average of the token embeddings.
This strategy is known as mean pooling.
If you'd like to find out more about the strategies that have been considered, take a look at this paper: Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.
Finetuning a BERT model into a Sentence-BERT model
The authors of the paper mentioned above have built a Python library called sentence-transformers, for manipulating Sentence-BERT models.
We'll use it to obtain a Sentence-CamemBERT model from a CamemBERT model available on huggingface.
Prerequisites
We will be using the following packages:
datasets
sentence-transformers
Training data
First of all, we're going to retrieve the training data.
We will use the French part of the dataset STSb Multi MT.
This is a dataset containing pairs of sentences and a score between 0 and 5 representing the similarity between the two sentences.
from datasets import load_dataset
sts_train_dataset = load_dataset("stsb_multi_mt", name="fr", split="train")
sts_dev_dataset = load_dataset("stsb_multi_mt", name="fr", split="dev")
sts_test_dataset = load_dataset("stsb_multi_mt", name="fr", split="test")
We'll then convert the retrieved data into InputExample objects that can be used for training.
from typing import List
from sentence_transformers import InputExample
def dataset_to_input_examples(dataset) -> List[InputExample]:
return [
InputExample(
texts=[example["sentence1"], example["sentence2"]],
label=example["similarity_score"] / 5.0,
)
for example in dataset
]
sts_train_examples = dataset_to_input_examples(sts_train_dataset)
sts_dev_examples = dataset_to_input_examples(sts_dev_dataset)
sts_test_examples = dataset_to_input_examples(sts_test_dataset)
We will use the CamemBERT model named almanach/camembert-base
for finetuning:
from sentence_transformers import evaluation, losses, SentenceTransformer
from torch.utils.data import DataLoader
batch_size = 32
model = SentenceTransformer("almanach/camembert-base")
train_dataloader = DataLoader(sts_train_examples, shuffle=True, batch_size=batch_size)
train_loss = losses.CosineSimilarityLoss(model=model)
We use cosine-similarity loss objective to train the model.
Finally, an evaluator is built to monitor the model's performance on the dev dataset during training.
sts_dev_evaluator = evaluation.EmbeddingSimilarityEvaluator.from_input_examples(
sts_dev_examples, name="sts-dev"
)
We can now start training the model:
model.fit(
train_objectives=[(train_dataloader, train_loss)],
evaluator=sts_dev_evaluator,
epochs=10,
warmup_steps=500,
save_best_model=True,
)
Model evaluation
Once training is complete, you can measure the model's performance on the test data set you've kept
away:
sts_test_evaluator = evaluation.EmbeddingSimilarityEvaluator.from_input_examples(
sts_test_examples, name="sts-test"
)
sts_test_evaluator(model, ".")
I get a Pearson correlation of 0.837
which is at the same level as the Sentence-CamemBERT I found on huggingface :
Model | Pearson Correlation | Parameters |
---|---|---|
h4c5/sts-camembert-base |
0.837 | 110M |
Lajavaness/sentence-camembert-base |
0.835 | 110M |
inokufu/flaubert-base-uncased-xnli-sts |
0.828 | 137M |
h4c5/sts-distilcamembert-base |
0.817 | 68M |
sentence-transformers/distiluse-base-multilingual-cased-v2 |
0.786 | 135M |
Sentence-BERT model distilled
As you may have noticed in the table above, I've also trained a Sentence-CamemBERT model that's about half the size (68M parameters vs. 110M) and yet performs very well: h4c5/sts-distilcamembert-base
.
This is in fact a model obtained by following the above procedure but starting from the distilled CamemBERT model: cmarkea/distilcamembert-base.
This so-called "distilled" model was obtained by removing half of the layer of the CamemBERT base model and training it to maintain its performance.
To find out more about the distillation process, please consult the following papers:
- DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
- DistilCamemBERT: a distilled version of the French CamemBERT
Et voilà. You can find my two Sentence-CamemBERT models on huggingface :
Top comments (0)