DEV Community

Naresh Nishad
Naresh Nishad

Posted on

Day 37: Named Entity Recognition (NER) with LLMs

Introduction

Named Entity Recognition (NER) is a crucial Natural Language Processing (NLP) task that identifies and classifies entities like names, locations, organizations, dates, and more in a given text. With the power of Large Language Models (LLMs), NER has reached unparalleled accuracy and versatility.

Why LLMs for NER?

  • Contextual Understanding: LLMs, like BERT and GPT, excel at capturing the surrounding context, enabling accurate entity recognition even in ambiguous scenarios.
  • Transfer Learning: Pretrained models can be fine-tuned on domain-specific datasets for high-performance NER tasks.
  • Scalability: Minimal effort is required to adapt LLMs to new languages or datasets.

Steps to Implement NER

1. Data Preparation

  • Collect labeled NER datasets (e.g., CoNLL-2003, OntoNotes).
  • Preprocess data to match the input requirements of the chosen LLM.

2. Model Selection

  • Popular models: BERT, DistilBERT, RoBERTa, GPT, spaCy transformer-based pipelines.
  • Use the Hugging Face transformers library for quick implementation.

3. Fine-tuning

  • Fine-tune the model using frameworks like PyTorch or TensorFlow on labeled data.

4. Evaluation

  • Use metrics like F1-score, precision, and recall to evaluate performance.

Example: Fine-Tuning BERT for NER

Here’s an example using Hugging Face's transformers library:

import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification, Trainer, TrainingArguments
from datasets import load_dataset
from evaluate import load

# Load dataset
dataset = load_dataset("conll2003")

# Load tokenizer and model
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Get number of labels
label_list = dataset["train"].features["ner_tags"].feature.names
num_labels = len(label_list)

# Load model with proper configuration
model = AutoModelForTokenClassification.from_pretrained(
    model_name,
    num_labels=num_labels,
    id2label={i: label for i, label in enumerate(label_list)},
    label2id={label: i for i, label in enumerate(label_list)}
)

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"],
        truncation=True,
        is_split_into_words=True,
        padding=True,
        max_length=512
    )

    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_id = None
        label_ids = []
        for word_id in word_ids:
            if word_id is None:
                label_ids.append(-100)
            elif word_id != previous_word_id:
                label_ids.append(label[word_id])
            else:
                # For tokens that are part of the same word, use the same label
                label_ids.append(-100)
            previous_word_id = word_id
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

# Process datasets
tokenized_datasets = dataset.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=dataset["train"].column_names
)

# Load metric
metric = load("seqeval")

def compute_metrics(eval_preds):
    predictions, labels = eval_preds
    predictions = predictions.argmax(axis=-1)

    # Remove ignored index (special tokens)
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
    save_total_limit=2,
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    report_to="none",  # Disable wandb logging
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

# Check if CUDA is available
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Train model
trainer.train()

# Evaluate model
results = trainer.evaluate()
print("\nEvaluation Results:")
print(results)

# Save the model
trainer.save_model("./final_model")
print("\nModel saved to ./final_model")
Enter fullscreen mode Exit fullscreen mode

Applications of NER

  • Healthcare: Extracting medical entities from patient records.
  • Finance: Identifying company names, financial events, and monetary amounts.
  • Customer Service: Enhancing chatbots by recognizing user intents.
  • Content Curation: Automating tagging of articles and media.

Challenges and Tips

  • Ambiguity: Use larger models or ensemble techniques for disambiguation.
  • Domain-Specific Entities: Fine-tune on domain-specific datasets.
  • Evaluation: Ensure balanced datasets for unbiased evaluation metrics.

Conclusion

NER powered by LLMs has transformed information extraction, making it faster and more reliable across industries. By leveraging LLMs, we can unlock insights from unstructured text with ease.

Top comments (0)