Sávio Santos

Posted on Oct 12

NoisOCR: A Python Library for Simulating Post-OCR Noisy Texts

#ocr #spellingcorrection #python #languagemodel

NoisOCR is a Python library designed to simulate noise in texts generated after Optical Character Recognition (OCR). These texts may contain errors or annotations, reflecting the challenges of handling OCR in low-quality documents or manuscripts. The library offers features that facilitate the simulation of common errors in post-OCR texts and partitioning texts into sliding windows, with or without hyphenation. This can contribute to the training of neural network models for spelling correction.

GitHub Repository: NoisOCR

PyPI: NoisOCR on PyPI

Features

Sliding windows: Split long texts into smaller segments without breaking words.
Sliding windows with hyphenation: Use hyphenation to fit words within character limits.
Simulate text errors: Add random errors to simulate post-OCR low-accuracy texts.
Simulate text annotations: Insert annotations like those found in the BRESSAY dataset to mark words or phrases in the text.

Installation

You can easily install NoisOCR via pip:

pip install noisocr

Usage Examples

1. Sliding Window

This function divides a text into segments of limited size, keeping the words intact.

import noisocr

text = "Lorem Ipsum is simply dummy...type specimen book."
max_window_size = 50

windows = noisocr.sliding_window(text, max_window_size)

# Output:
# [
#   'Lorem Ipsum is simply dummy text of the printing', 
#   ...
#   'type and scrambled it to make a type specimen', 
#   'book.'
# ]

2. Sliding Window with Hyphenation

When using hyphenation, the function attempts to fit words that exceed the character limit per window by inserting hyphens as necessary. This functionality supports multiple languages through the PyHyphen package.

import noisocr

text = "Lorem Ipsum is simply dummy...type specimen book."
max_window_size = 50

windows = noisocr.sliding_window_with_hyphenation(text, max_window_size, 'en_US')

# Output:
# [
#   'Lorem Ipsum is simply dummy text of the printing ',        
#   'typesetting industry. Lorem Ipsum has been the in-', 
#   ...
#   'scrambled it to make a type specimen book.'
# ]

3. Simulating Text Errors

The simulate_errors function allows users to add random errors to the text, emulating issues commonly found in post-OCR texts. The typo library generates errors, such as character swaps, missing spaces, extra characters, and more.

import noisocr

text = "Hello world."
text_with_errors = noisocr.simulate_errors(text, interactions=1)
# Output: Hello, wotrld!
text_with_errors = noisocr.simulate_errors(text, 2)
# Output: Hsllo,wlorld!
text_with_errors = noisocr.simulate_errors(text, 5)
# Output: fllo,w0rlr!

4. Simulating Text Annotations

The annotation simulation feature allows the user to add custom markings to the text based on a set of annotations, including those from the BRESSAY dataset.

import noisocr

text = "Hello world."
text_with_annotation = noisocr.simulate_annotation(text, probability=0.5)
# Output: Hello, $$--xxx--$$
text_with_annotation = noisocr.simulate_annotation(text, probability=0.5)
# Output: Hello, ##--world!--##
text_with_annotation = noisocr.simulate_annotation(text, 0.01)
# Output: Hello world.

Code Overview

The core functions of the NoisOCR library are based on leveraging libraries like typo for simulating errors and hyphen for managing word hyphenation across different languages. Below is an explanation of the critical functions.

1. `simulate_annotation` Function

The simulate_annotation function selects a random word from the text and annotates it, following a defined set of annotations.

import random

annotations = [
    '##@@???@@##', '$$@@???@@$$', '@@???@@', '##--xxx--##', 
    '$$--xxx--$$', '--xxx--', '##--text--##', '$$--text--$$',
    '##text##', '$$text$$', '--text--'
]

def simulate_annotation(text, annotations=annotations, probability=0.01):
    words = text.split()

    if len(words) > 1:
        target_word = random.choice(words)
    else:
        return text

    if random.random() < probability:
        annotation = random.choice(annotations)
        if 'text' in annotation:
            annotated_text = annotation.replace('text', target_word)
        else:
            annotated_text = annotation

        result_text = text.replace(target_word, annotated_text, 1)
        return result_text
    else:
        return text

2. `simulate_errors` Function

The simulate_errors function applies various errors to the text, randomly selected from the typo library.

import random
import typo

def simulate_errors(text, interactions=3, seed=None):
    methods = ["char_swap", "missing_char", "extra_char", "nearby_char", "similar_char", "skipped_space", "random_space", "repeated_char", "unichar"]

    if seed is not None:
        random.seed(seed)
    else:
        random.seed()

    instance = typo.StrErrer(text)
    method = random.choice(methods)
    method_to_call = getattr(instance, method)
    text = method_to_call().result

    if interactions > 0:
        interactions -= 1
        text = simulate_errors(text, interactions, seed=seed)

    return text

3. `sliding_window` and `sliding_window_with_hyphenation` Functions

These functions are responsible for splitting the text into sliding windows, with or without hyphenation.

from hyphen import Hyphenator

def sliding_window_with_hyphenation(text, window_size=80, language='pt_BR'):
    hyphenator = Hyphenator(language)
    words = text.split()
    windows = []
    current_window = []
    remaining_word = ""

    for word in words:
        if remaining_word:
            word = remaining_word + word
            remaining_word = ""

        if len(" ".join(current_window)) + len(word) + 1 <= window_size:
            current_window.append(word)
        else:
            syllables = hyphenator.syllables(word)
            temp_word = ""
            for i, syllable in enumerate(syllables):
                if len(" ".join(current_window)) + len(temp_word) + len(syllable) + 1 <= window_size:
                    temp_word += syllable
                else:
                    if temp_word:
                        current_window.append(temp_word + "-")
                        remaining_word = "".join(syllables[i:]) + " "
                        break
                    else:
                        remaining_word = word + " "
                        break
            else:
                current_window.append(temp_word)
                remaining_word = ""

            windows.append(" ".join(current_window))
            current_window = []

    if remaining_word:
        current_window.append(remaining_word)
    if current_window:
        windows.append(" ".join(current_window))

    return windows

Conclusion

NoisOCR provides essential tools for those working on post-OCR text correction, making it easier to simulate real-world scenarios where digitized texts are prone to errors and annotations. Whether for automated testing, text correction model development, or analysis of datasets like BRESSAY, this library is a versatile and user-friendly solution.

Check out the project on GitHub: NoisOCR and contribute to its improvement!

DEV Community

NoisOCR: A Python Library for Simulating Post-OCR Noisy Texts

Features

Installation

Usage Examples

1. Sliding Window

2. Sliding Window with Hyphenation

3. Simulating Text Errors

4. Simulating Text Annotations

Code Overview

1. `simulate_annotation` Function

2. `simulate_errors` Function

3. `sliding_window` and `sliding_window_with_hyphenation` Functions

Conclusion

Top comments (0)

Read next

Diagram-as-Code: Creating Dynamic and Interactive Documentation for Visual Content

Build an enterprise-level financial data analysis assistant: multi-source data RAG system practice based on LangChain

Mistral vs GPT: A Comprehensive Comparison of Leading AI Models

IceCream: A Sweet Alternative to Print Debugging in Python

Features

Installation

Usage Examples

1. Sliding Window

2. Sliding Window with Hyphenation

3. Simulating Text Errors

4. Simulating Text Annotations

Code Overview

1. simulate_annotation Function

2. simulate_errors Function

3. sliding_window and sliding_window_with_hyphenation Functions

Conclusion

Read next

Diagram-as-Code: Creating Dynamic and Interactive Documentation for Visual Content

Build an enterprise-level financial data analysis assistant: multi-source data RAG system practice based on LangChain

Mistral vs GPT: A Comprehensive Comparison of Leading AI Models

IceCream: A Sweet Alternative to Print Debugging in Python

1. `simulate_annotation` Function

2. `simulate_errors` Function

3. `sliding_window` and `sliding_window_with_hyphenation` Functions