DEV Community

Sávio Santos
Sávio Santos

Posted on

NoisOCR: A Python Library for Simulating Post-OCR Noisy Texts

NoisOCR is a Python library designed to simulate noise in texts generated after Optical Character Recognition (OCR). These texts may contain errors or annotations, reflecting the challenges of handling OCR in low-quality documents or manuscripts. The library offers features that facilitate the simulation of common errors in post-OCR texts and partitioning texts into sliding windows, with or without hyphenation. This can contribute to the training of neural network models for spelling correction.

GitHub Repository: NoisOCR

PyPI: NoisOCR on PyPI

Features

  • Sliding windows: Split long texts into smaller segments without breaking words.
  • Sliding windows with hyphenation: Use hyphenation to fit words within character limits.
  • Simulate text errors: Add random errors to simulate post-OCR low-accuracy texts.
  • Simulate text annotations: Insert annotations like those found in the BRESSAY dataset to mark words or phrases in the text.

Installation

You can easily install NoisOCR via pip:

pip install noisocr
Enter fullscreen mode Exit fullscreen mode

Usage Examples

1. Sliding Window

This function divides a text into segments of limited size, keeping the words intact.

import noisocr

text = "Lorem Ipsum is simply dummy...type specimen book."
max_window_size = 50

windows = noisocr.sliding_window(text, max_window_size)

# Output:
# [
#   'Lorem Ipsum is simply dummy text of the printing', 
#   ...
#   'type and scrambled it to make a type specimen', 
#   'book.'
# ]
Enter fullscreen mode Exit fullscreen mode

2. Sliding Window with Hyphenation

When using hyphenation, the function attempts to fit words that exceed the character limit per window by inserting hyphens as necessary. This functionality supports multiple languages through the PyHyphen package.

import noisocr

text = "Lorem Ipsum is simply dummy...type specimen book."
max_window_size = 50

windows = noisocr.sliding_window_with_hyphenation(text, max_window_size, 'en_US')

# Output:
# [
#   'Lorem Ipsum is simply dummy text of the printing ',        
#   'typesetting industry. Lorem Ipsum has been the in-', 
#   ...
#   'scrambled it to make a type specimen book.'
# ]
Enter fullscreen mode Exit fullscreen mode

3. Simulating Text Errors

The simulate_errors function allows users to add random errors to the text, emulating issues commonly found in post-OCR texts. The typo library generates errors, such as character swaps, missing spaces, extra characters, and more.

import noisocr

text = "Hello world."
text_with_errors = noisocr.simulate_errors(text, interactions=1)
# Output: Hello, wotrld!
text_with_errors = noisocr.simulate_errors(text, 2)
# Output: Hsllo,wlorld!
text_with_errors = noisocr.simulate_errors(text, 5)
# Output: fllo,w0rlr!
Enter fullscreen mode Exit fullscreen mode

4. Simulating Text Annotations

The annotation simulation feature allows the user to add custom markings to the text based on a set of annotations, including those from the BRESSAY dataset.

import noisocr

text = "Hello world."
text_with_annotation = noisocr.simulate_annotation(text, probability=0.5)
# Output: Hello, $$--xxx--$$
text_with_annotation = noisocr.simulate_annotation(text, probability=0.5)
# Output: Hello, ##--world!--##
text_with_annotation = noisocr.simulate_annotation(text, 0.01)
# Output: Hello world.
Enter fullscreen mode Exit fullscreen mode

Code Overview

The core functions of the NoisOCR library are based on leveraging libraries like typo for simulating errors and hyphen for managing word hyphenation across different languages. Below is an explanation of the critical functions.

1. simulate_annotation Function

The simulate_annotation function selects a random word from the text and annotates it, following a defined set of annotations.

import random

annotations = [
    '##@@???@@##', '$$@@???@@$$', '@@???@@', '##--xxx--##', 
    '$$--xxx--$$', '--xxx--', '##--text--##', '$$--text--$$',
    '##text##', '$$text$$', '--text--'
]

def simulate_annotation(text, annotations=annotations, probability=0.01):
    words = text.split()

    if len(words) > 1:
        target_word = random.choice(words)
    else:
        return text

    if random.random() < probability:
        annotation = random.choice(annotations)
        if 'text' in annotation:
            annotated_text = annotation.replace('text', target_word)
        else:
            annotated_text = annotation

        result_text = text.replace(target_word, annotated_text, 1)
        return result_text
    else:
        return text
Enter fullscreen mode Exit fullscreen mode

2. simulate_errors Function

The simulate_errors function applies various errors to the text, randomly selected from the typo library.

import random
import typo

def simulate_errors(text, interactions=3, seed=None):
    methods = ["char_swap", "missing_char", "extra_char", "nearby_char", "similar_char", "skipped_space", "random_space", "repeated_char", "unichar"]

    if seed is not None:
        random.seed(seed)
    else:
        random.seed()

    instance = typo.StrErrer(text)
    method = random.choice(methods)
    method_to_call = getattr(instance, method)
    text = method_to_call().result

    if interactions > 0:
        interactions -= 1
        text = simulate_errors(text, interactions, seed=seed)

    return text
Enter fullscreen mode Exit fullscreen mode

3. sliding_window and sliding_window_with_hyphenation Functions

These functions are responsible for splitting the text into sliding windows, with or without hyphenation.

from hyphen import Hyphenator

def sliding_window_with_hyphenation(text, window_size=80, language='pt_BR'):
    hyphenator = Hyphenator(language)
    words = text.split()
    windows = []
    current_window = []
    remaining_word = ""

    for word in words:
        if remaining_word:
            word = remaining_word + word
            remaining_word = ""

        if len(" ".join(current_window)) + len(word) + 1 <= window_size:
            current_window.append(word)
        else:
            syllables = hyphenator.syllables(word)
            temp_word = ""
            for i, syllable in enumerate(syllables):
                if len(" ".join(current_window)) + len(temp_word) + len(syllable) + 1 <= window_size:
                    temp_word += syllable
                else:
                    if temp_word:
                        current_window.append(temp_word + "-")
                        remaining_word = "".join(syllables[i:]) + " "
                        break
                    else:
                        remaining_word = word + " "
                        break
            else:
                current_window.append(temp_word)
                remaining_word = ""

            windows.append(" ".join(current_window))
            current_window = []

    if remaining_word:
        current_window.append(remaining_word)
    if current_window:
        windows.append(" ".join(current_window))

    return windows
Enter fullscreen mode Exit fullscreen mode

Conclusion

NoisOCR provides essential tools for those working on post-OCR text correction, making it easier to simulate real-world scenarios where digitized texts are prone to errors and annotations. Whether for automated testing, text correction model development, or analysis of datasets like BRESSAY, this library is a versatile and user-friendly solution.

Check out the project on GitHub: NoisOCR and contribute to its improvement!

Top comments (0)