Fahad Ali Khan

Posted on Sep 21

Enhancing EnglishFormatter: A Journey into Open Source Contribution

#webdev #beginners #programming #tutorial

Introduction

In the world of software development, open-source contributions not only help improve projects but also provide invaluable learning experiences. In this blog post, I'll share my journey of contributing to the EnglishFormatter project—a command-line tool designed to format, summarize, and paraphrase text documents using advanced language models.
Watch some of the functionality here

Project Overview

EnglishFormatter is a C++ command-line application that interacts with language models (LLMs) to process text files. It offers an interactive menu for users to select actions and supports customization through command-line flags.

Features:

Format Documents: Enhance the readability and structure of text.
Summarize Documents: Generate concise summaries.
Paraphrase Documents: Rephrase text while retaining the original meaning.
Customizable Models: Choose different language models via the --model flag.
Output Customization: Define custom suffixes for output files using the --output flag.
Token Usage Reporting: Display token usage information with the new --token-usage flag.

GitHub Repository: EnglishFormatter

The New Feature: Token Usage Reporting

Understanding token usage is crucial when working with LLMs due to context length limitations and cost considerations. I added a new command-line flag --token-usage (or -t) that, when used, reports detailed information about the number of tokens sent in the prompt and received in the response.

Why Token Usage?

Cost Management: Helps users estimate the cost of API calls.
Debugging: Assists in optimizing prompts to stay within token limits.
Performance Tuning: Enables users to fine-tune inputs for better responses.

Getting Started with EnglishFormatter

Prerequisites

C++17 or higher: Required for compiling the code.
libcurl: For HTTP requests.
nlohmann/json: For JSON parsing.
dotenv-cpp: To manage environment variables.
An API Key: From the language model provider (e.g., OpenAI).

Installation

Clone the Repository

   git clone https://github.com/yourusername/EnglishFormatter.git
   cd EnglishFormatter

Install Dependencies

libcurl

Windows: Use the pre-built binaries.
Linux:

   sudo apt-get install libcurl4-openssl-dev

nlohmann/json

Download the json.hpp file from the official repository.

dotenv-cpp

 git clone https://github.com/motdotla/dotenv-cpp.git

Include dotenv.h and dotenv.cpp in your project.

Set Up Environment Variables

Create a .env file in the project root:

   API_KEY=your_api_key_here

Build the Project

   g++ -std=c++17 -o EnglishFormatter main.cpp eng_format.cpp display.cpp dotenv.cpp -lcurl -pthread

Configuration

API Key

Ensure your API key is set in the .env file or as an environment variable.

Default Settings
- Model: Default is llama3-8b-8192.
- Output Suffix: Default is _modified.

Using EnglishFormatter

Run the application from the command line:

./EnglishFormatter [options]

Command-Line Options

Options:
  -h, --help           Show this help message and exit
  -v, --version        Show the tool's version and exit
  -t, --token-usage    Show token usage information
  -m, --model MODEL    Specify the model to use
  -o, --output NAME    Specify the output file suffix

Interactive Menu

Without any options, the tool launches an interactive menu:

./EnglishFormatter

Menu Options:

Format document
Summarize document
Paraphrase document
Exit

Navigate using the Up and Down arrow keys and select with Enter.

Example Usage

Formatting a Document with Token Usage Information

./EnglishFormatter --token-usage --model gpt-3.5-turbo --output _formatted

Select: Format document
Enter: sample.txt

Expected Output:

Token Usage for sample.txt:
  Prompt Tokens:     50
  Completion Tokens: 200
  Total Tokens:      250
sample.txt has been Format document.

Summarizing a Document

./EnglishFormatter --token-usage

Select: Summarize document
Enter: report.txt

Expected Output:

Token Usage for report.txt:
  Prompt Tokens:     100
  Completion Tokens: 30
  Total Tokens:      130
report.txt has been Summarize document.

Behind the Scenes: Code Modifications

Adding the Command-Line Flag

In cli.cpp, I introduced a new boolean variable showTokenUsage and updated the argument parsing loop:

bool showTokenUsage = false;

if (arg == "--token-usage" || arg == "-t") {
    showTokenUsage = true;
}

Parsing Token Usage from the API Response

In eng_format.cpp, I updated the parse_response method to extract token usage:

struct TokenUsage {
    int prompt_tokens = 0;
    int completion_tokens = 0;
    int total_tokens = 0;
};

std::string eng_format::parse_response(const std::string& response, TokenUsage& tokenUsage) {
    json jsonResponse = json::parse(response);

    if (jsonResponse.contains("usage")) {
        const json& usage = jsonResponse["usage"];
        tokenUsage.prompt_tokens = usage.value("prompt_tokens", 0);
        tokenUsage.completion_tokens = usage.value("completion_tokens", 0);
        tokenUsage.total_tokens = usage.value("total_tokens", 0);
    }

    // Extract the assistant's reply...
}

Displaying Token Usage Information

In convert_file, I added a condition to output token usage:

if (showTokenUsage) {
    std::cerr << "Token Usage for " << filename << ":\n";
    std::cerr << "  Prompt Tokens:     " << tokenUsage.prompt_tokens << "\n";
    std::cerr << "  Completion Tokens: " << tokenUsage.completion_tokens << "\n";
    std::cerr << "  Total Tokens:      " << tokenUsage.total_tokens << "\n";
}

Passing the Flag Through Classes

Updated the constructors and methods in both display and eng_format classes to accept the showTokenUsage flag.

Challenges and Learnings

Understanding Existing Code: It was crucial to grasp the original code structure to make seamless additions.
API Response Handling: Ensuring that the token usage information was correctly extracted required careful parsing.
Maintaining Code Style: Adhering to the original coding style was essential for consistency.
Collaboration: Communicating with the project owner helped refine the feature and fix issues.

Conclusion

Contributing to EnglishFormatter was an enriching experience that enhanced my understanding of open-source collaboration and LLM integrations. The addition of token usage reporting makes the tool more transparent and helpful for users mindful of API costs and token limitations.

GitHub Repository: EnglishFormatter

Video Demo: EnglishFormatter Demo (Please watch the demo to see the tool in action.)

Future Work

Support for Multiple LLM Providers: Extend compatibility with other APIs.
Enhanced Error Handling: Improve feedback for network issues or invalid inputs.
Additional Features: Implement text translation or grammar checking.

Call to Action

If you're interested in improving text processing tools or learning more about LLMs, feel free to contribute to the EnglishFormatter project. Whether it's adding new features, fixing bugs, or enhancing documentation, your contributions are welcome!

Contribute Here: EnglishFormatter on GitHub

Thank you for reading! If you have any questions or suggestions, please leave a comment below or open an issue on GitHub.

DEV Community