DEV Community

Fahad Ali Khan
Fahad Ali Khan

Posted on

Enhancing EnglishFormatter: A Journey into Open Source Contribution

Introduction

In the world of software development, open-source contributions not only help improve projects but also provide invaluable learning experiences. In this blog post, I'll share my journey of contributing to the EnglishFormatter project—a command-line tool designed to format, summarize, and paraphrase text documents using advanced language models.
Watch some of the functionality here

Project Overview

EnglishFormatter is a C++ command-line application that interacts with language models (LLMs) to process text files. It offers an interactive menu for users to select actions and supports customization through command-line flags.

Features:

  • Format Documents: Enhance the readability and structure of text.
  • Summarize Documents: Generate concise summaries.
  • Paraphrase Documents: Rephrase text while retaining the original meaning.
  • Customizable Models: Choose different language models via the --model flag.
  • Output Customization: Define custom suffixes for output files using the --output flag.
  • Token Usage Reporting: Display token usage information with the new --token-usage flag.

GitHub Repository: EnglishFormatter

The New Feature: Token Usage Reporting

Understanding token usage is crucial when working with LLMs due to context length limitations and cost considerations. I added a new command-line flag --token-usage (or -t) that, when used, reports detailed information about the number of tokens sent in the prompt and received in the response.

Why Token Usage?

  • Cost Management: Helps users estimate the cost of API calls.
  • Debugging: Assists in optimizing prompts to stay within token limits.
  • Performance Tuning: Enables users to fine-tune inputs for better responses.

Getting Started with EnglishFormatter

Prerequisites

  • C++17 or higher: Required for compiling the code.
  • libcurl: For HTTP requests.
  • nlohmann/json: For JSON parsing.
  • dotenv-cpp: To manage environment variables.
  • An API Key: From the language model provider (e.g., OpenAI).

Installation

  1. Clone the Repository
   git clone https://github.com/yourusername/EnglishFormatter.git
   cd EnglishFormatter
Enter fullscreen mode Exit fullscreen mode
  1. Install Dependencies
  • libcurl

    • Windows: Use the pre-built binaries.
    • Linux:
       sudo apt-get install libcurl4-openssl-dev
    
  • nlohmann/json

    Download the json.hpp file from the official repository.

  • dotenv-cpp

     git clone https://github.com/motdotla/dotenv-cpp.git
    

    Include dotenv.h and dotenv.cpp in your project.

  1. Set Up Environment Variables

Create a .env file in the project root:

   API_KEY=your_api_key_here
Enter fullscreen mode Exit fullscreen mode
  1. Build the Project
   g++ -std=c++17 -o EnglishFormatter main.cpp eng_format.cpp display.cpp dotenv.cpp -lcurl -pthread
Enter fullscreen mode Exit fullscreen mode

Configuration

  • API Key

Ensure your API key is set in the .env file or as an environment variable.

  • Default Settings

    • Model: Default is llama3-8b-8192.
    • Output Suffix: Default is _modified.

Using EnglishFormatter

Run the application from the command line:

./EnglishFormatter [options]
Enter fullscreen mode Exit fullscreen mode

Command-Line Options

Options:
  -h, --help           Show this help message and exit
  -v, --version        Show the tool's version and exit
  -t, --token-usage    Show token usage information
  -m, --model MODEL    Specify the model to use
  -o, --output NAME    Specify the output file suffix
Enter fullscreen mode Exit fullscreen mode

Interactive Menu

Without any options, the tool launches an interactive menu:

./EnglishFormatter
Enter fullscreen mode Exit fullscreen mode

Menu Options:

  • Format document
  • Summarize document
  • Paraphrase document
  • Exit

Navigate using the Up and Down arrow keys and select with Enter.

Example Usage

Formatting a Document with Token Usage Information

./EnglishFormatter --token-usage --model gpt-3.5-turbo --output _formatted
Enter fullscreen mode Exit fullscreen mode
  • Select: Format document
  • Enter: sample.txt

Expected Output:

Token Usage for sample.txt:
  Prompt Tokens:     50
  Completion Tokens: 200
  Total Tokens:      250
sample.txt has been Format document.
Enter fullscreen mode Exit fullscreen mode

Summarizing a Document

./EnglishFormatter --token-usage
Enter fullscreen mode Exit fullscreen mode
  • Select: Summarize document
  • Enter: report.txt

Expected Output:

Token Usage for report.txt:
  Prompt Tokens:     100
  Completion Tokens: 30
  Total Tokens:      130
report.txt has been Summarize document.
Enter fullscreen mode Exit fullscreen mode

Behind the Scenes: Code Modifications

Adding the Command-Line Flag

In cli.cpp, I introduced a new boolean variable showTokenUsage and updated the argument parsing loop:

bool showTokenUsage = false;

if (arg == "--token-usage" || arg == "-t") {
    showTokenUsage = true;
}
Enter fullscreen mode Exit fullscreen mode

Parsing Token Usage from the API Response

In eng_format.cpp, I updated the parse_response method to extract token usage:

struct TokenUsage {
    int prompt_tokens = 0;
    int completion_tokens = 0;
    int total_tokens = 0;
};

std::string eng_format::parse_response(const std::string& response, TokenUsage& tokenUsage) {
    json jsonResponse = json::parse(response);

    if (jsonResponse.contains("usage")) {
        const json& usage = jsonResponse["usage"];
        tokenUsage.prompt_tokens = usage.value("prompt_tokens", 0);
        tokenUsage.completion_tokens = usage.value("completion_tokens", 0);
        tokenUsage.total_tokens = usage.value("total_tokens", 0);
    }

    // Extract the assistant's reply...
}
Enter fullscreen mode Exit fullscreen mode

Displaying Token Usage Information

In convert_file, I added a condition to output token usage:

if (showTokenUsage) {
    std::cerr << "Token Usage for " << filename << ":\n";
    std::cerr << "  Prompt Tokens:     " << tokenUsage.prompt_tokens << "\n";
    std::cerr << "  Completion Tokens: " << tokenUsage.completion_tokens << "\n";
    std::cerr << "  Total Tokens:      " << tokenUsage.total_tokens << "\n";
}
Enter fullscreen mode Exit fullscreen mode

Passing the Flag Through Classes

Updated the constructors and methods in both display and eng_format classes to accept the showTokenUsage flag.

Challenges and Learnings

  • Understanding Existing Code: It was crucial to grasp the original code structure to make seamless additions.
  • API Response Handling: Ensuring that the token usage information was correctly extracted required careful parsing.
  • Maintaining Code Style: Adhering to the original coding style was essential for consistency.
  • Collaboration: Communicating with the project owner helped refine the feature and fix issues.

Conclusion

Contributing to EnglishFormatter was an enriching experience that enhanced my understanding of open-source collaboration and LLM integrations. The addition of token usage reporting makes the tool more transparent and helpful for users mindful of API costs and token limitations.

GitHub Repository: EnglishFormatter

Video Demo: EnglishFormatter Demo (Please watch the demo to see the tool in action.)

Future Work

  • Support for Multiple LLM Providers: Extend compatibility with other APIs.
  • Enhanced Error Handling: Improve feedback for network issues or invalid inputs.
  • Additional Features: Implement text translation or grammar checking.

Call to Action

If you're interested in improving text processing tools or learning more about LLMs, feel free to contribute to the EnglishFormatter project. Whether it's adding new features, fixing bugs, or enhancing documentation, your contributions are welcome!

Contribute Here: EnglishFormatter on GitHub


Thank you for reading! If you have any questions or suggestions, please leave a comment below or open an issue on GitHub.

Top comments (0)