DEV Community

francesco agati
francesco agati

Posted on

Generate Q&A from Wikipedia Pages with Pydantic, Instructor, and Phi-3 LLM

Introduction

With a little LLM model like phi3 and a good schema generator like pydantic we can generate question and answer pairs from Wikipedia pages. By using Pydantic for data validation, Instructor library for structured data extraction, and Microsoft's Phi-3 language models for efficient AI processing, we can transform large blocks of text into informative Q&A pairs.

Exploring a Python Script for Generating Q&A Pairs from Text

Let's break down a Python script that uses several libraries to generate question and answer pairs from a given text, specifically from a Wikipedia page. The script involves error handling, data modeling, and API interaction. Here's a simple explanation of how the code works.

Importing Libraries

The script begins by importing the necessary libraries:

from openai import OpenAI
import instructor
import wikipedia
from typing import Iterable, List, Optional, Union
from pydantic import BaseModel, Field
Enter fullscreen mode Exit fullscreen mode
  • OpenAI: Interacts with the OpenAI API.
  • Instructor: Manages the interaction mode with the OpenAI API and return structured data.
  • Wikipedia: Fetches content from Wikipedia.
  • Typing: Provides type hints for better code clarity.
  • Pydantic: Helps in data validation and settings management.

Setting Up the OpenAI Client

Next, the script sets up a client to interact with the OpenAI API:

client = instructor.from_openai(
    OpenAI(
        base_url="http://localhost:11434/v1",
        api_key="ollama",  # required, but unused
    ),
    mode=instructor.Mode.JSON,
)
Enter fullscreen mode Exit fullscreen mode

This code snippet initializes the client with a base URL and an API key using ollama as LLM server.

Error Handling with a Decorator

The script defines a decorator function to handle errors:

def rescue_error(func):
    def wrapper(*args, **kwargs):
        try:
            return func(*args, **kwargs)
        except Exception as e:
            print(f"Error: {e}")
            return None
    return wrapper
Enter fullscreen mode Exit fullscreen mode

The rescue_error decorator wraps around functions to catch and print errors, returning None if an error occurs.

Defining Data Models with Pydantic

The script uses Pydantic to define data models for question and answer pairs:

class QA(BaseModel):
    question: str = Field(..., description="The question to ask.")
    answer: str = Field(
        ..., description="The answer that corresponds to the question with almost 100 characters."
    )
    tags: List[str] = Field(
        ..., description="The tags keywords for indexing the question and answer pair."
    )

class Qas(BaseModel):
    qas: List[QA] = Field(..., description="The list of question and answer pairs.")
Enter fullscreen mode Exit fullscreen mode

Pydantic is a Python library for data validation and settings management using Python type annotations. Here are the key points about Pydantic:

  • Data Validation: Ensures data conforms to specified types, simplifying error handling and data analysis.
  • Models: Classes that inherit from BaseModel, similar to types in statically typed languages or API endpoint requirements. Pydantic guarantees the resulting model instance fields conform to the defined types.
  • Type Hints: Uses Python type hints to control schema validation and serialization.

Instructor Library

Instructor is a lightweight Python library designed to extract structured data from large language models (LLMs) such as GPT-3.5, GPT-4, and open-source models. Here are the key points about Instructor:

  • Structured Outputs: Simplifies obtaining structured outputs from LLMs.
  • Built on Pydantic: Utilizes Pydantic for schema validation and prompt control through type annotations.
  • Customizable: Allows custom validators and error messages.
  • Broad Support: Compatible with various programming languages including Python, TypeScript, Ruby, Go, and Elixir.

Generating Q&A Pairs

The generate_qas function generates question and answer pairs from a given text:

@rescue_error
def generate_qas(text: str) -> Qas:
    response = client.chat.completions.create(
        model="phi3:medium",
        messages=[
            {
                "role": "user",
                "content": f"Generate 5 question and answer pairs from the following text: {text}",
            }
        ],
        response_model=Qas,
    )

    for qa in response.qas:
        print("question: ", qa.question)
        print("answer: ", qa.answer)
        print("tags: ", qa.tags)

    return response
Enter fullscreen mode Exit fullscreen mode

This function sends a request to the OpenAI API to generate five question and answer pairs from the provided text. The decorator ensures any errors are caught and handled gracefully.

Fetching and Processing Wikipedia Content

Finally, the script fetches content from a Wikipedia page and processes it:

topic = "buddhism history"
page: wikipedia.WikipediaPage = wikipedia.page(topic)
content = page.content
x = 4000
text_chunks = [content[i : i + x] for i in range(0, len(content), x)]
text_chunks_qas = [generate_qas(chunk) for chunk in text_chunks]
Enter fullscreen mode Exit fullscreen mode
  • Fetching Content: Retrieves the content of a Wikipedia page on "buddhism history".
  • Chunking Text: Splits the content into chunks of 4000 characters.
  • Generating Q&A: Generates question and answer pairs for each chunk of text.

Phi-3: Compact and Powerful LLMs by Microsoft

Phi-3 is a family of small but highly efficient language models (LLMs) developed by Microsoft. Here are the key points about Phi-3:

  • Model Variants: Includes Phi-3-mini (3.8B parameters), Phi-3-small (7B), Phi-3-medium (14B), and Phi-3-Vision (4.2B) for multimodal applications.
  • Performance: Despite their small size, Phi-3 models perform comparably or better than much larger models due to innovative training techniques.
  • Availability: Phi-3-mini is publicly available on platforms like Azure, Hugging Face, and Ollama, and can run locally on resource-limited devices like smartphones.
  • Advantages: Optimized for on-device, edge, and offline execution, offering cost, privacy, and accessibility benefits.
  • Applications: Suitable for tasks like document summarization, content generation, and support chatbots, especially on personal devices with limited computational resources.
  • Trends: Represents a shift towards more efficient and accessible AI models, democratizing AI usage across various devices.

This script demonstrates a practical application of Python for generating question and answer pairs from text. It uses error handling, data modeling, and external APIs to create a robust and efficient process for extracting meaningful information from large text sources. The use of Pydantic ensures data validation, while the Instructor library simplifies structured data extraction from LLMs. Additionally, the integration of Microsoft's Phi-3 models highlights the advancements in compact and efficient AI models, making sophisticated AI accessible even on resource-limited devices.

Top comments (0)