Sebastian

Posted on Jun 17

LLM Fine-Tuning: Domain Embeddings with GPT-3

#llm

Starting in 2023, Large Language Models evolved to form or be a component of information retrieval systems. In such a system, domain knowledge is encoded in a special format. Then, given a user query, the most relevant chunks from the knowledge base are determined and an answer is formulated. In LLMs, the knowledge base is all learned training material. However, given the learned vector representations of words, other content can be embedded in the same vector space. And in this vector space, similarity search between user queries and stored knowledge can be made to identify the context from which an LLM answers. This is a LLM retrieval system in a nutshell.

This article shows how to use GPT-3 embeddings for designing a question answer system - the third possible approach as outlined in my previous article. GPT-3, initially released on 2020, showed astonishing capabilities to produce text that is hard to distinguish from human-written text. It is an advanced language model trained on billions of internet resources like Wikipedia, books and plain web crawling. Furthermore, a rich API exists for text generation as well as for creating embeddings. You will learn how to use this API to create embeddings, and how to use these embeddings for a similarity search given a user query.

The technical context of this article is Python v3.11, OpenAIs GPT-3.5 api wrapper openai v1.12.0, and the helper libraries scikit-learn v1.4.1 and wikipedia-api v0.6.0. All instructions should work with newer versions too, but you might need to use another OpenAi model because older models are being phased out.

This article originally appeared at my blog admantium.com.

GPT-3 Model Overview

OpenAI provides different models via its API. At the original time of writing this article in early 2022, API access was only granted to selected companies. Only later, API access for individual developers was granted, and since 2023, the API is open for every developer.

Another difference between starting this article in early 2022 and early 2024 are the available models. In essence, OpenAI deprecates older model as well as changing the provided API functions. Originally, the following GPT-3 models were available:

text-davinci-002
text-curie-001
text-babbage-001
text-ada-001

As of 2024, the list differs models along their context window and general capabilities - see gpt-3-5-turbo for a full description.

gpt-3.5-turbo-012: The most recent version, higher accuracy for output formatting, context window is 16,385 tokens
gpt-3.5-turbo-1106: Improved instruction following, context window is 16,385 tokens
gpt-3.5-turbo-instruct: Same capabilities as GPT3 models, and a context window of 4,096 tokens

Required Python Libraries

The essential library for this project is OpenAI, supported by two helper libraries. Install them with the poetry dependency manager a shown:

poetry init --quiet

poetry add openai scipy wikipedia-api
# Using version ^1.12.0 for openai
# Using version ^1.12.0 for scipy
# Using version ^0.6.0 for wikipedia-api

# Updating dependencies
# Resolving dependencies... (1.6s)

# Package operations: 7 installs, 0 updates, 0 removals

#   • Installing h11 (0.14.0)
#   • Installing httpcore (1.0.4)
#   • Installing distro (1.9.0)
#   • Installing httpx (0.27.0)
#   • Installing openai (1.12.0)
#   • Installing scipy (1.12.0)
#   • Installing wikipedia-api (0.6.0)

# Writing lock file

OpenAI Python Library

The quintessential input to language generation with GPT-3 is the prompt. It’s not a just a simple question: You can define the role of a messages as user, system or assistant, as well as structure the prompt into different sections. This primes the GPT-3 model and leads to a more nuanced and accurate answer.

The OpenAI library provides several API endpoints for specific use cases, including working with text, audio and images, as well as one for embedding. For text input, the chat completion endpoint is used.

Here is an example asking a statistic fact:

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

query = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": "How many inhabitants are living in Berlin?",
        }
    ],
    model="gpt-3.5-turbo-instruct"
)

And the response object is this:

ChatCompletion
  id='chatcmpl-8vmJ3K67ZApZrq5M0ZNiVJY7hld6i',
  choices=[Choice(finish_reason='stop',
  index=0,
  logprobs=None,
  message=ChatCompletionMessage(content='As of 2021,
  the population of Berlin is approximately 3.7 million inhabitants.',
  role='assistant',
  function_call=None,
  tool_calls=None))],
  created=1708781077,
  model='gpt-3.5-turbo-0125',
  object='chat.completion',
  system_fingerprint='fp_86156a94a0',
  usage=CompletionUsage(completion_tokens=19,
  prompt_tokens=15,
  total_tokens=34)
)

The object returned from the API contains meta information about the model, consumed and generated tokens, and a Choice object that contains the answer. Interestingly, the answer clearly reflects that this GPT3.5 model cannot access data after its 2021 training episode. Providing newer content to this model is another use case for our question answer system design approach.

Question Answering with Plaintext

When the exact containing context for a question is known, it can be simple added as as-is text to the prompt. The following example shows how to formulate the first two paragraphs of the Wikipedia article about NASA as a context for a user query.

# Source: Wikipedia, NASA, https://en.wikipedia.org/wiki/NASA
article_text = '''
The National Aeronautics and Space Administration (NASA /ˈnæsə/) is an independent agency of the U.S. federal government responsible for the civil space program, aeronautics research, and space research. Established in 1958, it succeeded the National Advisory Committee for Aeronautics (NACA) to give the U.S. space development effort a distinctly civilian orientation, emphasizing peaceful applications in space science. It has since led most American space exploration, including Project Mercury, Project Gemini, the 1968–1972 Apollo Moon landing missions, the Skylab space station, and the Space Shuttle. It currently supports the International Space Station and oversees the development of the Orion spacecraft and the Space Launch System for the crewed lunar Artemis program, the Commercial Crew spacecraft, and the planned Lunar Gateway space station.

NASA's science is focused on better understanding Earth through the Earth Observing System; advancing heliophysics through the efforts of the Science Mission Directorate's Heliophysics Research Program; exploring bodies throughout the Solar System with advanced robotic spacecraft such as New Horizons and planetary rovers such as Perseverance; and researching astrophysics topics, such as the Big Bang, through the James Webb Space Telescope, the Great Observatories and associated programs. The Launch Services Program oversees launch operations and countdown management for its uncrewed launches.
'''

question="What is NASA?"

client.chat.completions.create(
    messages=[
        {
            "role": "system",
            "content": "You are a question-answering assistant. Answer truthfully. If you do not know the answer, say 'I don't have access to this information'",
        },
        {
            "role": "user",
            "content": f'''
            Context: {article_text}

            Question: {question}''',
        }
    ],
    model="gpt-3.5-turbo"
)

In this example, the message list starts with a system role message that sets the general behavior for all subsequent interactions. And the second message structures the query into a context and questions section, providing structured information to the LLM.

Question Answering with Embeddings

To use embeddings for a question answering system, several steps need to be considered:

Create embeddings for chunked texts
Store the embeddings
For a user query, perform a similarity search with the embeddings
Retrieve embeddings and formulate a prompt

Let’s detail and realize these steps with individual Python functions.

Step 1: Create Embeddings

The OpenAI embedding API can be used via the client library. Expected parameters are the embedding model and the input text. At the time of writing this article, three embedding models are available: text-embedding-3-small, text-embedding-3-large and text-embedding-ada-002.

Here is an example:

embedding_model="text-embedding-3-small"
client.embeddings.create(input = [text], model=embddeing_model).

Using the above defined article_text variable, containing the first two paragraphs of the wikipedia article about NASA, shows this embedding:

# CreateEmbeddingResponse(
#   data=[Embedding(embedding=[-0.02948867715895176, 0.014214463531970978, 0.059668492525815964, ...], index=0, object='embedding')],
#   model='text-embedding-3-small',
#   object='list',
#   usage=Usage(prompt_tokens=279, total_tokens=279))

The embedding itself can be accessed as res.data[0].embedding. And similar to the chat completion object, this contains meta information about the processed tokens.

Step 2: Store Embeddings

The next step is to load the Wikipedia articles content and split it into paragraphs that have at least 100 characters (this removes headings and empty paragraphs as well). For this, the handy wikipedia-api library will be used.

import wikipediaapi

user_agent = "Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Mobile Safari/537.36"
language = "en"

wiki = wikipediaapi.Wikipedia(user_agent, language)

nasa_page = wiki.page('NASA')

chunks = [chunk for chunk in nasa_page.text.split('\n') if len(chunk) > 100]

In total, that gives 100 chunks. The first three chunks are as follows:

print(chunks[0:3])
# ['The National Aeronautics and Space Administration (NASA ) is an independent agency of the U.S. federal government responsible for the civil space program, aeronautics research, and space research. Established in 1958, it succeeded the National Advisory Committee for Aeronautics (NACA) to give the U.S. space development effort a distinctly civilian orientation, emphasizing peaceful applications in space science. It has since led most American space exploration, including Project Mercury, Project Gemini, the 1968–1972 Apollo Moon landing missions, the Skylab space station, and the Space Shuttle. It currently supports the International Space Station and oversees the development of the Orion spacecraft and the Space Launch System for the crewed lunar Artemis program, the Commercial Crew spacecraft, and the planned Lunar Gateway space station.',
#  "NASA's science is focused on better understanding Earth through the Earth Observing System; advancing heliophysics through the efforts of the Science Mission Directorate's Heliophysics Research Program; exploring bodies throughout the Solar System with advanced robotic spacecraft such as New Horizons and planetary rovers such as Perseverance; and researching astrophysics topics, such as the Big Bang, through the James Webb Space Telescope, the Great Observatories and associated programs. The Launch Services Program oversees launch operations and countdown management for its uncrewed launches.",
#  "NASA traces its roots to the National Advisory Committee for Aeronautics (NACA). Despite being the birthplace of aviation, by 1914 the United States recognized that it was far behind Europe in aviation capability. Determined to regain American leadership in aviation, Congress created the Aviation Section of the U.S. Army Signal Corps in 1914 and established NACA in 1915 to foster aeronautical research and development. Over the next forty years NACA would conduct aeronautical research in support of the U.S. Air Force, its predecessors in the U.S. Army, the U.S. Navy, and the civil aviation sector. After the end of World War II, NACA became interested in the possibilities of guided missiles and supersonic aircraft, developing and testing the Bell X-1 in a joint program with the U.S. Air Force. NACA's interest in space grew out of its rocketry program at the Pilotless Aircraft Research Division."],

For these chunks, embeddings are calculated and stored as tuples containing the original text as well as the embeddings. Here is the relevant source code:

def embedd(client, text):
    client = OpenAI()
    embedding_model="text-embedding-3-small"
    res = client.embeddings.create(input = [text], model=embedding_model)
    return res.data[0].embedding

embeddings = [(chunk, embedd(client,chunk)) for chunk in chunks]

The first item in this list is this:

embeddings[0]
# [(
#   'The National Aeronautics and Space Administration (NASA ) is ...',
#   [-0.024839315563440323, 0.004018288571387529, 0.061975762248039246, ...]
# )]

Step 3: Similarity Search for User Queries

The user query is passed to the embeddings API, and then a local cosine similarity search is made against the stored embeddings. This list is ordered from best to worst match and returned. Each item in the list is a tuple of the form (similarity, text, embedding).

from scipy.spatial.distance import cosine

def similarity_search(client, query, embeddings):
  emq = embedd(client, query)
  similar_embeddings = [(cosine(emq, emc), text, emc) for text, emc in embeddings]

  return(sorted(similar_embeddings, reverse = True))

Here is an example, asking about the founding age of NASA:

query = "When was NASA founded?"

similarity_search(client, query, embeddings)
# [(0.3510536381739262,
# 'The National Aeronautics and Space Administration (NASA ) is an independent agency of the U.S. federal government responsible for the civil space program, aeronautics research, and space research. Established in 1958 ...',
#  [-0.024839315563440323,0.004018288571387529, ...],
#    ...
# ]]

Step 4: Prompt Generation

The final step is to formulate the prompt that contains the system role definition, the user query and its context.

The first method generates the prompt, using only the best match of the similarity search.

def generate_prompt(client, query, embeddings):
  matches = similarity_search(client, query, embeddings)
  _, context, _ = matches[0]

  question_prompt = {
    "role": "user",
    "content": f'''
    Context: {context}

    Question: {query}
    '''
  }

  return (question_prompt)

And the second method wraps the chat completion API call and uses the generated prompts.

def embedding_qa(client, query, embeddings):
  system_message = {
    "role": "system",
    "content": "You are a question-answering assistant. Answer truthfully. If you do not know the answer, say 'I don't have access to this information'",
  }
  question_prompt = generate_prompt(client, query, embeddings)

  response = client.chat.completions.create(
    messages= [system_message, question_prompt],
    model="gpt-3.5-turbo"
  )

  return response.choices[0].message

Question Answering with Domain Embeddings

With all methods implemented, we formulate a query about NASA founding data, and use the embeddings from the same article.

query = "When was NASA founded?"

embedding_qa(client, query)
# ChatCompletionMessage(content='NASA was founded in 1958.', role='assistant', function_call=None, tool_calls=None)

The similarity search ordered all paragraphs and put the very first paragraph to the top. Then, the GPT3 model processes the context and found the required information.

Comparing this approach to the two former approaches (linguistic fine-tuning and qa finetuning), several benefits emerge: a) scalability, any textual data source can be embedded and used for similarity search, b) dynamicity, the Python methods can be used in a production system to continuously add new embeddings or retrieve the most up-to-date version, c) quality, Gen3 and Gen4 LLMs formulate answers themselves instead of just annotating parts of the context.

Conclusion

This article showed how to create a question answering system using domain embeddings. The approach specifically consists of four steps that were implanted as Python functions: a) Load a text, separate into chunks, and generate embeddings, b) store the embeddings, c) embed a user query, determine most relevant embeddings by calculating cosine similarity, and d) formulate a prompt that distinguishes the system role and the user message, and in which the user message clear mentions the context in which the LLMs should find the answer. With this approach, several benefits are realized: better scalability to access text information, dynamicity to use most up-to-data information, and overall better quality when using Gen3 and Gen4 models. The next article shows how to fine-tune a model using instruction datasets.

DEV Community

LLM Fine-Tuning: Domain Embeddings with GPT-3

GPT-3 Model Overview

Required Python Libraries

OpenAI Python Library

Question Answering with Plaintext

Question Answering with Embeddings

Step 1: Create Embeddings

Step 2: Store Embeddings

Step 3: Similarity Search for User Queries

Step 4: Prompt Generation

Question Answering with Domain Embeddings

Conclusion

Top comments (0)

Read next

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages

MindEye2: Shared-Subject Models Enable fMRI-To-Image With 1 Hour of Data

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models