Large Language Models and Generative AI
Generative AI is a branch of machine learning that deals with models that create entirely new outputs based on being trained on a lot of different examples. They can generate images, videos, music, and text. A Large Language Model is a type of model that can generate text; at a basic level, they predict the next word in a sequence of words. The best-known examples are through products like OpenAI's ChatGPT or Google's Bard.
The Silly Idea
An LLM is known to be large - it's in the name, after all. An AWS Lambda Function is meant to be small; AWS Lambda is a Function-as-a-Service offering where code is executed using microVMs in response to events that typically run in milliseconds. These two things don't really go together.
Even so, I came across some open-source LLMs that can run on a desktop PC using only a CPU (rather than a GPU), such as GPT4ALL. This made me wonder if it was possible to run this within a Lambda Function. They are not meant to be used for complex processing tasks, but they do scale-to-zero, and you only pay for what you use. Having an LLM inside a Lambda function seemed a fun experiment and a way to have a hosted model that doesn't require a server hosting a long-running process.
The Toolkit
GPT4ALL
First, we need a model. GPT4All is perfect because it runs on CPUs, rather than GPUs, and the available models are between 3-8GB. This is key for a Lambda function that can only be 10GB max (for memory and package size) and runs only on CPUs.
There are several models that can be chosen, but I went for ggml-model-gpt4all-falcon-q4_0.bin because it is a smaller model (4GB) which has good responses. Other models should work, but they need to be small enough to fit within the Lambda memory limits.
LangChain
LangChain is a framework written in Python and JavaScript that can be used to build applications related to LLMs. It's therefore a great way to build a basic application with the potential to extend later on with integrations and features around ChatGPT, DynamoDB, Web Searching, Caching, and much more.
One of the components in LangChain is the ability to interact with our GPT4ALL. This makes it possible to use very few lines of code to load in the model from a file location and pass in a prompt.
Python Container Image
Lambda function deployment packages can only be 250MB, whilst a container image can be up to 10GB. Therefore, this could only work as a container, and the container option also allows customization of the operating system and installed dependencies.
I started off by using the offical AWS base image for Python and used the ability to run the container locally to test this by using curl to post invocation requests to it. Unfortunately, this didn't work; the version of GCC within the image didn't match, and updating packages through commands in the Docker file didn't help.
I then tried the Amazon Linux 2 image for building custom runtimes, as well as the Amazon Linux 2023 preview image but had similar issues. I therefore ended up using a non-AWS base image and tried the official python image, which meant installing the Lambda Runtime Interface Client and the Lambda Runtime Interface Emulator so it can run as a function and be testable locally.
After this, it started to work, but the function would crash due to memory issues, so I had to increase my memory limit for Docker from 2GB up to 8GB before it then started to work fine.
Please see GitHub for the Dockerfile.
Function Code
The Lambda function code didn't need to be very complex. I used some example code from LangChain, which loads the model, and returns the output as a string. It would benefit from streaming the response and be more useful to add state with some LangChain modules, but this is enough for a proof of concept.
Below is the function code with extra comments; the repository on GitHub has the full application with Dockerfile.
import json
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.llms import GPT4All
# A default template that will be combined with the input to form the prompt
template = """The player performs the action: {action} What happens next?"""
prompt = PromptTemplate(template=template, input_variables=["action"])
# The location of the model to load
local_path = ("./ggml-model-gpt4all-falcon-q4_0.bin")
# LangChain code to load the GPT4ALL model
llm = GPT4All(model=local_path, verbose=True)
llm_chain = LLMChain(prompt=prompt, llm=llm)
# The Lambda function handler
def handler(event, context):
# Load an "action" from the request body
body = json.loads(event.get("body", "{}"))
action = body.get("action")
# Basic validation
if action is None:
return {
'statusCode': 400,
'body': json.dumps({'error': 'No action was provided'})
}
# Where all the work happens
response = llm_chain.run(action)
# Return the full response in JSON format
return {
'statusCode': 200,
'body': json.dumps(response)
}
Deploying
Once I had a working container image that could be run locally, I pushed the image to ECR and was able to create a Lambda function from this. The function needed at least 8GB of memory, but I went for the maximum of 10GB to also increase the CPU power. It can also take up to 2 minutes to return a response (warm), but I set the timeout to 8 minutes for testing (cold starts could be 5 minutes).
The easiest way to test this was either through Lambda test events or to add a Lambda function URL and invoke it remotely with a tool like Postman.
Results
For this example, I set up the prompt template to be a text-based RPG. I therefore tested it with some requests that gave actions, and then the LLM would act like the storyteller.
Action: "Throw your sword at the angry orc"
"As the player throws their sword at the angry orc, it strikes true and lands squarely in its chest. The orc roars in pain and anger, charging towards the player with a fierce look in its eyes.
The player quickly dodges out of the way, but not before noticing something strange about the orc's eyes. They seem to be glowing with an otherworldly light, as if they are possessed by some dark force.
With the orc now dangerously close, the player decides to take a risk and try to reason with it. "Stop! I mean you no harm," they say, holding out their hands in a gesture of peace.
To their surprise, the orc seems to understand them and hesitates for a moment before continuing its attack. The player quickly grabs their sword and prepares to defend themselves if necessary.
As the fight continues, it becomes clear that this orc is not like any other they have faced before. Its movements are unnaturally quick and its strength is unmatched. But with each blow of the sword, the player gains a little more confidence in their abilities.
Finally, after what seems like an eternity, the orc falls to the ground, defeated."
Time: 1m 30s
Action: A wizard casts a spell on the player and turns them into a pencil
"As the wizard finishes casting his spell, the player suddenly finds themselves transformed into a pencil. They look down at their new form in confusion before noticing that they are no longer holding the wand they were using to cast spells.
The player tries to move around, but find that they are unable to do so without making a noise. They try to speak, but find that they cannot make any sounds with their new pencil body.
Suddenly, they hear a voice from behind them. "Well, well, well. Look who we have here." The player turns around to see the wizard who cast the spell on them.
The wizard chuckles at the player's transformation and begins to speak in a mocking tone. "I bet you didn't expect this when you decided to mess with magic, did you?"
The player tries to respond, but finds that they are unable to do so without making any noise. They feel helpless and frustrated as they try to come up with a plan to escape their new form.
What will the player do next?"
Time: 1m 1s
Cold vs Warm
From a few quick tests, the difference between a cold start and a warm start is quite dramatic. One run was around 5 mins cold but then 39 seconds warm. This is a mixture of the large container size and the initialisation of the model into memory. It's this initialisation that takes the most time, and is why I put that outside of the handler.
Loading the model from EFS would assist in reducing the time to load the container package, but there is no obvious way of getting around the initial load of the model into memory.
Conclusion
I'm surprised it works as well as it does; getting such a large response from a full LLM in less than a minute (when warm) is quite impressive considering it's running serverless inside a microVM.
While I wouldn't recommend this for a production system, especially in its current form, it goes to show how powerful Lambda functions can be. With some caching, optimization, and smaller, more specialized models, this could be an option for certain private LLM cases.
Please check out the code and have a play; it's a fun experiment at the very least.
Top comments (1)
Really fun experiment, interesting how much it could be optimized!