DEV Community

Cover image for Deploy Hugging Face Models on Serverless GPU
Dhanush Reddy
Dhanush Reddy

Posted on

Deploy Hugging Face Models on Serverless GPU

Hugging Face is a platform and community that focuses on making artificial intelligence and data science more accessible. It aims to make AI knowledge and resources more widely available by promoting open source contributions. As AI technologies become more widely used, it’s important for advancements to be made in the field. Hugging Face provides a space for AI and data science professionals to connect and share their work.

Hugging Face provides state of the art machine learning models for different tasks. It has a vast number of pre-trained models in categories such as:

  • Computer Vision (Image Segmentation, Image Classification, Image Generation etc)
  • Natural Language Processing (Text Classification, Summarization, Generation, Translation etc)
  • Audio (Speech Recognition, Text to Speech etc)
  • and much more

Hugging Face's models are easy to use. For example, to do a text classification task, you can use the transformers library (which is part of Hugging Face), to load a pre-trained model and then use it to classify a text.

An example for text classification using transformers:



from transformers import pipeline

sentiment_analysis = pipeline(
    "sentiment-analysis", model="siebert/sentiment-roberta-large-english"
)

print(sentiment_analysis("I like Transformers"))
# [{'label': 'POSITIVE', 'score': 0.9987214207649231}]

print(sentiment_analysis("I hate React"))
# [{'label': 'NEGATIVE', 'score': 0.9993581175804138}]


Enter fullscreen mode Exit fullscreen mode

We can see how easy it is to do a text classification task in just 3 lines of code using the transformers library.

Similarly, in the same way we find many models on Hugging Face with their corresponding model card.

One of the challenges of using Hugging Face models is that they can be computationally expensive to deploy as most of them require GPU's. This is because they are often large and complex, and require a lot of computing power to run. GPUs can be very expensive. For example, the hourly rate of an NVIDIA A10G instance on AWS is $1.30 for a 24 GB memory. This equates to about $2496 for a period of one month. This show how you can easily burn money on AWS 😅, for an ML model which hardly gets 1000-2000 requests in a month.

An Image saying that there is a solution

Enter Serverless GPU's

Serverless GPUs are a type of cloud computing service that provides access to powerful GPUs on demand. This means that you only pay for the time that you use the GPUs, which can save you a significant amount of money if you only need to use them occasionally.

  • Pay-as-You-Go Pricing: Any Serverless architectures follow a pay-as-you-go pricing model, so you only pay for the actual resource consumption during model inference.
  • Ease of use: Serverless GPUs are very easy to use. You don't need to worry about managing or maintaining the hardware, which can save you a lot of time and hassle.
  • Scalability: Serverless GPUs can be scaled up or down as needed, which makes them ideal for applications that have fluctuating workloads.

A meme showing using servers vs serverless

There are a number of serverless GPU providers out there, such as Banana, Replicate, Beam, Modal and many more.

I would recommend checking out all the websites, before deploying your application. All of the ones which I have mentioned do have a free limit.

Going to the previous calculation of the costs, for a 16 CPU model, with a 32 GB RAM on an A10G instance the price on Beam was just $0.00155419/second (Almost all of the providers do have an identical pricing). So let's say your API gets 1500 requests in a month and lets say an average inference time to be 1 min (60 seconds). So the cost incurred are: 0.00155419 * 60 * 1500 = $140. Almost an 18x reduction😎 in cost as compared to using AWS.

For this tutorial, I am going to use Beam to deploy dolly-v2-7b, an open source large language model from Databricks, which responds similar to ChatGPT.

Of Course feel free to use any ML model with your choice of serverless GPU provider.

Deployment

Prerequisites

curl https://raw.githubusercontent.com/slai-labs/get-beam/main/get-beam.sh -sSfL | sh

- Configure Beam by entering
```bash


beam configure


Enter fullscreen mode Exit fullscreen mode
  • Install Beam SDK ```bash

pip install beam-sdk


Now you’re ready to start using Beam to deploy your ML models.

As I said I will be deploying [dolly-v2-7b](https://huggingface.co/databricks/dolly-v2-7b) from Databricks. The code to run it is provided on Hugging Face. So copying it into a file named as `run.py`
```python


import torch
from instruct_pipeline import InstructionTextGenerationPipeline
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-7b", padding_side="left")
model = AutoModelForCausalLM.from_pretrained(
    "databricks/dolly-v2-7b", device_map="auto", torch_dtype=torch.bfloat16
)

generate_text = InstructionTextGenerationPipeline(model=model, tokenizer=tokenizer)


Enter fullscreen mode Exit fullscreen mode

There is one major problem with this code. In the instantiation of AutoModelForCausalLM and AutoTokenizer a cache directory is not used. So this will lead to download of the model files every time it gets invoked. Of Course this will not happen if you run the code on your own device because the cache directory by default will be somewhere inside your hard disk. But in the serverless world, everything is stateless, the model files downloaded once will not persist in the consecutive runs.

Digging through beam docs, we see an option of Shared Volumes, which means now model files can be persisted between consecutive runs and the it can be mounted in the same way as a normal hard disk whenever the model runs.

So for now lets pass a cache_dir as "./mpt_weights". The modified code is:



import torch
from instruct_pipeline import InstructionTextGenerationPipeline
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "databricks/dolly-v2-7b", padding_side="left", cache_dir=cache_path
)
model = AutoModelForCausalLM.from_pretrained(
    "databricks/dolly-v2-7b",
    device_map="auto",
    torch_dtype=torch.bfloat16,
    cache_dir=cache_path,
)

generate_text = InstructionTextGenerationPipeline(model=model, tokenizer=tokenizer)


Enter fullscreen mode Exit fullscreen mode

To define the configuration of the system where our ML model runs, we need to write a Beam app definition.

I have mostly copied this code from Beam Docs. Creating a new file named app.py and adding code as:



import beam

app = beam.App(
    name="mpt-7b-chat",
    cpu=16,
    memory="32Gi",
    gpu="A10G",
    python_packages=[
        "accelerate>=0.16.0,<1",
        "transformers[torch]>=4.28.1,<5",
        "torch>=1.13.1,<2",
    ],
)

app.Trigger.RestAPI(
    inputs={"prompt": beam.Types.String()},
    outputs={"response": beam.Types.String()},
    handler="run.py:generate_text",
    keep_warm_seconds=60
)

app.Mount.SharedVolume(name="mpt_weights", path="./mpt_weights")


Enter fullscreen mode Exit fullscreen mode

The above code creates a Beam application that uses 16 CPUs, 32GB of memory, and an A10G GPU. It also defines all the pip packages that would be required during the model run. The application is triggered by a REST API call and generates text in response to a prompt. The application is also loaded with models when it starts.

Now let's modify run.py one last time before we deploy it. We need to return the response from the generate_text function which we will create in run.py so that it can be returned back to the REST API.



import torch
from instruct_pipeline import InstructionTextGenerationPipeline
from transformers import AutoModelForCausalLM, AutoTokenizer

cache_path = "./mpt_weights"

def generate_text(**inputs):
    tokenizer = AutoTokenizer.from_pretrained(
        "databricks/dolly-v2-3b", padding_side="left", cache_dir=cache_path
    )
    model = AutoModelForCausalLM.from_pretrained(
        "databricks/dolly-v2-3b",
        device_map="auto",
        torch_dtype=torch.bfloat16,
        cache_dir=cache_path,
    )
    generate_text = InstructionTextGenerationPipeline(model=model, tokenizer=tokenizer)
    prompt = inputs["prompt"]
    response = generate_text(prompt)

    return {"response": response[0]["generated_text"]}


Enter fullscreen mode Exit fullscreen mode

Please don't get intimidated by this code. I haven't modified much from the one on Hugging Face. I would highly suggest you first refer to respective docs before deploying anything on cloud that a random dude suggests on the internet.

Also one more thing we need instruct_pipeline.py for dolly-v2-3b as mentioned on Hugging Face. So copy paste the code from here.

So now we have 3 files, namely app.py, run.py and instruct_pipeline.py (may not be the same in your case).

Now deploy your application by entering:



beam deploy app.py


Enter fullscreen mode Exit fullscreen mode

This will deploy your ML model as serverless REST API which you can use from your Frontend. Obviosly you can deploy this model as an asynchronous webhook instead of a REST API if your model inference takes a long time.

You can get the CURL command to call your app from the Beam app dashboard. Testing the app I deployed,

An Image showing ML inference on Serverless GPU

I got a response in under 8 seconds which is not bad for an Open source version of ChatGPT :)

Cons

Everything seems good till now with serverless but let's weigh in on the cons.

  • Cold starts: When a serverless GPU is not in use, it is shut down. This means that the first time you use it, it will need to be booted up, which can take a few seconds to minutes depending on your model size. This can be a problem if you are running a real-time application.
  • Cost: Serverless GPUs can be more expensive than traditional GPU-based solutions, especially if you are running a long-running application that almost stays active throughout the day.

Conclusion

In conclusion, we have seen how to deploy Hugging Face models on serverless GPUs. This can be a great way to get the performance benefits of GPUs without having to worry about the underlying hardware, when not in use.

In case if you still have any questions regarding this post or want to discuss something with me feel free to connect on LinkedIn or Twitter.

If you run an organization and want me to write for you, please connect with me on my Socials 🙃

Top comments (0)