In recent weeks, I have been working on projects that utilize GPUs, and I have been exploring ways to optimize their usage. To gain insights into GPU utilization, I started by analyzing the memory consumption and usage patterns using the nvidia-smi tool. This provided me with a detailed breakdown of the GPU memory and usage for each application.
One of the areas I have been focusing on is deploying our own LLMs. I noticed that when working with smaller LLMs, such as those with 7B parameters, on an A100 GPU, they were only consuming about 8GB of memory and utilizing around 20% of the GPU during inference. This observation led me to investigate the possibility of running multiple LLM processes in parallel on a single GPU to optimize resource utilization.
To achieve this, I explored using Python's multiprocessing module and the spawn method to launch multiple processes concurrently. By doing so, I aimed to efficiently run multiple LLM inference tasks in parallel on a single GPU. The following code demonstrates the approach I used to set up and execute multiple LLMs on a single GPU.
MAX_MODELS = 3
def load_model(model_name: str, device: str):
model = AutoModelForCausalLM.from_pretrained(
model_name,
return_dict=True,
load_in_8bit=True,
device_map={"":device},
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
return model, tokenizer
def inference(model, prompt: str):
text = model.process_and_generate(prompt, params={"max_new_tokens": 200, "temperature": 1.0})
return text
def process_task(task_queue, result_queue):
model = load_model("tiiuae/falcon-7b-instruct", device="cuda:0")
while True:
task = task_queue.get()
if task is None:
break
prompt = task
start = time.time()
summary = inference(model, prompt)
print(f"Completed inference in {time.time() - start}")
result_queue.put(summary)
def main():
task_queue = multiprocessing.Queue()
result_queue = multiprocessing.Queue()
prompt = "" # The prompt you want to execute
processes = []
for _ in range(MAX_MODELS):
process = multiprocessing.Process(target=process_task, args=(task_queue, result_queue ))
process.start()
processes.append(process)
start = time.time()
# I want to run this 3 times for each of the models
for _ in range(MAX_MODELS*3):
task_queue.put((prompt))
results = []
for _ in range(MAX_MODELS*3):
result = result_queue.get()
results.append(result)
end = time.time()
if __name__ == "__main__":
multiprocessing.set_start_method("spawn")
main()
The following is a quick summary of some of the tests that I ran.
GPU | # of LLMs | GPU Memory | GPU Usage | Average Inference Time |
---|---|---|---|---|
A100 with 40GB | 1 | 8 GB | 20% | 12.8 seconds |
A100 with 40GB | 2 | 16 GB | 95% | 16 seconds |
A100 with 40GB | 3 | 32 GB | 100% | 23.2 seconds |
Running multiple LLM instances on a single GPU can significantly reduce costs and increase availability by efficiently utilizing the available resources. However, it's important to note that this approach may result in a slight performance degradation, as evident from the increased average inference time when running multiple LLMs concurrently. If you have any other ways of optimizing GPU usage or questions on how this works feel free to reach out.
Thanks
Top comments (0)