DEV Community

Cover image for Run Mixtral 8x7B πŸš€(Mixtral of Experts)in free colab
Kaushal Powar
Kaushal Powar

Posted on

Run Mixtral 8x7B πŸš€(Mixtral of Experts)in free colab

After releasing the Mixtral 8x7B model a few weeks ago they have recently released the paper Mixtral of Experts.

What is Mixtral 8x7B ??

It is a powerful language model that's a game-changer! Imagine Mistral 7B, but better – with 8 expert blocks in each layer. Here's the scoop: for every word, Mixtral picks two smart experts to do the heavy lifting, making each token a genius with access to a whopping 47 billion parameters.
What's really cool is that Mixtral only uses 13 billion active parameters during its magic, making it super efficient. Trained with a context of 32,000 tokens, Mixtral outshines big names like Llama 2 70B and GPT-3.5 in every benchmark – especially in math, code writing, and speaking multiple languages.

This model will not run on the T4 GPU that Google Colab provides for free, but I came across this GitHub repository that solves the issue.

How will MOE work on T4?

The four contributors achieved efficient inference of Mixtral-8x7B models through a combination of techniques

  • Mixed quantization with HQQ. They apply separate quantization schemes for attention layers and experts to fit the model into the combined GPU and CPU memory.

  • MoE offloading strategy. Each expert per layer is offloaded separately and only brought back to GPU when needed. We store active experts in an LRU cache to reduce GPU-RAM communication when computing activations for adjacent tokens.

You will approximately need 16 GB of VRAM and 11 GB of RAM.

Code

Install and import libraries

#fixing numpy
!pip uninstall numpy --yes
!pip install numpy==1.24.4
Enter fullscreen mode Exit fullscreen mode
# fix numpy in colab
import numpy
from IPython.display import clear_output

# fix triton in colab
!export LC_ALL="en_US.UTF-8"
!export LD_LIBRARY_PATH="/usr/lib64-nvidia"
!export LIBRARY_PATH="/usr/local/cuda/lib64/stubs"
!ldconfig /usr/lib64-nvidia

!git clone https://github.com/dvmazur/mixtral-offloading.git --quiet
!cd mixtral-offloading && pip install -q -r requirements.txt
!huggingface-cli download lavawolfiee/Mixtral-8x7B-Instruct-v0.1-offloading-demo --quiet --local-dir Mixtral-8x7B-Instruct-v0.1-offloading-demo

clear_output()
Enter fullscreen mode Exit fullscreen mode
import sys

sys.path.append("mixtral-offloading")
import torch
from torch.nn import functional as F
from hqq.core.quantize import BaseQuantizeConfig
from huggingface_hub import snapshot_download
from IPython.display import clear_output
from tqdm.auto import trange
from transformers import AutoConfig, AutoTokenizer
from transformers.utils import logging as hf_logging

from src.build_model import OffloadConfig, QuantConfig, build_model
Initialize model
Enter fullscreen mode Exit fullscreen mode
model_name = "mistralai/Mixtral-8x7B-Instruct-v0.1"
quantized_model_name = "lavawolfiee/Mixtral-8x7B-Instruct-v0.1-offloading-demo"
state_path = "Mixtral-8x7B-Instruct-v0.1-offloading-demo"

config = AutoConfig.from_pretrained(quantized_model_name)

device = torch.device("cuda:0")

##### Change this to 5 if you have only 12 GB of GPU VRAM #####
offload_per_layer = 4
# offload_per_layer = 5
###############################################################

num_experts = config.num_local_experts

offload_config = OffloadConfig(
    main_size=config.num_hidden_layers * (num_experts - offload_per_layer),
    offload_size=config.num_hidden_layers * offload_per_layer,
    buffer_size=4,
    offload_per_layer=offload_per_layer,
)


attn_config = BaseQuantizeConfig(
    nbits=4,
    group_size=64,
    quant_zero=True,
    quant_scale=True,
)
attn_config["scale_quant_params"]["group_size"] = 256


ffn_config = BaseQuantizeConfig(
    nbits=2,
    group_size=16,
    quant_zero=True,
    quant_scale=True,
)
quant_config = QuantConfig(ffn_config=ffn_config, attn_config=attn_config)


model = build_model(
    device=device,
    quant_config=quant_config,
    offload_config=offload_config,
    state_path=state_path,
)
Enter fullscreen mode Exit fullscreen mode

Run the model

!pip install langchain torch transformers sentence-transformers accelerate
Enter fullscreen mode Exit fullscreen mode
import transformers, torch
from transformers import pipeline
from langchain import HuggingFacePipeline
Enter fullscreen mode Exit fullscreen mode
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer = tokenizer,
    torch_dtype=torch.float16,
    max_new_tokens=1024,
    device=device
)
Enter fullscreen mode Exit fullscreen mode
from langchain import HuggingFacePipeline

llm = HuggingFacePipeline(
    pipeline = pipeline,
    model_kwargs={"temperature": 0.5, "max_new_tokens":1024},
)
Enter fullscreen mode Exit fullscreen mode
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
Enter fullscreen mode Exit fullscreen mode
task_template = """
Write your own template.
variable1 = {input_variable1}
variable2 = {input_variable2}
"""
task_prompt_template = PromptTemplate(
    input_variables=["input_variable1","input_variable1"], template=task_template, output_key = "structured_task"
)
task_chain = LLMChain(
    llm=llm, prompt=task_prompt_template
)
Enter fullscreen mode Exit fullscreen mode
question = {"input_variable1":input_variable1, "input_variable2":input_variable2}
Enter fullscreen mode Exit fullscreen mode
print(task_chain.run(question))
Enter fullscreen mode Exit fullscreen mode

References
https://huggingface.co/docs/transformers/model_doc/mixtral

https://arxiv.org/abs/2401.04088

https://github.com/dvmazur/mixtral-offloading

Thank you for reading 😁.

If you like my work, you can support me here: Support my work

I do welcome constructive criticism and alternative viewpoints. If you have any thoughts or feedback on our analysis, please feel free to share them in the comments section below.

For more such content make sure to subscribe to my Newsletter here

Follow me on

Twitter

GitHub

Linkedin

Top comments (0)