Brainsteam

The relatively recently released Phi 3.5 model series includes a mixture-of-experts model featuring 16 x 3.3 Billion parameter expert models. It activates these experts two at a time resulting in pretty good performance but only 6.6 billion parameters held in memory at once. I recently wanted to try running Phi MoE 3.5 on my macbook but was blocked from doing so using my usual method whilst support is built into llama.cpp and then ollama.

I decided to try out another library, mistral.rs, which is written in the rust programming language and already supports these newer models. It required a little bit of fiddling around but I did manage to get it working and the model is relatively responsive.

Getting Our Dependencies and Building Mistral.RS

To get started you will need to have the rust compiler toolchain installed on your macbook including rustc and cargo. The easiest way to do this is via brew:

brew install rust

You'll also need to grab the code for the project

git clone https://github.com/EricLBuehler/mistral.rs.git

Once you have both of these in place we can build the project. Since we're running on Mac, we want the compiler to make use of apple Metal which allows the model to use the GPU capabilities of the M-series chip to accelerate the model.

cd mistral.rs
cargo install --path mistralrs-server --features metal

This command may take a couple of minutes to run. The compiled server will be saved in the target/release folder relative to your project folder.

Running the Model with Quantization

The default instructions in the project readme work but you might find it takes up a lot of memory and takes a really long time to run. That's because, by default mistral.rs does not do any quantization so running the model requires 12GB of memory.

mistral.rs supports in-situ-quantisation which essentially means that the framework loads the model up and does the quantisation at run time (as opposed to requiring you to download a GGUF file that was already quantized). I recommend running the following:

./target/release/mistralrs-server --isq Q4_0 -i plain -m microsoft/Phi-3.5-mini-instruct -a phi3

In this mode we use ISQ to quantize the model down to 4bit mode (--isq Q4_0). You should be able to chat to the model through the terminal

Running as a Server

Mistral.rs provides a HTTP API that is compatible with OpenAI standards. To run in server mode we remove the -i argument and replace it with a port number to run on --port 1234:

./target/release/mistralrs-server --port 1234 --isq Q4_0 plain -m microsoft/Phi-3.5-mini-instruct -a phi3

You can then use an app like Postman or Bruno to interact with your model:

Screenshot of a REST tooling interface. A pane on the left shows a json payload that was sent to the server containing messages to the model telling it to behave as a useful assistant and write a poem.
<p>On the right is the response which contains a message and the beginning of a poem as written by the model." class=“wp-image-3999”/></figure></p>
<!-- /wp:image -->
<!-- wp:paragraph -->
<p></p>
<!-- /wp:paragraph -->
<!-- wp:heading -->
<h2 class=

Screenshot of a REST interface. A pane on the left shows a json payload that was sent to the server containing messages to the model telling it to analyse an image url.
<p>On the right is the response which describes the mountain in the picture that was sent." class=“wp-image-4000”/></figure></p>
<!-- /wp:image -->
<!-- wp:paragraph -->
<p></p>
<!-- /wp:paragraph -->
<!-- wp:heading -->
<h2 class=

Running Phi MoE 3.5 on Macbook Pro

Getting Our Dependencies and Building Mistral.RS

Running the Model with Quantization

Running as a Server

Replies & Web Activities

1 Likes

2 Reposts