Run Llama 2 on Your CPU with Rust

A new one-file Rust implementation of Llama 2 is now available thanks to Sasha Rush. It’s a Rust port of Karpathy’s llama2.c. It already supports the following features:

Support for 4-bit GPT-Q Quantization
SIMD support for fast CPU inference
Support for Grouped Query Attention (needed for big Llamas)
Memory mapping, loads 70B instantly.
Static size checks, no pointers

While this project is clearly in an early development phase, it’s already very impressive. It achieves 7.9 tokens/sec for Llama 2 7B and 0.9 tokens/sec for Llama 2 70B, both quantized with GPTQ.

You can learn about GPTQ for LLama 2 here:

Quantization of Llama 2 with GTPQ for Fast Inference on Your Computer

Llama 2 but 75% smaller

kaitchup.substack.com

Sasha claimed on X (Twitter…) that he could run the 70B version of Llama 2 using only the CPU of his laptop. But of course, it’s very slow (5 tokens/min). With an Intel i9, you can get a much higher speed of 1 token/sec.

If you understand Rust, I recommend reading the code. It gives a lot of ideas to efficiently deal with the quantization and dequantization of LLMs.

Click Here

Run Llama 2 on Your CPU with Rust

Quantization of Llama 2 with GTPQ for Fast Inference on Your Computer

Llama 2 but 75% smaller

Related posts

Recent posts