Run Llama 2 on Your CPU with Rust

A new one-file Rust implementation of Llama 2 is now available thanks to Sasha Rush. It’s a Rust port of Karpathy’s llama2.c. It already supports the following features:

  • Support for 4-bit GPT-Q Quantization
  • SIMD support for fast CPU inference
  • Support for Grouped Query Attention (needed for big Llamas)
  • Memory mapping, loads 70B instantly.
  • Static size checks, no pointers

While this project is clearly in an early development phase, it’s already very impressive. It achieves 7.9 tokens/sec for Llama 2 7B and 0.9 tokens/sec for Llama 2 70B, both quantized with GPTQ.

You can learn about GPTQ for LLama 2 here:

Quantization of Llama 2 with GTPQ for Fast Inference on Your Computer

Llama 2 but 75% smaller

kaitchup.substack.com

Sasha claimed on X (Twitter…) that he could run the 70B version of Llama 2 using only the CPU of his laptop. But of course, it’s very slow (5 tokens/min). With an Intel i9, you can get a much higher speed of 1 token/sec.

If you understand Rust, I recommend reading the code. It gives a lot of ideas to efficiently deal with the quantization and dequantization of LLMs.

Click Here

Tags: GTPQ Llama