A new one-file Rust implementation of Llama 2 is now available thanks to Sasha Rush. It’s a Rust port of Karpathy’s llama2.c. It already supports the following features:
- Support for 4-bit GPT-Q Quantization
- SIMD support for fast CPU inference
- Support for Grouped Query Attention (needed for big Llamas)
- Memory mapping, loads 70B instantly.
- Static size checks, no pointers
While this project is clearly in an early development phase, it’s already very impressive. It achieves 7.9 tokens/sec for Llama 2 7B and 0.9 tokens/sec for Llama 2 70B, both quantized with GPTQ.
You can learn about GPTQ for LLama 2 here:
Quantization of Llama 2 with GTPQ for Fast Inference on Your Computer
Llama 2 but 75% smaller
kaitchup.substack.com
Sasha claimed on X (Twitter…) that he could run the 70B version of Llama 2 using only the CPU of his laptop. But of course, it’s very slow (5 tokens/min). With an Intel i9, you can get a much higher speed of 1 token/sec.
If you understand Rust, I recommend reading the code. It gives a lot of ideas to efficiently deal with the quantization and dequantization of LLMs.