DeepSeek R1 model was released a few weeks ago and Brian Roemmele claimed to run it locally on a Raspberry Pi at 200 tokens per second promising to release a Raspberry Pi image “as soon as all tests are complete”. He further explains the Raspberry Pi 5 had a few HATs including a Hailo AI accelerator, but that’s about all the information we have so far, and I assume he used the distilled model with 1.5 billion parameters.
Jeff Geerling did his own tests with DeepSeek-R1 (Qwen 14B), but that was only on the CPU at 1.4 token/s, and he later installed an AMD W7700 graphics card on it for better performance. Other people made TinyZero models based on DeepSeekR1 optimized for Raspberry Pi, but that’s specific to countdown and multiplication tasks and still runs on the CPU only. So I was happy to finally see Radxa release instructions to run DeepSeek R1 (Qwen2 1.5B) on an NPU, more exactly the 6 TOPS NPU accelerator of the Rockchip RK3588 SoC, using the RKLLM toolkit.
The full instructions explain how to compile the model yourself, but if you only want to try it quickly, Radxa offers a pre-compiled RKLLM from ModelScope which you can get with:
1 |
git clone https://www.modelscope.cn/radxa/DeepSeek-R1-Distill-Qwen-1.5B_RKLLM.git |
It has four files:
- configuration.json – Configuration file
- librkllmrt.so – RKLLM library
- llm_demo – Demo program
- DeepSeek-R1-Distill-Qwen-1.5B.rkllm (1.9GB) – DeepSeek R1 Qwen 1.5B compiled with RKLLM
- README.md
Run the test with:
1 2 |
export RKLLM_LOG_LEVEL=1 ./llm_demo DeepSeek-R1-Distill-Qwen-1.5B.rkllm 10000 10000 |
Radxa says the RK3588 achieves 14.93 tokens per second for the math program
Solve the equations x+y=12, 2x+4y=34, find the values of x and y
The demo was tested on Radxa ROCK 5B. I haven’t done it myself since I don’t have the board with me right now… It should also work on other Rockchip RK3588/RK3588S boards and even Rockchip RK3576 hardware platforms since they use the same NPU. Banana Pi also shared a post on X with a video showing DeepSeek R1 (Qwen 1.5B) running the Banana Pi BPI-M7 board (RK3588).
#DeepSeek is perfectly adapted and operates efficiently on #BananaPi BPI-M7 (#Siger7) #Rockchip #RK3588 #SBC https://t.co/tlNXB2KjfN pic.twitter.com/W24zaW3OH5
— Banana pi Open Source Hardware (@sinovoip) February 8, 2025
![Jean Luc Aufranc](https://www.cnx-software.com/wp-content/uploads/2023/05/Jean-Luc-Aufranc.webp)
Jean-Luc started CNX Software in 2010 as a part-time endeavor, before quitting his job as a software engineering manager, and starting to write daily news, and reviews full time later in 2011.
Support CNX Software! Donate via cryptocurrencies, become a Patron on Patreon, or purchase goods on Amazon or Aliexpress
The pure CPU one is faster on my Rock5B (18 t/s). The thing is, NPU doesn’t improve text generation speed, which is RAM bandwidth-limited, it improves the prompt eval time which is CPU-limited: llama_perf_context_print: prompt eval time = 843.41 ms / 27 tokens ( 31.24 ms per token, 32.01 tokens per second) llama_perf_context_print: eval time = 63049.32 ms / 1135 runs ( 55.55 ms per token, 18.00 tokens per second) 12 llama_perf_context_print: prompt eval time = 843.41 ms / 27 tokens ( 31.24 ms per token, 32.01 tokens per second)llama_perf_context_print: eval time = 63049.32 ms / 1135 runs ( 55.55 ms per token, 18.00 tokens per second) This is visible above where the prompt eval speed was 32 t/s in… Read more »
Rebuilt with gcc-10 (it was 9.5 previously), and quantized at Q4_0, it even reaches 85-89 t/s processing and 19.5-20 t/s generating:
Well, you can’t really draw any conclusions from running a 14b model at 1.4 tokens per second vs a 1.5b model. Of course it would run quickly as it’s less computationally complex.
What this article should point out instead is that “you can run a small language model on the NPU of rk3588 with reasonable speed”, as the 1.5 billion parameters makes it quite dumb. And on the CPU, you can run any model that you wish, provided it fits in your RAM.
Now more news on that first mentioned image release when Brian Roemmele post where he claimed to run it locally on a Raspberry Pi at 200 tokens per second was from almost three weeks ago?
No news. If you read the thread, it mentions it drops to 90 Tokens/s sometimes and they are testing four different models on the Raspberry Pi. But as far as I know, the image has not been released yet.
We also have to wait until 16GB RAM variants of Raspberry Pi 5 and Raspberry Pi Compute Module 5 become available.
Do we? Isn’t that one super overpriced like the RPI with 16GB? Also no TPU/NPU…
Just for the berries or why?
You can use the official Raspberry Pi AI HAT+ which contains a Hailo-8 AI accelerator NPU. Anyway, for 7B models, at least 8GB RAM is recommended. For 13B models, at least 16GB RAM is recommended. So thinking you might be able to run the DeepSeek-R1-Distill Qwen-14B on a 16GB Raspberry Pi 5 with a Raspberry Pi AI HAT+?
The 16GB Pi 5’s been out for weeks; I’ve heard from a number of people who’ve already ordered and received them, at least in the US and UK.
Have you tested to run DeepSeek-R1-Distill Qwen-14B on a 16GB Raspberry Pi 5 with a Raspberry Pi AI HAT+?
Sigh. Every HW vendor now says “runs DeepSeek” when only a distilled model is run. I only consider the 671B model the real “DeepSeek R1”, the distilled models are still quite restricted by their base models.
Restricted yet but still useful in some use cases like for a specific LLM agent (such as for example Home Assistent’s voice assistant LLM fallback).
Can these models process continuous speech on the RK3588?