DeepSeek shown to run on Rockchip RK3588 with AI acceleration at about 15 tokens/s

DeepSeek R1 model was released a few weeks ago and Brian Roemmele claimed to run it locally on a Raspberry Pi at 200 tokens per second promising to release a Raspberry Pi image “as soon as all tests are complete”. He further explains the Raspberry Pi 5 had a few HATs including a Hailo AI accelerator, but that’s about all the information we have so far, and I assume he used the distilled model with 1.5 billion parameters.

Jeff Geerling did his own tests with DeepSeek-R1 (Qwen 14B), but that was only on the CPU at 1.4 token/s,  and he later installed an AMD W7700 graphics card on it for better performance. Other people made TinyZero models based on DeepSeekR1 optimized for Raspberry Pi, but that’s specific to countdown and multiplication tasks and still runs on the CPU only. So I was happy to finally see Radxa release instructions to run DeepSeek R1 (Qwen2 1.5B) on an NPU, more exactly the 6 TOPS NPU accelerator of the Rockchip RK3588 SoC, using the RKLLM toolkit.

Rockchip RK3588 DeepSeek R1 NPU acceleration

The full instructions explain how to compile the model yourself, but if you only want to try it quickly, Radxa offers a pre-compiled RKLLM from ModelScope which you can get with:


It has four files:

  • configuration.json – Configuration file
  • librkllmrt.so – RKLLM library
  • llm_demo – Demo program
  • DeepSeek-R1-Distill-Qwen-1.5B.rkllm (1.9GB) – DeepSeek R1 Qwen 1.5B compiled with RKLLM
  • README.md

Run the test with:


Radxa says the RK3588 achieves 14.93 tokens per second for the math program

Solve the equations x+y=12, 2x+4y=34, find the values of x and y

RK3588 DeepSeek Qwen 1.5B performance

The demo was tested on Radxa ROCK 5B. I haven’t done it myself since I don’t have the board with me right now… It should also work on other Rockchip RK3588/RK3588S boards and even Rockchip RK3576 hardware platforms since they use the same NPU. Banana Pi also shared a post on X with a video showing DeepSeek R1 (Qwen 1.5B) running the Banana Pi BPI-M7 board (RK3588).

 

Share this:

Support CNX Software! Donate via cryptocurrencies, become a Patron on Patreon, or purchase goods on Amazon or Aliexpress

Radxa Orion O6 Armv9 mini-ITX motherboard
Subscribe
Notify of
guest
The comment form collects your name, email and content to allow us keep track of the comments placed on the website. Please read and accept our website Terms and Privacy Policy to post a comment.
16 Comments
oldest
newest
Willy
1 day ago

The pure CPU one is faster on my Rock5B (18 t/s). The thing is, NPU doesn’t improve text generation speed, which is RAM bandwidth-limited, it improves the prompt eval time which is CPU-limited: llama_perf_context_print: prompt eval time =    843.41 ms /   27 tokens (  31.24 ms per token,   32.01 tokens per second) llama_perf_context_print:       eval time =  63049.32 ms / 1135 runs  (  55.55 ms per token,   18.00 tokens per second) 12 llama_perf_context_print: prompt eval time =    843.41 ms /   27 tokens (  31.24 ms per token,   32.01 tokens per second)llama_perf_context_print:       eval time =  63049.32 ms / 1135 runs  (  55.55 ms per token,   18.00 tokens per second) This is visible above where the prompt eval speed was 32 t/s in… Read more »

Willy
1 day ago

Rebuilt with gcc-10 (it was 9.5 previously), and quantized at Q4_0, it even reaches 85-89 t/s processing and 19.5-20 t/s generating:

urostor
urostor
1 day ago

Well, you can’t really draw any conclusions from running a 14b model at 1.4 tokens per second vs a 1.5b model. Of course it would run quickly as it’s less computationally complex.
What this article should point out instead is that “you can run a small language model on the NPU of rk3588 with reasonable speed”, as the 1.5 billion parameters makes it quite dumb. And on the CPU, you can run any model that you wish, provided it fits in your RAM.

Hedda
Hedda
1 day ago

Now more news on that first mentioned image release when Brian Roemmele post where he claimed to run it locally on a Raspberry Pi at 200 tokens per second was from almost three weeks ago?

Hedda
Hedda
1 day ago

We also have to wait until 16GB RAM variants of Raspberry Pi 5 and Raspberry Pi Compute Module 5 become available.

Trait0r
Trait0r
1 day ago

Do we? Isn’t that one super overpriced like the RPI with 16GB? Also no TPU/NPU…
Just for the berries or why?

Hedda
Hedda
19 hours ago

You can use the official Raspberry Pi AI HAT+ which contains a Hailo-8 AI accelerator NPU. Anyway, for 7B models, at least 8GB RAM is recommended. For 13B models, at least 16GB RAM is recommended. So thinking you might be able to run the DeepSeek-R1-Distill Qwen-14B on a 16GB Raspberry Pi 5 with a Raspberry Pi AI HAT+?

Willy
10 hours ago

The problem is, RPi5 has only a 32-bit data path to DRAM. This makes the memory performance super low and that’s the only thing that matters for LLM inference, so by design it will necessarily be twice as slow as any RK3588 board, which themselves are already not that beefy on this. You can use whatever accelerator you want, it will not make the DRAM faster. The only option would be to have dedicated DRAM with a large bus on the accelerator itself.

Jeff Geerling
1 day ago

The 16GB Pi 5’s been out for weeks; I’ve heard from a number of people who’ve already ordered and received them, at least in the US and UK.

Hedda
Hedda
19 hours ago

Have you tested to run DeepSeek-R1-Distill Qwen-14B on a 16GB Raspberry Pi 5 with a Raspberry Pi AI HAT+?

Willy
10 hours ago

As a very rough estimate based on my 1.5B test above on Rock5B, I guess than a Qwen-14B quantized at Q4_0 would deliver 20*(1.5/14) / 2 = 1.07 t/s on an RPi5. If you use a Q8 quantization it would fall down to 0.5 t/s.

Hedda
Hedda
7 hours ago

But that is without any AI accelerator, the question is what if add an AI accelerator like the Raspberry Pi AI HAT+ with Hailo-8 or other AI accelerator via the PCIe interface (that is not a full-size GPU-card)?

Icenowy Zheng
Icenowy Zheng
23 hours ago

Sigh. Every HW vendor now says “runs DeepSeek” when only a distilled model is run. I only consider the 671B model the real “DeepSeek R1”, the distilled models are still quite restricted by their base models.

Hedda
Hedda
19 hours ago

Restricted yet but still useful in some use cases like for a specific LLM agent (such as for example Home Assistent’s voice assistant LLM fallback).

Jon Smirl
20 hours ago

Can these models process continuous speech on the RK3588?

Boardcon CM3588 Rockchip RK3588 System-on-Module designed for AI and IoT applications