Testing AI and LLM on Rockchip RK3588 using Mixtile Blade 3 SBC with 32GB RAM

We were interested in testing artificial intelligence (AI) and specifically large language models (LLM) on Rockchip RK3588 to see how the GPU and NPU could be leveraged to accelerate those and what kind of performance to expect. We had read that LLMs may be computing and memory-intensive, so we looked for a Rockchip RK3588 SBC with 32GB of RAM, and Mixtile – a company that develops hardware solutions for various applications including IoT, AI, and industrial gateways – kindly offered us a sample of their Mixtile Blade 3 pico-ITX SBC with 32 GB of RAM for this purpose.

While the review focuses on using the RKNPU2 SDK with computer vision samples running on the 6 TOPS NPU, and a GPU-accelerated LLM test (since the NPU implementation is not ready yet), we also went through an unboxing to check out the hardware and a quick guide showing how to get started with Ubuntu 22.04 on the Mixtile Blade 3.

Mixtile Blade 3 unboxing

The package that Mixtile sent contained two boxes. The first box was for the Mixtile Blade 3 single computer board and the second box was for the Mixtile Blade 3 Case.

mixtile blade 3 and case unboxing

Let’s have a look at the Mixtile Blade 3 SBC box first. We found the board quite heavy the very first time we picked it up. That’s because there’s a heatsink that completely covers the bottom of the PCB to ensure fanless operation by dissipating the heat from the RK3588 SoC.

mixtile blade 3 and case package

mixtile blade 3 heatsink

The Mixtile Blade 3’s rear panel features two 2.5Gbps Ethernet ports, two HDMI ports one for output and the other for input, as well as two USB Type-C ports. The board also comes with a 30-pin GPIO header, a mini PCIe connector, a MIPI-CSI camera connector, a microSD card socket, a fan connector, and a debug header for a USB to TTL board. There’s also a U.2 edge connector (SFF-8639) with 4-lane PCIe Gen3 and SATA 3.0 signals used to connect PCIe/NVMe devices or multiple Blade 3 boards together to form a cluster.

mixtile blade 3 Ethernet HDMI USB connectors

Let’s now check out the Mixtile Blade 3 case. It is a CNC aluminum enclosure that also ships with a U.2 to M.2 adapter for connecting an NVMe SSD or other PCIe device (like an AI accelerator), a power button, an LED to indicate the working status, a screw set, and a screwdriver.

mixtile blade 3 case assembly U2 connector

Mixtile Blade 3 case assembly

We will now assemble the Mixtile Blade 3 board into the case. The first step is to remove the original heatsink, then attach the U.2 to M.2 adapter to the board and insert it into the case, and finish off the assembly by closing the cover with a silicon thermal pad as the metal case itself will act the heatsink cooling the Rockchip RK3588 CPU.

mixtile blade 3 m u.2 to m.2 adapter

Rockchip RK3588 SBC enclosure
Heatsink that ships with the board (left), Blade 3 installed in the enclosure (middle), and top cover with thermal pad (right)

The kit does not include a power adapter, so you’ll have to bring your own to power up the Mixtile Blade 3 board. It requires a USB-C power adapter compatible with the PD 2.0/PD 3.0 standard. Read our previous article about the Mixtle Blade 3 SBC to get the full specifications of the board.

Ubuntu 22.04 on the Mixtile Blade 3

The Mixtile Blade 3 ships with a Ubuntu 22.04 image,  so it can boot to Linux right out of the box. But if you want to install a new operating system or update the current image, it can be done using the same methods as used with other single board computers based on Rockchip SoCs, namely the RKDevTool program, or via a microSD card.

Since the Mixtile Blade 3 only comes with two USB ports, and one is already connected to the power supply, we had to insert a USB-C dock to connect to a keyboard and a mouse to the board.

mixtile blade 3 review RK3588 AI LLM

After this first boot, we’ll go through the Ubuntu OEM setup wizard, and once complete, we can access the usual Ubuntu 22.04 Desktop.

You can find the 128GB eMMC flash and the additional 256GB NVMe SSD we added through the U.2 to M.2 adapter with fdisk:


Our SBC does indeed come with 32GB of RAM:

Testing AI performance via RK3588’s NPU using the RKNPU2 toolkit

We will be testing Mixtile Blade 3’s AI performance using the Yolo v5 sample and RKNN benchmark found in the RKNPU2 as we did with the Youyeeyoo YY3568 SBC powered by a Rockchip RK3568 with an entry-level 0.8 TOPS NPU.

After installing the RKNN 2 toolkit from Github, we can compile the YOLO5 example:


Then we can run the YOLO5 samples with a test image:


YOLO5 output rknn rk3588

As expected, the Rockchip RK3588’s AI performance is much better than the one of the Rockchip RK3568 as shown in the table below.

Board/CPUFirst runAverage of 10 runs
Mixtile Blade 3 (RK3588)25.523000 ms18.620700 ms
YY3568 (RK3568)78.917000 ms69.709700 ms

The Mixtile Blade 3 board is about three times faster than a board based on Rockchip RK3568. Converting ms to FPS shows the Mixtile Blade 3 can run Yolo v5 at 54 FPS, which can be considered very fast processing and good enough for real-time applications.

Here are the results from the RKNN Benchmark run 10 times on the Mixtile Blade 3:


The benchmark results show an average inference at 63.123 FPS value is 63.123 frames per second and confirms the Mixtile Blade 3 board is suitable as an Edge AI computer.

Testing images is OK, but since the Mixtile Blade 3 is capable of real-time AI processing, we also decided to test the Yolo5 with a USB camera and stream the results over RTSP.  The first step was to install the MediaMTX RTSP server on the Mixtile Blade 3 following the instructions on GitHub.

We also edited the mediamtx.yml configuration file to encode the webcam video output with H.264 and stream it at 640 x 640 resolution.


We can test the RTSP streaming on the board with the following command:


The detected objects are saved in the log file since OpenCV is not used in the test, and the video will just show boxes around the detected objects as you’ll see in the video below.


YouTube video player

The video clip above shows good AI processing performance with a high frame rate for object detection and tracking.

Testing LLM performance on Rockchip RK3588 (GPU)

The initial idea was to test large language models leveraging the 6 TOPS NPU on Rockchip RK3588 like we just did with the RKNPU2 above. But it turns out this is not implemented yet, and instead, people have been using the Arm Mali G610 GPU built into the Rockchip RK3588 SoC for this purpose.

We started with Haolin Zhang’s llm-rk3588 project on GitHub, but despite our efforts, we never managed to make it work on the Mixtile Blade 3 board. Eventually, we found instructions to run an LLM on RK3588 using docker that worked for us. The blog post shows how to use the models RedPajama-INCITE-Chat-3B-v1-q4f16_1 with 3 billion parameters and Llama-2-7b-chat-hf-q4f16_1 with 7 billion parameters, but we also tried the Llama-2-13b-chat-hf-q4f16_1 model with 13 billion parameters to test the performance and try to fully make use of the 32GB of RAM at our disposition.

RedPajama-INCITE-Chat-3B-v1-q4f16_1 LLM model test

We ran the following command to start docker with the 3B LLM model:


We used the prompt “Explain why free electrons in an insulator cannot jump over the energy gap to the conduction band”:


The performance is good and the top command shows the memory usage used by the system when running the RedPajama-INCITE-Chat-3B-v1-q4f16_1 model is around 3.9GB of RAM.

CPU memory usage llm model3b mixtile blade 3 rk3588

Testing Llama-2-7b-chat-hf-q4f16_1 model

We restarted docker with the Llama 2 model with 7B parameters …


… and used the same prompt as above:


The performance is still good, and the system memory usage is now around 6.7GB.

CPU memory usage llm model7b mixtile blade 3 rk3588

Testing Llama-2-13b-chat-hf-q4f16_113B model on RK3588

For this test, we will use the docker.io/milas/mlc-llm:redpajama-3b image, and import the files related to the Llama-2-13b-chat-hf-q4f16_1 model in docker before reloading the model and runing the prompt “Explain why free electrons in an insulator cannot jump over the energy gap to the conduction band”:


The performance is really slow with the text being slowly printed out in the terminal, and we noticed that one of the GPU-related lines shown in the other models is gone:


So it’s not 100% clear whether the GPU is used although it loaded the file “/mlc-llm/dist/prebuilt/lib/Llama-2-13b-chat-hf-q4f16_1-mali.so”. Having said that it does work, and there’s about 10.6GB of memory used when running the 13B parameter model in docker which would imply 16GB RAM might be enough…

CPU memory usage llm model13b mixtile blade 3 rk3588

YouTube video player

We have been told Rockchip is working on a LLM implementation leveraging the NPU and it will be significantly faster than the GPU implemented. We’ll try to write another quick review once it is released.

Summary of LLM results on RK3588

We used the same prompt for each of the models, namely “Explain why free electrons in an insulator cannot jump over the energy gap to the conduction band”, and all could answer this question, but each with performed at different speeds. The table below summarizes how fast the decode and prefill were in tok/s (token/s) to how many words or subunits of words were processed per second for a specific model.

ModelPrefill (tok/s)Decode (tok/s)
RedPajama-INCITE-Chat-3B-v1-q4f16_14.65.1
Llama-2-7b-chat-hf-q4f16_14.82.8
Llama-2-13b-chat-hf-q4f16_12.41.2

Finally, we used Google Gemini to evaluate the answers from the models tested above to help us decide which answer is the best:


Unsurprisingly, the quality of the answer improves are more parameters are included in the model.

Conclusion

After testing AI and LLM on the Rockchip RK3588-powered Mixtile Blade 3 board with 32GB RAM we can conclude it performs well on workloads such as YoloV5 object detection with real-time performance, and LLM models can also be successfully run on the Arm Mali-610 GPU, but larger models would benefit from NPU-acceleration coming later this year.

The Mixtile Blade 3 SBC itself is offered with up to 256GB eMMC flash, supports NVMe support, and is one of the rare RK3588 boards actually available with 32GB RAM suitable to run LLMs. The build quality of the metal case is excellent, and it is designed to support wireless use without signal degradation thanks to areas with plastic covers, but we found the fan to be quite noisy.

The documentation is well done, arranged in various sections, and fairly complete which allows users to get started without much hassle. Depending on your use case, the board can be a bit cumbersome to use, for example, you’d need an mPCIe module for WiFi and a USB-C dock is needed to connect a USB keyboard and mouse combo. But its low-profile design with heatsink and U.2 connector make it ideal for clusters of boards, especially for applications needing a lot of memory with each board supporting up to 32GB RAM. The company also provides software drivers to get started with cluster computing.

We’d like to thank Mixtile for sending the Blade 3 Rockchip RK3588 SBC with 32GB RAM for our AI and LLM experimentation. The board can be purchased on the Mixtile shop for $229 with 4GB RAM and 32GB flash up to $439 in the 32GB/256GB configuration tested here. You’ll also find it on Aliexpress, but the price is quite high, and the 32GB RAM model is not available there. Besides the Mixtile shop, you might be able to find the 32GB RAM model on other distributors.

CNXSoft: This article is a translation – with some edits – of the review on CNX Software Thailand by Arnon Thongtem and edited by Suthinee Kerdkaew.

Share this:
FacebookTwitterHacker NewsSlashdotRedditLinkedInPinterestFlipboardMeWeLineEmailShare

Support CNX Software! Donate via cryptocurrencies, become a Patron on Patreon, or purchase goods on Amazon or Aliexpress

ROCK 5 ITX RK3588 mini-ITX motherboard

12 Replies to “Testing AI and LLM on Rockchip RK3588 using Mixtile Blade 3 SBC with 32GB RAM”

  1. Pity the A311D has a 4Gb RAM limit, otherwise it would have been interesting to see how its open source driver 5Tops NPU fares against the RK3588’s 6Tops.

    Have I understood the article correctly, that we should soon see an open source driver for the RK3588’s NPU?

  2. Great review as per your usual, thanks!

    I would have liked it even more if you measured power usage with the NPU under load, as I’m que curious on how much energy it consumes.

    Nice also to finally see a 32GB RK3588 that isn’t vaporware, unlike the OrangePi5 which has been out of stock everywhere since forever.

  3. Thanks for testing this! I tried to test with the GPU as well a months or two ago and never managed to make it work. Above it just seems that it’s more of a limiting factor than a help. At least it offloads some cores, but seems to slow down the whole thing. On my rock5b with 4G, I’m running mistral-7B at 4.1 tok/s prompt eval and 3.82 to generate the response. It’s 36% faster than what you got here. Note that I’m careful about only using the big cores, as using both big and little ones makes the whole thing advance at the speed of the small ones.

    1. I am same Willy
      I am not really sure about your test results as the newer Arm v8.2 mat/mul vector instructions on the four big cores are faster than the GPU which is only a G610 MC4 not a MC 20 that you might expect in a flagship Phone.

      I have not seen a framework that manages to use the OpenCL api fpr the Mali G610 and definately haven’t seen one that is faster than the big cores running my usual test of https://github.com/ggerganov/llama.cpp

      Maybe it is using the GPU as its approx 25% slower than theBig core Cpu’s

      1. Note that RK3588 doesn’t have the MMLA extension, it’s only optional in 8.2. The Altras don’t have it either, but Graviton3 does. Anyway the memory bandwidth is a very important factor for such models and our boards do not exactly excel in this regard with only 64-bit total.

        1. the cpu report fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp

          ASIMD is multiply-accumulate instruction and it supports various numerical formats for the instruction.
          Not sure what MMLA extention is?
          Its why a rk3588 and Pi5 for ML is so much faster (x6) than a Pi4 for much ML.

          1. Sorry, MMLA was the define to build llama with Matrix Multiply support before it was able to detect it. It’s reported as "__ARM_FEATURE_MATMUL_INT8" by the compiler when the CPU supports it. You can see that here: https://developer.arm.com/documentation/101028/0012/13–Advanced-SIMD–Neon–intrinsics?

            However the MatMul extension is definitely not present on A76 nor on Neoverse-N1, it came with armv8.6-a and is available on Neovese-N1 for example. You can easily check with gcc-10 -dM -xc -E /dev/null -mcpu=cortex-a76 |grep FEAT that does not report MATMUL. However you definitely have it if you build with -mcpu=cortex-a76+i8mm (but the code won’t work, I tried on llama).

            One thing the A76 has is the dot product that apparently is used quite a bit for some matrix operations.

            Also the differences must come from another flag than asimd since it was already present on A72 (and on rpi4). Here’s what my RPi4 has: fp asimd evtstrm crc32 cpuid. I think instead it should come from the FP16 arithmetics which do make a significant difference.

          2. Yeah but what ever it is the x6 speed up you get running LLMs over a Pi4 is vastly more than the series and clock improvements improvements, whatever instructions it uses.
            The CPU is faster than the GPU as far as I could make out running tests with ArmNN.
            I never had any probs with the Llama.cpp build just the x6 over the A73 was excellent on the A76.
            I never really fully worked out Apple Arm !st Class Citizen but the A76 seems to take more advantage of the code than A73.

            What would be really good is if a model could be partitioned and run over CPU & GPU but not sure as much operation seems purely serial.

      2. I will have to check that repo out as they have got opencl going much better than llama.cpp going as I have been holding out for Vulkan and the new Panthor drivers.
        Prob 36% is about right as just because there is a a GPU it doesn’t mean it will be faster than the CPU especially when the CPU has specific Mat/Mul vector instructions as the Cortex A76 has.
        What would be amazing is if the LLM can be partitioned in anyway so that CPU & GPU can run in parralel.

  4. I finally succumbed to buying an RP5 8GB and then added an Orange PI5Plus 16Gb for good measure and comparison, with the idea of returning the lesser system…

    And the RAM price issue on these ARMs keeps me baffled, because on x86 getting 64, or with DDR5 even 128GB on an SBC is both relatively easy and cheap with SO-DIMMs: I got plenty of NUCs to prove it!

    But on these ARMs, RAM capacity is sold at Apple prices, which makes a 32GB very unattractive, even if it could be bought.

    So I wonder: are these SoCs simpy unable to use DIMMs, lacking the amplifiers and IP blocks?

    And are the prices on the stacked RAM chips (there are only ever two packages on the boards, after all) the real limiting factor, because the number dies keep doubling inside the package?

    1. Large memory chips are expensive, those with lower capacities not. 32 GB RAM on a RK3588 device is made out of two expensive 128Gb chips while a 32 GB DIMM is made out of eight inexpensive 32 Gb chips.

      DIMMs don’t matter in the ‘Android e-waste world’ since all those SoCs lack the data lines to communicate with the DIMM’s SPD EEPROM (to retrieve timing data).

  5. 4-5 token/s sound like pure CPU based inference with these models to me and it’s pretty much what you get from the DRAM bandwidth, because that’s the limiting factor on LLMs.

    And it doesn’t much matter if you run this on a 22-core Broadwell Xeon or a laptop according to my tests, fitting it into your GPUs VRAM and the bandwidth of that VRAM (together with the representational density of the weights) mostly determines the token generation speed. The cores (CPU and GPU) are really rather bored in LLMs (unless they are really undersized), iGPU and not even an NPU won’t be any faster, if all weights reside in shared DRAM.

    With smaller LLM models like 7B or 13B Llama-2, Mistral or similar at 4-bit quantizations I can get 40-60 token/s on my RTX 4090. But as soon as the host or the PCIe bus get involved, things go single digit/s.

    E.g. with many of these models and the proper framework you can split the layers into GPU and CPU layers in an attempt to fit them into what you have (70B models even at 4bit won’t fit into my RTX4090’s 24GB). But it’s rarely worth it, because it means data will have to cross the PCIe and host DRAM bottlenecks at quite below 100GB/s, while that GPU achives nearly 1TB/s.

    Same with dual GPUs, which I’ve tried in my desperation, too: when the layers are too connected, you might as well not use a 2nd GPU at all, because it’s single digit token/s rates because of the PCIe bus between them.

    There are good reasons why NV-Link and HBM stacks are used by the pros…

    NPUs really target energy efficiency in dense and small image and sound models, they’re not any help with sparse and sequential LLMs.

Leave a Reply

Your email address will not be published. Required fields are marked *

Boardcon Rockchip and Allwinner SoM and SBC products
Boardcon Rockchip and Allwinner SoM and SBC products