Allwinner V831 NPU (Neural Processor Unit) reverse-engineered

When Sipeed introduced MAIX-II Dock AIoT vision development kit, they asked help from the community to help reverse-engineer Allwinner V831‘s NPU in order to make an open-source AI toolchain based on NCNN.

Sipeed already had decoded the NPU registers, and Jasbir offered help for the next step and received a free sample board to try it out. Good progress has been made and it’s now possible to detect objects like a boat using cifar10 object recognition sample.

V831 NPU open-source toolchain

Allwinner V831’s NPU is based on a customized implementation of NVIDIA Deep Learning Accelerator (NVDLA) open-source architecture, something that Allwinner (through Sipeed) asked us to remove from the initial announcement, and after reverse-engineering work, Jasbir determined the following key finding:

  1. The NPU clock defaults to 400 MHz, but can be set between 100 and 1200 MHz
  2. NPU is implemented with nv_small configuration (NV Small Model),  and relies on shared system memory for all data operations.
  3. int8 and int16 are supported with int8 preferred for speed and limited on-board memory (64Mb)
  4. 64 MACs  (Atomic-C * Atomic-K)
  5. Memory-mapped register programmable from userspace
  6. Physical address locations are required when referencing weights & input/output data locations, meaning kernel memory needs to be allocated and the physical addresses retrieved if accessed from userspace.
  7. NPU weights and input/output data follow a similar layout to the NVDLA private formats, so formats like nhwc or nchw must be transformed before being fed to the NPU.

Those findings allowed him to adapt the code for the cifar10 demo from Arm’s CMSIS_5 NN library, removing all Allwinner closed-source binaries in the process. You’ll find the source code on v831-npu repository on Github, and can check out Jasbir post to find out how to try it out provided you have an Allwinner V831 board on hand.

The current code supports direct convolutions,  bias addition, relu/prelu, element wise operations, and max/average pooling, and there’s more work to be done including the development of a weight and input/output data conversion utility and integrating into an existing AI framework.

The good news is the work should also benefit other platform features an NVDLA based AI accelerator including Beagle V SBC that has just started to find its way into the hands of developers in the few days.

Share this:

Support CNX Software! Donate via cryptocurrencies, become a Patron on Patreon, or purchase goods on Amazon or Aliexpress

Radxa Orion O6 Armv9 mini-ITX motherboard
Subscribe
Notify of
guest
The comment form collects your name, email and content to allow us keep track of the comments placed on the website. Please read and accept our website Terms and Privacy Policy to post a comment.
9 Comments
oldest
newest
Boardcon CM3588 Rockchip RK3588 System-on-Module designed for AI and IoT applications