Igalia Engineer, Tvrtko Ursulin has recently submitted a patch to the Linux kernel adding a NUMA (Non-Uniform Memory Access) emulation implementation for arm64 platforms that can boost the performance of 64-bit Arm targets by “splitting the physical RAM into chunks and utilizing an allocation policy to better utilize parallelism in physical memory chip organization”.
The NUMA emulation implementation was tested on a Raspberry Pi 5 SBC and the Geekbench 6 single-core score improved by 6%, while the multi-core score boosted by 18% after splitting into four emulated NUMA nodes. In other words, that’s like having a Broadcom BCM2712 CPU overclocked from 2.4 GHz up to 2.83 GHz.
The patch is actually quite short, around 100 lines, and the main C code file is about 60 lines long (stripped from SPDX header):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
#include <linux/memblock.h> #include "numa_emulation.h" static unsigned int emu_nodes; int __init numa_emu_cmdline(char *str) { int ret; ret = kstrtouint(str, 10, &emu_nodes); if (ret) return ret; if (emu_nodes > MAX_NUMNODES) { pr_notice("numa=fake=%u too large, reducing to %u\n", emu_nodes, MAX_NUMNODES); emu_nodes = MAX_NUMNODES; } return 0; } int __init numa_emu_init(void) { phys_addr_t start, end; unsigned long size; unsigned int i; int ret; if (!emu_nodes) return -EINVAL; start = memblock_start_of_DRAM(); end = memblock_end_of_DRAM() - 1; size = DIV_ROUND_DOWN_ULL(end - start + 1, emu_nodes); size = PAGE_ALIGN_DOWN(size); for (i = 0; i < emu_nodes; i++) { u64 s, e; s = start + i * size; e = s + size - 1; if (i == (emu_nodes - 1) && e != end) e = end; pr_info("Faking a node at [mem %pap-%pap]\n", &s, &e); ret = numa_add_memblk(i, s, e + 1); if (ret) { pr_err("Failed to add fake NUMA node %d!\n", i); break; } } return ret; } |
Code can be enabled using the new NUMA_EMULATION Kconfig option and then at runtime using the existing (shared with other platforms) numa=fake=<N> kernel boot argument. Users would also need to set up an interleaving allocation policy using a test program with:
1 |
numactl --interleave=all COMMAND |
So that would be for one specific program, but Tvrtko also explains a system-wide policy could be configured via systemd.
Although there’s no guarantee benchmark improvements transfer to overall system improvement, that’s great to have a “free” performance boost. The patch will still have to go through some iterations, and it’s still unclear whether the patch will be accepted, as Greg replied:
Why not just properly describe the numa topology in your bootloader or device tree and not need any such “fake” stuff at all?
Also, you are now asking me to maintain these new files, not something I’m comfortable doing at all sorry.
Time will tell.
Via Tom’s Hardware and Phoronix
Jean-Luc started CNX Software in 2010 as a part-time endeavor, before quitting his job as a software engineering manager, and starting to write daily news, and reviews full time later in 2011.
Support CNX Software! Donate via cryptocurrencies, become a Patron on Patreon, or purchase goods on Amazon or Aliexpress
> the Geekbench 6 single-core score improved by 6%
That’s the same ‘improvement’ you get on the RPi 5 by executing Geekbench 6 multiple times after booting w/o any modification at all. As for the multi score the RPi 5 is a great target since suffering from some sort of a memory bottleneck anyway.
The quad core BCM2712 scores in GB6 multi only ~200% of the single score while SoCs with better memory interface get closer to 300% or beyond (e.g. only the A76 cores in RK3588)
> executing Geekbench 6 multiple times after booting w/o any modification at all
I remembered wrong. The 6% score improvement on RPi 5 is not due to multiple executions but uptime related. You get lower GB6 scores directly after (re)boot but if you wait 20 minutes or so the scores magically improve by 6%.
Amazing.
I wonder which of these ARM SBCs would work ok-ish for a desktop with Fedora and i3/Sway?
Would RK3588 do it?
> and other arm64 platforms
IMO both the title and the contents are misleading since the patch comment talks only about RPi 5 (massively being bottlenecked wrt memory access) and not arm64 in general and parts of the ‘speed improvements’ are most probably the result of ‘benchmarking gone wrong’.
It’s easy to reproduce: boot RPi 5 and execute GB6 immediately, then wait 20 minutes and execute it again: scores improve by ~6% anyway.
Is it Raspberry Pi-specific? I understand it could work on other 64-bit Arm hardware after reading:
The patch code is arm64 but the ‘speed improvements’ have been only tested on a single SBC in a flawed way using a) a benchmark that improves scores based on uptime and b) testing it solely on the arm64 SoC with worst memory interface known.
Your readers now may think they will see GB6 scores (or even ‘real world performance’ – LMAO) improving by 18% on any arm64 with this patch set which is pretty unlikely.
Understood. I’ve made some edits.
Well, I have tried GeekBench6 on a RPi5 immediately after reboot and also 30 minutes later and they are identical. I have just the Raspberry Pi OS lite – no GUI, no HDMI connected, just SSH access.
When I apply the NUMA patch in question, I see an improvement of +4.6% and +15.6% for singlecore and mutlicore, respectively.
On RPi4 I see +1.0% and +7.1% for singlecore and mutlicore, respectively.
This is with THP enabled. THP itself brings about +1.8% and +1.1% (RPi5) and +2.7% and 1.9% (RPi4). Don’t know, why THP is not compiled in on Raspberry Pi kernel.
I did also HPL test, for RPi5 I get with NUMA patch +3.3% on RPi4 on the other hand I get -0.2% for RPi4.
I’d like to try rk3588, but it uses older kernel.