This is a guest post about Heterogeneous Multicore Processing (HMP), Real-Time Linux, and Xenomai to develop real-time Linux systems written by Guilherme Fernandes, Raul Muñoz, Leonardo Veiga, Brandon Shibley, all working for Toradex.
Introduction
Application processor usage continues to broaden. System-on-Chips, usually powered by ARM Cortex-A cores, are taking over several spaces where small ARM Cortex-M, and other microcontroller devices, have traditionally dominated. This trend is driven by several facts, such as:
- The strong requirements for connectivity, often related to IoT and not only from a hardware point of view but also related to software, protocols, and security
- The need for highly interactive interfaces such as multi-touch, high-resolution screens, and elaborate graphical user interfaces;
- The decreasing price of SoCs, a consequence of its volume gain and new production capabilities.
Typical cases exemplifying the statement above are the customers we see every day starting a product redesign upgrading from a microcontroller to a microprocessor. This move offers new challenges as the design is more complicated and the operating system abstraction layer is much re complex. The difficulty of hardware design using an application processor is overcome by the use of reference designs and off-the-shelf alternatives like computer-on-modules or single board computers. On the operating system layer, the use of embedded Linux distributions is widespread in the industry. An immense world of open source tools is available simplifying the development of complex and feature-rich embedded systems. Such development would be very complicated and time-consuming if using microcontrollers. Despite all the benefits, the use of an operating system like Linux still raises a lot of questions and distrust when determinism and real-time control application topics are addressed.
A common approach adopted by developers is the strategy of separating time-critical tasks and regular tasks onto different processors. Hence, a Cortex-A processor, or similar, is typically selected for multimedia and connectivity features while a microcontroller is still employed to handle real-time, determinism-critical tasks. The aim of this article is to present some options developers may consider when developing real-time systems with application processors. We present three possible solutions to provide real-time capability to application processor-based designs.
Heterogeneous Multicore Processing
The Heterogeneous Multicore Processing (HMP) approach is a hardware solution. Application processors like the NXP i.MX7 series, the NXP i.MX6SoloX and the upcoming NXP i.MX8 series present a variety of cores with different purposes. If we consider the i.MX7S you will see a dual-core processor composed of a Cortex-A7 core @ 800MHz side-by-side with a Cortex-M4 core @ 200MHz. The basic idea is that user interface and high-speed connectivity are implemented on an abstracted OS like Linux with the Cortex-A core while, independently and in parallel, executing control tasks on a Real-Time OS, like FreeRTOS, with the Cortex-M core. Both cores are able to share access to memory and peripherals allowing flexibility and freedom when defining which tasks are allocated to each core/OS. Refer to Figure 1.
Some of the advantages of using the HMP approach are:
- Legacy software from microcontrollers can be more easily reused;
- Firmware update (M4 core) is simplified as the firmware may be a file at the filesystem of the Cortex-A OS;
- Increased flexibility of choosing which peripherals will be handled by each core. Since it is software-defined, future changes can be made without changing hardware design.
More information on developing applications for HMP-based processors are available in these two articles:
- A Balancing Robot Leveraging the Heterogeneous Asymmetric Architecture of i.MX 7 with FreeRTOS and Qt
- FreeRTOS on the Cortex-M4 of a Colibri iMX7.
Toradex, Antimicro, and The Qt Company collaboratively built a robot showcasing this concept. The robot – named TAQ – is an inverted pendulum balancing robot designed with the Toradex Computer on Module Colibri iMX7. The user interface is built upon Linux with the QT framework running on the Cortex-A7 and the balancing/motor control is deployed on the Cortex-M4. Inter-core communication is used to remote control the robot and animate its face as seen in the short video below.
Real-Time Linux
The second approach we present in this article is software-related. Linux is not a real-time operating system, but there are some initiatives that have greatly improved the determinism and timeliness of Linux. One of these efforts is the Real-Time Linux project. Real-Time Linux is a series of patches (PREEMPT_RT) aimed at adding new preemption options to the Linux Kernel along with other features and tools to improve its suitability for real-time tasks. You can find documentation on applying the PREEMPT_RT patch to the Linux kernel and developing applications for it at the official Real-Time Linux Wiki (formerly here).
We did some tests using the PREEMPT_RT patches on a Colibri iMX6DL to exemplify the improvement in real-time performance. The documentation on preparing the Toradex Linux image to deploy the PREEMPT_RT patch is available at this link. We developed a simple application which toggles a GPIO at a 2.5KHz (200µs High / 200µs Low). The GPIO output is connected to a scope where we measure the resulting square wave and evaluate the real output timings. The histograms below show the comparison between the tests on a standard Linux kernel configured for Voluntary Preemption (top) and a PREEMPT_RT patched Linux kernel configured for Real-time Preemption (bottom). The x-axis represents the period of the square wave sample and the y-axis represents the number of samples which measured with such a period. The table below the chart presents the worst and average data.
Description |
Samples |
Smallest (µs) |
Worst Case for 99% of Samples (µs) |
Worst Case (µs) |
Median (µs) |
Average (µs) |
Default Kernel |
694,780 |
36 |
415 |
4,635 |
400 |
400 |
PREEMPT_RT Kernel |
683,593 |
369 |
407 |
431 |
400 |
400 |
Table 1: Comparison between Default Kernel and real-time Kernel when generating a square wave.
An example software system using the PREEMP_RT patch is provided by Codesys Solutions. They rely on the Real-Time Linux kernel, together with the OSADL (Open Source Automation Development Lab), to deploy their software PLC solution which is already widespread throughout the automation industry across thousands of devices. The video below presents the solution running on a Apalis iMX6Q.
Xenomai
Xenomai is another popular framework to make Linux a real-time system. Xenomai achieves this by adding a co-kernel to the Linux kernel. The co-kernel will handle time-critical operations and will have higher priority than the standard kernel. To use the real-time capabilities of Xenomai the real-time APIs (aka libcobalt) must be used to interface user-space applications with the Cobalt core, which is responsible for ensuring real-time performance.
Documentation on how to install Xenomai on your target device can be found at the Xenomai website. Additionally, there is a variety of Embedded Hardware that is known to work as indicated in the hardware reference list, which includes the whole NXP i.MX SoC series.
To validate the use of Xenomai on the i.MX6 SoC we also developed a simple experiment. The target device was the Colibri iMX6DL by Toradex. We ran the same test approach as described above for the Real-Time Linux extension. Some parts of the application code used to implement the test are presented below to highlight the use of Xenomai APIs.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
void blink(void *arg __attribute__((__unused__))) { int iomask = 0; rt_task_set_periodic(NULL, TM_NOW, TIMESLEEP); while(1) { rt_task_wait_period(NULL); if(iomask) SET_G35; else CLR_G35; iomask = 1 - iomask; } } int main(void) { /* Task Creation */ rt_task_create(&blink_task, "blinkLed", 0, 99, 0); rt_task_start(&blink_task, &blink, NULL); getchar(); rt_task_delete(&blink_task); return 0; } |
The results comparing Xenomai against a standard Linux kernel are presented in the chart below. Once again, the real-time solution provides a clear advantage – this time with even greater distinction – over the time-response of the standard Linux kernel.
Description |
Samples |
Smaller (µs) |
Worst Case for 99% of Samples (µs) |
Worst Case (µs) |
Median (µs) |
Average (µs) |
Default Kernel |
694,780 |
36 |
415 |
4,635 |
400 |
400 |
Xenomai Implementation |
1,323,521 |
386 |
402 |
414 |
400 |
400 |
Table 2: Comparison between Default Kernel and Xenomai implementation when generating a square wave.
Conclusion
This article presented a brief overview of some solutions available to develop real-time systems on application processors running Linux as the target operating system. This is a starting point for developers who are aiming to use microprocessors and are concerned about real-time control and determinism.
We presented one hardware-based approach, using Heterogeneous Multicore Processing SoCs and two software based approaches namely: Linux-RT Patch and Xenomai. The results presented do not intend to compare operating systems or real-time techniques. Each of them has strong and weak points and may be more or less suitable depending on the use case.
The primary takeaway is that several feasible solutions exist for utilizing Linux with application processors in reliable real-time applications.
Jean-Luc started CNX Software in 2010 as a part-time endeavor, before quitting his job as a software engineering manager, and starting to write daily news, and reviews full time later in 2011.
Support CNX Software! Donate via cryptocurrencies, become a Patron on Patreon, or purchase goods on Amazon or Aliexpress
Dropping a microcontroller sub-core on die and propping it up with something like a RT Linux OS is just putting a band aid on a system that was never designed for high-speed General-Purpose Input/Output (GPIO) in the first place. The problem has been rooted in the proprietary on-die busses (e.g., AMBA) that are almost impossible to traverse under user control. Even if the fabless (I.P.) designer/owners like ARM/SoftBank wanted to make the internal busses traversable (or even bypassable), doing so properly will require some changes in the MMU. We see other attempts to mitigate this problem in a round about way with the likes of the Programmable Realtime Unit (PRU) on TI’s ARM Cortex A8-based OMAP 3XXX series for-example. Other approaches attempt to achieve higher-speed real GPIO by hijacking DMA and the memory bus.
Ultimately, what the likes of ARM needs to do is give access to high-speed GPIO by design. But many believe they do not want to, because once ARM loses control over truly fast GOIO, then OEM’s (or maybe even users) can implement (or buy non AMD I.P. for) all sorts of interfaces that they have to pay ARM for today.
Not directly related but IMO quite impressive: http://forum.armbian.com/index.php/topic/1901-patch-for-quick-interrupt-handling-on-the-h3-fast-gpio/
Interesting article, I didn’t know about Xenomai. Thanks for that. I was wondering why more obvious solution is not used: out of (say) 4 CPU cores available on Cortex Ax, Linux kernel will use the first 3 and will not touch the fourth. Then on the 4th will run time critical code (not under Linux control) that bit bangs the hardware and runs in its own loop/scheduler. Also a block of memory can be “reserved” not to be used by Linux kernel (but can be mmapped by the Linux user-space application), so it will host the 4th cpu (realtime) application and its data. Mmapped memory block will then allow transfer of results/data between the 2 systems. Is there any fundamental issue that prevents such solution (apart from not being tested and engineered)?
One more thing. It would be interesting to know how the tests (presented with graphs) mentioned in the article were conducted. Was the thread running the test code treated with maximum priority (via: schedParam.sched_priority = sched_get_priority_max(SCHED_FIFO) – 2; sched_setscheduler(0, SCHED_FIFO, &schedParam);) ? If not, could they measure a difference when setting scheduling priority is used ant not used? From my experience adding these few lines to the code makes a big difference as the activity of other userspace tasks running in parallel with the application under test are suppressed (but not disabled – they still work as expected). Sure, in no way one gets a realtime performance using these, but may get just enough “execution precision” to overcome the time critical issue.
@olin
I think Xenomai operates pretty much the way you described, with a high priority for the co-kernel handling I.O. The CPU cores still share many of the same buses, so it’s still possible that even that way it may not meet timing requirements.
@olin
It doesn’t care when using Xenomai – its supervisor just schedules the ordinary Linux with all drivers/apps after Xenomai RT tasks which can be in its turn pure Linux user-space ones. So any Linux driver or app running under any ordinary Linux scheduling policy can’t influence the RT task. Please note the _driver_ word 😉
Thank you for the interesting article, I was wondering which version of the Linux kernel was used in the experiments. Is it safe to assume that version 2.5 was used as described in the tutorial from Toradex?
http://developer.toradex.com/knowledge-base/real-time-linux
And do you expect any differences with newer versions of the Linux kernel?
@stvl
The kernel used for the article comparison was 3.14.28 based on the kernel provided by NXP. Currently they provide a kernel based on 4.1. Without doing further tests or even looking at what has been improved, I feel I can’t make any assumptions about what to expect, be it in terms of performance or reliability.
If you want to build the RT kernel, there is a recipe for Colibri iMX6 module at http://git.toradex.com/cgit/meta-toradex-nxp.git/tree/recipes-kernel/linux/linux-toradex-rt-4.1-2.0.x?h=morty and the instructions for building a Linux image using OpenEmbedded at http://developer.toradex.com/knowledge-base/board-support-package/openembedded-(core)