Strong Isolation and Cyber-Security for Mixed-Criticality Cyber-Physical Systems

A story heard thousands of times

“Software complexity is increasing” — I’m sure this is not the first time, nor will be the last time you read this claim. It may sound rhetorical, but repeatedly claiming overtime that software complexity is increasing is nothing but a concrete observation about the incessant pace with which several technological domains are evolving thanks to software. Cyber-Physical Systems (CPS) are one of the most representative cases in which software is playing a pivotal role in providing advanced functionality, such as autonomous driving, robotic surgery, intelligent traffic control, etc. To name a relevant example, in automotive the trend of software complexity was already clear about ten years ago, when the typical size of the software of a high-end car reached 100 millions line of code, while the ones of Facebook and the control software a Boing 787 airplane were about 15 and 60 million lines of code, respectively.

To cope with software complexity while containing costs, energy consumption, space, and weight of systems, the integration of multiple functionalities of different nature and scope upon the same computing platform is becoming almost mandatory.

Multi-domain systems: a pragmatic design paradigm for mixed-criticality systems

The design of multi-domain software systems is an increasingly established industrial practice to face this issue. Each domain is an execution context with its own set of applications, handled by a dedicated operating system, and its set of non-functional requirements. Domains are typically characterized by heterogeneous criticalities and security levels: for this reason, these systems are also referred to as Multiple Independent Levels of Security/Safety (MILS) solutions.

For example, a multi-domain system can typically employ a non-critical domain based on Linux, to execute high-performance software components, and a critical domain, based on a Real-Time Operating System (RTOS) that hosts the execution of safety-critical software.

All chip vendors already provide customized Linux distribution and the corresponding support for their high-end embedded platforms (in particular, those that include processors with a memory management unit, e.g., of the Arm Cortex-A family). Linux is de-facto more and more used by embedded software designers due to the rich availability of device drivers, libraries, network stacks, middlewares, etc., which enable the development of advanced functionality with limited efforts, e.g., those using high-speed connectivity, high-performance sensors such as 3D cameras and lidars, and accelerated machine learning algorithms. Furthermore, Linux is free and generally supported by a wide community of specialists.

All these peculiarities make it an excellent choice for reducing time to market, containing cost, and securing the feasibility of projects.

At the same time, due to its large codebase, its strong dynamicity, and the fact that it has been designed as a general-purpose operating system, to date, Linux cannot actually be used to realize high-integrity software domains that implement safety-critical functionality notwithstanding the many and unceasing efforts to improve its safety and security (e.g., the ELISA project).

For this reason, it is common to couple Linux with an RTOS to support the execution of high-criticality software in another domain. In this way, it is possible to certify at a high-integrity level a software subsystem only, i.e., the one served by the RTOS, provided that also the Hypervisor is certified at the same integrity level. This paradigm can clearly be extended to cases with more than two domains and with even more heterogeneous criticality levels.

The need for strong isolation

In a multi-domain system, software domains with different levels of safety and security end up running together upon the same hardware platform. It is hence of utmost importance to ensure that software domains are strongly isolated among themselves, at least for those of different criticality. To guarantee non-functional requirements and enable certification at a high integrity level, it is simply not acceptable to have a high-criticality domain that is affected by what happens in a low-critical one.

Most (if not all) Hypervisor technologies attempt at addressing this issue by implementing temporal and spatial isolation mechanisms at the level of processor cores, memory space, and peripherals. Cores are either entirely dedicated to a domain or shared by multiple domains by means of CPU-time reservation algorithms. The memory space is instead partitioned so that each domain disposes of its own portion of memory, with the exception of possible shared-memory regions for inter-domain communication. Finally, peripherals are either assigned to one domain only or virtualized, with one of the domains (or the Hypervisor itself) regulating the access to them.

This isolation scheme is definitively pragmatic, driven by common sense, and often believed to be enough to properly isolate domains. Unfortunately, this is not the case. Even if two domains run on different cores, access two disjoint memory address spaces, and use a different set of peripherals, their execution can be mutually affected by the unintended contention for both micro-architectural and memory resources. This contention can, unfortunately, lead to highly-unpredictable interference among domains. This is because of at least four major phenomena.

  1. Contention at shared caches (Figure 1(a)). In multiprocessor platforms, it is common to dispose of one or more cache levels that are shared by multiple cores. Furthermore, it is also common that the cache management policies are not under the control of the programmer (e.g., as for ARMv8.0-A platforms), so that it is not possible to decide which portions of cache memory are actually used by the various cores. In these cases, the data and instructions loaded in a shared cache, as a consequence of the execution of a critical domain, may be evicted by the data and instructions required by a non-critical domain. The critical domain will hence experience cache misses originating from a non-critical one, which will result in extra delays during the execution of critical software. Hence, even if the two domains run upon different cores, the timing performance of a critical domain is dependent on a non-critical one.
  2. Contention at the memory controller (Figure 1(b)). Commercial off-the-shelf memory controllers for DRAMs typically implement arbitration policies for memory transactions that aim at maximizing throughput rather than ensuring time predictability. For instance, mechanisms to explicitly prioritize certain memory transactions are often lacking. Memory transactions issued by non-critical domains can hence easily interfere with those issued by critical ones, even if directed to different memory areas and/or different DRAM banks. This phenomenon introduces memory-access-related delays in critical domains that depend on the behavior of non-critical domains, hence exacerbating the problems caused by resource contention.
  3. Contention of DRAM banks (Figure 2(a)). Still, with the purpose of maximizing throughput in accessing memories, it is common to face memory layouts that are implicitly configured (e.g., by memory controllers) to exploit as much as possible parallel memory accesses to different banks. In these cases, the various domains will most likely end up accessing all DRAM banks, with the result that both critical and non-critical domains may contend for accessing the same bank. This increases the contention experienced by memory transactions issued by critical domains, as the transactions directed to the same bank are forced to be serialized independently of the arbitration policy implemented by the memory controller.
  4. Memory contention generated by I/O peripherals (Figure 2(b)). Many I/O peripherals such as Ethernet controllers include direct memory access (DMA) modules to autonomously retrieve and place data from/to memories that are shared with the processor cores. As a matter of fact, they constitute other relevant actors that interact with the memory bus and hence contribute, alongside processor cores, to the contention scenarios mentioned above. Indeed, an I/O peripheral controlled by a non-critical domain can be a means for generating additional DRAM-related interference to critical domains. Similar issues also arise for hardware accelerators accessing shared DRAMs.
Figure 1: Contention (a) at shared caches and (b) at the memory controller.
Figure 2: Contention (a) of memory banks and (b) due to I/O-related memory access

Do not overlook inter-domain security issues

When designing a multi-domain system, the threats originating from the execution of software domains with different safety and security levels are not certainly limited to resource contention. Another relevant aspect to consider is the capability of a system to thwart possible attacks launched from non-critical domains and directed to critical ones, e.g., that take their control remotely or aim at stealing data. The software running within non-critical domains cannot simply be trusted: due to the large code base of non-critical software, e.g., think of a rich set of Linux-based applications, exploitable vulnerabilities are extremely likely to be present, even in software modules that are deemed harmless and irrelevant. The exposure of non-critical domains to external connectivity could make these vulnerabilities remotely exploitable. For instance, a vulnerability in an image-processing library could be used as a starting point for a complex multi-stage attack that bypasses some countermeasures and then leverages another vulnerability (possibly locally exploitable only) to gain the control of the non-critical domain with high privileges. At this point, it could be possible to proceed with the attack to reach the critical domain, provided that the isolation boundary established by the Hypervisor is bypassed. This can, unfortunately, happen due to several reasons including the presence of vulnerabilities in the Hypervisor itself, a wrong configuration of the Hypervisor, malicious exploitation of peripheral devices under the control of the non-critical domain, and the usage of side channels.   

The chip industry is, fortunately, looking at these issues

The good news is that solutions for all these issues are emerging. The need for strong isolation has apparently been perceived by Arm (notably one of the major players in processor architectures), which recently released Memory System Resource Partitioning And Monitoring (MPAM), an extension for ARMv8-A architectures to deal with the partitioning of caches and the control of the memory traffic. MPAM-enabled chips allow partitioning of shared caches so that critical domains can dispose of private cache partitions, hence avoiding cache contention by construction. They also allow prioritizing the memory transactions directed to the memory controller or reserving a certain portion of the memory bandwidth, hence enforcing a bounded memory-related interference.

Although MPAM will likely become the reference solution to provide strong isolation, at the time of writing it still presents some relevant issues. First, the number of chips on the market with the MPAM extension is, to date, very limited. There are many systems that have just been deployed, or that will be deployed soon, that is not based on MPAM-enabled chips. Furthermore, even the newest chips that just entered the market, such as the Xilinx Versal ACAP, do not dispose of the MPAM extension. As these chips are going to stay on the market for a long time, in the near future there will most likely be many systems deployed in disparate fields that will not dispose of this technology. Second, the current MPAM specification for the mechanisms that control the memory traffic is still too vague, leaving a lot of room for interpretation by the vendor in charge of implementing the specification. This may make it difficult to bound the memory-related interference generated by non-critical domains because the behavior of implementations of the MPAM policies is not well defined. Third, MPAM does not provide any mechanism to explicitly regulate the usage of DRAM banks as MPAM is concerned with on-chip resources only.  

Regarding I/O-related memory contention, several commercial platforms already include QoS-400 regulators, some hardware components designed by Arm that are placed along the bus route from the DMA modules of I/O peripherals and the memory controller. Fortunately, these regulators proved to be effective in predictably controlling the memory traffic issued by I/O peripherals, hence allowing them to bound the interference they can generate to critical domains.

Security issues received even more attention from the chip industry as witnessed by many developments during the last decade—probably too many to name all of them here. Most relevant to the security threats for multi-domain systems mentioned above are the efforts spent by Arm in developing the Arm Trusted Firmware (ATF), the Pointer Authentication Code (PAC) processor extension, and the TrustZone technology.

The role of software solutions

Software solutions implemented at the Hypervisor level can play a crucial role in providing strong isolation and advanced security capabilities. On one hand, they can ensure isolation and security by either filling the lack of certain hardware mechanisms (e.g., MPAM) or by smartly leveraging some hardware feature to reach the desired goal. For instance, hypervisor-controlled cache coloring and memory bandwidth reservation, if properly implemented and configured, can mitigate the problems of contention at shared caches and the memory controller by leveraging double-stage memory management units (MMU) and on-chip performance counters—two hardware components that were not explicitly conceived to solve those issues. Still leveraging double-stage MMU and the PAC extension, it is possible to provide control-flow integrity and address-space layout randomization for the Hypervisor, hence strengthening the security capabilities of the trusted computing base of multi-domain systems.

On the other hand, software solutions can serve the important purpose of properly configuring the available hardware mechanisms, dealing with the complexity of modern platforms by means of automatic algorithms instead of leaving the burden to system designers, which may likely make mistakes and find sub-optimal configurations only. Indeed, addressing the design optimization of multi-domain systems, exploring different tradeoffs, and matching the requirements of designers with the available hardware and software mechanisms, is probably one of the major challenges of today’s heterogeneous platforms. To name some examples, the usage of cache coloring implies facing tradeoffs between isolation capabilities and cache and memory waste. The usage of memory bandwidth reservation algorithms alongside hardware regulators calls for the configuration of many parameters, which possibly also have a conflicting impact on multiple performance indexes. Finally, all such isolation mechanisms must co-exist with security mechanisms, ensuring that security capabilities are not compromised by the configurations employed for achieving strong isolation and vice versa.

Building upon a string of research results and projects, Accelerat, a startup company, developed the CLARE software stack to address all these issues in a holistic fashion. CLARE is a hypervisor-centric software stack to simplify the development of next-generation cyber-physical systems using heterogeneous computing platforms and offering a ready-to-use environment for deploying mixed-criticality applications—see Figure 3.

CLARE relies on its own type-1 hypervisor (CLARE-Hypervisor) that integrates cutting-edge safety, security, and real-time resource management mechanisms to offer strong isolation at all levels in the platform and advanced defenses against cyber-attacks. A middleware layer (CLARE-Middleware) is available for different operating systems to access the services offered by the stack and ensure proper levels of isolation and security. CLARE also comes with a powerful platform-aware toolkit (CLARE-Toolkit) that assembles a wide set of tools to automatically optimize the deployment of complex applications and configure the entire software stack. CLARE-Toolkit is accompanied by an intuitive graphical user interface with multiple perspectives for different user personas.

Figure 3: Overview of the workflow for developing mixed-criticality applications with CLARE

AuthorsAlessandro Biondi is a tenure-track Assistant Professor of Computer Engineering at the Scuola Superiore Sant’Anna of Pisa (Italy), where he works at the Real-Time Systems (ReTiS) Laboratory. In 2016, he has been visiting scholar at the Max Planck Institute for Software Systems (Germany). His main research interests include the design and implementation of real-time, safe, and secure cyber-physical systemsoperating systems and hypervisors, synchronization protocols, design optimization for embedded systems, and safe and secure machine learning.

DisclaimerAny views or opinions represented in this blog are personal, belong solely to the blog post authors and do not represent those of ACM SIGBED or its parent organization, ACM.