Cyber-Physical Systems embody the battle between Self-Awareness and Context-Awareness
I like to think about research on CPS as a battle for the survival of the world as we know it and being fought on multiple fronts. Much like in a Tolkienian world, the battlefield is lined with individuals, i.e. researchers, holstering unique abilities. Schedulability analysis brandished by Ents, Elves in charge of static WCET analysis, ranks of Hobbits skirmishing with experimental methods, and (bias alert) the systems Wizards coordinating a flanking maneuver. As you observe them waging this war, you begin to wonder what lies on the other side of the battlefield. The enemy’s side is lined up with all sorts of monstrosities of different calipers, shapes, and forms: from the Orcs of scheduling anomalies to the Trolls of over-pessimism, to the Uruk-Hai’s of inter-core temporal interference. But what is the Sauron-equivalent that we as a community are up against? In a word: complexity.
Complexity is a necessary evil. It emerges because of the expectations placed in modern cyber-physical systems. But complexity is a vague term that is just the result of what we demand our systems to be. That is, context-aware. Simply put, context-awareness refers to a system’s ability to understand and interpret its surroundings. And with that, to take informed actuation decisions. To use a buzzword, smart systems are expected to exhibit sharper context-awareness. It is precisely the desire for better context-awareness that has fueled research for better sensors, more capable and efficient processors, higher-throughput (on-chip and off-chip) data channels. The holy grail of context-awareness has transformed not only the foundations of our systems but also our applications, as data-heavy AI workloads become commonplace in budget devices and appliances.
As we dream of machines capable of understanding their surroundings and to act as autonomous “beings,” we suddenly realize that there is at least one more property we demand from our systems. Safety. That can be decomposed into logical safety, i.e., “our cyborgs are always taking the right decision,” and temporal safety, i.e. “our cyborgs are always enacting decisions on-time.” I will leave a discussion on logical safety to colleagues who are much more qualified than me on the subject and I hereby focus on temporal safety.
What enables temporal safety? The temporal properties of any application are determined by all the layers of the software/hardware stack. Therefore, how cost-efficiently one can achieve temporal safety is directly proportional to the confidence about the interplay of the different functional blocks across the entire stack. Clearly, it is hard to build confidence in the unknown. Thus, a prerequisite for achieving confidence in the temporal behavior of a system is, once again, some form of awareness. But this time, the awareness we need is about the system itself, and I like to refer to it with the (perhaps overloaded) term self-awareness. Self-awareness in simple systems can be trivially achieved. Consider a simple system, say a not-so-smart LED light-bulb with basic functionality as depicted in Figure 1(a). The lightbulb senses the current battery level, detects the on/off state of the LED, and commandeers the LED to the ON state if sufficient charge is available. In this case, it is easy to enumerate the possible system states and transitions. It is also straightforward to analyze and predict the interaction between internal hardware components, as well as between software and hardware resources. It follows that a simple system can be self-aware because the necessary knowledge of component-to-component interactions can be grasped by the system designers and appropriately incorporated in the final system design.
But a truly smart light-bulb, as depicted in Figure 1(b), should take into account the external brightness level; be able to detect motion; perhaps provide interfaces for voice- and touch-based activation; cooperate with other nearby light-bulbs to create a unified scene; or take into account external temperature to better estimate battery discharge or to provide visual temperature feedback to its users. In other words, a truly smart light-bulb is expected to be context-aware. The resulting complexity explosion makes achieving strong self-awareness a challenge. Indeed, in complex and open systems, having static knowledge of the emerging interactions between hardware and software components is hard, if not impossible.
Reconciling self- and context-awareness might be possible. Of course, when dealing with real-time systems, exact system knowledge is not always necessary as long as one can upper-bound an application’s temporal behavior. Right? Well, yes and no. The issue is that the ever-growing complexity of modern embedded systems makes their temporal characteristics abandon the realm of determinism and approach the doorsteps of chaos theory. Thus reasoning about the worst-case of a chaotic system might lead to over-estimations that grossly misrepresent its typical behavior. Much like an overly aggressive round-up in a weather simulation can predict a downpour on a sunkissed day.
From the discussion above, it emerges that there exists a fundamental tension between self-awareness and context-awareness in modern computing systems on which we place an expectation of (temporal) safety. But if complete static knowledge of component-to-component interactions is not possible, it becomes natural to wonder what other avenues are available. Together with the students in my group at Boston University, I set sail a few years ago to try and tackle precisely this question. Striking to the heart of the problem, component-to-component interactions occur as data exchanges. Hence, being able to observe and manipulate the flow of data between components in a programmatic manner is a promising approach to achieve self-awareness in spite of complexity.
And just about today, the time is ripe for the definition of a new breed of software-shaped platforms, or SOSH platforms for short. At the core of the SOSH paradigm, as depicted in Figure 2, is the idea of exposing direct control over the flow of data exchanged between hardware components in an (embedded) computing platform. Interestingly, SOSH platforms can be already implemented and tested today leveraging commercially available SoCs that integrate an embedded processing system (PS) and a block of programmable logic (PL) with high-performance PS-PL communication interfaces (HPI). How? In a nutshell, having PS+PL on the same chip enables rerouting through the PL all or a selected portion of the memory traffic originating from a CPU (or an I/O device) and targeting main memory. How exactly that can be achieved is described in detail in [1]. So we are able to re-route memory traffic through a block of programmable logic. But so what? Well hear me out: we can now achieve a level of programmability for the payload and meta-data (e.g. timing, QoS signaling) of each individual memory transaction in our systems that was previously possible only in custom soft-cores. And we have achieved all of that in commercial-off-the-shelf embedded platforms. The implications run very deep. SOSH platforms are empowered by high-granularity monitoring and control capabilities over data exchanges between the “compute” and the “store” side of a system-on-a-chip. Hence SOSH platforms lay the groundwork for implementing high-integrity safety-critical systems, advanced security threat counter-offensive measures, and access-pattern-aware data compression and re-organization, to name a few possible research avenues.
Is there a catch? Yes, overhead. If the memory subsystem is the performance bottleneck in data-heavy applications, interposing a block of FPGA between the CPUs and memory will incur a performance hit, right? Once again, yes and no. This is better explained with numbers, so let us consider the current platform of reference here which is the Xilinx UltraScale+ MPSoC. In [2] we evaluated the total main memory read bandwidth for CPU applications going directly to memory (unmanaged route, solid line in Figure 2) to be about 5.4 GB/s when all the CPUs are concurrently active. The PL’s maximum frequency is 300 MHz with multiple available HPIs that are 128 bit wide. So theoretically, the via-PL route (dashed line in Figure 2) can sustain about 9.6 GB/s if only two HPIs are used. Thus, on paper, the PL should not represent the bottleneck when memory traffic is re-routed through the PL. In practice, we have observed platform-specific design issues that prevent reaching those theoretical rates: from channel arbitration to shallow FIFO queues, to clock-domain crossing delays. Indeed if we specifically look at the current breed of UtraScale+ platforms, the top bandwidth that can be achieved going through the PL is only around 860 MB/s.
But PL+PS platforms are not a done deal. They are the baby prodigy of the computing world and are rapidly improving generation after generation. No need to take my word for it. Apart from the aforementioned UltraScale+ platforms, Xilinx has recently launched the Versal platforms. Intel has also entered the game with their Stratix 10 MPSoC. An honorable mention also goes to the Enzian project developed at ETH Zurich: a massive 48-cores system where ultra high-performance communication between PS and PL is one of the leading design principles. And lastly, a RISC-V-based PS+PL solution sold by Microsemi is now available, namely the PolarFire SoC. In light of this, it is reasonable to envision that the techniques enabling the SOSH paradigm will be as commonplace and efficient as features considered experimental only a few years ago such as virtualization and non-volatile memory.
Author bio: Renato Mancuso is an assistant professor in the Computer Science Department at Boston University. Renato received his Ph.D. in computer science in 2017 at the University of Illinois at Urbana-Champaign. His research interests are at the intersection of high-performance cyber-physical systems and real-time operating systems. His specific focus is on techniques to perform accurate workload characterization and to enforce strong performance isolation and temporal predictability in multi-core heterogeneous systems. Renato has contributed to more than 30 peer-review publications in the field and was the recipient of multiple research awards. He is a member of the IEEE and the Information Director of the ACM SIGBED. His work is supported by the National Science Foundation and by a number of industry partners that include Red Hat Inc. and Bosch GmbH.
Disclaimer: Any views or opinions represented in this blog are personal, belong solely to the blog post authors and do not represent those of ACM SIGBED or its parent organization, ACM.
Leave a comment