Real-Time Edge Intelligence
This article argues that a key new frontier for the real-time systems research community lies in developing the architectural and algorithmic foundations of real-time artificial intelligence. As always, by “real-time” we do not mean fast (or “streaming”), but rather “with a capability to respond predictably to different urgency (and criticality) requirements”. A key challenge in modern AI is perception. Machine perception must create the right abstractions of the physical world such that other modules can perform their planning and reasoning. The abstractions usually live in a simplified lower-dimensional space compared to the original stimulus. The reduction from the original environmental measurements to the abstraction is therefore where most of the computation is spent. Downstream reasoning in the reduced space of the abstraction is comparatively faster. We henceforth refer to the module performing the reduction generically as the perception subsystem. Significant advances were made in machine intelligence that enhanced the functionality of the perception subsystem over the years. The inputs processed by machine perception have become increasingly more voluminous, from sound to picture, to video, to streaming 3D LiDAR observations. This evolution leads to the beginning of a real-time need!
Perception creates the required abstractions for downstream planning and reasoning, but not everything in the physical world is equally important to the plan. Consider an autonomous vehicle, for example. Some objects in the background will not impact the choice of route, but others will. Some objects are closer and may be more important to watch than others because they call for shorter reaction times. The idea that different elements of the environment call for different response times suggests the existence of a real-time resource management problem. However, this is not the only reason that real-time research is needed. The other reason has to do with computational resource availability – a mounting price pressure will be experienced due to increased distribution, thereby squeezing computational resources to a smaller footprint. More specifically, as AI becomes more commoditized, everyday common objects will increasingly be endowed with machine intelligence capabilities. From an economic perspective, “commoditized” translates to “mass produced”. Mass produced means price-sensitive. For example, in 2020, Toyota sold 9.5 million cars worldwide, and GM sold 6.8 million cars. At this scale, saving $100 per item (for example, by using a smaller GPU for AI processing) means nearly a billion dollars in total cost savings and thus a corresponding increase in revenue. The implications are profound. Commoditization (and therefore mass-production) of AI will impose a significant price pressure to get away with the smallest viable computational platforms for supporting on-board computation. This is precisely the condition where real-time solutions thrive. After all, if resources are abundant enough that no “queueing” ever happens, it does not matter what the resource management policy is. It is only when resources are relatively scarce that managing them in a manner responsive to latency requirements becomes important. Such is the case with resources supporting AI in an exploding variety of modern applications, from autonomous cars to delivery drones and future home robotics and assistants.
To support real-time AI, an inspiration comes from biological evolution. Interestingly, evolution endowed us with a great (cognitive) resource management engine! When interacting with a complex environment, we use a set of built-in cues to decide where to focus. For example, when watching a movie on the large screen of an immersive IMAX theater, we do not pay the same attention to each pixel that the projector outputs on that screen. Rather, we “follow the action” by some intuitive heuristic definition of what action means. The same applies when driving. We might completely ignore some elements of the scene but focus on others in a manner that is context-sensitive and driven by perceptions of risk. Our ability to focus on what is important allows us to expend limited cognitive resources where they matter most. Yet that is not how current AI systems operate. Instead, a convolutional neural network, for example, will process all pixels of an input video frame at the same priority. Intuitively, this is equivalent to inspecting each IMAX pixel at the same level of attention. Significant amounts of resources are wasted on processing useless (parts of) inputs. This is a problem that real-time research can help rectify in order to better enable more cost-effective AI solutions.
There are several aspects to the real-time AI resource management problem. First, how might one decide what to focus machine attention on? The prioritization of input stimuli is a policy question that might be quite application-specific. In the case of autonomous cars, one might wish to enlist the help of a depth sensor and pay more attention to closer pixels or perhaps more quickly approaching ones. In other contexts, such as physical intrusion detection, one might choose to use motion near the perimeter as a cue to direct attention. Assuming an adequate policy for estimating the importance of different parts of a scene, next comes the challenge of developing the prioritization mechanism. Ideally, one might want to rewrite AI libraries such as TensorFlow and PyTorch in a manner that implements a notion of preferential treatment at the algorithmic level. Short of doing so, one may need to break down the input into smaller parts and schedule them for processing in the right priority and batching order. A recent paper (in RTSS 2020) espoused this approach. The problem is further complicated by the proliferation of different performance accelerators (e.g., GPUs and TPUs) that present additional idiosyncrasies to be accounted for in resource allocation.
A related challenge is that current AI algorithms suffer from limited preemptibility. These algorithms are usually executed outside the CPU (e.g., on a GPU). Moving data and computation in and out of a GPU is expensive. Thus, “context switching” (to support preemption) has a non-trivial overhead. One possibility might be to develop “anytime” AI inference architectures that offer partial utility from partial execution. Such an architecture would allow early termination of lower priority processing in favor of higher priority tasks, thereby reducing interference imposed by the lower-priority classes on the higher-priority ones, while at the same time limiting preemption-induced context-switching overhead. In general, performance isolation between different classes is an important concern.
Finally, one may consider the implementation of different levels of service that are commensurate with the criticality of content. For example, less important parts of a scene need not be inspected and identified with the same confidence as more important ones. Solutions to such performance differentiation within an AI subsystem (e.g., within a perception module) are yet to be developed.
To conclude, the current state of the art in designing AI components, such as neural network libraries, is reminiscent of what used to be called the cyclic executive in early operating system literature. Cyclic executives, in contrast to priority-based real-time scheduling, processed all pieces of incoming computation at the same priority and quality. Similarly, given incoming data frames (e.g., multi-color images or 3D LiDAR point clouds), modern neural network algorithms process all data rows and columns at the same priority and quality, with no regard to cues from the physical environment that impact time-constraints and criticality of different parts of the data scene. The real-time research community is in the best position to change this status quo.
Acknowledgement: This article is a distillation of challenges introduced in the RTSS 2020 paper: “On Removing Algorithmic Priority Inversion from Mission-critical Machine Inference Pipelines” by Shengzhong Liu, Shuochao Yao, Xinzhe Fu, Rohan Tabish, Simon Yu, Ayoosh Bansal, Heechul Yun, Lui Sha and Tarek Abdelzaher. (Best Paper Award)
Author bio: Tarek Abdelzaher received his Ph.D. in Computer Science from the University of Michigan in 1999. He is currently a Professor and Willett Faculty Scholar at the Department of Computer Science, the University of Illinois at Urbana Champaign. He has authored/coauthored more than 300 refereed publications in real-time computing, distributed systems, sensor networks, and control. He served as an Editor-in-Chief of the Journal of Real-Time Systems, and has served as Associate Editor of the IEEE Transactions on Mobile Computing, IEEE Transactions on Parallel and Distributed Systems, IEEE Embedded Systems Letters, the ACM Transaction on Sensor Networks, ACM Transactions on Internet Technology, ACM Transactions on Internet of Things, and the Ad Hoc Networks Journal. He chaired (as Program or General Chair) several conferences in his area including RTAS, RTSS, IPSN, Sensys, Infocom, MASS, SECON, DCoSS, ICDCS, and ICAC. Abdelzaher’s research interests lie broadly in understanding and influencing performance and temporal properties of networked embedded, social and software systems in the face of increasing complexity, distribution, and degree of interaction with an external physical environment. Tarek Abdelzaher is a recipient of the IEEE Outstanding Technical Achievement and Leadership Award in Real-time Systems (2012), the Xerox Award for Faculty Research (2011), as well as over ten best paper awards. He is a fellow of IEEE and ACM.
Disclaimer: Any views or opinions represented in this blog are personal, belong solely to the blog post authors and do not represent those of ACM SIGBED or its parent organization, ACM.