Lightweight Instrumentation for Accurate Performance Monitoring in RTOSes

Discussions about the accuracy of Performance Monitoring Counters (PMCs) are common in forums. These hardware-implemented registers track specific events within a processor’s microarchitecture, providing metrics for performance tuning or monitoring unusual activity. However, the accuracy of these counters can be compromised by system noise, other applications, the operating system, and asynchronous hardware events, especially in complex benchmarks on Intel and AMD cores. Tools like PAPI (Performance Application Programming Interface) access these counters while minimizing noise by using Operating System (OS) kernel extensions (such as Linux’s perf utility) to safeguard counter values during context switches and isolate performance data from other system activities.

In embedded systems, standard tools like PAPI or OS tools like perf are often unavailable. However, the hardware is usually much simpler and has fewer hidden behaviors, such as micro-ops and other sources of confusion for measurements. Therefore, this blog post will demonstrate how to achieve OS-level instrumentation in an embedded system using the RISC-V ISA as an example. We will provide necessary context abstractions, ensure low implementation overheads, and control application access to performance counters and their values.

Overview of this post

We will cover:

  1. The challenge of accurate performance monitoring in multi-context systems.
  2. How to let the OS handle performance measurements.
  3. Improvements in instrumentation for embedded systems.
  4. The inherent limitations of performance counter instrumentation.

The problem with multiple contexts

Performance Monitoring Counters (PMCs) are an invaluable tool for profiling and optimizing code in embedded systems. They provide a mechanism to count specific events such as instructions retired, clock cycles, and other micro-architectural events. In a straightforward usage scenario, you might initialize and read these counters as shown in the following code snippets:

C Code Implementation
RISC-V Assembly

In this example, csr_write and csr_read are functions that interact with the Control and Status Registers (CSRs) to reset and read the performance counters, respectively. The assembly code that is generated is shown as well, to demonstrate what will actually be measured by the counters. When we read the counters, we are sampling the measurements. This simple approach works well in a single-context environment where no other code interrupts the execution of compute_something().

However, embedded systems often run multiple tasks or processes, including interrupts from multiple sources, leading to context switches. During a context switch, the state of the current task (including register values) is saved, and the state of the next task is restored. If a context switch occurs during the execution of compute_something(), the performance counters may also be affected, leading to inaccurate measurements. Thus, we need to make the counters aware of the switches to have accurate measurements, the most efficient way is to let the OS handle the measurements.

Letting the OS handle it

To mitigate the problem of context switches affecting performance counter accuracy, we need to save and restore the state of the performance counters during context switches, just as we do with general-purpose registers. This requires modifying the context switch code in the operating system (OS) or bare-metal scheduler to include the performance counters. This is essentially what Linux does with the perf_event system.

To do this, the OS needs to sample the counters during every context switch. If the measurement of the counters involves any form of syscall to the kernel, we can also have issues with our measurements being affected by the tool itself. This is often referred to as the probe effect, where the act of measuring can alter the behavior of the system being measured. The following image shows the effects of adding the sampling to the OS context switch. Regardless of the source (exception or interrupt), there are overheads in the measurements. These overheads stem from the system calls and code that is executed between the actual user code and the sampling occurring.

The figure shows the actual metrics in gray, representing the metrics for the instrumented code if there were no interruptions from other contexts. The overheads given an exception (e.g., syscall) source or a timer interrupt and the 2 types of overheads, Program-Specific (PS) overheads (in orange), and Core-State (CS) overheads (in red). PS overheads are those easily attributed to the executing code (e.g., number of instructions retired), while the CS overheads are those that happen due to complex interactions at the microarchitecture level. We will see more details on this at the end, where we show the limitations of performance counter instrumentation.

Overheads on PS-related events (a) and CSD-related
events (b) from OS interference

The larger and more complicated the system calls and the kernel, the bigger these overheads are. Fortunately for us, RTOSes tend to be simpler and focused on limiting their overheads. So, we can integrate the counter measurements in a very accurate way.

Improving instrumentation for embedded systems

We can look at a simplified version of the FreeRTOS context switch code for the RISC-V ISA to see where we should add our instrumentation. This piece of code (when compiled with the missing macros) will handle all the context switch for FreeRTOS and serve as the entry point for any exceptions and interrupts in the system.

This is all the code we need to guarantee that the performance counters are affected as little as possible by the switch:

We can also save the counters to the OS stack, so we can handle multiple contexts being instrumented while still not providing direct access to the counters to the user code. This has some security implications that are often overlooked. The final code would look something like this:

Here, the OS tick becomes a natural instrumentation point that will happen regardless of user interaction. The same happens for system calls. If we can map the overheads described before and remove them from the measurements, we could, in theory, eliminate the probe effect from the measurements and leave no trace of context switches in the measurements.

We measured the overheads for the instrumented context switch in a SiFive FE310 chip. Since the code is small, we can verify what the probe values would be manually. The results for all the relevant performance events are shown in the following table:

The measurements were repeated 10 thousand times, and for the available counters, they appear to be constant, with no variation. This means we can systematically remove them from each sample. However, this is not true for all counters nor for all systems. This is where things get complicated.

Limits to perfect instrumentation

Remember the program-specific and core-state overheads? What the table shows is that we can pretty much nullify the program-specific measurement overheads because they depend on the state of the program at the specific time of measurement. The core-state overhead, on the other hand, are affected by hardware structures outside the control of the OS, such as the branch prefetch buffer or caches if they are available. This is where attacks such as Spectre come into play. They use the system behavior (training a prefetcher) to cause a predictable effect (a branch miss) in another context (usually the kernel).

The following figures make it quite obvious that these structures leave an effect in the respective performance events that is impossible to remove, since the software has no control over these things and can’t maintain full separation between contexts.

Distribution of branch target miss
Distribution of branch direction miss

Fortunately, this does not affect all the performance counter events. Many counters, especially those that track more direct and less speculative activities (such as the number of instructions retired or cache hits), remain reliable despite these issues. However, the OS cannot fully control the system in which it runs, making perfect software-based instrumentation impossible. This inherent limitation means that some level of measurement inaccuracy will always be present, especially due to complex hardware interactions that software cannot entirely mitigate.


Bruno Endres Forlin is a PhD student, in the special interest group (Dependable Computing Systems, led by Marco Ottavi), within the group of Computer Architecture for Embedded Systems (CAES) at the University of Twente in the Netherlands.

Kuan-Hsun Chen is a tenured assistant professor, in the special interest group (Dependable Computing Systems, led by Marco Ottavi), within the group of Computer Architecture for Embedded Systems (CAES) at the University of Twente in the Netherlands.


The original paper was presented in DATE 2024.

DisclaimerAny views or opinions represented in this blog are personal, belong solely to the blog post authors and do not represent those of ACM SIGBED or its parent organization, ACM.