[EmSoft’24 best paper] Thread Carefully: Preventing Starvation In The ROS 2 Multi-Threaded Executor

Introduction

From autonomous vehicles navigating city streets to robots managing packages in warehouses, robotics is reshaping industries across the globe. The Robot Operating System 2 (ROS 2) plays a pivotal role in this transformation as a powerful middleware framework that simplifies the development of safe, efficient, and scalable robotic systems. With its intuitive tools for creating seamlessly interconnected components, ROS 2 not only accelerates innovation in fields like autonomous driving and industrial logistics but also serves as a vital platform for advancing robotics research worldwide.

As ROS 2 has evolved over time, its design has largely focused on practical needs such as easy of use and flexibility – qualities that have contributed to its widespread adoption. However, this organic growth has left certain aspects underexplored, particularly when it comes to rigorous theoretical analyses and formal specification of system behavior. While scientific efforts have emerged to analyze ROS 2 systems – especially in the context of real-time communication and scheduling – the complexity of retrofitting formal properties into an established framework remains an ongoing challenge.

This blog tells the story of how we uncovered a subtle yet impactful issue in ROS 2’s scheduler – an issue that highlights the challenges of ensuring predictable behavior in complex systems and shows the needs for formal methods and rigorous analysis to build reliable robotic frameworks.

Understanding executors

The executor is a fundamental component of ROS 2’s architecture, responsible for determining when tasks are executed. By managing available system resources, it is capable of handling diverse robotic systems without requiring developers to delve deeply into scheduling complexities.

Tasks in ROS 2 fall into two categories:

  • Time-triggered tasks (“timers”): These activate periodically based on a predefined period.
  • Event-triggered tasks: These activate upon receiving messages from other system components.

To manage these tasks, executors use a structure called the wait set, which temporarily stores eligible tasks before they are processed. The scheduling process alternates between two key phases:

  • Polling point: Activated task instances are collected into the wait set.
  • Processing window: Tasks stored in the wait set are executed sequentially or concurrently depending on the thread configuration of the executor.

Once all tasks have been processed – or no remaining tasks can be executed – a new polling point begins. By default, executors operate using a single thread, and developers can use multiple single-threaded executors to manually distribute workloads across the system. Alternatively, ROS 2 also offers a multi-threaded executor, where multiple threads collaborate to manage and execute tasks simultaneously, enabling parallel processing in complex systems.

The executor is versatile and user-friendly by design, aiming to ensure that all activated tasks eventually execute. However, it has certain limitations – for example, if a timer has a period shorter than the length of processing windows, it may not be executed as frequently as configured. This can result in timing deviations that affect system behavior under specific configurations.

Design principles and limitations

Efficient workflow design in ROS 2 is essential for smooth scheduling and execution. A key principle is to avoid blocking executor threads by waiting synchronously for other tasks or external events.

Blocking behavior can delay ready-to-run tasks due to the executor’s design, reducing system responsiveness and performance. In extreme cases, tasks can block each other indefinitely, causing deadlocks that halt the system entirely.

One example of this challenge is found in robotics applications like autonomous navigation systems that use GPU accelerators for image processing. If a task sends data to the GPU and waits synchronously for results before completing, it prevents other tasks from being scheduled—creating a bottleneck that lowers system efficiency.

To solve this problem, ROS 2 promotes non-blocking designs where dependencies are handled asynchronously. For example, once the GPU finishes processing, another task should retrieve the results without occupying executors unnecessarily.

However, implementing non-blocking designs efficiently introduces new challenges:

  1. How can external accelerators like GPUs signal completion back to ROS 2 effectively?
  2. If signaling is not possible, how can we proactively poll results while minimizing wasted resources?

In our research, we started by addressing the second challenge: how to poll results efficiently without wasting resources. This led us to uncover the main problem in our work. Let’s first explore possible solutions to this challenge.

Solution 1: Using timers
Our initial approach leveraged ROS 2 timers as part of the executor’s workflow to periodically check whether GPU results were ready. Timers integrate seamlessly into ROS 2’s architecture by activating at predefined intervals without requiring additional signaling mechanisms.

While effective in theory, this solution introduced inherent delays:

  • Timers must wait until their period elapses before activation.
  • Even after activation, scheduling delays caused by other tasks can further postpone result retrieval.

To minimize these delays as much as possible, we reduced the timer’s period down to zero – ensuring that it was sampled at every polling point within the executor’s cycle.

However, this approach introduced its own inefficiencies:

  • Many instances of timer executions were unnecessary when results were not yet available.
  • Other high-priority tasks delayed result retrieval further during processing windows due to the prioritization of tasks within the wait set.

Recognizing these inefficiencies, we turned to an alternative approach – dedicating a thread specifically for result-checking

Solution 2: Dedicated thread
To overcome these limitations, we explored an alternative solution using ROS 2’s multi-threaded executor design. Breaking the principle against blocking tasks intentionally, one thread was dedicated exclusively to checking GPU results, while other threads handled unrelated workload tasks concurrently.

This approach improved responsiveness but introduced inefficiencies:

  • Dedicating an entire thread solely for result-checking wastes valuable computational resources.
  • Parallel execution risks concurrent access conflicts when multiple threads interact with shared resources simultaneously.

While dedicating threads improves responsiveness, it also introduces concurrency challenges when shared resources were accessed simultaneously. To address these issues effectively, ROS 2 provides callback groups – a mechanism that prevents simultaneous execution of callbacks assigned within the same group. Specifically, if any task of a callback group is executed by a thread, all other tasks of that callback group are blocked. By constraining parallelism through callback groups, we avoided resource conflicts successfully – but this led us toward discovering another significant issue with ROS 2’s multi-threaded executor: “starvation”.

Uncovering the starvation problem

Building on our use of ROS 2’s multi-threaded executor and callback groups to prevent concurrent access, we began testing various configurations to integrate accelerators like GPUs into ROS 2 systems. Initially, we focused on a simple setup: two timer tasks sharing a single callback group. One task was responsible for checking accelerator result availability (with its period set to zero), while the other handled post-processing operations that accessed the results of GPU computations.

However, during testing, an unexpected behavior emerged – only the timer task with a period of zero executed repeatedly, while the post-processing task was completely ignored.

At first, we dismissed this as one of those “mysterious quirks” often attributed to ROS 2 and moved on without further investigation. A colleague urged us to dig deeper into the issue, but we opted for a quick fix instead: splitting the tasks into separate callback groups. This change temporarily resolved our problem; both tasks executed as expected again.

This seemingly plausible solution – separating tasks into different callback groups – proved problematic and reintroduced old challenges, such as race conditions when accessing shared data structures used by both tasks. These were precisely the problems callback groups were designed to prevent in ROS 2 systems. Faced with this trade-off, our only remaining option was to manually introduce mutexes – a solution that bypasses ROS 2’s framework and shifts concurrency management to developers, reintroducing older problems such as blocking callbacks unnecessarily and increasing the risk of deadlocks.

Ultimately, we reverted back to our original configuration where both tasks belonged to one callback group – and found ourselves back at square one: only one task (the timer with a period of zero) was being executed repeatedly while the post-processing task remained untouched.

This persistent issue marked a turning point in our investigation – we conceded to calls for deeper analysis into ROS 2’s multi-threaded executor mechanisms.

Given the design of the single-threaded executor, all activated tasks are guaranteed to eventually execute – behavior that we have confirmed through our experience. Since the multi-threaded executor uses the same underlying structure as the single-threaded version, we expected it to behave similarly. However, our closer examination revealed something surprising: under certain configurations, some activated tasks were consistently ignored by the executor despite being ready for execution. This phenomenon is known as “starvation” – a condition where certain tasks fail to execute due to repeated prioritization of others.

The cause of starvation

To uncover why certain tasks were being starved in our tests, we developed a complete state machine model to analyze the core mechanisms of ROS 2’s executor (see figure below for an overview).

Complete state machine model of ROS 2’s executor

To focus on the most relevant aspects of this model, we created a simplified version that captures the key operations during its primary phases—polling points and processing windows—and their interaction with tasks in the wait set when callback groups are involved (see figure below).

Polling points occur when either the wait set is empty or all remaining tasks in the wait set are blocked by callback groups. During a polling point (illustrated on the left side of the figure), the executor clears the wait set, determines which tasks are activated, and adds those to the wait set. In the subsequent processing window (shown on the right side of the figure), tasks from this updated wait set are selected for execution, and their corresponding callback groups are marked as blocked during execution.

State machine illustrating task execution flow in ROS 2’s multi-threaded executor.

While these mechanisms generally ensure efficient scheduling, complications arise when multiple threads and callback groups interact – introducing unintended side effects under specific configurations.

To understand how polling points contribute to starvation, we revisit our earlier example involving a single callback group containing two tasks. One thread executes one task from this group while another thread remains idle with the post-processing task still in the wait set.

  • Idle thread initiates polling points: The idle thread triggers a polling point to update the contents of the wait set.
  • Actions performed during polling point:
    1. Clear wait set: All previously blocked tasks are removed entirely from the wait set.
    2. Fill wait set: Only unblocked tasks are added back into the wait set.
    3. Wait for work: Of these unblocked jobs, only activated jobs remain in the wait set; if no activated jobs exist, this thread idles until either a job activates or another thread finishes its work.

This process typically works well but breaks down when previously blocked tasks belong to callback groups that remain marked as blocked during polling points.

The following sequence of events illustrates how starvation occurs:

  • Blocked tasks are removed during step 1 (Clear wait set).
  • These same blocked tasks cannot be re-added during step 2 (Fill wait set) because their callback group remains blocked.
  • Once blocking ends (after another thread finishes executing), both tasks within this callback group may activate again during subsequent polling points. However:
    • In each new polling point cycle, both tasks return to the wait set.
    • In each processing window that follows, only one task is executed repeatedly while its counterpart remains perpetually blocked.

This repetitive cycle causes starvation for certain tasks within systems configured with specific combinations of threads and callback groups – preventing them from ever being executed despite being activated.

The road to starvation freedom

Recognizing the critical impact of starvation in ROS 2’s multi-threaded executor, we set out to raise awareness and find solutions. Our first step was engaging with the ROS 2 community through a GitHub pull request – a direct attempt to notify developers and initiate discussions around our findings. Initial responses suggested that this behavior might either be an intentional design choice of the executor or simply a flaw specific to our system configuration.

Confident that this issue extended beyond our initial setup, we validated our observations by testing various system configurations – including underutilized setups where workloads were minimal and task conditions should not have caused problems. Yet consistently, we observed starvation behavior across our tests – reinforcing our belief that the problem lay deeper within the executor’s design. This raised an important question: was starvation indeed an intentional design choice of the executor, or was it an unexpected behavior?

Despite these results, spreading awareness proved challenging. We seemed to be the only ones encountering – or at least recognizing- this issue within our systems. Moreover, documentation for ROS 2’s executor neither mentioned starvation nor provided sufficient detail for developers to anticipate such behavior during implementation. This lack of visibility made detecting and diagnosing starvation particularly difficult.

Typically, tuning parameters like task priorities or splitting tasks into multiple callback groups can help optimize performance in ROS 2 systems; however, no amount of tuning could resolve this particular issue without fundamental changes in how tasks are managed by the multi-threaded executor:

  • Splitting tasks into separate callback groups bypasses ROS 2’s framework, leaving developers responsible managing concurrency – often requiring mutexes that risk causing deadlocks.
  • Keeping tasks within a single callback group avoids concurrency issues but risks starvation under certain configurations.

This trade-off created a frustrating situation for developers – including us – working with ROS 2’s multi-threaded executor, forcing them either to accept inefficiencies or gamble with unpredictable behavior. We thus asked ourselves if this issue could be ignored or if it was too critical to overlook.

However, this issue could to catastrophic behavior in safety-critical systems, where missed task execution is unacceptable and compromises safe operation. For instance, if an autonomous vehicle fails to execute collision detection callbacks due to starvation, catastrophic outcomes could result. Given ROS 2’s widespread adoption across industries like autonomous driving and logistics, addressing this problem became even more urgent.

Internally within our research group, we introduced this problem as part of our research discussions – and while there were questions whether fixing it was worthwhile as an academic pursuit (given that resolving it might only require minor code changes), we believed otherwise. When reviewing existing research on ROS 2’s multi-threaded executor, we did not find any discussion about starvation-related issues – even though existing analyses [1,2] implied that starved tasks would eventually execute despite our evidence showing otherwise.

At this point, tackling this issue became not just worthwhile but necessary – not only from an academic perspective but also to ensure robust real-world deployment of ROS 2 systems across diverse industries. We were now convinced that there was a problem. The question was: how should we begin to fix it?

Achieving starvation-free scheduling

To address the starvation issue in ROS 2’s multi-threaded executor, we identified two key changes required to eliminate the problem. Each change tackles a specific aspect of the executor’s behavior that contributes to starvation:

  • Guarding callback groups from concurrent access
    We found that in all configurations exhibiting starvation, there were at least two tasks within one callback group – like in our previous example with a timer set to a period of zero and a post-processing task. This suggested an issue with how callback groups are handled. We asked ourselves: could it be due to the absence of mutex locks or some sort of hold-and-wait situation causing race conditions?

    After investigating, we discovered that callback groups were not protected from concurrent access by multiple threads, which could lead to race conditions and inconsistencies when determining whether a callback group was blocked or unblocked. During our efforts to fix this, we realized it was necessary to introduce mutexes to guard callback groups. By ensuring that only one thread can access the status of a callback group at any given time, we eliminated these risks and enabled safe management of tasks within the same callback group.
  • Modifying How the Wait Set Is Cleared
    While callback groups are now safe from concurrent access, this fix did not address the removal of jobs from the wait set. This raised an important question: Why are jobs lost, and how is the wait set cleared? Upon inspecting the structure of the executor, we identified that this issue originates from the ‘clear wait set’ method – the only method responsible for reducing the content of the wait set.

    To resolve this, we revisited how the wait set is cleared during polling points. Previously, tasks within blocked callback groups were removed entirely from the wait set and not re-added until their group became unblocked – directly contributing to starvation. Our solution ensures that once tasks are added to the wait set, they remain there until executed – even if their callback group is temporarily blocked by other executing tasks.

These two modifications effectively eliminate starvation across all system configurations, guaranteeing that every task in a callback group will eventually execute after activation.

While achieving starvation freedom is critical for reliable scheduling, it is equally important to evaluate potential drawbacks introduced by our changes – particularly in terms of performance overhead. We assessed whether guarding callback groups with mutexes and modifying how the wait set is handled would significantly impact scheduling efficiency. Our evaluation showed negligible performance impact. The additional overhead introduced by these changes remains insignificant compared to the overall scheduling overhead of the executor. Developers can therefore adopt our solution without concerns about compromising system performance.

To validate our approach further, we focused on formally verifying that our proposed executor design is starvation-free. Additionally, we proved that the design is deadlock-free, as the newly introduced mutexes could theoretically lead to deadlocks. To achieve this, we created manual proofs validating both deadlock freedom and starvation freedom and complemented them with model checking using SPIN – a tool for verifying concurrent systems. For this purpose, we developed a custom state machine model representing the new executor design and verified its behavior using SPIN’s capabilities. This combination of manual proofs and automated model checking demonstrates that it is possible to formally verify specific properties of executors in ROS 2.

Beyond implementing these fixes internally, we reached out to authors of existing research on ROS 2’s multi-threaded executor to raise awareness about this issue and its resolution. At the time of writing:

  • Existing analyses have been updated based on our findings with only minor adaptations required.
  • Efforts are underway to integrate our proposed changes into ROS 2’s main stack – bringing starvation-free scheduling closer to becoming standard across all systems without requiring any adjustments from current users.

By addressing this critical flaw in ROS 2’s multi-threaded executor, we hope developers will benefit from more predictable and reliable task execution – ensuring robust deployments across industries like autonomous driving, logistics, and beyond.

Conclusion

Our work uncovered a subtle yet impactful issue in ROS 2’s scheduling mechanisms – an issue that highlights the challenges of ensuring predictable behavior in complex systems. By modifying how the wait set is managed and introducing safeguards for callback groups, we ensured starvation freedom as a property of ROS 2’s multi-threaded executor.

This journey underscores the clear need for rigorous theoretical foundations and clearly specified properties for ROS 2. Ensuring that fundamental assumptions about task scheduling hold true – and are rigorously verified – is critical as robotic systems grow increasingly complex and applications demand higher levels of reliability. Formal verification played a key role in validating our solution, providing strong assurances about its correctness and reliability across diverse configurations.

As ROS 2 continues to advance, we hope this approach becomes integral to its development, ensuring more predictable and robust deployments – particularly for safety-critical industries like autonomous driving and logistics. We also hope our findings inspire further collaboration within the ROS 2 community to advance its scheduling mechanisms through rigorous study of its properties – paving the way for innovative solutions that meet evolving needs for safety and reliability in modern robotics.

For those interested in exploring our work in greater detail, we invite you to read our full scientific paper, ‘Thread Carefully: Preventing Starvation in the ROS 2 Multi-Threaded Executor’, presented at EMSOFT 2024 [3], which includes all the technical details, formal proofs, performance evaluations, and insights into achieving starvation-free scheduling in ROS 2 systems.

Authors

Harun Teper is a research assistant in the Faculty of Computer Science at TU Dortmund University, working with the Design Automation for Embedded Systems Group.

Daniel Kuhse is a research assistant in the Faculty of Computer Science at TU Dortmund University, working with the Design Automation for Embedded Systems Group.

Mario Günzel is a postdoc researcher in the Faculty of Computer Science at TU Dortmund University, working with the Design Automation for Embedded Systems Group.

Georg von der Brüggen is a senior researcher in the Faculty of Computer Science at TU Dortmund University, working with the Design Automation for Embedded Systems Group.

Falk Howar is a professor in the Faculty of Computer Science at TU Dortmund University, leading the Automated Quality Assurance Group.

Jian-Jia Chen is a professor in the Faculty of Computer Science at TU Dortmund University and also works with the Design Automation for Embedded Systems Group. Additionally, he is affiliated with the Lamarr Institute for Machine Learning and Artificial Intelligence.

References
[1] H. Sobhani, H. Choi and H. Kim, “Timing Analysis and Priority-driven Enhancements of ROS 2 Multi-threaded Executors,”IEEE 29th Real-Time and Embedded Technology and Applications Symposium (RTAS), 2023
[2] X. Jiang, D. Ji, N. Guan, R. Li, Y. Tang and Y. Wang, “Real-Time Scheduling and Analysis of Processing Chains on Multi-threaded Executor in ROS 2,” IEEE Real-Time Systems Symposium (RTSS), 2022
[3] H. Teper, D. Kuhse, M. Günzel, G. v. d. Brüggen, F. Howar and J. -J. Chen, “Thread Carefully: Preventing Starvation in the ROS 2 Multithreaded Executor,” in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2024

DisclaimerAny views or opinions represented in this blog are personal, belong solely to the blog post authors and do not represent those of ACM SIGBED or its parent organization, ACM.