Motion Planning for Autonomous Vehicles in the Presence of Uncertainty Using Reinforcement Learning

Kasra Rezaee, Peyman Yadmellat, Simon Chamorro

Innovation Level Breakthrough
Reviewer Rating 98% Approval
Impact Factor High

Abstract

This paper addresses the challenge of motion planning under uncertainty—a critical issue for autonomous driving where sensor limitations, occlusions, and restricted fields of view can compromise safety. Rather than following traditional reinforcement learning (RL) approaches that optimize the average expected reward, the authors propose a method that leverages distributional RL to optimize for the worst-case outcome. By integrating quantile regression with established RL algorithms (namely Soft Actor-Critic and DQN), the approach seeks to produce a more conservative and safe policy. Evaluated in two simulation scenarios using the SUMO traffic simulator, the proposed method demonstrates improved collision avoidance and a behavior that approximates human driving under challenging conditions.

Key Takeaways

1

Worst-Case Reward Optimization

By reformulating the RL objective to maximize the lower bound of the reward distribution rather than the average, the paper presents a safety-centric approach that minimizes risky behaviors in uncertain environments.

2

Handling Occlusion and Sensing Uncertainty

The proposed method directly addresses uncertainties stemming from limited sensor ranges and occlusions—common issues in real-world autonomous driving—by incorporating them into the motion planning framework.

3

Integration of Quantile Regression with RL Algorithms

The study adapts standard RL algorithms (Soft Actor-Critic and DQN) using quantile regression. This modification enables the estimation of a reward distribution and supports the selection of actions based on a conservative evaluation of outcomes.

4

Improved Safety and Performance in Simulation

Experimental results on both pedestrian crossing and curved road scenarios demonstrate that the conservative RL policies can reduce collision rates and achieve driving behaviors that are comparable to cautious human drivers, outperforming traditional RL and rule-based planners in critical metrics.

Introduction

intro
Figure 1. The figure illustrates two driving scenarios where occlusion affects autonomous vehicle visibility: in the pedestrian crossing scenario, a parked yellow vehicle at the corner occludes potential pedestrians approaching the crossing, while in the curved road scenario, the bend creates an occluded area (highlighted in red) that limits sensor coverage and may conceal an oncoming vehicle.
  • Motion planning is defined as the task of determining a path for an autonomous vehicle to achieve its objectives.
  • The primary goal of motion planning is to ensure a safe trajectory.
  • The work concentrates on uncertainty arising from limitations in sensing and perception, specifically due to limited field of view, occlusion, and sensing range.
  • These uncertainties make achieving a safe trajectory a challenging task in the context of self-driving.
  • These points underscore the importance of robust motion planning algorithms that can handle real-world uncertainties to ensure the safety and reliability of autonomous vehicles. The paper will further delve into how RL can be leveraged to address these challenges.

    Methodology

    1. Distributional RL with Quantile Regression: Instead of predicting a single Q-value for a state-action pair, the algorithm predicts a set of quantiles representing the distribution of possible returns. This is done using Quantile Regression.
      • Quantile Regression: For a given probability level τ (e.g., 0.1 for the 10th percentile), quantile regression aims to estimate the value qτ below which τ fraction of the data falls. In this context, the algorithm learns to predict the quantiles of the return distribution. This is achieved by minimizing a loss function that penalizes underestimation and overestimation differently based on the chosen quantile level \(\tau\). The loss function used in quantile regression for quantile \(\tau\) is:
      • \[L(u) = τ * u \; \text{if} \; u > 0 \; L(u) = (τ - 1) * u \; \text{if} \; u < 0\]

        where \(u\) is the residual between the predicted quantile and the target return.

    2. Conservative Policy Optimization: The key modification is in how the Q-value is used for policy optimization. Instead of using the mean of the predicted quantiles (as in standard QR-DQN or a similar distributional SAC implementation), the algorithm uses the lowest quantile as the Q-value:
    3. \[Q(s, a) = q_1(s, a)\]

      This effectively forces the agent to optimize for the worst-case scenario, as it only considers the lowest possible return when evaluating a state-action pair.

    4. CQR-DQN:
      • The DQN algorithm is extended to predict \(N\) quantiles.
      • The target Q-value for the DQN update is calculated using the lowest quantile, i.e., the target is \(q_1(s, a)\).
      • The action selection (epsilon-greedy or similar) is also based on the Q-values derived from the lowest quantile.
    5. CQR-SAC:
      • The Q-networks in SAC are modified to predict N quantiles.
      • The distributional Bellman equation is used to update the Critic. The Bellman update is: \(Z(s, a) := r(s, a) + \gamma(Z(s^{\prime}, a^{\prime}) - log \pi(a^{\prime}|s^{\prime}))\), where \(Z\) represents the distribution of returns.
      • The key difference is how \(Z(s, a)\) is estimated. Instead of a Gaussian distribution, the paper uses quantile regression to approximate the distribution.
      • The Q-value used for the Actor update (to improve the policy) is also based on the lowest quantile. The Actor maximizes the Q-value derived from the lowest quantile. The actor update rule is kept the same as the original SAC.
    6. Trajectory Evaluation vs. Policy Evaluation: The paper distinguishes between two ways of evaluating the RL agent's performance,
      • rajectory evaluation assumes that the vehicle will follow the same trajectory into the future and does not adapt the trajectory as more information becomes available.
      • Policy evaluation assumes that the agent chooses the next trajectory from its trained policy, thus allowing the vehicle to adapt the trajectory as more information becomes available.
    method2
    Figure 2. The figure presents a variation of the classic Cliff Walk example from reinforcement learning, where an agent begins at start (S) and must reach goal (G) while avoiding a hazardous cliff—represented by a gray region that inflicts a -20 penalty and terminates the episode if entered—with three possible paths: Path 1 is the shortest yet most perilous due to its proximity to the cliff, Path 2 offers a safer but longer route, and Path 3 provides an alternative option; additionally, with probability p, any chosen action may be randomly replaced by a downward movement, and the agent incurs a -1 reward for each step, thereby illustrating the trade-off between maximizing expected reward and mitigating risk, and underscoring why standard reinforcement learning approaches might be inadequate for real-world, safety-critical problems, thus motivating the use of Distributional RL to optimize for worst-case outcomes.

    Results

    Evaluated in two simulated scenarios using SUMO:

    • Pedestrian crossing with occlusion.
    • Curved road with occlusion.

    Compared CQR-SAC and CQR-DQN against standard SAC, DQN, QR-SAC, and QR-DQN, and rule-based planners (fixed speed, naive, aware).

    Results showed that CQR-SAC and CQR-DQN achieved lower collision rates compared to the standard RL algorithms, demonstrating the effectiveness of optimizing for the worst-case scenario. However, the performance of the algorithms depended on how the trajectory was evaluated.

    The conservative algorithms learned to drive more cautiously in the presence of occlusions, resulting in safer behavior.

    In the pedestrian crossing scenario, maximizing the lower bound of the reward resulted in overall better reward due to the lower speed and less penalizing brake.

    results1
    Figure 3. The figure depicts the training progress of various reinforcement learning algorithms—including SAC, QR-SAC, CQR-SACπ, CQR-SAC\(tau\), DQN, QR-DQN, CQR-DQN\(\pi\), and CQR-DQN\(tau\) in two scenarios: a pedestrian crossing and a curved road. It shows the collision rate, average episode speed, and episode reward as functions of training steps. The collision rate decreases as training progresses, with the CQR-SAC\(\pi\) and CQR-DQN\(\pi\) algorithms generally achieving lower collision rates in line with the objective of optimizing for the worst-case scenario to enhance safety. Meanwhile, the average episode speed increases as the agents become more proficient at navigating the environment, and the episode reward rises, indicating that the agents are learning policies that yield higher cumulative rewards. The shaded regions around each performance line represent the variability across multiple training runs, providing insight into each algorithm’s robustness.
    results2 class=
    Figure 4. This table summarizes the test performance of various algorithms in two scenarios—a curved road, which evaluates lane maintenance, speed, and collision avoidance, and a pedestrian crossing, which emphasizes safety and responsiveness. The metrics include average reward (r̄), collision rate (%), average speed (v̄ in m/s), and the 5th percentile acceleration (a in m/s²) that indicates deceleration intensity during critical events. The compared algorithms encompass traditional reinforcement learning approaches (SAC, DQN), their distributional RL extensions using quantile regression (QR-SAC, QR-DQN), and conservative variants (CQR-SACπ, CQR-DQNπ, CQR-SACτ, CQR-DQNτ), alongside rule-based planners (Fixed, Naive, Aware). Notably, the Aware planner achieves a 0% collision rate on the curved road, serving as a safety baseline, while the CQR-SACπ and CQR-DQNπ variants generally exhibit lower collision rates compared to SAC and DQN. Variations in average reward reflect different trade-offs among safety, comfort, and mobility, and the 5th percentile acceleration values reveal how aggressively the vehicle brakes during emergencies. The π variants assess the agent’s policy for greater flexibility, whereas the τ variants evaluate a specific trajectory for a more conservative approach. Overall, these results suggest that accounting for uncertainty and optimizing for the worst-case scenario—particularly with conservative QR-RL algorithms—can lead to safer motion planning in occluded scenarios, with rule-based planners providing a useful performance baseline.
    results3 class=
    Figure 5. The figure compares the behaviors of two motion planners—Soft Actor-Critic (SAC) and Conservative QR-SAC (CQR-SACπ)—in a simulated driving scenario where vehicle positions across multiple episodes are shown as colored dots indicating speed. In the top panel, the SAC planner displays a more dispersed distribution of dots that extend further into the intersection, with a color gradient revealing that the vehicle maintains higher speeds near the pedestrian crossing. In contrast, the bottom panel shows the CQR-SACπ planner forming a denser cluster of dots positioned further from the intersection, with colors indicating a significant reduction in speed as the vehicle approaches the crosswalk. Notably, the SAC planner tends to approach the crossing at higher speeds, resulting in less consistency in vehicle positions, while the CQR-SACπ planner exhibits a more cautious approach by slowing down considerably, thereby forming a tighter cluster of positions. Both planners also demonstrate a tendency to shift slightly to the left near the crosswalk—a behavior learned to improve visibility behind occlusions. Overall, the CQR-SACπ planner embodies a safer, more conservative strategy by accounting for environmental uncertainty and optimizing for worst-case scenarios, in contrast to the conventional SAC approach that focuses on maximizing average expected reward, which may not be as robust in uncertain situations.

    Discussion

    The discussion provides a critical analysis of the proposed approach:

    • Trade-Offs in Reward Optimization:
    • WThe paper highlights that while maximizing the average reward might yield higher overall performance in benign conditions, it can lead to unsafe trajectories when unexpected events occur. By focusing on the worst-case outcome, the conservative RL policies offer a more robust alternative for safety-critical applications.

    • Effectiveness of Quantile Regression:
    • The integration of quantile regression is shown to be beneficial, although its impact varies between scenarios. In settings with significant uncertainty, the conservative approach clearly outperforms traditional methods, though further tuning might be needed for complex, multi-agent environments.

    • Limitations and Future Directions:
    • The reliance on simulation (using SUMO) raises questions about the direct transferability of the results to real-world systems. Additionally, while the conservative policies improve safety, they may sometimes lead to overly cautious behavior. The discussion suggests that future research could explore more adaptive methods, including meta-learning and enhanced sensor fusion techniques, to better balance safety and performance.

    Conclusion

    In conclusion, the paper makes a compelling case for rethinking traditional reinforcement learning objectives in the context of autonomous motion planning. By shifting the focus from average reward maximization to worst-case outcome optimization, the authors demonstrate that safer and more robust driving policies can be achieved. The experimental validation using the SUMO simulator in challenging scenarios shows that the proposed modifications—applied to both SAC and DQN frameworks—yield significant improvements in collision avoidance and overall driving behavior. While promising, the study also identifies avenues for further research, particularly in extending the approach to more diverse and complex real-world situations.

    Reviewer Notes

    Key Points

    1

    The paper introduces a novel RL framework that focuses on optimizing the worst-case outcome, which is crucial for safety in autonomous driving.

    2

    It successfully integrates quantile regression into both value-based and actor-critic RL methods to provide a conservative evaluation of actions.

    3

    Experimental results in simulation demonstrate that the conservative RL policies can significantly reduce collision rates in uncertain scenarios.

    Extended Analysis

    1

    The heavy reliance on simulated environments (SUMO) raises concerns about the method’s scalability and adaptability to real-world driving conditions.

    2

    While quantile regression is a creative approach, its performance benefits appear to be scenario-dependent, suggesting that further research is needed to generalize the method.

    3

    Future work should investigate the integration of additional sensory inputs and multi-agent dynamics to better capture the complexities of real autonomous driving environments.