Continuous action spaces in reinforcement learning (RL) are commonly defined as interval sets. While intervals usually reflect the action boundaries for tasks well, they can be challenging for learning because the typically large global action space leads to frequent exploration of irrelevant actions. Yet, little task knowledge can be sufficient to identify significantly smaller state-specific sets of relevant actions. Focusing learning on these relevant actions can significantly improve training efficiency and effectiveness. In this paper, we propose to focus learning on the set of relevant actions and introduce three continuous action masking methods for exactly mapping the action space to the state-dependent set of relevant actions. Thus, our methods ensure that only relevant actions are executed, enhancing the predictability of the RL agent and enabling its use in safety-critical applications. We further derive the implications of the proposed methods on the policy gradient. Using Proximal Policy Optimization (PPO), we evaluate our methods on three control tasks, where the relevant action set is computed based on the system dynamics and a relevant state set. Our experiments show that the three action masking methods achieve higher final rewards and converge faster than the baseline without action masking.
Learning Biomolecular Models using Signal Temporal Logic
Hanna Krasowski, Eric Palanques-Tost, Calin Belta, and Murat Arcak
Modeling dynamical biological systems is key for understanding, predicting, and controlling complex biological behaviors. Traditional methods for identifying governing equations, such as ordinary differential equations (ODEs), typically require extensive quantitative data, which is often scarce in biological systems due to experimental limitations. To address this challenge, we introduce an approach that determines biomolecular models from qualitative system behaviors expressed as Signal Temporal Logic (STL) statements, which are naturally suited to translate expert knowledge into computationally tractable specifications. Our method represents the biological network as a graph, where edges represent interactions between species, and uses a genetic algorithm to identify the graph. To infer the parameters of the ODEs modeling the interactions, we propose a gradient-based algorithm. On a numerical example, we evaluate two loss functions using STL robustness and analyze different initialization techniques to improve the convergence of the approach.
Predictive Safety Shield for Dyna-Q Reinforcement Learning
Integrating safety guarantees into reinforcement learning is a major challenge to make this method applicable to real-world tasks. Safety shields extend standard reinforcement learning and achieve hard safety guarantees. However, existing safety shields use random sampling of safe actions or a fixed fallback controller, therefore disregarding future performance implications of different safe actions. In this work, we propose a predictive safety shield for model-based reinforcement learning agents in discrete space. Our safety shield updates the Q-function locally based on safe predictions, which originate from a safe simulation of the environment model. This shielding approach improves performance while maintaining hard safety guarantees. Our experiments on gridworld environments demonstrate that even short prediction horizons can be sufficient to identify the optimal path. We observe that our approach is robust to distribution shifts, e.g., between simulation and reality, without requiring additional training.
Maximizing Seaweed Growth on Autonomous Farms: A Dynamic Programming Approach for Underactuated Systems Navigating on Uncertain Ocean Currents
Matthias Killer*, Marius Wiggert*, Hanna Krasowski, Manan Doshi, Pierre F.J. Lermusiaux, and Claire J. Tomlin
Seaweed biomass offers significant potential for climate mitigation, but large-scale, autonomous open-ocean farms are required to fully exploit it. Such farms typically have low propulsion and are heavily influenced by ocean currents. We want to design a controller that maximizes seaweed growth over months by taking advantage of the non-linear time-varying ocean currents for reaching high-growth regions. The complex dynamics and underactuation make this challenging even when the currents are known. This is even harder when only short-term imperfect forecasts with increasing uncertainty are available. We propose a dynamic programming-based method to efficiently solve for the optimal growth value function when true currents are known. We additionally present three extensions when as in reality only forecasts are known: (1) our method’s resulting value function can be used as feedback policy to obtain the growth-optimal control for all states and times, allowing closed-loop control equivalent to re-planning at every time step hence mitigating forecast errors, (2) a feedback policy for long-term optimal growth beyond forecast horizons using seasonal average current data as terminal reward, and (3) a discounted finite-time Dynamic Programming formulation to account for increasing ocean current estimate uncertainty.We evaluate our approach through 30-day simulations of floating seaweed farms in realistic Pacific Ocean current scenarios. Our method demonstrates an achievement of 95.8 % of the best possible growth using only 5-day forecasts. This confirms the feasibility of using low-power propulsion and optimal control for enhanced seaweed growth on floating farms under real-world conditions.
Published
2024
Provable Traffic Rule Compliance in Safe Reinforcement Learning on the Open Sea
For safe operation, autonomous vehicles have to obey traffic rules that are set forth in legal documents formulated in natural language. Temporal logic is a suitable concept to formalize such traffic rules. Still, temporal logic rules often result in constraints that are hard to solve using optimization-based motion planners. Reinforcement learning (RL) is a promising method to find motion plans for autonomous vehicles. However, vanilla RL algorithms are based on random exploration and do not automatically comply with traffic rules. Our approach accomplishes guaranteed rule-compliance by integrating temporal logic specifications into RL. Specifically, we consider the application of vessels on the open sea, which must adhere to the Convention on the International Regulations for Preventing Collisions at Sea (COLREGS). To efficiently synthesize rule-compliant actions, we combine predicates based on set-based prediction with a statechart representing our formalized rules and their priorities. Action masking then restricts the RL agent to this set of verified rule-compliant actions. In numerical evaluations on critical maritime traffic situations, our agent always complies with the formalized legal rules and never collides while achieving a high goal-reaching rate during training and deployment. In contrast, vanilla and traffic rule-informed RL agents frequently violate traffic rules and collide even after training.
2023
Provably Safe Reinforcement Learning via Action Projection Using Reachability Analysis and Polynomial Zonotopes
Niklas Kochdumper*, Hanna Krasowski*, Xiao Wang*, Stanley Bak, and Matthias Althoff
While reinforcement learning produces very promising results for many applications, its main disadvantage is the lack of safety guarantees, which prevents its use in safety-critical systems. In this work, we address this issue by a safety shield for nonlinear continuous systems that solve reach-avoid tasks. Our safety shield prevents applying potentially unsafe actions from a reinforcement learning agent by projecting the proposed action to the closest safe action. This approach is called action projection and is implemented via mixed-integer optimization. The safety constraints for action projection are obtained by applying parameterized reachability analysis using polynomial zonotopes, which enables to accurately capture the nonlinear effects of the actions on the system. In contrast to other state-of-the-art approaches for action projection, our safety shield can efficiently handle input constraints and dynamic obstacles, eases incorporation of the spatial robot dimensions into the safety constraints, guarantees robust safety despite process noise and measurement errors, and is well suited for high-dimensional systems, as we demonstrate on several challenging benchmark systems.
Provably Safe Reinforcement Learning: Conceptual Analysis, Survey, and Benchmarking
Hanna Krasowski*, Jakob Thumm*, Marlon Müller, Lukas Schäfer, Xiao Wang, and Matthias Althoff
Ensuring the safety of reinforcement learning (RL) algorithms is crucial to unlock their potential for many real-world tasks. However, vanilla RL and most safe RL approaches do not guarantee safety. In recent years, several methods have been proposed to provide hard safety guarantees for RL, which is essential for applications where unsafe actions could have disastrous consequences. Nevertheless, there is no comprehensive comparison of these provably safe RL methods. Therefore, we introduce a categorization of existing provably safe RL methods, present the conceptual foundations for both continuous and discrete action spaces, and empirically benchmark existing methods. We categorize the methods based on how they adapt the action: action replacement, action projection, and action masking. Our experiments on an inverted pendulum and a quadrotor stabilization task indicate that action replacement is the best-performing approach for these applications despite its comparatively simple realization. Furthermore, adding a reward penalty, every time the safety verification is engaged, improved training performance in our experiments. Finally, we provide practical guidance on selecting provably safe RL approaches depending on the safety specification, RL algorithm, and type of action space.
Safe Reinforcement Learning with Probabilistic Guarantees Satisfying Temporal Logic Specifications in Continuous Action Spaces
Hanna Krasowski, Prithvi Akella, Aaron D. Ames, and Matthias Althoff
In Proc. of the IEEE Conference on Decision and Control (CDC), 2023
Vanilla Reinforcement Learning (RL) can efficiently solve complex tasks but does not provide any guarantees on system behavior. To bridge this gap, we propose a three-step safe RL procedure for continuous action spaces that provides probabilistic guarantees with respect to temporal logic specifications. First, our approach probabilistically verifies a candidate controller with respect to a temporal logic specification while randomizing the control inputs to the system within a bounded set. Second, we improve the performance of this probabilistically verified controller by adding an RL agent that optimizes the verified controller for performance in the same bounded set around the control input. Third, we verify probabilistic safety guarantees with respect to temporal logic specifications for the learned agent. Our approach is efficiently implementable for continuous action and state spaces. The separation of safety verification and performance improvement into two distinct steps realizes both explicit probabilistic safety guarantees and a straightforward RL setup that focuses on performance. We evaluate our approach on an evasion task where a robot has to reach a goal while evading a dynamic obstacle with a specific maneuver. Our results show that our safe RL approach leads to efficient learning while maintaining its probabilistic safety specification.
Stranding Risk for Underactuated Vessels in Complex Ocean Currents: Analysis and Controllers
Andreas Doering*, Marius Wiggert*, Hanna Krasowski, Manan Doshi, Pierre F.J. Lermusiaux, and Claire J. Tomlin
In Proc. of the IEEE Conference on Decision and Control (CDC), 2023
Low-propulsion vessels can take advantage of powerful ocean currents to navigate towards a destination. Recent results demonstrated that vessels can reach their destination with high probability despite forecast errors. However, these results do not consider the critical aspect of safety of such vessels: because their propulsion is much smaller than the magnitude of surrounding currents, they might end up in currents that inevitably push them into unsafe areas such as shallow waters, garbage patches, and shipping lanes. In this work, we first investigate the risk of stranding for passively floating vessels in the Northeast Pacific. We find that at least 5.04% would strand within 90 days. Next, we encode the unsafe sets as hard constraints into Hamilton-Jacobi Multi-Time Reachability to synthesize a feedback policy that is equivalent to re-planning at each time step at low computational cost. While applying this policy guarantees safe operation when the currents are known, in realistic situations only imperfect forecasts are available. Hence, we demonstrate the safety of our approach empirically with large-scale realistic simulations of a vessel navigating in high-risk regions in the Northeast Pacific. We find that applying our policy closed-loop with daily re-planning as new forecasts become available reduces stranding below 1% despite forecast errors often exceeding the maximal propulsion. Our method significantly improves safety over the baselines and still achieves a timely arrival of the vessel at the destination
2022
CommonOcean: Composable Benchmarks for Motion Planning on Oceans
Hanna Krasowski, and Matthias Althoff
In Proc. of the IEEE Int. Conf. on Intelligent Transportation Systems (ITSC), 2022
Autonomous vessels can increase safety and reduce emissions compared to human-operated vessels. One important task for autonomous vessels is motion planning. Currently, there are no benchmarks for autonomous vessels to compare different motion planning methods. Thus, we introduce composable benchmarks for motion planning on oceans (CommonOcean),
which is available at commonocean.cps.cit.tum.de. A CommonOcean benchmark consists of three elements: cost function, vessel model, and motion planning scenario. Benchmarks can be conveniently composed using unique identifiers for these elements, which are highly modular. CommonOcean is easy to use, because we provide meaningful parameters
for vessel models, various motion planning scenarios, and comprehensive documentation. Furthermore, we developed a scenario generation tool, which allows one to effortlessly create new scenarios from marine traffic data. We believe that CommonOcean will lead to a better reproducibility and comparability of research on motion planning for vessels.
Safe Reinforcement Learning for Urban Driving using Invariably Safe Braking Sets
Hanna Krasowski*, Yinqiang Zhang*, and Matthias Althoff
In Proc. of the IEEE Int. Conf. on Intelligent Transportation Systems (ITSC), 2022
Deep reinforcement learning (RL) has been widely applied to motion planning problems of autonomous vehicles in urban traffic. However, traditional deep RL algorithms cannot ensure safe trajectories throughout training and deployment. We propose a provably safe RL algorithm for urban autonomous driving to address this. We add a novel safety layer to the RL process to verify the safety of high-level actions before they are performed. Our safety layer is based on invariably safe braking sets to constrain actions for safe lane changing and safe intersection crossing. We introduce a generalized discrete high-level action space, which can represent all high-level intersection driving maneuvers and various desired accelerations. Finally, we conducted extensive experiments on the inD dataset containing urban driving scenarios. Our analysis demonstrates that the safe agent never causes a collision and that the safety layer’s lane changing verification can even improve the goal-reaching performance compared to the unsafe baseline agent.
2021
CommonRoad-RL: A Configurable Reinforcement Learning Environment for Motion Planning of Autonomous Vehicles
Xiao Wang, Hanna Krasowski, and Matthias Althoff
In Proc. of the IEEE Int. Conf. on Intelligent Transportation Systems (ITSC), 2021
Reinforcement learning (RL) methods have gained popularity in the field of motion planning for autonomous vehicles due to their success in robotics and computer games. However, no existing work enables researchers to conveniently compare different underlying the Markov decision processes (MDPs). To address this issue, we present CommonRoad-RL-an open-source toolbox to train and evaluate RL-based motion planners for autonomous vehicles. Configurability, modularity, and stability of CommonRoad-RL simplify comparing different MDPs. This is demonstrated by comparing agents trained with different rewards, action spaces, and vehicle models on a real-world highway dataset. Our toolbox is available at commonroad.in.tum.de.
Temporal Logic Formalization of Marine Traffic Rules
Hanna Krasowski, and Matthias Althoff
In Proc. of the IEEE Intelligent Vehicles Symposium (IV), 2021
Autonomous vessels have to adhere to marine traffic rules to ensure traffic safety and reduce the liability of manufacturers. However, autonomous systems can only evaluate rule compliance if rules are formulated in a precise and mathematical way. This paper formalizes marine traffic rules from the Convention on the International Regulations for Preventing Collisions at Sea (COLREGS) using temporal logic. In particular, the collision prevention rules between two power-driven vessels are delineated. The formulation is based on modular predicates and adjustable parameters. We evaluate the formalized rules in three US coastal areas for over 1,200 vessels using real marine traffic data.
2020
Safe Reinforcement Learning for Autonomous Lane Changing Using Set-Based Prediction
Hanna Krasowski*, Xiao Wang*, and Matthias Althoff
In Proc. of the IEEE Int. Conf. on Intelligent Transportation Systems (ITSC), 2020
Machine learning approaches often lack safety guarantees, which are often a key requirement in real-world tasks. This paper addresses the lack of safety guarantees by extending reinforcement learning with a safety layer that restricts the action space to the subspace of safe actions.
We demonstrate the proposed approach using lane changing in autonomous driving.
To distinguish safe actions from unsafe ones, we compare planned motions with the set of possible occupancies of traffic participants generated by set-based predictions. In situations where no safe action exists, a verified fail-safe controller is executed. We used real-world highway traffic data to train and test the proposed approach. The evaluation result shows that the proposed approach trains agents that do not cause collisions during training and deployment.