Lyapunov Actor-Critic (LAC)
See also
This document assumes you are familiar with the Soft Actor-Critic (SAC) algorithm. It is not meant to be a comprehensive guide but mainly depicts the difference between the SAC and Lyapunov Actor-Critic (LAC) algorithms. For more information, readers are referred to the original papers of Haarnoja et al., 2019 (SAC) and Han et al., 2020 (LAC).
Important
The LAC algorithm only guarantees stability in mean cost when trained on environments
with a positive definite cost function (i.e. environments in which the cost is minimized).
The opt_type
argument can be set to maximize
when training in environments where
the reward is maximized. However, because the Lyapunov’s stability conditions are not satisfied,
the LAC algorithm no longer guarantees stability in mean cost.
Background
The Lyapunov Actor-Critic (LAC) algorithm can be seen as a direct successor of the SAC algorithm. Although the SAC algorithm achieved impressive performance in various robotic control tasks, it does not guarantee its actions are stable. From a control-theoretic perspective, stability is the most critical property for any control system since it is closely related to robotic systems’ safety, robustness, and reliability. Using Lyapunov’s method, the LAC algorithm solves the aforementioned issues by proposing a data-based stability theorem that guarantees the system stays stable in mean cost.
Lyapunov critic function
The concept of Lyapunov stability is a useful and general approach for studying robotics Systems stability. In Lyapunov’s (direct) method, a scalar “energy-like” function, called a Lyapunov function, is constructed to analyse a system’s stability. According to Lyapunov’s stability conditions a dynamic autonomous system
is said to be asymptotically stable if such an “energy” function exist such that in some neighbourhood around an equilibrium point
and its partial derivatives are continuous.
is positive definite
is negative semi-definite.
In classical control theory, this concept is often used to design controllers that ensure that the difference of a Lyapunov function along the state trajectory is always negative definite. This results in stable closed-loop system dynamics as the state is guaranteed to decrease the Lyapunov function’s value and eventually converge to the equilibrium. The biggest challenge with this approach is that finding such a function is difficult and quickly becomes impractical. In learning-based methods, for example, since we do not have complete information about the system, finding such a Lyapunov Function would result in trying out all possible consecutive data pairs in the state space, i.e., to verify infinite inequalities . The LAC algorithm solves this by taking a data-based approach in which the controller/policy and a Lyapunov critic function, both parameterised by deep neural networks, are jointly learned. In this way, the actor learns to control the system while only choosing actions guaranteed to be stable in mean cost. This inherent stability makes the LAC algorithm incredibly useful for stabilising and tracking robotic systems tasks.
Differences with the SAC algorithm
Like its predecessor, the LAC algorithm also uses entropy regularisation to increase exploration and a Gaussian actor and value-critic to develop the best action. The main difference lies in how the critic network and the actor policy function are defined.
Critic network definition
The LAC algorithm uses a single Lyapunov Critic instead of the double Q-Critic used in the SAC algorithm. This new Lyapunov critic is similar to the Q-Critic, but a square output activation function is used instead of an Identity output activation function. This is done to ensure that the network output is positive, such that condition (2) of the Lyapunov’s stability conditions holds.
Similar to SAC during training, is updated by mean-squared Bellman error (MSBE) minimisation using the following objective function
Where is the approximation target received from the infinite-horizon discounted return value function
and the set of collected transition pairs.
Important
As explained by Han et al., 2020 the sum of cost over a finite time horizon can also be used as the approximation target (see Han et al., 2020 eq (9)):
To use this Lyapunov candidate, supply the LAC algorithm with the horizon_length=N
argument, where N
is the length of the time horizon you want to use.
See also
The SLC package also contains a LAC implementation using a double Q-Critic (i.e., Lyapunov Twin Critic).
For more information about this version, see the LAC Twin Critic documentation. This version can be used
by specifying the latc
algorithm in the CLI, by supplying the lac()
function with the actor_critic=LyapunovActorTwinCritic
argument or by directly calling the latc()
function.
Policy function definition
In the LAC algorithm, the policy is optimised according to
In this represents the quashed Gaussian policy
and the desired minimum expected entropy. When comparing this function with policy loss used in the SAC algorithm,
several differences stand out. First, the policy is minimised instead of maximised in the LAC algorithm. With the LAC algorithm, the objective is to stabilise a system or track a given reference. In these cases, instead of achieving a high return, we want to reduce the difference between the current position and a reference or equilibrium position. This leads us to the second difference that can be observed: the term in the SAC algorithm that represents the Q-values.
is in the LAC algorithm replaced by
As a result, in the LAC algorithm, the loss function now increases the probability of actions that cause the system to be close to the equilibrium or reference value while decreasing the likelihood of actions that drive the system away from these values. The quadratic regularisation term ensures that the mean cost is positive. The term represents the Lyapunov Lagrange multiplier and is responsible for weighting the relative importance of the stability condition. Similar to the entropy Lagrange multiplier used in the SAC algorithm this term is updated by
where is the learning rate. This is done to constrain the average Lyapunov value during training.
Quick Fact
LAC is an off-policy algorithm.
It is guaranteed to be stable in mean cost.
The version of LAC implemented here can only be used for environments with continuous action spaces.
An alternate version of LAC, which slightly changes the policy update rule, can be implemented to handle discrete action spaces.
The SLC implementation of LAC does not support parallelisation.
Further Reading
For more information on the LAC algorithm, please check out the original paper of Han et al., 2020.
Pseudocode
Implementation
You Should Know
In what follows, we give documentation for the PyTorch and TensorFlow implementations of LAC in SLC. They have nearly identical function calls and docstrings, except for details relating to model construction. However, we include both full docstrings for completeness.
Algorithm: PyTorch Version
- stable_learning_control.algos.pytorch.lac.lac(env_fn, actor_critic=None, ac_kwargs={'activation': <class 'torch.nn.modules.activation.ReLU'>, 'hidden_sizes': {'actor': [256, 256], 'critic': [256, 256]}, 'output_activation': <class 'torch.nn.modules.activation.ReLU'>}, opt_type='minimize', max_ep_len=None, epochs=100, steps_per_epoch=2048, start_steps=0, update_every=100, update_after=1000, steps_per_update=100, num_test_episodes=10, alpha=0.99, alpha3=0.2, labda=0.99, gamma=0.99, polyak=0.995, target_entropy=None, adaptive_temperature=True, lr_a=0.0001, lr_c=0.0003, lr_alpha=0.0001, lr_labda=0.0003, lr_a_final=1e-10, lr_c_final=1e-10, lr_alpha_final=1e-10, lr_labda_final=1e-10, lr_decay_type='linear', lr_a_decay_type=None, lr_c_decay_type=None, lr_alpha_decay_type=None, lr_labda_decay_type=None, lr_decay_ref='epoch', batch_size=256, replay_size=1000000, horizon_length=0, seed=None, device='cpu', logger_kwargs={}, save_freq=1, start_policy=None, export=False)[source]
Trains the LAC algorithm in a given environment.
- Parameters:
env_fn – A function which creates a copy of the environment. The environment must satisfy the gymnasium API.
actor_critic (torch.nn.Module, optional) –
The constructor method for a Torch Module with an
act
method, api
module and severalQ
orL
modules. Theact
method andpi
module should accept batches of observations as inputs, and theQ*
andL
modules should accept a batch of observations and a batch of actions as inputs. When called, these modules should return:Call
Output Shape
Description
act
(batch, act_dim)
Numpy array of actions for eachobservation.Q*/L
(batch,)
Tensor containing one current estimateofQ*/L
for the providedobservations and actions. (Critical:make sure to flatten this!)Calling
pi
should return:Symbol
Shape
Description
a
(batch, act_dim)
Tensor containing actions from policygiven observations.logp_pi
(batch,)
Tensor containing log probabilities ofactions ina
. Importantly:gradients should be able to flow backintoa
.Defaults to
LyapunovActorCritic
ac_kwargs (dict, optional) –
Any kwargs appropriate for the ActorCritic object you provided to LAC. Defaults to:
Kwarg
Value
hidden_sizes_actor
256 x 2
hidden_sizes_critic
256 x 2
activation
output_activation
opt_type (str, optional) – The optimization type you want to use. Options
maximize
andminimize
. Defaults tomaximize
.max_ep_len (int, optional) – Maximum length of trajectory / episode / rollout. Defaults to the environment maximum.
epochs (int, optional) – Number of epochs to run and train agent. Defaults to
100
.steps_per_epoch (int, optional) – Number of steps of interaction (state-action pairs) for the agent and the environment in each epoch. Defaults to
2048
.start_steps (int, optional) – Number of steps for uniform-random action selection, before running real policy. Helps exploration. Defaults to
0
.update_every (int, optional) – Number of env interactions that should elapse between gradient descent updates. Defaults to
100
.update_after (int, optional) – Number of env interactions to collect before starting to do gradient descent updates. Ensures replay buffer is full enough for useful updates. Defaults to
1000
.steps_per_update (int, optional) – Number of gradient descent steps that are performed for each gradient descent update. This determines the ratio of env steps to gradient steps (i.e.
update_every
/steps_per_update
). Defaults to100
.num_test_episodes (int, optional) – Number of episodes used to test the deterministic policy at the end of each epoch. This is used for logging the performance. Defaults to
10
.alpha (float, optional) – Entropy regularization coefficient (Equivalent to inverse of reward scale in the original SAC paper). Defaults to
0.99
.alpha3 (float, optional) – The Lyapunov constraint error boundary. Defaults to
0.2
.labda (float, optional) – The Lyapunov Lagrance multiplier. Defaults to
0.99
.gamma (float, optional) – Discount factor. (Always between 0 and 1.). Defaults to
0.99
.polyak (float, optional) –
Interpolation factor in polyak averaging for target networks. Target networks are updated towards main networks according to:
where is polyak (Always between 0 and 1, usually close to 1.). In some papers is defined as (1 - ) where is the soft replacement factor. Defaults to
0.995
.target_entropy (float, optional) –
Initial target entropy used while learning the entropy temperature (alpha). Defaults to the maximum information (bits) contained in action space. This can be calculated according to :
adaptive_temperature (bool, optional) – Enabled Automating Entropy Adjustment for maximum Entropy RL_learning.
lr_a (float, optional) – Learning rate used for the actor. Defaults to
1e-4
.lr_c (float, optional) – Learning rate used for the (lyapunov) critic. Defaults to
1e-4
.lr_alpha (float, optional) – Learning rate used for the entropy temperature. Defaults to
1e-4
.lr_labda (float, optional) – Learning rate used for the Lyapunov Lagrance multiplier. Defaults to
3e-4
.lr_a_final (float, optional) – The final actor learning rate that is achieved at the end of the training. Defaults to
1e-10
.lr_c_final (float, optional) – The final critic learning rate that is achieved at the end of the training. Defaults to
1e-10
.lr_alpha_final (float, optional) – The final alpha learning rate that is achieved at the end of the training. Defaults to
1e-10
.lr_labda_final (float, optional) – The final labda learning rate that is achieved at the end of the training. Defaults to
1e-10
.lr_decay_type (str, optional) – The learning rate decay type that is used (options are:
linear
andexponential
andconstant
). Defaults tolinear
.Can be overridden by the specific learning rate decay types.lr_a_decay_type (str, optional) – The learning rate decay type that is used for the actor learning rate (options are:
linear
andexponential
andconstant
). If not specified, the general learning rate decay type is used.lr_c_decay_type (str, optional) – The learning rate decay type that is used for the critic learning rate (options are:
linear
andexponential
andconstant
). If not specified, the general learning rate decay type is used.lr_alpha_decay_type (str, optional) – The learning rate decay type that is used for the alpha learning rate (options are:
linear
andexponential
andconstant
). If not specified, the general learning rate decay type is used.lr_labda_decay_type (str, optional) – The learning rate decay type that is used for the labda learning rate (options are:
linear
andexponential
andconstant
). If not specified, the general learning rate decay type is used.lr_decay_ref (str, optional) – The reference variable that is used for decaying the learning rate (options:
epoch
andstep
). Defaults toepoch
.batch_size (int, optional) – Minibatch size for SGD. Defaults to
256
.replay_size (int, optional) – Maximum length of replay buffer. Defaults to
1e6
.horizon_length (int, optional) – The length of the finite-horizon used for the Lyapunov Critic target. Defaults to
0
meaning the infinite-horizon bellman backup is used.seed (int) – Seed for random number generators. Defaults to
None
.device (str, optional) – The device the networks are placed on (options:
cpu
,gpu
,gpu:0
,gpu:1
, etc.). Defaults tocpu
.logger_kwargs (dict, optional) – Keyword args for EpochLogger.
save_freq (int, optional) – How often (in terms of gap between epochs) to save the current policy and value function.
start_policy (str) – Path of a already trained policy to use as the starting point for the training. By default a new policy is created.
export (bool) – Whether you want to export the model as a
TorchScript
such that it can be deployed on hardware. By defaultFalse
.
- Returns:
tuple containing:
policy (
LAC
): The trained actor-critic policy.replay_buffer (union[
ReplayBuffer
,FiniteHorizonReplayBuffer
]): The replay buffer used during training.
- Return type:
(tuple)
Saved Model Contents: PyTorch Version
The PyTorch version of the LAC algorithm is implemented by subclassing the torch.nn.Module
class. As a
result the model weights are saved using the model_state
dictionary ( state_dict
).
These saved weights can be found in the torch_save/model_state.pt
file. For an example of how to load a model using
this file, see Experiment Outputs or the PyTorch documentation.
Algorithm: TensorFlow Version
Attention
The TensorFlow version is still experimental. It is not guaranteed to work, and it is not guaranteed to be up-to-date with the PyTorch version.
- stable_learning_control.algos.tf2.lac.lac(env_fn, actor_critic=None, ac_kwargs={'activation': <function relu>, 'hidden_sizes': {'actor': [256, 256], 'critic': [256, 256]}, 'output_activation': <function relu>}, opt_type='minimize', max_ep_len=None, epochs=100, steps_per_epoch=2048, start_steps=0, update_every=100, update_after=1000, steps_per_update=100, num_test_episodes=10, alpha=0.99, alpha3=0.2, labda=0.99, gamma=0.99, polyak=0.995, target_entropy=None, adaptive_temperature=True, lr_a=0.0001, lr_c=0.0003, lr_alpha=0.0001, lr_labda=0.0003, lr_a_final=1e-10, lr_c_final=1e-10, lr_alpha_final=1e-10, lr_labda_final=1e-10, lr_decay_type='linear', lr_a_decay_type=None, lr_c_decay_type=None, lr_alpha_decay_type=None, lr_labda_decay_type=None, lr_decay_ref='epoch', batch_size=256, replay_size=1000000, seed=None, horizon_length=0, device='cpu', logger_kwargs={}, save_freq=1, start_policy=None, export=False)[source]
Trains the LAC algorithm in a given environment.
- Parameters:
env_fn – A function which creates a copy of the environment. The environment must satisfy the gymnasium API.
actor_critic (tf.Module, optional) –
The constructor method for a TensorFlow Module with an
act
method, api
module and severalQ
orL
modules. Theact
method andpi
module should accept batches of observations as inputs, and theQ*
andL
modules should accept a batch of observations and a batch of actions as inputs. When called, these modules should return:Call
Output Shape
Description
act
(batch, act_dim)
Numpy array of actions for eachobservation.Q*/L
(batch,)
Tensor containing one current estimateofQ*/L
for the providedobservations and actions. (Critical:make sure to flatten this!)Calling
pi
should return:Symbol
Shape
Description
a
(batch, act_dim)
Tensor containing actions from policygiven observations.logp_pi
(batch,)
Tensor containing log probabilities ofactions ina
. Importantly:gradients should be able to flow backintoa
.Defaults to
LyapunovActorCritic
ac_kwargs (dict, optional) –
Any kwargs appropriate for the ActorCritic object you provided to LAC. Defaults to:
Kwarg
Value
hidden_sizes_actor
256 x 2
hidden_sizes_critic
256 x 2
activation
tf.nn.ReLU
output_activation
tf.nn.ReLU
opt_type (str, optional) – The optimization type you want to use. Options
maximize
andminimize
. Defaults tomaximize
.max_ep_len (int, optional) – Maximum length of trajectory / episode / rollout. Defaults to the environment maximum.
epochs (int, optional) – Number of epochs to run and train agent. Defaults to
100
.steps_per_epoch (int, optional) – Number of steps of interaction (state-action pairs) for the agent and the environment in each epoch. Defaults to
2048
.start_steps (int, optional) – Number of steps for uniform-random action selection, before running real policy. Helps exploration. Defaults to
0
.update_every (int, optional) – Number of env interactions that should elapse between gradient descent updates. Defaults to
100
.update_after (int, optional) – Number of env interactions to collect before starting to do gradient descent updates. Ensures replay buffer is full enough for useful updates. Defaults to
1000
.steps_per_update (int, optional) – Number of gradient descent steps that are performed for each gradient descent update. This determines the ratio of env steps to gradient steps (i.e.
update_every
/steps_per_update
). Defaults to100
.num_test_episodes (int, optional) – Number of episodes used to test the deterministic policy at the end of each epoch. This is used for logging the performance. Defaults to
10
.alpha (float, optional) – Entropy regularization coefficient (Equivalent to inverse of reward scale in the original SAC paper). Defaults to
0.99
.alpha3 (float, optional) – The Lyapunov constraint error boundary. Defaults to
0.2
.labda (float, optional) – The Lyapunov Lagrance multiplier. Defaults to
0.99
.gamma (float, optional) – Discount factor. (Always between 0 and 1.). Defaults to
0.99
.polyak (float, optional) –
Interpolation factor in polyak averaging for target networks. Target networks are updated towards main networks according to:
where is polyak (Always between 0 and 1, usually close to 1.). In some papers is defined as (1 - ) where is the soft replacement factor. Defaults to
0.995
.target_entropy (float, optional) –
Initial target entropy used while learning the entropy temperature (alpha). Defaults to the maximum information (bits) contained in action space. This can be calculated according to :
adaptive_temperature (bool, optional) – Enabled Automating Entropy Adjustment for maximum Entropy RL_learning.
lr_a (float, optional) – Learning rate used for the actor. Defaults to
1e-4
.lr_c (float, optional) – Learning rate used for the (lyapunov) critic. Defaults to
1e-4
.lr_alpha (float, optional) – Learning rate used for the entropy temperature. Defaults to
1e-4
.lr_labda (float, optional) – Learning rate used for the Lyapunov Lagrance multiplier. Defaults to
3e-4
.lr_a_final (float, optional) – The final actor learning rate that is achieved at the end of the training. Defaults to
1e-10
.lr_c_final (float, optional) – The final critic learning rate that is achieved at the end of the training. Defaults to
1e-10
.lr_alpha_final (float, optional) – The final alpha learning rate that is achieved at the end of the training. Defaults to
1e-10
.lr_labda_final (float, optional) – The final labda learning rate that is achieved at the end of the training. Defaults to
1e-10
.lr_decay_type (str, optional) – The learning rate decay type that is used (options are:
linear
andexponential
andconstant
). Defaults tolinear
.Can be overridden by the specific learning rate decay types.lr_a_decay_type (str, optional) – The learning rate decay type that is used for the actor learning rate (options are:
linear
andexponential
andconstant
). If not specified, the general learning rate decay type is used.lr_c_decay_type (str, optional) – The learning rate decay type that is used for the critic learning rate (options are:
linear
andexponential
andconstant
). If not specified, the general learning rate decay type is used.lr_alpha_decay_type (str, optional) – The learning rate decay type that is used for the alpha learning rate (options are:
linear
andexponential
andconstant
). If not specified, the general learning rate decay type is used.lr_labda_decay_type (str, optional) – The learning rate decay type that is used for the labda learning rate (options are:
linear
andexponential
andconstant
). If not specified, the general learning rate decay type is used.lr_decay_type – The learning rate decay type that is used ( options are:
linear
andexponential
andconstant
). Defaults tolinear
.lr_decay_ref (str, optional) – The reference variable that is used for decaying the learning rate (options:
epoch
andstep
). Defaults toepoch
.batch_size (int, optional) – Minibatch size for SGD. Defaults to
256
.replay_size (int, optional) – Maximum length of replay buffer. Defaults to
1e6
.horizon_length (int, optional) – The length of the finite-horizon used for the Lyapunov Critic target. Defaults to
0
meaning the infinite-horizon bellman backup is used.seed (int) – Seed for random number generators. Defaults to
None
.device (str, optional) – The device the networks are placed on (options:
cpu
,gpu
,gpu:0
,gpu:1
, etc.). Defaults tocpu
.logger_kwargs (dict, optional) – Keyword args for EpochLogger.
save_freq (int, optional) – How often (in terms of gap between epochs) to save the current policy and value function.
start_policy (str) – Path of a already trained policy to use as the starting point for the training. By default a new policy is created.
export (bool) – Whether you want to export the model in the
SavedModel
format such that it can be deployed to hardware. By defaultFalse
.
- Returns:
tuple containing:
policy (
LAC
): The trained actor-critic policy.replay_buffer (union[
ReplayBuffer
,FiniteHorizonReplayBuffer
]): The replay buffer used during training.
- Return type:
(tuple)
Saved Model Contents: TensorFlow Version
The TensorFlow version of the LAC algorithm is implemented by subclassing the tf.nn.Model
class. As a result, both the
full model and the current model weights are saved. The complete model can be found in the saved_model.pb
file, while the current
weights checkpoints are found in the tf_safe/weights_checkpoint*
file. For an example of using these two methods, see Experiment Outputs
or the TensorFlow documentation.
References
Relevant Papers
Actor-Critic Reinforcement Learning for Control with Stability Guarantee, Han et al, 2020
The general problem of the stability of motion, J. Mawhin, 2005
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor, Haarnoja et al, 2018
Soft Actor-Critic: Algorithms and Applications, Haarnoja et al, 2019
Acknowledgements
Parts of this documentation are directly copied, with the author’s consent, from the original paper of Han et. al 2019.