stable_learning_control.algos.tf2

Contains the TensorFlow 2.x implementations of the RL algorithms.

Subpackages

Classes

`LAC`	The Lyapunov (soft) Actor-Critic (LAC) algorithm.
`SAC`	The Soft Actor Critic algorithm.

Package Contents

class stable_learning_control.algos.tf2.LAC(env, actor_critic=None, ac_kwargs=dict(hidden_sizes={'actor': [256] * 2, 'critic': [256] * 2}, activation=nn.relu, output_activation={'actor': nn.relu}), opt_type='minimize', alpha=0.99, alpha3=0.2, labda=0.99, gamma=0.99, polyak=0.995, target_entropy=None, adaptive_temperature=True, lr_a=0.0001, lr_c=0.0003, lr_alpha=0.0001, lr_labda=0.0003, device='cpu', name='LAC')[source]

Bases: tf.keras.Model

The Lyapunov (soft) Actor-Critic (LAC) algorithm.

ac

The (lyapunov) actor critic module.

Type:: tf.Module

ac_

The (lyapunov) target actor critic module.

Type:: tf.Module

log_alpha

The temperature Lagrance multiplier.

Type:: tf.Variable

log_labda

The Lyapunov Lagrance multiplier.

Type:: tf.Variable

Initialise the LAC algorithm.

Parameters:

env (gym.env) – The gymnasium environment the LAC is training in. This is used to retrieve the activation and observation space dimensions. This is used while creating the network sizes. The environment must satisfy the gymnasium API.

actor_critic (tf.Module, optional) –

The constructor method for a TensorFlow Module with an act method, a pi module and several Q or L modules. The act method and pi module should accept batches of observations as inputs, and the Q* and L modules should accept a batch of observations and a batch of actions as inputs. When called, these modules should return:

Call	Output Shape	Description
`act`	(batch, act_dim)	Numpy array of actions for each observation.
`Q*/L`	(batch,)	Tensor containing one current estimate of `Q*/L` for the provided observations and actions. (Critical: make sure to flatten this!)

Calling pi should return:

Symbol	Shape	Description
`a`	(batch, act_dim)	Tensor containing actions from policy given observations.
`logp_pi`	(batch,)	Tensor containing log probabilities of actions in `a`. Importantly: gradients should be able to flow back into `a`.

Defaults to LyapunovActorCritic

ac_kwargs (dict, optional) –
Any kwargs appropriate for the ActorCritic object you provided to LAC. Defaults to:

Kwarg

Value

hidden_sizes_actor

256 x 2

hidden_sizes_critic

256 x 2

activation

tf.nn.relu

output_activation

tf.nn.relu
opt_type (str, optional) – The optimization type you want to use. Options maximize and minimize. Defaults to maximize.
alpha (float, optional) – Entropy regularization coefficient (Equivalent to inverse of reward scale in the original SAC paper). Defaults to 0.99.
alpha3 (float, optional) – The Lyapunov constraint error boundary. Defaults to 0.2.
labda (float, optional) – The Lyapunov Lagrance multiplier. Defaults to 0.99.
gamma (float, optional) – Discount factor. (Always between 0 and 1.). Defaults to 0.99 per Haarnoja et al. 2018, not 0.995 as in Han et al. 2020.
polyak (float, optional) –
Interpolation factor in polyak averaging for target networks. Target networks are updated towards main networks according to:

$\theta_{\text{targ}} \leftarrow \rho \theta_{\text{targ}} + (1-\rho) \theta$

where $\rho$ is polyak (Always between 0 and 1, usually close to 1.). In some papers $\rho$ is defined as (1 - $\tau$ ) where $\tau$ is the soft replacement factor. Defaults to 0.995.
target_entropy (float, optional) –
Initial target entropy used while learning the entropy temperature (alpha). Defaults to the maximum information (bits) contained in action space. This can be calculated according to :

$-{\prod }_{i=0}^{n}action\_di{m}_{i}\phantom{\rule{0ex}{0ex}}$
adaptive_temperature (bool, optional) – Enabled Automating Entropy Adjustment for maximum Entropy RL_learning.
lr_a (float, optional) – Learning rate used for the actor. Defaults to 1e-4.
lr_c (float, optional) – Learning rate used for the (lyapunov) critic. Defaults to 1e-4.
lr_alpha (float, optional) – Learning rate used for the entropy temperature. Defaults to 1e-4.
lr_labda (float, optional) – Learning rate used for the Lyapunov Lagrance multiplier. Defaults to 3e-4.
device (str, optional) – The device the networks are placed on (options: cpu, gpu, gpu:0, gpu:1, etc.). Defaults to cpu.

Attention

This class will behave differently when the actor_critic argument is set to the LyapunovActorTwinCritic. For more information see the LATC documentation.

_device

_setup_kwargs

_act_dim

_obs_dim

_adaptive_temperature

_opt_type

_polyak

_gamma

_alpha3

_lr_a

_lr_lag

_lr_c

_use_twin_critic = False

log_alpha

log_labda

actor_critic

ac

ac_targ

_pi_optimizer

_pi_params

_log_labda_optimizer

_c_params

_c_optimizer

call(s, deterministic=False)[source]

Wrapper around the get_action() method that enables users to also receive actions directly by invoking LAC(observations).

Parameters:

s (numpy.ndarray) – The current state.
deterministic (bool, optional) – Whether to return a deterministic action. Defaults to False.

Returns:

The current action.

Return type:

numpy.ndarray

get_action(s, deterministic=False)[source]

Returns the current action of the policy.

Parameters:

s (numpy.ndarray) – The current state.
deterministic (bool, optional) – Whether to return a deterministic action. Defaults to False.

Returns:

The current action.

Return type:

numpy.ndarray

update(data)[source]

Update the actor critic network using stochastic gradient descent.

Parameters:: data (dict) – Dictionary containing a batch of experiences.

save(path, checkpoint_name='checkpoint')[source]

Can be used to save the current model state.

Parameters:

path (str) – The path where you want to save the policy.
checkpoint_name (str) – The name you want to use for the checkpoint.

Raises:

Exception – Raises an exception if something goes wrong during saving.

Note

This function saved the model weights using the tf.keras.Model.save_weights() method (see https://www.tensorflow.org/api_docs/python/tf/keras/Model#save_weights). The model should therefore be restored using the tf.keras.Model.load_weights() method (see https://www.tensorflow.org/api_docs/python/tf/keras/Model#load_weights). If you want to deploy the full model use the export() method instead.

restore(path, restore_lagrance_multipliers=False)[source]

Restores a already trained policy. Used for transfer learning.

Parameters:

path (str) – The path where the model state_dict of the policy is found.
restore_lagrance_multipliers (bool, optional) – Whether you want to restore the Lagrance multipliers. By fault False.

Raises:

Exception – Raises an exception if something goes wrong during loading.

export(path)[source]

Can be used to export the model in the SavedModel format such that it can be deployed to hardware.

Parameters:: path (str) – The path where you want to export the policy too.

build()[source]: Function that can be used to build the full model structure such that it can be visualized using the tf.keras.Model.summary(). This is done by calling the build method of the parent class with the correct input shape.

Note

This is done by calling the build methods of the submodules.

summary()[source]: Small wrapper around the tf.keras.Model.summary() method used to apply a custom format to the model summary.

full_summary()[source]: Prints a full summary of all the layers of the TensorFlow model

set_learning_rates(lr_a=None, lr_c=None, lr_alpha=None, lr_labda=None)[source]

Adjusts the learning rates of the optimizers.

Parameters:

lr_a (float, optional) – The learning rate of the actor optimizer. Defaults to None.
lr_c (float, optional) – The learning rate of the (Lyapunov) Critic. Defaults to None.
lr_alpha (float, optional) – The learning rate of the temperature optimizer. Defaults to None.
lr_labda (float, optional) – The learning rate of the Lyapunov Lagrance multiplier optimizer. Defaults to None.

_init_targets()[source]: Updates the target network weights to the main network weights.

_update_targets()[source]: Updates the target networks based on a Exponential moving average (Polyak averaging).

property alpha
Property used to clip :attr:`alpha` to be equal or bigger than ``0.0`` to
prevent it from becoming nan when :attr:`log_alpha` becomes ``-inf``. For
:attr:`alpha` no upper bound is used.

property labda
Property used to clip :attr:`lambda` to be equal or bigger than ``0.0`` in
order to prevent it from becoming ``nan`` when log_labda becomes -inf. Further
we clip it to be lower or equal than ``1.0`` in order to prevent lambda from
exploding when the the hyperparameters are chosen badly.

property target_entropy
The target entropy used while learning the entropy temperature
:attr:`alpha`.

property device

``cpu``, ``gpu``, ``gpu:0``,

``gpu:1``, etc.).

Type:: The device the networks are placed on (options

class stable_learning_control.algos.tf2.SAC(env, actor_critic=None, ac_kwargs=dict(hidden_sizes={'actor': [256] * 2, 'critic': [256] * 2}, activation={'actor': nn.relu, 'critic': nn.relu}, output_activation={'actor': nn.relu, 'critic': None}), opt_type='maximize', alpha=0.99, gamma=0.99, polyak=0.995, target_entropy=None, adaptive_temperature=True, lr_a=0.0001, lr_c=0.0003, lr_alpha=0.0001, device='cpu', name='SAC')[source]

Bases: tf.keras.Model

The Soft Actor Critic algorithm.

ac

The (soft) actor critic module.

Type:: tf.Module

ac_

The (soft) target actor critic module.

Type:: tf.Module

log_alpha

The temperature Lagrance multiplier.

Type:: tf.Variable

Initialise the SAC algorithm.

Parameters:

env (gym.env) – The gymnasium environment the SAC is training in. This is used to retrieve the activation and observation space dimensions. This is used while creating the network sizes. The environment must satisfy the gymnasium API.

actor_critic –

The constructor method for a TensorFlow Module with an act method, a pi module and several Q or L modules. The act method and pi module should accept batches of observations as inputs, and the Q* and L modules should accept a batch of observations and a batch of actions as inputs. When called, these modules should return:

Call	Output Shape	Description
`act`	(batch, act_dim)	Numpy array of actions for each observation.
`Q*/L`	(batch,)	Tensor containing one current estimate of `Q*/L` for the provided observations and actions. (Critical: make sure to flatten this!)

_device

_setup_kwargs

_act_dim

_obs_dim

_adaptive_temperature

_opt_type

_polyak

_gamma

_lr_a

_lr_c

log_alpha

actor_critic

ac

ac_targ

_pi_optimizer

_pi_params

_c_params

_c_optimizer

call(s, deterministic=False)[source]

Wrapper around the get_action() method that enables users to also receive actions directly by invoking SAC(observations).

Parameters:

s (numpy.ndarray) – The current state.
deterministic (bool, optional) – Whether to return a deterministic action. Defaults to False.

Returns:

The current action.

Return type:

numpy.ndarray

get_action(s, deterministic=False)[source]

Returns the current action of the policy.

Parameters:

s (numpy.ndarray) – The current state.
deterministic (bool, optional) – Whether to return a deterministic action. Defaults to False.

Returns:

The current action.

Return type:

numpy.ndarray

update(data)[source]

Update the actor critic network using stochastic gradient descent.

Parameters:: data (dict) – Dictionary containing a batch of experiences.

save(path, checkpoint_name='checkpoint')[source]

Can be used to save the current model state.

Parameters:

path (str) – The path where you want to save the policy.
checkpoint_name (str) – The name you want to use for the checkpoint.

Raises:

Exception – Raises an exception if something goes wrong during saving.

Note

This function saved the model weights using the tf.keras.Model.save_weights() method (see https://www.tensorflow.org/api_docs/python/tf/keras/Model#save_weights). The model should therefore be restored using the tf.keras.Model.load_weights() method (see https://www.tensorflow.org/api_docs/python/tf/keras/Model#load_weights). If you want to deploy the full model use the export() method instead.

restore(path, restore_lagrance_multipliers=False)[source]

Restores a already trained policy. Used for transfer learning.

Parameters:

path (str) – The path where the model state_dict of the policy is found.
restore_lagrance_multipliers (bool, optional) – Whether you want to restore the Lagrance multipliers. By fault False.

Raises:

Exception – Raises an exception if something goes wrong during loading.

export(path)[source]

Can be used to export the model in the SavedModel format such that it can be deployed to hardware.

Parameters:: path (str) – The path where you want to export the policy too.

build()[source]: Function that can be used to build the full model structure such that it can be visualized using the tf.keras.Model.summary(). This is done by calling the build method of the parent class with the correct input shape.

Note

This is done by calling the build methods of the submodules.

summary()[source]: Small wrapper around the tf.keras.Model.summary() method used to apply a custom format to the model summary.

full_summary()[source]: Prints a full summary of all the layers of the TensorFlow model

set_learning_rates(lr_a=None, lr_c=None, lr_alpha=None)[source]

Adjusts the learning rates of the optimizers.

Parameters:

lr_a (float, optional) – The learning rate of the actor optimizer. Defaults to None.
lr_c (float, optional) – The learning rate of the (soft) Critic. Defaults to None.
lr_alpha (float, optional) – The learning rate of the temperature optimizer. Defaults to None.

_init_targets()[source]: Updates the target network weights to the main network weights.

_update_targets()[source]: Updates the target networks based on a Exponential moving average (Polyak averaging).

property alpha
Property used to clip :attr:`alpha` to be equal or bigger than ``0.0`` to
prevent it from becoming nan when :attr:`log_alpha` becomes ``-inf``. For
:attr:`alpha` no upper bound is used.

property target_entropy
The target entropy used while learning the entropy temperature
:attr:`alpha`.

property device

``cpu``, ``gpu``, ``gpu:0``,

``gpu:1``, etc.).

Type:: The device the networks are placed on (options

Kwarg	Value
`hidden_sizes_actor`	`256 x 2`
`hidden_sizes_critic`	`256 x 2`
`activation`	`tf.nn.relu`
`output_activation`	`tf.nn.relu`