stable_learning_control.algos.tf2.sac

A Soft Actor Critic Agent.

Submodules

Classes

SAC

The Soft Actor Critic algorithm.

Functions

sac(env_fn[, actor_critic, ac_kwargs, opt_type, ...])

Trains the SAC algorithm in a given environment.

Package Contents

class stable_learning_control.algos.tf2.sac.SAC(env, actor_critic=None, ac_kwargs=dict(hidden_sizes={'actor': [256] * 2, 'critic': [256] * 2}, activation={'actor': nn.relu, 'critic': nn.relu}, output_activation={'actor': nn.relu, 'critic': None}), opt_type='maximize', alpha=0.99, gamma=0.99, polyak=0.995, target_entropy=None, adaptive_temperature=True, lr_a=0.0001, lr_c=0.0003, lr_alpha=0.0001, device='cpu', name='SAC')[source]

Bases: tf.keras.Model

The Soft Actor Critic algorithm.

ac

The (soft) actor critic module.

Type:

tf.Module

ac_

The (soft) target actor critic module.

Type:

tf.Module

log_alpha

The temperature Lagrance multiplier.

Type:

tf.Variable

Initialise the SAC algorithm.

Parameters:
  • env (gym.env) – The gymnasium environment the SAC is training in. This is used to retrieve the activation and observation space dimensions. This is used while creating the network sizes. The environment must satisfy the gymnasium API.

  • actor_critic

    The constructor method for a TensorFlow Module with an act method, a pi module and several Q or L modules. The act method and pi module should accept batches of observations as inputs, and the Q* and L modules should accept a batch of observations and a batch of actions as inputs. When called, these modules should return:

    Call

    Output Shape

    Description

    act

    (batch, act_dim)

    Numpy array of actions for each
    observation.

    Q*/L

    (batch,)

    Tensor containing one current estimate
    of Q*/L for the provided
    observations and actions. (Critical:
    make sure to flatten this!)

_device
_setup_kwargs
_act_dim
_obs_dim
_adaptive_temperature
_opt_type
_polyak
_gamma
_lr_a
_lr_c
log_alpha
actor_critic
ac
ac_targ
_pi_optimizer
_pi_params
_c_params
_c_optimizer
call(s, deterministic=False)[source]

Wrapper around the get_action() method that enables users to also receive actions directly by invoking SAC(observations).

Parameters:
  • s (numpy.ndarray) – The current state.

  • deterministic (bool, optional) – Whether to return a deterministic action. Defaults to False.

Returns:

The current action.

Return type:

numpy.ndarray

get_action(s, deterministic=False)[source]

Returns the current action of the policy.

Parameters:
  • s (numpy.ndarray) – The current state.

  • deterministic (bool, optional) – Whether to return a deterministic action. Defaults to False.

Returns:

The current action.

Return type:

numpy.ndarray

update(data)[source]

Update the actor critic network using stochastic gradient descent.

Parameters:

data (dict) – Dictionary containing a batch of experiences.

save(path, checkpoint_name='checkpoint')[source]

Can be used to save the current model state.

Parameters:
  • path (str) – The path where you want to save the policy.

  • checkpoint_name (str) – The name you want to use for the checkpoint.

Raises:

Exception – Raises an exception if something goes wrong during saving.

Note

This function saved the model weights using the tf.keras.Model.save_weights() method (see https://www.tensorflow.org/api_docs/python/tf/keras/Model#save_weights). The model should therefore be restored using the tf.keras.Model.load_weights() method (see https://www.tensorflow.org/api_docs/python/tf/keras/Model#load_weights). If you want to deploy the full model use the export() method instead.

restore(path, restore_lagrance_multipliers=False)[source]

Restores a already trained policy. Used for transfer learning.

Parameters:
  • path (str) – The path where the model state_dict of the policy is found.

  • restore_lagrance_multipliers (bool, optional) – Whether you want to restore the Lagrance multipliers. By fault False.

Raises:

Exception – Raises an exception if something goes wrong during loading.

export(path)[source]

Can be used to export the model in the SavedModel format such that it can be deployed to hardware.

Parameters:

path (str) – The path where you want to export the policy too.

build()[source]

Function that can be used to build the full model structure such that it can be visualized using the tf.keras.Model.summary(). This is done by calling the build method of the parent class with the correct input shape.

Note

This is done by calling the build methods of the submodules.

summary()[source]

Small wrapper around the tf.keras.Model.summary() method used to apply a custom format to the model summary.

full_summary()[source]

Prints a full summary of all the layers of the TensorFlow model

set_learning_rates(lr_a=None, lr_c=None, lr_alpha=None)[source]

Adjusts the learning rates of the optimizers.

Parameters:
  • lr_a (float, optional) – The learning rate of the actor optimizer. Defaults to None.

  • lr_c (float, optional) – The learning rate of the (soft) Critic. Defaults to None.

  • lr_alpha (float, optional) – The learning rate of the temperature optimizer. Defaults to None.

_init_targets()[source]

Updates the target network weights to the main network weights.

_update_targets()[source]

Updates the target networks based on a Exponential moving average (Polyak averaging).

property alpha
Property used to clip :attr:`alpha` to be equal or bigger than ``0.0`` to
prevent it from becoming nan when :attr:`log_alpha` becomes ``-inf``. For
:attr:`alpha` no upper bound is used.
property target_entropy
The target entropy used while learning the entropy temperature
:attr:`alpha`.
property device
``cpu``, ``gpu``, ``gpu:0``,
``gpu:1``, etc.).
Type:

The device the networks are placed on (options

stable_learning_control.algos.tf2.sac.sac(env_fn, actor_critic=None, ac_kwargs=dict(hidden_sizes={'actor': [256] * 2, 'critic': [256] * 2}, activation={'actor': nn.relu, 'critic': nn.relu}, output_activation={'actor': nn.relu, 'critic': None}), opt_type='maximize', max_ep_len=None, epochs=100, steps_per_epoch=2048, start_steps=0, update_every=100, update_after=1000, steps_per_update=100, num_test_episodes=10, alpha=0.99, gamma=0.99, polyak=0.995, target_entropy=None, adaptive_temperature=True, lr_a=0.0001, lr_c=0.0003, lr_alpha=0.0001, lr_a_final=1e-10, lr_c_final=1e-10, lr_alpha_final=1e-10, lr_decay_type=DEFAULT_DECAY_TYPE, lr_a_decay_type=None, lr_c_decay_type=None, lr_alpha_decay_type=None, lr_decay_ref=DEFAULT_DECAY_REFERENCE, batch_size=256, replay_size=int(1000000.0), seed=None, device='cpu', logger_kwargs=dict(), save_freq=1, start_policy=None, export=False)[source]

Trains the SAC algorithm in a given environment.

Parameters:
  • env_fn – A function which creates a copy of the environment. The environment must satisfy the gymnasium API.

  • actor_critic (tf.Module, optional) –

    The constructor method for a TensorFlow Module with an act method, a pi module and several Q or L modules. The act method and pi module should accept batches of observations as inputs, and the Q* and L modules should accept a batch of observations and a batch of actions as inputs. When called, these modules should return:

    Call

    Output Shape

    Description

    act

    (batch, act_dim)

    Numpy array of actions for each
    observation.

    Q*/L

    (batch,)

    Tensor containing one current estimate
    of Q*/L for the provided
    observations and actions. (Critical:
    make sure to flatten this!)

    Calling pi should return:

    Symbol

    Shape

    Description

    a

    (batch, act_dim)

    Tensor containing actions from policy
    given observations.

    logp_pi

    (batch,)

    Tensor containing log probabilities of
    actions in a. Importantly:
    gradients should be able to flow back
    into a.

    Defaults to SoftActorCritic

  • ac_kwargs (dict, optional) –

    Any kwargs appropriate for the ActorCritic object you provided to SAC. Defaults to:

    Kwarg

    Value

    hidden_sizes_actor

    64 x 2

    hidden_sizes_critic

    128 x 2

    activation

    tf.nn.relu

    output_activation

    tf.nn.relu

  • opt_type (str, optional) – The optimization type you want to use. Options maximize and minimize. Defaults to maximize.

  • max_ep_len (int, optional) – Maximum length of trajectory / episode / rollout. Defaults to the environment maximum.

  • epochs (int, optional) – Number of epochs to run and train agent. Defaults to 100.

  • steps_per_epoch (int, optional) – Number of steps of interaction (state-action pairs) for the agent and the environment in each epoch. Defaults to 2048.

  • start_steps (int, optional) – Number of steps for uniform-random action selection, before running real policy. Helps exploration. Defaults to 0.

  • update_every (int, optional) – Number of env interactions that should elapse between gradient descent updates. Defaults to 100.

  • update_after (int, optional) – Number of env interactions to collect before starting to do gradient descent updates. Ensures replay buffer is full enough for useful updates. Defaults to 1000.

  • steps_per_update (int, optional) – Number of gradient descent steps that are performed for each gradient descent update. This determines the ratio of env steps to gradient steps (i.e. update_every/ steps_per_update). Defaults to 100.

  • num_test_episodes (int, optional) – Number of episodes used to test the deterministic policy at the end of each epoch. This is used for logging the performance. Defaults to 10.

  • alpha (float, optional) – Entropy regularization coefficient (Equivalent to inverse of reward scale in the original SAC paper). Defaults to 0.99.

  • gamma (float, optional) – Discount factor. (Always between 0 and 1.). Defaults to 0.99.

  • polyak (float, optional) –

    Interpolation factor in polyak averaging for target networks. Target networks are updated towards main networks according to:

    \theta_{\text{targ}} \leftarrow
\rho \theta_{\text{targ}} + (1-\rho) \theta

    where \rho is polyak (Always between 0 and 1, usually close to 1.). In some papers \rho is defined as (1 - \tau) where \tau is the soft replacement factor. Defaults to 0.995.

  • target_entropy (float, optional) –

    Initial target entropy used while learning the entropy temperature (alpha). Defaults to the maximum information (bits) contained in action space. This can be calculated according to :

    -{\prod }_{i=0}^{n}action\_di{m}_{i}\phantom{\rule{0ex}{0ex}}

  • adaptive_temperature (bool, optional) – Enabled Automating Entropy Adjustment for maximum Entropy RL_learning.

  • lr_a (float, optional) – Learning rate used for the actor. Defaults to 1e-4.

  • lr_c (float, optional) – Learning rate used for the (soft) critic. Defaults to 1e-4.

  • lr_alpha (float, optional) – Learning rate used for the entropy temperature. Defaults to 1e-4.

  • lr_a_final (float, optional) – The final actor learning rate that is achieved at the end of the training. Defaults to 1e-10.

  • lr_c_final (float, optional) – The final critic learning rate that is achieved at the end of the training. Defaults to 1e-10.

  • lr_decay_type (str, optional) – The learning rate decay type that is used ( options are: linear and exponential and constant). Defaults to linear.

  • lr_alpha_final (float, optional) – The final alpha learning rate that is achieved at the end of the training. Defaults to 1e-10.

  • lr_decay_type – The learning rate decay type that is used (options are: linear and exponential and constant). Defaults to linear.Can be overridden by the specific learning rate decay types.

  • lr_a_decay_type (str, optional) – The learning rate decay type that is used for the actor learning rate (options are: linear and exponential and constant). If not specified, the general learning rate decay type is used.

  • lr_c_decay_type (str, optional) – The learning rate decay type that is used for the critic learning rate (options are: linear and exponential and constant). If not specified, the general learning rate decay type is used.

  • lr_alpha_decay_type (str, optional) – The learning rate decay type that is used for the alpha learning rate (options are: linear and exponential and constant). If not specified, the general learning rate decay type is used.

  • lr_decay_ref (str, optional) – The reference variable that is used for decaying the learning rate (options: epoch and step). Defaults to epoch.

  • batch_size (int, optional) – Minibatch size for SGD. Defaults to 256.

  • replay_size (int, optional) – Maximum length of replay buffer. Defaults to 1e6.

  • seed (int) – Seed for random number generators. Defaults to None.

  • device (str, optional) – The device the networks are placed on (options: cpu, gpu, gpu:0, gpu:1, etc.). Defaults to cpu.

  • logger_kwargs (dict, optional) – Keyword args for EpochLogger.

  • save_freq (int, optional) – How often (in terms of gap between epochs) to save the current policy and value function.

  • start_policy (str) – Path of a already trained policy to use as the starting point for the training. By default a new policy is created.

  • export (bool) – Whether you want to export the model in the SavedModel format such that it can be deployed to hardware. By default False.

Returns:

tuple containing:

Return type:

(tuple)