stable_learning_control.algos.pytorch.sac
A Soft Actor-Critic Agent.
Submodules
Functions
|
Trains the SAC algorithm in a given environment. |
Package Contents
- stable_learning_control.algos.pytorch.sac.sac(env_fn, actor_critic=None, ac_kwargs=dict(hidden_sizes={'actor': [256] * 2, 'critic': [256] * 2}, activation={'actor': nn.ReLU, 'critic': nn.ReLU}, output_activation={'actor': nn.ReLU, 'critic': nn.Identity}), opt_type='maximize', max_ep_len=None, epochs=100, steps_per_epoch=2048, start_steps=0, update_every=100, update_after=1000, steps_per_update=100, num_test_episodes=10, alpha=0.99, gamma=0.99, polyak=0.995, target_entropy=None, adaptive_temperature=True, lr_a=0.0001, lr_c=0.0003, lr_alpha=0.0001, lr_a_final=1e-10, lr_c_final=1e-10, lr_alpha_final=1e-10, lr_decay_type=DEFAULT_DECAY_TYPE, lr_a_decay_type=None, lr_c_decay_type=None, lr_alpha_decay_type=None, lr_decay_ref=DEFAULT_DECAY_REFERENCE, batch_size=256, replay_size=int(1000000.0), seed=None, device='cpu', logger_kwargs=dict(), save_freq=1, start_policy=None, export=False)[source]
Trains the SAC algorithm in a given environment.
- Parameters:
env_fn – A function which creates a copy of the environment. The environment must satisfy the gymnasium API.
actor_critic (torch.nn.Module, optional) –
The constructor method for a Torch Module with an
act
method, api
module and severalQ
orL
modules. Theact
method andpi
module should accept batches of observations as inputs, and theQ*
andL
modules should accept a batch of observations and a batch of actions as inputs. When called, these modules should return:Call
Output Shape
Description
act
(batch, act_dim)
Numpy array of actions for eachobservation.Q*/L
(batch,)
Tensor containing one current estimateofQ*/L
for the providedobservations and actions. (Critical:make sure to flatten this!)Calling
pi
should return:Symbol
Shape
Description
a
(batch, act_dim)
Tensor containing actions from policygiven observations.logp_pi
(batch,)
Tensor containing log probabilities ofactions ina
. Importantly:gradients should be able to flow backintoa
.Defaults to
SoftActorCritic
ac_kwargs (dict, optional) –
Any kwargs appropriate for the ActorCritic object you provided to SAC. Defaults to:
Kwarg
Value
hidden_sizes_actor
64 x 2
hidden_sizes_critic
128 x 2
activation
output_activation
opt_type (str, optional) – The optimization type you want to use. Options
maximize
andminimize
. Defaults tomaximize
.max_ep_len (int, optional) – Maximum length of trajectory / episode / rollout. Defaults to the environment maximum.
epochs (int, optional) – Number of epochs to run and train agent. Defaults to
100
.steps_per_epoch (int, optional) – Number of steps of interaction (state-action pairs) for the agent and the environment in each epoch. Defaults to
2048
.start_steps (int, optional) – Number of steps for uniform-random action selection, before running real policy. Helps exploration. Defaults to
0
.update_every (int, optional) – Number of env interactions that should elapse between gradient descent updates. Defaults to
100
.update_after (int, optional) – Number of env interactions to collect before starting to do gradient descent updates. Ensures replay buffer is full enough for useful updates. Defaults to
1000
.steps_per_update (int, optional) – Number of gradient descent steps that are performed for each gradient descent update. This determines the ratio of env steps to gradient steps (i.e.
update_every
/steps_per_update
). Defaults to100
.num_test_episodes (int, optional) – Number of episodes used to test the deterministic policy at the end of each epoch. This is used for logging the performance. Defaults to
10
.alpha (float, optional) – Entropy regularization coefficient (Equivalent to inverse of reward scale in the original SAC paper). Defaults to
0.99
.gamma (float, optional) – Discount factor. (Always between 0 and 1.). Defaults to
0.99
.polyak (float, optional) –
Interpolation factor in polyak averaging for target networks. Target networks are updated towards main networks according to:
where is polyak (Always between 0 and 1, usually close to 1.). In some papers is defined as (1 - ) where is the soft replacement factor. Defaults to
0.995
.target_entropy (float, optional) –
Initial target entropy used while learning the entropy temperature (alpha). Defaults to the maximum information (bits) contained in action space. This can be calculated according to :
adaptive_temperature (bool, optional) – Enabled Automating Entropy Adjustment for maximum Entropy RL_learning.
lr_a (float, optional) – Learning rate used for the actor. Defaults to
1e-4
.lr_c (float, optional) – Learning rate used for the (soft) critic. Defaults to
1e-4
.lr_alpha (float, optional) – Learning rate used for the entropy temperature. Defaults to
1e-4
.lr_a_final (float, optional) – The final actor learning rate that is achieved at the end of the training. Defaults to
1e-10
.lr_c_final (float, optional) – The final critic learning rate that is achieved at the end of the training. Defaults to
1e-10
.lr_alpha_final (float, optional) – The final alpha learning rate that is achieved at the end of the training. Defaults to
1e-10
.lr_decay_type (str, optional) – The learning rate decay type that is used (options are:
linear
andexponential
andconstant
). Defaults tolinear
. Can be overridden by the specific learning rate decay types.lr_a_decay_type (str, optional) – The learning rate decay type that is used for the actor learning rate (options are:
linear
andexponential
andconstant
). If not specified, the general learning rate decay type is used.lr_c_decay_type (str, optional) – The learning rate decay type that is used for the critic learning rate (options are:
linear
andexponential
andconstant
). If not specified, the general learning rate decay type is used.lr_alpha_decay_type (str, optional) – The learning rate decay type that is used for the alpha learning rate (options are:
linear
andexponential
andconstant
). If not specified, the general learning rate decay type is used.lr_decay_ref (str, optional) – The reference variable that is used for decaying the learning rate (options:
epoch
andstep
). Defaults toepoch
.batch_size (int, optional) – Minibatch size for SGD. Defaults to
256
.replay_size (int, optional) – Maximum length of replay buffer. Defaults to
1e6
.seed (int) – Seed for random number generators. Defaults to
None
.device (str, optional) – The device the networks are placed on (options:
cpu
,gpu
,gpu:0
,gpu:1
, etc.). Defaults tocpu
.logger_kwargs (dict, optional) – Keyword args for EpochLogger.
save_freq (int, optional) – How often (in terms of gap between epochs) to save the current policy and value function.
start_policy (str) – Path of a already trained policy to use as the starting point for the training. By default a new policy is created.
export (bool) – Whether you want to export the model as a
TorchScript
such that it can be deployed on hardware. By defaultFalse
.
- Returns:
tuple containing:
policy (
SAC
): The trained actor-critic policy.replay_buffer (union[
ReplayBuffer
,FiniteHorizonReplayBuffer
]): The replay buffer used during training.
- Return type:
(tuple)