Lyapunov Actor-Critic (LAC)

Background 

The Lyapunov Actor-Critic (LAC) algorithm can be seen as a direct successor of the SAC algorithm. Although the SAC algorithm achieved impressive performance in various robotic control tasks, it does not guarantee its actions are stable. From a control-theoretic perspective, stability is the most critical property for any control system since it is closely related to robotic systems’ safety, robustness, and reliability. Using Lyapunov’s method, the LAC algorithm solves the aforementioned issues by proposing a data-based stability theorem that guarantees the system stays stable in mean cost.

Lyapunov critic function 

The concept of Lyapunov stability is a useful and general approach for studying robotics Systems stability. In Lyapunov’s (direct) method, a scalar “energy-like” function, called a Lyapunov function, is constructed to analyse a system’s stability. According to Lyapunov’s stability conditions a dynamic autonomous system

$\dot{x} = X(x), \quad \textrm{where} \quad X(0) = 0 \textrm{with} \quad x^{*}(t) = 0, t \geq t_0;$

is said to be asymptotically stable if such an “energy” function $V(x)$ exist such that in some neighbourhood $\mathcal{V}^{*}$ around an equilibrium point $x = 0 (\left \| x < k \right \|)$

$V(x)$ and its partial derivatives are continuous.
$V(x)$ is positive definite
$\dot{V}(x)$ is negative semi-definite.

In classical control theory, this concept is often used to design controllers that ensure that the difference of a Lyapunov function along the state trajectory is always negative definite. This results in stable closed-loop system dynamics as the state is guaranteed to decrease the Lyapunov function’s value and eventually converge to the equilibrium. The biggest challenge with this approach is that finding such a function is difficult and quickly becomes impractical. In learning-based methods, for example, since we do not have complete information about the system, finding such a Lyapunov Function would result in trying out all possible consecutive data pairs in the state space, i.e., to verify infinite inequalities $L_{t+1}-L_{t} < 0$ . The LAC algorithm solves this by taking a data-based approach in which the controller/policy and a Lyapunov critic function, both parameterised by deep neural networks, are jointly learned. In this way, the actor learns to control the system while only choosing actions guaranteed to be stable in mean cost. This inherent stability makes the LAC algorithm incredibly useful for stabilising and tracking robotic systems tasks.

Differences with the SAC algorithm 

Like its predecessor, the LAC algorithm also uses entropy regularisation to increase exploration and a Gaussian actor and value-critic to develop the best action. The main difference lies in how the critic network and the actor policy function are defined.

Critic network definition 

The LAC algorithm uses a single Lyapunov Critic instead of the double Q-Critic used in the SAC algorithm. This new Lyapunov critic is similar to the Q-Critic, but a square output activation function is used instead of an Identity output activation function. This is done to ensure that the network output is positive, such that condition (2) of the Lyapunov’s stability conditions holds.

$L_{c}(s,a) = f_{\phi}(s,a)^{T}f_{\phi}(s,a)$

Similar to SAC during training, $L_{c}$ is updated by mean-squared Bellman error (MSBE) minimisation using the following objective function

$J(L_{c}) = E_{\mathcal{D}}\left[\frac{1}{2}(L_{c}(s,a)-L_{target}(s,a))^2\right]$

Where $L_{target}$ is the approximation target received from the infinite-horizon discounted return value function

$\begin{gather*} L(s) = E_{a\sim \pi}L_{target}(s,a) \\ \textrm{with} \\ L_{target}(s,a) = c + \max_{a'}\gamma L_{c}^{'}(s^{'}, a^{'}) \end{gather*}$

and $\mathcal{D}$ the set of collected transition pairs.

Important

As explained by Han et al., 2020 the sum of cost over a finite time horizon can also be used as the approximation target (see Han et al., 2020 eq (9)):

$L_{target}(s,a) = \sum_{t}^{t+N} \mathbb{E}_{c_{t}}$

To use this Lyapunov candidate, supply the LAC algorithm with the horizon_length=N argument, where N is the length of the time horizon you want to use.

Policy function definition 

In the LAC algorithm, the policy is optimised according to

$\min_{\theta} \underE{s \sim \mathcal{D} \\ \xi \sim \mathcal{N}}{\lambda(L_{c}(s^{'}, f_{\theta}(\epsilon, s^{'}))-L_{c}(s, a) + \alpha_{3}c) + \mathcal{\alpha}\log \pi_{\theta}(f_{\theta}(\epsilon, s)|s) + \mathcal{H}_{t}}$

In this $f_{\theta}(\epsilon, s)$ represents the quashed Gaussian policy

$\tilde{a}_{\theta}(s, \xi) = \tanh\left( \mu_{\theta}(s) + \sigma_{\theta}(s) \odot \xi \right), \;\;\;\;\; \xi \sim \mathcal{N}(0, I).$

and $\mathcal{H}_{t}$ the desired minimum expected entropy. When comparing this function with policy loss used in the SAC algorithm,

$\max_{\theta} \underE{s \sim \mathcal{D} \\ \xi \sim \mathcal{N}}{Q_{\phi_1}(s,\tilde{a}_{\theta}(s,\xi)) - \alpha \log \pi_{\theta}(\tilde{a}_{\theta}(s,\xi)|s) + \mathcal{H}_{t}},$

several differences stand out. First, the policy is minimised instead of maximised in the LAC algorithm. With the LAC algorithm, the objective is to stabilise a system or track a given reference. In these cases, instead of achieving a high return, we want to reduce the difference between the current position and a reference or equilibrium position. This leads us to the second difference that can be observed: the term in the SAC algorithm that represents the Q-values.

$Q_{\phi_1}(s, f_{\theta}(\epsilon, s))$

is in the LAC algorithm replaced by

$\lambda(L_{c}(s^{'}, f_{\theta}(\epsilon, s^{'})) - L_{c}(s, a) + \alpha_{3}c)$

As a result, in the LAC algorithm, the loss function now increases the probability of actions that cause the system to be close to the equilibrium or reference value while decreasing the likelihood of actions that drive the system away from these values. The $a_{3}c$ quadratic regularisation term ensures that the mean cost is positive. The $\lambda$ term represents the Lyapunov Lagrange multiplier and is responsible for weighting the relative importance of the stability condition. Similar to the entropy Lagrange multiplier $\alpha$ used in the SAC algorithm this term is updated by

$\lambda \leftarrow \max(0, \lambda + \delta \bigtriangledown_{\lambda}J(\lambda)))$

where $\delta$ is the learning rate. This is done to constrain the average Lyapunov value during training.

Quick Fact 

LAC is an off-policy algorithm.
It is guaranteed to be stable in mean cost.
The version of LAC implemented here can only be used for environments with continuous action spaces.
An alternate version of LAC, which slightly changes the policy update rule, can be implemented to handle discrete action spaces.
The SLC implementation of LAC does not support parallelisation.

Pseudocode 

$\begin{algorithm}[H] \caption{Lyapunov-based Actor-Critic (LAC)} \label{alg1} \begin{algorithmic}[1] \REQUIRE Maximum episode length $N$ and maximum update steps $M$ \REPEAT \STATE Samples $s_{0}$ according to $\rho$ \FOR{$t=1$ to $N$} \STATE Sample $a$ from $\pi(a|s)$ and step forward \STATE Observe $s'$, $c$ and store ($s$, $a$, $c$, $s'$) in $\mathcal{D}$ \ENDFOR \FOR{$i=1$ to $M$} \STATE Sample mini-batches of transitions from $D$ and update $L_{c}$, $\pi$, Lagrance multipliers with eq. (7) and (14) of Han et al., 2020 \ENDFOR \UNTIL{eq. 11 of Han et al., 2020 is satisfied} \end{algorithmic} \end{algorithm}$

Implementation 

You Should Know

In what follows, we give documentation for the PyTorch and TensorFlow implementations of LAC in SLC. They have nearly identical function calls and docstrings, except for details relating to model construction. However, we include both full docstrings for completeness.

Algorithm: PyTorch Version 

stable_learning_control.algos.pytorch.lac.lac(env_fn, actor_critic=None, ac_kwargs={'activation': <class 'torch.nn.modules.activation.ReLU'>, 'hidden_sizes': {'actor': [256, 256], 'critic': [256, 256]}, 'output_activation': <class 'torch.nn.modules.activation.ReLU'>}, opt_type='minimize', max_ep_len=None, epochs=100, steps_per_epoch=2048, start_steps=0, update_every=100, update_after=1000, steps_per_update=100, num_test_episodes=10, alpha=0.99, alpha3=0.2, labda=0.99, gamma=0.99, polyak=0.995, target_entropy=None, adaptive_temperature=True, lr_a=0.0001, lr_c=0.0003, lr_alpha=0.0001, lr_labda=0.0003, lr_a_final=1e-10, lr_c_final=1e-10, lr_alpha_final=1e-10, lr_labda_final=1e-10, lr_decay_type='linear', lr_a_decay_type=None, lr_c_decay_type=None, lr_alpha_decay_type=None, lr_labda_decay_type=None, lr_decay_ref='epoch', batch_size=256, replay_size=1000000, horizon_length=0, seed=None, device='cpu', logger_kwargs={}, save_freq=1, start_policy=None, export=False)[source]

Trains the LAC algorithm in a given environment.

Parameters:

env_fn – A function which creates a copy of the environment. The environment must satisfy the gymnasium API.

actor_critic (torch.nn.Module, optional) –

The constructor method for a Torch Module with an act method, a pi module and several Q or L modules. The act method and pi module should accept batches of observations as inputs, and the Q* and L modules should accept a batch of observations and a batch of actions as inputs. When called, these modules should return:

Call	Output Shape	Description
`act`	(batch, act_dim)	Numpy array of actions for each observation.
`Q*/L`	(batch,)	Tensor containing one current estimate of `Q*/L` for the provided observations and actions. (Critical: make sure to flatten this!)

Calling pi should return:

Symbol	Shape	Description
`a`	(batch, act_dim)	Tensor containing actions from policy given observations.
`logp_pi`	(batch,)	Tensor containing log probabilities of actions in `a`. Importantly: gradients should be able to flow back into `a`.

Defaults to LyapunovActorCritic

ac_kwargs (dict, optional) –
Any kwargs appropriate for the ActorCritic object you provided to LAC. Defaults to:

Kwarg

Value

hidden_sizes_actor

256 x 2

hidden_sizes_critic

256 x 2

activation

torch.nn.ReLU

output_activation

torch.nn.ReLU
opt_type (str, optional) – The optimization type you want to use. Options maximize and minimize. Defaults to maximize.
max_ep_len (int, optional) – Maximum length of trajectory / episode / rollout. Defaults to the environment maximum.
epochs (int, optional) – Number of epochs to run and train agent. Defaults to 100.
steps_per_epoch (int, optional) – Number of steps of interaction (state-action pairs) for the agent and the environment in each epoch. Defaults to 2048.
start_steps (int, optional) – Number of steps for uniform-random action selection, before running real policy. Helps exploration. Defaults to 0.
update_every (int, optional) – Number of env interactions that should elapse between gradient descent updates. Defaults to 100.
update_after (int, optional) – Number of env interactions to collect before starting to do gradient descent updates. Ensures replay buffer is full enough for useful updates. Defaults to 1000.
steps_per_update (int, optional) – Number of gradient descent steps that are performed for each gradient descent update. This determines the ratio of env steps to gradient steps (i.e. update_every/ steps_per_update). Defaults to 100.
num_test_episodes (int, optional) – Number of episodes used to test the deterministic policy at the end of each epoch. This is used for logging the performance. Defaults to 10.
alpha (float, optional) – Entropy regularization coefficient (Equivalent to inverse of reward scale in the original SAC paper). Defaults to 0.99.
alpha3 (float, optional) – The Lyapunov constraint error boundary. Defaults to 0.2.
labda (float, optional) – The Lyapunov Lagrance multiplier. Defaults to 0.99.
gamma (float, optional) – Discount factor. (Always between 0 and 1.). Defaults to 0.99.
polyak (float, optional) –
Interpolation factor in polyak averaging for target networks. Target networks are updated towards main networks according to:

$\theta_{\text{targ}} \leftarrow \rho \theta_{\text{targ}} + (1-\rho) \theta$

where $\rho$ is polyak (Always between 0 and 1, usually close to 1.). In some papers $\rho$ is defined as (1 - $\tau$ ) where $\tau$ is the soft replacement factor. Defaults to 0.995.
target_entropy (float, optional) –
Initial target entropy used while learning the entropy temperature (alpha). Defaults to the maximum information (bits) contained in action space. This can be calculated according to :

$-{\prod }_{i=0}^{n}action\_di{m}_{i}\phantom{\rule{0ex}{0ex}}$
adaptive_temperature (bool, optional) – Enabled Automating Entropy Adjustment for maximum Entropy RL_learning.
lr_a (float, optional) – Learning rate used for the actor. Defaults to 1e-4.
lr_c (float, optional) – Learning rate used for the (lyapunov) critic. Defaults to 1e-4.
lr_alpha (float, optional) – Learning rate used for the entropy temperature. Defaults to 1e-4.
lr_labda (float, optional) – Learning rate used for the Lyapunov Lagrance multiplier. Defaults to 3e-4.
lr_a_final (float, optional) – The final actor learning rate that is achieved at the end of the training. Defaults to 1e-10.
lr_c_final (float, optional) – The final critic learning rate that is achieved at the end of the training. Defaults to 1e-10.
lr_alpha_final (float, optional) – The final alpha learning rate that is achieved at the end of the training. Defaults to 1e-10.
lr_labda_final (float, optional) – The final labda learning rate that is achieved at the end of the training. Defaults to 1e-10.
lr_decay_type (str, optional) – The learning rate decay type that is used (options are: linear and exponential and constant). Defaults to linear.Can be overridden by the specific learning rate decay types.
lr_a_decay_type (str, optional) – The learning rate decay type that is used for the actor learning rate (options are: linear and exponential and constant). If not specified, the general learning rate decay type is used.
lr_c_decay_type (str, optional) – The learning rate decay type that is used for the critic learning rate (options are: linear and exponential and constant). If not specified, the general learning rate decay type is used.
lr_alpha_decay_type (str, optional) – The learning rate decay type that is used for the alpha learning rate (options are: linear and exponential and constant). If not specified, the general learning rate decay type is used.
lr_labda_decay_type (str, optional) – The learning rate decay type that is used for the labda learning rate (options are: linear and exponential and constant). If not specified, the general learning rate decay type is used.
lr_decay_ref (str, optional) – The reference variable that is used for decaying the learning rate (options: epoch and step). Defaults to epoch.
batch_size (int, optional) – Minibatch size for SGD. Defaults to 256.
replay_size (int, optional) – Maximum length of replay buffer. Defaults to 1e6.
horizon_length (int, optional) – The length of the finite-horizon used for the Lyapunov Critic target. Defaults to 0 meaning the infinite-horizon bellman backup is used.
seed (int) – Seed for random number generators. Defaults to None.
device (str, optional) – The device the networks are placed on (options: cpu, gpu, gpu:0, gpu:1, etc.). Defaults to cpu.
logger_kwargs (dict, optional) – Keyword args for EpochLogger.
save_freq (int, optional) – How often (in terms of gap between epochs) to save the current policy and value function.
start_policy (str) – Path of a already trained policy to use as the starting point for the training. By default a new policy is created.
export (bool) – Whether you want to export the model as a TorchScript such that it can be deployed on hardware. By default False.

Returns:

tuple containing:

policy (LAC): The trained actor-critic policy.

replay_buffer (union[ReplayBuffer, FiniteHorizonReplayBuffer]): The replay buffer used during training.

Return type:

(tuple)

Saved Model Contents: PyTorch Version 

The PyTorch version of the LAC algorithm is implemented by subclassing the torch.nn.Module class. As a result the model weights are saved using the model_state dictionary ( state_dict). These saved weights can be found in the torch_save/model_state.pt file. For an example of how to load a model using this file, see Experiment Outputs or the PyTorch documentation.

Algorithm: TensorFlow Version 

Attention

The TensorFlow version is still experimental. It is not guaranteed to work, and it is not guaranteed to be up-to-date with the PyTorch version.

stable_learning_control.algos.tf2.lac.lac(env_fn, actor_critic=None, ac_kwargs={'activation': <function relu>, 'hidden_sizes': {'actor': [256, 256], 'critic': [256, 256]}, 'output_activation': <function relu>}, opt_type='minimize', max_ep_len=None, epochs=100, steps_per_epoch=2048, start_steps=0, update_every=100, update_after=1000, steps_per_update=100, num_test_episodes=10, alpha=0.99, alpha3=0.2, labda=0.99, gamma=0.99, polyak=0.995, target_entropy=None, adaptive_temperature=True, lr_a=0.0001, lr_c=0.0003, lr_alpha=0.0001, lr_labda=0.0003, lr_a_final=1e-10, lr_c_final=1e-10, lr_alpha_final=1e-10, lr_labda_final=1e-10, lr_decay_type='linear', lr_a_decay_type=None, lr_c_decay_type=None, lr_alpha_decay_type=None, lr_labda_decay_type=None, lr_decay_ref='epoch', batch_size=256, replay_size=1000000, seed=None, horizon_length=0, device='cpu', logger_kwargs={}, save_freq=1, start_policy=None, export=False)[source]

Trains the LAC algorithm in a given environment.

Parameters:

env_fn – A function which creates a copy of the environment. The environment must satisfy the gymnasium API.

actor_critic (tf.Module, optional) –

The constructor method for a TensorFlow Module with an act method, a pi module and several Q or L modules. The act method and pi module should accept batches of observations as inputs, and the Q* and L modules should accept a batch of observations and a batch of actions as inputs. When called, these modules should return:

Call	Output Shape	Description
`act`	(batch, act_dim)	Numpy array of actions for each observation.
`Q*/L`	(batch,)	Tensor containing one current estimate of `Q*/L` for the provided observations and actions. (Critical: make sure to flatten this!)

Calling pi should return:

Symbol	Shape	Description
`a`	(batch, act_dim)	Tensor containing actions from policy given observations.
`logp_pi`	(batch,)	Tensor containing log probabilities of actions in `a`. Importantly: gradients should be able to flow back into `a`.

Defaults to LyapunovActorCritic

ac_kwargs (dict, optional) –
Any kwargs appropriate for the ActorCritic object you provided to LAC. Defaults to:

Kwarg

Value

hidden_sizes_actor

256 x 2

hidden_sizes_critic

256 x 2

activation

tf.nn.ReLU

output_activation

tf.nn.ReLU
opt_type (str, optional) – The optimization type you want to use. Options maximize and minimize. Defaults to maximize.
max_ep_len (int, optional) – Maximum length of trajectory / episode / rollout. Defaults to the environment maximum.
epochs (int, optional) – Number of epochs to run and train agent. Defaults to 100.
steps_per_epoch (int, optional) – Number of steps of interaction (state-action pairs) for the agent and the environment in each epoch. Defaults to 2048.
start_steps (int, optional) – Number of steps for uniform-random action selection, before running real policy. Helps exploration. Defaults to 0.
update_every (int, optional) – Number of env interactions that should elapse between gradient descent updates. Defaults to 100.
update_after (int, optional) – Number of env interactions to collect before starting to do gradient descent updates. Ensures replay buffer is full enough for useful updates. Defaults to 1000.
steps_per_update (int, optional) – Number of gradient descent steps that are performed for each gradient descent update. This determines the ratio of env steps to gradient steps (i.e. update_every/ steps_per_update). Defaults to 100.
num_test_episodes (int, optional) – Number of episodes used to test the deterministic policy at the end of each epoch. This is used for logging the performance. Defaults to 10.
alpha (float, optional) – Entropy regularization coefficient (Equivalent to inverse of reward scale in the original SAC paper). Defaults to 0.99.
alpha3 (float, optional) – The Lyapunov constraint error boundary. Defaults to 0.2.
labda (float, optional) – The Lyapunov Lagrance multiplier. Defaults to 0.99.
gamma (float, optional) – Discount factor. (Always between 0 and 1.). Defaults to 0.99.
polyak (float, optional) –
Interpolation factor in polyak averaging for target networks. Target networks are updated towards main networks according to:

$\theta_{\text{targ}} \leftarrow \rho \theta_{\text{targ}} + (1-\rho) \theta$

where $\rho$ is polyak (Always between 0 and 1, usually close to 1.). In some papers $\rho$ is defined as (1 - $\tau$ ) where $\tau$ is the soft replacement factor. Defaults to 0.995.
target_entropy (float, optional) –
Initial target entropy used while learning the entropy temperature (alpha). Defaults to the maximum information (bits) contained in action space. This can be calculated according to :

$-{\prod }_{i=0}^{n}action\_di{m}_{i}\phantom{\rule{0ex}{0ex}}$
adaptive_temperature (bool, optional) – Enabled Automating Entropy Adjustment for maximum Entropy RL_learning.
lr_a (float, optional) – Learning rate used for the actor. Defaults to 1e-4.
lr_c (float, optional) – Learning rate used for the (lyapunov) critic. Defaults to 1e-4.
lr_alpha (float, optional) – Learning rate used for the entropy temperature. Defaults to 1e-4.
lr_labda (float, optional) – Learning rate used for the Lyapunov Lagrance multiplier. Defaults to 3e-4.
lr_a_final (float, optional) – The final actor learning rate that is achieved at the end of the training. Defaults to 1e-10.
lr_c_final (float, optional) – The final critic learning rate that is achieved at the end of the training. Defaults to 1e-10.
lr_alpha_final (float, optional) – The final alpha learning rate that is achieved at the end of the training. Defaults to 1e-10.
lr_labda_final (float, optional) – The final labda learning rate that is achieved at the end of the training. Defaults to 1e-10.
lr_decay_type (str, optional) – The learning rate decay type that is used (options are: linear and exponential and constant). Defaults to linear.Can be overridden by the specific learning rate decay types.
lr_a_decay_type (str, optional) – The learning rate decay type that is used for the actor learning rate (options are: linear and exponential and constant). If not specified, the general learning rate decay type is used.
lr_c_decay_type (str, optional) – The learning rate decay type that is used for the critic learning rate (options are: linear and exponential and constant). If not specified, the general learning rate decay type is used.
lr_alpha_decay_type (str, optional) – The learning rate decay type that is used for the alpha learning rate (options are: linear and exponential and constant). If not specified, the general learning rate decay type is used.
lr_labda_decay_type (str, optional) – The learning rate decay type that is used for the labda learning rate (options are: linear and exponential and constant). If not specified, the general learning rate decay type is used.
lr_decay_type – The learning rate decay type that is used ( options are: linear and exponential and constant). Defaults to linear.
lr_decay_ref (str, optional) – The reference variable that is used for decaying the learning rate (options: epoch and step). Defaults to epoch.
batch_size (int, optional) – Minibatch size for SGD. Defaults to 256.
replay_size (int, optional) – Maximum length of replay buffer. Defaults to 1e6.
horizon_length (int, optional) – The length of the finite-horizon used for the Lyapunov Critic target. Defaults to 0 meaning the infinite-horizon bellman backup is used.
seed (int) – Seed for random number generators. Defaults to None.
device (str, optional) – The device the networks are placed on (options: cpu, gpu, gpu:0, gpu:1, etc.). Defaults to cpu.
logger_kwargs (dict, optional) – Keyword args for EpochLogger.
save_freq (int, optional) – How often (in terms of gap between epochs) to save the current policy and value function.
start_policy (str) – Path of a already trained policy to use as the starting point for the training. By default a new policy is created.
export (bool) – Whether you want to export the model in the SavedModel format such that it can be deployed to hardware. By default False.

Returns:

tuple containing:

policy (LAC): The trained actor-critic policy.

replay_buffer (union[ReplayBuffer, FiniteHorizonReplayBuffer]): The replay buffer used during training.

Return type:

(tuple)

Saved Model Contents: TensorFlow Version 

The TensorFlow version of the LAC algorithm is implemented by subclassing the tf.nn.Model class. As a result, both the full model and the current model weights are saved. The complete model can be found in the saved_model.pb file, while the current weights checkpoints are found in the tf_safe/weights_checkpoint* file. For an example of using these two methods, see Experiment Outputs or the TensorFlow documentation.

References 

Relevant Papers 

Actor-Critic Reinforcement Learning for Control with Stability Guarantee, Han et al, 2020
The general problem of the stability of motion, J. Mawhin, 2005
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor, Haarnoja et al, 2018
Soft Actor-Critic: Algorithms and Applications, Haarnoja et al, 2019

Acknowledgements 

Parts of this documentation are directly copied, with the author’s consent, from the original paper of Han et. al 2019.

Kwarg	Value
`hidden_sizes_actor`	`256 x 2`
`hidden_sizes_critic`	`256 x 2`
`activation`	`torch.nn.ReLU`
`output_activation`	`torch.nn.ReLU`

Kwarg	Value
`hidden_sizes_actor`	`256 x 2`
`hidden_sizes_critic`	`256 x 2`
`activation`	`tf.nn.ReLU`
`output_activation`	`tf.nn.ReLU`