stable_learning_control.algos.common.buffers
This module contains several replay buffers that are used in multiple Pytorch and TensorFlow algorithms.
Classes
A simple first-in-first-out (FIFO) experience replay buffer. |
|
A first-in-first-out (FIFO) experience replay buffer that also stores the |
|
A simple FIFO trajectory buffer. It can store trajectories of varying lengths |
Module Contents
- class stable_learning_control.algos.common.buffers.ReplayBuffer(obs_dim, act_dim, size)[source]
A simple first-in-first-out (FIFO) experience replay buffer.
- done_buf[source]
Buffer containing information whether the episode was terminated after the action was taken.
- Type:
Initialise the ReplayBuffer object.
- Parameters:
- store(obs, act, rew, next_obs, done)[source]
Add experience tuple to buffer.
- Parameters:
obs (numpy.ndarray) – Start state (observation).
act (numpy.ndarray) – Action.
rew (
numpy.float64
) – Reward.next_obs (numpy.ndarray) – Next state (observation)
done (bool) – Boolean specifying whether the terminal state was reached.
- class stable_learning_control.algos.common.buffers.FiniteHorizonReplayBuffer(obs_dim, act_dim, size, horizon_length)[source]
Bases:
ReplayBuffer
A first-in-first-out (FIFO) experience replay buffer that also stores the expected cumulative finite-horizon reward.
Note
The expected cumulative finite-horizon reward is calculated using the following formula:
Initialise the FiniteHorizonReplayBuffer object.
- Parameters:
- store(obs, act, rew, next_obs, done, truncated)[source]
Add experience tuple to buffer and calculate expected cumulative finite horizon reward if the episode is done or truncated.
- Parameters:
obs (numpy.ndarray) – Start state (observation).
act (numpy.ndarray) – Action.
rew (
numpy.float64
) – Reward.next_obs (numpy.ndarray) – Next state (observation)
done (bool) – Boolean specifying whether the terminal state was reached.
truncated (bool) – Boolean specifying whether the episode was truncated.
- class stable_learning_control.algos.common.buffers.TrajectoryBuffer(obs_dim, act_dim, size, preempt=False, min_trajectory_size=3, incomplete=False, gamma=0.99, lam=0.95)[source]
A simple FIFO trajectory buffer. It can store trajectories of varying lengths for Monte Carlo or TD-N learning algorithms.
- done_buf[source]
Buffer containing information whether the episode was terminated after the action was taken.
- Type:
Warning
This buffer has not be rigorously tested and should therefore still be regarded as experimental.
Initialise the TrajectoryBuffer object.
- Parameters:
obs_dim (tuple) – The size of the observation space.
act_dim (tuple) – The size of the action space.
size (int) – The replay buffer size.
preempt (bool, optional) – Whether the buffer can be retrieved before it is full. Defaults to
False
.min_trajectory_size (int, optional) – The minimum trajectory length that can be stored in the buffer. Defaults to
3
.incomplete (int, optional) – Whether the buffer can store incomplete trajectories (i.e. trajectories which do not contain the final state). Defaults to
False
.gamma (float, optional) – The General Advantage Estimate (GAE) discount factor (Always between 0 and 1). Defaults to
0.99
.lam (lam, optional) – The GAE bias-variance trade-off factor (always between 0 and 1). Defaults to
0.95
.
- store(obs, act, rew, next_obs, done, val=None, logp=None)[source]
Append one timestep of agent-environment interaction to the buffer.
- Parameters:
obs (numpy.ndarray) – Start state (observation).
act (numpy.ndarray) – Action.
rew (
numpy.float64
) – Reward.next_obs (numpy.ndarray) – Next state (observation)
done (bool) – Boolean specifying whether the terminal state was reached.
val (numpy.ndarray, optional) – The (action) values. Defaults to
None
.logp (numpy.ndarray, optional) – The log probabilities of the actions. Defaults to
None
.
- finish_path(last_val=0)[source]
Call this at the end of a trajectory or when one gets cut off by an epoch ends. This function increments the buffer pointers and calculates the advantage and rewards-to-go if it contains (action) values.
Note
When (action) values are stored in the buffer, this function looks back in the buffer to where the trajectory started and uses rewards and value estimates from the whole trajectory to compute advantage estimates with GAE-Lambda and compute the rewards-to-go for each state to use as the targets for the value function.
The “last_val” argument should be 0 if the trajectory ended because the agent reached a terminal state (died), and otherwise should be V(s_T), the value function estimated for the last state. This allows us to bootstrap the reward-to-go calculation to account for timesteps beyond the arbitrary episode horizon (or epoch cutoff).
- get(flat=False)[source]
Retrieve the trajectory buffer.
Call this at the end of an epoch to get all of the data from the buffer. Also, resets some pointers in the buffer.
- Parameters:
flat (bool, optional) – Retrieve a flat buffer (i.e. the trajectories are concatenated). Defaults to
False
.
- Returns:
The trajectory buffer.
- Return type: