stable_learning_control.algos.common.buffers

This module contains several replay buffers that are used in multiple Pytorch and TensorFlow algorithms.

Module Contents

Classes

ReplayBuffer

A simple first-in-first-out (FIFO) experience replay buffer.

FiniteHorizonReplayBuffer

A first-in-first-out (FIFO) experience replay buffer that also stores the

TrajectoryBuffer

A simple FIFO trajectory buffer. It can store trajectories of varying lengths

class stable_learning_control.algos.common.buffers.ReplayBuffer(obs_dim, act_dim, size)[source]

A simple first-in-first-out (FIFO) experience replay buffer.

obs_buf

Buffer containing the current state.

Type:

numpy.ndarray

obs_next_buf

Buffer containing the next state.

Type:

numpy.ndarray

act_buf

Buffer containing the current action.

Type:

numpy.ndarray

rew_buf

Buffer containing the current reward.

Type:

numpy.ndarray

done_buf

Buffer containing information whether the episode was terminated after the action was taken.

Type:

numpy.ndarray

ptr

The current buffer index.

Type:

int

Initialise the ReplayBuffer object.

Parameters:
  • obs_dim (tuple) – The size of the observation space.

  • act_dim (tuple) – The size of the action space.

  • size (int) – The replay buffer size.

store(obs, act, rew, next_obs, done)[source]

Add experience tuple to buffer.

Parameters:
sample_batch(batch_size=32)[source]

Retrieve a batch of experiences from buffer.

Parameters:

batch_size (int, optional) – The batch size. Defaults to 32.

Returns:

A batch of experiences.

Return type:

dict

class stable_learning_control.algos.common.buffers.FiniteHorizonReplayBuffer(obs_dim, act_dim, size, horizon_length)[source]

Bases: ReplayBuffer

A first-in-first-out (FIFO) experience replay buffer that also stores the expected cumulative finite-horizon reward.

Note

The expected cumulative finite-horizon reward is calculated using the following formula:

L_{target}(s,a) = \sum_{t}^{t+N} \mathbb{E}_{c_{t}}

horizon_length

The length of the finite-horizon.

Type:

int

horizon_rew_buf

Buffer containing the expected cumulative finite-horizon reward.

Type:

numpy.ndarray

Initialise the FiniteHorizonReplayBuffer object.

Parameters:
  • obs_dim (tuple) – The size of the observation space.

  • act_dim (tuple) – The size of the action space.

  • size (int) – The replay buffer size.

  • horizon_length (int) – The length of the finite-horizon.

store(obs, act, rew, next_obs, done, truncated)[source]

Add experience tuple to buffer and calculate expected cumulative finite horizon reward if the episode is done or truncated.

Parameters:
  • obs (numpy.ndarray) – Start state (observation).

  • act (numpy.ndarray) – Action.

  • rew (numpy.float64) – Reward.

  • next_obs (numpy.ndarray) – Next state (observation)

  • done (bool) – Boolean specifying whether the terminal state was reached.

  • truncated (bool) – Boolean specifying whether the episode was truncated.

sample_batch(batch_size=32)[source]

Retrieve a batch of experiences and their expected cumulative finite-horizon reward from buffer.

Parameters:

batch_size (int, optional) – The batch size. Defaults to 32.

Returns:

A batch of experiences.

Return type:

dict

class stable_learning_control.algos.common.buffers.TrajectoryBuffer(obs_dim, act_dim, size, preempt=False, min_trajectory_size=3, incomplete=False, gamma=0.99, lam=0.95)[source]

A simple FIFO trajectory buffer. It can store trajectories of varying lengths for Monte Carlo or TD-N learning algorithms.

obs_buf

Buffer containing the current state.

Type:

numpy.ndarray

obs_next_buf

Buffer containing the next state.

Type:

numpy.ndarray

act_buf

Buffer containing the current action.

Type:

numpy.ndarray

rew_buf

Buffer containing the current reward.

Type:

numpy.ndarray

done_buf

Buffer containing information whether the episode was terminated after the action was taken.

Type:

numpy.ndarray

traj_lengths

List with the lengths of each trajectory in the buffer.

Type:

list

ptr

The current buffer index.

Type:

int

traj_ptr

The start index of the current trajectory.

Type:

int

traj_ptrs

The start indexes of each trajectory.

Type:

list

n_traj

The number of trajectories currently stored in the buffer.

Type:

int

Warning

This buffer has not be rigorously tested and should therefore still be regarded as experimental.

Initialise the TrajectoryBuffer object.

Parameters:
  • obs_dim (tuple) – The size of the observation space.

  • act_dim (tuple) – The size of the action space.

  • size (int) – The replay buffer size.

  • preempt (bool, optional) – Whether the buffer can be retrieved before it is full. Defaults to False.

  • min_trajectory_size (int, optional) – The minimum trajectory length that can be stored in the buffer. Defaults to 3.

  • incomplete (int, optional) – Whether the buffer can store incomplete trajectories (i.e. trajectories which do not contain the final state). Defaults to False.

  • gamma (float, optional) – The General Advantage Estimate (GAE) discount factor (Always between 0 and 1). Defaults to 0.99.

  • lam (lam, optional) – The GAE bias-variance trade-off factor (always between 0 and 1). Defaults to 0.95.

store(obs, act, rew, next_obs, done, val=None, logp=None)[source]

Append one timestep of agent-environment interaction to the buffer.

Parameters:
finish_path(last_val=0)[source]

Call this at the end of a trajectory or when one gets cut off by an epoch ends. This function increments the buffer pointers and calculates the advantage and rewards-to-go if it contains (action) values.

Note

When (action) values are stored in the buffer, this function looks back in the buffer to where the trajectory started and uses rewards and value estimates from the whole trajectory to compute advantage estimates with GAE-Lambda and compute the rewards-to-go for each state to use as the targets for the value function.

The “last_val” argument should be 0 if the trajectory ended because the agent reached a terminal state (died), and otherwise should be V(s_T), the value function estimated for the last state. This allows us to bootstrap the reward-to-go calculation to account for timesteps beyond the arbitrary episode horizon (or epoch cutoff).

get(flat=False)[source]

Retrieve the trajectory buffer.

Call this at the end of an epoch to get all of the data from the buffer. Also, resets some pointers in the buffer.

Parameters:

flat (bool, optional) – Retrieve a flat buffer (i.e. the trajectories are concatenated). Defaults to False.

Returns:

The trajectory buffer.

Return type:

dict