CartPoleTrackingCost gymnasium environment
An unactuated joint attaches a pole to a cart, which moves along a frictionless track. This environment is a modified version of the CartPole-v1 found in the Gymnasium package, with several key alterations:
The action space is continuous, contrasting with the original discrete setting.
Offers an optional feature to confine actions within the defined action space, preventing the agent from exceeding set boundaries when activated.
The reward function is replaced with a (positive definite) cost function (negated reward), in line with Lyapunov stability theory. This cost is the difference between a state variable and a reference value (error).
Maximum cart force is increased from
10
to20
.Episode length is reduced from
500
to250
.A termination cost of
c=100
is introduced for early episode termination, to promote cost minimization.The terminal angle limit is expanded from the original
12
degrees to20
degrees, enhancing recovery potential.The terminal position limit is extended from
2.4
meters to10
meters, broadening the recovery range.Velocity limits are adjusted from ±∞ to ±50, accelerating training.
Angular velocity termination threshold is lowered from ±∞ to ±50, likely for improved training efficiency.
Random initial state range is modified from
[-0.05, 0.05]
to[-5, 5]
for the cart position and[-0.2, 0.2]
for all other states, allowing for expanded exploration.
Additional modifications in our implementation:
Unlike the original environment’s fixed cost threshold of
100
, this version allows users to adjust the maximum cost threshold improving training adaptability.An extra termination criterion for cumulative costs over
100
is added to hasten training.The gravity constant is adjusted back from
10
to the real-world value of9.8
, aligning it closer with the original CartPole environment.The stabilization objective is replaced with a reference tracking task for enhanced control.
Two additional observations are introduced, facilitating reference tracking.
The info dictionary now provides extra information about the reference to be tracked.
The data types for action and observation spaces are set to
np.float64
, diverging from thenp.float32
used by Han et al. 2020. This aligns the Gymnasium implementation with the original CartPole environment.
These modifications were first described in Han et al. 2020 and further adapted in our version for enhanced training and exploration.
Observation space
By default, the environment returns the following observation:
\(x\) - Cart Position.
\(x_{dot}\) - Cart Velocity.
\(w\) - Pole angle.
\(w_{dot}\) - Pole angle velocity.
\(x_{ref}\) - The cart position reference.
\(x_{ref\_error}\) - The reference tracking error.
The last two variables can be excluded from the observation space by setting the exclude_reference_from_observation
and exclude_reference_error_from_observation
environment arguments to True
. Please note that the environment needs the reference or the reference error to be included in the observation space when the reference signal is not constant to function correctly. If both are excluded, the environment will raise an error.
Action space
u1: The x-force applied on the cart.
Episode Termination
An episode is terminated when:
Pole Angle is more than 60 degrees.
Cart Position is more than 10 m (center of the cart reaches the edge of the display).
Episode length is greater than 200.
The cost is greater than a set threshold (100 by default). This threshold can be changed using the
max_cost
environment argument.
Environment goals
The goal is similar to the original CartPole-v1
environment. The pendulum starts upright, and the goal is to prevent it from falling over by increasing and reducing the cart’s control force. This version adds a secondary goal, which is to track a cart position reference signal.
Cost function
The cost function of this environment is designed in such a way that it tries to minimize the error between the current cart position and a set reference position. Additionally, it also includes a penalty for the pole angle:
The cost is between 0
and a set threshold value in both tasks, and the maximum cost is used when the episode is terminated.
Environment step return
In addition to the observations, the cost and a termination and truncation boolean, the environment also returns an info dictionary:
[observation, cost, termination, truncation, info_dict]
The info dictionary contains the following keys:
reference: The set cart position reference.
state_of_interest: The state that should track the reference (SOI).
reference_error: The error between SOI and the reference.
How to use
This environment is part of the Stable Gym package. It is therefore registered as the stable_gym:CartPoleTrackingCost-v1
gymnasium environment when you import the Stable Gym package. If you want to use the environment in stand-alone mode, you can register it yourself.
Warning
This environment is new and has not been thoroughly tested. As a result it may contain bugs and the cost function might be updated in the future.