README.md

pip install gym==0.21
import gym

# Create the environment
env = gym.make("CartPole-v1")

# Reset the environment and get the initial observation
observation = env.reset()

for _ in range(100):
    # Select a random action from the action space
    action = env.action_space.sample()
    # Apply the action to the environment
    # Returns next observation, reward, done signal (indicating
    # if the episode has ended), and an additional info dictionary
    observation, reward, done, info = env.step(action)
    # Render the environment to visualize the agent's behavior
    env.render()
Setup the CartPole environment
Setup the agent as a simple neural network with:
    - One fully connected layer with 128 units and ReLU activation followed by a dropout layer
    - One fully connected layer followed by softmax activation
Repeat 500 times:
    Reset the environment
    Reset the buffer
    Repeat until the end of the episode:
        Compute action probabilities
        Sample the action based on the probabilities and store its probability in the buffer
        Step the environment with the action
        Compute and store in the buffer the return using gamma=0.99
    Normalize the return
    Compute the policy loss as -sum(log(prob) * return)
    Update the policy using an Adam optimizer and a learning rate of 5e-3
pip install stable-baselines3[extra]