TD1-Reinforcement-learning
REINFORCE Implementation for CartPole
This repository contains an implementation of the REINFORCE algorithm (Monte Carlo Policy Gradient) to solve the CartPole-v1 environment from OpenAI Gym.
Implementation Details
The implementation consists of:
-
A simple policy network with:
- Input layer (4 units for state space)
- Hidden layer (128 units with ReLU activation and dropout)
- Output layer (2 units with softmax activation for action probabilities)
-
REINFORCE algorithm features:
- Uses PyTorch for neural network and automatic differentiation
- Implements full episode Monte Carlo returns with discount factor γ=0.99
- Uses Adam optimizer with learning rate 5e-3
- Includes return normalization for training stability
Training Results
The agent was trained for 500 episodes. The plot below shows the total reward obtained in each episode during training:
Files
-
reinforce_cartpole.py
: Contains the implementation of the policy network and REINFORCE algorithm -
reinforce_cartpole.pth
: Saved model weights after training -
training_plot.png
: Visualization of the training progress
Evaluation Results
After training, the agent was evaluated on 100 episodes:
- Success Rate: 100.00%
- Average Reward: 498.60
HuggingFace Model
https://huggingface.co/SimRams/a2c_sb3_cartpole
Wandb link
https://wandb.ai/sim-ramos01-centrale-lyon/sb3/runs/bv67u8pe?nw=nwusersimramos01
Disclaimer about Panda_gym
For an unknown reason, I could not download and use panda_gym. So I just put the code in a2c_sb3_panda_reach.py, but I don't have any way to test it.