Ramos Simon
TD1-Reinforcement-learning

Repository



TD1-Reinforcement-learning

REINFORCE Implementation for CartPole
This repository contains an implementation of the REINFORCE algorithm (Monte Carlo Policy Gradient) to solve the CartPole-v1 environment from OpenAI Gym.

Implementation Details
The implementation consists of:


A simple policy network with:

Input layer (4 units for state space)
Hidden layer (128 units with ReLU activation and dropout)
Output layer (2 units with softmax activation for action probabilities)


REINFORCE algorithm features:

Uses PyTorch for neural network and automatic differentiation
Implements full episode Monte Carlo returns with discount factor γ=0.99
Uses Adam optimizer with learning rate 5e-3
Includes return normalization for training stability


Training Results
The agent was trained for 500 episodes. The plot below shows the total reward obtained in each episode during training:


Files


reinforce_cartpole.py: Contains the implementation of the policy network and REINFORCE algorithm

reinforce_cartpole.pth: Saved model weights after training

training_plot.png: Visualization of the training progress


Evaluation Results
After training, the agent was evaluated on 100 episodes:

Success Rate: 100.00%
Average Reward: 498.60


HuggingFace Model
https://huggingface.co/SimRams/a2c_sb3_cartpole

Wandb link
https://wandb.ai/sim-ramos01-centrale-lyon/sb3/runs/bv67u8pe?nw=nwusersimramos01

Disclaimer about Panda_gym
For an unknown reason, I could not download and use panda_gym. So I just put the code in a2c_sb3_panda_reach.py, but I don't have any way to test it.