Hands-On Reinforcement Learning
This repository contains my implementation of various reinforcement learning algorithms and their application to different environments, as part of the MSO 3.4 Apprentissage Automatique course.
1. REINFORCE Implementation for CartPole
Algorithm Overview
I implemented the REINFORCE algorithm (Monte Carlo Policy Gradient) to solve the CartPole-v1 environment from OpenAI Gym. The algorithm directly optimizes the policy using gradient ascent on the expected return.
Implementation Details
My implementation includes:
-
A simple policy network with:
- Input layer (4 units for state space)
- Hidden layer (128 units with ReLU activation and dropout)
- Output layer (2 units with softmax activation for action probabilities)
-
REINFORCE algorithm features:
- Uses PyTorch for neural network and automatic differentiation
- Implements full episode Monte Carlo returns with discount factor γ=0.99
- Uses Adam optimizer with learning rate 5e-3
- Includes return normalization for training stability
Training Results
The agent was trained for 500 episodes. The plot below shows the total reward obtained in each episode during training:
Evaluation Results
After evaluation on 100 episodes, the agent achieved:
- Success Rate: 100.00%
- Average Reward: 498.60
2. A2C Implementation with Stable-Baselines3
For the second part of the project, I used Stable-Baselines3 to implement the Advantage Actor-Critic (A2C) algorithm for both the CartPole environment and a robotic arm simulation.
CartPole with A2C
I used the A2C implementation from Stable-Baselines3 to train an agent on the CartPole-v1 environment. The implementation leverages the high-level API provided by SB3 for efficient training and evaluation.
External Tools Integration
Hugging Face Hub
The trained A2C model for CartPole is available on Hugging Face: https://huggingface.co/SimRams/a2c_sb3_cartpole
Weights & Biases
Training progress was tracked using Weights & Biases: https://wandb.ai/sim-ramos01-centrale-lyon/sb3/runs/bv67u8pe?nw=nwusersimramos01
3. Robotic Arm Training
Panda-gym Implementation
I prepared code to train an A2C agent on the PandaReachJointsDense-v3 environment, which involves controlling a robotic arm to reach points in 3D space.
Note: I encountered technical issues with installing and running panda_gym. While the implementation code is included in a2c_sb3_panda_reach.py
, I was unable to test it. The code is structured to:
- Set up the environment with the proper configuration
- Train the A2C model with 500k timesteps
- Track training with Weights & Biases
- Upload the trained model to Hugging Face Hub
Files
-
reinforce_cartpole.py
: Implementation of the REINFORCE algorithm -
reinforce_cartpole.pth
: Saved model weights after training -
evaluate_reinforce_cartpole.py
: Script for evaluating the trained REINFORCE model -
a2c_sb3_cartpole.py
: Implementation using Stable-Baselines3 A2C -
wandb_cartpole.py
: Integration with Weights & Biases -
a2c_sb3_panda_reach.py
: Code for training a robotic arm (not tested)