Benyahia Mohammed Oussama
hands-on-rl

Repository



Hands-On Reinforcement Learning – TD 1
This repository contains my individual work for the Hands-On Reinforcement Learning project. The project explores reinforcement learning (RL) techniques applied to the CartPole and Panda-Gym robotic arm environments. The goal is to implement and evaluate RL models using both custom PyTorch implementations and high-level libraries like Stable-Baselines3.


1. REINFORCE on CartPole

Implementation


File: reinforce_cartpole.ipynb

The REINFORCE (Vanilla Policy Gradient) algorithm was implemented using PyTorch. The model learns an optimal policy for solving the CartPole-v1 environment by updating the policy network using gradients computed from episode returns.


Training Results

The model was trained for 500 episodes, showing a steady increase in total rewards. The goal (total reward = 500) was reached consistently after 400 episodes, confirming successful learning.

Training Plot:


(Figure: Total rewards increase per episode, indicating successful learning.)


Model Saving

The trained model is saved as: reinforce_cartpole.pth.


Evaluation


File: evaluate_reinforce_cartpole.ipynb

The model was evaluated over 100 episodes, with the success criterion being a total reward of 500.


Evaluation Results:


100% of the episodes reached a total reward of 500, demonstrating the model’s reliability.


Evaluation Plot:


(Figure: The model consistently reaches a total reward of 500 over 100 evaluation episodes.)


Example Video:

REINFORCE CartPole Evaluation Video


2. A2C with Stable-Baselines3 on CartPole

Implementation


File: a2c_sb3_cartpole.ipynb

Implemented Advantage Actor-Critic (A2C) using Stable-Baselines3, which combines value-based and policy-based RL methods.


Training Results

The model was trained for 500,000 timesteps, reaching a total reward of 500 consistently after 400 episodes. It continued training for 1,400 episodes, confirming stable convergence similar to the REINFORCE approach.

Training Plot:


(Figure: A2C training performance over time.)


Evaluation

The trained model was evaluated, achieving 100% success, with all episodes reaching a total reward of 500.

Evaluation Plot:


(Figure: A2C model consistently achieves perfect performance over 100 episodes.)


Model Upload

The trained A2C model is available on Hugging Face Hub:

A2C CartPole Model


3. Tracking with Weights & Biases (W&B) on CartPole

Training with W&B


File: a2c_sb3_cartpole.ipynb

The A2C training process was tracked using Weights & Biases (W&B) to monitor performance metrics.

W&B Run:

W&B Run for A2C CartPole


Training Analysis


Observations:

The training curve indicates that the A2C model stabilizes after 1,300 episodes.
The model exhibits strong and consistent performance.


Training Plot:


Model Upload

The trained A2C model (tracked with W&B) is available on Hugging Face Hub:

A2C CartPole (W&B) Model


Evaluation


Evaluation Results:


100% of episodes reached a total reward of 500, confirming the model’s reliability.


Evaluation Plot:


(Figure: Evaluation results tracked using W&B.)


Example Video:

W&B Evaluation Video

The A2C model stabilizes the balancing process more efficiently due to its superior performance compared to the REINFORCE approach.


4. Full Workflow with Panda-Gym

Implementation


File: a2c_sb3_panda_reach.ipynb

Used Stable-Baselines3 to train an A2C model on the PandaReachJointsDense-v3 environment, controlling a robotic arm to reach a target in 3D space.

Training Duration: 500,000 timesteps

Integrated Weights & Biases for tracking.


Training Results


W&B Run for Panda-Gym:

Panda-Gym W&B Run


Observations:

The training curve shows consistent improvement over time.
The model successfully learns to reach the target efficiently.
It stabilizes after 2,500 episodes, with minor fluctuations in rewards.


Training Plot:


(Figure: The robotic arm’s learning progress over 500,000 timesteps.)


Model Upload and Evaluation

The trained model is available on Hugging Face Hub:

A2C Panda-Reach Model


Evaluation


Evaluation Results:

The total reward across all episodes ranged between 0 and -1, indicating stable control.

100% of episodes met the success criteria.


Evaluation Plot:


(Figure: The robotic arm’s performance in the PandaReachJointsDense-v3 environment.)


Example Video:

Panda-Gym Evaluation Video


Conclusion
This project successfully applied reinforcement learning techniques to control both a CartPole system and a Panda-Gym robotic arm using REINFORCE and A2C algorithms. The experiments demonstrated that:


REINFORCE efficiently learned an optimal policy for CartPole but required more episodes to stabilize.

A2C (Stable-Baselines3) improved training stability and efficiency, reaching optimal performance faster.

Weights & Biases (W&B) was valuable for tracking and analyzing training performance in real-time.
The Panda-Gym experiment showed that A2C effectively trained the robotic arm to reach targets in 3D space.

These results confirm the effectiveness of policy-gradient-based RL methods for solving control and robotics problems, highlighting the advantages of actor-critic approaches in stabilizing learning. Future work could explore more advanced RL algorithms (e.g., PPO, SAC) and extend experiments to more complex robotic tasks.
Further improvements could include testing PPO or SAC algorithms for comparison and expanding experiments to more complex robotic tasks.