@@ -11,25 +11,25 @@ This repository contains my individual work for the **Hands-On Reinforcement Lea
The **REINFORCE (Vanilla Policy Gradient)** algorithm was implemented using PyTorch. The model learns an optimal policy for solving the **CartPole-v1** environment by updating the policy network using gradients computed from episode returns.
### Training Results
- The training process lasted for **500 episodes**, and we observed a steady increase in total rewards, confirming that the model successfully learned to balance the pole.
- The model was trained for **500 episodes**, showing a steady increase in total rewards. The goal (total reward = 500) was reached consistently after **400 episodes**, confirming successful learning.
-**Training Plot:**

*(Figure: The total rewards increase per episode, showing a successful learning process.)*
*(Figure: Total rewards increase per episode, indicating successful learning.)*
### Model Saving
- The trained model is saved as: `reinforce_cartpole.pth`.
### Evaluation
-**File:**`evaluate_reinforce_cartpole.ipynb`
The model was evaluated over **100 episodes**, and the success criterion was reaching a total reward of **500**.
The model was evaluated over **100 episodes**, with the success criterion being a total reward of **500**.
-**Evaluation Results:**
-**100%** of the episodes reached a total reward of 500, demonstrating the model’s reliability.
-**100%** of the episodes reached a total reward of **500**, demonstrating the model’s reliability.
-**Evaluation Plot:**

*(Figure: The model consistently reaches a total reward of 500 over 100 evaluation episodes.)*
@@ -37,19 +37,19 @@ This repository contains my individual work for the **Hands-On Reinforcement Lea
### Implementation
-**File:**`a2c_sb3_cartpole.ipynb`
I used **Advantage Actor-Critic (A2C)**from**Stable-Baselines3**, which is an advanced RL algorithm combining value-based and policy-based methods.
Implemented **Advantage Actor-Critic (A2C)**using**Stable-Baselines3**, which combines value-based and policy-based RL methods.
### Training Results
- The total rewards **quickly reach 500** within the first few episodes, indicating that **A2C is significantly more efficient** than the REINFORCE approach.
- The model was trained for **500,000 timesteps**, reaching a total reward of **500** consistently after **400 episodes**. It continued training for **1,400 episodes**, confirming stable convergence similar to the REINFORCE approach.
-**Training Plot:**

*(Figure: A2C rapidly achieves optimal performance within a few episodes.)*
*(Figure: A2C training performance over time.)*
### Evaluation
- The trained model was evaluated, and **100%** of the episodes successfully reached a total reward of **500**.
- The trained model was evaluated, achieving **100% success**, with all episodes reaching a total reward of **500**.
*(Figure: The A2C-trained model consistently achieves perfect performance over 100 episodes.)*
*(Figure: A2C model consistently achieves perfect performance over 100 episodes.)*
### Model Upload
- The trained A2C model is available on Hugging Face Hub:
...
...
@@ -67,11 +67,10 @@ This repository contains my individual work for the **Hands-On Reinforcement Lea
### Training Analysis
-**Observations:**
- The training curve indicates that the **A2C model converges very quickly**.
- The **performance remains stable**, showing that the policy does not degrade after convergence.
- The training curve indicates that the **A2C model stabilizes after 1,300 episodes**.
- The model exhibits strong and consistent performance.
-**Training Plot:**

*(Figure: Training performance tracked using W&B.)*
### Model Upload
- The trained A2C model (tracked with W&B) is available on Hugging Face Hub:
...
...
@@ -79,13 +78,13 @@ This repository contains my individual work for the **Hands-On Reinforcement Lea
### Evaluation
-**Evaluation Results:**
-**100%** of the episodes successfully reached a total reward of **500**.
- This further confirms that **A2C is highly stable and performs consistently well.**
-**100%** of episodes reached a total reward of **500**, confirming the model’s reliability.
-**Evaluation Plot:**

*(Figure: Evaluation results tracked using W&B.)*
-**Video exemple:**
-**Example Video:**

The A2C model stabilizes the balancing process more efficiently due to its superior performance compared to the REINFORCE approach.
---
...
...
@@ -93,62 +92,44 @@ This repository contains my individual work for the **Hands-On Reinforcement Lea
### Implementation
-**File:**`a2c_sb3_panda_reach.ipynb`
I used **Stable-Baselines3** to train an **A2C model** on the **PandaReachJointsDense-v3** environment, which involves controlling a robotic arm to reach a target in **3D space**.
Used **Stable-Baselines3** to train an **A2C model** on the **PandaReachJointsDense-v3** environment, controlling a robotic arm to reach a target in **3D space**.
-**Training Duration:****500,000 timesteps**
-The code integrates**Weights & Biases** for tracking.
This project successfully implemented and evaluated RL models on **CartPole** and **Panda-Gym** environments using **custom PyTorch implementations and Stable-Baselines3**. The results confirm that:
-**A2C achieves stable and reliable performance**, with high success rates.
-**Tracking with Weights & Biases provides valuable insights** into training dynamics.
-**RL techniques can effectively solve both discrete and continuous control tasks.**
This project provided a comprehensive hands-on experience with **reinforcement learning**, covering both **custom implementation** and **high-level library usage**. The key takeaways include:
✅ **Custom RL Implementation (REINFORCE)**
- Demonstrated a **gradual learning process** over 500 episodes.
- Achieved **100% success rate** in evaluation.
✅ **Stable-Baselines3 (A2C)**
- Achieved optimal performance **very quickly** compared to REINFORCE.
- The model remained **stable across multiple evaluation runs**.
✅ **Tracking with Weights & Biases**
- Provided **real-time tracking** and performance analysis.
- Confirmed the **stability and consistency** of the trained models.
✅ **Robotic Control with Panda-Gym**
- Successfully trained an **A2C agent** to control a robotic arm in **3D space**.
-**97% success rate** in evaluation.
This project highlights the efficiency of **A2C over REINFORCE**, the benefits of **W&B tracking**, and the feasibility of **reinforcement learning in robotic control applications**. 🚀
Further improvements could include testing **PPO or SAC algorithms** for comparison and expanding experiments to **more complex robotic tasks**.