Skip to content
Snippets Groups Projects
Select Git revision
  • e04808f9ab92601dfb6928f3ae0ca020fe7449d8
  • main default protected
2 results

hands-on-rl

  • Clone with SSH
  • Clone with HTTPS
  • Hands-On Reinforcement Learning – TD 1

    This repository contains my individual work for the Hands-On Reinforcement Learning project. The project explores reinforcement learning (RL) techniques applied to the CartPole and Panda-Gym robotic arm environments. The goal is to implement and evaluate RL models using both custom PyTorch implementations and high-level libraries like Stable-Baselines3.


    1. REINFORCE on CartPole

    Implementation

    • File: reinforce_cartpole.ipynb
      The REINFORCE (Vanilla Policy Gradient) algorithm was implemented using PyTorch. The model learns an optimal policy for solving the CartPole-v1 environment by updating the policy network using gradients computed from episode returns.

    Training Results

    • The model was trained for 500 episodes, showing a steady increase in total rewards. The goal (total reward = 500) was reached consistently after 400 episodes, confirming successful learning.
    • Training Plot:
      Training Plot
      (Figure: Total rewards increase per episode, indicating successful learning.)

    Model Saving

    • The trained model is saved as: reinforce_cartpole.pth.

    Evaluation

    • File: evaluate_reinforce_cartpole.ipynb
      The model was evaluated over 100 episodes, with the success criterion being a total reward of 500.

    • Evaluation Results:

      • 100% of the episodes reached a total reward of 500, demonstrating the model’s reliability.
    • Evaluation Plot:
      Evaluation Plot
      (Figure: The model consistently reaches a total reward of 500 over 100 evaluation episodes.)

    • Example Video:
      REINFORCE CartPole Evaluation Video


    2. A2C with Stable-Baselines3 on CartPole

    Implementation

    • File: a2c_sb3_cartpole.ipynb
      Implemented Advantage Actor-Critic (A2C) using Stable-Baselines3, which combines value-based and policy-based RL methods.

    Training Results

    • The model was trained for 500,000 timesteps, reaching a total reward of 500 consistently after 400 episodes. It continued training for 1,400 episodes, confirming stable convergence similar to the REINFORCE approach.
    • Training Plot:
      SB3 CartPole Training Plot
      (Figure: A2C training performance over time.)

    Evaluation

    • The trained model was evaluated, achieving 100% success, with all episodes reaching a total reward of 500.
    • Evaluation Plot:
      SB3 CartPole Evaluation Plot
      (Figure: A2C model consistently achieves perfect performance over 100 episodes.)

    Model Upload


    3. Tracking with Weights & Biases (W&B) on CartPole

    Training with W&B

    • File: a2c_sb3_cartpole.ipynb
      The A2C training process was tracked using Weights & Biases (W&B) to monitor performance metrics.
    • W&B Run:
      W&B Run for A2C CartPole

    Training Analysis

    • Observations:
      • The training curve indicates that the A2C model stabilizes after 1,300 episodes.
      • The model exhibits strong and consistent performance.
    • Training Plot:
      W&B Training Plot

    Model Upload

    Evaluation

    • Evaluation Results:
      • 100% of episodes reached a total reward of 500, confirming the model’s reliability.
    • Evaluation Plot:
      W&B Evaluation Plot
      (Figure: Evaluation results tracked using W&B.)
    • Example Video:
      W&B Evaluation Video
      The A2C model stabilizes the balancing process more efficiently due to its superior performance compared to the REINFORCE approach.

    4. Full Workflow with Panda-Gym

    Implementation

    • File: a2c_sb3_panda_reach.ipynb
      Used Stable-Baselines3 to train an A2C model on the PandaReachJointsDense-v3 environment, controlling a robotic arm to reach a target in 3D space.
    • Training Duration: 500,000 timesteps
    • Integrated Weights & Biases for tracking.

    Training Results

    • W&B Run for Panda-Gym:
      Panda-Gym W&B Run
    • Observations:
      • The training curve shows consistent improvement over time.
      • The model successfully learns to reach the target efficiently.
      • It stabilizes after 2,500 episodes, with minor fluctuations in rewards.
    • Training Plot:
      Training Total Rewards Plot
      (Figure: The robotic arm’s learning progress over 500,000 timesteps.)

    Model Upload and Evaluation

    Evaluation

    • Evaluation Results:
      • The total reward across all episodes ranged between 0 and -1, indicating stable control.
      • 100% of episodes met the success criteria.
    • Evaluation Plot:
      Evaluation Plot
      (Figure: The robotic arm’s performance in the PandaReachJointsDense-v3 environment.)
    • Example Video:
      Panda-Gym Evaluation Video

    Conclusion

    This project successfully applied reinforcement learning techniques to control both a CartPole system and a Panda-Gym robotic arm using REINFORCE and A2C algorithms. The experiments demonstrated that:

    • REINFORCE efficiently learned an optimal policy for CartPole but required more episodes to stabilize.
    • A2C (Stable-Baselines3) improved training stability and efficiency, reaching optimal performance faster.
    • Weights & Biases (W&B) was valuable for tracking and analyzing training performance in real-time.
    • The Panda-Gym experiment showed that A2C effectively trained the robotic arm to reach targets in 3D space.

    These results confirm the effectiveness of policy-gradient-based RL methods for solving control and robotics problems, highlighting the advantages of actor-critic approaches in stabilizing learning. Future work could explore more advanced RL algorithms (e.g., PPO, SAC) and extend experiments to more complex robotic tasks.

    Further improvements could include testing PPO or SAC algorithms for comparison and expanding experiments to more complex robotic tasks.