Skip to content
Snippets Groups Projects
Select Git revision
  • 602d1322b0aca909e3fb235232ee5b630d957416
  • main default protected
2 results

hands-on-rl

  • Clone with SSH
  • Clone with HTTPS
  • Hands-On Reinforcement Learning – TD 1

    This repository contains my individual work for the Hands-On Reinforcement Learning project. The project explores reinforcement learning (RL) techniques applied to the CartPole and Panda-Gym robotic arm environments. The goal is to implement and evaluate RL models using both custom PyTorch implementations and high-level libraries like Stable-Baselines3.


    1. REINFORCE on CartPole

    Implementation

    • File: reinforce_cartpole.ipynb
      The REINFORCE (Vanilla Policy Gradient) algorithm was implemented using PyTorch. The model learns an optimal policy for solving the CartPole-v1 environment by updating the policy network using gradients computed from episode returns.

    Training Results

    • The model was trained for 500 episodes, showing a steady increase in total rewards. The goal (total reward = 500) was reached consistently after 400 episodes, confirming successful learning.
    • Training Plot:
      Training Plot
      (Figure: Total rewards increase per episode, indicating successful learning.)

    Model Saving

    • The trained model is saved as: reinforce_cartpole.pth.

    Evaluation

    • File: evaluate_reinforce_cartpole.ipynb
      The model was evaluated over 100 episodes, with the success criterion being a total reward of 500.

    • Evaluation Results:

      • 100% of the episodes reached a total reward of 500, demonstrating the model’s reliability.
    • Evaluation Plot:
      Evaluation Plot
      (Figure: The model consistently reaches a total reward of 500 over 100 evaluation episodes.)

    • Example Video:
      REINFORCE CartPole Evaluation Video


    2. A2C with Stable-Baselines3 on CartPole

    Implementation

    • File: a2c_sb3_cartpole.ipynb
      Implemented Advantage Actor-Critic (A2C) using Stable-Baselines3, which combines value-based and policy-based RL methods.

    Training Results

    • The model was trained for 500,000 timesteps, reaching a total reward of 500 consistently after 400 episodes. It continued training for 1,400 episodes, confirming stable convergence similar to the REINFORCE approach.
    • Training Plot:
      SB3 CartPole Training Plot
      (Figure: A2C training performance over time.)

    Evaluation

    • The trained model was evaluated, achieving 100% success, with all episodes reaching a total reward of 500.
    • Evaluation Plot:
      SB3 CartPole Evaluation Plot
      (Figure: A2C model consistently achieves perfect performance over 100 episodes.)

    Model Upload


    3. Tracking with Weights & Biases (W&B) on CartPole

    Training with W&B

    • File: a2c_sb3_cartpole.ipynb
      The A2C training process was tracked using Weights & Biases (W&B) to monitor performance metrics.
    • W&B Run:
      W&B Run for A2C CartPole

    Training Analysis

    • Observations:
      • The training curve indicates that the A2C model stabilizes after 1,300 episodes.
      • The model exhibits strong and consistent performance.
    • Training Plot:
      W&B Training Plot

    Model Upload

    Evaluation

    • Evaluation Results:
      • 100% of episodes reached a total reward of 500, confirming the model’s reliability.
    • Evaluation Plot:
      W&B Evaluation Plot
      (Figure: Evaluation results tracked using W&B.)
    • Example Video:
      W&B Evaluation Video
      The A2C model stabilizes the balancing process more efficiently due to its superior performance compared to the REINFORCE approach.

    4. Full Workflow with Panda-Gym

    Implementation

    • File: a2c_sb3_panda_reach.ipynb
      Used Stable-Baselines3 to train an A2C model on the PandaReachJointsDense-v3 environment, controlling a robotic arm to reach a target in 3D space.
    • Training Duration: 500,000 timesteps
    • Integrated Weights & Biases for tracking.

    Training Results

    • W&B Run for Panda-Gym:
      Panda-Gym W&B Run
    • Observations:
      • The training curve shows consistent improvement over time.
      • The model successfully learns to reach the target efficiently.
      • It stabilizes after 2,500 episodes, with minor fluctuations in rewards.
    • Training Plot:
      Training Total Rewards Plot
      (Figure: The robotic arm’s learning progress over 500,000 timesteps.)

    Model Upload and Evaluation

    Evaluation

    • Evaluation Results:
      • The total reward across all episodes ranged between 0 and -1, indicating stable control.
      • 100% of episodes met the success criteria.
    • Evaluation Plot:
      Evaluation Plot
      (Figure: The robotic arm’s performance in the PandaReachJointsDense-v3 environment.)
    • Example Video:
      Panda-Gym Evaluation Video

    Conclusion

    This project successfully applied reinforcement learning techniques to control both a CartPole system and a Panda-Gym robotic arm using REINFORCE and A2C algorithms. The experiments demonstrated that:

    • REINFORCE efficiently learned an optimal policy for CartPole but required more episodes to stabilize.
    • A2C (Stable-Baselines3) improved training stability and efficiency, reaching optimal performance faster.
    • Weights & Biases (W&B) was valuable for tracking and analyzing training performance in real-time.
    • The Panda-Gym experiment showed that A2C effectively trained the robotic arm to reach targets in 3D space.

    These results confirm the effectiveness of policy-gradient-based RL methods for solving control and robotics problems, highlighting the advantages of actor-critic approaches in stabilizing learning. Future work could explore more advanced RL algorithms (e.g., PPO, SAC) and extend experiments to more complex robotic tasks.