Skip to content
Snippets Groups Projects
Commit 60a4ffb0 authored by Jules Coulon's avatar Jules Coulon
Browse files

Commit of the end of the course

parent 180efac4
Branches edgedg
No related tags found
No related merge requests found
...@@ -2,226 +2,23 @@ ...@@ -2,226 +2,23 @@
MSO 3.4 Apprentissage Automatique MSO 3.4 Apprentissage Automatique
# ### Jules Coulon
In this hands-on project, we will first implement a simple RL algorithm and apply it to solve the CartPole-v1 environment. Once we become familiar with the basic workflow, we will learn to use various tools for machine learning model training, monitoring, and sharing, by applying these tools to train a robotic arm.
## To be handed in
This work must be done individually. The expected output is a repository named `hands-on-rl` on https://gitlab.ec-lyon.fr.
We assume that `git` is installed, and that you are familiar with the basic `git` commands. (Optionnaly, you can use GitHub Desktop.)
We also assume that you have access to the [ECL GitLab](https://gitlab.ec-lyon.fr/). If necessary, please consult [this tutorial](https://gitlab.ec-lyon.fr/edelland/inf_tc2/-/blob/main/Tutoriel_gitlab/tutoriel_gitlab.md).
Your repository must contain a `README.md` file that explains **briefly** the successive steps of the project. It must be private, so you need to add your teacher as "developer" member.
Throughout the subject, you will find a 🛠 symbol indicating that a specific production is expected.
The last commit is due before 11:59 pm on March 17, 2025. Subsequent commits will not be considered.
> ⚠️ **Warning**
> Ensure that you only commit the files that are requested. For example, your directory should not contain the generated `.zip` files, nor the `runs` folder... At the end, your repository must contain one `README.md`, three python scripts, and optionally image files for the plots.
## Before you start
Make sure you know the basics of Reinforcement Learning. In case of need, you can refer to the [introduction of the Hugging Face RL course](https://huggingface.co/blog/deep-rl-intro).
## Introduction to Gym
[Gym](https://gymnasium.farama.org/) is a framework for developing and evaluating reinforcement learning environments. It offers various environments, including classic control and toy text scenarios, to test RL algorithms.
### Installation
We recommend to use Python virtual environnements to install the required modules : https://docs.python.org/3/library/venv.html
First, install Pytorch : https://pytorch.org/get-started/locally.
Then install the following modules :
```sh
pip install gym==0.26.2
```
Install also pyglet for the rendering.
```sh
pip install pyglet==2.0.10
```
```sh
pip install numpy==1.26.4
```
If needed
```sh
pip install pygame==2.5.2
```
```sh
pip install PyQt5
```
```sh
pip install opencv-python
```
### Usage
Here is an example of how to use Gym to solve the `CartPole-v1` environment [Documentation](https://gymnasium.farama.org/environments/classic_control/cart_pole/):
```python
import gym
# Create the environment
env = gym.make("CartPole-v1", render_mode="human")
# Reset the environment and get the initial observation
observation = env.reset()
for _ in range(100):
# Select a random action from the action space
action = env.action_space.sample()
# Apply the action to the environment
# Returns next observation, reward, done signal (indicating
# if the episode has ended), and an additional info dictionary
observation, reward, terminated, truncated, info = env.step(action)
# Render the environment to visualize the agent's behavior
env.render()
if terminated:
# Terminated before max step
break
env.close()
```
## REINFORCE ## REINFORCE
The REINFORCE algorithm (also known as Vanilla Policy Gradient) is a policy gradient method that optimizes the policy directly using gradient descent. The following is the pseudocode of the REINFORCE algorithm: Train the cartpole with the reinforce method with 500 episodes of 500 iterations maximum each give this reward and this loss.
We can find the code in this [python file](reinforce_cartpole.py).
```txt ![image1](./picture/reward.png)
Setup the CartPole environment ![image2](./picture/loss.png)
Setup the agent as a simple neural network with:
- One fully connected layer with 128 units and ReLU activation followed by a dropout layer
- One fully connected layer followed by softmax activation
Repeat 500 times:
Reset the environment
Reset the buffer
Repeat until the end of the episode:
Compute action probabilities
Sample the action based on the probabilities and store its probability in the buffer
Step the environment with the action
Compute and store in the buffer the return using gamma=0.99
Normalize the return
Compute the policy loss as -sum(log(prob) * return)
Update the policy using an Adam optimizer and a learning rate of 5e-3
Save the model weights
```
To learn more about REINFORCE, you can refer to [this unit](https://huggingface.co/learn/deep-rl-course/unit4/introduction).
> 🛠 **To be handed in**
> Use PyTorch to implement REINFORCE and solve the CartPole environement. Share the code in `reinforce_cartpole.py`, and share a plot showing the total reward accross episodes in the `README.md`. Also, share a file `reinforce_cartpole.pth` containing the learned weights. For saving and loading PyTorch models, check [this tutorial](https://pytorch.org/tutorials/beginner/saving_loading_models.html#saving-loading-model-for-inference)
## Model Evaluation
Now that you have trained your model, it is time to evaluate its performance. Run it with rendering for a few trials and see if the policy is capable of completing the task.
> 🛠 **To be handed in**
> Implement a script which loads your saved model and use it to solve the cartpole enviroment. Run 100 evaluations and share the final success rate across all evaluations in the `README.md`. Share the code in `evaluate_reinforce_cartpole.py`.
## Familiarization with a complete RL pipeline: Application to training a robotic arm
In this section, you will use the Stable-Baselines3 package to train a robotic arm using RL. You'll get familiar with several widely-used tools for training, monitoring and sharing machine learning models.
### Get familiar with Stable-Baselines3
Stable-Baselines3 (SB3) is a high-level RL library that provides various algorithms and integrated tools to easily train and test reinforcement learning models.
#### Installation
```sh
pip install stable-baselines3
pip install stable-baselines3[extra]
pip install moviepy
```
#### Usage
Use the [Stable-Baselines3 documentation](https://stable-baselines3.readthedocs.io/en/master/) to implement the code to solve the CartPole environment with the Advantage Actor-Critic (A2C) algorithm.
> 🛠 **To be handed in**
> Store the code in `a2c_sb3_cartpole.py`. Unless otherwise stated, you'll work upon this file for the next sections.
### Get familiar with Hugging Face Hub
Hugging Face Hub is a platform for easy sharing and versioning of trained machine learning models. With Hugging Face Hub, you can quickly and easily share your models with others and make them usable through the API. For example, see the trained A2C agent for CartPole: https://huggingface.co/sb3/a2c-CartPole-v1. Hugging Face Hub provides an API to download and upload SB3 models.
#### Installation of `huggingface_sb3`
```sh
pip install huggingface-sb3==2.3.1
```
#### Upload the model on the Hub
Follow the [Hugging Face Hub documentation](https://huggingface.co/docs/hub/stable-baselines3) to upload the previously learned model to the Hub.
> 🛠 **To be handed in**
> Link the trained model in the `README.md` file.
> 📝 **Note**
> [RL-Zoo3](https://stable-baselines3.readthedocs.io/en/master/guide/rl_zoo.html) provides more advanced features to save hyperparameters, generate renderings and metrics. Feel free to try them.
### Get familiar with Weights & Biases
Weights & Biases (W&B) is a tool for machine learning experiment management. With W&B, you can track and compare your experiments, visualize your model training and performance.
#### Installation
You'll need to install both `wand` and `tensorboar`.
```shell
pip install wandb tensorboard
```
Use the documentation of [Stable-Baselines3](https://stable-baselines3.readthedocs.io/en/master/) and [Weights & Biases](https://docs.wandb.ai/guides/integrations/stable-baselines-3) to track the CartPole training. Make the run public.
🛠 Share the link of the wandb run in the `README.md` file.
> ⚠️ **Warning**
> Make sure to make the run public!
### Full workflow with panda-gym
[Panda-gym](https://github.com/qgallouedec/panda-gym) is a collection of environments for robotic simulation and control. It provides a range of challenges for training robotic agents in a simulated environment. In this section, you will get familiar with one of the environments provided by panda-gym, the `PandaReachJointsDense-v3`. The objective is to learn how to reach any point in 3D space by directly controlling the robot's articulations.
#### Installation
```shell
pip install panda-gym==3.0.7
```
#### Train, track, and share
Use the Stable-Baselines3 package to train A2C model on the `PandaReachJointsDense-v3` environment. 500k timesteps should be enough. Track the environment with Weights & Biases. Once the training is over, upload the trained model on the Hub.
> 🛠 **To be handed in**
> Share all the code in `a2c_sb3_panda_reach.py`. Share the link of the wandb run and the trained model in the `README.md` file.
## Contribute
This tutorial may contain errors, inaccuracies, typos or areas for improvement. Feel free to contribute to its improvement by opening an issue.
## Author This modele is evaluated on [evaluate_reinforce_cartpole] (evaluate_reinforce_cartpole.py) and gives an achievement of 100%.
Quentin Gallouédec ### CartPole with SB3
Updates by Bruno Machado, Léo Schneider, Emmanuel Dellandréa The cartpole is trained with SB3 this time with the A2C model on [a2c_sb3_cartpole](a2c_sb3_cartpole.py).
We can find the modele on huggingface on this [link](https://huggingface.co/JulesCoulon/A2C_CartPole/tree/main) and the train on this [link](https://wandb.ai/julescoulon10-centrale-lyon/cartpole/runs/axqnijqu?nw=nwuserjulescoulon10) with wandb.
## License ### Panda Reach with SB3
MIT The Pandareach is trained with SB3 with the A2C model with 500000 timesteps on [a2c_sb3_pand_reach](a2c_sb3_panda_reach.py).
We can find the modele on huggingface on this [link](https://huggingface.co/JulesCoulon/A2C_CartPole/tree/main) and the train on this [link](https://wandb.ai/julescoulon10-centrale-lyon/cartpole/runs/axqnijqu?nw=nwuserjulescoulon10) with wandb.
\ No newline at end of file
import gym
from stable_baselines3 import A2C
import wandb
from wandb.integration.sb3 import WandbCallback
from stable_baselines3.common.env_util import make_vec_env
# start a new wandb run to track this script
config = {
"policy_type": "MlpPolicy",
"total_timesteps": 25000,
"env_name": "CartPole-v1",
}
run = wandb.init(
project="cartpole",
config=config,
sync_tensorboard=True,
monitor_gym=True,
save_code=True,
)
env = make_vec_env("CartPole-v1", n_envs=4)
# Train the model
model = A2C(config["policy_type"], env, verbose=1, tensorboard_log=f"runs/{run.id}")
model.learn(
total_timesteps=config["total_timesteps"],
callback=WandbCallback()
)
run.finish()
if False:
from huggingface_sb3 import push_to_hub
from huggingface_hub import login
login(token="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")
push_to_hub(
repo_id="JulesCoulon/A2C_CartPole",
filename="a2c_cartpole.zip",
commit_message="Added A2C model for CartPole with Stable Baselines3",
)
\ No newline at end of file
import gym
import panda_gym
from stable_baselines3 import A2C
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.vec_env import DummyVecEnv
import wandb
from wandb.integration.sb3 import WandbCallback
config = {
"policy_type": "MultiInputPolicy",
"total_timesteps": 500000,
"env_name": "PandaReachJointsDense-v3",
}
run = wandb.init(
project="pandareach",
config=config,
sync_tensorboard=True,
monitor_gym=True,
save_code=True,
)
def make_env():
env = gym.make(config["env_name"])
env = Monitor(env) # record stats such as returns
return env
env = DummyVecEnv([make_env])
env = gym.make("PandaReachJointsDense-v3")
model = A2C(config["policy_type"], env, verbose=1, tensorboard_log=f"runs/{run.id}")
model.learn(
total_timesteps=config["total_timesteps"],
callback=WandbCallback(
)
)
run.finish()
from tqdm import tqdm
import torch
import gym
from torch.distributions import Categorical
import torch.nn as nn
class SimpleNN(nn.Module):
def __init__(self):
super(SimpleNN, self).__init__()
# Première couche entièrement connectée avec 128 unités et activation ReLU
self.fc1 = nn.Linear(4, 128) # sortie de l'observation
self.relu = nn.ReLU()
self.dropout = nn.Dropout(0.25)
# Deuxième couche entièrement connectée suivie d'une activation Softmax
self.fc2 = nn.Linear(128, 2) # 2 action space
self.softmax = nn.Softmax(dim=0)
def forward(self, x):
x = self.fc1(x)
x = self.relu(x)
x = self.dropout(x)
x = self.fc2(x)
x = self.softmax(x)
return x
model = SimpleNN()
model.load_state_dict(torch.load("reinforce_cartpole.pth"))
model.eval()
# Create the environment
env = gym.make("CartPole-v1", render_mode=None)
achievement = 0
for ep in tqdm(range(100)):
observation = env.reset()[0]
terminated = False
for id in range(500):
prob = model(torch.tensor(observation))
# Choose the action with the highest probability
action = torch.argmax(prob)
observation, reward, terminated, truncated, info = env.step(action.item())
if terminated:
print("Episode terminated at step ", id)
break
if not terminated:
print("Episode terminated at step ", id)
achievement += 1
print("Achievement rate : ", achievement, "%")
env.close()
\ No newline at end of file
import gym
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Categorical
import torch.nn.functional as F
import matplotlib.pyplot as plt
env = gym.make("CartPole-v1", render_mode=None)
# Définition du modèle
class SimpleNN(nn.Module):
def __init__(self):
super(SimpleNN, self).__init__()
# Première couche entièrement connectée avec 128 unités et activation ReLU
self.fc1 = nn.Linear(4, 128) # sortie de l'observation
self.relu = nn.ReLU()
self.dropout = nn.Dropout(0.25)
# Deuxième couche entièrement connectée suivie d'une activation Softmax
self.fc2 = nn.Linear(128, 2) # 2 action space
self.softmax = nn.Softmax(dim=0)
def forward(self, x):
x = self.fc1(x)
x = self.relu(x)
x = self.dropout(x)
x = self.fc2(x)
x = self.softmax(x)
return x
# Instancier le modèle
model = SimpleNN()
gamma = 0.99
# Définir l'optimiseur et la fonction de perte
optimizer = optim.Adam(model.parameters(), lr=5*10**-3)
criterion = nn.CrossEntropyLoss() # La CrossEntropyLoss inclut Softmax, donc pas besoin de le redéfinir
# Affichage du modèle
print(model)
nb_episode = 500
max_episode_steps = 500
total_rewards = []
total_loss = []
for ep in range(nb_episode):
observation = env.reset()[0]
buffer = torch.zeros(max_episode_steps + 1)
probs = torch.zeros(max_episode_steps + 1)
done = False
id = 0
terminated = False
while not(terminated) and id < max_episode_steps:
prob = model(torch.tensor(observation))
m = Categorical(prob)
action = m.sample()
probs[id] = prob[action]
observation, reward, terminated, truncated, info = env.step(action.item())
for i in range(id+1):
buffer[i] += reward * gamma**(i-1)
id += 1
# env.render()
total_rewards.append(id)
probs = probs[:id]
buffer = buffer[:id]
F.normalize(buffer, dim=0)
# Compute loss
logs = torch.log10(probs)
loss = - torch.sum(torch.mul(logs, buffer))
total_loss.append(loss)
# Print progress
if (ep + 1) % 50 == 0:
print("Épisode : {} / {}".format(ep + 1, nb_episode))
print("Reward : ", id)
print("Loss : ", round(loss.item(), 2))
# Perform gradient descent to update neural network
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Plot the evolution of learning : reward and loss
# Reward
plt.plot(total_rewards)
plt.xlabel('Épisode')
plt.ylabel('Reward')
plt.title("Évolution du reward en fonction de l'épisode")
plt.show()
# Loss
total_losses = [loss.item() for loss in total_loss]
plt.plot(total_losses)
plt.xlabel('Épisode')
plt.ylabel('Loss')
plt.title("Évolution du loss en fonction de l'épisode")
plt.show()
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment