Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found
Select Git revision

Target

Select target project
  • loestrei/mso_3_4-td1
  • edelland/mso_3_4-td1
  • schneidl/mso_3_4-td1
  • epaganel/mso_3_4-td1
  • asennevi/armand-senneville-mso-3-4-td-1
  • hchauvin/mso_3_4-td1
  • mbabay/mso_3_4-td1
  • ochaufou/mso_3_4-td1
  • cgerest/hands-on-rl
  • robertr/mso_3_4-td1
  • kmajdi/mso_3_4-td1
  • jseksik/hands-on-rl
  • coulonj/mso_3_4-td1
  • tdesgreys/mso_3_4-td1
14 results
Select Git revision
Show changes
Commits on Source (3)
......@@ -2,209 +2,40 @@
MSO 3.4 Apprentissage Automatique
#
# Objectif du TD
In this hands-on project, we will first implement a simple RL algorithm and apply it to solve the CartPole-v1 environment. Once we become familiar with the basic workflow, we will learn to use various tools for machine learning model training, monitoring, and sharing, by applying these tools to train a robotic arm.
Dans ce projet pratique, nous commencerons par mettre en œuvre un algorithme RL simple et nous l'appliquerons pour résoudre l'environnement CartPole-v1. Une fois que nous nous serons familiarisés avec le flux de travail de base, nous apprendrons à utiliser divers outils pour la formation, le contrôle et le partage de modèles d'apprentissage automatique, en appliquant ces outils à la formation d'un bras robotique.
## To be handed in
This work must be done individually. The expected output is a repository named `hands-on-rl` on https://gitlab.ec-lyon.fr.
## Détails des fichiers python:
# reinforce_cartpole.py: implemention de l'algorithme de RL en utilisant Pytorch
We assume that `git` is installed, and that you are familiar with the basic `git` commands. (Optionnaly, you can use GitHub Desktop.)
We also assume that you have access to the [ECL GitLab](https://gitlab.ec-lyon.fr/). If necessary, please consult [this tutorial](https://gitlab.ec-lyon.fr/edelland/inf_tc2/-/blob/main/Tutoriel_gitlab/tutoriel_gitlab.md).
Classe PolicyNetwork : Définit un réseau neuronal simple pour servir de politique. Il s'agit d'un réseau neuronal feedforward avec une couche cachée et une fonction d'activation ReLU.
Your repository must contain a `README.md` file that explains **briefly** the successive steps of the project. It must be private, so you need to add your teacher as "developer" member.
Fonction compute_returns : Calcule les rendements actualisés pour un épisode donné. Les rendements actualisés sont un composant crucial de l'algorithme REINFORCE. Elle calcule la somme des récompenses, actualisées par un facteur (gamma) pour les récompenses futures, puis normalise ces rendements.
Throughout the subject, you will find a 🛠 symbol indicating that a specific production is expected.
Fonction reinforce : Implémente l'algorithme REINFORCE. Elle itère sur un nombre spécifié d'épisodes, interagit avec l'environnement, collecte des expériences (états, actions, récompenses), calcule les rendements et met à jour le réseau de politique en utilisant les rendements calculés et les log-probabilités.
The last commit is due before 11:59 pm on March 5, 2024. Subsequent commits will not be considered.
Bloc principal : Entraîne le réseau de politique en utilisant l'algorithme REINFORCE, puis trace la récompense totale obtenue à chaque épisode pendant l'entraînement. Il enregistre le tracé généré dans un répertoire nommé 'plots'.
> ⚠️ **Warning**
> Ensure that you only commit the files that are requested. For example, your directory should not contain the generated `.zip` files, nor the `runs` folder... At the end, your repository must contain one `README.md`, three python scripts, and optionally image files for the plots.
Traçage : Trace la récompense totale obtenue à chaque épisode pendant l'entraînement.
## Before you start
Enregistrement du tracé : Enregistre le tracé généré sous forme de fichier PNG dans le répertoire 'plots'.
Make sure you know the basics of Reinforcement Learning. In case of need, you can refer to the [introduction of the Hugging Face RL course](https://huggingface.co/blog/deep-rl-intro).
# a2c_sb3_cartpole.py:
implemention de l'algorithme de RL en utilisant stable_baselines3 et A2C algorithm.
## Introduction to Gym
# a2c_sb3_panda_reach.py:
implemention du modele précedent avec l'environement PandaReachJointsDense-v2.
[Gym](https://gymnasium.farama.org/) is a framework for developing and evaluating reinforcement learning environments. It offers various environments, including classic control and toy text scenarios, to test RL algorithms.
### Installation
We recommend to use Python virtual environnements to install the required modules : https://docs.python.org/3/library/venv.html
First, install Pytorch : https://pytorch.org/get-started/locally.
## Lien vers Hugging face:
https://huggingface.co/MohamedKhalil
Then install the following modules :
L'utilisation de la fonction push_to_hub a été sans succès. Donc le partage du modèle sur huggingface n'a pas abouti.
```sh
pip install gym==0.26.2
```
Install also pyglet for the rendering.
```sh
pip install pyglet==2.0.10
```
If needed
```sh
pip install pygame==2.5.2
```
```sh
pip install PyQt5
```
### Usage
Here is an example of how to use Gym to solve the `CartPole-v1` environment [Documentation](https://gymnasium.farama.org/environments/classic_control/cart_pole/):
```python
import gym
# Create the environment
env = gym.make("CartPole-v1", render_mode="human")
# Reset the environment and get the initial observation
observation = env.reset()
for _ in range(100):
# Select a random action from the action space
action = env.action_space.sample()
# Apply the action to the environment
# Returns next observation, reward, done signal (indicating
# if the episode has ended), and an additional info dictionary
observation, reward, terminated, truncated, info = env.step(action)
# Render the environment to visualize the agent's behavior
env.render()
if terminated:
# Terminated before max step
break
env.close()
```
## REINFORCE
The REINFORCE algorithm (also known as Vanilla Policy Gradient) is a policy gradient method that optimizes the policy directly using gradient descent. The following is the pseudocode of the REINFORCE algorithm:
```txt
Setup the CartPole environment
Setup the agent as a simple neural network with:
- One fully connected layer with 128 units and ReLU activation followed by a dropout layer
- One fully connected layer followed by softmax activation
Repeat 500 times:
Reset the environment
Reset the buffer
Repeat until the end of the episode:
Compute action probabilities
Sample the action based on the probabilities and store its probability in the buffer
Step the environment with the action
Compute and store in the buffer the return using gamma=0.99
Normalize the return
Compute the policy loss as -sum(log(prob) * return)
Update the policy using an Adam optimizer and a learning rate of 5e-3
```
To learn more about REINFORCE, you can refer to [this unit](https://huggingface.co/learn/deep-rl-course/unit4/introduction).
> 🛠 **To be handed in**
> Use PyTorch to implement REINFORCE and solve the CartPole environement. Share the code in `reinforce_cartpole.py`, and share a plot showing the total reward accross episodes in the `README.md`.
## Familiarization with a complete RL pipeline: Application to training a robotic arm
In this section, you will use the Stable-Baselines3 package to train a robotic arm using RL. You'll get familiar with several widely-used tools for training, monitoring and sharing machine learning models.
### Get familiar with Stable-Baselines3
Stable-Baselines3 (SB3) is a high-level RL library that provides various algorithms and integrated tools to easily train and test reinforcement learning models.
#### Installation
```sh
pip install stable-baselines3
pip install moviepy
```
#### Usage
Use the [Stable-Baselines3 documentation](https://stable-baselines3.readthedocs.io/en/master/) to implement the code to solve the CartPole environment with the Advantage Actor-Critic (A2C) algorithm.
> 🛠 **To be handed in**
> Store the code in `a2c_sb3_cartpole.py`. Unless otherwise stated, you'll work upon this file for the next sections.
### Get familiar with Hugging Face Hub
Hugging Face Hub is a platform for easy sharing and versioning of trained machine learning models. With Hugging Face Hub, you can quickly and easily share your models with others and make them usable through the API. For example, see the trained A2C agent for CartPole: https://huggingface.co/sb3/a2c-CartPole-v1. Hugging Face Hub provides an API to download and upload SB3 models.
#### Installation of `huggingface_sb3`
```sh
pip install huggingface-sb3==2.3.1
```
#### Upload the model on the Hub
Follow the [Hugging Face Hub documentation](https://huggingface.co/docs/hub/stable-baselines3) to upload the previously learned model to the Hub.
> 🛠 **To be handed in**
> Link the trained model in the `README.md` file.
> 📝 **Note**
> [RL-Zoo3](https://stable-baselines3.readthedocs.io/en/master/guide/rl_zoo.html) provides more advanced features to save hyperparameters, generate renderings and metrics. Feel free to try them.
### Get familiar with Weights & Biases
Weights & Biases (W&B) is a tool for machine learning experiment management. With W&B, you can track and compare your experiments, visualize your model training and performance.
#### Installation
You'll need to install both `wand` and `tensorboar`.
```shell
pip install wandb tensorboard
```
Use the documentation of [Stable-Baselines3](https://stable-baselines3.readthedocs.io/en/master/) and [Weights & Biases](https://docs.wandb.ai/guides/integrations/stable-baselines-3) to track the CartPole training. Make the run public.
🛠 Share the link of the wandb run in the `README.md` file.
> ⚠️ **Warning**
> Make sure to make the run public!
### Full workflow with panda-gym
[Panda-gym](https://github.com/qgallouedec/panda-gym) is a collection of environments for robotic simulation and control. It provides a range of challenges for training robotic agents in a simulated environment. In this section, you will get familiar with one of the environments provided by panda-gym, the `PandaReachJointsDense-v3`. The objective is to learn how to reach any point in 3D space by directly controlling the robot's articulations.
#### Installation
```shell
pip install panda-gym==3.0.7
```
#### Train, track, and share
Use the Stable-Baselines3 package to train A2C model on the `PandaReachJointsDense-v2` environment. 500k timesteps should be enough. Track the environment with Weights & Biases. Once the training is over, upload the trained model on the Hub.
> 🛠 **To be handed in**
> Share all the code in `a2c_sb3_panda_reach.py`. Share the link of the wandb run and the trained model in the `README.md` file.
## Contribute
This tutorial may contain errors, inaccuracies, typos or areas for improvement. Feel free to contribute to its improvement by opening an issue.
## Author
Quentin Gallouédec
Updates by Léo Schneider, Emmanuel Dellandréa
## License
MIT
import gym
import numpy as np
import matplotlib.pyplot as plt
from stable_baselines3 import A2C
from transformers.hf_api import push_to_hub
# Create the CartPole environment
env = gym.make("CartPole-v1")
# Create the A2C model
model = A2C("MlpPolicy", env, verbose=1)
# Train the model
model.learn(total_timesteps=10000)
# Get environment for rendering
vec_env = model.get_env()
# Plotting the rewards
episode_rewards = []
obs = vec_env.reset()
for _ in range(500):
action, _state = model.predict(obs, deterministic=True)
obs, reward, done, info = vec_env.step(action)
# done = terminated or truncated
episode_rewards.append(reward)
if done:
obs = vec_env.reset()
# Plotting the rewards
plt.plot(np.arange(len(episode_rewards)), episode_rewards)
plt.xlabel('Time Steps')
plt.ylabel('Reward')
plt.title('Reward Across Time Steps')
plt.show()
#save the model
model.save("cartpole_model_a2c_sb3")
push_to_hub(
repo_id="MohamedKhalil/Hands-on-rl",
filename="cartpole_model_a2c_sb3",
commit_message="First push",
)
\ No newline at end of file
import gym
import wandb
import panda_gym
import stable_baselines3
from stable_baselines3 import A2C
import matplotlib.pyplot as plt
# start a new wandb run to track this script
wandb.init(project='Model2',
config={
"Algorithm": "A2C",
"env": "CartPole-v1",
"episodes": 500,
})
#define the environment
env = gym.make('PandaReachJointsDense-v2')
model = stable_baselines3.A2C('MultiInputPolicy', env, verbose=1)
scores=[]
for i in range(100):
model.learn(total_timesteps=1000)
obs = env.reset()
rewards=0
while True:
action, _states = model.predict(obs)
obs, reward, done, info = env.step(action)
rewards+=reward
env.render()
if done:
break
scores.append(rewards)
wandb.log({"rewards" : scores})
wandb.join()
model.save("PandaReach")
plt.plot(scores)
plt.xlabel("Episode")
plt.ylabel("Total Reward")
plt.show()
File added
plots/reward_plot1.png

49 KiB

import gym
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt
import os
# Define a simple neural network as the policy
class PolicyNetwork(nn.Module):
def __init__(self, input_size, output_size, hidden_size=128):
super(PolicyNetwork, self).__init__()
self.fc1 = nn.Linear(input_size, hidden_size)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(hidden_size, output_size)
self.softmax = nn.Softmax(dim=-1)
def forward(self, x):
x = self.relu(self.fc1(x))
x = self.fc2(x)
return self.softmax(x)
# Function to compute discounted returns
def compute_returns(rewards, gamma=0.99):
returns = []
R = 0
for r in reversed(rewards):
R = r + gamma * R
returns.insert(0, R)
returns = torch.tensor(returns)
returns = (returns - returns.mean()) / (returns.std() + 1e-8) # Normalize
return returns
# Function to train the policy network using REINFORCE algorithm
def reinforce(env_name="CartPole-v1", num_episodes=500, gamma=0.99, learning_rate=5e-3):
# Initialize environment and policy network
env = gym.make(env_name)
policy = PolicyNetwork(env.observation_space.shape[0], env.action_space.n)
optimizer = optim.Adam(policy.parameters(), lr=learning_rate)
episode_rewards = []
for episode in range(num_episodes):
episode_reward = 0
log_probs = []
rewards = []
state = env.reset()
done = False
while not done:
state_array = state[0] if isinstance(state, tuple) else state # Extract array representing the state
state_tensor = torch.tensor(state_array, dtype=torch.float32).unsqueeze(0) # Convert to tensor
action_probs = policy(state_tensor)
action = torch.multinomial(action_probs, num_samples=1).item()
log_probs.append(torch.log(action_probs[0][action])) # Access the first element of the tensor
# Take action in the environment
step_output = env.step(action)
next_state, reward, terminated,truncated, info = step_output # Unpack step output
done = truncated or terminated
rewards.append(reward) # Collect rewards
# Update the state for the next iteration
state = next_state
episode_reward += reward # Accumulate rewards
# Compute discounted returns
returns = compute_returns(rewards, gamma)
# Compute loss and update policy
policy_loss = -torch.sum(torch.stack(log_probs) * returns)
optimizer.zero_grad()
policy_loss.backward()
optimizer.step()
episode_rewards.append(episode_reward)
print(f"Episode {episode + 1}/{num_episodes}, Reward: {episode_reward}")
env.close()
return episode_rewards
if __name__ == "__main__":
# Train the policy network
episode_rewards = reinforce()
# Plotting the total reward across episodes
plt.plot(np.arange(len(episode_rewards)), episode_rewards)
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.title('Total Reward Across Episodes')
# Save the plot in a file named 'plots'
plots_dir = 'plots'
if not os.path.exists(plots_dir):
os.makedirs(plots_dir)
plot_file = os.path.join(plots_dir, 'reward_plot1.png')
plt.savefig(plot_file)
plt.show()