diff --git a/README.md b/README.md index 3f71fbbb2a5fbd6b83fb25e58055dd091ccc4008..263a74011204e9b246a4b6ab29e0154aba66aaa6 100644 --- a/README.md +++ b/README.md @@ -1,171 +1,57 @@ # Hands-On Reinforcement Learning -In this hands-on project, we will first implement a simple RL algorithm and apply it to solve the CartPole-v1 environment. Once we become familiar with the basic workflow, we will learn to use various tools for machine learning model training, monitoring, and sharing, by applying these tools to train a robotic arm. +In this hands-on project, we will first implement a simple RL algorithm and apply it to solve the CartPole-v1 +environment. Once we become familiar with the basic workflow, we will learn to use various tools for machine learning +model training, monitoring, and sharing, by applying these tools to train a robotic arm. -## To be handed in - -This work must be done individually. The expected output is a repository named `hands-on-rl` on https://gitlab.ec-lyon.fr. It must contain a `README.md` file that explains **briefly** the successive steps of the project. Throughout the subject, you will find a ๐ symbol indicating that a specific production is expected. -The last commit is due before 11:59 pm on Monday, February 13, 2023. Subsequent commits will not be considered. - -> โ ๏ธ **Warning** -> Ensure that you only commit the files that are requested. For example, your directory should not contain the generated `.zip` files, nor the `runs` folder... At the end, your repository must contain one `README.md`, three python scripts, and optionally image files for the plots. - -## Before you start - -Make sure you know the basics of Reinforcement Learning. In case of need, you can refer to the [introduction of the Hugging Face RL course](https://huggingface.co/blog/deep-rl-intro). - -## Introduction to Gym - -Gym is a framework for developing and evaluating reinforcement learning environments. It offers various environments, including classic control and toy text scenarios, to test RL algorithms. - -### Installation - -```sh -pip install gym==0.21 -``` - -Install also pyglet for the rendering. - -```sh -pip install pyglet==1.5.27 -``` - -### Usage - -Here is an example of how to use Gym to solve the `CartPole-v1` environment: - -```python -import gym - -# Create the environment -env = gym.make("CartPole-v1") - -# Reset the environment and get the initial observation -observation = env.reset() - -for _ in range(100): - # Select a random action from the action space - action = env.action_space.sample() - # Apply the action to the environment - # Returns next observation, reward, done signal (indicating - # if the episode has ended), and an additional info dictionary - observation, reward, done, info = env.step(action) - # Render the environment to visualize the agent's behavior - env.render() -``` ## REINFORCE -The REINFORCE algorithm (also known as Vanilla Policy Gradient) is a policy gradient method that optimizes the policy directly using gradient descent. The following is the pseudocode of the REINFORCE algorithm: - -```txt -Setup the CartPole environment -Setup the agent as a simple neural network with: - - One fully connected layer with 128 units and ReLU activation followed by a dropout layer - - One fully connected layer followed by softmax activation -Repeat 500 times: - Reset the environment - Reset the buffer - Repeat until the end of the episode: - Compute action probabilities - Sample the action based on the probabilities and store its probability in the buffer - Step the environment with the action - Compute and store in the buffer the return using gamma=0.99 - Normalize the return - Compute the policy loss as -sum(log(prob) * return) - Update the policy using an Adam optimizer and a learning rate of 5e-3 -``` - -To learn more about REINFORCE, you can refer to [this unit](https://huggingface.co/blog/deep-rl-pg). - -> ๐ **To be handed in** -> Use PyTorch to implement REINFORCE and solve the CartPole environement. Share the code in `reinforce_cartpole.py`, and share a plot showing the total reward accross episodes in the `README.md`. +Here we implement a Reinforce algorithm using PyTorch (code here : [reinforce_cartpole.py](/reinforce_cartpole.py) ). +Here is the plot showing the total reward accross episodes : -## Familiarization with a complete RL pipeline: Application to training a robotic arm - -In this section, you will use the Stable-Baselines3 package to train a robotic arm using RL. You'll get familiar with several widely-used tools for training, monitoring and sharing machine learning models. - -### Get familiar with Stable-Baselines3 - -Stable-Baselines3 (SB3) is a high-level RL library that provides various algorithms and integrated tools to easily train and test reinforcement learning models. - -#### Installation - -```sh -pip install stable-baselines3 -``` - -#### Usage - -Use the [Stable-Baselines3 documentation](https://stable-baselines3.readthedocs.io/en/master/) to implement the code to solve the CartPole environment with the Advantage Actor-Critic (A2C) algorithm. + -> ๐ **To be handed in** -> Store the code in `a2c_sb3_cartpole.py`. Unless otherwise stated, you'll work upon this file for the next sections. -### Get familiar with Hugging Face Hub - -Hugging Face Hub is a platform for easy sharing and versioning of trained machine learning models. With Hugging Face Hub, you can quickly and easily share your models with others and make them usable through the API. For example, see the trained A2C agent for CartPole: https://huggingface.co/sb3/a2c-CartPole-v1. Hugging Face Hub provides an API to download and upload SB3 models. - -#### Installation of `huggingface_sb3` - -```sh -pip install huggingface_sb3 -``` - -#### Upload the model on the Hub +## Familiarization with a complete RL pipeline: Application to training a robotic arm -Follow the [Hugging Face Hub documentation](https://huggingface.co/docs/hub/index) to upload the previously learned model to the Hub. +In this section, we will use the Stable-Baselines3 package to train a robotic arm using RL. -> ๐ **To be handed in** -> Link the trained model in the `README.md` file. +Thus we use the +[Stable-Baselines3 documentation](https://stable-baselines3.readthedocs.io/en/master/) +to implement the python script [a2c_sb3_cartpole.py](/a2c_sb3_cartpole.py) to solve the CartPole +environment with the Advantage Actor-Critic (A2C) algorithm. -> ๐ **Note** -> [RL-Zoo3](https://stable-baselines3.readthedocs.io/en/master/guide/rl_zoo.html) provides more advanced features to save hyperparameters, generate renderings and metrics. Feel free to try them. -### Get familiar with Weights & Biases -Weights & Biases (W&B) is a tool for machine learning experiment management. With W&B, you can track and compare your experiments, visualize your model training and performance. -#### Installation +### Weights & Biases repository -You'll need to install both `wand` and `tensorboar`. +Here is the [link](https://wandb.ai/antoine-lebtahi/cartpole-sb3_a2c) of the Weights & Biases project where the training run can be found. -```shell -pip install wandb tensorboard -``` -Use the documentation of Stable-Baselines3 and [Weights & Biases](https://docs.wandb.ai) to track the CartPole training. Make the run public. +### Hugging Face Hub repository -๐ Share the link of the wandb run in the `README.md` file. +Trained model can also be found in this Hugging face repository : [link](https://huggingface.co/alebtahi/a2c_sb3) -> โ ๏ธ **Warning** -> Make sure to make the run public! ### Full workflow with panda-gym -Panda-gym is a collection of environments for robotic simulation and control. It provides a range of challenges for training robotic agents in a simulated environment. In this section, you will get familiar with one of the environments provided by panda-gym, the `PandaReachJointsDense-v2`. The objective is to learn how to reach any point in 3D space by directly controlling the robot's articulations. - -#### Installation - -```shell -pip install panda_gym==2.0.0 -``` - -#### Train, track, and share +In this section, we will get familiar with one of the environments provided by panda-gym, the +`PandaReachJointsDense-v2`. The objective is to learn how to reach any point in 3D space by directly controlling +the robot's articulations. -Use the Stable-Baselines3 package to train A2C model on the `PandaReachJointsDense-v2` environment. 500k timesteps should be enough. Track the environment with Weights & Biases. Once the training is over, upload the trained model on the Hub. +Resulting code is : [a2c_sb3_panda_reach.py](/a2c_sb3_panda_reach.py). -> ๐ **To be handed in** -> Share all the code in `a2c_sb3_panda_reach.py`. Share the link of the wandb run and the trained model in the `README.md` file. +Run is stored in this Weight and Biases project : [panda_reach-sb3_a2c](https://wandb.ai/antoine-lebtahi/panda_reach-sb3_a2c) -## Contribute +Model can be found in the same Hugging Face project thant previously : [link](https://huggingface.co/alebtahi/a2c_sb3) -This tutorial may contain errors, inaccuracies, typos or areas for improvement. Feel free to contribute to its improvement by opening an issue. ## Author -Quentin Gallouรฉdec +Antoine Lebtahi ## License diff --git a/a2c_sb3_cartpole.py b/a2c_sb3_cartpole.py new file mode 100644 index 0000000000000000000000000000000000000000..5b506e38ca6715c675efcd003373d4f13ad38b5d --- /dev/null +++ b/a2c_sb3_cartpole.py @@ -0,0 +1,48 @@ +import gym +import wandb + +from stable_baselines3 import A2C +from stable_baselines3.common.monitor import Monitor +from stable_baselines3.common.vec_env import DummyVecEnv +from wandb.integration.sb3 import WandbCallback + + +# Environment configuration : +config = { + "policy_type": "MlpPolicy", + "total_timesteps": 25000, + "env_name": "CartPole-v1"} + +# WandB config : +run = wandb.init( + project="cartpole-sb3_a2c", + config=config, + sync_tensorboard=True, + monitor_gym=True, + save_code=True, +) + + +# Environment definition : +def make_env(): + environment = gym.make(config["env_name"]) + environment = Monitor(environment) # record stats such as returns + return environment + + +env = DummyVecEnv([make_env]) + +# Model definition : +model = A2C(config["policy_type"], env, verbose=1, tensorboard_log=f"./runs/{run.id}") + +# Model training : + +model.learn( + total_timesteps=config["total_timesteps"], + callback=WandbCallback( + gradient_save_freq=100, + model_save_path=f"./models/{run.id}", + verbose=2, + ) +) +run.finish() diff --git a/a2c_sb3_panda_reach.py b/a2c_sb3_panda_reach.py new file mode 100644 index 0000000000000000000000000000000000000000..58da3ee418fc5aae0d1ec425de008e7f807d2361 --- /dev/null +++ b/a2c_sb3_panda_reach.py @@ -0,0 +1,48 @@ +import gym +import panda_gym +import wandb + +from stable_baselines3 import A2C +from stable_baselines3.common.monitor import Monitor +from stable_baselines3.common.vec_env import DummyVecEnv +from wandb.integration.sb3 import WandbCallback + + +# Environment configuration : +config = { + "policy_type": "MultiInputPolicy", + "total_timesteps": 500000, + "env_name": "PandaReachJointsDense-v2"} + +# WandB config : +run = wandb.init( + + project="panda_reach-sb3_a2c", + config=config, + sync_tensorboard=True, + monitor_gym=True, + save_code=True, +) + + +# Environment definition : +def make_env(): + environment = gym.make(config["env_name"]) + environment = Monitor(environment) # record stats such as returns + return environment + +env = DummyVecEnv([make_env]) + +# Model definition : +model = A2C(config["policy_type"], env, verbose=1, tensorboard_log=f"./runsPanda/{run.id}") + +# Model training : +model.learn( + total_timesteps=config["total_timesteps"], + callback=WandbCallback( + gradient_save_freq=1000, + model_save_path=f"./modelsPanda/{run.id}", + verbose=2, + ) +) +run.finish diff --git a/plots/reinforce_reward.png b/plots/reinforce_reward.png new file mode 100644 index 0000000000000000000000000000000000000000..2e17450bc435b93011783e778a9c84040173451c Binary files /dev/null and b/plots/reinforce_reward.png differ diff --git a/reinforce_cartpole.py b/reinforce_cartpole.py new file mode 100644 index 0000000000000000000000000000000000000000..e216ea1815fe823dc62f1c9aa2b9d111d5f710a7 --- /dev/null +++ b/reinforce_cartpole.py @@ -0,0 +1,134 @@ +import gym +import matplotlib.pyplot as plt +import numpy as np +import torch +import torch.nn as nn +import torch.nn.functional as F +import torch.optim as optim + +from torch.distributions import Categorical +from collections import deque + + +class Policy(nn.Module): + def __init__(self, s_size, a_size, h_size): + super(Policy, self).__init__() + self.fc1 = nn.Linear(s_size, h_size) + self.fc2 = nn.Linear(h_size, a_size) + self.drop = nn.Dropout(0.3) + + def forward(self, x): + x = F.relu(self.fc1(x)) + x = self.drop(x) + x = self.fc2(x) + return F.softmax(x, dim=1) + + def act(self, state): + device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") + state = torch.from_numpy(state).float().unsqueeze(0).to(device) + probs = self.forward(state).cpu() + m = Categorical(probs) + action = m.sample() + return action.item(), m.log_prob(action) + + +def reinforce(env, policy, optimizer, n_training_episodes, max_t, gamma, print_every): + # Help us to calculate the score during the training + scores_deque = deque(maxlen=100) + scores = [] + + for i_episode in range(1, n_training_episodes + 1): + saved_log_probs = [] + # reset the environment + state = env.reset() + # reset the buffer + rewards = [] + + # repeat until the end of the episode of len = max_t + for t in range(max_t): + # compute action probabilities and sample action + action, log_prob = policy.act(state) + saved_log_probs.append(log_prob) + # Step the environment with the action + state, reward, done, _ = env.step(action) + # Store rewards in our buffers + rewards.append(reward) + if done: + break + + # compute sum of the rewards + scores_deque.append(sum(rewards)) + scores.append(sum(rewards)) + + # calculate the return + returns = deque(maxlen=max_t) + n_steps = len(rewards) + + for t in range(n_steps)[::-1]: + disc_return_t = (returns[0] if len(returns) > 0 else 0) + returns.appendleft(gamma * disc_return_t + rewards[t]) + + # standardization of the returns is employed to make training more stable + eps = np.finfo(np.float32).eps.item() + # eps is the smallest representable float, which is + # added to the standard deviation of the returns to avoid numerical instabilities + returns = torch.tensor(returns) + returns = (returns - returns.mean()) / (returns.std() + eps) + + # Compute the policy loss as -sum(log(prob) * return) + policy_loss = [] + for log_prob, disc_return in zip(saved_log_probs, returns): + policy_loss.append(-log_prob * disc_return) + policy_loss = torch.cat(policy_loss).sum() + + # Update the policy using an Adam optimizer: + optimizer.zero_grad() + policy_loss.backward() + optimizer.step() + + if i_episode % print_every == 0: + print('Episode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_deque))) + + return scores + + +if __name__ == "__main__": + device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") + + env = gym.make("CartPole-v1") + state_size = env.observation_space.shape[0] + action_size = env.action_space.n + + cartpole_hyperparameters = {"h_size": 128, + "n_training_episodes": 500, + "n_evaluation_episodes": 10, + "max_t": 1000, + "gamma": 0.99, + "lr": 5e-3, + "state_space": state_size, + "action_space": action_size, + } + + cartpole_policy = Policy(cartpole_hyperparameters["state_space"], + cartpole_hyperparameters["action_space"], + cartpole_hyperparameters["h_size"] + ).to(device) + + cartpole_optimizer = optim.Adam(cartpole_policy.parameters(), + lr=cartpole_hyperparameters["lr"] + ) + + scores = reinforce(env, + cartpole_policy, + cartpole_optimizer, + cartpole_hyperparameters["n_training_episodes"], + cartpole_hyperparameters["max_t"], + cartpole_hyperparameters["gamma"], + print_every=100 + ) + + plt.plot(scores) + plt.xlabel("Episodes") + plt.ylabel("Total Reward") + plt.savefig("./plots/reinforce_reward.png") + plt.show() \ No newline at end of file diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 0000000000000000000000000000000000000000..802d572ffa91f14bfc3c9f081b451d2c909389b4 Binary files /dev/null and b/requirements.txt differ