Create Subject_9_Diffusion_Models.ipynb

136605a9 · Dellandrea Emmanuel · 494c2c34 · 136605a9
Commit 136605a9 authored 2 months ago by Dellandrea Emmanuel
--- a/Practical_sessions/Session_9/Subject_9_Diffusion_Models.ipynb
+++ b/Practical_sessions/Session_9/Subject_9_Diffusion_Models.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### **_Deep Learning  - Bsc Data Science for Responsible Business - Centrale Lyon_**\n",
+    "\n",
+    "2024-2025\n",
+    "\n",
+    "Emmanuel Dellandréa\t  "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "colab_type": "text",
+    "id": "UGwKsKS4GMTN"
+   },
+   "source": [
+    "\n",
+    "\n",
+    "# Practical Session 9 - Diffusion Models\n",
+    "\n",
+    "Subject written by Bruno Machado\n",
+    "\n",
+    "<p align=\"center\">\n",
+    "<img height=300px src=\"https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f5d7da7-52db-4104-9742-a0b4555d8dd6_1300x387.png\"/></p>\n",
+    "<p align=\"center\"></p>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "colab_type": "text",
+    "id": "16aVF81lJuiP"
+   },
+   "source": [
+    "The objective of this tutorial is to discover Diffusion Models that are probabilistic generative models which learn to generate data by iteratively refining random noise through a reverse diffusion process. Given a sample of data, noise is progressively added in small steps until it becomes pure noise. Then, a neural network is trained to reverse this process and generate realistic data from noise. \n",
+    "\n",
+    "Diffusion models have gained popularity due to their ability to generate high-quality, diverse, and detailed content, surpassing GANs in the quality of the generated images.  \n",
+    "\n",
+    "In this assignment we will focus on DDPMs, which were introduced in this [paper](https://arxiv.org/abs/2006.11239) and laid the foundation for generative diffusion models.\n",
+    "\n",
+    "The notebook contains code cells with the **\"# TO DO\"** comments. Your goal is to complete these cells and run the proposed experiments. \n",
+    "\n",
+    "As the computation is heavy, particularly during training, we encourage you to use a GPU. If your laptob is not equiped, you may use one of these remote jupyter servers, where you can select the execution on GPU :\n",
+    "\n",
+    "1) [jupyter.mi90.ec-lyon.fr](https://jupyter.mi90.ec-lyon.fr/)\n",
+    "\n",
+    "This server is accessible within the campus network. If outside, you need to use a VPN. Before executing the notebook, select the kernel \"Python PyTorch\" to run it on GPU and have access to PyTorch module.\n",
+    "\n",
+    "2) [Google Colaboratory](https://colab.research.google.com/)\n",
+    "\n",
+    "Before executing the notebook, select the execution on GPU : \"Exécution\" Menu -> \"Modifier le type d'exécution\" and select \"T4 GPU\". "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "colab_type": "text",
+    "id": "M-WNKvhOP1ED"
+   },
+   "source": [
+    "# Part1: Diffusion"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# First import useful libraries\n",
+    "import torch\n",
+    "import numpy as np\n",
+    "import matplotlib.pyplot as plt\n",
+    "import torchvision.transforms as transforms"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# device = torch.device(\"mps\")\n",
+    "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "colab_type": "text",
+    "id": "y_r8nMTGQI9a"
+   },
+   "source": [
+    "As in the previous session, we will use the MNIST dataset."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# TO DO: your code here to load the MNIST dataset. The size of the images should be set to 64x64.\n",
+    "mnist_dataset = \n",
+    "mnist_dataloader ="
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "colab_type": "text",
+    "id": "jiHCy4_UUBFb"
+   },
+   "source": [
+    "Auxiliary functions for plotting images"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def reverse_transform(image):\n",
+    "    image = image.numpy().transpose((1, 2, 0))\n",
+    "    image = np.clip(image, 0, 1)\n",
+    "    image = (image * 255).astype(np.uint8)\n",
+    "\n",
+    "    return image\n",
+    "\n",
+    "def plot1xNArray(images, labels):\n",
+    "    f, axarr = plt.subplots(1, len(images))\n",
+    "    \n",
+    "    for image, ax, label in zip(images, axarr, labels):\n",
+    "        ax.imshow(image, cmap='gray')\n",
+    "        ax.axis('off')\n",
+    "        ax.set_title(label)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "colab_type": "text",
+    "id": "7SjXNoT7BUey"
+   },
+   "source": [
+    "In order to train the model with the diffusion process, we will use a noise scheduler, which will be in charge of the forward diffusion process. The scheduler takes an image, a sample of random noise and a timestep, and return a noisy image for the corresponding timestep. Noise is progressively added to the image at each timestep, therefore a noisy image at timestep 0 will have barely any noise while a noisy image at the maximum timestep will be basically just noise.  \n",
+    "\n",
+    "Let's create a noise scheduler with 1000 max timesteps and visualize some noise images.  \n",
+    "\n",
+    "We will use the diffusers library from Hugging Face, which provides several tools for training and using diffusion models. \n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {},
+    "colab_type": "code",
+    "id": "uOKvYDyu0w8N"
+   },
+   "outputs": [],
+   "source": [
+    "from diffusers import DDPMScheduler\n",
+    "\n",
+    "# TO DO: Create the scheduler\n",
+    "noise_scheduler = \n",
+    "\n",
+    "image, _ = mnist_dataset[0]\n",
+    "\n",
+    "# TO DO: Create a noise tensor sampled from a normal distribution with the same shape as the image\n",
+    "noise = \n",
+    "\n",
+    "images, labels = [reverse_transform(image)], [\"Original\"]\n",
+    "\n",
+    "for i in [100, 250, 400, 900]:\n",
+    "    timestep = torch.LongTensor([i])\n",
+    "    noisy_image = noise_scheduler.add_noise(image, noise, timestep)\n",
+    "    images.append(reverse_transform(noisy_image))\n",
+    "    labels.append(f\"t={i}\")\n",
+    "\n",
+    "plot1xNArray(images, labels)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For the reverse diffusion process we will use a neural network. Given a noisy image and the corresponding timestep, the goal of the neural network is to predict the noise, which allows for the denoising.  \n",
+    "\n",
+    "For the model, we will have a similar architecture as we used for the cGAN generator, a 2D UNet with a few modifications. The main difference will be that we have to indicate to the model which timestep is currently being denoised. For that purpose, a timestep embedding is added, therefore the model has 2 inputs, the noisy image and the corresponding timestep.  \n",
+    "  \n",
+    "In this exercise, we will use an UNet implementation from the diffusers library, which already has the timestep embedding included."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {},
+    "colab_type": "code",
+    "id": "Zk5a6B5hILN2"
+   },
+   "outputs": [],
+   "source": [
+    "from diffusers import UNet2DModel\n",
+    "\n",
+    "# TO DO: Complete the parameters\n",
+    "diffusion_backbone = UNet2DModel(\n",
+    "                        block_out_channels=(64, 128, 256, 512),\n",
+    "                        down_block_types=(\"DownBlock2D\", \"DownBlock2D\", \"DownBlock2D\", \"DownBlock2D\"),\n",
+    "                        up_block_types=(\"UpBlock2D\", \"UpBlock2D\", \"UpBlock2D\", \"UpBlock2D\"),\n",
+    "                        sample_size=,\n",
+    "                        in_channels=,\n",
+    "                        out_channels=,\n",
+    "                    ).to(device)\n",
+    "   \n",
+    "# Optimizer\n",
+    "optimizer = torch.optim.AdamW(diffusion_backbone.parameters(), lr=1e-4)\n",
+    "\n",
+    "print(diffusion_backbone)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "colab_type": "text",
+    "id": "1rZ5Qz1mBUe8"
+   },
+   "source": [
+    "### Now, let's train the model."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {},
+    "colab_type": "code",
+    "id": "4Tbp_535EVPW"
+   },
+   "outputs": [],
+   "source": [
+    "# ----------------\n",
+    "#  Training Loop\n",
+    "# ----------------\n",
+    "torch.backends.cudnn.deterministic = True\n",
+    "\n",
+    "losses = []\n",
+    "num_epochs = 5\n",
+    "print_every = 100\n",
+    "\n",
+    "diffusion_backbone.train()\n",
+    "\n",
+    "for epoch in range(num_epochs):\n",
+    "    for i, batch in enumerate(mnist_dataloader):\n",
+    "        # Zero the gradients\n",
+    "        optimizer.zero_grad()\n",
+    "\n",
+    "        # Send input to device\n",
+    "        images = batch[0].to(device)\n",
+    "\n",
+    "        # Generate noisy images, different timestep for each image in the batch\n",
+    "        timesteps = torch.randint(noise_scheduler.config.num_train_timesteps, (images.size(0),), device=device)\n",
+    "\n",
+    "        # TO DO: Complete the code\n",
+    "        noise = \n",
+    "        noisy_images = \n",
+    "\n",
+    "        # Forward pass\n",
+    "        residual = diffusion_backbone(noisy_images, timesteps).sample\n",
+    "        \n",
+    "        # TO DO: Compute the loss\n",
+    "        loss = \n",
+    "        \n",
+    "        loss.backward()\n",
+    "        optimizer.step()\n",
+    "\n",
+    "        # Print stats\n",
+    "        if i % print_every == 0:\n",
+    "            print(f'Epoch [{epoch+1}/{num_epochs}][{i}/{len(mnist_dataloader)}] | loss: {loss.item():6.4f}')\n",
+    "\n",
+    "    losses.append(loss.item())\n",
+    "    torch.save(diffusion_backbone.state_dict(), f\"diffusion_{epoch+1}.pth\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If the training takes too long, you can download an already trained model from [this link](https://partage.liris.cnrs.fr/index.php/s/AP2t6b3w8SM4Bp5) and use it for inference. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {},
+    "colab_type": "code",
+    "id": "1hmcejTWJSYY"
+   },
+   "outputs": [],
+   "source": [
+    "# TO DO: Add the path to the model checkpoint for loading the model\n",
+    "diffusion_backbone.load_state_dict(torch.load())\n",
+    "diffusion_backbone.eval()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "colab_type": "text",
+    "id": "xIXFtHzcBUfO"
+   },
+   "source": [
+    "### Time to generate some images. \n",
+    "\n",
+    "During training, for each data sample, we take a random timestep and the corresponding noisy image to give it as input to our model. With sufficient training, the model should learn how to predict the noise in a noisy image for all possible timesteps.  \n",
+    "\n",
+    "During inference, to generate an image, we will start from pure noise and step by step predict the noise to go from one noisy image to the next, progressively denoising the image until we reach the timestep 0, in which we should have an image without any noise."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {},
+    "colab_type": "code",
+    "id": "vGKjO0UMBUf9"
+   },
+   "outputs": [],
+   "source": [
+    "from tqdm import tqdm\n",
+    "\n",
+    "# Start the image as random noise\n",
+    "image = torch.randn((10, 1, 64, 64)).to(device)\n",
+    "\n",
+    "# Create a list of images and labels for visualization\n",
+    "images, labels = [(image / 2 + 0.5).clamp(0, 1).cpu().permute(0, 2, 3, 1).numpy()], [\"Original\"]\n",
+    "\n",
+    "# Use the scheduler to iterate over timesteps\n",
+    "noise_scheduler.set_timesteps(1000)\n",
+    "\n",
+    "for timestep in tqdm(noise_scheduler.timesteps):\n",
+    "    with torch.no_grad():\n",
+    "        residual = diffusion_backbone(image, timestep).sample\n",
+    "        image = noise_scheduler.step(residual, timestep, image).prev_sample\n",
+    "\n",
+    "    if timestep.item() % 200 == 0:\n",
+    "        images.append((image / 2 + 0.5).clamp(0, 1).cpu().permute(0, 2, 3, 1).numpy())\n",
+    "        labels.append(f\"t={timestep.item()}\")\n",
+    "\n",
+    "for i in range(images[0].shape[0]):\n",
+    "    plot1xNArray([img[i] for img in images], labels)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "colab_type": "text",
+    "id": "PhPkU7BDYooV"
+   },
+   "source": [
+    "The diffusers library also provides *Pipeline* classes, which are wrappers around the model that abstracts the inference loop implemented above.  \n",
+    "  \n",
+    "We can create a pipeline, giving it the trained model and the noise scheduler, and use it to generate images. In this case, we will only have access to the final image, generated on the last timestep, but not the intermediary images from the denoising process.  "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from diffusers import DDPMPipeline\n",
+    "\n",
+    "pipeline = DDPMPipeline(diffusion_backbone, noise_scheduler)\n",
+    "generated_images = pipeline(10, output_type=\"np\")\n",
+    "\n",
+    "f, axarr = plt.subplots(1, len(generated_images[\"images\"]))\n",
+    "\n",
+    "for image, ax in zip(generated_images[\"images\"], axarr):\n",
+    "    ax.imshow(image, cmap='gray')\n",
+    "    ax.axis('off')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "colab_type": "text",
+    "id": "6DHT9c0_BUgA"
+   },
+   "source": [
+    "# Part 2: What about those beautiful images ?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<p align=\"center\">\n",
+    "<img height=300px src=\"https://huggingface.co/stabilityai/stable-diffusion-3.5-large/media/main/sd3.5_large_demo.png\"/></p>\n",
+    "<p align=\"center\"></p>\n",
+    "\n",
+    "In this exercise we achieved decent results for very simple datasets. But we are quite far from those beautiful AI generated images we can find online. That is for 2 main reasons:\n",
+    "- Model size: due to the computation and time constrains, we can't really train very large models \n",
+    "- Dataset size: due to the same constrains, we can't use very complex and large datasets, which requires larger models and longer training times.\n",
+    "\n",
+    "Fortunately, even though we can't train those large models with the available hardware and time, we can at least use them for inference !  \n",
+    "\n",
+    "The goal of this part is to learn how to retrieve and use a pre-trained diffusion models and also to get creative to come up with some nice prompts to generate outstanding images.  \n",
+    "\n",
+    "We are going to use Stable Diffusion 3.5, which is a state of the art open-source text conditioned model. It takes a prompt in natural language and use it to guide the diffusion process. This type of models are trained with image-text pairs, but can generalize beyond the pairs seen during training, being able to mix several different concepts into a single image.  \n",
+    "\n",
+    "In order to save memory, we will use quantization, which consists into converting the model weights types from float16 into float4. That simply means that each model weight will be stored using only 4 bits instead of 16 bits. That allows us to run the model in GPUs with less VRAM and have faster inference, with a small drop in the quality of the results."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For this part of the assignment restart the notebook kernel to be sure your GPU memory is empty. The memory usage can be verified using the command `nvidia-smi` in a terminal. If your GPU has 2GB of VRAM or less, the model will probably not fit into memory even with quantization. In that case, use Google Colab for this part or use the smaller model indicated below. If you are not happy with the results and have plenty of VRAM available, feel free to increase the quantization to 8 bits or even load the model without quantization."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {},
+    "colab_type": "code",
+    "id": "rxi_QIpgBUgB"
+   },
+   "outputs": [],
+   "source": [
+    "from diffusers import BitsAndBytesConfig, SD3Transformer2DModel\n",
+    "from diffusers import StableDiffusion3Pipeline\n",
+    "\n",
+    "model_id = \"stabilityai/stable-diffusion-3.5-medium\"\n",
+    "\n",
+    "nf4_config = BitsAndBytesConfig(\n",
+    "    load_in_4bit=True,\n",
+    "    bnb_4bit_quant_type=\"nf4\",\n",
+    "    bnb_4bit_compute_dtype=torch.float16\n",
+    ")\n",
+    "model_nf4 = SD3Transformer2DModel.from_pretrained(\n",
+    "    model_id,\n",
+    "    subfolder=\"transformer\",\n",
+    "    quantization_config=nf4_config,\n",
+    "    torch_dtype=torch.float16\n",
+    ")\n",
+    "\n",
+    "pipeline = StableDiffusion3Pipeline.from_pretrained(\n",
+    "    model_id, \n",
+    "    transformer=model_nf4,\n",
+    "    torch_dtype=torch.float16\n",
+    ")\n",
+    "pipeline.enable_model_cpu_offload()\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# TO DO: test different prompts and visualize the generated images\n",
+    "# Once you are happy with the results, you can save 3 different images as png file with the correspondent prompts in a text file\n",
+    "prompt = \n",
+    "\n",
+    "image = pipeline(\n",
+    "    prompt=prompt,\n",
+    "    num_inference_steps=40,\n",
+    "    guidance_scale=4.5,\n",
+    "    max_sequence_length=512\n",
+    ").images[0]\n",
+    "image.save(\"generated_image.png\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "colab_type": "text",
+    "id": "Okb3LU76BUgG"
+   },
+   "source": [
+    "If even with the quantization you still run out of GPU memory and you can't use Google Colab, you can use the following code instead, which uses a much smaller model (the results won't be as near as impressive, but it should be able to run even on a CPU, if you have a little patience)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {},
+    "colab_type": "code",
+    "id": "xuxq4TZRBUgJ"
+   },
+   "outputs": [],
+   "source": [
+    "from diffusers import DiffusionPipeline\n",
+    "\n",
+    "pipe = DiffusionPipeline.from_pretrained(\"OFA-Sys/small-stable-diffusion-v0\").to(device)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# TO DO: test different prompts and visualize the generated images\n",
+    "prompt = \n",
+    "image = pipe(prompt).images[0]\n",
+    "\n",
+    "image.save(\"not_as_good_generated_image.png\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "colab": {
+   "collapsed_sections": [],
+   "name": "BE2 - GAN and cGAN.ipynb",
+   "provenance": []
+  },
+  "kernelspec": {
+   "display_name": "td_llm",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.11"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}
+%% Cell type:markdown id: tags:
+
+### **_Deep Learning  - Bsc Data Science for Responsible Business - Centrale Lyon_**
+
+2024-2025
+
+Emmanuel Dellandréa
+
+%% Cell type:markdown id: tags:
+
+
+
+# Practical Session 9 - Diffusion Models
+
+Subject written by Bruno Machado
+
+<p align="center">
+<img height=300px src="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f5d7da7-52db-4104-9742-a0b4555d8dd6_1300x387.png"/></p>
+<p align="center"></p>
+
+%% Cell type:markdown id: tags:
+
+The objective of this tutorial is to discover Diffusion Models that are probabilistic generative models which learn to generate data by iteratively refining random noise through a reverse diffusion process. Given a sample of data, noise is progressively added in small steps until it becomes pure noise. Then, a neural network is trained to reverse this process and generate realistic data from noise.
+
+Diffusion models have gained popularity due to their ability to generate high-quality, diverse, and detailed content, surpassing GANs in the quality of the generated images.
+
+In this assignment we will focus on DDPMs, which were introduced in this [paper](https://arxiv.org/abs/2006.11239) and laid the foundation for generative diffusion models.
+
+The notebook contains code cells with the **"# TO DO"** comments. Your goal is to complete these cells and run the proposed experiments.
+
+As the computation is heavy, particularly during training, we encourage you to use a GPU. If your laptob is not equiped, you may use one of these remote jupyter servers, where you can select the execution on GPU :
+
+1) [jupyter.mi90.ec-lyon.fr](https://jupyter.mi90.ec-lyon.fr/)
+
+This server is accessible within the campus network. If outside, you need to use a VPN. Before executing the notebook, select the kernel "Python PyTorch" to run it on GPU and have access to PyTorch module.
+
+2) [Google Colaboratory](https://colab.research.google.com/)
+
+Before executing the notebook, select the execution on GPU : "Exécution" Menu -> "Modifier le type d'exécution" and select "T4 GPU".
+
+%% Cell type:markdown id: tags:
+
+# Part1: Diffusion
+
+%% Cell type:code id: tags:
+
+``` python
+# First import useful libraries
+import torch
+import numpy as np
+import matplotlib.pyplot as plt
+import torchvision.transforms as transforms
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# device = torch.device("mps")
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+```
+
+%% Cell type:markdown id: tags:
+
+As in the previous session, we will use the MNIST dataset.
+
+%% Cell type:code id: tags:
+
+``` python
+# TO DO: your code here to load the MNIST dataset. The size of the images should be set to 64x64.
+mnist_dataset =
+mnist_dataloader =
+```
+
+%% Cell type:markdown id: tags:
+
+Auxiliary functions for plotting images
+
+%% Cell type:code id: tags:
+
+``` python
+def reverse_transform(image):
+    image = image.numpy().transpose((1, 2, 0))
+    image = np.clip(image, 0, 1)
+    image = (image * 255).astype(np.uint8)
+
+    return image
+
+def plot1xNArray(images, labels):
+    f, axarr = plt.subplots(1, len(images))
+
+    for image, ax, label in zip(images, axarr, labels):
+        ax.imshow(image, cmap='gray')
+        ax.axis('off')
+        ax.set_title(label)
+```
+
+%% Cell type:markdown id: tags:
+
+In order to train the model with the diffusion process, we will use a noise scheduler, which will be in charge of the forward diffusion process. The scheduler takes an image, a sample of random noise and a timestep, and return a noisy image for the corresponding timestep. Noise is progressively added to the image at each timestep, therefore a noisy image at timestep 0 will have barely any noise while a noisy image at the maximum timestep will be basically just noise.
+
+Let's create a noise scheduler with 1000 max timesteps and visualize some noise images.
+
+We will use the diffusers library from Hugging Face, which provides several tools for training and using diffusion models.
+
+%% Cell type:code id: tags:
+
+``` python
+from diffusers import DDPMScheduler
+
+# TO DO: Create the scheduler
+noise_scheduler =
+
+image, _ = mnist_dataset[0]
+
+# TO DO: Create a noise tensor sampled from a normal distribution with the same shape as the image
+noise =
+
+images, labels = [reverse_transform(image)], ["Original"]
+
+for i in [100, 250, 400, 900]:
+    timestep = torch.LongTensor([i])
+    noisy_image = noise_scheduler.add_noise(image, noise, timestep)
+    images.append(reverse_transform(noisy_image))
+    labels.append(f"t={i}")
+
+plot1xNArray(images, labels)
+```
+
+%% Cell type:markdown id: tags:
+
+For the reverse diffusion process we will use a neural network. Given a noisy image and the corresponding timestep, the goal of the neural network is to predict the noise, which allows for the denoising.
+
+For the model, we will have a similar architecture as we used for the cGAN generator, a 2D UNet with a few modifications. The main difference will be that we have to indicate to the model which timestep is currently being denoised. For that purpose, a timestep embedding is added, therefore the model has 2 inputs, the noisy image and the corresponding timestep.
+
+In this exercise, we will use an UNet implementation from the diffusers library, which already has the timestep embedding included.
+
+%% Cell type:code id: tags:
+
+``` python
+from diffusers import UNet2DModel
+
+# TO DO: Complete the parameters
+diffusion_backbone = UNet2DModel(
+                        block_out_channels=(64, 128, 256, 512),
+                        down_block_types=("DownBlock2D", "DownBlock2D", "DownBlock2D", "DownBlock2D"),
+                        up_block_types=("UpBlock2D", "UpBlock2D", "UpBlock2D", "UpBlock2D"),
+                        sample_size=,
+                        in_channels=,
+                        out_channels=,
+                    ).to(device)
+
+# Optimizer
+optimizer = torch.optim.AdamW(diffusion_backbone.parameters(), lr=1e-4)
+
+print(diffusion_backbone)
+```
+
+%% Cell type:markdown id: tags:
+
+### Now, let's train the model.
+
+%% Cell type:code id: tags:
+
+``` python
+# ----------------
+#  Training Loop
+# ----------------
+torch.backends.cudnn.deterministic = True
+
+losses = []
+num_epochs = 5
+print_every = 100
+
+diffusion_backbone.train()
+
+for epoch in range(num_epochs):
+    for i, batch in enumerate(mnist_dataloader):
+        # Zero the gradients
+        optimizer.zero_grad()
+
+        # Send input to device
+        images = batch[0].to(device)
+
+        # Generate noisy images, different timestep for each image in the batch
+        timesteps = torch.randint(noise_scheduler.config.num_train_timesteps, (images.size(0),), device=device)
+
+        # TO DO: Complete the code
+        noise =
+        noisy_images =
+
+        # Forward pass
+        residual = diffusion_backbone(noisy_images, timesteps).sample
+
+        # TO DO: Compute the loss
+        loss =
+
+        loss.backward()
+        optimizer.step()
+
+        # Print stats
+        if i % print_every == 0:
+            print(f'Epoch [{epoch+1}/{num_epochs}][{i}/{len(mnist_dataloader)}] | loss: {loss.item():6.4f}')
+
+    losses.append(loss.item())
+    torch.save(diffusion_backbone.state_dict(), f"diffusion_{epoch+1}.pth")
+```
+
+%% Cell type:markdown id: tags:
+
+If the training takes too long, you can download an already trained model from [this link](https://partage.liris.cnrs.fr/index.php/s/AP2t6b3w8SM4Bp5) and use it for inference.
+
+%% Cell type:code id: tags:
+
+``` python
+# TO DO: Add the path to the model checkpoint for loading the model
+diffusion_backbone.load_state_dict(torch.load())
+diffusion_backbone.eval()
+```
+
+%% Cell type:markdown id: tags:
+
+### Time to generate some images.
+
+During training, for each data sample, we take a random timestep and the corresponding noisy image to give it as input to our model. With sufficient training, the model should learn how to predict the noise in a noisy image for all possible timesteps.
+
+During inference, to generate an image, we will start from pure noise and step by step predict the noise to go from one noisy image to the next, progressively denoising the image until we reach the timestep 0, in which we should have an image without any noise.
+
+%% Cell type:code id: tags:
+
+``` python
+from tqdm import tqdm
+
+# Start the image as random noise
+image = torch.randn((10, 1, 64, 64)).to(device)
+
+# Create a list of images and labels for visualization
+images, labels = [(image / 2 + 0.5).clamp(0, 1).cpu().permute(0, 2, 3, 1).numpy()], ["Original"]
+
+# Use the scheduler to iterate over timesteps
+noise_scheduler.set_timesteps(1000)
+
+for timestep in tqdm(noise_scheduler.timesteps):
+    with torch.no_grad():
+        residual = diffusion_backbone(image, timestep).sample
+        image = noise_scheduler.step(residual, timestep, image).prev_sample
+
+    if timestep.item() % 200 == 0:
+        images.append((image / 2 + 0.5).clamp(0, 1).cpu().permute(0, 2, 3, 1).numpy())
+        labels.append(f"t={timestep.item()}")
+
+for i in range(images[0].shape[0]):
+    plot1xNArray([img[i] for img in images], labels)
+```
+
+%% Cell type:markdown id: tags:
+
+The diffusers library also provides *Pipeline* classes, which are wrappers around the model that abstracts the inference loop implemented above.
+
+We can create a pipeline, giving it the trained model and the noise scheduler, and use it to generate images. In this case, we will only have access to the final image, generated on the last timestep, but not the intermediary images from the denoising process.
+
+%% Cell type:code id: tags:
+
+``` python
+from diffusers import DDPMPipeline
+
+pipeline = DDPMPipeline(diffusion_backbone, noise_scheduler)
+generated_images = pipeline(10, output_type="np")
+
+f, axarr = plt.subplots(1, len(generated_images["images"]))
+
+for image, ax in zip(generated_images["images"], axarr):
+    ax.imshow(image, cmap='gray')
+    ax.axis('off')
+```
+
+%% Cell type:markdown id: tags:
+
+# Part 2: What about those beautiful images ?
+
+%% Cell type:markdown id: tags:
+
+<p align="center">
+<img height=300px src="https://huggingface.co/stabilityai/stable-diffusion-3.5-large/media/main/sd3.5_large_demo.png"/></p>
+<p align="center"></p>
+
+In this exercise we achieved decent results for very simple datasets. But we are quite far from those beautiful AI generated images we can find online. That is for 2 main reasons:
+- Model size: due to the computation and time constrains, we can't really train very large models
+- Dataset size: due to the same constrains, we can't use very complex and large datasets, which requires larger models and longer training times.
+
+Fortunately, even though we can't train those large models with the available hardware and time, we can at least use them for inference !
+
+The goal of this part is to learn how to retrieve and use a pre-trained diffusion models and also to get creative to come up with some nice prompts to generate outstanding images.
+
+We are going to use Stable Diffusion 3.5, which is a state of the art open-source text conditioned model. It takes a prompt in natural language and use it to guide the diffusion process. This type of models are trained with image-text pairs, but can generalize beyond the pairs seen during training, being able to mix several different concepts into a single image.
+
+In order to save memory, we will use quantization, which consists into converting the model weights types from float16 into float4. That simply means that each model weight will be stored using only 4 bits instead of 16 bits. That allows us to run the model in GPUs with less VRAM and have faster inference, with a small drop in the quality of the results.
+
+%% Cell type:markdown id: tags:
+
+For this part of the assignment restart the notebook kernel to be sure your GPU memory is empty. The memory usage can be verified using the command `nvidia-smi` in a terminal. If your GPU has 2GB of VRAM or less, the model will probably not fit into memory even with quantization. In that case, use Google Colab for this part or use the smaller model indicated below. If you are not happy with the results and have plenty of VRAM available, feel free to increase the quantization to 8 bits or even load the model without quantization.
+
+%% Cell type:code id: tags:
+
+``` python
+from diffusers import BitsAndBytesConfig, SD3Transformer2DModel
+from diffusers import StableDiffusion3Pipeline
+
+model_id = "stabilityai/stable-diffusion-3.5-medium"
+
+nf4_config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_quant_type="nf4",
+    bnb_4bit_compute_dtype=torch.float16
+)
+model_nf4 = SD3Transformer2DModel.from_pretrained(
+    model_id,
+    subfolder="transformer",
+    quantization_config=nf4_config,
+    torch_dtype=torch.float16
+)
+
+pipeline = StableDiffusion3Pipeline.from_pretrained(
+    model_id,
+    transformer=model_nf4,
+    torch_dtype=torch.float16
+)
+pipeline.enable_model_cpu_offload()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# TO DO: test different prompts and visualize the generated images
+# Once you are happy with the results, you can save 3 different images as png file with the correspondent prompts in a text file
+prompt =
+
+image = pipeline(
+    prompt=prompt,
+    num_inference_steps=40,
+    guidance_scale=4.5,
+    max_sequence_length=512
+).images[0]
+image.save("generated_image.png")
+```
+
+%% Cell type:markdown id: tags:
+
+If even with the quantization you still run out of GPU memory and you can't use Google Colab, you can use the following code instead, which uses a much smaller model (the results won't be as near as impressive, but it should be able to run even on a CPU, if you have a little patience)
+
+%% Cell type:code id: tags:
+
+``` python
+from diffusers import DiffusionPipeline
+
+pipe = DiffusionPipeline.from_pretrained("OFA-Sys/small-stable-diffusion-v0").to(device)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# TO DO: test different prompts and visualize the generated images
+prompt =
+image = pipe(prompt).images[0]
+
+image.save("not_as_good_generated_image.png")
+```