Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found
Select Git revision

Target

Select target project
  • edelland/mso_3_4-td2
  • colasa/gan-cgan
  • rfarssi/mso3_4-be2_cgan
  • ssamuelm/mso3_4-be2_cgan
  • skhedhri/mso3_4-be2_cgan
  • tnavarro/gan-cgan
  • pmuller/mso3_4-be2_cgan
  • hmorillo/mso3_4-be2_cgan
  • mmachado/gan-cgan
  • bcornill/gan-cgan
  • fpennace/gan-cgan
  • egennari/gan-cgan
  • pbrussar/mso3_4-be2_cgan
  • bgourdin/mso3_4-be2_cgan
  • sfruchar/mso3_4-be2_cgan
  • psergent/mso3_4-be2_cgan
  • sclary/mso3_4-be2_cgan
  • gononq/be-2-c-gan
  • sfredj/mso3_4-be2_cgan
  • alebtahi/gan-cgan
  • sballoum/mso3_4-be2_cgan
  • ielansar/mso3_4-be2_cgan
  • asennevi/mso_3_4-td2
  • jseksik/mso_3_4-td2
  • mguiller/gan-cgan
  • ochaufou/mso_3_4-td2
  • barryt/gan-cgan
  • mbabay/mso_3_4-td2
  • amaassen/mso_3_4-td2
  • cgerest/mso_3_4-td2
  • pmarin/mso_3_4-td2
  • bbrudysa/gan-cgan
  • hchauvin/mso_3_4-td2
  • tfassin/mso_3_4-td2
  • coulonj/gan-diffusion
  • tdesgreys/gan-diffusion
  • mbenyahi/gan-diffusion
37 results
Select Git revision
Show changes
Commits on Source (42)
Showing
with 7761 additions and 1587 deletions
This diff is collapsed.
# TD 2: GAN & Diffusion
MSO 3.4 Apprentissage Automatique
## MSO 3.4 Machine Learning
### Overview
This project explores generative models for images, focusing on Generative Adversarial Networks (GANs) and Diffusion models. The objective is to understand their implementation, analyze specific architectures, and apply different training strategies for generating and denoising images, both with and without conditioning.
---
## Part 1: DC-GAN
In this section, we study the fundamentals of Generative Adversarial Networks through a Deep Convolutional GAN (DCGAN). We follow the tutorial: [DCGAN Tutorial](https://pytorch.org/tutorials/beginner/dcgan_faces_tutorial.html).
We generate handwritten digits using the MNIST dataset available in the `torchvision` package: [MNIST Dataset](https://pytorch.org/vision/stable/generated/torchvision.datasets.MNIST.html#torchvision.datasets.MNIST).
### Implemented Modifications
- Adapted the tutorial's code to work with the MNIST dataset.
- Displayed loss curves for both the generator and the discriminator over training steps.
- Compared generated images with real MNIST dataset images.
#### Examples of Generated Images:
![Example images of digits generated by DCGAN](images/generated_mnist1.png)
---
## Question: How to Control the Generated Digit?
To control which digit the generator produces, we implement a Conditional GAN (cGAN) with the following modifications:
### Generator Modifications
- Instead of using only random noise, we concatenate a class label (one-hot encoded or embedded) with the noise vector.
- This allows the generator to learn to produce specific digits based on the provided label.
### Discriminator Modifications
- Instead of just distinguishing real from fake, the discriminator is modified to classify images as digits (0-9) or as generated (fake).
- It outputs a probability distribution over 11 classes (10 digits + 1 for generated images).
### Training Process Update
- The generator is trained to fool the discriminator while generating images that match the correct class label.
- A categorical cross-entropy loss is used for the discriminator instead of a binary loss since it performs multi-class classification.
- The loss function encourages the generator to produce well-classified digits.
---
## Implementing a cGAN with a Multi-Class Discriminator
To enhance image generation and reduce ambiguities between similar digits (e.g., 3 vs 7), we introduce a multi-class discriminator that classifies generated images into one of the 10 digit categories or as fake.
### Algorithm Comparison
| Model | Description | Result |
|--------|------------|----------|
| **cGAN** | The generator learns to produce images conditioned on the class label. The discriminator only distinguishes real from fake. | Can generate realistic digits but sometimes ambiguous (e.g., confusion between 3 and 7). |
| **cGAN with Multi-Class Discriminator** | The generator produces class-conditioned digits, and the discriminator learns to classify images into one of 10 digit categories or as fake. | Improves image quality and reduces digit ambiguity. |
#### Examples of Digit (3) Generated by cGAN with Real/Fake Discriminator:
![Images generated for digit (3) by cGAN with Real/Fake Discriminator](images/generated_mnist_num3_1_.png)
#### Examples of Digit (3) Generated by cGAN with Multi-Class Discriminator:
![Images generated for digit (3) by cGAN with Multi-Class Discriminator](images/generated_mnist_num3_2_.png)
---
## Conclusion
- GANs enable the generation of realistic handwritten digits.
- Adding conditioning via a cGAN allows control over the generated digit.
- Using a multi-class discriminator improves digit differentiation and reduces ambiguities.
### References
- [DCGAN Tutorial](https://pytorch.org/tutorials/beginner/dcgan_faces_tutorial.html)
- [MNIST Dataset](https://pytorch.org/vision/stable/generated/torchvision.datasets.MNIST.html#torchvision.datasets.MNIST)
Here is the corrected version with only the necessary adjustments:
---
## Part 2: Conditional GAN (cGAN) with U-Net
### **Generator**
In the cGAN architecture, the generator chosen is a U-Net.
#### **U-Net Overview:**
- A U-Net takes an image as input and outputs another image.
- It consists of two main parts: an encoder and a decoder.
- The encoder reduces the image dimension to extract main features.
- The decoder reconstructs the image using these features.
- Unlike a simple encoder-decoder model, U-Net has skip connections that link encoder layers to corresponding decoder layers. These allow the decoder to use both high-frequency and low-frequency information.
#### **Architecture & Implementation:**
The encoder takes a colored picture (3 channels: RGB), processes it through a series of convolutional layers, and encodes the features. The decoder then reconstructs the image using transposed convolutional layers, utilizing skip connections to enhance details.
![architecture Unet](images/unet_architecture.png)
### **Question:**
Knowing that the input and output images have a shape of 256x256 with 3 channels, what will be the dimension of the feature map "x8"?
**Answer:** The dimension of the feature map x8 is **[numBatch, 512, 32, 32]**.
### **Question:**
Why are skip connections important in the U-Net architecture?
#### **Explanation:**
Skip connections link encoder and decoder layers, improving the model in several ways:
- **Preserving Spatial Resolution:** Helps retain fine details that may be lost during encoding.
- **Preventing Information Loss:** Transfers important features from the encoder to the decoder.
- **Improving Performance:** Combines high-level and low-level features for better reconstruction.
- **Mitigating Vanishing Gradient:** Eases training by allowing gradient flow through deeper layers.
---
### **Discriminator**
In the cGAN architecture, we use a **PatchGAN** discriminator instead of a traditional binary classifier.
#### **PatchGAN Overview:**
- Instead of classifying the entire image as real or fake, PatchGAN classifies **N × N patches** of the image.
- The size **N** depends on the number of convolutional layers in the network:
| Layers | Patch Size |
|--------|------------|
| 1 | 16×16 |
| 2 | 34×34 |
| 3 | 70×70 |
| 4 | 142×142 |
| 5 | 286×286 |
| 6 | 574×574 |
For this project, we use a **70×70 PatchGAN**.
![patchGAN](images/patchGAN.png)
### **Question:**
How many learnable parameters does this neural network have?
1. **conv1:**
- Input channels: 6
- Output channels: 64
- Kernel size: 4×4
- Parameters in conv1 : **(4×4×6+1(bias))×64 = 6208**
2. **conv2:**
- Weights: 4 × 4 × 64 × 128 = **131072**
- Biases: **128**
- BatchNorm: (scale + shift) for 128 channels: 2 × 128 = **256**
- Parameters in conv2: **131072 + 128 + 256 = 131456**
3. **conv3:**
- Weights: 4 × 4 × 128 × 256 = **524288**
- Biases: **256**
- BatchNorm: (scale + shift) for 256 channels: 2 × 256 = **512**
- Parameters in conv3: **524288 + 256 + 512 = 525056**
4. **conv4:**
- Weights: 4 × 4 × 256 × 512 = **2097152**
- Biases: **512**
- BatchNorm: (scale + shift) for 512 channels: 2 × 512 = **1024**
- Parameters in conv4: **2097152 + 512 + 1024 = 2098688**
5. **out:**
- Weights: 4 × 4 × 512 × 1 = **8192**
- Biases: **1**
- Parameters in out: **8192 + 1 = 8193**
**Total Learnable Parameters:**
**6208 + 131456 + 525056 + 2098688 + 8193 = 2,769,601**
---
### **Results Comparison: 100 vs. 200 Epochs**
#### **1. Training Performance**
- **100 Epochs:**
- The generator produces images that resemble the target facades.
- Some fine details may be missing, and slight noise is present.
- **200 Epochs:**
- The generated facades have more details and refined structures.
- Improved high-frequency details make outputs closer to target images.
- Less noise, but minor artifacts may still exist.
#### **2. Evaluation Performance**
- **100 Epochs:**
- Struggles with realistic facades on unseen masks.
- Noticeable noise, but sometimes less than the 200-epoch model.
- Some structures exist but lack consistency.
- **200 Epochs:**
- Overfits to training data, leading to poor generalization.
- Instead of realistic facades, it reuses training patches, causing noisy outputs.
#### **Conclusion & Observations**
- **Improved Detail at 200 Epochs:** Better training mask generation.
- **Overfitting Issue:** Generalization is poor beyond 100 epochs.
- **Limited Dataset Size (378 Images):** Restricts model’s diversity and quality.
#### **Example image of training set at 100 and 200 epochs:**
![Example image for training set at 100 and 200 epochs](images/facades_trainingset_100_200_.png)
#### **Example images of evaluation set at 100 and 200 epochs:**
![Example images of evaluation set at 100 and 200 epochs](images/facades_valset_100_200_.png)
## Part 3: Diffusion Models
Diffusion models represent a cutting-edge approach in generative modeling, focusing on the transformation of random noise into realistic data through an iterative denoising process. The reverse diffusion process begins with a noisy image and gradually refines it into a high-quality image using a trained neural network. These models have gained significant attention due to their ability to outperform GANs in generating diverse and high-fidelity images.
### Overview of Diffusion Models
This project explores **Denoising Diffusion Probabilistic Models (DDPMs)**, which are widely used for image noise prediction. The fundamental idea is to progressively apply noise over multiple timesteps and then train a neural network to reverse this process. The model learns to predict and remove noise at each timestep, effectively reconstructing a clean image.
### The Diffusion Process
- **Forward Diffusion Process**: Noise is incrementally added to an image over a series of timesteps. As the number of steps increases, the image becomes progressively noisier until it reaches a state of pure noise.
- **Reverse Diffusion Process**: A neural network is trained to undo the noise addition process by predicting and removing noise step by step, ultimately recovering the original image.
### Noise Scheduler
To regulate the diffusion process, we use a **noise scheduler**, which defines the amount of noise added at each timestep. The model is trained on the MNIST dataset, as introduced in Part 1 of this project.
### Architecture for Diffusion Model: **UNet2DModel (Diffusion Model)**
The UNet2DModel is specifically designed for diffusion-based denoising tasks. Key architectural features include:
- **Time Conditioning**: The `time_proj` and `time_embedding` modules encode timestep information, crucial for learning the progressive denoising process.
- **ResNet Blocks**: Instead of simple convolution layers, each downsampling and upsampling step integrates `ResNetBlock2D` with **GroupNorm** and **SiLU (Swish) activation**, enhancing robustness.
- **SiLU Activation**: Chosen over LeakyReLU or ReLU for smoother gradients.
- **GroupNorm**: Provides more stable training compared to BatchNorm in diffusion models.
### Training the Model
We train the diffusion model using the **diffusers** library on the MNIST dataset. Performance is assessed across different epochs to analyze the quality of generated images.
**Bonus**: We also evaluate a standard **UNet Model** by inputting the timestep embedding along with the noisy image. This UNet follows an encoder-decoder structure and is typically used for image-to-image translation.
### UNet Model Architecture
The UNet Model, in contrast to the diffusion model, aims to predict and remove noise in a single step. Key features include:
- **Encoder-Decoder Structure**: Employs Conv2D + BatchNorm + LeakyReLU for downsampling and ConvTranspose2D + BatchNorm + ReLU for upsampling.
- **Skip Connections**: Each downsampling layer has a corresponding upsampling layer that concatenates feature maps.
- **Dropout Layers**: Enhance regularization to prevent overfitting.
- **LeakyReLU Activation**: Improves feature extraction in the downsampling process.
### Comparison: UNet Model vs. Diffusion UNet (DDPM)
| Feature | UNet Model | Diffusion UNet (DDPM) |
| ----------------------- | ------------------------------------------ | -------------------------------- |
| **Primary Task** | Image-to-image translation | Image denoising via diffusion |
| **Downsampling** | Strided Conv2D + BatchNorm + LeakyReLU | ResNet Blocks + GroupNorm + SiLU |
| **Upsampling** | Transpose Conv2D | Interpolation + Conv2D |
| **Activation Function** | LeakyReLU (down), ReLU (up), Tanh (output) | SiLU (Swish) |
| **Normalization** | BatchNorm | GroupNorm |
| **Skip Connections** | Yes | Yes |
| **Time Embedding** | No | Yes |
### Results
**Diffusion UNet2D Model**
![Diffusion U-Net Example result](images/diffuse_denoise_mnist.png)
**UNet Model**
![UNet Example result](images/unet_denoise_mnist.png)
### Analysis: Noise Prediction and Image Denoising Performance
Both models leverage the UNet architecture but employ different strategies for noise removal:
- **Diffusion UNet2D Model**: Works iteratively, progressively removing noise in multiple steps. This allows it to generate high-quality images but at the cost of increased computational complexity.
- **UNet Model**: Predicts and removes noise in a single step, making it significantly faster. However, it struggles to accurately predict the noise pattern, leading to incomplete denoising and residual artifacts in the final output.
### Conclusion
The **UNet2DModel (Diffusion Model)** provides superior denoising quality due to its iterative refinement process, making it ideal for high-quality image generation. However, its computational cost is high, limiting its applicability in real-time scenarios. On the other hand, the **UNet Model** is computationally efficient, offering faster inference, but its denoising performance is subpar, resulting in images where numbers become unrecognizable due to residual noise.
## Part 4: What About Those Beautiful Images?
In this experiment, we compared the performance of two models: a large pre-trained model (Stable Diffusion 3.5 with quantization) and a smaller model (OFA-Sys small-stable-diffusion).
### Strong Model: Stable Diffusion 3.5 with Quantization
Using 4-bit quantization, this model produced high-quality and creative images from simple textual prompts, even with reduced memory requirements. We tested the model with prompts like:
- "Underwater wheeled bee"
- "Monster buys lollipop"
- "Imagine a world without eggs"
The results were visually appealing and imaginative, showcasing the model's capability to generate intricate and high-quality images.
**Example Results from Stable Diffusion 3.5:**
![Stable Diffusion 3.5 Result 1](images/generated1.jpeg)
*Underwater wheeled bee from large model*
![Stable Diffusion 3.5 Result 2](images/generated2.jpeg)
*Monster buys lollipop from large model*
![Stable Diffusion 3.5 Result 3](images/generated3.jpeg)
*Imagine a world without eggs from large model*
### Smaller Model: OFA-Sys Small-Stable-Diffusion
While the smaller model generated images, the quality and creativity were noticeably lower. The images lacked the detail and originality seen with the larger model, confirming that smaller models are less capable of handling complex, creative prompts.
**Example Results from OFA-Sys Small-Stable-Diffusion:**
![OFA-Sys Small Result 1](images/not_as_good_generated_image1.png)
*Underwater wheeled bee from smaller model*
![OFA-Sys Small Result 2](images/not_as_good_generated_image2.png)
*Monster buys lollipop from smaller model*
![OFA-Sys Small Result 2](images/not_as_good_generated_image3.png)
*Imagine a world without eggs from smaller model*
### Conclusion
The experiment demonstrates that larger models, like Stable Diffusion 3.5, produce superior image quality and creativity. However, smaller models can still be useful in scenarios with hardware limitations, though they fall short in terms of detail and imagination when compared to their larger counterparts.
---
......
images/diffuse_denoise_mnist.png

81.9 KiB

images/facades_trainingset_100_200_.png

677 KiB

images/facades_valset_100_200_.png

773 KiB

images/generated1.jpeg

690 KiB

images/generated2.jpeg

998 KiB

images/generated3.jpeg

1.13 MiB

images/generated_mnist1.png

412 KiB

images/generated_mnist_num3_1_.png

309 KiB

images/generated_mnist_num3_2_.png

325 KiB

images/not_as_good_generated_image1.png

525 KiB

images/not_as_good_generated_image2.png

358 KiB

images/not_as_good_generated_image3.png

426 KiB

images/patchGAN.png

44.6 KiB

images/unet_architecture.png

19.3 KiB

images/unet_denoise_mnist.png

91.8 KiB