Skip to content
Snippets Groups Projects
Select Git revision
  • main default protected
1 result

gan-diffusion

  • Clone with SSH
  • Clone with HTTPS
  • Forked from Dellandrea Emmanuel / MSO_3_4-TD2
    23 commits ahead of the upstream repository.

    TD 2 : GAN & Diffusion

    MSO 3.4 Apprentissage Automatique

    Overview

    Ce projet explore les modèles génératifs pour les images, en mettant l'accent sur les Generative Adversarial Networks (GANs) et les modèles de Diffusion. L'objectif est de comprendre leur implémentation, d'analyser des architectures spécifiques et d'appliquer différentes stratégies d'entraînement pour la génération d'images, avec et sans conditionnement.

    Part 1: DC-GAN

    Dans cette partie, nous étudions les bases des Generative Adversarial Networks à travers un DCGAN. Nous nous appuyons sur le tutoriel suivant : DCGAN Tutorial.

    Nous générons des chiffres manuscrits en utilisant le dataset MNIST disponible dans le package torchvision : MNIST Dataset.

    Modifications Implémentées

    • Adaptation du code du tutoriel pour fonctionner avec le dataset MNIST.
    • Affichage des courbes de perte du générateur et du discriminateur en fonction des étapes de gradient.
    • Comparaison des images générées avec les images du dataset MNIST.

    Exemples d'images générées :

    images d'exemples de chiffres générés par le DCGAN

    Question : Comment contrôler le chiffre généré ?

    Pour contrôler quel chiffre le générateur doit produire, nous implémentons un Conditional GAN (cGAN) avec les modifications suivantes :

    Modifications du Générateur

    • Au lieu d'utiliser uniquement du bruit aléatoire, nous concaténons un label de classe (one-hot encodé ou intégré) avec le vecteur de bruit.
    • Le générateur apprend ainsi à produire des chiffres spécifiques en fonction du label fourni.

    Modifications du Discriminateur

    • Plutôt que de seulement distinguer le vrai du faux, le discriminateur est modifié pour classifier les images en chiffres 0-9 ou en images générées (fake).
    • Il produit une distribution de probabilité sur 11 classes (10 chiffres + 1 pour les images générées).

    Mise à Jour du Processus d'Entraînement

    • Le générateur est entraîné pour tromper le discriminateur tout en générant des images correspondant au label de classe correct.
    • Une perte par entropie croisée catégorielle est utilisée pour le discriminateur au lieu d'une perte binaire, puisqu'il effectue une classification multi-classes.
    • La fonction de perte encourage le générateur à produire des chiffres bien classifiés.

    Implémentation d'un cGAN avec Discriminateur Multi-Class

    Pour améliorer la génération d'images et éviter les ambiguïtés entre certains chiffres (ex: 3 vs 7), nous avons mis en place un discriminateur multi-classes qui classifie les images générées en l'une des 10 catégories de chiffres ou comme une image générée.

    Comparaison des Algorithmes

    Modèle Description Résultat
    cGAN Le générateur apprend à produire des images conditionnées sur le label de classe. Le discriminateur ne fait que distinguer le vrai du faux. Peut générer des chiffres réalistes mais parfois ambiguës (ex: confusion entre 3 et 7).
    cGAN avec Discriminateur Multi-Class Le générateur produit des chiffres conditionnés sur le label, et le discriminateur apprend à classifier les images dans une des 10 catégories de chiffres ou comme fausses. Améliore la qualité des images générées et réduit l’ambiguïté entre les chiffres.

    Exemples d'images générées de numéro (3) par le cGAN avec Discriminateur (real/fake) :

    images d'exemples de chiffres générés par le DCGAN

    Exemples d'images générées de numéro (3) par le cGAN avec Discriminateur Multi-Class :

    images d'exemples de chiffres générés par le DCGAN

    Conclusion

    • Les GANs permettent de générer des chiffres manuscrits réalistes.
    • L'ajout d'un conditionnement via un cGAN permet de contrôler le chiffre généré.
    • L'utilisation d'un discriminateur multi-classes améliore la différenciation entre les chiffres et réduit les ambiguïtés.

    Références

    Part 2: Conditional GAN (cGAN) with U-Net

    Generator

    In the cGAN architecture, the generator chosen is a U-Net.

    U-Net Overview:

    • A U-Net takes an image as input and outputs another image.
    • It consists of two main parts: an encoder and a decoder.
      • The encoder reduces the image dimension to extract main features.
      • The decoder reconstructs the image using these features.
    • Unlike a simple encoder-decoder model, U-Net has skip connections that link encoder layers to corresponding decoder layers. These allow the decoder to use both high-frequency and low-frequency information.

    Architecture & Implementation:

    The encoder takes a colored picture (3 channels: RGB), processes it through a series of convolutional layers, and encodes the features. The decoder then reconstructs the image using transposed convolutional layers, utilizing skip connections to enhance details.

    (Insert U-Net architecture diagram)

    Question:

    Knowing that the input and output images have a shape of 256x256 with 3 channels, what will be the dimension of the feature map "x8"?

    Answer: The dimension of the feature map x8 is [numBatch, 512, 1, 1].

    Question:

    Why are skip connections important in the U-Net architecture?

    Explanation:

    Skip connections link encoder and decoder layers, improving the model in several ways:

    • Preserving Spatial Resolution: Helps retain fine details that may be lost during encoding.
    • Preventing Information Loss: Transfers important features from the encoder to the decoder.
    • Improving Performance: Combines high-level and low-level features for better reconstruction.
    • Mitigating Vanishing Gradient: Eases training by allowing gradient flow through deeper layers.

    Discriminator

    In the cGAN architecture, we use a PatchGAN discriminator instead of a traditional binary classifier.

    PatchGAN Overview:

    • Instead of classifying the entire image as real or fake, PatchGAN classifies N × N patches of the image.
    • The size N depends on the number of convolutional layers in the network:
    Layers Patch Size
    1 16×16
    2 34×34
    3 70×70
    4 142×142
    5 286×286
    6 574×574

    For this project, we use a 70×70 PatchGAN.

    (Insert PatchGAN architecture diagram)

    Results Comparison: 100 vs. 200 Epochs

    1. Training Performance

    • 100 Epochs:
      • The generator produces images that resemble the target facades.
      • Some fine details may be missing, and slight noise is present.
    • 200 Epochs:
      • The generated facades have more details and refined structures.
      • Improved high-frequency details make outputs closer to target images.
      • Less noise, but minor artifacts may still exist.

    2. Evaluation Performance

    • 100 Epochs:
      • Struggles with realistic facades on unseen masks.
      • Noticeable noise, but sometimes less than the 200-epoch model.
      • Some structures exist but lack consistency.
    • 200 Epochs:
      • Overfits to training data, leading to poor generalization.
      • Instead of realistic facades, it reuses training patches, causing noisy outputs.

    Conclusion & Observations

    • Improved Detail at 200 Epochs: Better training mask generation.
    • Overfitting Issue: Generalization is poor beyond 100 epochs.
    • Limited Dataset Size (378 Images): Restricts model’s diversity and quality.

    (Insert example images for training set at 100 and 200 epochs) (Insert example images for evaluation set at 100 and 200 epochs)

    Part 3: Diffusion Models

    Diffusion models are a fascinating category of generative models that focus on iteratively transforming random noise into realistic data. The reverse diffusion process starts from noisy data, and with the help of a trained neural network, it gradually denoises the data, ultimately generating high-quality, detailed images. These models have been gaining popularity due to their ability to surpass GANs in generating diverse and high-quality images.

    Overview of Diffusion Models

    In the context of this project, we will focus on DDPMs (Denoising Diffusion Probabilistic Models), which are widely used for generating images from noise. The key idea is to apply noise progressively over several timesteps, and then train a neural network to reverse this process. By doing so, the model learns to generate realistic images by denoising noisy samples.

    The Diffusion Process

    • Forward Diffusion Process: Starting with a real image, noise is gradually added to the image at each timestep. The amount of noise increases with each step, leading to a more noisy image as the timesteps increase. At the maximum timestep, the image is essentially pure noise.

    • Reverse Diffusion Process: In this step, a neural network is trained to reverse the noise process, effectively predicting the noise at each timestep. This allows the model to denoise images step-by-step, eventually generating a clean image from random noise.

    Noise Scheduler

    To control the diffusion process, we will create a noise scheduler. This scheduler defines how much noise is added to the image at each timestep. We will train the model on the MNIST dataset, which is also used in Part 1 of this project.

    Architecture for Diffusion Model: U-Net

    For both the generator in the Conditional GAN (cGAN) and the Diffusion model, we will utilize a U-Net architecture. This is a popular model in image-to-image tasks and is well-suited for tasks requiring pixel-level precision, such as image generation and denoising.

    U-Net Overview:

    • Encoder-Decoder Structure: The encoder progressively reduces the image size to extract features, while the decoder reconstructs the image from these features.
    • Skip Connections: U-Net has skip connections that link corresponding layers of the encoder and decoder. This allows the decoder to access both high-level and low-level features, improving the image reconstruction quality.

    For the Diffusion U-Net, the architecture will be slightly different:

    • Time Conditioning: A key difference is that the model will receive the current timestep as input, in addition to the noisy image. This allows the network to adjust its denoising process based on how much noise is present.
    • ResNet Blocks: In place of simple convolution layers, the model uses ResNet blocks (which consist of GroupNorm and SiLU activation), making the model more robust and stable.

    PatchGAN Discriminator for Diffusion Model

    In contrast to traditional GANs, the PatchGAN discriminator works by classifying patches of the image rather than the whole image at once. This allows the model to focus on local details, leading to more precise image generation.

    Training the Model

    We will train the diffusion model on the MNIST dataset using the diffusers library, which provides tools for training and using diffusion models. We will compare the results of training for different epochs and assess the quality of the generated images.

    Comparison of the UNet Architectures for cGAN and Diffusion Models

    Feature cGAN UNet Diffusion UNet (DDPM)
    Task Image-to-image translation Image denoising (diffusion)
    Downsampling Strided Conv2D + BatchNorm + LeakyReLU ResNet Blocks + GroupNorm + SiLU
    Upsampling Transpose Conv2D Interpolation + Conv2D
    Activation LeakyReLU (down), ReLU (up), Tanh (output) SiLU (Swish)
    Normalization BatchNorm GroupNorm
    Skip Connections Yes Yes
    Time Embedding No Yes

    Conclusion

    In this section, we have outlined the architecture and training process for a Diffusion model using a U-Net. This model is trained to perform image denoising, progressively refining noisy images into clean ones. We compared it with the U-Net used in cGANs, highlighting the key differences and how they are tailored to their respective tasks.

    By leveraging diffusion models, we aim to generate highly detailed and diverse images that surpass traditional GANs, especially in the context of noisy data and image-to-image tasks. We will continue training the model and assess its performance in the following steps.


    How to submit your Work ?

    This work must be done individually. The expected output is a private repository named gan-diffusion on https://gitlab.ec-lyon.fr. It must contain your notebook (or python files) and a README.md file that explains briefly the successive steps of the project. Don't forget to add your teacher as developer member of the project. The last commit is due before 11:59 pm on Wednesday, April 9th, 2025. Subsequent commits will not be considered.