Skip to content
Snippets Groups Projects
Select Git revision
  • main default protected
1 result

gan-diffusion

  • Clone with SSH
  • Clone with HTTPS
  • Forked from Dellandrea Emmanuel / MSO_3_4-TD2
    34 commits ahead of the upstream repository.

    TD 2: GAN & Diffusion

    MSO 3.4 Machine Learning

    Overview

    This project explores generative models for images, focusing on Generative Adversarial Networks (GANs) and Diffusion models. The objective is to understand their implementation, analyze specific architectures, and apply different training strategies for generating and denoising images, both with and without conditioning.


    Part 1: DC-GAN

    In this section, we study the fundamentals of Generative Adversarial Networks through a Deep Convolutional GAN (DCGAN). We follow the tutorial: DCGAN Tutorial.

    We generate handwritten digits using the MNIST dataset available in the torchvision package: MNIST Dataset.

    Implemented Modifications

    • Adapted the tutorial's code to work with the MNIST dataset.
    • Displayed loss curves for both the generator and the discriminator over training steps.
    • Compared generated images with real MNIST dataset images.

    Examples of Generated Images:

    Example images of digits generated by DCGAN


    Question: How to Control the Generated Digit?

    To control which digit the generator produces, we implement a Conditional GAN (cGAN) with the following modifications:

    Generator Modifications

    • Instead of using only random noise, we concatenate a class label (one-hot encoded or embedded) with the noise vector.
    • This allows the generator to learn to produce specific digits based on the provided label.

    Discriminator Modifications

    • Instead of just distinguishing real from fake, the discriminator is modified to classify images as digits (0-9) or as generated (fake).
    • It outputs a probability distribution over 11 classes (10 digits + 1 for generated images).

    Training Process Update

    • The generator is trained to fool the discriminator while generating images that match the correct class label.
    • A categorical cross-entropy loss is used for the discriminator instead of a binary loss since it performs multi-class classification.
    • The loss function encourages the generator to produce well-classified digits.

    Implementing a cGAN with a Multi-Class Discriminator

    To enhance image generation and reduce ambiguities between similar digits (e.g., 3 vs 7), we introduce a multi-class discriminator that classifies generated images into one of the 10 digit categories or as fake.

    Algorithm Comparison

    Model Description Result
    cGAN The generator learns to produce images conditioned on the class label. The discriminator only distinguishes real from fake. Can generate realistic digits but sometimes ambiguous (e.g., confusion between 3 and 7).
    cGAN with Multi-Class Discriminator The generator produces class-conditioned digits, and the discriminator learns to classify images into one of 10 digit categories or as fake. Improves image quality and reduces digit ambiguity.

    Examples of Digit (3) Generated by cGAN with Real/Fake Discriminator:

    Images generated for digit (3) by cGAN with Real/Fake Discriminator

    Examples of Digit (3) Generated by cGAN with Multi-Class Discriminator:

    Images generated for digit (3) by cGAN with Multi-Class Discriminator


    Conclusion

    • GANs enable the generation of realistic handwritten digits.
    • Adding conditioning via a cGAN allows control over the generated digit.
    • Using a multi-class discriminator improves digit differentiation and reduces ambiguities.

    References

    Part 2: Conditional GAN (cGAN) with U-Net

    Generator

    In the cGAN architecture, the generator chosen is a U-Net.

    U-Net Overview:

    • A U-Net takes an image as input and outputs another image.
    • It consists of two main parts: an encoder and a decoder.
      • The encoder reduces the image dimension to extract main features.
      • The decoder reconstructs the image using these features.
    • Unlike a simple encoder-decoder model, U-Net has skip connections that link encoder layers to corresponding decoder layers. These allow the decoder to use both high-frequency and low-frequency information.

    Architecture & Implementation:

    The encoder takes a colored picture (3 channels: RGB), processes it through a series of convolutional layers, and encodes the features. The decoder then reconstructs the image using transposed convolutional layers, utilizing skip connections to enhance details.

    architecture Unet

    Question:

    Knowing that the input and output images have a shape of 256x256 with 3 channels, what will be the dimension of the feature map "x8"?

    Answer: The dimension of the feature map x8 is [numBatch, 512, 1, 1].

    Question:

    Why are skip connections important in the U-Net architecture?

    Explanation:

    Skip connections link encoder and decoder layers, improving the model in several ways:

    • Preserving Spatial Resolution: Helps retain fine details that may be lost during encoding.
    • Preventing Information Loss: Transfers important features from the encoder to the decoder.
    • Improving Performance: Combines high-level and low-level features for better reconstruction.
    • Mitigating Vanishing Gradient: Eases training by allowing gradient flow through deeper layers.

    Discriminator

    In the cGAN architecture, we use a PatchGAN discriminator instead of a traditional binary classifier.

    PatchGAN Overview:

    • Instead of classifying the entire image as real or fake, PatchGAN classifies N × N patches of the image.
    • The size N depends on the number of convolutional layers in the network:
    Layers Patch Size
    1 16×16
    2 34×34
    3 70×70
    4 142×142
    5 286×286
    6 574×574

    For this project, we use a 70×70 PatchGAN.

    patchGAN

    question : how many learnable parameters this neural network has ?:

    1. conv1:
    • Input channels: 6
    • Output channels: 64
    • Kernel size: 4*4
    • Parameters in conv1 = (4×4×6+1(bais))×64=6208
    1. conv2:
    • Weights: 4 × 4 × 64 × 128 = 131072
    • Biases: 128
    • BatchNorm: (scale + shift) for 128 channels: 2 × 128 = 256
    • Parameters in conv2: 131072 + 128 + 256 = 131456
    1. conv3:
    • Weights: 4 × 4 × 128 × 256 = 524288
    • Biases: 256
    • BatchNorm: (scale + shift) for 256 channels: 2 × 256 = 512
    • Parameters in conv3: 524288 + 256 + 512= 525056
    1. conv4:
    • Weights: 4 × 4 × 256 × 512 = 2097152
    • Biases: 512
    • BatchNorm: (scale + shift) for 512 channels: 2 × 512 = 1024
    • Parameters in conv4: 2097152+512+1,024=2098688
    1. out:
    • Weights: 4 × 4 × 512 × 1 = 8192
    • Biases: 1
    • Parameters in out: 8192 + 1=8193

    Total Learnable Parameters

    6,208 + 131,456 + 525,056 + 2,098,688 + 8,193 = 2,769,601

    Results Comparison: 100 vs. 200 Epochs

    1. Training Performance

    • 100 Epochs:
      • The generator produces images that resemble the target facades.
      • Some fine details may be missing, and slight noise is present.
    • 200 Epochs:
      • The generated facades have more details and refined structures.
      • Improved high-frequency details make outputs closer to target images.
      • Less noise, but minor artifacts may still exist.

    2. Evaluation Performance

    • 100 Epochs:
      • Struggles with realistic facades on unseen masks.
      • Noticeable noise, but sometimes less than the 200-epoch model.
      • Some structures exist but lack consistency.
    • 200 Epochs:
      • Overfits to training data, leading to poor generalization.
      • Instead of realistic facades, it reuses training patches, causing noisy outputs.

    Conclusion & Observations

    • Improved Detail at 200 Epochs: Better training mask generation.
    • Overfitting Issue: Generalization is poor beyond 100 epochs.
    • Limited Dataset Size (378 Images): Restricts model’s diversity and quality.

    Example image of training set at 100 and 200 epochs:

    Example image for training set at 100 and 200 epochs

    Example images of evaluation set at 100 and 200 epochs:

    Example images of evaluation set at 100 and 200 epochs

    Part 3: Diffusion Models

    Diffusion models are a fascinating category of generative models that focus on iteratively transforming random noise into realistic data. The reverse diffusion process starts from noisy data, and with the help of a trained neural network, it gradually denoises the data, ultimately generating high-quality, detailed images. These models have been gaining popularity due to their ability to surpass GANs in generating diverse and high-quality images.

    Overview of Diffusion Models

    In the context of this project, we will focus on DDPMs (Denoising Diffusion Probabilistic Models), which are widely used for predicting images noise. The key idea is to apply noise progressively over several timesteps, and then train a neural network to reverse this process. By doing so, the model learns to pridict the noise of the image to desoise it.

    The Diffusion Process

    • Forward Diffusion Process: Starting with a real image, noise is gradually added to the image at each timestep. The amount of noise increases with each step, leading to a more noisy image as the timesteps increase. At the maximum timestep, the image is essentially pure noise.

    • Reverse Diffusion Process: In this step, a neural network is trained to reverse the noise process, effectively predicting the noise at each timestep. This allows the model to denoise images step-by-step, eventually generating a clean image from random noise.

    Noise Scheduler

    To control the diffusion process, we will create a noise scheduler. This scheduler defines how much noise is added to the image at each timestep. We will train the model on the MNIST dataset, which is also used in Part 1 of this project.

    Architecture for Diffusion Model:

    UNet2DModel (Diffusion Model) This UNet is designed for denoising diffusion probabilistic models (DDPMs), which progressively remove noise from images. Differences include:

    Time Conditioning: The time_proj and time_embedding modules encode timesteps, which are crucial for diffusion models to learn the progressive denoising process. ResNet Blocks Instead of Simple Conv Layers: Each downsampling and upsampling step includes ResnetBlock2D, which has GroupNorm + SiLU (Swish) activation, making it more robust than standard convolution layers. SiLU (Swish) Activation: Used instead of LeakyReLU/ReLU, offering smooth gradients. GroupNorm Instead of BatchNorm: More stable for diffusion-based models.

    PatchGAN Discriminator for Diffusion Model

    In contrast to traditional GANs, the PatchGAN discriminator works by classifying patches of the image rather than the whole image at once. This allows the model to focus on local details, leading to more precise image generation.

    Training the Model

    We will train the diffusion model on the MNIST dataset using the diffusers library, which provides tools for training and using diffusion models. We will compare the results of training for different epochs and assess the quality of the generated images. bonus : try to train also unet : UNet in cGAN Generator This UNet is used as the generator in a Conditional GAN (cGAN), typically for image-to-image translation tasks. The key characteristics are:

    Encoder-Decoder Structure: Uses downsampling (down1 to down7) with Conv2D + BatchNorm + LeakyReLU layers and upsampling (up7 to up1) with ConvTranspose2D + BatchNorm + ReLU. Skip Connections: Each downsampling layer has a corresponding upsampling layer that concatenates feature maps (e.g., up6 receives outputs from down6). Dropout in Some Layers: Helps regularize training. LeakyReLU Activation in Downsampling: Helps with learning stable representations. No Explicit Time Embedding: Since it’s not designed for diffusion models, it doesn’t incorporate timestep embeddings.

    Comparison of the UNet Architecture for cGAN and UNet2DModel (Diffusion Model)

    for compairing wuth the deffussion :

    Feature cGAN UNet Diffusion UNet (DDPM)
    Task Image-to-image translation Image denoising (diffusion)
    Downsampling Strided Conv2D + BatchNorm + LeakyReLU ResNet Blocks + GroupNorm + SiLU
    Upsampling Transpose Conv2D Interpolation + Conv2D
    Activation LeakyReLU (down), ReLU (up), Tanh (output) SiLU (Swish)
    Normalization BatchNorm GroupNorm
    Skip Connections Yes Yes
    Time Embedding No Yes

    Results

    Here a visual results from :

    Diffusion U-Net2D
    Diffusion U-Net Example result

    Conditional U-Net (cGAN)
    cGAN U-Net Example result

    Conclusion

    In this section, we have outlined the architecture and training process for a Diffusion model using a U-Net. This model is trained to perform image denoising, progressively refining noisy images into clean ones. We compared it with the U-Net used in cGANs, highlighting the key differences and how they are tailored to their respective tasks.

    By leveraging diffusion models, we aim to generate highly detailed and diverse images that surpass traditional GANs, especially in the context of noisy data and image-to-image tasks. We will continue training the model and assess its performance in the following steps.

    Part 4: What About Those Beautiful Images?

    In this experiment, we compared the performance of two models: a large pre-trained model (Stable Diffusion 3.5 with quantization) and a smaller model (OFA-Sys small-stable-diffusion).

    Strong Model: Stable Diffusion 3.5 with Quantization

    Using 4-bit quantization, this model produced high-quality and creative images from simple textual prompts, even with reduced memory requirements. We tested the model with prompts like:

    • "Underwater wheeled bee"
    • "Monster buys lollipop"
    • "Imagine a world without eggs"

    The results were visually appealing and imaginative, showcasing the model's capability to generate intricate and high-quality images.

    Example Results from Stable Diffusion 3.5:

    Stable Diffusion 3.5 Result 1
    Underwater wheeled bee from large model

    Stable Diffusion 3.5 Result 2
    Monster buys lollipop from large model

    Stable Diffusion 3.5 Result 3
    Imagine a world without eggs from large model

    Smaller Model: OFA-Sys Small-Stable-Diffusion

    While the smaller model generated images, the quality and creativity were noticeably lower. The images lacked the detail and originality seen with the larger model, confirming that smaller models are less capable of handling complex, creative prompts.

    Example Results from OFA-Sys Small-Stable-Diffusion:

    OFA-Sys Small Result 1
    Underwater wheeled bee from smaller model

    OFA-Sys Small Result 2
    Monster buys lollipop from smaller model

    OFA-Sys Small Result 2
    Imagine a world without eggs from smaller model

    Conclusion

    The experiment demonstrates that larger models, like Stable Diffusion 3.5, produce superior image quality and creativity. However, smaller models can still be useful in scenarios with hardware limitations, though they fall short in terms of detail and imagination when compared to their larger counterparts.


    How to submit your Work ?

    This work must be done individually. The expected output is a private repository named gan-diffusion on https://gitlab.ec-lyon.fr. It must contain your notebook (or python files) and a README.md file that explains briefly the successive steps of the project. Don't forget to add your teacher as developer member of the project. The last commit is due before 11:59 pm on Wednesday, April 9th, 2025. Subsequent commits will not be considered.