Create TD3_Vision_Transformer_rendu.ipynb

2c8483f3 · Sucio · 98bb0c4f · 2c8483f3
Commit 2c8483f3 authored Jan 2, 2024 by Sucio
--- a/TD3_Vision_Transformer_rendu.ipynb
+++ b/TD3_Vision_Transformer_rendu.ipynb
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "dXwjVZp1c4wd"
+      },
+      "source": [
+        "# TD3: Vision Transformer (ViT)\n",
+        "\n",
+        "In this TD, you must modify this notebook to complete the code (**# TO DO comments**) and complete the **proposed experiments**. To do this,\n",
+        "\n",
+        "1. Fork this repository\n",
+        "2. Clone your forked repository on your local computer\n",
+        "3. Add your code and answer the questions\n",
+        "4. Commit and push regularly\n",
+        "\n",
+        "**The last commit is due on Sunday, 14th January 2024**. Later commits will not be taken into account.\n",
+        "\n",
+        "As the computation is heavy, particularly during training, we encourage you to use a GPU. If your laptob is not equiped, you may use one of these remote jupyter servers, where you can select the execution on GPU :\n",
+        "\n",
+        "1) [jupyter.mi90.ec-lyon.fr](https://jupyter.mi90.ec-lyon.fr/)\n",
+        "\n",
+        "This server is accessible within the campus network. If outside, you need to use a VPN. Before executing the notebook, select the kernel \"Python PyTorch\" to run it on GPU and have access to PyTorch module.\n",
+        "\n",
+        "2) [Google Colaboratory](https://colab.research.google.com/)\n",
+        "\n",
+        "Before executing the notebook, select the execution on GPU : \"Exécution\" Menu -> \"Modifier le type d'exécution\" and select \"T4 GPU\"."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "6oHNXwKGc4wh"
+      },
+      "source": [
+        "### Goal of the TD\n",
+        "\n",
+        "Transformers have been introduced by [Vaswani et al. in 2017](https://arxiv.org/abs/1706.03762) in the context of NLP (Natural Language Processing), and particulary for Machine Translation.\n",
+        "\n",
+        "Its great success has led to its adaptation to various applications, including image classification. In this trend, [Dosovitskiy et al. in 2020](https://arxiv.org/abs/2010.11929) have proposed Vision Transformers (ViT) that we will study and implement from scratch in this TD.\n",
+        "\n",
+        "The principle is illustrated in the following picture from this paper.\n",
+        "\n",
+        "![Vision Tranformers](./figures/vit.png \"Vision Transformers\")\n",
+        "\n",
+        "First, an input image is “cut” into sub-images equally sized.\n",
+        "\n",
+        "Each such sub-image goes through a linear embedding. From then, each sub-image becomes a one-dimensional vector.\n",
+        "\n",
+        "A positional embedding is then added to these vectors (tokens). The positional embedding allows the network to know where each sub-image is positioned originally in the image. Without this information, the network would not be able to know where each such image would be placed, leading to potentially wrong predictions.\n",
+        "\n",
+        "These tokens are then passed, together with a special classification token, to the transformer encoders blocks, were each is composed of : A Layer Normalization (LN), followed by a Multi-head Self Attention (MSA) and a residual connection. Then a second LN, a Multi-Layer Perceptron (MLP), and again a residual connection. These blocks are connected back-to-back.\n",
+        "\n",
+        "Finally, a classification MLP head is used for the final classification only on the special classification token, which by the end of this process has global information about the picture.\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "0NEnfO0Bc4wi"
+      },
+      "source": [
+        "### Implementation of the ViT model"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "kqJ6S1EFc4wj"
+      },
+      "source": [
+        "First, we import the required modules."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 115,
+      "metadata": {
+        "id": "wEmbaOA4Okuo"
+      },
+      "outputs": [],
+      "source": [
+        "# Import modules\n",
+        "import numpy as np\n",
+        "import torch\n",
+        "import torch.nn as nn\n",
+        "from torch.nn import CrossEntropyLoss\n",
+        "from torch.optim import Adam\n",
+        "from torch.utils.data import DataLoader\n",
+        "from torchvision.datasets.mnist import MNIST\n",
+        "from torchvision.transforms import ToTensor"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "eOXWp-N2c4wl"
+      },
+      "source": [
+        "For this first experiment, we will use the MNIST dataset that contains 28x28 binary pixels images of hand-written digits ([0–9])."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 116,
+      "metadata": {
+        "id": "crfmWV8uc4wm"
+      },
+      "outputs": [],
+      "source": [
+        "# Load data\n",
+        "transform = ToTensor()\n",
+        "\n",
+        "train_set = MNIST(\n",
+        "    root=\"datasets\", train=True, download=True, transform=transform\n",
+        ")\n",
+        "test_set = MNIST(\n",
+        "    root=\"datasets\", train=False, download=True, transform=transform\n",
+        ")\n",
+        "\n",
+        "train_loader = DataLoader(train_set, shuffle=True, batch_size=128)\n",
+        "test_loader = DataLoader(test_set, shuffle=False, batch_size=128)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import matplotlib.pyplot as plt\n",
+        "import random\n",
+        "from torchvision.transforms import ToPILImage\n",
+        "\n",
+        "to_pil = ToPILImage()\n",
+        "random_index = random.randint(0, len(train_set) - 1)\n",
+        "image, label = train_set[random_index]\n",
+        "\n",
+        "image_pil = to_pil(image)\n",
+        "plt.imshow(image_pil, cmap='gray')\n",
+        "plt.title(f\"Classe : {label}\")\n",
+        "plt.axis('off')\n",
+        "plt.show()"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 428
+        },
+        "id": "m6l69-vHAekF",
+        "outputId": "65e9e2e0-e5bf-449c-d0e6-d13c4280519c"
+      },
+      "execution_count": 117,
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<Figure size 640x480 with 1 Axes>"
+            ],
+            "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYUAAAGbCAYAAAAr/4yjAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/bCgiHAAAACXBIWXMAAA9hAAAPYQGoP6dpAAARMUlEQVR4nO3cf4zXBf3A8dd5J+fB3am7BLzDQSBrcNlZ/ljzlCsNBdHcNKnW4kdquNGUiqvWMo0/XEpGjLIw2+wH9MO12gy1HVltSQPKfqBTM9AgmLemAaJAxL2/fzRfXwmUz/vj/SDv8djY9PN5vz7v14fNe/r+3N27piiKIgAgIo4b6gUAOHaIAgBJFABIogBAEgUAkigAkEQBgCQKACRRACCJAsesCRMmxLx584Z6DRhWRIFBt3nz5liwYEFMnDgxTjjhhGhubo7Ozs5Yvnx57N27d6jXGzK9vb2xYMGCaGtrixNOOCEmTJgQ11xzzVCvxTBTN9QLMLysWbMmrr766qivr485c+bEW9/61vjXv/4Vv/nNb6K7uzsee+yxuOuuu4Z6zUG3bdu26OzsjIiI66+/Ptra2mLHjh2xYcOGId6M4UYUGDRPP/10fOADH4jx48fHQw89FKeeemo+t3DhwvjrX/8aa9asGcINh86CBQuirq4uNm7cGC0tLUO9DsOYj48YNLfffnvs2bMnvvWtbx0ShJedfvrpceONN77q/PPPPx+LFy+OM844IxobG6O5uTlmzpwZf/rTnw47dsWKFdHe3h4jR46Mk08+Oc4+++xYvXp1Pv/CCy/EokWLYsKECVFfXx+jR4+O6dOnxyOPPHLI66xfvz5mzJgRJ554YowcOTK6urri4Ycfruj9bt26NZ544omjHvfEE0/EAw88EN3d3dHS0hL79u2LAwcOVHQO6G+iwKC57777YuLEiXHeeedVNb9ly5b46U9/Gpdddll8+ctfju7u7ti0aVN0dXXFjh078rhvfvObccMNN8TUqVPjK1/5SnzhC1+IM888M9avX5/HXH/99fH1r389rrrqqrjzzjtj8eLF0dDQEI8//nge89BDD8W0adNi9+7dcfPNN8ett94aO3fujAsvvLCij3XmzJkTU6ZMOepxa9eujYiIMWPGxEUXXRQNDQ3R0NAQM2fOjGeeeabE3xD0gwIGwa5du4qIKK644oqKZ8aPH1/MnTs3/33fvn3FwYMHDznm6aefLurr64slS5bkY1dccUXR3t7+mq994oknFgsXLnzV5/v6+orJkycXl1xySdHX15ePv/TSS8Wb3/zmYvr06Ufdv6urq6jkP7EbbrihiIiipaWlmDFjRvHDH/6wWLp0adHY2FhMmjSpePHFF4/6GtBffE+BQbF79+6IiGhqaqr6Nerr6/OfDx48GDt37ozGxsZ4y1vecsjHPieddFL8/e9/j40bN8Y555xzxNc66aSTYv369bFjx45obW097Pk//vGP8dRTT8XnPve5eO655w557qKLLorvfve70dfXF8cd9+oX27/61a8qel979uyJiIixY8fGmjVr8jXHjRsXH/zgB2P16tVx7bXXVvRa8Hr5+IhB0dzcHBH/+Sy/Wn19fbFs2bKYPHly1NfXx5ve9KY45ZRT4s9//nPs2rUrj/v0pz8djY2Nce6558bkyZNj4cKFh30f4Pbbb49HH300TjvttDj33HPjlltuiS1btuTzTz31VEREzJ07N0455ZRD/tx9992xf//+Q875ejQ0NERExOzZsw+JzNVXXx11dXWxbt26fjkPVEIUGBTNzc3R2toajz76aNWvceutt8YnPvGJmDZtWnzve9+Ln//859HT0xPt7e3R19eXx02ZMiWefPLJ+MEPfhDnn39+/PjHP47zzz8/br755jxm9uzZsWXLllixYkW0trbG0qVLo729PR544IGIiHy9pUuXRk9PzxH/NDY2Vv1eXunlK5UxY8Yc8nhtbW20tLTEP//5z345D1RkqD+/Yvj46Ec/WkREsW7duoqO/+/vKXR0dBTvfve7Dzuura2t6OrqetXX2b9/fzFr1qyitra22Lt37xGP6e3tLdra2orOzs6iKIpiw4YNRUQUK1eurGjX1+PBBx8sIqK46aabDtu7tra2uO666wZ8B3iZKwUGzac+9akYNWpUXHvttdHb23vY85s3b47ly5e/6nxtbW0URXHIY/fee29s3779kMf++3sAI0aMiKlTp0ZRFHHgwIE4ePDgYR/9jB49OlpbW2P//v0REXHWWWfFpEmT4ktf+lJ+5v9K//jHP177zUblP5L6rne9K0aPHh2rVq2Kffv25eP33HNPHDx4MKZPn37U14D+4hvNDJpJkybF6tWr4/3vf39MmTLlkN9oXrduXdx7772vea+jyy67LJYsWRLz58+P8847LzZt2hSrVq2KiRMnHnLcxRdfHGPHjo3Ozs4YM2ZMPP744/HVr341Zs2aFU1NTbFz584YN25cvO9974uOjo5obGyMtWvXxsaNG+OOO+6IiIjjjjsu7r777pg5c2a0t7fH/Pnzo62tLbZv3x6//OUvo7m5Oe67777XfL9z5syJX//614eF7L/V19fH0qVLY+7cuTFt2rT48Ic/HFu3bo3ly5fHBRdcEFdeeWVlf8HQH4b4SoVh6C9/+Utx3XXXFRMmTChGjBhRNDU1FZ2dncWKFSuKffv25XFH+pHUT37yk8Wpp55aNDQ0FJ2dncVvf/vboqur65CPj1auXFlMmzataGlpKerr64tJkyYV3d3dxa5du4qi+M/HMt3d3UVHR0fR1NRUjBo1qujo6CjuvPPOw3b9wx/+UFx55ZX5WuPHjy9mz55d/OIXvzjq+6z0R1Jf9v3vf7/o6Ogo6uvrizFjxhQf+9jHit27d1c8D/2hpiiO8r8xAAwbvqcAQBIFAJIoAJBEAYAkCgAkUQAgVfzLazU1NQO5BwADrJLfQHClAEASBQCSKACQRAGAJAoAJFEAIIkCAEkUAEiiAEASBQCSKACQRAGAJAoAJFEAIIkCAEkUAEiiAEASBQCSKACQRAGAJAoAJFEAIIkCAEkUAEiiAECqG+oF4Fjxnve8p/RMT09P6ZlFixaVnomIWLFiRemZvr6+qs7F8OVKAYAkCgAkUQAgiQIASRQASKIAQBIFAJIoAJBEAYAkCgAkUQAgiQIASRQASDVFURQVHVhTM9C7QL+ZOnVq6Zn777+/9Mxpp51WeqZara2tpWd6e3sHYBP+V1Xy5d6VAgBJFABIogBAEgUAkigAkEQBgCQKACRRACCJAgBJFABIogBAEgUAUt1QLwBHc/zxx5eeue2220rPDObN7eBY5UoBgCQKACRRACCJAgBJFABIogBAEgUAkigAkEQBgCQKACRRACCJAgDJDfE45p111lmlZy699NIB2ATe+FwpAJBEAYAkCgAkUQAgiQIASRQASKIAQBIFAJIoAJBEAYAkCgAkUQAguSEeg6a2traquZtuuqmfNxlat912W1Vzzz//fD9vAodzpQBAEgUAkigAkEQBgCQKACRRACCJAgBJFABIogBAEgUAkigAkEQBgFRTFEVR0YE1NQO9C29wI0aMqGpu7969/bxJ/3nmmWdKz3R2dlZ1rmeffbaqOXhZJV/uXSkAkEQBgCQKACRRACCJAgBJFABIogBAEgUAkigAkEQBgCQKACRRACCJAgCpbqgX4H/T2WefXXpmyZIlA7BJ/9m2bVvpmYsvvrj0jLudcixzpQBAEgUAkigAkEQBgCQKACRRACCJAgBJFABIogBAEgUAkigAkEQBgOSGeFTlwgsvLD1zySWXDMAm/eeee+4pPbN58+b+XwSGkCsFAJIoAJBEAYAkCgAkUQAgiQIASRQASKIAQBIFAJIoAJBEAYAkCgCkmqIoiooOrKkZ6F0YIu985ztLz/T09JSeGTlyZOmZam3atKn0zOWXX156Ztu2baVnYKhU8uXelQIASRQASKIAQBIFAJIoAJBEAYAkCgAkUQAgiQIASRQASKIAQBIFAFLdUC/A0Fu8eHHpmcG8ud2BAwdKz3zmM58pPePmduBKAYBXEAUAkigAkEQBgCQKACRRACCJAgBJFABIogBAEgUAkigAkEQBgCQKACR3SX2DGT16dOmZt7/97QOwyeGKoqhq7uMf/3jpmQcffLCqc5U1duzY0jPz58+v6lyXXnppVXPHqq1bt1Y1t2zZstIzv/vd76o613DkSgGAJAoAJFEAIIkCAEkUAEiiAEASBQCSKACQRAGAJAoAJFEAIIkCAKmmqPAuZTU1NQO9C/3gRz/6UemZq666agA2OdwjjzxS1dw555zTz5sc2Tve8Y7SMz/5yU9Kz4wbN670DP/vhRdeKD1zxhlnlJ7Ztm1b6ZljXSVf7l0pAJBEAYAkCgAkUQAgiQIASRQASKIAQBIFAJIoAJBEAYAkCgAkUQAg1Q31AhzZmWeeWdXce9/73v5dpB/df//9g3auD33oQ6Vnli9fXnrm5JNPLj3D69PU1FR6ZsSIEQOwyRuTKwUAkigAkEQBgCQKACRRACCJAgBJFABIogBAEgUAkigAkEQBgCQKACQ3xDtGjRo1qqq5448/vp83ObIDBw6Unlm2bFlV57rjjjtKz9x4442lZ2pqakrPDKZq/s57enpKz2zYsKH0zOzZs0vPTJ06tfQMA8+VAgBJFABIogBAEgUAkigAkEQBgCQKACRRACCJAgBJFABIogBAEgUAkhviDYJqblL32c9+dgA26T8rV64sPTNr1qyqzrVo0aKq5gZDb29v6Znf//73VZ3ri1/8YumZxx57rPTMvHnzSs8M5s3tXnrppdIz//73vwdgkzcmVwoAJFEAIIkCAEkUAEiiAEASBQCSKACQRAGAJAoAJFEAIIkCAEkUAEiiAEByl9RBMH78+NIzM2bMGIBN+k9jY2Ppme985zsDsMmR7d+/v/TM2rVrS89cc801pWdaWlpKz0REXHDBBaVnvvGNb5SeGaw7nu7Zs6equWru0Pu3v/2tqnMNR64UAEiiAEASBQCSKACQRAGAJAoAJFEAIIkCAEkUAEiiAEASBQCSKACQaoqiKCo6sKZmoHd5wzr99NNLzzz55JMDsMnw8eKLL5aeefjhhwdgk8N1dnZWNTdq1Kh+3mRoffvb365q7iMf+Ug/bzJ8VPLl3pUCAEkUAEiiAEASBQCSKACQRAGAJAoAJFEAIIkCAEkUAEiiAEASBQCSG+INgrFjx5ae2bhxY1Xnam1trWoOXo+f/exnpWfmzJlT1bl27dpV1RxuiAdASaIAQBIFAJIoAJBEAYAkCgAkUQAgiQIASRQASKIAQBIFAJIoAJDqhnqB4eDZZ58tPXPXXXdVda5bbrmlqjnemNatW1d6Zt68eaVnent7S8/s2bOn9AwDz5UCAEkUAEiiAEASBQCSKACQRAGAJAoAJFEAIIkCAEkUAEiiAEASBQBSTVEURUUH1tQM9C68Ql1ddfcqfNvb3lZ65vOf/3zpmcsvv7z0zBvRqlWrSs9s3769qnN97WtfKz3z3HPPlZ7Zu3dv6Rn+N1Ty5d6VAgBJFABIogBAEgUAkigAkEQBgCQKACRRACCJAgBJFABIogBAEgUAkigAkNwlFWCYcJdUAEoRBQCSKACQRAGAJAoAJFEAIIkCAEkUAEiiAEASBQCSKACQRAGAJAoAJFEAIIkCAEkUAEiiAEASBQCSKACQRAGAJAoAJFEAIIkCAEkUAEiiAEASBQCSKACQRAGAJAoAJFEAIIkCAEkUAEiiAEASBQCSKACQRAGAJAoAJFEAIIkCAEkUAEiiAEASBQCSKACQRAGAJAoAJFEAIIkCAEkUAEiiAEASBQCSKACQRAGAJAoAJFEAIIkCAEkUAEiiAEASBQCSKACQRAGAJAoAJFEAIIkCAEkUAEiiAEASBQCSKACQRAGAJAoAJFEAIIkCAEkUAEiiAEASBQCSKACQRAGAJAoAJFEAINVVemBRFAO5BwDHAFcKACRRACCJAgBJFABIogBAEgUAkigAkEQBgCQKAKT/A41pVEW0tQbgAAAAAElFTkSuQmCC\n"
+          },
+          "metadata": {}
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "Af23-AO6c4wn"
+      },
+      "source": [
+        "### \"Patchification\"\n",
+        "The transformer encoder was originally developed with sequence data in mind, such as English sentences. However, as an image is not a sequence, we need to “sequencify” an image. To do this, we break it into multiple sub-images and map each sub-image to a vector.\n",
+        "\n",
+        "We do so by simply reshaping our input, which has size (N, C, H, W), where N is the batch size, C the number of channels and (H,W) the image dimension. In the case of MNIST, dimensions are (N, 1, 28, 28). The target dimension is (N, #Patches, Patch dimensionality), where the dimensionality of a patch is adjusted accordingly.\n",
+        "\n",
+        "In this example, we break each (1, 28, 28) into 7x7 patches (hence, each of size 4x4). That is, we are going to obtain 7x7=49 sub-images out of a single image.\n",
+        "\n",
+        "Thus, we reshape input (N, 1, 28, 28) to (N, PxP, C x H/P x W/P) = (N, 49, 16)\n",
+        "\n",
+        "Notice that, while each patch is a picture of size 1x4x4, we flatten it to a 16-dimensional vector. Also, in this case, we only had a single color channel. If we had multiple color channels, those would also have been flattened into the vector."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 118,
+      "metadata": {
+        "id": "fxhHKKDFOoHp"
+      },
+      "outputs": [],
+      "source": [
+        "def patchify(images, n_patches):\n",
+        "    n, c, h, w = images.shape\n",
+        "\n",
+        "    assert h == w, \"Patchify method is implemented for square images only\"\n",
+        "\n",
+        "    patches = torch.zeros(n, n_patches**2, h * w * c // n_patches**2)\n",
+        "    patch_size = h // n_patches\n",
+        "\n",
+        "    for idx, image in enumerate(images):\n",
+        "        for i in range(n_patches):\n",
+        "            for j in range(n_patches):\n",
+        "                patch = image[\n",
+        "                    :,\n",
+        "                    i * patch_size : (i + 1) * patch_size,\n",
+        "                    j * patch_size : (j + 1) * patch_size,\n",
+        "                ]\n",
+        "                patches[idx, i * n_patches + j] = patch.flatten()\n",
+        "    return patches"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "def display_patches(patches, n_patches, image_size):\n",
+        "    fig, axes = plt.subplots(n_patches, n_patches, figsize=(8, 8))\n",
+        "    for i in range(n_patches):\n",
+        "        for j in range(n_patches):\n",
+        "            ax = axes[i, j]\n",
+        "            patch = patches[i * n_patches + j].reshape(image_size)\n",
+        "            ax.imshow(patch[0], cmap=\"gray\")  # Utiliser la composante 0 pour niveaux de gris\n",
+        "            ax.axis(\"off\")\n",
+        "    plt.show()\n",
+        "\n",
+        "n_patches = 4\n",
+        "image_plat = image.unsqueeze(0)\n",
+        "patches = patchify(image_plat, n_patches)\n",
+        "\n",
+        "# Affichez les patches\n",
+        "display_patches(patches[0], n_patches, (1, 7, 7))\n"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 653
+        },
+        "id": "_7IX7zJpBJah",
+        "outputId": "a1365c4e-98a7-4ea4-a89b-a741ea219a40"
+      },
+      "execution_count": 119,
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<Figure size 800x800 with 16 Axes>"
+            ],
+            "image/png": "iVBORw0KGgoAAAANSUhEUgAAAn8AAAJ8CAYAAACP2sdVAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/bCgiHAAAACXBIWXMAAA9hAAAPYQGoP6dpAAAQxUlEQVR4nO3dMYic1QKG4ZnrroKmkCAiwdhopVEIsdBGEANKUFEkhWgjYm3sxSjaiJBGi3QKEQsRMY0GxE4sgtionQFDUNbCgBgIIeJ/ayGXjHpmZve+z1MP3x42h9mX02Q+TdM0AwAg4T/rPgAAAKsj/gAAQsQfAECI+AMACBF/AAAh4g8AIET8AQCEiD8AgBDxBwAQsrHoB+fz+TLPwQ616v8gxj3kSlZ5D91BrsR3IdvBovfQyx8AQIj4AwAIEX8AACHiDwAgRPwBAISIPwCAEPEHABAi/gAAQsQfAECI+AMACBF/AAAh4g8AIET8AQCEiD8AgBDxBwAQIv4AAELEHwBAiPgDAAgRfwAAIeIPACBE/AEAhIg/AIAQ8QcAECL+AABCxB8AQIj4AwAIEX8AACHiDwAgRPwBAISIPwCAEPEHABAi/gAAQsQfAECI+AMACBF/AAAh4g8AIET8AQCEbKz7AAA7ycGDB4dtff7558O2jhw5Mmzr7bffHrY1m81mf/7559A94N/x8gcAECL+AABCxB8AQIj4AwAIEX8AACHiDwAgRPwBAISIPwCAEPEHABAi/gAAQsQfAECI+AMACBF/AAAh4g8AIET8AQCEiD8AgBDxBwAQIv4AAELm0zRNC31wPl/2WdiBFrw+w7iH/z/uvPPOYVvff//9sK2rOXv27LCtvXv3Dtsaac+ePUP3fvnll6F725HvQraDRe+hlz8AgBDxBwAQIv4AAELEHwBAiPgDAAgRfwAAIeIPACBE/AEAhIg/AIAQ8QcAECL+AABCxB8AQIj4AwAIEX8AACHiDwAgRPwBAISIPwCAEPEHABAi/gAAQjbWfQBg59jc3By29dZbbw3bWqW9e/eu+wgA/4qXPwCAEPEHABAi/gAAQsQfAECI+AMACBF/AAAh4g8AIET8AQCEiD8AgBDxBwAQIv4AAELEHwBAiPgDAAgRfwAAIeIPACBE/AEAhIg/AIAQ8QcAELKx7gMAO8eBAweGbR06dGjYFgCL8/IHABAi/gAAQsQfAECI+AMACBF/AAAh4g8AIET8AQCEiD8AgBDxBwAQIv4AAELEHwBAiPgDAAgRfwAAIeIPACBE/AEAhIg/AIAQ8QcAECL+AABCNtZ9AGC5rrnmmmFbR48eHbbFWG+++eawrfPnzw/bArYfL38AACHiDwAgRPwBAISIPwCAEPEHABAi/gAAQsQfAECI+AMACBF/AAAh4g8AIET8AQCEiD8AgBDxBwAQIv4AAELEHwBAiPgDAAgRfwAAIeIPACBkPk3TtO5DAACwGl7+AABCxB8AQIj4AwAIEX8AACHiDwAgRPwBAISIPwCAEPEHABAi/gAAQsQfAECI+AMACBF/AAAh4g8AIET8AQCEiD8AgBDxBwAQIv4AAELEHwBAiPgDAAgRfwAAIeIPACBE/AEAhIg/AIAQ8QcAECL+AABCxB8AQMjGoh+cz+fLPAc71DRNK/157uHfd+211w7bunTp0rAtZrMff/xx2Nb9998/bGtra2vYVoXvQraDRe+hlz8AgBDxBwAQIv4AAELEHwBAiPgDAAgRfwAAIeIPACBE/AEAhIg/AIAQ8QcAECL+AABCxB8AQIj4AwAIEX8AACHiDwAgRPwBAISIPwCAEPEHABCyse4DAH917733Dt174403hu7VnTt3btjWwYMHh21tbW0N2wL+v3n5AwAIEX8AACHiDwAgRPwBAISIPwCAEPEHABAi/gAAQsQfAECI+AMACBF/AAAh4g8AIET8AQCEiD8AgBDxBwAQIv4AAELEHwBAiPgDAAgRfwAAIRvrPgDwVw899NDQvYcffnjoXt277747bOvMmTPDtgAW5eUPACBE/AEAhIg/AIAQ8QcAECL+AABCxB8AQIj4AwAIEX8AACHiDwAgRPwBAISIPwCAEPEHABAi/gAAQsQfAECI+AMACBF/AAAh4g8AIET8AQCEzKdpmhb64Hy+7LOwAy14fYbZrvfwvvvuG7b1xRdfDNuazWaz66+/fujeKN9+++2wrbvvvnvY1tXcdtttw7bOnTs3bIv18l3IdrDoPfTyBwAQIv4AAELEHwBAiPgDAAgRfwAAIeIPACBE/AEAhIg/AIAQ8QcAECL+AABCxB8AQIj4AwAIEX8AACHiDwAgRPwBAISIPwCAEPEHABAi/gAAQsQfAEDIfJqmaaEPzufLPgs70ILXZ5jteg8//vjjYVtPPvnksK3RLl++PGzr8ccfH7b12WefDdu6mu16B1kv34VsB4veQy9/AAAh4g8AIET8AQCEiD8AgBDxBwAQIv4AAELEHwBAiPgDAAgRfwAAIeIPACBE/AEAhIg/AIAQ8QcAECL+AABCxB8AQIj4AwAIEX8AACHiDwAgZGPdB4B1ufnmm4dt7d+/f9jWaNM0Ddt68cUXh22dOnVq2NZOdcsttwzbev7554dtHTp0aNgWy/HBBx8M2zp27Niwra+//nrYFsvj5Q8AIET8AQCEiD8AgBDxBwAQIv4AAELEHwBAiPgDAAgRfwAAIeIPACBE/AEAhIg/AIAQ8QcAECL+AABCxB8AQIj4AwAIEX8AACHiDwAgRPwBAITMp2maFvrgfL7ss7ADLXh9hhl5Dz/66KNhW0899dSwrdG++eabYVsHDhwYtjXSKu/hyN/ByZMnh23deuutw7Zo+f3334dt3XXXXcO2zp07N2yrYtHvQi9/AAAh4g8AIET8AQCEiD8AgBDxBwAQIv4AAELEHwBAiPgDAAgRfwAAIeIPACBE/AEAhIg/AIAQ8QcAECL+AABCxB8AQIj4AwAIEX8AACHiDwAgZD5N07TuQwAAsBpe/gAAQsQfAECI+AMACBF/AAAh4g8AIET8AQCEiD8AgBDxBwAQIv4AAELEHwBAiPgDAAgRfwAAIeIPACBE/AEAhIg/AIAQ8QcAECL+AABCxB8AQIj4AwAIEX8AACHiDwAgRPwBAISIPwCAEPEHABAi/gAAQjYW/eB8Pl/mOdihpmla6c/bv3//sK3Tp08P29rc3By2Ndrrr78+bOuVV14ZtvXss88O2zpx4sSwras5f/78sK3du3cP24Lt4I477hi2debMmWFbFYv+TfbyBwAQIv4AAELEHwBAiPgDAAgRfwAAIeIPACBE/AEAhIg/AIAQ8QcAECL+AABCxB8AQIj4AwAIEX8AACHiDwAgRPwBAISIPwCAEPEHABAi/gAAQjbWfQD4O2644YZhW5ubm8O2Rrp8+fLQvWPHjm3LrSNHjgzbWqXdu3ev+whLN/oOnjp1atjWV199NWzrmWeeGba1b9++YVuwbF7+AABCxB8AQIj4AwAIEX8AACHiDwAgRPwBAISIPwCAEPEHABAi/gAAQsQfAECI+AMACBF/AAAh4g8AIET8AQCEiD8AgBDxBwAQIv4AAELEHwBAyMa6D8Dq3X777es+wj/28ssvr/sIS3f8+PGhe48++uiwrZdeemnYFrPZ1tbWsK3Tp08P2zp69OiwrdlsNvvuu++Gbb3wwgvDtvbt2zdsaye7ePHisK0//vhj2BbL4+UPACBE/AEAhIg/AIAQ8QcAECL+AABCxB8AQIj4AwAIEX8AACHiDwAgRPwBAISIPwCAEPEHABAi/gAAQsQfAECI+AMACBF/AAAh4g8AIET8AQCEbKz7AKzeDz/8sO4j/GOPPPLIuo+wdLt27Rq6d+LEiaF7o1y6dGnY1nXXXTds62pOnjw5bOvpp58etnXTTTcN23rwwQeHbc1mY+/gvn37hm3tZBcuXBi2dfz48WFbZ8+eHbbF8nj5AwAIEX8AACHiDwAgRPwBAISIPwCAEPEHABAi/gAAQsQfAECI+AMACBF/AAAh4g8AIET8AQCEiD8AgBDxBwAQIv4AAELEHwBAiPgDAAgRfwAAIeIPACBkPk3TtNAH5/Nln4UVWfCfHJbqwoULw7Z27do1bOtqPv3005X9rL/jgQceGLa1yt8n/8x77703bOu5554btsV6Lfr33csfAECI+AMACBF/AAAh4g8AIET8AQCEiD8AgBDxBwAQIv4AAELEHwBAiPgDAAgRfwAAIeIPACBE/AEAhIg/AIAQ8QcAECL+AABCxB8AQIj4AwAImU/TNC30wfl82WdhRT788MNhW4cPHx62tYiff/552NaePXuGbQE7zyeffDJs64knnhi2tYgbb7xx2NZvv/02bIv1WjDpvPwBAJSIPwCAEPEHABAi/gAAQsQfAECI+AMACBF/AAAh4g8AIET8AQCEiD8AgBDxBwAQIv4AAELEHwBAiPgDAAgRfwAAIeIPACBE/AEAhIg/AICQ+TRN00IfnM+XfRZ2oAWvzzBHjx4dtvXaa68N2wL+ty+//HLY1uHDh4dtbW1tDdta9Xehv8lcyaL30MsfAECI+AMACBF/AAAh4g8AIET8AQCEiD8AgBDxBwAQIv4AAELEHwBAiPgDAAgRfwAAIeIPACBE/AEAhIg/AIAQ8QcAECL+AABCxB8AQIj4AwAImU/TNK37EAAArIaXPwCAEPEHABAi/gAAQsQfAECI+AMACBF/AAAh4g8AIET8AQCEiD8AgBDxBwAQIv4AAELEHwBAiPgDAAgRfwAAIeIPACBE/AEAhIg/AIAQ8QcAECL+AABCxB8AQIj4AwAIEX8AACHiDwAgRPwBAISIPwCAEPEHABCysegH5/P5Ms/BDjVN00p/3ubm5rCte+65Z9jWq6++OmzrscceG7bF9vb+++8P2/rpp5+Gbb3zzjvDtmaz2ezXX38dtnXx4sVhWyOt+rvQ32SuZNF76OUPACBE/AEAhIg/AIAQ8QcAECL+AABCxB8AQIj4AwAIEX8AACHiDwAgRPwBAISIPwCAEPEHABAi/gAAQsQfAECI+AMACBF/AAAh4g8AIET8AQCEzKdpmhb64Hy+7LOwAy14fYZxD7mSVd5Dd5Ar8V3IdrDoPfTyBwAQIv4AAELEHwBAiPgDAAgRfwAAIeIPACBE/AEAhIg/AIAQ8QcAECL+AABCxB8AQIj4AwAIEX8AACHiDwAgRPwBAISIPwCAEPEHABAi/gAAQsQfAECI+AMACBF/AAAh4g8AIET8AQCEiD8AgBDxBwAQIv4AAELEHwBAiPgDAAgRfwAAIeIPACBE/AEAhIg/AIAQ8QcAECL+AABCxB8AQIj4AwAIEX8AACHiDwAgRPwBAISIPwCAEPEHABAi/gAAQsQfAECI+AMACBF/AAAh4g8AIET8AQCEiD8AgBDxBwAQIv4AAELEHwBAiPgDAAgRfwAAIeIPACBE/AEAhIg/AIAQ8QcAECL+AABCxB8AQIj4AwAIEX8AACHiDwAgRPwBAISIPwCAEPEHABAi/gAAQsQfAECI+AMACBF/AAAh4g8AIET8AQCEiD8AgBDxBwAQIv4AAELEHwBAiPgDAAgRfwAAIeIPACBE/AEAhIg/AIAQ8QcAECL+AABCxB8AQIj4AwAIEX8AACHiDwAgRPwBAISIPwCAEPEHABAi/gAAQsQfAECI+AMACBF/AAAh4g8AIET8AQCEiD8AgBDxBwAQIv4AAELEHwBAiPgDAAgRfwAAIeIPACBE/AEAhIg/AIAQ8QcAECL+AABCxB8AQIj4AwAIEX8AACHiDwAgRPwBAISIPwCAEPEHABAi/gAAQubTNE3rPgQAAKvh5Q8AIET8AQCEiD8AgBDxBwAQIv4AAELEHwBAiPgDAAgRfwAAIeIPACDkv4dFfAX0SOj9AAAAAElFTkSuQmCC\n"
+          },
+          "metadata": {}
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5mWw5mwDc4wo"
+      },
+      "source": [
+        "### Linear embedding\n",
+        "\n",
+        "Now that we have our flattened patches, we can map each of them through a Linear mapping. While each patch was a 4x4=16 dimensional vector, the linear mapping can map to any arbitrary vector size. Thus, we will use for this a parameter `hidden_d` for \"hidden dimension\".\n",
+        "\n",
+        "In this example, we will use a hidden dimension of 8, but in principle, any number can be put here. We will thus be mapping each 16-dimensional patch to an 8-dimensional patch.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "def features_embedding(patch,applatisseur,class_embaded):\n",
+        "  embedded_patches = applatisseur(flattened_patches)\n",
+        "  return(torch.cat((class_embaded, embedded_patches), dim=0))"
+      ],
+      "metadata": {
+        "id": "RhglaVPb59Ll"
+      },
+      "execution_count": 120,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "hidden_d=8\n",
+        "batch_size=patches.shape[1]\n",
+        "flattened_patches = patches.view(batch_size, -1)\n",
+        "linear_layer = nn.Linear(flattened_patches.size(1), hidden_d)\n",
+        "classe_embedded = torch.rand((1,hidden_d))\n",
+        "\n",
+        "features_emb=features_embedding(patches,linear_layer,classe_embedded)\n",
+        "print(features_emb)"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 0
+        },
+        "id": "ayIaewry62c-",
+        "outputId": "5c13d8d8-87f9-4353-ca93-08cf8aad095d"
+      },
+      "execution_count": 121,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "tensor([[ 0.0886,  0.1783,  0.7162,  0.6681,  0.3244,  0.4643,  0.1293,  0.5872],\n",
+            "        [ 0.0779, -0.0984, -0.0306,  0.0446, -0.0793,  0.0205, -0.1166,  0.0932],\n",
+            "        [ 0.0683, -0.0921,  0.0700, -0.1282, -0.1309,  0.1637, -0.0534,  0.2885],\n",
+            "        [ 0.0614, -0.1821,  0.1938,  0.2289, -0.5469, -0.1450,  0.2406, -0.0501],\n",
+            "        [ 0.0779, -0.0984, -0.0306,  0.0446, -0.0793,  0.0205, -0.1166,  0.0932],\n",
+            "        [ 0.0779, -0.0984, -0.0306,  0.0446, -0.0793,  0.0205, -0.1166,  0.0932],\n",
+            "        [-0.2211,  0.7500,  0.1590, -0.0509, -0.2274, -0.3068, -0.3688,  0.3081],\n",
+            "        [-0.1137,  0.3398,  0.2008, -0.2193, -0.2279, -0.2736,  0.0489,  0.1655],\n",
+            "        [ 0.1418,  0.0633,  0.2066,  0.1069, -0.1668, -0.1283,  0.0725,  0.1086],\n",
+            "        [ 0.0876, -0.1036, -0.0232,  0.0168, -0.0834,  0.0253, -0.1026,  0.1292],\n",
+            "        [-0.1190,  0.6631,  0.4508,  0.1273, -0.8950, -0.2657, -0.1744,  0.1429],\n",
+            "        [-0.0139,  1.1310,  0.6801, -0.0420, -0.8295, -0.3431, -0.3398,  0.1292],\n",
+            "        [ 0.2009, -0.1070,  0.3390, -0.1786, -0.2455,  0.1952, -0.2296,  0.0448],\n",
+            "        [ 0.0779, -0.0984, -0.0306,  0.0446, -0.0793,  0.0205, -0.1166,  0.0932],\n",
+            "        [ 0.0114,  0.2627,  0.1608,  0.1347, -0.0517, -0.1514, -0.2966,  0.0293],\n",
+            "        [ 0.0699,  0.1150,  0.0422,  0.0093, -0.1582,  0.0118, -0.2497,  0.1767],\n",
+            "        [ 0.0779, -0.0984, -0.0306,  0.0446, -0.0793,  0.0205, -0.1166,  0.0932]],\n",
+            "       grad_fn=<CatBackward0>)\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "jyRw5vrcc4wp"
+      },
+      "source": [
+        "### Positional encoding\n",
+        "\n",
+        "Positional encoding allows the model to understand where each patch would be placed in the original image. While it is theoretically possible to learn such positional embeddings, previous work by [Vaswani et al. in 2017](https://arxiv.org/abs/1706.03762) suggests that we can just add sines and cosines waves.\n",
+        "\n",
+        "In particular, positional encoding adds high-frequency values to the first dimensions and lower-frequency values to the latter dimensions.\n",
+        "\n",
+        "In each sequence, for token i we add to its j-th coordinate the following value:\n",
+        "\n",
+        "$$p_{i,j}=\\left\\{\\begin{matrix}\n",
+        "sin(\\frac{i}{10000^{\\frac{j-1}{d_{emb-dim}}}}) \\:if \\:j \\: is \\: even\\\\\n",
+        "cos(\\frac{i}{10000^{\\frac{j-1}{d_{emb-dim}}}})\\:if\\: j \\: is \\: odd\n",
+        "\\end{matrix}\\right.$$\n",
+        "\n",
+        "![Positional encoding](./figures/positional_encoding.png \"Positional encoding\").\n",
+        "\n",
+        "This positional embedding is a function of the number of elements in the sequence and the dimensionality of each element. Thus, it is always a 2-dimensional tensor or “rectangle”.\n",
+        "\n",
+        "Here is a simple function that, given the number of tokens and the dimensionality of each of them, outputs a matrix where each coordinate (i,j) is the value to be added to token i in dimension j.\n",
+        "\n",
+        "This positional encoding is added to our model after the linear mapping and the addition of the class token."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 122,
+      "metadata": {
+        "id": "bOaI_5SrO4vB"
+      },
+      "outputs": [],
+      "source": [
+        "def get_positional_embeddings(sequence_length, d):\n",
+        "    result = torch.ones(sequence_length, d)\n",
+        "    for i in range(sequence_length):\n",
+        "        for j in range(d):\n",
+        "            result[i][j] = (\n",
+        "                np.sin(i / (10000 ** (j / d)))\n",
+        "                if j % 2 == 0\n",
+        "                else np.cos(i / (10000 ** ((j - 1) / d)))\n",
+        "            )\n",
+        "    return result"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "positional_emb=get_positional_embeddings(17,8)\n",
+        "print(positional_emb)"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 0
+        },
+        "id": "PHvqYlLZFp6o",
+        "outputId": "583e9e21-81c9-481f-c235-de5180e88f98"
+      },
+      "execution_count": 123,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "tensor([[ 0.0000e+00,  1.0000e+00,  0.0000e+00,  1.0000e+00,  0.0000e+00,\n",
+            "          1.0000e+00,  0.0000e+00,  1.0000e+00],\n",
+            "        [ 8.4147e-01,  5.4030e-01,  9.9833e-02,  9.9500e-01,  9.9998e-03,\n",
+            "          9.9995e-01,  1.0000e-03,  1.0000e+00],\n",
+            "        [ 9.0930e-01, -4.1615e-01,  1.9867e-01,  9.8007e-01,  1.9999e-02,\n",
+            "          9.9980e-01,  2.0000e-03,  1.0000e+00],\n",
+            "        [ 1.4112e-01, -9.8999e-01,  2.9552e-01,  9.5534e-01,  2.9996e-02,\n",
+            "          9.9955e-01,  3.0000e-03,  1.0000e+00],\n",
+            "        [-7.5680e-01, -6.5364e-01,  3.8942e-01,  9.2106e-01,  3.9989e-02,\n",
+            "          9.9920e-01,  4.0000e-03,  9.9999e-01],\n",
+            "        [-9.5892e-01,  2.8366e-01,  4.7943e-01,  8.7758e-01,  4.9979e-02,\n",
+            "          9.9875e-01,  5.0000e-03,  9.9999e-01],\n",
+            "        [-2.7942e-01,  9.6017e-01,  5.6464e-01,  8.2534e-01,  5.9964e-02,\n",
+            "          9.9820e-01,  6.0000e-03,  9.9998e-01],\n",
+            "        [ 6.5699e-01,  7.5390e-01,  6.4422e-01,  7.6484e-01,  6.9943e-02,\n",
+            "          9.9755e-01,  6.9999e-03,  9.9998e-01],\n",
+            "        [ 9.8936e-01, -1.4550e-01,  7.1736e-01,  6.9671e-01,  7.9915e-02,\n",
+            "          9.9680e-01,  7.9999e-03,  9.9997e-01],\n",
+            "        [ 4.1212e-01, -9.1113e-01,  7.8333e-01,  6.2161e-01,  8.9879e-02,\n",
+            "          9.9595e-01,  8.9999e-03,  9.9996e-01],\n",
+            "        [-5.4402e-01, -8.3907e-01,  8.4147e-01,  5.4030e-01,  9.9833e-02,\n",
+            "          9.9500e-01,  9.9998e-03,  9.9995e-01],\n",
+            "        [-9.9999e-01,  4.4257e-03,  8.9121e-01,  4.5360e-01,  1.0978e-01,\n",
+            "          9.9396e-01,  1.1000e-02,  9.9994e-01],\n",
+            "        [-5.3657e-01,  8.4385e-01,  9.3204e-01,  3.6236e-01,  1.1971e-01,\n",
+            "          9.9281e-01,  1.2000e-02,  9.9993e-01],\n",
+            "        [ 4.2017e-01,  9.0745e-01,  9.6356e-01,  2.6750e-01,  1.2963e-01,\n",
+            "          9.9156e-01,  1.3000e-02,  9.9992e-01],\n",
+            "        [ 9.9061e-01,  1.3674e-01,  9.8545e-01,  1.6997e-01,  1.3954e-01,\n",
+            "          9.9022e-01,  1.4000e-02,  9.9990e-01],\n",
+            "        [ 6.5029e-01, -7.5969e-01,  9.9749e-01,  7.0737e-02,  1.4944e-01,\n",
+            "          9.8877e-01,  1.4999e-02,  9.9989e-01],\n",
+            "        [-2.8790e-01, -9.5766e-01,  9.9957e-01, -2.9200e-02,  1.5932e-01,\n",
+            "          9.8723e-01,  1.5999e-02,  9.9987e-01]])\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "input_vect=features_emb+positional_emb\n",
+        "print(input_vect.shape)\n",
+        "print(input_vect)"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 0
+        },
+        "id": "L6sGNOSl-r8x",
+        "outputId": "726a15e7-958f-4273-f4eb-33867a8b314c"
+      },
+      "execution_count": 124,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "torch.Size([17, 8])\n",
+            "tensor([[ 0.0886,  1.1783,  0.7162,  1.6681,  0.3244,  1.4643,  0.1293,  1.5872],\n",
+            "        [ 0.9194,  0.4419,  0.0692,  1.0396, -0.0693,  1.0204, -0.1156,  1.0932],\n",
+            "        [ 0.9775, -0.5082,  0.2687,  0.8518, -0.1109,  1.1635, -0.0514,  1.2885],\n",
+            "        [ 0.2025, -1.1721,  0.4894,  1.1843, -0.5169,  0.8545,  0.2436,  0.9499],\n",
+            "        [-0.6789, -0.7520,  0.3588,  0.9656, -0.0393,  1.0197, -0.1126,  1.0932],\n",
+            "        [-0.8810,  0.1853,  0.4488,  0.9222, -0.0293,  1.0192, -0.1116,  1.0932],\n",
+            "        [-0.5005,  1.7102,  0.7236,  0.7744, -0.1674,  0.6914, -0.3628,  1.3081],\n",
+            "        [ 0.5433,  1.0937,  0.8450,  0.5456, -0.1580,  0.7240,  0.0559,  1.1655],\n",
+            "        [ 1.1312, -0.0822,  0.9239,  0.8036, -0.0868,  0.8685,  0.0805,  1.1085],\n",
+            "        [ 0.4997, -1.0147,  0.7601,  0.6384,  0.0065,  1.0213, -0.0936,  1.1292],\n",
+            "        [-0.6631, -0.1760,  1.2923,  0.6676, -0.7952,  0.7293, -0.1644,  1.1428],\n",
+            "        [-1.0139,  1.1354,  1.5714,  0.4116, -0.7198,  0.6509, -0.3288,  1.1291],\n",
+            "        [-0.3357,  0.7369,  1.2710,  0.1837, -0.1258,  1.1880, -0.2176,  1.0448],\n",
+            "        [ 0.4981,  0.8091,  0.9329,  0.3121,  0.0504,  1.0121, -0.1036,  1.0931],\n",
+            "        [ 1.0020,  0.3994,  1.1463,  0.3046,  0.0878,  0.8389, -0.2827,  1.0292],\n",
+            "        [ 0.7202, -0.6447,  1.0397,  0.0801, -0.0088,  1.0006, -0.2347,  1.1766],\n",
+            "        [-0.2100, -1.0560,  0.9689,  0.0154,  0.0800,  1.0077, -0.1006,  1.0931]],\n",
+            "       grad_fn=<AddBackward0>)\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "gSgUqiFbc4wp"
+      },
+      "source": [
+        "### Multi-Head Self-Attention\n",
+        "\n",
+        "The objective is now that, for a single image, each patch has to be updated based on some similarity measure with the other patches. We do so by linearly mapping each patch (that is now an 8-dimensional vector in our example) to 3 distinct vectors: q, k, and v (query, key, value).\n",
+        "\n",
+        "Then, for a single patch, we are going to compute the dot product between its q vector with all of the k vectors, divide by the square root of the dimensionality of these vectors (sqrt(8)), softmax these so-called attention cues, and finally multiply each attention cue with the v vectors associated with the different k vectors and sum all up.\n",
+        "\n",
+        "In this way, each patch assumes a new value that is based on its similarity (after the linear mapping to q, k, and v) with other patches. This whole procedure, however, is carried out H times on H sub-vectors of our current 8-dimensional patches, where H is the number of Heads.\n",
+        "\n",
+        "Once all results are obtained, they are concatenated together. Finally, the result is passed through a linear layer (for good measure).\n",
+        "\n",
+        "The intuitive idea behind attention is that it allows modeling the relationship between the inputs. What makes a ‘0’ a zero are not the individual pixel values, but how they relate to each other.\n",
+        "\n",
+        "This is implemented in the MSA class:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 125,
+      "metadata": {
+        "id": "CIoyR-QsOruC"
+      },
+      "outputs": [],
+      "source": [
+        "class MSA(nn.Module):\n",
+        "    def __init__(self, d, n_heads=2):\n",
+        "        super().__init__()\n",
+        "        self.d = d\n",
+        "        self.n_heads = n_heads\n",
+        "\n",
+        "        assert d % n_heads == 0, f\"Can't divide dimension {d} into {n_heads} heads\"\n",
+        "\n",
+        "        d_head = int(d / n_heads)\n",
+        "        self.q_mappings = nn.ModuleList(\n",
+        "            [nn.Linear(d_head, d_head) for _ in range(self.n_heads)]\n",
+        "        )\n",
+        "        self.k_mappings = nn.ModuleList(\n",
+        "            [nn.Linear(d_head, d_head) for _ in range(self.n_heads)]\n",
+        "        )\n",
+        "        self.v_mappings = nn.ModuleList(\n",
+        "            [nn.Linear(d_head, d_head) for _ in range(self.n_heads)]\n",
+        "        )\n",
+        "        self.d_head = d_head\n",
+        "        self.softmax = nn.Softmax(dim=-1)\n",
+        "\n",
+        "    def forward(self, sequences):\n",
+        "        # Sequences has shape (N, seq_length, token_dim)\n",
+        "        # We go into shape    (N, seq_length, n_heads, token_dim / n_heads)\n",
+        "        # And come back to    (N, seq_length, item_dim)  (through concatenation)\n",
+        "        result = []\n",
+        "        for sequence in sequences:\n",
+        "            seq_result = []\n",
+        "            for head in range(self.n_heads):\n",
+        "                q_mapping = self.q_mappings[head]\n",
+        "                k_mapping = self.k_mappings[head]\n",
+        "                v_mapping = self.v_mappings[head]\n",
+        "\n",
+        "                seq = sequence[:, head * self.d_head : (head + 1) * self.d_head]\n",
+        "                q, k, v = q_mapping(seq), k_mapping(seq), v_mapping(seq)\n",
+        "\n",
+        "                attention = torch.matmul(self.softmax(torch.matmul(q, k.t())/np.sqrt(int(self.d / self.n_heads))),v)\n",
+        "\n",
+        "                seq_result.append(attention)\n",
+        "\n",
+        "            result.append(torch.hstack(seq_result))\n",
+        "        return torch.cat([torch.unsqueeze(r, dim=0) for r in result])"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "vomINDMpc4wq"
+      },
+      "source": [
+        "Notice that, for each head, we create distinct Q, K, and V mapping functions (square matrices of size 4x4 in our example).\n",
+        "\n",
+        "$$\\left\\{\\begin{matrix}\n",
+        "Q=W^Q.I\n",
+        "\\\\\n",
+        "K=W^K.I\n",
+        "\\\\\n",
+        "V=W^V.I\n",
+        "\\\\\n",
+        "A=K^T.Q\n",
+        "\\\\\n",
+        "\\hat{A}=softmax(A)\n",
+        "\\\\\n",
+        "O=V.\\hat{A}\n",
+        "\\end{matrix}\\right.$$\n",
+        "\n",
+        "Since our inputs will be sequences of size (N, 50, 8), and we only use 2 heads, we will at some point have an (N, 50, 2, 4) tensor, use a nn.Linear(4, 4) module on it, and then come back, after concatenation, to an (N, 50, 8) tensor.\n",
+        "\n",
+        "Also notice that using loops is not the most efficient way to compute the multi-head self-attention, but it makes the code much clearer for learning."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "gG4UKx-Ac4wr"
+      },
+      "source": [
+        "### Transformer Encoder Blocks\n",
+        "\n",
+        "The next step is to create the transformer encoder block class.\n",
+        "\n",
+        "Layer normalization (LN) is a popular block that, given an input, subtracts its mean and divides by the standard deviation. It is applied to the last dimension only. We can thus make each of our 50x8 matrices (representing a single sequence) have mean 0 and std 1. After we run our (N, 50, 8) tensor through LN, we still get the same dimensionality.\n",
+        "\n",
+        "Also, We will be using residual connection that consists in adding the original input to the result of some computation. This, intuitively, allows a network to become more powerful while also preserving the set of possible functions that the model can approximate.\n",
+        "\n",
+        "We will add a residual connection that will add our original (N, 50, 8) tensor to the (N, 50, 8) obtained after LN and MSA.\n",
+        "\n",
+        "Next is to add a simple residual connection between what we already have and what we get after passing the current tensor through another LN and an MLP. The MLP is composed of two layers, where the hidden layer typically is four times as big (this is a parameter).\n",
+        "\n",
+        "The transformer encoder block class (which will be a component of the future ViT class) is thus as follows:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 126,
+      "metadata": {
+        "id": "sv8wnTx4OwP7"
+      },
+      "outputs": [],
+      "source": [
+        "class ViTBlock(nn.Module):\n",
+        "    def __init__(self, hidden_d, n_heads, mlp_ratio=4):\n",
+        "        super().__init__()\n",
+        "        self.hidden_d = hidden_d\n",
+        "        self.n_heads = n_heads\n",
+        "\n",
+        "        self.norm1 = nn.LayerNorm(hidden_d)\n",
+        "        self.mhsa = MSA(hidden_d, n_heads)\n",
+        "        self.norm2 = nn.LayerNorm(hidden_d)\n",
+        "        self.mlp = nn.Sequential(\n",
+        "            nn.Linear(hidden_d, mlp_ratio * hidden_d),\n",
+        "            nn.GELU(),\n",
+        "            nn.Linear(mlp_ratio * hidden_d, hidden_d),\n",
+        "        )\n",
+        "\n",
+        "    def forward(self, x):\n",
+        "        m1=x\n",
+        "        x = self.norm1(x)\n",
+        "        x = m1+self.mhsa(x)\n",
+        "        m2=x\n",
+        "        x = self.norm2(x)\n",
+        "        x = m2+self.mlp(x)\n",
+        "\n",
+        "        return x"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "58UObL4hc4wr"
+      },
+      "source": [
+        "### ViT model\n",
+        "\n",
+        "Now that the encoder block is ready, we just need to insert it in our bigger ViT model which is responsible for patchifying before the transformer blocks, and carrying out the classification after.\n",
+        "\n",
+        "To help classification, we will use an additional **classification token** to the input sequence. This is a special token that we add to our model that has the role of capturing information about the other tokens. This will happen with the MSA block. When information about all other tokens will be present here, we will be able to classify the image using only this special token. The initial value of the special token (the one fed to the transformer encoder) is a parameter of the model that needs to be learned.\n",
+        "\n",
+        "Thus, we will add a parameter to our model and convert our (N, 49, 8) tokens tensor to an (N, 50, 8) tensor (we add the special token to each sequence).\n",
+        "\n",
+        "We could have an arbitrary number of transformer blocks. In this example, to keep it simple, I will use only 2. We also add a parameter to know how many heads does each encoder block will use.\n",
+        "\n",
+        "Finally, we can extract just the classification token (first token) out of our N sequences, and use each token to get N classifications.\n",
+        "\n",
+        "Since we decided that each token is an 8-dimensional vector, and since we have 10 possible digits, we can implement the classification MLP as a simple 8x10 matrix, activated with the SoftMax function.\n",
+        "\n",
+        "The output of our model shoud be an (N, 10) tensor."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 127,
+      "metadata": {
+        "id": "8Na9BTgnOy3o"
+      },
+      "outputs": [],
+      "source": [
+        "class ViT(nn.Module):\n",
+        "    def __init__(self, chw, n_patches=7, n_blocks=2, hidden_d=8, n_heads=2, out_d=10):\n",
+        "        # Super constructor\n",
+        "        super().__init__()\n",
+        "\n",
+        "        # Attributes\n",
+        "        self.chw = chw  # ( C , H , W )\n",
+        "        self.n_patches = n_patches\n",
+        "        self.n_blocks = n_blocks\n",
+        "        self.n_heads = n_heads\n",
+        "        self.hidden_d = hidden_d\n",
+        "\n",
+        "        # Input and patches sizes\n",
+        "        assert (\n",
+        "            chw[1] % n_patches == 0\n",
+        "        ), \"Input shape not entirely divisible by number of patches\"\n",
+        "        assert (\n",
+        "            chw[2] % n_patches == 0\n",
+        "        ), \"Input shape not entirely divisible by number of patches\"\n",
+        "        self.patch_size = (chw[1] / n_patches, chw[2] / n_patches)\n",
+        "\n",
+        "        # 1) Linear mapper\n",
+        "        self.input_d = int(chw[0] * self.patch_size[0] * self.patch_size[1])\n",
+        "        self.linear_mapper = nn.Linear(self.input_d, self.hidden_d)\n",
+        "\n",
+        "        # 2) Learnable classification token\n",
+        "        self.class_token = nn.Parameter(torch.rand(1, self.hidden_d))\n",
+        "\n",
+        "        # 3) Positional embedding\n",
+        "        self.register_buffer(\n",
+        "            \"positional_embeddings\",\n",
+        "            get_positional_embeddings(n_patches**2 + 1, hidden_d),\n",
+        "            persistent=False,\n",
+        "        )\n",
+        "\n",
+        "        # 4) Transformer encoder blocks\n",
+        "        self.blocks = nn.ModuleList(\n",
+        "            [ViTBlock(hidden_d, n_heads) for _ in range(n_blocks)]\n",
+        "        )\n",
+        "\n",
+        "        # 5) Classification MLPk\n",
+        "        self.mlp = nn.Sequential(nn.Linear(self.hidden_d, out_d), nn.Softmax(dim=-1))\n",
+        "\n",
+        "    def forward(self, images):\n",
+        "\n",
+        "        # Dividing images into patches\n",
+        "        n, c, h, w = images.shape\n",
+        "        patches = patchify(images,self.n_patches)\n",
+        "        patches=patches.to(device)\n",
+        "\n",
+        "        # Running linear layer tokenization\n",
+        "        tokens = self.linear_mapper(patches)\n",
+        "\n",
+        "\n",
+        "        # Map the vector corresponding to each patch to the hidden size dimension\n",
+        "        tokens = torch.cat((self.class_token.expand(n, 1, -1), tokens), dim=1)\n",
+        "\n",
+        "        # Adding classification token to the tokens\n",
+        "        tokens = tokens + self.positional_embeddings.repeat(n, 1, 1)\n",
+        "\n",
+        "        # Adding positional embedding\n",
+        "        out = tokens + self.positional_embeddings.repeat(n, 1, 1)\n",
+        "\n",
+        "        # Transformer Blocks\n",
+        "        for block in self.blocks:\n",
+        "            out = block(out)\n",
+        "\n",
+        "        # Getting the classification token only\n",
+        "        out = out[:, 0, :]\n",
+        "        # Map to output dimension, output category distribution\n",
+        "        out = self.mlp(out)\n",
+        "\n",
+        "        return out"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "Ot_AvvyKc4ws"
+      },
+      "source": [
+        "### ViT training\n",
+        "\n",
+        "The ViT model being built, the next step is to train it on the MNIST dataset."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qf344oqhc4ws"
+      },
+      "source": [
+        "First, we initialize the model and the hyperparameters."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 128,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 0
+        },
+        "id": "FrHHUmi7c4wt",
+        "outputId": "331a8bef-d409-41cb-b95f-1f3b58d0f1f7"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Using device:  cuda (Tesla T4)\n"
+          ]
+        }
+      ],
+      "source": [
+        "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
+        "print(\n",
+        "    \"Using device: \",\n",
+        "    device,\n",
+        "    f\"({torch.cuda.get_device_name(device)})\" if torch.cuda.is_available() else \"\",\n",
+        ")\n",
+        "\n",
+        "model = ViT(\n",
+        "    (1, 28, 28), n_patches=7, n_blocks=2, hidden_d=8, n_heads=2, out_d=10\n",
+        ").to(device)\n",
+        "\n",
+        "N_EPOCHS = 5\n",
+        "LR = 0.005"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "RwNrVNzKc4wt"
+      },
+      "source": [
+        "Training of the ViT model:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 132,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 0
+        },
+        "id": "iz8ojGdZc4wu",
+        "outputId": "822854bb-cafd-476f-cbaf-999692e1dc22"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": []
+        },
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Epoch 1/5 loss: 1.83\n"
+          ]
+        },
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": []
+        },
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Epoch 2/5 loss: 1.78\n"
+          ]
+        },
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": []
+        },
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Epoch 3/5 loss: 1.74\n"
+          ]
+        },
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": []
+        },
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Epoch 4/5 loss: 1.70\n"
+          ]
+        },
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "                                                            "
+          ]
+        },
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Epoch 5/5 loss: 1.68\n"
+          ]
+        },
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "\r"
+          ]
+        }
+      ],
+      "source": [
+        "from tqdm import tqdm  # Importez la fonction tqdm pour la barre de progression\n",
+        "\n",
+        "optimizer = Adam(model.parameters(), lr=LR)\n",
+        "criterion = CrossEntropyLoss()\n",
+        "\n",
+        "for epoch in range(N_EPOCHS):\n",
+        "    train_loss = 0.0\n",
+        "    model.train()\n",
+        "\n",
+        "    # Utilisez tqdm pour afficher la barre de progression\n",
+        "    for batch in tqdm(train_loader, desc=f\"Epoch {epoch + 1}/{N_EPOCHS}\", leave=False):\n",
+        "        x, y = batch\n",
+        "        x, y = x.to(device), y.to(device)\n",
+        "        optimizer.zero_grad()\n",
+        "        y_hat = model(x)\n",
+        "        loss = criterion(y_hat, y)\n",
+        "        loss.backward()\n",
+        "        optimizer.step()\n",
+        "        train_loss += loss.detach().cpu().item() / len(train_loader)\n",
+        "    print(f\"Epoch {epoch + 1}/{N_EPOCHS} loss: {train_loss:.2f}\")\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "4jpbVBZXc4wu"
+      },
+      "source": [
+        "### ViT test\n",
+        "\n",
+        "Finally, let's test the trained model."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 137,
+      "metadata": {
+        "id": "h55dVGGhOaPI",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 0
+        },
+        "outputId": "db5397d7-2402-4f30-c56a-d8e714a89b94"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "                                                                  "
+          ]
+        },
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Test loss: 1.71\n",
+            "Test accuracy: 77.85%\n"
+          ]
+        },
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "\r"
+          ]
+        }
+      ],
+      "source": [
+        "with torch.no_grad():\n",
+        "    correct, total = 0, 0\n",
+        "    test_loss = 0.0\n",
+        "    model.eval()\n",
+        "    for batch in tqdm(test_loader, desc=f\"Batch avancement \", leave=False):\n",
+        "        x, y = batch\n",
+        "        x, y = x.to(device), y.to(device)\n",
+        "        y_hat = model(x)\n",
+        "        loss = criterion(y_hat, y)\n",
+        "        _, pred = torch.max(y_hat, 1)\n",
+        "        correct_tensor = pred.eq(y.data.view_as(pred))\n",
+        "        test_loss += loss.detach().cpu().item() / len(test_loader)\n",
+        "        for i in range(len(batch)):\n",
+        "          correct += correct_tensor[i].item()\n",
+        "          total += 1\n",
+        "\n",
+        "\n",
+        "    print(f\"Test loss: {test_loss:.2f}\")\n",
+        "    print(f\"Test accuracy: {correct / total * 100:.2f}%\")\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# track test loss\n",
+        "test_loss = 0.0\n",
+        "class_correct_NET = list(0.0 for i in range(10))\n",
+        "class_total_NET = list(0.0 for i in range(10))\n",
+        "\n",
+        "import torch.optim as optim\n",
+        "\n",
+        "criterion = nn.CrossEntropyLoss()  # specify loss function\n",
+        "optimizer = optim.SGD(model.parameters(), lr=0.01)  # specify optimizer\n",
+        "\n",
+        "\n",
+        "model.eval()\n",
+        "# iterate over test data\n",
+        "for data, target in test_loader:\n",
+        "    # move tensors to GPU if CUDA is available\n",
+        "    if train_on_gpu:\n",
+        "        data, target = data.cuda(), target.cuda()\n",
+        "    # forward pass: compute predicted outputs by passing inputs to the model\n",
+        "    output = model(data)\n",
+        "    # calculate the batch loss\n",
+        "    loss = criterion(output, target)\n",
+        "    # update test loss\n",
+        "    test_loss += loss.item() * data.size(0)\n",
+        "    # convert output probabilities to predicted class\n",
+        "    _, pred = torch.max(output, 1)\n",
+        "    # compare predictions to true label\n",
+        "    correct_tensor = pred.eq(target.data.view_as(pred))\n",
+        "    correct = (\n",
+        "        np.squeeze(correct_tensor.numpy())\n",
+        "        if not train_on_gpu\n",
+        "        else np.squeeze(correct_tensor.cpu().numpy())\n",
+        "    )\n",
+        "    # calculate test accuracy for each object class\n",
+        "    for i in range(batch_size):\n",
+        "        label = target.data[i]\n",
+        "        class_correct_NET[label] += correct[i].item()\n",
+        "        class_total_NET[label] += 1\n",
+        "\n",
+        "# average test loss\n",
+        "test_loss = test_loss / len(test_loader)\n",
+        "print(\"Test Loss: {:.6f}\\n\".format(test_loss))\n",
+        "\n",
+        "for i in range(10):\n",
+        "    if class_total_NET[i] > 0:\n",
+        "        print(\n",
+        "            \"Test Accuracy of %5s: %2d%% (%2d/%2d)\"\n",
+        "            % (\n",
+        "                classes[i],\n",
+        "                100 * class_correct_NET[i] / class_total_NET[i],\n",
+        "                np.sum(class_correct_NET[i]),\n",
+        "                np.sum(class_total_NET[i]),\n",
+        "            )\n",
+        "        )\n",
+        "    else:\n",
+        "        print(\"Test Accuracy of %5s: N/A (no training examples)\" % (classes[i]))\n",
+        "\n",
+        "print(\n",
+        "    \"\\nTest Accuracy (Overall): %2d%% (%2d/%2d)\"\n",
+        "    % (\n",
+        "        100.0 * np.sum(class_correct_NET) / np.sum(class_total_NET),\n",
+        "        np.sum(class_correct_NET),\n",
+        "        np.sum(class_total_NET),\n",
+        "    )\n",
+        ")"
+      ],
+      "metadata": {
+        "id": "mVrKCa4vjWfy"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "Jb76N2pyc4wv"
+      },
+      "source": [
+        "## Further experiments\n",
+        "\n",
+        "1. Adapt the code to apply the ViT model on CIFAR dataset.\n",
+        "2. Make use of a validation set to evaluate overfitting.\n",
+        "3. Evaluate the model with a dimension of 16 for the tokens and 4 encoder blocks."
+      ]
+    }
+  ],
+  "metadata": {
+    "accelerator": "GPU",
+    "colab": {
+      "gpuType": "T4",
+      "provenance": []
+    },
+    "kernelspec": {
+      "display_name": "Python PyTorch 1.7.0",
+      "language": "python",
+      "name": "pytorch-1.7.0"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.8.5"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}
\ No newline at end of file
+%% Cell type:markdown id: tags:
+
+# TD3: Vision Transformer (ViT)
+
+In this TD, you must modify this notebook to complete the code (**# TO DO comments**) and complete the **proposed experiments**. To do this,
+
+1. Fork this repository
+2. Clone your forked repository on your local computer
+3. Add your code and answer the questions
+4. Commit and push regularly
+
+**The last commit is due on Sunday, 14th January 2024**. Later commits will not be taken into account.
+
+As the computation is heavy, particularly during training, we encourage you to use a GPU. If your laptob is not equiped, you may use one of these remote jupyter servers, where you can select the execution on GPU :
+
+1) [jupyter.mi90.ec-lyon.fr](https://jupyter.mi90.ec-lyon.fr/)
+
+This server is accessible within the campus network. If outside, you need to use a VPN. Before executing the notebook, select the kernel "Python PyTorch" to run it on GPU and have access to PyTorch module.
+
+2) [Google Colaboratory](https://colab.research.google.com/)
+
+Before executing the notebook, select the execution on GPU : "Exécution" Menu -> "Modifier le type d'exécution" and select "T4 GPU".
+
+%% Cell type:markdown id: tags:
+
+### Goal of the TD
+
+Transformers have been introduced by [Vaswani et al. in 2017](https://arxiv.org/abs/1706.03762) in the context of NLP (Natural Language Processing), and particulary for Machine Translation.
+
+Its great success has led to its adaptation to various applications, including image classification. In this trend, [Dosovitskiy et al. in 2020](https://arxiv.org/abs/2010.11929) have proposed Vision Transformers (ViT) that we will study and implement from scratch in this TD.
+
+The principle is illustrated in the following picture from this paper.
+
+![Vision Tranformers](./figures/vit.png "Vision Transformers")
+
+First, an input image is “cut” into sub-images equally sized.
+
+Each such sub-image goes through a linear embedding. From then, each sub-image becomes a one-dimensional vector.
+
+A positional embedding is then added to these vectors (tokens). The positional embedding allows the network to know where each sub-image is positioned originally in the image. Without this information, the network would not be able to know where each such image would be placed, leading to potentially wrong predictions.
+
+These tokens are then passed, together with a special classification token, to the transformer encoders blocks, were each is composed of : A Layer Normalization (LN), followed by a Multi-head Self Attention (MSA) and a residual connection. Then a second LN, a Multi-Layer Perceptron (MLP), and again a residual connection. These blocks are connected back-to-back.
+
+Finally, a classification MLP head is used for the final classification only on the special classification token, which by the end of this process has global information about the picture.
+
+%% Cell type:markdown id: tags:
+
+### Implementation of the ViT model
+
+%% Cell type:markdown id: tags:
+
+First, we import the required modules.
+
+%% Cell type:code id: tags:
+
+``` python
+# Import modules
+import numpy as np
+import torch
+import torch.nn as nn
+from torch.nn import CrossEntropyLoss
+from torch.optim import Adam
+from torch.utils.data import DataLoader
+from torchvision.datasets.mnist import MNIST
+from torchvision.transforms import ToTensor
+```
+
+%% Cell type:markdown id: tags:
+
+For this first experiment, we will use the MNIST dataset that contains 28x28 binary pixels images of hand-written digits ([0–9]).
+
+%% Cell type:code id: tags:
+
+``` python
+# Load data
+transform = ToTensor()
+
+train_set = MNIST(
+    root="datasets", train=True, download=True, transform=transform
+)
+test_set = MNIST(
+    root="datasets", train=False, download=True, transform=transform
+)
+
+train_loader = DataLoader(train_set, shuffle=True, batch_size=128)
+test_loader = DataLoader(test_set, shuffle=False, batch_size=128)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+import matplotlib.pyplot as plt
+import random
+from torchvision.transforms import ToPILImage
+
+to_pil = ToPILImage()
+random_index = random.randint(0, len(train_set) - 1)
+image, label = train_set[random_index]
+
+image_pil = to_pil(image)
+plt.imshow(image_pil, cmap='gray')
+plt.title(f"Classe : {label}")
+plt.axis('off')
+plt.show()
+```
+
+%% Output
+
+
+
+%% Cell type:markdown id: tags:
+
+### "Patchification"
+The transformer encoder was originally developed with sequence data in mind, such as English sentences. However, as an image is not a sequence, we need to “sequencify” an image. To do this, we break it into multiple sub-images and map each sub-image to a vector.
+
+We do so by simply reshaping our input, which has size (N, C, H, W), where N is the batch size, C the number of channels and (H,W) the image dimension. In the case of MNIST, dimensions are (N, 1, 28, 28). The target dimension is (N, #Patches, Patch dimensionality), where the dimensionality of a patch is adjusted accordingly.
+
+In this example, we break each (1, 28, 28) into 7x7 patches (hence, each of size 4x4). That is, we are going to obtain 7x7=49 sub-images out of a single image.
+
+Thus, we reshape input (N, 1, 28, 28) to (N, PxP, C x H/P x W/P) = (N, 49, 16)
+
+Notice that, while each patch is a picture of size 1x4x4, we flatten it to a 16-dimensional vector. Also, in this case, we only had a single color channel. If we had multiple color channels, those would also have been flattened into the vector.
+
+%% Cell type:code id: tags:
+
+``` python
+def patchify(images, n_patches):
+    n, c, h, w = images.shape
+
+    assert h == w, "Patchify method is implemented for square images only"
+
+    patches = torch.zeros(n, n_patches**2, h * w * c // n_patches**2)
+    patch_size = h // n_patches
+
+    for idx, image in enumerate(images):
+        for i in range(n_patches):
+            for j in range(n_patches):
+                patch = image[
+                    :,
+                    i * patch_size : (i + 1) * patch_size,
+                    j * patch_size : (j + 1) * patch_size,
+                ]
+                patches[idx, i * n_patches + j] = patch.flatten()
+    return patches
+```
+
+%% Cell type:code id: tags:
+
+``` python
+def display_patches(patches, n_patches, image_size):
+    fig, axes = plt.subplots(n_patches, n_patches, figsize=(8, 8))
+    for i in range(n_patches):
+        for j in range(n_patches):
+            ax = axes[i, j]
+            patch = patches[i * n_patches + j].reshape(image_size)
+            ax.imshow(patch[0], cmap="gray")  # Utiliser la composante 0 pour niveaux de gris
+            ax.axis("off")
+    plt.show()
+
+n_patches = 4
+image_plat = image.unsqueeze(0)
+patches = patchify(image_plat, n_patches)
+
+# Affichez les patches
+display_patches(patches[0], n_patches, (1, 7, 7))
+```
+
+%% Output
+
+
+
+%% Cell type:markdown id: tags:
+
+### Linear embedding
+
+Now that we have our flattened patches, we can map each of them through a Linear mapping. While each patch was a 4x4=16 dimensional vector, the linear mapping can map to any arbitrary vector size. Thus, we will use for this a parameter `hidden_d` for "hidden dimension".
+
+In this example, we will use a hidden dimension of 8, but in principle, any number can be put here. We will thus be mapping each 16-dimensional patch to an 8-dimensional patch.
+
+%% Cell type:code id: tags:
+
+``` python
+def features_embedding(patch,applatisseur,class_embaded):
+  embedded_patches = applatisseur(flattened_patches)
+  return(torch.cat((class_embaded, embedded_patches), dim=0))
+```
+
+%% Cell type:code id: tags:
+
+``` python
+hidden_d=8
+batch_size=patches.shape[1]
+flattened_patches = patches.view(batch_size, -1)
+linear_layer = nn.Linear(flattened_patches.size(1), hidden_d)
+classe_embedded = torch.rand((1,hidden_d))
+
+features_emb=features_embedding(patches,linear_layer,classe_embedded)
+print(features_emb)
+```
+
+%% Output
+
+    tensor([[ 0.0886,  0.1783,  0.7162,  0.6681,  0.3244,  0.4643,  0.1293,  0.5872],
+            [ 0.0779, -0.0984, -0.0306,  0.0446, -0.0793,  0.0205, -0.1166,  0.0932],
+            [ 0.0683, -0.0921,  0.0700, -0.1282, -0.1309,  0.1637, -0.0534,  0.2885],
+            [ 0.0614, -0.1821,  0.1938,  0.2289, -0.5469, -0.1450,  0.2406, -0.0501],
+            [ 0.0779, -0.0984, -0.0306,  0.0446, -0.0793,  0.0205, -0.1166,  0.0932],
+            [ 0.0779, -0.0984, -0.0306,  0.0446, -0.0793,  0.0205, -0.1166,  0.0932],
+            [-0.2211,  0.7500,  0.1590, -0.0509, -0.2274, -0.3068, -0.3688,  0.3081],
+            [-0.1137,  0.3398,  0.2008, -0.2193, -0.2279, -0.2736,  0.0489,  0.1655],
+            [ 0.1418,  0.0633,  0.2066,  0.1069, -0.1668, -0.1283,  0.0725,  0.1086],
+            [ 0.0876, -0.1036, -0.0232,  0.0168, -0.0834,  0.0253, -0.1026,  0.1292],
+            [-0.1190,  0.6631,  0.4508,  0.1273, -0.8950, -0.2657, -0.1744,  0.1429],
+            [-0.0139,  1.1310,  0.6801, -0.0420, -0.8295, -0.3431, -0.3398,  0.1292],
+            [ 0.2009, -0.1070,  0.3390, -0.1786, -0.2455,  0.1952, -0.2296,  0.0448],
+            [ 0.0779, -0.0984, -0.0306,  0.0446, -0.0793,  0.0205, -0.1166,  0.0932],
+            [ 0.0114,  0.2627,  0.1608,  0.1347, -0.0517, -0.1514, -0.2966,  0.0293],
+            [ 0.0699,  0.1150,  0.0422,  0.0093, -0.1582,  0.0118, -0.2497,  0.1767],
+            [ 0.0779, -0.0984, -0.0306,  0.0446, -0.0793,  0.0205, -0.1166,  0.0932]],
+           grad_fn=<CatBackward0>)
+
+%% Cell type:markdown id: tags:
+
+### Positional encoding
+
+Positional encoding allows the model to understand where each patch would be placed in the original image. While it is theoretically possible to learn such positional embeddings, previous work by [Vaswani et al. in 2017](https://arxiv.org/abs/1706.03762) suggests that we can just add sines and cosines waves.
+
+In particular, positional encoding adds high-frequency values to the first dimensions and lower-frequency values to the latter dimensions.
+
+In each sequence, for token i we add to its j-th coordinate the following value:
+
+$$p_{i,j}=\left\{\begin{matrix}
+sin(\frac{i}{10000^{\frac{j-1}{d_{emb-dim}}}}) \:if \:j \: is \: even\\
+cos(\frac{i}{10000^{\frac{j-1}{d_{emb-dim}}}})\:if\: j \: is \: odd
+\end{matrix}\right.$$
+
+![Positional encoding](./figures/positional_encoding.png "Positional encoding").
+
+This positional embedding is a function of the number of elements in the sequence and the dimensionality of each element. Thus, it is always a 2-dimensional tensor or “rectangle”.
+
+Here is a simple function that, given the number of tokens and the dimensionality of each of them, outputs a matrix where each coordinate (i,j) is the value to be added to token i in dimension j.
+
+This positional encoding is added to our model after the linear mapping and the addition of the class token.
+
+%% Cell type:code id: tags:
+
+``` python
+def get_positional_embeddings(sequence_length, d):
+    result = torch.ones(sequence_length, d)
+    for i in range(sequence_length):
+        for j in range(d):
+            result[i][j] = (
+                np.sin(i / (10000 ** (j / d)))
+                if j % 2 == 0
+                else np.cos(i / (10000 ** ((j - 1) / d)))
+            )
+    return result
+```
+
+%% Cell type:code id: tags:
+
+``` python
+positional_emb=get_positional_embeddings(17,8)
+print(positional_emb)
+```
+
+%% Output
+
+    tensor([[ 0.0000e+00,  1.0000e+00,  0.0000e+00,  1.0000e+00,  0.0000e+00,
+              1.0000e+00,  0.0000e+00,  1.0000e+00],
+            [ 8.4147e-01,  5.4030e-01,  9.9833e-02,  9.9500e-01,  9.9998e-03,
+              9.9995e-01,  1.0000e-03,  1.0000e+00],
+            [ 9.0930e-01, -4.1615e-01,  1.9867e-01,  9.8007e-01,  1.9999e-02,
+              9.9980e-01,  2.0000e-03,  1.0000e+00],
+            [ 1.4112e-01, -9.8999e-01,  2.9552e-01,  9.5534e-01,  2.9996e-02,
+              9.9955e-01,  3.0000e-03,  1.0000e+00],
+            [-7.5680e-01, -6.5364e-01,  3.8942e-01,  9.2106e-01,  3.9989e-02,
+              9.9920e-01,  4.0000e-03,  9.9999e-01],
+            [-9.5892e-01,  2.8366e-01,  4.7943e-01,  8.7758e-01,  4.9979e-02,
+              9.9875e-01,  5.0000e-03,  9.9999e-01],
+            [-2.7942e-01,  9.6017e-01,  5.6464e-01,  8.2534e-01,  5.9964e-02,
+              9.9820e-01,  6.0000e-03,  9.9998e-01],
+            [ 6.5699e-01,  7.5390e-01,  6.4422e-01,  7.6484e-01,  6.9943e-02,
+              9.9755e-01,  6.9999e-03,  9.9998e-01],
+            [ 9.8936e-01, -1.4550e-01,  7.1736e-01,  6.9671e-01,  7.9915e-02,
+              9.9680e-01,  7.9999e-03,  9.9997e-01],
+            [ 4.1212e-01, -9.1113e-01,  7.8333e-01,  6.2161e-01,  8.9879e-02,
+              9.9595e-01,  8.9999e-03,  9.9996e-01],
+            [-5.4402e-01, -8.3907e-01,  8.4147e-01,  5.4030e-01,  9.9833e-02,
+              9.9500e-01,  9.9998e-03,  9.9995e-01],
+            [-9.9999e-01,  4.4257e-03,  8.9121e-01,  4.5360e-01,  1.0978e-01,
+              9.9396e-01,  1.1000e-02,  9.9994e-01],
+            [-5.3657e-01,  8.4385e-01,  9.3204e-01,  3.6236e-01,  1.1971e-01,
+              9.9281e-01,  1.2000e-02,  9.9993e-01],
+            [ 4.2017e-01,  9.0745e-01,  9.6356e-01,  2.6750e-01,  1.2963e-01,
+              9.9156e-01,  1.3000e-02,  9.9992e-01],
+            [ 9.9061e-01,  1.3674e-01,  9.8545e-01,  1.6997e-01,  1.3954e-01,
+              9.9022e-01,  1.4000e-02,  9.9990e-01],
+            [ 6.5029e-01, -7.5969e-01,  9.9749e-01,  7.0737e-02,  1.4944e-01,
+              9.8877e-01,  1.4999e-02,  9.9989e-01],
+            [-2.8790e-01, -9.5766e-01,  9.9957e-01, -2.9200e-02,  1.5932e-01,
+              9.8723e-01,  1.5999e-02,  9.9987e-01]])
+
+%% Cell type:code id: tags:
+
+``` python
+input_vect=features_emb+positional_emb
+print(input_vect.shape)
+print(input_vect)
+```
+
+%% Output
+
+    torch.Size([17, 8])
+    tensor([[ 0.0886,  1.1783,  0.7162,  1.6681,  0.3244,  1.4643,  0.1293,  1.5872],
+            [ 0.9194,  0.4419,  0.0692,  1.0396, -0.0693,  1.0204, -0.1156,  1.0932],
+            [ 0.9775, -0.5082,  0.2687,  0.8518, -0.1109,  1.1635, -0.0514,  1.2885],
+            [ 0.2025, -1.1721,  0.4894,  1.1843, -0.5169,  0.8545,  0.2436,  0.9499],
+            [-0.6789, -0.7520,  0.3588,  0.9656, -0.0393,  1.0197, -0.1126,  1.0932],
+            [-0.8810,  0.1853,  0.4488,  0.9222, -0.0293,  1.0192, -0.1116,  1.0932],
+            [-0.5005,  1.7102,  0.7236,  0.7744, -0.1674,  0.6914, -0.3628,  1.3081],
+            [ 0.5433,  1.0937,  0.8450,  0.5456, -0.1580,  0.7240,  0.0559,  1.1655],
+            [ 1.1312, -0.0822,  0.9239,  0.8036, -0.0868,  0.8685,  0.0805,  1.1085],
+            [ 0.4997, -1.0147,  0.7601,  0.6384,  0.0065,  1.0213, -0.0936,  1.1292],
+            [-0.6631, -0.1760,  1.2923,  0.6676, -0.7952,  0.7293, -0.1644,  1.1428],
+            [-1.0139,  1.1354,  1.5714,  0.4116, -0.7198,  0.6509, -0.3288,  1.1291],
+            [-0.3357,  0.7369,  1.2710,  0.1837, -0.1258,  1.1880, -0.2176,  1.0448],
+            [ 0.4981,  0.8091,  0.9329,  0.3121,  0.0504,  1.0121, -0.1036,  1.0931],
+            [ 1.0020,  0.3994,  1.1463,  0.3046,  0.0878,  0.8389, -0.2827,  1.0292],
+            [ 0.7202, -0.6447,  1.0397,  0.0801, -0.0088,  1.0006, -0.2347,  1.1766],
+            [-0.2100, -1.0560,  0.9689,  0.0154,  0.0800,  1.0077, -0.1006,  1.0931]],
+           grad_fn=<AddBackward0>)
+
+%% Cell type:markdown id: tags:
+
+### Multi-Head Self-Attention
+
+The objective is now that, for a single image, each patch has to be updated based on some similarity measure with the other patches. We do so by linearly mapping each patch (that is now an 8-dimensional vector in our example) to 3 distinct vectors: q, k, and v (query, key, value).
+
+Then, for a single patch, we are going to compute the dot product between its q vector with all of the k vectors, divide by the square root of the dimensionality of these vectors (sqrt(8)), softmax these so-called attention cues, and finally multiply each attention cue with the v vectors associated with the different k vectors and sum all up.
+
+In this way, each patch assumes a new value that is based on its similarity (after the linear mapping to q, k, and v) with other patches. This whole procedure, however, is carried out H times on H sub-vectors of our current 8-dimensional patches, where H is the number of Heads.
+
+Once all results are obtained, they are concatenated together. Finally, the result is passed through a linear layer (for good measure).
+
+The intuitive idea behind attention is that it allows modeling the relationship between the inputs. What makes a ‘0’ a zero are not the individual pixel values, but how they relate to each other.
+
+This is implemented in the MSA class:
+
+%% Cell type:code id: tags:
+
+``` python
+class MSA(nn.Module):
+    def __init__(self, d, n_heads=2):
+        super().__init__()
+        self.d = d
+        self.n_heads = n_heads
+
+        assert d % n_heads == 0, f"Can't divide dimension {d} into {n_heads} heads"
+
+        d_head = int(d / n_heads)
+        self.q_mappings = nn.ModuleList(
+            [nn.Linear(d_head, d_head) for _ in range(self.n_heads)]
+        )
+        self.k_mappings = nn.ModuleList(
+            [nn.Linear(d_head, d_head) for _ in range(self.n_heads)]
+        )
+        self.v_mappings = nn.ModuleList(
+            [nn.Linear(d_head, d_head) for _ in range(self.n_heads)]
+        )
+        self.d_head = d_head
+        self.softmax = nn.Softmax(dim=-1)
+
+    def forward(self, sequences):
+        # Sequences has shape (N, seq_length, token_dim)
+        # We go into shape    (N, seq_length, n_heads, token_dim / n_heads)
+        # And come back to    (N, seq_length, item_dim)  (through concatenation)
+        result = []
+        for sequence in sequences:
+            seq_result = []
+            for head in range(self.n_heads):
+                q_mapping = self.q_mappings[head]
+                k_mapping = self.k_mappings[head]
+                v_mapping = self.v_mappings[head]
+
+                seq = sequence[:, head * self.d_head : (head + 1) * self.d_head]
+                q, k, v = q_mapping(seq), k_mapping(seq), v_mapping(seq)
+
+                attention = torch.matmul(self.softmax(torch.matmul(q, k.t())/np.sqrt(int(self.d / self.n_heads))),v)
+
+                seq_result.append(attention)
+
+            result.append(torch.hstack(seq_result))
+        return torch.cat([torch.unsqueeze(r, dim=0) for r in result])
+```
+
+%% Cell type:markdown id: tags:
+
+Notice that, for each head, we create distinct Q, K, and V mapping functions (square matrices of size 4x4 in our example).
+
+$$\left\{\begin{matrix}
+Q=W^Q.I
+\\
+K=W^K.I
+\\
+V=W^V.I
+\\
+A=K^T.Q
+\\
+\hat{A}=softmax(A)
+\\
+O=V.\hat{A}
+\end{matrix}\right.$$
+
+Since our inputs will be sequences of size (N, 50, 8), and we only use 2 heads, we will at some point have an (N, 50, 2, 4) tensor, use a nn.Linear(4, 4) module on it, and then come back, after concatenation, to an (N, 50, 8) tensor.
+
+Also notice that using loops is not the most efficient way to compute the multi-head self-attention, but it makes the code much clearer for learning.
+
+%% Cell type:markdown id: tags:
+
+### Transformer Encoder Blocks
+
+The next step is to create the transformer encoder block class.
+
+Layer normalization (LN) is a popular block that, given an input, subtracts its mean and divides by the standard deviation. It is applied to the last dimension only. We can thus make each of our 50x8 matrices (representing a single sequence) have mean 0 and std 1. After we run our (N, 50, 8) tensor through LN, we still get the same dimensionality.
+
+Also, We will be using residual connection that consists in adding the original input to the result of some computation. This, intuitively, allows a network to become more powerful while also preserving the set of possible functions that the model can approximate.
+
+We will add a residual connection that will add our original (N, 50, 8) tensor to the (N, 50, 8) obtained after LN and MSA.
+
+Next is to add a simple residual connection between what we already have and what we get after passing the current tensor through another LN and an MLP. The MLP is composed of two layers, where the hidden layer typically is four times as big (this is a parameter).
+
+The transformer encoder block class (which will be a component of the future ViT class) is thus as follows:
+
+%% Cell type:code id: tags:
+
+``` python
+class ViTBlock(nn.Module):
+    def __init__(self, hidden_d, n_heads, mlp_ratio=4):
+        super().__init__()
+        self.hidden_d = hidden_d
+        self.n_heads = n_heads
+
+        self.norm1 = nn.LayerNorm(hidden_d)
+        self.mhsa = MSA(hidden_d, n_heads)
+        self.norm2 = nn.LayerNorm(hidden_d)
+        self.mlp = nn.Sequential(
+            nn.Linear(hidden_d, mlp_ratio * hidden_d),
+            nn.GELU(),
+            nn.Linear(mlp_ratio * hidden_d, hidden_d),
+        )
+
+    def forward(self, x):
+        m1=x
+        x = self.norm1(x)
+        x = m1+self.mhsa(x)
+        m2=x
+        x = self.norm2(x)
+        x = m2+self.mlp(x)
+
+        return x
+```
+
+%% Cell type:markdown id: tags:
+
+### ViT model
+
+Now that the encoder block is ready, we just need to insert it in our bigger ViT model which is responsible for patchifying before the transformer blocks, and carrying out the classification after.
+
+To help classification, we will use an additional **classification token** to the input sequence. This is a special token that we add to our model that has the role of capturing information about the other tokens. This will happen with the MSA block. When information about all other tokens will be present here, we will be able to classify the image using only this special token. The initial value of the special token (the one fed to the transformer encoder) is a parameter of the model that needs to be learned.
+
+Thus, we will add a parameter to our model and convert our (N, 49, 8) tokens tensor to an (N, 50, 8) tensor (we add the special token to each sequence).
+
+We could have an arbitrary number of transformer blocks. In this example, to keep it simple, I will use only 2. We also add a parameter to know how many heads does each encoder block will use.
+
+Finally, we can extract just the classification token (first token) out of our N sequences, and use each token to get N classifications.
+
+Since we decided that each token is an 8-dimensional vector, and since we have 10 possible digits, we can implement the classification MLP as a simple 8x10 matrix, activated with the SoftMax function.
+
+The output of our model shoud be an (N, 10) tensor.
+
+%% Cell type:code id: tags:
+
+``` python
+class ViT(nn.Module):
+    def __init__(self, chw, n_patches=7, n_blocks=2, hidden_d=8, n_heads=2, out_d=10):
+        # Super constructor
+        super().__init__()
+
+        # Attributes
+        self.chw = chw  # ( C , H , W )
+        self.n_patches = n_patches
+        self.n_blocks = n_blocks
+        self.n_heads = n_heads
+        self.hidden_d = hidden_d
+
+        # Input and patches sizes
+        assert (
+            chw[1] % n_patches == 0
+        ), "Input shape not entirely divisible by number of patches"
+        assert (
+            chw[2] % n_patches == 0
+        ), "Input shape not entirely divisible by number of patches"
+        self.patch_size = (chw[1] / n_patches, chw[2] / n_patches)
+
+        # 1) Linear mapper
+        self.input_d = int(chw[0] * self.patch_size[0] * self.patch_size[1])
+        self.linear_mapper = nn.Linear(self.input_d, self.hidden_d)
+
+        # 2) Learnable classification token
+        self.class_token = nn.Parameter(torch.rand(1, self.hidden_d))
+
+        # 3) Positional embedding
+        self.register_buffer(
+            "positional_embeddings",
+            get_positional_embeddings(n_patches**2 + 1, hidden_d),
+            persistent=False,
+        )
+
+        # 4) Transformer encoder blocks
+        self.blocks = nn.ModuleList(
+            [ViTBlock(hidden_d, n_heads) for _ in range(n_blocks)]
+        )
+
+        # 5) Classification MLPk
+        self.mlp = nn.Sequential(nn.Linear(self.hidden_d, out_d), nn.Softmax(dim=-1))
+
+    def forward(self, images):
+
+        # Dividing images into patches
+        n, c, h, w = images.shape
+        patches = patchify(images,self.n_patches)
+        patches=patches.to(device)
+
+        # Running linear layer tokenization
+        tokens = self.linear_mapper(patches)
+
+
+        # Map the vector corresponding to each patch to the hidden size dimension
+        tokens = torch.cat((self.class_token.expand(n, 1, -1), tokens), dim=1)
+
+        # Adding classification token to the tokens
+        tokens = tokens + self.positional_embeddings.repeat(n, 1, 1)
+
+        # Adding positional embedding
+        out = tokens + self.positional_embeddings.repeat(n, 1, 1)
+
+        # Transformer Blocks
+        for block in self.blocks:
+            out = block(out)
+
+        # Getting the classification token only
+        out = out[:, 0, :]
+        # Map to output dimension, output category distribution
+        out = self.mlp(out)
+
+        return out
+```
+
+%% Cell type:markdown id: tags:
+
+### ViT training
+
+The ViT model being built, the next step is to train it on the MNIST dataset.
+
+%% Cell type:markdown id: tags:
+
+First, we initialize the model and the hyperparameters.
+
+%% Cell type:code id: tags:
+
+``` python
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+print(
+    "Using device: ",
+    device,
+    f"({torch.cuda.get_device_name(device)})" if torch.cuda.is_available() else "",
+)
+
+model = ViT(
+    (1, 28, 28), n_patches=7, n_blocks=2, hidden_d=8, n_heads=2, out_d=10
+).to(device)
+
+N_EPOCHS = 5
+LR = 0.005
+```
+
+%% Output
+
+    Using device:  cuda (Tesla T4)
+
+%% Cell type:markdown id: tags:
+
+Training of the ViT model:
+
+%% Cell type:code id: tags:
+
+``` python
+from tqdm import tqdm  # Importez la fonction tqdm pour la barre de progression
+
+optimizer = Adam(model.parameters(), lr=LR)
+criterion = CrossEntropyLoss()
+
+for epoch in range(N_EPOCHS):
+    train_loss = 0.0
+    model.train()
+
+    # Utilisez tqdm pour afficher la barre de progression
+    for batch in tqdm(train_loader, desc=f"Epoch {epoch + 1}/{N_EPOCHS}", leave=False):
+        x, y = batch
+        x, y = x.to(device), y.to(device)
+        optimizer.zero_grad()
+        y_hat = model(x)
+        loss = criterion(y_hat, y)
+        loss.backward()
+        optimizer.step()
+        train_loss += loss.detach().cpu().item() / len(train_loader)
+    print(f"Epoch {epoch + 1}/{N_EPOCHS} loss: {train_loss:.2f}")
+```
+
+%% Output
+
+
+    Epoch 1/5 loss: 1.83
+
+
+    Epoch 2/5 loss: 1.78
+
+
+    Epoch 3/5 loss: 1.74
+
+
+    Epoch 4/5 loss: 1.70
+
+    
+
+    Epoch 5/5 loss: 1.68
+
+    
+
+%% Cell type:markdown id: tags:
+
+### ViT test
+
+Finally, let's test the trained model.
+
+%% Cell type:code id: tags:
+
+``` python
+with torch.no_grad():
+    correct, total = 0, 0
+    test_loss = 0.0
+    model.eval()
+    for batch in tqdm(test_loader, desc=f"Batch avancement ", leave=False):
+        x, y = batch
+        x, y = x.to(device), y.to(device)
+        y_hat = model(x)
+        loss = criterion(y_hat, y)
+        _, pred = torch.max(y_hat, 1)
+        correct_tensor = pred.eq(y.data.view_as(pred))
+        test_loss += loss.detach().cpu().item() / len(test_loader)
+        for i in range(len(batch)):
+          correct += correct_tensor[i].item()
+          total += 1
+
+
+    print(f"Test loss: {test_loss:.2f}")
+    print(f"Test accuracy: {correct / total * 100:.2f}%")
+```
+
+%% Output
+
+    
+
+    Test loss: 1.71
+    Test accuracy: 77.85%
+
+    
+
+%% Cell type:code id: tags:
+
+``` python
+# track test loss
+test_loss = 0.0
+class_correct_NET = list(0.0 for i in range(10))
+class_total_NET = list(0.0 for i in range(10))
+
+import torch.optim as optim
+
+criterion = nn.CrossEntropyLoss()  # specify loss function
+optimizer = optim.SGD(model.parameters(), lr=0.01)  # specify optimizer
+
+
+model.eval()
+# iterate over test data
+for data, target in test_loader:
+    # move tensors to GPU if CUDA is available
+    if train_on_gpu:
+        data, target = data.cuda(), target.cuda()
+    # forward pass: compute predicted outputs by passing inputs to the model
+    output = model(data)
+    # calculate the batch loss
+    loss = criterion(output, target)
+    # update test loss
+    test_loss += loss.item() * data.size(0)
+    # convert output probabilities to predicted class
+    _, pred = torch.max(output, 1)
+    # compare predictions to true label
+    correct_tensor = pred.eq(target.data.view_as(pred))
+    correct = (
+        np.squeeze(correct_tensor.numpy())
+        if not train_on_gpu
+        else np.squeeze(correct_tensor.cpu().numpy())
+    )
+    # calculate test accuracy for each object class
+    for i in range(batch_size):
+        label = target.data[i]
+        class_correct_NET[label] += correct[i].item()
+        class_total_NET[label] += 1
+
+# average test loss
+test_loss = test_loss / len(test_loader)
+print("Test Loss: {:.6f}\n".format(test_loss))
+
+for i in range(10):
+    if class_total_NET[i] > 0:
+        print(
+            "Test Accuracy of %5s: %2d%% (%2d/%2d)"
+            % (
+                classes[i],
+                100 * class_correct_NET[i] / class_total_NET[i],
+                np.sum(class_correct_NET[i]),
+                np.sum(class_total_NET[i]),
+            )
+        )
+    else:
+        print("Test Accuracy of %5s: N/A (no training examples)" % (classes[i]))
+
+print(
+    "\nTest Accuracy (Overall): %2d%% (%2d/%2d)"
+    % (
+        100.0 * np.sum(class_correct_NET) / np.sum(class_total_NET),
+        np.sum(class_correct_NET),
+        np.sum(class_total_NET),
+    )
+)
+```
+
+%% Cell type:markdown id: tags:
+
+## Further experiments
+
+1. Adapt the code to apply the ViT model on CIFAR dataset.
+2. Make use of a validation set to evaluate overfitting.
+3. Evaluate the model with a dimension of 16 for the tokens and 4 encoder blocks.