diff --git a/README.md b/README.md
index 724b594c2423f3a97b964a73432fcb2064b7900b..b149587a38ba18c2b5840f45f9e2b3a95b82c7e6 100644
--- a/README.md
+++ b/README.md
@@ -93,7 +93,7 @@ This database can be obtained at the address https://www.cs.toronto.edu/~kriz/ci
 ## k-nearest neighbors
 
 1. Create a Python fil named `knn.py`. Write the function `distance_matrix` taking as parameters two matrices and returns `dists`, the L2 Euclidean distance matrix. The computation must be done only with matrix manipulation (no loops).
-    Hint: $(a-b)^2 = a^2 + b^2 - 2 ab$
+    Hint: $`(a-b)^2 = a^2 + b^2 - 2 ab`$
 2. Write the function `knn_predict` taking as parameters:
       - `dists` the distance matrix between the train set and the test set, 
       - `labels_train` the training labels, and
@@ -116,18 +116,18 @@ The objective here is to develop a classifier based on a multilayer perceptron (
 First of all, let's focus on the backpropagation of the gradient with an example.
 Let's consider a network with a hidden layer.
 
-The weight matrix of the layer $L$ is denoted $W^{(L)}$. The bias vector of the layer $L$ is denoted $B^{(L)}$. We choose the sigmoid function, denoted $\sigma$, as the activation function. The output vector of the layer $L$ before activation is denoted $Z^{(L)}$. The output vector of the layer $L$ after activation is denoted $A^{(L)}$. By convention, we note $A^{(0)}$ the network input vector. Thus $Z^{(L+1)} = W^{(L+1)}A^{(L)} + B^{(L+1)}$ and $A^{(L+1)} = \sigma\left(Z^{(L+1)}\right)$. In our example, the output is $\hat{Y} = A^{(2)}$.
-Let $Y$ be the desired output. We use mean squared error (MSE) as the cost function. Thus, the cost is $C = \frac{1}{N_{out}}\sum_{i=1}^{N_{out}} (\hat{y_i} - y_i)^2$.
+The weight matrix of the layer $`L`$ is denoted $`W^{(L)}`$. The bias vector of the layer $`L`$ is denoted $`B^{(L)}`$. We choose the sigmoid function, denoted $`\sigma`$, as the activation function. The output vector of the layer $`L`$ before activation is denoted $`Z^{(L)}`$. The output vector of the layer $`L`$ after activation is denoted $`A^{(L)}`$. By convention, we note $`A^{(0)}`$ the network input vector. Thus $`Z^{(L+1)} = W^{(L+1)}A^{(L)} + B^{(L+1)}`$ and $`A^{(L+1)} = \sigma\left(Z^{(L+1)}\right)`$. In our example, the output is $`\hat{Y} = A^{(2)}`$.
+Let $`Y`$ be the desired output. We use mean squared error (MSE) as the cost function. Thus, the cost is $`C = \frac{1}{N_{out}}\sum_{i=1}^{N_{out}} (\hat{y_i} - y_i)^2`$.
 
 1. Prove that $`\sigma' = \sigma \times (1-\sigma)`$
-2. Express $\frac{\partial C}{\partial A^{(2)}}$, i.e. the vector of $\frac{\partial C}{\partial a^{(2)}_i}$ as a function of $A^{(2)}$ and $Y$.
-3. Using the chaining rule, express $\frac{\partial C}{\partial Z^{(2)}}$, i.e. the vector of $\frac{\partial C}{\partial z^{(2)}_i}$ as a function of $\frac{\partial C}{\partial A^{(2)}}$ and $A^{(2)}$.
-4. Similarly, express $\frac{\partial C}{\partial W^{(2)}}$, i.e. the matrix of $\frac{\partial C}{\partial w^{(2)}_{i,j}}$ as a function of $\frac{\partial C}{\partial Z^{(2)}}$ and $A^{(1)}$.
-5. Similarly, express $\frac{\partial C}{\partial B^{(2)}}$ as a function of $\frac{\partial C}{\partial Z^{(2)}}$.
-6. Similarly, express $\frac{\partial C}{\partial A^{(1)}}$ as a function of $\frac{\partial C}{\partial Z^{(2)}}$ and $W^{(2)}$.
-7. Similarly, express $\frac{\partial C}{\partial Z^{(1)}}$ as a function of $\frac{\partial C}{\partial A^{(1)}}$ and $A^{(1)}$.
-8. Similarly, express $\frac{\partial C}{\partial W^{(1)}}$ as a function of $\frac{\partial C}{\partial Z^{(1)}}$ and $A^{(0)}$.
-9. Similarly, express $\frac{\partial C}{\partial B^{(1)}}$ as a function of $\frac{\partial C}{\partial Z^{(1)}}$.
+2. Express $`\frac{\partial C}{\partial A^{(2)}}`$, i.e. the vector of $`\frac{\partial C}{\partial a^{(2)}_i}`$ as a function of $`A^{(2)}`$ and $`Y`$.
+3. Using the chaining rule, express $`\frac{\partial C}{\partial Z^{(2)}}`$, i.e. the vector of $`\frac{\partial C}{\partial z^{(2)}_i}`$ as a function of $`\frac{\partial C}{\partial A^{(2)}}`$ and $`A^{(2)}`$.
+4. Similarly, express $`\frac{\partial C}{\partial W^{(2)}}`$, i.e. the matrix of $`\frac{\partial C}{\partial w^{(2)}_{i,j}}`$ as a function of $`\frac{\partial C}{\partial Z^{(2)}}`$ and $`A^{(1)}`$.
+5. Similarly, express $`\frac{\partial C}{\partial B^{(2)}}`$ as a function of $`\frac{\partial C}{\partial Z^{(2)}}`$.
+6. Similarly, express $`\frac{\partial C}{\partial A^{(1)}}`$ as a function of $`\frac{\partial C}{\partial Z^{(2)}}`$ and $`W^{(2)}`$.
+7. Similarly, express $`\frac{\partial C}{\partial Z^{(1)}}`$ as a function of $`\frac{\partial C}{\partial A^{(1)}}`$ and $`A^{(1)}`$.
+8. Similarly, express $`\frac{\partial C}{\partial W^{(1)}}`$ as a function of $`\frac{\partial C}{\partial Z^{(1)}}`$ and $`A^{(0)}`$.
+9. Similarly, express $`\frac{\partial C}{\partial B^{(1)}}`$ as a function of $`\frac{\partial C}{\partial Z^{(1)}}`$.
 
 Below is a Python code performing a forward pass and computing the cost in a network containing a hidden layer and using the sigmoid function as the activation function:
 
@@ -167,7 +167,7 @@ print(loss)
 For classification task, we prefer to use a binary cross-entropy loss. We also want to replace the last activation layer of the network with a softmax layer.
 
 10. Write the function `one_hot` taking a (n)-D array as parameters and returning the corresponding (n+1)-D one-hot matrix.
-11. Write a function `learn_once_cross_entropy` taking the the same parameters as `learn_once_mse` and returns the same outputs. The function must use a cross entropy loss and the last layer of the network must be a softmax. We admit that $\frac{\partial C}{\partial Z^{(2)}} = A^{(2)} - Y$. Where $Y$ is a one-hot vector encoding the label.
+11. Write a function `learn_once_cross_entropy` taking the the same parameters as `learn_once_mse` and returns the same outputs. The function must use a cross entropy loss and the last layer of the network must be a softmax. We admit that $`\frac{\partial C}{\partial Z^{(2)}} = A^{(2)} - Y`$. Where $`Y`$ is a one-hot vector encoding the label.
 12. Write the function `evaluate_mlp` taking as parameter:
       - `data_train`, `labels_train`, `data_test`, `labels_test`, the training and testing data,
       - `learning_rate` the learning rate,