Update Subject_7_LLM.ipynb

b20e1de2 · Dellandrea Emmanuel · 351fc3c6 · b20e1de2
Commit b20e1de2 authored 3 months ago by Dellandrea Emmanuel
--- a/Practical_sessions/Session_7/Subject_7_LLM.ipynb
+++ b/Practical_sessions/Session_7/Subject_7_LLM.ipynb
@@ -33,6 +33,8 @@
    "\n",
    "**This code is intentionally not commented. It is your responsibility to add all the necessary comments to ensure your proper understanding of the code.**\n",
    "\n",
+    "You might frequently rely on [Hugging Face’s documentation](https://huggingface.co/docs).\n",
+    "\n",
    "\n",
    "---\n",
    "\n",

 %% Cell type:markdown id: tags:
 ### **_Deep Learning  - Bsc Data Science for Responsible Business - Centrale Lyon_**
 2024-2025
 Emmanuel Dellandréa
 %% Cell type:markdown id: tags:
 # Practical Session 7 – Large Language Models
 The objective of this tutorial is to learn to work with LLM models for sentence generation and classification. The pretrained models and tokenizers will be obtained from the [Hugging Face platform](https://huggingface.co/).
 This notebook contains 8 parts:
 1. Using a Hugging Face text generation model
 2. Using Pipeline of Hugging Face for text classification
 3. Using Pipeline with a specific model and tokenizer of Hugging Face
 4. Experimenting with models from Hugging Face
 5. Training a LLM for sentence classification using the **Trainer** class
 6. Fine tuning a LLM model with a custom head
 7. Sharing a model on Hugging Face platform
 8. Further experiments
 Before going further into experiments, you work is to understand the provided code, that gives an overview of using LLM with Hugging Face.
 **This code is intentionally not commented. It is your responsibility to add all the necessary comments to ensure your proper understanding of the code.**
+You might frequently rely on [Hugging Face’s documentation](https://huggingface.co/docs).
 ---
 As the computation can be heavy, particularly during training, we encourage you to use a GPU. If your laptob is not equiped, you may use one of these remote jupyter servers, where you can select the execution on GPU :
 1) [jupyter.mi90.ec-lyon.fr](https://jupyter.mi90.ec-lyon.fr/)
 This server is accessible within the campus network. If outside, you need to use a VPN. Before executing the notebook, select the kernel "Python PyTorch" to run it on GPU and have access to PyTorch module.
 2) [Google Colaboratory](https://colab.research.google.com/)
 Before executing the notebook, select the execution on GPU : "Runtime" -> "Change runtime type" --> "T4 GPU".
 %% Cell type:markdown id: tags:
 ### Installing required librairies
 %% Cell type:code id: tags:
 ``` python
 !pip install huggingface_hub
 !pip install ipywidgets
 !pip install transformers
 !pip install datasets
 !pip install accelerate
 !pip install scikit-learn
 ```
 %% Cell type:markdown id: tags:
 ### Then login to Hugging Face
 %% Cell type:code id: tags:
 ``` python
 from huggingface_hub import notebook_login
 notebook_login()
 ```
 %% Cell type:markdown id: tags:
 ### Part 1 - Using a Hugging Face text generation model
 %% Cell type:code id: tags:
 ``` python
 from transformers import AutoTokenizer, AutoModelForCausalLM
 # model_name = "mistralai/Mistral-7B"
 # model_name = "deepseek-ai/DeepSeek-R1"
 # model_name = "meta-llama/Llama-3.2-3B-Instruct"
 # model_name = "homebrewltd/AlphaMaze-v0.2-1.5B"
 model_name = "gpt2"
 tokenizer = AutoTokenizer.from_pretrained(model_name)
 model = AutoModelForCausalLM.from_pretrained(model_name)
 ```
 %% Cell type:code id: tags:
 ``` python
 input_text = "Hello. Who are you ?"
 encoded_input = tokenizer(input_text, return_tensors="pt")
 output = model.generate(
    input_ids=encoded_input["input_ids"],
    attention_mask=encoded_input["attention_mask"],
    max_length=100,
    temperature=0.8,
    pad_token_id=tokenizer.pad_token_id
 )
 ```
 %% Cell type:code id: tags:
 ``` python
 generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
 print(generated_text)
 ```
 %% Cell type:markdown id: tags:
 ### Part 2 - Using Pipeline of Hugging Face for text classification
 %% Cell type:code id: tags:
 ``` python
 from transformers import pipeline
 classifier = pipeline("text-classification")
 ```
 %% Cell type:code id: tags:
 ``` python
 classifier("We are very happy to welcome you at Centrale Lyon.")
 ```
 %% Cell type:code id: tags:
 ``` python
 results = classifier(["We are very happy to welcome you at Centrale Lyon.", "We hope you don't hate it."])
 for result in results:
    print(f"label: {result['label']}, with score: {round(result['score'], 4)}")
 ```
 %% Cell type:markdown id: tags:
 ### Part 3 - Using Pipeline with a specific model and tokenizer of Hugging Face
 %% Cell type:code id: tags:
 ``` python
 model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
 ```
 %% Cell type:code id: tags:
 ``` python
 from transformers import AutoTokenizer, AutoModelForSequenceClassification
 model = AutoModelForSequenceClassification.from_pretrained(model_name)
 tokenizer = AutoTokenizer.from_pretrained(model_name)
 ```
 %% Cell type:code id: tags:
 ``` python
 classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
 classifier("We are very hapy to present you this incredible  model.")
 ```
 %% Cell type:markdown id: tags:
 ### Part 4 - Experimenting with models from Hugging Face
 %% Cell type:code id: tags:
 ``` python
 from transformers import AutoTokenizer
 model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
 tokenizer = AutoTokenizer.from_pretrained(model_name)
 ```
 %% Cell type:code id: tags:
 ``` python
 encoding = tokenizer("We are very happy to welcome you at Centrale Lyon.")
 print(encoding)
 ```
 %% Cell type:code id: tags:
 ``` python
 batch = tokenizer(
    ["We are very happy to welcome you at Centrale Lyon.", "We hope you don't hate it."],
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt",
 )
 print(batch)
 ```
 %% Cell type:code id: tags:
 ``` python
 from transformers import AutoModelForSequenceClassification
 model = AutoModelForSequenceClassification.from_pretrained(model_name, torch_dtype="auto")
 print(model)
 ```
 %% Cell type:code id: tags:
 ``` python
 outputs = model(**batch)
 print(outputs)
 ```
 %% Cell type:code id: tags:
 ``` python
 from torch import nn
 predictions = nn.functional.softmax(outputs.logits, dim=-1)
 print(predictions)
 ```
 %% Cell type:code id: tags:
 ``` python
 save_directory = "./save_pretrained"
 tokenizer.save_pretrained(save_directory)
 model.save_pretrained(save_directory)
 ```
 %% Cell type:code id: tags:
 ``` python
 loaded_model = AutoModelForSequenceClassification.from_pretrained("./save_pretrained")
 ```
 %% Cell type:markdown id: tags:
 ### Part 5 - Training a LLM for sentence classification using the **Trainer** class
 %% Cell type:code id: tags:
 ``` python
 from transformers import AutoModelForSequenceClassification
 model_name = "distilbert/distilbert-base-uncased"
 model = AutoModelForSequenceClassification.from_pretrained(model_name, torch_dtype="auto")
 ```
 %% Cell type:code id: tags:
 ``` python
 from transformers import TrainingArguments
 training_args = TrainingArguments(
    output_dir="save_folder/",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=2,
 )
 ```
 %% Cell type:code id: tags:
 ``` python
 from transformers import AutoTokenizer
 tokenizer = AutoTokenizer.from_pretrained(model_name)
 ```
 %% Cell type:code id: tags:
 ``` python
 from datasets import load_dataset
 dataset = load_dataset("rotten_tomatoes")
 ```
 %% Cell type:code id: tags:
 ``` python
 def tokenize_dataset(dataset):
    return tokenizer(dataset["text"])
 ```
 %% Cell type:code id: tags:
 ``` python
 dataset = dataset.map(tokenize_dataset, batched=True)
 ```
 %% Cell type:code id: tags:
 ``` python
 from transformers import DataCollatorWithPadding
 data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
 ```
 %% Cell type:code id: tags:
 ``` python
 from transformers import Trainer
 trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    processing_class=tokenizer,
    data_collator=data_collator,
 )
 ```
 %% Cell type:code id: tags:
 ``` python
 trainer.train()
 ```
 %% Cell type:code id: tags:
 ``` python
 save_directory = "./tomatoes_save_pretrained"
 tokenizer.save_pretrained(save_directory)
 model.save_pretrained(save_directory)
 ```
 %% Cell type:code id: tags:
 ``` python
 model = AutoModelForSequenceClassification.from_pretrained(save_directory)
 tokenizer = AutoTokenizer.from_pretrained(save_directory)
 ```
 %% Cell type:code id: tags:
 ``` python
 from transformers import pipeline
 classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
 ```
 %% Cell type:code id: tags:
 ``` python
 t = dataset['test'][345]
 print(t)
 classifier(t['text'])
 ```
 %% Cell type:markdown id: tags:
 ### Part 6 - Fine tuning a LLM model with a custom head
 %% Cell type:code id: tags:
 ``` python
 from datasets import load_dataset
 from transformers import DistilBertTokenizer, DistilBertModel
 import torch
 from torch.utils.data import DataLoader
 from torch.optim import AdamW
 from sklearn.metrics import accuracy_score, precision_recall_fscore_support
 import numpy as np
 ```
 %% Cell type:code id: tags:
 ``` python
 dataset = load_dataset("imdb")
 ```
 %% Cell type:code id: tags:
 ``` python
 tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
 ```
 %% Cell type:code id: tags:
 ``` python
 def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)
 tokenized_datasets = dataset.map(tokenize_function, batched=True)
 tokenized_datasets = tokenized_datasets.remove_columns(["text"])
 tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
 tokenized_datasets.set_format("torch")
 train_dataset = tokenized_datasets["train"]
 test_dataset = tokenized_datasets["test"]
 train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
 test_loader = DataLoader(test_dataset, batch_size=8)
 ```
 %% Cell type:code id: tags:
 ``` python
 bert_model = DistilBertModel.from_pretrained("distilbert-base-uncased")
 for param in bert_model.parameters():
    param.requires_grad = False
 ```
 %% Cell type:code id: tags:
 ``` python
 class CustomBERTModel(torch.nn.Module):
    def __init__(self, bert_model):
        super(CustomBERTModel, self).__init__()
        self.bert = bert_model
        self.custom_head = torch.nn.Sequential(
            torch.nn.Linear(self.bert.config.hidden_size, 128),
            torch.nn.ReLU(),
            torch.nn.Dropout(0.1),
            torch.nn.Linear(128, 2)
        )
    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        outputs = self.custom_head(outputs.last_hidden_state[:, 0, :])  # Use [CLS] token output
        return outputs
 ```
 %% Cell type:code id: tags:
 ``` python
 bert_model = DistilBertModel.from_pretrained("distilbert-base-uncased")
 for param in bert_model.parameters():
    param.requires_grad = False
 model = CustomBERTModel(bert_model)
 # device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 device = torch.device("mps")
 model.to(device)
 ```
 %% Cell type:code id: tags:
 ``` python
 optimizer = AdamW(model.parameters(), lr=2e-5)
 criterion = torch.nn.CrossEntropyLoss()
 ```
 %% Cell type:code id: tags:
 ``` python
 def train_epoch(model, data_loader, optimizer, criterion, device):
    model.train()
    total_loss = 0
    for batch in data_loader:
        optimizer.zero_grad()
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["labels"].to(device)
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    return total_loss / len(data_loader)
 ```
 %% Cell type:code id: tags:
 ``` python
 def evaluate(model, data_loader, criterion, device):
    model.eval()
    total_loss = 0
    all_predictions = []
    all_labels = []
    with torch.no_grad():
        for batch in data_loader:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["labels"].to(device)
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            loss = criterion(outputs, labels)
            total_loss += loss.item()
            predictions = torch.argmax(outputs, dim=-1)
            all_predictions.extend(predictions.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())
    accuracy = accuracy_score(all_labels, all_predictions)
    precision, recall, f1, _ = precision_recall_fscore_support(all_labels, all_predictions, average="binary")
    return total_loss / len(data_loader), accuracy, precision, recall, f1
 ```
 %% Cell type:code id: tags:
 ``` python
 num_epochs = 3
 for epoch in range(num_epochs):
    print(f"Epoch {epoch + 1}/{num_epochs}")
    train_loss = train_epoch(model, train_loader, optimizer, criterion, device)
    print(f"Train Loss: {train_loss:.4f}")
    val_loss, val_accuracy, val_precision, val_recall, val_f1 = evaluate(model, test_loader, criterion, device)
    print(f"Validation Loss: {val_loss:.4f}")
    print(f"Accuracy: {val_accuracy:.4f}, Precision: {val_precision:.4f}, Recall: {val_recall:.4f}, F1 Score: {val_f1:.4f}")
    torch.save(model.state_dict(), f"custom_bert_epoch_{epoch + 1}.pth")
 # (After 76 minutes of training)
 # Epoch 1/3
 # Train Loss: 0.6708
 # Validation Loss: 0.6415
 # Accuracy: 0.7917, Precision: 0.8218, Recall: 0.7450, F1 Score: 0.7815
 # Epoch 2/3
 # Train Loss: 0.6172
 # Validation Loss: 0.5825
 # Accuracy: 0.8051, Precision: 0.8142, Recall: 0.7907, F1 Score: 0.8023
 # Epoch 3/3
 # Train Loss: 0.5634
 # Validation Loss: 0.5300
 # Accuracy: 0.8098, Precision: 0.8339, Recall: 0.7738, F1 Score: 0.8027
 ```
 %% Cell type:code id: tags:
 ``` python
 model_save_path = "custom_bert_model.pth"
 torch.save(model.state_dict(), model_save_path)
 ```
 %% Cell type:code id: tags:
 ``` python
 loadedbert_model = DistilBertModel.from_pretrained("distilbert-base-uncased")
 loaded_model = CustomBERTModel(loadedbert_model)
 loaded_model.load_state_dict(torch.load(model_save_path))
 loaded_model.to(device)
 ```
 %% Cell type:code id: tags:
 ``` python
 batch = next(iter(test_loader))
 ids = batch['input_ids'][0]
 attention_mask = batch['attention_mask'][0]
 label = batch['labels'][0]
 ids = ids.to(device)
 attention_mask = attention_mask.to(device)
 text = tokenizer.decode(ids, skip_special_tokens=True)
 print(text)
 print(label)
 loaded_model.eval()
 output = model(input_ids=ids.unsqueeze(0), attention_mask=attention_mask.unsqueeze(0))
 output = output.squeeze(0)
 print(output)
 prediction = torch.argmax(output, dim=-1)
 print(prediction)
 print(label)
 print(prediction == label)
 ```
 %% Cell type:markdown id: tags:
 ### Part 7 - Sharing a model on Hugging Face platform
 %% Cell type:code id: tags:
 ``` python
 from transformers import DistilBertPreTrainedModel, DistilBertModel
 import torch.nn as nn
 class CustomDistilBERTModel(DistilBertPreTrainedModel):
    def __init__(self, config, freeze_backbone=True):
        super().__init__(config)
        self.distilbert = DistilBertModel(config)
        self.classifier = nn.Sequential(
            nn.Linear(config.hidden_size, 128),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(128, config.num_labels)  # Binary classification
        )
        self.init_weights()
        # Freeze DistilBERT backbone if specified
        if freeze_backbone:
            for param in self.distilbert.parameters():
                param.requires_grad = False
    def forward(self, input_ids, attention_mask=None, labels=None):
        outputs = self.distilbert(input_ids=input_ids, attention_mask=attention_mask)
        logits = self.classifier(outputs.last_hidden_state[:, 0, :])  # Use [CLS] token output
        return logits
 ```
 %% Cell type:code id: tags:
 ``` python
 from transformers import AutoConfig
 AutoConfig.register("custom-distilbert", AutoConfig)
 AutoModel.register(CustomDistilBERTModel, "custom-distilbert")
 ```
 %% Cell type:code id: tags:
 ``` python
 from transformers import DistilBertTokenizer
 # Initialize the configuration with custom attributes
 config = AutoConfig.from_pretrained("distilbert-base-uncased", num_labels=2)
 config.architectures = ["CustomDistilBERTModel"]
 # Initialize the model and tokenizer
 model = CustomDistilBERTModel(config)
 tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
 # Save locally
 model.save_pretrained("custom_distilbert_model")
 tokenizer.save_pretrained("custom_distilbert_model")
 print("Custom model and tokenizer saved locally!")
 ```
 %% Cell type:code id: tags:
 ``` python
 device = "mps"
 model = model.to(device)
 ```
 %% Cell type:code id: tags:
 ``` python
 num_epochs = 3
 for epoch in range(num_epochs):
    print(f"Epoch {epoch + 1}/{num_epochs}")
    train_loss = train_epoch(model, train_loader, optimizer, criterion, device)
    print(f"Train Loss: {train_loss:.4f}")
    val_loss, val_accuracy, val_precision, val_recall, val_f1 = evaluate(model, test_loader, criterion, device)
    print(f"Validation Loss: {val_loss:.4f}")
    print(f"Accuracy: {val_accuracy:.4f}, Precision: {val_precision:.4f}, Recall: {val_recall:.4f}, F1 Score: {val_f1:.4f}")
    torch.save(model.state_dict(), f"custom_bert_epoch_{epoch + 1}.pth")
 ```
 %% Cell type:code id: tags:
 ``` python
 model.push_to_hub("custom-distilbert-model")
 tokenizer.push_to_hub("custom-distilbert-model")
 ```
 %% Cell type:code id: tags:
 ``` python
 from transformers import AutoTokenizer, AutoModel
 loaded_tokenizer = AutoTokenizer.from_pretrained("your_hf_id/custom-distilbert-model")
 loaded_model = AutoModel.from_pretrained("your_hf_id/custom-distilbert-model")
 ```
 %% Cell type:markdown id: tags:
 ### Part 8 - Further experiments
 %% Cell type:markdown id: tags:
 Now that you know the basics for manipulating LLM through Hugging Face platform, it is time to experiment with:
 - different [NLP tasks](https://huggingface.co/tasks)
 - different [models](https://huggingface.co/models?pipeline_tag=text-classification&sort=trending)
 - different [datasets](https://huggingface.co/datasets?task_categories=task_categories:text-classification&sort=trending)
 ... and to share your finetuned models on the platform.
 Besides, don't forget to monitor your trainings through [Weights & Biases](https://wandb.ai/home).