Skip to content
Snippets Groups Projects
Commit df8cc95a authored by Dellandrea Emmanuel's avatar Dellandrea Emmanuel
Browse files

Update Subject_7_LLM.ipynb

parent 3887ae5c
Branches
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
### **_Deep Learning - Bsc Data Science for Responsible Business - Centrale Lyon_**
2024-2025
Emmanuel Dellandréa
%% Cell type:markdown id: tags:
# Practical Session 7 – Large Language Models
The objective of this tutorial is to learn to work with LLM models for sentence generation and classification. The pretrained models and tokenizers will be obtained from the [Hugging Face platform](https://huggingface.co/).
This notebook contains 8 parts:
1. Using a Hugging Face text generation model
2. Using Pipeline of Hugging Face for text classification
3. Using Pipeline with a specific model and tokenizer of Hugging Face
4. Experimenting with models from Hugging Face
5. Training a LLM for sentence classification using the **Trainer** class
6. Fine tuning a LLM model with a custom head
7. Sharing a model on Hugging Face platform
8. Further experiments
Before going further into experiments, you work is to understand the provided code, that gives an overview of using LLM with Hugging Face.
**This code is intentionally not commented. It is your responsibility to add all the necessary comments to ensure your proper understanding of the code.**
**This code is intentionally not commented. It is your objective to add all the necessary comments to ensure your proper understanding of the code.**
You might frequently rely on [Hugging Face’s documentation](https://huggingface.co/docs).
---
As the computation can be heavy, particularly during training, we encourage you to use a GPU. If your laptob is not equiped, you may use one of these remote jupyter servers, where you can select the execution on GPU :
1) [jupyter.mi90.ec-lyon.fr](https://jupyter.mi90.ec-lyon.fr/)
This server is accessible within the campus network. If outside, you need to use a VPN. Before executing the notebook, select the kernel "Python PyTorch" to run it on GPU and have access to PyTorch module.
2) [Google Colaboratory](https://colab.research.google.com/)
Before executing the notebook, select the execution on GPU : "Runtime" -> "Change runtime type" --> "T4 GPU".
%% Cell type:markdown id: tags:
### Installing required librairies
%% Cell type:code id: tags:
``` python
!pip install huggingface_hub
!pip install ipywidgets
!pip install transformers
!pip install datasets
!pip install accelerate
!pip install scikit-learn
```
%% Cell type:markdown id: tags:
### Then login to Hugging Face
### Log in to Hugging Face
First, you need to create an account on [Hugging Face platform](https://huggingface.co/join).
Then you can log in to your account directly from the notebook.
%% Cell type:code id: tags:
``` python
from huggingface_hub import notebook_login
notebook_login()
```
%% Cell type:markdown id: tags:
### Part 1 - Using a Hugging Face text generation model
%% Cell type:code id: tags:
``` python
from transformers import AutoTokenizer, AutoModelForCausalLM
# model_name = "mistralai/Mistral-7B"
# model_name = "deepseek-ai/DeepSeek-R1"
# model_name = "meta-llama/Llama-3.2-3B-Instruct"
# model_name = "homebrewltd/AlphaMaze-v0.2-1.5B"
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
```
%% Cell type:code id: tags:
``` python
input_text = "Hello. Who are you ?"
encoded_input = tokenizer(input_text, return_tensors="pt")
output = model.generate(
input_ids=encoded_input["input_ids"],
attention_mask=encoded_input["attention_mask"],
max_length=100,
temperature=0.8,
pad_token_id=tokenizer.pad_token_id
)
```
%% Cell type:code id: tags:
``` python
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)
```
%% Cell type:markdown id: tags:
### Part 2 - Using Pipeline of Hugging Face for text classification
%% Cell type:code id: tags:
``` python
from transformers import pipeline
classifier = pipeline("text-classification")
```
%% Cell type:code id: tags:
``` python
classifier("We are very happy to welcome you at Centrale Lyon.")
```
%% Cell type:code id: tags:
``` python
results = classifier(["We are very happy to welcome you at Centrale Lyon.", "We hope you don't hate it."])
for result in results:
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")
```
%% Cell type:markdown id: tags:
### Part 3 - Using Pipeline with a specific model and tokenizer of Hugging Face
%% Cell type:code id: tags:
``` python
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
```
%% Cell type:code id: tags:
``` python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
```
%% Cell type:code id: tags:
``` python
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
classifier("We are very hapy to present you this incredible model.")
```
%% Cell type:markdown id: tags:
### Part 4 - Experimenting with models from Hugging Face
%% Cell type:code id: tags:
``` python
from transformers import AutoTokenizer
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
tokenizer = AutoTokenizer.from_pretrained(model_name)
```
%% Cell type:code id: tags:
``` python
encoding = tokenizer("We are very happy to welcome you at Centrale Lyon.")
print(encoding)
```
%% Cell type:code id: tags:
``` python
batch = tokenizer(
["We are very happy to welcome you at Centrale Lyon.", "We hope you don't hate it."],
padding=True,
truncation=True,
max_length=512,
return_tensors="pt",
)
print(batch)
```
%% Cell type:code id: tags:
``` python
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(model_name, torch_dtype="auto")
print(model)
```
%% Cell type:code id: tags:
``` python
outputs = model(**batch)
print(outputs)
```
%% Cell type:code id: tags:
``` python
from torch import nn
predictions = nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)
```
%% Cell type:code id: tags:
``` python
save_directory = "./save_pretrained"
tokenizer.save_pretrained(save_directory)
model.save_pretrained(save_directory)
```
%% Cell type:code id: tags:
``` python
loaded_model = AutoModelForSequenceClassification.from_pretrained("./save_pretrained")
```
%% Cell type:markdown id: tags:
### Part 5 - Training a LLM for sentence classification using the **Trainer** class
%% Cell type:code id: tags:
``` python
from transformers import AutoModelForSequenceClassification
model_name = "distilbert/distilbert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(model_name, torch_dtype="auto")
```
%% Cell type:code id: tags:
``` python
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir="save_folder/",
learning_rate=2e-5,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
num_train_epochs=2,
)
```
%% Cell type:code id: tags:
``` python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
```
%% Cell type:code id: tags:
``` python
from datasets import load_dataset
dataset = load_dataset("rotten_tomatoes")
```
%% Cell type:code id: tags:
``` python
def tokenize_dataset(dataset):
return tokenizer(dataset["text"])
```
%% Cell type:code id: tags:
``` python
dataset = dataset.map(tokenize_dataset, batched=True)
```
%% Cell type:code id: tags:
``` python
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
```
%% Cell type:code id: tags:
``` python
from transformers import Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
processing_class=tokenizer,
data_collator=data_collator,
)
```
%% Cell type:code id: tags:
``` python
trainer.train()
```
%% Cell type:code id: tags:
``` python
save_directory = "./tomatoes_save_pretrained"
tokenizer.save_pretrained(save_directory)
model.save_pretrained(save_directory)
```
%% Cell type:code id: tags:
``` python
model = AutoModelForSequenceClassification.from_pretrained(save_directory)
tokenizer = AutoTokenizer.from_pretrained(save_directory)
```
%% Cell type:code id: tags:
``` python
from transformers import pipeline
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
```
%% Cell type:code id: tags:
``` python
t = dataset['test'][345]
print(t)
classifier(t['text'])
```
%% Cell type:markdown id: tags:
### Part 6 - Fine tuning a LLM model with a custom head
%% Cell type:code id: tags:
``` python
from datasets import load_dataset
from transformers import DistilBertTokenizer, DistilBertModel
import torch
from torch.utils.data import DataLoader
from torch.optim import AdamW
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import numpy as np
```
%% Cell type:code id: tags:
``` python
dataset = load_dataset("imdb")
```
%% Cell type:code id: tags:
``` python
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
```
%% Cell type:code id: tags:
``` python
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(["text"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")
train_dataset = tokenized_datasets["train"]
test_dataset = tokenized_datasets["test"]
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=8)
```
%% Cell type:code id: tags:
``` python
bert_model = DistilBertModel.from_pretrained("distilbert-base-uncased")
for param in bert_model.parameters():
param.requires_grad = False
```
%% Cell type:code id: tags:
``` python
class CustomBERTModel(torch.nn.Module):
def __init__(self, bert_model):
super(CustomBERTModel, self).__init__()
self.bert = bert_model
self.custom_head = torch.nn.Sequential(
torch.nn.Linear(self.bert.config.hidden_size, 128),
torch.nn.ReLU(),
torch.nn.Dropout(0.1),
torch.nn.Linear(128, 2)
)
def forward(self, input_ids, attention_mask):
outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
outputs = self.custom_head(outputs.last_hidden_state[:, 0, :]) # Use [CLS] token output
return outputs
```
%% Cell type:code id: tags:
``` python
bert_model = DistilBertModel.from_pretrained("distilbert-base-uncased")
for param in bert_model.parameters():
param.requires_grad = False
model = CustomBERTModel(bert_model)
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = torch.device("mps")
model.to(device)
```
%% Cell type:code id: tags:
``` python
optimizer = AdamW(model.parameters(), lr=2e-5)
criterion = torch.nn.CrossEntropyLoss()
```
%% Cell type:code id: tags:
``` python
def train_epoch(model, data_loader, optimizer, criterion, device):
model.train()
total_loss = 0
for batch in data_loader:
optimizer.zero_grad()
input_ids = batch["input_ids"].to(device)
attention_mask = batch["attention_mask"].to(device)
labels = batch["labels"].to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
total_loss += loss.item()
return total_loss / len(data_loader)
```
%% Cell type:code id: tags:
``` python
def evaluate(model, data_loader, criterion, device):
model.eval()
total_loss = 0
all_predictions = []
all_labels = []
with torch.no_grad():
for batch in data_loader:
input_ids = batch["input_ids"].to(device)
attention_mask = batch["attention_mask"].to(device)
labels = batch["labels"].to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = criterion(outputs, labels)
total_loss += loss.item()
predictions = torch.argmax(outputs, dim=-1)
all_predictions.extend(predictions.cpu().numpy())
all_labels.extend(labels.cpu().numpy())
accuracy = accuracy_score(all_labels, all_predictions)
precision, recall, f1, _ = precision_recall_fscore_support(all_labels, all_predictions, average="binary")
return total_loss / len(data_loader), accuracy, precision, recall, f1
```
%% Cell type:code id: tags:
``` python
num_epochs = 3
for epoch in range(num_epochs):
print(f"Epoch {epoch + 1}/{num_epochs}")
train_loss = train_epoch(model, train_loader, optimizer, criterion, device)
print(f"Train Loss: {train_loss:.4f}")
val_loss, val_accuracy, val_precision, val_recall, val_f1 = evaluate(model, test_loader, criterion, device)
print(f"Validation Loss: {val_loss:.4f}")
print(f"Accuracy: {val_accuracy:.4f}, Precision: {val_precision:.4f}, Recall: {val_recall:.4f}, F1 Score: {val_f1:.4f}")
torch.save(model.state_dict(), f"custom_bert_epoch_{epoch + 1}.pth")
# (After 76 minutes of training)
# Epoch 1/3
# Train Loss: 0.6708
# Validation Loss: 0.6415
# Accuracy: 0.7917, Precision: 0.8218, Recall: 0.7450, F1 Score: 0.7815
# Epoch 2/3
# Train Loss: 0.6172
# Validation Loss: 0.5825
# Accuracy: 0.8051, Precision: 0.8142, Recall: 0.7907, F1 Score: 0.8023
# Epoch 3/3
# Train Loss: 0.5634
# Validation Loss: 0.5300
# Accuracy: 0.8098, Precision: 0.8339, Recall: 0.7738, F1 Score: 0.8027
```
%% Cell type:code id: tags:
``` python
model_save_path = "custom_bert_model.pth"
torch.save(model.state_dict(), model_save_path)
```
%% Cell type:code id: tags:
``` python
loadedbert_model = DistilBertModel.from_pretrained("distilbert-base-uncased")
loaded_model = CustomBERTModel(loadedbert_model)
loaded_model.load_state_dict(torch.load(model_save_path))
loaded_model.to(device)
```
%% Cell type:code id: tags:
``` python
batch = next(iter(test_loader))
ids = batch['input_ids'][0]
attention_mask = batch['attention_mask'][0]
label = batch['labels'][0]
ids = ids.to(device)
attention_mask = attention_mask.to(device)
text = tokenizer.decode(ids, skip_special_tokens=True)
print(text)
print(label)
loaded_model.eval()
output = model(input_ids=ids.unsqueeze(0), attention_mask=attention_mask.unsqueeze(0))
output = output.squeeze(0)
print(output)
prediction = torch.argmax(output, dim=-1)
print(prediction)
print(label)
print(prediction == label)
```
%% Cell type:markdown id: tags:
### Part 7 - Sharing a model on Hugging Face platform
%% Cell type:code id: tags:
``` python
from transformers import DistilBertPreTrainedModel, DistilBertModel
import torch.nn as nn
class CustomDistilBERTModel(DistilBertPreTrainedModel):
def __init__(self, config, freeze_backbone=True):
super().__init__(config)
self.distilbert = DistilBertModel(config)
self.classifier = nn.Sequential(
nn.Linear(config.hidden_size, 128),
nn.ReLU(),
nn.Dropout(0.1),
nn.Linear(128, config.num_labels)
)
if freeze_backbone:
for param in self.distilbert.parameters():
param.requires_grad = False
def forward(self, input_ids, attention_mask=None, labels=None):
outputs = self.distilbert(input_ids=input_ids, attention_mask=attention_mask)
logits = self.classifier(outputs.last_hidden_state[:, 0, :]) # Use [CLS] token output
return logits
```
%% Cell type:code id: tags:
``` python
from transformers import AutoConfig
AutoConfig.register("custom-distilbert", AutoConfig)
AutoModel.register(CustomDistilBERTModel, "custom-distilbert")
```
%% Cell type:code id: tags:
``` python
from transformers import DistilBertTokenizer
config = AutoConfig.from_pretrained("distilbert-base-uncased", num_labels=2)
config.architectures = ["CustomDistilBERTModel"]
model = CustomDistilBERTModel(config)
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
model.save_pretrained("custom_distilbert_model")
tokenizer.save_pretrained("custom_distilbert_model")
print("Custom model and tokenizer saved locally!")
```
%% Cell type:code id: tags:
``` python
device = "mps"
model = model.to(device)
```
%% Cell type:code id: tags:
``` python
num_epochs = 3
for epoch in range(num_epochs):
print(f"Epoch {epoch + 1}/{num_epochs}")
train_loss = train_epoch(model, train_loader, optimizer, criterion, device)
print(f"Train Loss: {train_loss:.4f}")
val_loss, val_accuracy, val_precision, val_recall, val_f1 = evaluate(model, test_loader, criterion, device)
print(f"Validation Loss: {val_loss:.4f}")
print(f"Accuracy: {val_accuracy:.4f}, Precision: {val_precision:.4f}, Recall: {val_recall:.4f}, F1 Score: {val_f1:.4f}")
torch.save(model.state_dict(), f"custom_bert_epoch_{epoch + 1}.pth")
```
%% Cell type:code id: tags:
``` python
model.push_to_hub("custom-distilbert-model")
tokenizer.push_to_hub("custom-distilbert-model")
```
%% Cell type:code id: tags:
``` python
from transformers import AutoTokenizer, AutoModel
loaded_tokenizer = AutoTokenizer.from_pretrained("your_hf_id/custom-distilbert-model")
loaded_model = AutoModel.from_pretrained("your_hf_id/custom-distilbert-model")
```
%% Cell type:markdown id: tags:
### Part 8 - Further experiments
%% Cell type:markdown id: tags:
Now that you know the basics for manipulating LLM through Hugging Face platform, it is time to experiment with:
- different [NLP tasks](https://huggingface.co/tasks)
- different [models](https://huggingface.co/models?pipeline_tag=text-classification&sort=trending)
- different [datasets](https://huggingface.co/datasets?task_categories=task_categories:text-classification&sort=trending)
... and to share your finetuned models on the platform.
Besides, don't forget to monitor your trainings through [Weights & Biases](https://wandb.ai/home).
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment