Upload New File

6fd03e90 · Bouyahya Zied · a23b624b · 6fd03e90
Commit 6fd03e90 authored 4 months ago by Bouyahya Zied
--- a/WebScraping/webscraping.ipynb
+++ b/WebScraping/webscraping.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Web scraping tutorial\n",
+    "\n",
+    "This tutorial will guide you step-by-step on how to use Selenium and BeautifulSoup to scrape job listings from a website. We will also use Pandas to store the scraped data in a CSV file. This tutorial is designed for beginners, so every step will be explained in detail.\n",
+    "\n",
+    "1. Introduction\n",
+    "What is Web Scraping?\n",
+    "Web scraping is the process of extracting data from websites. It involves sending a request to a website, downloading the HTML content, and then parsing it to extract useful information.\n",
+    "\n",
+    "Tools Used in This Tutorial\n",
+    "- ```Selenium:``` A tool for automating web browsers. It allows you to interact with web pages as if you were using a real browser.\n",
+    "\n",
+    "- ```BeautifulSoup```: A Python library for parsing HTML and XML documents. It makes it easy to extract data from web pages.\n",
+    "\n",
+    "- ```Pandas```: A Python library for data manipulation and analysis. We will use it to store the scraped data in a CSV file.\n",
+    "\n",
+    "- ```WebDriver```: A tool that allows Selenium to control a web browser (e.g., Chrome, Firefox).\n",
+    "\n",
+    "#### Important notes \n",
+    "- Respect the website's robots.txt file: Before scraping, check the website's robots.txt file (e.g., https://awebsite.com/robots.txt) to ensure you are allowed to scrape the site.\n",
+    "\n",
+    "- Avoid overloading the server: Add delays between requests to avoid overwhelming the website's server.\n",
+    "\n",
+    "- **Legal considerations: Ensure that your scraping activities comply with the website's terms of service and local laws.**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 1: Setting Up Your Environment\n",
+    "Before we start coding, you need to set up your environment.\n",
+    "\n",
+    "1. Install Required Libraries\n",
+    "Run the following commands in your terminal or command prompt to install the required libraries:\n",
+    "```python \n",
+    "%pip install selenium beautifulsoup4 pandas\n",
+    "```\n",
+    "\n",
+    "Download WebDriver\n",
+    "Selenium requires a WebDriver to interact with the browser. For this tutorial, we will use ChromeDriver.\n",
+    "\n",
+    "2. Download the ChromeDriver version that matches your Chrome browser version from the following link.\n",
+    "https://sites.google.com/chromium.org/driver/\n",
+    "\n",
+    "Add the ```ChromeDriver``` executable to your system's PATH or place it in the same directory as your script"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 2: Import Required Libraries\n",
+    "- Selenium: Used to automate the browser.\n",
+    "\n",
+    "- BeautifulSoup: Used to parse HTML content.\n",
+    "\n",
+    "- Pandas: Used to store data in a CSV file.\n",
+    "\n",
+    "- Time: Used to add delays (e.g., waiting for the page to load)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from selenium import webdriver\n",
+    "from selenium.webdriver.common.by import By\n",
+    "from selenium.webdriver.common.keys import Keys\n",
+    "from selenium.webdriver.support.ui import WebDriverWait\n",
+    "from selenium.webdriver.support import expected_conditions as EC\n",
+    "from bs4 import BeautifulSoup\n",
+    "import pandas as pd\n",
+    "import time"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 3: Initialize the WebDriver\n",
+    "Make sure chromedriver is in your system's PATH or in the same directory as your script."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "driver = webdriver.Safari()  # Ensure chromedriver is in your PATH"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 4: Define search keywords\n",
+    "The keyword we will use to search for jobs on the website"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "search_keyword = \"Data Science\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 5 Construct and navigate to the url \n",
+    "We have to construct the URL by replacing spaces in the search keyword with %20 (URL encoding for spaces).\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "url = f'https://remote.co/remote-jobs/search?searchkeyword={search_keyword.replace(\" \", \"%20\")}'"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This line tells the WebDriver to open the constructed URL in the browser"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 22,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "driver.get(url)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 23,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "time.sleep(5)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 24,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "<selenium.webdriver.remote.webelement.WebElement (session=\"F0DF4FC9-1BC7-4191-AB9B-E7B3CB2EFDC3\", element=\"node-83A1A8FD-F3D8-4391-81A1-C6C6D2458FD6\")>"
+      ]
+     },
+     "execution_count": 24,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "\n",
+    "WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, 'sc-jv5lm6-0')))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 25,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "soup = BeautifulSoup(driver.page_source, 'html.parser')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 26,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "job_listings = soup.find_all('div', class_='sc-jv5lm6-0')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 27,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "jobs = []\n",
+    "\n",
+    "for job in job_listings:\n",
+    "    title = job.find('a', class_='sc-jv5lm6-13')\n",
+    "    title = title.text.strip() if title else 'N/A'\n",
+    "\n",
+    "    company = job.find('span', class_='sc-jv5lm6-5')\n",
+    "    company = company.text.strip() if company else 'N/A'\n",
+    "\n",
+    "    location = job.find('span', class_='sc-jv5lm6-10')\n",
+    "    location = location.text.strip() if location else 'Remote'\n",
+    "\n",
+    "    link = job.find('a', class_='sc-jv5lm6-13')\n",
+    "    link = 'https://remote.co' + link['href'] if link else 'N/A'\n",
+    "\n",
+    "    salary = job.find('li', id=lambda x: x and x.startswith('salartRange-'))\n",
+    "    salary = salary.text.strip() if salary else 'N/A'\n",
+    "\n",
+    "    job_type = job.find('li', id=lambda x: x and x.startswith('jobTypes-'))\n",
+    "    job_type = job_type.text.strip() if job_type else 'N/A'\n",
+    "\n",
+    "    jobs.append({\n",
+    "        'Title': title,\n",
+    "        'Company': company,\n",
+    "        'Location': location,\n",
+    "        'Link': link,\n",
+    "        'Salary': salary,\n",
+    "        'Job Type': job_type\n",
+    "    })"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 28,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "(50, 6)"
+      ]
+     },
+     "execution_count": 28,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "df = pd.DataFrame(jobs)\n",
+    "df.shape"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 29,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "driver.quit()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 30,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>Title</th>\n",
+       "      <th>Company</th>\n",
+       "      <th>Location</th>\n",
+       "      <th>Link</th>\n",
+       "      <th>Salary</th>\n",
+       "      <th>Job Type</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>Science Water Ecologist, Bureau of Environment...</td>\n",
+       "      <td></td>\n",
+       "      <td>Hybrid Remote in New York City, NY</td>\n",
+       "      <td>https://remote.co/job-details/science-water-ec...</td>\n",
+       "      <td>$49,653 - $57,101 Annually</td>\n",
+       "      <td>Employee</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>Data Science Manager, Marketing Science</td>\n",
+       "      <td></td>\n",
+       "      <td>Hybrid Remote in Seattle, WA, Palo Alto, CA, S...</td>\n",
+       "      <td>https://remote.co/job-details/data-science-man...</td>\n",
+       "      <td>$191,840 - $335,720 Annually</td>\n",
+       "      <td>Employee</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>Senior Data Scientist - Core Data Science</td>\n",
+       "      <td></td>\n",
+       "      <td>Remote, US National</td>\n",
+       "      <td>https://remote.co/job-details/senior-data-scie...</td>\n",
+       "      <td>$180,370 - $212,200 Annually</td>\n",
+       "      <td>Employee</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>Data Scientist, Product Data Science</td>\n",
+       "      <td></td>\n",
+       "      <td>Remote in Canada</td>\n",
+       "      <td>https://remote.co/job-details/data-scientist-p...</td>\n",
+       "      <td>N/A</td>\n",
+       "      <td>Employee</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>Solutions Architecture Architect – Data Scienc...</td>\n",
+       "      <td></td>\n",
+       "      <td>Hybrid Remote in Herndon, VA</td>\n",
+       "      <td>https://remote.co/job-details/solutions-archit...</td>\n",
+       "      <td>$183,498 - $207,000 Annually</td>\n",
+       "      <td>Employee</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                                               Title Company  \\\n",
+       "0  Science Water Ecologist, Bureau of Environment...           \n",
+       "1            Data Science Manager, Marketing Science           \n",
+       "2          Senior Data Scientist - Core Data Science           \n",
+       "3               Data Scientist, Product Data Science           \n",
+       "4  Solutions Architecture Architect – Data Scienc...           \n",
+       "\n",
+       "                                            Location  \\\n",
+       "0                 Hybrid Remote in New York City, NY   \n",
+       "1  Hybrid Remote in Seattle, WA, Palo Alto, CA, S...   \n",
+       "2                                Remote, US National   \n",
+       "3                                   Remote in Canada   \n",
+       "4                       Hybrid Remote in Herndon, VA   \n",
+       "\n",
+       "                                                Link  \\\n",
+       "0  https://remote.co/job-details/science-water-ec...   \n",
+       "1  https://remote.co/job-details/data-science-man...   \n",
+       "2  https://remote.co/job-details/senior-data-scie...   \n",
+       "3  https://remote.co/job-details/data-scientist-p...   \n",
+       "4  https://remote.co/job-details/solutions-archit...   \n",
+       "\n",
+       "                         Salary  Job Type  \n",
+       "0    $49,653 - $57,101 Annually  Employee  \n",
+       "1  $191,840 - $335,720 Annually  Employee  \n",
+       "2  $180,370 - $212,200 Annually  Employee  \n",
+       "3                           N/A  Employee  \n",
+       "4  $183,498 - $207,000 Annually  Employee  "
+      ]
+     },
+     "execution_count": 30,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "df.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 35,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "Title       object\n",
+       "Company     object\n",
+       "Location    object\n",
+       "Link        object\n",
+       "Salary      object\n",
+       "Job Type    object\n",
+       "dtype: object"
+      ]
+     },
+     "execution_count": 35,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "df.dtypes"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 60,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#TODO: Write a function called extract_salary_range(salary) that extracts from the salary information in the dataset the salary from, the salaray to and the range (\"annually\"), \"monthly\"\n",
+    "\n",
+    "def extract_salary_range(salary):\n",
+    "    if pd.isna(salary) or salary == 'N/A':\n",
+    "        return None, None, None\n",
+    "    \n",
+    "    # Your Code HERE\n",
+    "    \n",
+    "    return salary_range, salary_from, salary_to\n",
+    "\n",
+    "df[['Salary Range', 'Salary From', 'Salary To']] = df['Salary'].apply(\n",
+    "    lambda x: pd.Series(extract_salary_range(x))\n",
+    ")\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 61,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>Title</th>\n",
+       "      <th>Company</th>\n",
+       "      <th>Location</th>\n",
+       "      <th>Link</th>\n",
+       "      <th>Salary</th>\n",
+       "      <th>Job Type</th>\n",
+       "      <th>Salary Range</th>\n",
+       "      <th>Salary From</th>\n",
+       "      <th>Salary To</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>Science Water Ecologist, Bureau of Environment...</td>\n",
+       "      <td></td>\n",
+       "      <td>Hybrid Remote in New York City, NY</td>\n",
+       "      <td>https://remote.co/job-details/science-water-ec...</td>\n",
+       "      <td>$49,653 - $57,101 Annually</td>\n",
+       "      <td>Employee</td>\n",
+       "      <td>$49,653 - $57,101</td>\n",
+       "      <td>49.653</td>\n",
+       "      <td>57.101</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>Data Science Manager, Marketing Science</td>\n",
+       "      <td></td>\n",
+       "      <td>Hybrid Remote in Seattle, WA, Palo Alto, CA, S...</td>\n",
+       "      <td>https://remote.co/job-details/data-science-man...</td>\n",
+       "      <td>$191,840 - $335,720 Annually</td>\n",
+       "      <td>Employee</td>\n",
+       "      <td>$191,840 - $335,720</td>\n",
+       "      <td>191.840</td>\n",
+       "      <td>335.720</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>Senior Data Scientist - Core Data Science</td>\n",
+       "      <td></td>\n",
+       "      <td>Remote, US National</td>\n",
+       "      <td>https://remote.co/job-details/senior-data-scie...</td>\n",
+       "      <td>$180,370 - $212,200 Annually</td>\n",
+       "      <td>Employee</td>\n",
+       "      <td>$180,370 - $212,200</td>\n",
+       "      <td>180.370</td>\n",
+       "      <td>212.200</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>Data Scientist, Product Data Science</td>\n",
+       "      <td></td>\n",
+       "      <td>Remote in Canada</td>\n",
+       "      <td>https://remote.co/job-details/data-scientist-p...</td>\n",
+       "      <td>N/A</td>\n",
+       "      <td>Employee</td>\n",
+       "      <td>None</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>Solutions Architecture Architect – Data Scienc...</td>\n",
+       "      <td></td>\n",
+       "      <td>Hybrid Remote in Herndon, VA</td>\n",
+       "      <td>https://remote.co/job-details/solutions-archit...</td>\n",
+       "      <td>$183,498 - $207,000 Annually</td>\n",
+       "      <td>Employee</td>\n",
+       "      <td>$183,498 - $207,000</td>\n",
+       "      <td>183.498</td>\n",
+       "      <td>207.000</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                                               Title Company  \\\n",
+       "0  Science Water Ecologist, Bureau of Environment...           \n",
+       "1            Data Science Manager, Marketing Science           \n",
+       "2          Senior Data Scientist - Core Data Science           \n",
+       "3               Data Scientist, Product Data Science           \n",
+       "4  Solutions Architecture Architect – Data Scienc...           \n",
+       "\n",
+       "                                            Location  \\\n",
+       "0                 Hybrid Remote in New York City, NY   \n",
+       "1  Hybrid Remote in Seattle, WA, Palo Alto, CA, S...   \n",
+       "2                                Remote, US National   \n",
+       "3                                   Remote in Canada   \n",
+       "4                       Hybrid Remote in Herndon, VA   \n",
+       "\n",
+       "                                                Link  \\\n",
+       "0  https://remote.co/job-details/science-water-ec...   \n",
+       "1  https://remote.co/job-details/data-science-man...   \n",
+       "2  https://remote.co/job-details/senior-data-scie...   \n",
+       "3  https://remote.co/job-details/data-scientist-p...   \n",
+       "4  https://remote.co/job-details/solutions-archit...   \n",
+       "\n",
+       "                         Salary  Job Type         Salary Range  Salary From  \\\n",
+       "0    $49,653 - $57,101 Annually  Employee    $49,653 - $57,101       49.653   \n",
+       "1  $191,840 - $335,720 Annually  Employee  $191,840 - $335,720      191.840   \n",
+       "2  $180,370 - $212,200 Annually  Employee  $180,370 - $212,200      180.370   \n",
+       "3                           N/A  Employee                 None          NaN   \n",
+       "4  $183,498 - $207,000 Annually  Employee  $183,498 - $207,000      183.498   \n",
+       "\n",
+       "   Salary To  \n",
+       "0     57.101  \n",
+       "1    335.720  \n",
+       "2    212.200  \n",
+       "3        NaN  \n",
+       "4    207.000  "
+      ]
+     },
+     "execution_count": 61,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "df.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### ToDo\n",
+    "1. Find a website that allows scraping (check the robots.txt)\n",
+    "2. Scrape the relevant data\n",
+    "3. Pre-process the data and conduct an EDA\n",
+    "4. Submit the \n",
+    "- The scraping Notebook\n",
+    "- CSV file before processing\n",
+    "- The EDA notebook and the processed CSV file \n",
+    "\n",
+    "**Deadline : April 5th 2025 at 14:00**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "base",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
+%% Cell type:markdown id: tags:
+
+### Web scraping tutorial
+
+This tutorial will guide you step-by-step on how to use Selenium and BeautifulSoup to scrape job listings from a website. We will also use Pandas to store the scraped data in a CSV file. This tutorial is designed for beginners, so every step will be explained in detail.
+
+1. Introduction
+What is Web Scraping?
+Web scraping is the process of extracting data from websites. It involves sending a request to a website, downloading the HTML content, and then parsing it to extract useful information.
+
+Tools Used in This Tutorial
+- ```Selenium:``` A tool for automating web browsers. It allows you to interact with web pages as if you were using a real browser.
+
+- ```BeautifulSoup```: A Python library for parsing HTML and XML documents. It makes it easy to extract data from web pages.
+
+- ```Pandas```: A Python library for data manipulation and analysis. We will use it to store the scraped data in a CSV file.
+
+- ```WebDriver```: A tool that allows Selenium to control a web browser (e.g., Chrome, Firefox).
+
+#### Important notes
+- Respect the website's robots.txt file: Before scraping, check the website's robots.txt file (e.g., https://awebsite.com/robots.txt) to ensure you are allowed to scrape the site.
+
+- Avoid overloading the server: Add delays between requests to avoid overwhelming the website's server.
+
+- **Legal considerations: Ensure that your scraping activities comply with the website's terms of service and local laws.**
+
+%% Cell type:markdown id: tags:
+
+## Step 1: Setting Up Your Environment
+Before we start coding, you need to set up your environment.
+
+1. Install Required Libraries
+Run the following commands in your terminal or command prompt to install the required libraries:
+```python
+%pip install selenium beautifulsoup4 pandas
+```
+
+Download WebDriver
+Selenium requires a WebDriver to interact with the browser. For this tutorial, we will use ChromeDriver.
+
+2. Download the ChromeDriver version that matches your Chrome browser version from the following link.
+https://sites.google.com/chromium.org/driver/
+
+Add the ```ChromeDriver``` executable to your system's PATH or place it in the same directory as your script
+
+%% Cell type:markdown id: tags:
+
+### Step 2: Import Required Libraries
+- Selenium: Used to automate the browser.
+
+- BeautifulSoup: Used to parse HTML content.
+
+- Pandas: Used to store data in a CSV file.
+
+- Time: Used to add delays (e.g., waiting for the page to load).
+
+%% Cell type:code id: tags:
+
+``` python
+from selenium import webdriver
+from selenium.webdriver.common.by import By
+from selenium.webdriver.common.keys import Keys
+from selenium.webdriver.support.ui import WebDriverWait
+from selenium.webdriver.support import expected_conditions as EC
+from bs4 import BeautifulSoup
+import pandas as pd
+import time
+```
+
+%% Cell type:markdown id: tags:
+
+### Step 3: Initialize the WebDriver
+Make sure chromedriver is in your system's PATH or in the same directory as your script.
+
+%% Cell type:code id: tags:
+
+``` python
+driver = webdriver.Safari()  # Ensure chromedriver is in your PATH
+```
+
+%% Cell type:markdown id: tags:
+
+### Step 4: Define search keywords
+The keyword we will use to search for jobs on the website
+
+%% Cell type:code id: tags:
+
+``` python
+search_keyword = "Data Science"
+```
+
+%% Cell type:markdown id: tags:
+
+### Step 5 Construct and navigate to the url
+We have to construct the URL by replacing spaces in the search keyword with %20 (URL encoding for spaces).
+
+%% Cell type:code id: tags:
+
+``` python
+url = f'https://remote.co/remote-jobs/search?searchkeyword={search_keyword.replace(" ", "%20")}'
+```
+
+%% Cell type:markdown id: tags:
+
+This line tells the WebDriver to open the constructed URL in the browser
+
+%% Cell type:code id: tags:
+
+``` python
+driver.get(url)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+time.sleep(5)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+
+WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, 'sc-jv5lm6-0')))
+```
+
+%% Output
+
+    <selenium.webdriver.remote.webelement.WebElement (session="F0DF4FC9-1BC7-4191-AB9B-E7B3CB2EFDC3", element="node-83A1A8FD-F3D8-4391-81A1-C6C6D2458FD6")>
+
+%% Cell type:code id: tags:
+
+``` python
+soup = BeautifulSoup(driver.page_source, 'html.parser')
+```
+
+%% Cell type:code id: tags:
+
+``` python
+job_listings = soup.find_all('div', class_='sc-jv5lm6-0')
+```
+
+%% Cell type:code id: tags:
+
+``` python
+jobs = []
+
+for job in job_listings:
+    title = job.find('a', class_='sc-jv5lm6-13')
+    title = title.text.strip() if title else 'N/A'
+
+    company = job.find('span', class_='sc-jv5lm6-5')
+    company = company.text.strip() if company else 'N/A'
+
+    location = job.find('span', class_='sc-jv5lm6-10')
+    location = location.text.strip() if location else 'Remote'
+
+    link = job.find('a', class_='sc-jv5lm6-13')
+    link = 'https://remote.co' + link['href'] if link else 'N/A'
+
+    salary = job.find('li', id=lambda x: x and x.startswith('salartRange-'))
+    salary = salary.text.strip() if salary else 'N/A'
+
+    job_type = job.find('li', id=lambda x: x and x.startswith('jobTypes-'))
+    job_type = job_type.text.strip() if job_type else 'N/A'
+
+    jobs.append({
+        'Title': title,
+        'Company': company,
+        'Location': location,
+        'Link': link,
+        'Salary': salary,
+        'Job Type': job_type
+    })
+```
+
+%% Cell type:code id: tags:
+
+``` python
+df = pd.DataFrame(jobs)
+df.shape
+```
+
+%% Output
+
+    (50, 6)
+
+%% Cell type:code id: tags:
+
+``` python
+driver.quit()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+df.head()
+```
+
+%% Output
+
+                                                   Title Company  \
+    0  Science Water Ecologist, Bureau of Environment...
+    1            Data Science Manager, Marketing Science
+    2          Senior Data Scientist - Core Data Science
+    3               Data Scientist, Product Data Science
+    4  Solutions Architecture Architect – Data Scienc...
+    
+                                                Location  \
+    0                 Hybrid Remote in New York City, NY
+    1  Hybrid Remote in Seattle, WA, Palo Alto, CA, S...
+    2                                Remote, US National
+    3                                   Remote in Canada
+    4                       Hybrid Remote in Herndon, VA
+    
+                                                    Link  \
+    0  https://remote.co/job-details/science-water-ec...
+    1  https://remote.co/job-details/data-science-man...
+    2  https://remote.co/job-details/senior-data-scie...
+    3  https://remote.co/job-details/data-scientist-p...
+    4  https://remote.co/job-details/solutions-archit...
+    
+                             Salary  Job Type
+    0    $49,653 - $57,101 Annually  Employee
+    1  $191,840 - $335,720 Annually  Employee
+    2  $180,370 - $212,200 Annually  Employee
+    3                           N/A  Employee
+    4  $183,498 - $207,000 Annually  Employee
+
+%% Cell type:code id: tags:
+
+``` python
+df.dtypes
+```
+
+%% Output
+
+    Title       object
+    Company     object
+    Location    object
+    Link        object
+    Salary      object
+    Job Type    object
+    dtype: object
+
+%% Cell type:code id: tags:
+
+``` python
+#TODO: Write a function called extract_salary_range(salary) that extracts from the salary information in the dataset the salary from, the salaray to and the range ("annually"), "monthly"
+
+def extract_salary_range(salary):
+    if pd.isna(salary) or salary == 'N/A':
+        return None, None, None
+
+    # Your Code HERE
+
+    return salary_range, salary_from, salary_to
+
+df[['Salary Range', 'Salary From', 'Salary To']] = df['Salary'].apply(
+    lambda x: pd.Series(extract_salary_range(x))
+)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+df.head()
+```
+
+%% Output
+
+                                                   Title Company  \
+    0  Science Water Ecologist, Bureau of Environment...
+    1            Data Science Manager, Marketing Science
+    2          Senior Data Scientist - Core Data Science
+    3               Data Scientist, Product Data Science
+    4  Solutions Architecture Architect – Data Scienc...
+    
+                                                Location  \
+    0                 Hybrid Remote in New York City, NY
+    1  Hybrid Remote in Seattle, WA, Palo Alto, CA, S...
+    2                                Remote, US National
+    3                                   Remote in Canada
+    4                       Hybrid Remote in Herndon, VA
+    
+                                                    Link  \
+    0  https://remote.co/job-details/science-water-ec...
+    1  https://remote.co/job-details/data-science-man...
+    2  https://remote.co/job-details/senior-data-scie...
+    3  https://remote.co/job-details/data-scientist-p...
+    4  https://remote.co/job-details/solutions-archit...
+    
+                             Salary  Job Type         Salary Range  Salary From  \
+    0    $49,653 - $57,101 Annually  Employee    $49,653 - $57,101       49.653
+    1  $191,840 - $335,720 Annually  Employee  $191,840 - $335,720      191.840
+    2  $180,370 - $212,200 Annually  Employee  $180,370 - $212,200      180.370
+    3                           N/A  Employee                 None          NaN
+    4  $183,498 - $207,000 Annually  Employee  $183,498 - $207,000      183.498
+    
+       Salary To
+    0     57.101
+    1    335.720
+    2    212.200
+    3        NaN
+    4    207.000
+
+%% Cell type:markdown id: tags:
+
+### ToDo
+1. Find a website that allows scraping (check the robots.txt)
+2. Scrape the relevant data
+3. Pre-process the data and conduct an EDA
+4. Submit the
+- The scraping Notebook
+- CSV file before processing
+- The EDA notebook and the processed CSV file
+
+**Deadline : April 5th 2025 at 14:00**
+
+%% Cell type:markdown id: tags:
+