diff --git a/WebScraping/webscraping.ipynb b/WebScraping/webscraping.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..3c76aa5421c69c4f72584ca8a60bbd18aee3ff86 --- /dev/null +++ b/WebScraping/webscraping.ipynb @@ -0,0 +1,627 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Web scraping tutorial\n", + "\n", + "This tutorial will guide you step-by-step on how to use Selenium and BeautifulSoup to scrape job listings from a website. We will also use Pandas to store the scraped data in a CSV file. This tutorial is designed for beginners, so every step will be explained in detail.\n", + "\n", + "1. Introduction\n", + "What is Web Scraping?\n", + "Web scraping is the process of extracting data from websites. It involves sending a request to a website, downloading the HTML content, and then parsing it to extract useful information.\n", + "\n", + "Tools Used in This Tutorial\n", + "- ```Selenium:``` A tool for automating web browsers. It allows you to interact with web pages as if you were using a real browser.\n", + "\n", + "- ```BeautifulSoup```: A Python library for parsing HTML and XML documents. It makes it easy to extract data from web pages.\n", + "\n", + "- ```Pandas```: A Python library for data manipulation and analysis. We will use it to store the scraped data in a CSV file.\n", + "\n", + "- ```WebDriver```: A tool that allows Selenium to control a web browser (e.g., Chrome, Firefox).\n", + "\n", + "#### Important notes \n", + "- Respect the website's robots.txt file: Before scraping, check the website's robots.txt file (e.g., https://awebsite.com/robots.txt) to ensure you are allowed to scrape the site.\n", + "\n", + "- Avoid overloading the server: Add delays between requests to avoid overwhelming the website's server.\n", + "\n", + "- **Legal considerations: Ensure that your scraping activities comply with the website's terms of service and local laws.**" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 1: Setting Up Your Environment\n", + "Before we start coding, you need to set up your environment.\n", + "\n", + "1. Install Required Libraries\n", + "Run the following commands in your terminal or command prompt to install the required libraries:\n", + "```python \n", + "%pip install selenium beautifulsoup4 pandas\n", + "```\n", + "\n", + "Download WebDriver\n", + "Selenium requires a WebDriver to interact with the browser. For this tutorial, we will use ChromeDriver.\n", + "\n", + "2. Download the ChromeDriver version that matches your Chrome browser version from the following link.\n", + "https://sites.google.com/chromium.org/driver/\n", + "\n", + "Add the ```ChromeDriver``` executable to your system's PATH or place it in the same directory as your script" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 2: Import Required Libraries\n", + "- Selenium: Used to automate the browser.\n", + "\n", + "- BeautifulSoup: Used to parse HTML content.\n", + "\n", + "- Pandas: Used to store data in a CSV file.\n", + "\n", + "- Time: Used to add delays (e.g., waiting for the page to load)." + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [], + "source": [ + "from selenium import webdriver\n", + "from selenium.webdriver.common.by import By\n", + "from selenium.webdriver.common.keys import Keys\n", + "from selenium.webdriver.support.ui import WebDriverWait\n", + "from selenium.webdriver.support import expected_conditions as EC\n", + "from bs4 import BeautifulSoup\n", + "import pandas as pd\n", + "import time" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 3: Initialize the WebDriver\n", + "Make sure chromedriver is in your system's PATH or in the same directory as your script." + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [], + "source": [ + "driver = webdriver.Safari() # Ensure chromedriver is in your PATH" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 4: Define search keywords\n", + "The keyword we will use to search for jobs on the website" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [], + "source": [ + "search_keyword = \"Data Science\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 5 Construct and navigate to the url \n", + "We have to construct the URL by replacing spaces in the search keyword with %20 (URL encoding for spaces).\n" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [], + "source": [ + "url = f'https://remote.co/remote-jobs/search?searchkeyword={search_keyword.replace(\" \", \"%20\")}'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This line tells the WebDriver to open the constructed URL in the browser" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [], + "source": [ + "driver.get(url)" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [], + "source": [ + "time.sleep(5)" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "<selenium.webdriver.remote.webelement.WebElement (session=\"F0DF4FC9-1BC7-4191-AB9B-E7B3CB2EFDC3\", element=\"node-83A1A8FD-F3D8-4391-81A1-C6C6D2458FD6\")>" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "\n", + "WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, 'sc-jv5lm6-0')))" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [], + "source": [ + "soup = BeautifulSoup(driver.page_source, 'html.parser')" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [], + "source": [ + "job_listings = soup.find_all('div', class_='sc-jv5lm6-0')" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": {}, + "outputs": [], + "source": [ + "jobs = []\n", + "\n", + "for job in job_listings:\n", + " title = job.find('a', class_='sc-jv5lm6-13')\n", + " title = title.text.strip() if title else 'N/A'\n", + "\n", + " company = job.find('span', class_='sc-jv5lm6-5')\n", + " company = company.text.strip() if company else 'N/A'\n", + "\n", + " location = job.find('span', class_='sc-jv5lm6-10')\n", + " location = location.text.strip() if location else 'Remote'\n", + "\n", + " link = job.find('a', class_='sc-jv5lm6-13')\n", + " link = 'https://remote.co' + link['href'] if link else 'N/A'\n", + "\n", + " salary = job.find('li', id=lambda x: x and x.startswith('salartRange-'))\n", + " salary = salary.text.strip() if salary else 'N/A'\n", + "\n", + " job_type = job.find('li', id=lambda x: x and x.startswith('jobTypes-'))\n", + " job_type = job_type.text.strip() if job_type else 'N/A'\n", + "\n", + " jobs.append({\n", + " 'Title': title,\n", + " 'Company': company,\n", + " 'Location': location,\n", + " 'Link': link,\n", + " 'Salary': salary,\n", + " 'Job Type': job_type\n", + " })" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(50, 6)" + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df = pd.DataFrame(jobs)\n", + "df.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [], + "source": [ + "driver.quit()" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>Title</th>\n", + " <th>Company</th>\n", + " <th>Location</th>\n", + " <th>Link</th>\n", + " <th>Salary</th>\n", + " <th>Job Type</th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th>0</th>\n", + " <td>Science Water Ecologist, Bureau of Environment...</td>\n", + " <td></td>\n", + " <td>Hybrid Remote in New York City, NY</td>\n", + " <td>https://remote.co/job-details/science-water-ec...</td>\n", + " <td>$49,653 - $57,101 Annually</td>\n", + " <td>Employee</td>\n", + " </tr>\n", + " <tr>\n", + " <th>1</th>\n", + " <td>Data Science Manager, Marketing Science</td>\n", + " <td></td>\n", + " <td>Hybrid Remote in Seattle, WA, Palo Alto, CA, S...</td>\n", + " <td>https://remote.co/job-details/data-science-man...</td>\n", + " <td>$191,840 - $335,720 Annually</td>\n", + " <td>Employee</td>\n", + " </tr>\n", + " <tr>\n", + " <th>2</th>\n", + " <td>Senior Data Scientist - Core Data Science</td>\n", + " <td></td>\n", + " <td>Remote, US National</td>\n", + " <td>https://remote.co/job-details/senior-data-scie...</td>\n", + " <td>$180,370 - $212,200 Annually</td>\n", + " <td>Employee</td>\n", + " </tr>\n", + " <tr>\n", + " <th>3</th>\n", + " <td>Data Scientist, Product Data Science</td>\n", + " <td></td>\n", + " <td>Remote in Canada</td>\n", + " <td>https://remote.co/job-details/data-scientist-p...</td>\n", + " <td>N/A</td>\n", + " <td>Employee</td>\n", + " </tr>\n", + " <tr>\n", + " <th>4</th>\n", + " <td>Solutions Architecture Architect – Data Scienc...</td>\n", + " <td></td>\n", + " <td>Hybrid Remote in Herndon, VA</td>\n", + " <td>https://remote.co/job-details/solutions-archit...</td>\n", + " <td>$183,498 - $207,000 Annually</td>\n", + " <td>Employee</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "</div>" + ], + "text/plain": [ + " Title Company \\\n", + "0 Science Water Ecologist, Bureau of Environment... \n", + "1 Data Science Manager, Marketing Science \n", + "2 Senior Data Scientist - Core Data Science \n", + "3 Data Scientist, Product Data Science \n", + "4 Solutions Architecture Architect – Data Scienc... \n", + "\n", + " Location \\\n", + "0 Hybrid Remote in New York City, NY \n", + "1 Hybrid Remote in Seattle, WA, Palo Alto, CA, S... \n", + "2 Remote, US National \n", + "3 Remote in Canada \n", + "4 Hybrid Remote in Herndon, VA \n", + "\n", + " Link \\\n", + "0 https://remote.co/job-details/science-water-ec... \n", + "1 https://remote.co/job-details/data-science-man... \n", + "2 https://remote.co/job-details/senior-data-scie... \n", + "3 https://remote.co/job-details/data-scientist-p... \n", + "4 https://remote.co/job-details/solutions-archit... \n", + "\n", + " Salary Job Type \n", + "0 $49,653 - $57,101 Annually Employee \n", + "1 $191,840 - $335,720 Annually Employee \n", + "2 $180,370 - $212,200 Annually Employee \n", + "3 N/A Employee \n", + "4 $183,498 - $207,000 Annually Employee " + ] + }, + "execution_count": 30, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Title object\n", + "Company object\n", + "Location object\n", + "Link object\n", + "Salary object\n", + "Job Type object\n", + "dtype: object" + ] + }, + "execution_count": 35, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.dtypes" + ] + }, + { + "cell_type": "code", + "execution_count": 60, + "metadata": {}, + "outputs": [], + "source": [ + "#TODO: Write a function called extract_salary_range(salary) that extracts from the salary information in the dataset the salary from, the salaray to and the range (\"annually\"), \"monthly\"\n", + "\n", + "def extract_salary_range(salary):\n", + " if pd.isna(salary) or salary == 'N/A':\n", + " return None, None, None\n", + " \n", + " # Your Code HERE\n", + " \n", + " return salary_range, salary_from, salary_to\n", + "\n", + "df[['Salary Range', 'Salary From', 'Salary To']] = df['Salary'].apply(\n", + " lambda x: pd.Series(extract_salary_range(x))\n", + ")\n" + ] + }, + { + "cell_type": "code", + "execution_count": 61, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>Title</th>\n", + " <th>Company</th>\n", + " <th>Location</th>\n", + " <th>Link</th>\n", + " <th>Salary</th>\n", + " <th>Job Type</th>\n", + " <th>Salary Range</th>\n", + " <th>Salary From</th>\n", + " <th>Salary To</th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th>0</th>\n", + " <td>Science Water Ecologist, Bureau of Environment...</td>\n", + " <td></td>\n", + " <td>Hybrid Remote in New York City, NY</td>\n", + " <td>https://remote.co/job-details/science-water-ec...</td>\n", + " <td>$49,653 - $57,101 Annually</td>\n", + " <td>Employee</td>\n", + " <td>$49,653 - $57,101</td>\n", + " <td>49.653</td>\n", + " <td>57.101</td>\n", + " </tr>\n", + " <tr>\n", + " <th>1</th>\n", + " <td>Data Science Manager, Marketing Science</td>\n", + " <td></td>\n", + " <td>Hybrid Remote in Seattle, WA, Palo Alto, CA, S...</td>\n", + " <td>https://remote.co/job-details/data-science-man...</td>\n", + " <td>$191,840 - $335,720 Annually</td>\n", + " <td>Employee</td>\n", + " <td>$191,840 - $335,720</td>\n", + " <td>191.840</td>\n", + " <td>335.720</td>\n", + " </tr>\n", + " <tr>\n", + " <th>2</th>\n", + " <td>Senior Data Scientist - Core Data Science</td>\n", + " <td></td>\n", + " <td>Remote, US National</td>\n", + " <td>https://remote.co/job-details/senior-data-scie...</td>\n", + " <td>$180,370 - $212,200 Annually</td>\n", + " <td>Employee</td>\n", + " <td>$180,370 - $212,200</td>\n", + " <td>180.370</td>\n", + " <td>212.200</td>\n", + " </tr>\n", + " <tr>\n", + " <th>3</th>\n", + " <td>Data Scientist, Product Data Science</td>\n", + " <td></td>\n", + " <td>Remote in Canada</td>\n", + " <td>https://remote.co/job-details/data-scientist-p...</td>\n", + " <td>N/A</td>\n", + " <td>Employee</td>\n", + " <td>None</td>\n", + " <td>NaN</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " <tr>\n", + " <th>4</th>\n", + " <td>Solutions Architecture Architect – Data Scienc...</td>\n", + " <td></td>\n", + " <td>Hybrid Remote in Herndon, VA</td>\n", + " <td>https://remote.co/job-details/solutions-archit...</td>\n", + " <td>$183,498 - $207,000 Annually</td>\n", + " <td>Employee</td>\n", + " <td>$183,498 - $207,000</td>\n", + " <td>183.498</td>\n", + " <td>207.000</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "</div>" + ], + "text/plain": [ + " Title Company \\\n", + "0 Science Water Ecologist, Bureau of Environment... \n", + "1 Data Science Manager, Marketing Science \n", + "2 Senior Data Scientist - Core Data Science \n", + "3 Data Scientist, Product Data Science \n", + "4 Solutions Architecture Architect – Data Scienc... \n", + "\n", + " Location \\\n", + "0 Hybrid Remote in New York City, NY \n", + "1 Hybrid Remote in Seattle, WA, Palo Alto, CA, S... \n", + "2 Remote, US National \n", + "3 Remote in Canada \n", + "4 Hybrid Remote in Herndon, VA \n", + "\n", + " Link \\\n", + "0 https://remote.co/job-details/science-water-ec... \n", + "1 https://remote.co/job-details/data-science-man... \n", + "2 https://remote.co/job-details/senior-data-scie... \n", + "3 https://remote.co/job-details/data-scientist-p... \n", + "4 https://remote.co/job-details/solutions-archit... \n", + "\n", + " Salary Job Type Salary Range Salary From \\\n", + "0 $49,653 - $57,101 Annually Employee $49,653 - $57,101 49.653 \n", + "1 $191,840 - $335,720 Annually Employee $191,840 - $335,720 191.840 \n", + "2 $180,370 - $212,200 Annually Employee $180,370 - $212,200 180.370 \n", + "3 N/A Employee None NaN \n", + "4 $183,498 - $207,000 Annually Employee $183,498 - $207,000 183.498 \n", + "\n", + " Salary To \n", + "0 57.101 \n", + "1 335.720 \n", + "2 212.200 \n", + "3 NaN \n", + "4 207.000 " + ] + }, + "execution_count": 61, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### ToDo\n", + "1. Find a website that allows scraping (check the robots.txt)\n", + "2. Scrape the relevant data\n", + "3. Pre-process the data and conduct an EDA\n", + "4. Submit the \n", + "- The scraping Notebook\n", + "- CSV file before processing\n", + "- The EDA notebook and the processed CSV file \n", + "\n", + "**Deadline : April 5th 2025 at 14:00**" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "base", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.4" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}