Skip to content
Snippets Groups Projects
Commit 6fd03e90 authored by Bouyahya Zied's avatar Bouyahya Zied
Browse files

Upload New File

parent a23b624b
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
### Web scraping tutorial
This tutorial will guide you step-by-step on how to use Selenium and BeautifulSoup to scrape job listings from a website. We will also use Pandas to store the scraped data in a CSV file. This tutorial is designed for beginners, so every step will be explained in detail.
1. Introduction
What is Web Scraping?
Web scraping is the process of extracting data from websites. It involves sending a request to a website, downloading the HTML content, and then parsing it to extract useful information.
Tools Used in This Tutorial
- ```Selenium:``` A tool for automating web browsers. It allows you to interact with web pages as if you were using a real browser.
- ```BeautifulSoup```: A Python library for parsing HTML and XML documents. It makes it easy to extract data from web pages.
- ```Pandas```: A Python library for data manipulation and analysis. We will use it to store the scraped data in a CSV file.
- ```WebDriver```: A tool that allows Selenium to control a web browser (e.g., Chrome, Firefox).
#### Important notes
- Respect the website's robots.txt file: Before scraping, check the website's robots.txt file (e.g., https://awebsite.com/robots.txt) to ensure you are allowed to scrape the site.
- Avoid overloading the server: Add delays between requests to avoid overwhelming the website's server.
- **Legal considerations: Ensure that your scraping activities comply with the website's terms of service and local laws.**
%% Cell type:markdown id: tags:
## Step 1: Setting Up Your Environment
Before we start coding, you need to set up your environment.
1. Install Required Libraries
Run the following commands in your terminal or command prompt to install the required libraries:
```python
%pip install selenium beautifulsoup4 pandas
```
Download WebDriver
Selenium requires a WebDriver to interact with the browser. For this tutorial, we will use ChromeDriver.
2. Download the ChromeDriver version that matches your Chrome browser version from the following link.
https://sites.google.com/chromium.org/driver/
Add the ```ChromeDriver``` executable to your system's PATH or place it in the same directory as your script
%% Cell type:markdown id: tags:
### Step 2: Import Required Libraries
- Selenium: Used to automate the browser.
- BeautifulSoup: Used to parse HTML content.
- Pandas: Used to store data in a CSV file.
- Time: Used to add delays (e.g., waiting for the page to load).
%% Cell type:code id: tags:
``` python
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import pandas as pd
import time
```
%% Cell type:markdown id: tags:
### Step 3: Initialize the WebDriver
Make sure chromedriver is in your system's PATH or in the same directory as your script.
%% Cell type:code id: tags:
``` python
driver = webdriver.Safari() # Ensure chromedriver is in your PATH
```
%% Cell type:markdown id: tags:
### Step 4: Define search keywords
The keyword we will use to search for jobs on the website
%% Cell type:code id: tags:
``` python
search_keyword = "Data Science"
```
%% Cell type:markdown id: tags:
### Step 5 Construct and navigate to the url
We have to construct the URL by replacing spaces in the search keyword with %20 (URL encoding for spaces).
%% Cell type:code id: tags:
``` python
url = f'https://remote.co/remote-jobs/search?searchkeyword={search_keyword.replace(" ", "%20")}'
```
%% Cell type:markdown id: tags:
This line tells the WebDriver to open the constructed URL in the browser
%% Cell type:code id: tags:
``` python
driver.get(url)
```
%% Cell type:code id: tags:
``` python
time.sleep(5)
```
%% Cell type:code id: tags:
``` python
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, 'sc-jv5lm6-0')))
```
%% Output
<selenium.webdriver.remote.webelement.WebElement (session="F0DF4FC9-1BC7-4191-AB9B-E7B3CB2EFDC3", element="node-83A1A8FD-F3D8-4391-81A1-C6C6D2458FD6")>
%% Cell type:code id: tags:
``` python
soup = BeautifulSoup(driver.page_source, 'html.parser')
```
%% Cell type:code id: tags:
``` python
job_listings = soup.find_all('div', class_='sc-jv5lm6-0')
```
%% Cell type:code id: tags:
``` python
jobs = []
for job in job_listings:
title = job.find('a', class_='sc-jv5lm6-13')
title = title.text.strip() if title else 'N/A'
company = job.find('span', class_='sc-jv5lm6-5')
company = company.text.strip() if company else 'N/A'
location = job.find('span', class_='sc-jv5lm6-10')
location = location.text.strip() if location else 'Remote'
link = job.find('a', class_='sc-jv5lm6-13')
link = 'https://remote.co' + link['href'] if link else 'N/A'
salary = job.find('li', id=lambda x: x and x.startswith('salartRange-'))
salary = salary.text.strip() if salary else 'N/A'
job_type = job.find('li', id=lambda x: x and x.startswith('jobTypes-'))
job_type = job_type.text.strip() if job_type else 'N/A'
jobs.append({
'Title': title,
'Company': company,
'Location': location,
'Link': link,
'Salary': salary,
'Job Type': job_type
})
```
%% Cell type:code id: tags:
``` python
df = pd.DataFrame(jobs)
df.shape
```
%% Output
(50, 6)
%% Cell type:code id: tags:
``` python
driver.quit()
```
%% Cell type:code id: tags:
``` python
df.head()
```
%% Output
Title Company \
0 Science Water Ecologist, Bureau of Environment...
1 Data Science Manager, Marketing Science
2 Senior Data Scientist - Core Data Science
3 Data Scientist, Product Data Science
4 Solutions Architecture Architect – Data Scienc...
Location \
0 Hybrid Remote in New York City, NY
1 Hybrid Remote in Seattle, WA, Palo Alto, CA, S...
2 Remote, US National
3 Remote in Canada
4 Hybrid Remote in Herndon, VA
Link \
0 https://remote.co/job-details/science-water-ec...
1 https://remote.co/job-details/data-science-man...
2 https://remote.co/job-details/senior-data-scie...
3 https://remote.co/job-details/data-scientist-p...
4 https://remote.co/job-details/solutions-archit...
Salary Job Type
0 $49,653 - $57,101 Annually Employee
1 $191,840 - $335,720 Annually Employee
2 $180,370 - $212,200 Annually Employee
3 N/A Employee
4 $183,498 - $207,000 Annually Employee
%% Cell type:code id: tags:
``` python
df.dtypes
```
%% Output
Title object
Company object
Location object
Link object
Salary object
Job Type object
dtype: object
%% Cell type:code id: tags:
``` python
#TODO: Write a function called extract_salary_range(salary) that extracts from the salary information in the dataset the salary from, the salaray to and the range ("annually"), "monthly"
def extract_salary_range(salary):
if pd.isna(salary) or salary == 'N/A':
return None, None, None
# Your Code HERE
return salary_range, salary_from, salary_to
df[['Salary Range', 'Salary From', 'Salary To']] = df['Salary'].apply(
lambda x: pd.Series(extract_salary_range(x))
)
```
%% Cell type:code id: tags:
``` python
df.head()
```
%% Output
Title Company \
0 Science Water Ecologist, Bureau of Environment...
1 Data Science Manager, Marketing Science
2 Senior Data Scientist - Core Data Science
3 Data Scientist, Product Data Science
4 Solutions Architecture Architect – Data Scienc...
Location \
0 Hybrid Remote in New York City, NY
1 Hybrid Remote in Seattle, WA, Palo Alto, CA, S...
2 Remote, US National
3 Remote in Canada
4 Hybrid Remote in Herndon, VA
Link \
0 https://remote.co/job-details/science-water-ec...
1 https://remote.co/job-details/data-science-man...
2 https://remote.co/job-details/senior-data-scie...
3 https://remote.co/job-details/data-scientist-p...
4 https://remote.co/job-details/solutions-archit...
Salary Job Type Salary Range Salary From \
0 $49,653 - $57,101 Annually Employee $49,653 - $57,101 49.653
1 $191,840 - $335,720 Annually Employee $191,840 - $335,720 191.840
2 $180,370 - $212,200 Annually Employee $180,370 - $212,200 180.370
3 N/A Employee None NaN
4 $183,498 - $207,000 Annually Employee $183,498 - $207,000 183.498
Salary To
0 57.101
1 335.720
2 212.200
3 NaN
4 207.000
%% Cell type:markdown id: tags:
### ToDo
1. Find a website that allows scraping (check the robots.txt)
2. Scrape the relevant data
3. Pre-process the data and conduct an EDA
4. Submit the
- The scraping Notebook
- CSV file before processing
- The EDA notebook and the processed CSV file
**Deadline : April 5th 2025 at 14:00**
%% Cell type:markdown id: tags:
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment