"This tutorial will guide you step-by-step on how to use Selenium and BeautifulSoup to scrape job listings from a website. We will also use Pandas to store the scraped data in a CSV file. This tutorial is designed for beginners, so every step will be explained in detail.\n",
"\n",
"1. Introduction\n",
"What is Web Scraping?\n",
"Web scraping is the process of extracting data from websites. It involves sending a request to a website, downloading the HTML content, and then parsing it to extract useful information.\n",
"\n",
"Tools Used in This Tutorial\n",
"- ```Selenium:``` A tool for automating web browsers. It allows you to interact with web pages as if you were using a real browser.\n",
"\n",
"- ```BeautifulSoup```: A Python library for parsing HTML and XML documents. It makes it easy to extract data from web pages.\n",
"\n",
"- ```Pandas```: A Python library for data manipulation and analysis. We will use it to store the scraped data in a CSV file.\n",
"\n",
"- ```WebDriver```: A tool that allows Selenium to control a web browser (e.g., Chrome, Firefox).\n",
"\n",
"#### Important notes \n",
"- Respect the website's robots.txt file: Before scraping, check the website's robots.txt file (e.g., https://awebsite.com/robots.txt) to ensure you are allowed to scrape the site.\n",
"\n",
"- Avoid overloading the server: Add delays between requests to avoid overwhelming the website's server.\n",
"\n",
"- **Legal considerations: Ensure that your scraping activities comply with the website's terms of service and local laws.**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 1: Setting Up Your Environment\n",
"Before we start coding, you need to set up your environment.\n",
"\n",
"1. Install Required Libraries\n",
"Run the following commands in your terminal or command prompt to install the required libraries:\n",
"```python \n",
"%pip install selenium beautifulsoup4 pandas\n",
"```\n",
"\n",
"Download WebDriver\n",
"Selenium requires a WebDriver to interact with the browser. For this tutorial, we will use ChromeDriver.\n",
"\n",
"2. Download the ChromeDriver version that matches your Chrome browser version from the following link.\n",
"#TODO: Write a function called extract_salary_range(salary) that extracts from the salary information in the dataset the salary from, the salaray to and the range (\"annually\"), \"monthly\"\n",
"1. Find a website that allows scraping (check the robots.txt)\n",
"2. Scrape the relevant data\n",
"3. Pre-process the data and conduct an EDA\n",
"4. Submit the \n",
"- The scraping Notebook\n",
"- CSV file before processing\n",
"- The EDA notebook and the processed CSV file \n",
"\n",
"**Deadline : April 5th 2025 at 14:00**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "base",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
%% Cell type:markdown id: tags:
### Web scraping tutorial
This tutorial will guide you step-by-step on how to use Selenium and BeautifulSoup to scrape job listings from a website. We will also use Pandas to store the scraped data in a CSV file. This tutorial is designed for beginners, so every step will be explained in detail.
1. Introduction
What is Web Scraping?
Web scraping is the process of extracting data from websites. It involves sending a request to a website, downloading the HTML content, and then parsing it to extract useful information.
Tools Used in This Tutorial
-```Selenium:``` A tool for automating web browsers. It allows you to interact with web pages as if you were using a real browser.
-```BeautifulSoup```: A Python library for parsing HTML and XML documents. It makes it easy to extract data from web pages.
-```Pandas```: A Python library for data manipulation and analysis. We will use it to store the scraped data in a CSV file.
-```WebDriver```: A tool that allows Selenium to control a web browser (e.g., Chrome, Firefox).
#### Important notes
- Respect the website's robots.txt file: Before scraping, check the website's robots.txt file (e.g., https://awebsite.com/robots.txt) to ensure you are allowed to scrape the site.
- Avoid overloading the server: Add delays between requests to avoid overwhelming the website's server.
-**Legal considerations: Ensure that your scraping activities comply with the website's terms of service and local laws.**
%% Cell type:markdown id: tags:
## Step 1: Setting Up Your Environment
Before we start coding, you need to set up your environment.
1. Install Required Libraries
Run the following commands in your terminal or command prompt to install the required libraries:
```python
%pipinstallseleniumbeautifulsoup4pandas
```
Download WebDriver
Selenium requires a WebDriver to interact with the browser. For this tutorial, we will use ChromeDriver.
2. Download the ChromeDriver version that matches your Chrome browser version from the following link.
https://sites.google.com/chromium.org/driver/
Add the ```ChromeDriver``` executable to your system's PATH or place it in the same directory as your script
%% Cell type:markdown id: tags:
### Step 2: Import Required Libraries
- Selenium: Used to automate the browser.
- BeautifulSoup: Used to parse HTML content.
- Pandas: Used to store data in a CSV file.
- Time: Used to add delays (e.g., waiting for the page to load).
#TODO: Write a function called extract_salary_range(salary) that extracts from the salary information in the dataset the salary from, the salaray to and the range ("annually"), "monthly"