In this article, you’ll learn how to perform web scraping on Covid-19 data using python.
Prerequisites:
- Python Basics
- pandas Basics
- HTML Basics
- CSS module
- Beautiful soup/bs4 module
- requests module
- lxml module
What is Web Scrapping?
- Web scraping, also known as web data extraction, is the process of retrieving or “scraping” data from a website.
- This information is collected and then exported into a format that is more useful for the user.
- Be it a spreadsheet or an API.
Two important points to be taken into consideration here:
- Always be respectful and try to get permission to scrape, do not bombard a website with scraping requests, otherwise, your IP address may get blocked!
- Be aware that websites change often, meaning your code could go from working to totally broken from one day to the next.
The Process:
- Request for a response from the webpage
- Parse and extract with the help of Beautiful soup and lxml
- Download and export the data with pandas into excel
Uses:
- It can serve several purposes, most popular ones are Investment Decision Making, Competitor Monitoring,
- News Monitoring, Market Trend Analysis, Appraising Property Value, Estimating Rental Yields, Politics and Campaigns and many more.
Covid-19 Data Source
We will use the Worldometer website to fetch the data because we are interested in the data contained in a table at Worldometer’s website, where there are lists all the countries together with their current reported coronavirus cases, new cases for the day, total deaths, new deaths for the day, etc.


HTML
<!DOCTYPE html> <html> <head> </head> <body> <h1> Scrapping </h1> <p> Hello </p> </body> </html>
<!DOCTYPE html>
: HTML documents must start with a type declaration.- The HTML document is contained between
<html>
and</html>
. - The meta and script declaration of the HTML document is between
<head>
and</head>
. - The visible part of the HTML document is between
<body>
and</body>
tags. - Title headings are defined with the
<h1>
through<h6>
tags. - Paragraphs are defined with the
<p>
tag. - Other useful tags include
<a>
for hyperlinks,<table>
for tables,<tr>
for table rows, and<td>
for table columns.
Install Necessary Modules:
Open your Prompt and type and run the following command (individually):
pip install requests
pip install lxml
pip install bs4
Requests:
- Use the requests library to grab the page.
- This may fail if you have a firewall blocking Python/Jupyter.
- Sometimes you need to run this twice if it fails the first time.
Beautiful soup:
- BeautifulSoup library already has lots of built-in tools and methods to grab information from a string of this nature (basically an HTML file).
- It is a Python library for pulling data out of HTML and XML files.
- Using BeautifulSoup we can create a “soup” object that contains all the “ingredients” of the webpage.
Once Installed now we can import it inside our python code.
Source Code:
Requests
# Import the necessary module! import requests # Make request from webpage url ="https://www.worldometers.info/coronavirus/country/us/" result = requests.get(url) #result.text
Beautiful Soup
# Import the necessary module! import bs4 # Create soup variable # pass on two things here, result.text string and lxml as a string soup = bs4.BeautifulSoup(result.text, 'lxml') #soup
Extracting the data
Find the div
# Find-all method cases = soup.find_all('div', class_ = 'maincounter-number')
cases
Output:
[<div class="maincounter-number">
<span style="color:#aaa">35,185,064 </span>
</div>,
<div class="maincounter-number">
<span>626,717</span>
</div>,
<div class="maincounter-number" style="color:#8ACA2B ">
<span>29,507,148</span>
</div>]
Code language: HTML, XML (xml)
Storing the data
# Python list data = [] # Find the span and get data from it for i in cases: # We will use span to fetch data from div span = i .find('span') # We will use span.string to get the numbers. data.append(span.string) data
Output:
['35,185,064 ', '626,717', '29,507,148']
Code language: JSON / JSON with Comments (json)
Exporting the data
# Import the necessary module! import pandas as pd # Create a dataframe df = pd.DataFrame({'CoronaData': data}) df
Output:
CoronaData | |
---|---|
0 | 35,185,064 |
1 | 626,717 |
2 | 29,507,148 |
# Naming the columns df.index = ["TotalCases", "TotalDeaths", "TotalRecovered"] df
Output:
CoronaData | |
---|---|
TotalCases | 35,185,064 |
TotalDeaths | 626,717 |
TotalRecovered | 29,507,148 |
# Naming the file df.to_csv('Covid-19_data.csv')
Output:
Covid-19_data.csv is Created.
Code language: CSS (css)