Python to Web Scraping Covid 19 Data with Full Source Code

In this article, you’ll learn how to perform web scraping on Covid-19 data using python.

Prerequisites:

  1. Python Basics
  2. pandas Basics
  3. HTML Basics
  4. CSS module
  5. Beautiful soup/bs4 module
  6. requests module
  7. lxml module

What is Web Scrapping?

  • Web scraping, also known as web data extraction, is the process of retrieving or “scraping” data from a website.
  • This information is collected and then exported into a format that is more useful for the user.
  • Be it a spreadsheet or an API.

Two important points to be taken into consideration here:

  1. Always be respectful and try to get permission to scrape, do not bombard a website with scraping requests, otherwise, your IP address may get blocked!
  2. Be aware that websites change often, meaning your code could go from working to totally broken from one day to the next.

The Process:

  1. Request for a response from the webpage
  2. Parse and extract with the help of Beautiful soup and lxml
  3. Download and export the data with pandas into excel

Uses:

  • It can serve several purposes, most popular ones are Investment Decision Making, Competitor Monitoring,
  • News Monitoring, Market Trend Analysis, Appraising Property Value, Estimating Rental Yields, Politics and Campaigns and many more.

Covid-19 Data Source

We will use the Worldometer website to fetch the data because we are interested in the data contained in a table at Worldometer’s website, where there are lists all the countries together with their current reported coronavirus cases, new cases for the day, total deaths, new deaths for the day, etc.

HTML

<!DOCTYPE html>
<html>
  <head>
  </head> 
  <body>
    <h1> Scrapping </h1>
    <p> Hello </p>
  </body>
</html>
  • <!DOCTYPE html>: HTML documents must start with a type declaration.
  • The HTML document is contained between <html> and </html> .
  • The meta and script declaration of the HTML document is between <head> and </head> .
  • The visible part of the HTML document is between <body> and </body> tags.
  • Title headings are defined with the <h1> through <h6> tags.
  • Paragraphs are defined with the <p> tag.
  • Other useful tags include <a> for hyperlinks, <table> for tables, <tr> for table rows, and <td> for table columns.

Install Necessary Modules:

Open your  Prompt  and type and run the following command (individually):

pip install requests pip install lxml pip install bs4

Requests:

  • Use the requests library to grab the page.
  • This may fail if you have a firewall blocking Python/Jupyter.
  • Sometimes you need to run this twice if it fails the first time.

Beautiful soup:

  • BeautifulSoup library already has lots of built-in tools and methods to grab information from a string of this nature (basically an HTML file).
  • It is a Python library for pulling data out of HTML and XML files.
  • Using BeautifulSoup we can create a “soup” object that contains all the “ingredients” of the webpage.

Once Installed now we can import it inside our python code.

Source Code:

Requests

# Import the necessary module!
import requests

# Make request from webpage
url ="https://www.worldometers.info/coronavirus/country/us/"
result = requests.get(url)

#result.text

Beautiful Soup

# Import the necessary module!
import bs4

# Create soup variable
# pass on two things here, result.text string and lxml as a string
soup = bs4.BeautifulSoup(result.text, 'lxml')

#soup

Extracting the data

Find the div

# Find-all method
cases = soup.find_all('div', class_ = 'maincounter-number')
cases

Output:

[<div class="maincounter-number"> <span style="color:#aaa">35,185,064 </span> </div>, <div class="maincounter-number"> <span>626,717</span> </div>, <div class="maincounter-number" style="color:#8ACA2B "> <span>29,507,148</span> </div>]
Code language: HTML, XML (xml)

Storing the data

# Python list
data = []

# Find the span and get data from it
for i in cases:
    # We will use span to fetch data from div
    span = i .find('span')
    # We will use span.string to get the numbers.
    data.append(span.string)
    
data

Output:

['35,185,064 ', '626,717', '29,507,148']
Code language: JSON / JSON with Comments (json)

Exporting the data

# Import the necessary module!
import pandas as pd

# Create a dataframe
df = pd.DataFrame({'CoronaData': data})

df

Output:

CoronaData
035,185,064
1626,717
229,507,148
# Naming the columns
df.index = ["TotalCases", "TotalDeaths", "TotalRecovered"]

df

Output:

CoronaData
TotalCases35,185,064
TotalDeaths626,717
TotalRecovered29,507,148
# Naming the file
df.to_csv('Covid-19_data.csv')

Output:

Covid-19_data.csv is Created.
Code language: CSS (css)

Leave a Comment