Learning Outcomes
- Understand the benefits and use cases of web scraping.
- Learn how to parse the HTML content of a webpage using BeautifulSoup to extract specific elements.
- Learn how to scan the HTML for specific keywords.
- Learn how to scrape multiple web pages.
- Learn how to store your web scraped data into a pandas dataframe.
- Learn how to save the web scraped data as a local .csv file.
The following installations are for a Jupyter Notebook, however if you are using a command line then simply exclude the ! symbol
!pip install beautifulsoup4
!pip install requests
# Library Imports import pandas as pd from bs4 import BeautifulSoup import requests
Why Learn Web Scraping?
Learning web scraping is a useful skill, whether you work as a programmer, marketer or analyst.
It’s a fantastic way for you to analyse websites. Web scraping should never replace a tool such as ScreamingFrog, however when you’re creating data pipelines with Python or JavaScript scripts, then you’ll likely want to write a custom scraper.
Because what’s the point of doing a website crawl if you only need a few pieces of information per page?
Once you have acquired advanced web scraping skills, you can:
- Accurately monitor your competitors.
- Create data pipelines that push fresh HTML data into a data warehouse such as BigQuery.
- Allow you to blend it with other data sources such as Google Search Console or Google Analytics data.
- Create your own APIs for websites that don’t publicly expose an API.
There are many other uses for why web scraping is a powerful skill to possess.
Challenges of Web Scraping
Firstly every website is different, this means it can be difficult to build a robust web scraper that will work on every website. You’ll likely need to create unique selectors for each website which can be time-consuming.
Secondly, your scripts are more likely to fail over time because websites change. Whenever a marketer, owner or developer makes changes to their website, it could lead to your script breaking. Therefore for larger proejcts its essential that you create a monitoring system so that you can fix these problems as they arise.
How To Web Scrape A Single HTML Page:
In order to scrape a web page in python or any programming language, we will need to download the HTML content.
The library that we’ll be using is requests.
url = 'https://www.indeed.co.uk/jobs?q=data%20scientist&l=london&start=40&advn=2102673149993430&vjk=40339845379bc411' response = requests.get(url)
print(response)
As long as the status code is 200 (which means Ok), then we’ll be able to access the web page. You can always check the status code with:
print(response.status_code)
if response.status_code == 200: print(response)
To access the content of a request, simply use:
response.content
# This will store the HTML content as a stream of bytes: html_content = response.content # This will store the HTML content as a string: html_content_string = response.text
Parsing the HTML Content to a Parser
Simply downloading the HTML page is not enough, particularly if we would like to extract elements from it. Therefore we will use a python package called BeautifulSoup. BeautifulSoup provides us with a large amount of DOM (document object model) parsing methods.
In order to parse the DOM of a page, simply use:
soup = BeautifulSoup(html_content, 'html.parser') help(soup)
We can now see that instead of a HTML bytes string, we have a BeautifulSoup object, that has many functions on it!
In our example, we’ll be web scraping indeed and extracting job information from Indeed.co.uk
- The job will be: data scientist.
- The area will be london.
Investigate The URL
url = ‘https://www.indeed.co.uk/jobs?q=data%20scientist&l=london&start=40&advn=2102673149993430&vjk=40339845379bc411′
There can be a lot of information inside of a URL.
Its important for you to be able to identify the structure of URLs and to reverse engineer how they might have been created.
- The base URL means the path to the jobs functionality of the website which in this case is: https://www.index.co.uk/
- Query Parameters are a way for the jobs search to be dynamic, in the above example they are: ?q=data%20scientist&l=london&start=40&advn=2102673149993430&vjk=40339845379bc411′
Query parameters consist of:
- The start of the query at q
- A key and value for each query parameter (i.e. l = london or start=40)
- A separator which is an ampersand symbol (&) that separates all of the key + value query parameters.
Visually Inspect The Webpage In Google Chrome Dev Tools
Before jumping straight into coding, its worthwhile visually inspecting the HTML page content within your browser. This will give you a sense of how the website is constructed and what repeating patterns you can see within the HTML.
Google Chrome Developer tools is a free available tool that allows you to visually inspect the HTML code.
Navigate to it by:
- Opening up Google Chrome.
- Right clicking on a webpage.
- Clicking inspect.
Find Element By HTML ID
It is possible to select specific HTML elements by using the #id CSS selector.
appPromoBanner = soup.find('div', {'id':'appPromoBanner'})
Find Element By HTML Class Name
Alternatively, you can find elements by their class selector.
container_div = soup.find('div', class_='tab-container')
len(container_div) 10
How To Extract Text From HTML Elements
As well as selecting the entire HTML element, you can also easily extract the text using BeautifulSoup.
Let’s see how this might work whilst scraping a single job advertisement:
job_url = 'https://www.indeed.co.uk/viewjob?cmp=Crowd-Link-Consulting&t=Business+Intelligence+Engineer&jk=9129263166da1718&q=data+engineer&vjs=3'
resp = requests.get(job_url) soup = BeautifulSoup(resp.content, 'html.parser')
Extracting The Title Tag
Firstly let’s extract the title tag and then use .text to obtain the text:
title_tag_text = soup.title.text print(title_tag_text)
Business Intelligence Engineer - Woking - Indeed.co.uk
Or we can extract the first paragraph on the webpage, then get the text for that element:
first_paragraph = soup.find('p')
print(first_paragraph)
Business Intelligence Engineer – Woking, Surrey
How To Extract Multiple HTML Elements
Sometimes you’ll want to store multiple elements, for example if there is a list of job advertisements on the same page. The following method will return a list of elements rather than just the first element:
soup.findAll(some_element)
all_paragraphs = soup.findAll('p')
print(all_paragraphs[0:3])
[Business Intelligence Engineer – Woking, Surrey, Objective , This role needs to work closely with our client’s customers to turn data into critical information and knowledge that can be used to make sound business decisions. They provide data that is accurate, congruent, reliable and is easily accessible.]
If we wanted to extract the text of every paragraph element, we could just do a list compehension:
all_paragraphs_text = [paragraph.text.strip() for paragraph in all_paragraphs]
It’s also possible to remove paragraph tags if they contain empty strings, by only including paragraphs which are truthy (don’t have empty strings).
# This will only return paragraphs that don't have empty strings! full_paragraphs = [paragraph for paragraph in all_paragraphs_text if paragraph]
print(len(full_paragraphs)) 12
How To Web Scrape Multiple HTML Pages:
If you’d like to web scrape multiple pages, then we’ll simply create a for loop and multiple beautifulsoup objects.
The important things are:
- Have a results dictionary or list(s) that is outside of the loop.
- Extract either the result or N/A or a NaN (not a number), this is especially important when you’re using python lists as it ensures that all of your python lists will always be the same length.
urls = ['https://website.understandingdata.com/', 'https://website.understandingdata.com/about-me/', 'https://website.understandingdata.com/contact/'] 1. Create a results list to store all of the web scraped data: results = [] for url in urls: # 2. Obtain the HTML response: response = requests.get(url) # 3. Create a BeautifulSoup object: soup = BeautifulSoup(response.content, 'html.parser') # 4. Extract the elements per URL: title_tag = soup.title results.append(title_tag.text)
print(results)
. . .
How To Scan HTML Content For Specific Keywords
Particularly in a marketing context, if one of your web pages is ranking for 5 keywords it would be beneficial to know:
- If every keyword was on a given HTML page.
- If there were keywords on / missing from the HTML page.
By writing a web scraper we can easily answer these questions at scale.
Let’s say that our keyword is Understanding Data, we will normalise this to be lowercase with .lower()
url_dict = {}
keyword = 'Understanding Data'.lower()
for url in urls:
# Creating a new item in the dictionary:
url_dict[url] = {'in_title': False, 'in_html': False}
# Obtaining the HTML page with python requests:
response = requests.get(url)
if response.status_code == 200:
# Parse the HTML content using BeautifulSoup:
soup = BeautifulSoup(response.content, 'html.parser')
# Extract the HTML content into a string and normalise it to be lowercase:
cleaned_html_text = response.text.lower()
# Extract the HTML elements using BeautifulSoup:
title_tag = soup.title
# Checking to see if the keyword is present in the HTML and the