Learning Outcomes
- To understand the benefits of using async + await compared to simply web scraping with the requests library.
- Learn how to create an asynchronous web scraper from scratch in pure python using asyncio and aiohttp.
- Practice downloading multiple webpages using Aiohttp + Asyncio and parsing HTML content per URL with BeautifulSoup.
The following python installations are for a Jupyter Notebook, however if you are using a command line then simply exclude the ! symbol
!pip install beautifulsoup4
!pip install requests
!pip install aiohttp
# Library Imports import aiohttp from bs4 import BeautifulSoup import pandas as pd import requests
Note: The only reason why we use nest_asyncio is because this tutorial is written in a jupyter notebook, however if you wanted to write the same web scraper code in a python file, then you would’nt need to install or run the following code block:
!pip install nest-asyncio import nest_asyncio nest_asyncio.apply()
Why Use Asychronous Web Scraping?
Writing synchronous web scrapers are easier and the code is less complex, however they’re incredibly slow.
This is because all of the requests must wait for the current request to finish one by one. There can only be one request running at a given time.
In contrast, asynchronous web requests are able to execute without depending on previous requests within a queue or for loop. Asychronous requests happen simultaneously.
How Is Asychronous Web Scraping Different To Using Python Requests?
Instead of thinking about creating a for loop with Xn requests, you need to think about creating an event loop. For example the environment for NodeJS, by design executes in a single threaded event loop.
However for Python, we will manually create an event loop with asyncio.
Inside of your event loop, you can set a number of tasks to be completed and every task will be created and executed asychronously.
How To Web Scrape A Single Web Page Using Aiohttp
import aiohttp
import asyncio
async def main():
async with aiohttp.ClientSession() as session:
async with session.get('http://python.org') as response:
print("Status:", response.status)
print("Content-type:", response.headers['content-type'])
html = await response.text()
print("Body:", html[:15], "...")
asyncio.run(main())
Firstly we define a client session with aiohttp:
async with aiohttp.ClientSession() as session:
Then with our session, we execute a get response on a single URL:
async with session.get('http://python.org') as response:
Thirdly, notice how we use the await keyword in front of response.text() like this:
html = await response.text()
Also, note that every asynchronous function starts with:
async def function_name
Finally we run asyncio.run(main()), this creates an event loop and executes all tasks within it.
After all of the tasks have been completed then the event loop is automatically destroyed.
How To Web Scrape Multiple Pages Using Aiohttp
When scraping multiple pages with asyncio and aiohttp, we’ll use the following pattern to create multiple tasks that will be simulataneously executed within an asyncio event loop:
tasks = []
for url in urls:
tasks.append(some_function(session, url))
htmls = await asyncio.gather(*tasks)
To start with we create an empty list and then for every URL, we will attach an uncalled/uninvoked function, an AioHTTP session and the URL to the list.
The asyncio.gather(*tasks), basically tells asyncio to keep running the event loop until all of these functions within the python have been completed. It will return a list that is the same length as the number of functions (unless one of the functions within the list returned zero results).
Now that we know how to create and execute multiple tasks, let’s see this in action:
class WebScraper(object):
def __init__(self, urls):
self.urls = urls
# Global Place To Store The Data:
self.all_data = []
self.master_dict = {}
# Run The Scraper:
asyncio.run(self.main())
async def fetch(self, session, url):
try:
async with session.get(url) as response:
text = await response.text()
return text, url
except Exception as e:
print(str(e))
async def main(self):
tasks = []
headers = {
"user-agent": "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"}
async with aiohttp.ClientSession(headers=headers) as session:
for url in self.urls:
tasks.append(self.fetch(session, url))
htmls = await asyncio.gather(*tasks)
self.all_data.extend(htmls)
# Storing the raw HTML data.
for html in htmls:
if html is not None:
url = html[1]
self.master_dict[url] = {'Raw Html': html[0]}
else:
continue
# 1. Create a list of URLs for our scraper to get the data for: urls = ['https://website.understandingdata.com/', 'http://twitter.com/'] # 2. Create the scraper class instance, this will automatically create a new event loop within the __init__ method: scraper = WebScraper(urls = urls) # 3. Notice how we have a list length of 2: len(scraper.all_data)
Adding HTML Parsing Logic To The Aiohttp Web Scraper
As well as collecting the HTML response from multiple webpages, parsing the web page can be useful for SEO and HTML Content Analysis.
Therefore let’s create second function which will parse the HTML page and will extract the title tag.
class WebScraper(object):
def __init__(self, urls):
self.urls = urls
# Global Place To Store The Data:
self.all_data = []
self.master_dict = {}
# Run The Scraper:
asyncio.run(self.main())
async def fetch(self, session, url):
try:
async with session.get(url) as response:
# 1. Extracting the Text:
text = await response.text()
# 2. Extracting the Tag:
title_tag = await self.extract_title_tag(text)
return text, url, title_tag
except Exception as e:
print(str(e))
async def extract_title_tag(self, text):
try:
soup = BeautifulSoup(text, 'html.parser')
return soup.title
except Exception as e:
print(str(e))
async def main(self):
tasks = []
headers = {
"user-agent": "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"}
async with aiohttp.ClientSession(headers=headers) as session:
for url in self.urls:
tasks.append(self.fetch(session, url))
htmls = await asyncio.gather(*tasks)
self.all_data.extend(htmls)
# Storing the raw HTML data.
for html in htmls:
if html is not None:
url = html[1]
self.master_dict[url] = {'Raw Html': html[0], 'Title': html[2]}
else:
continue
scraper = WebScraper(urls = urls) scraper.master_dict['https://website.understandingdata.com/']['Title']
Conclusion
Asynchronous web scraping is more suitable when you have a larger number of URLs that need to be processed quickly.
Also, notice how easy it is to add on a HTML parsing function with BeautifulSoup, allowing you to easily extract specific elements on a per URL basis.