When performing content analysis at scale, you’ll need to automatically extract text content from web pages.
In this article you’ll learn how to extract the text content from single and multiple web pages using Python.
!pip install beautifulsoup4
!pip install numpy
!pip install requests
!pip install spacy
!pip install trafilatura
NB: If you’re writing this in a standard python file, you won’t need to include the ! symbol. This is solely because this tutorial is written in a Jupyter Notebook.
Firstly we’ll break the problem down into several stages:
- Extract all of the HTML content using requests into a python dictionary.
- Pass every single HTML page to Trafilatura to parse the text content.
- Add error and exception handling so that if Trafilatura fails, we can still extract the content, albeit with a less accurate approach.
from bs4 import BeautifulSoup
import json
import numpy as np
import requests
from requests.models import MissingSchema
import spacy
import trafilatura
Collect The HTML Content From The Website
urls = ['https://website.understandingdata.com/',
'https://sempioneer.com/',]
data = {}
for url in urls:
# 1. Obtain the response:
resp = requests.get(url)
# 2. If the response content is 200 - Status Ok, Save The HTML Content:
if resp.status_code == 200:
data[url] = resp.text
Extract The Text From A Single Web Page
After collecting the all of the requests that had a status_code of 200, we can now apply several attempts to extract the text content from every request.
Firstly we’ll try to use trafilatura, however if this library is unable to extract the text, then we’ll use BeautifulSoup4 as a fallback.
def beautifulsoup_extract_text_fallback(response_content):
'''
This is a fallback function, so that we can always return a value for text content.
Even for when both Trafilatura and BeautifulSoup are unable to extract the text from a
single URL.
'''
# Create the beautifulsoup object:
soup = BeautifulSoup(response_content, 'html.parser')
# Finding the text:
text = soup.find_all(text=True)
# Remove unwanted tag elements:
cleaned_text = ''
blacklist = [
'[document]',
'noscript',
'header',
'html',
'meta',
'head',
'input',
'script',
'style',]
# Then we will loop over every item in the extract text and make sure that the beautifulsoup4 tag
# is NOT in the blacklist
for item in text:
if item.parent.name not in blacklist:
cleaned_text += '{} '.format(item)
# Remove any tab separation and strip the text:
cleaned_text = cleaned_text.replace('\t', '')
return cleaned_text.strip()
def extract_text_from_single_web_page(url):
downloaded_url = trafilatura.fetch_url(url)
try:
a = trafilatura.extract(downloaded_url, json_output=True, with_metadata=True, include_comments = False,
date_extraction_params={'extensive_search': True, 'original_date': True})
except AttributeError:
a = trafilatura.extract(downloaded_url, json_output=True, with_metadata=True,
date_extraction_params={'extensive_search': True, 'original_date': True})
if a:
json_output = json.loads(a)
return json_output['text']
else:
try:
resp = requests.get(url)
# We will only extract the text from successful requests:
if resp.status_code == 200:
return beautifulsoup_extract_text_fallback(resp.content)
else:
# This line will handle for any failures in both the Trafilature and BeautifulSoup4 functions:
return np.nan
# Handling for any URLs that don't have the correct protocol
except MissingSchema:
return np.nan
single_url = 'https://website.understandingdata.com/'
text = extract_text_from_single_web_page(url=single_url)
print(text)
Extract The Text From Multiple Web Pages
Let’s use a list comprehension with our single_extract text function to easily extract the text from many web pages:
urls = urls + ['fake_url']
text_content = [extract_text_from_single_web_page(url) for url in urls]
print(text_content[1])
print(text_content[-1:])
Notice how we’ve made sure that any URL that failed can easily be removed as we’ve returned np.nan (not a number).
Cleaning Our Raw Text From Multiple Web Pages
After you’ve successfully extracted the raw text documents, let’s remove any web pages that failed:
cleaned_textual_content = [text for text in text_content if str(text) != 'nan']
Also, you might want to clean the text for further analysis. For example, tokenising the text content allows you to analyse the sentiment, the sentence structure, semantic dependencies and also the word count.
nlp = spacy.load("en_core_web_sm")
for cleaned_text in cleaned_textual_content:
# 1. Create an NLP document with Spacy:
doc = nlp(cleaned_text)
# 2. Spacy has tokenised the text content:
print(f"This is a spacy token: {doc[0]}")
# 3. Extracting the word count per text document:
print(f"The estimated word count for this document is: {len(doc)}.")
# 4. Extracting the number of sentences:
print(f"The estimated number of sentences in the document is: {len(list(doc.sents))}")
print('\n')
Conclusion
Hopefully you can now easily extract text content from either a single url or multiple urls.
We’ve also included beautifulsoup as a failside/fallback function. This ensures that our code is less fragile and is able to withstand the following errors:
- Invalid URLs.
- URLs that had a failed status code (not 200).
- Removing all URLs that we were unable to extract the text content from.