To effectively analyse websites, knowing how to download all of the sitemap.xml files for a particular website is an incredibly useful skill.
Forunately, there are python packages that allow us to easily download all of sitemap.xml file’s with brute force!
NB: If you’re using a standard python environment, then simply exclude the ! symbol. The reason for using !pip install is because this guide is written in a jupyter notebook.
!pip install ultimate-sitemap-parser
!pip install requests
from usp.tree import sitemap_tree_for_homepage
import requests
Download all of the Sitemap.xml files based upon the URL of the homepage:
After running the following method, we’ve found all of the sitemap files and have saved them to a variable called tree:
tree = sitemap_tree_for_homepage('https://website.understandingdata.com/')
print(tree)
sitemap_tree_for_homepage() returns a tree of AbstractSitemap subclass objects that represent the sitemap hierarchy found on a given website.
To find all of the pages we can simply do:
# all_pages() returns an Iterator
for page in tree.all_pages():
print(page)
Also, you can save of the URLs to a new variable via a list comprehension:
urls = [page.url for page in tree.all_pages()]
print(len(urls), urls[0:2])
Conclusion
Now you’ll hopefully be able to easily find all of the sitemap.xml files and the web pages in just a few lines of python code!