How To Easily Find All Of The Sitemap.xml Files In Python

James Phoenix
James Phoenix

To effectively analyse websites, knowing how to download all of the sitemap.xml files for a particular website is an incredibly useful skill.

Forunately, there are python packages that allow us to easily download all of sitemap.xml file’s with brute force!


NB: If you’re using a standard python environment, then simply exclude the ! symbol. The reason for using !pip install is because this guide is written in a jupyter notebook.

!pip install ultimate-sitemap-parser
!pip install requests
from usp.tree import sitemap_tree_for_homepage
import requests

Download all of the Sitemap.xml files based upon the URL of the homepage:

After running the following method, we’ve found all of the sitemap files and have saved them to a variable called tree:

tree = sitemap_tree_for_homepage('https://website.understandingdata.com/')
print(tree)

sitemap_tree_for_homepage() returns a tree of AbstractSitemap subclass objects that represent the sitemap hierarchy found on a given website.

Unleash Your Potential with AI-Powered Prompt Engineering!

Dive into our comprehensive Udemy course and learn to craft compelling, AI-optimized prompts. Boost your skills and open up new possibilities with ChatGPT and Prompt Engineering.

Embark on Your AI Journey Now!

To find all of the pages we can simply do:

# all_pages() returns an Iterator
for page in tree.all_pages():
    print(page)

Also, you can save of the URLs to a new variable via a list comprehension:

urls = [page.url for page in tree.all_pages()]
print(len(urls), urls[0:2])

Conclusion

Now you’ll hopefully be able to easily find all of the sitemap.xml files and the web pages in just a few lines of python code!

TaggedPython For SEO


More Stories

Cover Image for Why I’m Betting on AI Agents as the Future of Work

Why I’m Betting on AI Agents as the Future of Work

I’ve been spending a lot of time with Devin lately, and I’ve got to tell you – we’re thinking about AI agents all wrong. You and I are standing at the edge of a fundamental shift in how we work with AI. These aren’t just tools anymore; they’re becoming more like background workers in our digital lives. Let me share what I’ve…

James Phoenix
James Phoenix
Cover Image for Supercharging Devin + Supabase: Fixing Docker Performance on EC2 with overlay2

Supercharging Devin + Supabase: Fixing Docker Performance on EC2 with overlay2

The Problem While setting up Devin (a coding assistant) with Supabase CLI on an EC2 instance, I encountered significant performance issues. After investigation, I discovered that Docker was using the VFS storage driver, which is known for being significantly slower than other storage drivers like overlay2. The root cause was interesting: the EC2 instance was already using overlayfs for its root filesystem,…

James Phoenix
James Phoenix