How To Easily Find All Of The Sitemap.xml Files In Python

James Phoenix
James Phoenix

To effectively analyse websites, knowing how to download all of the sitemap.xml files for a particular website is an incredibly useful skill.

Forunately, there are python packages that allow us to easily download all of sitemap.xml file’s with brute force!


NB: If you’re using a standard python environment, then simply exclude the ! symbol. The reason for using !pip install is because this guide is written in a jupyter notebook.

!pip install ultimate-sitemap-parser
!pip install requests
from usp.tree import sitemap_tree_for_homepage
import requests

Download all of the Sitemap.xml files based upon the URL of the homepage:

After running the following method, we’ve found all of the sitemap files and have saved them to a variable called tree:

tree = sitemap_tree_for_homepage('https://website.understandingdata.com/')
print(tree)

sitemap_tree_for_homepage() returns a tree of AbstractSitemap subclass objects that represent the sitemap hierarchy found on a given website.

To find all of the pages we can simply do:

Leanpub Book

Read The Meta-Engineer

A practical book on building autonomous AI systems with Claude Code, context engineering, verification loops, and production harnesses.

Continuously updated
Claude Code + agentic systems
View Book
# all_pages() returns an Iterator
for page in tree.all_pages():
    print(page)

Also, you can save of the URLs to a new variable via a list comprehension:

urls = [page.url for page in tree.all_pages()]
print(len(urls), urls[0:2])

Conclusion

Now you’ll hopefully be able to easily find all of the sitemap.xml files and the web pages in just a few lines of python code!

Topics
Python For SEO

Newsletter

Become a better AI engineer

Weekly deep dives on production AI systems, context engineering, and the patterns that compound. No fluff, no tutorials. Just what works.

Join 306K+ developers. No spam. Unsubscribe anytime.


More Insights

Cover Image for Fast Playwright E2E Without the Bloat

Fast Playwright E2E Without the Bloat

Most end-to-end suites die the same way. They start fast, then they bloat. A run creeps from forty seconds to four minutes. A few specs go flaky, so someone bumps the retries. The retries hide the fla

James Phoenix
James Phoenix
Cover Image for Write the Pseudocode, Let the LLM Type

Write the Pseudocode, Let the LLM Type

I asked Codex to “build a worker.”

James Phoenix
James Phoenix