I recommend using scrapy, a high-level Python package for scraping websites and saving the output as files.


Use scrapy to discover pages on a site.


From the Spiders docs page.

import scrapy

class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = [

    def parse(self, response):
        self.logger.info('A response from %s just arrived!', response.url)

Scrape nested sitemaps

Example from the SitemapSpider part of the docs.

from scrapy.spiders import SitemapSpider

class MySpider(SitemapSpider):
    sitemap_urls = ['http://www.example.com/sitemap.xml']

    def parse(self, response):
        pass # ... scrape item here ...