📝 Edit page
➕ Add page
Scrapy
I recommend using scrapy
, a high-level Python package for scraping websites and saving the output as files.
Spidering
Use scrapy
to discover pages on a site.
Basic
From the Spiders docs page.
import scrapy
class MySpider(scrapy.Spider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = [
'http://www.example.com/1.html',
'http://www.example.com/2.html',
'http://www.example.com/3.html',
]
def parse(self, response):
self.logger.info('A response from %s just arrived!', response.url)
Scrape nested sitemaps
Example from the SitemapSpider part of the docs.
from scrapy.spiders import SitemapSpider
class MySpider(SitemapSpider):
sitemap_urls = ['http://www.example.com/sitemap.xml']
def parse(self, response):
pass # ... scrape item here ...