Get URLs on a page

Request using requests and parse the response with BeautifulSoup4 package.

Install packages:

$ pip install requests BeautifulSoup4

Script to run:

  • main.py
      """
      Get all links on page.
      """
      import sys
    
      import requests
      from bs4 import BeautifulSoup
    
      OUT_PATH = "links.txt"
    
    
      def get_links(url):
          resp = requests.get(url)
          soup = BeautifulSoup(resp.text, "html.parser")
    
          return [link.get("href") for link in soup.find_all("a")]
    
    
      def main(args):
          """
          Main command-line entry-point.
          """
          if not args:
              print(f"Usage: python {sys.argv[0]} URL")
    
              return
    
          url = args[0]
          if not url.startswith("http"):
              url = f"https://{url}"
    
          print(f"Requesting: {url}")
          links = get_links(url)
    
          print(f"Writing to {OUT_PATH}")
          with open(OUT_PATH, "w") as f_out:
              content = "\n".join(links)
              f_out.write(content)
    
    
      if __name__ == "__main__":
          main(sys.argv[1:])
    
    

Sample usage:

$ python3 main.py example.com
$ python3 main.py https://example.com

Sample output:

  • links.txt
      https://www.iana.org/domains/example