Resources

Install

Install BeautifulSoup

Use pip to install:

beautifulsoup4

Note: Don’t use beautifulsoup. You can use bs4 but that’s just a wrapper fixed at 0.0.1 so don’t use that.

Install a parser

See Installing a parse in the docs.

Choose a parser. The guide linked also compares them and those are covered here.

If you get an invalid document error, you can try changing your parser as they parse differently.

Python’s parser

Not as fast as LXML and not as leniant as HTML5Lib.

No installation needed.

html.parser

Use like this:

BeautifulSoup(markup, "html.parser")

LXML

Link: lxml.de/

Use pip to install:

lxml

Or:

apt-get install python-lxml

Use like this:

  • HTML parsing
      BeautifulSoup(markup, "lxml")
    
  • XML parsing
      BeautifulSoup(markup, "lxml-xml")
      # OR
      BeautifulSoup(markup, "xml")
    

HTML5Lib

Described in the docs as “very slow”, so only covered briefly here:

Use like this:

BeautifulSoup(markup, "html5lib")

Import

from bs4 import BeautifulSoup

Make soup

Syntax

soup = BeautifulSoup(markup, features)

Mark up as a string of file object.

Example from the docs here:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

That is with a builtin parser. That parser is slower than lxml for large or complex HTML documents. And has limited support for some advanced features like XML namespaces and XML-based HTML. Though lxml has to be installed - see lxml on PyPI.

Example with sample HTML:

from bs4 import BeautifulSoup

PARSER = 'html.parser'

# Sample HTML markup
markup = """
<html>
  <head>
    <title>Example Page</title>
  </head>
  <body>
    <h1>Welcome to my page</h1>
    <p>This is a paragraph.</p>
    <ul>
      <li>Item 1</li>
      <li>Item 2</li>
      <li>Item 3</li>
    </ul>
  </body>
</html>
"""

# Parse the HTML markup
soup = BeautifulSoup(markup, PARSER)

# Print the HTML title
print(soup.title.string)  # Output: Example Page

# Find all list items
list_items = soup.find_all('li')
for item in list_items:
    print(item.string)
# Output:
# Item 1
# Item 2
# Item 3

Parameter info, from the docstring:

    :param markup: A string or a file-like object representing
     markup to be parsed.

    :param features: Desirable features of the parser to be
     used. This may be the name of a specific parser ("lxml",
     "lxml-xml", "html.parser", or "html5lib") or it may be the
     type of markup to be used ("html", "html5", "xml"). It's
     recommended that you name a specific parser, so that
     Beautiful Soup gives you the same results across platforms
     and virtual environments.

    :param builder: A TreeBuilder subclass to instantiate (or
     instance to use) instead of looking one up based on
     `features`. You only need to use this if you've implemented a
     custom TreeBuilder.

    :param parse_only: A SoupStrainer. Only parts of the document
     matching the SoupStrainer will be considered. This is useful
     when parsing part of a document that would otherwise be too
     large to fit into memory.

    :param from_encoding: A string indicating the encoding of the
     document to be parsed. Pass this in if Beautiful Soup is
     guessing wrongly about the document's encoding.

    :param exclude_encodings: A list of strings indicating
     encodings known to be wrong. Pass this in if you don't know
     the document's encoding but you know Beautiful Soup's guess is
     wrong.

    :param element_classes: A dictionary mapping BeautifulSoup
     classes like Tag and NavigableString, to other classes you'd
     like to be instantiated instead as the parse tree is
     built. This is useful for subclassing Tag or NavigableString
     to modify default behavior.

    :param kwargs: For backwards compatibility purposes, the
     constructor accepts certain keyword arguments used in
     Beautiful Soup 3. None of these arguments do anything in
     Beautiful Soup 4; they will result in a warning and then be
     ignored.

     Apart from this, any keyword arguments passed into the
     BeautifulSoup constructor are propagated to the TreeBuilder
     constructor. This makes it possible to configure a
     TreeBuilder by passing in arguments, not just by saying which
     one to use.

Read string

markup = "<div>Hello, world!</div>"
soup = BeautifulSoup(markup, 'lxml')

Read local text file

Note you do not need to use f_in.read().

with open("index.html", "r") as f_in:
    soup = BeautifulSoup(f_in, 'lxml')

Read webpage

resp = requests.get(url)

if resp.ok:
    soup = BeautifulSoup(resp.text, 'lxml')
else:
    print(resp.status_code)

Find elements

Get by attribute

Get the first element or None.

soup.html

soup.head
soup.h2.name
# h2
soup.h2.text
# My header

Use find

Get the first element matching search and any attribute filters.

tag = soup.find_all(attrs={"name" : "stainfo"})
soup.find("ul", attrs={ "id" : "mylist" } ))
# OR
soup.find("ul", id="mylist"))

To narrow down a search, you can use an element in place of soup.

el.find(...)

Or use a nested CSS selector.

Use find_all

Get all elements.

tags = soup.find_all("li"):

Use a list.

tags = soup.find_all(["h2", "p"]):

Use a function.

def is_empty(tag):
    return tag.is_empty_element

tags = soup.find_all(is_empty):

Use select_one

CSS selector.

tag = soup.select_one("#mylist")

Use selector

CSS selector.

Get third element.

tags = soup.select("li:nth-of-type(3)")

Attributes

For string: '<div id="foo" class="question" name="python" x="something">Hello World</div>'

Get

el.get('id')
# > 'foo

Attrs

el.attrs
# => {'id': 'foo', 'x': 'something', 'class': ['question'], 'name': 'python'}

el.attrs['id']
# => 'foo'

# Safer dict lookup which can return None for no key found.
el.attrs.get('id')
# => 'foo'

Get all URLs

for link in soup.find_all('a'):
    print(link.get('href'))

Get all text on the page

soup.get_text())

Traversing

Children

Direct descendants.

el.children

Descendents

All levels.

el.descendants

Parent

el.parent

Traverse recursively

with open("index.html", "r") as f_in:
    contents = f_in.read()
    soup = BeautifulSoup(contents, 'lxml')

    for child in soup.recursiveChildGenerator():
        if child.name:
            print(child.name)

Modify

Create

new_tag = soup.new_tag('li')
new_tag.string='Foo'

ul_tag = soup.ul

Append

ul_tag.append(new_tag)

Insert

Insert at 3rd position:

ul_tag.append(2, new_tag)

Replace

tag = soup.find(text="Windows")
tag.replace_with("OpenBSD")

Remove

tag.decompose()

Prettify

print(soup.prettify))