BeautifulSoup
Resources
- Read the docs docs
- Read the docs project page
- PyPI
- Tutorials
Install
Install BeautifulSoup
Use pip
to install:
beautifulsoup4
Note: Don’t use beautifulsoup
. You can use bs4
but that’s just a wrapper fixed at 0.0.1
so don’t use that.
Install a parser
See Installing a parse in the docs.
Choose a parser. The guide linked also compares them and those are covered here.
If you get an invalid document error, you can try changing your parser as they parse differently.
Python’s parser
Not as fast as LXML and not as leniant as HTML5Lib.
No installation needed.
html.parser
Use like this:
BeautifulSoup(markup, "html.parser")
LXML
Link: lxml.de/
Use pip
to install:
lxml
Or:
apt-get install python-lxml
Use like this:
- HTML parsing
BeautifulSoup(markup, "lxml")
- XML parsing
BeautifulSoup(markup, "lxml-xml") # OR BeautifulSoup(markup, "xml")
HTML5Lib
Described in the docs as “very slow”, so only covered briefly here:
Use like this:
BeautifulSoup(markup, "html5lib")
Import
from bs4 import BeautifulSoup
Make soup
Syntax
soup = BeautifulSoup(markup, features)
Mark up as a string of file object.
Example from the docs here:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
That is with a builtin parser. That parser is slower than lxml
for large or complex HTML documents. And has limited support for some advanced features like XML namespaces and XML-based HTML. Though lxml
has to be installed - see lxml on PyPI.
Example with sample HTML:
from bs4 import BeautifulSoup
PARSER = 'html.parser'
# Sample HTML markup
markup = """
<html>
<head>
<title>Example Page</title>
</head>
<body>
<h1>Welcome to my page</h1>
<p>This is a paragraph.</p>
<ul>
<li>Item 1</li>
<li>Item 2</li>
<li>Item 3</li>
</ul>
</body>
</html>
"""
# Parse the HTML markup
soup = BeautifulSoup(markup, PARSER)
# Print the HTML title
print(soup.title.string) # Output: Example Page
# Find all list items
list_items = soup.find_all('li')
for item in list_items:
print(item.string)
# Output:
# Item 1
# Item 2
# Item 3
Parameter info, from the docstring:
:param markup: A string or a file-like object representing
markup to be parsed.
:param features: Desirable features of the parser to be
used. This may be the name of a specific parser ("lxml",
"lxml-xml", "html.parser", or "html5lib") or it may be the
type of markup to be used ("html", "html5", "xml"). It's
recommended that you name a specific parser, so that
Beautiful Soup gives you the same results across platforms
and virtual environments.
:param builder: A TreeBuilder subclass to instantiate (or
instance to use) instead of looking one up based on
`features`. You only need to use this if you've implemented a
custom TreeBuilder.
:param parse_only: A SoupStrainer. Only parts of the document
matching the SoupStrainer will be considered. This is useful
when parsing part of a document that would otherwise be too
large to fit into memory.
:param from_encoding: A string indicating the encoding of the
document to be parsed. Pass this in if Beautiful Soup is
guessing wrongly about the document's encoding.
:param exclude_encodings: A list of strings indicating
encodings known to be wrong. Pass this in if you don't know
the document's encoding but you know Beautiful Soup's guess is
wrong.
:param element_classes: A dictionary mapping BeautifulSoup
classes like Tag and NavigableString, to other classes you'd
like to be instantiated instead as the parse tree is
built. This is useful for subclassing Tag or NavigableString
to modify default behavior.
:param kwargs: For backwards compatibility purposes, the
constructor accepts certain keyword arguments used in
Beautiful Soup 3. None of these arguments do anything in
Beautiful Soup 4; they will result in a warning and then be
ignored.
Apart from this, any keyword arguments passed into the
BeautifulSoup constructor are propagated to the TreeBuilder
constructor. This makes it possible to configure a
TreeBuilder by passing in arguments, not just by saying which
one to use.
Read string
markup = "<div>Hello, world!</div>"
soup = BeautifulSoup(markup, 'lxml')
Read local text file
Note you do not need to use f_in.read()
.
with open("index.html", "r") as f_in:
soup = BeautifulSoup(f_in, 'lxml')
Read webpage
resp = requests.get(url)
if resp.ok:
soup = BeautifulSoup(resp.text, 'lxml')
else:
print(resp.status_code)
Find elements
Get by attribute
Get the first element or None
.
soup.html
soup.head
soup.h2.name
# h2
soup.h2.text
# My header
Use find
Get the first element matching search and any attribute filters.
tag = soup.find_all(attrs={"name" : "stainfo"})
soup.find("ul", attrs={ "id" : "mylist" } ))
# OR
soup.find("ul", id="mylist"))
To narrow down a search, you can use an element in place of soup.
el.find(...)
Or use a nested CSS selector.
Use find_all
Get all elements.
tags = soup.find_all("li"):
Use a list.
tags = soup.find_all(["h2", "p"]):
Use a function.
def is_empty(tag):
return tag.is_empty_element
tags = soup.find_all(is_empty):
Use select_one
CSS selector.
tag = soup.select_one("#mylist")
Use selector
CSS selector.
Get third element.
tags = soup.select("li:nth-of-type(3)")
Attributes
For string: '<div id="foo" class="question" name="python" x="something">Hello World</div>'
Get
el.get('id')
# > 'foo
Attrs
el.attrs
# => {'id': 'foo', 'x': 'something', 'class': ['question'], 'name': 'python'}
el.attrs['id']
# => 'foo'
# Safer dict lookup which can return None for no key found.
el.attrs.get('id')
# => 'foo'
Get all URLs
for link in soup.find_all('a'):
print(link.get('href'))
Get all text on the page
soup.get_text())
Traversing
Children
Direct descendants.
el.children
Descendents
All levels.
el.descendants
Parent
el.parent
Traverse recursively
with open("index.html", "r") as f_in:
contents = f_in.read()
soup = BeautifulSoup(contents, 'lxml')
for child in soup.recursiveChildGenerator():
if child.name:
print(child.name)
Modify
Create
new_tag = soup.new_tag('li')
new_tag.string='Foo'
ul_tag = soup.ul
Append
ul_tag.append(new_tag)
Insert
Insert at 3rd position:
ul_tag.append(2, new_tag)
Replace
tag = soup.find(text="Windows")
tag.replace_with("OpenBSD")
Remove
tag.decompose()
Prettify
print(soup.prettify))