Web Scraping in Python Step by Step Using BeautifulSoup

To parse a document, pass it into the BeautifulSoup constructor. You can pass in a string or an open filehandle:

from bs4 import BeautifulSoup

soup = BeautifulSoup("index.html", 'html.parser')

Beautiful Soup then parses the document using the best available parser. It will use an HTML parser unless you specifically tell it to use an XML parser.

We can now print out the HTML content of the page, formatted nicely, using the prettify method on the BeautifulSoup object:

print(soup.prettify())

As all the tags are nested, we can move through the structure one level at a time. We can first select all the elements at the top level of the page using the children property of soup. Note that children returns a list generator, so we need to call the listfunction on it:

list(soup.children)

The above tells us that there are two tags at the top level of the page -- the initial <!DOCTYPE html> tag, and the <html> tag. There is a newline character (\n) in the list as well.

We can now select the html tag and its children by taking the third item in the list:

html = list(soup.children)[2]

Each item in the list returned by the children property is also a BeautifulSoup object, so we can also call the children method on html.

Now, we can find the children inside the html tag:

list(html.children)

As you can see above, there are two tags here, head, and body. We want to extract the text inside the p tag, so we'll dive into the body:

body = list(html.children)[3]

Now, we can get the p tag by finding the children of the body tag:

list(body.children)

We can now isolate the p tag:

p = list(body.children)[1]

Once we've isolated the tag, we can use the get_text method to extract all of the text inside the tag:

p.get_text()

Finding all instances of a tag at once

What we did above was useful for figuring out how to navigate a page, but it took a lot of commands to do something fairly simple. If we want to extract a single tag, we can instead use the find_all method, which will find all the instances of a tag on a page.

soup = BeautifulSoup(page.content, 'html.parser')

soup.find_all('p')

find_all returns a list, so we'll have to loop through, or use list indexing, it to extract text:

soup.find_all('p')[0].get_text()

If you instead only want to find the first instance of a tag, you can use the findmethod, which will return a single BeautifulSoup object:

soup.find('p')

Searching for tags by class and id

Now, we can use the find_all method to search for items by class or by id. In the below example, we'll search for any p tag that has the class outer-text:

soup.find_all('p', class_='outer-text')