To
parse a document, pass it into the BeautifulSoup constructor.
You can pass in a string or an open filehandle:
from
bs4
import
BeautifulSoup
soup
= BeautifulSoup("index.html",
'html.parser')
Beautiful
Soup then parses the document using the best available parser. It
will use an HTML parser unless you specifically tell it to use an XML
parser.
We
can now print out the HTML content of the page, formatted nicely,
using the prettify method
on the BeautifulSoup object:
print(soup.prettify())
As
all the tags are nested, we can move through the structure one level
at a time. We can first select all the elements at the top level of
the page using the children property
of soup.
Note that children returns
a list generator, so we need to call the listfunction
on it:
list(soup.children)
The
above tells us that there are two tags at the top level of the page
-- the initial <!DOCTYPE
html> tag,
and the <html> tag.
There is a newline character (\n)
in the list as well.
We
can now select the html tag
and its children by taking the third item in the list:
html
= list(soup.children)[2]
Each
item in the list returned by the children property
is also a BeautifulSoup object,
so we can also call the children method
on html.
Now,
we can find the children inside the html tag:
list(html.children)
As
you can see above, there are two tags here, head,
and body.
We want to extract the text inside the p tag,
so we'll dive into the body:
body
= list(html.children)[3]
Now,
we can get the p tag
by finding the children of the body tag:
list(body.children)
We
can now isolate the p tag:
p
= list(body.children)[1]
Once
we've isolated the tag, we can use the get_text method
to extract all of the text inside the tag:
p.get_text()
Finding
all instances of a tag at once
What we did above was useful for figuring
out how to navigate a page, but it took a lot of commands to do
something fairly simple. If we want to extract a single tag, we can
instead use the find_all method,
which will find all the instances of a tag on a page.
soup
= BeautifulSoup(page.content, 'html.parser')
soup.find_all('p')
find_all returns
a list, so we'll have to loop through, or use list indexing, it to
extract text:
soup.find_all('p')[0].get_text()
If
you instead only want to find the first instance of a tag, you can
use the findmethod,
which will return a single BeautifulSoup object:
soup.find('p')
Searching for tags by class and id
Now,
we can use the find_all method
to search for items by class or by id. In the below example, we'll
search for any p tag
that has the class outer-text:
soup.find_all('p',
class_='outer-text')
soup.find_all(id="first")
its very nice article. thanks for sharing such great article. keep sharing such kind of article. Python Web Scraping
ReplyDeletePython If Else Statement
ReplyDeletePython Elif
Python Nested If
Python While Loop
Python For Loop
Python Nested Loop
Python Break Statement
Python Continue Statement