Web scraping and parsing with Beautiful Soup 4 Introduction




Welcome to a tutorial on web scraping with Beautiful Soup 4. Beautiful Soup is a Python library aimed at helping programmers who are trying to scrape data from websites.

To use beautiful soup, you need to install it: $ pip install beautifulsoup4. Beautiful Soup also relies on a parser, the default is lxml. You may already have it, but you should check (open IDLE and attempt to import lxml). If not, do: $ pip install lxml or $ apt-get install python-lxml.

To begin, we need HTML. I have created an example page for us to work with.

To begin, we need to import Beautiful Soup and urllib, and grab source code:

import bs4 as bs
import urllib.request

source = urllib.request.urlopen('https://pythonprogramming.net/parsememcparseface/').read()

Then, we create the "soup." This is a beautiful soup object:

soup = bs.BeautifulSoup(source,'lxml')

If you do print(soup) and print(source), it looks the same, but the source is just plain the response data, and the soup is an object that we can actually interact with, by tag, now, like so:

# title of the page
print(soup.title)

# get attributes:
print(soup.title.name)

# get values:
print(soup.title.string)

# beginning navigation:
print(soup.title.parent.name)

# getting specific values:
print(soup.p)

Finding paragraph tags <p> is a fairly common task. In the case above, we're just finding the first one. What if we wanted to find them all?

print(soup.find_all('p'))

We can also iterate through them:

for paragraph in soup.find_all('p'):
    print(paragraph.string)
    print(str(paragraph.text))

The difference between string and text is that string produces a NavigableString object, and text is just typical unicode text. Notice that, if there are child tags in the paragraph item that we're attempting to use .string on, we will get None returned.

Another common task is to grab links. For example:

for url in soup.find_all('a'):
    print(url.get('href'))

In this case, if we just grabbed the .text from the tag, you'd get the anchor text, but we actually want the link itself. That's why we're using .get('href') to get the true URL.

Finally, you may just want to grab text. You can use .get_text() on a Beautiful Soup object, including the full soup:

print(soup.get_text())

This concludes the introduction to Beautiful Soup. In the next tutorial, we're going cover navigating a page's elements to get more specifically what you want.

The next tutorial:





  • Web scraping and parsing with Beautiful Soup 4 Introduction
  • Navigation with Beautiful Soup 4
  • Parsing tables and XML with Beautiful Soup 4
  • Scraping Dynamic Javascript Text