Introduction to web scraping in python using Beautiful Soup

14.02.2019 - Jay M. Patel - Reading time ~5 Minutes

The first step for any web scraping project is getting the webpage you want to parse. There are many python libraries such as urllib, urllib2, urllib3 for requesting pages via HTTP, however, none of them beat the elegance of requests library which we have been using in earlier posts on rest APIs and we will continue to use that here. Before we get into the workings of Beautiful Soup, let us first get a basic understanding of HTML structure, common tags and styling sheets.

Introduction to HTML documents

Let us test out an example; we will go to the wikipedia page for a machine learning algorithm called support vector machine (SVM). Right click anywhere on the page and click “inspect” if you are using Google Chrome as your browser, or click on “View page source” if you are using Mozilla Firefox.

If you prefer, you can use the requests module to get the same html page as shown below.

test_url = "https://en.wikipedia.org/wiki/Support-vector_machine"
r = requests.get(test_url)
r.text

You will see bunch of code, don’t be intimidated; let us go through it and decode the HTML tags according to rules outlined above.

  • Every HTML document contains <html>...</html> with <!DOCTYPE html> usually the the start of a HTML document
<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
...............
  • <h1>...</h1> to <h6>...</h6> tags are used for headers
<h1 id="firstHeading" class="firstHeading" lang="en">Support-vector machine</h1>

The other important tags are:

  • <div>...</div> to indicate a division in an HTML document, generally used to group a set of elements

  • <p>...</p> to enclose a paragraph

  • <br> to set a line break

  • <table>...</table> to start a table block

    • <tr>...<tr/> is used for the rows
    • <td>...</td> is used for individual cells
  • <img> for images

  • <a>...</a> for hyperlinks;

  • <ul>...</ul>, <ol>...</ol> for unordered and ordered lists respectively; inside of these, <li>...</li> is used for each list item

Introduction to Cascading Style Sheets (CSS)

Cascading Style Sheets (CSS) is a style sheet language used for describing the presentation of a document such as layout, colors, and fonts written in a markup language like HTML.

In the section above, you may have noticed the terms “ids” and “class” such as for h1 headers such as for h1 tags.

<h1 id="firstHeading" class="firstHeading" lang="en">Support-vector machine</h1>
  • Id: is an unique identifier representing a tag within the document
  • Class: an identifier that can annotate multiple elements in a document and represents a space separated series of CSS class names

Classes and id are case-sensitive, start with letters, and can include alphanumeric characters, hyphens and underscores. A class may apply to any number of instances of any elements whereas ID may only be applied to a single element.

There are three ways to apply CSS styles to HTML pages.

  • Inside a regular HTML tag such as shown below. You can also apply styles to change font colors <p style=”color:’green’;”>...</p>.
</tr><tr><td style="padding:0 0.1em 0.4em">
  • You can create a separate CSS file and link it by including it in a link tag within the main <head> of the HTML document; the browser will go out and request the CSS file whenever a page is loaded.

  • Style can also be applied inside of <style>...</style> tags, placed inside the <head> tag of a page.

CSS selectors define the patterns used to “select” the HTML elements you want to style. For web scraping perspective, they are very essential in enabling you to capture specific text enclosed within a CSS selector without taking in all the other boilerplate text.

  • tagname selects all elements with a particular tag name. For example, “h2” matches with all <h2> tags on a page.

  • .classname selects all the elements having the same classname.

  • #myid selects the element with the given id. Note from the last section that id is unique and it is only applied to one element in contrast to tagname and classname.

Please check out this for more information on how to use CSS selectors.

Web scraping with Beautiful Soup

Beautifulsoup is a python library designed to pull data out of HTML and XML files. Let us start with a simple example of requesting the same wikipedia page as above on support vector machine and processing it with Beautiful Soup library.

from bs4 import BeautifulSoup
import requests

test_url = "https://en.wikipedia.org/wiki/Support-vector_machine"
r = requests.get(test_url)
html_response = r.text
soup = BeautifulSoup(html_response,'html.parser')

print(soup.find('h1'))

print(soup.find('h1').get_text())
print(soup.find('h1').text)

print(soup.find('h1').contents)
print(soup.find('h1').attrs)

print(soup.find('h1').name)
Out:
<h1 class="firstHeading" id="firstHeading" lang="en">Support-vector machine</h1>
Support-vector machine
Support-vector machine
['Support-vector machine']
{'id': 'firstHeading', 'class': ['firstHeading'], 'lang': 'en'}
h1

In the example above, we used the Python’s html.parser but beautiful Soup also supports other parsers such as lxml, lxml-xml, html5lib; each them have different strengths and weaknesses and please refer to the official documentation for a more complete discussion.

Once you have a soup object, can filter the results using the find() method, and get only the text contained in the tag by .text or .get_text().

Another very useful function is find_all() which will return a list of all tags with a given attribute. A quick example of how to use that is shown below.

for object in soup.find_all(['h1', 'h2']):
    print(object.get_text())
    
Out:
Support-vector machine
Contents
Motivation[edit]
Definition[edit]
Applications[edit]
History[edit]
Linear SVM[edit]
Nonlinear classification[edit]
Computing the SVM classifier[edit]
Empirical risk minimization[edit]
Properties[edit]
Extensions[edit]
Implementation[edit]
See also[edit]
References[edit]
Bibliography[edit]
External links[edit]
Navigation menu

comments powered by Disqus