Home » Python » Modern Python Web Scraping Using Multiple Libraries

About Mokhtar Ebrahim

Mokhtar Ebrahim

LikeGeeks: Learn more about Linux, Server Administration, Web Development, Python, iOS, and Tech Tips

Modern Python Web Scraping Using Multiple Libraries

In this post, we will talk about Python web scraping and how to scrap web pages using multiple libraries such as Beautifulsoup, Selenium, and some other magic tools like PhantomJS.

You’ll learn how to scrap static web pages, Ajax loaded content, iframes, how to handle cookies and much more stuff

What is Web Scraping

Web scraping is the process of extracting data from the web, you can analyze the data and extract useful information

Also, you can store the scraped data in a database or any kind of tabular format such as CSV, XLS, etc, so you can access that information easily.

The scraped data can be passed to a library like NLTK for further processing to understand what the page is talking about.

In short, web scraping is downloading the web data in a human-readable format so you can benefit from it.

Benefits of Web Scraping

You might wonder, why I should scrap the web and I have Google? Well, we don’t reinvent the wheel here. Web scraping is not for creating search engines only.

You can scrap your competitor’s web pages and analyze the data and see what kind of products your competitor’s clients are happy with from their responses. All this for FREE.

A successful SEO tool like Moz that scraps and crawls the entire web and process the data for you so you can see people’s interest and how to compete with others in your field to be on the top.

These are just some simple uses of web scraping. The scraped data means money :).

How to Use Beautifulsoup

I assume that you have some background in Python basics, so let’s install our first Python web scraping library which is BeautifulSoup.

To install Beautifulsoup, you can use pip or you can install it from the source.

I’ll install it using pip like this:

$ pip install beautifulsoup4

To check if it’s installed or not, open your editor and type the following:

from bs4 import BeautifulSoup

Then run it:

$ python myfile.py

If it runs without errors, that means BeautifulSoup is installed successfully. Now, let’s see how to use Beautifulsoup.

Your First Web Scraper

Take a look at this simple example, we will extract the page title using BeautifulSoup:

from urllib.request import urlopen
 
from bs4 import BeautifulSoup
 
html = urlopen("https://www.python.org/")
 
res = BeautifulSoup(html.read(),"html5lib");
 
print(res.title)

The result is:

We use the urlopen library to connect to the web page we want then we read the returned HTML using html.read() method.

The returned HTML is transformed into a BeautifulSoup object which has a hieratical structure.

That means if you need to extract any HTML element, you just need to know the surrounding tags to get it as we will see later.

Handling HTTP Exceptions

For any reason, urlopen may return an error. It could be 404 if the page is not found or 500 if there is an internal server error, so we need to avoid script crashing by using exception handling like this:

from urllib.request import urlopen
 
from urllib.error import HTTPError
 
from bs4 import BeautifulSoup
 
try:
 
    html = urlopen("https://www.python.org/")
 
except HTTPError as e:
 
    print(e)
 
else:
 
    res = BeautifulSoup(html.read(),"html5lib")
 
    print(res.title)

Great, what if the server is down or you typed the domain incorrectly?

Handling URL Exceptions

We need to handle this kind of exceptions also. This exception is URLError, so our code will be like this:

from urllib.request import urlopen
 
from urllib.error import HTTPError
 
from urllib.error import URLError
 
from bs4 import BeautifulSoup
 
try:
 
    html = urlopen("https://www.python.org/")
 
except HTTPError as e:
 
    print(e)
 
except URLError:
 
    print("Server down or incorrect domain")
 
else:
 
    res = BeautifulSoup(html.read(),"html5lib")
 
    print(res.titles)

Well, the last thing we need to check for is the returned tag, you may type incorrect tag or try to scrap a tag that is not found on the scraped page and this will return None object, so you need to check for None object.

This can be done using a simple if statement like this:

from urllib.request import urlopen
 
from urllib.error import HTTPError
 
from urllib.error import URLError
 
from bs4 import BeautifulSoup
 
try:
 
    html = urlopen("https://www.python.org/")
 
except HTTPError as e:
 
    print(e)
 
except URLError:
 
    print("Server down or incorrect domain")
 
else:
 
    res = BeautifulSoup(html.read(),"html5lib")
 
    if res.title is None:
 
        print("Tag not found")
 
    else:
 
        print(res.title)

Great, our scraper is doing a good job. Now and we are able to scrap the whole page or scrap a specific tag.

what about more deep hunting?

Beautifulsoup Find by Class

Now let’s try to be selective by scraping some HTML elements based on their CSS classes.

The Beautifulsoup object has a function called findAll which extracts or filters elements based on their attributes

We can filter all h3 elements whose class is “post-title” like this:

tags = res.findAll("h3", {"class": "post-title"})

Then we can use for loop to iterate over them and do whatever with them.

So our code will be like this:

from urllib.request import urlopen
 
from urllib.error import HTTPError
 
from urllib.error import URLError
 
from bs4 import BeautifulSoup
 
try:
 
    html = urlopen("https://likegeeks.com/")
 
except HTTPError as e:
 
    print(e)
 
except URLError:
 
    print("Server down or incorrect domain")
 
else:
 
    res = BeautifulSoup(html.read(),"html5lib")
 
    tags = res.findAll("h3", {"class": "post-title"})
 
    for tag in tags:
 
        print(tag.getText())

This code returns all h3 tags with a class called post-title where these tags are the home page post titles.

We use getText function to print only the inner content of the tag, but if you didn’t use getText, you’ll end up with the tags with everything inside them.

Check the difference:

This when we use getText():

And this without using getText():

Beautifulsoup findAll Examples

We saw how findAll function filters tags by class, but this is not everything.

To filter a list of tags, replace the highlighted line of the above example with the following line:

tags = res.findAll("span", "a" "img")

This code gets all span, anchor, and image tags from the scraped HTML.

Also, you can extract tags that have these classes:

tags = res.findAll("a", {"class": ["url", "readmorebtn"]})

This code extracts all anchor tags that have “readmorebtn” and “url” class.

You can filter the content based on the inner text itself using the text argument like this:

tags = res.findAll(text="Python Programming Basics with Examples")

The findAll function returns all elements that match the specified attributes, but if you want to return one element only, you can use the limit parameter or use the find function which returns the first element only.

Find nth Child Using Beautifulsoup

Beautifulsoup object has many powerful features, you can get children elements directly like this:

tags = res.span.findAll("a")

This line will get the first span element on the Beautifulsoup object then scrap all anchor elements under that span.

What if you need to get the nth child?

You can use the select function like this:

tag = res.find("nav", {"id": "site-navigation"}).select("a")[3]

This line gets the nav element with id “site-navigation” then we grab the fourth anchor tag from that nav element.

Beautifulsoup is a powerful library!!

Find Tags using Regex

On a previous post, we talked about regular expressions and we saw how powerful it’s to use regex to identify common patterns such as emails, URLs, and much more.

Luckily, Beautifulsoup has this feature, you can pass regex patterns to match specific tags.

Imagine that you want to scrap some links that match a specific pattern like internal links or specific external links or scrap some images that reside in a specific path.

Regex engine makes it so easy to achieve such jobs.

import re
 
tags = res.findAll("img", {"src": re.compile("\.\./uploads/photo_.*\.png")})

These lines will scrap all PNG images on ../uploads/ and start with photo_

I love Python web scraping. This is just a simple example to show you the power of regular expressions combined with Beautifulsoup.

Scraping JavaScript

Suppose that the page you need to scrap has another loading page that redirects you to the required page and the URL doesn’t change or there are some pieces of your scraped page that loads its content using Ajax.

Our scraper won’t load any content of these since the scraper doesn’t run the required JavaScript to load that content.

Your browser runs JavaScript and loads any content normally and actually, that what we will do using Python library called Selenium.

Selenium library doesn’t include its own browser, you need to install a third-party browser (or Web driver) in order to work.

You can choose from Chrome, Firefox, Safari, or Edge.

If you install any of these drivers let’s say Chrome, it will open an instance of the browser and loads your page then you can scrap or interact with your page.

Scrap Pages Using Selenium Chrome Driver

First, you should install selenium library like this:

$ pip install selenium

Then you should download Chrome driver from here and it to your system PATH.

Now you can load your page like this:

from selenium import webdriver
 
browser = webdriver.Chrome()
 
browser.get("https://www.python.org/")
 
nav = browser.find_element_by_id("mainnav")
 
print(nav.text)

The output looks like this:

Pretty simple, right?

We didn’t interact with page elements, so we didn’t see the power of selenium yet, just wait for it.

Selenium Web Scraping

You might like working with browsers drivers, but there are much more people like running code in the background without seeing running in action.

For this purpose, there is an awesome tool called PhantomJS that loads your page and runs your code without opening any browsers.

PhantomJS enables you to interact with scraped page cookies and JavaScript without a headache.

Also, you can use it like Beautifulsoup to scrap pages and elements inside those pages.

Download PhantomJS from here and put it in your PATH so we can use it as a web driver with selenium.

Now, let’s use selenium with PhantomJS the same way as Chrome web driver.

from selenium import webdriver
 
browser = webdriver.PhantomJS()
 
browser.get("https://www.python.org/")
 
print(browser.find_element_by_class_name("introduction").text)
 
browser.close()

The result is:

Awesome!! It works very well.

You can access elements in many ways such as:

browser.find_element_by_id("id")
 
browser.find_element_by_css_selector("#id")
 
browser.find_element_by_link_text("Click Here")
 
browser.find_element_by_name("Home")

All of these functions return only one element, you can return multiple elements by using elements like this:

browser.find_elements_by_id("id")
 
browser.find_elements_by_css_selector("#id")
 
browser.find_elements_by_link_text("Click Here")
 
browser.find_elements_by_name("Home")

You can use the power of BeautifulSoup on the returned content from selenium by using page_source like this:

from selenium import webdriver
 
from bs4 import BeautifulSoup
 
browser = webdriver.PhantomJS()
 
browser.get("https://www.python.org/")
 
page = BeautifulSoup(browser.page_source,"html5lib")
 
links = page.findAll("a")
 
for link in links:
 
    print(link)
 
browser.close()

The result is:

Do you feel the power of Python web scraping? Let’s see more.

Scrap iframe Content Using Selenium

Your scraped page may contain an iframe that contains data.

If you try to scrap a page that contains an iframe, you won’t get the iframe content, you need to scrap the iframe source.

You can use Selenium to scrap iframes by switching to the frame you want to scrap.

from selenium import webdriver
 
browser = webdriver.PhantomJS()
 
browser.get("https://developer.mozilla.org/en-US/docs/Web/HTML/Element/iframe")
 
iframe = browser.find_element_by_tag_name("iframe")
 
browser.switch_to.default_content()
 
browser.switch_to.frame(iframe)
 
iframe_source = browser.page_source
 
print(iframe_source) #returns iframe source
 
print(browser.current_url) #returns iframe URL

The result is:

Check the current URL, it’s the iframe URL, not the original page.

Scrap iframe Content Using Beautifulsoup

You can get the URL of the iframe by using find function then you can scrap that URL.

from urllib.request import urlopen
 
from urllib.error import HTTPError
 
from urllib.error import URLError
 
from bs4 import BeautifulSoup
 
try:
 
html = urlopen("https://developer.mozilla.org/en-US/docs/Web/HTML/Element/iframe")
 
except HTTPError as e:
 
print(e)
 
except URLError:
 
print("Server down or incorrect domain")
 
else:
 
res = BeautifulSoup(html.read(), "html5lib")
 
tag = res.find("iframe")
 
print(tag['src']) #URl of iframe ready for scraping

Awesome!! This is the power of Python web scraping. You have many options.

Handle Ajax Calls Using Selenium+ PhantomJS

You can use selenium to scrap content after you make your Ajax calls.

Like clicking a button that gets the content that you need to scrap. Check the following example:

from selenium import webdriver
 
import time
 
browser = webdriver.PhantomJS()
 
browser.get("https://www.w3schools.com/xml/ajax_intro.asp")
 
browser.find_element_by_tag_name("button").click()
 
time.sleep(2)     #Explicit wait
 
browser.get_screenshot_as_file("image.png")
 
browser.close()

The result is:

Here we scrap a page that contains a button and we click that button which makes the Ajax call and gets the text, then we save a screenshot of that page.

There is one little thing here, it’s about the wait time.

We know that the page load cannot exceed 2 seconds to fully load, but that is not a good solution, the server can take more time or your connection could be slow, there are many reasons.

Wait for Ajax Calls to Complete (Implicit Wait)

The best solution is to check for the existence of an HTML element on the final page, if it exists, that means the Ajax call is finished successfully.

Check this example:

from selenium import webdriver
 
from selenium.webdriver.common.by import By
 
from selenium.webdriver.support.ui import WebDriverWait
 
from selenium.webdriver.support import expected_conditions as EC
 
browser = webdriver.PhantomJS()
 
browser.get("https://resttesttest.com/")
 
browser.find_element_by_id("submitajax").click()
 
try:
 
    element = WebDriverWait(browser, 10).until(EC.text_to_be_present_in_element((By.ID, "statuspre"),"HTTP 200 OK"))
 
finally:
 
    browser.get_screenshot_as_file("image.png")
 
browser.close()

The result is:

Here we click on an Ajax button which makes REST call and returns the JSON result.

We check for div element text if it’s “HTTP 200 OK” with 10 seconds timeout, then we save the result page as an image as shown.

You can check for many things like:

URL change using EC.url_changes()

New opened window using EC.new_window_is_opened()

Changes in title using EC.title_is()

If you have any page redirections, you can see if there is a change in title or URL to check for it.

There are many conditions to check for, we just take an example to show you how much power you have.

Cool!!

Handling Cookies

Sometimes, it’s very important to take care of cookies for the site you are scraping.

Maybe you need to delete the cookies or maybe you need to save it in a file and use it for later connections.

A lot of scenarios out there, so let’s see how to handle cookies.

To retrieve cookies for the currently visited site, you can call get_cookies() function like this:

from selenium import webdriver
 
browser = webdriver.PhantomJS()
 
browser.get("https://likegeeks.com/")
 
print(browser.get_cookies())

The result is:

To delete cookies, you can use delete_all_cookies() functions like this:

from selenium import webdriver
 
browser = webdriver.PhantomJS()
 
browser.get("https://likegeeks.com/")
 
browser.delete_all_cookies()

Scraping VS Crawling

We talked about Python web scraping and how to parse web pages, now some people get confused about scraping and crawling.

Web Scraping is about parsing web pages and extracting data from it for any purpose as we saw.

Web crawling is about harvesting every link you find and crawl every one of them without a scale, and this for the purpose of indexing, like what Google and other search engines do.

Python web scraping has a lot of fun, but before we end our discussion, there are some tricky points that may prevent you from scraping like Google reCaptcha.

Google reCaptcha becomes much harder now, you can’t find a good solution to rely on.

I hope you find the post useful. Keep coming back.

Thank you.

Published on Web Code Geeks with permission by Mokhtar Ebrahim, partner at our WCG program. See the original article here: Modern Python Web Scraping Using Multiple Libraries

Opinions expressed by Web Code Geeks contributors are their own.

Do you want to know how to develop your skillset to become a Web Rockstar?
Subscribe to our newsletter to start Rocking right now!
To get you started we give you our best selling eBooks for FREE!
1. Building web apps with Node.js
2. HTML5 Programming Cookbook
3. CSS Programming Cookbook
4. AngularJS Programming Cookbook
5. jQuery Programming Cookbook
6. Bootstrap Programming Cookbook
and many more ....
Email address:

Leave a Reply

Be the First to Comment!

Notify of
avatar
wpDiscuz