Web Scraping Using Selenium And Beautifulsoup Python

Posted on  by 



This post is part 1 of the 'Advanced Scraping' series:

The Python documentation, wikipedia, and most blogs (including this one) use static content. When we request the URL, we get the final HTML returned to us. If that's the case, then a parser like BeautifulSoup is all you need. A short example of scraping a static page is demonstrated below. I have an overview of BeautifulSoup here.

A site with dynamic content is one where requesting the URL returns an incomplete HTML. The HTML includes Javascript for the browser to execute. Only once the Javascript finishes running is the HTML in its final state. This is common for sites that update frequently. For example, weather.com would use Javascript to look up the latest weather. An Amazon webpage would use Javascript to load the latest reviews from its database. If you use a parser on a dynamically generated page, you get a skeleton of the page with the unexecuted javascript on it.

This post will outline different strategies for scraping dynamic pages.

Here are three approaches (i.e. Python libraries) for web scraping which are among the most popular: Sending an HTTP request, ordinarily via Requests, to a webpage and then parsing the HTML (ordinarily using BeautifulSoup) which is returned to access the desired information. Typical Use Case: Standard web scraping problem, refer to the case study.

An example of scraping a static page

The Udemy Web Scraping with Python: BeautifulSoup, Requests & Selenium free download also includes 5 hours on-demand video, 4 articles, 47 downloadable resources, Full lifetime access, Access on mobile and TV, Assignments, Certificate of Completion and much more. With the help of this course you can Web Scraping and Crawling with Python: Beautiful Soup, Requests & Selenium. This course was created by GoTrained Academy & Waqar Ahmed. It was rated 4.4 out of 5 by approx 5758 ratings. There are approx 77080 users enrolled with this course, so. Libraries used for web scraping python beautifulsoup:-We will use the following libraries: Selenium: - It is a web testing library and used to automate browser activities. BeautifulSoup: -Beautiful Soup is also called Python package for parsing HTML and XML documents and creates the parse trees which are helpful to extract the data easily.

Let's start with an example of scraping a static page. This code demonstrates how to get the Introduction section of the Python style guide, PEP8:

This prints

IntroductionThis document gives coding conventions for the Python code comprisingthe standard library in the main Python distribution. Please see thecompanion informational PEP describing style guidelines for the C codein the C implementation of Python [1].....

Volia! If all you have is a static page, you are done!

The straightforward way to scrape a dynamic page

Python 3 Web Scraping

The easiest way of scraping a dynamic page is to actually execute the javascript, and allow it to alter the HTML to finish the page. We can pass the rendered (i.e. finalized) HTML to python, and use the same parsing techniques we used on static sites. The Python module Selenium allows us to control a browser directly from Python. The steps to Parse a dynamic page using Selenium are:

  1. Initialize a driver (a Python object that controls a browser window)
  2. Direct the driver to the URL we want to scrape.
  3. Wait for the driver to finish executing the javascript, and changing the HTML. The driver is typically a Chrome driver, so the page is treated the same way as if you were visiting it in Chrome.
  4. Use driver.page_source to get the HTML as it appears after javascript has rendered it.
  5. Use a parser on the returned HTML

The website https://webscraper.io has some fake pages to test scraping on. Let's use it on the page https://www.webscraper.io/test-sites/e-commerce/ajax/computers/laptops to get the product name and the price for the six items listed on the first page. These are randomly generated; at the time of writing the products were an Asus VivoBook (295.99), two Prestigio SmartBs (299 each), an Acer Aspire ES1 (306.99), and two Lenovo V110s (322 and 356).

Once the HTML has been by Selenium, each item has a div with class caption that contains the information we want. The product name is in a subdiv with class title, and the price is in a subdiv with the classes pull-right and price. Here is code for scraping the product names and prices:

Trying scraping a dynamic site using requests

Web scraping using python

What would happen if we tried to load this e-commerce site using requests? That is, what if we didn't know it was a dynamic site?

The html we get out can be a little difficult to read directly. If you are using a terminal, then you can save the results from r.html to a file and then load it in a browser. If you are using a Jupyter notebook, you can actually use a neat trick to render the output in your browser:

The output in the notebook is an empty list, because javascript hasn't generated the items yet.

Using Selenium is an (almost) sure-fire way of being able to generate any of the dynamic content that you need, because the pages are actually visited by a browser (albeit one controlled by Python rather than you). If you can see it while browsing, Selenium will be able to see it as well.

There are some drawbacks to using Selenium over pure requests:

  • It's slow.

We have to wait for pages to render, rather than just grabbing the data we want.

  • We have to download images and assets, using bandwidth

Related to the previous point, even if we are just parsing for text, our browser will download all ads and images on the site.

  • Chrome takes a lot of memory

When scraping, we might want to have parallel scrapers running (e.g. one for each category of items on an e-commerce site) to allow us to finish faster. If we use Selenium, we will have to have enough memory to have multiple copies running.

Web Scraping Using Selenium And Beautiful Soup Python Interview

  • We might not need to parse

Often sites will make API calls to get the data in a nicely formatted JSON object, which is then processed by Javascript into HTML entities. When using a parser such as BeautifulSoup, we are reading in the HTML entities, and trying to reconstruct the original data. It would be a lot slicker (and less error prone) if we are able to get the JSON objects directly.

  • Selenium (like parsing) is often tedious and error-prone

Python Web Scraping Beautifulsoup

The bad news for using the alternative methods is that there are so many different ways of loading data that no single technique is guaranteed to work. The biggest advantage Selenium has is that it uses a browser, and with enough care, should be indistinguishable from you browsing the web yourself.

Other techniques

This is the first in a series of articles that will look at other techniques to get data from dynamic webpages. Because scraping requires a custom approach to each site we scrape, each technique will be presented as a case study. The examples will be detailed enough to enable you to try the technique on other sites.

TechniqueDescriptionExamples
Scheme or Opengraph MetaDataOpenGraph is a standard for allowing sites like Facebook to easily find what your page is 'about'. We can scrape the relevant data directly from these tags??? Need example ???
JSON for Linking DataThis is a standard for putting JSON inside Javascript tagsYelp
XHRUse the same API requests that the browser does to get the dataSephora lipsticks, Apple jobs

Selenium summary

Using Selenium Python

The short list of pros and cons for using Selenium to scrape dynamic sites.

Web Scraping Using Selenium Python

ProsCons
* Will work* Slow
* Bandwidth and memory intensive
* Requires error-prone parsing




Coments are closed