Scraping web pages with Python and Selenium

As the amount of information on the Web grows exponentially, a flurry of applications (mobile, web, or otherwise) have come about that wish to harness it all. The methods for harnessing information on the web may be many, but one that’s seemingly the most ubiquitous is ‘scraping’. Scraping is what most search engines employ in some form or the other: the ‘spiders’ that crawl the web looking for metadata information embedded in websites, or price-comparison sites that allow users to make purchase decisions, and probably a gazillion other things that I cannot even imagine.

In this article, I will give you a real quick tutorial on how to write a simple scraping application in Python. That is also to say that I assume you have a basic understanding of programming in Python, and some level of comfort with HTML.

What is Web Scraping?

Wikipedia provides a pretty darn workable definition of web scraping:

“Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites.”

Speaking specifically in terms of web pages, web scraping allows you to extract the metadata (tags, attributes, etc.), and the primary information (the ‘text’ of the tags, paragraphs of text, contents of tables, etc.) embedded in the HTML of a page. Although I could give you more precise and specific information here using the Document Object Model (DOM) of standard web pages, it behooves me to keep things simple in this tutorial. So do not worry if you do not know what the DOM is.

For this tutorial, I’ll take YouTube as an example, and show you how to search for stuff and get a nice list of all the video links on the results page.

Preparation

There are a few things you must ensure in your development environment:

  1. Python 3.x: I assume you have Python 3.x installed on your system. My code examples use the Python 3.x grammar and syntax. If something prohibits you from installing Python 3.x, you could perhaps modify the code to work with Python 2.x; that should be simple enough to do with some Googling. I also recommend that you install Anaconda instead of the stock Python installation. The advantage is that Anaconda comes with a real big load of Python packages so that you do not have to install them individually. But nevertheless, that’s not a strict requirement, I’ll any way talk about how to install packages we need for this tutorial.
  2. Selenium: This is basically the only extra package that you need to install for this tutorial. Whatever OS you are on, just type the following command on a terminal/command-line window:
    pip install selenium

    This should install the selenium package into your Python library folder.

  3. Chrome: I used Google Chrome for all examples while preparing this tutorial. You could use Firefox if you wish and I’ll explain how you could do that when we get to the code.
  4. ChromeDriver: This is the final element you need to install. Head over to this link, and download the latest ChromeDriver (it will be a .zip file) for your OS. Extract the downloaded zip file, and copy the contents to any directory you wish to (as long as you do not move that directory around). Then add the path to that directory to you OS’s ‘PATH’ environment variable. And you’re good to go! If you are on Windows, you will need to launch a new instance of the command line app, or if you are on Linux just open a new shell. This should take the newly modified PATH variable into its fold.

Let’s get to it

Ok, fire up Chrome, and open YouTube. Enter the query “guitar lessons” in the search field at the very top of the page and hit enter or return. Note down the URL of the results page, it should look like this:

https://www.youtube.com/results?search_query=guitar+lessons

We’ll use this URL as a basis to allow our Python script to search YouTube for anything we want to. Now let’s inspect the results page. Right-click on any of the results, and select ‘Inspect’.

This should open up a panel within your browser window similar to this:

inspect element

We are interested in the <div> element in the orange box in the image. Why? Because if you see, Chrome has highlighted the <a> tag as it is the exact element being ‘inspected’. But I figured that the class attribute in the <a> tag is pretty lengthy, and most-likely a bad candidate for the first element we need to search. A good candidate is the <div> boxed in orange. Note down the class attribute as we will use it in our code.

Now that we have our basic inputs, let’s look at the code.

The Code

from selenium import webdriver

driver = webdriver.Chrome()
driver.get("http://www.youtube.com/results?search_query=" + "guitar+lessons")

results = driver.find_elements_by_xpath('//div[@class="yt-lockup-content"]')

print(len(results))

for result in results:
    video = result.find_element_by_xpath('.//h3/a')
    title = video.get_attribute('title')
    url = video.get_attribute('href')
    print("{} ({})".format(title, url))
driver.quit()

Yes, it is really just that small! Run this script, and do not close the Chrome window that opens up as this script runs; it will close automatically once all is done. Let me take you through the code, as that is the best way to understand what’s going on.

The first statement asks the Python interpreter to bring in the ‘webdriver’ module (from the selenium package) into the namespace of our script. What this allows us is to instantiate a Chrome browser instant that the script could invoke. That’s exactly what happens in the second line. (You could as well use Firefox, by replacing driver.Crome with driver.Firefox in the above piece of code.)

The driver.get(“http://www.youtube.com/results?search_query=” + “guitar+lessons”) line is where we invoke Chrome to open the URL provided as a string to the get() method.

The next statement is where we make use of the information we got by inspecting one of the results in Chrome. The parameter to the find_elements_by_xpath() method is telling the Chrome driver to get all <div> elements which have their class attributes set to “yt-lockup-content”. Remember, this was our best guess, the hope being that this class is not used for <div> elements used for other purposes on the page. The end result is that ‘results’ now contains a list of all page elements that met this criteria, i.e. those <div> elements which had their class attribute set to “yt-lockup-content”.

Note the “//” at the beginning of the argument string for find_elements_by_xpath(). The “//” means that we are looking for this element starting at the root of the document, i.e. the <body> of the page.

The next statement prints the number of such <div> elements found in the root. Next, we have a loop that iterates over all the <div>‘s we found. If you see in the image above, we are actually interested in the attributes of the <a> tag where all the information about the specific results (video title in the title attribute, and the URL in the href attribute) lies. So for each ‘result’, we get the embedded <a> tag by giving the relative path of that <a> as a parameter to the method find_element_by_xpath(). This gets us the <a> embedded in the <h3> embedded in the original <div> we found.

The next two statements in the loop extract the value of the two attributes title, and href that we are interested in. The script then prints the values and there  you have it, the loop does the same for all the <div>s it found and you get a nice output of all the first page’s results on your console!

webscrape_output

I’m sure if you have a reasonable expertise in Python, and a basic grasp of the HTML DOM, you can extend or twist or turn this little code snippet to satisfy all your scraping needs. Have a happy time scraping the web, and leave me a thought in the comments if you have any queries or comments.

 

Leave a Reply

Your email address will not be published. Required fields are marked *