Web-Scraping for Social Research: A Tutorial

Scraping, to state this quite formally, is a prominent technique for the automated collection of online data” (Marres and Weltevrede, 2013)

As online databases become increasingly ubiquitous and critical to infrastructure and basic services, there has been a push back by researchers and activists attempting, in the first place, to make these databases more transparent and, secondly, to show how they are being used. These efforts have become more urgent as private technology companies have become increasingly entangled into basic and expected service. Think: AirBnB’s use as a rental service in many cities; or Facebook’s increasing dominance in media and communications. In many cases, making digital infrastructure more transparent has become critical to understanding how 21st century life is unfolding.

Accessing these databases can be difficult, as demonstrated by a recent Propublica series called Black Box, which delved into four areas of our digital society and revealed just how complicated it can be to find out if companies and their data policies are racist, aiding hate groups to organise, or fixing prices based on race. Researchers have therefore developed tools to try and break through the opacity and reveal some of the mechanisms that companies and governments employ.

The most well-known of these tools is called ‘web-scraping’—a technique that, at its simplest, uses small snippets of code to automate all the normal actions that we do on websites: clicking on links, filling out forms, searching for text, and recording the results of these actions. This automation can allow researchers to examine and gather datasets and is a fantastic but still under-used technique.

An example of using this type of web-scraping is the scraping of the Airbnb website by Murray Cox and Tom Slee. In 2015, Cox and Slee had been using a web-scraping script to scrape datasets from Airbnb, which included listings for entire cities with the number of available days of each listing and type: whole apartment or a room, for example. When Airbnb first made data available about its business, they showed a ‘typical day’ snapshot of Airbnb’s presence in New York City; but Cox and Slee had proof that “the data was photoshopped.” Using their scraped data, Cox and Slee demonstrated that Airbnb carried out a “one-time targeted purge of over 1,000 listings” in the weeks before the ‘typical day’ that Airbnb presented as a snapshot of their data. Airbnb’s claims were based on misleading data. Now, thanks to Inside Airbnb, an independent source, Airbnb data for various cities, including Dublin, can be downloaded and the data have been used in various exciting projects (some great research from the project can be found here).

There are, then, quite a few ways to do web-scraping and there are various tutorials for web scraping methods for social research on the internet, some of which can be found here, here, and here. Against this backdrop, the following post provides instructions on how to scrape websites that may be open to the public (or, if you have your own login information and permission, password-protected datasets), but prove to be quite difficult to gather data in a comprehensive manner. This difficulty is fairly typical of many web sites today: they allow users to see one or two entries, but it is frequently difficult to gather and analyse the entire dataset.

This tutorial uses R, a software environment for statistical computing, and the Selenum webdriver. R and the Selenium webdriver are particularly well-equipped to do web-scraping for social research on otherwise difficult websites. The open-source and free R software package presents a useful tool for making these tools more accessible and making scraping as a method of social research more rigorous and reproducible. Creating an R program for gathering and using online data allows for dutiful record-keeping and increases reproducibility, by simply sharing the R file. RSelenium, a third-party package in R, allows researchers to use the R environment to run the Selenium webdriver, which works towards automating web browsers and simulating actual ‘clicks.’ Together these tools can do web-scraping to gather information from websites that would otherwise be difficult to gather data using other tools.

Scraping Daft.ie

This post provides instructions for downloading, installing, and running all the necessary programs. The tutorial does require writing and executing code, but includes a detailed guide to do this, even if you are unfamiliar with using command-line tools and writing scripts.

This tutorial uses Daft.ie, the main web site in Ireland for listing properties for sale, rent and sharing, as an example for web-scraping for this tutorial. The website describes it as “the number one destinations for property searchers and connects property professionals with a unique audience of over 2.5 million users each month” (link). Daft.ie also releases “in-depth quarterly market analysis by way of Ireland’s most read property report–The Daft Report.” This report is generally well-respected. However, web-scraping could allow us to see a different side to housing issues in Ireland, thereby revealing that little bit of extra detail, which could be particularly useful amidst a prolonged housing crisis and when existing datasets are inadequate to solving Ireland’s housing problems.

At the end of this tutorial, you will be able to scrape the daft.ie website for their properties for sale, lodgings to rent, or lodgings to share, and have the tools you need to do web-scraping on other websites. Below is an example of a webscraping of the daft.ie properties listed for sale in March 2017, by price.


A map of all properties advertised for sale on daft.ie from a date in March, 2017, by price. Click on the top-left icon for legend. Map by author.

Web-scraping tools in this tutorial

This tutorial uses the following software:

  • R – a free software environment for statistics data science
  • RStudio – an integrated development environment (IDE) that makes R easier to use
  • RSelenium – an R package that allows R to control the python-based Selenium web automation tool, or webdriver
  • Docker – a virtual network controller (VNC). Docker creates a virtual network machine running within your computer. In this tutorial, this virtual machine will run a Firefox browser, and we can control and automate the browser from within Rstudio.
  • A VNC viewer – a viewer that allows us to see the webpages, and to manually click or fill out forms when need. VNC viewer is a free option for Windows, and the Mac OS comes preinstalled with a VNC viewer called Screen Sharing

Step 1 – Getting started – Installing the software

This section provides instructions for how to install the software to use R and RSelenium and to do web-scraping and provides instructions on how to do so in Windows OS and Mac OS when they differ. This process usually takes less than 20 minutes but can take longer with a slower internet connection. You will also need administrator privilege or the administrator password for your computer to install these programs.

1.1 – Installing R

Donwnload R from here. The link also provides instruction for installing, and you should download the base package, for installing R for the first time.

1.2 – Installing RStudio

After you have installed R, install Rstudio from here. The free desktop version provides all the features we will need.

1.3  – Installing Docker

Docker is a “software container platform” that allows us to create a virtual network controller (VNC) a virtual computer running within your own computer. This is where we will run the browser. Install the Docker Toolbox here. At the link, you can download the installer for Docker. When installing Docker, make sure to do the ‘full installation’, and then make sure to tick the box for “Install Virtualbox with NDI55 driver”. If you miss this step, Docker will not work and you will to reinstall the program.

When installed, Docker provides a Docker Quickstart Terminal that opens the program in a command-line window. Later, this is where we will enter our code to start a VNC to which R can connect.

1.4 – Installing a VNC Viewer

Virtual Network Computer (VNC) Viewers do what they say: they allow us to view what we are doing when we drive the browser from RStudio, and let us click manually when needed. VNC Viewer is one free program that lets you do this, and is available for Mac, Windows, and Linux from here. Mac OS also comes with a VNC viewer called Screen Sharing. The application is located in the Library folder, however it can also be accessed by opening Safari and entering VNC:// into the address bar. Once a VNC viewer is open, it will request an IP address. We will get the IP address that we will use from Docker.

Step 2 – Starting up a webscraper – Docker ports, VNCs, and RSelenium

2.1 – Starting Docker

Once the Docker Toolbox has been installed, the Docker Toolbox folder will appear with tools, including the Docker Quickstart Terminal. Open the quickstart terminal, and let it load until the whale logo appears.

On a Windows computer there is an extra step to running Docker. On Windows, Docker requires virtualization to be enabled, which can only be done in the BIOS. This requires starting your computer in the special BIOS mode. A step-by-step guide for enabling virtualization can be found here, and a guide to opening the BIOS settings on different Windows computers can be found here.

Once Docker has started and the whale logo is displayed, enter these commands to start a VNC. This will start a machine running a ‘debug’ version of Firefox, and will map the ports so that RStudio can reach them. The copy-paste tool may not work in your Docker terminal window, in which case it may be easiest to type out the full command:

docker run –d –p 4445:4444 –p 5901:5900 selenium/standalone-firefox-debug:2.53.1

It will take a few minutes to download the Firefox standalone browser the first time you run this command, and longer on a slower internet connection.

Next, use the following code in the command line to find the IP (Internet Protocol) address of your server:

docker-machine ip

We will use this IP number to connect to the VNC from other applications.

If you wish to later turn off the VNC, use this code:

Docker stop $(docker ps –q)

2.2 – Using a VNC Viewer

Open the VNC viewer and enter the IP followed by :5901. On most computers this will be the same, but it may differ. The default is:

192.168.99.100:5901

If this is working properly, the VNC viewer may ask you for a password. The default for this is usually “secret” and can be changed in your computer’s network system preferences. If the VNC is connected and the VNC viewer is working, a logo on a black background should display (see fig. 1). If this does not work, check the sharing permissions in settings.

three windows open: RStudio, Docker, and VNC viewer. The VNC viewer displays the docker logo.

Figure 1. RStudio, Docker, and a VNC viewer open on a mac, with Docker logo displayed on the VNC viewer.

2.3 RStudio – Installing the RSelenium package

To use Rstudio we will be writing a script. To create a new script, in the RStudio application go to the menu File -> New File -> R Script. To install the packages, we will be using, copy the following code to the script. To run code from a script, press ctrl+enter on a windows or command+enter on a mac, while the cursor is on the line of code you wish to run, or while the code you wish to run is highlighted.

install.packages("RSelenium")
install.packages("dplyr")

This may take a few minutes to install, but installing packages is only required once.

Step 3 – Running a webdriver- navigation and scraping the results

3.1 – Starting a webdriver

Now you should have the VNC working and all the proper connections made. The next step is to start the webdriver. The best way to do this seems to be to make a split-screen: Rstudio running on one half and the VNC viewer on the other. The following code will open the packages we will be using, create settings for our webdriver, start a webdriver, and navigate to a url (note: the “#” character denotes the beginning of comments on that line. Comments are lines of code that R will ignore and are only for messages to humans).

library(dplyr)
library(RSelenium)

remDr <- remoteDriver(remoteServerAddr = "192.168.99.100",port=4445L) #change ip if it does not match yours
remDr$open() #change to remDr$open(silent=T) to open a webdriver without printing result
remDr$navigate("http://www.daft.ie/") #navigate to a site
R, using RSelenium, has navigated the VNC to http://www.daft.ie and the Daft homepage is displayed in the VNC viewer.

Figure 2. After executing the RSelenium commands to open a browser and navigate to http://www.daft.ie, the Daft.ie homepage is displayed in the VNC viewer.

3.2 – Using HTML – Navigating to a page

The displaying of a websites in a browser uses an HTML file that contains the content of a site, and CSS, that sets the style of the page (we are not including web applications for now). The source HTML of a web page is made up of elements. These elements are all the content on a page. Types of elements include text (<p> for paragraph, and <h1>, <h2>, <h3> for different headers), images (<img>), hyperlinks (<a>), etc. Elements can also be nested within other elements, and ‘container’ elements such <div> (a division or a section) and <span>, meant only for organizing other elements, are also common.

The key to automating and scraping websites is to identify discrete and specific ways of identifying different elements.

Some browsers, including Chrome and Firefox, offer an ‘inspect’ tool. In a normal web browser, outside of our VNC, we can open the Daft.ie homepage or any other site we are scraping and ‘inspect’ the html source code.

After the <head> section, which describes metadata for the page, the <body> section begins, which contains all the elements to be displayed for the site. After some <script> elements, which run JavaScript, there is a <div> element with another <div> element nested inside that one, then an element for the header of the page, etc. By hovering the mouse over the elements in the ‘inspect’ mode, the corresponding elements on the page are highlighted.

The remDr$navigate(“http://www.daft.ie/&#8221;) command sent us to the Daft.ie page. For this page, let us navigate to the “For Sale” hyperlink.

3.3 – Finding Elements on a page

There are four general ways to find elements on a page using the FindElement command in the RSelenium package (for more, see RSelenium Basics). Most of these use classifiers that the author of a website gives to elements:

  • Name, ID, Class: Some elements have a Name, ID, or Class tag within the first set of <> brackets to identify them.
  • CSS selector: This is a more general tool to select elements based on any identifier that they may have for the styling of the page. For more information, see the RSelenium Basics page for more information.
  • XPath: This is the most complex way of finding specific elements, but will work when other ways do not. The xpath of an element is a unique reference path of the element from the beginning of the document. For example, a link like this: “//div[2]/p/a”. This would reference a link a within a paragraph p within the second larger division div[2] on a page. The xpath can also be used for searches, for example searching for a link within a paragraph with class “example” would look like this: “//p[contains(@class,’example’)]/a”.
  • Link Text: This is often the simplest way. This will search for links with the text specified.

3.4 – Finding Elements using text

One way for the RSelenium program to identify a link element is by searching for the link text. In this case we know that the text is “For Sale”, so we enter this into code.

webElem<-remDr$findElement(using = 'link text', value = 'For Sale')

This assigns this element to the webElem value, which we can reference for doing actions on that element.

For example, we can highlight that element in the VNC viewer so that we know we have the right one.

webElem$highlightElement()

This will make the “For Sale” link flash yellow twice.

R runs the RSelenium command highlightElement() to highlight the "For Sale" link in the VNC viewer.

Figure 3. After selecting the link ‘For Sale’, the active link is highlighted using highlightElement()

And to click on this link, we can use another command.

webElem$clickElement()

This clicks on the “For Sale” link, ignoring the dropdown menu with more options.

We are now presented with a form. With this form we can make a search of all the properties for sale in the Daft database, and we can narrow down our search with various selections, including location, price, number of bedrooms, etc. For now, we are going to search for all properties in Dublin City.

While we could automate filling out this form, we will only be doing this once so it is easier fill it out and click “search” just by clicking in the VNC viewer. Now we have a page of the results of our search. The results 1-20 are displayed on the page. On one time that I conducted the search, there were a total of 3,488 properties. If we want to scrape all this information, we would have to automate gathering the information from each of the 20 entries on the page, recording that to a file, and then navigating to the next page and repeating. To do this, we must first learn how to search for elements in new ways.

After using RSelenium webdriver to navigate to the search page for properties for sale, we can click on the VNC viewer to select search parameters.

Figure 4. After navigating to the ‘For Sale’ search page, we can select our desired search within the VNC viewer.

Elements and data on websites are always structured in different ways, and scraping an individual site presents different problems. These always require spending time understanding how the page is organised using the web inspector of another tool.

3.5 Finding Elements on Daft Properties

For the daft properties, we want the following entries for each property

  1. The title text, which is the address of the house and then a description of the house (semi-detached, new home, or another descriptor). We will record this entire line and separate these two values in the data cleaning stage at the end
  2. The offer price
  3. Information about the house (type, # of beds, # of bath)
  4. The estate agent, if present

Looking through the web inspect, we can look at these elements and find a way to reference them specifically. We find these elements have classes assigned to them. Classes are used mainly in the formatting of websites, as elements with a certain class can be given specific formatting in the CSS (the Cascading Style Sheets that determines the formatting and styles of the html code to make a webpage).

We find that we can reference each part individually.

  1. A <div> element containing the address has a class of “search_result_title_box”
  2. The <strong> element (which makes text bold) with the price has the class “price”
  3. A <ul> (unordered list) element with the info of type, beds, and bath has a class of “info”
  4. A <ul> element that contains the Agent, when present, has a class “links”

Each of these elements occur twenty times on the page, one for each listing, so we will find the elements that match, record the text from the first recording to variables, and print the variable to make sure that it has worked correctly. The code for finding each of these elements looks like this:

#Finding the Address
webElem<-remDr$findElements(using = 'class', value = 'search_result_title_box') #searches for the element with class=”search_result_title_box”
webElem[[1]]$getElementText() -> address #copies text to value ‘address’
print(address)

#Price
webElem<-remDr$findElements(using = 'class', value = 'price')
webElem[[1]]$getElementText() -> price
print(price)

#Info
webElem<-remDr$findElements(using = 'class', value = 'info')
webElem[[1]]$getElementText() -> info
print(info)

#Agent
webElem<-remDr$findElements(using = 'class', value = 'links')
webElem[[1]]$getElementText() -> agent
print(agent)

3.6 – Writing to a text file

The next step for scraping the data from the first entry or property listing is to record it to a file. To do this, we will gather all the variables with the information we want, and paste it to a file. Creating this file is the key moment for easing the process of data cleaning, which we will do later. Primarily, we need a symbol that we can use to denote a ‘column break’ in excel, or a comma for a csv (comma separated values). However, we probably won’t be able to use commas because our dataset contains commas in the values.

Before doing these actions, we can set up our working directory in R to the location we want to save the file. The code is:

setwd(“//*filepath*”)

If you want to save a file in your user directory on a mac, it will usually look something like this: “Users/*username*/*filename*. On a windows it will often look like this: “c:/docs/mydir”. On windows, note that R uses “\” as an escape character. Use “/” instead.

The paste function in R will paste together our variables into a list, and the collapse variable adds a “,” in between each entry. This way, our file should be close to csv format, a comma-separated values file.

paste0(c(address,price,info,agent), collapse='","')%>%write.table(file='daft.txt', append=TRUE, col.names = FALSE)

3.7 – Data cleaning

Data cleaning on this single entry uses a few find and replace commands available in any simple text editor to get rid of some extra characters and to separate the columns.

Find: ^return
Replace: *blank*

Find: \
Replace: *blank*

Find: Add to saved ads|Agent:
Replace: *blank*

Find: |
Replace: “,”

Step 4 – Automation and creating a ‘for’ loop

We now want to automate this process for all 20 entries on the page, and then automate the process for all pages in our results (the 3,508 entries would be on 176 pages). We will do this using a ‘for’ loop, a common coding command in which we will repeat a certain section of code 20 times, each time with a different variable “i” counting up from 1 to 20.

The structure of a ‘for’ loop in R is as follows

for (i in 1:20){
     *run code using variable i
}

This ‘for’ loop runs the code inside of the brackets 20 times, first with the variable i=1, then i=2, etcetera onto i=20, and then finishing. This is perfect for us. We will run our code writing the variables of the first entry to a text file, and then run the code again on the second entry. This operation looks like this:

for(i in 1:20){
 #Address
 webElem<-remDr$findElements(using = 'class', value = 'search_result_title_box')
 webElem[[i]]$getElementText() -> address
 print(address)

 #Price
 webElem<-remDr$findElements(using = 'class', value = 'price')
 webElem[[i]]$getElementText() -> price
 print(price)

 #Info
 webElem<-remDr$findElements(using = 'class', value = 'info')
 webElem[[i]]$getElementText() -> info
 print(info)

 #Agent
 webElem<-remDr$findElements(using = 'class', value = 'links')
 webElem[[i]]$getElementText() -> agent
 print(agent)

 paste0(c(address,price,info,agent), collapse='","')%>%write.table(file='daft.txt', append=TRUE, col.names = FALSE)
}

This ‘for’ loop will enter each of the 20 listings to a text file with the name “daft.txt”. To do this for all the listings in our search results requires another ‘for’ loop, which contains this loop and a command to click on the ‘next’ link.

4.1 – Ethics of web-scraping

It is important to at this moment to have some consideration for the people who work at Daft and those who use the website. On a fast computer with a fast internet connection, running a scraping command that automatically visits all of the results pages of our search has the possibility of putting a bandwidth load on the Daft.ie website. Out of consideration of this, and so that our work does not affect the many people who use this site to find our list housing, we will use the Sys.sleep(1) command, which instructs R to suspend execution for one second. This will lengthen the time our web-scraping takes, but it is the right thing to do. Please don’t skip this step.

for (j in 1:2){ #the j for loop goes to the next page after the scrape is done. Change second number to the number results pages.
 #Don’t run j unless ready for a lengthy scraping time!
 #alternatively you can just run the i for loop and see how it works on one page. Use for debugging
for(i in 1:20){ #goes through the 20 properties on the page
   #cycles through elements with class of address
   webElem<-remDr$findElements(using = 'class', value = 'search_result_title_box')
   webElem[[i]]$getElementText() -> addrs
   print(i)
   print(addrs)

   #cycles through the elements with class 'links'
   webElem<-remDr$findElements(using = 'class', value = 'links')
   webElem[[i]]$getElementText() -> agent
   print(agent)

   #cycles through price
   webElem<-remDr$findElements(using = 'class', value = 'price')
   webElem[[i]]$getElementText() -> price
   print(price)

   #cycles through info
   webElem<-remDr$findElements(using = 'class', value = 'info')
   webElem[[i]]$getElementText() -> info
   print(info)

   #write variables to a text file called 'daft.txt' in the working directory
   paste0(c(i,addrs,price,agent,info), collapse='","')%>%write.table(file='daft.txt', append=TRUE, col.names = FALSE)

 }
 #click on 'next page' button at the bottom of the page
 webElem<-remDr$findElement(using = 'class', value = 'next_page')
 webElem$clickElement()
 Sys.sleep(5) #sleep for 5 seconds
}

4.2 – Final data cleaning

The final data cleaning of the entire scraping uses the same commands as for the smaller scraping of just one page. However, because of the size of the file, doing this can be too much for the computer. The options for getting around this are to divide up the file into parts and to do the find and replace commands on each file, or to use an online find and replace tool that is better for larger datasets, such as this one. The cleaned and formatted file can then be imported to the data software of your choice, or back into R.

Step 5 – Analysis

A map of the results can be put together quickly using the Google mapping tools. The following map is from a scraping of all homes for sale on daft.ie in March of 2017, and sorted by price. Some homes are not mapped because the Google mapping tool did not understand the address format (288 out of the total of about 3000), or did not list a price.

Will you do some web-scraping?

Thanks for reading all the way to the end! Hopefully, you now have all the tools you need to begin to do web-scraping for social research, and are beginning to understand how this tool can be used to enhance and add to your research. Please share your questions and comments! It would be great to hear how you found this tutorial. And if you have any questions, share them in the comments, or if you use this tutorial for a project, let me know!

You can contact me at Sasha.Brown.2016@mumail.ie

Sasha Marks Brown, Geography PhD Candidate, Maynooth University
Irish Research Council Postgraduate Scholar

Acknowledgments

Thank you to Prof. Chris Brunsdon for helping me learn web-scraping and to Dr. Alistair Fraser for consistent help editing this tutorial and encouragement.

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: