Simple data scraping using online tools

Scraping is a set of techniques required to extract information from various formats like the web, PDFs, or scanned images   into a file type that can be analyzed further, for example into table formats including comma-separated values (csv) or  Microsoft Excel (xls) files.

There are online tools/websites that enable users to extract data from files by converting them. Some of the web based software for simple scraping of PDFs include:

The common steps to converting your PDF file on any of the three platforms are:

a)      Upload your PDF file

b)      Enter a valid email address. You should be able to access this email address because the converted file will be sent there.

c)       Click on the convert button. The web service will momentarily process the file and on completion display a success dialogue box.

d)      Open your email address to access your converted document.

As a practical example, we will try to convert this document which is uploaded on Google Drive. You must have a Google account to access the file.  The data set is about projected health development budget estimates from 2011 to 2014. We want to calculate thesum of the total health development county budget from 2010 through to 2014. We will convert the document into an Excel file so that we can use Excels’ sum function to get the total.

1) Download the file from Google Drive.

2) From your browser navigate to http://www.pdftoexcelonline.com/

3) Click on the “Select  a File” button as shown below. Browse to your saved file and select it. Click open then enter a valid email address to which the converted document will be sent to. You can opt to use a junk email service like http://www.mailinator.com . It enables you to receive emails without signing up. Just enter a random name e.g. healthbudget123@mailinator.com and click on “Check it.”

Mailinator

Enter the same email address on pdftoexcelonline.com

First step

4)      Click on Convert it! Button.

Convert

5)  The browser will momentarily give a dialogue box to inform you that it is processing your document. Then on completion, you will get a screen like the one below:

Complete

6)  If you check your mailinator account, you should now have one email in your inbox. The email is from pdftoexcel. It contains the converted document.

7) Download your document by clicking on the link provided.

Get file

Save the file on your computer. The downloaded file is now in Excel format (.xls) and therefore by opening it in Microsoft Excel, you are able to perform calculations on the dataset.

8)  We can test our file by performing a simple sum calculation on the county budgets from 2010 to 2014. Open the file in Excel, then in column G header, input Total or Sum.  In cell G2 is where we will perform our calculation. Click on cell G2 then do a sum function like in the diagram below.

9)

Summation

10)  Hit Enter. You get a Total of county health development budgets from 2010 to 2014 per county. You can now fill down to get the rest of the values.

Fill Handle

By converting the data from a PDF format to an Excel format, we were able to add a computational column called Total/Sum. We would not have been able to do this in a PDF file. This is an example of data scraping.

Daniel Cheseret
About the Author: Daniel Cheseret
Few developers in Kenya would consider applying for a data journalism fellowship but not Daniel Cheseret. Having realized that the growing importance of data journalism has created new career opportunities in the newsroom, he applied for Internews Data and Health Journalism Fellowship. I will learn more about producing infographics, data mining and creating news applications,” says the software, developer and webmaster. Among his achievements is developing a news management software for Citizen TV. While working with Royal Media Services, the company that owns Citizen TV, he developed over 10 websites for the group. After five years he left to start a consultancy firm.
Author Website: https://internewske.org

Leave a Reply

Your email address will not be published. Required fields are marked *