AutoMate™ Web Scraping — A Solution for Murphy's Law of Data Extraction

by Jonathan Chan, in Data Extraction, posted 3/26/12
image for article

We currently live in the information age where everything is at our fingertips. Is Drew Brees the current NFL Passing Leader? Is houndstooth or herringbone more “in” this season? What will the weather be like this weekend? When asked these questions, most of us will hop on Google/Bing/Wikipedia and search for the answer. More and more information is being made available online—and the business world is no different. People are printing out less physical media.  Rather, the information is being made available in digital formats with various consumption methods such as laptops, tablets and smart phones. As more business information makes its way to the web, being able to interact and make sense of that data becomes essential.

Pulling data from a webpage is nothing new. Web-scraping and Web Data Extraction tools have been around forever.  The issue with these tools is the huge amount of processing and filtering required to retrieve usable information. And according to Murphy ’s Law, the moment you finish a filtering algorithm to process the data accurately, the web designer will change the site.

That is where our Extract Table activity comes into play. With the Extract Table action, AutoMate brings structured data extraction capabilities that easily integrate with other components and tools in your infrastructure. Extract Table keeps the information in table format so you can bypass all the processing and references by column/row.

We recently worked with a customer on a Web Interactivity Task that involved the issue described above. The client had to log into a website with username/password and retrieve a section of information.  They needed to pull all of the information from the table and then evaluate particular fields.  We could have grabbed each individual element, but that would have proved tedious and cumbersome. So we decided to use the Extract Table action. Once we have verified that the section of data was indeed an HTML Table Element, we used the Extract Table to grab everything into a dataset table. From there we could evaluate the contents and branch appropriately.

In this particular example, we will be using the stock tickers from Yahoo Finance, and are going to schedule this task to run at the close of market Eastern Time. Once we have retrieved the information, we are going to parse through the values. The values will be written to an Excel workbook: worksheet1 if the changes are positive; worksheet2 if they are negative.

Our process is going to look something like Figure1.

Fig. 1

Figure1

So first, let’s open up the Yahoo Extract Table Task to look at how to extract the table. The first thing you need to do is create a variable that contains the tickers. (Figure 1A_pt1, Figure 1A_pt2)

Fig. 1a

Figure 1A_pt1

Fig. 1a2 

Figure 1A_Pt2

Next, we need to open a new browser to Yahoo Finance. Select the open activity and specify finance.yahoo.com in the Page URL box. (Figure 1B)

Fig. 1b 

Next, select the appropriate box to set the values of the stock tickers.  (Figure 2A, Figure 2B)

Fig. 2a

Figure 2A

Fig. 2b

Figure2B

Once you select the box that you want, scroll down and specify the interaction. You want to place the content of the variable into the search box. (Figure 2C)

Fig. 2c

Figure 2C

Once that step is done, you want to click the “get quotes” button. Use the Click Action to select the button. (Figure 2D, Figure 2E)

Fig. 2d

Figure 2D

Fig. 2e

Figure 2E

Once you have finished that and run it once, your browser should be displaying something like this (Figure3):

Fig. 3

Figure 3

Drag over another web browser and then select the wait action. Once you’ve done that, drag over the final web browser action and select the Extract Table activity. (Figure 4)

Fig. 4

Figure 4

Select the browser by dragging over the magnifying glass. Then drag the hand over to the table. (Figure5)

Fig. 5

Figure 5

If you have done this successfully, there should be some table identifier information located in the details. (Figure 6)

Fig. 6

Figure 6

Scrolling down, be sure to enter the name of a dataset here. It stores the information from the table. (Figure 7)

Fig. 7

Figure 7

Once you have done this, close any IE browsers you have open and run the task. At the bottom you’ll see your dataset. (Figure 8)

Fig. 8

Figure  8

Right click on it and select Inspect. Then re-run your task one more time. You should be able to see all of this information in your dataset. (Figure 9)

Fig. 9

Figure 9

Now let’s write the information to Excel; it will create a new workbook. (Figure 10)

Fig. 10

Figure 10

Next we want to loop through our dataset to figure out which sheet we are going to write the information to. (Figure 11)

Fig, 11

Figure 11

Lastly we want to build in the logic that places Positive Gains (+) into worksheet 1 and Negative Gains (-) into worksheet 2; and create two variables: one that stores the current row of worksheet 1, and the other that stores the current row of worksheet 2. Then apply the following if-else logic. (Figure 12)

Fig. 12

Figure 12

Now that we have that all taken care of, lets run the task and verify the results. Figure 13 verifies that the positive values have been placed into worksheet1. Figure 14 verifies that the negative values have been placed into worksheet2.

 Fig. 13

Figure 13

Fig. 14 Figure 14

Now that you’ve discovered this powerful new tool, go out there and crunch some data!

Learn more about web scraping, screen scraping, and other data extraction here