How To AutoMate Web Extraction

by Donna Horton, in Data Extraction, posted 6/28/12
image for article

As more and more companies and governments make data available online, the demand and need to access and process Internet-based data is growing fast. Problematically, much of the data is available only through a front-end web-app that is designed for ad-hoc queries. If the need is to get bulk amounts of online data, automation becomes essential. Real world automation requires that any number of methods for extracting data from the Internet be made available to support the desired business process. This is made even more so by the variances of web systems design, e.g. some apps have Web Services and many do not.

AutoMate supports the ability to integrate with Web Services or to navigate through webpages, interact with web data, initiate searches and logins, enter data, click through links, and extract tables and entire webpage source information. It does this through the Action Library with actions for web browsers, HTTP, XML and Web Services, all which can be invoked without writing code.

Seeing can be believing, click on the link below to see our How To video on automating basic web scraping and table extraction automation.

How To AutoMate Web Extraction

Extracting a Table from a Website and importing the table into an Excel Spreadsheet  

The video shows in four short steps how we extract a table of sample tasks from the Network Automation website and import the table into Microsoft Excel as shown in the Task Builder, Figure 1.  AutoMate can also support loading the scraped data into any number of file formats including some of the most popular like Excel, Access, SQL, Exchange, SharePoint, CSV and PDF files. 

For easy reference, we will show the contents of the video below. To start, we utilize two sets of actions, the Web Browser and the Excel actions to create the task.    

 Task

(Figure 1)

Step 1

In the first step of the task, the “Open” Web Browser activity opens the website, specified in the Page URL, to mimic manually opening the Network Automation Sample Task webpage on our User Resource Center. 

Web Browser

(Figure 2)

Notice in Figure 2, there is a Browser dropdown arrow indicating multiple browser support (such as Internet Explorer, Firefox, Safari and Chrome). 

Step 2

In the second step of the task, we use the “Extract Table” web browser action. This requires selecting a browser, locating the HTML element and creating a dataset to populate the table information for importing into Excel.

After selecting your preferred web browser, use the Magnifying Glass icon to drag it to the opened webpage.  You can also manually enter the webpage link into the URL area under Select browser (as shown in Figure 3).    

Web Browser 2

(Figure 3)

Locating the HTML Elements requires using the Hand icon to point to the website containing the table for extraction. Drag the hand to the table to select the HTML Components. The controls and components collected from the HTML elements always identify the correct table.  Notice how the “Locate by HTML tag” and “Locate by attributes (case sensitive, all must match)” identifies the HTML location and discovery as shown in Figure 4.

 Extract Table

(Figure 4)

When locating HTML elements, AutoMate matches the locator element results with the criterea specified, as shown in Figure 5.   

 

Element Locator Results

(Figure 5)

Creating and populating the dataset under the Interaction area allows us to place a value on each of the columns in the table for integration with other applications. In this example, we will be using Excel.

 Interaction

(Figure 6)

Step 3

In the third step of the task, we use a “Create an Excel Workbook” action to create an Excel spreadsheet and establish an Excel session for interaction.  The only requirement with this action is to specify the path and a new spreadsheet name, such as C:\test\sampletask.xlsx.

 Create Workbook

(Figure 7)

Step 4

The last step in the task is the “Dataset to Cells” action. This is used to set the text of the selected cell range in the established Excel session with the values contained in the dataset created in the Web Browser Extract Table (Step 2).  Starting at Row 1, Column 1, the entire table is placed into the newly created Excel Spreadsheet starting at Row A1. And that’s it.

 Dataset to Cells

(Figure 8)

Click here to watch this short video on automating web extraction.