Where Data Goes to Die: Business Solutions for Scraping PDFs

by Kaitlin McCloughan, in Data Extraction, posted 7/14/16
image for article

At the recent HelpSystems User Conference, the keynote speaker was Ben Wellington, a quantitative researcher and the creator of I Quant NY, a data science and policy blog that focuses on insights drawn from New York City's public data and advocates for the expansion and improvement of that data. You can watch his TED talk here.  

Ben Wellington at the 2016 HelpSystems User Conference

In his User Conference session, Ben talked about how a lot of city data, while technically “open,” is hard to access. For example, the taxi GPS data he used to calculate rush hour in NYC was only made available when someone made a Freedom of Information Law Request to the Taxi and Limousine Commission and then brought a hard drive to their office to receive the data.

Ben also pointed out another reason data can be difficult to work with--it’s often in PDF format. If you want to gain some insight from the data, you will either have to copy and paste it from hundreds of PDF files into a format you can manipulate, or you will need a tool that can do the work for you. That’s why, in his words, “PDFs are where data goes to die.”

While people like Ben Wellington are advocating for public data to be in more useful formats, you may be running into similar issues with the data used by your company. Do you repeatedly find yourself trying to copy and paste PDF data into an Excel spreadsheet? Or wishing for a faster way to combine the data from multiple PDFs?

Luckily, there are many tools that can give you access to your PDF data, including free software. You may have someone on your staff who can write a script to scrape the data you need. But if you have regular, repeated business processes that involve scraping data from unstructured sources like PDFs, email, or websites, here’s why it’s worth the investment in a more sophisticated business process automation (BPA) solution.

One Solution for All Your Data Extraction

While extracting data from PDFs is a common frustration, data can also be rendered inaccessible if it is behind a CAPTCHA on a website, on a legacy system, or written into an email. Writing individual scripts for each scenario uses valuable resources and can be risky for your business. The right business process automation software will be able to deal with these scenarios and more with no additional investment.

Do More with Your PDFs

So far we haven’t had many nice things to say about PDFs in this article. But maybe a PDF document is exactly what your business needs. This independent academic press used business process automation software to help convert rare texts to electronic format. The texts were first scanned into TIFF files, one file per page, and then converted into a single PDF with the help of BPA software. A good business process automation solution can convert files to and from PDF format, compress, encrypt, split, merge, or split PDFs. It can also interface with other tools you may be using to manipulate your documents.

Integrate PDF Actions with Your other Business Workflows

When your business needs to save time and resources by eliminating manual workloads, you will maximize efficiency by automating business processes over individual tasks. For example, perhaps you receive data in the format of PDF email attachments. The data needs to go into your ERP system.  We’ve already covered how you can use a business process automation tool to scrape data from the documents and write it into a report. That still leaves you to download the PDF attachments, run your script or free tool, and load the data into your ERP.

With business process automation, you don’t have to manually complete any of those steps. In fact, if the email attachments arrive at night, the whole process could be finished by the time you come into work in the morning. The emails trigger the automatic process to begin and each step is completed automatically.

Giving you better access to your data isn’t the only problem BPA software can solve. Read about five more here