Blog

5th December 2014 Posted by studioteam

How can I collect the web data that I need?

Almost every business is sent data in different formats which needs processing in different ways and there are solutions available to manage this and give significant benefits as a result.

But what if the data isn’t sent or delivered and has to be collected?

What’s the problem?

Pulling data from various internal and external sources is extremely time-consuming. Cutting and pasting, using home-grown scripts or applications that record a user’s actions can’t compete with the pace of business. And over time, there will be an increased demand of not only quantity but quality of information.

Lots of information is accessible via public websites with more data that’s often hidden beyond firewalls and web portals. Accessing this data requires login credentials and the ability to navigate the site in order to extract the data. Valuable information is also embedded in PDFs, images, and graphics.

All businesses face this growing need

From start-ups to enterprise organisations and spanning almost every industry, acquiring external data is critical. Whether you want to prove compliance, move ahead of the competition or reach new markets – it all requires constant monitoring of web-based data.

Data needs to be extracted, transformed, and migrated into various reports where it can become the foundation that business decisions are based upon.

Can you do it piecemeal?

So, a web-scraping tool or home-grown web scraping approach can seem like a good option, since it looks like it’s a quick and inexpensive way to harvest the data you require. Or is it? – now comes the uneasy feeling in the back of your mind.

Can my home-grown web scraping approach or a web-scraping tool acquire the correct information I need? How do I know the data I received is accurate and formatted correctly? And what if management wants different reporting data, how is that handled?

The short answer: you don’t know.

The right answer begins with an evaluation of your specific data requirements and business needs.

1. How does web scraping acquire the data?
When needing to acquire external data, list the actual websites you gather data from. Your list should include the various types of sites including HTML 5, Flash, JavaScript, and AJAX. Be sure to include websites with firewalls and PDFs. The more scalable, reliable, and faster the web data extraction process performs across various external websites, the better.

2. What does the data look like?
You have received some data using a web scraper tool, but now you spend all your time trying to transform the data. You notice formatting and quality issues with the data. If the extracted data is not accurately transformed and put into a usable format, such as Microsoft Excel, .csv files, or XML, the data becomes unusable by applications that have specific integration requirements. Now you have lost half the value of your purchased or home-grown investment. Extracting and correcting specialised data often includes dates, currencies, calculations, conditional expressions, plus the removal of duplicate data – These are all important considerations.

3. How difficult is it to make changes?
What happens if a website changes or if you need to monitor and extract data from new websites? Many web-scraping tools fail when websites change, which then requires intervention – using precious resources and in some cases requiring a developer to fix the problem. Unless you have a developer in-house to make these fixes, this will add additional time and expense, and the problem only grows bigger as you monitor and extract data from hundreds or even thousands of websites. If scalability is important to you, be sure to know how the technology solution monitors and handles changes to a website, especially if you want to expand beyond your immediate data collection needs.

Look at the bigger picture

Extracting and transforming web data needs more than just purchasing any web-scraping tool. Think about the data you are collecting and how it’s tied to your business. In all likelihood, there’s a strong set of business drivers for collecting the data, and taking shortcuts will only compromise the success of your business goals. And you should never feel uneasy about the information you are collecting.
Look beyond the data that’s being extracted, and think about what you are doing with it. Are you improving customer experience, creating competitive advantage, or streamlining processes that rely on data from websites/portals, and online verification services?

In summary, you are most likely to need the ability to acquire, enhance and deliver information in a faster, smarter and more agile way than ever before, whilst automating the acquisition and integration of information—in particular from websites and web portals – into your business applications, without the need for coding.

For more information on how to collect web data without all the pain, please contact us.


AAC Systems . No.1 Bell Street . Maidenhead . Berkshire . SL6 1BU . Telephone 01628 421 569 . Click here to email us ›