At its most basic, data scraping is the technique by which a piece of software extracts data from the human-readable output of another piece of software. It’s frequently used to refer to web scraping bots which extract information from websites — whether that’s web content for business intelligence, prices for comparison sites, or data for possible market research bots that pull user-generated content from social media sites.
Web scraping is not inherently good or bad. A price comparison website, for example, is a positive use of the technology since it enables users to get possible deals on a product or service. A more negative use of web scraping, meanwhile, might be an automated agent that steals content it does not have the right to reproduce for posting elsewhere.
In some cases, it may be legally challenging since certain data might be in the public domain, but not intended to be used for a specific purpose. This is where the importance of good data security comes into play.
Spate of data scraping instances
A terrifying illustration of this last point is the recent spate of data-scraping security threats. These involve almost unfathomable amounts of data being scraped from major platforms and sold or posted on hacker forums.
In April, for example, it was reported that personal information belonging to 533 million Facebook users from 106 countries had been scraped from the platform and posted online. This information included names, birth dates, and phone numbers. While Facebook noted that it was at least a couple of years old, cyber security experts were nonetheless concerned that it could be used by bad actors for social engineering attacks, hacking, scams, and other nefarious activities.
Not long after, a similar incident was reported, this time involving the data of 500 million users on business networking service LinkedIn. Following this, a similar scraping of data was reported on the buzzy, invite-only voice chat app Clubhouse, with the technique used to garner data from approximately 1.3 million users. This data reportedly included names, user ID numbers, profile images, social media handles, referrer name (since new Clubhouse users must be recommended by an existing user), and more.
Not classical hacks
As with Facebook, these were not classical instances of hacking in the sense of breaking into a system, but nonetheless showcase how bots can be used to aggregate enormous amounts of publicly available user information on a large scale.
Because this is not technically a data breach, it represents a challenging new frontier for users and companies to deal with. The likes of Facebook have suggested that users must think carefully about the information they post online, and carry out frequent “privacy check-ups” to ensure that they are properly protected. The spread of these incidents highlights one of the many challenges of social media: Users post content online, but with the tacit understanding that it is used in certain contexts.
Facebook, for instance, is structured around interactions with “friends” who are invited to interact with us on a social platform. LinkedIn is built around business interactions, primarily involving people we are connected to via professional links. In each case, users may have very different beliefs about what is and isn’t acceptable to share, based on who they think will be reading. A Facebook user might share their phone number because they expect it to be used only by friends, whereas that same person may not do so on LinkedIn for fear of being bombarded by recruiter messages. In both cases, they could be unaware that much of this information is publicly viewable.
Protecting against threats
The threat of large scale web scraping isn’t necessarily that it is being carried out, but rather what it could be used for in the wrong hands. Such information could be harnessed in phishing attacks or to try and otherwise brute force entry into other accounts or systems for spreading malware.
Facebook’s advice about being careful what users share online is sensible. However, it’s not the only data security measure organizations should employ. To protect against attackers potentially gaining entry to systems, companies should make sure that they use Identity and Access Management (IAM) frameworks to control which users are able to gain access to sensitive information. They must also consider the likes of two-factor authentication and multi-factor authentication, which make it significantly tougher for bad actors to access unauthorized information. On top of this, proper use of user behavior analytics, database firewalls, data encryption and data loss prevention (DLP) are all incredibly valuable.
Data scraping is just one way that malicious actors try and gain information they can use as part of cyber attacks. The problem, unfortunately, isn’t going away. But by taking the right precautions you can greatly minimize the risks involved.