What is Web Scraping?

James Phoenix
James Phoenix

Web scraping is the process of collecting information and data from websites and webpages. The internet can provide you with an incredibly rich data source for every subject and topic you’ve ever dreamed of. Also we live in a perfect time where you can easily harvest data from the web to create useful / predictive insights.

Web scrapers allow you to collect this data, whilst web crawlers help you to systematically mine the internet for online community insights and potential target websites to scrape data from.

Copying and pasting information from a website is essentially web scraping by hand, after all, the internet now hosts approximately 1.7 billion websites so clearly we’re in need of advanced scraping techniques.


Crawling and Scraping

Whilst web scraping refers to the fundamental process of searching the internet for data and then logging it for use and analysis, there are two types of data collection which are:

You can give a scraper a defined list of URLs to scrape and it will only scrape data from those URLs, but if you want it to travel through a website or set of websites then you’ll likely require a web crawler.

The process of reading and following website information itself is crawling; following links, moving between sites, pages and their content. When you’re selecting and obtaining elements with either CSS selectors, div elements or xPath selectors and organising that information into a structured fashion, this is web scraping/data mining.


The Web Crawler or Spider

At the heart of a good crawling process is the spider. A web crawler or spider is an AI bot designed to systematically browse the web, hence its nickname ‘spider’. Following the strands of the web through <a> links, a crawler builds indexes from the data it encounters whilst crawling. Googlebot is an example of a famous crawler that collects information from the web to build its own searchable index. 


The Web Scraper or Parser

In web scraping, however, the crawler is simply the driving mechanism, it simply leads the scraper which then scrapes content from the page. The two are complimentary and a scraper cannot navigate itself through websites and webpages. 

To scrape data, several actions need to be completed:

  1. Analysing information on the page to discover its meaning.
  2. Deciding based on set criteria whether or not to scrape that information.
  3. Copying this information into a spreadsheet or database for displaying, interpretation or further analysis.

The Scraping Process

Web scraping programs without crawlers always require a list of URLs.

Typically scrapers will download and parse the HTML code for each URL, though more complex scrapers can load all content including CSS and any Javascript. 

After the webpage has been downloaded and parsed with a parser, the scraper will use programmatic methods to discover and extract information relevant to your previously criteria.

Such methods for selecting and obtaining HTML elements include:


Your web scraper’s effectiveness will mostly dependent on:

  • Clearly defining what elements you would like to extract from the web pages.
  • Being able to handle errors, for example what happens when the element isn’t present? Will your code, software or program be able to handle that?
  • Your web scraper needs to output data in a readable format, a basic form of this is a .CSV file, spreadsheet or JSON file.
  • Effectively creating a fixed place where you will store the data, this could be as simple as a google sheet or it could be more advanced such as a database i.e GoogleBigquery.
  • Using proxies and random sleep delays whenever you need to avoid detection.
  • Clearly identifying how often you will scrape data and what you will do when that data is not present.
  • Thinking about the fact that websites and app’s change and including maintenance into your on-going costs.

Third-party apps and software can be used to automatically organise scraped information into a coherent searchable database. 


To Recap:

  1. A web scraper either receives URLs to scrape or a crawler provides a list of links to it during the crawling process.
  2. The scraper syntactically analyses data from an URL and discerns what is important to store and what will be discarded.
  3. The scraper then outputs this interpreted data in a form that can be collected and stored for future use.

What Types of Web Scrapers Exist?

You could probably determine 2 general categories of web scrapers:

  1. Browser extension web scrapers.
  2. Software web scrapers.

But, ultimately, there are plenty of nuances to these rather complex bots. After all, the internet was designed for humans, not for machines, so the process of syntactically analysing webpages is very demanding. 


Browser Extension Web Scrapers

Web scrapers attached to your browser can automatically scrape any webpage you visit. This is perfect for collecting data semi-manually on relatively complex terms or themes that you want to screen yourself as you go.

However if the scraping is attached to your own web searching then you’re fairly limited to downloading small datasets.


Software Web Scrapers

Web scrapers in the form of downloadable software usually come with lots of advanced features, e.g. rotating your scraping IP to allow you to more effectively scrape a site, and they can also run in the background separated from your browser. Some of these software scrapers have advanced UIs that allow you to output data in many formats, search your database, schedule scraping sessions and many other features.


Cloud and Local Scrapers

So where does the scraper operate from? Web scraping requires a lot of system resources, also the task of crawling through hundreds, thousands or even millions of webpages is arduous. As a result, any serious scraping task needs to run off-site on a cloud-based service. Scraping companies use their own servers that run your scraper for you. Local scrapers running off your own PC work fine for smaller tasks but even then, you may notice system drain during the process. 


Use Cases For Web Scraping

Current Affairs, News and Content

Web scraping can be used to track the current trends around current affairs or news. This allows businesses to follow emerging stories on themselves, their industry or their competitors, allowing them to react in a timely fashion, hopefully before others.

Journalists can also scrape the internet for leads, story ideas or to-the-minute information on emerging stories. This can also inform content production, e.g. blogs and other social media content. Social media can be scraped to see what reactions are emerging to trends, how opinions tend to group around certain subjects and how current events are being gauged by the public.

  • Analyse responses to trends to time business reactions, e.g. climate change reports.
  • Make investment or buying decisions based on emerging news stories.
  • Monitor competitors.
  • Target campaigns, e.g. topical marketing or political campaigns.

Price Monitoring and Analysis

Monitoring prices is relatively simple as you’ll usually be scraping Amazon or eBay, and perhaps your competitor’s eCommerce store. You could also scrape prices from wholesalers like Alibaba. This allows you to optimise your own pricing structures and respond to general price movements within your industry.

  • Price your stock dynamically, e.g. to undercut competitors in trending markets whilst easing prices out elsewhere.
  • Analyse and monitor buying trends and competitor’s marketing strategies.
  • Comply with MAP regulations and other price legalities.

Market Research

Market research lies at the intersection of many of these use examples. Web scraped data can inform market research, helping you to make effective business decisions ranging from content strategies to social media and advertising campaigns.

Unleash Your Potential with AI-Powered Prompt Engineering!

Dive into our comprehensive Udemy course and learn to craft compelling, AI-optimized prompts. Boost your skills and open up new possibilities with ChatGPT and Prompt Engineering.

Embark on Your AI Journey Now!

Market research also extends to sentiment analysis, businesses can scrape review data and other responses to their products and services to discover and itemise themes in the responses. These sentiments can be used to optimise products or improve on-boarding, customer service, etc.

  • Analyse existing and trending markets.
  • Choose the best time for product launch.
  • Monitor competitor strategies.
  • Monitor market price dynamics.

Finance and Investment

We’ve touched upon the potential of web scraping for finance but now more than ever, finance and investment companies are tapping into semantic data to help guide their investment decisions. Web scraping can help track emerging trends and analyse their potential impact (e.g. cryptocurrency impacts on traditional financial markets), allowing investors to act ahead of forecast impacts and mitigate loss.

  • Analyse company filings (e.g. SEC filings in the US or Companies House data in the UK).
  • Monitor public sentiment towards industries, e.g. using negative sentiment towards fossil fuel to encourage investment in green energy sources.
  • Developing social media and PR responses to emerging issues and public discontentment.

Real Estate and Property

Monitoring real estate trends using web scraping allows estate agents and realtors to assess competitor’s fee models whilst tracking market dynamics on a local, regional and national level. By scraping listings data, estate agents can better their competitor’s listings.

  • Monitor property value at multiple levels.
  • Analyse buying and selling trends.
  • Scrape property listings for actionable data.

Machine Learning

Web scraping can gather data on human behaviour and internet communication patterns which can, in turn, be used for machine learning projects. Everything ranging from what humans post on or talk about when, and why, what they engage in, what they do; chat, create, argue, etc, is contained on the internet alongside huge ranges of emotional and sentimental expressions. Together, this data can inform AI on how we have come to use the internet and what it means to be a human interacting with it.

  • Web scraping provides AI with data on how humans communicate online.
  • Objective and subjective facts and opinions can be tallied up with emotional and sentimental expressions.
  • Web scraping provides clues as to how the internet is involving.

Web Scraping FAQ

What is Web Scraping Used For?

Web scraping is useful for anything that requires the service of data from the internet. Data can be collected surrounding themes or search terms and then organised in a database for analysis and implementation.

Is Web Scraping Legal?

Web scraping is extremely common, almost ubiquitous amongst both small and large businesses, however, the legality surrounding it is extremely complex. Though web scraping in its entirety is generally not illegal anywhere in the world, legality is involved when it comes to scraping personal data or copyrighted materials.

Amazon: Scraping Amazon is not illegal and is already part of many business’s models, though the company does provide its own price analysis API.
Facebook & Instagram: Generally, scraping social media platforms is not illegal but the use of personal data is restricted.

Is Web Scraping Easy?

It depends directly on the specificity of the data you want to scrape. The more precise your data is, the harder it will be to scrape and the more likely you will find errors in your scraped data, e.g. it wasn’t relevant to your terms. However, scraping generic data from HTML pages using software scrapers or browser scrapers is relatively easy.

Can I get Blacklisted by Web Scraping?

Yes. Bad web crawling practice puts a strain on the target site, and some sites obviously won’t actively permit web crawling and will kick a bot off the website if it doesn’t look human. Sites will often have a robots.txt file that contains instructions on how robots should treat a site when they access it. 

There is a bit of an underground war going on here, many websites will try and boot off crawlers by analysing their behaviour as being too repetitive or too demanding on resources, sites may also have ‘honeypots’, links that only bots can see – if a bot accesses the link, it trips the trap!

To combat this, web crawlers are designed to appear more human but this slows them down also. There’s a happy medium between reliability and effectiveness.

Is Web Scraping Ethical?

That’s a tricky one! Obviously, it depends on what you’re using the data for and whether you’re collecting personal information.

Web scrapers used for grey hat purposes might scrape emails to be added to email lists, this can contravene the law, e.g. EU GDPR which restricts personal data collection and unsolicited emails.

For scraping nameless or commercial data, it’d be hard to declare web scraping as unethical, though. After all, it is simply an automation of a process we can do ourselves, it’s just easier to get robots to do it for us…

How can I Web Scrape data without coding?

There are lots of software providers which provide a drag and drop interface for scraping data. This means that even if you can’t code then you can easily build a web scraper, one of my personal favourites is ParseHub!

Taggedguide


More Stories

Cover Image for Why I’m Betting on AI Agents as the Future of Work

Why I’m Betting on AI Agents as the Future of Work

I’ve been spending a lot of time with Devin lately, and I’ve got to tell you – we’re thinking about AI agents all wrong. You and I are standing at the edge of a fundamental shift in how we work with AI. These aren’t just tools anymore; they’re becoming more like background workers in our digital lives. Let me share what I’ve…

James Phoenix
James Phoenix
Cover Image for Supercharging Devin + Supabase: Fixing Docker Performance on EC2 with overlay2

Supercharging Devin + Supabase: Fixing Docker Performance on EC2 with overlay2

The Problem While setting up Devin (a coding assistant) with Supabase CLI on an EC2 instance, I encountered significant performance issues. After investigation, I discovered that Docker was using the VFS storage driver, which is known for being significantly slower than other storage drivers like overlay2. The root cause was interesting: the EC2 instance was already using overlayfs for its root filesystem,…

James Phoenix
James Phoenix