Web scraping is an effective and scalable method for automatically collecting data from websites and webpages. In comparison, manually copying and pasting information from the internet is incredibly cumbersome, error-prone and slow.
As scraping data can be highly beneficial for your business, there are now many web scraping tools to choose from, making it essential that you pick the right piece of software-based upon your specific use case and technical ability.
Table of Contents
Choosing The Right Web Scraping Tool For The Job
A web scraper will load the <HTML> code of a webpage before parsing and extracting information that meets set criteria, however, more complex scrapers can scrape content rendered by AJAX and Javascript.
Often web scraping browser extensions have fewer advanced features and are more suited to collecting specific information from fewer URLs, whereas open-source programming technologies are much more sophisticated, allowing you to scrape huge quantities of data efficiently and without interruption from servers.
Scraping isnât illegal but many sites do not want robots crawling around them and scraping their data â a good web scraping application must be able to avoid detection.
Another important aspect of scraping is that it can be resource-intensive. Whilst smaller web scraping tools can be run effectively from within your browser, large suites of web scraping tools are more economical as standalone programs or web clients.
One step further still are full-service web scraping providers that provide advanced web scraping tools from dedicated cloud servers.
Web Scraper Browser Extensions
This type of web scraping tool acts as an extension for Google Chrome and Firefox, allowing you to control scraping tasks from within your browser as you search the internet. You can have the web scraper follow you as you search manually through some pages, essentially automatically copying and pasting data, or have it perform a more in-depth scrape of a set of URLs.
Webscraper.io Web Scraper Extension
Pricing: Free (browser) and paid for cloud crawling
Web Scraper from webscraper.io is a Chrome extension, enabling you to scrape locally from the browser using your own system resources. Itâs naturally limited in scope but it does allow you to construct a sitemap of pages to scrape using a drag-and-drop interface. You can then scrape and intelligently categorise information before downloading data as a CSV file.
Benefits | Ideal For |
Particularly good at scraping detailed information from limited web pages (e.g. a few product categories or blog posts). | Smaller web-scraping projects. |
Conveniently executes from a Chrome browser. | Collecting ideas from blogs and content. |
Totally free. | Scraping data from small shop inventories. |
Dataminer.io
Pricing: Free up to 500 pages/month and paid plans for larger crawls
This convenient browser extension scraper enables you to efficiently scrape a wide array of data from modern webpages and compile it into CSV and XSL files. Data is easily converted into clear well-structured tables and semi-manual scraping controls allow you to be selective about what data you scrape or ignore.
Recipes
Dataminer also comes bundled with pre-built scripts/tasks as a ârecipeâ, these are web scraping schematics developed by the community that instruct the scraper on what data to collect. Some of these include scraping data from e-commerce sites such as eBay, Amazon and Alibaba, or for social media, news sites, etc.
Benefits | Ideal For |
User-friendly browser extension. | Smaller commercial projects and startups. |
âRecipesâ provide readymade scraping queries optimised for popular scraping tasks. | Those who need to scrape specific or niche data without coding knowledge. |
Scalable cloud-server run services for bigger projects or businesses. | Anyone looking for a streamlined scraper for collecting data from popular sites. |
Google Chrome Developer Tools
Pricing: Free
Google chrome has an in-built developer tools section, making it easy to inspect <HTML> source code. I find this tool incredibly useful, espeically when youâre looking to locate certain html elements and would like to get either:
- The XPath selector.
- The CSS selector.
1. The copy functionality
Example of an XPath query:
//*[@id="__next"]/div/div[2]/div[1]/div[2]
Alternatively we can copy a CSS selector.
Example of a CSS selector query:
#root > div > article > div > section:nth-child(4) > div
2. The $x command
A useful command in Google Chrome Developer tools is the $x command which will execute XPATH queries within the console section of the browser. Itâs a great way for you to quickly test and refine your XPATH queries before using them within your code.
Also, before even pressing enter, $x will pre-fetch the matched elements so that you can easily refine and get the perfect query.
3. Listening and copying network requests
Whilst web scraping you will often have to replicate complex network requests which may need to include specific:
- Headers.
- Cookies.
- Parameters (as a payload).
So even if the content loads after clicking a button or scrolling on a page you can easily record and replay these events via Google developers tools.
In order to do this, simply visit the network tab, click XHR and click preserve log. Once youâve found the right request you can copy it as a cURL request.
Example of a cURL request:
curl 'https://collector-medium.lightstep.com/api/v0/reports' -X OPTIONS -H 'Access-Control-Request-Method: POST' -H 'Origin: https://medium.com' -H 'Referer: https://medium.com/@aris.pattakos/3-1-chrome-tools-you-can-use-to-scrape-websites-faster-7a1952a10b52' -H 'Sec-Fetch-Dest: empty' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36' -H 'Access-Control-Request-Headers: content-type,lightstep-access-token' --compressed
Afterwards, you can simply paste the cURL request into this website which will then output a request for any programming language (with the relevant headers, cookies and parameters).
Example of an AJAX request:
import requests
headers = { 'Access-Control-Request-Method': 'POST',
'Origin': 'https://medium.com',
'Referer': 'https://medium.com/@aris.pattakos/3-1-chrome-tools-you-can-use-to-scrape-websites-faster-7a1952a10b52',
'Sec-Fetch-Dest': 'empty',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36',
'Access-Control-Request-Headers': 'content-type,lightstep-access-token',
}
response = requests.options('https://collector-medium.lightstep.com/api/v0/reports', headers=headers)
Benefits | Ideal For |
Completely free. | Programmers and professional web scrapers. |
Perfect for debugging and creating programmatic web scrapers. | Anyone with XPath and CSS selector knowledge. |
Easily find the right elements for creating more complex scrapers. |
Open Source Web Scraping Frameworks
Open source web scraping frameworks allow you to build your own scrapers that are optimised for your projectâs unique requirements.
These are suitable for demanding projects where youâll need to run multiple automated scraping tasks or large-volume niche archiving projects, however, it is also possible to use these frameworks for smaller web scraping projects.
Programming Language: Python
Scrapy
Pricing: Free
Scrapy is a free and open-source Python-based web scraping framework which provides customisable web scrapers for pretty much any task, whether small quantities of data need to be mined in intense detail or huge volumes need to be mined across multiple websites.
With a massive community, Scrapy also has the backing of professional programmers, data scientists and enthusiasts alike. Scrapy also uniquely allows you to build your own spider and scraper before deploying it to Scrapy Cloud or deploying it to your own server via Scrapyd.
Benefits | Ideal For |
Integrated proxy management with Crawlera. | Large scale / custom web scraping projects. |
AutoExtract API for News, Product & Job Postings data. | Cost-effective scraping solutions that are expandable and adaptable. |
Cheap, even when using the Scrapy Cloud. | Outputting scraped data for other automated uses. |
Excellent documentation and a wide range of tutorials on Udemy and Youtube. |
Requests
Pricing: Free
Requests is a simple, yet elegant HTTP library for Python, this python library allows for easy web scraping and offers both HTTP GET or HTTP POST requests.
Benefits | Ideal For |
Easy to use and beginner-friendly. | Simple web scraping applications. |
Provides proxy support and there are plenty of useful code snippets on StackOverflow. |
Selenium Python & Java Versions Available)
Pricing: Free
Selenium WebDriver is a library of open-source APIs which is often used for automatically testing front-end web applications. However, as a programmable browser that can dynamically render javascript created content it is an incredibly powerful tool for web scraping more complex websites.
Benefits | Ideal For |
An effective framework for automated testing and advanced workflows. | Web scraping Javascript rich websites. |
Support for multiple browsers including Google Chrome and Firefox. | Automating complex workflows on a website including: form submissions, logging into websites & reverse engineering AJAX calls. |
Selenium Grid provides an effective way to run multiple web bots/scrapers at scale. | Advanced web scrapers and websites that require authentication such as LinkedIn, Facebook or Twitter. |
Programming Language: Javascript
Cheerio
Pricing: Free
Cheerio provides a quick, nimble and flexible implementation of jQuery designed specifically for server-side usage. The framework parses the HTML markup and provides you with an easy API for manipulating the DOM.
Benefits | Ideal For |
Beautiful syntax, as Cheerio implements a subset of jQuery without a lot of code bloat. | Similar to Javascriptâs HTTP library Request, Cheerio is excellent at extracting data from static HTML web pages. |
Also, Cheerio is incredibly fast due to having a very consistent, simple DOM model. Several benchmark experiments show that it is ~8 times faster than JSDOM. |
Puppeteer
Puppeteer is Javascriptâs version of selenium and provides a headless chromium browser for automated testing. Also the framework acts as an effective method for web scraping javascript rich websites which only display content after interacting with it.
Pricing: Free
Benefits | Ideal For |
Renders Javascript Content: As puppeteer can render interactive content, it is perfect for crawling single-page applications. | Testing Google Chrome extensions. |
Puppeteer can be used to generate pre-rendered content for Server-Side Rendering. | Automating complex workflows on a website including: form submissions, logging into websites & reverse engineering AJAX calls. |
Puppeteer features an excellent community and support for developers as the open-source technology is backed by Google. |
Apify SDK
Apify SDK is a JS web scraping framework which helps you to achieve scalable web crawling and scraping for medium to larger sized projects. This tool utilises Puppeteer and Cheerio and provides an easy method to completely operationalize your web scraping.
Pricing: Free
Benefits | Ideal For |
Easy to manage lists & queues of URLs to crawl. | Web scraping professionals. |
High performance due to being able to run crawlers in parallel. | Custom web scraping projects. |
Utilises puppeteer/selenium so itâs good for web scraping javascript rich websites. |
Programming Language: Java
Heritrix
Pricing: Free
Heritrix, a Java web crawler is designed to archive internet data for historical and cultural reasons. This means that it is slow and purposeful, designed for volume and not speed. One defining characteristic is that it respects the robots.txt file in a websiteâs root directory. The framework is currently in use by many national archives and libraries.
Benefits | Ideal For |
Niche crawler for archiving. | Scraping and archiving huge volumes of data. |
Utilises a REST api for creating, monitoring and accessing jobs. | Non-commercial informational use. |
Incredibly useful for accessing archived crawled data whilst respecting robots.txt files. |
Nutch
Used in a similar way to Heritrix, Nutch is an archive-quality crawler that slowly and purposefully scrapes and archives web data. Itâs been employed in some non-profit projects like the Common Crawl project that stores petabytes of web data every month. Nutch still has a large community of active enthusiasts, though, and it can easily be employed as an open-source Java web crawler for an array of uses with its robust architecture and modular features.
Benefits | Ideal For |
Archival web crawler. | Archiving large volumes of web data. |
Very active community. | Those looking to build a broad-scale crawler with Java. |
Modular, extensible and scalable. | Hobbyist crawling projects. |
Programming Language: PHP
For an extensive list of PHP web scraping libraries Iâd recommend checking out the following resources and guides.
Programming Language: GoLang
For the most up to date, GoLang web scraping libraries I found that these guides are very thorough:
Paid Web Scraping Tools & Software and Clients
Non-browser extension web scrapers use their own downloadable software or web clients to run. These are usually more in-depth, professional and come with tons of extra features that allow you to scrape complex data quickly to then output to databases or API.
Mozenda
Paid starting at ~$250
Mozenda is a leader in the scraping & data mining industry which is trusted by thousands of huge businesses worldwide including around 1/3 of the USAâs Fortune 500 companies. It is a simple way to achieve advanced high-level data mining, however the tool is quite costly at $250 per month for the lowest package.
Despite the cost, Mozenda is intuitive to use and the advanced packages come bundled with 8 hours of training lessons that show you how to get the most out of this impressive piece of software.
Benefits | Ideal For |
Incredibly fast and sophisticated web scraping software. | Collecting volumes of commercial data. |
Tiered scraping solutions range from project level to massive multinational corporation level. | Automating scraped data with API. |
Intuitive software and easy to use even when scraping large datasets. | Complex pricing strategies and comparing data with competitors. |
Trusted by some of the worldâs largest businesses. |
Octoparse
Pricing: Free plan (limited to 10k rows per export) with a paid standard plan of ~$75 / month
Octoparse is a thoroughly modern and sophisticated web scraper that suits projects or businesses of virtually any size. The modern interface is very slick, clean and easy to use even as someone who has no scraping, coding or otherwise technical knowledge.
The tool provides superb customer service and a large community that helps onboard those with limited knowledge. From the generous beginnings of Octoparseâs free plan, users can use it to scrape data via readymade scraping templates with the help of techniques such as IP rotation, advanced automation and javascript content rendering capability.
Beyond that, Octoparse also has a full-service managed solution where they scrape and deliver data straight to you.
Benefits | Ideal For |
Intelligent and scalable web scraping service. | Sophisticated automated web scraping. |
Superb interface. | Those looking for in-depth scraping tools that require little technical knowledge. |
Generous free version has 10 crawlers and allows for 2 simultaneous scraping tasks. | Start-ups, established businesses and enterprises. |
Plenty of features in the advanced versions. |
ParseHub
Pricing: Free plan available (200 pages per run) + a paid standard plan at ~$149 / month
ParseHub is a codeless easy-to-use and intuitive web scraper that comes in well-engineered and highly functional web application form. It can construct intuitive and effective scraped data from sites running AJAX and JavaScript, it can get behind logins to scrape data behind, move through complex site structures quickly and even scrape images and map data.
ParseHubâs machine learning approach to web scraping ensures that even the most complex pages are turned into intelligible datasets that can be exported as Excel, CSV, JSON or through custom API. Itâs an impressive app and the free version is generous, providing 200 pages of scraped data over 40 minutes.
Benefits | Ideal For |
Powerful app with impressive usability. | Those who need to scrape complex web data from modern web pages. |
Scrapes super-modern pages with ease. | Those looking for in-depth scraping tools that require little technical knowledge. |
Free version is pretty good for many projects. | Start-ups, established businesses and enterprises. |
Scalable all the way up to enterprise. | Businesses and commercial entities that need to compare and calculate large volumes of data. |
Anyone looking who needs a feature-packed free web scraper for a few pages. |
Dexi
Pricing: Standard plan starting at ~$119 / month
Dexi goes toe-to-toe with Mozemba as another world-class data scraping and management service that goes way beyond the basics to provide something truly fit for scraping highly sophisticated and modern websites and webpages. Dexi, like Mozemba, is also used by some of the worldâs business titans including Amazon, Samsung and Virgin.
Dexi has both complex scalable and simpler and more intuitive functions but its primary offering are bots grouped into 4 main types:
- Crawlers.
- Extractors.
- Autobots.
- Pipes.
These work in tandem to scrape multiple layers of data and organise them for specific purposes. Pipes can push data through to database services like PostgreSQL, MySQL, Amazon S3 or to any number of custom API allowing extracted data to be implemented automatically across sites or networks of sites.
Benefits | Ideal For |
World-class flexibility. | Extracting huge volumes of complex data quickly. |
Data scraping that can obtain pretty much any level of granular detail. | Businesses and enterprises looking for highly-customisable but efficient web scrapers. |
Managed solutions for businesses and enterprises. | Outputting scraped data to other programs or endpoints. |
ProWebScraper
Pricing: Free plan available (200 pages per run) + a paid standard plan at ~$149 / month
ProWebScraper is the best web scraping tool in the market. Itâs point and click functionality to scrape data makes web scraping an effortless exercise. ProWebScraper can scrape 90% of internet websites with its robust features like automatic IP rotation, scraping data from difficult websites, and HTML tables. What makes ProWebScraper stand out from others is the âFree Scraper Setupâ service in which dedicated experts will build scrapers for users.
In short, itâs a tool that can simply automate your process of scraping web data.
Benefits | Ideal for |
Easy to Use Interface (no coding required) | Startups who needs large amounts of web data |
Highly scalable architecture to scrape data from thousands of websites | Every type of businesses who wants to understand market trends, pricing, availability & monitor competitor |
Geo Based Extraction | Data Scientists who need structured data to learn their algorithms / models |
Lowest Price in Market | Non technical person who want to extract data |
Custom Scraper Setup for Free | Sales or Marketing Executives for extract leads |
Summary
Under the skin, web scrapers have the same basic functions but the way they execute these functions varies with sophistication, reliability, efficiency, speed and on your technical ability.
There are web scrapers for everything and everyone ranging from university or college students who need to actively collect data for reports and essays to multinational corporations who collect petabytes of data every month. And from looking at the price of advanced scrapers you can easily see how essential automatic data collection is to businesses and organisations.
Choosing the right one for you obviously depends on many factors relating to your project and data needs. Scrapers are getting bigger and better, though, so youâll always have options ranging from DIY scrapers with tried-and-tested functionality to super-modern scrapers that make even the most complex data tasks as simple and streamlined as possible.
Web Scraping Tools FAQ
Can Every Website Be Scared?
Yes, even if a website owner places specific bot requests inside of their robots.txt file, this is only a suggestion to a crawler and can be ignored by web scraping applications. However some websites are harder to web scrape such as LinkedIn, Twitter or Facebook, where you need to be actively logged in and there are also specific rate limits attached to your personal social media account.
Is Web Scraping Illegal?
Web scraping has a huge range of uses ranging from simplifying academic or other personal research to scraping price data from major e-commerce sites and archiving volumes of data to preserve the internet for public access. Not every use is commercially motivated, many universities and libraries scrape web data all the time for their archives as well as for research.
Why is Python Primarily Used for Web Scraping?
Python is an easy programming language to learn and it also has one of the biggest open source web scraping projects called Scrapy.
Can you make money from Web Scraping?
Data can be very valuable so yes, you can make money web scraping. Lists of competitor information, e.g. what theyâre selling products for at any given time, allows other retailers to undercut them or beat them in stock levels, etc. This is just one of many examples where scraped data is commercially valuable.
Is Web Scraping Easy?
Scraping wasnât formerly an easy task, but it is now. Scraping tools are numerous and thereâs something for everyone at any price or scale ranging from personal micro-level uses to massive multinational corporate uses.
Is Web Scraping Ethical?
This depends fundamentally on what data is being scraped and what itâs being used for. Scraping personal data and using it for certain means is not always legal or ethical, for example, scraping emails written in public places for the sake of sending fraudulent or spam mail is neither ethical nor legal in most countries. Web scraping price data from wholesale sites like Alibaba or from Amazon, eBay, etc, is probably not unethical as this is a competitive environment anyway and personal data is not involved.
What is the Difference Between Web Crawling and Web Scraping?
Web crawling is the process of moving through URLs and site elements methodically. A crawler follows links and menus whilst a scraper follows behind, downloads the code and parses it to scrape useful information based on any input terms or queries. A scraper without a crawler will need to be given set URLs to scrape using a semi-automatic process. A scraper with a crawler will be led around appropriate websites automatically â they work as a pair, one leads and the other follows.