While web scraping small websites rarely leads to scraping issues, when you start web crawling on larger websites or even Google, you’ll often find your requests can be ignored or even blocked.
In this article we’ll look at several web scraping best practices to avoid your future web scraping from being blocked.
1. Use IP Rotation
Sending repeated requests from the same IP address is a clear footprint that you’re automating HTTPS/HTTP requests. Website owners can detect and block your web scrapers by checking the IP address in their server log files.
Often there are automated rules, for example if you make over 100 requests per 1 hour your IP will be blocked.
To avoid that, use proxy servers or a virtual private network to send your requests through a series of different IP addresses. Your real IP will be hidden. Accordingly, you will be able to scrape most of the sites without an issue.
There are many different types of web scraping proxy providers that you can trial. Just be sure to choose a reliable proxy provider such as Smartproxy. They’re also offering a discount on their residential proxy solutions with this discount code: UDRST15
2. Use Google Cloud Platform IPs
It can be beneficial to use Google Cloud Functions or AppEngine as the hosting platform for your web scrapers. This is because when combined with changing your user-agent to be GoogleBot, it can appear to website owners that you’re actually GoogleBot!
3. Set Additional Request Headers
Genuine web browsers will have lots of different headers set, any of these can be checked by websites to block your web scraper.
To make your web scraper appear more realistic, you can copy all of the headers from httpbin.org/anything. (These are the headers that your browser is currently utilising).
For example, by setting: “Upgrade-Insecure-Requests”, “Accept”, “Accept-Encoding” and “Accept-Language”, it will make it look like your requests are coming from a real web browser.
4. Set A Referrer
The referrer header is an http request header that informs the website where you are previously visiting from. By setting this as https://www.google.co.uk, it therefore looks like you’re arriving from the UK Google search engine.
You can also change this to be specific for different countries, i.e:
5. Learn To Web Scrape Slowly
When using web scraping services, it’s tempting to the scrape data as fast as possible. However, when a human stays on a website, their browsing speed is quite slow compared to crawlers.
Also, website owners can often detect your scrapers by analysing:
- How fast you scroll on pages.
- How often you click and navigate on the pages.
- If you’re interacting with the pages too fast, the site most likely is going to block you.
Add In Random Sleep Delays And Actions
It’s best practice to fine tune your website crawlers and to:
- Add in random sleep delays between your HTTPS requests.
6. Pursue Different Scraping Patterns
A slow pace isn’t the only feature of human browsing activity. Humans skim websites uniquely. You should also consider different view time, random clicks when users visit a site. However, the bots follow the same browsing pattern. Websites can easily identify scrapers when they find repetitive and similar browsing actions.
Therefore, you should apply various scraping patterns from time to time when extracting the data from the sites. Some sites may have improved anti-scraping mechanisms.
Consider combining several clicks, mouse movements or shuffle and combine random event activities to make your scraper look like a human.
Some example activities for a LinkedIn bot might include:
- Scrolling the news feed.
- Taking a break to ‘go to the toilet’.
- Commenting on someone’s post.
- Liking on someone’s post.
- Watching a video.
With the list above, you could create different combinations of activities such as:
- Scrolling Posts –> Break –> Liking Posts.
- Break –> Scrolling Posts –> Break.
To easily create the combinations, you can use a native package in Python. This ensures your web bots are less rule based and less deterministic.
from itertools import permutations # Get all of the permutations of [2, 4, 6] perm_ = permutations([2, 4, 6]) # Print all of the the permutations for i in list(perm_): print(i) # A Python program that prints all # combinations of given length from itertools import combinations # Get all combinations of [2, 4, 6] # with a length of length 2 comb_ = combinations([2, 4, 6] , 2) # Print all of the combinations for i in list(comb_): print(i)
7. Web Scrape At Different Day Times
As well as randomising your actions, logging into the same website at different day times can also reduce your footprint.
For example instead of logging in at 8.00am every day:
- Logging at unique time intervals: at 8:00, 8:05, 8:30.
- Login in the morning, afternoon and evening instead of just in the morning.
8. Avoid Honeypot Traps
While scraping, you must avoid falling for honeypot traps which are computer security mechanisms set up to identify web scrapers.
They are unidentifiable links to users that are still located within the HTML code.
Consequently, honey pot traps are only noticeable to web scrapers. When a web crawler accesses that link, the website will block all the requests made by that user. Therefore, it is crucial to check for hidden links on a website when designing a scraper.
Ensure that the crawler only tracks links that have proper visibility because some honeypot links are hidden using the background color on the text.
9. Use Real User Agents
A User-Agent request header includes a unique string that identifies the browser being used, its version, and the operating system. The web browser assigns the user-agent to the site every time a request is being made. Anti-scraping structures can detect bots if you make a substantial number of requests from one user agent. Ultimately, you will get blocked.
To prevent this situation, you should build a list of user-agents and change the user agent for each request because no site wants to block genuine users. Also, employing popular user agents like Googlebot can be helpful.
10. Use Headless Browsers
Some websites are more difficult to scrape. They are set to detect from browser extensions, web fonts to browser cookies to check whether the request is coming from a real user or not.
If you need to scrape such sites, you will need to use a headless browser. Tools such as Selenium and Puppeteer, they’re rich in features such as automatic screenshots or being able to click on interactive buttons/content elements.
11. Detect Website Changes
Websites often have their own unique layouts and theme, this can cause your scrapers to break when the website owner decides to redesign the layout.
You will need to detect these changes with your web scraper and create an on-going monitoring solution to ensure that your web crawler is still functional. One method is to count the number of successful requests per web crawl.
Alternatively you can create specific unit tests for different types of layouts:
If there is a reviews page or a product page then simply create a UnitTest for every type of page layout. Then you’ll only need to send several requests per day to see whether or not the layout has changed by whether all of your UnitTests were successful.
12. Use a CAPTCHA Solving Service
Some websites will put CAPTCHA tests in place in order to detect bot traffic that is scraping their data. By using a CAPTCHA service, you can significantly reduce the chance that a website thinks you’re a web bot.
Several CAPTCHA solving services include:
However, its worth remembering that these types of services can be expensive and can add additional request time to your web scraping.
Therefore, you will need to consider whether the data collected outweighs the cost of an extra time delay.
13. Scrape Data From The Google Cache
If all else has failed, it is possible to scrape the data directly from Google’s cache.
This is particularly a useful workaround for obtaining information from web pages that change infrequently.
In order to access the cache of any web page, simply add on to the front of the URL:
http://webcache.googleusercontent.com/search?q=cache: I.e. http://webcache.googleusercontent.com/search?q=cache:http://phoenixandpartners.co.uk/
However this method is not 100% accurate as large companies such as LinkedIn tell Google to not cache their content, making it inaccessible to scrapers.
Hopefully you’ve learned several new techniques on how to reduce the chance that your web scraping efforts will be blocked.
Generally rotating your IPs and adding real HTTP request headers is more than enough for most use-cases, however sometimes you will have to utilise headless browsers or scraping out of the Google cache to get the required data.