Disclaimer: Please note that all code and methodologies within this post are to be used at your own risk. I strongly recommend that you adhere to NinjaOutreachâs rate limits.
Whatâs NinjaOutreach?
NinjaOutreach is an Influencer and Blogger Marketing Outreach platform that provides:
- A rich prospecting platform and database for finding relevant influencers.
- In-built email software for contacting influencers and website owners.
- An email finder.
Why Web Scrape NinjaOutreach?
Whatâs very interesting is how tools / platforms such as NinjaOutreach are offering lifetime deals on special occasions. These lifetime deals provide a perfect alternative to monthly pricing and offer a stable method for creating unique data acquisition pipelines đ! #LoveThatData
After purchasing a lifetime discount via their Black Friday deal, I wondered if it would be possible to automate some of the more tedious tasks on their platform with Python.
Web Scraping Sucked But Their Network Requests Data Is Very Rich
Firstly I attempted to web scrape NinjaOutreach by logging in to the platform with a selenium web browser, however it was impossible to scroll through the prospected paginated results.
Therefore I turned to my next method; spoofing the HTTP network requests⊠and to my amusement, highly structured and clean JSON data was returned!
The next part of this blog post will show you how to easily access NinjaOutreachâs prospecting section with Python.
How To Web Scrape Results From NinjaOutreach
Acquiring Authentication
As many software as service (SaS) tools require a logged in session, the first thing we need to do is study the authentication techniques that the website uses in its HTTP requests headers.
The authentication for logged in sessions are generally found in:
- Cookies.
- HTTP headers.
Login to your NinjaOutreach then proceed with the next steps.
1. Refresh Prospect Results With Google Chrome Developer Tools
- Load up the Google Chrome developer tools, go to the network tab, click preserve log.
- Then visit the prospecting section and enter âFashion bloggersâ and click search.
- Click XHR in Google Chrome Developer tools to easily see the relevant AJAX requests that we made to NinjaOutreachâs server.
2. Replicating the Authentication Requests to mirror a logged in session
After seeing the new GET request we can then take a direct copy of this authenticated request and do the following:
- Right click on the request, click copy to cURL.
- Visit this website with your cURL request ( this will convert your cURL command into a Python request).
Inspecting The Request
As we can see from inspecting the HTTP request, there are some authentication tokens inside of the cookies and headers of the GET request.
cookies = {
'ASP.NET_SessionId': 'iihvtccjpcww4hsvr11p1pc0',
'__RequestVerificationToken': 'gAjlSs8t61uT55pABYTv-vyXob7BpKhcI3lwH3wN7mOtqrDxgWkupJlWHxQBqFJTdT_PQdy5Lp-Ft68G07EJRNOYcVc1',
'__cfduid': 'd5e2207aee0e800da26386bfa8494fe001582059533',
'ajs_user_id': '%22jamesaphoenix%40googlemail.com%22',
'ajs_group_id': 'null',
'ajs_anonymous_id': '%221b0e0d03-2430-4e96-a3ca-4281a857abdd%64',
'.AspNet.ApplicationCookie': 'g3WeZDFRjJd4TwTH8V89jxopHfgXujwByLCtBdGJRbYrI4EJqlUKOoSrS1NylcomAVWiM6-u5ZPLI4FEfSL_dtrXvBCAMyZc11fIB1bwOKB-NqW6kfckMWY76fKhLSoXV3N07nznkXHWCYqgphObViptnGv6FKU6b7mNi9Byi0foeVNl7FJyTG-VrsJnSFasMentXkXS08TSSSynTSzM07mockREpO8iHVlIJW8t-HCdvUxOE_cjZWOLU23rTMyPRvyzdqkTcByXG6O7eCXWcrf3Dmff-lmFamD8am8mEIUVz-RafpD01IZZGLmU5Fdt4aXmqgN_DB-mFFQN97z7UlHjvZxEeNikWrgR2Qx4ZndhrhfHaLry-5zOcoKEJLogrPEdYFoex4DPIvZzzIyhkX-HnmqbY9bdcGcnGV7RWwRbdiFQvx9YSu7vSyvIchfCJyHwzObWPB-8LWc22yRGubGgwFMQ_RaoWQ4y7A7X2cxbZoI9voSVzWFGUfHoJNVlUSm74g',
'__ca__chat': 'S5HLLK7dBPaS',
}
3. Creating Additional Functionality With The Obtained Authentication
Now that we can pass in the relevant authentication to Ninjaoutreach, by closely investigating the parameters we can see that we can pass in a different page and a different last ID.
params = (
('keywords', 'Fashion Bloggers'), ('page', '1'), ('isAndSearch', 'true'), ('type', 'instagram', ('minFollowers', '0'), ('maxFollowers', ''), ('minEngagementRate', '0'), ('maxEngagementRate', '100'), ('country', ''), ('state', ''), ('city', ''), ('categoriesFilterMode', '1'), ('verified', 'All'),('businessAccount', 'All'), ('sort', '1'), ('descending', 'true'), ('isDirectSearch', 'true'), ('lastId', ''), ('lastRank', ''), ('_', '1582059572320'), )
The âpageâ parameter will allow us to paginate through all of the pages with the previously authenticated cookies and headers.
4. Achieving Pagination
Below youâll find a simple code example of how I can paginate through a series of pages for one keyword with a random time delay to kindly respect their API rate limits.
results = []
for i in range(0, 10):
params = dict(params)
params['page'] = i
# Request the 50 results per page
response = requests.get('https://app.ninjaoutreach.com/Prospecting/GetList', headers=headers, params=params, cookies=cookies)
# Saving the results for processing
results.append(response)
By loading responses with json.loads() and decoding them with utf-8, we can finally get access to the structured data!
Bulk Cleaning And Removing Spam Emails
Now that weâve gathered some email contact data we need to clean it because we want to:
- Reduce our bounce rate.
- Improve our email deliverability.
@contact, @info & @support email addressâs are generally poor email addresses, therefore letâs create a simple python list to remove all of these from our pandas dataframe.
# Previously obtained email data
df = pd.read_csv('data.csv', delimiter='\t')
def clean_email(x):
negative_email_address_list = ['team', 'contact', 'info', 'support', 'enquiries', 'enquiry', 'sales', 'partner','editor','letschat','hello','quote','admin','training','support','jobs', 'letschat', 'solutions', 'sales', 'mail', 'marketing', 'careers', 'accounts','admin', 'team', 'feedback', 'recruitment', 'welcome']
email = x.split('@')
if email[0] in negative_email_address_list:
return 'Bad Email'
else:
return x
We can now remove any poor performing email address with the following pandas apply method.
df['Email'] = df['Email'].apply(lambda x: clean_email(x))
# This allows us to only return the emails which aren't bad emails (i.e. they're not from the list above).
test = df[df['Email'] != 'Bad Email']
What are the current rates / limits for NinjaOutreach?
After testing, the current rates for NinjaOutreach are approximately 50 â 60 page requests an hour. Thatâs approximately 2500 contacts or websites per hour. Therefore as long as we can stay within the range of 20 â 30 pages per hour (0.3 â 0.5 results per minute), weâll hopefully be able to run the scraper for an extended period of time.
Whatâs Next?
- Integrate our Instagram URLs from influencers into an automated Instagram Outreach Tactic with Instapy.
- Create an automated data pipeline that runs 8 â 10 hours a day to build up email lists via Apache Airflow and BigQuery.
- Integrate the emails with a monthly list cleaning service to enhance deliverability.
Conclusion
As digital marketers, its important for us to push the boundaries of what data we can collect, how we can sort it and what we decide to do with it.
Thanks for reading and its time for you to search for that competitive edge and to get creative with your data acquisition pipelines!