Web Scraping NinjaOutreach At Scale

James Phoenix

Disclaimer: Please note that all code and methodologies within this post are to be used at your own risk. I strongly recommend that you adhere to NinjaOutreach’s rate limits.

What’s NinjaOutreach?

NinjaOutreach is an Influencer and Blogger Marketing Outreach platform that provides:

A rich prospecting platform and database for finding relevant influencers.
In-built email software for contacting influencers and website owners.
An email finder.

Why Web Scrape NinjaOutreach?

What’s very interesting is how tools / platforms such as NinjaOutreach are offering lifetime deals on special occasions. These lifetime deals provide a perfect alternative to monthly pricing and offer a stable method for creating unique data acquisition pipelines 😍! #LoveThatData

After purchasing a lifetime discount via their Black Friday deal, I wondered if it would be possible to automate some of the more tedious tasks on their platform with Python.

Web Scraping Sucked But Their Network Requests Data Is Very Rich

Firstly I attempted to web scrape NinjaOutreach by logging in to the platform with a selenium web browser, however it was impossible to scroll through the prospected paginated results.

Therefore I turned to my next method; spoofing the HTTP network requests… and to my amusement, highly structured and clean JSON data was returned!

The next part of this blog post will show you how to easily access NinjaOutreach’s prospecting section with Python.

How To Web Scrape Results From NinjaOutreach

Acquiring Authentication

As many software as service (SaS) tools require a logged in session, the first thing we need to do is study the authentication techniques that the website uses in its HTTP requests headers.

The authentication for logged in sessions are generally found in:

Cookies.
HTTP headers.

Login to your NinjaOutreach then proceed with the next steps.

1. Refresh Prospect Results With Google Chrome Developer Tools

Load up the Google Chrome developer tools, go to the network tab, click preserve log.

Then visit the prospecting section and enter “Fashion bloggers” and click search.

Click XHR in Google Chrome Developer tools to easily see the relevant AJAX requests that we made to NinjaOutreach’s server.

2. Replicating the Authentication Requests to mirror a logged in session

After seeing the new GET request we can then take a direct copy of this authenticated request and do the following:

Right click on the request, click copy to cURL.

Visit this website with your cURL request ( this will convert your cURL command into a Python request).

Inspecting The Request

As we can see from inspecting the HTTP request, there are some authentication tokens inside of the cookies and headers of the GET request.

cookies = {
    'ASP.NET_SessionId': 'iihvtccjpcww4hsvr11p1pc0',
    '__RequestVerificationToken': 'gAjlSs8t61uT55pABYTv-vyXob7BpKhcI3lwH3wN7mOtqrDxgWkupJlWHxQBqFJTdT_PQdy5Lp-Ft68G07EJRNOYcVc1',
    '__cfduid': 'd5e2207aee0e800da26386bfa8494fe001582059533',
    'ajs_user_id': '%22jamesaphoenix%40googlemail.com%22',
    'ajs_group_id': 'null',
    'ajs_anonymous_id': '%221b0e0d03-2430-4e96-a3ca-4281a857abdd%64',
    '.AspNet.ApplicationCookie': 'g3WeZDFRjJd4TwTH8V89jxopHfgXujwByLCtBdGJRbYrI4EJqlUKOoSrS1NylcomAVWiM6-u5ZPLI4FEfSL_dtrXvBCAMyZc11fIB1bwOKB-NqW6kfckMWY76fKhLSoXV3N07nznkXHWCYqgphObViptnGv6FKU6b7mNi9Byi0foeVNl7FJyTG-VrsJnSFasMentXkXS08TSSSynTSzM07mockREpO8iHVlIJW8t-HCdvUxOE_cjZWOLU23rTMyPRvyzdqkTcByXG6O7eCXWcrf3Dmff-lmFamD8am8mEIUVz-RafpD01IZZGLmU5Fdt4aXmqgN_DB-mFFQN97z7UlHjvZxEeNikWrgR2Qx4ZndhrhfHaLry-5zOcoKEJLogrPEdYFoex4DPIvZzzIyhkX-HnmqbY9bdcGcnGV7RWwRbdiFQvx9YSu7vSyvIchfCJyHwzObWPB-8LWc22yRGubGgwFMQ_RaoWQ4y7A7X2cxbZoI9voSVzWFGUfHoJNVlUSm74g',
    '__ca__chat': 'S5HLLK7dBPaS',
}

3. Creating Additional Functionality With The Obtained Authentication

Now that we can pass in the relevant authentication to Ninjaoutreach, by closely investigating the parameters we can see that we can pass in a different page and a different last ID.

params = (
    ('keywords', 'Fashion Bloggers'), ('page', '1'), ('isAndSearch', 'true'), ('type', 'instagram', ('minFollowers', '0'), ('maxFollowers', ''), ('minEngagementRate', '0'), ('maxEngagementRate', '100'), ('country', ''), ('state', ''), ('city', ''), ('categoriesFilterMode', '1'), ('verified', 'All'),('businessAccount', 'All'), ('sort', '1'), ('descending', 'true'), ('isDirectSearch', 'true'), ('lastId', ''), ('lastRank', ''), ('_', '1582059572320'), )

The ‘page’ parameter will allow us to paginate through all of the pages with the previously authenticated cookies and headers.

4. Achieving Pagination

Below you’ll find a simple code example of how I can paginate through a series of pages for one keyword with a random time delay to kindly respect their API rate limits.

results = []
for i in range(0, 10):
    params = dict(params)
    params['page'] = i
    
    # Request the 50 results per page
    response = requests.get('https://app.ninjaoutreach.com/Prospecting/GetList', headers=headers, params=params, cookies=cookies)
    
    # Saving the results for processing 
    results.append(response)

By loading responses with json.loads() and decoding them with utf-8, we can finally get access to the structured data!

Bulk Cleaning And Removing Spam Emails

Now that we’ve gathered some email contact data we need to clean it because we want to:

Reduce our bounce rate.
Improve our email deliverability.

@contact, @info & @support email address’s are generally poor email addresses, therefore let’s create a simple python list to remove all of these from our pandas dataframe.

# Previously obtained email data
df = pd.read_csv('data.csv', delimiter='\t')

def clean_email(x):
    negative_email_address_list = ['team', 'contact', 'info', 'support', 'enquiries', 'enquiry', 'sales',                          'partner','editor','letschat','hello','quote','admin','training','support','jobs', 'letschat', 'solutions', 'sales', 'mail', 'marketing', 'careers', 'accounts','admin', 'team', 'feedback', 'recruitment', 'welcome']
    email = x.split('@')
    
    if email[0] in negative_email_address_list:
        return 'Bad Email'
    else:
        return x

We can now remove any poor performing email address with the following pandas apply method.

df['Email'] = df['Email'].apply(lambda x: clean_email(x))

# This allows us to only return the emails which aren't bad emails (i.e. they're not from the list above).
test = df[df['Email'] != 'Bad Email']

What are the current rates / limits for NinjaOutreach?

After testing, the current rates for NinjaOutreach are approximately 50 – 60 page requests an hour. That’s approximately 2500 contacts or websites per hour. Therefore as long as we can stay within the range of 20 – 30 pages per hour (0.3 – 0.5 results per minute), we’ll hopefully be able to run the scraper for an extended period of time.

What’s Next?

Integrate our Instagram URLs from influencers into an automated Instagram Outreach Tactic with Instapy.
Create an automated data pipeline that runs 8 – 10 hours a day to build up email lists via Apache Airflow and BigQuery.
Integrate the emails with a monthly list cleaning service to enhance deliverability.

Conclusion

As digital marketers, its important for us to push the boundaries of what data we can collect, how we can sort it and what we decide to do with it.

Thanks for reading and its time for you to search for that competitive edge and to get creative with your data acquisition pipelines!