This guide aims to provide you with a detailed explanation on how to find good local SEO clients using data, python, API’s and automation.
So the first stage of our scraping operation involves collecting ~2180 Google My Business categories and transforming them into a list. Thankfully, we can get a pre-made list of the GMB categories from this article.
We will use the Google My Business categories as keywords, then take the UK national search volume for these keywords as a proxy for demand on a local level.
Despite this not being completely accurate it allows us to focus our search on profitable niches, here’s the process:
- Extract all of the GMB categories and enter the keywords into an Ahrefs keyword search.
- Sort the list by monthly search volume and select the top 300 keywords for closer inspection – here’s the UK data.
Data Cleaning For Profitable Niche + Location Keywords
After extracting the lists, then I manually reviewed every keyword and removed any non-profitable GMB categories or companies that would not take kindly to a pitch. These included GMB’s such as park’s, churches or local monuments.
- Businesses with a low relevance or low CPC were removed.
- Download the keyword data from Ahrefs/SEMrush etc.
Now we’re simply left with a list of niche keywords that we can use to find prospective local SEO clients with Python.
df = pd.read_csv('keywords.csv')
df.head(15)
Let’s save this list to be used later: when we will extract the SERP pages from geo-specific searches for every keyword via a cheap SERP API.
query_list = list(df['Keyword'])
Choosing A SERP API for Our Keywords
Now we will need to query the SERP API with every niche keyword, luckily I will be utilising DataForSEO.com where the cost for querying 1000 x SERP pages is approximately $0.75 – $1.00.
Creating A REST Client For DataForSEO.com
The code below simply allows us to create a RestClient for accessing dataforseo’s API. Its actually a very simple process and they provide extensive documentation on their website.
class RestClient:
domain = "api.dataforseo.com"
def __init__(self, username, password):
self.username = username
self.password = password
def request(self, path, method, data=None):
connection = HTTPSConnection(self.domain)
try:
base64_bytes = b64encode(
("%s:%s" % (self.username, self.password)).encode("ascii")
).decode("ascii")
headers = {'Authorization' : 'Basic %s' % base64_bytes}
connection.request(method, path, headers=headers, body=data)
response = connection.getresponse()
return loads(response.read().decode())
finally:
connection.close()
def get(self, path):
return self.request(path, 'GET')
def post(self, path, data):
if isinstance(data, str):
data_str = data
else:
data_str = dumps(data)
return self.request(path, 'POST', data_str)
username = input()
password = input()
client = RestClient(username, password)
Finding The Right Geo-Code For Our Local Business Prospecting
As we’ll be prospecting for local SEO businesses we will need to apply a geo-specific Google API search.
DataforSEO has native support for this and will allow us to pass the right parameters to make our Google search specific to a location i.e. Reading.
response = client.get("/v2/cmn_locations")
if response["status"] == "error":
print("error. Code: %d Message: %s" % (response["error"]["code"], response["error"]["message"]))
else:
locations = response["results"]
pd.DataFrame(locations).to_csv('locations.csv')
location_data = pd.read_csv("locations.csv")
location_data.head(5)
reading_location = location_data[(location_data['loc_name_canonical'].str.contains("United Kingdom"))
& location_data['loc_name'].str.contains("RG4") ]
reading_location
1 x Query To Test The API
Before making a large number of requests and spending lots of credits, I decided to test the API with just one keyword “party wigs”.
post_data = dict()
post_data[30000000] = dict(
priority=1,
se_name="google.co.uk",
se_language="English",
loc_name_canonical="London,England,United Kingdom",
key="party wigs")
response = client.post("/v2/srp_tasks_post", dict(data=post_data))
if response["status"] == "error":
print("error. Code: %d Message: %s" % (response["error"]["code"], response["error"]["message"]))
else:
print(response["results"])
response = client.get("/v2/srp_tasks_get")
if response["status"] == "error":
print("error. Code: %d Message: %s" % (response["error"]["code"], response["error"]["message"]))
else:
print(response["results"])
Querying The SERP API For 201 Keywords
Firstly we will create 201 jobs for DataForSEO to perform, then we will query these completed jobs which will return 201 SERP results with the top 100 domains for every keyword.
data = defaultdict(dict)
priority = 1
se_name = "google.co.uk"
se_language = "English"
loc_name_canonical = "RG4,England,United Kingdom"
for i, keys in zip(range(1,len(query_list)), query_list):
# Creating the nested dictionary
data[i] = {}
data[i]['priority'] = priority
data[i]['se_name'] = se_name
data[i]['se_language'] = se_language
data[i]['loc_name_canonical'] = loc_name_canonical
# Dynamically Adding The Queries
data[i]['key'] = keys
Submitting The Jobs To DataForSEO
response = client.post("/v2/srp_tasks_post", dict(data=data))
if response["status"] == "error":
print("error. Code: %d Message: %s" % (response["error"]["code"], response["error"]["message"]))
else:
print(response["results"])
Obtaining Our SERP Data
Now that all of our tasks have been sent to the DataForSEO’s API servers we can simply call back all of the tasks and extract the results.
final_results = []
completed_tasks_response = client.get("/v2/srp_tasks_get")
if completed_tasks_response["status"] == "error":
print("error. Code: %d Message: %s" % (completed_tasks_response["error"]["code"], completed_tasks_response["error"]["message"]))
else:
results = completed_tasks_response["results"]
print(results)
for result in results:
srp_response = client.get("/v2/srp_tasks_get/%d" % (result["task_id"]))
if srp_response["status"] == "error":
print("error. Code: %d Message: %s" % (srp_response["error"]["code"], srp_response["error"]["message"]))
else:
final_results.append(srp_response["results"])
Creating A New Data Structure + Data Munging The Arrays
This code just tidies up the underlying data structure so that it can be put into a pandas dataframe.
master_dict = {
'post_key': [],
'result_position': [],
'result_url': [],
'result_title': []
}
key_list = ['post_key', 'result_position', 'result_url', 'result_title']
# The Order Is Returned In The Following: Post_Key, Position, URL, Title which matches the order of our master_dict.
i = 0
while i < len(raw_data):
# Query Every Keyword SERP Page.
for data in raw_data[i]['organic']:
for keys, values in data.items():
if keys in key_list:
master_dict[str(keys)].append(values)
# Iterate I
i += 1
print("{} x 100 Result SERP Completed".format(i / len(raw_data)))
The SERP Data After Data Extraction
After extracting all 201 keyword SERPS within the location of Reading we can inspect the data.
Also we will extract the root domain element for every URL. We will remove the protocol and any extensions such as .com or .co.uk to create brand-able queries for every root domain. I.e. https://jamesphoenix.com becomes jamesphoenix.
Then we will use these brand-able queries to search for knowledge panels via a different python script.
df = pd.DataFrame(master_dict)
df.head(15)
Data Cleaning
Now that we’ve extracted our 200 geo-serp queries into a dataframe, we need to:
- Clean the data.
- Obtain the root domains.
So let’s get to it!
def get_base_domain(x):
parsed = urlparse(x)
domain = parsed.netloc.split(".")[1:]
host = ".".join(domain)
return host
1. Extract The Root Domain
df['Root_Domain'] = df['result_url'].apply(lambda x: get_base_domain(x))
2. De-duplicate against Root Domain
df.drop_duplicates(subset='Root_Domain', inplace = True)
3. Removing results > position 36+, simply to just reduce the size of our dataset.
df = df[df['result_position'] < 36]
4. Removing Any Large Directory Websites
At this point we need to collect more competitor metrics.
These metrics will solely serve to act as a ‘proxy/approximation’ to the website’s authority, allowing us to easily remove bigger directory websites such as yell.com and checkatrade.com. Because let’s face it, they probably don’t need GMB optimisation and likely already have multiple GMB listings.
- I decided to use 16 x batch analysis tasks within Ahrefs and combined all of the CSV’s using Python.
Combining 16 CSV Files Together With Python
The code below will basically create a new dataframe from every CSV and then concatenate it as it grows, therefore combining all of the 16x separate Ahrefs batch analysis files.
import os
import glob
os.chdir('batch_analysis_csvs')
csv_list = glob.glob('*.csv')
all_batch_analysis_csvs = pd.concat([pd.read_csv(x, encoding='utf-16', delimiter='\t' ) for x in csv_list])
df['domain_list.csv'].to_csv('domain_list.csv')
Merging The Ahrefs Batch Analysis + Google API SERP Data
Then we can merge the dataframes using the root domain as an identifying key. #Yay!
all_batch_analysis_csvs.rename(columns = {'Target': 'Root_Domain'}, inplace = True)
merged_df = pd.merge(df, all_batch_analysis_csvs, on='Root_Domain')
merged_df.head(6)
Let’s change all of the domain rating nans (not a number to 0)
merged_df['Domain Rating'].fillna(value = 0, inplace = True)
EDA (Exploratory Data Analysis)
1. Domain Rating (DR) Distribution
The domain rating distribution looks similar to that of a normal distribution.
2. Total Backlinks Distribution
- As expected the total backlinks distribution is highly positively skewed and looks very similar to an exponential distribution.
- From this we can see that a large amount of websites are accumulating only a relatively few number of backlinks.
3. Dofollow Referring Domains Distribution
The referring domains follows a similar pattern to the backlinks distribution graph.
merged_df.describe()
From investigating the .describe() method, we can see that the lowest 25% of our dataset has a DR rating of 27 or lower.
As I will be doing a brand query Google search with a Python selenium browser I’m happy to start with ~750 root domains for prospecting. Also we will remove any .gov extensions.
business_data = merged_df[merged_df['Domain Rating'] < 27]
business_data.drop_duplicates(subset='Root_Domain', inplace = True)
business_data['Root_Domain'] = business_data['Root_Domain'].apply(lambda x: np.nan if ".gov." in x else x)
Now let’s extract the brand name from the Root Domain to obtain a brand-able query.
query_list = list(business_data['Root_Domain'].apply(lambda x: x.split('.')[0]))
query_and_root_domain_data = list(zip(list(business_data['Root_Domain']),query_list ))
pickle.dump(query_and_root_domain_data, open('query_data_for_prospector.pkl', 'wb'))
We have finally reached the query level, now we can simply place every query into the GMB prospector that I made previously, you can find it here:
The End Result After Running A GMB Prospector On 715 Brand Queries
- I also decided to modify the existing script and added GMB review score, review_count and phone number to help with the prospecting process.
- The keyword in the dataframe below is the query, so technically we will perform 715 unique google searches via the GMB python prospector script.
Each time the script runs it will search for a knowledge panel on the right side of the HTML page.
Here are the results! Happy Prospecting 🙂