Google Search Console is a free of charge tool that allows you to view the visibility of your websites within organic search.

For every website, you receive a range of metrics for either pages or queries including:

  • Clicks.
  • Impression.
  • Average Click Through Rate (%).
  • Average Position (what rank your web page is within organic search).


Optimising the click through rate (CTR %) for your website pages is an on-going challenge, and is by its very nature a machine learning problem.

If you ask any search engine marketer, they will tell you to optimise your <title> tags and meta-descriptions so that the page entices more users to click on your result instead of other results in the search engine results page (SERP).


The Current Process

Firstly let’s review the current, manual process for selecting pages to be optimised via Google Search Console data.

  1. Get Data: Download your Google Search Console data as a .CSV and sort it by impressions.
  2. Find Opportunities: Look at any pages that have a low CTR and decide that the <title> tag or meta-description is inadequate.
  3. Execute: Re-write either the <title> tag, meta-description or both.

So what’s the problem with this approach? It doesn’t standardise for the fact that CTR (%) changes when organic search position changes.


In other words, as the position of your web page gets higher on Google, its click through rate will increase, regardless of whether you had written a poor <title> tag or meta-description.


So filtering on only impression data and selecting pages with low CTR % is simply not a good way to select pages.


Optimise Your Title Tags & Meta Descriptions with ML

Identifying the type of machine learning problem

As we will be attempting to predict a continuous variable (click through rate – CTR %) for every page, its possible for us to reframe this as a regression problem. Let’s import all of the relevant python packages.

# Import needed libraries
import pandas as pd
import math, datetime
import numpy as np
from sklearn import preprocessing, svm
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from matplotlib import style
import pickle
import re
from sklearn.cluster import KMeans
style.use('ggplot')
%matplotlib inline

For this example we will take a 3 month sample of Google Search Console page URL data. Download it as a CSV and then read it with a pandas dataframe.

df = pd.read_csv('data.csv')
df

Cleaning The Data

The only part of cleaning that this data needs is to simply remove the % string and convert every value in CTR to a float integer.

# Cleaning The Data
df['CTR'] = df['CTR'].apply(lambda x: float(x.replace('%', '')))

Model inputs, assigning a target variable and standardizing the data

The inputs for our machine learning model will be the following: clicks, impressions and average organic search position.

Essentially what we’re saying here is “given clicks, impressions and position what is my predicted CTR % for this given specific observation?”

  • X is defined as the predictor matrix (i.e. model inputs).
  • Y is defined as the target variable (click through rate – CTR %).
X = df[['Clicks', 'Impressions', 'Position']]
y = df.pop('CTR')

Now let’s create a train-test-split (this basically allows us to hold back a portion of the data and also perform cross validation).

We will also standardize the predictor matrix, this means that every value in every feature column (clicks, impressions, position) will have a mean of zero and a standard deviation of 1. This is important because it allows us to remove the unit aspect from all of our model inputs.

# Train Test Split 
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42)

# Standardise The Data 
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler = scaler.fit(X_train)
X_train = pd.DataFrame(scaler.transform(X_train), columns = X_train.columns, index = X_train.index)
X_test = pd.DataFrame(scaler.transform(X_test), columns = X_test.columns, index = X_test.index)
The standardization formula for machine learning
The standardization fomulua – Z Scores

Model 1 – Linear Regression

# Import different Algorithms to see differences between their predictions
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

Linear_regression = LinearRegression()
Linear_regression.fit(X_train, y_train)
Linear_regression.score(X_train, y_train)
Linear_regression.score(X_test, y_test)

Model 2 – Random Forest

RandomForestRegressorModel = RandomForestRegressor(n_estimators=50)
RandomForestRegressorModel.fit(X_train, y_train)
prediction_score = RandomForestRegressorModel.score(X_train, y_train)
test_score = RandomForestRegressorModel.score(X_test, y_test)

Let’s Combine The Predictions For X_train + X_test

X_list = [X_train, X_test]
y_list = y_train, y_test

RandomForestRegressorModel = RandomForestRegressor(n_estimators=50)

data_dict = {
    'Indexes': [],
    'Predictions': []    
}

for x, y in zip(X_list, y_list):
    
    indexes = list(x.index)
    RandomForestRegressorModel.fit(X_train, y_train)
    predictions = RandomForestRegressorModel.predict(x)
    
    data_dict['Indexes'].extend(indexes)
    data_dict['Predictions'].extend(predictions)

Merging X_train + X_test with Y_train + Y_test

merged_X = pd.concat([X_train, X_test])
merged_Y = pd.concat([y_train, y_test])
merged_df = pd.concat([merged_X, merged_Y], axis = 1)
merged_df.head(12)

Predicting Click Through Ratio For All Pages

As the random forest model performed significantly better than a generic linear regression model I used this model to make predictions CTR % for every URL:


Merging The Original Dataframe to the predicted CTR (%)

Now we have the following:

  • An original Google Search Console dataset.
  • CTR predictions for every URL (web page).

Its possible to match these dataframes by their original index. #yay


Creating a Calculated Metric To Rank Pages By CTR % Opportunity

Okay so far so good, we’ve got CTR predictions on our original dataset. But how can we rank these pages in terms of opportunity now that position is standardised?

final_df['CTR_Difference'] = final_df['Predicted_CTR_%'] - final_df['CTR']
### Now let's subset on the data and remove any CTR's that were 0.00
final_df = final_df[final_df['CTR'] != 0.00]

final_df.sort_values(by='CTR_Difference', ascending=False)

Let’s do a final merge by index again:

final_df = final_df[['CTR', 'Predicted_CTR_%', 'CTR_Difference']]
original_data = df[['Page', 'Clicks', 'Impressions', 'Position']]
results = pd.merge(original_data, final_df, left_index=True, right_index=True)
results

How To Interpret CTR_Difference

Basically you can view it like this:

  • If the CTR_Difference is positive: This means that the predicted CTR was higher than the actual CTR. Which in turn means that the model predicts the page should have a higher CTR given the number of clicks, it’s positions and number of impressions it received.
  • If the CTR_Difference is negative: This means that the predicted CTR was lower than the actual CTR. Which means that the model predicts that the page should have a lower CTR (the page is actually over-performing given the other features).

How To Use This:

  • Rank all of your URL’s by CTR_Difference. Optimise the URL’s with the highest CTR_Difference.
  • Leave the pages with the most negative CTR_Difference alone as these pages are already optimised.

Limitations Of This Approach / Next Steps

  • The machine learning model treats every query as if it was the same.
  • The random forest model can only explain 73% of the variance within the test set, so there will be poor predictions on at least a few of the URLs.

Next on the list of activities to do would be to:

  • Add on-page HTML features for every URL.
  • Add SERP features for specific queries.
  • Perform grid_search (hyper-parameter tuning).
  • Test more models.
What's your reaction?