Google Search Console is a free of charge tool that allows you to view the visibility of your websites within organic search.

For every website, you receive a range of metrics for either pages or queries including:

  • Clicks.
  • Impression.
  • Average Click Through Rate (%).
  • Average Position (what rank your web page is within organic search).

Optimising the click through rate (CTR %) for your website pages is an on-going challenge, and is by its very nature a machine learning problem.

If you ask any search engine marketer, they will tell you to optimise your <title> tags and meta-descriptions so that the page entices more users to click on your result instead of other results in the search engine results page (SERP).

The Current Process

Firstly let’s review the current, manual process for selecting pages to be optimised via Google Search Console data.

  1. Get Data: Download your Google Search Console data as a .CSV and sort it by impressions.
  2. Find Opportunities: Look at any pages that have a low CTR and decide that the <title> tag or meta-description is inadequate.
  3. Execute: Re-write either the <title> tag, meta-description or both.

So what’s the problem with this approach? It doesn’t standardise for the fact that CTR (%) changes when organic search position changes.

In other words, as the position of your web page gets higher on Google, its click through rate will increase, regardless of whether you had written a poor <title> tag or meta-description.

So filtering on only impression data and selecting pages with low CTR % is simply not a good way to select pages.

Optimise Your Title Tags & Meta Descriptions with ML

Identifying the type of machine learning problem

As we will be attempting to predict a continuous variable (click through rate – CTR %) for every page, its possible for us to reframe this as a regression problem. Let’s import all of the relevant python packages.

For this example we will take a 3 month sample of Google Search Console page URL data. Download it as a CSV and then read it with a pandas dataframe.

Cleaning The Data

The only part of cleaning that this data needs is to simply remove the % string and convert every value in CTR to a float integer.

Model inputs, assigning a target variable and standardizing the data

The inputs for our machine learning model will be the following: clicks, impressions and average organic search position.

Essentially what we’re saying here is “given clicks, impressions and position what is my predicted CTR % for this given specific observation?”

  • X is defined as the predictor matrix (i.e. model inputs).
  • Y is defined as the target variable (click through rate – CTR %).

Now let’s create a train-test-split (this basically allows us to hold back a portion of the data and also perform cross validation).

We will also standardize the predictor matrix, this means that every value in every feature column (clicks, impressions, position) will have a mean of zero and a standard deviation of 1. This is important because it allows us to remove the unit aspect from all of our model inputs.

The standardization formula for machine learning
The standardization fomulua – Z Scores

Model 1 – Linear Regression

Model 2 – Random Forest

Let’s Combine The Predictions For X_train + X_test

Merging X_train + X_test with Y_train + Y_test

Predicting Click Through Ratio For All Pages

As the random forest model performed significantly better than a generic linear regression model I used this model to make predictions CTR % for every URL:

Merging The Original Dataframe to the predicted CTR (%)

Now we have the following:

  • An original Google Search Console dataset.
  • CTR predictions for every URL (web page).

Its possible to match these dataframes by their original index. #yay

Creating a Calculated Metric To Rank Pages By CTR % Opportunity

Okay so far so good, we’ve got CTR predictions on our original dataset. But how can we rank these pages in terms of opportunity now that position is standardised?

Let’s do a final merge by index again:

How To Interpret CTR_Difference

Basically you can view it like this:

  • If the CTR_Difference is positive: This means that the predicted CTR was higher than the actual CTR. Which in turn means that the model predicts the page should have a higher CTR given the number of clicks, it’s positions and number of impressions it received.
  • If the CTR_Difference is negative: This means that the predicted CTR was lower than the actual CTR. Which means that the model predicts that the page should have a lower CTR (the page is actually over-performing given the other features).

How To Use This:

  • Rank all of your URL’s by CTR_Difference. Optimise the URL’s with the highest CTR_Difference.
  • Leave the pages with the most negative CTR_Difference alone as these pages are already optimised.

Limitations Of This Approach / Next Steps

  • The machine learning model treats every query as if it was the same.
  • The random forest model can only explain 73% of the variance within the test set, so there will be poor predictions on at least a few of the URLs.

Next on the list of activities to do would be to:

  • Add on-page HTML features for every URL.
  • Add SERP features for specific queries.
  • Perform grid_search (hyper-parameter tuning).
  • Test more models.
What's your reaction?