Keyword research is a fundamental process that helps search engine marketers to understand where the market opportunity is and what searchers care about.

After using tools such as Ahrefs/SEOMoz or SEMrush, you can obtain either a list of top pages or keywords which highlight all of the potential market opportunity with metrics including monthly search volume, traffic value ($) and top keyword.


Often marketers will use simple math formulas for prioritising the massive amount of data produced by these tools, then analyse the data into a recommended list of landing pages/blog posts or resources for content creation.


For example Siegemedia recommends the following KOB score:

From looking at this formula, we would assume that if there was a difficulty level of 75 it would be approximately 3 times larger than a difficulty level of 25.

However let’s check how Ahrefs calculates keyword difficulty. Because yes, we love you Ahrefs.


Investigating How Ahrefs Calculates Keyword Difficulty

Ahref’s keyword difficulty metric ranges from 0 – 100 and thankfully they’ve provided us with some numerical data showing how the metric is calculated.


As rightly stated by Ahrefs, this data is not linear. This means that assuming 25 is three times smaller than 75 is likely to be a flawed assumption.

However we can’t quite use Ahref’s metric yet because the data is spread across a range (0 – 10 – 20).

So let’s see if we can calculate the values in-between these numbers to rescale the KOB traffic formula.


TLDR:


import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy as sp
import numpy as np
import selenium
import pickle
%matplotlib inline
df = pd.read_csv('ahrefs_keyword_difficulty.csv')
df.rename(columns={'Keyword Difficulty - Y':'Keyword_Difficulty', 'Referring Domains - X':'Referring_Domains'}, inplace=True) 
df
sns.scatterplot(data = df, x = 'Keyword_Difficulty', y = 'Referring_Domains') plt.show()

This is the original relationship of Keyword Difficulty vs Referring Domains. When both scales are in a linear form, the relationship between X to Y appears to be logarithmic.

Now let’s apply a log+1 transformation to the Y-axis in an attempt to make the relationship between X and Y more linear.

df['Log_Referring_Domains'] = df['Referring_Domains'].apply(lambda x: np.log1p(x))
sns.scatterplot(data = df, x = 'Keyword_Difficulty', y = 'Log_Referring_Domains') plt.show()

Okay great! We’ve got approximately a straight line that we can model our data against. It has some wiggles, but it’ll do.

Let’s apply a simple linear regression to the data using the least squares method to minimise the error.

# Regression Function
def regress(x, y):
    """Return a tuple of predicted y values and parameters for linear regression."""
    p = sp.stats.linregress(x, y)
    b1, b0, r, p_val, stderr = p
    y_pred = sp.polyval([b1, b0], x)
    return y_pred, p

# Plotting
x, y = df['Keyword_Difficulty'], df['Log_Referring_Domains']                      # transformed data
y_pred, _ = regress(x, y)

plt.plot(x, y, "mo", label="Data")
plt.plot(x, y_pred, "k--", label="Pred.")
plt.xlabel("Keyword Difficulty")
plt.ylabel("Log Referring Domains")                            # label axis
plt.legend()
plt.show()

Results

Ε· = b0 + b1x, where:

  • b0 is a constant (the intercept).
  • b1 is the slope (regression coefficient).
  • x is the value of the independent input variable.
  • Ε· is the predicted value of the dependent output variable.

Insight

This means that for every 1 Keyword Difficulty added, there will be a ~0.06 increase of Log(1 + Referring Domains).

Now that we can approximate the relationship between X and y, we can now fill in the gaps for every keyword difficulty value.

np.expm1(Slope + Intercept)
Slope = 0.06001338738238661 
Intercept = 1.330338619969123

y_predictions = []
for i in range(2, 101): 
    print(np.expm1(Intercept + (Slope * i)))

Conclusion

So in this guide:


Enjoy the rest of your week and thanks for reading!

What's your reaction?