De-duplicating Keywords with Python, Pandas And Fuzzy Wuzzy

James Phoenix

Hey everyone, welcome to the first episode, covering a python for SEO course. We’ll be mainly working up from all of the easy content to the more difficult exercises. The course content will be progressively rolled out on GitHub, and YouTube 😃

Prerequisites:

Python variables.
Python dictionaries + lists.
Knowing how to use for loops and list comprehensions.
Knowing how to use if and else python statements.
A working version of Anaconda (if you haven’t already installed python please follow my guide here).
git must be installed on your local computer.

How To Download The Course Material:

You will need to go to your command line, then navigate to a folder that you would like to store all of the course contents in.

Then simply type this command:

git clone https://github.com/jamesaphoenix/Python_For_SEO/

You’ll get access to all of the jupyter notebooks with the solution code in one of them. The other notebook contains the starter code, we’ll be using this notebook to practice typing out the commands, going through the code and explaining it line by line.

How To Access A Specific Jupyter Notebook

Once you’ve downloaded all of the course content by git cloning the repository, you can then access the jupyter notebook files by creating a local jupyter webserver from your terminal/command line:

jupyter notebook

Learning Outcomes

Learn how to de-duplicate keyword lists using set operations.
Learn how to compare two lists of keywords using set operations.
Learn how to de-duplicate keyword lists using FuzzyWuzzy.
Learn how to open an Ahrefs keyword report .CSV file with Pandas.
Performing standard data analysis operations within Pandas on a keyword report from Ahrefs (including GroupBy Objects, DataFrame Subsetting, the .drop_duplicates() method, the .apply() method and how to save your new dataframe to a .CSV file).
Learn how to create groups of keywords using FuzzyWuzzy + a custom keyword grouping function.
Learn how to use .lower() to improve deduplication.
Learn how to use NLP lemmmatization to improve deduplication.

Keyword De-duplication + 2. Comparing Keyword Lists

import pandas as pd

De-duplicating Keywords With Set Operations

Now let’s investigate how we can use python lists and set operations to remove duplicates across both single and multiple python lists.

keyword_list_example = ['digital marketing', 'digital marketing', 'digital marketing services', 
                       'copywriting', 'seo copywriting', 'social media marketing', 'social media',
                       'digital marketing services']

Set operations allow us to easily de-duplicate a Python list which contains exact duplicates like so:

de_duplicated_set = set(keyword_list_example)

We can also transform the set back into a Python list and re-assign it back to the original variable keyword_list_example.

keyword_list_example = list(de_duplicated_set)

print(keyword_list_example)

['seo copywriting', 'digital marketing services', 'copywriting', 'social media', 'social media marketing', 'digital marketing']

The benefits of using this type of de-duplication is that it is natively supported within Python, however it doesn’t allow us to capture partial matches. That’s where FuzzyWuzzy comes in!

How to Compare Two Keyword Lists With Set Operations

We might also want to compare two keyword lists to find where there are matches between them. For example if we were to take google search console data and paid search data we would like to find the following:

Items that occur within both lists
Items that occur in list_a but not in list_b
Items that occur in list_b but not in list_a
All of the items
Are all of the items in list_a in list_b (boolean – True / False)?
Are all of the items in list_b in list_a (boolean – True / False)?

Using the power of python sets we can easily find exact matches between two python lists:

google_search_console_keywords = ['data science, machine learning', 'ml consulting', 'digital marketing', 'seo', 'social media', 'data science consultants', 'search engine optimisation']
paid_search_keywords = ['data science consultants', 'digital marketing', 'seo','search engine optimisation']

The .intersection() functionality allows us to find keywords that appear in both lists:

set(google_search_console_keywords).intersection(set(paid_search_keywords))

{'data science consultants',
 'digital marketing',
 'search engine optimisation',
 'seo'}

The example below shows all of the keywords within google_search_console_keywords that are not within the paid search keywords list:

set(google_search_console_keywords) - set(paid_search_keywords)
{'data science, machine learning', 'ml consulting', 'social media'}

The example below shows the exact opposite, all of the keywords that appear within the paid search keywords list that aren’t in the google search console keywords list (which turns out to be none):

set(paid_search_keywords) - set(google_search_console_keywords)
set() # Notice how this result means that we have all of the paid search keywords within our GSC keyword list.

We can also easily extract all of the total exact match keywords from both lists with the following set operation:

set(google_search_console_keywords).union(set(paid_search_keywords))

{'data science consultants',
 'data science, machine learning',
 'digital marketing',
 'ml consulting',
 'search engine optimisation',
 'seo',
 'social media'}

So now we’re going to look at some Boolean logic using the following set method .issubset() , the usefulness for this would be to see if all of Google search console keywords are with the paid search keywords list and vice versa.

all_gsc_keywords_in_paid_search = set(google_search_console_keywords).issubset(set(paid_search_keywords))
all_paid_search_keywords_in_gsc_list = set(paid_search_keywords).issubset(set(google_search_console_keywords))

print(all_gsc_keywords_in_paid_search, all_paid_search_keywords_in_gsc_list)
False True

De-duplicating Keywords With FuzzyWuzzy

So we’ll now be moving on to a more advanced method of keyword de-duplication with a python package called Fuzzywuzzy which is absolutely great for de-duplicating keywords that are near duplicates but are not exact duplicates.

But first let’s see an example of where our previous set() operations wouldn’t be able to perform effective de-duplication:


partial_duplicates = ['digital marketing services', 'digital marketing service', 'digital marketing company']

Notice how the python list above has strings which are indeed unique but likely mean the exact same thing! This is where partial string de-duplication comes in!

So what’s fuzzywuzzy? Its a string matching python package that allows us to easily calculate the difference between several strings via Levenshtein Distance.

Firstly we’ll need to install the python package called fuzzywuzzy with:

pip install fuzzywuzzy

As a side-note, anytime you install python packages you will need to restart the python ikernel to use them within a Jupyter Notebook (click Kernel at the top, then click Restart & Clear Output).

One of the most useful functions from the fuzzywuzzy package is the process.dedupe() function, so let’s see it in action!

from fuzzywuzzy import fuzz, process


print(process.dedupe(keyword_list_example, threshold=70))
dict_keys(['seo copywriting', 'digital marketing services', 'social media marketing'])

What’s important to notice here is that the longer phrases have been chosen as the de-duplicated keywords. Whilst this approach will provide us with a list of duplicates it does come at the cost of loosing what keywords were lost whilst doing the de-duplication.

To get the original of keywords, we can take the de-duplicated dict_keys and simply transform it into a list.

de_duplicated_keywords = list(process.dedupe(keyword_list_example, threshold=70))
print(de_duplicated_keywords, type(de_duplicated_keywords))

['seo copywriting', 'digital marketing services', 'social media marketing']

Also its important to notice that there is a threshold argument which we can pass into process.dedupe:

process.dedupe(keyword_list, threshold=70)

This number can range from 0 – 100, the intuition behind it is that as the threshold increases we are only de-duplicating keywords that are more closely related to each other.

Therefore if we set a low threshold, we will get more de-duplication, however it might come at the cost of quality de-duplication.

There is a balance to using the threshold parameter and I encourage you to try different different threshold values. So let’s do just that! 🙂

We can use the following code to loop over a range of numbers (stepping up with increments of 10):


for i in range(10, 100, 10):
    print(i)
    # This will start at the number 10 and will increment in steps of 10 up to 90

threshold_dictionary = {}

for i in range(10, 100, 10):
    # Increasing the number of steps from 10 --> 90 by 10 at a time:
    print(f"This is the current threshold value: {i}")
    # Performing the de-duplication
    de_duplicated_items = process.dedupe(keyword_list_example, threshold=i)
    # Storing the de-duplicated items
    threshold_dictionary[i] = list(de_duplicated_items)

This is the current threshold value: 10
This is the current threshold value: 20
This is the current threshold value: 30
This is the current threshold value: 40
This is the current threshold value: 50
This is the current threshold value: 60
This is the current threshold value: 70
This is the current threshold value: 80
This is the current threshold value: 90


print(threshold_dictionary)

{10: ['digital marketing services'], 20: ['digital marketing services'], 30: ['digital marketing services', 'digital marketing'], 40: ['social media marketing', 'digital marketing services', 'seo copywriting'], 50: ['seo copywriting', 'digital marketing services', 'social media marketing'], 60: ['seo copywriting', 'digital marketing services', 'social media marketing'], 70: ['seo copywriting', 'digital marketing services', 'social media marketing'], 80: ['seo copywriting', 'digital marketing services', 'social media marketing'], 90: ['seo copywriting', 'digital marketing services', 'social media marketing']}

How To Open An Ahrefs CSV Keyword Report With Pandas

I’ve prepared a sample report from the ahrefs keyword explorer for the term “digital marketing”. We’ll be using this to show you a few ways to doing keyword data analysis + grouping within pandas. Firstly we can load the CSV file with the following syntax:


df = pd.read_csv('csv_file_path.csv')

Important Points:

All of the keyword reports from Ahrefs are tab seperated, this will need to be specified when we read the CSV file.
Depending upon how you download keyword.csv’s from Ahrefs will require a different type of encoding. Again this can be specified inside of the pd.read_csv() function.

Examples:

df = pd.read_csv('data/digital-marketing-keyword-ideas.csv', encoding='UTF-16', delimiter='t')
df = pd.read_csv('data/digital-marketing-keyword-ideas.csv', encoding='UTF-8', delimiter='t')
df = pd.read_csv('data/digital-marketing-keyword-ideas.csv', encoding='UTF-16', delimiter='t')


.info() # allows us to inspect and see how many np.nan values are inside of the dataframe

df.info()


RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
#                 1000 non-null int64
Keyword           1000 non-null object
Country           1000 non-null object
Difficulty        599 non-null float64
Volume            658 non-null float64
CPC               462 non-null float64
Clicks            188 non-null float64
CPS               188 non-null float64
Return Rate       188 non-null float64
Parent Keyword    602 non-null object
Last Update       597 non-null object
SERP Features     562 non-null object
dtypes: float64(6), int64(1), object(5)
memory usage: 93.9+ KB

df.dtypes # This method allows us to inspect the different data types of every column

Column Selection

Dataframes are great for automating your google sheet processes. Let’s now learn how to select single and multiple columns:


df['single_column'] --> This will return a pd.series object which is essentially a single column.
df[['column_one', 'column_two']] --> This will return a dataframe object, similar to the original df.

df['Keyword']

0                                coast
1                              hubspot
2                    digital marketing
3                              digital
4                      content meaning
                    ...               
603            results through digital
604    star digital marketing services
610         how to market b2b services
620           phone marketing services
641        campaign marketing services
Name: Keyword, Length: 602, dtype: object

You can also create mini dataframes by using double [[]] square brackets. This allows you to create a new pandas dataframe that could be easily assigned to a new variable.

df[['CPC', 'Clicks']]

# Let's store this dataframe within a temporary variable
temp_variable = df[['CPC', 'Clicks']]

temp_variable

How To Index Specific Columns And Rows

There are two ways that you can index columns either with .loc or with .iloc:


.loc[] refers to the column and index names
.iloc refers to the index position of the column and index

Remember when it comes to indexing your dataframe the order is first ROWS, then COLUMNS

df.loc[0:5, ['Keyword', 'Country']].reindex()

	Keyword	Country
0	NaN	gb
1	NaN	gb
2	NaN	gb
3	NaN	gb
4	NaN	gb
5	NaN	gb

df.iloc[0:5,0:2]

	Keyword	Country
0	coast	gb
1	hubspot	gb
2	digital marketing	gb
3	digital	gb
4	content meaning	gb

Let’s look at some more advanced methods for selecting and indexing columns.

Selecting Specific Columns With A Python List

df.iloc[:, [0, 5 , 3]] # This will return all of the rows, and column 0, 3 and 5.
df.loc[:, ['CPC','Keyword']] # This will return all of the rows, and the columns labelled CPC and Keyword.

Notice in the above example, we can literally wrap multiple values within a python list to get specific columns. This type of syntax can be applied to either columns or rows.

Selecting Increments Of Rows With Slicing

Let’s say that you wanted to get rows 0, 5, 10 (which are the rows increasing by 5 at a time), this is how you could easily achieve that:

df.iloc[::5,:] # All of the rows and columns increasing in a step size of 5 at a time.

Sorting Dataframes By Column Values

Let’s rank the keywords by the organic monthly search volume in descending order, also notice that I’ve used inplace=True, this means that the pandas dataframe is permanently sorted by the search volume:


df.sort_values(by='column_name', ascending=Boolean, inplace=Boolean) # Replace the Boolean with either True or False
df.head(integer)

# The df.head() command allows us to easily to show the top N results.

df.sort_values(by='Volume', ascending=False, inplace=True)
df.head(6)

Modifying Your Dataframe With Re-assignment

You don’t always have to use the inplace=True parameter, you can choose to do any operation on a pandas dataframe or pandas series object (a single column), then you can simply re-assign the result back to the original variable.

For example the syntax would be:

df = df.sort_values(by='CPC') # This will save the result of the sort_values back, and is similar to inplace=True

Now let’s sort the dataframe based upon CPC, also notice how we’re using the .head(6) function here allows us to easily look at the top 6 rows.

df.sort_values(by='CPC', ascending=False).head(6)

If you wanted to look at the last 6 rows you can just use:

df.tail(6)

How To Drop Null (np.nan) Values From A Dataframe

Okay that’s great, but as Ahrefs provides a Parent Keyword column, let’s firstly remove any keywords that don’t have a value for this column:


.dropna(subset='column_name') This command allows us to drop np.NaN (not a number) values from the dataframe.

Also let’s remove the # column as it is unnecessary:

How To Remove A Column From A Dataframe

df.dropna(subset=['Parent Keyword'], inplace=True)
df.drop(columns=['#'], inplace=True)
df.head(3)

	Keyword	Country	Difficulty	Volume	CPC	Clicks	CPS	Return Rate	Parent Keyword	Last Update	SERP Features
0	coast	gb	42.0	70000.0	2.5	64715.0	0.93	1.67	coast	2020-05-09 23:55:39	Sitelinks
1	hubspot	gb	67.0	63000.0	5.0	59708.0	0.95	2.34	hubspot	2020-05-09 00:57:01	Adwords top, Sitelinks, Top stories, Thumbnail…
2	digital marketing	gb	74.0	19000.0	7.0	11033.0	0.57	1.41	digital marketing	2020-05-09 07:11:09	Adwords top, Sitelinks, People also ask, Top s…
3	digital	gb	89.0	16000.0	2.5	5912.0	0.38	1.19	digital	2020-05-09 02:26:10	Sitelinks, People also ask, Top stories, Thumb…
4	content meaning	gb	45.0	4400.0	17.0	622.0	0.14	1.18	content	2020-05-10 06:32:24	Knowledge card, People also ask, Top stories, …
5	digital media	gb	24.0	3600.0	3.0	1671.0	0.47	1.24	digital media	2020-05-10 14:14:41	Featured snippet, Thumbnails, People also ask,…

Utilising GroupBy Objects:

We can use a Pandas function called .groupby() which will allow us to group keywords based upon their Parent Keyword.

df.groupby('column_name')

This allows us to group multiple keywords that have the same parent keyword and to get aggregation metrics on the columns of our choosing. Also we will save this groupby object to a variable so that we can reference it directly.

grouped_parent_keywords = df.groupby('Parent Keyword')

grouped_parent_keywords.count()

grouped_parent_keywords.mean()

Notice that as it is grouping a series of keywords by their Parent Keyword, we need to use aggregation functions/methods to summarise the grouped metrics. Common functions include the following:


.mean()
.count()
.sum()
.median()

However for our analysis we’ll want to use a custom .agg() function so that we can apply different summarisation techniques to unique columns:

grouped_parent_keywords = grouped_parent_keywords.agg({'Keyword': 'count',
                             'Difficulty': 'mean', 
                             'Volume': 'sum', 
                             'CPC': 'mean',
                             'CPS': 'mean', 
                             'Return Rate': 'mean'}, inplace=True)

We Do Section:

Now we can use the similar filtering techniques that we learned earlier. Code along and filter the groupedby dataframe by:

Difficulty
Volume

grouped_parent_keywords.sort_values(by='Volume', ascending=False, inplace=True)
grouped_parent_keywords.sort_values(by='Difficulty', ascending=False)

Removing Exact Duplicates In Pandas

Let’s now return to our original dataframe and practice some other useful methods. Firstly we can attempt to drop any duplicates within our keyword column:

df.drop_duplicates(subset=['Keyword'], inplace=True)

However as there are no duplicates inside of the keywords column, let’s get a de-duplicated list of parent keywords by using the following command:


.drop_duplicates(subset=['column_name'])

Then we will select the Parent Keyword column, and convert it into a python list with:


.tolist()

parent_keywords_df = df.drop_duplicates(subset=['Parent Keyword'])
parent_keywords_list = parent_keywords_df['Parent Keyword'].to_list()[0:10]
print(f"This is the first 10 items of a de-duplicated python list from the parent column: {parent_keywords_list}")

This is the first 10 items of a de-duplicated python list from the parent column: ['coast', 'hubspot', 'digital marketing', 'digital', 'content', 'digital media', 'digital marketing agency', 'digital uk', 'and digital', 'online marketing']

DataFrame Filtering (subsetting)

The easiest type of filtering is to use the = operator.

For example, we might want to find keywords that are equal to hubspot within our dataframe:

single_keyword_mask = df['Keyword'] == 'hubspot'
single_keyword_df = df[single_keyword_mask]
single_keyword_df

	Keyword	Country	Difficulty	Volume	CPC	Clicks	CPS	Return Rate	Parent Keyword	Last Update	SERP Features	Stemmed Keyword
1	hubspot	gb	67.0	63000.0	5.0	59708.0	0.95	2.34	hubspot	2020-05-09 00:57:01	Adwords top, Sitelinks, Top stories, Thumbnail…	hubspot

We could also write the same filter like so:

single_keyword_df = df[df['Keyword'] == 'hubspot']

Additionally might want to filter our dataframe to only find keywords that have a search volume greater than 50.

df['column_name'] > 50

The above command will return a pandas series object containing either True or False for every row. We can then save that as a mask which can be applied to filter the original dataframe:

mask = df['Volume'] > 50
filtered_dataframe = df[mask]

Another short handed way to accomplish the same thing would be:

filtered_dataframe  = df[df['Volume'] > 50]

How To Filter A Dataframe By Multiple Boolean Conditions

We can also filter the dataframe by several columns by chaining the boolean dataframe subsets together:

mask_two = (df['Volume'] > 50) & (df['CPC'] > 2.0)
two_column_filtered_dataframe = df[mask_two]

This could also be written like this:


two_column_filtered_dataframe = df[(df['Volume'] > 50) & (df['CPC'] > 2.0)]

How To Do OR and AND Operations With Multiple DataFrame Subsets

We can also do OR operations with the pipe operator |

mask_three = (df['Volume'] > 50) | (df['CPC'] > 2.0)
or_dataframe_subset= df[mask_three]

Which could also be written like this:

two_column_filtered_dataframe = df[(df['Volume'] > 50) | (df['CPC'] > 2.0)]

Grouping de_duplicated keywords with FuzzyWuzzy

As well as de-duplicating lists of keywords, it would also be useful to keep all of the keywords but bucket them into keyword groups based upon how close every keyword was as a duplicate in reference to every keyword.

Without going into the specifics of how this function works, you can use it as a way to group keywords based upon their FuzzyWuzzy score:

def word_grouper(df, column_name=None, limit=6, threshold=85):
    # Create a near_match_duplicated_list 
    test = df.drop_duplicates(subset=column_name)[column_name].tolist()
    master_dict = {}
    processed_words = []
    no_matches = []

    for index, item in enumerate(df[column_name]):
        # Let's pop out the first index from the list so we never match against 
        try:
            test.pop(0)
        except IndexError as e:
            print(e)

        # Let's only loop over the keywords that aren't already grouped
        if item not in processed_words:
            # Creating the top N matches
            try:
                matches = process.extract(item, test, limit=limit)
                """Extracting out the matched words - A threshold for this can be changed so that 
                    we never cluster words together with a low match score"""
                matches = [item for item in matches if item[1] > threshold]
                matched_words = [item[0] for item in matches if item[1] > threshold]
                # Saving the matches to a dictionary
                master_dict[item] = matches
                # Saving the matches and  to a list of processed words
                processed_words.extend(matched_words)
            except Exception as e:
                no_matches.append(item)
        else:
            pass
    return master_dict

So let’s run this function with our original dataframe, notice how we must pass in several arguments to this function including:

df: The dataframe.
column_name: The column name that we’d like to de-duplicate.
limit: This refers to the maximum size that a group can be.
threshold: This refers to the Levenshtein distance.

grouped_words = word_grouper(df, column_name='Keyword', limit=6, threshold=70)

test = pd.DataFrame()
test = test.from_dict(grouped_words, orient='index')
test.iloc[0:20,:]

Additional Ways That We Can Improve Our Keyword De-Duplication Efforts

1. Using .lower()

.lower() on a list of strings ensures that any strings which are duplicates such as “digital marketing services” vs “Digital Marketing Services” can be normalised with this in-built python function.

example = ['DIGITAL MARKETING SERVICES', 'digital marketing services', 'Digital Marketing Services']

example = [word.lower() for word in example]
print(example)
['digital marketing services', 'digital marketing services', 'digital marketing services']

de_duplicated_example = list(set(example))
print(f"This is the de_duplicated_example {de_duplicated_example}")

This is the de_duplicated_example ['digital marketing services']

2. NLP Stemming + Lemmatisation

Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.

Luckily I’ve already got some stemming + lemmatized functions that we can utilise on our previous dataframe.

Firstly we’ll need to install the python package called NLTK with: (you will also need to install the punctuation as a separate download).

pip install nltk

from nltk.util import ngrams
from nltk.stem import PorterStemmer, LancasterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

porter=PorterStemmer()
lancaster=LancasterStemmer()

# NLP Processing
def stem_sentence(sentence):
    token_words=word_tokenize(sentence)
    # token_words
    stem_sentence=[]
    for word in token_words:
        stem_sentence.append(porter.stem(word))
        stem_sentence.append(" ")
    return "".join(stem_sentence)

#rejoin words after stemming
def rejoin_words(row):
    joined_words = ( " ".join(row))
    return joined_words

#prepare list of all words in criteria report
def prepare_list(df, column):
    results = []
    for t in df[column]:
        x=t.split()
        for i in x:
            results.append(i)
    # Remove Duplicates:           
    return list(set(results))

The .apply() Method

def test(x):
    return x

Now we will use an .apply() method on the dataframe, basically how this method works is that it will perform an operation on either every row, or every column (depending upon whether you use axis = 0 for rows or axis = 1 for columns).

One at a time!

For example, let’s do a very simple method using the above function (test). Ths will simply return every row:

df['Keyword'].apply(test)

0                                coast
1                              hubspot
2                    digital marketing
3                              digital
4                      content meaning
                    ...               
603            results through digital
604    star digital marketing services
610         how to market b2b services
620           phone marketing services
641        campaign marketing services
Name: Keyword, Length: 602, dtype: object

We could also have written the same example with a lambda function:

df['Keyword'].apply(lambda x: x) # This would simply just return the rows, one by one.

How To Create A New Column In Pandas

Now let’s apply some stemming to our keyword column with an .apply() method:

Then we will save it as a new column by re-assigning it to the original dataframe with a new name.

df['Stemmed Keyword'] = df['Keyword'].apply(stem_sentence)

df.head(6)

	Keyword	Country	Difficulty	Volume	CPC	Clicks	CPS	Return Rate	Parent Keyword	Last Update	SERP Features	Stemmed Keyword
0	coast	gb	42.0	70000.0	2.5	64715.0	0.93	1.67	coast	2020-05-09 23:55:39	Sitelinks	coast
1	hubspot	gb	67.0	63000.0	5.0	59708.0	0.95	2.34	hubspot	2020-05-09 00:57:01	Adwords top, Sitelinks, Top stories, Thumbnail…	hubspot
2	digital marketing	gb	74.0	19000.0	7.0	11033.0	0.57	1.41	digital marketing	2020-05-09 07:11:09	Adwords top, Sitelinks, People also ask, Top s…	digit market
3	digital	gb	89.0	16000.0	2.5	5912.0	0.38	1.19	digital	2020-05-09 02:26:10	Sitelinks, People also ask, Top stories, Thumb…	digit
4	content meaning	gb	45.0	4400.0	17.0	622.0	0.14	1.18	content	2020-05-10 06:32:24	Knowledge card, People also ask, Top stories, …	content mean
5	digital media	gb	24.0	3600.0	3.0	1671.0	0.47	1.24	digital media	2020-05-10 14:14:41	Featured snippet, Thumbnails, People also ask,…	digit media

We could then use these stemmed/lemmatized keywords instead of the original keywords whilst performing de-duplication with FuzzyWuzzy!

stemmed_keywords = df['Stemmed Keyword'].tolist() # Converting the stemmed keywords to a list
de_duplicated_stemmed_keywords = list(set(stemmed_keywords)) # De-duplicating any stemmed keywords via set(list())

stemmed_unique_keywords = list(process.dedupe(de_duplicated_stemmed_keywords, threshold=80)) # De-duplicating the stemmed keywords with FuzzyWuzzy

How To Filter Rows Via A Python List In Pandas

The below code is how you could filter the dataframe to return only rows which have a specific value within a python list. You will need to use .isin() function to accomplish this:

df[df['Stemmed Keyword'].isin(python_list)]

unique_stemmed_df = df[df['Stemmed Keyword'].isin(stemmed_unique_keywords)]

How To Reset The Index In A Pandas Dataframe

Also you will notice that the index on the rows is not completely reset:

We can reset the index with:

unique_stemmed_df.reset_index(drop=True, inplace=True) # The drop=True, removes the original index column.

unique_stemmed_df.head(6)

	Keyword	Country	Difficulty	Volume	CPC	Clicks	CPS	Return Rate	Parent Keyword	Last Update	SERP Features	Stemmed Keyword
1	hubspot	gb	67.0	63000.0	5.0	59708.0	0.95	2.34	hubspot	2020-05-09 00:57:01	Adwords top, Sitelinks, Top stories, Thumbnail…	hubspot
4	content meaning	gb	45.0	4400.0	17.0	622.0	0.14	1.18	content	2020-05-10 06:32:24	Knowledge card, People also ask, Top stories, …	content mean
10	hub spot	gb	53.0	2600.0	7.0	1820.0	0.71	1.37	hubspot	2020-05-05 05:59:21	Adwords top, Sitelinks, Top stories, Thumbnail…	hub spot
15	climb online	gb	3.0	1300.0	0.0	1228.0	0.95	1.20	climb online	2020-05-04 21:52:16	Sitelinks, Top stories, Thumbnails	climb onlin
18	company meaning	gb	55.0	1200.0	0.0	158.0	0.13	1.11	مؤسسة	2020-05-08 16:52:22	Knowledge card, People also ask, Sitelinks, To…	compani mean
20	digital marketing strategy	gb	65.0	1200.0	9.0	1112.0	0.96	1.36	digital marketing strategy	2020-05-08 20:21:50	Featured snippet, Thumbnails, People also ask,…	digit market strategi

How To Save A Pandas DataFrame To A CSV File

After you’ve performed any data analysis or script, you can easily save the pandas series object / pandas dataframe to CSV with:

df.to_csv('name_of_file.csv', index=True)

unique_stemmed_df.to_csv('stemmed_dataframe.csv', index=True)

Congratulations, you’ve made it to the end of the first tutorial!

We’ve introduced you to the following frameworks:

Python Sets
FuzzyWuzzy
Pandas
NLP Stemming & Lemmatization

Its time to continue on our epic journey and to learn more scripts which will help to automate your SEO life! ❤