As this course is being progressively released, whenever a new article and video is released, after initially git cloning the repository. You will need to run this command within your command line / terminal (from the root directory of the course):
git pull origin master
This will pull any recent changes that have been made on the github.com version of the course and will allow you to easily get fresh content as it is added.
Learning Outcomes
- To learn what the pd.concat() method is and how it works
- Learn how to combine multiple csv files using Pandas
Firstly let’s say that we have 5, 10 or 100 .csv files. Combining all of these by hand can be incredibly tiring and definitely deserves to be automated. Therefore in today’s exercise, we’ll combine multiple csv files within only 8 lines of code.
For this tutorial, I’ve already prepared 5 top pages .csv reports from Ahrefs which can be found in the following directory:
'data/pbpython.com-top-pages-by-traffic-subdomains-US-17-May-2020_20-24-30-2c4e06e2ea39cce8b2f514e24c929e51.csv'
'data/machinelearningmastery.com-top-pages-by-traffic-subdomains-US-17-May-2020_20-25-38-b4088b6c6c987653748b10c0543a5713.csv'
One of the problems with automatically detecting csv files is that the names are dynamically generated. Therefore we will be using the .csv file extension name and a python package called glob to automatically detect all of the files ending with a .csv name within a specific working directory.
Import packages and set the working directory
You will need to change “/directory” to your specific directory.
import os
import glob
import pandas as pd
# os.chdir("/directory")
By writing pwd within the command line, we can identify the exact file path that these Ahrefs top page .csv files are located in:
pwd
/Users/jamesphoenix/Desktop/Imran_And_James_Projects/Python_For_SEO/2_bulk_csv_operations/data
Let’s now move into our desired working directory where the csv files are:
os.chdir("/Users/jamesphoenix/Desktop/Imran_And_James_Projects/Python_For_SEO/2_bulk_csv_operations/data")
Now let’s running !ls and !pwd just to show that we have changed directory:
!ls
!pwd
Pro-tip: using ! before a linux command allows you to run the unix/linux commands within a jupyter notebook file!
!pwd
/Users/jamesphoenix/Desktop/Imran_And_James_Projects/Python_For_SEO/2_bulk_csv_operations/data
!ls
hackernoon.com-top-pages-by-traffic-subdomains-US-17-May-2020_20-20-38-3c2d0502092ca7b22c9527c370526f6b.csv
machinelearningmastery.com-top-pages-by-traffic-subdomains-US-17-May-2020_20-25-38-b4088b6c6c987653748b10c0543a5713.csv
pbpython.com-top-pages-by-traffic-subdomains-US-17-May-2020_20-24-30-2c4e06e2ea39cce8b2f514e24c929e51.csv
towardsdatascience.com-top-pages-by-traffic-subdomains-US-17-May-2020_20-15-48-03e00fb8d3976a642dd2db330422cef7.csv
www.datacamp.com-top-pages-by-traffic-subdomains-US-17-May-2020_20-21-48-821c6875ceb05705bce7e0a5a032d622.csv
Step 2: Use Global To Match The Pattern ‘.csv’
We will now match the file pattern (‘.csv’) within all of the files located in the current working directory.
file_extension = '.csv'
all_filenames = [i for i in glob.glob(f"*{file_extension}")]
print(f"These are all of the filenames ending in .csv {all_filenames}.")
These are all of the filenames ending in .csv ['hackernoon.com-top-pages-by-traffic-subdomains-US-17-May-2020_20-20-38-3c2d0502092ca7b22c9527c370526f6b.csv', 'machinelearningmastery.com-top-pages-by-traffic-subdomains-US-17-May-2020_20-25-38-b4088b6c6c987653748b10c0543a5713.csv', 'pbpython.com-top-pages-by-traffic-subdomains-US-17-May-2020_20-24-30-2c4e06e2ea39cce8b2f514e24c929e51.csv', 'towardsdatascience.com-top-pages-by-traffic-subdomains-US-17-May-2020_20-15-48-03e00fb8d3976a642dd2db330422cef7.csv', 'www.datacamp.com-top-pages-by-traffic-subdomains-US-17-May-2020_20-21-48-821c6875ceb05705bce7e0a5a032d622.csv'].
Step 3: Let’s Combine All Of The Files Within The List And Export as a CSV
In the code below we will read all of the csv’s and will then use the pd.concat() method to stack every dataframe one on top of another.
But before we do that, let’s make sure that we can get one result within a pandas dataframe by adding the appropriate encoding:
- UTF-16 (This is a specific encoding type).
- \t (tab delimited data).
all_filenames[0]
'hackernoon.com-top-pages-by-traffic-subdomains-US-17-May-2020_20-20-38-3c2d0502092ca7b22c9527c370526f6b.csv'
df = pd.read_csv('hackernoon.com-top-pages-by-traffic-subdomains-US-17-May-2020_20-20-38-3c2d0502092ca7b22c9527c370526f6b.csv', delimiter='t', encoding='UTF-16')
print(df.shape)
(267, 11) # There are 267 rows and 11 columns and our pd.read_csv() was successful!
combined_csv_data = pd.concat([pd.read_csv(f, delimiter='t', encoding='UTF-16') for f in all_filenames])
Now let’s break down what the above line of code does, firstly we loop over all of the filenames and assign them one by one to the f variable. Each csv file is then read & converted into a pandas dataframe with:
pd.read_csv(f, delimiter='t', encoding='UTF-16')
Then we concatenate all of the dataframes together and stack them one on top of each other using:
pd.concat() # This command by default will stack dataframes vertically rather than horizontally.
That’s it, within 8 lines of code you’re now able to easily combine as many .csv files as you want!
- Remember that all of the csv files must have the same columns otherwise you will not be able to effectively concatenate them!
Step 4 Save Your New DataFrame To CSV
Let’s now use the os.chdir(‘..’) to go up one working directory before saving our data:
os.chdir('..')
pwd
'/Users/jamesphoenix/Desktop/Imran_And_James_Projects/Python_For_SEO/2_bulk_csv_operations'
combined_csv_data.to_csv('combined_csv_data.csv') #Saving our combined csv data as a new file!