How To Combine Multiple CSV Files In Python

James Phoenix
James Phoenix

As this course is being progressively released, whenever a new article and video is released, after initially git cloning the repository. You will need to run this command within your command line / terminal (from the root directory of the course):

git pull origin master

This will pull any recent changes that have been made on the github.com version of the course and will allow you to easily get fresh content as it is added.



Learning Outcomes

  • To learn what the pd.concat() method is and how it works
  • Learn how to combine multiple csv files using Pandas

Firstly let’s say that we have 5, 10 or 100 .csv files. Combining all of these by hand can be incredibly tiring and definitely deserves to be automated. Therefore in today’s exercise, we’ll combine multiple csv files within only 8 lines of code.

For this tutorial, I’ve already prepared 5 top pages .csv reports from Ahrefs which can be found in the following directory:


'data/pbpython.com-top-pages-by-traffic-subdomains-US-17-May-2020_20-24-30-2c4e06e2ea39cce8b2f514e24c929e51.csv'
'data/machinelearningmastery.com-top-pages-by-traffic-subdomains-US-17-May-2020_20-25-38-b4088b6c6c987653748b10c0543a5713.csv'

One of the problems with automatically detecting csv files is that the names are dynamically generated. Therefore we will be using the .csv file extension name and a python package called glob to automatically detect all of the files ending with a .csv name within a specific working directory.


Import packages and set the working directory

You will need to change “/directory” to your specific directory.

Unleash Your Potential with AI-Powered Prompt Engineering!

Dive into our comprehensive Udemy course and learn to craft compelling, AI-optimized prompts. Boost your skills and open up new possibilities with ChatGPT and Prompt Engineering.

Embark on Your AI Journey Now!
import os
import glob
import pandas as pd
# os.chdir("/directory")

By writing pwd within the command line, we can identify the exact file path that these Ahrefs top page .csv files are located in:


pwd
/Users/jamesphoenix/Desktop/Imran_And_James_Projects/Python_For_SEO/2_bulk_csv_operations/data

Let’s now move into our desired working directory where the csv files are:

os.chdir("/Users/jamesphoenix/Desktop/Imran_And_James_Projects/Python_For_SEO/2_bulk_csv_operations/data")

Now let’s running !ls and !pwd just to show that we have changed directory:

!ls 
!pwd

Pro-tip: using ! before a linux command allows you to run the unix/linux commands within a jupyter notebook file!


!pwd
/Users/jamesphoenix/Desktop/Imran_And_James_Projects/Python_For_SEO/2_bulk_csv_operations/data

!ls 
hackernoon.com-top-pages-by-traffic-subdomains-US-17-May-2020_20-20-38-3c2d0502092ca7b22c9527c370526f6b.csv
machinelearningmastery.com-top-pages-by-traffic-subdomains-US-17-May-2020_20-25-38-b4088b6c6c987653748b10c0543a5713.csv
pbpython.com-top-pages-by-traffic-subdomains-US-17-May-2020_20-24-30-2c4e06e2ea39cce8b2f514e24c929e51.csv
towardsdatascience.com-top-pages-by-traffic-subdomains-US-17-May-2020_20-15-48-03e00fb8d3976a642dd2db330422cef7.csv
www.datacamp.com-top-pages-by-traffic-subdomains-US-17-May-2020_20-21-48-821c6875ceb05705bce7e0a5a032d622.csv

Step 2: Use Global To Match The Pattern ‘.csv’

We will now match the file pattern (‘.csv’) within all of the files located in the current working directory.

file_extension = '.csv'
all_filenames = [i for i in glob.glob(f"*{file_extension}")]

print(f"These are all of the filenames ending in .csv {all_filenames}.")
These are all of the filenames ending in .csv ['hackernoon.com-top-pages-by-traffic-subdomains-US-17-May-2020_20-20-38-3c2d0502092ca7b22c9527c370526f6b.csv', 'machinelearningmastery.com-top-pages-by-traffic-subdomains-US-17-May-2020_20-25-38-b4088b6c6c987653748b10c0543a5713.csv', 'pbpython.com-top-pages-by-traffic-subdomains-US-17-May-2020_20-24-30-2c4e06e2ea39cce8b2f514e24c929e51.csv', 'towardsdatascience.com-top-pages-by-traffic-subdomains-US-17-May-2020_20-15-48-03e00fb8d3976a642dd2db330422cef7.csv', 'www.datacamp.com-top-pages-by-traffic-subdomains-US-17-May-2020_20-21-48-821c6875ceb05705bce7e0a5a032d622.csv'].

Step 3: Let’s Combine All Of The Files Within The List And Export as a CSV

In the code below we will read all of the csv’s and will then use the pd.concat() method to stack every dataframe one on top of another.

But before we do that, let’s make sure that we can get one result within a pandas dataframe by adding the appropriate encoding:

  • UTF-16 (This is a specific encoding type).
  • \t (tab delimited data).
all_filenames[0]
'hackernoon.com-top-pages-by-traffic-subdomains-US-17-May-2020_20-20-38-3c2d0502092ca7b22c9527c370526f6b.csv'
df = pd.read_csv('hackernoon.com-top-pages-by-traffic-subdomains-US-17-May-2020_20-20-38-3c2d0502092ca7b22c9527c370526f6b.csv', delimiter='t', encoding='UTF-16')

print(df.shape)
(267, 11) # There are 267 rows and 11 columns and our pd.read_csv() was successful! 

combined_csv_data = pd.concat([pd.read_csv(f, delimiter='t', encoding='UTF-16') for f in all_filenames])

Now let’s break down what the above line of code does, firstly we loop over all of the filenames and assign them one by one to the f variable. Each csv file is then read & converted into a pandas dataframe with:


pd.read_csv(f, delimiter='t', encoding='UTF-16')

Then we concatenate all of the dataframes together and stack them one on top of each other using:


pd.concat()  # This command by default will stack dataframes vertically rather than horizontally.

That’s it, within 8 lines of code you’re now able to easily combine as many .csv files as you want!

  • Remember that all of the csv files must have the same columns otherwise you will not be able to effectively concatenate them!

Step 4 Save Your New DataFrame To CSV

Let’s now use the os.chdir(‘..’) to go up one working directory before saving our data:

os.chdir('..')

pwd
'/Users/jamesphoenix/Desktop/Imran_And_James_Projects/Python_For_SEO/2_bulk_csv_operations'
combined_csv_data.to_csv('combined_csv_data.csv') #Saving our combined csv data as a new file!

TaggedPython For SEO


More Stories

Cover Image for Soft Skills for Programmers: Why They Matter and How to Develop Them

Soft Skills for Programmers: Why They Matter and How to Develop Them

Overview You need a variety of soft skills in addition to technical skills to succeed in the technology sector. Soft skills are used by software professionals to collaborate with their peers effectively and profitably. Finding out more about soft skills and how they are used in the workplace will help you get ready for the job if you are interested in a…

James Phoenix
James Phoenix
Cover Image for What Are Webhooks? And How Do They Relate to Data Engineering?

What Are Webhooks? And How Do They Relate to Data Engineering?

Webhooks are a simple and powerful method for receiving real-time notifications when certain events occur. They enable a whole host of automated and interconnected applications. Broadly speaking, your apps can communicate via two main ways: polling and webhooks.  Polling is like going to a shop and asking for pizza – you have to ask whenever you want it. Webhooks are almost the…

James Phoenix
James Phoenix