In modern business and research, data is almost never a "single file" affair. Imagine a retail chain that generates 365 separate CSV files—one for every day of the year. Or a medical researcher receiving patient stats from ten different clinics.
To find trends, perform calculations, or create charts, you first need to "stack" these files together. This process of merging CSV files in Python, using the pandas library, is the professional standard for this task. Unlike Excel, which might lag or crash with millions of rows, Python handles massive datasets with ease. In this tutorial, we'll break down every single line of code so that even if you've never written a script before, you'll understand exactly what's happening under the hood.
The "Filing Cabinet" Analogy
Think of your computer's hard drive as a massive filing cabinet. Each CSV file is a single sheet of paper inside that cabinet.
The Problem:
You need to read the data as one continuous list to find the total sales for the year.
The Solution:
Python is like a robotic assistant. You tell it to go to the cabinet, grab specific sheets, and tape them together into one long scroll (the DataFrame). This scroll can then be analyzed, edited, and eventually saved back into the cabinet as a new, master sheet.
Step 1: Merging Files One by One
Let's begin by loading and merging two CSV files into a single DataFrame. This is the simplest method where we explicitly load and combine each file.
Loading the Files
To load data from a CSV file, we use the pd.read_csv() function. This reads the contents of the CSV file and loads it into a pandas DataFrame, which is like a table or spreadsheet.
"text-purple-400">import pandas "text-purple-400">as "text-blue-400">pd
# Load the first CSV file
df1 = "text-blue-400">pd."text-blue-400">read_csv('january_sales.csv')
# Load the second CSV file
df2 = "text-blue-400">pd."text-blue-400">read_csv('february_sales.csv')Behind the Code:
pd.read_csv('january_sales.csv'): The read_csv() function reads the contents of the CSV file and converts it into a pandas DataFrame, a table-like structure in Python. Here, df1 holds the data from January, and df2 holds the data from February.
Now, we have two DataFrames (df1 and df2) containing data from January and February.
Merging the Files
Once the files are loaded, we can merge them into a single DataFrame using the pd.concat() function. This function stacks the data from the two DataFrames vertically.
# Concatenate the two DataFrames vertically
merged_df = "text-blue-400">pd."text-blue-400">concat([df1, df2], ignore_index="text-blue-400">True)Behind the Code:
pd.concat([df1, df2], ignore_index=True): The pd.concat() function concatenates (joins) the DataFrames vertically. Since we are stacking data from January and February, this ensures that the rows from both DataFrames are placed one after the other.
ignore_index=True: This ensures that the row numbers are reset and sequential in the merged DataFrame. If you don't use this, pandas will keep the original row numbers, which may result in duplicate index values.
Now, merged_df contains the data from both January and February, stacked one below the other.
Step 2: Handling Missing Data
In the merging process, there may be missing data, or "NaN" (Not a Number) values, especially if one CSV file has columns that the other doesn't. To handle this, we can use the fillna() method to replace any missing values.
# Replace all NaN values "text-purple-400">with 'No Data'
merged_df."text-blue-400">fillna('No Data', inplace="text-blue-400">True)Behind the Code:
merged_df.fillna('No Data', inplace=True): The fillna() function is used to replace all missing (NaN) values with the string 'No Data'. This ensures that there are no empty values in the final DataFrame. The inplace=True argument tells pandas to apply the change directly to merged_df without creating a new copy.
This step ensures that any missing values are appropriately filled before we move on.
Step 3: Saving the Merged File
After cleaning, we save the result as a new master CSV file. This is the final step in your Python for data analysis workflow for small batches.
# Save the merged "text-blue-400">DataFrame to a new CSV file
merged_df."text-blue-400">to_csv('annual_report_2023.csv', index="text-blue-400">False)Behind the Code:
merged_df.to_csv('annual_report_2023.csv', index=False): The to_csv() function writes the DataFrame to a new CSV file.
index=False: This ensures that the row numbers (index) are not saved as a separate column in the output CSV file. We only want the data.
Now, you have successfully merged two CSV files and saved the result in a new file.
Step 4: Automating the Process for Multiple Files with glob
Now, let's say you have a folder full of CSV files and you want to merge them all. Manually loading each file one by one would be tedious. That's where the glob module comes in.
The glob module helps you find files based on patterns (like *.csv for all CSV files). It saves you the trouble of manually listing each file and lets you automatically grab all the files you need.
Using glob to Find CSV Files
We can use glob to find all CSV files in a directory. This way, we don't have to specify each file manually.
"text-purple-400">import "text-blue-400">glob
# Use "text-blue-400">glob to find all CSV files "text-purple-400">in the directory
csv_files = "text-blue-400">glob."text-blue-400">glob('path/to/your/files/*.csv')Behind the Code:
glob.glob('path/to/your/files/*.csv'): This searches for all CSV files in the specified directory (path/to/your/files/). The *.csv pattern tells glob to find every file that ends with .csv.
Loading and Merging All CSV Files
Once we have the list of files, we can use a loop to load and merge them all.
# Load all CSV files into DataFrames
dfs = ["text-blue-400">pd."text-blue-400">read_csv(file) "text-purple-400">for file "text-purple-400">in csv_files]
# Concatenate all DataFrames into one
merged_df = "text-blue-400">pd."text-blue-400">concat(dfs, ignore_index="text-blue-400">True)Behind the Code:
[pd.read_csv(file) for file in csv_files]: This list comprehension reads each CSV file found by glob and loads it into a DataFrame.
pd.concat(dfs, ignore_index=True): This concatenates all the DataFrames in the list dfs into a single DataFrame, just like we did with two files earlier.
By using glob, we've automated the process of finding and loading all the CSV files in the folder. This is extremely useful if you have a large number of files and don't want to manually list each one.
Step 5: Data Cleaning & Handling Missing Info
As with the earlier approach, once we've merged the files, we can clean the data by replacing missing values.
# Replace all NaN values "text-purple-400">with 'No Data'
merged_df."text-blue-400">fillna('No Data', inplace="text-blue-400">True)Step 6: Saving the Master File
Finally, after merging and cleaning, we save the master file just as we did earlier.
# Save to a new CSV file
merged_df."text-blue-400">to_csv('annual_report_2023.csv', index="text-blue-400">False)Common Pitfalls: Troubleshooting Like a Pro
Even for experts, things go wrong. Here are the three most common errors you'll encounter:
FileNotFoundError
Python can't find your CSV. Double-check that your script and your data files are in the exact same folder.
Column Mismatch
If CSV A has "Price" and CSV B has "Cost", they won't merge into one column. They must have identical headers.
Memory Error
Trying to merge 10GB of data on a 4GB laptop? Use chunksize in your read_csv to process data in bits.
The Power of Automation
You have now completed a workflow that allows you to process hundreds of files in seconds. What we've learned—Importing, Loading, and Concatenating—forms the backbone of almost all modern data engineering pipelines . In fact, advanced learners are now using Generative AI to automate this entire workflow.
Summary Checklist
This concludes our guide on merging CSV files in Python. Enjoy automating your data workflows!
By first understanding how to merge files one by one, and then introducing glob to automate the process for multiple files, this guide ensures you have a solid foundation in merging data in Python.
Identify Your Knowledge Gaps with Intelligent Quizzes
Take personalized quizzes tailored to your domain, topic, and difficulty level. Get detailed feedback on your strengths and weaknesses. Receive a customized learning plan to improve based on your quiz performance. Join 50,000+ learners who've improved their skills with PrepAI Diagnose.
Start Your PrepAI Diagnose