How do I concatenate files that have multiple sheets with same column headers but randomly ordered in Python/Pandas?
I have 3 xls files that have 3 sheets per file. All sheets have same column headers but order as you see below is different
1.xls
Name Address Date City State Zip
2.xls
Address Date City Zip Name State
3.xls
City Zip Name Address Date State
I want my final xls file to concatenate all 3 files and sheets
Output.xls
Name Address Date City State Zip RowNumber SheetName
The rownumber should be the specific row number from each file and sheet the data comes from before concatenation.Sheetname should be the sheet it comes from within the xls file.
My attempt-
import os
import pandas as pd
#set src directory
os.chdir('C:/Users/hhh/Desktop/python/Concat')
def read_sheets(filename):
result = []
sheets = pd.read_excel(filename, sheet_name=None)
for name, sheet in sheets.items():
sheet['Sheetname'] = name
sheet['Row'] = sheet.index
result.append(sheet)
return pd.concat(result, ignore_index=True)
files = [file for file in os.listdir(folder_path) if file.endswith(".xls")]
dfoo = read_sheets(files)
But nothing happens and i just receive an assertion error saying assert content_or_path is not None. Is this because column orders don't match? is there a workaround? Number of columns are same in all files and sheets. Within each file sheets have same order. But if you compare 1.xls sheets to 2.xls the column order is different as you can see in my reprex above
Answer
The issue you're encountering (assert content_or_path is not None
) is likely because of how you're handling the file paths and passing them to pd.read_excel()
. The error suggests that files
is not being correctly used as a valid input path for pd.read_excel()
.
Issues:
-
File Path: You’re using
files
, which is just the list of file names, but you're not providing the full path when trying to read each file.os.listdir()
will return the filenames, and you need to prepend the full directory path to open the file correctly. -
Sheet Name Column Handling: The column order is different across files, and this may create issues when trying to merge them. A simple solution is to reorder columns after reading each sheet based on a standard column order (which can be derived from any one of the files).
-
Adding RowNumber: Instead of using
sheet.index
, it's better to usesheet.reset_index(drop=False)
to ensure the row number is preserved correctly, especially when dealing with files that may have non-sequential indices.
Fixing the Code:
Here’s how you can update your code to handle these issues:
import os
import pandas as pd
# Set the directory for your files
folder_path = 'C:/Users/hhh/Desktop/python/Concat'
# Define the expected column order
column_order = ['Name', 'Address', 'Date', 'City', 'State', 'Zip']
def read_sheets(filename):
result = []
sheets = pd.read_excel(filename, sheet_name=None) # Read all sheets
for name, sheet in sheets.items():
# Ensure the column order is standardized
sheet = sheet[column_order]
# Add Sheetname and Row number (starting from 1, you can adjust if needed)
sheet['Sheetname'] = name
sheet['Row'] = sheet.reset_index(drop=False)['index'] + 1 # Adjust for 1-based index
result.append(sheet)
return pd.concat(result, ignore_index=True)
# Get all files from the folder
files = [file for file in os.listdir(folder_path) if file.endswith(".xls")]
# Read and concatenate all the sheets from all the files
final_data = pd.concat([read_sheets(os.path.join(folder_path, file)) for file in files], ignore_index=True)
# Save the result to an output file
final_data.to_excel('Output.xls', index=False)
print("Output file has been created successfully.")
Key Changes:
- File Path Handling: We're using
os.path.join(folder_path, file)
to make sure we provide the correct path topd.read_excel()
. - Column Order Standardization: After reading each sheet, we reorder the columns explicitly to match the expected
column_order
list. - Row Number and Sheet Name: The row number is extracted from
reset_index()
and adjusted to be 1-based. - Output: The final output is saved to
Output.xls
usingto_excel()
.
Explanation:
- Standardizing Columns: The
sheet[column_order]
part ensures that the columns are ordered in the same way across all files and sheets, which should help avoid alignment issues. - Adding Row Numbers: By resetting the index (
reset_index(drop=False)
), you preserve the original row index, which you then use to create theRow
column. - Concatenating Files and Sheets: The list comprehension loops through all the files, reads each sheet, and then concatenates everything into a single dataframe.
This should resolve the issues and create the desired output file.