How do I concatenate files that have multiple sheets with same c

ghz 昨天 ⋅ 3 views

How do I concatenate files that have multiple sheets with same column headers but randomly ordered in Python/Pandas?

I have 3 xls files that have 3 sheets per file. All sheets have same column headers but order as you see below is different

1.xls
Name Address Date City State Zip

2.xls
Address Date City Zip Name State

3.xls
City Zip Name Address Date State

I want my final xls file to concatenate all 3 files and sheets

Output.xls
Name Address Date City State Zip RowNumber SheetName

The rownumber should be the specific row number from each file and sheet the data comes from before concatenation.Sheetname should be the sheet it comes from within the xls file.

My attempt-

import os
import pandas as pd
#set src directory
os.chdir('C:/Users/hhh/Desktop/python/Concat')
def read_sheets(filename):
    result = []
    sheets = pd.read_excel(filename, sheet_name=None)
    for name, sheet in sheets.items():
        sheet['Sheetname'] = name
        sheet['Row'] = sheet.index
        result.append(sheet)
    return pd.concat(result, ignore_index=True)
files = [file for file in os.listdir(folder_path) if file.endswith(".xls")] 
dfoo = read_sheets(files)

But nothing happens and i just receive an assertion error saying assert content_or_path is not None. Is this because column orders don't match? is there a workaround? Number of columns are same in all files and sheets. Within each file sheets have same order. But if you compare 1.xls sheets to 2.xls the column order is different as you can see in my reprex above

Answer

The issue you're encountering (assert content_or_path is not None) is likely because of how you're handling the file paths and passing them to pd.read_excel(). The error suggests that files is not being correctly used as a valid input path for pd.read_excel().

Issues:

  1. File Path: You’re using files, which is just the list of file names, but you're not providing the full path when trying to read each file. os.listdir() will return the filenames, and you need to prepend the full directory path to open the file correctly.

  2. Sheet Name Column Handling: The column order is different across files, and this may create issues when trying to merge them. A simple solution is to reorder columns after reading each sheet based on a standard column order (which can be derived from any one of the files).

  3. Adding RowNumber: Instead of using sheet.index, it's better to use sheet.reset_index(drop=False) to ensure the row number is preserved correctly, especially when dealing with files that may have non-sequential indices.

Fixing the Code:

Here’s how you can update your code to handle these issues:

import os
import pandas as pd

# Set the directory for your files
folder_path = 'C:/Users/hhh/Desktop/python/Concat'

# Define the expected column order
column_order = ['Name', 'Address', 'Date', 'City', 'State', 'Zip']

def read_sheets(filename):
    result = []
    sheets = pd.read_excel(filename, sheet_name=None)  # Read all sheets
    for name, sheet in sheets.items():
        # Ensure the column order is standardized
        sheet = sheet[column_order]
        # Add Sheetname and Row number (starting from 1, you can adjust if needed)
        sheet['Sheetname'] = name
        sheet['Row'] = sheet.reset_index(drop=False)['index'] + 1  # Adjust for 1-based index
        result.append(sheet)
    return pd.concat(result, ignore_index=True)

# Get all files from the folder
files = [file for file in os.listdir(folder_path) if file.endswith(".xls")]

# Read and concatenate all the sheets from all the files
final_data = pd.concat([read_sheets(os.path.join(folder_path, file)) for file in files], ignore_index=True)

# Save the result to an output file
final_data.to_excel('Output.xls', index=False)

print("Output file has been created successfully.")

Key Changes:

  1. File Path Handling: We're using os.path.join(folder_path, file) to make sure we provide the correct path to pd.read_excel().
  2. Column Order Standardization: After reading each sheet, we reorder the columns explicitly to match the expected column_order list.
  3. Row Number and Sheet Name: The row number is extracted from reset_index() and adjusted to be 1-based.
  4. Output: The final output is saved to Output.xls using to_excel().

Explanation:

  • Standardizing Columns: The sheet[column_order] part ensures that the columns are ordered in the same way across all files and sheets, which should help avoid alignment issues.
  • Adding Row Numbers: By resetting the index (reset_index(drop=False)), you preserve the original row index, which you then use to create the Row column.
  • Concatenating Files and Sheets: The list comprehension loops through all the files, reads each sheet, and then concatenates everything into a single dataframe.

This should resolve the issues and create the desired output file.