Looping through .xlsx files using pandas, only does first file

ghz 14hours ago ⋅ 5 views

My ultimate goal is to merge the contents of a folder full of .xlsx files into one big file.

I thought the below code would suffice, but it only does the first file, and I can't figure out why it stops there. The files are small (~6 KB), so it shouldn't be a matter of waiting. If I print f_list, it shows the complete list of files. So, where am I going wrong? To be clear, there is no error returned, it just does not do the entire for loop. I feel like there should be a simple fix, but being new to Python and coding, I'm having trouble seeing it.

I'm doing this with Anaconda on Windows 8.

import pandas as pd
import glob
f_list = glob.glob("C:\\Users\\me\\dt\\xx\\*.xlsx")  # creates my file list
all_data = pd.DataFrame()             # creates my DataFrame

for f in f_list:                      # basic for loop to go through file list but doesn't
    df = pd.read_excel(f)             # reads .xlsx file
    all_data = all_data.append(df)    # appends file contents to DataFrame
all_data.to_excel("output.xlsx")      # creates new .xlsx

Edit with new information:

After trying some of the suggested changes, I noticed the output claiming the files are empty, except for 1 of them which is slightly larger than the others. If I put them into the DataFrame, it claims the DataFrame is empty. If I put it into the dict, it claims there are no values associated. Could this have something to do with the file size? Many, if not most, of these files have 3-5 rows with 5 columns. The one it does see has 12 rows.

Answer

The issue you're facing with the code is likely due to how pandas.read_excel() works and how the append method is used in a loop. There are a few things to address here:

1. append() in a loop:

The .append() method in pandas is inefficient and can cause unexpected results when used in a loop, especially when dealing with large datasets or many files. This happens because .append() creates a new DataFrame at each iteration, which is inefficient and could lead to unintended behavior, especially when combined with file reading.

2. File content issues:

If some files are smaller or appear to be empty, the issue might be related to how read_excel() handles those files. If the files are in a different format (e.g., corrupted or not fully structured as expected), it might result in an empty DataFrame being read.

3. Reading the files:

You can improve the code by ensuring that the files are properly read and append them in an efficient way. A better approach is to store the DataFrames in a list and then concatenate them at the end. This approach is more efficient than using .append() inside the loop.

Solution:

import pandas as pd
import glob

# Get list of all the .xlsx files in the folder
f_list = glob.glob("C:\\Users\\me\\dt\\xx\\*.xlsx") 

# List to store DataFrames
dfs = []

# Loop through the list of files
for f in f_list:
    try:
        # Read the .xlsx file
        df = pd.read_excel(f)
        
        # Check if the file is empty, skip if so
        if df.empty:
            print(f"Warning: The file {f} is empty and will be skipped.")
        else:
            # Append the DataFrame to the list
            dfs.append(df)
    except Exception as e:
        print(f"Error reading {f}: {e}")

# If the list is not empty, concatenate all DataFrames into one
if dfs:
    all_data = pd.concat(dfs, ignore_index=True)
    all_data.to_excel("output.xlsx", index=False)  # Save the merged DataFrame to a new Excel file
    print("Merging complete. Output saved to 'output.xlsx'.")
else:
    print("No valid data to merge.")

Key Improvements:

  1. Storing DataFrames in a list: Instead of using .append() inside the loop (which is inefficient), we're adding each DataFrame to a list (dfs). At the end, we use pd.concat(dfs) to concatenate all DataFrames into one large DataFrame. This is much faster and avoids the issues with inefficient appending.

  2. Handling empty files: We check if a file results in an empty DataFrame (df.empty) and skip it with a warning. This will help you avoid appending empty files that might be causing unexpected results.

  3. Error handling: A try-except block is added to catch any potential errors during the reading of files (e.g., if the file is corrupted or has an unexpected format).

  4. Concatenating and saving: If there are any valid DataFrames in the list, we concatenate them using pd.concat() and save the final result to an Excel file. If no valid DataFrames were found, a message will be printed.

Further Debugging (if needed):

If you're still facing issues with some files being read as empty or not being merged correctly, I recommend checking the following:

  • File formats: Ensure that the .xlsx files are properly formatted and not corrupted. You can try opening them manually in Excel to confirm.
  • Check the content of individual files: Before reading the files, you could inspect their content or size programmatically to ensure they have data.

Example:

import os
print(f"File {f} has {os.path.getsize(f)} bytes.")

This will help you confirm if the files are indeed being processed correctly.

Conclusion:

This improved approach should handle your files efficiently, skip any empty or problematic files, and merge the rest correctly into one output file. Let me know if you need further clarification!