Create multiple new columns for pandas dataframe with apply + fu

ghz 12hours ago ⋅ 3 views

Create multiple new columns for pandas dataframe with apply + function

I have a pandas dataframe df of the following shape: (763, 65)

I use the following code to create 4 new columns:

df[['col1', 'col2', 'col3','col4']] = df.apply(myFunc, axis=1)

def myFunc(row):
    #code to get some result from another dataframe
    return result1, result2, result3, result4

The shape of the dataframe which is returned in myFunc is (1, 4). The code runs into the following error:

ValueError: Shape of passed values is (763, 4), indices imply (763, 65)

I know that df has 65 columns and that the returned data from myFunc only has 4 columns. However, I only want to create the 4 new columns (that is, col1, col2, etc.), so in my opinion the code is correct when it only returns 4 columns in myFunc. What am I doing wrong?

Answer

The error you're encountering happens because you're using apply() incorrectly when attempting to assign the result to the new columns col1, col2, col3, and col4. The issue stems from the fact that df.apply() returns a Series or DataFrame for each row, but it tries to align the result with the entire DataFrame.

To fix this, you need to make sure that df.apply() returns a DataFrame with the exact shape that you expect, i.e., with 4 columns (the new columns you're creating). Here's how to resolve the problem:

Correct Approach:

You should ensure that the function myFunc() returns a tuple or a list with 4 values for each row, which will then be expanded into the 4 columns. You can either return the tuple or a pandas Series object in myFunc(). This allows the result of the apply() method to be correctly assigned to the new columns.

Solution:

import pandas as pd

# Sample DataFrame with shape (763, 65)
df = pd.DataFrame([[i] * 65 for i in range(763)], columns=[f"col{i}" for i in range(65)])

# Define your function to return a tuple of 4 values
def myFunc(row):
    # Assuming you are doing some operations and returning 4 values
    result1 = row[0] * 2  # Replace with actual logic
    result2 = row[1] * 3  # Replace with actual logic
    result3 = row[2] * 4  # Replace with actual logic
    result4 = row[3] * 5  # Replace with actual logic
    return result1, result2, result3, result4

# Apply the function to each row and assign the result to the new columns
df[['col1', 'col2', 'col3', 'col4']] = df.apply(lambda row: pd.Series(myFunc(row)), axis=1)

# Check the result
print(df.head())

Explanation:

  1. myFunc(row): This function takes a row from the DataFrame and returns a tuple of 4 values, which you want to assign to the new columns col1, col2, col3, and col4.

  2. df.apply(): We use df.apply(lambda row: pd.Series(myFunc(row)), axis=1) to apply myFunc() to each row. The result of myFunc(row) is wrapped in a pd.Series() so that apply() can properly expand it into separate columns for each row.

    • The axis=1 argument tells apply() to operate row-wise (i.e., apply myFunc() to each row).
    • The result from myFunc(row) (which is a tuple) is converted into a pandas Series object so that it can be correctly assigned to multiple columns.
  3. Assigning to new columns: The output of df.apply() is a DataFrame with 4 columns, which is assigned to df[['col1', 'col2', 'col3', 'col4']].

This should resolve the ValueError and correctly add the 4 new columns to your DataFrame.