Create multiple new columns for pandas dataframe with apply + function
I have a pandas dataframe df
of the following shape: (763, 65)
I use the following code to create 4 new columns:
df[['col1', 'col2', 'col3','col4']] = df.apply(myFunc, axis=1)
def myFunc(row):
#code to get some result from another dataframe
return result1, result2, result3, result4
The shape of the dataframe which is returned in myFunc
is (1, 4)
. The code runs into the following error:
ValueError: Shape of passed values is (763, 4), indices imply (763, 65)
I know that df
has 65 columns and that the returned data from myFunc
only has 4 columns. However, I only want to create the 4 new columns (that is, col1
, col2
, etc.), so in my opinion the code is correct when it only returns 4 columns in myFunc
. What am I doing wrong?
Answer
The error you're encountering happens because you're using apply()
incorrectly when attempting to assign the result to the new columns col1
, col2
, col3
, and col4
. The issue stems from the fact that df.apply()
returns a Series or DataFrame for each row, but it tries to align the result with the entire DataFrame.
To fix this, you need to make sure that df.apply()
returns a DataFrame with the exact shape that you expect, i.e., with 4 columns (the new columns you're creating). Here's how to resolve the problem:
Correct Approach:
You should ensure that the function myFunc()
returns a tuple or a list with 4 values for each row, which will then be expanded into the 4 columns. You can either return the tuple or a pandas Series
object in myFunc()
. This allows the result of the apply()
method to be correctly assigned to the new columns.
Solution:
import pandas as pd
# Sample DataFrame with shape (763, 65)
df = pd.DataFrame([[i] * 65 for i in range(763)], columns=[f"col{i}" for i in range(65)])
# Define your function to return a tuple of 4 values
def myFunc(row):
# Assuming you are doing some operations and returning 4 values
result1 = row[0] * 2 # Replace with actual logic
result2 = row[1] * 3 # Replace with actual logic
result3 = row[2] * 4 # Replace with actual logic
result4 = row[3] * 5 # Replace with actual logic
return result1, result2, result3, result4
# Apply the function to each row and assign the result to the new columns
df[['col1', 'col2', 'col3', 'col4']] = df.apply(lambda row: pd.Series(myFunc(row)), axis=1)
# Check the result
print(df.head())
Explanation:
-
myFunc(row)
: This function takes a row from the DataFrame and returns a tuple of 4 values, which you want to assign to the new columnscol1
,col2
,col3
, andcol4
. -
df.apply()
: We usedf.apply(lambda row: pd.Series(myFunc(row)), axis=1)
to applymyFunc()
to each row. The result ofmyFunc(row)
is wrapped in apd.Series()
so thatapply()
can properly expand it into separate columns for each row.- The
axis=1
argument tellsapply()
to operate row-wise (i.e., applymyFunc()
to each row). - The result from
myFunc(row)
(which is a tuple) is converted into a pandasSeries
object so that it can be correctly assigned to multiple columns.
- The
-
Assigning to new columns: The output of
df.apply()
is a DataFrame with 4 columns, which is assigned todf[['col1', 'col2', 'col3', 'col4']]
.
This should resolve the ValueError
and correctly add the 4 new columns to your DataFrame.