Pandas Multiindex Groupby aggregate column with value from another column
I have a pandas dataframe with multiindex where I want to aggregate the duplicate key rows as follows:
import numpy as np
import pandas as pd
df = pd.DataFrame({'S':[0,5,0,5,0,3,5,0],'Q':[6,4,10,6,2,5,17,4],'A':
['A1','A1','A1','A1','A2','A2','A2','A2'],
'B':['B1','B1','B2','B2','B1','B1','B1','B2']})
df.set_index(['A','B'])
Q S
A B
A1 B1 6 0
B1 4 5
B2 10 0
B2 6 5
A2 B1 2 0
B1 5 3
B1 17 5
B2 4 0
and I would like to groupby this dataframe to aggregate the Q values (sum) and keep the S value that corresponds to the maximal row of the Q value yielding this:
df2 = pd.DataFrame({'S':[0,0,5,0],'Q':[10,16,24,4],'A':
['A1','A1','A2','A2'],
'B':['B1','B2','B1','B2']})
df2.set_index(['A','B'])
Q S
A B
A1 B1 10 0
B2 16 0
A2 B1 24 5
B2 4 0
I tried the following, but it didn't work:
df.groupby(by=['A','B']).agg({'Q':'sum','S':df.S[df.Q.idxmax()]})
any hints?
Answer
You are almost there with your approach, but the issue is with how you're trying to aggregate S
. Specifically, using df.S[df.Q.idxmax()]
won't work directly in agg()
because agg()
expects a function that can be applied to the groups. You need to pass a function that finds the S
value corresponding to the row with the maximum Q
for each group.
Here’s how you can achieve your desired output:
Solution
- Sum the
Q
values within each group. - Find the
S
value corresponding to the maximumQ
value in each group.
You can use a custom aggregation function for the S
column by finding the index of the maximum Q
value for each group and then using that index to extract the corresponding S
value.
Here’s the code:
import numpy as np
import pandas as pd
# Create the DataFrame
df = pd.DataFrame({
'S': [0, 5, 0, 5, 0, 3, 5, 0],
'Q': [6, 4, 10, 6, 2, 5, 17, 4],
'A': ['A1', 'A1', 'A1', 'A1', 'A2', 'A2', 'A2', 'A2'],
'B': ['B1', 'B1', 'B2', 'B2', 'B1', 'B1', 'B1', 'B2']
})
df.set_index(['A', 'B'], inplace=True)
# Groupby and aggregate
df2 = df.groupby(['A', 'B']).agg(
Q=('Q', 'sum'), # Sum the Q values
S=('S', lambda x: x.loc[x.idxmax()]) # Get the S value corresponding to the max Q
)
print(df2)
Explanation:
Q=('Q', 'sum')
: This sums theQ
values for each group.S=('S', lambda x: x.loc[x.idxmax()])
: For theS
column, we apply a custom lambda function.x.idxmax()
finds the index of the maximum value in theQ
column for each group. Thenx.loc[x.idxmax()]
retrieves the correspondingS
value for that index.
Output:
Q S
A B
A1 B1 10 0
B2 16 0
A2 B1 24 5
B2 4 0
This gives the result you're looking for, where the S
values correspond to the rows with the maximum Q
values for each group, and the Q
values are summed within each group.