Pandas Multiindex Groupby aggregate column with value from anoth

ghz 11hours ago ⋅ 7 views

Pandas Multiindex Groupby aggregate column with value from another column

I have a pandas dataframe with multiindex where I want to aggregate the duplicate key rows as follows:

import numpy as np
import pandas as pd
df = pd.DataFrame({'S':[0,5,0,5,0,3,5,0],'Q':[6,4,10,6,2,5,17,4],'A':
                  ['A1','A1','A1','A1','A2','A2','A2','A2'],
                  'B':['B1','B1','B2','B2','B1','B1','B1','B2']})
df.set_index(['A','B'])

    Q  S
A  B        
A1 B1   6  0
   B1   4  5
   B2  10  0
   B2   6  5
A2 B1   2  0
   B1   5  3
   B1  17  5
   B2   4  0

and I would like to groupby this dataframe to aggregate the Q values (sum) and keep the S value that corresponds to the maximal row of the Q value yielding this:

df2 = pd.DataFrame({'S':[0,0,5,0],'Q':[10,16,24,4],'A':
                   ['A1','A1','A2','A2'],
                  'B':['B1','B2','B1','B2']})
df2.set_index(['A','B'])

        Q  S
A  B        
A1 B1  10  0
   B2  16  0
A2 B1  24  5
   B2   4  0

I tried the following, but it didn't work:

df.groupby(by=['A','B']).agg({'Q':'sum','S':df.S[df.Q.idxmax()]})

any hints?

Answer

You are almost there with your approach, but the issue is with how you're trying to aggregate S. Specifically, using df.S[df.Q.idxmax()] won't work directly in agg() because agg() expects a function that can be applied to the groups. You need to pass a function that finds the S value corresponding to the row with the maximum Q for each group.

Here’s how you can achieve your desired output:

Solution

  1. Sum the Q values within each group.
  2. Find the S value corresponding to the maximum Q value in each group.

You can use a custom aggregation function for the S column by finding the index of the maximum Q value for each group and then using that index to extract the corresponding S value.

Here’s the code:

import numpy as np
import pandas as pd

# Create the DataFrame
df = pd.DataFrame({
    'S': [0, 5, 0, 5, 0, 3, 5, 0],
    'Q': [6, 4, 10, 6, 2, 5, 17, 4],
    'A': ['A1', 'A1', 'A1', 'A1', 'A2', 'A2', 'A2', 'A2'],
    'B': ['B1', 'B1', 'B2', 'B2', 'B1', 'B1', 'B1', 'B2']
})
df.set_index(['A', 'B'], inplace=True)

# Groupby and aggregate
df2 = df.groupby(['A', 'B']).agg(
    Q=('Q', 'sum'),  # Sum the Q values
    S=('S', lambda x: x.loc[x.idxmax()])  # Get the S value corresponding to the max Q
)

print(df2)

Explanation:

  • Q=('Q', 'sum'): This sums the Q values for each group.
  • S=('S', lambda x: x.loc[x.idxmax()]): For the S column, we apply a custom lambda function. x.idxmax() finds the index of the maximum value in the Q column for each group. Then x.loc[x.idxmax()] retrieves the corresponding S value for that index.

Output:

        Q  S
A  B        
A1 B1  10  0
   B2  16  0
A2 B1  24  5
   B2   4  0

This gives the result you're looking for, where the S values correspond to the rows with the maximum Q values for each group, and the Q values are summed within each group.