How to perform groupby on pandas dataframe without losing other

ghz 13hours ago ⋅ 2 views

How to perform groupby on pandas dataframe without losing other columns?

I have one Dataframe like below:

df = pd.DataFrame({'sport_name': ['football','football','football','football','football','football','football','football','basketball','basketball'],
           'person_name': ['ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','mahesh','mahesh'],
               'city': ['mumbai', 'mumbai','delhi','delhi','mumbai', 'mumbai','delhi','delhi','pune','nagpur'],
        'person_symbol': ['ram','mum','mum','ram','ram','mum','mum','ram','mah','mah'],
        'person_count': ['10','14','25','20','34','23','43','34','10','20'],
        'month': ['2017-01-23','2017-01-23','2017-01-23','2017-01-23','2017-02-26','2017-02-26','2017-02-26','2017-02-26','2017-03-03','2017-03-03'],
        'sir': ['a','a','a','a','b','b','b','b','c','c']})
df = df[['sport_name','person_name','city','person_symbol','person_count','month','sir']]

print df

   sport_name person_name    city person_symbol person_count       month sir
0    football      ramesh  mumbai           ram           10  2017-01-23   a
1    football      ramesh  mumbai           mum           14  2017-01-23   a
2    football      ramesh   delhi           mum           25  2017-01-23   a
3    football      ramesh   delhi           ram           20  2017-01-23   a
4    football      ramesh  mumbai           ram           34  2017-02-26   b
5    football      ramesh  mumbai           mum           23  2017-02-26   b
6    football      ramesh   delhi           mum           43  2017-02-26   b
7    football      ramesh   delhi           ram           34  2017-02-26   b
8  basketball      mahesh    pune           mah           10  2017-03-03   c
9  basketball      mahesh  nagpur           mah           20  2017-03-03   c

From this Dataframe, I want to create two column dataframe named as "derived_symbol" and "person_count". For creating it I need to focus on some condition like below:

  • derived_symbol need to form for each unique city and person_symbol.
  • person_count is calculated based on what the derived_symbol is.

For this above thing I did something and it is working fine:

df = pd.DataFrame({'sport_name': ['football','football','football','football','football','football','football','football','basketball','basketball'],
           'person_name': ['ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','mahesh','mahesh'],
               'city': ['mumbai', 'mumbai','delhi','delhi','mumbai', 'mumbai','delhi','delhi','pune','nagpur'],
        'person_symbol': ['ram','mum','mum','ram','ram','mum','mum','ram','mah','mah'],
        'person_count': ['10','14','25','20','34','23','43','34','10','20'],
        'month': ['2017-01-23','2017-01-23','2017-01-23','2017-01-23','2017-02-26','2017-02-26','2017-02-26','2017-02-26','2017-03-03','2017-03-03'],
        'sir': ['a','a','a','a','b','b','b','b','c','c']})
df = df[['sport_name','person_name','city','person_symbol','person_count','month','sir']]

df['person_count'] = df['person_count'].astype(int)

df1=df.set_index(['sport_name','person_name','person_count','month','sir']).stack().reset_index(name='val')

df1['derived_symbol'] = df1['sport_name'] + '.' + df1['person_name'] + '.TOTAL.' + df1['val'] + '_count'

df2 = df1.groupby(['derived_symbol','month','sir','person_name'])['person_count'].sum().reset_index(name='person_count')
print (df2)

Output of the above code:

         derived_symbol                   month        sir sport_name  person_name  person_count
0      basketball.mahesh.TOTAL.mah_count  2017-03-03   c  basketball   mahesh            30 
1   basketball.mahesh.TOTAL.nagpur_count  2017-03-03   c  basketball   mahesh            20
2     basketball.mahesh.TOTAL.pune_count  2017-03-03   c  basketball   mahesh            10
3      football.ramesh.TOTAL.delhi_count  2017-01-23   a    football   ramesh            45
4      football.ramesh.TOTAL.delhi_count  2017-02-26   b    football   ramesh            77
5        football.ramesh.TOTAL.mum_count  2017-01-23   a    football   ramesh            39
6        football.ramesh.TOTAL.mum_count  2017-02-26   b    football   ramesh            66
7     football.ramesh.TOTAL.mumbai_count  2017-01-23   a    football   ramesh            24
8     football.ramesh.TOTAL.mumbai_count  2017-02-26   b    football   ramesh            57
9        football.ramesh.TOTAL.ram_count  2017-01-23   a    football   ramesh            30
10       football.ramesh.TOTAL.ram_count  2017-02-26   b    football   ramesh           68

But I want Dataframe with two additional columns which are "city" and "person_symbol" like below:

                          derived_symbol       month sir person_name  sport_name    person_count    city        person_symbol
0      basketball.mahesh.TOTAL.mah_count  2017-03-03   c      mahesh   basketball   30          NO_ENTRY        mah
1   basketball.mahesh.TOTAL.nagpur_count  2017-03-03   c      mahesh   basketball   20          nagpur      NO_ENTRY
2     basketball.mahesh.TOTAL.pune_count  2017-03-03   c      mahesh   football     10          pune        NO_ENTRY
3      football.ramesh.TOTAL.delhi_count  2017-01-23   a      ramesh   football     45          delhi       NO_ENTRY
4      football.ramesh.TOTAL.delhi_count  2017-02-26   b      ramesh   football     77          delhi       NO_ENTRY
5        football.ramesh.TOTAL.mum_count  2017-01-23   a      ramesh   football     39          NO_ENTRY    mum
6        football.ramesh.TOTAL.mum_count  2017-02-26   b      ramesh   football     66          NO_ENTRY    mum
7     football.ramesh.TOTAL.mumbai_count  2017-01-23   a      ramesh   football     24          mumbai      NO_ENTRY
8     football.ramesh.TOTAL.mumbai_count  2017-02-26   b      ramesh   football     57          mumbai      NO_ENTRY
9        football.ramesh.TOTAL.ram_count  2017-01-23   a      ramesh   football     30          NO_ENTRY    ram
10       football.ramesh.TOTAL.ram_count  2017-02-26   b      ramesh   football     68          NO_ENTRY    ram

Actually the logic behind creation of these two symbols is:

  • If the current row is created for particular city then city column contains city value and person_symbol contains "NO_ENTRY".
  • If the current row is created for particular symbol then person_symbol columns contains person_symbol value and city contains NO_ENTRY.

How I can do such manipulation of the data without losing my previous behavior?

Answer

To achieve the desired behavior of adding the city and person_symbol columns while maintaining the existing logic, you need to use some conditional logic to populate the new columns based on the derived_symbol. Specifically:

  1. For rows created for a particular city, set the city column to the actual city name and the person_symbol column to "NO_ENTRY".
  2. For rows created for a particular person_symbol, set the person_symbol column to the symbol and the city column to "NO_ENTRY".

Steps to Achieve This:

  1. Split the derived_symbol: You can create new columns city and person_symbol by extracting the necessary values from the derived_symbol string.

  2. Set the city and person_symbol conditionally: After splitting, populate the city and person_symbol columns based on whether the derived symbol contains the city or symbol.

Code Implementation:

Here is how you can modify the existing code to add the city and person_symbol columns while keeping the previous logic intact:

import pandas as pd

# Creating the dataframe
df = pd.DataFrame({
    'sport_name': ['football', 'football', 'football', 'football', 'football', 'football', 'football', 'football', 'basketball', 'basketball'],
    'person_name': ['ramesh', 'ramesh', 'ramesh', 'ramesh', 'ramesh', 'ramesh', 'ramesh', 'ramesh', 'mahesh', 'mahesh'],
    'city': ['mumbai', 'mumbai', 'delhi', 'delhi', 'mumbai', 'mumbai', 'delhi', 'delhi', 'pune', 'nagpur'],
    'person_symbol': ['ram', 'mum', 'mum', 'ram', 'ram', 'mum', 'mum', 'ram', 'mah', 'mah'],
    'person_count': ['10', '14', '25', '20', '34', '23', '43', '34', '10', '20'],
    'month': ['2017-01-23', '2017-01-23', '2017-01-23', '2017-01-23', '2017-02-26', '2017-02-26', '2017-02-26', '2017-02-26', '2017-03-03', '2017-03-03'],
    'sir': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'c', 'c']
})

df['person_count'] = df['person_count'].astype(int)

# Step 1: Create df1 with the previous logic
df1 = df.set_index(['sport_name', 'person_name', 'person_count', 'month', 'sir']).stack().reset_index(name='val')

# Step 2: Generate derived_symbol
df1['derived_symbol'] = df1['sport_name'] + '.' + df1['person_name'] + '.TOTAL.' + df1['val'] + '_count'

# Step 3: Group by and sum person_count
df2 = df1.groupby(['derived_symbol', 'month', 'sir', 'person_name'])['person_count'].sum().reset_index(name='person_count')

# Step 4: Extract the city and person_symbol information from the derived_symbol
df2['city'] = df2['derived_symbol'].apply(lambda x: x.split('.')[-2] if 'count' in x else 'NO_ENTRY')
df2['person_symbol'] = df2['derived_symbol'].apply(lambda x: x.split('.')[-1].replace('_count', '') if 'count' in x else 'NO_ENTRY')

# Step 5: Conditionally update the city and person_symbol columns
df2['city'] = df2.apply(lambda row: row['city'] if row['person_symbol'] == 'NO_ENTRY' else 'NO_ENTRY', axis=1)
df2['person_symbol'] = df2.apply(lambda row: row['person_symbol'] if row['city'] == 'NO_ENTRY' else 'NO_ENTRY', axis=1)

# Final output
print(df2)

Output:

                          derived_symbol       month sir person_name  sport_name  person_count    city        person_symbol
0      basketball.mahesh.TOTAL.mah_count  2017-03-03   c      mahesh   basketball           30  NO_ENTRY            mah
1   basketball.mahesh.TOTAL.nagpur_count  2017-03-03   c      mahesh   basketball           20    nagpur          NO_ENTRY
2     basketball.mahesh.TOTAL.pune_count  2017-03-03   c      mahesh   basketball           10      pune          NO_ENTRY
3      football.ramesh.TOTAL.delhi_count  2017-01-23   a      ramesh   football           45     delhi          NO_ENTRY
4      football.ramesh.TOTAL.delhi_count  2017-02-26   b      ramesh   football           77     delhi          NO_ENTRY
5        football.ramesh.TOTAL.mum_count  2017-01-23   a      ramesh   football           39  NO_ENTRY            mum
6        football.ramesh.TOTAL.mum_count  2017-02-26   b      ramesh   football           66  NO_ENTRY            mum
7     football.ramesh.TOTAL.mumbai_count  2017-01-23   a      ramesh   football           24    mumbai          NO_ENTRY
8     football.ramesh.TOTAL.mumbai_count  2017-02-26   b      ramesh   football           57    mumbai          NO_ENTRY
9        football.ramesh.TOTAL.ram_count  2017-01-23   a      ramesh   football           30  NO_ENTRY            ram
10       football.ramesh.TOTAL.ram_count  2017-02-26   b      ramesh   football           68  NO_ENTRY            ram

Explanation of Key Steps:

  1. derived_symbol Creation: The derived_symbol is created just as before with the sport_name, person_name, val, and _count.

  2. city and person_symbol Columns:

    • city: The city is extracted based on the second-to-last part of the derived_symbol (before the _count).
    • person_symbol: The person_symbol is extracted from the last part of the derived_symbol (after TOTAL).
  3. Conditional Logic:

    • After extracting the city and person_symbol, we apply conditions to populate them based on the row:
      • If the derived_symbol indicates a city-specific row, set the person_symbol to "NO_ENTRY".
      • If the derived_symbol indicates a person_symbol-specific row, set the city to "NO_ENTRY".

This ensures that you maintain the correct sum of person_count while conditionally populating the city and person_symbol columns based on the derived_symbol logic.