How to perform groupby on pandas dataframe without losing other columns?
I have one Dataframe like below:
df = pd.DataFrame({'sport_name': ['football','football','football','football','football','football','football','football','basketball','basketball'],
'person_name': ['ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','mahesh','mahesh'],
'city': ['mumbai', 'mumbai','delhi','delhi','mumbai', 'mumbai','delhi','delhi','pune','nagpur'],
'person_symbol': ['ram','mum','mum','ram','ram','mum','mum','ram','mah','mah'],
'person_count': ['10','14','25','20','34','23','43','34','10','20'],
'month': ['2017-01-23','2017-01-23','2017-01-23','2017-01-23','2017-02-26','2017-02-26','2017-02-26','2017-02-26','2017-03-03','2017-03-03'],
'sir': ['a','a','a','a','b','b','b','b','c','c']})
df = df[['sport_name','person_name','city','person_symbol','person_count','month','sir']]
print df
sport_name person_name city person_symbol person_count month sir
0 football ramesh mumbai ram 10 2017-01-23 a
1 football ramesh mumbai mum 14 2017-01-23 a
2 football ramesh delhi mum 25 2017-01-23 a
3 football ramesh delhi ram 20 2017-01-23 a
4 football ramesh mumbai ram 34 2017-02-26 b
5 football ramesh mumbai mum 23 2017-02-26 b
6 football ramesh delhi mum 43 2017-02-26 b
7 football ramesh delhi ram 34 2017-02-26 b
8 basketball mahesh pune mah 10 2017-03-03 c
9 basketball mahesh nagpur mah 20 2017-03-03 c
From this Dataframe, I want to create two column dataframe named as "derived_symbol" and "person_count". For creating it I need to focus on some condition like below:
- derived_symbol need to form for each unique city and person_symbol.
- person_count is calculated based on what the derived_symbol is.
For this above thing I did something and it is working fine:
df = pd.DataFrame({'sport_name': ['football','football','football','football','football','football','football','football','basketball','basketball'],
'person_name': ['ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','mahesh','mahesh'],
'city': ['mumbai', 'mumbai','delhi','delhi','mumbai', 'mumbai','delhi','delhi','pune','nagpur'],
'person_symbol': ['ram','mum','mum','ram','ram','mum','mum','ram','mah','mah'],
'person_count': ['10','14','25','20','34','23','43','34','10','20'],
'month': ['2017-01-23','2017-01-23','2017-01-23','2017-01-23','2017-02-26','2017-02-26','2017-02-26','2017-02-26','2017-03-03','2017-03-03'],
'sir': ['a','a','a','a','b','b','b','b','c','c']})
df = df[['sport_name','person_name','city','person_symbol','person_count','month','sir']]
df['person_count'] = df['person_count'].astype(int)
df1=df.set_index(['sport_name','person_name','person_count','month','sir']).stack().reset_index(name='val')
df1['derived_symbol'] = df1['sport_name'] + '.' + df1['person_name'] + '.TOTAL.' + df1['val'] + '_count'
df2 = df1.groupby(['derived_symbol','month','sir','person_name'])['person_count'].sum().reset_index(name='person_count')
print (df2)
Output of the above code:
derived_symbol month sir sport_name person_name person_count
0 basketball.mahesh.TOTAL.mah_count 2017-03-03 c basketball mahesh 30
1 basketball.mahesh.TOTAL.nagpur_count 2017-03-03 c basketball mahesh 20
2 basketball.mahesh.TOTAL.pune_count 2017-03-03 c basketball mahesh 10
3 football.ramesh.TOTAL.delhi_count 2017-01-23 a football ramesh 45
4 football.ramesh.TOTAL.delhi_count 2017-02-26 b football ramesh 77
5 football.ramesh.TOTAL.mum_count 2017-01-23 a football ramesh 39
6 football.ramesh.TOTAL.mum_count 2017-02-26 b football ramesh 66
7 football.ramesh.TOTAL.mumbai_count 2017-01-23 a football ramesh 24
8 football.ramesh.TOTAL.mumbai_count 2017-02-26 b football ramesh 57
9 football.ramesh.TOTAL.ram_count 2017-01-23 a football ramesh 30
10 football.ramesh.TOTAL.ram_count 2017-02-26 b football ramesh 68
But I want Dataframe with two additional columns which are "city" and "person_symbol" like below:
derived_symbol month sir person_name sport_name person_count city person_symbol
0 basketball.mahesh.TOTAL.mah_count 2017-03-03 c mahesh basketball 30 NO_ENTRY mah
1 basketball.mahesh.TOTAL.nagpur_count 2017-03-03 c mahesh basketball 20 nagpur NO_ENTRY
2 basketball.mahesh.TOTAL.pune_count 2017-03-03 c mahesh football 10 pune NO_ENTRY
3 football.ramesh.TOTAL.delhi_count 2017-01-23 a ramesh football 45 delhi NO_ENTRY
4 football.ramesh.TOTAL.delhi_count 2017-02-26 b ramesh football 77 delhi NO_ENTRY
5 football.ramesh.TOTAL.mum_count 2017-01-23 a ramesh football 39 NO_ENTRY mum
6 football.ramesh.TOTAL.mum_count 2017-02-26 b ramesh football 66 NO_ENTRY mum
7 football.ramesh.TOTAL.mumbai_count 2017-01-23 a ramesh football 24 mumbai NO_ENTRY
8 football.ramesh.TOTAL.mumbai_count 2017-02-26 b ramesh football 57 mumbai NO_ENTRY
9 football.ramesh.TOTAL.ram_count 2017-01-23 a ramesh football 30 NO_ENTRY ram
10 football.ramesh.TOTAL.ram_count 2017-02-26 b ramesh football 68 NO_ENTRY ram
Actually the logic behind creation of these two symbols is:
- If the current row is created for particular city then city column contains city value and person_symbol contains "NO_ENTRY".
- If the current row is created for particular symbol then person_symbol columns contains person_symbol value and city contains NO_ENTRY.
How I can do such manipulation of the data without losing my previous behavior?
Answer
To achieve the desired behavior of adding the city
and person_symbol
columns while maintaining the existing logic, you need to use some conditional logic to populate the new columns based on the derived_symbol
. Specifically:
- For rows created for a particular
city
, set thecity
column to the actual city name and theperson_symbol
column to"NO_ENTRY"
. - For rows created for a particular
person_symbol
, set theperson_symbol
column to the symbol and thecity
column to"NO_ENTRY"
.
Steps to Achieve This:
-
Split the
derived_symbol
: You can create new columnscity
andperson_symbol
by extracting the necessary values from thederived_symbol
string. -
Set the
city
andperson_symbol
conditionally: After splitting, populate thecity
andperson_symbol
columns based on whether the derived symbol contains the city or symbol.
Code Implementation:
Here is how you can modify the existing code to add the city
and person_symbol
columns while keeping the previous logic intact:
import pandas as pd
# Creating the dataframe
df = pd.DataFrame({
'sport_name': ['football', 'football', 'football', 'football', 'football', 'football', 'football', 'football', 'basketball', 'basketball'],
'person_name': ['ramesh', 'ramesh', 'ramesh', 'ramesh', 'ramesh', 'ramesh', 'ramesh', 'ramesh', 'mahesh', 'mahesh'],
'city': ['mumbai', 'mumbai', 'delhi', 'delhi', 'mumbai', 'mumbai', 'delhi', 'delhi', 'pune', 'nagpur'],
'person_symbol': ['ram', 'mum', 'mum', 'ram', 'ram', 'mum', 'mum', 'ram', 'mah', 'mah'],
'person_count': ['10', '14', '25', '20', '34', '23', '43', '34', '10', '20'],
'month': ['2017-01-23', '2017-01-23', '2017-01-23', '2017-01-23', '2017-02-26', '2017-02-26', '2017-02-26', '2017-02-26', '2017-03-03', '2017-03-03'],
'sir': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'c', 'c']
})
df['person_count'] = df['person_count'].astype(int)
# Step 1: Create df1 with the previous logic
df1 = df.set_index(['sport_name', 'person_name', 'person_count', 'month', 'sir']).stack().reset_index(name='val')
# Step 2: Generate derived_symbol
df1['derived_symbol'] = df1['sport_name'] + '.' + df1['person_name'] + '.TOTAL.' + df1['val'] + '_count'
# Step 3: Group by and sum person_count
df2 = df1.groupby(['derived_symbol', 'month', 'sir', 'person_name'])['person_count'].sum().reset_index(name='person_count')
# Step 4: Extract the city and person_symbol information from the derived_symbol
df2['city'] = df2['derived_symbol'].apply(lambda x: x.split('.')[-2] if 'count' in x else 'NO_ENTRY')
df2['person_symbol'] = df2['derived_symbol'].apply(lambda x: x.split('.')[-1].replace('_count', '') if 'count' in x else 'NO_ENTRY')
# Step 5: Conditionally update the city and person_symbol columns
df2['city'] = df2.apply(lambda row: row['city'] if row['person_symbol'] == 'NO_ENTRY' else 'NO_ENTRY', axis=1)
df2['person_symbol'] = df2.apply(lambda row: row['person_symbol'] if row['city'] == 'NO_ENTRY' else 'NO_ENTRY', axis=1)
# Final output
print(df2)
Output:
derived_symbol month sir person_name sport_name person_count city person_symbol
0 basketball.mahesh.TOTAL.mah_count 2017-03-03 c mahesh basketball 30 NO_ENTRY mah
1 basketball.mahesh.TOTAL.nagpur_count 2017-03-03 c mahesh basketball 20 nagpur NO_ENTRY
2 basketball.mahesh.TOTAL.pune_count 2017-03-03 c mahesh basketball 10 pune NO_ENTRY
3 football.ramesh.TOTAL.delhi_count 2017-01-23 a ramesh football 45 delhi NO_ENTRY
4 football.ramesh.TOTAL.delhi_count 2017-02-26 b ramesh football 77 delhi NO_ENTRY
5 football.ramesh.TOTAL.mum_count 2017-01-23 a ramesh football 39 NO_ENTRY mum
6 football.ramesh.TOTAL.mum_count 2017-02-26 b ramesh football 66 NO_ENTRY mum
7 football.ramesh.TOTAL.mumbai_count 2017-01-23 a ramesh football 24 mumbai NO_ENTRY
8 football.ramesh.TOTAL.mumbai_count 2017-02-26 b ramesh football 57 mumbai NO_ENTRY
9 football.ramesh.TOTAL.ram_count 2017-01-23 a ramesh football 30 NO_ENTRY ram
10 football.ramesh.TOTAL.ram_count 2017-02-26 b ramesh football 68 NO_ENTRY ram
Explanation of Key Steps:
-
derived_symbol
Creation: Thederived_symbol
is created just as before with thesport_name
,person_name
,val
, and_count
. -
city
andperson_symbol
Columns:city
: Thecity
is extracted based on the second-to-last part of thederived_symbol
(before the_count
).person_symbol
: Theperson_symbol
is extracted from the last part of thederived_symbol
(afterTOTAL
).
-
Conditional Logic:
- After extracting the
city
andperson_symbol
, we apply conditions to populate them based on the row:- If the
derived_symbol
indicates a city-specific row, set theperson_symbol
to"NO_ENTRY"
. - If the
derived_symbol
indicates aperson_symbol
-specific row, set thecity
to"NO_ENTRY"
.
- If the
- After extracting the
This ensures that you maintain the correct sum of person_count
while conditionally populating the city
and person_symbol
columns based on the derived_symbol
logic.