I have a df that contains information about various places.
import pandas as pd
d = ({
'C' : ['08:00:00','XX','08:10:00','XX','08:41:42','XX','08:50:00','XX', '09:00:00', 'XX','09:15:00','XX','09:21:00','XX','09:30:00','XX','09:40:00','XX'],
'D' : ['Home','','Home','','Away','','Home','','Away','','Home','','Home','','Away','','Home',''],
'E' : ['Num:','','Num:','','Num:','','Num:','','Num:', '','Num:','','Num:','','Num:', '','Num:', ''],
'F' : ['1','','1','','1','','1','','1', '','2','','2','','1', '','2',''],
'A' : ['A','','A','','A','','A','','A','','A','','A','','A','','A',''],
'B' : ['Stop','','Res','','Stop','','Start','','Res','','Stop','','Res','','Start','','Start','']
})
df = pd.DataFrame(data=d)
I want to export that data into their respective places, which are labelled in Column D
. I also want to add new columns based off functions labelled in Column B
.
df['C'] = pd.to_timedelta(df['C'], errors="coerce").dt.total_seconds()
incl = ['Home', 'Away']
for k, g in df[df.D.isin(incl)].groupby('D'):
Stop = g.loc[df['B'] == 'Stop'].reset_index()['C']
Start = g.loc[df['B'] == 'Start'].reset_index()['C']
Res = g.loc[df['B'] == 'Res'].reset_index()['C']
g['Start_diff'] = Start - Stop
g['Res_diff'] = Start - Res
The problem is these functions occur multiple times, which are labelled in Column F
. So if we look at the export for Home
we get the diff for the first time in Column F
.
Output:
A B C D E F Start_diff Res_diff
0 A Stop 28800 Home Num: 1 3000 2400
2 A Res 29400 Home Num: 1
6 A Start 31800 Home Num: 1
10 A Stop 33300 Home Num: 2
12 A Res 33660 Home Num: 2
16 A Start 34800 Home Num: 2
Whereas I'm hoping the intended output would be:
A B C D E F Start_diff Res_diff
0 A Stop 28800 Home Num: 1 3000 2400
2 A Res 29400 Home Num: 1
6 A Start 31800 Home Num: 1
10 A Stop 33300 Home Num: 2 1500 1200
12 A Res 33660 Home Num: 2
16 A Start 34800 Home Num: 2
I have tried to alter for k, g in df[df.D.isin(incl)].groupby('D'):
to for k, g in df[df.D.isin(incl)].groupby('D').F.nunique():
But I get an error TypeError: 'numpy.int64' object is not iterable
Answer
The issue you're facing arises because you're trying to calculate the difference (Start_diff
and Res_diff
) for multiple rows, but the calculations are being applied only once per group (based on the current groupby
logic), which does not account for the fact that the same type of event (e.g., Stop
, Start
, Res
) can occur multiple times, and thus you need to calculate these differences per event sequence (grouped by F
).
Solution Approach
-
Correct
groupby
logic: Instead of usinggroupby('D')
alone, you need to group by bothD
andF
(the event sequence) so that each set ofStart
,Stop
, andRes
events within the sameD
(place) andF
(sequence) can be processed correctly. -
Handle missing values: Make sure that
Start
,Stop
, andRes
events are correctly aligned, even if one of them is missing in a sequence. -
Apply calculations: For each group (which is now identified by both
D
andF
), you can then calculateStart_diff
andRes_diff
based on their respectiveStop
,Start
, andRes
times.
Updated Code:
Here’s how you can fix your code:
import pandas as pd
d = ({
'C' : ['08:00:00','XX','08:10:00','XX','08:41:42','XX','08:50:00','XX', '09:00:00', 'XX','09:15:00','XX','09:21:00','XX','09:30:00','XX','09:40:00','XX'],
'D' : ['Home','','Home','','Away','','Home','','Away','','Home','','Home','','Away','','Home',''],
'E' : ['Num:','','Num:','','Num:','','Num:','','Num:', '','Num:','','Num:','','Num:', '','Num:', ''],
'F' : ['1','','1','','1','','1','','1', '','2','','2','','1', '','2',''],
'A' : ['A','','A','','A','','A','','A','','A','','A','','A','','A',''],
'B' : ['Stop','','Res','','Stop','','Start','','Res','','Stop','','Res','','Start','','Start','']
})
df = pd.DataFrame(data=d)
# Convert 'C' to timedelta in seconds
df['C'] = pd.to_timedelta(df['C'], errors="coerce").dt.total_seconds()
# Filter for relevant places (Home, Away)
incl = ['Home', 'Away']
# Iterate over each group defined by both 'D' and 'F'
for k, g in df[df.D.isin(incl)].groupby(['D', 'F']):
# Extract relevant events (Stop, Start, Res)
Stop = g.loc[g['B'] == 'Stop', 'C']
Start = g.loc[g['B'] == 'Start', 'C']
Res = g.loc[g['B'] == 'Res', 'C']
# Ensure there are values for Stop, Start, and Res before calculating diffs
if not Stop.empty and not Start.empty:
g['Start_diff'] = Start.values[0] - Stop.values[0] # First Stop and Start pair
if not Start.empty and not Res.empty:
g['Res_diff'] = Start.values[0] - Res.values[0] # First Start and Res pair
# Update the DataFrame with the new calculated columns
df.loc[g.index, 'Start_diff'] = g['Start_diff']
df.loc[g.index, 'Res_diff'] = g['Res_diff']
# Show the final DataFrame
print(df)
Key Changes:
-
Grouping by both
D
andF
: This ensures that each sequence of events (Stop
,Start
,Res
) for each location (D
) and event sequence (F
) is processed separately. -
Extracting
Stop
,Start
, andRes
events: Using.loc
to filter for the respective event types (Stop
,Start
, andRes
) for each group. -
Calculation of
Start_diff
andRes_diff
: The difference is calculated for the first occurrence of each event within the sequence, so thatStart_diff
is based on the firstStart
andStop
pair, andRes_diff
is based on the firstStart
andRes
pair. -
Updating the DataFrame: After calculating the differences for each group, I update the original
df
with the newStart_diff
andRes_diff
values.
Expected Output:
The output should now correctly reflect the intended result:
A B C D E F Start_diff Res_diff
0 A Stop 28800.0 Home Num: 1 3000.0 2400.0
2 A Res 29400.0 Home Num: 1 NaN NaN
6 A Start 31800.0 Home Num: 1 NaN NaN
10 A Stop 33300.0 Home Num: 2 1500.0 1200.0
12 A Res 33660.0 Home Num: 2 NaN NaN
16 A Start 34800.0 Home Num: 2 NaN NaN
Explanation:
- For
Home
(sequenceF = 1
), theStart_diff
andRes_diff
are calculated based on the firstStop
andStart
forF = 1
. - For
Home
(sequenceF = 2
), the same logic applies:Start_diff
andRes_diff
are calculated forF = 2
.
This solution will handle multiple occurrences of Stop
, Start
, and Res
while calculating the differences sequentially for each group defined by D
(location) and F
(sequence).