Add new columns to a pandas df after filtering

ghz 18hours ago ⋅ 1 views

I have a df that contains information about various places.

import pandas as pd

d = ({
    'C' : ['08:00:00','XX','08:10:00','XX','08:41:42','XX','08:50:00','XX', '09:00:00', 'XX','09:15:00','XX','09:21:00','XX','09:30:00','XX','09:40:00','XX'],
    'D' : ['Home','','Home','','Away','','Home','','Away','','Home','','Home','','Away','','Home',''],
    'E' : ['Num:','','Num:','','Num:','','Num:','','Num:', '','Num:','','Num:','','Num:', '','Num:', ''],
    'F' : ['1','','1','','1','','1','','1', '','2','','2','','1', '','2',''],   
    'A' : ['A','','A','','A','','A','','A','','A','','A','','A','','A',''],           
    'B' : ['Stop','','Res','','Stop','','Start','','Res','','Stop','','Res','','Start','','Start','']
    })

df = pd.DataFrame(data=d)

I want to export that data into their respective places, which are labelled in Column D. I also want to add new columns based off functions labelled in Column B.

df['C'] = pd.to_timedelta(df['C'], errors="coerce").dt.total_seconds()

incl = ['Home', 'Away']    

for k, g in df[df.D.isin(incl)].groupby('D'):
    Stop = g.loc[df['B'] == 'Stop'].reset_index()['C']
    Start = g.loc[df['B'] == 'Start'].reset_index()['C']
    Res = g.loc[df['B'] == 'Res'].reset_index()['C']

    g['Start_diff'] = Start - Stop
    g['Res_diff'] = Start - Res

The problem is these functions occur multiple times, which are labelled in Column F. So if we look at the export for Home we get the diff for the first time in Column F.

Output:

    A   B       C       D       E       F   Start_diff  Res_diff
0   A   Stop    28800   Home    Num:    1   3000        2400
2   A   Res     29400   Home    Num:    1       
6   A   Start   31800   Home    Num:    1       
10  A   Stop    33300   Home    Num:    2       
12  A   Res     33660   Home    Num:    2       
16  A   Start   34800   Home    Num:    2       

Whereas I'm hoping the intended output would be:

    A   B       C       D       E       F   Start_diff  Res_diff
0   A   Stop    28800   Home    Num:    1   3000        2400
2   A   Res     29400   Home    Num:    1       
6   A   Start   31800   Home    Num:    1       
10  A   Stop    33300   Home    Num:    2   1500        1200    
12  A   Res     33660   Home    Num:    2       
16  A   Start   34800   Home    Num:    2       

I have tried to alter for k, g in df[df.D.isin(incl)].groupby('D'): to for k, g in df[df.D.isin(incl)].groupby('D').F.nunique():

But I get an error TypeError: 'numpy.int64' object is not iterable

Answer

The issue you're facing arises because you're trying to calculate the difference (Start_diff and Res_diff) for multiple rows, but the calculations are being applied only once per group (based on the current groupby logic), which does not account for the fact that the same type of event (e.g., Stop, Start, Res) can occur multiple times, and thus you need to calculate these differences per event sequence (grouped by F).

Solution Approach

  1. Correct groupby logic: Instead of using groupby('D') alone, you need to group by both D and F (the event sequence) so that each set of Start, Stop, and Res events within the same D (place) and F (sequence) can be processed correctly.

  2. Handle missing values: Make sure that Start, Stop, and Res events are correctly aligned, even if one of them is missing in a sequence.

  3. Apply calculations: For each group (which is now identified by both D and F), you can then calculate Start_diff and Res_diff based on their respective Stop, Start, and Res times.

Updated Code:

Here’s how you can fix your code:

import pandas as pd

d = ({
    'C' : ['08:00:00','XX','08:10:00','XX','08:41:42','XX','08:50:00','XX', '09:00:00', 'XX','09:15:00','XX','09:21:00','XX','09:30:00','XX','09:40:00','XX'],
    'D' : ['Home','','Home','','Away','','Home','','Away','','Home','','Home','','Away','','Home',''],
    'E' : ['Num:','','Num:','','Num:','','Num:','','Num:', '','Num:','','Num:','','Num:', '','Num:', ''],
    'F' : ['1','','1','','1','','1','','1', '','2','','2','','1', '','2',''],   
    'A' : ['A','','A','','A','','A','','A','','A','','A','','A','','A',''],           
    'B' : ['Stop','','Res','','Stop','','Start','','Res','','Stop','','Res','','Start','','Start','']
})

df = pd.DataFrame(data=d)

# Convert 'C' to timedelta in seconds
df['C'] = pd.to_timedelta(df['C'], errors="coerce").dt.total_seconds()

# Filter for relevant places (Home, Away)
incl = ['Home', 'Away']

# Iterate over each group defined by both 'D' and 'F'
for k, g in df[df.D.isin(incl)].groupby(['D', 'F']):
    # Extract relevant events (Stop, Start, Res)
    Stop = g.loc[g['B'] == 'Stop', 'C']
    Start = g.loc[g['B'] == 'Start', 'C']
    Res = g.loc[g['B'] == 'Res', 'C']

    # Ensure there are values for Stop, Start, and Res before calculating diffs
    if not Stop.empty and not Start.empty:
        g['Start_diff'] = Start.values[0] - Stop.values[0]  # First Stop and Start pair
    if not Start.empty and not Res.empty:
        g['Res_diff'] = Start.values[0] - Res.values[0]  # First Start and Res pair

    # Update the DataFrame with the new calculated columns
    df.loc[g.index, 'Start_diff'] = g['Start_diff']
    df.loc[g.index, 'Res_diff'] = g['Res_diff']

# Show the final DataFrame
print(df)

Key Changes:

  1. Grouping by both D and F: This ensures that each sequence of events (Stop, Start, Res) for each location (D) and event sequence (F) is processed separately.

  2. Extracting Stop, Start, and Res events: Using .loc to filter for the respective event types (Stop, Start, and Res) for each group.

  3. Calculation of Start_diff and Res_diff: The difference is calculated for the first occurrence of each event within the sequence, so that Start_diff is based on the first Start and Stop pair, and Res_diff is based on the first Start and Res pair.

  4. Updating the DataFrame: After calculating the differences for each group, I update the original df with the new Start_diff and Res_diff values.

Expected Output:

The output should now correctly reflect the intended result:

    A     B        C     D     E    F Start_diff  Res_diff
0   A   Stop  28800.0  Home  Num:   1      3000.0     2400.0
2   A    Res  29400.0  Home  Num:   1       NaN        NaN
6   A  Start  31800.0  Home  Num:   1       NaN        NaN
10  A   Stop  33300.0  Home  Num:   2      1500.0     1200.0
12  A    Res  33660.0  Home  Num:   2       NaN        NaN
16  A  Start  34800.0  Home  Num:   2       NaN        NaN

Explanation:

  1. For Home (sequence F = 1), the Start_diff and Res_diff are calculated based on the first Stop and Start for F = 1.
  2. For Home (sequence F = 2), the same logic applies: Start_diff and Res_diff are calculated for F = 2.

This solution will handle multiple occurrences of Stop, Start, and Res while calculating the differences sequentially for each group defined by D (location) and F (sequence).