How to loop pandas dataframe with subset of data (like group by)

ghz 昨天 ⋅ 4 views

I have a pandas dataframe after sorted, it looks like bellow (like few person working for shop as shift):

A   B   C   D 
1   1   1   Anna
2   3   1   Anna
3   1   2   Anna
4   3   2   Tom
5   3   2   Tom
6   3   2   Tom
7   3   2   Tom
8   1   1   Anna
9   3   1   Anna
10   1   2   Tom
...

I want to loop and split dataframe to subset of dataframe, then call my another function, eg:

first subset df would be

A   B   C   D 
1   1   1   Anna
2   3   1   Anna
3   1   2   Anna

second subset df would be

4   3   2   Tom
5   3   2   Tom
6   3   2   Tom
7   3   2   Tom

third subset df would be

8   1   1   Anna
9   3   1   Anna

Is there a good way to loop the main datafraem and split it?

for x in some_magic_here:
    sub_df = some_mage_here_too()
    my_fun(sub_df)

Answer

You can loop through your sorted DataFrame and split it based on consecutive rows that share the same value in column D (since that seems to be your identifier for different shifts, such as Anna's or Tom's shifts). This can be done using groupby or iterating through the rows to detect shifts in the value of column D. After identifying the different shifts, you can create the subsets and pass them to your function.

Here’s a method to split the DataFrame based on consecutive occurrences of the same value in column D:

Solution 1: Using groupby with shift

This method groups the DataFrame by consecutive values in column D. The idea is to create a new column that identifies groups of consecutive rows where the value in column D stays the same. Then you can group by this column and pass each group to your function.

import pandas as pd

# Sample data
data = {
    'A': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'B': [1, 3, 1, 3, 3, 3, 3, 1, 3, 1],
    'C': [1, 1, 2, 2, 2, 2, 2, 1, 1, 2],
    'D': ['Anna', 'Anna', 'Anna', 'Tom', 'Tom', 'Tom', 'Tom', 'Anna', 'Anna', 'Tom']
}

df = pd.DataFrame(data)

# Add a group column to identify consecutive rows with the same value in column 'D'
df['group'] = (df['D'] != df['D'].shift()).cumsum()

# Now, you can iterate over each group
for _, sub_df in df.groupby('group'):
    # Call your function with the subset
    print(sub_df)
    # my_fun(sub_df)  # Uncomment to call your function

Explanation:

  1. df['group'] = (df['D'] != df['D'].shift()).cumsum(): This creates a new column group which will increment whenever the value in column D changes (i.e., when the shift changes). The shift function checks for consecutive values, and cumsum generates a unique group number for each block of consecutive values in column D.

  2. df.groupby('group'): This groups the DataFrame by the group column, effectively splitting the DataFrame into subsets based on consecutive occurrences of the same value in column D.

  3. Looping over groups: The for _, sub_df in df.groupby('group') loop allows you to process each subset (shift group) individually.

Sample Output:

For the input data above, the output will be:

   A  B  C     D  group
0  1  1  1  Anna      1
1  2  3  1  Anna      1
2  3  1  2  Anna      1
   A  B  C    D  group
3  4  3  2  Tom      2
4  5  3  2  Tom      2
5  6  3  2  Tom      2
6  7  3  2  Tom      2
   A  B  C     D  group
7  8  1  1  Anna      3
8  9  3  1  Anna      3
   A  B  C    D  group
9  10  1  2  Tom      4

You can then pass each sub_df to your function my_fun(sub_df) inside the loop.

Solution 2: Iterating Manually

If you want more control over the process or don't want to use groupby, you can manually iterate over the rows and create the subsets by checking for changes in column D:

import pandas as pd

# Sample data
data = {
    'A': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'B': [1, 3, 1, 3, 3, 3, 3, 1, 3, 1],
    'C': [1, 1, 2, 2, 2, 2, 2, 1, 1, 2],
    'D': ['Anna', 'Anna', 'Anna', 'Tom', 'Tom', 'Tom', 'Tom', 'Anna', 'Anna', 'Tom']
}

df = pd.DataFrame(data)

# Initialize an empty list to hold subsets
subsets = []
current_subset = []

# Iterate through rows
for i, row in df.iterrows():
    if current_subset and row['D'] != current_subset[-1]['D']:
        # If the shift changes, save the current subset and start a new one
        subsets.append(pd.DataFrame(current_subset))
        current_subset = []
    
    # Add current row to the current subset
    current_subset.append(row)

# Don't forget to add the last subset
if current_subset:
    subsets.append(pd.DataFrame(current_subset))

# Now you can call your function on each subset
for sub_df in subsets:
    print(sub_df)
    # my_fun(sub_df)  # Uncomment to call your function

This method manually handles row iteration and checks for changes in the value of D. When a change is detected, it stores the current subset and starts building the next one.

Conclusion:

  • groupby method is cleaner and leverages pandas' built-in functionality to group data by consecutive values in a column.
  • Manual iteration method gives you more control, but it requires more code and careful handling of subset construction.

Both methods will achieve the same result, so you can choose the one that best fits your needs!