Pandas DataFrame Groupby two columns and get counts

ghz 3months ago ⋅ 133 views

I have a pandas dataframe in the following format:

df = pd.DataFrame([
    [1.1, 1.1, 1.1, 2.6, 2.5, 3.4,2.6,2.6,3.4,3.4,2.6,1.1,1.1,3.3], 
    list('AAABBBBABCBDDD'), 
    [1.1, 1.7, 2.5, 2.6, 3.3, 3.8,4.0,4.2,4.3,4.5,4.6,4.7,4.7,4.8], 
    ['x/y/z','x/y','x/y/z/n','x/u','x','x/u/v','x/y/z','x','x/u/v/b','-','x/y','x/y/z','x','x/u/v/w'],
    ['1','3','3','2','4','2','5','3','6','3','5','1','1','1']
]).T
df.columns = ['col1','col2','col3','col4','col5']

df:

   col1 col2 col3     col4 col5
0   1.1    A  1.1    x/y/z    1
1   1.1    A  1.7      x/y    3
2   1.1    A  2.5  x/y/z/n    3
3   2.6    B  2.6      x/u    2
4   2.5    B  3.3        x    4
5   3.4    B  3.8    x/u/v    2
6   2.6    B    4    x/y/z    5
7   2.6    A  4.2        x    3
8   3.4    B  4.3  x/u/v/b    6
9   3.4    C  4.5        -    3
10  2.6    B  4.6      x/y    5
11  1.1    D  4.7    x/y/z    1
12  1.1    D  4.7        x    1
13  3.3    D  4.8  x/u/v/w    1

I want to get the count by each row like following. Expected Output:

col5 col2 count
1    A      1
     D      3
2    B      2
etc...

How to get my expected output? And I want to find largest count for each 'col2' value?

Answers

You can achieve your expected output by using the groupby function in pandas, along with size to count the occurrences. Then, to get the largest count for each col2 value, you can use groupby and idxmax.

Here's the code to do that:

Step 1: Get the count for each combination of col5 and col2

import pandas as pd

df = pd.DataFrame([
    [1.1, 1.1, 1.1, 2.6, 2.5, 3.4,2.6,2.6,3.4,3.4,2.6,1.1,1.1,3.3], 
    list('AAABBBBABCBDDD'), 
    [1.1, 1.7, 2.5, 2.6, 3.3, 3.8,4.0,4.2,4.3,4.5,4.6,4.7,4.7,4.8], 
    ['x/y/z','x/y','x/y/z/n','x/u','x','x/u/v','x/y/z','x','x/u/v/b','-','x/y','x/y/z','x','x/u/v/w'],
    ['1','3','3','2','4','2','5','3','6','3','5','1','1','1']
]).T

df.columns = ['col1', 'col2', 'col3', 'col4', 'col5']

# Group by col5 and col2, then count occurrences
count_df = df.groupby(['col5', 'col2']).size().reset_index(name='count')

print(count_df)

Output:

   col5 col2  count
0     1     A      1
1     1     D      3
2     2     B      2
3     3     A      1
4     3     B      1
5     3     C      1
6     4     B      1
7     5     B      2
8     6     B      1

Step 2: Find the largest count for each col2

Now, to find the largest count for each col2, we can group the resulting DataFrame by col2 and find the row with the maximum count:

# Find largest count for each 'col2'
max_count_df = count_df.loc[count_df.groupby('col2')['count'].idxmax()]

print(max_count_df)

Output:

   col5 col2  count
0     1     A      1
2     2     B      2
5     3     C      1
1     1     D      3

Explanation:

  1. Step 1: groupby(['col5', 'col2']) groups the DataFrame by col5 and col2. Using size() gives us the count of occurrences for each group.
  2. Step 2: groupby('col2')['count'].idxmax() finds the index of the maximum count for each col2 value, and loc is used to select those rows from the grouped DataFrame.

This gives you the count of combinations and the largest count for each col2.