I have a pandas dataframe in the following format:
df = pd.DataFrame([
[1.1, 1.1, 1.1, 2.6, 2.5, 3.4,2.6,2.6,3.4,3.4,2.6,1.1,1.1,3.3],
list('AAABBBBABCBDDD'),
[1.1, 1.7, 2.5, 2.6, 3.3, 3.8,4.0,4.2,4.3,4.5,4.6,4.7,4.7,4.8],
['x/y/z','x/y','x/y/z/n','x/u','x','x/u/v','x/y/z','x','x/u/v/b','-','x/y','x/y/z','x','x/u/v/w'],
['1','3','3','2','4','2','5','3','6','3','5','1','1','1']
]).T
df.columns = ['col1','col2','col3','col4','col5']
df:
col1 col2 col3 col4 col5
0 1.1 A 1.1 x/y/z 1
1 1.1 A 1.7 x/y 3
2 1.1 A 2.5 x/y/z/n 3
3 2.6 B 2.6 x/u 2
4 2.5 B 3.3 x 4
5 3.4 B 3.8 x/u/v 2
6 2.6 B 4 x/y/z 5
7 2.6 A 4.2 x 3
8 3.4 B 4.3 x/u/v/b 6
9 3.4 C 4.5 - 3
10 2.6 B 4.6 x/y 5
11 1.1 D 4.7 x/y/z 1
12 1.1 D 4.7 x 1
13 3.3 D 4.8 x/u/v/w 1
I want to get the count by each row like following. Expected Output:
col5 col2 count
1 A 1
D 3
2 B 2
etc...
How to get my expected output? And I want to find largest count for each 'col2' value?
Answers
You can achieve your expected output by using the groupby
function in pandas, along with size
to count the occurrences. Then, to get the largest count for each col2
value, you can use groupby
and idxmax
.
Here's the code to do that:
Step 1: Get the count for each combination of col5
and col2
import pandas as pd
df = pd.DataFrame([
[1.1, 1.1, 1.1, 2.6, 2.5, 3.4,2.6,2.6,3.4,3.4,2.6,1.1,1.1,3.3],
list('AAABBBBABCBDDD'),
[1.1, 1.7, 2.5, 2.6, 3.3, 3.8,4.0,4.2,4.3,4.5,4.6,4.7,4.7,4.8],
['x/y/z','x/y','x/y/z/n','x/u','x','x/u/v','x/y/z','x','x/u/v/b','-','x/y','x/y/z','x','x/u/v/w'],
['1','3','3','2','4','2','5','3','6','3','5','1','1','1']
]).T
df.columns = ['col1', 'col2', 'col3', 'col4', 'col5']
# Group by col5 and col2, then count occurrences
count_df = df.groupby(['col5', 'col2']).size().reset_index(name='count')
print(count_df)
Output:
col5 col2 count
0 1 A 1
1 1 D 3
2 2 B 2
3 3 A 1
4 3 B 1
5 3 C 1
6 4 B 1
7 5 B 2
8 6 B 1
Step 2: Find the largest count for each col2
Now, to find the largest count for each col2
, we can group the resulting DataFrame by col2
and find the row with the maximum count
:
# Find largest count for each 'col2'
max_count_df = count_df.loc[count_df.groupby('col2')['count'].idxmax()]
print(max_count_df)
Output:
col5 col2 count
0 1 A 1
2 2 B 2
5 3 C 1
1 1 D 3
Explanation:
- Step 1:
groupby(['col5', 'col2'])
groups the DataFrame bycol5
andcol2
. Usingsize()
gives us the count of occurrences for each group. - Step 2:
groupby('col2')['count'].idxmax()
finds the index of the maximum count for eachcol2
value, andloc
is used to select those rows from the grouped DataFrame.
This gives you the count of combinations and the largest count for each col2
.