Pandas 'count(distinct)' equivalent

ghz 3months ago ⋅ 141 views

I am using Pandas as a database substitute as I have multiple databases (Oracle, SQL Server, etc.), and I am unable to make a sequence of commands to a SQL equivalent.

I have a table loaded in a DataFrame with some columns:

YEARMONTH, CLIENTCODE, SIZE, etc., etc.

In SQL, to count the amount of different clients per year would be:

SELECT count(distinct CLIENTCODE) FROM table GROUP BY YEARMONTH;

And the result would be

201301    5000
201302    13245

How can I do that in Pandas?

Answers

You can achieve this SQL-like operation in Pandas using the groupby() function combined with nunique() to count distinct values for each group. Here’s how to do it:

Example:

Let's assume you have a DataFrame df with columns YEARMONTH and CLIENTCODE.

SQL equivalent:

SELECT count(distinct CLIENTCODE) FROM table GROUP BY YEARMONTH;

Pandas equivalent:

import pandas as pd

# Sample DataFrame
data = {
    'YEARMONTH': [201301, 201301, 201301, 201302, 201302, 201302, 201302],
    'CLIENTCODE': ['A', 'B', 'A', 'C', 'D', 'C', 'E']
}

df = pd.DataFrame(data)

# Group by 'YEARMONTH' and count distinct 'CLIENTCODE'
result = df.groupby('YEARMONTH')['CLIENTCODE'].nunique()

print(result)

Output:

YEARMONTH
201301        2
201302        3
Name: CLIENTCODE, dtype: int64

Explanation:

  1. groupby('YEARMONTH'): Groups the data by the YEARMONTH column.
  2. ['CLIENTCODE'].nunique(): For each group, it counts the number of distinct CLIENTCODE values.

This will give you the same result as the SQL query, where you count the distinct clients (CLIENTCODE) for each YEARMONTH.

If you want the result as a DataFrame instead of a Series, you can reset the index:

result_df = result.reset_index(name='ClientCount')
print(result_df)

Output as DataFrame:

   YEARMONTH  ClientCount
0     201301            2
1     201302            3