I am using Pandas as a database substitute as I have multiple databases (Oracle, SQL Server, etc.), and I am unable to make a sequence of commands to a SQL equivalent.
I have a table loaded in a DataFrame with some columns:
YEARMONTH, CLIENTCODE, SIZE, etc., etc.
In SQL, to count the amount of different clients per year would be:
SELECT count(distinct CLIENTCODE) FROM table GROUP BY YEARMONTH;
And the result would be
201301 5000
201302 13245
How can I do that in Pandas?
Answers
You can achieve this SQL-like operation in Pandas using the groupby()
function combined with nunique()
to count distinct values for each group. Here’s how to do it:
Example:
Let's assume you have a DataFrame df
with columns YEARMONTH
and CLIENTCODE
.
SQL equivalent:
SELECT count(distinct CLIENTCODE) FROM table GROUP BY YEARMONTH;
Pandas equivalent:
import pandas as pd
# Sample DataFrame
data = {
'YEARMONTH': [201301, 201301, 201301, 201302, 201302, 201302, 201302],
'CLIENTCODE': ['A', 'B', 'A', 'C', 'D', 'C', 'E']
}
df = pd.DataFrame(data)
# Group by 'YEARMONTH' and count distinct 'CLIENTCODE'
result = df.groupby('YEARMONTH')['CLIENTCODE'].nunique()
print(result)
Output:
YEARMONTH
201301 2
201302 3
Name: CLIENTCODE, dtype: int64
Explanation:
groupby('YEARMONTH')
: Groups the data by theYEARMONTH
column.['CLIENTCODE'].nunique()
: For each group, it counts the number of distinctCLIENTCODE
values.
This will give you the same result as the SQL query, where you count the distinct clients (CLIENTCODE
) for each YEARMONTH
.
If you want the result as a DataFrame instead of a Series, you can reset the index:
result_df = result.reset_index(name='ClientCount')
print(result_df)
Output as DataFrame:
YEARMONTH ClientCount
0 201301 2
1 201302 3