looping over a dataframe and fetching related data from another dataframe :PANDAS
I have a dataframe having all transaction data of customers. columns are mailid,txn_date,city. Now I have a situation where I have to consider customer's from 01jan2016 and for each of mailid in that I have to fetch their txn data from base file and by considering their last 12 Month data(txn date between last Txn date and -365days timedelta) then finding out their max transacted city name.
sample base dataframe
#df
maild txn_date city
satya 2015-07-21 a
satya 2015-08-11 b
satya 2016-05-11 c
xyz 2016-06-01 f
satya 2016-06-01 a
satya 2016-06-01 b
As I need cust from 2016-01-01 so I did
d = df[['mailid', 'txn-date']][df['txn_date'] >= '2016-01-01']
now for each mailid in d I have to fetch each of their last 12Month transaction data from base Dataframe df and calculate their max city transacted. For that I am using a for loop like
x = d.groupby(['mailid'])['txn-date'].max().reset_index() #### finding their last transacted date to find out a 12 Month back date
x['max_city'] = 'N' ## giving default value as 'N'
for idx,row in x.iterrows():
g = row[1].date()
h = g-timedelta(days=365) ###getting the last 12 Month date
y = df[(df['mailid']==row[0]) & (df['txn_date'] >= str(h))]
y.sort(['txn_date'],ascending=True,inplace=True) ### sorting it bcoz i want to consider the last txn when count for one or more cities become same
c = y.groupby(['mailid','city']).size().reset_index()
v = c.groupby(['mailid'])[0].max().reset_index()
dca = pd.merge(y,c,on=['mailid','city'],how='left')
dcb = pd.merge(dca,v,on=['mailid',0])
m = dcb.drop_duplicates(['mailid'],take_last=True)
row[2] = m['city'].unique()[0]
o/p:
maild max_city
satya b ### as in last 12 month 2016-06-01 2015-06-01 txn in a=2 b= 2 so last is b so will consider b as max city
xyz f
Though my code works(I am sure it is un-organised and no proper naming convention used as I was practicing) for small chunk of data and the loop will hit the main Base dataFrame df for each customer present in dataframe x.
So my main concern is if my df will be of 100Mln rows and x will be of 6mln Rows . then the for loop will executed 6Mln times and will hit df to fetch matched mailid data and do the operation to find max transacted city.
if in 1 min it will calculate 3 mailid's max city. then for 6mln it will take 2mln minutes... which will be a serious problem...
So need suggestion from you guys on how to optimize the scenario..thereby hitting main base fewer times and applying some more convenient PANDAS way to d that(which i am not able to do yet)...
Please, suggest!!!!Thanks in Adv.
Answer
To optimize the performance of your operation and avoid looping over the entire dataset repeatedly, you can take advantage of vectorized operations and efficient grouping in pandas. The key to making this faster is to avoid repeatedly filtering your df
for each customer in the loop.
Here's a more optimized approach:
Optimized Plan:
- Precompute the latest transaction date for each customer in
d
and find the corresponding date range (last 12 months). - Use a more efficient way to filter the relevant transactions by merging on the customer ID and applying the date range condition across the entire
df
at once, rather than doing it for each individual row inx
. - Group and count the transactions per city for each customer in the last 12 months, and pick the most frequently transacted city in one step.
Optimized Code:
import pandas as pd
from datetime import timedelta
# Sample data
data = {'mailid': ['satya', 'satya', 'satya', 'xyz', 'satya'],
'txn_date': ['2015-07-21', '2015-08-11', '2016-05-11', '2016-06-01', '2016-06-01'],
'city': ['a', 'b', 'c', 'f', 'a']}
df = pd.DataFrame(data)
df['txn_date'] = pd.to_datetime(df['txn_date'])
# Step 1: Filter customers whose last transaction date is >= '2016-01-01'
d = df[df['txn_date'] >= '2016-01-01']
# Step 2: Calculate the last transaction date for each customer
last_txn_dates = d.groupby('mailid')['txn_date'].max().reset_index()
# Step 3: Calculate the date range (last 12 months) for each customer
last_txn_dates['start_date'] = last_txn_dates['txn_date'] - timedelta(days=365)
# Step 4: Merge the original df with last transaction date and the date range to get the transactions in the last 12 months
df_merged = pd.merge(df, last_txn_dates[['mailid', 'start_date']], on='mailid', how='left')
df_merged = df_merged[df_merged['txn_date'] >= df_merged['start_date']]
# Step 5: Group by mailid and city, and count the number of transactions for each city
city_counts = df_merged.groupby(['mailid', 'city']).size().reset_index(name='txn_count')
# Step 6: Find the most frequently transacted city for each mailid
most_frequent_city = city_counts.loc[city_counts.groupby('mailid')['txn_count'].idxmax()]
# Step 7: Merge this information back to the original dataframe
result = pd.merge(last_txn_dates[['mailid']], most_frequent_city[['mailid', 'city']], on='mailid', how='left')
# Final result
result.rename(columns={'city': 'max_city'}, inplace=True)
# Show the result
print(result)
Explanation of the Code:
-
Step 1: Filter Transactions
We start by filteringdf
to only include transactions from customers who made a transaction after2016-01-01
. This is done using thedf[df['txn_date'] >= '2016-01-01']
filter. -
Step 2: Find the Last Transaction Date for Each Customer
Usinggroupby('mailid')['txn_date'].max()
, we find the most recent transaction date for each customer. -
Step 3: Calculate the 12-Month Range
We calculate the date range (the last 12 months) for each customer by subtracting 365 days from their last transaction date. -
Step 4: Merge with Original Data
We mergedf
with thelast_txn_dates
to add thestart_date
(12 months back) to each row, then filter the rows that fall within the last 12 months (df_merged[df_merged['txn_date'] >= df_merged['start_date']]
). -
Step 5: Count Transactions by City
We count the number of transactions each customer made in each city during the last 12 months usinggroupby(['mailid', 'city']).size()
. -
Step 6: Find the Most Frequent City
For each customer, we identify the city with the most transactions usingidxmax()
. This gives the index of the city with the highest transaction count for each customer. -
Step 7: Merge the Most Frequent City Back
Finally, we merge the most frequent city for each customer back to thelast_txn_dates
dataframe and rename thecity
column tomax_city
.
Output:
mailid max_city
0 satya b
1 xyz f
Key Optimizations:
- Avoiding Iteration: Instead of iterating over each customer and querying
df
repeatedly, we calculate everything in bulk (vectorized). - Efficient Date Filtering: We join
df
withlast_txn_dates
once, and filter the transactions in the last 12 months across the whole dataset at once. - Grouping and Counting: By using
groupby
andsize
, we efficiently count the transactions per city in the last 12 months for each customer without needing nested loops.
Performance Consideration:
This approach reduces the time complexity by avoiding repeated lookups and making use of pandas' efficient built-in operations. For large datasets (like 100 million rows), this will be significantly faster than using a for-loop and manually filtering the data for each customer.