Creating a new dataframe with calculations from a base dataframe where the indexes are independent. Working code, but iterrows inside of a for loop
The dataset I am looking is for animals that have entered my local animal shelter. The fields I am interested in are the intake_date
and the outcome_date
which are when the animal entered the shelter and when it left.
I want to be able to plot the total number of animals in residence (i.e. animals that spent the night at a shelter) vs the date. The logic that I have is:
- if the outcome date is null then that record should count for every date from the intake date to present date
- else the record should count for every day from the intake date until the day before the outcome date.
I have some working code, but I want to speed it up since I essentially doing 90 million iterations.
Here is the code I have:
import pandas as pd
import time
import datetime
def fetch_data():
url = "https://raw.githubusercontent.com/Etharialle/shelter-streamlit/main/datasets/asio.csv"
df = pd.read_csv(url, on_bad_lines='skip').replace("'", "", regex=True)
return df
df = fetch_data()
# Ensure 'Intake Date' is a datetime type
df_open_cases["Intake Date"] = pd.to_datetime(
df_open_cases["Intake Date"])
df_open_cases["Outcome Date"] = pd.to_datetime(
df_open_cases["Outcome Date"])
# Animals in Residence by Date
start_date = df_open_cases["Intake Date"].min()
end_date = datetime.datetime.now()
date_list = {}
while start_date <= end_date:
loop_date = start_date.strftime('%Y-%m-%d')
date_list[loop_date] = 0
start_date += datetime.timedelta(days=1)
for k, v, in date_list.items():
for index, rows in df_open_cases.iterrows():
if pd.isnull(rows['Outcome Date']) and rows["Intake Date"].strftime('%Y-%m-%d') <= k:
date_list[k] += 1
elif rows["Intake Date"].strftime('%Y-%m-%d') <= k and rows['Outcome Date'].strftime('%Y-%m-%d') > k:
date_list[k] += 1
else:
pass
df_dates = pd.DataFrame.from_dict(
date_list, orient='index', columns=['Animals in Residence'])
This code does what I want it to, but there are 25000+ records spanning over 10 years. Which means I am running the if/elif/else 90+ million times. It took my laptop a few hours to run. I'm not sure how/if I can vectorize this. The only way I can think of shortening the loop count is to filter the data frame for iterrows like df_open_cases[df_open_cases.intake_date >= k].iterrows
example of the data set:
intake_date Outcome Date
27312 2024-04-04 NaT
27313 2024-04-04 NaT
27330 2024-04-04 NaT
27331 2024-04-04 NaT
27332 2024-04-04 NaT
27315 2024-04-04 NaT
27316 2024-04-03 NaT
27317 2024-04-03 NaT
27318 2024-04-03 2024-04-03
27319 2024-04-03 2024-04-03
27320 2024-04-03 2024-04-03
27321 2024-04-03 NaT
27322 2024-04-03 NaT
27323 2024-03-28 NaT
27324 2024-03-28 2024-04-03
27325 2024-03-28 NaT
27326 2024-03-28 2024-04-03
27327 2024-03-28 NaT
27328 2024-04-03 NaT
27329 2024-04-03 2024-04-04
27493 2024-04-03 NaT
27494 2024-04-03 NaT
27300 2024-04-02 NaT
27495 2024-04-02 NaT
27496 2024-04-02 NaT
Answers
To optimize your code and avoid iterating over each row in the DataFrame for each date, you can use vectorized operations and group the data by date intervals. Here's how you can do it:
import pandas as pd
import numpy as np
def fetch_data():
url = "https://raw.githubusercontent.com/Etharialle/shelter-streamlit/main/datasets/asio.csv"
df = pd.read_csv(url, on_bad_lines='skip').replace("'", "", regex=True)
return df
# Fetch data
df = fetch_data()
# Convert date columns to datetime
df["Intake Date"] = pd.to_datetime(df["Intake Date"])
df["Outcome Date"] = pd.to_datetime(df["Outcome Date"])
# Get all unique dates between the minimum intake date and today
all_dates = pd.date_range(start=df["Intake Date"].min(), end=pd.Timestamp.today(), freq="D")
# Create a DataFrame with all dates
date_df = pd.DataFrame({"Date": all_dates})
# Count the number of animals in residence for each date
# If Outcome Date is null, the animal is still in residence
in_residence = df.loc[df["Outcome Date"].isna(), "Intake Date"].value_counts()
date_df["Animals in Residence"] = date_df["Date"].map(in_residence).fillna(0)
# For animals with Outcome Date not null, count the residence days
outcome_dates = df.loc[~df["Outcome Date"].isna(), ["Intake Date", "Outcome Date"]]
for index, row in outcome_dates.iterrows():
start_date = row["Intake Date"]
end_date = row["Outcome Date"]
date_range = pd.date_range(start=start_date, end=end_date, freq="D")
date_df.loc[date_df["Date"].isin(date_range), "Animals in Residence"] += 1
print(date_df)
In this optimized version:
- We create a DataFrame
date_df
containing all unique dates between the minimum intake date and today. - We count the number of animals in residence for each date using vectorized operations. For animals with a null Outcome Date, we directly count them based on the Intake Date. For animals with a non-null Outcome Date, we calculate the date range and increment the count for each date in that range.
- The resulting DataFrame
date_df
contains the count of animals in residence for each date, which you can use for plotting or further analysis. This approach avoids the need for nested loops and significantly improves performance, especially for large datasets.