Creating a new dataframe with calculations from a base dataframe

ghz 8months ago ⋅ 152 views

Creating a new dataframe with calculations from a base dataframe where the indexes are independent. Working code, but iterrows inside of a for loop

The dataset I am looking is for animals that have entered my local animal shelter. The fields I am interested in are the intake_date and the outcome_date which are when the animal entered the shelter and when it left.

I want to be able to plot the total number of animals in residence (i.e. animals that spent the night at a shelter) vs the date. The logic that I have is:

  1. if the outcome date is null then that record should count for every date from the intake date to present date
  2. else the record should count for every day from the intake date until the day before the outcome date.

I have some working code, but I want to speed it up since I essentially doing 90 million iterations.

Here is the code I have:

import pandas as pd
import time
import datetime


def fetch_data():
    url = "https://raw.githubusercontent.com/Etharialle/shelter-streamlit/main/datasets/asio.csv"
    df = pd.read_csv(url, on_bad_lines='skip').replace("'", "", regex=True)
    return df


df = fetch_data()
# Ensure 'Intake Date' is a datetime type
df_open_cases["Intake Date"] = pd.to_datetime(
    df_open_cases["Intake Date"])
df_open_cases["Outcome Date"] = pd.to_datetime(
    df_open_cases["Outcome Date"])

# Animals in Residence by Date
start_date = df_open_cases["Intake Date"].min()
end_date = datetime.datetime.now()
date_list = {}

while start_date <= end_date:
    loop_date = start_date.strftime('%Y-%m-%d')
    date_list[loop_date] = 0
    start_date += datetime.timedelta(days=1)

for k, v, in date_list.items():
    for index, rows in df_open_cases.iterrows():
        if pd.isnull(rows['Outcome Date']) and rows["Intake Date"].strftime('%Y-%m-%d') <= k:
            date_list[k] += 1
        elif rows["Intake Date"].strftime('%Y-%m-%d') <= k and rows['Outcome Date'].strftime('%Y-%m-%d') > k:
            date_list[k] += 1
        else:
            pass
df_dates = pd.DataFrame.from_dict(
    date_list, orient='index', columns=['Animals in Residence'])

This code does what I want it to, but there are 25000+ records spanning over 10 years. Which means I am running the if/elif/else 90+ million times. It took my laptop a few hours to run. I'm not sure how/if I can vectorize this. The only way I can think of shortening the loop count is to filter the data frame for iterrows like df_open_cases[df_open_cases.intake_date >= k].iterrows

example of the data set:

    intake_date Outcome Date
27312   2024-04-04  NaT
27313   2024-04-04  NaT
27330   2024-04-04  NaT
27331   2024-04-04  NaT
27332   2024-04-04  NaT
27315   2024-04-04  NaT
27316   2024-04-03  NaT
27317   2024-04-03  NaT
27318   2024-04-03  2024-04-03
27319   2024-04-03  2024-04-03
27320   2024-04-03  2024-04-03
27321   2024-04-03  NaT
27322   2024-04-03  NaT
27323   2024-03-28  NaT
27324   2024-03-28  2024-04-03
27325   2024-03-28  NaT
27326   2024-03-28  2024-04-03
27327   2024-03-28  NaT
27328   2024-04-03  NaT
27329   2024-04-03  2024-04-04
27493   2024-04-03  NaT
27494   2024-04-03  NaT
27300   2024-04-02  NaT
27495   2024-04-02  NaT
27496   2024-04-02  NaT

Answers

To optimize your code and avoid iterating over each row in the DataFrame for each date, you can use vectorized operations and group the data by date intervals. Here's how you can do it:

import pandas as pd
import numpy as np

def fetch_data():
    url = "https://raw.githubusercontent.com/Etharialle/shelter-streamlit/main/datasets/asio.csv"
    df = pd.read_csv(url, on_bad_lines='skip').replace("'", "", regex=True)
    return df

# Fetch data
df = fetch_data()

# Convert date columns to datetime
df["Intake Date"] = pd.to_datetime(df["Intake Date"])
df["Outcome Date"] = pd.to_datetime(df["Outcome Date"])

# Get all unique dates between the minimum intake date and today
all_dates = pd.date_range(start=df["Intake Date"].min(), end=pd.Timestamp.today(), freq="D")

# Create a DataFrame with all dates
date_df = pd.DataFrame({"Date": all_dates})

# Count the number of animals in residence for each date
# If Outcome Date is null, the animal is still in residence
in_residence = df.loc[df["Outcome Date"].isna(), "Intake Date"].value_counts()
date_df["Animals in Residence"] = date_df["Date"].map(in_residence).fillna(0)

# For animals with Outcome Date not null, count the residence days
outcome_dates = df.loc[~df["Outcome Date"].isna(), ["Intake Date", "Outcome Date"]]
for index, row in outcome_dates.iterrows():
    start_date = row["Intake Date"]
    end_date = row["Outcome Date"]
    date_range = pd.date_range(start=start_date, end=end_date, freq="D")
    date_df.loc[date_df["Date"].isin(date_range), "Animals in Residence"] += 1

print(date_df)

In this optimized version:

  1. We create a DataFrame date_df containing all unique dates between the minimum intake date and today.
  2. We count the number of animals in residence for each date using vectorized operations. For animals with a null Outcome Date, we directly count them based on the Intake Date. For animals with a non-null Outcome Date, we calculate the date range and increment the count for each date in that range.
  3. The resulting DataFrame date_df contains the count of animals in residence for each date, which you can use for plotting or further analysis. This approach avoids the need for nested loops and significantly improves performance, especially for large datasets.