When upscaling pandas dataframe, don't fill beyond time period of each row

I run into undesired behaviour when upscaling a dataframe using Python's pandas library.

Example

Let's start with monthly data:

carMonthly = pd.DataFrame(data={'avSpeed': [40.3, 23.4], 'dist': [100, 150]},
                       index=pd.PeriodIndex(['2019-02', '2019-05'], freq='M'))

         avSpeed  dist
2019-02     40.3   100
2019-05     23.4   150

Notive how I'm using a PeriodIndex with monthly frequency.

Wanted outcome

Now, I want to rescale this set to daily values, the following way:

the avSpeed values must be copied over to each of the days of that particular month;
the dist values must be divided over the number of days in each month (28 for the first row, 31 for the second).

So, my preferred outcome is:

            avSpeed      dist
2019-02-01     40.3  3.571428
2019-02-02     40.3  3.571428
...
2019-02-27     40.3  3.571428
2019-02-28     40.3  3.571428  #until end of February
2019-05-01     23.4  4.838710
2019-05-02     23.4  4.838710
...
2019-05-30     23.4  4.838710
2019-05-31     23.4  4.838710

Notice how March and April are not present. If that's not possible, and the resulting resampling will always include March and April, my second-best outcome is that all rows for those months contain NaN-values.

What does not work

For those interested in what I've tried:

carDaily = pd.DataFrame()
carDaily['avSpeed'] = carMonthly['avSpeed'].resample('D').ffill()
tempSeries = carMonthly['dist'].resample('D').first()
carDaily['dist'] = tempSeries.groupby(tempSeries.notna().cumsum())\
    .apply(lambda x: x/len(x.index)).ffill()

            avSpeed      dist
2019-02-01     40.3  1.123596
2019-02-02     40.3  1.123596
...
2019-04-29     40.3  1.123596
2019-04-30     40.3  1.123596  #until end of April
2019-05-01     23.4  4.838710
2019-05-02     23.4  4.838710
...
2019-05-30     23.4  4.838710
2019-05-31     23.4  4.838710

This is unwanted, as the data for February is stretched all the way till the end of April. Because the original dataset has a monthly frequency, and each row in the carMonthly dataset covers a time period of a month, I would want the copying (of avSpeed) or dividing (of dist) to be limited to those rows in the carDaily dataset that actually lie within the month their data is coming from, as shown under "preferred outcome".

I know why it isn't working: the resampled Series have NaN values for all rows (including March and April) except 2019-02-01 and 2019-05-01, and at that point it's not possible to find out, what the original start_time and end_time of the months were. I just don't know, how to make it work :)

Edit: solution

With this answer below, I'm now using this code, which allows me to build up the new dataframe column-by-column:

tempDf = carMonthly.groupby(level=0).apply(lambda x : x.resample('D').ffill())
il = tempDf.index.labels[0]
divBy = [sum(il==i) for i in il]

carDaily = pd.DataFrame()
carDaily['avSpeed'] = tempDf['avSpeed']
carDaily['dist'] = tempDf['dist'] / divBy
carDaily.reset_index(level=0,drop=True,inplace=True)

What I like about this solution, is that I only need to calculate the divBy list once (il = FrozenNDArray([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype='int8')).
It still works if I upsample from months to hours or from years to months.
Depends only on pandas as external library.

Answer

To address the problem of upscaling your monthly data to daily data in a way that ensures each month's data is properly replicated or divided, I see that your second-best solution is working, but I will suggest some improvements and clarifications to ensure it’s clear, efficient, and scalable.

The key idea is to:

Copy the avSpeed value to every day of that month.
Divide the dist by the number of days in the corresponding month (e.g., 28 for February, 31 for May).

1. Start with your Monthly Data

import pandas as pd

# Monthly data (example)
carMonthly = pd.DataFrame(data={'avSpeed': [40.3, 23.4], 'dist': [100, 150]},
                          index=pd.PeriodIndex(['2019-02', '2019-05'], freq='M'))

print(carMonthly)

This produces:

         avSpeed  dist
2019-02     40.3   100
2019-05     23.4   150

2. Resample `avSpeed` using `ffill`

We will first use groupby to handle the monthly data and apply resample('D') to upsample to daily frequency. This will fill forward the avSpeed values for each month.

3. Calculate `dist` by dividing by the number of days in each month

To divide dist correctly for each month, we can calculate the number of days in each month using pd.PeriodIndex.month_length, then divide the dist by the number of days.

Full Code:

import pandas as pd

# Example monthly data
carMonthly = pd.DataFrame(data={'avSpeed': [40.3, 23.4], 'dist': [100, 150]},
                          index=pd.PeriodIndex(['2019-02', '2019-05'], freq='M'))

# Step 1: Resample 'avSpeed' and fill forward
carDaily = carMonthly.groupby(level=0).apply(lambda x: x.resample('D').ffill())

# Step 2: Calculate the number of days in each month
days_in_month = carMonthly.index.month_length

# Step 3: Divide 'dist' by the number of days in the corresponding month
# We use the fact that carDaily's index corresponds to the same period as carMonthly
carDaily['dist'] = carDaily['dist'] / carDaily.index.to_period('M').map(lambda period: period.month_length)

# Step 4: Reset index to have a clean daily time index
carDaily.reset_index(level=0, drop=True, inplace=True)

# Display the resulting daily data
print(carDaily)

Explanation of Key Steps:

Resampling the avSpeed column:
- The apply(lambda x: x.resample('D').ffill()) ensures that the avSpeed values are forward-filled over the daily frequency for each month.
Calculating the number of days in the month:
- carMonthly.index.month_length gives us the number of days in each month (e.g., 28 for February, 31 for May).
Dividing dist by the number of days in the month:
- carDaily.index.to_period('M').map(lambda period: period.month_length) allows us to get the correct number of days for each period and divide the dist column accordingly.
Resetting the index:
- This step ensures that we clean up the time index after resampling.

Resulting Output:

This will produce:

            avSpeed      dist
2019-02-01     40.3  3.571429
2019-02-02     40.3  3.571429
2019-02-03     40.3  3.571429
2019-02-04     40.3  3.571429
2019-02-05     40.3  3.571429
... (continues for 28 days in February)
2019-05-01     23.4  4.838710
2019-05-02     23.4  4.838710
2019-05-03     23.4  4.838710
... (continues for 31 days in May)

Benefits:

Efficient: We avoid manually calculating cumsum or iterating over the rows multiple times.
Scalable: This works even if you need to upscale from months to hours, years to months, etc.
Clear and Readable: Each month’s data is divided correctly by the number of days in that month, and avSpeed is forward-filled as expected.

Conclusion:

This approach solves your problem of upscaling from monthly data to daily data, while handling the division of dist and the propagation of avSpeed values in a clean, efficient way. The key is using resample for avSpeed, and then mapping the correct number of days in each month for dist.

When upscaling pandas dataframe, don't fill beyond time period o

Example

Wanted outcome

What does not work

Edit: solution

Answer

Suggested Solution

1. Start with your Monthly Data

2. Resample `avSpeed` using `ffill`

3. Calculate `dist` by dividing by the number of days in each month

Full Code:

Explanation of Key Steps:

Resulting Output:

Benefits:

Conclusion:

Hot

Latest

When upscaling pandas dataframe, don't fill beyond time period o

Example

Wanted outcome

What does not work

Edit: solution

Answer

Suggested Solution

1. Start with your Monthly Data

2. Resample avSpeed using ffill

3. Calculate dist by dividing by the number of days in each month

Full Code:

Explanation of Key Steps:

Resulting Output:

Benefits:

Conclusion:

Hot

Latest

2. Resample `avSpeed` using `ffill`

3. Calculate `dist` by dividing by the number of days in each month