I am trying to resample 1h bins of trading data into 4h bins. The problem is that my pandas code gives also output of partial bins that are not closed yet.
My Input:
close high low open symbol turnover \
timestamp
2018-05-08 03:00:00 9418.0 9449.0 9408.5 9412.5 XBTUSD 1091577940325
2018-05-08 04:00:00 9423.5 9435.0 9390.0 9418.0 XBTUSD 801492831858
2018-05-08 05:00:00 9414.0 9428.5 9393.5 9423.5 XBTUSD 445420257388
2018-05-08 06:00:00 9337.0 9414.0 9314.5 9414.0 XBTUSD 1349710247828
2018-05-08 07:00:00 9328.5 9359.5 9305.0 9337.0 XBTUSD 1103092129997
2018-05-08 08:00:00 9355.0 9359.5 9328.5 9328.5 XBTUSD 647813850343
2018-05-08 09:00:00 9376.0 9383.0 9355.0 9355.0 XBTUSD 597066647876
2018-05-08 10:00:00 9312.0 9376.5 9241.5 9376.0 XBTUSD 1933554301163
2018-05-08 11:00:00 9296.0 9338.0 9275.5 9312.0 XBTUSD 1318169059747
2018-05-08 12:00:00 9201.5 9305.0 9178.0 9296.0 XBTUSD 2058057970783
My Output:
open high low close volume vwap \
timestamp
2018-05-08 04:00:00 9418.0 9435.0 9305.0 9328.5 346736372 9380.972675
2018-05-08 08:00:00 9328.5 9383.0 9241.5 9296.0 419074812 9332.798550
2018-05-08 12:00:00 9296.0 9305.0 9178.0 9201.5 189922434 9228.497600
Note that the 4h interval from 12:00 till 16:00 contains the partial data from the 12:00 hourly interval of the source.
My Desired output should look like:
open high low close volume vwap \
timestamp
2018-05-08 04:00:00 9418.0 9435.0 9305.0 9328.5 346736372 9380.972675
2018-05-08 08:00:00 9328.5 9383.0 9241.5 9296.0 419074812 9332.798550
So that only the whole 12:00 interval must be closed in order to give data in the resampling process.
My Code so far:
outputData = srcData.resample('4H').agg({'open': 'first',
'high': 'max',
'low': 'min',
'close': 'last',
'volume': 'sum',
'vwap': 'mean',
'turnover': 'sum',
'symbol': 'first'})
Is there a function in pandas that would help me or do I have to figure out a way to cut the partial interval off after resampling? Cheers Alex
Answer
In pandas, you can use the loffset
parameter to control how timestamps align after resampling, but that won't inherently remove partial bins. To exclude partial intervals, you need to filter out the incomplete bins after resampling. Here's how you can do it:
Approach:
- Resample your data as you are doing now.
- Check if the last timestamp of each resampled interval aligns with the original timestamps.
- Drop bins that do not have a complete set of intervals.
Here’s the modified code to achieve your desired output:
# Resample the data to 4-hour intervals
outputData = srcData.resample('4H').agg({
'open': 'first',
'high': 'max',
'low': 'min',
'close': 'last',
'volume': 'sum',
'vwap': 'mean',
'turnover': 'sum',
'symbol': 'first'
})
# Determine the maximum timestamp in the original data
max_timestamp = srcData.index.max()
# Filter out partial intervals
# Keep only those intervals where the end of the bin does not exceed the max timestamp
outputData = outputData[outputData.index + pd.Timedelta(hours=4) - pd.Timedelta(hours=1) <= max_timestamp]
print(outputData)
Explanation:
- Resampling: The
resample('4H').agg(...)
block computes the aggregate statistics for the 4-hour bins. - Filter Partial Intervals:
- Calculate the effective end of each bin using
outputData.index + pd.Timedelta(hours=4) - pd.Timedelta(hours=1)
. - Compare these effective end times against the maximum timestamp of the original data (
max_timestamp
). - Retain only the bins where the interval fully fits within the original data.
- Calculate the effective end of each bin using
Output:
This will ensure that only bins fully contained within the original dataset are included, avoiding partial bins like the one starting at 12:00
.