pandas 0.22, resample only full bins, drop partials

ghz 2days ago ⋅ 4 views

I am trying to resample 1h bins of trading data into 4h bins. The problem is that my pandas code gives also output of partial bins that are not closed yet.

My Input:

                  close    high     low    open  symbol       turnover  \
timestamp
2018-05-08 03:00:00  9418.0  9449.0  9408.5  9412.5  XBTUSD  1091577940325
2018-05-08 04:00:00  9423.5  9435.0  9390.0  9418.0  XBTUSD   801492831858
2018-05-08 05:00:00  9414.0  9428.5  9393.5  9423.5  XBTUSD   445420257388
2018-05-08 06:00:00  9337.0  9414.0  9314.5  9414.0  XBTUSD  1349710247828
2018-05-08 07:00:00  9328.5  9359.5  9305.0  9337.0  XBTUSD  1103092129997
2018-05-08 08:00:00  9355.0  9359.5  9328.5  9328.5  XBTUSD   647813850343
2018-05-08 09:00:00  9376.0  9383.0  9355.0  9355.0  XBTUSD   597066647876
2018-05-08 10:00:00  9312.0  9376.5  9241.5  9376.0  XBTUSD  1933554301163
2018-05-08 11:00:00  9296.0  9338.0  9275.5  9312.0  XBTUSD  1318169059747
2018-05-08 12:00:00  9201.5  9305.0  9178.0  9296.0  XBTUSD  2058057970783

My Output:

                   open    high     low   close     volume         vwap  \
timestamp
2018-05-08 04:00:00  9418.0  9435.0  9305.0  9328.5  346736372  9380.972675
2018-05-08 08:00:00  9328.5  9383.0  9241.5  9296.0  419074812  9332.798550
2018-05-08 12:00:00  9296.0  9305.0  9178.0  9201.5  189922434  9228.497600

Note that the 4h interval from 12:00 till 16:00 contains the partial data from the 12:00 hourly interval of the source.

My Desired output should look like:

                     open    high     low   close     volume         vwap  \
timestamp
2018-05-08 04:00:00  9418.0  9435.0  9305.0  9328.5  346736372  9380.972675
2018-05-08 08:00:00  9328.5  9383.0  9241.5  9296.0  419074812  9332.798550

So that only the whole 12:00 interval must be closed in order to give data in the resampling process.

My Code so far:

outputData = srcData.resample('4H').agg({'open': 'first',
                                                'high': 'max',
                                                'low': 'min',
                                                'close': 'last',
                                                'volume': 'sum',
                                                'vwap': 'mean',
                                                'turnover': 'sum',
                                                'symbol': 'first'})

Is there a function in pandas that would help me or do I have to figure out a way to cut the partial interval off after resampling? Cheers Alex

Answer

In pandas, you can use the loffset parameter to control how timestamps align after resampling, but that won't inherently remove partial bins. To exclude partial intervals, you need to filter out the incomplete bins after resampling. Here's how you can do it:

Approach:

  1. Resample your data as you are doing now.
  2. Check if the last timestamp of each resampled interval aligns with the original timestamps.
  3. Drop bins that do not have a complete set of intervals.

Here’s the modified code to achieve your desired output:

# Resample the data to 4-hour intervals
outputData = srcData.resample('4H').agg({
    'open': 'first',
    'high': 'max',
    'low': 'min',
    'close': 'last',
    'volume': 'sum',
    'vwap': 'mean',
    'turnover': 'sum',
    'symbol': 'first'
})

# Determine the maximum timestamp in the original data
max_timestamp = srcData.index.max()

# Filter out partial intervals
# Keep only those intervals where the end of the bin does not exceed the max timestamp
outputData = outputData[outputData.index + pd.Timedelta(hours=4) - pd.Timedelta(hours=1) <= max_timestamp]

print(outputData)

Explanation:

  1. Resampling: The resample('4H').agg(...) block computes the aggregate statistics for the 4-hour bins.
  2. Filter Partial Intervals:
    • Calculate the effective end of each bin using outputData.index + pd.Timedelta(hours=4) - pd.Timedelta(hours=1).
    • Compare these effective end times against the maximum timestamp of the original data (max_timestamp).
    • Retain only the bins where the interval fully fits within the original data.

Output:

This will ensure that only bins fully contained within the original dataset are included, avoiding partial bins like the one starting at 12:00.