This is my Timeseries:
data z_data zp_data
time
2018-01-01 00:00:00 -0.045988 NaN NaN
2018-01-01 00:01:00 -0.046024 NaN NaN
2018-01-01 00:02:00 -0.044360 NaN NaN
2018-01-01 00:03:00 -0.044722 NaN NaN
2018-01-01 00:04:00 -0.043637 NaN NaN
... ... ...
2018-12-12 23:55:00 11.454639 0.088124 1.631736
2018-12-12 23:56:00 11.498422 0.935382 2.551753
2018-12-12 23:57:00 11.521695 1.251496 1.223949
2018-12-12 23:58:00 11.476974 0.244583 -0.012273
2018-12-12 23:59:00 11.480120 0.278023 0.015562
[498240 rows x 3 columns]
I used the Augmented Dickey-Fuller Test. This test is used to assess whether or not a time-series is stationary. but I get the MemoryError. How can I solve this issue?
Answer
When working with large time series data, such as your 498240
rows, performing the Augmented Dickey-Fuller (ADF) Test can lead to memory issues, especially if you're working on machines with limited RAM. The test involves calculating lag-based statistics, and for large datasets, this can cause memory bottlenecks.
Possible Solutions to Solve MemoryError
1. Use a Rolling Window Approach
Instead of applying the ADF test to the entire dataset at once, you can compute the test over a rolling window. This will significantly reduce memory usage, as you're only working with a subset of the data at a time.
You can apply the ADF test to smaller chunks (e.g., using a rolling window of 1000
or 2000
rows) and then aggregate the results.
Here's how you can do that:
from statsmodels.tsa.stattools import adfuller
import pandas as pd
# Function to perform the ADF test on a rolling window
def adf_test_rolling(df, window_size):
p_values = [] # List to store p-values from ADF test
for start in range(0, len(df) - window_size + 1, window_size):
end = start + window_size
# Subset the time series for the current window
window_data = df[start:end]
result = adfuller(window_data)
p_values.append(result[1]) # Store p-value (index 1 is the p-value)
return p_values
# Apply rolling ADF test
window_size = 1000 # Adjust this based on memory capacity
p_values = adf_test_rolling(df['data'], window_size)
# Show the p-values for each window
print(p_values)
This approach will allow you to work with smaller chunks of the data at a time, preventing memory overload.
2. Downsampling Your Data
If you don't need the full granularity of the data (e.g., if it's recorded at minute or second-level intervals), you could downsample your time series to a lower frequency (e.g., hourly or daily) before applying the ADF test. This will reduce the number of data points and ease memory consumption.
For example, you can resample the data to a daily frequency:
df_resampled = df.resample('D').mean() # 'D' for daily, or other frequencies
# Now perform the ADF test on the resampled data
result = adfuller(df_resampled['data'])
print(result)
Alternatively, you could take a sample of the data:
df_sampled = df.sample(frac=0.1) # Take a 10% sample of the data
result = adfuller(df_sampled['data'])
print(result)
3. Use a Memory-Efficient ADF Implementation
The adfuller
function from statsmodels
can be memory-intensive, especially on large datasets. However, you could look into more memory-efficient alternatives, such as:
-
Reducing the number of lags in the ADF test. By default, the ADF test uses the maximum number of lags. You can reduce this to decrease memory usage. For example:
result = adfuller(df['data'], maxlag=20) # Set a smaller number of lags print(result)
-
Using a more efficient statistical package that handles large time series datasets better. While
statsmodels
is commonly used, there are other libraries that might handle larger datasets more efficiently, such astsfresh
orpyculiarity
, but these might require custom implementations.
4. Use Parallel Processing
If you have access to multiple cores, you can parallelize the ADF test across multiple processors to avoid memory constraints. This approach is helpful if you're working with very large time series or if you're applying the test to multiple columns.
You can use Python's multiprocessing
or libraries like joblib
for parallel execution.
Here's a simple example using joblib
:
from joblib import Parallel, delayed
from statsmodels.tsa.stattools import adfuller
# Function to compute the ADF test for a segment of the data
def adf_test_segment(segment):
return adfuller(segment)[1] # Returning the p-value
# Split the data into chunks and process them in parallel
chunk_size = 1000
segments = [df['data'][i:i+chunk_size] for i in range(0, len(df['data']), chunk_size)]
p_values = Parallel(n_jobs=-1)(delayed(adf_test_segment)(segment) for segment in segments)
print(p_values)
This will divide the data into smaller chunks and process them in parallel, potentially reducing memory usage by spreading the load across multiple cores.
5. Increase Available Memory
If none of the above options are sufficient, you could consider using more memory by:
- Increasing your system's RAM (if possible).
- Using a machine with more memory (for example, if you're running the code on a local machine, try running it on a cloud instance with more RAM).
If you are using a 64-bit system, it may also be worth ensuring that you're using 64-bit Python to utilize more memory.
Summary
To solve the memory issue while applying the Augmented Dickey-Fuller test on a large time series:
- Use a rolling window approach to apply the ADF test on smaller subsets of the data.
- Downsample the data to a lower frequency (e.g., from minute-level to hourly or daily).
- Use a memory-efficient implementation by reducing the number of lags in the ADF test or looking for more efficient libraries.
- Parallelize the ADF computation to distribute the workload and reduce memory pressure.
- If necessary, increase available memory by upgrading your system or moving to a machine with more RAM.
These steps should help reduce the memory usage and allow you to perform the ADF test without encountering MemoryError
.