I have a question about eliminating outliers from two-time series. One time series includes spot market prices and the other includes power outputs. The two series are from 2012 to 2016 and are both CSV files with the with a timestamp and then a value. As example for the power output: 2012-01-01 00:00:00,2335.2152646951617 and for the price: 2012-01-01 00:00:00,17.2
Because the spot market prices are very volatile and have a lot of outliers, I have filtered them. For the second time series, I have to delete the values with the same timestamp, which were eliminated in the time series of the prices. I thought about generating a list with the deleted values and writing a loop to delete the values with the same timestamp in the second time series. But so far that has not worked and I'm not really on. Does anyone have an idea?
My python code looks as follow:
import pandas as pd
import matplotlib.pyplot as plt
power_output = pd.read_csv("./data/external/power_output.csv", delimiter=",", parse_dates=[0], index_col=[0])
print(power_output.head())
plt.plot(power_output)
spotmarket = pd.read_csv("./data/external/spotmarket_dhp.csv", delimiter=",", parse_dates=[0], index_col=[0])
print(spotmarket.head())
r = spotmarket['price'].pct_change().dropna() * 100
print(r)
plt.plot(r)
Q1 = r.quantile(.25)
Q3 = r.quantile(.75)
q1 = Q1-2*(Q3-Q1)
q3 = Q3+2*(Q3-Q1)
a = r[r.between(q1, q3)]
print(a)
plt.plot(a)
Can somebody help me?
Answer
To handle outliers in two time series (power output and spot market prices) where you want to delete the values with the same timestamp that were filtered out from the first series, you can follow this approach more systematically:
1. Read the Data: You’ve already read in your CSVs with pd.read_csv()
and properly parsed the datetime column as the index, which is good.
2. Identify Outliers in the Spot Market Time Series: You’ve correctly computed the interquartile range (IQR) for the spot market prices and filtered out the outliers. This filtering seems fine.
3. Filter Power Output Using the Same Timestamps: Once you have filtered the outliers from the spot market series (based on the IQR), you can remove the corresponding rows from the power output series using their shared timestamp.
Here’s an updated version of your code that addresses the issue of deleting rows with the same timestamp in the power output data:
import pandas as pd
import matplotlib.pyplot as plt
# Read the CSVs
power_output = pd.read_csv("./data/external/power_output.csv", delimiter=",", parse_dates=[0], index_col=[0])
spotmarket = pd.read_csv("./data/external/spotmarket_dhp.csv", delimiter=",", parse_dates=[0], index_col=[0])
# Spot market price percentage change and filtering outliers based on IQR
r = spotmarket['price'].pct_change().dropna() * 100 # Compute percentage change
# Calculate the IQR for outlier detection
Q1 = r.quantile(.25)
Q3 = r.quantile(.75)
IQR = Q3 - Q1
q1 = Q1 - 2 * IQR
q3 = Q3 + 2 * IQR
# Filter out the values between Q1-2*IQR and Q3+2*IQR
filtered_prices = r[(r >= q1) & (r <= q3)]
# Now filter the spotmarket and power_output dataframes based on the filtered timestamps
# Getting the timestamps where prices were not outliers
valid_timestamps = filtered_prices.index
# Filter both dataframes to only include rows with valid timestamps
filtered_power_output = power_output.loc[valid_timestamps]
filtered_spotmarket = spotmarket.loc[valid_timestamps]
# Plot the filtered power output and the filtered spot market data
plt.figure(figsize=(10, 5))
plt.subplot(2, 1, 1)
plt.plot(filtered_power_output)
plt.title("Filtered Power Output")
plt.subplot(2, 1, 2)
plt.plot(filtered_spotmarket)
plt.title("Filtered Spot Market Prices")
plt.tight_layout()
plt.show()
Key Changes:
-
Filtering Out Outliers: In your original code, you're filtering out the outliers from
r
(the percentage change of prices) and storing the result ina
. In the updated version, I’m using thebetween()
method to filter the price percentage change values (this part remains largely the same). -
Extracting Timestamps: Once you’ve identified the rows with valid percentage changes in
r
, you extract the corresponding timestamps (valid_timestamps
) by indexingr.index
. These are the timestamps that should remain in the data. -
Using the Same Timestamps for Both DataFrames: Using the
valid_timestamps
, you filter both thepower_output
andspotmarket
DataFrames to only include the rows that have matching timestamps. This ensures that you remove the power output data points corresponding to the filtered out spot market prices.
What This Code Does:
- Identifies the outliers in the spot market data using IQR and filters out those timestamps from both the spot market and power output series.
- Ensures that both time series are aligned, with the same set of timestamps, after filtering.
Plotting:
- The code also shows how you can plot both the filtered power output and spot market prices together, so you can visually confirm the results.
Key Assumptions:
- The spot market prices (
spotmarket['price']
) are the main series where you're detecting outliers. - The power output series (
power_output
) is filtered by the same timestamps as the ones removed from the spot market data.
This approach should allow you to correctly remove outliers from both time series while maintaining the alignment of timestamps between them.