so I have a dataset (a bunch of csv files) which contains (anonymized) datetimes in the following form:
3202-11-11 14:51:00 EST
The dates have been shifted by some random time for each entity. So differences in time for a given entity are still meaningful.
When trying to convert using e.g. pd.to_datetime(['3202-11-11 14:51:00 EST'], format='%Y-%m-%d %H:%M:%S EST')
, this will result in 'OutOfBoundsDatetime' error.
For my use case it would be ideal to specify a number of years by which to shift all dates when reading the csv files, s.t. they are within the valid pandas datetime range.
Do you have an idea how this could be solved efficiently? I have to do this on ~40k entities/csv files, with 10 to 10k such dates per csv. (my non-efficient idea: Go through python datetime which works for years till 9999, shift dates there and then convert to pandas datetime)
EDIT: I also asked this question in IRC #pydata and got this answer (thanks jboy):
>>> from datetime import timedelta
>>> offset = timedelta(days=10000)
>>> df
time
0 3001-01-01 01:00:01
1 3001-01-01 01:00:02
2 3001-01-01 01:00:05
3 3001-01-01 01:00:09
>>> df['time'].map(lambda t: t - offset)
0 2973-08-15 01:00:01
1 2973-08-15 01:00:02
2 2973-08-15 01:00:05
3 2973-08-15 01:00:09
Name: time, dtype: object
The only thing I have to do differently was:
df['time'].map(lambda t: datetime.datetime.strptime(t, '%Y-%m-%d %H:%M:%S EST')-offset)
Because my time column was still str and not datetime.datetime.
Answer
To solve your problem efficiently, the strategy you're thinking about—shifting the dates by a certain number of years to bring them within the valid Pandas datetime range—is indeed a good approach. Here's how you can proceed with your solution, incorporating your own idea and the helpful response you received in IRC.
Steps to solve:
- Read the dates: You will first read the dates as strings from the CSV files.
- Shift the dates: You will then convert the strings to
datetime
objects and apply the necessary shift to bring the dates within the valid Pandas range. Since Pandas'datetime
range is limited to the years 1677–2262, you'll need to shift any dates that fall outside this range. - Convert to pandas datetime: Once shifted, convert the dates back into
datetime
objects for processing in Pandas.
Solution
Here's an example of how you can handle this efficiently using Python's datetime
and timedelta
to shift the dates:
import pandas as pd
from datetime import datetime, timedelta
# Sample offset (adjust as needed)
offset = timedelta(days=10000)
# Function to convert and shift the datetime
def adjust_datetime(date_str):
try:
# Convert string to datetime
dt = datetime.strptime(date_str, '%Y-%m-%d %H:%M:%S EST')
# Apply the offset to bring the date within range
return dt - offset
except Exception as e:
print(f"Error processing date {date_str}: {e}")
return None
# Example dataframe
data = {'time': ['3202-11-11 14:51:00 EST', '3001-01-01 01:00:01 EST']}
df = pd.DataFrame(data)
# Apply the adjustment function to the 'time' column
df['adjusted_time'] = df['time'].map(adjust_datetime)
# Convert the adjusted column into pandas datetime (if needed for further processing)
df['adjusted_time'] = pd.to_datetime(df['adjusted_time'])
print(df)
Explanation:
adjust_datetime(date_str)
: This function parses the datetime string and applies the necessary offset to bring the date within the valid Pandas range.timedelta(days=10000)
: The offset is set to 10,000 days, which can be adjusted based on your dataset. This offset is meant to shift the date to a valid range (in your case, to get dates within the acceptabledatetime
range).map(adjust_datetime)
: We usemap
to apply the adjustment function to each date in the column.pd.to_datetime(df['adjusted_time'])
: After shifting, you can convert the adjusted date back to Pandas'datetime
format for further processing.
Efficient Processing for Large Datasets
Given your use case (40k entities, multiple files, etc.), the above approach should scale well. If your dataset is very large, you can optimize it by:
-
Batch processing: Instead of processing each file one by one, consider loading and processing the files in parallel (e.g., using
concurrent.futures
orjoblib
). -
Efficient date parsing: If dates are stored in a specific column, you can parse them efficiently using Pandas' built-in date parsing functions (
pd.to_datetime()
) with theformat
argument. For example:df = pd.read_csv('your_file.csv', parse_dates=['your_date_column'], date_parser=lambda x: adjust_datetime(x))
This way, you can read and adjust the dates in one go.
Additional Considerations:
-
Handling Time Zones: If your dates include time zones (like
EST
), you can usepytz
to handle time zone conversion if needed. However, since you're shifting to bring the dates within range, it may not be necessary unless you need to maintain time zone information. -
Valid Range: The Pandas datetime range is from the year 1677 to 2262. You should ensure that the shifted dates are within this range after applying the offset. If needed, you can manually check and adjust any dates that fall outside this range.
By applying this solution, you'll be able to efficiently process your large datasets while handling out-of-bounds datetime values.