Python/Numpy: problems with type conversion in vectorize and item
I am writing a function to extract values from datetimes over arrays. I want the function to operate on a Pandas DataFrame or a numpy ndarray.
The values should be returned in the same way as the Python datetime properties, e.g.
from datetime import datetime
dt = datetime(2016, 10, 12, 13)
dt.year
=> 2016
dt.second
=> 0
For a DataFrame this is reasonably easy to handle using applymap()
(although there may well be a better way). I tried the same approach for numpy ndarrays using vectorize()
, and I'm running into problems. Instead of the values I was expecting, I end up with very large integers, sometimes negative.
This was pretty baffling at first, but I figured out what is happening: the vectorized function is using item
instead of __get__
to get the values out of the ndarray. This seems to automatically convert each datetime64
object to a long
:
nd[1][0]
=> numpy.datetime64('1986-01-15T12:00:00.000000000')
nd[1].item()
=> 506174400000000000L
The long seems to be the number of nanoseconds since epoch (1970-01-01T00:00:00). Somewhere along the line the values are converted to integers and they overflow, hence the negative numbers.
So that's the problem. Please can someone help me fix it? The only thing I can think of is doing the conversion manually, but this would effectively mean reimplementing a chunk of the datetime
module.
Is there some alternative to vectorize
that doesn't use item()
?
Thanks!
Minimal code example:
## DataFrame works fine
import pandas as pd
from datetime import datetime
df = pd.DataFrame({'dts': [datetime(1970, 1, 1, 1), datetime(1986, 1, 15, 12),
datetime(2016, 7, 15, 23)]})
exp = pd.DataFrame({'dts': [1, 15, 15]})
df_func = lambda x: x.day
out = df.applymap(df_func)
assert out.equals(exp)
## numpy ndarray is more difficult
from numpy import datetime64 as dt64, timedelta64 as td64, vectorize # for brevity
# The unary function is a little more complex, especially for days and months where the minimum value is 1
nd_func = lambda x: int((dt64(x, 'D') - dt64(x, 'M') + td64(1, 'D')) / td64(1, 'D'))
nd = df.as_matrix()
exp = exp.as_matrix()
=> array([[ 1],
[15],
[15]])
# The function works as expected on a single element...
assert nd_func(nd[1][0]) == 15
# ...but not on an ndarray
nd_vect = vectorize(nd_func)
out = nd_vect(nd)
=> array([[ -105972749999999],
[ 3546551532709551616],
[-6338201187830896640]])
Answer
The issue you are encountering stems from the fact that numpy.datetime64
objects are represented as integers under the hood, which is why you're seeing values in nanoseconds (the number of nanoseconds since the Unix epoch). When you apply vectorize
or use item()
, these datetime64
objects are automatically converted to integers, which leads to overflow issues for large numbers.
To solve this problem, you should avoid using item()
, and instead directly access the properties of the datetime64
objects using numpy's datetime operations or through pandas' DatetimeIndex
functionality, which provides a convenient way to handle datetime objects in arrays.
Solution: Using Numpy's datetime64
and timedelta64
functionality
Instead of using vectorize
, you can leverage numpy's datetime functions directly without needing to convert to a regular Python datetime
object. These functions can extract the individual components (e.g., year, month, day, etc.) from datetime64
arrays without the overflow issues you're seeing.
Here's how you can efficiently extract the properties of datetime64
objects from both Pandas DataFrames and NumPy ndarrays:
1. Using NumPy with datetime64
:
NumPy provides functions like np.datetime64
and np.timedelta64
that can operate on arrays of datetime64
objects.
Here's how you can modify your code to use NumPy's vectorized operations directly on datetime64
objects:
import numpy as np
# Example ndarray of datetime64 objects
dates = np.array(['1970-01-01T01:00', '1986-01-15T12:00', '2016-07-15T23:00'], dtype='datetime64[m]')
# Function to extract the day of the month
def extract_day(dt):
return dt.item().day
# Apply the function using vectorization
day_vect = np.vectorize(extract_day)
# Get the days for the datetime64 array
out = day_vect(dates)
print(out) # Output: [1 15 15]
2. Using Pandas DataFrame:
Pandas DatetimeIndex
objects have built-in properties like .year
, .month
, .day
, etc., which are optimized for vectorized operations. You can directly apply these to DataFrame columns with apply()
or use vectorized methods like .dt
.
Here's how you can modify your DataFrame
example:
import pandas as pd
# Create DataFrame with datetime objects
df = pd.DataFrame({'dts': pd.to_datetime(['1970-01-01 01:00', '1986-01-15 12:00', '2016-07-15 23:00'])})
# Extract day using .dt accessor for vectorized datetime operations
df['day'] = df['dts'].dt.day
# Check the output
print(df)
# Expected Output:
# dts day
# 0 1970-01-01 01:00:00 1
# 1 1986-01-15 12:00:00 15
# 2 2016-07-15 23:00:00 15
Summary of Key Changes:
- For NumPy: Instead of using
item()
which converts to integers, usenp.vectorize
with a function that operates ondatetime64
objects or directly usedatetime64
methods (e.g.,.day
,.year
). - For Pandas: Use the
.dt
accessor to perform operations directly ondatetime64
columns, which avoids manual conversion and overflow issues.
Why item()
causes the issue:
When using .item()
, NumPy converts datetime64
objects to integers (nanoseconds since the Unix epoch). This causes overflow for large values, leading to negative or incorrect results. Instead, directly operating on datetime64
objects with NumPy or Pandas methods will avoid this problem and provide correct results.
This should resolve the issue you're facing while also maintaining efficiency.