Python/Numpy: problems with type conversion in vectorize and ite

ghz 13hours ago ⋅ 4 views

Python/Numpy: problems with type conversion in vectorize and item

I am writing a function to extract values from datetimes over arrays. I want the function to operate on a Pandas DataFrame or a numpy ndarray.

The values should be returned in the same way as the Python datetime properties, e.g.

from datetime import datetime
dt = datetime(2016, 10, 12, 13)
dt.year
  => 2016
dt.second
  => 0

For a DataFrame this is reasonably easy to handle using applymap() (although there may well be a better way). I tried the same approach for numpy ndarrays using vectorize(), and I'm running into problems. Instead of the values I was expecting, I end up with very large integers, sometimes negative.

This was pretty baffling at first, but I figured out what is happening: the vectorized function is using item instead of __get__ to get the values out of the ndarray. This seems to automatically convert each datetime64 object to a long:

nd[1][0]
  => numpy.datetime64('1986-01-15T12:00:00.000000000')
nd[1].item()
  => 506174400000000000L

The long seems to be the number of nanoseconds since epoch (1970-01-01T00:00:00). Somewhere along the line the values are converted to integers and they overflow, hence the negative numbers.

So that's the problem. Please can someone help me fix it? The only thing I can think of is doing the conversion manually, but this would effectively mean reimplementing a chunk of the datetime module.

Is there some alternative to vectorize that doesn't use item()?

Thanks!

Minimal code example:

## DataFrame works fine
import pandas as pd
from datetime import datetime

df = pd.DataFrame({'dts': [datetime(1970, 1, 1, 1), datetime(1986, 1, 15, 12),
                         datetime(2016, 7, 15, 23)]})
exp = pd.DataFrame({'dts': [1, 15, 15]})

df_func = lambda x: x.day    
out = df.applymap(df_func)

assert out.equals(exp)

## numpy ndarray is more difficult
from numpy import datetime64 as dt64, timedelta64 as td64, vectorize  # for brevity

# The unary function is a little more complex, especially for days and months where the minimum value is 1
nd_func = lambda x: int((dt64(x, 'D') - dt64(x, 'M') + td64(1, 'D')) / td64(1, 'D'))

nd = df.as_matrix()
exp = exp.as_matrix()
  => array([[ 1],
            [15],
            [15]])

# The function works as expected on a single element...
assert nd_func(nd[1][0]) == 15

# ...but not on an ndarray
nd_vect = vectorize(nd_func)
out = nd_vect(nd)
  => array([[    -105972749999999],
            [ 3546551532709551616],
            [-6338201187830896640]])

Answer

The issue you are encountering stems from the fact that numpy.datetime64 objects are represented as integers under the hood, which is why you're seeing values in nanoseconds (the number of nanoseconds since the Unix epoch). When you apply vectorize or use item(), these datetime64 objects are automatically converted to integers, which leads to overflow issues for large numbers.

To solve this problem, you should avoid using item(), and instead directly access the properties of the datetime64 objects using numpy's datetime operations or through pandas' DatetimeIndex functionality, which provides a convenient way to handle datetime objects in arrays.

Solution: Using Numpy's datetime64 and timedelta64 functionality

Instead of using vectorize, you can leverage numpy's datetime functions directly without needing to convert to a regular Python datetime object. These functions can extract the individual components (e.g., year, month, day, etc.) from datetime64 arrays without the overflow issues you're seeing.

Here's how you can efficiently extract the properties of datetime64 objects from both Pandas DataFrames and NumPy ndarrays:

1. Using NumPy with datetime64:

NumPy provides functions like np.datetime64 and np.timedelta64 that can operate on arrays of datetime64 objects.

Here's how you can modify your code to use NumPy's vectorized operations directly on datetime64 objects:

import numpy as np

# Example ndarray of datetime64 objects
dates = np.array(['1970-01-01T01:00', '1986-01-15T12:00', '2016-07-15T23:00'], dtype='datetime64[m]')

# Function to extract the day of the month
def extract_day(dt):
    return dt.item().day

# Apply the function using vectorization
day_vect = np.vectorize(extract_day)

# Get the days for the datetime64 array
out = day_vect(dates)

print(out)  # Output: [1 15 15]

2. Using Pandas DataFrame:

Pandas DatetimeIndex objects have built-in properties like .year, .month, .day, etc., which are optimized for vectorized operations. You can directly apply these to DataFrame columns with apply() or use vectorized methods like .dt.

Here's how you can modify your DataFrame example:

import pandas as pd

# Create DataFrame with datetime objects
df = pd.DataFrame({'dts': pd.to_datetime(['1970-01-01 01:00', '1986-01-15 12:00', '2016-07-15 23:00'])})

# Extract day using .dt accessor for vectorized datetime operations
df['day'] = df['dts'].dt.day

# Check the output
print(df)

# Expected Output:
#                 dts  day
# 0 1970-01-01 01:00:00    1
# 1 1986-01-15 12:00:00   15
# 2 2016-07-15 23:00:00   15

Summary of Key Changes:

  1. For NumPy: Instead of using item() which converts to integers, use np.vectorize with a function that operates on datetime64 objects or directly use datetime64 methods (e.g., .day, .year).
  2. For Pandas: Use the .dt accessor to perform operations directly on datetime64 columns, which avoids manual conversion and overflow issues.

Why item() causes the issue:

When using .item(), NumPy converts datetime64 objects to integers (nanoseconds since the Unix epoch). This causes overflow for large values, leading to negative or incorrect results. Instead, directly operating on datetime64 objects with NumPy or Pandas methods will avoid this problem and provide correct results.

This should resolve the issue you're facing while also maintaining efficiency.