SMOTE is giving array size / ValueError for all-categorical data

ghz 12hours ago ⋅ 5 views

SMOTE is giving array size / ValueError for all-categorical dataset

I am using SMOTE-NC for oversampling my categorical data. I have only 1 feature and 10500 samples.

While running the below code, I am getting the error:

   ---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-151-a261c423a6d8> in <module>()
     16 print(X_new.shape) # (10500, 1)
     17 print(X_new)
---> 18 sm.fit_sample(X_new, Y_new)

~\AppData\Local\Continuum\Miniconda3\envs\data-science\lib\site-packages\imblearn\base.py in fit_resample(self, X, y)
     81         )
     82 
---> 83         output = self._fit_resample(X, y)
     84 
     85         y_ = (label_binarize(output[1], np.unique(y))

~\AppData\Local\Continuum\Miniconda3\envs\data-science\lib\site-packages\imblearn\over_sampling\_smote.py in _fit_resample(self, X, y)
    926 
    927         X_continuous = X[:, self.continuous_features_]
--> 928         X_continuous = check_array(X_continuous, accept_sparse=["csr", "csc"])
    929         X_minority = _safe_indexing(
    930             X_continuous, np.flatnonzero(y == class_minority)

~\AppData\Local\Continuum\Miniconda3\envs\data-science\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    592                              " a minimum of %d is required%s."
    593                              % (n_features, array.shape, ensure_min_features,
--> 594                                 context))
    595 
    596     if warn_on_dtype and dtype_orig is not None and array.dtype != dtype_orig:

ValueError: Found array with 0 feature(s) (shape=(10500, 0)) while a minimum of 1 is required.

Code:

from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import SMOTENC

sm = SMOTENC(random_state=27,categorical_features=[0,])

X_new = np.array(X_train.values.tolist())
Y_new = np.array(y_train.values.tolist())

print(X_new.shape) # (10500,)
print(Y_new.shape) # (10500,)

X_new = np.reshape(X_new, (-1, 1)) # SMOTE require 2-D Array, Hence changing the shape of X_mew

print(X_new.shape) # (10500, 1)
print(X_new)
sm.fit_sample(X_new, Y_new)

If i understand correctly, the shape of X_new should be (n_samples, n_features) which is 10500 X 1. I am not sure why in the ValueError it is considering it as shape=(10500,0)

Can someone please help me here ?

Answer

The error you're encountering (ValueError: Found array with 0 feature(s)) suggests that the SMOTENC class is receiving an input array where it expects at least one feature (i.e., one column per sample), but instead, it seems to be getting an array with zero features.

The issue lies in how you're using X_new. You are using X_train.values.tolist() to convert the pandas dataframe into a list, but this results in a list of lists (each containing the values of a single row). This format is not properly interpreted as a 2D array by SMOTENC.

To resolve this issue, you should directly convert X_train into a 2D NumPy array without using .tolist(). Additionally, ensure that you are correctly passing the categorical_features parameter to SMOTENC.

Fixing the Code

Here is an updated version of your code with a fix for the issue:

from imblearn.over_sampling import SMOTENC
import numpy as np

# Assuming X_train and y_train are pandas DataFrames/Series
# Convert X_train to a numpy array with shape (n_samples, n_features)
X_new = X_train.values  # This will give you a 2D array with the correct shape
Y_new = y_train.values  # Ensure y_train is also a 1D array

# Check the shapes before proceeding
print(X_new.shape)  # Should be (10500, 1) if you have 1 feature
print(Y_new.shape)  # Should be (10500,)

# Set up SMOTENC
sm = SMOTENC(random_state=27, categorical_features=[0])

# Fit and sample
X_resampled, Y_resampled = sm.fit_resample(X_new, Y_new)

# Check the resampled data
print(X_resampled.shape)  # Should be the new shape after oversampling
print(Y_resampled.shape)

Key Changes:

  1. X_train.values: Instead of using .tolist(), use .values to convert the pandas DataFrame into a NumPy array. This ensures the correct 2D shape (n_samples, n_features).

  2. sm.fit_resample(X_new, Y_new): The correct method to call in imblearn is fit_resample (not fit_sample, which is deprecated).

  3. Categorical Feature Index: Make sure you're passing categorical_features=[0] to indicate that the feature at index 0 is categorical. Since you mentioned only one feature, this is likely correct.

  4. Reshaping: The reshape you were doing is unnecessary when you're directly converting X_train to a 2D array with .values. The .reshape() step is typically used when you're dealing with 1D arrays that need to be converted to 2D arrays (e.g., when you only have one feature).

Explanation of the Shapes:

  • X_new.shape should be (10500, 1) because you have 10500 samples, each with one feature.
  • Y_new.shape should be (10500,) because y_train is a 1D array containing the target labels for each sample.

After these adjustments, the code should work correctly with SMOTENC to perform oversampling on your data.