SMOTE is giving array size / ValueError for all-categorical dataset
I am using SMOTE-NC for oversampling my categorical data. I have only 1 feature and 10500 samples.
While running the below code, I am getting the error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-151-a261c423a6d8> in <module>()
16 print(X_new.shape) # (10500, 1)
17 print(X_new)
---> 18 sm.fit_sample(X_new, Y_new)
~\AppData\Local\Continuum\Miniconda3\envs\data-science\lib\site-packages\imblearn\base.py in fit_resample(self, X, y)
81 )
82
---> 83 output = self._fit_resample(X, y)
84
85 y_ = (label_binarize(output[1], np.unique(y))
~\AppData\Local\Continuum\Miniconda3\envs\data-science\lib\site-packages\imblearn\over_sampling\_smote.py in _fit_resample(self, X, y)
926
927 X_continuous = X[:, self.continuous_features_]
--> 928 X_continuous = check_array(X_continuous, accept_sparse=["csr", "csc"])
929 X_minority = _safe_indexing(
930 X_continuous, np.flatnonzero(y == class_minority)
~\AppData\Local\Continuum\Miniconda3\envs\data-science\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
592 " a minimum of %d is required%s."
593 % (n_features, array.shape, ensure_min_features,
--> 594 context))
595
596 if warn_on_dtype and dtype_orig is not None and array.dtype != dtype_orig:
ValueError: Found array with 0 feature(s) (shape=(10500, 0)) while a minimum of 1 is required.
Code:
from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import SMOTENC
sm = SMOTENC(random_state=27,categorical_features=[0,])
X_new = np.array(X_train.values.tolist())
Y_new = np.array(y_train.values.tolist())
print(X_new.shape) # (10500,)
print(Y_new.shape) # (10500,)
X_new = np.reshape(X_new, (-1, 1)) # SMOTE require 2-D Array, Hence changing the shape of X_mew
print(X_new.shape) # (10500, 1)
print(X_new)
sm.fit_sample(X_new, Y_new)
If i understand correctly, the shape of X_new
should be (n_samples, n_features) which is 10500 X 1. I am not sure why in the ValueError it is considering it as shape=(10500,0)
Can someone please help me here ?
Answer
The error you're encountering (ValueError: Found array with 0 feature(s)
) suggests that the SMOTENC
class is receiving an input array where it expects at least one feature (i.e., one column per sample), but instead, it seems to be getting an array with zero features.
The issue lies in how you're using X_new
. You are using X_train.values.tolist()
to convert the pandas dataframe into a list, but this results in a list of lists (each containing the values of a single row). This format is not properly interpreted as a 2D array by SMOTENC
.
To resolve this issue, you should directly convert X_train
into a 2D NumPy array without using .tolist()
. Additionally, ensure that you are correctly passing the categorical_features
parameter to SMOTENC.
Fixing the Code
Here is an updated version of your code with a fix for the issue:
from imblearn.over_sampling import SMOTENC
import numpy as np
# Assuming X_train and y_train are pandas DataFrames/Series
# Convert X_train to a numpy array with shape (n_samples, n_features)
X_new = X_train.values # This will give you a 2D array with the correct shape
Y_new = y_train.values # Ensure y_train is also a 1D array
# Check the shapes before proceeding
print(X_new.shape) # Should be (10500, 1) if you have 1 feature
print(Y_new.shape) # Should be (10500,)
# Set up SMOTENC
sm = SMOTENC(random_state=27, categorical_features=[0])
# Fit and sample
X_resampled, Y_resampled = sm.fit_resample(X_new, Y_new)
# Check the resampled data
print(X_resampled.shape) # Should be the new shape after oversampling
print(Y_resampled.shape)
Key Changes:
-
X_train.values
: Instead of using.tolist()
, use.values
to convert the pandas DataFrame into a NumPy array. This ensures the correct 2D shape (n_samples, n_features). -
sm.fit_resample(X_new, Y_new)
: The correct method to call inimblearn
isfit_resample
(notfit_sample
, which is deprecated). -
Categorical Feature Index: Make sure you're passing
categorical_features=[0]
to indicate that the feature at index0
is categorical. Since you mentioned only one feature, this is likely correct. -
Reshaping: The
reshape
you were doing is unnecessary when you're directly convertingX_train
to a 2D array with.values
. The.reshape()
step is typically used when you're dealing with 1D arrays that need to be converted to 2D arrays (e.g., when you only have one feature).
Explanation of the Shapes:
X_new.shape
should be(10500, 1)
because you have 10500 samples, each with one feature.Y_new.shape
should be(10500,)
becausey_train
is a 1D array containing the target labels for each sample.
After these adjustments, the code should work correctly with SMOTENC
to perform oversampling on your data.