GridSearchCV on a working pipeline returns ValueError

ghz 11hours ago ⋅ 6 views

I am using GridSearchCV in order to find the best parameters for my pipeline.

My pipeline seems to work well as I can apply:

pipeline.fit(X_train, y_train)
preds = pipeline.predict(X_test)

And I get a decent result.

But GridSearchCV obviously doesn't like something, and I cannot figure it out.

My pipeline:

feats = FeatureUnion([('age', age),
                      ('education_num', education_num),
                      ('is_education_favo', is_education_favo),
                      ('is_marital_status_favo', is_marital_status_favo),
                      ('hours_per_week', hours_per_week),
                      ('capital_diff', capital_diff),
                      ('sex', sex),
                      ('race', race),
                      ('native_country', native_country)
                     ])

pipeline = Pipeline([
        ('adhocFC',AdHocFeaturesCreation()),
        ('imputers', KnnImputer(target = 'native-country', n_neighbors = 5)),
        ('features',feats),('clf',LogisticRegression())])

My GridSearch:

hyperparameters = {'imputers__n_neighbors' : [5,21,41], 'clf__C' : [1.0, 2.0]}

GSCV = GridSearchCV(pipeline, hyperparameters, cv=3, scoring = 'roc_auc' , refit = False) #change n_jobs = 2, refit = False

GSCV.fit(X_train, y_train)

I receive 11 similar warnings:

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/ipykernel/main.py:11: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

and this is the error message:

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/ipykernel/__main__.py:11: SettingWithCopyWarning:  A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy /home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/ipykernel/__main__.py:12: SettingWithCopyWarning:  A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy /home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/ipykernel/__main__.py:14: SettingWithCopyWarning:  A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

--------------------------------------------------------------------------- ValueError                                Traceback (most recent call last) <ipython-input-47-05f7c4f5167d> in <module>()
      3 GSCV = GridSearchCV(pipeline, hyperparameters, cv=3, scoring = 'roc_auc' ,refit = False) #change n_jobs = 2, refit = False
      4 
----> 5 GSCV.fit(X_train, y_train)

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/model_selection/_search.py in fit(self, X, y, groups)
    943             train/test set.
    944         """
--> 945         return self._fit(X, y, groups, ParameterGrid(self.param_grid))
    946 
    947 

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/model_selection/_search.py in _fit(self, X, y, groups, parameter_iterable)
    562                                   return_times=True, return_parameters=True,
    563                                   error_score=self.error_score)
--> 564           for parameters in parameter_iterable
    565           for train, test in cv_iter)
    566 

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
    756             # was dispatched. In particular this covers the edge
    757             # case of Parallel used with an exhausted iterator.
--> 758             while self.dispatch_one_batch(iterator):
    759                 self._iterating = True
    760             else:

Answer

The error you're seeing, SettingWithCopyWarning, is caused by the way you're manipulating slices of a pandas DataFrame within your pipeline or feature engineering steps. The warning suggests that you're working with a view of the original DataFrame rather than a copy, and this can cause issues when trying to modify the data.

Moreover, the error in your code isn't directly related to GridSearchCV but seems to stem from the operation within your AdHocFeaturesCreation class or any part of your custom feature transformers that involves modifying the DataFrame.

Let’s break down the potential issues and solutions:

1. SettingWithCopyWarning

This warning indicates that somewhere in your code, you are attempting to modify a subset (or slice) of a pandas DataFrame in a way that could affect the original DataFrame (a view rather than a copy). The best way to avoid this issue is to ensure you use .loc[] to perform operations on DataFrame rows or columns. You should always work on a copy of the DataFrame if you want to avoid such problems.

To fix this, ensure that when manipulating DataFrame rows or columns, you use .loc[] for assignment. For example:

# Instead of this:
df['column'] = df['column'].apply(some_function)

# Use this:
df.loc[:, 'column'] = df['column'].apply(some_function)

Check the code inside your custom transformer (AdHocFeaturesCreation or KnnImputer) to ensure you're using .loc[] for any DataFrame assignment.

2. Error during GridSearchCV.fit()

The actual ValueError you're seeing in the traceback might be occurring due to incompatible shapes or other issues related to how the parameters are being passed. GridSearchCV tries different combinations of hyperparameters and might be failing due to a mismatch between the expected and actual shape of the data or parameters.

To resolve this:

  1. Ensure your GridSearchCV hyperparameter grid is correctly formatted. For example, make sure you're using the correct syntax to specify parameter names. In your case, 'imputers__n_neighbors' and 'clf__C' seem correct, but ensure that the parameter names are aligned with how they appear in the pipeline.

  2. Check that your data (X_train and y_train) are compatible with the pipeline. Ensure that no preprocessing steps inside the pipeline cause issues like misalignment of data shapes.

3. Debugging Step-by-Step

To help isolate the issue:

  • Run the pipeline without GridSearchCV to verify it works:

    pipeline.fit(X_train, y_train)
    preds = pipeline.predict(X_test)
    print(preds)
    
  • Test individual steps in your pipeline to check if any of them is causing issues:

    # Test individual steps:
    ad_hoc_features = AdHocFeaturesCreation()
    ad_hoc_features.fit(X_train, y_train)
    
    imputers = KnnImputer(target='native-country', n_neighbors=5)
    imputers.fit(X_train, y_train)
    

4. GridSearchCV Debugging

Try using verbose in GridSearchCV to get more information on what's going wrong:

GSCV = GridSearchCV(pipeline, hyperparameters, cv=3, scoring='roc_auc', refit=False, verbose=3)
GSCV.fit(X_train, y_train)

This will print more details during the fitting process and may give you more insight into what is causing the ValueError.

5. Revisit the KnnImputer

It looks like KnnImputer is a custom transformer. Ensure that the KnnImputer class is implemented properly and does not return a DataFrame slice or a view when working on the data. Double-check that it does not modify the original X_train or y_train without creating a copy first.


Revised Code Example:

If you're still facing issues, here's a simplified approach you can try:

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Assuming AdHocFeaturesCreation and KnnImputer are properly implemented

# Define the feature union (just as a mock example)
feats = FeatureUnion([
    ('age', StandardScaler()),
    ('education_num', StandardScaler()),
    # Add other features here...
])

# Create the pipeline
pipeline = Pipeline([
    ('adhocFC', AdHocFeaturesCreation()),  # Custom feature creation
    ('imputers', KnnImputer(target='native-country', n_neighbors=5)),  # Custom imputer
    ('features', feats),
    ('clf', LogisticRegression())
])

# Define hyperparameters grid for GridSearchCV
hyperparameters = {
    'imputers__n_neighbors': [5, 21, 41],
    'clf__C': [1.0, 2.0]
}

# Use GridSearchCV to find the best parameters
GSCV = GridSearchCV(pipeline, hyperparameters, cv=3, scoring='roc_auc', refit=False, verbose=3)
GSCV.fit(X_train, y_train)

Final Thoughts:

  • Ensure your custom transformers (AdHocFeaturesCreation and KnnImputer) are written in a way that they don't inadvertently modify DataFrame slices. Always use .loc[] for setting values on DataFrames.
  • Check that GridSearchCV parameters are properly configured and that your input data shapes match what the pipeline expects.

If the problem persists, please share the implementation details of AdHocFeaturesCreation and KnnImputer, as the issue may lie within how they handle the data.