I am using GridSearchCV in order to find the best parameters for my pipeline.
My pipeline seems to work well as I can apply:
pipeline.fit(X_train, y_train)
preds = pipeline.predict(X_test)
And I get a decent result.
But GridSearchCV obviously doesn't like something, and I cannot figure it out.
My pipeline:
feats = FeatureUnion([('age', age),
('education_num', education_num),
('is_education_favo', is_education_favo),
('is_marital_status_favo', is_marital_status_favo),
('hours_per_week', hours_per_week),
('capital_diff', capital_diff),
('sex', sex),
('race', race),
('native_country', native_country)
])
pipeline = Pipeline([
('adhocFC',AdHocFeaturesCreation()),
('imputers', KnnImputer(target = 'native-country', n_neighbors = 5)),
('features',feats),('clf',LogisticRegression())])
My GridSearch:
hyperparameters = {'imputers__n_neighbors' : [5,21,41], 'clf__C' : [1.0, 2.0]}
GSCV = GridSearchCV(pipeline, hyperparameters, cv=3, scoring = 'roc_auc' , refit = False) #change n_jobs = 2, refit = False
GSCV.fit(X_train, y_train)
I receive 11 similar warnings:
/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/ipykernel/main.py:11: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead
and this is the error message:
/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/ipykernel/__main__.py:11: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy /home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/ipykernel/__main__.py:12: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy /home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/ipykernel/__main__.py:14: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-47-05f7c4f5167d> in <module>()
3 GSCV = GridSearchCV(pipeline, hyperparameters, cv=3, scoring = 'roc_auc' ,refit = False) #change n_jobs = 2, refit = False
4
----> 5 GSCV.fit(X_train, y_train)
/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/model_selection/_search.py in fit(self, X, y, groups)
943 train/test set.
944 """
--> 945 return self._fit(X, y, groups, ParameterGrid(self.param_grid))
946
947
/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/model_selection/_search.py in _fit(self, X, y, groups, parameter_iterable)
562 return_times=True, return_parameters=True,
563 error_score=self.error_score)
--> 564 for parameters in parameter_iterable
565 for train, test in cv_iter)
566
/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
756 # was dispatched. In particular this covers the edge
757 # case of Parallel used with an exhausted iterator.
--> 758 while self.dispatch_one_batch(iterator):
759 self._iterating = True
760 else:
Answer
The error you're seeing, SettingWithCopyWarning
, is caused by the way you're manipulating slices of a pandas DataFrame within your pipeline or feature engineering steps. The warning suggests that you're working with a view of the original DataFrame rather than a copy, and this can cause issues when trying to modify the data.
Moreover, the error in your code isn't directly related to GridSearchCV
but seems to stem from the operation within your AdHocFeaturesCreation
class or any part of your custom feature transformers that involves modifying the DataFrame.
Let’s break down the potential issues and solutions:
1. SettingWithCopyWarning
This warning indicates that somewhere in your code, you are attempting to modify a subset (or slice) of a pandas DataFrame in a way that could affect the original DataFrame (a view rather than a copy). The best way to avoid this issue is to ensure you use .loc[]
to perform operations on DataFrame rows or columns. You should always work on a copy of the DataFrame if you want to avoid such problems.
To fix this, ensure that when manipulating DataFrame rows or columns, you use .loc[]
for assignment. For example:
# Instead of this:
df['column'] = df['column'].apply(some_function)
# Use this:
df.loc[:, 'column'] = df['column'].apply(some_function)
Check the code inside your custom transformer (AdHocFeaturesCreation
or KnnImputer
) to ensure you're using .loc[]
for any DataFrame assignment.
2. Error during GridSearchCV.fit()
The actual ValueError
you're seeing in the traceback might be occurring due to incompatible shapes or other issues related to how the parameters are being passed. GridSearchCV
tries different combinations of hyperparameters and might be failing due to a mismatch between the expected and actual shape of the data or parameters.
To resolve this:
-
Ensure your
GridSearchCV
hyperparameter grid is correctly formatted. For example, make sure you're using the correct syntax to specify parameter names. In your case,'imputers__n_neighbors'
and'clf__C'
seem correct, but ensure that the parameter names are aligned with how they appear in the pipeline. -
Check that your data (
X_train
andy_train
) are compatible with the pipeline. Ensure that no preprocessing steps inside the pipeline cause issues like misalignment of data shapes.
3. Debugging Step-by-Step
To help isolate the issue:
-
Run the pipeline without
GridSearchCV
to verify it works:pipeline.fit(X_train, y_train) preds = pipeline.predict(X_test) print(preds)
-
Test individual steps in your pipeline to check if any of them is causing issues:
# Test individual steps: ad_hoc_features = AdHocFeaturesCreation() ad_hoc_features.fit(X_train, y_train) imputers = KnnImputer(target='native-country', n_neighbors=5) imputers.fit(X_train, y_train)
4. GridSearchCV Debugging
Try using verbose
in GridSearchCV
to get more information on what's going wrong:
GSCV = GridSearchCV(pipeline, hyperparameters, cv=3, scoring='roc_auc', refit=False, verbose=3)
GSCV.fit(X_train, y_train)
This will print more details during the fitting process and may give you more insight into what is causing the ValueError
.
5. Revisit the KnnImputer
It looks like KnnImputer
is a custom transformer. Ensure that the KnnImputer
class is implemented properly and does not return a DataFrame slice or a view when working on the data. Double-check that it does not modify the original X_train
or y_train
without creating a copy first.
Revised Code Example:
If you're still facing issues, here's a simplified approach you can try:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
import pandas as pd
# Assuming AdHocFeaturesCreation and KnnImputer are properly implemented
# Define the feature union (just as a mock example)
feats = FeatureUnion([
('age', StandardScaler()),
('education_num', StandardScaler()),
# Add other features here...
])
# Create the pipeline
pipeline = Pipeline([
('adhocFC', AdHocFeaturesCreation()), # Custom feature creation
('imputers', KnnImputer(target='native-country', n_neighbors=5)), # Custom imputer
('features', feats),
('clf', LogisticRegression())
])
# Define hyperparameters grid for GridSearchCV
hyperparameters = {
'imputers__n_neighbors': [5, 21, 41],
'clf__C': [1.0, 2.0]
}
# Use GridSearchCV to find the best parameters
GSCV = GridSearchCV(pipeline, hyperparameters, cv=3, scoring='roc_auc', refit=False, verbose=3)
GSCV.fit(X_train, y_train)
Final Thoughts:
- Ensure your custom transformers (
AdHocFeaturesCreation
andKnnImputer
) are written in a way that they don't inadvertently modify DataFrame slices. Always use.loc[]
for setting values on DataFrames. - Check that
GridSearchCV
parameters are properly configured and that your input data shapes match what the pipeline expects.
If the problem persists, please share the implementation details of AdHocFeaturesCreation
and KnnImputer
, as the issue may lie within how they handle the data.