python-3.x SMOTE在分类问题中的正确使用方法

kkbh8khc  于 8个月前  发布在  Python
关注(0)|答案(1)|浏览(114)

**在分类器建模过程中实现SMOTE()的正确方法是什么?**我真的很困惑如何在那里应用SMOTE()。假设我将数据集分为训练和测试,就像这样作为初学者:

from sklearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as imbpipeline
from sklearn.model_selection import GridSearchCV, train_test_split

# Some dataset initialization
X = df.drop(['things'], axis = 1)
y = df['things']

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

# SMOTE() on the train dataset:
X_train_smote, y_train_smote = SMOTE().fit_resample(X_train, y_train, random_state=42)

字符串
在将SMOTE()应用于上述分类问题的训练数据集之后,我的问题是:
1.在像这样分割数据集之后,我应该在管道内应用SMOTE()**吗?:

# Pipeline for scaling and initializing the model
pipeline = imbpipeline(steps = [('scale', StandardScaler()),
                                ('over', SMOTE(random_state = 42)), 
                                ('model', LogisticRegression(random_state = 42))])

# Then do model evaluation with Repeated Stratified KFold,
# Then do Grid Search for hyperparameter tuning
# Then do the actual model testing with unseen X_test (Like this): 

cv = RepeatedStratifiedKFold(n_splits = 10, n_repeats = 3, random_state = 42)

params = {'model__penalty': ['l1', 'l2'],
          'model__C':[0.001, 0.01, 0.1, 5, 10, 100]}
    
grid = GridSearchCV(estimator = pipeline,
                    param_grid = params,
                    scoring = 'roc_auc',
                    cv = cv,
                    n_jobs = -1)

grid.fit(X_train_smote, y_train_smote)
    

cv_score = grid.best_score_
test_score = grid.score(X_test, y_test)

print(f"Cross-validation score: {cv_score} \n Test Score: {test_score}")


1.或者,我应该应用管道**而不像这样调用SMOTE()吗?

# Pipeline for scaling and initializing the model
pipeline = imbpipeline(steps = [('scale', StandardScaler()), 
                                ('model', LogisticRegression(random_state = 42))])

# Same process as above for modeling, evaluation, etc...


1.或者,我应该像这样使用SMOTE(),而不像这样使用SMOTE'd数据:

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

# Pipeline for scaling and initializing the model
pipeline = imbpipeline(steps = [('scale', StandardScaler()),
                                ('over', SMOTE(random_state = 42)), 
                                ('model', LogisticRegression(random_state = 42))])

# Same process as above for modeling, evaluation, etc... 

# BUT!, when fitting grid.fit(), we do this?:
grid.fit(X_train, y_train)


1.或者像这样在Sklearn的Pipeline中使用SMOTE()训练数据?:

X_train_smote, y_train_smote = SMOTE().fit_resample(X_train, y_train, random_state=42)

pipeline = Pipeline(steps = [('scale', StandardScaler()),
                             ('model', LogisticRegression(random_state = 42))])

# Same process as above for modeling, evaluation, etc... 

# BUT!, when fitting grid.fit(), we do this?:
grid.fit(X_train_smote, y_train_smote)

lmyy7pcs

lmyy7pcs1#

一般来说,你想要SMOTE训练数据,而不是验证或测试数据。因此,如果你想使用折叠交叉验证,你不能在将数据发送到该流程之前SMOTE数据。
1.不,你运行了两次SMOTE(在管道之前和管道内部)。而且,你在验证折叠中有SMOTEd点,这是你不想要的。
1.否,验证折叠中将有SMOTEd点。
1.这就是做这件事的方法。
1.否,验证折叠中将有SMOTEd点。
我建议您查看sklearn.metrics.roc_auc_score()以及您使用的任何其他指标,因为它可以揭示不正确分割重采样数据所导致的问题。(SMOTEd点可以非常预测,但不会改善AUC。)

相关问题