**在分类器建模过程中实现SMOTE()的正确方法是什么？**我真的很困惑如何在那里应用SMOTE()。假设我将数据集分为训练和测试，就像这样作为初学者：

from sklearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as imbpipeline
from sklearn.model_selection import GridSearchCV, train_test_split

# Some dataset initialization
X = df.drop(['things'], axis = 1)
y = df['things']

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

# SMOTE() on the train dataset:
X_train_smote, y_train_smote = SMOTE().fit_resample(X_train, y_train, random_state=42)

字符串
在将SMOTE()应用于上述分类问题的训练数据集之后，我的问题是：
1.在像这样分割数据集之后，我应该在管道内应用SMOTE()**吗？：

# Pipeline for scaling and initializing the model
pipeline = imbpipeline(steps = [('scale', StandardScaler()),
                                ('over', SMOTE(random_state = 42)), 
                                ('model', LogisticRegression(random_state = 42))])

# Then do model evaluation with Repeated Stratified KFold,
# Then do Grid Search for hyperparameter tuning
# Then do the actual model testing with unseen X_test (Like this): 

cv = RepeatedStratifiedKFold(n_splits = 10, n_repeats = 3, random_state = 42)

params = {'model__penalty': ['l1', 'l2'],
          'model__C':[0.001, 0.01, 0.1, 5, 10, 100]}
    
grid = GridSearchCV(estimator = pipeline,
                    param_grid = params,
                    scoring = 'roc_auc',
                    cv = cv,
                    n_jobs = -1)

grid.fit(X_train_smote, y_train_smote)
    

cv_score = grid.best_score_
test_score = grid.score(X_test, y_test)

print(f"Cross-validation score: {cv_score} \n Test Score: {test_score}")

型
1.或者，我应该应用管道**而不像这样调用SMOTE()吗？

# Pipeline for scaling and initializing the model
pipeline = imbpipeline(steps = [('scale', StandardScaler()), 
                                ('model', LogisticRegression(random_state = 42))])

# Same process as above for modeling, evaluation, etc...

型
1.或者，我应该像这样使用SMOTE()，而不像这样使用SMOTE'd数据：

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

# Pipeline for scaling and initializing the model
pipeline = imbpipeline(steps = [('scale', StandardScaler()),
                                ('over', SMOTE(random_state = 42)), 
                                ('model', LogisticRegression(random_state = 42))])

# Same process as above for modeling, evaluation, etc... 

# BUT!, when fitting grid.fit(), we do this?:
grid.fit(X_train, y_train)

型
1.或者像这样在Sklearn的Pipeline中使用SMOTE()训练数据？：

X_train_smote, y_train_smote = SMOTE().fit_resample(X_train, y_train, random_state=42)

pipeline = Pipeline(steps = [('scale', StandardScaler()),
                             ('model', LogisticRegression(random_state = 42))])

# Same process as above for modeling, evaluation, etc... 

# BUT!, when fitting grid.fit(), we do this?:
grid.fit(X_train_smote, y_train_smote)

型

1条答案

按热度按时间

lmyy7pcs1#

一般来说，你想要SMOTE训练数据，而不是验证或测试数据。因此，如果你想使用折叠交叉验证，你不能在将数据发送到该流程之前SMOTE数据。
1.不，你运行了两次SMOTE（在管道之前和管道内部）。而且，你在验证折叠中有SMOTEd点，这是你不想要的。
1.否，验证折叠中将有SMOTEd点。
1.这就是做这件事的方法。
1.否，验证折叠中将有SMOTEd点。
我建议您查看sklearn.metrics.roc_auc_score()以及您使用的任何其他指标，因为它可以揭示不正确分割重采样数据所导致的问题。（SMOTEd点可以非常预测，但不会改善AUC。）

赞(0）回复(0）举报 8个月前

python-3.x SMOTE在分类问题中的正确使用方法

1条答案

相关问题

热门标签

最新问答