numpy 将数据集拆分为序列验证和测试-选择验证索引和测试索引

eit6fx6z  于 2023-03-18  发布在  其他
关注(0)|答案(1)|浏览(97)

为了将数据集分为train、validation和test,我尝试使用以下函数:

X = df.drop("RISK DECISION", axis = 1).values
y = df["RISK DECISION"].values

def train_validation_test_split(X, y, validation_size = 0.1, test_size = 0.1, random_state = None):
    
    if random_state != None:
        np.random.seed(random_state)
    
    n = X.shape[0]
    
    validation_indices = np.random.choice(n, int(n*validation_size), replace = False)
    test_indices = np.random.choice(n, int(n*test_size), replace = False)
    all_indices = np.concatenate((validation_indices, test_indices))
    
    X_validation = X[validation_indices]
    y_validation = y[validation_indices]
    X_test = X[test_indices]
    y_test = y[test_indices]
    
    X_train = np.delete(X, all_indices, axis = 0)
    y_train = np.delete(y, all_indices, axis = 0)
    
    return(X_train, X_validation, X_test, y_train, y_validation, y_test)

初始数据集的长度为:34322条记录
应用函数后,X_train、X_validation和X_test的长度之和大于数据集的初始长度。
问题可能由np.random.choice给出,变量validation_indicestest_indices可能包含一些相同的索引。
如何在不过多修改train_validation_test_split函数的情况下解决这个问题?

8ehkhllq

8ehkhllq1#

是的,重复的索引确实是个问题。一个简单的解决方案是将所有的索引放在一起,然后将结果分为验证和测试

n = X.shape[0]

n_validation = int(n*validation_size)
n_test = int(n*test_size)
all_indices = np.random.choice(n, n_validation+n_test, replace = False)
validation_indices = all_indices[:n_validation]
test_indices = all_indices[n_validation:]

X_validation = X[validation_indices]
y_validation = y[validation_indices]
X_test = X[test_indices]
y_test = y[test_indices]

X_train = np.delete(X, all_indices, axis = 0)
y_train = np.delete(y, all_indices, axis = 0)

既然numpy.random.choice似乎是以随机的顺序给出索引,这应该没问题。只是可能包含一个Assert来确保validation_size+test_size〈1(或者可能更健壮的n_validation+n_test〈=n)

相关问题