tensorflow 无法看到训练的HuggingFace模型的结果

fkvaft9z  于 7个月前  发布在  其他
关注(0)|答案(1)|浏览(112)

我是huggingface的新手,在阅读文档后,我一直在尝试在我的简单数据集上微调DNABERT 2。
基本上,这个想法是我有一些标记为“1”或“0”的DNA序列,我想使用预训练的DNABERT 2模型来预测标签。
范例:

AATTGGC    1
TCTC       0 
TGTTA      1

字符串
我想我已经把所有的步骤都写下来了--但是在最后一行遇到了一个错误--TypeError: '_TensorSliceDataset' object is not subscriptable
下面是我使用的所有代码(从这里派生的步骤:https://github.com/huggingface/notebooks/blob/main/transformers_doc/en/training.ipynb):

import pandas as pd
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)
model = AutoModel.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True).to('cuda')

# Assuming 'sequences' is a list of DNA sequences and 'labels' is a list of binary labels (0 or 1)

# Split the data into training and testing sets
df = pd.read_csv('/content/dev.csv', nrows=10)
df1 = pd.read_csv('/content/test.csv', nrows=10)

train_sequences=df.iloc[:, 0].tolist()
test_sequences=df1.iloc[:, 0].tolist()
train_labels=df.iloc[:, 1].tolist()
test_labels =df1.iloc[:, 1].tolist()

# Tokenize and format the data
train_encodings = tokenizer(train_sequences, truncation=True, padding=True)
test_encodings = tokenizer(test_sequences, truncation=True, padding=True)

# convert encodings

import tensorflow as tf

train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_labels
))

test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    test_labels
))

from transformers import TrainingArguments

training_args = TrainingArguments(output_dir="test_trainer")


import numpy as np
import evaluate

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)


from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
)


一切都很顺利,直到我运行最后一行:

trainer.train()

wnvonmuf

wnvonmuf1#

在您的案例中存在一些问题。您尝试为分类进行微调的模型是填充掩码模型。您可以使用此course chapter了解如何为分类任务微调此类模型。或者您可以进一步微调对于第二种情况,您不能将tensorflow数据集传递给Trainer。函数。您的train_datasettest_dataset需要替换为

class IMDbDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = IMDbDataset(train_encodings, train_labels)
# val_dataset = IMDbDataset(val_encodings, val_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)

字符串
检查此link以更好地理解它

相关问题