我是huggingface的新手,在阅读文档后,我一直在尝试在我的简单数据集上微调DNABERT 2。
基本上,这个想法是我有一些标记为“1”或“0”的DNA序列,我想使用预训练的DNABERT 2模型来预测标签。
范例:
AATTGGC 1
TCTC 0
TGTTA 1
字符串
我想我已经把所有的步骤都写下来了--但是在最后一行遇到了一个错误--TypeError: '_TensorSliceDataset' object is not subscriptable
。
下面是我使用的所有代码(从这里派生的步骤:https://github.com/huggingface/notebooks/blob/main/transformers_doc/en/training.ipynb):
import pandas as pd
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)
model = AutoModel.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True).to('cuda')
# Assuming 'sequences' is a list of DNA sequences and 'labels' is a list of binary labels (0 or 1)
# Split the data into training and testing sets
df = pd.read_csv('/content/dev.csv', nrows=10)
df1 = pd.read_csv('/content/test.csv', nrows=10)
train_sequences=df.iloc[:, 0].tolist()
test_sequences=df1.iloc[:, 0].tolist()
train_labels=df.iloc[:, 1].tolist()
test_labels =df1.iloc[:, 1].tolist()
# Tokenize and format the data
train_encodings = tokenizer(train_sequences, truncation=True, padding=True)
test_encodings = tokenizer(test_sequences, truncation=True, padding=True)
# convert encodings
import tensorflow as tf
train_dataset = tf.data.Dataset.from_tensor_slices((
dict(train_encodings),
train_labels
))
test_dataset = tf.data.Dataset.from_tensor_slices((
dict(test_encodings),
test_labels
))
from transformers import TrainingArguments
training_args = TrainingArguments(output_dir="test_trainer")
import numpy as np
import evaluate
metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=test_dataset,
compute_metrics=compute_metrics,
)
型
一切都很顺利,直到我运行最后一行:
trainer.train()
型
1条答案
按热度按时间wnvonmuf1#
在您的案例中存在一些问题。您尝试为分类进行微调的模型是填充掩码模型。您可以使用此course chapter了解如何为分类任务微调此类模型。或者您可以进一步微调对于第二种情况,您不能将tensorflow数据集传递给Trainer。函数。您的
train_dataset
和test_dataset
需要替换为字符串
检查此link以更好地理解它