如何批处理数百万文本数据

rqqzpn5f 于 2021-09-29 发布在 Java

关注(0)|答案(0)|浏览(155)

我有一个专栏名为 description 在我的数据集中有每种食物的描述。但是，数据集不完整，因此我想生成一些合成文本数据。我正在寻找一种批量处理这些数据的方法，而不是单个块（因为它会杀死内核）。
这就是我的数据集的外观：

branded_food_category                   description                 napcs

0   Ice Cream & Frozen Yogurt               mochi ice cream bonbons     3
1   Ketchup, Mustard, BBQ & Cheese Sauce    chipotle barbecue sauce     0
2   Ketchup, Mustard, BBQ & Cheese Sauce    hot spicy barbecue sauce    0
3   Ketchup, Mustard, BBQ & Cheese Sauce    barbecue sauce              0
4   Ketchup, Mustard, BBQ & Cheese Sauce    barbecue sauce              0

当我只取一组我的描述列时，我得到的是：

print(len(set(description)))

> 152398

在这里，我将数据分成51块

length_sentence = 50 + 1

lines = []

for i in range(length_sentence, len(description)):
  seq = description[i-length_sentence:i]
  line = seq
  lines.append(line)

  if i > 200000: # limit our dataset to 200000 words.
    break

print(len(lines))

现在我正在为lstm模型准备数据：

tokenizer = Tokenizer()
tokenizer.fit_on_texts(lines)
sequences = tokenizer.texts_to_sequences(lines) # texts_to_sequences() transforms each text in texts to a sequence of integers.

sequences = np.array(sequences)
x, y = sequences[:, :-1], sequences[:,-1] # Now we will split each line such that the first 50 words are in X and the last word is in y.

x[0]

> array([128280, 128278, 128276,     43,     43,   1483,   7803,   1968,
          247, 128273, 128271, 128269,   1967,    345,    420,     51,
           23,   3690, 128265,   1175, 128263, 128262, 128261, 128259,
        16737,  16736, 128257, 128255, 128254,    558,    195,    454,
         3689, 128250, 128248,   1964, 128247, 128245,    890, 128244,
          673,    890,    673, 128241,   7801,     64,      1,    557,
          557, 128239])

这就是抛出错误的原因。当我尝试将数据转换为“分类”并将其分配给 y .


# tokenizer.word_index gives the mapping of each unique word to its numerical equivalent.

# tokenizer.word_index gives the vocab_size.

vocab_size = len(tokenizer.word_index) + 1
y = to_categorical(y, num_classes=vocab_size)
seq_length = x.shape[1]

python nlp lstm

来源：https://stackoverflow.com/questions/68545931/how-to-batch-process-millions-of-text-data