pytorch 8位量化应该使耳语推理在GPU上更快吗？

jxct1oxe 于 6个月前发布在其他

关注(0)|答案(1)|浏览(85)

我正在huggingface transformers上执行耳语推断。load_in_8bit量化由bitsandbytes提供。
如果在NVIDIA T4 GPU上以8位模式加载whisper-large-v3，则对示例文件的推断需要更长的时间（5倍）。nvidia-smi中的GPU利用率为33%。
量化不应该提高GPU上的推理速度吗？https://pytorch.org/docs/stable/quantization.html
类似问题：

https://discuss.huggingface.co/t/enabling-load-in-8bit-makes-inference-much-slower/38596的

import torch

from transformers import WhisperFeatureExtractor, WhisperTokenizerFast
from transformers.pipelines.audio_classification import ffmpeg_read

MODEL_NAME = "openai/whisper-large-v3"

tokenizer = WhisperTokenizerFast.from_pretrained(MODEL_NAME)
feature_extractor = WhisperFeatureExtractor.from_pretrained(MODEL_NAME)

model_8bit = AutoModelForSpeechSeq2Seq.from_pretrained(
     "openai/whisper-large-v3",
    device_map='auto',
    load_in_8bit=True)

sample = "sample.mp3" #27s long

with torch.inference_mode():
    with open(sample, "rb") as f:
        inputs = f.read()
        inputs = ffmpeg_read(inputs, feature_extractor.sampling_rate)

        input_features = feature_extractor(inputs, sampling_rate = feature_extractor.sampling_rate, return_tensors='pt')['input_features']

        input_features = torch.tensor(input_features, dtype=torch.float16, device='cuda')

        forced_decoder_ids_output = model_8bit.generate(input_features=input_features, return_timestamps=False)

        out = tokenizer.decode(forced_decoder_ids_output.squeeze())
        print(out)

字符串

pytorch

来源：https://stackoverflow.com/questions/77656929/should-8bit-quantization-make-whisper-inference-faster-on-gpu

1条答案

按热度按时间

hts6caw31#

预计int8量化的模型会更慢。这是因为量化增加了额外的操作到模型的前向传递。你可以在int8 quantization paper中阅读更多关于这一点的信息。你也可以找到一些基准测试here，它们显示了相同的情况。
使用int8量化的原因是为了减少模型的内存占用，它允许在更少的硬件上加载更大的模型。

赞(0）回复(0）举报 6个月前

我来回答

pytorch 8位量化应该使耳语推理在GPU上更快吗？

1条答案

相关问题

热门标签

最新问答