我对LLM服务和量化非常陌生。任何线索都将非常感谢。我正在尝试使用Autoawq来验证我的模型。我已经安装了以下软件包:
Package Version
------------------ ------------
absl-py 2.0.0
accelerate 0.24.1
aiohttp 3.9.0
aiosignal 1.3.1
annotated-types 0.6.0
anyio 3.7.1
async-timeout 4.0.3
attributedict 0.3.0
attrs 23.1.0
autoawq 0.1.7
blessings 1.7
cachetools 5.3.2
certifi 2022.12.7
chardet 5.2.0
charset-normalizer 2.1.1
click 8.1.7
codecov 2.1.13
colorama 0.4.6
coloredlogs 15.0.1
colour-runner 0.1.1
coverage 7.3.2
DataProperty 1.0.1
datasets 2.15.0
deepdiff 6.7.1
dill 0.3.7
distlib 0.3.7
distro 1.8.0
exceptiongroup 1.1.3
filelock 3.9.0
frozenlist 1.4.0
fsspec 2023.4.0
h11 0.14.0
httpcore 1.0.2
httpx 0.25.1
huggingface-hub 0.19.4
humanfriendly 10.0
idna 3.4
inspecta 0.1.3
Jinja2 3.1.2
joblib 1.3.2
jsonlines 4.0.0
lm-eval 0.3.0
MarkupSafe 2.1.3
mbstrdecoder 1.1.3
mpmath 1.3.0
multidict 6.0.4
multiprocess 0.70.15
networkx 3.0
nltk 3.8.1
numexpr 2.8.6
numpy 1.24.1
openai 1.3.3
ordered-set 4.1.0
packaging 23.2
pandas 2.0.3
pathvalidate 3.2.0
Pillow 9.3.0
pip 19.3.1
platformdirs 4.0.0
pluggy 1.3.0
portalocker 2.8.2
protobuf 4.25.1
psutil 5.9.6
pyarrow 14.0.1
pyarrow-hotfix 0.5
pybind11 2.11.1
pycountry 22.3.5
pydantic 2.5.1
pydantic-core 2.14.3
pygments 2.17.1
pyproject-api 1.6.1
pytablewriter 1.2.0
python-dateutil 2.8.2
pytz 2023.3.post1
PyYAML 6.0.1
regex 2023.10.3
requests 2.28.1
rootpath 0.1.1
rouge-score 0.1.2
sacrebleu 1.5.0
safetensors 0.4.0
scikit-learn 1.3.2
scipy 1.10.1
sentencepiece 0.1.99
setuptools 41.6.0
six 1.16.0
sniffio 1.3.0
sqlitedict 2.1.0
sympy 1.12
tabledata 1.3.3
tabulate 0.9.0
tcolorpy 0.1.4
termcolor 2.3.0
texttable 1.7.0
threadpoolctl 3.2.0
tokenizers 0.15.0
toml 0.10.2
tomli 2.0.1
torch 2.1.1+cu118
torchaudio 2.1.1+cu118
torchvision 0.16.1+cu118
tox 4.11.3
tqdm 4.66.1
tqdm-multiprocess 0.0.11
transformers 4.35.2
triton 2.1.0
typepy 1.3.2
typing-extensions 4.4.0
tzdata 2023.3
urllib3 1.26.13
virtualenv 20.24.6
xxhash 3.4.1
yarl 1.9.2
zstandard 0.22.0
字符串
我尝试运行https://github.com/casper-hansen/AutoAWQ的示例代码:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = 'lmsys/vicuna-7b-v1.5'
quant_path = 'vicuna-7b-v1.5-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Quantize
model.quantize(tokenizer, quant_config=quant_config)
# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
型
但我得到以下错误:
/usr/test3/lib64/python3.8/site-packages/huggingface_hub/utils/_runtime.py:184: UserWarning: Pydantic is installed but cannot be imported. Please check your installation. `huggingface_hub` will default to not using Pydantic. Error message: '{e}'
warnings.warn(
Traceback (most recent call last):
File "quant.py", line 1, in <module>
from awq import AutoAWQForCausalLM
File "/usr/test3/lib64/python3.8/site-packages/awq/__init__.py", line 2, in <module>
from awq.models.auto import AutoAWQForCausalLM
File "/usr/test3/lib64/python3.8/site-packages/awq/models/__init__.py", line 1, in <module>
from .mpt import MptAWQForCausalLM
File "/usr/test3/lib64/python3.8/site-packages/awq/models/mpt.py", line 1, in <module>
from .base import BaseAWQForCausalLM
File "/usr/test3/lib64/python3.8/site-packages/awq/models/base.py", line 12, in <module>
from awq.quantize.quantizer import AwqQuantizer
File "/usr/test3/lib64/python3.8/site-packages/awq/quantize/quantizer.py", line 11, in <module>
from awq.modules.linear import WQLinear_GEMM, WQLinear_GEMV
File "/usr/test3/lib64/python3.8/site-packages/awq/modules/linear.py", line 4, in <module>
import awq_inference_engine # with CUDA kernels
ImportError: libcudart.so.12: cannot open shared object file: No such file or directory
型
这是我的nvidia配置:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A10 Off | 00000000:17:00.0 Off | 0 |
| 0% 40C P0 59W / 150W | 18106MiB / 23028MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A10 Off | 00000000:31:00.0 Off | 0 |
| 0% 28C P8 21W / 150W | 2MiB / 23028MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A10 Off | 00000000:B1:00.0 Off | 0 |
| 0% 26C P8 20W / 150W | 2MiB / 23028MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A10 Off | 00000000:CA:00.0 Off | 0 |
| 0% 26C P8 20W / 150W | 2MiB / 23028MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
型
下面是nvcc --version输出:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:58_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0
型
1条答案
按热度按时间busg9geu1#
我最近在Runpod中使用AWQ。面临同样的问题。所以默认使用nvidia-smi看到cuda版本是12.3。通过使用这些命令安装库解决了这个问题。
!pip -q install --upgrade fschat accelerate autoawq vllm
!pip install torch==2.1.0+cu121 torchvision==0.16.0+cu121 torchaudio==2.1.0 torchtext==0.16.0+cpu torchdata==0.7.0 --index-url https://download.pytorch.org/whl/cu121
链接到解决方案https://github.com/vllm-project/vllm/issues/1718