当我把一个Tensor移到“cuda”时,会出现错误。当我把一个Tensor从“cuda”移到cpu时,情况也是一样。
我已经检查了我的Tensor的形状和dtype,一切正常。有人知道问题可能是什么吗?
我的追溯:
/opt/conda/conda-bld/pytorch_1678402411778/work/aten/src/ATen/native/cuda/Indexing.cu:1146:
indexSelectLargeIndex: block: [55,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
......
Traceback (most recent call last):
File "/home/xc/.conda/envs/molbart_new/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 44, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/xc/.conda/envs/molbart_new/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/xc/.conda/envs/molbart_new/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 989, in _run
results = self._run_stage()
File "/home/xc/.conda/envs/molbart_new/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1033, in _run_stage
self._run_sanity_check()
File "/home/xc/.conda/envs/molbart_new/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1062, in _run_sanity_check
val_loop.run()
File "/home/xc/.conda/envs/molbart_new/lib/python3.10/site-packages/lightning/pytorch/loops/utilities.py", line 182, in _decorator
return loop_run(self, *args, **kwargs)
File "/home/xc/.conda/envs/molbart_new/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 134, in run
self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter)
File "/home/xc/.conda/envs/molbart_new/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 391, in _evaluation_step
output = call._call_strategy_hook(trainer, hook_name, *step_args)
File "/home/xc/.conda/envs/molbart_new/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 309, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/home/xc/.conda/envs/molbart_new/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 403, in validation_step
return self.lightning_module.validation_step(*args, **kwargs)
File "/home/xc/xc_mol_seq/xc_work/seq_template/modules/model.py", line 1027, in validation_step
bs, logits, loss = self.forward(batch)
File "/home/xc/xc_mol_seq/xc_work/seq_template/modules/model.py", line 934, in forward
indices, templates_candidates, templates_candidates_score = self.topk_candidates(
File "/home/xc/xc_mol_seq/xc_work/seq_template/modules/model.py", line 1138, in topk_candidates
scores, indices = scores.cpu().detach().numpy(), indices.cpu().detach().numpy()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
字符串
1条答案
按热度按时间yrefmtwq1#
在你粘贴的轨迹的开头,我看到:
indexSelectLargeIndex:block:[55,0,0],thread:[0,0,0]Assert
srcIndex < srcSelectDimSize
失败。这是一个形状错误。您索引到的Tensor的索引大于给定维度上的Tensor,类似于“IndexOutOfBoundsException”。它与在cuda/cpu之间移动Tensor无关。
堆栈跟踪将您指向代码的另一部分,但这是因为,正如错误中所说:
CUDA内核错误可能会在其他API调用时异步报告,因此下面的堆栈跟踪可能不正确。