pytorch CUDA RuntimeError当移动Tensor到不同的设备

swvgeqrz  于 6个月前  发布在  其他
关注(0)|答案(1)|浏览(144)

当我把一个Tensor移到“cuda”时,会出现错误。当我把一个Tensor从“cuda”移到cpu时,情况也是一样。
我已经检查了我的Tensor的形状和dtype,一切正常。有人知道问题可能是什么吗?
我的追溯:

/opt/conda/conda-bld/pytorch_1678402411778/work/aten/src/ATen/native/cuda/Indexing.cu:1146: 
indexSelectLargeIndex: block: [55,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
......
Traceback (most recent call last):
  File "/home/xc/.conda/envs/molbart_new/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/xc/.conda/envs/molbart_new/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/xc/.conda/envs/molbart_new/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 989, in _run
    results = self._run_stage()
  File "/home/xc/.conda/envs/molbart_new/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1033, in _run_stage
    self._run_sanity_check()
  File "/home/xc/.conda/envs/molbart_new/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1062, in _run_sanity_check
    val_loop.run()
  File "/home/xc/.conda/envs/molbart_new/lib/python3.10/site-packages/lightning/pytorch/loops/utilities.py", line 182, in _decorator
    return loop_run(self, *args, **kwargs)
  File "/home/xc/.conda/envs/molbart_new/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 134, in run
    self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter)
  File "/home/xc/.conda/envs/molbart_new/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 391, in _evaluation_step
    output = call._call_strategy_hook(trainer, hook_name, *step_args)
  File "/home/xc/.conda/envs/molbart_new/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 309, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/home/xc/.conda/envs/molbart_new/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 403, in validation_step
    return self.lightning_module.validation_step(*args, **kwargs)
  File "/home/xc/xc_mol_seq/xc_work/seq_template/modules/model.py", line 1027, in validation_step
    bs, logits, loss = self.forward(batch)
  File "/home/xc/xc_mol_seq/xc_work/seq_template/modules/model.py", line 934, in forward
    indices, templates_candidates, templates_candidates_score = self.topk_candidates(
  File "/home/xc/xc_mol_seq/xc_work/seq_template/modules/model.py", line 1138, in topk_candidates
    scores, indices = scores.cpu().detach().numpy(), indices.cpu().detach().numpy()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

字符串

yrefmtwq

yrefmtwq1#

在你粘贴的轨迹的开头,我看到:
indexSelectLargeIndex:block:[55,0,0],thread:[0,0,0]AssertsrcIndex < srcSelectDimSize失败。
这是一个形状错误。您索引到的Tensor的索引大于给定维度上的Tensor,类似于“IndexOutOfBoundsException”。它与在cuda/cpu之间移动Tensor无关。
堆栈跟踪将您指向代码的另一部分,但这是因为,正如错误中所说:
CUDA内核错误可能会在其他API调用时异步报告,因此下面的堆栈跟踪可能不正确。

相关问题