How to train YOLO in the container? error: AssertionError: Torch not compiled with CUDA enabled #799

jiekechoo · 2025-03-31T02:04:13Z

Describe the issue

Run docker container

docker run -it     \
--device /dev/dri     \
-v /dev/dri/by-path:/dev/dri/by-path     \
--ipc=host     \
intel/intel-extension-for-pytorch:2.6.10-xpu

Python environment and

root@d6253cd9f54c:/# pip list |grep "torch\|ultralytics"
intel_extension_for_pytorch 2.6.10+xpu
pytorch-triton-xpu          3.2.0
torch                       2.6.0+xpu
torchaudio                  2.6.0+xpu
torchvision                 0.21.0+xpu
ultralytics                 8.3.99
ultralytics-thop            2.0.14

root@d6253cd9f54c:/# python
Python 3.10.12 (main, Feb  4 2025, 14:57:36) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from ultralytics import YOLO
[W331 01:50:04.006781204 OperatorEntry.cpp:154] Warning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::_validate_compressed_sparse_indices(bool is_crow, Tensor compressed_idx, Tensor plain_idx, int cdim, int dim, int nnz) -> ()
    registered at /pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: XPU
  previous kernel: registered at /pytorch/build/aten/src/ATen/RegisterCPU.cpp:30477
       new kernel: registered at /build/intel-pytorch-extension/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:468 (function operator())
2025-03-31 01:50:05,421 - matplotlib.font_manager - INFO - generated new fontManager
Creating new Ultralytics Settings v0.0.6 file ✅ 
View Ultralytics Settings with 'yolo settings' or at '/root/.config/Ultralytics/settings.json'
Update Settings with 'yolo settings key=value', i.e. 'yolo settings runs_dir=path/to/dir'. For help see https://docs.ultralytics.com/quickstart/#ultralytics-settings.
>>> [W331 01:50:05.832608465 OperatorEntry.cpp:154] Warning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::_validate_compressed_sparse_indices(bool is_crow, Tensor compressed_idx, Tensor plain_idx, int cdim, int dim, int nnz) -> ()
    registered at /pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: XPU
  previous kernel: registered at /pytorch/build/aten/src/ATen/RegisterCPU.cpp:30477
       new kernel: registered at /build/intel-pytorch-extension/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:468 (function operator())

>>> import torch
>>> import intel_extension_for_pytorch as ipex 
>>> device = torch.device('xpu' if torch.xpu.is_available() else 'cpu') 
>>> device
device(type='xpu')

>>> model = YOLO("yolov8n.pt").to(device)
>>> model.info()
YOLOv8n summary: 129 layers, 3,157,200 parameters, 0 gradients, 8.9 GFLOPs
(129, 3157200, 0, 8.8575488)

>>> results = model.train(data="coco8.yaml", epochs=100, imgsz=640)
engine/trainer: task=detect, mode=train, model=yolov8n.pt, data=coco8.yaml, epochs=100, time=None, patience=100, batch=16, imgsz=640, save=True, save_period=-1, cache=False, device=xpu:0, workers=8, project=None, name=train2, exist_ok=False, pretrained=True, optimizer=auto, verbose=True, seed=0, deterministic=True, single_cls=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, amp=True, fraction=1.0, profile=False, freeze=None, multi_scale=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, vid_stride=1, stream_buffer=False, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, embed=None, show=False, save_frames=False, save_txt=False, save_conf=False, save_crop=False, show_labels=True, show_conf=True, show_boxes=True, line_width=None, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=True, opset=None, workspace=None, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, pose=12.0, kobj=1.0, nbs=64, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, bgr=0.0, mosaic=1.0, mixup=0.0, copy_paste=0.0, copy_paste_mode=flip, auto_augment=randaugment, erasing=0.4, crop_fraction=1.0, cfg=None, tracker=botsort.yaml, save_dir=runs/detect/train2

                   from  n    params  module                                       arguments                     
  0                  -1  1       464  ultralytics.nn.modules.conv.Conv             [3, 16, 3, 2]                 
  1                  -1  1      4672  ultralytics.nn.modules.conv.Conv             [16, 32, 3, 2]                
  2                  -1  1      7360  ultralytics.nn.modules.block.C2f             [32, 32, 1, True]             
  3                  -1  1     18560  ultralytics.nn.modules.conv.Conv             [32, 64, 3, 2]                
  4                  -1  2     49664  ultralytics.nn.modules.block.C2f             [64, 64, 2, True]             
  5                  -1  1     73984  ultralytics.nn.modules.conv.Conv             [64, 128, 3, 2]               
  6                  -1  2    197632  ultralytics.nn.modules.block.C2f             [128, 128, 2, True]           
  7                  -1  1    295424  ultralytics.nn.modules.conv.Conv             [128, 256, 3, 2]              
  8                  -1  1    460288  ultralytics.nn.modules.block.C2f             [256, 256, 1, True]           
  9                  -1  1    164608  ultralytics.nn.modules.block.SPPF            [256, 256, 5]                 
 10                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 11             [-1, 6]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 12                  -1  1    148224  ultralytics.nn.modules.block.C2f             [384, 128, 1]                 
 13                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 14             [-1, 4]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 15                  -1  1     37248  ultralytics.nn.modules.block.C2f             [192, 64, 1]                  
 16                  -1  1     36992  ultralytics.nn.modules.conv.Conv             [64, 64, 3, 2]                
 17            [-1, 12]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 18                  -1  1    123648  ultralytics.nn.modules.block.C2f             [192, 128, 1]                 
 19                  -1  1    147712  ultralytics.nn.modules.conv.Conv             [128, 128, 3, 2]              
 20             [-1, 9]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 21                  -1  1    493056  ultralytics.nn.modules.block.C2f             [384, 256, 1]                 
 22        [15, 18, 21]  1    897664  ultralytics.nn.modules.head.Detect           [80, [64, 128, 256]]          
Model summary: 129 layers, 3,157,200 parameters, 3,157,184 gradients, 8.9 GFLOPs

Transferred 355/355 items from pretrained weights
Freezing layer 'model.22.dfl.conv.weight'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.10/dist-packages/ultralytics/engine/model.py", line 791, in train
    self.trainer.train()
  File "/usr/local/lib/python3.10/dist-packages/ultralytics/engine/trainer.py", line 211, in train
    self._do_train(world_size)
  File "/usr/local/lib/python3.10/dist-packages/ultralytics/engine/trainer.py", line 327, in _do_train
    self._setup_train(world_size)
  File "/usr/local/lib/python3.10/dist-packages/ultralytics/engine/trainer.py", line 269, in _setup_train
    self.amp = torch.tensor(check_amp(self.model), device=self.device)
  File "/usr/local/lib/python3.10/dist-packages/ultralytics/utils/checks.py", line 735, in check_amp
    gpu = torch.cuda.get_device_name(device)
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 491, in get_device_name
    return get_device_properties(device).name
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 523, in get_device_properties
    _lazy_init()  # will define _get_device_properties
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 310, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

I followed this article: https://community.intel.com/t5/GPU-Compute-Software/How-to-train-yolov8-using-intel-arc-770-gpu/m-p/1597471 , but not worked. Is there any guide for training YOLO ?

The text was updated successfully, but these errors were encountered:

ZailiWang · 2025-03-31T03:05:15Z

Hi, from the error msg the target device is not specified as expected. I'm looking into whether some device orientation is missed.

jiekechoo · 2025-03-31T03:16:25Z

set amp=False in the training python file, it's worked in Ubuntu host.

from ultralytics import YOLO

import torch
import intel_extension_for_pytorch as ipex
device = torch.device('xpu' if torch.xpu.is_available() else 'cpu')

print(device)

model = YOLO("yolov8n.pt").to(device)
results = model.train(
        data="coco8.yaml",
        epochs=100,
        imgsz=640,
        device=device,
        amp=False
        )
print(results)

(venv) root@A770:/opt# pip list
Package                     Version
--------------------------- -----------
certifi                     2025.1.31
charset-normalizer          3.4.1
contourpy                   1.3.1
cycler                      0.12.1
filelock                    3.18.0
fonttools                   4.56.0
fsspec                      2025.3.1
idna                        3.10
intel_extension_for_pytorch 2.5.0
Jinja2                      3.1.6
kiwisolver                  1.4.8
MarkupSafe                  3.0.2
matplotlib                  3.10.1
mpmath                      1.3.0
networkx                    3.4.2
numpy                       2.1.1
opencv-python               4.11.0.86
packaging                   24.2
pandas                      2.2.3
pillow                      11.0.0
pip                         25.0.1
psutil                      7.0.0
py-cpuinfo                  9.0.0
pyparsing                   3.2.3
python-dateutil             2.9.0.post0
pytorch-triton-xpu          3.1.0
pytz                        2025.2
PyYAML                      6.0.2
requests                    2.32.3
scipy                       1.15.2
seaborn                     0.13.2
setuptools                  59.6.0
six                         1.17.0
sympy                       1.13.1
torch                       2.5.1+xpu
torchaudio                  2.5.1+xpu
torchvision                 0.20.1+xpu
tqdm                        4.67.1
typing_extensions           4.13.0
tzdata                      2025.2
ultralytics                 8.3.72
ultralytics-thop            2.0.14
urllib3                     2.3.0
(venv) root@A770:/opt# python train.py 
xpu
New https://pypi.org/project/ultralytics/8.3.99 available 😃 Update with 'pip install -U ultralytics'
engine/trainer: task=detect, mode=train, model=yolov8n.pt, data=coco8.yaml, epochs=100, time=None, patience=100, batch=16, imgsz=640, save=True, save_period=-1, cache=False, device=xpu, workers=8, project=None, name=train10, exist_ok=False, pretrained=True, optimizer=auto, verbose=True, seed=0, deterministic=True, single_cls=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, amp=False, fraction=1.0, profile=False, freeze=None, multi_scale=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, vid_stride=1, stream_buffer=False, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, embed=None, show=False, save_frames=False, save_txt=False, save_conf=False, save_crop=False, show_labels=True, show_conf=True, show_boxes=True, line_width=None, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=True, opset=None, workspace=None, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, pose=12.0, kobj=1.0, nbs=64, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, bgr=0.0, mosaic=1.0, mixup=0.0, copy_paste=0.0, copy_paste_mode=flip, auto_augment=randaugment, erasing=0.4, crop_fraction=1.0, cfg=None, tracker=botsort.yaml, save_dir=runs/detect/train10

                   from  n    params  module                                       arguments                     
  0                  -1  1       464  ultralytics.nn.modules.conv.Conv             [3, 16, 3, 2]                 
  1                  -1  1      4672  ultralytics.nn.modules.conv.Conv             [16, 32, 3, 2]                
  2                  -1  1      7360  ultralytics.nn.modules.block.C2f             [32, 32, 1, True]             
  3                  -1  1     18560  ultralytics.nn.modules.conv.Conv             [32, 64, 3, 2]                
  4                  -1  2     49664  ultralytics.nn.modules.block.C2f             [64, 64, 2, True]             
  5                  -1  1     73984  ultralytics.nn.modules.conv.Conv             [64, 128, 3, 2]               
  6                  -1  2    197632  ultralytics.nn.modules.block.C2f             [128, 128, 2, True]           
  7                  -1  1    295424  ultralytics.nn.modules.conv.Conv             [128, 256, 3, 2]              
  8                  -1  1    460288  ultralytics.nn.modules.block.C2f             [256, 256, 1, True]           
  9                  -1  1    164608  ultralytics.nn.modules.block.SPPF            [256, 256, 5]                 
 10                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 11             [-1, 6]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 12                  -1  1    148224  ultralytics.nn.modules.block.C2f             [384, 128, 1]                 
 13                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 14             [-1, 4]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 15                  -1  1     37248  ultralytics.nn.modules.block.C2f             [192, 64, 1]                  
 16                  -1  1     36992  ultralytics.nn.modules.conv.Conv             [64, 64, 3, 2]                
 17            [-1, 12]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 18                  -1  1    123648  ultralytics.nn.modules.block.C2f             [192, 128, 1]                 
 19                  -1  1    147712  ultralytics.nn.modules.conv.Conv             [128, 128, 3, 2]              
 20             [-1, 9]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 21                  -1  1    493056  ultralytics.nn.modules.block.C2f             [384, 256, 1]                 
 22        [15, 18, 21]  1    897664  ultralytics.nn.modules.head.Detect           [80, [64, 128, 256]]          
Model summary: 225 layers, 3,157,200 parameters, 3,157,184 gradients, 8.9 GFLOPs

Transferred 355/355 items from pretrained weights
Freezing layer 'model.22.dfl.conv.weight'
train: Scanning /opt/datasets/coco8/labels/train.cache... 4 images, 0 backgrounds, 0 corrupt: 100%|██████████| 4/4 [00:00<?, ?it/s]
val: Scanning /opt/datasets/coco8/labels/val.cache... 4 images, 0 backgrounds, 0 corrupt: 100%|██████████| 4/4 [00:00<?, ?it/s]
Plotting labels to runs/detect/train10/labels.jpg... 
optimizer: 'optimizer=auto' found, ignoring 'lr0=0.01' and 'momentum=0.937' and determining best 'optimizer', 'lr0' and 'momentum' automatically... 
optimizer: AdamW(lr=0.000119, momentum=0.9) with parameter groups 57 weight(decay=0.0), 64 weight(decay=0.0005), 63 bias(decay=0.0)
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to runs/detect/train10
Starting training for 100 epochs...

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
      1/100         0G       1.07      3.509      1.515         21        640: 100%|██████████| 1/1 [00:03<00:00,  3.25s/it]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 1/1 [00:00<00:00,  1.38it/s]
                   all          4         17       0.62      0.877      0.888      0.618

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
      2/100         0G      1.132      2.786      1.441         36        640: 100%|██████████| 1/1 [00:00<00:00,  2.80it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 1/1 [00:00<00:00,  8.52it/s]
                   all          4         17      0.598      0.896      0.888      0.625

intel-gpu-top: 8086:56a0 @ /dev/dri/card0 - 2288/2397 MHz;   3% RC6;     6263 irqs/s

         ENGINES     BUSY                                                                                                                                                                         MI_SEMA MI_WAIT
       Render/3D    0.00% |                                                                                                                                                                     |      0%      0%
         Blitter   25.94% |██████████████████████████████████████████▉                                                                                                                          |     15%      0%
           Video    0.00% |                                                                                                                                                                     |      0%      0%
    VideoEnhance    0.00% |                                                                                                                                                                     |      0%      0%
       [unknown]   40.80% |███████████████████████████████████████████████████████████████████▍                                                                                                 |      0%      0%

jiekechoo · 2025-03-31T03:22:56Z

I got it.

ultralytics 8.3.99 in the container, but 8.3.72 in the host.

downgrade to 8.3.72 in the container, worked.

root@d6253cd9f54c:/# pip list |grep ultralytics
ultralytics                 8.3.72
ultralytics-thop            2.0.14
root@d6253cd9f54c:/# more train.py 
from ultralytics import YOLO

import torch
import intel_extension_for_pytorch as ipex
device = torch.device('xpu' if torch.xpu.is_available() else 'cpu')

print(device)

model = YOLO("yolov8n.pt").to(device)
results = model.train(
        data="coco8.yaml",
        epochs=100,
        imgsz=640,
        device=device,
        amp=False
        )
print(results)
root@d6253cd9f54c:/# python train.py 
[W331 03:21:45.859557434 OperatorEntry.cpp:154] Warning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::_validate_compressed_sparse_indices(bool is_crow, Tensor compressed_idx, Tensor plain_idx, int cdim, int dim, int nnz) -> ()
    registered at /pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: XPU
  previous kernel: registered at /pytorch/build/aten/src/ATen/RegisterCPU.cpp:30477
       new kernel: registered at /build/intel-pytorch-extension/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:468 (function operator())
xpu
New https://pypi.org/project/ultralytics/8.3.99 available 😃 Update with 'pip install -U ultralytics'
[W331 03:21:47.697545964 OperatorEntry.cpp:154] Warning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::_validate_compressed_sparse_indices(bool is_crow, Tensor compressed_idx, Tensor plain_idx, int cdim, int dim, int nnz) -> ()
    registered at /pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: XPU
  previous kernel: registered at /pytorch/build/aten/src/ATen/RegisterCPU.cpp:30477
       new kernel: registered at /build/intel-pytorch-extension/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:468 (function operator())
engine/trainer: task=detect, mode=train, model=yolov8n.pt, data=coco8.yaml, epochs=100, time=None, patience=100, batch=16, imgsz=640, save=True, save_period=-1, cache=False, device=xpu, workers=8, project=None, name=train7, exist_ok=False, pretrained=True, optimizer=auto, verbose=True, seed=0, deterministic=True, single_cls=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, amp=False, fraction=1.0, profile=False, freeze=None, multi_scale=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, vid_stride=1, stream_buffer=False, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, embed=None, show=False, save_frames=False, save_txt=False, save_conf=False, save_crop=False, show_labels=True, show_conf=True, show_boxes=True, line_width=None, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=True, opset=None, workspace=None, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, pose=12.0, kobj=1.0, nbs=64, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, bgr=0.0, mosaic=1.0, mixup=0.0, copy_paste=0.0, copy_paste_mode=flip, auto_augment=randaugment, erasing=0.4, crop_fraction=1.0, cfg=None, tracker=botsort.yaml, save_dir=runs/detect/train7

                   from  n    params  module                                       arguments                     
  0                  -1  1       464  ultralytics.nn.modules.conv.Conv             [3, 16, 3, 2]                 
  1                  -1  1      4672  ultralytics.nn.modules.conv.Conv             [16, 32, 3, 2]                
  2                  -1  1      7360  ultralytics.nn.modules.block.C2f             [32, 32, 1, True]             
  3                  -1  1     18560  ultralytics.nn.modules.conv.Conv             [32, 64, 3, 2]                
  4                  -1  2     49664  ultralytics.nn.modules.block.C2f             [64, 64, 2, True]             
  5                  -1  1     73984  ultralytics.nn.modules.conv.Conv             [64, 128, 3, 2]               
  6                  -1  2    197632  ultralytics.nn.modules.block.C2f             [128, 128, 2, True]           
  7                  -1  1    295424  ultralytics.nn.modules.conv.Conv             [128, 256, 3, 2]              
  8                  -1  1    460288  ultralytics.nn.modules.block.C2f             [256, 256, 1, True]           
  9                  -1  1    164608  ultralytics.nn.modules.block.SPPF            [256, 256, 5]                 
 10                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 11             [-1, 6]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 12                  -1  1    148224  ultralytics.nn.modules.block.C2f             [384, 128, 1]                 
 13                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 14             [-1, 4]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 15                  -1  1     37248  ultralytics.nn.modules.block.C2f             [192, 64, 1]                  
 16                  -1  1     36992  ultralytics.nn.modules.conv.Conv             [64, 64, 3, 2]                
 17            [-1, 12]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 18                  -1  1    123648  ultralytics.nn.modules.block.C2f             [192, 128, 1]                 
 19                  -1  1    147712  ultralytics.nn.modules.conv.Conv             [128, 128, 3, 2]              
 20             [-1, 9]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 21                  -1  1    493056  ultralytics.nn.modules.block.C2f             [384, 256, 1]                 
 22        [15, 18, 21]  1    897664  ultralytics.nn.modules.head.Detect           [80, [64, 128, 256]]          
Model summary: 225 layers, 3,157,200 parameters, 3,157,184 gradients, 8.9 GFLOPs

Transferred 355/355 items from pretrained weights
Freezing layer 'model.22.dfl.conv.weight'
train: Scanning /datasets/coco8/labels/train.cache... 4 images, 0 backgrounds, 0 corrupt: 100%|██████████| 4/4 [00:00<?, ?it/s]
val: Scanning /datasets/coco8/labels/val.cache... 4 images, 0 backgrounds, 0 corrupt: 100%|██████████| 4/4 [00:00<?, ?it/s]
Plotting labels to runs/detect/train7/labels.jpg... 
optimizer: 'optimizer=auto' found, ignoring 'lr0=0.01' and 'momentum=0.937' and determining best 'optimizer', 'lr0' and 'momentum' automatically... 
optimizer: AdamW(lr=0.000119, momentum=0.9) with parameter groups 57 weight(decay=0.0), 64 weight(decay=0.0005), 63 bias(decay=0.0)
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to runs/detect/train7
Starting training for 100 epochs...

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
      1/100         0G       1.07      3.509      1.515         21        640: 100%|██████████| 1/1 [00:06<00:00,  6.83s/it]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 1/1 [00:00<00:00,  1.05it/s]
                   all          4         17       0.62      0.877      0.888      0.618

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
      2/100         0G      1.132      2.786      1.441         36        640: 100%|██████████| 1/1 [00:00<00:00,  2.24it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 1/1 [00:00<00:00,  4.31it/s]
                   all          4         17      0.598      0.896      0.888      0.625

ZailiWang · 2025-03-31T03:25:38Z

It works for both AMP & non-AMP after the versions got aligned?

jiekechoo · 2025-03-31T03:29:39Z

It works for both AMP & non-AMP after the versions got aligned?

No, it's not not worked when amp=True .

still:

AssertionError: Torch not compiled with CUDA enabled

jiekechoo · 2025-03-31T03:31:55Z

Here is the GPU info: 😄

Intel(R) Arc(TM) A770 Graphics

>>> import torch
>>> import sys
>>> print(torch.device('xpu'))
xpu
>>> import intel_extension_for_pytorch as ipex
>>> print(torch.xpu.has_xpu())
True
>>> if (not torch.xpu.is_available()):
...     print('Intel GPU not detected. Please install GPU with compatible drivers')
...     sys.exit(1)
... 
>>> print(torch.xpu.has_onemkl())
True
>>> print(torch.__version__); print(ipex.__version__)
2.6.0+xpu
2.6.10+xpu
>>> [print(f'[{i}]: {torch.xpu.get_device_properties(i)}') for i in range(torch.xpu.device_count())]
[0]: _XpuDeviceProperties(name='Intel(R) Arc(TM) A770 Graphics', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.32224+14', total_memory=15473MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=32, max_work_group_size=1024, max_num_sub_groups=128, sub_group_sizes=[8 16 32], has_fp16=1, has_fp64=0, has_atomic64=1)
[None]

jiekechoo · 2025-03-31T07:52:24Z

When the training task finished, the new model has created, got some errors:
Should I have to modify the torch_utils.py source code?

100 epochs completed in 0.036 hours.
Optimizer stripped from runs/detect/train/weights/last.pt, 6.5MB
Optimizer stripped from runs/detect/train/weights/best.pt, 6.5MB

Validating runs/detect/train/weights/best.pt...
Ultralytics 8.3.72 🚀 Python-3.11.0rc1 torch-2.6.0+xpu 
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[2], line 10
      7 print(device)
      9 model = YOLO("yolov8n.pt").to(device)
---> 10 results = model.train(
     11         data="coco8.yaml",
     12         epochs=100,
     13         imgsz=640,
     14         device=device,
     15         amp=False
     16         )

File [/usr/local/lib/python3.11/dist-packages/ultralytics/engine/model.py:808](http://kf.yiqisoft.cn/usr/local/lib/python3.11/dist-packages/ultralytics/engine/model.py#line=807), in Model.train(self, trainer, **kwargs)
    805     self.model = self.trainer.model
    807 self.trainer.hub_session = self.session  # attach optional HUB session
--> 808 self.trainer.train()
    809 # Update model and cfg after training
    810 if RANK in {-1, 0}:

File [/usr/local/lib/python3.11/dist-packages/ultralytics/engine/trainer.py:207](http://kf.yiqisoft.cn/usr/local/lib/python3.11/dist-packages/ultralytics/engine/trainer.py#line=206), in BaseTrainer.train(self)
    204         ddp_cleanup(self, str(file))
    206 else:
--> 207     self._do_train(world_size)

File [/usr/local/lib/python3.11/dist-packages/ultralytics/engine/trainer.py:469](http://kf.yiqisoft.cn/usr/local/lib/python3.11/dist-packages/ultralytics/engine/trainer.py#line=468), in BaseTrainer._do_train(self, world_size)
    467 seconds = time.time() - self.train_time_start
    468 LOGGER.info(f"\n{epoch - self.start_epoch + 1} epochs completed in {seconds [/](http://kf.yiqisoft.cn/) 3600:.3f} hours.")
--> 469 self.final_eval()
    470 if self.args.plots:
    471     self.plot_metrics()

File [/usr/local/lib/python3.11/dist-packages/ultralytics/engine/trainer.py:687](http://kf.yiqisoft.cn/usr/local/lib/python3.11/dist-packages/ultralytics/engine/trainer.py#line=686), in BaseTrainer.final_eval(self)
    685 LOGGER.info(f"\nValidating {f}...")
    686 self.validator.args.plots = self.args.plots
--> 687 self.metrics = self.validator(model=f)
    688 self.metrics.pop("fitness", None)
    689 self.run_callbacks("on_fit_epoch_end")

File [/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py:116](http://kf.yiqisoft.cn/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py#line=115), in context_decorator.<locals>.decorate_context(*args, **kwargs)
    113 @functools.wraps(func)
    114 def decorate_context(*args, **kwargs):
    115     with ctx_factory():
--> 116         return func(*args, **kwargs)

File [/usr/local/lib/python3.11/dist-packages/ultralytics/engine/validator.py:130](http://kf.yiqisoft.cn/usr/local/lib/python3.11/dist-packages/ultralytics/engine/validator.py#line=129), in BaseValidator.__call__(self, trainer, model)
    126     LOGGER.warning("WARNING ⚠️ validating an untrained model YAML will result in 0 mAP.")
    127 callbacks.add_integration_callbacks(self)
    128 model = AutoBackend(
    129     weights=model or self.args.model,
--> 130     device=select_device(self.args.device, self.args.batch),
    131     dnn=self.args.dnn,
    132     data=self.args.data,
    133     fp16=self.args.half,
    134 )
    135 # self.model = model
    136 self.device = model.device  # update device

File [/usr/local/lib/python3.11/dist-packages/ultralytics/utils/torch_utils.py:188](http://kf.yiqisoft.cn/usr/local/lib/python3.11/dist-packages/ultralytics/utils/torch_utils.py#line=187), in select_device(device, batch, newline, verbose)
    181         LOGGER.info(s)
    182         install = (
    183             "See https://pytorch.org/get-started/locally/ for up-to-date torch install instructions if no "
    184             "CUDA devices are seen by torch.\n"
    185             if torch.cuda.device_count() == 0
    186             else ""
    187         )
--> 188         raise ValueError(
    189             f"Invalid CUDA 'device={device}' requested."
    190             f" Use 'device=cpu' or pass valid CUDA device(s) if available,"
    191             f" i.e. 'device=0' or 'device=0,1,2,3' for Multi-GPU.\n"
    192             f"\ntorch.cuda.is_available(): {torch.cuda.is_available()}"
    193             f"\ntorch.cuda.device_count(): {torch.cuda.device_count()}"
    194             f"\nos.environ['CUDA_VISIBLE_DEVICES']: {visible}\n"
    195             f"{install}"
    196         )
    198 if not cpu and not mps and torch.cuda.is_available():  # prefer GPU if available
    199     devices = device.split(",") if device else "0"  # i.e. "0,1" -> ["0", "1"]

ValueError: Invalid CUDA 'device=xpu' requested. Use 'device=cpu' or pass valid CUDA device(s) if available, i.e. 'device=0' or 'device=0,1,2,3' for Multi-GPU.

torch.cuda.is_available(): False
torch.cuda.device_count(): 0
os.environ['CUDA_VISIBLE_DEVICES']: None
See https://pytorch.org/get-started/locally/ for up-to-date torch install instructions if no CUDA devices are seen by torch.

ZailiWang · 2025-03-31T11:08:42Z

Hi @jiekechoo , the errors you encountered is due that the ultralytics yolov8/v11 repo does not support Intel GPU (xpu device) natively. The maintainer declared Intel GPU devices can be easily supported here but it seems a few logics in the repository are still hard-coded with CUDA device, which are basically correspondent with the codes where the errors you met were thrown from.

Since ultralytics is a pre-installed lib, the easiest way to enable xpu is to directly change the source codes in the lib path by eliminating the CUDA related statements or judgements which the training process would go through. I tried and found some minimal changes are as follows. In the ipex 2.6.10 container, the path of the installed ultralytics lib should be at /usr/local/lib/python3.10/dist-packages/ultralytics. Assuming ultralytics==8.3.99 is installed and this folder is where we are:

In utils/torch_utils.py, directly return torch.device('xpu') at the beginning of select_device() function at L164;
In utils/checks.py, directly return True or False based on whether you are enabling AMP at the beginning of check_amp() function at L723;
In engine/trainer.py:L502-L504, change torch.cuda.xxx to torch.xpu.xxx. At L383, add argument device='xpu' in autocast() calling.

With these changes your training script is workable at my side, please give it a try and check whether it solves your problem. Thanks!

jiekechoo · 2025-03-31T13:08:24Z

@ZailiWang Thanks for your reply.
I think these HARD-CODED changes in the ipex container are not a good idea.
I want to use the YOLO framework to train custom models, but I need to modify its source code for my specific images.

How can I use IPEX to train models across multiple GPUs, either on a single host or in a Kubernetes cluster? Are there any official documents I can refer to for this?

ZailiWang · 2025-04-01T02:55:57Z

Let's sync it up via mail.

jiekechoo · 2025-04-01T07:50:24Z

Hi @jiekechoo , the errors you encountered is due that the ultralytics yolov8/v11 repo does not support Intel GPU (xpu device) natively. The maintainer declared Intel GPU devices can be easily supported here but it seems a few logics in the repository are still hard-coded with CUDA device, which are basically correspondent with the codes where the errors you met were thrown from.

Since ultralytics is a pre-installed lib, the easiest way to enable xpu is to directly change the source codes in the lib path by eliminating the CUDA related statements or judgements which the training process would go through. I tried and found some minimal changes are as follows. In the ipex 2.6.10 container, the path of the installed ultralytics lib should be at /usr/local/lib/python3.10/dist-packages/ultralytics. Assuming ultralytics==8.3.99 is installed and this folder is where we are:

In utils/torch_utils.py, directly return torch.device('xpu') at the beginning of select_device() function at L164;

In utils/checks.py, directly return True or False based on whether you are enabling AMP at the beginning of check_amp() function at L723;

In engine/trainer.py:L502-L504, change torch.cuda.xxx to torch.xpu.xxx. At L383, add argument device='xpu' in autocast() calling.

With these changes your training script is workable at my side, please give it a try and check whether it solves your problem. Thanks!

fixed after this guide. Thanks.

ramesh-dev-code · 2025-04-02T11:17:58Z

@ZailiWang Thanks for mentioning my workaround in ultralytics to train YOLO model on Intel GPUs

jiekechoo · 2025-04-03T00:04:28Z

I’m trying to create a PR for this issue.

ZailiWang self-assigned this Mar 31, 2025

ZailiWang closed this as completed Apr 1, 2025

jiekechoo mentioned this issue Apr 1, 2025

Train YOLOv11/v8 on Intel Arc Discrete GPU ultralytics/ultralytics#19821

Closed

1 task

ZailiWang added ARC ARC GPU Ecosystem PyTorch ecosystem related labels Apr 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to train YOLO in the container? error: AssertionError: Torch not compiled with CUDA enabled #799

How to train YOLO in the container? error: AssertionError: Torch not compiled with CUDA enabled #799

jiekechoo commented Mar 31, 2025

ZailiWang commented Mar 31, 2025

jiekechoo commented Mar 31, 2025

jiekechoo commented Mar 31, 2025

ZailiWang commented Mar 31, 2025

jiekechoo commented Mar 31, 2025

jiekechoo commented Mar 31, 2025

jiekechoo commented Mar 31, 2025

ZailiWang commented Mar 31, 2025

jiekechoo commented Mar 31, 2025

ZailiWang commented Apr 1, 2025

jiekechoo commented Apr 1, 2025

ramesh-dev-code commented Apr 2, 2025

jiekechoo commented Apr 3, 2025

How to train YOLO in the container? error: AssertionError: Torch not compiled with CUDA enabled #799

How to train YOLO in the container? error: AssertionError: Torch not compiled with CUDA enabled #799

Comments

jiekechoo commented Mar 31, 2025

Describe the issue

Run docker container

Python environment and

ZailiWang commented Mar 31, 2025

jiekechoo commented Mar 31, 2025

jiekechoo commented Mar 31, 2025

ZailiWang commented Mar 31, 2025

jiekechoo commented Mar 31, 2025

jiekechoo commented Mar 31, 2025

jiekechoo commented Mar 31, 2025

ZailiWang commented Mar 31, 2025

jiekechoo commented Mar 31, 2025

ZailiWang commented Apr 1, 2025

jiekechoo commented Apr 1, 2025

ramesh-dev-code commented Apr 2, 2025

jiekechoo commented Apr 3, 2025