Talk:OpenCL

From Gentoo Wiki
Jump to:navigation Jump to:search
Note
This is a Talk page - please see the documentation about using talk pages. Add newer comments below older ones, sign comments using four tildes (~~~~), and indent successive comments with colons (:). Add new sections at the bottom of the page, under a heading (== ==). Please remember to mark sections as "open for discussion" using {{talk|open}}, so they will show up in the list of open discussions.

Translation

Talk status
This discussion is done.

Hi,
I would like to translate this page to Japanese. Could the editor request translation?
Hisashi

amdgpu-pro-opencl instability

Talk status
This discussion is still ongoing.

Word of warning for other users: it's not kidding about the risk of mixing and matching drivers.

Running things sometimes works, but usually dmesg fills up with the dying gurgles of a broken driver, X stops working, and the computer follows soon after.

Here's one I suffered earlier:

user $awk '/ERROR/,/end / { print }' /var/log/kernel/current
Mar 13 02:04:13 [kernel] [25091.704610] [drm:amdgpu_ttm_backend_bind] *ERROR* failed to pin userptr
Mar 13 02:04:13 [kernel] [25091.937866] [drm:amdgpu_ttm_backend_bind] *ERROR* failed to pin userptr
Mar 13 02:04:14 [kernel] [25092.903125] [drm:amdgpu_ttm_backend_bind] *ERROR* failed to pin userptr
Mar 13 02:04:14 [kernel] [25092.914144] [drm:amdgpu_ttm_backend_bind] *ERROR* failed to pin userptr
Mar 13 02:04:14 [kernel] [25092.961676] ------------[ cut here ]------------
Mar 13 02:04:14 [kernel] [25092.961685] WARNING: CPU: 18 PID: 55924 at drivers/iommu/dma-iommu.c:471 __iommu_dma_unmap+0xe1/0xf0
Mar 13 02:04:14 [kernel] [25092.961686] Modules linked in: ext4 mbcache jbd2 fuse sd_mod bnep bluetooth ecdh_generic ecc crc16 rfkill kvm_amd kvm ahci irqbypass libahci libata uas usb_storage input_leds cdc_acm hid_microsoft led_class scsi_mod
Mar 13 02:04:14 [kernel] [25092.961698] CPU: 18 PID: 55924 Comm: FahCore_22 Tainted: G      D           5.5.9-zen-01720-gb286bb50f
#22
Mar 13 02:04:14 [kernel] [25092.961699] Hardware name: Gigabyte Technology Co., Ltd. X570 UD/X570 UD, BIOS F11 12/06/2019
Mar 13 02:04:14 [kernel] [25092.961701] RIP: 0010:__iommu_dma_unmap+0xe1/0xf0
Mar 13 02:04:14 [kernel] [25092.961703] Code: c0 74 0b 48 89 e6 4c 89 f7 e8 6b d1 76 00 48 c7 44 24 08 00 00 00 00 48 c7 44 24 10 00 00 00 00 48 c7 04 24 ff ff ff ff eb a2 <0f> 0b eb 94 e8 66 df bb ff 66 0f 1f 44 00 00 41 57 41 56 49 89 f7
Mar 13 02:04:14 [kernel] [25092.961704] RSP: 0018:ffff963f14a37bd0 EFLAGS: 00010206
Mar 13 02:04:14 [kernel] [25092.961705] RAX: 0000000040000000 RBX: 0000000000000001 RCX: ffff963f14a37b48
Mar 13 02:04:14 [kernel] [25092.961705] RDX: 0000000000000000 RSI: ffffffffc0000000 RDI: 0000000000000015
Mar 13 02:04:14 [kernel] [25092.961706] RBP: ffff940fd2d27000 R08: 0000000000000000 R09: 0000000000000000
Mar 13 02:04:14 [kernel] [25092.961706] R10: 0000000000000002 R11: 0000000000000001 R12: 0000000000002000
Mar 13 02:04:14 [kernel] [25092.961707] R13: ffff9411078ea800 R14: ffff9411087d0e20 R15: ffff9410243c9e38
Mar 13 02:04:14 [kernel] [25092.961708] FS:  00007f885c38bf80(0000) GS:ffff94110ec80000(0000) knlGS:0000000000000000
Mar 13 02:04:14 [kernel] [25092.961709] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 13 02:04:14 [kernel] [25092.961709] CR2: 00007f885c9f2270 CR3: 0000000e2c56c000 CR4: 0000000000340ee0
Mar 13 02:04:14 [kernel] [25092.961710] Call Trace:
Mar 13 02:04:14 [kernel] [25092.961715]  ttm_unmap_and_unpopulate_pages+0xa7/0x130
Mar 13 02:04:14 [kernel] [25092.961717]  ttm_tt_destroy.part.0+0x44/0x50
Mar 13 02:04:14 [kernel] [25092.961718]  ttm_bo_cleanup_memtype_use+0x2d/0x80
Mar 13 02:04:14 [kernel] [25092.961720]  ttm_bo_put+0x2ac/0x330
Mar 13 02:04:14 [kernel] [25092.961722]  amdgpu_bo_unref+0x15/0x20
Mar 13 02:04:14 [kernel] [25092.961724]  amdgpu_gem_object_free+0x2b/0x50
Mar 13 02:04:14 [kernel] [25092.961726]  drm_gem_object_release_handle+0x6b/0x90
Mar 13 02:04:14 [kernel] [25092.961728]  drm_gem_handle_delete+0x53/0x90
Mar 13 02:04:14 [kernel] [25092.961730]  ? drm_gem_handle_create+0x40/0x40
Mar 13 02:04:14 [kernel] [25092.961731]  drm_ioctl_kernel+0xa6/0xf0
Mar 13 02:04:14 [kernel] [25092.961733]  drm_ioctl+0x1fc/0x380
Mar 13 02:04:14 [kernel] [25092.961735]  ? drm_gem_handle_create+0x40/0x40
Mar 13 02:04:14 [kernel] [25092.961738]  ? tlb_finish_mmu+0x24/0x160
Mar 13 02:04:14 [kernel] [25092.961739]  ? unmap_region+0xd1/0x100
Mar 13 02:04:14 [kernel] [25092.961741]  amdgpu_drm_ioctl+0x44/0x80
Mar 13 02:04:14 [kernel] [25092.961744]  do_vfs_ioctl+0x449/0x6c0
Mar 13 02:04:14 [kernel] [25092.961745]  ? __do_munmap+0x27e/0x4a0
Mar 13 02:04:14 [kernel] [25092.961747]  ksys_ioctl+0x35/0x70
Mar 13 02:04:14 [kernel] [25092.961748]  __x64_sys_ioctl+0x11/0x20
Mar 13 02:04:14 [kernel] [25092.961750]  do_syscall_64+0x43/0x100
Mar 13 02:04:14 [kernel] [25092.961752]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Mar 13 02:04:14 [kernel] [25092.961755] RIP: 0033:0x7f885c48e567
Mar 13 02:04:14 [kernel] [25092.961756] Code: 00 00 00 75 0c 48 c7 c0 ff ff ff ff 48 83 c4 18 c3 e8 8d c8 01 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f9 e8 0c 00 f7 d8 64 89 01 48
Mar 13 02:04:14 [kernel] [25092.961757] RSP: 002b:00007ffedf620218 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Mar 13 02:04:14 [kernel] [25092.961758] RAX: ffffffffffffffda RBX: 00007ffedf620260 RCX: 00007f885c48e567
Mar 13 02:04:14 [kernel] [25092.961759] RDX: 00007ffedf620260 RSI: 0000000040086409 RDI: 000000000000000f
Mar 13 02:04:14 [kernel] [25092.961760] RBP: 0000000040086409 R08: 0000000002e23012 R09: 0000000000000007
Mar 13 02:04:14 [kernel] [25092.961760] R10: 00007ffedf6204d0 R11: 0000000000000246 R12: 000000000319c970
Mar 13 02:04:14 [kernel] [25092.961761] R13: 000000000000000f R14: 0000000002f95cd0 R15: 0000000000000001
Mar 13 02:04:14 [kernel] [25092.961763] ---[ end trace a079ed0091d543ee ]---

I've encountered other poor folks with similar experiences, and given something as simple as `clinfo` can cause it to detonate I don't think it's a hardware issue.

I'd really like to get OpenCL working, but unfortunately this is the only option that produces any working result (ROCm just won't run period, Clover is uselessly out of date) and it's so risky as to be unusable for the job. - Ant P. (talk) 03:09, 13 March 2020 (UTC)

Update: apparently it's kernel 5.5 to blame and it's being fixed. Looks like I just picked an unlucky time to try it. - Ant P. (talk) 18:41, 24 March 2020 (UTC)
-----
Does this issue still occur 3 years later or is the fix functional? If the issue is still valid, open a Gentoo Bug. Someone may fix it (or not) but at least the problem is documented. A small note about this is the main article can be a valuable information the case the problem still occurs.
--Admnd (talk) 13:47, 11 April 2023 (UTC)

possible issue when installing rocm based opencl system without PCIe Atomics on some AMD GPUs

Talk status
This discussion is still ongoing.

I am openning this topic because ROCM itself requires PCIe with Atomics for AMD GPUs from gfx803 as written on https://github.com/ROCm/ROCm.github.io/blob/master/hardware.md , and for what I am testing on my system, including with clinfo showing 0 devices, and forum messages it seems it indeed goes to affect opencl.

The good news is that maybe rusticl from Mesa could support it