From mboxrd@z Thu Jan 1 00:00:00 1970 From: bugzilla-daemon@freedesktop.org Subject: [Bug 109978] Unprivileged user mode program can cause GPU reset Date: Tue, 12 Mar 2019 13:56:07 +0000 Message-ID: Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============1623604099==" Return-path: Received: from culpepper.freedesktop.org (culpepper.freedesktop.org [131.252.210.165]) by gabe.freedesktop.org (Postfix) with ESMTP id 988B589E32 for ; Tue, 12 Mar 2019 13:56:08 +0000 (UTC) List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" To: dri-devel@lists.freedesktop.org List-Id: dri-devel@lists.freedesktop.org --===============1623604099== Content-Type: multipart/alternative; boundary="15523989681.ac98d5cb.29866" Content-Transfer-Encoding: 7bit --15523989681.ac98d5cb.29866 Date: Tue, 12 Mar 2019 13:56:08 +0000 MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://bugs.freedesktop.org/ Auto-Submitted: auto-generated https://bugs.freedesktop.org/show_bug.cgi?id=3D109978 Bug ID: 109978 Summary: Unprivileged user mode program can cause GPU reset Product: DRI Version: XOrg git Hardware: x86-64 (AMD64) OS: Linux (All) Status: NEW Severity: major Priority: medium Component: DRM/amdkfd Assignee: dri-devel@lists.freedesktop.org Reporter: sudolskym@gmail.com https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/issues/72 Sample program which causes this (needs ROCm): > #include > int main() > { > parallel_for_each(hc::extent<1>(1), [=3D]() [[hc]] > { > asm("s_trap 2"); > }); > return 0; > } > hcc -hc main.cpp > ./a.out Process never ends and CTRL-C causes GPU reset which breaks all other proce= sses actually using rocm on that GPU. Seems trap handler expects queue handle in s[0:1] which is set when using __builtin_trap() so without it trap handler causes another exceptions. System logs: [ 247.428727] qcm fence wait loop timeout expired [ 247.428730] The cp might be in an unrecoverable state due to an unsucces= sful queues preemption [ 247.428736] amdgpu 0000:0b:00.0: GPU reset begin! [ 247.619440] amdgpu 0000:0b:00.0: GPU reset [ 248.152762] [drm] psp mode1 reset succeed=20 [ 248.279461] amdgpu 0000:0b:00.0: GPU reset succeeded, trying to resume [ 248.279584] [drm] PCIE GART of 512M enabled (table at 0x000000F400900000= ). [ 248.279639] [drm:amdgpu_device_gpu_recover [amdgpu]] *ERROR* VRAM is los= t! [ 248.279769] [drm] PSP is resuming... [ 248.428305] [drm] reserve 0x400000 from 0xf400d00000 for PSP TMR SIZE [ 248.472774] WARNING: CPU: 23 PID: 21634 at /build/linux-uQJ2um/linux-4.15.0/kernel/kthread.c:498 kthread_park+0x67/0x80 [ 248.472775] Modules linked in: ufs qnx4 hfsplus hfs minix ntfs msdos jfs= xfs msr nls_utf8 cifs ccm fscache cmac bnep binfmt_misc nls_iso8859_1 edac_mce_= amd arc4 snd_hda_codec_realtek snd_hda_codec_generic kvm_amd snd_hda_codec_hdmi= kvm snd_seq_midi irqbypass snd_hda_intel snd_seq_midi_event snd_hda_codec btusb snd_hda_core btrtl wmi_bmof snd_rawmidi iwlmvm snd_hwdep btbcm btintel snd_= pcm snd_seq bluetooth mac80211 snd_seq_device ecdh_generic snd_timer iwlwifi ccp snd cfg80211 soundcore k10temp shpchp mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi nct6775 hwmon_vid parport_pc ppdev lp parport ip_tables x_tables autofs4 bt= rfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_= xor async_tx xor raid6_pq libcrc32c raid1 [ 248.472823] multipath linear raid0 amdgpu(OE) amdchash(OE) amdttm(OE) amd_sched(OE) mxm_wmi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 amdkcl(OE) crypto_simd glue_helper amd_iommu_v2 cryp= td drm_kms_helper syscopyarea sysfillrect sysimgblt igb fb_sys_fops drm dca nv= me i2c_algo_bit i2c_piix4 nvme_core ptp ahci atlantic libahci pps_core gpio_am= dpt wmi gpio_generic [ 248.472846] CPU: 23 PID: 21634 Comm: a.out Tainted: G OE=20=20= =20 4.15.0-45-generic #48-Ubuntu [ 248.472847] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X399 Professional Gaming, BIOS P3.30 08/14/2018 [ 248.472849] RIP: 0010:kthread_park+0x67/0x80 [ 248.472850] RSP: 0018:ffffb44fc7e27ad0 EFLAGS: 00010202 [ 248.472852] RAX: 0000000000000004 RBX: ffff9ec63f49e480 RCX: 0000000000000000 [ 248.472853] RDX: ffff9ec63c717198 RSI: ffff9ec63ea0c0c0 RDI: ffff9ec63dd38000 [ 248.472854] RBP: ffffb44fc7e27ae0 R08: 0000000000000051 R09: 0000000000000000 [ 248.472855] R10: 0000000000000000 R11: 0000000000000056 R12: ffff9ec63ea0c0c0 [ 248.472855] R13: ffff9ec64f4f4200 R14: ffff9ec63c710000 R15: 0000000000000000 [ 248.472857] FS: 00007fd52a286c00(0000) GS:ffff9ec65cdc0000(0000) knlGS:0000000000000000 [ 248.472858] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 248.472859] CR2: 00007f0c07687a98 CR3: 000000081b5b6000 CR4: 00000000003406e0 [ 248.472860] Call Trace: [ 248.472865] amddrm_sched_entity_fini+0x44/0x1b0 [amd_sched] [ 248.472868] amddrm_sched_entity_destroy+0x1f/0x30 [amd_sched] [ 248.472907] amdgpu_vm_fini+0xbb/0x4f0 [amdgpu] [ 248.472942] amdgpu_driver_postclose_kms+0x15b/0x2b0 [amdgpu] [ 248.472952] drm_release+0x26b/0x390 [drm] [ 248.472955] __fput+0xea/0x220 [ 248.472957] ____fput+0xe/0x10 [ 248.472959] task_work_run+0x9d/0xc0 [ 248.472961] do_exit+0x2ec/0xb40 [ 248.472963] do_group_exit+0x43/0xb0 [ 248.472965] get_signal+0x27b/0x590 [ 248.472968] do_signal+0x37/0x730 [ 248.472971] ? __switch_to_asm+0x34/0x70 [ 248.472973] ? __switch_to_asm+0x40/0x70 [ 248.472976] ? do_vfs_ioctl+0xa8/0x630 [ 248.472978] ? __schedule+0x299/0x8a0 [ 248.472980] exit_to_usermode_loop+0x73/0xd0 [ 248.472982] do_syscall_64+0x115/0x130 [ 248.472984] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [ 248.472986] RIP: 0033:0x7fd528bdd5d7 [ 248.472987] RSP: 002b:00007ffe830d4778 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 248.472988] RAX: fffffffffffffffc RBX: 0000000000000001 RCX: 00007fd528bdd5d7 [ 248.472989] RDX: 00007ffe830d47d0 RSI: 00000000c0184b0c RDI: 0000000000000003 [ 248.472990] RBP: 00007ffe830d47d0 R08: 00007ffe830d4890 R09: 0000000000000001 [ 248.472990] R10: 0000000000c92010 R11: 0000000000000246 R12: 00000000c0184b0c [ 248.472991] R13: 0000000000000003 R14: 0000000000000000 R15: 00000000fffffffe [ 248.472992] Code: 0e e8 6e c0 00 00 48 8d 7b 18 e8 35 d2 8e 00 44 89 e0 = 5b 41 5c 5d c3 0f 0b 41 bc da ff ff ff 44 89 e0 5b 41 5c 5d c3 0f 0b eb af <0f= > 0b 41 bc f0 ff ff ff eb da 0f 1f 44 00 00 66 2e 0f 1f 84 00=20 [ 248.473020] ---[ end trace 19649ddd4a6314f7 ]--- [ 248.648453] [drm] UVD and UVD ENC initialized successfully. [ 248.748509] [drm] VCE initialized successfully. [ 248.749616] [drm] recover vram bo from shadow start [ 248.749666] [drm] recover vram bo from shadow done [ 248.749680] amdgpu 0000:0b:00.0: GPU reset(1) succeeded! --=20 You are receiving this mail because: You are the assignee for the bug.= --15523989681.ac98d5cb.29866 Date: Tue, 12 Mar 2019 13:56:08 +0000 MIME-Version: 1.0 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://bugs.freedesktop.org/ Auto-Submitted: auto-generated
Bug ID 109978
Summary Unprivileged user mode program can cause GPU reset
Product DRI
Version XOrg git
Hardware x86-64 (AMD64)
OS Linux (All)
Status NEW
Severity major
Priority medium
Component DRM/amdkfd
Assignee dri-devel@lists.freedesktop.org
Reporter sudolskym@gmail.com

https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/iss=
ues/72

Sample program which causes this (needs ROCm):

> #include <hc.hpp>
> int main()
> {
> 	parallel_for_each(hc::extent<1>(1), [=3D]() [[hc]]
> 	{
> 		asm("s_trap 2");
> 	});
> 	return 0;
> }

> hcc -hc main.cpp
> ./a.out

Process never ends and CTRL-C causes GPU reset which breaks all other proce=
sses
actually using rocm on that GPU. Seems trap handler expects queue handle in
s[0:1] which is set when using __builtin_trap() so without it trap handler
causes another exceptions.

System logs:

[  247.428727] qcm fence wait loop timeout expired
[  247.428730] The cp might be in an unrecoverable state due to an unsucces=
sful
queues preemption
[  247.428736] amdgpu 0000:0b:00.0: GPU reset begin!
[  247.619440] amdgpu 0000:0b:00.0: GPU reset
[  248.152762] [drm] psp mode1 reset succeed=20
[  248.279461] amdgpu 0000:0b:00.0: GPU reset succeeded, trying to resume
[  248.279584] [drm] PCIE GART of 512M enabled (table at 0x000000F400900000=
).
[  248.279639] [drm:amdgpu_device_gpu_recover [amdgpu]] *ERROR* VRAM is los=
t!
[  248.279769] [drm] PSP is resuming...
[  248.428305] [drm] reserve 0x400000 from 0xf400d00000 for PSP TMR SIZE
[  248.472774] WARNING: CPU: 23 PID: 21634 at
/build/linux-uQJ2um/linux-4.15.0/kernel/kthread.c:498 kthread_park+0x67/0x80
[  248.472775] Modules linked in: ufs qnx4 hfsplus hfs minix ntfs msdos jfs=
 xfs
msr nls_utf8 cifs ccm fscache cmac bnep binfmt_misc nls_iso8859_1 edac_mce_=
amd
arc4 snd_hda_codec_realtek snd_hda_codec_generic kvm_amd snd_hda_codec_hdmi=
 kvm
snd_seq_midi irqbypass snd_hda_intel snd_seq_midi_event snd_hda_codec btusb
snd_hda_core btrtl wmi_bmof snd_rawmidi iwlmvm snd_hwdep btbcm btintel snd_=
pcm
snd_seq bluetooth mac80211 snd_seq_device ecdh_generic snd_timer iwlwifi ccp
snd cfg80211 soundcore k10temp shpchp mac_hid sch_fq_codel ib_iser rdma_cm
iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi
nct6775 hwmon_vid parport_pc ppdev lp parport ip_tables x_tables autofs4 bt=
rfs
zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_=
xor
async_tx xor raid6_pq libcrc32c raid1
[  248.472823]  multipath linear raid0 amdgpu(OE) amdchash(OE) amdttm(OE)
amd_sched(OE) mxm_wmi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc
aesni_intel aes_x86_64 amdkcl(OE) crypto_simd glue_helper amd_iommu_v2 cryp=
td
drm_kms_helper syscopyarea sysfillrect sysimgblt igb fb_sys_fops drm dca nv=
me
i2c_algo_bit i2c_piix4 nvme_core ptp ahci atlantic libahci pps_core gpio_am=
dpt
wmi gpio_generic
[  248.472846] CPU: 23 PID: 21634 Comm: a.out Tainted: G           OE=20=20=
=20
4.15.0-45-generic #48-Ubuntu
[  248.472847] Hardware name: To Be Filled By O.E.M. To Be Filled By
O.E.M./X399 Professional Gaming, BIOS P3.30 08/14/2018
[  248.472849] RIP: 0010:kthread_park+0x67/0x80
[  248.472850] RSP: 0018:ffffb44fc7e27ad0 EFLAGS: 00010202
[  248.472852] RAX: 0000000000000004 RBX: ffff9ec63f49e480 RCX:
0000000000000000
[  248.472853] RDX: ffff9ec63c717198 RSI: ffff9ec63ea0c0c0 RDI:
ffff9ec63dd38000
[  248.472854] RBP: ffffb44fc7e27ae0 R08: 0000000000000051 R09:
0000000000000000
[  248.472855] R10: 0000000000000000 R11: 0000000000000056 R12:
ffff9ec63ea0c0c0
[  248.472855] R13: ffff9ec64f4f4200 R14: ffff9ec63c710000 R15:
0000000000000000
[  248.472857] FS:  00007fd52a286c00(0000) GS:ffff9ec65cdc0000(0000)
knlGS:0000000000000000
[  248.472858] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  248.472859] CR2: 00007f0c07687a98 CR3: 000000081b5b6000 CR4:
00000000003406e0
[  248.472860] Call Trace:
[  248.472865]  amddrm_sched_entity_fini+0x44/0x1b0 [amd_sched]
[  248.472868]  amddrm_sched_entity_destroy+0x1f/0x30 [amd_sched]
[  248.472907]  amdgpu_vm_fini+0xbb/0x4f0 [amdgpu]
[  248.472942]  amdgpu_driver_postclose_kms+0x15b/0x2b0 [amdgpu]
[  248.472952]  drm_release+0x26b/0x390 [drm]
[  248.472955]  __fput+0xea/0x220
[  248.472957]  ____fput+0xe/0x10
[  248.472959]  task_work_run+0x9d/0xc0
[  248.472961]  do_exit+0x2ec/0xb40
[  248.472963]  do_group_exit+0x43/0xb0
[  248.472965]  get_signal+0x27b/0x590
[  248.472968]  do_signal+0x37/0x730
[  248.472971]  ? __switch_to_asm+0x34/0x70
[  248.472973]  ? __switch_to_asm+0x40/0x70
[  248.472976]  ? do_vfs_ioctl+0xa8/0x630
[  248.472978]  ? __schedule+0x299/0x8a0
[  248.472980]  exit_to_usermode_loop+0x73/0xd0
[  248.472982]  do_syscall_64+0x115/0x130
[  248.472984]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  248.472986] RIP: 0033:0x7fd528bdd5d7
[  248.472987] RSP: 002b:00007ffe830d4778 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
[  248.472988] RAX: fffffffffffffffc RBX: 0000000000000001 RCX:
00007fd528bdd5d7
[  248.472989] RDX: 00007ffe830d47d0 RSI: 00000000c0184b0c RDI:
0000000000000003
[  248.472990] RBP: 00007ffe830d47d0 R08: 00007ffe830d4890 R09:
0000000000000001
[  248.472990] R10: 0000000000c92010 R11: 0000000000000246 R12:
00000000c0184b0c
[  248.472991] R13: 0000000000000003 R14: 0000000000000000 R15:
00000000fffffffe
[  248.472992] Code: 0e e8 6e c0 00 00 48 8d 7b 18 e8 35 d2 8e 00 44 89 e0 =
5b
41 5c 5d c3 0f 0b 41 bc da ff ff ff 44 89 e0 5b 41 5c 5d c3 0f 0b eb af <=
;0f> 0b
41 bc f0 ff ff ff eb da 0f 1f 44 00 00 66 2e 0f 1f 84 00=20
[  248.473020] ---[ end trace 19649ddd4a6314f7 ]---
[  248.648453] [drm] UVD and UVD ENC initialized successfully.
[  248.748509] [drm] VCE initialized successfully.
[  248.749616] [drm] recover vram bo from shadow start
[  248.749666] [drm] recover vram bo from shadow done
[  248.749680] amdgpu 0000:0b:00.0: GPU reset(1) succeeded!


You are receiving this mail because:
  • You are the assignee for the bug.
= --15523989681.ac98d5cb.29866-- --===============1623604099== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KZHJpLWRldmVs IG1haWxpbmcgbGlzdApkcmktZGV2ZWxAbGlzdHMuZnJlZWRlc2t0b3Aub3JnCmh0dHBzOi8vbGlz dHMuZnJlZWRlc2t0b3Aub3JnL21haWxtYW4vbGlzdGluZm8vZHJpLWRldmVs --===============1623604099==--