From mboxrd@z Thu Jan 1 00:00:00 1970 From: bugzilla-daemon@freedesktop.org Subject: [Bug 109692] deadlock occurs during GPU reset Date: Wed, 20 Feb 2019 17:30:54 +0000 Message-ID: Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============0085823580==" Return-path: Received: from culpepper.freedesktop.org (culpepper.freedesktop.org [IPv6:2610:10:20:722:a800:ff:fe98:4b55]) by gabe.freedesktop.org (Postfix) with ESMTP id 0D24E892D2 for ; Wed, 20 Feb 2019 17:31:04 +0000 (UTC) List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" To: dri-devel@lists.freedesktop.org List-Id: dri-devel@lists.freedesktop.org --===============0085823580== Content-Type: multipart/alternative; boundary="15506838630.EAabCCf.1380" Content-Transfer-Encoding: 7bit --15506838630.EAabCCf.1380 Date: Wed, 20 Feb 2019 17:31:03 +0000 MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://bugs.freedesktop.org/ Auto-Submitted: auto-generated https://bugs.freedesktop.org/show_bug.cgi?id=3D109692 Bug ID: 109692 Summary: deadlock occurs during GPU reset Product: DRI Version: XOrg git Hardware: Other OS: All Status: NEW Severity: normal Priority: medium Component: DRM/AMDgpu Assignee: dri-devel@lists.freedesktop.org Reporter: mikhail.v.gavrilov@gmail.com Created attachment 143419 --> https://bugs.freedesktop.org/attachment.cgi?id=3D143419&action=3Dedit dmesg Steps for reproduce: 1. $ git clone git://people.freedesktop.org/~agd5f/linux -b amd-staging-drm-next 2. $ make bzImage && make module 3. # make modules_install && make install 4. Launch "Shadow of the Tomb Raider" --- Here GPU hung occurs --- and after few time=20 --- Here start GPU reset --- --- Here Deadlock occurs --- [ 291.746741] amdgpu 0000:0b:00.0: [gfxhub] no-retry page fault (src_id:0 ring:158 vmid:7 pasid:32774, for process SOTTR.exe pid 5250 thread SOTTR.exe pid 5250) [ 291.746750] amdgpu 0000:0b:00.0: in page starting at address 0x0000000000002000 from 27 [ 291.746754] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0070113C [ 297.135183] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting = for fences timed out. [ 302.255032] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting = for fences timed out. [ 302.265813] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=3D13292, emitted seq=3D13293 [ 302.265950] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process informati= on: process SOTTR.exe pid 5250 thread SOTTR.exe pid 5250 [ 302.265974] amdgpu 0000:0b:00.0: GPU reset begin! [ 302.266337] =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D [ 302.266338] WARNING: possible circular locking dependency detected [ 302.266340] 5.0.0-rc1-drm-next-kernel+ #1 Tainted: G C=20=20=20= =20=20=20=20 [ 302.266341] ------------------------------------------------------ [ 302.266343] kworker/5:2/871 is trying to acquire lock: [ 302.266345] 000000000abbb16a (&(&ring->fence_drv.lock)->rlock){-.-.}, at: dma_fence_remove_callback+0x1a/0x60 [ 302.266352]=20 but task is already holding lock: [ 302.266353] 000000006e32ba38 (&(&sched->job_list_lock)->rlock){-.-.}, at: drm_sched_stop+0x34/0x140 [gpu_sched] [ 302.266358]=20 which lock already depends on the new lock. [ 302.266360]=20 the existing dependency chain (in reverse order) is: [ 302.266361]=20 -> #1 (&(&sched->job_list_lock)->rlock){-.-.}: [ 302.266366] drm_sched_process_job+0x4d/0x180 [gpu_sched] [ 302.266368] dma_fence_signal+0x111/0x1a0 [ 302.266414] amdgpu_fence_process+0xa3/0x100 [amdgpu] [ 302.266470] sdma_v4_0_process_trap_irq+0x6e/0xa0 [amdgpu] [ 302.266523] amdgpu_irq_dispatch+0xc0/0x250 [amdgpu] [ 302.266576] amdgpu_ih_process+0x84/0xf0 [amdgpu] [ 302.266628] amdgpu_irq_handler+0x1b/0x50 [amdgpu] [ 302.266632] __handle_irq_event_percpu+0x3f/0x290 [ 302.266635] handle_irq_event_percpu+0x31/0x80 [ 302.266637] handle_irq_event+0x34/0x51 [ 302.266639] handle_edge_irq+0x7c/0x1a0 [ 302.266643] handle_irq+0xbf/0x100 [ 302.266646] do_IRQ+0x61/0x120 [ 302.266648] ret_from_intr+0x0/0x22 [ 302.266651] cpuidle_enter_state+0xbf/0x470 [ 302.266654] do_idle+0x1ec/0x280 [ 302.266657] cpu_startup_entry+0x19/0x20 [ 302.266660] start_secondary+0x1b3/0x200 [ 302.266663] secondary_startup_64+0xa4/0xb0 [ 302.266664]=20 -> #0 (&(&ring->fence_drv.lock)->rlock){-.-.}: [ 302.266668] _raw_spin_lock_irqsave+0x49/0x83 [ 302.266670] dma_fence_remove_callback+0x1a/0x60 [ 302.266673] drm_sched_stop+0x59/0x140 [gpu_sched] [ 302.266717] amdgpu_device_pre_asic_reset+0x4f/0x240 [amdgpu] [ 302.266761] amdgpu_device_gpu_recover+0x88/0x7d0 [amdgpu] [ 302.266822] amdgpu_job_timedout+0x109/0x130 [amdgpu] [ 302.266827] drm_sched_job_timedout+0x40/0x70 [gpu_sched] [ 302.266831] process_one_work+0x272/0x5d0 [ 302.266834] worker_thread+0x50/0x3b0 [ 302.266836] kthread+0x108/0x140 [ 302.266839] ret_from_fork+0x27/0x50 [ 302.266840]=20 other info that might help us debug this: [ 302.266841] Possible unsafe locking scenario: [ 302.266842] CPU0 CPU1 [ 302.266843] ---- ---- [ 302.266844] lock(&(&sched->job_list_lock)->rlock); [ 302.266846]=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20 lock(&(&ring->fence_drv.lock)->rlock); [ 302.266847]=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20 lock(&(&sched->job_list_lock)->rlock); [ 302.266849] lock(&(&ring->fence_drv.lock)->rlock); [ 302.266850]=20 *** DEADLOCK *** [ 302.266852] 5 locks held by kworker/5:2/871: [ 302.266853] #0: 00000000d133fb6e ((wq_completion)"events"){+.+.}, at: process_one_work+0x1e9/0x5d0 [ 302.266857] #1: 000000008a5c3f7e ((work_completion)(&(&sched->work_tdr)->work)){+.+.}, at: process_one_work+0x1e9/0x5d0 [ 302.266862] #2: 00000000b9b2c76f (&adev->lock_reset){+.+.}, at: amdgpu_device_lock_adev+0x17/0x40 [amdgpu] [ 302.266908] #3: 00000000ac637728 (&dqm->lock_hidden){+.+.}, at: kgd2kfd_pre_reset+0x30/0x60 [amdgpu] [ 302.266965] #4: 000000006e32ba38 (&(&sched->job_list_lock)->rlock){-.-.= }, at: drm_sched_stop+0x34/0x140 [gpu_sched] [ 302.266971]=20 stack backtrace: [ 302.266975] CPU: 5 PID: 871 Comm: kworker/5:2 Tainted: G C=20=20= =20=20=20=20=20 5.0.0-rc1-drm-next-kernel+ #1 [ 302.266976] Hardware name: System manufacturer System Product Name/ROG S= TRIX X470-I GAMING, BIOS 1103 11/16/2018 [ 302.266980] Workqueue: events drm_sched_job_timedout [gpu_sched] [ 302.266982] Call Trace: [ 302.266987] dump_stack+0x85/0xc0 [ 302.266991] print_circular_bug.isra.0.cold+0x15c/0x195 [ 302.266994] __lock_acquire+0x134c/0x1660 [ 302.266998] ? add_lock_to_list.isra.0+0x67/0xb0 [ 302.267003] lock_acquire+0xa2/0x1b0 [ 302.267006] ? dma_fence_remove_callback+0x1a/0x60 [ 302.267011] _raw_spin_lock_irqsave+0x49/0x83 [ 302.267013] ? dma_fence_remove_callback+0x1a/0x60 [ 302.267016] dma_fence_remove_callback+0x1a/0x60 [ 302.267020] drm_sched_stop+0x59/0x140 [gpu_sched] [ 302.267065] amdgpu_device_pre_asic_reset+0x4f/0x240 [amdgpu] [ 302.267110] amdgpu_device_gpu_recover+0x88/0x7d0 [amdgpu] [ 302.267173] amdgpu_job_timedout+0x109/0x130 [amdgpu] [ 302.267178] drm_sched_job_timedout+0x40/0x70 [gpu_sched] [ 302.267183] process_one_work+0x272/0x5d0 [ 302.267188] worker_thread+0x50/0x3b0 [ 302.267191] kthread+0x108/0x140 [ 302.267194] ? process_one_work+0x5d0/0x5d0 [ 302.267196] ? kthread_park+0x90/0x90 [ 302.267199] ret_from_fork+0x27/0x50 [ 302.692194] amdgpu 0000:0b:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110) [ 302.692234] [drm:gfx_v9_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed [ 302.768931] amdgpu 0000:0b:00.0: GPU BACO reset [ 303.278874] amdgpu 0000:0b:00.0: GPU reset succeeded, trying to resume [ 303.279006] [drm] PCIE GART of 512M enabled (table at 0x000000F400900000= ). [ 303.279072] [drm:amdgpu_device_gpu_recover [amdgpu]] *ERROR* VRAM is los= t! [ 303.279234] [drm] PSP is resuming... [ 303.426601] [drm] reserve 0x400000 from 0xf400d00000 for PSP TMR SIZE [ 303.572227] [drm] UVD and UVD ENC initialized successfully. [ 303.687727] [drm] VCE initialized successfully. [ 303.689585] [drm] recover vram bo from shadow start [ 303.722757] [drm] recover vram bo from shadow done [ 303.722761] [drm] Skip scheduling IBs! [ 303.722791] amdgpu 0000:0b:00.0: GPU reset(2) succeeded! [ 303.722811] [drm] Skip scheduling IBs! [ 303.722838] [drm] Skip scheduling IBs! [ 303.722846] [drm] Skip scheduling IBs! [ 303.722854] [drm] Skip scheduling IBs! [ 303.722863] [drm] Skip scheduling IBs! [ 303.722871] [drm] Skip scheduling IBs! --=20 You are receiving this mail because: You are the assignee for the bug.= --15506838630.EAabCCf.1380 Date: Wed, 20 Feb 2019 17:31:03 +0000 MIME-Version: 1.0 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://bugs.freedesktop.org/ Auto-Submitted: auto-generated
Bug ID 109692
Summary deadlock occurs during GPU reset
Product DRI
Version XOrg git
Hardware Other
OS All
Status NEW
Severity normal
Priority medium
Component DRM/AMDgpu
Assignee dri-devel@lists.freedesktop.org
Reporter mikhail.v.gavrilov@gmail.com

Created attachment 143419 [details]<=
/span>
dmesg

Steps for reproduce:
1. $ git clone git://people.freedesktop.org/~agd5f/linux -b
amd-staging-drm-next
2. $ make bzImage && make module
3. # make modules_install && make install
4. Launch "Shadow of the Tomb Raider"
--- Here GPU hung occurs ---
and after few time=20
--- Here start GPU reset ---
--- Here Deadlock occurs ---

[  291.746741] amdgpu 0000:0b:00.0: [gfxhub] no-retry page fault (src_id:0
ring:158 vmid:7 pasid:32774, for process SOTTR.exe pid 5250 thread SOTTR.exe
pid 5250)
[  291.746750] amdgpu 0000:0b:00.0:   in page starting at address
0x0000000000002000 from 27
[  291.746754] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0070113C
[  297.135183] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting =
for
fences timed out.
[  302.255032] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting =
for
fences timed out.
[  302.265813] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout,
signaled seq=3D13292, emitted seq=3D13293
[  302.265950] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process informati=
on:
process SOTTR.exe pid 5250 thread SOTTR.exe pid 5250
[  302.265974] amdgpu 0000:0b:00.0: GPU reset begin!

[  302.266337] =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D
[  302.266338] WARNING: possible circular locking dependency detected
[  302.266340] 5.0.0-rc1-drm-next-kernel+ #1 Tainted: G         C=20=20=20=
=20=20=20=20
[  302.266341] ------------------------------------------------------
[  302.266343] kworker/5:2/871 is trying to acquire lock:
[  302.266345] 000000000abbb16a (&(&ring->fence_drv.lock)->rl=
ock){-.-.}, at:
dma_fence_remove_callback+0x1a/0x60
[  302.266352]=20
               but task is already holding lock:
[  302.266353] 000000006e32ba38 (&(&sched->job_list_lock)->rl=
ock){-.-.}, at:
drm_sched_stop+0x34/0x140 [gpu_sched]
[  302.266358]=20
               which lock already depends on the new lock.

[  302.266360]=20
               the existing dependency chain (in reverse order) is:
[  302.266361]=20
               -> #1 (&(&sched->job_list_lock)->rlock){-.-=
.}:
[  302.266366]        drm_sched_process_job+0x4d/0x180 [gpu_sched]
[  302.266368]        dma_fence_signal+0x111/0x1a0
[  302.266414]        amdgpu_fence_process+0xa3/0x100 [amdgpu]
[  302.266470]        sdma_v4_0_process_trap_irq+0x6e/0xa0 [amdgpu]
[  302.266523]        amdgpu_irq_dispatch+0xc0/0x250 [amdgpu]
[  302.266576]        amdgpu_ih_process+0x84/0xf0 [amdgpu]
[  302.266628]        amdgpu_irq_handler+0x1b/0x50 [amdgpu]
[  302.266632]        __handle_irq_event_percpu+0x3f/0x290
[  302.266635]        handle_irq_event_percpu+0x31/0x80
[  302.266637]        handle_irq_event+0x34/0x51
[  302.266639]        handle_edge_irq+0x7c/0x1a0
[  302.266643]        handle_irq+0xbf/0x100
[  302.266646]        do_IRQ+0x61/0x120
[  302.266648]        ret_from_intr+0x0/0x22
[  302.266651]        cpuidle_enter_state+0xbf/0x470
[  302.266654]        do_idle+0x1ec/0x280
[  302.266657]        cpu_startup_entry+0x19/0x20
[  302.266660]        start_secondary+0x1b3/0x200
[  302.266663]        secondary_startup_64+0xa4/0xb0
[  302.266664]=20
               -> #0 (&(&ring->fence_drv.lock)->rlock){-.-=
.}:
[  302.266668]        _raw_spin_lock_irqsave+0x49/0x83
[  302.266670]        dma_fence_remove_callback+0x1a/0x60
[  302.266673]        drm_sched_stop+0x59/0x140 [gpu_sched]
[  302.266717]        amdgpu_device_pre_asic_reset+0x4f/0x240 [amdgpu]
[  302.266761]        amdgpu_device_gpu_recover+0x88/0x7d0 [amdgpu]
[  302.266822]        amdgpu_job_timedout+0x109/0x130 [amdgpu]
[  302.266827]        drm_sched_job_timedout+0x40/0x70 [gpu_sched]
[  302.266831]        process_one_work+0x272/0x5d0
[  302.266834]        worker_thread+0x50/0x3b0
[  302.266836]        kthread+0x108/0x140
[  302.266839]        ret_from_fork+0x27/0x50
[  302.266840]=20
               other info that might help us debug this:

[  302.266841]  Possible unsafe locking scenario:

[  302.266842]        CPU0                    CPU1
[  302.266843]        ----                    ----
[  302.266844]   lock(&(&sched->job_list_lock)->rlock);
[  302.266846]=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20
lock(&(&ring->fence_drv.lock)->rlock);
[  302.266847]=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20
lock(&(&sched->job_list_lock)->rlock);
[  302.266849]   lock(&(&ring->fence_drv.lock)->rlock);
[  302.266850]=20
                *** DEADLOCK ***

[  302.266852] 5 locks held by kworker/5:2/871:
[  302.266853]  #0: 00000000d133fb6e ((wq_completion)"events"){+.=
+.}, at:
process_one_work+0x1e9/0x5d0
[  302.266857]  #1: 000000008a5c3f7e
((work_completion)(&(&sched->work_tdr)->work)){+.+.}, at:
process_one_work+0x1e9/0x5d0
[  302.266862]  #2: 00000000b9b2c76f (&adev->lock_reset){+.+.}, at:
amdgpu_device_lock_adev+0x17/0x40 [amdgpu]
[  302.266908]  #3: 00000000ac637728 (&dqm->lock_hidden){+.+.}, at:
kgd2kfd_pre_reset+0x30/0x60 [amdgpu]
[  302.266965]  #4: 000000006e32ba38 (&(&sched->job_list_lock)-&=
gt;rlock){-.-.},
at: drm_sched_stop+0x34/0x140 [gpu_sched]
[  302.266971]=20
               stack backtrace:
[  302.266975] CPU: 5 PID: 871 Comm: kworker/5:2 Tainted: G         C=20=20=
=20=20=20=20=20
5.0.0-rc1-drm-next-kernel+ #1
[  302.266976] Hardware name: System manufacturer System Product Name/ROG S=
TRIX
X470-I GAMING, BIOS 1103 11/16/2018
[  302.266980] Workqueue: events drm_sched_job_timedout [gpu_sched]
[  302.266982] Call Trace:
[  302.266987]  dump_stack+0x85/0xc0
[  302.266991]  print_circular_bug.isra.0.cold+0x15c/0x195
[  302.266994]  __lock_acquire+0x134c/0x1660
[  302.266998]  ? add_lock_to_list.isra.0+0x67/0xb0
[  302.267003]  lock_acquire+0xa2/0x1b0
[  302.267006]  ? dma_fence_remove_callback+0x1a/0x60
[  302.267011]  _raw_spin_lock_irqsave+0x49/0x83
[  302.267013]  ? dma_fence_remove_callback+0x1a/0x60
[  302.267016]  dma_fence_remove_callback+0x1a/0x60
[  302.267020]  drm_sched_stop+0x59/0x140 [gpu_sched]
[  302.267065]  amdgpu_device_pre_asic_reset+0x4f/0x240 [amdgpu]
[  302.267110]  amdgpu_device_gpu_recover+0x88/0x7d0 [amdgpu]
[  302.267173]  amdgpu_job_timedout+0x109/0x130 [amdgpu]
[  302.267178]  drm_sched_job_timedout+0x40/0x70 [gpu_sched]
[  302.267183]  process_one_work+0x272/0x5d0
[  302.267188]  worker_thread+0x50/0x3b0
[  302.267191]  kthread+0x108/0x140
[  302.267194]  ? process_one_work+0x5d0/0x5d0
[  302.267196]  ? kthread_park+0x90/0x90
[  302.267199]  ret_from_fork+0x27/0x50
[  302.692194] amdgpu 0000:0b:00.0: [drm:amdgpu_ring_test_helper [amdgpu]]
*ERROR* ring kiq_2.1.0 test failed (-110)
[  302.692234] [drm:gfx_v9_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
[  302.768931] amdgpu 0000:0b:00.0: GPU BACO reset
[  303.278874] amdgpu 0000:0b:00.0: GPU reset succeeded, trying to resume
[  303.279006] [drm] PCIE GART of 512M enabled (table at 0x000000F400900000=
).
[  303.279072] [drm:amdgpu_device_gpu_recover [amdgpu]] *ERROR* VRAM is los=
t!
[  303.279234] [drm] PSP is resuming...
[  303.426601] [drm] reserve 0x400000 from 0xf400d00000 for PSP TMR SIZE
[  303.572227] [drm] UVD and UVD ENC initialized successfully.
[  303.687727] [drm] VCE initialized successfully.
[  303.689585] [drm] recover vram bo from shadow start
[  303.722757] [drm] recover vram bo from shadow done
[  303.722761] [drm] Skip scheduling IBs!
[  303.722791] amdgpu 0000:0b:00.0: GPU reset(2) succeeded!
[  303.722811] [drm] Skip scheduling IBs!
[  303.722838] [drm] Skip scheduling IBs!
[  303.722846] [drm] Skip scheduling IBs!
[  303.722854] [drm] Skip scheduling IBs!
[  303.722863] [drm] Skip scheduling IBs!
[  303.722871] [drm] Skip scheduling IBs!


You are receiving this mail because:
  • You are the assignee for the bug.
= --15506838630.EAabCCf.1380-- --===============0085823580== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KZHJpLWRldmVs IG1haWxpbmcgbGlzdApkcmktZGV2ZWxAbGlzdHMuZnJlZWRlc2t0b3Aub3JnCmh0dHBzOi8vbGlz dHMuZnJlZWRlc2t0b3Aub3JnL21haWxtYW4vbGlzdGluZm8vZHJpLWRldmVs --===============0085823580==--