From mboxrd@z Thu Jan  1 00:00:00 1970
From: bugzilla-daemon@freedesktop.org
Subject: [Bug 106500] GPU Recovery + DC deadlock
Date: Sun, 13 May 2018 11:38:43 +0000
Message-ID: <bug-106500-502@http.bugs.freedesktop.org/>
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="===============0257762917=="
Return-path: <dri-devel-bounces@lists.freedesktop.org>
Received: from culpepper.freedesktop.org (culpepper.freedesktop.org
 [IPv6:2610:10:20:722:a800:ff:fe98:4b55])
 by gabe.freedesktop.org (Postfix) with ESMTP id 28F526E086
 for <dri-devel@lists.freedesktop.org>; Sun, 13 May 2018 11:38:43 +0000 (UTC)
List-Unsubscribe: <https://lists.freedesktop.org/mailman/options/dri-devel>,
 <mailto:dri-devel-request@lists.freedesktop.org?subject=unsubscribe>
List-Archive: <https://lists.freedesktop.org/archives/dri-devel>
List-Post: <mailto:dri-devel@lists.freedesktop.org>
List-Help: <mailto:dri-devel-request@lists.freedesktop.org?subject=help>
List-Subscribe: <https://lists.freedesktop.org/mailman/listinfo/dri-devel>,
 <mailto:dri-devel-request@lists.freedesktop.org?subject=subscribe>
Errors-To: dri-devel-bounces@lists.freedesktop.org
Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
To: dri-devel@lists.freedesktop.org
List-Id: dri-devel@lists.freedesktop.org


--===============0257762917==
Content-Type: multipart/alternative; boundary="15262115230.86A81fD5.24799"
Content-Transfer-Encoding: 7bit


--15262115230.86A81fD5.24799
Date: Sun, 13 May 2018 11:38:43 +0000
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://bugs.freedesktop.org/
Auto-Submitted: auto-generated

https://bugs.freedesktop.org/show_bug.cgi?id=3D106500

            Bug ID: 106500
           Summary: GPU Recovery + DC deadlock
           Product: DRI
           Version: unspecified
          Hardware: Other
                OS: All
            Status: NEW
          Severity: normal
          Priority: medium
         Component: DRM/AMDgpu
          Assignee: dri-devel@lists.freedesktop.org
          Reporter: bas@basnieuwenhuizen.nl
                CC: andrey.grodzovsky@amd.com

If you try to reset a GPU using=20

cat /sys/kernel/debug/dri/2/amdgpu_gpu_recovery

while the GPU is hung the kernel deadlocks if the GPU is used for displaying
stuff.

I found two causes. If I hang the GPU with the libdrm tests I get a deadloc=
k in

https://cgit.freedesktop.org/~agd5f/linux/tree/drivers/gpu/drm/amd/amdgpu/a=
mdgpu_pm.c?h=3Damd-staging-drm-next&id=3Dda603c1d0aac505485490f5e0ba495d4e2=
92e7b9#n1876

Looks like we disable DC during the reset, but as part of the disabling we
change the clocks and for that we wait till the GPU is idle. It is of course
not going to be idle without intervention if hung.

Supporting trace:

[ 1842.823262] INFO: task cat:3635 blocked for more than 120 seconds.
[ 1842.823268]       Tainted: G        W        4.16.0-rc7-g36031c0dfb2d #6
[ 1842.823270] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables =
this
message.
[ 1842.823271] cat             D    0  3635   3630 0x00000000
[ 1842.823275] Call Trace:
[ 1842.823285]  ? __schedule+0x23c/0x870
[ 1842.823288]  schedule+0x2f/0x90
[ 1842.823291]  schedule_timeout+0x1fc/0x460
[ 1842.823296]  ? __alloc_pages_nodemask+0x10f/0xfd0
[ 1842.823300]  dma_fence_default_wait+0x1eb/0x280
[ 1842.823303]  ? dma_fence_default_wait+0x280/0x280
[ 1842.823306]  dma_fence_wait_timeout+0x38/0x110
[ 1842.823331]  amdgpu_fence_wait_empty+0x98/0xd0 [amdgpu]
[ 1842.823356]  ? dc_remove_plane_from_context+0x202/0x240 [amdgpu]
[ 1842.823378]  amdgpu_pm_compute_clocks.part.8+0x70/0x590 [amdgpu]
[ 1842.823409]  dm_pp_apply_display_requirements+0x159/0x160 [amdgpu]
[ 1842.823433]  pplib_apply_display_requirements+0x197/0x1c0 [amdgpu]
[ 1842.823457]  dc_commit_state+0x23b/0x560 [amdgpu]
[ 1842.823481]  ? dce112_validate_bandwidth+0x1bd/0x230 [amdgpu]
[ 1842.823506]  ? dce112_validate_bandwidth+0x1c9/0x230 [amdgpu]
[ 1842.823535]  amdgpu_dm_atomic_commit_tail+0x27a/0xc70 [amdgpu]
[ 1842.823540]  ? __wake_up_common_lock+0x89/0xc0
[ 1842.823542]  ? wait_for_common+0x151/0x180
[ 1842.823545]  ? wait_for_common+0x151/0x180
[ 1842.823551]  commit_tail+0x3d/0x70 [drm_kms_helper]
[ 1842.823557]  drm_atomic_helper_commit+0xfc/0x110 [drm_kms_helper]
[ 1842.823562]  drm_atomic_helper_disable_all+0x158/0x1b0 [drm_kms_helper]
[ 1842.823567]  drm_atomic_helper_suspend+0xd6/0x130 [drm_kms_helper]
[ 1842.823587]  amdgpu_device_gpu_recover+0x60f/0x8b0 [amdgpu]
[ 1842.823591]  ? __kmalloc_node+0x204/0x2b0
[ 1842.823611]  amdgpu_debugfs_gpu_recover+0x30/0x40 [amdgpu]
[ 1842.823615]  seq_read+0xee/0x480
[ 1842.823619]  full_proxy_read+0x53/0x80
[ 1842.823624]  __vfs_read+0x36/0x150
[ 1842.823627]  vfs_read+0x91/0x130
[ 1842.823630]  SyS_read+0x52/0xc0
[ 1842.823634]  do_syscall_64+0x67/0x120
[ 1842.823637]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 1842.823640] RIP: 0033:0x7f987c073701
[ 1842.823642] RSP: 002b:00007ffc0deeac08 EFLAGS: 00000246 ORIG_RAX:
0000000000000000
[ 1842.823644] RAX: ffffffffffffffda RBX: 0000000000020000 RCX:
00007f987c073701
[ 1842.823646] RDX: 0000000000020000 RSI: 00007f987c549000 RDI:
0000000000000003
[ 1842.823647] RBP: 0000000000020000 R08: 00000000ffffffff R09:
0000000000000000
[ 1842.823649] R10: 0000000000000022 R11: 0000000000000246 R12:
00007f987c549000
[ 1842.823650] R13: 0000000000000003 R14: 00007f987c54900f R15:
0000000000020000

I managed to "fix" this by commenting out that code. Now a libdrm test caus=
ed
hang recovers successfully though the display (even for the non-X terminals=
) is
garbled.

However, I recently had a game hang and then tried recovering and that still
gave a deadlock:

[127426.165215] INFO: task kworker/u256:0:77605 blocked for more than 120
seconds.
[127426.165221]       Tainted: G        W        4.16.0-rc7-gffd4abe7dbf9 #7
[127426.165222] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[127426.165224] kworker/u256:0  D    0 77605      2 0x80000000
[127426.165236] Workqueue: events_unbound commit_work [drm_kms_helper]
[127426.165239] Call Trace:
[127426.165248]  ? __schedule+0x23c/0x870
[127426.165251]  schedule+0x2f/0x90
[127426.165254]  schedule_timeout+0x1fc/0x460
[127426.165283]  ? dce120_timing_generator_get_crtc_position+0x5b/0x70 [amd=
gpu]
[127426.165308]  ? dce120_timing_generator_get_crtc_scanoutpos+0x70/0xb0
[amdgpu]
[127426.165312]  dma_fence_default_wait+0x1eb/0x280
[127426.165315]  ? dma_fence_default_wait+0x280/0x280
[127426.165317]  dma_fence_wait_timeout+0x38/0x110
[127426.165320]  reservation_object_wait_timeout_rcu+0x187/0x360
[127426.165350]  amdgpu_dm_do_flip+0x109/0x350 [amdgpu]
[127426.165382]  amdgpu_dm_atomic_commit_tail+0xa7c/0xc70 [amdgpu]
[127426.165386]  ? wait_for_common+0x151/0x180
[127426.165390]  ? pick_next_task_fair+0x48c/0x5a0
[127426.165393]  ? __switch_to+0x199/0x460
[127426.165399]  commit_tail+0x3d/0x70 [drm_kms_helper]
[127426.165403]  process_one_work+0x1ce/0x3f0
[127426.165405]  worker_thread+0x2b/0x3d0
[127426.165408]  ? process_one_work+0x3f0/0x3f0
[127426.165410]  kthread+0x113/0x130
[127426.165413]  ? kthread_create_on_node+0x70/0x70
[127426.165416]  ret_from_fork+0x22/0x40

which seems to be here:

https://cgit.freedesktop.org/~agd5f/linux/tree/drivers/gpu/drm/amd/display/=
amdgpu_dm/amdgpu_dm.c?h=3Damd-staging-drm-next&id=3Dda603c1d0aac505485490f5=
e0ba495d4e292e7b9#n3973

This is before the GPU reset itself happens, so either we use a BO somehow =
in
the disabled state or this is an earlier flip.

Anyway that wait is not going to finish due to a hung GPU.

--=20
You are receiving this mail because:
You are the assignee for the bug.=

--15262115230.86A81fD5.24799
Date: Sun, 13 May 2018 11:38:43 +0000
MIME-Version: 1.0
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://bugs.freedesktop.org/
Auto-Submitted: auto-generated

<html>
    <head>
      <base href=3D"https://bugs.freedesktop.org/">
    </head>
    <body><table border=3D"1" cellspacing=3D"0" cellpadding=3D"8">
        <tr>
          <th>Bug ID</th>
          <td><a class=3D"bz_bug_link=20
          bz_status_NEW "
   title=3D"NEW - GPU Recovery + DC deadlock"
   href=3D"https://bugs.freedesktop.org/show_bug.cgi?id=3D106500">106500</a>
          </td>
        </tr>

        <tr>
          <th>Summary</th>
          <td>GPU Recovery + DC deadlock
          </td>
        </tr>

        <tr>
          <th>Product</th>
          <td>DRI
          </td>
        </tr>

        <tr>
          <th>Version</th>
          <td>unspecified
          </td>
        </tr>

        <tr>
          <th>Hardware</th>
          <td>Other
          </td>
        </tr>

        <tr>
          <th>OS</th>
          <td>All
          </td>
        </tr>

        <tr>
          <th>Status</th>
          <td>NEW
          </td>
        </tr>

        <tr>
          <th>Severity</th>
          <td>normal
          </td>
        </tr>

        <tr>
          <th>Priority</th>
          <td>medium
          </td>
        </tr>

        <tr>
          <th>Component</th>
          <td>DRM/AMDgpu
          </td>
        </tr>

        <tr>
          <th>Assignee</th>
          <td>dri-devel&#64;lists.freedesktop.org
          </td>
        </tr>

        <tr>
          <th>Reporter</th>
          <td>bas&#64;basnieuwenhuizen.nl
          </td>
        </tr>

        <tr>
          <th>CC</th>
          <td>andrey.grodzovsky&#64;amd.com
          </td>
        </tr></table>
      <p>
        <div>
        <pre>If you try to reset a GPU using=20

cat /sys/kernel/debug/dri/2/amdgpu_gpu_recovery

while the GPU is hung the kernel deadlocks if the GPU is used for displaying
stuff.

I found two causes. If I hang the GPU with the libdrm tests I get a deadloc=
k in

<a href=3D"https://cgit.freedesktop.org/~agd5f/linux/tree/drivers/gpu/drm/a=
md/amdgpu/amdgpu_pm.c?h=3Damd-staging-drm-next&amp;id=3Dda603c1d0aac5054854=
90f5e0ba495d4e292e7b9#n1876">https://cgit.freedesktop.org/~agd5f/linux/tree=
/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c?h=3Damd-staging-drm-next&amp;id=3Dd=
a603c1d0aac505485490f5e0ba495d4e292e7b9#n1876</a>

Looks like we disable DC during the reset, but as part of the disabling we
change the clocks and for that we wait till the GPU is idle. It is of course
not going to be idle without intervention if hung.

Supporting trace:

[ 1842.823262] INFO: task cat:3635 blocked for more than 120 seconds.
[ 1842.823268]       Tainted: G        W        4.16.0-rc7-g36031c0dfb2d #6
[ 1842.823270] &quot;echo 0 &gt; /proc/sys/kernel/hung_task_timeout_secs&qu=
ot; disables this
message.
[ 1842.823271] cat             D    0  3635   3630 0x00000000
[ 1842.823275] Call Trace:
[ 1842.823285]  ? __schedule+0x23c/0x870
[ 1842.823288]  schedule+0x2f/0x90
[ 1842.823291]  schedule_timeout+0x1fc/0x460
[ 1842.823296]  ? __alloc_pages_nodemask+0x10f/0xfd0
[ 1842.823300]  dma_fence_default_wait+0x1eb/0x280
[ 1842.823303]  ? dma_fence_default_wait+0x280/0x280
[ 1842.823306]  dma_fence_wait_timeout+0x38/0x110
[ 1842.823331]  amdgpu_fence_wait_empty+0x98/0xd0 [amdgpu]
[ 1842.823356]  ? dc_remove_plane_from_context+0x202/0x240 [amdgpu]
[ 1842.823378]  amdgpu_pm_compute_clocks.part.8+0x70/0x590 [amdgpu]
[ 1842.823409]  dm_pp_apply_display_requirements+0x159/0x160 [amdgpu]
[ 1842.823433]  pplib_apply_display_requirements+0x197/0x1c0 [amdgpu]
[ 1842.823457]  dc_commit_state+0x23b/0x560 [amdgpu]
[ 1842.823481]  ? dce112_validate_bandwidth+0x1bd/0x230 [amdgpu]
[ 1842.823506]  ? dce112_validate_bandwidth+0x1c9/0x230 [amdgpu]
[ 1842.823535]  amdgpu_dm_atomic_commit_tail+0x27a/0xc70 [amdgpu]
[ 1842.823540]  ? __wake_up_common_lock+0x89/0xc0
[ 1842.823542]  ? wait_for_common+0x151/0x180
[ 1842.823545]  ? wait_for_common+0x151/0x180
[ 1842.823551]  commit_tail+0x3d/0x70 [drm_kms_helper]
[ 1842.823557]  drm_atomic_helper_commit+0xfc/0x110 [drm_kms_helper]
[ 1842.823562]  drm_atomic_helper_disable_all+0x158/0x1b0 [drm_kms_helper]
[ 1842.823567]  drm_atomic_helper_suspend+0xd6/0x130 [drm_kms_helper]
[ 1842.823587]  amdgpu_device_gpu_recover+0x60f/0x8b0 [amdgpu]
[ 1842.823591]  ? __kmalloc_node+0x204/0x2b0
[ 1842.823611]  amdgpu_debugfs_gpu_recover+0x30/0x40 [amdgpu]
[ 1842.823615]  seq_read+0xee/0x480
[ 1842.823619]  full_proxy_read+0x53/0x80
[ 1842.823624]  __vfs_read+0x36/0x150
[ 1842.823627]  vfs_read+0x91/0x130
[ 1842.823630]  SyS_read+0x52/0xc0
[ 1842.823634]  do_syscall_64+0x67/0x120
[ 1842.823637]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 1842.823640] RIP: 0033:0x7f987c073701
[ 1842.823642] RSP: 002b:00007ffc0deeac08 EFLAGS: 00000246 ORIG_RAX:
0000000000000000
[ 1842.823644] RAX: ffffffffffffffda RBX: 0000000000020000 RCX:
00007f987c073701
[ 1842.823646] RDX: 0000000000020000 RSI: 00007f987c549000 RDI:
0000000000000003
[ 1842.823647] RBP: 0000000000020000 R08: 00000000ffffffff R09:
0000000000000000
[ 1842.823649] R10: 0000000000000022 R11: 0000000000000246 R12:
00007f987c549000
[ 1842.823650] R13: 0000000000000003 R14: 00007f987c54900f R15:
0000000000020000

I managed to &quot;fix&quot; this by commenting out that code. Now a libdrm=
 test caused
hang recovers successfully though the display (even for the non-X terminals=
) is
garbled.

However, I recently had a game hang and then tried recovering and that still
gave a deadlock:

[127426.165215] INFO: task kworker/u256:0:77605 blocked for more than 120
seconds.
[127426.165221]       Tainted: G        W        4.16.0-rc7-gffd4abe7dbf9 #7
[127426.165222] &quot;echo 0 &gt; /proc/sys/kernel/hung_task_timeout_secs&q=
uot; disables
this message.
[127426.165224] kworker/u256:0  D    0 77605      2 0x80000000
[127426.165236] Workqueue: events_unbound commit_work [drm_kms_helper]
[127426.165239] Call Trace:
[127426.165248]  ? __schedule+0x23c/0x870
[127426.165251]  schedule+0x2f/0x90
[127426.165254]  schedule_timeout+0x1fc/0x460
[127426.165283]  ? dce120_timing_generator_get_crtc_position+0x5b/0x70 [amd=
gpu]
[127426.165308]  ? dce120_timing_generator_get_crtc_scanoutpos+0x70/0xb0
[amdgpu]
[127426.165312]  dma_fence_default_wait+0x1eb/0x280
[127426.165315]  ? dma_fence_default_wait+0x280/0x280
[127426.165317]  dma_fence_wait_timeout+0x38/0x110
[127426.165320]  reservation_object_wait_timeout_rcu+0x187/0x360
[127426.165350]  amdgpu_dm_do_flip+0x109/0x350 [amdgpu]
[127426.165382]  amdgpu_dm_atomic_commit_tail+0xa7c/0xc70 [amdgpu]
[127426.165386]  ? wait_for_common+0x151/0x180
[127426.165390]  ? pick_next_task_fair+0x48c/0x5a0
[127426.165393]  ? __switch_to+0x199/0x460
[127426.165399]  commit_tail+0x3d/0x70 [drm_kms_helper]
[127426.165403]  process_one_work+0x1ce/0x3f0
[127426.165405]  worker_thread+0x2b/0x3d0
[127426.165408]  ? process_one_work+0x3f0/0x3f0
[127426.165410]  kthread+0x113/0x130
[127426.165413]  ? kthread_create_on_node+0x70/0x70
[127426.165416]  ret_from_fork+0x22/0x40

which seems to be here:

<a href=3D"https://cgit.freedesktop.org/~agd5f/linux/tree/drivers/gpu/drm/a=
md/display/amdgpu_dm/amdgpu_dm.c?h=3Damd-staging-drm-next&amp;id=3Dda603c1d=
0aac505485490f5e0ba495d4e292e7b9#n3973">https://cgit.freedesktop.org/~agd5f=
/linux/tree/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c?h=3Damd-stagi=
ng-drm-next&amp;id=3Dda603c1d0aac505485490f5e0ba495d4e292e7b9#n3973</a>

This is before the GPU reset itself happens, so either we use a BO somehow =
in
the disabled state or this is an earlier flip.

Anyway that wait is not going to finish due to a hung GPU.</pre>
        </div>
      </p>


      <hr>
      <span>You are receiving this mail because:</span>

      <ul>
          <li>You are the assignee for the bug.</li>
      </ul>
    </body>
</html>=

--15262115230.86A81fD5.24799--

--===============0257762917==
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: base64
Content-Disposition: inline

X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KZHJpLWRldmVs
IG1haWxpbmcgbGlzdApkcmktZGV2ZWxAbGlzdHMuZnJlZWRlc2t0b3Aub3JnCmh0dHBzOi8vbGlz
dHMuZnJlZWRlc2t0b3Aub3JnL21haWxtYW4vbGlzdGluZm8vZHJpLWRldmVsCg==

--===============0257762917==--