From: bugzilla-daemon@freedesktop.org
To: dri-devel@lists.freedesktop.org
Subject: [Bug 106500] GPU Recovery + DC deadlock
Date: Sun, 13 May 2018 11:38:43 +0000 [thread overview]
Message-ID: <bug-106500-502@http.bugs.freedesktop.org/> (raw)
[-- Attachment #1.1: Type: text/plain, Size: 6112 bytes --]
https://bugs.freedesktop.org/show_bug.cgi?id=106500
Bug ID: 106500
Summary: GPU Recovery + DC deadlock
Product: DRI
Version: unspecified
Hardware: Other
OS: All
Status: NEW
Severity: normal
Priority: medium
Component: DRM/AMDgpu
Assignee: dri-devel@lists.freedesktop.org
Reporter: bas@basnieuwenhuizen.nl
CC: andrey.grodzovsky@amd.com
If you try to reset a GPU using
cat /sys/kernel/debug/dri/2/amdgpu_gpu_recovery
while the GPU is hung the kernel deadlocks if the GPU is used for displaying
stuff.
I found two causes. If I hang the GPU with the libdrm tests I get a deadlock in
https://cgit.freedesktop.org/~agd5f/linux/tree/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c?h=amd-staging-drm-next&id=da603c1d0aac505485490f5e0ba495d4e292e7b9#n1876
Looks like we disable DC during the reset, but as part of the disabling we
change the clocks and for that we wait till the GPU is idle. It is of course
not going to be idle without intervention if hung.
Supporting trace:
[ 1842.823262] INFO: task cat:3635 blocked for more than 120 seconds.
[ 1842.823268] Tainted: G W 4.16.0-rc7-g36031c0dfb2d #6
[ 1842.823270] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
[ 1842.823271] cat D 0 3635 3630 0x00000000
[ 1842.823275] Call Trace:
[ 1842.823285] ? __schedule+0x23c/0x870
[ 1842.823288] schedule+0x2f/0x90
[ 1842.823291] schedule_timeout+0x1fc/0x460
[ 1842.823296] ? __alloc_pages_nodemask+0x10f/0xfd0
[ 1842.823300] dma_fence_default_wait+0x1eb/0x280
[ 1842.823303] ? dma_fence_default_wait+0x280/0x280
[ 1842.823306] dma_fence_wait_timeout+0x38/0x110
[ 1842.823331] amdgpu_fence_wait_empty+0x98/0xd0 [amdgpu]
[ 1842.823356] ? dc_remove_plane_from_context+0x202/0x240 [amdgpu]
[ 1842.823378] amdgpu_pm_compute_clocks.part.8+0x70/0x590 [amdgpu]
[ 1842.823409] dm_pp_apply_display_requirements+0x159/0x160 [amdgpu]
[ 1842.823433] pplib_apply_display_requirements+0x197/0x1c0 [amdgpu]
[ 1842.823457] dc_commit_state+0x23b/0x560 [amdgpu]
[ 1842.823481] ? dce112_validate_bandwidth+0x1bd/0x230 [amdgpu]
[ 1842.823506] ? dce112_validate_bandwidth+0x1c9/0x230 [amdgpu]
[ 1842.823535] amdgpu_dm_atomic_commit_tail+0x27a/0xc70 [amdgpu]
[ 1842.823540] ? __wake_up_common_lock+0x89/0xc0
[ 1842.823542] ? wait_for_common+0x151/0x180
[ 1842.823545] ? wait_for_common+0x151/0x180
[ 1842.823551] commit_tail+0x3d/0x70 [drm_kms_helper]
[ 1842.823557] drm_atomic_helper_commit+0xfc/0x110 [drm_kms_helper]
[ 1842.823562] drm_atomic_helper_disable_all+0x158/0x1b0 [drm_kms_helper]
[ 1842.823567] drm_atomic_helper_suspend+0xd6/0x130 [drm_kms_helper]
[ 1842.823587] amdgpu_device_gpu_recover+0x60f/0x8b0 [amdgpu]
[ 1842.823591] ? __kmalloc_node+0x204/0x2b0
[ 1842.823611] amdgpu_debugfs_gpu_recover+0x30/0x40 [amdgpu]
[ 1842.823615] seq_read+0xee/0x480
[ 1842.823619] full_proxy_read+0x53/0x80
[ 1842.823624] __vfs_read+0x36/0x150
[ 1842.823627] vfs_read+0x91/0x130
[ 1842.823630] SyS_read+0x52/0xc0
[ 1842.823634] do_syscall_64+0x67/0x120
[ 1842.823637] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 1842.823640] RIP: 0033:0x7f987c073701
[ 1842.823642] RSP: 002b:00007ffc0deeac08 EFLAGS: 00000246 ORIG_RAX:
0000000000000000
[ 1842.823644] RAX: ffffffffffffffda RBX: 0000000000020000 RCX:
00007f987c073701
[ 1842.823646] RDX: 0000000000020000 RSI: 00007f987c549000 RDI:
0000000000000003
[ 1842.823647] RBP: 0000000000020000 R08: 00000000ffffffff R09:
0000000000000000
[ 1842.823649] R10: 0000000000000022 R11: 0000000000000246 R12:
00007f987c549000
[ 1842.823650] R13: 0000000000000003 R14: 00007f987c54900f R15:
0000000000020000
I managed to "fix" this by commenting out that code. Now a libdrm test caused
hang recovers successfully though the display (even for the non-X terminals) is
garbled.
However, I recently had a game hang and then tried recovering and that still
gave a deadlock:
[127426.165215] INFO: task kworker/u256:0:77605 blocked for more than 120
seconds.
[127426.165221] Tainted: G W 4.16.0-rc7-gffd4abe7dbf9 #7
[127426.165222] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[127426.165224] kworker/u256:0 D 0 77605 2 0x80000000
[127426.165236] Workqueue: events_unbound commit_work [drm_kms_helper]
[127426.165239] Call Trace:
[127426.165248] ? __schedule+0x23c/0x870
[127426.165251] schedule+0x2f/0x90
[127426.165254] schedule_timeout+0x1fc/0x460
[127426.165283] ? dce120_timing_generator_get_crtc_position+0x5b/0x70 [amdgpu]
[127426.165308] ? dce120_timing_generator_get_crtc_scanoutpos+0x70/0xb0
[amdgpu]
[127426.165312] dma_fence_default_wait+0x1eb/0x280
[127426.165315] ? dma_fence_default_wait+0x280/0x280
[127426.165317] dma_fence_wait_timeout+0x38/0x110
[127426.165320] reservation_object_wait_timeout_rcu+0x187/0x360
[127426.165350] amdgpu_dm_do_flip+0x109/0x350 [amdgpu]
[127426.165382] amdgpu_dm_atomic_commit_tail+0xa7c/0xc70 [amdgpu]
[127426.165386] ? wait_for_common+0x151/0x180
[127426.165390] ? pick_next_task_fair+0x48c/0x5a0
[127426.165393] ? __switch_to+0x199/0x460
[127426.165399] commit_tail+0x3d/0x70 [drm_kms_helper]
[127426.165403] process_one_work+0x1ce/0x3f0
[127426.165405] worker_thread+0x2b/0x3d0
[127426.165408] ? process_one_work+0x3f0/0x3f0
[127426.165410] kthread+0x113/0x130
[127426.165413] ? kthread_create_on_node+0x70/0x70
[127426.165416] ret_from_fork+0x22/0x40
which seems to be here:
https://cgit.freedesktop.org/~agd5f/linux/tree/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c?h=amd-staging-drm-next&id=da603c1d0aac505485490f5e0ba495d4e292e7b9#n3973
This is before the GPU reset itself happens, so either we use a BO somehow in
the disabled state or this is an earlier flip.
Anyway that wait is not going to finish due to a hung GPU.
--
You are receiving this mail because:
You are the assignee for the bug.
[-- Attachment #1.2: Type: text/html, Size: 7864 bytes --]
[-- Attachment #2: Type: text/plain, Size: 160 bytes --]
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
next reply other threads:[~2018-05-13 11:38 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-05-13 11:38 bugzilla-daemon [this message]
2018-05-14 18:31 ` [Bug 106500] GPU Recovery + DC deadlock bugzilla-daemon
2018-05-14 21:59 ` bugzilla-daemon
2018-05-14 22:30 ` bugzilla-daemon
2019-11-19 8:38 ` bugzilla-daemon
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=bug-106500-502@http.bugs.freedesktop.org/ \
--to=bugzilla-daemon@freedesktop.org \
--cc=dri-devel@lists.freedesktop.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.