From mboxrd@z Thu Jan 1 00:00:00 1970 From: bugzilla-daemon@freedesktop.org Subject: [Bug 102322] System crashes after "[drm] IP block:gmc_v8_0 is hung!" / [drm] IP block:sdma_v3_0 is hung! Date: Tue, 21 Aug 2018 08:41:52 +0000 Message-ID: References: Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============0062121752==" Return-path: Received: from culpepper.freedesktop.org (culpepper.freedesktop.org [IPv6:2610:10:20:722:a800:ff:fe98:4b55]) by gabe.freedesktop.org (Postfix) with ESMTP id 3FD556E293 for ; Tue, 21 Aug 2018 08:41:53 +0000 (UTC) In-Reply-To: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" To: dri-devel@lists.freedesktop.org List-Id: dri-devel@lists.freedesktop.org --===============0062121752== Content-Type: multipart/alternative; boundary="15348409131.6Ebc.15406" Content-Transfer-Encoding: 7bit --15348409131.6Ebc.15406 Date: Tue, 21 Aug 2018 08:41:53 +0000 MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://bugs.freedesktop.org/ Auto-Submitted: auto-generated https://bugs.freedesktop.org/show_bug.cgi?id=3D102322 --- Comment #54 from dwagner --- (In reply to Andrey Grodzovsky from comment #53) > Created attachment 141198 [details] [review] > add_debug_info2.patch >=20 > Try this patch instead, i might be missing some prints in the first one. Can try that this evening. > In the last log you attached I haven't seen any UMR dumps or GPU fault > prints in dmesg. THe GPU fault has to be in the log to compare the faulty > address against the debug prints in the patch. In above attached file "xz-compressed output of gpu_debug3.sh" there is umr output at the time of the crash (238 seconds after the reboot): ---------------------------------------------- ... mpv/vo-897 [005] .... 235.191542: dma_fence_wait_start: driver=3Ddrm_sched timeline=3Dgfx context=3D162 seqno=3D87 mpv/vo-897 [005] d... 235.191548: dma_fence_enable_signal: driver=3Ddrm_sched timeline=3Dgfx context=3D162 seqno=3D87 kworker/0:2-92 [000] .... 238.275988: dma_fence_signaled: driver=3Damdgpu timeline=3Dsdma1 context=3D11 seqno=3D210 kworker/0:2-92 [000] .... 238.276004: dma_fence_signaled: driver=3Damdgpu timeline=3Dsdma1 context=3D11 seqno=3D211 [ 238.180634] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeou= t, signaled seq=3D32624, emitted seq=3D32626 [ 238.180641] amdgpu 0000:0a:00.0: GPU reset begin! [ 238.180641] amdgpu 0000:0a:00.0: GPU reset begin! crash detected! executing umr -O halt_waves -wa No active waves! executing umr -O verbose -R gfx[.] polaris11.gfx.rptr =3D=3D 1792 polaris11.gfx.wptr =3D=3D 1792 polaris11.gfx.drv_wptr =3D=3D 1792 polaris11.gfx.ring[1761] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[1762] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[1763] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[1764] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[1765] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[1766] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[1767] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[1768] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[1769] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[1770] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[1771] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[1772] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[1773] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[1774] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[1775] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[1776] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[1777] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[1778] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[1779] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[1780] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[1781] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[1782] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[1783] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[1784] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[1785] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[1786] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[1787] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[1788] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[1789] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[1790] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[1791] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[1792] =3D=3D 0xc0032200 rwD=20 trying to get ADR from dmesg output for 'umr -O verbose -vm ...' trying to get VMID from dmesg output for 'umr -O verbose -vm ...' done after crash, flashing NUMLOCK LED. amdgpu_cs:0-799 [001] .... 286.852838: amdgpu_bo_list_set: list=3D0000000099c16b5c, bo=3D000000001771c26f, bo_size=3D131072 amdgpu_cs:0-799 [001] .... 286.852846: amdgpu_bo_list_set: list=3D0000000099c16b5c, bo=3D0000000046bfd439, bo_size=3D131072 ... ---------------------------------------------- But sure, there were no "VM_CONTEXT1_PROTECTION_FAULT_ADDR" error messages = this time. Sometimes such are emitted, sometimes not. --=20 You are receiving this mail because: You are the assignee for the bug.= --15348409131.6Ebc.15406 Date: Tue, 21 Aug 2018 08:41:53 +0000 MIME-Version: 1.0 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://bugs.freedesktop.org/ Auto-Submitted: auto-generated

Comme= nt # 54 on bug 10232= 2 from dwagner
(In reply to Andrey Grodzovsky from comment #53)
> Created atta=
chment 141198 [details] [review] [re=
view]
> add_debug_info2.patch
>=20
> Try this patch instead, i might be missing some prints in the first on=
e.

Can try that this evening.

> In the last log you attached I haven't seen any =
UMR dumps or GPU fault
> prints in dmesg. THe GPU fault has to be in the log to compare the fau=
lty
> address against the debug prints in the patch.

In above attached file "xz-compressed output of gpu_debug3.sh" th=
ere is umr
output at the time of the crash (238 seconds after the reboot):

----------------------------------------------
...
          mpv/vo-897   [005] ....   235.191542: dma_fence_wait_start:
driver=3Ddrm_sched timeline=3Dgfx context=3D162 seqno=3D87
          mpv/vo-897   [005] d...   235.191548: dma_fence_enable_signal:
driver=3Ddrm_sched timeline=3Dgfx context=3D162 seqno=3D87
     kworker/0:2-92    [000] ....   238.275988: dma_fence_signaled:
driver=3Damdgpu timeline=3Dsdma1 context=3D11 seqno=3D210
     kworker/0:2-92    [000] ....   238.276004: dma_fence_signaled:
driver=3Damdgpu timeline=3Dsdma1 context=3D11 seqno=3D211
[  238.180634] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeou=
t,
signaled seq=3D32624, emitted seq=3D32626
[  238.180641] amdgpu 0000:0a:00.0: GPU reset begin!
[  238.180641] amdgpu 0000:0a:00.0: GPU reset begin!

crash detected!

executing umr -O halt_waves -wa
No active waves!


executing umr -O verbose -R gfx[.]

polaris11.gfx.rptr =3D=3D 1792
polaris11.gfx.wptr =3D=3D 1792
polaris11.gfx.drv_wptr =3D=3D 1792
polaris11.gfx.ring[1761] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[1762] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[1763] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[1764] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[1765] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[1766] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[1767] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[1768] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[1769] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[1770] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[1771] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[1772] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[1773] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[1774] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[1775] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[1776] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[1777] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[1778] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[1779] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[1780] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[1781] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[1782] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[1783] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[1784] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[1785] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[1786] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[1787] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[1788] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[1789] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[1790] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[1791] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[1792] =3D=3D 0xc0032200    rwD=20

trying to get ADR from dmesg output for 'umr -O verbose -vm ...'
trying to get VMID from dmesg output for 'umr -O verbose -vm ...'

done after crash, flashing NUMLOCK LED.
     amdgpu_cs:0-799   [001] ....   286.852838: amdgpu_bo_list_set:
list=3D0000000099c16b5c, bo=3D000000001771c26f, bo_size=3D131072
     amdgpu_cs:0-799   [001] ....   286.852846: amdgpu_bo_list_set:
list=3D0000000099c16b5c, bo=3D0000000046bfd439, bo_size=3D131072
...
----------------------------------------------

But sure, there were no "VM_CONTEXT1_PROTECTION_FAULT_ADDR" error=
 messages this
time. Sometimes such are emitted, sometimes not.


You are receiving this mail because:
  • You are the assignee for the bug.
= --15348409131.6Ebc.15406-- --===============0062121752== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KZHJpLWRldmVs IG1haWxpbmcgbGlzdApkcmktZGV2ZWxAbGlzdHMuZnJlZWRlc2t0b3Aub3JnCmh0dHBzOi8vbGlz dHMuZnJlZWRlc2t0b3Aub3JnL21haWxtYW4vbGlzdGluZm8vZHJpLWRldmVsCg== --===============0062121752==--