From mboxrd@z Thu Jan 1 00:00:00 1970 From: bugzilla-daemon@freedesktop.org Subject: [Bug 102322] System crashes after "[drm] IP block:gmc_v8_0 is hung!" / [drm] IP block:sdma_v3_0 is hung! Date: Tue, 21 Aug 2018 14:43:24 +0000 Message-ID: References: Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============1360968901==" Return-path: Received: from culpepper.freedesktop.org (culpepper.freedesktop.org [131.252.210.165]) by gabe.freedesktop.org (Postfix) with ESMTP id A58776E1E5 for ; Tue, 21 Aug 2018 14:43:24 +0000 (UTC) In-Reply-To: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" To: dri-devel@lists.freedesktop.org List-Id: dri-devel@lists.freedesktop.org --===============1360968901== Content-Type: multipart/alternative; boundary="15348626044.DA8A.2275" Content-Transfer-Encoding: 7bit --15348626044.DA8A.2275 Date: Tue, 21 Aug 2018 14:43:24 +0000 MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://bugs.freedesktop.org/ Auto-Submitted: auto-generated https://bugs.freedesktop.org/show_bug.cgi?id=3D102322 --- Comment #55 from Andrey Grodzovsky --- (In reply to dwagner from comment #54) > (In reply to Andrey Grodzovsky from comment #53) > > Created attachment 141198 [details] [review] [review] > > add_debug_info2.patch > >=20 > > Try this patch instead, i might be missing some prints in the first one. >=20 > Can try that this evening. >=20 > > In the last log you attached I haven't seen any UMR dumps or GPU fault > > prints in dmesg. THe GPU fault has to be in the log to compare the faul= ty > > address against the debug prints in the patch. >=20 > In above attached file "xz-compressed output of gpu_debug3.sh" there is u= mr > output at the time of the crash (238 seconds after the reboot): >=20 > ---------------------------------------------- > ... > mpv/vo-897 [005] .... 235.191542: dma_fence_wait_start: > driver=3Ddrm_sched timeline=3Dgfx context=3D162 seqno=3D87 > mpv/vo-897 [005] d... 235.191548: dma_fence_enable_signal: > driver=3Ddrm_sched timeline=3Dgfx context=3D162 seqno=3D87 > kworker/0:2-92 [000] .... 238.275988: dma_fence_signaled: > driver=3Damdgpu timeline=3Dsdma1 context=3D11 seqno=3D210 > kworker/0:2-92 [000] .... 238.276004: dma_fence_signaled: > driver=3Damdgpu timeline=3Dsdma1 context=3D11 seqno=3D211 > [ 238.180634] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 > timeout, signaled seq=3D32624, emitted seq=3D32626 > [ 238.180641] amdgpu 0000:0a:00.0: GPU reset begin! > [ 238.180641] amdgpu 0000:0a:00.0: GPU reset begin! >=20 > crash detected! >=20 > executing umr -O halt_waves -wa > No active waves! Did you use amdgpu.vm_fault_stop=3D2 parameter ? In case a fault happened t= hat should have froze GPUs compute units and hence the above command would prod= uce a lot of wave info. >=20 >=20 > executing umr -O verbose -R gfx[.] >=20 > polaris11.gfx.rptr =3D=3D 1792 > polaris11.gfx.wptr =3D=3D 1792 > polaris11.gfx.drv_wptr =3D=3D 1792 > polaris11.gfx.ring[1761] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[1762] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[1763] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[1764] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[1765] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[1766] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[1767] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[1768] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[1769] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[1770] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[1771] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[1772] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[1773] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[1774] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[1775] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[1776] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[1777] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[1778] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[1779] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[1780] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[1781] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[1782] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[1783] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[1784] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[1785] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[1786] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[1787] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[1788] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[1789] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[1790] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[1791] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[1792] =3D=3D 0xc0032200 rwD=20 >=20 > trying to get ADR from dmesg output for 'umr -O verbose -vm ...' > trying to get VMID from dmesg output for 'umr -O verbose -vm ...' >=20 > done after crash, flashing NUMLOCK LED. > amdgpu_cs:0-799 [001] .... 286.852838: amdgpu_bo_list_set: > list=3D0000000099c16b5c, bo=3D000000001771c26f, bo_size=3D131072 > amdgpu_cs:0-799 [001] .... 286.852846: amdgpu_bo_list_set: > list=3D0000000099c16b5c, bo=3D0000000046bfd439, bo_size=3D131072 > ... > ---------------------------------------------- >=20 > But sure, there were no "VM_CONTEXT1_PROTECTION_FAULT_ADDR" error messages > this time. Sometimes such are emitted, sometimes not. --=20 You are receiving this mail because: You are the assignee for the bug.= --15348626044.DA8A.2275 Date: Tue, 21 Aug 2018 14:43:24 +0000 MIME-Version: 1.0 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://bugs.freedesktop.org/ Auto-Submitted: auto-generated

Comme= nt # 55 on bug 10232= 2 from Andrey Grodzovsky
(In reply to dwagner from comment #54)
> (In reply to Andrey Grodzovsky from comment #53)
> > Created attachment 141198=
 [details] [review] [review] [review]
> > add_debug_info2.patch
> >=20
> > Try this patch instead, i might be missing some prints in the fir=
st one.
>=20
> Can try that this evening.
>=20
> > In the last log you attached I haven't seen any UMR dumps or GPU =
fault
> > prints in dmesg. THe GPU fault has to be in the log to compare th=
e faulty
> > address against the debug prints in the patch.
>=20
> In above attached file "xz-compressed output of gpu_debug3.sh&quo=
t; there is umr
> output at the time of the crash (238 seconds after the reboot):
>=20
> ----------------------------------------------
> ...
>           mpv/vo-897   [005] ....   235.191542: dma_fence_wait_start:
> driver=3Ddrm_sched timeline=3Dgfx context=3D162 seqno=3D87
>           mpv/vo-897   [005] d...   235.191548: dma_fence_enable_signa=
l:
> driver=3Ddrm_sched timeline=3Dgfx context=3D162 seqno=3D87
>      kworker/0:2-92    [000] ....   238.275988: dma_fence_signaled:
> driver=3Damdgpu timeline=3Dsdma1 context=3D11 seqno=3D210
>      kworker/0:2-92    [000] ....   238.276004: dma_fence_signaled:
> driver=3Damdgpu timeline=3Dsdma1 context=3D11 seqno=3D211
> [  238.180634] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0
> timeout, signaled seq=3D32624, emitted seq=3D32626
> [  238.180641] amdgpu 0000:0a:00.0: GPU reset begin!
> [  238.180641] amdgpu 0000:0a:00.0: GPU reset begin!
>=20
> crash detected!
>=20
> executing umr -O halt_waves -wa
> No active waves!

Did you use amdgpu.vm_fault_stop=3D2 parameter ? In case a fault happened t=
hat
should have froze GPUs compute units and hence the above command would prod=
uce
a lot of wave info.

>=20
>=20
> executing umr -O verbose -R gfx[.]
>=20
> polaris11.gfx.rptr =3D=3D 1792
> polaris11.gfx.wptr =3D=3D 1792
> polaris11.gfx.drv_wptr =3D=3D 1792
> polaris11.gfx.ring[1761] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[1762] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[1763] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[1764] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[1765] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[1766] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[1767] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[1768] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[1769] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[1770] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[1771] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[1772] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[1773] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[1774] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[1775] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[1776] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[1777] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[1778] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[1779] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[1780] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[1781] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[1782] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[1783] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[1784] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[1785] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[1786] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[1787] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[1788] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[1789] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[1790] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[1791] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[1792] =3D=3D 0xc0032200    rwD=20
>=20
> trying to get ADR from dmesg output for 'umr -O verbose -vm ...'
> trying to get VMID from dmesg output for 'umr -O verbose -vm ...'
>=20
> done after crash, flashing NUMLOCK LED.
>      amdgpu_cs:0-799   [001] ....   286.852838: amdgpu_bo_list_set:
> list=3D0000000099c16b5c, bo=3D000000001771c26f, bo_size=3D131072
>      amdgpu_cs:0-799   [001] ....   286.852846: amdgpu_bo_list_set:
> list=3D0000000099c16b5c, bo=3D0000000046bfd439, bo_size=3D131072
> ...
> ----------------------------------------------
>=20
> But sure, there were no "VM_CONTEXT1_PROTECTION_FAULT_ADDR" =
error messages
> this time. Sometimes such are emitted, sometimes not.


You are receiving this mail because:
  • You are the assignee for the bug.
= --15348626044.DA8A.2275-- --===============1360968901== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KZHJpLWRldmVs IG1haWxpbmcgbGlzdApkcmktZGV2ZWxAbGlzdHMuZnJlZWRlc2t0b3Aub3JnCmh0dHBzOi8vbGlz dHMuZnJlZWRlc2t0b3Aub3JnL21haWxtYW4vbGlzdGluZm8vZHJpLWRldmVsCg== --===============1360968901==--