From mboxrd@z Thu Jan 1 00:00:00 1970
From: bugzilla-daemon@freedesktop.org
Subject: [Bug 102322] System crashes after "[drm] IP block:gmc_v8_0 is hung!"
/ [drm] IP block:sdma_v3_0 is hung!
Date: Tue, 21 Aug 2018 14:43:24 +0000
Message-ID:
References:
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="===============1360968901=="
Return-path:
Received: from culpepper.freedesktop.org (culpepper.freedesktop.org
[131.252.210.165])
by gabe.freedesktop.org (Postfix) with ESMTP id A58776E1E5
for ; Tue, 21 Aug 2018 14:43:24 +0000 (UTC)
In-Reply-To:
List-Unsubscribe: ,
List-Archive:
List-Post:
List-Help:
List-Subscribe: ,
Errors-To: dri-devel-bounces@lists.freedesktop.org
Sender: "dri-devel"
To: dri-devel@lists.freedesktop.org
List-Id: dri-devel@lists.freedesktop.org
--===============1360968901==
Content-Type: multipart/alternative; boundary="15348626044.DA8A.2275"
Content-Transfer-Encoding: 7bit
--15348626044.DA8A.2275
Date: Tue, 21 Aug 2018 14:43:24 +0000
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://bugs.freedesktop.org/
Auto-Submitted: auto-generated
https://bugs.freedesktop.org/show_bug.cgi?id=3D102322
--- Comment #55 from Andrey Grodzovsky ---
(In reply to dwagner from comment #54)
> (In reply to Andrey Grodzovsky from comment #53)
> > Created attachment 141198 [details] [review] [review]
> > add_debug_info2.patch
> >=20
> > Try this patch instead, i might be missing some prints in the first one.
>=20
> Can try that this evening.
>=20
> > In the last log you attached I haven't seen any UMR dumps or GPU fault
> > prints in dmesg. THe GPU fault has to be in the log to compare the faul=
ty
> > address against the debug prints in the patch.
>=20
> In above attached file "xz-compressed output of gpu_debug3.sh" there is u=
mr
> output at the time of the crash (238 seconds after the reboot):
>=20
> ----------------------------------------------
> ...
> mpv/vo-897 [005] .... 235.191542: dma_fence_wait_start:
> driver=3Ddrm_sched timeline=3Dgfx context=3D162 seqno=3D87
> mpv/vo-897 [005] d... 235.191548: dma_fence_enable_signal:
> driver=3Ddrm_sched timeline=3Dgfx context=3D162 seqno=3D87
> kworker/0:2-92 [000] .... 238.275988: dma_fence_signaled:
> driver=3Damdgpu timeline=3Dsdma1 context=3D11 seqno=3D210
> kworker/0:2-92 [000] .... 238.276004: dma_fence_signaled:
> driver=3Damdgpu timeline=3Dsdma1 context=3D11 seqno=3D211
> [ 238.180634] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0
> timeout, signaled seq=3D32624, emitted seq=3D32626
> [ 238.180641] amdgpu 0000:0a:00.0: GPU reset begin!
> [ 238.180641] amdgpu 0000:0a:00.0: GPU reset begin!
>=20
> crash detected!
>=20
> executing umr -O halt_waves -wa
> No active waves!
Did you use amdgpu.vm_fault_stop=3D2 parameter ? In case a fault happened t=
hat
should have froze GPUs compute units and hence the above command would prod=
uce
a lot of wave info.
>=20
>=20
> executing umr -O verbose -R gfx[.]
>=20
> polaris11.gfx.rptr =3D=3D 1792
> polaris11.gfx.wptr =3D=3D 1792
> polaris11.gfx.drv_wptr =3D=3D 1792
> polaris11.gfx.ring[1761] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1762] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1763] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1764] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1765] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1766] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1767] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1768] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1769] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1770] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1771] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1772] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1773] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1774] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1775] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1776] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1777] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1778] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1779] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1780] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1781] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1782] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1783] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1784] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1785] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1786] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1787] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1788] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1789] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1790] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1791] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1792] =3D=3D 0xc0032200 rwD=20
>=20
> trying to get ADR from dmesg output for 'umr -O verbose -vm ...'
> trying to get VMID from dmesg output for 'umr -O verbose -vm ...'
>=20
> done after crash, flashing NUMLOCK LED.
> amdgpu_cs:0-799 [001] .... 286.852838: amdgpu_bo_list_set:
> list=3D0000000099c16b5c, bo=3D000000001771c26f, bo_size=3D131072
> amdgpu_cs:0-799 [001] .... 286.852846: amdgpu_bo_list_set:
> list=3D0000000099c16b5c, bo=3D0000000046bfd439, bo_size=3D131072
> ...
> ----------------------------------------------
>=20
> But sure, there were no "VM_CONTEXT1_PROTECTION_FAULT_ADDR" error messages
> this time. Sometimes such are emitted, sometimes not.
--=20
You are receiving this mail because:
You are the assignee for the bug.=
--15348626044.DA8A.2275
Date: Tue, 21 Aug 2018 14:43:24 +0000
MIME-Version: 1.0
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://bugs.freedesktop.org/
Auto-Submitted: auto-generated
Comme=
nt # 55
on bug 10232=
2
from Andrey Grodzovsky
(In reply to dwagner from comment #54)
> (In reply to Andrey Grodzovsky from comment #53)
> > Created attachment 141198=
[details] [review] [review] [review]
> > add_debug_info2.patch
> >=20
> > Try this patch instead, i might be missing some prints in the fir=
st one.
>=20
> Can try that this evening.
>=20
> > In the last log you attached I haven't seen any UMR dumps or GPU =
fault
> > prints in dmesg. THe GPU fault has to be in the log to compare th=
e faulty
> > address against the debug prints in the patch.
>=20
> In above attached file "xz-compressed output of gpu_debug3.sh&quo=
t; there is umr
> output at the time of the crash (238 seconds after the reboot):
>=20
> ----------------------------------------------
> ...
> mpv/vo-897 [005] .... 235.191542: dma_fence_wait_start:
> driver=3Ddrm_sched timeline=3Dgfx context=3D162 seqno=3D87
> mpv/vo-897 [005] d... 235.191548: dma_fence_enable_signa=
l:
> driver=3Ddrm_sched timeline=3Dgfx context=3D162 seqno=3D87
> kworker/0:2-92 [000] .... 238.275988: dma_fence_signaled:
> driver=3Damdgpu timeline=3Dsdma1 context=3D11 seqno=3D210
> kworker/0:2-92 [000] .... 238.276004: dma_fence_signaled:
> driver=3Damdgpu timeline=3Dsdma1 context=3D11 seqno=3D211
> [ 238.180634] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0
> timeout, signaled seq=3D32624, emitted seq=3D32626
> [ 238.180641] amdgpu 0000:0a:00.0: GPU reset begin!
> [ 238.180641] amdgpu 0000:0a:00.0: GPU reset begin!
>=20
> crash detected!
>=20
> executing umr -O halt_waves -wa
> No active waves!
Did you use amdgpu.vm_fault_stop=3D2 parameter ? In case a fault happened t=
hat
should have froze GPUs compute units and hence the above command would prod=
uce
a lot of wave info.
>=20
>=20
> executing umr -O verbose -R gfx[.]
>=20
> polaris11.gfx.rptr =3D=3D 1792
> polaris11.gfx.wptr =3D=3D 1792
> polaris11.gfx.drv_wptr =3D=3D 1792
> polaris11.gfx.ring[1761] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1762] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1763] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1764] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1765] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1766] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1767] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1768] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1769] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1770] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1771] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1772] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1773] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1774] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1775] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1776] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1777] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1778] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1779] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1780] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1781] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1782] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1783] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1784] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1785] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1786] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1787] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1788] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1789] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1790] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1791] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[1792] =3D=3D 0xc0032200 rwD=20
>=20
> trying to get ADR from dmesg output for 'umr -O verbose -vm ...'
> trying to get VMID from dmesg output for 'umr -O verbose -vm ...'
>=20
> done after crash, flashing NUMLOCK LED.
> amdgpu_cs:0-799 [001] .... 286.852838: amdgpu_bo_list_set:
> list=3D0000000099c16b5c, bo=3D000000001771c26f, bo_size=3D131072
> amdgpu_cs:0-799 [001] .... 286.852846: amdgpu_bo_list_set:
> list=3D0000000099c16b5c, bo=3D0000000046bfd439, bo_size=3D131072
> ...
> ----------------------------------------------
>=20
> But sure, there were no "VM_CONTEXT1_PROTECTION_FAULT_ADDR" =
error messages
> this time. Sometimes such are emitted, sometimes not.
You are receiving this mail because:
- You are the assignee for the bug.
=
--15348626044.DA8A.2275--
--===============1360968901==
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: base64
Content-Disposition: inline
X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KZHJpLWRldmVs
IG1haWxpbmcgbGlzdApkcmktZGV2ZWxAbGlzdHMuZnJlZWRlc2t0b3Aub3JnCmh0dHBzOi8vbGlz
dHMuZnJlZWRlc2t0b3Aub3JnL21haWxtYW4vbGlzdGluZm8vZHJpLWRldmVsCg==
--===============1360968901==--