From mboxrd@z Thu Jan 1 00:00:00 1970 From: bugzilla-daemon@freedesktop.org Subject: [Bug 102322] System crashes after "[drm] IP block:gmc_v8_0 is hung!" / [drm] IP block:sdma_v3_0 is hung! Date: Wed, 22 Aug 2018 00:24:35 +0000 Message-ID: References: Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============2123262021==" Return-path: Received: from culpepper.freedesktop.org (culpepper.freedesktop.org [131.252.210.165]) by gabe.freedesktop.org (Postfix) with ESMTP id 71C8C89B55 for ; Wed, 22 Aug 2018 00:24:35 +0000 (UTC) In-Reply-To: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" To: dri-devel@lists.freedesktop.org List-Id: dri-devel@lists.freedesktop.org --===============2123262021== Content-Type: multipart/alternative; boundary="15348974754.2Bb10afD.2211" Content-Transfer-Encoding: 7bit --15348974754.2Bb10afD.2211 Date: Wed, 22 Aug 2018 00:24:35 +0000 MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://bugs.freedesktop.org/ Auto-Submitted: auto-generated https://bugs.freedesktop.org/show_bug.cgi?id=3D102322 --- Comment #58 from dwagner --- Here comes another trace log, with your info2.patch applied. Something must have changed since the last test, as it took pretty long this time to reproduce the crash. Could that have been caused by https://cgit.freedesktop.org/~agd5f/linux/commit/drivers/gpu/drm/amd/amdgpu= /nbio_v7_4.c?h=3Damd-staging-drm-next&id=3Db385925f3922faca7435e50e31380bb2= 602fd6b8 now being part of the kernel? However, the latest trace you find attached below is not much different to = the last one, xzcat /tmp/gpu_debug5.txt.xz | grep '^\[' will tell you: [ 1510.023112] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeou= t, signaled seq=3D475104, emitted seq=3D475106 [ 1510.023117] [drm] GPU recovery disabled. amdgpu_cs:0-806 [012] .... 1787.493126: amdgpu_vm_bo_cs: soffs=3D00001001a0, eoffs=3D00001001b9, flags=3D70 amdgpu_cs:0-806 [012] .... 1787.493127: amdgpu_vm_bo_cs: soffs=3D0000100200, eoffs=3D00001021e0, flags=3D70 amdgpu_cs:0-806 [012] .... 1787.493127: amdgpu_vm_bo_cs: soffs=3D0000102200, eoffs=3D00001041e0, flags=3D70 amdgpu_cs:0-806 [012] .... 1787.493129: amdgpu_vm_bo_cs: soffs=3D000010c1e0, eoffs=3D000010c2e1, flags=3D70 amdgpu_cs:0-806 [012] .... 1787.493131: drm_sched_job: entity=3D00000000406345a7, id=3D10239, fence=3D000000007a120377, ring=3Dgfx= , job count:8, hw job count:0 And later in the file you can find: ------------------------------------------------------ crash detected! executing umr -O halt_waves -wa No active waves! executing umr -O verbose -R gfx[.] polaris11.gfx.rptr =3D=3D 512 polaris11.gfx.wptr =3D=3D 512 polaris11.gfx.drv_wptr =3D=3D 512 polaris11.gfx.ring[ 481] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[ 482] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[ 483] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[ 484] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[ 485] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[ 486] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[ 487] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[ 488] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[ 489] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[ 490] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[ 491] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[ 492] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[ 493] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[ 494] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[ 495] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[ 496] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[ 497] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[ 498] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[ 499] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[ 500] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[ 501] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[ 502] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[ 503] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[ 504] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[ 505] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[ 506] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[ 507] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[ 508] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[ 509] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[ 510] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[ 511] =3D=3D 0xffff1000 ...=20 polaris11.gfx.ring[ 512] =3D=3D 0xc0032200 rwD=20 trying to get ADR from dmesg output for 'umr -O verbose -vm ...' trying to get VMID from dmesg output for 'umr -O verbose -vm ...' done after crash. ------------------------------------------- So even without GPU reset, still no "waves". And the error message also does not state any VM fault address. --=20 You are receiving this mail because: You are the assignee for the bug.= --15348974754.2Bb10afD.2211 Date: Wed, 22 Aug 2018 00:24:35 +0000 MIME-Version: 1.0 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://bugs.freedesktop.org/ Auto-Submitted: auto-generated

Comme= nt # 58 on bug 10232= 2 from dwagner
Here comes another trace log, with your info2.patch applied.

Something must have changed since the last test, as it took pretty long this
time to reproduce the crash. Could that have been caused by
https://cgit.freedesktop.org/~agd5f/linux/commit/d=
rivers/gpu/drm/amd/amdgpu/nbio_v7_4.c?h=3Damd-staging-drm-next&id=3Db38=
5925f3922faca7435e50e31380bb2602fd6b8
now being part of the kernel?

However, the latest trace you find attached below is not much different to =
the
last one, xzcat /tmp/gpu_debug5.txt.xz  | grep '^\[' will tell you:

[ 1510.023112] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeou=
t,
signaled seq=3D475104, emitted seq=3D475106
[ 1510.023117] [drm] GPU recovery disabled.

     amdgpu_cs:0-806   [012] ....  1787.493126: amdgpu_vm_bo_cs:
soffs=3D00001001a0, eoffs=3D00001001b9, flags=3D70
     amdgpu_cs:0-806   [012] ....  1787.493127: amdgpu_vm_bo_cs:
soffs=3D0000100200, eoffs=3D00001021e0, flags=3D70
     amdgpu_cs:0-806   [012] ....  1787.493127: amdgpu_vm_bo_cs:
soffs=3D0000102200, eoffs=3D00001041e0, flags=3D70
     amdgpu_cs:0-806   [012] ....  1787.493129: amdgpu_vm_bo_cs:
soffs=3D000010c1e0, eoffs=3D000010c2e1, flags=3D70
     amdgpu_cs:0-806   [012] ....  1787.493131: drm_sched_job:
entity=3D00000000406345a7, id=3D10239, fence=3D000000007a120377, ring=3Dgfx=
, job
count:8, hw job count:0

And later in the file you can find:
------------------------------------------------------
crash detected!

executing umr -O halt_waves -wa
No active waves!

executing umr -O verbose -R gfx[.]

polaris11.gfx.rptr =3D=3D 512
polaris11.gfx.wptr =3D=3D 512
polaris11.gfx.drv_wptr =3D=3D 512
polaris11.gfx.ring[ 481] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[ 482] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[ 483] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[ 484] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[ 485] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[ 486] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[ 487] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[ 488] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[ 489] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[ 490] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[ 491] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[ 492] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[ 493] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[ 494] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[ 495] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[ 496] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[ 497] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[ 498] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[ 499] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[ 500] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[ 501] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[ 502] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[ 503] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[ 504] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[ 505] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[ 506] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[ 507] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[ 508] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[ 509] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[ 510] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[ 511] =3D=3D 0xffff1000    ...=20
polaris11.gfx.ring[ 512] =3D=3D 0xc0032200    rwD=20


trying to get ADR from dmesg output for 'umr -O verbose -vm ...'
trying to get VMID from dmesg output for 'umr -O verbose -vm ...'

done after crash.
-------------------------------------------

So even without GPU reset, still no "waves". And the error messag=
e also does
not state any VM fault address.


You are receiving this mail because:
  • You are the assignee for the bug.
= --15348974754.2Bb10afD.2211-- --===============2123262021== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KZHJpLWRldmVs IG1haWxpbmcgbGlzdApkcmktZGV2ZWxAbGlzdHMuZnJlZWRlc2t0b3Aub3JnCmh0dHBzOi8vbGlz dHMuZnJlZWRlc2t0b3Aub3JnL21haWxtYW4vbGlzdGluZm8vZHJpLWRldmVsCg== --===============2123262021==--