From mboxrd@z Thu Jan 1 00:00:00 1970 From: bugzilla-daemon@freedesktop.org Subject: [Bug 110674] Crashes / Resets From AMDGPU / Radeon VII Date: Sun, 11 Aug 2019 23:44:16 +0000 Message-ID: References: Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============2059111744==" Return-path: Received: from culpepper.freedesktop.org (culpepper.freedesktop.org [131.252.210.165]) by gabe.freedesktop.org (Postfix) with ESMTP id 0FBF189933 for ; Sun, 11 Aug 2019 23:44:16 +0000 (UTC) In-Reply-To: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" To: dri-devel@lists.freedesktop.org List-Id: dri-devel@lists.freedesktop.org --===============2059111744== Content-Type: multipart/alternative; boundary="15655670560.7CfE.19745" Content-Transfer-Encoding: 7bit --15655670560.7CfE.19745 Date: Sun, 11 Aug 2019 23:44:16 +0000 MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://bugs.freedesktop.org/ Auto-Submitted: auto-generated https://bugs.freedesktop.org/show_bug.cgi?id=3D110674 --- Comment #75 from ReddestDream --- >Here's some additional investigation. >[SetUclkToHightestDpmLevel] Set hard min uclk failed! Appears as one of th= e first errors in dmesg. This is from vega20_hwmgr.c:3354 and triggered by: I agree that [SetUclkToHightestDpmLevel] is probably the key to all this as= it always seems to be the first thing that fails after dysregulation occurs. T= he "Failed to send message 0x28, response 0x0" errors show that the driver is sending wrong or at least wrongly timed commands to the GPU that eventually cascade into complete failure. >Again, it didn't help. I will note that this code is identical in 5.0.13=20 I have also been unable to find changed code since 5.0 that could be direct= ly connected to display detect/init/enumeration issues on Radeon VII/Vega 20. = This is why I've come to suspect the error is triggered indirectly in a way that will probably not be obvious and by code that was likely flawed from the beginning of Radeon VII/Vega 20 support. This is also why I was hopeful that 5.3-rc2 would fix this issue since it h= as commits that do seem to affect display detection on AMD GPUs. Alas, it did = not. :( >If the GPU did not crash with dpm disabled as a whole, the proper way to proceed would be to start from there and step by step add dpm features and = see when it starts crashing. It's not a small task since dpm code paths may be scattered all over the code. Unfortunately, it does look like going through and slowing disabling featur= es and/or bisecting might be the only way to find how this issue got started. = At least if we could narrow it down, we might be in better shape. :/ I must admit I don't have much experience with graphics drivers and when I = tell other people about this issue, they immediately want to blame X or Mesa unt= il I explain that I can get these errors w/o starting any graphics at all. lol. In any case, I really appreciate your testing Tom B. And any advice you mig= ht have on debugging, Sylvain BERTRAND, is greatly appreciated. :) --=20 You are receiving this mail because: You are the assignee for the bug.= --15655670560.7CfE.19745 Date: Sun, 11 Aug 2019 23:44:16 +0000 MIME-Version: 1.0 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://bugs.freedesktop.org/ Auto-Submitted: auto-generated

Comme= nt # 75 on bug 11067= 4 from ReddestDream
>Here's some additional investigation=
.

>[SetUclkToHightestDpmLevel] Set hard min uclk fai=
led! Appears as one of the first errors in dmesg. This is from vega20_hwmgr=
.c:3354 and triggered by:

I agree that [SetUclkToHightestDpmLevel] is probably the key to all this as=
 it
always seems to be the first thing that fails after dysregulation occurs. T=
he
"Failed to send message 0x28, response 0x0" errors show that the =
driver is
sending wrong or at least wrongly timed commands to the GPU that eventually
cascade into complete failure.

>Again, it didn't help. I will note that this code=
 is identical in 5.0.13 

I have also been unable to find changed code since 5.0 that could be direct=
ly
connected to display detect/init/enumeration issues on Radeon VII/Vega 20. =
This
is why I've come to suspect the error is triggered indirectly in a way that
will probably not be obvious and by code that was likely flawed from the
beginning of Radeon VII/Vega 20 support.

This is also why I was hopeful that 5.3-rc2 would fix this issue since it h=
as
commits that do seem to affect display detection on AMD GPUs. Alas, it did =
not.
:(

>If the GPU did not crash with dpm disabled as a w=
hole, the proper way to
proceed would be to start from there and step by step add dpm features and =
see
when it starts crashing. It's not a small task since dpm code paths may be
scattered all over the code.

Unfortunately, it does look like going through and slowing disabling featur=
es
and/or bisecting might be the only way to find how this issue got started. =
At
least if we could narrow it down, we might be in better shape. :/

I must admit I don't have much experience with graphics drivers and when I =
tell
other people about this issue, they immediately want to blame X or Mesa unt=
il I
explain that I can get these errors w/o starting any graphics at all. lol.

In any case, I really appreciate your testing Tom B. And any advice you mig=
ht
have on debugging, Sylvain BERTRAND, is greatly appreciated. :)


You are receiving this mail because:
  • You are the assignee for the bug.
= --15655670560.7CfE.19745-- --===============2059111744== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KZHJpLWRldmVs IG1haWxpbmcgbGlzdApkcmktZGV2ZWxAbGlzdHMuZnJlZWRlc2t0b3Aub3JnCmh0dHBzOi8vbGlz dHMuZnJlZWRlc2t0b3Aub3JnL21haWxtYW4vbGlzdGluZm8vZHJpLWRldmVs --===============2059111744==--