From mboxrd@z Thu Jan 1 00:00:00 1970
From: bugzilla-daemon@freedesktop.org
Subject: [Bug 110674] Crashes / Resets From AMDGPU / Radeon VII
Date: Sun, 11 Aug 2019 23:44:16 +0000
Message-ID:
References:
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="===============2059111744=="
Return-path:
Received: from culpepper.freedesktop.org (culpepper.freedesktop.org
[131.252.210.165])
by gabe.freedesktop.org (Postfix) with ESMTP id 0FBF189933
for ; Sun, 11 Aug 2019 23:44:16 +0000 (UTC)
In-Reply-To:
List-Unsubscribe: ,
List-Archive:
List-Post:
List-Help:
List-Subscribe: ,
Errors-To: dri-devel-bounces@lists.freedesktop.org
Sender: "dri-devel"
To: dri-devel@lists.freedesktop.org
List-Id: dri-devel@lists.freedesktop.org
--===============2059111744==
Content-Type: multipart/alternative; boundary="15655670560.7CfE.19745"
Content-Transfer-Encoding: 7bit
--15655670560.7CfE.19745
Date: Sun, 11 Aug 2019 23:44:16 +0000
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://bugs.freedesktop.org/
Auto-Submitted: auto-generated
https://bugs.freedesktop.org/show_bug.cgi?id=3D110674
--- Comment #75 from ReddestDream ---
>Here's some additional investigation.
>[SetUclkToHightestDpmLevel] Set hard min uclk failed! Appears as one of th=
e first errors in dmesg. This is from vega20_hwmgr.c:3354 and triggered by:
I agree that [SetUclkToHightestDpmLevel] is probably the key to all this as=
it
always seems to be the first thing that fails after dysregulation occurs. T=
he
"Failed to send message 0x28, response 0x0" errors show that the driver is
sending wrong or at least wrongly timed commands to the GPU that eventually
cascade into complete failure.
>Again, it didn't help. I will note that this code is identical in 5.0.13=20
I have also been unable to find changed code since 5.0 that could be direct=
ly
connected to display detect/init/enumeration issues on Radeon VII/Vega 20. =
This
is why I've come to suspect the error is triggered indirectly in a way that
will probably not be obvious and by code that was likely flawed from the
beginning of Radeon VII/Vega 20 support.
This is also why I was hopeful that 5.3-rc2 would fix this issue since it h=
as
commits that do seem to affect display detection on AMD GPUs. Alas, it did =
not.
:(
>If the GPU did not crash with dpm disabled as a whole, the proper way to
proceed would be to start from there and step by step add dpm features and =
see
when it starts crashing. It's not a small task since dpm code paths may be
scattered all over the code.
Unfortunately, it does look like going through and slowing disabling featur=
es
and/or bisecting might be the only way to find how this issue got started. =
At
least if we could narrow it down, we might be in better shape. :/
I must admit I don't have much experience with graphics drivers and when I =
tell
other people about this issue, they immediately want to blame X or Mesa unt=
il I
explain that I can get these errors w/o starting any graphics at all. lol.
In any case, I really appreciate your testing Tom B. And any advice you mig=
ht
have on debugging, Sylvain BERTRAND, is greatly appreciated. :)
--=20
You are receiving this mail because:
You are the assignee for the bug.=
--15655670560.7CfE.19745
Date: Sun, 11 Aug 2019 23:44:16 +0000
MIME-Version: 1.0
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://bugs.freedesktop.org/
Auto-Submitted: auto-generated
Comme=
nt # 75
on bug 11067=
4
from ReddestDream
>Here's some additional investigation=
.
>[SetUclkToHightestDpmLevel] Set hard min uclk fai=
led! Appears as one of the first errors in dmesg. This is from vega20_hwmgr=
.c:3354 and triggered by:
I agree that [SetUclkToHightestDpmLevel] is probably the key to all this as=
it
always seems to be the first thing that fails after dysregulation occurs. T=
he
"Failed to send message 0x28, response 0x0" errors show that the =
driver is
sending wrong or at least wrongly timed commands to the GPU that eventually
cascade into complete failure.
>Again, it didn't help. I will note that this code=
is identical in 5.0.13
I have also been unable to find changed code since 5.0 that could be direct=
ly
connected to display detect/init/enumeration issues on Radeon VII/Vega 20. =
This
is why I've come to suspect the error is triggered indirectly in a way that
will probably not be obvious and by code that was likely flawed from the
beginning of Radeon VII/Vega 20 support.
This is also why I was hopeful that 5.3-rc2 would fix this issue since it h=
as
commits that do seem to affect display detection on AMD GPUs. Alas, it did =
not.
:(
>If the GPU did not crash with dpm disabled as a w=
hole, the proper way to
proceed would be to start from there and step by step add dpm features and =
see
when it starts crashing. It's not a small task since dpm code paths may be
scattered all over the code.
Unfortunately, it does look like going through and slowing disabling featur=
es
and/or bisecting might be the only way to find how this issue got started. =
At
least if we could narrow it down, we might be in better shape. :/
I must admit I don't have much experience with graphics drivers and when I =
tell
other people about this issue, they immediately want to blame X or Mesa unt=
il I
explain that I can get these errors w/o starting any graphics at all. lol.
In any case, I really appreciate your testing Tom B. And any advice you mig=
ht
have on debugging, Sylvain BERTRAND, is greatly appreciated. :)
You are receiving this mail because:
- You are the assignee for the bug.
=
--15655670560.7CfE.19745--
--===============2059111744==
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: base64
Content-Disposition: inline
X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KZHJpLWRldmVs
IG1haWxpbmcgbGlzdApkcmktZGV2ZWxAbGlzdHMuZnJlZWRlc2t0b3Aub3JnCmh0dHBzOi8vbGlz
dHMuZnJlZWRlc2t0b3Aub3JnL21haWxtYW4vbGlzdGluZm8vZHJpLWRldmVs
--===============2059111744==--