From mboxrd@z Thu Jan 1 00:00:00 1970 From: bugzilla-daemon@freedesktop.org Subject: [Bug 110674] Crashes / Resets From AMDGPU / Radeon VII Date: Fri, 16 Aug 2019 23:19:25 +0000 Message-ID: References: Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============0725341565==" Return-path: Received: from culpepper.freedesktop.org (culpepper.freedesktop.org [IPv6:2610:10:20:722:a800:ff:fe98:4b55]) by gabe.freedesktop.org (Postfix) with ESMTP id BD2BF6E9BD for ; Fri, 16 Aug 2019 23:19:25 +0000 (UTC) In-Reply-To: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" To: dri-devel@lists.freedesktop.org List-Id: dri-devel@lists.freedesktop.org --===============0725341565== Content-Type: multipart/alternative; boundary="15659975654.7B625ca4E.3273" Content-Transfer-Encoding: 7bit --15659975654.7B625ca4E.3273 Date: Fri, 16 Aug 2019 23:19:25 +0000 MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://bugs.freedesktop.org/ Auto-Submitted: auto-generated https://bugs.freedesktop.org/show_bug.cgi?id=3D110674 --- Comment #110 from ReddestDream --- > 1. The functions in vega20_ppt.c are used with this new patch so that ans= wers my question from earlier, that's what this file is for and why it cont= ains similar/identical functions. I was hoping this was the case as the duplicated functions were confusing me too. Glad we got this figured out! :) > I tried it, it didn't help the crashing issue and I was stuck at 30w. As = soon as I started sddm the system froze. I've attached my dmesg from amdgpu= .dpm=3D2 boot. It doesn't fix the issue but it does help answer a few quest= ions I had: This is disappointing tho. I was hoping that setting amdgpu.dpm=3D2 would u= se the more "actively developed" path and that would fix the issue. :/ > Given that two different versions of the code produce the same result, my= hunch is that the problem is B. The card is not in a state where it's able= to receive power changes. I tend to agree, but it's still not clear why or how the card ends up in a = bad state when commands to it via smu_send_smc_msg_with_param seem to just sudd= enly stop working. And given the amount of same/similar functions in vega20_hwmg= r.c and vega20_ppt.c it's hard to rule out A entirely. Since amdgpu.dpm=3D0 resolves the issue (albeit at the cost of being stuck = at minimum clocks inherited from the VBIOS/GOP/UEFI/firmware), it seems that t= he card is starting out in a reasonable state and then being thrown into a bad state later by bad driver code. And that code is part of the DPM (Dynamic P= ower Management) system. We are pretty confident that dpm_state.hard_min_level is stable the whole time, so that's probably not what's throwing the card into= a bad state. But perhaps another value in the DPM table is . . .=20 It doesn't make intuitive sense that the soft min/max values would be problematic since they are presumably "more flexible," but it's possible th= at they get calculated out of spec or something and logging them should be possible like how dpm_state.hard_min_level was logged. --=20 You are receiving this mail because: You are the assignee for the bug.= --15659975654.7B625ca4E.3273 Date: Fri, 16 Aug 2019 23:19:25 +0000 MIME-Version: 1.0 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://bugs.freedesktop.org/ Auto-Submitted: auto-generated

Comm= ent # 110 on bug 11067= 4 from ReddestDream
> 1. The functions in vega20_ppt.c ar=
e used with this new patch so that answers my question from earlier, that's=
 what this file is for and why it contains similar/identical functions.

I was hoping this was the case as the duplicated functions were confusing me
too. Glad we got this figured out! :)

> I tried it, it didn't help the crashing issue an=
d I was stuck at 30w. As soon as I started sddm the system froze. I've atta=
ched my dmesg from amdgpu.dpm=3D2 boot. It doesn't fix the issue but it doe=
s help answer a few questions I had:

This is disappointing tho. I was hoping that setting amdgpu.dpm=3D2 would u=
se the
more "actively developed" path and that would fix the issue. :/

> Given that two different versions of the code pr=
oduce the same result, my hunch is that the problem is B. The card is not i=
n a state where it's able to receive power changes.

I tend to agree, but it's still not clear why or how the card ends up in a =
bad
state when commands to it via smu_send_smc_msg_with_param seem to just sudd=
enly
stop working. And given the amount of same/similar functions in vega20_hwmg=
r.c
and vega20_ppt.c it's hard to rule out A entirely.

Since amdgpu.dpm=3D0 resolves the issue (albeit at the cost of being stuck =
at
minimum clocks inherited from the VBIOS/GOP/UEFI/firmware), it seems that t=
he
card is starting out in a reasonable state and then being thrown into a bad
state later by bad driver code. And that code is part of the DPM (Dynamic P=
ower
Management) system. We are pretty confident that dpm_state.hard_min_level is
stable the whole time, so that's probably not what's throwing the card into=
 a
bad state. But perhaps another value in the DPM table is . . .=20

It doesn't make intuitive sense that the soft min/max values would be
problematic since they are presumably "more flexible," but it's p=
ossible that
they get calculated out of spec or something and logging them should be
possible like how dpm_state.hard_min_level was logged.


You are receiving this mail because:
  • You are the assignee for the bug.
= --15659975654.7B625ca4E.3273-- --===============0725341565== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KZHJpLWRldmVs IG1haWxpbmcgbGlzdApkcmktZGV2ZWxAbGlzdHMuZnJlZWRlc2t0b3Aub3JnCmh0dHBzOi8vbGlz dHMuZnJlZWRlc2t0b3Aub3JnL21haWxtYW4vbGlzdGluZm8vZHJpLWRldmVs --===============0725341565==--