All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mario Limonciello <superm1@kernel.org>
To: Antheas Kapenekakis <lkml@antheas.dev>,
	Alex Deucher <alexdeucher@gmail.com>
Cc: amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	linux-kernel@vger.kernel.org,
	"Alex Deucher" <alexander.deucher@amd.com>,
	"Christian König" <christian.koenig@amd.com>,
	"David Airlie" <airlied@gmail.com>,
	"Simona Vetter" <simona@ffwll.ch>,
	"Harry Wentland" <harry.wentland@amd.com>,
	"Rodrigo Siqueira" <siqueira@igalia.com>,
	"Peyton Lee" <peytolee@amd.com>, "Lang Yu" <lang.yu@amd.com>
Subject: Re: [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo
Date: Mon, 25 Aug 2025 11:41:30 -0500	[thread overview]
Message-ID: <425162fe-aeb7-4ff5-9a84-e7f6da20225e@kernel.org> (raw)
In-Reply-To: <CAGwozwFmfBrnZBO6JRZPnPyHLrKycdnoMRtOkK+KpwkdQ4Fw=w@mail.gmail.com>

On 8/25/2025 9:01 AM, Antheas Kapenekakis wrote:
> On Mon, 25 Aug 2025 at 15:33, Antheas Kapenekakis <lkml@antheas.dev> wrote:
>>
>> On Mon, 25 Aug 2025 at 15:20, Alex Deucher <alexdeucher@gmail.com> wrote:
>>>
>>> On Mon, Aug 25, 2025 at 3:13 AM Antheas Kapenekakis <lkml@antheas.dev> wrote:
>>>>
>>>> On the Asus Z13 2025, which uses a Strix Halo platform, around 8% of the
>>>> suspend resumes result in a soft lock around 1 second after the screen
>>>> turns on (it freezes). This happens due to power gating VPE when it is
>>>> not used, which happens 1 second after inactivity.
>>>>
>>>> Specifically, the VPE gating after resume is as follows: an initial
>>>> ungate, followed by a gate in the resume process. Then,
>>>> amdgpu_device_delayed_init_work_handler with a delay of 2s is scheduled
>>>> to run tests, one of which is testing VPE in vpe_ring_test_ib. This
>>>> causes an ungate, After that test, vpe_idle_work_handler is scheduled
>>>> with VPE_IDLE_TIMEOUT (1s).
>>>>
>>>> When vpe_idle_work_handler runs and tries to gate VPE, it causes the
>>>> SMU to hang and partially freezes half of the GPU IPs, with the thread
>>>> that called the command being stuck processing it.
>>>>
>>>> Specifically, after that SMU command tries to run, we get the following:
>>>>
>>>> snd_hda_intel 0000:c4:00.1: Refused to change power state from D0 to D3hot
>>>> ...
>>>> xhci_hcd 0000:c4:00.4: Refused to change power state from D0 to D3hot
>>>> ...
>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VPE!
>>>> [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe failed, ret = -62.
>>>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:93:crtc-0] flip_done timed out
>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate JPEG!
>>>> [drm:jpeg_v4_0_5_set_powergating_state [amdgpu]] *ERROR* Dpm disable jpeg failed, ret = -62.
>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 0!
>>>> [drm:vcn_v4_0_5_stop [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62.
>>>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 1 from 0xd3
>>>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 2 from 0x5
>>>> thunderbolt 0000:c6:00.5: Refused to change power state from D0 to D3hot
>>>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:97:crtc-1] flip_done timed out
>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 1!
>>>>
>>>> In addition to e.g., kwin errors in journalctl. 0000:c4.00.0 is the GPU.
>>>> Interestingly, 0000:c4.00.6, which is another HDA block, 0000:c4.00.5,
>>>> a PCI controller, and 0000:c4.00.2, resume normally. 0x00000032 is the
>>>> PowerDownVpe(50) command which is the common failure point in all
>>>> failed resumes.
>>>>
>>>> On a normal resume, we should get the following power gates:
>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVpe(50) param: 0x00000000, resp: 0x00000001
>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg0(33) param: 0x00000000, resp: 0x00000001
>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg1(38) param: 0x00010000, resp: 0x00000001
>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn1(4) param: 0x00010000, resp: 0x00000001
>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn0(6) param: 0x00000000, resp: 0x00000001
>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn0(7) param: 0x00000000, resp: 0x00000001
>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn1(5) param: 0x00010000, resp: 0x00000001
>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg0(34) param: 0x00000000, resp: 0x00000001
>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg1(39) param: 0x00010000, resp: 0x00000001
>>>>
>>>> To fix this, increase VPE_IDLE_TIMEOUT to 2 seconds. This increases
>>>> reliability from 4-25 suspends to 200+ (tested) suspends with a cycle
>>>> time of 12s sleep, 8s resume. The suspected reason here is that 1s that
>>>> when VPE is used, it needs a bit of time before it can be gated and
>>>> there was a borderline delay before, which is not enough for Strix Halo.
>>>> When the VPE is not used, such as on resume, gating it instantly does
>>>> not seem to cause issues.
>>>
>>> This doesn't make much sense.  The VPE idle timeout is arbitrary.  The
>>> VPE idle work handler checks to see if the block is idle before it
>>> powers gates the block. If it's not idle, then the delayed work is
>>> rescheduled so changing the timing should not make a difference.  We
>>> are no powering down VPE while it still has active jobs.  It sounds
>>> like there is some race condition somewhere else.
>>
>> On resume, the vpe is ungated and gated instantly, which does not
>> cause any crashes, then the delayed work is scheduled to run two
>> seconds later. Then, the tests run and finish, which start the gate
>> timer. After the timer lapses and the kernel tries to gate VPE, it
>> crashes. I logged all SMU commands and there is no difference between
>> the ones in a crash and not, other than the fact the VPE gate command
>> failed. Which becomes apparent when the next command runs. I will also
>> note that until the idle timer lapses, the system is responsive
>>
>> Since the VPE is ungated to run the tests, I assume that in my setup
>> it is not used close to resume.
> 
> I should also add that I forced a kernel panic and dumped all CPU
> backtraces in multiple logs. After the softlock, CPUs were either
> parked in the scheduler, powered off, or stuck executing an SMU
> command by e.g., a userspace usage sensor graph. So it is not a
> deadlock.
> 

Can you please confirm if you are on the absolute latest linux-firmware 
when you reproduced this issue?

Can you please share the debugfs output for amdgpu_firmware_info.


  reply	other threads:[~2025-08-25 16:41 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-24  8:53 [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo Antheas Kapenekakis
2025-08-24  8:53 ` [PATCH v1 2/2] drm/amd/display: Adjust AUX brightness to be a granularity of 100 Antheas Kapenekakis
2025-08-24 11:29   ` kernel test robot
2025-08-24 19:33   ` Antheas Kapenekakis
2025-08-25  7:02     ` Philip Mueller
2025-08-24 20:16 ` [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo Mario Limonciello
2025-08-24 20:46   ` Antheas Kapenekakis
2025-08-25  1:38     ` Mario Limonciello
2025-08-25 13:39       ` Antheas Kapenekakis
2025-08-26 13:41         ` Alex Deucher
2025-08-26 19:19           ` Mario Limonciello
2025-08-26 19:21             ` Antheas Kapenekakis
2025-08-26 20:12               ` Matthew Schwartz
2025-08-26 20:58                 ` Antheas Kapenekakis
2025-08-27  0:50                   ` Matthew Schwartz
2025-08-27  2:37                     ` Lee, Peyton
2025-08-27 15:42                       ` Matthew Schwartz
2025-08-25 13:20 ` Alex Deucher
2025-08-25 13:33   ` Antheas Kapenekakis
2025-08-25 14:01     ` Antheas Kapenekakis
2025-08-25 16:41       ` Mario Limonciello [this message]
2025-08-25 21:00         ` Antheas Kapenekakis

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=425162fe-aeb7-4ff5-9a84-e7f6da20225e@kernel.org \
    --to=superm1@kernel.org \
    --cc=airlied@gmail.com \
    --cc=alexander.deucher@amd.com \
    --cc=alexdeucher@gmail.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=harry.wentland@amd.com \
    --cc=lang.yu@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lkml@antheas.dev \
    --cc=peytolee@amd.com \
    --cc=simona@ffwll.ch \
    --cc=siqueira@igalia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.