[PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo
@ 2025-08-24  8:53 Antheas Kapenekakis
  2025-08-24  8:53 ` [PATCH v1 2/2] drm/amd/display: Adjust AUX brightness to be a granularity of 100 Antheas Kapenekakis
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Antheas Kapenekakis @ 2025-08-24  8:53 UTC (permalink / raw)
  To: amd-gfx
  Cc: dri-devel, linux-kernel, Alex Deucher, Christian König,
	David Airlie, Simona Vetter, Harry Wentland, Rodrigo Siqueira,
	Mario Limonciello, Peyton Lee, Lang Yu, Antheas Kapenekakis

On the Asus Z13 2025, which uses a Strix Halo platform, around 8% of the
suspend resumes result in a soft lock around 1 second after the screen
turns on (it freezes). This happens due to power gating VPE when it is
not used, which happens 1 second after inactivity.

Specifically, the VPE gating after resume is as follows: an initial
ungate, followed by a gate in the resume process. Then,
amdgpu_device_delayed_init_work_handler with a delay of 2s is scheduled
to run tests, one of which is testing VPE in vpe_ring_test_ib. This
causes an ungate, After that test, vpe_idle_work_handler is scheduled
with VPE_IDLE_TIMEOUT (1s).

When vpe_idle_work_handler runs and tries to gate VPE, it causes the
SMU to hang and partially freezes half of the GPU IPs, with the thread
that called the command being stuck processing it.

Specifically, after that SMU command tries to run, we get the following:

snd_hda_intel 0000:c4:00.1: Refused to change power state from D0 to D3hot
...
xhci_hcd 0000:c4:00.4: Refused to change power state from D0 to D3hot
...
amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VPE!
[drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe failed, ret = -62.
amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:93:crtc-0] flip_done timed out
amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
amdgpu 0000:c4:00.0: amdgpu: Failed to power gate JPEG!
[drm:jpeg_v4_0_5_set_powergating_state [amdgpu]] *ERROR* Dpm disable jpeg failed, ret = -62.
amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 0!
[drm:vcn_v4_0_5_stop [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62.
thunderbolt 0000:c6:00.5: 0: timeout reading config space 1 from 0xd3
thunderbolt 0000:c6:00.5: 0: timeout reading config space 2 from 0x5
thunderbolt 0000:c6:00.5: Refused to change power state from D0 to D3hot
amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:97:crtc-1] flip_done timed out
amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 1!

In addition to e.g., kwin errors in journalctl. 0000:c4.00.0 is the GPU.
Interestingly, 0000:c4.00.6, which is another HDA block, 0000:c4.00.5,
a PCI controller, and 0000:c4.00.2, resume normally. 0x00000032 is the
PowerDownVpe(50) command which is the common failure point in all
failed resumes.

On a normal resume, we should get the following power gates:
amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVpe(50) param: 0x00000000, resp: 0x00000001
amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg0(33) param: 0x00000000, resp: 0x00000001
amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg1(38) param: 0x00010000, resp: 0x00000001
amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn1(4) param: 0x00010000, resp: 0x00000001
amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn0(6) param: 0x00000000, resp: 0x00000001
amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn0(7) param: 0x00000000, resp: 0x00000001
amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn1(5) param: 0x00010000, resp: 0x00000001
amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg0(34) param: 0x00000000, resp: 0x00000001
amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg1(39) param: 0x00010000, resp: 0x00000001

To fix this, increase VPE_IDLE_TIMEOUT to 2 seconds. This increases
reliability from 4-25 suspends to 200+ (tested) suspends with a cycle
time of 12s sleep, 8s resume. The suspected reason here is that 1s that
when VPE is used, it needs a bit of time before it can be gated and
there was a borderline delay before, which is not enough for Strix Halo.
When the VPE is not used, such as on resume, gating it instantly does
not seem to cause issues.

Fixes: 5f82a0c90cca ("drm/amdgpu/vpe: enable vpe dpm")
Signed-off-by: Antheas Kapenekakis <lkml@antheas.dev>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
index 121ee17b522b..24f09e457352 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
@@ -34,8 +34,8 @@
 /* VPE CSA resides in the 4th page of CSA */
 #define AMDGPU_CSA_VPE_OFFSET 	(4096 * 3)

-/* 1 second timeout */
-#define VPE_IDLE_TIMEOUT	msecs_to_jiffies(1000)
+/* 2 second timeout */
+#define VPE_IDLE_TIMEOUT	msecs_to_jiffies(2000)

 #define VPE_MAX_DPM_LEVEL			4
 #define FIXED1_8_BITS_PER_FRACTIONAL_PART	8

base-commit: c17b750b3ad9f45f2b6f7e6f7f4679844244f0b9
-- 
2.50.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v1 2/2] drm/amd/display: Adjust AUX brightness to be a granularity of 100
  2025-08-24  8:53 [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo Antheas Kapenekakis
@ 2025-08-24  8:53 ` Antheas Kapenekakis
  2025-08-24 11:29   ` kernel test robot
  2025-08-24 19:33   ` Antheas Kapenekakis
  2025-08-24 20:16 ` [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo Mario Limonciello
  2025-08-25 13:20 ` Alex Deucher
  2 siblings, 2 replies; 21+ messages in thread
From: Antheas Kapenekakis @ 2025-08-24  8:53 UTC (permalink / raw)
  To: amd-gfx
  Cc: dri-devel, linux-kernel, Alex Deucher, Christian König,
	David Airlie, Simona Vetter, Harry Wentland, Rodrigo Siqueira,
	Mario Limonciello, Peyton Lee, Lang Yu, Antheas Kapenekakis

Certain OLED devices malfunction on specific brightness levels.
Specifically, when DP_SOURCE_BACKLIGHT_LEVEL is written to with
the minor byte being 0x00 and sometimes 0x01, the panel forcibly
turns off until the device sleeps again. This is an issue on
multiple handhelds, including OneXPlayer F1 Pro and Ayaneo 3
(the panel is suspected to be the same-1080p 7in OLED).

Below are some examples. This was found by iterating over brighness
ranges while printing DP_SOURCE_BACKLIGHT_LEVEL. It was found that
the screen would malfunction on specific values, and some of them
were collected.

Broken:
 86016:  10101000000000000
 86272:  10101000100000000
 87808:  10101011100000000
251648: 111101011100000000
251649: 111101011100000001

Working:
 86144:  10101000010000000
 87809:  10101011100000001
251650: 111101011100000010

The reason for this is that the range manipulation is too granular.
AUX is currently written to with a granularity of 1. Forcing 100,
which on the Ayaneo 3 OLED yields 400*10=4000 values, is plenty of
granularity and fixes this issue. Iterating over the values through
Python shows that the final byte is never 0x00, and testing over the
entire range with a cadence of 0.2s/it and 73 increments (to saturate
the range) shows no issues. Windows likewise shows no issues.

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3803
Signed-off-by: Antheas Kapenekakis <lkml@antheas.dev>
---
 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 28 +++++++++++--------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index cd0e2976e268..bb16adcafb88 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -4739,7 +4739,8 @@ static void amdgpu_dm_update_backlight_caps(struct amdgpu_display_manager *dm,
 }
 
 static int get_brightness_range(const struct amdgpu_dm_backlight_caps *caps,
-				unsigned int *min, unsigned int *max)
+				unsigned int *min, unsigned int *max,
+				unsigned int *multiple)
 {
 	if (!caps)
 		return 0;
@@ -4748,10 +4749,12 @@ static int get_brightness_range(const struct amdgpu_dm_backlight_caps *caps,
 		// Firmware limits are in nits, DC API wants millinits.
 		*max = 1000 * caps->aux_max_input_signal;
 		*min = 1000 * caps->aux_min_input_signal;
+		*multiple = 100;
 	} else {
 		// Firmware limits are 8-bit, PWM control is 16-bit.
 		*max = 0x101 * caps->max_input_signal;
 		*min = 0x101 * caps->min_input_signal;
+		*multiple = 1;
 	}
 	return 1;
 }
@@ -4813,23 +4816,25 @@ static void convert_custom_brightness(const struct amdgpu_dm_backlight_caps *cap
 static u32 convert_brightness_from_user(const struct amdgpu_dm_backlight_caps *caps,
 					uint32_t brightness)
 {
-	unsigned int min, max;
+	unsigned int min, max, multiple;
 
-	if (!get_brightness_range(caps, &min, &max))
+	if (!get_brightness_range(caps, &min, &max, &multiple))
 		return brightness;
 
 	convert_custom_brightness(caps, min, max, &brightness);
 
-	// Rescale 0..max to min..max
-	return min + DIV_ROUND_CLOSEST_ULL((u64)(max - min) * brightness, max);
+	// Rescale 0..max to min..max rounding to nearest multiple
+	return rounddown(
+		min + DIV_ROUND_CLOSEST_ULL((u64)(max - min) * brightness, max),
+		multiple);
 }
 
 static u32 convert_brightness_to_user(const struct amdgpu_dm_backlight_caps *caps,
 				      uint32_t brightness)
 {
-	unsigned int min, max;
+	unsigned int min, max, multiple;
 
-	if (!get_brightness_range(caps, &min, &max))
+	if (!get_brightness_range(caps, &min, &max, &multiple))
 		return brightness;
 
 	if (brightness < min)
@@ -4970,7 +4975,7 @@ amdgpu_dm_register_backlight_device(struct amdgpu_dm_connector *aconnector)
 	struct backlight_properties props = { 0 };
 	struct amdgpu_dm_backlight_caps *caps;
 	char bl_name[16];
-	int min, max;
+	int min, max, multiple;
 
 	if (aconnector->bl_idx == -1)
 		return;
@@ -4983,15 +4988,16 @@ amdgpu_dm_register_backlight_device(struct amdgpu_dm_connector *aconnector)
 	}
 
 	caps = &dm->backlight_caps[aconnector->bl_idx];
-	if (get_brightness_range(caps, &min, &max)) {
+	if (get_brightness_range(caps, &min, &max, &multiple)) {
 		if (power_supply_is_system_supplied() > 0)
 			props.brightness = DIV_ROUND_CLOSEST((max - min) * caps->ac_level, 100);
 		else
 			props.brightness = DIV_ROUND_CLOSEST((max - min) * caps->dc_level, 100);
 		/* min is zero, so max needs to be adjusted */
 		props.max_brightness = max - min;
-		drm_dbg(drm, "Backlight caps: min: %d, max: %d, ac %d, dc %d\n", min, max,
-			caps->ac_level, caps->dc_level);
+		drm_dbg(drm,
+			"Backlight caps: min: %d, max: %d, ac %d, dc %d, multiple: %d\n",
+			min, max, caps->ac_level, caps->dc_level, multiple);
 	} else
 		props.brightness = props.max_brightness = MAX_BACKLIGHT_LEVEL;
 
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v1 2/2] drm/amd/display: Adjust AUX brightness to be a granularity of 100
  2025-08-24  8:53 ` [PATCH v1 2/2] drm/amd/display: Adjust AUX brightness to be a granularity of 100 Antheas Kapenekakis
@ 2025-08-24 11:29   ` kernel test robot
  2025-08-24 19:33   ` Antheas Kapenekakis
  1 sibling, 0 replies; 21+ messages in thread
From: kernel test robot @ 2025-08-24 11:29 UTC (permalink / raw)
  To: Antheas Kapenekakis, amd-gfx
  Cc: oe-kbuild-all, dri-devel, linux-kernel, Alex Deucher,
	Christian König, David Airlie, Simona Vetter, Harry Wentland,
	Rodrigo Siqueira, Mario Limonciello, Peyton Lee, Lang Yu,
	Antheas Kapenekakis

Hi Antheas,

kernel test robot noticed the following build errors:

[auto build test ERROR on c17b750b3ad9f45f2b6f7e6f7f4679844244f0b9]

url:    https://github.com/intel-lab-lkp/linux/commits/Antheas-Kapenekakis/drm-amd-display-Adjust-AUX-brightness-to-be-a-granularity-of-100/20250824-165633
base:   c17b750b3ad9f45f2b6f7e6f7f4679844244f0b9
patch link:    https://lore.kernel.org/r/20250824085351.454619-2-lkml%40antheas.dev
patch subject: [PATCH v1 2/2] drm/amd/display: Adjust AUX brightness to be a granularity of 100
config: i386-buildonly-randconfig-002-20250824 (https://download.01.org/0day-ci/archive/20250824/202508241901.DJ851kiv-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14+deb12u1) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250824/202508241901.DJ851kiv-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202508241901.DJ851kiv-lkp@intel.com/

All errors (new ones prefixed by >>):

   ld: drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.o: in function `amdgpu_dm_backlight_set_level':
>> amdgpu_dm.c:(.text+0x8b89): undefined reference to `__umoddi3'

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1 2/2] drm/amd/display: Adjust AUX brightness to be a granularity of 100
  2025-08-24  8:53 ` [PATCH v1 2/2] drm/amd/display: Adjust AUX brightness to be a granularity of 100 Antheas Kapenekakis
  2025-08-24 11:29   ` kernel test robot
@ 2025-08-24 19:33   ` Antheas Kapenekakis
  2025-08-25  7:02     ` Philip Mueller
  1 sibling, 1 reply; 21+ messages in thread
From: Antheas Kapenekakis @ 2025-08-24 19:33 UTC (permalink / raw)
  To: amd-gfx
  Cc: dri-devel, linux-kernel, Alex Deucher, Christian König,
	David Airlie, Simona Vetter, Harry Wentland, Rodrigo Siqueira,
	Mario Limonciello, Peyton Lee, Lang Yu

On Sun, 24 Aug 2025 at 10:54, Antheas Kapenekakis <lkml@antheas.dev> wrote:
>
> Certain OLED devices malfunction on specific brightness levels.
> Specifically, when DP_SOURCE_BACKLIGHT_LEVEL is written to with
> the minor byte being 0x00 and sometimes 0x01, the panel forcibly
> turns off until the device sleeps again. This is an issue on
> multiple handhelds, including OneXPlayer F1 Pro and Ayaneo 3
> (the panel is suspected to be the same-1080p 7in OLED).
>
> Below are some examples. This was found by iterating over brighness
> ranges while printing DP_SOURCE_BACKLIGHT_LEVEL. It was found that
> the screen would malfunction on specific values, and some of them
> were collected.
>
> Broken:
>  86016:  10101000000000000
>  86272:  10101000100000000
>  87808:  10101011100000000
> 251648: 111101011100000000
> 251649: 111101011100000001
>
> Working:
>  86144:  10101000010000000
>  87809:  10101011100000001
> 251650: 111101011100000010
>
> The reason for this is that the range manipulation is too granular.
> AUX is currently written to with a granularity of 1. Forcing 100,
> which on the Ayaneo 3 OLED yields 400*10=4000 values, is plenty of
> granularity and fixes this issue. Iterating over the values through
> Python shows that the final byte is never 0x00, and testing over the
> entire range with a cadence of 0.2s/it and 73 increments (to saturate
> the range) shows no issues. Windows likewise shows no issues.

Well Phil managed to fall into the value 332800, which has a 0 minor
bit. Unfortunate. In hindsight, every 256 hundreds there would be a
zero anyway.

Before I made this patch I made a partial refactor of panel-quirks
where a quirk like this could go to. But I would really prefer not to
do quirks. Ill send that too.

Antheas

> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3803
> Signed-off-by: Antheas Kapenekakis <lkml@antheas.dev>
> ---
>  .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 28 +++++++++++--------
>  1 file changed, 17 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> index cd0e2976e268..bb16adcafb88 100644
> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> @@ -4739,7 +4739,8 @@ static void amdgpu_dm_update_backlight_caps(struct amdgpu_display_manager *dm,
>  }
>
>  static int get_brightness_range(const struct amdgpu_dm_backlight_caps *caps,
> -                               unsigned int *min, unsigned int *max)
> +                               unsigned int *min, unsigned int *max,
> +                               unsigned int *multiple)
>  {
>         if (!caps)
>                 return 0;
> @@ -4748,10 +4749,12 @@ static int get_brightness_range(const struct amdgpu_dm_backlight_caps *caps,
>                 // Firmware limits are in nits, DC API wants millinits.
>                 *max = 1000 * caps->aux_max_input_signal;
>                 *min = 1000 * caps->aux_min_input_signal;
> +               *multiple = 100;
>         } else {
>                 // Firmware limits are 8-bit, PWM control is 16-bit.
>                 *max = 0x101 * caps->max_input_signal;
>                 *min = 0x101 * caps->min_input_signal;
> +               *multiple = 1;
>         }
>         return 1;
>  }
> @@ -4813,23 +4816,25 @@ static void convert_custom_brightness(const struct amdgpu_dm_backlight_caps *cap
>  static u32 convert_brightness_from_user(const struct amdgpu_dm_backlight_caps *caps,
>                                         uint32_t brightness)
>  {
> -       unsigned int min, max;
> +       unsigned int min, max, multiple;
>
> -       if (!get_brightness_range(caps, &min, &max))
> +       if (!get_brightness_range(caps, &min, &max, &multiple))
>                 return brightness;
>
>         convert_custom_brightness(caps, min, max, &brightness);
>
> -       // Rescale 0..max to min..max
> -       return min + DIV_ROUND_CLOSEST_ULL((u64)(max - min) * brightness, max);
> +       // Rescale 0..max to min..max rounding to nearest multiple
> +       return rounddown(
> +               min + DIV_ROUND_CLOSEST_ULL((u64)(max - min) * brightness, max),
> +               multiple);
>  }
>
>  static u32 convert_brightness_to_user(const struct amdgpu_dm_backlight_caps *caps,
>                                       uint32_t brightness)
>  {
> -       unsigned int min, max;
> +       unsigned int min, max, multiple;
>
> -       if (!get_brightness_range(caps, &min, &max))
> +       if (!get_brightness_range(caps, &min, &max, &multiple))
>                 return brightness;
>
>         if (brightness < min)
> @@ -4970,7 +4975,7 @@ amdgpu_dm_register_backlight_device(struct amdgpu_dm_connector *aconnector)
>         struct backlight_properties props = { 0 };
>         struct amdgpu_dm_backlight_caps *caps;
>         char bl_name[16];
> -       int min, max;
> +       int min, max, multiple;
>
>         if (aconnector->bl_idx == -1)
>                 return;
> @@ -4983,15 +4988,16 @@ amdgpu_dm_register_backlight_device(struct amdgpu_dm_connector *aconnector)
>         }
>
>         caps = &dm->backlight_caps[aconnector->bl_idx];
> -       if (get_brightness_range(caps, &min, &max)) {
> +       if (get_brightness_range(caps, &min, &max, &multiple)) {
>                 if (power_supply_is_system_supplied() > 0)
>                         props.brightness = DIV_ROUND_CLOSEST((max - min) * caps->ac_level, 100);
>                 else
>                         props.brightness = DIV_ROUND_CLOSEST((max - min) * caps->dc_level, 100);
>                 /* min is zero, so max needs to be adjusted */
>                 props.max_brightness = max - min;
> -               drm_dbg(drm, "Backlight caps: min: %d, max: %d, ac %d, dc %d\n", min, max,
> -                       caps->ac_level, caps->dc_level);
> +               drm_dbg(drm,
> +                       "Backlight caps: min: %d, max: %d, ac %d, dc %d, multiple: %d\n",
> +                       min, max, caps->ac_level, caps->dc_level, multiple);
>         } else
>                 props.brightness = props.max_brightness = MAX_BACKLIGHT_LEVEL;
>
> --
> 2.50.1
>
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo
  2025-08-24  8:53 [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo Antheas Kapenekakis
  2025-08-24  8:53 ` [PATCH v1 2/2] drm/amd/display: Adjust AUX brightness to be a granularity of 100 Antheas Kapenekakis
@ 2025-08-24 20:16 ` Mario Limonciello
  2025-08-24 20:46   ` Antheas Kapenekakis
  2025-08-25 13:20 ` Alex Deucher
  2 siblings, 1 reply; 21+ messages in thread
From: Mario Limonciello @ 2025-08-24 20:16 UTC (permalink / raw)
  To: Antheas Kapenekakis, amd-gfx
  Cc: dri-devel, linux-kernel, Alex Deucher, Christian König,
	David Airlie, Simona Vetter, Harry Wentland, Rodrigo Siqueira,
	Mario Limonciello, Peyton Lee, Lang Yu



On 8/24/25 3:53 AM, Antheas Kapenekakis wrote:
> On the Asus Z13 2025, which uses a Strix Halo platform, around 8% of the
> suspend resumes result in a soft lock around 1 second after the screen
> turns on (it freezes). This happens due to power gating VPE when it is
> not used, which happens 1 second after inactivity.
> 
> Specifically, the VPE gating after resume is as follows: an initial
> ungate, followed by a gate in the resume process. Then,
> amdgpu_device_delayed_init_work_handler with a delay of 2s is scheduled
> to run tests, one of which is testing VPE in vpe_ring_test_ib. This
> causes an ungate, After that test, vpe_idle_work_handler is scheduled
> with VPE_IDLE_TIMEOUT (1s).
> 
> When vpe_idle_work_handler runs and tries to gate VPE, it causes the
> SMU to hang and partially freezes half of the GPU IPs, with the thread
> that called the command being stuck processing it.
> 
> Specifically, after that SMU command tries to run, we get the following:
> 
> snd_hda_intel 0000:c4:00.1: Refused to change power state from D0 to D3hot
> ...
> xhci_hcd 0000:c4:00.4: Refused to change power state from D0 to D3hot
> ...
> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VPE!
> [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe failed, ret = -62.
> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:93:crtc-0] flip_done timed out
> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate JPEG!
> [drm:jpeg_v4_0_5_set_powergating_state [amdgpu]] *ERROR* Dpm disable jpeg failed, ret = -62.
> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 0!
> [drm:vcn_v4_0_5_stop [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62.
> thunderbolt 0000:c6:00.5: 0: timeout reading config space 1 from 0xd3
> thunderbolt 0000:c6:00.5: 0: timeout reading config space 2 from 0x5
> thunderbolt 0000:c6:00.5: Refused to change power state from D0 to D3hot
> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:97:crtc-1] flip_done timed out
> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 1!
> 
> In addition to e.g., kwin errors in journalctl. 0000:c4.00.0 is the GPU.
> Interestingly, 0000:c4.00.6, which is another HDA block, 0000:c4.00.5,
> a PCI controller, and 0000:c4.00.2, resume normally. 0x00000032 is the
> PowerDownVpe(50) command which is the common failure point in all
> failed resumes.
> 
> On a normal resume, we should get the following power gates:
> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVpe(50) param: 0x00000000, resp: 0x00000001
> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg0(33) param: 0x00000000, resp: 0x00000001
> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg1(38) param: 0x00010000, resp: 0x00000001
> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn1(4) param: 0x00010000, resp: 0x00000001
> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn0(6) param: 0x00000000, resp: 0x00000001
> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn0(7) param: 0x00000000, resp: 0x00000001
> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn1(5) param: 0x00010000, resp: 0x00000001
> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg0(34) param: 0x00000000, resp: 0x00000001
> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg1(39) param: 0x00010000, resp: 0x00000001
> 
> To fix this, increase VPE_IDLE_TIMEOUT to 2 seconds. This increases
> reliability from 4-25 suspends to 200+ (tested) suspends with a cycle
> time of 12s sleep, 8s resume. 

When you say you reproduced with 12s sleep and 8s resume, was that 
'amd-s2idle --duration 12 --wait 8'?

> The suspected reason here is that 1s that
> when VPE is used, it needs a bit of time before it can be gated and
> there was a borderline delay before, which is not enough for Strix Halo.
> When the VPE is not used, such as on resume, gating it instantly does
> not seem to cause issues.
> 
> Fixes: 5f82a0c90cca ("drm/amdgpu/vpe: enable vpe dpm")
> Signed-off-by: Antheas Kapenekakis <lkml@antheas.dev>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
> index 121ee17b522b..24f09e457352 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
> @@ -34,8 +34,8 @@
>   /* VPE CSA resides in the 4th page of CSA */
>   #define AMDGPU_CSA_VPE_OFFSET 	(4096 * 3)
>   
> -/* 1 second timeout */
> -#define VPE_IDLE_TIMEOUT	msecs_to_jiffies(1000)
> +/* 2 second timeout */
> +#define VPE_IDLE_TIMEOUT	msecs_to_jiffies(2000)
>   
>   #define VPE_MAX_DPM_LEVEL			4
>   #define FIXED1_8_BITS_PER_FRACTIONAL_PART	8
> 
> base-commit: c17b750b3ad9f45f2b6f7e6f7f4679844244f0b9

1s idle timeout has been used by other IPs for a long time.
For example JPEG, UVD, VCN all use 1s.

Can you please confirm both your AGESA and your SMU firmware version? 
In case you're not aware; you can get AGESA version from SMBIOS string 
(DMI type 40).

❯ sudo dmidecode | grep AGESA

You can get SMU firmware version from this:

❯ grep . /sys/bus/platform/drivers/amd_pmc/*/smu_*

Are you on the most up to date firmware for your system from the 
manufacturer?

We haven't seen anything like this reported on Strix Halo thus far and 
we do internal stress testing on s0i3 on reference hardware.

To me this seems likely to be a platform firmware bug; but I would like 
to understand the timing of the gate vs ungate on good vs bad.

IE is it possible the delayed work handler 
amdgpu_device_delayed_init_work_handler() is causing a race with 
vpe_ring_begin_use()?

This should be possible to check without extra instrumentation by using 
ftrace and looking at the timing of the 2 ring functions and the init 
work handler and checking good vs bad cycles.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo
  2025-08-24 20:16 ` [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo Mario Limonciello
@ 2025-08-24 20:46   ` Antheas Kapenekakis
  2025-08-25  1:38     ` Mario Limonciello
  0 siblings, 1 reply; 21+ messages in thread
From: Antheas Kapenekakis @ 2025-08-24 20:46 UTC (permalink / raw)
  To: Mario Limonciello
  Cc: amd-gfx, dri-devel, linux-kernel, Alex Deucher,
	Christian König, David Airlie, Simona Vetter, Harry Wentland,
	Rodrigo Siqueira, Mario Limonciello, Peyton Lee, Lang Yu

On Sun, 24 Aug 2025 at 22:16, Mario Limonciello <superm1@kernel.org> wrote:
>
>
>
> On 8/24/25 3:53 AM, Antheas Kapenekakis wrote:
> > On the Asus Z13 2025, which uses a Strix Halo platform, around 8% of the
> > suspend resumes result in a soft lock around 1 second after the screen
> > turns on (it freezes). This happens due to power gating VPE when it is
> > not used, which happens 1 second after inactivity.
> >
> > Specifically, the VPE gating after resume is as follows: an initial
> > ungate, followed by a gate in the resume process. Then,
> > amdgpu_device_delayed_init_work_handler with a delay of 2s is scheduled
> > to run tests, one of which is testing VPE in vpe_ring_test_ib. This
> > causes an ungate, After that test, vpe_idle_work_handler is scheduled
> > with VPE_IDLE_TIMEOUT (1s).
> >
> > When vpe_idle_work_handler runs and tries to gate VPE, it causes the
> > SMU to hang and partially freezes half of the GPU IPs, with the thread
> > that called the command being stuck processing it.
> >
> > Specifically, after that SMU command tries to run, we get the following:
> >
> > snd_hda_intel 0000:c4:00.1: Refused to change power state from D0 to D3hot
> > ...
> > xhci_hcd 0000:c4:00.4: Refused to change power state from D0 to D3hot
> > ...
> > amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> > amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VPE!
> > [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe failed, ret = -62.
> > amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:93:crtc-0] flip_done timed out
> > amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> > amdgpu 0000:c4:00.0: amdgpu: Failed to power gate JPEG!
> > [drm:jpeg_v4_0_5_set_powergating_state [amdgpu]] *ERROR* Dpm disable jpeg failed, ret = -62.
> > amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> > amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 0!
> > [drm:vcn_v4_0_5_stop [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62.
> > thunderbolt 0000:c6:00.5: 0: timeout reading config space 1 from 0xd3
> > thunderbolt 0000:c6:00.5: 0: timeout reading config space 2 from 0x5
> > thunderbolt 0000:c6:00.5: Refused to change power state from D0 to D3hot
> > amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:97:crtc-1] flip_done timed out
> > amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> > amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 1!
> >
> > In addition to e.g., kwin errors in journalctl. 0000:c4.00.0 is the GPU.
> > Interestingly, 0000:c4.00.6, which is another HDA block, 0000:c4.00.5,
> > a PCI controller, and 0000:c4.00.2, resume normally. 0x00000032 is the
> > PowerDownVpe(50) command which is the common failure point in all
> > failed resumes.
> >
> > On a normal resume, we should get the following power gates:
> > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVpe(50) param: 0x00000000, resp: 0x00000001
> > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg0(33) param: 0x00000000, resp: 0x00000001
> > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg1(38) param: 0x00010000, resp: 0x00000001
> > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn1(4) param: 0x00010000, resp: 0x00000001
> > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn0(6) param: 0x00000000, resp: 0x00000001
> > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn0(7) param: 0x00000000, resp: 0x00000001
> > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn1(5) param: 0x00010000, resp: 0x00000001
> > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg0(34) param: 0x00000000, resp: 0x00000001
> > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg1(39) param: 0x00010000, resp: 0x00000001
> >
> > To fix this, increase VPE_IDLE_TIMEOUT to 2 seconds. This increases
> > reliability from 4-25 suspends to 200+ (tested) suspends with a cycle
> > time of 12s sleep, 8s resume.
>
> When you say you reproduced with 12s sleep and 8s resume, was that
> 'amd-s2idle --duration 12 --wait 8'?

I did not use amd-s2idle. I essentially used the script below with a
12 on the wake alarm and 12 on the for loop. I also used pstore for
this testing.

for i in {1..200}; do
  echo "Suspend attempt $i"
  echo `date '+%s' -d '+ 60 seconds'` | sudo tee /sys/class/rtc/rtc0/wakealarm
  sudo sh -c 'echo mem > /sys/power/state'

  for j in {1..50}; do
    # Use repeating sleep in case echo mem returns early
    sleep 1
  done
done

> > The suspected reason here is that 1s that
> > when VPE is used, it needs a bit of time before it can be gated and
> > there was a borderline delay before, which is not enough for Strix Halo.
> > When the VPE is not used, such as on resume, gating it instantly does
> > not seem to cause issues.
> >
> > Fixes: 5f82a0c90cca ("drm/amdgpu/vpe: enable vpe dpm")
> > Signed-off-by: Antheas Kapenekakis <lkml@antheas.dev>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c | 4 ++--
> >   1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
> > index 121ee17b522b..24f09e457352 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
> > @@ -34,8 +34,8 @@
> >   /* VPE CSA resides in the 4th page of CSA */
> >   #define AMDGPU_CSA_VPE_OFFSET       (4096 * 3)
> >
> > -/* 1 second timeout */
> > -#define VPE_IDLE_TIMEOUT     msecs_to_jiffies(1000)
> > +/* 2 second timeout */
> > +#define VPE_IDLE_TIMEOUT     msecs_to_jiffies(2000)
> >
> >   #define VPE_MAX_DPM_LEVEL                   4
> >   #define FIXED1_8_BITS_PER_FRACTIONAL_PART   8
> >
> > base-commit: c17b750b3ad9f45f2b6f7e6f7f4679844244f0b9
>
> 1s idle timeout has been used by other IPs for a long time.
> For example JPEG, UVD, VCN all use 1s.
>
> Can you please confirm both your AGESA and your SMU firmware version?
> In case you're not aware; you can get AGESA version from SMBIOS string
> (DMI type 40).
>
> ❯ sudo dmidecode | grep AGESA

String: AGESA!V9 StrixHaloPI-FP11 1.0.0.0c

> You can get SMU firmware version from this:
>
> ❯ grep . /sys/bus/platform/drivers/amd_pmc/*/smu_*

grep . /sys/bus/platform/drivers/amd_pmc/*/smu_*
/sys/bus/platform/drivers/amd_pmc/AMDI000B:00/smu_fw_version:100.112.0
/sys/bus/platform/drivers/amd_pmc/AMDI000B:00/smu_program:0

> Are you on the most up to date firmware for your system from the
> manufacturer?

I updated my bios, pd firmware, and USB device firmware early August,
when I was doing this testing.

> We haven't seen anything like this reported on Strix Halo thus far and
> we do internal stress testing on s0i3 on reference hardware.

Cant find a reference for it on the bug tracker. I have four bug
reports on the bazzite issue tracker, 2 about sleep wake crashes and 2
for runtime crashes, where the culprit would be this. IE runtime gates
VPE and causes a crash.

> To me this seems likely to be a platform firmware bug; but I would like
> to understand the timing of the gate vs ungate on good vs bad.

Perhaps it is. It is either something like that or silicon quality.

> IE is it possible the delayed work handler
> amdgpu_device_delayed_init_work_handler() is causing a race with
> vpe_ring_begin_use()?

I don't think so. There is only a single ungate. Also, the crash
happens on the gate. So what happens is the device wakes up, the
screen turns on, kde clock works, then after a second it freezes,
there is a softlock, and the device hangs.

The failed command is always the VPE gate that is triggered after 1s in idle.

> This should be possible to check without extra instrumentation by using
> ftrace and looking at the timing of the 2 ring functions and the init
> work handler and checking good vs bad cycles.

I do not know how to use ftrace. I should also note that after the
device freezes around 1/5 cycles will sync the fs, so it is also not a
very easy thing to diagnose. The device just stops working. A lot of
the logs I got were in pstore by forcing a kernel panic.

If you say that all IP blocks use 1s, perhaps an alternative solution
would be to desync the idle times so they do not happen
simultaneously. So 1000, 1200, 1400, etc.

Antheas


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo
  2025-08-24 20:46   ` Antheas Kapenekakis
@ 2025-08-25  1:38     ` Mario Limonciello
  2025-08-25 13:39       ` Antheas Kapenekakis
  0 siblings, 1 reply; 21+ messages in thread
From: Mario Limonciello @ 2025-08-25  1:38 UTC (permalink / raw)
  To: Antheas Kapenekakis
  Cc: amd-gfx, dri-devel, linux-kernel, Alex Deucher,
	Christian König, David Airlie, Simona Vetter, Harry Wentland,
	Rodrigo Siqueira, Mario Limonciello, Peyton Lee, Lang Yu



On 8/24/25 3:46 PM, Antheas Kapenekakis wrote:
> On Sun, 24 Aug 2025 at 22:16, Mario Limonciello <superm1@kernel.org> wrote:
>>
>>
>>
>> On 8/24/25 3:53 AM, Antheas Kapenekakis wrote:
>>> On the Asus Z13 2025, which uses a Strix Halo platform, around 8% of the
>>> suspend resumes result in a soft lock around 1 second after the screen
>>> turns on (it freezes). This happens due to power gating VPE when it is
>>> not used, which happens 1 second after inactivity.
>>>
>>> Specifically, the VPE gating after resume is as follows: an initial
>>> ungate, followed by a gate in the resume process. Then,
>>> amdgpu_device_delayed_init_work_handler with a delay of 2s is scheduled
>>> to run tests, one of which is testing VPE in vpe_ring_test_ib. This
>>> causes an ungate, After that test, vpe_idle_work_handler is scheduled
>>> with VPE_IDLE_TIMEOUT (1s).
>>>
>>> When vpe_idle_work_handler runs and tries to gate VPE, it causes the
>>> SMU to hang and partially freezes half of the GPU IPs, with the thread
>>> that called the command being stuck processing it.
>>>
>>> Specifically, after that SMU command tries to run, we get the following:
>>>
>>> snd_hda_intel 0000:c4:00.1: Refused to change power state from D0 to D3hot
>>> ...
>>> xhci_hcd 0000:c4:00.4: Refused to change power state from D0 to D3hot
>>> ...
>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VPE!
>>> [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe failed, ret = -62.
>>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:93:crtc-0] flip_done timed out
>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate JPEG!
>>> [drm:jpeg_v4_0_5_set_powergating_state [amdgpu]] *ERROR* Dpm disable jpeg failed, ret = -62.
>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 0!
>>> [drm:vcn_v4_0_5_stop [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62.
>>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 1 from 0xd3
>>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 2 from 0x5
>>> thunderbolt 0000:c6:00.5: Refused to change power state from D0 to D3hot
>>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:97:crtc-1] flip_done timed out
>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 1!
>>>
>>> In addition to e.g., kwin errors in journalctl. 0000:c4.00.0 is the GPU.
>>> Interestingly, 0000:c4.00.6, which is another HDA block, 0000:c4.00.5,
>>> a PCI controller, and 0000:c4.00.2, resume normally. 0x00000032 is the
>>> PowerDownVpe(50) command which is the common failure point in all
>>> failed resumes.
>>>
>>> On a normal resume, we should get the following power gates:
>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVpe(50) param: 0x00000000, resp: 0x00000001
>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg0(33) param: 0x00000000, resp: 0x00000001
>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg1(38) param: 0x00010000, resp: 0x00000001
>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn1(4) param: 0x00010000, resp: 0x00000001
>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn0(6) param: 0x00000000, resp: 0x00000001
>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn0(7) param: 0x00000000, resp: 0x00000001
>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn1(5) param: 0x00010000, resp: 0x00000001
>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg0(34) param: 0x00000000, resp: 0x00000001
>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg1(39) param: 0x00010000, resp: 0x00000001
>>>
>>> To fix this, increase VPE_IDLE_TIMEOUT to 2 seconds. This increases
>>> reliability from 4-25 suspends to 200+ (tested) suspends with a cycle
>>> time of 12s sleep, 8s resume.
>>
>> When you say you reproduced with 12s sleep and 8s resume, was that
>> 'amd-s2idle --duration 12 --wait 8'?
> 
> I did not use amd-s2idle. I essentially used the script below with a
> 12 on the wake alarm and 12 on the for loop. I also used pstore for
> this testing.
> 
> for i in {1..200}; do
>    echo "Suspend attempt $i"
>    echo `date '+%s' -d '+ 60 seconds'` | sudo tee /sys/class/rtc/rtc0/wakealarm
>    sudo sh -c 'echo mem > /sys/power/state'
> 
>    for j in {1..50}; do
>      # Use repeating sleep in case echo mem returns early
>      sleep 1
>    done
> done

👍

> 
>>> The suspected reason here is that 1s that
>>> when VPE is used, it needs a bit of time before it can be gated and
>>> there was a borderline delay before, which is not enough for Strix Halo.
>>> When the VPE is not used, such as on resume, gating it instantly does
>>> not seem to cause issues.
>>>
>>> Fixes: 5f82a0c90cca ("drm/amdgpu/vpe: enable vpe dpm")
>>> Signed-off-by: Antheas Kapenekakis <lkml@antheas.dev>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c | 4 ++--
>>>    1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
>>> index 121ee17b522b..24f09e457352 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
>>> @@ -34,8 +34,8 @@
>>>    /* VPE CSA resides in the 4th page of CSA */
>>>    #define AMDGPU_CSA_VPE_OFFSET       (4096 * 3)
>>>
>>> -/* 1 second timeout */
>>> -#define VPE_IDLE_TIMEOUT     msecs_to_jiffies(1000)
>>> +/* 2 second timeout */
>>> +#define VPE_IDLE_TIMEOUT     msecs_to_jiffies(2000)
>>>
>>>    #define VPE_MAX_DPM_LEVEL                   4
>>>    #define FIXED1_8_BITS_PER_FRACTIONAL_PART   8
>>>
>>> base-commit: c17b750b3ad9f45f2b6f7e6f7f4679844244f0b9
>>
>> 1s idle timeout has been used by other IPs for a long time.
>> For example JPEG, UVD, VCN all use 1s.
>>
>> Can you please confirm both your AGESA and your SMU firmware version?
>> In case you're not aware; you can get AGESA version from SMBIOS string
>> (DMI type 40).
>>
>> ❯ sudo dmidecode | grep AGESA
> 
> String: AGESA!V9 StrixHaloPI-FP11 1.0.0.0c
> 
>> You can get SMU firmware version from this:
>>
>> ❯ grep . /sys/bus/platform/drivers/amd_pmc/*/smu_*
> 
> grep . /sys/bus/platform/drivers/amd_pmc/*/smu_*
> /sys/bus/platform/drivers/amd_pmc/AMDI000B:00/smu_fw_version:100.112.0
> /sys/bus/platform/drivers/amd_pmc/AMDI000B:00/smu_program:0
> 

Thanks, I'll get some folks to see if we match this AGESA version if we 
can also reproduce it on reference hardware the same way you did.

>> Are you on the most up to date firmware for your system from the
>> manufacturer?
> 
> I updated my bios, pd firmware, and USB device firmware early August,
> when I was doing this testing.
> 
>> We haven't seen anything like this reported on Strix Halo thus far and
>> we do internal stress testing on s0i3 on reference hardware.
> 
> Cant find a reference for it on the bug tracker. I have four bug
> reports on the bazzite issue tracker, 2 about sleep wake crashes and 2
> for runtime crashes, where the culprit would be this. IE runtime gates
> VPE and causes a crash.

All on Strix Halo and all tied to VPE?  At runtime was VPE in use?  By 
what software?

BTW - Strix and Kraken also have VPE.

> 
>> To me this seems likely to be a platform firmware bug; but I would like
>> to understand the timing of the gate vs ungate on good vs bad.
> 
> Perhaps it is. It is either something like that or silicon quality.
> 
>> IE is it possible the delayed work handler
>> amdgpu_device_delayed_init_work_handler() is causing a race with
>> vpe_ring_begin_use()?
> 
> I don't think so. There is only a single ungate. Also, the crash
> happens on the gate. So what happens is the device wakes up, the
> screen turns on, kde clock works, then after a second it freezes,
> there is a softlock, and the device hangs.
> 
> The failed command is always the VPE gate that is triggered after 1s in idle.
> 
>> This should be possible to check without extra instrumentation by using
>> ftrace and looking at the timing of the 2 ring functions and the init
>> work handler and checking good vs bad cycles.
> 
> I do not know how to use ftrace. I should also note that after the
> device freezes around 1/5 cycles will sync the fs, so it is also not a
> very easy thing to diagnose. The device just stops working. A lot of
> the logs I got were in pstore by forcing a kernel panic.

Here's how you capture the timing of functions.  Each time the function 
is called there will be an event in the trace buffer.

❯ sudo trace-cmd record -p function -l 
amdgpu_device_delayed_init_work_handler -l vpe_idle_work_handler -l 
vpe_ring_begin_use -l vpe_ring_end_use -l amdgpu_pmops_suspend -l 
amdgpu_pmops_resume

Here's how you would review the report:

❯ trace-cmd report
cpus=24
   kworker/u97:37-18051 [001] ..... 13655.970108: function: 
amdgpu_pmops_suspend <-- pci_pm_suspend
   kworker/u97:21-18036 [002] ..... 13666.290715: function: 
amdgpu_pmops_resume <-- dpm_run_callback
   kworker/u97:21-18036 [015] ..... 13666.308295: function: 
vpe_ring_begin_use <-- amdgpu_ring_alloc
   kworker/u97:21-18036 [015] ..... 13666.308298: function: 
vpe_ring_end_use <-- vpe_ring_test_ring
     kworker/15:1-12285 [015] ..... 13666.960191: function: 
amdgpu_device_delayed_init_work_handler <-- process_one_work
     kworker/15:1-12285 [015] ..... 13666.963970: function: 
vpe_ring_begin_use <-- amdgpu_ring_alloc
     kworker/15:1-12285 [015] ..... 13666.965481: function: 
vpe_ring_end_use <-- amdgpu_ib_schedule
     kworker/15:4-16354 [015] ..... 13667.981394: function: 
vpe_idle_work_handler <-- process_one_work

I did this on a Strix system just now to capture that.

You can see that basically the ring gets used before the delayed init 
work handler, and then again from the ring tests.  My concern is if the 
sequence ever looks different than the above.  If it does; we do have a 
driver race condition.

It would also be helpful to look at the function_graph tracer.

Here's some more documentation about ftrace and trace-cmd.
https://www.kernel.org/doc/html/latest/trace/ftrace.html
https://lwn.net/Articles/410200/

You can probably also get an LLM to help you with building commands if 
you're not familiar with it.

But if you're hung so bad you can't flush to disk that's going to be a 
problem without a UART.  A few ideas:

1) You can use CONFIG_PSTORE_FTRACE

2) If you add "tp_printk" to the kernel command line it should make the 
trace ring buffer flush to kernel log ring buffer.  But be warned this 
is going to change the timing, the issue might go away entirely or have 
a different failure rate.  So hopefully <1> works.
> 
> If you say that all IP blocks use 1s, perhaps an alternative solution
> would be to desync the idle times so they do not happen
> simultaneously. So 1000, 1200, 1400, etc.
> 
> Antheas
> 

I don't dobut your your proposal of changing the timing works.  I just 
want to make sure it's the right solution because otherwise we might 
change the timing or sequence elsewhere in the driver two years from now 
and re-introduce the problem unintentionally.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1 2/2] drm/amd/display: Adjust AUX brightness to be a granularity of 100
  2025-08-24 19:33   ` Antheas Kapenekakis
@ 2025-08-25  7:02     ` Philip Mueller
  0 siblings, 0 replies; 21+ messages in thread
From: Philip Mueller @ 2025-08-25  7:02 UTC (permalink / raw)
  To: Antheas Kapenekakis, amd-gfx
  Cc: dri-devel, linux-kernel, Alex Deucher, Christian König,
	David Airlie, Simona Vetter, Harry Wentland, Rodrigo Siqueira,
	Mario Limonciello, Peyton Lee, Lang Yu

On Sun, 2025-08-24 at 21:33 +0200, Antheas Kapenekakis wrote:
> Well Phil managed to fall into the value 332800, which has a 0 minor
> bit. Unfortunate. In hindsight, every 256 hundreds there would be a
> zero anyway.
> 
> Before I made this patch I made a partial refactor of panel-quirks
> where a quirk like this could go to. But I would really prefer not to
> do quirks. Ill send that too.
> 
> Antheas

I was already looking into that OLED issue for several weeks. Changing
granularity might hid the root cause and you might hit the issue less
frequently.

Currently checking [1] which changes the first byte to 3 since when
DP_SOURCE_BACKLIGHT_LEVEL is written to with
the first byte being 0x00 and sometimes 0x01, the panel forcibly
turns off until the device sleeps again.

In the end the panel vendor has to fix it in firmware. If not a quirk
might be better specific for each panel vendor.

I'm still not sure if that refactoring is needed, or a separate quirk
function is more logical to be accepted upstream.

[1]
https://lore.kernel.org/lkml/20250824200202.1744335-5-lkml@antheas.dev/T/#u

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo
  2025-08-24  8:53 [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo Antheas Kapenekakis
  2025-08-24  8:53 ` [PATCH v1 2/2] drm/amd/display: Adjust AUX brightness to be a granularity of 100 Antheas Kapenekakis
  2025-08-24 20:16 ` [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo Mario Limonciello
@ 2025-08-25 13:20 ` Alex Deucher
  2025-08-25 13:33   ` Antheas Kapenekakis
  2 siblings, 1 reply; 21+ messages in thread
From: Alex Deucher @ 2025-08-25 13:20 UTC (permalink / raw)
  To: Antheas Kapenekakis
  Cc: amd-gfx, dri-devel, linux-kernel, Alex Deucher,
	Christian König, David Airlie, Simona Vetter, Harry Wentland,
	Rodrigo Siqueira, Mario Limonciello, Peyton Lee, Lang Yu

On Mon, Aug 25, 2025 at 3:13 AM Antheas Kapenekakis <lkml@antheas.dev> wrote:
>
> On the Asus Z13 2025, which uses a Strix Halo platform, around 8% of the
> suspend resumes result in a soft lock around 1 second after the screen
> turns on (it freezes). This happens due to power gating VPE when it is
> not used, which happens 1 second after inactivity.
>
> Specifically, the VPE gating after resume is as follows: an initial
> ungate, followed by a gate in the resume process. Then,
> amdgpu_device_delayed_init_work_handler with a delay of 2s is scheduled
> to run tests, one of which is testing VPE in vpe_ring_test_ib. This
> causes an ungate, After that test, vpe_idle_work_handler is scheduled
> with VPE_IDLE_TIMEOUT (1s).
>
> When vpe_idle_work_handler runs and tries to gate VPE, it causes the
> SMU to hang and partially freezes half of the GPU IPs, with the thread
> that called the command being stuck processing it.
>
> Specifically, after that SMU command tries to run, we get the following:
>
> snd_hda_intel 0000:c4:00.1: Refused to change power state from D0 to D3hot
> ...
> xhci_hcd 0000:c4:00.4: Refused to change power state from D0 to D3hot
> ...
> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VPE!
> [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe failed, ret = -62.
> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:93:crtc-0] flip_done timed out
> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate JPEG!
> [drm:jpeg_v4_0_5_set_powergating_state [amdgpu]] *ERROR* Dpm disable jpeg failed, ret = -62.
> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 0!
> [drm:vcn_v4_0_5_stop [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62.
> thunderbolt 0000:c6:00.5: 0: timeout reading config space 1 from 0xd3
> thunderbolt 0000:c6:00.5: 0: timeout reading config space 2 from 0x5
> thunderbolt 0000:c6:00.5: Refused to change power state from D0 to D3hot
> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:97:crtc-1] flip_done timed out
> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 1!
>
> In addition to e.g., kwin errors in journalctl. 0000:c4.00.0 is the GPU.
> Interestingly, 0000:c4.00.6, which is another HDA block, 0000:c4.00.5,
> a PCI controller, and 0000:c4.00.2, resume normally. 0x00000032 is the
> PowerDownVpe(50) command which is the common failure point in all
> failed resumes.
>
> On a normal resume, we should get the following power gates:
> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVpe(50) param: 0x00000000, resp: 0x00000001
> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg0(33) param: 0x00000000, resp: 0x00000001
> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg1(38) param: 0x00010000, resp: 0x00000001
> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn1(4) param: 0x00010000, resp: 0x00000001
> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn0(6) param: 0x00000000, resp: 0x00000001
> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn0(7) param: 0x00000000, resp: 0x00000001
> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn1(5) param: 0x00010000, resp: 0x00000001
> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg0(34) param: 0x00000000, resp: 0x00000001
> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg1(39) param: 0x00010000, resp: 0x00000001
>
> To fix this, increase VPE_IDLE_TIMEOUT to 2 seconds. This increases
> reliability from 4-25 suspends to 200+ (tested) suspends with a cycle
> time of 12s sleep, 8s resume. The suspected reason here is that 1s that
> when VPE is used, it needs a bit of time before it can be gated and
> there was a borderline delay before, which is not enough for Strix Halo.
> When the VPE is not used, such as on resume, gating it instantly does
> not seem to cause issues.

This doesn't make much sense.  The VPE idle timeout is arbitrary.  The
VPE idle work handler checks to see if the block is idle before it
powers gates the block. If it's not idle, then the delayed work is
rescheduled so changing the timing should not make a difference.  We
are no powering down VPE while it still has active jobs.  It sounds
like there is some race condition somewhere else.

Alex

>
> Fixes: 5f82a0c90cca ("drm/amdgpu/vpe: enable vpe dpm")
> Signed-off-by: Antheas Kapenekakis <lkml@antheas.dev>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
> index 121ee17b522b..24f09e457352 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
> @@ -34,8 +34,8 @@
>  /* VPE CSA resides in the 4th page of CSA */
>  #define AMDGPU_CSA_VPE_OFFSET  (4096 * 3)
>
> -/* 1 second timeout */
> -#define VPE_IDLE_TIMEOUT       msecs_to_jiffies(1000)
> +/* 2 second timeout */
> +#define VPE_IDLE_TIMEOUT       msecs_to_jiffies(2000)
>
>  #define VPE_MAX_DPM_LEVEL                      4
>  #define FIXED1_8_BITS_PER_FRACTIONAL_PART      8
>
> base-commit: c17b750b3ad9f45f2b6f7e6f7f4679844244f0b9
> --
> 2.50.1
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo
  2025-08-25 13:20 ` Alex Deucher
@ 2025-08-25 13:33   ` Antheas Kapenekakis
  2025-08-25 14:01     ` Antheas Kapenekakis
  0 siblings, 1 reply; 21+ messages in thread
From: Antheas Kapenekakis @ 2025-08-25 13:33 UTC (permalink / raw)
  To: Alex Deucher
  Cc: amd-gfx, dri-devel, linux-kernel, Alex Deucher,
	Christian König, David Airlie, Simona Vetter, Harry Wentland,
	Rodrigo Siqueira, Mario Limonciello, Peyton Lee, Lang Yu

On Mon, 25 Aug 2025 at 15:20, Alex Deucher <alexdeucher@gmail.com> wrote:
>
> On Mon, Aug 25, 2025 at 3:13 AM Antheas Kapenekakis <lkml@antheas.dev> wrote:
> >
> > On the Asus Z13 2025, which uses a Strix Halo platform, around 8% of the
> > suspend resumes result in a soft lock around 1 second after the screen
> > turns on (it freezes). This happens due to power gating VPE when it is
> > not used, which happens 1 second after inactivity.
> >
> > Specifically, the VPE gating after resume is as follows: an initial
> > ungate, followed by a gate in the resume process. Then,
> > amdgpu_device_delayed_init_work_handler with a delay of 2s is scheduled
> > to run tests, one of which is testing VPE in vpe_ring_test_ib. This
> > causes an ungate, After that test, vpe_idle_work_handler is scheduled
> > with VPE_IDLE_TIMEOUT (1s).
> >
> > When vpe_idle_work_handler runs and tries to gate VPE, it causes the
> > SMU to hang and partially freezes half of the GPU IPs, with the thread
> > that called the command being stuck processing it.
> >
> > Specifically, after that SMU command tries to run, we get the following:
> >
> > snd_hda_intel 0000:c4:00.1: Refused to change power state from D0 to D3hot
> > ...
> > xhci_hcd 0000:c4:00.4: Refused to change power state from D0 to D3hot
> > ...
> > amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> > amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VPE!
> > [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe failed, ret = -62.
> > amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:93:crtc-0] flip_done timed out
> > amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> > amdgpu 0000:c4:00.0: amdgpu: Failed to power gate JPEG!
> > [drm:jpeg_v4_0_5_set_powergating_state [amdgpu]] *ERROR* Dpm disable jpeg failed, ret = -62.
> > amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> > amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 0!
> > [drm:vcn_v4_0_5_stop [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62.
> > thunderbolt 0000:c6:00.5: 0: timeout reading config space 1 from 0xd3
> > thunderbolt 0000:c6:00.5: 0: timeout reading config space 2 from 0x5
> > thunderbolt 0000:c6:00.5: Refused to change power state from D0 to D3hot
> > amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:97:crtc-1] flip_done timed out
> > amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> > amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 1!
> >
> > In addition to e.g., kwin errors in journalctl. 0000:c4.00.0 is the GPU.
> > Interestingly, 0000:c4.00.6, which is another HDA block, 0000:c4.00.5,
> > a PCI controller, and 0000:c4.00.2, resume normally. 0x00000032 is the
> > PowerDownVpe(50) command which is the common failure point in all
> > failed resumes.
> >
> > On a normal resume, we should get the following power gates:
> > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVpe(50) param: 0x00000000, resp: 0x00000001
> > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg0(33) param: 0x00000000, resp: 0x00000001
> > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg1(38) param: 0x00010000, resp: 0x00000001
> > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn1(4) param: 0x00010000, resp: 0x00000001
> > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn0(6) param: 0x00000000, resp: 0x00000001
> > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn0(7) param: 0x00000000, resp: 0x00000001
> > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn1(5) param: 0x00010000, resp: 0x00000001
> > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg0(34) param: 0x00000000, resp: 0x00000001
> > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg1(39) param: 0x00010000, resp: 0x00000001
> >
> > To fix this, increase VPE_IDLE_TIMEOUT to 2 seconds. This increases
> > reliability from 4-25 suspends to 200+ (tested) suspends with a cycle
> > time of 12s sleep, 8s resume. The suspected reason here is that 1s that
> > when VPE is used, it needs a bit of time before it can be gated and
> > there was a borderline delay before, which is not enough for Strix Halo.
> > When the VPE is not used, such as on resume, gating it instantly does
> > not seem to cause issues.
>
> This doesn't make much sense.  The VPE idle timeout is arbitrary.  The
> VPE idle work handler checks to see if the block is idle before it
> powers gates the block. If it's not idle, then the delayed work is
> rescheduled so changing the timing should not make a difference.  We
> are no powering down VPE while it still has active jobs.  It sounds
> like there is some race condition somewhere else.

On resume, the vpe is ungated and gated instantly, which does not
cause any crashes, then the delayed work is scheduled to run two
seconds later. Then, the tests run and finish, which start the gate
timer. After the timer lapses and the kernel tries to gate VPE, it
crashes. I logged all SMU commands and there is no difference between
the ones in a crash and not, other than the fact the VPE gate command
failed. Which becomes apparent when the next command runs. I will also
note that until the idle timer lapses, the system is responsive

Since the VPE is ungated to run the tests, I assume that in my setup
it is not used close to resume.

Antheas

> Alex
>
> >
> > Fixes: 5f82a0c90cca ("drm/amdgpu/vpe: enable vpe dpm")
> > Signed-off-by: Antheas Kapenekakis <lkml@antheas.dev>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
> > index 121ee17b522b..24f09e457352 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
> > @@ -34,8 +34,8 @@
> >  /* VPE CSA resides in the 4th page of CSA */
> >  #define AMDGPU_CSA_VPE_OFFSET  (4096 * 3)
> >
> > -/* 1 second timeout */
> > -#define VPE_IDLE_TIMEOUT       msecs_to_jiffies(1000)
> > +/* 2 second timeout */
> > +#define VPE_IDLE_TIMEOUT       msecs_to_jiffies(2000)
> >
> >  #define VPE_MAX_DPM_LEVEL                      4
> >  #define FIXED1_8_BITS_PER_FRACTIONAL_PART      8
> >
> > base-commit: c17b750b3ad9f45f2b6f7e6f7f4679844244f0b9
> > --
> > 2.50.1
> >
> >
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo
  2025-08-25  1:38     ` Mario Limonciello
@ 2025-08-25 13:39       ` Antheas Kapenekakis
  2025-08-26 13:41         ` Alex Deucher
  0 siblings, 1 reply; 21+ messages in thread
From: Antheas Kapenekakis @ 2025-08-25 13:39 UTC (permalink / raw)
  To: Mario Limonciello
  Cc: amd-gfx, dri-devel, linux-kernel, Alex Deucher,
	Christian König, David Airlie, Simona Vetter, Harry Wentland,
	Rodrigo Siqueira, Mario Limonciello, Peyton Lee, Lang Yu

On Mon, 25 Aug 2025 at 03:38, Mario Limonciello <superm1@kernel.org> wrote:
>
>
>
> On 8/24/25 3:46 PM, Antheas Kapenekakis wrote:
> > On Sun, 24 Aug 2025 at 22:16, Mario Limonciello <superm1@kernel.org> wrote:
> >>
> >>
> >>
> >> On 8/24/25 3:53 AM, Antheas Kapenekakis wrote:
> >>> On the Asus Z13 2025, which uses a Strix Halo platform, around 8% of the
> >>> suspend resumes result in a soft lock around 1 second after the screen
> >>> turns on (it freezes). This happens due to power gating VPE when it is
> >>> not used, which happens 1 second after inactivity.
> >>>
> >>> Specifically, the VPE gating after resume is as follows: an initial
> >>> ungate, followed by a gate in the resume process. Then,
> >>> amdgpu_device_delayed_init_work_handler with a delay of 2s is scheduled
> >>> to run tests, one of which is testing VPE in vpe_ring_test_ib. This
> >>> causes an ungate, After that test, vpe_idle_work_handler is scheduled
> >>> with VPE_IDLE_TIMEOUT (1s).
> >>>
> >>> When vpe_idle_work_handler runs and tries to gate VPE, it causes the
> >>> SMU to hang and partially freezes half of the GPU IPs, with the thread
> >>> that called the command being stuck processing it.
> >>>
> >>> Specifically, after that SMU command tries to run, we get the following:
> >>>
> >>> snd_hda_intel 0000:c4:00.1: Refused to change power state from D0 to D3hot
> >>> ...
> >>> xhci_hcd 0000:c4:00.4: Refused to change power state from D0 to D3hot
> >>> ...
> >>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> >>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VPE!
> >>> [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe failed, ret = -62.
> >>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:93:crtc-0] flip_done timed out
> >>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> >>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate JPEG!
> >>> [drm:jpeg_v4_0_5_set_powergating_state [amdgpu]] *ERROR* Dpm disable jpeg failed, ret = -62.
> >>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> >>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 0!
> >>> [drm:vcn_v4_0_5_stop [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62.
> >>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 1 from 0xd3
> >>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 2 from 0x5
> >>> thunderbolt 0000:c6:00.5: Refused to change power state from D0 to D3hot
> >>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:97:crtc-1] flip_done timed out
> >>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> >>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 1!
> >>>
> >>> In addition to e.g., kwin errors in journalctl. 0000:c4.00.0 is the GPU.
> >>> Interestingly, 0000:c4.00.6, which is another HDA block, 0000:c4.00.5,
> >>> a PCI controller, and 0000:c4.00.2, resume normally. 0x00000032 is the
> >>> PowerDownVpe(50) command which is the common failure point in all
> >>> failed resumes.
> >>>
> >>> On a normal resume, we should get the following power gates:
> >>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVpe(50) param: 0x00000000, resp: 0x00000001
> >>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg0(33) param: 0x00000000, resp: 0x00000001
> >>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg1(38) param: 0x00010000, resp: 0x00000001
> >>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn1(4) param: 0x00010000, resp: 0x00000001
> >>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn0(6) param: 0x00000000, resp: 0x00000001
> >>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn0(7) param: 0x00000000, resp: 0x00000001
> >>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn1(5) param: 0x00010000, resp: 0x00000001
> >>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg0(34) param: 0x00000000, resp: 0x00000001
> >>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg1(39) param: 0x00010000, resp: 0x00000001
> >>>
> >>> To fix this, increase VPE_IDLE_TIMEOUT to 2 seconds. This increases
> >>> reliability from 4-25 suspends to 200+ (tested) suspends with a cycle
> >>> time of 12s sleep, 8s resume.
> >>
> >> When you say you reproduced with 12s sleep and 8s resume, was that
> >> 'amd-s2idle --duration 12 --wait 8'?
> >
> > I did not use amd-s2idle. I essentially used the script below with a
> > 12 on the wake alarm and 12 on the for loop. I also used pstore for
> > this testing.
> >
> > for i in {1..200}; do
> >    echo "Suspend attempt $i"
> >    echo `date '+%s' -d '+ 60 seconds'` | sudo tee /sys/class/rtc/rtc0/wakealarm
> >    sudo sh -c 'echo mem > /sys/power/state'
> >
> >    for j in {1..50}; do
> >      # Use repeating sleep in case echo mem returns early
> >      sleep 1
> >    done
> > done
>
> 👍
>
> >
> >>> The suspected reason here is that 1s that
> >>> when VPE is used, it needs a bit of time before it can be gated and
> >>> there was a borderline delay before, which is not enough for Strix Halo.
> >>> When the VPE is not used, such as on resume, gating it instantly does
> >>> not seem to cause issues.
> >>>
> >>> Fixes: 5f82a0c90cca ("drm/amdgpu/vpe: enable vpe dpm")
> >>> Signed-off-by: Antheas Kapenekakis <lkml@antheas.dev>
> >>> ---
> >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c | 4 ++--
> >>>    1 file changed, 2 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
> >>> index 121ee17b522b..24f09e457352 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
> >>> @@ -34,8 +34,8 @@
> >>>    /* VPE CSA resides in the 4th page of CSA */
> >>>    #define AMDGPU_CSA_VPE_OFFSET       (4096 * 3)
> >>>
> >>> -/* 1 second timeout */
> >>> -#define VPE_IDLE_TIMEOUT     msecs_to_jiffies(1000)
> >>> +/* 2 second timeout */
> >>> +#define VPE_IDLE_TIMEOUT     msecs_to_jiffies(2000)
> >>>
> >>>    #define VPE_MAX_DPM_LEVEL                   4
> >>>    #define FIXED1_8_BITS_PER_FRACTIONAL_PART   8
> >>>
> >>> base-commit: c17b750b3ad9f45f2b6f7e6f7f4679844244f0b9
> >>
> >> 1s idle timeout has been used by other IPs for a long time.
> >> For example JPEG, UVD, VCN all use 1s.
> >>
> >> Can you please confirm both your AGESA and your SMU firmware version?
> >> In case you're not aware; you can get AGESA version from SMBIOS string
> >> (DMI type 40).
> >>
> >> ❯ sudo dmidecode | grep AGESA
> >
> > String: AGESA!V9 StrixHaloPI-FP11 1.0.0.0c
> >
> >> You can get SMU firmware version from this:
> >>
> >> ❯ grep . /sys/bus/platform/drivers/amd_pmc/*/smu_*
> >
> > grep . /sys/bus/platform/drivers/amd_pmc/*/smu_*
> > /sys/bus/platform/drivers/amd_pmc/AMDI000B:00/smu_fw_version:100.112.0
> > /sys/bus/platform/drivers/amd_pmc/AMDI000B:00/smu_program:0
> >
>
> Thanks, I'll get some folks to see if we match this AGESA version if we
> can also reproduce it on reference hardware the same way you did.
>
> >> Are you on the most up to date firmware for your system from the
> >> manufacturer?
> >
> > I updated my bios, pd firmware, and USB device firmware early August,
> > when I was doing this testing.
> >
> >> We haven't seen anything like this reported on Strix Halo thus far and
> >> we do internal stress testing on s0i3 on reference hardware.
> >
> > Cant find a reference for it on the bug tracker. I have four bug
> > reports on the bazzite issue tracker, 2 about sleep wake crashes and 2
> > for runtime crashes, where the culprit would be this. IE runtime gates
> > VPE and causes a crash.
>
> All on Strix Halo and all tied to VPE?  At runtime was VPE in use?  By
> what software?
>
> BTW - Strix and Kraken also have VPE.

All on the Z13. Not tied to VPE necessarily. I just know that I get
reports of crashes on the Z13, and with this patch they are fixed for
me. It will be part of the next bazzite version so I will get feedback
about it.

I don't think software that is using the VPE is relevant. Perhaps for
the runtime crashes it is and this patch helps in that case as well.
But in my case, the crash is caused after the ungate that runs the
tests on resume on the delayed handler.

The Z13 also has some other quirks with spurious wakeups when
connected to a charger. So, if systemd is configured to e.g., sleep
after 20m, combined with this crash if it stays plugged in overnight
in the morning it has crashed.

> >
> >> To me this seems likely to be a platform firmware bug; but I would like
> >> to understand the timing of the gate vs ungate on good vs bad.
> >
> > Perhaps it is. It is either something like that or silicon quality.
> >
> >> IE is it possible the delayed work handler
> >> amdgpu_device_delayed_init_work_handler() is causing a race with
> >> vpe_ring_begin_use()?
> >
> > I don't think so. There is only a single ungate. Also, the crash
> > happens on the gate. So what happens is the device wakes up, the
> > screen turns on, kde clock works, then after a second it freezes,
> > there is a softlock, and the device hangs.
> >
> > The failed command is always the VPE gate that is triggered after 1s in idle.
> >
> >> This should be possible to check without extra instrumentation by using
> >> ftrace and looking at the timing of the 2 ring functions and the init
> >> work handler and checking good vs bad cycles.
> >
> > I do not know how to use ftrace. I should also note that after the
> > device freezes around 1/5 cycles will sync the fs, so it is also not a
> > very easy thing to diagnose. The device just stops working. A lot of
> > the logs I got were in pstore by forcing a kernel panic.
>
> Here's how you capture the timing of functions.  Each time the function
> is called there will be an event in the trace buffer.
>
> ❯ sudo trace-cmd record -p function -l
> amdgpu_device_delayed_init_work_handler -l vpe_idle_work_handler -l
> vpe_ring_begin_use -l vpe_ring_end_use -l amdgpu_pmops_suspend -l
> amdgpu_pmops_resume
>
> Here's how you would review the report:
>
> ❯ trace-cmd report
> cpus=24
>    kworker/u97:37-18051 [001] ..... 13655.970108: function:
> amdgpu_pmops_suspend <-- pci_pm_suspend
>    kworker/u97:21-18036 [002] ..... 13666.290715: function:
> amdgpu_pmops_resume <-- dpm_run_callback
>    kworker/u97:21-18036 [015] ..... 13666.308295: function:
> vpe_ring_begin_use <-- amdgpu_ring_alloc
>    kworker/u97:21-18036 [015] ..... 13666.308298: function:
> vpe_ring_end_use <-- vpe_ring_test_ring
>      kworker/15:1-12285 [015] ..... 13666.960191: function:
> amdgpu_device_delayed_init_work_handler <-- process_one_work
>      kworker/15:1-12285 [015] ..... 13666.963970: function:
> vpe_ring_begin_use <-- amdgpu_ring_alloc
>      kworker/15:1-12285 [015] ..... 13666.965481: function:
> vpe_ring_end_use <-- amdgpu_ib_schedule
>      kworker/15:4-16354 [015] ..... 13667.981394: function:
> vpe_idle_work_handler <-- process_one_work
>
> I did this on a Strix system just now to capture that.
>
> You can see that basically the ring gets used before the delayed init
> work handler, and then again from the ring tests.  My concern is if the
> sequence ever looks different than the above.  If it does; we do have a
> driver race condition.
>
> It would also be helpful to look at the function_graph tracer.
>
> Here's some more documentation about ftrace and trace-cmd.
> https://www.kernel.org/doc/html/latest/trace/ftrace.html
> https://lwn.net/Articles/410200/
>
> You can probably also get an LLM to help you with building commands if
> you're not familiar with it.
>
> But if you're hung so bad you can't flush to disk that's going to be a
> problem without a UART.  A few ideas:

Some times it flushes to disk

> 1) You can use CONFIG_PSTORE_FTRACE

I can look into that

> 2) If you add "tp_printk" to the kernel command line it should make the
> trace ring buffer flush to kernel log ring buffer.  But be warned this
> is going to change the timing, the issue might go away entirely or have
> a different failure rate.  So hopefully <1> works.
> >
> > If you say that all IP blocks use 1s, perhaps an alternative solution
> > would be to desync the idle times so they do not happen
> > simultaneously. So 1000, 1200, 1400, etc.
> >
> > Antheas
> >
>
> I don't dobut your your proposal of changing the timing works.  I just
> want to make sure it's the right solution because otherwise we might
> change the timing or sequence elsewhere in the driver two years from now
> and re-introduce the problem unintentionally.

If there are other idle timers and only this one changes to 2s, I will
agree and say that it would be peculiar. Although 1s seems arbitrary
in any case.

Antheas

>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo
  2025-08-25 13:33   ` Antheas Kapenekakis
@ 2025-08-25 14:01     ` Antheas Kapenekakis
  2025-08-25 16:41       ` Mario Limonciello
  0 siblings, 1 reply; 21+ messages in thread
From: Antheas Kapenekakis @ 2025-08-25 14:01 UTC (permalink / raw)
  To: Alex Deucher
  Cc: amd-gfx, dri-devel, linux-kernel, Alex Deucher,
	Christian König, David Airlie, Simona Vetter, Harry Wentland,
	Rodrigo Siqueira, Mario Limonciello, Peyton Lee, Lang Yu

On Mon, 25 Aug 2025 at 15:33, Antheas Kapenekakis <lkml@antheas.dev> wrote:
>
> On Mon, 25 Aug 2025 at 15:20, Alex Deucher <alexdeucher@gmail.com> wrote:
> >
> > On Mon, Aug 25, 2025 at 3:13 AM Antheas Kapenekakis <lkml@antheas.dev> wrote:
> > >
> > > On the Asus Z13 2025, which uses a Strix Halo platform, around 8% of the
> > > suspend resumes result in a soft lock around 1 second after the screen
> > > turns on (it freezes). This happens due to power gating VPE when it is
> > > not used, which happens 1 second after inactivity.
> > >
> > > Specifically, the VPE gating after resume is as follows: an initial
> > > ungate, followed by a gate in the resume process. Then,
> > > amdgpu_device_delayed_init_work_handler with a delay of 2s is scheduled
> > > to run tests, one of which is testing VPE in vpe_ring_test_ib. This
> > > causes an ungate, After that test, vpe_idle_work_handler is scheduled
> > > with VPE_IDLE_TIMEOUT (1s).
> > >
> > > When vpe_idle_work_handler runs and tries to gate VPE, it causes the
> > > SMU to hang and partially freezes half of the GPU IPs, with the thread
> > > that called the command being stuck processing it.
> > >
> > > Specifically, after that SMU command tries to run, we get the following:
> > >
> > > snd_hda_intel 0000:c4:00.1: Refused to change power state from D0 to D3hot
> > > ...
> > > xhci_hcd 0000:c4:00.4: Refused to change power state from D0 to D3hot
> > > ...
> > > amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> > > amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VPE!
> > > [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe failed, ret = -62.
> > > amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:93:crtc-0] flip_done timed out
> > > amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> > > amdgpu 0000:c4:00.0: amdgpu: Failed to power gate JPEG!
> > > [drm:jpeg_v4_0_5_set_powergating_state [amdgpu]] *ERROR* Dpm disable jpeg failed, ret = -62.
> > > amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> > > amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 0!
> > > [drm:vcn_v4_0_5_stop [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62.
> > > thunderbolt 0000:c6:00.5: 0: timeout reading config space 1 from 0xd3
> > > thunderbolt 0000:c6:00.5: 0: timeout reading config space 2 from 0x5
> > > thunderbolt 0000:c6:00.5: Refused to change power state from D0 to D3hot
> > > amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:97:crtc-1] flip_done timed out
> > > amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> > > amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 1!
> > >
> > > In addition to e.g., kwin errors in journalctl. 0000:c4.00.0 is the GPU.
> > > Interestingly, 0000:c4.00.6, which is another HDA block, 0000:c4.00.5,
> > > a PCI controller, and 0000:c4.00.2, resume normally. 0x00000032 is the
> > > PowerDownVpe(50) command which is the common failure point in all
> > > failed resumes.
> > >
> > > On a normal resume, we should get the following power gates:
> > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVpe(50) param: 0x00000000, resp: 0x00000001
> > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg0(33) param: 0x00000000, resp: 0x00000001
> > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg1(38) param: 0x00010000, resp: 0x00000001
> > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn1(4) param: 0x00010000, resp: 0x00000001
> > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn0(6) param: 0x00000000, resp: 0x00000001
> > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn0(7) param: 0x00000000, resp: 0x00000001
> > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn1(5) param: 0x00010000, resp: 0x00000001
> > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg0(34) param: 0x00000000, resp: 0x00000001
> > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg1(39) param: 0x00010000, resp: 0x00000001
> > >
> > > To fix this, increase VPE_IDLE_TIMEOUT to 2 seconds. This increases
> > > reliability from 4-25 suspends to 200+ (tested) suspends with a cycle
> > > time of 12s sleep, 8s resume. The suspected reason here is that 1s that
> > > when VPE is used, it needs a bit of time before it can be gated and
> > > there was a borderline delay before, which is not enough for Strix Halo.
> > > When the VPE is not used, such as on resume, gating it instantly does
> > > not seem to cause issues.
> >
> > This doesn't make much sense.  The VPE idle timeout is arbitrary.  The
> > VPE idle work handler checks to see if the block is idle before it
> > powers gates the block. If it's not idle, then the delayed work is
> > rescheduled so changing the timing should not make a difference.  We
> > are no powering down VPE while it still has active jobs.  It sounds
> > like there is some race condition somewhere else.
>
> On resume, the vpe is ungated and gated instantly, which does not
> cause any crashes, then the delayed work is scheduled to run two
> seconds later. Then, the tests run and finish, which start the gate
> timer. After the timer lapses and the kernel tries to gate VPE, it
> crashes. I logged all SMU commands and there is no difference between
> the ones in a crash and not, other than the fact the VPE gate command
> failed. Which becomes apparent when the next command runs. I will also
> note that until the idle timer lapses, the system is responsive
>
> Since the VPE is ungated to run the tests, I assume that in my setup
> it is not used close to resume.

I should also add that I forced a kernel panic and dumped all CPU
backtraces in multiple logs. After the softlock, CPUs were either
parked in the scheduler, powered off, or stuck executing an SMU
command by e.g., a userspace usage sensor graph. So it is not a
deadlock.

Antheas

> Antheas
>
> > Alex
> >
> > >
> > > Fixes: 5f82a0c90cca ("drm/amdgpu/vpe: enable vpe dpm")
> > > Signed-off-by: Antheas Kapenekakis <lkml@antheas.dev>
> > > ---
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c | 4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
> > > index 121ee17b522b..24f09e457352 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
> > > @@ -34,8 +34,8 @@
> > >  /* VPE CSA resides in the 4th page of CSA */
> > >  #define AMDGPU_CSA_VPE_OFFSET  (4096 * 3)
> > >
> > > -/* 1 second timeout */
> > > -#define VPE_IDLE_TIMEOUT       msecs_to_jiffies(1000)
> > > +/* 2 second timeout */
> > > +#define VPE_IDLE_TIMEOUT       msecs_to_jiffies(2000)
> > >
> > >  #define VPE_MAX_DPM_LEVEL                      4
> > >  #define FIXED1_8_BITS_PER_FRACTIONAL_PART      8
> > >
> > > base-commit: c17b750b3ad9f45f2b6f7e6f7f4679844244f0b9
> > > --
> > > 2.50.1
> > >
> > >
> >


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo
  2025-08-25 14:01     ` Antheas Kapenekakis
@ 2025-08-25 16:41       ` Mario Limonciello
  2025-08-25 21:00         ` Antheas Kapenekakis
  0 siblings, 1 reply; 21+ messages in thread
From: Mario Limonciello @ 2025-08-25 16:41 UTC (permalink / raw)
  To: Antheas Kapenekakis, Alex Deucher
  Cc: amd-gfx, dri-devel, linux-kernel, Alex Deucher,
	Christian König, David Airlie, Simona Vetter, Harry Wentland,
	Rodrigo Siqueira, Peyton Lee, Lang Yu

On 8/25/2025 9:01 AM, Antheas Kapenekakis wrote:
> On Mon, 25 Aug 2025 at 15:33, Antheas Kapenekakis <lkml@antheas.dev> wrote:
>>
>> On Mon, 25 Aug 2025 at 15:20, Alex Deucher <alexdeucher@gmail.com> wrote:
>>>
>>> On Mon, Aug 25, 2025 at 3:13 AM Antheas Kapenekakis <lkml@antheas.dev> wrote:
>>>>
>>>> On the Asus Z13 2025, which uses a Strix Halo platform, around 8% of the
>>>> suspend resumes result in a soft lock around 1 second after the screen
>>>> turns on (it freezes). This happens due to power gating VPE when it is
>>>> not used, which happens 1 second after inactivity.
>>>>
>>>> Specifically, the VPE gating after resume is as follows: an initial
>>>> ungate, followed by a gate in the resume process. Then,
>>>> amdgpu_device_delayed_init_work_handler with a delay of 2s is scheduled
>>>> to run tests, one of which is testing VPE in vpe_ring_test_ib. This
>>>> causes an ungate, After that test, vpe_idle_work_handler is scheduled
>>>> with VPE_IDLE_TIMEOUT (1s).
>>>>
>>>> When vpe_idle_work_handler runs and tries to gate VPE, it causes the
>>>> SMU to hang and partially freezes half of the GPU IPs, with the thread
>>>> that called the command being stuck processing it.
>>>>
>>>> Specifically, after that SMU command tries to run, we get the following:
>>>>
>>>> snd_hda_intel 0000:c4:00.1: Refused to change power state from D0 to D3hot
>>>> ...
>>>> xhci_hcd 0000:c4:00.4: Refused to change power state from D0 to D3hot
>>>> ...
>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VPE!
>>>> [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe failed, ret = -62.
>>>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:93:crtc-0] flip_done timed out
>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate JPEG!
>>>> [drm:jpeg_v4_0_5_set_powergating_state [amdgpu]] *ERROR* Dpm disable jpeg failed, ret = -62.
>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 0!
>>>> [drm:vcn_v4_0_5_stop [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62.
>>>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 1 from 0xd3
>>>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 2 from 0x5
>>>> thunderbolt 0000:c6:00.5: Refused to change power state from D0 to D3hot
>>>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:97:crtc-1] flip_done timed out
>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 1!
>>>>
>>>> In addition to e.g., kwin errors in journalctl. 0000:c4.00.0 is the GPU.
>>>> Interestingly, 0000:c4.00.6, which is another HDA block, 0000:c4.00.5,
>>>> a PCI controller, and 0000:c4.00.2, resume normally. 0x00000032 is the
>>>> PowerDownVpe(50) command which is the common failure point in all
>>>> failed resumes.
>>>>
>>>> On a normal resume, we should get the following power gates:
>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVpe(50) param: 0x00000000, resp: 0x00000001
>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg0(33) param: 0x00000000, resp: 0x00000001
>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg1(38) param: 0x00010000, resp: 0x00000001
>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn1(4) param: 0x00010000, resp: 0x00000001
>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn0(6) param: 0x00000000, resp: 0x00000001
>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn0(7) param: 0x00000000, resp: 0x00000001
>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn1(5) param: 0x00010000, resp: 0x00000001
>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg0(34) param: 0x00000000, resp: 0x00000001
>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg1(39) param: 0x00010000, resp: 0x00000001
>>>>
>>>> To fix this, increase VPE_IDLE_TIMEOUT to 2 seconds. This increases
>>>> reliability from 4-25 suspends to 200+ (tested) suspends with a cycle
>>>> time of 12s sleep, 8s resume. The suspected reason here is that 1s that
>>>> when VPE is used, it needs a bit of time before it can be gated and
>>>> there was a borderline delay before, which is not enough for Strix Halo.
>>>> When the VPE is not used, such as on resume, gating it instantly does
>>>> not seem to cause issues.
>>>
>>> This doesn't make much sense.  The VPE idle timeout is arbitrary.  The
>>> VPE idle work handler checks to see if the block is idle before it
>>> powers gates the block. If it's not idle, then the delayed work is
>>> rescheduled so changing the timing should not make a difference.  We
>>> are no powering down VPE while it still has active jobs.  It sounds
>>> like there is some race condition somewhere else.
>>
>> On resume, the vpe is ungated and gated instantly, which does not
>> cause any crashes, then the delayed work is scheduled to run two
>> seconds later. Then, the tests run and finish, which start the gate
>> timer. After the timer lapses and the kernel tries to gate VPE, it
>> crashes. I logged all SMU commands and there is no difference between
>> the ones in a crash and not, other than the fact the VPE gate command
>> failed. Which becomes apparent when the next command runs. I will also
>> note that until the idle timer lapses, the system is responsive
>>
>> Since the VPE is ungated to run the tests, I assume that in my setup
>> it is not used close to resume.
> 
> I should also add that I forced a kernel panic and dumped all CPU
> backtraces in multiple logs. After the softlock, CPUs were either
> parked in the scheduler, powered off, or stuck executing an SMU
> command by e.g., a userspace usage sensor graph. So it is not a
> deadlock.
> 

Can you please confirm if you are on the absolute latest linux-firmware 
when you reproduced this issue?

Can you please share the debugfs output for amdgpu_firmware_info.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo
  2025-08-25 16:41       ` Mario Limonciello
@ 2025-08-25 21:00         ` Antheas Kapenekakis
  0 siblings, 0 replies; 21+ messages in thread
From: Antheas Kapenekakis @ 2025-08-25 21:00 UTC (permalink / raw)
  To: Mario Limonciello
  Cc: Alex Deucher, amd-gfx, dri-devel, linux-kernel, Alex Deucher,
	Christian König, David Airlie, Simona Vetter, Harry Wentland,
	Rodrigo Siqueira, Peyton Lee, Lang Yu

On Mon, 25 Aug 2025 at 18:41, Mario Limonciello <superm1@kernel.org> wrote:
>
> On 8/25/2025 9:01 AM, Antheas Kapenekakis wrote:
> > On Mon, 25 Aug 2025 at 15:33, Antheas Kapenekakis <lkml@antheas.dev> wrote:
> >>
> >> On Mon, 25 Aug 2025 at 15:20, Alex Deucher <alexdeucher@gmail.com> wrote:
> >>>
> >>> On Mon, Aug 25, 2025 at 3:13 AM Antheas Kapenekakis <lkml@antheas.dev> wrote:
> >>>>
> >>>> On the Asus Z13 2025, which uses a Strix Halo platform, around 8% of the
> >>>> suspend resumes result in a soft lock around 1 second after the screen
> >>>> turns on (it freezes). This happens due to power gating VPE when it is
> >>>> not used, which happens 1 second after inactivity.
> >>>>
> >>>> Specifically, the VPE gating after resume is as follows: an initial
> >>>> ungate, followed by a gate in the resume process. Then,
> >>>> amdgpu_device_delayed_init_work_handler with a delay of 2s is scheduled
> >>>> to run tests, one of which is testing VPE in vpe_ring_test_ib. This
> >>>> causes an ungate, After that test, vpe_idle_work_handler is scheduled
> >>>> with VPE_IDLE_TIMEOUT (1s).
> >>>>
> >>>> When vpe_idle_work_handler runs and tries to gate VPE, it causes the
> >>>> SMU to hang and partially freezes half of the GPU IPs, with the thread
> >>>> that called the command being stuck processing it.
> >>>>
> >>>> Specifically, after that SMU command tries to run, we get the following:
> >>>>
> >>>> snd_hda_intel 0000:c4:00.1: Refused to change power state from D0 to D3hot
> >>>> ...
> >>>> xhci_hcd 0000:c4:00.4: Refused to change power state from D0 to D3hot
> >>>> ...
> >>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> >>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VPE!
> >>>> [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe failed, ret = -62.
> >>>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:93:crtc-0] flip_done timed out
> >>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> >>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate JPEG!
> >>>> [drm:jpeg_v4_0_5_set_powergating_state [amdgpu]] *ERROR* Dpm disable jpeg failed, ret = -62.
> >>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> >>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 0!
> >>>> [drm:vcn_v4_0_5_stop [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62.
> >>>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 1 from 0xd3
> >>>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 2 from 0x5
> >>>> thunderbolt 0000:c6:00.5: Refused to change power state from D0 to D3hot
> >>>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:97:crtc-1] flip_done timed out
> >>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> >>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 1!
> >>>>
> >>>> In addition to e.g., kwin errors in journalctl. 0000:c4.00.0 is the GPU.
> >>>> Interestingly, 0000:c4.00.6, which is another HDA block, 0000:c4.00.5,
> >>>> a PCI controller, and 0000:c4.00.2, resume normally. 0x00000032 is the
> >>>> PowerDownVpe(50) command which is the common failure point in all
> >>>> failed resumes.
> >>>>
> >>>> On a normal resume, we should get the following power gates:
> >>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVpe(50) param: 0x00000000, resp: 0x00000001
> >>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg0(33) param: 0x00000000, resp: 0x00000001
> >>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg1(38) param: 0x00010000, resp: 0x00000001
> >>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn1(4) param: 0x00010000, resp: 0x00000001
> >>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn0(6) param: 0x00000000, resp: 0x00000001
> >>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn0(7) param: 0x00000000, resp: 0x00000001
> >>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn1(5) param: 0x00010000, resp: 0x00000001
> >>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg0(34) param: 0x00000000, resp: 0x00000001
> >>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg1(39) param: 0x00010000, resp: 0x00000001
> >>>>
> >>>> To fix this, increase VPE_IDLE_TIMEOUT to 2 seconds. This increases
> >>>> reliability from 4-25 suspends to 200+ (tested) suspends with a cycle
> >>>> time of 12s sleep, 8s resume. The suspected reason here is that 1s that
> >>>> when VPE is used, it needs a bit of time before it can be gated and
> >>>> there was a borderline delay before, which is not enough for Strix Halo.
> >>>> When the VPE is not used, such as on resume, gating it instantly does
> >>>> not seem to cause issues.
> >>>
> >>> This doesn't make much sense.  The VPE idle timeout is arbitrary.  The
> >>> VPE idle work handler checks to see if the block is idle before it
> >>> powers gates the block. If it's not idle, then the delayed work is
> >>> rescheduled so changing the timing should not make a difference.  We
> >>> are no powering down VPE while it still has active jobs.  It sounds
> >>> like there is some race condition somewhere else.
> >>
> >> On resume, the vpe is ungated and gated instantly, which does not
> >> cause any crashes, then the delayed work is scheduled to run two
> >> seconds later. Then, the tests run and finish, which start the gate
> >> timer. After the timer lapses and the kernel tries to gate VPE, it
> >> crashes. I logged all SMU commands and there is no difference between
> >> the ones in a crash and not, other than the fact the VPE gate command
> >> failed. Which becomes apparent when the next command runs. I will also
> >> note that until the idle timer lapses, the system is responsive
> >>
> >> Since the VPE is ungated to run the tests, I assume that in my setup
> >> it is not used close to resume.
> >
> > I should also add that I forced a kernel panic and dumped all CPU
> > backtraces in multiple logs. After the softlock, CPUs were either
> > parked in the scheduler, powered off, or stuck executing an SMU
> > command by e.g., a userspace usage sensor graph. So it is not a
> > deadlock.
> >
>
> Can you please confirm if you are on the absolute latest linux-firmware
> when you reproduced this issue?

I was on the latest at the time built from source. I think it was
commit 08ee93ff8ffa. There was an update today though it seems.


> Can you please share the debugfs output for amdgpu_firmware_info.

Here is the information from it:
VCE feature version: 0, firmware version: 0x00000000
UVD feature version: 0, firmware version: 0x00000000
MC feature version: 0, firmware version: 0x00000000
ME feature version: 35, firmware version: 0x0000001f
PFP feature version: 35, firmware version: 0x0000002c
CE feature version: 0, firmware version: 0x00000000
RLC feature version: 1, firmware version: 0x11530505
RLC SRLC feature version: 0, firmware version: 0x00000000
RLC SRLG feature version: 0, firmware version: 0x00000000
RLC SRLS feature version: 0, firmware version: 0x00000000
RLCP feature version: 1, firmware version: 0x11530505
RLCV feature version: 0, firmware version: 0x00000000
MEC feature version: 35, firmware version: 0x0000001f
IMU feature version: 0, firmware version: 0x0b352300
SOS feature version: 0, firmware version: 0x00000000
ASD feature version: 553648366, firmware version: 0x210000ee
TA XGMI feature version: 0x00000000, firmware version: 0x00000000
TA RAS feature version: 0x00000000, firmware version: 0x00000000
TA HDCP feature version: 0x00000000, firmware version: 0x17000044
TA DTM feature version: 0x00000000, firmware version: 0x12000018
TA RAP feature version: 0x00000000, firmware version: 0x00000000
TA SECUREDISPLAY feature version: 0x00000000, firmware version: 0x00000000
SMC feature version: 0, program: 0, firmware version: 0x00647000 (100.112.0)
SDMA0 feature version: 60, firmware version: 0x0000000e
VCN feature version: 0, firmware version: 0x0911800b
DMCU feature version: 0, firmware version: 0x00000000
DMCUB feature version: 0, firmware version: 0x09002600
TOC feature version: 0, firmware version: 0x0000000b
MES_KIQ feature version: 6, firmware version: 0x0000006c
MES feature version: 1, firmware version: 0x0000007c
VPE feature version: 60, firmware version: 0x00000016
VBIOS version: 113-STRXLGEN-001

I see there was an update today though

Antheas
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo
  2025-08-25 13:39       ` Antheas Kapenekakis
@ 2025-08-26 13:41         ` Alex Deucher
  2025-08-26 19:19           ` Mario Limonciello
  0 siblings, 1 reply; 21+ messages in thread
From: Alex Deucher @ 2025-08-26 13:41 UTC (permalink / raw)
  To: Antheas Kapenekakis
  Cc: Mario Limonciello, amd-gfx, dri-devel, linux-kernel, Alex Deucher,
	Christian König, David Airlie, Simona Vetter, Harry Wentland,
	Rodrigo Siqueira, Mario Limonciello, Peyton Lee, Lang Yu

On Tue, Aug 26, 2025 at 3:49 AM Antheas Kapenekakis <lkml@antheas.dev> wrote:
>
> On Mon, 25 Aug 2025 at 03:38, Mario Limonciello <superm1@kernel.org> wrote:
> >
> >
> >
> > On 8/24/25 3:46 PM, Antheas Kapenekakis wrote:
> > > On Sun, 24 Aug 2025 at 22:16, Mario Limonciello <superm1@kernel.org> wrote:
> > >>
> > >>
> > >>
> > >> On 8/24/25 3:53 AM, Antheas Kapenekakis wrote:
> > >>> On the Asus Z13 2025, which uses a Strix Halo platform, around 8% of the
> > >>> suspend resumes result in a soft lock around 1 second after the screen
> > >>> turns on (it freezes). This happens due to power gating VPE when it is
> > >>> not used, which happens 1 second after inactivity.
> > >>>
> > >>> Specifically, the VPE gating after resume is as follows: an initial
> > >>> ungate, followed by a gate in the resume process. Then,
> > >>> amdgpu_device_delayed_init_work_handler with a delay of 2s is scheduled
> > >>> to run tests, one of which is testing VPE in vpe_ring_test_ib. This
> > >>> causes an ungate, After that test, vpe_idle_work_handler is scheduled
> > >>> with VPE_IDLE_TIMEOUT (1s).
> > >>>
> > >>> When vpe_idle_work_handler runs and tries to gate VPE, it causes the
> > >>> SMU to hang and partially freezes half of the GPU IPs, with the thread
> > >>> that called the command being stuck processing it.
> > >>>
> > >>> Specifically, after that SMU command tries to run, we get the following:
> > >>>
> > >>> snd_hda_intel 0000:c4:00.1: Refused to change power state from D0 to D3hot
> > >>> ...
> > >>> xhci_hcd 0000:c4:00.4: Refused to change power state from D0 to D3hot
> > >>> ...
> > >>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> > >>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VPE!
> > >>> [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe failed, ret = -62.
> > >>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:93:crtc-0] flip_done timed out
> > >>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> > >>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate JPEG!
> > >>> [drm:jpeg_v4_0_5_set_powergating_state [amdgpu]] *ERROR* Dpm disable jpeg failed, ret = -62.
> > >>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> > >>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 0!
> > >>> [drm:vcn_v4_0_5_stop [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62.
> > >>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 1 from 0xd3
> > >>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 2 from 0x5
> > >>> thunderbolt 0000:c6:00.5: Refused to change power state from D0 to D3hot
> > >>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:97:crtc-1] flip_done timed out
> > >>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> > >>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 1!
> > >>>
> > >>> In addition to e.g., kwin errors in journalctl. 0000:c4.00.0 is the GPU.
> > >>> Interestingly, 0000:c4.00.6, which is another HDA block, 0000:c4.00.5,
> > >>> a PCI controller, and 0000:c4.00.2, resume normally. 0x00000032 is the
> > >>> PowerDownVpe(50) command which is the common failure point in all
> > >>> failed resumes.
> > >>>
> > >>> On a normal resume, we should get the following power gates:
> > >>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVpe(50) param: 0x00000000, resp: 0x00000001
> > >>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg0(33) param: 0x00000000, resp: 0x00000001
> > >>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg1(38) param: 0x00010000, resp: 0x00000001
> > >>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn1(4) param: 0x00010000, resp: 0x00000001
> > >>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn0(6) param: 0x00000000, resp: 0x00000001
> > >>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn0(7) param: 0x00000000, resp: 0x00000001
> > >>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn1(5) param: 0x00010000, resp: 0x00000001
> > >>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg0(34) param: 0x00000000, resp: 0x00000001
> > >>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg1(39) param: 0x00010000, resp: 0x00000001
> > >>>
> > >>> To fix this, increase VPE_IDLE_TIMEOUT to 2 seconds. This increases
> > >>> reliability from 4-25 suspends to 200+ (tested) suspends with a cycle
> > >>> time of 12s sleep, 8s resume.
> > >>
> > >> When you say you reproduced with 12s sleep and 8s resume, was that
> > >> 'amd-s2idle --duration 12 --wait 8'?
> > >
> > > I did not use amd-s2idle. I essentially used the script below with a
> > > 12 on the wake alarm and 12 on the for loop. I also used pstore for
> > > this testing.
> > >
> > > for i in {1..200}; do
> > >    echo "Suspend attempt $i"
> > >    echo `date '+%s' -d '+ 60 seconds'` | sudo tee /sys/class/rtc/rtc0/wakealarm
> > >    sudo sh -c 'echo mem > /sys/power/state'
> > >
> > >    for j in {1..50}; do
> > >      # Use repeating sleep in case echo mem returns early
> > >      sleep 1
> > >    done
> > > done
> >
> > 👍
> >
> > >
> > >>> The suspected reason here is that 1s that
> > >>> when VPE is used, it needs a bit of time before it can be gated and
> > >>> there was a borderline delay before, which is not enough for Strix Halo.
> > >>> When the VPE is not used, such as on resume, gating it instantly does
> > >>> not seem to cause issues.
> > >>>
> > >>> Fixes: 5f82a0c90cca ("drm/amdgpu/vpe: enable vpe dpm")
> > >>> Signed-off-by: Antheas Kapenekakis <lkml@antheas.dev>
> > >>> ---
> > >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c | 4 ++--
> > >>>    1 file changed, 2 insertions(+), 2 deletions(-)
> > >>>
> > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
> > >>> index 121ee17b522b..24f09e457352 100644
> > >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
> > >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
> > >>> @@ -34,8 +34,8 @@
> > >>>    /* VPE CSA resides in the 4th page of CSA */
> > >>>    #define AMDGPU_CSA_VPE_OFFSET       (4096 * 3)
> > >>>
> > >>> -/* 1 second timeout */
> > >>> -#define VPE_IDLE_TIMEOUT     msecs_to_jiffies(1000)
> > >>> +/* 2 second timeout */
> > >>> +#define VPE_IDLE_TIMEOUT     msecs_to_jiffies(2000)
> > >>>
> > >>>    #define VPE_MAX_DPM_LEVEL                   4
> > >>>    #define FIXED1_8_BITS_PER_FRACTIONAL_PART   8
> > >>>
> > >>> base-commit: c17b750b3ad9f45f2b6f7e6f7f4679844244f0b9
> > >>
> > >> 1s idle timeout has been used by other IPs for a long time.
> > >> For example JPEG, UVD, VCN all use 1s.
> > >>
> > >> Can you please confirm both your AGESA and your SMU firmware version?
> > >> In case you're not aware; you can get AGESA version from SMBIOS string
> > >> (DMI type 40).
> > >>
> > >> ❯ sudo dmidecode | grep AGESA
> > >
> > > String: AGESA!V9 StrixHaloPI-FP11 1.0.0.0c
> > >
> > >> You can get SMU firmware version from this:
> > >>
> > >> ❯ grep . /sys/bus/platform/drivers/amd_pmc/*/smu_*
> > >
> > > grep . /sys/bus/platform/drivers/amd_pmc/*/smu_*
> > > /sys/bus/platform/drivers/amd_pmc/AMDI000B:00/smu_fw_version:100.112.0
> > > /sys/bus/platform/drivers/amd_pmc/AMDI000B:00/smu_program:0
> > >
> >
> > Thanks, I'll get some folks to see if we match this AGESA version if we
> > can also reproduce it on reference hardware the same way you did.
> >
> > >> Are you on the most up to date firmware for your system from the
> > >> manufacturer?
> > >
> > > I updated my bios, pd firmware, and USB device firmware early August,
> > > when I was doing this testing.
> > >
> > >> We haven't seen anything like this reported on Strix Halo thus far and
> > >> we do internal stress testing on s0i3 on reference hardware.
> > >
> > > Cant find a reference for it on the bug tracker. I have four bug
> > > reports on the bazzite issue tracker, 2 about sleep wake crashes and 2
> > > for runtime crashes, where the culprit would be this. IE runtime gates
> > > VPE and causes a crash.
> >
> > All on Strix Halo and all tied to VPE?  At runtime was VPE in use?  By
> > what software?
> >
> > BTW - Strix and Kraken also have VPE.
>
> All on the Z13. Not tied to VPE necessarily. I just know that I get
> reports of crashes on the Z13, and with this patch they are fixed for
> me. It will be part of the next bazzite version so I will get feedback
> about it.
>
> I don't think software that is using the VPE is relevant. Perhaps for
> the runtime crashes it is and this patch helps in that case as well.
> But in my case, the crash is caused after the ungate that runs the
> tests on resume on the delayed handler.
>
> The Z13 also has some other quirks with spurious wakeups when
> connected to a charger. So, if systemd is configured to e.g., sleep
> after 20m, combined with this crash if it stays plugged in overnight
> in the morning it has crashed.
>
> > >
> > >> To me this seems likely to be a platform firmware bug; but I would like
> > >> to understand the timing of the gate vs ungate on good vs bad.
> > >
> > > Perhaps it is. It is either something like that or silicon quality.
> > >
> > >> IE is it possible the delayed work handler
> > >> amdgpu_device_delayed_init_work_handler() is causing a race with
> > >> vpe_ring_begin_use()?
> > >
> > > I don't think so. There is only a single ungate. Also, the crash
> > > happens on the gate. So what happens is the device wakes up, the
> > > screen turns on, kde clock works, then after a second it freezes,
> > > there is a softlock, and the device hangs.
> > >
> > > The failed command is always the VPE gate that is triggered after 1s in idle.
> > >
> > >> This should be possible to check without extra instrumentation by using
> > >> ftrace and looking at the timing of the 2 ring functions and the init
> > >> work handler and checking good vs bad cycles.
> > >
> > > I do not know how to use ftrace. I should also note that after the
> > > device freezes around 1/5 cycles will sync the fs, so it is also not a
> > > very easy thing to diagnose. The device just stops working. A lot of
> > > the logs I got were in pstore by forcing a kernel panic.
> >
> > Here's how you capture the timing of functions.  Each time the function
> > is called there will be an event in the trace buffer.
> >
> > ❯ sudo trace-cmd record -p function -l
> > amdgpu_device_delayed_init_work_handler -l vpe_idle_work_handler -l
> > vpe_ring_begin_use -l vpe_ring_end_use -l amdgpu_pmops_suspend -l
> > amdgpu_pmops_resume
> >
> > Here's how you would review the report:
> >
> > ❯ trace-cmd report
> > cpus=24
> >    kworker/u97:37-18051 [001] ..... 13655.970108: function:
> > amdgpu_pmops_suspend <-- pci_pm_suspend
> >    kworker/u97:21-18036 [002] ..... 13666.290715: function:
> > amdgpu_pmops_resume <-- dpm_run_callback
> >    kworker/u97:21-18036 [015] ..... 13666.308295: function:
> > vpe_ring_begin_use <-- amdgpu_ring_alloc
> >    kworker/u97:21-18036 [015] ..... 13666.308298: function:
> > vpe_ring_end_use <-- vpe_ring_test_ring
> >      kworker/15:1-12285 [015] ..... 13666.960191: function:
> > amdgpu_device_delayed_init_work_handler <-- process_one_work
> >      kworker/15:1-12285 [015] ..... 13666.963970: function:
> > vpe_ring_begin_use <-- amdgpu_ring_alloc
> >      kworker/15:1-12285 [015] ..... 13666.965481: function:
> > vpe_ring_end_use <-- amdgpu_ib_schedule
> >      kworker/15:4-16354 [015] ..... 13667.981394: function:
> > vpe_idle_work_handler <-- process_one_work
> >
> > I did this on a Strix system just now to capture that.
> >
> > You can see that basically the ring gets used before the delayed init
> > work handler, and then again from the ring tests.  My concern is if the
> > sequence ever looks different than the above.  If it does; we do have a
> > driver race condition.
> >
> > It would also be helpful to look at the function_graph tracer.
> >
> > Here's some more documentation about ftrace and trace-cmd.
> > https://www.kernel.org/doc/html/latest/trace/ftrace.html
> > https://lwn.net/Articles/410200/
> >
> > You can probably also get an LLM to help you with building commands if
> > you're not familiar with it.
> >
> > But if you're hung so bad you can't flush to disk that's going to be a
> > problem without a UART.  A few ideas:
>
> Some times it flushes to disk
>
> > 1) You can use CONFIG_PSTORE_FTRACE
>
> I can look into that
>
> > 2) If you add "tp_printk" to the kernel command line it should make the
> > trace ring buffer flush to kernel log ring buffer.  But be warned this
> > is going to change the timing, the issue might go away entirely or have
> > a different failure rate.  So hopefully <1> works.
> > >
> > > If you say that all IP blocks use 1s, perhaps an alternative solution
> > > would be to desync the idle times so they do not happen
> > > simultaneously. So 1000, 1200, 1400, etc.
> > >
> > > Antheas
> > >
> >
> > I don't dobut your your proposal of changing the timing works.  I just
> > want to make sure it's the right solution because otherwise we might
> > change the timing or sequence elsewhere in the driver two years from now
> > and re-introduce the problem unintentionally.
>
> If there are other idle timers and only this one changes to 2s, I will
> agree and say that it would be peculiar. Although 1s seems arbitrary
> in any case.

All of these timers are arbitrary.  Their point is just to provide a
future point where we can check if the engine is idle.  The idle work
handler will either power down the IP if it is idle or re-schedule in
the future and try again if there is still work.  Making the value
longer will use more power as it will wait longer before checking if
the engine is idle.  Making it shorter will save more power, but adds
extra overhead in that the engine will be powered up/down more often.
In most cases, the jobs should complete in a few ms.  The timer is
there to avoid the overhead of powering up/down the block too
frequently when applications are using the engine.

Alex

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo
  2025-08-26 13:41         ` Alex Deucher
@ 2025-08-26 19:19           ` Mario Limonciello
  2025-08-26 19:21             ` Antheas Kapenekakis
  0 siblings, 1 reply; 21+ messages in thread
From: Mario Limonciello @ 2025-08-26 19:19 UTC (permalink / raw)
  To: Alex Deucher, Antheas Kapenekakis
  Cc: amd-gfx, dri-devel, linux-kernel, Alex Deucher,
	Christian König, David Airlie, Simona Vetter, Harry Wentland,
	Rodrigo Siqueira, Mario Limonciello, Peyton Lee, Lang Yu

On 8/26/2025 8:41 AM, Alex Deucher wrote:
> On Tue, Aug 26, 2025 at 3:49 AM Antheas Kapenekakis <lkml@antheas.dev> wrote:
>>
>> On Mon, 25 Aug 2025 at 03:38, Mario Limonciello <superm1@kernel.org> wrote:
>>>
>>>
>>>
>>> On 8/24/25 3:46 PM, Antheas Kapenekakis wrote:
>>>> On Sun, 24 Aug 2025 at 22:16, Mario Limonciello <superm1@kernel.org> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 8/24/25 3:53 AM, Antheas Kapenekakis wrote:
>>>>>> On the Asus Z13 2025, which uses a Strix Halo platform, around 8% of the
>>>>>> suspend resumes result in a soft lock around 1 second after the screen
>>>>>> turns on (it freezes). This happens due to power gating VPE when it is
>>>>>> not used, which happens 1 second after inactivity.
>>>>>>
>>>>>> Specifically, the VPE gating after resume is as follows: an initial
>>>>>> ungate, followed by a gate in the resume process. Then,
>>>>>> amdgpu_device_delayed_init_work_handler with a delay of 2s is scheduled
>>>>>> to run tests, one of which is testing VPE in vpe_ring_test_ib. This
>>>>>> causes an ungate, After that test, vpe_idle_work_handler is scheduled
>>>>>> with VPE_IDLE_TIMEOUT (1s).
>>>>>>
>>>>>> When vpe_idle_work_handler runs and tries to gate VPE, it causes the
>>>>>> SMU to hang and partially freezes half of the GPU IPs, with the thread
>>>>>> that called the command being stuck processing it.
>>>>>>
>>>>>> Specifically, after that SMU command tries to run, we get the following:
>>>>>>
>>>>>> snd_hda_intel 0000:c4:00.1: Refused to change power state from D0 to D3hot
>>>>>> ...
>>>>>> xhci_hcd 0000:c4:00.4: Refused to change power state from D0 to D3hot
>>>>>> ...
>>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
>>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VPE!
>>>>>> [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe failed, ret = -62.
>>>>>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:93:crtc-0] flip_done timed out
>>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
>>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate JPEG!
>>>>>> [drm:jpeg_v4_0_5_set_powergating_state [amdgpu]] *ERROR* Dpm disable jpeg failed, ret = -62.
>>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
>>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 0!
>>>>>> [drm:vcn_v4_0_5_stop [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62.
>>>>>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 1 from 0xd3
>>>>>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 2 from 0x5
>>>>>> thunderbolt 0000:c6:00.5: Refused to change power state from D0 to D3hot
>>>>>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:97:crtc-1] flip_done timed out
>>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
>>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 1!
>>>>>>
>>>>>> In addition to e.g., kwin errors in journalctl. 0000:c4.00.0 is the GPU.
>>>>>> Interestingly, 0000:c4.00.6, which is another HDA block, 0000:c4.00.5,
>>>>>> a PCI controller, and 0000:c4.00.2, resume normally. 0x00000032 is the
>>>>>> PowerDownVpe(50) command which is the common failure point in all
>>>>>> failed resumes.
>>>>>>
>>>>>> On a normal resume, we should get the following power gates:
>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVpe(50) param: 0x00000000, resp: 0x00000001
>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg0(33) param: 0x00000000, resp: 0x00000001
>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg1(38) param: 0x00010000, resp: 0x00000001
>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn1(4) param: 0x00010000, resp: 0x00000001
>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn0(6) param: 0x00000000, resp: 0x00000001
>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn0(7) param: 0x00000000, resp: 0x00000001
>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn1(5) param: 0x00010000, resp: 0x00000001
>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg0(34) param: 0x00000000, resp: 0x00000001
>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg1(39) param: 0x00010000, resp: 0x00000001
>>>>>>
>>>>>> To fix this, increase VPE_IDLE_TIMEOUT to 2 seconds. This increases
>>>>>> reliability from 4-25 suspends to 200+ (tested) suspends with a cycle
>>>>>> time of 12s sleep, 8s resume.
>>>>>
>>>>> When you say you reproduced with 12s sleep and 8s resume, was that
>>>>> 'amd-s2idle --duration 12 --wait 8'?
>>>>
>>>> I did not use amd-s2idle. I essentially used the script below with a
>>>> 12 on the wake alarm and 12 on the for loop. I also used pstore for
>>>> this testing.
>>>>
>>>> for i in {1..200}; do
>>>>     echo "Suspend attempt $i"
>>>>     echo `date '+%s' -d '+ 60 seconds'` | sudo tee /sys/class/rtc/rtc0/wakealarm
>>>>     sudo sh -c 'echo mem > /sys/power/state'
>>>>
>>>>     for j in {1..50}; do
>>>>       # Use repeating sleep in case echo mem returns early
>>>>       sleep 1
>>>>     done
>>>> done
>>>
>>> 👍
>>>
>>>>
>>>>>> The suspected reason here is that 1s that
>>>>>> when VPE is used, it needs a bit of time before it can be gated and
>>>>>> there was a borderline delay before, which is not enough for Strix Halo.
>>>>>> When the VPE is not used, such as on resume, gating it instantly does
>>>>>> not seem to cause issues.
>>>>>>
>>>>>> Fixes: 5f82a0c90cca ("drm/amdgpu/vpe: enable vpe dpm")
>>>>>> Signed-off-by: Antheas Kapenekakis <lkml@antheas.dev>
>>>>>> ---
>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c | 4 ++--
>>>>>>     1 file changed, 2 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
>>>>>> index 121ee17b522b..24f09e457352 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
>>>>>> @@ -34,8 +34,8 @@
>>>>>>     /* VPE CSA resides in the 4th page of CSA */
>>>>>>     #define AMDGPU_CSA_VPE_OFFSET       (4096 * 3)
>>>>>>
>>>>>> -/* 1 second timeout */
>>>>>> -#define VPE_IDLE_TIMEOUT     msecs_to_jiffies(1000)
>>>>>> +/* 2 second timeout */
>>>>>> +#define VPE_IDLE_TIMEOUT     msecs_to_jiffies(2000)
>>>>>>
>>>>>>     #define VPE_MAX_DPM_LEVEL                   4
>>>>>>     #define FIXED1_8_BITS_PER_FRACTIONAL_PART   8
>>>>>>
>>>>>> base-commit: c17b750b3ad9f45f2b6f7e6f7f4679844244f0b9
>>>>>
>>>>> 1s idle timeout has been used by other IPs for a long time.
>>>>> For example JPEG, UVD, VCN all use 1s.
>>>>>
>>>>> Can you please confirm both your AGESA and your SMU firmware version?
>>>>> In case you're not aware; you can get AGESA version from SMBIOS string
>>>>> (DMI type 40).
>>>>>
>>>>> ❯ sudo dmidecode | grep AGESA
>>>>
>>>> String: AGESA!V9 StrixHaloPI-FP11 1.0.0.0c
>>>>
>>>>> You can get SMU firmware version from this:
>>>>>
>>>>> ❯ grep . /sys/bus/platform/drivers/amd_pmc/*/smu_*
>>>>
>>>> grep . /sys/bus/platform/drivers/amd_pmc/*/smu_*
>>>> /sys/bus/platform/drivers/amd_pmc/AMDI000B:00/smu_fw_version:100.112.0
>>>> /sys/bus/platform/drivers/amd_pmc/AMDI000B:00/smu_program:0
>>>>
>>>
>>> Thanks, I'll get some folks to see if we match this AGESA version if we
>>> can also reproduce it on reference hardware the same way you did.
>>>
>>>>> Are you on the most up to date firmware for your system from the
>>>>> manufacturer?
>>>>
>>>> I updated my bios, pd firmware, and USB device firmware early August,
>>>> when I was doing this testing.
>>>>
>>>>> We haven't seen anything like this reported on Strix Halo thus far and
>>>>> we do internal stress testing on s0i3 on reference hardware.
>>>>
>>>> Cant find a reference for it on the bug tracker. I have four bug
>>>> reports on the bazzite issue tracker, 2 about sleep wake crashes and 2
>>>> for runtime crashes, where the culprit would be this. IE runtime gates
>>>> VPE and causes a crash.
>>>
>>> All on Strix Halo and all tied to VPE?  At runtime was VPE in use?  By
>>> what software?
>>>
>>> BTW - Strix and Kraken also have VPE.
>>
>> All on the Z13. Not tied to VPE necessarily. I just know that I get
>> reports of crashes on the Z13, and with this patch they are fixed for
>> me. It will be part of the next bazzite version so I will get feedback
>> about it.
>>
>> I don't think software that is using the VPE is relevant. Perhaps for
>> the runtime crashes it is and this patch helps in that case as well.
>> But in my case, the crash is caused after the ungate that runs the
>> tests on resume on the delayed handler.
>>
>> The Z13 also has some other quirks with spurious wakeups when
>> connected to a charger. So, if systemd is configured to e.g., sleep
>> after 20m, combined with this crash if it stays plugged in overnight
>> in the morning it has crashed.
>>
>>>>
>>>>> To me this seems likely to be a platform firmware bug; but I would like
>>>>> to understand the timing of the gate vs ungate on good vs bad.
>>>>
>>>> Perhaps it is. It is either something like that or silicon quality.
>>>>
>>>>> IE is it possible the delayed work handler
>>>>> amdgpu_device_delayed_init_work_handler() is causing a race with
>>>>> vpe_ring_begin_use()?
>>>>
>>>> I don't think so. There is only a single ungate. Also, the crash
>>>> happens on the gate. So what happens is the device wakes up, the
>>>> screen turns on, kde clock works, then after a second it freezes,
>>>> there is a softlock, and the device hangs.
>>>>
>>>> The failed command is always the VPE gate that is triggered after 1s in idle.
>>>>
>>>>> This should be possible to check without extra instrumentation by using
>>>>> ftrace and looking at the timing of the 2 ring functions and the init
>>>>> work handler and checking good vs bad cycles.
>>>>
>>>> I do not know how to use ftrace. I should also note that after the
>>>> device freezes around 1/5 cycles will sync the fs, so it is also not a
>>>> very easy thing to diagnose. The device just stops working. A lot of
>>>> the logs I got were in pstore by forcing a kernel panic.
>>>
>>> Here's how you capture the timing of functions.  Each time the function
>>> is called there will be an event in the trace buffer.
>>>
>>> ❯ sudo trace-cmd record -p function -l
>>> amdgpu_device_delayed_init_work_handler -l vpe_idle_work_handler -l
>>> vpe_ring_begin_use -l vpe_ring_end_use -l amdgpu_pmops_suspend -l
>>> amdgpu_pmops_resume
>>>
>>> Here's how you would review the report:
>>>
>>> ❯ trace-cmd report
>>> cpus=24
>>>     kworker/u97:37-18051 [001] ..... 13655.970108: function:
>>> amdgpu_pmops_suspend <-- pci_pm_suspend
>>>     kworker/u97:21-18036 [002] ..... 13666.290715: function:
>>> amdgpu_pmops_resume <-- dpm_run_callback
>>>     kworker/u97:21-18036 [015] ..... 13666.308295: function:
>>> vpe_ring_begin_use <-- amdgpu_ring_alloc
>>>     kworker/u97:21-18036 [015] ..... 13666.308298: function:
>>> vpe_ring_end_use <-- vpe_ring_test_ring
>>>       kworker/15:1-12285 [015] ..... 13666.960191: function:
>>> amdgpu_device_delayed_init_work_handler <-- process_one_work
>>>       kworker/15:1-12285 [015] ..... 13666.963970: function:
>>> vpe_ring_begin_use <-- amdgpu_ring_alloc
>>>       kworker/15:1-12285 [015] ..... 13666.965481: function:
>>> vpe_ring_end_use <-- amdgpu_ib_schedule
>>>       kworker/15:4-16354 [015] ..... 13667.981394: function:
>>> vpe_idle_work_handler <-- process_one_work
>>>
>>> I did this on a Strix system just now to capture that.
>>>
>>> You can see that basically the ring gets used before the delayed init
>>> work handler, and then again from the ring tests.  My concern is if the
>>> sequence ever looks different than the above.  If it does; we do have a
>>> driver race condition.
>>>
>>> It would also be helpful to look at the function_graph tracer.
>>>
>>> Here's some more documentation about ftrace and trace-cmd.
>>> https://www.kernel.org/doc/html/latest/trace/ftrace.html
>>> https://lwn.net/Articles/410200/
>>>
>>> You can probably also get an LLM to help you with building commands if
>>> you're not familiar with it.
>>>
>>> But if you're hung so bad you can't flush to disk that's going to be a
>>> problem without a UART.  A few ideas:
>>
>> Some times it flushes to disk
>>
>>> 1) You can use CONFIG_PSTORE_FTRACE
>>
>> I can look into that
>>
>>> 2) If you add "tp_printk" to the kernel command line it should make the
>>> trace ring buffer flush to kernel log ring buffer.  But be warned this
>>> is going to change the timing, the issue might go away entirely or have
>>> a different failure rate.  So hopefully <1> works.
>>>>
>>>> If you say that all IP blocks use 1s, perhaps an alternative solution
>>>> would be to desync the idle times so they do not happen
>>>> simultaneously. So 1000, 1200, 1400, etc.
>>>>
>>>> Antheas
>>>>
>>>
>>> I don't dobut your your proposal of changing the timing works.  I just
>>> want to make sure it's the right solution because otherwise we might
>>> change the timing or sequence elsewhere in the driver two years from now
>>> and re-introduce the problem unintentionally.
>>
>> If there are other idle timers and only this one changes to 2s, I will
>> agree and say that it would be peculiar. Although 1s seems arbitrary
>> in any case.
> 
> All of these timers are arbitrary.  Their point is just to provide a
> future point where we can check if the engine is idle.  The idle work
> handler will either power down the IP if it is idle or re-schedule in
> the future and try again if there is still work.  Making the value
> longer will use more power as it will wait longer before checking if
> the engine is idle.  Making it shorter will save more power, but adds
> extra overhead in that the engine will be powered up/down more often.
> In most cases, the jobs should complete in a few ms.  The timer is
> there to avoid the overhead of powering up/down the block too
> frequently when applications are using the engine.
> 
> Alex

We had a try internally with both 6.17-rc2 and 6.17-rc3 and 1001b or 
1001c AGESA on reference system but unfortunately didn't reproduce the 
issue with a 200 cycle attempt on either kernel or either BIOS (so we 
had 800 cycles total).

Was your base a bazzite kernel or was it an upstream kernel?  I know 
there are some other patches in bazzite especially relevant to suspend, 
so I wonder if they could be influencing the timing.

Can you repo on 6.17-rc3?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo
  2025-08-26 19:19           ` Mario Limonciello
@ 2025-08-26 19:21             ` Antheas Kapenekakis
  2025-08-26 20:12               ` Matthew Schwartz
  0 siblings, 1 reply; 21+ messages in thread
From: Antheas Kapenekakis @ 2025-08-26 19:21 UTC (permalink / raw)
  To: Mario Limonciello
  Cc: Alex Deucher, amd-gfx, dri-devel, linux-kernel, Alex Deucher,
	Christian König, David Airlie, Simona Vetter, Harry Wentland,
	Rodrigo Siqueira, Mario Limonciello, Peyton Lee, Lang Yu

On Tue, 26 Aug 2025 at 21:19, Mario Limonciello <superm1@kernel.org> wrote:
>
> On 8/26/2025 8:41 AM, Alex Deucher wrote:
> > On Tue, Aug 26, 2025 at 3:49 AM Antheas Kapenekakis <lkml@antheas.dev> wrote:
> >>
> >> On Mon, 25 Aug 2025 at 03:38, Mario Limonciello <superm1@kernel.org> wrote:
> >>>
> >>>
> >>>
> >>> On 8/24/25 3:46 PM, Antheas Kapenekakis wrote:
> >>>> On Sun, 24 Aug 2025 at 22:16, Mario Limonciello <superm1@kernel.org> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 8/24/25 3:53 AM, Antheas Kapenekakis wrote:
> >>>>>> On the Asus Z13 2025, which uses a Strix Halo platform, around 8% of the
> >>>>>> suspend resumes result in a soft lock around 1 second after the screen
> >>>>>> turns on (it freezes). This happens due to power gating VPE when it is
> >>>>>> not used, which happens 1 second after inactivity.
> >>>>>>
> >>>>>> Specifically, the VPE gating after resume is as follows: an initial
> >>>>>> ungate, followed by a gate in the resume process. Then,
> >>>>>> amdgpu_device_delayed_init_work_handler with a delay of 2s is scheduled
> >>>>>> to run tests, one of which is testing VPE in vpe_ring_test_ib. This
> >>>>>> causes an ungate, After that test, vpe_idle_work_handler is scheduled
> >>>>>> with VPE_IDLE_TIMEOUT (1s).
> >>>>>>
> >>>>>> When vpe_idle_work_handler runs and tries to gate VPE, it causes the
> >>>>>> SMU to hang and partially freezes half of the GPU IPs, with the thread
> >>>>>> that called the command being stuck processing it.
> >>>>>>
> >>>>>> Specifically, after that SMU command tries to run, we get the following:
> >>>>>>
> >>>>>> snd_hda_intel 0000:c4:00.1: Refused to change power state from D0 to D3hot
> >>>>>> ...
> >>>>>> xhci_hcd 0000:c4:00.4: Refused to change power state from D0 to D3hot
> >>>>>> ...
> >>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> >>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VPE!
> >>>>>> [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe failed, ret = -62.
> >>>>>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:93:crtc-0] flip_done timed out
> >>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> >>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate JPEG!
> >>>>>> [drm:jpeg_v4_0_5_set_powergating_state [amdgpu]] *ERROR* Dpm disable jpeg failed, ret = -62.
> >>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> >>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 0!
> >>>>>> [drm:vcn_v4_0_5_stop [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62.
> >>>>>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 1 from 0xd3
> >>>>>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 2 from 0x5
> >>>>>> thunderbolt 0000:c6:00.5: Refused to change power state from D0 to D3hot
> >>>>>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:97:crtc-1] flip_done timed out
> >>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> >>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 1!
> >>>>>>
> >>>>>> In addition to e.g., kwin errors in journalctl. 0000:c4.00.0 is the GPU.
> >>>>>> Interestingly, 0000:c4.00.6, which is another HDA block, 0000:c4.00.5,
> >>>>>> a PCI controller, and 0000:c4.00.2, resume normally. 0x00000032 is the
> >>>>>> PowerDownVpe(50) command which is the common failure point in all
> >>>>>> failed resumes.
> >>>>>>
> >>>>>> On a normal resume, we should get the following power gates:
> >>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVpe(50) param: 0x00000000, resp: 0x00000001
> >>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg0(33) param: 0x00000000, resp: 0x00000001
> >>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg1(38) param: 0x00010000, resp: 0x00000001
> >>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn1(4) param: 0x00010000, resp: 0x00000001
> >>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn0(6) param: 0x00000000, resp: 0x00000001
> >>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn0(7) param: 0x00000000, resp: 0x00000001
> >>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn1(5) param: 0x00010000, resp: 0x00000001
> >>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg0(34) param: 0x00000000, resp: 0x00000001
> >>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg1(39) param: 0x00010000, resp: 0x00000001
> >>>>>>
> >>>>>> To fix this, increase VPE_IDLE_TIMEOUT to 2 seconds. This increases
> >>>>>> reliability from 4-25 suspends to 200+ (tested) suspends with a cycle
> >>>>>> time of 12s sleep, 8s resume.
> >>>>>
> >>>>> When you say you reproduced with 12s sleep and 8s resume, was that
> >>>>> 'amd-s2idle --duration 12 --wait 8'?
> >>>>
> >>>> I did not use amd-s2idle. I essentially used the script below with a
> >>>> 12 on the wake alarm and 12 on the for loop. I also used pstore for
> >>>> this testing.
> >>>>
> >>>> for i in {1..200}; do
> >>>>     echo "Suspend attempt $i"
> >>>>     echo `date '+%s' -d '+ 60 seconds'` | sudo tee /sys/class/rtc/rtc0/wakealarm
> >>>>     sudo sh -c 'echo mem > /sys/power/state'
> >>>>
> >>>>     for j in {1..50}; do
> >>>>       # Use repeating sleep in case echo mem returns early
> >>>>       sleep 1
> >>>>     done
> >>>> done
> >>>
> >>> 👍
> >>>
> >>>>
> >>>>>> The suspected reason here is that 1s that
> >>>>>> when VPE is used, it needs a bit of time before it can be gated and
> >>>>>> there was a borderline delay before, which is not enough for Strix Halo.
> >>>>>> When the VPE is not used, such as on resume, gating it instantly does
> >>>>>> not seem to cause issues.
> >>>>>>
> >>>>>> Fixes: 5f82a0c90cca ("drm/amdgpu/vpe: enable vpe dpm")
> >>>>>> Signed-off-by: Antheas Kapenekakis <lkml@antheas.dev>
> >>>>>> ---
> >>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c | 4 ++--
> >>>>>>     1 file changed, 2 insertions(+), 2 deletions(-)
> >>>>>>
> >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
> >>>>>> index 121ee17b522b..24f09e457352 100644
> >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
> >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
> >>>>>> @@ -34,8 +34,8 @@
> >>>>>>     /* VPE CSA resides in the 4th page of CSA */
> >>>>>>     #define AMDGPU_CSA_VPE_OFFSET       (4096 * 3)
> >>>>>>
> >>>>>> -/* 1 second timeout */
> >>>>>> -#define VPE_IDLE_TIMEOUT     msecs_to_jiffies(1000)
> >>>>>> +/* 2 second timeout */
> >>>>>> +#define VPE_IDLE_TIMEOUT     msecs_to_jiffies(2000)
> >>>>>>
> >>>>>>     #define VPE_MAX_DPM_LEVEL                   4
> >>>>>>     #define FIXED1_8_BITS_PER_FRACTIONAL_PART   8
> >>>>>>
> >>>>>> base-commit: c17b750b3ad9f45f2b6f7e6f7f4679844244f0b9
> >>>>>
> >>>>> 1s idle timeout has been used by other IPs for a long time.
> >>>>> For example JPEG, UVD, VCN all use 1s.
> >>>>>
> >>>>> Can you please confirm both your AGESA and your SMU firmware version?
> >>>>> In case you're not aware; you can get AGESA version from SMBIOS string
> >>>>> (DMI type 40).
> >>>>>
> >>>>> ❯ sudo dmidecode | grep AGESA
> >>>>
> >>>> String: AGESA!V9 StrixHaloPI-FP11 1.0.0.0c
> >>>>
> >>>>> You can get SMU firmware version from this:
> >>>>>
> >>>>> ❯ grep . /sys/bus/platform/drivers/amd_pmc/*/smu_*
> >>>>
> >>>> grep . /sys/bus/platform/drivers/amd_pmc/*/smu_*
> >>>> /sys/bus/platform/drivers/amd_pmc/AMDI000B:00/smu_fw_version:100.112.0
> >>>> /sys/bus/platform/drivers/amd_pmc/AMDI000B:00/smu_program:0
> >>>>
> >>>
> >>> Thanks, I'll get some folks to see if we match this AGESA version if we
> >>> can also reproduce it on reference hardware the same way you did.
> >>>
> >>>>> Are you on the most up to date firmware for your system from the
> >>>>> manufacturer?
> >>>>
> >>>> I updated my bios, pd firmware, and USB device firmware early August,
> >>>> when I was doing this testing.
> >>>>
> >>>>> We haven't seen anything like this reported on Strix Halo thus far and
> >>>>> we do internal stress testing on s0i3 on reference hardware.
> >>>>
> >>>> Cant find a reference for it on the bug tracker. I have four bug
> >>>> reports on the bazzite issue tracker, 2 about sleep wake crashes and 2
> >>>> for runtime crashes, where the culprit would be this. IE runtime gates
> >>>> VPE and causes a crash.
> >>>
> >>> All on Strix Halo and all tied to VPE?  At runtime was VPE in use?  By
> >>> what software?
> >>>
> >>> BTW - Strix and Kraken also have VPE.
> >>
> >> All on the Z13. Not tied to VPE necessarily. I just know that I get
> >> reports of crashes on the Z13, and with this patch they are fixed for
> >> me. It will be part of the next bazzite version so I will get feedback
> >> about it.
> >>
> >> I don't think software that is using the VPE is relevant. Perhaps for
> >> the runtime crashes it is and this patch helps in that case as well.
> >> But in my case, the crash is caused after the ungate that runs the
> >> tests on resume on the delayed handler.
> >>
> >> The Z13 also has some other quirks with spurious wakeups when
> >> connected to a charger. So, if systemd is configured to e.g., sleep
> >> after 20m, combined with this crash if it stays plugged in overnight
> >> in the morning it has crashed.
> >>
> >>>>
> >>>>> To me this seems likely to be a platform firmware bug; but I would like
> >>>>> to understand the timing of the gate vs ungate on good vs bad.
> >>>>
> >>>> Perhaps it is. It is either something like that or silicon quality.
> >>>>
> >>>>> IE is it possible the delayed work handler
> >>>>> amdgpu_device_delayed_init_work_handler() is causing a race with
> >>>>> vpe_ring_begin_use()?
> >>>>
> >>>> I don't think so. There is only a single ungate. Also, the crash
> >>>> happens on the gate. So what happens is the device wakes up, the
> >>>> screen turns on, kde clock works, then after a second it freezes,
> >>>> there is a softlock, and the device hangs.
> >>>>
> >>>> The failed command is always the VPE gate that is triggered after 1s in idle.
> >>>>
> >>>>> This should be possible to check without extra instrumentation by using
> >>>>> ftrace and looking at the timing of the 2 ring functions and the init
> >>>>> work handler and checking good vs bad cycles.
> >>>>
> >>>> I do not know how to use ftrace. I should also note that after the
> >>>> device freezes around 1/5 cycles will sync the fs, so it is also not a
> >>>> very easy thing to diagnose. The device just stops working. A lot of
> >>>> the logs I got were in pstore by forcing a kernel panic.
> >>>
> >>> Here's how you capture the timing of functions.  Each time the function
> >>> is called there will be an event in the trace buffer.
> >>>
> >>> ❯ sudo trace-cmd record -p function -l
> >>> amdgpu_device_delayed_init_work_handler -l vpe_idle_work_handler -l
> >>> vpe_ring_begin_use -l vpe_ring_end_use -l amdgpu_pmops_suspend -l
> >>> amdgpu_pmops_resume
> >>>
> >>> Here's how you would review the report:
> >>>
> >>> ❯ trace-cmd report
> >>> cpus=24
> >>>     kworker/u97:37-18051 [001] ..... 13655.970108: function:
> >>> amdgpu_pmops_suspend <-- pci_pm_suspend
> >>>     kworker/u97:21-18036 [002] ..... 13666.290715: function:
> >>> amdgpu_pmops_resume <-- dpm_run_callback
> >>>     kworker/u97:21-18036 [015] ..... 13666.308295: function:
> >>> vpe_ring_begin_use <-- amdgpu_ring_alloc
> >>>     kworker/u97:21-18036 [015] ..... 13666.308298: function:
> >>> vpe_ring_end_use <-- vpe_ring_test_ring
> >>>       kworker/15:1-12285 [015] ..... 13666.960191: function:
> >>> amdgpu_device_delayed_init_work_handler <-- process_one_work
> >>>       kworker/15:1-12285 [015] ..... 13666.963970: function:
> >>> vpe_ring_begin_use <-- amdgpu_ring_alloc
> >>>       kworker/15:1-12285 [015] ..... 13666.965481: function:
> >>> vpe_ring_end_use <-- amdgpu_ib_schedule
> >>>       kworker/15:4-16354 [015] ..... 13667.981394: function:
> >>> vpe_idle_work_handler <-- process_one_work
> >>>
> >>> I did this on a Strix system just now to capture that.
> >>>
> >>> You can see that basically the ring gets used before the delayed init
> >>> work handler, and then again from the ring tests.  My concern is if the
> >>> sequence ever looks different than the above.  If it does; we do have a
> >>> driver race condition.
> >>>
> >>> It would also be helpful to look at the function_graph tracer.
> >>>
> >>> Here's some more documentation about ftrace and trace-cmd.
> >>> https://www.kernel.org/doc/html/latest/trace/ftrace.html
> >>> https://lwn.net/Articles/410200/
> >>>
> >>> You can probably also get an LLM to help you with building commands if
> >>> you're not familiar with it.
> >>>
> >>> But if you're hung so bad you can't flush to disk that's going to be a
> >>> problem without a UART.  A few ideas:
> >>
> >> Some times it flushes to disk
> >>
> >>> 1) You can use CONFIG_PSTORE_FTRACE
> >>
> >> I can look into that
> >>
> >>> 2) If you add "tp_printk" to the kernel command line it should make the
> >>> trace ring buffer flush to kernel log ring buffer.  But be warned this
> >>> is going to change the timing, the issue might go away entirely or have
> >>> a different failure rate.  So hopefully <1> works.
> >>>>
> >>>> If you say that all IP blocks use 1s, perhaps an alternative solution
> >>>> would be to desync the idle times so they do not happen
> >>>> simultaneously. So 1000, 1200, 1400, etc.
> >>>>
> >>>> Antheas
> >>>>
> >>>
> >>> I don't dobut your your proposal of changing the timing works.  I just
> >>> want to make sure it's the right solution because otherwise we might
> >>> change the timing or sequence elsewhere in the driver two years from now
> >>> and re-introduce the problem unintentionally.
> >>
> >> If there are other idle timers and only this one changes to 2s, I will
> >> agree and say that it would be peculiar. Although 1s seems arbitrary
> >> in any case.
> >
> > All of these timers are arbitrary.  Their point is just to provide a
> > future point where we can check if the engine is idle.  The idle work
> > handler will either power down the IP if it is idle or re-schedule in
> > the future and try again if there is still work.  Making the value
> > longer will use more power as it will wait longer before checking if
> > the engine is idle.  Making it shorter will save more power, but adds
> > extra overhead in that the engine will be powered up/down more often.
> > In most cases, the jobs should complete in a few ms.  The timer is
> > there to avoid the overhead of powering up/down the block too
> > frequently when applications are using the engine.
> >
> > Alex
>
> We had a try internally with both 6.17-rc2 and 6.17-rc3 and 1001b or
> 1001c AGESA on reference system but unfortunately didn't reproduce the
> issue with a 200 cycle attempt on either kernel or either BIOS (so we
> had 800 cycles total).

I think I did 6.12, 6.15, and a 6.16rc stock. I will have to come back
to you with 6.17-rc3.

> Was your base a bazzite kernel or was it an upstream kernel?  I know
> there are some other patches in bazzite especially relevant to suspend,
> so I wonder if they could be influencing the timing.
>
> Can you repo on 6.17-rc3?
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo
  2025-08-26 19:21             ` Antheas Kapenekakis
@ 2025-08-26 20:12               ` Matthew Schwartz
  2025-08-26 20:58                 ` Antheas Kapenekakis
  0 siblings, 1 reply; 21+ messages in thread
From: Matthew Schwartz @ 2025-08-26 20:12 UTC (permalink / raw)
  To: Antheas Kapenekakis
  Cc: Mario Limonciello, Alex Deucher, amd-gfx, dri-devel, linux-kernel,
	Alex Deucher, Christian König, David Airlie, Simona Vetter,
	Harry Wentland, Rodrigo Siqueira, Mario Limonciello, Peyton Lee,
	Lang Yu



> On Aug 26, 2025, at 12:21 PM, Antheas Kapenekakis <lkml@antheas.dev> wrote:
> 
> On Tue, 26 Aug 2025 at 21:19, Mario Limonciello <superm1@kernel.org> wrote:
>> 
>> On 8/26/2025 8:41 AM, Alex Deucher wrote:
>>> On Tue, Aug 26, 2025 at 3:49 AM Antheas Kapenekakis <lkml@antheas.dev> wrote:
>>>> 
>>>> On Mon, 25 Aug 2025 at 03:38, Mario Limonciello <superm1@kernel.org> wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> On 8/24/25 3:46 PM, Antheas Kapenekakis wrote:
>>>>>> On Sun, 24 Aug 2025 at 22:16, Mario Limonciello <superm1@kernel.org> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On 8/24/25 3:53 AM, Antheas Kapenekakis wrote:
>>>>>>>> On the Asus Z13 2025, which uses a Strix Halo platform, around 8% of the
>>>>>>>> suspend resumes result in a soft lock around 1 second after the screen
>>>>>>>> turns on (it freezes). This happens due to power gating VPE when it is
>>>>>>>> not used, which happens 1 second after inactivity.
>>>>>>>> 
>>>>>>>> Specifically, the VPE gating after resume is as follows: an initial
>>>>>>>> ungate, followed by a gate in the resume process. Then,
>>>>>>>> amdgpu_device_delayed_init_work_handler with a delay of 2s is scheduled
>>>>>>>> to run tests, one of which is testing VPE in vpe_ring_test_ib. This
>>>>>>>> causes an ungate, After that test, vpe_idle_work_handler is scheduled
>>>>>>>> with VPE_IDLE_TIMEOUT (1s).
>>>>>>>> 
>>>>>>>> When vpe_idle_work_handler runs and tries to gate VPE, it causes the
>>>>>>>> SMU to hang and partially freezes half of the GPU IPs, with the thread
>>>>>>>> that called the command being stuck processing it.
>>>>>>>> 
>>>>>>>> Specifically, after that SMU command tries to run, we get the following:
>>>>>>>> 
>>>>>>>> snd_hda_intel 0000:c4:00.1: Refused to change power state from D0 to D3hot
>>>>>>>> ...
>>>>>>>> xhci_hcd 0000:c4:00.4: Refused to change power state from D0 to D3hot
>>>>>>>> ...
>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VPE!
>>>>>>>> [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe failed, ret = -62.
>>>>>>>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:93:crtc-0] flip_done timed out
>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate JPEG!
>>>>>>>> [drm:jpeg_v4_0_5_set_powergating_state [amdgpu]] *ERROR* Dpm disable jpeg failed, ret = -62.
>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 0!
>>>>>>>> [drm:vcn_v4_0_5_stop [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62.
>>>>>>>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 1 from 0xd3
>>>>>>>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 2 from 0x5
>>>>>>>> thunderbolt 0000:c6:00.5: Refused to change power state from D0 to D3hot
>>>>>>>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:97:crtc-1] flip_done timed out
>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 1!
>>>>>>>> 
>>>>>>>> In addition to e.g., kwin errors in journalctl. 0000:c4.00.0 is the GPU.
>>>>>>>> Interestingly, 0000:c4.00.6, which is another HDA block, 0000:c4.00.5,
>>>>>>>> a PCI controller, and 0000:c4.00.2, resume normally. 0x00000032 is the
>>>>>>>> PowerDownVpe(50) command which is the common failure point in all
>>>>>>>> failed resumes.
>>>>>>>> 
>>>>>>>> On a normal resume, we should get the following power gates:
>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVpe(50) param: 0x00000000, resp: 0x00000001
>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg0(33) param: 0x00000000, resp: 0x00000001
>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg1(38) param: 0x00010000, resp: 0x00000001
>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn1(4) param: 0x00010000, resp: 0x00000001
>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn0(6) param: 0x00000000, resp: 0x00000001
>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn0(7) param: 0x00000000, resp: 0x00000001
>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn1(5) param: 0x00010000, resp: 0x00000001
>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg0(34) param: 0x00000000, resp: 0x00000001
>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg1(39) param: 0x00010000, resp: 0x00000001
>>>>>>>> 
>>>>>>>> To fix this, increase VPE_IDLE_TIMEOUT to 2 seconds. This increases
>>>>>>>> reliability from 4-25 suspends to 200+ (tested) suspends with a cycle
>>>>>>>> time of 12s sleep, 8s resume.
>>>>>>> 
>>>>>>> When you say you reproduced with 12s sleep and 8s resume, was that
>>>>>>> 'amd-s2idle --duration 12 --wait 8'?
>>>>>> 
>>>>>> I did not use amd-s2idle. I essentially used the script below with a
>>>>>> 12 on the wake alarm and 12 on the for loop. I also used pstore for
>>>>>> this testing.
>>>>>> 
>>>>>> for i in {1..200}; do
>>>>>>   echo "Suspend attempt $i"
>>>>>>   echo `date '+%s' -d '+ 60 seconds'` | sudo tee /sys/class/rtc/rtc0/wakealarm
>>>>>>   sudo sh -c 'echo mem > /sys/power/state'
>>>>>> 
>>>>>>   for j in {1..50}; do
>>>>>>     # Use repeating sleep in case echo mem returns early
>>>>>>     sleep 1
>>>>>>   done
>>>>>> done
>>>>> 
>>>>> 👍
>>>>> 
>>>>>> 
>>>>>>>> The suspected reason here is that 1s that
>>>>>>>> when VPE is used, it needs a bit of time before it can be gated and
>>>>>>>> there was a borderline delay before, which is not enough for Strix Halo.
>>>>>>>> When the VPE is not used, such as on resume, gating it instantly does
>>>>>>>> not seem to cause issues.
>>>>>>>> 
>>>>>>>> Fixes: 5f82a0c90cca ("drm/amdgpu/vpe: enable vpe dpm")
>>>>>>>> Signed-off-by: Antheas Kapenekakis <lkml@antheas.dev>
>>>>>>>> ---
>>>>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c | 4 ++--
>>>>>>>>   1 file changed, 2 insertions(+), 2 deletions(-)
>>>>>>>> 
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
>>>>>>>> index 121ee17b522b..24f09e457352 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
>>>>>>>> @@ -34,8 +34,8 @@
>>>>>>>>   /* VPE CSA resides in the 4th page of CSA */
>>>>>>>>   #define AMDGPU_CSA_VPE_OFFSET       (4096 * 3)
>>>>>>>> 
>>>>>>>> -/* 1 second timeout */
>>>>>>>> -#define VPE_IDLE_TIMEOUT     msecs_to_jiffies(1000)
>>>>>>>> +/* 2 second timeout */
>>>>>>>> +#define VPE_IDLE_TIMEOUT     msecs_to_jiffies(2000)
>>>>>>>> 
>>>>>>>>   #define VPE_MAX_DPM_LEVEL                   4
>>>>>>>>   #define FIXED1_8_BITS_PER_FRACTIONAL_PART   8
>>>>>>>> 
>>>>>>>> base-commit: c17b750b3ad9f45f2b6f7e6f7f4679844244f0b9
>>>>>>> 
>>>>>>> 1s idle timeout has been used by other IPs for a long time.
>>>>>>> For example JPEG, UVD, VCN all use 1s.
>>>>>>> 
>>>>>>> Can you please confirm both your AGESA and your SMU firmware version?
>>>>>>> In case you're not aware; you can get AGESA version from SMBIOS string
>>>>>>> (DMI type 40).
>>>>>>> 
>>>>>>> ❯ sudo dmidecode | grep AGESA
>>>>>> 
>>>>>> String: AGESA!V9 StrixHaloPI-FP11 1.0.0.0c
>>>>>> 
>>>>>>> You can get SMU firmware version from this:
>>>>>>> 
>>>>>>> ❯ grep . /sys/bus/platform/drivers/amd_pmc/*/smu_*
>>>>>> 
>>>>>> grep . /sys/bus/platform/drivers/amd_pmc/*/smu_*
>>>>>> /sys/bus/platform/drivers/amd_pmc/AMDI000B:00/smu_fw_version:100.112.0
>>>>>> /sys/bus/platform/drivers/amd_pmc/AMDI000B:00/smu_program:0
>>>>>> 
>>>>> 
>>>>> Thanks, I'll get some folks to see if we match this AGESA version if we
>>>>> can also reproduce it on reference hardware the same way you did.
>>>>> 
>>>>>>> Are you on the most up to date firmware for your system from the
>>>>>>> manufacturer?
>>>>>> 
>>>>>> I updated my bios, pd firmware, and USB device firmware early August,
>>>>>> when I was doing this testing.
>>>>>> 
>>>>>>> We haven't seen anything like this reported on Strix Halo thus far and
>>>>>>> we do internal stress testing on s0i3 on reference hardware.
>>>>>> 
>>>>>> Cant find a reference for it on the bug tracker. I have four bug
>>>>>> reports on the bazzite issue tracker, 2 about sleep wake crashes and 2
>>>>>> for runtime crashes, where the culprit would be this. IE runtime gates
>>>>>> VPE and causes a crash.
>>>>> 
>>>>> All on Strix Halo and all tied to VPE?  At runtime was VPE in use?  By
>>>>> what software?
>>>>> 
>>>>> BTW - Strix and Kraken also have VPE.
>>>> 
>>>> All on the Z13. Not tied to VPE necessarily. I just know that I get
>>>> reports of crashes on the Z13, and with this patch they are fixed for
>>>> me. It will be part of the next bazzite version so I will get feedback
>>>> about it.
>>>> 
>>>> I don't think software that is using the VPE is relevant. Perhaps for
>>>> the runtime crashes it is and this patch helps in that case as well.
>>>> But in my case, the crash is caused after the ungate that runs the
>>>> tests on resume on the delayed handler.
>>>> 
>>>> The Z13 also has some other quirks with spurious wakeups when
>>>> connected to a charger. So, if systemd is configured to e.g., sleep
>>>> after 20m, combined with this crash if it stays plugged in overnight
>>>> in the morning it has crashed.
>>>> 
>>>>>> 
>>>>>>> To me this seems likely to be a platform firmware bug; but I would like
>>>>>>> to understand the timing of the gate vs ungate on good vs bad.
>>>>>> 
>>>>>> Perhaps it is. It is either something like that or silicon quality.
>>>>>> 
>>>>>>> IE is it possible the delayed work handler
>>>>>>> amdgpu_device_delayed_init_work_handler() is causing a race with
>>>>>>> vpe_ring_begin_use()?
>>>>>> 
>>>>>> I don't think so. There is only a single ungate. Also, the crash
>>>>>> happens on the gate. So what happens is the device wakes up, the
>>>>>> screen turns on, kde clock works, then after a second it freezes,
>>>>>> there is a softlock, and the device hangs.
>>>>>> 
>>>>>> The failed command is always the VPE gate that is triggered after 1s in idle.
>>>>>> 
>>>>>>> This should be possible to check without extra instrumentation by using
>>>>>>> ftrace and looking at the timing of the 2 ring functions and the init
>>>>>>> work handler and checking good vs bad cycles.
>>>>>> 
>>>>>> I do not know how to use ftrace. I should also note that after the
>>>>>> device freezes around 1/5 cycles will sync the fs, so it is also not a
>>>>>> very easy thing to diagnose. The device just stops working. A lot of
>>>>>> the logs I got were in pstore by forcing a kernel panic.
>>>>> 
>>>>> Here's how you capture the timing of functions.  Each time the function
>>>>> is called there will be an event in the trace buffer.
>>>>> 
>>>>> ❯ sudo trace-cmd record -p function -l
>>>>> amdgpu_device_delayed_init_work_handler -l vpe_idle_work_handler -l
>>>>> vpe_ring_begin_use -l vpe_ring_end_use -l amdgpu_pmops_suspend -l
>>>>> amdgpu_pmops_resume
>>>>> 
>>>>> Here's how you would review the report:
>>>>> 
>>>>> ❯ trace-cmd report
>>>>> cpus=24
>>>>>   kworker/u97:37-18051 [001] ..... 13655.970108: function:
>>>>> amdgpu_pmops_suspend <-- pci_pm_suspend
>>>>>   kworker/u97:21-18036 [002] ..... 13666.290715: function:
>>>>> amdgpu_pmops_resume <-- dpm_run_callback
>>>>>   kworker/u97:21-18036 [015] ..... 13666.308295: function:
>>>>> vpe_ring_begin_use <-- amdgpu_ring_alloc
>>>>>   kworker/u97:21-18036 [015] ..... 13666.308298: function:
>>>>> vpe_ring_end_use <-- vpe_ring_test_ring
>>>>>     kworker/15:1-12285 [015] ..... 13666.960191: function:
>>>>> amdgpu_device_delayed_init_work_handler <-- process_one_work
>>>>>     kworker/15:1-12285 [015] ..... 13666.963970: function:
>>>>> vpe_ring_begin_use <-- amdgpu_ring_alloc
>>>>>     kworker/15:1-12285 [015] ..... 13666.965481: function:
>>>>> vpe_ring_end_use <-- amdgpu_ib_schedule
>>>>>     kworker/15:4-16354 [015] ..... 13667.981394: function:
>>>>> vpe_idle_work_handler <-- process_one_work
>>>>> 
>>>>> I did this on a Strix system just now to capture that.
>>>>> 
>>>>> You can see that basically the ring gets used before the delayed init
>>>>> work handler, and then again from the ring tests.  My concern is if the
>>>>> sequence ever looks different than the above.  If it does; we do have a
>>>>> driver race condition.
>>>>> 
>>>>> It would also be helpful to look at the function_graph tracer.
>>>>> 
>>>>> Here's some more documentation about ftrace and trace-cmd.
>>>>> https://www.kernel.org/doc/html/latest/trace/ftrace.html
>>>>> https://lwn.net/Articles/410200/
>>>>> 
>>>>> You can probably also get an LLM to help you with building commands if
>>>>> you're not familiar with it.
>>>>> 
>>>>> But if you're hung so bad you can't flush to disk that's going to be a
>>>>> problem without a UART.  A few ideas:
>>>> 
>>>> Some times it flushes to disk
>>>> 
>>>>> 1) You can use CONFIG_PSTORE_FTRACE
>>>> 
>>>> I can look into that
>>>> 
>>>>> 2) If you add "tp_printk" to the kernel command line it should make the
>>>>> trace ring buffer flush to kernel log ring buffer.  But be warned this
>>>>> is going to change the timing, the issue might go away entirely or have
>>>>> a different failure rate.  So hopefully <1> works.
>>>>>> 
>>>>>> If you say that all IP blocks use 1s, perhaps an alternative solution
>>>>>> would be to desync the idle times so they do not happen
>>>>>> simultaneously. So 1000, 1200, 1400, etc.
>>>>>> 
>>>>>> Antheas
>>>>>> 
>>>>> 
>>>>> I don't dobut your your proposal of changing the timing works.  I just
>>>>> want to make sure it's the right solution because otherwise we might
>>>>> change the timing or sequence elsewhere in the driver two years from now
>>>>> and re-introduce the problem unintentionally.
>>>> 
>>>> If there are other idle timers and only this one changes to 2s, I will
>>>> agree and say that it would be peculiar. Although 1s seems arbitrary
>>>> in any case.
>>> 
>>> All of these timers are arbitrary.  Their point is just to provide a
>>> future point where we can check if the engine is idle.  The idle work
>>> handler will either power down the IP if it is idle or re-schedule in
>>> the future and try again if there is still work.  Making the value
>>> longer will use more power as it will wait longer before checking if
>>> the engine is idle.  Making it shorter will save more power, but adds
>>> extra overhead in that the engine will be powered up/down more often.
>>> In most cases, the jobs should complete in a few ms.  The timer is
>>> there to avoid the overhead of powering up/down the block too
>>> frequently when applications are using the engine.
>>> 
>>> Alex
>> 
>> We had a try internally with both 6.17-rc2 and 6.17-rc3 and 1001b or
>> 1001c AGESA on reference system but unfortunately didn't reproduce the
>> issue with a 200 cycle attempt on either kernel or either BIOS (so we
>> had 800 cycles total).
> 
> I think I did 6.12, 6.15, and a 6.16rc stock. I will have to come back
> to you with 6.17-rc3.

I can reproduce the hang on a stock 6.17-rc3 kernel on my own Flow Z13, froze within 10 cycles with Antheas’ script. I will setup pstore to get logs from it since nothing appears in my journal after force rebooting.

Matt

> 
>> Was your base a bazzite kernel or was it an upstream kernel?  I know
>> there are some other patches in bazzite especially relevant to suspend,
>> so I wonder if they could be influencing the timing.
>> 
>> Can you repo on 6.17-rc3?
>> 
> 
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo
  2025-08-26 20:12               ` Matthew Schwartz
@ 2025-08-26 20:58                 ` Antheas Kapenekakis
  2025-08-27  0:50                   ` Matthew Schwartz
  0 siblings, 1 reply; 21+ messages in thread
From: Antheas Kapenekakis @ 2025-08-26 20:58 UTC (permalink / raw)
  To: Matthew Schwartz
  Cc: Mario Limonciello, Alex Deucher, amd-gfx, dri-devel, linux-kernel,
	Alex Deucher, Christian König, David Airlie, Simona Vetter,
	Harry Wentland, Rodrigo Siqueira, Mario Limonciello, Peyton Lee,
	Lang Yu

On Tue, 26 Aug 2025 at 22:13, Matthew Schwartz
<matthew.schwartz@linux.dev> wrote:
>
>
>
> > On Aug 26, 2025, at 12:21 PM, Antheas Kapenekakis <lkml@antheas.dev> wrote:
> >
> > On Tue, 26 Aug 2025 at 21:19, Mario Limonciello <superm1@kernel.org> wrote:
> >>
> >> On 8/26/2025 8:41 AM, Alex Deucher wrote:
> >>> On Tue, Aug 26, 2025 at 3:49 AM Antheas Kapenekakis <lkml@antheas.dev> wrote:
> >>>>
> >>>> On Mon, 25 Aug 2025 at 03:38, Mario Limonciello <superm1@kernel.org> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 8/24/25 3:46 PM, Antheas Kapenekakis wrote:
> >>>>>> On Sun, 24 Aug 2025 at 22:16, Mario Limonciello <superm1@kernel.org> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On 8/24/25 3:53 AM, Antheas Kapenekakis wrote:
> >>>>>>>> On the Asus Z13 2025, which uses a Strix Halo platform, around 8% of the
> >>>>>>>> suspend resumes result in a soft lock around 1 second after the screen
> >>>>>>>> turns on (it freezes). This happens due to power gating VPE when it is
> >>>>>>>> not used, which happens 1 second after inactivity.
> >>>>>>>>
> >>>>>>>> Specifically, the VPE gating after resume is as follows: an initial
> >>>>>>>> ungate, followed by a gate in the resume process. Then,
> >>>>>>>> amdgpu_device_delayed_init_work_handler with a delay of 2s is scheduled
> >>>>>>>> to run tests, one of which is testing VPE in vpe_ring_test_ib. This
> >>>>>>>> causes an ungate, After that test, vpe_idle_work_handler is scheduled
> >>>>>>>> with VPE_IDLE_TIMEOUT (1s).
> >>>>>>>>
> >>>>>>>> When vpe_idle_work_handler runs and tries to gate VPE, it causes the
> >>>>>>>> SMU to hang and partially freezes half of the GPU IPs, with the thread
> >>>>>>>> that called the command being stuck processing it.
> >>>>>>>>
> >>>>>>>> Specifically, after that SMU command tries to run, we get the following:
> >>>>>>>>
> >>>>>>>> snd_hda_intel 0000:c4:00.1: Refused to change power state from D0 to D3hot
> >>>>>>>> ...
> >>>>>>>> xhci_hcd 0000:c4:00.4: Refused to change power state from D0 to D3hot
> >>>>>>>> ...
> >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VPE!
> >>>>>>>> [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe failed, ret = -62.
> >>>>>>>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:93:crtc-0] flip_done timed out
> >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate JPEG!
> >>>>>>>> [drm:jpeg_v4_0_5_set_powergating_state [amdgpu]] *ERROR* Dpm disable jpeg failed, ret = -62.
> >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 0!
> >>>>>>>> [drm:vcn_v4_0_5_stop [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62.
> >>>>>>>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 1 from 0xd3
> >>>>>>>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 2 from 0x5
> >>>>>>>> thunderbolt 0000:c6:00.5: Refused to change power state from D0 to D3hot
> >>>>>>>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:97:crtc-1] flip_done timed out
> >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 1!
> >>>>>>>>
> >>>>>>>> In addition to e.g., kwin errors in journalctl. 0000:c4.00.0 is the GPU.
> >>>>>>>> Interestingly, 0000:c4.00.6, which is another HDA block, 0000:c4.00.5,
> >>>>>>>> a PCI controller, and 0000:c4.00.2, resume normally. 0x00000032 is the
> >>>>>>>> PowerDownVpe(50) command which is the common failure point in all
> >>>>>>>> failed resumes.
> >>>>>>>>
> >>>>>>>> On a normal resume, we should get the following power gates:
> >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVpe(50) param: 0x00000000, resp: 0x00000001
> >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg0(33) param: 0x00000000, resp: 0x00000001
> >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg1(38) param: 0x00010000, resp: 0x00000001
> >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn1(4) param: 0x00010000, resp: 0x00000001
> >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn0(6) param: 0x00000000, resp: 0x00000001
> >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn0(7) param: 0x00000000, resp: 0x00000001
> >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn1(5) param: 0x00010000, resp: 0x00000001
> >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg0(34) param: 0x00000000, resp: 0x00000001
> >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg1(39) param: 0x00010000, resp: 0x00000001
> >>>>>>>>
> >>>>>>>> To fix this, increase VPE_IDLE_TIMEOUT to 2 seconds. This increases
> >>>>>>>> reliability from 4-25 suspends to 200+ (tested) suspends with a cycle
> >>>>>>>> time of 12s sleep, 8s resume.
> >>>>>>>
> >>>>>>> When you say you reproduced with 12s sleep and 8s resume, was that
> >>>>>>> 'amd-s2idle --duration 12 --wait 8'?
> >>>>>>
> >>>>>> I did not use amd-s2idle. I essentially used the script below with a
> >>>>>> 12 on the wake alarm and 12 on the for loop. I also used pstore for
> >>>>>> this testing.
> >>>>>>
> >>>>>> for i in {1..200}; do
> >>>>>>   echo "Suspend attempt $i"
> >>>>>>   echo `date '+%s' -d '+ 60 seconds'` | sudo tee /sys/class/rtc/rtc0/wakealarm
> >>>>>>   sudo sh -c 'echo mem > /sys/power/state'
> >>>>>>
> >>>>>>   for j in {1..50}; do
> >>>>>>     # Use repeating sleep in case echo mem returns early
> >>>>>>     sleep 1
> >>>>>>   done
> >>>>>> done
> >>>>>
> >>>>> 👍
> >>>>>
> >>>>>>
> >>>>>>>> The suspected reason here is that 1s that
> >>>>>>>> when VPE is used, it needs a bit of time before it can be gated and
> >>>>>>>> there was a borderline delay before, which is not enough for Strix Halo.
> >>>>>>>> When the VPE is not used, such as on resume, gating it instantly does
> >>>>>>>> not seem to cause issues.
> >>>>>>>>
> >>>>>>>> Fixes: 5f82a0c90cca ("drm/amdgpu/vpe: enable vpe dpm")
> >>>>>>>> Signed-off-by: Antheas Kapenekakis <lkml@antheas.dev>
> >>>>>>>> ---
> >>>>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c | 4 ++--
> >>>>>>>>   1 file changed, 2 insertions(+), 2 deletions(-)
> >>>>>>>>
> >>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
> >>>>>>>> index 121ee17b522b..24f09e457352 100644
> >>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
> >>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
> >>>>>>>> @@ -34,8 +34,8 @@
> >>>>>>>>   /* VPE CSA resides in the 4th page of CSA */
> >>>>>>>>   #define AMDGPU_CSA_VPE_OFFSET       (4096 * 3)
> >>>>>>>>
> >>>>>>>> -/* 1 second timeout */
> >>>>>>>> -#define VPE_IDLE_TIMEOUT     msecs_to_jiffies(1000)
> >>>>>>>> +/* 2 second timeout */
> >>>>>>>> +#define VPE_IDLE_TIMEOUT     msecs_to_jiffies(2000)
> >>>>>>>>
> >>>>>>>>   #define VPE_MAX_DPM_LEVEL                   4
> >>>>>>>>   #define FIXED1_8_BITS_PER_FRACTIONAL_PART   8
> >>>>>>>>
> >>>>>>>> base-commit: c17b750b3ad9f45f2b6f7e6f7f4679844244f0b9
> >>>>>>>
> >>>>>>> 1s idle timeout has been used by other IPs for a long time.
> >>>>>>> For example JPEG, UVD, VCN all use 1s.
> >>>>>>>
> >>>>>>> Can you please confirm both your AGESA and your SMU firmware version?
> >>>>>>> In case you're not aware; you can get AGESA version from SMBIOS string
> >>>>>>> (DMI type 40).
> >>>>>>>
> >>>>>>> ❯ sudo dmidecode | grep AGESA
> >>>>>>
> >>>>>> String: AGESA!V9 StrixHaloPI-FP11 1.0.0.0c
> >>>>>>
> >>>>>>> You can get SMU firmware version from this:
> >>>>>>>
> >>>>>>> ❯ grep . /sys/bus/platform/drivers/amd_pmc/*/smu_*
> >>>>>>
> >>>>>> grep . /sys/bus/platform/drivers/amd_pmc/*/smu_*
> >>>>>> /sys/bus/platform/drivers/amd_pmc/AMDI000B:00/smu_fw_version:100.112.0
> >>>>>> /sys/bus/platform/drivers/amd_pmc/AMDI000B:00/smu_program:0
> >>>>>>
> >>>>>
> >>>>> Thanks, I'll get some folks to see if we match this AGESA version if we
> >>>>> can also reproduce it on reference hardware the same way you did.
> >>>>>
> >>>>>>> Are you on the most up to date firmware for your system from the
> >>>>>>> manufacturer?
> >>>>>>
> >>>>>> I updated my bios, pd firmware, and USB device firmware early August,
> >>>>>> when I was doing this testing.
> >>>>>>
> >>>>>>> We haven't seen anything like this reported on Strix Halo thus far and
> >>>>>>> we do internal stress testing on s0i3 on reference hardware.
> >>>>>>
> >>>>>> Cant find a reference for it on the bug tracker. I have four bug
> >>>>>> reports on the bazzite issue tracker, 2 about sleep wake crashes and 2
> >>>>>> for runtime crashes, where the culprit would be this. IE runtime gates
> >>>>>> VPE and causes a crash.
> >>>>>
> >>>>> All on Strix Halo and all tied to VPE?  At runtime was VPE in use?  By
> >>>>> what software?
> >>>>>
> >>>>> BTW - Strix and Kraken also have VPE.
> >>>>
> >>>> All on the Z13. Not tied to VPE necessarily. I just know that I get
> >>>> reports of crashes on the Z13, and with this patch they are fixed for
> >>>> me. It will be part of the next bazzite version so I will get feedback
> >>>> about it.
> >>>>
> >>>> I don't think software that is using the VPE is relevant. Perhaps for
> >>>> the runtime crashes it is and this patch helps in that case as well.
> >>>> But in my case, the crash is caused after the ungate that runs the
> >>>> tests on resume on the delayed handler.
> >>>>
> >>>> The Z13 also has some other quirks with spurious wakeups when
> >>>> connected to a charger. So, if systemd is configured to e.g., sleep
> >>>> after 20m, combined with this crash if it stays plugged in overnight
> >>>> in the morning it has crashed.
> >>>>
> >>>>>>
> >>>>>>> To me this seems likely to be a platform firmware bug; but I would like
> >>>>>>> to understand the timing of the gate vs ungate on good vs bad.
> >>>>>>
> >>>>>> Perhaps it is. It is either something like that or silicon quality.
> >>>>>>
> >>>>>>> IE is it possible the delayed work handler
> >>>>>>> amdgpu_device_delayed_init_work_handler() is causing a race with
> >>>>>>> vpe_ring_begin_use()?
> >>>>>>
> >>>>>> I don't think so. There is only a single ungate. Also, the crash
> >>>>>> happens on the gate. So what happens is the device wakes up, the
> >>>>>> screen turns on, kde clock works, then after a second it freezes,
> >>>>>> there is a softlock, and the device hangs.
> >>>>>>
> >>>>>> The failed command is always the VPE gate that is triggered after 1s in idle.
> >>>>>>
> >>>>>>> This should be possible to check without extra instrumentation by using
> >>>>>>> ftrace and looking at the timing of the 2 ring functions and the init
> >>>>>>> work handler and checking good vs bad cycles.
> >>>>>>
> >>>>>> I do not know how to use ftrace. I should also note that after the
> >>>>>> device freezes around 1/5 cycles will sync the fs, so it is also not a
> >>>>>> very easy thing to diagnose. The device just stops working. A lot of
> >>>>>> the logs I got were in pstore by forcing a kernel panic.
> >>>>>
> >>>>> Here's how you capture the timing of functions.  Each time the function
> >>>>> is called there will be an event in the trace buffer.
> >>>>>
> >>>>> ❯ sudo trace-cmd record -p function -l
> >>>>> amdgpu_device_delayed_init_work_handler -l vpe_idle_work_handler -l
> >>>>> vpe_ring_begin_use -l vpe_ring_end_use -l amdgpu_pmops_suspend -l
> >>>>> amdgpu_pmops_resume
> >>>>>
> >>>>> Here's how you would review the report:
> >>>>>
> >>>>> ❯ trace-cmd report
> >>>>> cpus=24
> >>>>>   kworker/u97:37-18051 [001] ..... 13655.970108: function:
> >>>>> amdgpu_pmops_suspend <-- pci_pm_suspend
> >>>>>   kworker/u97:21-18036 [002] ..... 13666.290715: function:
> >>>>> amdgpu_pmops_resume <-- dpm_run_callback
> >>>>>   kworker/u97:21-18036 [015] ..... 13666.308295: function:
> >>>>> vpe_ring_begin_use <-- amdgpu_ring_alloc
> >>>>>   kworker/u97:21-18036 [015] ..... 13666.308298: function:
> >>>>> vpe_ring_end_use <-- vpe_ring_test_ring
> >>>>>     kworker/15:1-12285 [015] ..... 13666.960191: function:
> >>>>> amdgpu_device_delayed_init_work_handler <-- process_one_work
> >>>>>     kworker/15:1-12285 [015] ..... 13666.963970: function:
> >>>>> vpe_ring_begin_use <-- amdgpu_ring_alloc
> >>>>>     kworker/15:1-12285 [015] ..... 13666.965481: function:
> >>>>> vpe_ring_end_use <-- amdgpu_ib_schedule
> >>>>>     kworker/15:4-16354 [015] ..... 13667.981394: function:
> >>>>> vpe_idle_work_handler <-- process_one_work
> >>>>>
> >>>>> I did this on a Strix system just now to capture that.
> >>>>>
> >>>>> You can see that basically the ring gets used before the delayed init
> >>>>> work handler, and then again from the ring tests.  My concern is if the
> >>>>> sequence ever looks different than the above.  If it does; we do have a
> >>>>> driver race condition.
> >>>>>
> >>>>> It would also be helpful to look at the function_graph tracer.
> >>>>>
> >>>>> Here's some more documentation about ftrace and trace-cmd.
> >>>>> https://www.kernel.org/doc/html/latest/trace/ftrace.html
> >>>>> https://lwn.net/Articles/410200/
> >>>>>
> >>>>> You can probably also get an LLM to help you with building commands if
> >>>>> you're not familiar with it.
> >>>>>
> >>>>> But if you're hung so bad you can't flush to disk that's going to be a
> >>>>> problem without a UART.  A few ideas:
> >>>>
> >>>> Some times it flushes to disk
> >>>>
> >>>>> 1) You can use CONFIG_PSTORE_FTRACE
> >>>>
> >>>> I can look into that
> >>>>
> >>>>> 2) If you add "tp_printk" to the kernel command line it should make the
> >>>>> trace ring buffer flush to kernel log ring buffer.  But be warned this
> >>>>> is going to change the timing, the issue might go away entirely or have
> >>>>> a different failure rate.  So hopefully <1> works.
> >>>>>>
> >>>>>> If you say that all IP blocks use 1s, perhaps an alternative solution
> >>>>>> would be to desync the idle times so they do not happen
> >>>>>> simultaneously. So 1000, 1200, 1400, etc.
> >>>>>>
> >>>>>> Antheas
> >>>>>>
> >>>>>
> >>>>> I don't dobut your your proposal of changing the timing works.  I just
> >>>>> want to make sure it's the right solution because otherwise we might
> >>>>> change the timing or sequence elsewhere in the driver two years from now
> >>>>> and re-introduce the problem unintentionally.
> >>>>
> >>>> If there are other idle timers and only this one changes to 2s, I will
> >>>> agree and say that it would be peculiar. Although 1s seems arbitrary
> >>>> in any case.
> >>>
> >>> All of these timers are arbitrary.  Their point is just to provide a
> >>> future point where we can check if the engine is idle.  The idle work
> >>> handler will either power down the IP if it is idle or re-schedule in
> >>> the future and try again if there is still work.  Making the value
> >>> longer will use more power as it will wait longer before checking if
> >>> the engine is idle.  Making it shorter will save more power, but adds
> >>> extra overhead in that the engine will be powered up/down more often.
> >>> In most cases, the jobs should complete in a few ms.  The timer is
> >>> there to avoid the overhead of powering up/down the block too
> >>> frequently when applications are using the engine.
> >>>
> >>> Alex
> >>
> >> We had a try internally with both 6.17-rc2 and 6.17-rc3 and 1001b or
> >> 1001c AGESA on reference system but unfortunately didn't reproduce the
> >> issue with a 200 cycle attempt on either kernel or either BIOS (so we
> >> had 800 cycles total).
> >
> > I think I did 6.12, 6.15, and a 6.16rc stock. I will have to come back
> > to you with 6.17-rc3.
>
> I can reproduce the hang on a stock 6.17-rc3 kernel on my own Flow Z13, froze within 10 cycles with Antheas’ script. I will setup pstore to get logs from it since nothing appears in my journal after force rebooting.
>
> Matt

Mine does not want to get reproduced right now. I will have to try later.

You will need these kernel arguments:
efi_pstore.pstore_disable=0 pstore.kmsg_bytes=200000

Here are some logging commands before the for loop
# clear pstore
sudo bash -c "rm -rf /sys/fs/pstore/*"

# https://www.ais.com/understanding-pstore-linux-kernel-persistent-storage-file-system/

# Runtime logs
# echo 1 | sudo tee
/sys/kernel/debug/tracing/events/power/power_runtime_suspend/enable
# echo 1 | sudo tee
/sys/kernel/debug/tracing/events/power/power_runtime_resume/enable
# echo 1 | sudo tee /sys/kernel/debug/tracing/tracing_on

# Enable panics on lockups
echo 255 | sudo tee /proc/sys/kernel/sysrq
echo 1 | sudo tee /proc/sys/kernel/softlockup_panic
echo 1 | sudo tee /proc/sys/kernel/hardlockup_panic
echo 1 | sudo tee /proc/sys/kernel/panic_on_oops
echo 5 | sudo tee /proc/sys/kernel/panic
# echo 64 | sudo tee /proc/sys/kernel/panic_print

# Enable these for hangs, shows Thread on hangs
# echo 1 | sudo tee /proc/sys/kernel/softlockup_all_cpu_backtrace
# echo 1 | sudo tee /proc/sys/kernel/hardlockup_all_cpu_backtrace

# Enable pstore logging on panics
# Needs kernel param:
# efi_pstore.pstore_disable=0 pstore.kmsg_bytes=100000
# First enables, second sets the size to fit all cpus in case of a panic
echo Y | sudo tee /sys/module/kernel/parameters/crash_kexec_post_notifiers
echo Y | sudo tee /sys/module/printk/parameters/always_kmsg_dump

# Enable dynamic debug for various kernel components
sudo bash -c "cat > /sys/kernel/debug/dynamic_debug/control" << EOF
file drivers/acpi/x86/s2idle.c +p
file drivers/pinctrl/pinctrl-amd.c +p
file drivers/platform/x86/amd/pmc.c +p
file drivers/pci/pci-driver.c +p
file drivers/input/serio/* +p
file drivers/gpu/drm/amd/pm/* +p
file drivers/gpu/drm/amd/pm/swsmu/* +p
EOF
# file drivers/acpi/ec.c +p
# file drivers/gpu/drm/amd/* +p
# file drivers/gpu/drm/amd/display/dc/core/* -p

# Additional debugging for suspend/resume
echo 1 | sudo tee /sys/power/pm_debug_messages

Here is how to reconstruct the log:
rm -rf crash && mkdir crash
sudo bash -c "cp /sys/fs/pstore/dmesg-efi_pstore-* crash"
sudo bash -c "rm -rf /sys/fs/pstore/*"
cat $(find crash/ -name "dmesg-*" | tac) > crash.txt

Antheas
> >
> >> Was your base a bazzite kernel or was it an upstream kernel?  I know
> >> there are some other patches in bazzite especially relevant to suspend,
> >> so I wonder if they could be influencing the timing.
> >>
> >> Can you repo on 6.17-rc3?
> >>
> >
> >
>
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo
  2025-08-26 20:58                 ` Antheas Kapenekakis
@ 2025-08-27  0:50                   ` Matthew Schwartz
       [not found]                     ` <MN2PR12MB43736AAF6E8166AD962843F48638A@MN2PR12MB4373.namprd12.prod.outlook.com>
  0 siblings, 1 reply; 21+ messages in thread
From: Matthew Schwartz @ 2025-08-27  0:50 UTC (permalink / raw)
  To: Antheas Kapenekakis
  Cc: Mario Limonciello, Alex Deucher, amd-gfx, dri-devel, linux-kernel,
	Alex Deucher, Christian König, David Airlie, Simona Vetter,
	Harry Wentland, Rodrigo Siqueira, Mario Limonciello, Peyton Lee,
	Lang Yu

On 8/26/25 1:58 PM, Antheas Kapenekakis wrote:
> On Tue, 26 Aug 2025 at 22:13, Matthew Schwartz
> <matthew.schwartz@linux.dev> wrote:
>>
>>
>>
>>> On Aug 26, 2025, at 12:21 PM, Antheas Kapenekakis <lkml@antheas.dev> wrote:
>>>
>>> On Tue, 26 Aug 2025 at 21:19, Mario Limonciello <superm1@kernel.org> wrote:
>>>>
>>>> On 8/26/2025 8:41 AM, Alex Deucher wrote:
>>>>> On Tue, Aug 26, 2025 at 3:49 AM Antheas Kapenekakis <lkml@antheas.dev> wrote:
>>>>>>
>>>>>> On Mon, 25 Aug 2025 at 03:38, Mario Limonciello <superm1@kernel.org> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 8/24/25 3:46 PM, Antheas Kapenekakis wrote:
>>>>>>>> On Sun, 24 Aug 2025 at 22:16, Mario Limonciello <superm1@kernel.org> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 8/24/25 3:53 AM, Antheas Kapenekakis wrote:
>>>>>>>>>> On the Asus Z13 2025, which uses a Strix Halo platform, around 8% of the
>>>>>>>>>> suspend resumes result in a soft lock around 1 second after the screen
>>>>>>>>>> turns on (it freezes). This happens due to power gating VPE when it is
>>>>>>>>>> not used, which happens 1 second after inactivity.
>>>>>>>>>>
>>>>>>>>>> Specifically, the VPE gating after resume is as follows: an initial
>>>>>>>>>> ungate, followed by a gate in the resume process. Then,
>>>>>>>>>> amdgpu_device_delayed_init_work_handler with a delay of 2s is scheduled
>>>>>>>>>> to run tests, one of which is testing VPE in vpe_ring_test_ib. This
>>>>>>>>>> causes an ungate, After that test, vpe_idle_work_handler is scheduled
>>>>>>>>>> with VPE_IDLE_TIMEOUT (1s).
>>>>>>>>>>
>>>>>>>>>> When vpe_idle_work_handler runs and tries to gate VPE, it causes the
>>>>>>>>>> SMU to hang and partially freezes half of the GPU IPs, with the thread
>>>>>>>>>> that called the command being stuck processing it.
>>>>>>>>>>
>>>>>>>>>> Specifically, after that SMU command tries to run, we get the following:
>>>>>>>>>>
>>>>>>>>>> snd_hda_intel 0000:c4:00.1: Refused to change power state from D0 to D3hot
>>>>>>>>>> ...
>>>>>>>>>> xhci_hcd 0000:c4:00.4: Refused to change power state from D0 to D3hot
>>>>>>>>>> ...
>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VPE!
>>>>>>>>>> [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe failed, ret = -62.
>>>>>>>>>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:93:crtc-0] flip_done timed out
>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate JPEG!
>>>>>>>>>> [drm:jpeg_v4_0_5_set_powergating_state [amdgpu]] *ERROR* Dpm disable jpeg failed, ret = -62.
>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 0!
>>>>>>>>>> [drm:vcn_v4_0_5_stop [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62.
>>>>>>>>>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 1 from 0xd3
>>>>>>>>>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 2 from 0x5
>>>>>>>>>> thunderbolt 0000:c6:00.5: Refused to change power state from D0 to D3hot
>>>>>>>>>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:97:crtc-1] flip_done timed out
>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 1!
>>>>>>>>>>
>>>>>>>>>> In addition to e.g., kwin errors in journalctl. 0000:c4.00.0 is the GPU.
>>>>>>>>>> Interestingly, 0000:c4.00.6, which is another HDA block, 0000:c4.00.5,
>>>>>>>>>> a PCI controller, and 0000:c4.00.2, resume normally. 0x00000032 is the
>>>>>>>>>> PowerDownVpe(50) command which is the common failure point in all
>>>>>>>>>> failed resumes.
>>>>>>>>>>
>>>>>>>>>> On a normal resume, we should get the following power gates:
>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVpe(50) param: 0x00000000, resp: 0x00000001
>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg0(33) param: 0x00000000, resp: 0x00000001
>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg1(38) param: 0x00010000, resp: 0x00000001
>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn1(4) param: 0x00010000, resp: 0x00000001
>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn0(6) param: 0x00000000, resp: 0x00000001
>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn0(7) param: 0x00000000, resp: 0x00000001
>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn1(5) param: 0x00010000, resp: 0x00000001
>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg0(34) param: 0x00000000, resp: 0x00000001
>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg1(39) param: 0x00010000, resp: 0x00000001
>>>>>>>>>>
>>>>>>>>>> To fix this, increase VPE_IDLE_TIMEOUT to 2 seconds. This increases
>>>>>>>>>> reliability from 4-25 suspends to 200+ (tested) suspends with a cycle
>>>>>>>>>> time of 12s sleep, 8s resume.
>>>>>>>>>
>>>>>>>>> When you say you reproduced with 12s sleep and 8s resume, was that
>>>>>>>>> 'amd-s2idle --duration 12 --wait 8'?
>>>>>>>>
>>>>>>>> I did not use amd-s2idle. I essentially used the script below with a
>>>>>>>> 12 on the wake alarm and 12 on the for loop. I also used pstore for
>>>>>>>> this testing.
>>>>>>>>
>>>>>>>> for i in {1..200}; do
>>>>>>>>   echo "Suspend attempt $i"
>>>>>>>>   echo `date '+%s' -d '+ 60 seconds'` | sudo tee /sys/class/rtc/rtc0/wakealarm
>>>>>>>>   sudo sh -c 'echo mem > /sys/power/state'
>>>>>>>>
>>>>>>>>   for j in {1..50}; do
>>>>>>>>     # Use repeating sleep in case echo mem returns early
>>>>>>>>     sleep 1
>>>>>>>>   done
>>>>>>>> done
>>>>>>>
>>>>>>> 👍
>>>>>>>
>>>>>>>>
>>>>>>>>>> The suspected reason here is that 1s that
>>>>>>>>>> when VPE is used, it needs a bit of time before it can be gated and
>>>>>>>>>> there was a borderline delay before, which is not enough for Strix Halo.
>>>>>>>>>> When the VPE is not used, such as on resume, gating it instantly does
>>>>>>>>>> not seem to cause issues.
>>>>>>>>>>
>>>>>>>>>> Fixes: 5f82a0c90cca ("drm/amdgpu/vpe: enable vpe dpm")
>>>>>>>>>> Signed-off-by: Antheas Kapenekakis <lkml@antheas.dev>
>>>>>>>>>> ---
>>>>>>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c | 4 ++--
>>>>>>>>>>   1 file changed, 2 insertions(+), 2 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
>>>>>>>>>> index 121ee17b522b..24f09e457352 100644
>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
>>>>>>>>>> @@ -34,8 +34,8 @@
>>>>>>>>>>   /* VPE CSA resides in the 4th page of CSA */
>>>>>>>>>>   #define AMDGPU_CSA_VPE_OFFSET       (4096 * 3)
>>>>>>>>>>
>>>>>>>>>> -/* 1 second timeout */
>>>>>>>>>> -#define VPE_IDLE_TIMEOUT     msecs_to_jiffies(1000)
>>>>>>>>>> +/* 2 second timeout */
>>>>>>>>>> +#define VPE_IDLE_TIMEOUT     msecs_to_jiffies(2000)
>>>>>>>>>>
>>>>>>>>>>   #define VPE_MAX_DPM_LEVEL                   4
>>>>>>>>>>   #define FIXED1_8_BITS_PER_FRACTIONAL_PART   8
>>>>>>>>>>
>>>>>>>>>> base-commit: c17b750b3ad9f45f2b6f7e6f7f4679844244f0b9
>>>>>>>>>
>>>>>>>>> 1s idle timeout has been used by other IPs for a long time.
>>>>>>>>> For example JPEG, UVD, VCN all use 1s.
>>>>>>>>>
>>>>>>>>> Can you please confirm both your AGESA and your SMU firmware version?
>>>>>>>>> In case you're not aware; you can get AGESA version from SMBIOS string
>>>>>>>>> (DMI type 40).
>>>>>>>>>
>>>>>>>>> ❯ sudo dmidecode | grep AGESA
>>>>>>>>
>>>>>>>> String: AGESA!V9 StrixHaloPI-FP11 1.0.0.0c
>>>>>>>>
>>>>>>>>> You can get SMU firmware version from this:
>>>>>>>>>
>>>>>>>>> ❯ grep . /sys/bus/platform/drivers/amd_pmc/*/smu_*
>>>>>>>>
>>>>>>>> grep . /sys/bus/platform/drivers/amd_pmc/*/smu_*
>>>>>>>> /sys/bus/platform/drivers/amd_pmc/AMDI000B:00/smu_fw_version:100.112.0
>>>>>>>> /sys/bus/platform/drivers/amd_pmc/AMDI000B:00/smu_program:0
>>>>>>>>
>>>>>>>
>>>>>>> Thanks, I'll get some folks to see if we match this AGESA version if we
>>>>>>> can also reproduce it on reference hardware the same way you did.
>>>>>>>
>>>>>>>>> Are you on the most up to date firmware for your system from the
>>>>>>>>> manufacturer?
>>>>>>>>
>>>>>>>> I updated my bios, pd firmware, and USB device firmware early August,
>>>>>>>> when I was doing this testing.
>>>>>>>>
>>>>>>>>> We haven't seen anything like this reported on Strix Halo thus far and
>>>>>>>>> we do internal stress testing on s0i3 on reference hardware.
>>>>>>>>
>>>>>>>> Cant find a reference for it on the bug tracker. I have four bug
>>>>>>>> reports on the bazzite issue tracker, 2 about sleep wake crashes and 2
>>>>>>>> for runtime crashes, where the culprit would be this. IE runtime gates
>>>>>>>> VPE and causes a crash.
>>>>>>>
>>>>>>> All on Strix Halo and all tied to VPE?  At runtime was VPE in use?  By
>>>>>>> what software?
>>>>>>>
>>>>>>> BTW - Strix and Kraken also have VPE.
>>>>>>
>>>>>> All on the Z13. Not tied to VPE necessarily. I just know that I get
>>>>>> reports of crashes on the Z13, and with this patch they are fixed for
>>>>>> me. It will be part of the next bazzite version so I will get feedback
>>>>>> about it.
>>>>>>
>>>>>> I don't think software that is using the VPE is relevant. Perhaps for
>>>>>> the runtime crashes it is and this patch helps in that case as well.
>>>>>> But in my case, the crash is caused after the ungate that runs the
>>>>>> tests on resume on the delayed handler.
>>>>>>
>>>>>> The Z13 also has some other quirks with spurious wakeups when
>>>>>> connected to a charger. So, if systemd is configured to e.g., sleep
>>>>>> after 20m, combined with this crash if it stays plugged in overnight
>>>>>> in the morning it has crashed.
>>>>>>
>>>>>>>>
>>>>>>>>> To me this seems likely to be a platform firmware bug; but I would like
>>>>>>>>> to understand the timing of the gate vs ungate on good vs bad.
>>>>>>>>
>>>>>>>> Perhaps it is. It is either something like that or silicon quality.
>>>>>>>>
>>>>>>>>> IE is it possible the delayed work handler
>>>>>>>>> amdgpu_device_delayed_init_work_handler() is causing a race with
>>>>>>>>> vpe_ring_begin_use()?
>>>>>>>>
>>>>>>>> I don't think so. There is only a single ungate. Also, the crash
>>>>>>>> happens on the gate. So what happens is the device wakes up, the
>>>>>>>> screen turns on, kde clock works, then after a second it freezes,
>>>>>>>> there is a softlock, and the device hangs.
>>>>>>>>
>>>>>>>> The failed command is always the VPE gate that is triggered after 1s in idle.
>>>>>>>>
>>>>>>>>> This should be possible to check without extra instrumentation by using
>>>>>>>>> ftrace and looking at the timing of the 2 ring functions and the init
>>>>>>>>> work handler and checking good vs bad cycles.
>>>>>>>>
>>>>>>>> I do not know how to use ftrace. I should also note that after the
>>>>>>>> device freezes around 1/5 cycles will sync the fs, so it is also not a
>>>>>>>> very easy thing to diagnose. The device just stops working. A lot of
>>>>>>>> the logs I got were in pstore by forcing a kernel panic.
>>>>>>>
>>>>>>> Here's how you capture the timing of functions.  Each time the function
>>>>>>> is called there will be an event in the trace buffer.
>>>>>>>
>>>>>>> ❯ sudo trace-cmd record -p function -l
>>>>>>> amdgpu_device_delayed_init_work_handler -l vpe_idle_work_handler -l
>>>>>>> vpe_ring_begin_use -l vpe_ring_end_use -l amdgpu_pmops_suspend -l
>>>>>>> amdgpu_pmops_resume
>>>>>>>
>>>>>>> Here's how you would review the report:
>>>>>>>
>>>>>>> ❯ trace-cmd report
>>>>>>> cpus=24
>>>>>>>   kworker/u97:37-18051 [001] ..... 13655.970108: function:
>>>>>>> amdgpu_pmops_suspend <-- pci_pm_suspend
>>>>>>>   kworker/u97:21-18036 [002] ..... 13666.290715: function:
>>>>>>> amdgpu_pmops_resume <-- dpm_run_callback
>>>>>>>   kworker/u97:21-18036 [015] ..... 13666.308295: function:
>>>>>>> vpe_ring_begin_use <-- amdgpu_ring_alloc
>>>>>>>   kworker/u97:21-18036 [015] ..... 13666.308298: function:
>>>>>>> vpe_ring_end_use <-- vpe_ring_test_ring
>>>>>>>     kworker/15:1-12285 [015] ..... 13666.960191: function:
>>>>>>> amdgpu_device_delayed_init_work_handler <-- process_one_work
>>>>>>>     kworker/15:1-12285 [015] ..... 13666.963970: function:
>>>>>>> vpe_ring_begin_use <-- amdgpu_ring_alloc
>>>>>>>     kworker/15:1-12285 [015] ..... 13666.965481: function:
>>>>>>> vpe_ring_end_use <-- amdgpu_ib_schedule
>>>>>>>     kworker/15:4-16354 [015] ..... 13667.981394: function:
>>>>>>> vpe_idle_work_handler <-- process_one_work
>>>>>>>
>>>>>>> I did this on a Strix system just now to capture that.
>>>>>>>
>>>>>>> You can see that basically the ring gets used before the delayed init
>>>>>>> work handler, and then again from the ring tests.  My concern is if the
>>>>>>> sequence ever looks different than the above.  If it does; we do have a
>>>>>>> driver race condition.
>>>>>>>
>>>>>>> It would also be helpful to look at the function_graph tracer.
>>>>>>>
>>>>>>> Here's some more documentation about ftrace and trace-cmd.
>>>>>>> https://www.kernel.org/doc/html/latest/trace/ftrace.html
>>>>>>> https://lwn.net/Articles/410200/
>>>>>>>
>>>>>>> You can probably also get an LLM to help you with building commands if
>>>>>>> you're not familiar with it.
>>>>>>>
>>>>>>> But if you're hung so bad you can't flush to disk that's going to be a
>>>>>>> problem without a UART.  A few ideas:
>>>>>>
>>>>>> Some times it flushes to disk
>>>>>>
>>>>>>> 1) You can use CONFIG_PSTORE_FTRACE
>>>>>>
>>>>>> I can look into that
>>>>>>
>>>>>>> 2) If you add "tp_printk" to the kernel command line it should make the
>>>>>>> trace ring buffer flush to kernel log ring buffer.  But be warned this
>>>>>>> is going to change the timing, the issue might go away entirely or have
>>>>>>> a different failure rate.  So hopefully <1> works.
>>>>>>>>
>>>>>>>> If you say that all IP blocks use 1s, perhaps an alternative solution
>>>>>>>> would be to desync the idle times so they do not happen
>>>>>>>> simultaneously. So 1000, 1200, 1400, etc.
>>>>>>>>
>>>>>>>> Antheas
>>>>>>>>
>>>>>>>
>>>>>>> I don't dobut your your proposal of changing the timing works.  I just
>>>>>>> want to make sure it's the right solution because otherwise we might
>>>>>>> change the timing or sequence elsewhere in the driver two years from now
>>>>>>> and re-introduce the problem unintentionally.
>>>>>>
>>>>>> If there are other idle timers and only this one changes to 2s, I will
>>>>>> agree and say that it would be peculiar. Although 1s seems arbitrary
>>>>>> in any case.
>>>>>
>>>>> All of these timers are arbitrary.  Their point is just to provide a
>>>>> future point where we can check if the engine is idle.  The idle work
>>>>> handler will either power down the IP if it is idle or re-schedule in
>>>>> the future and try again if there is still work.  Making the value
>>>>> longer will use more power as it will wait longer before checking if
>>>>> the engine is idle.  Making it shorter will save more power, but adds
>>>>> extra overhead in that the engine will be powered up/down more often.
>>>>> In most cases, the jobs should complete in a few ms.  The timer is
>>>>> there to avoid the overhead of powering up/down the block too
>>>>> frequently when applications are using the engine.
>>>>>
>>>>> Alex
>>>>
>>>> We had a try internally with both 6.17-rc2 and 6.17-rc3 and 1001b or
>>>> 1001c AGESA on reference system but unfortunately didn't reproduce the
>>>> issue with a 200 cycle attempt on either kernel or either BIOS (so we
>>>> had 800 cycles total).
>>>
>>> I think I did 6.12, 6.15, and a 6.16rc stock. I will have to come back
>>> to you with 6.17-rc3.
>>
>> I can reproduce the hang on a stock 6.17-rc3 kernel on my own Flow Z13, froze within 10 cycles with Antheas’ script. I will setup pstore to get logs from it since nothing appears in my journal after force rebooting.
>>
>> Matt
> 
> Mine does not want to get reproduced right now. I will have to try later.
> 
> You will need these kernel arguments:
> efi_pstore.pstore_disable=0 pstore.kmsg_bytes=200000
> 
> Here are some logging commands before the for loop
> # clear pstore
> sudo bash -c "rm -rf /sys/fs/pstore/*"
> 
> # https://www.ais.com/understanding-pstore-linux-kernel-persistent-storage-file-system/
> 
> # Runtime logs
> # echo 1 | sudo tee
> /sys/kernel/debug/tracing/events/power/power_runtime_suspend/enable
> # echo 1 | sudo tee
> /sys/kernel/debug/tracing/events/power/power_runtime_resume/enable
> # echo 1 | sudo tee /sys/kernel/debug/tracing/tracing_on
> 
> # Enable panics on lockups
> echo 255 | sudo tee /proc/sys/kernel/sysrq
> echo 1 | sudo tee /proc/sys/kernel/softlockup_panic
> echo 1 | sudo tee /proc/sys/kernel/hardlockup_panic
> echo 1 | sudo tee /proc/sys/kernel/panic_on_oops
> echo 5 | sudo tee /proc/sys/kernel/panic
> # echo 64 | sudo tee /proc/sys/kernel/panic_print
> 
> # Enable these for hangs, shows Thread on hangs
> # echo 1 | sudo tee /proc/sys/kernel/softlockup_all_cpu_backtrace
> # echo 1 | sudo tee /proc/sys/kernel/hardlockup_all_cpu_backtrace
> 
> # Enable pstore logging on panics
> # Needs kernel param:
> # efi_pstore.pstore_disable=0 pstore.kmsg_bytes=100000
> # First enables, second sets the size to fit all cpus in case of a panic
> echo Y | sudo tee /sys/module/kernel/parameters/crash_kexec_post_notifiers
> echo Y | sudo tee /sys/module/printk/parameters/always_kmsg_dump
> 
> # Enable dynamic debug for various kernel components
> sudo bash -c "cat > /sys/kernel/debug/dynamic_debug/control" << EOF
> file drivers/acpi/x86/s2idle.c +p
> file drivers/pinctrl/pinctrl-amd.c +p
> file drivers/platform/x86/amd/pmc.c +p
> file drivers/pci/pci-driver.c +p
> file drivers/input/serio/* +p
> file drivers/gpu/drm/amd/pm/* +p
> file drivers/gpu/drm/amd/pm/swsmu/* +p
> EOF
> # file drivers/acpi/ec.c +p
> # file drivers/gpu/drm/amd/* +p
> # file drivers/gpu/drm/amd/display/dc/core/* -p
> 
> # Additional debugging for suspend/resume
> echo 1 | sudo tee /sys/power/pm_debug_messages

So I ran the commands that you gave above while connected over ssh, and I could actually still interact with the system after the amdgpu failures started.
Your suspend script also kept running for a while because of this, and pstore was not necessary.

My dmesg looks very similar to the snippet you posted in the patch contents.
Full dmesg is here: https://gist.github.com/matte-schwartz/9ad4b925866d9228923e909618d045d9

I was able to run trace-cmd as Mario suggested, but nothing seemed out of order:

❯ trace-cmd report

    kworker/22:6-9326  [022] .....  4003.204988: function:             amdgpu_device_delayed_init_work_handler <-- process_one_work
    kworker/22:6-9326  [022] .....  4003.209383: function:             vpe_ring_begin_use <-- amdgpu_ring_alloc
    kworker/22:6-9326  [022] .....  4003.210152: function:             vpe_ring_end_use <-- amdgpu_ib_schedule
    kworker/22:6-9326  [022] .....  4004.263841: function:             vpe_idle_work_handler <-- process_one_work
  kworker/u129:6-530   [001] .....  4053.545634: function:             amdgpu_pmops_suspend <-- pci_pm_suspend
 kworker/u129:18-4060  [002] .....  4114.908515: function:             amdgpu_pmops_resume <-- dpm_run_callback
 kworker/u129:18-4060  [023] .....  4114.931055: function:             vpe_ring_begin_use <-- amdgpu_ring_alloc
 kworker/u129:18-4060  [023] .....  4114.931057: function:             vpe_ring_end_use <-- vpe_ring_test_ring
     kworker/7:5-5733  [007] .....  4115.198936: function:             amdgpu_device_delayed_init_work_handler <-- process_one_work
     kworker/7:5-5733  [007] .....  4115.203185: function:             vpe_ring_begin_use <-- amdgpu_ring_alloc
     kworker/7:5-5733  [007] .....  4115.204141: function:             vpe_ring_end_use <-- amdgpu_ib_schedule
     kworker/7:0-7950  [007] .....  4116.253971: function:             vpe_idle_work_handler <-- process_one_work
 kworker/u129:41-4083  [001] .....  4165.539388: function:             amdgpu_pmops_suspend <-- pci_pm_suspend
 kworker/u129:58-4100  [001] .....  4226.906561: function:             amdgpu_pmops_resume <-- dpm_run_callback
 kworker/u129:58-4100  [022] .....  4226.927900: function:             vpe_ring_begin_use <-- amdgpu_ring_alloc
 kworker/u129:58-4100  [022] .....  4226.927902: function:             vpe_ring_end_use <-- vpe_ring_test_ring
     kworker/7:0-7950  [007] .....  4227.193678: function:             amdgpu_device_delayed_init_work_handler <-- process_one_work
     kworker/7:0-7950  [007] .....  4227.197604: function:             vpe_ring_begin_use <-- amdgpu_ring_alloc
     kworker/7:0-7950  [007] .....  4227.201691: function:             vpe_ring_end_use <-- amdgpu_ib_schedule
     kworker/7:0-7950  [007] .....  4228.240479: function:             vpe_idle_work_handler <-- process_one_work

I have not tested the kernel patch yet, so that will be my next step.

> 
> Here is how to reconstruct the log:
> rm -rf crash && mkdir crash
> sudo bash -c "cp /sys/fs/pstore/dmesg-efi_pstore-* crash"
> sudo bash -c "rm -rf /sys/fs/pstore/*"
> cat $(find crash/ -name "dmesg-*" | tac) > crash.txt
> 
> Antheas
>>>
>>>> Was your base a bazzite kernel or was it an upstream kernel?  I know
>>>> there are some other patches in bazzite especially relevant to suspend,
>>>> so I wonder if they could be influencing the timing.
>>>>
>>>> Can you repo on 6.17-rc3?
>>>>
>>>
>>>
>>
>>
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo
       [not found]                     ` <MN2PR12MB43736AAF6E8166AD962843F48638A@MN2PR12MB4373.namprd12.prod.outlook.com>
@ 2025-08-27 15:42                       ` Matthew Schwartz
  0 siblings, 0 replies; 21+ messages in thread
From: Matthew Schwartz @ 2025-08-27 15:42 UTC (permalink / raw)
  To: Lee, Peyton, Antheas Kapenekakis
  Cc: Mario Limonciello, Alex Deucher, amd-gfx@lists.freedesktop.org,
	dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org,
	Deucher, Alexander, Koenig, Christian, David Airlie,
	Simona Vetter, Wentland, Harry, Rodrigo Siqueira,
	Limonciello, Mario, Yu, Lang

On 8/26/25 7:37 PM, Lee, Peyton wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
> I recently encountered a similar issue on Strix.
> 
> What I found was that the root cause was GFX failing during hw_init.
> 
> Here’s the situation:
> Linux AMDGPU boot-up flow:
> 
>   1.
> sw_init — This stage initializes the software for each IP block (GFX, VCN, VPE, etc.), and powers them on.
>   2.
> hw_init — This stage calls the hardware initialization of each IP block. At this point, VPE begins loading its firmware and configuring hardware.
> 
> The issue:
> When the problem occurs, the GFX block fails during hw_init. As a result, it requests the SMU to power off all IP blocks.
> However, at that point, the VPE firmware hasn’t been loaded yet, so it cannot respond to the SMU's power-off request.
> This causes the system to hang during boot.
> 
> Previously, my approach was to remove all the calls to VPE power off (both in the VPE driver and in the SMU deinit function) to help locate the issue. Maybe you could try the same.

Thanks, I tried both Antheas' patch and your suggestion, and I was still able to trigger the issue with both of them.

On Mario's suggestion, I updated from linux-firmware 20250808 to linux-firmware from git. With this new amdgpu_firmware_info:

VCE feature version: 0, firmware version: 0x00000000
UVD feature version: 0, firmware version: 0x00000000
MC feature version: 0, firmware version: 0x00000000
ME feature version: 35, firmware version: 0x0000001f
PFP feature version: 35, firmware version: 0x0000002c
CE feature version: 0, firmware version: 0x00000000
RLC feature version: 1, firmware version: 0x11530506
RLC SRLC feature version: 0, firmware version: 0x00000000
RLC SRLG feature version: 0, firmware version: 0x00000000
RLC SRLS feature version: 0, firmware version: 0x00000000
RLCP feature version: 1, firmware version: 0x11530506
RLCV feature version: 0, firmware version: 0x00000000
MEC feature version: 35, firmware version: 0x0000001f
IMU feature version: 0, firmware version: 0x0b352300
SOS feature version: 0, firmware version: 0x00000000
ASD feature version: 553648371, firmware version: 0x210000f3
TA XGMI feature version: 0x00000000, firmware version: 0x00000000
TA RAS feature version: 0x00000000, firmware version: 0x00000000
TA HDCP feature version: 0x00000000, firmware version: 0x17000046
TA DTM feature version: 0x00000000, firmware version: 0x12000019
TA RAP feature version: 0x00000000, firmware version: 0x00000000
TA SECUREDISPLAY feature version: 0x00000000, firmware version: 0x00000000
SMC feature version: 0, program: 0, firmware version: 0x00647000 (100.112.0)
SDMA0 feature version: 60, firmware version: 0x00000011
VCN feature version: 0, firmware version: 0x09118011
DMCU feature version: 0, firmware version: 0x00000000
DMCUB feature version: 0, firmware version: 0x09002a00
TOC feature version: 0, firmware version: 0x0000000b
MES_KIQ feature version: 6, firmware version: 0x0000006c
MES feature version: 1, firmware version: 0x0000007e
VPE feature version: 60, firmware version: 0x00000017
VBIOS version: 113-STRXLGEN-001

I have not been able to trigger any amdgpu failures after 200 cycles on an unpatched 6.17-rc3 kernel.

Thanks,
Matt

> 
> 
> 
> ________________________________
> 寄件者: Matthew Schwartz <matthew.schwartz@linux.dev>
> 已傳送: 星期三, 2025 年 8 月 27 日 08:50
> 收件者: Antheas Kapenekakis <lkml@antheas.dev>
> 副本: Mario Limonciello <superm1@kernel.org>; Alex Deucher <alexdeucher@gmail.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>; linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>; Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; David Airlie <airlied@gmail.com>; Simona Vetter <simona@ffwll.ch>; Wentland, Harry <Harry.Wentland@amd.com>; Rodrigo Siqueira <siqueira@igalia.com>; Limonciello, Mario <Mario.Limonciello@amd.com>; Lee, Peyton <Peyton.Lee@amd.com>; Yu, Lang <Lang.Yu@amd.com>
> 主旨: Re: [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo
> 
> On 8/26/25 1:58 PM, Antheas Kapenekakis wrote:
>> On Tue, 26 Aug 2025 at 22:13, Matthew Schwartz
>> <matthew.schwartz@linux.dev> wrote:
>>>
>>>
>>>
>>>> On Aug 26, 2025, at 12:21 PM, Antheas Kapenekakis <lkml@antheas.dev> wrote:
>>>>
>>>> On Tue, 26 Aug 2025 at 21:19, Mario Limonciello <superm1@kernel.org> wrote:
>>>>>
>>>>> On 8/26/2025 8:41 AM, Alex Deucher wrote:
>>>>>> On Tue, Aug 26, 2025 at 3:49 AM Antheas Kapenekakis <lkml@antheas.dev> wrote:
>>>>>>>
>>>>>>> On Mon, 25 Aug 2025 at 03:38, Mario Limonciello <superm1@kernel.org> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 8/24/25 3:46 PM, Antheas Kapenekakis wrote:
>>>>>>>>> On Sun, 24 Aug 2025 at 22:16, Mario Limonciello <superm1@kernel.org> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 8/24/25 3:53 AM, Antheas Kapenekakis wrote:
>>>>>>>>>>> On the Asus Z13 2025, which uses a Strix Halo platform, around 8% of the
>>>>>>>>>>> suspend resumes result in a soft lock around 1 second after the screen
>>>>>>>>>>> turns on (it freezes). This happens due to power gating VPE when it is
>>>>>>>>>>> not used, which happens 1 second after inactivity.
>>>>>>>>>>>
>>>>>>>>>>> Specifically, the VPE gating after resume is as follows: an initial
>>>>>>>>>>> ungate, followed by a gate in the resume process. Then,
>>>>>>>>>>> amdgpu_device_delayed_init_work_handler with a delay of 2s is scheduled
>>>>>>>>>>> to run tests, one of which is testing VPE in vpe_ring_test_ib. This
>>>>>>>>>>> causes an ungate, After that test, vpe_idle_work_handler is scheduled
>>>>>>>>>>> with VPE_IDLE_TIMEOUT (1s).
>>>>>>>>>>>
>>>>>>>>>>> When vpe_idle_work_handler runs and tries to gate VPE, it causes the
>>>>>>>>>>> SMU to hang and partially freezes half of the GPU IPs, with the thread
>>>>>>>>>>> that called the command being stuck processing it.
>>>>>>>>>>>
>>>>>>>>>>> Specifically, after that SMU command tries to run, we get the following:
>>>>>>>>>>>
>>>>>>>>>>> snd_hda_intel 0000:c4:00.1: Refused to change power state from D0 to D3hot
>>>>>>>>>>> ...
>>>>>>>>>>> xhci_hcd 0000:c4:00.4: Refused to change power state from D0 to D3hot
>>>>>>>>>>> ...
>>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
>>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VPE!
>>>>>>>>>>> [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe failed, ret = -62.
>>>>>>>>>>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:93:crtc-0] flip_done timed out
>>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
>>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate JPEG!
>>>>>>>>>>> [drm:jpeg_v4_0_5_set_powergating_state [amdgpu]] *ERROR* Dpm disable jpeg failed, ret = -62.
>>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
>>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 0!
>>>>>>>>>>> [drm:vcn_v4_0_5_stop [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62.
>>>>>>>>>>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 1 from 0xd3
>>>>>>>>>>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 2 from 0x5
>>>>>>>>>>> thunderbolt 0000:c6:00.5: Refused to change power state from D0 to D3hot
>>>>>>>>>>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:97:crtc-1] flip_done timed out
>>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
>>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 1!
>>>>>>>>>>>
>>>>>>>>>>> In addition to e.g., kwin errors in journalctl. 0000:c4.00.0 is the GPU.
>>>>>>>>>>> Interestingly, 0000:c4.00.6, which is another HDA block, 0000:c4.00.5,
>>>>>>>>>>> a PCI controller, and 0000:c4.00.2, resume normally. 0x00000032 is the
>>>>>>>>>>> PowerDownVpe(50) command which is the common failure point in all
>>>>>>>>>>> failed resumes.
>>>>>>>>>>>
>>>>>>>>>>> On a normal resume, we should get the following power gates:
>>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVpe(50) param: 0x00000000, resp: 0x00000001
>>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg0(33) param: 0x00000000, resp: 0x00000001
>>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg1(38) param: 0x00010000, resp: 0x00000001
>>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn1(4) param: 0x00010000, resp: 0x00000001
>>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn0(6) param: 0x00000000, resp: 0x00000001
>>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn0(7) param: 0x00000000, resp: 0x00000001
>>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn1(5) param: 0x00010000, resp: 0x00000001
>>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg0(34) param: 0x00000000, resp: 0x00000001
>>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg1(39) param: 0x00010000, resp: 0x00000001
>>>>>>>>>>>
>>>>>>>>>>> To fix this, increase VPE_IDLE_TIMEOUT to 2 seconds. This increases
>>>>>>>>>>> reliability from 4-25 suspends to 200+ (tested) suspends with a cycle
>>>>>>>>>>> time of 12s sleep, 8s resume.
>>>>>>>>>>
>>>>>>>>>> When you say you reproduced with 12s sleep and 8s resume, was that
>>>>>>>>>> 'amd-s2idle --duration 12 --wait 8'?
>>>>>>>>>
>>>>>>>>> I did not use amd-s2idle. I essentially used the script below with a
>>>>>>>>> 12 on the wake alarm and 12 on the for loop. I also used pstore for
>>>>>>>>> this testing.
>>>>>>>>>
>>>>>>>>> for i in {1..200}; do
>>>>>>>>>   echo "Suspend attempt $i"
>>>>>>>>>   echo `date '+%s' -d '+ 60 seconds'` | sudo tee /sys/class/rtc/rtc0/wakealarm
>>>>>>>>>   sudo sh -c 'echo mem > /sys/power/state'
>>>>>>>>>
>>>>>>>>>   for j in {1..50}; do
>>>>>>>>>     # Use repeating sleep in case echo mem returns early
>>>>>>>>>     sleep 1
>>>>>>>>>   done
>>>>>>>>> done
>>>>>>>>
>>>>>>>> 👍
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> The suspected reason here is that 1s that
>>>>>>>>>>> when VPE is used, it needs a bit of time before it can be gated and
>>>>>>>>>>> there was a borderline delay before, which is not enough for Strix Halo.
>>>>>>>>>>> When the VPE is not used, such as on resume, gating it instantly does
>>>>>>>>>>> not seem to cause issues.
>>>>>>>>>>>
>>>>>>>>>>> Fixes: 5f82a0c90cca ("drm/amdgpu/vpe: enable vpe dpm")
>>>>>>>>>>> Signed-off-by: Antheas Kapenekakis <lkml@antheas.dev>
>>>>>>>>>>> ---
>>>>>>>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c | 4 ++--
>>>>>>>>>>>   1 file changed, 2 insertions(+), 2 deletions(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
>>>>>>>>>>> index 121ee17b522b..24f09e457352 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
>>>>>>>>>>> @@ -34,8 +34,8 @@
>>>>>>>>>>>   /* VPE CSA resides in the 4th page of CSA */
>>>>>>>>>>>   #define AMDGPU_CSA_VPE_OFFSET       (4096 * 3)
>>>>>>>>>>>
>>>>>>>>>>> -/* 1 second timeout */
>>>>>>>>>>> -#define VPE_IDLE_TIMEOUT     msecs_to_jiffies(1000)
>>>>>>>>>>> +/* 2 second timeout */
>>>>>>>>>>> +#define VPE_IDLE_TIMEOUT     msecs_to_jiffies(2000)
>>>>>>>>>>>
>>>>>>>>>>>   #define VPE_MAX_DPM_LEVEL                   4
>>>>>>>>>>>   #define FIXED1_8_BITS_PER_FRACTIONAL_PART   8
>>>>>>>>>>>
>>>>>>>>>>> base-commit: c17b750b3ad9f45f2b6f7e6f7f4679844244f0b9
>>>>>>>>>>
>>>>>>>>>> 1s idle timeout has been used by other IPs for a long time.
>>>>>>>>>> For example JPEG, UVD, VCN all use 1s.
>>>>>>>>>>
>>>>>>>>>> Can you please confirm both your AGESA and your SMU firmware version?
>>>>>>>>>> In case you're not aware; you can get AGESA version from SMBIOS string
>>>>>>>>>> (DMI type 40).
>>>>>>>>>>
>>>>>>>>>> ❯ sudo dmidecode | grep AGESA
>>>>>>>>>
>>>>>>>>> String: AGESA!V9 StrixHaloPI-FP11 1.0.0.0c
>>>>>>>>>
>>>>>>>>>> You can get SMU firmware version from this:
>>>>>>>>>>
>>>>>>>>>> ❯ grep . /sys/bus/platform/drivers/amd_pmc/*/smu_*
>>>>>>>>>
>>>>>>>>> grep . /sys/bus/platform/drivers/amd_pmc/*/smu_*
>>>>>>>>> /sys/bus/platform/drivers/amd_pmc/AMDI000B:00/smu_fw_version:100.112.0
>>>>>>>>> /sys/bus/platform/drivers/amd_pmc/AMDI000B:00/smu_program:0
>>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks, I'll get some folks to see if we match this AGESA version if we
>>>>>>>> can also reproduce it on reference hardware the same way you did.
>>>>>>>>
>>>>>>>>>> Are you on the most up to date firmware for your system from the
>>>>>>>>>> manufacturer?
>>>>>>>>>
>>>>>>>>> I updated my bios, pd firmware, and USB device firmware early August,
>>>>>>>>> when I was doing this testing.
>>>>>>>>>
>>>>>>>>>> We haven't seen anything like this reported on Strix Halo thus far and
>>>>>>>>>> we do internal stress testing on s0i3 on reference hardware.
>>>>>>>>>
>>>>>>>>> Cant find a reference for it on the bug tracker. I have four bug
>>>>>>>>> reports on the bazzite issue tracker, 2 about sleep wake crashes and 2
>>>>>>>>> for runtime crashes, where the culprit would be this. IE runtime gates
>>>>>>>>> VPE and causes a crash.
>>>>>>>>
>>>>>>>> All on Strix Halo and all tied to VPE?  At runtime was VPE in use?  By
>>>>>>>> what software?
>>>>>>>>
>>>>>>>> BTW - Strix and Kraken also have VPE.
>>>>>>>
>>>>>>> All on the Z13. Not tied to VPE necessarily. I just know that I get
>>>>>>> reports of crashes on the Z13, and with this patch they are fixed for
>>>>>>> me. It will be part of the next bazzite version so I will get feedback
>>>>>>> about it.
>>>>>>>
>>>>>>> I don't think software that is using the VPE is relevant. Perhaps for
>>>>>>> the runtime crashes it is and this patch helps in that case as well.
>>>>>>> But in my case, the crash is caused after the ungate that runs the
>>>>>>> tests on resume on the delayed handler.
>>>>>>>
>>>>>>> The Z13 also has some other quirks with spurious wakeups when
>>>>>>> connected to a charger. So, if systemd is configured to e.g., sleep
>>>>>>> after 20m, combined with this crash if it stays plugged in overnight
>>>>>>> in the morning it has crashed.
>>>>>>>
>>>>>>>>>
>>>>>>>>>> To me this seems likely to be a platform firmware bug; but I would like
>>>>>>>>>> to understand the timing of the gate vs ungate on good vs bad.
>>>>>>>>>
>>>>>>>>> Perhaps it is. It is either something like that or silicon quality.
>>>>>>>>>
>>>>>>>>>> IE is it possible the delayed work handler
>>>>>>>>>> amdgpu_device_delayed_init_work_handler() is causing a race with
>>>>>>>>>> vpe_ring_begin_use()?
>>>>>>>>>
>>>>>>>>> I don't think so. There is only a single ungate. Also, the crash
>>>>>>>>> happens on the gate. So what happens is the device wakes up, the
>>>>>>>>> screen turns on, kde clock works, then after a second it freezes,
>>>>>>>>> there is a softlock, and the device hangs.
>>>>>>>>>
>>>>>>>>> The failed command is always the VPE gate that is triggered after 1s in idle.
>>>>>>>>>
>>>>>>>>>> This should be possible to check without extra instrumentation by using
>>>>>>>>>> ftrace and looking at the timing of the 2 ring functions and the init
>>>>>>>>>> work handler and checking good vs bad cycles.
>>>>>>>>>
>>>>>>>>> I do not know how to use ftrace. I should also note that after the
>>>>>>>>> device freezes around 1/5 cycles will sync the fs, so it is also not a
>>>>>>>>> very easy thing to diagnose. The device just stops working. A lot of
>>>>>>>>> the logs I got were in pstore by forcing a kernel panic.
>>>>>>>>
>>>>>>>> Here's how you capture the timing of functions.  Each time the function
>>>>>>>> is called there will be an event in the trace buffer.
>>>>>>>>
>>>>>>>> ❯ sudo trace-cmd record -p function -l
>>>>>>>> amdgpu_device_delayed_init_work_handler -l vpe_idle_work_handler -l
>>>>>>>> vpe_ring_begin_use -l vpe_ring_end_use -l amdgpu_pmops_suspend -l
>>>>>>>> amdgpu_pmops_resume
>>>>>>>>
>>>>>>>> Here's how you would review the report:
>>>>>>>>
>>>>>>>> ❯ trace-cmd report
>>>>>>>> cpus=24
>>>>>>>>   kworker/u97:37-18051 [001] ..... 13655.970108: function:
>>>>>>>> amdgpu_pmops_suspend <-- pci_pm_suspend
>>>>>>>>   kworker/u97:21-18036 [002] ..... 13666.290715: function:
>>>>>>>> amdgpu_pmops_resume <-- dpm_run_callback
>>>>>>>>   kworker/u97:21-18036 [015] ..... 13666.308295: function:
>>>>>>>> vpe_ring_begin_use <-- amdgpu_ring_alloc
>>>>>>>>   kworker/u97:21-18036 [015] ..... 13666.308298: function:
>>>>>>>> vpe_ring_end_use <-- vpe_ring_test_ring
>>>>>>>>     kworker/15:1-12285 [015] ..... 13666.960191: function:
>>>>>>>> amdgpu_device_delayed_init_work_handler <-- process_one_work
>>>>>>>>     kworker/15:1-12285 [015] ..... 13666.963970: function:
>>>>>>>> vpe_ring_begin_use <-- amdgpu_ring_alloc
>>>>>>>>     kworker/15:1-12285 [015] ..... 13666.965481: function:
>>>>>>>> vpe_ring_end_use <-- amdgpu_ib_schedule
>>>>>>>>     kworker/15:4-16354 [015] ..... 13667.981394: function:
>>>>>>>> vpe_idle_work_handler <-- process_one_work
>>>>>>>>
>>>>>>>> I did this on a Strix system just now to capture that.
>>>>>>>>
>>>>>>>> You can see that basically the ring gets used before the delayed init
>>>>>>>> work handler, and then again from the ring tests.  My concern is if the
>>>>>>>> sequence ever looks different than the above.  If it does; we do have a
>>>>>>>> driver race condition.
>>>>>>>>
>>>>>>>> It would also be helpful to look at the function_graph tracer.
>>>>>>>>
>>>>>>>> Here's some more documentation about ftrace and trace-cmd.
>>>>>>>> https://www.kernel.org/doc/html/latest/trace/ftrace.html
>>>>>>>> https://lwn.net/Articles/410200/
>>>>>>>>
>>>>>>>> You can probably also get an LLM to help you with building commands if
>>>>>>>> you're not familiar with it.
>>>>>>>>
>>>>>>>> But if you're hung so bad you can't flush to disk that's going to be a
>>>>>>>> problem without a UART.  A few ideas:
>>>>>>>
>>>>>>> Some times it flushes to disk
>>>>>>>
>>>>>>>> 1) You can use CONFIG_PSTORE_FTRACE
>>>>>>>
>>>>>>> I can look into that
>>>>>>>
>>>>>>>> 2) If you add "tp_printk" to the kernel command line it should make the
>>>>>>>> trace ring buffer flush to kernel log ring buffer.  But be warned this
>>>>>>>> is going to change the timing, the issue might go away entirely or have
>>>>>>>> a different failure rate.  So hopefully <1> works.
>>>>>>>>>
>>>>>>>>> If you say that all IP blocks use 1s, perhaps an alternative solution
>>>>>>>>> would be to desync the idle times so they do not happen
>>>>>>>>> simultaneously. So 1000, 1200, 1400, etc.
>>>>>>>>>
>>>>>>>>> Antheas
>>>>>>>>>
>>>>>>>>
>>>>>>>> I don't dobut your your proposal of changing the timing works.  I just
>>>>>>>> want to make sure it's the right solution because otherwise we might
>>>>>>>> change the timing or sequence elsewhere in the driver two years from now
>>>>>>>> and re-introduce the problem unintentionally.
>>>>>>>
>>>>>>> If there are other idle timers and only this one changes to 2s, I will
>>>>>>> agree and say that it would be peculiar. Although 1s seems arbitrary
>>>>>>> in any case.
>>>>>>
>>>>>> All of these timers are arbitrary.  Their point is just to provide a
>>>>>> future point where we can check if the engine is idle.  The idle work
>>>>>> handler will either power down the IP if it is idle or re-schedule in
>>>>>> the future and try again if there is still work.  Making the value
>>>>>> longer will use more power as it will wait longer before checking if
>>>>>> the engine is idle.  Making it shorter will save more power, but adds
>>>>>> extra overhead in that the engine will be powered up/down more often.
>>>>>> In most cases, the jobs should complete in a few ms.  The timer is
>>>>>> there to avoid the overhead of powering up/down the block too
>>>>>> frequently when applications are using the engine.
>>>>>>
>>>>>> Alex
>>>>>
>>>>> We had a try internally with both 6.17-rc2 and 6.17-rc3 and 1001b or
>>>>> 1001c AGESA on reference system but unfortunately didn't reproduce the
>>>>> issue with a 200 cycle attempt on either kernel or either BIOS (so we
>>>>> had 800 cycles total).
>>>>
>>>> I think I did 6.12, 6.15, and a 6.16rc stock. I will have to come back
>>>> to you with 6.17-rc3.
>>>
>>> I can reproduce the hang on a stock 6.17-rc3 kernel on my own Flow Z13, froze within 10 cycles with Antheas’ script. I will setup pstore to get logs from it since nothing appears in my journal after force rebooting.
>>>
>>> Matt
>>
>> Mine does not want to get reproduced right now. I will have to try later.
>>
>> You will need these kernel arguments:
>> efi_pstore.pstore_disable=0 pstore.kmsg_bytes=200000
>>
>> Here are some logging commands before the for loop
>> # clear pstore
>> sudo bash -c "rm -rf /sys/fs/pstore/*"
>>
>> # https://www.ais.com/understanding-pstore-linux-kernel-persistent-storage-file-system/
>>
>> # Runtime logs
>> # echo 1 | sudo tee
>> /sys/kernel/debug/tracing/events/power/power_runtime_suspend/enable
>> # echo 1 | sudo tee
>> /sys/kernel/debug/tracing/events/power/power_runtime_resume/enable
>> # echo 1 | sudo tee /sys/kernel/debug/tracing/tracing_on
>>
>> # Enable panics on lockups
>> echo 255 | sudo tee /proc/sys/kernel/sysrq
>> echo 1 | sudo tee /proc/sys/kernel/softlockup_panic
>> echo 1 | sudo tee /proc/sys/kernel/hardlockup_panic
>> echo 1 | sudo tee /proc/sys/kernel/panic_on_oops
>> echo 5 | sudo tee /proc/sys/kernel/panic
>> # echo 64 | sudo tee /proc/sys/kernel/panic_print
>>
>> # Enable these for hangs, shows Thread on hangs
>> # echo 1 | sudo tee /proc/sys/kernel/softlockup_all_cpu_backtrace
>> # echo 1 | sudo tee /proc/sys/kernel/hardlockup_all_cpu_backtrace
>>
>> # Enable pstore logging on panics
>> # Needs kernel param:
>> # efi_pstore.pstore_disable=0 pstore.kmsg_bytes=100000
>> # First enables, second sets the size to fit all cpus in case of a panic
>> echo Y | sudo tee /sys/module/kernel/parameters/crash_kexec_post_notifiers
>> echo Y | sudo tee /sys/module/printk/parameters/always_kmsg_dump
>>
>> # Enable dynamic debug for various kernel components
>> sudo bash -c "cat > /sys/kernel/debug/dynamic_debug/control" << EOF
>> file drivers/acpi/x86/s2idle.c +p
>> file drivers/pinctrl/pinctrl-amd.c +p
>> file drivers/platform/x86/amd/pmc.c +p
>> file drivers/pci/pci-driver.c +p
>> file drivers/input/serio/* +p
>> file drivers/gpu/drm/amd/pm/* +p
>> file drivers/gpu/drm/amd/pm/swsmu/* +p
>> EOF
>> # file drivers/acpi/ec.c +p
>> # file drivers/gpu/drm/amd/* +p
>> # file drivers/gpu/drm/amd/display/dc/core/* -p
>>
>> # Additional debugging for suspend/resume
>> echo 1 | sudo tee /sys/power/pm_debug_messages
> 
> So I ran the commands that you gave above while connected over ssh, and I could actually still interact with the system after the amdgpu failures started.
> Your suspend script also kept running for a while because of this, and pstore was not necessary.
> 
> My dmesg looks very similar to the snippet you posted in the patch contents.
> Full dmesg is here: https://gist.github.com/matte-schwartz/9ad4b925866d9228923e909618d045d9
> 
> I was able to run trace-cmd as Mario suggested, but nothing seemed out of order:
> 
> ❯ trace-cmd report
> 
>     kworker/22:6-9326  [022] .....  4003.204988: function:             amdgpu_device_delayed_init_work_handler <-- process_one_work
>     kworker/22:6-9326  [022] .....  4003.209383: function:             vpe_ring_begin_use <-- amdgpu_ring_alloc
>     kworker/22:6-9326  [022] .....  4003.210152: function:             vpe_ring_end_use <-- amdgpu_ib_schedule
>     kworker/22:6-9326  [022] .....  4004.263841: function:             vpe_idle_work_handler <-- process_one_work
>   kworker/u129:6-530   [001] .....  4053.545634: function:             amdgpu_pmops_suspend <-- pci_pm_suspend
>  kworker/u129:18-4060  [002] .....  4114.908515: function:             amdgpu_pmops_resume <-- dpm_run_callback
>  kworker/u129:18-4060  [023] .....  4114.931055: function:             vpe_ring_begin_use <-- amdgpu_ring_alloc
>  kworker/u129:18-4060  [023] .....  4114.931057: function:             vpe_ring_end_use <-- vpe_ring_test_ring
>      kworker/7:5-5733  [007] .....  4115.198936: function:             amdgpu_device_delayed_init_work_handler <-- process_one_work
>      kworker/7:5-5733  [007] .....  4115.203185: function:             vpe_ring_begin_use <-- amdgpu_ring_alloc
>      kworker/7:5-5733  [007] .....  4115.204141: function:             vpe_ring_end_use <-- amdgpu_ib_schedule
>      kworker/7:0-7950  [007] .....  4116.253971: function:             vpe_idle_work_handler <-- process_one_work
>  kworker/u129:41-4083  [001] .....  4165.539388: function:             amdgpu_pmops_suspend <-- pci_pm_suspend
>  kworker/u129:58-4100  [001] .....  4226.906561: function:             amdgpu_pmops_resume <-- dpm_run_callback
>  kworker/u129:58-4100  [022] .....  4226.927900: function:             vpe_ring_begin_use <-- amdgpu_ring_alloc
>  kworker/u129:58-4100  [022] .....  4226.927902: function:             vpe_ring_end_use <-- vpe_ring_test_ring
>      kworker/7:0-7950  [007] .....  4227.193678: function:             amdgpu_device_delayed_init_work_handler <-- process_one_work
>      kworker/7:0-7950  [007] .....  4227.197604: function:             vpe_ring_begin_use <-- amdgpu_ring_alloc
>      kworker/7:0-7950  [007] .....  4227.201691: function:             vpe_ring_end_use <-- amdgpu_ib_schedule
>      kworker/7:0-7950  [007] .....  4228.240479: function:             vpe_idle_work_handler <-- process_one_work
> 
> I have not tested the kernel patch yet, so that will be my next step.
> 
>>
>> Here is how to reconstruct the log:
>> rm -rf crash && mkdir crash
>> sudo bash -c "cp /sys/fs/pstore/dmesg-efi_pstore-* crash"
>> sudo bash -c "rm -rf /sys/fs/pstore/*"
>> cat $(find crash/ -name "dmesg-*" | tac) > crash.txt
>>
>> Antheas
>>>>
>>>>> Was your base a bazzite kernel or was it an upstream kernel?  I know
>>>>> there are some other patches in bazzite especially relevant to suspend,
>>>>> so I wonder if they could be influencing the timing.
>>>>>
>>>>> Can you repo on 6.17-rc3?
>>>>>
>>>>
>>>>
>>>
>>>
>>
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2025-08-27 15:43 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-24  8:53 [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo Antheas Kapenekakis
2025-08-24  8:53 ` [PATCH v1 2/2] drm/amd/display: Adjust AUX brightness to be a granularity of 100 Antheas Kapenekakis
2025-08-24 11:29   ` kernel test robot
2025-08-24 19:33   ` Antheas Kapenekakis
2025-08-25  7:02     ` Philip Mueller
2025-08-24 20:16 ` [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo Mario Limonciello
2025-08-24 20:46   ` Antheas Kapenekakis
2025-08-25  1:38     ` Mario Limonciello
2025-08-25 13:39       ` Antheas Kapenekakis
2025-08-26 13:41         ` Alex Deucher
2025-08-26 19:19           ` Mario Limonciello
2025-08-26 19:21             ` Antheas Kapenekakis
2025-08-26 20:12               ` Matthew Schwartz
2025-08-26 20:58                 ` Antheas Kapenekakis
2025-08-27  0:50                   ` Matthew Schwartz
     [not found]                     ` <MN2PR12MB43736AAF6E8166AD962843F48638A@MN2PR12MB4373.namprd12.prod.outlook.com>
2025-08-27 15:42                       ` Matthew Schwartz
2025-08-25 13:20 ` Alex Deucher
2025-08-25 13:33   ` Antheas Kapenekakis
2025-08-25 14:01     ` Antheas Kapenekakis
2025-08-25 16:41       ` Mario Limonciello
2025-08-25 21:00         ` Antheas Kapenekakis

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).