* [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo @ 2025-08-24 8:53 Antheas Kapenekakis 2025-08-24 8:53 ` [PATCH v1 2/2] drm/amd/display: Adjust AUX brightness to be a granularity of 100 Antheas Kapenekakis ` (2 more replies) 0 siblings, 3 replies; 22+ messages in thread From: Antheas Kapenekakis @ 2025-08-24 8:53 UTC (permalink / raw) To: amd-gfx Cc: dri-devel, linux-kernel, Alex Deucher, Christian König, David Airlie, Simona Vetter, Harry Wentland, Rodrigo Siqueira, Mario Limonciello, Peyton Lee, Lang Yu, Antheas Kapenekakis On the Asus Z13 2025, which uses a Strix Halo platform, around 8% of the suspend resumes result in a soft lock around 1 second after the screen turns on (it freezes). This happens due to power gating VPE when it is not used, which happens 1 second after inactivity. Specifically, the VPE gating after resume is as follows: an initial ungate, followed by a gate in the resume process. Then, amdgpu_device_delayed_init_work_handler with a delay of 2s is scheduled to run tests, one of which is testing VPE in vpe_ring_test_ib. This causes an ungate, After that test, vpe_idle_work_handler is scheduled with VPE_IDLE_TIMEOUT (1s). When vpe_idle_work_handler runs and tries to gate VPE, it causes the SMU to hang and partially freezes half of the GPU IPs, with the thread that called the command being stuck processing it. Specifically, after that SMU command tries to run, we get the following: snd_hda_intel 0000:c4:00.1: Refused to change power state from D0 to D3hot ... xhci_hcd 0000:c4:00.4: Refused to change power state from D0 to D3hot ... amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VPE! [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe failed, ret = -62. amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:93:crtc-0] flip_done timed out amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 amdgpu 0000:c4:00.0: amdgpu: Failed to power gate JPEG! [drm:jpeg_v4_0_5_set_powergating_state [amdgpu]] *ERROR* Dpm disable jpeg failed, ret = -62. amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 0! [drm:vcn_v4_0_5_stop [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62. thunderbolt 0000:c6:00.5: 0: timeout reading config space 1 from 0xd3 thunderbolt 0000:c6:00.5: 0: timeout reading config space 2 from 0x5 thunderbolt 0000:c6:00.5: Refused to change power state from D0 to D3hot amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:97:crtc-1] flip_done timed out amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 1! In addition to e.g., kwin errors in journalctl. 0000:c4.00.0 is the GPU. Interestingly, 0000:c4.00.6, which is another HDA block, 0000:c4.00.5, a PCI controller, and 0000:c4.00.2, resume normally. 0x00000032 is the PowerDownVpe(50) command which is the common failure point in all failed resumes. On a normal resume, we should get the following power gates: amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVpe(50) param: 0x00000000, resp: 0x00000001 amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg0(33) param: 0x00000000, resp: 0x00000001 amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg1(38) param: 0x00010000, resp: 0x00000001 amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn1(4) param: 0x00010000, resp: 0x00000001 amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn0(6) param: 0x00000000, resp: 0x00000001 amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn0(7) param: 0x00000000, resp: 0x00000001 amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn1(5) param: 0x00010000, resp: 0x00000001 amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg0(34) param: 0x00000000, resp: 0x00000001 amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg1(39) param: 0x00010000, resp: 0x00000001 To fix this, increase VPE_IDLE_TIMEOUT to 2 seconds. This increases reliability from 4-25 suspends to 200+ (tested) suspends with a cycle time of 12s sleep, 8s resume. The suspected reason here is that 1s that when VPE is used, it needs a bit of time before it can be gated and there was a borderline delay before, which is not enough for Strix Halo. When the VPE is not used, such as on resume, gating it instantly does not seem to cause issues. Fixes: 5f82a0c90cca ("drm/amdgpu/vpe: enable vpe dpm") Signed-off-by: Antheas Kapenekakis <lkml@antheas.dev> --- drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c index 121ee17b522b..24f09e457352 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c @@ -34,8 +34,8 @@ /* VPE CSA resides in the 4th page of CSA */ #define AMDGPU_CSA_VPE_OFFSET (4096 * 3) -/* 1 second timeout */ -#define VPE_IDLE_TIMEOUT msecs_to_jiffies(1000) +/* 2 second timeout */ +#define VPE_IDLE_TIMEOUT msecs_to_jiffies(2000) #define VPE_MAX_DPM_LEVEL 4 #define FIXED1_8_BITS_PER_FRACTIONAL_PART 8 base-commit: c17b750b3ad9f45f2b6f7e6f7f4679844244f0b9 -- 2.50.1 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v1 2/2] drm/amd/display: Adjust AUX brightness to be a granularity of 100 2025-08-24 8:53 [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo Antheas Kapenekakis @ 2025-08-24 8:53 ` Antheas Kapenekakis 2025-08-24 11:29 ` kernel test robot 2025-08-24 19:33 ` Antheas Kapenekakis 2025-08-24 20:16 ` [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo Mario Limonciello 2025-08-25 13:20 ` Alex Deucher 2 siblings, 2 replies; 22+ messages in thread From: Antheas Kapenekakis @ 2025-08-24 8:53 UTC (permalink / raw) To: amd-gfx Cc: dri-devel, linux-kernel, Alex Deucher, Christian König, David Airlie, Simona Vetter, Harry Wentland, Rodrigo Siqueira, Mario Limonciello, Peyton Lee, Lang Yu, Antheas Kapenekakis Certain OLED devices malfunction on specific brightness levels. Specifically, when DP_SOURCE_BACKLIGHT_LEVEL is written to with the minor byte being 0x00 and sometimes 0x01, the panel forcibly turns off until the device sleeps again. This is an issue on multiple handhelds, including OneXPlayer F1 Pro and Ayaneo 3 (the panel is suspected to be the same-1080p 7in OLED). Below are some examples. This was found by iterating over brighness ranges while printing DP_SOURCE_BACKLIGHT_LEVEL. It was found that the screen would malfunction on specific values, and some of them were collected. Broken: 86016: 10101000000000000 86272: 10101000100000000 87808: 10101011100000000 251648: 111101011100000000 251649: 111101011100000001 Working: 86144: 10101000010000000 87809: 10101011100000001 251650: 111101011100000010 The reason for this is that the range manipulation is too granular. AUX is currently written to with a granularity of 1. Forcing 100, which on the Ayaneo 3 OLED yields 400*10=4000 values, is plenty of granularity and fixes this issue. Iterating over the values through Python shows that the final byte is never 0x00, and testing over the entire range with a cadence of 0.2s/it and 73 increments (to saturate the range) shows no issues. Windows likewise shows no issues. Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3803 Signed-off-by: Antheas Kapenekakis <lkml@antheas.dev> --- .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 28 +++++++++++-------- 1 file changed, 17 insertions(+), 11 deletions(-) diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c index cd0e2976e268..bb16adcafb88 100644 --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c @@ -4739,7 +4739,8 @@ static void amdgpu_dm_update_backlight_caps(struct amdgpu_display_manager *dm, } static int get_brightness_range(const struct amdgpu_dm_backlight_caps *caps, - unsigned int *min, unsigned int *max) + unsigned int *min, unsigned int *max, + unsigned int *multiple) { if (!caps) return 0; @@ -4748,10 +4749,12 @@ static int get_brightness_range(const struct amdgpu_dm_backlight_caps *caps, // Firmware limits are in nits, DC API wants millinits. *max = 1000 * caps->aux_max_input_signal; *min = 1000 * caps->aux_min_input_signal; + *multiple = 100; } else { // Firmware limits are 8-bit, PWM control is 16-bit. *max = 0x101 * caps->max_input_signal; *min = 0x101 * caps->min_input_signal; + *multiple = 1; } return 1; } @@ -4813,23 +4816,25 @@ static void convert_custom_brightness(const struct amdgpu_dm_backlight_caps *cap static u32 convert_brightness_from_user(const struct amdgpu_dm_backlight_caps *caps, uint32_t brightness) { - unsigned int min, max; + unsigned int min, max, multiple; - if (!get_brightness_range(caps, &min, &max)) + if (!get_brightness_range(caps, &min, &max, &multiple)) return brightness; convert_custom_brightness(caps, min, max, &brightness); - // Rescale 0..max to min..max - return min + DIV_ROUND_CLOSEST_ULL((u64)(max - min) * brightness, max); + // Rescale 0..max to min..max rounding to nearest multiple + return rounddown( + min + DIV_ROUND_CLOSEST_ULL((u64)(max - min) * brightness, max), + multiple); } static u32 convert_brightness_to_user(const struct amdgpu_dm_backlight_caps *caps, uint32_t brightness) { - unsigned int min, max; + unsigned int min, max, multiple; - if (!get_brightness_range(caps, &min, &max)) + if (!get_brightness_range(caps, &min, &max, &multiple)) return brightness; if (brightness < min) @@ -4970,7 +4975,7 @@ amdgpu_dm_register_backlight_device(struct amdgpu_dm_connector *aconnector) struct backlight_properties props = { 0 }; struct amdgpu_dm_backlight_caps *caps; char bl_name[16]; - int min, max; + int min, max, multiple; if (aconnector->bl_idx == -1) return; @@ -4983,15 +4988,16 @@ amdgpu_dm_register_backlight_device(struct amdgpu_dm_connector *aconnector) } caps = &dm->backlight_caps[aconnector->bl_idx]; - if (get_brightness_range(caps, &min, &max)) { + if (get_brightness_range(caps, &min, &max, &multiple)) { if (power_supply_is_system_supplied() > 0) props.brightness = DIV_ROUND_CLOSEST((max - min) * caps->ac_level, 100); else props.brightness = DIV_ROUND_CLOSEST((max - min) * caps->dc_level, 100); /* min is zero, so max needs to be adjusted */ props.max_brightness = max - min; - drm_dbg(drm, "Backlight caps: min: %d, max: %d, ac %d, dc %d\n", min, max, - caps->ac_level, caps->dc_level); + drm_dbg(drm, + "Backlight caps: min: %d, max: %d, ac %d, dc %d, multiple: %d\n", + min, max, caps->ac_level, caps->dc_level, multiple); } else props.brightness = props.max_brightness = MAX_BACKLIGHT_LEVEL; -- 2.50.1 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH v1 2/2] drm/amd/display: Adjust AUX brightness to be a granularity of 100 2025-08-24 8:53 ` [PATCH v1 2/2] drm/amd/display: Adjust AUX brightness to be a granularity of 100 Antheas Kapenekakis @ 2025-08-24 11:29 ` kernel test robot 2025-08-24 19:33 ` Antheas Kapenekakis 1 sibling, 0 replies; 22+ messages in thread From: kernel test robot @ 2025-08-24 11:29 UTC (permalink / raw) To: Antheas Kapenekakis, amd-gfx Cc: oe-kbuild-all, dri-devel, linux-kernel, Alex Deucher, Christian König, David Airlie, Simona Vetter, Harry Wentland, Rodrigo Siqueira, Mario Limonciello, Peyton Lee, Lang Yu, Antheas Kapenekakis Hi Antheas, kernel test robot noticed the following build errors: [auto build test ERROR on c17b750b3ad9f45f2b6f7e6f7f4679844244f0b9] url: https://github.com/intel-lab-lkp/linux/commits/Antheas-Kapenekakis/drm-amd-display-Adjust-AUX-brightness-to-be-a-granularity-of-100/20250824-165633 base: c17b750b3ad9f45f2b6f7e6f7f4679844244f0b9 patch link: https://lore.kernel.org/r/20250824085351.454619-2-lkml%40antheas.dev patch subject: [PATCH v1 2/2] drm/amd/display: Adjust AUX brightness to be a granularity of 100 config: i386-buildonly-randconfig-002-20250824 (https://download.01.org/0day-ci/archive/20250824/202508241901.DJ851kiv-lkp@intel.com/config) compiler: gcc-12 (Debian 12.2.0-14+deb12u1) 12.2.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250824/202508241901.DJ851kiv-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202508241901.DJ851kiv-lkp@intel.com/ All errors (new ones prefixed by >>): ld: drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.o: in function `amdgpu_dm_backlight_set_level': >> amdgpu_dm.c:(.text+0x8b89): undefined reference to `__umoddi3' -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v1 2/2] drm/amd/display: Adjust AUX brightness to be a granularity of 100 2025-08-24 8:53 ` [PATCH v1 2/2] drm/amd/display: Adjust AUX brightness to be a granularity of 100 Antheas Kapenekakis 2025-08-24 11:29 ` kernel test robot @ 2025-08-24 19:33 ` Antheas Kapenekakis 2025-08-25 7:02 ` Philip Mueller 1 sibling, 1 reply; 22+ messages in thread From: Antheas Kapenekakis @ 2025-08-24 19:33 UTC (permalink / raw) To: amd-gfx Cc: dri-devel, linux-kernel, Alex Deucher, Christian König, David Airlie, Simona Vetter, Harry Wentland, Rodrigo Siqueira, Mario Limonciello, Peyton Lee, Lang Yu On Sun, 24 Aug 2025 at 10:54, Antheas Kapenekakis <lkml@antheas.dev> wrote: > > Certain OLED devices malfunction on specific brightness levels. > Specifically, when DP_SOURCE_BACKLIGHT_LEVEL is written to with > the minor byte being 0x00 and sometimes 0x01, the panel forcibly > turns off until the device sleeps again. This is an issue on > multiple handhelds, including OneXPlayer F1 Pro and Ayaneo 3 > (the panel is suspected to be the same-1080p 7in OLED). > > Below are some examples. This was found by iterating over brighness > ranges while printing DP_SOURCE_BACKLIGHT_LEVEL. It was found that > the screen would malfunction on specific values, and some of them > were collected. > > Broken: > 86016: 10101000000000000 > 86272: 10101000100000000 > 87808: 10101011100000000 > 251648: 111101011100000000 > 251649: 111101011100000001 > > Working: > 86144: 10101000010000000 > 87809: 10101011100000001 > 251650: 111101011100000010 > > The reason for this is that the range manipulation is too granular. > AUX is currently written to with a granularity of 1. Forcing 100, > which on the Ayaneo 3 OLED yields 400*10=4000 values, is plenty of > granularity and fixes this issue. Iterating over the values through > Python shows that the final byte is never 0x00, and testing over the > entire range with a cadence of 0.2s/it and 73 increments (to saturate > the range) shows no issues. Windows likewise shows no issues. Well Phil managed to fall into the value 332800, which has a 0 minor bit. Unfortunate. In hindsight, every 256 hundreds there would be a zero anyway. Before I made this patch I made a partial refactor of panel-quirks where a quirk like this could go to. But I would really prefer not to do quirks. Ill send that too. Antheas > Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3803 > Signed-off-by: Antheas Kapenekakis <lkml@antheas.dev> > --- > .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 28 +++++++++++-------- > 1 file changed, 17 insertions(+), 11 deletions(-) > > diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c > index cd0e2976e268..bb16adcafb88 100644 > --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c > +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c > @@ -4739,7 +4739,8 @@ static void amdgpu_dm_update_backlight_caps(struct amdgpu_display_manager *dm, > } > > static int get_brightness_range(const struct amdgpu_dm_backlight_caps *caps, > - unsigned int *min, unsigned int *max) > + unsigned int *min, unsigned int *max, > + unsigned int *multiple) > { > if (!caps) > return 0; > @@ -4748,10 +4749,12 @@ static int get_brightness_range(const struct amdgpu_dm_backlight_caps *caps, > // Firmware limits are in nits, DC API wants millinits. > *max = 1000 * caps->aux_max_input_signal; > *min = 1000 * caps->aux_min_input_signal; > + *multiple = 100; > } else { > // Firmware limits are 8-bit, PWM control is 16-bit. > *max = 0x101 * caps->max_input_signal; > *min = 0x101 * caps->min_input_signal; > + *multiple = 1; > } > return 1; > } > @@ -4813,23 +4816,25 @@ static void convert_custom_brightness(const struct amdgpu_dm_backlight_caps *cap > static u32 convert_brightness_from_user(const struct amdgpu_dm_backlight_caps *caps, > uint32_t brightness) > { > - unsigned int min, max; > + unsigned int min, max, multiple; > > - if (!get_brightness_range(caps, &min, &max)) > + if (!get_brightness_range(caps, &min, &max, &multiple)) > return brightness; > > convert_custom_brightness(caps, min, max, &brightness); > > - // Rescale 0..max to min..max > - return min + DIV_ROUND_CLOSEST_ULL((u64)(max - min) * brightness, max); > + // Rescale 0..max to min..max rounding to nearest multiple > + return rounddown( > + min + DIV_ROUND_CLOSEST_ULL((u64)(max - min) * brightness, max), > + multiple); > } > > static u32 convert_brightness_to_user(const struct amdgpu_dm_backlight_caps *caps, > uint32_t brightness) > { > - unsigned int min, max; > + unsigned int min, max, multiple; > > - if (!get_brightness_range(caps, &min, &max)) > + if (!get_brightness_range(caps, &min, &max, &multiple)) > return brightness; > > if (brightness < min) > @@ -4970,7 +4975,7 @@ amdgpu_dm_register_backlight_device(struct amdgpu_dm_connector *aconnector) > struct backlight_properties props = { 0 }; > struct amdgpu_dm_backlight_caps *caps; > char bl_name[16]; > - int min, max; > + int min, max, multiple; > > if (aconnector->bl_idx == -1) > return; > @@ -4983,15 +4988,16 @@ amdgpu_dm_register_backlight_device(struct amdgpu_dm_connector *aconnector) > } > > caps = &dm->backlight_caps[aconnector->bl_idx]; > - if (get_brightness_range(caps, &min, &max)) { > + if (get_brightness_range(caps, &min, &max, &multiple)) { > if (power_supply_is_system_supplied() > 0) > props.brightness = DIV_ROUND_CLOSEST((max - min) * caps->ac_level, 100); > else > props.brightness = DIV_ROUND_CLOSEST((max - min) * caps->dc_level, 100); > /* min is zero, so max needs to be adjusted */ > props.max_brightness = max - min; > - drm_dbg(drm, "Backlight caps: min: %d, max: %d, ac %d, dc %d\n", min, max, > - caps->ac_level, caps->dc_level); > + drm_dbg(drm, > + "Backlight caps: min: %d, max: %d, ac %d, dc %d, multiple: %d\n", > + min, max, caps->ac_level, caps->dc_level, multiple); > } else > props.brightness = props.max_brightness = MAX_BACKLIGHT_LEVEL; > > -- > 2.50.1 > > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v1 2/2] drm/amd/display: Adjust AUX brightness to be a granularity of 100 2025-08-24 19:33 ` Antheas Kapenekakis @ 2025-08-25 7:02 ` Philip Mueller 0 siblings, 0 replies; 22+ messages in thread From: Philip Mueller @ 2025-08-25 7:02 UTC (permalink / raw) To: Antheas Kapenekakis, amd-gfx Cc: dri-devel, linux-kernel, Alex Deucher, Christian König, David Airlie, Simona Vetter, Harry Wentland, Rodrigo Siqueira, Mario Limonciello, Peyton Lee, Lang Yu On Sun, 2025-08-24 at 21:33 +0200, Antheas Kapenekakis wrote: > Well Phil managed to fall into the value 332800, which has a 0 minor > bit. Unfortunate. In hindsight, every 256 hundreds there would be a > zero anyway. > > Before I made this patch I made a partial refactor of panel-quirks > where a quirk like this could go to. But I would really prefer not to > do quirks. Ill send that too. > > Antheas I was already looking into that OLED issue for several weeks. Changing granularity might hid the root cause and you might hit the issue less frequently. Currently checking [1] which changes the first byte to 3 since when DP_SOURCE_BACKLIGHT_LEVEL is written to with the first byte being 0x00 and sometimes 0x01, the panel forcibly turns off until the device sleeps again. In the end the panel vendor has to fix it in firmware. If not a quirk might be better specific for each panel vendor. I'm still not sure if that refactoring is needed, or a separate quirk function is more logical to be accepted upstream. [1] https://lore.kernel.org/lkml/20250824200202.1744335-5-lkml@antheas.dev/T/#u ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo 2025-08-24 8:53 [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo Antheas Kapenekakis 2025-08-24 8:53 ` [PATCH v1 2/2] drm/amd/display: Adjust AUX brightness to be a granularity of 100 Antheas Kapenekakis @ 2025-08-24 20:16 ` Mario Limonciello 2025-08-24 20:46 ` Antheas Kapenekakis 2025-08-25 13:20 ` Alex Deucher 2 siblings, 1 reply; 22+ messages in thread From: Mario Limonciello @ 2025-08-24 20:16 UTC (permalink / raw) To: Antheas Kapenekakis, amd-gfx Cc: dri-devel, linux-kernel, Alex Deucher, Christian König, David Airlie, Simona Vetter, Harry Wentland, Rodrigo Siqueira, Mario Limonciello, Peyton Lee, Lang Yu On 8/24/25 3:53 AM, Antheas Kapenekakis wrote: > On the Asus Z13 2025, which uses a Strix Halo platform, around 8% of the > suspend resumes result in a soft lock around 1 second after the screen > turns on (it freezes). This happens due to power gating VPE when it is > not used, which happens 1 second after inactivity. > > Specifically, the VPE gating after resume is as follows: an initial > ungate, followed by a gate in the resume process. Then, > amdgpu_device_delayed_init_work_handler with a delay of 2s is scheduled > to run tests, one of which is testing VPE in vpe_ring_test_ib. This > causes an ungate, After that test, vpe_idle_work_handler is scheduled > with VPE_IDLE_TIMEOUT (1s). > > When vpe_idle_work_handler runs and tries to gate VPE, it causes the > SMU to hang and partially freezes half of the GPU IPs, with the thread > that called the command being stuck processing it. > > Specifically, after that SMU command tries to run, we get the following: > > snd_hda_intel 0000:c4:00.1: Refused to change power state from D0 to D3hot > ... > xhci_hcd 0000:c4:00.4: Refused to change power state from D0 to D3hot > ... > amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VPE! > [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe failed, ret = -62. > amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:93:crtc-0] flip_done timed out > amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > amdgpu 0000:c4:00.0: amdgpu: Failed to power gate JPEG! > [drm:jpeg_v4_0_5_set_powergating_state [amdgpu]] *ERROR* Dpm disable jpeg failed, ret = -62. > amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 0! > [drm:vcn_v4_0_5_stop [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62. > thunderbolt 0000:c6:00.5: 0: timeout reading config space 1 from 0xd3 > thunderbolt 0000:c6:00.5: 0: timeout reading config space 2 from 0x5 > thunderbolt 0000:c6:00.5: Refused to change power state from D0 to D3hot > amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:97:crtc-1] flip_done timed out > amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 1! > > In addition to e.g., kwin errors in journalctl. 0000:c4.00.0 is the GPU. > Interestingly, 0000:c4.00.6, which is another HDA block, 0000:c4.00.5, > a PCI controller, and 0000:c4.00.2, resume normally. 0x00000032 is the > PowerDownVpe(50) command which is the common failure point in all > failed resumes. > > On a normal resume, we should get the following power gates: > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVpe(50) param: 0x00000000, resp: 0x00000001 > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg0(33) param: 0x00000000, resp: 0x00000001 > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg1(38) param: 0x00010000, resp: 0x00000001 > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn1(4) param: 0x00010000, resp: 0x00000001 > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn0(6) param: 0x00000000, resp: 0x00000001 > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn0(7) param: 0x00000000, resp: 0x00000001 > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn1(5) param: 0x00010000, resp: 0x00000001 > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg0(34) param: 0x00000000, resp: 0x00000001 > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg1(39) param: 0x00010000, resp: 0x00000001 > > To fix this, increase VPE_IDLE_TIMEOUT to 2 seconds. This increases > reliability from 4-25 suspends to 200+ (tested) suspends with a cycle > time of 12s sleep, 8s resume. When you say you reproduced with 12s sleep and 8s resume, was that 'amd-s2idle --duration 12 --wait 8'? > The suspected reason here is that 1s that > when VPE is used, it needs a bit of time before it can be gated and > there was a borderline delay before, which is not enough for Strix Halo. > When the VPE is not used, such as on resume, gating it instantly does > not seem to cause issues. > > Fixes: 5f82a0c90cca ("drm/amdgpu/vpe: enable vpe dpm") > Signed-off-by: Antheas Kapenekakis <lkml@antheas.dev> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c > index 121ee17b522b..24f09e457352 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c > @@ -34,8 +34,8 @@ > /* VPE CSA resides in the 4th page of CSA */ > #define AMDGPU_CSA_VPE_OFFSET (4096 * 3) > > -/* 1 second timeout */ > -#define VPE_IDLE_TIMEOUT msecs_to_jiffies(1000) > +/* 2 second timeout */ > +#define VPE_IDLE_TIMEOUT msecs_to_jiffies(2000) > > #define VPE_MAX_DPM_LEVEL 4 > #define FIXED1_8_BITS_PER_FRACTIONAL_PART 8 > > base-commit: c17b750b3ad9f45f2b6f7e6f7f4679844244f0b9 1s idle timeout has been used by other IPs for a long time. For example JPEG, UVD, VCN all use 1s. Can you please confirm both your AGESA and your SMU firmware version? In case you're not aware; you can get AGESA version from SMBIOS string (DMI type 40). ❯ sudo dmidecode | grep AGESA You can get SMU firmware version from this: ❯ grep . /sys/bus/platform/drivers/amd_pmc/*/smu_* Are you on the most up to date firmware for your system from the manufacturer? We haven't seen anything like this reported on Strix Halo thus far and we do internal stress testing on s0i3 on reference hardware. To me this seems likely to be a platform firmware bug; but I would like to understand the timing of the gate vs ungate on good vs bad. IE is it possible the delayed work handler amdgpu_device_delayed_init_work_handler() is causing a race with vpe_ring_begin_use()? This should be possible to check without extra instrumentation by using ftrace and looking at the timing of the 2 ring functions and the init work handler and checking good vs bad cycles. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo 2025-08-24 20:16 ` [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo Mario Limonciello @ 2025-08-24 20:46 ` Antheas Kapenekakis 2025-08-25 1:38 ` Mario Limonciello 0 siblings, 1 reply; 22+ messages in thread From: Antheas Kapenekakis @ 2025-08-24 20:46 UTC (permalink / raw) To: Mario Limonciello Cc: amd-gfx, dri-devel, linux-kernel, Alex Deucher, Christian König, David Airlie, Simona Vetter, Harry Wentland, Rodrigo Siqueira, Mario Limonciello, Peyton Lee, Lang Yu On Sun, 24 Aug 2025 at 22:16, Mario Limonciello <superm1@kernel.org> wrote: > > > > On 8/24/25 3:53 AM, Antheas Kapenekakis wrote: > > On the Asus Z13 2025, which uses a Strix Halo platform, around 8% of the > > suspend resumes result in a soft lock around 1 second after the screen > > turns on (it freezes). This happens due to power gating VPE when it is > > not used, which happens 1 second after inactivity. > > > > Specifically, the VPE gating after resume is as follows: an initial > > ungate, followed by a gate in the resume process. Then, > > amdgpu_device_delayed_init_work_handler with a delay of 2s is scheduled > > to run tests, one of which is testing VPE in vpe_ring_test_ib. This > > causes an ungate, After that test, vpe_idle_work_handler is scheduled > > with VPE_IDLE_TIMEOUT (1s). > > > > When vpe_idle_work_handler runs and tries to gate VPE, it causes the > > SMU to hang and partially freezes half of the GPU IPs, with the thread > > that called the command being stuck processing it. > > > > Specifically, after that SMU command tries to run, we get the following: > > > > snd_hda_intel 0000:c4:00.1: Refused to change power state from D0 to D3hot > > ... > > xhci_hcd 0000:c4:00.4: Refused to change power state from D0 to D3hot > > ... > > amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > > amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VPE! > > [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe failed, ret = -62. > > amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:93:crtc-0] flip_done timed out > > amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > > amdgpu 0000:c4:00.0: amdgpu: Failed to power gate JPEG! > > [drm:jpeg_v4_0_5_set_powergating_state [amdgpu]] *ERROR* Dpm disable jpeg failed, ret = -62. > > amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > > amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 0! > > [drm:vcn_v4_0_5_stop [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62. > > thunderbolt 0000:c6:00.5: 0: timeout reading config space 1 from 0xd3 > > thunderbolt 0000:c6:00.5: 0: timeout reading config space 2 from 0x5 > > thunderbolt 0000:c6:00.5: Refused to change power state from D0 to D3hot > > amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:97:crtc-1] flip_done timed out > > amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > > amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 1! > > > > In addition to e.g., kwin errors in journalctl. 0000:c4.00.0 is the GPU. > > Interestingly, 0000:c4.00.6, which is another HDA block, 0000:c4.00.5, > > a PCI controller, and 0000:c4.00.2, resume normally. 0x00000032 is the > > PowerDownVpe(50) command which is the common failure point in all > > failed resumes. > > > > On a normal resume, we should get the following power gates: > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVpe(50) param: 0x00000000, resp: 0x00000001 > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg0(33) param: 0x00000000, resp: 0x00000001 > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg1(38) param: 0x00010000, resp: 0x00000001 > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn1(4) param: 0x00010000, resp: 0x00000001 > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn0(6) param: 0x00000000, resp: 0x00000001 > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn0(7) param: 0x00000000, resp: 0x00000001 > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn1(5) param: 0x00010000, resp: 0x00000001 > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg0(34) param: 0x00000000, resp: 0x00000001 > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg1(39) param: 0x00010000, resp: 0x00000001 > > > > To fix this, increase VPE_IDLE_TIMEOUT to 2 seconds. This increases > > reliability from 4-25 suspends to 200+ (tested) suspends with a cycle > > time of 12s sleep, 8s resume. > > When you say you reproduced with 12s sleep and 8s resume, was that > 'amd-s2idle --duration 12 --wait 8'? I did not use amd-s2idle. I essentially used the script below with a 12 on the wake alarm and 12 on the for loop. I also used pstore for this testing. for i in {1..200}; do echo "Suspend attempt $i" echo `date '+%s' -d '+ 60 seconds'` | sudo tee /sys/class/rtc/rtc0/wakealarm sudo sh -c 'echo mem > /sys/power/state' for j in {1..50}; do # Use repeating sleep in case echo mem returns early sleep 1 done done > > The suspected reason here is that 1s that > > when VPE is used, it needs a bit of time before it can be gated and > > there was a borderline delay before, which is not enough for Strix Halo. > > When the VPE is not used, such as on resume, gating it instantly does > > not seem to cause issues. > > > > Fixes: 5f82a0c90cca ("drm/amdgpu/vpe: enable vpe dpm") > > Signed-off-by: Antheas Kapenekakis <lkml@antheas.dev> > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c > > index 121ee17b522b..24f09e457352 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c > > @@ -34,8 +34,8 @@ > > /* VPE CSA resides in the 4th page of CSA */ > > #define AMDGPU_CSA_VPE_OFFSET (4096 * 3) > > > > -/* 1 second timeout */ > > -#define VPE_IDLE_TIMEOUT msecs_to_jiffies(1000) > > +/* 2 second timeout */ > > +#define VPE_IDLE_TIMEOUT msecs_to_jiffies(2000) > > > > #define VPE_MAX_DPM_LEVEL 4 > > #define FIXED1_8_BITS_PER_FRACTIONAL_PART 8 > > > > base-commit: c17b750b3ad9f45f2b6f7e6f7f4679844244f0b9 > > 1s idle timeout has been used by other IPs for a long time. > For example JPEG, UVD, VCN all use 1s. > > Can you please confirm both your AGESA and your SMU firmware version? > In case you're not aware; you can get AGESA version from SMBIOS string > (DMI type 40). > > ❯ sudo dmidecode | grep AGESA String: AGESA!V9 StrixHaloPI-FP11 1.0.0.0c > You can get SMU firmware version from this: > > ❯ grep . /sys/bus/platform/drivers/amd_pmc/*/smu_* grep . /sys/bus/platform/drivers/amd_pmc/*/smu_* /sys/bus/platform/drivers/amd_pmc/AMDI000B:00/smu_fw_version:100.112.0 /sys/bus/platform/drivers/amd_pmc/AMDI000B:00/smu_program:0 > Are you on the most up to date firmware for your system from the > manufacturer? I updated my bios, pd firmware, and USB device firmware early August, when I was doing this testing. > We haven't seen anything like this reported on Strix Halo thus far and > we do internal stress testing on s0i3 on reference hardware. Cant find a reference for it on the bug tracker. I have four bug reports on the bazzite issue tracker, 2 about sleep wake crashes and 2 for runtime crashes, where the culprit would be this. IE runtime gates VPE and causes a crash. > To me this seems likely to be a platform firmware bug; but I would like > to understand the timing of the gate vs ungate on good vs bad. Perhaps it is. It is either something like that or silicon quality. > IE is it possible the delayed work handler > amdgpu_device_delayed_init_work_handler() is causing a race with > vpe_ring_begin_use()? I don't think so. There is only a single ungate. Also, the crash happens on the gate. So what happens is the device wakes up, the screen turns on, kde clock works, then after a second it freezes, there is a softlock, and the device hangs. The failed command is always the VPE gate that is triggered after 1s in idle. > This should be possible to check without extra instrumentation by using > ftrace and looking at the timing of the 2 ring functions and the init > work handler and checking good vs bad cycles. I do not know how to use ftrace. I should also note that after the device freezes around 1/5 cycles will sync the fs, so it is also not a very easy thing to diagnose. The device just stops working. A lot of the logs I got were in pstore by forcing a kernel panic. If you say that all IP blocks use 1s, perhaps an alternative solution would be to desync the idle times so they do not happen simultaneously. So 1000, 1200, 1400, etc. Antheas ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo 2025-08-24 20:46 ` Antheas Kapenekakis @ 2025-08-25 1:38 ` Mario Limonciello 2025-08-25 13:39 ` Antheas Kapenekakis 0 siblings, 1 reply; 22+ messages in thread From: Mario Limonciello @ 2025-08-25 1:38 UTC (permalink / raw) To: Antheas Kapenekakis Cc: amd-gfx, dri-devel, linux-kernel, Alex Deucher, Christian König, David Airlie, Simona Vetter, Harry Wentland, Rodrigo Siqueira, Mario Limonciello, Peyton Lee, Lang Yu On 8/24/25 3:46 PM, Antheas Kapenekakis wrote: > On Sun, 24 Aug 2025 at 22:16, Mario Limonciello <superm1@kernel.org> wrote: >> >> >> >> On 8/24/25 3:53 AM, Antheas Kapenekakis wrote: >>> On the Asus Z13 2025, which uses a Strix Halo platform, around 8% of the >>> suspend resumes result in a soft lock around 1 second after the screen >>> turns on (it freezes). This happens due to power gating VPE when it is >>> not used, which happens 1 second after inactivity. >>> >>> Specifically, the VPE gating after resume is as follows: an initial >>> ungate, followed by a gate in the resume process. Then, >>> amdgpu_device_delayed_init_work_handler with a delay of 2s is scheduled >>> to run tests, one of which is testing VPE in vpe_ring_test_ib. This >>> causes an ungate, After that test, vpe_idle_work_handler is scheduled >>> with VPE_IDLE_TIMEOUT (1s). >>> >>> When vpe_idle_work_handler runs and tries to gate VPE, it causes the >>> SMU to hang and partially freezes half of the GPU IPs, with the thread >>> that called the command being stuck processing it. >>> >>> Specifically, after that SMU command tries to run, we get the following: >>> >>> snd_hda_intel 0000:c4:00.1: Refused to change power state from D0 to D3hot >>> ... >>> xhci_hcd 0000:c4:00.4: Refused to change power state from D0 to D3hot >>> ... >>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 >>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VPE! >>> [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe failed, ret = -62. >>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:93:crtc-0] flip_done timed out >>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 >>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate JPEG! >>> [drm:jpeg_v4_0_5_set_powergating_state [amdgpu]] *ERROR* Dpm disable jpeg failed, ret = -62. >>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 >>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 0! >>> [drm:vcn_v4_0_5_stop [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62. >>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 1 from 0xd3 >>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 2 from 0x5 >>> thunderbolt 0000:c6:00.5: Refused to change power state from D0 to D3hot >>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:97:crtc-1] flip_done timed out >>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 >>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 1! >>> >>> In addition to e.g., kwin errors in journalctl. 0000:c4.00.0 is the GPU. >>> Interestingly, 0000:c4.00.6, which is another HDA block, 0000:c4.00.5, >>> a PCI controller, and 0000:c4.00.2, resume normally. 0x00000032 is the >>> PowerDownVpe(50) command which is the common failure point in all >>> failed resumes. >>> >>> On a normal resume, we should get the following power gates: >>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVpe(50) param: 0x00000000, resp: 0x00000001 >>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg0(33) param: 0x00000000, resp: 0x00000001 >>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg1(38) param: 0x00010000, resp: 0x00000001 >>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn1(4) param: 0x00010000, resp: 0x00000001 >>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn0(6) param: 0x00000000, resp: 0x00000001 >>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn0(7) param: 0x00000000, resp: 0x00000001 >>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn1(5) param: 0x00010000, resp: 0x00000001 >>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg0(34) param: 0x00000000, resp: 0x00000001 >>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg1(39) param: 0x00010000, resp: 0x00000001 >>> >>> To fix this, increase VPE_IDLE_TIMEOUT to 2 seconds. This increases >>> reliability from 4-25 suspends to 200+ (tested) suspends with a cycle >>> time of 12s sleep, 8s resume. >> >> When you say you reproduced with 12s sleep and 8s resume, was that >> 'amd-s2idle --duration 12 --wait 8'? > > I did not use amd-s2idle. I essentially used the script below with a > 12 on the wake alarm and 12 on the for loop. I also used pstore for > this testing. > > for i in {1..200}; do > echo "Suspend attempt $i" > echo `date '+%s' -d '+ 60 seconds'` | sudo tee /sys/class/rtc/rtc0/wakealarm > sudo sh -c 'echo mem > /sys/power/state' > > for j in {1..50}; do > # Use repeating sleep in case echo mem returns early > sleep 1 > done > done 👍 > >>> The suspected reason here is that 1s that >>> when VPE is used, it needs a bit of time before it can be gated and >>> there was a borderline delay before, which is not enough for Strix Halo. >>> When the VPE is not used, such as on resume, gating it instantly does >>> not seem to cause issues. >>> >>> Fixes: 5f82a0c90cca ("drm/amdgpu/vpe: enable vpe dpm") >>> Signed-off-by: Antheas Kapenekakis <lkml@antheas.dev> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c | 4 ++-- >>> 1 file changed, 2 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c >>> index 121ee17b522b..24f09e457352 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c >>> @@ -34,8 +34,8 @@ >>> /* VPE CSA resides in the 4th page of CSA */ >>> #define AMDGPU_CSA_VPE_OFFSET (4096 * 3) >>> >>> -/* 1 second timeout */ >>> -#define VPE_IDLE_TIMEOUT msecs_to_jiffies(1000) >>> +/* 2 second timeout */ >>> +#define VPE_IDLE_TIMEOUT msecs_to_jiffies(2000) >>> >>> #define VPE_MAX_DPM_LEVEL 4 >>> #define FIXED1_8_BITS_PER_FRACTIONAL_PART 8 >>> >>> base-commit: c17b750b3ad9f45f2b6f7e6f7f4679844244f0b9 >> >> 1s idle timeout has been used by other IPs for a long time. >> For example JPEG, UVD, VCN all use 1s. >> >> Can you please confirm both your AGESA and your SMU firmware version? >> In case you're not aware; you can get AGESA version from SMBIOS string >> (DMI type 40). >> >> ❯ sudo dmidecode | grep AGESA > > String: AGESA!V9 StrixHaloPI-FP11 1.0.0.0c > >> You can get SMU firmware version from this: >> >> ❯ grep . /sys/bus/platform/drivers/amd_pmc/*/smu_* > > grep . /sys/bus/platform/drivers/amd_pmc/*/smu_* > /sys/bus/platform/drivers/amd_pmc/AMDI000B:00/smu_fw_version:100.112.0 > /sys/bus/platform/drivers/amd_pmc/AMDI000B:00/smu_program:0 > Thanks, I'll get some folks to see if we match this AGESA version if we can also reproduce it on reference hardware the same way you did. >> Are you on the most up to date firmware for your system from the >> manufacturer? > > I updated my bios, pd firmware, and USB device firmware early August, > when I was doing this testing. > >> We haven't seen anything like this reported on Strix Halo thus far and >> we do internal stress testing on s0i3 on reference hardware. > > Cant find a reference for it on the bug tracker. I have four bug > reports on the bazzite issue tracker, 2 about sleep wake crashes and 2 > for runtime crashes, where the culprit would be this. IE runtime gates > VPE and causes a crash. All on Strix Halo and all tied to VPE? At runtime was VPE in use? By what software? BTW - Strix and Kraken also have VPE. > >> To me this seems likely to be a platform firmware bug; but I would like >> to understand the timing of the gate vs ungate on good vs bad. > > Perhaps it is. It is either something like that or silicon quality. > >> IE is it possible the delayed work handler >> amdgpu_device_delayed_init_work_handler() is causing a race with >> vpe_ring_begin_use()? > > I don't think so. There is only a single ungate. Also, the crash > happens on the gate. So what happens is the device wakes up, the > screen turns on, kde clock works, then after a second it freezes, > there is a softlock, and the device hangs. > > The failed command is always the VPE gate that is triggered after 1s in idle. > >> This should be possible to check without extra instrumentation by using >> ftrace and looking at the timing of the 2 ring functions and the init >> work handler and checking good vs bad cycles. > > I do not know how to use ftrace. I should also note that after the > device freezes around 1/5 cycles will sync the fs, so it is also not a > very easy thing to diagnose. The device just stops working. A lot of > the logs I got were in pstore by forcing a kernel panic. Here's how you capture the timing of functions. Each time the function is called there will be an event in the trace buffer. ❯ sudo trace-cmd record -p function -l amdgpu_device_delayed_init_work_handler -l vpe_idle_work_handler -l vpe_ring_begin_use -l vpe_ring_end_use -l amdgpu_pmops_suspend -l amdgpu_pmops_resume Here's how you would review the report: ❯ trace-cmd report cpus=24 kworker/u97:37-18051 [001] ..... 13655.970108: function: amdgpu_pmops_suspend <-- pci_pm_suspend kworker/u97:21-18036 [002] ..... 13666.290715: function: amdgpu_pmops_resume <-- dpm_run_callback kworker/u97:21-18036 [015] ..... 13666.308295: function: vpe_ring_begin_use <-- amdgpu_ring_alloc kworker/u97:21-18036 [015] ..... 13666.308298: function: vpe_ring_end_use <-- vpe_ring_test_ring kworker/15:1-12285 [015] ..... 13666.960191: function: amdgpu_device_delayed_init_work_handler <-- process_one_work kworker/15:1-12285 [015] ..... 13666.963970: function: vpe_ring_begin_use <-- amdgpu_ring_alloc kworker/15:1-12285 [015] ..... 13666.965481: function: vpe_ring_end_use <-- amdgpu_ib_schedule kworker/15:4-16354 [015] ..... 13667.981394: function: vpe_idle_work_handler <-- process_one_work I did this on a Strix system just now to capture that. You can see that basically the ring gets used before the delayed init work handler, and then again from the ring tests. My concern is if the sequence ever looks different than the above. If it does; we do have a driver race condition. It would also be helpful to look at the function_graph tracer. Here's some more documentation about ftrace and trace-cmd. https://www.kernel.org/doc/html/latest/trace/ftrace.html https://lwn.net/Articles/410200/ You can probably also get an LLM to help you with building commands if you're not familiar with it. But if you're hung so bad you can't flush to disk that's going to be a problem without a UART. A few ideas: 1) You can use CONFIG_PSTORE_FTRACE 2) If you add "tp_printk" to the kernel command line it should make the trace ring buffer flush to kernel log ring buffer. But be warned this is going to change the timing, the issue might go away entirely or have a different failure rate. So hopefully <1> works. > > If you say that all IP blocks use 1s, perhaps an alternative solution > would be to desync the idle times so they do not happen > simultaneously. So 1000, 1200, 1400, etc. > > Antheas > I don't dobut your your proposal of changing the timing works. I just want to make sure it's the right solution because otherwise we might change the timing or sequence elsewhere in the driver two years from now and re-introduce the problem unintentionally. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo 2025-08-25 1:38 ` Mario Limonciello @ 2025-08-25 13:39 ` Antheas Kapenekakis 2025-08-26 13:41 ` Alex Deucher 0 siblings, 1 reply; 22+ messages in thread From: Antheas Kapenekakis @ 2025-08-25 13:39 UTC (permalink / raw) To: Mario Limonciello Cc: amd-gfx, dri-devel, linux-kernel, Alex Deucher, Christian König, David Airlie, Simona Vetter, Harry Wentland, Rodrigo Siqueira, Mario Limonciello, Peyton Lee, Lang Yu On Mon, 25 Aug 2025 at 03:38, Mario Limonciello <superm1@kernel.org> wrote: > > > > On 8/24/25 3:46 PM, Antheas Kapenekakis wrote: > > On Sun, 24 Aug 2025 at 22:16, Mario Limonciello <superm1@kernel.org> wrote: > >> > >> > >> > >> On 8/24/25 3:53 AM, Antheas Kapenekakis wrote: > >>> On the Asus Z13 2025, which uses a Strix Halo platform, around 8% of the > >>> suspend resumes result in a soft lock around 1 second after the screen > >>> turns on (it freezes). This happens due to power gating VPE when it is > >>> not used, which happens 1 second after inactivity. > >>> > >>> Specifically, the VPE gating after resume is as follows: an initial > >>> ungate, followed by a gate in the resume process. Then, > >>> amdgpu_device_delayed_init_work_handler with a delay of 2s is scheduled > >>> to run tests, one of which is testing VPE in vpe_ring_test_ib. This > >>> causes an ungate, After that test, vpe_idle_work_handler is scheduled > >>> with VPE_IDLE_TIMEOUT (1s). > >>> > >>> When vpe_idle_work_handler runs and tries to gate VPE, it causes the > >>> SMU to hang and partially freezes half of the GPU IPs, with the thread > >>> that called the command being stuck processing it. > >>> > >>> Specifically, after that SMU command tries to run, we get the following: > >>> > >>> snd_hda_intel 0000:c4:00.1: Refused to change power state from D0 to D3hot > >>> ... > >>> xhci_hcd 0000:c4:00.4: Refused to change power state from D0 to D3hot > >>> ... > >>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > >>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VPE! > >>> [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe failed, ret = -62. > >>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:93:crtc-0] flip_done timed out > >>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > >>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate JPEG! > >>> [drm:jpeg_v4_0_5_set_powergating_state [amdgpu]] *ERROR* Dpm disable jpeg failed, ret = -62. > >>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > >>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 0! > >>> [drm:vcn_v4_0_5_stop [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62. > >>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 1 from 0xd3 > >>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 2 from 0x5 > >>> thunderbolt 0000:c6:00.5: Refused to change power state from D0 to D3hot > >>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:97:crtc-1] flip_done timed out > >>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > >>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 1! > >>> > >>> In addition to e.g., kwin errors in journalctl. 0000:c4.00.0 is the GPU. > >>> Interestingly, 0000:c4.00.6, which is another HDA block, 0000:c4.00.5, > >>> a PCI controller, and 0000:c4.00.2, resume normally. 0x00000032 is the > >>> PowerDownVpe(50) command which is the common failure point in all > >>> failed resumes. > >>> > >>> On a normal resume, we should get the following power gates: > >>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVpe(50) param: 0x00000000, resp: 0x00000001 > >>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg0(33) param: 0x00000000, resp: 0x00000001 > >>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg1(38) param: 0x00010000, resp: 0x00000001 > >>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn1(4) param: 0x00010000, resp: 0x00000001 > >>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn0(6) param: 0x00000000, resp: 0x00000001 > >>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn0(7) param: 0x00000000, resp: 0x00000001 > >>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn1(5) param: 0x00010000, resp: 0x00000001 > >>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg0(34) param: 0x00000000, resp: 0x00000001 > >>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg1(39) param: 0x00010000, resp: 0x00000001 > >>> > >>> To fix this, increase VPE_IDLE_TIMEOUT to 2 seconds. This increases > >>> reliability from 4-25 suspends to 200+ (tested) suspends with a cycle > >>> time of 12s sleep, 8s resume. > >> > >> When you say you reproduced with 12s sleep and 8s resume, was that > >> 'amd-s2idle --duration 12 --wait 8'? > > > > I did not use amd-s2idle. I essentially used the script below with a > > 12 on the wake alarm and 12 on the for loop. I also used pstore for > > this testing. > > > > for i in {1..200}; do > > echo "Suspend attempt $i" > > echo `date '+%s' -d '+ 60 seconds'` | sudo tee /sys/class/rtc/rtc0/wakealarm > > sudo sh -c 'echo mem > /sys/power/state' > > > > for j in {1..50}; do > > # Use repeating sleep in case echo mem returns early > > sleep 1 > > done > > done > > 👍 > > > > >>> The suspected reason here is that 1s that > >>> when VPE is used, it needs a bit of time before it can be gated and > >>> there was a borderline delay before, which is not enough for Strix Halo. > >>> When the VPE is not used, such as on resume, gating it instantly does > >>> not seem to cause issues. > >>> > >>> Fixes: 5f82a0c90cca ("drm/amdgpu/vpe: enable vpe dpm") > >>> Signed-off-by: Antheas Kapenekakis <lkml@antheas.dev> > >>> --- > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c | 4 ++-- > >>> 1 file changed, 2 insertions(+), 2 deletions(-) > >>> > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c > >>> index 121ee17b522b..24f09e457352 100644 > >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c > >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c > >>> @@ -34,8 +34,8 @@ > >>> /* VPE CSA resides in the 4th page of CSA */ > >>> #define AMDGPU_CSA_VPE_OFFSET (4096 * 3) > >>> > >>> -/* 1 second timeout */ > >>> -#define VPE_IDLE_TIMEOUT msecs_to_jiffies(1000) > >>> +/* 2 second timeout */ > >>> +#define VPE_IDLE_TIMEOUT msecs_to_jiffies(2000) > >>> > >>> #define VPE_MAX_DPM_LEVEL 4 > >>> #define FIXED1_8_BITS_PER_FRACTIONAL_PART 8 > >>> > >>> base-commit: c17b750b3ad9f45f2b6f7e6f7f4679844244f0b9 > >> > >> 1s idle timeout has been used by other IPs for a long time. > >> For example JPEG, UVD, VCN all use 1s. > >> > >> Can you please confirm both your AGESA and your SMU firmware version? > >> In case you're not aware; you can get AGESA version from SMBIOS string > >> (DMI type 40). > >> > >> ❯ sudo dmidecode | grep AGESA > > > > String: AGESA!V9 StrixHaloPI-FP11 1.0.0.0c > > > >> You can get SMU firmware version from this: > >> > >> ❯ grep . /sys/bus/platform/drivers/amd_pmc/*/smu_* > > > > grep . /sys/bus/platform/drivers/amd_pmc/*/smu_* > > /sys/bus/platform/drivers/amd_pmc/AMDI000B:00/smu_fw_version:100.112.0 > > /sys/bus/platform/drivers/amd_pmc/AMDI000B:00/smu_program:0 > > > > Thanks, I'll get some folks to see if we match this AGESA version if we > can also reproduce it on reference hardware the same way you did. > > >> Are you on the most up to date firmware for your system from the > >> manufacturer? > > > > I updated my bios, pd firmware, and USB device firmware early August, > > when I was doing this testing. > > > >> We haven't seen anything like this reported on Strix Halo thus far and > >> we do internal stress testing on s0i3 on reference hardware. > > > > Cant find a reference for it on the bug tracker. I have four bug > > reports on the bazzite issue tracker, 2 about sleep wake crashes and 2 > > for runtime crashes, where the culprit would be this. IE runtime gates > > VPE and causes a crash. > > All on Strix Halo and all tied to VPE? At runtime was VPE in use? By > what software? > > BTW - Strix and Kraken also have VPE. All on the Z13. Not tied to VPE necessarily. I just know that I get reports of crashes on the Z13, and with this patch they are fixed for me. It will be part of the next bazzite version so I will get feedback about it. I don't think software that is using the VPE is relevant. Perhaps for the runtime crashes it is and this patch helps in that case as well. But in my case, the crash is caused after the ungate that runs the tests on resume on the delayed handler. The Z13 also has some other quirks with spurious wakeups when connected to a charger. So, if systemd is configured to e.g., sleep after 20m, combined with this crash if it stays plugged in overnight in the morning it has crashed. > > > >> To me this seems likely to be a platform firmware bug; but I would like > >> to understand the timing of the gate vs ungate on good vs bad. > > > > Perhaps it is. It is either something like that or silicon quality. > > > >> IE is it possible the delayed work handler > >> amdgpu_device_delayed_init_work_handler() is causing a race with > >> vpe_ring_begin_use()? > > > > I don't think so. There is only a single ungate. Also, the crash > > happens on the gate. So what happens is the device wakes up, the > > screen turns on, kde clock works, then after a second it freezes, > > there is a softlock, and the device hangs. > > > > The failed command is always the VPE gate that is triggered after 1s in idle. > > > >> This should be possible to check without extra instrumentation by using > >> ftrace and looking at the timing of the 2 ring functions and the init > >> work handler and checking good vs bad cycles. > > > > I do not know how to use ftrace. I should also note that after the > > device freezes around 1/5 cycles will sync the fs, so it is also not a > > very easy thing to diagnose. The device just stops working. A lot of > > the logs I got were in pstore by forcing a kernel panic. > > Here's how you capture the timing of functions. Each time the function > is called there will be an event in the trace buffer. > > ❯ sudo trace-cmd record -p function -l > amdgpu_device_delayed_init_work_handler -l vpe_idle_work_handler -l > vpe_ring_begin_use -l vpe_ring_end_use -l amdgpu_pmops_suspend -l > amdgpu_pmops_resume > > Here's how you would review the report: > > ❯ trace-cmd report > cpus=24 > kworker/u97:37-18051 [001] ..... 13655.970108: function: > amdgpu_pmops_suspend <-- pci_pm_suspend > kworker/u97:21-18036 [002] ..... 13666.290715: function: > amdgpu_pmops_resume <-- dpm_run_callback > kworker/u97:21-18036 [015] ..... 13666.308295: function: > vpe_ring_begin_use <-- amdgpu_ring_alloc > kworker/u97:21-18036 [015] ..... 13666.308298: function: > vpe_ring_end_use <-- vpe_ring_test_ring > kworker/15:1-12285 [015] ..... 13666.960191: function: > amdgpu_device_delayed_init_work_handler <-- process_one_work > kworker/15:1-12285 [015] ..... 13666.963970: function: > vpe_ring_begin_use <-- amdgpu_ring_alloc > kworker/15:1-12285 [015] ..... 13666.965481: function: > vpe_ring_end_use <-- amdgpu_ib_schedule > kworker/15:4-16354 [015] ..... 13667.981394: function: > vpe_idle_work_handler <-- process_one_work > > I did this on a Strix system just now to capture that. > > You can see that basically the ring gets used before the delayed init > work handler, and then again from the ring tests. My concern is if the > sequence ever looks different than the above. If it does; we do have a > driver race condition. > > It would also be helpful to look at the function_graph tracer. > > Here's some more documentation about ftrace and trace-cmd. > https://www.kernel.org/doc/html/latest/trace/ftrace.html > https://lwn.net/Articles/410200/ > > You can probably also get an LLM to help you with building commands if > you're not familiar with it. > > But if you're hung so bad you can't flush to disk that's going to be a > problem without a UART. A few ideas: Some times it flushes to disk > 1) You can use CONFIG_PSTORE_FTRACE I can look into that > 2) If you add "tp_printk" to the kernel command line it should make the > trace ring buffer flush to kernel log ring buffer. But be warned this > is going to change the timing, the issue might go away entirely or have > a different failure rate. So hopefully <1> works. > > > > If you say that all IP blocks use 1s, perhaps an alternative solution > > would be to desync the idle times so they do not happen > > simultaneously. So 1000, 1200, 1400, etc. > > > > Antheas > > > > I don't dobut your your proposal of changing the timing works. I just > want to make sure it's the right solution because otherwise we might > change the timing or sequence elsewhere in the driver two years from now > and re-introduce the problem unintentionally. If there are other idle timers and only this one changes to 2s, I will agree and say that it would be peculiar. Although 1s seems arbitrary in any case. Antheas > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo 2025-08-25 13:39 ` Antheas Kapenekakis @ 2025-08-26 13:41 ` Alex Deucher 2025-08-26 19:19 ` Mario Limonciello 0 siblings, 1 reply; 22+ messages in thread From: Alex Deucher @ 2025-08-26 13:41 UTC (permalink / raw) To: Antheas Kapenekakis Cc: Mario Limonciello, amd-gfx, dri-devel, linux-kernel, Alex Deucher, Christian König, David Airlie, Simona Vetter, Harry Wentland, Rodrigo Siqueira, Mario Limonciello, Peyton Lee, Lang Yu On Tue, Aug 26, 2025 at 3:49 AM Antheas Kapenekakis <lkml@antheas.dev> wrote: > > On Mon, 25 Aug 2025 at 03:38, Mario Limonciello <superm1@kernel.org> wrote: > > > > > > > > On 8/24/25 3:46 PM, Antheas Kapenekakis wrote: > > > On Sun, 24 Aug 2025 at 22:16, Mario Limonciello <superm1@kernel.org> wrote: > > >> > > >> > > >> > > >> On 8/24/25 3:53 AM, Antheas Kapenekakis wrote: > > >>> On the Asus Z13 2025, which uses a Strix Halo platform, around 8% of the > > >>> suspend resumes result in a soft lock around 1 second after the screen > > >>> turns on (it freezes). This happens due to power gating VPE when it is > > >>> not used, which happens 1 second after inactivity. > > >>> > > >>> Specifically, the VPE gating after resume is as follows: an initial > > >>> ungate, followed by a gate in the resume process. Then, > > >>> amdgpu_device_delayed_init_work_handler with a delay of 2s is scheduled > > >>> to run tests, one of which is testing VPE in vpe_ring_test_ib. This > > >>> causes an ungate, After that test, vpe_idle_work_handler is scheduled > > >>> with VPE_IDLE_TIMEOUT (1s). > > >>> > > >>> When vpe_idle_work_handler runs and tries to gate VPE, it causes the > > >>> SMU to hang and partially freezes half of the GPU IPs, with the thread > > >>> that called the command being stuck processing it. > > >>> > > >>> Specifically, after that SMU command tries to run, we get the following: > > >>> > > >>> snd_hda_intel 0000:c4:00.1: Refused to change power state from D0 to D3hot > > >>> ... > > >>> xhci_hcd 0000:c4:00.4: Refused to change power state from D0 to D3hot > > >>> ... > > >>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > > >>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VPE! > > >>> [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe failed, ret = -62. > > >>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:93:crtc-0] flip_done timed out > > >>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > > >>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate JPEG! > > >>> [drm:jpeg_v4_0_5_set_powergating_state [amdgpu]] *ERROR* Dpm disable jpeg failed, ret = -62. > > >>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > > >>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 0! > > >>> [drm:vcn_v4_0_5_stop [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62. > > >>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 1 from 0xd3 > > >>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 2 from 0x5 > > >>> thunderbolt 0000:c6:00.5: Refused to change power state from D0 to D3hot > > >>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:97:crtc-1] flip_done timed out > > >>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > > >>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 1! > > >>> > > >>> In addition to e.g., kwin errors in journalctl. 0000:c4.00.0 is the GPU. > > >>> Interestingly, 0000:c4.00.6, which is another HDA block, 0000:c4.00.5, > > >>> a PCI controller, and 0000:c4.00.2, resume normally. 0x00000032 is the > > >>> PowerDownVpe(50) command which is the common failure point in all > > >>> failed resumes. > > >>> > > >>> On a normal resume, we should get the following power gates: > > >>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVpe(50) param: 0x00000000, resp: 0x00000001 > > >>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg0(33) param: 0x00000000, resp: 0x00000001 > > >>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg1(38) param: 0x00010000, resp: 0x00000001 > > >>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn1(4) param: 0x00010000, resp: 0x00000001 > > >>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn0(6) param: 0x00000000, resp: 0x00000001 > > >>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn0(7) param: 0x00000000, resp: 0x00000001 > > >>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn1(5) param: 0x00010000, resp: 0x00000001 > > >>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg0(34) param: 0x00000000, resp: 0x00000001 > > >>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg1(39) param: 0x00010000, resp: 0x00000001 > > >>> > > >>> To fix this, increase VPE_IDLE_TIMEOUT to 2 seconds. This increases > > >>> reliability from 4-25 suspends to 200+ (tested) suspends with a cycle > > >>> time of 12s sleep, 8s resume. > > >> > > >> When you say you reproduced with 12s sleep and 8s resume, was that > > >> 'amd-s2idle --duration 12 --wait 8'? > > > > > > I did not use amd-s2idle. I essentially used the script below with a > > > 12 on the wake alarm and 12 on the for loop. I also used pstore for > > > this testing. > > > > > > for i in {1..200}; do > > > echo "Suspend attempt $i" > > > echo `date '+%s' -d '+ 60 seconds'` | sudo tee /sys/class/rtc/rtc0/wakealarm > > > sudo sh -c 'echo mem > /sys/power/state' > > > > > > for j in {1..50}; do > > > # Use repeating sleep in case echo mem returns early > > > sleep 1 > > > done > > > done > > > > 👍 > > > > > > > >>> The suspected reason here is that 1s that > > >>> when VPE is used, it needs a bit of time before it can be gated and > > >>> there was a borderline delay before, which is not enough for Strix Halo. > > >>> When the VPE is not used, such as on resume, gating it instantly does > > >>> not seem to cause issues. > > >>> > > >>> Fixes: 5f82a0c90cca ("drm/amdgpu/vpe: enable vpe dpm") > > >>> Signed-off-by: Antheas Kapenekakis <lkml@antheas.dev> > > >>> --- > > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c | 4 ++-- > > >>> 1 file changed, 2 insertions(+), 2 deletions(-) > > >>> > > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c > > >>> index 121ee17b522b..24f09e457352 100644 > > >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c > > >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c > > >>> @@ -34,8 +34,8 @@ > > >>> /* VPE CSA resides in the 4th page of CSA */ > > >>> #define AMDGPU_CSA_VPE_OFFSET (4096 * 3) > > >>> > > >>> -/* 1 second timeout */ > > >>> -#define VPE_IDLE_TIMEOUT msecs_to_jiffies(1000) > > >>> +/* 2 second timeout */ > > >>> +#define VPE_IDLE_TIMEOUT msecs_to_jiffies(2000) > > >>> > > >>> #define VPE_MAX_DPM_LEVEL 4 > > >>> #define FIXED1_8_BITS_PER_FRACTIONAL_PART 8 > > >>> > > >>> base-commit: c17b750b3ad9f45f2b6f7e6f7f4679844244f0b9 > > >> > > >> 1s idle timeout has been used by other IPs for a long time. > > >> For example JPEG, UVD, VCN all use 1s. > > >> > > >> Can you please confirm both your AGESA and your SMU firmware version? > > >> In case you're not aware; you can get AGESA version from SMBIOS string > > >> (DMI type 40). > > >> > > >> ❯ sudo dmidecode | grep AGESA > > > > > > String: AGESA!V9 StrixHaloPI-FP11 1.0.0.0c > > > > > >> You can get SMU firmware version from this: > > >> > > >> ❯ grep . /sys/bus/platform/drivers/amd_pmc/*/smu_* > > > > > > grep . /sys/bus/platform/drivers/amd_pmc/*/smu_* > > > /sys/bus/platform/drivers/amd_pmc/AMDI000B:00/smu_fw_version:100.112.0 > > > /sys/bus/platform/drivers/amd_pmc/AMDI000B:00/smu_program:0 > > > > > > > Thanks, I'll get some folks to see if we match this AGESA version if we > > can also reproduce it on reference hardware the same way you did. > > > > >> Are you on the most up to date firmware for your system from the > > >> manufacturer? > > > > > > I updated my bios, pd firmware, and USB device firmware early August, > > > when I was doing this testing. > > > > > >> We haven't seen anything like this reported on Strix Halo thus far and > > >> we do internal stress testing on s0i3 on reference hardware. > > > > > > Cant find a reference for it on the bug tracker. I have four bug > > > reports on the bazzite issue tracker, 2 about sleep wake crashes and 2 > > > for runtime crashes, where the culprit would be this. IE runtime gates > > > VPE and causes a crash. > > > > All on Strix Halo and all tied to VPE? At runtime was VPE in use? By > > what software? > > > > BTW - Strix and Kraken also have VPE. > > All on the Z13. Not tied to VPE necessarily. I just know that I get > reports of crashes on the Z13, and with this patch they are fixed for > me. It will be part of the next bazzite version so I will get feedback > about it. > > I don't think software that is using the VPE is relevant. Perhaps for > the runtime crashes it is and this patch helps in that case as well. > But in my case, the crash is caused after the ungate that runs the > tests on resume on the delayed handler. > > The Z13 also has some other quirks with spurious wakeups when > connected to a charger. So, if systemd is configured to e.g., sleep > after 20m, combined with this crash if it stays plugged in overnight > in the morning it has crashed. > > > > > > >> To me this seems likely to be a platform firmware bug; but I would like > > >> to understand the timing of the gate vs ungate on good vs bad. > > > > > > Perhaps it is. It is either something like that or silicon quality. > > > > > >> IE is it possible the delayed work handler > > >> amdgpu_device_delayed_init_work_handler() is causing a race with > > >> vpe_ring_begin_use()? > > > > > > I don't think so. There is only a single ungate. Also, the crash > > > happens on the gate. So what happens is the device wakes up, the > > > screen turns on, kde clock works, then after a second it freezes, > > > there is a softlock, and the device hangs. > > > > > > The failed command is always the VPE gate that is triggered after 1s in idle. > > > > > >> This should be possible to check without extra instrumentation by using > > >> ftrace and looking at the timing of the 2 ring functions and the init > > >> work handler and checking good vs bad cycles. > > > > > > I do not know how to use ftrace. I should also note that after the > > > device freezes around 1/5 cycles will sync the fs, so it is also not a > > > very easy thing to diagnose. The device just stops working. A lot of > > > the logs I got were in pstore by forcing a kernel panic. > > > > Here's how you capture the timing of functions. Each time the function > > is called there will be an event in the trace buffer. > > > > ❯ sudo trace-cmd record -p function -l > > amdgpu_device_delayed_init_work_handler -l vpe_idle_work_handler -l > > vpe_ring_begin_use -l vpe_ring_end_use -l amdgpu_pmops_suspend -l > > amdgpu_pmops_resume > > > > Here's how you would review the report: > > > > ❯ trace-cmd report > > cpus=24 > > kworker/u97:37-18051 [001] ..... 13655.970108: function: > > amdgpu_pmops_suspend <-- pci_pm_suspend > > kworker/u97:21-18036 [002] ..... 13666.290715: function: > > amdgpu_pmops_resume <-- dpm_run_callback > > kworker/u97:21-18036 [015] ..... 13666.308295: function: > > vpe_ring_begin_use <-- amdgpu_ring_alloc > > kworker/u97:21-18036 [015] ..... 13666.308298: function: > > vpe_ring_end_use <-- vpe_ring_test_ring > > kworker/15:1-12285 [015] ..... 13666.960191: function: > > amdgpu_device_delayed_init_work_handler <-- process_one_work > > kworker/15:1-12285 [015] ..... 13666.963970: function: > > vpe_ring_begin_use <-- amdgpu_ring_alloc > > kworker/15:1-12285 [015] ..... 13666.965481: function: > > vpe_ring_end_use <-- amdgpu_ib_schedule > > kworker/15:4-16354 [015] ..... 13667.981394: function: > > vpe_idle_work_handler <-- process_one_work > > > > I did this on a Strix system just now to capture that. > > > > You can see that basically the ring gets used before the delayed init > > work handler, and then again from the ring tests. My concern is if the > > sequence ever looks different than the above. If it does; we do have a > > driver race condition. > > > > It would also be helpful to look at the function_graph tracer. > > > > Here's some more documentation about ftrace and trace-cmd. > > https://www.kernel.org/doc/html/latest/trace/ftrace.html > > https://lwn.net/Articles/410200/ > > > > You can probably also get an LLM to help you with building commands if > > you're not familiar with it. > > > > But if you're hung so bad you can't flush to disk that's going to be a > > problem without a UART. A few ideas: > > Some times it flushes to disk > > > 1) You can use CONFIG_PSTORE_FTRACE > > I can look into that > > > 2) If you add "tp_printk" to the kernel command line it should make the > > trace ring buffer flush to kernel log ring buffer. But be warned this > > is going to change the timing, the issue might go away entirely or have > > a different failure rate. So hopefully <1> works. > > > > > > If you say that all IP blocks use 1s, perhaps an alternative solution > > > would be to desync the idle times so they do not happen > > > simultaneously. So 1000, 1200, 1400, etc. > > > > > > Antheas > > > > > > > I don't dobut your your proposal of changing the timing works. I just > > want to make sure it's the right solution because otherwise we might > > change the timing or sequence elsewhere in the driver two years from now > > and re-introduce the problem unintentionally. > > If there are other idle timers and only this one changes to 2s, I will > agree and say that it would be peculiar. Although 1s seems arbitrary > in any case. All of these timers are arbitrary. Their point is just to provide a future point where we can check if the engine is idle. The idle work handler will either power down the IP if it is idle or re-schedule in the future and try again if there is still work. Making the value longer will use more power as it will wait longer before checking if the engine is idle. Making it shorter will save more power, but adds extra overhead in that the engine will be powered up/down more often. In most cases, the jobs should complete in a few ms. The timer is there to avoid the overhead of powering up/down the block too frequently when applications are using the engine. Alex ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo 2025-08-26 13:41 ` Alex Deucher @ 2025-08-26 19:19 ` Mario Limonciello 2025-08-26 19:21 ` Antheas Kapenekakis 0 siblings, 1 reply; 22+ messages in thread From: Mario Limonciello @ 2025-08-26 19:19 UTC (permalink / raw) To: Alex Deucher, Antheas Kapenekakis Cc: amd-gfx, dri-devel, linux-kernel, Alex Deucher, Christian König, David Airlie, Simona Vetter, Harry Wentland, Rodrigo Siqueira, Mario Limonciello, Peyton Lee, Lang Yu On 8/26/2025 8:41 AM, Alex Deucher wrote: > On Tue, Aug 26, 2025 at 3:49 AM Antheas Kapenekakis <lkml@antheas.dev> wrote: >> >> On Mon, 25 Aug 2025 at 03:38, Mario Limonciello <superm1@kernel.org> wrote: >>> >>> >>> >>> On 8/24/25 3:46 PM, Antheas Kapenekakis wrote: >>>> On Sun, 24 Aug 2025 at 22:16, Mario Limonciello <superm1@kernel.org> wrote: >>>>> >>>>> >>>>> >>>>> On 8/24/25 3:53 AM, Antheas Kapenekakis wrote: >>>>>> On the Asus Z13 2025, which uses a Strix Halo platform, around 8% of the >>>>>> suspend resumes result in a soft lock around 1 second after the screen >>>>>> turns on (it freezes). This happens due to power gating VPE when it is >>>>>> not used, which happens 1 second after inactivity. >>>>>> >>>>>> Specifically, the VPE gating after resume is as follows: an initial >>>>>> ungate, followed by a gate in the resume process. Then, >>>>>> amdgpu_device_delayed_init_work_handler with a delay of 2s is scheduled >>>>>> to run tests, one of which is testing VPE in vpe_ring_test_ib. This >>>>>> causes an ungate, After that test, vpe_idle_work_handler is scheduled >>>>>> with VPE_IDLE_TIMEOUT (1s). >>>>>> >>>>>> When vpe_idle_work_handler runs and tries to gate VPE, it causes the >>>>>> SMU to hang and partially freezes half of the GPU IPs, with the thread >>>>>> that called the command being stuck processing it. >>>>>> >>>>>> Specifically, after that SMU command tries to run, we get the following: >>>>>> >>>>>> snd_hda_intel 0000:c4:00.1: Refused to change power state from D0 to D3hot >>>>>> ... >>>>>> xhci_hcd 0000:c4:00.4: Refused to change power state from D0 to D3hot >>>>>> ... >>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 >>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VPE! >>>>>> [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe failed, ret = -62. >>>>>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:93:crtc-0] flip_done timed out >>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 >>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate JPEG! >>>>>> [drm:jpeg_v4_0_5_set_powergating_state [amdgpu]] *ERROR* Dpm disable jpeg failed, ret = -62. >>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 >>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 0! >>>>>> [drm:vcn_v4_0_5_stop [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62. >>>>>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 1 from 0xd3 >>>>>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 2 from 0x5 >>>>>> thunderbolt 0000:c6:00.5: Refused to change power state from D0 to D3hot >>>>>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:97:crtc-1] flip_done timed out >>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 >>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 1! >>>>>> >>>>>> In addition to e.g., kwin errors in journalctl. 0000:c4.00.0 is the GPU. >>>>>> Interestingly, 0000:c4.00.6, which is another HDA block, 0000:c4.00.5, >>>>>> a PCI controller, and 0000:c4.00.2, resume normally. 0x00000032 is the >>>>>> PowerDownVpe(50) command which is the common failure point in all >>>>>> failed resumes. >>>>>> >>>>>> On a normal resume, we should get the following power gates: >>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVpe(50) param: 0x00000000, resp: 0x00000001 >>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg0(33) param: 0x00000000, resp: 0x00000001 >>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg1(38) param: 0x00010000, resp: 0x00000001 >>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn1(4) param: 0x00010000, resp: 0x00000001 >>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn0(6) param: 0x00000000, resp: 0x00000001 >>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn0(7) param: 0x00000000, resp: 0x00000001 >>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn1(5) param: 0x00010000, resp: 0x00000001 >>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg0(34) param: 0x00000000, resp: 0x00000001 >>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg1(39) param: 0x00010000, resp: 0x00000001 >>>>>> >>>>>> To fix this, increase VPE_IDLE_TIMEOUT to 2 seconds. This increases >>>>>> reliability from 4-25 suspends to 200+ (tested) suspends with a cycle >>>>>> time of 12s sleep, 8s resume. >>>>> >>>>> When you say you reproduced with 12s sleep and 8s resume, was that >>>>> 'amd-s2idle --duration 12 --wait 8'? >>>> >>>> I did not use amd-s2idle. I essentially used the script below with a >>>> 12 on the wake alarm and 12 on the for loop. I also used pstore for >>>> this testing. >>>> >>>> for i in {1..200}; do >>>> echo "Suspend attempt $i" >>>> echo `date '+%s' -d '+ 60 seconds'` | sudo tee /sys/class/rtc/rtc0/wakealarm >>>> sudo sh -c 'echo mem > /sys/power/state' >>>> >>>> for j in {1..50}; do >>>> # Use repeating sleep in case echo mem returns early >>>> sleep 1 >>>> done >>>> done >>> >>> 👍 >>> >>>> >>>>>> The suspected reason here is that 1s that >>>>>> when VPE is used, it needs a bit of time before it can be gated and >>>>>> there was a borderline delay before, which is not enough for Strix Halo. >>>>>> When the VPE is not used, such as on resume, gating it instantly does >>>>>> not seem to cause issues. >>>>>> >>>>>> Fixes: 5f82a0c90cca ("drm/amdgpu/vpe: enable vpe dpm") >>>>>> Signed-off-by: Antheas Kapenekakis <lkml@antheas.dev> >>>>>> --- >>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c | 4 ++-- >>>>>> 1 file changed, 2 insertions(+), 2 deletions(-) >>>>>> >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c >>>>>> index 121ee17b522b..24f09e457352 100644 >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c >>>>>> @@ -34,8 +34,8 @@ >>>>>> /* VPE CSA resides in the 4th page of CSA */ >>>>>> #define AMDGPU_CSA_VPE_OFFSET (4096 * 3) >>>>>> >>>>>> -/* 1 second timeout */ >>>>>> -#define VPE_IDLE_TIMEOUT msecs_to_jiffies(1000) >>>>>> +/* 2 second timeout */ >>>>>> +#define VPE_IDLE_TIMEOUT msecs_to_jiffies(2000) >>>>>> >>>>>> #define VPE_MAX_DPM_LEVEL 4 >>>>>> #define FIXED1_8_BITS_PER_FRACTIONAL_PART 8 >>>>>> >>>>>> base-commit: c17b750b3ad9f45f2b6f7e6f7f4679844244f0b9 >>>>> >>>>> 1s idle timeout has been used by other IPs for a long time. >>>>> For example JPEG, UVD, VCN all use 1s. >>>>> >>>>> Can you please confirm both your AGESA and your SMU firmware version? >>>>> In case you're not aware; you can get AGESA version from SMBIOS string >>>>> (DMI type 40). >>>>> >>>>> ❯ sudo dmidecode | grep AGESA >>>> >>>> String: AGESA!V9 StrixHaloPI-FP11 1.0.0.0c >>>> >>>>> You can get SMU firmware version from this: >>>>> >>>>> ❯ grep . /sys/bus/platform/drivers/amd_pmc/*/smu_* >>>> >>>> grep . /sys/bus/platform/drivers/amd_pmc/*/smu_* >>>> /sys/bus/platform/drivers/amd_pmc/AMDI000B:00/smu_fw_version:100.112.0 >>>> /sys/bus/platform/drivers/amd_pmc/AMDI000B:00/smu_program:0 >>>> >>> >>> Thanks, I'll get some folks to see if we match this AGESA version if we >>> can also reproduce it on reference hardware the same way you did. >>> >>>>> Are you on the most up to date firmware for your system from the >>>>> manufacturer? >>>> >>>> I updated my bios, pd firmware, and USB device firmware early August, >>>> when I was doing this testing. >>>> >>>>> We haven't seen anything like this reported on Strix Halo thus far and >>>>> we do internal stress testing on s0i3 on reference hardware. >>>> >>>> Cant find a reference for it on the bug tracker. I have four bug >>>> reports on the bazzite issue tracker, 2 about sleep wake crashes and 2 >>>> for runtime crashes, where the culprit would be this. IE runtime gates >>>> VPE and causes a crash. >>> >>> All on Strix Halo and all tied to VPE? At runtime was VPE in use? By >>> what software? >>> >>> BTW - Strix and Kraken also have VPE. >> >> All on the Z13. Not tied to VPE necessarily. I just know that I get >> reports of crashes on the Z13, and with this patch they are fixed for >> me. It will be part of the next bazzite version so I will get feedback >> about it. >> >> I don't think software that is using the VPE is relevant. Perhaps for >> the runtime crashes it is and this patch helps in that case as well. >> But in my case, the crash is caused after the ungate that runs the >> tests on resume on the delayed handler. >> >> The Z13 also has some other quirks with spurious wakeups when >> connected to a charger. So, if systemd is configured to e.g., sleep >> after 20m, combined with this crash if it stays plugged in overnight >> in the morning it has crashed. >> >>>> >>>>> To me this seems likely to be a platform firmware bug; but I would like >>>>> to understand the timing of the gate vs ungate on good vs bad. >>>> >>>> Perhaps it is. It is either something like that or silicon quality. >>>> >>>>> IE is it possible the delayed work handler >>>>> amdgpu_device_delayed_init_work_handler() is causing a race with >>>>> vpe_ring_begin_use()? >>>> >>>> I don't think so. There is only a single ungate. Also, the crash >>>> happens on the gate. So what happens is the device wakes up, the >>>> screen turns on, kde clock works, then after a second it freezes, >>>> there is a softlock, and the device hangs. >>>> >>>> The failed command is always the VPE gate that is triggered after 1s in idle. >>>> >>>>> This should be possible to check without extra instrumentation by using >>>>> ftrace and looking at the timing of the 2 ring functions and the init >>>>> work handler and checking good vs bad cycles. >>>> >>>> I do not know how to use ftrace. I should also note that after the >>>> device freezes around 1/5 cycles will sync the fs, so it is also not a >>>> very easy thing to diagnose. The device just stops working. A lot of >>>> the logs I got were in pstore by forcing a kernel panic. >>> >>> Here's how you capture the timing of functions. Each time the function >>> is called there will be an event in the trace buffer. >>> >>> ❯ sudo trace-cmd record -p function -l >>> amdgpu_device_delayed_init_work_handler -l vpe_idle_work_handler -l >>> vpe_ring_begin_use -l vpe_ring_end_use -l amdgpu_pmops_suspend -l >>> amdgpu_pmops_resume >>> >>> Here's how you would review the report: >>> >>> ❯ trace-cmd report >>> cpus=24 >>> kworker/u97:37-18051 [001] ..... 13655.970108: function: >>> amdgpu_pmops_suspend <-- pci_pm_suspend >>> kworker/u97:21-18036 [002] ..... 13666.290715: function: >>> amdgpu_pmops_resume <-- dpm_run_callback >>> kworker/u97:21-18036 [015] ..... 13666.308295: function: >>> vpe_ring_begin_use <-- amdgpu_ring_alloc >>> kworker/u97:21-18036 [015] ..... 13666.308298: function: >>> vpe_ring_end_use <-- vpe_ring_test_ring >>> kworker/15:1-12285 [015] ..... 13666.960191: function: >>> amdgpu_device_delayed_init_work_handler <-- process_one_work >>> kworker/15:1-12285 [015] ..... 13666.963970: function: >>> vpe_ring_begin_use <-- amdgpu_ring_alloc >>> kworker/15:1-12285 [015] ..... 13666.965481: function: >>> vpe_ring_end_use <-- amdgpu_ib_schedule >>> kworker/15:4-16354 [015] ..... 13667.981394: function: >>> vpe_idle_work_handler <-- process_one_work >>> >>> I did this on a Strix system just now to capture that. >>> >>> You can see that basically the ring gets used before the delayed init >>> work handler, and then again from the ring tests. My concern is if the >>> sequence ever looks different than the above. If it does; we do have a >>> driver race condition. >>> >>> It would also be helpful to look at the function_graph tracer. >>> >>> Here's some more documentation about ftrace and trace-cmd. >>> https://www.kernel.org/doc/html/latest/trace/ftrace.html >>> https://lwn.net/Articles/410200/ >>> >>> You can probably also get an LLM to help you with building commands if >>> you're not familiar with it. >>> >>> But if you're hung so bad you can't flush to disk that's going to be a >>> problem without a UART. A few ideas: >> >> Some times it flushes to disk >> >>> 1) You can use CONFIG_PSTORE_FTRACE >> >> I can look into that >> >>> 2) If you add "tp_printk" to the kernel command line it should make the >>> trace ring buffer flush to kernel log ring buffer. But be warned this >>> is going to change the timing, the issue might go away entirely or have >>> a different failure rate. So hopefully <1> works. >>>> >>>> If you say that all IP blocks use 1s, perhaps an alternative solution >>>> would be to desync the idle times so they do not happen >>>> simultaneously. So 1000, 1200, 1400, etc. >>>> >>>> Antheas >>>> >>> >>> I don't dobut your your proposal of changing the timing works. I just >>> want to make sure it's the right solution because otherwise we might >>> change the timing or sequence elsewhere in the driver two years from now >>> and re-introduce the problem unintentionally. >> >> If there are other idle timers and only this one changes to 2s, I will >> agree and say that it would be peculiar. Although 1s seems arbitrary >> in any case. > > All of these timers are arbitrary. Their point is just to provide a > future point where we can check if the engine is idle. The idle work > handler will either power down the IP if it is idle or re-schedule in > the future and try again if there is still work. Making the value > longer will use more power as it will wait longer before checking if > the engine is idle. Making it shorter will save more power, but adds > extra overhead in that the engine will be powered up/down more often. > In most cases, the jobs should complete in a few ms. The timer is > there to avoid the overhead of powering up/down the block too > frequently when applications are using the engine. > > Alex We had a try internally with both 6.17-rc2 and 6.17-rc3 and 1001b or 1001c AGESA on reference system but unfortunately didn't reproduce the issue with a 200 cycle attempt on either kernel or either BIOS (so we had 800 cycles total). Was your base a bazzite kernel or was it an upstream kernel? I know there are some other patches in bazzite especially relevant to suspend, so I wonder if they could be influencing the timing. Can you repo on 6.17-rc3? ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo 2025-08-26 19:19 ` Mario Limonciello @ 2025-08-26 19:21 ` Antheas Kapenekakis 2025-08-26 20:12 ` Matthew Schwartz 0 siblings, 1 reply; 22+ messages in thread From: Antheas Kapenekakis @ 2025-08-26 19:21 UTC (permalink / raw) To: Mario Limonciello Cc: Alex Deucher, amd-gfx, dri-devel, linux-kernel, Alex Deucher, Christian König, David Airlie, Simona Vetter, Harry Wentland, Rodrigo Siqueira, Mario Limonciello, Peyton Lee, Lang Yu On Tue, 26 Aug 2025 at 21:19, Mario Limonciello <superm1@kernel.org> wrote: > > On 8/26/2025 8:41 AM, Alex Deucher wrote: > > On Tue, Aug 26, 2025 at 3:49 AM Antheas Kapenekakis <lkml@antheas.dev> wrote: > >> > >> On Mon, 25 Aug 2025 at 03:38, Mario Limonciello <superm1@kernel.org> wrote: > >>> > >>> > >>> > >>> On 8/24/25 3:46 PM, Antheas Kapenekakis wrote: > >>>> On Sun, 24 Aug 2025 at 22:16, Mario Limonciello <superm1@kernel.org> wrote: > >>>>> > >>>>> > >>>>> > >>>>> On 8/24/25 3:53 AM, Antheas Kapenekakis wrote: > >>>>>> On the Asus Z13 2025, which uses a Strix Halo platform, around 8% of the > >>>>>> suspend resumes result in a soft lock around 1 second after the screen > >>>>>> turns on (it freezes). This happens due to power gating VPE when it is > >>>>>> not used, which happens 1 second after inactivity. > >>>>>> > >>>>>> Specifically, the VPE gating after resume is as follows: an initial > >>>>>> ungate, followed by a gate in the resume process. Then, > >>>>>> amdgpu_device_delayed_init_work_handler with a delay of 2s is scheduled > >>>>>> to run tests, one of which is testing VPE in vpe_ring_test_ib. This > >>>>>> causes an ungate, After that test, vpe_idle_work_handler is scheduled > >>>>>> with VPE_IDLE_TIMEOUT (1s). > >>>>>> > >>>>>> When vpe_idle_work_handler runs and tries to gate VPE, it causes the > >>>>>> SMU to hang and partially freezes half of the GPU IPs, with the thread > >>>>>> that called the command being stuck processing it. > >>>>>> > >>>>>> Specifically, after that SMU command tries to run, we get the following: > >>>>>> > >>>>>> snd_hda_intel 0000:c4:00.1: Refused to change power state from D0 to D3hot > >>>>>> ... > >>>>>> xhci_hcd 0000:c4:00.4: Refused to change power state from D0 to D3hot > >>>>>> ... > >>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > >>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VPE! > >>>>>> [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe failed, ret = -62. > >>>>>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:93:crtc-0] flip_done timed out > >>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > >>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate JPEG! > >>>>>> [drm:jpeg_v4_0_5_set_powergating_state [amdgpu]] *ERROR* Dpm disable jpeg failed, ret = -62. > >>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > >>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 0! > >>>>>> [drm:vcn_v4_0_5_stop [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62. > >>>>>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 1 from 0xd3 > >>>>>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 2 from 0x5 > >>>>>> thunderbolt 0000:c6:00.5: Refused to change power state from D0 to D3hot > >>>>>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:97:crtc-1] flip_done timed out > >>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > >>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 1! > >>>>>> > >>>>>> In addition to e.g., kwin errors in journalctl. 0000:c4.00.0 is the GPU. > >>>>>> Interestingly, 0000:c4.00.6, which is another HDA block, 0000:c4.00.5, > >>>>>> a PCI controller, and 0000:c4.00.2, resume normally. 0x00000032 is the > >>>>>> PowerDownVpe(50) command which is the common failure point in all > >>>>>> failed resumes. > >>>>>> > >>>>>> On a normal resume, we should get the following power gates: > >>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVpe(50) param: 0x00000000, resp: 0x00000001 > >>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg0(33) param: 0x00000000, resp: 0x00000001 > >>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg1(38) param: 0x00010000, resp: 0x00000001 > >>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn1(4) param: 0x00010000, resp: 0x00000001 > >>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn0(6) param: 0x00000000, resp: 0x00000001 > >>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn0(7) param: 0x00000000, resp: 0x00000001 > >>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn1(5) param: 0x00010000, resp: 0x00000001 > >>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg0(34) param: 0x00000000, resp: 0x00000001 > >>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg1(39) param: 0x00010000, resp: 0x00000001 > >>>>>> > >>>>>> To fix this, increase VPE_IDLE_TIMEOUT to 2 seconds. This increases > >>>>>> reliability from 4-25 suspends to 200+ (tested) suspends with a cycle > >>>>>> time of 12s sleep, 8s resume. > >>>>> > >>>>> When you say you reproduced with 12s sleep and 8s resume, was that > >>>>> 'amd-s2idle --duration 12 --wait 8'? > >>>> > >>>> I did not use amd-s2idle. I essentially used the script below with a > >>>> 12 on the wake alarm and 12 on the for loop. I also used pstore for > >>>> this testing. > >>>> > >>>> for i in {1..200}; do > >>>> echo "Suspend attempt $i" > >>>> echo `date '+%s' -d '+ 60 seconds'` | sudo tee /sys/class/rtc/rtc0/wakealarm > >>>> sudo sh -c 'echo mem > /sys/power/state' > >>>> > >>>> for j in {1..50}; do > >>>> # Use repeating sleep in case echo mem returns early > >>>> sleep 1 > >>>> done > >>>> done > >>> > >>> 👍 > >>> > >>>> > >>>>>> The suspected reason here is that 1s that > >>>>>> when VPE is used, it needs a bit of time before it can be gated and > >>>>>> there was a borderline delay before, which is not enough for Strix Halo. > >>>>>> When the VPE is not used, such as on resume, gating it instantly does > >>>>>> not seem to cause issues. > >>>>>> > >>>>>> Fixes: 5f82a0c90cca ("drm/amdgpu/vpe: enable vpe dpm") > >>>>>> Signed-off-by: Antheas Kapenekakis <lkml@antheas.dev> > >>>>>> --- > >>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c | 4 ++-- > >>>>>> 1 file changed, 2 insertions(+), 2 deletions(-) > >>>>>> > >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c > >>>>>> index 121ee17b522b..24f09e457352 100644 > >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c > >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c > >>>>>> @@ -34,8 +34,8 @@ > >>>>>> /* VPE CSA resides in the 4th page of CSA */ > >>>>>> #define AMDGPU_CSA_VPE_OFFSET (4096 * 3) > >>>>>> > >>>>>> -/* 1 second timeout */ > >>>>>> -#define VPE_IDLE_TIMEOUT msecs_to_jiffies(1000) > >>>>>> +/* 2 second timeout */ > >>>>>> +#define VPE_IDLE_TIMEOUT msecs_to_jiffies(2000) > >>>>>> > >>>>>> #define VPE_MAX_DPM_LEVEL 4 > >>>>>> #define FIXED1_8_BITS_PER_FRACTIONAL_PART 8 > >>>>>> > >>>>>> base-commit: c17b750b3ad9f45f2b6f7e6f7f4679844244f0b9 > >>>>> > >>>>> 1s idle timeout has been used by other IPs for a long time. > >>>>> For example JPEG, UVD, VCN all use 1s. > >>>>> > >>>>> Can you please confirm both your AGESA and your SMU firmware version? > >>>>> In case you're not aware; you can get AGESA version from SMBIOS string > >>>>> (DMI type 40). > >>>>> > >>>>> ❯ sudo dmidecode | grep AGESA > >>>> > >>>> String: AGESA!V9 StrixHaloPI-FP11 1.0.0.0c > >>>> > >>>>> You can get SMU firmware version from this: > >>>>> > >>>>> ❯ grep . /sys/bus/platform/drivers/amd_pmc/*/smu_* > >>>> > >>>> grep . /sys/bus/platform/drivers/amd_pmc/*/smu_* > >>>> /sys/bus/platform/drivers/amd_pmc/AMDI000B:00/smu_fw_version:100.112.0 > >>>> /sys/bus/platform/drivers/amd_pmc/AMDI000B:00/smu_program:0 > >>>> > >>> > >>> Thanks, I'll get some folks to see if we match this AGESA version if we > >>> can also reproduce it on reference hardware the same way you did. > >>> > >>>>> Are you on the most up to date firmware for your system from the > >>>>> manufacturer? > >>>> > >>>> I updated my bios, pd firmware, and USB device firmware early August, > >>>> when I was doing this testing. > >>>> > >>>>> We haven't seen anything like this reported on Strix Halo thus far and > >>>>> we do internal stress testing on s0i3 on reference hardware. > >>>> > >>>> Cant find a reference for it on the bug tracker. I have four bug > >>>> reports on the bazzite issue tracker, 2 about sleep wake crashes and 2 > >>>> for runtime crashes, where the culprit would be this. IE runtime gates > >>>> VPE and causes a crash. > >>> > >>> All on Strix Halo and all tied to VPE? At runtime was VPE in use? By > >>> what software? > >>> > >>> BTW - Strix and Kraken also have VPE. > >> > >> All on the Z13. Not tied to VPE necessarily. I just know that I get > >> reports of crashes on the Z13, and with this patch they are fixed for > >> me. It will be part of the next bazzite version so I will get feedback > >> about it. > >> > >> I don't think software that is using the VPE is relevant. Perhaps for > >> the runtime crashes it is and this patch helps in that case as well. > >> But in my case, the crash is caused after the ungate that runs the > >> tests on resume on the delayed handler. > >> > >> The Z13 also has some other quirks with spurious wakeups when > >> connected to a charger. So, if systemd is configured to e.g., sleep > >> after 20m, combined with this crash if it stays plugged in overnight > >> in the morning it has crashed. > >> > >>>> > >>>>> To me this seems likely to be a platform firmware bug; but I would like > >>>>> to understand the timing of the gate vs ungate on good vs bad. > >>>> > >>>> Perhaps it is. It is either something like that or silicon quality. > >>>> > >>>>> IE is it possible the delayed work handler > >>>>> amdgpu_device_delayed_init_work_handler() is causing a race with > >>>>> vpe_ring_begin_use()? > >>>> > >>>> I don't think so. There is only a single ungate. Also, the crash > >>>> happens on the gate. So what happens is the device wakes up, the > >>>> screen turns on, kde clock works, then after a second it freezes, > >>>> there is a softlock, and the device hangs. > >>>> > >>>> The failed command is always the VPE gate that is triggered after 1s in idle. > >>>> > >>>>> This should be possible to check without extra instrumentation by using > >>>>> ftrace and looking at the timing of the 2 ring functions and the init > >>>>> work handler and checking good vs bad cycles. > >>>> > >>>> I do not know how to use ftrace. I should also note that after the > >>>> device freezes around 1/5 cycles will sync the fs, so it is also not a > >>>> very easy thing to diagnose. The device just stops working. A lot of > >>>> the logs I got were in pstore by forcing a kernel panic. > >>> > >>> Here's how you capture the timing of functions. Each time the function > >>> is called there will be an event in the trace buffer. > >>> > >>> ❯ sudo trace-cmd record -p function -l > >>> amdgpu_device_delayed_init_work_handler -l vpe_idle_work_handler -l > >>> vpe_ring_begin_use -l vpe_ring_end_use -l amdgpu_pmops_suspend -l > >>> amdgpu_pmops_resume > >>> > >>> Here's how you would review the report: > >>> > >>> ❯ trace-cmd report > >>> cpus=24 > >>> kworker/u97:37-18051 [001] ..... 13655.970108: function: > >>> amdgpu_pmops_suspend <-- pci_pm_suspend > >>> kworker/u97:21-18036 [002] ..... 13666.290715: function: > >>> amdgpu_pmops_resume <-- dpm_run_callback > >>> kworker/u97:21-18036 [015] ..... 13666.308295: function: > >>> vpe_ring_begin_use <-- amdgpu_ring_alloc > >>> kworker/u97:21-18036 [015] ..... 13666.308298: function: > >>> vpe_ring_end_use <-- vpe_ring_test_ring > >>> kworker/15:1-12285 [015] ..... 13666.960191: function: > >>> amdgpu_device_delayed_init_work_handler <-- process_one_work > >>> kworker/15:1-12285 [015] ..... 13666.963970: function: > >>> vpe_ring_begin_use <-- amdgpu_ring_alloc > >>> kworker/15:1-12285 [015] ..... 13666.965481: function: > >>> vpe_ring_end_use <-- amdgpu_ib_schedule > >>> kworker/15:4-16354 [015] ..... 13667.981394: function: > >>> vpe_idle_work_handler <-- process_one_work > >>> > >>> I did this on a Strix system just now to capture that. > >>> > >>> You can see that basically the ring gets used before the delayed init > >>> work handler, and then again from the ring tests. My concern is if the > >>> sequence ever looks different than the above. If it does; we do have a > >>> driver race condition. > >>> > >>> It would also be helpful to look at the function_graph tracer. > >>> > >>> Here's some more documentation about ftrace and trace-cmd. > >>> https://www.kernel.org/doc/html/latest/trace/ftrace.html > >>> https://lwn.net/Articles/410200/ > >>> > >>> You can probably also get an LLM to help you with building commands if > >>> you're not familiar with it. > >>> > >>> But if you're hung so bad you can't flush to disk that's going to be a > >>> problem without a UART. A few ideas: > >> > >> Some times it flushes to disk > >> > >>> 1) You can use CONFIG_PSTORE_FTRACE > >> > >> I can look into that > >> > >>> 2) If you add "tp_printk" to the kernel command line it should make the > >>> trace ring buffer flush to kernel log ring buffer. But be warned this > >>> is going to change the timing, the issue might go away entirely or have > >>> a different failure rate. So hopefully <1> works. > >>>> > >>>> If you say that all IP blocks use 1s, perhaps an alternative solution > >>>> would be to desync the idle times so they do not happen > >>>> simultaneously. So 1000, 1200, 1400, etc. > >>>> > >>>> Antheas > >>>> > >>> > >>> I don't dobut your your proposal of changing the timing works. I just > >>> want to make sure it's the right solution because otherwise we might > >>> change the timing or sequence elsewhere in the driver two years from now > >>> and re-introduce the problem unintentionally. > >> > >> If there are other idle timers and only this one changes to 2s, I will > >> agree and say that it would be peculiar. Although 1s seems arbitrary > >> in any case. > > > > All of these timers are arbitrary. Their point is just to provide a > > future point where we can check if the engine is idle. The idle work > > handler will either power down the IP if it is idle or re-schedule in > > the future and try again if there is still work. Making the value > > longer will use more power as it will wait longer before checking if > > the engine is idle. Making it shorter will save more power, but adds > > extra overhead in that the engine will be powered up/down more often. > > In most cases, the jobs should complete in a few ms. The timer is > > there to avoid the overhead of powering up/down the block too > > frequently when applications are using the engine. > > > > Alex > > We had a try internally with both 6.17-rc2 and 6.17-rc3 and 1001b or > 1001c AGESA on reference system but unfortunately didn't reproduce the > issue with a 200 cycle attempt on either kernel or either BIOS (so we > had 800 cycles total). I think I did 6.12, 6.15, and a 6.16rc stock. I will have to come back to you with 6.17-rc3. > Was your base a bazzite kernel or was it an upstream kernel? I know > there are some other patches in bazzite especially relevant to suspend, > so I wonder if they could be influencing the timing. > > Can you repo on 6.17-rc3? > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo 2025-08-26 19:21 ` Antheas Kapenekakis @ 2025-08-26 20:12 ` Matthew Schwartz 2025-08-26 20:58 ` Antheas Kapenekakis 0 siblings, 1 reply; 22+ messages in thread From: Matthew Schwartz @ 2025-08-26 20:12 UTC (permalink / raw) To: Antheas Kapenekakis Cc: Mario Limonciello, Alex Deucher, amd-gfx, dri-devel, linux-kernel, Alex Deucher, Christian König, David Airlie, Simona Vetter, Harry Wentland, Rodrigo Siqueira, Mario Limonciello, Peyton Lee, Lang Yu > On Aug 26, 2025, at 12:21 PM, Antheas Kapenekakis <lkml@antheas.dev> wrote: > > On Tue, 26 Aug 2025 at 21:19, Mario Limonciello <superm1@kernel.org> wrote: >> >> On 8/26/2025 8:41 AM, Alex Deucher wrote: >>> On Tue, Aug 26, 2025 at 3:49 AM Antheas Kapenekakis <lkml@antheas.dev> wrote: >>>> >>>> On Mon, 25 Aug 2025 at 03:38, Mario Limonciello <superm1@kernel.org> wrote: >>>>> >>>>> >>>>> >>>>> On 8/24/25 3:46 PM, Antheas Kapenekakis wrote: >>>>>> On Sun, 24 Aug 2025 at 22:16, Mario Limonciello <superm1@kernel.org> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 8/24/25 3:53 AM, Antheas Kapenekakis wrote: >>>>>>>> On the Asus Z13 2025, which uses a Strix Halo platform, around 8% of the >>>>>>>> suspend resumes result in a soft lock around 1 second after the screen >>>>>>>> turns on (it freezes). This happens due to power gating VPE when it is >>>>>>>> not used, which happens 1 second after inactivity. >>>>>>>> >>>>>>>> Specifically, the VPE gating after resume is as follows: an initial >>>>>>>> ungate, followed by a gate in the resume process. Then, >>>>>>>> amdgpu_device_delayed_init_work_handler with a delay of 2s is scheduled >>>>>>>> to run tests, one of which is testing VPE in vpe_ring_test_ib. This >>>>>>>> causes an ungate, After that test, vpe_idle_work_handler is scheduled >>>>>>>> with VPE_IDLE_TIMEOUT (1s). >>>>>>>> >>>>>>>> When vpe_idle_work_handler runs and tries to gate VPE, it causes the >>>>>>>> SMU to hang and partially freezes half of the GPU IPs, with the thread >>>>>>>> that called the command being stuck processing it. >>>>>>>> >>>>>>>> Specifically, after that SMU command tries to run, we get the following: >>>>>>>> >>>>>>>> snd_hda_intel 0000:c4:00.1: Refused to change power state from D0 to D3hot >>>>>>>> ... >>>>>>>> xhci_hcd 0000:c4:00.4: Refused to change power state from D0 to D3hot >>>>>>>> ... >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VPE! >>>>>>>> [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe failed, ret = -62. >>>>>>>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:93:crtc-0] flip_done timed out >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate JPEG! >>>>>>>> [drm:jpeg_v4_0_5_set_powergating_state [amdgpu]] *ERROR* Dpm disable jpeg failed, ret = -62. >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 0! >>>>>>>> [drm:vcn_v4_0_5_stop [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62. >>>>>>>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 1 from 0xd3 >>>>>>>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 2 from 0x5 >>>>>>>> thunderbolt 0000:c6:00.5: Refused to change power state from D0 to D3hot >>>>>>>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:97:crtc-1] flip_done timed out >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 1! >>>>>>>> >>>>>>>> In addition to e.g., kwin errors in journalctl. 0000:c4.00.0 is the GPU. >>>>>>>> Interestingly, 0000:c4.00.6, which is another HDA block, 0000:c4.00.5, >>>>>>>> a PCI controller, and 0000:c4.00.2, resume normally. 0x00000032 is the >>>>>>>> PowerDownVpe(50) command which is the common failure point in all >>>>>>>> failed resumes. >>>>>>>> >>>>>>>> On a normal resume, we should get the following power gates: >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVpe(50) param: 0x00000000, resp: 0x00000001 >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg0(33) param: 0x00000000, resp: 0x00000001 >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg1(38) param: 0x00010000, resp: 0x00000001 >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn1(4) param: 0x00010000, resp: 0x00000001 >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn0(6) param: 0x00000000, resp: 0x00000001 >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn0(7) param: 0x00000000, resp: 0x00000001 >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn1(5) param: 0x00010000, resp: 0x00000001 >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg0(34) param: 0x00000000, resp: 0x00000001 >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg1(39) param: 0x00010000, resp: 0x00000001 >>>>>>>> >>>>>>>> To fix this, increase VPE_IDLE_TIMEOUT to 2 seconds. This increases >>>>>>>> reliability from 4-25 suspends to 200+ (tested) suspends with a cycle >>>>>>>> time of 12s sleep, 8s resume. >>>>>>> >>>>>>> When you say you reproduced with 12s sleep and 8s resume, was that >>>>>>> 'amd-s2idle --duration 12 --wait 8'? >>>>>> >>>>>> I did not use amd-s2idle. I essentially used the script below with a >>>>>> 12 on the wake alarm and 12 on the for loop. I also used pstore for >>>>>> this testing. >>>>>> >>>>>> for i in {1..200}; do >>>>>> echo "Suspend attempt $i" >>>>>> echo `date '+%s' -d '+ 60 seconds'` | sudo tee /sys/class/rtc/rtc0/wakealarm >>>>>> sudo sh -c 'echo mem > /sys/power/state' >>>>>> >>>>>> for j in {1..50}; do >>>>>> # Use repeating sleep in case echo mem returns early >>>>>> sleep 1 >>>>>> done >>>>>> done >>>>> >>>>> 👍 >>>>> >>>>>> >>>>>>>> The suspected reason here is that 1s that >>>>>>>> when VPE is used, it needs a bit of time before it can be gated and >>>>>>>> there was a borderline delay before, which is not enough for Strix Halo. >>>>>>>> When the VPE is not used, such as on resume, gating it instantly does >>>>>>>> not seem to cause issues. >>>>>>>> >>>>>>>> Fixes: 5f82a0c90cca ("drm/amdgpu/vpe: enable vpe dpm") >>>>>>>> Signed-off-by: Antheas Kapenekakis <lkml@antheas.dev> >>>>>>>> --- >>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c | 4 ++-- >>>>>>>> 1 file changed, 2 insertions(+), 2 deletions(-) >>>>>>>> >>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c >>>>>>>> index 121ee17b522b..24f09e457352 100644 >>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c >>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c >>>>>>>> @@ -34,8 +34,8 @@ >>>>>>>> /* VPE CSA resides in the 4th page of CSA */ >>>>>>>> #define AMDGPU_CSA_VPE_OFFSET (4096 * 3) >>>>>>>> >>>>>>>> -/* 1 second timeout */ >>>>>>>> -#define VPE_IDLE_TIMEOUT msecs_to_jiffies(1000) >>>>>>>> +/* 2 second timeout */ >>>>>>>> +#define VPE_IDLE_TIMEOUT msecs_to_jiffies(2000) >>>>>>>> >>>>>>>> #define VPE_MAX_DPM_LEVEL 4 >>>>>>>> #define FIXED1_8_BITS_PER_FRACTIONAL_PART 8 >>>>>>>> >>>>>>>> base-commit: c17b750b3ad9f45f2b6f7e6f7f4679844244f0b9 >>>>>>> >>>>>>> 1s idle timeout has been used by other IPs for a long time. >>>>>>> For example JPEG, UVD, VCN all use 1s. >>>>>>> >>>>>>> Can you please confirm both your AGESA and your SMU firmware version? >>>>>>> In case you're not aware; you can get AGESA version from SMBIOS string >>>>>>> (DMI type 40). >>>>>>> >>>>>>> ❯ sudo dmidecode | grep AGESA >>>>>> >>>>>> String: AGESA!V9 StrixHaloPI-FP11 1.0.0.0c >>>>>> >>>>>>> You can get SMU firmware version from this: >>>>>>> >>>>>>> ❯ grep . /sys/bus/platform/drivers/amd_pmc/*/smu_* >>>>>> >>>>>> grep . /sys/bus/platform/drivers/amd_pmc/*/smu_* >>>>>> /sys/bus/platform/drivers/amd_pmc/AMDI000B:00/smu_fw_version:100.112.0 >>>>>> /sys/bus/platform/drivers/amd_pmc/AMDI000B:00/smu_program:0 >>>>>> >>>>> >>>>> Thanks, I'll get some folks to see if we match this AGESA version if we >>>>> can also reproduce it on reference hardware the same way you did. >>>>> >>>>>>> Are you on the most up to date firmware for your system from the >>>>>>> manufacturer? >>>>>> >>>>>> I updated my bios, pd firmware, and USB device firmware early August, >>>>>> when I was doing this testing. >>>>>> >>>>>>> We haven't seen anything like this reported on Strix Halo thus far and >>>>>>> we do internal stress testing on s0i3 on reference hardware. >>>>>> >>>>>> Cant find a reference for it on the bug tracker. I have four bug >>>>>> reports on the bazzite issue tracker, 2 about sleep wake crashes and 2 >>>>>> for runtime crashes, where the culprit would be this. IE runtime gates >>>>>> VPE and causes a crash. >>>>> >>>>> All on Strix Halo and all tied to VPE? At runtime was VPE in use? By >>>>> what software? >>>>> >>>>> BTW - Strix and Kraken also have VPE. >>>> >>>> All on the Z13. Not tied to VPE necessarily. I just know that I get >>>> reports of crashes on the Z13, and with this patch they are fixed for >>>> me. It will be part of the next bazzite version so I will get feedback >>>> about it. >>>> >>>> I don't think software that is using the VPE is relevant. Perhaps for >>>> the runtime crashes it is and this patch helps in that case as well. >>>> But in my case, the crash is caused after the ungate that runs the >>>> tests on resume on the delayed handler. >>>> >>>> The Z13 also has some other quirks with spurious wakeups when >>>> connected to a charger. So, if systemd is configured to e.g., sleep >>>> after 20m, combined with this crash if it stays plugged in overnight >>>> in the morning it has crashed. >>>> >>>>>> >>>>>>> To me this seems likely to be a platform firmware bug; but I would like >>>>>>> to understand the timing of the gate vs ungate on good vs bad. >>>>>> >>>>>> Perhaps it is. It is either something like that or silicon quality. >>>>>> >>>>>>> IE is it possible the delayed work handler >>>>>>> amdgpu_device_delayed_init_work_handler() is causing a race with >>>>>>> vpe_ring_begin_use()? >>>>>> >>>>>> I don't think so. There is only a single ungate. Also, the crash >>>>>> happens on the gate. So what happens is the device wakes up, the >>>>>> screen turns on, kde clock works, then after a second it freezes, >>>>>> there is a softlock, and the device hangs. >>>>>> >>>>>> The failed command is always the VPE gate that is triggered after 1s in idle. >>>>>> >>>>>>> This should be possible to check without extra instrumentation by using >>>>>>> ftrace and looking at the timing of the 2 ring functions and the init >>>>>>> work handler and checking good vs bad cycles. >>>>>> >>>>>> I do not know how to use ftrace. I should also note that after the >>>>>> device freezes around 1/5 cycles will sync the fs, so it is also not a >>>>>> very easy thing to diagnose. The device just stops working. A lot of >>>>>> the logs I got were in pstore by forcing a kernel panic. >>>>> >>>>> Here's how you capture the timing of functions. Each time the function >>>>> is called there will be an event in the trace buffer. >>>>> >>>>> ❯ sudo trace-cmd record -p function -l >>>>> amdgpu_device_delayed_init_work_handler -l vpe_idle_work_handler -l >>>>> vpe_ring_begin_use -l vpe_ring_end_use -l amdgpu_pmops_suspend -l >>>>> amdgpu_pmops_resume >>>>> >>>>> Here's how you would review the report: >>>>> >>>>> ❯ trace-cmd report >>>>> cpus=24 >>>>> kworker/u97:37-18051 [001] ..... 13655.970108: function: >>>>> amdgpu_pmops_suspend <-- pci_pm_suspend >>>>> kworker/u97:21-18036 [002] ..... 13666.290715: function: >>>>> amdgpu_pmops_resume <-- dpm_run_callback >>>>> kworker/u97:21-18036 [015] ..... 13666.308295: function: >>>>> vpe_ring_begin_use <-- amdgpu_ring_alloc >>>>> kworker/u97:21-18036 [015] ..... 13666.308298: function: >>>>> vpe_ring_end_use <-- vpe_ring_test_ring >>>>> kworker/15:1-12285 [015] ..... 13666.960191: function: >>>>> amdgpu_device_delayed_init_work_handler <-- process_one_work >>>>> kworker/15:1-12285 [015] ..... 13666.963970: function: >>>>> vpe_ring_begin_use <-- amdgpu_ring_alloc >>>>> kworker/15:1-12285 [015] ..... 13666.965481: function: >>>>> vpe_ring_end_use <-- amdgpu_ib_schedule >>>>> kworker/15:4-16354 [015] ..... 13667.981394: function: >>>>> vpe_idle_work_handler <-- process_one_work >>>>> >>>>> I did this on a Strix system just now to capture that. >>>>> >>>>> You can see that basically the ring gets used before the delayed init >>>>> work handler, and then again from the ring tests. My concern is if the >>>>> sequence ever looks different than the above. If it does; we do have a >>>>> driver race condition. >>>>> >>>>> It would also be helpful to look at the function_graph tracer. >>>>> >>>>> Here's some more documentation about ftrace and trace-cmd. >>>>> https://www.kernel.org/doc/html/latest/trace/ftrace.html >>>>> https://lwn.net/Articles/410200/ >>>>> >>>>> You can probably also get an LLM to help you with building commands if >>>>> you're not familiar with it. >>>>> >>>>> But if you're hung so bad you can't flush to disk that's going to be a >>>>> problem without a UART. A few ideas: >>>> >>>> Some times it flushes to disk >>>> >>>>> 1) You can use CONFIG_PSTORE_FTRACE >>>> >>>> I can look into that >>>> >>>>> 2) If you add "tp_printk" to the kernel command line it should make the >>>>> trace ring buffer flush to kernel log ring buffer. But be warned this >>>>> is going to change the timing, the issue might go away entirely or have >>>>> a different failure rate. So hopefully <1> works. >>>>>> >>>>>> If you say that all IP blocks use 1s, perhaps an alternative solution >>>>>> would be to desync the idle times so they do not happen >>>>>> simultaneously. So 1000, 1200, 1400, etc. >>>>>> >>>>>> Antheas >>>>>> >>>>> >>>>> I don't dobut your your proposal of changing the timing works. I just >>>>> want to make sure it's the right solution because otherwise we might >>>>> change the timing or sequence elsewhere in the driver two years from now >>>>> and re-introduce the problem unintentionally. >>>> >>>> If there are other idle timers and only this one changes to 2s, I will >>>> agree and say that it would be peculiar. Although 1s seems arbitrary >>>> in any case. >>> >>> All of these timers are arbitrary. Their point is just to provide a >>> future point where we can check if the engine is idle. The idle work >>> handler will either power down the IP if it is idle or re-schedule in >>> the future and try again if there is still work. Making the value >>> longer will use more power as it will wait longer before checking if >>> the engine is idle. Making it shorter will save more power, but adds >>> extra overhead in that the engine will be powered up/down more often. >>> In most cases, the jobs should complete in a few ms. The timer is >>> there to avoid the overhead of powering up/down the block too >>> frequently when applications are using the engine. >>> >>> Alex >> >> We had a try internally with both 6.17-rc2 and 6.17-rc3 and 1001b or >> 1001c AGESA on reference system but unfortunately didn't reproduce the >> issue with a 200 cycle attempt on either kernel or either BIOS (so we >> had 800 cycles total). > > I think I did 6.12, 6.15, and a 6.16rc stock. I will have to come back > to you with 6.17-rc3. I can reproduce the hang on a stock 6.17-rc3 kernel on my own Flow Z13, froze within 10 cycles with Antheas’ script. I will setup pstore to get logs from it since nothing appears in my journal after force rebooting. Matt > >> Was your base a bazzite kernel or was it an upstream kernel? I know >> there are some other patches in bazzite especially relevant to suspend, >> so I wonder if they could be influencing the timing. >> >> Can you repo on 6.17-rc3? >> > > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo 2025-08-26 20:12 ` Matthew Schwartz @ 2025-08-26 20:58 ` Antheas Kapenekakis 2025-08-27 0:50 ` Matthew Schwartz 0 siblings, 1 reply; 22+ messages in thread From: Antheas Kapenekakis @ 2025-08-26 20:58 UTC (permalink / raw) To: Matthew Schwartz Cc: Mario Limonciello, Alex Deucher, amd-gfx, dri-devel, linux-kernel, Alex Deucher, Christian König, David Airlie, Simona Vetter, Harry Wentland, Rodrigo Siqueira, Mario Limonciello, Peyton Lee, Lang Yu On Tue, 26 Aug 2025 at 22:13, Matthew Schwartz <matthew.schwartz@linux.dev> wrote: > > > > > On Aug 26, 2025, at 12:21 PM, Antheas Kapenekakis <lkml@antheas.dev> wrote: > > > > On Tue, 26 Aug 2025 at 21:19, Mario Limonciello <superm1@kernel.org> wrote: > >> > >> On 8/26/2025 8:41 AM, Alex Deucher wrote: > >>> On Tue, Aug 26, 2025 at 3:49 AM Antheas Kapenekakis <lkml@antheas.dev> wrote: > >>>> > >>>> On Mon, 25 Aug 2025 at 03:38, Mario Limonciello <superm1@kernel.org> wrote: > >>>>> > >>>>> > >>>>> > >>>>> On 8/24/25 3:46 PM, Antheas Kapenekakis wrote: > >>>>>> On Sun, 24 Aug 2025 at 22:16, Mario Limonciello <superm1@kernel.org> wrote: > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On 8/24/25 3:53 AM, Antheas Kapenekakis wrote: > >>>>>>>> On the Asus Z13 2025, which uses a Strix Halo platform, around 8% of the > >>>>>>>> suspend resumes result in a soft lock around 1 second after the screen > >>>>>>>> turns on (it freezes). This happens due to power gating VPE when it is > >>>>>>>> not used, which happens 1 second after inactivity. > >>>>>>>> > >>>>>>>> Specifically, the VPE gating after resume is as follows: an initial > >>>>>>>> ungate, followed by a gate in the resume process. Then, > >>>>>>>> amdgpu_device_delayed_init_work_handler with a delay of 2s is scheduled > >>>>>>>> to run tests, one of which is testing VPE in vpe_ring_test_ib. This > >>>>>>>> causes an ungate, After that test, vpe_idle_work_handler is scheduled > >>>>>>>> with VPE_IDLE_TIMEOUT (1s). > >>>>>>>> > >>>>>>>> When vpe_idle_work_handler runs and tries to gate VPE, it causes the > >>>>>>>> SMU to hang and partially freezes half of the GPU IPs, with the thread > >>>>>>>> that called the command being stuck processing it. > >>>>>>>> > >>>>>>>> Specifically, after that SMU command tries to run, we get the following: > >>>>>>>> > >>>>>>>> snd_hda_intel 0000:c4:00.1: Refused to change power state from D0 to D3hot > >>>>>>>> ... > >>>>>>>> xhci_hcd 0000:c4:00.4: Refused to change power state from D0 to D3hot > >>>>>>>> ... > >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VPE! > >>>>>>>> [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe failed, ret = -62. > >>>>>>>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:93:crtc-0] flip_done timed out > >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate JPEG! > >>>>>>>> [drm:jpeg_v4_0_5_set_powergating_state [amdgpu]] *ERROR* Dpm disable jpeg failed, ret = -62. > >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 0! > >>>>>>>> [drm:vcn_v4_0_5_stop [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62. > >>>>>>>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 1 from 0xd3 > >>>>>>>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 2 from 0x5 > >>>>>>>> thunderbolt 0000:c6:00.5: Refused to change power state from D0 to D3hot > >>>>>>>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:97:crtc-1] flip_done timed out > >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 1! > >>>>>>>> > >>>>>>>> In addition to e.g., kwin errors in journalctl. 0000:c4.00.0 is the GPU. > >>>>>>>> Interestingly, 0000:c4.00.6, which is another HDA block, 0000:c4.00.5, > >>>>>>>> a PCI controller, and 0000:c4.00.2, resume normally. 0x00000032 is the > >>>>>>>> PowerDownVpe(50) command which is the common failure point in all > >>>>>>>> failed resumes. > >>>>>>>> > >>>>>>>> On a normal resume, we should get the following power gates: > >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVpe(50) param: 0x00000000, resp: 0x00000001 > >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg0(33) param: 0x00000000, resp: 0x00000001 > >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg1(38) param: 0x00010000, resp: 0x00000001 > >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn1(4) param: 0x00010000, resp: 0x00000001 > >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn0(6) param: 0x00000000, resp: 0x00000001 > >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn0(7) param: 0x00000000, resp: 0x00000001 > >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn1(5) param: 0x00010000, resp: 0x00000001 > >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg0(34) param: 0x00000000, resp: 0x00000001 > >>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg1(39) param: 0x00010000, resp: 0x00000001 > >>>>>>>> > >>>>>>>> To fix this, increase VPE_IDLE_TIMEOUT to 2 seconds. This increases > >>>>>>>> reliability from 4-25 suspends to 200+ (tested) suspends with a cycle > >>>>>>>> time of 12s sleep, 8s resume. > >>>>>>> > >>>>>>> When you say you reproduced with 12s sleep and 8s resume, was that > >>>>>>> 'amd-s2idle --duration 12 --wait 8'? > >>>>>> > >>>>>> I did not use amd-s2idle. I essentially used the script below with a > >>>>>> 12 on the wake alarm and 12 on the for loop. I also used pstore for > >>>>>> this testing. > >>>>>> > >>>>>> for i in {1..200}; do > >>>>>> echo "Suspend attempt $i" > >>>>>> echo `date '+%s' -d '+ 60 seconds'` | sudo tee /sys/class/rtc/rtc0/wakealarm > >>>>>> sudo sh -c 'echo mem > /sys/power/state' > >>>>>> > >>>>>> for j in {1..50}; do > >>>>>> # Use repeating sleep in case echo mem returns early > >>>>>> sleep 1 > >>>>>> done > >>>>>> done > >>>>> > >>>>> 👍 > >>>>> > >>>>>> > >>>>>>>> The suspected reason here is that 1s that > >>>>>>>> when VPE is used, it needs a bit of time before it can be gated and > >>>>>>>> there was a borderline delay before, which is not enough for Strix Halo. > >>>>>>>> When the VPE is not used, such as on resume, gating it instantly does > >>>>>>>> not seem to cause issues. > >>>>>>>> > >>>>>>>> Fixes: 5f82a0c90cca ("drm/amdgpu/vpe: enable vpe dpm") > >>>>>>>> Signed-off-by: Antheas Kapenekakis <lkml@antheas.dev> > >>>>>>>> --- > >>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c | 4 ++-- > >>>>>>>> 1 file changed, 2 insertions(+), 2 deletions(-) > >>>>>>>> > >>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c > >>>>>>>> index 121ee17b522b..24f09e457352 100644 > >>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c > >>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c > >>>>>>>> @@ -34,8 +34,8 @@ > >>>>>>>> /* VPE CSA resides in the 4th page of CSA */ > >>>>>>>> #define AMDGPU_CSA_VPE_OFFSET (4096 * 3) > >>>>>>>> > >>>>>>>> -/* 1 second timeout */ > >>>>>>>> -#define VPE_IDLE_TIMEOUT msecs_to_jiffies(1000) > >>>>>>>> +/* 2 second timeout */ > >>>>>>>> +#define VPE_IDLE_TIMEOUT msecs_to_jiffies(2000) > >>>>>>>> > >>>>>>>> #define VPE_MAX_DPM_LEVEL 4 > >>>>>>>> #define FIXED1_8_BITS_PER_FRACTIONAL_PART 8 > >>>>>>>> > >>>>>>>> base-commit: c17b750b3ad9f45f2b6f7e6f7f4679844244f0b9 > >>>>>>> > >>>>>>> 1s idle timeout has been used by other IPs for a long time. > >>>>>>> For example JPEG, UVD, VCN all use 1s. > >>>>>>> > >>>>>>> Can you please confirm both your AGESA and your SMU firmware version? > >>>>>>> In case you're not aware; you can get AGESA version from SMBIOS string > >>>>>>> (DMI type 40). > >>>>>>> > >>>>>>> ❯ sudo dmidecode | grep AGESA > >>>>>> > >>>>>> String: AGESA!V9 StrixHaloPI-FP11 1.0.0.0c > >>>>>> > >>>>>>> You can get SMU firmware version from this: > >>>>>>> > >>>>>>> ❯ grep . /sys/bus/platform/drivers/amd_pmc/*/smu_* > >>>>>> > >>>>>> grep . /sys/bus/platform/drivers/amd_pmc/*/smu_* > >>>>>> /sys/bus/platform/drivers/amd_pmc/AMDI000B:00/smu_fw_version:100.112.0 > >>>>>> /sys/bus/platform/drivers/amd_pmc/AMDI000B:00/smu_program:0 > >>>>>> > >>>>> > >>>>> Thanks, I'll get some folks to see if we match this AGESA version if we > >>>>> can also reproduce it on reference hardware the same way you did. > >>>>> > >>>>>>> Are you on the most up to date firmware for your system from the > >>>>>>> manufacturer? > >>>>>> > >>>>>> I updated my bios, pd firmware, and USB device firmware early August, > >>>>>> when I was doing this testing. > >>>>>> > >>>>>>> We haven't seen anything like this reported on Strix Halo thus far and > >>>>>>> we do internal stress testing on s0i3 on reference hardware. > >>>>>> > >>>>>> Cant find a reference for it on the bug tracker. I have four bug > >>>>>> reports on the bazzite issue tracker, 2 about sleep wake crashes and 2 > >>>>>> for runtime crashes, where the culprit would be this. IE runtime gates > >>>>>> VPE and causes a crash. > >>>>> > >>>>> All on Strix Halo and all tied to VPE? At runtime was VPE in use? By > >>>>> what software? > >>>>> > >>>>> BTW - Strix and Kraken also have VPE. > >>>> > >>>> All on the Z13. Not tied to VPE necessarily. I just know that I get > >>>> reports of crashes on the Z13, and with this patch they are fixed for > >>>> me. It will be part of the next bazzite version so I will get feedback > >>>> about it. > >>>> > >>>> I don't think software that is using the VPE is relevant. Perhaps for > >>>> the runtime crashes it is and this patch helps in that case as well. > >>>> But in my case, the crash is caused after the ungate that runs the > >>>> tests on resume on the delayed handler. > >>>> > >>>> The Z13 also has some other quirks with spurious wakeups when > >>>> connected to a charger. So, if systemd is configured to e.g., sleep > >>>> after 20m, combined with this crash if it stays plugged in overnight > >>>> in the morning it has crashed. > >>>> > >>>>>> > >>>>>>> To me this seems likely to be a platform firmware bug; but I would like > >>>>>>> to understand the timing of the gate vs ungate on good vs bad. > >>>>>> > >>>>>> Perhaps it is. It is either something like that or silicon quality. > >>>>>> > >>>>>>> IE is it possible the delayed work handler > >>>>>>> amdgpu_device_delayed_init_work_handler() is causing a race with > >>>>>>> vpe_ring_begin_use()? > >>>>>> > >>>>>> I don't think so. There is only a single ungate. Also, the crash > >>>>>> happens on the gate. So what happens is the device wakes up, the > >>>>>> screen turns on, kde clock works, then after a second it freezes, > >>>>>> there is a softlock, and the device hangs. > >>>>>> > >>>>>> The failed command is always the VPE gate that is triggered after 1s in idle. > >>>>>> > >>>>>>> This should be possible to check without extra instrumentation by using > >>>>>>> ftrace and looking at the timing of the 2 ring functions and the init > >>>>>>> work handler and checking good vs bad cycles. > >>>>>> > >>>>>> I do not know how to use ftrace. I should also note that after the > >>>>>> device freezes around 1/5 cycles will sync the fs, so it is also not a > >>>>>> very easy thing to diagnose. The device just stops working. A lot of > >>>>>> the logs I got were in pstore by forcing a kernel panic. > >>>>> > >>>>> Here's how you capture the timing of functions. Each time the function > >>>>> is called there will be an event in the trace buffer. > >>>>> > >>>>> ❯ sudo trace-cmd record -p function -l > >>>>> amdgpu_device_delayed_init_work_handler -l vpe_idle_work_handler -l > >>>>> vpe_ring_begin_use -l vpe_ring_end_use -l amdgpu_pmops_suspend -l > >>>>> amdgpu_pmops_resume > >>>>> > >>>>> Here's how you would review the report: > >>>>> > >>>>> ❯ trace-cmd report > >>>>> cpus=24 > >>>>> kworker/u97:37-18051 [001] ..... 13655.970108: function: > >>>>> amdgpu_pmops_suspend <-- pci_pm_suspend > >>>>> kworker/u97:21-18036 [002] ..... 13666.290715: function: > >>>>> amdgpu_pmops_resume <-- dpm_run_callback > >>>>> kworker/u97:21-18036 [015] ..... 13666.308295: function: > >>>>> vpe_ring_begin_use <-- amdgpu_ring_alloc > >>>>> kworker/u97:21-18036 [015] ..... 13666.308298: function: > >>>>> vpe_ring_end_use <-- vpe_ring_test_ring > >>>>> kworker/15:1-12285 [015] ..... 13666.960191: function: > >>>>> amdgpu_device_delayed_init_work_handler <-- process_one_work > >>>>> kworker/15:1-12285 [015] ..... 13666.963970: function: > >>>>> vpe_ring_begin_use <-- amdgpu_ring_alloc > >>>>> kworker/15:1-12285 [015] ..... 13666.965481: function: > >>>>> vpe_ring_end_use <-- amdgpu_ib_schedule > >>>>> kworker/15:4-16354 [015] ..... 13667.981394: function: > >>>>> vpe_idle_work_handler <-- process_one_work > >>>>> > >>>>> I did this on a Strix system just now to capture that. > >>>>> > >>>>> You can see that basically the ring gets used before the delayed init > >>>>> work handler, and then again from the ring tests. My concern is if the > >>>>> sequence ever looks different than the above. If it does; we do have a > >>>>> driver race condition. > >>>>> > >>>>> It would also be helpful to look at the function_graph tracer. > >>>>> > >>>>> Here's some more documentation about ftrace and trace-cmd. > >>>>> https://www.kernel.org/doc/html/latest/trace/ftrace.html > >>>>> https://lwn.net/Articles/410200/ > >>>>> > >>>>> You can probably also get an LLM to help you with building commands if > >>>>> you're not familiar with it. > >>>>> > >>>>> But if you're hung so bad you can't flush to disk that's going to be a > >>>>> problem without a UART. A few ideas: > >>>> > >>>> Some times it flushes to disk > >>>> > >>>>> 1) You can use CONFIG_PSTORE_FTRACE > >>>> > >>>> I can look into that > >>>> > >>>>> 2) If you add "tp_printk" to the kernel command line it should make the > >>>>> trace ring buffer flush to kernel log ring buffer. But be warned this > >>>>> is going to change the timing, the issue might go away entirely or have > >>>>> a different failure rate. So hopefully <1> works. > >>>>>> > >>>>>> If you say that all IP blocks use 1s, perhaps an alternative solution > >>>>>> would be to desync the idle times so they do not happen > >>>>>> simultaneously. So 1000, 1200, 1400, etc. > >>>>>> > >>>>>> Antheas > >>>>>> > >>>>> > >>>>> I don't dobut your your proposal of changing the timing works. I just > >>>>> want to make sure it's the right solution because otherwise we might > >>>>> change the timing or sequence elsewhere in the driver two years from now > >>>>> and re-introduce the problem unintentionally. > >>>> > >>>> If there are other idle timers and only this one changes to 2s, I will > >>>> agree and say that it would be peculiar. Although 1s seems arbitrary > >>>> in any case. > >>> > >>> All of these timers are arbitrary. Their point is just to provide a > >>> future point where we can check if the engine is idle. The idle work > >>> handler will either power down the IP if it is idle or re-schedule in > >>> the future and try again if there is still work. Making the value > >>> longer will use more power as it will wait longer before checking if > >>> the engine is idle. Making it shorter will save more power, but adds > >>> extra overhead in that the engine will be powered up/down more often. > >>> In most cases, the jobs should complete in a few ms. The timer is > >>> there to avoid the overhead of powering up/down the block too > >>> frequently when applications are using the engine. > >>> > >>> Alex > >> > >> We had a try internally with both 6.17-rc2 and 6.17-rc3 and 1001b or > >> 1001c AGESA on reference system but unfortunately didn't reproduce the > >> issue with a 200 cycle attempt on either kernel or either BIOS (so we > >> had 800 cycles total). > > > > I think I did 6.12, 6.15, and a 6.16rc stock. I will have to come back > > to you with 6.17-rc3. > > I can reproduce the hang on a stock 6.17-rc3 kernel on my own Flow Z13, froze within 10 cycles with Antheas’ script. I will setup pstore to get logs from it since nothing appears in my journal after force rebooting. > > Matt Mine does not want to get reproduced right now. I will have to try later. You will need these kernel arguments: efi_pstore.pstore_disable=0 pstore.kmsg_bytes=200000 Here are some logging commands before the for loop # clear pstore sudo bash -c "rm -rf /sys/fs/pstore/*" # https://www.ais.com/understanding-pstore-linux-kernel-persistent-storage-file-system/ # Runtime logs # echo 1 | sudo tee /sys/kernel/debug/tracing/events/power/power_runtime_suspend/enable # echo 1 | sudo tee /sys/kernel/debug/tracing/events/power/power_runtime_resume/enable # echo 1 | sudo tee /sys/kernel/debug/tracing/tracing_on # Enable panics on lockups echo 255 | sudo tee /proc/sys/kernel/sysrq echo 1 | sudo tee /proc/sys/kernel/softlockup_panic echo 1 | sudo tee /proc/sys/kernel/hardlockup_panic echo 1 | sudo tee /proc/sys/kernel/panic_on_oops echo 5 | sudo tee /proc/sys/kernel/panic # echo 64 | sudo tee /proc/sys/kernel/panic_print # Enable these for hangs, shows Thread on hangs # echo 1 | sudo tee /proc/sys/kernel/softlockup_all_cpu_backtrace # echo 1 | sudo tee /proc/sys/kernel/hardlockup_all_cpu_backtrace # Enable pstore logging on panics # Needs kernel param: # efi_pstore.pstore_disable=0 pstore.kmsg_bytes=100000 # First enables, second sets the size to fit all cpus in case of a panic echo Y | sudo tee /sys/module/kernel/parameters/crash_kexec_post_notifiers echo Y | sudo tee /sys/module/printk/parameters/always_kmsg_dump # Enable dynamic debug for various kernel components sudo bash -c "cat > /sys/kernel/debug/dynamic_debug/control" << EOF file drivers/acpi/x86/s2idle.c +p file drivers/pinctrl/pinctrl-amd.c +p file drivers/platform/x86/amd/pmc.c +p file drivers/pci/pci-driver.c +p file drivers/input/serio/* +p file drivers/gpu/drm/amd/pm/* +p file drivers/gpu/drm/amd/pm/swsmu/* +p EOF # file drivers/acpi/ec.c +p # file drivers/gpu/drm/amd/* +p # file drivers/gpu/drm/amd/display/dc/core/* -p # Additional debugging for suspend/resume echo 1 | sudo tee /sys/power/pm_debug_messages Here is how to reconstruct the log: rm -rf crash && mkdir crash sudo bash -c "cp /sys/fs/pstore/dmesg-efi_pstore-* crash" sudo bash -c "rm -rf /sys/fs/pstore/*" cat $(find crash/ -name "dmesg-*" | tac) > crash.txt Antheas > > > >> Was your base a bazzite kernel or was it an upstream kernel? I know > >> there are some other patches in bazzite especially relevant to suspend, > >> so I wonder if they could be influencing the timing. > >> > >> Can you repo on 6.17-rc3? > >> > > > > > > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo 2025-08-26 20:58 ` Antheas Kapenekakis @ 2025-08-27 0:50 ` Matthew Schwartz 2025-08-27 2:37 ` Lee, Peyton 0 siblings, 1 reply; 22+ messages in thread From: Matthew Schwartz @ 2025-08-27 0:50 UTC (permalink / raw) To: Antheas Kapenekakis Cc: Mario Limonciello, Alex Deucher, amd-gfx, dri-devel, linux-kernel, Alex Deucher, Christian König, David Airlie, Simona Vetter, Harry Wentland, Rodrigo Siqueira, Mario Limonciello, Peyton Lee, Lang Yu On 8/26/25 1:58 PM, Antheas Kapenekakis wrote: > On Tue, 26 Aug 2025 at 22:13, Matthew Schwartz > <matthew.schwartz@linux.dev> wrote: >> >> >> >>> On Aug 26, 2025, at 12:21 PM, Antheas Kapenekakis <lkml@antheas.dev> wrote: >>> >>> On Tue, 26 Aug 2025 at 21:19, Mario Limonciello <superm1@kernel.org> wrote: >>>> >>>> On 8/26/2025 8:41 AM, Alex Deucher wrote: >>>>> On Tue, Aug 26, 2025 at 3:49 AM Antheas Kapenekakis <lkml@antheas.dev> wrote: >>>>>> >>>>>> On Mon, 25 Aug 2025 at 03:38, Mario Limonciello <superm1@kernel.org> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 8/24/25 3:46 PM, Antheas Kapenekakis wrote: >>>>>>>> On Sun, 24 Aug 2025 at 22:16, Mario Limonciello <superm1@kernel.org> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On 8/24/25 3:53 AM, Antheas Kapenekakis wrote: >>>>>>>>>> On the Asus Z13 2025, which uses a Strix Halo platform, around 8% of the >>>>>>>>>> suspend resumes result in a soft lock around 1 second after the screen >>>>>>>>>> turns on (it freezes). This happens due to power gating VPE when it is >>>>>>>>>> not used, which happens 1 second after inactivity. >>>>>>>>>> >>>>>>>>>> Specifically, the VPE gating after resume is as follows: an initial >>>>>>>>>> ungate, followed by a gate in the resume process. Then, >>>>>>>>>> amdgpu_device_delayed_init_work_handler with a delay of 2s is scheduled >>>>>>>>>> to run tests, one of which is testing VPE in vpe_ring_test_ib. This >>>>>>>>>> causes an ungate, After that test, vpe_idle_work_handler is scheduled >>>>>>>>>> with VPE_IDLE_TIMEOUT (1s). >>>>>>>>>> >>>>>>>>>> When vpe_idle_work_handler runs and tries to gate VPE, it causes the >>>>>>>>>> SMU to hang and partially freezes half of the GPU IPs, with the thread >>>>>>>>>> that called the command being stuck processing it. >>>>>>>>>> >>>>>>>>>> Specifically, after that SMU command tries to run, we get the following: >>>>>>>>>> >>>>>>>>>> snd_hda_intel 0000:c4:00.1: Refused to change power state from D0 to D3hot >>>>>>>>>> ... >>>>>>>>>> xhci_hcd 0000:c4:00.4: Refused to change power state from D0 to D3hot >>>>>>>>>> ... >>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 >>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VPE! >>>>>>>>>> [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe failed, ret = -62. >>>>>>>>>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:93:crtc-0] flip_done timed out >>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 >>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate JPEG! >>>>>>>>>> [drm:jpeg_v4_0_5_set_powergating_state [amdgpu]] *ERROR* Dpm disable jpeg failed, ret = -62. >>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 >>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 0! >>>>>>>>>> [drm:vcn_v4_0_5_stop [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62. >>>>>>>>>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 1 from 0xd3 >>>>>>>>>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 2 from 0x5 >>>>>>>>>> thunderbolt 0000:c6:00.5: Refused to change power state from D0 to D3hot >>>>>>>>>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:97:crtc-1] flip_done timed out >>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 >>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 1! >>>>>>>>>> >>>>>>>>>> In addition to e.g., kwin errors in journalctl. 0000:c4.00.0 is the GPU. >>>>>>>>>> Interestingly, 0000:c4.00.6, which is another HDA block, 0000:c4.00.5, >>>>>>>>>> a PCI controller, and 0000:c4.00.2, resume normally. 0x00000032 is the >>>>>>>>>> PowerDownVpe(50) command which is the common failure point in all >>>>>>>>>> failed resumes. >>>>>>>>>> >>>>>>>>>> On a normal resume, we should get the following power gates: >>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVpe(50) param: 0x00000000, resp: 0x00000001 >>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg0(33) param: 0x00000000, resp: 0x00000001 >>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg1(38) param: 0x00010000, resp: 0x00000001 >>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn1(4) param: 0x00010000, resp: 0x00000001 >>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn0(6) param: 0x00000000, resp: 0x00000001 >>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn0(7) param: 0x00000000, resp: 0x00000001 >>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn1(5) param: 0x00010000, resp: 0x00000001 >>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg0(34) param: 0x00000000, resp: 0x00000001 >>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg1(39) param: 0x00010000, resp: 0x00000001 >>>>>>>>>> >>>>>>>>>> To fix this, increase VPE_IDLE_TIMEOUT to 2 seconds. This increases >>>>>>>>>> reliability from 4-25 suspends to 200+ (tested) suspends with a cycle >>>>>>>>>> time of 12s sleep, 8s resume. >>>>>>>>> >>>>>>>>> When you say you reproduced with 12s sleep and 8s resume, was that >>>>>>>>> 'amd-s2idle --duration 12 --wait 8'? >>>>>>>> >>>>>>>> I did not use amd-s2idle. I essentially used the script below with a >>>>>>>> 12 on the wake alarm and 12 on the for loop. I also used pstore for >>>>>>>> this testing. >>>>>>>> >>>>>>>> for i in {1..200}; do >>>>>>>> echo "Suspend attempt $i" >>>>>>>> echo `date '+%s' -d '+ 60 seconds'` | sudo tee /sys/class/rtc/rtc0/wakealarm >>>>>>>> sudo sh -c 'echo mem > /sys/power/state' >>>>>>>> >>>>>>>> for j in {1..50}; do >>>>>>>> # Use repeating sleep in case echo mem returns early >>>>>>>> sleep 1 >>>>>>>> done >>>>>>>> done >>>>>>> >>>>>>> 👍 >>>>>>> >>>>>>>> >>>>>>>>>> The suspected reason here is that 1s that >>>>>>>>>> when VPE is used, it needs a bit of time before it can be gated and >>>>>>>>>> there was a borderline delay before, which is not enough for Strix Halo. >>>>>>>>>> When the VPE is not used, such as on resume, gating it instantly does >>>>>>>>>> not seem to cause issues. >>>>>>>>>> >>>>>>>>>> Fixes: 5f82a0c90cca ("drm/amdgpu/vpe: enable vpe dpm") >>>>>>>>>> Signed-off-by: Antheas Kapenekakis <lkml@antheas.dev> >>>>>>>>>> --- >>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c | 4 ++-- >>>>>>>>>> 1 file changed, 2 insertions(+), 2 deletions(-) >>>>>>>>>> >>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c >>>>>>>>>> index 121ee17b522b..24f09e457352 100644 >>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c >>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c >>>>>>>>>> @@ -34,8 +34,8 @@ >>>>>>>>>> /* VPE CSA resides in the 4th page of CSA */ >>>>>>>>>> #define AMDGPU_CSA_VPE_OFFSET (4096 * 3) >>>>>>>>>> >>>>>>>>>> -/* 1 second timeout */ >>>>>>>>>> -#define VPE_IDLE_TIMEOUT msecs_to_jiffies(1000) >>>>>>>>>> +/* 2 second timeout */ >>>>>>>>>> +#define VPE_IDLE_TIMEOUT msecs_to_jiffies(2000) >>>>>>>>>> >>>>>>>>>> #define VPE_MAX_DPM_LEVEL 4 >>>>>>>>>> #define FIXED1_8_BITS_PER_FRACTIONAL_PART 8 >>>>>>>>>> >>>>>>>>>> base-commit: c17b750b3ad9f45f2b6f7e6f7f4679844244f0b9 >>>>>>>>> >>>>>>>>> 1s idle timeout has been used by other IPs for a long time. >>>>>>>>> For example JPEG, UVD, VCN all use 1s. >>>>>>>>> >>>>>>>>> Can you please confirm both your AGESA and your SMU firmware version? >>>>>>>>> In case you're not aware; you can get AGESA version from SMBIOS string >>>>>>>>> (DMI type 40). >>>>>>>>> >>>>>>>>> ❯ sudo dmidecode | grep AGESA >>>>>>>> >>>>>>>> String: AGESA!V9 StrixHaloPI-FP11 1.0.0.0c >>>>>>>> >>>>>>>>> You can get SMU firmware version from this: >>>>>>>>> >>>>>>>>> ❯ grep . /sys/bus/platform/drivers/amd_pmc/*/smu_* >>>>>>>> >>>>>>>> grep . /sys/bus/platform/drivers/amd_pmc/*/smu_* >>>>>>>> /sys/bus/platform/drivers/amd_pmc/AMDI000B:00/smu_fw_version:100.112.0 >>>>>>>> /sys/bus/platform/drivers/amd_pmc/AMDI000B:00/smu_program:0 >>>>>>>> >>>>>>> >>>>>>> Thanks, I'll get some folks to see if we match this AGESA version if we >>>>>>> can also reproduce it on reference hardware the same way you did. >>>>>>> >>>>>>>>> Are you on the most up to date firmware for your system from the >>>>>>>>> manufacturer? >>>>>>>> >>>>>>>> I updated my bios, pd firmware, and USB device firmware early August, >>>>>>>> when I was doing this testing. >>>>>>>> >>>>>>>>> We haven't seen anything like this reported on Strix Halo thus far and >>>>>>>>> we do internal stress testing on s0i3 on reference hardware. >>>>>>>> >>>>>>>> Cant find a reference for it on the bug tracker. I have four bug >>>>>>>> reports on the bazzite issue tracker, 2 about sleep wake crashes and 2 >>>>>>>> for runtime crashes, where the culprit would be this. IE runtime gates >>>>>>>> VPE and causes a crash. >>>>>>> >>>>>>> All on Strix Halo and all tied to VPE? At runtime was VPE in use? By >>>>>>> what software? >>>>>>> >>>>>>> BTW - Strix and Kraken also have VPE. >>>>>> >>>>>> All on the Z13. Not tied to VPE necessarily. I just know that I get >>>>>> reports of crashes on the Z13, and with this patch they are fixed for >>>>>> me. It will be part of the next bazzite version so I will get feedback >>>>>> about it. >>>>>> >>>>>> I don't think software that is using the VPE is relevant. Perhaps for >>>>>> the runtime crashes it is and this patch helps in that case as well. >>>>>> But in my case, the crash is caused after the ungate that runs the >>>>>> tests on resume on the delayed handler. >>>>>> >>>>>> The Z13 also has some other quirks with spurious wakeups when >>>>>> connected to a charger. So, if systemd is configured to e.g., sleep >>>>>> after 20m, combined with this crash if it stays plugged in overnight >>>>>> in the morning it has crashed. >>>>>> >>>>>>>> >>>>>>>>> To me this seems likely to be a platform firmware bug; but I would like >>>>>>>>> to understand the timing of the gate vs ungate on good vs bad. >>>>>>>> >>>>>>>> Perhaps it is. It is either something like that or silicon quality. >>>>>>>> >>>>>>>>> IE is it possible the delayed work handler >>>>>>>>> amdgpu_device_delayed_init_work_handler() is causing a race with >>>>>>>>> vpe_ring_begin_use()? >>>>>>>> >>>>>>>> I don't think so. There is only a single ungate. Also, the crash >>>>>>>> happens on the gate. So what happens is the device wakes up, the >>>>>>>> screen turns on, kde clock works, then after a second it freezes, >>>>>>>> there is a softlock, and the device hangs. >>>>>>>> >>>>>>>> The failed command is always the VPE gate that is triggered after 1s in idle. >>>>>>>> >>>>>>>>> This should be possible to check without extra instrumentation by using >>>>>>>>> ftrace and looking at the timing of the 2 ring functions and the init >>>>>>>>> work handler and checking good vs bad cycles. >>>>>>>> >>>>>>>> I do not know how to use ftrace. I should also note that after the >>>>>>>> device freezes around 1/5 cycles will sync the fs, so it is also not a >>>>>>>> very easy thing to diagnose. The device just stops working. A lot of >>>>>>>> the logs I got were in pstore by forcing a kernel panic. >>>>>>> >>>>>>> Here's how you capture the timing of functions. Each time the function >>>>>>> is called there will be an event in the trace buffer. >>>>>>> >>>>>>> ❯ sudo trace-cmd record -p function -l >>>>>>> amdgpu_device_delayed_init_work_handler -l vpe_idle_work_handler -l >>>>>>> vpe_ring_begin_use -l vpe_ring_end_use -l amdgpu_pmops_suspend -l >>>>>>> amdgpu_pmops_resume >>>>>>> >>>>>>> Here's how you would review the report: >>>>>>> >>>>>>> ❯ trace-cmd report >>>>>>> cpus=24 >>>>>>> kworker/u97:37-18051 [001] ..... 13655.970108: function: >>>>>>> amdgpu_pmops_suspend <-- pci_pm_suspend >>>>>>> kworker/u97:21-18036 [002] ..... 13666.290715: function: >>>>>>> amdgpu_pmops_resume <-- dpm_run_callback >>>>>>> kworker/u97:21-18036 [015] ..... 13666.308295: function: >>>>>>> vpe_ring_begin_use <-- amdgpu_ring_alloc >>>>>>> kworker/u97:21-18036 [015] ..... 13666.308298: function: >>>>>>> vpe_ring_end_use <-- vpe_ring_test_ring >>>>>>> kworker/15:1-12285 [015] ..... 13666.960191: function: >>>>>>> amdgpu_device_delayed_init_work_handler <-- process_one_work >>>>>>> kworker/15:1-12285 [015] ..... 13666.963970: function: >>>>>>> vpe_ring_begin_use <-- amdgpu_ring_alloc >>>>>>> kworker/15:1-12285 [015] ..... 13666.965481: function: >>>>>>> vpe_ring_end_use <-- amdgpu_ib_schedule >>>>>>> kworker/15:4-16354 [015] ..... 13667.981394: function: >>>>>>> vpe_idle_work_handler <-- process_one_work >>>>>>> >>>>>>> I did this on a Strix system just now to capture that. >>>>>>> >>>>>>> You can see that basically the ring gets used before the delayed init >>>>>>> work handler, and then again from the ring tests. My concern is if the >>>>>>> sequence ever looks different than the above. If it does; we do have a >>>>>>> driver race condition. >>>>>>> >>>>>>> It would also be helpful to look at the function_graph tracer. >>>>>>> >>>>>>> Here's some more documentation about ftrace and trace-cmd. >>>>>>> https://www.kernel.org/doc/html/latest/trace/ftrace.html >>>>>>> https://lwn.net/Articles/410200/ >>>>>>> >>>>>>> You can probably also get an LLM to help you with building commands if >>>>>>> you're not familiar with it. >>>>>>> >>>>>>> But if you're hung so bad you can't flush to disk that's going to be a >>>>>>> problem without a UART. A few ideas: >>>>>> >>>>>> Some times it flushes to disk >>>>>> >>>>>>> 1) You can use CONFIG_PSTORE_FTRACE >>>>>> >>>>>> I can look into that >>>>>> >>>>>>> 2) If you add "tp_printk" to the kernel command line it should make the >>>>>>> trace ring buffer flush to kernel log ring buffer. But be warned this >>>>>>> is going to change the timing, the issue might go away entirely or have >>>>>>> a different failure rate. So hopefully <1> works. >>>>>>>> >>>>>>>> If you say that all IP blocks use 1s, perhaps an alternative solution >>>>>>>> would be to desync the idle times so they do not happen >>>>>>>> simultaneously. So 1000, 1200, 1400, etc. >>>>>>>> >>>>>>>> Antheas >>>>>>>> >>>>>>> >>>>>>> I don't dobut your your proposal of changing the timing works. I just >>>>>>> want to make sure it's the right solution because otherwise we might >>>>>>> change the timing or sequence elsewhere in the driver two years from now >>>>>>> and re-introduce the problem unintentionally. >>>>>> >>>>>> If there are other idle timers and only this one changes to 2s, I will >>>>>> agree and say that it would be peculiar. Although 1s seems arbitrary >>>>>> in any case. >>>>> >>>>> All of these timers are arbitrary. Their point is just to provide a >>>>> future point where we can check if the engine is idle. The idle work >>>>> handler will either power down the IP if it is idle or re-schedule in >>>>> the future and try again if there is still work. Making the value >>>>> longer will use more power as it will wait longer before checking if >>>>> the engine is idle. Making it shorter will save more power, but adds >>>>> extra overhead in that the engine will be powered up/down more often. >>>>> In most cases, the jobs should complete in a few ms. The timer is >>>>> there to avoid the overhead of powering up/down the block too >>>>> frequently when applications are using the engine. >>>>> >>>>> Alex >>>> >>>> We had a try internally with both 6.17-rc2 and 6.17-rc3 and 1001b or >>>> 1001c AGESA on reference system but unfortunately didn't reproduce the >>>> issue with a 200 cycle attempt on either kernel or either BIOS (so we >>>> had 800 cycles total). >>> >>> I think I did 6.12, 6.15, and a 6.16rc stock. I will have to come back >>> to you with 6.17-rc3. >> >> I can reproduce the hang on a stock 6.17-rc3 kernel on my own Flow Z13, froze within 10 cycles with Antheas’ script. I will setup pstore to get logs from it since nothing appears in my journal after force rebooting. >> >> Matt > > Mine does not want to get reproduced right now. I will have to try later. > > You will need these kernel arguments: > efi_pstore.pstore_disable=0 pstore.kmsg_bytes=200000 > > Here are some logging commands before the for loop > # clear pstore > sudo bash -c "rm -rf /sys/fs/pstore/*" > > # https://www.ais.com/understanding-pstore-linux-kernel-persistent-storage-file-system/ > > # Runtime logs > # echo 1 | sudo tee > /sys/kernel/debug/tracing/events/power/power_runtime_suspend/enable > # echo 1 | sudo tee > /sys/kernel/debug/tracing/events/power/power_runtime_resume/enable > # echo 1 | sudo tee /sys/kernel/debug/tracing/tracing_on > > # Enable panics on lockups > echo 255 | sudo tee /proc/sys/kernel/sysrq > echo 1 | sudo tee /proc/sys/kernel/softlockup_panic > echo 1 | sudo tee /proc/sys/kernel/hardlockup_panic > echo 1 | sudo tee /proc/sys/kernel/panic_on_oops > echo 5 | sudo tee /proc/sys/kernel/panic > # echo 64 | sudo tee /proc/sys/kernel/panic_print > > # Enable these for hangs, shows Thread on hangs > # echo 1 | sudo tee /proc/sys/kernel/softlockup_all_cpu_backtrace > # echo 1 | sudo tee /proc/sys/kernel/hardlockup_all_cpu_backtrace > > # Enable pstore logging on panics > # Needs kernel param: > # efi_pstore.pstore_disable=0 pstore.kmsg_bytes=100000 > # First enables, second sets the size to fit all cpus in case of a panic > echo Y | sudo tee /sys/module/kernel/parameters/crash_kexec_post_notifiers > echo Y | sudo tee /sys/module/printk/parameters/always_kmsg_dump > > # Enable dynamic debug for various kernel components > sudo bash -c "cat > /sys/kernel/debug/dynamic_debug/control" << EOF > file drivers/acpi/x86/s2idle.c +p > file drivers/pinctrl/pinctrl-amd.c +p > file drivers/platform/x86/amd/pmc.c +p > file drivers/pci/pci-driver.c +p > file drivers/input/serio/* +p > file drivers/gpu/drm/amd/pm/* +p > file drivers/gpu/drm/amd/pm/swsmu/* +p > EOF > # file drivers/acpi/ec.c +p > # file drivers/gpu/drm/amd/* +p > # file drivers/gpu/drm/amd/display/dc/core/* -p > > # Additional debugging for suspend/resume > echo 1 | sudo tee /sys/power/pm_debug_messages So I ran the commands that you gave above while connected over ssh, and I could actually still interact with the system after the amdgpu failures started. Your suspend script also kept running for a while because of this, and pstore was not necessary. My dmesg looks very similar to the snippet you posted in the patch contents. Full dmesg is here: https://gist.github.com/matte-schwartz/9ad4b925866d9228923e909618d045d9 I was able to run trace-cmd as Mario suggested, but nothing seemed out of order: ❯ trace-cmd report kworker/22:6-9326 [022] ..... 4003.204988: function: amdgpu_device_delayed_init_work_handler <-- process_one_work kworker/22:6-9326 [022] ..... 4003.209383: function: vpe_ring_begin_use <-- amdgpu_ring_alloc kworker/22:6-9326 [022] ..... 4003.210152: function: vpe_ring_end_use <-- amdgpu_ib_schedule kworker/22:6-9326 [022] ..... 4004.263841: function: vpe_idle_work_handler <-- process_one_work kworker/u129:6-530 [001] ..... 4053.545634: function: amdgpu_pmops_suspend <-- pci_pm_suspend kworker/u129:18-4060 [002] ..... 4114.908515: function: amdgpu_pmops_resume <-- dpm_run_callback kworker/u129:18-4060 [023] ..... 4114.931055: function: vpe_ring_begin_use <-- amdgpu_ring_alloc kworker/u129:18-4060 [023] ..... 4114.931057: function: vpe_ring_end_use <-- vpe_ring_test_ring kworker/7:5-5733 [007] ..... 4115.198936: function: amdgpu_device_delayed_init_work_handler <-- process_one_work kworker/7:5-5733 [007] ..... 4115.203185: function: vpe_ring_begin_use <-- amdgpu_ring_alloc kworker/7:5-5733 [007] ..... 4115.204141: function: vpe_ring_end_use <-- amdgpu_ib_schedule kworker/7:0-7950 [007] ..... 4116.253971: function: vpe_idle_work_handler <-- process_one_work kworker/u129:41-4083 [001] ..... 4165.539388: function: amdgpu_pmops_suspend <-- pci_pm_suspend kworker/u129:58-4100 [001] ..... 4226.906561: function: amdgpu_pmops_resume <-- dpm_run_callback kworker/u129:58-4100 [022] ..... 4226.927900: function: vpe_ring_begin_use <-- amdgpu_ring_alloc kworker/u129:58-4100 [022] ..... 4226.927902: function: vpe_ring_end_use <-- vpe_ring_test_ring kworker/7:0-7950 [007] ..... 4227.193678: function: amdgpu_device_delayed_init_work_handler <-- process_one_work kworker/7:0-7950 [007] ..... 4227.197604: function: vpe_ring_begin_use <-- amdgpu_ring_alloc kworker/7:0-7950 [007] ..... 4227.201691: function: vpe_ring_end_use <-- amdgpu_ib_schedule kworker/7:0-7950 [007] ..... 4228.240479: function: vpe_idle_work_handler <-- process_one_work I have not tested the kernel patch yet, so that will be my next step. > > Here is how to reconstruct the log: > rm -rf crash && mkdir crash > sudo bash -c "cp /sys/fs/pstore/dmesg-efi_pstore-* crash" > sudo bash -c "rm -rf /sys/fs/pstore/*" > cat $(find crash/ -name "dmesg-*" | tac) > crash.txt > > Antheas >>> >>>> Was your base a bazzite kernel or was it an upstream kernel? I know >>>> there are some other patches in bazzite especially relevant to suspend, >>>> so I wonder if they could be influencing the timing. >>>> >>>> Can you repo on 6.17-rc3? >>>> >>> >>> >> >> > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo 2025-08-27 0:50 ` Matthew Schwartz @ 2025-08-27 2:37 ` Lee, Peyton 2025-08-27 15:42 ` Matthew Schwartz 0 siblings, 1 reply; 22+ messages in thread From: Lee, Peyton @ 2025-08-27 2:37 UTC (permalink / raw) To: Matthew Schwartz, Antheas Kapenekakis Cc: Mario Limonciello, Alex Deucher, amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org, Deucher, Alexander, Koenig, Christian, David Airlie, Simona Vetter, Wentland, Harry, Rodrigo Siqueira, Limonciello, Mario, Yu, Lang [-- Attachment #1: Type: text/plain, Size: 24051 bytes --] [AMD Official Use Only - AMD Internal Distribution Only] I recently encountered a similar issue on Strix. What I found was that the root cause was GFX failing during hw_init. Here’s the situation: Linux AMDGPU boot-up flow: 1. sw_init — This stage initializes the software for each IP block (GFX, VCN, VPE, etc.), and powers them on. 2. hw_init — This stage calls the hardware initialization of each IP block. At this point, VPE begins loading its firmware and configuring hardware. The issue: When the problem occurs, the GFX block fails during hw_init. As a result, it requests the SMU to power off all IP blocks. However, at that point, the VPE firmware hasn’t been loaded yet, so it cannot respond to the SMU's power-off request. This causes the system to hang during boot. Previously, my approach was to remove all the calls to VPE power off (both in the VPE driver and in the SMU deinit function) to help locate the issue. Maybe you could try the same. ________________________________ 寄件者: Matthew Schwartz <matthew.schwartz@linux.dev> 已傳送: 星期三, 2025 年 8 月 27 日 08:50 收件者: Antheas Kapenekakis <lkml@antheas.dev> 副本: Mario Limonciello <superm1@kernel.org>; Alex Deucher <alexdeucher@gmail.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>; linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>; Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; David Airlie <airlied@gmail.com>; Simona Vetter <simona@ffwll.ch>; Wentland, Harry <Harry.Wentland@amd.com>; Rodrigo Siqueira <siqueira@igalia.com>; Limonciello, Mario <Mario.Limonciello@amd.com>; Lee, Peyton <Peyton.Lee@amd.com>; Yu, Lang <Lang.Yu@amd.com> 主旨: Re: [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo On 8/26/25 1:58 PM, Antheas Kapenekakis wrote: > On Tue, 26 Aug 2025 at 22:13, Matthew Schwartz > <matthew.schwartz@linux.dev> wrote: >> >> >> >>> On Aug 26, 2025, at 12:21 PM, Antheas Kapenekakis <lkml@antheas.dev> wrote: >>> >>> On Tue, 26 Aug 2025 at 21:19, Mario Limonciello <superm1@kernel.org> wrote: >>>> >>>> On 8/26/2025 8:41 AM, Alex Deucher wrote: >>>>> On Tue, Aug 26, 2025 at 3:49 AM Antheas Kapenekakis <lkml@antheas.dev> wrote: >>>>>> >>>>>> On Mon, 25 Aug 2025 at 03:38, Mario Limonciello <superm1@kernel.org> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 8/24/25 3:46 PM, Antheas Kapenekakis wrote: >>>>>>>> On Sun, 24 Aug 2025 at 22:16, Mario Limonciello <superm1@kernel.org> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On 8/24/25 3:53 AM, Antheas Kapenekakis wrote: >>>>>>>>>> On the Asus Z13 2025, which uses a Strix Halo platform, around 8% of the >>>>>>>>>> suspend resumes result in a soft lock around 1 second after the screen >>>>>>>>>> turns on (it freezes). This happens due to power gating VPE when it is >>>>>>>>>> not used, which happens 1 second after inactivity. >>>>>>>>>> >>>>>>>>>> Specifically, the VPE gating after resume is as follows: an initial >>>>>>>>>> ungate, followed by a gate in the resume process. Then, >>>>>>>>>> amdgpu_device_delayed_init_work_handler with a delay of 2s is scheduled >>>>>>>>>> to run tests, one of which is testing VPE in vpe_ring_test_ib. This >>>>>>>>>> causes an ungate, After that test, vpe_idle_work_handler is scheduled >>>>>>>>>> with VPE_IDLE_TIMEOUT (1s). >>>>>>>>>> >>>>>>>>>> When vpe_idle_work_handler runs and tries to gate VPE, it causes the >>>>>>>>>> SMU to hang and partially freezes half of the GPU IPs, with the thread >>>>>>>>>> that called the command being stuck processing it. >>>>>>>>>> >>>>>>>>>> Specifically, after that SMU command tries to run, we get the following: >>>>>>>>>> >>>>>>>>>> snd_hda_intel 0000:c4:00.1: Refused to change power state from D0 to D3hot >>>>>>>>>> ... >>>>>>>>>> xhci_hcd 0000:c4:00.4: Refused to change power state from D0 to D3hot >>>>>>>>>> ... >>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 >>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VPE! >>>>>>>>>> [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe failed, ret = -62. >>>>>>>>>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:93:crtc-0] flip_done timed out >>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 >>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate JPEG! >>>>>>>>>> [drm:jpeg_v4_0_5_set_powergating_state [amdgpu]] *ERROR* Dpm disable jpeg failed, ret = -62. >>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 >>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 0! >>>>>>>>>> [drm:vcn_v4_0_5_stop [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62. >>>>>>>>>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 1 from 0xd3 >>>>>>>>>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 2 from 0x5 >>>>>>>>>> thunderbolt 0000:c6:00.5: Refused to change power state from D0 to D3hot >>>>>>>>>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:97:crtc-1] flip_done timed out >>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 >>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 1! >>>>>>>>>> >>>>>>>>>> In addition to e.g., kwin errors in journalctl. 0000:c4.00.0 is the GPU. >>>>>>>>>> Interestingly, 0000:c4.00.6, which is another HDA block, 0000:c4.00.5, >>>>>>>>>> a PCI controller, and 0000:c4.00.2, resume normally. 0x00000032 is the >>>>>>>>>> PowerDownVpe(50) command which is the common failure point in all >>>>>>>>>> failed resumes. >>>>>>>>>> >>>>>>>>>> On a normal resume, we should get the following power gates: >>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVpe(50) param: 0x00000000, resp: 0x00000001 >>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg0(33) param: 0x00000000, resp: 0x00000001 >>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg1(38) param: 0x00010000, resp: 0x00000001 >>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn1(4) param: 0x00010000, resp: 0x00000001 >>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn0(6) param: 0x00000000, resp: 0x00000001 >>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn0(7) param: 0x00000000, resp: 0x00000001 >>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn1(5) param: 0x00010000, resp: 0x00000001 >>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg0(34) param: 0x00000000, resp: 0x00000001 >>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg1(39) param: 0x00010000, resp: 0x00000001 >>>>>>>>>> >>>>>>>>>> To fix this, increase VPE_IDLE_TIMEOUT to 2 seconds. This increases >>>>>>>>>> reliability from 4-25 suspends to 200+ (tested) suspends with a cycle >>>>>>>>>> time of 12s sleep, 8s resume. >>>>>>>>> >>>>>>>>> When you say you reproduced with 12s sleep and 8s resume, was that >>>>>>>>> 'amd-s2idle --duration 12 --wait 8'? >>>>>>>> >>>>>>>> I did not use amd-s2idle. I essentially used the script below with a >>>>>>>> 12 on the wake alarm and 12 on the for loop. I also used pstore for >>>>>>>> this testing. >>>>>>>> >>>>>>>> for i in {1..200}; do >>>>>>>> echo "Suspend attempt $i" >>>>>>>> echo `date '+%s' -d '+ 60 seconds'` | sudo tee /sys/class/rtc/rtc0/wakealarm >>>>>>>> sudo sh -c 'echo mem > /sys/power/state' >>>>>>>> >>>>>>>> for j in {1..50}; do >>>>>>>> # Use repeating sleep in case echo mem returns early >>>>>>>> sleep 1 >>>>>>>> done >>>>>>>> done >>>>>>> >>>>>>> 👍 >>>>>>> >>>>>>>> >>>>>>>>>> The suspected reason here is that 1s that >>>>>>>>>> when VPE is used, it needs a bit of time before it can be gated and >>>>>>>>>> there was a borderline delay before, which is not enough for Strix Halo. >>>>>>>>>> When the VPE is not used, such as on resume, gating it instantly does >>>>>>>>>> not seem to cause issues. >>>>>>>>>> >>>>>>>>>> Fixes: 5f82a0c90cca ("drm/amdgpu/vpe: enable vpe dpm") >>>>>>>>>> Signed-off-by: Antheas Kapenekakis <lkml@antheas.dev> >>>>>>>>>> --- >>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c | 4 ++-- >>>>>>>>>> 1 file changed, 2 insertions(+), 2 deletions(-) >>>>>>>>>> >>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c >>>>>>>>>> index 121ee17b522b..24f09e457352 100644 >>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c >>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c >>>>>>>>>> @@ -34,8 +34,8 @@ >>>>>>>>>> /* VPE CSA resides in the 4th page of CSA */ >>>>>>>>>> #define AMDGPU_CSA_VPE_OFFSET (4096 * 3) >>>>>>>>>> >>>>>>>>>> -/* 1 second timeout */ >>>>>>>>>> -#define VPE_IDLE_TIMEOUT msecs_to_jiffies(1000) >>>>>>>>>> +/* 2 second timeout */ >>>>>>>>>> +#define VPE_IDLE_TIMEOUT msecs_to_jiffies(2000) >>>>>>>>>> >>>>>>>>>> #define VPE_MAX_DPM_LEVEL 4 >>>>>>>>>> #define FIXED1_8_BITS_PER_FRACTIONAL_PART 8 >>>>>>>>>> >>>>>>>>>> base-commit: c17b750b3ad9f45f2b6f7e6f7f4679844244f0b9 >>>>>>>>> >>>>>>>>> 1s idle timeout has been used by other IPs for a long time. >>>>>>>>> For example JPEG, UVD, VCN all use 1s. >>>>>>>>> >>>>>>>>> Can you please confirm both your AGESA and your SMU firmware version? >>>>>>>>> In case you're not aware; you can get AGESA version from SMBIOS string >>>>>>>>> (DMI type 40). >>>>>>>>> >>>>>>>>> ❯ sudo dmidecode | grep AGESA >>>>>>>> >>>>>>>> String: AGESA!V9 StrixHaloPI-FP11 1.0.0.0c >>>>>>>> >>>>>>>>> You can get SMU firmware version from this: >>>>>>>>> >>>>>>>>> ❯ grep . /sys/bus/platform/drivers/amd_pmc/*/smu_* >>>>>>>> >>>>>>>> grep . /sys/bus/platform/drivers/amd_pmc/*/smu_* >>>>>>>> /sys/bus/platform/drivers/amd_pmc/AMDI000B:00/smu_fw_version:100.112.0 >>>>>>>> /sys/bus/platform/drivers/amd_pmc/AMDI000B:00/smu_program:0 >>>>>>>> >>>>>>> >>>>>>> Thanks, I'll get some folks to see if we match this AGESA version if we >>>>>>> can also reproduce it on reference hardware the same way you did. >>>>>>> >>>>>>>>> Are you on the most up to date firmware for your system from the >>>>>>>>> manufacturer? >>>>>>>> >>>>>>>> I updated my bios, pd firmware, and USB device firmware early August, >>>>>>>> when I was doing this testing. >>>>>>>> >>>>>>>>> We haven't seen anything like this reported on Strix Halo thus far and >>>>>>>>> we do internal stress testing on s0i3 on reference hardware. >>>>>>>> >>>>>>>> Cant find a reference for it on the bug tracker. I have four bug >>>>>>>> reports on the bazzite issue tracker, 2 about sleep wake crashes and 2 >>>>>>>> for runtime crashes, where the culprit would be this. IE runtime gates >>>>>>>> VPE and causes a crash. >>>>>>> >>>>>>> All on Strix Halo and all tied to VPE? At runtime was VPE in use? By >>>>>>> what software? >>>>>>> >>>>>>> BTW - Strix and Kraken also have VPE. >>>>>> >>>>>> All on the Z13. Not tied to VPE necessarily. I just know that I get >>>>>> reports of crashes on the Z13, and with this patch they are fixed for >>>>>> me. It will be part of the next bazzite version so I will get feedback >>>>>> about it. >>>>>> >>>>>> I don't think software that is using the VPE is relevant. Perhaps for >>>>>> the runtime crashes it is and this patch helps in that case as well. >>>>>> But in my case, the crash is caused after the ungate that runs the >>>>>> tests on resume on the delayed handler. >>>>>> >>>>>> The Z13 also has some other quirks with spurious wakeups when >>>>>> connected to a charger. So, if systemd is configured to e.g., sleep >>>>>> after 20m, combined with this crash if it stays plugged in overnight >>>>>> in the morning it has crashed. >>>>>> >>>>>>>> >>>>>>>>> To me this seems likely to be a platform firmware bug; but I would like >>>>>>>>> to understand the timing of the gate vs ungate on good vs bad. >>>>>>>> >>>>>>>> Perhaps it is. It is either something like that or silicon quality. >>>>>>>> >>>>>>>>> IE is it possible the delayed work handler >>>>>>>>> amdgpu_device_delayed_init_work_handler() is causing a race with >>>>>>>>> vpe_ring_begin_use()? >>>>>>>> >>>>>>>> I don't think so. There is only a single ungate. Also, the crash >>>>>>>> happens on the gate. So what happens is the device wakes up, the >>>>>>>> screen turns on, kde clock works, then after a second it freezes, >>>>>>>> there is a softlock, and the device hangs. >>>>>>>> >>>>>>>> The failed command is always the VPE gate that is triggered after 1s in idle. >>>>>>>> >>>>>>>>> This should be possible to check without extra instrumentation by using >>>>>>>>> ftrace and looking at the timing of the 2 ring functions and the init >>>>>>>>> work handler and checking good vs bad cycles. >>>>>>>> >>>>>>>> I do not know how to use ftrace. I should also note that after the >>>>>>>> device freezes around 1/5 cycles will sync the fs, so it is also not a >>>>>>>> very easy thing to diagnose. The device just stops working. A lot of >>>>>>>> the logs I got were in pstore by forcing a kernel panic. >>>>>>> >>>>>>> Here's how you capture the timing of functions. Each time the function >>>>>>> is called there will be an event in the trace buffer. >>>>>>> >>>>>>> ❯ sudo trace-cmd record -p function -l >>>>>>> amdgpu_device_delayed_init_work_handler -l vpe_idle_work_handler -l >>>>>>> vpe_ring_begin_use -l vpe_ring_end_use -l amdgpu_pmops_suspend -l >>>>>>> amdgpu_pmops_resume >>>>>>> >>>>>>> Here's how you would review the report: >>>>>>> >>>>>>> ❯ trace-cmd report >>>>>>> cpus=24 >>>>>>> kworker/u97:37-18051 [001] ..... 13655.970108: function: >>>>>>> amdgpu_pmops_suspend <-- pci_pm_suspend >>>>>>> kworker/u97:21-18036 [002] ..... 13666.290715: function: >>>>>>> amdgpu_pmops_resume <-- dpm_run_callback >>>>>>> kworker/u97:21-18036 [015] ..... 13666.308295: function: >>>>>>> vpe_ring_begin_use <-- amdgpu_ring_alloc >>>>>>> kworker/u97:21-18036 [015] ..... 13666.308298: function: >>>>>>> vpe_ring_end_use <-- vpe_ring_test_ring >>>>>>> kworker/15:1-12285 [015] ..... 13666.960191: function: >>>>>>> amdgpu_device_delayed_init_work_handler <-- process_one_work >>>>>>> kworker/15:1-12285 [015] ..... 13666.963970: function: >>>>>>> vpe_ring_begin_use <-- amdgpu_ring_alloc >>>>>>> kworker/15:1-12285 [015] ..... 13666.965481: function: >>>>>>> vpe_ring_end_use <-- amdgpu_ib_schedule >>>>>>> kworker/15:4-16354 [015] ..... 13667.981394: function: >>>>>>> vpe_idle_work_handler <-- process_one_work >>>>>>> >>>>>>> I did this on a Strix system just now to capture that. >>>>>>> >>>>>>> You can see that basically the ring gets used before the delayed init >>>>>>> work handler, and then again from the ring tests. My concern is if the >>>>>>> sequence ever looks different than the above. If it does; we do have a >>>>>>> driver race condition. >>>>>>> >>>>>>> It would also be helpful to look at the function_graph tracer. >>>>>>> >>>>>>> Here's some more documentation about ftrace and trace-cmd. >>>>>>> https://www.kernel.org/doc/html/latest/trace/ftrace.html >>>>>>> https://lwn.net/Articles/410200/ >>>>>>> >>>>>>> You can probably also get an LLM to help you with building commands if >>>>>>> you're not familiar with it. >>>>>>> >>>>>>> But if you're hung so bad you can't flush to disk that's going to be a >>>>>>> problem without a UART. A few ideas: >>>>>> >>>>>> Some times it flushes to disk >>>>>> >>>>>>> 1) You can use CONFIG_PSTORE_FTRACE >>>>>> >>>>>> I can look into that >>>>>> >>>>>>> 2) If you add "tp_printk" to the kernel command line it should make the >>>>>>> trace ring buffer flush to kernel log ring buffer. But be warned this >>>>>>> is going to change the timing, the issue might go away entirely or have >>>>>>> a different failure rate. So hopefully <1> works. >>>>>>>> >>>>>>>> If you say that all IP blocks use 1s, perhaps an alternative solution >>>>>>>> would be to desync the idle times so they do not happen >>>>>>>> simultaneously. So 1000, 1200, 1400, etc. >>>>>>>> >>>>>>>> Antheas >>>>>>>> >>>>>>> >>>>>>> I don't dobut your your proposal of changing the timing works. I just >>>>>>> want to make sure it's the right solution because otherwise we might >>>>>>> change the timing or sequence elsewhere in the driver two years from now >>>>>>> and re-introduce the problem unintentionally. >>>>>> >>>>>> If there are other idle timers and only this one changes to 2s, I will >>>>>> agree and say that it would be peculiar. Although 1s seems arbitrary >>>>>> in any case. >>>>> >>>>> All of these timers are arbitrary. Their point is just to provide a >>>>> future point where we can check if the engine is idle. The idle work >>>>> handler will either power down the IP if it is idle or re-schedule in >>>>> the future and try again if there is still work. Making the value >>>>> longer will use more power as it will wait longer before checking if >>>>> the engine is idle. Making it shorter will save more power, but adds >>>>> extra overhead in that the engine will be powered up/down more often. >>>>> In most cases, the jobs should complete in a few ms. The timer is >>>>> there to avoid the overhead of powering up/down the block too >>>>> frequently when applications are using the engine. >>>>> >>>>> Alex >>>> >>>> We had a try internally with both 6.17-rc2 and 6.17-rc3 and 1001b or >>>> 1001c AGESA on reference system but unfortunately didn't reproduce the >>>> issue with a 200 cycle attempt on either kernel or either BIOS (so we >>>> had 800 cycles total). >>> >>> I think I did 6.12, 6.15, and a 6.16rc stock. I will have to come back >>> to you with 6.17-rc3. >> >> I can reproduce the hang on a stock 6.17-rc3 kernel on my own Flow Z13, froze within 10 cycles with Antheas’ script. I will setup pstore to get logs from it since nothing appears in my journal after force rebooting. >> >> Matt > > Mine does not want to get reproduced right now. I will have to try later. > > You will need these kernel arguments: > efi_pstore.pstore_disable=0 pstore.kmsg_bytes=200000 > > Here are some logging commands before the for loop > # clear pstore > sudo bash -c "rm -rf /sys/fs/pstore/*" > > # https://www.ais.com/understanding-pstore-linux-kernel-persistent-storage-file-system/ > > # Runtime logs > # echo 1 | sudo tee > /sys/kernel/debug/tracing/events/power/power_runtime_suspend/enable > # echo 1 | sudo tee > /sys/kernel/debug/tracing/events/power/power_runtime_resume/enable > # echo 1 | sudo tee /sys/kernel/debug/tracing/tracing_on > > # Enable panics on lockups > echo 255 | sudo tee /proc/sys/kernel/sysrq > echo 1 | sudo tee /proc/sys/kernel/softlockup_panic > echo 1 | sudo tee /proc/sys/kernel/hardlockup_panic > echo 1 | sudo tee /proc/sys/kernel/panic_on_oops > echo 5 | sudo tee /proc/sys/kernel/panic > # echo 64 | sudo tee /proc/sys/kernel/panic_print > > # Enable these for hangs, shows Thread on hangs > # echo 1 | sudo tee /proc/sys/kernel/softlockup_all_cpu_backtrace > # echo 1 | sudo tee /proc/sys/kernel/hardlockup_all_cpu_backtrace > > # Enable pstore logging on panics > # Needs kernel param: > # efi_pstore.pstore_disable=0 pstore.kmsg_bytes=100000 > # First enables, second sets the size to fit all cpus in case of a panic > echo Y | sudo tee /sys/module/kernel/parameters/crash_kexec_post_notifiers > echo Y | sudo tee /sys/module/printk/parameters/always_kmsg_dump > > # Enable dynamic debug for various kernel components > sudo bash -c "cat > /sys/kernel/debug/dynamic_debug/control" << EOF > file drivers/acpi/x86/s2idle.c +p > file drivers/pinctrl/pinctrl-amd.c +p > file drivers/platform/x86/amd/pmc.c +p > file drivers/pci/pci-driver.c +p > file drivers/input/serio/* +p > file drivers/gpu/drm/amd/pm/* +p > file drivers/gpu/drm/amd/pm/swsmu/* +p > EOF > # file drivers/acpi/ec.c +p > # file drivers/gpu/drm/amd/* +p > # file drivers/gpu/drm/amd/display/dc/core/* -p > > # Additional debugging for suspend/resume > echo 1 | sudo tee /sys/power/pm_debug_messages So I ran the commands that you gave above while connected over ssh, and I could actually still interact with the system after the amdgpu failures started. Your suspend script also kept running for a while because of this, and pstore was not necessary. My dmesg looks very similar to the snippet you posted in the patch contents. Full dmesg is here: https://gist.github.com/matte-schwartz/9ad4b925866d9228923e909618d045d9 I was able to run trace-cmd as Mario suggested, but nothing seemed out of order: ❯ trace-cmd report kworker/22:6-9326 [022] ..... 4003.204988: function: amdgpu_device_delayed_init_work_handler <-- process_one_work kworker/22:6-9326 [022] ..... 4003.209383: function: vpe_ring_begin_use <-- amdgpu_ring_alloc kworker/22:6-9326 [022] ..... 4003.210152: function: vpe_ring_end_use <-- amdgpu_ib_schedule kworker/22:6-9326 [022] ..... 4004.263841: function: vpe_idle_work_handler <-- process_one_work kworker/u129:6-530 [001] ..... 4053.545634: function: amdgpu_pmops_suspend <-- pci_pm_suspend kworker/u129:18-4060 [002] ..... 4114.908515: function: amdgpu_pmops_resume <-- dpm_run_callback kworker/u129:18-4060 [023] ..... 4114.931055: function: vpe_ring_begin_use <-- amdgpu_ring_alloc kworker/u129:18-4060 [023] ..... 4114.931057: function: vpe_ring_end_use <-- vpe_ring_test_ring kworker/7:5-5733 [007] ..... 4115.198936: function: amdgpu_device_delayed_init_work_handler <-- process_one_work kworker/7:5-5733 [007] ..... 4115.203185: function: vpe_ring_begin_use <-- amdgpu_ring_alloc kworker/7:5-5733 [007] ..... 4115.204141: function: vpe_ring_end_use <-- amdgpu_ib_schedule kworker/7:0-7950 [007] ..... 4116.253971: function: vpe_idle_work_handler <-- process_one_work kworker/u129:41-4083 [001] ..... 4165.539388: function: amdgpu_pmops_suspend <-- pci_pm_suspend kworker/u129:58-4100 [001] ..... 4226.906561: function: amdgpu_pmops_resume <-- dpm_run_callback kworker/u129:58-4100 [022] ..... 4226.927900: function: vpe_ring_begin_use <-- amdgpu_ring_alloc kworker/u129:58-4100 [022] ..... 4226.927902: function: vpe_ring_end_use <-- vpe_ring_test_ring kworker/7:0-7950 [007] ..... 4227.193678: function: amdgpu_device_delayed_init_work_handler <-- process_one_work kworker/7:0-7950 [007] ..... 4227.197604: function: vpe_ring_begin_use <-- amdgpu_ring_alloc kworker/7:0-7950 [007] ..... 4227.201691: function: vpe_ring_end_use <-- amdgpu_ib_schedule kworker/7:0-7950 [007] ..... 4228.240479: function: vpe_idle_work_handler <-- process_one_work I have not tested the kernel patch yet, so that will be my next step. > > Here is how to reconstruct the log: > rm -rf crash && mkdir crash > sudo bash -c "cp /sys/fs/pstore/dmesg-efi_pstore-* crash" > sudo bash -c "rm -rf /sys/fs/pstore/*" > cat $(find crash/ -name "dmesg-*" | tac) > crash.txt > > Antheas >>> >>>> Was your base a bazzite kernel or was it an upstream kernel? I know >>>> there are some other patches in bazzite especially relevant to suspend, >>>> so I wonder if they could be influencing the timing. >>>> >>>> Can you repo on 6.17-rc3? >>>> >>> >>> >> >> > [-- Attachment #2: Type: text/html, Size: 41460 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo 2025-08-27 2:37 ` Lee, Peyton @ 2025-08-27 15:42 ` Matthew Schwartz 0 siblings, 0 replies; 22+ messages in thread From: Matthew Schwartz @ 2025-08-27 15:42 UTC (permalink / raw) To: Lee, Peyton, Antheas Kapenekakis Cc: Mario Limonciello, Alex Deucher, amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org, Deucher, Alexander, Koenig, Christian, David Airlie, Simona Vetter, Wentland, Harry, Rodrigo Siqueira, Limonciello, Mario, Yu, Lang On 8/26/25 7:37 PM, Lee, Peyton wrote: > [AMD Official Use Only - AMD Internal Distribution Only] > > I recently encountered a similar issue on Strix. > > What I found was that the root cause was GFX failing during hw_init. > > Here’s the situation: > Linux AMDGPU boot-up flow: > > 1. > sw_init — This stage initializes the software for each IP block (GFX, VCN, VPE, etc.), and powers them on. > 2. > hw_init — This stage calls the hardware initialization of each IP block. At this point, VPE begins loading its firmware and configuring hardware. > > The issue: > When the problem occurs, the GFX block fails during hw_init. As a result, it requests the SMU to power off all IP blocks. > However, at that point, the VPE firmware hasn’t been loaded yet, so it cannot respond to the SMU's power-off request. > This causes the system to hang during boot. > > Previously, my approach was to remove all the calls to VPE power off (both in the VPE driver and in the SMU deinit function) to help locate the issue. Maybe you could try the same. Thanks, I tried both Antheas' patch and your suggestion, and I was still able to trigger the issue with both of them. On Mario's suggestion, I updated from linux-firmware 20250808 to linux-firmware from git. With this new amdgpu_firmware_info: VCE feature version: 0, firmware version: 0x00000000 UVD feature version: 0, firmware version: 0x00000000 MC feature version: 0, firmware version: 0x00000000 ME feature version: 35, firmware version: 0x0000001f PFP feature version: 35, firmware version: 0x0000002c CE feature version: 0, firmware version: 0x00000000 RLC feature version: 1, firmware version: 0x11530506 RLC SRLC feature version: 0, firmware version: 0x00000000 RLC SRLG feature version: 0, firmware version: 0x00000000 RLC SRLS feature version: 0, firmware version: 0x00000000 RLCP feature version: 1, firmware version: 0x11530506 RLCV feature version: 0, firmware version: 0x00000000 MEC feature version: 35, firmware version: 0x0000001f IMU feature version: 0, firmware version: 0x0b352300 SOS feature version: 0, firmware version: 0x00000000 ASD feature version: 553648371, firmware version: 0x210000f3 TA XGMI feature version: 0x00000000, firmware version: 0x00000000 TA RAS feature version: 0x00000000, firmware version: 0x00000000 TA HDCP feature version: 0x00000000, firmware version: 0x17000046 TA DTM feature version: 0x00000000, firmware version: 0x12000019 TA RAP feature version: 0x00000000, firmware version: 0x00000000 TA SECUREDISPLAY feature version: 0x00000000, firmware version: 0x00000000 SMC feature version: 0, program: 0, firmware version: 0x00647000 (100.112.0) SDMA0 feature version: 60, firmware version: 0x00000011 VCN feature version: 0, firmware version: 0x09118011 DMCU feature version: 0, firmware version: 0x00000000 DMCUB feature version: 0, firmware version: 0x09002a00 TOC feature version: 0, firmware version: 0x0000000b MES_KIQ feature version: 6, firmware version: 0x0000006c MES feature version: 1, firmware version: 0x0000007e VPE feature version: 60, firmware version: 0x00000017 VBIOS version: 113-STRXLGEN-001 I have not been able to trigger any amdgpu failures after 200 cycles on an unpatched 6.17-rc3 kernel. Thanks, Matt > > > > ________________________________ > 寄件者: Matthew Schwartz <matthew.schwartz@linux.dev> > 已傳送: 星期三, 2025 年 8 月 27 日 08:50 > 收件者: Antheas Kapenekakis <lkml@antheas.dev> > 副本: Mario Limonciello <superm1@kernel.org>; Alex Deucher <alexdeucher@gmail.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>; linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>; Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; David Airlie <airlied@gmail.com>; Simona Vetter <simona@ffwll.ch>; Wentland, Harry <Harry.Wentland@amd.com>; Rodrigo Siqueira <siqueira@igalia.com>; Limonciello, Mario <Mario.Limonciello@amd.com>; Lee, Peyton <Peyton.Lee@amd.com>; Yu, Lang <Lang.Yu@amd.com> > 主旨: Re: [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo > > On 8/26/25 1:58 PM, Antheas Kapenekakis wrote: >> On Tue, 26 Aug 2025 at 22:13, Matthew Schwartz >> <matthew.schwartz@linux.dev> wrote: >>> >>> >>> >>>> On Aug 26, 2025, at 12:21 PM, Antheas Kapenekakis <lkml@antheas.dev> wrote: >>>> >>>> On Tue, 26 Aug 2025 at 21:19, Mario Limonciello <superm1@kernel.org> wrote: >>>>> >>>>> On 8/26/2025 8:41 AM, Alex Deucher wrote: >>>>>> On Tue, Aug 26, 2025 at 3:49 AM Antheas Kapenekakis <lkml@antheas.dev> wrote: >>>>>>> >>>>>>> On Mon, 25 Aug 2025 at 03:38, Mario Limonciello <superm1@kernel.org> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 8/24/25 3:46 PM, Antheas Kapenekakis wrote: >>>>>>>>> On Sun, 24 Aug 2025 at 22:16, Mario Limonciello <superm1@kernel.org> wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 8/24/25 3:53 AM, Antheas Kapenekakis wrote: >>>>>>>>>>> On the Asus Z13 2025, which uses a Strix Halo platform, around 8% of the >>>>>>>>>>> suspend resumes result in a soft lock around 1 second after the screen >>>>>>>>>>> turns on (it freezes). This happens due to power gating VPE when it is >>>>>>>>>>> not used, which happens 1 second after inactivity. >>>>>>>>>>> >>>>>>>>>>> Specifically, the VPE gating after resume is as follows: an initial >>>>>>>>>>> ungate, followed by a gate in the resume process. Then, >>>>>>>>>>> amdgpu_device_delayed_init_work_handler with a delay of 2s is scheduled >>>>>>>>>>> to run tests, one of which is testing VPE in vpe_ring_test_ib. This >>>>>>>>>>> causes an ungate, After that test, vpe_idle_work_handler is scheduled >>>>>>>>>>> with VPE_IDLE_TIMEOUT (1s). >>>>>>>>>>> >>>>>>>>>>> When vpe_idle_work_handler runs and tries to gate VPE, it causes the >>>>>>>>>>> SMU to hang and partially freezes half of the GPU IPs, with the thread >>>>>>>>>>> that called the command being stuck processing it. >>>>>>>>>>> >>>>>>>>>>> Specifically, after that SMU command tries to run, we get the following: >>>>>>>>>>> >>>>>>>>>>> snd_hda_intel 0000:c4:00.1: Refused to change power state from D0 to D3hot >>>>>>>>>>> ... >>>>>>>>>>> xhci_hcd 0000:c4:00.4: Refused to change power state from D0 to D3hot >>>>>>>>>>> ... >>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 >>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VPE! >>>>>>>>>>> [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe failed, ret = -62. >>>>>>>>>>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:93:crtc-0] flip_done timed out >>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 >>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate JPEG! >>>>>>>>>>> [drm:jpeg_v4_0_5_set_powergating_state [amdgpu]] *ERROR* Dpm disable jpeg failed, ret = -62. >>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 >>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 0! >>>>>>>>>>> [drm:vcn_v4_0_5_stop [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62. >>>>>>>>>>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 1 from 0xd3 >>>>>>>>>>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 2 from 0x5 >>>>>>>>>>> thunderbolt 0000:c6:00.5: Refused to change power state from D0 to D3hot >>>>>>>>>>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:97:crtc-1] flip_done timed out >>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 >>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 1! >>>>>>>>>>> >>>>>>>>>>> In addition to e.g., kwin errors in journalctl. 0000:c4.00.0 is the GPU. >>>>>>>>>>> Interestingly, 0000:c4.00.6, which is another HDA block, 0000:c4.00.5, >>>>>>>>>>> a PCI controller, and 0000:c4.00.2, resume normally. 0x00000032 is the >>>>>>>>>>> PowerDownVpe(50) command which is the common failure point in all >>>>>>>>>>> failed resumes. >>>>>>>>>>> >>>>>>>>>>> On a normal resume, we should get the following power gates: >>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVpe(50) param: 0x00000000, resp: 0x00000001 >>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg0(33) param: 0x00000000, resp: 0x00000001 >>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg1(38) param: 0x00010000, resp: 0x00000001 >>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn1(4) param: 0x00010000, resp: 0x00000001 >>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn0(6) param: 0x00000000, resp: 0x00000001 >>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn0(7) param: 0x00000000, resp: 0x00000001 >>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn1(5) param: 0x00010000, resp: 0x00000001 >>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg0(34) param: 0x00000000, resp: 0x00000001 >>>>>>>>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg1(39) param: 0x00010000, resp: 0x00000001 >>>>>>>>>>> >>>>>>>>>>> To fix this, increase VPE_IDLE_TIMEOUT to 2 seconds. This increases >>>>>>>>>>> reliability from 4-25 suspends to 200+ (tested) suspends with a cycle >>>>>>>>>>> time of 12s sleep, 8s resume. >>>>>>>>>> >>>>>>>>>> When you say you reproduced with 12s sleep and 8s resume, was that >>>>>>>>>> 'amd-s2idle --duration 12 --wait 8'? >>>>>>>>> >>>>>>>>> I did not use amd-s2idle. I essentially used the script below with a >>>>>>>>> 12 on the wake alarm and 12 on the for loop. I also used pstore for >>>>>>>>> this testing. >>>>>>>>> >>>>>>>>> for i in {1..200}; do >>>>>>>>> echo "Suspend attempt $i" >>>>>>>>> echo `date '+%s' -d '+ 60 seconds'` | sudo tee /sys/class/rtc/rtc0/wakealarm >>>>>>>>> sudo sh -c 'echo mem > /sys/power/state' >>>>>>>>> >>>>>>>>> for j in {1..50}; do >>>>>>>>> # Use repeating sleep in case echo mem returns early >>>>>>>>> sleep 1 >>>>>>>>> done >>>>>>>>> done >>>>>>>> >>>>>>>> 👍 >>>>>>>> >>>>>>>>> >>>>>>>>>>> The suspected reason here is that 1s that >>>>>>>>>>> when VPE is used, it needs a bit of time before it can be gated and >>>>>>>>>>> there was a borderline delay before, which is not enough for Strix Halo. >>>>>>>>>>> When the VPE is not used, such as on resume, gating it instantly does >>>>>>>>>>> not seem to cause issues. >>>>>>>>>>> >>>>>>>>>>> Fixes: 5f82a0c90cca ("drm/amdgpu/vpe: enable vpe dpm") >>>>>>>>>>> Signed-off-by: Antheas Kapenekakis <lkml@antheas.dev> >>>>>>>>>>> --- >>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c | 4 ++-- >>>>>>>>>>> 1 file changed, 2 insertions(+), 2 deletions(-) >>>>>>>>>>> >>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c >>>>>>>>>>> index 121ee17b522b..24f09e457352 100644 >>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c >>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c >>>>>>>>>>> @@ -34,8 +34,8 @@ >>>>>>>>>>> /* VPE CSA resides in the 4th page of CSA */ >>>>>>>>>>> #define AMDGPU_CSA_VPE_OFFSET (4096 * 3) >>>>>>>>>>> >>>>>>>>>>> -/* 1 second timeout */ >>>>>>>>>>> -#define VPE_IDLE_TIMEOUT msecs_to_jiffies(1000) >>>>>>>>>>> +/* 2 second timeout */ >>>>>>>>>>> +#define VPE_IDLE_TIMEOUT msecs_to_jiffies(2000) >>>>>>>>>>> >>>>>>>>>>> #define VPE_MAX_DPM_LEVEL 4 >>>>>>>>>>> #define FIXED1_8_BITS_PER_FRACTIONAL_PART 8 >>>>>>>>>>> >>>>>>>>>>> base-commit: c17b750b3ad9f45f2b6f7e6f7f4679844244f0b9 >>>>>>>>>> >>>>>>>>>> 1s idle timeout has been used by other IPs for a long time. >>>>>>>>>> For example JPEG, UVD, VCN all use 1s. >>>>>>>>>> >>>>>>>>>> Can you please confirm both your AGESA and your SMU firmware version? >>>>>>>>>> In case you're not aware; you can get AGESA version from SMBIOS string >>>>>>>>>> (DMI type 40). >>>>>>>>>> >>>>>>>>>> ❯ sudo dmidecode | grep AGESA >>>>>>>>> >>>>>>>>> String: AGESA!V9 StrixHaloPI-FP11 1.0.0.0c >>>>>>>>> >>>>>>>>>> You can get SMU firmware version from this: >>>>>>>>>> >>>>>>>>>> ❯ grep . /sys/bus/platform/drivers/amd_pmc/*/smu_* >>>>>>>>> >>>>>>>>> grep . /sys/bus/platform/drivers/amd_pmc/*/smu_* >>>>>>>>> /sys/bus/platform/drivers/amd_pmc/AMDI000B:00/smu_fw_version:100.112.0 >>>>>>>>> /sys/bus/platform/drivers/amd_pmc/AMDI000B:00/smu_program:0 >>>>>>>>> >>>>>>>> >>>>>>>> Thanks, I'll get some folks to see if we match this AGESA version if we >>>>>>>> can also reproduce it on reference hardware the same way you did. >>>>>>>> >>>>>>>>>> Are you on the most up to date firmware for your system from the >>>>>>>>>> manufacturer? >>>>>>>>> >>>>>>>>> I updated my bios, pd firmware, and USB device firmware early August, >>>>>>>>> when I was doing this testing. >>>>>>>>> >>>>>>>>>> We haven't seen anything like this reported on Strix Halo thus far and >>>>>>>>>> we do internal stress testing on s0i3 on reference hardware. >>>>>>>>> >>>>>>>>> Cant find a reference for it on the bug tracker. I have four bug >>>>>>>>> reports on the bazzite issue tracker, 2 about sleep wake crashes and 2 >>>>>>>>> for runtime crashes, where the culprit would be this. IE runtime gates >>>>>>>>> VPE and causes a crash. >>>>>>>> >>>>>>>> All on Strix Halo and all tied to VPE? At runtime was VPE in use? By >>>>>>>> what software? >>>>>>>> >>>>>>>> BTW - Strix and Kraken also have VPE. >>>>>>> >>>>>>> All on the Z13. Not tied to VPE necessarily. I just know that I get >>>>>>> reports of crashes on the Z13, and with this patch they are fixed for >>>>>>> me. It will be part of the next bazzite version so I will get feedback >>>>>>> about it. >>>>>>> >>>>>>> I don't think software that is using the VPE is relevant. Perhaps for >>>>>>> the runtime crashes it is and this patch helps in that case as well. >>>>>>> But in my case, the crash is caused after the ungate that runs the >>>>>>> tests on resume on the delayed handler. >>>>>>> >>>>>>> The Z13 also has some other quirks with spurious wakeups when >>>>>>> connected to a charger. So, if systemd is configured to e.g., sleep >>>>>>> after 20m, combined with this crash if it stays plugged in overnight >>>>>>> in the morning it has crashed. >>>>>>> >>>>>>>>> >>>>>>>>>> To me this seems likely to be a platform firmware bug; but I would like >>>>>>>>>> to understand the timing of the gate vs ungate on good vs bad. >>>>>>>>> >>>>>>>>> Perhaps it is. It is either something like that or silicon quality. >>>>>>>>> >>>>>>>>>> IE is it possible the delayed work handler >>>>>>>>>> amdgpu_device_delayed_init_work_handler() is causing a race with >>>>>>>>>> vpe_ring_begin_use()? >>>>>>>>> >>>>>>>>> I don't think so. There is only a single ungate. Also, the crash >>>>>>>>> happens on the gate. So what happens is the device wakes up, the >>>>>>>>> screen turns on, kde clock works, then after a second it freezes, >>>>>>>>> there is a softlock, and the device hangs. >>>>>>>>> >>>>>>>>> The failed command is always the VPE gate that is triggered after 1s in idle. >>>>>>>>> >>>>>>>>>> This should be possible to check without extra instrumentation by using >>>>>>>>>> ftrace and looking at the timing of the 2 ring functions and the init >>>>>>>>>> work handler and checking good vs bad cycles. >>>>>>>>> >>>>>>>>> I do not know how to use ftrace. I should also note that after the >>>>>>>>> device freezes around 1/5 cycles will sync the fs, so it is also not a >>>>>>>>> very easy thing to diagnose. The device just stops working. A lot of >>>>>>>>> the logs I got were in pstore by forcing a kernel panic. >>>>>>>> >>>>>>>> Here's how you capture the timing of functions. Each time the function >>>>>>>> is called there will be an event in the trace buffer. >>>>>>>> >>>>>>>> ❯ sudo trace-cmd record -p function -l >>>>>>>> amdgpu_device_delayed_init_work_handler -l vpe_idle_work_handler -l >>>>>>>> vpe_ring_begin_use -l vpe_ring_end_use -l amdgpu_pmops_suspend -l >>>>>>>> amdgpu_pmops_resume >>>>>>>> >>>>>>>> Here's how you would review the report: >>>>>>>> >>>>>>>> ❯ trace-cmd report >>>>>>>> cpus=24 >>>>>>>> kworker/u97:37-18051 [001] ..... 13655.970108: function: >>>>>>>> amdgpu_pmops_suspend <-- pci_pm_suspend >>>>>>>> kworker/u97:21-18036 [002] ..... 13666.290715: function: >>>>>>>> amdgpu_pmops_resume <-- dpm_run_callback >>>>>>>> kworker/u97:21-18036 [015] ..... 13666.308295: function: >>>>>>>> vpe_ring_begin_use <-- amdgpu_ring_alloc >>>>>>>> kworker/u97:21-18036 [015] ..... 13666.308298: function: >>>>>>>> vpe_ring_end_use <-- vpe_ring_test_ring >>>>>>>> kworker/15:1-12285 [015] ..... 13666.960191: function: >>>>>>>> amdgpu_device_delayed_init_work_handler <-- process_one_work >>>>>>>> kworker/15:1-12285 [015] ..... 13666.963970: function: >>>>>>>> vpe_ring_begin_use <-- amdgpu_ring_alloc >>>>>>>> kworker/15:1-12285 [015] ..... 13666.965481: function: >>>>>>>> vpe_ring_end_use <-- amdgpu_ib_schedule >>>>>>>> kworker/15:4-16354 [015] ..... 13667.981394: function: >>>>>>>> vpe_idle_work_handler <-- process_one_work >>>>>>>> >>>>>>>> I did this on a Strix system just now to capture that. >>>>>>>> >>>>>>>> You can see that basically the ring gets used before the delayed init >>>>>>>> work handler, and then again from the ring tests. My concern is if the >>>>>>>> sequence ever looks different than the above. If it does; we do have a >>>>>>>> driver race condition. >>>>>>>> >>>>>>>> It would also be helpful to look at the function_graph tracer. >>>>>>>> >>>>>>>> Here's some more documentation about ftrace and trace-cmd. >>>>>>>> https://www.kernel.org/doc/html/latest/trace/ftrace.html >>>>>>>> https://lwn.net/Articles/410200/ >>>>>>>> >>>>>>>> You can probably also get an LLM to help you with building commands if >>>>>>>> you're not familiar with it. >>>>>>>> >>>>>>>> But if you're hung so bad you can't flush to disk that's going to be a >>>>>>>> problem without a UART. A few ideas: >>>>>>> >>>>>>> Some times it flushes to disk >>>>>>> >>>>>>>> 1) You can use CONFIG_PSTORE_FTRACE >>>>>>> >>>>>>> I can look into that >>>>>>> >>>>>>>> 2) If you add "tp_printk" to the kernel command line it should make the >>>>>>>> trace ring buffer flush to kernel log ring buffer. But be warned this >>>>>>>> is going to change the timing, the issue might go away entirely or have >>>>>>>> a different failure rate. So hopefully <1> works. >>>>>>>>> >>>>>>>>> If you say that all IP blocks use 1s, perhaps an alternative solution >>>>>>>>> would be to desync the idle times so they do not happen >>>>>>>>> simultaneously. So 1000, 1200, 1400, etc. >>>>>>>>> >>>>>>>>> Antheas >>>>>>>>> >>>>>>>> >>>>>>>> I don't dobut your your proposal of changing the timing works. I just >>>>>>>> want to make sure it's the right solution because otherwise we might >>>>>>>> change the timing or sequence elsewhere in the driver two years from now >>>>>>>> and re-introduce the problem unintentionally. >>>>>>> >>>>>>> If there are other idle timers and only this one changes to 2s, I will >>>>>>> agree and say that it would be peculiar. Although 1s seems arbitrary >>>>>>> in any case. >>>>>> >>>>>> All of these timers are arbitrary. Their point is just to provide a >>>>>> future point where we can check if the engine is idle. The idle work >>>>>> handler will either power down the IP if it is idle or re-schedule in >>>>>> the future and try again if there is still work. Making the value >>>>>> longer will use more power as it will wait longer before checking if >>>>>> the engine is idle. Making it shorter will save more power, but adds >>>>>> extra overhead in that the engine will be powered up/down more often. >>>>>> In most cases, the jobs should complete in a few ms. The timer is >>>>>> there to avoid the overhead of powering up/down the block too >>>>>> frequently when applications are using the engine. >>>>>> >>>>>> Alex >>>>> >>>>> We had a try internally with both 6.17-rc2 and 6.17-rc3 and 1001b or >>>>> 1001c AGESA on reference system but unfortunately didn't reproduce the >>>>> issue with a 200 cycle attempt on either kernel or either BIOS (so we >>>>> had 800 cycles total). >>>> >>>> I think I did 6.12, 6.15, and a 6.16rc stock. I will have to come back >>>> to you with 6.17-rc3. >>> >>> I can reproduce the hang on a stock 6.17-rc3 kernel on my own Flow Z13, froze within 10 cycles with Antheas’ script. I will setup pstore to get logs from it since nothing appears in my journal after force rebooting. >>> >>> Matt >> >> Mine does not want to get reproduced right now. I will have to try later. >> >> You will need these kernel arguments: >> efi_pstore.pstore_disable=0 pstore.kmsg_bytes=200000 >> >> Here are some logging commands before the for loop >> # clear pstore >> sudo bash -c "rm -rf /sys/fs/pstore/*" >> >> # https://www.ais.com/understanding-pstore-linux-kernel-persistent-storage-file-system/ >> >> # Runtime logs >> # echo 1 | sudo tee >> /sys/kernel/debug/tracing/events/power/power_runtime_suspend/enable >> # echo 1 | sudo tee >> /sys/kernel/debug/tracing/events/power/power_runtime_resume/enable >> # echo 1 | sudo tee /sys/kernel/debug/tracing/tracing_on >> >> # Enable panics on lockups >> echo 255 | sudo tee /proc/sys/kernel/sysrq >> echo 1 | sudo tee /proc/sys/kernel/softlockup_panic >> echo 1 | sudo tee /proc/sys/kernel/hardlockup_panic >> echo 1 | sudo tee /proc/sys/kernel/panic_on_oops >> echo 5 | sudo tee /proc/sys/kernel/panic >> # echo 64 | sudo tee /proc/sys/kernel/panic_print >> >> # Enable these for hangs, shows Thread on hangs >> # echo 1 | sudo tee /proc/sys/kernel/softlockup_all_cpu_backtrace >> # echo 1 | sudo tee /proc/sys/kernel/hardlockup_all_cpu_backtrace >> >> # Enable pstore logging on panics >> # Needs kernel param: >> # efi_pstore.pstore_disable=0 pstore.kmsg_bytes=100000 >> # First enables, second sets the size to fit all cpus in case of a panic >> echo Y | sudo tee /sys/module/kernel/parameters/crash_kexec_post_notifiers >> echo Y | sudo tee /sys/module/printk/parameters/always_kmsg_dump >> >> # Enable dynamic debug for various kernel components >> sudo bash -c "cat > /sys/kernel/debug/dynamic_debug/control" << EOF >> file drivers/acpi/x86/s2idle.c +p >> file drivers/pinctrl/pinctrl-amd.c +p >> file drivers/platform/x86/amd/pmc.c +p >> file drivers/pci/pci-driver.c +p >> file drivers/input/serio/* +p >> file drivers/gpu/drm/amd/pm/* +p >> file drivers/gpu/drm/amd/pm/swsmu/* +p >> EOF >> # file drivers/acpi/ec.c +p >> # file drivers/gpu/drm/amd/* +p >> # file drivers/gpu/drm/amd/display/dc/core/* -p >> >> # Additional debugging for suspend/resume >> echo 1 | sudo tee /sys/power/pm_debug_messages > > So I ran the commands that you gave above while connected over ssh, and I could actually still interact with the system after the amdgpu failures started. > Your suspend script also kept running for a while because of this, and pstore was not necessary. > > My dmesg looks very similar to the snippet you posted in the patch contents. > Full dmesg is here: https://gist.github.com/matte-schwartz/9ad4b925866d9228923e909618d045d9 > > I was able to run trace-cmd as Mario suggested, but nothing seemed out of order: > > ❯ trace-cmd report > > kworker/22:6-9326 [022] ..... 4003.204988: function: amdgpu_device_delayed_init_work_handler <-- process_one_work > kworker/22:6-9326 [022] ..... 4003.209383: function: vpe_ring_begin_use <-- amdgpu_ring_alloc > kworker/22:6-9326 [022] ..... 4003.210152: function: vpe_ring_end_use <-- amdgpu_ib_schedule > kworker/22:6-9326 [022] ..... 4004.263841: function: vpe_idle_work_handler <-- process_one_work > kworker/u129:6-530 [001] ..... 4053.545634: function: amdgpu_pmops_suspend <-- pci_pm_suspend > kworker/u129:18-4060 [002] ..... 4114.908515: function: amdgpu_pmops_resume <-- dpm_run_callback > kworker/u129:18-4060 [023] ..... 4114.931055: function: vpe_ring_begin_use <-- amdgpu_ring_alloc > kworker/u129:18-4060 [023] ..... 4114.931057: function: vpe_ring_end_use <-- vpe_ring_test_ring > kworker/7:5-5733 [007] ..... 4115.198936: function: amdgpu_device_delayed_init_work_handler <-- process_one_work > kworker/7:5-5733 [007] ..... 4115.203185: function: vpe_ring_begin_use <-- amdgpu_ring_alloc > kworker/7:5-5733 [007] ..... 4115.204141: function: vpe_ring_end_use <-- amdgpu_ib_schedule > kworker/7:0-7950 [007] ..... 4116.253971: function: vpe_idle_work_handler <-- process_one_work > kworker/u129:41-4083 [001] ..... 4165.539388: function: amdgpu_pmops_suspend <-- pci_pm_suspend > kworker/u129:58-4100 [001] ..... 4226.906561: function: amdgpu_pmops_resume <-- dpm_run_callback > kworker/u129:58-4100 [022] ..... 4226.927900: function: vpe_ring_begin_use <-- amdgpu_ring_alloc > kworker/u129:58-4100 [022] ..... 4226.927902: function: vpe_ring_end_use <-- vpe_ring_test_ring > kworker/7:0-7950 [007] ..... 4227.193678: function: amdgpu_device_delayed_init_work_handler <-- process_one_work > kworker/7:0-7950 [007] ..... 4227.197604: function: vpe_ring_begin_use <-- amdgpu_ring_alloc > kworker/7:0-7950 [007] ..... 4227.201691: function: vpe_ring_end_use <-- amdgpu_ib_schedule > kworker/7:0-7950 [007] ..... 4228.240479: function: vpe_idle_work_handler <-- process_one_work > > I have not tested the kernel patch yet, so that will be my next step. > >> >> Here is how to reconstruct the log: >> rm -rf crash && mkdir crash >> sudo bash -c "cp /sys/fs/pstore/dmesg-efi_pstore-* crash" >> sudo bash -c "rm -rf /sys/fs/pstore/*" >> cat $(find crash/ -name "dmesg-*" | tac) > crash.txt >> >> Antheas >>>> >>>>> Was your base a bazzite kernel or was it an upstream kernel? I know >>>>> there are some other patches in bazzite especially relevant to suspend, >>>>> so I wonder if they could be influencing the timing. >>>>> >>>>> Can you repo on 6.17-rc3? >>>>> >>>> >>>> >>> >>> >> > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo 2025-08-24 8:53 [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo Antheas Kapenekakis 2025-08-24 8:53 ` [PATCH v1 2/2] drm/amd/display: Adjust AUX brightness to be a granularity of 100 Antheas Kapenekakis 2025-08-24 20:16 ` [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo Mario Limonciello @ 2025-08-25 13:20 ` Alex Deucher 2025-08-25 13:33 ` Antheas Kapenekakis 2 siblings, 1 reply; 22+ messages in thread From: Alex Deucher @ 2025-08-25 13:20 UTC (permalink / raw) To: Antheas Kapenekakis Cc: amd-gfx, dri-devel, linux-kernel, Alex Deucher, Christian König, David Airlie, Simona Vetter, Harry Wentland, Rodrigo Siqueira, Mario Limonciello, Peyton Lee, Lang Yu On Mon, Aug 25, 2025 at 3:13 AM Antheas Kapenekakis <lkml@antheas.dev> wrote: > > On the Asus Z13 2025, which uses a Strix Halo platform, around 8% of the > suspend resumes result in a soft lock around 1 second after the screen > turns on (it freezes). This happens due to power gating VPE when it is > not used, which happens 1 second after inactivity. > > Specifically, the VPE gating after resume is as follows: an initial > ungate, followed by a gate in the resume process. Then, > amdgpu_device_delayed_init_work_handler with a delay of 2s is scheduled > to run tests, one of which is testing VPE in vpe_ring_test_ib. This > causes an ungate, After that test, vpe_idle_work_handler is scheduled > with VPE_IDLE_TIMEOUT (1s). > > When vpe_idle_work_handler runs and tries to gate VPE, it causes the > SMU to hang and partially freezes half of the GPU IPs, with the thread > that called the command being stuck processing it. > > Specifically, after that SMU command tries to run, we get the following: > > snd_hda_intel 0000:c4:00.1: Refused to change power state from D0 to D3hot > ... > xhci_hcd 0000:c4:00.4: Refused to change power state from D0 to D3hot > ... > amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VPE! > [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe failed, ret = -62. > amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:93:crtc-0] flip_done timed out > amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > amdgpu 0000:c4:00.0: amdgpu: Failed to power gate JPEG! > [drm:jpeg_v4_0_5_set_powergating_state [amdgpu]] *ERROR* Dpm disable jpeg failed, ret = -62. > amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 0! > [drm:vcn_v4_0_5_stop [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62. > thunderbolt 0000:c6:00.5: 0: timeout reading config space 1 from 0xd3 > thunderbolt 0000:c6:00.5: 0: timeout reading config space 2 from 0x5 > thunderbolt 0000:c6:00.5: Refused to change power state from D0 to D3hot > amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:97:crtc-1] flip_done timed out > amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 1! > > In addition to e.g., kwin errors in journalctl. 0000:c4.00.0 is the GPU. > Interestingly, 0000:c4.00.6, which is another HDA block, 0000:c4.00.5, > a PCI controller, and 0000:c4.00.2, resume normally. 0x00000032 is the > PowerDownVpe(50) command which is the common failure point in all > failed resumes. > > On a normal resume, we should get the following power gates: > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVpe(50) param: 0x00000000, resp: 0x00000001 > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg0(33) param: 0x00000000, resp: 0x00000001 > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg1(38) param: 0x00010000, resp: 0x00000001 > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn1(4) param: 0x00010000, resp: 0x00000001 > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn0(6) param: 0x00000000, resp: 0x00000001 > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn0(7) param: 0x00000000, resp: 0x00000001 > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn1(5) param: 0x00010000, resp: 0x00000001 > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg0(34) param: 0x00000000, resp: 0x00000001 > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg1(39) param: 0x00010000, resp: 0x00000001 > > To fix this, increase VPE_IDLE_TIMEOUT to 2 seconds. This increases > reliability from 4-25 suspends to 200+ (tested) suspends with a cycle > time of 12s sleep, 8s resume. The suspected reason here is that 1s that > when VPE is used, it needs a bit of time before it can be gated and > there was a borderline delay before, which is not enough for Strix Halo. > When the VPE is not used, such as on resume, gating it instantly does > not seem to cause issues. This doesn't make much sense. The VPE idle timeout is arbitrary. The VPE idle work handler checks to see if the block is idle before it powers gates the block. If it's not idle, then the delayed work is rescheduled so changing the timing should not make a difference. We are no powering down VPE while it still has active jobs. It sounds like there is some race condition somewhere else. Alex > > Fixes: 5f82a0c90cca ("drm/amdgpu/vpe: enable vpe dpm") > Signed-off-by: Antheas Kapenekakis <lkml@antheas.dev> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c > index 121ee17b522b..24f09e457352 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c > @@ -34,8 +34,8 @@ > /* VPE CSA resides in the 4th page of CSA */ > #define AMDGPU_CSA_VPE_OFFSET (4096 * 3) > > -/* 1 second timeout */ > -#define VPE_IDLE_TIMEOUT msecs_to_jiffies(1000) > +/* 2 second timeout */ > +#define VPE_IDLE_TIMEOUT msecs_to_jiffies(2000) > > #define VPE_MAX_DPM_LEVEL 4 > #define FIXED1_8_BITS_PER_FRACTIONAL_PART 8 > > base-commit: c17b750b3ad9f45f2b6f7e6f7f4679844244f0b9 > -- > 2.50.1 > > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo 2025-08-25 13:20 ` Alex Deucher @ 2025-08-25 13:33 ` Antheas Kapenekakis 2025-08-25 14:01 ` Antheas Kapenekakis 0 siblings, 1 reply; 22+ messages in thread From: Antheas Kapenekakis @ 2025-08-25 13:33 UTC (permalink / raw) To: Alex Deucher Cc: amd-gfx, dri-devel, linux-kernel, Alex Deucher, Christian König, David Airlie, Simona Vetter, Harry Wentland, Rodrigo Siqueira, Mario Limonciello, Peyton Lee, Lang Yu On Mon, 25 Aug 2025 at 15:20, Alex Deucher <alexdeucher@gmail.com> wrote: > > On Mon, Aug 25, 2025 at 3:13 AM Antheas Kapenekakis <lkml@antheas.dev> wrote: > > > > On the Asus Z13 2025, which uses a Strix Halo platform, around 8% of the > > suspend resumes result in a soft lock around 1 second after the screen > > turns on (it freezes). This happens due to power gating VPE when it is > > not used, which happens 1 second after inactivity. > > > > Specifically, the VPE gating after resume is as follows: an initial > > ungate, followed by a gate in the resume process. Then, > > amdgpu_device_delayed_init_work_handler with a delay of 2s is scheduled > > to run tests, one of which is testing VPE in vpe_ring_test_ib. This > > causes an ungate, After that test, vpe_idle_work_handler is scheduled > > with VPE_IDLE_TIMEOUT (1s). > > > > When vpe_idle_work_handler runs and tries to gate VPE, it causes the > > SMU to hang and partially freezes half of the GPU IPs, with the thread > > that called the command being stuck processing it. > > > > Specifically, after that SMU command tries to run, we get the following: > > > > snd_hda_intel 0000:c4:00.1: Refused to change power state from D0 to D3hot > > ... > > xhci_hcd 0000:c4:00.4: Refused to change power state from D0 to D3hot > > ... > > amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > > amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VPE! > > [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe failed, ret = -62. > > amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:93:crtc-0] flip_done timed out > > amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > > amdgpu 0000:c4:00.0: amdgpu: Failed to power gate JPEG! > > [drm:jpeg_v4_0_5_set_powergating_state [amdgpu]] *ERROR* Dpm disable jpeg failed, ret = -62. > > amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > > amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 0! > > [drm:vcn_v4_0_5_stop [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62. > > thunderbolt 0000:c6:00.5: 0: timeout reading config space 1 from 0xd3 > > thunderbolt 0000:c6:00.5: 0: timeout reading config space 2 from 0x5 > > thunderbolt 0000:c6:00.5: Refused to change power state from D0 to D3hot > > amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:97:crtc-1] flip_done timed out > > amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > > amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 1! > > > > In addition to e.g., kwin errors in journalctl. 0000:c4.00.0 is the GPU. > > Interestingly, 0000:c4.00.6, which is another HDA block, 0000:c4.00.5, > > a PCI controller, and 0000:c4.00.2, resume normally. 0x00000032 is the > > PowerDownVpe(50) command which is the common failure point in all > > failed resumes. > > > > On a normal resume, we should get the following power gates: > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVpe(50) param: 0x00000000, resp: 0x00000001 > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg0(33) param: 0x00000000, resp: 0x00000001 > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg1(38) param: 0x00010000, resp: 0x00000001 > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn1(4) param: 0x00010000, resp: 0x00000001 > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn0(6) param: 0x00000000, resp: 0x00000001 > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn0(7) param: 0x00000000, resp: 0x00000001 > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn1(5) param: 0x00010000, resp: 0x00000001 > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg0(34) param: 0x00000000, resp: 0x00000001 > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg1(39) param: 0x00010000, resp: 0x00000001 > > > > To fix this, increase VPE_IDLE_TIMEOUT to 2 seconds. This increases > > reliability from 4-25 suspends to 200+ (tested) suspends with a cycle > > time of 12s sleep, 8s resume. The suspected reason here is that 1s that > > when VPE is used, it needs a bit of time before it can be gated and > > there was a borderline delay before, which is not enough for Strix Halo. > > When the VPE is not used, such as on resume, gating it instantly does > > not seem to cause issues. > > This doesn't make much sense. The VPE idle timeout is arbitrary. The > VPE idle work handler checks to see if the block is idle before it > powers gates the block. If it's not idle, then the delayed work is > rescheduled so changing the timing should not make a difference. We > are no powering down VPE while it still has active jobs. It sounds > like there is some race condition somewhere else. On resume, the vpe is ungated and gated instantly, which does not cause any crashes, then the delayed work is scheduled to run two seconds later. Then, the tests run and finish, which start the gate timer. After the timer lapses and the kernel tries to gate VPE, it crashes. I logged all SMU commands and there is no difference between the ones in a crash and not, other than the fact the VPE gate command failed. Which becomes apparent when the next command runs. I will also note that until the idle timer lapses, the system is responsive Since the VPE is ungated to run the tests, I assume that in my setup it is not used close to resume. Antheas > Alex > > > > > Fixes: 5f82a0c90cca ("drm/amdgpu/vpe: enable vpe dpm") > > Signed-off-by: Antheas Kapenekakis <lkml@antheas.dev> > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c > > index 121ee17b522b..24f09e457352 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c > > @@ -34,8 +34,8 @@ > > /* VPE CSA resides in the 4th page of CSA */ > > #define AMDGPU_CSA_VPE_OFFSET (4096 * 3) > > > > -/* 1 second timeout */ > > -#define VPE_IDLE_TIMEOUT msecs_to_jiffies(1000) > > +/* 2 second timeout */ > > +#define VPE_IDLE_TIMEOUT msecs_to_jiffies(2000) > > > > #define VPE_MAX_DPM_LEVEL 4 > > #define FIXED1_8_BITS_PER_FRACTIONAL_PART 8 > > > > base-commit: c17b750b3ad9f45f2b6f7e6f7f4679844244f0b9 > > -- > > 2.50.1 > > > > > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo 2025-08-25 13:33 ` Antheas Kapenekakis @ 2025-08-25 14:01 ` Antheas Kapenekakis 2025-08-25 16:41 ` Mario Limonciello 0 siblings, 1 reply; 22+ messages in thread From: Antheas Kapenekakis @ 2025-08-25 14:01 UTC (permalink / raw) To: Alex Deucher Cc: amd-gfx, dri-devel, linux-kernel, Alex Deucher, Christian König, David Airlie, Simona Vetter, Harry Wentland, Rodrigo Siqueira, Mario Limonciello, Peyton Lee, Lang Yu On Mon, 25 Aug 2025 at 15:33, Antheas Kapenekakis <lkml@antheas.dev> wrote: > > On Mon, 25 Aug 2025 at 15:20, Alex Deucher <alexdeucher@gmail.com> wrote: > > > > On Mon, Aug 25, 2025 at 3:13 AM Antheas Kapenekakis <lkml@antheas.dev> wrote: > > > > > > On the Asus Z13 2025, which uses a Strix Halo platform, around 8% of the > > > suspend resumes result in a soft lock around 1 second after the screen > > > turns on (it freezes). This happens due to power gating VPE when it is > > > not used, which happens 1 second after inactivity. > > > > > > Specifically, the VPE gating after resume is as follows: an initial > > > ungate, followed by a gate in the resume process. Then, > > > amdgpu_device_delayed_init_work_handler with a delay of 2s is scheduled > > > to run tests, one of which is testing VPE in vpe_ring_test_ib. This > > > causes an ungate, After that test, vpe_idle_work_handler is scheduled > > > with VPE_IDLE_TIMEOUT (1s). > > > > > > When vpe_idle_work_handler runs and tries to gate VPE, it causes the > > > SMU to hang and partially freezes half of the GPU IPs, with the thread > > > that called the command being stuck processing it. > > > > > > Specifically, after that SMU command tries to run, we get the following: > > > > > > snd_hda_intel 0000:c4:00.1: Refused to change power state from D0 to D3hot > > > ... > > > xhci_hcd 0000:c4:00.4: Refused to change power state from D0 to D3hot > > > ... > > > amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > > > amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VPE! > > > [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe failed, ret = -62. > > > amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:93:crtc-0] flip_done timed out > > > amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > > > amdgpu 0000:c4:00.0: amdgpu: Failed to power gate JPEG! > > > [drm:jpeg_v4_0_5_set_powergating_state [amdgpu]] *ERROR* Dpm disable jpeg failed, ret = -62. > > > amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > > > amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 0! > > > [drm:vcn_v4_0_5_stop [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62. > > > thunderbolt 0000:c6:00.5: 0: timeout reading config space 1 from 0xd3 > > > thunderbolt 0000:c6:00.5: 0: timeout reading config space 2 from 0x5 > > > thunderbolt 0000:c6:00.5: Refused to change power state from D0 to D3hot > > > amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:97:crtc-1] flip_done timed out > > > amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > > > amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 1! > > > > > > In addition to e.g., kwin errors in journalctl. 0000:c4.00.0 is the GPU. > > > Interestingly, 0000:c4.00.6, which is another HDA block, 0000:c4.00.5, > > > a PCI controller, and 0000:c4.00.2, resume normally. 0x00000032 is the > > > PowerDownVpe(50) command which is the common failure point in all > > > failed resumes. > > > > > > On a normal resume, we should get the following power gates: > > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVpe(50) param: 0x00000000, resp: 0x00000001 > > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg0(33) param: 0x00000000, resp: 0x00000001 > > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg1(38) param: 0x00010000, resp: 0x00000001 > > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn1(4) param: 0x00010000, resp: 0x00000001 > > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn0(6) param: 0x00000000, resp: 0x00000001 > > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn0(7) param: 0x00000000, resp: 0x00000001 > > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn1(5) param: 0x00010000, resp: 0x00000001 > > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg0(34) param: 0x00000000, resp: 0x00000001 > > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg1(39) param: 0x00010000, resp: 0x00000001 > > > > > > To fix this, increase VPE_IDLE_TIMEOUT to 2 seconds. This increases > > > reliability from 4-25 suspends to 200+ (tested) suspends with a cycle > > > time of 12s sleep, 8s resume. The suspected reason here is that 1s that > > > when VPE is used, it needs a bit of time before it can be gated and > > > there was a borderline delay before, which is not enough for Strix Halo. > > > When the VPE is not used, such as on resume, gating it instantly does > > > not seem to cause issues. > > > > This doesn't make much sense. The VPE idle timeout is arbitrary. The > > VPE idle work handler checks to see if the block is idle before it > > powers gates the block. If it's not idle, then the delayed work is > > rescheduled so changing the timing should not make a difference. We > > are no powering down VPE while it still has active jobs. It sounds > > like there is some race condition somewhere else. > > On resume, the vpe is ungated and gated instantly, which does not > cause any crashes, then the delayed work is scheduled to run two > seconds later. Then, the tests run and finish, which start the gate > timer. After the timer lapses and the kernel tries to gate VPE, it > crashes. I logged all SMU commands and there is no difference between > the ones in a crash and not, other than the fact the VPE gate command > failed. Which becomes apparent when the next command runs. I will also > note that until the idle timer lapses, the system is responsive > > Since the VPE is ungated to run the tests, I assume that in my setup > it is not used close to resume. I should also add that I forced a kernel panic and dumped all CPU backtraces in multiple logs. After the softlock, CPUs were either parked in the scheduler, powered off, or stuck executing an SMU command by e.g., a userspace usage sensor graph. So it is not a deadlock. Antheas > Antheas > > > Alex > > > > > > > > Fixes: 5f82a0c90cca ("drm/amdgpu/vpe: enable vpe dpm") > > > Signed-off-by: Antheas Kapenekakis <lkml@antheas.dev> > > > --- > > > drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c | 4 ++-- > > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c > > > index 121ee17b522b..24f09e457352 100644 > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c > > > @@ -34,8 +34,8 @@ > > > /* VPE CSA resides in the 4th page of CSA */ > > > #define AMDGPU_CSA_VPE_OFFSET (4096 * 3) > > > > > > -/* 1 second timeout */ > > > -#define VPE_IDLE_TIMEOUT msecs_to_jiffies(1000) > > > +/* 2 second timeout */ > > > +#define VPE_IDLE_TIMEOUT msecs_to_jiffies(2000) > > > > > > #define VPE_MAX_DPM_LEVEL 4 > > > #define FIXED1_8_BITS_PER_FRACTIONAL_PART 8 > > > > > > base-commit: c17b750b3ad9f45f2b6f7e6f7f4679844244f0b9 > > > -- > > > 2.50.1 > > > > > > > > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo 2025-08-25 14:01 ` Antheas Kapenekakis @ 2025-08-25 16:41 ` Mario Limonciello 2025-08-25 21:00 ` Antheas Kapenekakis 0 siblings, 1 reply; 22+ messages in thread From: Mario Limonciello @ 2025-08-25 16:41 UTC (permalink / raw) To: Antheas Kapenekakis, Alex Deucher Cc: amd-gfx, dri-devel, linux-kernel, Alex Deucher, Christian König, David Airlie, Simona Vetter, Harry Wentland, Rodrigo Siqueira, Peyton Lee, Lang Yu On 8/25/2025 9:01 AM, Antheas Kapenekakis wrote: > On Mon, 25 Aug 2025 at 15:33, Antheas Kapenekakis <lkml@antheas.dev> wrote: >> >> On Mon, 25 Aug 2025 at 15:20, Alex Deucher <alexdeucher@gmail.com> wrote: >>> >>> On Mon, Aug 25, 2025 at 3:13 AM Antheas Kapenekakis <lkml@antheas.dev> wrote: >>>> >>>> On the Asus Z13 2025, which uses a Strix Halo platform, around 8% of the >>>> suspend resumes result in a soft lock around 1 second after the screen >>>> turns on (it freezes). This happens due to power gating VPE when it is >>>> not used, which happens 1 second after inactivity. >>>> >>>> Specifically, the VPE gating after resume is as follows: an initial >>>> ungate, followed by a gate in the resume process. Then, >>>> amdgpu_device_delayed_init_work_handler with a delay of 2s is scheduled >>>> to run tests, one of which is testing VPE in vpe_ring_test_ib. This >>>> causes an ungate, After that test, vpe_idle_work_handler is scheduled >>>> with VPE_IDLE_TIMEOUT (1s). >>>> >>>> When vpe_idle_work_handler runs and tries to gate VPE, it causes the >>>> SMU to hang and partially freezes half of the GPU IPs, with the thread >>>> that called the command being stuck processing it. >>>> >>>> Specifically, after that SMU command tries to run, we get the following: >>>> >>>> snd_hda_intel 0000:c4:00.1: Refused to change power state from D0 to D3hot >>>> ... >>>> xhci_hcd 0000:c4:00.4: Refused to change power state from D0 to D3hot >>>> ... >>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 >>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VPE! >>>> [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe failed, ret = -62. >>>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:93:crtc-0] flip_done timed out >>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 >>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate JPEG! >>>> [drm:jpeg_v4_0_5_set_powergating_state [amdgpu]] *ERROR* Dpm disable jpeg failed, ret = -62. >>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 >>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 0! >>>> [drm:vcn_v4_0_5_stop [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62. >>>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 1 from 0xd3 >>>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 2 from 0x5 >>>> thunderbolt 0000:c6:00.5: Refused to change power state from D0 to D3hot >>>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:97:crtc-1] flip_done timed out >>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 >>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 1! >>>> >>>> In addition to e.g., kwin errors in journalctl. 0000:c4.00.0 is the GPU. >>>> Interestingly, 0000:c4.00.6, which is another HDA block, 0000:c4.00.5, >>>> a PCI controller, and 0000:c4.00.2, resume normally. 0x00000032 is the >>>> PowerDownVpe(50) command which is the common failure point in all >>>> failed resumes. >>>> >>>> On a normal resume, we should get the following power gates: >>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVpe(50) param: 0x00000000, resp: 0x00000001 >>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg0(33) param: 0x00000000, resp: 0x00000001 >>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg1(38) param: 0x00010000, resp: 0x00000001 >>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn1(4) param: 0x00010000, resp: 0x00000001 >>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn0(6) param: 0x00000000, resp: 0x00000001 >>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn0(7) param: 0x00000000, resp: 0x00000001 >>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn1(5) param: 0x00010000, resp: 0x00000001 >>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg0(34) param: 0x00000000, resp: 0x00000001 >>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg1(39) param: 0x00010000, resp: 0x00000001 >>>> >>>> To fix this, increase VPE_IDLE_TIMEOUT to 2 seconds. This increases >>>> reliability from 4-25 suspends to 200+ (tested) suspends with a cycle >>>> time of 12s sleep, 8s resume. The suspected reason here is that 1s that >>>> when VPE is used, it needs a bit of time before it can be gated and >>>> there was a borderline delay before, which is not enough for Strix Halo. >>>> When the VPE is not used, such as on resume, gating it instantly does >>>> not seem to cause issues. >>> >>> This doesn't make much sense. The VPE idle timeout is arbitrary. The >>> VPE idle work handler checks to see if the block is idle before it >>> powers gates the block. If it's not idle, then the delayed work is >>> rescheduled so changing the timing should not make a difference. We >>> are no powering down VPE while it still has active jobs. It sounds >>> like there is some race condition somewhere else. >> >> On resume, the vpe is ungated and gated instantly, which does not >> cause any crashes, then the delayed work is scheduled to run two >> seconds later. Then, the tests run and finish, which start the gate >> timer. After the timer lapses and the kernel tries to gate VPE, it >> crashes. I logged all SMU commands and there is no difference between >> the ones in a crash and not, other than the fact the VPE gate command >> failed. Which becomes apparent when the next command runs. I will also >> note that until the idle timer lapses, the system is responsive >> >> Since the VPE is ungated to run the tests, I assume that in my setup >> it is not used close to resume. > > I should also add that I forced a kernel panic and dumped all CPU > backtraces in multiple logs. After the softlock, CPUs were either > parked in the scheduler, powered off, or stuck executing an SMU > command by e.g., a userspace usage sensor graph. So it is not a > deadlock. > Can you please confirm if you are on the absolute latest linux-firmware when you reproduced this issue? Can you please share the debugfs output for amdgpu_firmware_info. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo 2025-08-25 16:41 ` Mario Limonciello @ 2025-08-25 21:00 ` Antheas Kapenekakis 0 siblings, 0 replies; 22+ messages in thread From: Antheas Kapenekakis @ 2025-08-25 21:00 UTC (permalink / raw) To: Mario Limonciello Cc: Alex Deucher, amd-gfx, dri-devel, linux-kernel, Alex Deucher, Christian König, David Airlie, Simona Vetter, Harry Wentland, Rodrigo Siqueira, Peyton Lee, Lang Yu On Mon, 25 Aug 2025 at 18:41, Mario Limonciello <superm1@kernel.org> wrote: > > On 8/25/2025 9:01 AM, Antheas Kapenekakis wrote: > > On Mon, 25 Aug 2025 at 15:33, Antheas Kapenekakis <lkml@antheas.dev> wrote: > >> > >> On Mon, 25 Aug 2025 at 15:20, Alex Deucher <alexdeucher@gmail.com> wrote: > >>> > >>> On Mon, Aug 25, 2025 at 3:13 AM Antheas Kapenekakis <lkml@antheas.dev> wrote: > >>>> > >>>> On the Asus Z13 2025, which uses a Strix Halo platform, around 8% of the > >>>> suspend resumes result in a soft lock around 1 second after the screen > >>>> turns on (it freezes). This happens due to power gating VPE when it is > >>>> not used, which happens 1 second after inactivity. > >>>> > >>>> Specifically, the VPE gating after resume is as follows: an initial > >>>> ungate, followed by a gate in the resume process. Then, > >>>> amdgpu_device_delayed_init_work_handler with a delay of 2s is scheduled > >>>> to run tests, one of which is testing VPE in vpe_ring_test_ib. This > >>>> causes an ungate, After that test, vpe_idle_work_handler is scheduled > >>>> with VPE_IDLE_TIMEOUT (1s). > >>>> > >>>> When vpe_idle_work_handler runs and tries to gate VPE, it causes the > >>>> SMU to hang and partially freezes half of the GPU IPs, with the thread > >>>> that called the command being stuck processing it. > >>>> > >>>> Specifically, after that SMU command tries to run, we get the following: > >>>> > >>>> snd_hda_intel 0000:c4:00.1: Refused to change power state from D0 to D3hot > >>>> ... > >>>> xhci_hcd 0000:c4:00.4: Refused to change power state from D0 to D3hot > >>>> ... > >>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > >>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VPE! > >>>> [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe failed, ret = -62. > >>>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:93:crtc-0] flip_done timed out > >>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > >>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate JPEG! > >>>> [drm:jpeg_v4_0_5_set_powergating_state [amdgpu]] *ERROR* Dpm disable jpeg failed, ret = -62. > >>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > >>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 0! > >>>> [drm:vcn_v4_0_5_stop [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62. > >>>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 1 from 0xd3 > >>>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 2 from 0x5 > >>>> thunderbolt 0000:c6:00.5: Refused to change power state from D0 to D3hot > >>>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:97:crtc-1] flip_done timed out > >>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > >>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 1! > >>>> > >>>> In addition to e.g., kwin errors in journalctl. 0000:c4.00.0 is the GPU. > >>>> Interestingly, 0000:c4.00.6, which is another HDA block, 0000:c4.00.5, > >>>> a PCI controller, and 0000:c4.00.2, resume normally. 0x00000032 is the > >>>> PowerDownVpe(50) command which is the common failure point in all > >>>> failed resumes. > >>>> > >>>> On a normal resume, we should get the following power gates: > >>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVpe(50) param: 0x00000000, resp: 0x00000001 > >>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg0(33) param: 0x00000000, resp: 0x00000001 > >>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg1(38) param: 0x00010000, resp: 0x00000001 > >>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn1(4) param: 0x00010000, resp: 0x00000001 > >>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn0(6) param: 0x00000000, resp: 0x00000001 > >>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn0(7) param: 0x00000000, resp: 0x00000001 > >>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn1(5) param: 0x00010000, resp: 0x00000001 > >>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg0(34) param: 0x00000000, resp: 0x00000001 > >>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg1(39) param: 0x00010000, resp: 0x00000001 > >>>> > >>>> To fix this, increase VPE_IDLE_TIMEOUT to 2 seconds. This increases > >>>> reliability from 4-25 suspends to 200+ (tested) suspends with a cycle > >>>> time of 12s sleep, 8s resume. The suspected reason here is that 1s that > >>>> when VPE is used, it needs a bit of time before it can be gated and > >>>> there was a borderline delay before, which is not enough for Strix Halo. > >>>> When the VPE is not used, such as on resume, gating it instantly does > >>>> not seem to cause issues. > >>> > >>> This doesn't make much sense. The VPE idle timeout is arbitrary. The > >>> VPE idle work handler checks to see if the block is idle before it > >>> powers gates the block. If it's not idle, then the delayed work is > >>> rescheduled so changing the timing should not make a difference. We > >>> are no powering down VPE while it still has active jobs. It sounds > >>> like there is some race condition somewhere else. > >> > >> On resume, the vpe is ungated and gated instantly, which does not > >> cause any crashes, then the delayed work is scheduled to run two > >> seconds later. Then, the tests run and finish, which start the gate > >> timer. After the timer lapses and the kernel tries to gate VPE, it > >> crashes. I logged all SMU commands and there is no difference between > >> the ones in a crash and not, other than the fact the VPE gate command > >> failed. Which becomes apparent when the next command runs. I will also > >> note that until the idle timer lapses, the system is responsive > >> > >> Since the VPE is ungated to run the tests, I assume that in my setup > >> it is not used close to resume. > > > > I should also add that I forced a kernel panic and dumped all CPU > > backtraces in multiple logs. After the softlock, CPUs were either > > parked in the scheduler, powered off, or stuck executing an SMU > > command by e.g., a userspace usage sensor graph. So it is not a > > deadlock. > > > > Can you please confirm if you are on the absolute latest linux-firmware > when you reproduced this issue? I was on the latest at the time built from source. I think it was commit 08ee93ff8ffa. There was an update today though it seems. > Can you please share the debugfs output for amdgpu_firmware_info. Here is the information from it: VCE feature version: 0, firmware version: 0x00000000 UVD feature version: 0, firmware version: 0x00000000 MC feature version: 0, firmware version: 0x00000000 ME feature version: 35, firmware version: 0x0000001f PFP feature version: 35, firmware version: 0x0000002c CE feature version: 0, firmware version: 0x00000000 RLC feature version: 1, firmware version: 0x11530505 RLC SRLC feature version: 0, firmware version: 0x00000000 RLC SRLG feature version: 0, firmware version: 0x00000000 RLC SRLS feature version: 0, firmware version: 0x00000000 RLCP feature version: 1, firmware version: 0x11530505 RLCV feature version: 0, firmware version: 0x00000000 MEC feature version: 35, firmware version: 0x0000001f IMU feature version: 0, firmware version: 0x0b352300 SOS feature version: 0, firmware version: 0x00000000 ASD feature version: 553648366, firmware version: 0x210000ee TA XGMI feature version: 0x00000000, firmware version: 0x00000000 TA RAS feature version: 0x00000000, firmware version: 0x00000000 TA HDCP feature version: 0x00000000, firmware version: 0x17000044 TA DTM feature version: 0x00000000, firmware version: 0x12000018 TA RAP feature version: 0x00000000, firmware version: 0x00000000 TA SECUREDISPLAY feature version: 0x00000000, firmware version: 0x00000000 SMC feature version: 0, program: 0, firmware version: 0x00647000 (100.112.0) SDMA0 feature version: 60, firmware version: 0x0000000e VCN feature version: 0, firmware version: 0x0911800b DMCU feature version: 0, firmware version: 0x00000000 DMCUB feature version: 0, firmware version: 0x09002600 TOC feature version: 0, firmware version: 0x0000000b MES_KIQ feature version: 6, firmware version: 0x0000006c MES feature version: 1, firmware version: 0x0000007c VPE feature version: 60, firmware version: 0x00000016 VBIOS version: 113-STRXLGEN-001 I see there was an update today though Antheas > ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2025-08-27 15:49 UTC | newest] Thread overview: 22+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-08-24 8:53 [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo Antheas Kapenekakis 2025-08-24 8:53 ` [PATCH v1 2/2] drm/amd/display: Adjust AUX brightness to be a granularity of 100 Antheas Kapenekakis 2025-08-24 11:29 ` kernel test robot 2025-08-24 19:33 ` Antheas Kapenekakis 2025-08-25 7:02 ` Philip Mueller 2025-08-24 20:16 ` [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix hang on Strix Halo Mario Limonciello 2025-08-24 20:46 ` Antheas Kapenekakis 2025-08-25 1:38 ` Mario Limonciello 2025-08-25 13:39 ` Antheas Kapenekakis 2025-08-26 13:41 ` Alex Deucher 2025-08-26 19:19 ` Mario Limonciello 2025-08-26 19:21 ` Antheas Kapenekakis 2025-08-26 20:12 ` Matthew Schwartz 2025-08-26 20:58 ` Antheas Kapenekakis 2025-08-27 0:50 ` Matthew Schwartz 2025-08-27 2:37 ` Lee, Peyton 2025-08-27 15:42 ` Matthew Schwartz 2025-08-25 13:20 ` Alex Deucher 2025-08-25 13:33 ` Antheas Kapenekakis 2025-08-25 14:01 ` Antheas Kapenekakis 2025-08-25 16:41 ` Mario Limonciello 2025-08-25 21:00 ` Antheas Kapenekakis
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).