* [PATCH 1/2] drm: introduce page_flip_timeout() @ 2026-01-23 0:05 Hamza Mahfooz 2026-01-23 0:05 ` [PATCH 2/2] drm/amdgpu: implement page_flip_timeout() support Hamza Mahfooz 2026-01-23 13:52 ` [PATCH 1/2] drm: introduce page_flip_timeout() Christian König 0 siblings, 2 replies; 47+ messages in thread From: Hamza Mahfooz @ 2026-01-23 0:05 UTC (permalink / raw) To: dri-devel Cc: Hamza Mahfooz, Alex Deucher, Christian König, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Ce Sun, Lijo Lazar, Kenneth Feng, Ivan Lipski, Alex Hung, Tom Chung, Melissa Wen, Michel Dänzer, Fangzhi Zuo, Timur Kristóf, amd-gfx, linux-kernel There should be a mechanism for drivers to respond to flip_done time outs. Since, as it stands it is possible for the display to stall indefinitely, necessitating a hard reset. So, introduce a new crtc callback that is called by drm_atomic_helper_wait_for_flip_done() to give drivers a shot at recovering from page flip timeouts. Signed-off-by: Hamza Mahfooz <someguy@effective-light.com> --- drivers/gpu/drm/drm_atomic_helper.c | 6 +++++- include/drm/drm_crtc.h | 9 +++++++++ 2 files changed, 14 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/drm_atomic_helper.c b/drivers/gpu/drm/drm_atomic_helper.c index 5840e9cc6f66..3a144c324b19 100644 --- a/drivers/gpu/drm/drm_atomic_helper.c +++ b/drivers/gpu/drm/drm_atomic_helper.c @@ -1881,9 +1881,13 @@ void drm_atomic_helper_wait_for_flip_done(struct drm_device *dev, continue; ret = wait_for_completion_timeout(&commit->flip_done, 10 * HZ); - if (ret == 0) + if (!ret) { drm_err(dev, "[CRTC:%d:%s] flip_done timed out\n", crtc->base.id, crtc->name); + + if (crtc->funcs->page_flip_timeout) + crtc->funcs->page_flip_timeout(crtc); + } } if (state->fake_commit) diff --git a/include/drm/drm_crtc.h b/include/drm/drm_crtc.h index 66278ffeebd6..45dc5a76e915 100644 --- a/include/drm/drm_crtc.h +++ b/include/drm/drm_crtc.h @@ -609,6 +609,15 @@ struct drm_crtc_funcs { uint32_t flags, uint32_t target, struct drm_modeset_acquire_ctx *ctx); + /** + * @page_flip_timeout: + * + * This optional hook is called if &drm_crtc_commit.flip_done times out, + * and can be used by drivers to attempt to recover from a page flip + * timeout. + */ + void (*page_flip_timeout)(struct drm_crtc *crtc); + /** * @set_property: * -- 2.52.0 ^ permalink raw reply related [flat|nested] 47+ messages in thread
* [PATCH 2/2] drm/amdgpu: implement page_flip_timeout() support 2026-01-23 0:05 [PATCH 1/2] drm: introduce page_flip_timeout() Hamza Mahfooz @ 2026-01-23 0:05 ` Hamza Mahfooz 2026-01-23 11:20 ` Timur Kristóf ` (3 more replies) 2026-01-23 13:52 ` [PATCH 1/2] drm: introduce page_flip_timeout() Christian König 1 sibling, 4 replies; 47+ messages in thread From: Hamza Mahfooz @ 2026-01-23 0:05 UTC (permalink / raw) To: dri-devel Cc: Hamza Mahfooz, Alex Deucher, Christian König, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Lijo Lazar, Ce Sun, Ivan Lipski, Kenneth Feng, Alex Hung, Tom Chung, Melissa Wen, Michel Dänzer, Fangzhi Zuo, Timur Kristóf, amd-gfx, linux-kernel We now have a means to respond to page flip timeouts. So, hook up support for the new page_flip_timeout() callback. Signed-off-by: Hamza Mahfooz <someguy@effective-light.com> --- Hi, I have tested this on 7940HS system and it appears even a MODE2 reset will reset display firmware, so I don't think we need to force a full reset here. --- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 2 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 1 + .../drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c | 18 ++++++++++++++++++ 3 files changed, 21 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c index 28c4ad62f50e..bd63f0345984 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c @@ -343,6 +343,8 @@ void amdgpu_reset_get_desc(struct amdgpu_reset_context *rst_ctxt, char *buf, case AMDGPU_RESET_SRC_USERQ: strscpy(buf, "user queue trigger", len); break; + case AMDGPU_RESET_SRC_DISPLAY: + strscpy(buf, "display hang", len); default: strscpy(buf, "unknown", len); } diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h index 07b4d37f1db6..53b577062b11 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h @@ -44,6 +44,7 @@ enum AMDGPU_RESET_SRCS { AMDGPU_RESET_SRC_HWS, AMDGPU_RESET_SRC_USER, AMDGPU_RESET_SRC_USERQ, + AMDGPU_RESET_SRC_DISPLAY, }; struct amdgpu_reset_context { diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c index 697e232acebf..2233e5b3b6a2 100644 --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c @@ -28,6 +28,7 @@ #include "dc.h" #include "amdgpu.h" +#include "amdgpu_reset.h" #include "amdgpu_dm_psr.h" #include "amdgpu_dm_replay.h" #include "amdgpu_dm_crtc.h" @@ -578,12 +579,29 @@ amdgpu_dm_atomic_crtc_get_property(struct drm_crtc *crtc, } #endif +static void amdgpu_dm_crtc_handle_timeout(struct drm_crtc *crtc) +{ + struct amdgpu_device *adev = drm_to_adev(crtc->dev); + struct amdgpu_reset_context reset_context = {0}; + + if (amdgpu_device_should_recover_gpu(adev)) { + memset(&reset_context, 0, sizeof(reset_context)); + + reset_context.method = AMD_RESET_METHOD_NONE; + reset_context.reset_req_dev = adev; + reset_context.src = AMDGPU_RESET_SRC_DISPLAY; + + amdgpu_device_gpu_recover(adev, NULL, &reset_context); + } +} + /* Implemented only the options currently available for the driver */ static const struct drm_crtc_funcs amdgpu_dm_crtc_funcs = { .reset = amdgpu_dm_crtc_reset_state, .destroy = amdgpu_dm_crtc_destroy, .set_config = drm_atomic_helper_set_config, .page_flip = drm_atomic_helper_page_flip, + .page_flip_timeout = amdgpu_dm_crtc_handle_timeout, .atomic_duplicate_state = amdgpu_dm_crtc_duplicate_state, .atomic_destroy_state = amdgpu_dm_crtc_destroy_state, .set_crc_source = amdgpu_dm_crtc_set_crc_source, -- 2.52.0 ^ permalink raw reply related [flat|nested] 47+ messages in thread
* Re: [PATCH 2/2] drm/amdgpu: implement page_flip_timeout() support 2026-01-23 0:05 ` [PATCH 2/2] drm/amdgpu: implement page_flip_timeout() support Hamza Mahfooz @ 2026-01-23 11:20 ` Timur Kristóf 2026-01-23 14:25 ` Hamza Mahfooz 2026-01-23 12:26 ` kernel test robot ` (2 subsequent siblings) 3 siblings, 1 reply; 47+ messages in thread From: Timur Kristóf @ 2026-01-23 11:20 UTC (permalink / raw) To: dri-devel, Hamza Mahfooz Cc: Hamza Mahfooz, Alex Deucher, Christian König, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Lijo Lazar, Ce Sun, Ivan Lipski, Kenneth Feng, Alex Hung, Tom Chung, Melissa Wen, Michel Dänzer, Fangzhi Zuo, amd-gfx, linux-kernel On Friday, January 23, 2026 1:05:28 AM Central European Standard Time Hamza Mahfooz wrote: > We now have a means to respond to page flip timeouts. So, hook up > support for the new page_flip_timeout() callback. > > Signed-off-by: Hamza Mahfooz <someguy@effective-light.com> > --- > Hi, > > I have tested this on 7940HS system and it appears even a MODE2 reset > will reset display firmware, so I don't think we need to force a full > reset here. > --- MODE2 reset _is_ a full reset on APUs, it resets everything but just doesn't lose the RAM contents. Also note that MODE2 reset is not supported on dedicated GPUs, so this will likely trigger a full reset for those. Can you say how you tested this? I'd be happy to test it myself too. > drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 2 ++ > drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 1 + > .../drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c | 18 ++++++++++++++++++ > 3 files changed, 21 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c index > 28c4ad62f50e..bd63f0345984 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c > @@ -343,6 +343,8 @@ void amdgpu_reset_get_desc(struct amdgpu_reset_context > *rst_ctxt, char *buf, case AMDGPU_RESET_SRC_USERQ: > strscpy(buf, "user queue trigger", len); > break; > + case AMDGPU_RESET_SRC_DISPLAY: > + strscpy(buf, "display hang", len); > default: > strscpy(buf, "unknown", len); > } > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h > b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h index > 07b4d37f1db6..53b577062b11 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h > @@ -44,6 +44,7 @@ enum AMDGPU_RESET_SRCS { > AMDGPU_RESET_SRC_HWS, > AMDGPU_RESET_SRC_USER, > AMDGPU_RESET_SRC_USERQ, > + AMDGPU_RESET_SRC_DISPLAY, > }; > > struct amdgpu_reset_context { > diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c > b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c index > 697e232acebf..2233e5b3b6a2 100644 > --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c > +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c > @@ -28,6 +28,7 @@ > > #include "dc.h" > #include "amdgpu.h" > +#include "amdgpu_reset.h" > #include "amdgpu_dm_psr.h" > #include "amdgpu_dm_replay.h" > #include "amdgpu_dm_crtc.h" > @@ -578,12 +579,29 @@ amdgpu_dm_atomic_crtc_get_property(struct drm_crtc > *crtc, } > #endif > > +static void amdgpu_dm_crtc_handle_timeout(struct drm_crtc *crtc) > +{ > + struct amdgpu_device *adev = drm_to_adev(crtc->dev); > + struct amdgpu_reset_context reset_context = {0}; > + > + if (amdgpu_device_should_recover_gpu(adev)) { > + memset(&reset_context, 0, sizeof(reset_context)); > + > + reset_context.method = AMD_RESET_METHOD_NONE; > + reset_context.reset_req_dev = adev; > + reset_context.src = AMDGPU_RESET_SRC_DISPLAY; > + > + amdgpu_device_gpu_recover(adev, NULL, &reset_context); > + } > +} > + > /* Implemented only the options currently available for the driver */ > static const struct drm_crtc_funcs amdgpu_dm_crtc_funcs = { > .reset = amdgpu_dm_crtc_reset_state, > .destroy = amdgpu_dm_crtc_destroy, > .set_config = drm_atomic_helper_set_config, > .page_flip = drm_atomic_helper_page_flip, > + .page_flip_timeout = amdgpu_dm_crtc_handle_timeout, > .atomic_duplicate_state = amdgpu_dm_crtc_duplicate_state, > .atomic_destroy_state = amdgpu_dm_crtc_destroy_state, > .set_crc_source = amdgpu_dm_crtc_set_crc_source, ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/2] drm/amdgpu: implement page_flip_timeout() support 2026-01-23 11:20 ` Timur Kristóf @ 2026-01-23 14:25 ` Hamza Mahfooz 0 siblings, 0 replies; 47+ messages in thread From: Hamza Mahfooz @ 2026-01-23 14:25 UTC (permalink / raw) To: Timur Kristóf Cc: dri-devel, Alex Deucher, Christian König, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Lijo Lazar, Ce Sun, Ivan Lipski, Kenneth Feng, Alex Hung, Tom Chung, Melissa Wen, Michel Dänzer, Fangzhi Zuo, amd-gfx, linux-kernel On Fri, Jan 23, 2026 at 12:20:30PM +0100, Timur Kristóf wrote: > MODE2 reset _is_ a full reset on APUs, it resets everything but just doesn't > lose the RAM contents. Also note that MODE2 reset is not supported on > dedicated GPUs, so this will likely trigger a full reset for those. > > Can you say how you tested this? I'd be happy to test it myself too. Unfortunately, I don't have a reliable way to reproduce timeouts. Though, that machine usually hangs if I leave it idle overnight and forget to turn off the display (I have DPMS disabled). So, I just let nature run its course, I guess you can say and checked dmesg in the morning to find that it had reset and was still usable. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/2] drm/amdgpu: implement page_flip_timeout() support 2026-01-23 0:05 ` [PATCH 2/2] drm/amdgpu: implement page_flip_timeout() support Hamza Mahfooz 2026-01-23 11:20 ` Timur Kristóf @ 2026-01-23 12:26 ` kernel test robot 2026-01-23 15:16 ` kernel test robot 2026-01-23 17:49 ` Alex Deucher 3 siblings, 0 replies; 47+ messages in thread From: kernel test robot @ 2026-01-23 12:26 UTC (permalink / raw) To: Hamza Mahfooz, dri-devel Cc: llvm, oe-kbuild-all, Hamza Mahfooz, Alex Deucher, Christian König, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Lijo Lazar, Ce Sun, Ivan Lipski, Kenneth Feng, Alex Hung, Tom Chung, Melissa Wen, Michel Dänzer, Fangzhi Zuo, Timur Kristóf, amd-gfx, linux-kernel Hi Hamza, kernel test robot noticed the following build warnings: [auto build test WARNING on drm-misc/drm-misc-next] [also build test WARNING on drm/drm-next drm-i915/for-linux-next drm-i915/for-linux-next-fixes drm-tip/drm-tip linus/master v6.19-rc6 next-20260122] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Hamza-Mahfooz/drm-amdgpu-implement-page_flip_timeout-support/20260123-085944 base: https://gitlab.freedesktop.org/drm/misc/kernel.git drm-misc-next patch link: https://lore.kernel.org/r/20260123000537.2450496-2-someguy%40effective-light.com patch subject: [PATCH 2/2] drm/amdgpu: implement page_flip_timeout() support config: loongarch-defconfig (https://download.01.org/0day-ci/archive/20260123/202601232007.NAxTeBna-lkp@intel.com/config) compiler: clang version 19.1.7 (https://github.com/llvm/llvm-project cd708029e0b2869e80abe31ddb175f7c35361f90) reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260123/202601232007.NAxTeBna-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202601232007.NAxTeBna-lkp@intel.com/ All warnings (new ones prefixed by >>): >> drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c:348:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] 348 | default: | ^ drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c:348:2: note: insert '__attribute__((fallthrough));' to silence this warning 348 | default: | ^ | __attribute__((fallthrough)); drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c:348:2: note: insert 'break;' to avoid fall-through 348 | default: | ^ | break; 1 warning generated. vim +348 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 2656e1ce783a90 Eric Huang 2024-06-03 315 2656e1ce783a90 Eric Huang 2024-06-03 316 void amdgpu_reset_get_desc(struct amdgpu_reset_context *rst_ctxt, char *buf, 2656e1ce783a90 Eric Huang 2024-06-03 317 size_t len) 2656e1ce783a90 Eric Huang 2024-06-03 318 { 2656e1ce783a90 Eric Huang 2024-06-03 319 if (!buf || !len) 2656e1ce783a90 Eric Huang 2024-06-03 320 return; 2656e1ce783a90 Eric Huang 2024-06-03 321 2656e1ce783a90 Eric Huang 2024-06-03 322 switch (rst_ctxt->src) { 2656e1ce783a90 Eric Huang 2024-06-03 323 case AMDGPU_RESET_SRC_JOB: 2656e1ce783a90 Eric Huang 2024-06-03 324 if (rst_ctxt->job) { 7bed1df814cd61 Eric Huang 2024-06-06 325 snprintf(buf, len, "job hang on ring:%s", 7bed1df814cd61 Eric Huang 2024-06-06 326 rst_ctxt->job->base.sched->name); 2656e1ce783a90 Eric Huang 2024-06-03 327 } else { 2656e1ce783a90 Eric Huang 2024-06-03 328 strscpy(buf, "job hang", len); 2656e1ce783a90 Eric Huang 2024-06-03 329 } 2656e1ce783a90 Eric Huang 2024-06-03 330 break; 2656e1ce783a90 Eric Huang 2024-06-03 331 case AMDGPU_RESET_SRC_RAS: 2656e1ce783a90 Eric Huang 2024-06-03 332 strscpy(buf, "RAS error", len); 2656e1ce783a90 Eric Huang 2024-06-03 333 break; 2656e1ce783a90 Eric Huang 2024-06-03 334 case AMDGPU_RESET_SRC_MES: 2656e1ce783a90 Eric Huang 2024-06-03 335 strscpy(buf, "MES hang", len); 2656e1ce783a90 Eric Huang 2024-06-03 336 break; 2656e1ce783a90 Eric Huang 2024-06-03 337 case AMDGPU_RESET_SRC_HWS: 2656e1ce783a90 Eric Huang 2024-06-03 338 strscpy(buf, "HWS hang", len); 2656e1ce783a90 Eric Huang 2024-06-03 339 break; 2656e1ce783a90 Eric Huang 2024-06-03 340 case AMDGPU_RESET_SRC_USER: 2656e1ce783a90 Eric Huang 2024-06-03 341 strscpy(buf, "user trigger", len); 2656e1ce783a90 Eric Huang 2024-06-03 342 break; c5da9e9c023893 Alex Deucher 2025-04-16 343 case AMDGPU_RESET_SRC_USERQ: c5da9e9c023893 Alex Deucher 2025-04-16 344 strscpy(buf, "user queue trigger", len); c5da9e9c023893 Alex Deucher 2025-04-16 345 break; 7136aabf196b0a Hamza Mahfooz 2026-01-22 346 case AMDGPU_RESET_SRC_DISPLAY: 7136aabf196b0a Hamza Mahfooz 2026-01-22 347 strscpy(buf, "display hang", len); 2656e1ce783a90 Eric Huang 2024-06-03 @348 default: 2656e1ce783a90 Eric Huang 2024-06-03 349 strscpy(buf, "unknown", len); 2656e1ce783a90 Eric Huang 2024-06-03 350 } 2656e1ce783a90 Eric Huang 2024-06-03 351 } a86e0c0e94373a Lijo Lazar 2024-11-15 352 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/2] drm/amdgpu: implement page_flip_timeout() support 2026-01-23 0:05 ` [PATCH 2/2] drm/amdgpu: implement page_flip_timeout() support Hamza Mahfooz 2026-01-23 11:20 ` Timur Kristóf 2026-01-23 12:26 ` kernel test robot @ 2026-01-23 15:16 ` kernel test robot 2026-01-23 17:49 ` Alex Deucher 3 siblings, 0 replies; 47+ messages in thread From: kernel test robot @ 2026-01-23 15:16 UTC (permalink / raw) To: Hamza Mahfooz, dri-devel Cc: oe-kbuild-all, Hamza Mahfooz, Alex Deucher, Christian König, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Lijo Lazar, Ce Sun, Ivan Lipski, Kenneth Feng, Alex Hung, Tom Chung, Melissa Wen, Michel Dänzer, Fangzhi Zuo, Timur Kristóf, amd-gfx, linux-kernel Hi Hamza, kernel test robot noticed the following build warnings: [auto build test WARNING on drm-misc/drm-misc-next] [also build test WARNING on daeinki-drm-exynos/exynos-drm-next drm/drm-next drm-i915/for-linux-next drm-i915/for-linux-next-fixes drm-tip/drm-tip linus/master v6.19-rc6 next-20260122] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Hamza-Mahfooz/drm-amdgpu-implement-page_flip_timeout-support/20260123-085944 base: https://gitlab.freedesktop.org/drm/misc/kernel.git drm-misc-next patch link: https://lore.kernel.org/r/20260123000537.2450496-2-someguy%40effective-light.com patch subject: [PATCH 2/2] drm/amdgpu: implement page_flip_timeout() support config: i386-buildonly-randconfig-002-20260123 (https://download.01.org/0day-ci/archive/20260123/202601232335.sf6DrrBF-lkp@intel.com/config) compiler: gcc-14 (Debian 14.2.0-19) 14.2.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260123/202601232335.sf6DrrBF-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202601232335.sf6DrrBF-lkp@intel.com/ All warnings (new ones prefixed by >>): In file included from arch/x86/include/asm/page_32.h:18, from arch/x86/include/asm/page.h:14, from arch/x86/include/asm/processor.h:20, from arch/x86/include/asm/timex.h:5, from include/linux/timex.h:67, from include/linux/time32.h:13, from include/linux/time.h:60, from include/linux/jiffies.h:10, from include/linux/ktime.h:25, from drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h:26, from drivers/gpu/drm/amd/amdgpu/amdgpu.h:43, from drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h:27, from drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c:24: drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c: In function 'amdgpu_reset_get_desc': >> include/linux/string.h:83:9: warning: this statement may fall through [-Wimplicit-fallthrough=] 83 | sized_strscpy(dst, src, size + __must_be_cstr(dst) + __must_be_cstr(src)) | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ include/linux/args.h:25:24: note: in expansion of macro '__strscpy1' 25 | #define __CONCAT(a, b) a ## b | ^ include/linux/args.h:26:27: note: in expansion of macro '__CONCAT' 26 | #define CONCATENATE(a, b) __CONCAT(a, b) | ^~~~~~~~ include/linux/string.h:114:9: note: in expansion of macro 'CONCATENATE' 114 | CONCATENATE(__strscpy, COUNT_ARGS(__VA_ARGS__))(dst, src, __VA_ARGS__) | ^~~~~~~~~~~ drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c:347:17: note: in expansion of macro 'strscpy' 347 | strscpy(buf, "display hang", len); | ^~~~~~~ drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c:348:9: note: here 348 | default: | ^~~~~~~ -- In file included from arch/x86/include/asm/page_32.h:18, from arch/x86/include/asm/page.h:14, from arch/x86/include/asm/processor.h:20, from arch/x86/include/asm/timex.h:5, from include/linux/timex.h:67, from include/linux/time32.h:13, from include/linux/time.h:60, from include/linux/jiffies.h:10, from include/linux/ktime.h:25, from amdgpu_ctx.h:26, from amdgpu.h:43, from amdgpu_reset.h:27, from amdgpu_reset.c:24: amdgpu_reset.c: In function 'amdgpu_reset_get_desc': >> include/linux/string.h:83:9: warning: this statement may fall through [-Wimplicit-fallthrough=] 83 | sized_strscpy(dst, src, size + __must_be_cstr(dst) + __must_be_cstr(src)) | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ include/linux/args.h:25:24: note: in expansion of macro '__strscpy1' 25 | #define __CONCAT(a, b) a ## b | ^ include/linux/args.h:26:27: note: in expansion of macro '__CONCAT' 26 | #define CONCATENATE(a, b) __CONCAT(a, b) | ^~~~~~~~ include/linux/string.h:114:9: note: in expansion of macro 'CONCATENATE' 114 | CONCATENATE(__strscpy, COUNT_ARGS(__VA_ARGS__))(dst, src, __VA_ARGS__) | ^~~~~~~~~~~ amdgpu_reset.c:347:17: note: in expansion of macro 'strscpy' 347 | strscpy(buf, "display hang", len); | ^~~~~~~ amdgpu_reset.c:348:9: note: here 348 | default: | ^~~~~~~ vim +83 include/linux/string.h e6584c3964f2ff Kees Cook 2023-09-20 74 e6584c3964f2ff Kees Cook 2023-09-20 75 /* e6584c3964f2ff Kees Cook 2023-09-20 76 * The 2 argument style can only be used when dst is an array with a e6584c3964f2ff Kees Cook 2023-09-20 77 * known size. e6584c3964f2ff Kees Cook 2023-09-20 78 */ e6584c3964f2ff Kees Cook 2023-09-20 79 #define __strscpy0(dst, src, ...) \ 559048d156ff33 Kees Cook 2024-08-05 80 sized_strscpy(dst, src, sizeof(dst) + __must_be_array(dst) + \ 559048d156ff33 Kees Cook 2024-08-05 81 __must_be_cstr(dst) + __must_be_cstr(src)) 559048d156ff33 Kees Cook 2024-08-05 82 #define __strscpy1(dst, src, size) \ 559048d156ff33 Kees Cook 2024-08-05 @83 sized_strscpy(dst, src, size + __must_be_cstr(dst) + __must_be_cstr(src)) e6584c3964f2ff Kees Cook 2023-09-20 84 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/2] drm/amdgpu: implement page_flip_timeout() support 2026-01-23 0:05 ` [PATCH 2/2] drm/amdgpu: implement page_flip_timeout() support Hamza Mahfooz ` (2 preceding siblings ...) 2026-01-23 15:16 ` kernel test robot @ 2026-01-23 17:49 ` Alex Deucher 2026-01-23 18:10 ` Alex Deucher 3 siblings, 1 reply; 47+ messages in thread From: Alex Deucher @ 2026-01-23 17:49 UTC (permalink / raw) To: Hamza Mahfooz Cc: dri-devel, Alex Deucher, Christian König, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Lijo Lazar, Ce Sun, Ivan Lipski, Kenneth Feng, Alex Hung, Tom Chung, Melissa Wen, Michel Dänzer, Fangzhi Zuo, Timur Kristóf, amd-gfx, linux-kernel On Fri, Jan 23, 2026 at 3:37 AM Hamza Mahfooz <someguy@effective-light.com> wrote: > > We now have a means to respond to page flip timeouts. So, hook up > support for the new page_flip_timeout() callback. > > Signed-off-by: Hamza Mahfooz <someguy@effective-light.com> > --- > Hi, > > I have tested this on 7940HS system and it appears even a MODE2 reset > will reset display firmware, so I don't think we need to force a full > reset here. > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 2 ++ > drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 1 + > .../drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c | 18 ++++++++++++++++++ > 3 files changed, 21 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c > index 28c4ad62f50e..bd63f0345984 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c > @@ -343,6 +343,8 @@ void amdgpu_reset_get_desc(struct amdgpu_reset_context *rst_ctxt, char *buf, > case AMDGPU_RESET_SRC_USERQ: > strscpy(buf, "user queue trigger", len); > break; > + case AMDGPU_RESET_SRC_DISPLAY: > + strscpy(buf, "display hang", len); > default: > strscpy(buf, "unknown", len); > } > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h > index 07b4d37f1db6..53b577062b11 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h > @@ -44,6 +44,7 @@ enum AMDGPU_RESET_SRCS { > AMDGPU_RESET_SRC_HWS, > AMDGPU_RESET_SRC_USER, > AMDGPU_RESET_SRC_USERQ, > + AMDGPU_RESET_SRC_DISPLAY, > }; > > struct amdgpu_reset_context { > diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c > index 697e232acebf..2233e5b3b6a2 100644 > --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c > +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c > @@ -28,6 +28,7 @@ > > #include "dc.h" > #include "amdgpu.h" > +#include "amdgpu_reset.h" > #include "amdgpu_dm_psr.h" > #include "amdgpu_dm_replay.h" > #include "amdgpu_dm_crtc.h" > @@ -578,12 +579,29 @@ amdgpu_dm_atomic_crtc_get_property(struct drm_crtc *crtc, > } > #endif > > +static void amdgpu_dm_crtc_handle_timeout(struct drm_crtc *crtc) > +{ > + struct amdgpu_device *adev = drm_to_adev(crtc->dev); > + struct amdgpu_reset_context reset_context = {0}; > + > + if (amdgpu_device_should_recover_gpu(adev)) { > + memset(&reset_context, 0, sizeof(reset_context)); > + > + reset_context.method = AMD_RESET_METHOD_NONE; > + reset_context.reset_req_dev = adev; > + reset_context.src = AMDGPU_RESET_SRC_DISPLAY; > + > + amdgpu_device_gpu_recover(adev, NULL, &reset_context); > + } Rather than resetting the whole GPU here, does just suspending and resuming DC help? E.g., call dm_suspend() and dm_resume(), but force the reset path (the amdgpu_in_reset() case) in those functions. If that works, that should help narrow down where the problem is. Alex > +} > + > /* Implemented only the options currently available for the driver */ > static const struct drm_crtc_funcs amdgpu_dm_crtc_funcs = { > .reset = amdgpu_dm_crtc_reset_state, > .destroy = amdgpu_dm_crtc_destroy, > .set_config = drm_atomic_helper_set_config, > .page_flip = drm_atomic_helper_page_flip, > + .page_flip_timeout = amdgpu_dm_crtc_handle_timeout, > .atomic_duplicate_state = amdgpu_dm_crtc_duplicate_state, > .atomic_destroy_state = amdgpu_dm_crtc_destroy_state, > .set_crc_source = amdgpu_dm_crtc_set_crc_source, > -- > 2.52.0 > ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/2] drm/amdgpu: implement page_flip_timeout() support 2026-01-23 17:49 ` Alex Deucher @ 2026-01-23 18:10 ` Alex Deucher 0 siblings, 0 replies; 47+ messages in thread From: Alex Deucher @ 2026-01-23 18:10 UTC (permalink / raw) To: Hamza Mahfooz Cc: dri-devel, Alex Deucher, Christian König, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Lijo Lazar, Ce Sun, Ivan Lipski, Kenneth Feng, Alex Hung, Tom Chung, Melissa Wen, Michel Dänzer, Fangzhi Zuo, Timur Kristóf, amd-gfx, linux-kernel On Fri, Jan 23, 2026 at 12:49 PM Alex Deucher <alexdeucher@gmail.com> wrote: > > On Fri, Jan 23, 2026 at 3:37 AM Hamza Mahfooz > <someguy@effective-light.com> wrote: > > > > We now have a means to respond to page flip timeouts. So, hook up > > support for the new page_flip_timeout() callback. > > > > Signed-off-by: Hamza Mahfooz <someguy@effective-light.com> > > --- > > Hi, > > > > I have tested this on 7940HS system and it appears even a MODE2 reset > > will reset display firmware, so I don't think we need to force a full > > reset here. > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 2 ++ > > drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 1 + > > .../drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c | 18 ++++++++++++++++++ > > 3 files changed, 21 insertions(+) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c > > index 28c4ad62f50e..bd63f0345984 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c > > @@ -343,6 +343,8 @@ void amdgpu_reset_get_desc(struct amdgpu_reset_context *rst_ctxt, char *buf, > > case AMDGPU_RESET_SRC_USERQ: > > strscpy(buf, "user queue trigger", len); > > break; > > + case AMDGPU_RESET_SRC_DISPLAY: > > + strscpy(buf, "display hang", len); > > default: > > strscpy(buf, "unknown", len); > > } > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h > > index 07b4d37f1db6..53b577062b11 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h > > @@ -44,6 +44,7 @@ enum AMDGPU_RESET_SRCS { > > AMDGPU_RESET_SRC_HWS, > > AMDGPU_RESET_SRC_USER, > > AMDGPU_RESET_SRC_USERQ, > > + AMDGPU_RESET_SRC_DISPLAY, > > }; > > > > struct amdgpu_reset_context { > > diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c > > index 697e232acebf..2233e5b3b6a2 100644 > > --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c > > +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c > > @@ -28,6 +28,7 @@ > > > > #include "dc.h" > > #include "amdgpu.h" > > +#include "amdgpu_reset.h" > > #include "amdgpu_dm_psr.h" > > #include "amdgpu_dm_replay.h" > > #include "amdgpu_dm_crtc.h" > > @@ -578,12 +579,29 @@ amdgpu_dm_atomic_crtc_get_property(struct drm_crtc *crtc, > > } > > #endif > > > > +static void amdgpu_dm_crtc_handle_timeout(struct drm_crtc *crtc) > > +{ > > + struct amdgpu_device *adev = drm_to_adev(crtc->dev); > > + struct amdgpu_reset_context reset_context = {0}; > > + > > + if (amdgpu_device_should_recover_gpu(adev)) { > > + memset(&reset_context, 0, sizeof(reset_context)); > > + > > + reset_context.method = AMD_RESET_METHOD_NONE; > > + reset_context.reset_req_dev = adev; > > + reset_context.src = AMDGPU_RESET_SRC_DISPLAY; > > + > > + amdgpu_device_gpu_recover(adev, NULL, &reset_context); > > + } > > Rather than resetting the whole GPU here, does just suspending and > resuming DC help? E.g., call dm_suspend() and dm_resume(), but force > the reset path (the amdgpu_in_reset() case) in those functions. If > that works, that should help narrow down where the problem is. Actually does just calling drm_crtc_send_vblank_event() for all of the relevant crtcs at this point help? Alex > > Alex > > > > +} > > + > > /* Implemented only the options currently available for the driver */ > > static const struct drm_crtc_funcs amdgpu_dm_crtc_funcs = { > > .reset = amdgpu_dm_crtc_reset_state, > > .destroy = amdgpu_dm_crtc_destroy, > > .set_config = drm_atomic_helper_set_config, > > .page_flip = drm_atomic_helper_page_flip, > > + .page_flip_timeout = amdgpu_dm_crtc_handle_timeout, > > .atomic_duplicate_state = amdgpu_dm_crtc_duplicate_state, > > .atomic_destroy_state = amdgpu_dm_crtc_destroy_state, > > .set_crc_source = amdgpu_dm_crtc_set_crc_source, > > -- > > 2.52.0 > > ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] drm: introduce page_flip_timeout() 2026-01-23 0:05 [PATCH 1/2] drm: introduce page_flip_timeout() Hamza Mahfooz 2026-01-23 0:05 ` [PATCH 2/2] drm/amdgpu: implement page_flip_timeout() support Hamza Mahfooz @ 2026-01-23 13:52 ` Christian König 2026-01-23 14:44 ` Hamza Mahfooz 2026-01-23 14:44 ` Timur Kristóf 1 sibling, 2 replies; 47+ messages in thread From: Christian König @ 2026-01-23 13:52 UTC (permalink / raw) To: Hamza Mahfooz, dri-devel Cc: Alex Deucher, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Ce Sun, Lijo Lazar, Kenneth Feng, Ivan Lipski, Alex Hung, Tom Chung, Melissa Wen, Michel Dänzer, Fangzhi Zuo, Timur Kristóf, amd-gfx, linux-kernel On 1/23/26 01:05, Hamza Mahfooz wrote: > There should be a mechanism for drivers to respond to flip_done > time outs. I can only see two reasons why you could run into a timeout: 1. A dma_fence never signals. How that should be handled is already well documented and doesn't require any of this. 2. A coding error in the vblank or page flip handler leading to waiting forever. In that case calling back into the driver doesn't help either. So as far as I can see the whole approach doesn't make any sense at all. Regards, Christian. > Since, as it stands it is possible for the display > to stall indefinitely, necessitating a hard reset. So, introduce > a new crtc callback that is called by > drm_atomic_helper_wait_for_flip_done() to give drivers a shot > at recovering from page flip timeouts. > > Signed-off-by: Hamza Mahfooz <someguy@effective-light.com> > --- > drivers/gpu/drm/drm_atomic_helper.c | 6 +++++- > include/drm/drm_crtc.h | 9 +++++++++ > 2 files changed, 14 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/drm_atomic_helper.c b/drivers/gpu/drm/drm_atomic_helper.c > index 5840e9cc6f66..3a144c324b19 100644 > --- a/drivers/gpu/drm/drm_atomic_helper.c > +++ b/drivers/gpu/drm/drm_atomic_helper.c > @@ -1881,9 +1881,13 @@ void drm_atomic_helper_wait_for_flip_done(struct drm_device *dev, > continue; > > ret = wait_for_completion_timeout(&commit->flip_done, 10 * HZ); > - if (ret == 0) > + if (!ret) { > drm_err(dev, "[CRTC:%d:%s] flip_done timed out\n", > crtc->base.id, crtc->name); > + > + if (crtc->funcs->page_flip_timeout) > + crtc->funcs->page_flip_timeout(crtc); > + } > } > > if (state->fake_commit) > diff --git a/include/drm/drm_crtc.h b/include/drm/drm_crtc.h > index 66278ffeebd6..45dc5a76e915 100644 > --- a/include/drm/drm_crtc.h > +++ b/include/drm/drm_crtc.h > @@ -609,6 +609,15 @@ struct drm_crtc_funcs { > uint32_t flags, uint32_t target, > struct drm_modeset_acquire_ctx *ctx); > > + /** > + * @page_flip_timeout: > + * > + * This optional hook is called if &drm_crtc_commit.flip_done times out, > + * and can be used by drivers to attempt to recover from a page flip > + * timeout. > + */ > + void (*page_flip_timeout)(struct drm_crtc *crtc); > + > /** > * @set_property: > * ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] drm: introduce page_flip_timeout() 2026-01-23 13:52 ` [PATCH 1/2] drm: introduce page_flip_timeout() Christian König @ 2026-01-23 14:44 ` Hamza Mahfooz 2026-01-23 16:12 ` Christian König 2026-01-23 19:41 ` Alex Deucher 2026-01-23 14:44 ` Timur Kristóf 1 sibling, 2 replies; 47+ messages in thread From: Hamza Mahfooz @ 2026-01-23 14:44 UTC (permalink / raw) To: Christian König Cc: dri-devel, Alex Deucher, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Ce Sun, Lijo Lazar, Kenneth Feng, Ivan Lipski, Alex Hung, Tom Chung, Melissa Wen, Michel Dänzer, Fangzhi Zuo, Timur Kristóf, amd-gfx, linux-kernel On Fri, Jan 23, 2026 at 02:52:44PM +0100, Christian König wrote: > I can only see two reasons why you could run into a timeout: > > 1. A dma_fence never signals. > How that should be handled is already well documented and doesn't require any of this. > > 2. A coding error in the vblank or page flip handler leading to waiting forever. > In that case calling back into the driver doesn't help either. > > So as far as I can see the whole approach doesn't make any sense at all. It appears that resetting display firmware is able to put at least a subset of these systems back into a consistent (usable) state. Though, I don't have a reliable way to reproduce the issue that I'm seeing so I can't say for sure what it boils down to. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] drm: introduce page_flip_timeout() 2026-01-23 14:44 ` Hamza Mahfooz @ 2026-01-23 16:12 ` Christian König 2026-01-23 19:41 ` Alex Deucher 1 sibling, 0 replies; 47+ messages in thread From: Christian König @ 2026-01-23 16:12 UTC (permalink / raw) To: Hamza Mahfooz Cc: dri-devel, Alex Deucher, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Ce Sun, Lijo Lazar, Kenneth Feng, Ivan Lipski, Alex Hung, Tom Chung, Melissa Wen, Michel Dänzer, Fangzhi Zuo, Timur Kristóf, amd-gfx, linux-kernel On 1/23/26 15:44, Hamza Mahfooz wrote: > On Fri, Jan 23, 2026 at 02:52:44PM +0100, Christian König wrote: >> I can only see two reasons why you could run into a timeout: >> >> 1. A dma_fence never signals. >> How that should be handled is already well documented and doesn't require any of this. >> >> 2. A coding error in the vblank or page flip handler leading to waiting forever. >> In that case calling back into the driver doesn't help either. >> >> So as far as I can see the whole approach doesn't make any sense at all. > > It appears that resetting display firmware is able to put at least a > subset of these systems back into a consistent (usable) state. Though, I > don't have a reliable way to reproduce the issue that I'm seeing so I > can't say for sure what it boils down to. Well there is no way to only reset the display firmware. So I'm not sure what you are testing here. What could be is that the DC code has bugs and a normal ASIC reset unblocked some endless loop or similar somehow, but that is absolutely not the right thing TODO. Regards, Christian. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] drm: introduce page_flip_timeout() 2026-01-23 14:44 ` Hamza Mahfooz 2026-01-23 16:12 ` Christian König @ 2026-01-23 19:41 ` Alex Deucher 1 sibling, 0 replies; 47+ messages in thread From: Alex Deucher @ 2026-01-23 19:41 UTC (permalink / raw) To: Hamza Mahfooz Cc: Christian König, dri-devel, Alex Deucher, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Ce Sun, Lijo Lazar, Kenneth Feng, Ivan Lipski, Alex Hung, Tom Chung, Melissa Wen, Michel Dänzer, Fangzhi Zuo, Timur Kristóf, amd-gfx, linux-kernel On Fri, Jan 23, 2026 at 9:52 AM Hamza Mahfooz <someguy@effective-light.com> wrote: > > On Fri, Jan 23, 2026 at 02:52:44PM +0100, Christian König wrote: > > I can only see two reasons why you could run into a timeout: > > > > 1. A dma_fence never signals. > > How that should be handled is already well documented and doesn't require any of this. > > > > 2. A coding error in the vblank or page flip handler leading to waiting forever. > > In that case calling back into the driver doesn't help either. > > > > So as far as I can see the whole approach doesn't make any sense at all. > > It appears that resetting display firmware is able to put at least a > subset of these systems back into a consistent (usable) state. Though, I > don't have a reliable way to reproduce the issue that I'm seeing so I > can't say for sure what it boils down to. I'm not at all an expert on KMS, but I took a quick look at the in and out fences in KMS, and I think I know what might be going on. The out fence is signalled by calling drm_crtc_send_vblank_event() from the interrupt handler for the vblank/pageflip interrupt. If that interrupt gets missed somehow, that never gets called and userspace will wait forever. As a safety measure, maybe add a worker thread that gets scheduled when the atomic commit happens and then in the interrupt handler we cancel the worker. If the interrupt never happens, the worker will eventually run and call drm_crtc_send_vblank_event() and get things unstuck. Alex ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] drm: introduce page_flip_timeout() 2026-01-23 13:52 ` [PATCH 1/2] drm: introduce page_flip_timeout() Christian König 2026-01-23 14:44 ` Hamza Mahfooz @ 2026-01-23 14:44 ` Timur Kristóf 2026-01-23 22:30 ` Mario Limonciello 2026-01-26 10:14 ` Christian König 1 sibling, 2 replies; 47+ messages in thread From: Timur Kristóf @ 2026-01-23 14:44 UTC (permalink / raw) To: Hamza Mahfooz, dri-devel, Christian König Cc: Alex Deucher, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Ce Sun, Lijo Lazar, Kenneth Feng, Ivan Lipski, Alex Hung, Tom Chung, Melissa Wen, Michel Dänzer, Fangzhi Zuo, amd-gfx, linux-kernel, Harry Wentland, Mario Limonciello On Friday, January 23, 2026 2:52:44 PM Central European Standard Time Christian König wrote: > On 1/23/26 01:05, Hamza Mahfooz wrote: > > There should be a mechanism for drivers to respond to flip_done > > time outs. > I am adding Harry and Mario to this email as they are more familiar with this. > I can only see two reasons why you could run into a timeout: > > 1. A dma_fence never signals. > How that should be handled is already well documented and doesn't require > any of this. Page flip timeouts have nothing to do with fence timeouts. A page flip timeout can occur even when all fences of all job submissions complete correctly and on time. > > 2. A coding error in the vblank or page flip handler leading to waiting > forever. In that case calling back into the driver doesn't help either. At the moment, a page flip timeout will leave the whole system in a hung state and the driver does not even attempt to recover it in any way, it just stops doing anything, which is unacceptable and I'm pretty surprised that it was left like that for so long. Note that we have approximately a hundred bug reports open on the drm/amd bug tracker about "random" page flip timeouts. It affects a lot of users. > > So as far as I can see the whole approach doesn't make any sense at all. Actually this approach was proposed as a solution at XDC 2025 in Harry's presentation, "DRM calls driver callback to attempt recovery", see page 9 in this slide deck: https://indico.freedesktop.org/event/10/contributions/431/attachments/ 267/355/2025%20XDC%20Hackfest%20Update%20v1.2.pdf If you disagree with Harry, please make a counter-proposal. Thanks, Timur > > > Since, as it stands it is possible for the display > > to stall indefinitely, necessitating a hard reset. So, introduce > > a new crtc callback that is called by > > drm_atomic_helper_wait_for_flip_done() to give drivers a shot > > at recovering from page flip timeouts. > > > > Signed-off-by: Hamza Mahfooz <someguy@effective-light.com> > > --- > > > > drivers/gpu/drm/drm_atomic_helper.c | 6 +++++- > > include/drm/drm_crtc.h | 9 +++++++++ > > 2 files changed, 14 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/gpu/drm/drm_atomic_helper.c > > b/drivers/gpu/drm/drm_atomic_helper.c index 5840e9cc6f66..3a144c324b19 > > 100644 > > --- a/drivers/gpu/drm/drm_atomic_helper.c > > +++ b/drivers/gpu/drm/drm_atomic_helper.c > > @@ -1881,9 +1881,13 @@ void drm_atomic_helper_wait_for_flip_done(struct > > drm_device *dev,> > > continue; > > > > ret = wait_for_completion_timeout(&commit->flip_done, 10 * HZ); > > > > - if (ret == 0) > > + if (!ret) { > > > > drm_err(dev, "[CRTC:%d:%s] flip_done timed out\n", > > > > crtc->base.id, crtc->name); > > > > + > > + if (crtc->funcs->page_flip_timeout) > > + crtc->funcs- >page_flip_timeout(crtc); > > + } > > > > } > > > > if (state->fake_commit) > > > > diff --git a/include/drm/drm_crtc.h b/include/drm/drm_crtc.h > > index 66278ffeebd6..45dc5a76e915 100644 > > --- a/include/drm/drm_crtc.h > > +++ b/include/drm/drm_crtc.h > > @@ -609,6 +609,15 @@ struct drm_crtc_funcs { > > > > uint32_t flags, uint32_t target, > > struct drm_modeset_acquire_ctx *ctx); > > > > + /** > > + * @page_flip_timeout: > > + * > > + * This optional hook is called if &drm_crtc_commit.flip_done times out, > > + * and can be used by drivers to attempt to recover from a page flip > > + * timeout. > > + */ > > + void (*page_flip_timeout)(struct drm_crtc *crtc); > > + > > > > /** > > > > * @set_property: > > * ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] drm: introduce page_flip_timeout() 2026-01-23 14:44 ` Timur Kristóf @ 2026-01-23 22:30 ` Mario Limonciello 2026-01-24 18:32 ` Hamza Mahfooz 2026-01-26 10:14 ` Christian König 1 sibling, 1 reply; 47+ messages in thread From: Mario Limonciello @ 2026-01-23 22:30 UTC (permalink / raw) To: Timur Kristóf, Hamza Mahfooz, dri-devel, Christian König Cc: Alex Deucher, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Ce Sun, Lijo Lazar, Kenneth Feng, Ivan Lipski, Alex Hung, Tom Chung, Melissa Wen, Michel Dänzer, Fangzhi Zuo, amd-gfx, linux-kernel On 1/23/2026 8:44 AM, Timur Kristóf wrote: > On Friday, January 23, 2026 2:52:44 PM Central European Standard Time > Christian König wrote: >> On 1/23/26 01:05, Hamza Mahfooz wrote: >>> There should be a mechanism for drivers to respond to flip_done >>> time outs. >> > When there is a display hang, I think that resetting the GPU IP is really heavy handed. I second what Alex said - Why not instead just reset DCN? I would think move DCN into D3 and back out should be enough if trying to use something to recover. > I am adding Harry and Mario to this email as they are more familiar with this. > >> I can only see two reasons why you could run into a timeout: >> >> 1. A dma_fence never signals. >> How that should be handled is already well documented and doesn't > require >> any of this. > > Page flip timeouts have nothing to do with fence timeouts. > A page flip timeout can occur even when all fences of all job submissions > complete correctly and on time. > >> >> 2. A coding error in the vblank or page flip handler leading to waiting >> forever. In that case calling back into the driver doesn't help either. > > At the moment, a page flip timeout will leave the whole system in a hung state > and the driver does not even attempt to recover it in any way, it just stops > doing anything, which is unacceptable and I'm pretty surprised that it was > left like that for so long. > > Note that we have approximately a hundred bug reports open on the drm/amd bug > tracker about "random" page flip timeouts. It affects a lot of users. Yeah I would much rather leave some messages in the log that this happened and see a recovery occur than a hang. > >> >> So as far as I can see the whole approach doesn't make any sense at all. > > Actually this approach was proposed as a solution at XDC 2025 in Harry's > presentation, "DRM calls driver callback to attempt recovery", see page 9 in > this slide deck: > > https://indico.freedesktop.org/event/10/contributions/431/attachments/ > 267/355/2025%20XDC%20Hackfest%20Update%20v1.2.pdf > > If you disagree with Harry, please make a counter-proposal. Hamza - since you seem to have a "workload" that can run overnight and this series recovers, can you try what Alex said and do a dc_suspend() and dc_resume() for failure? Make sure you log a message so you can know it worked. > > Thanks, > Timur > > > >> >>> Since, as it stands it is possible for the display >>> to stall indefinitely, necessitating a hard reset. So, introduce >>> a new crtc callback that is called by >>> drm_atomic_helper_wait_for_flip_done() to give drivers a shot >>> at recovering from page flip timeouts. >>> >>> Signed-off-by: Hamza Mahfooz <someguy@effective-light.com> >>> --- >>> >>> drivers/gpu/drm/drm_atomic_helper.c | 6 +++++- >>> include/drm/drm_crtc.h | 9 +++++++++ >>> 2 files changed, 14 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/gpu/drm/drm_atomic_helper.c >>> b/drivers/gpu/drm/drm_atomic_helper.c index 5840e9cc6f66..3a144c324b19 >>> 100644 >>> --- a/drivers/gpu/drm/drm_atomic_helper.c >>> +++ b/drivers/gpu/drm/drm_atomic_helper.c >>> @@ -1881,9 +1881,13 @@ void drm_atomic_helper_wait_for_flip_done(struct >>> drm_device *dev,> >>> continue; >>> >>> ret = wait_for_completion_timeout(&commit->flip_done, 10 > * HZ); >>> >>> - if (ret == 0) >>> + if (!ret) { >>> >>> drm_err(dev, "[CRTC:%d:%s] flip_done timed > out\n", >>> >>> crtc->base.id, crtc->name); >>> >>> + >>> + if (crtc->funcs->page_flip_timeout) >>> + crtc->funcs- >> page_flip_timeout(crtc); >>> + } >>> >>> } >>> >>> if (state->fake_commit) >>> >>> diff --git a/include/drm/drm_crtc.h b/include/drm/drm_crtc.h >>> index 66278ffeebd6..45dc5a76e915 100644 >>> --- a/include/drm/drm_crtc.h >>> +++ b/include/drm/drm_crtc.h >>> @@ -609,6 +609,15 @@ struct drm_crtc_funcs { >>> >>> uint32_t flags, uint32_t target, >>> struct drm_modeset_acquire_ctx > *ctx); >>> >>> + /** >>> + * @page_flip_timeout: >>> + * >>> + * This optional hook is called if &drm_crtc_commit.flip_done times > out, >>> + * and can be used by drivers to attempt to recover from a page > flip >>> + * timeout. >>> + */ >>> + void (*page_flip_timeout)(struct drm_crtc *crtc); >>> + >>> >>> /** >>> >>> * @set_property: >>> * > > > > ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] drm: introduce page_flip_timeout() 2026-01-23 22:30 ` Mario Limonciello @ 2026-01-24 18:32 ` Hamza Mahfooz 2026-01-24 18:43 ` Mario Limonciello 2026-01-26 14:20 ` Alex Deucher 0 siblings, 2 replies; 47+ messages in thread From: Hamza Mahfooz @ 2026-01-24 18:32 UTC (permalink / raw) To: Mario Limonciello Cc: Timur Kristóf, dri-devel, Christian König, Alex Deucher, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Ce Sun, Lijo Lazar, Kenneth Feng, Ivan Lipski, Alex Hung, Tom Chung, Melissa Wen, Michel Dänzer, Fangzhi Zuo, amd-gfx, linux-kernel On Fri, Jan 23, 2026 at 04:30:12PM -0600, Mario Limonciello wrote: > Hamza - since you seem to have a "workload" that can run overnight and this > series recovers, can you try what Alex said and do a dc_suspend() and > dc_resume() for failure? > > Make sure you log a message so you can know it worked. Sure, I'll try something along the lines of: diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c index 492457c86393..bc7abd00f5f4 100644 --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c @@ -579,11 +579,28 @@ amdgpu_dm_atomic_crtc_get_property(struct drm_crtc *crtc, } #endif -static void amdgpu_dm_crtc_handle_timeout(struct drm_crtc *crtc) +static void amdgpu_dm_crtc_handle_timeout(struct drm_crtc *crtc, + struct drm_crtc_commit *commit) { struct amdgpu_device *adev = drm_to_adev(crtc->dev); struct amdgpu_reset_context reset_ctx; + struct amdgpu_ip_block *ip_block; + ip_block = amdgpu_device_ip_get_ip_block(adev, AMD_IP_BLOCK_TYPE_DCE); + if (!ip_block) + goto full_reset; + + ip_block->version->funcs->suspend(ip_block); + ip_block->version->funcs->resume(ip_block); + + if (drm_crtc_commit_wait(commit)) { + drm_err(crtc->dev, "suspend-resume failed!\n"); + goto full_reset; + } + + return; + +full_reset: if (amdgpu_device_should_recover_gpu(adev)) { memset(&reset_ctx, 0, sizeof(reset_ctx)); diff --git a/drivers/gpu/drm/drm_atomic_helper.c b/drivers/gpu/drm/drm_atomic_helper.c index 7175294ccb57..b38c4ee2fc95 100644 --- a/drivers/gpu/drm/drm_atomic_helper.c +++ b/drivers/gpu/drm/drm_atomic_helper.c @@ -1961,7 +1961,7 @@ void drm_atomic_helper_wait_for_flip_done(struct drm_device *dev, crtc->base.id, crtc->name); if (crtc->funcs->page_flip_timeout) - crtc->funcs->page_flip_timeout(crtc); + crtc->funcs->page_flip_timeout(crtc, commit); } } diff --git a/include/drm/drm_crtc.h b/include/drm/drm_crtc.h index 45dc5a76e915..47a34a05f6de 100644 --- a/include/drm/drm_crtc.h +++ b/include/drm/drm_crtc.h @@ -616,7 +616,8 @@ struct drm_crtc_funcs { * and can be used by drivers to attempt to recover from a page flip * timeout. */ - void (*page_flip_timeout)(struct drm_crtc *crtc); + void (*page_flip_timeout)(struct drm_crtc *crtc, + struct drm_crtc_commit *commit); /** * @set_property: ^ permalink raw reply related [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] drm: introduce page_flip_timeout() 2026-01-24 18:32 ` Hamza Mahfooz @ 2026-01-24 18:43 ` Mario Limonciello 2026-01-24 19:49 ` Hamza Mahfooz 2026-01-26 14:20 ` Alex Deucher 1 sibling, 1 reply; 47+ messages in thread From: Mario Limonciello @ 2026-01-24 18:43 UTC (permalink / raw) To: Hamza Mahfooz Cc: Timur Kristóf, dri-devel, Christian König, Alex Deucher, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Ce Sun, Lijo Lazar, Kenneth Feng, Ivan Lipski, Alex Hung, Tom Chung, Melissa Wen, Michel Dänzer, Fangzhi Zuo, amd-gfx, linux-kernel On 1/24/2026 12:32 PM, Hamza Mahfooz wrote: > On Fri, Jan 23, 2026 at 04:30:12PM -0600, Mario Limonciello wrote: >> Hamza - since you seem to have a "workload" that can run overnight and this >> series recovers, can you try what Alex said and do a dc_suspend() and >> dc_resume() for failure? >> >> Make sure you log a message so you can know it worked. > > Sure, I'll try something along the lines of: Generally speaking that looks good, but I'll leave a few comments. > > diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c > index 492457c86393..bc7abd00f5f4 100644 > --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c > +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c > @@ -579,11 +579,28 @@ amdgpu_dm_atomic_crtc_get_property(struct drm_crtc *crtc, > } > #endif > > -static void amdgpu_dm_crtc_handle_timeout(struct drm_crtc *crtc) > +static void amdgpu_dm_crtc_handle_timeout(struct drm_crtc *crtc, > + struct drm_crtc_commit *commit) > { > struct amdgpu_device *adev = drm_to_adev(crtc->dev); > struct amdgpu_reset_context reset_ctx; > + struct amdgpu_ip_block *ip_block; > > + ip_block = amdgpu_device_ip_get_ip_block(adev, AMD_IP_BLOCK_TYPE_DCE); > + if (!ip_block) > + goto full_reset; > + > + ip_block->version->funcs->suspend(ip_block); > + ip_block->version->funcs->resume(ip_block); Both of these can fail. Especially considering the page flip timeout could be a DCN hang, I think you should check return code for both of them sequentially and jump to the full reset condition if either fails. > + > + if (drm_crtc_commit_wait(commit)) { > + drm_err(crtc->dev, "suspend-resume failed!\n"); > + goto full_reset; > + } > + At least to prove "this worked" you should log a message "right here" that the reset occurred and you recovered. That "might not" be in the final version, but I think it's worth having for now. > + return; > + > +full_reset: > if (amdgpu_device_should_recover_gpu(adev)) { > memset(&reset_ctx, 0, sizeof(reset_ctx)); > > diff --git a/drivers/gpu/drm/drm_atomic_helper.c b/drivers/gpu/drm/drm_atomic_helper.c > index 7175294ccb57..b38c4ee2fc95 100644 > --- a/drivers/gpu/drm/drm_atomic_helper.c > +++ b/drivers/gpu/drm/drm_atomic_helper.c > @@ -1961,7 +1961,7 @@ void drm_atomic_helper_wait_for_flip_done(struct drm_device *dev, > crtc->base.id, crtc->name); > > if (crtc->funcs->page_flip_timeout) > - crtc->funcs->page_flip_timeout(crtc); > + crtc->funcs->page_flip_timeout(crtc, commit); > } > } > > diff --git a/include/drm/drm_crtc.h b/include/drm/drm_crtc.h > index 45dc5a76e915..47a34a05f6de 100644 > --- a/include/drm/drm_crtc.h > +++ b/include/drm/drm_crtc.h > @@ -616,7 +616,8 @@ struct drm_crtc_funcs { > * and can be used by drivers to attempt to recover from a page flip > * timeout. > */ > - void (*page_flip_timeout)(struct drm_crtc *crtc); > + void (*page_flip_timeout)(struct drm_crtc *crtc, > + struct drm_crtc_commit *commit); > > /** > * @set_property: ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] drm: introduce page_flip_timeout() 2026-01-24 18:43 ` Mario Limonciello @ 2026-01-24 19:49 ` Hamza Mahfooz 2026-01-27 22:44 ` Hamza Mahfooz 0 siblings, 1 reply; 47+ messages in thread From: Hamza Mahfooz @ 2026-01-24 19:49 UTC (permalink / raw) To: Mario Limonciello Cc: Timur Kristóf, dri-devel, Christian König, Alex Deucher, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Ce Sun, Lijo Lazar, Kenneth Feng, Ivan Lipski, Alex Hung, Tom Chung, Melissa Wen, Michel Dänzer, Fangzhi Zuo, amd-gfx, linux-kernel On Sat, Jan 24, 2026 at 12:43:02PM -0600, Mario Limonciello wrote: > > > On 1/24/2026 12:32 PM, Hamza Mahfooz wrote: > > On Fri, Jan 23, 2026 at 04:30:12PM -0600, Mario Limonciello wrote: > > > Hamza - since you seem to have a "workload" that can run overnight and this > > > series recovers, can you try what Alex said and do a dc_suspend() and > > > dc_resume() for failure? > > > > > > Make sure you log a message so you can know it worked. > > > > Sure, I'll try something along the lines of: > > Generally speaking that looks good, but I'll leave a few comments. > > > > > diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c > > index 492457c86393..bc7abd00f5f4 100644 > > --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c > > +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c > > @@ -579,11 +579,28 @@ amdgpu_dm_atomic_crtc_get_property(struct drm_crtc *crtc, > > } > > #endif > > > > -static void amdgpu_dm_crtc_handle_timeout(struct drm_crtc *crtc) > > +static void amdgpu_dm_crtc_handle_timeout(struct drm_crtc *crtc, > > + struct drm_crtc_commit *commit) > > { > > struct amdgpu_device *adev = drm_to_adev(crtc->dev); > > struct amdgpu_reset_context reset_ctx; > > + struct amdgpu_ip_block *ip_block; > > > > + ip_block = amdgpu_device_ip_get_ip_block(adev, AMD_IP_BLOCK_TYPE_DCE); > > + if (!ip_block) > > + goto full_reset; > > + > > + ip_block->version->funcs->suspend(ip_block); > > + ip_block->version->funcs->resume(ip_block); > > Both of these can fail. Especially considering the page flip timeout could > be a DCN hang, I think you should check return code for both of them > sequentially and jump to the full reset condition if either fails. > > > + > > + if (drm_crtc_commit_wait(commit)) { > > + drm_err(crtc->dev, "suspend-resume failed!\n"); > > + goto full_reset; > > + } > > + > > At least to prove "this worked" you should log a message "right here" that > the reset occurred and you recovered. That "might not" be in the final > version, but I think it's worth having for now. I have included all of the suggestions in my test run, fingers crossed that I don't have to wait too long for a repro though. > > > + return; > > + > > +full_reset: > > if (amdgpu_device_should_recover_gpu(adev)) { > > memset(&reset_ctx, 0, sizeof(reset_ctx)); > > > > diff --git a/drivers/gpu/drm/drm_atomic_helper.c b/drivers/gpu/drm/drm_atomic_helper.c > > index 7175294ccb57..b38c4ee2fc95 100644 > > --- a/drivers/gpu/drm/drm_atomic_helper.c > > +++ b/drivers/gpu/drm/drm_atomic_helper.c > > @@ -1961,7 +1961,7 @@ void drm_atomic_helper_wait_for_flip_done(struct drm_device *dev, > > crtc->base.id, crtc->name); > > > > if (crtc->funcs->page_flip_timeout) > > - crtc->funcs->page_flip_timeout(crtc); > > + crtc->funcs->page_flip_timeout(crtc, commit); > > } > > } > > > > diff --git a/include/drm/drm_crtc.h b/include/drm/drm_crtc.h > > index 45dc5a76e915..47a34a05f6de 100644 > > --- a/include/drm/drm_crtc.h > > +++ b/include/drm/drm_crtc.h > > @@ -616,7 +616,8 @@ struct drm_crtc_funcs { > > * and can be used by drivers to attempt to recover from a page flip > > * timeout. > > */ > > - void (*page_flip_timeout)(struct drm_crtc *crtc); > > + void (*page_flip_timeout)(struct drm_crtc *crtc, > > + struct drm_crtc_commit *commit); > > > > /** > > * @set_property: > ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] drm: introduce page_flip_timeout() 2026-01-24 19:49 ` Hamza Mahfooz @ 2026-01-27 22:44 ` Hamza Mahfooz 0 siblings, 0 replies; 47+ messages in thread From: Hamza Mahfooz @ 2026-01-27 22:44 UTC (permalink / raw) To: Mario Limonciello Cc: Timur Kristóf, dri-devel, Christian König, Alex Deucher, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Ce Sun, Lijo Lazar, Kenneth Feng, Ivan Lipski, Alex Hung, Tom Chung, Melissa Wen, Michel Dänzer, Fangzhi Zuo, amd-gfx, linux-kernel The normal suspend-resume path doesn't seem to be enough (I end up hitting [0] and [1] actually). However, if I force it to take the reset path (using amdgpu_device_lock_reset_domain()/amdgpu_device_unlock_reset_domain()), that does seem to do the trick. [0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c?h=v6.19-rc7#n3161 [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c?h=v6.19-rc7#n3171 ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] drm: introduce page_flip_timeout() 2026-01-24 18:32 ` Hamza Mahfooz 2026-01-24 18:43 ` Mario Limonciello @ 2026-01-26 14:20 ` Alex Deucher 2026-01-27 22:52 ` Hamza Mahfooz 1 sibling, 1 reply; 47+ messages in thread From: Alex Deucher @ 2026-01-26 14:20 UTC (permalink / raw) To: Hamza Mahfooz Cc: Mario Limonciello, Timur Kristóf, dri-devel, Christian König, Alex Deucher, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Ce Sun, Lijo Lazar, Kenneth Feng, Ivan Lipski, Alex Hung, Tom Chung, Melissa Wen, Michel Dänzer, Fangzhi Zuo, amd-gfx, linux-kernel On Mon, Jan 26, 2026 at 3:09 AM Hamza Mahfooz <someguy@effective-light.com> wrote: > > On Fri, Jan 23, 2026 at 04:30:12PM -0600, Mario Limonciello wrote: > > Hamza - since you seem to have a "workload" that can run overnight and this > > series recovers, can you try what Alex said and do a dc_suspend() and > > dc_resume() for failure? > > > > Make sure you log a message so you can know it worked. > > Sure, I'll try something along the lines of: > > diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c > index 492457c86393..bc7abd00f5f4 100644 > --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c > +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c > @@ -579,11 +579,28 @@ amdgpu_dm_atomic_crtc_get_property(struct drm_crtc *crtc, > } > #endif > > -static void amdgpu_dm_crtc_handle_timeout(struct drm_crtc *crtc) > +static void amdgpu_dm_crtc_handle_timeout(struct drm_crtc *crtc, > + struct drm_crtc_commit *commit) > { > struct amdgpu_device *adev = drm_to_adev(crtc->dev); > struct amdgpu_reset_context reset_ctx; > + struct amdgpu_ip_block *ip_block; > > + ip_block = amdgpu_device_ip_get_ip_block(adev, AMD_IP_BLOCK_TYPE_DCE); > + if (!ip_block) > + goto full_reset; > + > + ip_block->version->funcs->suspend(ip_block); > + ip_block->version->funcs->resume(ip_block); > + I suspect just calling drm_crtc_send_vblank_event() here on the relevant crtcs would be enough. Alex > + if (drm_crtc_commit_wait(commit)) { > + drm_err(crtc->dev, "suspend-resume failed!\n"); > + goto full_reset; > + } > + > + return; > + > +full_reset: > if (amdgpu_device_should_recover_gpu(adev)) { > memset(&reset_ctx, 0, sizeof(reset_ctx)); > > diff --git a/drivers/gpu/drm/drm_atomic_helper.c b/drivers/gpu/drm/drm_atomic_helper.c > index 7175294ccb57..b38c4ee2fc95 100644 > --- a/drivers/gpu/drm/drm_atomic_helper.c > +++ b/drivers/gpu/drm/drm_atomic_helper.c > @@ -1961,7 +1961,7 @@ void drm_atomic_helper_wait_for_flip_done(struct drm_device *dev, > crtc->base.id, crtc->name); > > if (crtc->funcs->page_flip_timeout) > - crtc->funcs->page_flip_timeout(crtc); > + crtc->funcs->page_flip_timeout(crtc, commit); > } > } > > diff --git a/include/drm/drm_crtc.h b/include/drm/drm_crtc.h > index 45dc5a76e915..47a34a05f6de 100644 > --- a/include/drm/drm_crtc.h > +++ b/include/drm/drm_crtc.h > @@ -616,7 +616,8 @@ struct drm_crtc_funcs { > * and can be used by drivers to attempt to recover from a page flip > * timeout. > */ > - void (*page_flip_timeout)(struct drm_crtc *crtc); > + void (*page_flip_timeout)(struct drm_crtc *crtc, > + struct drm_crtc_commit *commit); > > /** > * @set_property: ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] drm: introduce page_flip_timeout() 2026-01-26 14:20 ` Alex Deucher @ 2026-01-27 22:52 ` Hamza Mahfooz 2026-01-27 22:57 ` Alex Deucher 0 siblings, 1 reply; 47+ messages in thread From: Hamza Mahfooz @ 2026-01-27 22:52 UTC (permalink / raw) To: Alex Deucher Cc: Mario Limonciello, Timur Kristóf, dri-devel, Christian König, Alex Deucher, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Ce Sun, Lijo Lazar, Kenneth Feng, Ivan Lipski, Alex Hung, Tom Chung, Melissa Wen, Michel Dänzer, Fangzhi Zuo, amd-gfx, linux-kernel On Mon, Jan 26, 2026 at 09:20:55AM -0500, Alex Deucher wrote: > I suspect just calling drm_crtc_send_vblank_event() here on the > relevant crtcs would be enough. > Seems like an interesting idea, though I would imagine we would still want to attempt a reset (of some kind) assuming that the subsequent page flip also experiences a timeout. > Alex ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] drm: introduce page_flip_timeout() 2026-01-27 22:52 ` Hamza Mahfooz @ 2026-01-27 22:57 ` Alex Deucher 2026-01-28 10:39 ` Christian König 0 siblings, 1 reply; 47+ messages in thread From: Alex Deucher @ 2026-01-27 22:57 UTC (permalink / raw) To: Hamza Mahfooz Cc: Mario Limonciello, Timur Kristóf, dri-devel, Christian König, Alex Deucher, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Ce Sun, Lijo Lazar, Kenneth Feng, Ivan Lipski, Alex Hung, Tom Chung, Melissa Wen, Michel Dänzer, Fangzhi Zuo, amd-gfx, linux-kernel On Tue, Jan 27, 2026 at 5:53 PM Hamza Mahfooz <someguy@effective-light.com> wrote: > > On Mon, Jan 26, 2026 at 09:20:55AM -0500, Alex Deucher wrote: > > I suspect just calling drm_crtc_send_vblank_event() here on the > > relevant crtcs would be enough. > > > > Seems like an interesting idea, though I would imagine we would still > want to attempt a reset (of some kind) assuming that the subsequent page > flip also experiences a timeout. Is it actually a timeout or just missed interrupts? I'm wondering if some power feature races with the modeset and causes the interrupt to get missed from time to time. Alex ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] drm: introduce page_flip_timeout() 2026-01-27 22:57 ` Alex Deucher @ 2026-01-28 10:39 ` Christian König 2026-01-28 11:26 ` Michel Dänzer 0 siblings, 1 reply; 47+ messages in thread From: Christian König @ 2026-01-28 10:39 UTC (permalink / raw) To: Alex Deucher, Hamza Mahfooz Cc: Mario Limonciello, Timur Kristóf, dri-devel, Alex Deucher, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Ce Sun, Lijo Lazar, Kenneth Feng, Ivan Lipski, Alex Hung, Tom Chung, Melissa Wen, Michel Dänzer, Fangzhi Zuo, amd-gfx, linux-kernel On 1/27/26 23:57, Alex Deucher wrote: > On Tue, Jan 27, 2026 at 5:53 PM Hamza Mahfooz > <someguy@effective-light.com> wrote: >> >> On Mon, Jan 26, 2026 at 09:20:55AM -0500, Alex Deucher wrote: >>> I suspect just calling drm_crtc_send_vblank_event() here on the >>> relevant crtcs would be enough. >>> >> >> Seems like an interesting idea, though I would imagine we would still >> want to attempt a reset (of some kind) assuming that the subsequent page >> flip also experiences a timeout. > > Is it actually a timeout or just missed interrupts? I'm wondering if > some power feature races with the modeset and causes the interrupt to > get missed from time to time. That is my strong suspicion as well. Even if we missed a vblank interrupt that thing is reoccurring, so the worst thing that can happen is that we delayed reporting back success by one frame. So something must have turned the CRTC fully off. BTW: suspending the DAL or event resetting the GPU in such a callback are just hacks, don't be surprised if you run into things like NULL pointer dereferences etc.... Christian. > > Alex ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] drm: introduce page_flip_timeout() 2026-01-28 10:39 ` Christian König @ 2026-01-28 11:26 ` Michel Dänzer 2026-01-28 12:14 ` Timur Kristóf 0 siblings, 1 reply; 47+ messages in thread From: Michel Dänzer @ 2026-01-28 11:26 UTC (permalink / raw) To: Christian König, Alex Deucher, Hamza Mahfooz Cc: Mario Limonciello, Timur Kristóf, dri-devel, Alex Deucher, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Ce Sun, Lijo Lazar, Kenneth Feng, Ivan Lipski, Alex Hung, Tom Chung, Melissa Wen, Michel Dänzer, Fangzhi Zuo, amd-gfx, linux-kernel On 1/28/26 11:39, Christian König wrote: > On 1/27/26 23:57, Alex Deucher wrote: >> On Tue, Jan 27, 2026 at 5:53 PM Hamza Mahfooz >> <someguy@effective-light.com> wrote: >>> >>> On Mon, Jan 26, 2026 at 09:20:55AM -0500, Alex Deucher wrote: >>>> I suspect just calling drm_crtc_send_vblank_event() here on the >>>> relevant crtcs would be enough. >>>> >>> >>> Seems like an interesting idea, though I would imagine we would still >>> want to attempt a reset (of some kind) assuming that the subsequent page >>> flip also experiences a timeout. >> >> Is it actually a timeout or just missed interrupts? I'm wondering if >> some power feature races with the modeset and causes the interrupt to >> get missed from time to time. > > That is my strong suspicion as well. > > Even if we missed a vblank interrupt that thing is reoccurring, so the worst thing that can happen is that we delayed reporting back success by one frame. > > So something must have turned the CRTC fully off. Not sure that's a generally valid conclusion (do the gitlab issues talk about the display going black, or about it staying on but freezing?). AFAIR at least in some cases amdgpu uses a dedicated "page flip" interrupt instead of the vblank interrupt, in which case missing a single interrupt could cause a timeout. P.S. Completing the atomic commit and sending the completion event must work even if user space turns off any CRTCs as part of the commit[0]. So your hypothesis would be a kernel bug, accidentally turning off the CRTC and/or not handling a CRTC getting turned off correctly. [0] If any CRTC for which the commit has state is off both before and after the commit though, the commit fails with an error before it could result in a timeout. -- Earthling Michel Dänzer \ GNOME / Xwayland / Mesa developer https://redhat.com \ Libre software enthusiast ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] drm: introduce page_flip_timeout() 2026-01-28 11:26 ` Michel Dänzer @ 2026-01-28 12:14 ` Timur Kristóf 2026-01-28 12:48 ` Christian König 0 siblings, 1 reply; 47+ messages in thread From: Timur Kristóf @ 2026-01-28 12:14 UTC (permalink / raw) To: Christian König, Alex Deucher, Hamza Mahfooz, Michel Dänzer Cc: Mario Limonciello, dri-devel, Alex Deucher, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Ce Sun, Lijo Lazar, Kenneth Feng, Ivan Lipski, Alex Hung, Tom Chung, Melissa Wen, Michel Dänzer, Fangzhi Zuo, amd-gfx, linux-kernel On Wednesday, January 28, 2026 12:26:20 PM Central European Standard Time Michel Dänzer wrote: > On 1/28/26 11:39, Christian König wrote: > > On 1/27/26 23:57, Alex Deucher wrote: > >> On Tue, Jan 27, 2026 at 5:53 PM Hamza Mahfooz > >> > >> <someguy@effective-light.com> wrote: > >>> On Mon, Jan 26, 2026 at 09:20:55AM -0500, Alex Deucher wrote: > >>>> I suspect just calling drm_crtc_send_vblank_event() here on the > >>>> relevant crtcs would be enough. > >>> > >>> Seems like an interesting idea, though I would imagine we would still > >>> want to attempt a reset (of some kind) assuming that the subsequent page > >>> flip also experiences a timeout. > >> > >> Is it actually a timeout or just missed interrupts? I'm wondering if > >> some power feature races with the modeset and causes the interrupt to > >> get missed from time to time. > > > > That is my strong suspicion as well. > > > > Even if we missed a vblank interrupt that thing is reoccurring, so the > > worst thing that can happen is that we delayed reporting back success by > > one frame. > > > > So something must have turned the CRTC fully off. > > Not sure that's a generally valid conclusion (do the gitlab issues talk > about the display going black, or about it staying on but freezing?). In all the bug reports I've seen about page flip timeouts, and in all the timeouts I've seen on my machine, the screen remains on, but frozen. It doesn't go black and doesn't turn off. Christian, why would the CRTC be turned off? > AFAIR > at least in some cases amdgpu uses a dedicated "page flip" interrupt > instead of the vblank interrupt, That matches what I saw when I was digging in the code. > in which case missing a single interrupt > could cause a timeout. > > > P.S. Completing the atomic commit and sending the completion event must work > even if user space turns off any CRTCs as part of the commit[0]. So your > hypothesis would be a kernel bug, accidentally turning off the CRTC and/or > not handling a CRTC getting turned off correctly. > > [0] If any CRTC for which the commit has state is off both before and after > the commit though, the commit fails with an error before it could result in > a timeout. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] drm: introduce page_flip_timeout() 2026-01-28 12:14 ` Timur Kristóf @ 2026-01-28 12:48 ` Christian König 2026-01-28 14:25 ` Michel Dänzer 0 siblings, 1 reply; 47+ messages in thread From: Christian König @ 2026-01-28 12:48 UTC (permalink / raw) To: Timur Kristóf, Alex Deucher, Hamza Mahfooz, Michel Dänzer Cc: Mario Limonciello, dri-devel, Alex Deucher, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Ce Sun, Lijo Lazar, Kenneth Feng, Ivan Lipski, Alex Hung, Tom Chung, Melissa Wen, Michel Dänzer, Fangzhi Zuo, amd-gfx, linux-kernel On 1/28/26 13:14, Timur Kristóf wrote: > On Wednesday, January 28, 2026 12:26:20 PM Central European Standard Time > Michel Dänzer wrote: >> On 1/28/26 11:39, Christian König wrote: >>> On 1/27/26 23:57, Alex Deucher wrote: >>>> On Tue, Jan 27, 2026 at 5:53 PM Hamza Mahfooz >>>> >>>> <someguy@effective-light.com> wrote: >>>>> On Mon, Jan 26, 2026 at 09:20:55AM -0500, Alex Deucher wrote: >>>>>> I suspect just calling drm_crtc_send_vblank_event() here on the >>>>>> relevant crtcs would be enough. >>>>> >>>>> Seems like an interesting idea, though I would imagine we would still >>>>> want to attempt a reset (of some kind) assuming that the subsequent page >>>>> flip also experiences a timeout. >>>> >>>> Is it actually a timeout or just missed interrupts? I'm wondering if >>>> some power feature races with the modeset and causes the interrupt to >>>> get missed from time to time. >>> >>> That is my strong suspicion as well. >>> >>> Even if we missed a vblank interrupt that thing is reoccurring, so the >>> worst thing that can happen is that we delayed reporting back success by >>> one frame. >>> >>> So something must have turned the CRTC fully off. >> >> Not sure that's a generally valid conclusion (do the gitlab issues talk >> about the display going black, or about it staying on but freezing?). > > In all the bug reports I've seen about page flip timeouts, and in all the > timeouts I've seen on my machine, the screen remains on, but frozen. > It doesn't go black and doesn't turn off. > > Christian, why would the CRTC be turned off? Exactly that's the question we need to answer. But from what you describe the CRTC keeps on, just doesn't send any more vblank events. >> AFAIR >> at least in some cases amdgpu uses a dedicated "page flip" interrupt >> instead of the vblank interrupt, Oh really good point! I haven't though about the dedicated page flip interrupt. But IIRC we already had problems with that one with radeon, so we stopped using it a long long time ago. > That matches what I saw when I was digging in the code. > >> in which case missing a single interrupt >> could cause a timeout. >> >> >> P.S. Completing the atomic commit and sending the completion event must work >> even if user space turns off any CRTCs as part of the commit[0]. Wait a second. What happens if we never complete that? So when the completion event is never signaled? Does the kernel then reject any new atomic commit as well? If yes then I think that is not defensive at all. In other words when you are right and the page flip interrupt is used and missed then we are stuck forever. >> So your >> hypothesis would be a kernel bug, accidentally turning off the CRTC and/or >> not handling a CRTC getting turned off correctly. I'm not arguing that it isn't a kernel bug, but the question is what is triggering it? In other words could it be that userspace does something illegal which the kernel fails to reject? Regards, Christian. >> [0] If any CRTC for which the commit has state is off both before and after >> the commit though, the commit fails with an error before it could result in >> a timeout. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] drm: introduce page_flip_timeout() 2026-01-28 12:48 ` Christian König @ 2026-01-28 14:25 ` Michel Dänzer 2026-01-29 10:06 ` Michel Dänzer 0 siblings, 1 reply; 47+ messages in thread From: Michel Dänzer @ 2026-01-28 14:25 UTC (permalink / raw) To: Christian König, Timur Kristóf, Alex Deucher, Hamza Mahfooz Cc: Mario Limonciello, dri-devel, Alex Deucher, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Ce Sun, Lijo Lazar, Kenneth Feng, Ivan Lipski, Alex Hung, Tom Chung, Melissa Wen, Fangzhi Zuo, amd-gfx, linux-kernel On 1/28/26 13:48, Christian König wrote: > On 1/28/26 13:14, Timur Kristóf wrote: >> On Wednesday, January 28, 2026 12:26:20 PM Central European Standard Time >> Michel Dänzer wrote: >>> On 1/28/26 11:39, Christian König wrote: >>>> >>>> Even if we missed a vblank interrupt that thing is reoccurring, so the >>>> worst thing that can happen is that we delayed reporting back success by >>>> one frame. >>>> >>>> So something must have turned the CRTC fully off. >>> >>> Not sure that's a generally valid conclusion (do the gitlab issues talk >>> about the display going black, or about it staying on but freezing?). >> >> In all the bug reports I've seen about page flip timeouts, and in all the >> timeouts I've seen on my machine, the screen remains on, but frozen. >> It doesn't go black and doesn't turn off. >> >> Christian, why would the CRTC be turned off? > > Exactly that's the question we need to answer. > > But from what you describe the CRTC keeps on, just doesn't send any more vblank events. The vblank interrupt source getting accidentally disabled might be one possible cause though. >>> P.S. Completing the atomic commit and sending the completion event must work >>> even if user space turns off any CRTCs as part of the commit[0]. > > Wait a second. What happens if we never complete that? So when the completion event is never signaled? > > Does the kernel then reject any new atomic commit as well? Fundamentally, current atomic KMS UAPI is that any attempt to do a commit before the previous one completes fails with EBUSY. (Another possible scenario is that the commit completes as far as the kernel is concerned, but the completion events for it are never sent to user space for some reason. In that case, user space would hang waiting for the completion events. That's not the scenario we're talking about here though, or there would be no timeout in the kernel) > If yes then I think that is not defensive at all. In other words when you are right and the page flip interrupt is used and missed then we are stuck forever. I guess it's basically up to the driver to prevent that from happening. Other drivers don't seem to have such issues. > In other words could it be that userspace does something illegal which the kernel fails to reject? That's possible in theory, we haven't ruled out all simpler explanations on the kernel side though. -- Earthling Michel Dänzer \ GNOME / Xwayland / Mesa developer https://redhat.com \ Libre software enthusiast ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] drm: introduce page_flip_timeout() 2026-01-28 14:25 ` Michel Dänzer @ 2026-01-29 10:06 ` Michel Dänzer 2026-01-29 11:25 ` Timur Kristóf 2026-01-29 21:56 ` Xaver Hugl 0 siblings, 2 replies; 47+ messages in thread From: Michel Dänzer @ 2026-01-29 10:06 UTC (permalink / raw) To: Christian König, Timur Kristóf, Alex Deucher, Hamza Mahfooz Cc: Mario Limonciello, dri-devel, Alex Deucher, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Ce Sun, Lijo Lazar, Kenneth Feng, Ivan Lipski, Alex Hung, Tom Chung, Melissa Wen, Fangzhi Zuo, amd-gfx, linux-kernel On 1/28/26 15:25, Michel Dänzer wrote: > On 1/28/26 13:48, Christian König wrote: >> On 1/28/26 13:14, Timur Kristóf wrote: >>> On Wednesday, January 28, 2026 12:26:20 PM Central European Standard Time >>> Michel Dänzer wrote: >>>> On 1/28/26 11:39, Christian König wrote: >>>>> >>>>> Even if we missed a vblank interrupt that thing is reoccurring, so the >>>>> worst thing that can happen is that we delayed reporting back success by >>>>> one frame. >>>>> >>>>> So something must have turned the CRTC fully off. >>>> >>>> Not sure that's a generally valid conclusion (do the gitlab issues talk >>>> about the display going black, or about it staying on but freezing?). >>> >>> In all the bug reports I've seen about page flip timeouts, and in all the >>> timeouts I've seen on my machine, the screen remains on, but frozen. >>> It doesn't go black and doesn't turn off. >>> >>> Christian, why would the CRTC be turned off? >> >> Exactly that's the question we need to answer. >> >> But from what you describe the CRTC keeps on, just doesn't send any more vblank events. > > The vblank interrupt source getting accidentally disabled might be one possible cause though. Another possibility is that test-only commits with the DRM_MODE_ATOMIC_TEST_ONLY flag (which can happen in parallel while the kernel is processing a "real" commit) accidentally have side effects on the current kernel state, resulting in the "real" commit failing to do something it should. There have been bugs like that in the amdgpu DM code before. Anyway, this is all speculation. Somebody just needs to dig in and get to the bottom of why the commits aren't getting completed. -- Earthling Michel Dänzer \ GNOME / Xwayland / Mesa developer https://redhat.com \ Libre software enthusiast ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] drm: introduce page_flip_timeout() 2026-01-29 10:06 ` Michel Dänzer @ 2026-01-29 11:25 ` Timur Kristóf 2026-01-29 11:38 ` Christian König 2026-01-29 21:56 ` Xaver Hugl 1 sibling, 1 reply; 47+ messages in thread From: Timur Kristóf @ 2026-01-29 11:25 UTC (permalink / raw) To: Christian König, Alex Deucher, Hamza Mahfooz, Michel Dänzer Cc: Mario Limonciello, dri-devel, Alex Deucher, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Ce Sun, Lijo Lazar, Kenneth Feng, Ivan Lipski, Alex Hung, Tom Chung, Melissa Wen, Fangzhi Zuo, amd-gfx, linux-kernel On Thursday, January 29, 2026 11:06:11 AM Central European Standard Time Michel Dänzer wrote: > >>> > >>> Christian, why would the CRTC be turned off? > >> > >> Exactly that's the question we need to answer. > >> > >> But from what you describe the CRTC keeps on, just doesn't send any more > >> vblank events.> > > The vblank interrupt source getting accidentally disabled might be one > > possible cause though. > Another possibility is that test-only commits with the > DRM_MODE_ATOMIC_TEST_ONLY flag (which can happen in parallel while the > kernel is processing a "real" commit) accidentally have side effects on the > current kernel state, resulting in the "real" commit failing to do > something it should. There have been bugs like that in the amdgpu DM code > before. > > > Anyway, this is all speculation. Somebody just needs to dig in and get to > the bottom of why the commits aren't getting completed. Yes, I agree. However, just like we do with ring timeouts, we also need to be prepared for the situation where a page flip timeout happens and we should try to recover from it. And if it isn't recoverable, fall back to GPU reset. I strongly suspect that there are many different issues depending on the hardware generation and display configuration. There isn't going to be a silver bullet to fix all of them, and in case it cannot be fixed, I think a GPU reset is the right thing to do - it's drastic, but still better than letting the machine just freeze irrecoverably. One example of such a bug was fixed by 6cbe6e072c5d where DC was trying to use an interrupt that didn't exist on some hardware. This type of bug would be impossible for userspace to solve in any way, but a GPU reset would have helped to recover the machine into a usable state. Another example would be Strix Halo with adaptive sync and/or tearing updates enabled, which 100% reproduces a page flip timeout for me. I haven't had time to investigate that one just yet. Timur ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] drm: introduce page_flip_timeout() 2026-01-29 11:25 ` Timur Kristóf @ 2026-01-29 11:38 ` Christian König 2026-01-29 12:06 ` Timur Kristóf 0 siblings, 1 reply; 47+ messages in thread From: Christian König @ 2026-01-29 11:38 UTC (permalink / raw) To: Timur Kristóf, Alex Deucher, Hamza Mahfooz, Michel Dänzer Cc: Mario Limonciello, dri-devel, Alex Deucher, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Ce Sun, Lijo Lazar, Kenneth Feng, Ivan Lipski, Alex Hung, Tom Chung, Melissa Wen, Fangzhi Zuo, amd-gfx, linux-kernel On 1/29/26 12:25, Timur Kristóf wrote: > On Thursday, January 29, 2026 11:06:11 AM Central European Standard Time > Michel Dänzer wrote: >>>>> >>>>> Christian, why would the CRTC be turned off? >>>> >>>> Exactly that's the question we need to answer. >>>> >>>> But from what you describe the CRTC keeps on, just doesn't send any more >>>> vblank events.> >>> The vblank interrupt source getting accidentally disabled might be one >>> possible cause though. >> Another possibility is that test-only commits with the >> DRM_MODE_ATOMIC_TEST_ONLY flag (which can happen in parallel while the >> kernel is processing a "real" commit) accidentally have side effects on the >> current kernel state, resulting in the "real" commit failing to do >> something it should. There have been bugs like that in the amdgpu DM code >> before. >> >> >> Anyway, this is all speculation. Somebody just needs to dig in and get to >> the bottom of why the commits aren't getting completed. > > Yes, I agree. > > However, just like we do with ring timeouts, we also need to be prepared for > the situation where a page flip timeout happens and we should try to recover > from it. And if it isn't recoverable, fall back to GPU reset. No, that is clearly a bad idea. CRTCs are fixed function devices, that GPU reset helps here is just pure coincident. What we can certainly do is to improve the error handling, e.g. that the system doesn't sit there forever after a page flip timeout. > I strongly suspect that there are many different issues depending on the > hardware generation and display configuration. There isn't going to be a silver > bullet to fix all of them, and in case it cannot be fixed, I think a GPU reset > is the right thing to do - it's drastic, but still better than letting the > machine just freeze irrecoverably. > > One example of such a bug was fixed by 6cbe6e072c5d where DC was trying to use > an interrupt that didn't exist on some hardware. This type of bug would be > impossible for userspace to solve in any way, but a GPU reset would have > helped to recover the machine into a usable state. > > Another example would be Strix Halo with adaptive sync and/or tearing updates > enabled, which 100% reproduces a page flip timeout for me. I haven't had time > to investigate that one just yet. Let's maybe try a complete different approach. We force a page flip timeout, and see if the system can handle that or not. E.g. every 300 page flip we just fail to signal and see if things still work after the timeout. Regards, Christian. > Timur > > ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] drm: introduce page_flip_timeout() 2026-01-29 11:38 ` Christian König @ 2026-01-29 12:06 ` Timur Kristóf 2026-01-29 12:59 ` Christian König 0 siblings, 1 reply; 47+ messages in thread From: Timur Kristóf @ 2026-01-29 12:06 UTC (permalink / raw) To: Alex Deucher, Hamza Mahfooz, Michel Dänzer, Christian König Cc: Mario Limonciello, dri-devel, Alex Deucher, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Ce Sun, Lijo Lazar, Kenneth Feng, Ivan Lipski, Alex Hung, Tom Chung, Melissa Wen, Fangzhi Zuo, amd-gfx, linux-kernel On Thursday, January 29, 2026 12:38:30 PM Central European Standard Time Christian König wrote: > > > > However, just like we do with ring timeouts, we also need to be prepared > > for the situation where a page flip timeout happens and we should try to > > recover from it. And if it isn't recoverable, fall back to GPU reset. > > No, that is clearly a bad idea. I don't see why it's "clearly" a bad idea. It's not clear to me at all, please clarify it for me. > CRTCs are fixed function devices that GPU > reset helps here is just pure coincident. Currently, the driver doesn't handle page flip timeouts at all, which means that if it happens, there is 0% chance of recovering from it. If the GPU reset improves that chance to non-zero, it's already an improvement, and already more than what AMD did to address this problem for the past few years. I just find it incredibly disrespectful towards the community that AMD proposes a solution that they neglect to implement, then when somebody from the community steps up to implement it, it's rejected. > What we can certainly do is to improve the error handling, e.g. that the > system doesn't sit there forever after a page flip timeout. Sure. > > Let's maybe try a complete different approach. We force a page flip timeout, > and see if the system can handle that or not. > > E.g. every 300 page flip we just fail to signal and see if things still work > after the timeout. How do you propose to do that? ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] drm: introduce page_flip_timeout() 2026-01-29 12:06 ` Timur Kristóf @ 2026-01-29 12:59 ` Christian König 2026-01-29 14:04 ` Hamza Mahfooz 2026-02-03 21:48 ` Timur Kristóf 0 siblings, 2 replies; 47+ messages in thread From: Christian König @ 2026-01-29 12:59 UTC (permalink / raw) To: Timur Kristóf, Alex Deucher, Hamza Mahfooz, Michel Dänzer Cc: Mario Limonciello, dri-devel, Alex Deucher, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Ce Sun, Lijo Lazar, Kenneth Feng, Ivan Lipski, Alex Hung, Tom Chung, Melissa Wen, Fangzhi Zuo, amd-gfx, linux-kernel On 1/29/26 13:06, Timur Kristóf wrote: > On Thursday, January 29, 2026 12:38:30 PM Central European Standard Time > Christian König wrote: >>> >>> However, just like we do with ring timeouts, we also need to be prepared >>> for the situation where a page flip timeout happens and we should try to >>> recover from it. And if it isn't recoverable, fall back to GPU reset. >> >> No, that is clearly a bad idea. > > I don't see why it's "clearly" a bad idea. It's not clear to me at all, please > clarify it for me. The GPU resets are necessary because we allow Turing complete programs to be submitted by userspace and that in turn is then messing up the HW state and we need to reset it to get into a known working state again (e.g. classic reset signal in electronics). But in this case here when you see a frozen picture on the screen then that means the CRTC is still working, e.g. power is there, clocks are running, hblank, vblank is happening ... this doesn't looks like a HW failure at all. After the input from Michel I'm pretty sure that what we have here is just messed up SW state, e.g. the DC/DM code has no fallback handling and not only misses the HW event but also blocks all further page flip requests from userspace which would resolve the issue. >> CRTCs are fixed function devices that GPU >> reset helps here is just pure coincident. > > Currently, the driver doesn't handle page flip timeouts at all, which means > that if it happens, there is 0% chance of recovering from it. Yeah and I completely agree that this is the absolutely worse thing we can do. > If the GPU reset improves that chance to non-zero, it's already an > improvement, and already more than what AMD did to address this problem for > the past few years. I just find it incredibly disrespectful towards the > community that AMD proposes a solution that they neglect to implement, then > when somebody from the community steps up to implement it, it's rejected. Well, I've heard about this problem just a few days ago. >> What we can certainly do is to improve the error handling, e.g. that the >> system doesn't sit there forever after a page flip timeout. > > Sure. > >> >> Let's maybe try a complete different approach. We force a page flip timeout, >> and see if the system can handle that or not. >> >> E.g. every 300 page flip we just fail to signal and see if things still work >> after the timeout. > > How do you propose to do that? I need to dig a bit into the DAL/DC code and see how the signaling path actually goes. Going to give that a try tomorrow. Regards, Christian. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] drm: introduce page_flip_timeout() 2026-01-29 12:59 ` Christian König @ 2026-01-29 14:04 ` Hamza Mahfooz 2026-01-29 14:24 ` Christian König 2026-02-03 21:48 ` Timur Kristóf 1 sibling, 1 reply; 47+ messages in thread From: Hamza Mahfooz @ 2026-01-29 14:04 UTC (permalink / raw) To: Christian König Cc: Timur Kristóf, Alex Deucher, Michel Dänzer, Mario Limonciello, dri-devel, Alex Deucher, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Ce Sun, Lijo Lazar, Kenneth Feng, Ivan Lipski, Alex Hung, Tom Chung, Melissa Wen, Fangzhi Zuo, amd-gfx, linux-kernel On Thu, Jan 29, 2026 at 01:59:00PM +0100, Christian König wrote: > > How do you propose to do that? > > I need to dig a bit into the DAL/DC code and see how the signaling path actually goes. > > Going to give that a try tomorrow. > For recent ASICs, something along the lines of the following should do the trick: diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mode.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_mode.h index dc8d2f52c7d6..fac668c2fcfb 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mode.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mode.h @@ -510,6 +510,7 @@ struct amdgpu_crtc { bool wb_pending; bool wb_enabled; struct drm_writeback_connector *wb_conn; + int pflip_cnt; }; struct amdgpu_encoder_atom_dig { diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c index 740711ac1037..1c3b7fbab1c6 100644 --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c @@ -427,6 +427,18 @@ static inline bool update_planes_and_stream_adapter(struct dc *dc, stream_update); } +static inline bool update_pflip_cnt(struct amdgpu_crtc *acrtc) +{ + int cnt = acrtc->pflip_cnt++; + + if (cnt == 300) { + acrtc->pflip_cnt = 0; + return true; + } + + return false; +} + /** * dm_pflip_high_irq() - Handle pageflip interrupt * @interrupt_params: ignored @@ -454,6 +466,9 @@ static void dm_pflip_high_irq(void *interrupt_params) return; } + if (update_pflip_cnt(amdgpu_crtc)) + return; + spin_lock_irqsave(&adev_to_drm(adev)->event_lock, flags); if (amdgpu_crtc->pflip_status != AMDGPU_FLIP_SUBMITTED) { @@ -589,6 +604,9 @@ static void dm_vupdate_high_irq(void *interrupt_params) acrtc = get_crtc_by_otg_inst(adev, irq_params->irq_src - IRQ_TYPE_VUPDATE); if (acrtc) { + if (update_pflip_cnt(acrtc)) + return; + vrr_active = amdgpu_dm_crtc_vrr_active_irq(acrtc); drm_dev = acrtc->base.dev; vblank = drm_crtc_vblank_crtc(&acrtc->base); @@ -659,6 +677,9 @@ static void dm_crtc_high_irq(void *interrupt_params) if (!acrtc) return; + if (update_pflip_cnt(acrtc)) + return; + if (acrtc->wb_conn) { spin_lock_irqsave(&acrtc->wb_conn->job_lock, flags); ^ permalink raw reply related [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] drm: introduce page_flip_timeout() 2026-01-29 14:04 ` Hamza Mahfooz @ 2026-01-29 14:24 ` Christian König 2026-01-29 14:33 ` Hamza Mahfooz 0 siblings, 1 reply; 47+ messages in thread From: Christian König @ 2026-01-29 14:24 UTC (permalink / raw) To: Hamza Mahfooz Cc: Timur Kristóf, Alex Deucher, Michel Dänzer, Mario Limonciello, dri-devel, Alex Deucher, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Ce Sun, Lijo Lazar, Kenneth Feng, Ivan Lipski, Alex Hung, Tom Chung, Melissa Wen, Fangzhi Zuo, amd-gfx, linux-kernel On 1/29/26 15:04, Hamza Mahfooz wrote: > On Thu, Jan 29, 2026 at 01:59:00PM +0100, Christian König wrote: >>> How do you propose to do that? >> >> I need to dig a bit into the DAL/DC code and see how the signaling path actually goes. >> >> Going to give that a try tomorrow. >> > > For recent ASICs, something along the lines of the following should do > the trick: Thanks a lot for that. What happens if you apply this? Can the higher level handling recover from that? E.g. continue after 10 second timeout. Thanks, Christian. > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mode.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_mode.h > index dc8d2f52c7d6..fac668c2fcfb 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mode.h > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mode.h > @@ -510,6 +510,7 @@ struct amdgpu_crtc { > bool wb_pending; > bool wb_enabled; > struct drm_writeback_connector *wb_conn; > + int pflip_cnt; > }; > > struct amdgpu_encoder_atom_dig { > diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c > index 740711ac1037..1c3b7fbab1c6 100644 > --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c > +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c > @@ -427,6 +427,18 @@ static inline bool update_planes_and_stream_adapter(struct dc *dc, > stream_update); > } > > +static inline bool update_pflip_cnt(struct amdgpu_crtc *acrtc) > +{ > + int cnt = acrtc->pflip_cnt++; > + > + if (cnt == 300) { > + acrtc->pflip_cnt = 0; > + return true; > + } > + > + return false; > +} > + > /** > * dm_pflip_high_irq() - Handle pageflip interrupt > * @interrupt_params: ignored > @@ -454,6 +466,9 @@ static void dm_pflip_high_irq(void *interrupt_params) > return; > } > > + if (update_pflip_cnt(amdgpu_crtc)) > + return; > + > spin_lock_irqsave(&adev_to_drm(adev)->event_lock, flags); > > if (amdgpu_crtc->pflip_status != AMDGPU_FLIP_SUBMITTED) { > @@ -589,6 +604,9 @@ static void dm_vupdate_high_irq(void *interrupt_params) > acrtc = get_crtc_by_otg_inst(adev, irq_params->irq_src - IRQ_TYPE_VUPDATE); > > if (acrtc) { > + if (update_pflip_cnt(acrtc)) > + return; > + > vrr_active = amdgpu_dm_crtc_vrr_active_irq(acrtc); > drm_dev = acrtc->base.dev; > vblank = drm_crtc_vblank_crtc(&acrtc->base); > @@ -659,6 +677,9 @@ static void dm_crtc_high_irq(void *interrupt_params) > if (!acrtc) > return; > > + if (update_pflip_cnt(acrtc)) > + return; > + > if (acrtc->wb_conn) { > spin_lock_irqsave(&acrtc->wb_conn->job_lock, flags); ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] drm: introduce page_flip_timeout() 2026-01-29 14:24 ` Christian König @ 2026-01-29 14:33 ` Hamza Mahfooz 2026-01-29 14:41 ` Christian König 0 siblings, 1 reply; 47+ messages in thread From: Hamza Mahfooz @ 2026-01-29 14:33 UTC (permalink / raw) To: Christian König Cc: Timur Kristóf, Alex Deucher, Michel Dänzer, Mario Limonciello, dri-devel, Alex Deucher, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Ce Sun, Lijo Lazar, Kenneth Feng, Ivan Lipski, Alex Hung, Tom Chung, Melissa Wen, Fangzhi Zuo, amd-gfx, linux-kernel > Thanks a lot for that. What happens if you apply this? > > Can the higher level handling recover from that? E.g. continue after 10 second timeout. > Without my series applied it just stays stuck (i.e. the CRTC is still on the last rendered frame). ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] drm: introduce page_flip_timeout() 2026-01-29 14:33 ` Hamza Mahfooz @ 2026-01-29 14:41 ` Christian König 0 siblings, 0 replies; 47+ messages in thread From: Christian König @ 2026-01-29 14:41 UTC (permalink / raw) To: Hamza Mahfooz Cc: Timur Kristóf, Alex Deucher, Michel Dänzer, Mario Limonciello, dri-devel, Alex Deucher, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Ce Sun, Lijo Lazar, Kenneth Feng, Ivan Lipski, Alex Hung, Tom Chung, Melissa Wen, Fangzhi Zuo, amd-gfx, linux-kernel On 1/29/26 15:33, Hamza Mahfooz wrote: >> Thanks a lot for that. What happens if you apply this? >> >> Can the higher level handling recover from that? E.g. continue after 10 second timeout. >> > > Without my series applied it just stays stuck (i.e. the CRTC is still on > the last rendered frame). Perfect, exactly what I expected. Can you go into function amdgpu_dm_atomic_commit_tail() where drm_atomic_helper_wait_for_flip_done() is called and add some manual recovery to signal the atomic commit? Very roughly from the front of my mind: 1. Grab the event lock. 2. Double check if it didn't signaled between returning from drm_atomic_helper_wait_for_flip_done() and taking the lock. 3. If it is still not signaled, signal it manually just like the interrupt handler would do. 4. Drop the lock. Thanks, Christian. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] drm: introduce page_flip_timeout() 2026-01-29 12:59 ` Christian König 2026-01-29 14:04 ` Hamza Mahfooz @ 2026-02-03 21:48 ` Timur Kristóf 1 sibling, 0 replies; 47+ messages in thread From: Timur Kristóf @ 2026-02-03 21:48 UTC (permalink / raw) To: Alex Deucher, Hamza Mahfooz, Michel Dänzer, Christian König Cc: Mario Limonciello, dri-devel, Alex Deucher, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Ce Sun, Lijo Lazar, Kenneth Feng, Ivan Lipski, Alex Hung, Tom Chung, Melissa Wen, Fangzhi Zuo, amd-gfx, linux-kernel On 2026. január 29., csütörtök 13:59:00 közép-európai téli idő Christian König wrote: > On 1/29/26 13:06, Timur Kristóf wrote: > > On Thursday, January 29, 2026 12:38:30 PM Central European Standard Time > > > > Christian König wrote: > >>> However, just like we do with ring timeouts, we also need to be prepared > >>> for the situation where a page flip timeout happens and we should try to > >>> recover from it. And if it isn't recoverable, fall back to GPU reset. > >> > >> No, that is clearly a bad idea. > > > > I don't see why it's "clearly" a bad idea. It's not clear to me at all, > > please clarify it for me. > > The GPU resets are necessary because we allow Turing complete programs to be > submitted by userspace and that in turn is then messing up the HW state and > we need to reset it to get into a known working state again (e.g. classic > reset signal in electronics). > > But in this case here when you see a frozen picture on the screen then that > means the CRTC is still working, e.g. power is there, clocks are running, > hblank, vblank is happening ... this doesn't looks like a HW failure at > all. I don't see why that is an argument against performing a reset. Regardless of whether the display is Turing complete or not, it can freeze up, and resetting it will allow the system to recover. > > After the input from Michel I'm pretty sure that what we have here is just > messed up SW state, e.g. the DC/DM code has no fallback handling and not > only misses the HW event but also blocks all further page flip requests > from userspace which would resolve the issue. The display HW can hang in other ways, as already explained in this thread. Also the amdgpu_dm code already acknowledges that the DMU and SMU can hang. Those would be fixed by a reset. > >> CRTCs are fixed function devices that GPU > >> reset helps here is just pure coincident. > > > > Currently, the driver doesn't handle page flip timeouts at all, which > > means > > that if it happens, there is 0% chance of recovering from it. > > Yeah and I completely agree that this is the absolutely worse thing we can > do. > > If the GPU reset improves that chance to non-zero, it's already an > > improvement, and already more than what AMD did to address this problem > > for > > the past few years. I just find it incredibly disrespectful towards the > > community that AMD proposes a solution that they neglect to implement, > > then > > when somebody from the community steps up to implement it, it's rejected. > > Well, I've heard about this problem just a few days ago. Harry presented the problem and the proposed solution at XDC, and the display team already merged some patches which are intended to hook up to the GPU recovery, see: dm_helpers_dmu_timeout() dm_helpers_smu_timeout() Do you disagree with how these are handled as well? > > >> What we can certainly do is to improve the error handling, e.g. that the > >> system doesn't sit there forever after a page flip timeout. > > > > Sure. > > > >> Let's maybe try a complete different approach. We force a page flip > >> timeout, and see if the system can handle that or not. > >> > >> E.g. every 300 page flip we just fail to signal and see if things still > >> work after the timeout. > > > > How do you propose to do that? > > I need to dig a bit into the DAL/DC code and see how the signaling path > actually goes. > > Going to give that a try tomorrow. > Have you had any luck? ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] drm: introduce page_flip_timeout() 2026-01-29 10:06 ` Michel Dänzer 2026-01-29 11:25 ` Timur Kristóf @ 2026-01-29 21:56 ` Xaver Hugl 1 sibling, 0 replies; 47+ messages in thread From: Xaver Hugl @ 2026-01-29 21:56 UTC (permalink / raw) To: Michel Dänzer Cc: Christian König, Timur Kristóf, Alex Deucher, Hamza Mahfooz, Mario Limonciello, dri-devel, Alex Deucher, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Ce Sun, Lijo Lazar, Kenneth Feng, Ivan Lipski, Alex Hung, Tom Chung, Melissa Wen, Fangzhi Zuo, amd-gfx, linux-kernel > Another possibility is that test-only commits with the DRM_MODE_ATOMIC_TEST_ONLY flag (which can happen in parallel while the kernel is processing a "real" commit) accidentally have side effects on the current kernel state, resulting in the "real" commit failing to do something it should. There have been bugs like that in the amdgpu DM code before. Some users reported that GPU resets on dGPUs happens way less often with legacy modesetting than atomic, which led me to the same conclusion of possibly missing locks in the driver. To test that theory, I recently gave some affected users a patch to lock KWin's commit thread(s) while doing atomic tests on the main thread, so it never does two atomic commits simultaneously. Testing on APUs showed that it did not help there, but as I haven't heard back from any dGPU users yet, it's still a possible factor. - Xaver ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] drm: introduce page_flip_timeout() 2026-01-23 14:44 ` Timur Kristóf 2026-01-23 22:30 ` Mario Limonciello @ 2026-01-26 10:14 ` Christian König 2026-01-26 10:27 ` Michel Dänzer 1 sibling, 1 reply; 47+ messages in thread From: Christian König @ 2026-01-26 10:14 UTC (permalink / raw) To: Timur Kristóf, Hamza Mahfooz, dri-devel Cc: Alex Deucher, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Ce Sun, Lijo Lazar, Kenneth Feng, Ivan Lipski, Alex Hung, Tom Chung, Melissa Wen, Michel Dänzer, Fangzhi Zuo, amd-gfx, linux-kernel, Mario Limonciello On 1/23/26 15:44, Timur Kristóf wrote: > On Friday, January 23, 2026 2:52:44 PM Central European Standard Time > Christian König wrote: >> On 1/23/26 01:05, Hamza Mahfooz wrote: >>> There should be a mechanism for drivers to respond to flip_done >>> time outs. >> > > I am adding Harry and Mario to this email as they are more familiar with this. > >> I can only see two reasons why you could run into a timeout: >> >> 1. A dma_fence never signals. >> How that should be handled is already well documented and doesn't > require >> any of this. > > Page flip timeouts have nothing to do with fence timeouts. A page flip can be associated with a dma_fence as well. This is used for example on Android to immediately re-use the BO not scanned out any more. That dma_fence has a well defined behavior, especially error handling and that it needs to signal in a finite amount of time even when driver doesn't confirm that the flip has completed. The purpose of of the page flip timeout is now to guarantee that this defined behavior is actually fulfilled even when the driver does something bad. > A page flip timeout can occur even when all fences of all job submissions > complete correctly and on time. But only when you have a major coding error in the driver. In other words a page flip can only timeout if there is no vblank or hblank signaling any more. And that in turn means that somebody turned the CRTC off or hot plugged or we missed an (re-occuring) interrupt.... >> 2. A coding error in the vblank or page flip handler leading to waiting>> forever. In that case calling back into the driver doesn't help either. > > At the moment, a page flip timeout will leave the whole system in a hung state > and the driver does not even attempt to recover it in any way, it just stops > doing anything Yeah and that is expected behavior. See there is no way the driver can recover from this on its own. When this condition happens you can't take locks or allocate memory, otherwise you have the full potential to run into a very low level deadlock. I'm not an expert for the atomic mode setting stuff but as far as I can see the only reasonable thing you can do is to set an error code on the page flip and return to userspace to signal that a flip has completed but with an error. > which is unacceptable and I'm pretty surprised that it was > left like that for so long. > Note that we have approximately a hundred bug reports open on the drm/amd bug > tracker about "random" page flip timeouts. It affects a lot of users. Wow that is a boomer, I completely agree that sounds like a serious problem. >> So as far as I can see the whole approach doesn't make any sense at all. > > Actually this approach was proposed as a solution at XDC 2025 in Harry's > presentation, "DRM calls driver callback to attempt recovery", see page 9 in > this slide deck: > > https://indico.freedesktop.org/event/10/contributions/431/attachments/ > 267/355/2025%20XDC%20Hackfest%20Update%20v1.2.pdf > > If you disagree with Harry, please make a counter-proposal. Well I must have missed that detail otherwise I would have objected. But looking at the slide Harry actually pointed out what immediately came to my mind as well, e.g. that the Compositor needs to issue a full modeset to re-program the CRTC. We need to get the atomic mode setting experts on this and some suggesting on how to fix the error handling. Regards, Christian. > > Thanks, > Timur > > > >> >>> Since, as it stands it is possible for the display >>> to stall indefinitely, necessitating a hard reset. So, introduce >>> a new crtc callback that is called by >>> drm_atomic_helper_wait_for_flip_done() to give drivers a shot >>> at recovering from page flip timeouts. >>> >>> Signed-off-by: Hamza Mahfooz <someguy@effective-light.com> >>> --- >>> >>> drivers/gpu/drm/drm_atomic_helper.c | 6 +++++- >>> include/drm/drm_crtc.h | 9 +++++++++ >>> 2 files changed, 14 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/gpu/drm/drm_atomic_helper.c >>> b/drivers/gpu/drm/drm_atomic_helper.c index 5840e9cc6f66..3a144c324b19 >>> 100644 >>> --- a/drivers/gpu/drm/drm_atomic_helper.c >>> +++ b/drivers/gpu/drm/drm_atomic_helper.c >>> @@ -1881,9 +1881,13 @@ void drm_atomic_helper_wait_for_flip_done(struct >>> drm_device *dev,> >>> continue; >>> >>> ret = wait_for_completion_timeout(&commit->flip_done, 10 > * HZ); >>> >>> - if (ret == 0) >>> + if (!ret) { >>> >>> drm_err(dev, "[CRTC:%d:%s] flip_done timed > out\n", >>> >>> crtc->base.id, crtc->name); >>> >>> + >>> + if (crtc->funcs->page_flip_timeout) >>> + crtc->funcs- >> page_flip_timeout(crtc); >>> + } >>> >>> } >>> >>> if (state->fake_commit) >>> >>> diff --git a/include/drm/drm_crtc.h b/include/drm/drm_crtc.h >>> index 66278ffeebd6..45dc5a76e915 100644 >>> --- a/include/drm/drm_crtc.h >>> +++ b/include/drm/drm_crtc.h >>> @@ -609,6 +609,15 @@ struct drm_crtc_funcs { >>> >>> uint32_t flags, uint32_t target, >>> struct drm_modeset_acquire_ctx > *ctx); >>> >>> + /** >>> + * @page_flip_timeout: >>> + * >>> + * This optional hook is called if &drm_crtc_commit.flip_done times > out, >>> + * and can be used by drivers to attempt to recover from a page > flip >>> + * timeout. >>> + */ >>> + void (*page_flip_timeout)(struct drm_crtc *crtc); >>> + >>> >>> /** >>> >>> * @set_property: >>> * > > > > ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] drm: introduce page_flip_timeout() 2026-01-26 10:14 ` Christian König @ 2026-01-26 10:27 ` Michel Dänzer 2026-01-26 13:00 ` Christian König 0 siblings, 1 reply; 47+ messages in thread From: Michel Dänzer @ 2026-01-26 10:27 UTC (permalink / raw) To: Christian König, Timur Kristóf, Hamza Mahfooz, dri-devel Cc: Alex Deucher, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Ce Sun, Lijo Lazar, Kenneth Feng, Ivan Lipski, Alex Hung, Tom Chung, Melissa Wen, Michel Dänzer, Fangzhi Zuo, amd-gfx, linux-kernel, Mario Limonciello On 1/26/26 11:14, Christian König wrote: > On 1/23/26 15:44, Timur Kristóf wrote: >> On Friday, January 23, 2026 2:52:44 PM Central European Standard Time >> Christian König wrote: >> >>> So as far as I can see the whole approach doesn't make any sense at all. >> >> Actually this approach was proposed as a solution at XDC 2025 in Harry's >> presentation, "DRM calls driver callback to attempt recovery", see page 9 in >> this slide deck: >> >> https://indico.freedesktop.org/event/10/contributions/431/attachments/ >> 267/355/2025%20XDC%20Hackfest%20Update%20v1.2.pdf >> >> If you disagree with Harry, please make a counter-proposal. > > Well I must have missed that detail otherwise I would have objected. > > But looking at the slide Harry actually pointed out what immediately came to my mind as well, e.g. that the Compositor needs to issue a full modeset to re-program the CRTC. In principle, the kernel driver has all the information it needs to reprogram the HW by itself. Not sure why the compositor would need to be actively involved. -- Earthling Michel Dänzer \ GNOME / Xwayland / Mesa developer https://redhat.com \ Libre software enthusiast ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] drm: introduce page_flip_timeout() 2026-01-26 10:27 ` Michel Dänzer @ 2026-01-26 13:00 ` Christian König 2026-01-26 14:31 ` Michel Dänzer ` (2 more replies) 0 siblings, 3 replies; 47+ messages in thread From: Christian König @ 2026-01-26 13:00 UTC (permalink / raw) To: Michel Dänzer, Timur Kristóf, Hamza Mahfooz, dri-devel Cc: Alex Deucher, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Ce Sun, Lijo Lazar, Kenneth Feng, Ivan Lipski, Alex Hung, Tom Chung, Melissa Wen, Michel Dänzer, Fangzhi Zuo, amd-gfx, linux-kernel, Mario Limonciello On 1/26/26 11:27, Michel Dänzer wrote: > On 1/26/26 11:14, Christian König wrote: >> On 1/23/26 15:44, Timur Kristóf wrote: >>> On Friday, January 23, 2026 2:52:44 PM Central European Standard Time >>> Christian König wrote: >>> >>>> So as far as I can see the whole approach doesn't make any sense at all. >>> >>> Actually this approach was proposed as a solution at XDC 2025 in Harry's >>> presentation, "DRM calls driver callback to attempt recovery", see page 9 in >>> this slide deck: >>> >>> https://indico.freedesktop.org/event/10/contributions/431/attachments/ >>> 267/355/2025%20XDC%20Hackfest%20Update%20v1.2.pdf >>> >>> If you disagree with Harry, please make a counter-proposal. >> >> Well I must have missed that detail otherwise I would have objected. >> >> But looking at the slide Harry actually pointed out what immediately came to my mind as well, e.g. that the Compositor needs to issue a full modeset to re-program the CRTC. > > In principle, the kernel driver has all the information it needs to reprogram the HW by itself. Not sure why the compositor would need to be actively involved. Well first of all I'm not sure if we can reprogram the HW even if all information are available. Please keep in mind that we are in a dma_fence timeout handler here with the usual rat tail of consequences. So no allocation of memory or taking locks under which memory is allocated or are part of preparing the page flip etc... I'm not so deep in the atomic code, so Alex, Sima and probably you as well can answer that much better than I do, but of hand it sounds questionable. On the other hand we could of course postpone reprogramming the CRTC into an async work item, but that might created more problems then it solves. Then second even if the kernel can do it I'm not sure if it should do it. I mean userspace asked for a quick page flip and not some expensive CRTC/PLL reprogramming. Stuff like that usually takes some time and by then the frame which should be displayed by the page flip might already be stale and it would be better to tell userspace that we couldn't display it on time and wait for a new frame to be generated. And third, there must be a root cause of the page flip not completing. My educated guess is that we have some atomic property change or even turning the CRTC off in parallel with the page flip. I mean HW rarely turns off its reoccurring vblank interrupt on its own. Returning an error to userspace might actually help identify the root cause. Regards, Christian. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] drm: introduce page_flip_timeout() 2026-01-26 13:00 ` Christian König @ 2026-01-26 14:31 ` Michel Dänzer 2026-01-28 9:19 ` Timur Kristóf 2026-01-29 21:39 ` Xaver Hugl 2 siblings, 0 replies; 47+ messages in thread From: Michel Dänzer @ 2026-01-26 14:31 UTC (permalink / raw) To: Christian König, Timur Kristóf, Hamza Mahfooz, dri-devel Cc: Alex Deucher, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Ce Sun, Lijo Lazar, Kenneth Feng, Ivan Lipski, Alex Hung, Tom Chung, Melissa Wen, Fangzhi Zuo, amd-gfx, linux-kernel, Mario Limonciello On 1/26/26 14:00, Christian König wrote: > On 1/26/26 11:27, Michel Dänzer wrote: >> On 1/26/26 11:14, Christian König wrote: >>> On 1/23/26 15:44, Timur Kristóf wrote: >>>> On Friday, January 23, 2026 2:52:44 PM Central European Standard Time >>>> Christian König wrote: >>>> >>>>> So as far as I can see the whole approach doesn't make any sense at all. >>>> >>>> Actually this approach was proposed as a solution at XDC 2025 in Harry's >>>> presentation, "DRM calls driver callback to attempt recovery", see page 9 in >>>> this slide deck: >>>> >>>> https://indico.freedesktop.org/event/10/contributions/431/attachments/ >>>> 267/355/2025%20XDC%20Hackfest%20Update%20v1.2.pdf >>>> >>>> If you disagree with Harry, please make a counter-proposal. >>> >>> Well I must have missed that detail otherwise I would have objected. >>> >>> But looking at the slide Harry actually pointed out what immediately came to my mind as well, e.g. that the Compositor needs to issue a full modeset to re-program the CRTC. >> >> In principle, the kernel driver has all the information it needs to reprogram the HW by itself. Not sure why the compositor would need to be actively involved. > > Well first of all I'm not sure if we can reprogram the HW even if all information are available. > > Please keep in mind that we are in a dma_fence timeout handler here with the usual rat tail of consequences. So no allocation of memory or taking locks under which memory is allocated or are part of preparing the page flip etc... I'm not so deep in the atomic code, so Alex, Sima and probably you as well can answer that much better than I do, but of hand it sounds questionable. > > On the other hand we could of course postpone reprogramming the CRTC into an async work item, but that might created more problems then it solves. Seems doable offhand from a KMS UAPI PoV. The reprogramming just needs to be done before sending the atomic commit completion event(s) to user space. Not sure about the DMA fence angle though. (I consider OUT_FENCE_PTR problematic for other reasons, in particular, using it to get a release fence for clients is kind of laying a trap for them. And in the compositor I see no benefit vs completion events) > Then second even if the kernel can do it I'm not sure if it should do it. > > I mean userspace asked for a quick page flip and not some expensive CRTC/PLL reprogramming. More complex atomic commits can also hang, FWIW. In fact, they might be more likely to hang. > Stuff like that usually takes some time and by then the frame which should be displayed by the page flip might already be stale and it would be better to tell userspace that we couldn't display it on time and wait for a new frame to be generated. With my compositor developer hat on, I'd rather not spend effort generating a new frame if there is doubt that the kernel will actually be able to display it. The worst case of that would be constantly generating new frames, none of which are displayed. I'd rather try again with the same frame, which boils down to an "empty" (no actual state changes) commit with the DRM_MODE_ATOMIC_ALLOW_MODESET flag. Relying on user space for this can also be problematic, e.g. if user space dies and drops back to fbcon. > And third, there must be a root cause of the page flip not completing. > > My educated guess is that we have some atomic property change or even turning the CRTC off in parallel with the page flip. I mean HW rarely turns off its reoccurring vblank interrupt on its own. > > Returning an error to userspace might actually help identify the root cause. It seems pretty clear that the hangs plaguing KWin are amdgpu DC bugs. -- Earthling Michel Dänzer \ GNOME / Xwayland / Mesa developer https://redhat.com \ Libre software enthusiast ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] drm: introduce page_flip_timeout() 2026-01-26 13:00 ` Christian König 2026-01-26 14:31 ` Michel Dänzer @ 2026-01-28 9:19 ` Timur Kristóf 2026-01-28 11:25 ` Christian König 2026-01-29 21:39 ` Xaver Hugl 2 siblings, 1 reply; 47+ messages in thread From: Timur Kristóf @ 2026-01-28 9:19 UTC (permalink / raw) To: Michel Dänzer, Hamza Mahfooz, dri-devel, Christian König Cc: Alex Deucher, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Ce Sun, Lijo Lazar, Kenneth Feng, Ivan Lipski, Alex Hung, Tom Chung, Melissa Wen, Michel Dänzer, Fangzhi Zuo, amd-gfx, linux-kernel, Mario Limonciello On 2026. január 26., hétfő 14:00:59 közép-európai téli idő Christian König wrote: > On 1/26/26 11:27, Michel Dänzer wrote: > > On 1/26/26 11:14, Christian König wrote: > >> On 1/23/26 15:44, Timur Kristóf wrote: > >>> On Friday, January 23, 2026 2:52:44 PM Central European Standard Time > >>> > >>> Christian König wrote: > >>>> So as far as I can see the whole approach doesn't make any sense at > >>>> all. > >>> > >>> Actually this approach was proposed as a solution at XDC 2025 in Harry's > >>> presentation, "DRM calls driver callback to attempt recovery", see page > >>> 9 in this slide deck: > >>> > >>> https://indico.freedesktop.org/event/10/contributions/431/attachments/ > >>> 267/355/2025%20XDC%20Hackfest%20Update%20v1.2.pdf > >>> > >>> If you disagree with Harry, please make a counter-proposal. > >> > >> Well I must have missed that detail otherwise I would have objected. > >> > >> But looking at the slide Harry actually pointed out what immediately came > >> to my mind as well, e.g. that the Compositor needs to issue a full > >> modeset to re-program the CRTC.> > > In principle, the kernel driver has all the information it needs to > > reprogram the HW by itself. Not sure why the compositor would need to be > > actively involved. > Well first of all I'm not sure if we can reprogram the HW even if all > information are available. > > Please keep in mind that we are in a dma_fence timeout handler here with the > usual rat tail of consequences. So no allocation of memory or taking locks > under which memory is allocated or are part of preparing the page flip > etc... I'm not so deep in the atomic code, so Alex, Sima and probably you > as well can answer that much better than I do, but of hand it sounds > questionable. > > On the other hand we could of course postpone reprogramming the CRTC into an > async work item, but that might created more problems then it solves. > > Then second even if the kernel can do it I'm not sure if it should do it. > > I mean userspace asked for a quick page flip and not some expensive CRTC/PLL > reprogramming. Stuff like that usually takes some time and by then the > frame which should be displayed by the page flip might already be stale and > it would be better to tell userspace that we couldn't display it on time > and wait for a new frame to be generated. I agree with Michel here. It's a kernel bug, it should be solved by the kernel. I don't like the tendency of pushing userspace to handle kernel bugs, especially if this is just needed for one vendor's buggy driver. (No offence.) > > And third, there must be a root cause of the page flip not completing. > > My educated guess is that we have some atomic property change or even > turning the CRTC off in parallel with the page flip. I mean HW rarely turns > off its reoccurring vblank interrupt on its own. I think a page flip timeout can many root causes, so it's unlikely that a single solution would solve all of it. See: https://gitlab.freedesktop.org/drm/amd/-/issues? label_name[]=page%20flip%20timeout Best regards, Timur ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] drm: introduce page_flip_timeout() 2026-01-28 9:19 ` Timur Kristóf @ 2026-01-28 11:25 ` Christian König 2026-01-28 12:22 ` Timur Kristóf 2026-01-28 14:25 ` Michel Dänzer 0 siblings, 2 replies; 47+ messages in thread From: Christian König @ 2026-01-28 11:25 UTC (permalink / raw) To: Timur Kristóf, Michel Dänzer, Hamza Mahfooz, dri-devel Cc: Alex Deucher, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Ce Sun, Lijo Lazar, Kenneth Feng, Ivan Lipski, Alex Hung, Tom Chung, Melissa Wen, Michel Dänzer, Fangzhi Zuo, amd-gfx, linux-kernel, Mario Limonciello On 1/28/26 10:19, Timur Kristóf wrote: > On 2026. január 26., hétfő 14:00:59 közép-európai téli idő Christian König > wrote: >> On 1/26/26 11:27, Michel Dänzer wrote: >>> On 1/26/26 11:14, Christian König wrote: >>>> On 1/23/26 15:44, Timur Kristóf wrote: >>>>> On Friday, January 23, 2026 2:52:44 PM Central European Standard Time >>>>> >>>>> Christian König wrote: >>>>>> So as far as I can see the whole approach doesn't make any sense at >>>>>> all. >>>>> >>>>> Actually this approach was proposed as a solution at XDC 2025 in Harry's >>>>> presentation, "DRM calls driver callback to attempt recovery", see page >>>>> 9 in this slide deck: >>>>> >>>>> https://indico.freedesktop.org/event/10/contributions/431/attachments/ >>>>> 267/355/2025%20XDC%20Hackfest%20Update%20v1.2.pdf >>>>> >>>>> If you disagree with Harry, please make a counter-proposal. >>>> >>>> Well I must have missed that detail otherwise I would have objected. >>>> >>>> But looking at the slide Harry actually pointed out what immediately came >>>> to my mind as well, e.g. that the Compositor needs to issue a full >>>> modeset to re-program the CRTC.> >>> In principle, the kernel driver has all the information it needs to >>> reprogram the HW by itself. Not sure why the compositor would need to be >>> actively involved. >> Well first of all I'm not sure if we can reprogram the HW even if all >> information are available. >> >> Please keep in mind that we are in a dma_fence timeout handler here with the >> usual rat tail of consequences. So no allocation of memory or taking locks >> under which memory is allocated or are part of preparing the page flip >> etc... I'm not so deep in the atomic code, so Alex, Sima and probably you >> as well can answer that much better than I do, but of hand it sounds >> questionable. >> >> On the other hand we could of course postpone reprogramming the CRTC into an >> async work item, but that might created more problems then it solves. >> >> Then second even if the kernel can do it I'm not sure if it should do it. >> >> I mean userspace asked for a quick page flip and not some expensive CRTC/PLL >> reprogramming. Stuff like that usually takes some time and by then the >> frame which should be displayed by the page flip might already be stale and >> it would be better to tell userspace that we couldn't display it on time >> and wait for a new frame to be generated. > > I agree with Michel here. It's a kernel bug, it should be solved by the > kernel. I don't like the tendency of pushing userspace to handle kernel bugs, > especially if this is just needed for one vendor's buggy driver. (No offence.) Well I strongly disagree. The kernel is not here to serve userspace, but to give userspace access to the HW in a generalized manner. If this is caused by a HW failure then reporting back to userspace is the most reasonable thing to do. >> >> And third, there must be a root cause of the page flip not completing. >> >> My educated guess is that we have some atomic property change or even >> turning the CRTC off in parallel with the page flip. I mean HW rarely turns >> off its reoccurring vblank interrupt on its own. > > I think a page flip timeout can many root causes, so it's unlikely that a > single solution would solve all of it. That's actually what I'm not sure about. I mean what seems to happen consistently is that somehow the CRTC is turned off, who is doing that and why? As far as I can see drm_atomic_helper_wait_for_flip_done() is called by amdgpu_dm_atomic_commit_tail() to wait for the page flip to finish. So the message comes from our DM/DC code calling into the DRM code. That in turn makes the callback completely superfluous, we could just signal an error back from drm_atomic_helper_wait_for_flip_done(). Additional to that I don't see much locking here. Who is preventing multiple atomic changes from happening at the same time? Regards, Christian. > > See: > https://gitlab.freedesktop.org/drm/amd/-/issues? > label_name[]=page%20flip%20timeout > > Best regards, > Timur > > > ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] drm: introduce page_flip_timeout() 2026-01-28 11:25 ` Christian König @ 2026-01-28 12:22 ` Timur Kristóf 2026-01-28 14:25 ` Michel Dänzer 1 sibling, 0 replies; 47+ messages in thread From: Timur Kristóf @ 2026-01-28 12:22 UTC (permalink / raw) To: Michel Dänzer, Hamza Mahfooz, dri-devel, Christian König Cc: Alex Deucher, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Ce Sun, Lijo Lazar, Kenneth Feng, Ivan Lipski, Alex Hung, Tom Chung, Melissa Wen, Michel Dänzer, Fangzhi Zuo, amd-gfx, linux-kernel, Mario Limonciello On Wednesday, January 28, 2026 12:25:31 PM Central European Standard Time Christian König wrote: > On 1/28/26 10:19, Timur Kristóf wrote: > > On 2026. január 26., hétfő 14:00:59 közép-európai téli idő Christian König > > > > wrote: > >> On 1/26/26 11:27, Michel Dänzer wrote: > >>> On 1/26/26 11:14, Christian König wrote: > >>>> On 1/23/26 15:44, Timur Kristóf wrote: > >>>>> On Friday, January 23, 2026 2:52:44 PM Central European Standard Time > >>>>> > >>>>> Christian König wrote: > >>>>>> So as far as I can see the whole approach doesn't make any sense at > >>>>>> all. > >>>>> > >>>>> Actually this approach was proposed as a solution at XDC 2025 in > >>>>> Harry's > >>>>> presentation, "DRM calls driver callback to attempt recovery", see > >>>>> page > >>>>> 9 in this slide deck: > >>>>> > >>>>> https://indico.freedesktop.org/event/10/contributions/431/attachments/ > >>>>> 267/355/2025%20XDC%20Hackfest%20Update%20v1.2.pdf > >>>>> > >>>>> If you disagree with Harry, please make a counter-proposal. > >>>> > >>>> Well I must have missed that detail otherwise I would have objected. > >>>> > >>>> But looking at the slide Harry actually pointed out what immediately > >>>> came > >>>> to my mind as well, e.g. that the Compositor needs to issue a full > >>>> modeset to re-program the CRTC.> > >>> > >>> In principle, the kernel driver has all the information it needs to > >>> reprogram the HW by itself. Not sure why the compositor would need to be > >>> actively involved. > >> > >> Well first of all I'm not sure if we can reprogram the HW even if all > >> information are available. > >> > >> Please keep in mind that we are in a dma_fence timeout handler here with > >> the usual rat tail of consequences. So no allocation of memory or taking > >> locks under which memory is allocated or are part of preparing the page > >> flip etc... I'm not so deep in the atomic code, so Alex, Sima and > >> probably you as well can answer that much better than I do, but of hand > >> it sounds questionable. > >> > >> On the other hand we could of course postpone reprogramming the CRTC into > >> an async work item, but that might created more problems then it solves. > >> > >> Then second even if the kernel can do it I'm not sure if it should do it. > >> > >> I mean userspace asked for a quick page flip and not some expensive > >> CRTC/PLL reprogramming. Stuff like that usually takes some time and by > >> then the frame which should be displayed by the page flip might already > >> be stale and it would be better to tell userspace that we couldn't > >> display it on time and wait for a new frame to be generated. > > > > I agree with Michel here. It's a kernel bug, it should be solved by the > > kernel. I don't like the tendency of pushing userspace to handle kernel > > bugs, especially if this is just needed for one vendor's buggy driver. > > (No offence.) > Well I strongly disagree. The kernel is not here to serve userspace, but to > give userspace access to the HW in a generalized manner. Isn't this why kernel mode setting was invented in favour of the mess that we used to have in the DDX drivers? > If this is caused by a HW failure then reporting back to userspace is the > most reasonable thing to do. Nothing wrong with reporting the problem back to userspace. But it isn't worth much, because userspace is extremely unlikely to be able to fix it. How would userspace fix a missed or broken interrupt, a firmware hang, or buggy programming of display engine registers? Also, even if it were possible, expecting userspace to fix it would just place extra burden on compositor maintainers, which in turn would put us in a similar situation where were with GPU recovery before queue reset was implemented. Only a small handful of compositors can handle it (only one of the major players and maybe a few smaller ones). That gives all other users a bad experience by default. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] drm: introduce page_flip_timeout() 2026-01-28 11:25 ` Christian König 2026-01-28 12:22 ` Timur Kristóf @ 2026-01-28 14:25 ` Michel Dänzer 2026-01-28 14:35 ` Christian König 1 sibling, 1 reply; 47+ messages in thread From: Michel Dänzer @ 2026-01-28 14:25 UTC (permalink / raw) To: Christian König, Timur Kristóf, Hamza Mahfooz, dri-devel Cc: Alex Deucher, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Ce Sun, Lijo Lazar, Kenneth Feng, Ivan Lipski, Alex Hung, Tom Chung, Melissa Wen, Fangzhi Zuo, amd-gfx, linux-kernel, Mario Limonciello On 1/28/26 12:25, Christian König wrote: > On 1/28/26 10:19, Timur Kristóf wrote: >> On 2026. január 26., hétfő 14:00:59 közép-európai téli idő Christian König >> wrote: >>> On 1/26/26 11:27, Michel Dänzer wrote: >>>> On 1/26/26 11:14, Christian König wrote: >>>>> >>>>> But looking at the slide Harry actually pointed out what immediately came >>>>> to my mind as well, e.g. that the Compositor needs to issue a full >>>>> modeset to re-program the CRTC.> >>>> In principle, the kernel driver has all the information it needs to >>>> reprogram the HW by itself. Not sure why the compositor would need to be >>>> actively involved. >>> >>> [...] >>> >>> Then second even if the kernel can do it I'm not sure if it should do it. >>> >>> [...] >> >> I agree with Michel here. It's a kernel bug, it should be solved by the >> kernel. I don't like the tendency of pushing userspace to handle kernel bugs, >> especially if this is just needed for one vendor's buggy driver. (No offence.) > > Well I strongly disagree. The kernel is not here to serve userspace, [...] Can't say I agree with that statement. Anyway, user space certainly isn't here to kick the kernel back into gear after it hit a bug, or to tell it things it already knows. -- Earthling Michel Dänzer \ GNOME / Xwayland / Mesa developer https://redhat.com \ Libre software enthusiast ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] drm: introduce page_flip_timeout() 2026-01-28 14:25 ` Michel Dänzer @ 2026-01-28 14:35 ` Christian König 0 siblings, 0 replies; 47+ messages in thread From: Christian König @ 2026-01-28 14:35 UTC (permalink / raw) To: Michel Dänzer, Timur Kristóf, Hamza Mahfooz, dri-devel Cc: Alex Deucher, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Ce Sun, Lijo Lazar, Kenneth Feng, Ivan Lipski, Alex Hung, Tom Chung, Melissa Wen, Fangzhi Zuo, amd-gfx, linux-kernel, Mario Limonciello On 1/28/26 15:25, Michel Dänzer wrote: > On 1/28/26 12:25, Christian König wrote: >> On 1/28/26 10:19, Timur Kristóf wrote: >>> On 2026. január 26., hétfő 14:00:59 közép-európai téli idő Christian König >>> wrote: >>>> On 1/26/26 11:27, Michel Dänzer wrote: >>>>> On 1/26/26 11:14, Christian König wrote: >>>>>> >>>>>> But looking at the slide Harry actually pointed out what immediately came >>>>>> to my mind as well, e.g. that the Compositor needs to issue a full >>>>>> modeset to re-program the CRTC.> >>>>> In principle, the kernel driver has all the information it needs to >>>>> reprogram the HW by itself. Not sure why the compositor would need to be >>>>> actively involved. >>>> >>>> [...] >>>> >>>> Then second even if the kernel can do it I'm not sure if it should do it. >>>> >>>> [...] >>> >>> I agree with Michel here. It's a kernel bug, it should be solved by the >>> kernel. I don't like the tendency of pushing userspace to handle kernel bugs, >>> especially if this is just needed for one vendor's buggy driver. (No offence.) >> >> Well I strongly disagree. The kernel is not here to serve userspace, [...] > > Can't say I agree with that statement. > > Anyway, user space certainly isn't here to kick the kernel back into gear after it hit a bug, or to tell it things it already knows. That is a really good argument. But the kernel should also not hide that something went wrong from userspace. E.g. flaky HDP pins for example. Regards, Christian. > > ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] drm: introduce page_flip_timeout() 2026-01-26 13:00 ` Christian König 2026-01-26 14:31 ` Michel Dänzer 2026-01-28 9:19 ` Timur Kristóf @ 2026-01-29 21:39 ` Xaver Hugl 2 siblings, 0 replies; 47+ messages in thread From: Xaver Hugl @ 2026-01-29 21:39 UTC (permalink / raw) To: Christian König Cc: Michel Dänzer, Timur Kristóf, Hamza Mahfooz, dri-devel, Alex Deucher, David Airlie, Simona Vetter, Harry Wentland, Leo Li, Rodrigo Siqueira, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sunil Khatri, Ce Sun, Lijo Lazar, Kenneth Feng, Ivan Lipski, Alex Hung, Tom Chung, Melissa Wen, Michel Dänzer, Fangzhi Zuo, amd-gfx, linux-kernel, Mario Limonciello > Then second even if the kernel can do it I'm not sure if it should do it. > > I mean userspace asked for a quick page flip and not some expensive CRTC/PLL reprogramming. Stuff like that usually takes some time and by then the frame which should be displayed by the page flip might already be stale and it would be better to tell userspace that we couldn't display it on time and wait for a new frame to be generated. I would personally prefer a new "pageflip failed" event, which the compositor can react to appropriately. For compositors not opting into that new API, the kernel automatically fixing things would be nice though. Even pretending the pageflip completed and then failing the next one with EINVAL would be enough to trigger a modeset in the case of KWin. > And third, there must be a root cause of the page flip not completing. > > My educated guess is that we have some atomic property change or even turning the CRTC off in parallel with the page flip. I mean HW rarely turns off its reoccurring vblank interrupt on its own. > > Returning an error to userspace might actually help identify the root cause. There are two things I know that trigger pageflip timeouts frequently. On dedicated GPUs, most of them seem to happen when a game causes a GPU reset. In some cases, it seemed like the pageflip did complete, but the kernel never sent the pageflip event to userspace. It also rejected new atomic commits of the compositor with EBUSY - but a new instance of the compositor could still do atomic commits just fine. In other cases, triggering another GPU reset was necessary to recover, and in yet other cases it was just broken beyond repair. Presumably, all of them are caused by bugs in the GPU reset sequence. As another piece of information on that, KWin only does atomic commits once the fences of the involved buffers are signaled, and it does not use OUT_FENCE_FD. So fence signaling should not be relevant, at least not on the KMS uAPI level. On APUs, the vast majority are caused by PSR. I know many AMD laptop users that run with "amdgpu.dcdebugmask=0x10" to disable PSR as a workaround, and have never seen a pageflip timeout since setting that option. I have even seen a freeze *without* a pageflip timeout in testing PSR again on my laptop recently, so PSR seems to have even bigger issues. Pageflip timeout or not, manually triggering a GPU reset seems to be a reliable way to recover from it. IMO that one is bad and widespread enough that PSR should be disabled by default on the relevant AMD hardware until it no longer causes such problems. On the topic of whether or not this is just a thing the driver has to fix, this isn't as exclusive to amdgpu as it might seem. i915 has some pageflip timeout issues with apparently still unknown causes (https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/14604), and the proprietary Nvidia driver had one some time ago that IIRC was caused by firmware bugs. Obviously, drivers still need to be fixed, but the bug is bad enough for the end user that a fallback would be very helpful. If userspace gets notified about it, we can still direct users to the relevant bug trackers to get the underlying bugs hopefully fixed either way. - Xaver ^ permalink raw reply [flat|nested] 47+ messages in thread
end of thread, other threads:[~2026-02-03 21:48 UTC | newest] Thread overview: 47+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-01-23 0:05 [PATCH 1/2] drm: introduce page_flip_timeout() Hamza Mahfooz 2026-01-23 0:05 ` [PATCH 2/2] drm/amdgpu: implement page_flip_timeout() support Hamza Mahfooz 2026-01-23 11:20 ` Timur Kristóf 2026-01-23 14:25 ` Hamza Mahfooz 2026-01-23 12:26 ` kernel test robot 2026-01-23 15:16 ` kernel test robot 2026-01-23 17:49 ` Alex Deucher 2026-01-23 18:10 ` Alex Deucher 2026-01-23 13:52 ` [PATCH 1/2] drm: introduce page_flip_timeout() Christian König 2026-01-23 14:44 ` Hamza Mahfooz 2026-01-23 16:12 ` Christian König 2026-01-23 19:41 ` Alex Deucher 2026-01-23 14:44 ` Timur Kristóf 2026-01-23 22:30 ` Mario Limonciello 2026-01-24 18:32 ` Hamza Mahfooz 2026-01-24 18:43 ` Mario Limonciello 2026-01-24 19:49 ` Hamza Mahfooz 2026-01-27 22:44 ` Hamza Mahfooz 2026-01-26 14:20 ` Alex Deucher 2026-01-27 22:52 ` Hamza Mahfooz 2026-01-27 22:57 ` Alex Deucher 2026-01-28 10:39 ` Christian König 2026-01-28 11:26 ` Michel Dänzer 2026-01-28 12:14 ` Timur Kristóf 2026-01-28 12:48 ` Christian König 2026-01-28 14:25 ` Michel Dänzer 2026-01-29 10:06 ` Michel Dänzer 2026-01-29 11:25 ` Timur Kristóf 2026-01-29 11:38 ` Christian König 2026-01-29 12:06 ` Timur Kristóf 2026-01-29 12:59 ` Christian König 2026-01-29 14:04 ` Hamza Mahfooz 2026-01-29 14:24 ` Christian König 2026-01-29 14:33 ` Hamza Mahfooz 2026-01-29 14:41 ` Christian König 2026-02-03 21:48 ` Timur Kristóf 2026-01-29 21:56 ` Xaver Hugl 2026-01-26 10:14 ` Christian König 2026-01-26 10:27 ` Michel Dänzer 2026-01-26 13:00 ` Christian König 2026-01-26 14:31 ` Michel Dänzer 2026-01-28 9:19 ` Timur Kristóf 2026-01-28 11:25 ` Christian König 2026-01-28 12:22 ` Timur Kristóf 2026-01-28 14:25 ` Michel Dänzer 2026-01-28 14:35 ` Christian König 2026-01-29 21:39 ` Xaver Hugl
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox