[PATCH 1/1] drm/amd/display: complete cursor vblank events immediately

public inbox for amd-gfx@lists.freedesktop.org
 help / color / mirror / Atom feed

* [PATCH 1/1] drm/amd/display: complete cursor vblank events immediately
@ 2026-02-17 19:16 Michele Palazzi
  2026-02-23 15:27 ` Leo Li
  0 siblings, 1 reply; 36+ messages in thread
From: Michele Palazzi @ 2026-02-17 19:16 UTC (permalink / raw)
  To: amd-gfx
  Cc: harry.wentland, rodrigo.siqueira, sunpeng.li, alexander.deucher,
	christian.koenig, Michele Palazzi

Intermittent flip_done timeouts have been observed on AMD GPUs
since kernel 6.12.

Analysis with bpftrace reveals that amdgpu_dm_crtc_handle_vblank() can
incorrectly consume events meant for plane flips during cursor-only
updates. This happens because cursor commits defer event delivery to
the vblank handler, which checks (pflip_status != SUBMITTED). Since
AMDGPU_FLIP_NONE also matches this, cursor events can "steal" the
event slot for subsequent plane flips, leading to timeouts.

The potential for a race was present since commit 473683a03495
("drm/amd/display: Create a file dedicated for CRTC"), then
commit 58a261bfc967 ("drm/amd/display: use a more lax vblank enable
policy for older ASICs") made it happen by reducing vblank
off-delay and making disables happen much more frequently
between commits.

Fix this by sending cursor-only vblank events immediately in
amdgpu_dm_commit_planes(). Since cursor updates are committed to
hardware immediately, deferring the event is unnecessary and
creates race windows for event stealing or starvation if vblank
is disabled before the handler runs.

Tested on DCN 2.1, 3.2, and 3.5.

Fixes: 58a261bfc967 ("drm/amd/display: use a more lax vblank enable policy for older ASICs")
Signed-off-by: Michele Palazzi <sysdadmin@m1k.cloud>
---
I've been chasing intermittent flip_done timeouts on AMD GPUs (7900 GRE first, 9070 XT now)
since kernel 6.12. The hang occurs during normal desktop usage but is much easier to
trigger under specific conditions involving cursor movements and plane updates.

Partially tracked in https://gitlab.freedesktop.org/drm/amd/-/issues/3787

Hardware: Ryzen 7 7800X3D, Radeon RX 9070 XT
Dual DP monitors, 2560x1440, 144Hz
Desktop: KDE Plasma Wayland

The hang was initially observed while using Cisco Webex
(XDG_SESSION_TYPE=x11 /opt/Webex/bin/CiscoCollabHost %U), start a meeting
and screen share a window running Omnissa Horizon client. Then move the cursor
around between the two monitors and the shared window.
Under these conditions the hang usually occurs within a few hours.

Enabling drm.debug masks the issue entirely, the overhead
changes timing enough to close the race window.
So i added debug printks to amdgpu_dm.c and used a small bpftrace script to log the
pageflip lifecycle with per-thread tracking to debug.

bpftrace script:

  config = { missing_probes = "warn" }
  BEGIN { printf("=== flip_done tracer started ===\n"); }
  kprobe:drm_crtc_vblank_on_config       { printf("%lu drm_crtc_vblank_on_config\n", nsecs/1000000); }
  kprobe:drm_vblank_disable_and_save     { printf("%lu drm_vblank_disable_and_save\n", nsecs/1000000); }
  kprobe:dm_pflip_high_irq               { printf("%lu dm_pflip_high_irq\n", nsecs/1000000); }
  kprobe:drm_crtc_send_vblank_event      { printf("%lu drm_crtc_send_vblank_event\n", nsecs/1000000); }
  kprobe:drm_vblank_put                  { printf("%lu drm_vblank_put\n", nsecs/1000000); }
  kprobe:drm_atomic_helper_commit_hw_done { printf("%lu drm_atomic_helper_commit_hw_done\n", nsecs/1000000); }
  kprobe:manage_dm_interrupts            { printf("%lu manage_dm_interrupts\n", nsecs/1000000); }
  kprobe:drm_atomic_helper_wait_for_flip_done {
      @wait_start[tid] = nsecs;
      printf("%lu drm_atomic_helper_wait_for_flip_done ENTER [tid=%d]\n", nsecs/1000000, tid);
  }
  kretprobe:drm_atomic_helper_wait_for_flip_done {
      $start = @wait_start[tid];
      $ms = $start > 0 ? (nsecs - $start) / 1000000 : 0;
      if ($ms > 100) {
          printf("%lu drm_atomic_helper_wait_for_flip_done TIMEOUT waited %lums [tid=%d]\n",
                 nsecs/1000000, $ms, tid);
      } else {
          printf("%lu drm_atomic_helper_wait_for_flip_done EXIT %lums [tid=%d]\n",
                 nsecs/1000000, $ms, tid);
      }
      delete(@wait_start[tid]);
  }
  interval:s:60 { printf("%lu HEARTBEAT\n", nsecs/1000000); }
  END { printf("=== stopped ===\n"); clear(@wait_start); }

The timeout was captured at 17:35:41 CET. The trace timestamps
match dmesg exactly (9942110ms = dmesg 9942.110s).

dmesg output from the timeout:

  [ 9942.110360] [FLIP_DEBUG] wait_for_flip_done took 10329ms!
  [ 9942.110380] [FLIP_DEBUG]  crtc:0 pflip_status=0 event=00000000a0636a23
                  vbl_enabled=1 vbl_refcount=1 vbl_count=1428659
                  disable_immediate=0 active_planes=1

pflip_status=0 (AMDGPU_FLIP_NONE) but event is still non-NULL. The flip was never completed
but the status was already reset to NONE. vblank was enabled, refcount was held, so vblank
IRQs were firing throughout the wait.

The bpftrace captured the exact sequence leading up to the hang. Here's the critical
timeline at ~17:35:31 (9931771), about 10 seconds before the timeout fired:

  9931755 drm_atomic_helper_commit_hw_done
  9931755 drm_atomic_helper_wait_for_flip_done ENTER [tid=35929]
  9931756 dm_pflip_high_irq                           <- normal plane flip, last good one
  9931756 drm_crtc_send_vblank_event
  9931756 drm_vblank_put
  9931756 drm_atomic_helper_wait_for_flip_done EXIT 1ms [tid=35929]
  9931771 drm_vblank_disable_and_save                 <- vblank timer fires
  9931771 drm_crtc_send_vblank_event                  <- event sent WITHOUT dm_pflip_high_irq
  9931771 drm_vblank_put
  9931771 drm_atomic_helper_commit_hw_done
  9931771 drm_atomic_helper_wait_for_flip_done ENTER [tid=35929]
  9931771 drm_atomic_helper_wait_for_flip_done EXIT 0ms [tid=35929]  <- instant, already done
  9931773 drm_atomic_helper_commit_hw_done
  9931773 drm_atomic_helper_wait_for_flip_done ENTER [tid=36929]     <- new commit
  9931777 dm_pflip_high_irq                           <- pflip fires, completes the wrong one
  9931777 drm_crtc_send_vblank_event
  9931777 drm_vblank_put
  9931777 drm_atomic_helper_wait_for_flip_done EXIT 3ms [tid=36929]
  9931781 drm_atomic_helper_commit_hw_done
  9931781 drm_atomic_helper_wait_for_flip_done ENTER [tid=36929]     <- THIS ONE HANGS
  ... 10328ms of silence ...
  9942110 drm_atomic_helper_wait_for_flip_done TIMEOUT waited 10328ms [tid=36929]

The drm_crtc_send_vblank_event at 9931771 fires without dm_pflip_high_irq. This is
amdgpu_dm_crtc_handle_vblank() sending a cursor-only event. The problem is that the
cursor-only commit path in amdgpu_dm_commit_planes() stores the event in acrtc->event
and defers delivery to the vblank handler. This creates two race conditions:

- The vblank handler checks (pflip_status != SUBMITTED) which also
  matches NONE, so it can consume events meant for plane flips. The subsequent
  dm_pflip_high_irq finds no event, and the next commit hangs.

- If vblank is disabled by the off-delay timer before the handler
  runs, the PENDING cursor event is never delivered and the commit hangs.

The fix is to send cursor-only events immediately via drm_crtc_send_vblank_event()
in amdgpu_dm_commit_planes() instead of deferring to the vblank handler. The cursor
update is already committed to hardware at this point, so immediate delivery is correct.
This eliminates both race conditions by removing cursor events from the deferred
delivery path entirely:

- Plane flips: SUBMITTED -> dm_pflip_high_irq delivers (unchanged)
- Cursor updates: sent immediately in commit_planes (no deferral, no races)

From git history the check in amdgpu_dm_crtc_handle_vblank() has been like this since
473683a03495 ("drm/amd/display: Create a file dedicated for CRTC", 2022)
which moved this code from amdgpu_dm.c, but it was practically impossible to trigger
because the default drm_vblank_offdelay was 5000ms.
Commit 58a261bfc967("drm/amd/display: use a more lax vblank enable policy for older ASICs") in 6.12
changed all ASICs to use drm_crtc_vblank_on_config() with a computed off-delay
of roughly 2 frames (~14ms at 144Hz).
This made drm_vblank_disable_and_save fire hundreds of times more often, turning
a theoretical race into reality. The bpftrace log is full of drm_vblank_disable_and_save
events interleaved with the commit sequence.

This fix was tested on DCN 2.1 (4700U), DCN 3.2 (7600M XT), and DCN 3.5 (9070 XT).
Under high-frequency glxgears + cursor jiggling test the patch successfully intercepted
the race thousands of times without a single timeout.
Also running this on the main system without issues.

This instead https://lists.freedesktop.org/archives/amd-gfx/2026-February/138636.html was
my previously rushed attempt to do something about this that is no longer needed.

Patch applies cleanly on top of tag v6.19.

 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index a8a59126b2d2..35987ce80c71 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -10168,8 +10168,7 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
 	} else if (cursor_update && acrtc_state->active_planes > 0) {
 		spin_lock_irqsave(&pcrtc->dev->event_lock, flags);
 		if (acrtc_attach->base.state->event) {
-			drm_crtc_vblank_get(pcrtc);
-			acrtc_attach->event = acrtc_attach->base.state->event;
+			drm_crtc_send_vblank_event(pcrtc, acrtc_attach->base.state->event);
 			acrtc_attach->base.state->event = NULL;
 		}
 		spin_unlock_irqrestore(&pcrtc->dev->event_lock, flags);
-- 
2.53.0

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/1] drm/amd/display: complete cursor vblank events immediately
  2026-02-17 19:16 [PATCH 1/1] drm/amd/display: complete cursor vblank events immediately Michele Palazzi
@ 2026-02-23 15:27 ` Leo Li
  2026-02-27  8:53   ` Michele Palazzi
                     ` (2 more replies)
  0 siblings, 3 replies; 36+ messages in thread
From: Leo Li @ 2026-02-23 15:27 UTC (permalink / raw)
  To: Michele Palazzi, amd-gfx
  Cc: harry.wentland, rodrigo.siqueira, alexander.deucher,
	christian.koenig



On 2026-02-17 14:16, Michele Palazzi wrote:
> Intermittent flip_done timeouts have been observed on AMD GPUs
> since kernel 6.12.
> 
> Analysis with bpftrace reveals that amdgpu_dm_crtc_handle_vblank() can
> incorrectly consume events meant for plane flips during cursor-only
> updates. This happens because cursor commits defer event delivery to
> the vblank handler, which checks (pflip_status != SUBMITTED). Since
> AMDGPU_FLIP_NONE also matches this, cursor events can "steal" the
> event slot for subsequent plane flips, leading to timeouts.
> 
> The potential for a race was present since commit 473683a03495
> ("drm/amd/display: Create a file dedicated for CRTC"), then
> commit 58a261bfc967 ("drm/amd/display: use a more lax vblank enable
> policy for older ASICs") made it happen by reducing vblank
> off-delay and making disables happen much more frequently
> between commits.
> 
> Fix this by sending cursor-only vblank events immediately in
> amdgpu_dm_commit_planes(). Since cursor updates are committed to
> hardware immediately, deferring the event is unnecessary and
> creates race windows for event stealing or starvation if vblank
> is disabled before the handler runs.
> 
> Tested on DCN 2.1, 3.2, and 3.5.
> 
> Fixes: 58a261bfc967 ("drm/amd/display: use a more lax vblank enable policy for older ASICs")
> Signed-off-by: Michele Palazzi <sysdadmin@m1k.cloud>
> ---
> I've been chasing intermittent flip_done timeouts on AMD GPUs (7900 GRE first, 9070 XT now)
> since kernel 6.12. The hang occurs during normal desktop usage but is much easier to
> trigger under specific conditions involving cursor movements and plane updates.
> 
> Partially tracked in https://gitlab.freedesktop.org/drm/amd/-/issues/3787
> 
> Hardware: Ryzen 7 7800X3D, Radeon RX 9070 XT
> Dual DP monitors, 2560x1440, 144Hz
> Desktop: KDE Plasma Wayland
> 
> The hang was initially observed while using Cisco Webex
> (XDG_SESSION_TYPE=x11 /opt/Webex/bin/CiscoCollabHost %U), start a meeting
> and screen share a window running Omnissa Horizon client. Then move the cursor
> around between the two monitors and the shared window.
> Under these conditions the hang usually occurs within a few hours.
> 
> Enabling drm.debug masks the issue entirely, the overhead
> changes timing enough to close the race window.
> So i added debug printks to amdgpu_dm.c and used a small bpftrace script to log the
> pageflip lifecycle with per-thread tracking to debug.
> 
> bpftrace script:
> 
>   config = { missing_probes = "warn" }
>   BEGIN { printf("=== flip_done tracer started ===\n"); }
>   kprobe:drm_crtc_vblank_on_config       { printf("%lu drm_crtc_vblank_on_config\n", nsecs/1000000); }
>   kprobe:drm_vblank_disable_and_save     { printf("%lu drm_vblank_disable_and_save\n", nsecs/1000000); }
>   kprobe:dm_pflip_high_irq               { printf("%lu dm_pflip_high_irq\n", nsecs/1000000); }
>   kprobe:drm_crtc_send_vblank_event      { printf("%lu drm_crtc_send_vblank_event\n", nsecs/1000000); }
>   kprobe:drm_vblank_put                  { printf("%lu drm_vblank_put\n", nsecs/1000000); }
>   kprobe:drm_atomic_helper_commit_hw_done { printf("%lu drm_atomic_helper_commit_hw_done\n", nsecs/1000000); }
>   kprobe:manage_dm_interrupts            { printf("%lu manage_dm_interrupts\n", nsecs/1000000); }
>   kprobe:drm_atomic_helper_wait_for_flip_done {
>       @wait_start[tid] = nsecs;
>       printf("%lu drm_atomic_helper_wait_for_flip_done ENTER [tid=%d]\n", nsecs/1000000, tid);
>   }
>   kretprobe:drm_atomic_helper_wait_for_flip_done {
>       $start = @wait_start[tid];
>       $ms = $start > 0 ? (nsecs - $start) / 1000000 : 0;
>       if ($ms > 100) {
>           printf("%lu drm_atomic_helper_wait_for_flip_done TIMEOUT waited %lums [tid=%d]\n",
>                  nsecs/1000000, $ms, tid);
>       } else {
>           printf("%lu drm_atomic_helper_wait_for_flip_done EXIT %lums [tid=%d]\n",
>                  nsecs/1000000, $ms, tid);
>       }
>       delete(@wait_start[tid]);
>   }
>   interval:s:60 { printf("%lu HEARTBEAT\n", nsecs/1000000); }
>   END { printf("=== stopped ===\n"); clear(@wait_start); }
> 
> The timeout was captured at 17:35:41 CET. The trace timestamps
> match dmesg exactly (9942110ms = dmesg 9942.110s).
> 
> dmesg output from the timeout:
> 
>   [ 9942.110360] [FLIP_DEBUG] wait_for_flip_done took 10329ms!
>   [ 9942.110380] [FLIP_DEBUG]  crtc:0 pflip_status=0 event=00000000a0636a23
>                   vbl_enabled=1 vbl_refcount=1 vbl_count=1428659
>                   disable_immediate=0 active_planes=1
> 
> pflip_status=0 (AMDGPU_FLIP_NONE) but event is still non-NULL. The flip was never completed
> but the status was already reset to NONE. vblank was enabled, refcount was held, so vblank
> IRQs were firing throughout the wait.
> 
> The bpftrace captured the exact sequence leading up to the hang. Here's the critical
> timeline at ~17:35:31 (9931771), about 10 seconds before the timeout fired:
> 
>   9931755 drm_atomic_helper_commit_hw_done
>   9931755 drm_atomic_helper_wait_for_flip_done ENTER [tid=35929]
>   9931756 dm_pflip_high_irq                           <- normal plane flip, last good one
>   9931756 drm_crtc_send_vblank_event
>   9931756 drm_vblank_put
>   9931756 drm_atomic_helper_wait_for_flip_done EXIT 1ms [tid=35929]
>   9931771 drm_vblank_disable_and_save                 <- vblank timer fires
>   9931771 drm_crtc_send_vblank_event                  <- event sent WITHOUT dm_pflip_high_irq
>   9931771 drm_vblank_put
>   9931771 drm_atomic_helper_commit_hw_done
>   9931771 drm_atomic_helper_wait_for_flip_done ENTER [tid=35929]
>   9931771 drm_atomic_helper_wait_for_flip_done EXIT 0ms [tid=35929]  <- instant, already done
>   9931773 drm_atomic_helper_commit_hw_done
>   9931773 drm_atomic_helper_wait_for_flip_done ENTER [tid=36929]     <- new commit
>   9931777 dm_pflip_high_irq                           <- pflip fires, completes the wrong one
>   9931777 drm_crtc_send_vblank_event
>   9931777 drm_vblank_put
>   9931777 drm_atomic_helper_wait_for_flip_done EXIT 3ms [tid=36929]
>   9931781 drm_atomic_helper_commit_hw_done
>   9931781 drm_atomic_helper_wait_for_flip_done ENTER [tid=36929]     <- THIS ONE HANGS
>   ... 10328ms of silence ...
>   9942110 drm_atomic_helper_wait_for_flip_done TIMEOUT waited 10328ms [tid=36929]
> 
> The drm_crtc_send_vblank_event at 9931771 fires without dm_pflip_high_irq. This is
> amdgpu_dm_crtc_handle_vblank() sending a cursor-only event. The problem is that the
> cursor-only commit path in amdgpu_dm_commit_planes() stores the event in acrtc->event
> and defers delivery to the vblank handler. This creates two race conditions:
> 
> - The vblank handler checks (pflip_status != SUBMITTED) which also
>   matches NONE, so it can consume events meant for plane flips. The subsequent
>   dm_pflip_high_irq finds no event, and the next commit hangs.
> 
> - If vblank is disabled by the off-delay timer before the handler
>   runs, the PENDING cursor event is never delivered and the commit hangs.
> 
> The fix is to send cursor-only events immediately via drm_crtc_send_vblank_event()
> in amdgpu_dm_commit_planes() instead of deferring to the vblank handler. The cursor
> update is already committed to hardware at this point, so immediate delivery is correct.
> This eliminates both race conditions by removing cursor events from the deferred
> delivery path entirely:
> 
> - Plane flips: SUBMITTED -> dm_pflip_high_irq delivers (unchanged)
> - Cursor updates: sent immediately in commit_planes (no deferral, no races)
> 
> From git history the check in amdgpu_dm_crtc_handle_vblank() has been like this since
> 473683a03495 ("drm/amd/display: Create a file dedicated for CRTC", 2022)
> which moved this code from amdgpu_dm.c, but it was practically impossible to trigger
> because the default drm_vblank_offdelay was 5000ms.
> Commit 58a261bfc967("drm/amd/display: use a more lax vblank enable policy for older ASICs") in 6.12
> changed all ASICs to use drm_crtc_vblank_on_config() with a computed off-delay
> of roughly 2 frames (~14ms at 144Hz).
> This made drm_vblank_disable_and_save fire hundreds of times more often, turning
> a theoretical race into reality. The bpftrace log is full of drm_vblank_disable_and_save
> events interleaved with the commit sequence.
> 
> This fix was tested on DCN 2.1 (4700U), DCN 3.2 (7600M XT), and DCN 3.5 (9070 XT).
> Under high-frequency glxgears + cursor jiggling test the patch successfully intercepted
> the race thousands of times without a single timeout.
> Also running this on the main system without issues.
> 
> This instead https://lists.freedesktop.org/archives/amd-gfx/2026-February/138636.html was
> my previously rushed attempt to do something about this that is no longer needed.
> 
> Patch applies cleanly on top of tag v6.19.

Really nice debuging work, thanks for catching this!

Ideally, the cursor event should be delivered when hardware latches onto the new
cursor info and starts scanning it out. The latching event fires an interrupt
that should be handled by dm_crtc_high_irq().

dm_pflip_high_irq() handles an interrupt specifically for when hardware latches
onto a new fb address; I don't think it actually fires when there's a
cursor-only update. I think if we really want to do it right, we can have
another "acrtc_attach->cursor_event" just for cusror-only updates, and deliver
the event in crtc_high_irq().

In any case, I don't foresee any major issues with delivering the event early.
And since it fixes an ongoing issue:

Reviewed-by: Leo Li <sunpeng.li@amd.com>

Thanks!
Leo

> 
>  drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> index a8a59126b2d2..35987ce80c71 100644
> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> @@ -10168,8 +10168,7 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
>  	} else if (cursor_update && acrtc_state->active_planes > 0) {
>  		spin_lock_irqsave(&pcrtc->dev->event_lock, flags);
>  		if (acrtc_attach->base.state->event) {
> -			drm_crtc_vblank_get(pcrtc);
> -			acrtc_attach->event = acrtc_attach->base.state->event;
> +			drm_crtc_send_vblank_event(pcrtc, acrtc_attach->base.state->event);
>  			acrtc_attach->base.state->event = NULL;
>  		}
>  		spin_unlock_irqrestore(&pcrtc->dev->event_lock, flags);


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/1] drm/amd/display: complete cursor vblank events immediately
  2026-02-23 15:27 ` Leo Li
@ 2026-02-27  8:53   ` Michele Palazzi
  2026-02-27  8:58     ` Michele Palazzi
  2026-02-27 19:43   ` Alex Deucher
  2026-03-02  8:53   ` Michel Dänzer
  2 siblings, 1 reply; 36+ messages in thread
From: Michele Palazzi @ 2026-02-27  8:53 UTC (permalink / raw)
  To: Leo Li, amd-gfx
  Cc: harry.wentland, rodrigo.siqueira, alexander.deucher,
	christian.koenig

On 2/23/26 16:27, Leo Li wrote:
> 
> Really nice debuging work, thanks for catching this!
> 
> Ideally, the cursor event should be delivered when hardware latches onto the new
> cursor info and starts scanning it out. The latching event fires an interrupt
> that should be handled by dm_crtc_high_irq().
> 
> dm_pflip_high_irq() handles an interrupt specifically for when hardware latches
> onto a new fb address; I don't think it actually fires when there's a
> cursor-only update. I think if we really want to do it right, we can have
> another "acrtc_attach->cursor_event" just for cusror-only updates, and deliver
> the event in crtc_high_irq().
> 
> In any case, I don't foresee any major issues with delivering the event early.
> And since it fixes an ongoing issue:
> 
> Reviewed-by: Leo Li <sunpeng.li@amd.com>
> 
> Thanks!
> Leo

Thanks for the review. Further testing confirms that both this patch and 
increasing the dGPU vblank offdelay (from 2 frames to ~50 frames) 
independently eliminate the flip timeouts in my testing. Both work by 
reducing the frequency of vblank disable/re-enable cycles, basically 
either could be an interim fix.

Your deferred vblank enable/disable series 
https://lore.kernel.org/amd-gfx/20260224212639.390768-1-sunpeng.li@amd.com/T/#t 
looks like it could be the proper solution going forward instead 
(haven't tested it).


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/1] drm/amd/display: complete cursor vblank events immediately
  2026-02-27  8:53   ` Michele Palazzi
@ 2026-02-27  8:58     ` Michele Palazzi
  2026-03-02 22:13       ` Leo Li
  0 siblings, 1 reply; 36+ messages in thread
From: Michele Palazzi @ 2026-02-27  8:58 UTC (permalink / raw)
  To: Leo Li, amd-gfx
  Cc: harry.wentland, alexander.deucher, christian.koenig, siqueira

On 2/27/26 09:53, Michele Palazzi wrote:
> On 2/23/26 16:27, Leo Li wrote:
>>
>> Really nice debuging work, thanks for catching this!
>>
>> Ideally, the cursor event should be delivered when hardware latches 
>> onto the new
>> cursor info and starts scanning it out. The latching event fires an 
>> interrupt
>> that should be handled by dm_crtc_high_irq().
>>
>> dm_pflip_high_irq() handles an interrupt specifically for when 
>> hardware latches
>> onto a new fb address; I don't think it actually fires when there's a
>> cursor-only update. I think if we really want to do it right, we can have
>> another "acrtc_attach->cursor_event" just for cusror-only updates, and 
>> deliver
>> the event in crtc_high_irq().
>>
>> In any case, I don't foresee any major issues with delivering the 
>> event early.
>> And since it fixes an ongoing issue:
>>
>> Reviewed-by: Leo Li <sunpeng.li@amd.com>
>>
>> Thanks!
>> Leo
> 
> Thanks for the review. Further testing confirms that both this patch and 
> increasing the dGPU vblank offdelay (from 2 frames to ~50 frames) 
> independently eliminate the flip timeouts in my testing. Both work by 
> reducing the frequency of vblank disable/re-enable cycles, basically 
> either could be an interim fix.
> 
> Your deferred vblank enable/disable series https://lore.kernel.org/amd- 
> gfx/20260224212639.390768-1-sunpeng.li@amd.com/T/#t looks like it could 
> be the proper solution going forward instead (haven't tested it).
> 

fixed siqueira@igalia.com cc

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/1] drm/amd/display: complete cursor vblank events immediately
  2026-02-27  8:58     ` Michele Palazzi
@ 2026-03-02 22:13       ` Leo Li
  2026-03-03  8:17         ` Shengyu Qu
  0 siblings, 1 reply; 36+ messages in thread
From: Leo Li @ 2026-03-02 22:13 UTC (permalink / raw)
  To: Michele Palazzi, amd-gfx
  Cc: harry.wentland, alexander.deucher, christian.koenig, siqueira,
	Michel Dänzer

On 2026-02-27 03:58, Michele Palazzi wrote:
> On 2/27/26 09:53, Michele Palazzi wrote:
>> On 2/23/26 16:27, Leo Li wrote:
>>>
>>> Really nice debuging work, thanks for catching this!
>>>
>>> Ideally, the cursor event should be delivered when hardware latches onto the new
>>> cursor info and starts scanning it out. The latching event fires an interrupt
>>> that should be handled by dm_crtc_high_irq().
>>>
>>> dm_pflip_high_irq() handles an interrupt specifically for when hardware latches
>>> onto a new fb address; I don't think it actually fires when there's a
>>> cursor-only update. I think if we really want to do it right, we can have
>>> another "acrtc_attach->cursor_event" just for cusror-only updates, and deliver
>>> the event in crtc_high_irq().
>>>
>>> In any case, I don't foresee any major issues with delivering the event early.
>>> And since it fixes an ongoing issue:
>>>
>>> Reviewed-by: Leo Li <sunpeng.li@amd.com>
>>>
>>> Thanks!
>>> Leo
>>
>> Thanks for the review. Further testing confirms that both this patch and increasing the dGPU vblank offdelay (from 2 frames to ~50 frames) independently eliminate the flip timeouts in my testing. Both work by reducing the frequency of vblank disable/re-enable cycles, basically either could be an interim fix.
>>
>> Your deferred vblank enable/disable series https://lore.kernel.org/amd- gfx/20260224212639.390768-1-sunpeng.li@amd.com/T/#t looks like it could be the proper solution going forward instead (haven't tested it).
>>

Looking at this a bit more, I'm not sure if we're understanding the trace
correctly.

Let's first assume the cursor update is not an legacy_cursor_update: In both
non-blocking and blocking atomic commits, there should be mechanisms in place
that limits the number of in-flight atomic_commit_tail()s per crtc to 1 (see
drm_atomic_helper_wait_for_dependencies()). IOW, After each independent cursor
**or** fb update, there should be one flip_done completion from
drm_crtc_send_vblank_event(), before the next update is allowed to continue.
Since the event is "armed" as part of atomic_commit_tail(), and "completed" in
either pflip_high_irq or crtc_high_irq, racing "arms" of acrtc->event should not
be possible.

A combined cursor **and** flip update should use a single event and flip_done
completion, since it's one atomic_commit_tail to update both.

Now if it is a legacy_cursor_update, DRM core first checks if the driver can
commit it asynchronously, and set state->async_update=true if it can. If
async_update==true, drm_atomic_helper_commit() skips setting up the event
entirely. Otherwise, drm_atomic_helper_setup_commit() will check if
legacy_cursor_update==true. If it is, it completes flip_done early *and* skips
setting up the event. So either way, there's no event to send, nor flip_done to
wait on.

But evidently in the trace, something awry is going on. Though I'm not sure if
it's because of the race condition as described. It would be interesting to
trace the events at the point that they're created, armed, then completed, and
see if there's some mismatch going on.

Here's a patch that inserts a few trace events.
https://pastebin.com/dpLnVSbu

Could you try to reproduce the hang again while recording these trace events?
Using trace-cmd (with stack trace enabled '-T'):

    trace-cmd record -e amdgpu_dm_event_arm -e drm_vblank_dbg* -T
    trace-cmd report trace.dat

The timeout can be found by searching 'remaining_wait_ms=0'.

Regarding the deferred vblank patchset, if the issue is indeed racing writes of
amdgpu_crtc->event, then I don't imagine that patchset would help. It's
intended to solve a different race.

Thanks,
Leo 

> 
> fixed siqueira@igalia.com cc

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/1] drm/amd/display: complete cursor vblank events immediately
  2026-03-02 22:13       ` Leo Li
@ 2026-03-03  8:17         ` Shengyu Qu
  2026-03-03 19:07           ` Leo Li
  0 siblings, 1 reply; 36+ messages in thread
From: Shengyu Qu @ 2026-03-03  8:17 UTC (permalink / raw)
  To: Leo Li, Michele Palazzi, amd-gfx
  Cc: wiagn233, harry.wentland, alexander.deucher, christian.koenig,
	siqueira, Michel Dänzer


[-- Attachment #1.1.1: Type: text/plain, Size: 769 bytes --]

> Here's a patch that inserts a few trace events.
> https://pastebin.com/dpLnVSbu
> 
> Could you try to reproduce the hang again while recording these trace events?
> Using trace-cmd (with stack trace enabled '-T'):

I think Michele said that the timeout issue would be masked by drm.debug 
due to overhead?

> 
>      trace-cmd record -e amdgpu_dm_event_arm -e drm_vblank_dbg* -T
>      trace-cmd report trace.dat
> 
> The timeout can be found by searching 'remaining_wait_ms=0'.
> 
> Regarding the deferred vblank patchset, if the issue is indeed racing writes of
> amdgpu_crtc->event, then I don't imagine that patchset would help. It's
> intended to solve a different race.
> 
> Thanks,
> Leo
> 
> 
>>
>> fixed siqueira@igalia.com cc
> 


[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 6977 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 840 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/1] drm/amd/display: complete cursor vblank events immediately
  2026-03-03  8:17         ` Shengyu Qu
@ 2026-03-03 19:07           ` Leo Li
  2026-03-04 14:00             ` Michele Palazzi
  0 siblings, 1 reply; 36+ messages in thread
From: Leo Li @ 2026-03-03 19:07 UTC (permalink / raw)
  To: Shengyu Qu, Michele Palazzi, amd-gfx
  Cc: harry.wentland, alexander.deucher, christian.koenig, siqueira,
	Michel Dänzer



On 2026-03-03 03:17, Shengyu Qu wrote:
>> Here's a patch that inserts a few trace events.
>> https://pastebin.com/dpLnVSbu
>>
>> Could you try to reproduce the hang again while recording these trace events?
>> Using trace-cmd (with stack trace enabled '-T'):
> 
> I think Michele said that the timeout issue would be masked by drm.debug due to overhead?

I'm hoping that enabling a few tracepoints would have much less overhead than
enabling drm.debug. Depending on which debug flag, there can be a lot of dmesg
output.

If tracepoints ends up masking the race condition, I wonder if there's a way for
bpftrace to probe the lines where the tracepoints are inserted, and print out
the event's pointer address. If so, that's a viable alternative.

- Leo

> 
>>
>>      trace-cmd record -e amdgpu_dm_event_arm -e drm_vblank_dbg* -T
>>      trace-cmd report trace.dat
>>
>> The timeout can be found by searching 'remaining_wait_ms=0'.
>>
>> Regarding the deferred vblank patchset, if the issue is indeed racing writes of
>> amdgpu_crtc->event, then I don't imagine that patchset would help. It's
>> intended to solve a different race.
>>
>> Thanks,
>> Leo
>>
>>
>>>
>>> fixed siqueira@igalia.com cc
>>
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/1] drm/amd/display: complete cursor vblank events immediately
  2026-03-03 19:07           ` Leo Li
@ 2026-03-04 14:00             ` Michele Palazzi
  2026-03-04 14:20               ` Leo Li
  0 siblings, 1 reply; 36+ messages in thread
From: Michele Palazzi @ 2026-03-04 14:00 UTC (permalink / raw)
  To: Leo Li, Shengyu Qu, amd-gfx
  Cc: harry.wentland, alexander.deucher, christian.koenig, siqueira,
	Michel Dänzer

On 3/3/26 20:07, Leo Li wrote:
> 
> 
> On 2026-03-03 03:17, Shengyu Qu wrote:
>>> Here's a patch that inserts a few trace events.
>>> https://pastebin.com/dpLnVSbu
>>>
>>> Could you try to reproduce the hang again while recording these trace events?
>>> Using trace-cmd (with stack trace enabled '-T'):
>>
>> I think Michele said that the timeout issue would be masked by drm.debug due to overhead?
> 
> I'm hoping that enabling a few tracepoints would have much less overhead than
> enabling drm.debug. Depending on which debug flag, there can be a lot of dmesg
> output.
> 
> If tracepoints ends up masking the race condition, I wonder if there's a way for
> bpftrace to probe the lines where the tracepoints are inserted, and print out
> the event's pointer address. If so, that's a viable alternative.
> 
> - Leo
So far i could not reproduce with your tracing patch applied, i could 
try to use bpftrace instead but it will take some time, i am quite busy 
these days.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/1] drm/amd/display: complete cursor vblank events immediately
  2026-03-04 14:00             ` Michele Palazzi
@ 2026-03-04 14:20               ` Leo Li
  2026-03-05 22:30                 ` Leo Li
  0 siblings, 1 reply; 36+ messages in thread
From: Leo Li @ 2026-03-04 14:20 UTC (permalink / raw)
  To: Michele Palazzi, Shengyu Qu, amd-gfx
  Cc: harry.wentland, alexander.deucher, christian.koenig, siqueira,
	Michel Dänzer



On 2026-03-04 09:00, Michele Palazzi wrote:
> On 3/3/26 20:07, Leo Li wrote:
>>
>>
>> On 2026-03-03 03:17, Shengyu Qu wrote:
>>>> Here's a patch that inserts a few trace events.
>>>> https://pastebin.com/dpLnVSbu
>>>>
>>>> Could you try to reproduce the hang again while recording these trace events?
>>>> Using trace-cmd (with stack trace enabled '-T'):
>>>
>>> I think Michele said that the timeout issue would be masked by drm.debug due to overhead?
>>
>> I'm hoping that enabling a few tracepoints would have much less overhead than
>> enabling drm.debug. Depending on which debug flag, there can be a lot of dmesg
>> output.
>>
>> If tracepoints ends up masking the race condition, I wonder if there's a way for
>> bpftrace to probe the lines where the tracepoints are inserted, and print out
>> the event's pointer address. If so, that's a viable alternative.
>>
>> - Leo
> So far i could not reproduce with your tracing patch applied, i could try to use bpftrace instead but it will take some time, i am quite busy these days.

Understood. If you haven't tried already, maybe dropping the stack trace (-T) flag is worth a shot.
- Leo


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/1] drm/amd/display: complete cursor vblank events immediately
  2026-03-04 14:20               ` Leo Li
@ 2026-03-05 22:30                 ` Leo Li
  2026-03-06  8:37                   ` Michele Palazzi
  0 siblings, 1 reply; 36+ messages in thread
From: Leo Li @ 2026-03-05 22:30 UTC (permalink / raw)
  To: Michele Palazzi
  Cc: amd-gfx, harry.wentland, alexander.deucher, christian.koenig,
	siqueira, Michel Dänzer, Shengyu Qu



On 2026-03-04 09:20, Leo Li wrote:
>> So far i could not reproduce with your tracing patch applied, i could try to use bpftrace instead but it will take some time, i am quite busy these days.
> Understood. If you haven't tried already, maybe dropping the stack trace (-T) flag is worth a shot.
> - Leo

Hi Michele,

Since this issue is fairly widespread, and we're heading towards *a* fix,
I sent a reworked version of this where we save cursor-only vblank events
in a separate member in struct amdgpu_crtc, and deliver it in
amdgpu_dm_crtc_handle_vblank():

https://lore.kernel.org/amd-gfx/20260305222131.160914-1-sunpeng.li@amd.com/

Let me know if you get a chance to try it out! I'll also add your
Co-developed-by when merging -- I forgot to add it when sending it out.

- Leo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/1] drm/amd/display: complete cursor vblank events immediately
  2026-03-05 22:30                 ` Leo Li
@ 2026-03-06  8:37                   ` Michele Palazzi
  2026-03-09 16:49                     ` Michele Palazzi
  0 siblings, 1 reply; 36+ messages in thread
From: Michele Palazzi @ 2026-03-06  8:37 UTC (permalink / raw)
  To: Leo Li
  Cc: amd-gfx, harry.wentland, alexander.deucher, christian.koenig,
	siqueira, Michel Dänzer, Shengyu Qu

On 3/5/26 23:30, Leo Li wrote:
> 
> 
> On 2026-03-04 09:20, Leo Li wrote:
>>> So far i could not reproduce with your tracing patch applied, i could try to use bpftrace instead but it will take some time, i am quite busy these days.
>> Understood. If you haven't tried already, maybe dropping the stack trace (-T) flag is worth a shot.
>> - Leo
> 
> Hi Michele,
> 
> Since this issue is fairly widespread, and we're heading towards *a* fix,
> I sent a reworked version of this where we save cursor-only vblank events
> in a separate member in struct amdgpu_crtc, and deliver it in
> amdgpu_dm_crtc_handle_vblank():
> 
> https://lore.kernel.org/amd-gfx/20260305222131.160914-1-sunpeng.li@amd.com/
> 
> Let me know if you get a chance to try it out! I'll also add your
> Co-developed-by when merging -- I forgot to add it when sending it out.
> 
> - Leo

Reproduced today with a minimal bpftrace active but prepare_flip_isr is 
inlined by the compiler so was unable to track the event ARM side (stil 
have to try the offset approach)

Trace around the timeout:

142642747  commit_hw_done [tid=778834]
142642747  WAIT_FLIP ENTER [tid=778834]
   10279ms silence for tid=778834 on CRTC 0
   CRTC 1 continues normally throughout
142653026  WAIT_FLIP TIMEOUT waited 10278ms [tid=778834]


No dm_pflip_high_irq or DELIVER for that specific commit. CRTC 1 kept 
flowing normally the entire time.

Your new patch is an approach i already tried, and in my previous 
testing i still had flip timeouts, so while i think separating the 
cursor events makes sense and is correct, the root cause could be 
different from what i initially assumed and sending the cursor events 
immediately was masking it by relieving pressure.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/1] drm/amd/display: complete cursor vblank events immediately
  2026-03-06  8:37                   ` Michele Palazzi
@ 2026-03-09 16:49                     ` Michele Palazzi
  2026-03-10 23:50                       ` Leo Li
  0 siblings, 1 reply; 36+ messages in thread
From: Michele Palazzi @ 2026-03-09 16:49 UTC (permalink / raw)
  To: Leo Li
  Cc: amd-gfx, harry.wentland, alexander.deucher, christian.koenig,
	siqueira, Michel Dänzer, Shengyu Qu

On 3/6/26 09:37, Michele Palazzi wrote:
> 
> Your new patch is an approach i already tried, and in my previous 
> testing i still had flip timeouts, so while i think separating the 
> cursor events makes sense and is correct, the root cause could be 
> different from what i initially assumed and sending the cursor events 
> immediately was masking it by relieving pressure.

Leo i finally reproduced with a bpftrace that tracks event ARM (flip vs 
cursor) and DELIVER using kprobe offsets into the inlined prepare_flip_isr.

The hung commit is a cursor-only update on CRTC 0:

31088420  dm_pflip_high_irq [tid=0]
31088420  DELIVER event=ffff8b519225c580 crtc=0 [tid=0]
31088420  WAIT_FLIP EXIT 2ms [tid=203071]
31088421  ARM flip event=ffff8b4f26184c00 acrtc=ffff8b4ed1ddd000 
[tid=203071]
31088421  commit_hw_done [tid=203071]
31088421  WAIT_FLIP ENTER [tid=203071]
31088422  dm_pflip_high_irq [tid=0]
31088422  DELIVER event=ffff8b4f26184c00 crtc=1 [tid=0]
31088422  WAIT_FLIP EXIT 1ms [tid=203071]
31088425  ARM cursor event=ffff8b519225ce00 acrtc=ffff8b4ed1dde000 
[tid=203071]
31088425  commit_hw_done [tid=203071]
31088425  WAIT_FLIP ENTER [tid=203071]
31088428  ARM flip event=ffff8b4f26184480 acrtc=ffff8b4ed1ddd000 
[tid=208580]
31088428  commit_hw_done [tid=208580]
31088428  WAIT_FLIP ENTER [tid=208580]
31088429  dm_pflip_high_irq [tid=0]
31088429  DELIVER event=ffff8b4f26184480 crtc=1 [tid=0]
31088429  WAIT_FLIP EXIT 1ms [tid=208580]
            ...
            10036ms silence for tid=203071
            no dm_pflip_high_irq, no DELIVER, no 
drm_vblank_disable_and_save on CRTC 0
            CRTC 1 continues normally throughout
            ...
31098462  WAIT_FLIP !!!TIMEOUT!!! waited 10036ms [tid=203071]
acrtc ffff8b4ed1dde000 = CRTC 0 (confirmed from ARM+DELIVER correlation) 
acrtc ffff8b4ed1ddd000 = CRTC 1

Event ffff8b519225ce00 was armed as cursor on CRTC 0 and never 
delivered. No dm_pflip_high_irq fired for CRTC 0 during the entire 10s 
wait, and vblank was not disabled (no drm_vblank_disable_and_save in 
that window). CRTC 1 kept flowing normally throughout.

The complete bpftrace is here https://pastebin.com/Xiju44Cy
Note that i did this on tag v6.19



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/1] drm/amd/display: complete cursor vblank events immediately
  2026-03-09 16:49                     ` Michele Palazzi
@ 2026-03-10 23:50                       ` Leo Li
  2026-03-11 10:16                         ` Shengyu Qu
  2026-03-11 10:38                         ` Michele Palazzi
  0 siblings, 2 replies; 36+ messages in thread
From: Leo Li @ 2026-03-10 23:50 UTC (permalink / raw)
  To: Michele Palazzi
  Cc: amd-gfx, harry.wentland, alexander.deucher, christian.koenig,
	siqueira, Michel Dänzer, Shengyu Qu



On 2026-03-09 12:49, Michele Palazzi wrote:
> On 3/6/26 09:37, Michele Palazzi wrote:
>>
>> Your new patch is an approach i already tried, and in my previous testing i still had flip timeouts, so while i think separating the cursor events makes sense and is correct, the root cause could be different from what i initially assumed and sending the cursor events immediately was masking it by relieving pressure.
> 
> Leo i finally reproduced with a bpftrace that tracks event ARM (flip vs cursor) and DELIVER using kprobe offsets into the inlined prepare_flip_isr.
> 
> The hung commit is a cursor-only update on CRTC 0:
> 
> 31088420  dm_pflip_high_irq [tid=0]
> 31088420  DELIVER event=ffff8b519225c580 crtc=0 [tid=0]
> 31088420  WAIT_FLIP EXIT 2ms [tid=203071]
> 31088421  ARM flip event=ffff8b4f26184c00 acrtc=ffff8b4ed1ddd000 [tid=203071]
> 31088421  commit_hw_done [tid=203071]
> 31088421  WAIT_FLIP ENTER [tid=203071]
> 31088422  dm_pflip_high_irq [tid=0]
> 31088422  DELIVER event=ffff8b4f26184c00 crtc=1 [tid=0]
> 31088422  WAIT_FLIP EXIT 1ms [tid=203071]
> 31088425  ARM cursor event=ffff8b519225ce00 acrtc=ffff8b4ed1dde000 [tid=203071]
> 31088425  commit_hw_done [tid=203071]
> 31088425  WAIT_FLIP ENTER [tid=203071]
> 31088428  ARM flip event=ffff8b4f26184480 acrtc=ffff8b4ed1ddd000 [tid=208580]
> 31088428  commit_hw_done [tid=208580]
> 31088428  WAIT_FLIP ENTER [tid=208580]
> 31088429  dm_pflip_high_irq [tid=0]
> 31088429  DELIVER event=ffff8b4f26184480 crtc=1 [tid=0]
> 31088429  WAIT_FLIP EXIT 1ms [tid=208580]
>            ...
>            10036ms silence for tid=203071
>            no dm_pflip_high_irq, no DELIVER, no drm_vblank_disable_and_save on CRTC 0
>            CRTC 1 continues normally throughout
>            ...
> 31098462  WAIT_FLIP !!!TIMEOUT!!! waited 10036ms [tid=203071]
> acrtc ffff8b4ed1dde000 = CRTC 0 (confirmed from ARM+DELIVER correlation) acrtc ffff8b4ed1ddd000 = CRTC 1
> 
> Event ffff8b519225ce00 was armed as cursor on CRTC 0 and never delivered. No dm_pflip_high_irq fired for CRTC 0 during the entire 10s wait, and vblank was not disabled (no drm_vblank_disable_and_save in that window). CRTC 1 kept flowing normally throughout.

Hi Michele, no dm_pflip_high_irq firing makes sense, since there's no new fb
addresses being programmed on CRTC 0 due to the timeout.

Did you see any dm_crtc_high_irq() or dm_vupdate_high_irq() on crtc0 after the
timeout? An easy way to check would be to enable DRM vblank debug once you hit
the flip_done timeout. The drm_dbg_vbl prints will start outputting to dmesg:

    echo 0x20 > /sys/module/drm/parameters/debug

I'm also curious what the acrtc->event and ->pflip_status end up being when the
timeout is hit. This debug diff should dump that without masking the issue:
https://pastebin.com/u7hGR7L4

Thanks,
Leo


> 
> The complete bpftrace is here https://pastebin.com/Xiju44Cy
> Note that i did this on tag v6.19
> 
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/1] drm/amd/display: complete cursor vblank events immediately
  2026-03-10 23:50                       ` Leo Li
@ 2026-03-11 10:16                         ` Shengyu Qu
  2026-03-11 10:38                         ` Michele Palazzi
  1 sibling, 0 replies; 36+ messages in thread
From: Shengyu Qu @ 2026-03-11 10:16 UTC (permalink / raw)
  To: Leo Li, Michele Palazzi
  Cc: wiagn233, amd-gfx, harry.wentland, alexander.deucher,
	christian.koenig, siqueira, Michel Dänzer


[-- Attachment #1.1.1: Type: text/plain, Size: 3251 bytes --]

Some test report: https://gitlab.freedesktop.org/drm/amd/-/issues/2950#note_3366575

在 2026/3/11 7:50, Leo Li 写道:
> 
> 
> On 2026-03-09 12:49, Michele Palazzi wrote:
>> On 3/6/26 09:37, Michele Palazzi wrote:
>>>
>>> Your new patch is an approach i already tried, and in my previous testing i still had flip timeouts, so while i think separating the cursor events makes sense and is correct, the root cause could be different from what i initially assumed and sending the cursor events immediately was masking it by relieving pressure.
>>
>> Leo i finally reproduced with a bpftrace that tracks event ARM (flip vs cursor) and DELIVER using kprobe offsets into the inlined prepare_flip_isr.
>>
>> The hung commit is a cursor-only update on CRTC 0:
>>
>> 31088420  dm_pflip_high_irq [tid=0]
>> 31088420  DELIVER event=ffff8b519225c580 crtc=0 [tid=0]
>> 31088420  WAIT_FLIP EXIT 2ms [tid=203071]
>> 31088421  ARM flip event=ffff8b4f26184c00 acrtc=ffff8b4ed1ddd000 [tid=203071]
>> 31088421  commit_hw_done [tid=203071]
>> 31088421  WAIT_FLIP ENTER [tid=203071]
>> 31088422  dm_pflip_high_irq [tid=0]
>> 31088422  DELIVER event=ffff8b4f26184c00 crtc=1 [tid=0]
>> 31088422  WAIT_FLIP EXIT 1ms [tid=203071]
>> 31088425  ARM cursor event=ffff8b519225ce00 acrtc=ffff8b4ed1dde000 [tid=203071]
>> 31088425  commit_hw_done [tid=203071]
>> 31088425  WAIT_FLIP ENTER [tid=203071]
>> 31088428  ARM flip event=ffff8b4f26184480 acrtc=ffff8b4ed1ddd000 [tid=208580]
>> 31088428  commit_hw_done [tid=208580]
>> 31088428  WAIT_FLIP ENTER [tid=208580]
>> 31088429  dm_pflip_high_irq [tid=0]
>> 31088429  DELIVER event=ffff8b4f26184480 crtc=1 [tid=0]
>> 31088429  WAIT_FLIP EXIT 1ms [tid=208580]
>>             ...
>>             10036ms silence for tid=203071
>>             no dm_pflip_high_irq, no DELIVER, no drm_vblank_disable_and_save on CRTC 0
>>             CRTC 1 continues normally throughout
>>             ...
>> 31098462  WAIT_FLIP !!!TIMEOUT!!! waited 10036ms [tid=203071]
>> acrtc ffff8b4ed1dde000 = CRTC 0 (confirmed from ARM+DELIVER correlation) acrtc ffff8b4ed1ddd000 = CRTC 1
>>
>> Event ffff8b519225ce00 was armed as cursor on CRTC 0 and never delivered. No dm_pflip_high_irq fired for CRTC 0 during the entire 10s wait, and vblank was not disabled (no drm_vblank_disable_and_save in that window). CRTC 1 kept flowing normally throughout.
> 
> Hi Michele, no dm_pflip_high_irq firing makes sense, since there's no new fb
> addresses being programmed on CRTC 0 due to the timeout.
> 
> Did you see any dm_crtc_high_irq() or dm_vupdate_high_irq() on crtc0 after the
> timeout? An easy way to check would be to enable DRM vblank debug once you hit
> the flip_done timeout. The drm_dbg_vbl prints will start outputting to dmesg:
> 
>      echo 0x20 > /sys/module/drm/parameters/debug
> 
> I'm also curious what the acrtc->event and ->pflip_status end up being when the
> timeout is hit. This debug diff should dump that without masking the issue:
> https://pastebin.com/u7hGR7L4
> 
> Thanks,
> Leo
> 
> 
>>
>> The complete bpftrace is here https://pastebin.com/Xiju44Cy
>> Note that i did this on tag v6.19
>>
>>
> 


[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 6977 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 840 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/1] drm/amd/display: complete cursor vblank events immediately
  2026-03-10 23:50                       ` Leo Li
  2026-03-11 10:16                         ` Shengyu Qu
@ 2026-03-11 10:38                         ` Michele Palazzi
  2026-03-11 17:56                           ` Leo Li
  1 sibling, 1 reply; 36+ messages in thread
From: Michele Palazzi @ 2026-03-11 10:38 UTC (permalink / raw)
  To: Leo Li
  Cc: amd-gfx, harry.wentland, alexander.deucher, christian.koenig,
	siqueira, Michel Dänzer, Shengyu Qu

On 3/11/26 00:50, Leo Li wrote:
> 
> Hi Michele, no dm_pflip_high_irq firing makes sense, since there's no new fb
> addresses being programmed on CRTC 0 due to the timeout.
> 
> Did you see any dm_crtc_high_irq() or dm_vupdate_high_irq() on crtc0 after the
> timeout? An easy way to check would be to enable DRM vblank debug once you hit
> the flip_done timeout. The drm_dbg_vbl prints will start outputting to dmesg:
> 
>      echo 0x20 > /sys/module/drm/parameters/debug
> 
> I'm also curious what the acrtc->event and ->pflip_status end up being when the
> timeout is hit. This debug diff should dump that without masking the issue:
> https://pastebin.com/u7hGR7L4
> 
> Thanks,
> Leo


Applied your debug diff on clean v6.19, reproduced with bpftrace running 
(dm_crtc_high_irq and dm_vupdate_high_irq probes added).

dmesg:

[drm] *ERROR* [CRTC:283:crtc-0] flip_done timed out
[flip_done timeout] crtc-0 event 00000000baf6917e status 0
[flip_done timeout] crtc-1 event 0000000000000000 status 0
[flip_done timeout] crtc-2 event 0000000000000000 status 0
[flip_done timeout] crtc-3 event 0000000000000000 status 0

crtc-0 has a non-NULL event with pflip_status=0 (AMDGPU_FLIP_NONE).
Note: %p hashes the pointer so can't directly correlate with the 
bpftrace output.

bpftrace:

8301644  dm_pflip_high_irq [tid=0]
8301644  DELIVER event=ffff8b87186d5a80 crtc=0 [tid=0]
8301644  WAIT_FLIP EXIT 1ms [tid=36993]
8301649  ARM cursor event=ffff8b87186d5480 acrtc=ffff8b84958f7000 [tid=176]
8301649  commit_hw_done [tid=176]
8301649  WAIT_FLIP ENTER [tid=176]
            ...
            10252ms, CRTC 1 continues normally
            ...
8311902  WAIT_FLIP !!!TIMEOUT!!! waited 10252ms [tid=176]
Between the ARM cursor at 8301649 and the TIMEOUT at 8311902:

692 dm_crtc_high_irq fired, all on CRTC 1 (zero DELIVER with crtc=0 in 
the window)
0 DELIVER for event ffff8b87186d5480
0 ARM or DELIVER referencing acrtc ffff8b84958f7000 (CRTC 0)
drm_vblank_disable_and_save continued firing (on CRTC 1)
no dm_vupdate_high_irq fired at all during the entire trace
acrtc ffff8b84958f7000 = CRTC 0

If this is not enough i can retry to have the proper correlation using %px

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/1] drm/amd/display: complete cursor vblank events immediately
  2026-03-11 10:38                         ` Michele Palazzi
@ 2026-03-11 17:56                           ` Leo Li
  2026-03-16 14:55                             ` Michele Palazzi
  0 siblings, 1 reply; 36+ messages in thread
From: Leo Li @ 2026-03-11 17:56 UTC (permalink / raw)
  To: Michele Palazzi
  Cc: amd-gfx, harry.wentland, alexander.deucher, christian.koenig,
	siqueira, Michel Dänzer, Shengyu Qu



On 2026-03-11 06:38, Michele Palazzi wrote:
> Applied your debug diff on clean v6.19, reproduced with bpftrace running (dm_crtc_high_irq and dm_vupdate_high_irq probes added).
> 
> dmesg:
> 
> [drm] *ERROR* [CRTC:283:crtc-0] flip_done timed out
> [flip_done timeout] crtc-0 event 00000000baf6917e status 0
> [flip_done timeout] crtc-1 event 0000000000000000 status 0
> [flip_done timeout] crtc-2 event 0000000000000000 status 0
> [flip_done timeout] crtc-3 event 0000000000000000 status 0
> 
> crtc-0 has a non-NULL event with pflip_status=0 (AMDGPU_FLIP_NONE).
> Note: %p hashes the pointer so can't directly correlate with the bpftrace output.
> 
> bpftrace:
> 
> 8301644  dm_pflip_high_irq [tid=0]
> 8301644  DELIVER event=ffff8b87186d5a80 crtc=0 [tid=0]
> 8301644  WAIT_FLIP EXIT 1ms [tid=36993]
> 8301649  ARM cursor event=ffff8b87186d5480 acrtc=ffff8b84958f7000 [tid=176]
> 8301649  commit_hw_done [tid=176]
> 8301649  WAIT_FLIP ENTER [tid=176]
>            ...
>            10252ms, CRTC 1 continues normally
>            ...
> 8311902  WAIT_FLIP !!!TIMEOUT!!! waited 10252ms [tid=176]
> Between the ARM cursor at 8301649 and the TIMEOUT at 8311902:
> 
> 692 dm_crtc_high_irq fired, all on CRTC 1 (zero DELIVER with crtc=0 in the window)
> 0 DELIVER for event ffff8b87186d5480
> 0 ARM or DELIVER referencing acrtc ffff8b84958f7000 (CRTC 0)
> drm_vblank_disable_and_save continued firing (on CRTC 1)
> no dm_vupdate_high_irq fired at all during the entire trace
> acrtc ffff8b84958f7000 = CRTC 0
> 
> If this is not enough i can retry to have the proper correlation using %px

dm_crtc_high_irq() not firing on CRTC 0 is quite strange. It suggests either
the interrupts were disabled (even though drm_vblank_disable_and_save() was
not called), or the timing generator in HW hanged.

Could you dump the interrupt state registers once the timeout is hit? Using UMR:

# get the GPU instance for your 9070XT, it should be the one with "dcn401" under
# "IP Blocks:"
sudo umr -e

# Dump interrupt state, replacing --instance # with your 9070XT instance:
sudo umr --instance 1 -r '*.*.OTG_GLOBAL_SYNC_STATUS' -O bits

UMR is available on aur, building it is also straightforward:
https://aur.archlinux.org/packages/umr
https://gitlab.freedesktop.org/tomstdenis/umr

----

Another suspicion is that DGPU idle optimizations might hang the TG. If force-
disabling it fixes the issue, then it would support that suspicion:

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 2676865f6f943..eb4c5f13943e0 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -2096,6 +2096,10 @@ static int amdgpu_dm_init(struct amdgpu_device *adev)
        /* Display Core create. */
        adev->dm.dc = dc_create(&init_data);
 
+       adev->dm.dc->debug.disable_idle_power_optimizations = true;
+       adev->dm.dc->debug.force_disable_subvp = true;
+       adev->dm.dc->debug.fams2_config.bits.enable = false;
+
        if (adev->dm.dc) {
                drm_info(adev_to_drm(adev), "Display Core v%s initialized on %s\n", DC_VER,
                         dce_version_to_string(adev->dm.dc->ctx->dce_version));

Thanks,
Leo

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/1] drm/amd/display: complete cursor vblank events immediately
  2026-03-11 17:56                           ` Leo Li
@ 2026-03-16 14:55                             ` Michele Palazzi
  2026-03-16 15:17                               ` Michele Palazzi
  0 siblings, 1 reply; 36+ messages in thread
From: Michele Palazzi @ 2026-03-16 14:55 UTC (permalink / raw)
  To: Leo Li
  Cc: amd-gfx, harry.wentland, alexander.deucher, christian.koenig,
	siqueira, Michel Dänzer, Shengyu Qu

On 3/11/26 18:56, Leo Li wrote:
> 
> dm_crtc_high_irq() not firing on CRTC 0 is quite strange. It suggests either
> the interrupts were disabled (even though drm_vblank_disable_and_save() was
> not called), or the timing generator in HW hanged.
> 
> Could you dump the interrupt state registers once the timeout is hit? Using UMR:
> 
> # get the GPU instance for your 9070XT, it should be the one with "dcn401" under
> # "IP Blocks:"
> sudo umr -e
> 
> # Dump interrupt state, replacing --instance # with your 9070XT instance:
> sudo umr --instance 1 -r '*.*.OTG_GLOBAL_SYNC_STATUS' -O bits
> 
> UMR is available on aur, building it is also straightforward:
> https://aur.archlinux.org/packages/umr
> https://gitlab.freedesktop.org/tomstdenis/umr
> 


took me a while to get the umr output after the timeout (taken within 1 
second from the flip timeout)

https://pastebin.com/dz4tkfDV



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/1] drm/amd/display: complete cursor vblank events immediately
  2026-03-16 14:55                             ` Michele Palazzi
@ 2026-03-16 15:17                               ` Michele Palazzi
  2026-03-16 18:39                                 ` Leo Li
  0 siblings, 1 reply; 36+ messages in thread
From: Michele Palazzi @ 2026-03-16 15:17 UTC (permalink / raw)
  To: Leo Li
  Cc: amd-gfx, harry.wentland, alexander.deucher, christian.koenig,
	siqueira, Michel Dänzer, Shengyu Qu

On 3/16/26 15:55, Michele Palazzi wrote:
> On 3/11/26 18:56, Leo Li wrote:
>>
>> dm_crtc_high_irq() not firing on CRTC 0 is quite strange. It suggests 
>> either
>> the interrupts were disabled (even though 
>> drm_vblank_disable_and_save() was
>> not called), or the timing generator in HW hanged.
>>
>> Could you dump the interrupt state registers once the timeout is hit? 
>> Using UMR:
>>
>> # get the GPU instance for your 9070XT, it should be the one with 
>> "dcn401" under
>> # "IP Blocks:"
>> sudo umr -e
>>
>> # Dump interrupt state, replacing --instance # with your 9070XT instance:
>> sudo umr --instance 1 -r '*.*.OTG_GLOBAL_SYNC_STATUS' -O bits
>>
>> UMR is available on aur, building it is also straightforward:
>> https://aur.archlinux.org/packages/umr
>> https://gitlab.freedesktop.org/tomstdenis/umr
>>
> 
> 
> took me a while to get the umr output after the timeout (taken within 1 
> second from the flip timeout)
> 
> https://pastebin.com/dz4tkfDV
> 
> 

actually there were 3 dumps in rapid succession, here you have all 3 for 
completeness

16 mar 15.33 umr_dump_20260316_153356.txt https://pastebin.com/LvYrjw5y
16 mar 15.35 umr_dump_20260316_153540.txt https://pastebin.com/SmSvCXva
16 mar 15.35 umr_dump_20260316_153550.txt https://pastebin.com/BbsWbbTN


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/1] drm/amd/display: complete cursor vblank events immediately
  2026-03-16 15:17                               ` Michele Palazzi
@ 2026-03-16 18:39                                 ` Leo Li
  2026-03-16 18:48                                   ` Leo Li
  0 siblings, 1 reply; 36+ messages in thread
From: Leo Li @ 2026-03-16 18:39 UTC (permalink / raw)
  To: Michele Palazzi
  Cc: amd-gfx, harry.wentland, alexander.deucher, christian.koenig,
	siqueira, Michel Dänzer, Shengyu Qu



On 2026-03-16 11:17, Michele Palazzi wrote:
>> took me a while to get the umr output after the timeout (taken within 1 second from the flip timeout)
>>
>> https://pastebin.com/dz4tkfDV
>>
>>
> 
> actually there were 3 dumps in rapid succession, here you have all 3 for completeness
> 
> 16 mar 15.33 umr_dump_20260316_153356.txt https://pastebin.com/LvYrjw5y
> 16 mar 15.35 umr_dump_20260316_153540.txt https://pastebin.com/SmSvCXva
> 16 mar 15.35 umr_dump_20260316_153550.txt https://pastebin.com/BbsWbbTN

Thanks for the dumps, looks like interrupts were disabled, which is surprising
given drm_vblank_disable_and_save() was not called. OTG0 seems to be active as
the FRAME_COUNT is incrementing.

Does force-enabling the VSTARTUP interrupt on OTG0 revive the hanging display
once the timeout happens?

sudo umr --instance 1 -wb '*.dcn410.regOTG0_OTG_GLOBAL_SYNC_STATUS.VSTARTUP_INT_EN' 1

- Leo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/1] drm/amd/display: complete cursor vblank events immediately
  2026-03-16 18:39                                 ` Leo Li
@ 2026-03-16 18:48                                   ` Leo Li
  2026-03-18 11:36                                     ` Michele Palazzi
  0 siblings, 1 reply; 36+ messages in thread
From: Leo Li @ 2026-03-16 18:48 UTC (permalink / raw)
  To: Michele Palazzi
  Cc: amd-gfx, harry.wentland, alexander.deucher, christian.koenig,
	siqueira, Michel Dänzer, Shengyu Qu



On 2026-03-16 14:39, Leo Li wrote:
> 
> 
> On 2026-03-16 11:17, Michele Palazzi wrote:
>>> took me a while to get the umr output after the timeout (taken within 1 second from the flip timeout)
>>>
>>> https://pastebin.com/dz4tkfDV
>>>
>>>
>>
>> actually there were 3 dumps in rapid succession, here you have all 3 for completeness
>>
>> 16 mar 15.33 umr_dump_20260316_153356.txt https://pastebin.com/LvYrjw5y
>> 16 mar 15.35 umr_dump_20260316_153540.txt https://pastebin.com/SmSvCXva
>> 16 mar 15.35 umr_dump_20260316_153550.txt https://pastebin.com/BbsWbbTN
> 
> Thanks for the dumps, looks like interrupts were disabled, which is surprising
> given drm_vblank_disable_and_save() was not called. OTG0 seems to be active as
> the FRAME_COUNT is incrementing.
> 
> Does force-enabling the VSTARTUP interrupt on OTG0 revive the hanging display
> once the timeout happens?

Actually, if you manage to catch it and send the below umr command *before* the 10s
timeout expires, that'd be even more telling.
- Leo 

> 
> sudo umr --instance 1 -wb '*.dcn410.regOTG0_OTG_GLOBAL_SYNC_STATUS.VSTARTUP_INT_EN' 1
> 
> - Leo


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/1] drm/amd/display: complete cursor vblank events immediately
  2026-03-16 18:48                                   ` Leo Li
@ 2026-03-18 11:36                                     ` Michele Palazzi
  2026-03-20  0:52                                       ` Leo Li
  0 siblings, 1 reply; 36+ messages in thread
From: Michele Palazzi @ 2026-03-18 11:36 UTC (permalink / raw)
  To: Leo Li
  Cc: amd-gfx, harry.wentland, alexander.deucher, christian.koenig,
	siqueira, Michel Dänzer, Shengyu Qu

On 3/16/26 19:48, Leo Li wrote:
>>
>> Thanks for the dumps, looks like interrupts were disabled, which is surprising
>> given drm_vblank_disable_and_save() was not called. OTG0 seems to be active as
>> the FRAME_COUNT is incrementing.
>>
>> Does force-enabling the VSTARTUP interrupt on OTG0 revive the hanging display
>> once the timeout happens?
> 
> Actually, if you manage to catch it and send the below umr command *before* the 10s
> timeout expires, that'd be even more telling.
> - Leo
> 
>>
>> sudo umr --instance 1 -wb '*.dcn410.regOTG0_OTG_GLOBAL_SYNC_STATUS.VSTARTUP_INT_EN' 1
>>
>> - Leo
> 

Ok i managed to do exactly that, before the timeout sent your umr 
command, frozen display did not recover (recovered only after 
unplugging/replugging the DP cable as always)

UMR output before/after force enable
https://pastebin.com/hhapxBev


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/1] drm/amd/display: complete cursor vblank events immediately
  2026-03-18 11:36                                     ` Michele Palazzi
@ 2026-03-20  0:52                                       ` Leo Li
  2026-03-20  1:33                                         ` Michele Palazzi
  0 siblings, 1 reply; 36+ messages in thread
From: Leo Li @ 2026-03-20  0:52 UTC (permalink / raw)
  To: Michele Palazzi
  Cc: amd-gfx, harry.wentland, alexander.deucher, christian.koenig,
	siqueira, Michel Dänzer, Shengyu Qu

On 2026-03-18 07:36, Michele Palazzi wrote:
> Ok i managed to do exactly that, before the timeout sent your umr command, frozen display did not recover (recovered only after unplugging/replugging the DP cable as always)
> 
> UMR output before/after force enable
> https://pastebin.com/hhapxBev

Hmm, is the "before" captured after the display hangs, but before the flip_done timeout error in dmesg?
And the "after" is captured after writing VSTARTUP_INT_EN=1, but also before flip_done timeout error in dmesg?

If so, it seems my previous idea that interrupts got disabled is wrong, since OTG0 has VSTARTUP enabled in "before".

Did you happen to try disabling some idle optimization features mentioned in a previous reply?
https://lore.kernel.org/amd-gfx/1356e93b-af76-47f3-afc5-29535a9518bb@amd.com/

Thanks,
Leo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/1] drm/amd/display: complete cursor vblank events immediately
  2026-03-20  0:52                                       ` Leo Li
@ 2026-03-20  1:33                                         ` Michele Palazzi
  2026-03-31 12:57                                           ` Michele Palazzi
  0 siblings, 1 reply; 36+ messages in thread
From: Michele Palazzi @ 2026-03-20  1:33 UTC (permalink / raw)
  To: Leo Li
  Cc: amd-gfx, harry.wentland, alexander.deucher, christian.koenig,
	siqueira, Michel Dänzer, Shengyu Qu

On 3/20/26 01:52, Leo Li wrote:
> 
> Hmm, is the "before" captured after the display hangs, but before the flip_done timeout error in dmesg?

yes

> And the "after" is captured after writing VSTARTUP_INT_EN=1, but also before flip_done timeout error in dmesg?

also yes
basically i detected the timeout with bpftrace and got umr dump 
before/after setting VSTARTUP_INT_EN=1

> Did you happen to try disabling some idle optimization features mentioned in a previous reply?
> https://lore.kernel.org/amd-gfx/1356e93b-af76-47f3-afc5-29535a9518bb@amd.com/

only for a bit, need more time on it to properly assess

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/1] drm/amd/display: complete cursor vblank events immediately
  2026-03-20  1:33                                         ` Michele Palazzi
@ 2026-03-31 12:57                                           ` Michele Palazzi
  0 siblings, 0 replies; 36+ messages in thread
From: Michele Palazzi @ 2026-03-31 12:57 UTC (permalink / raw)
  To: Leo Li
  Cc: amd-gfx, harry.wentland, alexander.deucher, christian.koenig,
	siqueira, Michel Dänzer, Shengyu Qu

> On 3/20/26 01:52, Leo Li wrote:

>> Did you happen to try disabling some idle optimization features 
>> mentioned in a previous reply?
>> https://lore.kernel.org/amd-gfx/1356e93b-af76-47f3- 
>> afc5-29535a9518bb@amd.com/
> 


Sorry for the delay i have been away due to business travel, anyway 
since resuming testing i can't seem to reproduce the timeout with idle 
optimization features disabled.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/1] drm/amd/display: complete cursor vblank events immediately
  2026-02-23 15:27 ` Leo Li
  2026-02-27  8:53   ` Michele Palazzi
@ 2026-02-27 19:43   ` Alex Deucher
  2026-03-02  8:53   ` Michel Dänzer
  2 siblings, 0 replies; 36+ messages in thread
From: Alex Deucher @ 2026-02-27 19:43 UTC (permalink / raw)
  To: Leo Li
  Cc: Michele Palazzi, amd-gfx, harry.wentland, rodrigo.siqueira,
	alexander.deucher, christian.koenig

On Mon, Feb 23, 2026 at 11:04 AM Leo Li <sunpeng.li@amd.com> wrote:
>
>
>
> On 2026-02-17 14:16, Michele Palazzi wrote:
> > Intermittent flip_done timeouts have been observed on AMD GPUs
> > since kernel 6.12.
> >
> > Analysis with bpftrace reveals that amdgpu_dm_crtc_handle_vblank() can
> > incorrectly consume events meant for plane flips during cursor-only
> > updates. This happens because cursor commits defer event delivery to
> > the vblank handler, which checks (pflip_status != SUBMITTED). Since
> > AMDGPU_FLIP_NONE also matches this, cursor events can "steal" the
> > event slot for subsequent plane flips, leading to timeouts.
> >
> > The potential for a race was present since commit 473683a03495
> > ("drm/amd/display: Create a file dedicated for CRTC"), then
> > commit 58a261bfc967 ("drm/amd/display: use a more lax vblank enable
> > policy for older ASICs") made it happen by reducing vblank
> > off-delay and making disables happen much more frequently
> > between commits.
> >
> > Fix this by sending cursor-only vblank events immediately in
> > amdgpu_dm_commit_planes(). Since cursor updates are committed to
> > hardware immediately, deferring the event is unnecessary and
> > creates race windows for event stealing or starvation if vblank
> > is disabled before the handler runs.
> >
> > Tested on DCN 2.1, 3.2, and 3.5.
> >
> > Fixes: 58a261bfc967 ("drm/amd/display: use a more lax vblank enable policy for older ASICs")
> > Signed-off-by: Michele Palazzi <sysdadmin@m1k.cloud>
> > ---
> > I've been chasing intermittent flip_done timeouts on AMD GPUs (7900 GRE first, 9070 XT now)
> > since kernel 6.12. The hang occurs during normal desktop usage but is much easier to
> > trigger under specific conditions involving cursor movements and plane updates.
> >
> > Partially tracked in https://gitlab.freedesktop.org/drm/amd/-/issues/3787
> >
> > Hardware: Ryzen 7 7800X3D, Radeon RX 9070 XT
> > Dual DP monitors, 2560x1440, 144Hz
> > Desktop: KDE Plasma Wayland
> >
> > The hang was initially observed while using Cisco Webex
> > (XDG_SESSION_TYPE=x11 /opt/Webex/bin/CiscoCollabHost %U), start a meeting
> > and screen share a window running Omnissa Horizon client. Then move the cursor
> > around between the two monitors and the shared window.
> > Under these conditions the hang usually occurs within a few hours.
> >
> > Enabling drm.debug masks the issue entirely, the overhead
> > changes timing enough to close the race window.
> > So i added debug printks to amdgpu_dm.c and used a small bpftrace script to log the
> > pageflip lifecycle with per-thread tracking to debug.
> >
> > bpftrace script:
> >
> >   config = { missing_probes = "warn" }
> >   BEGIN { printf("=== flip_done tracer started ===\n"); }
> >   kprobe:drm_crtc_vblank_on_config       { printf("%lu drm_crtc_vblank_on_config\n", nsecs/1000000); }
> >   kprobe:drm_vblank_disable_and_save     { printf("%lu drm_vblank_disable_and_save\n", nsecs/1000000); }
> >   kprobe:dm_pflip_high_irq               { printf("%lu dm_pflip_high_irq\n", nsecs/1000000); }
> >   kprobe:drm_crtc_send_vblank_event      { printf("%lu drm_crtc_send_vblank_event\n", nsecs/1000000); }
> >   kprobe:drm_vblank_put                  { printf("%lu drm_vblank_put\n", nsecs/1000000); }
> >   kprobe:drm_atomic_helper_commit_hw_done { printf("%lu drm_atomic_helper_commit_hw_done\n", nsecs/1000000); }
> >   kprobe:manage_dm_interrupts            { printf("%lu manage_dm_interrupts\n", nsecs/1000000); }
> >   kprobe:drm_atomic_helper_wait_for_flip_done {
> >       @wait_start[tid] = nsecs;
> >       printf("%lu drm_atomic_helper_wait_for_flip_done ENTER [tid=%d]\n", nsecs/1000000, tid);
> >   }
> >   kretprobe:drm_atomic_helper_wait_for_flip_done {
> >       $start = @wait_start[tid];
> >       $ms = $start > 0 ? (nsecs - $start) / 1000000 : 0;
> >       if ($ms > 100) {
> >           printf("%lu drm_atomic_helper_wait_for_flip_done TIMEOUT waited %lums [tid=%d]\n",
> >                  nsecs/1000000, $ms, tid);
> >       } else {
> >           printf("%lu drm_atomic_helper_wait_for_flip_done EXIT %lums [tid=%d]\n",
> >                  nsecs/1000000, $ms, tid);
> >       }
> >       delete(@wait_start[tid]);
> >   }
> >   interval:s:60 { printf("%lu HEARTBEAT\n", nsecs/1000000); }
> >   END { printf("=== stopped ===\n"); clear(@wait_start); }
> >
> > The timeout was captured at 17:35:41 CET. The trace timestamps
> > match dmesg exactly (9942110ms = dmesg 9942.110s).
> >
> > dmesg output from the timeout:
> >
> >   [ 9942.110360] [FLIP_DEBUG] wait_for_flip_done took 10329ms!
> >   [ 9942.110380] [FLIP_DEBUG]  crtc:0 pflip_status=0 event=00000000a0636a23
> >                   vbl_enabled=1 vbl_refcount=1 vbl_count=1428659
> >                   disable_immediate=0 active_planes=1
> >
> > pflip_status=0 (AMDGPU_FLIP_NONE) but event is still non-NULL. The flip was never completed
> > but the status was already reset to NONE. vblank was enabled, refcount was held, so vblank
> > IRQs were firing throughout the wait.
> >
> > The bpftrace captured the exact sequence leading up to the hang. Here's the critical
> > timeline at ~17:35:31 (9931771), about 10 seconds before the timeout fired:
> >
> >   9931755 drm_atomic_helper_commit_hw_done
> >   9931755 drm_atomic_helper_wait_for_flip_done ENTER [tid=35929]
> >   9931756 dm_pflip_high_irq                           <- normal plane flip, last good one
> >   9931756 drm_crtc_send_vblank_event
> >   9931756 drm_vblank_put
> >   9931756 drm_atomic_helper_wait_for_flip_done EXIT 1ms [tid=35929]
> >   9931771 drm_vblank_disable_and_save                 <- vblank timer fires
> >   9931771 drm_crtc_send_vblank_event                  <- event sent WITHOUT dm_pflip_high_irq
> >   9931771 drm_vblank_put
> >   9931771 drm_atomic_helper_commit_hw_done
> >   9931771 drm_atomic_helper_wait_for_flip_done ENTER [tid=35929]
> >   9931771 drm_atomic_helper_wait_for_flip_done EXIT 0ms [tid=35929]  <- instant, already done
> >   9931773 drm_atomic_helper_commit_hw_done
> >   9931773 drm_atomic_helper_wait_for_flip_done ENTER [tid=36929]     <- new commit
> >   9931777 dm_pflip_high_irq                           <- pflip fires, completes the wrong one
> >   9931777 drm_crtc_send_vblank_event
> >   9931777 drm_vblank_put
> >   9931777 drm_atomic_helper_wait_for_flip_done EXIT 3ms [tid=36929]
> >   9931781 drm_atomic_helper_commit_hw_done
> >   9931781 drm_atomic_helper_wait_for_flip_done ENTER [tid=36929]     <- THIS ONE HANGS
> >   ... 10328ms of silence ...
> >   9942110 drm_atomic_helper_wait_for_flip_done TIMEOUT waited 10328ms [tid=36929]
> >
> > The drm_crtc_send_vblank_event at 9931771 fires without dm_pflip_high_irq. This is
> > amdgpu_dm_crtc_handle_vblank() sending a cursor-only event. The problem is that the
> > cursor-only commit path in amdgpu_dm_commit_planes() stores the event in acrtc->event
> > and defers delivery to the vblank handler. This creates two race conditions:
> >
> > - The vblank handler checks (pflip_status != SUBMITTED) which also
> >   matches NONE, so it can consume events meant for plane flips. The subsequent
> >   dm_pflip_high_irq finds no event, and the next commit hangs.
> >
> > - If vblank is disabled by the off-delay timer before the handler
> >   runs, the PENDING cursor event is never delivered and the commit hangs.
> >
> > The fix is to send cursor-only events immediately via drm_crtc_send_vblank_event()
> > in amdgpu_dm_commit_planes() instead of deferring to the vblank handler. The cursor
> > update is already committed to hardware at this point, so immediate delivery is correct.
> > This eliminates both race conditions by removing cursor events from the deferred
> > delivery path entirely:
> >
> > - Plane flips: SUBMITTED -> dm_pflip_high_irq delivers (unchanged)
> > - Cursor updates: sent immediately in commit_planes (no deferral, no races)
> >
> > From git history the check in amdgpu_dm_crtc_handle_vblank() has been like this since
> > 473683a03495 ("drm/amd/display: Create a file dedicated for CRTC", 2022)
> > which moved this code from amdgpu_dm.c, but it was practically impossible to trigger
> > because the default drm_vblank_offdelay was 5000ms.
> > Commit 58a261bfc967("drm/amd/display: use a more lax vblank enable policy for older ASICs") in 6.12
> > changed all ASICs to use drm_crtc_vblank_on_config() with a computed off-delay
> > of roughly 2 frames (~14ms at 144Hz).
> > This made drm_vblank_disable_and_save fire hundreds of times more often, turning
> > a theoretical race into reality. The bpftrace log is full of drm_vblank_disable_and_save
> > events interleaved with the commit sequence.
> >
> > This fix was tested on DCN 2.1 (4700U), DCN 3.2 (7600M XT), and DCN 3.5 (9070 XT).
> > Under high-frequency glxgears + cursor jiggling test the patch successfully intercepted
> > the race thousands of times without a single timeout.
> > Also running this on the main system without issues.
> >
> > This instead https://lists.freedesktop.org/archives/amd-gfx/2026-February/138636.html was
> > my previously rushed attempt to do something about this that is no longer needed.
> >
> > Patch applies cleanly on top of tag v6.19.
>
> Really nice debuging work, thanks for catching this!
>
> Ideally, the cursor event should be delivered when hardware latches onto the new
> cursor info and starts scanning it out. The latching event fires an interrupt
> that should be handled by dm_crtc_high_irq().
>
> dm_pflip_high_irq() handles an interrupt specifically for when hardware latches
> onto a new fb address; I don't think it actually fires when there's a
> cursor-only update. I think if we really want to do it right, we can have
> another "acrtc_attach->cursor_event" just for cusror-only updates, and deliver
> the event in crtc_high_irq().
>
> In any case, I don't foresee any major issues with delivering the event early.
> And since it fixes an ongoing issue:
>
> Reviewed-by: Leo Li <sunpeng.li@amd.com>

Leo, I assume you are planning to push this?

Alex

>
> Thanks!
> Leo
>
> >
> >  drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 3 +--
> >  1 file changed, 1 insertion(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > index a8a59126b2d2..35987ce80c71 100644
> > --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > @@ -10168,8 +10168,7 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
> >       } else if (cursor_update && acrtc_state->active_planes > 0) {
> >               spin_lock_irqsave(&pcrtc->dev->event_lock, flags);
> >               if (acrtc_attach->base.state->event) {
> > -                     drm_crtc_vblank_get(pcrtc);
> > -                     acrtc_attach->event = acrtc_attach->base.state->event;
> > +                     drm_crtc_send_vblank_event(pcrtc, acrtc_attach->base.state->event);
> >                       acrtc_attach->base.state->event = NULL;
> >               }
> >               spin_unlock_irqrestore(&pcrtc->dev->event_lock, flags);
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/1] drm/amd/display: complete cursor vblank events immediately
  2026-02-23 15:27 ` Leo Li
  2026-02-27  8:53   ` Michele Palazzi
  2026-02-27 19:43   ` Alex Deucher
@ 2026-03-02  8:53   ` Michel Dänzer
  2026-03-02 22:14     ` Leo Li
  2 siblings, 1 reply; 36+ messages in thread
From: Michel Dänzer @ 2026-03-02  8:53 UTC (permalink / raw)
  To: Leo Li, Michele Palazzi, amd-gfx
  Cc: harry.wentland, alexander.deucher, christian.koenig

On 2/23/26 16:27, Leo Li wrote:
> 
> Ideally, the cursor event should be delivered when hardware latches onto the new
> cursor info and starts scanning it out. The latching event fires an interrupt
> that should be handled by dm_crtc_high_irq().
> 
> dm_pflip_high_irq() handles an interrupt specifically for when hardware latches
> onto a new fb address; I don't think it actually fires when there's a
> cursor-only update. I think if we really want to do it right, we can have
> another "acrtc_attach->cursor_event" just for cusror-only updates, and deliver
> the event in crtc_high_irq().
> 
> In any case, I don't foresee any major issues with delivering the event early.

If the event having wrong sequence & timestamp values isn't considered a "major issue", we might as well not bother and just put random values in there. ;)

Compositors actually make use of the timestamp for frame scheduling.


-- 
Earthling Michel Dänzer       \        GNOME / Xwayland / Mesa developer
https://redhat.com             \               Libre software enthusiast

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/1] drm/amd/display: complete cursor vblank events immediately
  2026-03-02  8:53   ` Michel Dänzer
@ 2026-03-02 22:14     ` Leo Li
  0 siblings, 0 replies; 36+ messages in thread
From: Leo Li @ 2026-03-02 22:14 UTC (permalink / raw)
  To: Michel Dänzer, Michele Palazzi, amd-gfx
  Cc: harry.wentland, alexander.deucher, christian.koenig



On 2026-03-02 03:53, Michel Dänzer wrote:
> On 2/23/26 16:27, Leo Li wrote:
>>
>> Ideally, the cursor event should be delivered when hardware latches onto the new
>> cursor info and starts scanning it out. The latching event fires an interrupt
>> that should be handled by dm_crtc_high_irq().
>>
>> dm_pflip_high_irq() handles an interrupt specifically for when hardware latches
>> onto a new fb address; I don't think it actually fires when there's a
>> cursor-only update. I think if we really want to do it right, we can have
>> another "acrtc_attach->cursor_event" just for cusror-only updates, and deliver
>> the event in crtc_high_irq().
>>
>> In any case, I don't foresee any major issues with delivering the event early.
> 
> If the event having wrong sequence & timestamp values isn't considered a "major issue", we might as well not bother and just put random values in there. ;)
> 
> Compositors actually make use of the timestamp for frame scheduling.

Yeah, point taken, I read the other thread too late :)
- Leo

> 
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 1/1] drm/amd/display: complete cursor vblank events immediately
@ 2026-02-18  0:31 Michele Palazzi
  2026-02-18  9:41 ` Michel Dänzer
  0 siblings, 1 reply; 36+ messages in thread
From: Michele Palazzi @ 2026-02-18  0:31 UTC (permalink / raw)
  To: amd-gfx
  Cc: harry.wentland, siqueira, alexander.deucher, sunpeng.li,
	Michele Palazzi

Intermittent flip_done timeouts have been observed on AMD GPUs
since kernel 6.12.

Analysis with bpftrace reveals that amdgpu_dm_crtc_handle_vblank() can
incorrectly consume events meant for plane flips during cursor-only
updates. This happens because cursor commits defer event delivery to
the vblank handler, which checks (pflip_status != SUBMITTED). Since
AMDGPU_FLIP_NONE also matches this, cursor events can "steal" the
event slot for subsequent plane flips, leading to timeouts.

The potential for a race was present since commit 473683a03495
("drm/amd/display: Create a file dedicated for CRTC"), then
commit 58a261bfc967 ("drm/amd/display: use a more lax vblank enable
policy for older ASICs") made it happen by reducing vblank
off-delay and making disables happen much more frequently
between commits.

Fix this by sending cursor-only vblank events immediately in
amdgpu_dm_commit_planes(). Since cursor updates are committed to
hardware immediately, deferring the event is unnecessary and
creates race windows for event stealing or starvation if vblank
is disabled before the handler runs.

Tested on DCN 2.1, 3.2, and 3.5.

Fixes: 58a261bfc967 ("drm/amd/display: use a more lax vblank enable policy for older ASICs")
Signed-off-by: Michele Palazzi <sysdadmin@m1k.cloud>
---
I've been chasing intermittent flip_done timeouts on AMD GPUs (7900 GRE first, 9070 XT now)
since kernel 6.12. The hang occurs during normal desktop usage but is much easier to
trigger under specific conditions involving cursor movements and plane updates.

Partially tracked in https://gitlab.freedesktop.org/drm/amd/-/issues/3787

Hardware: Ryzen 7 7800X3D, Radeon RX 9070 XT
Dual DP monitors, 2560x1440, 144Hz
Desktop: KDE Plasma Wayland

The hang was initially observed while using Cisco Webex
(XDG_SESSION_TYPE=x11 /opt/Webex/bin/CiscoCollabHost %U), start a meeting
and screen share a window running Omnissa Horizon client. Then move the cursor
around between the two monitors and the shared window.
Under these conditions the hang usually occurs within a few hours.

Enabling drm.debug masks the issue entirely, the overhead
changes timing enough to close the race window.
So i added debug printks to amdgpu_dm.c and used a small bpftrace script to log the
pageflip lifecycle with per-thread tracking to debug.

bpftrace script:

  config = { missing_probes = "warn" }
  BEGIN { printf("=== flip_done tracer started ===\n"); }
  kprobe:drm_crtc_vblank_on_config       { printf("%lu drm_crtc_vblank_on_config\n", nsecs/1000000); }
  kprobe:drm_vblank_disable_and_save     { printf("%lu drm_vblank_disable_and_save\n", nsecs/1000000); }
  kprobe:dm_pflip_high_irq               { printf("%lu dm_pflip_high_irq\n", nsecs/1000000); }
  kprobe:drm_crtc_send_vblank_event      { printf("%lu drm_crtc_send_vblank_event\n", nsecs/1000000); }
  kprobe:drm_vblank_put                  { printf("%lu drm_vblank_put\n", nsecs/1000000); }
  kprobe:drm_atomic_helper_commit_hw_done { printf("%lu drm_atomic_helper_commit_hw_done\n", nsecs/1000000); }
  kprobe:manage_dm_interrupts            { printf("%lu manage_dm_interrupts\n", nsecs/1000000); }
  kprobe:drm_atomic_helper_wait_for_flip_done {
      @wait_start[tid] = nsecs;
      printf("%lu drm_atomic_helper_wait_for_flip_done ENTER [tid=%d]\n", nsecs/1000000, tid);
  }
  kretprobe:drm_atomic_helper_wait_for_flip_done {
      $start = @wait_start[tid];
      $ms = $start > 0 ? (nsecs - $start) / 1000000 : 0;
      if ($ms > 100) {
          printf("%lu drm_atomic_helper_wait_for_flip_done TIMEOUT waited %lums [tid=%d]\n",
                 nsecs/1000000, $ms, tid);
      } else {
          printf("%lu drm_atomic_helper_wait_for_flip_done EXIT %lums [tid=%d]\n",
                 nsecs/1000000, $ms, tid);
      }
      delete(@wait_start[tid]);
  }
  interval:s:60 { printf("%lu HEARTBEAT\n", nsecs/1000000); }
  END { printf("=== stopped ===\n"); clear(@wait_start); }

The timeout was captured at 17:35:41 CET. The trace timestamps
match dmesg exactly (9942110ms = dmesg 9942.110s).

dmesg output from the timeout:

  [ 9942.110360] [FLIP_DEBUG] wait_for_flip_done took 10329ms!
  [ 9942.110380] [FLIP_DEBUG]  crtc:0 pflip_status=0 event=00000000a0636a23
                  vbl_enabled=1 vbl_refcount=1 vbl_count=1428659
                  disable_immediate=0 active_planes=1

pflip_status=0 (AMDGPU_FLIP_NONE) but event is still non-NULL. The flip was never completed
but the status was already reset to NONE. vblank was enabled, refcount was held, so vblank
IRQs were firing throughout the wait.

The bpftrace captured the exact sequence leading up to the hang. Here's the critical
timeline at ~17:35:31 (9931771), about 10 seconds before the timeout fired:

  9931755 drm_atomic_helper_commit_hw_done
  9931755 drm_atomic_helper_wait_for_flip_done ENTER [tid=35929]
  9931756 dm_pflip_high_irq                           <- normal plane flip, last good one
  9931756 drm_crtc_send_vblank_event
  9931756 drm_vblank_put
  9931756 drm_atomic_helper_wait_for_flip_done EXIT 1ms [tid=35929]
  9931771 drm_vblank_disable_and_save                 <- vblank timer fires
  9931771 drm_crtc_send_vblank_event                  <- event sent WITHOUT dm_pflip_high_irq
  9931771 drm_vblank_put
  9931771 drm_atomic_helper_commit_hw_done
  9931771 drm_atomic_helper_wait_for_flip_done ENTER [tid=35929]
  9931771 drm_atomic_helper_wait_for_flip_done EXIT 0ms [tid=35929]  <- instant, already done
  9931773 drm_atomic_helper_commit_hw_done
  9931773 drm_atomic_helper_wait_for_flip_done ENTER [tid=36929]     <- new commit
  9931777 dm_pflip_high_irq                           <- pflip fires, completes the wrong one
  9931777 drm_crtc_send_vblank_event
  9931777 drm_vblank_put
  9931777 drm_atomic_helper_wait_for_flip_done EXIT 3ms [tid=36929]
  9931781 drm_atomic_helper_commit_hw_done
  9931781 drm_atomic_helper_wait_for_flip_done ENTER [tid=36929]     <- THIS ONE HANGS
  ... 10328ms of silence ...
  9942110 drm_atomic_helper_wait_for_flip_done TIMEOUT waited 10328ms [tid=36929]

The drm_crtc_send_vblank_event at 9931771 fires without dm_pflip_high_irq. This is
amdgpu_dm_crtc_handle_vblank() sending a cursor-only event. The problem is that the
cursor-only commit path in amdgpu_dm_commit_planes() stores the event in acrtc->event
and defers delivery to the vblank handler. This creates two race conditions:

- The vblank handler checks (pflip_status != SUBMITTED) which also
  matches NONE, so it can consume events meant for plane flips. The subsequent
  dm_pflip_high_irq finds no event, and the next commit hangs.

- If vblank is disabled by the off-delay timer before the handler
  runs, the PENDING cursor event is never delivered and the commit hangs.

The fix is to send cursor-only events immediately via drm_crtc_send_vblank_event()
in amdgpu_dm_commit_planes() instead of deferring to the vblank handler. The cursor
update is already committed to hardware at this point, so immediate delivery is correct.
This eliminates both race conditions by removing cursor events from the deferred
delivery path entirely:

- Plane flips: SUBMITTED -> dm_pflip_high_irq delivers (unchanged)
- Cursor updates: sent immediately in commit_planes (no deferral, no races)

From git history the check in amdgpu_dm_crtc_handle_vblank() has been like this since
473683a03495 ("drm/amd/display: Create a file dedicated for CRTC", 2022)
which moved this code from amdgpu_dm.c, but it was practically impossible to trigger
because the default drm_vblank_offdelay was 5000ms.
Commit 58a261bfc967("drm/amd/display: use a more lax vblank enable policy for older ASICs") in 6.12
changed all ASICs to use drm_crtc_vblank_on_config() with a computed off-delay
of roughly 2 frames (~14ms at 144Hz).
This made drm_vblank_disable_and_save fire hundreds of times more often, turning
a theoretical race into reality. The bpftrace log is full of drm_vblank_disable_and_save
events interleaved with the commit sequence.

This fix was tested on DCN 2.1 (4700U), DCN 3.2 (7600M XT), and DCN 3.5 (9070 XT).
Under high-frequency glxgears + cursor jiggling test the patch successfully intercepted
the race thousands of times without a single timeout.
Also running this on the main system without issues.

This instead https://lists.freedesktop.org/archives/amd-gfx/2026-February/138636.html was
my previously rushed attempt to do something about this that is no longer needed.

Patch applies cleanly on top of tag v6.19.

 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index a8a59126b2d2..35987ce80c71 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -10168,8 +10168,7 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
 	} else if (cursor_update && acrtc_state->active_planes > 0) {
 		spin_lock_irqsave(&pcrtc->dev->event_lock, flags);
 		if (acrtc_attach->base.state->event) {
-			drm_crtc_vblank_get(pcrtc);
-			acrtc_attach->event = acrtc_attach->base.state->event;
+			drm_crtc_send_vblank_event(pcrtc, acrtc_attach->base.state->event);
 			acrtc_attach->base.state->event = NULL;
 		}
 		spin_unlock_irqrestore(&pcrtc->dev->event_lock, flags);
-- 
2.53.0

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/1] drm/amd/display: complete cursor vblank events immediately
  2026-02-18  0:31 Michele Palazzi
@ 2026-02-18  9:41 ` Michel Dänzer
  2026-02-18 10:09   ` Michele Palazzi
  0 siblings, 1 reply; 36+ messages in thread
From: Michel Dänzer @ 2026-02-18  9:41 UTC (permalink / raw)
  To: Michele Palazzi, amd-gfx
  Cc: harry.wentland, siqueira, alexander.deucher, sunpeng.li

On 2/18/26 01:31, Michele Palazzi wrote:
> 
> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> index a8a59126b2d2..35987ce80c71 100644
> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> @@ -10168,8 +10168,7 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
>  	} else if (cursor_update && acrtc_state->active_planes > 0) {
>  		spin_lock_irqsave(&pcrtc->dev->event_lock, flags);
>  		if (acrtc_attach->base.state->event) {
> -			drm_crtc_vblank_get(pcrtc);
> -			acrtc_attach->event = acrtc_attach->base.state->event;
> +			drm_crtc_send_vblank_event(pcrtc, acrtc_attach->base.state->event);

Can this code run before start of vblank? If yes, the event would have the wrong sequence number and timestamp.


-- 
Earthling Michel Dänzer       \        GNOME / Xwayland / Mesa developer
https://redhat.com             \               Libre software enthusiast

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/1] drm/amd/display: complete cursor vblank events immediately
  2026-02-18  9:41 ` Michel Dänzer
@ 2026-02-18 10:09   ` Michele Palazzi
  2026-02-19 11:09     ` Michel Dänzer
  0 siblings, 1 reply; 36+ messages in thread
From: Michele Palazzi @ 2026-02-18 10:09 UTC (permalink / raw)
  To: Michel Dänzer, amd-gfx
  Cc: harry.wentland, siqueira, alexander.deucher, sunpeng.li

Yes, but the original code path had the same problem.

Would drm_crtc_arm_vblank_event() be more appropriate here? The concern 
is that it reintroduces the starvation race if the vblank off-delay 
fires before the interrupt.

On 2/18/26 10:41, Michel Dänzer wrote:
> On 2/18/26 01:31, Michele Palazzi wrote:
>>
>> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>> index a8a59126b2d2..35987ce80c71 100644
>> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>> @@ -10168,8 +10168,7 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
>>   	} else if (cursor_update && acrtc_state->active_planes > 0) {
>>   		spin_lock_irqsave(&pcrtc->dev->event_lock, flags);
>>   		if (acrtc_attach->base.state->event) {
>> -			drm_crtc_vblank_get(pcrtc);
>> -			acrtc_attach->event = acrtc_attach->base.state->event;
>> +			drm_crtc_send_vblank_event(pcrtc, acrtc_attach->base.state->event);
> 
> Can this code run before start of vblank? If yes, the event would have the wrong sequence number and timestamp.
> 
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/1] drm/amd/display: complete cursor vblank events immediately
  2026-02-18 10:09   ` Michele Palazzi
@ 2026-02-19 11:09     ` Michel Dänzer
  2026-02-19 13:08       ` Michele Palazzi
  0 siblings, 1 reply; 36+ messages in thread
From: Michel Dänzer @ 2026-02-19 11:09 UTC (permalink / raw)
  To: Michele Palazzi
  Cc: harry.wentland, siqueira, alexander.deucher, sunpeng.li, amd-gfx


[ Fixed up the top-posting, please don't ]

On 2/18/26 11:09, Michele Palazzi wrote:
> On 2/18/26 10:41, Michel Dänzer wrote:
>> On 2/18/26 01:31, Michele Palazzi wrote:
>>>
>>> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>>> index a8a59126b2d2..35987ce80c71 100644
>>> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>>> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>>> @@ -10168,8 +10168,7 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
>>>       } else if (cursor_update && acrtc_state->active_planes > 0) {
>>>           spin_lock_irqsave(&pcrtc->dev->event_lock, flags);
>>>           if (acrtc_attach->base.state->event) {
>>> -            drm_crtc_vblank_get(pcrtc);
>>> -            acrtc_attach->event = acrtc_attach->base.state->event;
>>> +            drm_crtc_send_vblank_event(pcrtc, acrtc_attach->base.state->event);
>>
>> Can this code run before start of vblank? If yes, the event would have the wrong sequence number and timestamp.
>
> Yes, but the original code path had the same problem.

Are you sure?

I'd expect the original code to send the event only when an interrupt fires during vblank, at which point the values are correct.


> Would drm_crtc_arm_vblank_event() be more appropriate here? The concern is that it reintroduces the starvation race if the vblank off-delay fires before the interrupt.

Not sure that could happen, some of the issues discussed in the comment above drm_crtc_arm_vblank_event might apply though.


-- 
Earthling Michel Dänzer       \        GNOME / Xwayland / Mesa developer
https://redhat.com             \               Libre software enthusiast

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/1] drm/amd/display: complete cursor vblank events immediately
  2026-02-19 11:09     ` Michel Dänzer
@ 2026-02-19 13:08       ` Michele Palazzi
  2026-02-19 13:59         ` Michel Dänzer
  0 siblings, 1 reply; 36+ messages in thread
From: Michele Palazzi @ 2026-02-19 13:08 UTC (permalink / raw)
  To: Michel Dänzer
  Cc: harry.wentland, siqueira, alexander.deucher, sunpeng.li, amd-gfx


On 2/19/26 12:09, Michel Dänzer wrote:
> 
> [ Fixed up the top-posting, please don't ]

thanks, i won't.

> On 2/18/26 11:09, Michele Palazzi wrote:
>> On 2/18/26 10:41, Michel Dänzer wrote:
>>> On 2/18/26 01:31, Michele Palazzi wrote:
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>>>> index a8a59126b2d2..35987ce80c71 100644
>>>> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>>>> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>>>> @@ -10168,8 +10168,7 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
>>>>        } else if (cursor_update && acrtc_state->active_planes > 0) {
>>>>            spin_lock_irqsave(&pcrtc->dev->event_lock, flags);
>>>>            if (acrtc_attach->base.state->event) {
>>>> -            drm_crtc_vblank_get(pcrtc);
>>>> -            acrtc_attach->event = acrtc_attach->base.state->event;
>>>> +            drm_crtc_send_vblank_event(pcrtc, acrtc_attach->base.state->event);
>>>
>>> Can this code run before start of vblank? If yes, the event would have the wrong sequence number and timestamp.
>>
>> Yes, but the original code path had the same problem.
> 
> Are you sure?
> 
> I'd expect the original code to send the event only when an interrupt fires during vblank, at which point the values are correct.

Actually you are indeed right, this approach potentially produces 
slightly anticipated cursor events, not noticeable but wrong nonetheless.

>> Would drm_crtc_arm_vblank_event() be more appropriate here? The concern is that it reintroduces the starvation race if the vblank off-delay fires before the interrupt.
> 
> Not sure that could happen, some of the issues discussed in the comment above drm_crtc_arm_vblank_event might apply though.

If i understand correctly, using drm_crtc_arm_vblank_event() we would 
still be having incorrect sequence, although delayed this time, if so i 
am not sure a v2 using that would be any better.

We could maybe add a dedicated flag to amdgpu_dm_crtc_handle_vblank() 
instead, but it's something i already tried before submitting this and 
it produced the second race condition, so that alone is not enough.

If there is a suggested approach i am willing to explore it






^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/1] drm/amd/display: complete cursor vblank events immediately
  2026-02-19 13:08       ` Michele Palazzi
@ 2026-02-19 13:59         ` Michel Dänzer
  2026-02-19 15:56           ` Michele Palazzi
  0 siblings, 1 reply; 36+ messages in thread
From: Michel Dänzer @ 2026-02-19 13:59 UTC (permalink / raw)
  To: Michele Palazzi
  Cc: harry.wentland, siqueira, alexander.deucher, sunpeng.li, amd-gfx

On 2/19/26 14:08, Michele Palazzi wrote:
> On 2/19/26 12:09, Michel Dänzer wrote:
>> On 2/18/26 11:09, Michele Palazzi wrote:
>>> On 2/18/26 10:41, Michel Dänzer wrote:
>>>> On 2/18/26 01:31, Michele Palazzi wrote:
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>>>>> index a8a59126b2d2..35987ce80c71 100644
>>>>> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>>>>> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>>>>> @@ -10168,8 +10168,7 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
>>>>>        } else if (cursor_update && acrtc_state->active_planes > 0) {
>>>>>            spin_lock_irqsave(&pcrtc->dev->event_lock, flags);
>>>>>            if (acrtc_attach->base.state->event) {
>>>>> -            drm_crtc_vblank_get(pcrtc);
>>>>> -            acrtc_attach->event = acrtc_attach->base.state->event;
>>>>> +            drm_crtc_send_vblank_event(pcrtc, acrtc_attach->base.state->event);
>>>>
>>>> Can this code run before start of vblank? If yes, the event would have the wrong sequence number and timestamp.
>>>
>>> Yes, but the original code path had the same problem.
>>
>> Are you sure?
>>
>> I'd expect the original code to send the event only when an interrupt fires during vblank, at which point the values are correct.
> 
> Actually you are indeed right, this approach potentially produces slightly anticipated cursor events, not noticeable but wrong nonetheless.

"not noticeable" by what? It might be noticeable e.g. for mutter's KMS thread deadline timer.


>>> Would drm_crtc_arm_vblank_event() be more appropriate here? The concern is that it reintroduces the starvation race if the vblank off-delay fires before the interrupt.
>>
>> Not sure that could happen, some of the issues discussed in the comment above drm_crtc_arm_vblank_event might apply though.
> 
> If i understand correctly, using drm_crtc_arm_vblank_event() we would still be having incorrect sequence, although delayed this time,

Not if used correctly, though per the comment above the function, there are various potential races as well.

> We could maybe add a dedicated flag to amdgpu_dm_crtc_handle_vblank() instead, but it's something i already tried before submitting this and it produced the second race condition, so that alone is not enough.
> 
> If there is a suggested approach i am willing to explore it

Can't the issue be solved by fixing the pflip_status handling in the vblank handler?
I guess that might also hit the second race condition:

> - If vblank is disabled by the off-delay timer before the handler
>   runs, the PENDING cursor event is never delivered and the commit hangs.

That sounds like the drm_crtc_vblank_get/put handling might be incorrect in amdgpu_dm.

In a nutshell, the vblank interrupt is kept enabled as long as there have been more 
drm_crtc_vblank_get calls than _put ones for the CRTC. I.e. amdgpu_dm needs to call the former under circumstances where it needs the interrupt to be on, and the latter only once it's no longer needed for those circumstances (in this case when sending the event).


-- 
Earthling Michel Dänzer       \        GNOME / Xwayland / Mesa developer
https://redhat.com             \               Libre software enthusiast

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/1] drm/amd/display: complete cursor vblank events immediately
  2026-02-19 13:59         ` Michel Dänzer
@ 2026-02-19 15:56           ` Michele Palazzi
  2026-02-19 16:02             ` Michel Dänzer
  0 siblings, 1 reply; 36+ messages in thread
From: Michele Palazzi @ 2026-02-19 15:56 UTC (permalink / raw)
  To: Michel Dänzer
  Cc: harry.wentland, siqueira, alexander.deucher, sunpeng.li, amd-gfx


On 2/19/26 14:59, Michel Dänzer wrote:
> Can't the issue be solved by fixing the pflip_status handling in the vblank handler?
> I guess that might also hit the second race condition:
> 
>> - If vblank is disabled by the off-delay timer before the handler
>>    runs, the PENDING cursor event is never delivered and the commit hangs.
> 
> That sounds like the drm_crtc_vblank_get/put handling might be incorrect in amdgpu_dm.
> 
> In a nutshell, the vblank interrupt is kept enabled as long as there have been more
> drm_crtc_vblank_get calls than _put ones for the CRTC. I.e. amdgpu_dm needs to call the former under circumstances where it needs the interrupt to be on, and the latter only once it's no longer needed for those circumstances (in this case when sending the event).


The get/put pairing seems correct, the issue is that cursor and pflip 
share the same acrtc->event slot, so the pflip_status check in the 
vblank handler can race. Adding a dedicated cursor_event field, separate 
from event used by pflip, could maybe solve the whole thing?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/1] drm/amd/display: complete cursor vblank events immediately
  2026-02-19 15:56           ` Michele Palazzi
@ 2026-02-19 16:02             ` Michel Dänzer
  2026-02-20 11:10               ` Michele Palazzi
  0 siblings, 1 reply; 36+ messages in thread
From: Michel Dänzer @ 2026-02-19 16:02 UTC (permalink / raw)
  To: Michele Palazzi
  Cc: harry.wentland, siqueira, alexander.deucher, sunpeng.li, amd-gfx

On 2/19/26 16:56, Michele Palazzi wrote:
> On 2/19/26 14:59, Michel Dänzer wrote:
>> Can't the issue be solved by fixing the pflip_status handling in the vblank handler?
>> I guess that might also hit the second race condition:
>>
>>> - If vblank is disabled by the off-delay timer before the handler
>>>    runs, the PENDING cursor event is never delivered and the commit hangs.
>>
>> That sounds like the drm_crtc_vblank_get/put handling might be incorrect in amdgpu_dm.
>>
>> In a nutshell, the vblank interrupt is kept enabled as long as there have been more
>> drm_crtc_vblank_get calls than _put ones for the CRTC. I.e. amdgpu_dm needs to call the former under circumstances where it needs the interrupt to be on, and the latter only once it's no longer needed for those circumstances (in this case when sending the event).
> 
> The get/put pairing seems correct,

"If vblank is disabled by the off-delay timer before the handler runs, the PENDING cursor event is never delivered" indicates otherwise. If the handling was correct, the vblank interrupt should never be disabled before the handler runs.


> the issue is that cursor and pflip share the same acrtc->event slot, so the pflip_status check in the vblank handler can race. Adding a dedicated cursor_event field, separate from event used by pflip, could maybe solve the whole thing?

Maybe? If the vblank event handler never needs to send an event for a flip.


-- 
Earthling Michel Dänzer       \        GNOME / Xwayland / Mesa developer
https://redhat.com             \               Libre software enthusiast

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/1] drm/amd/display: complete cursor vblank events immediately
  2026-02-19 16:02             ` Michel Dänzer
@ 2026-02-20 11:10               ` Michele Palazzi
  0 siblings, 0 replies; 36+ messages in thread
From: Michele Palazzi @ 2026-02-20 11:10 UTC (permalink / raw)
  To: Michel Dänzer
  Cc: harry.wentland, siqueira, alexander.deucher, sunpeng.li, amd-gfx

On 2/19/26 17:02, Michel Dänzer wrote:
> "If vblank is disabled by the off-delay timer before the handler runs, the PENDING cursor event is never delivered" indicates otherwise. If the handling was correct, the vblank interrupt should never be disabled before the handler runs.

You are correct again, calling drm_crtc_vblank_get() ensures the 
off-delay timer cannot disable the interrupt before handler runs.
I will add a fallback to check if it fails and send the event 
immediately only in that case, this would prevent hangs for disabled crtcs.

> Maybe? If the vblank event handler never needs to send an event for a flip.

This point would be covered with a dedicated cursor_event, the vblank 
handler only touches cursor_event, never acrtc->event. So the condition 
is met.
Unless advised against it I will be moving in this direction testing 
this approach, and if nothing unforeseen arises send a v2 when ready.
Thanks for the guidance.

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2026-03-31 12:57 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-17 19:16 [PATCH 1/1] drm/amd/display: complete cursor vblank events immediately Michele Palazzi
2026-02-23 15:27 ` Leo Li
2026-02-27  8:53   ` Michele Palazzi
2026-02-27  8:58     ` Michele Palazzi
2026-03-02 22:13       ` Leo Li
2026-03-03  8:17         ` Shengyu Qu
2026-03-03 19:07           ` Leo Li
2026-03-04 14:00             ` Michele Palazzi
2026-03-04 14:20               ` Leo Li
2026-03-05 22:30                 ` Leo Li
2026-03-06  8:37                   ` Michele Palazzi
2026-03-09 16:49                     ` Michele Palazzi
2026-03-10 23:50                       ` Leo Li
2026-03-11 10:16                         ` Shengyu Qu
2026-03-11 10:38                         ` Michele Palazzi
2026-03-11 17:56                           ` Leo Li
2026-03-16 14:55                             ` Michele Palazzi
2026-03-16 15:17                               ` Michele Palazzi
2026-03-16 18:39                                 ` Leo Li
2026-03-16 18:48                                   ` Leo Li
2026-03-18 11:36                                     ` Michele Palazzi
2026-03-20  0:52                                       ` Leo Li
2026-03-20  1:33                                         ` Michele Palazzi
2026-03-31 12:57                                           ` Michele Palazzi
2026-02-27 19:43   ` Alex Deucher
2026-03-02  8:53   ` Michel Dänzer
2026-03-02 22:14     ` Leo Li
  -- strict thread matches above, loose matches on Subject: below --
2026-02-18  0:31 Michele Palazzi
2026-02-18  9:41 ` Michel Dänzer
2026-02-18 10:09   ` Michele Palazzi
2026-02-19 11:09     ` Michel Dänzer
2026-02-19 13:08       ` Michele Palazzi
2026-02-19 13:59         ` Michel Dänzer
2026-02-19 15:56           ` Michele Palazzi
2026-02-19 16:02             ` Michel Dänzer
2026-02-20 11:10               ` Michele Palazzi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox