public inbox for amd-gfx@lists.freedesktop.org
 help / color / mirror / Atom feed
* [PATCH 1/1] drm/amd/display: complete cursor vblank events immediately
@ 2026-02-18  0:31 Michele Palazzi
  2026-02-18  9:41 ` Michel Dänzer
  0 siblings, 1 reply; 36+ messages in thread
From: Michele Palazzi @ 2026-02-18  0:31 UTC (permalink / raw)
  To: amd-gfx
  Cc: harry.wentland, siqueira, alexander.deucher, sunpeng.li,
	Michele Palazzi

Intermittent flip_done timeouts have been observed on AMD GPUs
since kernel 6.12.

Analysis with bpftrace reveals that amdgpu_dm_crtc_handle_vblank() can
incorrectly consume events meant for plane flips during cursor-only
updates. This happens because cursor commits defer event delivery to
the vblank handler, which checks (pflip_status != SUBMITTED). Since
AMDGPU_FLIP_NONE also matches this, cursor events can "steal" the
event slot for subsequent plane flips, leading to timeouts.

The potential for a race was present since commit 473683a03495
("drm/amd/display: Create a file dedicated for CRTC"), then
commit 58a261bfc967 ("drm/amd/display: use a more lax vblank enable
policy for older ASICs") made it happen by reducing vblank
off-delay and making disables happen much more frequently
between commits.

Fix this by sending cursor-only vblank events immediately in
amdgpu_dm_commit_planes(). Since cursor updates are committed to
hardware immediately, deferring the event is unnecessary and
creates race windows for event stealing or starvation if vblank
is disabled before the handler runs.

Tested on DCN 2.1, 3.2, and 3.5.

Fixes: 58a261bfc967 ("drm/amd/display: use a more lax vblank enable policy for older ASICs")
Signed-off-by: Michele Palazzi <sysdadmin@m1k.cloud>
---
I've been chasing intermittent flip_done timeouts on AMD GPUs (7900 GRE first, 9070 XT now)
since kernel 6.12. The hang occurs during normal desktop usage but is much easier to
trigger under specific conditions involving cursor movements and plane updates.

Partially tracked in https://gitlab.freedesktop.org/drm/amd/-/issues/3787

Hardware: Ryzen 7 7800X3D, Radeon RX 9070 XT
Dual DP monitors, 2560x1440, 144Hz
Desktop: KDE Plasma Wayland

The hang was initially observed while using Cisco Webex
(XDG_SESSION_TYPE=x11 /opt/Webex/bin/CiscoCollabHost %U), start a meeting
and screen share a window running Omnissa Horizon client. Then move the cursor
around between the two monitors and the shared window.
Under these conditions the hang usually occurs within a few hours.

Enabling drm.debug masks the issue entirely, the overhead
changes timing enough to close the race window.
So i added debug printks to amdgpu_dm.c and used a small bpftrace script to log the
pageflip lifecycle with per-thread tracking to debug.

bpftrace script:

  config = { missing_probes = "warn" }
  BEGIN { printf("=== flip_done tracer started ===\n"); }
  kprobe:drm_crtc_vblank_on_config       { printf("%lu drm_crtc_vblank_on_config\n", nsecs/1000000); }
  kprobe:drm_vblank_disable_and_save     { printf("%lu drm_vblank_disable_and_save\n", nsecs/1000000); }
  kprobe:dm_pflip_high_irq               { printf("%lu dm_pflip_high_irq\n", nsecs/1000000); }
  kprobe:drm_crtc_send_vblank_event      { printf("%lu drm_crtc_send_vblank_event\n", nsecs/1000000); }
  kprobe:drm_vblank_put                  { printf("%lu drm_vblank_put\n", nsecs/1000000); }
  kprobe:drm_atomic_helper_commit_hw_done { printf("%lu drm_atomic_helper_commit_hw_done\n", nsecs/1000000); }
  kprobe:manage_dm_interrupts            { printf("%lu manage_dm_interrupts\n", nsecs/1000000); }
  kprobe:drm_atomic_helper_wait_for_flip_done {
      @wait_start[tid] = nsecs;
      printf("%lu drm_atomic_helper_wait_for_flip_done ENTER [tid=%d]\n", nsecs/1000000, tid);
  }
  kretprobe:drm_atomic_helper_wait_for_flip_done {
      $start = @wait_start[tid];
      $ms = $start > 0 ? (nsecs - $start) / 1000000 : 0;
      if ($ms > 100) {
          printf("%lu drm_atomic_helper_wait_for_flip_done TIMEOUT waited %lums [tid=%d]\n",
                 nsecs/1000000, $ms, tid);
      } else {
          printf("%lu drm_atomic_helper_wait_for_flip_done EXIT %lums [tid=%d]\n",
                 nsecs/1000000, $ms, tid);
      }
      delete(@wait_start[tid]);
  }
  interval:s:60 { printf("%lu HEARTBEAT\n", nsecs/1000000); }
  END { printf("=== stopped ===\n"); clear(@wait_start); }

The timeout was captured at 17:35:41 CET. The trace timestamps
match dmesg exactly (9942110ms = dmesg 9942.110s).

dmesg output from the timeout:

  [ 9942.110360] [FLIP_DEBUG] wait_for_flip_done took 10329ms!
  [ 9942.110380] [FLIP_DEBUG]  crtc:0 pflip_status=0 event=00000000a0636a23
                  vbl_enabled=1 vbl_refcount=1 vbl_count=1428659
                  disable_immediate=0 active_planes=1

pflip_status=0 (AMDGPU_FLIP_NONE) but event is still non-NULL. The flip was never completed
but the status was already reset to NONE. vblank was enabled, refcount was held, so vblank
IRQs were firing throughout the wait.

The bpftrace captured the exact sequence leading up to the hang. Here's the critical
timeline at ~17:35:31 (9931771), about 10 seconds before the timeout fired:

  9931755 drm_atomic_helper_commit_hw_done
  9931755 drm_atomic_helper_wait_for_flip_done ENTER [tid=35929]
  9931756 dm_pflip_high_irq                           <- normal plane flip, last good one
  9931756 drm_crtc_send_vblank_event
  9931756 drm_vblank_put
  9931756 drm_atomic_helper_wait_for_flip_done EXIT 1ms [tid=35929]
  9931771 drm_vblank_disable_and_save                 <- vblank timer fires
  9931771 drm_crtc_send_vblank_event                  <- event sent WITHOUT dm_pflip_high_irq
  9931771 drm_vblank_put
  9931771 drm_atomic_helper_commit_hw_done
  9931771 drm_atomic_helper_wait_for_flip_done ENTER [tid=35929]
  9931771 drm_atomic_helper_wait_for_flip_done EXIT 0ms [tid=35929]  <- instant, already done
  9931773 drm_atomic_helper_commit_hw_done
  9931773 drm_atomic_helper_wait_for_flip_done ENTER [tid=36929]     <- new commit
  9931777 dm_pflip_high_irq                           <- pflip fires, completes the wrong one
  9931777 drm_crtc_send_vblank_event
  9931777 drm_vblank_put
  9931777 drm_atomic_helper_wait_for_flip_done EXIT 3ms [tid=36929]
  9931781 drm_atomic_helper_commit_hw_done
  9931781 drm_atomic_helper_wait_for_flip_done ENTER [tid=36929]     <- THIS ONE HANGS
  ... 10328ms of silence ...
  9942110 drm_atomic_helper_wait_for_flip_done TIMEOUT waited 10328ms [tid=36929]

The drm_crtc_send_vblank_event at 9931771 fires without dm_pflip_high_irq. This is
amdgpu_dm_crtc_handle_vblank() sending a cursor-only event. The problem is that the
cursor-only commit path in amdgpu_dm_commit_planes() stores the event in acrtc->event
and defers delivery to the vblank handler. This creates two race conditions:

- The vblank handler checks (pflip_status != SUBMITTED) which also
  matches NONE, so it can consume events meant for plane flips. The subsequent
  dm_pflip_high_irq finds no event, and the next commit hangs.

- If vblank is disabled by the off-delay timer before the handler
  runs, the PENDING cursor event is never delivered and the commit hangs.

The fix is to send cursor-only events immediately via drm_crtc_send_vblank_event()
in amdgpu_dm_commit_planes() instead of deferring to the vblank handler. The cursor
update is already committed to hardware at this point, so immediate delivery is correct.
This eliminates both race conditions by removing cursor events from the deferred
delivery path entirely:

- Plane flips: SUBMITTED -> dm_pflip_high_irq delivers (unchanged)
- Cursor updates: sent immediately in commit_planes (no deferral, no races)

From git history the check in amdgpu_dm_crtc_handle_vblank() has been like this since
473683a03495 ("drm/amd/display: Create a file dedicated for CRTC", 2022)
which moved this code from amdgpu_dm.c, but it was practically impossible to trigger
because the default drm_vblank_offdelay was 5000ms.
Commit 58a261bfc967("drm/amd/display: use a more lax vblank enable policy for older ASICs") in 6.12
changed all ASICs to use drm_crtc_vblank_on_config() with a computed off-delay
of roughly 2 frames (~14ms at 144Hz).
This made drm_vblank_disable_and_save fire hundreds of times more often, turning
a theoretical race into reality. The bpftrace log is full of drm_vblank_disable_and_save
events interleaved with the commit sequence.

This fix was tested on DCN 2.1 (4700U), DCN 3.2 (7600M XT), and DCN 3.5 (9070 XT).
Under high-frequency glxgears + cursor jiggling test the patch successfully intercepted
the race thousands of times without a single timeout.
Also running this on the main system without issues.

This instead https://lists.freedesktop.org/archives/amd-gfx/2026-February/138636.html was
my previously rushed attempt to do something about this that is no longer needed.

Patch applies cleanly on top of tag v6.19.

 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index a8a59126b2d2..35987ce80c71 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -10168,8 +10168,7 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
 	} else if (cursor_update && acrtc_state->active_planes > 0) {
 		spin_lock_irqsave(&pcrtc->dev->event_lock, flags);
 		if (acrtc_attach->base.state->event) {
-			drm_crtc_vblank_get(pcrtc);
-			acrtc_attach->event = acrtc_attach->base.state->event;
+			drm_crtc_send_vblank_event(pcrtc, acrtc_attach->base.state->event);
 			acrtc_attach->base.state->event = NULL;
 		}
 		spin_unlock_irqrestore(&pcrtc->dev->event_lock, flags);
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread
* [PATCH 1/1] drm/amd/display: complete cursor vblank events immediately
@ 2026-02-17 19:16 Michele Palazzi
  2026-02-23 15:27 ` Leo Li
  0 siblings, 1 reply; 36+ messages in thread
From: Michele Palazzi @ 2026-02-17 19:16 UTC (permalink / raw)
  To: amd-gfx
  Cc: harry.wentland, rodrigo.siqueira, sunpeng.li, alexander.deucher,
	christian.koenig, Michele Palazzi

Intermittent flip_done timeouts have been observed on AMD GPUs
since kernel 6.12.

Analysis with bpftrace reveals that amdgpu_dm_crtc_handle_vblank() can
incorrectly consume events meant for plane flips during cursor-only
updates. This happens because cursor commits defer event delivery to
the vblank handler, which checks (pflip_status != SUBMITTED). Since
AMDGPU_FLIP_NONE also matches this, cursor events can "steal" the
event slot for subsequent plane flips, leading to timeouts.

The potential for a race was present since commit 473683a03495
("drm/amd/display: Create a file dedicated for CRTC"), then
commit 58a261bfc967 ("drm/amd/display: use a more lax vblank enable
policy for older ASICs") made it happen by reducing vblank
off-delay and making disables happen much more frequently
between commits.

Fix this by sending cursor-only vblank events immediately in
amdgpu_dm_commit_planes(). Since cursor updates are committed to
hardware immediately, deferring the event is unnecessary and
creates race windows for event stealing or starvation if vblank
is disabled before the handler runs.

Tested on DCN 2.1, 3.2, and 3.5.

Fixes: 58a261bfc967 ("drm/amd/display: use a more lax vblank enable policy for older ASICs")
Signed-off-by: Michele Palazzi <sysdadmin@m1k.cloud>
---
I've been chasing intermittent flip_done timeouts on AMD GPUs (7900 GRE first, 9070 XT now)
since kernel 6.12. The hang occurs during normal desktop usage but is much easier to
trigger under specific conditions involving cursor movements and plane updates.

Partially tracked in https://gitlab.freedesktop.org/drm/amd/-/issues/3787

Hardware: Ryzen 7 7800X3D, Radeon RX 9070 XT
Dual DP monitors, 2560x1440, 144Hz
Desktop: KDE Plasma Wayland

The hang was initially observed while using Cisco Webex
(XDG_SESSION_TYPE=x11 /opt/Webex/bin/CiscoCollabHost %U), start a meeting
and screen share a window running Omnissa Horizon client. Then move the cursor
around between the two monitors and the shared window.
Under these conditions the hang usually occurs within a few hours.

Enabling drm.debug masks the issue entirely, the overhead
changes timing enough to close the race window.
So i added debug printks to amdgpu_dm.c and used a small bpftrace script to log the
pageflip lifecycle with per-thread tracking to debug.

bpftrace script:

  config = { missing_probes = "warn" }
  BEGIN { printf("=== flip_done tracer started ===\n"); }
  kprobe:drm_crtc_vblank_on_config       { printf("%lu drm_crtc_vblank_on_config\n", nsecs/1000000); }
  kprobe:drm_vblank_disable_and_save     { printf("%lu drm_vblank_disable_and_save\n", nsecs/1000000); }
  kprobe:dm_pflip_high_irq               { printf("%lu dm_pflip_high_irq\n", nsecs/1000000); }
  kprobe:drm_crtc_send_vblank_event      { printf("%lu drm_crtc_send_vblank_event\n", nsecs/1000000); }
  kprobe:drm_vblank_put                  { printf("%lu drm_vblank_put\n", nsecs/1000000); }
  kprobe:drm_atomic_helper_commit_hw_done { printf("%lu drm_atomic_helper_commit_hw_done\n", nsecs/1000000); }
  kprobe:manage_dm_interrupts            { printf("%lu manage_dm_interrupts\n", nsecs/1000000); }
  kprobe:drm_atomic_helper_wait_for_flip_done {
      @wait_start[tid] = nsecs;
      printf("%lu drm_atomic_helper_wait_for_flip_done ENTER [tid=%d]\n", nsecs/1000000, tid);
  }
  kretprobe:drm_atomic_helper_wait_for_flip_done {
      $start = @wait_start[tid];
      $ms = $start > 0 ? (nsecs - $start) / 1000000 : 0;
      if ($ms > 100) {
          printf("%lu drm_atomic_helper_wait_for_flip_done TIMEOUT waited %lums [tid=%d]\n",
                 nsecs/1000000, $ms, tid);
      } else {
          printf("%lu drm_atomic_helper_wait_for_flip_done EXIT %lums [tid=%d]\n",
                 nsecs/1000000, $ms, tid);
      }
      delete(@wait_start[tid]);
  }
  interval:s:60 { printf("%lu HEARTBEAT\n", nsecs/1000000); }
  END { printf("=== stopped ===\n"); clear(@wait_start); }

The timeout was captured at 17:35:41 CET. The trace timestamps
match dmesg exactly (9942110ms = dmesg 9942.110s).

dmesg output from the timeout:

  [ 9942.110360] [FLIP_DEBUG] wait_for_flip_done took 10329ms!
  [ 9942.110380] [FLIP_DEBUG]  crtc:0 pflip_status=0 event=00000000a0636a23
                  vbl_enabled=1 vbl_refcount=1 vbl_count=1428659
                  disable_immediate=0 active_planes=1

pflip_status=0 (AMDGPU_FLIP_NONE) but event is still non-NULL. The flip was never completed
but the status was already reset to NONE. vblank was enabled, refcount was held, so vblank
IRQs were firing throughout the wait.

The bpftrace captured the exact sequence leading up to the hang. Here's the critical
timeline at ~17:35:31 (9931771), about 10 seconds before the timeout fired:

  9931755 drm_atomic_helper_commit_hw_done
  9931755 drm_atomic_helper_wait_for_flip_done ENTER [tid=35929]
  9931756 dm_pflip_high_irq                           <- normal plane flip, last good one
  9931756 drm_crtc_send_vblank_event
  9931756 drm_vblank_put
  9931756 drm_atomic_helper_wait_for_flip_done EXIT 1ms [tid=35929]
  9931771 drm_vblank_disable_and_save                 <- vblank timer fires
  9931771 drm_crtc_send_vblank_event                  <- event sent WITHOUT dm_pflip_high_irq
  9931771 drm_vblank_put
  9931771 drm_atomic_helper_commit_hw_done
  9931771 drm_atomic_helper_wait_for_flip_done ENTER [tid=35929]
  9931771 drm_atomic_helper_wait_for_flip_done EXIT 0ms [tid=35929]  <- instant, already done
  9931773 drm_atomic_helper_commit_hw_done
  9931773 drm_atomic_helper_wait_for_flip_done ENTER [tid=36929]     <- new commit
  9931777 dm_pflip_high_irq                           <- pflip fires, completes the wrong one
  9931777 drm_crtc_send_vblank_event
  9931777 drm_vblank_put
  9931777 drm_atomic_helper_wait_for_flip_done EXIT 3ms [tid=36929]
  9931781 drm_atomic_helper_commit_hw_done
  9931781 drm_atomic_helper_wait_for_flip_done ENTER [tid=36929]     <- THIS ONE HANGS
  ... 10328ms of silence ...
  9942110 drm_atomic_helper_wait_for_flip_done TIMEOUT waited 10328ms [tid=36929]

The drm_crtc_send_vblank_event at 9931771 fires without dm_pflip_high_irq. This is
amdgpu_dm_crtc_handle_vblank() sending a cursor-only event. The problem is that the
cursor-only commit path in amdgpu_dm_commit_planes() stores the event in acrtc->event
and defers delivery to the vblank handler. This creates two race conditions:

- The vblank handler checks (pflip_status != SUBMITTED) which also
  matches NONE, so it can consume events meant for plane flips. The subsequent
  dm_pflip_high_irq finds no event, and the next commit hangs.

- If vblank is disabled by the off-delay timer before the handler
  runs, the PENDING cursor event is never delivered and the commit hangs.

The fix is to send cursor-only events immediately via drm_crtc_send_vblank_event()
in amdgpu_dm_commit_planes() instead of deferring to the vblank handler. The cursor
update is already committed to hardware at this point, so immediate delivery is correct.
This eliminates both race conditions by removing cursor events from the deferred
delivery path entirely:

- Plane flips: SUBMITTED -> dm_pflip_high_irq delivers (unchanged)
- Cursor updates: sent immediately in commit_planes (no deferral, no races)

From git history the check in amdgpu_dm_crtc_handle_vblank() has been like this since
473683a03495 ("drm/amd/display: Create a file dedicated for CRTC", 2022)
which moved this code from amdgpu_dm.c, but it was practically impossible to trigger
because the default drm_vblank_offdelay was 5000ms.
Commit 58a261bfc967("drm/amd/display: use a more lax vblank enable policy for older ASICs") in 6.12
changed all ASICs to use drm_crtc_vblank_on_config() with a computed off-delay
of roughly 2 frames (~14ms at 144Hz).
This made drm_vblank_disable_and_save fire hundreds of times more often, turning
a theoretical race into reality. The bpftrace log is full of drm_vblank_disable_and_save
events interleaved with the commit sequence.

This fix was tested on DCN 2.1 (4700U), DCN 3.2 (7600M XT), and DCN 3.5 (9070 XT).
Under high-frequency glxgears + cursor jiggling test the patch successfully intercepted
the race thousands of times without a single timeout.
Also running this on the main system without issues.

This instead https://lists.freedesktop.org/archives/amd-gfx/2026-February/138636.html was
my previously rushed attempt to do something about this that is no longer needed.

Patch applies cleanly on top of tag v6.19.

 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index a8a59126b2d2..35987ce80c71 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -10168,8 +10168,7 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
 	} else if (cursor_update && acrtc_state->active_planes > 0) {
 		spin_lock_irqsave(&pcrtc->dev->event_lock, flags);
 		if (acrtc_attach->base.state->event) {
-			drm_crtc_vblank_get(pcrtc);
-			acrtc_attach->event = acrtc_attach->base.state->event;
+			drm_crtc_send_vblank_event(pcrtc, acrtc_attach->base.state->event);
 			acrtc_attach->base.state->event = NULL;
 		}
 		spin_unlock_irqrestore(&pcrtc->dev->event_lock, flags);
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2026-03-31 12:57 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-18  0:31 [PATCH 1/1] drm/amd/display: complete cursor vblank events immediately Michele Palazzi
2026-02-18  9:41 ` Michel Dänzer
2026-02-18 10:09   ` Michele Palazzi
2026-02-19 11:09     ` Michel Dänzer
2026-02-19 13:08       ` Michele Palazzi
2026-02-19 13:59         ` Michel Dänzer
2026-02-19 15:56           ` Michele Palazzi
2026-02-19 16:02             ` Michel Dänzer
2026-02-20 11:10               ` Michele Palazzi
  -- strict thread matches above, loose matches on Subject: below --
2026-02-17 19:16 Michele Palazzi
2026-02-23 15:27 ` Leo Li
2026-02-27  8:53   ` Michele Palazzi
2026-02-27  8:58     ` Michele Palazzi
2026-03-02 22:13       ` Leo Li
2026-03-03  8:17         ` Shengyu Qu
2026-03-03 19:07           ` Leo Li
2026-03-04 14:00             ` Michele Palazzi
2026-03-04 14:20               ` Leo Li
2026-03-05 22:30                 ` Leo Li
2026-03-06  8:37                   ` Michele Palazzi
2026-03-09 16:49                     ` Michele Palazzi
2026-03-10 23:50                       ` Leo Li
2026-03-11 10:16                         ` Shengyu Qu
2026-03-11 10:38                         ` Michele Palazzi
2026-03-11 17:56                           ` Leo Li
2026-03-16 14:55                             ` Michele Palazzi
2026-03-16 15:17                               ` Michele Palazzi
2026-03-16 18:39                                 ` Leo Li
2026-03-16 18:48                                   ` Leo Li
2026-03-18 11:36                                     ` Michele Palazzi
2026-03-20  0:52                                       ` Leo Li
2026-03-20  1:33                                         ` Michele Palazzi
2026-03-31 12:57                                           ` Michele Palazzi
2026-02-27 19:43   ` Alex Deucher
2026-03-02  8:53   ` Michel Dänzer
2026-03-02 22:14     ` Leo Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox