public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 0/3] media: videobuf2: opt-in dma_resv producer fences for V4L2 dmabuf exports
@ 2026-04-29 19:53 Markus Fritsche
  2026-04-29 19:53 ` [PATCH RFC 1/3] media: videobuf2: add dma_resv release-fence helper Markus Fritsche
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: Markus Fritsche @ 2026-04-29 19:53 UTC (permalink / raw)
  To: Tomasz Figa, Marek Szyprowski, Mauro Carvalho Chehab,
	Sumit Semwal, Christian König, Ezequiel Garcia,
	Philipp Zabel, Jacob Chen, Heiko Stuebner
  Cc: linux-media, linux-kernel, dri-devel, linaro-mm-sig,
	linux-rockchip, linux-arm-kernel

Hi,

This series proposes a small opt-in API in videobuf2-core that lets V4L2
drivers populate a dma_resv exclusive write fence on the dmabufs they
export to userspace, signalled when the buffer transitions to
VB2_BUF_STATE_DONE. Two example drivers (hantro, rockchip-rga) opt in
to demonstrate the call shape; the change is no-op for every other
driver.

Why
---
Modern Wayland compositors and any other userspace consumers that
import V4L2-produced dmabufs and want to do implicit synchronization
the spec-clean way (poll(POLLIN) on the dmabuf fd, or
DMA_BUF_IOCTL_EXPORT_SYNC_FILE for a sync_file) currently get either:

1. A stub fence from dma_buf_export_sync_file(), because the dmabuf's
   dma_resv has no fences populated. The kernel substitutes
   dma_fence_get_stub() which is permanently signalled. The compositor
   "successfully" waits on a fence that represents nothing real about
   the producer's state.
2. A poll(POLLIN) on the dmabuf fd that returns immediately for the
   same reason — dma_buf_poll_add_cb finds zero fences in the resv,
   triggers the wake callback inline, and reports POLLIN ready before
   the producer has actually said anything.

Today this works as a happy accident on most paths because clients
attach buffers after VIDIOC_DQBUF, which the userspace V4L2 contract
guarantees only returns a buffer after the producer is done. So the
implicit "the kernel's stub fence is fine because the buffer is
already complete by the time anyone polls it" assumption has held.

But:

- It's a contract gap. The kernel claims to expose implicit sync; it
  does not, for V4L2 producers.
- It paid latency for nothing. Every Wayland frame from a V4L2
  producer pays a DMA_BUF_IOCTL_EXPORT_SYNC_FILE round-trip for a
  fence that's stub-signalled. On Mali-class hardware (RK3566 Wayland
  chrome video playback), this contributed to compositor stalls.
  Removing the wait at the compositor level is a workaround, not a
  fix.
- It blocks downstream consumers from doing the right thing. A
  Wayland compositor that defensively waits on a sync_file gets a
  stub-fence pass-through with no actual gating; if the V4L2 driver
  ever has an out-of-band path that releases the buffer before
  finishing the write, there is no fence to gate on.

What
----
Patch 1 adds:

- struct dma_fence *release_fence to struct vb2_buffer
- u64 dma_resv_fence_context + atomic64_t dma_resv_fence_seqno +
  spinlock_t dma_resv_fence_lock to struct vb2_queue
- vb2_buffer_attach_release_fence(vb) — drivers call this from their
  buf_queue callback. Allocates a dma_fence on the queue's fence
  context, attaches it as DMA_RESV_USAGE_WRITE on each plane's
  dmabuf->resv. No-op for buffers without exported dmabufs.
- vb2_buffer_done() extended to signal+put the fence if attached,
  so the producer's completion signal lands in the resv synchronously
  with the userspace DQBUF wakeup.

Patches 2 and 3 add a single call to the helper from hantro_buf_queue
and rga_buf_queue respectively. Both are demonstration drivers; other
vb2 drivers can opt in incrementally with the same one-line change.

Tested on
---------
PineTab2 (RK3566 / Mali-G52 panfrost / mainline 6.19.10, this series
backported), playing 1080p30 H.264 in chromium under KDE Plasma 6.6.4
Wayland. The test harness is the chromium-fourier patch series at
https://github.com/marfrit/fourier — chromium plus a KWin patch
that *previously bypassed* Transaction::watchDmaBuf because the
kernel-side fence was stub-signalled. With this series applied, the
bypass becomes unnecessary; KWin's fence wait completes correctly
because the fence now signals when hantro completes the capture
buffer write.

End-to-end result before the kernel patch (chromium + Qt 6 patches +
KWin watchDmaBuf bypass): 1080p30 H.264 plays through, ~81% combined
chrome CPU, but the watchDmaBuf bypass weakens KWin's defenses against
misbehaving clients.

End-to-end result after the kernel patch (chromium + Qt 6 patches +
plain unmodified KWin): 1080p30 H.264 plays through with the same CPU
profile, KWin's watchDmaBuf wait completes within microseconds against
the now-real producer fence, no defenses weakened.

What's missing in this RFC
--------------------------
- Other vb2-using drivers don't opt in. Each maintainer should look
  at their driver and decide. The hantro + rga patches show the
  shape; copying it to other drivers should be straightforward.
- For drivers that have intermediate image-processor stages (e.g.
  CSI -> ISP -> user), the fence semantics across stage boundaries
  are out of scope here. This series only addresses the producer-to-
  userspace edge.
- No selftest. videobuf2 doesn't have a great in-tree selftest harness
  for dmabuf flows; the validation is end-to-end at the userspace
  consumer level (KWin, in our case).

Reviews especially welcome on:

- The decision to make this opt-in per driver vs. automatic for all
  vb2-CAPTURE queues. Auto-on would force every driver to be audited;
  opt-in is incremental and safer but leaves the contract gap for
  drivers nobody touches.
- Whether vb2_buffer_done is the right place to signal vs. an earlier
  hook (e.g. immediately after DMA-from-device finishes). For hantro
  the two are effectively the same; for drivers with asynchronous
  post-processing they may differ.
- The choice of DMA_RESV_USAGE_WRITE — we are emitting the producer's
  write completion, so WRITE matches dma-buf documentation, but a
  sanity check is welcome.

Cheers,
Markus


Markus Fritsche (3):
  media: videobuf2: add dma_resv release-fence helper
  media: hantro: attach dma_resv release fence at buf_queue
  media: rockchip-rga: attach dma_resv release fence at buf_queue

 .../media/common/videobuf2/videobuf2-core.c   | 95 +++++++++++++++++++
 drivers/media/platform/rockchip/rga/rga-buf.c | 10 ++
 .../media/platform/verisilicon/hantro_v4l2.c  | 12 +++
 include/media/videobuf2-core.h                | 29 ++++++
 4 files changed, 146 insertions(+)

-- 
2.47.3


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH RFC 1/3] media: videobuf2: add dma_resv release-fence helper
  2026-04-29 19:53 [PATCH RFC 0/3] media: videobuf2: opt-in dma_resv producer fences for V4L2 dmabuf exports Markus Fritsche
@ 2026-04-29 19:53 ` Markus Fritsche
  2026-04-29 19:53 ` [PATCH RFC 2/3] media: hantro: attach dma_resv release fence at buf_queue Markus Fritsche
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 6+ messages in thread
From: Markus Fritsche @ 2026-04-29 19:53 UTC (permalink / raw)
  To: Tomasz Figa, Marek Szyprowski, Mauro Carvalho Chehab,
	Sumit Semwal, Christian König, Ezequiel Garcia,
	Philipp Zabel, Jacob Chen, Heiko Stuebner
  Cc: linux-media, linux-kernel, dri-devel, linaro-mm-sig,
	linux-rockchip, linux-arm-kernel

Add an opt-in API that lets vb2 producers populate a dma_resv
exclusive write fence on the dmabufs they export to userspace,
signalled when the buffer transitions to VB2_BUF_STATE_DONE.

V4L2 producers historically don't propagate buffer-state-done into
the dmabuf's dma_resv exclusive fence. Userspace consumers that
import V4L2-produced dmabufs and wait on the dmabuf's implicit-sync
fence (poll(POLLIN), DMA_BUF_IOCTL_EXPORT_SYNC_FILE) currently see
either zero fences or a stub fence from dma_fence_get_stub(). This
is correct by accident for the common case (clients call DQBUF
before importing) but represents a contract gap.

Drivers opt in by calling vb2_buffer_attach_release_fence(vb) from
their buf_queue callback. The helper allocates a dma_fence on the
queue's fence context (set up at vb2_core_queue_init), attaches it
as DMA_RESV_USAGE_WRITE on each plane's dmabuf->resv, and stashes
it in vb->release_fence. vb2_buffer_done signals + puts the fence
as part of its state transition.

For drivers that don't opt in, vb->release_fence stays NULL and
the signal path is a no-op.

Skips planes whose vb2_plane.dbuf is NULL — buffers never exported
via VIDIOC_EXPBUF (or imported via V4L2_MEMORY_DMABUF) have no
dmabuf for userspace to wait on.

Signed-off-by: Markus Fritsche <mfritsche@reauktion.de>
---
 .../media/common/videobuf2/videobuf2-core.c   | 95 +++++++++++++++++++
 include/media/videobuf2-core.h                | 29 ++++++
 2 files changed, 124 insertions(+)

diff --git a/drivers/media/common/videobuf2/videobuf2-core.c b/drivers/media/common/videobuf2/videobuf2-core.c
index b0523fc23..ee766aae0 100644
--- a/drivers/media/common/videobuf2/videobuf2-core.c
+++ b/drivers/media/common/videobuf2/videobuf2-core.c
@@ -26,6 +26,9 @@
 #include <linux/freezer.h>
 #include <linux/kthread.h>
 
+#include <linux/dma-fence.h>
+#include <linux/dma-resv.h>
+#include <linux/dma-buf.h>
 #include <media/videobuf2-core.h>
 #include <media/v4l2-mc.h>
 
@@ -1179,6 +1182,86 @@ void *vb2_plane_cookie(struct vb2_buffer *vb, unsigned int plane_no)
 }
 EXPORT_SYMBOL_GPL(vb2_plane_cookie);
 
+/*
+ * dma_resv release-fence integration.
+ *
+ * V4L2 producers historically don't propagate buffer-state-done into
+ * the dmabuf's dma_resv exclusive fence. Userspace consumers that
+ * wait on that fence (e.g. wayland compositors via poll(POLLIN) or
+ * DMA_BUF_IOCTL_EXPORT_SYNC_FILE) currently see either no fences or
+ * a stub fence from dma_fence_get_stub(). The opt-in API below lets
+ * a driver attach a real producer fence at QBUF time and have it
+ * signalled by vb2_buffer_done().
+ */
+
+static const char *vb2_dma_resv_get_driver_name(struct dma_fence *fence)
+{
+	return "videobuf2";
+}
+
+static const char *vb2_dma_resv_get_timeline_name(struct dma_fence *fence)
+{
+	return "vb2-release-fence";
+}
+
+static const struct dma_fence_ops vb2_dma_resv_fence_ops = {
+	.get_driver_name = vb2_dma_resv_get_driver_name,
+	.get_timeline_name = vb2_dma_resv_get_timeline_name,
+};
+
+int vb2_buffer_attach_release_fence(struct vb2_buffer *vb)
+{
+	struct vb2_queue *q = vb->vb2_queue;
+	struct dma_fence *fence;
+	unsigned int plane;
+
+	if (WARN_ON(vb->release_fence))
+		return -EINVAL;
+
+	fence = kzalloc(sizeof(*fence), GFP_KERNEL);
+	if (!fence)
+		return -ENOMEM;
+
+	dma_fence_init(fence, &vb2_dma_resv_fence_ops, &q->dma_resv_fence_lock,
+		       q->dma_resv_fence_context,
+		       atomic64_inc_return(&q->dma_resv_fence_seqno));
+
+	for (plane = 0; plane < vb->num_planes; plane++) {
+		struct dma_buf *dbuf = vb->planes[plane].dbuf;
+
+		if (!dbuf)
+			continue;
+
+		dma_resv_lock(dbuf->resv, NULL);
+		dma_resv_add_fence(dbuf->resv, fence, DMA_RESV_USAGE_WRITE);
+		dma_resv_unlock(dbuf->resv);
+	}
+
+	/* One reference for the eventual signal in vb2_buffer_done. */
+	vb->release_fence = dma_fence_get(fence);
+
+	/* The dma_resv held its own reference per plane. Drop ours. */
+	dma_fence_put(fence);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(vb2_buffer_attach_release_fence);
+
+static void vb2_buffer_signal_release_fence(struct vb2_buffer *vb,
+					    enum vb2_buffer_state state)
+{
+	struct dma_fence *fence = vb->release_fence;
+
+	if (!fence)
+		return;
+
+	if (state == VB2_BUF_STATE_ERROR)
+		dma_fence_set_error(fence, -EIO);
+	dma_fence_signal(fence);
+	dma_fence_put(fence);
+	vb->release_fence = NULL;
+}
+
 void vb2_buffer_done(struct vb2_buffer *vb, enum vb2_buffer_state state)
 {
 	struct vb2_queue *q = vb->vb2_queue;
@@ -1205,6 +1288,9 @@ void vb2_buffer_done(struct vb2_buffer *vb, enum vb2_buffer_state state)
 	if (state != VB2_BUF_STATE_QUEUED)
 		__vb2_buf_mem_finish(vb);
 
+	if (state != VB2_BUF_STATE_QUEUED)
+		vb2_buffer_signal_release_fence(vb, state);
+
 	spin_lock_irqsave(&q->done_lock, flags);
 	if (state == VB2_BUF_STATE_QUEUED) {
 		vb->state = VB2_BUF_STATE_QUEUED;
@@ -2652,6 +2738,15 @@ int vb2_core_queue_init(struct vb2_queue *q)
 	mutex_init(&q->mmap_lock);
 	init_waitqueue_head(&q->done_wq);
 
+	/*
+	 * Per-queue dma_resv release-fence context. Drivers opt-in via
+	 * vb2_buffer_attach_release_fence(); other drivers pay only the
+	 * cost of the unused fields.
+	 */
+	q->dma_resv_fence_context = dma_fence_context_alloc(1);
+	atomic64_set(&q->dma_resv_fence_seqno, 0);
+	spin_lock_init(&q->dma_resv_fence_lock);
+
 	q->memory = VB2_MEMORY_UNKNOWN;
 
 	if (q->buf_struct_size == 0)
diff --git a/include/media/videobuf2-core.h b/include/media/videobuf2-core.h
index 9b02aeba4..2bf3272d4 100644
--- a/include/media/videobuf2-core.h
+++ b/include/media/videobuf2-core.h
@@ -288,6 +288,12 @@ struct vb2_buffer {
 	unsigned int		skip_cache_sync_on_finish:1;
 
 	struct vb2_plane	planes[VB2_MAX_PLANES];
+	/*
+	 * dma_resv release fence — set by vb2_buffer_attach_release_fence()
+	 * (driver opt-in from buf_queue), signalled and put by
+	 * vb2_buffer_done(). NULL for drivers that don't opt in.
+	 */
+	struct dma_fence	*release_fence;
 	struct list_head	queued_entry;
 	struct list_head	done_entry;
 #ifdef CONFIG_VIDEO_ADV_DEBUG
@@ -658,6 +664,15 @@ struct vb2_queue {
 	spinlock_t			done_lock;
 	wait_queue_head_t		done_wq;
 
+	/*
+	 * Per-queue dma_resv release-fence context. Drivers that opt
+	 * into vb2_buffer_attach_release_fence() use these to allocate
+	 * fences on a single per-queue timeline.
+	 */
+	u64				dma_resv_fence_context;
+	atomic64_t			dma_resv_fence_seqno;
+	spinlock_t			dma_resv_fence_lock;
+
 	unsigned int			streaming:1;
 	unsigned int			start_streaming_called:1;
 	unsigned int			error:1;
@@ -747,6 +762,20 @@ void *vb2_plane_cookie(struct vb2_buffer *vb, unsigned int plane_no);
  */
 void vb2_buffer_done(struct vb2_buffer *vb, enum vb2_buffer_state state);
 
+/**
+ * vb2_buffer_attach_release_fence() - opt-in dma_resv release fence.
+ * @vb: the buffer being queued to the producer.
+ *
+ * Drivers call this from their buf_queue callback to attach an
+ * exclusive write fence to each plane's dmabuf->resv. The fence
+ * is signalled and put by vb2_buffer_done() when the buffer
+ * transitions to VB2_BUF_STATE_DONE / _ERROR. Skips planes whose
+ * dbuf is NULL.
+ *
+ * Returns 0 on success, negative errno on allocation failure.
+ */
+int vb2_buffer_attach_release_fence(struct vb2_buffer *vb);
+
 /**
  * vb2_discard_done() - discard all buffers marked as DONE.
  * @q:		pointer to &struct vb2_queue with videobuf2 queue.
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH RFC 2/3] media: hantro: attach dma_resv release fence at buf_queue
  2026-04-29 19:53 [PATCH RFC 0/3] media: videobuf2: opt-in dma_resv producer fences for V4L2 dmabuf exports Markus Fritsche
  2026-04-29 19:53 ` [PATCH RFC 1/3] media: videobuf2: add dma_resv release-fence helper Markus Fritsche
@ 2026-04-29 19:53 ` Markus Fritsche
  2026-04-29 19:53 ` [PATCH RFC 3/3] media: rockchip-rga: " Markus Fritsche
  2026-04-29 21:22 ` [PATCH RFC 0/3] media: videobuf2: opt-in dma_resv producer fences for V4L2 dmabuf exports Nicolas Dufresne
  3 siblings, 0 replies; 6+ messages in thread
From: Markus Fritsche @ 2026-04-29 19:53 UTC (permalink / raw)
  To: Tomasz Figa, Marek Szyprowski, Mauro Carvalho Chehab,
	Sumit Semwal, Christian König, Ezequiel Garcia,
	Philipp Zabel, Jacob Chen, Heiko Stuebner
  Cc: linux-media, linux-kernel, dri-devel, linaro-mm-sig,
	linux-rockchip, linux-arm-kernel

Opt the hantro driver into the new vb2 release-fence helper.

When userspace QBUFs a buffer to hantro, the buffer is added to the
driver's m2m queue via v4l2_m2m_buf_queue. We additionally call
vb2_buffer_attach_release_fence() so each plane's dmabuf->resv
gets a real producer fence attached. The fence is signalled by
vb2_buffer_done when hantro completes the decode (via
v4l2_m2m_buf_done_and_job_finish in hantro_drv.c, which converges
on vb2_buffer_done).

Wayland compositors (and any other userspace) that import hantro
CAPTURE buffers and wait on the dmabuf's implicit-sync fence now
wait on a real fence representing the producer's actual completion,
not a stub. Validated end-to-end on PineTab2 (RK3566 / Mali-G52 /
mainline 6.19 with this series backported) playing 1080p30 H.264 in
chromium under stock KDE Plasma 6.6.4 Wayland: KWin's
Transaction::watchDmaBuf wait completes correctly the moment
hantro's IRQ fires, instead of falling back to a stub-resolved
poll.

Signed-off-by: Markus Fritsche <mfritsche@reauktion.de>
---
 drivers/media/platform/verisilicon/hantro_v4l2.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/drivers/media/platform/verisilicon/hantro_v4l2.c b/drivers/media/platform/verisilicon/hantro_v4l2.c
index 62d3962c1..e95a3433a 100644
--- a/drivers/media/platform/verisilicon/hantro_v4l2.c
+++ b/drivers/media/platform/verisilicon/hantro_v4l2.c
@@ -877,6 +877,18 @@ static void hantro_buf_queue(struct vb2_buffer *vb)
 	}
 
 	v4l2_m2m_buf_queue(ctx->fh.m2m_ctx, vbuf);
+
+	/*
+	 * Opt in to vb2's dma_resv release-fence path. Userspace
+	 * consumers that imported this buffer's dmabuf and wait on
+	 * its implicit-sync fence (poll(POLLIN) or
+	 * DMA_BUF_IOCTL_EXPORT_SYNC_FILE) get a real producer fence
+	 * representing this device's completion, instead of the stub
+	 * fence dma_buf_export_sync_file substitutes when dma_resv
+	 * is empty. Best-effort: a fence-allocation failure means we
+	 * lose implicit-sync precision, no functional regression.
+	 */
+	(void)vb2_buffer_attach_release_fence(vb);
 }
 
 static bool hantro_vq_is_coded(struct vb2_queue *q)
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH RFC 3/3] media: rockchip-rga: attach dma_resv release fence at buf_queue
  2026-04-29 19:53 [PATCH RFC 0/3] media: videobuf2: opt-in dma_resv producer fences for V4L2 dmabuf exports Markus Fritsche
  2026-04-29 19:53 ` [PATCH RFC 1/3] media: videobuf2: add dma_resv release-fence helper Markus Fritsche
  2026-04-29 19:53 ` [PATCH RFC 2/3] media: hantro: attach dma_resv release fence at buf_queue Markus Fritsche
@ 2026-04-29 19:53 ` Markus Fritsche
  2026-04-29 21:22 ` [PATCH RFC 0/3] media: videobuf2: opt-in dma_resv producer fences for V4L2 dmabuf exports Nicolas Dufresne
  3 siblings, 0 replies; 6+ messages in thread
From: Markus Fritsche @ 2026-04-29 19:53 UTC (permalink / raw)
  To: Tomasz Figa, Marek Szyprowski, Mauro Carvalho Chehab,
	Sumit Semwal, Christian König, Ezequiel Garcia,
	Philipp Zabel, Jacob Chen, Heiko Stuebner
  Cc: linux-media, linux-kernel, dri-devel, linaro-mm-sig,
	linux-rockchip, linux-arm-kernel

Opt the Rockchip RGA driver into the new vb2 release-fence helper.

Same shape as the hantro patch: rga_buf_queue enqueues the buffer
in the driver's m2m queue via v4l2_m2m_buf_queue and additionally
attaches a release fence to each plane's dmabuf->resv via
vb2_buffer_attach_release_fence(). vb2_buffer_done signals the
fence when RGA completes the M2M operation.

Userspace consumers of RGA-produced dmabufs (image-processing
pipelines, screen-rotation servers, gstreamer flows on Rockchip
boards) get spec-clean implicit-sync semantics, matching what
hantro now does in the same patch series.

Signed-off-by: Markus Fritsche <mfritsche@reauktion.de>
---
 drivers/media/platform/rockchip/rga/rga-buf.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/media/platform/rockchip/rga/rga-buf.c b/drivers/media/platform/rockchip/rga/rga-buf.c
index 70808049d..5557ca632 100644
--- a/drivers/media/platform/rockchip/rga/rga-buf.c
+++ b/drivers/media/platform/rockchip/rga/rga-buf.c
@@ -153,6 +153,16 @@ static void rga_buf_queue(struct vb2_buffer *vb)
 	struct rga_ctx *ctx = vb2_get_drv_priv(vb->vb2_queue);
 
 	v4l2_m2m_buf_queue(ctx->fh.m2m_ctx, vbuf);
+
+	/*
+	 * Opt in to vb2's dma_resv release-fence path so userspace
+	 * consumers of RGA-produced dmabufs get a real producer fence
+	 * to wait on instead of the dma_buf core's stub fence. See
+	 * the leading patch in this series for rationale. Best-effort:
+	 * fence-allocation failure means we lose implicit-sync
+	 * precision but the m2m operation itself proceeds normally.
+	 */
+	(void)vb2_buffer_attach_release_fence(vb);
 }
 
 static void rga_buf_cleanup(struct vb2_buffer *vb)
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC 0/3] media: videobuf2: opt-in dma_resv producer fences for V4L2 dmabuf exports
  2026-04-29 19:53 [PATCH RFC 0/3] media: videobuf2: opt-in dma_resv producer fences for V4L2 dmabuf exports Markus Fritsche
                   ` (2 preceding siblings ...)
  2026-04-29 19:53 ` [PATCH RFC 3/3] media: rockchip-rga: " Markus Fritsche
@ 2026-04-29 21:22 ` Nicolas Dufresne
  2026-04-30  6:51   ` Christian König
  3 siblings, 1 reply; 6+ messages in thread
From: Nicolas Dufresne @ 2026-04-29 21:22 UTC (permalink / raw)
  To: Markus Fritsche, Tomasz Figa, Marek Szyprowski,
	Mauro Carvalho Chehab, Sumit Semwal, Christian König,
	Ezequiel Garcia, Philipp Zabel, Jacob Chen, Heiko Stuebner
  Cc: linux-media, linux-kernel, dri-devel, linaro-mm-sig,
	linux-rockchip, linux-arm-kernel

[-- Attachment #1: Type: text/plain, Size: 9830 bytes --]

Hi Markus,

Le mercredi 29 avril 2026 à 19:53 +0000, Markus Fritsche a écrit :
> Hi,
> 
> This series proposes a small opt-in API in videobuf2-core that lets V4L2
> drivers populate a dma_resv exclusive write fence on the dmabufs they
> export to userspace, signalled when the buffer transitions to
> VB2_BUF_STATE_DONE. Two example drivers (hantro, rockchip-rga) opt in
> to demonstrate the call shape; the change is no-op for every other
> driver.

Thanks for attempting again this feat. I see you went for implicit fencing, but
in the past we've been recommend to stay away from these and adopt an explicit
fencing model. Is this something you have started to think about, have you
reviewed past proposal in regard to fences ?

> 
> Why
> ---
> Modern Wayland compositors and any other userspace consumers that
> import V4L2-produced dmabufs and want to do implicit synchronization
> the spec-clean way (poll(POLLIN) on the dmabuf fd, or
> DMA_BUF_IOCTL_EXPORT_SYNC_FILE for a sync_file) currently get either:
> 
> 1. A stub fence from dma_buf_export_sync_file(), because the dmabuf's
>    dma_resv has no fences populated. The kernel substitutes
>    dma_fence_get_stub() which is permanently signalled. The compositor
>    "successfully" waits on a fence that represents nothing real about
>    the producer's state.
> 2. A poll(POLLIN) on the dmabuf fd that returns immediately for the
>    same reason — dma_buf_poll_add_cb finds zero fences in the resv,
>    triggers the wake callback inline, and reports POLLIN ready before
>    the producer has actually said anything.
> 
> Today this works as a happy accident on most paths because clients
> attach buffers after VIDIOC_DQBUF, which the userspace V4L2 contract
> guarantees only returns a buffer after the producer is done. So the
> implicit "the kernel's stub fence is fine because the buffer is
> already complete by the time anyone polls it" assumption has held.

There is no accident, just saying. Have you studied also the other side of
fences, the one that actually cause problem with Freedreno and Etnaviv ? To me
these would be higher priority since they are known to cause "back flash" kind
of bugs, specially for compositor that are not expecting GL driver to place
implicit fences on imported (v4l2 allocated) buffers.

> 
> But:
> 
> - It's a contract gap. The kernel claims to expose implicit sync; it
>   does not, for V4L2 producers.
> - It paid latency for nothing. Every Wayland frame from a V4L2
>   producer pays a DMA_BUF_IOCTL_EXPORT_SYNC_FILE round-trip for a
>   fence that's stub-signalled. On Mali-class hardware (RK3566 Wayland
>   chrome video playback), this contributed to compositor stalls.
>   Removing the wait at the compositor level is a workaround, not a
>   fix.
> - It blocks downstream consumers from doing the right thing. A
>   Wayland compositor that defensively waits on a sync_file gets a
>   stub-fence pass-through with no actual gating; if the V4L2 driver
>   ever has an out-of-band path that releases the buffer before
>   finishing the write, there is no fence to gate on.

Some things don't add up here. I think I want to remind that there is a contract
in regard to delivering a fence to userspace. One of the most important aspect
of fences is that they must in finit time be signaled, regardless what userspace
decided to do next. And for that reason, you shouldn't deliver a fence to
userspace if its not armed. In my reading, you are delivering that fence at
QBUF(capture) time, just like what Gustavo was trying to do previously. Its even
worse if you deliver it to your compositor allowing that compositor to hang
forever by not feeding any bitstream.

Let's take Hantro driver as an example. The right moment to deliver the fence is
either right before we set the DEC bit on the control register, or somewhere
before that when you have bitstream, parameters and request queued. At that
moment, you are guarantied that the decode will either finish or fail (yes, it
can fail, and its extremely common with live stream, or when application calls
streamoff, since in v4l2, we cancel work). Prior to that, user may starve the
OUTPUT queue (the bitstream) and cause the fence to hold forever. This would
break the contract I mention earlier.

Though, if you attach the fence at that moment, you will need to design how to
signal the fence readiness (rather then the data readiness). One idea would be
(with userspace opting-in) to signal the queue at that moment. But then you
can't do the memory management operation you would normally do in DQBUF. This of
course don't apply to hantro, which has no device cache, but we can't design
something in vb2 for the old HW. So we'd need to move memory management somwhere
else, maybe buffer_done, though you have to carefully make sure in which context
you do that, you can't sleep in an IRQ.

There is an obvious benefit of basing your solution on
DMA_BUF_IOCTL_EXPORT_SYNC_FILE, once you get there, you'll discover that there
is very little room in v4l2_buffer, and that was causing a lot of headache to
previous people attempting this. Though, if we look forward, we could also
consider this a feature of the media_request. Queuing a request could maybe
deliver a fence, assuming few pre-condition that guarantee execution (or
failure) are met. We've seen with DW100 recently that its rather easy to convert
an existing m2m driver to request. The media API is a much more open canvas to
design new mechanism. We could have a really simple ioctl that attach out fences
to request, and in a future hook it to our own depedency manager.

I'm simply throwing ideas, I could have missed few things in your PoC, let me
know.

Nicolas

> What
> ----
> Patch 1 adds:
> 
> - struct dma_fence *release_fence to struct vb2_buffer
> - u64 dma_resv_fence_context + atomic64_t dma_resv_fence_seqno +
>   spinlock_t dma_resv_fence_lock to struct vb2_queue
> - vb2_buffer_attach_release_fence(vb) — drivers call this from their
>   buf_queue callback. Allocates a dma_fence on the queue's fence
>   context, attaches it as DMA_RESV_USAGE_WRITE on each plane's
>   dmabuf->resv. No-op for buffers without exported dmabufs.
> - vb2_buffer_done() extended to signal+put the fence if attached,
>   so the producer's completion signal lands in the resv synchronously
>   with the userspace DQBUF wakeup.
> 
> Patches 2 and 3 add a single call to the helper from hantro_buf_queue
> and rga_buf_queue respectively. Both are demonstration drivers; other
> vb2 drivers can opt in incrementally with the same one-line change.
> 
> Tested on
> ---------
> PineTab2 (RK3566 / Mali-G52 panfrost / mainline 6.19.10, this series
> backported), playing 1080p30 H.264 in chromium under KDE Plasma 6.6.4
> Wayland. The test harness is the chromium-fourier patch series at
> https://github.com/marfrit/fourier — chromium plus a KWin patch
> that *previously bypassed* Transaction::watchDmaBuf because the
> kernel-side fence was stub-signalled. With this series applied, the
> bypass becomes unnecessary; KWin's fence wait completes correctly
> because the fence now signals when hantro completes the capture
> buffer write.
> 
> End-to-end result before the kernel patch (chromium + Qt 6 patches +
> KWin watchDmaBuf bypass): 1080p30 H.264 plays through, ~81% combined
> chrome CPU, but the watchDmaBuf bypass weakens KWin's defenses against
> misbehaving clients.
> 
> End-to-end result after the kernel patch (chromium + Qt 6 patches +
> plain unmodified KWin): 1080p30 H.264 plays through with the same CPU
> profile, KWin's watchDmaBuf wait completes within microseconds against
> the now-real producer fence, no defenses weakened.
> 
> What's missing in this RFC
> --------------------------
> - Other vb2-using drivers don't opt in. Each maintainer should look
>   at their driver and decide. The hantro + rga patches show the
>   shape; copying it to other drivers should be straightforward.
> - For drivers that have intermediate image-processor stages (e.g.
>   CSI -> ISP -> user), the fence semantics across stage boundaries
>   are out of scope here. This series only addresses the producer-to-
>   userspace edge.
> - No selftest. videobuf2 doesn't have a great in-tree selftest harness
>   for dmabuf flows; the validation is end-to-end at the userspace
>   consumer level (KWin, in our case).
> 
> Reviews especially welcome on:
> 
> - The decision to make this opt-in per driver vs. automatic for all
>   vb2-CAPTURE queues. Auto-on would force every driver to be audited;
>   opt-in is incremental and safer but leaves the contract gap for
>   drivers nobody touches.
> - Whether vb2_buffer_done is the right place to signal vs. an earlier
>   hook (e.g. immediately after DMA-from-device finishes). For hantro
>   the two are effectively the same; for drivers with asynchronous
>   post-processing they may differ.
> - The choice of DMA_RESV_USAGE_WRITE — we are emitting the producer's
>   write completion, so WRITE matches dma-buf documentation, but a
>   sanity check is welcome.
> 
> Cheers,
> Markus
> 
> 
> Markus Fritsche (3):
>   media: videobuf2: add dma_resv release-fence helper
>   media: hantro: attach dma_resv release fence at buf_queue
>   media: rockchip-rga: attach dma_resv release fence at buf_queue
> 
>  .../media/common/videobuf2/videobuf2-core.c   | 95 +++++++++++++++++++
>  drivers/media/platform/rockchip/rga/rga-buf.c | 10 ++
>  .../media/platform/verisilicon/hantro_v4l2.c  | 12 +++
>  include/media/videobuf2-core.h                | 29 ++++++
>  4 files changed, 146 insertions(+)

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC 0/3] media: videobuf2: opt-in dma_resv producer fences for V4L2 dmabuf exports
  2026-04-29 21:22 ` [PATCH RFC 0/3] media: videobuf2: opt-in dma_resv producer fences for V4L2 dmabuf exports Nicolas Dufresne
@ 2026-04-30  6:51   ` Christian König
  0 siblings, 0 replies; 6+ messages in thread
From: Christian König @ 2026-04-30  6:51 UTC (permalink / raw)
  To: Nicolas Dufresne, Markus Fritsche, Tomasz Figa, Marek Szyprowski,
	Mauro Carvalho Chehab, Sumit Semwal, Ezequiel Garcia,
	Philipp Zabel, Jacob Chen, Heiko Stuebner
  Cc: linux-media, linux-kernel, dri-devel, linaro-mm-sig,
	linux-rockchip, linux-arm-kernel

On 4/29/26 23:22, Nicolas Dufresne wrote:
> Hi Markus,
> 
> Le mercredi 29 avril 2026 à 19:53 +0000, Markus Fritsche a écrit :
>> Hi,
>>
>> This series proposes a small opt-in API in videobuf2-core that lets V4L2
>> drivers populate a dma_resv exclusive write fence on the dmabufs they
>> export to userspace, signalled when the buffer transitions to
>> VB2_BUF_STATE_DONE. Two example drivers (hantro, rockchip-rga) opt in
>> to demonstrate the call shape; the change is no-op for every other
>> driver.
> 
> Thanks for attempting again this feat. I see you went for implicit fencing, but
> in the past we've been recommend to stay away from these and adopt an explicit
> fencing model. Is this something you have started to think about, have you
> reviewed past proposal in regard to fences ?

Yeah agree, I just wanted to note something similar.

Implicit fencing is basically just a workaround how the GPU HW used to work ~20years ago and only rarely makes sense today.

On the other hand putting it behind a flag might be acceptable when you want to interact with implicit synced HW, e.g. a GPU.

If you want to expose a dma_fence to userspace then using a sync_file as output from the IOCTL which kicks of the operation is usually the better approach.

>>
>> Why
>> ---
>> Modern Wayland compositors and any other userspace consumers that
>> import V4L2-produced dmabufs and want to do implicit synchronization
>> the spec-clean way (poll(POLLIN) on the dmabuf fd, or
>> DMA_BUF_IOCTL_EXPORT_SYNC_FILE for a sync_file) currently get either:
>>
>> 1. A stub fence from dma_buf_export_sync_file(), because the dmabuf's
>>    dma_resv has no fences populated. The kernel substitutes
>>    dma_fence_get_stub() which is permanently signalled. The compositor
>>    "successfully" waits on a fence that represents nothing real about
>>    the producer's state.
>> 2. A poll(POLLIN) on the dmabuf fd that returns immediately for the
>>    same reason — dma_buf_poll_add_cb finds zero fences in the resv,
>>    triggers the wake callback inline, and reports POLLIN ready before
>>    the producer has actually said anything.
>>
>> Today this works as a happy accident on most paths because clients
>> attach buffers after VIDIOC_DQBUF, which the userspace V4L2 contract
>> guarantees only returns a buffer after the producer is done. So the
>> implicit "the kernel's stub fence is fine because the buffer is
>> already complete by the time anyone polls it" assumption has held.
> 
> There is no accident, just saying. Have you studied also the other side of
> fences, the one that actually cause problem with Freedreno and Etnaviv ? To me
> these would be higher priority since they are known to cause "back flash" kind
> of bugs, specially for compositor that are not expecting GL driver to place
> implicit fences on imported (v4l2 allocated) buffers.
> 
>>
>> But:
>>
>> - It's a contract gap. The kernel claims to expose implicit sync; it
>>   does not, for V4L2 producers.
>> - It paid latency for nothing. Every Wayland frame from a V4L2
>>   producer pays a DMA_BUF_IOCTL_EXPORT_SYNC_FILE round-trip for a
>>   fence that's stub-signalled. On Mali-class hardware (RK3566 Wayland
>>   chrome video playback), this contributed to compositor stalls.
>>   Removing the wait at the compositor level is a workaround, not a
>>   fix.
>> - It blocks downstream consumers from doing the right thing. A
>>   Wayland compositor that defensively waits on a sync_file gets a
>>   stub-fence pass-through with no actual gating; if the V4L2 driver
>>   ever has an out-of-band path that releases the buffer before
>>   finishing the write, there is no fence to gate on.
> 
> Some things don't add up here. I think I want to remind that there is a contract
> in regard to delivering a fence to userspace. One of the most important aspect
> of fences is that they must in finit time be signaled, regardless what userspace
> decided to do next. And for that reason, you shouldn't deliver a fence to
> userspace if its not armed. In my reading, you are delivering that fence at
> QBUF(capture) time, just like what Gustavo was trying to do previously. Its even
> worse if you deliver it to your compositor allowing that compositor to hang
> forever by not feeding any bitstream.

+1

> Let's take Hantro driver as an example. The right moment to deliver the fence is
> either right before we set the DEC bit on the control register, or somewhere
> before that when you have bitstream, parameters and request queued. At that
> moment, you are guarantied that the decode will either finish or fail (yes, it
> can fail, and its extremely common with live stream, or when application calls
> streamoff, since in v4l2, we cancel work). Prior to that, user may starve the
> OUTPUT queue (the bitstream) and cause the fence to hold forever. This would
> break the contract I mention earlier.

Exactly that yes.

Regarding failed or canceled submission handling that is not so much of a problem since we also have proper error signaling on fences.

But what is usually problematic as well are things like dynamic memory management, e.g. when HW uses the PCIe PRI interface or similar functionality on other bus systems.

Getting the contract right is indeed really tricky but unfortunately mandatory because you otherwise run into deadlocks sooner or later.

> Though, if you attach the fence at that moment, you will need to design how to
> signal the fence readiness (rather then the data readiness). One idea would be
> (with userspace opting-in) to signal the queue at that moment. But then you
> can't do the memory management operation you would normally do in DQBUF. This of
> course don't apply to hantro, which has no device cache, but we can't design
> something in vb2 for the old HW. So we'd need to move memory management somwhere
> else, maybe buffer_done, though you have to carefully make sure in which context
> you do that, you can't sleep in an IRQ.
> 
> There is an obvious benefit of basing your solution on
> DMA_BUF_IOCTL_EXPORT_SYNC_FILE, once you get there, you'll discover that there
> is very little room in v4l2_buffer, and that was causing a lot of headache to
> previous people attempting this. Though, if we look forward, we could also
> consider this a feature of the media_request. Queuing a request could maybe
> deliver a fence, assuming few pre-condition that guarantee execution (or
> failure) are met. We've seen with DW100 recently that its rather easy to convert
> an existing m2m driver to request. The media API is a much more open canvas to
> design new mechanism. We could have a really simple ioctl that attach out fences
> to request, and in a future hook it to our own depedency manager.

My request with the DMA-buf maintainer hat on would be to correctly annotate your fence submission path the with lockdep primitives Sima wrote and then make a few test runs with lockdep enabled.

It's not a prove of correctness but this at least gives you a good hint if your locking order and memory allocation paths are correct.

Regards,
Christian.

> 
> I'm simply throwing ideas, I could have missed few things in your PoC, let me
> know.
> 
> Nicolas
> 
>> What
>> ----
>> Patch 1 adds:
>>
>> - struct dma_fence *release_fence to struct vb2_buffer
>> - u64 dma_resv_fence_context + atomic64_t dma_resv_fence_seqno +
>>   spinlock_t dma_resv_fence_lock to struct vb2_queue
>> - vb2_buffer_attach_release_fence(vb) — drivers call this from their
>>   buf_queue callback. Allocates a dma_fence on the queue's fence
>>   context, attaches it as DMA_RESV_USAGE_WRITE on each plane's
>>   dmabuf->resv. No-op for buffers without exported dmabufs.
>> - vb2_buffer_done() extended to signal+put the fence if attached,
>>   so the producer's completion signal lands in the resv synchronously
>>   with the userspace DQBUF wakeup.
>>
>> Patches 2 and 3 add a single call to the helper from hantro_buf_queue
>> and rga_buf_queue respectively. Both are demonstration drivers; other
>> vb2 drivers can opt in incrementally with the same one-line change.
>>
>> Tested on
>> ---------
>> PineTab2 (RK3566 / Mali-G52 panfrost / mainline 6.19.10, this series
>> backported), playing 1080p30 H.264 in chromium under KDE Plasma 6.6.4
>> Wayland. The test harness is the chromium-fourier patch series at
>> https://github.com/marfrit/fourier — chromium plus a KWin patch
>> that *previously bypassed* Transaction::watchDmaBuf because the
>> kernel-side fence was stub-signalled. With this series applied, the
>> bypass becomes unnecessary; KWin's fence wait completes correctly
>> because the fence now signals when hantro completes the capture
>> buffer write.
>>
>> End-to-end result before the kernel patch (chromium + Qt 6 patches +
>> KWin watchDmaBuf bypass): 1080p30 H.264 plays through, ~81% combined
>> chrome CPU, but the watchDmaBuf bypass weakens KWin's defenses against
>> misbehaving clients.
>>
>> End-to-end result after the kernel patch (chromium + Qt 6 patches +
>> plain unmodified KWin): 1080p30 H.264 plays through with the same CPU
>> profile, KWin's watchDmaBuf wait completes within microseconds against
>> the now-real producer fence, no defenses weakened.
>>
>> What's missing in this RFC
>> --------------------------
>> - Other vb2-using drivers don't opt in. Each maintainer should look
>>   at their driver and decide. The hantro + rga patches show the
>>   shape; copying it to other drivers should be straightforward.
>> - For drivers that have intermediate image-processor stages (e.g.
>>   CSI -> ISP -> user), the fence semantics across stage boundaries
>>   are out of scope here. This series only addresses the producer-to-
>>   userspace edge.
>> - No selftest. videobuf2 doesn't have a great in-tree selftest harness
>>   for dmabuf flows; the validation is end-to-end at the userspace
>>   consumer level (KWin, in our case).
>>
>> Reviews especially welcome on:
>>
>> - The decision to make this opt-in per driver vs. automatic for all
>>   vb2-CAPTURE queues. Auto-on would force every driver to be audited;
>>   opt-in is incremental and safer but leaves the contract gap for
>>   drivers nobody touches.
>> - Whether vb2_buffer_done is the right place to signal vs. an earlier
>>   hook (e.g. immediately after DMA-from-device finishes). For hantro
>>   the two are effectively the same; for drivers with asynchronous
>>   post-processing they may differ.
>> - The choice of DMA_RESV_USAGE_WRITE — we are emitting the producer's
>>   write completion, so WRITE matches dma-buf documentation, but a
>>   sanity check is welcome.
>>
>> Cheers,
>> Markus
>>
>>
>> Markus Fritsche (3):
>>   media: videobuf2: add dma_resv release-fence helper
>>   media: hantro: attach dma_resv release fence at buf_queue
>>   media: rockchip-rga: attach dma_resv release fence at buf_queue
>>
>>  .../media/common/videobuf2/videobuf2-core.c   | 95 +++++++++++++++++++
>>  drivers/media/platform/rockchip/rga/rga-buf.c | 10 ++
>>  .../media/platform/verisilicon/hantro_v4l2.c  | 12 +++
>>  include/media/videobuf2-core.h                | 29 ++++++
>>  4 files changed, 146 insertions(+)


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-04-30  6:51 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-29 19:53 [PATCH RFC 0/3] media: videobuf2: opt-in dma_resv producer fences for V4L2 dmabuf exports Markus Fritsche
2026-04-29 19:53 ` [PATCH RFC 1/3] media: videobuf2: add dma_resv release-fence helper Markus Fritsche
2026-04-29 19:53 ` [PATCH RFC 2/3] media: hantro: attach dma_resv release fence at buf_queue Markus Fritsche
2026-04-29 19:53 ` [PATCH RFC 3/3] media: rockchip-rga: " Markus Fritsche
2026-04-29 21:22 ` [PATCH RFC 0/3] media: videobuf2: opt-in dma_resv producer fences for V4L2 dmabuf exports Nicolas Dufresne
2026-04-30  6:51   ` Christian König

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox