[RFC PATCH 01/12] workqueue: Add interface to teach lockdep to warn on reclaim violations

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 01/12] workqueue: Add interface to teach lockdep to warn on reclaim violations
       [not found] <20260316043255.226352-1-matthew.brost@intel.com>
@ 2026-03-16  4:32 ` Matthew Brost
  2026-03-25 15:59   ` Tejun Heo
  2026-03-16  4:32 ` [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer Matthew Brost
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 50+ messages in thread
From: Matthew Brost @ 2026-03-16  4:32 UTC (permalink / raw)
  To: intel-xe; +Cc: dri-devel, Tejun Heo, Lai Jiangshan, linux-kernel

Drivers often use workqueues that run in reclaim paths (e.g., DRM
scheduler workqueues). It is useful to teach lockdep that memory
allocations which can recurse into reclaim (e.g., GFP_KERNEL) are not
allowed on these workqueues. Add an interface that taints a workqueue’s
lockdep state with reclaim.

Also add a helper to test whether a workqueue is reclaim annotated,
allowing drivers to enforce reclaim-safe behavior.

Example of lockdep splat upon violation below:

[   60.953095] =============================================

[   73.023656] Console: switching to colour dummy device 80x25
[   73.023684] [IGT] xe_exec_reset: executing
[   73.038237] [IGT] xe_exec_reset: starting subtest gt-reset
[   73.044163] xe 0000:03:00.0: [drm] Tile0: GT0: trying reset from force_reset_write [xe]
[   73.044276] xe 0000:03:00.0: [drm] Tile0: GT0: reset queued

[   73.045963] ======================================================
[   73.052133] WARNING: possible circular locking dependency detected
[   73.058302] 7.0.0-rc3-xe+ #31 Tainted: G     U
[   73.063866] ------------------------------------------------------
[   73.070036] kworker/u64:5/158 is trying to acquire lock:
[   73.075342] ffffffff829a87a0 (fs_reclaim){+.+.}-{0:0}, at: __kmalloc_cache_noprof+0x39/0x420
[   73.083791]
               but task is already holding lock:
[   73.089612] ffffc9000152fe60 ((work_completion)(&gt->reset.worker)){+.+.}-{0:0}, at: process_one_work+0x1d2/0x6a0
[   73.099852]
               which lock already depends on the new lock.

[   73.108013]
               the existing dependency chain (in reverse order) is:
[   73.115481]
               -> #2 ((work_completion)(&gt->reset.worker)){+.+.}-{0:0}:
[   73.123381]        process_one_work+0x1ec/0x6a0
[   73.127906]        worker_thread+0x183/0x330
[   73.132173]        kthread+0xe2/0x120
[   73.135833]        ret_from_fork+0x289/0x2f0
[   73.140101]        ret_from_fork_asm+0x1a/0x30
[   73.144540]
               -> #1 ((wq_completion)gt-ordered-wq){+.+.}-{0:0}:
[   73.151749]        workqueue_warn_on_reclaim.part.0+0x32/0x50
[   73.157487]        alloc_workqueue_noprof+0xef/0x100
[   73.162445]        xe_gt_alloc+0x92/0x220 [xe]
[   73.166954]        xe_pci_probe+0x734/0x1660 [xe]
[   73.171720]        pci_device_probe+0x98/0x140
[   73.176161]        really_probe+0xcf/0x2c0
[   73.180256]        __driver_probe_device+0x6e/0x120
[   73.185126]        driver_probe_device+0x19/0x90
[   73.189740]        __driver_attach+0x89/0x140
[   73.194091]        bus_for_each_dev+0x79/0xd0
[   73.198446]        bus_add_driver+0xe6/0x210
[   73.202712]        driver_register+0x5b/0x110
[   73.207064]        0xffffffffa00aa0db
[   73.210724]        do_one_initcall+0x59/0x2e0
[   73.215077]        do_init_module+0x5f/0x230
[   73.219345]        init_module_from_file+0xc7/0xe0
[   73.224128]        idempotent_init_module+0x176/0x270
[   73.229175]        __x64_sys_finit_module+0x61/0xb0
[   73.234047]        do_syscall_64+0x9b/0x540
[   73.238228]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   73.243793]
               -> #0 (fs_reclaim){+.+.}-{0:0}:
[   73.249442]        __lock_acquire+0x1496/0x2510
[   73.253970]        lock_acquire+0xbd/0x2f0
[   73.258062]        fs_reclaim_acquire+0x98/0xd0
[   73.262586]        __kmalloc_cache_noprof+0x39/0x420
[   73.267545]        gt_reset_worker+0x27/0x1f0 [xe]
[   73.272385]        process_one_work+0x213/0x6a0
[   73.276910]        worker_thread+0x183/0x330
[   73.281178]        kthread+0xe2/0x120
[   73.284838]        ret_from_fork+0x289/0x2f0
[   73.289104]        ret_from_fork_asm+0x1a/0x30
[   73.293542]
               other info that might help us debug this:

[   73.301528] Chain exists of:
                 fs_reclaim --> (wq_completion)gt-ordered-wq --> (work_completion)(&gt->reset.worker)

[   73.314795]  Possible unsafe locking scenario:

[   73.320705]        CPU0                    CPU1
[   73.325232]        ----                    ----
[   73.329759]   lock((work_completion)(&gt->reset.worker));
[   73.335148]                                lock((wq_completion)gt-ordered-wq);
[   73.342359]                                lock((work_completion)(&gt->reset.worker));
[   73.350259]   lock(fs_reclaim);
[   73.353400]
                *** DEADLOCK ***

v2:
 - Add WQ flag to warn on reclaim violations (Tejun)
 - Add a helper function to test if WQ is annotated

Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 include/linux/workqueue.h |  3 +++
 kernel/workqueue.c        | 41 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 44 insertions(+)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index a4749f56398f..5ad3b92ddd75 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -403,6 +403,7 @@ enum wq_flags {
 	 */
 	WQ_POWER_EFFICIENT	= 1 << 7,
 	WQ_PERCPU		= 1 << 8, /* bound to a specific cpu */
+	WQ_MEM_WARN_ON_RECLAIM	= 1 << 9, /* teach lockdep to warn on reclaim */
 
 	__WQ_DESTROYING		= 1 << 15, /* internal: workqueue is destroying */
 	__WQ_DRAINING		= 1 << 16, /* internal: workqueue is draining */
@@ -582,6 +583,8 @@ alloc_workqueue_lockdep_map(const char *fmt, unsigned int flags, int max_active,
 
 extern void destroy_workqueue(struct workqueue_struct *wq);
 
+extern bool workqueue_is_reclaim_annotated(struct workqueue_struct *wq);
+
 struct workqueue_attrs *alloc_workqueue_attrs_noprof(void);
 #define alloc_workqueue_attrs(...)	alloc_hooks(alloc_workqueue_attrs_noprof(__VA_ARGS__))
 
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index b77119d71641..9c2c3a503e2c 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -5872,6 +5872,45 @@ static struct workqueue_struct *__alloc_workqueue(const char *fmt,
 	return NULL;
 }
 
+#ifdef CONFIG_LOCKDEP
+static void workqueue_warn_on_reclaim(struct workqueue_struct *wq)
+{
+	if (wq->flags & WQ_MEM_WARN_ON_RECLAIM) {
+		fs_reclaim_acquire(GFP_KERNEL);
+		lock_map_acquire(wq->lockdep_map);
+		lock_map_release(wq->lockdep_map);
+		fs_reclaim_release(GFP_KERNEL);
+	}
+}
+#else
+static void workqueue_warn_on_reclaim(struct workqueue_struct *wq)
+{
+}
+#endif
+
+/**
+ * workqueue_is_reclaim_annotated() - Test whether a workqueue is annotated for
+ * reclaim safety
+ * @wq: workqueue to test
+ *
+ * Returns true if @wq is flags have both %WQ_MEM_WARN_ON_RECLAIM and
+ * %WQ_MEM_RECLAIM set. A workqueue marked with these flags indicates that it
+ * participates in reclaim paths, and therefore must not perform memory
+ * allocations that can recurse into reclaim (e.g., GFP_KERNEL is not allowed).
+ *
+ * Drivers can use this helper to enforce reclaim-safe behavior on workqueues
+ * that are created or provided elsewhere in the code.
+ *
+ * Return:
+ * true if the workqueue is reclaim-annotated, false otherwise.
+ */
+bool workqueue_is_reclaim_annotated(struct workqueue_struct *wq)
+{
+	return (wq->flags & WQ_MEM_WARN_ON_RECLAIM) &&
+		(wq->flags & WQ_MEM_RECLAIM);
+}
+EXPORT_SYMBOL_GPL(workqueue_is_reclaim_annotated);
+
 __printf(1, 4)
 struct workqueue_struct *alloc_workqueue_noprof(const char *fmt,
 						unsigned int flags,
@@ -5887,6 +5926,7 @@ struct workqueue_struct *alloc_workqueue_noprof(const char *fmt,
 		return NULL;
 
 	wq_init_lockdep(wq);
+	workqueue_warn_on_reclaim(wq);
 
 	return wq;
 }
@@ -5908,6 +5948,7 @@ alloc_workqueue_lockdep_map(const char *fmt, unsigned int flags,
 		return NULL;
 
 	wq->lockdep_map = lockdep_map;
+	workqueue_warn_on_reclaim(wq);
 
 	return wq;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
       [not found] <20260316043255.226352-1-matthew.brost@intel.com>
  2026-03-16  4:32 ` [RFC PATCH 01/12] workqueue: Add interface to teach lockdep to warn on reclaim violations Matthew Brost
@ 2026-03-16  4:32 ` Matthew Brost
  2026-03-16  9:16   ` Boris Brezillon
                     ` (5 more replies)
  2026-03-16  4:32 ` [RFC PATCH 11/12] accel/amdxdna: Convert to drm_dep scheduler layer Matthew Brost
  2026-03-16  4:32 ` [RFC PATCH 12/12] drm/panthor: " Matthew Brost
  3 siblings, 6 replies; 50+ messages in thread
From: Matthew Brost @ 2026-03-16  4:32 UTC (permalink / raw)
  To: intel-xe
  Cc: dri-devel, Boris Brezillon, Tvrtko Ursulin, Rodrigo Vivi,
	Thomas Hellström, Christian König, Danilo Krummrich,
	David Airlie, Maarten Lankhorst, Maxime Ripard, Philipp Stanner,
	Simona Vetter, Sumit Semwal, Thomas Zimmermann, linux-kernel

Diverging requirements between GPU drivers using firmware scheduling
and those using hardware scheduling have shown that drm_gpu_scheduler is
no longer sufficient for firmware-scheduled GPU drivers. The technical
debt, lack of memory-safety guarantees, absence of clear object-lifetime
rules, and numerous driver-specific hacks have rendered
drm_gpu_scheduler unmaintainable. It is time for a fresh design for
firmware-scheduled GPU drivers—one that addresses all of the
aforementioned shortcomings.

Add drm_dep, a lightweight GPU submission queue intended as a
replacement for drm_gpu_scheduler for firmware-managed GPU schedulers
(e.g. Xe, Panthor, AMDXDNA, PVR, Nouveau, Nova). Unlike
drm_gpu_scheduler, which separates the scheduler (drm_gpu_scheduler)
from the queue (drm_sched_entity) into two objects requiring external
coordination, drm_dep merges both roles into a single struct
drm_dep_queue. This eliminates the N:1 entity-to-scheduler mapping
that is unnecessary for firmware schedulers which manage their own
run-lists internally.

Unlike drm_gpu_scheduler, which relies on external locking and lifetime
management by the driver, drm_dep uses reference counting (kref) on both
queues and jobs to guarantee object lifetime safety. A job holds a queue
reference from init until its last put, and the queue holds a job reference
from dispatch until the put_job worker runs. This makes use-after-free
impossible even when completion arrives from IRQ context or concurrent
teardown is in flight.

The core objects are:

  struct drm_dep_queue - a per-context submission queue owning an
    ordered submit workqueue, a TDR timeout workqueue, an SPSC job
    queue, and a pending-job list. Reference counted; drivers can embed
    it and provide a .release vfunc for RCU-safe teardown.

  struct drm_dep_job - a single unit of GPU work. Drivers embed this
    and provide a .release vfunc. Jobs carry an xarray of input
    dma_fence dependencies and produce a drm_dep_fence as their
    finished fence.

  struct drm_dep_fence - a dma_fence subclass wrapping an optional
    parent hardware fence. The finished fence is armed (sequence
    number assigned) before submission and signals when the hardware
    fence signals (or immediately on synchronous completion).

Job lifecycle:
  1. drm_dep_job_init() - allocate and initialise; job acquires a
     queue reference.
  2. drm_dep_job_add_dependency() and friends - register input fences;
     duplicates from the same context are deduplicated.
  3. drm_dep_job_arm() - assign sequence number, obtain finished fence.
  4. drm_dep_job_push() - submit to queue.

Submission paths under queue lock:
  - Bypass path: if DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set, the
    SPSC queue is empty, no dependencies are pending, and credits are
    available, the job is dispatched inline on the calling thread.
  - Queued path: job is pushed onto the SPSC queue and the run_job
    worker is kicked. The worker resolves remaining dependencies
    (installing wakeup callbacks for unresolved fences) before calling
    ops->run_job().

Credit-based throttling prevents hardware overflow: each job declares
a credit cost at init time; dispatch is deferred until sufficient
credits are available.

Timeout Detection and Recovery (TDR): a per-queue delayed work item
fires when the head pending job exceeds q->job.timeout jiffies, calling
ops->timedout_job(). drm_dep_queue_trigger_timeout() forces immediate
expiry for device teardown.

IRQ-safe completion: queues flagged DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE
allow drm_dep_job_done() to be called from hardirq context (e.g. a
dma_fence callback). Dependency cleanup is deferred to process context
after ops->run_job() returns to avoid calling xa_destroy() from IRQ.

Zombie-state guard: workers use kref_get_unless_zero() on entry and
bail immediately if the queue refcount has already reached zero and
async teardown is in flight, preventing use-after-free.

Teardown is always deferred to a module-private workqueue (dep_free_wq)
so that destroy_workqueue() is never called from within one of the
queue's own workers. Each queue holds a drm_dev_get() reference on its
owning struct drm_device, released as the final step of teardown via
drm_dev_put(). This prevents the driver module from being unloaded
while any queue is still alive without requiring a separate drain API.

Cc: Boris Brezillon <boris.brezillon@collabora.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: dri-devel@lists.freedesktop.org
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Philipp Stanner <phasta@kernel.org>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Assisted-by: GitHub Copilot:claude-sonnet-4.6
---
 drivers/gpu/drm/Kconfig             |    4 +
 drivers/gpu/drm/Makefile            |    1 +
 drivers/gpu/drm/dep/Makefile        |    5 +
 drivers/gpu/drm/dep/drm_dep_fence.c |  406 +++++++
 drivers/gpu/drm/dep/drm_dep_fence.h |   25 +
 drivers/gpu/drm/dep/drm_dep_job.c   |  675 +++++++++++
 drivers/gpu/drm/dep/drm_dep_job.h   |   13 +
 drivers/gpu/drm/dep/drm_dep_queue.c | 1647 +++++++++++++++++++++++++++
 drivers/gpu/drm/dep/drm_dep_queue.h |   31 +
 include/drm/drm_dep.h               |  597 ++++++++++
 10 files changed, 3404 insertions(+)
 create mode 100644 drivers/gpu/drm/dep/Makefile
 create mode 100644 drivers/gpu/drm/dep/drm_dep_fence.c
 create mode 100644 drivers/gpu/drm/dep/drm_dep_fence.h
 create mode 100644 drivers/gpu/drm/dep/drm_dep_job.c
 create mode 100644 drivers/gpu/drm/dep/drm_dep_job.h
 create mode 100644 drivers/gpu/drm/dep/drm_dep_queue.c
 create mode 100644 drivers/gpu/drm/dep/drm_dep_queue.h
 create mode 100644 include/drm/drm_dep.h

diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
index 5386248e75b6..834f6e210551 100644
--- a/drivers/gpu/drm/Kconfig
+++ b/drivers/gpu/drm/Kconfig
@@ -276,6 +276,10 @@ config DRM_SCHED
 	tristate
 	depends on DRM
 
+config DRM_DEP
+	tristate
+	depends on DRM
+
 # Separate option as not all DRM drivers use it
 config DRM_PANEL_BACKLIGHT_QUIRKS
 	tristate
diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
index e97faabcd783..1ad87cc0e545 100644
--- a/drivers/gpu/drm/Makefile
+++ b/drivers/gpu/drm/Makefile
@@ -173,6 +173,7 @@ obj-y			+= clients/
 obj-y			+= display/
 obj-$(CONFIG_DRM_TTM)	+= ttm/
 obj-$(CONFIG_DRM_SCHED)	+= scheduler/
+obj-$(CONFIG_DRM_DEP)	+= dep/
 obj-$(CONFIG_DRM_RADEON)+= radeon/
 obj-$(CONFIG_DRM_AMDGPU)+= amd/amdgpu/
 obj-$(CONFIG_DRM_AMDGPU)+= amd/amdxcp/
diff --git a/drivers/gpu/drm/dep/Makefile b/drivers/gpu/drm/dep/Makefile
new file mode 100644
index 000000000000..335f1af46a7b
--- /dev/null
+++ b/drivers/gpu/drm/dep/Makefile
@@ -0,0 +1,5 @@
+# SPDX-License-Identifier: GPL-2.0
+
+drm_dep-y := drm_dep_queue.o drm_dep_job.o drm_dep_fence.o
+
+obj-$(CONFIG_DRM_DEP) += drm_dep.o
diff --git a/drivers/gpu/drm/dep/drm_dep_fence.c b/drivers/gpu/drm/dep/drm_dep_fence.c
new file mode 100644
index 000000000000..ae05b9077772
--- /dev/null
+++ b/drivers/gpu/drm/dep/drm_dep_fence.c
@@ -0,0 +1,406 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2026 Intel Corporation
+ */
+
+/**
+ * DOC: DRM dependency fence
+ *
+ * Each struct drm_dep_job has an associated struct drm_dep_fence that
+ * provides a single dma_fence (@finished) signalled when the hardware
+ * completes the job.
+ *
+ * The hardware fence returned by &drm_dep_queue_ops.run_job is stored as
+ * @parent. @finished is chained to @parent via drm_dep_job_done_cb() and
+ * is signalled once @parent signals (or immediately if run_job() returns
+ * NULL or an error).
+ *
+ * Drivers should expose @finished as the out-fence for GPU work since it is
+ * valid from the moment drm_dep_job_arm() returns, whereas the hardware fence
+ * could be a compound fence, which is disallowed when installed into
+ * drm_syncobjs or dma-resv.
+ *
+ * The fence uses the kernel's inline spinlock (NULL passed to dma_fence_init())
+ * so no separate lock allocation is required.
+ *
+ * Deadline propagation is supported: if a consumer sets a deadline via
+ * dma_fence_set_deadline(), it is forwarded to @parent when @parent is set.
+ * If @parent has not been set yet the deadline is stored in @deadline and
+ * forwarded at that point.
+ *
+ * Memory management: drm_dep_fence objects are allocated with kzalloc() and
+ * freed via kfree_rcu() once the fence is released, ensuring safety with
+ * RCU-protected fence accesses.
+ */
+
+#include <linux/slab.h>
+#include <drm/drm_dep.h>
+#include "drm_dep_fence.h"
+
+/**
+ * DRM_DEP_FENCE_FLAG_HAS_DEADLINE_BIT - a fence deadline hint has been set
+ *
+ * Set by the deadline callback on the finished fence to indicate a deadline
+ * has been set which may need to be propagated to the parent hardware fence.
+ */
+#define DRM_DEP_FENCE_FLAG_HAS_DEADLINE_BIT	(DMA_FENCE_FLAG_USER_BITS + 1)
+
+/**
+ * struct drm_dep_fence - fence tracking the completion of a dep job
+ *
+ * Contains a single dma_fence (@finished) that is signalled when the
+ * hardware completes the job. The fence uses the kernel's inline_lock
+ * (no external spinlock required).
+ *
+ * This struct is private to the drm_dep module; external code interacts
+ * through the accessor functions declared in drm_dep_fence.h.
+ */
+struct drm_dep_fence {
+	/**
+	 * @finished: signalled when the job completes on hardware.
+	 *
+	 * Drivers should use this fence as the out-fence for a job since it
+	 * is available immediately upon drm_dep_job_arm().
+	 */
+	struct dma_fence finished;
+
+	/**
+	 * @deadline: deadline set on @finished which potentially needs to be
+	 * propagated to @parent.
+	 */
+	ktime_t	deadline;
+
+	/**
+	 * @parent: The hardware fence returned by &drm_dep_queue_ops.run_job.
+	 *
+	 * @finished is signaled once @parent is signaled. The initial store is
+	 * performed via smp_store_release to synchronize with deadline handling.
+	 *
+	 * All readers must access this under the fence lock and take a reference to
+	 * it, as @parent is set to NULL under the fence lock when the drm_dep_fence
+	 * signals, and this drop also releases its internal reference.
+	 */
+	struct dma_fence *parent;
+
+	/**
+	 * @q: the queue this fence belongs to.
+	 */
+	struct drm_dep_queue *q;
+};
+
+static const struct dma_fence_ops drm_dep_fence_ops;
+
+/**
+ * to_drm_dep_fence() - cast a dma_fence to its enclosing drm_dep_fence
+ * @f: dma_fence to cast
+ *
+ * Context: No context requirements (inline helper).
+ * Return: pointer to the enclosing &drm_dep_fence.
+ */
+static struct drm_dep_fence *to_drm_dep_fence(struct dma_fence *f)
+{
+	return container_of(f, struct drm_dep_fence, finished);
+}
+
+/**
+ * drm_dep_fence_set_parent() - store the hardware fence and propagate
+ *   any deadline
+ * @dfence: dep fence
+ * @parent: hardware fence returned by &drm_dep_queue_ops.run_job, or NULL/error
+ *
+ * Stores @parent on @dfence under smp_store_release() so that a concurrent
+ * drm_dep_fence_set_deadline() call sees the parent before checking the
+ * deadline bit. If a deadline has already been set on @dfence->finished it is
+ * forwarded to @parent immediately. Does nothing if @parent is NULL or an
+ * error pointer.
+ *
+ * Context: Any context.
+ */
+void drm_dep_fence_set_parent(struct drm_dep_fence *dfence,
+			      struct dma_fence *parent)
+{
+	if (IS_ERR_OR_NULL(parent))
+		return;
+
+	/*
+	 * smp_store_release() to ensure a thread racing us in
+	 * drm_dep_fence_set_deadline() sees the parent set before
+	 * it calls test_bit(HAS_DEADLINE_BIT).
+	 */
+	smp_store_release(&dfence->parent, dma_fence_get(parent));
+	if (test_bit(DRM_DEP_FENCE_FLAG_HAS_DEADLINE_BIT,
+		     &dfence->finished.flags))
+		dma_fence_set_deadline(parent, dfence->deadline);
+}
+
+/**
+ * drm_dep_fence_finished() - signal the finished fence with a result
+ * @dfence: dep fence to signal
+ * @result: error code to set, or 0 for success
+ *
+ * Sets the fence error to @result if non-zero, then signals
+ * @dfence->finished. Also removes parent visibility under the fence lock
+ * and drops the parent reference. Dropping the parent here allows the
+ * DRM dep fence to be completely decoupled from the DRM dep module.
+ *
+ * Context: Any context.
+ */
+static void drm_dep_fence_finished(struct drm_dep_fence *dfence, int result)
+{
+	struct dma_fence *parent;
+	unsigned long flags;
+
+	dma_fence_lock_irqsave(&dfence->finished, flags);
+	if (result)
+		dma_fence_set_error(&dfence->finished, result);
+	dma_fence_signal_locked(&dfence->finished);
+	parent = dfence->parent;
+	dfence->parent = NULL;
+	dma_fence_unlock_irqrestore(&dfence->finished, flags);
+
+	dma_fence_put(parent);
+}
+
+static const char *drm_dep_fence_get_driver_name(struct dma_fence *fence)
+{
+	return "drm_dep";
+}
+
+static const char *drm_dep_fence_get_timeline_name(struct dma_fence *f)
+{
+	struct drm_dep_fence *dfence = to_drm_dep_fence(f);
+
+	return dfence->q->name;
+}
+
+/**
+ * drm_dep_fence_get_parent() - get a reference to the parent hardware fence
+ * @dfence: dep fence to query
+ *
+ * Returns a new reference to @dfence->parent, or NULL if the parent has
+ * already been cleared (i.e. @dfence->finished has signalled and the parent
+ * reference was dropped under the fence lock).
+ *
+ * Uses smp_load_acquire() to pair with the smp_store_release() in
+ * drm_dep_fence_set_parent(), ensuring that if we race a concurrent
+ * drm_dep_fence_set_parent() call we observe the parent pointer only after
+ * the store is fully visible — before set_parent() tests
+ * %DRM_DEP_FENCE_FLAG_HAS_DEADLINE_BIT.
+ *
+ * Caller must hold the fence lock on @dfence->finished.
+ *
+ * Context: Any context, fence lock on @dfence->finished must be held.
+ * Return: a new reference to the parent fence, or NULL.
+ */
+static struct dma_fence *drm_dep_fence_get_parent(struct drm_dep_fence *dfence)
+{
+	dma_fence_assert_held(&dfence->finished);
+
+	return dma_fence_get(smp_load_acquire(&dfence->parent));
+}
+
+/**
+ * drm_dep_fence_set_deadline() - dma_fence_ops deadline callback
+ * @f: fence on which the deadline is being set
+ * @deadline: the deadline hint to apply
+ *
+ * Stores the earliest deadline under the fence lock, then propagates
+ * it to the parent hardware fence via smp_load_acquire() to race
+ * safely with drm_dep_fence_set_parent().
+ *
+ * Context: Any context.
+ */
+static void drm_dep_fence_set_deadline(struct dma_fence *f, ktime_t deadline)
+{
+	struct drm_dep_fence *dfence = to_drm_dep_fence(f);
+	struct dma_fence *parent;
+	unsigned long flags;
+
+	dma_fence_lock_irqsave(f, flags);
+
+	/* If we already have an earlier deadline, keep it: */
+	if (test_bit(DRM_DEP_FENCE_FLAG_HAS_DEADLINE_BIT, &f->flags) &&
+	    ktime_before(dfence->deadline, deadline)) {
+		dma_fence_unlock_irqrestore(f, flags);
+		return;
+	}
+
+	dfence->deadline = deadline;
+	set_bit(DRM_DEP_FENCE_FLAG_HAS_DEADLINE_BIT, &f->flags);
+
+	parent = drm_dep_fence_get_parent(dfence);
+	dma_fence_unlock_irqrestore(f, flags);
+
+	if (parent)
+		dma_fence_set_deadline(parent, deadline);
+
+	dma_fence_put(parent);
+}
+
+static const struct dma_fence_ops drm_dep_fence_ops = {
+	.get_driver_name = drm_dep_fence_get_driver_name,
+	.get_timeline_name = drm_dep_fence_get_timeline_name,
+	.set_deadline = drm_dep_fence_set_deadline,
+};
+
+/**
+ * drm_dep_fence_alloc() - allocate a dep fence
+ *
+ * Allocates a &drm_dep_fence with kzalloc() without initialising the
+ * dma_fence. Call drm_dep_fence_init() to fully initialise it.
+ *
+ * Context: Process context.
+ * Return: new &drm_dep_fence on success, NULL on allocation failure.
+ */
+struct drm_dep_fence *drm_dep_fence_alloc(void)
+{
+	return kzalloc_obj(struct drm_dep_fence);
+}
+
+/**
+ * drm_dep_fence_init() - initialise the dma_fence inside a dep fence
+ * @dfence: dep fence to initialise
+ * @q: queue the owning job belongs to
+ *
+ * Initialises @dfence->finished using the context and sequence number from @q.
+ * Passes NULL as the lock so the fence uses its inline spinlock.
+ *
+ * Context: Any context.
+ */
+void drm_dep_fence_init(struct drm_dep_fence *dfence, struct drm_dep_queue *q)
+{
+	u32 seq = ++q->fence.seqno;
+
+	/*
+	 * XXX: Inline fence hazard: currently all expected users of DRM dep
+	 * hardware fences have a unique lockdep class. If that ever changes,
+	 * we will need to assign a unique lockdep class here so lockdep knows
+	 * this fence is allowed to nest with driver hardware fences.
+	 */
+
+	dfence->q = q;
+	dma_fence_init(&dfence->finished, &drm_dep_fence_ops,
+		       NULL, q->fence.context, seq);
+}
+
+/**
+ * drm_dep_fence_cleanup() - release a dep fence at job teardown
+ * @dfence: dep fence to clean up
+ *
+ * Called from drm_dep_job_fini(). If the dep fence was armed (refcount > 0)
+ * it is released via dma_fence_put() and will be freed by the RCU release
+ * callback once all waiters have dropped their references. If it was never
+ * armed it is freed directly with kfree().
+ *
+ * Context: Any context.
+ */
+void drm_dep_fence_cleanup(struct drm_dep_fence *dfence)
+{
+	if (drm_dep_fence_is_armed(dfence))
+		dma_fence_put(&dfence->finished);
+	else
+		kfree(dfence);
+}
+
+/**
+ * drm_dep_fence_is_armed() - check whether the fence has been armed
+ * @dfence: dep fence to check
+ *
+ * Returns true if drm_dep_job_arm() has been called, i.e. @dfence->finished
+ * has been initialised and its reference count is non-zero.  Used by
+ * assertions to enforce correct job lifecycle ordering (arm before push,
+ * add_dependency before arm).
+ *
+ * Context: Any context.
+ * Return: true if the fence is armed, false otherwise.
+ */
+bool drm_dep_fence_is_armed(struct drm_dep_fence *dfence)
+{
+	return !!kref_read(&dfence->finished.refcount);
+}
+
+/**
+ * drm_dep_fence_is_finished() - test whether the finished fence has signalled
+ * @dfence: dep fence to check
+ *
+ * Uses dma_fence_test_signaled_flag() to read %DMA_FENCE_FLAG_SIGNALED_BIT
+ * directly without invoking the fence's ->signaled() callback or triggering
+ * any signalling side-effects.
+ *
+ * Context: Any context.
+ * Return: true if @dfence->finished has been signalled, false otherwise.
+ */
+bool drm_dep_fence_is_finished(struct drm_dep_fence *dfence)
+{
+	return dma_fence_test_signaled_flag(&dfence->finished);
+}
+
+/**
+ * drm_dep_fence_is_complete() - test whether the job has completed
+ * @dfence: dep fence to check
+ *
+ * Takes the fence lock on @dfence->finished and calls
+ * drm_dep_fence_get_parent() to safely obtain a reference to the parent
+ * hardware fence — or NULL if the parent has already been cleared after
+ * signalling.  Calls dma_fence_is_signaled() on @parent outside the lock,
+ * which may invoke the fence's ->signaled() callback and trigger signalling
+ * side-effects if the fence has completed but the signalled flag has not yet
+ * been set.  The finished fence is tested via dma_fence_test_signaled_flag(),
+ * without side-effects.
+ *
+ * May only be called on a stopped queue (see drm_dep_queue_is_stopped()).
+ *
+ * Context: Process context. The queue must be stopped before calling this.
+ * Return: true if the job is complete, false otherwise.
+ */
+bool drm_dep_fence_is_complete(struct drm_dep_fence *dfence)
+{
+	struct dma_fence *parent;
+	unsigned long flags;
+	bool complete;
+
+	dma_fence_lock_irqsave(&dfence->finished, flags);
+	parent = drm_dep_fence_get_parent(dfence);
+	dma_fence_unlock_irqrestore(&dfence->finished, flags);
+
+	complete = (parent && dma_fence_is_signaled(parent)) ||
+		dma_fence_test_signaled_flag(&dfence->finished);
+
+	dma_fence_put(parent);
+
+	return complete;
+}
+
+/**
+ * drm_dep_fence_to_dma() - return the finished dma_fence for a dep fence
+ * @dfence: dep fence to query
+ *
+ * No reference is taken; the caller must hold its own reference to the owning
+ * &drm_dep_job for the duration of the access.
+ *
+ * Context: Any context.
+ * Return: the finished &dma_fence.
+ */
+struct dma_fence *drm_dep_fence_to_dma(struct drm_dep_fence *dfence)
+{
+	return &dfence->finished;
+}
+
+/**
+ * drm_dep_fence_done() - signal the finished fence on job completion
+ * @dfence: dep fence to signal
+ * @result: job error code, or 0 on success
+ *
+ * Gets a temporary reference to @dfence->finished to guard against a racing
+ * last-put, signals the fence with @result, then drops the temporary
+ * reference. Called from drm_dep_job_done() in the queue core when a
+ * hardware completion callback fires or when run_job() returns immediately.
+ *
+ * Context: Any context.
+ */
+void drm_dep_fence_done(struct drm_dep_fence *dfence, int result)
+{
+	dma_fence_get(&dfence->finished);
+	drm_dep_fence_finished(dfence, result);
+	dma_fence_put(&dfence->finished);
+}
diff --git a/drivers/gpu/drm/dep/drm_dep_fence.h b/drivers/gpu/drm/dep/drm_dep_fence.h
new file mode 100644
index 000000000000..65a1582f858b
--- /dev/null
+++ b/drivers/gpu/drm/dep/drm_dep_fence.h
@@ -0,0 +1,25 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2026 Intel Corporation
+ */
+
+#ifndef _DRM_DEP_FENCE_H_
+#define _DRM_DEP_FENCE_H_
+
+#include <linux/dma-fence.h>
+
+struct drm_dep_fence;
+struct drm_dep_queue;
+
+struct drm_dep_fence *drm_dep_fence_alloc(void);
+void drm_dep_fence_init(struct drm_dep_fence *dfence, struct drm_dep_queue *q);
+void drm_dep_fence_cleanup(struct drm_dep_fence *dfence);
+void drm_dep_fence_set_parent(struct drm_dep_fence *dfence,
+			      struct dma_fence *parent);
+void drm_dep_fence_done(struct drm_dep_fence *dfence, int result);
+bool drm_dep_fence_is_armed(struct drm_dep_fence *dfence);
+bool drm_dep_fence_is_finished(struct drm_dep_fence *dfence);
+bool drm_dep_fence_is_complete(struct drm_dep_fence *dfence);
+struct dma_fence *drm_dep_fence_to_dma(struct drm_dep_fence *dfence);
+
+#endif /* _DRM_DEP_FENCE_H_ */
diff --git a/drivers/gpu/drm/dep/drm_dep_job.c b/drivers/gpu/drm/dep/drm_dep_job.c
new file mode 100644
index 000000000000..2d012b29a5fc
--- /dev/null
+++ b/drivers/gpu/drm/dep/drm_dep_job.c
@@ -0,0 +1,675 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright 2015 Advanced Micro Devices, Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
+ * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
+ * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
+ * OTHER DEALINGS IN THE SOFTWARE.
+ *
+ * Copyright © 2026 Intel Corporation
+ */
+
+/**
+ * DOC: DRM dependency job
+ *
+ * A struct drm_dep_job represents a single unit of GPU work associated with
+ * a struct drm_dep_queue. The lifecycle of a job is:
+ *
+ * 1. **Allocation**: the driver allocates memory for the job (typically by
+ *    embedding struct drm_dep_job in a larger structure) and calls
+ *    drm_dep_job_init() to initialise it. On success the job holds one
+ *    kref reference and a reference to its queue.
+ *
+ * 2. **Dependency collection**: the driver calls drm_dep_job_add_dependency(),
+ *    drm_dep_job_add_syncobj_dependency(), drm_dep_job_add_resv_dependencies(),
+ *    or drm_dep_job_add_implicit_dependencies() to register dma_fence objects
+ *    that must be signalled before the job can run. Duplicate fences from the
+ *    same fence context are deduplicated automatically.
+ *
+ * 3. **Arming**: drm_dep_job_arm() initialises the job's finished fence,
+ *    consuming a sequence number from the queue. After arming,
+ *    drm_dep_job_finished_fence() returns a valid fence that may be passed to
+ *    userspace or used as a dependency by other jobs.
+ *
+ * 4. **Submission**: drm_dep_job_push() submits the job to the queue. The
+ *    queue takes a reference that it holds until the job's finished fence
+ *    signals and the job is freed by the put_job worker.
+ *
+ * 5. **Completion**: when the job's hardware work finishes its finished fence
+ *    is signalled and drm_dep_job_put() is called by the queue. The driver
+ *    must release any driver-private resources in &drm_dep_job_ops.release.
+ *
+ * Reference counting uses drm_dep_job_get() / drm_dep_job_put(). The
+ * internal drm_dep_job_fini() tears down the dependency xarray and fence
+ * objects before the driver's release callback is invoked.
+ */
+
+#include <linux/dma-resv.h>
+#include <linux/kref.h>
+#include <linux/slab.h>
+#include <drm/drm_dep.h>
+#include <drm/drm_file.h>
+#include <drm/drm_gem.h>
+#include <drm/drm_syncobj.h>
+#include "drm_dep_fence.h"
+#include "drm_dep_job.h"
+#include "drm_dep_queue.h"
+
+/**
+ * drm_dep_job_init() - initialise a dep job
+ * @job: dep job to initialise
+ * @args: initialisation arguments
+ *
+ * Initialises @job with the queue, ops and credit count from @args.  Acquires
+ * a reference to @args->q via drm_dep_queue_get(); this reference is held for
+ * the lifetime of the job and released by drm_dep_job_release() when the last
+ * job reference is dropped.
+ *
+ * Resources are released automatically when the last reference is dropped
+ * via drm_dep_job_put(), which must be called to release the job; drivers
+ * must not free the job directly.
+ *
+ * Context: Process context. Allocates memory with GFP_KERNEL.
+ * Return: 0 on success, -%EINVAL if credits is 0,
+ *   -%ENOMEM on fence allocation failure.
+ */
+int drm_dep_job_init(struct drm_dep_job *job,
+		     const struct drm_dep_job_init_args *args)
+{
+	if (unlikely(!args->credits)) {
+		pr_err("drm_dep: %s: credits cannot be 0\n", __func__);
+		return -EINVAL;
+	}
+
+	memset(job, 0, sizeof(*job));
+
+	job->dfence = drm_dep_fence_alloc();
+	if (!job->dfence)
+		return -ENOMEM;
+
+	job->ops = args->ops;
+	job->q = drm_dep_queue_get(args->q);
+	job->credits = args->credits;
+
+	kref_init(&job->refcount);
+	xa_init_flags(&job->dependencies, XA_FLAGS_ALLOC);
+	INIT_LIST_HEAD(&job->pending_link);
+
+	return 0;
+}
+EXPORT_SYMBOL(drm_dep_job_init);
+
+/**
+ * drm_dep_job_drop_dependencies() - release all input dependency fences
+ * @job: dep job whose dependency xarray to drain
+ *
+ * Walks @job->dependencies, puts each fence, and destroys the xarray.
+ * Any slots still holding a %DRM_DEP_JOB_FENCE_PREALLOC sentinel —
+ * i.e. slots that were pre-allocated but never replaced — are silently
+ * skipped; the sentinel carries no reference.  Called from
+ * drm_dep_queue_run_job() in process context immediately after
+ * @ops->run_job() returns, before the final drm_dep_job_put().  Releasing
+ * dependencies here — while still in process context — avoids calling
+ * xa_destroy() from IRQ context if the job's last reference is later
+ * dropped from a dma_fence callback.
+ *
+ * Context: Process context.
+ */
+void drm_dep_job_drop_dependencies(struct drm_dep_job *job)
+{
+	struct dma_fence *fence;
+	unsigned long index;
+
+	xa_for_each(&job->dependencies, index, fence) {
+		if (unlikely(fence == DRM_DEP_JOB_FENCE_PREALLOC))
+			continue;
+		dma_fence_put(fence);
+	}
+	xa_destroy(&job->dependencies);
+}
+
+/**
+ * drm_dep_job_fini() - clean up a dep job
+ * @job: dep job to clean up
+ *
+ * Cleans up the dep fence and drops the queue reference held by @job.
+ *
+ * If the job was never armed (e.g. init failed before drm_dep_job_arm()),
+ * the dependency xarray is also released here.  For armed jobs the xarray
+ * has already been drained by drm_dep_job_drop_dependencies() in process
+ * context immediately after run_job(), so it is left untouched to avoid
+ * calling xa_destroy() from IRQ context.
+ *
+ * Warns if @job is still linked on the queue's pending list, which would
+ * indicate a bug in the teardown ordering.
+ *
+ * Context: Any context.
+ */
+static void drm_dep_job_fini(struct drm_dep_job *job)
+{
+	bool armed = drm_dep_fence_is_armed(job->dfence);
+
+	WARN_ON(!list_empty(&job->pending_link));
+
+	drm_dep_fence_cleanup(job->dfence);
+	job->dfence = NULL;
+
+	/*
+	 * Armed jobs have their dependencies drained by
+	 * drm_dep_job_drop_dependencies() in process context after run_job().
+	 * Skip here to avoid calling xa_destroy() from IRQ context.
+	 */
+	if (!armed)
+		drm_dep_job_drop_dependencies(job);
+}
+
+/**
+ * drm_dep_job_get() - acquire a reference to a dep job
+ * @job: dep job to acquire a reference on, or NULL
+ *
+ * Context: Any context.
+ * Return: @job with an additional reference held, or NULL if @job is NULL.
+ */
+struct drm_dep_job *drm_dep_job_get(struct drm_dep_job *job)
+{
+	if (job)
+		kref_get(&job->refcount);
+	return job;
+}
+EXPORT_SYMBOL(drm_dep_job_get);
+
+/**
+ * drm_dep_job_release() - kref release callback for a dep job
+ * @kref: kref embedded in the dep job
+ *
+ * Calls drm_dep_job_fini(), then invokes &drm_dep_job_ops.release if set,
+ * otherwise frees @job with kfree().  Finally, releases the queue reference
+ * that was acquired by drm_dep_job_init() via drm_dep_queue_put().  The
+ * queue put is performed last to ensure no queue state is accessed after
+ * the job memory is freed.
+ *
+ * Context: Any context if %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE is set on the
+ *   job's queue; otherwise process context only, as the release callback may
+ *   sleep.
+ */
+static void drm_dep_job_release(struct kref *kref)
+{
+	struct drm_dep_job *job =
+		container_of(kref, struct drm_dep_job, refcount);
+	struct drm_dep_queue *q = job->q;
+
+	drm_dep_job_fini(job);
+
+	if (job->ops && job->ops->release)
+		job->ops->release(job);
+	else
+		kfree(job);
+
+	drm_dep_queue_put(q);
+}
+
+/**
+ * drm_dep_job_put() - release a reference to a dep job
+ * @job: dep job to release a reference on, or NULL
+ *
+ * When the last reference is dropped, calls &drm_dep_job_ops.release if set,
+ * otherwise frees @job with kfree(). Does nothing if @job is NULL.
+ *
+ * Context: Any context if %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE is set on the
+ *   job's queue; otherwise process context only, as the release callback may
+ *   sleep.
+ */
+void drm_dep_job_put(struct drm_dep_job *job)
+{
+	if (job)
+		kref_put(&job->refcount, drm_dep_job_release);
+}
+EXPORT_SYMBOL(drm_dep_job_put);
+
+/**
+ * drm_dep_job_arm() - arm a dep job for submission
+ * @job: dep job to arm
+ *
+ * Initialises the finished fence on @job->dfence, assigning
+ * it a sequence number from the job's queue. Must be called after
+ * drm_dep_job_init() and before drm_dep_job_push(). Once armed,
+ * drm_dep_job_finished_fence() returns a valid fence that may be passed to
+ * userspace or used as a dependency by other jobs.
+ *
+ * Begins the DMA fence signalling path via dma_fence_begin_signalling().
+ * After this point, memory allocations that could trigger reclaim are
+ * forbidden; lockdep enforces this. arm() must always be paired with
+ * drm_dep_job_push(); lockdep also enforces this pairing.
+ *
+ * Warns if the job has already been armed.
+ *
+ * Context: Process context if %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set
+ *   (takes @q->sched.lock, a mutex); any context otherwise. DMA fence signaling
+ *   path.
+ */
+void drm_dep_job_arm(struct drm_dep_job *job)
+{
+	drm_dep_queue_push_job_begin(job->q);
+	WARN_ON(drm_dep_fence_is_armed(job->dfence));
+	drm_dep_fence_init(job->dfence, job->q);
+	job->signalling_cookie = dma_fence_begin_signalling();
+}
+EXPORT_SYMBOL(drm_dep_job_arm);
+
+/**
+ * drm_dep_job_push() - submit a job to its queue for execution
+ * @job: dep job to push
+ *
+ * Submits @job to the queue it was initialised with. Must be called after
+ * drm_dep_job_arm(). Acquires a reference on @job on behalf of the queue,
+ * held until the queue is fully done with it. The reference is released
+ * directly in the finished-fence dma_fence callback for queues with
+ * %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE (where drm_dep_job_done() may run
+ * from hardirq context), or via the put_job work item on the submit
+ * workqueue otherwise.
+ *
+ * Ends the DMA fence signalling path begun by drm_dep_job_arm() via
+ * dma_fence_end_signalling(). This must be paired with arm(); lockdep
+ * enforces the pairing.
+ *
+ * Once pushed, &drm_dep_queue_ops.run_job is guaranteed to be called for
+ * @job exactly once, even if the queue is killed or torn down before the
+ * job reaches the head of the queue. Drivers can use this guarantee to
+ * perform bookkeeping cleanup; the actual backend operation should be
+ * skipped when drm_dep_queue_is_killed() returns true.
+ *
+ * If the queue does not support the bypass path, the job is pushed directly
+ * onto the SPSC submission queue via drm_dep_queue_push_job() without holding
+ * @q->sched.lock. Otherwise, @q->sched.lock is taken and the job is either
+ * run immediately via drm_dep_queue_run_job() if it qualifies for bypass, or
+ * enqueued via drm_dep_queue_push_job() for dispatch by the run_job work item.
+ *
+ * Warns if the job has not been armed.
+ *
+ * Context: Process context if %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set
+ *   (takes @q->sched.lock, a mutex); any context otherwise. DMA fence signaling
+ *   path.
+ */
+void drm_dep_job_push(struct drm_dep_job *job)
+{
+	struct drm_dep_queue *q = job->q;
+
+	WARN_ON(!drm_dep_fence_is_armed(job->dfence));
+
+	drm_dep_job_get(job);
+
+	if (!(q->sched.flags & DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED)) {
+		drm_dep_queue_push_job(q, job);
+		dma_fence_end_signalling(job->signalling_cookie);
+		drm_dep_queue_push_job_end(job->q);
+		return;
+	}
+
+	scoped_guard(mutex, &q->sched.lock) {
+		if (drm_dep_queue_can_job_bypass(q, job))
+			drm_dep_queue_run_job(q, job);
+		else
+			drm_dep_queue_push_job(q, job);
+	}
+
+	dma_fence_end_signalling(job->signalling_cookie);
+	drm_dep_queue_push_job_end(job->q);
+}
+EXPORT_SYMBOL(drm_dep_job_push);
+
+/**
+ * drm_dep_job_add_dependency() - adds the fence as a job dependency
+ * @job: dep job to add the dependencies to
+ * @fence: the dma_fence to add to the list of dependencies, or
+ *         %DRM_DEP_JOB_FENCE_PREALLOC to reserve a slot for later.
+ *
+ * Note that @fence is consumed in both the success and error cases (except
+ * when @fence is %DRM_DEP_JOB_FENCE_PREALLOC, which carries no reference).
+ *
+ * Signalled fences and fences belonging to the same queue as @job (i.e. where
+ * fence->context matches the queue's finished fence context) are silently
+ * dropped; the job need not wait on its own queue's output.
+ *
+ * Warns if the job has already been armed (dependencies must be added before
+ * drm_dep_job_arm()).
+ *
+ * **Pre-allocation pattern**
+ *
+ * When multiple jobs across different queues must be prepared and submitted
+ * together in a single atomic commit — for example, where job A's finished
+ * fence is an input dependency of job B — all jobs must be armed and pushed
+ * within a single dma_fence_begin_signalling() / dma_fence_end_signalling()
+ * region.  Once that region has started no memory allocation is permitted.
+ *
+ * To handle this, pass %DRM_DEP_JOB_FENCE_PREALLOC during the preparation
+ * phase (before arming any job, while GFP_KERNEL allocation is still allowed)
+ * to pre-allocate a slot in @job->dependencies.  The slot index assigned by
+ * the underlying xarray must be tracked by the caller separately (e.g. it is
+ * always index 0 when the dependency array is empty, as Xe relies on).
+ * After all jobs have been armed and the finished fences are available, call
+ * drm_dep_job_replace_dependency() with that index and the real fence.
+ * drm_dep_job_replace_dependency() uses GFP_NOWAIT internally and may be
+ * called from atomic or signalling context.
+ *
+ * The sentinel slot is never skipped by the signalled-fence fast-path,
+ * ensuring a slot is always allocated even when the real fence is not yet
+ * known.
+ *
+ * **Example: bind job feeding TLB invalidation jobs**
+ *
+ * Consider a GPU with separate queues for page-table bind operations and for
+ * TLB invalidation.  A single atomic commit must:
+ *
+ *  1. Run a bind job that modifies page tables.
+ *  2. Run one TLB-invalidation job per MMU that depends on the bind
+ *     completing, so stale translations are flushed before the engines
+ *     continue.
+ *
+ * Because all jobs must be armed and pushed inside a signalling region (where
+ * GFP_KERNEL is forbidden), pre-allocate slots before entering the region::
+ *
+ *   // Phase 1 — process context, GFP_KERNEL allowed
+ *   drm_dep_job_init(bind_job, bind_queue, ops);
+ *   for_each_mmu(mmu) {
+ *       drm_dep_job_init(tlb_job[mmu], tlb_queue[mmu], ops);
+ *       // Pre-allocate slot at index 0; real fence not available yet
+ *       drm_dep_job_add_dependency(tlb_job[mmu], DRM_DEP_JOB_FENCE_PREALLOC);
+ *   }
+ *
+ *   // Phase 2 — inside signalling region, no GFP_KERNEL
+ *   dma_fence_begin_signalling();
+ *   drm_dep_job_arm(bind_job);
+ *   for_each_mmu(mmu) {
+ *       // Swap sentinel for bind job's finished fence
+ *       drm_dep_job_replace_dependency(tlb_job[mmu], 0,
+ *                                      dma_fence_get(bind_job->finished));
+ *       drm_dep_job_arm(tlb_job[mmu]);
+ *   }
+ *   drm_dep_job_push(bind_job);
+ *   for_each_mmu(mmu)
+ *       drm_dep_job_push(tlb_job[mmu]);
+ *   dma_fence_end_signalling();
+ *
+ * Context: Process context. May allocate memory with GFP_KERNEL.
+ * Return: If fence == DRM_DEP_JOB_FENCE_PREALLOC index of allocation on
+ * success, else 0 on success, or a negative error code.
+ */
+int drm_dep_job_add_dependency(struct drm_dep_job *job, struct dma_fence *fence)
+{
+	struct drm_dep_queue *q = job->q;
+	struct dma_fence *entry;
+	unsigned long index;
+	u32 id = 0;
+	int ret;
+
+	WARN_ON(drm_dep_fence_is_armed(job->dfence));
+	might_alloc(GFP_KERNEL);
+
+	if (!fence)
+		return 0;
+
+	if (fence == DRM_DEP_JOB_FENCE_PREALLOC)
+		goto add_fence;
+
+	/*
+	 * Ignore signalled fences or fences from our own queue — finished
+	 * fences use q->fence.context.
+	 */
+	if (dma_fence_test_signaled_flag(fence) ||
+	    fence->context == q->fence.context) {
+		dma_fence_put(fence);
+		return 0;
+	}
+
+	/* Deduplicate if we already depend on a fence from the same context.
+	 * This lets the size of the array of deps scale with the number of
+	 * engines involved, rather than the number of BOs.
+	 */
+	xa_for_each(&job->dependencies, index, entry) {
+		if (entry == DRM_DEP_JOB_FENCE_PREALLOC ||
+		    entry->context != fence->context)
+			continue;
+
+		if (dma_fence_is_later(fence, entry)) {
+			dma_fence_put(entry);
+			xa_store(&job->dependencies, index, fence, GFP_KERNEL);
+		} else {
+			dma_fence_put(fence);
+		}
+		return 0;
+	}
+
+add_fence:
+	ret = xa_alloc(&job->dependencies, &id, fence, xa_limit_32b,
+		       GFP_KERNEL);
+	if (ret != 0) {
+		if (fence != DRM_DEP_JOB_FENCE_PREALLOC)
+			dma_fence_put(fence);
+		return ret;
+	}
+
+	return (fence == DRM_DEP_JOB_FENCE_PREALLOC) ? id : 0;
+}
+EXPORT_SYMBOL(drm_dep_job_add_dependency);
+
+/**
+ * drm_dep_job_replace_dependency() - replace a pre-allocated dependency slot
+ * @job: dep job to update
+ * @index: xarray index of the slot to replace, as returned when the sentinel
+ *         was originally inserted via drm_dep_job_add_dependency()
+ * @fence: the real dma_fence to store; its reference is always consumed
+ *
+ * Replaces the %DRM_DEP_JOB_FENCE_PREALLOC sentinel at @index in
+ * @job->dependencies with @fence.  The slot must have been pre-allocated by
+ * passing %DRM_DEP_JOB_FENCE_PREALLOC to drm_dep_job_add_dependency(); the
+ * existing entry is asserted to be the sentinel.
+ *
+ * This is the second half of the pre-allocation pattern described in
+ * drm_dep_job_add_dependency().  It is intended to be called inside a
+ * dma_fence_begin_signalling() / dma_fence_end_signalling() region where
+ * memory allocation with GFP_KERNEL is forbidden.  It uses GFP_NOWAIT
+ * internally so it is safe to call from atomic or signalling context, but
+ * since the slot has been pre-allocated no actual memory allocation occurs.
+ *
+ * If @fence is already signalled the slot is erased rather than storing a
+ * redundant dependency.  The successful store is asserted — if the store
+ * fails it indicates a programming error (slot index out of range or
+ * concurrent modification).
+ *
+ * Must be called before drm_dep_job_arm(). @fence is consumed in all cases.
+ *
+ * Context: Any context. DMA fence signaling path.
+ */
+void drm_dep_job_replace_dependency(struct drm_dep_job *job, u32 index,
+				    struct dma_fence *fence)
+{
+	WARN_ON(xa_load(&job->dependencies, index) !=
+		DRM_DEP_JOB_FENCE_PREALLOC);
+
+	if (dma_fence_test_signaled_flag(fence)) {
+		xa_erase(&job->dependencies, index);
+		dma_fence_put(fence);
+		return;
+	}
+
+	if (WARN_ON(xa_is_err(xa_store(&job->dependencies, index, fence,
+				       GFP_NOWAIT)))) {
+		dma_fence_put(fence);
+		return;
+	}
+}
+EXPORT_SYMBOL(drm_dep_job_replace_dependency);
+
+/**
+ * drm_dep_job_add_syncobj_dependency() - adds a syncobj's fence as a
+ *   job dependency
+ * @job: dep job to add the dependencies to
+ * @file: drm file private pointer
+ * @handle: syncobj handle to lookup
+ * @point: timeline point
+ *
+ * This adds the fence matching the given syncobj to @job.
+ *
+ * Context: Process context.
+ * Return: 0 on success, or a negative error code.
+ */
+int drm_dep_job_add_syncobj_dependency(struct drm_dep_job *job,
+				       struct drm_file *file, u32 handle,
+				       u32 point)
+{
+	struct dma_fence *fence;
+	int ret;
+
+	ret = drm_syncobj_find_fence(file, handle, point, 0, &fence);
+	if (ret)
+		return ret;
+
+	return drm_dep_job_add_dependency(job, fence);
+}
+EXPORT_SYMBOL(drm_dep_job_add_syncobj_dependency);
+
+/**
+ * drm_dep_job_add_resv_dependencies() - add all fences from the resv to the job
+ * @job: dep job to add the dependencies to
+ * @resv: the dma_resv object to get the fences from
+ * @usage: the dma_resv_usage to use to filter the fences
+ *
+ * This adds all fences matching the given usage from @resv to @job.
+ * Must be called with the @resv lock held.
+ *
+ * Context: Process context.
+ * Return: 0 on success, or a negative error code.
+ */
+int drm_dep_job_add_resv_dependencies(struct drm_dep_job *job,
+				      struct dma_resv *resv,
+				      enum dma_resv_usage usage)
+{
+	struct dma_resv_iter cursor;
+	struct dma_fence *fence;
+	int ret;
+
+	dma_resv_assert_held(resv);
+
+	dma_resv_for_each_fence(&cursor, resv, usage, fence) {
+		/*
+		 * As drm_dep_job_add_dependency always consumes the fence
+		 * reference (even when it fails), and dma_resv_for_each_fence
+		 * is not obtaining one, we need to grab one before calling.
+		 */
+		ret = drm_dep_job_add_dependency(job, dma_fence_get(fence));
+		if (ret)
+			return ret;
+	}
+	return 0;
+}
+EXPORT_SYMBOL(drm_dep_job_add_resv_dependencies);
+
+/**
+ * drm_dep_job_add_implicit_dependencies() - adds implicit dependencies
+ *   as job dependencies
+ * @job: dep job to add the dependencies to
+ * @obj: the gem object to add new dependencies from.
+ * @write: whether the job might write the object (so we need to depend on
+ * shared fences in the reservation object).
+ *
+ * This should be called after drm_gem_lock_reservations() on your array of
+ * GEM objects used in the job but before updating the reservations with your
+ * own fences.
+ *
+ * Context: Process context.
+ * Return: 0 on success, or a negative error code.
+ */
+int drm_dep_job_add_implicit_dependencies(struct drm_dep_job *job,
+					  struct drm_gem_object *obj,
+					  bool write)
+{
+	return drm_dep_job_add_resv_dependencies(job, obj->resv,
+						 dma_resv_usage_rw(write));
+}
+EXPORT_SYMBOL(drm_dep_job_add_implicit_dependencies);
+
+/**
+ * drm_dep_job_is_signaled() - check whether a dep job has completed
+ * @job: dep job to check
+ *
+ * Determines whether @job has signalled. The queue should be stopped before
+ * calling this to obtain a stable snapshot of state. Both the parent hardware
+ * fence and the finished software fence are checked.
+ *
+ * Context: Process context. The queue must be stopped before calling this.
+ * Return: true if the job is signalled, false otherwise.
+ */
+bool drm_dep_job_is_signaled(struct drm_dep_job *job)
+{
+	WARN_ON(!drm_dep_queue_is_stopped(job->q));
+	return drm_dep_fence_is_complete(job->dfence);
+}
+EXPORT_SYMBOL(drm_dep_job_is_signaled);
+
+/**
+ * drm_dep_job_is_finished() - test whether a dep job's finished fence has signalled
+ * @job: dep job to check
+ *
+ * Tests whether the job's software finished fence has been signalled, using
+ * dma_fence_test_signaled_flag() to avoid any signalling side-effects. Unlike
+ * drm_dep_job_is_signaled(), this does not require the queue to be stopped and
+ * does not check the parent hardware fence — it is a lightweight test of the
+ * finished fence only.
+ *
+ * Context: Any context.
+ * Return: true if the job's finished fence has been signalled, false otherwise.
+ */
+bool drm_dep_job_is_finished(struct drm_dep_job *job)
+{
+	return drm_dep_fence_is_finished(job->dfence);
+}
+EXPORT_SYMBOL(drm_dep_job_is_finished);
+
+/**
+ * drm_dep_job_invalidate_job() - increment the invalidation count for a job
+ * @job: dep job to invalidate
+ * @threshold: threshold above which the job is considered invalidated
+ *
+ * Increments @job->invalidate_count and returns true if it exceeds @threshold,
+ * indicating the job should be considered hung and discarded. The queue must
+ * be stopped before calling this function.
+ *
+ * Context: Process context. The queue must be stopped before calling this.
+ * Return: true if @job->invalidate_count exceeds @threshold, false otherwise.
+ */
+bool drm_dep_job_invalidate_job(struct drm_dep_job *job, int threshold)
+{
+	WARN_ON(!drm_dep_queue_is_stopped(job->q));
+	return ++job->invalidate_count > threshold;
+}
+EXPORT_SYMBOL(drm_dep_job_invalidate_job);
+
+/**
+ * drm_dep_job_finished_fence() - return the finished fence for a job
+ * @job: dep job to query
+ *
+ * No reference is taken on the returned fence; the caller must hold its own
+ * reference to @job for the duration of any access.
+ *
+ * Context: Any context.
+ * Return: the finished &dma_fence for @job.
+ */
+struct dma_fence *drm_dep_job_finished_fence(struct drm_dep_job *job)
+{
+	return drm_dep_fence_to_dma(job->dfence);
+}
+EXPORT_SYMBOL(drm_dep_job_finished_fence);
diff --git a/drivers/gpu/drm/dep/drm_dep_job.h b/drivers/gpu/drm/dep/drm_dep_job.h
new file mode 100644
index 000000000000..35c61d258fa1
--- /dev/null
+++ b/drivers/gpu/drm/dep/drm_dep_job.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2026 Intel Corporation
+ */
+
+#ifndef _DRM_DEP_JOB_H_
+#define _DRM_DEP_JOB_H_
+
+struct drm_dep_queue;
+
+void drm_dep_job_drop_dependencies(struct drm_dep_job *job);
+
+#endif /* _DRM_DEP_JOB_H_ */
diff --git a/drivers/gpu/drm/dep/drm_dep_queue.c b/drivers/gpu/drm/dep/drm_dep_queue.c
new file mode 100644
index 000000000000..dac02d0d22c4
--- /dev/null
+++ b/drivers/gpu/drm/dep/drm_dep_queue.c
@@ -0,0 +1,1647 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright 2015 Advanced Micro Devices, Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
+ * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
+ * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
+ * OTHER DEALINGS IN THE SOFTWARE.
+ *
+ * Copyright © 2026 Intel Corporation
+ */
+
+/**
+ * DOC: DRM dependency queue
+ *
+ * The drm_dep subsystem provides a lightweight GPU submission queue that
+ * combines the roles of drm_gpu_scheduler and drm_sched_entity into a
+ * single object (struct drm_dep_queue). Each queue owns its own ordered
+ * submit workqueue, timeout workqueue, and TDR delayed-work.
+ *
+ * **Job lifecycle**
+ *
+ * 1. Allocate and initialise a job with drm_dep_job_init().
+ * 2. Add dependency fences with drm_dep_job_add_dependency() and friends.
+ * 3. Arm the job with drm_dep_job_arm() to obtain its out-fences.
+ * 4. Submit with drm_dep_job_push().
+ *
+ * **Submission paths**
+ *
+ * drm_dep_job_push() decides between two paths under @q->sched.lock:
+ *
+ * - **Bypass path** (drm_dep_queue_can_job_bypass()): if
+ *   %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set, the queue is not stopped,
+ *   the SPSC queue is empty, the job has no dependency fences, and credits
+ *   are available, the job is submitted inline on the calling thread without
+ *   touching the submit workqueue.
+ *
+ * - **Queued path** (drm_dep_queue_push_job()): the job is pushed onto an
+ *   SPSC queue and the run_job worker is kicked. The run_job worker pops the
+ *   job, resolves any remaining dependency fences (installing wakeup
+ *   callbacks for unresolved ones), and calls drm_dep_queue_run_job().
+ *
+ * **Running a job**
+ *
+ * drm_dep_queue_run_job() accounts credits, appends the job to the pending
+ * list (starting the TDR timer only when the list was previously empty),
+ * calls @ops->run_job(), stores the returned hardware fence as the parent
+ * of the job's dep fence, then installs a callback on it. When the hardware
+ * fence fires (or the job completes synchronously), drm_dep_job_done()
+ * signals the finished fence, returns credits, and kicks the put_job worker
+ * to free the job.
+ *
+ * **Timeout detection and recovery (TDR)**
+ *
+ * A delayed work item fires when a job on the pending list takes longer than
+ * @q->job.timeout jiffies. It calls @ops->timedout_job() and acts on the
+ * returned status (%DRM_DEP_TIMEDOUT_STAT_JOB_SIGNALED or
+ * %DRM_DEP_TIMEDOUT_STAT_REQUEUE_JOB).
+ * drm_dep_queue_trigger_timeout() forces the timer to fire immediately (without
+ * changing the stored timeout), for example during device teardown.
+ *
+ * **Reference counting**
+ *
+ * Jobs and queues are both reference counted.
+ *
+ * A job holds a reference to its queue from drm_dep_job_init() until
+ * drm_dep_job_put() drops the job's last reference and its release callback
+ * runs. This ensures the queue remains valid for the entire lifetime of any
+ * job that was submitted to it.
+ *
+ * The queue holds its own reference to a job for as long as the job is
+ * internally tracked: from the moment the job is added to the pending list
+ * in drm_dep_queue_run_job() until drm_dep_job_done() kicks the put_job
+ * worker, which calls drm_dep_job_put() to release that reference.
+ *
+ * **Hazard: use-after-free from within a worker**
+ *
+ * Because a job holds a queue reference, drm_dep_job_put() dropping the last
+ * job reference will also drop a queue reference via the job's release path.
+ * If that happens to be the last queue reference, drm_dep_queue_fini() can be
+ * called, which queues @q->free_work on dep_free_wq and returns immediately.
+ * free_work calls disable_work_sync() / disable_delayed_work_sync() on the
+ * queue's own workers before destroying its workqueues, so in practice a
+ * running worker always completes before the queue memory is freed.
+ *
+ * However, there is a secondary hazard: a worker can be queued while the
+ * queue is in a "zombie" state — refcount has already reached zero and async
+ * teardown is in flight, but the work item has not yet been disabled by
+ * free_work.  To guard against this every worker uses
+ * drm_dep_queue_get_unless_zero() at entry; if the refcount is already zero
+ * the worker bails immediately without touching the queue state.
+ *
+ * Because all actual teardown (disable_*_sync, destroy_workqueue) runs on
+ * dep_free_wq — which is independent of the queue's own submit/timeout
+ * workqueues — there is no deadlock risk.  Each queue holds a drm_dev_get()
+ * reference on its owning &drm_device, which is released as the last step of
+ * teardown.  This ensures the driver module cannot be unloaded while any queue
+ * is still alive.
+ */
+
+#include <linux/dma-resv.h>
+#include <linux/kref.h>
+#include <linux/module.h>
+#include <linux/overflow.h>
+#include <linux/slab.h>
+#include <linux/wait.h>
+#include <linux/workqueue.h>
+#include <drm/drm_dep.h>
+#include <drm/drm_drv.h>
+#include <drm/drm_print.h>
+#include "drm_dep_fence.h"
+#include "drm_dep_job.h"
+#include "drm_dep_queue.h"
+
+/*
+ * Dedicated workqueue for deferred drm_dep_queue teardown.  Using a
+ * module-private WQ instead of system_percpu_wq keeps teardown isolated
+ * from unrelated kernel subsystems.
+ */
+static struct workqueue_struct *dep_free_wq;
+
+/**
+ * drm_dep_queue_flags_set() - set a flag on the queue under sched.lock
+ * @q: dep queue
+ * @flag: flag to set (one of &enum drm_dep_queue_flags)
+ *
+ * Sets @flag in @q->sched.flags. Must be called with @q->sched.lock
+ * held; the lockdep assertion enforces this.
+ *
+ * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
+ */
+static void drm_dep_queue_flags_set(struct drm_dep_queue *q,
+				    enum drm_dep_queue_flags flag)
+{
+	lockdep_assert_held(&q->sched.lock);
+	q->sched.flags |= flag;
+}
+
+/**
+ * drm_dep_queue_flags_clear() - clear a flag on the queue under sched.lock
+ * @q: dep queue
+ * @flag: flag to clear (one of &enum drm_dep_queue_flags)
+ *
+ * Clears @flag in @q->sched.flags. Must be called with @q->sched.lock
+ * held; the lockdep assertion enforces this.
+ *
+ * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
+ */
+static void drm_dep_queue_flags_clear(struct drm_dep_queue *q,
+				      enum drm_dep_queue_flags flag)
+{
+	lockdep_assert_held(&q->sched.lock);
+	q->sched.flags &= ~flag;
+}
+
+/**
+ * drm_dep_queue_has_credits() - check whether the queue has enough credits
+ * @q: dep queue
+ * @job: job requesting credits
+ *
+ * Checks whether the queue has enough available credits to dispatch
+ * @job. If @job->credits exceeds the queue's credit limit, it is
+ * clamped with a WARN.
+ *
+ * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
+ * Return: true if available credits >= @job->credits, false otherwise.
+ */
+static bool drm_dep_queue_has_credits(struct drm_dep_queue *q,
+				      struct drm_dep_job *job)
+{
+	u32 available;
+
+	lockdep_assert_held(&q->sched.lock);
+
+	if (job->credits > q->credit.limit) {
+		drm_warn(q->drm,
+			 "Jobs may not exceed the credit limit, truncate.\n");
+		job->credits = q->credit.limit;
+	}
+
+	WARN_ON(check_sub_overflow(q->credit.limit,
+				   atomic_read(&q->credit.count),
+				   &available));
+
+	return available >= job->credits;
+}
+
+/**
+ * drm_dep_queue_run_job_queue() - kick the run-job worker
+ * @q: dep queue
+ *
+ * Queues @q->sched.run_job on @q->sched.submit_wq unless the queue is stopped
+ * or the job queue is empty.  The empty-queue check avoids queueing a work item
+ * that would immediately return with nothing to do.
+ *
+ * Context: Any context.
+ */
+static void drm_dep_queue_run_job_queue(struct drm_dep_queue *q)
+{
+	if (!drm_dep_queue_is_stopped(q) && spsc_queue_count(&q->job.queue))
+		queue_work(q->sched.submit_wq, &q->sched.run_job);
+}
+
+/**
+ * drm_dep_queue_put_job_queue() - kick the put-job worker
+ * @q: dep queue
+ *
+ * Queues @q->sched.put_job on @q->sched.submit_wq unless the queue
+ * is stopped.
+ *
+ * Context: Any context.
+ */
+static void drm_dep_queue_put_job_queue(struct drm_dep_queue *q)
+{
+	if (!drm_dep_queue_is_stopped(q))
+		queue_work(q->sched.submit_wq, &q->sched.put_job);
+}
+
+/**
+ * drm_queue_start_timeout() - arm or re-arm the TDR delayed work
+ * @q: dep queue
+ *
+ * Arms the TDR delayed work with @q->job.timeout. No-op if
+ * @q->ops->timedout_job is NULL, the timeout is MAX_SCHEDULE_TIMEOUT,
+ * or the pending list is empty.
+ *
+ * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
+ */
+static void drm_queue_start_timeout(struct drm_dep_queue *q)
+{
+	lockdep_assert_held(&q->job.lock);
+
+	if (!q->ops->timedout_job ||
+	    q->job.timeout == MAX_SCHEDULE_TIMEOUT ||
+	    list_empty(&q->job.pending))
+		return;
+
+	mod_delayed_work(q->sched.timeout_wq, &q->sched.tdr, q->job.timeout);
+}
+
+/**
+ * drm_queue_start_timeout_unlocked() - arm TDR, acquiring job.lock
+ * @q: dep queue
+ *
+ * Acquires @q->job.lock with irq and calls
+ * drm_queue_start_timeout().
+ *
+ * Context: Process context (workqueue).
+ */
+static void drm_queue_start_timeout_unlocked(struct drm_dep_queue *q)
+{
+	guard(spinlock_irq)(&q->job.lock);
+	drm_queue_start_timeout(q);
+}
+
+/**
+ * drm_dep_queue_remove_dependency() - clear the active dependency and wake
+ *   the run-job worker
+ * @q: dep queue
+ * @f: the dependency fence being removed
+ *
+ * Stores @f into @q->dep.removed_fence via smp_store_release() so that the
+ * run-job worker can drop the reference to it in drm_dep_queue_is_ready(),
+ * paired with smp_load_acquire().  Clears @q->dep.fence and kicks the
+ * run-job worker.
+ *
+ * The fence reference is not dropped here; it is deferred to the run-job
+ * worker via @q->dep.removed_fence to keep this path suitable dma_fence
+ * callback removal in drm_dep_queue_kill().
+ *
+ * Context: Any context.
+ */
+static void drm_dep_queue_remove_dependency(struct drm_dep_queue *q,
+					    struct dma_fence *f)
+{
+	/* removed_fence must be visible to the reader before &q->dep.fence */
+	smp_store_release(&q->dep.removed_fence, f);
+
+	WRITE_ONCE(q->dep.fence, NULL);
+	drm_dep_queue_run_job_queue(q);
+}
+
+/**
+ * drm_dep_queue_wakeup() - dma_fence callback to wake the run-job worker
+ * @f: the signalled dependency fence
+ * @cb: callback embedded in the dep queue
+ *
+ * Called from dma_fence_signal() when the active dependency fence signals.
+ * Delegates to drm_dep_queue_remove_dependency() to clear @q->dep.fence and
+ * kick the run-job worker.  The fence reference is not dropped here; it is
+ * deferred to the run-job worker via @q->dep.removed_fence.
+ *
+ * Context: Any context.
+ */
+static void drm_dep_queue_wakeup(struct dma_fence *f, struct dma_fence_cb *cb)
+{
+	struct drm_dep_queue *q =
+		container_of(cb, struct drm_dep_queue, dep.cb);
+
+	drm_dep_queue_remove_dependency(q, f);
+}
+
+/**
+ * drm_dep_queue_is_ready() - check whether the queue has a dispatchable job
+ * @q: dep queue
+ *
+ * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
+ * Return: true if SPSC queue non-empty and no dep fence pending,
+ *   false otherwise.
+ */
+static bool drm_dep_queue_is_ready(struct drm_dep_queue *q)
+{
+	lockdep_assert_held(&q->sched.lock);
+
+	if (!spsc_queue_count(&q->job.queue))
+		return false;
+
+	if (READ_ONCE(q->dep.fence))
+		return false;
+
+	/* Paired with smp_store_release in drm_dep_queue_remove_dependency() */
+	dma_fence_put(smp_load_acquire(&q->dep.removed_fence));
+
+	q->dep.removed_fence = NULL;
+
+	return true;
+}
+
+/**
+ * drm_dep_queue_is_killed() - check whether a dep queue has been killed
+ * @q: dep queue to check
+ *
+ * Return: true if %DRM_DEP_QUEUE_FLAGS_KILLED is set on @q, false otherwise.
+ *
+ * Context: Any context.
+ */
+bool drm_dep_queue_is_killed(struct drm_dep_queue *q)
+{
+	return !!(q->sched.flags & DRM_DEP_QUEUE_FLAGS_KILLED);
+}
+EXPORT_SYMBOL(drm_dep_queue_is_killed);
+
+/**
+ * drm_dep_queue_is_initialized() - check whether a dep queue has been initialized
+ * @q: dep queue to check
+ *
+ * A queue is considered initialized once its ops pointer has been set by a
+ * successful call to drm_dep_queue_init().  Drivers that embed a
+ * &drm_dep_queue inside a larger structure may call this before attempting any
+ * other queue operation to confirm that initialization has taken place.
+ * drm_dep_queue_put() must be called if this function returns true to drop the
+ * initialization reference from drm_dep_queue_init().
+ *
+ * Return: true if @q has been initialized, false otherwise.
+ *
+ * Context: Any context.
+ */
+bool drm_dep_queue_is_initialized(struct drm_dep_queue *q)
+{
+	return !!q->ops;
+}
+EXPORT_SYMBOL(drm_dep_queue_is_initialized);
+
+/**
+ * drm_dep_queue_set_stopped() - pre-mark a queue as stopped before first use
+ * @q: dep queue to mark
+ *
+ * Sets %DRM_DEP_QUEUE_FLAGS_STOPPED directly on @q without going through the
+ * normal drm_dep_queue_stop() path.  This is only valid during the driver-side
+ * queue initialisation sequence — i.e. after drm_dep_queue_init() returns but
+ * before the queue is made visible to other threads (e.g. before it is added
+ * to any lookup structures).  Using this after the queue is live is a driver
+ * bug; use drm_dep_queue_stop() instead.
+ *
+ * Context: Process context, queue not yet visible to other threads.
+ */
+void drm_dep_queue_set_stopped(struct drm_dep_queue *q)
+{
+	q->sched.flags |= DRM_DEP_QUEUE_FLAGS_STOPPED;
+}
+EXPORT_SYMBOL(drm_dep_queue_set_stopped);
+
+/**
+ * drm_dep_queue_refcount() - read the current reference count of a queue
+ * @q: dep queue to query
+ *
+ * Returns the instantaneous kref value.  The count may change immediately
+ * after this call; callers must not make safety decisions based solely on
+ * the returned value.  Intended for diagnostic snapshots and debugfs output.
+ *
+ * Context: Any context.
+ * Return: current reference count.
+ */
+unsigned int drm_dep_queue_refcount(const struct drm_dep_queue *q)
+{
+	return kref_read(&q->refcount);
+}
+EXPORT_SYMBOL(drm_dep_queue_refcount);
+
+/**
+ * drm_dep_queue_timeout() - read the per-job TDR timeout for a queue
+ * @q: dep queue to query
+ *
+ * Returns the per-job timeout in jiffies as set at init time.
+ * %MAX_SCHEDULE_TIMEOUT means no timeout is configured.
+ *
+ * Context: Any context.
+ * Return: timeout in jiffies.
+ */
+long drm_dep_queue_timeout(const struct drm_dep_queue *q)
+{
+	return q->job.timeout;
+}
+EXPORT_SYMBOL(drm_dep_queue_timeout);
+
+/**
+ * drm_dep_queue_is_job_put_irq_safe() - test whether job-put from IRQ is allowed
+ * @q: dep queue
+ *
+ * Context: Any context.
+ * Return: true if %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE is set,
+ *   false otherwise.
+ */
+static bool drm_dep_queue_is_job_put_irq_safe(const struct drm_dep_queue *q)
+{
+	return !!(q->sched.flags & DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE);
+}
+
+/**
+ * drm_dep_queue_job_dependency() - get next unresolved dep fence
+ * @q: dep queue
+ * @job: job whose dependencies to advance
+ *
+ * Returns NULL immediately if the queue has been killed via
+ * drm_dep_queue_kill(), bypassing all dependency waits so that jobs
+ * drain through run_job as quickly as possible.
+ *
+ * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
+ * Return: next unresolved &dma_fence with a new reference, or NULL
+ *   when all dependencies have been consumed (or the queue is killed).
+ */
+static struct dma_fence *
+drm_dep_queue_job_dependency(struct drm_dep_queue *q,
+			     struct drm_dep_job *job)
+{
+	struct dma_fence *f;
+
+	lockdep_assert_held(&q->sched.lock);
+
+	if (drm_dep_queue_is_killed(q))
+		return NULL;
+
+	f = xa_load(&job->dependencies, job->last_dependency);
+	if (f) {
+		job->last_dependency++;
+		if (WARN_ON(DRM_DEP_JOB_FENCE_PREALLOC == f))
+			return dma_fence_get_stub();
+		return dma_fence_get(f);
+	}
+
+	return NULL;
+}
+
+/**
+ * drm_dep_queue_add_dep_cb() - install wakeup callback on dep fence
+ * @q: dep queue
+ * @job: job whose dependency fence is stored in @q->dep.fence
+ *
+ * Installs a wakeup callback on @q->dep.fence. Returns true if the
+ * callback was installed (the queue must wait), false if the fence is
+ * already signalled or is a self-fence from the same queue context.
+ *
+ * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
+ * Return: true if callback installed, false if fence already done.
+ */
+static bool drm_dep_queue_add_dep_cb(struct drm_dep_queue *q,
+				     struct drm_dep_job *job)
+{
+	struct dma_fence *fence = q->dep.fence;
+
+	lockdep_assert_held(&q->sched.lock);
+
+	if (WARN_ON(fence->context == q->fence.context)) {
+		dma_fence_put(q->dep.fence);
+		q->dep.fence = NULL;
+		return false;
+	}
+
+	if (!dma_fence_add_callback(q->dep.fence, &q->dep.cb,
+				    drm_dep_queue_wakeup))
+		return true;
+
+	dma_fence_put(q->dep.fence);
+	q->dep.fence = NULL;
+
+	return false;
+}
+
+/**
+ * drm_dep_queue_pop_job() - pop a dispatchable job from the SPSC queue
+ * @q: dep queue
+ *
+ * Peeks at the head of the SPSC queue and drains all resolved
+ * dependencies. If a dependency is still pending, installs a wakeup
+ * callback and returns NULL. On success pops the job and returns it.
+ *
+ * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
+ * Return: next dispatchable job, or NULL if a dep is still pending.
+ */
+static struct drm_dep_job *drm_dep_queue_pop_job(struct drm_dep_queue *q)
+{
+	struct spsc_node *node;
+	struct drm_dep_job *job;
+
+	lockdep_assert_held(&q->sched.lock);
+
+	node = spsc_queue_peek(&q->job.queue);
+	if (!node)
+		return NULL;
+
+	job = container_of(node, struct drm_dep_job, queue_node);
+
+	while ((q->dep.fence = drm_dep_queue_job_dependency(q, job))) {
+		if (drm_dep_queue_add_dep_cb(q, job))
+			return NULL;
+	}
+
+	spsc_queue_pop(&q->job.queue);
+
+	return job;
+}
+
+/*
+ * drm_dep_queue_get_unless_zero() - try to acquire a queue reference
+ *
+ * Workers use this instead of drm_dep_queue_get() to guard against the zombie
+ * state: the queue's refcount has already reached zero (async teardown is in
+ * flight) but a work item was queued before free_work had a chance to cancel
+ * it.  If kref_get_unless_zero() fails the caller must bail immediately.
+ *
+ * Context: Any context.
+ * Returns true if the reference was acquired, false if the queue is zombie.
+ */
+bool drm_dep_queue_get_unless_zero(struct drm_dep_queue *q)
+{
+	return kref_get_unless_zero(&q->refcount);
+}
+EXPORT_SYMBOL(drm_dep_queue_get_unless_zero);
+
+/**
+ * drm_dep_queue_run_job_work() - run-job worker
+ * @work: work item embedded in the dep queue
+ *
+ * Acquires @q->sched.lock, checks stopped state, queue readiness and
+ * available credits, pops the next job via drm_dep_queue_pop_job(),
+ * dispatches it via drm_dep_queue_run_job(), then re-kicks itself.
+ *
+ * Uses drm_dep_queue_get_unless_zero() at entry and bails immediately if the
+ * queue is in zombie state (refcount already zero, async teardown in flight).
+ *
+ * Context: Process context (workqueue). DMA fence signaling path.
+ */
+static void drm_dep_queue_run_job_work(struct work_struct *work)
+{
+	struct drm_dep_queue *q =
+		container_of(work, struct drm_dep_queue, sched.run_job);
+	struct spsc_node *node;
+	struct drm_dep_job *job;
+	bool cookie = dma_fence_begin_signalling();
+
+	/* Bail if queue is zombie (refcount already zero, teardown in flight). */
+	if (!drm_dep_queue_get_unless_zero(q)) {
+		dma_fence_end_signalling(cookie);
+		return;
+	}
+
+	mutex_lock(&q->sched.lock);
+
+	if (drm_dep_queue_is_stopped(q))
+		goto put_queue;
+
+	if (!drm_dep_queue_is_ready(q))
+		goto put_queue;
+
+	/* Peek to check credits before committing to pop and dep resolution */
+	node = spsc_queue_peek(&q->job.queue);
+	if (!node)
+		goto put_queue;
+
+	job = container_of(node, struct drm_dep_job, queue_node);
+	if (!drm_dep_queue_has_credits(q, job))
+		goto put_queue;
+
+	job = drm_dep_queue_pop_job(q);
+	if (!job)
+		goto put_queue;
+
+	drm_dep_queue_run_job(q, job);
+	drm_dep_queue_run_job_queue(q);
+
+put_queue:
+	mutex_unlock(&q->sched.lock);
+	drm_dep_queue_put(q);
+	dma_fence_end_signalling(cookie);
+}
+
+/*
+ * drm_dep_queue_remove_job() - unlink a job from the pending list and reset TDR
+ * @q:   dep queue owning @job
+ * @job: job to remove
+ *
+ * Splices @job out of @q->job.pending, cancels any pending TDR delayed work,
+ * and arms the timeout for the new list head (if any).
+ *
+ * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
+ */
+static void drm_dep_queue_remove_job(struct drm_dep_queue *q,
+				     struct drm_dep_job *job)
+{
+	lockdep_assert_held(&q->job.lock);
+
+	list_del_init(&job->pending_link);
+	cancel_delayed_work(&q->sched.tdr);
+	drm_queue_start_timeout(q);
+}
+
+/**
+ * drm_dep_queue_get_finished_job() - dequeue a finished job
+ * @q: dep queue
+ *
+ * Under @q->job.lock checks the head of the pending list for a
+ * finished dep fence. If found, removes the job from the list,
+ * cancels the TDR, and re-arms it for the new head.
+ *
+ * Context: Process context (workqueue). DMA fence signaling path.
+ * Return: the finished &drm_dep_job, or NULL if none is ready.
+ */
+static struct drm_dep_job *
+drm_dep_queue_get_finished_job(struct drm_dep_queue *q)
+{
+	struct drm_dep_job *job;
+
+	guard(spinlock_irq)(&q->job.lock);
+
+	job = list_first_entry_or_null(&q->job.pending, struct drm_dep_job,
+				       pending_link);
+	if (job && drm_dep_fence_is_finished(job->dfence))
+		drm_dep_queue_remove_job(q, job);
+	else
+		job = NULL;
+
+	return job;
+}
+
+/**
+ * drm_dep_queue_put_job_work() - put-job worker
+ * @work: work item embedded in the dep queue
+ *
+ * Drains all finished jobs by calling drm_dep_job_put() in a loop,
+ * then kicks the run-job worker.
+ *
+ * Uses drm_dep_queue_get_unless_zero() at entry and bails immediately if the
+ * queue is in zombie state (refcount already zero, async teardown in flight).
+ *
+ * Wraps execution in dma_fence_begin_signalling() / dma_fence_end_signalling()
+ * because workqueue is shared with other items in the fence signaling path.
+ *
+ * Context: Process context (workqueue). DMA fence signaling path.
+ */
+static void drm_dep_queue_put_job_work(struct work_struct *work)
+{
+	struct drm_dep_queue *q =
+		container_of(work, struct drm_dep_queue, sched.put_job);
+	struct drm_dep_job *job;
+	bool cookie = dma_fence_begin_signalling();
+
+	/* Bail if queue is zombie (refcount already zero, teardown in flight). */
+	if (!drm_dep_queue_get_unless_zero(q)) {
+		dma_fence_end_signalling(cookie);
+		return;
+	}
+
+	while ((job = drm_dep_queue_get_finished_job(q)))
+		drm_dep_job_put(job);
+
+	drm_dep_queue_run_job_queue(q);
+
+	drm_dep_queue_put(q);
+	dma_fence_end_signalling(cookie);
+}
+
+/**
+ * drm_dep_queue_tdr_work() - TDR worker
+ * @work: work item embedded in the delayed TDR work
+ *
+ * Removes the head job from the pending list under @q->job.lock,
+ * asserts @q->ops->timedout_job is non-NULL, calls it outside the lock,
+ * requeues the job if %DRM_DEP_TIMEDOUT_STAT_REQUEUE_JOB, drops the
+ * queue's job reference on %DRM_DEP_TIMEDOUT_STAT_JOB_SIGNALED, and always
+ * restarts the TDR timer after handling the job (unless @q is stopping).
+ * Any other return value triggers a WARN.
+ *
+ * The TDR is never armed when @q->ops->timedout_job is NULL, so firing
+ * this worker without a timedout_job callback is a driver bug.
+ *
+ * Uses drm_dep_queue_get_unless_zero() at entry and bails immediately if the
+ * queue is in zombie state (refcount already zero, async teardown in flight).
+ *
+ * Wraps execution in dma_fence_begin_signalling() / dma_fence_end_signalling()
+ * because timedout_job() is expected to signal the guilty job's fence as part
+ * of reset.
+ *
+ * Context: Process context (workqueue). DMA fence signaling path.
+ */
+static void drm_dep_queue_tdr_work(struct work_struct *work)
+{
+	struct drm_dep_queue *q =
+		container_of(work, struct drm_dep_queue, sched.tdr.work);
+	struct drm_dep_job *job;
+	bool cookie = dma_fence_begin_signalling();
+
+	/* Bail if queue is zombie (refcount already zero, teardown in flight). */
+	if (!drm_dep_queue_get_unless_zero(q)) {
+		dma_fence_end_signalling(cookie);
+		return;
+	}
+
+	scoped_guard(spinlock_irq, &q->job.lock) {
+		job = list_first_entry_or_null(&q->job.pending,
+					       struct drm_dep_job,
+					       pending_link);
+		if (job)
+			/*
+			 * Remove from pending so it cannot be freed
+			 * concurrently by drm_dep_queue_get_finished_job() or
+			 * .drm_dep_job_done().
+			 */
+			list_del_init(&job->pending_link);
+	}
+
+	if (job) {
+		enum drm_dep_timedout_stat status;
+
+		if (WARN_ON(!q->ops->timedout_job)) {
+			drm_dep_job_put(job);
+			goto out;
+		}
+
+		status = q->ops->timedout_job(job);
+
+		switch (status) {
+		case DRM_DEP_TIMEDOUT_STAT_REQUEUE_JOB:
+			scoped_guard(spinlock_irq, &q->job.lock)
+				list_add(&job->pending_link, &q->job.pending);
+			drm_dep_queue_put_job_queue(q);
+			break;
+		case DRM_DEP_TIMEDOUT_STAT_JOB_SIGNALED:
+			drm_dep_job_put(job);
+			break;
+		default:
+			WARN_ON("invalid drm_dep_timedout_stat");
+			break;
+		}
+	}
+
+out:
+	drm_queue_start_timeout_unlocked(q);
+	drm_dep_queue_put(q);
+	dma_fence_end_signalling(cookie);
+}
+
+/**
+ * drm_dep_alloc_submit_wq() - allocate an ordered submit workqueue
+ * @name: name for the workqueue
+ * @flags: DRM_DEP_QUEUE_FLAGS_* flags
+ *
+ * Allocates an ordered workqueue for job submission with %WQ_MEM_RECLAIM and
+ * %WQ_MEM_WARN_ON_RECLAIM set, ensuring the workqueue is safe to use from
+ * memory reclaim context and properly annotated for lockdep taint tracking.
+ * Adds %WQ_HIGHPRI if %DRM_DEP_QUEUE_FLAGS_HIGHPRI is set. When
+ * CONFIG_LOCKDEP is enabled, uses a dedicated lockdep map for annotation.
+ *
+ * Context: Process context.
+ * Return: the new &workqueue_struct, or NULL on failure.
+ */
+static struct workqueue_struct *
+drm_dep_alloc_submit_wq(const char *name, enum drm_dep_queue_flags flags)
+{
+	unsigned int wq_flags = WQ_MEM_RECLAIM | WQ_MEM_WARN_ON_RECLAIM;
+
+	if (flags & DRM_DEP_QUEUE_FLAGS_HIGHPRI)
+		wq_flags |= WQ_HIGHPRI;
+
+#if IS_ENABLED(CONFIG_LOCKDEP)
+	static struct lockdep_map map = {
+		.name = "drm_dep_submit_lockdep_map"
+	};
+	return alloc_ordered_workqueue_lockdep_map(name, wq_flags, &map);
+#else
+	return alloc_ordered_workqueue(name, wq_flags);
+#endif
+}
+
+/**
+ * drm_dep_alloc_timeout_wq() - allocate an ordered TDR workqueue
+ * @name: name for the workqueue
+ *
+ * Allocates an ordered workqueue for timeout detection and recovery with
+ * %WQ_MEM_RECLAIM and %WQ_MEM_WARN_ON_RECLAIM set, ensuring consistent taint
+ * annotation with the submit workqueue. When CONFIG_LOCKDEP is enabled, uses
+ * a dedicated lockdep map for annotation.
+ *
+ * Context: Process context.
+ * Return: the new &workqueue_struct, or NULL on failure.
+ */
+static struct workqueue_struct *drm_dep_alloc_timeout_wq(const char *name)
+{
+	unsigned int wq_flags = WQ_MEM_RECLAIM | WQ_MEM_WARN_ON_RECLAIM;
+
+#if IS_ENABLED(CONFIG_LOCKDEP)
+	static struct lockdep_map map = {
+		.name = "drm_dep_timeout_lockdep_map"
+	};
+	return alloc_ordered_workqueue_lockdep_map(name, wq_flags, &map);
+#else
+	return alloc_ordered_workqueue(name, wq_flags);
+#endif
+}
+
+/**
+ * drm_dep_queue_init() - initialize a dep queue
+ * @q: dep queue to initialize
+ * @args: initialization arguments
+ *
+ * Initializes all fields of @q from @args. If @args->submit_wq is NULL an
+ * ordered workqueue is allocated and owned by the queue
+ * (%DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ). If @args->timeout_wq is NULL an
+ * ordered workqueue is allocated and owned by the queue
+ * (%DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ). On success the queue holds one kref
+ * reference and drm_dep_queue_put() must be called to drop this reference
+ * (i.e., drivers cannot directly free the queue).
+ *
+ * When CONFIG_LOCKDEP is enabled, @q->sched.lock is primed against the
+ * fs_reclaim pseudo-lock so that lockdep can detect any lock ordering
+ * inversion between @sched.lock and memory reclaim.
+ *
+ * Return: 0 on success, %-EINVAL when @args->credit_limit is zero, @args->ops
+ * is NULL, @args->drm is NULL, @args->ops->run_job is NULL, or when
+ * @args->submit_wq or @args->timeout_wq is non-NULL but was not allocated with
+ * %WQ_MEM_WARN_ON_RECLAIM; %-ENOMEM when workqueue allocation fails.
+ *
+ * Context: Process context. May allocate memory and create workqueues.
+ */
+int drm_dep_queue_init(struct drm_dep_queue *q,
+		       const struct drm_dep_queue_init_args *args)
+{
+	if (!args->credit_limit || !args->drm || !args->ops ||
+	    !args->ops->run_job)
+		return -EINVAL;
+
+	if (args->submit_wq && !workqueue_is_reclaim_annotated(args->submit_wq))
+		return -EINVAL;
+
+	if (args->timeout_wq &&
+	    !workqueue_is_reclaim_annotated(args->timeout_wq))
+		return -EINVAL;
+
+	memset(q, 0, sizeof(*q));
+
+	q->name = args->name;
+	q->drm = args->drm;
+	q->credit.limit = args->credit_limit;
+	q->job.timeout = args->timeout ? args->timeout : MAX_SCHEDULE_TIMEOUT;
+
+	init_rcu_head(&q->rcu);
+	INIT_LIST_HEAD(&q->job.pending);
+	spin_lock_init(&q->job.lock);
+	spsc_queue_init(&q->job.queue);
+
+	mutex_init(&q->sched.lock);
+	if (IS_ENABLED(CONFIG_LOCKDEP)) {
+		fs_reclaim_acquire(GFP_KERNEL);
+		might_lock(&q->sched.lock);
+		fs_reclaim_release(GFP_KERNEL);
+	}
+
+	if (args->submit_wq) {
+		q->sched.submit_wq = args->submit_wq;
+	} else {
+		q->sched.submit_wq = drm_dep_alloc_submit_wq(args->name ?: "drm_dep",
+							     args->flags);
+		if (!q->sched.submit_wq)
+			return -ENOMEM;
+
+		q->sched.flags |= DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ;
+	}
+
+	if (args->timeout_wq) {
+		q->sched.timeout_wq = args->timeout_wq;
+	} else {
+		q->sched.timeout_wq = drm_dep_alloc_timeout_wq(args->name ?: "drm_dep");
+		if (!q->sched.timeout_wq)
+			goto err_submit_wq;
+
+		q->sched.flags |= DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ;
+	}
+
+	q->sched.flags |= args->flags &
+		~(DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ |
+		  DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ);
+
+	INIT_DELAYED_WORK(&q->sched.tdr, drm_dep_queue_tdr_work);
+	INIT_WORK(&q->sched.run_job, drm_dep_queue_run_job_work);
+	INIT_WORK(&q->sched.put_job, drm_dep_queue_put_job_work);
+
+	q->fence.context = dma_fence_context_alloc(1);
+
+	kref_init(&q->refcount);
+	q->ops = args->ops;
+	drm_dev_get(q->drm);
+
+	return 0;
+
+err_submit_wq:
+	if (q->sched.flags & DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ)
+		destroy_workqueue(q->sched.submit_wq);
+	mutex_destroy(&q->sched.lock);
+
+	return -ENOMEM;
+}
+EXPORT_SYMBOL(drm_dep_queue_init);
+
+#if IS_ENABLED(CONFIG_PROVE_LOCKING)
+/**
+ * drm_dep_queue_push_job_begin() - mark the start of an arm/push critical section
+ * @q: dep queue the job belongs to
+ *
+ * Called at the start of drm_dep_job_arm() and warns if the push context is
+ * already owned by another task, which would indicate concurrent arm/push on
+ * the same queue.
+ *
+ * No-op when CONFIG_PROVE_LOCKING is disabled.
+ *
+ * Context: Process context. DMA fence signaling path.
+ */
+void drm_dep_queue_push_job_begin(struct drm_dep_queue *q)
+{
+	WARN_ON(q->job.push.owner);
+	q->job.push.owner = current;
+}
+
+/**
+ * drm_dep_queue_push_job_end() - mark the end of an arm/push critical section
+ * @q: dep queue the job belongs to
+ *
+ * Called at the end of drm_dep_job_push() and warns if the push context is not
+ * owned by the current task, which would indicate a mismatched begin/end pair
+ * or a push from the wrong thread.
+ *
+ * No-op when CONFIG_PROVE_LOCKING is disabled.
+ *
+ * Context: Process context. DMA fence signaling path.
+ */
+void drm_dep_queue_push_job_end(struct drm_dep_queue *q)
+{
+	WARN_ON(q->job.push.owner != current);
+	q->job.push.owner = NULL;
+}
+#endif
+
+/**
+ * drm_dep_queue_assert_teardown_invariants() - assert teardown invariants
+ * @q: dep queue being torn down
+ *
+ * Warns if the pending-job list, the SPSC submission queue, or the credit
+ * counter is non-zero when called, or if the queue still has a non-zero
+ * reference count.
+ *
+ * Context: Any context.
+ */
+static void drm_dep_queue_assert_teardown_invariants(struct drm_dep_queue *q)
+{
+	WARN_ON(!list_empty(&q->job.pending));
+	WARN_ON(spsc_queue_count(&q->job.queue));
+	WARN_ON(atomic_read(&q->credit.count));
+	WARN_ON(drm_dep_queue_refcount(q));
+}
+
+/**
+ * drm_dep_queue_release() - final internal cleanup of a dep queue
+ * @q: dep queue to clean up
+ *
+ * Asserts teardown invariants and destroys internal resources allocated by
+ * drm_dep_queue_init() that cannot be torn down earlier in the teardown
+ * sequence.  Currently this destroys @q->sched.lock.
+ *
+ * Drivers that implement &drm_dep_queue_ops.release **must** call this
+ * function after removing @q from any internal bookkeeping (e.g. lookup
+ * tables or lists) but before freeing the memory that contains @q.  When
+ * &drm_dep_queue_ops.release is NULL, drm_dep follows the default teardown
+ * path and calls this function automatically.
+ *
+ * Context: Any context.
+ */
+void drm_dep_queue_release(struct drm_dep_queue *q)
+{
+	drm_dep_queue_assert_teardown_invariants(q);
+	mutex_destroy(&q->sched.lock);
+}
+EXPORT_SYMBOL(drm_dep_queue_release);
+
+/**
+ * drm_dep_queue_free() - final cleanup of a dep queue
+ * @q: dep queue to free
+ *
+ * Invokes &drm_dep_queue_ops.release if set, in which case the driver is
+ * responsible for calling drm_dep_queue_release() and freeing @q itself.
+ * If &drm_dep_queue_ops.release is NULL, calls drm_dep_queue_release()
+ * and then frees @q with kfree_rcu().
+ *
+ * In either case, releases the drm_dev_get() reference taken at init time
+ * via drm_dev_put(), allowing the owning &drm_device to be unloaded once
+ * all queues have been freed.
+ *
+ * Context: Process context (workqueue), reclaim safe.
+ */
+static void drm_dep_queue_free(struct drm_dep_queue *q)
+{
+	struct drm_device *drm = q->drm;
+
+	if (q->ops->release) {
+		q->ops->release(q);
+	} else {
+		drm_dep_queue_release(q);
+		kfree_rcu(q, rcu);
+	}
+	drm_dev_put(drm);
+}
+
+/**
+ * drm_dep_queue_free_work() - deferred queue teardown worker
+ * @work: free_work item embedded in the dep queue
+ *
+ * Runs on dep_free_wq. Disables all work items synchronously
+ * (preventing re-queue and waiting for in-flight instances),
+ * destroys any owned workqueues, then calls drm_dep_queue_free().
+ * Running on dep_free_wq ensures destroy_workqueue() is never
+ * called from within one of the queue's own workers (deadlock)
+ * and disable_*_sync() cannot deadlock either.
+ *
+ * Context: Process context (workqueue), reclaim safe.
+ */
+static void drm_dep_queue_free_work(struct work_struct *work)
+{
+	struct drm_dep_queue *q =
+		container_of(work, struct drm_dep_queue, free_work);
+
+	drm_dep_queue_assert_teardown_invariants(q);
+
+	disable_delayed_work_sync(&q->sched.tdr);
+	disable_work_sync(&q->sched.run_job);
+	disable_work_sync(&q->sched.put_job);
+
+	if (q->sched.flags & DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ)
+		destroy_workqueue(q->sched.timeout_wq);
+
+	if (q->sched.flags & DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ)
+		destroy_workqueue(q->sched.submit_wq);
+
+	drm_dep_queue_free(q);
+}
+
+/**
+ * drm_dep_queue_fini() - tear down a dep queue
+ * @q: dep queue to tear down
+ *
+ * Asserts teardown invariants  and nitiates teardown of @q by queuing the
+ * deferred free work onto tht module-private dep_free_wq workqueue.  The work
+ * item disables any pending TDR and run/put-job work synchronously, destroys
+ * any workqueues that were allocated by drm_dep_queue_init(), and then releases
+ * the queue memory.
+ *
+ * Running teardown from dep_free_wq ensures that destroy_workqueue() is never
+ * called from within one of the queue's own workers (e.g. via
+ * drm_dep_queue_put()), which would deadlock.
+ *
+ * Drivers can wait for all outstanding deferred work to complete by waiting
+ * for the last drm_dev_put() reference on their &drm_device, which is
+ * released as the final step of each queue's teardown.
+ *
+ * Drivers that implement &drm_dep_queue_ops.fini **must** call this
+ * function after removing @q from any device bookkeeping but before freeing the
+ * memory that contains @q.  When &drm_dep_queue_ops.fini is NULL, drm_dep
+ * follows the default teardown path and calls this function automatically.
+ *
+ * Context: Any context.
+ */
+void drm_dep_queue_fini(struct drm_dep_queue *q)
+{
+	drm_dep_queue_assert_teardown_invariants(q);
+
+	INIT_WORK(&q->free_work, drm_dep_queue_free_work);
+	queue_work(dep_free_wq, &q->free_work);
+}
+EXPORT_SYMBOL(drm_dep_queue_fini);
+
+/**
+ * drm_dep_queue_get() - acquire a reference to a dep queue
+ * @q: dep queue to acquire a reference on, or NULL
+ *
+ * Return: @q with an additional reference held, or NULL if @q is NULL.
+ *
+ * Context: Any context.
+ */
+struct drm_dep_queue *drm_dep_queue_get(struct drm_dep_queue *q)
+{
+	if (q)
+		kref_get(&q->refcount);
+	return q;
+}
+EXPORT_SYMBOL(drm_dep_queue_get);
+
+/**
+ * __drm_dep_queue_release() - kref release callback for a dep queue
+ * @kref: kref embedded in the dep queue
+ *
+ * Calls &drm_dep_queue_ops.fini if set, otherwise calls
+ * drm_dep_queue_fini() to initiate deferred teardown.
+ *
+ * Context: Any context.
+ */
+static void __drm_dep_queue_release(struct kref *kref)
+{
+	struct drm_dep_queue *q =
+		container_of(kref, struct drm_dep_queue, refcount);
+
+	if (q->ops->fini)
+		q->ops->fini(q);
+	else
+		drm_dep_queue_fini(q);
+}
+
+/**
+ * drm_dep_queue_put() - release a reference to a dep queue
+ * @q: dep queue to release a reference on, or NULL
+ *
+ * When the last reference is dropped, calls &drm_dep_queue_ops.fini if set,
+ * otherwise calls drm_dep_queue_fini(). Final memory release is handled by
+ * &drm_dep_queue_ops.release (which must call drm_dep_queue_release()) if set,
+ * or drm_dep_queue_release() followed by kfree_rcu() otherwise.
+ * Does nothing if @q is NULL.
+ *
+ * Context: Any context.
+ */
+void drm_dep_queue_put(struct drm_dep_queue *q)
+{
+	if (q)
+		kref_put(&q->refcount, __drm_dep_queue_release);
+}
+EXPORT_SYMBOL(drm_dep_queue_put);
+
+/**
+ * drm_dep_queue_stop() - stop a dep queue from processing new jobs
+ * @q: dep queue to stop
+ *
+ * Sets %DRM_DEP_QUEUE_FLAGS_STOPPED on @q under both @q->sched.lock (mutex)
+ * and @q->job.lock (spinlock_irq), making the flag safe to test from finished
+ * fenced signaling context. Then cancels any in-flight run_job and put_job work
+ * items. Once stopped, the bypass path and the submit workqueue will not
+ * dispatch further jobs nor will any jobs be removed from the pending list.
+ * Call drm_dep_queue_start() to resume processing.
+ *
+ * Context: Process context. Waits for in-flight workers to complete.
+ */
+void drm_dep_queue_stop(struct drm_dep_queue *q)
+{
+	scoped_guard(mutex, &q->sched.lock) {
+		scoped_guard(spinlock_irq, &q->job.lock)
+			drm_dep_queue_flags_set(q, DRM_DEP_QUEUE_FLAGS_STOPPED);
+	}
+	cancel_work_sync(&q->sched.run_job);
+	cancel_work_sync(&q->sched.put_job);
+}
+EXPORT_SYMBOL(drm_dep_queue_stop);
+
+/**
+ * drm_dep_queue_start() - resume a stopped dep queue
+ * @q: dep queue to start
+ *
+ * Clears %DRM_DEP_QUEUE_FLAGS_STOPPED on @q under both @q->sched.lock (mutex)
+ * and @q->job.lock (spinlock_irq), making the flag safe to test from IRQ
+ * context. Then re-queues the run_job and put_job work items so that any jobs
+ * pending since the queue was stopped are processed. Must only be called after
+ * drm_dep_queue_stop().
+ *
+ * Context: Process context.
+ */
+void drm_dep_queue_start(struct drm_dep_queue *q)
+{
+	scoped_guard(mutex, &q->sched.lock) {
+		scoped_guard(spinlock_irq, &q->job.lock)
+			drm_dep_queue_flags_clear(q, DRM_DEP_QUEUE_FLAGS_STOPPED);
+	}
+	drm_dep_queue_run_job_queue(q);
+	drm_dep_queue_put_job_queue(q);
+}
+EXPORT_SYMBOL(drm_dep_queue_start);
+
+/**
+ * drm_dep_queue_trigger_timeout() - trigger the TDR immediately for
+ *   all pending jobs
+ * @q: dep queue to trigger timeout on
+ *
+ * Sets @q->job.timeout to 1 and arms the TDR delayed work with a one-jiffy
+ * delay, causing it to fire almost immediately without hot-spinning at zero
+ * delay. This is used to force-expire any pendind jobs on the queue, for
+ * example when the device is being torn down or has encountered an
+ * unrecoverable error.
+ *
+ * It is suggested that when this function is used, the first timedout_job call
+ * causes the driver to kick the queue off the hardware and signal all pending
+ * job fences. Subsequent calls continue to signal all pending job fences.
+ *
+ * Has no effect if the pending list is empty.
+ *
+ * Context: Any context.
+ */
+void drm_dep_queue_trigger_timeout(struct drm_dep_queue *q)
+{
+	guard(spinlock_irqsave)(&q->job.lock);
+	q->job.timeout = 1;
+	drm_queue_start_timeout(q);
+}
+EXPORT_SYMBOL(drm_dep_queue_trigger_timeout);
+
+/**
+ * drm_dep_queue_cancel_tdr_sync() - cancel any pending TDR and wait
+ *   for it to finish
+ * @q: dep queue whose TDR to cancel
+ *
+ * Cancels the TDR delayed work item if it has not yet started, and waits for
+ * it to complete if it is already running.  After this call returns, the TDR
+ * worker is guaranteed not to be executing and will not fire again until
+ * explicitly rearmed (e.g. via drm_dep_queue_resume_timeout() or by a new
+ * job being submitted).
+ *
+ * Useful during error recovery or queue teardown when the caller needs to
+ * know that no timeout handling races with its own reset logic.
+ *
+ * Context: Process context. May sleep waiting for the TDR worker to finish.
+ */
+void drm_dep_queue_cancel_tdr_sync(struct drm_dep_queue *q)
+{
+	cancel_delayed_work_sync(&q->sched.tdr);
+}
+EXPORT_SYMBOL(drm_dep_queue_cancel_tdr_sync);
+
+/**
+ * drm_dep_queue_resume_timeout() - restart the TDR timer with the
+ *   configured timeout
+ * @q: dep queue to resume the timeout for
+ *
+ * Restarts the TDR delayed work using @q->job.timeout. Called after device
+ * recovery to give pending jobs a fresh full timeout window. Has no effect
+ * if the pending list is empty.
+ *
+ * Context: Any context.
+ */
+void drm_dep_queue_resume_timeout(struct drm_dep_queue *q)
+{
+	drm_queue_start_timeout_unlocked(q);
+}
+EXPORT_SYMBOL(drm_dep_queue_resume_timeout);
+
+/**
+ * drm_dep_queue_is_stopped() - check whether a dep queue is stopped
+ * @q: dep queue to check
+ *
+ * Return: true if %DRM_DEP_QUEUE_FLAGS_STOPPED is set on @q, false otherwise.
+ *
+ * Context: Any context.
+ */
+bool drm_dep_queue_is_stopped(struct drm_dep_queue *q)
+{
+	return !!(q->sched.flags & DRM_DEP_QUEUE_FLAGS_STOPPED);
+}
+EXPORT_SYMBOL(drm_dep_queue_is_stopped);
+
+/**
+ * drm_dep_queue_kill() - kill a dep queue and flush all pending jobs
+ * @q: dep queue to kill
+ *
+ * Sets %DRM_DEP_QUEUE_FLAGS_KILLED on @q under @q->sched.lock.  If a
+ * dependency fence is currently being waited on, its callback is removed and
+ * the run-job worker is kicked immediately so that the blocked job drains
+ * without waiting.
+ *
+ * Once killed, drm_dep_queue_job_dependency() returns NULL for all jobs,
+ * bypassing dependency waits so that every queued job drains through
+ * &drm_dep_queue_ops.run_job without blocking.
+ *
+ * The &drm_dep_queue_ops.run_job callback is guaranteed to be called for every
+ * job that was pushed before or after drm_dep_queue_kill(), even during queue
+ * teardown.  Drivers should use this guarantee to perform any necessary
+ * bookkeeping cleanup without executing the actual backend operation when the
+ * queue is killed.
+ *
+ * Unlike drm_dep_queue_stop(), killing is one-way: there is no corresponding
+ * start function.
+ *
+ * **Driver safety requirement**
+ *
+ * drm_dep_queue_kill() must only be called once the driver can guarantee that
+ * no job in the queue will touch memory associated with any of its fences
+ * (i.e., the queue has been removed from the device and will never be put back
+ * on).
+ *
+ * Context: Process context.
+ */
+void drm_dep_queue_kill(struct drm_dep_queue *q)
+{
+	scoped_guard(mutex, &q->sched.lock) {
+		struct dma_fence *fence;
+
+		drm_dep_queue_flags_set(q, DRM_DEP_QUEUE_FLAGS_KILLED);
+
+		/*
+		 * Holding &q->sched.lock guarantees that the run-job work item
+		 * cannot drop its reference to q->dep.fence concurrently, so
+		 * reading q->dep.fence here is safe.
+		 */
+		fence = READ_ONCE(q->dep.fence);
+		if (fence && dma_fence_remove_callback(fence, &q->dep.cb))
+			drm_dep_queue_remove_dependency(q, fence);
+	}
+}
+EXPORT_SYMBOL(drm_dep_queue_kill);
+
+/**
+ * drm_dep_queue_submit_wq() - retrieve the submit workqueue of a dep queue
+ * @q: dep queue whose workqueue to retrieve
+ *
+ * Drivers may use this to queue their own work items alongside the queue's
+ * internal run-job and put-job workers — for example to process incoming
+ * messages in the same serialisation domain.
+ *
+ * Prefer drm_dep_queue_work_enqueue() when the only need is to enqueue a
+ * work item, as it additionally checks the stopped state.  Use this accessor
+ * when the workqueue itself is required (e.g. for alloc_ordered_workqueue
+ * replacement or drain_workqueue calls).
+ *
+ * Context: Any context.
+ * Return: the &workqueue_struct used by @q for job submission.
+ */
+struct workqueue_struct *drm_dep_queue_submit_wq(struct drm_dep_queue *q)
+{
+	return q->sched.submit_wq;
+}
+EXPORT_SYMBOL(drm_dep_queue_submit_wq);
+
+/**
+ * drm_dep_queue_timeout_wq() - retrieve the timeout workqueue of a dep queue
+ * @q: dep queue whose workqueue to retrieve
+ *
+ * Returns the workqueue used by @q to run TDR (timeout detection and recovery)
+ * work.  Drivers may use this to queue their own timeout-domain work items, or
+ * to call drain_workqueue() when tearing down and needing to ensure all pending
+ * timeout callbacks have completed before proceeding.
+ *
+ * Context: Any context.
+ * Return: the &workqueue_struct used by @q for TDR work.
+ */
+struct workqueue_struct *drm_dep_queue_timeout_wq(struct drm_dep_queue *q)
+{
+	return q->sched.timeout_wq;
+}
+EXPORT_SYMBOL(drm_dep_queue_timeout_wq);
+
+/**
+ * drm_dep_queue_work_enqueue() - queue work on the dep queue's submit workqueue
+ * @q: dep queue to enqueue work on
+ * @work: work item to enqueue
+ *
+ * Queues @work on @q->sched.submit_wq if the queue is not stopped.  This
+ * allows drivers to schedule custom work items that run serialised with the
+ * queue's own run-job and put-job workers.
+ *
+ * Return: true if the work was queued, false if the queue is stopped or the
+ * work item was already pending.
+ *
+ * Context: Any context.
+ */
+bool drm_dep_queue_work_enqueue(struct drm_dep_queue *q,
+				struct work_struct *work)
+{
+	if (drm_dep_queue_is_stopped(q))
+		return false;
+
+	return queue_work(q->sched.submit_wq, work);
+}
+EXPORT_SYMBOL(drm_dep_queue_work_enqueue);
+
+/**
+ * drm_dep_queue_can_job_bypass() - test whether a job can skip the SPSC queue
+ * @q: dep queue
+ * @job: job to test
+ *
+ * A job may bypass the submit workqueue and run inline on the calling thread
+ * if all of the following hold:
+ *
+ *  - %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set on the queue
+ *  - the queue is not stopped
+ *  - the SPSC submission queue is empty (no other jobs waiting)
+ *  - the queue has enough credits for @job
+ *  - @job has no unresolved dependency fences
+ *
+ * Must be called under @q->sched.lock.
+ *
+ * Context: Process context. Must hold @q->sched.lock (a mutex).
+ * Return: true if the job may be run inline, false otherwise.
+ */
+bool drm_dep_queue_can_job_bypass(struct drm_dep_queue *q,
+				  struct drm_dep_job *job)
+{
+	lockdep_assert_held(&q->sched.lock);
+
+	return q->sched.flags & DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED &&
+		!drm_dep_queue_is_stopped(q) &&
+		!spsc_queue_count(&q->job.queue) &&
+		drm_dep_queue_has_credits(q, job) &&
+		xa_empty(&job->dependencies);
+}
+
+/**
+ * drm_dep_job_done() - mark a job as complete
+ * @job: the job that finished
+ * @result: error code to propagate, or 0 for success
+ *
+ * Subtracts @job->credits from the queue credit counter, then signals the
+ * job's dep fence with @result.
+ *
+ * When %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE is set (IRQ-safe path), a
+ * temporary extra reference is taken on @job before signalling the fence.
+ * This prevents a concurrent put-job worker — which may be woken by timeouts or
+ * queue starting — from freeing the job while this function still holds a
+ * pointer to it.  The extra reference is released at the end of the function.
+ *
+ * After signalling, the IRQ-safe path removes the job from the pending list
+ * under @q->job.lock, provided the queue is not stopped.  Removal is skipped
+ * when the queue is stopped so that drm_dep_queue_for_each_pending_job() can
+ * iterate the list without racing with the completion path.  On successful
+ * removal, kicks the run-job worker so the next queued job can be dispatched
+ * immediately, then drops the job reference.  If the job was already removed
+ * by TDR, or removal was skipped because the queue is stopped, kicks the
+ * put-job worker instead to allow the deferred put to complete.
+ *
+ * Context: Any context.
+ */
+static void drm_dep_job_done(struct drm_dep_job *job, int result)
+{
+	struct drm_dep_queue *q = job->q;
+	bool irq_safe = drm_dep_queue_is_job_put_irq_safe(q), removed = false;
+
+	/*
+	 * Local ref to ensure the put worker—which may be woken by external
+	 * forces (TDR, driver-side queue starting)—doesn't free the job behind
+	 * this function's back after drm_dep_fence_done() while it is still on
+	 * the pending list.
+	 */
+	if (irq_safe)
+		drm_dep_job_get(job);
+
+	atomic_sub(job->credits, &q->credit.count);
+	drm_dep_fence_done(job->dfence, result);
+
+	/* Only safe to touch job after fence signal if we have a local ref. */
+
+	if (irq_safe) {
+		scoped_guard(spinlock_irqsave, &q->job.lock) {
+			removed = !list_empty(&job->pending_link) &&
+				!drm_dep_queue_is_stopped(q);
+
+			/* Guard against TDR operating on job */
+			if (removed)
+				drm_dep_queue_remove_job(q, job);
+		}
+	}
+
+	if (removed) {
+		drm_dep_queue_run_job_queue(q);
+		drm_dep_job_put(job);
+	} else {
+		drm_dep_queue_put_job_queue(q);
+	}
+
+	if (irq_safe)
+		drm_dep_job_put(job);
+}
+
+/**
+ * drm_dep_job_done_cb() - dma_fence callback to complete a job
+ * @f: the hardware fence that signalled
+ * @cb: fence callback embedded in the dep job
+ *
+ * Extracts the job from @cb and calls drm_dep_job_done() with
+ * @f->error as the result.
+ *
+ * Context: Any context, but with IRQ disabled. May not sleep.
+ */
+static void drm_dep_job_done_cb(struct dma_fence *f, struct dma_fence_cb *cb)
+{
+	struct drm_dep_job *job = container_of(cb, struct drm_dep_job, cb);
+
+	drm_dep_job_done(job, f->error);
+}
+
+/**
+ * drm_dep_queue_run_job() - submit a job to hardware and set up
+ *   completion tracking
+ * @q: dep queue
+ * @job: job to run
+ *
+ * Accounts @job->credits against the queue, appends the job to the pending
+ * list, then calls @q->ops->run_job(). The TDR timer is started only when
+ * @job is the first entry on the pending list; subsequent jobs added while
+ * a TDR is already in flight do not reset the timer (which would otherwise
+ * extend the deadline for the already-running head job). Stores the returned
+ * hardware fence as the parent of the job's dep fence, then installs
+ * drm_dep_job_done_cb() on it. If the hardware fence is already signalled
+ * (%-ENOENT from dma_fence_add_callback()) or run_job() returns NULL/error,
+ * the job is completed immediately. Must be called under @q->sched.lock.
+ *
+ * Context: Process context. Must hold @q->sched.lock (a mutex). DMA fence
+ * signaling path.
+ */
+void drm_dep_queue_run_job(struct drm_dep_queue *q, struct drm_dep_job *job)
+{
+	struct dma_fence *fence;
+	int r;
+
+	lockdep_assert_held(&q->sched.lock);
+
+	drm_dep_job_get(job);
+	atomic_add(job->credits, &q->credit.count);
+
+	scoped_guard(spinlock_irq, &q->job.lock) {
+		bool first = list_empty(&q->job.pending);
+
+		list_add_tail(&job->pending_link, &q->job.pending);
+		if (first)
+			drm_queue_start_timeout(q);
+	}
+
+	fence = q->ops->run_job(job);
+	drm_dep_fence_set_parent(job->dfence, fence);
+
+	if (!IS_ERR_OR_NULL(fence)) {
+		r = dma_fence_add_callback(fence, &job->cb,
+					   drm_dep_job_done_cb);
+		if (r == -ENOENT)
+			drm_dep_job_done(job, fence->error);
+		else if (r)
+			drm_err(q->drm, "fence add callback failed (%d)\n", r);
+		dma_fence_put(fence);
+	} else {
+		drm_dep_job_done(job, IS_ERR(fence) ? PTR_ERR(fence) : 0);
+	}
+
+	/*
+	 * Drop all input dependency fences now, in process context, before the
+	 * final job put. Once the job is on the pending list its last reference
+	 * may be dropped from a dma_fence callback (IRQ context), where calling
+	 * xa_destroy() would be unsafe.
+	 */
+	drm_dep_job_drop_dependencies(job);
+	drm_dep_job_put(job);
+}
+
+/**
+ * drm_dep_queue_push_job() - enqueue a job on the SPSC submission queue
+ * @q: dep queue
+ * @job: job to push
+ *
+ * Pushes @job onto the SPSC queue. If the queue was previously empty
+ * (i.e. this is the first pending job), kicks the run_job worker so it
+ * processes the job promptly without waiting for the next wakeup.
+ * May be called with or without @q->sched.lock held.
+ *
+ * Context: Any context. DMA fence signaling path.
+ */
+void drm_dep_queue_push_job(struct drm_dep_queue *q, struct drm_dep_job *job)
+{
+	/*
+	 * spsc_queue_push() returns true if the queue was previously empty,
+	 * i.e. this is the first pending job. Kick the run_job worker so it
+	 * picks it up without waiting for the next wakeup.
+	 */
+	if (spsc_queue_push(&q->job.queue, &job->queue_node))
+		drm_dep_queue_run_job_queue(q);
+}
+
+/**
+ * drm_dep_init() - module initialiser
+ *
+ * Allocates the module-private dep_free_wq unbound workqueue used for
+ * deferred queue teardown.
+ *
+ * Return: 0 on success, %-ENOMEM if workqueue allocation fails.
+ */
+static int __init drm_dep_init(void)
+{
+	dep_free_wq = alloc_workqueue("drm_dep_free", WQ_UNBOUND, 0);
+	if (!dep_free_wq)
+		return -ENOMEM;
+
+	return 0;
+}
+
+/**
+ * drm_dep_exit() - module exit
+ *
+ * Destroys the module-private dep_free_wq workqueue.
+ */
+static void __exit drm_dep_exit(void)
+{
+	destroy_workqueue(dep_free_wq);
+	dep_free_wq = NULL;
+}
+
+module_init(drm_dep_init);
+module_exit(drm_dep_exit);
+
+MODULE_DESCRIPTION("DRM dependency queue");
+MODULE_LICENSE("Dual MIT/GPL");
diff --git a/drivers/gpu/drm/dep/drm_dep_queue.h b/drivers/gpu/drm/dep/drm_dep_queue.h
new file mode 100644
index 000000000000..e5c217a3fab5
--- /dev/null
+++ b/drivers/gpu/drm/dep/drm_dep_queue.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2026 Intel Corporation
+ */
+
+#ifndef _DRM_DEP_QUEUE_H_
+#define _DRM_DEP_QUEUE_H_
+
+#include <linux/types.h>
+
+struct drm_dep_job;
+struct drm_dep_queue;
+
+bool drm_dep_queue_can_job_bypass(struct drm_dep_queue *q,
+				  struct drm_dep_job *job);
+void drm_dep_queue_run_job(struct drm_dep_queue *q, struct drm_dep_job *job);
+void drm_dep_queue_push_job(struct drm_dep_queue *q, struct drm_dep_job *job);
+
+#if IS_ENABLED(CONFIG_PROVE_LOCKING)
+void drm_dep_queue_push_job_begin(struct drm_dep_queue *q);
+void drm_dep_queue_push_job_end(struct drm_dep_queue *q);
+#else
+static inline void drm_dep_queue_push_job_begin(struct drm_dep_queue *q)
+{
+}
+static inline void drm_dep_queue_push_job_end(struct drm_dep_queue *q)
+{
+}
+#endif
+
+#endif /* _DRM_DEP_QUEUE_H_ */
diff --git a/include/drm/drm_dep.h b/include/drm/drm_dep.h
new file mode 100644
index 000000000000..615926584506
--- /dev/null
+++ b/include/drm/drm_dep.h
@@ -0,0 +1,597 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright 2015 Advanced Micro Devices, Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
+ * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
+ * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
+ * OTHER DEALINGS IN THE SOFTWARE.
+ *
+ * Copyright © 2026 Intel Corporation
+ */
+
+#ifndef _DRM_DEP_H_
+#define _DRM_DEP_H_
+
+#include <drm/spsc_queue.h>
+#include <linux/dma-fence.h>
+#include <linux/xarray.h>
+#include <linux/workqueue.h>
+
+enum dma_resv_usage;
+struct dma_resv;
+struct drm_dep_fence;
+struct drm_dep_job;
+struct drm_dep_queue;
+struct drm_file;
+struct drm_gem_object;
+
+/**
+ * enum drm_dep_timedout_stat - return value of &drm_dep_queue_ops.timedout_job
+ * @DRM_DEP_TIMEDOUT_STAT_JOB_SIGNALED: driver signaled the job's finished
+ *   fence during reset; drm_dep may safely drop its reference to the job.
+ * @DRM_DEP_TIMEDOUT_STAT_REQUEUE_JOB: timeout was a false alarm; reinsert the
+ *   job at the head of the pending list so it can complete normally.
+ */
+enum drm_dep_timedout_stat {
+	DRM_DEP_TIMEDOUT_STAT_JOB_SIGNALED,
+	DRM_DEP_TIMEDOUT_STAT_REQUEUE_JOB,
+};
+
+/**
+ * struct drm_dep_queue_ops - driver callbacks for a dep queue
+ */
+struct drm_dep_queue_ops {
+	/**
+	 * @run_job: submit the job to hardware. Returns the hardware completion
+	 * fence (with a reference held for the scheduler), or NULL/ERR_PTR on
+	 * synchronous completion or error.
+	 */
+	struct dma_fence *(*run_job)(struct drm_dep_job *job);
+
+	/**
+	 * @timedout_job: called when the TDR fires for the head job. Must stop
+	 * the hardware, then return %DRM_DEP_TIMEDOUT_STAT_JOB_SIGNALED if the
+	 * job's fence was signalled during reset, or
+	 * %DRM_DEP_TIMEDOUT_STAT_REQUEUE_JOB if the timeout was spurious or
+	 * signalling was otherwise delayed, and the job should be re-inserted
+	 * at the head of the pending list. Any other value triggers a WARN.
+	 */
+	enum drm_dep_timedout_stat (*timedout_job)(struct drm_dep_job *job);
+
+	/**
+	 * @release: called when the last kref on the queue is dropped and
+	 * drm_dep_queue_fini() has completed.  The driver is responsible for
+	 * removing @q from any internal bookkeeping, calling
+	 * drm_dep_queue_release(), and then freeing the memory containing @q
+	 * (e.g. via kfree_rcu() using @q->rcu).  If NULL, drm_dep calls
+	 * drm_dep_queue_release() and frees @q automatically via kfree_rcu().
+	 * Use this when the queue is embedded in a larger structure.
+	 */
+	void (*release)(struct drm_dep_queue *q);
+
+	/**
+	 * @fini: if set, called instead of drm_dep_queue_fini() when the last
+	 * kref is dropped. The driver is responsible for calling
+	 * drm_dep_queue_fini() itself after it is done with the queue. Use this
+	 * when additional teardown logic must run before fini (e.g., cleanup
+	 * firmware resources associated with the queue).
+	 */
+	void (*fini)(struct drm_dep_queue *q);
+};
+
+/**
+ * enum drm_dep_queue_flags - flags for &drm_dep_queue and
+ *   &drm_dep_queue_init_args
+ *
+ * Flags are divided into three categories:
+ *
+ * - **Private static**: set internally at init time and never changed.
+ *   Drivers must not read or write these.
+ *   %DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ,
+ *   %DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ.
+ *
+ * - **Public dynamic**: toggled at runtime by drivers via accessors.
+ *   Any modification must be performed under &drm_dep_queue.sched.lock.
+ *   Accessor functions provide unstable reads.
+ *   %DRM_DEP_QUEUE_FLAGS_STOPPED,
+ *   %DRM_DEP_QUEUE_FLAGS_KILLED.
+ *
+ * - **Public static**: supplied by the driver in
+ *   &drm_dep_queue_init_args.flags at queue creation time and not modified
+ *   thereafter.
+ *   %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED,
+ *   %DRM_DEP_QUEUE_FLAGS_HIGHPRI,
+ *   %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE.
+ *
+ * @DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ: (private, static) submit workqueue was
+ *   allocated by drm_dep_queue_init() and will be destroyed by
+ *   drm_dep_queue_fini().
+ * @DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ: (private, static) timeout workqueue
+ *   was allocated by drm_dep_queue_init() and will be destroyed by
+ *   drm_dep_queue_fini().
+ * @DRM_DEP_QUEUE_FLAGS_STOPPED: (public, dynamic) the queue is stopped and
+ *   will not dispatch new jobs or remove jobs from the pending list, dropping
+ *   the drm_dep-owned reference. Set by drm_dep_queue_stop(), cleared by
+ *   drm_dep_queue_start().
+ * @DRM_DEP_QUEUE_FLAGS_KILLED: (public, dynamic) the queue has been killed
+ *   via drm_dep_queue_kill(). Any active dependency wait is cancelled
+ *   immediately.  Jobs continue to flow through run_job for bookkeeping
+ *   cleanup, but dependency waiting is skipped so that queued work drains
+ *   as quickly as possible.
+ * @DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED: (public, static) the queue supports
+ *   the bypass path where eligible jobs skip the SPSC queue and run inline.
+ * @DRM_DEP_QUEUE_FLAGS_HIGHPRI: (public, static) the submit workqueue owned
+ *   by the queue is created with %WQ_HIGHPRI, causing run-job and put-job
+ *   workers to execute at elevated priority. Only privileged clients (e.g.
+ *   drivers managing time-critical or real-time GPU contexts) should request
+ *   this flag; granting it to unprivileged userspace would allow priority
+ *   inversion attacks.
+ *   @drm_dep_queue_init_args.submit_wq is provided.
+ * @DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE: (public, static) when set,
+ *   drm_dep_job_done() may be called from hardirq context (e.g. from a
+ *   hardware-signalled dma_fence callback). drm_dep_job_done() will directly
+ *   dequeue the job and call drm_dep_job_put() without deferring to a
+ *   workqueue. The driver's &drm_dep_job_ops.release callback must therefore
+ *   be safe to invoke from IRQ context.
+ */
+enum drm_dep_queue_flags {
+	DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ	= BIT(0),
+	DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ	= BIT(1),
+	DRM_DEP_QUEUE_FLAGS_STOPPED		= BIT(2),
+	DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED	= BIT(3),
+	DRM_DEP_QUEUE_FLAGS_HIGHPRI		= BIT(4),
+	DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE	= BIT(5),
+	DRM_DEP_QUEUE_FLAGS_KILLED		= BIT(6),
+};
+
+/**
+ * struct drm_dep_queue - a dependency-tracked GPU submission queue
+ *
+ * Combines the role of &drm_gpu_scheduler and &drm_sched_entity into a single
+ * object.  Each queue owns a submit workqueue (or borrows one), a timeout
+ * workqueue, an SPSC submission queue, and a pending-job list used for TDR.
+ *
+ * Initialise with drm_dep_queue_init(), tear down with drm_dep_queue_fini().
+ * Reference counted via drm_dep_queue_get() / drm_dep_queue_put().
+ *
+ * All fields are **opaque to drivers**.  Do not read or write any field
+ * directly; use the provided helper functions instead.  The sole exception
+ * is @rcu, which drivers may pass to kfree_rcu() when the queue is embedded
+ * inside a larger driver-managed structure and the &drm_dep_queue_ops.release
+ * vfunc performs an RCU-deferred free.
+ */
+struct drm_dep_queue {
+	/** @ops: driver callbacks, set at init time. */
+	const struct drm_dep_queue_ops *ops;
+	/** @name: human-readable name used for workqueue and fence naming. */
+	const char *name;
+	/** @drm: owning DRM device; a drm_dev_get() reference is held for the
+	 *  lifetime of the queue to prevent module unload while queues are live.
+	 */
+	struct drm_device *drm;
+	/** @refcount: reference count; use drm_dep_queue_get/put(). */
+	struct kref refcount;
+	/**
+	 * @free_work: deferred teardown work queued unconditionally by
+	 * drm_dep_queue_fini() onto the module-private dep_free_wq.  The work
+	 * item disables pending workers synchronously and destroys any owned
+	 * workqueues before releasing the queue memory and dropping the
+	 * drm_dev_get() reference.  Running on dep_free_wq ensures
+	 * destroy_workqueue() is never called from within one of the queue's
+	 * own workers.
+	 */
+	struct work_struct free_work;
+	/**
+	 * @rcu: RCU head for deferred freeing.
+	 *
+	 * This is the **only** field drivers may access directly.  When the
+	 * queue is embedded in a larger structure, implement
+	 * &drm_dep_queue_ops.release, call drm_dep_queue_release() to destroy
+	 * internal resources, then pass this field to kfree_rcu() so that any
+	 * in-flight RCU readers referencing the queue's dma_fence timeline name
+	 * complete before the memory is returned.  All other fields must be
+	 * accessed through the provided helpers.
+	 */
+	struct rcu_head rcu;
+
+	/** @sched: scheduling and workqueue state. */
+	struct {
+		/** @sched.submit_wq: ordered workqueue for run/put-job work. */
+		struct workqueue_struct	*submit_wq;
+		/** @sched.timeout_wq: workqueue for the TDR delayed work. */
+		struct workqueue_struct	*timeout_wq;
+		/**
+		 * @sched.run_job: work item that dispatches the next queued
+		 * job.
+		 */
+		struct work_struct run_job;
+		/** @sched.put_job: work item that frees finished jobs. */
+		struct work_struct put_job;
+		/** @sched.tdr: delayed work item for timeout/reset (TDR). */
+		struct delayed_work tdr;
+		/**
+		 * @sched.lock: mutex serialising job dispatch, bypass
+		 * decisions, stop/start, and flag updates.
+		 */
+		struct mutex lock;
+		/**
+		 * @sched.flags: bitmask of &enum drm_dep_queue_flags.
+		 * Any modification after drm_dep_queue_init() must be
+		 * performed under @sched.lock.
+		 */
+		enum drm_dep_queue_flags flags;
+	} sched;
+
+	/** @job: pending-job tracking state. */
+	struct {
+		/**
+		 * @job.pending: list of jobs that have been dispatched to
+		 * hardware and not yet freed. Protected by @job.lock.
+		 */
+		struct list_head pending;
+		/**
+		 * @job.queue: SPSC queue of jobs waiting to be dispatched.
+		 * Producers push via drm_dep_queue_push_job(); the run_job
+		 * work item pops from the consumer side.
+		 */
+		struct spsc_queue queue;
+		/**
+		 * @job.lock: spinlock protecting @job.pending, TDR start, and
+		 * the %DRM_DEP_QUEUE_FLAGS_STOPPED flag. Always acquired with
+		 * irqsave (spin_lock_irqsave / spin_unlock_irqrestore) to
+		 * support %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE queues where
+		 * drm_dep_job_done() may run from hardirq context.
+		 */
+		spinlock_t lock;
+		/**
+		 * @job.timeout: per-job TDR timeout in jiffies.
+		 * %MAX_SCHEDULE_TIMEOUT means no timeout.
+		 */
+		long timeout;
+#if IS_ENABLED(CONFIG_PROVE_LOCKING)
+		/**
+		 * @job.push: lockdep annotation tracking the arm-to-push
+		 * critical section.
+		 */
+		struct {
+			/*
+			 * @job.push.owner: task that currently holds the push
+			 * context, used to assert single-owner invariants.
+			 * NULL when idle.
+			 */
+			struct task_struct *owner;
+		} push;
+#endif
+	} job;
+
+	/** @credit: hardware credit accounting. */
+	struct {
+		/** @credit.limit: maximum credits the queue can hold. */
+		u32 limit;
+		/** @credit.count: credits currently in flight (atomic). */
+		atomic_t count;
+	} credit;
+
+	/** @dep: current blocking dependency for the head SPSC job. */
+	struct {
+		/**
+		 * @dep.fence: fence being waited on before the head job can
+		 * run. NULL when no dependency is pending.
+		 */
+		struct dma_fence *fence;
+		/**
+		 * @dep.removed_fence: dependency fence whose callback has been
+		 * removed.  The run-job worker must drop its reference to this
+		 * fence before proceeding to call run_job.
+		 */
+		struct dma_fence *removed_fence;
+		/** @dep.cb: callback installed on @dep.fence. */
+		struct dma_fence_cb cb;
+	} dep;
+
+	/** @fence: fence context and sequence number state. */
+	struct {
+		/**
+		 * @fence.seqno: next sequence number to assign, incremented
+		 * each time a job is armed.
+		 */
+		u32 seqno;
+		/**
+		 * @fence.context: base DMA fence context allocated at init
+		 * time. Finished fences use this context.
+		 */
+		u64 context;
+	} fence;
+};
+
+/**
+ * struct drm_dep_queue_init_args - arguments for drm_dep_queue_init()
+ */
+struct drm_dep_queue_init_args {
+	/** @ops: driver callbacks; must not be NULL. */
+	const struct drm_dep_queue_ops *ops;
+	/** @name: human-readable name for workqueues and fence timelines. */
+	const char *name;
+	/** @drm: owning DRM device. A drm_dev_get() reference is taken at
+	 *  queue init and released when the queue is freed, preventing module
+	 *  unload while any queue is still alive.
+	 */
+	struct drm_device *drm;
+	/**
+	 * @submit_wq: workqueue for job dispatch. If NULL, an ordered
+	 * workqueue is allocated and owned by the queue.  If non-NULL, the
+	 * workqueue must have been allocated with %WQ_MEM_RECLAIM_TAINT;
+	 * drm_dep_queue_init() returns %-EINVAL otherwise.
+	 */
+	struct workqueue_struct *submit_wq;
+	/**
+	 * @timeout_wq: workqueue for TDR. If NULL, an ordered workqueue
+	 * is allocated and owned by the queue.  If non-NULL, the workqueue
+	 * must have been allocated with %WQ_MEM_RECLAIM_TAINT;
+	 * drm_dep_queue_init() returns %-EINVAL otherwise.
+	 */
+	struct workqueue_struct *timeout_wq;
+	/** @credit_limit: maximum hardware credits; must be non-zero. */
+	u32 credit_limit;
+	/**
+	 * @timeout: per-job TDR timeout in jiffies. Zero means no timeout
+	 * (%MAX_SCHEDULE_TIMEOUT is used internally).
+	 */
+	long timeout;
+	/**
+	 * @flags: initial queue flags. %DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ
+	 * and %DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ are managed internally
+	 * and will be ignored if set here. Setting
+	 * %DRM_DEP_QUEUE_FLAGS_HIGHPRI requests a high-priority submit
+	 * workqueue; drivers must only set this for privileged clients.
+	 */
+	enum drm_dep_queue_flags flags;
+};
+
+/**
+ * struct drm_dep_job_ops - driver callbacks for a dep job
+ */
+struct drm_dep_job_ops {
+	/**
+	 * @release: called when the last reference to the job is dropped.
+	 *
+	 * If set, the driver is responsible for freeing the job. If NULL,
+	 * drm_dep_job_put() will call kfree() on the job directly.
+	 */
+	void (*release)(struct drm_dep_job *job);
+};
+
+/**
+ * struct drm_dep_job - a unit of work submitted to a dep queue
+ *
+ * All fields are **opaque to drivers**.  Do not read or write any field
+ * directly; use the provided helper functions instead.
+ */
+struct drm_dep_job {
+	/** @ops: driver callbacks for this job. */
+	const struct drm_dep_job_ops *ops;
+	/** @refcount: reference count, managed by drm_dep_job_get/put(). */
+	struct kref refcount;
+	/**
+	 * @dependencies: xarray of &dma_fence dependencies before the job can
+	 * run.
+	 */
+	struct xarray dependencies;
+	/** @q: the queue this job is submitted to. */
+	struct drm_dep_queue *q;
+	/** @queue_node: SPSC queue linkage for pending submission. */
+	struct spsc_node queue_node;
+	/**
+	 * @pending_link: list entry in the queue's pending job list. Protected
+	 * by @job.q->job.lock.
+	 */
+	struct list_head pending_link;
+	/** @dfence: finished fence for this job. */
+	struct drm_dep_fence *dfence;
+	/** @cb: fence callback used to watch for dependency completion. */
+	struct dma_fence_cb cb;
+	/** @credits: number of credits this job consumes from the queue. */
+	u32 credits;
+	/**
+	 * @last_dependency: index into @dependencies of the next fence to
+	 * check. Advanced by drm_dep_queue_job_dependency() as each
+	 * dependency is consumed.
+	 */
+	u32 last_dependency;
+	/**
+	 * @invalidate_count: number of times this job has been invalidated.
+	 * Incremented by drm_dep_job_invalidate_job().
+	 */
+	u32 invalidate_count;
+	/**
+	 * @signalling_cookie: return value of dma_fence_begin_signalling()
+	 * captured in drm_dep_job_arm() and consumed by drm_dep_job_push().
+	 * Not valid outside the arm→push window.
+	 */
+	bool signalling_cookie;
+};
+
+/**
+ * struct drm_dep_job_init_args - arguments for drm_dep_job_init()
+ */
+struct drm_dep_job_init_args {
+	/**
+	 * @ops: driver callbacks for the job, or NULL for default behaviour.
+	 */
+	const struct drm_dep_job_ops *ops;
+	/** @q: the queue to associate the job with. A reference is taken. */
+	struct drm_dep_queue *q;
+	/** @credits: number of credits this job consumes; must be non-zero. */
+	u32 credits;
+};
+
+/* Queue API */
+
+/**
+ * drm_dep_queue_sched_guard() - acquire the queue scheduler lock as a guard
+ * @__q: dep queue whose scheduler lock to acquire
+ *
+ * Acquires @__q->sched.lock as a scoped mutex guard (released automatically
+ * when the enclosing scope exits).  This lock serialises all scheduler state
+ * transitions — stop/start/kill flag changes, bypass-path decisions, and the
+ * run-job worker — so it must be held when the driver needs to atomically
+ * inspect or modify queue state in relation to job submission.
+ *
+ * **When to use**
+ *
+ * Drivers that set %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED and wish to
+ * serialise their own submit work against the bypass path must acquire this
+ * guard.  Without it, a concurrent caller of drm_dep_job_push() could take
+ * the bypass path and call ops->run_job() inline between the driver's
+ * eligibility check and its corresponding action, producing a race.
+ *
+ * **Constraint: only from submit_wq worker context**
+ *
+ * This guard must only be acquired from a work item running on the queue's
+ * submit workqueue (@q->sched.submit_wq) by drivers.
+ *
+ * Context: Process context only; must be called from submit_wq work by
+ * drivers.
+ */
+#define drm_dep_queue_sched_guard(__q)	\
+	guard(mutex)(&(__q)->sched.lock)
+
+int drm_dep_queue_init(struct drm_dep_queue *q,
+		       const struct drm_dep_queue_init_args *args);
+void drm_dep_queue_fini(struct drm_dep_queue *q);
+void drm_dep_queue_release(struct drm_dep_queue *q);
+struct drm_dep_queue *drm_dep_queue_get(struct drm_dep_queue *q);
+bool drm_dep_queue_get_unless_zero(struct drm_dep_queue *q);
+void drm_dep_queue_put(struct drm_dep_queue *q);
+void drm_dep_queue_stop(struct drm_dep_queue *q);
+void drm_dep_queue_start(struct drm_dep_queue *q);
+void drm_dep_queue_kill(struct drm_dep_queue *q);
+void drm_dep_queue_trigger_timeout(struct drm_dep_queue *q);
+void drm_dep_queue_cancel_tdr_sync(struct drm_dep_queue *q);
+void drm_dep_queue_resume_timeout(struct drm_dep_queue *q);
+bool drm_dep_queue_work_enqueue(struct drm_dep_queue *q,
+				struct work_struct *work);
+bool drm_dep_queue_is_stopped(struct drm_dep_queue *q);
+bool drm_dep_queue_is_killed(struct drm_dep_queue *q);
+bool drm_dep_queue_is_initialized(struct drm_dep_queue *q);
+void drm_dep_queue_set_stopped(struct drm_dep_queue *q);
+unsigned int drm_dep_queue_refcount(const struct drm_dep_queue *q);
+long drm_dep_queue_timeout(const struct drm_dep_queue *q);
+struct workqueue_struct *drm_dep_queue_submit_wq(struct drm_dep_queue *q);
+struct workqueue_struct *drm_dep_queue_timeout_wq(struct drm_dep_queue *q);
+
+/* Job API */
+
+/**
+ * DRM_DEP_JOB_FENCE_PREALLOC - sentinel value for pre-allocating a dependency slot
+ *
+ * Pass this to drm_dep_job_add_dependency() instead of a real fence to
+ * pre-allocate a slot in the job's dependency xarray during the preparation
+ * phase (where GFP_KERNEL is available).  The returned xarray index identifies
+ * the slot.  Call drm_dep_job_replace_dependency() later — inside a
+ * dma_fence_begin_signalling() region if needed — to swap in the real fence
+ * without further allocation.
+ *
+ * This sentinel is never treated as a dma_fence; it carries no reference count
+ * and must not be passed to dma_fence_put().  It is only valid as an argument
+ * to drm_dep_job_add_dependency() and as the expected stored value checked by
+ * drm_dep_job_replace_dependency().
+ */
+#define DRM_DEP_JOB_FENCE_PREALLOC	((struct dma_fence *)-1)
+
+int drm_dep_job_init(struct drm_dep_job *job,
+		     const struct drm_dep_job_init_args *args);
+struct drm_dep_job *drm_dep_job_get(struct drm_dep_job *job);
+void drm_dep_job_put(struct drm_dep_job *job);
+void drm_dep_job_arm(struct drm_dep_job *job);
+void drm_dep_job_push(struct drm_dep_job *job);
+int drm_dep_job_add_dependency(struct drm_dep_job *job,
+			       struct dma_fence *fence);
+void drm_dep_job_replace_dependency(struct drm_dep_job *job, u32 index,
+				    struct dma_fence *fence);
+int drm_dep_job_add_syncobj_dependency(struct drm_dep_job *job,
+				       struct drm_file *file, u32 handle,
+				       u32 point);
+int drm_dep_job_add_resv_dependencies(struct drm_dep_job *job,
+				      struct dma_resv *resv,
+				      enum dma_resv_usage usage);
+int drm_dep_job_add_implicit_dependencies(struct drm_dep_job *job,
+					  struct drm_gem_object *obj,
+					  bool write);
+bool drm_dep_job_is_signaled(struct drm_dep_job *job);
+bool drm_dep_job_is_finished(struct drm_dep_job *job);
+bool drm_dep_job_invalidate_job(struct drm_dep_job *job, int threshold);
+struct dma_fence *drm_dep_job_finished_fence(struct drm_dep_job *job);
+
+/**
+ * struct drm_dep_queue_pending_job_iter - iterator state for
+ *   drm_dep_queue_for_each_pending_job()
+ * @q: queue being iterated
+ */
+struct drm_dep_queue_pending_job_iter {
+	struct drm_dep_queue *q;
+};
+
+/* Drivers should never call this directly */
+static inline struct drm_dep_queue_pending_job_iter
+__drm_dep_queue_pending_job_iter_begin(struct drm_dep_queue *q)
+{
+	struct drm_dep_queue_pending_job_iter iter = {
+		.q = q,
+	};
+
+	WARN_ON(!drm_dep_queue_is_stopped(q));
+	return iter;
+}
+
+/* Drivers should never call this directly */
+static inline void
+__drm_dep_queue_pending_job_iter_end(struct drm_dep_queue_pending_job_iter iter)
+{
+	WARN_ON(!drm_dep_queue_is_stopped(iter.q));
+}
+
+/* clang-format off */
+DEFINE_CLASS(drm_dep_queue_pending_job_iter,
+	     struct drm_dep_queue_pending_job_iter,
+	     __drm_dep_queue_pending_job_iter_end(_T),
+	     __drm_dep_queue_pending_job_iter_begin(__q),
+	     struct drm_dep_queue *__q);
+/* clang-format on */
+static inline void *
+class_drm_dep_queue_pending_job_iter_lock_ptr(
+	class_drm_dep_queue_pending_job_iter_t *_T)
+{ return _T; }
+#define class_drm_dep_queue_pending_job_iter_is_conditional false
+
+/**
+ * drm_dep_queue_for_each_pending_job() - iterate over all pending jobs
+ *   in a queue
+ * @__job: loop cursor, a &struct drm_dep_job pointer
+ * @__q: &struct drm_dep_queue to iterate
+ *
+ * Iterates over every job currently on @__q->job.pending. The queue must be
+ * stopped (drm_dep_queue_stop() called) before using this iterator; a WARN_ON
+ * fires at the start and end of the scope if it is not.
+ *
+ * Context: Any context.
+ */
+#define drm_dep_queue_for_each_pending_job(__job, __q)			\
+	scoped_guard(drm_dep_queue_pending_job_iter, (__q))		\
+		list_for_each_entry((__job), &(__q)->job.pending, pending_link)
+
+#endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 11/12] accel/amdxdna: Convert to drm_dep scheduler layer
       [not found] <20260316043255.226352-1-matthew.brost@intel.com>
  2026-03-16  4:32 ` [RFC PATCH 01/12] workqueue: Add interface to teach lockdep to warn on reclaim violations Matthew Brost
  2026-03-16  4:32 ` [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer Matthew Brost
@ 2026-03-16  4:32 ` Matthew Brost
  2026-03-16  4:32 ` [RFC PATCH 12/12] drm/panthor: " Matthew Brost
  3 siblings, 0 replies; 50+ messages in thread
From: Matthew Brost @ 2026-03-16  4:32 UTC (permalink / raw)
  To: intel-xe; +Cc: dri-devel, Lizhi Hou, Min Ma, Oded Gabbay, linux-kernel

Replace drm_gpu_scheduler/drm_sched_entity with the drm_dep layer
(struct drm_dep_queue / struct drm_dep_job).

aie2_pci.h: struct aie2_hwctx_priv drops the inline struct
drm_gpu_scheduler and struct drm_sched_entity fields in favour of a
heap-allocated struct drm_dep_queue *q.  The queue is allocated with
kzalloc_obj in aie2_ctx_init() and freed via drm_dep_queue_put() on
teardown and error paths.

amdxdna_ctx.h: struct amdxdna_sched_job drops struct drm_sched_job base
and struct kref refcnt in favour of struct drm_dep_job base.  Job
reference counting is handled by drm_dep_job_get/put() rather than a
private kref; aie2_job_put() is replaced with drm_dep_job_put().

aie2_ctx.c:
  - aie2_sched_job_run() updated to take struct drm_dep_job *; adds an
    early return of NULL when the queue is killed via
    drm_dep_queue_is_killed().
  - aie2_sched_job_free() renamed to aie2_sched_job_release() and wired
    as the .release vfunc on struct drm_dep_job_ops; the job free path
    (cleanup, counter increment, wake_up, dma_fence_put, kfree) is
    unchanged.
  - aie2_sched_job_timedout() updated to return drm_dep_timedout_stat.
    Stop/restart logic is guarded so it only runs once if the TDR fires
    multiple times.  Returns DRM_DEP_TIMEDOUT_STAT_JOB_SIGNALED if the
    job has already finished, DRM_DEP_TIMEDOUT_STAT_REQUEUE_JOB
    otherwise (replacing the previous DRM_GPU_SCHED_STAT_RESET).
  - drm_sched_stop/start replaced with drm_dep_queue_stop/start.
  - drm_sched_entity_destroy replaced with drm_dep_queue_kill +
    drm_dep_queue_put on the context destroy path.
  - drm_sched_job_init replaced with drm_dep_job_init.

trace/amdxdna.h: trace event updated to accept struct drm_dep_job *
instead of struct drm_sched_job *.

Cc: dri-devel@lists.freedesktop.org
Cc: Lizhi Hou <lizhi.hou@amd.com>
Cc: Min Ma <mamin506@gmail.com>
Cc: Oded Gabbay <ogabbay@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Assisted-by: GitHub Copilot:claude-sonnet-4.6
---
 drivers/accel/amdxdna/Kconfig       |   2 +-
 drivers/accel/amdxdna/aie2_ctx.c    | 144 +++++++++++++++-------------
 drivers/accel/amdxdna/aie2_pci.h    |   4 +-
 drivers/accel/amdxdna/amdxdna_ctx.c |   5 +-
 drivers/accel/amdxdna/amdxdna_ctx.h |   4 +-
 include/trace/events/amdxdna.h      |  12 ++-
 6 files changed, 88 insertions(+), 83 deletions(-)

diff --git a/drivers/accel/amdxdna/Kconfig b/drivers/accel/amdxdna/Kconfig
index f39d7a87296c..fdce1ef57cc0 100644
--- a/drivers/accel/amdxdna/Kconfig
+++ b/drivers/accel/amdxdna/Kconfig
@@ -6,7 +6,7 @@ config DRM_ACCEL_AMDXDNA
 	depends on DRM_ACCEL
 	depends on PCI && HAS_IOMEM
 	depends on X86_64
-	select DRM_SCHED
+	select DRM_DEP
 	select DRM_GEM_SHMEM_HELPER
 	select FW_LOADER
 	select HMM_MIRROR
diff --git a/drivers/accel/amdxdna/aie2_ctx.c b/drivers/accel/amdxdna/aie2_ctx.c
index 202c7a3eef24..89d3d30a5ad0 100644
--- a/drivers/accel/amdxdna/aie2_ctx.c
+++ b/drivers/accel/amdxdna/aie2_ctx.c
@@ -4,6 +4,7 @@
  */
 
 #include <drm/amdxdna_accel.h>
+#include <drm/drm_dep.h>
 #include <drm/drm_device.h>
 #include <drm/drm_gem.h>
 #include <drm/drm_gem_shmem_helper.h>
@@ -29,31 +30,18 @@ MODULE_PARM_DESC(force_cmdlist, "Force use command list (Default true)");
 
 #define HWCTX_MAX_TIMEOUT	60000 /* milliseconds */
 
-static void aie2_job_release(struct kref *ref)
-{
-	struct amdxdna_sched_job *job;
-
-	job = container_of(ref, struct amdxdna_sched_job, refcnt);
-	amdxdna_sched_job_cleanup(job);
-	atomic64_inc(&job->hwctx->job_free_cnt);
-	wake_up(&job->hwctx->priv->job_free_wq);
-	if (job->out_fence)
-		dma_fence_put(job->out_fence);
-	kfree(job);
-}
-
 static void aie2_job_put(struct amdxdna_sched_job *job)
 {
-	kref_put(&job->refcnt, aie2_job_release);
+	drm_dep_job_put(&job->base);
 }
 
 /* The bad_job is used in aie2_sched_job_timedout, otherwise, set it to NULL */
 static void aie2_hwctx_stop(struct amdxdna_dev *xdna, struct amdxdna_hwctx *hwctx,
-			    struct drm_sched_job *bad_job)
+			    struct drm_dep_job *bad_job)
 {
-	drm_sched_stop(&hwctx->priv->sched, bad_job);
+	drm_dep_queue_stop(hwctx->priv->q);
 	aie2_destroy_context(xdna->dev_handle, hwctx);
-	drm_sched_start(&hwctx->priv->sched, 0);
+	drm_dep_queue_start(hwctx->priv->q);
 }
 
 static int aie2_hwctx_restart(struct amdxdna_dev *xdna, struct amdxdna_hwctx *hwctx)
@@ -282,21 +270,24 @@ aie2_sched_cmdlist_resp_handler(void *handle, void __iomem *data, size_t size)
 }
 
 static struct dma_fence *
-aie2_sched_job_run(struct drm_sched_job *sched_job)
+aie2_sched_job_run(struct drm_dep_job *dep_job)
 {
-	struct amdxdna_sched_job *job = drm_job_to_xdna_job(sched_job);
+	struct amdxdna_sched_job *job = drm_job_to_xdna_job(dep_job);
 	struct amdxdna_gem_obj *cmd_abo = job->cmd_bo;
 	struct amdxdna_hwctx *hwctx = job->hwctx;
 	struct dma_fence *fence;
 	int ret;
 
+	if (drm_dep_queue_is_killed(job->hwctx->priv->q))
+		return NULL;
+
 	if (!hwctx->priv->mbox_chann)
 		return NULL;
 
 	if (!mmget_not_zero(job->mm))
 		return ERR_PTR(-ESRCH);
 
-	kref_get(&job->refcnt);
+	drm_dep_job_get(&job->base);
 	fence = dma_fence_get(job->fence);
 
 	if (job->drv_cmd) {
@@ -330,46 +321,58 @@ aie2_sched_job_run(struct drm_sched_job *sched_job)
 		mmput(job->mm);
 		fence = ERR_PTR(ret);
 	}
-	trace_xdna_job(sched_job, hwctx->name, "sent to device", job->seq);
+	trace_xdna_job(dep_job, hwctx->name, "sent to device", job->seq);
 
 	return fence;
 }
 
-static void aie2_sched_job_free(struct drm_sched_job *sched_job)
+static void aie2_sched_job_release(struct drm_dep_job *dep_job)
 {
-	struct amdxdna_sched_job *job = drm_job_to_xdna_job(sched_job);
+	struct amdxdna_sched_job *job = drm_job_to_xdna_job(dep_job);
 	struct amdxdna_hwctx *hwctx = job->hwctx;
 
-	trace_xdna_job(sched_job, hwctx->name, "job free", job->seq);
+	trace_xdna_job(dep_job, hwctx->name, "job free", job->seq);
 	if (!job->job_done)
 		up(&hwctx->priv->job_sem);
 
-	drm_sched_job_cleanup(sched_job);
-	aie2_job_put(job);
+	amdxdna_sched_job_cleanup(job);
+	atomic64_inc(&job->hwctx->job_free_cnt);
+	wake_up(&job->hwctx->priv->job_free_wq);
+	if (job->out_fence)
+		dma_fence_put(job->out_fence);
+	kfree(job);
 }
 
-static enum drm_gpu_sched_stat
-aie2_sched_job_timedout(struct drm_sched_job *sched_job)
+static const struct drm_dep_job_ops job_ops = {
+	.release = aie2_sched_job_release,
+};
+
+static enum drm_dep_timedout_stat
+aie2_sched_job_timedout(struct drm_dep_job *dep_job)
 {
-	struct amdxdna_sched_job *job = drm_job_to_xdna_job(sched_job);
+	struct amdxdna_sched_job *job = drm_job_to_xdna_job(dep_job);
 	struct amdxdna_hwctx *hwctx = job->hwctx;
 	struct amdxdna_dev *xdna;
 
-	xdna = hwctx->client->xdna;
-	trace_xdna_job(sched_job, hwctx->name, "job timedout", job->seq);
-	job->job_timeout = true;
-	mutex_lock(&xdna->dev_lock);
-	aie2_hwctx_stop(xdna, hwctx, sched_job);
+	if (!job->job_timeout) {
+		xdna = hwctx->client->xdna;
+		trace_xdna_job(dep_job, hwctx->name, "job timedout", job->seq);
+		job->job_timeout = true;
+		mutex_lock(&xdna->dev_lock);
+		aie2_hwctx_stop(xdna, hwctx, dep_job);
 
-	aie2_hwctx_restart(xdna, hwctx);
-	mutex_unlock(&xdna->dev_lock);
+		aie2_hwctx_restart(xdna, hwctx);
+		mutex_unlock(&xdna->dev_lock);
+	}
 
-	return DRM_GPU_SCHED_STAT_RESET;
+	if (drm_dep_job_is_finished(dep_job))
+		return DRM_DEP_TIMEDOUT_STAT_JOB_SIGNALED;
+	else
+		return DRM_DEP_TIMEDOUT_STAT_REQUEUE_JOB;
 }
 
-static const struct drm_sched_backend_ops sched_ops = {
+static const struct drm_dep_queue_ops sched_ops = {
 	.run_job = aie2_sched_job_run,
-	.free_job = aie2_sched_job_free,
 	.timedout_job = aie2_sched_job_timedout,
 };
 
@@ -534,15 +537,13 @@ int aie2_hwctx_init(struct amdxdna_hwctx *hwctx)
 {
 	struct amdxdna_client *client = hwctx->client;
 	struct amdxdna_dev *xdna = client->xdna;
-	const struct drm_sched_init_args args = {
+	const struct drm_dep_queue_init_args args = {
 		.ops = &sched_ops,
-		.num_rqs = DRM_SCHED_PRIORITY_COUNT,
 		.credit_limit = HWCTX_MAX_CMDS,
 		.timeout = msecs_to_jiffies(HWCTX_MAX_TIMEOUT),
 		.name = "amdxdna_js",
-		.dev = xdna->ddev.dev,
+		.drm = &xdna->ddev,
 	};
-	struct drm_gpu_scheduler *sched;
 	struct amdxdna_hwctx_priv *priv;
 	struct amdxdna_gem_obj *heap;
 	int i, ret;
@@ -591,30 +592,29 @@ int aie2_hwctx_init(struct amdxdna_hwctx *hwctx)
 		priv->cmd_buf[i] = abo;
 	}
 
-	sched = &priv->sched;
 	mutex_init(&priv->io_lock);
 
 	fs_reclaim_acquire(GFP_KERNEL);
 	might_lock(&priv->io_lock);
 	fs_reclaim_release(GFP_KERNEL);
 
-	ret = drm_sched_init(sched, &args);
-	if (ret) {
-		XDNA_ERR(xdna, "Failed to init DRM scheduler. ret %d", ret);
+	priv->q = kzalloc_obj(*priv->q);
+	if (!priv->q) {
+		ret = -ENOMEM;
 		goto free_cmd_bufs;
 	}
 
-	ret = drm_sched_entity_init(&priv->entity, DRM_SCHED_PRIORITY_NORMAL,
-				    &sched, 1, NULL);
+	ret = drm_dep_queue_init(priv->q, &args);
 	if (ret) {
-		XDNA_ERR(xdna, "Failed to initial sched entiry. ret %d", ret);
-		goto free_sched;
+		XDNA_ERR(xdna, "Failed to init dep queue. ret %d", ret);
+		kfree(priv->q);
+		goto free_cmd_bufs;
 	}
 
 	ret = aie2_hwctx_col_list(hwctx);
 	if (ret) {
 		XDNA_ERR(xdna, "Create col list failed, ret %d", ret);
-		goto free_entity;
+		goto free_queue;
 	}
 
 	ret = amdxdna_pm_resume_get_locked(xdna);
@@ -654,10 +654,8 @@ int aie2_hwctx_init(struct amdxdna_hwctx *hwctx)
 	amdxdna_pm_suspend_put(xdna);
 free_col_list:
 	kfree(hwctx->col_list);
-free_entity:
-	drm_sched_entity_destroy(&priv->entity);
-free_sched:
-	drm_sched_fini(&priv->sched);
+free_queue:
+	drm_dep_queue_put(priv->q);
 free_cmd_bufs:
 	for (i = 0; i < ARRAY_SIZE(priv->cmd_buf); i++) {
 		if (!priv->cmd_buf[i])
@@ -683,12 +681,13 @@ void aie2_hwctx_fini(struct amdxdna_hwctx *hwctx)
 	aie2_hwctx_wait_for_idle(hwctx);
 
 	/* Request fw to destroy hwctx and cancel the rest pending requests */
-	drm_sched_stop(&hwctx->priv->sched, NULL);
+	drm_dep_queue_stop(hwctx->priv->q);
 	aie2_release_resource(hwctx);
-	drm_sched_start(&hwctx->priv->sched, 0);
+	drm_dep_queue_start(hwctx->priv->q);
 
 	mutex_unlock(&xdna->dev_lock);
-	drm_sched_entity_destroy(&hwctx->priv->entity);
+	drm_dep_queue_kill(hwctx->priv->q);
+	drm_dep_queue_put(hwctx->priv->q);
 
 	/* Wait for all submitted jobs to be completed or canceled */
 	wait_event(hwctx->priv->job_free_wq,
@@ -696,7 +695,6 @@ void aie2_hwctx_fini(struct amdxdna_hwctx *hwctx)
 		   atomic64_read(&hwctx->job_free_cnt));
 	mutex_lock(&xdna->dev_lock);
 
-	drm_sched_fini(&hwctx->priv->sched);
 	aie2_ctx_syncobj_destroy(hwctx);
 
 	for (idx = 0; idx < ARRAY_SIZE(hwctx->priv->cmd_buf); idx++)
@@ -965,6 +963,7 @@ int aie2_cmd_submit(struct amdxdna_hwctx *hwctx, struct amdxdna_sched_job *job,
 	ret = down_interruptible(&hwctx->priv->job_sem);
 	if (ret) {
 		XDNA_ERR(xdna, "Grab job sem failed, ret %d", ret);
+		goto err_sem;
 		return ret;
 	}
 
@@ -975,10 +974,13 @@ int aie2_cmd_submit(struct amdxdna_hwctx *hwctx, struct amdxdna_sched_job *job,
 		goto up_sem;
 	}
 
-	ret = drm_sched_job_init(&job->base, &hwctx->priv->entity, 1, hwctx,
-				 hwctx->client->filp->client_id);
+	ret = drm_dep_job_init(&job->base, &(struct drm_dep_job_init_args){
+				.ops = &job_ops,
+				.q = hwctx->priv->q,
+				.credits = 1,
+			       });
 	if (ret) {
-		XDNA_ERR(xdna, "DRM job init failed, ret %d", ret);
+		XDNA_ERR(xdna, "DRM dep job init failed, ret %d", ret);
 		goto free_chain;
 	}
 
@@ -1020,13 +1022,12 @@ int aie2_cmd_submit(struct amdxdna_hwctx *hwctx, struct amdxdna_sched_job *job,
 	}
 
 	mutex_lock(&hwctx->priv->io_lock);
-	drm_sched_job_arm(&job->base);
-	job->out_fence = dma_fence_get(&job->base.s_fence->finished);
+	drm_dep_job_arm(&job->base);
+	job->out_fence = dma_fence_get(drm_dep_job_finished_fence(&job->base));
 	for (i = 0; i < job->bo_cnt; i++)
 		dma_resv_add_fence(job->bos[i]->resv, job->out_fence, DMA_RESV_USAGE_WRITE);
 	job->seq = hwctx->priv->seq++;
-	kref_get(&job->refcnt);
-	drm_sched_entity_push_job(&job->base);
+	drm_dep_job_push(&job->base);
 
 	*seq = job->seq;
 	drm_syncobj_add_point(hwctx->priv->syncobj, chain, job->out_fence, *seq);
@@ -1035,18 +1036,23 @@ int aie2_cmd_submit(struct amdxdna_hwctx *hwctx, struct amdxdna_sched_job *job,
 	up_read(&xdna->notifier_lock);
 	drm_gem_unlock_reservations(job->bos, job->bo_cnt, &acquire_ctx);
 
-	aie2_job_put(job);
 	atomic64_inc(&hwctx->job_submit_cnt);
+	aie2_job_put(job);
 
 	return 0;
 
 cleanup_job:
-	drm_sched_job_cleanup(&job->base);
+	aie2_job_put(job);
+	return ret;
+
 free_chain:
 	dma_fence_chain_free(chain);
 up_sem:
 	up(&hwctx->priv->job_sem);
 	job->job_done = true;
+err_sem:
+	amdxdna_sched_job_cleanup(job);
+	kfree(job);
 	return ret;
 }
 
diff --git a/drivers/accel/amdxdna/aie2_pci.h b/drivers/accel/amdxdna/aie2_pci.h
index 885ae7e6bfc7..63edcd7fb631 100644
--- a/drivers/accel/amdxdna/aie2_pci.h
+++ b/drivers/accel/amdxdna/aie2_pci.h
@@ -7,6 +7,7 @@
 #define _AIE2_PCI_H_
 
 #include <drm/amdxdna_accel.h>
+#include <drm/drm_dep.h>
 #include <linux/limits.h>
 #include <linux/semaphore.h>
 
@@ -165,8 +166,7 @@ struct amdxdna_hwctx_priv {
 	struct amdxdna_gem_obj		*heap;
 	void				*mbox_chann;
 
-	struct drm_gpu_scheduler	sched;
-	struct drm_sched_entity		entity;
+	struct drm_dep_queue		*q;
 
 	struct mutex			io_lock; /* protect seq and cmd order */
 	struct wait_queue_head		job_free_wq;
diff --git a/drivers/accel/amdxdna/amdxdna_ctx.c b/drivers/accel/amdxdna/amdxdna_ctx.c
index 838430903a3e..a9dc1677db47 100644
--- a/drivers/accel/amdxdna/amdxdna_ctx.c
+++ b/drivers/accel/amdxdna/amdxdna_ctx.c
@@ -509,11 +509,10 @@ int amdxdna_cmd_submit(struct amdxdna_client *client,
 		ret = -ENOMEM;
 		goto unlock_srcu;
 	}
-	kref_init(&job->refcnt);
 
 	ret = xdna->dev_info->ops->cmd_submit(hwctx, job, seq);
 	if (ret)
-		goto put_fence;
+		return ret;
 
 	/*
 	 * The amdxdna_hwctx_destroy_rcu() will release hwctx and associated
@@ -526,8 +525,6 @@ int amdxdna_cmd_submit(struct amdxdna_client *client,
 
 	return 0;
 
-put_fence:
-	dma_fence_put(job->fence);
 unlock_srcu:
 	srcu_read_unlock(&client->hwctx_srcu, idx);
 	amdxdna_pm_suspend_put(xdna);
diff --git a/drivers/accel/amdxdna/amdxdna_ctx.h b/drivers/accel/amdxdna/amdxdna_ctx.h
index fbdf9d000871..a92bd4d6f817 100644
--- a/drivers/accel/amdxdna/amdxdna_ctx.h
+++ b/drivers/accel/amdxdna/amdxdna_ctx.h
@@ -7,6 +7,7 @@
 #define _AMDXDNA_CTX_H_
 
 #include <linux/bitfield.h>
+#include <drm/drm_dep.h>
 
 #include "amdxdna_gem.h"
 
@@ -123,8 +124,7 @@ struct amdxdna_drv_cmd {
 };
 
 struct amdxdna_sched_job {
-	struct drm_sched_job	base;
-	struct kref		refcnt;
+	struct drm_dep_job	base;
 	struct amdxdna_hwctx	*hwctx;
 	struct mm_struct	*mm;
 	/* The fence to notice DRM scheduler that job is done by hardware */
diff --git a/include/trace/events/amdxdna.h b/include/trace/events/amdxdna.h
index c6cb2da7b706..798958edeb60 100644
--- a/include/trace/events/amdxdna.h
+++ b/include/trace/events/amdxdna.h
@@ -9,7 +9,7 @@
 #if !defined(_TRACE_AMDXDNA_H) || defined(TRACE_HEADER_MULTI_READ)
 #define _TRACE_AMDXDNA_H
 
-#include <drm/gpu_scheduler.h>
+#include <drm/drm_dep.h>
 #include <linux/tracepoint.h>
 
 TRACE_EVENT(amdxdna_debug_point,
@@ -30,9 +30,9 @@ TRACE_EVENT(amdxdna_debug_point,
 );
 
 TRACE_EVENT(xdna_job,
-	    TP_PROTO(struct drm_sched_job *sched_job, const char *name, const char *str, u64 seq),
+	    TP_PROTO(struct drm_dep_job *dep_job, const char *name, const char *str, u64 seq),
 
-	    TP_ARGS(sched_job, name, str, seq),
+	    TP_ARGS(dep_job, name, str, seq),
 
 	    TP_STRUCT__entry(__string(name, name)
 			     __string(str, str)
@@ -42,8 +42,10 @@ TRACE_EVENT(xdna_job,
 
 	    TP_fast_assign(__assign_str(name);
 			   __assign_str(str);
-			   __entry->fence_context = sched_job->s_fence->finished.context;
-			   __entry->fence_seqno = sched_job->s_fence->finished.seqno;
+			   __entry->fence_context =
+			   drm_dep_job_finished_fence(dep_job)->context;
+			   __entry->fence_seqno =
+			   drm_dep_job_finished_fence(dep_job)->seqno;
 			   __entry->seq = seq;),
 
 	    TP_printk("fence=(context:%llu, seqno:%lld), %s seq#:%lld %s",
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 12/12] drm/panthor: Convert to drm_dep scheduler layer
       [not found] <20260316043255.226352-1-matthew.brost@intel.com>
                   ` (2 preceding siblings ...)
  2026-03-16  4:32 ` [RFC PATCH 11/12] accel/amdxdna: Convert to drm_dep scheduler layer Matthew Brost
@ 2026-03-16  4:32 ` Matthew Brost
  3 siblings, 0 replies; 50+ messages in thread
From: Matthew Brost @ 2026-03-16  4:32 UTC (permalink / raw)
  To: intel-xe
  Cc: dri-devel, Boris Brezillon, Christian König, David Airlie,
	Liviu Dudau, Maarten Lankhorst, Maxime Ripard, Simona Vetter,
	Steven Price, Sumit Semwal, Thomas Zimmermann, linux-kernel

Replace drm_gpu_scheduler/drm_sched_entity with the drm_dep layer
(struct drm_dep_queue / struct drm_dep_job) across all Panthor
submission paths: the CSF queue scheduler and the VM_BIND scheduler.

panthor_sched.c — CSF queue scheduler:
  struct panthor_queue drops the inline struct drm_gpu_scheduler and
  struct drm_sched_entity, replacing them with an embedded struct
  drm_dep_queue q.  The 1:1 scheduler:entity pairing that drm_sched
  required collapses into the single queue object.  queue_run_job() and
  queue_timedout_job() updated to drm_dep signatures and return types.
  queue_timedout_job() guards the reset path so it only fires once;
  returns DRM_DEP_TIMEDOUT_STAT_JOB_SIGNALED.  Stop/start on reset
  updated to drm_dep_queue_stop/start.  The timeout workqueue is
  accessed via drm_dep_queue_timeout_wq() for mod_delayed_work() calls.
  Queue teardown simplified from a multi-step disable_delayed_work_sync
  / entity_destroy / sched_fini sequence to drm_dep_queue_put().

  struct panthor_job drops struct drm_sched_job base and struct kref
  refcount in favour of struct drm_dep_job base.  panthor_job_get/put()
  updated to drm_dep_job_get/put().  job_release() becomes the .release
  vfunc on struct drm_dep_job_ops; job_release() body moves to
  job_cleanup() which is called from both the release vfunc and the
  init error path.  drm_sched_job_init replaced with drm_dep_job_init.
  panthor_job_update_resvs() updated to use drm_dep_job_finished_fence()
  instead of sched_job->s_fence->finished.

panthor_mmu.c — VM_BIND scheduler:
  struct panthor_vm drops the inline struct drm_gpu_scheduler and struct
  drm_sched_entity in favour of a heap-allocated struct drm_dep_queue *q.
  The queue is allocated with kzalloc_obj and freed via
  drm_dep_queue_put().  Queue init gains
  DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE and
  DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED flags appropriate for the VM_BIND
  path.  panthor_vm_bind_run_job() and panthor_vm_bind_timedout_job()
  updated to drm_dep signatures; timedout returns
  DRM_DEP_TIMEDOUT_STAT_JOB_SIGNALED.  panthor_vm_bind_job_release()
  becomes the .release vfunc; the old panthor_vm_bind_job_put() wrapper
  is replaced with drm_dep_job_put().  VM stop/start on reset path and
  destroy path updated to drm_dep_queue_stop/start/put.
  drm_dep_job_finished_fence() used in place of s_fence->finished.

Cc: Boris Brezillon <boris.brezillon@collabora.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: David Airlie <airlied@gmail.com>
Cc: dri-devel@lists.freedesktop.org
Cc: Liviu Dudau <liviu.dudau@arm.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Steven Price <steven.price@arm.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Assisted-by: GitHub Copilot:claude-sonnet-4.6
---
 drivers/gpu/drm/panthor/Kconfig          |   2 +-
 drivers/gpu/drm/panthor/panthor_device.c |   5 +-
 drivers/gpu/drm/panthor/panthor_device.h |   2 +-
 drivers/gpu/drm/panthor/panthor_drv.c    |  35 ++--
 drivers/gpu/drm/panthor/panthor_mmu.c    | 160 +++++++--------
 drivers/gpu/drm/panthor/panthor_mmu.h    |  14 +-
 drivers/gpu/drm/panthor/panthor_sched.c  | 242 +++++++++++------------
 drivers/gpu/drm/panthor/panthor_sched.h  |  12 +-
 8 files changed, 223 insertions(+), 249 deletions(-)

diff --git a/drivers/gpu/drm/panthor/Kconfig b/drivers/gpu/drm/panthor/Kconfig
index 55b40ad07f3b..e22f7dc33dff 100644
--- a/drivers/gpu/drm/panthor/Kconfig
+++ b/drivers/gpu/drm/panthor/Kconfig
@@ -10,7 +10,7 @@ config DRM_PANTHOR
 	select DRM_EXEC
 	select DRM_GEM_SHMEM_HELPER
 	select DRM_GPUVM
-	select DRM_SCHED
+	select DRM_DEP
 	select IOMMU_IO_PGTABLE_LPAE
 	select IOMMU_SUPPORT
 	select PM_DEVFREQ
diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
index 54fbb1aa07c5..66a01f26a52b 100644
--- a/drivers/gpu/drm/panthor/panthor_device.c
+++ b/drivers/gpu/drm/panthor/panthor_device.c
@@ -11,6 +11,7 @@
 #include <linux/regulator/consumer.h>
 #include <linux/reset.h>
 
+#include <drm/drm_dep.h>
 #include <drm/drm_drv.h>
 #include <drm/drm_managed.h>
 #include <drm/drm_print.h>
@@ -231,7 +232,9 @@ int panthor_device_init(struct panthor_device *ptdev)
 	*dummy_page_virt = 1;
 
 	INIT_WORK(&ptdev->reset.work, panthor_device_reset_work);
-	ptdev->reset.wq = alloc_ordered_workqueue("panthor-reset-wq", 0);
+	ptdev->reset.wq = alloc_ordered_workqueue("panthor-reset-wq",
+						  WQ_MEM_RECLAIM |
+						  WQ_MEM_WARN_ON_RECLAIM);
 	if (!ptdev->reset.wq)
 		return -ENOMEM;
 
diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
index b6696f73a536..d2c05c1ee513 100644
--- a/drivers/gpu/drm/panthor/panthor_device.h
+++ b/drivers/gpu/drm/panthor/panthor_device.h
@@ -15,7 +15,7 @@
 
 #include <drm/drm_device.h>
 #include <drm/drm_mm.h>
-#include <drm/gpu_scheduler.h>
+#include <drm/drm_dep.h>
 #include <drm/panthor_drm.h>
 
 struct panthor_csf;
diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
index 1bcec6a2e3e0..086f9f28c6be 100644
--- a/drivers/gpu/drm/panthor/panthor_drv.c
+++ b/drivers/gpu/drm/panthor/panthor_drv.c
@@ -23,7 +23,7 @@
 #include <drm/drm_print.h>
 #include <drm/drm_syncobj.h>
 #include <drm/drm_utils.h>
-#include <drm/gpu_scheduler.h>
+#include <drm/drm_dep.h>
 #include <drm/panthor_drm.h>
 
 #include "panthor_devfreq.h"
@@ -269,8 +269,8 @@ struct panthor_sync_signal {
  * struct panthor_job_ctx - Job context
  */
 struct panthor_job_ctx {
-	/** @job: The job that is about to be submitted to drm_sched. */
-	struct drm_sched_job *job;
+	/** @job: The job that is about to be submitted to drm_dep. */
+	struct drm_dep_job *job;
 
 	/** @syncops: Array of sync operations. */
 	struct drm_panthor_sync_op *syncops;
@@ -452,7 +452,7 @@ panthor_submit_ctx_search_sync_signal(struct panthor_submit_ctx *ctx, u32 handle
  */
 static int
 panthor_submit_ctx_add_job(struct panthor_submit_ctx *ctx, u32 idx,
-			   struct drm_sched_job *job,
+			   struct drm_dep_job *job,
 			   const struct drm_panthor_obj_array *syncs)
 {
 	int ret;
@@ -502,7 +502,7 @@ panthor_submit_ctx_update_job_sync_signal_fences(struct panthor_submit_ctx *ctx,
 	struct panthor_device *ptdev = container_of(ctx->file->minor->dev,
 						    struct panthor_device,
 						    base);
-	struct dma_fence *done_fence = &ctx->jobs[job_idx].job->s_fence->finished;
+	struct dma_fence *done_fence = drm_dep_job_finished_fence(ctx->jobs[job_idx].job);
 	const struct drm_panthor_sync_op *sync_ops = ctx->jobs[job_idx].syncops;
 	u32 sync_op_count = ctx->jobs[job_idx].syncop_count;
 
@@ -604,7 +604,7 @@ panthor_submit_ctx_add_sync_deps_to_job(struct panthor_submit_ctx *ctx,
 						    struct panthor_device,
 						    base);
 	const struct drm_panthor_sync_op *sync_ops = ctx->jobs[job_idx].syncops;
-	struct drm_sched_job *job = ctx->jobs[job_idx].job;
+	struct drm_dep_job *job = ctx->jobs[job_idx].job;
 	u32 sync_op_count = ctx->jobs[job_idx].syncop_count;
 	int ret = 0;
 
@@ -634,7 +634,7 @@ panthor_submit_ctx_add_sync_deps_to_job(struct panthor_submit_ctx *ctx,
 				return ret;
 		}
 
-		ret = drm_sched_job_add_dependency(job, fence);
+		ret = drm_dep_job_add_dependency(job, fence);
 		if (ret)
 			return ret;
 	}
@@ -681,8 +681,11 @@ panthor_submit_ctx_add_deps_and_arm_jobs(struct panthor_submit_ctx *ctx)
 		if (ret)
 			return ret;
 
-		drm_sched_job_arm(ctx->jobs[i].job);
+		drm_dep_job_arm(ctx->jobs[i].job);
 
+		/*
+		 * XXX: Failing path hazard... per DRM dep this is not allowed
+		 */
 		ret = panthor_submit_ctx_update_job_sync_signal_fences(ctx, i);
 		if (ret)
 			return ret;
@@ -699,11 +702,11 @@ panthor_submit_ctx_add_deps_and_arm_jobs(struct panthor_submit_ctx *ctx)
  */
 static void
 panthor_submit_ctx_push_jobs(struct panthor_submit_ctx *ctx,
-			     void (*upd_resvs)(struct drm_exec *, struct drm_sched_job *))
+			     void (*upd_resvs)(struct drm_exec *, struct drm_dep_job *))
 {
 	for (u32 i = 0; i < ctx->job_count; i++) {
 		upd_resvs(&ctx->exec, ctx->jobs[i].job);
-		drm_sched_entity_push_job(ctx->jobs[i].job);
+		drm_dep_job_push(ctx->jobs[i].job);
 
 		/* Job is owned by the scheduler now. */
 		ctx->jobs[i].job = NULL;
@@ -743,7 +746,7 @@ static int panthor_submit_ctx_init(struct panthor_submit_ctx *ctx,
  * @job_put: Job put callback.
  */
 static void panthor_submit_ctx_cleanup(struct panthor_submit_ctx *ctx,
-				       void (*job_put)(struct drm_sched_job *))
+				       void (*job_put)(struct drm_dep_job *))
 {
 	struct panthor_sync_signal *sig_sync, *tmp;
 	unsigned long i;
@@ -1004,7 +1007,7 @@ static int panthor_ioctl_group_submit(struct drm_device *ddev, void *data,
 	/* Create jobs and attach sync operations */
 	for (u32 i = 0; i < args->queue_submits.count; i++) {
 		const struct drm_panthor_queue_submit *qsubmit = &jobs_args[i];
-		struct drm_sched_job *job;
+		struct drm_dep_job *job;
 
 		job = panthor_job_create(pfile, args->group_handle, qsubmit,
 					 file->client_id);
@@ -1032,11 +1035,11 @@ static int panthor_ioctl_group_submit(struct drm_device *ddev, void *data,
 	 * dependency registration.
 	 *
 	 * This is solving two problems:
-	 * 1. drm_sched_job_arm() and drm_sched_entity_push_job() must be
+	 * 1. drm_dep_job_arm() and drm_dep_job_push() must be
 	 *    protected by a lock to make sure no concurrent access to the same
-	 *    entity get interleaved, which would mess up with the fence seqno
+	 *    queue gets interleaved, which would mess up with the fence seqno
 	 *    ordering. Luckily, one of the resv being acquired is the VM resv,
-	 *    and a scheduling entity is only bound to a single VM. As soon as
+	 *    and a dep queue is only bound to a single VM. As soon as
 	 *    we acquire the VM resv, we should be safe.
 	 * 2. Jobs might depend on fences that were issued by previous jobs in
 	 *    the same batch, so we can't add dependencies on all jobs before
@@ -1232,7 +1235,7 @@ static int panthor_ioctl_vm_bind_async(struct drm_device *ddev,
 
 	for (u32 i = 0; i < args->ops.count; i++) {
 		struct drm_panthor_vm_bind_op *op = &jobs_args[i];
-		struct drm_sched_job *job;
+		struct drm_dep_job *job;
 
 		job = panthor_vm_bind_job_create(file, vm, op);
 		if (IS_ERR(job)) {
diff --git a/drivers/gpu/drm/panthor/panthor_mmu.c b/drivers/gpu/drm/panthor/panthor_mmu.c
index f8c41e36afa4..45e5f0d71594 100644
--- a/drivers/gpu/drm/panthor/panthor_mmu.c
+++ b/drivers/gpu/drm/panthor/panthor_mmu.c
@@ -8,7 +8,7 @@
 #include <drm/drm_gpuvm.h>
 #include <drm/drm_managed.h>
 #include <drm/drm_print.h>
-#include <drm/gpu_scheduler.h>
+#include <drm/drm_dep.h>
 #include <drm/panthor_drm.h>
 
 #include <linux/atomic.h>
@@ -232,19 +232,9 @@ struct panthor_vm {
 	struct drm_gpuvm base;
 
 	/**
-	 * @sched: Scheduler used for asynchronous VM_BIND request.
-	 *
-	 * We use a 1:1 scheduler here.
-	 */
-	struct drm_gpu_scheduler sched;
-
-	/**
-	 * @entity: Scheduling entity representing the VM_BIND queue.
-	 *
-	 * There's currently one bind queue per VM. It doesn't make sense to
-	 * allow more given the VM operations are serialized anyway.
+	 * @q: Dep queue used for asynchronous VM_BIND request.
 	 */
-	struct drm_sched_entity entity;
+	struct drm_dep_queue *q;
 
 	/** @ptdev: Device. */
 	struct panthor_device *ptdev;
@@ -262,7 +252,7 @@ struct panthor_vm {
 	 * @op_lock: Lock used to serialize operations on a VM.
 	 *
 	 * The serialization of jobs queued to the VM_BIND queue is already
-	 * taken care of by drm_sched, but we need to serialize synchronous
+	 * taken care of by drm_dep, but we need to serialize synchronous
 	 * and asynchronous VM_BIND request. This is what this lock is for.
 	 */
 	struct mutex op_lock;
@@ -390,11 +380,8 @@ struct panthor_vm {
  * struct panthor_vm_bind_job - VM bind job
  */
 struct panthor_vm_bind_job {
-	/** @base: Inherit from drm_sched_job. */
-	struct drm_sched_job base;
-
-	/** @refcount: Reference count. */
-	struct kref refcount;
+	/** @base: Inherit from drm_dep_job. */
+	struct drm_dep_job base;
 
 	/** @cleanup_op_ctx_work: Work used to cleanup the VM operation context. */
 	struct work_struct cleanup_op_ctx_work;
@@ -821,12 +808,12 @@ u32 panthor_vm_page_size(struct panthor_vm *vm)
 
 static void panthor_vm_stop(struct panthor_vm *vm)
 {
-	drm_sched_stop(&vm->sched, NULL);
+	drm_dep_queue_stop(vm->q);
 }
 
 static void panthor_vm_start(struct panthor_vm *vm)
 {
-	drm_sched_start(&vm->sched, 0);
+	drm_dep_queue_start(vm->q);
 }
 
 /**
@@ -1882,17 +1869,17 @@ static void panthor_vm_free(struct drm_gpuvm *gpuvm)
 
 	mutex_lock(&ptdev->mmu->vm.lock);
 	list_del(&vm->node);
-	/* Restore the scheduler state so we can call drm_sched_entity_destroy()
-	 * and drm_sched_fini(). If get there, that means we have no job left
-	 * and no new jobs can be queued, so we can start the scheduler without
+	/* Restore the queue state so we can call drm_dep_queue_put().
+	 * If we get there, that means we have no job left
+	 * and no new jobs can be queued, so we can start the queue without
 	 * risking interfering with the reset.
 	 */
 	if (ptdev->mmu->vm.reset_in_progress)
 		panthor_vm_start(vm);
 	mutex_unlock(&ptdev->mmu->vm.lock);
 
-	drm_sched_entity_destroy(&vm->entity);
-	drm_sched_fini(&vm->sched);
+	drm_dep_queue_kill(vm->q);
+	drm_dep_queue_put(vm->q);
 
 	mutex_lock(&vm->op_lock);
 	mutex_lock(&ptdev->mmu->as.slots_lock);
@@ -2319,14 +2306,14 @@ panthor_vm_exec_op(struct panthor_vm *vm, struct panthor_vm_op_ctx *op,
 }
 
 static struct dma_fence *
-panthor_vm_bind_run_job(struct drm_sched_job *sched_job)
+panthor_vm_bind_run_job(struct drm_dep_job *dep_job)
 {
-	struct panthor_vm_bind_job *job = container_of(sched_job, struct panthor_vm_bind_job, base);
+	struct panthor_vm_bind_job *job = container_of(dep_job, struct panthor_vm_bind_job, base);
 	bool cookie;
 	int ret;
 
 	/* Not only we report an error whose result is propagated to the
-	 * drm_sched finished fence, but we also flag the VM as unusable, because
+	 * drm_dep finished fence, but we also flag the VM as unusable, because
 	 * a failure in the async VM_BIND results in an inconsistent state. VM needs
 	 * to be destroyed and recreated.
 	 */
@@ -2337,38 +2324,24 @@ panthor_vm_bind_run_job(struct drm_sched_job *sched_job)
 	return ret ? ERR_PTR(ret) : NULL;
 }
 
-static void panthor_vm_bind_job_release(struct kref *kref)
+static void panthor_vm_bind_job_cleanup(struct panthor_vm_bind_job *job)
 {
-	struct panthor_vm_bind_job *job = container_of(kref, struct panthor_vm_bind_job, refcount);
-
-	if (job->base.s_fence)
-		drm_sched_job_cleanup(&job->base);
-
 	panthor_vm_cleanup_op_ctx(&job->ctx, job->vm);
 	panthor_vm_put(job->vm);
 	kfree(job);
 }
 
-/**
- * panthor_vm_bind_job_put() - Release a VM_BIND job reference
- * @sched_job: Job to release the reference on.
- */
-void panthor_vm_bind_job_put(struct drm_sched_job *sched_job)
+static void panthor_vm_bind_job_cleanup_op_ctx_work(struct work_struct *work)
 {
 	struct panthor_vm_bind_job *job =
-		container_of(sched_job, struct panthor_vm_bind_job, base);
+		container_of(work, struct panthor_vm_bind_job, cleanup_op_ctx_work);
 
-	if (sched_job)
-		kref_put(&job->refcount, panthor_vm_bind_job_release);
+	panthor_vm_bind_job_cleanup(job);
 }
 
-static void
-panthor_vm_bind_free_job(struct drm_sched_job *sched_job)
+static void panthor_vm_bind_job_release(struct drm_dep_job *dep_job)
 {
-	struct panthor_vm_bind_job *job =
-		container_of(sched_job, struct panthor_vm_bind_job, base);
-
-	drm_sched_job_cleanup(sched_job);
+	struct panthor_vm_bind_job *job = container_of(dep_job, struct panthor_vm_bind_job, base);
 
 	/* Do the heavy cleanups asynchronously, so we're out of the
 	 * dma-signaling path and can acquire dma-resv locks safely.
@@ -2376,16 +2349,29 @@ panthor_vm_bind_free_job(struct drm_sched_job *sched_job)
 	queue_work(panthor_cleanup_wq, &job->cleanup_op_ctx_work);
 }
 
-static enum drm_gpu_sched_stat
-panthor_vm_bind_timedout_job(struct drm_sched_job *sched_job)
+static const struct drm_dep_job_ops panthor_vm_bind_job_ops = {
+	.release = panthor_vm_bind_job_release,
+};
+
+/**
+ * panthor_vm_bind_job_put() - Release a VM_BIND job reference
+ * @dep_job: Job to release the reference on.
+ */
+void panthor_vm_bind_job_put(struct drm_dep_job *dep_job)
+{
+	if (dep_job)
+		drm_dep_job_put(dep_job);
+}
+
+static enum drm_dep_timedout_stat
+panthor_vm_bind_timedout_job(struct drm_dep_job *dep_job)
 {
 	WARN(1, "VM_BIND ops are synchronous for now, there should be no timeout!");
-	return DRM_GPU_SCHED_STAT_RESET;
+	return DRM_DEP_TIMEDOUT_STAT_JOB_SIGNALED;
 }
 
-static const struct drm_sched_backend_ops panthor_vm_bind_ops = {
+static const struct drm_dep_queue_ops panthor_vm_bind_ops = {
 	.run_job = panthor_vm_bind_run_job,
-	.free_job = panthor_vm_bind_free_job,
 	.timedout_job = panthor_vm_bind_timedout_job,
 };
 
@@ -2409,16 +2395,16 @@ panthor_vm_create(struct panthor_device *ptdev, bool for_mcu,
 	u32 pa_bits = GPU_MMU_FEATURES_PA_BITS(ptdev->gpu_info.mmu_features);
 	u64 full_va_range = 1ull << va_bits;
 	struct drm_gem_object *dummy_gem;
-	struct drm_gpu_scheduler *sched;
-	const struct drm_sched_init_args sched_args = {
+	const struct drm_dep_queue_init_args q_args = {
 		.ops = &panthor_vm_bind_ops,
 		.submit_wq = ptdev->mmu->vm.wq,
-		.num_rqs = 1,
 		.credit_limit = 1,
 		/* Bind operations are synchronous for now, no timeout needed. */
 		.timeout = MAX_SCHEDULE_TIMEOUT,
 		.name = "panthor-vm-bind",
-		.dev = ptdev->base.dev,
+		.drm = &ptdev->base,
+		.flags = DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE |
+			DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED,
 	};
 	struct io_pgtable_cfg pgtbl_cfg;
 	u64 mair, min_va, va_range;
@@ -2477,14 +2463,17 @@ panthor_vm_create(struct panthor_device *ptdev, bool for_mcu,
 		goto err_mm_takedown;
 	}
 
-	ret = drm_sched_init(&vm->sched, &sched_args);
-	if (ret)
+	vm->q = kzalloc_obj(*vm->q);
+	if (!vm->q) {
+		ret = -ENOMEM;
 		goto err_free_io_pgtable;
+	}
 
-	sched = &vm->sched;
-	ret = drm_sched_entity_init(&vm->entity, 0, &sched, 1, NULL);
-	if (ret)
-		goto err_sched_fini;
+	ret = drm_dep_queue_init(vm->q, &q_args);
+	if (ret) {
+		kfree(vm->q);
+		goto err_free_io_pgtable;
+	}
 
 	mair = io_pgtable_ops_to_pgtable(vm->pgtbl_ops)->cfg.arm_lpae_s1_cfg.mair;
 	vm->memattr = mair_to_memattr(mair, ptdev->coherent);
@@ -2492,7 +2481,7 @@ panthor_vm_create(struct panthor_device *ptdev, bool for_mcu,
 	mutex_lock(&ptdev->mmu->vm.lock);
 	list_add_tail(&vm->node, &ptdev->mmu->vm.list);
 
-	/* If a reset is in progress, stop the scheduler. */
+	/* If a reset is in progress, stop the queue. */
 	if (ptdev->mmu->vm.reset_in_progress)
 		panthor_vm_stop(vm);
 	mutex_unlock(&ptdev->mmu->vm.lock);
@@ -2507,9 +2496,6 @@ panthor_vm_create(struct panthor_device *ptdev, bool for_mcu,
 	drm_gem_object_put(dummy_gem);
 	return vm;
 
-err_sched_fini:
-	drm_sched_fini(&vm->sched);
-
 err_free_io_pgtable:
 	free_io_pgtable_ops(vm->pgtbl_ops);
 
@@ -2578,14 +2564,6 @@ panthor_vm_bind_prepare_op_ctx(struct drm_file *file,
 	}
 }
 
-static void panthor_vm_bind_job_cleanup_op_ctx_work(struct work_struct *work)
-{
-	struct panthor_vm_bind_job *job =
-		container_of(work, struct panthor_vm_bind_job, cleanup_op_ctx_work);
-
-	panthor_vm_bind_job_put(&job->base);
-}
-
 /**
  * panthor_vm_bind_job_create() - Create a VM_BIND job
  * @file: File.
@@ -2594,7 +2572,7 @@ static void panthor_vm_bind_job_cleanup_op_ctx_work(struct work_struct *work)
  *
  * Return: A valid pointer on success, an ERR_PTR() otherwise.
  */
-struct drm_sched_job *
+struct drm_dep_job *
 panthor_vm_bind_job_create(struct drm_file *file,
 			   struct panthor_vm *vm,
 			   const struct drm_panthor_vm_bind_op *op)
@@ -2619,17 +2597,21 @@ panthor_vm_bind_job_create(struct drm_file *file,
 	}
 
 	INIT_WORK(&job->cleanup_op_ctx_work, panthor_vm_bind_job_cleanup_op_ctx_work);
-	kref_init(&job->refcount);
 	job->vm = panthor_vm_get(vm);
 
-	ret = drm_sched_job_init(&job->base, &vm->entity, 1, vm, file->client_id);
+	ret = drm_dep_job_init(&job->base,
+			       &(struct drm_dep_job_init_args){
+					.ops = &panthor_vm_bind_job_ops,
+					.q = vm->q,
+					.credits = 1,
+			       });
 	if (ret)
-		goto err_put_job;
+		goto err_cleanup;
 
 	return &job->base;
 
-err_put_job:
-	panthor_vm_bind_job_put(&job->base);
+err_cleanup:
+	panthor_vm_bind_job_cleanup(job);
 	return ERR_PTR(ret);
 }
 
@@ -2645,9 +2627,9 @@ panthor_vm_bind_job_create(struct drm_file *file,
  * Return: 0 on success, a negative error code otherwise.
  */
 int panthor_vm_bind_job_prepare_resvs(struct drm_exec *exec,
-				      struct drm_sched_job *sched_job)
+				      struct drm_dep_job *dep_job)
 {
-	struct panthor_vm_bind_job *job = container_of(sched_job, struct panthor_vm_bind_job, base);
+	struct panthor_vm_bind_job *job = container_of(dep_job, struct panthor_vm_bind_job, base);
 	int ret;
 
 	/* Acquire the VM lock an reserve a slot for this VM bind job. */
@@ -2671,13 +2653,13 @@ int panthor_vm_bind_job_prepare_resvs(struct drm_exec *exec,
  * @sched_job: Job to update the resvs on.
  */
 void panthor_vm_bind_job_update_resvs(struct drm_exec *exec,
-				      struct drm_sched_job *sched_job)
+				      struct drm_dep_job *dep_job)
 {
-	struct panthor_vm_bind_job *job = container_of(sched_job, struct panthor_vm_bind_job, base);
+	struct panthor_vm_bind_job *job = container_of(dep_job, struct panthor_vm_bind_job, base);
 
 	/* Explicit sync => we just register our job finished fence as bookkeep. */
 	drm_gpuvm_resv_add_fence(&job->vm->base, exec,
-				 &sched_job->s_fence->finished,
+				 drm_dep_job_finished_fence(dep_job),
 				 DMA_RESV_USAGE_BOOKKEEP,
 				 DMA_RESV_USAGE_BOOKKEEP);
 }
@@ -2873,7 +2855,9 @@ int panthor_mmu_init(struct panthor_device *ptdev)
 	if (ret)
 		return ret;
 
-	mmu->vm.wq = alloc_workqueue("panthor-vm-bind", WQ_UNBOUND, 0);
+	mmu->vm.wq = alloc_workqueue("panthor-vm-bind", WQ_MEM_RECLAIM |
+				     WQ_MEM_WARN_ON_RECLAIM |
+				     WQ_UNBOUND, 0);
 	if (!mmu->vm.wq)
 		return -ENOMEM;
 
diff --git a/drivers/gpu/drm/panthor/panthor_mmu.h b/drivers/gpu/drm/panthor/panthor_mmu.h
index 0e268fdfdb2f..845f45ce7739 100644
--- a/drivers/gpu/drm/panthor/panthor_mmu.h
+++ b/drivers/gpu/drm/panthor/panthor_mmu.h
@@ -8,7 +8,7 @@
 #include <linux/dma-resv.h>
 
 struct drm_exec;
-struct drm_sched_job;
+struct drm_dep_job;
 struct drm_memory_stats;
 struct panthor_gem_object;
 struct panthor_heap_pool;
@@ -50,9 +50,9 @@ int panthor_vm_prepare_mapped_bos_resvs(struct drm_exec *exec,
 					struct panthor_vm *vm,
 					u32 slot_count);
 int panthor_vm_add_bos_resvs_deps_to_job(struct panthor_vm *vm,
-					 struct drm_sched_job *job);
+					 struct drm_dep_job *job);
 void panthor_vm_add_job_fence_to_bos_resvs(struct panthor_vm *vm,
-					   struct drm_sched_job *job);
+					   struct drm_dep_job *job);
 
 struct dma_resv *panthor_vm_resv(struct panthor_vm *vm);
 struct drm_gem_object *panthor_vm_root_gem(struct panthor_vm *vm);
@@ -82,14 +82,14 @@ int panthor_vm_bind_exec_sync_op(struct drm_file *file,
 				 struct panthor_vm *vm,
 				 struct drm_panthor_vm_bind_op *op);
 
-struct drm_sched_job *
+struct drm_dep_job *
 panthor_vm_bind_job_create(struct drm_file *file,
 			   struct panthor_vm *vm,
 			   const struct drm_panthor_vm_bind_op *op);
-void panthor_vm_bind_job_put(struct drm_sched_job *job);
+void panthor_vm_bind_job_put(struct drm_dep_job *job);
 int panthor_vm_bind_job_prepare_resvs(struct drm_exec *exec,
-				      struct drm_sched_job *job);
-void panthor_vm_bind_job_update_resvs(struct drm_exec *exec, struct drm_sched_job *job);
+				      struct drm_dep_job *job);
+void panthor_vm_bind_job_update_resvs(struct drm_exec *exec, struct drm_dep_job *job);
 
 void panthor_vm_update_resvs(struct panthor_vm *vm, struct drm_exec *exec,
 			     struct dma_fence *fence,
diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
index 2fe04d0f0e3a..040bea0688c3 100644
--- a/drivers/gpu/drm/panthor/panthor_sched.c
+++ b/drivers/gpu/drm/panthor/panthor_sched.c
@@ -6,7 +6,7 @@
 #include <drm/drm_gem_shmem_helper.h>
 #include <drm/drm_managed.h>
 #include <drm/drm_print.h>
-#include <drm/gpu_scheduler.h>
+#include <drm/drm_dep.h>
 #include <drm/panthor_drm.h>
 
 #include <linux/build_bug.h>
@@ -61,13 +61,11 @@
  * always gets consistent results (cache maintenance,
  * synchronization, ...).
  *
- * We rely on the drm_gpu_scheduler framework to deal with job
- * dependencies and submission. As any other driver dealing with a
- * FW-scheduler, we use the 1:1 entity:scheduler mode, such that each
- * entity has its own job scheduler. When a job is ready to be executed
- * (all its dependencies are met), it is pushed to the appropriate
- * queue ring-buffer, and the group is scheduled for execution if it
- * wasn't already active.
+ * We rely on the drm_dep framework to deal with job
+ * dependencies and submission. Each queue owns its own dep queue. When a job
+ * is ready to be executed (all its dependencies are met), it is pushed to the
+ * appropriate queue ring-buffer, and the group is scheduled for execution if
+ * it wasn't already active.
  *
  * Kernel-side group scheduling is timeslice-based. When we have less
  * groups than there are slots, the periodic tick is disabled and we
@@ -83,7 +81,7 @@
  * if userspace was in charge of the ring-buffer. That's also one of the
  * reason we don't do 'cooperative' scheduling (encoding FW group slot
  * reservation as dma_fence that would be returned from the
- * drm_gpu_scheduler::prepare_job() hook, and treating group rotation as
+ * drm_dep_queue_ops::prepare_job() hook, and treating group rotation as
  * a queue of waiters, ordered by job submission order). This approach
  * would work for kernel-mode queues, but would make user-mode queues a
  * lot more complicated to retrofit.
@@ -147,11 +145,11 @@ struct panthor_scheduler {
 
 	/**
 	 * @wq: Workqueue used by our internal scheduler logic and
-	 * drm_gpu_scheduler.
+	 * drm_dep queues.
 	 *
 	 * Used for the scheduler tick, group update or other kind of FW
 	 * event processing that can't be handled in the threaded interrupt
-	 * path. Also passed to the drm_gpu_scheduler instances embedded
+	 * path. Also passed to the drm_dep_queue instances embedded
 	 * in panthor_queue.
 	 */
 	struct workqueue_struct *wq;
@@ -347,13 +345,10 @@ struct panthor_syncobj_64b {
  * struct panthor_queue - Execution queue
  */
 struct panthor_queue {
-	/** @scheduler: DRM scheduler used for this queue. */
-	struct drm_gpu_scheduler scheduler;
+	/** @q: drm_dep queue used for this queue. */
+	struct drm_dep_queue q;
 
-	/** @entity: DRM scheduling entity used for this queue. */
-	struct drm_sched_entity entity;
-
-	/** @name: DRM scheduler name for this queue. */
+	/** @name: dep queue name for this queue. */
 	char *name;
 
 	/** @timeout: Queue timeout related fields. */
@@ -461,7 +456,7 @@ struct panthor_queue {
 		 *
 		 * We return this fence when we get an empty command stream.
 		 * This way, we are guaranteed that all earlier jobs have completed
-		 * when drm_sched_job::s_fence::finished without having to feed
+		 * when the drm_dep finished fence is signaled without having to feed
 		 * the CS ring buffer with a dummy job that only signals the fence.
 		 */
 		struct dma_fence *last_fence;
@@ -599,7 +594,7 @@ struct panthor_group {
 	 * @timedout: True when a timeout occurred on any of the queues owned by
 	 * this group.
 	 *
-	 * Timeouts can be reported by drm_sched or by the FW. If a reset is required,
+	 * Timeouts can be reported by drm_dep or by the FW. If a reset is required,
 	 * and the group can't be suspended, this also leads to a timeout. In any case,
 	 * any timeout situation is unrecoverable, and the group becomes useless. We
 	 * simply wait for all references to be dropped so we can release the group
@@ -791,11 +786,8 @@ struct panthor_group_pool {
  * struct panthor_job - Used to manage GPU job
  */
 struct panthor_job {
-	/** @base: Inherit from drm_sched_job. */
-	struct drm_sched_job base;
-
-	/** @refcount: Reference count. */
-	struct kref refcount;
+	/** @base: Inherit from drm_dep_job. */
+	struct drm_dep_job base;
 
 	/** @group: Group of the queue this job will be pushed to. */
 	struct panthor_group *group;
@@ -915,27 +907,8 @@ static void group_free_queue(struct panthor_group *group, struct panthor_queue *
 	if (IS_ERR_OR_NULL(queue))
 		return;
 
-	/* Disable the timeout before tearing down drm_sched components. */
-	disable_delayed_work_sync(&queue->timeout.work);
-
-	if (queue->entity.fence_context)
-		drm_sched_entity_destroy(&queue->entity);
-
-	if (queue->scheduler.ops)
-		drm_sched_fini(&queue->scheduler);
-
-	kfree(queue->name);
-
-	panthor_queue_put_syncwait_obj(queue);
-
-	panthor_kernel_bo_destroy(queue->ringbuf);
-	panthor_kernel_bo_destroy(queue->iface.mem);
-	panthor_kernel_bo_destroy(queue->profiling.slots);
-
-	/* Release the last_fence we were holding, if any. */
-	dma_fence_put(queue->fence_ctx.last_fence);
-
-	kfree(queue);
+	if (queue->q.ops)
+		drm_dep_queue_put(&queue->q);
 }
 
 static void group_release_work(struct work_struct *work)
@@ -1098,7 +1071,7 @@ queue_reset_timeout_locked(struct panthor_queue *queue)
 	lockdep_assert_held(&queue->fence_ctx.lock);
 
 	if (!queue_timeout_is_suspended(queue)) {
-		mod_delayed_work(queue->scheduler.timeout_wq,
+		mod_delayed_work(drm_dep_queue_timeout_wq(&queue->q),
 				 &queue->timeout.work,
 				 msecs_to_jiffies(JOB_TIMEOUT_MS));
 	}
@@ -1162,7 +1135,7 @@ queue_resume_timeout(struct panthor_queue *queue)
 	spin_lock(&queue->fence_ctx.lock);
 
 	if (queue_timeout_is_suspended(queue)) {
-		mod_delayed_work(queue->scheduler.timeout_wq,
+		mod_delayed_work(drm_dep_queue_timeout_wq(&queue->q),
 				 &queue->timeout.work,
 				 queue->timeout.remaining);
 
@@ -2726,19 +2699,13 @@ static void queue_stop(struct panthor_queue *queue,
 		       struct panthor_job *bad_job)
 {
 	disable_delayed_work_sync(&queue->timeout.work);
-	drm_sched_stop(&queue->scheduler, bad_job ? &bad_job->base : NULL);
+	drm_dep_queue_stop(&queue->q);
 }
 
 static void queue_start(struct panthor_queue *queue)
 {
-	struct panthor_job *job;
-
-	/* Re-assign the parent fences. */
-	list_for_each_entry(job, &queue->scheduler.pending_list, base.list)
-		job->base.s_fence->parent = dma_fence_get(job->done_fence);
-
 	enable_delayed_work(&queue->timeout.work);
-	drm_sched_start(&queue->scheduler, 0);
+	drm_dep_queue_start(&queue->q);
 }
 
 static void panthor_group_stop(struct panthor_group *group)
@@ -3293,9 +3260,9 @@ static u32 calc_job_credits(u32 profile_mask)
 }
 
 static struct dma_fence *
-queue_run_job(struct drm_sched_job *sched_job)
+queue_run_job(struct drm_dep_job *dep_job)
 {
-	struct panthor_job *job = container_of(sched_job, struct panthor_job, base);
+	struct panthor_job *job = container_of(dep_job, struct panthor_job, base);
 	struct panthor_group *group = job->group;
 	struct panthor_queue *queue = group->queues[job->queue_idx];
 	struct panthor_device *ptdev = group->ptdev;
@@ -3306,8 +3273,7 @@ queue_run_job(struct drm_sched_job *sched_job)
 	int ret;
 
 	/* Stream size is zero, nothing to do except making sure all previously
-	 * submitted jobs are done before we signal the
-	 * drm_sched_job::s_fence::finished fence.
+	 * submitted jobs are done before we signal the drm_dep finished fence.
 	 */
 	if (!job->call_info.size) {
 		job->done_fence = dma_fence_get(queue->fence_ctx.last_fence);
@@ -3394,10 +3360,10 @@ queue_run_job(struct drm_sched_job *sched_job)
 	return done_fence;
 }
 
-static enum drm_gpu_sched_stat
-queue_timedout_job(struct drm_sched_job *sched_job)
+static enum drm_dep_timedout_stat
+queue_timedout_job(struct drm_dep_job *dep_job)
 {
-	struct panthor_job *job = container_of(sched_job, struct panthor_job, base);
+	struct panthor_job *job = container_of(dep_job, struct panthor_job, base);
 	struct panthor_group *group = job->group;
 	struct panthor_device *ptdev = group->ptdev;
 	struct panthor_scheduler *sched = ptdev->scheduler;
@@ -3411,34 +3377,58 @@ queue_timedout_job(struct drm_sched_job *sched_job)
 	queue_stop(queue, job);
 
 	mutex_lock(&sched->lock);
-	group->timedout = true;
-	if (group->csg_id >= 0) {
-		sched_queue_delayed_work(ptdev->scheduler, tick, 0);
-	} else {
-		/* Remove from the run queues, so the scheduler can't
-		 * pick the group on the next tick.
-		 */
-		list_del_init(&group->run_node);
-		list_del_init(&group->wait_node);
+	if (!group->timedout) {
+		group->timedout = true;
+		if (group->csg_id >= 0) {
+			sched_queue_delayed_work(ptdev->scheduler, tick, 0);
+		} else {
+			/* Remove from the run queues, so the scheduler can't
+			 * pick the group on the next tick.
+			 */
+			list_del_init(&group->run_node);
+			list_del_init(&group->wait_node);
 
-		group_queue_work(group, term);
+			group_queue_work(group, term);
+		}
 	}
 	mutex_unlock(&sched->lock);
 
 	queue_start(queue);
-	return DRM_GPU_SCHED_STAT_RESET;
+	if (drm_dep_job_is_finished(dep_job))
+		return DRM_DEP_TIMEDOUT_STAT_JOB_SIGNALED;
+	else
+		return DRM_DEP_TIMEDOUT_STAT_REQUEUE_JOB;
 }
 
-static void queue_free_job(struct drm_sched_job *sched_job)
+static void job_release(struct drm_dep_job *dep_job);
+
+static const struct drm_dep_job_ops panthor_job_ops = {
+	.release = job_release,
+};
+
+static void panthor_queue_release(struct drm_dep_queue *q)
 {
-	drm_sched_job_cleanup(sched_job);
-	panthor_job_put(sched_job);
+	struct panthor_queue *queue = container_of(q, typeof(*queue), q);
+
+	kfree(queue->name);
+
+	panthor_queue_put_syncwait_obj(queue);
+
+	panthor_kernel_bo_destroy(queue->ringbuf);
+	panthor_kernel_bo_destroy(queue->iface.mem);
+	panthor_kernel_bo_destroy(queue->profiling.slots);
+
+	/* Release the last_fence we were holding, if any. */
+	dma_fence_put(queue->fence_ctx.last_fence);
+
+	drm_dep_queue_release(q);
+	kfree_rcu(queue, q.rcu);
 }
 
-static const struct drm_sched_backend_ops panthor_queue_sched_ops = {
+static const struct drm_dep_queue_ops panthor_queue_ops = {
 	.run_job = queue_run_job,
 	.timedout_job = queue_timedout_job,
-	.free_job = queue_free_job,
+	.release = panthor_queue_release,
 };
 
 static u32 calc_profiling_ringbuf_num_slots(struct panthor_device *ptdev,
@@ -3476,7 +3466,7 @@ static void queue_timeout_work(struct work_struct *work)
 
 	progress = queue_check_job_completion(queue);
 	if (!progress)
-		drm_sched_fault(&queue->scheduler);
+		drm_dep_queue_trigger_timeout(&queue->q);
 }
 
 static struct panthor_queue *
@@ -3484,10 +3474,9 @@ group_create_queue(struct panthor_group *group,
 		   const struct drm_panthor_queue_create *args,
 		   u64 drm_client_id, u32 gid, u32 qid)
 {
-	struct drm_sched_init_args sched_args = {
-		.ops = &panthor_queue_sched_ops,
+	struct drm_dep_queue_init_args q_args = {
+		.ops = &panthor_queue_ops,
 		.submit_wq = group->ptdev->scheduler->wq,
-		.num_rqs = 1,
 		/*
 		 * The credit limit argument tells us the total number of
 		 * instructions across all CS slots in the ringbuffer, with
@@ -3497,9 +3486,10 @@ group_create_queue(struct panthor_group *group,
 		.credit_limit = args->ringbuf_size / sizeof(u64),
 		.timeout = MAX_SCHEDULE_TIMEOUT,
 		.timeout_wq = group->ptdev->reset.wq,
-		.dev = group->ptdev->base.dev,
+		.drm = &group->ptdev->base,
+		.flags = DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE |
+			DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED,
 	};
-	struct drm_gpu_scheduler *drm_sched;
 	struct panthor_queue *queue;
 	int ret;
 
@@ -3580,14 +3570,9 @@ group_create_queue(struct panthor_group *group,
 		goto err_free_queue;
 	}
 
-	sched_args.name = queue->name;
-
-	ret = drm_sched_init(&queue->scheduler, &sched_args);
-	if (ret)
-		goto err_free_queue;
+	q_args.name = queue->name;
 
-	drm_sched = &queue->scheduler;
-	ret = drm_sched_entity_init(&queue->entity, 0, &drm_sched, 1, NULL);
+	ret = drm_dep_queue_init(&queue->q, &q_args);
 	if (ret)
 		goto err_free_queue;
 
@@ -3907,15 +3892,8 @@ panthor_fdinfo_gather_group_mem_info(struct panthor_file *pfile,
 	xa_unlock(&gpool->xa);
 }
 
-static void job_release(struct kref *ref)
+static void job_cleanup(struct panthor_job *job)
 {
-	struct panthor_job *job = container_of(ref, struct panthor_job, refcount);
-
-	drm_WARN_ON(&job->group->ptdev->base, !list_empty(&job->node));
-
-	if (job->base.s_fence)
-		drm_sched_job_cleanup(&job->base);
-
 	if (dma_fence_was_initialized(job->done_fence))
 		dma_fence_put(job->done_fence);
 	else
@@ -3926,33 +3904,36 @@ static void job_release(struct kref *ref)
 	kfree(job);
 }
 
-struct drm_sched_job *panthor_job_get(struct drm_sched_job *sched_job)
+static void job_release(struct drm_dep_job *dep_job)
 {
-	if (sched_job) {
-		struct panthor_job *job = container_of(sched_job, struct panthor_job, base);
-
-		kref_get(&job->refcount);
-	}
+	struct panthor_job *job = container_of(dep_job, struct panthor_job, base);
 
-	return sched_job;
+	drm_WARN_ON(&job->group->ptdev->base, !list_empty(&job->node));
+	job_cleanup(job);
 }
 
-void panthor_job_put(struct drm_sched_job *sched_job)
+struct drm_dep_job *panthor_job_get(struct drm_dep_job *dep_job)
 {
-	struct panthor_job *job = container_of(sched_job, struct panthor_job, base);
+	if (dep_job)
+		drm_dep_job_get(dep_job);
+
+	return dep_job;
+}
 
-	if (sched_job)
-		kref_put(&job->refcount, job_release);
+void panthor_job_put(struct drm_dep_job *dep_job)
+{
+	if (dep_job)
+		drm_dep_job_put(dep_job);
 }
 
-struct panthor_vm *panthor_job_vm(struct drm_sched_job *sched_job)
+struct panthor_vm *panthor_job_vm(struct drm_dep_job *dep_job)
 {
-	struct panthor_job *job = container_of(sched_job, struct panthor_job, base);
+	struct panthor_job *job = container_of(dep_job, struct panthor_job, base);
 
 	return job->group->vm;
 }
 
-struct drm_sched_job *
+struct drm_dep_job *
 panthor_job_create(struct panthor_file *pfile,
 		   u16 group_handle,
 		   const struct drm_panthor_queue_submit *qsubmit,
@@ -3984,7 +3965,6 @@ panthor_job_create(struct panthor_file *pfile,
 	if (!job)
 		return ERR_PTR(-ENOMEM);
 
-	kref_init(&job->refcount);
 	job->queue_idx = qsubmit->queue_index;
 	job->call_info.size = qsubmit->stream_size;
 	job->call_info.start = qsubmit->stream_addr;
@@ -3994,18 +3974,18 @@ panthor_job_create(struct panthor_file *pfile,
 	job->group = group_from_handle(gpool, group_handle);
 	if (!job->group) {
 		ret = -EINVAL;
-		goto err_put_job;
+		goto err_cleanup_job;
 	}
 
 	if (!group_can_run(job->group)) {
 		ret = -EINVAL;
-		goto err_put_job;
+		goto err_cleanup_job;
 	}
 
 	if (job->queue_idx >= job->group->queue_count ||
 	    !job->group->queues[job->queue_idx]) {
 		ret = -EINVAL;
-		goto err_put_job;
+		goto err_cleanup_job;
 	}
 
 	/* Empty command streams don't need a fence, they'll pick the one from
@@ -4015,7 +3995,7 @@ panthor_job_create(struct panthor_file *pfile,
 		job->done_fence = kzalloc_obj(*job->done_fence);
 		if (!job->done_fence) {
 			ret = -ENOMEM;
-			goto err_put_job;
+			goto err_cleanup_job;
 		}
 	}
 
@@ -4023,27 +4003,30 @@ panthor_job_create(struct panthor_file *pfile,
 	credits = calc_job_credits(job->profiling.mask);
 	if (credits == 0) {
 		ret = -EINVAL;
-		goto err_put_job;
+		goto err_cleanup_job;
 	}
 
-	ret = drm_sched_job_init(&job->base,
-				 &job->group->queues[job->queue_idx]->entity,
-				 credits, job->group, drm_client_id);
+	ret = drm_dep_job_init(&job->base,
+			       &(struct drm_dep_job_init_args){
+					.ops = &panthor_job_ops,
+					.q = &job->group->queues[job->queue_idx]->q,
+					.credits = credits,
+			       });
 	if (ret)
-		goto err_put_job;
+		goto err_cleanup_job;
 
 	return &job->base;
 
-err_put_job:
-	panthor_job_put(&job->base);
+err_cleanup_job:
+	job_cleanup(job);
 	return ERR_PTR(ret);
 }
 
-void panthor_job_update_resvs(struct drm_exec *exec, struct drm_sched_job *sched_job)
+void panthor_job_update_resvs(struct drm_exec *exec, struct drm_dep_job *dep_job)
 {
-	struct panthor_job *job = container_of(sched_job, struct panthor_job, base);
+	struct panthor_job *job = container_of(dep_job, struct panthor_job, base);
 
-	panthor_vm_update_resvs(job->group->vm, exec, &sched_job->s_fence->finished,
+	panthor_vm_update_resvs(job->group->vm, exec, drm_dep_job_finished_fence(dep_job),
 				DMA_RESV_USAGE_BOOKKEEP, DMA_RESV_USAGE_BOOKKEEP);
 }
 
@@ -4171,7 +4154,8 @@ int panthor_sched_init(struct panthor_device *ptdev)
 	 * system is running out of memory.
 	 */
 	sched->heap_alloc_wq = alloc_workqueue("panthor-heap-alloc", WQ_UNBOUND, 0);
-	sched->wq = alloc_workqueue("panthor-csf-sched", WQ_MEM_RECLAIM | WQ_UNBOUND, 0);
+	sched->wq = alloc_workqueue("panthor-csf-sched", WQ_MEM_RECLAIM |
+				    WQ_MEM_WARN_ON_RECLAIM | WQ_UNBOUND, 0);
 	if (!sched->wq || !sched->heap_alloc_wq) {
 		panthor_sched_fini(&ptdev->base, sched);
 		drm_err(&ptdev->base, "Failed to allocate the workqueues");
diff --git a/drivers/gpu/drm/panthor/panthor_sched.h b/drivers/gpu/drm/panthor/panthor_sched.h
index 9a8692de8ade..a7b8e2851f4b 100644
--- a/drivers/gpu/drm/panthor/panthor_sched.h
+++ b/drivers/gpu/drm/panthor/panthor_sched.h
@@ -8,7 +8,7 @@ struct drm_exec;
 struct dma_fence;
 struct drm_file;
 struct drm_gem_object;
-struct drm_sched_job;
+struct drm_dep_job;
 struct drm_memory_stats;
 struct drm_panthor_group_create;
 struct drm_panthor_queue_create;
@@ -27,15 +27,15 @@ int panthor_group_destroy(struct panthor_file *pfile, u32 group_handle);
 int panthor_group_get_state(struct panthor_file *pfile,
 			    struct drm_panthor_group_get_state *get_state);
 
-struct drm_sched_job *
+struct drm_dep_job *
 panthor_job_create(struct panthor_file *pfile,
 		   u16 group_handle,
 		   const struct drm_panthor_queue_submit *qsubmit,
 		   u64 drm_client_id);
-struct drm_sched_job *panthor_job_get(struct drm_sched_job *job);
-struct panthor_vm *panthor_job_vm(struct drm_sched_job *sched_job);
-void panthor_job_put(struct drm_sched_job *job);
-void panthor_job_update_resvs(struct drm_exec *exec, struct drm_sched_job *job);
+struct drm_dep_job *panthor_job_get(struct drm_dep_job *job);
+struct panthor_vm *panthor_job_vm(struct drm_dep_job *dep_job);
+void panthor_job_put(struct drm_dep_job *job);
+void panthor_job_update_resvs(struct drm_exec *exec, struct drm_dep_job *job);
 
 int panthor_group_pool_create(struct panthor_file *pfile);
 void panthor_group_pool_destroy(struct panthor_file *pfile);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
  2026-03-16  4:32 ` [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer Matthew Brost
@ 2026-03-16  9:16   ` Boris Brezillon
  2026-03-17  5:22     ` Matthew Brost
  2026-03-16 10:25   ` Danilo Krummrich
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 50+ messages in thread
From: Boris Brezillon @ 2026-03-16  9:16 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, Tvrtko Ursulin, Rodrigo Vivi,
	Thomas Hellström, Christian König, Danilo Krummrich,
	David Airlie, Maarten Lankhorst, Maxime Ripard, Philipp Stanner,
	Simona Vetter, Sumit Semwal, Thomas Zimmermann, linux-kernel

Hi Matthew,

On Sun, 15 Mar 2026 21:32:45 -0700
Matthew Brost <matthew.brost@intel.com> wrote:

> Diverging requirements between GPU drivers using firmware scheduling
> and those using hardware scheduling have shown that drm_gpu_scheduler is
> no longer sufficient for firmware-scheduled GPU drivers. The technical
> debt, lack of memory-safety guarantees, absence of clear object-lifetime
> rules, and numerous driver-specific hacks have rendered
> drm_gpu_scheduler unmaintainable. It is time for a fresh design for
> firmware-scheduled GPU drivers—one that addresses all of the
> aforementioned shortcomings.
> 
> Add drm_dep, a lightweight GPU submission queue intended as a
> replacement for drm_gpu_scheduler for firmware-managed GPU schedulers
> (e.g. Xe, Panthor, AMDXDNA, PVR, Nouveau, Nova). Unlike
> drm_gpu_scheduler, which separates the scheduler (drm_gpu_scheduler)
> from the queue (drm_sched_entity) into two objects requiring external
> coordination, drm_dep merges both roles into a single struct
> drm_dep_queue. This eliminates the N:1 entity-to-scheduler mapping
> that is unnecessary for firmware schedulers which manage their own
> run-lists internally.
> 
> Unlike drm_gpu_scheduler, which relies on external locking and lifetime
> management by the driver, drm_dep uses reference counting (kref) on both
> queues and jobs to guarantee object lifetime safety. A job holds a queue
> reference from init until its last put, and the queue holds a job reference
> from dispatch until the put_job worker runs. This makes use-after-free
> impossible even when completion arrives from IRQ context or concurrent
> teardown is in flight.
> 
> The core objects are:
> 
>   struct drm_dep_queue - a per-context submission queue owning an
>     ordered submit workqueue, a TDR timeout workqueue, an SPSC job
>     queue, and a pending-job list. Reference counted; drivers can embed
>     it and provide a .release vfunc for RCU-safe teardown.

First of, I like this idea, and actually think we should have done that
from the start rather than trying to bend drm_sched to meet our
FW-assisted scheduling model. That's also the direction me and Danilo
have been pushing for for the new JobQueue stuff in rust, so I'm glad
to see some consensus here.

Now, let's start with the usual naming nitpick :D => can't we find a
better prefix than "drm_dep"? I think I get where "dep" comes from (the
logic mostly takes care of job deps, and acts as a FIFO otherwise, no
real scheduling involved). It's kinda okay for drm_dep_queue, even
though, according to the description you've made, jobs seem to stay in
that queue even after their deps are met, which, IMHO, is a bit
confusing: dep_queue sounds like a queue in which jobs are placed until
their deps are met, and then the job moves to some other queue.

It gets worse for drm_dep_job, which sounds like a dep-only job, rather
than a job that's queued to the drm_dep_queue. Same goes for
drm_dep_fence, which I find super confusing. What this one does is just
proxy the driver fence to provide proper isolation between GPU drivers
and fence observers (other drivers).

Since this new model is primarily designed for hardware that have
FW-assisted scheduling, how about drm_fw_queue, drm_fw_job,
drm_fw_job_fence?

> 
>   struct drm_dep_job - a single unit of GPU work. Drivers embed this
>     and provide a .release vfunc. Jobs carry an xarray of input
>     dma_fence dependencies and produce a drm_dep_fence as their
>     finished fence.
> 
>   struct drm_dep_fence - a dma_fence subclass wrapping an optional
>     parent hardware fence. The finished fence is armed (sequence
>     number assigned) before submission and signals when the hardware
>     fence signals (or immediately on synchronous completion).
> 
> Job lifecycle:
>   1. drm_dep_job_init() - allocate and initialise; job acquires a
>      queue reference.
>   2. drm_dep_job_add_dependency() and friends - register input fences;
>      duplicates from the same context are deduplicated.
>   3. drm_dep_job_arm() - assign sequence number, obtain finished fence.
>   4. drm_dep_job_push() - submit to queue.
> 
> Submission paths under queue lock:
>   - Bypass path: if DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set, the
>     SPSC queue is empty, no dependencies are pending, and credits are
>     available, the job is dispatched inline on the calling thread.

I've yet to look at the code, but I must admit I'm less worried about
this fast path if it's part of a new model restricted to FW-assisted
scheduling. I keep thinking we're not entirely covered for so called
real-time GPU contexts that might have jobs that are not dep-free, and
if we're going for something new, I'd really like us to consider that
case from the start (maybe investigate if kthread_work[er] can be used
as a replacement for workqueues, if RT priority on workqueues is not an
option).

>   - Queued path: job is pushed onto the SPSC queue and the run_job
>     worker is kicked. The worker resolves remaining dependencies
>     (installing wakeup callbacks for unresolved fences) before calling
>     ops->run_job().
> 
> Credit-based throttling prevents hardware overflow: each job declares
> a credit cost at init time; dispatch is deferred until sufficient
> credits are available.
> 
> Timeout Detection and Recovery (TDR): a per-queue delayed work item
> fires when the head pending job exceeds q->job.timeout jiffies, calling
> ops->timedout_job(). drm_dep_queue_trigger_timeout() forces immediate
> expiry for device teardown.
> 
> IRQ-safe completion: queues flagged DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE
> allow drm_dep_job_done() to be called from hardirq context (e.g. a
> dma_fence callback). Dependency cleanup is deferred to process context
> after ops->run_job() returns to avoid calling xa_destroy() from IRQ.
> 
> Zombie-state guard: workers use kref_get_unless_zero() on entry and
> bail immediately if the queue refcount has already reached zero and
> async teardown is in flight, preventing use-after-free.
> 
> Teardown is always deferred to a module-private workqueue (dep_free_wq)
> so that destroy_workqueue() is never called from within one of the
> queue's own workers. Each queue holds a drm_dev_get() reference on its
> owning struct drm_device, released as the final step of teardown via
> drm_dev_put(). This prevents the driver module from being unloaded
> while any queue is still alive without requiring a separate drain API.

Thanks for posting this RFC. I'll try to have a closer look at the code
in the coming days, but given the diffstat, it might take me a bit of
time...

Regards,

Boris

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
  2026-03-16  4:32 ` [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer Matthew Brost
  2026-03-16  9:16   ` Boris Brezillon
@ 2026-03-16 10:25   ` Danilo Krummrich
  2026-03-17  5:10     ` Matthew Brost
  2026-03-17  2:47   ` Daniel Almeida
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 50+ messages in thread
From: Danilo Krummrich @ 2026-03-16 10:25 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, Boris Brezillon, Tvrtko Ursulin,
	Rodrigo Vivi, Thomas Hellström, Christian König,
	David Airlie, Maarten Lankhorst, Maxime Ripard, Philipp Stanner,
	Simona Vetter, Sumit Semwal, Thomas Zimmermann, linux-kernel

On Mon Mar 16, 2026 at 5:32 AM CET, Matthew Brost wrote:
> Diverging requirements between GPU drivers using firmware scheduling
> and those using hardware scheduling have shown that drm_gpu_scheduler is
> no longer sufficient for firmware-scheduled GPU drivers. The technical
> debt, lack of memory-safety guarantees, absence of clear object-lifetime
> rules, and numerous driver-specific hacks have rendered
> drm_gpu_scheduler unmaintainable. It is time for a fresh design for
> firmware-scheduled GPU drivers—one that addresses all of the
> aforementioned shortcomings.

I think we all agree on this and I also think we all agree that this should have
been a separate component in the first place -- and just to be clear, I am
saying this in retrospective.

In fact, this is also the reason why I proposed building the Rust component
differently, i.e. start with a Joqueue (or drm_dep as called in this patch) and
expand as needed with a loosely coupled "orchestrator" for drivers with strictly
limited software/hardware queues later.

The reason I proposed a new component for Rust, is basically what you also wrote
in your cover letter, plus the fact that it prevents us having to build a Rust
abstraction layer to the DRM GPU scheduler.

The latter I identified as pretty questionable as building another abstraction
layer on top of some infrastructure is really something that you only want to do
when it is mature enough in terms of lifetime and ownership model.

I'm not saying it wouldn't be possible, but as mentioned in other threads, I
don't think it is a good idea building new features on top of something that has
known problems, even less when they are barely resolvable due to other existing
dependencies, such as some drivers relying on implementation details
historically, etc.

My point is, the justification for a new Jobqueue component in Rust I consider
given by the fact that it allows us to avoid building another abstraction layer
on top of DRM sched. Additionally, DRM moves to Rust and gathering experience
with building native Rust components seems like a good synergy in this context.

Having that said, the obvious question for me for this series is how drm_dep
fits into the bigger picture.

I.e. what is the maintainance strategy?

Do we want to support three components allowing users to do the same thing? What
happens to DRM sched for 1:1 entity / scheduler relationships?

Is it worth? Do we have enough C users to justify the maintainance of yet
another component? (Again, DRM moves into the direction of Rust drivers, so I
don't know how many new C drivers we will see.) I.e. having this component won't
get us rid of the majority of DRM sched users.

What are the expected improvements? Given the above, I'm not sure it will
actually decrease the maintainance burdon of DRM sched.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
  2026-03-16  4:32 ` [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer Matthew Brost
  2026-03-16  9:16   ` Boris Brezillon
  2026-03-16 10:25   ` Danilo Krummrich
@ 2026-03-17  2:47   ` Daniel Almeida
  2026-03-17  5:45     ` Matthew Brost
  2026-03-17 12:31     ` Danilo Krummrich
  2026-03-17  8:47   ` Christian König
                     ` (2 subsequent siblings)
  5 siblings, 2 replies; 50+ messages in thread
From: Daniel Almeida @ 2026-03-17  2:47 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, Boris Brezillon, Tvrtko Ursulin,
	Rodrigo Vivi, Thomas Hellström, Christian König,
	Danilo Krummrich, David Airlie, Maarten Lankhorst, Maxime Ripard,
	Philipp Stanner, Simona Vetter, Sumit Semwal, Thomas Zimmermann,
	linux-kernel, Sami Tolvanen, Jeffrey Vander Stoep, Alice Ryhl,
	Daniel Stone, Alexandre Courbot, John Hubbard, shashanks, jajones,
	Eliot Courtney, Joel Fernandes, rust-for-linux

(+cc a few other people + Rust-for-Linux ML)

Hi Matthew,

I agree with what Danilo said below, i.e.:  IMHO, with the direction that DRM
is going, it is much more ergonomic to add a Rust component with a nice C
interface than doing it the other way around.

> On 16 Mar 2026, at 01:32, Matthew Brost <matthew.brost@intel.com> wrote:
> 
> Diverging requirements between GPU drivers using firmware scheduling
> and those using hardware scheduling have shown that drm_gpu_scheduler is
> no longer sufficient for firmware-scheduled GPU drivers. The technical
> debt, lack of memory-safety guarantees, absence of clear object-lifetime
> rules, and numerous driver-specific hacks have rendered
> drm_gpu_scheduler unmaintainable. It is time for a fresh design for
> firmware-scheduled GPU drivers—one that addresses all of the
> aforementioned shortcomings.
> 
> Add drm_dep, a lightweight GPU submission queue intended as a
> replacement for drm_gpu_scheduler for firmware-managed GPU schedulers
> (e.g. Xe, Panthor, AMDXDNA, PVR, Nouveau, Nova). Unlike
> drm_gpu_scheduler, which separates the scheduler (drm_gpu_scheduler)
> from the queue (drm_sched_entity) into two objects requiring external
> coordination, drm_dep merges both roles into a single struct
> drm_dep_queue. This eliminates the N:1 entity-to-scheduler mapping
> that is unnecessary for firmware schedulers which manage their own
> run-lists internally.
> 
> Unlike drm_gpu_scheduler, which relies on external locking and lifetime
> management by the driver, drm_dep uses reference counting (kref) on both
> queues and jobs to guarantee object lifetime safety. A job holds a queue

In a domain that has been plagued by lifetime issues, we really should be
enforcing RAII for resource management instead of manual calls.

> reference from init until its last put, and the queue holds a job reference
> from dispatch until the put_job worker runs. This makes use-after-free
> impossible even when completion arrives from IRQ context or concurrent
> teardown is in flight.

It makes use-after-free impossible _if_ you’re careful. It is not a
property of the type system, and incorrect code will compile just fine.

> 
> The core objects are:
> 
>  struct drm_dep_queue - a per-context submission queue owning an
>    ordered submit workqueue, a TDR timeout workqueue, an SPSC job
>    queue, and a pending-job list. Reference counted; drivers can embed
>    it and provide a .release vfunc for RCU-safe teardown.
> 
>  struct drm_dep_job - a single unit of GPU work. Drivers embed this
>    and provide a .release vfunc. Jobs carry an xarray of input
>    dma_fence dependencies and produce a drm_dep_fence as their
>    finished fence.
> 
>  struct drm_dep_fence - a dma_fence subclass wrapping an optional
>    parent hardware fence. The finished fence is armed (sequence
>    number assigned) before submission and signals when the hardware
>    fence signals (or immediately on synchronous completion).
> 
> Job lifecycle:
>  1. drm_dep_job_init() - allocate and initialise; job acquires a
>     queue reference.
>  2. drm_dep_job_add_dependency() and friends - register input fences;
>     duplicates from the same context are deduplicated.
>  3. drm_dep_job_arm() - assign sequence number, obtain finished fence.
>  4. drm_dep_job_push() - submit to queue.

You cannot enforce this sequence easily in C code. Once again, we are trusting
drivers that it is followed, but in Rust, you can simply reject code that does
not follow this order at compile time.


> 
> Submission paths under queue lock:
>  - Bypass path: if DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set, the
>    SPSC queue is empty, no dependencies are pending, and credits are
>    available, the job is dispatched inline on the calling thread.
>  - Queued path: job is pushed onto the SPSC queue and the run_job
>    worker is kicked. The worker resolves remaining dependencies
>    (installing wakeup callbacks for unresolved fences) before calling
>    ops->run_job().
> 
> Credit-based throttling prevents hardware overflow: each job declares
> a credit cost at init time; dispatch is deferred until sufficient
> credits are available.

Why can’t we design an API where the driver can refuse jobs in
ops->run_job() if there are no resources to run it? This would do away with the
credit system that has been in place for quite a while. Has this approach been
tried in the past?


> 
> Timeout Detection and Recovery (TDR): a per-queue delayed work item
> fires when the head pending job exceeds q->job.timeout jiffies, calling
> ops->timedout_job(). drm_dep_queue_trigger_timeout() forces immediate
> expiry for device teardown.
> 
> IRQ-safe completion: queues flagged DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE
> allow drm_dep_job_done() to be called from hardirq context (e.g. a
> dma_fence callback). Dependency cleanup is deferred to process context
> after ops->run_job() returns to avoid calling xa_destroy() from IRQ.
> 
> Zombie-state guard: workers use kref_get_unless_zero() on entry and
> bail immediately if the queue refcount has already reached zero and
> async teardown is in flight, preventing use-after-free.

In rust, when you queue work, you have to pass a reference-counted pointer
(Arc<T>). We simply never have this problem in a Rust design. If there is work
queued, the queue is alive.

By the way, why can’t we simply require synchronous teardowns?

> 
> Teardown is always deferred to a module-private workqueue (dep_free_wq)
> so that destroy_workqueue() is never called from within one of the
> queue's own workers. Each queue holds a drm_dev_get() reference on its
> owning struct drm_device, released as the final step of teardown via
> drm_dev_put(). This prevents the driver module from being unloaded
> while any queue is still alive without requiring a separate drain API.
> 
> Cc: Boris Brezillon <boris.brezillon@collabora.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: David Airlie <airlied@gmail.com>
> Cc: dri-devel@lists.freedesktop.org
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: Maxime Ripard <mripard@kernel.org>
> Cc: Philipp Stanner <phasta@kernel.org>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: Sumit Semwal <sumit.semwal@linaro.org>
> Cc: Thomas Zimmermann <tzimmermann@suse.de>
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Assisted-by: GitHub Copilot:claude-sonnet-4.6
> ---
> drivers/gpu/drm/Kconfig             |    4 +
> drivers/gpu/drm/Makefile            |    1 +
> drivers/gpu/drm/dep/Makefile        |    5 +
> drivers/gpu/drm/dep/drm_dep_fence.c |  406 +++++++
> drivers/gpu/drm/dep/drm_dep_fence.h |   25 +
> drivers/gpu/drm/dep/drm_dep_job.c   |  675 +++++++++++
> drivers/gpu/drm/dep/drm_dep_job.h   |   13 +
> drivers/gpu/drm/dep/drm_dep_queue.c | 1647 +++++++++++++++++++++++++++
> drivers/gpu/drm/dep/drm_dep_queue.h |   31 +
> include/drm/drm_dep.h               |  597 ++++++++++
> 10 files changed, 3404 insertions(+)
> create mode 100644 drivers/gpu/drm/dep/Makefile
> create mode 100644 drivers/gpu/drm/dep/drm_dep_fence.c
> create mode 100644 drivers/gpu/drm/dep/drm_dep_fence.h
> create mode 100644 drivers/gpu/drm/dep/drm_dep_job.c
> create mode 100644 drivers/gpu/drm/dep/drm_dep_job.h
> create mode 100644 drivers/gpu/drm/dep/drm_dep_queue.c
> create mode 100644 drivers/gpu/drm/dep/drm_dep_queue.h
> create mode 100644 include/drm/drm_dep.h
> 
> diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
> index 5386248e75b6..834f6e210551 100644
> --- a/drivers/gpu/drm/Kconfig
> +++ b/drivers/gpu/drm/Kconfig
> @@ -276,6 +276,10 @@ config DRM_SCHED
> tristate
> depends on DRM
> 
> +config DRM_DEP
> + tristate
> + depends on DRM
> +
> # Separate option as not all DRM drivers use it
> config DRM_PANEL_BACKLIGHT_QUIRKS
> tristate
> diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
> index e97faabcd783..1ad87cc0e545 100644
> --- a/drivers/gpu/drm/Makefile
> +++ b/drivers/gpu/drm/Makefile
> @@ -173,6 +173,7 @@ obj-y += clients/
> obj-y += display/
> obj-$(CONFIG_DRM_TTM) += ttm/
> obj-$(CONFIG_DRM_SCHED) += scheduler/
> +obj-$(CONFIG_DRM_DEP) += dep/
> obj-$(CONFIG_DRM_RADEON)+= radeon/
> obj-$(CONFIG_DRM_AMDGPU)+= amd/amdgpu/
> obj-$(CONFIG_DRM_AMDGPU)+= amd/amdxcp/
> diff --git a/drivers/gpu/drm/dep/Makefile b/drivers/gpu/drm/dep/Makefile
> new file mode 100644
> index 000000000000..335f1af46a7b
> --- /dev/null
> +++ b/drivers/gpu/drm/dep/Makefile
> @@ -0,0 +1,5 @@
> +# SPDX-License-Identifier: GPL-2.0
> +
> +drm_dep-y := drm_dep_queue.o drm_dep_job.o drm_dep_fence.o
> +
> +obj-$(CONFIG_DRM_DEP) += drm_dep.o
> diff --git a/drivers/gpu/drm/dep/drm_dep_fence.c b/drivers/gpu/drm/dep/drm_dep_fence.c
> new file mode 100644
> index 000000000000..ae05b9077772
> --- /dev/null
> +++ b/drivers/gpu/drm/dep/drm_dep_fence.c
> @@ -0,0 +1,406 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2026 Intel Corporation
> + */
> +
> +/**
> + * DOC: DRM dependency fence
> + *
> + * Each struct drm_dep_job has an associated struct drm_dep_fence that
> + * provides a single dma_fence (@finished) signalled when the hardware
> + * completes the job.
> + *
> + * The hardware fence returned by &drm_dep_queue_ops.run_job is stored as
> + * @parent. @finished is chained to @parent via drm_dep_job_done_cb() and
> + * is signalled once @parent signals (or immediately if run_job() returns
> + * NULL or an error).

I thought this fence proxy mechanism was going away due to recent work being
carried out by Christian?

> + *
> + * Drivers should expose @finished as the out-fence for GPU work since it is
> + * valid from the moment drm_dep_job_arm() returns, whereas the hardware fence
> + * could be a compound fence, which is disallowed when installed into
> + * drm_syncobjs or dma-resv.
> + *
> + * The fence uses the kernel's inline spinlock (NULL passed to dma_fence_init())
> + * so no separate lock allocation is required.
> + *
> + * Deadline propagation is supported: if a consumer sets a deadline via
> + * dma_fence_set_deadline(), it is forwarded to @parent when @parent is set.
> + * If @parent has not been set yet the deadline is stored in @deadline and
> + * forwarded at that point.
> + *
> + * Memory management: drm_dep_fence objects are allocated with kzalloc() and
> + * freed via kfree_rcu() once the fence is released, ensuring safety with
> + * RCU-protected fence accesses.
> + */
> +
> +#include <linux/slab.h>
> +#include <drm/drm_dep.h>
> +#include "drm_dep_fence.h"
> +
> +/**
> + * DRM_DEP_FENCE_FLAG_HAS_DEADLINE_BIT - a fence deadline hint has been set
> + *
> + * Set by the deadline callback on the finished fence to indicate a deadline
> + * has been set which may need to be propagated to the parent hardware fence.
> + */
> +#define DRM_DEP_FENCE_FLAG_HAS_DEADLINE_BIT (DMA_FENCE_FLAG_USER_BITS + 1)
> +
> +/**
> + * struct drm_dep_fence - fence tracking the completion of a dep job
> + *
> + * Contains a single dma_fence (@finished) that is signalled when the
> + * hardware completes the job. The fence uses the kernel's inline_lock
> + * (no external spinlock required).
> + *
> + * This struct is private to the drm_dep module; external code interacts
> + * through the accessor functions declared in drm_dep_fence.h.
> + */
> +struct drm_dep_fence {
> + /**
> + * @finished: signalled when the job completes on hardware.
> + *
> + * Drivers should use this fence as the out-fence for a job since it
> + * is available immediately upon drm_dep_job_arm().
> + */
> + struct dma_fence finished;
> +
> + /**
> + * @deadline: deadline set on @finished which potentially needs to be
> + * propagated to @parent.
> + */
> + ktime_t deadline;
> +
> + /**
> + * @parent: The hardware fence returned by &drm_dep_queue_ops.run_job.
> + *
> + * @finished is signaled once @parent is signaled. The initial store is
> + * performed via smp_store_release to synchronize with deadline handling.
> + *
> + * All readers must access this under the fence lock and take a reference to
> + * it, as @parent is set to NULL under the fence lock when the drm_dep_fence
> + * signals, and this drop also releases its internal reference.
> + */
> + struct dma_fence *parent;
> +
> + /**
> + * @q: the queue this fence belongs to.
> + */
> + struct drm_dep_queue *q;
> +};
> +
> +static const struct dma_fence_ops drm_dep_fence_ops;
> +
> +/**
> + * to_drm_dep_fence() - cast a dma_fence to its enclosing drm_dep_fence
> + * @f: dma_fence to cast
> + *
> + * Context: No context requirements (inline helper).
> + * Return: pointer to the enclosing &drm_dep_fence.
> + */
> +static struct drm_dep_fence *to_drm_dep_fence(struct dma_fence *f)
> +{
> + return container_of(f, struct drm_dep_fence, finished);
> +}
> +
> +/**
> + * drm_dep_fence_set_parent() - store the hardware fence and propagate
> + *   any deadline
> + * @dfence: dep fence
> + * @parent: hardware fence returned by &drm_dep_queue_ops.run_job, or NULL/error
> + *
> + * Stores @parent on @dfence under smp_store_release() so that a concurrent
> + * drm_dep_fence_set_deadline() call sees the parent before checking the
> + * deadline bit. If a deadline has already been set on @dfence->finished it is
> + * forwarded to @parent immediately. Does nothing if @parent is NULL or an
> + * error pointer.
> + *
> + * Context: Any context.
> + */
> +void drm_dep_fence_set_parent(struct drm_dep_fence *dfence,
> +      struct dma_fence *parent)
> +{
> + if (IS_ERR_OR_NULL(parent))
> + return;
> +
> + /*
> + * smp_store_release() to ensure a thread racing us in
> + * drm_dep_fence_set_deadline() sees the parent set before
> + * it calls test_bit(HAS_DEADLINE_BIT).
> + */
> + smp_store_release(&dfence->parent, dma_fence_get(parent));
> + if (test_bit(DRM_DEP_FENCE_FLAG_HAS_DEADLINE_BIT,
> +     &dfence->finished.flags))
> + dma_fence_set_deadline(parent, dfence->deadline);
> +}
> +
> +/**
> + * drm_dep_fence_finished() - signal the finished fence with a result
> + * @dfence: dep fence to signal
> + * @result: error code to set, or 0 for success
> + *
> + * Sets the fence error to @result if non-zero, then signals
> + * @dfence->finished. Also removes parent visibility under the fence lock
> + * and drops the parent reference. Dropping the parent here allows the
> + * DRM dep fence to be completely decoupled from the DRM dep module.
> + *
> + * Context: Any context.
> + */
> +static void drm_dep_fence_finished(struct drm_dep_fence *dfence, int result)
> +{
> + struct dma_fence *parent;
> + unsigned long flags;
> +
> + dma_fence_lock_irqsave(&dfence->finished, flags);
> + if (result)
> + dma_fence_set_error(&dfence->finished, result);
> + dma_fence_signal_locked(&dfence->finished);
> + parent = dfence->parent;
> + dfence->parent = NULL;
> + dma_fence_unlock_irqrestore(&dfence->finished, flags);
> +
> + dma_fence_put(parent);
> +}

We should really try to move away from manual locks and unlocks.

> +
> +static const char *drm_dep_fence_get_driver_name(struct dma_fence *fence)
> +{
> + return "drm_dep";
> +}
> +
> +static const char *drm_dep_fence_get_timeline_name(struct dma_fence *f)
> +{
> + struct drm_dep_fence *dfence = to_drm_dep_fence(f);
> +
> + return dfence->q->name;
> +}
> +
> +/**
> + * drm_dep_fence_get_parent() - get a reference to the parent hardware fence
> + * @dfence: dep fence to query
> + *
> + * Returns a new reference to @dfence->parent, or NULL if the parent has
> + * already been cleared (i.e. @dfence->finished has signalled and the parent
> + * reference was dropped under the fence lock).
> + *
> + * Uses smp_load_acquire() to pair with the smp_store_release() in
> + * drm_dep_fence_set_parent(), ensuring that if we race a concurrent
> + * drm_dep_fence_set_parent() call we observe the parent pointer only after
> + * the store is fully visible — before set_parent() tests
> + * %DRM_DEP_FENCE_FLAG_HAS_DEADLINE_BIT.
> + *
> + * Caller must hold the fence lock on @dfence->finished.
> + *
> + * Context: Any context, fence lock on @dfence->finished must be held.
> + * Return: a new reference to the parent fence, or NULL.
> + */
> +static struct dma_fence *drm_dep_fence_get_parent(struct drm_dep_fence *dfence)
> +{
> + dma_fence_assert_held(&dfence->finished);

> +
> + return dma_fence_get(smp_load_acquire(&dfence->parent));
> +}
> +
> +/**
> + * drm_dep_fence_set_deadline() - dma_fence_ops deadline callback
> + * @f: fence on which the deadline is being set
> + * @deadline: the deadline hint to apply
> + *
> + * Stores the earliest deadline under the fence lock, then propagates
> + * it to the parent hardware fence via smp_load_acquire() to race
> + * safely with drm_dep_fence_set_parent().
> + *
> + * Context: Any context.
> + */
> +static void drm_dep_fence_set_deadline(struct dma_fence *f, ktime_t deadline)
> +{
> + struct drm_dep_fence *dfence = to_drm_dep_fence(f);
> + struct dma_fence *parent;
> + unsigned long flags;
> +
> + dma_fence_lock_irqsave(f, flags);
> +
> + /* If we already have an earlier deadline, keep it: */
> + if (test_bit(DRM_DEP_FENCE_FLAG_HAS_DEADLINE_BIT, &f->flags) &&
> +    ktime_before(dfence->deadline, deadline)) {
> + dma_fence_unlock_irqrestore(f, flags);
> + return;
> + }
> +
> + dfence->deadline = deadline;
> + set_bit(DRM_DEP_FENCE_FLAG_HAS_DEADLINE_BIT, &f->flags);
> +
> + parent = drm_dep_fence_get_parent(dfence);
> + dma_fence_unlock_irqrestore(f, flags);
> +
> + if (parent)
> + dma_fence_set_deadline(parent, deadline);
> +
> + dma_fence_put(parent);
> +}
> +
> +static const struct dma_fence_ops drm_dep_fence_ops = {
> + .get_driver_name = drm_dep_fence_get_driver_name,
> + .get_timeline_name = drm_dep_fence_get_timeline_name,
> + .set_deadline = drm_dep_fence_set_deadline,
> +};
> +
> +/**
> + * drm_dep_fence_alloc() - allocate a dep fence
> + *
> + * Allocates a &drm_dep_fence with kzalloc() without initialising the
> + * dma_fence. Call drm_dep_fence_init() to fully initialise it.
> + *
> + * Context: Process context.
> + * Return: new &drm_dep_fence on success, NULL on allocation failure.
> + */
> +struct drm_dep_fence *drm_dep_fence_alloc(void)
> +{
> + return kzalloc_obj(struct drm_dep_fence);
> +}
> +
> +/**
> + * drm_dep_fence_init() - initialise the dma_fence inside a dep fence
> + * @dfence: dep fence to initialise
> + * @q: queue the owning job belongs to
> + *
> + * Initialises @dfence->finished using the context and sequence number from @q.
> + * Passes NULL as the lock so the fence uses its inline spinlock.
> + *
> + * Context: Any context.
> + */
> +void drm_dep_fence_init(struct drm_dep_fence *dfence, struct drm_dep_queue *q)
> +{
> + u32 seq = ++q->fence.seqno;
> +
> + /*
> + * XXX: Inline fence hazard: currently all expected users of DRM dep
> + * hardware fences have a unique lockdep class. If that ever changes,
> + * we will need to assign a unique lockdep class here so lockdep knows
> + * this fence is allowed to nest with driver hardware fences.
> + */
> +
> + dfence->q = q;
> + dma_fence_init(&dfence->finished, &drm_dep_fence_ops,
> +       NULL, q->fence.context, seq);
> +}
> +
> +/**
> + * drm_dep_fence_cleanup() - release a dep fence at job teardown
> + * @dfence: dep fence to clean up
> + *
> + * Called from drm_dep_job_fini(). If the dep fence was armed (refcount > 0)
> + * it is released via dma_fence_put() and will be freed by the RCU release
> + * callback once all waiters have dropped their references. If it was never
> + * armed it is freed directly with kfree().
> + *
> + * Context: Any context.
> + */
> +void drm_dep_fence_cleanup(struct drm_dep_fence *dfence)
> +{
> + if (drm_dep_fence_is_armed(dfence))
> + dma_fence_put(&dfence->finished);
> + else
> + kfree(dfence);
> +}
> +
> +/**
> + * drm_dep_fence_is_armed() - check whether the fence has been armed
> + * @dfence: dep fence to check
> + *
> + * Returns true if drm_dep_job_arm() has been called, i.e. @dfence->finished
> + * has been initialised and its reference count is non-zero.  Used by
> + * assertions to enforce correct job lifecycle ordering (arm before push,
> + * add_dependency before arm).
> + *
> + * Context: Any context.
> + * Return: true if the fence is armed, false otherwise.
> + */
> +bool drm_dep_fence_is_armed(struct drm_dep_fence *dfence)
> +{
> + return !!kref_read(&dfence->finished.refcount);
> +}

> +
> +/**
> + * drm_dep_fence_is_finished() - test whether the finished fence has signalled
> + * @dfence: dep fence to check
> + *
> + * Uses dma_fence_test_signaled_flag() to read %DMA_FENCE_FLAG_SIGNALED_BIT
> + * directly without invoking the fence's ->signaled() callback or triggering
> + * any signalling side-effects.
> + *
> + * Context: Any context.
> + * Return: true if @dfence->finished has been signalled, false otherwise.
> + */
> +bool drm_dep_fence_is_finished(struct drm_dep_fence *dfence)
> +{
> + return dma_fence_test_signaled_flag(&dfence->finished);
> +}
> +
> +/**
> + * drm_dep_fence_is_complete() - test whether the job has completed
> + * @dfence: dep fence to check
> + *
> + * Takes the fence lock on @dfence->finished and calls
> + * drm_dep_fence_get_parent() to safely obtain a reference to the parent
> + * hardware fence — or NULL if the parent has already been cleared after
> + * signalling.  Calls dma_fence_is_signaled() on @parent outside the lock,
> + * which may invoke the fence's ->signaled() callback and trigger signalling
> + * side-effects if the fence has completed but the signalled flag has not yet
> + * been set.  The finished fence is tested via dma_fence_test_signaled_flag(),
> + * without side-effects.
> + *
> + * May only be called on a stopped queue (see drm_dep_queue_is_stopped()).
> + *
> + * Context: Process context. The queue must be stopped before calling this.
> + * Return: true if the job is complete, false otherwise.
> + */
> +bool drm_dep_fence_is_complete(struct drm_dep_fence *dfence)
> +{
> + struct dma_fence *parent;
> + unsigned long flags;
> + bool complete;
> +
> + dma_fence_lock_irqsave(&dfence->finished, flags);
> + parent = drm_dep_fence_get_parent(dfence);
> + dma_fence_unlock_irqrestore(&dfence->finished, flags);
> +
> + complete = (parent && dma_fence_is_signaled(parent)) ||
> + dma_fence_test_signaled_flag(&dfence->finished);
> +
> + dma_fence_put(parent);
> +
> + return complete;
> +}
> +
> +/**
> + * drm_dep_fence_to_dma() - return the finished dma_fence for a dep fence
> + * @dfence: dep fence to query
> + *
> + * No reference is taken; the caller must hold its own reference to the owning
> + * &drm_dep_job for the duration of the access.
> + *
> + * Context: Any context.
> + * Return: the finished &dma_fence.
> + */
> +struct dma_fence *drm_dep_fence_to_dma(struct drm_dep_fence *dfence)
> +{
> + return &dfence->finished;
> +}
> +
> +/**
> + * drm_dep_fence_done() - signal the finished fence on job completion
> + * @dfence: dep fence to signal
> + * @result: job error code, or 0 on success
> + *
> + * Gets a temporary reference to @dfence->finished to guard against a racing
> + * last-put, signals the fence with @result, then drops the temporary
> + * reference. Called from drm_dep_job_done() in the queue core when a
> + * hardware completion callback fires or when run_job() returns immediately.
> + *
> + * Context: Any context.
> + */
> +void drm_dep_fence_done(struct drm_dep_fence *dfence, int result)
> +{
> + dma_fence_get(&dfence->finished);
> + drm_dep_fence_finished(dfence, result);
> + dma_fence_put(&dfence->finished);
> +}

Proper refcounting is automated (and enforced) in Rust.

> diff --git a/drivers/gpu/drm/dep/drm_dep_fence.h b/drivers/gpu/drm/dep/drm_dep_fence.h
> new file mode 100644
> index 000000000000..65a1582f858b
> --- /dev/null
> +++ b/drivers/gpu/drm/dep/drm_dep_fence.h
> @@ -0,0 +1,25 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2026 Intel Corporation
> + */
> +
> +#ifndef _DRM_DEP_FENCE_H_
> +#define _DRM_DEP_FENCE_H_
> +
> +#include <linux/dma-fence.h>
> +
> +struct drm_dep_fence;
> +struct drm_dep_queue;
> +
> +struct drm_dep_fence *drm_dep_fence_alloc(void);
> +void drm_dep_fence_init(struct drm_dep_fence *dfence, struct drm_dep_queue *q);
> +void drm_dep_fence_cleanup(struct drm_dep_fence *dfence);
> +void drm_dep_fence_set_parent(struct drm_dep_fence *dfence,
> +      struct dma_fence *parent);
> +void drm_dep_fence_done(struct drm_dep_fence *dfence, int result);
> +bool drm_dep_fence_is_armed(struct drm_dep_fence *dfence);
> +bool drm_dep_fence_is_finished(struct drm_dep_fence *dfence);
> +bool drm_dep_fence_is_complete(struct drm_dep_fence *dfence);
> +struct dma_fence *drm_dep_fence_to_dma(struct drm_dep_fence *dfence);
> +
> +#endif /* _DRM_DEP_FENCE_H_ */
> diff --git a/drivers/gpu/drm/dep/drm_dep_job.c b/drivers/gpu/drm/dep/drm_dep_job.c
> new file mode 100644
> index 000000000000..2d012b29a5fc
> --- /dev/null
> +++ b/drivers/gpu/drm/dep/drm_dep_job.c
> @@ -0,0 +1,675 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright 2015 Advanced Micro Devices, Inc.
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
> + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
> + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
> + * OTHER DEALINGS IN THE SOFTWARE.
> + *
> + * Copyright © 2026 Intel Corporation
> + */
> +
> +/**
> + * DOC: DRM dependency job
> + *
> + * A struct drm_dep_job represents a single unit of GPU work associated with
> + * a struct drm_dep_queue. The lifecycle of a job is:
> + *
> + * 1. **Allocation**: the driver allocates memory for the job (typically by
> + *    embedding struct drm_dep_job in a larger structure) and calls
> + *    drm_dep_job_init() to initialise it. On success the job holds one
> + *    kref reference and a reference to its queue.
> + *
> + * 2. **Dependency collection**: the driver calls drm_dep_job_add_dependency(),
> + *    drm_dep_job_add_syncobj_dependency(), drm_dep_job_add_resv_dependencies(),
> + *    or drm_dep_job_add_implicit_dependencies() to register dma_fence objects
> + *    that must be signalled before the job can run. Duplicate fences from the
> + *    same fence context are deduplicated automatically.
> + *
> + * 3. **Arming**: drm_dep_job_arm() initialises the job's finished fence,
> + *    consuming a sequence number from the queue. After arming,
> + *    drm_dep_job_finished_fence() returns a valid fence that may be passed to
> + *    userspace or used as a dependency by other jobs.
> + *
> + * 4. **Submission**: drm_dep_job_push() submits the job to the queue. The
> + *    queue takes a reference that it holds until the job's finished fence
> + *    signals and the job is freed by the put_job worker.
> + *
> + * 5. **Completion**: when the job's hardware work finishes its finished fence
> + *    is signalled and drm_dep_job_put() is called by the queue. The driver
> + *    must release any driver-private resources in &drm_dep_job_ops.release.
> + *
> + * Reference counting uses drm_dep_job_get() / drm_dep_job_put(). The
> + * internal drm_dep_job_fini() tears down the dependency xarray and fence
> + * objects before the driver's release callback is invoked.
> + */
> +
> +#include <linux/dma-resv.h>
> +#include <linux/kref.h>
> +#include <linux/slab.h>
> +#include <drm/drm_dep.h>
> +#include <drm/drm_file.h>
> +#include <drm/drm_gem.h>
> +#include <drm/drm_syncobj.h>
> +#include "drm_dep_fence.h"
> +#include "drm_dep_job.h"
> +#include "drm_dep_queue.h"
> +
> +/**
> + * drm_dep_job_init() - initialise a dep job
> + * @job: dep job to initialise
> + * @args: initialisation arguments
> + *
> + * Initialises @job with the queue, ops and credit count from @args.  Acquires
> + * a reference to @args->q via drm_dep_queue_get(); this reference is held for
> + * the lifetime of the job and released by drm_dep_job_release() when the last
> + * job reference is dropped.
> + *
> + * Resources are released automatically when the last reference is dropped
> + * via drm_dep_job_put(), which must be called to release the job; drivers
> + * must not free the job directly.

Again, can’t enforce that in C.

> + *
> + * Context: Process context. Allocates memory with GFP_KERNEL.
> + * Return: 0 on success, -%EINVAL if credits is 0,
> + *   -%ENOMEM on fence allocation failure.
> + */
> +int drm_dep_job_init(struct drm_dep_job *job,
> +     const struct drm_dep_job_init_args *args)
> +{
> + if (unlikely(!args->credits)) {
> + pr_err("drm_dep: %s: credits cannot be 0\n", __func__);
> + return -EINVAL;
> + }
> +
> + memset(job, 0, sizeof(*job));
> +
> + job->dfence = drm_dep_fence_alloc();
> + if (!job->dfence)
> + return -ENOMEM;
> +
> + job->ops = args->ops;
> + job->q = drm_dep_queue_get(args->q);
> + job->credits = args->credits;
> +
> + kref_init(&job->refcount);
> + xa_init_flags(&job->dependencies, XA_FLAGS_ALLOC);
> + INIT_LIST_HEAD(&job->pending_link);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL(drm_dep_job_init);
> +
> +/**
> + * drm_dep_job_drop_dependencies() - release all input dependency fences
> + * @job: dep job whose dependency xarray to drain
> + *
> + * Walks @job->dependencies, puts each fence, and destroys the xarray.
> + * Any slots still holding a %DRM_DEP_JOB_FENCE_PREALLOC sentinel —
> + * i.e. slots that were pre-allocated but never replaced — are silently
> + * skipped; the sentinel carries no reference.  Called from
> + * drm_dep_queue_run_job() in process context immediately after
> + * @ops->run_job() returns, before the final drm_dep_job_put().  Releasing
> + * dependencies here — while still in process context — avoids calling
> + * xa_destroy() from IRQ context if the job's last reference is later
> + * dropped from a dma_fence callback.
> + *
> + * Context: Process context.
> + */
> +void drm_dep_job_drop_dependencies(struct drm_dep_job *job)
> +{
> + struct dma_fence *fence;
> + unsigned long index;
> +
> + xa_for_each(&job->dependencies, index, fence) {
> + if (unlikely(fence == DRM_DEP_JOB_FENCE_PREALLOC))
> + continue;
> + dma_fence_put(fence);
> + }
> + xa_destroy(&job->dependencies);
> +}

This is automated in Rust. You also can’t “forget” to call this.

> +
> +/**
> + * drm_dep_job_fini() - clean up a dep job
> + * @job: dep job to clean up
> + *
> + * Cleans up the dep fence and drops the queue reference held by @job.
> + *
> + * If the job was never armed (e.g. init failed before drm_dep_job_arm()),
> + * the dependency xarray is also released here.  For armed jobs the xarray
> + * has already been drained by drm_dep_job_drop_dependencies() in process
> + * context immediately after run_job(), so it is left untouched to avoid
> + * calling xa_destroy() from IRQ context.
> + *
> + * Warns if @job is still linked on the queue's pending list, which would
> + * indicate a bug in the teardown ordering.
> + *
> + * Context: Any context.
> + */
> +static void drm_dep_job_fini(struct drm_dep_job *job)
> +{
> + bool armed = drm_dep_fence_is_armed(job->dfence);
> +
> + WARN_ON(!list_empty(&job->pending_link));
> +
> + drm_dep_fence_cleanup(job->dfence);
> + job->dfence = NULL;
> +
> + /*
> + * Armed jobs have their dependencies drained by
> + * drm_dep_job_drop_dependencies() in process context after run_job().
> + * Skip here to avoid calling xa_destroy() from IRQ context.
> + */
> + if (!armed)
> + drm_dep_job_drop_dependencies(job);
> +}

Same here.

> +
> +/**
> + * drm_dep_job_get() - acquire a reference to a dep job
> + * @job: dep job to acquire a reference on, or NULL
> + *
> + * Context: Any context.
> + * Return: @job with an additional reference held, or NULL if @job is NULL.
> + */
> +struct drm_dep_job *drm_dep_job_get(struct drm_dep_job *job)
> +{
> + if (job)
> + kref_get(&job->refcount);
> + return job;
> +}
> +EXPORT_SYMBOL(drm_dep_job_get);
> +

Same here.

> +/**
> + * drm_dep_job_release() - kref release callback for a dep job
> + * @kref: kref embedded in the dep job
> + *
> + * Calls drm_dep_job_fini(), then invokes &drm_dep_job_ops.release if set,
> + * otherwise frees @job with kfree().  Finally, releases the queue reference
> + * that was acquired by drm_dep_job_init() via drm_dep_queue_put().  The
> + * queue put is performed last to ensure no queue state is accessed after
> + * the job memory is freed.
> + *
> + * Context: Any context if %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE is set on the
> + *   job's queue; otherwise process context only, as the release callback may
> + *   sleep.
> + */
> +static void drm_dep_job_release(struct kref *kref)
> +{
> + struct drm_dep_job *job =
> + container_of(kref, struct drm_dep_job, refcount);
> + struct drm_dep_queue *q = job->q;
> +
> + drm_dep_job_fini(job);
> +
> + if (job->ops && job->ops->release)
> + job->ops->release(job);
> + else
> + kfree(job);
> +
> + drm_dep_queue_put(q);
> +}

Same here.

> +
> +/**
> + * drm_dep_job_put() - release a reference to a dep job
> + * @job: dep job to release a reference on, or NULL
> + *
> + * When the last reference is dropped, calls &drm_dep_job_ops.release if set,
> + * otherwise frees @job with kfree(). Does nothing if @job is NULL.
> + *
> + * Context: Any context if %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE is set on the
> + *   job's queue; otherwise process context only, as the release callback may
> + *   sleep.
> + */
> +void drm_dep_job_put(struct drm_dep_job *job)
> +{
> + if (job)
> + kref_put(&job->refcount, drm_dep_job_release);
> +}
> +EXPORT_SYMBOL(drm_dep_job_put);
> +

Same here.

> +/**
> + * drm_dep_job_arm() - arm a dep job for submission
> + * @job: dep job to arm
> + *
> + * Initialises the finished fence on @job->dfence, assigning
> + * it a sequence number from the job's queue. Must be called after
> + * drm_dep_job_init() and before drm_dep_job_push(). Once armed,
> + * drm_dep_job_finished_fence() returns a valid fence that may be passed to
> + * userspace or used as a dependency by other jobs.
> + *
> + * Begins the DMA fence signalling path via dma_fence_begin_signalling().
> + * After this point, memory allocations that could trigger reclaim are
> + * forbidden; lockdep enforces this. arm() must always be paired with
> + * drm_dep_job_push(); lockdep also enforces this pairing.
> + *
> + * Warns if the job has already been armed.
> + *
> + * Context: Process context if %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set
> + *   (takes @q->sched.lock, a mutex); any context otherwise. DMA fence signaling
> + *   path.
> + */
> +void drm_dep_job_arm(struct drm_dep_job *job)
> +{
> + drm_dep_queue_push_job_begin(job->q);
> + WARN_ON(drm_dep_fence_is_armed(job->dfence));
> + drm_dep_fence_init(job->dfence, job->q);
> + job->signalling_cookie = dma_fence_begin_signalling();
> +}
> +EXPORT_SYMBOL(drm_dep_job_arm);
> +
> +/**
> + * drm_dep_job_push() - submit a job to its queue for execution
> + * @job: dep job to push
> + *
> + * Submits @job to the queue it was initialised with. Must be called after
> + * drm_dep_job_arm(). Acquires a reference on @job on behalf of the queue,
> + * held until the queue is fully done with it. The reference is released
> + * directly in the finished-fence dma_fence callback for queues with
> + * %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE (where drm_dep_job_done() may run
> + * from hardirq context), or via the put_job work item on the submit
> + * workqueue otherwise.
> + *
> + * Ends the DMA fence signalling path begun by drm_dep_job_arm() via
> + * dma_fence_end_signalling(). This must be paired with arm(); lockdep
> + * enforces the pairing.
> + *
> + * Once pushed, &drm_dep_queue_ops.run_job is guaranteed to be called for
> + * @job exactly once, even if the queue is killed or torn down before the
> + * job reaches the head of the queue. Drivers can use this guarantee to
> + * perform bookkeeping cleanup; the actual backend operation should be
> + * skipped when drm_dep_queue_is_killed() returns true.
> + *
> + * If the queue does not support the bypass path, the job is pushed directly
> + * onto the SPSC submission queue via drm_dep_queue_push_job() without holding
> + * @q->sched.lock. Otherwise, @q->sched.lock is taken and the job is either
> + * run immediately via drm_dep_queue_run_job() if it qualifies for bypass, or
> + * enqueued via drm_dep_queue_push_job() for dispatch by the run_job work item.
> + *
> + * Warns if the job has not been armed.
> + *
> + * Context: Process context if %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set
> + *   (takes @q->sched.lock, a mutex); any context otherwise. DMA fence signaling
> + *   path.
> + */
> +void drm_dep_job_push(struct drm_dep_job *job)
> +{
> + struct drm_dep_queue *q = job->q;
> +
> + WARN_ON(!drm_dep_fence_is_armed(job->dfence));
> +
> + drm_dep_job_get(job);
> +
> + if (!(q->sched.flags & DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED)) {
> + drm_dep_queue_push_job(q, job);
> + dma_fence_end_signalling(job->signalling_cookie);

Signaling is enforced in a more thorough way in Rust. I’ll expand on this later in this patch.

> + drm_dep_queue_push_job_end(job->q);
> + return;
> + }
> +
> + scoped_guard(mutex, &q->sched.lock) {
> + if (drm_dep_queue_can_job_bypass(q, job))
> + drm_dep_queue_run_job(q, job);
> + else
> + drm_dep_queue_push_job(q, job);
> + }
> +
> + dma_fence_end_signalling(job->signalling_cookie);
> + drm_dep_queue_push_job_end(job->q);
> +}
> +EXPORT_SYMBOL(drm_dep_job_push);
> +
> +/**
> + * drm_dep_job_add_dependency() - adds the fence as a job dependency
> + * @job: dep job to add the dependencies to
> + * @fence: the dma_fence to add to the list of dependencies, or
> + *         %DRM_DEP_JOB_FENCE_PREALLOC to reserve a slot for later.
> + *
> + * Note that @fence is consumed in both the success and error cases (except
> + * when @fence is %DRM_DEP_JOB_FENCE_PREALLOC, which carries no reference).
> + *
> + * Signalled fences and fences belonging to the same queue as @job (i.e. where
> + * fence->context matches the queue's finished fence context) are silently
> + * dropped; the job need not wait on its own queue's output.
> + *
> + * Warns if the job has already been armed (dependencies must be added before
> + * drm_dep_job_arm()).
> + *
> + * **Pre-allocation pattern**
> + *
> + * When multiple jobs across different queues must be prepared and submitted
> + * together in a single atomic commit — for example, where job A's finished
> + * fence is an input dependency of job B — all jobs must be armed and pushed
> + * within a single dma_fence_begin_signalling() / dma_fence_end_signalling()
> + * region.  Once that region has started no memory allocation is permitted.
> + *
> + * To handle this, pass %DRM_DEP_JOB_FENCE_PREALLOC during the preparation
> + * phase (before arming any job, while GFP_KERNEL allocation is still allowed)
> + * to pre-allocate a slot in @job->dependencies.  The slot index assigned by
> + * the underlying xarray must be tracked by the caller separately (e.g. it is
> + * always index 0 when the dependency array is empty, as Xe relies on).
> + * After all jobs have been armed and the finished fences are available, call
> + * drm_dep_job_replace_dependency() with that index and the real fence.
> + * drm_dep_job_replace_dependency() uses GFP_NOWAIT internally and may be
> + * called from atomic or signalling context.
> + *
> + * The sentinel slot is never skipped by the signalled-fence fast-path,
> + * ensuring a slot is always allocated even when the real fence is not yet
> + * known.
> + *
> + * **Example: bind job feeding TLB invalidation jobs**
> + *
> + * Consider a GPU with separate queues for page-table bind operations and for
> + * TLB invalidation.  A single atomic commit must:
> + *
> + *  1. Run a bind job that modifies page tables.
> + *  2. Run one TLB-invalidation job per MMU that depends on the bind
> + *     completing, so stale translations are flushed before the engines
> + *     continue.
> + *
> + * Because all jobs must be armed and pushed inside a signalling region (where
> + * GFP_KERNEL is forbidden), pre-allocate slots before entering the region::
> + *
> + *   // Phase 1 — process context, GFP_KERNEL allowed
> + *   drm_dep_job_init(bind_job, bind_queue, ops);
> + *   for_each_mmu(mmu) {
> + *       drm_dep_job_init(tlb_job[mmu], tlb_queue[mmu], ops);
> + *       // Pre-allocate slot at index 0; real fence not available yet
> + *       drm_dep_job_add_dependency(tlb_job[mmu], DRM_DEP_JOB_FENCE_PREALLOC);
> + *   }
> + *
> + *   // Phase 2 — inside signalling region, no GFP_KERNEL
> + *   dma_fence_begin_signalling();
> + *   drm_dep_job_arm(bind_job);
> + *   for_each_mmu(mmu) {
> + *       // Swap sentinel for bind job's finished fence
> + *       drm_dep_job_replace_dependency(tlb_job[mmu], 0,
> + *                                      dma_fence_get(bind_job->finished));
> + *       drm_dep_job_arm(tlb_job[mmu]);
> + *   }
> + *   drm_dep_job_push(bind_job);
> + *   for_each_mmu(mmu)
> + *       drm_dep_job_push(tlb_job[mmu]);
> + *   dma_fence_end_signalling();
> + *
> + * Context: Process context. May allocate memory with GFP_KERNEL.
> + * Return: If fence == DRM_DEP_JOB_FENCE_PREALLOC index of allocation on
> + * success, else 0 on success, or a negative error code.
> + */

> +int drm_dep_job_add_dependency(struct drm_dep_job *job, struct dma_fence *fence)
> +{
> + struct drm_dep_queue *q = job->q;
> + struct dma_fence *entry;
> + unsigned long index;
> + u32 id = 0;
> + int ret;
> +
> + WARN_ON(drm_dep_fence_is_armed(job->dfence));
> + might_alloc(GFP_KERNEL);
> +
> + if (!fence)
> + return 0;
> +
> + if (fence == DRM_DEP_JOB_FENCE_PREALLOC)
> + goto add_fence;
> +
> + /*
> + * Ignore signalled fences or fences from our own queue — finished
> + * fences use q->fence.context.
> + */
> + if (dma_fence_test_signaled_flag(fence) ||
> +    fence->context == q->fence.context) {
> + dma_fence_put(fence);
> + return 0;
> + }
> +
> + /* Deduplicate if we already depend on a fence from the same context.
> + * This lets the size of the array of deps scale with the number of
> + * engines involved, rather than the number of BOs.
> + */
> + xa_for_each(&job->dependencies, index, entry) {
> + if (entry == DRM_DEP_JOB_FENCE_PREALLOC ||
> +    entry->context != fence->context)
> + continue;
> +
> + if (dma_fence_is_later(fence, entry)) {
> + dma_fence_put(entry);
> + xa_store(&job->dependencies, index, fence, GFP_KERNEL);
> + } else {
> + dma_fence_put(fence);
> + }
> + return 0;
> + }
> +
> +add_fence:
> + ret = xa_alloc(&job->dependencies, &id, fence, xa_limit_32b,
> +       GFP_KERNEL);
> + if (ret != 0) {
> + if (fence != DRM_DEP_JOB_FENCE_PREALLOC)
> + dma_fence_put(fence);
> + return ret;
> + }
> +
> + return (fence == DRM_DEP_JOB_FENCE_PREALLOC) ? id : 0;
> +}
> +EXPORT_SYMBOL(drm_dep_job_add_dependency);
> +
> +/**
> + * drm_dep_job_replace_dependency() - replace a pre-allocated dependency slot
> + * @job: dep job to update
> + * @index: xarray index of the slot to replace, as returned when the sentinel
> + *         was originally inserted via drm_dep_job_add_dependency()
> + * @fence: the real dma_fence to store; its reference is always consumed
> + *
> + * Replaces the %DRM_DEP_JOB_FENCE_PREALLOC sentinel at @index in
> + * @job->dependencies with @fence.  The slot must have been pre-allocated by
> + * passing %DRM_DEP_JOB_FENCE_PREALLOC to drm_dep_job_add_dependency(); the
> + * existing entry is asserted to be the sentinel.
> + *
> + * This is the second half of the pre-allocation pattern described in
> + * drm_dep_job_add_dependency().  It is intended to be called inside a
> + * dma_fence_begin_signalling() / dma_fence_end_signalling() region where
> + * memory allocation with GFP_KERNEL is forbidden.  It uses GFP_NOWAIT
> + * internally so it is safe to call from atomic or signalling context, but
> + * since the slot has been pre-allocated no actual memory allocation occurs.
> + *
> + * If @fence is already signalled the slot is erased rather than storing a
> + * redundant dependency.  The successful store is asserted — if the store
> + * fails it indicates a programming error (slot index out of range or
> + * concurrent modification).
> + *
> + * Must be called before drm_dep_job_arm(). @fence is consumed in all cases.

Can’t enforce this in C. Also, how is the fence “consumed” ? You can’t enforce that
the user can’t access the fence anymore after this function returns, like we can do
at compile time in Rust.

> + *
> + * Context: Any context. DMA fence signaling path.
> + */
> +void drm_dep_job_replace_dependency(struct drm_dep_job *job, u32 index,
> +    struct dma_fence *fence)
> +{
> + WARN_ON(xa_load(&job->dependencies, index) !=
> + DRM_DEP_JOB_FENCE_PREALLOC);
> +
> + if (dma_fence_test_signaled_flag(fence)) {
> + xa_erase(&job->dependencies, index);
> + dma_fence_put(fence);
> + return;
> + }
> +
> + if (WARN_ON(xa_is_err(xa_store(&job->dependencies, index, fence,
> +       GFP_NOWAIT)))) {
> + dma_fence_put(fence);
> + return;
> + }
> +}
> +EXPORT_SYMBOL(drm_dep_job_replace_dependency);
> +
> +/**
> + * drm_dep_job_add_syncobj_dependency() - adds a syncobj's fence as a
> + *   job dependency
> + * @job: dep job to add the dependencies to
> + * @file: drm file private pointer
> + * @handle: syncobj handle to lookup
> + * @point: timeline point
> + *
> + * This adds the fence matching the given syncobj to @job.
> + *
> + * Context: Process context.
> + * Return: 0 on success, or a negative error code.
> + */
> +int drm_dep_job_add_syncobj_dependency(struct drm_dep_job *job,
> +       struct drm_file *file, u32 handle,
> +       u32 point)
> +{
> + struct dma_fence *fence;
> + int ret;
> +
> + ret = drm_syncobj_find_fence(file, handle, point, 0, &fence);
> + if (ret)
> + return ret;
> +
> + return drm_dep_job_add_dependency(job, fence);
> +}
> +EXPORT_SYMBOL(drm_dep_job_add_syncobj_dependency);
> +
> +/**
> + * drm_dep_job_add_resv_dependencies() - add all fences from the resv to the job
> + * @job: dep job to add the dependencies to
> + * @resv: the dma_resv object to get the fences from
> + * @usage: the dma_resv_usage to use to filter the fences
> + *
> + * This adds all fences matching the given usage from @resv to @job.
> + * Must be called with the @resv lock held.
> + *
> + * Context: Process context.
> + * Return: 0 on success, or a negative error code.
> + */
> +int drm_dep_job_add_resv_dependencies(struct drm_dep_job *job,
> +      struct dma_resv *resv,
> +      enum dma_resv_usage usage)
> +{
> + struct dma_resv_iter cursor;
> + struct dma_fence *fence;
> + int ret;
> +
> + dma_resv_assert_held(resv);
> +
> + dma_resv_for_each_fence(&cursor, resv, usage, fence) {
> + /*
> + * As drm_dep_job_add_dependency always consumes the fence
> + * reference (even when it fails), and dma_resv_for_each_fence
> + * is not obtaining one, we need to grab one before calling.
> + */
> + ret = drm_dep_job_add_dependency(job, dma_fence_get(fence));
> + if (ret)
> + return ret;
> + }
> + return 0;
> +}
> +EXPORT_SYMBOL(drm_dep_job_add_resv_dependencies);
> +
> +/**
> + * drm_dep_job_add_implicit_dependencies() - adds implicit dependencies
> + *   as job dependencies
> + * @job: dep job to add the dependencies to
> + * @obj: the gem object to add new dependencies from.
> + * @write: whether the job might write the object (so we need to depend on
> + * shared fences in the reservation object).
> + *
> + * This should be called after drm_gem_lock_reservations() on your array of
> + * GEM objects used in the job but before updating the reservations with your
> + * own fences.
> + *
> + * Context: Process context.
> + * Return: 0 on success, or a negative error code.
> + */
> +int drm_dep_job_add_implicit_dependencies(struct drm_dep_job *job,
> +  struct drm_gem_object *obj,
> +  bool write)
> +{
> + return drm_dep_job_add_resv_dependencies(job, obj->resv,
> + dma_resv_usage_rw(write));
> +}
> +EXPORT_SYMBOL(drm_dep_job_add_implicit_dependencies);
> +
> +/**
> + * drm_dep_job_is_signaled() - check whether a dep job has completed
> + * @job: dep job to check
> + *
> + * Determines whether @job has signalled. The queue should be stopped before
> + * calling this to obtain a stable snapshot of state. Both the parent hardware
> + * fence and the finished software fence are checked.
> + *
> + * Context: Process context. The queue must be stopped before calling this.
> + * Return: true if the job is signalled, false otherwise.
> + */
> +bool drm_dep_job_is_signaled(struct drm_dep_job *job)
> +{
> + WARN_ON(!drm_dep_queue_is_stopped(job->q));
> + return drm_dep_fence_is_complete(job->dfence);
> +}
> +EXPORT_SYMBOL(drm_dep_job_is_signaled);
> +
> +/**
> + * drm_dep_job_is_finished() - test whether a dep job's finished fence has signalled
> + * @job: dep job to check
> + *
> + * Tests whether the job's software finished fence has been signalled, using
> + * dma_fence_test_signaled_flag() to avoid any signalling side-effects. Unlike
> + * drm_dep_job_is_signaled(), this does not require the queue to be stopped and
> + * does not check the parent hardware fence — it is a lightweight test of the
> + * finished fence only.
> + *
> + * Context: Any context.
> + * Return: true if the job's finished fence has been signalled, false otherwise.
> + */
> +bool drm_dep_job_is_finished(struct drm_dep_job *job)
> +{
> + return drm_dep_fence_is_finished(job->dfence);
> +}
> +EXPORT_SYMBOL(drm_dep_job_is_finished);
> +
> +/**
> + * drm_dep_job_invalidate_job() - increment the invalidation count for a job
> + * @job: dep job to invalidate
> + * @threshold: threshold above which the job is considered invalidated
> + *
> + * Increments @job->invalidate_count and returns true if it exceeds @threshold,
> + * indicating the job should be considered hung and discarded. The queue must
> + * be stopped before calling this function.
> + *
> + * Context: Process context. The queue must be stopped before calling this.
> + * Return: true if @job->invalidate_count exceeds @threshold, false otherwise.
> + */
> +bool drm_dep_job_invalidate_job(struct drm_dep_job *job, int threshold)
> +{
> + WARN_ON(!drm_dep_queue_is_stopped(job->q));
> + return ++job->invalidate_count > threshold;
> +}
> +EXPORT_SYMBOL(drm_dep_job_invalidate_job);
> +
> +/**
> + * drm_dep_job_finished_fence() - return the finished fence for a job
> + * @job: dep job to query
> + *
> + * No reference is taken on the returned fence; the caller must hold its own
> + * reference to @job for the duration of any access.

Can’t enforce this in C.

> + *
> + * Context: Any context.
> + * Return: the finished &dma_fence for @job.
> + */
> +struct dma_fence *drm_dep_job_finished_fence(struct drm_dep_job *job)
> +{
> + return drm_dep_fence_to_dma(job->dfence);
> +}
> +EXPORT_SYMBOL(drm_dep_job_finished_fence);
> diff --git a/drivers/gpu/drm/dep/drm_dep_job.h b/drivers/gpu/drm/dep/drm_dep_job.h
> new file mode 100644
> index 000000000000..35c61d258fa1
> --- /dev/null
> +++ b/drivers/gpu/drm/dep/drm_dep_job.h
> @@ -0,0 +1,13 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2026 Intel Corporation
> + */
> +
> +#ifndef _DRM_DEP_JOB_H_
> +#define _DRM_DEP_JOB_H_
> +
> +struct drm_dep_queue;
> +
> +void drm_dep_job_drop_dependencies(struct drm_dep_job *job);
> +
> +#endif /* _DRM_DEP_JOB_H_ */
> diff --git a/drivers/gpu/drm/dep/drm_dep_queue.c b/drivers/gpu/drm/dep/drm_dep_queue.c
> new file mode 100644
> index 000000000000..dac02d0d22c4
> --- /dev/null
> +++ b/drivers/gpu/drm/dep/drm_dep_queue.c
> @@ -0,0 +1,1647 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright 2015 Advanced Micro Devices, Inc.
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
> + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
> + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
> + * OTHER DEALINGS IN THE SOFTWARE.
> + *
> + * Copyright © 2026 Intel Corporation
> + */
> +
> +/**
> + * DOC: DRM dependency queue
> + *
> + * The drm_dep subsystem provides a lightweight GPU submission queue that
> + * combines the roles of drm_gpu_scheduler and drm_sched_entity into a
> + * single object (struct drm_dep_queue). Each queue owns its own ordered
> + * submit workqueue, timeout workqueue, and TDR delayed-work.
> + *
> + * **Job lifecycle**
> + *
> + * 1. Allocate and initialise a job with drm_dep_job_init().
> + * 2. Add dependency fences with drm_dep_job_add_dependency() and friends.
> + * 3. Arm the job with drm_dep_job_arm() to obtain its out-fences.
> + * 4. Submit with drm_dep_job_push().
> + *
> + * **Submission paths**
> + *
> + * drm_dep_job_push() decides between two paths under @q->sched.lock:
> + *
> + * - **Bypass path** (drm_dep_queue_can_job_bypass()): if
> + *   %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set, the queue is not stopped,
> + *   the SPSC queue is empty, the job has no dependency fences, and credits
> + *   are available, the job is submitted inline on the calling thread without
> + *   touching the submit workqueue.
> + *
> + * - **Queued path** (drm_dep_queue_push_job()): the job is pushed onto an
> + *   SPSC queue and the run_job worker is kicked. The run_job worker pops the
> + *   job, resolves any remaining dependency fences (installing wakeup
> + *   callbacks for unresolved ones), and calls drm_dep_queue_run_job().
> + *
> + * **Running a job**
> + *
> + * drm_dep_queue_run_job() accounts credits, appends the job to the pending
> + * list (starting the TDR timer only when the list was previously empty),
> + * calls @ops->run_job(), stores the returned hardware fence as the parent
> + * of the job's dep fence, then installs a callback on it. When the hardware
> + * fence fires (or the job completes synchronously), drm_dep_job_done()
> + * signals the finished fence, returns credits, and kicks the put_job worker
> + * to free the job.
> + *
> + * **Timeout detection and recovery (TDR)**
> + *
> + * A delayed work item fires when a job on the pending list takes longer than
> + * @q->job.timeout jiffies. It calls @ops->timedout_job() and acts on the
> + * returned status (%DRM_DEP_TIMEDOUT_STAT_JOB_SIGNALED or
> + * %DRM_DEP_TIMEDOUT_STAT_REQUEUE_JOB).
> + * drm_dep_queue_trigger_timeout() forces the timer to fire immediately (without
> + * changing the stored timeout), for example during device teardown.
> + *
> + * **Reference counting**
> + *
> + * Jobs and queues are both reference counted.
> + *
> + * A job holds a reference to its queue from drm_dep_job_init() until
> + * drm_dep_job_put() drops the job's last reference and its release callback
> + * runs. This ensures the queue remains valid for the entire lifetime of any
> + * job that was submitted to it.
> + *
> + * The queue holds its own reference to a job for as long as the job is
> + * internally tracked: from the moment the job is added to the pending list
> + * in drm_dep_queue_run_job() until drm_dep_job_done() kicks the put_job
> + * worker, which calls drm_dep_job_put() to release that reference.

Why not simply keep track that the job was completed, instead of relinquishing
the reference? We can then release the reference once the job is cleaned up
(by the queue, using a worker) in process context.


> + *
> + * **Hazard: use-after-free from within a worker**
> + *
> + * Because a job holds a queue reference, drm_dep_job_put() dropping the last
> + * job reference will also drop a queue reference via the job's release path.
> + * If that happens to be the last queue reference, drm_dep_queue_fini() can be
> + * called, which queues @q->free_work on dep_free_wq and returns immediately.
> + * free_work calls disable_work_sync() / disable_delayed_work_sync() on the
> + * queue's own workers before destroying its workqueues, so in practice a
> + * running worker always completes before the queue memory is freed.
> + *
> + * However, there is a secondary hazard: a worker can be queued while the
> + * queue is in a "zombie" state — refcount has already reached zero and async
> + * teardown is in flight, but the work item has not yet been disabled by
> + * free_work.  To guard against this every worker uses
> + * drm_dep_queue_get_unless_zero() at entry; if the refcount is already zero
> + * the worker bails immediately without touching the queue state.

Again, this problem is gone in Rust.

> + *
> + * Because all actual teardown (disable_*_sync, destroy_workqueue) runs on
> + * dep_free_wq — which is independent of the queue's own submit/timeout
> + * workqueues — there is no deadlock risk.  Each queue holds a drm_dev_get()
> + * reference on its owning &drm_device, which is released as the last step of
> + * teardown.  This ensures the driver module cannot be unloaded while any queue
> + * is still alive.
> + */
> +
> +#include <linux/dma-resv.h>
> +#include <linux/kref.h>
> +#include <linux/module.h>
> +#include <linux/overflow.h>
> +#include <linux/slab.h>
> +#include <linux/wait.h>
> +#include <linux/workqueue.h>
> +#include <drm/drm_dep.h>
> +#include <drm/drm_drv.h>
> +#include <drm/drm_print.h>
> +#include "drm_dep_fence.h"
> +#include "drm_dep_job.h"
> +#include "drm_dep_queue.h"
> +
> +/*
> + * Dedicated workqueue for deferred drm_dep_queue teardown.  Using a
> + * module-private WQ instead of system_percpu_wq keeps teardown isolated
> + * from unrelated kernel subsystems.
> + */
> +static struct workqueue_struct *dep_free_wq;
> +
> +/**
> + * drm_dep_queue_flags_set() - set a flag on the queue under sched.lock
> + * @q: dep queue
> + * @flag: flag to set (one of &enum drm_dep_queue_flags)
> + *
> + * Sets @flag in @q->sched.flags. Must be called with @q->sched.lock
> + * held; the lockdep assertion enforces this.
> + *
> + * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
> + */
> +static void drm_dep_queue_flags_set(struct drm_dep_queue *q,
> +    enum drm_dep_queue_flags flag)
> +{
> + lockdep_assert_held(&q->sched.lock);

We can enforce this in Rust at compile-time. The code does not compile if the
lock is not taken. Same here and everywhere else where the sched lock has
to be taken.


> + q->sched.flags |= flag;
> +}
> +
> +/**
> + * drm_dep_queue_flags_clear() - clear a flag on the queue under sched.lock
> + * @q: dep queue
> + * @flag: flag to clear (one of &enum drm_dep_queue_flags)
> + *
> + * Clears @flag in @q->sched.flags. Must be called with @q->sched.lock
> + * held; the lockdep assertion enforces this.
> + *
> + * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
> + */
> +static void drm_dep_queue_flags_clear(struct drm_dep_queue *q,
> +      enum drm_dep_queue_flags flag)
> +{
> + lockdep_assert_held(&q->sched.lock);
> + q->sched.flags &= ~flag;
> +}
> +
> +/**
> + * drm_dep_queue_has_credits() - check whether the queue has enough credits
> + * @q: dep queue
> + * @job: job requesting credits
> + *
> + * Checks whether the queue has enough available credits to dispatch
> + * @job. If @job->credits exceeds the queue's credit limit, it is
> + * clamped with a WARN.
> + *
> + * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
> + * Return: true if available credits >= @job->credits, false otherwise.
> + */
> +static bool drm_dep_queue_has_credits(struct drm_dep_queue *q,
> +      struct drm_dep_job *job)
> +{
> + u32 available;
> +
> + lockdep_assert_held(&q->sched.lock);
> +
> + if (job->credits > q->credit.limit) {
> + drm_warn(q->drm,
> + "Jobs may not exceed the credit limit, truncate.\n");
> + job->credits = q->credit.limit;
> + }
> +
> + WARN_ON(check_sub_overflow(q->credit.limit,
> +   atomic_read(&q->credit.count),
> +   &available));
> +
> + return available >= job->credits;
> +}
> +
> +/**
> + * drm_dep_queue_run_job_queue() - kick the run-job worker
> + * @q: dep queue
> + *
> + * Queues @q->sched.run_job on @q->sched.submit_wq unless the queue is stopped
> + * or the job queue is empty.  The empty-queue check avoids queueing a work item
> + * that would immediately return with nothing to do.
> + *
> + * Context: Any context.
> + */
> +static void drm_dep_queue_run_job_queue(struct drm_dep_queue *q)
> +{
> + if (!drm_dep_queue_is_stopped(q) && spsc_queue_count(&q->job.queue))
> + queue_work(q->sched.submit_wq, &q->sched.run_job);
> +}
> +
> +/**
> + * drm_dep_queue_put_job_queue() - kick the put-job worker
> + * @q: dep queue
> + *
> + * Queues @q->sched.put_job on @q->sched.submit_wq unless the queue
> + * is stopped.
> + *
> + * Context: Any context.
> + */
> +static void drm_dep_queue_put_job_queue(struct drm_dep_queue *q)
> +{
> + if (!drm_dep_queue_is_stopped(q))
> + queue_work(q->sched.submit_wq, &q->sched.put_job);
> +}
> +
> +/**
> + * drm_queue_start_timeout() - arm or re-arm the TDR delayed work
> + * @q: dep queue
> + *
> + * Arms the TDR delayed work with @q->job.timeout. No-op if
> + * @q->ops->timedout_job is NULL, the timeout is MAX_SCHEDULE_TIMEOUT,
> + * or the pending list is empty.
> + *
> + * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
> + */
> +static void drm_queue_start_timeout(struct drm_dep_queue *q)
> +{
> + lockdep_assert_held(&q->job.lock);
> +
> + if (!q->ops->timedout_job ||
> +    q->job.timeout == MAX_SCHEDULE_TIMEOUT ||
> +    list_empty(&q->job.pending))
> + return;
> +
> + mod_delayed_work(q->sched.timeout_wq, &q->sched.tdr, q->job.timeout);
> +}
> +
> +/**
> + * drm_queue_start_timeout_unlocked() - arm TDR, acquiring job.lock
> + * @q: dep queue
> + *
> + * Acquires @q->job.lock with irq and calls
> + * drm_queue_start_timeout().
> + *
> + * Context: Process context (workqueue).
> + */
> +static void drm_queue_start_timeout_unlocked(struct drm_dep_queue *q)
> +{
> + guard(spinlock_irq)(&q->job.lock);
> + drm_queue_start_timeout(q);
> +}
> +
> +/**
> + * drm_dep_queue_remove_dependency() - clear the active dependency and wake
> + *   the run-job worker
> + * @q: dep queue
> + * @f: the dependency fence being removed
> + *
> + * Stores @f into @q->dep.removed_fence via smp_store_release() so that the
> + * run-job worker can drop the reference to it in drm_dep_queue_is_ready(),
> + * paired with smp_load_acquire().  Clears @q->dep.fence and kicks the
> + * run-job worker.
> + *
> + * The fence reference is not dropped here; it is deferred to the run-job
> + * worker via @q->dep.removed_fence to keep this path suitable dma_fence
> + * callback removal in drm_dep_queue_kill().

This is a comment in C, but in Rust this is encoded directly in the type system.

> + *
> + * Context: Any context.
> + */
> +static void drm_dep_queue_remove_dependency(struct drm_dep_queue *q,
> +    struct dma_fence *f)
> +{
> + /* removed_fence must be visible to the reader before &q->dep.fence */
> + smp_store_release(&q->dep.removed_fence, f);
> +
> + WRITE_ONCE(q->dep.fence, NULL);
> + drm_dep_queue_run_job_queue(q);
> +}
> +
> +/**
> + * drm_dep_queue_wakeup() - dma_fence callback to wake the run-job worker
> + * @f: the signalled dependency fence
> + * @cb: callback embedded in the dep queue
> + *
> + * Called from dma_fence_signal() when the active dependency fence signals.
> + * Delegates to drm_dep_queue_remove_dependency() to clear @q->dep.fence and
> + * kick the run-job worker.  The fence reference is not dropped here; it is
> + * deferred to the run-job worker via @q->dep.removed_fence.

Same here.

> + *
> + * Context: Any context.
> + */
> +static void drm_dep_queue_wakeup(struct dma_fence *f, struct dma_fence_cb *cb)
> +{
> + struct drm_dep_queue *q =
> + container_of(cb, struct drm_dep_queue, dep.cb);
> +
> + drm_dep_queue_remove_dependency(q, f);
> +}
> +
> +/**
> + * drm_dep_queue_is_ready() - check whether the queue has a dispatchable job
> + * @q: dep queue
> + *
> + * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.

Can’t call this in Rust if the lock is not taken.

> + * Return: true if SPSC queue non-empty and no dep fence pending,
> + *   false otherwise.
> + */
> +static bool drm_dep_queue_is_ready(struct drm_dep_queue *q)
> +{
> + lockdep_assert_held(&q->sched.lock);
> +
> + if (!spsc_queue_count(&q->job.queue))
> + return false;
> +
> + if (READ_ONCE(q->dep.fence))
> + return false;
> +
> + /* Paired with smp_store_release in drm_dep_queue_remove_dependency() */
> + dma_fence_put(smp_load_acquire(&q->dep.removed_fence));
> +
> + q->dep.removed_fence = NULL;
> +
> + return true;
> +}
> +
> +/**
> + * drm_dep_queue_is_killed() - check whether a dep queue has been killed
> + * @q: dep queue to check
> + *
> + * Return: true if %DRM_DEP_QUEUE_FLAGS_KILLED is set on @q, false otherwise.
> + *
> + * Context: Any context.
> + */
> +bool drm_dep_queue_is_killed(struct drm_dep_queue *q)
> +{
> + return !!(q->sched.flags & DRM_DEP_QUEUE_FLAGS_KILLED);
> +}
> +EXPORT_SYMBOL(drm_dep_queue_is_killed);
> +
> +/**
> + * drm_dep_queue_is_initialized() - check whether a dep queue has been initialized
> + * @q: dep queue to check
> + *
> + * A queue is considered initialized once its ops pointer has been set by a
> + * successful call to drm_dep_queue_init().  Drivers that embed a
> + * &drm_dep_queue inside a larger structure may call this before attempting any
> + * other queue operation to confirm that initialization has taken place.
> + * drm_dep_queue_put() must be called if this function returns true to drop the
> + * initialization reference from drm_dep_queue_init().
> + *
> + * Return: true if @q has been initialized, false otherwise.
> + *
> + * Context: Any context.
> + */
> +bool drm_dep_queue_is_initialized(struct drm_dep_queue *q)
> +{
> + return !!q->ops;
> +}
> +EXPORT_SYMBOL(drm_dep_queue_is_initialized);
> +
> +/**
> + * drm_dep_queue_set_stopped() - pre-mark a queue as stopped before first use
> + * @q: dep queue to mark
> + *
> + * Sets %DRM_DEP_QUEUE_FLAGS_STOPPED directly on @q without going through the
> + * normal drm_dep_queue_stop() path.  This is only valid during the driver-side
> + * queue initialisation sequence — i.e. after drm_dep_queue_init() returns but
> + * before the queue is made visible to other threads (e.g. before it is added
> + * to any lookup structures).  Using this after the queue is live is a driver
> + * bug; use drm_dep_queue_stop() instead.
> + *
> + * Context: Process context, queue not yet visible to other threads.
> + */
> +void drm_dep_queue_set_stopped(struct drm_dep_queue *q)
> +{
> + q->sched.flags |= DRM_DEP_QUEUE_FLAGS_STOPPED;
> +}
> +EXPORT_SYMBOL(drm_dep_queue_set_stopped);
> +
> +/**
> + * drm_dep_queue_refcount() - read the current reference count of a queue
> + * @q: dep queue to query
> + *
> + * Returns the instantaneous kref value.  The count may change immediately
> + * after this call; callers must not make safety decisions based solely on
> + * the returned value.  Intended for diagnostic snapshots and debugfs output.
> + *
> + * Context: Any context.
> + * Return: current reference count.
> + */
> +unsigned int drm_dep_queue_refcount(const struct drm_dep_queue *q)
> +{
> + return kref_read(&q->refcount);
> +}
> +EXPORT_SYMBOL(drm_dep_queue_refcount);
> +
> +/**
> + * drm_dep_queue_timeout() - read the per-job TDR timeout for a queue
> + * @q: dep queue to query
> + *
> + * Returns the per-job timeout in jiffies as set at init time.
> + * %MAX_SCHEDULE_TIMEOUT means no timeout is configured.
> + *
> + * Context: Any context.
> + * Return: timeout in jiffies.
> + */
> +long drm_dep_queue_timeout(const struct drm_dep_queue *q)
> +{
> + return q->job.timeout;
> +}
> +EXPORT_SYMBOL(drm_dep_queue_timeout);
> +
> +/**
> + * drm_dep_queue_is_job_put_irq_safe() - test whether job-put from IRQ is allowed
> + * @q: dep queue
> + *
> + * Context: Any context.
> + * Return: true if %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE is set,
> + *   false otherwise.
> + */
> +static bool drm_dep_queue_is_job_put_irq_safe(const struct drm_dep_queue *q)
> +{
> + return !!(q->sched.flags & DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE);
> +}
> +
> +/**
> + * drm_dep_queue_job_dependency() - get next unresolved dep fence
> + * @q: dep queue
> + * @job: job whose dependencies to advance
> + *
> + * Returns NULL immediately if the queue has been killed via
> + * drm_dep_queue_kill(), bypassing all dependency waits so that jobs
> + * drain through run_job as quickly as possible.
> + *
> + * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
> + * Return: next unresolved &dma_fence with a new reference, or NULL
> + *   when all dependencies have been consumed (or the queue is killed).
> + */
> +static struct dma_fence *
> +drm_dep_queue_job_dependency(struct drm_dep_queue *q,
> +     struct drm_dep_job *job)
> +{
> + struct dma_fence *f;
> +
> + lockdep_assert_held(&q->sched.lock);
> +
> + if (drm_dep_queue_is_killed(q))
> + return NULL;
> +
> + f = xa_load(&job->dependencies, job->last_dependency);
> + if (f) {
> + job->last_dependency++;
> + if (WARN_ON(DRM_DEP_JOB_FENCE_PREALLOC == f))
> + return dma_fence_get_stub();
> + return dma_fence_get(f);
> + }
> +
> + return NULL;
> +}
> +
> +/**
> + * drm_dep_queue_add_dep_cb() - install wakeup callback on dep fence
> + * @q: dep queue
> + * @job: job whose dependency fence is stored in @q->dep.fence
> + *
> + * Installs a wakeup callback on @q->dep.fence. Returns true if the
> + * callback was installed (the queue must wait), false if the fence is
> + * already signalled or is a self-fence from the same queue context.
> + *
> + * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
> + * Return: true if callback installed, false if fence already done.
> + */

In Rust, we can encode the signaling paths with a “token type”. So any
sections that are part of the signaling path can simply take this token as an
argument. This type also enforces that end_signaling() is called automatically when it
goes out of scope.

By the way, we can easily offer an irq handler type where we enforce this:

fn handle_threaded_irq(&self, device: &Device<Bound>) -> IrqReturn { 
 let _annotation = DmaFenceSignallingAnnotation::new();  // Calls begin_signaling()
 self.driver.handle_threaded_irq(device) 

 // end_signaling() is called here automatically.
}

Same for workqueues:

fn work_fn(&self, device: &Device<Bound>) {
 let _annotation = DmaFenceSignallingAnnotation::new();  // Calls begin_signaling()
 self.driver.work_fn(device) 

 // end_signaling() is called here automatically.
}

This is not Rust-specific, of course, but it is more ergonomic to write in Rust.

> +static bool drm_dep_queue_add_dep_cb(struct drm_dep_queue *q,
> +     struct drm_dep_job *job)
> +{
> + struct dma_fence *fence = q->dep.fence;
> +
> + lockdep_assert_held(&q->sched.lock);
> +
> + if (WARN_ON(fence->context == q->fence.context)) {
> + dma_fence_put(q->dep.fence);
> + q->dep.fence = NULL;
> + return false;
> + }
> +
> + if (!dma_fence_add_callback(q->dep.fence, &q->dep.cb,
> +    drm_dep_queue_wakeup))
> + return true;
> +
> + dma_fence_put(q->dep.fence);
> + q->dep.fence = NULL;
> +
> + return false;
> +}

In rust we can enforce that all callbacks take a reference to the fence
automatically. If the callback is “forgotten” in a buggy path, it is
automatically removed, and the fence is automatically signaled with -ECANCELED.

> +
> +/**
> + * drm_dep_queue_pop_job() - pop a dispatchable job from the SPSC queue
> + * @q: dep queue
> + *
> + * Peeks at the head of the SPSC queue and drains all resolved
> + * dependencies. If a dependency is still pending, installs a wakeup
> + * callback and returns NULL. On success pops the job and returns it.
> + *
> + * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
> + * Return: next dispatchable job, or NULL if a dep is still pending.
> + */
> +static struct drm_dep_job *drm_dep_queue_pop_job(struct drm_dep_queue *q)
> +{
> + struct spsc_node *node;
> + struct drm_dep_job *job;
> +
> + lockdep_assert_held(&q->sched.lock);
> +
> + node = spsc_queue_peek(&q->job.queue);
> + if (!node)
> + return NULL;
> +
> + job = container_of(node, struct drm_dep_job, queue_node);
> +
> + while ((q->dep.fence = drm_dep_queue_job_dependency(q, job))) {
> + if (drm_dep_queue_add_dep_cb(q, job))
> + return NULL;
> + }
> +
> + spsc_queue_pop(&q->job.queue);
> +
> + return job;
> +}
> +
> +/*
> + * drm_dep_queue_get_unless_zero() - try to acquire a queue reference
> + *
> + * Workers use this instead of drm_dep_queue_get() to guard against the zombie
> + * state: the queue's refcount has already reached zero (async teardown is in
> + * flight) but a work item was queued before free_work had a chance to cancel
> + * it.  If kref_get_unless_zero() fails the caller must bail immediately.
> + *
> + * Context: Any context.
> + * Returns true if the reference was acquired, false if the queue is zombie.
> + */

Again, this function is totally gone in Rust.

> +bool drm_dep_queue_get_unless_zero(struct drm_dep_queue *q)
> +{
> + return kref_get_unless_zero(&q->refcount);
> +}
> +EXPORT_SYMBOL(drm_dep_queue_get_unless_zero);
> +
> +/**
> + * drm_dep_queue_run_job_work() - run-job worker
> + * @work: work item embedded in the dep queue
> + *
> + * Acquires @q->sched.lock, checks stopped state, queue readiness and
> + * available credits, pops the next job via drm_dep_queue_pop_job(),
> + * dispatches it via drm_dep_queue_run_job(), then re-kicks itself.
> + *
> + * Uses drm_dep_queue_get_unless_zero() at entry and bails immediately if the
> + * queue is in zombie state (refcount already zero, async teardown in flight).
> + *
> + * Context: Process context (workqueue). DMA fence signaling path.
> + */
> +static void drm_dep_queue_run_job_work(struct work_struct *work)
> +{
> + struct drm_dep_queue *q =
> + container_of(work, struct drm_dep_queue, sched.run_job);
> + struct spsc_node *node;
> + struct drm_dep_job *job;
> + bool cookie = dma_fence_begin_signalling();
> +
> + /* Bail if queue is zombie (refcount already zero, teardown in flight). */
> + if (!drm_dep_queue_get_unless_zero(q)) {
> + dma_fence_end_signalling(cookie);
> + return;
> + }
> +
> + mutex_lock(&q->sched.lock);
> +
> + if (drm_dep_queue_is_stopped(q))
> + goto put_queue;
> +
> + if (!drm_dep_queue_is_ready(q))
> + goto put_queue;
> +
> + /* Peek to check credits before committing to pop and dep resolution */
> + node = spsc_queue_peek(&q->job.queue);
> + if (!node)
> + goto put_queue;
> +
> + job = container_of(node, struct drm_dep_job, queue_node);
> + if (!drm_dep_queue_has_credits(q, job))
> + goto put_queue;
> +
> + job = drm_dep_queue_pop_job(q);
> + if (!job)
> + goto put_queue;
> +
> + drm_dep_queue_run_job(q, job);
> + drm_dep_queue_run_job_queue(q);
> +
> +put_queue:
> + mutex_unlock(&q->sched.lock);
> + drm_dep_queue_put(q);
> + dma_fence_end_signalling(cookie);
> +}
> +
> +/*
> + * drm_dep_queue_remove_job() - unlink a job from the pending list and reset TDR
> + * @q:   dep queue owning @job
> + * @job: job to remove
> + *
> + * Splices @job out of @q->job.pending, cancels any pending TDR delayed work,
> + * and arms the timeout for the new list head (if any).
> + *
> + * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
> + */
> +static void drm_dep_queue_remove_job(struct drm_dep_queue *q,
> +     struct drm_dep_job *job)
> +{
> + lockdep_assert_held(&q->job.lock);
> +
> + list_del_init(&job->pending_link);
> + cancel_delayed_work(&q->sched.tdr);
> + drm_queue_start_timeout(q);
> +}
> +
> +/**
> + * drm_dep_queue_get_finished_job() - dequeue a finished job
> + * @q: dep queue
> + *
> + * Under @q->job.lock checks the head of the pending list for a
> + * finished dep fence. If found, removes the job from the list,
> + * cancels the TDR, and re-arms it for the new head.
> + *
> + * Context: Process context (workqueue). DMA fence signaling path.
> + * Return: the finished &drm_dep_job, or NULL if none is ready.
> + */
> +static struct drm_dep_job *
> +drm_dep_queue_get_finished_job(struct drm_dep_queue *q)
> +{
> + struct drm_dep_job *job;
> +
> + guard(spinlock_irq)(&q->job.lock);
> +
> + job = list_first_entry_or_null(&q->job.pending, struct drm_dep_job,
> +       pending_link);
> + if (job && drm_dep_fence_is_finished(job->dfence))
> + drm_dep_queue_remove_job(q, job);
> + else
> + job = NULL;
> +
> + return job;
> +}
> +
> +/**
> + * drm_dep_queue_put_job_work() - put-job worker
> + * @work: work item embedded in the dep queue
> + *
> + * Drains all finished jobs by calling drm_dep_job_put() in a loop,
> + * then kicks the run-job worker.
> + *
> + * Uses drm_dep_queue_get_unless_zero() at entry and bails immediately if the
> + * queue is in zombie state (refcount already zero, async teardown in flight).
> + *
> + * Wraps execution in dma_fence_begin_signalling() / dma_fence_end_signalling()
> + * because workqueue is shared with other items in the fence signaling path.
> + *
> + * Context: Process context (workqueue). DMA fence signaling path.
> + */
> +static void drm_dep_queue_put_job_work(struct work_struct *work)
> +{
> + struct drm_dep_queue *q =
> + container_of(work, struct drm_dep_queue, sched.put_job);
> + struct drm_dep_job *job;
> + bool cookie = dma_fence_begin_signalling();
> +
> + /* Bail if queue is zombie (refcount already zero, teardown in flight). */
> + if (!drm_dep_queue_get_unless_zero(q)) {
> + dma_fence_end_signalling(cookie);
> + return;
> + }
> +
> + while ((job = drm_dep_queue_get_finished_job(q)))
> + drm_dep_job_put(job);
> +
> + drm_dep_queue_run_job_queue(q);
> +
> + drm_dep_queue_put(q);
> + dma_fence_end_signalling(cookie);
> +}
> +
> +/**
> + * drm_dep_queue_tdr_work() - TDR worker
> + * @work: work item embedded in the delayed TDR work
> + *
> + * Removes the head job from the pending list under @q->job.lock,
> + * asserts @q->ops->timedout_job is non-NULL, calls it outside the lock,
> + * requeues the job if %DRM_DEP_TIMEDOUT_STAT_REQUEUE_JOB, drops the
> + * queue's job reference on %DRM_DEP_TIMEDOUT_STAT_JOB_SIGNALED, and always
> + * restarts the TDR timer after handling the job (unless @q is stopping).
> + * Any other return value triggers a WARN.
> + *
> + * The TDR is never armed when @q->ops->timedout_job is NULL, so firing
> + * this worker without a timedout_job callback is a driver bug.
> + *
> + * Uses drm_dep_queue_get_unless_zero() at entry and bails immediately if the
> + * queue is in zombie state (refcount already zero, async teardown in flight).
> + *
> + * Wraps execution in dma_fence_begin_signalling() / dma_fence_end_signalling()
> + * because timedout_job() is expected to signal the guilty job's fence as part
> + * of reset.
> + *
> + * Context: Process context (workqueue). DMA fence signaling path.
> + */
> +static void drm_dep_queue_tdr_work(struct work_struct *work)
> +{
> + struct drm_dep_queue *q =
> + container_of(work, struct drm_dep_queue, sched.tdr.work);
> + struct drm_dep_job *job;
> + bool cookie = dma_fence_begin_signalling();
> +
> + /* Bail if queue is zombie (refcount already zero, teardown in flight). */
> + if (!drm_dep_queue_get_unless_zero(q)) {
> + dma_fence_end_signalling(cookie);
> + return;
> + }
> +
> + scoped_guard(spinlock_irq, &q->job.lock) {
> + job = list_first_entry_or_null(&q->job.pending,
> +       struct drm_dep_job,
> +       pending_link);
> + if (job)
> + /*
> + * Remove from pending so it cannot be freed
> + * concurrently by drm_dep_queue_get_finished_job() or
> + * .drm_dep_job_done().
> + */
> + list_del_init(&job->pending_link);
> + }
> +
> + if (job) {
> + enum drm_dep_timedout_stat status;
> +
> + if (WARN_ON(!q->ops->timedout_job)) {
> + drm_dep_job_put(job);
> + goto out;
> + }
> +
> + status = q->ops->timedout_job(job);
> +
> + switch (status) {
> + case DRM_DEP_TIMEDOUT_STAT_REQUEUE_JOB:
> + scoped_guard(spinlock_irq, &q->job.lock)
> + list_add(&job->pending_link, &q->job.pending);
> + drm_dep_queue_put_job_queue(q);
> + break;
> + case DRM_DEP_TIMEDOUT_STAT_JOB_SIGNALED:
> + drm_dep_job_put(job);
> + break;
> + default:
> + WARN_ON("invalid drm_dep_timedout_stat");
> + break;
> + }
> + }
> +
> +out:
> + drm_queue_start_timeout_unlocked(q);
> + drm_dep_queue_put(q);
> + dma_fence_end_signalling(cookie);
> +}
> +
> +/**
> + * drm_dep_alloc_submit_wq() - allocate an ordered submit workqueue
> + * @name: name for the workqueue
> + * @flags: DRM_DEP_QUEUE_FLAGS_* flags
> + *
> + * Allocates an ordered workqueue for job submission with %WQ_MEM_RECLAIM and
> + * %WQ_MEM_WARN_ON_RECLAIM set, ensuring the workqueue is safe to use from
> + * memory reclaim context and properly annotated for lockdep taint tracking.
> + * Adds %WQ_HIGHPRI if %DRM_DEP_QUEUE_FLAGS_HIGHPRI is set. When
> + * CONFIG_LOCKDEP is enabled, uses a dedicated lockdep map for annotation.
> + *
> + * Context: Process context.
> + * Return: the new &workqueue_struct, or NULL on failure.
> + */
> +static struct workqueue_struct *
> +drm_dep_alloc_submit_wq(const char *name, enum drm_dep_queue_flags flags)
> +{
> + unsigned int wq_flags = WQ_MEM_RECLAIM | WQ_MEM_WARN_ON_RECLAIM;
> +
> + if (flags & DRM_DEP_QUEUE_FLAGS_HIGHPRI)
> + wq_flags |= WQ_HIGHPRI;
> +
> +#if IS_ENABLED(CONFIG_LOCKDEP)
> + static struct lockdep_map map = {
> + .name = "drm_dep_submit_lockdep_map"
> + };
> + return alloc_ordered_workqueue_lockdep_map(name, wq_flags, &map);
> +#else
> + return alloc_ordered_workqueue(name, wq_flags);
> +#endif
> +}
> +
> +/**
> + * drm_dep_alloc_timeout_wq() - allocate an ordered TDR workqueue
> + * @name: name for the workqueue
> + *
> + * Allocates an ordered workqueue for timeout detection and recovery with
> + * %WQ_MEM_RECLAIM and %WQ_MEM_WARN_ON_RECLAIM set, ensuring consistent taint
> + * annotation with the submit workqueue. When CONFIG_LOCKDEP is enabled, uses
> + * a dedicated lockdep map for annotation.
> + *
> + * Context: Process context.
> + * Return: the new &workqueue_struct, or NULL on failure.
> + */
> +static struct workqueue_struct *drm_dep_alloc_timeout_wq(const char *name)
> +{
> + unsigned int wq_flags = WQ_MEM_RECLAIM | WQ_MEM_WARN_ON_RECLAIM;
> +
> +#if IS_ENABLED(CONFIG_LOCKDEP)
> + static struct lockdep_map map = {
> + .name = "drm_dep_timeout_lockdep_map"
> + };
> + return alloc_ordered_workqueue_lockdep_map(name, wq_flags, &map);
> +#else
> + return alloc_ordered_workqueue(name, wq_flags);
> +#endif
> +}
> +
> +/**
> + * drm_dep_queue_init() - initialize a dep queue
> + * @q: dep queue to initialize
> + * @args: initialization arguments
> + *
> + * Initializes all fields of @q from @args. If @args->submit_wq is NULL an
> + * ordered workqueue is allocated and owned by the queue
> + * (%DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ). If @args->timeout_wq is NULL an
> + * ordered workqueue is allocated and owned by the queue
> + * (%DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ). On success the queue holds one kref
> + * reference and drm_dep_queue_put() must be called to drop this reference
> + * (i.e., drivers cannot directly free the queue).
> + *
> + * When CONFIG_LOCKDEP is enabled, @q->sched.lock is primed against the
> + * fs_reclaim pseudo-lock so that lockdep can detect any lock ordering
> + * inversion between @sched.lock and memory reclaim.
> + *
> + * Return: 0 on success, %-EINVAL when @args->credit_limit is zero, @args->ops
> + * is NULL, @args->drm is NULL, @args->ops->run_job is NULL, or when
> + * @args->submit_wq or @args->timeout_wq is non-NULL but was not allocated with
> + * %WQ_MEM_WARN_ON_RECLAIM; %-ENOMEM when workqueue allocation fails.
> + *
> + * Context: Process context. May allocate memory and create workqueues.
> + */
> +int drm_dep_queue_init(struct drm_dep_queue *q,
> +       const struct drm_dep_queue_init_args *args)
> +{
> + if (!args->credit_limit || !args->drm || !args->ops ||
> +    !args->ops->run_job)
> + return -EINVAL;
> +
> + if (args->submit_wq && !workqueue_is_reclaim_annotated(args->submit_wq))
> + return -EINVAL;
> +
> + if (args->timeout_wq &&
> +    !workqueue_is_reclaim_annotated(args->timeout_wq))
> + return -EINVAL;
> +
> + memset(q, 0, sizeof(*q));
> +
> + q->name = args->name;
> + q->drm = args->drm;
> + q->credit.limit = args->credit_limit;
> + q->job.timeout = args->timeout ? args->timeout : MAX_SCHEDULE_TIMEOUT;
> +
> + init_rcu_head(&q->rcu);
> + INIT_LIST_HEAD(&q->job.pending);
> + spin_lock_init(&q->job.lock);
> + spsc_queue_init(&q->job.queue);
> +
> + mutex_init(&q->sched.lock);
> + if (IS_ENABLED(CONFIG_LOCKDEP)) {
> + fs_reclaim_acquire(GFP_KERNEL);
> + might_lock(&q->sched.lock);
> + fs_reclaim_release(GFP_KERNEL);
> + }
> +
> + if (args->submit_wq) {
> + q->sched.submit_wq = args->submit_wq;
> + } else {
> + q->sched.submit_wq = drm_dep_alloc_submit_wq(args->name ?: "drm_dep",
> +     args->flags);
> + if (!q->sched.submit_wq)
> + return -ENOMEM;
> +
> + q->sched.flags |= DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ;
> + }
> +
> + if (args->timeout_wq) {
> + q->sched.timeout_wq = args->timeout_wq;
> + } else {
> + q->sched.timeout_wq = drm_dep_alloc_timeout_wq(args->name ?: "drm_dep");
> + if (!q->sched.timeout_wq)
> + goto err_submit_wq;
> +
> + q->sched.flags |= DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ;
> + }
> +
> + q->sched.flags |= args->flags &
> + ~(DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ |
> +  DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ);
> +
> + INIT_DELAYED_WORK(&q->sched.tdr, drm_dep_queue_tdr_work);
> + INIT_WORK(&q->sched.run_job, drm_dep_queue_run_job_work);
> + INIT_WORK(&q->sched.put_job, drm_dep_queue_put_job_work);
> +
> + q->fence.context = dma_fence_context_alloc(1);
> +
> + kref_init(&q->refcount);
> + q->ops = args->ops;
> + drm_dev_get(q->drm);
> +
> + return 0;
> +
> +err_submit_wq:
> + if (q->sched.flags & DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ)
> + destroy_workqueue(q->sched.submit_wq);
> + mutex_destroy(&q->sched.lock);
> +
> + return -ENOMEM;
> +}
> +EXPORT_SYMBOL(drm_dep_queue_init);
> +
> +#if IS_ENABLED(CONFIG_PROVE_LOCKING)
> +/**
> + * drm_dep_queue_push_job_begin() - mark the start of an arm/push critical section
> + * @q: dep queue the job belongs to
> + *
> + * Called at the start of drm_dep_job_arm() and warns if the push context is
> + * already owned by another task, which would indicate concurrent arm/push on
> + * the same queue.
> + *
> + * No-op when CONFIG_PROVE_LOCKING is disabled.
> + *
> + * Context: Process context. DMA fence signaling path.
> + */
> +void drm_dep_queue_push_job_begin(struct drm_dep_queue *q)
> +{
> + WARN_ON(q->job.push.owner);
> + q->job.push.owner = current;
> +}
> +
> +/**
> + * drm_dep_queue_push_job_end() - mark the end of an arm/push critical section
> + * @q: dep queue the job belongs to
> + *
> + * Called at the end of drm_dep_job_push() and warns if the push context is not
> + * owned by the current task, which would indicate a mismatched begin/end pair
> + * or a push from the wrong thread.
> + *
> + * No-op when CONFIG_PROVE_LOCKING is disabled.
> + *
> + * Context: Process context. DMA fence signaling path.
> + */
> +void drm_dep_queue_push_job_end(struct drm_dep_queue *q)
> +{
> + WARN_ON(q->job.push.owner != current);
> + q->job.push.owner = NULL;
> +}
> +#endif
> +
> +/**
> + * drm_dep_queue_assert_teardown_invariants() - assert teardown invariants
> + * @q: dep queue being torn down
> + *
> + * Warns if the pending-job list, the SPSC submission queue, or the credit
> + * counter is non-zero when called, or if the queue still has a non-zero
> + * reference count.
> + *
> + * Context: Any context.
> + */
> +static void drm_dep_queue_assert_teardown_invariants(struct drm_dep_queue *q)
> +{
> + WARN_ON(!list_empty(&q->job.pending));
> + WARN_ON(spsc_queue_count(&q->job.queue));
> + WARN_ON(atomic_read(&q->credit.count));
> + WARN_ON(drm_dep_queue_refcount(q));
> +}
> +
> +/**
> + * drm_dep_queue_release() - final internal cleanup of a dep queue
> + * @q: dep queue to clean up
> + *
> + * Asserts teardown invariants and destroys internal resources allocated by
> + * drm_dep_queue_init() that cannot be torn down earlier in the teardown
> + * sequence.  Currently this destroys @q->sched.lock.
> + *
> + * Drivers that implement &drm_dep_queue_ops.release **must** call this
> + * function after removing @q from any internal bookkeeping (e.g. lookup
> + * tables or lists) but before freeing the memory that contains @q.  When
> + * &drm_dep_queue_ops.release is NULL, drm_dep follows the default teardown
> + * path and calls this function automatically.
> + *
> + * Context: Any context.
> + */
> +void drm_dep_queue_release(struct drm_dep_queue *q)
> +{
> + drm_dep_queue_assert_teardown_invariants(q);
> + mutex_destroy(&q->sched.lock);
> +}
> +EXPORT_SYMBOL(drm_dep_queue_release);
> +
> +/**
> + * drm_dep_queue_free() - final cleanup of a dep queue
> + * @q: dep queue to free
> + *
> + * Invokes &drm_dep_queue_ops.release if set, in which case the driver is
> + * responsible for calling drm_dep_queue_release() and freeing @q itself.
> + * If &drm_dep_queue_ops.release is NULL, calls drm_dep_queue_release()
> + * and then frees @q with kfree_rcu().
> + *
> + * In either case, releases the drm_dev_get() reference taken at init time
> + * via drm_dev_put(), allowing the owning &drm_device to be unloaded once
> + * all queues have been freed.
> + *
> + * Context: Process context (workqueue), reclaim safe.
> + */
> +static void drm_dep_queue_free(struct drm_dep_queue *q)
> +{
> + struct drm_device *drm = q->drm;
> +
> + if (q->ops->release) {
> + q->ops->release(q);
> + } else {
> + drm_dep_queue_release(q);
> + kfree_rcu(q, rcu);
> + }
> + drm_dev_put(drm);
> +}
> +
> +/**
> + * drm_dep_queue_free_work() - deferred queue teardown worker
> + * @work: free_work item embedded in the dep queue
> + *
> + * Runs on dep_free_wq. Disables all work items synchronously
> + * (preventing re-queue and waiting for in-flight instances),
> + * destroys any owned workqueues, then calls drm_dep_queue_free().
> + * Running on dep_free_wq ensures destroy_workqueue() is never
> + * called from within one of the queue's own workers (deadlock)
> + * and disable_*_sync() cannot deadlock either.
> + *
> + * Context: Process context (workqueue), reclaim safe.
> + */
> +static void drm_dep_queue_free_work(struct work_struct *work)
> +{
> + struct drm_dep_queue *q =
> + container_of(work, struct drm_dep_queue, free_work);
> +
> + drm_dep_queue_assert_teardown_invariants(q);
> +
> + disable_delayed_work_sync(&q->sched.tdr);
> + disable_work_sync(&q->sched.run_job);
> + disable_work_sync(&q->sched.put_job);
> +
> + if (q->sched.flags & DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ)
> + destroy_workqueue(q->sched.timeout_wq);
> +
> + if (q->sched.flags & DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ)
> + destroy_workqueue(q->sched.submit_wq);
> +
> + drm_dep_queue_free(q);
> +}
> +
> +/**
> + * drm_dep_queue_fini() - tear down a dep queue
> + * @q: dep queue to tear down
> + *
> + * Asserts teardown invariants  and nitiates teardown of @q by queuing the
> + * deferred free work onto tht module-private dep_free_wq workqueue.  The work
> + * item disables any pending TDR and run/put-job work synchronously, destroys
> + * any workqueues that were allocated by drm_dep_queue_init(), and then releases
> + * the queue memory.
> + *
> + * Running teardown from dep_free_wq ensures that destroy_workqueue() is never
> + * called from within one of the queue's own workers (e.g. via
> + * drm_dep_queue_put()), which would deadlock.
> + *
> + * Drivers can wait for all outstanding deferred work to complete by waiting
> + * for the last drm_dev_put() reference on their &drm_device, which is
> + * released as the final step of each queue's teardown.
> + *
> + * Drivers that implement &drm_dep_queue_ops.fini **must** call this
> + * function after removing @q from any device bookkeeping but before freeing the
> + * memory that contains @q.  When &drm_dep_queue_ops.fini is NULL, drm_dep
> + * follows the default teardown path and calls this function automatically.
> + *
> + * Context: Any context.
> + */
> +void drm_dep_queue_fini(struct drm_dep_queue *q)
> +{
> + drm_dep_queue_assert_teardown_invariants(q);
> +
> + INIT_WORK(&q->free_work, drm_dep_queue_free_work);
> + queue_work(dep_free_wq, &q->free_work);
> +}
> +EXPORT_SYMBOL(drm_dep_queue_fini);
> +
> +/**
> + * drm_dep_queue_get() - acquire a reference to a dep queue
> + * @q: dep queue to acquire a reference on, or NULL
> + *
> + * Return: @q with an additional reference held, or NULL if @q is NULL.
> + *
> + * Context: Any context.
> + */
> +struct drm_dep_queue *drm_dep_queue_get(struct drm_dep_queue *q)
> +{
> + if (q)
> + kref_get(&q->refcount);
> + return q;
> +}
> +EXPORT_SYMBOL(drm_dep_queue_get);
> +
> +/**
> + * __drm_dep_queue_release() - kref release callback for a dep queue
> + * @kref: kref embedded in the dep queue
> + *
> + * Calls &drm_dep_queue_ops.fini if set, otherwise calls
> + * drm_dep_queue_fini() to initiate deferred teardown.
> + *
> + * Context: Any context.
> + */
> +static void __drm_dep_queue_release(struct kref *kref)
> +{
> + struct drm_dep_queue *q =
> + container_of(kref, struct drm_dep_queue, refcount);
> +
> + if (q->ops->fini)
> + q->ops->fini(q);
> + else
> + drm_dep_queue_fini(q);
> +}
> +
> +/**
> + * drm_dep_queue_put() - release a reference to a dep queue
> + * @q: dep queue to release a reference on, or NULL
> + *
> + * When the last reference is dropped, calls &drm_dep_queue_ops.fini if set,
> + * otherwise calls drm_dep_queue_fini(). Final memory release is handled by
> + * &drm_dep_queue_ops.release (which must call drm_dep_queue_release()) if set,
> + * or drm_dep_queue_release() followed by kfree_rcu() otherwise.
> + * Does nothing if @q is NULL.
> + *
> + * Context: Any context.
> + */
> +void drm_dep_queue_put(struct drm_dep_queue *q)
> +{
> + if (q)
> + kref_put(&q->refcount, __drm_dep_queue_release);
> +}
> +EXPORT_SYMBOL(drm_dep_queue_put);
> +
> +/**
> + * drm_dep_queue_stop() - stop a dep queue from processing new jobs
> + * @q: dep queue to stop
> + *
> + * Sets %DRM_DEP_QUEUE_FLAGS_STOPPED on @q under both @q->sched.lock (mutex)
> + * and @q->job.lock (spinlock_irq), making the flag safe to test from finished
> + * fenced signaling context. Then cancels any in-flight run_job and put_job work
> + * items. Once stopped, the bypass path and the submit workqueue will not
> + * dispatch further jobs nor will any jobs be removed from the pending list.
> + * Call drm_dep_queue_start() to resume processing.
> + *
> + * Context: Process context. Waits for in-flight workers to complete.
> + */
> +void drm_dep_queue_stop(struct drm_dep_queue *q)
> +{
> + scoped_guard(mutex, &q->sched.lock) {
> + scoped_guard(spinlock_irq, &q->job.lock)
> + drm_dep_queue_flags_set(q, DRM_DEP_QUEUE_FLAGS_STOPPED);
> + }
> + cancel_work_sync(&q->sched.run_job);
> + cancel_work_sync(&q->sched.put_job);
> +}
> +EXPORT_SYMBOL(drm_dep_queue_stop);
> +
> +/**
> + * drm_dep_queue_start() - resume a stopped dep queue
> + * @q: dep queue to start
> + *
> + * Clears %DRM_DEP_QUEUE_FLAGS_STOPPED on @q under both @q->sched.lock (mutex)
> + * and @q->job.lock (spinlock_irq), making the flag safe to test from IRQ
> + * context. Then re-queues the run_job and put_job work items so that any jobs
> + * pending since the queue was stopped are processed. Must only be called after
> + * drm_dep_queue_stop().
> + *
> + * Context: Process context.
> + */
> +void drm_dep_queue_start(struct drm_dep_queue *q)
> +{
> + scoped_guard(mutex, &q->sched.lock) {
> + scoped_guard(spinlock_irq, &q->job.lock)
> + drm_dep_queue_flags_clear(q, DRM_DEP_QUEUE_FLAGS_STOPPED);
> + }
> + drm_dep_queue_run_job_queue(q);
> + drm_dep_queue_put_job_queue(q);
> +}
> +EXPORT_SYMBOL(drm_dep_queue_start);
> +
> +/**
> + * drm_dep_queue_trigger_timeout() - trigger the TDR immediately for
> + *   all pending jobs
> + * @q: dep queue to trigger timeout on
> + *
> + * Sets @q->job.timeout to 1 and arms the TDR delayed work with a one-jiffy
> + * delay, causing it to fire almost immediately without hot-spinning at zero
> + * delay. This is used to force-expire any pendind jobs on the queue, for
> + * example when the device is being torn down or has encountered an
> + * unrecoverable error.
> + *
> + * It is suggested that when this function is used, the first timedout_job call
> + * causes the driver to kick the queue off the hardware and signal all pending
> + * job fences. Subsequent calls continue to signal all pending job fences.
> + *
> + * Has no effect if the pending list is empty.
> + *
> + * Context: Any context.
> + */
> +void drm_dep_queue_trigger_timeout(struct drm_dep_queue *q)
> +{
> + guard(spinlock_irqsave)(&q->job.lock);
> + q->job.timeout = 1;
> + drm_queue_start_timeout(q);
> +}
> +EXPORT_SYMBOL(drm_dep_queue_trigger_timeout);
> +
> +/**
> + * drm_dep_queue_cancel_tdr_sync() - cancel any pending TDR and wait
> + *   for it to finish
> + * @q: dep queue whose TDR to cancel
> + *
> + * Cancels the TDR delayed work item if it has not yet started, and waits for
> + * it to complete if it is already running.  After this call returns, the TDR
> + * worker is guaranteed not to be executing and will not fire again until
> + * explicitly rearmed (e.g. via drm_dep_queue_resume_timeout() or by a new
> + * job being submitted).
> + *
> + * Useful during error recovery or queue teardown when the caller needs to
> + * know that no timeout handling races with its own reset logic.
> + *
> + * Context: Process context. May sleep waiting for the TDR worker to finish.
> + */
> +void drm_dep_queue_cancel_tdr_sync(struct drm_dep_queue *q)
> +{
> + cancel_delayed_work_sync(&q->sched.tdr);
> +}
> +EXPORT_SYMBOL(drm_dep_queue_cancel_tdr_sync);
> +
> +/**
> + * drm_dep_queue_resume_timeout() - restart the TDR timer with the
> + *   configured timeout
> + * @q: dep queue to resume the timeout for
> + *
> + * Restarts the TDR delayed work using @q->job.timeout. Called after device
> + * recovery to give pending jobs a fresh full timeout window. Has no effect
> + * if the pending list is empty.
> + *
> + * Context: Any context.
> + */
> +void drm_dep_queue_resume_timeout(struct drm_dep_queue *q)
> +{
> + drm_queue_start_timeout_unlocked(q);
> +}
> +EXPORT_SYMBOL(drm_dep_queue_resume_timeout);
> +
> +/**
> + * drm_dep_queue_is_stopped() - check whether a dep queue is stopped
> + * @q: dep queue to check
> + *
> + * Return: true if %DRM_DEP_QUEUE_FLAGS_STOPPED is set on @q, false otherwise.
> + *
> + * Context: Any context.
> + */
> +bool drm_dep_queue_is_stopped(struct drm_dep_queue *q)
> +{
> + return !!(q->sched.flags & DRM_DEP_QUEUE_FLAGS_STOPPED);
> +}
> +EXPORT_SYMBOL(drm_dep_queue_is_stopped);
> +
> +/**
> + * drm_dep_queue_kill() - kill a dep queue and flush all pending jobs
> + * @q: dep queue to kill
> + *
> + * Sets %DRM_DEP_QUEUE_FLAGS_KILLED on @q under @q->sched.lock.  If a
> + * dependency fence is currently being waited on, its callback is removed and
> + * the run-job worker is kicked immediately so that the blocked job drains
> + * without waiting.
> + *
> + * Once killed, drm_dep_queue_job_dependency() returns NULL for all jobs,
> + * bypassing dependency waits so that every queued job drains through
> + * &drm_dep_queue_ops.run_job without blocking.
> + *
> + * The &drm_dep_queue_ops.run_job callback is guaranteed to be called for every
> + * job that was pushed before or after drm_dep_queue_kill(), even during queue
> + * teardown.  Drivers should use this guarantee to perform any necessary
> + * bookkeeping cleanup without executing the actual backend operation when the
> + * queue is killed.
> + *
> + * Unlike drm_dep_queue_stop(), killing is one-way: there is no corresponding
> + * start function.
> + *
> + * **Driver safety requirement**
> + *
> + * drm_dep_queue_kill() must only be called once the driver can guarantee that
> + * no job in the queue will touch memory associated with any of its fences
> + * (i.e., the queue has been removed from the device and will never be put back
> + * on).
> + *
> + * Context: Process context.
> + */
> +void drm_dep_queue_kill(struct drm_dep_queue *q)
> +{
> + scoped_guard(mutex, &q->sched.lock) {
> + struct dma_fence *fence;
> +
> + drm_dep_queue_flags_set(q, DRM_DEP_QUEUE_FLAGS_KILLED);
> +
> + /*
> + * Holding &q->sched.lock guarantees that the run-job work item
> + * cannot drop its reference to q->dep.fence concurrently, so
> + * reading q->dep.fence here is safe.
> + */
> + fence = READ_ONCE(q->dep.fence);
> + if (fence && dma_fence_remove_callback(fence, &q->dep.cb))
> + drm_dep_queue_remove_dependency(q, fence);
> + }
> +}
> +EXPORT_SYMBOL(drm_dep_queue_kill);
> +
> +/**
> + * drm_dep_queue_submit_wq() - retrieve the submit workqueue of a dep queue
> + * @q: dep queue whose workqueue to retrieve
> + *
> + * Drivers may use this to queue their own work items alongside the queue's
> + * internal run-job and put-job workers — for example to process incoming
> + * messages in the same serialisation domain.
> + *
> + * Prefer drm_dep_queue_work_enqueue() when the only need is to enqueue a
> + * work item, as it additionally checks the stopped state.  Use this accessor
> + * when the workqueue itself is required (e.g. for alloc_ordered_workqueue
> + * replacement or drain_workqueue calls).
> + *
> + * Context: Any context.
> + * Return: the &workqueue_struct used by @q for job submission.
> + */
> +struct workqueue_struct *drm_dep_queue_submit_wq(struct drm_dep_queue *q)
> +{
> + return q->sched.submit_wq;
> +}
> +EXPORT_SYMBOL(drm_dep_queue_submit_wq);
> +
> +/**
> + * drm_dep_queue_timeout_wq() - retrieve the timeout workqueue of a dep queue
> + * @q: dep queue whose workqueue to retrieve
> + *
> + * Returns the workqueue used by @q to run TDR (timeout detection and recovery)
> + * work.  Drivers may use this to queue their own timeout-domain work items, or
> + * to call drain_workqueue() when tearing down and needing to ensure all pending
> + * timeout callbacks have completed before proceeding.
> + *
> + * Context: Any context.
> + * Return: the &workqueue_struct used by @q for TDR work.
> + */
> +struct workqueue_struct *drm_dep_queue_timeout_wq(struct drm_dep_queue *q)
> +{
> + return q->sched.timeout_wq;
> +}
> +EXPORT_SYMBOL(drm_dep_queue_timeout_wq);
> +
> +/**
> + * drm_dep_queue_work_enqueue() - queue work on the dep queue's submit workqueue
> + * @q: dep queue to enqueue work on
> + * @work: work item to enqueue
> + *
> + * Queues @work on @q->sched.submit_wq if the queue is not stopped.  This
> + * allows drivers to schedule custom work items that run serialised with the
> + * queue's own run-job and put-job workers.
> + *
> + * Return: true if the work was queued, false if the queue is stopped or the
> + * work item was already pending.
> + *
> + * Context: Any context.
> + */
> +bool drm_dep_queue_work_enqueue(struct drm_dep_queue *q,
> + struct work_struct *work)
> +{
> + if (drm_dep_queue_is_stopped(q))
> + return false;
> +
> + return queue_work(q->sched.submit_wq, work);
> +}
> +EXPORT_SYMBOL(drm_dep_queue_work_enqueue);
> +
> +/**
> + * drm_dep_queue_can_job_bypass() - test whether a job can skip the SPSC queue
> + * @q: dep queue
> + * @job: job to test
> + *
> + * A job may bypass the submit workqueue and run inline on the calling thread
> + * if all of the following hold:
> + *
> + *  - %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set on the queue
> + *  - the queue is not stopped
> + *  - the SPSC submission queue is empty (no other jobs waiting)
> + *  - the queue has enough credits for @job
> + *  - @job has no unresolved dependency fences
> + *
> + * Must be called under @q->sched.lock.
> + *
> + * Context: Process context. Must hold @q->sched.lock (a mutex).
> + * Return: true if the job may be run inline, false otherwise.
> + */
> +bool drm_dep_queue_can_job_bypass(struct drm_dep_queue *q,
> +  struct drm_dep_job *job)
> +{
> + lockdep_assert_held(&q->sched.lock);
> +
> + return q->sched.flags & DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED &&
> + !drm_dep_queue_is_stopped(q) &&
> + !spsc_queue_count(&q->job.queue) &&
> + drm_dep_queue_has_credits(q, job) &&
> + xa_empty(&job->dependencies);
> +}
> +
> +/**
> + * drm_dep_job_done() - mark a job as complete
> + * @job: the job that finished
> + * @result: error code to propagate, or 0 for success
> + *
> + * Subtracts @job->credits from the queue credit counter, then signals the
> + * job's dep fence with @result.
> + *
> + * When %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE is set (IRQ-safe path), a
> + * temporary extra reference is taken on @job before signalling the fence.
> + * This prevents a concurrent put-job worker — which may be woken by timeouts or
> + * queue starting — from freeing the job while this function still holds a
> + * pointer to it.  The extra reference is released at the end of the function.
> + *
> + * After signalling, the IRQ-safe path removes the job from the pending list
> + * under @q->job.lock, provided the queue is not stopped.  Removal is skipped
> + * when the queue is stopped so that drm_dep_queue_for_each_pending_job() can
> + * iterate the list without racing with the completion path.  On successful
> + * removal, kicks the run-job worker so the next queued job can be dispatched
> + * immediately, then drops the job reference.  If the job was already removed
> + * by TDR, or removal was skipped because the queue is stopped, kicks the
> + * put-job worker instead to allow the deferred put to complete.
> + *
> + * Context: Any context.
> + */
> +static void drm_dep_job_done(struct drm_dep_job *job, int result)
> +{
> + struct drm_dep_queue *q = job->q;
> + bool irq_safe = drm_dep_queue_is_job_put_irq_safe(q), removed = false;
> +
> + /*
> + * Local ref to ensure the put worker—which may be woken by external
> + * forces (TDR, driver-side queue starting)—doesn't free the job behind
> + * this function's back after drm_dep_fence_done() while it is still on
> + * the pending list.
> + */
> + if (irq_safe)
> + drm_dep_job_get(job);
> +
> + atomic_sub(job->credits, &q->credit.count);
> + drm_dep_fence_done(job->dfence, result);
> +
> + /* Only safe to touch job after fence signal if we have a local ref. */
> +
> + if (irq_safe) {
> + scoped_guard(spinlock_irqsave, &q->job.lock) {
> + removed = !list_empty(&job->pending_link) &&
> + !drm_dep_queue_is_stopped(q);
> +
> + /* Guard against TDR operating on job */
> + if (removed)
> + drm_dep_queue_remove_job(q, job);
> + }
> + }
> +
> + if (removed) {
> + drm_dep_queue_run_job_queue(q);
> + drm_dep_job_put(job);
> + } else {
> + drm_dep_queue_put_job_queue(q);
> + }
> +
> + if (irq_safe)
> + drm_dep_job_put(job);
> +}
> +
> +/**
> + * drm_dep_job_done_cb() - dma_fence callback to complete a job
> + * @f: the hardware fence that signalled
> + * @cb: fence callback embedded in the dep job
> + *
> + * Extracts the job from @cb and calls drm_dep_job_done() with
> + * @f->error as the result.
> + *
> + * Context: Any context, but with IRQ disabled. May not sleep.
> + */
> +static void drm_dep_job_done_cb(struct dma_fence *f, struct dma_fence_cb *cb)
> +{
> + struct drm_dep_job *job = container_of(cb, struct drm_dep_job, cb);
> +
> + drm_dep_job_done(job, f->error);
> +}
> +
> +/**
> + * drm_dep_queue_run_job() - submit a job to hardware and set up
> + *   completion tracking
> + * @q: dep queue
> + * @job: job to run
> + *
> + * Accounts @job->credits against the queue, appends the job to the pending
> + * list, then calls @q->ops->run_job(). The TDR timer is started only when
> + * @job is the first entry on the pending list; subsequent jobs added while
> + * a TDR is already in flight do not reset the timer (which would otherwise
> + * extend the deadline for the already-running head job). Stores the returned
> + * hardware fence as the parent of the job's dep fence, then installs
> + * drm_dep_job_done_cb() on it. If the hardware fence is already signalled
> + * (%-ENOENT from dma_fence_add_callback()) or run_job() returns NULL/error,
> + * the job is completed immediately. Must be called under @q->sched.lock.
> + *
> + * Context: Process context. Must hold @q->sched.lock (a mutex). DMA fence
> + * signaling path.
> + */
> +void drm_dep_queue_run_job(struct drm_dep_queue *q, struct drm_dep_job *job)
> +{
> + struct dma_fence *fence;
> + int r;
> +
> + lockdep_assert_held(&q->sched.lock);
> +
> + drm_dep_job_get(job);
> + atomic_add(job->credits, &q->credit.count);
> +
> + scoped_guard(spinlock_irq, &q->job.lock) {
> + bool first = list_empty(&q->job.pending);
> +
> + list_add_tail(&job->pending_link, &q->job.pending);
> + if (first)
> + drm_queue_start_timeout(q);
> + }
> +
> + fence = q->ops->run_job(job);
> + drm_dep_fence_set_parent(job->dfence, fence);
> +
> + if (!IS_ERR_OR_NULL(fence)) {
> + r = dma_fence_add_callback(fence, &job->cb,
> +   drm_dep_job_done_cb);
> + if (r == -ENOENT)
> + drm_dep_job_done(job, fence->error);
> + else if (r)
> + drm_err(q->drm, "fence add callback failed (%d)\n", r);
> + dma_fence_put(fence);
> + } else {
> + drm_dep_job_done(job, IS_ERR(fence) ? PTR_ERR(fence) : 0);
> + }
> +
> + /*
> + * Drop all input dependency fences now, in process context, before the
> + * final job put. Once the job is on the pending list its last reference
> + * may be dropped from a dma_fence callback (IRQ context), where calling
> + * xa_destroy() would be unsafe.
> + */

I assume that “pending” is the list of jobs that have been handed to the driver
via ops->run_job()?

Can’t this problem be solved by not doing anything inside a dma_fence callback
other than scheduling the queue worker?

> + drm_dep_job_drop_dependencies(job);
> + drm_dep_job_put(job);
> +}
> +
> +/**
> + * drm_dep_queue_push_job() - enqueue a job on the SPSC submission queue
> + * @q: dep queue
> + * @job: job to push
> + *
> + * Pushes @job onto the SPSC queue. If the queue was previously empty
> + * (i.e. this is the first pending job), kicks the run_job worker so it
> + * processes the job promptly without waiting for the next wakeup.
> + * May be called with or without @q->sched.lock held.
> + *
> + * Context: Any context. DMA fence signaling path.
> + */
> +void drm_dep_queue_push_job(struct drm_dep_queue *q, struct drm_dep_job *job)
> +{
> + /*
> + * spsc_queue_push() returns true if the queue was previously empty,
> + * i.e. this is the first pending job. Kick the run_job worker so it
> + * picks it up without waiting for the next wakeup.
> + */
> + if (spsc_queue_push(&q->job.queue, &job->queue_node))
> + drm_dep_queue_run_job_queue(q);
> +}
> +
> +/**
> + * drm_dep_init() - module initialiser
> + *
> + * Allocates the module-private dep_free_wq unbound workqueue used for
> + * deferred queue teardown.
> + *
> + * Return: 0 on success, %-ENOMEM if workqueue allocation fails.
> + */
> +static int __init drm_dep_init(void)
> +{
> + dep_free_wq = alloc_workqueue("drm_dep_free", WQ_UNBOUND, 0);
> + if (!dep_free_wq)
> + return -ENOMEM;
> +
> + return 0;
> +}
> +
> +/**
> + * drm_dep_exit() - module exit
> + *
> + * Destroys the module-private dep_free_wq workqueue.
> + */
> +static void __exit drm_dep_exit(void)
> +{
> + destroy_workqueue(dep_free_wq);
> + dep_free_wq = NULL;
> +}
> +
> +module_init(drm_dep_init);
> +module_exit(drm_dep_exit);
> +
> +MODULE_DESCRIPTION("DRM dependency queue");
> +MODULE_LICENSE("Dual MIT/GPL");
> diff --git a/drivers/gpu/drm/dep/drm_dep_queue.h b/drivers/gpu/drm/dep/drm_dep_queue.h
> new file mode 100644
> index 000000000000..e5c217a3fab5
> --- /dev/null
> +++ b/drivers/gpu/drm/dep/drm_dep_queue.h
> @@ -0,0 +1,31 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2026 Intel Corporation
> + */
> +
> +#ifndef _DRM_DEP_QUEUE_H_
> +#define _DRM_DEP_QUEUE_H_
> +
> +#include <linux/types.h>
> +
> +struct drm_dep_job;
> +struct drm_dep_queue;
> +
> +bool drm_dep_queue_can_job_bypass(struct drm_dep_queue *q,
> +  struct drm_dep_job *job);
> +void drm_dep_queue_run_job(struct drm_dep_queue *q, struct drm_dep_job *job);
> +void drm_dep_queue_push_job(struct drm_dep_queue *q, struct drm_dep_job *job);
> +
> +#if IS_ENABLED(CONFIG_PROVE_LOCKING)
> +void drm_dep_queue_push_job_begin(struct drm_dep_queue *q);
> +void drm_dep_queue_push_job_end(struct drm_dep_queue *q);
> +#else
> +static inline void drm_dep_queue_push_job_begin(struct drm_dep_queue *q)
> +{
> +}
> +static inline void drm_dep_queue_push_job_end(struct drm_dep_queue *q)
> +{
> +}
> +#endif
> +
> +#endif /* _DRM_DEP_QUEUE_H_ */
> diff --git a/include/drm/drm_dep.h b/include/drm/drm_dep.h
> new file mode 100644
> index 000000000000..615926584506
> --- /dev/null
> +++ b/include/drm/drm_dep.h
> @@ -0,0 +1,597 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright 2015 Advanced Micro Devices, Inc.
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
> + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
> + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
> + * OTHER DEALINGS IN THE SOFTWARE.
> + *
> + * Copyright © 2026 Intel Corporation
> + */
> +
> +#ifndef _DRM_DEP_H_
> +#define _DRM_DEP_H_
> +
> +#include <drm/spsc_queue.h>
> +#include <linux/dma-fence.h>
> +#include <linux/xarray.h>
> +#include <linux/workqueue.h>
> +
> +enum dma_resv_usage;
> +struct dma_resv;
> +struct drm_dep_fence;
> +struct drm_dep_job;
> +struct drm_dep_queue;
> +struct drm_file;
> +struct drm_gem_object;
> +
> +/**
> + * enum drm_dep_timedout_stat - return value of &drm_dep_queue_ops.timedout_job
> + * @DRM_DEP_TIMEDOUT_STAT_JOB_SIGNALED: driver signaled the job's finished
> + *   fence during reset; drm_dep may safely drop its reference to the job.
> + * @DRM_DEP_TIMEDOUT_STAT_REQUEUE_JOB: timeout was a false alarm; reinsert the
> + *   job at the head of the pending list so it can complete normally.
> + */
> +enum drm_dep_timedout_stat {
> + DRM_DEP_TIMEDOUT_STAT_JOB_SIGNALED,
> + DRM_DEP_TIMEDOUT_STAT_REQUEUE_JOB,
> +};
> +
> +/**
> + * struct drm_dep_queue_ops - driver callbacks for a dep queue
> + */
> +struct drm_dep_queue_ops {
> + /**
> + * @run_job: submit the job to hardware. Returns the hardware completion
> + * fence (with a reference held for the scheduler), or NULL/ERR_PTR on
> + * synchronous completion or error.
> + */
> + struct dma_fence *(*run_job)(struct drm_dep_job *job);
> +
> + /**
> + * @timedout_job: called when the TDR fires for the head job. Must stop
> + * the hardware, then return %DRM_DEP_TIMEDOUT_STAT_JOB_SIGNALED if the
> + * job's fence was signalled during reset, or
> + * %DRM_DEP_TIMEDOUT_STAT_REQUEUE_JOB if the timeout was spurious or
> + * signalling was otherwise delayed, and the job should be re-inserted
> + * at the head of the pending list. Any other value triggers a WARN.
> + */
> + enum drm_dep_timedout_stat (*timedout_job)(struct drm_dep_job *job);
> +
> + /**
> + * @release: called when the last kref on the queue is dropped and
> + * drm_dep_queue_fini() has completed.  The driver is responsible for
> + * removing @q from any internal bookkeeping, calling
> + * drm_dep_queue_release(), and then freeing the memory containing @q
> + * (e.g. via kfree_rcu() using @q->rcu).  If NULL, drm_dep calls
> + * drm_dep_queue_release() and frees @q automatically via kfree_rcu().
> + * Use this when the queue is embedded in a larger structure.
> + */
> + void (*release)(struct drm_dep_queue *q);
> +
> + /**
> + * @fini: if set, called instead of drm_dep_queue_fini() when the last
> + * kref is dropped. The driver is responsible for calling
> + * drm_dep_queue_fini() itself after it is done with the queue. Use this
> + * when additional teardown logic must run before fini (e.g., cleanup
> + * firmware resources associated with the queue).
> + */
> + void (*fini)(struct drm_dep_queue *q);
> +};
> +
> +/**
> + * enum drm_dep_queue_flags - flags for &drm_dep_queue and
> + *   &drm_dep_queue_init_args
> + *
> + * Flags are divided into three categories:
> + *
> + * - **Private static**: set internally at init time and never changed.
> + *   Drivers must not read or write these.
> + *   %DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ,
> + *   %DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ.
> + *
> + * - **Public dynamic**: toggled at runtime by drivers via accessors.
> + *   Any modification must be performed under &drm_dep_queue.sched.lock.

Can’t enforce that in C.

> + *   Accessor functions provide unstable reads.
> + *   %DRM_DEP_QUEUE_FLAGS_STOPPED,
> + *   %DRM_DEP_QUEUE_FLAGS_KILLED.

> + *
> + * - **Public static**: supplied by the driver in
> + *   &drm_dep_queue_init_args.flags at queue creation time and not modified
> + *   thereafter.

Same here.

> + *   %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED,
> + *   %DRM_DEP_QUEUE_FLAGS_HIGHPRI,
> + *   %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE.

> + *
> + * @DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ: (private, static) submit workqueue was
> + *   allocated by drm_dep_queue_init() and will be destroyed by
> + *   drm_dep_queue_fini().
> + * @DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ: (private, static) timeout workqueue
> + *   was allocated by drm_dep_queue_init() and will be destroyed by
> + *   drm_dep_queue_fini().
> + * @DRM_DEP_QUEUE_FLAGS_STOPPED: (public, dynamic) the queue is stopped and
> + *   will not dispatch new jobs or remove jobs from the pending list, dropping
> + *   the drm_dep-owned reference. Set by drm_dep_queue_stop(), cleared by
> + *   drm_dep_queue_start().
> + * @DRM_DEP_QUEUE_FLAGS_KILLED: (public, dynamic) the queue has been killed
> + *   via drm_dep_queue_kill(). Any active dependency wait is cancelled
> + *   immediately.  Jobs continue to flow through run_job for bookkeeping
> + *   cleanup, but dependency waiting is skipped so that queued work drains
> + *   as quickly as possible.
> + * @DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED: (public, static) the queue supports
> + *   the bypass path where eligible jobs skip the SPSC queue and run inline.
> + * @DRM_DEP_QUEUE_FLAGS_HIGHPRI: (public, static) the submit workqueue owned
> + *   by the queue is created with %WQ_HIGHPRI, causing run-job and put-job
> + *   workers to execute at elevated priority. Only privileged clients (e.g.
> + *   drivers managing time-critical or real-time GPU contexts) should request
> + *   this flag; granting it to unprivileged userspace would allow priority
> + *   inversion attacks.
> + *   @drm_dep_queue_init_args.submit_wq is provided.
> + * @DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE: (public, static) when set,
> + *   drm_dep_job_done() may be called from hardirq context (e.g. from a
> + *   hardware-signalled dma_fence callback). drm_dep_job_done() will directly
> + *   dequeue the job and call drm_dep_job_put() without deferring to a
> + *   workqueue. The driver's &drm_dep_job_ops.release callback must therefore
> + *   be safe to invoke from IRQ context.
> + */
> +enum drm_dep_queue_flags {
> + DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ = BIT(0),
> + DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ = BIT(1),
> + DRM_DEP_QUEUE_FLAGS_STOPPED = BIT(2),
> + DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED = BIT(3),
> + DRM_DEP_QUEUE_FLAGS_HIGHPRI = BIT(4),
> + DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE = BIT(5),
> + DRM_DEP_QUEUE_FLAGS_KILLED = BIT(6),
> +};
> +
> +/**
> + * struct drm_dep_queue - a dependency-tracked GPU submission queue
> + *
> + * Combines the role of &drm_gpu_scheduler and &drm_sched_entity into a single
> + * object.  Each queue owns a submit workqueue (or borrows one), a timeout
> + * workqueue, an SPSC submission queue, and a pending-job list used for TDR.
> + *
> + * Initialise with drm_dep_queue_init(), tear down with drm_dep_queue_fini().
> + * Reference counted via drm_dep_queue_get() / drm_dep_queue_put().
> + *
> + * All fields are **opaque to drivers**.  Do not read or write any field

Can’t enforce this in C.

> + * directly; use the provided helper functions instead.  The sole exception
> + * is @rcu, which drivers may pass to kfree_rcu() when the queue is embedded
> + * inside a larger driver-managed structure and the &drm_dep_queue_ops.release
> + * vfunc performs an RCU-deferred free.

> + */
> +struct drm_dep_queue {
> + /** @ops: driver callbacks, set at init time. */
> + const struct drm_dep_queue_ops *ops;
> + /** @name: human-readable name used for workqueue and fence naming. */
> + const char *name;
> + /** @drm: owning DRM device; a drm_dev_get() reference is held for the
> + *  lifetime of the queue to prevent module unload while queues are live.
> + */
> + struct drm_device *drm;
> + /** @refcount: reference count; use drm_dep_queue_get/put(). */
> + struct kref refcount;
> + /**
> + * @free_work: deferred teardown work queued unconditionally by
> + * drm_dep_queue_fini() onto the module-private dep_free_wq.  The work
> + * item disables pending workers synchronously and destroys any owned
> + * workqueues before releasing the queue memory and dropping the
> + * drm_dev_get() reference.  Running on dep_free_wq ensures
> + * destroy_workqueue() is never called from within one of the queue's
> + * own workers.
> + */
> + struct work_struct free_work;
> + /**
> + * @rcu: RCU head for deferred freeing.
> + *
> + * This is the **only** field drivers may access directly.  When the

We can enforce this in Rust at compile time.

> + * queue is embedded in a larger structure, implement
> + * &drm_dep_queue_ops.release, call drm_dep_queue_release() to destroy
> + * internal resources, then pass this field to kfree_rcu() so that any
> + * in-flight RCU readers referencing the queue's dma_fence timeline name
> + * complete before the memory is returned.  All other fields must be
> + * accessed through the provided helpers.
> + */
> + struct rcu_head rcu;
> +
> + /** @sched: scheduling and workqueue state. */
> + struct {
> + /** @sched.submit_wq: ordered workqueue for run/put-job work. */
> + struct workqueue_struct *submit_wq;
> + /** @sched.timeout_wq: workqueue for the TDR delayed work. */
> + struct workqueue_struct *timeout_wq;
> + /**
> + * @sched.run_job: work item that dispatches the next queued
> + * job.
> + */
> + struct work_struct run_job;
> + /** @sched.put_job: work item that frees finished jobs. */
> + struct work_struct put_job;
> + /** @sched.tdr: delayed work item for timeout/reset (TDR). */
> + struct delayed_work tdr;
> + /**
> + * @sched.lock: mutex serialising job dispatch, bypass
> + * decisions, stop/start, and flag updates.
> + */
> + struct mutex lock;
> + /**
> + * @sched.flags: bitmask of &enum drm_dep_queue_flags.
> + * Any modification after drm_dep_queue_init() must be
> + * performed under @sched.lock.
> + */
> + enum drm_dep_queue_flags flags;
> + } sched;
> +
> + /** @job: pending-job tracking state. */
> + struct {
> + /**
> + * @job.pending: list of jobs that have been dispatched to
> + * hardware and not yet freed. Protected by @job.lock.
> + */
> + struct list_head pending;
> + /**
> + * @job.queue: SPSC queue of jobs waiting to be dispatched.
> + * Producers push via drm_dep_queue_push_job(); the run_job
> + * work item pops from the consumer side.
> + */
> + struct spsc_queue queue;
> + /**
> + * @job.lock: spinlock protecting @job.pending, TDR start, and
> + * the %DRM_DEP_QUEUE_FLAGS_STOPPED flag. Always acquired with
> + * irqsave (spin_lock_irqsave / spin_unlock_irqrestore) to
> + * support %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE queues where
> + * drm_dep_job_done() may run from hardirq context.
> + */
> + spinlock_t lock;
> + /**
> + * @job.timeout: per-job TDR timeout in jiffies.
> + * %MAX_SCHEDULE_TIMEOUT means no timeout.
> + */
> + long timeout;
> +#if IS_ENABLED(CONFIG_PROVE_LOCKING)
> + /**
> + * @job.push: lockdep annotation tracking the arm-to-push
> + * critical section.
> + */
> + struct {
> + /*
> + * @job.push.owner: task that currently holds the push
> + * context, used to assert single-owner invariants.
> + * NULL when idle.
> + */
> + struct task_struct *owner;
> + } push;
> +#endif
> + } job;
> +
> + /** @credit: hardware credit accounting. */
> + struct {
> + /** @credit.limit: maximum credits the queue can hold. */
> + u32 limit;
> + /** @credit.count: credits currently in flight (atomic). */
> + atomic_t count;
> + } credit;
> +
> + /** @dep: current blocking dependency for the head SPSC job. */
> + struct {
> + /**
> + * @dep.fence: fence being waited on before the head job can
> + * run. NULL when no dependency is pending.
> + */
> + struct dma_fence *fence;
> + /**
> + * @dep.removed_fence: dependency fence whose callback has been
> + * removed.  The run-job worker must drop its reference to this
> + * fence before proceeding to call run_job.

We can enforce this in Rust automatically.

> + */
> + struct dma_fence *removed_fence;
> + /** @dep.cb: callback installed on @dep.fence. */
> + struct dma_fence_cb cb;
> + } dep;
> +
> + /** @fence: fence context and sequence number state. */
> + struct {
> + /**
> + * @fence.seqno: next sequence number to assign, incremented
> + * each time a job is armed.
> + */
> + u32 seqno;
> + /**
> + * @fence.context: base DMA fence context allocated at init
> + * time. Finished fences use this context.
> + */
> + u64 context;
> + } fence;
> +};
> +
> +/**
> + * struct drm_dep_queue_init_args - arguments for drm_dep_queue_init()
> + */
> +struct drm_dep_queue_init_args {
> + /** @ops: driver callbacks; must not be NULL. */
> + const struct drm_dep_queue_ops *ops;
> + /** @name: human-readable name for workqueues and fence timelines. */
> + const char *name;
> + /** @drm: owning DRM device. A drm_dev_get() reference is taken at
> + *  queue init and released when the queue is freed, preventing module
> + *  unload while any queue is still alive.
> + */
> + struct drm_device *drm;
> + /**
> + * @submit_wq: workqueue for job dispatch. If NULL, an ordered
> + * workqueue is allocated and owned by the queue.  If non-NULL, the
> + * workqueue must have been allocated with %WQ_MEM_RECLAIM_TAINT;
> + * drm_dep_queue_init() returns %-EINVAL otherwise.
> + */
> + struct workqueue_struct *submit_wq;
> + /**
> + * @timeout_wq: workqueue for TDR. If NULL, an ordered workqueue
> + * is allocated and owned by the queue.  If non-NULL, the workqueue
> + * must have been allocated with %WQ_MEM_RECLAIM_TAINT;
> + * drm_dep_queue_init() returns %-EINVAL otherwise.
> + */
> + struct workqueue_struct *timeout_wq;
> + /** @credit_limit: maximum hardware credits; must be non-zero. */
> + u32 credit_limit;
> + /**
> + * @timeout: per-job TDR timeout in jiffies. Zero means no timeout
> + * (%MAX_SCHEDULE_TIMEOUT is used internally).
> + */
> + long timeout;
> + /**
> + * @flags: initial queue flags. %DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ
> + * and %DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ are managed internally
> + * and will be ignored if set here. Setting
> + * %DRM_DEP_QUEUE_FLAGS_HIGHPRI requests a high-priority submit
> + * workqueue; drivers must only set this for privileged clients.
> + */
> + enum drm_dep_queue_flags flags;
> +};
> +
> +/**
> + * struct drm_dep_job_ops - driver callbacks for a dep job
> + */
> +struct drm_dep_job_ops {
> + /**
> + * @release: called when the last reference to the job is dropped.
> + *
> + * If set, the driver is responsible for freeing the job. If NULL,

And if they don’t?

By the way, we can also enforce this in Rust.

> + * drm_dep_job_put() will call kfree() on the job directly.
> + */
> + void (*release)(struct drm_dep_job *job);
> +};
> +
> +/**
> + * struct drm_dep_job - a unit of work submitted to a dep queue
> + *
> + * All fields are **opaque to drivers**.  Do not read or write any field
> + * directly; use the provided helper functions instead.
> + */
> +struct drm_dep_job {
> + /** @ops: driver callbacks for this job. */
> + const struct drm_dep_job_ops *ops;
> + /** @refcount: reference count, managed by drm_dep_job_get/put(). */
> + struct kref refcount;
> + /**
> + * @dependencies: xarray of &dma_fence dependencies before the job can
> + * run.
> + */
> + struct xarray dependencies;
> + /** @q: the queue this job is submitted to. */
> + struct drm_dep_queue *q;
> + /** @queue_node: SPSC queue linkage for pending submission. */
> + struct spsc_node queue_node;
> + /**
> + * @pending_link: list entry in the queue's pending job list. Protected
> + * by @job.q->job.lock.
> + */
> + struct list_head pending_link;
> + /** @dfence: finished fence for this job. */
> + struct drm_dep_fence *dfence;
> + /** @cb: fence callback used to watch for dependency completion. */
> + struct dma_fence_cb cb;
> + /** @credits: number of credits this job consumes from the queue. */
> + u32 credits;
> + /**
> + * @last_dependency: index into @dependencies of the next fence to
> + * check. Advanced by drm_dep_queue_job_dependency() as each
> + * dependency is consumed.
> + */
> + u32 last_dependency;
> + /**
> + * @invalidate_count: number of times this job has been invalidated.
> + * Incremented by drm_dep_job_invalidate_job().
> + */
> + u32 invalidate_count;
> + /**
> + * @signalling_cookie: return value of dma_fence_begin_signalling()
> + * captured in drm_dep_job_arm() and consumed by drm_dep_job_push().
> + * Not valid outside the arm→push window.
> + */
> + bool signalling_cookie;
> +};
> +
> +/**
> + * struct drm_dep_job_init_args - arguments for drm_dep_job_init()
> + */
> +struct drm_dep_job_init_args {
> + /**
> + * @ops: driver callbacks for the job, or NULL for default behaviour.
> + */
> + const struct drm_dep_job_ops *ops;
> + /** @q: the queue to associate the job with. A reference is taken. */
> + struct drm_dep_queue *q;
> + /** @credits: number of credits this job consumes; must be non-zero. */
> + u32 credits;
> +};
> +
> +/* Queue API */
> +
> +/**
> + * drm_dep_queue_sched_guard() - acquire the queue scheduler lock as a guard
> + * @__q: dep queue whose scheduler lock to acquire
> + *
> + * Acquires @__q->sched.lock as a scoped mutex guard (released automatically
> + * when the enclosing scope exits).  This lock serialises all scheduler state
> + * transitions — stop/start/kill flag changes, bypass-path decisions, and the
> + * run-job worker — so it must be held when the driver needs to atomically
> + * inspect or modify queue state in relation to job submission.
> + *
> + * **When to use**
> + *
> + * Drivers that set %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED and wish to
> + * serialise their own submit work against the bypass path must acquire this
> + * guard.  Without it, a concurrent caller of drm_dep_job_push() could take
> + * the bypass path and call ops->run_job() inline between the driver's
> + * eligibility check and its corresponding action, producing a race.

So if you’re not careful, you have just introduced a race :/

> + *
> + * **Constraint: only from submit_wq worker context**
> + *
> + * This guard must only be acquired from a work item running on the queue's
> + * submit workqueue (@q->sched.submit_wq) by drivers.
> + *
> + * Context: Process context only; must be called from submit_wq work by
> + * drivers.
> + */
> +#define drm_dep_queue_sched_guard(__q) \
> + guard(mutex)(&(__q)->sched.lock)
> +
> +int drm_dep_queue_init(struct drm_dep_queue *q,
> +       const struct drm_dep_queue_init_args *args);
> +void drm_dep_queue_fini(struct drm_dep_queue *q);
> +void drm_dep_queue_release(struct drm_dep_queue *q);
> +struct drm_dep_queue *drm_dep_queue_get(struct drm_dep_queue *q);
> +bool drm_dep_queue_get_unless_zero(struct drm_dep_queue *q);
> +void drm_dep_queue_put(struct drm_dep_queue *q);
> +void drm_dep_queue_stop(struct drm_dep_queue *q);
> +void drm_dep_queue_start(struct drm_dep_queue *q);
> +void drm_dep_queue_kill(struct drm_dep_queue *q);
> +void drm_dep_queue_trigger_timeout(struct drm_dep_queue *q);
> +void drm_dep_queue_cancel_tdr_sync(struct drm_dep_queue *q);
> +void drm_dep_queue_resume_timeout(struct drm_dep_queue *q);
> +bool drm_dep_queue_work_enqueue(struct drm_dep_queue *q,
> + struct work_struct *work);
> +bool drm_dep_queue_is_stopped(struct drm_dep_queue *q);
> +bool drm_dep_queue_is_killed(struct drm_dep_queue *q);
> +bool drm_dep_queue_is_initialized(struct drm_dep_queue *q);
> +void drm_dep_queue_set_stopped(struct drm_dep_queue *q);
> +unsigned int drm_dep_queue_refcount(const struct drm_dep_queue *q);
> +long drm_dep_queue_timeout(const struct drm_dep_queue *q);
> +struct workqueue_struct *drm_dep_queue_submit_wq(struct drm_dep_queue *q);
> +struct workqueue_struct *drm_dep_queue_timeout_wq(struct drm_dep_queue *q);
> +
> +/* Job API */
> +
> +/**
> + * DRM_DEP_JOB_FENCE_PREALLOC - sentinel value for pre-allocating a dependency slot
> + *
> + * Pass this to drm_dep_job_add_dependency() instead of a real fence to
> + * pre-allocate a slot in the job's dependency xarray during the preparation
> + * phase (where GFP_KERNEL is available).  The returned xarray index identifies
> + * the slot.  Call drm_dep_job_replace_dependency() later — inside a
> + * dma_fence_begin_signalling() region if needed — to swap in the real fence
> + * without further allocation.
> + *
> + * This sentinel is never treated as a dma_fence; it carries no reference count
> + * and must not be passed to dma_fence_put().  It is only valid as an argument
> + * to drm_dep_job_add_dependency() and as the expected stored value checked by
> + * drm_dep_job_replace_dependency().
> + */
> +#define DRM_DEP_JOB_FENCE_PREALLOC ((struct dma_fence *)-1)
> +
> +int drm_dep_job_init(struct drm_dep_job *job,
> +     const struct drm_dep_job_init_args *args);
> +struct drm_dep_job *drm_dep_job_get(struct drm_dep_job *job);
> +void drm_dep_job_put(struct drm_dep_job *job);
> +void drm_dep_job_arm(struct drm_dep_job *job);
> +void drm_dep_job_push(struct drm_dep_job *job);
> +int drm_dep_job_add_dependency(struct drm_dep_job *job,
> +       struct dma_fence *fence);
> +void drm_dep_job_replace_dependency(struct drm_dep_job *job, u32 index,
> +    struct dma_fence *fence);
> +int drm_dep_job_add_syncobj_dependency(struct drm_dep_job *job,
> +       struct drm_file *file, u32 handle,
> +       u32 point);
> +int drm_dep_job_add_resv_dependencies(struct drm_dep_job *job,
> +      struct dma_resv *resv,
> +      enum dma_resv_usage usage);
> +int drm_dep_job_add_implicit_dependencies(struct drm_dep_job *job,
> +  struct drm_gem_object *obj,
> +  bool write);
> +bool drm_dep_job_is_signaled(struct drm_dep_job *job);
> +bool drm_dep_job_is_finished(struct drm_dep_job *job);
> +bool drm_dep_job_invalidate_job(struct drm_dep_job *job, int threshold);
> +struct dma_fence *drm_dep_job_finished_fence(struct drm_dep_job *job);
> +
> +/**
> + * struct drm_dep_queue_pending_job_iter - iterator state for
> + *   drm_dep_queue_for_each_pending_job()
> + * @q: queue being iterated
> + */
> +struct drm_dep_queue_pending_job_iter {
> + struct drm_dep_queue *q;
> +};
> +
> +/* Drivers should never call this directly */

Not enforceable in C.

> +static inline struct drm_dep_queue_pending_job_iter
> +__drm_dep_queue_pending_job_iter_begin(struct drm_dep_queue *q)
> +{
> + struct drm_dep_queue_pending_job_iter iter = {
> + .q = q,
> + };
> +
> + WARN_ON(!drm_dep_queue_is_stopped(q));
> + return iter;
> +}
> +
> +/* Drivers should never call this directly */
> +static inline void
> +__drm_dep_queue_pending_job_iter_end(struct drm_dep_queue_pending_job_iter iter)
> +{
> + WARN_ON(!drm_dep_queue_is_stopped(iter.q));
> +}
> +
> +/* clang-format off */
> +DEFINE_CLASS(drm_dep_queue_pending_job_iter,
> +     struct drm_dep_queue_pending_job_iter,
> +     __drm_dep_queue_pending_job_iter_end(_T),
> +     __drm_dep_queue_pending_job_iter_begin(__q),
> +     struct drm_dep_queue *__q);
> +/* clang-format on */
> +static inline void *
> +class_drm_dep_queue_pending_job_iter_lock_ptr(
> + class_drm_dep_queue_pending_job_iter_t *_T)
> +{ return _T; }
> +#define class_drm_dep_queue_pending_job_iter_is_conditional false
> +
> +/**
> + * drm_dep_queue_for_each_pending_job() - iterate over all pending jobs
> + *   in a queue
> + * @__job: loop cursor, a &struct drm_dep_job pointer
> + * @__q: &struct drm_dep_queue to iterate
> + *
> + * Iterates over every job currently on @__q->job.pending. The queue must be
> + * stopped (drm_dep_queue_stop() called) before using this iterator; a WARN_ON
> + * fires at the start and end of the scope if it is not.
> + *
> + * Context: Any context.
> + */
> +#define drm_dep_queue_for_each_pending_job(__job, __q) \
> + scoped_guard(drm_dep_queue_pending_job_iter, (__q)) \
> + list_for_each_entry((__job), &(__q)->job.pending, pending_link)
> +
> +#endif
> -- 
> 2.34.1
> 


By the way:

I invite you to have a look at this implementation [0]. It currently works in real
hardware i.e.: our downstream "Tyr" driver for Arm Mali is using that at the
moment. It is a mere prototype that we’ve put together to test different
approaches, so it’s not meant to be a “solution” at all. It’s a mere data point
for further discussion.

Philip Stanner is working on this “Job Queue” concept too, but from an upstream
perspective.

[0]: https://gitlab.freedesktop.org/panfrost/linux/-/merge_requests/61

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
  2026-03-16 10:25   ` Danilo Krummrich
@ 2026-03-17  5:10     ` Matthew Brost
  2026-03-17 12:19       ` Danilo Krummrich
  0 siblings, 1 reply; 50+ messages in thread
From: Matthew Brost @ 2026-03-17  5:10 UTC (permalink / raw)
  To: Danilo Krummrich
  Cc: intel-xe, dri-devel, Boris Brezillon, Tvrtko Ursulin,
	Rodrigo Vivi, Thomas Hellström, Christian König,
	David Airlie, Maarten Lankhorst, Maxime Ripard, Philipp Stanner,
	Simona Vetter, Sumit Semwal, Thomas Zimmermann, linux-kernel

On Mon, Mar 16, 2026 at 11:25:23AM +0100, Danilo Krummrich wrote:
> On Mon Mar 16, 2026 at 5:32 AM CET, Matthew Brost wrote:
> > Diverging requirements between GPU drivers using firmware scheduling
> > and those using hardware scheduling have shown that drm_gpu_scheduler is
> > no longer sufficient for firmware-scheduled GPU drivers. The technical
> > debt, lack of memory-safety guarantees, absence of clear object-lifetime
> > rules, and numerous driver-specific hacks have rendered
> > drm_gpu_scheduler unmaintainable. It is time for a fresh design for
> > firmware-scheduled GPU drivers—one that addresses all of the
> > aforementioned shortcomings.
> 
> I think we all agree on this and I also think we all agree that this should have
> been a separate component in the first place -- and just to be clear, I am
> saying this in retrospective.

Yes. Tvrtko actually suggested this years ago, and in my naïveté I
rejected it. I’m eating my hat here.

> 
> In fact, this is also the reason why I proposed building the Rust component
> differently, i.e. start with a Joqueue (or drm_dep as called in this patch) and
> expand as needed with a loosely coupled "orchestrator" for drivers with strictly
> limited software/hardware queues later.

Yes, I actually have a hardware-scheduling layer built on top of drm_dep
[1] after hacking for several hours today. It’s very unlikely to
actually work since I’m typing blind without a test platform, but it
conceptually proves that this layer separation works and is clean.

[1] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-svn-perf-6-15-2025/-/commit/22c8aa993b5c9e4ad0c312af2f3e032273d20966

> 
> The reason I proposed a new component for Rust, is basically what you also wrote
> in your cover letter, plus the fact that it prevents us having to build a Rust
> abstraction layer to the DRM GPU scheduler.
> 
> The latter I identified as pretty questionable as building another abstraction
> layer on top of some infrastructure is really something that you only want to do
> when it is mature enough in terms of lifetime and ownership model.
> 

I personally don’t think the language matters that much. I care about
lifetime, ownership, and teardown semantics. I believe I’ve made this
clear in C, so the Rust bindings should be trivial.

> I'm not saying it wouldn't be possible, but as mentioned in other threads, I
> don't think it is a good idea building new features on top of something that has
> known problems, even less when they are barely resolvable due to other existing
> dependencies, such as some drivers relying on implementation details
> historically, etc.
> 

It’s a new component, well thought out and without any baggage, so I
don’t understand the above statement. Invariants and annotations
everywhere (e.g., you cannot abuse this).

> My point is, the justification for a new Jobqueue component in Rust I consider
> given by the fact that it allows us to avoid building another abstraction layer
> on top of DRM sched. Additionally, DRM moves to Rust and gathering experience
> with building native Rust components seems like a good synergy in this context.
>

If I knew Rust off-hand, I would have written it in Rust :). Perhaps
this is an opportunity to learn. But I think the Rust vs. C holy war
isn’t in scope here. The real questions are what semantics we want, the
timeline, and maintainability. Certainly more people know C, and most
drivers are written in C, so having the common component in C makes more
sense at this point, in my opinion. If the objection is really about the
language, I’ll rewrite it in Rust.

> Having that said, the obvious question for me for this series is how drm_dep
> fits into the bigger picture.
> 
> I.e. what is the maintainance strategy?
>

I will commit to maintaining code I believe in, and immediately write
the bindings on top of this so they’re maintained from day one.

> Do we want to support three components allowing users to do the same thing? What
> happens to DRM sched for 1:1 entity / scheduler relationships?
> 
> Is it worth? Do we have enough C users to justify the maintainance of yet
> another component? (Again, DRM moves into the direction of Rust drivers, so I
> don't know how many new C drivers we will see.) I.e. having this component won't
> get us rid of the majority of DRM sched users.
> 

Actually, with [1], I’m fairly certain that pretty much every driver
could convert to this new code. Part of the problem, though, is that
when looking at this, multiple drivers clearly break dma-fencing rules,
so an annotated component like DRM dep would explode their drivers. Not
to mention the many driver-side hacks that each individual driver would
need to drop (e.g., I would not be receptive to any driver directly
touching drm_dep object structs).

> What are the expected improvements? Given the above, I'm not sure it will

Clear object model and lifetimes — therefore memory-safe. Bypass paths
for compositors, compute, and kernel page-fault handlers. No kicking a
worker just to drop a ref; asynchronous teardown (e.g., user hits Ctrl‑C
and returns). Reclaim-safe final puts of queues, and built‑in
driver-unload barriers. Maintainable, as I understand every single LOC,
with verbose documentation (generated with Copilot, but I’ve reviewed it
multiple times and it’s correct), etc.

Regardless, given all of the above, at a minimum my driver needs to move
on one way or another.

> actually decrease the maintainance burdon of DRM sched.

We can deprecate DRM sched, which is now possible as of [1]. I can
commit to compile-testing most drivers, aside from the ones with the
horrible hacks.

Matt

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
  2026-03-16  9:16   ` Boris Brezillon
@ 2026-03-17  5:22     ` Matthew Brost
  2026-03-17  8:48       ` Boris Brezillon
  0 siblings, 1 reply; 50+ messages in thread
From: Matthew Brost @ 2026-03-17  5:22 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: intel-xe, dri-devel, Tvrtko Ursulin, Rodrigo Vivi,
	Thomas Hellström, Christian König, Danilo Krummrich,
	David Airlie, Maarten Lankhorst, Maxime Ripard, Philipp Stanner,
	Simona Vetter, Sumit Semwal, Thomas Zimmermann, linux-kernel

On Mon, Mar 16, 2026 at 10:16:01AM +0100, Boris Brezillon wrote:
> Hi Matthew,
> 
> On Sun, 15 Mar 2026 21:32:45 -0700
> Matthew Brost <matthew.brost@intel.com> wrote:
> 
> > Diverging requirements between GPU drivers using firmware scheduling
> > and those using hardware scheduling have shown that drm_gpu_scheduler is
> > no longer sufficient for firmware-scheduled GPU drivers. The technical
> > debt, lack of memory-safety guarantees, absence of clear object-lifetime
> > rules, and numerous driver-specific hacks have rendered
> > drm_gpu_scheduler unmaintainable. It is time for a fresh design for
> > firmware-scheduled GPU drivers—one that addresses all of the
> > aforementioned shortcomings.
> > 
> > Add drm_dep, a lightweight GPU submission queue intended as a
> > replacement for drm_gpu_scheduler for firmware-managed GPU schedulers
> > (e.g. Xe, Panthor, AMDXDNA, PVR, Nouveau, Nova). Unlike
> > drm_gpu_scheduler, which separates the scheduler (drm_gpu_scheduler)
> > from the queue (drm_sched_entity) into two objects requiring external
> > coordination, drm_dep merges both roles into a single struct
> > drm_dep_queue. This eliminates the N:1 entity-to-scheduler mapping
> > that is unnecessary for firmware schedulers which manage their own
> > run-lists internally.
> > 
> > Unlike drm_gpu_scheduler, which relies on external locking and lifetime
> > management by the driver, drm_dep uses reference counting (kref) on both
> > queues and jobs to guarantee object lifetime safety. A job holds a queue
> > reference from init until its last put, and the queue holds a job reference
> > from dispatch until the put_job worker runs. This makes use-after-free
> > impossible even when completion arrives from IRQ context or concurrent
> > teardown is in flight.
> > 
> > The core objects are:
> > 
> >   struct drm_dep_queue - a per-context submission queue owning an
> >     ordered submit workqueue, a TDR timeout workqueue, an SPSC job
> >     queue, and a pending-job list. Reference counted; drivers can embed
> >     it and provide a .release vfunc for RCU-safe teardown.
> 
> First of, I like this idea, and actually think we should have done that
> from the start rather than trying to bend drm_sched to meet our

Yes. Tvrtko actually suggested this years ago, and in my naïveté I
rejected it. I’m eating my hat here.

> FW-assisted scheduling model. That's also the direction me and Danilo
> have been pushing for for the new JobQueue stuff in rust, so I'm glad
> to see some consensus here.
> 
> Now, let's start with the usual naming nitpick :D => can't we find a
> better prefix than "drm_dep"? I think I get where "dep" comes from (the
> logic mostly takes care of job deps, and acts as a FIFO otherwise, no
> real scheduling involved). It's kinda okay for drm_dep_queue, even
> though, according to the description you've made, jobs seem to stay in
> that queue even after their deps are met, which, IMHO, is a bit
> confusing: dep_queue sounds like a queue in which jobs are placed until
> their deps are met, and then the job moves to some other queue.
> 
> It gets worse for drm_dep_job, which sounds like a dep-only job, rather
> than a job that's queued to the drm_dep_queue. Same goes for
> drm_dep_fence, which I find super confusing. What this one does is just
> proxy the driver fence to provide proper isolation between GPU drivers
> and fence observers (other drivers).
> 
> Since this new model is primarily designed for hardware that have
> FW-assisted scheduling, how about drm_fw_queue, drm_fw_job,
> drm_fw_job_fence?

We can bikeshed — I’m open to other names, but I believe hardware
scheduling can be built quite cleanly on top of this, so drm_fw_*
doesn’t really work either. Check out a hardware-scheduler PoC built
(today) on top of this in [1].

[1] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-svn-perf-6-15-2025/-/commit/22c8aa993b5c9e4ad0c312af2f3e032273d20966

> 
> > 
> >   struct drm_dep_job - a single unit of GPU work. Drivers embed this
> >     and provide a .release vfunc. Jobs carry an xarray of input
> >     dma_fence dependencies and produce a drm_dep_fence as their
> >     finished fence.
> > 
> >   struct drm_dep_fence - a dma_fence subclass wrapping an optional
> >     parent hardware fence. The finished fence is armed (sequence
> >     number assigned) before submission and signals when the hardware
> >     fence signals (or immediately on synchronous completion).
> > 
> > Job lifecycle:
> >   1. drm_dep_job_init() - allocate and initialise; job acquires a
> >      queue reference.
> >   2. drm_dep_job_add_dependency() and friends - register input fences;
> >      duplicates from the same context are deduplicated.
> >   3. drm_dep_job_arm() - assign sequence number, obtain finished fence.
> >   4. drm_dep_job_push() - submit to queue.
> > 
> > Submission paths under queue lock:
> >   - Bypass path: if DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set, the
> >     SPSC queue is empty, no dependencies are pending, and credits are
> >     available, the job is dispatched inline on the calling thread.
> 
> I've yet to look at the code, but I must admit I'm less worried about
> this fast path if it's part of a new model restricted to FW-assisted
> scheduling. I keep thinking we're not entirely covered for so called
> real-time GPU contexts that might have jobs that are not dep-free, and
> if we're going for something new, I'd really like us to consider that
> case from the start (maybe investigate if kthread_work[er] can be used
> as a replacement for workqueues, if RT priority on workqueues is not an
> option).
> 

I mostly agree, and I’ll look into whether kthread_work is better
suited—if that’s the right model, it should be done up front.

But can you give a use case for real-time GPU contexts that are not
dep-free? I personally don’t know of one.

> >   - Queued path: job is pushed onto the SPSC queue and the run_job
> >     worker is kicked. The worker resolves remaining dependencies
> >     (installing wakeup callbacks for unresolved fences) before calling
> >     ops->run_job().
> > 
> > Credit-based throttling prevents hardware overflow: each job declares
> > a credit cost at init time; dispatch is deferred until sufficient
> > credits are available.
> > 
> > Timeout Detection and Recovery (TDR): a per-queue delayed work item
> > fires when the head pending job exceeds q->job.timeout jiffies, calling
> > ops->timedout_job(). drm_dep_queue_trigger_timeout() forces immediate
> > expiry for device teardown.
> > 
> > IRQ-safe completion: queues flagged DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE
> > allow drm_dep_job_done() to be called from hardirq context (e.g. a
> > dma_fence callback). Dependency cleanup is deferred to process context
> > after ops->run_job() returns to avoid calling xa_destroy() from IRQ.
> > 
> > Zombie-state guard: workers use kref_get_unless_zero() on entry and
> > bail immediately if the queue refcount has already reached zero and
> > async teardown is in flight, preventing use-after-free.
> > 
> > Teardown is always deferred to a module-private workqueue (dep_free_wq)
> > so that destroy_workqueue() is never called from within one of the
> > queue's own workers. Each queue holds a drm_dev_get() reference on its
> > owning struct drm_device, released as the final step of teardown via
> > drm_dev_put(). This prevents the driver module from being unloaded
> > while any queue is still alive without requiring a separate drain API.
> 
> Thanks for posting this RFC. I'll try to have a closer look at the code
> in the coming days, but given the diffstat, it might take me a bit of
> time...


I understand — I’m a firehose when I get started. Hopefully a sane one,
though.

Matt

> 
> Regards,
> 
> Boris

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
  2026-03-17  2:47   ` Daniel Almeida
@ 2026-03-17  5:45     ` Matthew Brost
  2026-03-17  7:17       ` Miguel Ojeda
  2026-03-17 18:14       ` Matthew Brost
  2026-03-17 12:31     ` Danilo Krummrich
  1 sibling, 2 replies; 50+ messages in thread
From: Matthew Brost @ 2026-03-17  5:45 UTC (permalink / raw)
  To: Daniel Almeida
  Cc: intel-xe, dri-devel, Boris Brezillon, Tvrtko Ursulin,
	Rodrigo Vivi, Thomas Hellström, Christian König,
	Danilo Krummrich, David Airlie, Maarten Lankhorst, Maxime Ripard,
	Philipp Stanner, Simona Vetter, Sumit Semwal, Thomas Zimmermann,
	linux-kernel, Sami Tolvanen, Jeffrey Vander Stoep, Alice Ryhl,
	Daniel Stone, Alexandre Courbot, John Hubbard, shashanks, jajones,
	Eliot Courtney, Joel Fernandes, rust-for-linux

On Mon, Mar 16, 2026 at 11:47:01PM -0300, Daniel Almeida wrote:
> (+cc a few other people + Rust-for-Linux ML)
> 
> Hi Matthew,
> 
> I agree with what Danilo said below, i.e.:  IMHO, with the direction that DRM
> is going, it is much more ergonomic to add a Rust component with a nice C
> interface than doing it the other way around.
>

Holy war? See my reply to Danilo — I’ll write this in Rust if needed,
but it’s not my first choice since I’m not yet a native speaker.
 
> > On 16 Mar 2026, at 01:32, Matthew Brost <matthew.brost@intel.com> wrote:
> > 
> > Diverging requirements between GPU drivers using firmware scheduling
> > and those using hardware scheduling have shown that drm_gpu_scheduler is
> > no longer sufficient for firmware-scheduled GPU drivers. The technical
> > debt, lack of memory-safety guarantees, absence of clear object-lifetime
> > rules, and numerous driver-specific hacks have rendered
> > drm_gpu_scheduler unmaintainable. It is time for a fresh design for
> > firmware-scheduled GPU drivers—one that addresses all of the
> > aforementioned shortcomings.
> > 
> > Add drm_dep, a lightweight GPU submission queue intended as a
> > replacement for drm_gpu_scheduler for firmware-managed GPU schedulers
> > (e.g. Xe, Panthor, AMDXDNA, PVR, Nouveau, Nova). Unlike
> > drm_gpu_scheduler, which separates the scheduler (drm_gpu_scheduler)
> > from the queue (drm_sched_entity) into two objects requiring external
> > coordination, drm_dep merges both roles into a single struct
> > drm_dep_queue. This eliminates the N:1 entity-to-scheduler mapping
> > that is unnecessary for firmware schedulers which manage their own
> > run-lists internally.
> > 
> > Unlike drm_gpu_scheduler, which relies on external locking and lifetime
> > management by the driver, drm_dep uses reference counting (kref) on both
> > queues and jobs to guarantee object lifetime safety. A job holds a queue
> 
> In a domain that has been plagued by lifetime issues, we really should be

Yes, drm sched is a mess. I’ve been suggesting we fix it for years and
have met pushback. This, however (drm dep), isn’t plagued by lifetime
issues — that’s the primary focus here.

> enforcing RAII for resource management instead of manual calls.
> 

You can do RAII in C - see cleanup.h. Clear object lifetimes and
ownership are what is important. Disciplined coding is the only to do
this regardless of language. RAII doesn't help with help with bad object
models / ownership / lifetime models either.

I don't buy the Rust solves everything argument but again non-native
speaker.

> > reference from init until its last put, and the queue holds a job reference
> > from dispatch until the put_job worker runs. This makes use-after-free
> > impossible even when completion arrives from IRQ context or concurrent
> > teardown is in flight.
> 
> It makes use-after-free impossible _if_ you’re careful. It is not a
> property of the type system, and incorrect code will compile just fine.
> 

Sure. If a driver puts a drm_dep object reference on a resource that
drm_dep owns, it will explode. That’s effectively putting a reference on
a resource the driver doesn’t own. A driver can write to any physical
memory and crash the system anyway, so I’m not really sure what we’re
talking about here. Rust doesn’t solve anything in this scenario — you
can always use an unsafe block and put a reference on a resource you
don’t own.

Object model, ownership, and lifetimes are what is important and that is
what drm dep is built around.

> > 
> > The core objects are:
> > 
> >  struct drm_dep_queue - a per-context submission queue owning an
> >    ordered submit workqueue, a TDR timeout workqueue, an SPSC job
> >    queue, and a pending-job list. Reference counted; drivers can embed
> >    it and provide a .release vfunc for RCU-safe teardown.
> > 
> >  struct drm_dep_job - a single unit of GPU work. Drivers embed this
> >    and provide a .release vfunc. Jobs carry an xarray of input
> >    dma_fence dependencies and produce a drm_dep_fence as their
> >    finished fence.
> > 
> >  struct drm_dep_fence - a dma_fence subclass wrapping an optional
> >    parent hardware fence. The finished fence is armed (sequence
> >    number assigned) before submission and signals when the hardware
> >    fence signals (or immediately on synchronous completion).
> > 
> > Job lifecycle:
> >  1. drm_dep_job_init() - allocate and initialise; job acquires a
> >     queue reference.
> >  2. drm_dep_job_add_dependency() and friends - register input fences;
> >     duplicates from the same context are deduplicated.
> >  3. drm_dep_job_arm() - assign sequence number, obtain finished fence.
> >  4. drm_dep_job_push() - submit to queue.
> 
> You cannot enforce this sequence easily in C code. Once again, we are trusting
> drivers that it is followed, but in Rust, you can simply reject code that does
> not follow this order at compile time.
> 

I don’t know Rust, but yes — you can enforce this in C. It’s called
lockdep and annotations. It’s not compile-time, but all of this is
strictly enforced. e.g., write some code that doesn't follow this and
report back if the kernel doesn't explode. It will, if doesn't I'll fix
it to complain.

> 
> > 
> > Submission paths under queue lock:
> >  - Bypass path: if DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set, the
> >    SPSC queue is empty, no dependencies are pending, and credits are
> >    available, the job is dispatched inline on the calling thread.
> >  - Queued path: job is pushed onto the SPSC queue and the run_job
> >    worker is kicked. The worker resolves remaining dependencies
> >    (installing wakeup callbacks for unresolved fences) before calling
> >    ops->run_job().
> > 
> > Credit-based throttling prevents hardware overflow: each job declares
> > a credit cost at init time; dispatch is deferred until sufficient
> > credits are available.
> 
> Why can’t we design an API where the driver can refuse jobs in
> ops->run_job() if there are no resources to run it? This would do away with the
> credit system that has been in place for quite a while. Has this approach been
> tried in the past?
> 

That seems possible if this is the preferred option. -EAGAIN is the way
to do this. I’m open to the idea, but we also need to weigh the cost of
converting drivers against the number of changes required.

Partial - reply with catch up the rest later.

Appreciate the feedback.

Matt

> 
> > 
> > Timeout Detection and Recovery (TDR): a per-queue delayed work item
> > fires when the head pending job exceeds q->job.timeout jiffies, calling
> > ops->timedout_job(). drm_dep_queue_trigger_timeout() forces immediate
> > expiry for device teardown.
> > 
> > IRQ-safe completion: queues flagged DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE
> > allow drm_dep_job_done() to be called from hardirq context (e.g. a
> > dma_fence callback). Dependency cleanup is deferred to process context
> > after ops->run_job() returns to avoid calling xa_destroy() from IRQ.
> > 
> > Zombie-state guard: workers use kref_get_unless_zero() on entry and
> > bail immediately if the queue refcount has already reached zero and
> > async teardown is in flight, preventing use-after-free.
> 
> In rust, when you queue work, you have to pass a reference-counted pointer
> (Arc<T>). We simply never have this problem in a Rust design. If there is work
> queued, the queue is alive.
> 
> By the way, why can’t we simply require synchronous teardowns?
> 
> > 
> > Teardown is always deferred to a module-private workqueue (dep_free_wq)
> > so that destroy_workqueue() is never called from within one of the
> > queue's own workers. Each queue holds a drm_dev_get() reference on its
> > owning struct drm_device, released as the final step of teardown via
> > drm_dev_put(). This prevents the driver module from being unloaded
> > while any queue is still alive without requiring a separate drain API.
> > 
> > Cc: Boris Brezillon <boris.brezillon@collabora.com>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> > Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > Cc: Christian König <christian.koenig@amd.com>
> > Cc: Danilo Krummrich <dakr@kernel.org>
> > Cc: David Airlie <airlied@gmail.com>
> > Cc: dri-devel@lists.freedesktop.org
> > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > Cc: Maxime Ripard <mripard@kernel.org>
> > Cc: Philipp Stanner <phasta@kernel.org>
> > Cc: Simona Vetter <simona@ffwll.ch>
> > Cc: Sumit Semwal <sumit.semwal@linaro.org>
> > Cc: Thomas Zimmermann <tzimmermann@suse.de>
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > Assisted-by: GitHub Copilot:claude-sonnet-4.6
> > ---
> > drivers/gpu/drm/Kconfig             |    4 +
> > drivers/gpu/drm/Makefile            |    1 +
> > drivers/gpu/drm/dep/Makefile        |    5 +
> > drivers/gpu/drm/dep/drm_dep_fence.c |  406 +++++++
> > drivers/gpu/drm/dep/drm_dep_fence.h |   25 +
> > drivers/gpu/drm/dep/drm_dep_job.c   |  675 +++++++++++
> > drivers/gpu/drm/dep/drm_dep_job.h   |   13 +
> > drivers/gpu/drm/dep/drm_dep_queue.c | 1647 +++++++++++++++++++++++++++
> > drivers/gpu/drm/dep/drm_dep_queue.h |   31 +
> > include/drm/drm_dep.h               |  597 ++++++++++
> > 10 files changed, 3404 insertions(+)
> > create mode 100644 drivers/gpu/drm/dep/Makefile
> > create mode 100644 drivers/gpu/drm/dep/drm_dep_fence.c
> > create mode 100644 drivers/gpu/drm/dep/drm_dep_fence.h
> > create mode 100644 drivers/gpu/drm/dep/drm_dep_job.c
> > create mode 100644 drivers/gpu/drm/dep/drm_dep_job.h
> > create mode 100644 drivers/gpu/drm/dep/drm_dep_queue.c
> > create mode 100644 drivers/gpu/drm/dep/drm_dep_queue.h
> > create mode 100644 include/drm/drm_dep.h
> > 
> > diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
> > index 5386248e75b6..834f6e210551 100644
> > --- a/drivers/gpu/drm/Kconfig
> > +++ b/drivers/gpu/drm/Kconfig
> > @@ -276,6 +276,10 @@ config DRM_SCHED
> > tristate
> > depends on DRM
> > 
> > +config DRM_DEP
> > + tristate
> > + depends on DRM
> > +
> > # Separate option as not all DRM drivers use it
> > config DRM_PANEL_BACKLIGHT_QUIRKS
> > tristate
> > diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
> > index e97faabcd783..1ad87cc0e545 100644
> > --- a/drivers/gpu/drm/Makefile
> > +++ b/drivers/gpu/drm/Makefile
> > @@ -173,6 +173,7 @@ obj-y += clients/
> > obj-y += display/
> > obj-$(CONFIG_DRM_TTM) += ttm/
> > obj-$(CONFIG_DRM_SCHED) += scheduler/
> > +obj-$(CONFIG_DRM_DEP) += dep/
> > obj-$(CONFIG_DRM_RADEON)+= radeon/
> > obj-$(CONFIG_DRM_AMDGPU)+= amd/amdgpu/
> > obj-$(CONFIG_DRM_AMDGPU)+= amd/amdxcp/
> > diff --git a/drivers/gpu/drm/dep/Makefile b/drivers/gpu/drm/dep/Makefile
> > new file mode 100644
> > index 000000000000..335f1af46a7b
> > --- /dev/null
> > +++ b/drivers/gpu/drm/dep/Makefile
> > @@ -0,0 +1,5 @@
> > +# SPDX-License-Identifier: GPL-2.0
> > +
> > +drm_dep-y := drm_dep_queue.o drm_dep_job.o drm_dep_fence.o
> > +
> > +obj-$(CONFIG_DRM_DEP) += drm_dep.o
> > diff --git a/drivers/gpu/drm/dep/drm_dep_fence.c b/drivers/gpu/drm/dep/drm_dep_fence.c
> > new file mode 100644
> > index 000000000000..ae05b9077772
> > --- /dev/null
> > +++ b/drivers/gpu/drm/dep/drm_dep_fence.c
> > @@ -0,0 +1,406 @@
> > +// SPDX-License-Identifier: MIT
> > +/*
> > + * Copyright © 2026 Intel Corporation
> > + */
> > +
> > +/**
> > + * DOC: DRM dependency fence
> > + *
> > + * Each struct drm_dep_job has an associated struct drm_dep_fence that
> > + * provides a single dma_fence (@finished) signalled when the hardware
> > + * completes the job.
> > + *
> > + * The hardware fence returned by &drm_dep_queue_ops.run_job is stored as
> > + * @parent. @finished is chained to @parent via drm_dep_job_done_cb() and
> > + * is signalled once @parent signals (or immediately if run_job() returns
> > + * NULL or an error).
> 
> I thought this fence proxy mechanism was going away due to recent work being
> carried out by Christian?
> 
> > + *
> > + * Drivers should expose @finished as the out-fence for GPU work since it is
> > + * valid from the moment drm_dep_job_arm() returns, whereas the hardware fence
> > + * could be a compound fence, which is disallowed when installed into
> > + * drm_syncobjs or dma-resv.
> > + *
> > + * The fence uses the kernel's inline spinlock (NULL passed to dma_fence_init())
> > + * so no separate lock allocation is required.
> > + *
> > + * Deadline propagation is supported: if a consumer sets a deadline via
> > + * dma_fence_set_deadline(), it is forwarded to @parent when @parent is set.
> > + * If @parent has not been set yet the deadline is stored in @deadline and
> > + * forwarded at that point.
> > + *
> > + * Memory management: drm_dep_fence objects are allocated with kzalloc() and
> > + * freed via kfree_rcu() once the fence is released, ensuring safety with
> > + * RCU-protected fence accesses.
> > + */
> > +
> > +#include <linux/slab.h>
> > +#include <drm/drm_dep.h>
> > +#include "drm_dep_fence.h"
> > +
> > +/**
> > + * DRM_DEP_FENCE_FLAG_HAS_DEADLINE_BIT - a fence deadline hint has been set
> > + *
> > + * Set by the deadline callback on the finished fence to indicate a deadline
> > + * has been set which may need to be propagated to the parent hardware fence.
> > + */
> > +#define DRM_DEP_FENCE_FLAG_HAS_DEADLINE_BIT (DMA_FENCE_FLAG_USER_BITS + 1)
> > +
> > +/**
> > + * struct drm_dep_fence - fence tracking the completion of a dep job
> > + *
> > + * Contains a single dma_fence (@finished) that is signalled when the
> > + * hardware completes the job. The fence uses the kernel's inline_lock
> > + * (no external spinlock required).
> > + *
> > + * This struct is private to the drm_dep module; external code interacts
> > + * through the accessor functions declared in drm_dep_fence.h.
> > + */
> > +struct drm_dep_fence {
> > + /**
> > + * @finished: signalled when the job completes on hardware.
> > + *
> > + * Drivers should use this fence as the out-fence for a job since it
> > + * is available immediately upon drm_dep_job_arm().
> > + */
> > + struct dma_fence finished;
> > +
> > + /**
> > + * @deadline: deadline set on @finished which potentially needs to be
> > + * propagated to @parent.
> > + */
> > + ktime_t deadline;
> > +
> > + /**
> > + * @parent: The hardware fence returned by &drm_dep_queue_ops.run_job.
> > + *
> > + * @finished is signaled once @parent is signaled. The initial store is
> > + * performed via smp_store_release to synchronize with deadline handling.
> > + *
> > + * All readers must access this under the fence lock and take a reference to
> > + * it, as @parent is set to NULL under the fence lock when the drm_dep_fence
> > + * signals, and this drop also releases its internal reference.
> > + */
> > + struct dma_fence *parent;
> > +
> > + /**
> > + * @q: the queue this fence belongs to.
> > + */
> > + struct drm_dep_queue *q;
> > +};
> > +
> > +static const struct dma_fence_ops drm_dep_fence_ops;
> > +
> > +/**
> > + * to_drm_dep_fence() - cast a dma_fence to its enclosing drm_dep_fence
> > + * @f: dma_fence to cast
> > + *
> > + * Context: No context requirements (inline helper).
> > + * Return: pointer to the enclosing &drm_dep_fence.
> > + */
> > +static struct drm_dep_fence *to_drm_dep_fence(struct dma_fence *f)
> > +{
> > + return container_of(f, struct drm_dep_fence, finished);
> > +}
> > +
> > +/**
> > + * drm_dep_fence_set_parent() - store the hardware fence and propagate
> > + *   any deadline
> > + * @dfence: dep fence
> > + * @parent: hardware fence returned by &drm_dep_queue_ops.run_job, or NULL/error
> > + *
> > + * Stores @parent on @dfence under smp_store_release() so that a concurrent
> > + * drm_dep_fence_set_deadline() call sees the parent before checking the
> > + * deadline bit. If a deadline has already been set on @dfence->finished it is
> > + * forwarded to @parent immediately. Does nothing if @parent is NULL or an
> > + * error pointer.
> > + *
> > + * Context: Any context.
> > + */
> > +void drm_dep_fence_set_parent(struct drm_dep_fence *dfence,
> > +      struct dma_fence *parent)
> > +{
> > + if (IS_ERR_OR_NULL(parent))
> > + return;
> > +
> > + /*
> > + * smp_store_release() to ensure a thread racing us in
> > + * drm_dep_fence_set_deadline() sees the parent set before
> > + * it calls test_bit(HAS_DEADLINE_BIT).
> > + */
> > + smp_store_release(&dfence->parent, dma_fence_get(parent));
> > + if (test_bit(DRM_DEP_FENCE_FLAG_HAS_DEADLINE_BIT,
> > +     &dfence->finished.flags))
> > + dma_fence_set_deadline(parent, dfence->deadline);
> > +}
> > +
> > +/**
> > + * drm_dep_fence_finished() - signal the finished fence with a result
> > + * @dfence: dep fence to signal
> > + * @result: error code to set, or 0 for success
> > + *
> > + * Sets the fence error to @result if non-zero, then signals
> > + * @dfence->finished. Also removes parent visibility under the fence lock
> > + * and drops the parent reference. Dropping the parent here allows the
> > + * DRM dep fence to be completely decoupled from the DRM dep module.
> > + *
> > + * Context: Any context.
> > + */
> > +static void drm_dep_fence_finished(struct drm_dep_fence *dfence, int result)
> > +{
> > + struct dma_fence *parent;
> > + unsigned long flags;
> > +
> > + dma_fence_lock_irqsave(&dfence->finished, flags);
> > + if (result)
> > + dma_fence_set_error(&dfence->finished, result);
> > + dma_fence_signal_locked(&dfence->finished);
> > + parent = dfence->parent;
> > + dfence->parent = NULL;
> > + dma_fence_unlock_irqrestore(&dfence->finished, flags);
> > +
> > + dma_fence_put(parent);
> > +}
> 
> We should really try to move away from manual locks and unlocks.
> 
> > +
> > +static const char *drm_dep_fence_get_driver_name(struct dma_fence *fence)
> > +{
> > + return "drm_dep";
> > +}
> > +
> > +static const char *drm_dep_fence_get_timeline_name(struct dma_fence *f)
> > +{
> > + struct drm_dep_fence *dfence = to_drm_dep_fence(f);
> > +
> > + return dfence->q->name;
> > +}
> > +
> > +/**
> > + * drm_dep_fence_get_parent() - get a reference to the parent hardware fence
> > + * @dfence: dep fence to query
> > + *
> > + * Returns a new reference to @dfence->parent, or NULL if the parent has
> > + * already been cleared (i.e. @dfence->finished has signalled and the parent
> > + * reference was dropped under the fence lock).
> > + *
> > + * Uses smp_load_acquire() to pair with the smp_store_release() in
> > + * drm_dep_fence_set_parent(), ensuring that if we race a concurrent
> > + * drm_dep_fence_set_parent() call we observe the parent pointer only after
> > + * the store is fully visible — before set_parent() tests
> > + * %DRM_DEP_FENCE_FLAG_HAS_DEADLINE_BIT.
> > + *
> > + * Caller must hold the fence lock on @dfence->finished.
> > + *
> > + * Context: Any context, fence lock on @dfence->finished must be held.
> > + * Return: a new reference to the parent fence, or NULL.
> > + */
> > +static struct dma_fence *drm_dep_fence_get_parent(struct drm_dep_fence *dfence)
> > +{
> > + dma_fence_assert_held(&dfence->finished);
> 
> > +
> > + return dma_fence_get(smp_load_acquire(&dfence->parent));
> > +}
> > +
> > +/**
> > + * drm_dep_fence_set_deadline() - dma_fence_ops deadline callback
> > + * @f: fence on which the deadline is being set
> > + * @deadline: the deadline hint to apply
> > + *
> > + * Stores the earliest deadline under the fence lock, then propagates
> > + * it to the parent hardware fence via smp_load_acquire() to race
> > + * safely with drm_dep_fence_set_parent().
> > + *
> > + * Context: Any context.
> > + */
> > +static void drm_dep_fence_set_deadline(struct dma_fence *f, ktime_t deadline)
> > +{
> > + struct drm_dep_fence *dfence = to_drm_dep_fence(f);
> > + struct dma_fence *parent;
> > + unsigned long flags;
> > +
> > + dma_fence_lock_irqsave(f, flags);
> > +
> > + /* If we already have an earlier deadline, keep it: */
> > + if (test_bit(DRM_DEP_FENCE_FLAG_HAS_DEADLINE_BIT, &f->flags) &&
> > +    ktime_before(dfence->deadline, deadline)) {
> > + dma_fence_unlock_irqrestore(f, flags);
> > + return;
> > + }
> > +
> > + dfence->deadline = deadline;
> > + set_bit(DRM_DEP_FENCE_FLAG_HAS_DEADLINE_BIT, &f->flags);
> > +
> > + parent = drm_dep_fence_get_parent(dfence);
> > + dma_fence_unlock_irqrestore(f, flags);
> > +
> > + if (parent)
> > + dma_fence_set_deadline(parent, deadline);
> > +
> > + dma_fence_put(parent);
> > +}
> > +
> > +static const struct dma_fence_ops drm_dep_fence_ops = {
> > + .get_driver_name = drm_dep_fence_get_driver_name,
> > + .get_timeline_name = drm_dep_fence_get_timeline_name,
> > + .set_deadline = drm_dep_fence_set_deadline,
> > +};
> > +
> > +/**
> > + * drm_dep_fence_alloc() - allocate a dep fence
> > + *
> > + * Allocates a &drm_dep_fence with kzalloc() without initialising the
> > + * dma_fence. Call drm_dep_fence_init() to fully initialise it.
> > + *
> > + * Context: Process context.
> > + * Return: new &drm_dep_fence on success, NULL on allocation failure.
> > + */
> > +struct drm_dep_fence *drm_dep_fence_alloc(void)
> > +{
> > + return kzalloc_obj(struct drm_dep_fence);
> > +}
> > +
> > +/**
> > + * drm_dep_fence_init() - initialise the dma_fence inside a dep fence
> > + * @dfence: dep fence to initialise
> > + * @q: queue the owning job belongs to
> > + *
> > + * Initialises @dfence->finished using the context and sequence number from @q.
> > + * Passes NULL as the lock so the fence uses its inline spinlock.
> > + *
> > + * Context: Any context.
> > + */
> > +void drm_dep_fence_init(struct drm_dep_fence *dfence, struct drm_dep_queue *q)
> > +{
> > + u32 seq = ++q->fence.seqno;
> > +
> > + /*
> > + * XXX: Inline fence hazard: currently all expected users of DRM dep
> > + * hardware fences have a unique lockdep class. If that ever changes,
> > + * we will need to assign a unique lockdep class here so lockdep knows
> > + * this fence is allowed to nest with driver hardware fences.
> > + */
> > +
> > + dfence->q = q;
> > + dma_fence_init(&dfence->finished, &drm_dep_fence_ops,
> > +       NULL, q->fence.context, seq);
> > +}
> > +
> > +/**
> > + * drm_dep_fence_cleanup() - release a dep fence at job teardown
> > + * @dfence: dep fence to clean up
> > + *
> > + * Called from drm_dep_job_fini(). If the dep fence was armed (refcount > 0)
> > + * it is released via dma_fence_put() and will be freed by the RCU release
> > + * callback once all waiters have dropped their references. If it was never
> > + * armed it is freed directly with kfree().
> > + *
> > + * Context: Any context.
> > + */
> > +void drm_dep_fence_cleanup(struct drm_dep_fence *dfence)
> > +{
> > + if (drm_dep_fence_is_armed(dfence))
> > + dma_fence_put(&dfence->finished);
> > + else
> > + kfree(dfence);
> > +}
> > +
> > +/**
> > + * drm_dep_fence_is_armed() - check whether the fence has been armed
> > + * @dfence: dep fence to check
> > + *
> > + * Returns true if drm_dep_job_arm() has been called, i.e. @dfence->finished
> > + * has been initialised and its reference count is non-zero.  Used by
> > + * assertions to enforce correct job lifecycle ordering (arm before push,
> > + * add_dependency before arm).
> > + *
> > + * Context: Any context.
> > + * Return: true if the fence is armed, false otherwise.
> > + */
> > +bool drm_dep_fence_is_armed(struct drm_dep_fence *dfence)
> > +{
> > + return !!kref_read(&dfence->finished.refcount);
> > +}
> 
> > +
> > +/**
> > + * drm_dep_fence_is_finished() - test whether the finished fence has signalled
> > + * @dfence: dep fence to check
> > + *
> > + * Uses dma_fence_test_signaled_flag() to read %DMA_FENCE_FLAG_SIGNALED_BIT
> > + * directly without invoking the fence's ->signaled() callback or triggering
> > + * any signalling side-effects.
> > + *
> > + * Context: Any context.
> > + * Return: true if @dfence->finished has been signalled, false otherwise.
> > + */
> > +bool drm_dep_fence_is_finished(struct drm_dep_fence *dfence)
> > +{
> > + return dma_fence_test_signaled_flag(&dfence->finished);
> > +}
> > +
> > +/**
> > + * drm_dep_fence_is_complete() - test whether the job has completed
> > + * @dfence: dep fence to check
> > + *
> > + * Takes the fence lock on @dfence->finished and calls
> > + * drm_dep_fence_get_parent() to safely obtain a reference to the parent
> > + * hardware fence — or NULL if the parent has already been cleared after
> > + * signalling.  Calls dma_fence_is_signaled() on @parent outside the lock,
> > + * which may invoke the fence's ->signaled() callback and trigger signalling
> > + * side-effects if the fence has completed but the signalled flag has not yet
> > + * been set.  The finished fence is tested via dma_fence_test_signaled_flag(),
> > + * without side-effects.
> > + *
> > + * May only be called on a stopped queue (see drm_dep_queue_is_stopped()).
> > + *
> > + * Context: Process context. The queue must be stopped before calling this.
> > + * Return: true if the job is complete, false otherwise.
> > + */
> > +bool drm_dep_fence_is_complete(struct drm_dep_fence *dfence)
> > +{
> > + struct dma_fence *parent;
> > + unsigned long flags;
> > + bool complete;
> > +
> > + dma_fence_lock_irqsave(&dfence->finished, flags);
> > + parent = drm_dep_fence_get_parent(dfence);
> > + dma_fence_unlock_irqrestore(&dfence->finished, flags);
> > +
> > + complete = (parent && dma_fence_is_signaled(parent)) ||
> > + dma_fence_test_signaled_flag(&dfence->finished);
> > +
> > + dma_fence_put(parent);
> > +
> > + return complete;
> > +}
> > +
> > +/**
> > + * drm_dep_fence_to_dma() - return the finished dma_fence for a dep fence
> > + * @dfence: dep fence to query
> > + *
> > + * No reference is taken; the caller must hold its own reference to the owning
> > + * &drm_dep_job for the duration of the access.
> > + *
> > + * Context: Any context.
> > + * Return: the finished &dma_fence.
> > + */
> > +struct dma_fence *drm_dep_fence_to_dma(struct drm_dep_fence *dfence)
> > +{
> > + return &dfence->finished;
> > +}
> > +
> > +/**
> > + * drm_dep_fence_done() - signal the finished fence on job completion
> > + * @dfence: dep fence to signal
> > + * @result: job error code, or 0 on success
> > + *
> > + * Gets a temporary reference to @dfence->finished to guard against a racing
> > + * last-put, signals the fence with @result, then drops the temporary
> > + * reference. Called from drm_dep_job_done() in the queue core when a
> > + * hardware completion callback fires or when run_job() returns immediately.
> > + *
> > + * Context: Any context.
> > + */
> > +void drm_dep_fence_done(struct drm_dep_fence *dfence, int result)
> > +{
> > + dma_fence_get(&dfence->finished);
> > + drm_dep_fence_finished(dfence, result);
> > + dma_fence_put(&dfence->finished);
> > +}
> 
> Proper refcounting is automated (and enforced) in Rust.
> 
> > diff --git a/drivers/gpu/drm/dep/drm_dep_fence.h b/drivers/gpu/drm/dep/drm_dep_fence.h
> > new file mode 100644
> > index 000000000000..65a1582f858b
> > --- /dev/null
> > +++ b/drivers/gpu/drm/dep/drm_dep_fence.h
> > @@ -0,0 +1,25 @@
> > +/* SPDX-License-Identifier: MIT */
> > +/*
> > + * Copyright © 2026 Intel Corporation
> > + */
> > +
> > +#ifndef _DRM_DEP_FENCE_H_
> > +#define _DRM_DEP_FENCE_H_
> > +
> > +#include <linux/dma-fence.h>
> > +
> > +struct drm_dep_fence;
> > +struct drm_dep_queue;
> > +
> > +struct drm_dep_fence *drm_dep_fence_alloc(void);
> > +void drm_dep_fence_init(struct drm_dep_fence *dfence, struct drm_dep_queue *q);
> > +void drm_dep_fence_cleanup(struct drm_dep_fence *dfence);
> > +void drm_dep_fence_set_parent(struct drm_dep_fence *dfence,
> > +      struct dma_fence *parent);
> > +void drm_dep_fence_done(struct drm_dep_fence *dfence, int result);
> > +bool drm_dep_fence_is_armed(struct drm_dep_fence *dfence);
> > +bool drm_dep_fence_is_finished(struct drm_dep_fence *dfence);
> > +bool drm_dep_fence_is_complete(struct drm_dep_fence *dfence);
> > +struct dma_fence *drm_dep_fence_to_dma(struct drm_dep_fence *dfence);
> > +
> > +#endif /* _DRM_DEP_FENCE_H_ */
> > diff --git a/drivers/gpu/drm/dep/drm_dep_job.c b/drivers/gpu/drm/dep/drm_dep_job.c
> > new file mode 100644
> > index 000000000000..2d012b29a5fc
> > --- /dev/null
> > +++ b/drivers/gpu/drm/dep/drm_dep_job.c
> > @@ -0,0 +1,675 @@
> > +// SPDX-License-Identifier: MIT
> > +/*
> > + * Copyright 2015 Advanced Micro Devices, Inc.
> > + *
> > + * Permission is hereby granted, free of charge, to any person obtaining a
> > + * copy of this software and associated documentation files (the "Software"),
> > + * to deal in the Software without restriction, including without limitation
> > + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> > + * and/or sell copies of the Software, and to permit persons to whom the
> > + * Software is furnished to do so, subject to the following conditions:
> > + *
> > + * The above copyright notice and this permission notice shall be included in
> > + * all copies or substantial portions of the Software.
> > + *
> > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> > + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
> > + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
> > + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
> > + * OTHER DEALINGS IN THE SOFTWARE.
> > + *
> > + * Copyright © 2026 Intel Corporation
> > + */
> > +
> > +/**
> > + * DOC: DRM dependency job
> > + *
> > + * A struct drm_dep_job represents a single unit of GPU work associated with
> > + * a struct drm_dep_queue. The lifecycle of a job is:
> > + *
> > + * 1. **Allocation**: the driver allocates memory for the job (typically by
> > + *    embedding struct drm_dep_job in a larger structure) and calls
> > + *    drm_dep_job_init() to initialise it. On success the job holds one
> > + *    kref reference and a reference to its queue.
> > + *
> > + * 2. **Dependency collection**: the driver calls drm_dep_job_add_dependency(),
> > + *    drm_dep_job_add_syncobj_dependency(), drm_dep_job_add_resv_dependencies(),
> > + *    or drm_dep_job_add_implicit_dependencies() to register dma_fence objects
> > + *    that must be signalled before the job can run. Duplicate fences from the
> > + *    same fence context are deduplicated automatically.
> > + *
> > + * 3. **Arming**: drm_dep_job_arm() initialises the job's finished fence,
> > + *    consuming a sequence number from the queue. After arming,
> > + *    drm_dep_job_finished_fence() returns a valid fence that may be passed to
> > + *    userspace or used as a dependency by other jobs.
> > + *
> > + * 4. **Submission**: drm_dep_job_push() submits the job to the queue. The
> > + *    queue takes a reference that it holds until the job's finished fence
> > + *    signals and the job is freed by the put_job worker.
> > + *
> > + * 5. **Completion**: when the job's hardware work finishes its finished fence
> > + *    is signalled and drm_dep_job_put() is called by the queue. The driver
> > + *    must release any driver-private resources in &drm_dep_job_ops.release.
> > + *
> > + * Reference counting uses drm_dep_job_get() / drm_dep_job_put(). The
> > + * internal drm_dep_job_fini() tears down the dependency xarray and fence
> > + * objects before the driver's release callback is invoked.
> > + */
> > +
> > +#include <linux/dma-resv.h>
> > +#include <linux/kref.h>
> > +#include <linux/slab.h>
> > +#include <drm/drm_dep.h>
> > +#include <drm/drm_file.h>
> > +#include <drm/drm_gem.h>
> > +#include <drm/drm_syncobj.h>
> > +#include "drm_dep_fence.h"
> > +#include "drm_dep_job.h"
> > +#include "drm_dep_queue.h"
> > +
> > +/**
> > + * drm_dep_job_init() - initialise a dep job
> > + * @job: dep job to initialise
> > + * @args: initialisation arguments
> > + *
> > + * Initialises @job with the queue, ops and credit count from @args.  Acquires
> > + * a reference to @args->q via drm_dep_queue_get(); this reference is held for
> > + * the lifetime of the job and released by drm_dep_job_release() when the last
> > + * job reference is dropped.
> > + *
> > + * Resources are released automatically when the last reference is dropped
> > + * via drm_dep_job_put(), which must be called to release the job; drivers
> > + * must not free the job directly.
> 
> Again, can’t enforce that in C.
> 
> > + *
> > + * Context: Process context. Allocates memory with GFP_KERNEL.
> > + * Return: 0 on success, -%EINVAL if credits is 0,
> > + *   -%ENOMEM on fence allocation failure.
> > + */
> > +int drm_dep_job_init(struct drm_dep_job *job,
> > +     const struct drm_dep_job_init_args *args)
> > +{
> > + if (unlikely(!args->credits)) {
> > + pr_err("drm_dep: %s: credits cannot be 0\n", __func__);
> > + return -EINVAL;
> > + }
> > +
> > + memset(job, 0, sizeof(*job));
> > +
> > + job->dfence = drm_dep_fence_alloc();
> > + if (!job->dfence)
> > + return -ENOMEM;
> > +
> > + job->ops = args->ops;
> > + job->q = drm_dep_queue_get(args->q);
> > + job->credits = args->credits;
> > +
> > + kref_init(&job->refcount);
> > + xa_init_flags(&job->dependencies, XA_FLAGS_ALLOC);
> > + INIT_LIST_HEAD(&job->pending_link);
> > +
> > + return 0;
> > +}
> > +EXPORT_SYMBOL(drm_dep_job_init);
> > +
> > +/**
> > + * drm_dep_job_drop_dependencies() - release all input dependency fences
> > + * @job: dep job whose dependency xarray to drain
> > + *
> > + * Walks @job->dependencies, puts each fence, and destroys the xarray.
> > + * Any slots still holding a %DRM_DEP_JOB_FENCE_PREALLOC sentinel —
> > + * i.e. slots that were pre-allocated but never replaced — are silently
> > + * skipped; the sentinel carries no reference.  Called from
> > + * drm_dep_queue_run_job() in process context immediately after
> > + * @ops->run_job() returns, before the final drm_dep_job_put().  Releasing
> > + * dependencies here — while still in process context — avoids calling
> > + * xa_destroy() from IRQ context if the job's last reference is later
> > + * dropped from a dma_fence callback.
> > + *
> > + * Context: Process context.
> > + */
> > +void drm_dep_job_drop_dependencies(struct drm_dep_job *job)
> > +{
> > + struct dma_fence *fence;
> > + unsigned long index;
> > +
> > + xa_for_each(&job->dependencies, index, fence) {
> > + if (unlikely(fence == DRM_DEP_JOB_FENCE_PREALLOC))
> > + continue;
> > + dma_fence_put(fence);
> > + }
> > + xa_destroy(&job->dependencies);
> > +}
> 
> This is automated in Rust. You also can’t “forget” to call this.
> 
> > +
> > +/**
> > + * drm_dep_job_fini() - clean up a dep job
> > + * @job: dep job to clean up
> > + *
> > + * Cleans up the dep fence and drops the queue reference held by @job.
> > + *
> > + * If the job was never armed (e.g. init failed before drm_dep_job_arm()),
> > + * the dependency xarray is also released here.  For armed jobs the xarray
> > + * has already been drained by drm_dep_job_drop_dependencies() in process
> > + * context immediately after run_job(), so it is left untouched to avoid
> > + * calling xa_destroy() from IRQ context.
> > + *
> > + * Warns if @job is still linked on the queue's pending list, which would
> > + * indicate a bug in the teardown ordering.
> > + *
> > + * Context: Any context.
> > + */
> > +static void drm_dep_job_fini(struct drm_dep_job *job)
> > +{
> > + bool armed = drm_dep_fence_is_armed(job->dfence);
> > +
> > + WARN_ON(!list_empty(&job->pending_link));
> > +
> > + drm_dep_fence_cleanup(job->dfence);
> > + job->dfence = NULL;
> > +
> > + /*
> > + * Armed jobs have their dependencies drained by
> > + * drm_dep_job_drop_dependencies() in process context after run_job().
> > + * Skip here to avoid calling xa_destroy() from IRQ context.
> > + */
> > + if (!armed)
> > + drm_dep_job_drop_dependencies(job);
> > +}
> 
> Same here.
> 
> > +
> > +/**
> > + * drm_dep_job_get() - acquire a reference to a dep job
> > + * @job: dep job to acquire a reference on, or NULL
> > + *
> > + * Context: Any context.
> > + * Return: @job with an additional reference held, or NULL if @job is NULL.
> > + */
> > +struct drm_dep_job *drm_dep_job_get(struct drm_dep_job *job)
> > +{
> > + if (job)
> > + kref_get(&job->refcount);
> > + return job;
> > +}
> > +EXPORT_SYMBOL(drm_dep_job_get);
> > +
> 
> Same here.
> 
> > +/**
> > + * drm_dep_job_release() - kref release callback for a dep job
> > + * @kref: kref embedded in the dep job
> > + *
> > + * Calls drm_dep_job_fini(), then invokes &drm_dep_job_ops.release if set,
> > + * otherwise frees @job with kfree().  Finally, releases the queue reference
> > + * that was acquired by drm_dep_job_init() via drm_dep_queue_put().  The
> > + * queue put is performed last to ensure no queue state is accessed after
> > + * the job memory is freed.
> > + *
> > + * Context: Any context if %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE is set on the
> > + *   job's queue; otherwise process context only, as the release callback may
> > + *   sleep.
> > + */
> > +static void drm_dep_job_release(struct kref *kref)
> > +{
> > + struct drm_dep_job *job =
> > + container_of(kref, struct drm_dep_job, refcount);
> > + struct drm_dep_queue *q = job->q;
> > +
> > + drm_dep_job_fini(job);
> > +
> > + if (job->ops && job->ops->release)
> > + job->ops->release(job);
> > + else
> > + kfree(job);
> > +
> > + drm_dep_queue_put(q);
> > +}
> 
> Same here.
> 
> > +
> > +/**
> > + * drm_dep_job_put() - release a reference to a dep job
> > + * @job: dep job to release a reference on, or NULL
> > + *
> > + * When the last reference is dropped, calls &drm_dep_job_ops.release if set,
> > + * otherwise frees @job with kfree(). Does nothing if @job is NULL.
> > + *
> > + * Context: Any context if %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE is set on the
> > + *   job's queue; otherwise process context only, as the release callback may
> > + *   sleep.
> > + */
> > +void drm_dep_job_put(struct drm_dep_job *job)
> > +{
> > + if (job)
> > + kref_put(&job->refcount, drm_dep_job_release);
> > +}
> > +EXPORT_SYMBOL(drm_dep_job_put);
> > +
> 
> Same here.
> 
> > +/**
> > + * drm_dep_job_arm() - arm a dep job for submission
> > + * @job: dep job to arm
> > + *
> > + * Initialises the finished fence on @job->dfence, assigning
> > + * it a sequence number from the job's queue. Must be called after
> > + * drm_dep_job_init() and before drm_dep_job_push(). Once armed,
> > + * drm_dep_job_finished_fence() returns a valid fence that may be passed to
> > + * userspace or used as a dependency by other jobs.
> > + *
> > + * Begins the DMA fence signalling path via dma_fence_begin_signalling().
> > + * After this point, memory allocations that could trigger reclaim are
> > + * forbidden; lockdep enforces this. arm() must always be paired with
> > + * drm_dep_job_push(); lockdep also enforces this pairing.
> > + *
> > + * Warns if the job has already been armed.
> > + *
> > + * Context: Process context if %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set
> > + *   (takes @q->sched.lock, a mutex); any context otherwise. DMA fence signaling
> > + *   path.
> > + */
> > +void drm_dep_job_arm(struct drm_dep_job *job)
> > +{
> > + drm_dep_queue_push_job_begin(job->q);
> > + WARN_ON(drm_dep_fence_is_armed(job->dfence));
> > + drm_dep_fence_init(job->dfence, job->q);
> > + job->signalling_cookie = dma_fence_begin_signalling();
> > +}
> > +EXPORT_SYMBOL(drm_dep_job_arm);
> > +
> > +/**
> > + * drm_dep_job_push() - submit a job to its queue for execution
> > + * @job: dep job to push
> > + *
> > + * Submits @job to the queue it was initialised with. Must be called after
> > + * drm_dep_job_arm(). Acquires a reference on @job on behalf of the queue,
> > + * held until the queue is fully done with it. The reference is released
> > + * directly in the finished-fence dma_fence callback for queues with
> > + * %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE (where drm_dep_job_done() may run
> > + * from hardirq context), or via the put_job work item on the submit
> > + * workqueue otherwise.
> > + *
> > + * Ends the DMA fence signalling path begun by drm_dep_job_arm() via
> > + * dma_fence_end_signalling(). This must be paired with arm(); lockdep
> > + * enforces the pairing.
> > + *
> > + * Once pushed, &drm_dep_queue_ops.run_job is guaranteed to be called for
> > + * @job exactly once, even if the queue is killed or torn down before the
> > + * job reaches the head of the queue. Drivers can use this guarantee to
> > + * perform bookkeeping cleanup; the actual backend operation should be
> > + * skipped when drm_dep_queue_is_killed() returns true.
> > + *
> > + * If the queue does not support the bypass path, the job is pushed directly
> > + * onto the SPSC submission queue via drm_dep_queue_push_job() without holding
> > + * @q->sched.lock. Otherwise, @q->sched.lock is taken and the job is either
> > + * run immediately via drm_dep_queue_run_job() if it qualifies for bypass, or
> > + * enqueued via drm_dep_queue_push_job() for dispatch by the run_job work item.
> > + *
> > + * Warns if the job has not been armed.
> > + *
> > + * Context: Process context if %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set
> > + *   (takes @q->sched.lock, a mutex); any context otherwise. DMA fence signaling
> > + *   path.
> > + */
> > +void drm_dep_job_push(struct drm_dep_job *job)
> > +{
> > + struct drm_dep_queue *q = job->q;
> > +
> > + WARN_ON(!drm_dep_fence_is_armed(job->dfence));
> > +
> > + drm_dep_job_get(job);
> > +
> > + if (!(q->sched.flags & DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED)) {
> > + drm_dep_queue_push_job(q, job);
> > + dma_fence_end_signalling(job->signalling_cookie);
> 
> Signaling is enforced in a more thorough way in Rust. I’ll expand on this later in this patch.
> 
> > + drm_dep_queue_push_job_end(job->q);
> > + return;
> > + }
> > +
> > + scoped_guard(mutex, &q->sched.lock) {
> > + if (drm_dep_queue_can_job_bypass(q, job))
> > + drm_dep_queue_run_job(q, job);
> > + else
> > + drm_dep_queue_push_job(q, job);
> > + }
> > +
> > + dma_fence_end_signalling(job->signalling_cookie);
> > + drm_dep_queue_push_job_end(job->q);
> > +}
> > +EXPORT_SYMBOL(drm_dep_job_push);
> > +
> > +/**
> > + * drm_dep_job_add_dependency() - adds the fence as a job dependency
> > + * @job: dep job to add the dependencies to
> > + * @fence: the dma_fence to add to the list of dependencies, or
> > + *         %DRM_DEP_JOB_FENCE_PREALLOC to reserve a slot for later.
> > + *
> > + * Note that @fence is consumed in both the success and error cases (except
> > + * when @fence is %DRM_DEP_JOB_FENCE_PREALLOC, which carries no reference).
> > + *
> > + * Signalled fences and fences belonging to the same queue as @job (i.e. where
> > + * fence->context matches the queue's finished fence context) are silently
> > + * dropped; the job need not wait on its own queue's output.
> > + *
> > + * Warns if the job has already been armed (dependencies must be added before
> > + * drm_dep_job_arm()).
> > + *
> > + * **Pre-allocation pattern**
> > + *
> > + * When multiple jobs across different queues must be prepared and submitted
> > + * together in a single atomic commit — for example, where job A's finished
> > + * fence is an input dependency of job B — all jobs must be armed and pushed
> > + * within a single dma_fence_begin_signalling() / dma_fence_end_signalling()
> > + * region.  Once that region has started no memory allocation is permitted.
> > + *
> > + * To handle this, pass %DRM_DEP_JOB_FENCE_PREALLOC during the preparation
> > + * phase (before arming any job, while GFP_KERNEL allocation is still allowed)
> > + * to pre-allocate a slot in @job->dependencies.  The slot index assigned by
> > + * the underlying xarray must be tracked by the caller separately (e.g. it is
> > + * always index 0 when the dependency array is empty, as Xe relies on).
> > + * After all jobs have been armed and the finished fences are available, call
> > + * drm_dep_job_replace_dependency() with that index and the real fence.
> > + * drm_dep_job_replace_dependency() uses GFP_NOWAIT internally and may be
> > + * called from atomic or signalling context.
> > + *
> > + * The sentinel slot is never skipped by the signalled-fence fast-path,
> > + * ensuring a slot is always allocated even when the real fence is not yet
> > + * known.
> > + *
> > + * **Example: bind job feeding TLB invalidation jobs**
> > + *
> > + * Consider a GPU with separate queues for page-table bind operations and for
> > + * TLB invalidation.  A single atomic commit must:
> > + *
> > + *  1. Run a bind job that modifies page tables.
> > + *  2. Run one TLB-invalidation job per MMU that depends on the bind
> > + *     completing, so stale translations are flushed before the engines
> > + *     continue.
> > + *
> > + * Because all jobs must be armed and pushed inside a signalling region (where
> > + * GFP_KERNEL is forbidden), pre-allocate slots before entering the region::
> > + *
> > + *   // Phase 1 — process context, GFP_KERNEL allowed
> > + *   drm_dep_job_init(bind_job, bind_queue, ops);
> > + *   for_each_mmu(mmu) {
> > + *       drm_dep_job_init(tlb_job[mmu], tlb_queue[mmu], ops);
> > + *       // Pre-allocate slot at index 0; real fence not available yet
> > + *       drm_dep_job_add_dependency(tlb_job[mmu], DRM_DEP_JOB_FENCE_PREALLOC);
> > + *   }
> > + *
> > + *   // Phase 2 — inside signalling region, no GFP_KERNEL
> > + *   dma_fence_begin_signalling();
> > + *   drm_dep_job_arm(bind_job);
> > + *   for_each_mmu(mmu) {
> > + *       // Swap sentinel for bind job's finished fence
> > + *       drm_dep_job_replace_dependency(tlb_job[mmu], 0,
> > + *                                      dma_fence_get(bind_job->finished));
> > + *       drm_dep_job_arm(tlb_job[mmu]);
> > + *   }
> > + *   drm_dep_job_push(bind_job);
> > + *   for_each_mmu(mmu)
> > + *       drm_dep_job_push(tlb_job[mmu]);
> > + *   dma_fence_end_signalling();
> > + *
> > + * Context: Process context. May allocate memory with GFP_KERNEL.
> > + * Return: If fence == DRM_DEP_JOB_FENCE_PREALLOC index of allocation on
> > + * success, else 0 on success, or a negative error code.
> > + */
> 
> > +int drm_dep_job_add_dependency(struct drm_dep_job *job, struct dma_fence *fence)
> > +{
> > + struct drm_dep_queue *q = job->q;
> > + struct dma_fence *entry;
> > + unsigned long index;
> > + u32 id = 0;
> > + int ret;
> > +
> > + WARN_ON(drm_dep_fence_is_armed(job->dfence));
> > + might_alloc(GFP_KERNEL);
> > +
> > + if (!fence)
> > + return 0;
> > +
> > + if (fence == DRM_DEP_JOB_FENCE_PREALLOC)
> > + goto add_fence;
> > +
> > + /*
> > + * Ignore signalled fences or fences from our own queue — finished
> > + * fences use q->fence.context.
> > + */
> > + if (dma_fence_test_signaled_flag(fence) ||
> > +    fence->context == q->fence.context) {
> > + dma_fence_put(fence);
> > + return 0;
> > + }
> > +
> > + /* Deduplicate if we already depend on a fence from the same context.
> > + * This lets the size of the array of deps scale with the number of
> > + * engines involved, rather than the number of BOs.
> > + */
> > + xa_for_each(&job->dependencies, index, entry) {
> > + if (entry == DRM_DEP_JOB_FENCE_PREALLOC ||
> > +    entry->context != fence->context)
> > + continue;
> > +
> > + if (dma_fence_is_later(fence, entry)) {
> > + dma_fence_put(entry);
> > + xa_store(&job->dependencies, index, fence, GFP_KERNEL);
> > + } else {
> > + dma_fence_put(fence);
> > + }
> > + return 0;
> > + }
> > +
> > +add_fence:
> > + ret = xa_alloc(&job->dependencies, &id, fence, xa_limit_32b,
> > +       GFP_KERNEL);
> > + if (ret != 0) {
> > + if (fence != DRM_DEP_JOB_FENCE_PREALLOC)
> > + dma_fence_put(fence);
> > + return ret;
> > + }
> > +
> > + return (fence == DRM_DEP_JOB_FENCE_PREALLOC) ? id : 0;
> > +}
> > +EXPORT_SYMBOL(drm_dep_job_add_dependency);
> > +
> > +/**
> > + * drm_dep_job_replace_dependency() - replace a pre-allocated dependency slot
> > + * @job: dep job to update
> > + * @index: xarray index of the slot to replace, as returned when the sentinel
> > + *         was originally inserted via drm_dep_job_add_dependency()
> > + * @fence: the real dma_fence to store; its reference is always consumed
> > + *
> > + * Replaces the %DRM_DEP_JOB_FENCE_PREALLOC sentinel at @index in
> > + * @job->dependencies with @fence.  The slot must have been pre-allocated by
> > + * passing %DRM_DEP_JOB_FENCE_PREALLOC to drm_dep_job_add_dependency(); the
> > + * existing entry is asserted to be the sentinel.
> > + *
> > + * This is the second half of the pre-allocation pattern described in
> > + * drm_dep_job_add_dependency().  It is intended to be called inside a
> > + * dma_fence_begin_signalling() / dma_fence_end_signalling() region where
> > + * memory allocation with GFP_KERNEL is forbidden.  It uses GFP_NOWAIT
> > + * internally so it is safe to call from atomic or signalling context, but
> > + * since the slot has been pre-allocated no actual memory allocation occurs.
> > + *
> > + * If @fence is already signalled the slot is erased rather than storing a
> > + * redundant dependency.  The successful store is asserted — if the store
> > + * fails it indicates a programming error (slot index out of range or
> > + * concurrent modification).
> > + *
> > + * Must be called before drm_dep_job_arm(). @fence is consumed in all cases.
> 
> Can’t enforce this in C. Also, how is the fence “consumed” ? You can’t enforce that
> the user can’t access the fence anymore after this function returns, like we can do
> at compile time in Rust.
> 
> > + *
> > + * Context: Any context. DMA fence signaling path.
> > + */
> > +void drm_dep_job_replace_dependency(struct drm_dep_job *job, u32 index,
> > +    struct dma_fence *fence)
> > +{
> > + WARN_ON(xa_load(&job->dependencies, index) !=
> > + DRM_DEP_JOB_FENCE_PREALLOC);
> > +
> > + if (dma_fence_test_signaled_flag(fence)) {
> > + xa_erase(&job->dependencies, index);
> > + dma_fence_put(fence);
> > + return;
> > + }
> > +
> > + if (WARN_ON(xa_is_err(xa_store(&job->dependencies, index, fence,
> > +       GFP_NOWAIT)))) {
> > + dma_fence_put(fence);
> > + return;
> > + }
> > +}
> > +EXPORT_SYMBOL(drm_dep_job_replace_dependency);
> > +
> > +/**
> > + * drm_dep_job_add_syncobj_dependency() - adds a syncobj's fence as a
> > + *   job dependency
> > + * @job: dep job to add the dependencies to
> > + * @file: drm file private pointer
> > + * @handle: syncobj handle to lookup
> > + * @point: timeline point
> > + *
> > + * This adds the fence matching the given syncobj to @job.
> > + *
> > + * Context: Process context.
> > + * Return: 0 on success, or a negative error code.
> > + */
> > +int drm_dep_job_add_syncobj_dependency(struct drm_dep_job *job,
> > +       struct drm_file *file, u32 handle,
> > +       u32 point)
> > +{
> > + struct dma_fence *fence;
> > + int ret;
> > +
> > + ret = drm_syncobj_find_fence(file, handle, point, 0, &fence);
> > + if (ret)
> > + return ret;
> > +
> > + return drm_dep_job_add_dependency(job, fence);
> > +}
> > +EXPORT_SYMBOL(drm_dep_job_add_syncobj_dependency);
> > +
> > +/**
> > + * drm_dep_job_add_resv_dependencies() - add all fences from the resv to the job
> > + * @job: dep job to add the dependencies to
> > + * @resv: the dma_resv object to get the fences from
> > + * @usage: the dma_resv_usage to use to filter the fences
> > + *
> > + * This adds all fences matching the given usage from @resv to @job.
> > + * Must be called with the @resv lock held.
> > + *
> > + * Context: Process context.
> > + * Return: 0 on success, or a negative error code.
> > + */
> > +int drm_dep_job_add_resv_dependencies(struct drm_dep_job *job,
> > +      struct dma_resv *resv,
> > +      enum dma_resv_usage usage)
> > +{
> > + struct dma_resv_iter cursor;
> > + struct dma_fence *fence;
> > + int ret;
> > +
> > + dma_resv_assert_held(resv);
> > +
> > + dma_resv_for_each_fence(&cursor, resv, usage, fence) {
> > + /*
> > + * As drm_dep_job_add_dependency always consumes the fence
> > + * reference (even when it fails), and dma_resv_for_each_fence
> > + * is not obtaining one, we need to grab one before calling.
> > + */
> > + ret = drm_dep_job_add_dependency(job, dma_fence_get(fence));
> > + if (ret)
> > + return ret;
> > + }
> > + return 0;
> > +}
> > +EXPORT_SYMBOL(drm_dep_job_add_resv_dependencies);
> > +
> > +/**
> > + * drm_dep_job_add_implicit_dependencies() - adds implicit dependencies
> > + *   as job dependencies
> > + * @job: dep job to add the dependencies to
> > + * @obj: the gem object to add new dependencies from.
> > + * @write: whether the job might write the object (so we need to depend on
> > + * shared fences in the reservation object).
> > + *
> > + * This should be called after drm_gem_lock_reservations() on your array of
> > + * GEM objects used in the job but before updating the reservations with your
> > + * own fences.
> > + *
> > + * Context: Process context.
> > + * Return: 0 on success, or a negative error code.
> > + */
> > +int drm_dep_job_add_implicit_dependencies(struct drm_dep_job *job,
> > +  struct drm_gem_object *obj,
> > +  bool write)
> > +{
> > + return drm_dep_job_add_resv_dependencies(job, obj->resv,
> > + dma_resv_usage_rw(write));
> > +}
> > +EXPORT_SYMBOL(drm_dep_job_add_implicit_dependencies);
> > +
> > +/**
> > + * drm_dep_job_is_signaled() - check whether a dep job has completed
> > + * @job: dep job to check
> > + *
> > + * Determines whether @job has signalled. The queue should be stopped before
> > + * calling this to obtain a stable snapshot of state. Both the parent hardware
> > + * fence and the finished software fence are checked.
> > + *
> > + * Context: Process context. The queue must be stopped before calling this.
> > + * Return: true if the job is signalled, false otherwise.
> > + */
> > +bool drm_dep_job_is_signaled(struct drm_dep_job *job)
> > +{
> > + WARN_ON(!drm_dep_queue_is_stopped(job->q));
> > + return drm_dep_fence_is_complete(job->dfence);
> > +}
> > +EXPORT_SYMBOL(drm_dep_job_is_signaled);
> > +
> > +/**
> > + * drm_dep_job_is_finished() - test whether a dep job's finished fence has signalled
> > + * @job: dep job to check
> > + *
> > + * Tests whether the job's software finished fence has been signalled, using
> > + * dma_fence_test_signaled_flag() to avoid any signalling side-effects. Unlike
> > + * drm_dep_job_is_signaled(), this does not require the queue to be stopped and
> > + * does not check the parent hardware fence — it is a lightweight test of the
> > + * finished fence only.
> > + *
> > + * Context: Any context.
> > + * Return: true if the job's finished fence has been signalled, false otherwise.
> > + */
> > +bool drm_dep_job_is_finished(struct drm_dep_job *job)
> > +{
> > + return drm_dep_fence_is_finished(job->dfence);
> > +}
> > +EXPORT_SYMBOL(drm_dep_job_is_finished);
> > +
> > +/**
> > + * drm_dep_job_invalidate_job() - increment the invalidation count for a job
> > + * @job: dep job to invalidate
> > + * @threshold: threshold above which the job is considered invalidated
> > + *
> > + * Increments @job->invalidate_count and returns true if it exceeds @threshold,
> > + * indicating the job should be considered hung and discarded. The queue must
> > + * be stopped before calling this function.
> > + *
> > + * Context: Process context. The queue must be stopped before calling this.
> > + * Return: true if @job->invalidate_count exceeds @threshold, false otherwise.
> > + */
> > +bool drm_dep_job_invalidate_job(struct drm_dep_job *job, int threshold)
> > +{
> > + WARN_ON(!drm_dep_queue_is_stopped(job->q));
> > + return ++job->invalidate_count > threshold;
> > +}
> > +EXPORT_SYMBOL(drm_dep_job_invalidate_job);
> > +
> > +/**
> > + * drm_dep_job_finished_fence() - return the finished fence for a job
> > + * @job: dep job to query
> > + *
> > + * No reference is taken on the returned fence; the caller must hold its own
> > + * reference to @job for the duration of any access.
> 
> Can’t enforce this in C.
> 
> > + *
> > + * Context: Any context.
> > + * Return: the finished &dma_fence for @job.
> > + */
> > +struct dma_fence *drm_dep_job_finished_fence(struct drm_dep_job *job)
> > +{
> > + return drm_dep_fence_to_dma(job->dfence);
> > +}
> > +EXPORT_SYMBOL(drm_dep_job_finished_fence);
> > diff --git a/drivers/gpu/drm/dep/drm_dep_job.h b/drivers/gpu/drm/dep/drm_dep_job.h
> > new file mode 100644
> > index 000000000000..35c61d258fa1
> > --- /dev/null
> > +++ b/drivers/gpu/drm/dep/drm_dep_job.h
> > @@ -0,0 +1,13 @@
> > +/* SPDX-License-Identifier: MIT */
> > +/*
> > + * Copyright © 2026 Intel Corporation
> > + */
> > +
> > +#ifndef _DRM_DEP_JOB_H_
> > +#define _DRM_DEP_JOB_H_
> > +
> > +struct drm_dep_queue;
> > +
> > +void drm_dep_job_drop_dependencies(struct drm_dep_job *job);
> > +
> > +#endif /* _DRM_DEP_JOB_H_ */
> > diff --git a/drivers/gpu/drm/dep/drm_dep_queue.c b/drivers/gpu/drm/dep/drm_dep_queue.c
> > new file mode 100644
> > index 000000000000..dac02d0d22c4
> > --- /dev/null
> > +++ b/drivers/gpu/drm/dep/drm_dep_queue.c
> > @@ -0,0 +1,1647 @@
> > +// SPDX-License-Identifier: MIT
> > +/*
> > + * Copyright 2015 Advanced Micro Devices, Inc.
> > + *
> > + * Permission is hereby granted, free of charge, to any person obtaining a
> > + * copy of this software and associated documentation files (the "Software"),
> > + * to deal in the Software without restriction, including without limitation
> > + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> > + * and/or sell copies of the Software, and to permit persons to whom the
> > + * Software is furnished to do so, subject to the following conditions:
> > + *
> > + * The above copyright notice and this permission notice shall be included in
> > + * all copies or substantial portions of the Software.
> > + *
> > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> > + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
> > + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
> > + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
> > + * OTHER DEALINGS IN THE SOFTWARE.
> > + *
> > + * Copyright © 2026 Intel Corporation
> > + */
> > +
> > +/**
> > + * DOC: DRM dependency queue
> > + *
> > + * The drm_dep subsystem provides a lightweight GPU submission queue that
> > + * combines the roles of drm_gpu_scheduler and drm_sched_entity into a
> > + * single object (struct drm_dep_queue). Each queue owns its own ordered
> > + * submit workqueue, timeout workqueue, and TDR delayed-work.
> > + *
> > + * **Job lifecycle**
> > + *
> > + * 1. Allocate and initialise a job with drm_dep_job_init().
> > + * 2. Add dependency fences with drm_dep_job_add_dependency() and friends.
> > + * 3. Arm the job with drm_dep_job_arm() to obtain its out-fences.
> > + * 4. Submit with drm_dep_job_push().
> > + *
> > + * **Submission paths**
> > + *
> > + * drm_dep_job_push() decides between two paths under @q->sched.lock:
> > + *
> > + * - **Bypass path** (drm_dep_queue_can_job_bypass()): if
> > + *   %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set, the queue is not stopped,
> > + *   the SPSC queue is empty, the job has no dependency fences, and credits
> > + *   are available, the job is submitted inline on the calling thread without
> > + *   touching the submit workqueue.
> > + *
> > + * - **Queued path** (drm_dep_queue_push_job()): the job is pushed onto an
> > + *   SPSC queue and the run_job worker is kicked. The run_job worker pops the
> > + *   job, resolves any remaining dependency fences (installing wakeup
> > + *   callbacks for unresolved ones), and calls drm_dep_queue_run_job().
> > + *
> > + * **Running a job**
> > + *
> > + * drm_dep_queue_run_job() accounts credits, appends the job to the pending
> > + * list (starting the TDR timer only when the list was previously empty),
> > + * calls @ops->run_job(), stores the returned hardware fence as the parent
> > + * of the job's dep fence, then installs a callback on it. When the hardware
> > + * fence fires (or the job completes synchronously), drm_dep_job_done()
> > + * signals the finished fence, returns credits, and kicks the put_job worker
> > + * to free the job.
> > + *
> > + * **Timeout detection and recovery (TDR)**
> > + *
> > + * A delayed work item fires when a job on the pending list takes longer than
> > + * @q->job.timeout jiffies. It calls @ops->timedout_job() and acts on the
> > + * returned status (%DRM_DEP_TIMEDOUT_STAT_JOB_SIGNALED or
> > + * %DRM_DEP_TIMEDOUT_STAT_REQUEUE_JOB).
> > + * drm_dep_queue_trigger_timeout() forces the timer to fire immediately (without
> > + * changing the stored timeout), for example during device teardown.
> > + *
> > + * **Reference counting**
> > + *
> > + * Jobs and queues are both reference counted.
> > + *
> > + * A job holds a reference to its queue from drm_dep_job_init() until
> > + * drm_dep_job_put() drops the job's last reference and its release callback
> > + * runs. This ensures the queue remains valid for the entire lifetime of any
> > + * job that was submitted to it.
> > + *
> > + * The queue holds its own reference to a job for as long as the job is
> > + * internally tracked: from the moment the job is added to the pending list
> > + * in drm_dep_queue_run_job() until drm_dep_job_done() kicks the put_job
> > + * worker, which calls drm_dep_job_put() to release that reference.
> 
> Why not simply keep track that the job was completed, instead of relinquishing
> the reference? We can then release the reference once the job is cleaned up
> (by the queue, using a worker) in process context.
> 
> 
> > + *
> > + * **Hazard: use-after-free from within a worker**
> > + *
> > + * Because a job holds a queue reference, drm_dep_job_put() dropping the last
> > + * job reference will also drop a queue reference via the job's release path.
> > + * If that happens to be the last queue reference, drm_dep_queue_fini() can be
> > + * called, which queues @q->free_work on dep_free_wq and returns immediately.
> > + * free_work calls disable_work_sync() / disable_delayed_work_sync() on the
> > + * queue's own workers before destroying its workqueues, so in practice a
> > + * running worker always completes before the queue memory is freed.
> > + *
> > + * However, there is a secondary hazard: a worker can be queued while the
> > + * queue is in a "zombie" state — refcount has already reached zero and async
> > + * teardown is in flight, but the work item has not yet been disabled by
> > + * free_work.  To guard against this every worker uses
> > + * drm_dep_queue_get_unless_zero() at entry; if the refcount is already zero
> > + * the worker bails immediately without touching the queue state.
> 
> Again, this problem is gone in Rust.
> 
> > + *
> > + * Because all actual teardown (disable_*_sync, destroy_workqueue) runs on
> > + * dep_free_wq — which is independent of the queue's own submit/timeout
> > + * workqueues — there is no deadlock risk.  Each queue holds a drm_dev_get()
> > + * reference on its owning &drm_device, which is released as the last step of
> > + * teardown.  This ensures the driver module cannot be unloaded while any queue
> > + * is still alive.
> > + */
> > +
> > +#include <linux/dma-resv.h>
> > +#include <linux/kref.h>
> > +#include <linux/module.h>
> > +#include <linux/overflow.h>
> > +#include <linux/slab.h>
> > +#include <linux/wait.h>
> > +#include <linux/workqueue.h>
> > +#include <drm/drm_dep.h>
> > +#include <drm/drm_drv.h>
> > +#include <drm/drm_print.h>
> > +#include "drm_dep_fence.h"
> > +#include "drm_dep_job.h"
> > +#include "drm_dep_queue.h"
> > +
> > +/*
> > + * Dedicated workqueue for deferred drm_dep_queue teardown.  Using a
> > + * module-private WQ instead of system_percpu_wq keeps teardown isolated
> > + * from unrelated kernel subsystems.
> > + */
> > +static struct workqueue_struct *dep_free_wq;
> > +
> > +/**
> > + * drm_dep_queue_flags_set() - set a flag on the queue under sched.lock
> > + * @q: dep queue
> > + * @flag: flag to set (one of &enum drm_dep_queue_flags)
> > + *
> > + * Sets @flag in @q->sched.flags. Must be called with @q->sched.lock
> > + * held; the lockdep assertion enforces this.
> > + *
> > + * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
> > + */
> > +static void drm_dep_queue_flags_set(struct drm_dep_queue *q,
> > +    enum drm_dep_queue_flags flag)
> > +{
> > + lockdep_assert_held(&q->sched.lock);
> 
> We can enforce this in Rust at compile-time. The code does not compile if the
> lock is not taken. Same here and everywhere else where the sched lock has
> to be taken.
> 
> 
> > + q->sched.flags |= flag;
> > +}
> > +
> > +/**
> > + * drm_dep_queue_flags_clear() - clear a flag on the queue under sched.lock
> > + * @q: dep queue
> > + * @flag: flag to clear (one of &enum drm_dep_queue_flags)
> > + *
> > + * Clears @flag in @q->sched.flags. Must be called with @q->sched.lock
> > + * held; the lockdep assertion enforces this.
> > + *
> > + * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
> > + */
> > +static void drm_dep_queue_flags_clear(struct drm_dep_queue *q,
> > +      enum drm_dep_queue_flags flag)
> > +{
> > + lockdep_assert_held(&q->sched.lock);
> > + q->sched.flags &= ~flag;
> > +}
> > +
> > +/**
> > + * drm_dep_queue_has_credits() - check whether the queue has enough credits
> > + * @q: dep queue
> > + * @job: job requesting credits
> > + *
> > + * Checks whether the queue has enough available credits to dispatch
> > + * @job. If @job->credits exceeds the queue's credit limit, it is
> > + * clamped with a WARN.
> > + *
> > + * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
> > + * Return: true if available credits >= @job->credits, false otherwise.
> > + */
> > +static bool drm_dep_queue_has_credits(struct drm_dep_queue *q,
> > +      struct drm_dep_job *job)
> > +{
> > + u32 available;
> > +
> > + lockdep_assert_held(&q->sched.lock);
> > +
> > + if (job->credits > q->credit.limit) {
> > + drm_warn(q->drm,
> > + "Jobs may not exceed the credit limit, truncate.\n");
> > + job->credits = q->credit.limit;
> > + }
> > +
> > + WARN_ON(check_sub_overflow(q->credit.limit,
> > +   atomic_read(&q->credit.count),
> > +   &available));
> > +
> > + return available >= job->credits;
> > +}
> > +
> > +/**
> > + * drm_dep_queue_run_job_queue() - kick the run-job worker
> > + * @q: dep queue
> > + *
> > + * Queues @q->sched.run_job on @q->sched.submit_wq unless the queue is stopped
> > + * or the job queue is empty.  The empty-queue check avoids queueing a work item
> > + * that would immediately return with nothing to do.
> > + *
> > + * Context: Any context.
> > + */
> > +static void drm_dep_queue_run_job_queue(struct drm_dep_queue *q)
> > +{
> > + if (!drm_dep_queue_is_stopped(q) && spsc_queue_count(&q->job.queue))
> > + queue_work(q->sched.submit_wq, &q->sched.run_job);
> > +}
> > +
> > +/**
> > + * drm_dep_queue_put_job_queue() - kick the put-job worker
> > + * @q: dep queue
> > + *
> > + * Queues @q->sched.put_job on @q->sched.submit_wq unless the queue
> > + * is stopped.
> > + *
> > + * Context: Any context.
> > + */
> > +static void drm_dep_queue_put_job_queue(struct drm_dep_queue *q)
> > +{
> > + if (!drm_dep_queue_is_stopped(q))
> > + queue_work(q->sched.submit_wq, &q->sched.put_job);
> > +}
> > +
> > +/**
> > + * drm_queue_start_timeout() - arm or re-arm the TDR delayed work
> > + * @q: dep queue
> > + *
> > + * Arms the TDR delayed work with @q->job.timeout. No-op if
> > + * @q->ops->timedout_job is NULL, the timeout is MAX_SCHEDULE_TIMEOUT,
> > + * or the pending list is empty.
> > + *
> > + * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
> > + */
> > +static void drm_queue_start_timeout(struct drm_dep_queue *q)
> > +{
> > + lockdep_assert_held(&q->job.lock);
> > +
> > + if (!q->ops->timedout_job ||
> > +    q->job.timeout == MAX_SCHEDULE_TIMEOUT ||
> > +    list_empty(&q->job.pending))
> > + return;
> > +
> > + mod_delayed_work(q->sched.timeout_wq, &q->sched.tdr, q->job.timeout);
> > +}
> > +
> > +/**
> > + * drm_queue_start_timeout_unlocked() - arm TDR, acquiring job.lock
> > + * @q: dep queue
> > + *
> > + * Acquires @q->job.lock with irq and calls
> > + * drm_queue_start_timeout().
> > + *
> > + * Context: Process context (workqueue).
> > + */
> > +static void drm_queue_start_timeout_unlocked(struct drm_dep_queue *q)
> > +{
> > + guard(spinlock_irq)(&q->job.lock);
> > + drm_queue_start_timeout(q);
> > +}
> > +
> > +/**
> > + * drm_dep_queue_remove_dependency() - clear the active dependency and wake
> > + *   the run-job worker
> > + * @q: dep queue
> > + * @f: the dependency fence being removed
> > + *
> > + * Stores @f into @q->dep.removed_fence via smp_store_release() so that the
> > + * run-job worker can drop the reference to it in drm_dep_queue_is_ready(),
> > + * paired with smp_load_acquire().  Clears @q->dep.fence and kicks the
> > + * run-job worker.
> > + *
> > + * The fence reference is not dropped here; it is deferred to the run-job
> > + * worker via @q->dep.removed_fence to keep this path suitable dma_fence
> > + * callback removal in drm_dep_queue_kill().
> 
> This is a comment in C, but in Rust this is encoded directly in the type system.
> 
> > + *
> > + * Context: Any context.
> > + */
> > +static void drm_dep_queue_remove_dependency(struct drm_dep_queue *q,
> > +    struct dma_fence *f)
> > +{
> > + /* removed_fence must be visible to the reader before &q->dep.fence */
> > + smp_store_release(&q->dep.removed_fence, f);
> > +
> > + WRITE_ONCE(q->dep.fence, NULL);
> > + drm_dep_queue_run_job_queue(q);
> > +}
> > +
> > +/**
> > + * drm_dep_queue_wakeup() - dma_fence callback to wake the run-job worker
> > + * @f: the signalled dependency fence
> > + * @cb: callback embedded in the dep queue
> > + *
> > + * Called from dma_fence_signal() when the active dependency fence signals.
> > + * Delegates to drm_dep_queue_remove_dependency() to clear @q->dep.fence and
> > + * kick the run-job worker.  The fence reference is not dropped here; it is
> > + * deferred to the run-job worker via @q->dep.removed_fence.
> 
> Same here.
> 
> > + *
> > + * Context: Any context.
> > + */
> > +static void drm_dep_queue_wakeup(struct dma_fence *f, struct dma_fence_cb *cb)
> > +{
> > + struct drm_dep_queue *q =
> > + container_of(cb, struct drm_dep_queue, dep.cb);
> > +
> > + drm_dep_queue_remove_dependency(q, f);
> > +}
> > +
> > +/**
> > + * drm_dep_queue_is_ready() - check whether the queue has a dispatchable job
> > + * @q: dep queue
> > + *
> > + * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
> 
> Can’t call this in Rust if the lock is not taken.
> 
> > + * Return: true if SPSC queue non-empty and no dep fence pending,
> > + *   false otherwise.
> > + */
> > +static bool drm_dep_queue_is_ready(struct drm_dep_queue *q)
> > +{
> > + lockdep_assert_held(&q->sched.lock);
> > +
> > + if (!spsc_queue_count(&q->job.queue))
> > + return false;
> > +
> > + if (READ_ONCE(q->dep.fence))
> > + return false;
> > +
> > + /* Paired with smp_store_release in drm_dep_queue_remove_dependency() */
> > + dma_fence_put(smp_load_acquire(&q->dep.removed_fence));
> > +
> > + q->dep.removed_fence = NULL;
> > +
> > + return true;
> > +}
> > +
> > +/**
> > + * drm_dep_queue_is_killed() - check whether a dep queue has been killed
> > + * @q: dep queue to check
> > + *
> > + * Return: true if %DRM_DEP_QUEUE_FLAGS_KILLED is set on @q, false otherwise.
> > + *
> > + * Context: Any context.
> > + */
> > +bool drm_dep_queue_is_killed(struct drm_dep_queue *q)
> > +{
> > + return !!(q->sched.flags & DRM_DEP_QUEUE_FLAGS_KILLED);
> > +}
> > +EXPORT_SYMBOL(drm_dep_queue_is_killed);
> > +
> > +/**
> > + * drm_dep_queue_is_initialized() - check whether a dep queue has been initialized
> > + * @q: dep queue to check
> > + *
> > + * A queue is considered initialized once its ops pointer has been set by a
> > + * successful call to drm_dep_queue_init().  Drivers that embed a
> > + * &drm_dep_queue inside a larger structure may call this before attempting any
> > + * other queue operation to confirm that initialization has taken place.
> > + * drm_dep_queue_put() must be called if this function returns true to drop the
> > + * initialization reference from drm_dep_queue_init().
> > + *
> > + * Return: true if @q has been initialized, false otherwise.
> > + *
> > + * Context: Any context.
> > + */
> > +bool drm_dep_queue_is_initialized(struct drm_dep_queue *q)
> > +{
> > + return !!q->ops;
> > +}
> > +EXPORT_SYMBOL(drm_dep_queue_is_initialized);
> > +
> > +/**
> > + * drm_dep_queue_set_stopped() - pre-mark a queue as stopped before first use
> > + * @q: dep queue to mark
> > + *
> > + * Sets %DRM_DEP_QUEUE_FLAGS_STOPPED directly on @q without going through the
> > + * normal drm_dep_queue_stop() path.  This is only valid during the driver-side
> > + * queue initialisation sequence — i.e. after drm_dep_queue_init() returns but
> > + * before the queue is made visible to other threads (e.g. before it is added
> > + * to any lookup structures).  Using this after the queue is live is a driver
> > + * bug; use drm_dep_queue_stop() instead.
> > + *
> > + * Context: Process context, queue not yet visible to other threads.
> > + */
> > +void drm_dep_queue_set_stopped(struct drm_dep_queue *q)
> > +{
> > + q->sched.flags |= DRM_DEP_QUEUE_FLAGS_STOPPED;
> > +}
> > +EXPORT_SYMBOL(drm_dep_queue_set_stopped);
> > +
> > +/**
> > + * drm_dep_queue_refcount() - read the current reference count of a queue
> > + * @q: dep queue to query
> > + *
> > + * Returns the instantaneous kref value.  The count may change immediately
> > + * after this call; callers must not make safety decisions based solely on
> > + * the returned value.  Intended for diagnostic snapshots and debugfs output.
> > + *
> > + * Context: Any context.
> > + * Return: current reference count.
> > + */
> > +unsigned int drm_dep_queue_refcount(const struct drm_dep_queue *q)
> > +{
> > + return kref_read(&q->refcount);
> > +}
> > +EXPORT_SYMBOL(drm_dep_queue_refcount);
> > +
> > +/**
> > + * drm_dep_queue_timeout() - read the per-job TDR timeout for a queue
> > + * @q: dep queue to query
> > + *
> > + * Returns the per-job timeout in jiffies as set at init time.
> > + * %MAX_SCHEDULE_TIMEOUT means no timeout is configured.
> > + *
> > + * Context: Any context.
> > + * Return: timeout in jiffies.
> > + */
> > +long drm_dep_queue_timeout(const struct drm_dep_queue *q)
> > +{
> > + return q->job.timeout;
> > +}
> > +EXPORT_SYMBOL(drm_dep_queue_timeout);
> > +
> > +/**
> > + * drm_dep_queue_is_job_put_irq_safe() - test whether job-put from IRQ is allowed
> > + * @q: dep queue
> > + *
> > + * Context: Any context.
> > + * Return: true if %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE is set,
> > + *   false otherwise.
> > + */
> > +static bool drm_dep_queue_is_job_put_irq_safe(const struct drm_dep_queue *q)
> > +{
> > + return !!(q->sched.flags & DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE);
> > +}
> > +
> > +/**
> > + * drm_dep_queue_job_dependency() - get next unresolved dep fence
> > + * @q: dep queue
> > + * @job: job whose dependencies to advance
> > + *
> > + * Returns NULL immediately if the queue has been killed via
> > + * drm_dep_queue_kill(), bypassing all dependency waits so that jobs
> > + * drain through run_job as quickly as possible.
> > + *
> > + * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
> > + * Return: next unresolved &dma_fence with a new reference, or NULL
> > + *   when all dependencies have been consumed (or the queue is killed).
> > + */
> > +static struct dma_fence *
> > +drm_dep_queue_job_dependency(struct drm_dep_queue *q,
> > +     struct drm_dep_job *job)
> > +{
> > + struct dma_fence *f;
> > +
> > + lockdep_assert_held(&q->sched.lock);
> > +
> > + if (drm_dep_queue_is_killed(q))
> > + return NULL;
> > +
> > + f = xa_load(&job->dependencies, job->last_dependency);
> > + if (f) {
> > + job->last_dependency++;
> > + if (WARN_ON(DRM_DEP_JOB_FENCE_PREALLOC == f))
> > + return dma_fence_get_stub();
> > + return dma_fence_get(f);
> > + }
> > +
> > + return NULL;
> > +}
> > +
> > +/**
> > + * drm_dep_queue_add_dep_cb() - install wakeup callback on dep fence
> > + * @q: dep queue
> > + * @job: job whose dependency fence is stored in @q->dep.fence
> > + *
> > + * Installs a wakeup callback on @q->dep.fence. Returns true if the
> > + * callback was installed (the queue must wait), false if the fence is
> > + * already signalled or is a self-fence from the same queue context.
> > + *
> > + * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
> > + * Return: true if callback installed, false if fence already done.
> > + */
> 
> In Rust, we can encode the signaling paths with a “token type”. So any
> sections that are part of the signaling path can simply take this token as an
> argument. This type also enforces that end_signaling() is called automatically when it
> goes out of scope.
> 
> By the way, we can easily offer an irq handler type where we enforce this:
> 
> fn handle_threaded_irq(&self, device: &Device<Bound>) -> IrqReturn { 
>  let _annotation = DmaFenceSignallingAnnotation::new();  // Calls begin_signaling()
>  self.driver.handle_threaded_irq(device) 
> 
>  // end_signaling() is called here automatically.
> }
> 
> Same for workqueues:
> 
> fn work_fn(&self, device: &Device<Bound>) {
>  let _annotation = DmaFenceSignallingAnnotation::new();  // Calls begin_signaling()
>  self.driver.work_fn(device) 
> 
>  // end_signaling() is called here automatically.
> }
> 
> This is not Rust-specific, of course, but it is more ergonomic to write in Rust.
> 
> > +static bool drm_dep_queue_add_dep_cb(struct drm_dep_queue *q,
> > +     struct drm_dep_job *job)
> > +{
> > + struct dma_fence *fence = q->dep.fence;
> > +
> > + lockdep_assert_held(&q->sched.lock);
> > +
> > + if (WARN_ON(fence->context == q->fence.context)) {
> > + dma_fence_put(q->dep.fence);
> > + q->dep.fence = NULL;
> > + return false;
> > + }
> > +
> > + if (!dma_fence_add_callback(q->dep.fence, &q->dep.cb,
> > +    drm_dep_queue_wakeup))
> > + return true;
> > +
> > + dma_fence_put(q->dep.fence);
> > + q->dep.fence = NULL;
> > +
> > + return false;
> > +}
> 
> In rust we can enforce that all callbacks take a reference to the fence
> automatically. If the callback is “forgotten” in a buggy path, it is
> automatically removed, and the fence is automatically signaled with -ECANCELED.
> 
> > +
> > +/**
> > + * drm_dep_queue_pop_job() - pop a dispatchable job from the SPSC queue
> > + * @q: dep queue
> > + *
> > + * Peeks at the head of the SPSC queue and drains all resolved
> > + * dependencies. If a dependency is still pending, installs a wakeup
> > + * callback and returns NULL. On success pops the job and returns it.
> > + *
> > + * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
> > + * Return: next dispatchable job, or NULL if a dep is still pending.
> > + */
> > +static struct drm_dep_job *drm_dep_queue_pop_job(struct drm_dep_queue *q)
> > +{
> > + struct spsc_node *node;
> > + struct drm_dep_job *job;
> > +
> > + lockdep_assert_held(&q->sched.lock);
> > +
> > + node = spsc_queue_peek(&q->job.queue);
> > + if (!node)
> > + return NULL;
> > +
> > + job = container_of(node, struct drm_dep_job, queue_node);
> > +
> > + while ((q->dep.fence = drm_dep_queue_job_dependency(q, job))) {
> > + if (drm_dep_queue_add_dep_cb(q, job))
> > + return NULL;
> > + }
> > +
> > + spsc_queue_pop(&q->job.queue);
> > +
> > + return job;
> > +}
> > +
> > +/*
> > + * drm_dep_queue_get_unless_zero() - try to acquire a queue reference
> > + *
> > + * Workers use this instead of drm_dep_queue_get() to guard against the zombie
> > + * state: the queue's refcount has already reached zero (async teardown is in
> > + * flight) but a work item was queued before free_work had a chance to cancel
> > + * it.  If kref_get_unless_zero() fails the caller must bail immediately.
> > + *
> > + * Context: Any context.
> > + * Returns true if the reference was acquired, false if the queue is zombie.
> > + */
> 
> Again, this function is totally gone in Rust.
> 
> > +bool drm_dep_queue_get_unless_zero(struct drm_dep_queue *q)
> > +{
> > + return kref_get_unless_zero(&q->refcount);
> > +}
> > +EXPORT_SYMBOL(drm_dep_queue_get_unless_zero);
> > +
> > +/**
> > + * drm_dep_queue_run_job_work() - run-job worker
> > + * @work: work item embedded in the dep queue
> > + *
> > + * Acquires @q->sched.lock, checks stopped state, queue readiness and
> > + * available credits, pops the next job via drm_dep_queue_pop_job(),
> > + * dispatches it via drm_dep_queue_run_job(), then re-kicks itself.
> > + *
> > + * Uses drm_dep_queue_get_unless_zero() at entry and bails immediately if the
> > + * queue is in zombie state (refcount already zero, async teardown in flight).
> > + *
> > + * Context: Process context (workqueue). DMA fence signaling path.
> > + */
> > +static void drm_dep_queue_run_job_work(struct work_struct *work)
> > +{
> > + struct drm_dep_queue *q =
> > + container_of(work, struct drm_dep_queue, sched.run_job);
> > + struct spsc_node *node;
> > + struct drm_dep_job *job;
> > + bool cookie = dma_fence_begin_signalling();
> > +
> > + /* Bail if queue is zombie (refcount already zero, teardown in flight). */
> > + if (!drm_dep_queue_get_unless_zero(q)) {
> > + dma_fence_end_signalling(cookie);
> > + return;
> > + }
> > +
> > + mutex_lock(&q->sched.lock);
> > +
> > + if (drm_dep_queue_is_stopped(q))
> > + goto put_queue;
> > +
> > + if (!drm_dep_queue_is_ready(q))
> > + goto put_queue;
> > +
> > + /* Peek to check credits before committing to pop and dep resolution */
> > + node = spsc_queue_peek(&q->job.queue);
> > + if (!node)
> > + goto put_queue;
> > +
> > + job = container_of(node, struct drm_dep_job, queue_node);
> > + if (!drm_dep_queue_has_credits(q, job))
> > + goto put_queue;
> > +
> > + job = drm_dep_queue_pop_job(q);
> > + if (!job)
> > + goto put_queue;
> > +
> > + drm_dep_queue_run_job(q, job);
> > + drm_dep_queue_run_job_queue(q);
> > +
> > +put_queue:
> > + mutex_unlock(&q->sched.lock);
> > + drm_dep_queue_put(q);
> > + dma_fence_end_signalling(cookie);
> > +}
> > +
> > +/*
> > + * drm_dep_queue_remove_job() - unlink a job from the pending list and reset TDR
> > + * @q:   dep queue owning @job
> > + * @job: job to remove
> > + *
> > + * Splices @job out of @q->job.pending, cancels any pending TDR delayed work,
> > + * and arms the timeout for the new list head (if any).
> > + *
> > + * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
> > + */
> > +static void drm_dep_queue_remove_job(struct drm_dep_queue *q,
> > +     struct drm_dep_job *job)
> > +{
> > + lockdep_assert_held(&q->job.lock);
> > +
> > + list_del_init(&job->pending_link);
> > + cancel_delayed_work(&q->sched.tdr);
> > + drm_queue_start_timeout(q);
> > +}
> > +
> > +/**
> > + * drm_dep_queue_get_finished_job() - dequeue a finished job
> > + * @q: dep queue
> > + *
> > + * Under @q->job.lock checks the head of the pending list for a
> > + * finished dep fence. If found, removes the job from the list,
> > + * cancels the TDR, and re-arms it for the new head.
> > + *
> > + * Context: Process context (workqueue). DMA fence signaling path.
> > + * Return: the finished &drm_dep_job, or NULL if none is ready.
> > + */
> > +static struct drm_dep_job *
> > +drm_dep_queue_get_finished_job(struct drm_dep_queue *q)
> > +{
> > + struct drm_dep_job *job;
> > +
> > + guard(spinlock_irq)(&q->job.lock);
> > +
> > + job = list_first_entry_or_null(&q->job.pending, struct drm_dep_job,
> > +       pending_link);
> > + if (job && drm_dep_fence_is_finished(job->dfence))
> > + drm_dep_queue_remove_job(q, job);
> > + else
> > + job = NULL;
> > +
> > + return job;
> > +}
> > +
> > +/**
> > + * drm_dep_queue_put_job_work() - put-job worker
> > + * @work: work item embedded in the dep queue
> > + *
> > + * Drains all finished jobs by calling drm_dep_job_put() in a loop,
> > + * then kicks the run-job worker.
> > + *
> > + * Uses drm_dep_queue_get_unless_zero() at entry and bails immediately if the
> > + * queue is in zombie state (refcount already zero, async teardown in flight).
> > + *
> > + * Wraps execution in dma_fence_begin_signalling() / dma_fence_end_signalling()
> > + * because workqueue is shared with other items in the fence signaling path.
> > + *
> > + * Context: Process context (workqueue). DMA fence signaling path.
> > + */
> > +static void drm_dep_queue_put_job_work(struct work_struct *work)
> > +{
> > + struct drm_dep_queue *q =
> > + container_of(work, struct drm_dep_queue, sched.put_job);
> > + struct drm_dep_job *job;
> > + bool cookie = dma_fence_begin_signalling();
> > +
> > + /* Bail if queue is zombie (refcount already zero, teardown in flight). */
> > + if (!drm_dep_queue_get_unless_zero(q)) {
> > + dma_fence_end_signalling(cookie);
> > + return;
> > + }
> > +
> > + while ((job = drm_dep_queue_get_finished_job(q)))
> > + drm_dep_job_put(job);
> > +
> > + drm_dep_queue_run_job_queue(q);
> > +
> > + drm_dep_queue_put(q);
> > + dma_fence_end_signalling(cookie);
> > +}
> > +
> > +/**
> > + * drm_dep_queue_tdr_work() - TDR worker
> > + * @work: work item embedded in the delayed TDR work
> > + *
> > + * Removes the head job from the pending list under @q->job.lock,
> > + * asserts @q->ops->timedout_job is non-NULL, calls it outside the lock,
> > + * requeues the job if %DRM_DEP_TIMEDOUT_STAT_REQUEUE_JOB, drops the
> > + * queue's job reference on %DRM_DEP_TIMEDOUT_STAT_JOB_SIGNALED, and always
> > + * restarts the TDR timer after handling the job (unless @q is stopping).
> > + * Any other return value triggers a WARN.
> > + *
> > + * The TDR is never armed when @q->ops->timedout_job is NULL, so firing
> > + * this worker without a timedout_job callback is a driver bug.
> > + *
> > + * Uses drm_dep_queue_get_unless_zero() at entry and bails immediately if the
> > + * queue is in zombie state (refcount already zero, async teardown in flight).
> > + *
> > + * Wraps execution in dma_fence_begin_signalling() / dma_fence_end_signalling()
> > + * because timedout_job() is expected to signal the guilty job's fence as part
> > + * of reset.
> > + *
> > + * Context: Process context (workqueue). DMA fence signaling path.
> > + */
> > +static void drm_dep_queue_tdr_work(struct work_struct *work)
> > +{
> > + struct drm_dep_queue *q =
> > + container_of(work, struct drm_dep_queue, sched.tdr.work);
> > + struct drm_dep_job *job;
> > + bool cookie = dma_fence_begin_signalling();
> > +
> > + /* Bail if queue is zombie (refcount already zero, teardown in flight). */
> > + if (!drm_dep_queue_get_unless_zero(q)) {
> > + dma_fence_end_signalling(cookie);
> > + return;
> > + }
> > +
> > + scoped_guard(spinlock_irq, &q->job.lock) {
> > + job = list_first_entry_or_null(&q->job.pending,
> > +       struct drm_dep_job,
> > +       pending_link);
> > + if (job)
> > + /*
> > + * Remove from pending so it cannot be freed
> > + * concurrently by drm_dep_queue_get_finished_job() or
> > + * .drm_dep_job_done().
> > + */
> > + list_del_init(&job->pending_link);
> > + }
> > +
> > + if (job) {
> > + enum drm_dep_timedout_stat status;
> > +
> > + if (WARN_ON(!q->ops->timedout_job)) {
> > + drm_dep_job_put(job);
> > + goto out;
> > + }
> > +
> > + status = q->ops->timedout_job(job);
> > +
> > + switch (status) {
> > + case DRM_DEP_TIMEDOUT_STAT_REQUEUE_JOB:
> > + scoped_guard(spinlock_irq, &q->job.lock)
> > + list_add(&job->pending_link, &q->job.pending);
> > + drm_dep_queue_put_job_queue(q);
> > + break;
> > + case DRM_DEP_TIMEDOUT_STAT_JOB_SIGNALED:
> > + drm_dep_job_put(job);
> > + break;
> > + default:
> > + WARN_ON("invalid drm_dep_timedout_stat");
> > + break;
> > + }
> > + }
> > +
> > +out:
> > + drm_queue_start_timeout_unlocked(q);
> > + drm_dep_queue_put(q);
> > + dma_fence_end_signalling(cookie);
> > +}
> > +
> > +/**
> > + * drm_dep_alloc_submit_wq() - allocate an ordered submit workqueue
> > + * @name: name for the workqueue
> > + * @flags: DRM_DEP_QUEUE_FLAGS_* flags
> > + *
> > + * Allocates an ordered workqueue for job submission with %WQ_MEM_RECLAIM and
> > + * %WQ_MEM_WARN_ON_RECLAIM set, ensuring the workqueue is safe to use from
> > + * memory reclaim context and properly annotated for lockdep taint tracking.
> > + * Adds %WQ_HIGHPRI if %DRM_DEP_QUEUE_FLAGS_HIGHPRI is set. When
> > + * CONFIG_LOCKDEP is enabled, uses a dedicated lockdep map for annotation.
> > + *
> > + * Context: Process context.
> > + * Return: the new &workqueue_struct, or NULL on failure.
> > + */
> > +static struct workqueue_struct *
> > +drm_dep_alloc_submit_wq(const char *name, enum drm_dep_queue_flags flags)
> > +{
> > + unsigned int wq_flags = WQ_MEM_RECLAIM | WQ_MEM_WARN_ON_RECLAIM;
> > +
> > + if (flags & DRM_DEP_QUEUE_FLAGS_HIGHPRI)
> > + wq_flags |= WQ_HIGHPRI;
> > +
> > +#if IS_ENABLED(CONFIG_LOCKDEP)
> > + static struct lockdep_map map = {
> > + .name = "drm_dep_submit_lockdep_map"
> > + };
> > + return alloc_ordered_workqueue_lockdep_map(name, wq_flags, &map);
> > +#else
> > + return alloc_ordered_workqueue(name, wq_flags);
> > +#endif
> > +}
> > +
> > +/**
> > + * drm_dep_alloc_timeout_wq() - allocate an ordered TDR workqueue
> > + * @name: name for the workqueue
> > + *
> > + * Allocates an ordered workqueue for timeout detection and recovery with
> > + * %WQ_MEM_RECLAIM and %WQ_MEM_WARN_ON_RECLAIM set, ensuring consistent taint
> > + * annotation with the submit workqueue. When CONFIG_LOCKDEP is enabled, uses
> > + * a dedicated lockdep map for annotation.
> > + *
> > + * Context: Process context.
> > + * Return: the new &workqueue_struct, or NULL on failure.
> > + */
> > +static struct workqueue_struct *drm_dep_alloc_timeout_wq(const char *name)
> > +{
> > + unsigned int wq_flags = WQ_MEM_RECLAIM | WQ_MEM_WARN_ON_RECLAIM;
> > +
> > +#if IS_ENABLED(CONFIG_LOCKDEP)
> > + static struct lockdep_map map = {
> > + .name = "drm_dep_timeout_lockdep_map"
> > + };
> > + return alloc_ordered_workqueue_lockdep_map(name, wq_flags, &map);
> > +#else
> > + return alloc_ordered_workqueue(name, wq_flags);
> > +#endif
> > +}
> > +
> > +/**
> > + * drm_dep_queue_init() - initialize a dep queue
> > + * @q: dep queue to initialize
> > + * @args: initialization arguments
> > + *
> > + * Initializes all fields of @q from @args. If @args->submit_wq is NULL an
> > + * ordered workqueue is allocated and owned by the queue
> > + * (%DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ). If @args->timeout_wq is NULL an
> > + * ordered workqueue is allocated and owned by the queue
> > + * (%DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ). On success the queue holds one kref
> > + * reference and drm_dep_queue_put() must be called to drop this reference
> > + * (i.e., drivers cannot directly free the queue).
> > + *
> > + * When CONFIG_LOCKDEP is enabled, @q->sched.lock is primed against the
> > + * fs_reclaim pseudo-lock so that lockdep can detect any lock ordering
> > + * inversion between @sched.lock and memory reclaim.
> > + *
> > + * Return: 0 on success, %-EINVAL when @args->credit_limit is zero, @args->ops
> > + * is NULL, @args->drm is NULL, @args->ops->run_job is NULL, or when
> > + * @args->submit_wq or @args->timeout_wq is non-NULL but was not allocated with
> > + * %WQ_MEM_WARN_ON_RECLAIM; %-ENOMEM when workqueue allocation fails.
> > + *
> > + * Context: Process context. May allocate memory and create workqueues.
> > + */
> > +int drm_dep_queue_init(struct drm_dep_queue *q,
> > +       const struct drm_dep_queue_init_args *args)
> > +{
> > + if (!args->credit_limit || !args->drm || !args->ops ||
> > +    !args->ops->run_job)
> > + return -EINVAL;
> > +
> > + if (args->submit_wq && !workqueue_is_reclaim_annotated(args->submit_wq))
> > + return -EINVAL;
> > +
> > + if (args->timeout_wq &&
> > +    !workqueue_is_reclaim_annotated(args->timeout_wq))
> > + return -EINVAL;
> > +
> > + memset(q, 0, sizeof(*q));
> > +
> > + q->name = args->name;
> > + q->drm = args->drm;
> > + q->credit.limit = args->credit_limit;
> > + q->job.timeout = args->timeout ? args->timeout : MAX_SCHEDULE_TIMEOUT;
> > +
> > + init_rcu_head(&q->rcu);
> > + INIT_LIST_HEAD(&q->job.pending);
> > + spin_lock_init(&q->job.lock);
> > + spsc_queue_init(&q->job.queue);
> > +
> > + mutex_init(&q->sched.lock);
> > + if (IS_ENABLED(CONFIG_LOCKDEP)) {
> > + fs_reclaim_acquire(GFP_KERNEL);
> > + might_lock(&q->sched.lock);
> > + fs_reclaim_release(GFP_KERNEL);
> > + }
> > +
> > + if (args->submit_wq) {
> > + q->sched.submit_wq = args->submit_wq;
> > + } else {
> > + q->sched.submit_wq = drm_dep_alloc_submit_wq(args->name ?: "drm_dep",
> > +     args->flags);
> > + if (!q->sched.submit_wq)
> > + return -ENOMEM;
> > +
> > + q->sched.flags |= DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ;
> > + }
> > +
> > + if (args->timeout_wq) {
> > + q->sched.timeout_wq = args->timeout_wq;
> > + } else {
> > + q->sched.timeout_wq = drm_dep_alloc_timeout_wq(args->name ?: "drm_dep");
> > + if (!q->sched.timeout_wq)
> > + goto err_submit_wq;
> > +
> > + q->sched.flags |= DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ;
> > + }
> > +
> > + q->sched.flags |= args->flags &
> > + ~(DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ |
> > +  DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ);
> > +
> > + INIT_DELAYED_WORK(&q->sched.tdr, drm_dep_queue_tdr_work);
> > + INIT_WORK(&q->sched.run_job, drm_dep_queue_run_job_work);
> > + INIT_WORK(&q->sched.put_job, drm_dep_queue_put_job_work);
> > +
> > + q->fence.context = dma_fence_context_alloc(1);
> > +
> > + kref_init(&q->refcount);
> > + q->ops = args->ops;
> > + drm_dev_get(q->drm);
> > +
> > + return 0;
> > +
> > +err_submit_wq:
> > + if (q->sched.flags & DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ)
> > + destroy_workqueue(q->sched.submit_wq);
> > + mutex_destroy(&q->sched.lock);
> > +
> > + return -ENOMEM;
> > +}
> > +EXPORT_SYMBOL(drm_dep_queue_init);
> > +
> > +#if IS_ENABLED(CONFIG_PROVE_LOCKING)
> > +/**
> > + * drm_dep_queue_push_job_begin() - mark the start of an arm/push critical section
> > + * @q: dep queue the job belongs to
> > + *
> > + * Called at the start of drm_dep_job_arm() and warns if the push context is
> > + * already owned by another task, which would indicate concurrent arm/push on
> > + * the same queue.
> > + *
> > + * No-op when CONFIG_PROVE_LOCKING is disabled.
> > + *
> > + * Context: Process context. DMA fence signaling path.
> > + */
> > +void drm_dep_queue_push_job_begin(struct drm_dep_queue *q)
> > +{
> > + WARN_ON(q->job.push.owner);
> > + q->job.push.owner = current;
> > +}
> > +
> > +/**
> > + * drm_dep_queue_push_job_end() - mark the end of an arm/push critical section
> > + * @q: dep queue the job belongs to
> > + *
> > + * Called at the end of drm_dep_job_push() and warns if the push context is not
> > + * owned by the current task, which would indicate a mismatched begin/end pair
> > + * or a push from the wrong thread.
> > + *
> > + * No-op when CONFIG_PROVE_LOCKING is disabled.
> > + *
> > + * Context: Process context. DMA fence signaling path.
> > + */
> > +void drm_dep_queue_push_job_end(struct drm_dep_queue *q)
> > +{
> > + WARN_ON(q->job.push.owner != current);
> > + q->job.push.owner = NULL;
> > +}
> > +#endif
> > +
> > +/**
> > + * drm_dep_queue_assert_teardown_invariants() - assert teardown invariants
> > + * @q: dep queue being torn down
> > + *
> > + * Warns if the pending-job list, the SPSC submission queue, or the credit
> > + * counter is non-zero when called, or if the queue still has a non-zero
> > + * reference count.
> > + *
> > + * Context: Any context.
> > + */
> > +static void drm_dep_queue_assert_teardown_invariants(struct drm_dep_queue *q)
> > +{
> > + WARN_ON(!list_empty(&q->job.pending));
> > + WARN_ON(spsc_queue_count(&q->job.queue));
> > + WARN_ON(atomic_read(&q->credit.count));
> > + WARN_ON(drm_dep_queue_refcount(q));
> > +}
> > +
> > +/**
> > + * drm_dep_queue_release() - final internal cleanup of a dep queue
> > + * @q: dep queue to clean up
> > + *
> > + * Asserts teardown invariants and destroys internal resources allocated by
> > + * drm_dep_queue_init() that cannot be torn down earlier in the teardown
> > + * sequence.  Currently this destroys @q->sched.lock.
> > + *
> > + * Drivers that implement &drm_dep_queue_ops.release **must** call this
> > + * function after removing @q from any internal bookkeeping (e.g. lookup
> > + * tables or lists) but before freeing the memory that contains @q.  When
> > + * &drm_dep_queue_ops.release is NULL, drm_dep follows the default teardown
> > + * path and calls this function automatically.
> > + *
> > + * Context: Any context.
> > + */
> > +void drm_dep_queue_release(struct drm_dep_queue *q)
> > +{
> > + drm_dep_queue_assert_teardown_invariants(q);
> > + mutex_destroy(&q->sched.lock);
> > +}
> > +EXPORT_SYMBOL(drm_dep_queue_release);
> > +
> > +/**
> > + * drm_dep_queue_free() - final cleanup of a dep queue
> > + * @q: dep queue to free
> > + *
> > + * Invokes &drm_dep_queue_ops.release if set, in which case the driver is
> > + * responsible for calling drm_dep_queue_release() and freeing @q itself.
> > + * If &drm_dep_queue_ops.release is NULL, calls drm_dep_queue_release()
> > + * and then frees @q with kfree_rcu().
> > + *
> > + * In either case, releases the drm_dev_get() reference taken at init time
> > + * via drm_dev_put(), allowing the owning &drm_device to be unloaded once
> > + * all queues have been freed.
> > + *
> > + * Context: Process context (workqueue), reclaim safe.
> > + */
> > +static void drm_dep_queue_free(struct drm_dep_queue *q)
> > +{
> > + struct drm_device *drm = q->drm;
> > +
> > + if (q->ops->release) {
> > + q->ops->release(q);
> > + } else {
> > + drm_dep_queue_release(q);
> > + kfree_rcu(q, rcu);
> > + }
> > + drm_dev_put(drm);
> > +}
> > +
> > +/**
> > + * drm_dep_queue_free_work() - deferred queue teardown worker
> > + * @work: free_work item embedded in the dep queue
> > + *
> > + * Runs on dep_free_wq. Disables all work items synchronously
> > + * (preventing re-queue and waiting for in-flight instances),
> > + * destroys any owned workqueues, then calls drm_dep_queue_free().
> > + * Running on dep_free_wq ensures destroy_workqueue() is never
> > + * called from within one of the queue's own workers (deadlock)
> > + * and disable_*_sync() cannot deadlock either.
> > + *
> > + * Context: Process context (workqueue), reclaim safe.
> > + */
> > +static void drm_dep_queue_free_work(struct work_struct *work)
> > +{
> > + struct drm_dep_queue *q =
> > + container_of(work, struct drm_dep_queue, free_work);
> > +
> > + drm_dep_queue_assert_teardown_invariants(q);
> > +
> > + disable_delayed_work_sync(&q->sched.tdr);
> > + disable_work_sync(&q->sched.run_job);
> > + disable_work_sync(&q->sched.put_job);
> > +
> > + if (q->sched.flags & DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ)
> > + destroy_workqueue(q->sched.timeout_wq);
> > +
> > + if (q->sched.flags & DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ)
> > + destroy_workqueue(q->sched.submit_wq);
> > +
> > + drm_dep_queue_free(q);
> > +}
> > +
> > +/**
> > + * drm_dep_queue_fini() - tear down a dep queue
> > + * @q: dep queue to tear down
> > + *
> > + * Asserts teardown invariants  and nitiates teardown of @q by queuing the
> > + * deferred free work onto tht module-private dep_free_wq workqueue.  The work
> > + * item disables any pending TDR and run/put-job work synchronously, destroys
> > + * any workqueues that were allocated by drm_dep_queue_init(), and then releases
> > + * the queue memory.
> > + *
> > + * Running teardown from dep_free_wq ensures that destroy_workqueue() is never
> > + * called from within one of the queue's own workers (e.g. via
> > + * drm_dep_queue_put()), which would deadlock.
> > + *
> > + * Drivers can wait for all outstanding deferred work to complete by waiting
> > + * for the last drm_dev_put() reference on their &drm_device, which is
> > + * released as the final step of each queue's teardown.
> > + *
> > + * Drivers that implement &drm_dep_queue_ops.fini **must** call this
> > + * function after removing @q from any device bookkeeping but before freeing the
> > + * memory that contains @q.  When &drm_dep_queue_ops.fini is NULL, drm_dep
> > + * follows the default teardown path and calls this function automatically.
> > + *
> > + * Context: Any context.
> > + */
> > +void drm_dep_queue_fini(struct drm_dep_queue *q)
> > +{
> > + drm_dep_queue_assert_teardown_invariants(q);
> > +
> > + INIT_WORK(&q->free_work, drm_dep_queue_free_work);
> > + queue_work(dep_free_wq, &q->free_work);
> > +}
> > +EXPORT_SYMBOL(drm_dep_queue_fini);
> > +
> > +/**
> > + * drm_dep_queue_get() - acquire a reference to a dep queue
> > + * @q: dep queue to acquire a reference on, or NULL
> > + *
> > + * Return: @q with an additional reference held, or NULL if @q is NULL.
> > + *
> > + * Context: Any context.
> > + */
> > +struct drm_dep_queue *drm_dep_queue_get(struct drm_dep_queue *q)
> > +{
> > + if (q)
> > + kref_get(&q->refcount);
> > + return q;
> > +}
> > +EXPORT_SYMBOL(drm_dep_queue_get);
> > +
> > +/**
> > + * __drm_dep_queue_release() - kref release callback for a dep queue
> > + * @kref: kref embedded in the dep queue
> > + *
> > + * Calls &drm_dep_queue_ops.fini if set, otherwise calls
> > + * drm_dep_queue_fini() to initiate deferred teardown.
> > + *
> > + * Context: Any context.
> > + */
> > +static void __drm_dep_queue_release(struct kref *kref)
> > +{
> > + struct drm_dep_queue *q =
> > + container_of(kref, struct drm_dep_queue, refcount);
> > +
> > + if (q->ops->fini)
> > + q->ops->fini(q);
> > + else
> > + drm_dep_queue_fini(q);
> > +}
> > +
> > +/**
> > + * drm_dep_queue_put() - release a reference to a dep queue
> > + * @q: dep queue to release a reference on, or NULL
> > + *
> > + * When the last reference is dropped, calls &drm_dep_queue_ops.fini if set,
> > + * otherwise calls drm_dep_queue_fini(). Final memory release is handled by
> > + * &drm_dep_queue_ops.release (which must call drm_dep_queue_release()) if set,
> > + * or drm_dep_queue_release() followed by kfree_rcu() otherwise.
> > + * Does nothing if @q is NULL.
> > + *
> > + * Context: Any context.
> > + */
> > +void drm_dep_queue_put(struct drm_dep_queue *q)
> > +{
> > + if (q)
> > + kref_put(&q->refcount, __drm_dep_queue_release);
> > +}
> > +EXPORT_SYMBOL(drm_dep_queue_put);
> > +
> > +/**
> > + * drm_dep_queue_stop() - stop a dep queue from processing new jobs
> > + * @q: dep queue to stop
> > + *
> > + * Sets %DRM_DEP_QUEUE_FLAGS_STOPPED on @q under both @q->sched.lock (mutex)
> > + * and @q->job.lock (spinlock_irq), making the flag safe to test from finished
> > + * fenced signaling context. Then cancels any in-flight run_job and put_job work
> > + * items. Once stopped, the bypass path and the submit workqueue will not
> > + * dispatch further jobs nor will any jobs be removed from the pending list.
> > + * Call drm_dep_queue_start() to resume processing.
> > + *
> > + * Context: Process context. Waits for in-flight workers to complete.
> > + */
> > +void drm_dep_queue_stop(struct drm_dep_queue *q)
> > +{
> > + scoped_guard(mutex, &q->sched.lock) {
> > + scoped_guard(spinlock_irq, &q->job.lock)
> > + drm_dep_queue_flags_set(q, DRM_DEP_QUEUE_FLAGS_STOPPED);
> > + }
> > + cancel_work_sync(&q->sched.run_job);
> > + cancel_work_sync(&q->sched.put_job);
> > +}
> > +EXPORT_SYMBOL(drm_dep_queue_stop);
> > +
> > +/**
> > + * drm_dep_queue_start() - resume a stopped dep queue
> > + * @q: dep queue to start
> > + *
> > + * Clears %DRM_DEP_QUEUE_FLAGS_STOPPED on @q under both @q->sched.lock (mutex)
> > + * and @q->job.lock (spinlock_irq), making the flag safe to test from IRQ
> > + * context. Then re-queues the run_job and put_job work items so that any jobs
> > + * pending since the queue was stopped are processed. Must only be called after
> > + * drm_dep_queue_stop().
> > + *
> > + * Context: Process context.
> > + */
> > +void drm_dep_queue_start(struct drm_dep_queue *q)
> > +{
> > + scoped_guard(mutex, &q->sched.lock) {
> > + scoped_guard(spinlock_irq, &q->job.lock)
> > + drm_dep_queue_flags_clear(q, DRM_DEP_QUEUE_FLAGS_STOPPED);
> > + }
> > + drm_dep_queue_run_job_queue(q);
> > + drm_dep_queue_put_job_queue(q);
> > +}
> > +EXPORT_SYMBOL(drm_dep_queue_start);
> > +
> > +/**
> > + * drm_dep_queue_trigger_timeout() - trigger the TDR immediately for
> > + *   all pending jobs
> > + * @q: dep queue to trigger timeout on
> > + *
> > + * Sets @q->job.timeout to 1 and arms the TDR delayed work with a one-jiffy
> > + * delay, causing it to fire almost immediately without hot-spinning at zero
> > + * delay. This is used to force-expire any pendind jobs on the queue, for
> > + * example when the device is being torn down or has encountered an
> > + * unrecoverable error.
> > + *
> > + * It is suggested that when this function is used, the first timedout_job call
> > + * causes the driver to kick the queue off the hardware and signal all pending
> > + * job fences. Subsequent calls continue to signal all pending job fences.
> > + *
> > + * Has no effect if the pending list is empty.
> > + *
> > + * Context: Any context.
> > + */
> > +void drm_dep_queue_trigger_timeout(struct drm_dep_queue *q)
> > +{
> > + guard(spinlock_irqsave)(&q->job.lock);
> > + q->job.timeout = 1;
> > + drm_queue_start_timeout(q);
> > +}
> > +EXPORT_SYMBOL(drm_dep_queue_trigger_timeout);
> > +
> > +/**
> > + * drm_dep_queue_cancel_tdr_sync() - cancel any pending TDR and wait
> > + *   for it to finish
> > + * @q: dep queue whose TDR to cancel
> > + *
> > + * Cancels the TDR delayed work item if it has not yet started, and waits for
> > + * it to complete if it is already running.  After this call returns, the TDR
> > + * worker is guaranteed not to be executing and will not fire again until
> > + * explicitly rearmed (e.g. via drm_dep_queue_resume_timeout() or by a new
> > + * job being submitted).
> > + *
> > + * Useful during error recovery or queue teardown when the caller needs to
> > + * know that no timeout handling races with its own reset logic.
> > + *
> > + * Context: Process context. May sleep waiting for the TDR worker to finish.
> > + */
> > +void drm_dep_queue_cancel_tdr_sync(struct drm_dep_queue *q)
> > +{
> > + cancel_delayed_work_sync(&q->sched.tdr);
> > +}
> > +EXPORT_SYMBOL(drm_dep_queue_cancel_tdr_sync);
> > +
> > +/**
> > + * drm_dep_queue_resume_timeout() - restart the TDR timer with the
> > + *   configured timeout
> > + * @q: dep queue to resume the timeout for
> > + *
> > + * Restarts the TDR delayed work using @q->job.timeout. Called after device
> > + * recovery to give pending jobs a fresh full timeout window. Has no effect
> > + * if the pending list is empty.
> > + *
> > + * Context: Any context.
> > + */
> > +void drm_dep_queue_resume_timeout(struct drm_dep_queue *q)
> > +{
> > + drm_queue_start_timeout_unlocked(q);
> > +}
> > +EXPORT_SYMBOL(drm_dep_queue_resume_timeout);
> > +
> > +/**
> > + * drm_dep_queue_is_stopped() - check whether a dep queue is stopped
> > + * @q: dep queue to check
> > + *
> > + * Return: true if %DRM_DEP_QUEUE_FLAGS_STOPPED is set on @q, false otherwise.
> > + *
> > + * Context: Any context.
> > + */
> > +bool drm_dep_queue_is_stopped(struct drm_dep_queue *q)
> > +{
> > + return !!(q->sched.flags & DRM_DEP_QUEUE_FLAGS_STOPPED);
> > +}
> > +EXPORT_SYMBOL(drm_dep_queue_is_stopped);
> > +
> > +/**
> > + * drm_dep_queue_kill() - kill a dep queue and flush all pending jobs
> > + * @q: dep queue to kill
> > + *
> > + * Sets %DRM_DEP_QUEUE_FLAGS_KILLED on @q under @q->sched.lock.  If a
> > + * dependency fence is currently being waited on, its callback is removed and
> > + * the run-job worker is kicked immediately so that the blocked job drains
> > + * without waiting.
> > + *
> > + * Once killed, drm_dep_queue_job_dependency() returns NULL for all jobs,
> > + * bypassing dependency waits so that every queued job drains through
> > + * &drm_dep_queue_ops.run_job without blocking.
> > + *
> > + * The &drm_dep_queue_ops.run_job callback is guaranteed to be called for every
> > + * job that was pushed before or after drm_dep_queue_kill(), even during queue
> > + * teardown.  Drivers should use this guarantee to perform any necessary
> > + * bookkeeping cleanup without executing the actual backend operation when the
> > + * queue is killed.
> > + *
> > + * Unlike drm_dep_queue_stop(), killing is one-way: there is no corresponding
> > + * start function.
> > + *
> > + * **Driver safety requirement**
> > + *
> > + * drm_dep_queue_kill() must only be called once the driver can guarantee that
> > + * no job in the queue will touch memory associated with any of its fences
> > + * (i.e., the queue has been removed from the device and will never be put back
> > + * on).
> > + *
> > + * Context: Process context.
> > + */
> > +void drm_dep_queue_kill(struct drm_dep_queue *q)
> > +{
> > + scoped_guard(mutex, &q->sched.lock) {
> > + struct dma_fence *fence;
> > +
> > + drm_dep_queue_flags_set(q, DRM_DEP_QUEUE_FLAGS_KILLED);
> > +
> > + /*
> > + * Holding &q->sched.lock guarantees that the run-job work item
> > + * cannot drop its reference to q->dep.fence concurrently, so
> > + * reading q->dep.fence here is safe.
> > + */
> > + fence = READ_ONCE(q->dep.fence);
> > + if (fence && dma_fence_remove_callback(fence, &q->dep.cb))
> > + drm_dep_queue_remove_dependency(q, fence);
> > + }
> > +}
> > +EXPORT_SYMBOL(drm_dep_queue_kill);
> > +
> > +/**
> > + * drm_dep_queue_submit_wq() - retrieve the submit workqueue of a dep queue
> > + * @q: dep queue whose workqueue to retrieve
> > + *
> > + * Drivers may use this to queue their own work items alongside the queue's
> > + * internal run-job and put-job workers — for example to process incoming
> > + * messages in the same serialisation domain.
> > + *
> > + * Prefer drm_dep_queue_work_enqueue() when the only need is to enqueue a
> > + * work item, as it additionally checks the stopped state.  Use this accessor
> > + * when the workqueue itself is required (e.g. for alloc_ordered_workqueue
> > + * replacement or drain_workqueue calls).
> > + *
> > + * Context: Any context.
> > + * Return: the &workqueue_struct used by @q for job submission.
> > + */
> > +struct workqueue_struct *drm_dep_queue_submit_wq(struct drm_dep_queue *q)
> > +{
> > + return q->sched.submit_wq;
> > +}
> > +EXPORT_SYMBOL(drm_dep_queue_submit_wq);
> > +
> > +/**
> > + * drm_dep_queue_timeout_wq() - retrieve the timeout workqueue of a dep queue
> > + * @q: dep queue whose workqueue to retrieve
> > + *
> > + * Returns the workqueue used by @q to run TDR (timeout detection and recovery)
> > + * work.  Drivers may use this to queue their own timeout-domain work items, or
> > + * to call drain_workqueue() when tearing down and needing to ensure all pending
> > + * timeout callbacks have completed before proceeding.
> > + *
> > + * Context: Any context.
> > + * Return: the &workqueue_struct used by @q for TDR work.
> > + */
> > +struct workqueue_struct *drm_dep_queue_timeout_wq(struct drm_dep_queue *q)
> > +{
> > + return q->sched.timeout_wq;
> > +}
> > +EXPORT_SYMBOL(drm_dep_queue_timeout_wq);
> > +
> > +/**
> > + * drm_dep_queue_work_enqueue() - queue work on the dep queue's submit workqueue
> > + * @q: dep queue to enqueue work on
> > + * @work: work item to enqueue
> > + *
> > + * Queues @work on @q->sched.submit_wq if the queue is not stopped.  This
> > + * allows drivers to schedule custom work items that run serialised with the
> > + * queue's own run-job and put-job workers.
> > + *
> > + * Return: true if the work was queued, false if the queue is stopped or the
> > + * work item was already pending.
> > + *
> > + * Context: Any context.
> > + */
> > +bool drm_dep_queue_work_enqueue(struct drm_dep_queue *q,
> > + struct work_struct *work)
> > +{
> > + if (drm_dep_queue_is_stopped(q))
> > + return false;
> > +
> > + return queue_work(q->sched.submit_wq, work);
> > +}
> > +EXPORT_SYMBOL(drm_dep_queue_work_enqueue);
> > +
> > +/**
> > + * drm_dep_queue_can_job_bypass() - test whether a job can skip the SPSC queue
> > + * @q: dep queue
> > + * @job: job to test
> > + *
> > + * A job may bypass the submit workqueue and run inline on the calling thread
> > + * if all of the following hold:
> > + *
> > + *  - %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set on the queue
> > + *  - the queue is not stopped
> > + *  - the SPSC submission queue is empty (no other jobs waiting)
> > + *  - the queue has enough credits for @job
> > + *  - @job has no unresolved dependency fences
> > + *
> > + * Must be called under @q->sched.lock.
> > + *
> > + * Context: Process context. Must hold @q->sched.lock (a mutex).
> > + * Return: true if the job may be run inline, false otherwise.
> > + */
> > +bool drm_dep_queue_can_job_bypass(struct drm_dep_queue *q,
> > +  struct drm_dep_job *job)
> > +{
> > + lockdep_assert_held(&q->sched.lock);
> > +
> > + return q->sched.flags & DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED &&
> > + !drm_dep_queue_is_stopped(q) &&
> > + !spsc_queue_count(&q->job.queue) &&
> > + drm_dep_queue_has_credits(q, job) &&
> > + xa_empty(&job->dependencies);
> > +}
> > +
> > +/**
> > + * drm_dep_job_done() - mark a job as complete
> > + * @job: the job that finished
> > + * @result: error code to propagate, or 0 for success
> > + *
> > + * Subtracts @job->credits from the queue credit counter, then signals the
> > + * job's dep fence with @result.
> > + *
> > + * When %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE is set (IRQ-safe path), a
> > + * temporary extra reference is taken on @job before signalling the fence.
> > + * This prevents a concurrent put-job worker — which may be woken by timeouts or
> > + * queue starting — from freeing the job while this function still holds a
> > + * pointer to it.  The extra reference is released at the end of the function.
> > + *
> > + * After signalling, the IRQ-safe path removes the job from the pending list
> > + * under @q->job.lock, provided the queue is not stopped.  Removal is skipped
> > + * when the queue is stopped so that drm_dep_queue_for_each_pending_job() can
> > + * iterate the list without racing with the completion path.  On successful
> > + * removal, kicks the run-job worker so the next queued job can be dispatched
> > + * immediately, then drops the job reference.  If the job was already removed
> > + * by TDR, or removal was skipped because the queue is stopped, kicks the
> > + * put-job worker instead to allow the deferred put to complete.
> > + *
> > + * Context: Any context.
> > + */
> > +static void drm_dep_job_done(struct drm_dep_job *job, int result)
> > +{
> > + struct drm_dep_queue *q = job->q;
> > + bool irq_safe = drm_dep_queue_is_job_put_irq_safe(q), removed = false;
> > +
> > + /*
> > + * Local ref to ensure the put worker—which may be woken by external
> > + * forces (TDR, driver-side queue starting)—doesn't free the job behind
> > + * this function's back after drm_dep_fence_done() while it is still on
> > + * the pending list.
> > + */
> > + if (irq_safe)
> > + drm_dep_job_get(job);
> > +
> > + atomic_sub(job->credits, &q->credit.count);
> > + drm_dep_fence_done(job->dfence, result);
> > +
> > + /* Only safe to touch job after fence signal if we have a local ref. */
> > +
> > + if (irq_safe) {
> > + scoped_guard(spinlock_irqsave, &q->job.lock) {
> > + removed = !list_empty(&job->pending_link) &&
> > + !drm_dep_queue_is_stopped(q);
> > +
> > + /* Guard against TDR operating on job */
> > + if (removed)
> > + drm_dep_queue_remove_job(q, job);
> > + }
> > + }
> > +
> > + if (removed) {
> > + drm_dep_queue_run_job_queue(q);
> > + drm_dep_job_put(job);
> > + } else {
> > + drm_dep_queue_put_job_queue(q);
> > + }
> > +
> > + if (irq_safe)
> > + drm_dep_job_put(job);
> > +}
> > +
> > +/**
> > + * drm_dep_job_done_cb() - dma_fence callback to complete a job
> > + * @f: the hardware fence that signalled
> > + * @cb: fence callback embedded in the dep job
> > + *
> > + * Extracts the job from @cb and calls drm_dep_job_done() with
> > + * @f->error as the result.
> > + *
> > + * Context: Any context, but with IRQ disabled. May not sleep.
> > + */
> > +static void drm_dep_job_done_cb(struct dma_fence *f, struct dma_fence_cb *cb)
> > +{
> > + struct drm_dep_job *job = container_of(cb, struct drm_dep_job, cb);
> > +
> > + drm_dep_job_done(job, f->error);
> > +}
> > +
> > +/**
> > + * drm_dep_queue_run_job() - submit a job to hardware and set up
> > + *   completion tracking
> > + * @q: dep queue
> > + * @job: job to run
> > + *
> > + * Accounts @job->credits against the queue, appends the job to the pending
> > + * list, then calls @q->ops->run_job(). The TDR timer is started only when
> > + * @job is the first entry on the pending list; subsequent jobs added while
> > + * a TDR is already in flight do not reset the timer (which would otherwise
> > + * extend the deadline for the already-running head job). Stores the returned
> > + * hardware fence as the parent of the job's dep fence, then installs
> > + * drm_dep_job_done_cb() on it. If the hardware fence is already signalled
> > + * (%-ENOENT from dma_fence_add_callback()) or run_job() returns NULL/error,
> > + * the job is completed immediately. Must be called under @q->sched.lock.
> > + *
> > + * Context: Process context. Must hold @q->sched.lock (a mutex). DMA fence
> > + * signaling path.
> > + */
> > +void drm_dep_queue_run_job(struct drm_dep_queue *q, struct drm_dep_job *job)
> > +{
> > + struct dma_fence *fence;
> > + int r;
> > +
> > + lockdep_assert_held(&q->sched.lock);
> > +
> > + drm_dep_job_get(job);
> > + atomic_add(job->credits, &q->credit.count);
> > +
> > + scoped_guard(spinlock_irq, &q->job.lock) {
> > + bool first = list_empty(&q->job.pending);
> > +
> > + list_add_tail(&job->pending_link, &q->job.pending);
> > + if (first)
> > + drm_queue_start_timeout(q);
> > + }
> > +
> > + fence = q->ops->run_job(job);
> > + drm_dep_fence_set_parent(job->dfence, fence);
> > +
> > + if (!IS_ERR_OR_NULL(fence)) {
> > + r = dma_fence_add_callback(fence, &job->cb,
> > +   drm_dep_job_done_cb);
> > + if (r == -ENOENT)
> > + drm_dep_job_done(job, fence->error);
> > + else if (r)
> > + drm_err(q->drm, "fence add callback failed (%d)\n", r);
> > + dma_fence_put(fence);
> > + } else {
> > + drm_dep_job_done(job, IS_ERR(fence) ? PTR_ERR(fence) : 0);
> > + }
> > +
> > + /*
> > + * Drop all input dependency fences now, in process context, before the
> > + * final job put. Once the job is on the pending list its last reference
> > + * may be dropped from a dma_fence callback (IRQ context), where calling
> > + * xa_destroy() would be unsafe.
> > + */
> 
> I assume that “pending” is the list of jobs that have been handed to the driver
> via ops->run_job()?
> 
> Can’t this problem be solved by not doing anything inside a dma_fence callback
> other than scheduling the queue worker?
> 
> > + drm_dep_job_drop_dependencies(job);
> > + drm_dep_job_put(job);
> > +}
> > +
> > +/**
> > + * drm_dep_queue_push_job() - enqueue a job on the SPSC submission queue
> > + * @q: dep queue
> > + * @job: job to push
> > + *
> > + * Pushes @job onto the SPSC queue. If the queue was previously empty
> > + * (i.e. this is the first pending job), kicks the run_job worker so it
> > + * processes the job promptly without waiting for the next wakeup.
> > + * May be called with or without @q->sched.lock held.
> > + *
> > + * Context: Any context. DMA fence signaling path.
> > + */
> > +void drm_dep_queue_push_job(struct drm_dep_queue *q, struct drm_dep_job *job)
> > +{
> > + /*
> > + * spsc_queue_push() returns true if the queue was previously empty,
> > + * i.e. this is the first pending job. Kick the run_job worker so it
> > + * picks it up without waiting for the next wakeup.
> > + */
> > + if (spsc_queue_push(&q->job.queue, &job->queue_node))
> > + drm_dep_queue_run_job_queue(q);
> > +}
> > +
> > +/**
> > + * drm_dep_init() - module initialiser
> > + *
> > + * Allocates the module-private dep_free_wq unbound workqueue used for
> > + * deferred queue teardown.
> > + *
> > + * Return: 0 on success, %-ENOMEM if workqueue allocation fails.
> > + */
> > +static int __init drm_dep_init(void)
> > +{
> > + dep_free_wq = alloc_workqueue("drm_dep_free", WQ_UNBOUND, 0);
> > + if (!dep_free_wq)
> > + return -ENOMEM;
> > +
> > + return 0;
> > +}
> > +
> > +/**
> > + * drm_dep_exit() - module exit
> > + *
> > + * Destroys the module-private dep_free_wq workqueue.
> > + */
> > +static void __exit drm_dep_exit(void)
> > +{
> > + destroy_workqueue(dep_free_wq);
> > + dep_free_wq = NULL;
> > +}
> > +
> > +module_init(drm_dep_init);
> > +module_exit(drm_dep_exit);
> > +
> > +MODULE_DESCRIPTION("DRM dependency queue");
> > +MODULE_LICENSE("Dual MIT/GPL");
> > diff --git a/drivers/gpu/drm/dep/drm_dep_queue.h b/drivers/gpu/drm/dep/drm_dep_queue.h
> > new file mode 100644
> > index 000000000000..e5c217a3fab5
> > --- /dev/null
> > +++ b/drivers/gpu/drm/dep/drm_dep_queue.h
> > @@ -0,0 +1,31 @@
> > +/* SPDX-License-Identifier: MIT */
> > +/*
> > + * Copyright © 2026 Intel Corporation
> > + */
> > +
> > +#ifndef _DRM_DEP_QUEUE_H_
> > +#define _DRM_DEP_QUEUE_H_
> > +
> > +#include <linux/types.h>
> > +
> > +struct drm_dep_job;
> > +struct drm_dep_queue;
> > +
> > +bool drm_dep_queue_can_job_bypass(struct drm_dep_queue *q,
> > +  struct drm_dep_job *job);
> > +void drm_dep_queue_run_job(struct drm_dep_queue *q, struct drm_dep_job *job);
> > +void drm_dep_queue_push_job(struct drm_dep_queue *q, struct drm_dep_job *job);
> > +
> > +#if IS_ENABLED(CONFIG_PROVE_LOCKING)
> > +void drm_dep_queue_push_job_begin(struct drm_dep_queue *q);
> > +void drm_dep_queue_push_job_end(struct drm_dep_queue *q);
> > +#else
> > +static inline void drm_dep_queue_push_job_begin(struct drm_dep_queue *q)
> > +{
> > +}
> > +static inline void drm_dep_queue_push_job_end(struct drm_dep_queue *q)
> > +{
> > +}
> > +#endif
> > +
> > +#endif /* _DRM_DEP_QUEUE_H_ */
> > diff --git a/include/drm/drm_dep.h b/include/drm/drm_dep.h
> > new file mode 100644
> > index 000000000000..615926584506
> > --- /dev/null
> > +++ b/include/drm/drm_dep.h
> > @@ -0,0 +1,597 @@
> > +/* SPDX-License-Identifier: MIT */
> > +/*
> > + * Copyright 2015 Advanced Micro Devices, Inc.
> > + *
> > + * Permission is hereby granted, free of charge, to any person obtaining a
> > + * copy of this software and associated documentation files (the "Software"),
> > + * to deal in the Software without restriction, including without limitation
> > + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> > + * and/or sell copies of the Software, and to permit persons to whom the
> > + * Software is furnished to do so, subject to the following conditions:
> > + *
> > + * The above copyright notice and this permission notice shall be included in
> > + * all copies or substantial portions of the Software.
> > + *
> > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> > + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
> > + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
> > + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
> > + * OTHER DEALINGS IN THE SOFTWARE.
> > + *
> > + * Copyright © 2026 Intel Corporation
> > + */
> > +
> > +#ifndef _DRM_DEP_H_
> > +#define _DRM_DEP_H_
> > +
> > +#include <drm/spsc_queue.h>
> > +#include <linux/dma-fence.h>
> > +#include <linux/xarray.h>
> > +#include <linux/workqueue.h>
> > +
> > +enum dma_resv_usage;
> > +struct dma_resv;
> > +struct drm_dep_fence;
> > +struct drm_dep_job;
> > +struct drm_dep_queue;
> > +struct drm_file;
> > +struct drm_gem_object;
> > +
> > +/**
> > + * enum drm_dep_timedout_stat - return value of &drm_dep_queue_ops.timedout_job
> > + * @DRM_DEP_TIMEDOUT_STAT_JOB_SIGNALED: driver signaled the job's finished
> > + *   fence during reset; drm_dep may safely drop its reference to the job.
> > + * @DRM_DEP_TIMEDOUT_STAT_REQUEUE_JOB: timeout was a false alarm; reinsert the
> > + *   job at the head of the pending list so it can complete normally.
> > + */
> > +enum drm_dep_timedout_stat {
> > + DRM_DEP_TIMEDOUT_STAT_JOB_SIGNALED,
> > + DRM_DEP_TIMEDOUT_STAT_REQUEUE_JOB,
> > +};
> > +
> > +/**
> > + * struct drm_dep_queue_ops - driver callbacks for a dep queue
> > + */
> > +struct drm_dep_queue_ops {
> > + /**
> > + * @run_job: submit the job to hardware. Returns the hardware completion
> > + * fence (with a reference held for the scheduler), or NULL/ERR_PTR on
> > + * synchronous completion or error.
> > + */
> > + struct dma_fence *(*run_job)(struct drm_dep_job *job);
> > +
> > + /**
> > + * @timedout_job: called when the TDR fires for the head job. Must stop
> > + * the hardware, then return %DRM_DEP_TIMEDOUT_STAT_JOB_SIGNALED if the
> > + * job's fence was signalled during reset, or
> > + * %DRM_DEP_TIMEDOUT_STAT_REQUEUE_JOB if the timeout was spurious or
> > + * signalling was otherwise delayed, and the job should be re-inserted
> > + * at the head of the pending list. Any other value triggers a WARN.
> > + */
> > + enum drm_dep_timedout_stat (*timedout_job)(struct drm_dep_job *job);
> > +
> > + /**
> > + * @release: called when the last kref on the queue is dropped and
> > + * drm_dep_queue_fini() has completed.  The driver is responsible for
> > + * removing @q from any internal bookkeeping, calling
> > + * drm_dep_queue_release(), and then freeing the memory containing @q
> > + * (e.g. via kfree_rcu() using @q->rcu).  If NULL, drm_dep calls
> > + * drm_dep_queue_release() and frees @q automatically via kfree_rcu().
> > + * Use this when the queue is embedded in a larger structure.
> > + */
> > + void (*release)(struct drm_dep_queue *q);
> > +
> > + /**
> > + * @fini: if set, called instead of drm_dep_queue_fini() when the last
> > + * kref is dropped. The driver is responsible for calling
> > + * drm_dep_queue_fini() itself after it is done with the queue. Use this
> > + * when additional teardown logic must run before fini (e.g., cleanup
> > + * firmware resources associated with the queue).
> > + */
> > + void (*fini)(struct drm_dep_queue *q);
> > +};
> > +
> > +/**
> > + * enum drm_dep_queue_flags - flags for &drm_dep_queue and
> > + *   &drm_dep_queue_init_args
> > + *
> > + * Flags are divided into three categories:
> > + *
> > + * - **Private static**: set internally at init time and never changed.
> > + *   Drivers must not read or write these.
> > + *   %DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ,
> > + *   %DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ.
> > + *
> > + * - **Public dynamic**: toggled at runtime by drivers via accessors.
> > + *   Any modification must be performed under &drm_dep_queue.sched.lock.
> 
> Can’t enforce that in C.
> 
> > + *   Accessor functions provide unstable reads.
> > + *   %DRM_DEP_QUEUE_FLAGS_STOPPED,
> > + *   %DRM_DEP_QUEUE_FLAGS_KILLED.
> 
> > + *
> > + * - **Public static**: supplied by the driver in
> > + *   &drm_dep_queue_init_args.flags at queue creation time and not modified
> > + *   thereafter.
> 
> Same here.
> 
> > + *   %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED,
> > + *   %DRM_DEP_QUEUE_FLAGS_HIGHPRI,
> > + *   %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE.
> 
> > + *
> > + * @DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ: (private, static) submit workqueue was
> > + *   allocated by drm_dep_queue_init() and will be destroyed by
> > + *   drm_dep_queue_fini().
> > + * @DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ: (private, static) timeout workqueue
> > + *   was allocated by drm_dep_queue_init() and will be destroyed by
> > + *   drm_dep_queue_fini().
> > + * @DRM_DEP_QUEUE_FLAGS_STOPPED: (public, dynamic) the queue is stopped and
> > + *   will not dispatch new jobs or remove jobs from the pending list, dropping
> > + *   the drm_dep-owned reference. Set by drm_dep_queue_stop(), cleared by
> > + *   drm_dep_queue_start().
> > + * @DRM_DEP_QUEUE_FLAGS_KILLED: (public, dynamic) the queue has been killed
> > + *   via drm_dep_queue_kill(). Any active dependency wait is cancelled
> > + *   immediately.  Jobs continue to flow through run_job for bookkeeping
> > + *   cleanup, but dependency waiting is skipped so that queued work drains
> > + *   as quickly as possible.
> > + * @DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED: (public, static) the queue supports
> > + *   the bypass path where eligible jobs skip the SPSC queue and run inline.
> > + * @DRM_DEP_QUEUE_FLAGS_HIGHPRI: (public, static) the submit workqueue owned
> > + *   by the queue is created with %WQ_HIGHPRI, causing run-job and put-job
> > + *   workers to execute at elevated priority. Only privileged clients (e.g.
> > + *   drivers managing time-critical or real-time GPU contexts) should request
> > + *   this flag; granting it to unprivileged userspace would allow priority
> > + *   inversion attacks.
> > + *   @drm_dep_queue_init_args.submit_wq is provided.
> > + * @DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE: (public, static) when set,
> > + *   drm_dep_job_done() may be called from hardirq context (e.g. from a
> > + *   hardware-signalled dma_fence callback). drm_dep_job_done() will directly
> > + *   dequeue the job and call drm_dep_job_put() without deferring to a
> > + *   workqueue. The driver's &drm_dep_job_ops.release callback must therefore
> > + *   be safe to invoke from IRQ context.
> > + */
> > +enum drm_dep_queue_flags {
> > + DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ = BIT(0),
> > + DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ = BIT(1),
> > + DRM_DEP_QUEUE_FLAGS_STOPPED = BIT(2),
> > + DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED = BIT(3),
> > + DRM_DEP_QUEUE_FLAGS_HIGHPRI = BIT(4),
> > + DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE = BIT(5),
> > + DRM_DEP_QUEUE_FLAGS_KILLED = BIT(6),
> > +};
> > +
> > +/**
> > + * struct drm_dep_queue - a dependency-tracked GPU submission queue
> > + *
> > + * Combines the role of &drm_gpu_scheduler and &drm_sched_entity into a single
> > + * object.  Each queue owns a submit workqueue (or borrows one), a timeout
> > + * workqueue, an SPSC submission queue, and a pending-job list used for TDR.
> > + *
> > + * Initialise with drm_dep_queue_init(), tear down with drm_dep_queue_fini().
> > + * Reference counted via drm_dep_queue_get() / drm_dep_queue_put().
> > + *
> > + * All fields are **opaque to drivers**.  Do not read or write any field
> 
> Can’t enforce this in C.
> 
> > + * directly; use the provided helper functions instead.  The sole exception
> > + * is @rcu, which drivers may pass to kfree_rcu() when the queue is embedded
> > + * inside a larger driver-managed structure and the &drm_dep_queue_ops.release
> > + * vfunc performs an RCU-deferred free.
> 
> > + */
> > +struct drm_dep_queue {
> > + /** @ops: driver callbacks, set at init time. */
> > + const struct drm_dep_queue_ops *ops;
> > + /** @name: human-readable name used for workqueue and fence naming. */
> > + const char *name;
> > + /** @drm: owning DRM device; a drm_dev_get() reference is held for the
> > + *  lifetime of the queue to prevent module unload while queues are live.
> > + */
> > + struct drm_device *drm;
> > + /** @refcount: reference count; use drm_dep_queue_get/put(). */
> > + struct kref refcount;
> > + /**
> > + * @free_work: deferred teardown work queued unconditionally by
> > + * drm_dep_queue_fini() onto the module-private dep_free_wq.  The work
> > + * item disables pending workers synchronously and destroys any owned
> > + * workqueues before releasing the queue memory and dropping the
> > + * drm_dev_get() reference.  Running on dep_free_wq ensures
> > + * destroy_workqueue() is never called from within one of the queue's
> > + * own workers.
> > + */
> > + struct work_struct free_work;
> > + /**
> > + * @rcu: RCU head for deferred freeing.
> > + *
> > + * This is the **only** field drivers may access directly.  When the
> 
> We can enforce this in Rust at compile time.
> 
> > + * queue is embedded in a larger structure, implement
> > + * &drm_dep_queue_ops.release, call drm_dep_queue_release() to destroy
> > + * internal resources, then pass this field to kfree_rcu() so that any
> > + * in-flight RCU readers referencing the queue's dma_fence timeline name
> > + * complete before the memory is returned.  All other fields must be
> > + * accessed through the provided helpers.
> > + */
> > + struct rcu_head rcu;
> > +
> > + /** @sched: scheduling and workqueue state. */
> > + struct {
> > + /** @sched.submit_wq: ordered workqueue for run/put-job work. */
> > + struct workqueue_struct *submit_wq;
> > + /** @sched.timeout_wq: workqueue for the TDR delayed work. */
> > + struct workqueue_struct *timeout_wq;
> > + /**
> > + * @sched.run_job: work item that dispatches the next queued
> > + * job.
> > + */
> > + struct work_struct run_job;
> > + /** @sched.put_job: work item that frees finished jobs. */
> > + struct work_struct put_job;
> > + /** @sched.tdr: delayed work item for timeout/reset (TDR). */
> > + struct delayed_work tdr;
> > + /**
> > + * @sched.lock: mutex serialising job dispatch, bypass
> > + * decisions, stop/start, and flag updates.
> > + */
> > + struct mutex lock;
> > + /**
> > + * @sched.flags: bitmask of &enum drm_dep_queue_flags.
> > + * Any modification after drm_dep_queue_init() must be
> > + * performed under @sched.lock.
> > + */
> > + enum drm_dep_queue_flags flags;
> > + } sched;
> > +
> > + /** @job: pending-job tracking state. */
> > + struct {
> > + /**
> > + * @job.pending: list of jobs that have been dispatched to
> > + * hardware and not yet freed. Protected by @job.lock.
> > + */
> > + struct list_head pending;
> > + /**
> > + * @job.queue: SPSC queue of jobs waiting to be dispatched.
> > + * Producers push via drm_dep_queue_push_job(); the run_job
> > + * work item pops from the consumer side.
> > + */
> > + struct spsc_queue queue;
> > + /**
> > + * @job.lock: spinlock protecting @job.pending, TDR start, and
> > + * the %DRM_DEP_QUEUE_FLAGS_STOPPED flag. Always acquired with
> > + * irqsave (spin_lock_irqsave / spin_unlock_irqrestore) to
> > + * support %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE queues where
> > + * drm_dep_job_done() may run from hardirq context.
> > + */
> > + spinlock_t lock;
> > + /**
> > + * @job.timeout: per-job TDR timeout in jiffies.
> > + * %MAX_SCHEDULE_TIMEOUT means no timeout.
> > + */
> > + long timeout;
> > +#if IS_ENABLED(CONFIG_PROVE_LOCKING)
> > + /**
> > + * @job.push: lockdep annotation tracking the arm-to-push
> > + * critical section.
> > + */
> > + struct {
> > + /*
> > + * @job.push.owner: task that currently holds the push
> > + * context, used to assert single-owner invariants.
> > + * NULL when idle.
> > + */
> > + struct task_struct *owner;
> > + } push;
> > +#endif
> > + } job;
> > +
> > + /** @credit: hardware credit accounting. */
> > + struct {
> > + /** @credit.limit: maximum credits the queue can hold. */
> > + u32 limit;
> > + /** @credit.count: credits currently in flight (atomic). */
> > + atomic_t count;
> > + } credit;
> > +
> > + /** @dep: current blocking dependency for the head SPSC job. */
> > + struct {
> > + /**
> > + * @dep.fence: fence being waited on before the head job can
> > + * run. NULL when no dependency is pending.
> > + */
> > + struct dma_fence *fence;
> > + /**
> > + * @dep.removed_fence: dependency fence whose callback has been
> > + * removed.  The run-job worker must drop its reference to this
> > + * fence before proceeding to call run_job.
> 
> We can enforce this in Rust automatically.
> 
> > + */
> > + struct dma_fence *removed_fence;
> > + /** @dep.cb: callback installed on @dep.fence. */
> > + struct dma_fence_cb cb;
> > + } dep;
> > +
> > + /** @fence: fence context and sequence number state. */
> > + struct {
> > + /**
> > + * @fence.seqno: next sequence number to assign, incremented
> > + * each time a job is armed.
> > + */
> > + u32 seqno;
> > + /**
> > + * @fence.context: base DMA fence context allocated at init
> > + * time. Finished fences use this context.
> > + */
> > + u64 context;
> > + } fence;
> > +};
> > +
> > +/**
> > + * struct drm_dep_queue_init_args - arguments for drm_dep_queue_init()
> > + */
> > +struct drm_dep_queue_init_args {
> > + /** @ops: driver callbacks; must not be NULL. */
> > + const struct drm_dep_queue_ops *ops;
> > + /** @name: human-readable name for workqueues and fence timelines. */
> > + const char *name;
> > + /** @drm: owning DRM device. A drm_dev_get() reference is taken at
> > + *  queue init and released when the queue is freed, preventing module
> > + *  unload while any queue is still alive.
> > + */
> > + struct drm_device *drm;
> > + /**
> > + * @submit_wq: workqueue for job dispatch. If NULL, an ordered
> > + * workqueue is allocated and owned by the queue.  If non-NULL, the
> > + * workqueue must have been allocated with %WQ_MEM_RECLAIM_TAINT;
> > + * drm_dep_queue_init() returns %-EINVAL otherwise.
> > + */
> > + struct workqueue_struct *submit_wq;
> > + /**
> > + * @timeout_wq: workqueue for TDR. If NULL, an ordered workqueue
> > + * is allocated and owned by the queue.  If non-NULL, the workqueue
> > + * must have been allocated with %WQ_MEM_RECLAIM_TAINT;
> > + * drm_dep_queue_init() returns %-EINVAL otherwise.
> > + */
> > + struct workqueue_struct *timeout_wq;
> > + /** @credit_limit: maximum hardware credits; must be non-zero. */
> > + u32 credit_limit;
> > + /**
> > + * @timeout: per-job TDR timeout in jiffies. Zero means no timeout
> > + * (%MAX_SCHEDULE_TIMEOUT is used internally).
> > + */
> > + long timeout;
> > + /**
> > + * @flags: initial queue flags. %DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ
> > + * and %DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ are managed internally
> > + * and will be ignored if set here. Setting
> > + * %DRM_DEP_QUEUE_FLAGS_HIGHPRI requests a high-priority submit
> > + * workqueue; drivers must only set this for privileged clients.
> > + */
> > + enum drm_dep_queue_flags flags;
> > +};
> > +
> > +/**
> > + * struct drm_dep_job_ops - driver callbacks for a dep job
> > + */
> > +struct drm_dep_job_ops {
> > + /**
> > + * @release: called when the last reference to the job is dropped.
> > + *
> > + * If set, the driver is responsible for freeing the job. If NULL,
> 
> And if they don’t?
> 
> By the way, we can also enforce this in Rust.
> 
> > + * drm_dep_job_put() will call kfree() on the job directly.
> > + */
> > + void (*release)(struct drm_dep_job *job);
> > +};
> > +
> > +/**
> > + * struct drm_dep_job - a unit of work submitted to a dep queue
> > + *
> > + * All fields are **opaque to drivers**.  Do not read or write any field
> > + * directly; use the provided helper functions instead.
> > + */
> > +struct drm_dep_job {
> > + /** @ops: driver callbacks for this job. */
> > + const struct drm_dep_job_ops *ops;
> > + /** @refcount: reference count, managed by drm_dep_job_get/put(). */
> > + struct kref refcount;
> > + /**
> > + * @dependencies: xarray of &dma_fence dependencies before the job can
> > + * run.
> > + */
> > + struct xarray dependencies;
> > + /** @q: the queue this job is submitted to. */
> > + struct drm_dep_queue *q;
> > + /** @queue_node: SPSC queue linkage for pending submission. */
> > + struct spsc_node queue_node;
> > + /**
> > + * @pending_link: list entry in the queue's pending job list. Protected
> > + * by @job.q->job.lock.
> > + */
> > + struct list_head pending_link;
> > + /** @dfence: finished fence for this job. */
> > + struct drm_dep_fence *dfence;
> > + /** @cb: fence callback used to watch for dependency completion. */
> > + struct dma_fence_cb cb;
> > + /** @credits: number of credits this job consumes from the queue. */
> > + u32 credits;
> > + /**
> > + * @last_dependency: index into @dependencies of the next fence to
> > + * check. Advanced by drm_dep_queue_job_dependency() as each
> > + * dependency is consumed.
> > + */
> > + u32 last_dependency;
> > + /**
> > + * @invalidate_count: number of times this job has been invalidated.
> > + * Incremented by drm_dep_job_invalidate_job().
> > + */
> > + u32 invalidate_count;
> > + /**
> > + * @signalling_cookie: return value of dma_fence_begin_signalling()
> > + * captured in drm_dep_job_arm() and consumed by drm_dep_job_push().
> > + * Not valid outside the arm→push window.
> > + */
> > + bool signalling_cookie;
> > +};
> > +
> > +/**
> > + * struct drm_dep_job_init_args - arguments for drm_dep_job_init()
> > + */
> > +struct drm_dep_job_init_args {
> > + /**
> > + * @ops: driver callbacks for the job, or NULL for default behaviour.
> > + */
> > + const struct drm_dep_job_ops *ops;
> > + /** @q: the queue to associate the job with. A reference is taken. */
> > + struct drm_dep_queue *q;
> > + /** @credits: number of credits this job consumes; must be non-zero. */
> > + u32 credits;
> > +};
> > +
> > +/* Queue API */
> > +
> > +/**
> > + * drm_dep_queue_sched_guard() - acquire the queue scheduler lock as a guard
> > + * @__q: dep queue whose scheduler lock to acquire
> > + *
> > + * Acquires @__q->sched.lock as a scoped mutex guard (released automatically
> > + * when the enclosing scope exits).  This lock serialises all scheduler state
> > + * transitions — stop/start/kill flag changes, bypass-path decisions, and the
> > + * run-job worker — so it must be held when the driver needs to atomically
> > + * inspect or modify queue state in relation to job submission.
> > + *
> > + * **When to use**
> > + *
> > + * Drivers that set %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED and wish to
> > + * serialise their own submit work against the bypass path must acquire this
> > + * guard.  Without it, a concurrent caller of drm_dep_job_push() could take
> > + * the bypass path and call ops->run_job() inline between the driver's
> > + * eligibility check and its corresponding action, producing a race.
> 
> So if you’re not careful, you have just introduced a race :/
> 
> > + *
> > + * **Constraint: only from submit_wq worker context**
> > + *
> > + * This guard must only be acquired from a work item running on the queue's
> > + * submit workqueue (@q->sched.submit_wq) by drivers.
> > + *
> > + * Context: Process context only; must be called from submit_wq work by
> > + * drivers.
> > + */
> > +#define drm_dep_queue_sched_guard(__q) \
> > + guard(mutex)(&(__q)->sched.lock)
> > +
> > +int drm_dep_queue_init(struct drm_dep_queue *q,
> > +       const struct drm_dep_queue_init_args *args);
> > +void drm_dep_queue_fini(struct drm_dep_queue *q);
> > +void drm_dep_queue_release(struct drm_dep_queue *q);
> > +struct drm_dep_queue *drm_dep_queue_get(struct drm_dep_queue *q);
> > +bool drm_dep_queue_get_unless_zero(struct drm_dep_queue *q);
> > +void drm_dep_queue_put(struct drm_dep_queue *q);
> > +void drm_dep_queue_stop(struct drm_dep_queue *q);
> > +void drm_dep_queue_start(struct drm_dep_queue *q);
> > +void drm_dep_queue_kill(struct drm_dep_queue *q);
> > +void drm_dep_queue_trigger_timeout(struct drm_dep_queue *q);
> > +void drm_dep_queue_cancel_tdr_sync(struct drm_dep_queue *q);
> > +void drm_dep_queue_resume_timeout(struct drm_dep_queue *q);
> > +bool drm_dep_queue_work_enqueue(struct drm_dep_queue *q,
> > + struct work_struct *work);
> > +bool drm_dep_queue_is_stopped(struct drm_dep_queue *q);
> > +bool drm_dep_queue_is_killed(struct drm_dep_queue *q);
> > +bool drm_dep_queue_is_initialized(struct drm_dep_queue *q);
> > +void drm_dep_queue_set_stopped(struct drm_dep_queue *q);
> > +unsigned int drm_dep_queue_refcount(const struct drm_dep_queue *q);
> > +long drm_dep_queue_timeout(const struct drm_dep_queue *q);
> > +struct workqueue_struct *drm_dep_queue_submit_wq(struct drm_dep_queue *q);
> > +struct workqueue_struct *drm_dep_queue_timeout_wq(struct drm_dep_queue *q);
> > +
> > +/* Job API */
> > +
> > +/**
> > + * DRM_DEP_JOB_FENCE_PREALLOC - sentinel value for pre-allocating a dependency slot
> > + *
> > + * Pass this to drm_dep_job_add_dependency() instead of a real fence to
> > + * pre-allocate a slot in the job's dependency xarray during the preparation
> > + * phase (where GFP_KERNEL is available).  The returned xarray index identifies
> > + * the slot.  Call drm_dep_job_replace_dependency() later — inside a
> > + * dma_fence_begin_signalling() region if needed — to swap in the real fence
> > + * without further allocation.
> > + *
> > + * This sentinel is never treated as a dma_fence; it carries no reference count
> > + * and must not be passed to dma_fence_put().  It is only valid as an argument
> > + * to drm_dep_job_add_dependency() and as the expected stored value checked by
> > + * drm_dep_job_replace_dependency().
> > + */
> > +#define DRM_DEP_JOB_FENCE_PREALLOC ((struct dma_fence *)-1)
> > +
> > +int drm_dep_job_init(struct drm_dep_job *job,
> > +     const struct drm_dep_job_init_args *args);
> > +struct drm_dep_job *drm_dep_job_get(struct drm_dep_job *job);
> > +void drm_dep_job_put(struct drm_dep_job *job);
> > +void drm_dep_job_arm(struct drm_dep_job *job);
> > +void drm_dep_job_push(struct drm_dep_job *job);
> > +int drm_dep_job_add_dependency(struct drm_dep_job *job,
> > +       struct dma_fence *fence);
> > +void drm_dep_job_replace_dependency(struct drm_dep_job *job, u32 index,
> > +    struct dma_fence *fence);
> > +int drm_dep_job_add_syncobj_dependency(struct drm_dep_job *job,
> > +       struct drm_file *file, u32 handle,
> > +       u32 point);
> > +int drm_dep_job_add_resv_dependencies(struct drm_dep_job *job,
> > +      struct dma_resv *resv,
> > +      enum dma_resv_usage usage);
> > +int drm_dep_job_add_implicit_dependencies(struct drm_dep_job *job,
> > +  struct drm_gem_object *obj,
> > +  bool write);
> > +bool drm_dep_job_is_signaled(struct drm_dep_job *job);
> > +bool drm_dep_job_is_finished(struct drm_dep_job *job);
> > +bool drm_dep_job_invalidate_job(struct drm_dep_job *job, int threshold);
> > +struct dma_fence *drm_dep_job_finished_fence(struct drm_dep_job *job);
> > +
> > +/**
> > + * struct drm_dep_queue_pending_job_iter - iterator state for
> > + *   drm_dep_queue_for_each_pending_job()
> > + * @q: queue being iterated
> > + */
> > +struct drm_dep_queue_pending_job_iter {
> > + struct drm_dep_queue *q;
> > +};
> > +
> > +/* Drivers should never call this directly */
> 
> Not enforceable in C.
> 
> > +static inline struct drm_dep_queue_pending_job_iter
> > +__drm_dep_queue_pending_job_iter_begin(struct drm_dep_queue *q)
> > +{
> > + struct drm_dep_queue_pending_job_iter iter = {
> > + .q = q,
> > + };
> > +
> > + WARN_ON(!drm_dep_queue_is_stopped(q));
> > + return iter;
> > +}
> > +
> > +/* Drivers should never call this directly */
> > +static inline void
> > +__drm_dep_queue_pending_job_iter_end(struct drm_dep_queue_pending_job_iter iter)
> > +{
> > + WARN_ON(!drm_dep_queue_is_stopped(iter.q));
> > +}
> > +
> > +/* clang-format off */
> > +DEFINE_CLASS(drm_dep_queue_pending_job_iter,
> > +     struct drm_dep_queue_pending_job_iter,
> > +     __drm_dep_queue_pending_job_iter_end(_T),
> > +     __drm_dep_queue_pending_job_iter_begin(__q),
> > +     struct drm_dep_queue *__q);
> > +/* clang-format on */
> > +static inline void *
> > +class_drm_dep_queue_pending_job_iter_lock_ptr(
> > + class_drm_dep_queue_pending_job_iter_t *_T)
> > +{ return _T; }
> > +#define class_drm_dep_queue_pending_job_iter_is_conditional false
> > +
> > +/**
> > + * drm_dep_queue_for_each_pending_job() - iterate over all pending jobs
> > + *   in a queue
> > + * @__job: loop cursor, a &struct drm_dep_job pointer
> > + * @__q: &struct drm_dep_queue to iterate
> > + *
> > + * Iterates over every job currently on @__q->job.pending. The queue must be
> > + * stopped (drm_dep_queue_stop() called) before using this iterator; a WARN_ON
> > + * fires at the start and end of the scope if it is not.
> > + *
> > + * Context: Any context.
> > + */
> > +#define drm_dep_queue_for_each_pending_job(__job, __q) \
> > + scoped_guard(drm_dep_queue_pending_job_iter, (__q)) \
> > + list_for_each_entry((__job), &(__q)->job.pending, pending_link)
> > +
> > +#endif
> > -- 
> > 2.34.1
> > 
> 
> 
> By the way:
> 
> I invite you to have a look at this implementation [0]. It currently works in real
> hardware i.e.: our downstream "Tyr" driver for Arm Mali is using that at the
> moment. It is a mere prototype that we’ve put together to test different
> approaches, so it’s not meant to be a “solution” at all. It’s a mere data point
> for further discussion.
> 
> Philip Stanner is working on this “Job Queue” concept too, but from an upstream
> perspective.
> 
> [0]: https://gitlab.freedesktop.org/panfrost/linux/-/merge_requests/61

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
  2026-03-17  5:45     ` Matthew Brost
@ 2026-03-17  7:17       ` Miguel Ojeda
  2026-03-17  8:26         ` Matthew Brost
  2026-03-17 18:14       ` Matthew Brost
  1 sibling, 1 reply; 50+ messages in thread
From: Miguel Ojeda @ 2026-03-17  7:17 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Daniel Almeida, intel-xe, dri-devel, Boris Brezillon,
	Tvrtko Ursulin, Rodrigo Vivi, Thomas Hellström,
	Christian König, Danilo Krummrich, David Airlie,
	Maarten Lankhorst, Maxime Ripard, Philipp Stanner, Simona Vetter,
	Sumit Semwal, Thomas Zimmermann, linux-kernel, Sami Tolvanen,
	Jeffrey Vander Stoep, Alice Ryhl, Daniel Stone, Alexandre Courbot,
	John Hubbard, shashanks, jajones, Eliot Courtney, Joel Fernandes,
	rust-for-linux

On Tue, Mar 17, 2026 at 6:46 AM Matthew Brost <matthew.brost@intel.com> wrote:
>
> You can do RAII in C - see cleanup.h. Clear object lifetimes and
> ownership are what is important. Disciplined coding is the only to do
> this regardless of language. RAII doesn't help with help with bad object
> models / ownership / lifetime models either.

"Ownership", "lifetimes" and being "disciplined" *is* what Rust helps
with. That is the whole point (even if there are other advantages).

Yes, the cleanup attribute is nice, but even the whole `CLASS` thing
is meant to simplify code. Simplifying code does reduce bugs in
general, but it doesn't solve anything fundamental. Even if we had C++
and full-fledged smart pointers and so on, it doesn't improve
meaningfully the situation -- one can still mess things up very easily
with them.

And yes, sanitizers and lockdep and runtime solutions that require to
trigger paths are amazing, but not anywhere close to enforcing
something statically.

The fact that `unsafe` exists doesn't mean "Rust doesn't solve
anything". Quite the opposite: the goal is to provide safe
abstractions where possible, i.e. we minimize the need for `unsafe`.
And for the cases where there is no other way around it, the toolchain
will force you to write an explanation for your `unsafe` usage. Then
maintainers and reviewers will have to agree with your argument for
it.

In particular, it is not something that gets routinely (and
implicitly) used every second line like we do in C.

Cheers,
Miguel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
  2026-03-17  7:17       ` Miguel Ojeda
@ 2026-03-17  8:26         ` Matthew Brost
  2026-03-17 12:04           ` Daniel Almeida
  2026-03-17 19:41           ` Miguel Ojeda
  0 siblings, 2 replies; 50+ messages in thread
From: Matthew Brost @ 2026-03-17  8:26 UTC (permalink / raw)
  To: Miguel Ojeda
  Cc: Daniel Almeida, intel-xe, dri-devel, Boris Brezillon,
	Tvrtko Ursulin, Rodrigo Vivi, Thomas Hellström,
	Christian König, Danilo Krummrich, David Airlie,
	Maarten Lankhorst, Maxime Ripard, Philipp Stanner, Simona Vetter,
	Sumit Semwal, Thomas Zimmermann, linux-kernel, Sami Tolvanen,
	Jeffrey Vander Stoep, Alice Ryhl, Daniel Stone, Alexandre Courbot,
	John Hubbard, shashanks, jajones, Eliot Courtney, Joel Fernandes,
	rust-for-linux

On Tue, Mar 17, 2026 at 08:17:27AM +0100, Miguel Ojeda wrote:
> On Tue, Mar 17, 2026 at 6:46 AM Matthew Brost <matthew.brost@intel.com> wrote:
> >
> > You can do RAII in C - see cleanup.h. Clear object lifetimes and
> > ownership are what is important. Disciplined coding is the only to do
> > this regardless of language. RAII doesn't help with help with bad object
> > models / ownership / lifetime models either.
> 

I hate cut off in thteads.

> "Ownership", "lifetimes" and being "disciplined" *is* what Rust helps
> with. That is the whole point (even if there are other advantages).
> 

I get it — you’re a Rust zealot. You can do this in C and enforce the
rules quite well.

RAII cannot describe ownership transfers of refs, nor can it express who
owns what in multi-threaded components, as far as I know. Ref-tracking
and ownership need to be explicit.

I’m not going to reply to Rust vs C comments in this thread. If you want
to talk about ownership, lifetimes, dma-fence enforcement, and teardown
guarantees, sure.

If you want to build on top of a component that’s been tested on a
production driver, great — please join in. If you want to figure out all
the pitfalls yourself, well… have fun.

Matt

> Yes, the cleanup attribute is nice, but even the whole `CLASS` thing
> is meant to simplify code. Simplifying code does reduce bugs in
> general, but it doesn't solve anything fundamental. Even if we had C++
> and full-fledged smart pointers and so on, it doesn't improve
> meaningfully the situation -- one can still mess things up very easily
> with them.
> 
> And yes, sanitizers and lockdep and runtime solutions that require to
> trigger paths are amazing, but not anywhere close to enforcing
> something statically.
> 
> The fact that `unsafe` exists doesn't mean "Rust doesn't solve
> anything". Quite the opposite: the goal is to provide safe
> abstractions where possible, i.e. we minimize the need for `unsafe`.
> And for the cases where there is no other way around it, the toolchain
> will force you to write an explanation for your `unsafe` usage. Then
> maintainers and reviewers will have to agree with your argument for
> it.
> 
> In particular, it is not something that gets routinely (and
> implicitly) used every second line like we do in C.
> 
> Cheers,
> Miguel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
  2026-03-16  4:32 ` [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer Matthew Brost
                     ` (2 preceding siblings ...)
  2026-03-17  2:47   ` Daniel Almeida
@ 2026-03-17  8:47   ` Christian König
  2026-03-17 14:55   ` Boris Brezillon
  2026-03-17 16:30   ` Shashank Sharma
  5 siblings, 0 replies; 50+ messages in thread
From: Christian König @ 2026-03-17  8:47 UTC (permalink / raw)
  To: Matthew Brost, intel-xe
  Cc: dri-devel, Boris Brezillon, Tvrtko Ursulin, Rodrigo Vivi,
	Thomas Hellström, Danilo Krummrich, David Airlie,
	Maarten Lankhorst, Maxime Ripard, Philipp Stanner, Simona Vetter,
	Sumit Semwal, Thomas Zimmermann, linux-kernel

On 3/16/26 05:32, Matthew Brost wrote:
> Diverging requirements between GPU drivers using firmware scheduling
> and those using hardware scheduling have shown that drm_gpu_scheduler is
> no longer sufficient for firmware-scheduled GPU drivers. The technical
> debt, lack of memory-safety guarantees, absence of clear object-lifetime
> rules, and numerous driver-specific hacks have rendered
> drm_gpu_scheduler unmaintainable. It is time for a fresh design for
> firmware-scheduled GPU drivers—one that addresses all of the
> aforementioned shortcomings.
> 
> Add drm_dep, a lightweight GPU submission queue intended as a
> replacement for drm_gpu_scheduler for firmware-managed GPU schedulers
> (e.g. Xe, Panthor, AMDXDNA, PVR, Nouveau, Nova). Unlike
> drm_gpu_scheduler, which separates the scheduler (drm_gpu_scheduler)
> from the queue (drm_sched_entity) into two objects requiring external
> coordination, drm_dep merges both roles into a single struct
> drm_dep_queue. This eliminates the N:1 entity-to-scheduler mapping
> that is unnecessary for firmware schedulers which manage their own
> run-lists internally.

Yeah I can't count how often I considered re-writing the GPU scheduler from scratch.

But if that is done I completely agree that it should probably be done in Rust instead of C.

I've worked enough with save languages to aknoledge the advantages they have.

Regards,
Christian.

> Unlike drm_gpu_scheduler, which relies on external locking and lifetime
> management by the driver, drm_dep uses reference counting (kref) on both
> queues and jobs to guarantee object lifetime safety. A job holds a queue
> reference from init until its last put, and the queue holds a job reference
> from dispatch until the put_job worker runs. This makes use-after-free
> impossible even when completion arrives from IRQ context or concurrent
> teardown is in flight.
> 
> The core objects are:
> 
>   struct drm_dep_queue - a per-context submission queue owning an
>     ordered submit workqueue, a TDR timeout workqueue, an SPSC job
>     queue, and a pending-job list. Reference counted; drivers can embed
>     it and provide a .release vfunc for RCU-safe teardown.
> 
>   struct drm_dep_job - a single unit of GPU work. Drivers embed this
>     and provide a .release vfunc. Jobs carry an xarray of input
>     dma_fence dependencies and produce a drm_dep_fence as their
>     finished fence.
> 
>   struct drm_dep_fence - a dma_fence subclass wrapping an optional
>     parent hardware fence. The finished fence is armed (sequence
>     number assigned) before submission and signals when the hardware
>     fence signals (or immediately on synchronous completion).
> 
> Job lifecycle:
>   1. drm_dep_job_init() - allocate and initialise; job acquires a
>      queue reference.
>   2. drm_dep_job_add_dependency() and friends - register input fences;
>      duplicates from the same context are deduplicated.
>   3. drm_dep_job_arm() - assign sequence number, obtain finished fence.
>   4. drm_dep_job_push() - submit to queue.
> 
> Submission paths under queue lock:
>   - Bypass path: if DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set, the
>     SPSC queue is empty, no dependencies are pending, and credits are
>     available, the job is dispatched inline on the calling thread.
>   - Queued path: job is pushed onto the SPSC queue and the run_job
>     worker is kicked. The worker resolves remaining dependencies
>     (installing wakeup callbacks for unresolved fences) before calling
>     ops->run_job().
> 
> Credit-based throttling prevents hardware overflow: each job declares
> a credit cost at init time; dispatch is deferred until sufficient
> credits are available.
> 
> Timeout Detection and Recovery (TDR): a per-queue delayed work item
> fires when the head pending job exceeds q->job.timeout jiffies, calling
> ops->timedout_job(). drm_dep_queue_trigger_timeout() forces immediate
> expiry for device teardown.
> 
> IRQ-safe completion: queues flagged DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE
> allow drm_dep_job_done() to be called from hardirq context (e.g. a
> dma_fence callback). Dependency cleanup is deferred to process context
> after ops->run_job() returns to avoid calling xa_destroy() from IRQ.
> 
> Zombie-state guard: workers use kref_get_unless_zero() on entry and
> bail immediately if the queue refcount has already reached zero and
> async teardown is in flight, preventing use-after-free.
> 
> Teardown is always deferred to a module-private workqueue (dep_free_wq)
> so that destroy_workqueue() is never called from within one of the
> queue's own workers. Each queue holds a drm_dev_get() reference on its
> owning struct drm_device, released as the final step of teardown via
> drm_dev_put(). This prevents the driver module from being unloaded
> while any queue is still alive without requiring a separate drain API.
> 
> Cc: Boris Brezillon <boris.brezillon@collabora.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: David Airlie <airlied@gmail.com>
> Cc: dri-devel@lists.freedesktop.org
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: Maxime Ripard <mripard@kernel.org>
> Cc: Philipp Stanner <phasta@kernel.org>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: Sumit Semwal <sumit.semwal@linaro.org>
> Cc: Thomas Zimmermann <tzimmermann@suse.de>
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Assisted-by: GitHub Copilot:claude-sonnet-4.6
> ---
>  drivers/gpu/drm/Kconfig             |    4 +
>  drivers/gpu/drm/Makefile            |    1 +
>  drivers/gpu/drm/dep/Makefile        |    5 +
>  drivers/gpu/drm/dep/drm_dep_fence.c |  406 +++++++
>  drivers/gpu/drm/dep/drm_dep_fence.h |   25 +
>  drivers/gpu/drm/dep/drm_dep_job.c   |  675 +++++++++++
>  drivers/gpu/drm/dep/drm_dep_job.h   |   13 +
>  drivers/gpu/drm/dep/drm_dep_queue.c | 1647 +++++++++++++++++++++++++++
>  drivers/gpu/drm/dep/drm_dep_queue.h |   31 +
>  include/drm/drm_dep.h               |  597 ++++++++++
>  10 files changed, 3404 insertions(+)
>  create mode 100644 drivers/gpu/drm/dep/Makefile
>  create mode 100644 drivers/gpu/drm/dep/drm_dep_fence.c
>  create mode 100644 drivers/gpu/drm/dep/drm_dep_fence.h
>  create mode 100644 drivers/gpu/drm/dep/drm_dep_job.c
>  create mode 100644 drivers/gpu/drm/dep/drm_dep_job.h
>  create mode 100644 drivers/gpu/drm/dep/drm_dep_queue.c
>  create mode 100644 drivers/gpu/drm/dep/drm_dep_queue.h
>  create mode 100644 include/drm/drm_dep.h
> 
> diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
> index 5386248e75b6..834f6e210551 100644
> --- a/drivers/gpu/drm/Kconfig
> +++ b/drivers/gpu/drm/Kconfig
> @@ -276,6 +276,10 @@ config DRM_SCHED
>  	tristate
>  	depends on DRM
>  
> +config DRM_DEP
> +	tristate
> +	depends on DRM
> +
>  # Separate option as not all DRM drivers use it
>  config DRM_PANEL_BACKLIGHT_QUIRKS
>  	tristate
> diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
> index e97faabcd783..1ad87cc0e545 100644
> --- a/drivers/gpu/drm/Makefile
> +++ b/drivers/gpu/drm/Makefile
> @@ -173,6 +173,7 @@ obj-y			+= clients/
>  obj-y			+= display/
>  obj-$(CONFIG_DRM_TTM)	+= ttm/
>  obj-$(CONFIG_DRM_SCHED)	+= scheduler/
> +obj-$(CONFIG_DRM_DEP)	+= dep/
>  obj-$(CONFIG_DRM_RADEON)+= radeon/
>  obj-$(CONFIG_DRM_AMDGPU)+= amd/amdgpu/
>  obj-$(CONFIG_DRM_AMDGPU)+= amd/amdxcp/
> diff --git a/drivers/gpu/drm/dep/Makefile b/drivers/gpu/drm/dep/Makefile
> new file mode 100644
> index 000000000000..335f1af46a7b
> --- /dev/null
> +++ b/drivers/gpu/drm/dep/Makefile
> @@ -0,0 +1,5 @@
> +# SPDX-License-Identifier: GPL-2.0
> +
> +drm_dep-y := drm_dep_queue.o drm_dep_job.o drm_dep_fence.o
> +
> +obj-$(CONFIG_DRM_DEP) += drm_dep.o
> diff --git a/drivers/gpu/drm/dep/drm_dep_fence.c b/drivers/gpu/drm/dep/drm_dep_fence.c
> new file mode 100644
> index 000000000000..ae05b9077772
> --- /dev/null
> +++ b/drivers/gpu/drm/dep/drm_dep_fence.c
> @@ -0,0 +1,406 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2026 Intel Corporation
> + */
> +
> +/**
> + * DOC: DRM dependency fence
> + *
> + * Each struct drm_dep_job has an associated struct drm_dep_fence that
> + * provides a single dma_fence (@finished) signalled when the hardware
> + * completes the job.
> + *
> + * The hardware fence returned by &drm_dep_queue_ops.run_job is stored as
> + * @parent. @finished is chained to @parent via drm_dep_job_done_cb() and
> + * is signalled once @parent signals (or immediately if run_job() returns
> + * NULL or an error).
> + *
> + * Drivers should expose @finished as the out-fence for GPU work since it is
> + * valid from the moment drm_dep_job_arm() returns, whereas the hardware fence
> + * could be a compound fence, which is disallowed when installed into
> + * drm_syncobjs or dma-resv.
> + *
> + * The fence uses the kernel's inline spinlock (NULL passed to dma_fence_init())
> + * so no separate lock allocation is required.
> + *
> + * Deadline propagation is supported: if a consumer sets a deadline via
> + * dma_fence_set_deadline(), it is forwarded to @parent when @parent is set.
> + * If @parent has not been set yet the deadline is stored in @deadline and
> + * forwarded at that point.
> + *
> + * Memory management: drm_dep_fence objects are allocated with kzalloc() and
> + * freed via kfree_rcu() once the fence is released, ensuring safety with
> + * RCU-protected fence accesses.
> + */
> +
> +#include <linux/slab.h>
> +#include <drm/drm_dep.h>
> +#include "drm_dep_fence.h"
> +
> +/**
> + * DRM_DEP_FENCE_FLAG_HAS_DEADLINE_BIT - a fence deadline hint has been set
> + *
> + * Set by the deadline callback on the finished fence to indicate a deadline
> + * has been set which may need to be propagated to the parent hardware fence.
> + */
> +#define DRM_DEP_FENCE_FLAG_HAS_DEADLINE_BIT	(DMA_FENCE_FLAG_USER_BITS + 1)
> +
> +/**
> + * struct drm_dep_fence - fence tracking the completion of a dep job
> + *
> + * Contains a single dma_fence (@finished) that is signalled when the
> + * hardware completes the job. The fence uses the kernel's inline_lock
> + * (no external spinlock required).
> + *
> + * This struct is private to the drm_dep module; external code interacts
> + * through the accessor functions declared in drm_dep_fence.h.
> + */
> +struct drm_dep_fence {
> +	/**
> +	 * @finished: signalled when the job completes on hardware.
> +	 *
> +	 * Drivers should use this fence as the out-fence for a job since it
> +	 * is available immediately upon drm_dep_job_arm().
> +	 */
> +	struct dma_fence finished;
> +
> +	/**
> +	 * @deadline: deadline set on @finished which potentially needs to be
> +	 * propagated to @parent.
> +	 */
> +	ktime_t	deadline;
> +
> +	/**
> +	 * @parent: The hardware fence returned by &drm_dep_queue_ops.run_job.
> +	 *
> +	 * @finished is signaled once @parent is signaled. The initial store is
> +	 * performed via smp_store_release to synchronize with deadline handling.
> +	 *
> +	 * All readers must access this under the fence lock and take a reference to
> +	 * it, as @parent is set to NULL under the fence lock when the drm_dep_fence
> +	 * signals, and this drop also releases its internal reference.
> +	 */
> +	struct dma_fence *parent;
> +
> +	/**
> +	 * @q: the queue this fence belongs to.
> +	 */
> +	struct drm_dep_queue *q;
> +};
> +
> +static const struct dma_fence_ops drm_dep_fence_ops;
> +
> +/**
> + * to_drm_dep_fence() - cast a dma_fence to its enclosing drm_dep_fence
> + * @f: dma_fence to cast
> + *
> + * Context: No context requirements (inline helper).
> + * Return: pointer to the enclosing &drm_dep_fence.
> + */
> +static struct drm_dep_fence *to_drm_dep_fence(struct dma_fence *f)
> +{
> +	return container_of(f, struct drm_dep_fence, finished);
> +}
> +
> +/**
> + * drm_dep_fence_set_parent() - store the hardware fence and propagate
> + *   any deadline
> + * @dfence: dep fence
> + * @parent: hardware fence returned by &drm_dep_queue_ops.run_job, or NULL/error
> + *
> + * Stores @parent on @dfence under smp_store_release() so that a concurrent
> + * drm_dep_fence_set_deadline() call sees the parent before checking the
> + * deadline bit. If a deadline has already been set on @dfence->finished it is
> + * forwarded to @parent immediately. Does nothing if @parent is NULL or an
> + * error pointer.
> + *
> + * Context: Any context.
> + */
> +void drm_dep_fence_set_parent(struct drm_dep_fence *dfence,
> +			      struct dma_fence *parent)
> +{
> +	if (IS_ERR_OR_NULL(parent))
> +		return;
> +
> +	/*
> +	 * smp_store_release() to ensure a thread racing us in
> +	 * drm_dep_fence_set_deadline() sees the parent set before
> +	 * it calls test_bit(HAS_DEADLINE_BIT).
> +	 */
> +	smp_store_release(&dfence->parent, dma_fence_get(parent));
> +	if (test_bit(DRM_DEP_FENCE_FLAG_HAS_DEADLINE_BIT,
> +		     &dfence->finished.flags))
> +		dma_fence_set_deadline(parent, dfence->deadline);
> +}
> +
> +/**
> + * drm_dep_fence_finished() - signal the finished fence with a result
> + * @dfence: dep fence to signal
> + * @result: error code to set, or 0 for success
> + *
> + * Sets the fence error to @result if non-zero, then signals
> + * @dfence->finished. Also removes parent visibility under the fence lock
> + * and drops the parent reference. Dropping the parent here allows the
> + * DRM dep fence to be completely decoupled from the DRM dep module.
> + *
> + * Context: Any context.
> + */
> +static void drm_dep_fence_finished(struct drm_dep_fence *dfence, int result)
> +{
> +	struct dma_fence *parent;
> +	unsigned long flags;
> +
> +	dma_fence_lock_irqsave(&dfence->finished, flags);
> +	if (result)
> +		dma_fence_set_error(&dfence->finished, result);
> +	dma_fence_signal_locked(&dfence->finished);
> +	parent = dfence->parent;
> +	dfence->parent = NULL;
> +	dma_fence_unlock_irqrestore(&dfence->finished, flags);
> +
> +	dma_fence_put(parent);
> +}
> +
> +static const char *drm_dep_fence_get_driver_name(struct dma_fence *fence)
> +{
> +	return "drm_dep";
> +}
> +
> +static const char *drm_dep_fence_get_timeline_name(struct dma_fence *f)
> +{
> +	struct drm_dep_fence *dfence = to_drm_dep_fence(f);
> +
> +	return dfence->q->name;
> +}
> +
> +/**
> + * drm_dep_fence_get_parent() - get a reference to the parent hardware fence
> + * @dfence: dep fence to query
> + *
> + * Returns a new reference to @dfence->parent, or NULL if the parent has
> + * already been cleared (i.e. @dfence->finished has signalled and the parent
> + * reference was dropped under the fence lock).
> + *
> + * Uses smp_load_acquire() to pair with the smp_store_release() in
> + * drm_dep_fence_set_parent(), ensuring that if we race a concurrent
> + * drm_dep_fence_set_parent() call we observe the parent pointer only after
> + * the store is fully visible — before set_parent() tests
> + * %DRM_DEP_FENCE_FLAG_HAS_DEADLINE_BIT.
> + *
> + * Caller must hold the fence lock on @dfence->finished.
> + *
> + * Context: Any context, fence lock on @dfence->finished must be held.
> + * Return: a new reference to the parent fence, or NULL.
> + */
> +static struct dma_fence *drm_dep_fence_get_parent(struct drm_dep_fence *dfence)
> +{
> +	dma_fence_assert_held(&dfence->finished);
> +
> +	return dma_fence_get(smp_load_acquire(&dfence->parent));
> +}
> +
> +/**
> + * drm_dep_fence_set_deadline() - dma_fence_ops deadline callback
> + * @f: fence on which the deadline is being set
> + * @deadline: the deadline hint to apply
> + *
> + * Stores the earliest deadline under the fence lock, then propagates
> + * it to the parent hardware fence via smp_load_acquire() to race
> + * safely with drm_dep_fence_set_parent().
> + *
> + * Context: Any context.
> + */
> +static void drm_dep_fence_set_deadline(struct dma_fence *f, ktime_t deadline)
> +{
> +	struct drm_dep_fence *dfence = to_drm_dep_fence(f);
> +	struct dma_fence *parent;
> +	unsigned long flags;
> +
> +	dma_fence_lock_irqsave(f, flags);
> +
> +	/* If we already have an earlier deadline, keep it: */
> +	if (test_bit(DRM_DEP_FENCE_FLAG_HAS_DEADLINE_BIT, &f->flags) &&
> +	    ktime_before(dfence->deadline, deadline)) {
> +		dma_fence_unlock_irqrestore(f, flags);
> +		return;
> +	}
> +
> +	dfence->deadline = deadline;
> +	set_bit(DRM_DEP_FENCE_FLAG_HAS_DEADLINE_BIT, &f->flags);
> +
> +	parent = drm_dep_fence_get_parent(dfence);
> +	dma_fence_unlock_irqrestore(f, flags);
> +
> +	if (parent)
> +		dma_fence_set_deadline(parent, deadline);
> +
> +	dma_fence_put(parent);
> +}
> +
> +static const struct dma_fence_ops drm_dep_fence_ops = {
> +	.get_driver_name = drm_dep_fence_get_driver_name,
> +	.get_timeline_name = drm_dep_fence_get_timeline_name,
> +	.set_deadline = drm_dep_fence_set_deadline,
> +};
> +
> +/**
> + * drm_dep_fence_alloc() - allocate a dep fence
> + *
> + * Allocates a &drm_dep_fence with kzalloc() without initialising the
> + * dma_fence. Call drm_dep_fence_init() to fully initialise it.
> + *
> + * Context: Process context.
> + * Return: new &drm_dep_fence on success, NULL on allocation failure.
> + */
> +struct drm_dep_fence *drm_dep_fence_alloc(void)
> +{
> +	return kzalloc_obj(struct drm_dep_fence);
> +}
> +
> +/**
> + * drm_dep_fence_init() - initialise the dma_fence inside a dep fence
> + * @dfence: dep fence to initialise
> + * @q: queue the owning job belongs to
> + *
> + * Initialises @dfence->finished using the context and sequence number from @q.
> + * Passes NULL as the lock so the fence uses its inline spinlock.
> + *
> + * Context: Any context.
> + */
> +void drm_dep_fence_init(struct drm_dep_fence *dfence, struct drm_dep_queue *q)
> +{
> +	u32 seq = ++q->fence.seqno;
> +
> +	/*
> +	 * XXX: Inline fence hazard: currently all expected users of DRM dep
> +	 * hardware fences have a unique lockdep class. If that ever changes,
> +	 * we will need to assign a unique lockdep class here so lockdep knows
> +	 * this fence is allowed to nest with driver hardware fences.
> +	 */
> +
> +	dfence->q = q;
> +	dma_fence_init(&dfence->finished, &drm_dep_fence_ops,
> +		       NULL, q->fence.context, seq);
> +}
> +
> +/**
> + * drm_dep_fence_cleanup() - release a dep fence at job teardown
> + * @dfence: dep fence to clean up
> + *
> + * Called from drm_dep_job_fini(). If the dep fence was armed (refcount > 0)
> + * it is released via dma_fence_put() and will be freed by the RCU release
> + * callback once all waiters have dropped their references. If it was never
> + * armed it is freed directly with kfree().
> + *
> + * Context: Any context.
> + */
> +void drm_dep_fence_cleanup(struct drm_dep_fence *dfence)
> +{
> +	if (drm_dep_fence_is_armed(dfence))
> +		dma_fence_put(&dfence->finished);
> +	else
> +		kfree(dfence);
> +}
> +
> +/**
> + * drm_dep_fence_is_armed() - check whether the fence has been armed
> + * @dfence: dep fence to check
> + *
> + * Returns true if drm_dep_job_arm() has been called, i.e. @dfence->finished
> + * has been initialised and its reference count is non-zero.  Used by
> + * assertions to enforce correct job lifecycle ordering (arm before push,
> + * add_dependency before arm).
> + *
> + * Context: Any context.
> + * Return: true if the fence is armed, false otherwise.
> + */
> +bool drm_dep_fence_is_armed(struct drm_dep_fence *dfence)
> +{
> +	return !!kref_read(&dfence->finished.refcount);
> +}
> +
> +/**
> + * drm_dep_fence_is_finished() - test whether the finished fence has signalled
> + * @dfence: dep fence to check
> + *
> + * Uses dma_fence_test_signaled_flag() to read %DMA_FENCE_FLAG_SIGNALED_BIT
> + * directly without invoking the fence's ->signaled() callback or triggering
> + * any signalling side-effects.
> + *
> + * Context: Any context.
> + * Return: true if @dfence->finished has been signalled, false otherwise.
> + */
> +bool drm_dep_fence_is_finished(struct drm_dep_fence *dfence)
> +{
> +	return dma_fence_test_signaled_flag(&dfence->finished);
> +}
> +
> +/**
> + * drm_dep_fence_is_complete() - test whether the job has completed
> + * @dfence: dep fence to check
> + *
> + * Takes the fence lock on @dfence->finished and calls
> + * drm_dep_fence_get_parent() to safely obtain a reference to the parent
> + * hardware fence — or NULL if the parent has already been cleared after
> + * signalling.  Calls dma_fence_is_signaled() on @parent outside the lock,
> + * which may invoke the fence's ->signaled() callback and trigger signalling
> + * side-effects if the fence has completed but the signalled flag has not yet
> + * been set.  The finished fence is tested via dma_fence_test_signaled_flag(),
> + * without side-effects.
> + *
> + * May only be called on a stopped queue (see drm_dep_queue_is_stopped()).
> + *
> + * Context: Process context. The queue must be stopped before calling this.
> + * Return: true if the job is complete, false otherwise.
> + */
> +bool drm_dep_fence_is_complete(struct drm_dep_fence *dfence)
> +{
> +	struct dma_fence *parent;
> +	unsigned long flags;
> +	bool complete;
> +
> +	dma_fence_lock_irqsave(&dfence->finished, flags);
> +	parent = drm_dep_fence_get_parent(dfence);
> +	dma_fence_unlock_irqrestore(&dfence->finished, flags);
> +
> +	complete = (parent && dma_fence_is_signaled(parent)) ||
> +		dma_fence_test_signaled_flag(&dfence->finished);
> +
> +	dma_fence_put(parent);
> +
> +	return complete;
> +}
> +
> +/**
> + * drm_dep_fence_to_dma() - return the finished dma_fence for a dep fence
> + * @dfence: dep fence to query
> + *
> + * No reference is taken; the caller must hold its own reference to the owning
> + * &drm_dep_job for the duration of the access.
> + *
> + * Context: Any context.
> + * Return: the finished &dma_fence.
> + */
> +struct dma_fence *drm_dep_fence_to_dma(struct drm_dep_fence *dfence)
> +{
> +	return &dfence->finished;
> +}
> +
> +/**
> + * drm_dep_fence_done() - signal the finished fence on job completion
> + * @dfence: dep fence to signal
> + * @result: job error code, or 0 on success
> + *
> + * Gets a temporary reference to @dfence->finished to guard against a racing
> + * last-put, signals the fence with @result, then drops the temporary
> + * reference. Called from drm_dep_job_done() in the queue core when a
> + * hardware completion callback fires or when run_job() returns immediately.
> + *
> + * Context: Any context.
> + */
> +void drm_dep_fence_done(struct drm_dep_fence *dfence, int result)
> +{
> +	dma_fence_get(&dfence->finished);
> +	drm_dep_fence_finished(dfence, result);
> +	dma_fence_put(&dfence->finished);
> +}
> diff --git a/drivers/gpu/drm/dep/drm_dep_fence.h b/drivers/gpu/drm/dep/drm_dep_fence.h
> new file mode 100644
> index 000000000000..65a1582f858b
> --- /dev/null
> +++ b/drivers/gpu/drm/dep/drm_dep_fence.h
> @@ -0,0 +1,25 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2026 Intel Corporation
> + */
> +
> +#ifndef _DRM_DEP_FENCE_H_
> +#define _DRM_DEP_FENCE_H_
> +
> +#include <linux/dma-fence.h>
> +
> +struct drm_dep_fence;
> +struct drm_dep_queue;
> +
> +struct drm_dep_fence *drm_dep_fence_alloc(void);
> +void drm_dep_fence_init(struct drm_dep_fence *dfence, struct drm_dep_queue *q);
> +void drm_dep_fence_cleanup(struct drm_dep_fence *dfence);
> +void drm_dep_fence_set_parent(struct drm_dep_fence *dfence,
> +			      struct dma_fence *parent);
> +void drm_dep_fence_done(struct drm_dep_fence *dfence, int result);
> +bool drm_dep_fence_is_armed(struct drm_dep_fence *dfence);
> +bool drm_dep_fence_is_finished(struct drm_dep_fence *dfence);
> +bool drm_dep_fence_is_complete(struct drm_dep_fence *dfence);
> +struct dma_fence *drm_dep_fence_to_dma(struct drm_dep_fence *dfence);
> +
> +#endif /* _DRM_DEP_FENCE_H_ */
> diff --git a/drivers/gpu/drm/dep/drm_dep_job.c b/drivers/gpu/drm/dep/drm_dep_job.c
> new file mode 100644
> index 000000000000..2d012b29a5fc
> --- /dev/null
> +++ b/drivers/gpu/drm/dep/drm_dep_job.c
> @@ -0,0 +1,675 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright 2015 Advanced Micro Devices, Inc.
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
> + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
> + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
> + * OTHER DEALINGS IN THE SOFTWARE.
> + *
> + * Copyright © 2026 Intel Corporation
> + */
> +
> +/**
> + * DOC: DRM dependency job
> + *
> + * A struct drm_dep_job represents a single unit of GPU work associated with
> + * a struct drm_dep_queue. The lifecycle of a job is:
> + *
> + * 1. **Allocation**: the driver allocates memory for the job (typically by
> + *    embedding struct drm_dep_job in a larger structure) and calls
> + *    drm_dep_job_init() to initialise it. On success the job holds one
> + *    kref reference and a reference to its queue.
> + *
> + * 2. **Dependency collection**: the driver calls drm_dep_job_add_dependency(),
> + *    drm_dep_job_add_syncobj_dependency(), drm_dep_job_add_resv_dependencies(),
> + *    or drm_dep_job_add_implicit_dependencies() to register dma_fence objects
> + *    that must be signalled before the job can run. Duplicate fences from the
> + *    same fence context are deduplicated automatically.
> + *
> + * 3. **Arming**: drm_dep_job_arm() initialises the job's finished fence,
> + *    consuming a sequence number from the queue. After arming,
> + *    drm_dep_job_finished_fence() returns a valid fence that may be passed to
> + *    userspace or used as a dependency by other jobs.
> + *
> + * 4. **Submission**: drm_dep_job_push() submits the job to the queue. The
> + *    queue takes a reference that it holds until the job's finished fence
> + *    signals and the job is freed by the put_job worker.
> + *
> + * 5. **Completion**: when the job's hardware work finishes its finished fence
> + *    is signalled and drm_dep_job_put() is called by the queue. The driver
> + *    must release any driver-private resources in &drm_dep_job_ops.release.
> + *
> + * Reference counting uses drm_dep_job_get() / drm_dep_job_put(). The
> + * internal drm_dep_job_fini() tears down the dependency xarray and fence
> + * objects before the driver's release callback is invoked.
> + */
> +
> +#include <linux/dma-resv.h>
> +#include <linux/kref.h>
> +#include <linux/slab.h>
> +#include <drm/drm_dep.h>
> +#include <drm/drm_file.h>
> +#include <drm/drm_gem.h>
> +#include <drm/drm_syncobj.h>
> +#include "drm_dep_fence.h"
> +#include "drm_dep_job.h"
> +#include "drm_dep_queue.h"
> +
> +/**
> + * drm_dep_job_init() - initialise a dep job
> + * @job: dep job to initialise
> + * @args: initialisation arguments
> + *
> + * Initialises @job with the queue, ops and credit count from @args.  Acquires
> + * a reference to @args->q via drm_dep_queue_get(); this reference is held for
> + * the lifetime of the job and released by drm_dep_job_release() when the last
> + * job reference is dropped.
> + *
> + * Resources are released automatically when the last reference is dropped
> + * via drm_dep_job_put(), which must be called to release the job; drivers
> + * must not free the job directly.
> + *
> + * Context: Process context. Allocates memory with GFP_KERNEL.
> + * Return: 0 on success, -%EINVAL if credits is 0,
> + *   -%ENOMEM on fence allocation failure.
> + */
> +int drm_dep_job_init(struct drm_dep_job *job,
> +		     const struct drm_dep_job_init_args *args)
> +{
> +	if (unlikely(!args->credits)) {
> +		pr_err("drm_dep: %s: credits cannot be 0\n", __func__);
> +		return -EINVAL;
> +	}
> +
> +	memset(job, 0, sizeof(*job));
> +
> +	job->dfence = drm_dep_fence_alloc();
> +	if (!job->dfence)
> +		return -ENOMEM;
> +
> +	job->ops = args->ops;
> +	job->q = drm_dep_queue_get(args->q);
> +	job->credits = args->credits;
> +
> +	kref_init(&job->refcount);
> +	xa_init_flags(&job->dependencies, XA_FLAGS_ALLOC);
> +	INIT_LIST_HEAD(&job->pending_link);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(drm_dep_job_init);
> +
> +/**
> + * drm_dep_job_drop_dependencies() - release all input dependency fences
> + * @job: dep job whose dependency xarray to drain
> + *
> + * Walks @job->dependencies, puts each fence, and destroys the xarray.
> + * Any slots still holding a %DRM_DEP_JOB_FENCE_PREALLOC sentinel —
> + * i.e. slots that were pre-allocated but never replaced — are silently
> + * skipped; the sentinel carries no reference.  Called from
> + * drm_dep_queue_run_job() in process context immediately after
> + * @ops->run_job() returns, before the final drm_dep_job_put().  Releasing
> + * dependencies here — while still in process context — avoids calling
> + * xa_destroy() from IRQ context if the job's last reference is later
> + * dropped from a dma_fence callback.
> + *
> + * Context: Process context.
> + */
> +void drm_dep_job_drop_dependencies(struct drm_dep_job *job)
> +{
> +	struct dma_fence *fence;
> +	unsigned long index;
> +
> +	xa_for_each(&job->dependencies, index, fence) {
> +		if (unlikely(fence == DRM_DEP_JOB_FENCE_PREALLOC))
> +			continue;
> +		dma_fence_put(fence);
> +	}
> +	xa_destroy(&job->dependencies);
> +}
> +
> +/**
> + * drm_dep_job_fini() - clean up a dep job
> + * @job: dep job to clean up
> + *
> + * Cleans up the dep fence and drops the queue reference held by @job.
> + *
> + * If the job was never armed (e.g. init failed before drm_dep_job_arm()),
> + * the dependency xarray is also released here.  For armed jobs the xarray
> + * has already been drained by drm_dep_job_drop_dependencies() in process
> + * context immediately after run_job(), so it is left untouched to avoid
> + * calling xa_destroy() from IRQ context.
> + *
> + * Warns if @job is still linked on the queue's pending list, which would
> + * indicate a bug in the teardown ordering.
> + *
> + * Context: Any context.
> + */
> +static void drm_dep_job_fini(struct drm_dep_job *job)
> +{
> +	bool armed = drm_dep_fence_is_armed(job->dfence);
> +
> +	WARN_ON(!list_empty(&job->pending_link));
> +
> +	drm_dep_fence_cleanup(job->dfence);
> +	job->dfence = NULL;
> +
> +	/*
> +	 * Armed jobs have their dependencies drained by
> +	 * drm_dep_job_drop_dependencies() in process context after run_job().
> +	 * Skip here to avoid calling xa_destroy() from IRQ context.
> +	 */
> +	if (!armed)
> +		drm_dep_job_drop_dependencies(job);
> +}
> +
> +/**
> + * drm_dep_job_get() - acquire a reference to a dep job
> + * @job: dep job to acquire a reference on, or NULL
> + *
> + * Context: Any context.
> + * Return: @job with an additional reference held, or NULL if @job is NULL.
> + */
> +struct drm_dep_job *drm_dep_job_get(struct drm_dep_job *job)
> +{
> +	if (job)
> +		kref_get(&job->refcount);
> +	return job;
> +}
> +EXPORT_SYMBOL(drm_dep_job_get);
> +
> +/**
> + * drm_dep_job_release() - kref release callback for a dep job
> + * @kref: kref embedded in the dep job
> + *
> + * Calls drm_dep_job_fini(), then invokes &drm_dep_job_ops.release if set,
> + * otherwise frees @job with kfree().  Finally, releases the queue reference
> + * that was acquired by drm_dep_job_init() via drm_dep_queue_put().  The
> + * queue put is performed last to ensure no queue state is accessed after
> + * the job memory is freed.
> + *
> + * Context: Any context if %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE is set on the
> + *   job's queue; otherwise process context only, as the release callback may
> + *   sleep.
> + */
> +static void drm_dep_job_release(struct kref *kref)
> +{
> +	struct drm_dep_job *job =
> +		container_of(kref, struct drm_dep_job, refcount);
> +	struct drm_dep_queue *q = job->q;
> +
> +	drm_dep_job_fini(job);
> +
> +	if (job->ops && job->ops->release)
> +		job->ops->release(job);
> +	else
> +		kfree(job);
> +
> +	drm_dep_queue_put(q);
> +}
> +
> +/**
> + * drm_dep_job_put() - release a reference to a dep job
> + * @job: dep job to release a reference on, or NULL
> + *
> + * When the last reference is dropped, calls &drm_dep_job_ops.release if set,
> + * otherwise frees @job with kfree(). Does nothing if @job is NULL.
> + *
> + * Context: Any context if %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE is set on the
> + *   job's queue; otherwise process context only, as the release callback may
> + *   sleep.
> + */
> +void drm_dep_job_put(struct drm_dep_job *job)
> +{
> +	if (job)
> +		kref_put(&job->refcount, drm_dep_job_release);
> +}
> +EXPORT_SYMBOL(drm_dep_job_put);
> +
> +/**
> + * drm_dep_job_arm() - arm a dep job for submission
> + * @job: dep job to arm
> + *
> + * Initialises the finished fence on @job->dfence, assigning
> + * it a sequence number from the job's queue. Must be called after
> + * drm_dep_job_init() and before drm_dep_job_push(). Once armed,
> + * drm_dep_job_finished_fence() returns a valid fence that may be passed to
> + * userspace or used as a dependency by other jobs.
> + *
> + * Begins the DMA fence signalling path via dma_fence_begin_signalling().
> + * After this point, memory allocations that could trigger reclaim are
> + * forbidden; lockdep enforces this. arm() must always be paired with
> + * drm_dep_job_push(); lockdep also enforces this pairing.
> + *
> + * Warns if the job has already been armed.
> + *
> + * Context: Process context if %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set
> + *   (takes @q->sched.lock, a mutex); any context otherwise. DMA fence signaling
> + *   path.
> + */
> +void drm_dep_job_arm(struct drm_dep_job *job)
> +{
> +	drm_dep_queue_push_job_begin(job->q);
> +	WARN_ON(drm_dep_fence_is_armed(job->dfence));
> +	drm_dep_fence_init(job->dfence, job->q);
> +	job->signalling_cookie = dma_fence_begin_signalling();
> +}
> +EXPORT_SYMBOL(drm_dep_job_arm);
> +
> +/**
> + * drm_dep_job_push() - submit a job to its queue for execution
> + * @job: dep job to push
> + *
> + * Submits @job to the queue it was initialised with. Must be called after
> + * drm_dep_job_arm(). Acquires a reference on @job on behalf of the queue,
> + * held until the queue is fully done with it. The reference is released
> + * directly in the finished-fence dma_fence callback for queues with
> + * %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE (where drm_dep_job_done() may run
> + * from hardirq context), or via the put_job work item on the submit
> + * workqueue otherwise.
> + *
> + * Ends the DMA fence signalling path begun by drm_dep_job_arm() via
> + * dma_fence_end_signalling(). This must be paired with arm(); lockdep
> + * enforces the pairing.
> + *
> + * Once pushed, &drm_dep_queue_ops.run_job is guaranteed to be called for
> + * @job exactly once, even if the queue is killed or torn down before the
> + * job reaches the head of the queue. Drivers can use this guarantee to
> + * perform bookkeeping cleanup; the actual backend operation should be
> + * skipped when drm_dep_queue_is_killed() returns true.
> + *
> + * If the queue does not support the bypass path, the job is pushed directly
> + * onto the SPSC submission queue via drm_dep_queue_push_job() without holding
> + * @q->sched.lock. Otherwise, @q->sched.lock is taken and the job is either
> + * run immediately via drm_dep_queue_run_job() if it qualifies for bypass, or
> + * enqueued via drm_dep_queue_push_job() for dispatch by the run_job work item.
> + *
> + * Warns if the job has not been armed.
> + *
> + * Context: Process context if %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set
> + *   (takes @q->sched.lock, a mutex); any context otherwise. DMA fence signaling
> + *   path.
> + */
> +void drm_dep_job_push(struct drm_dep_job *job)
> +{
> +	struct drm_dep_queue *q = job->q;
> +
> +	WARN_ON(!drm_dep_fence_is_armed(job->dfence));
> +
> +	drm_dep_job_get(job);
> +
> +	if (!(q->sched.flags & DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED)) {
> +		drm_dep_queue_push_job(q, job);
> +		dma_fence_end_signalling(job->signalling_cookie);
> +		drm_dep_queue_push_job_end(job->q);
> +		return;
> +	}
> +
> +	scoped_guard(mutex, &q->sched.lock) {
> +		if (drm_dep_queue_can_job_bypass(q, job))
> +			drm_dep_queue_run_job(q, job);
> +		else
> +			drm_dep_queue_push_job(q, job);
> +	}
> +
> +	dma_fence_end_signalling(job->signalling_cookie);
> +	drm_dep_queue_push_job_end(job->q);
> +}
> +EXPORT_SYMBOL(drm_dep_job_push);
> +
> +/**
> + * drm_dep_job_add_dependency() - adds the fence as a job dependency
> + * @job: dep job to add the dependencies to
> + * @fence: the dma_fence to add to the list of dependencies, or
> + *         %DRM_DEP_JOB_FENCE_PREALLOC to reserve a slot for later.
> + *
> + * Note that @fence is consumed in both the success and error cases (except
> + * when @fence is %DRM_DEP_JOB_FENCE_PREALLOC, which carries no reference).
> + *
> + * Signalled fences and fences belonging to the same queue as @job (i.e. where
> + * fence->context matches the queue's finished fence context) are silently
> + * dropped; the job need not wait on its own queue's output.
> + *
> + * Warns if the job has already been armed (dependencies must be added before
> + * drm_dep_job_arm()).
> + *
> + * **Pre-allocation pattern**
> + *
> + * When multiple jobs across different queues must be prepared and submitted
> + * together in a single atomic commit — for example, where job A's finished
> + * fence is an input dependency of job B — all jobs must be armed and pushed
> + * within a single dma_fence_begin_signalling() / dma_fence_end_signalling()
> + * region.  Once that region has started no memory allocation is permitted.
> + *
> + * To handle this, pass %DRM_DEP_JOB_FENCE_PREALLOC during the preparation
> + * phase (before arming any job, while GFP_KERNEL allocation is still allowed)
> + * to pre-allocate a slot in @job->dependencies.  The slot index assigned by
> + * the underlying xarray must be tracked by the caller separately (e.g. it is
> + * always index 0 when the dependency array is empty, as Xe relies on).
> + * After all jobs have been armed and the finished fences are available, call
> + * drm_dep_job_replace_dependency() with that index and the real fence.
> + * drm_dep_job_replace_dependency() uses GFP_NOWAIT internally and may be
> + * called from atomic or signalling context.
> + *
> + * The sentinel slot is never skipped by the signalled-fence fast-path,
> + * ensuring a slot is always allocated even when the real fence is not yet
> + * known.
> + *
> + * **Example: bind job feeding TLB invalidation jobs**
> + *
> + * Consider a GPU with separate queues for page-table bind operations and for
> + * TLB invalidation.  A single atomic commit must:
> + *
> + *  1. Run a bind job that modifies page tables.
> + *  2. Run one TLB-invalidation job per MMU that depends on the bind
> + *     completing, so stale translations are flushed before the engines
> + *     continue.
> + *
> + * Because all jobs must be armed and pushed inside a signalling region (where
> + * GFP_KERNEL is forbidden), pre-allocate slots before entering the region::
> + *
> + *   // Phase 1 — process context, GFP_KERNEL allowed
> + *   drm_dep_job_init(bind_job, bind_queue, ops);
> + *   for_each_mmu(mmu) {
> + *       drm_dep_job_init(tlb_job[mmu], tlb_queue[mmu], ops);
> + *       // Pre-allocate slot at index 0; real fence not available yet
> + *       drm_dep_job_add_dependency(tlb_job[mmu], DRM_DEP_JOB_FENCE_PREALLOC);
> + *   }
> + *
> + *   // Phase 2 — inside signalling region, no GFP_KERNEL
> + *   dma_fence_begin_signalling();
> + *   drm_dep_job_arm(bind_job);
> + *   for_each_mmu(mmu) {
> + *       // Swap sentinel for bind job's finished fence
> + *       drm_dep_job_replace_dependency(tlb_job[mmu], 0,
> + *                                      dma_fence_get(bind_job->finished));
> + *       drm_dep_job_arm(tlb_job[mmu]);
> + *   }
> + *   drm_dep_job_push(bind_job);
> + *   for_each_mmu(mmu)
> + *       drm_dep_job_push(tlb_job[mmu]);
> + *   dma_fence_end_signalling();
> + *
> + * Context: Process context. May allocate memory with GFP_KERNEL.
> + * Return: If fence == DRM_DEP_JOB_FENCE_PREALLOC index of allocation on
> + * success, else 0 on success, or a negative error code.
> + */
> +int drm_dep_job_add_dependency(struct drm_dep_job *job, struct dma_fence *fence)
> +{
> +	struct drm_dep_queue *q = job->q;
> +	struct dma_fence *entry;
> +	unsigned long index;
> +	u32 id = 0;
> +	int ret;
> +
> +	WARN_ON(drm_dep_fence_is_armed(job->dfence));
> +	might_alloc(GFP_KERNEL);
> +
> +	if (!fence)
> +		return 0;
> +
> +	if (fence == DRM_DEP_JOB_FENCE_PREALLOC)
> +		goto add_fence;
> +
> +	/*
> +	 * Ignore signalled fences or fences from our own queue — finished
> +	 * fences use q->fence.context.
> +	 */
> +	if (dma_fence_test_signaled_flag(fence) ||
> +	    fence->context == q->fence.context) {
> +		dma_fence_put(fence);
> +		return 0;
> +	}
> +
> +	/* Deduplicate if we already depend on a fence from the same context.
> +	 * This lets the size of the array of deps scale with the number of
> +	 * engines involved, rather than the number of BOs.
> +	 */
> +	xa_for_each(&job->dependencies, index, entry) {
> +		if (entry == DRM_DEP_JOB_FENCE_PREALLOC ||
> +		    entry->context != fence->context)
> +			continue;
> +
> +		if (dma_fence_is_later(fence, entry)) {
> +			dma_fence_put(entry);
> +			xa_store(&job->dependencies, index, fence, GFP_KERNEL);
> +		} else {
> +			dma_fence_put(fence);
> +		}
> +		return 0;
> +	}
> +
> +add_fence:
> +	ret = xa_alloc(&job->dependencies, &id, fence, xa_limit_32b,
> +		       GFP_KERNEL);
> +	if (ret != 0) {
> +		if (fence != DRM_DEP_JOB_FENCE_PREALLOC)
> +			dma_fence_put(fence);
> +		return ret;
> +	}
> +
> +	return (fence == DRM_DEP_JOB_FENCE_PREALLOC) ? id : 0;
> +}
> +EXPORT_SYMBOL(drm_dep_job_add_dependency);
> +
> +/**
> + * drm_dep_job_replace_dependency() - replace a pre-allocated dependency slot
> + * @job: dep job to update
> + * @index: xarray index of the slot to replace, as returned when the sentinel
> + *         was originally inserted via drm_dep_job_add_dependency()
> + * @fence: the real dma_fence to store; its reference is always consumed
> + *
> + * Replaces the %DRM_DEP_JOB_FENCE_PREALLOC sentinel at @index in
> + * @job->dependencies with @fence.  The slot must have been pre-allocated by
> + * passing %DRM_DEP_JOB_FENCE_PREALLOC to drm_dep_job_add_dependency(); the
> + * existing entry is asserted to be the sentinel.
> + *
> + * This is the second half of the pre-allocation pattern described in
> + * drm_dep_job_add_dependency().  It is intended to be called inside a
> + * dma_fence_begin_signalling() / dma_fence_end_signalling() region where
> + * memory allocation with GFP_KERNEL is forbidden.  It uses GFP_NOWAIT
> + * internally so it is safe to call from atomic or signalling context, but
> + * since the slot has been pre-allocated no actual memory allocation occurs.
> + *
> + * If @fence is already signalled the slot is erased rather than storing a
> + * redundant dependency.  The successful store is asserted — if the store
> + * fails it indicates a programming error (slot index out of range or
> + * concurrent modification).
> + *
> + * Must be called before drm_dep_job_arm(). @fence is consumed in all cases.
> + *
> + * Context: Any context. DMA fence signaling path.
> + */
> +void drm_dep_job_replace_dependency(struct drm_dep_job *job, u32 index,
> +				    struct dma_fence *fence)
> +{
> +	WARN_ON(xa_load(&job->dependencies, index) !=
> +		DRM_DEP_JOB_FENCE_PREALLOC);
> +
> +	if (dma_fence_test_signaled_flag(fence)) {
> +		xa_erase(&job->dependencies, index);
> +		dma_fence_put(fence);
> +		return;
> +	}
> +
> +	if (WARN_ON(xa_is_err(xa_store(&job->dependencies, index, fence,
> +				       GFP_NOWAIT)))) {
> +		dma_fence_put(fence);
> +		return;
> +	}
> +}
> +EXPORT_SYMBOL(drm_dep_job_replace_dependency);
> +
> +/**
> + * drm_dep_job_add_syncobj_dependency() - adds a syncobj's fence as a
> + *   job dependency
> + * @job: dep job to add the dependencies to
> + * @file: drm file private pointer
> + * @handle: syncobj handle to lookup
> + * @point: timeline point
> + *
> + * This adds the fence matching the given syncobj to @job.
> + *
> + * Context: Process context.
> + * Return: 0 on success, or a negative error code.
> + */
> +int drm_dep_job_add_syncobj_dependency(struct drm_dep_job *job,
> +				       struct drm_file *file, u32 handle,
> +				       u32 point)
> +{
> +	struct dma_fence *fence;
> +	int ret;
> +
> +	ret = drm_syncobj_find_fence(file, handle, point, 0, &fence);
> +	if (ret)
> +		return ret;
> +
> +	return drm_dep_job_add_dependency(job, fence);
> +}
> +EXPORT_SYMBOL(drm_dep_job_add_syncobj_dependency);
> +
> +/**
> + * drm_dep_job_add_resv_dependencies() - add all fences from the resv to the job
> + * @job: dep job to add the dependencies to
> + * @resv: the dma_resv object to get the fences from
> + * @usage: the dma_resv_usage to use to filter the fences
> + *
> + * This adds all fences matching the given usage from @resv to @job.
> + * Must be called with the @resv lock held.
> + *
> + * Context: Process context.
> + * Return: 0 on success, or a negative error code.
> + */
> +int drm_dep_job_add_resv_dependencies(struct drm_dep_job *job,
> +				      struct dma_resv *resv,
> +				      enum dma_resv_usage usage)
> +{
> +	struct dma_resv_iter cursor;
> +	struct dma_fence *fence;
> +	int ret;
> +
> +	dma_resv_assert_held(resv);
> +
> +	dma_resv_for_each_fence(&cursor, resv, usage, fence) {
> +		/*
> +		 * As drm_dep_job_add_dependency always consumes the fence
> +		 * reference (even when it fails), and dma_resv_for_each_fence
> +		 * is not obtaining one, we need to grab one before calling.
> +		 */
> +		ret = drm_dep_job_add_dependency(job, dma_fence_get(fence));
> +		if (ret)
> +			return ret;
> +	}
> +	return 0;
> +}
> +EXPORT_SYMBOL(drm_dep_job_add_resv_dependencies);
> +
> +/**
> + * drm_dep_job_add_implicit_dependencies() - adds implicit dependencies
> + *   as job dependencies
> + * @job: dep job to add the dependencies to
> + * @obj: the gem object to add new dependencies from.
> + * @write: whether the job might write the object (so we need to depend on
> + * shared fences in the reservation object).
> + *
> + * This should be called after drm_gem_lock_reservations() on your array of
> + * GEM objects used in the job but before updating the reservations with your
> + * own fences.
> + *
> + * Context: Process context.
> + * Return: 0 on success, or a negative error code.
> + */
> +int drm_dep_job_add_implicit_dependencies(struct drm_dep_job *job,
> +					  struct drm_gem_object *obj,
> +					  bool write)
> +{
> +	return drm_dep_job_add_resv_dependencies(job, obj->resv,
> +						 dma_resv_usage_rw(write));
> +}
> +EXPORT_SYMBOL(drm_dep_job_add_implicit_dependencies);
> +
> +/**
> + * drm_dep_job_is_signaled() - check whether a dep job has completed
> + * @job: dep job to check
> + *
> + * Determines whether @job has signalled. The queue should be stopped before
> + * calling this to obtain a stable snapshot of state. Both the parent hardware
> + * fence and the finished software fence are checked.
> + *
> + * Context: Process context. The queue must be stopped before calling this.
> + * Return: true if the job is signalled, false otherwise.
> + */
> +bool drm_dep_job_is_signaled(struct drm_dep_job *job)
> +{
> +	WARN_ON(!drm_dep_queue_is_stopped(job->q));
> +	return drm_dep_fence_is_complete(job->dfence);
> +}
> +EXPORT_SYMBOL(drm_dep_job_is_signaled);
> +
> +/**
> + * drm_dep_job_is_finished() - test whether a dep job's finished fence has signalled
> + * @job: dep job to check
> + *
> + * Tests whether the job's software finished fence has been signalled, using
> + * dma_fence_test_signaled_flag() to avoid any signalling side-effects. Unlike
> + * drm_dep_job_is_signaled(), this does not require the queue to be stopped and
> + * does not check the parent hardware fence — it is a lightweight test of the
> + * finished fence only.
> + *
> + * Context: Any context.
> + * Return: true if the job's finished fence has been signalled, false otherwise.
> + */
> +bool drm_dep_job_is_finished(struct drm_dep_job *job)
> +{
> +	return drm_dep_fence_is_finished(job->dfence);
> +}
> +EXPORT_SYMBOL(drm_dep_job_is_finished);
> +
> +/**
> + * drm_dep_job_invalidate_job() - increment the invalidation count for a job
> + * @job: dep job to invalidate
> + * @threshold: threshold above which the job is considered invalidated
> + *
> + * Increments @job->invalidate_count and returns true if it exceeds @threshold,
> + * indicating the job should be considered hung and discarded. The queue must
> + * be stopped before calling this function.
> + *
> + * Context: Process context. The queue must be stopped before calling this.
> + * Return: true if @job->invalidate_count exceeds @threshold, false otherwise.
> + */
> +bool drm_dep_job_invalidate_job(struct drm_dep_job *job, int threshold)
> +{
> +	WARN_ON(!drm_dep_queue_is_stopped(job->q));
> +	return ++job->invalidate_count > threshold;
> +}
> +EXPORT_SYMBOL(drm_dep_job_invalidate_job);
> +
> +/**
> + * drm_dep_job_finished_fence() - return the finished fence for a job
> + * @job: dep job to query
> + *
> + * No reference is taken on the returned fence; the caller must hold its own
> + * reference to @job for the duration of any access.
> + *
> + * Context: Any context.
> + * Return: the finished &dma_fence for @job.
> + */
> +struct dma_fence *drm_dep_job_finished_fence(struct drm_dep_job *job)
> +{
> +	return drm_dep_fence_to_dma(job->dfence);
> +}
> +EXPORT_SYMBOL(drm_dep_job_finished_fence);
> diff --git a/drivers/gpu/drm/dep/drm_dep_job.h b/drivers/gpu/drm/dep/drm_dep_job.h
> new file mode 100644
> index 000000000000..35c61d258fa1
> --- /dev/null
> +++ b/drivers/gpu/drm/dep/drm_dep_job.h
> @@ -0,0 +1,13 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2026 Intel Corporation
> + */
> +
> +#ifndef _DRM_DEP_JOB_H_
> +#define _DRM_DEP_JOB_H_
> +
> +struct drm_dep_queue;
> +
> +void drm_dep_job_drop_dependencies(struct drm_dep_job *job);
> +
> +#endif /* _DRM_DEP_JOB_H_ */
> diff --git a/drivers/gpu/drm/dep/drm_dep_queue.c b/drivers/gpu/drm/dep/drm_dep_queue.c
> new file mode 100644
> index 000000000000..dac02d0d22c4
> --- /dev/null
> +++ b/drivers/gpu/drm/dep/drm_dep_queue.c
> @@ -0,0 +1,1647 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright 2015 Advanced Micro Devices, Inc.
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
> + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
> + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
> + * OTHER DEALINGS IN THE SOFTWARE.
> + *
> + * Copyright © 2026 Intel Corporation
> + */
> +
> +/**
> + * DOC: DRM dependency queue
> + *
> + * The drm_dep subsystem provides a lightweight GPU submission queue that
> + * combines the roles of drm_gpu_scheduler and drm_sched_entity into a
> + * single object (struct drm_dep_queue). Each queue owns its own ordered
> + * submit workqueue, timeout workqueue, and TDR delayed-work.
> + *
> + * **Job lifecycle**
> + *
> + * 1. Allocate and initialise a job with drm_dep_job_init().
> + * 2. Add dependency fences with drm_dep_job_add_dependency() and friends.
> + * 3. Arm the job with drm_dep_job_arm() to obtain its out-fences.
> + * 4. Submit with drm_dep_job_push().
> + *
> + * **Submission paths**
> + *
> + * drm_dep_job_push() decides between two paths under @q->sched.lock:
> + *
> + * - **Bypass path** (drm_dep_queue_can_job_bypass()): if
> + *   %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set, the queue is not stopped,
> + *   the SPSC queue is empty, the job has no dependency fences, and credits
> + *   are available, the job is submitted inline on the calling thread without
> + *   touching the submit workqueue.
> + *
> + * - **Queued path** (drm_dep_queue_push_job()): the job is pushed onto an
> + *   SPSC queue and the run_job worker is kicked. The run_job worker pops the
> + *   job, resolves any remaining dependency fences (installing wakeup
> + *   callbacks for unresolved ones), and calls drm_dep_queue_run_job().
> + *
> + * **Running a job**
> + *
> + * drm_dep_queue_run_job() accounts credits, appends the job to the pending
> + * list (starting the TDR timer only when the list was previously empty),
> + * calls @ops->run_job(), stores the returned hardware fence as the parent
> + * of the job's dep fence, then installs a callback on it. When the hardware
> + * fence fires (or the job completes synchronously), drm_dep_job_done()
> + * signals the finished fence, returns credits, and kicks the put_job worker
> + * to free the job.
> + *
> + * **Timeout detection and recovery (TDR)**
> + *
> + * A delayed work item fires when a job on the pending list takes longer than
> + * @q->job.timeout jiffies. It calls @ops->timedout_job() and acts on the
> + * returned status (%DRM_DEP_TIMEDOUT_STAT_JOB_SIGNALED or
> + * %DRM_DEP_TIMEDOUT_STAT_REQUEUE_JOB).
> + * drm_dep_queue_trigger_timeout() forces the timer to fire immediately (without
> + * changing the stored timeout), for example during device teardown.
> + *
> + * **Reference counting**
> + *
> + * Jobs and queues are both reference counted.
> + *
> + * A job holds a reference to its queue from drm_dep_job_init() until
> + * drm_dep_job_put() drops the job's last reference and its release callback
> + * runs. This ensures the queue remains valid for the entire lifetime of any
> + * job that was submitted to it.
> + *
> + * The queue holds its own reference to a job for as long as the job is
> + * internally tracked: from the moment the job is added to the pending list
> + * in drm_dep_queue_run_job() until drm_dep_job_done() kicks the put_job
> + * worker, which calls drm_dep_job_put() to release that reference.
> + *
> + * **Hazard: use-after-free from within a worker**
> + *
> + * Because a job holds a queue reference, drm_dep_job_put() dropping the last
> + * job reference will also drop a queue reference via the job's release path.
> + * If that happens to be the last queue reference, drm_dep_queue_fini() can be
> + * called, which queues @q->free_work on dep_free_wq and returns immediately.
> + * free_work calls disable_work_sync() / disable_delayed_work_sync() on the
> + * queue's own workers before destroying its workqueues, so in practice a
> + * running worker always completes before the queue memory is freed.
> + *
> + * However, there is a secondary hazard: a worker can be queued while the
> + * queue is in a "zombie" state — refcount has already reached zero and async
> + * teardown is in flight, but the work item has not yet been disabled by
> + * free_work.  To guard against this every worker uses
> + * drm_dep_queue_get_unless_zero() at entry; if the refcount is already zero
> + * the worker bails immediately without touching the queue state.
> + *
> + * Because all actual teardown (disable_*_sync, destroy_workqueue) runs on
> + * dep_free_wq — which is independent of the queue's own submit/timeout
> + * workqueues — there is no deadlock risk.  Each queue holds a drm_dev_get()
> + * reference on its owning &drm_device, which is released as the last step of
> + * teardown.  This ensures the driver module cannot be unloaded while any queue
> + * is still alive.
> + */
> +
> +#include <linux/dma-resv.h>
> +#include <linux/kref.h>
> +#include <linux/module.h>
> +#include <linux/overflow.h>
> +#include <linux/slab.h>
> +#include <linux/wait.h>
> +#include <linux/workqueue.h>
> +#include <drm/drm_dep.h>
> +#include <drm/drm_drv.h>
> +#include <drm/drm_print.h>
> +#include "drm_dep_fence.h"
> +#include "drm_dep_job.h"
> +#include "drm_dep_queue.h"
> +
> +/*
> + * Dedicated workqueue for deferred drm_dep_queue teardown.  Using a
> + * module-private WQ instead of system_percpu_wq keeps teardown isolated
> + * from unrelated kernel subsystems.
> + */
> +static struct workqueue_struct *dep_free_wq;
> +
> +/**
> + * drm_dep_queue_flags_set() - set a flag on the queue under sched.lock
> + * @q: dep queue
> + * @flag: flag to set (one of &enum drm_dep_queue_flags)
> + *
> + * Sets @flag in @q->sched.flags. Must be called with @q->sched.lock
> + * held; the lockdep assertion enforces this.
> + *
> + * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
> + */
> +static void drm_dep_queue_flags_set(struct drm_dep_queue *q,
> +				    enum drm_dep_queue_flags flag)
> +{
> +	lockdep_assert_held(&q->sched.lock);
> +	q->sched.flags |= flag;
> +}
> +
> +/**
> + * drm_dep_queue_flags_clear() - clear a flag on the queue under sched.lock
> + * @q: dep queue
> + * @flag: flag to clear (one of &enum drm_dep_queue_flags)
> + *
> + * Clears @flag in @q->sched.flags. Must be called with @q->sched.lock
> + * held; the lockdep assertion enforces this.
> + *
> + * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
> + */
> +static void drm_dep_queue_flags_clear(struct drm_dep_queue *q,
> +				      enum drm_dep_queue_flags flag)
> +{
> +	lockdep_assert_held(&q->sched.lock);
> +	q->sched.flags &= ~flag;
> +}
> +
> +/**
> + * drm_dep_queue_has_credits() - check whether the queue has enough credits
> + * @q: dep queue
> + * @job: job requesting credits
> + *
> + * Checks whether the queue has enough available credits to dispatch
> + * @job. If @job->credits exceeds the queue's credit limit, it is
> + * clamped with a WARN.
> + *
> + * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
> + * Return: true if available credits >= @job->credits, false otherwise.
> + */
> +static bool drm_dep_queue_has_credits(struct drm_dep_queue *q,
> +				      struct drm_dep_job *job)
> +{
> +	u32 available;
> +
> +	lockdep_assert_held(&q->sched.lock);
> +
> +	if (job->credits > q->credit.limit) {
> +		drm_warn(q->drm,
> +			 "Jobs may not exceed the credit limit, truncate.\n");
> +		job->credits = q->credit.limit;
> +	}
> +
> +	WARN_ON(check_sub_overflow(q->credit.limit,
> +				   atomic_read(&q->credit.count),
> +				   &available));
> +
> +	return available >= job->credits;
> +}
> +
> +/**
> + * drm_dep_queue_run_job_queue() - kick the run-job worker
> + * @q: dep queue
> + *
> + * Queues @q->sched.run_job on @q->sched.submit_wq unless the queue is stopped
> + * or the job queue is empty.  The empty-queue check avoids queueing a work item
> + * that would immediately return with nothing to do.
> + *
> + * Context: Any context.
> + */
> +static void drm_dep_queue_run_job_queue(struct drm_dep_queue *q)
> +{
> +	if (!drm_dep_queue_is_stopped(q) && spsc_queue_count(&q->job.queue))
> +		queue_work(q->sched.submit_wq, &q->sched.run_job);
> +}
> +
> +/**
> + * drm_dep_queue_put_job_queue() - kick the put-job worker
> + * @q: dep queue
> + *
> + * Queues @q->sched.put_job on @q->sched.submit_wq unless the queue
> + * is stopped.
> + *
> + * Context: Any context.
> + */
> +static void drm_dep_queue_put_job_queue(struct drm_dep_queue *q)
> +{
> +	if (!drm_dep_queue_is_stopped(q))
> +		queue_work(q->sched.submit_wq, &q->sched.put_job);
> +}
> +
> +/**
> + * drm_queue_start_timeout() - arm or re-arm the TDR delayed work
> + * @q: dep queue
> + *
> + * Arms the TDR delayed work with @q->job.timeout. No-op if
> + * @q->ops->timedout_job is NULL, the timeout is MAX_SCHEDULE_TIMEOUT,
> + * or the pending list is empty.
> + *
> + * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
> + */
> +static void drm_queue_start_timeout(struct drm_dep_queue *q)
> +{
> +	lockdep_assert_held(&q->job.lock);
> +
> +	if (!q->ops->timedout_job ||
> +	    q->job.timeout == MAX_SCHEDULE_TIMEOUT ||
> +	    list_empty(&q->job.pending))
> +		return;
> +
> +	mod_delayed_work(q->sched.timeout_wq, &q->sched.tdr, q->job.timeout);
> +}
> +
> +/**
> + * drm_queue_start_timeout_unlocked() - arm TDR, acquiring job.lock
> + * @q: dep queue
> + *
> + * Acquires @q->job.lock with irq and calls
> + * drm_queue_start_timeout().
> + *
> + * Context: Process context (workqueue).
> + */
> +static void drm_queue_start_timeout_unlocked(struct drm_dep_queue *q)
> +{
> +	guard(spinlock_irq)(&q->job.lock);
> +	drm_queue_start_timeout(q);
> +}
> +
> +/**
> + * drm_dep_queue_remove_dependency() - clear the active dependency and wake
> + *   the run-job worker
> + * @q: dep queue
> + * @f: the dependency fence being removed
> + *
> + * Stores @f into @q->dep.removed_fence via smp_store_release() so that the
> + * run-job worker can drop the reference to it in drm_dep_queue_is_ready(),
> + * paired with smp_load_acquire().  Clears @q->dep.fence and kicks the
> + * run-job worker.
> + *
> + * The fence reference is not dropped here; it is deferred to the run-job
> + * worker via @q->dep.removed_fence to keep this path suitable dma_fence
> + * callback removal in drm_dep_queue_kill().
> + *
> + * Context: Any context.
> + */
> +static void drm_dep_queue_remove_dependency(struct drm_dep_queue *q,
> +					    struct dma_fence *f)
> +{
> +	/* removed_fence must be visible to the reader before &q->dep.fence */
> +	smp_store_release(&q->dep.removed_fence, f);
> +
> +	WRITE_ONCE(q->dep.fence, NULL);
> +	drm_dep_queue_run_job_queue(q);
> +}
> +
> +/**
> + * drm_dep_queue_wakeup() - dma_fence callback to wake the run-job worker
> + * @f: the signalled dependency fence
> + * @cb: callback embedded in the dep queue
> + *
> + * Called from dma_fence_signal() when the active dependency fence signals.
> + * Delegates to drm_dep_queue_remove_dependency() to clear @q->dep.fence and
> + * kick the run-job worker.  The fence reference is not dropped here; it is
> + * deferred to the run-job worker via @q->dep.removed_fence.
> + *
> + * Context: Any context.
> + */
> +static void drm_dep_queue_wakeup(struct dma_fence *f, struct dma_fence_cb *cb)
> +{
> +	struct drm_dep_queue *q =
> +		container_of(cb, struct drm_dep_queue, dep.cb);
> +
> +	drm_dep_queue_remove_dependency(q, f);
> +}
> +
> +/**
> + * drm_dep_queue_is_ready() - check whether the queue has a dispatchable job
> + * @q: dep queue
> + *
> + * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
> + * Return: true if SPSC queue non-empty and no dep fence pending,
> + *   false otherwise.
> + */
> +static bool drm_dep_queue_is_ready(struct drm_dep_queue *q)
> +{
> +	lockdep_assert_held(&q->sched.lock);
> +
> +	if (!spsc_queue_count(&q->job.queue))
> +		return false;
> +
> +	if (READ_ONCE(q->dep.fence))
> +		return false;
> +
> +	/* Paired with smp_store_release in drm_dep_queue_remove_dependency() */
> +	dma_fence_put(smp_load_acquire(&q->dep.removed_fence));
> +
> +	q->dep.removed_fence = NULL;
> +
> +	return true;
> +}
> +
> +/**
> + * drm_dep_queue_is_killed() - check whether a dep queue has been killed
> + * @q: dep queue to check
> + *
> + * Return: true if %DRM_DEP_QUEUE_FLAGS_KILLED is set on @q, false otherwise.
> + *
> + * Context: Any context.
> + */
> +bool drm_dep_queue_is_killed(struct drm_dep_queue *q)
> +{
> +	return !!(q->sched.flags & DRM_DEP_QUEUE_FLAGS_KILLED);
> +}
> +EXPORT_SYMBOL(drm_dep_queue_is_killed);
> +
> +/**
> + * drm_dep_queue_is_initialized() - check whether a dep queue has been initialized
> + * @q: dep queue to check
> + *
> + * A queue is considered initialized once its ops pointer has been set by a
> + * successful call to drm_dep_queue_init().  Drivers that embed a
> + * &drm_dep_queue inside a larger structure may call this before attempting any
> + * other queue operation to confirm that initialization has taken place.
> + * drm_dep_queue_put() must be called if this function returns true to drop the
> + * initialization reference from drm_dep_queue_init().
> + *
> + * Return: true if @q has been initialized, false otherwise.
> + *
> + * Context: Any context.
> + */
> +bool drm_dep_queue_is_initialized(struct drm_dep_queue *q)
> +{
> +	return !!q->ops;
> +}
> +EXPORT_SYMBOL(drm_dep_queue_is_initialized);
> +
> +/**
> + * drm_dep_queue_set_stopped() - pre-mark a queue as stopped before first use
> + * @q: dep queue to mark
> + *
> + * Sets %DRM_DEP_QUEUE_FLAGS_STOPPED directly on @q without going through the
> + * normal drm_dep_queue_stop() path.  This is only valid during the driver-side
> + * queue initialisation sequence — i.e. after drm_dep_queue_init() returns but
> + * before the queue is made visible to other threads (e.g. before it is added
> + * to any lookup structures).  Using this after the queue is live is a driver
> + * bug; use drm_dep_queue_stop() instead.
> + *
> + * Context: Process context, queue not yet visible to other threads.
> + */
> +void drm_dep_queue_set_stopped(struct drm_dep_queue *q)
> +{
> +	q->sched.flags |= DRM_DEP_QUEUE_FLAGS_STOPPED;
> +}
> +EXPORT_SYMBOL(drm_dep_queue_set_stopped);
> +
> +/**
> + * drm_dep_queue_refcount() - read the current reference count of a queue
> + * @q: dep queue to query
> + *
> + * Returns the instantaneous kref value.  The count may change immediately
> + * after this call; callers must not make safety decisions based solely on
> + * the returned value.  Intended for diagnostic snapshots and debugfs output.
> + *
> + * Context: Any context.
> + * Return: current reference count.
> + */
> +unsigned int drm_dep_queue_refcount(const struct drm_dep_queue *q)
> +{
> +	return kref_read(&q->refcount);
> +}
> +EXPORT_SYMBOL(drm_dep_queue_refcount);
> +
> +/**
> + * drm_dep_queue_timeout() - read the per-job TDR timeout for a queue
> + * @q: dep queue to query
> + *
> + * Returns the per-job timeout in jiffies as set at init time.
> + * %MAX_SCHEDULE_TIMEOUT means no timeout is configured.
> + *
> + * Context: Any context.
> + * Return: timeout in jiffies.
> + */
> +long drm_dep_queue_timeout(const struct drm_dep_queue *q)
> +{
> +	return q->job.timeout;
> +}
> +EXPORT_SYMBOL(drm_dep_queue_timeout);
> +
> +/**
> + * drm_dep_queue_is_job_put_irq_safe() - test whether job-put from IRQ is allowed
> + * @q: dep queue
> + *
> + * Context: Any context.
> + * Return: true if %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE is set,
> + *   false otherwise.
> + */
> +static bool drm_dep_queue_is_job_put_irq_safe(const struct drm_dep_queue *q)
> +{
> +	return !!(q->sched.flags & DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE);
> +}
> +
> +/**
> + * drm_dep_queue_job_dependency() - get next unresolved dep fence
> + * @q: dep queue
> + * @job: job whose dependencies to advance
> + *
> + * Returns NULL immediately if the queue has been killed via
> + * drm_dep_queue_kill(), bypassing all dependency waits so that jobs
> + * drain through run_job as quickly as possible.
> + *
> + * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
> + * Return: next unresolved &dma_fence with a new reference, or NULL
> + *   when all dependencies have been consumed (or the queue is killed).
> + */
> +static struct dma_fence *
> +drm_dep_queue_job_dependency(struct drm_dep_queue *q,
> +			     struct drm_dep_job *job)
> +{
> +	struct dma_fence *f;
> +
> +	lockdep_assert_held(&q->sched.lock);
> +
> +	if (drm_dep_queue_is_killed(q))
> +		return NULL;
> +
> +	f = xa_load(&job->dependencies, job->last_dependency);
> +	if (f) {
> +		job->last_dependency++;
> +		if (WARN_ON(DRM_DEP_JOB_FENCE_PREALLOC == f))
> +			return dma_fence_get_stub();
> +		return dma_fence_get(f);
> +	}
> +
> +	return NULL;
> +}
> +
> +/**
> + * drm_dep_queue_add_dep_cb() - install wakeup callback on dep fence
> + * @q: dep queue
> + * @job: job whose dependency fence is stored in @q->dep.fence
> + *
> + * Installs a wakeup callback on @q->dep.fence. Returns true if the
> + * callback was installed (the queue must wait), false if the fence is
> + * already signalled or is a self-fence from the same queue context.
> + *
> + * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
> + * Return: true if callback installed, false if fence already done.
> + */
> +static bool drm_dep_queue_add_dep_cb(struct drm_dep_queue *q,
> +				     struct drm_dep_job *job)
> +{
> +	struct dma_fence *fence = q->dep.fence;
> +
> +	lockdep_assert_held(&q->sched.lock);
> +
> +	if (WARN_ON(fence->context == q->fence.context)) {
> +		dma_fence_put(q->dep.fence);
> +		q->dep.fence = NULL;
> +		return false;
> +	}
> +
> +	if (!dma_fence_add_callback(q->dep.fence, &q->dep.cb,
> +				    drm_dep_queue_wakeup))
> +		return true;
> +
> +	dma_fence_put(q->dep.fence);
> +	q->dep.fence = NULL;
> +
> +	return false;
> +}
> +
> +/**
> + * drm_dep_queue_pop_job() - pop a dispatchable job from the SPSC queue
> + * @q: dep queue
> + *
> + * Peeks at the head of the SPSC queue and drains all resolved
> + * dependencies. If a dependency is still pending, installs a wakeup
> + * callback and returns NULL. On success pops the job and returns it.
> + *
> + * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
> + * Return: next dispatchable job, or NULL if a dep is still pending.
> + */
> +static struct drm_dep_job *drm_dep_queue_pop_job(struct drm_dep_queue *q)
> +{
> +	struct spsc_node *node;
> +	struct drm_dep_job *job;
> +
> +	lockdep_assert_held(&q->sched.lock);
> +
> +	node = spsc_queue_peek(&q->job.queue);
> +	if (!node)
> +		return NULL;
> +
> +	job = container_of(node, struct drm_dep_job, queue_node);
> +
> +	while ((q->dep.fence = drm_dep_queue_job_dependency(q, job))) {
> +		if (drm_dep_queue_add_dep_cb(q, job))
> +			return NULL;
> +	}
> +
> +	spsc_queue_pop(&q->job.queue);
> +
> +	return job;
> +}
> +
> +/*
> + * drm_dep_queue_get_unless_zero() - try to acquire a queue reference
> + *
> + * Workers use this instead of drm_dep_queue_get() to guard against the zombie
> + * state: the queue's refcount has already reached zero (async teardown is in
> + * flight) but a work item was queued before free_work had a chance to cancel
> + * it.  If kref_get_unless_zero() fails the caller must bail immediately.
> + *
> + * Context: Any context.
> + * Returns true if the reference was acquired, false if the queue is zombie.
> + */
> +bool drm_dep_queue_get_unless_zero(struct drm_dep_queue *q)
> +{
> +	return kref_get_unless_zero(&q->refcount);
> +}
> +EXPORT_SYMBOL(drm_dep_queue_get_unless_zero);
> +
> +/**
> + * drm_dep_queue_run_job_work() - run-job worker
> + * @work: work item embedded in the dep queue
> + *
> + * Acquires @q->sched.lock, checks stopped state, queue readiness and
> + * available credits, pops the next job via drm_dep_queue_pop_job(),
> + * dispatches it via drm_dep_queue_run_job(), then re-kicks itself.
> + *
> + * Uses drm_dep_queue_get_unless_zero() at entry and bails immediately if the
> + * queue is in zombie state (refcount already zero, async teardown in flight).
> + *
> + * Context: Process context (workqueue). DMA fence signaling path.
> + */
> +static void drm_dep_queue_run_job_work(struct work_struct *work)
> +{
> +	struct drm_dep_queue *q =
> +		container_of(work, struct drm_dep_queue, sched.run_job);
> +	struct spsc_node *node;
> +	struct drm_dep_job *job;
> +	bool cookie = dma_fence_begin_signalling();
> +
> +	/* Bail if queue is zombie (refcount already zero, teardown in flight). */
> +	if (!drm_dep_queue_get_unless_zero(q)) {
> +		dma_fence_end_signalling(cookie);
> +		return;
> +	}
> +
> +	mutex_lock(&q->sched.lock);
> +
> +	if (drm_dep_queue_is_stopped(q))
> +		goto put_queue;
> +
> +	if (!drm_dep_queue_is_ready(q))
> +		goto put_queue;
> +
> +	/* Peek to check credits before committing to pop and dep resolution */
> +	node = spsc_queue_peek(&q->job.queue);
> +	if (!node)
> +		goto put_queue;
> +
> +	job = container_of(node, struct drm_dep_job, queue_node);
> +	if (!drm_dep_queue_has_credits(q, job))
> +		goto put_queue;
> +
> +	job = drm_dep_queue_pop_job(q);
> +	if (!job)
> +		goto put_queue;
> +
> +	drm_dep_queue_run_job(q, job);
> +	drm_dep_queue_run_job_queue(q);
> +
> +put_queue:
> +	mutex_unlock(&q->sched.lock);
> +	drm_dep_queue_put(q);
> +	dma_fence_end_signalling(cookie);
> +}
> +
> +/*
> + * drm_dep_queue_remove_job() - unlink a job from the pending list and reset TDR
> + * @q:   dep queue owning @job
> + * @job: job to remove
> + *
> + * Splices @job out of @q->job.pending, cancels any pending TDR delayed work,
> + * and arms the timeout for the new list head (if any).
> + *
> + * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
> + */
> +static void drm_dep_queue_remove_job(struct drm_dep_queue *q,
> +				     struct drm_dep_job *job)
> +{
> +	lockdep_assert_held(&q->job.lock);
> +
> +	list_del_init(&job->pending_link);
> +	cancel_delayed_work(&q->sched.tdr);
> +	drm_queue_start_timeout(q);
> +}
> +
> +/**
> + * drm_dep_queue_get_finished_job() - dequeue a finished job
> + * @q: dep queue
> + *
> + * Under @q->job.lock checks the head of the pending list for a
> + * finished dep fence. If found, removes the job from the list,
> + * cancels the TDR, and re-arms it for the new head.
> + *
> + * Context: Process context (workqueue). DMA fence signaling path.
> + * Return: the finished &drm_dep_job, or NULL if none is ready.
> + */
> +static struct drm_dep_job *
> +drm_dep_queue_get_finished_job(struct drm_dep_queue *q)
> +{
> +	struct drm_dep_job *job;
> +
> +	guard(spinlock_irq)(&q->job.lock);
> +
> +	job = list_first_entry_or_null(&q->job.pending, struct drm_dep_job,
> +				       pending_link);
> +	if (job && drm_dep_fence_is_finished(job->dfence))
> +		drm_dep_queue_remove_job(q, job);
> +	else
> +		job = NULL;
> +
> +	return job;
> +}
> +
> +/**
> + * drm_dep_queue_put_job_work() - put-job worker
> + * @work: work item embedded in the dep queue
> + *
> + * Drains all finished jobs by calling drm_dep_job_put() in a loop,
> + * then kicks the run-job worker.
> + *
> + * Uses drm_dep_queue_get_unless_zero() at entry and bails immediately if the
> + * queue is in zombie state (refcount already zero, async teardown in flight).
> + *
> + * Wraps execution in dma_fence_begin_signalling() / dma_fence_end_signalling()
> + * because workqueue is shared with other items in the fence signaling path.
> + *
> + * Context: Process context (workqueue). DMA fence signaling path.
> + */
> +static void drm_dep_queue_put_job_work(struct work_struct *work)
> +{
> +	struct drm_dep_queue *q =
> +		container_of(work, struct drm_dep_queue, sched.put_job);
> +	struct drm_dep_job *job;
> +	bool cookie = dma_fence_begin_signalling();
> +
> +	/* Bail if queue is zombie (refcount already zero, teardown in flight). */
> +	if (!drm_dep_queue_get_unless_zero(q)) {
> +		dma_fence_end_signalling(cookie);
> +		return;
> +	}
> +
> +	while ((job = drm_dep_queue_get_finished_job(q)))
> +		drm_dep_job_put(job);
> +
> +	drm_dep_queue_run_job_queue(q);
> +
> +	drm_dep_queue_put(q);
> +	dma_fence_end_signalling(cookie);
> +}
> +
> +/**
> + * drm_dep_queue_tdr_work() - TDR worker
> + * @work: work item embedded in the delayed TDR work
> + *
> + * Removes the head job from the pending list under @q->job.lock,
> + * asserts @q->ops->timedout_job is non-NULL, calls it outside the lock,
> + * requeues the job if %DRM_DEP_TIMEDOUT_STAT_REQUEUE_JOB, drops the
> + * queue's job reference on %DRM_DEP_TIMEDOUT_STAT_JOB_SIGNALED, and always
> + * restarts the TDR timer after handling the job (unless @q is stopping).
> + * Any other return value triggers a WARN.
> + *
> + * The TDR is never armed when @q->ops->timedout_job is NULL, so firing
> + * this worker without a timedout_job callback is a driver bug.
> + *
> + * Uses drm_dep_queue_get_unless_zero() at entry and bails immediately if the
> + * queue is in zombie state (refcount already zero, async teardown in flight).
> + *
> + * Wraps execution in dma_fence_begin_signalling() / dma_fence_end_signalling()
> + * because timedout_job() is expected to signal the guilty job's fence as part
> + * of reset.
> + *
> + * Context: Process context (workqueue). DMA fence signaling path.
> + */
> +static void drm_dep_queue_tdr_work(struct work_struct *work)
> +{
> +	struct drm_dep_queue *q =
> +		container_of(work, struct drm_dep_queue, sched.tdr.work);
> +	struct drm_dep_job *job;
> +	bool cookie = dma_fence_begin_signalling();
> +
> +	/* Bail if queue is zombie (refcount already zero, teardown in flight). */
> +	if (!drm_dep_queue_get_unless_zero(q)) {
> +		dma_fence_end_signalling(cookie);
> +		return;
> +	}
> +
> +	scoped_guard(spinlock_irq, &q->job.lock) {
> +		job = list_first_entry_or_null(&q->job.pending,
> +					       struct drm_dep_job,
> +					       pending_link);
> +		if (job)
> +			/*
> +			 * Remove from pending so it cannot be freed
> +			 * concurrently by drm_dep_queue_get_finished_job() or
> +			 * .drm_dep_job_done().
> +			 */
> +			list_del_init(&job->pending_link);
> +	}
> +
> +	if (job) {
> +		enum drm_dep_timedout_stat status;
> +
> +		if (WARN_ON(!q->ops->timedout_job)) {
> +			drm_dep_job_put(job);
> +			goto out;
> +		}
> +
> +		status = q->ops->timedout_job(job);
> +
> +		switch (status) {
> +		case DRM_DEP_TIMEDOUT_STAT_REQUEUE_JOB:
> +			scoped_guard(spinlock_irq, &q->job.lock)
> +				list_add(&job->pending_link, &q->job.pending);
> +			drm_dep_queue_put_job_queue(q);
> +			break;
> +		case DRM_DEP_TIMEDOUT_STAT_JOB_SIGNALED:
> +			drm_dep_job_put(job);
> +			break;
> +		default:
> +			WARN_ON("invalid drm_dep_timedout_stat");
> +			break;
> +		}
> +	}
> +
> +out:
> +	drm_queue_start_timeout_unlocked(q);
> +	drm_dep_queue_put(q);
> +	dma_fence_end_signalling(cookie);
> +}
> +
> +/**
> + * drm_dep_alloc_submit_wq() - allocate an ordered submit workqueue
> + * @name: name for the workqueue
> + * @flags: DRM_DEP_QUEUE_FLAGS_* flags
> + *
> + * Allocates an ordered workqueue for job submission with %WQ_MEM_RECLAIM and
> + * %WQ_MEM_WARN_ON_RECLAIM set, ensuring the workqueue is safe to use from
> + * memory reclaim context and properly annotated for lockdep taint tracking.
> + * Adds %WQ_HIGHPRI if %DRM_DEP_QUEUE_FLAGS_HIGHPRI is set. When
> + * CONFIG_LOCKDEP is enabled, uses a dedicated lockdep map for annotation.
> + *
> + * Context: Process context.
> + * Return: the new &workqueue_struct, or NULL on failure.
> + */
> +static struct workqueue_struct *
> +drm_dep_alloc_submit_wq(const char *name, enum drm_dep_queue_flags flags)
> +{
> +	unsigned int wq_flags = WQ_MEM_RECLAIM | WQ_MEM_WARN_ON_RECLAIM;
> +
> +	if (flags & DRM_DEP_QUEUE_FLAGS_HIGHPRI)
> +		wq_flags |= WQ_HIGHPRI;
> +
> +#if IS_ENABLED(CONFIG_LOCKDEP)
> +	static struct lockdep_map map = {
> +		.name = "drm_dep_submit_lockdep_map"
> +	};
> +	return alloc_ordered_workqueue_lockdep_map(name, wq_flags, &map);
> +#else
> +	return alloc_ordered_workqueue(name, wq_flags);
> +#endif
> +}
> +
> +/**
> + * drm_dep_alloc_timeout_wq() - allocate an ordered TDR workqueue
> + * @name: name for the workqueue
> + *
> + * Allocates an ordered workqueue for timeout detection and recovery with
> + * %WQ_MEM_RECLAIM and %WQ_MEM_WARN_ON_RECLAIM set, ensuring consistent taint
> + * annotation with the submit workqueue. When CONFIG_LOCKDEP is enabled, uses
> + * a dedicated lockdep map for annotation.
> + *
> + * Context: Process context.
> + * Return: the new &workqueue_struct, or NULL on failure.
> + */
> +static struct workqueue_struct *drm_dep_alloc_timeout_wq(const char *name)
> +{
> +	unsigned int wq_flags = WQ_MEM_RECLAIM | WQ_MEM_WARN_ON_RECLAIM;
> +
> +#if IS_ENABLED(CONFIG_LOCKDEP)
> +	static struct lockdep_map map = {
> +		.name = "drm_dep_timeout_lockdep_map"
> +	};
> +	return alloc_ordered_workqueue_lockdep_map(name, wq_flags, &map);
> +#else
> +	return alloc_ordered_workqueue(name, wq_flags);
> +#endif
> +}
> +
> +/**
> + * drm_dep_queue_init() - initialize a dep queue
> + * @q: dep queue to initialize
> + * @args: initialization arguments
> + *
> + * Initializes all fields of @q from @args. If @args->submit_wq is NULL an
> + * ordered workqueue is allocated and owned by the queue
> + * (%DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ). If @args->timeout_wq is NULL an
> + * ordered workqueue is allocated and owned by the queue
> + * (%DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ). On success the queue holds one kref
> + * reference and drm_dep_queue_put() must be called to drop this reference
> + * (i.e., drivers cannot directly free the queue).
> + *
> + * When CONFIG_LOCKDEP is enabled, @q->sched.lock is primed against the
> + * fs_reclaim pseudo-lock so that lockdep can detect any lock ordering
> + * inversion between @sched.lock and memory reclaim.
> + *
> + * Return: 0 on success, %-EINVAL when @args->credit_limit is zero, @args->ops
> + * is NULL, @args->drm is NULL, @args->ops->run_job is NULL, or when
> + * @args->submit_wq or @args->timeout_wq is non-NULL but was not allocated with
> + * %WQ_MEM_WARN_ON_RECLAIM; %-ENOMEM when workqueue allocation fails.
> + *
> + * Context: Process context. May allocate memory and create workqueues.
> + */
> +int drm_dep_queue_init(struct drm_dep_queue *q,
> +		       const struct drm_dep_queue_init_args *args)
> +{
> +	if (!args->credit_limit || !args->drm || !args->ops ||
> +	    !args->ops->run_job)
> +		return -EINVAL;
> +
> +	if (args->submit_wq && !workqueue_is_reclaim_annotated(args->submit_wq))
> +		return -EINVAL;
> +
> +	if (args->timeout_wq &&
> +	    !workqueue_is_reclaim_annotated(args->timeout_wq))
> +		return -EINVAL;
> +
> +	memset(q, 0, sizeof(*q));
> +
> +	q->name = args->name;
> +	q->drm = args->drm;
> +	q->credit.limit = args->credit_limit;
> +	q->job.timeout = args->timeout ? args->timeout : MAX_SCHEDULE_TIMEOUT;
> +
> +	init_rcu_head(&q->rcu);
> +	INIT_LIST_HEAD(&q->job.pending);
> +	spin_lock_init(&q->job.lock);
> +	spsc_queue_init(&q->job.queue);
> +
> +	mutex_init(&q->sched.lock);
> +	if (IS_ENABLED(CONFIG_LOCKDEP)) {
> +		fs_reclaim_acquire(GFP_KERNEL);
> +		might_lock(&q->sched.lock);
> +		fs_reclaim_release(GFP_KERNEL);
> +	}
> +
> +	if (args->submit_wq) {
> +		q->sched.submit_wq = args->submit_wq;
> +	} else {
> +		q->sched.submit_wq = drm_dep_alloc_submit_wq(args->name ?: "drm_dep",
> +							     args->flags);
> +		if (!q->sched.submit_wq)
> +			return -ENOMEM;
> +
> +		q->sched.flags |= DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ;
> +	}
> +
> +	if (args->timeout_wq) {
> +		q->sched.timeout_wq = args->timeout_wq;
> +	} else {
> +		q->sched.timeout_wq = drm_dep_alloc_timeout_wq(args->name ?: "drm_dep");
> +		if (!q->sched.timeout_wq)
> +			goto err_submit_wq;
> +
> +		q->sched.flags |= DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ;
> +	}
> +
> +	q->sched.flags |= args->flags &
> +		~(DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ |
> +		  DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ);
> +
> +	INIT_DELAYED_WORK(&q->sched.tdr, drm_dep_queue_tdr_work);
> +	INIT_WORK(&q->sched.run_job, drm_dep_queue_run_job_work);
> +	INIT_WORK(&q->sched.put_job, drm_dep_queue_put_job_work);
> +
> +	q->fence.context = dma_fence_context_alloc(1);
> +
> +	kref_init(&q->refcount);
> +	q->ops = args->ops;
> +	drm_dev_get(q->drm);
> +
> +	return 0;
> +
> +err_submit_wq:
> +	if (q->sched.flags & DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ)
> +		destroy_workqueue(q->sched.submit_wq);
> +	mutex_destroy(&q->sched.lock);
> +
> +	return -ENOMEM;
> +}
> +EXPORT_SYMBOL(drm_dep_queue_init);
> +
> +#if IS_ENABLED(CONFIG_PROVE_LOCKING)
> +/**
> + * drm_dep_queue_push_job_begin() - mark the start of an arm/push critical section
> + * @q: dep queue the job belongs to
> + *
> + * Called at the start of drm_dep_job_arm() and warns if the push context is
> + * already owned by another task, which would indicate concurrent arm/push on
> + * the same queue.
> + *
> + * No-op when CONFIG_PROVE_LOCKING is disabled.
> + *
> + * Context: Process context. DMA fence signaling path.
> + */
> +void drm_dep_queue_push_job_begin(struct drm_dep_queue *q)
> +{
> +	WARN_ON(q->job.push.owner);
> +	q->job.push.owner = current;
> +}
> +
> +/**
> + * drm_dep_queue_push_job_end() - mark the end of an arm/push critical section
> + * @q: dep queue the job belongs to
> + *
> + * Called at the end of drm_dep_job_push() and warns if the push context is not
> + * owned by the current task, which would indicate a mismatched begin/end pair
> + * or a push from the wrong thread.
> + *
> + * No-op when CONFIG_PROVE_LOCKING is disabled.
> + *
> + * Context: Process context. DMA fence signaling path.
> + */
> +void drm_dep_queue_push_job_end(struct drm_dep_queue *q)
> +{
> +	WARN_ON(q->job.push.owner != current);
> +	q->job.push.owner = NULL;
> +}
> +#endif
> +
> +/**
> + * drm_dep_queue_assert_teardown_invariants() - assert teardown invariants
> + * @q: dep queue being torn down
> + *
> + * Warns if the pending-job list, the SPSC submission queue, or the credit
> + * counter is non-zero when called, or if the queue still has a non-zero
> + * reference count.
> + *
> + * Context: Any context.
> + */
> +static void drm_dep_queue_assert_teardown_invariants(struct drm_dep_queue *q)
> +{
> +	WARN_ON(!list_empty(&q->job.pending));
> +	WARN_ON(spsc_queue_count(&q->job.queue));
> +	WARN_ON(atomic_read(&q->credit.count));
> +	WARN_ON(drm_dep_queue_refcount(q));
> +}
> +
> +/**
> + * drm_dep_queue_release() - final internal cleanup of a dep queue
> + * @q: dep queue to clean up
> + *
> + * Asserts teardown invariants and destroys internal resources allocated by
> + * drm_dep_queue_init() that cannot be torn down earlier in the teardown
> + * sequence.  Currently this destroys @q->sched.lock.
> + *
> + * Drivers that implement &drm_dep_queue_ops.release **must** call this
> + * function after removing @q from any internal bookkeeping (e.g. lookup
> + * tables or lists) but before freeing the memory that contains @q.  When
> + * &drm_dep_queue_ops.release is NULL, drm_dep follows the default teardown
> + * path and calls this function automatically.
> + *
> + * Context: Any context.
> + */
> +void drm_dep_queue_release(struct drm_dep_queue *q)
> +{
> +	drm_dep_queue_assert_teardown_invariants(q);
> +	mutex_destroy(&q->sched.lock);
> +}
> +EXPORT_SYMBOL(drm_dep_queue_release);
> +
> +/**
> + * drm_dep_queue_free() - final cleanup of a dep queue
> + * @q: dep queue to free
> + *
> + * Invokes &drm_dep_queue_ops.release if set, in which case the driver is
> + * responsible for calling drm_dep_queue_release() and freeing @q itself.
> + * If &drm_dep_queue_ops.release is NULL, calls drm_dep_queue_release()
> + * and then frees @q with kfree_rcu().
> + *
> + * In either case, releases the drm_dev_get() reference taken at init time
> + * via drm_dev_put(), allowing the owning &drm_device to be unloaded once
> + * all queues have been freed.
> + *
> + * Context: Process context (workqueue), reclaim safe.
> + */
> +static void drm_dep_queue_free(struct drm_dep_queue *q)
> +{
> +	struct drm_device *drm = q->drm;
> +
> +	if (q->ops->release) {
> +		q->ops->release(q);
> +	} else {
> +		drm_dep_queue_release(q);
> +		kfree_rcu(q, rcu);
> +	}
> +	drm_dev_put(drm);
> +}
> +
> +/**
> + * drm_dep_queue_free_work() - deferred queue teardown worker
> + * @work: free_work item embedded in the dep queue
> + *
> + * Runs on dep_free_wq. Disables all work items synchronously
> + * (preventing re-queue and waiting for in-flight instances),
> + * destroys any owned workqueues, then calls drm_dep_queue_free().
> + * Running on dep_free_wq ensures destroy_workqueue() is never
> + * called from within one of the queue's own workers (deadlock)
> + * and disable_*_sync() cannot deadlock either.
> + *
> + * Context: Process context (workqueue), reclaim safe.
> + */
> +static void drm_dep_queue_free_work(struct work_struct *work)
> +{
> +	struct drm_dep_queue *q =
> +		container_of(work, struct drm_dep_queue, free_work);
> +
> +	drm_dep_queue_assert_teardown_invariants(q);
> +
> +	disable_delayed_work_sync(&q->sched.tdr);
> +	disable_work_sync(&q->sched.run_job);
> +	disable_work_sync(&q->sched.put_job);
> +
> +	if (q->sched.flags & DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ)
> +		destroy_workqueue(q->sched.timeout_wq);
> +
> +	if (q->sched.flags & DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ)
> +		destroy_workqueue(q->sched.submit_wq);
> +
> +	drm_dep_queue_free(q);
> +}
> +
> +/**
> + * drm_dep_queue_fini() - tear down a dep queue
> + * @q: dep queue to tear down
> + *
> + * Asserts teardown invariants  and nitiates teardown of @q by queuing the
> + * deferred free work onto tht module-private dep_free_wq workqueue.  The work
> + * item disables any pending TDR and run/put-job work synchronously, destroys
> + * any workqueues that were allocated by drm_dep_queue_init(), and then releases
> + * the queue memory.
> + *
> + * Running teardown from dep_free_wq ensures that destroy_workqueue() is never
> + * called from within one of the queue's own workers (e.g. via
> + * drm_dep_queue_put()), which would deadlock.
> + *
> + * Drivers can wait for all outstanding deferred work to complete by waiting
> + * for the last drm_dev_put() reference on their &drm_device, which is
> + * released as the final step of each queue's teardown.
> + *
> + * Drivers that implement &drm_dep_queue_ops.fini **must** call this
> + * function after removing @q from any device bookkeeping but before freeing the
> + * memory that contains @q.  When &drm_dep_queue_ops.fini is NULL, drm_dep
> + * follows the default teardown path and calls this function automatically.
> + *
> + * Context: Any context.
> + */
> +void drm_dep_queue_fini(struct drm_dep_queue *q)
> +{
> +	drm_dep_queue_assert_teardown_invariants(q);
> +
> +	INIT_WORK(&q->free_work, drm_dep_queue_free_work);
> +	queue_work(dep_free_wq, &q->free_work);
> +}
> +EXPORT_SYMBOL(drm_dep_queue_fini);
> +
> +/**
> + * drm_dep_queue_get() - acquire a reference to a dep queue
> + * @q: dep queue to acquire a reference on, or NULL
> + *
> + * Return: @q with an additional reference held, or NULL if @q is NULL.
> + *
> + * Context: Any context.
> + */
> +struct drm_dep_queue *drm_dep_queue_get(struct drm_dep_queue *q)
> +{
> +	if (q)
> +		kref_get(&q->refcount);
> +	return q;
> +}
> +EXPORT_SYMBOL(drm_dep_queue_get);
> +
> +/**
> + * __drm_dep_queue_release() - kref release callback for a dep queue
> + * @kref: kref embedded in the dep queue
> + *
> + * Calls &drm_dep_queue_ops.fini if set, otherwise calls
> + * drm_dep_queue_fini() to initiate deferred teardown.
> + *
> + * Context: Any context.
> + */
> +static void __drm_dep_queue_release(struct kref *kref)
> +{
> +	struct drm_dep_queue *q =
> +		container_of(kref, struct drm_dep_queue, refcount);
> +
> +	if (q->ops->fini)
> +		q->ops->fini(q);
> +	else
> +		drm_dep_queue_fini(q);
> +}
> +
> +/**
> + * drm_dep_queue_put() - release a reference to a dep queue
> + * @q: dep queue to release a reference on, or NULL
> + *
> + * When the last reference is dropped, calls &drm_dep_queue_ops.fini if set,
> + * otherwise calls drm_dep_queue_fini(). Final memory release is handled by
> + * &drm_dep_queue_ops.release (which must call drm_dep_queue_release()) if set,
> + * or drm_dep_queue_release() followed by kfree_rcu() otherwise.
> + * Does nothing if @q is NULL.
> + *
> + * Context: Any context.
> + */
> +void drm_dep_queue_put(struct drm_dep_queue *q)
> +{
> +	if (q)
> +		kref_put(&q->refcount, __drm_dep_queue_release);
> +}
> +EXPORT_SYMBOL(drm_dep_queue_put);
> +
> +/**
> + * drm_dep_queue_stop() - stop a dep queue from processing new jobs
> + * @q: dep queue to stop
> + *
> + * Sets %DRM_DEP_QUEUE_FLAGS_STOPPED on @q under both @q->sched.lock (mutex)
> + * and @q->job.lock (spinlock_irq), making the flag safe to test from finished
> + * fenced signaling context. Then cancels any in-flight run_job and put_job work
> + * items. Once stopped, the bypass path and the submit workqueue will not
> + * dispatch further jobs nor will any jobs be removed from the pending list.
> + * Call drm_dep_queue_start() to resume processing.
> + *
> + * Context: Process context. Waits for in-flight workers to complete.
> + */
> +void drm_dep_queue_stop(struct drm_dep_queue *q)
> +{
> +	scoped_guard(mutex, &q->sched.lock) {
> +		scoped_guard(spinlock_irq, &q->job.lock)
> +			drm_dep_queue_flags_set(q, DRM_DEP_QUEUE_FLAGS_STOPPED);
> +	}
> +	cancel_work_sync(&q->sched.run_job);
> +	cancel_work_sync(&q->sched.put_job);
> +}
> +EXPORT_SYMBOL(drm_dep_queue_stop);
> +
> +/**
> + * drm_dep_queue_start() - resume a stopped dep queue
> + * @q: dep queue to start
> + *
> + * Clears %DRM_DEP_QUEUE_FLAGS_STOPPED on @q under both @q->sched.lock (mutex)
> + * and @q->job.lock (spinlock_irq), making the flag safe to test from IRQ
> + * context. Then re-queues the run_job and put_job work items so that any jobs
> + * pending since the queue was stopped are processed. Must only be called after
> + * drm_dep_queue_stop().
> + *
> + * Context: Process context.
> + */
> +void drm_dep_queue_start(struct drm_dep_queue *q)
> +{
> +	scoped_guard(mutex, &q->sched.lock) {
> +		scoped_guard(spinlock_irq, &q->job.lock)
> +			drm_dep_queue_flags_clear(q, DRM_DEP_QUEUE_FLAGS_STOPPED);
> +	}
> +	drm_dep_queue_run_job_queue(q);
> +	drm_dep_queue_put_job_queue(q);
> +}
> +EXPORT_SYMBOL(drm_dep_queue_start);
> +
> +/**
> + * drm_dep_queue_trigger_timeout() - trigger the TDR immediately for
> + *   all pending jobs
> + * @q: dep queue to trigger timeout on
> + *
> + * Sets @q->job.timeout to 1 and arms the TDR delayed work with a one-jiffy
> + * delay, causing it to fire almost immediately without hot-spinning at zero
> + * delay. This is used to force-expire any pendind jobs on the queue, for
> + * example when the device is being torn down or has encountered an
> + * unrecoverable error.
> + *
> + * It is suggested that when this function is used, the first timedout_job call
> + * causes the driver to kick the queue off the hardware and signal all pending
> + * job fences. Subsequent calls continue to signal all pending job fences.
> + *
> + * Has no effect if the pending list is empty.
> + *
> + * Context: Any context.
> + */
> +void drm_dep_queue_trigger_timeout(struct drm_dep_queue *q)
> +{
> +	guard(spinlock_irqsave)(&q->job.lock);
> +	q->job.timeout = 1;
> +	drm_queue_start_timeout(q);
> +}
> +EXPORT_SYMBOL(drm_dep_queue_trigger_timeout);
> +
> +/**
> + * drm_dep_queue_cancel_tdr_sync() - cancel any pending TDR and wait
> + *   for it to finish
> + * @q: dep queue whose TDR to cancel
> + *
> + * Cancels the TDR delayed work item if it has not yet started, and waits for
> + * it to complete if it is already running.  After this call returns, the TDR
> + * worker is guaranteed not to be executing and will not fire again until
> + * explicitly rearmed (e.g. via drm_dep_queue_resume_timeout() or by a new
> + * job being submitted).
> + *
> + * Useful during error recovery or queue teardown when the caller needs to
> + * know that no timeout handling races with its own reset logic.
> + *
> + * Context: Process context. May sleep waiting for the TDR worker to finish.
> + */
> +void drm_dep_queue_cancel_tdr_sync(struct drm_dep_queue *q)
> +{
> +	cancel_delayed_work_sync(&q->sched.tdr);
> +}
> +EXPORT_SYMBOL(drm_dep_queue_cancel_tdr_sync);
> +
> +/**
> + * drm_dep_queue_resume_timeout() - restart the TDR timer with the
> + *   configured timeout
> + * @q: dep queue to resume the timeout for
> + *
> + * Restarts the TDR delayed work using @q->job.timeout. Called after device
> + * recovery to give pending jobs a fresh full timeout window. Has no effect
> + * if the pending list is empty.
> + *
> + * Context: Any context.
> + */
> +void drm_dep_queue_resume_timeout(struct drm_dep_queue *q)
> +{
> +	drm_queue_start_timeout_unlocked(q);
> +}
> +EXPORT_SYMBOL(drm_dep_queue_resume_timeout);
> +
> +/**
> + * drm_dep_queue_is_stopped() - check whether a dep queue is stopped
> + * @q: dep queue to check
> + *
> + * Return: true if %DRM_DEP_QUEUE_FLAGS_STOPPED is set on @q, false otherwise.
> + *
> + * Context: Any context.
> + */
> +bool drm_dep_queue_is_stopped(struct drm_dep_queue *q)
> +{
> +	return !!(q->sched.flags & DRM_DEP_QUEUE_FLAGS_STOPPED);
> +}
> +EXPORT_SYMBOL(drm_dep_queue_is_stopped);
> +
> +/**
> + * drm_dep_queue_kill() - kill a dep queue and flush all pending jobs
> + * @q: dep queue to kill
> + *
> + * Sets %DRM_DEP_QUEUE_FLAGS_KILLED on @q under @q->sched.lock.  If a
> + * dependency fence is currently being waited on, its callback is removed and
> + * the run-job worker is kicked immediately so that the blocked job drains
> + * without waiting.
> + *
> + * Once killed, drm_dep_queue_job_dependency() returns NULL for all jobs,
> + * bypassing dependency waits so that every queued job drains through
> + * &drm_dep_queue_ops.run_job without blocking.
> + *
> + * The &drm_dep_queue_ops.run_job callback is guaranteed to be called for every
> + * job that was pushed before or after drm_dep_queue_kill(), even during queue
> + * teardown.  Drivers should use this guarantee to perform any necessary
> + * bookkeeping cleanup without executing the actual backend operation when the
> + * queue is killed.
> + *
> + * Unlike drm_dep_queue_stop(), killing is one-way: there is no corresponding
> + * start function.
> + *
> + * **Driver safety requirement**
> + *
> + * drm_dep_queue_kill() must only be called once the driver can guarantee that
> + * no job in the queue will touch memory associated with any of its fences
> + * (i.e., the queue has been removed from the device and will never be put back
> + * on).
> + *
> + * Context: Process context.
> + */
> +void drm_dep_queue_kill(struct drm_dep_queue *q)
> +{
> +	scoped_guard(mutex, &q->sched.lock) {
> +		struct dma_fence *fence;
> +
> +		drm_dep_queue_flags_set(q, DRM_DEP_QUEUE_FLAGS_KILLED);
> +
> +		/*
> +		 * Holding &q->sched.lock guarantees that the run-job work item
> +		 * cannot drop its reference to q->dep.fence concurrently, so
> +		 * reading q->dep.fence here is safe.
> +		 */
> +		fence = READ_ONCE(q->dep.fence);
> +		if (fence && dma_fence_remove_callback(fence, &q->dep.cb))
> +			drm_dep_queue_remove_dependency(q, fence);
> +	}
> +}
> +EXPORT_SYMBOL(drm_dep_queue_kill);
> +
> +/**
> + * drm_dep_queue_submit_wq() - retrieve the submit workqueue of a dep queue
> + * @q: dep queue whose workqueue to retrieve
> + *
> + * Drivers may use this to queue their own work items alongside the queue's
> + * internal run-job and put-job workers — for example to process incoming
> + * messages in the same serialisation domain.
> + *
> + * Prefer drm_dep_queue_work_enqueue() when the only need is to enqueue a
> + * work item, as it additionally checks the stopped state.  Use this accessor
> + * when the workqueue itself is required (e.g. for alloc_ordered_workqueue
> + * replacement or drain_workqueue calls).
> + *
> + * Context: Any context.
> + * Return: the &workqueue_struct used by @q for job submission.
> + */
> +struct workqueue_struct *drm_dep_queue_submit_wq(struct drm_dep_queue *q)
> +{
> +	return q->sched.submit_wq;
> +}
> +EXPORT_SYMBOL(drm_dep_queue_submit_wq);
> +
> +/**
> + * drm_dep_queue_timeout_wq() - retrieve the timeout workqueue of a dep queue
> + * @q: dep queue whose workqueue to retrieve
> + *
> + * Returns the workqueue used by @q to run TDR (timeout detection and recovery)
> + * work.  Drivers may use this to queue their own timeout-domain work items, or
> + * to call drain_workqueue() when tearing down and needing to ensure all pending
> + * timeout callbacks have completed before proceeding.
> + *
> + * Context: Any context.
> + * Return: the &workqueue_struct used by @q for TDR work.
> + */
> +struct workqueue_struct *drm_dep_queue_timeout_wq(struct drm_dep_queue *q)
> +{
> +	return q->sched.timeout_wq;
> +}
> +EXPORT_SYMBOL(drm_dep_queue_timeout_wq);
> +
> +/**
> + * drm_dep_queue_work_enqueue() - queue work on the dep queue's submit workqueue
> + * @q: dep queue to enqueue work on
> + * @work: work item to enqueue
> + *
> + * Queues @work on @q->sched.submit_wq if the queue is not stopped.  This
> + * allows drivers to schedule custom work items that run serialised with the
> + * queue's own run-job and put-job workers.
> + *
> + * Return: true if the work was queued, false if the queue is stopped or the
> + * work item was already pending.
> + *
> + * Context: Any context.
> + */
> +bool drm_dep_queue_work_enqueue(struct drm_dep_queue *q,
> +				struct work_struct *work)
> +{
> +	if (drm_dep_queue_is_stopped(q))
> +		return false;
> +
> +	return queue_work(q->sched.submit_wq, work);
> +}
> +EXPORT_SYMBOL(drm_dep_queue_work_enqueue);
> +
> +/**
> + * drm_dep_queue_can_job_bypass() - test whether a job can skip the SPSC queue
> + * @q: dep queue
> + * @job: job to test
> + *
> + * A job may bypass the submit workqueue and run inline on the calling thread
> + * if all of the following hold:
> + *
> + *  - %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set on the queue
> + *  - the queue is not stopped
> + *  - the SPSC submission queue is empty (no other jobs waiting)
> + *  - the queue has enough credits for @job
> + *  - @job has no unresolved dependency fences
> + *
> + * Must be called under @q->sched.lock.
> + *
> + * Context: Process context. Must hold @q->sched.lock (a mutex).
> + * Return: true if the job may be run inline, false otherwise.
> + */
> +bool drm_dep_queue_can_job_bypass(struct drm_dep_queue *q,
> +				  struct drm_dep_job *job)
> +{
> +	lockdep_assert_held(&q->sched.lock);
> +
> +	return q->sched.flags & DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED &&
> +		!drm_dep_queue_is_stopped(q) &&
> +		!spsc_queue_count(&q->job.queue) &&
> +		drm_dep_queue_has_credits(q, job) &&
> +		xa_empty(&job->dependencies);
> +}
> +
> +/**
> + * drm_dep_job_done() - mark a job as complete
> + * @job: the job that finished
> + * @result: error code to propagate, or 0 for success
> + *
> + * Subtracts @job->credits from the queue credit counter, then signals the
> + * job's dep fence with @result.
> + *
> + * When %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE is set (IRQ-safe path), a
> + * temporary extra reference is taken on @job before signalling the fence.
> + * This prevents a concurrent put-job worker — which may be woken by timeouts or
> + * queue starting — from freeing the job while this function still holds a
> + * pointer to it.  The extra reference is released at the end of the function.
> + *
> + * After signalling, the IRQ-safe path removes the job from the pending list
> + * under @q->job.lock, provided the queue is not stopped.  Removal is skipped
> + * when the queue is stopped so that drm_dep_queue_for_each_pending_job() can
> + * iterate the list without racing with the completion path.  On successful
> + * removal, kicks the run-job worker so the next queued job can be dispatched
> + * immediately, then drops the job reference.  If the job was already removed
> + * by TDR, or removal was skipped because the queue is stopped, kicks the
> + * put-job worker instead to allow the deferred put to complete.
> + *
> + * Context: Any context.
> + */
> +static void drm_dep_job_done(struct drm_dep_job *job, int result)
> +{
> +	struct drm_dep_queue *q = job->q;
> +	bool irq_safe = drm_dep_queue_is_job_put_irq_safe(q), removed = false;
> +
> +	/*
> +	 * Local ref to ensure the put worker—which may be woken by external
> +	 * forces (TDR, driver-side queue starting)—doesn't free the job behind
> +	 * this function's back after drm_dep_fence_done() while it is still on
> +	 * the pending list.
> +	 */
> +	if (irq_safe)
> +		drm_dep_job_get(job);
> +
> +	atomic_sub(job->credits, &q->credit.count);
> +	drm_dep_fence_done(job->dfence, result);
> +
> +	/* Only safe to touch job after fence signal if we have a local ref. */
> +
> +	if (irq_safe) {
> +		scoped_guard(spinlock_irqsave, &q->job.lock) {
> +			removed = !list_empty(&job->pending_link) &&
> +				!drm_dep_queue_is_stopped(q);
> +
> +			/* Guard against TDR operating on job */
> +			if (removed)
> +				drm_dep_queue_remove_job(q, job);
> +		}
> +	}
> +
> +	if (removed) {
> +		drm_dep_queue_run_job_queue(q);
> +		drm_dep_job_put(job);
> +	} else {
> +		drm_dep_queue_put_job_queue(q);
> +	}
> +
> +	if (irq_safe)
> +		drm_dep_job_put(job);
> +}
> +
> +/**
> + * drm_dep_job_done_cb() - dma_fence callback to complete a job
> + * @f: the hardware fence that signalled
> + * @cb: fence callback embedded in the dep job
> + *
> + * Extracts the job from @cb and calls drm_dep_job_done() with
> + * @f->error as the result.
> + *
> + * Context: Any context, but with IRQ disabled. May not sleep.
> + */
> +static void drm_dep_job_done_cb(struct dma_fence *f, struct dma_fence_cb *cb)
> +{
> +	struct drm_dep_job *job = container_of(cb, struct drm_dep_job, cb);
> +
> +	drm_dep_job_done(job, f->error);
> +}
> +
> +/**
> + * drm_dep_queue_run_job() - submit a job to hardware and set up
> + *   completion tracking
> + * @q: dep queue
> + * @job: job to run
> + *
> + * Accounts @job->credits against the queue, appends the job to the pending
> + * list, then calls @q->ops->run_job(). The TDR timer is started only when
> + * @job is the first entry on the pending list; subsequent jobs added while
> + * a TDR is already in flight do not reset the timer (which would otherwise
> + * extend the deadline for the already-running head job). Stores the returned
> + * hardware fence as the parent of the job's dep fence, then installs
> + * drm_dep_job_done_cb() on it. If the hardware fence is already signalled
> + * (%-ENOENT from dma_fence_add_callback()) or run_job() returns NULL/error,
> + * the job is completed immediately. Must be called under @q->sched.lock.
> + *
> + * Context: Process context. Must hold @q->sched.lock (a mutex). DMA fence
> + * signaling path.
> + */
> +void drm_dep_queue_run_job(struct drm_dep_queue *q, struct drm_dep_job *job)
> +{
> +	struct dma_fence *fence;
> +	int r;
> +
> +	lockdep_assert_held(&q->sched.lock);
> +
> +	drm_dep_job_get(job);
> +	atomic_add(job->credits, &q->credit.count);
> +
> +	scoped_guard(spinlock_irq, &q->job.lock) {
> +		bool first = list_empty(&q->job.pending);
> +
> +		list_add_tail(&job->pending_link, &q->job.pending);
> +		if (first)
> +			drm_queue_start_timeout(q);
> +	}
> +
> +	fence = q->ops->run_job(job);
> +	drm_dep_fence_set_parent(job->dfence, fence);
> +
> +	if (!IS_ERR_OR_NULL(fence)) {
> +		r = dma_fence_add_callback(fence, &job->cb,
> +					   drm_dep_job_done_cb);
> +		if (r == -ENOENT)
> +			drm_dep_job_done(job, fence->error);
> +		else if (r)
> +			drm_err(q->drm, "fence add callback failed (%d)\n", r);
> +		dma_fence_put(fence);
> +	} else {
> +		drm_dep_job_done(job, IS_ERR(fence) ? PTR_ERR(fence) : 0);
> +	}
> +
> +	/*
> +	 * Drop all input dependency fences now, in process context, before the
> +	 * final job put. Once the job is on the pending list its last reference
> +	 * may be dropped from a dma_fence callback (IRQ context), where calling
> +	 * xa_destroy() would be unsafe.
> +	 */
> +	drm_dep_job_drop_dependencies(job);
> +	drm_dep_job_put(job);
> +}
> +
> +/**
> + * drm_dep_queue_push_job() - enqueue a job on the SPSC submission queue
> + * @q: dep queue
> + * @job: job to push
> + *
> + * Pushes @job onto the SPSC queue. If the queue was previously empty
> + * (i.e. this is the first pending job), kicks the run_job worker so it
> + * processes the job promptly without waiting for the next wakeup.
> + * May be called with or without @q->sched.lock held.
> + *
> + * Context: Any context. DMA fence signaling path.
> + */
> +void drm_dep_queue_push_job(struct drm_dep_queue *q, struct drm_dep_job *job)
> +{
> +	/*
> +	 * spsc_queue_push() returns true if the queue was previously empty,
> +	 * i.e. this is the first pending job. Kick the run_job worker so it
> +	 * picks it up without waiting for the next wakeup.
> +	 */
> +	if (spsc_queue_push(&q->job.queue, &job->queue_node))
> +		drm_dep_queue_run_job_queue(q);
> +}
> +
> +/**
> + * drm_dep_init() - module initialiser
> + *
> + * Allocates the module-private dep_free_wq unbound workqueue used for
> + * deferred queue teardown.
> + *
> + * Return: 0 on success, %-ENOMEM if workqueue allocation fails.
> + */
> +static int __init drm_dep_init(void)
> +{
> +	dep_free_wq = alloc_workqueue("drm_dep_free", WQ_UNBOUND, 0);
> +	if (!dep_free_wq)
> +		return -ENOMEM;
> +
> +	return 0;
> +}
> +
> +/**
> + * drm_dep_exit() - module exit
> + *
> + * Destroys the module-private dep_free_wq workqueue.
> + */
> +static void __exit drm_dep_exit(void)
> +{
> +	destroy_workqueue(dep_free_wq);
> +	dep_free_wq = NULL;
> +}
> +
> +module_init(drm_dep_init);
> +module_exit(drm_dep_exit);
> +
> +MODULE_DESCRIPTION("DRM dependency queue");
> +MODULE_LICENSE("Dual MIT/GPL");
> diff --git a/drivers/gpu/drm/dep/drm_dep_queue.h b/drivers/gpu/drm/dep/drm_dep_queue.h
> new file mode 100644
> index 000000000000..e5c217a3fab5
> --- /dev/null
> +++ b/drivers/gpu/drm/dep/drm_dep_queue.h
> @@ -0,0 +1,31 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2026 Intel Corporation
> + */
> +
> +#ifndef _DRM_DEP_QUEUE_H_
> +#define _DRM_DEP_QUEUE_H_
> +
> +#include <linux/types.h>
> +
> +struct drm_dep_job;
> +struct drm_dep_queue;
> +
> +bool drm_dep_queue_can_job_bypass(struct drm_dep_queue *q,
> +				  struct drm_dep_job *job);
> +void drm_dep_queue_run_job(struct drm_dep_queue *q, struct drm_dep_job *job);
> +void drm_dep_queue_push_job(struct drm_dep_queue *q, struct drm_dep_job *job);
> +
> +#if IS_ENABLED(CONFIG_PROVE_LOCKING)
> +void drm_dep_queue_push_job_begin(struct drm_dep_queue *q);
> +void drm_dep_queue_push_job_end(struct drm_dep_queue *q);
> +#else
> +static inline void drm_dep_queue_push_job_begin(struct drm_dep_queue *q)
> +{
> +}
> +static inline void drm_dep_queue_push_job_end(struct drm_dep_queue *q)
> +{
> +}
> +#endif
> +
> +#endif /* _DRM_DEP_QUEUE_H_ */
> diff --git a/include/drm/drm_dep.h b/include/drm/drm_dep.h
> new file mode 100644
> index 000000000000..615926584506
> --- /dev/null
> +++ b/include/drm/drm_dep.h
> @@ -0,0 +1,597 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright 2015 Advanced Micro Devices, Inc.
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
> + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
> + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
> + * OTHER DEALINGS IN THE SOFTWARE.
> + *
> + * Copyright © 2026 Intel Corporation
> + */
> +
> +#ifndef _DRM_DEP_H_
> +#define _DRM_DEP_H_
> +
> +#include <drm/spsc_queue.h>
> +#include <linux/dma-fence.h>
> +#include <linux/xarray.h>
> +#include <linux/workqueue.h>
> +
> +enum dma_resv_usage;
> +struct dma_resv;
> +struct drm_dep_fence;
> +struct drm_dep_job;
> +struct drm_dep_queue;
> +struct drm_file;
> +struct drm_gem_object;
> +
> +/**
> + * enum drm_dep_timedout_stat - return value of &drm_dep_queue_ops.timedout_job
> + * @DRM_DEP_TIMEDOUT_STAT_JOB_SIGNALED: driver signaled the job's finished
> + *   fence during reset; drm_dep may safely drop its reference to the job.
> + * @DRM_DEP_TIMEDOUT_STAT_REQUEUE_JOB: timeout was a false alarm; reinsert the
> + *   job at the head of the pending list so it can complete normally.
> + */
> +enum drm_dep_timedout_stat {
> +	DRM_DEP_TIMEDOUT_STAT_JOB_SIGNALED,
> +	DRM_DEP_TIMEDOUT_STAT_REQUEUE_JOB,
> +};
> +
> +/**
> + * struct drm_dep_queue_ops - driver callbacks for a dep queue
> + */
> +struct drm_dep_queue_ops {
> +	/**
> +	 * @run_job: submit the job to hardware. Returns the hardware completion
> +	 * fence (with a reference held for the scheduler), or NULL/ERR_PTR on
> +	 * synchronous completion or error.
> +	 */
> +	struct dma_fence *(*run_job)(struct drm_dep_job *job);
> +
> +	/**
> +	 * @timedout_job: called when the TDR fires for the head job. Must stop
> +	 * the hardware, then return %DRM_DEP_TIMEDOUT_STAT_JOB_SIGNALED if the
> +	 * job's fence was signalled during reset, or
> +	 * %DRM_DEP_TIMEDOUT_STAT_REQUEUE_JOB if the timeout was spurious or
> +	 * signalling was otherwise delayed, and the job should be re-inserted
> +	 * at the head of the pending list. Any other value triggers a WARN.
> +	 */
> +	enum drm_dep_timedout_stat (*timedout_job)(struct drm_dep_job *job);
> +
> +	/**
> +	 * @release: called when the last kref on the queue is dropped and
> +	 * drm_dep_queue_fini() has completed.  The driver is responsible for
> +	 * removing @q from any internal bookkeeping, calling
> +	 * drm_dep_queue_release(), and then freeing the memory containing @q
> +	 * (e.g. via kfree_rcu() using @q->rcu).  If NULL, drm_dep calls
> +	 * drm_dep_queue_release() and frees @q automatically via kfree_rcu().
> +	 * Use this when the queue is embedded in a larger structure.
> +	 */
> +	void (*release)(struct drm_dep_queue *q);
> +
> +	/**
> +	 * @fini: if set, called instead of drm_dep_queue_fini() when the last
> +	 * kref is dropped. The driver is responsible for calling
> +	 * drm_dep_queue_fini() itself after it is done with the queue. Use this
> +	 * when additional teardown logic must run before fini (e.g., cleanup
> +	 * firmware resources associated with the queue).
> +	 */
> +	void (*fini)(struct drm_dep_queue *q);
> +};
> +
> +/**
> + * enum drm_dep_queue_flags - flags for &drm_dep_queue and
> + *   &drm_dep_queue_init_args
> + *
> + * Flags are divided into three categories:
> + *
> + * - **Private static**: set internally at init time and never changed.
> + *   Drivers must not read or write these.
> + *   %DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ,
> + *   %DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ.
> + *
> + * - **Public dynamic**: toggled at runtime by drivers via accessors.
> + *   Any modification must be performed under &drm_dep_queue.sched.lock.
> + *   Accessor functions provide unstable reads.
> + *   %DRM_DEP_QUEUE_FLAGS_STOPPED,
> + *   %DRM_DEP_QUEUE_FLAGS_KILLED.
> + *
> + * - **Public static**: supplied by the driver in
> + *   &drm_dep_queue_init_args.flags at queue creation time and not modified
> + *   thereafter.
> + *   %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED,
> + *   %DRM_DEP_QUEUE_FLAGS_HIGHPRI,
> + *   %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE.
> + *
> + * @DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ: (private, static) submit workqueue was
> + *   allocated by drm_dep_queue_init() and will be destroyed by
> + *   drm_dep_queue_fini().
> + * @DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ: (private, static) timeout workqueue
> + *   was allocated by drm_dep_queue_init() and will be destroyed by
> + *   drm_dep_queue_fini().
> + * @DRM_DEP_QUEUE_FLAGS_STOPPED: (public, dynamic) the queue is stopped and
> + *   will not dispatch new jobs or remove jobs from the pending list, dropping
> + *   the drm_dep-owned reference. Set by drm_dep_queue_stop(), cleared by
> + *   drm_dep_queue_start().
> + * @DRM_DEP_QUEUE_FLAGS_KILLED: (public, dynamic) the queue has been killed
> + *   via drm_dep_queue_kill(). Any active dependency wait is cancelled
> + *   immediately.  Jobs continue to flow through run_job for bookkeeping
> + *   cleanup, but dependency waiting is skipped so that queued work drains
> + *   as quickly as possible.
> + * @DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED: (public, static) the queue supports
> + *   the bypass path where eligible jobs skip the SPSC queue and run inline.
> + * @DRM_DEP_QUEUE_FLAGS_HIGHPRI: (public, static) the submit workqueue owned
> + *   by the queue is created with %WQ_HIGHPRI, causing run-job and put-job
> + *   workers to execute at elevated priority. Only privileged clients (e.g.
> + *   drivers managing time-critical or real-time GPU contexts) should request
> + *   this flag; granting it to unprivileged userspace would allow priority
> + *   inversion attacks.
> + *   @drm_dep_queue_init_args.submit_wq is provided.
> + * @DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE: (public, static) when set,
> + *   drm_dep_job_done() may be called from hardirq context (e.g. from a
> + *   hardware-signalled dma_fence callback). drm_dep_job_done() will directly
> + *   dequeue the job and call drm_dep_job_put() without deferring to a
> + *   workqueue. The driver's &drm_dep_job_ops.release callback must therefore
> + *   be safe to invoke from IRQ context.
> + */
> +enum drm_dep_queue_flags {
> +	DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ	= BIT(0),
> +	DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ	= BIT(1),
> +	DRM_DEP_QUEUE_FLAGS_STOPPED		= BIT(2),
> +	DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED	= BIT(3),
> +	DRM_DEP_QUEUE_FLAGS_HIGHPRI		= BIT(4),
> +	DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE	= BIT(5),
> +	DRM_DEP_QUEUE_FLAGS_KILLED		= BIT(6),
> +};
> +
> +/**
> + * struct drm_dep_queue - a dependency-tracked GPU submission queue
> + *
> + * Combines the role of &drm_gpu_scheduler and &drm_sched_entity into a single
> + * object.  Each queue owns a submit workqueue (or borrows one), a timeout
> + * workqueue, an SPSC submission queue, and a pending-job list used for TDR.
> + *
> + * Initialise with drm_dep_queue_init(), tear down with drm_dep_queue_fini().
> + * Reference counted via drm_dep_queue_get() / drm_dep_queue_put().
> + *
> + * All fields are **opaque to drivers**.  Do not read or write any field
> + * directly; use the provided helper functions instead.  The sole exception
> + * is @rcu, which drivers may pass to kfree_rcu() when the queue is embedded
> + * inside a larger driver-managed structure and the &drm_dep_queue_ops.release
> + * vfunc performs an RCU-deferred free.
> + */
> +struct drm_dep_queue {
> +	/** @ops: driver callbacks, set at init time. */
> +	const struct drm_dep_queue_ops *ops;
> +	/** @name: human-readable name used for workqueue and fence naming. */
> +	const char *name;
> +	/** @drm: owning DRM device; a drm_dev_get() reference is held for the
> +	 *  lifetime of the queue to prevent module unload while queues are live.
> +	 */
> +	struct drm_device *drm;
> +	/** @refcount: reference count; use drm_dep_queue_get/put(). */
> +	struct kref refcount;
> +	/**
> +	 * @free_work: deferred teardown work queued unconditionally by
> +	 * drm_dep_queue_fini() onto the module-private dep_free_wq.  The work
> +	 * item disables pending workers synchronously and destroys any owned
> +	 * workqueues before releasing the queue memory and dropping the
> +	 * drm_dev_get() reference.  Running on dep_free_wq ensures
> +	 * destroy_workqueue() is never called from within one of the queue's
> +	 * own workers.
> +	 */
> +	struct work_struct free_work;
> +	/**
> +	 * @rcu: RCU head for deferred freeing.
> +	 *
> +	 * This is the **only** field drivers may access directly.  When the
> +	 * queue is embedded in a larger structure, implement
> +	 * &drm_dep_queue_ops.release, call drm_dep_queue_release() to destroy
> +	 * internal resources, then pass this field to kfree_rcu() so that any
> +	 * in-flight RCU readers referencing the queue's dma_fence timeline name
> +	 * complete before the memory is returned.  All other fields must be
> +	 * accessed through the provided helpers.
> +	 */
> +	struct rcu_head rcu;
> +
> +	/** @sched: scheduling and workqueue state. */
> +	struct {
> +		/** @sched.submit_wq: ordered workqueue for run/put-job work. */
> +		struct workqueue_struct	*submit_wq;
> +		/** @sched.timeout_wq: workqueue for the TDR delayed work. */
> +		struct workqueue_struct	*timeout_wq;
> +		/**
> +		 * @sched.run_job: work item that dispatches the next queued
> +		 * job.
> +		 */
> +		struct work_struct run_job;
> +		/** @sched.put_job: work item that frees finished jobs. */
> +		struct work_struct put_job;
> +		/** @sched.tdr: delayed work item for timeout/reset (TDR). */
> +		struct delayed_work tdr;
> +		/**
> +		 * @sched.lock: mutex serialising job dispatch, bypass
> +		 * decisions, stop/start, and flag updates.
> +		 */
> +		struct mutex lock;
> +		/**
> +		 * @sched.flags: bitmask of &enum drm_dep_queue_flags.
> +		 * Any modification after drm_dep_queue_init() must be
> +		 * performed under @sched.lock.
> +		 */
> +		enum drm_dep_queue_flags flags;
> +	} sched;
> +
> +	/** @job: pending-job tracking state. */
> +	struct {
> +		/**
> +		 * @job.pending: list of jobs that have been dispatched to
> +		 * hardware and not yet freed. Protected by @job.lock.
> +		 */
> +		struct list_head pending;
> +		/**
> +		 * @job.queue: SPSC queue of jobs waiting to be dispatched.
> +		 * Producers push via drm_dep_queue_push_job(); the run_job
> +		 * work item pops from the consumer side.
> +		 */
> +		struct spsc_queue queue;
> +		/**
> +		 * @job.lock: spinlock protecting @job.pending, TDR start, and
> +		 * the %DRM_DEP_QUEUE_FLAGS_STOPPED flag. Always acquired with
> +		 * irqsave (spin_lock_irqsave / spin_unlock_irqrestore) to
> +		 * support %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE queues where
> +		 * drm_dep_job_done() may run from hardirq context.
> +		 */
> +		spinlock_t lock;
> +		/**
> +		 * @job.timeout: per-job TDR timeout in jiffies.
> +		 * %MAX_SCHEDULE_TIMEOUT means no timeout.
> +		 */
> +		long timeout;
> +#if IS_ENABLED(CONFIG_PROVE_LOCKING)
> +		/**
> +		 * @job.push: lockdep annotation tracking the arm-to-push
> +		 * critical section.
> +		 */
> +		struct {
> +			/*
> +			 * @job.push.owner: task that currently holds the push
> +			 * context, used to assert single-owner invariants.
> +			 * NULL when idle.
> +			 */
> +			struct task_struct *owner;
> +		} push;
> +#endif
> +	} job;
> +
> +	/** @credit: hardware credit accounting. */
> +	struct {
> +		/** @credit.limit: maximum credits the queue can hold. */
> +		u32 limit;
> +		/** @credit.count: credits currently in flight (atomic). */
> +		atomic_t count;
> +	} credit;
> +
> +	/** @dep: current blocking dependency for the head SPSC job. */
> +	struct {
> +		/**
> +		 * @dep.fence: fence being waited on before the head job can
> +		 * run. NULL when no dependency is pending.
> +		 */
> +		struct dma_fence *fence;
> +		/**
> +		 * @dep.removed_fence: dependency fence whose callback has been
> +		 * removed.  The run-job worker must drop its reference to this
> +		 * fence before proceeding to call run_job.
> +		 */
> +		struct dma_fence *removed_fence;
> +		/** @dep.cb: callback installed on @dep.fence. */
> +		struct dma_fence_cb cb;
> +	} dep;
> +
> +	/** @fence: fence context and sequence number state. */
> +	struct {
> +		/**
> +		 * @fence.seqno: next sequence number to assign, incremented
> +		 * each time a job is armed.
> +		 */
> +		u32 seqno;
> +		/**
> +		 * @fence.context: base DMA fence context allocated at init
> +		 * time. Finished fences use this context.
> +		 */
> +		u64 context;
> +	} fence;
> +};
> +
> +/**
> + * struct drm_dep_queue_init_args - arguments for drm_dep_queue_init()
> + */
> +struct drm_dep_queue_init_args {
> +	/** @ops: driver callbacks; must not be NULL. */
> +	const struct drm_dep_queue_ops *ops;
> +	/** @name: human-readable name for workqueues and fence timelines. */
> +	const char *name;
> +	/** @drm: owning DRM device. A drm_dev_get() reference is taken at
> +	 *  queue init and released when the queue is freed, preventing module
> +	 *  unload while any queue is still alive.
> +	 */
> +	struct drm_device *drm;
> +	/**
> +	 * @submit_wq: workqueue for job dispatch. If NULL, an ordered
> +	 * workqueue is allocated and owned by the queue.  If non-NULL, the
> +	 * workqueue must have been allocated with %WQ_MEM_RECLAIM_TAINT;
> +	 * drm_dep_queue_init() returns %-EINVAL otherwise.
> +	 */
> +	struct workqueue_struct *submit_wq;
> +	/**
> +	 * @timeout_wq: workqueue for TDR. If NULL, an ordered workqueue
> +	 * is allocated and owned by the queue.  If non-NULL, the workqueue
> +	 * must have been allocated with %WQ_MEM_RECLAIM_TAINT;
> +	 * drm_dep_queue_init() returns %-EINVAL otherwise.
> +	 */
> +	struct workqueue_struct *timeout_wq;
> +	/** @credit_limit: maximum hardware credits; must be non-zero. */
> +	u32 credit_limit;
> +	/**
> +	 * @timeout: per-job TDR timeout in jiffies. Zero means no timeout
> +	 * (%MAX_SCHEDULE_TIMEOUT is used internally).
> +	 */
> +	long timeout;
> +	/**
> +	 * @flags: initial queue flags. %DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ
> +	 * and %DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ are managed internally
> +	 * and will be ignored if set here. Setting
> +	 * %DRM_DEP_QUEUE_FLAGS_HIGHPRI requests a high-priority submit
> +	 * workqueue; drivers must only set this for privileged clients.
> +	 */
> +	enum drm_dep_queue_flags flags;
> +};
> +
> +/**
> + * struct drm_dep_job_ops - driver callbacks for a dep job
> + */
> +struct drm_dep_job_ops {
> +	/**
> +	 * @release: called when the last reference to the job is dropped.
> +	 *
> +	 * If set, the driver is responsible for freeing the job. If NULL,
> +	 * drm_dep_job_put() will call kfree() on the job directly.
> +	 */
> +	void (*release)(struct drm_dep_job *job);
> +};
> +
> +/**
> + * struct drm_dep_job - a unit of work submitted to a dep queue
> + *
> + * All fields are **opaque to drivers**.  Do not read or write any field
> + * directly; use the provided helper functions instead.
> + */
> +struct drm_dep_job {
> +	/** @ops: driver callbacks for this job. */
> +	const struct drm_dep_job_ops *ops;
> +	/** @refcount: reference count, managed by drm_dep_job_get/put(). */
> +	struct kref refcount;
> +	/**
> +	 * @dependencies: xarray of &dma_fence dependencies before the job can
> +	 * run.
> +	 */
> +	struct xarray dependencies;
> +	/** @q: the queue this job is submitted to. */
> +	struct drm_dep_queue *q;
> +	/** @queue_node: SPSC queue linkage for pending submission. */
> +	struct spsc_node queue_node;
> +	/**
> +	 * @pending_link: list entry in the queue's pending job list. Protected
> +	 * by @job.q->job.lock.
> +	 */
> +	struct list_head pending_link;
> +	/** @dfence: finished fence for this job. */
> +	struct drm_dep_fence *dfence;
> +	/** @cb: fence callback used to watch for dependency completion. */
> +	struct dma_fence_cb cb;
> +	/** @credits: number of credits this job consumes from the queue. */
> +	u32 credits;
> +	/**
> +	 * @last_dependency: index into @dependencies of the next fence to
> +	 * check. Advanced by drm_dep_queue_job_dependency() as each
> +	 * dependency is consumed.
> +	 */
> +	u32 last_dependency;
> +	/**
> +	 * @invalidate_count: number of times this job has been invalidated.
> +	 * Incremented by drm_dep_job_invalidate_job().
> +	 */
> +	u32 invalidate_count;
> +	/**
> +	 * @signalling_cookie: return value of dma_fence_begin_signalling()
> +	 * captured in drm_dep_job_arm() and consumed by drm_dep_job_push().
> +	 * Not valid outside the arm→push window.
> +	 */
> +	bool signalling_cookie;
> +};
> +
> +/**
> + * struct drm_dep_job_init_args - arguments for drm_dep_job_init()
> + */
> +struct drm_dep_job_init_args {
> +	/**
> +	 * @ops: driver callbacks for the job, or NULL for default behaviour.
> +	 */
> +	const struct drm_dep_job_ops *ops;
> +	/** @q: the queue to associate the job with. A reference is taken. */
> +	struct drm_dep_queue *q;
> +	/** @credits: number of credits this job consumes; must be non-zero. */
> +	u32 credits;
> +};
> +
> +/* Queue API */
> +
> +/**
> + * drm_dep_queue_sched_guard() - acquire the queue scheduler lock as a guard
> + * @__q: dep queue whose scheduler lock to acquire
> + *
> + * Acquires @__q->sched.lock as a scoped mutex guard (released automatically
> + * when the enclosing scope exits).  This lock serialises all scheduler state
> + * transitions — stop/start/kill flag changes, bypass-path decisions, and the
> + * run-job worker — so it must be held when the driver needs to atomically
> + * inspect or modify queue state in relation to job submission.
> + *
> + * **When to use**
> + *
> + * Drivers that set %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED and wish to
> + * serialise their own submit work against the bypass path must acquire this
> + * guard.  Without it, a concurrent caller of drm_dep_job_push() could take
> + * the bypass path and call ops->run_job() inline between the driver's
> + * eligibility check and its corresponding action, producing a race.
> + *
> + * **Constraint: only from submit_wq worker context**
> + *
> + * This guard must only be acquired from a work item running on the queue's
> + * submit workqueue (@q->sched.submit_wq) by drivers.
> + *
> + * Context: Process context only; must be called from submit_wq work by
> + * drivers.
> + */
> +#define drm_dep_queue_sched_guard(__q)	\
> +	guard(mutex)(&(__q)->sched.lock)
> +
> +int drm_dep_queue_init(struct drm_dep_queue *q,
> +		       const struct drm_dep_queue_init_args *args);
> +void drm_dep_queue_fini(struct drm_dep_queue *q);
> +void drm_dep_queue_release(struct drm_dep_queue *q);
> +struct drm_dep_queue *drm_dep_queue_get(struct drm_dep_queue *q);
> +bool drm_dep_queue_get_unless_zero(struct drm_dep_queue *q);
> +void drm_dep_queue_put(struct drm_dep_queue *q);
> +void drm_dep_queue_stop(struct drm_dep_queue *q);
> +void drm_dep_queue_start(struct drm_dep_queue *q);
> +void drm_dep_queue_kill(struct drm_dep_queue *q);
> +void drm_dep_queue_trigger_timeout(struct drm_dep_queue *q);
> +void drm_dep_queue_cancel_tdr_sync(struct drm_dep_queue *q);
> +void drm_dep_queue_resume_timeout(struct drm_dep_queue *q);
> +bool drm_dep_queue_work_enqueue(struct drm_dep_queue *q,
> +				struct work_struct *work);
> +bool drm_dep_queue_is_stopped(struct drm_dep_queue *q);
> +bool drm_dep_queue_is_killed(struct drm_dep_queue *q);
> +bool drm_dep_queue_is_initialized(struct drm_dep_queue *q);
> +void drm_dep_queue_set_stopped(struct drm_dep_queue *q);
> +unsigned int drm_dep_queue_refcount(const struct drm_dep_queue *q);
> +long drm_dep_queue_timeout(const struct drm_dep_queue *q);
> +struct workqueue_struct *drm_dep_queue_submit_wq(struct drm_dep_queue *q);
> +struct workqueue_struct *drm_dep_queue_timeout_wq(struct drm_dep_queue *q);
> +
> +/* Job API */
> +
> +/**
> + * DRM_DEP_JOB_FENCE_PREALLOC - sentinel value for pre-allocating a dependency slot
> + *
> + * Pass this to drm_dep_job_add_dependency() instead of a real fence to
> + * pre-allocate a slot in the job's dependency xarray during the preparation
> + * phase (where GFP_KERNEL is available).  The returned xarray index identifies
> + * the slot.  Call drm_dep_job_replace_dependency() later — inside a
> + * dma_fence_begin_signalling() region if needed — to swap in the real fence
> + * without further allocation.
> + *
> + * This sentinel is never treated as a dma_fence; it carries no reference count
> + * and must not be passed to dma_fence_put().  It is only valid as an argument
> + * to drm_dep_job_add_dependency() and as the expected stored value checked by
> + * drm_dep_job_replace_dependency().
> + */
> +#define DRM_DEP_JOB_FENCE_PREALLOC	((struct dma_fence *)-1)
> +
> +int drm_dep_job_init(struct drm_dep_job *job,
> +		     const struct drm_dep_job_init_args *args);
> +struct drm_dep_job *drm_dep_job_get(struct drm_dep_job *job);
> +void drm_dep_job_put(struct drm_dep_job *job);
> +void drm_dep_job_arm(struct drm_dep_job *job);
> +void drm_dep_job_push(struct drm_dep_job *job);
> +int drm_dep_job_add_dependency(struct drm_dep_job *job,
> +			       struct dma_fence *fence);
> +void drm_dep_job_replace_dependency(struct drm_dep_job *job, u32 index,
> +				    struct dma_fence *fence);
> +int drm_dep_job_add_syncobj_dependency(struct drm_dep_job *job,
> +				       struct drm_file *file, u32 handle,
> +				       u32 point);
> +int drm_dep_job_add_resv_dependencies(struct drm_dep_job *job,
> +				      struct dma_resv *resv,
> +				      enum dma_resv_usage usage);
> +int drm_dep_job_add_implicit_dependencies(struct drm_dep_job *job,
> +					  struct drm_gem_object *obj,
> +					  bool write);
> +bool drm_dep_job_is_signaled(struct drm_dep_job *job);
> +bool drm_dep_job_is_finished(struct drm_dep_job *job);
> +bool drm_dep_job_invalidate_job(struct drm_dep_job *job, int threshold);
> +struct dma_fence *drm_dep_job_finished_fence(struct drm_dep_job *job);
> +
> +/**
> + * struct drm_dep_queue_pending_job_iter - iterator state for
> + *   drm_dep_queue_for_each_pending_job()
> + * @q: queue being iterated
> + */
> +struct drm_dep_queue_pending_job_iter {
> +	struct drm_dep_queue *q;
> +};
> +
> +/* Drivers should never call this directly */
> +static inline struct drm_dep_queue_pending_job_iter
> +__drm_dep_queue_pending_job_iter_begin(struct drm_dep_queue *q)
> +{
> +	struct drm_dep_queue_pending_job_iter iter = {
> +		.q = q,
> +	};
> +
> +	WARN_ON(!drm_dep_queue_is_stopped(q));
> +	return iter;
> +}
> +
> +/* Drivers should never call this directly */
> +static inline void
> +__drm_dep_queue_pending_job_iter_end(struct drm_dep_queue_pending_job_iter iter)
> +{
> +	WARN_ON(!drm_dep_queue_is_stopped(iter.q));
> +}
> +
> +/* clang-format off */
> +DEFINE_CLASS(drm_dep_queue_pending_job_iter,
> +	     struct drm_dep_queue_pending_job_iter,
> +	     __drm_dep_queue_pending_job_iter_end(_T),
> +	     __drm_dep_queue_pending_job_iter_begin(__q),
> +	     struct drm_dep_queue *__q);
> +/* clang-format on */
> +static inline void *
> +class_drm_dep_queue_pending_job_iter_lock_ptr(
> +	class_drm_dep_queue_pending_job_iter_t *_T)
> +{ return _T; }
> +#define class_drm_dep_queue_pending_job_iter_is_conditional false
> +
> +/**
> + * drm_dep_queue_for_each_pending_job() - iterate over all pending jobs
> + *   in a queue
> + * @__job: loop cursor, a &struct drm_dep_job pointer
> + * @__q: &struct drm_dep_queue to iterate
> + *
> + * Iterates over every job currently on @__q->job.pending. The queue must be
> + * stopped (drm_dep_queue_stop() called) before using this iterator; a WARN_ON
> + * fires at the start and end of the scope if it is not.
> + *
> + * Context: Any context.
> + */
> +#define drm_dep_queue_for_each_pending_job(__job, __q)			\
> +	scoped_guard(drm_dep_queue_pending_job_iter, (__q))		\
> +		list_for_each_entry((__job), &(__q)->job.pending, pending_link)
> +
> +#endif


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
  2026-03-17  5:22     ` Matthew Brost
@ 2026-03-17  8:48       ` Boris Brezillon
  0 siblings, 0 replies; 50+ messages in thread
From: Boris Brezillon @ 2026-03-17  8:48 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, Tvrtko Ursulin, Rodrigo Vivi,
	Thomas Hellström, Christian König, Danilo Krummrich,
	David Airlie, Maarten Lankhorst, Maxime Ripard, Philipp Stanner,
	Simona Vetter, Sumit Semwal, Thomas Zimmermann, linux-kernel

Hi Matthew,

On Mon, 16 Mar 2026 22:22:30 -0700
Matthew Brost <matthew.brost@intel.com> wrote:

> On Mon, Mar 16, 2026 at 10:16:01AM +0100, Boris Brezillon wrote:
> > Hi Matthew,
> > 
> > On Sun, 15 Mar 2026 21:32:45 -0700
> > Matthew Brost <matthew.brost@intel.com> wrote:
> >   
> > > Diverging requirements between GPU drivers using firmware scheduling
> > > and those using hardware scheduling have shown that drm_gpu_scheduler is
> > > no longer sufficient for firmware-scheduled GPU drivers. The technical
> > > debt, lack of memory-safety guarantees, absence of clear object-lifetime
> > > rules, and numerous driver-specific hacks have rendered
> > > drm_gpu_scheduler unmaintainable. It is time for a fresh design for
> > > firmware-scheduled GPU drivers—one that addresses all of the
> > > aforementioned shortcomings.
> > > 
> > > Add drm_dep, a lightweight GPU submission queue intended as a
> > > replacement for drm_gpu_scheduler for firmware-managed GPU schedulers
> > > (e.g. Xe, Panthor, AMDXDNA, PVR, Nouveau, Nova). Unlike
> > > drm_gpu_scheduler, which separates the scheduler (drm_gpu_scheduler)
> > > from the queue (drm_sched_entity) into two objects requiring external
> > > coordination, drm_dep merges both roles into a single struct
> > > drm_dep_queue. This eliminates the N:1 entity-to-scheduler mapping
> > > that is unnecessary for firmware schedulers which manage their own
> > > run-lists internally.
> > > 
> > > Unlike drm_gpu_scheduler, which relies on external locking and lifetime
> > > management by the driver, drm_dep uses reference counting (kref) on both
> > > queues and jobs to guarantee object lifetime safety. A job holds a queue
> > > reference from init until its last put, and the queue holds a job reference
> > > from dispatch until the put_job worker runs. This makes use-after-free
> > > impossible even when completion arrives from IRQ context or concurrent
> > > teardown is in flight.
> > > 
> > > The core objects are:
> > > 
> > >   struct drm_dep_queue - a per-context submission queue owning an
> > >     ordered submit workqueue, a TDR timeout workqueue, an SPSC job
> > >     queue, and a pending-job list. Reference counted; drivers can embed
> > >     it and provide a .release vfunc for RCU-safe teardown.  
> > 
> > First of, I like this idea, and actually think we should have done that
> > from the start rather than trying to bend drm_sched to meet our  
> 
> Yes. Tvrtko actually suggested this years ago, and in my naïveté I
> rejected it. I’m eating my hat here.
> 
> > FW-assisted scheduling model. That's also the direction me and Danilo
> > have been pushing for for the new JobQueue stuff in rust, so I'm glad
> > to see some consensus here.
> > 
> > Now, let's start with the usual naming nitpick :D => can't we find a
> > better prefix than "drm_dep"? I think I get where "dep" comes from (the
> > logic mostly takes care of job deps, and acts as a FIFO otherwise, no
> > real scheduling involved). It's kinda okay for drm_dep_queue, even
> > though, according to the description you've made, jobs seem to stay in
> > that queue even after their deps are met, which, IMHO, is a bit
> > confusing: dep_queue sounds like a queue in which jobs are placed until
> > their deps are met, and then the job moves to some other queue.
> > 
> > It gets worse for drm_dep_job, which sounds like a dep-only job, rather
> > than a job that's queued to the drm_dep_queue. Same goes for
> > drm_dep_fence, which I find super confusing. What this one does is just
> > proxy the driver fence to provide proper isolation between GPU drivers
> > and fence observers (other drivers).
> > 
> > Since this new model is primarily designed for hardware that have
> > FW-assisted scheduling, how about drm_fw_queue, drm_fw_job,
> > drm_fw_job_fence?  
> 
> We can bikeshed — I’m open to other names, but I believe hardware
> scheduling can be built quite cleanly on top of this, so drm_fw_*
> doesn’t really work either.

I agree drm_fw_ is not great either. I think Philipp's name, JobQueue,
is a better fit.

> Check out a hardware-scheduler PoC built
> (today) on top of this in [1].

Yeah, I'm pretty sure we can add a layer on top to deal with HW GPU
queues (Danilo and Philipp proposed it already), I'm just not sure it
should be our primary focus, at least not until we have something
usable for the drivers that are currently written in rust. Doesn't mean
we shouldn't think about it and design the thing so it can be added
later, of course, but most of the time, perfect is the enemy of good.

> 
> [1] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-svn-perf-6-15-2025/-/commit/22c8aa993b5c9e4ad0c312af2f3e032273d20966
> 
> >   
> > > 
> > >   struct drm_dep_job - a single unit of GPU work. Drivers embed this
> > >     and provide a .release vfunc. Jobs carry an xarray of input
> > >     dma_fence dependencies and produce a drm_dep_fence as their
> > >     finished fence.
> > > 
> > >   struct drm_dep_fence - a dma_fence subclass wrapping an optional
> > >     parent hardware fence. The finished fence is armed (sequence
> > >     number assigned) before submission and signals when the hardware
> > >     fence signals (or immediately on synchronous completion).
> > > 
> > > Job lifecycle:
> > >   1. drm_dep_job_init() - allocate and initialise; job acquires a
> > >      queue reference.
> > >   2. drm_dep_job_add_dependency() and friends - register input fences;
> > >      duplicates from the same context are deduplicated.
> > >   3. drm_dep_job_arm() - assign sequence number, obtain finished fence.
> > >   4. drm_dep_job_push() - submit to queue.
> > > 
> > > Submission paths under queue lock:
> > >   - Bypass path: if DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set, the
> > >     SPSC queue is empty, no dependencies are pending, and credits are
> > >     available, the job is dispatched inline on the calling thread.  
> > 
> > I've yet to look at the code, but I must admit I'm less worried about
> > this fast path if it's part of a new model restricted to FW-assisted
> > scheduling. I keep thinking we're not entirely covered for so called
> > real-time GPU contexts that might have jobs that are not dep-free, and
> > if we're going for something new, I'd really like us to consider that
> > case from the start (maybe investigate if kthread_work[er] can be used
> > as a replacement for workqueues, if RT priority on workqueues is not an
> > option).
> >   
> 
> I mostly agree, and I’ll look into whether kthread_work is better
> suited—if that’s the right model, it should be done up front.
> 
> But can you give a use case for real-time GPU contexts that are not
> dep-free? I personally don’t know of one.

Let alone real-time GPU contexts, you still have the concept of context
priority in both GL and Vulkan, on which jobs with deps are likely to
be queued, and if the dequeuing thread doesn't take this priority into
consideration, there's nothing differentiating a low-prio from an
high-prio context, thus partially defeating the priority you assign to
your FW queue (you can execute fast, but if the queuing to the FW queue
is slow, it's pointless).

As for real-time contexts, yes, right now the only use case I can think
of is compositors, and apparently those would pass dep-less submissions,
but claiming that this is the only use case we'll ever have for those RT
contexts is a bit of a risk I'd rather not take. Also, if we solve the
problem with kthreads with prios, it might be that we don't even need
this fastpath in the first place.

> 
> > >   - Queued path: job is pushed onto the SPSC queue and the run_job
> > >     worker is kicked. The worker resolves remaining dependencies
> > >     (installing wakeup callbacks for unresolved fences) before calling
> > >     ops->run_job().
> > > 
> > > Credit-based throttling prevents hardware overflow: each job declares
> > > a credit cost at init time; dispatch is deferred until sufficient
> > > credits are available.
> > > 
> > > Timeout Detection and Recovery (TDR): a per-queue delayed work item
> > > fires when the head pending job exceeds q->job.timeout jiffies, calling
> > > ops->timedout_job(). drm_dep_queue_trigger_timeout() forces immediate
> > > expiry for device teardown.
> > > 
> > > IRQ-safe completion: queues flagged DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE
> > > allow drm_dep_job_done() to be called from hardirq context (e.g. a
> > > dma_fence callback). Dependency cleanup is deferred to process context
> > > after ops->run_job() returns to avoid calling xa_destroy() from IRQ.
> > > 
> > > Zombie-state guard: workers use kref_get_unless_zero() on entry and
> > > bail immediately if the queue refcount has already reached zero and
> > > async teardown is in flight, preventing use-after-free.
> > > 
> > > Teardown is always deferred to a module-private workqueue (dep_free_wq)
> > > so that destroy_workqueue() is never called from within one of the
> > > queue's own workers. Each queue holds a drm_dev_get() reference on its
> > > owning struct drm_device, released as the final step of teardown via
> > > drm_dev_put(). This prevents the driver module from being unloaded
> > > while any queue is still alive without requiring a separate drain API.  
> > 
> > Thanks for posting this RFC. I'll try to have a closer look at the code
> > in the coming days, but given the diffstat, it might take me a bit of
> > time...  
> 
> 
> I understand — I’m a firehose when I get started. Hopefully a sane one,
> though.

One last note: I've seen a bunch of discussions that start to look like
some pissing contest between C and rust advocates. I've only started to
look at rust recently, in the context of Tyr (the rust re-implementation
of Panthor), and there's niceties in the language that I think make it
a good fit for the new JobQueue thing we've been discussing (Daniel
listed a bunch of them). I don't really mind who's going to end up
winning the trophy of this contest to be honest, but what I certainly
don't want is for us to be stuck with no solution at all because each
camp claims that their solution is best and none of them give up on
their ideas (basically what happened with the AGX driver). So please,
please, let's all take a step back, have a look at what each side has
done, keep the discussion constructive, and hopefully we'll end up with
a mix of ideas that makes the final solution better.

Regards,

Boris

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
  2026-03-17  8:26         ` Matthew Brost
@ 2026-03-17 12:04           ` Daniel Almeida
  2026-03-17 19:41           ` Miguel Ojeda
  1 sibling, 0 replies; 50+ messages in thread
From: Daniel Almeida @ 2026-03-17 12:04 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Miguel Ojeda, intel-xe, dri-devel, Boris Brezillon,
	Tvrtko Ursulin, Rodrigo Vivi, Thomas Hellström,
	Christian König, Danilo Krummrich, David Airlie,
	Maarten Lankhorst, Maxime Ripard, Philipp Stanner, Simona Vetter,
	Sumit Semwal, Thomas Zimmermann, linux-kernel, Sami Tolvanen,
	Jeffrey Vander Stoep, Alice Ryhl, Daniel Stone, Alexandre Courbot,
	John Hubbard, shashanks, jajones, Eliot Courtney, Joel Fernandes,
	rust-for-linux

Matthew,

> I get it — you’re a Rust zealot. You can do this in C and enforce the
> rules quite well.
> 
> RAII cannot describe ownership transfers of refs, nor can it express who
> owns what in multi-threaded components, as far as I know. Ref-tracking
> and ownership need to be explicit.
> 
> I’m not going to reply to Rust vs C comments in this thread. If you want
> to talk about ownership, lifetimes, dma-fence enforcement, and teardown
> guarantees, sure.
> 
> If you want to build on top of a component that’s been tested on a
> production driver, great — please join in. If you want to figure out all
> the pitfalls yourself, well… have fun.
> 
> Matt
> 

It is not about being a Rust zealot. I pointed out that your code has issues.
Every time you access the queue you have to use a special function because the
queue might be gone, how is this not a problem?

+ * However, there is a secondary hazard: a worker can be queued while the
+ * queue is in a "zombie" state — refcount has already reached zero and async
+ * teardown is in flight, but the work item has not yet been disabled by
+ * free_work.  To guard against this every worker uses
+ * drm_dep_queue_get_unless_zero() at entry; if the refcount is already zero
+ * the worker bails immediately without touching the queue state.

At various points you document requirements that are simply comments. Resource
management is scattered all over the place, and it’s sometimes even shared
with drivers, whom you have no control over.

+ * Drivers that set %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED and wish to
+ * serialise their own submit work against the bypass path must acquire this
+ * guard.  Without it, a concurrent caller of drm_dep_job_push() could take
+ * the bypass path and call ops->run_job() inline between the driver's
+ * eligibility check and its corresponding action, producing a race.

How is this not a problem? Again, you’re not in control of driver code.

+ * If set, the driver is responsible for freeing the job. If NULL,

Same here.

Even if we take Rust out of the equation, how do you plan to solve these things? Or
do you consider them solved as is?

I worry that we will find ourselves again at XDC in yet another scheduler
workshop to address the issues that will invariably come up with your new
design in a few years.

> If you want to build on top of a component that’s been tested on a
> production driver, great — please join in. If you want to figure out all
> the pitfalls yourself, well… have fun.

Note that I didn’t show up with a low-effort “hey, how about we rewrite
this in Rust?”. Instead, I linked to an actual Rust implementation that I
spent weeks painstakingly debugging, not to mention the time it took to write
it. Again, I suggest that you guys have a look, like I did with your code. You
might find things you end up liking there. 

— Daniel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
  2026-03-17  5:10     ` Matthew Brost
@ 2026-03-17 12:19       ` Danilo Krummrich
  2026-03-18 23:02         ` Matthew Brost
  0 siblings, 1 reply; 50+ messages in thread
From: Danilo Krummrich @ 2026-03-17 12:19 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, Boris Brezillon, Tvrtko Ursulin,
	Rodrigo Vivi, Thomas Hellström, Christian König,
	David Airlie, Maarten Lankhorst, Maxime Ripard, Philipp Stanner,
	Simona Vetter, Sumit Semwal, Thomas Zimmermann, linux-kernel,
	Miguel Ojeda

On Tue Mar 17, 2026 at 6:10 AM CET, Matthew Brost wrote:
> On Mon, Mar 16, 2026 at 11:25:23AM +0100, Danilo Krummrich wrote:
>> The reason I proposed a new component for Rust, is basically what you also wrote
>> in your cover letter, plus the fact that it prevents us having to build a Rust
>> abstraction layer to the DRM GPU scheduler.
>> 
>> The latter I identified as pretty questionable as building another abstraction
>> layer on top of some infrastructure is really something that you only want to do
>> when it is mature enough in terms of lifetime and ownership model.
>> 
>
> I personally don’t think the language matters that much. I care about
> lifetime, ownership, and teardown semantics. I believe I’ve made this
> clear in C, so the Rust bindings should be trivial.

No, they won't be trivial -- in fact, in the case of the Jobqueue they may even
end up being more complicated than a native implementation.

We still want to build the object model around it that allows us to catch most
of the pitfalls at compile time rather than runtime.

For instance, there has been a proposal of having specific work and workqueue
types that ensure not to violate DMA fence rules, which the Jobqueue can adopt.

We can also use Klint to ensure correctness for those types at compile time.

So, I can easily see this becoming more complicated when we have to go through
an FFI layer that makes us loose additional type information / guarantees.

Anyways, I don't want to argue about this. I don't know why the whole thread
took this direction.

This is not about C vs. Rust, and I see the Rust component to be added
regardless of this effort.

The question for me is whether we want to have a second component besides the
GPU scheduler on the C side or not.

If we can replace the existing scheduler entirely and rework all the drivers
that'd be great and you absolutely have my blessing.

But, I don't want to end up in a situation where this is landed, one or two
drivers are converted, and everything else is left behind in terms of
maintainance / maintainer commitment.

>> My point is, the justification for a new Jobqueue component in Rust I consider
>> given by the fact that it allows us to avoid building another abstraction layer
>> on top of DRM sched. Additionally, DRM moves to Rust and gathering experience
>> with building native Rust components seems like a good synergy in this context.
>>
>
> If I knew Rust off-hand, I would have written it in Rust :). Perhaps
> this is an opportunity to learn. But I think the Rust vs. C holy war
> isn’t in scope here. The real questions are what semantics we want, the
> timeline, and maintainability. Certainly more people know C, and most
> drivers are written in C, so having the common component in C makes more
> sense at this point, in my opinion. If the objection is really about the
> language, I’ll rewrite it in Rust.

Again, I'm not talking about Rust vs. C. I'm talking about why a new Rust
component is much easier to justify maintainance wise than a new C component is.

That is, the existing infrastructure has problems we don't want to build on top
of and the abstraction ends up being of a similar magnitude as a native
implementation.

A new C implementation alongside the existing one is a whole different question.

>> Having that said, the obvious question for me for this series is how drm_dep
>> fits into the bigger picture.
>> 
>> I.e. what is the maintainance strategy?
>>
>
> I will commit to maintaining code I believe in, and immediately write
> the bindings on top of this so they’re maintained from day one.

This I am sure about, but what about the existing scheduler infrastructure? Are
you going to keep this up as well?

Who keeps supporting it for all the drivers that can't switch (due to not having
firmware queues) or simply did not switch yet?

>> Do we want to support three components allowing users to do the same thing? What
>> happens to DRM sched for 1:1 entity / scheduler relationships?
>> 
>> Is it worth? Do we have enough C users to justify the maintainance of yet
>> another component? (Again, DRM moves into the direction of Rust drivers, so I
>> don't know how many new C drivers we will see.) I.e. having this component won't
>> get us rid of the majority of DRM sched users.
>> 
>
> Actually, with [1], I’m fairly certain that pretty much every driver
> could convert to this new code. Part of the problem, though, is that
> when looking at this, multiple drivers clearly break dma-fencing rules,
> so an annotated component like DRM dep would explode their drivers. Not
> to mention the many driver-side hacks that each individual driver would
> need to drop (e.g., I would not be receptive to any driver directly
> touching drm_dep object structs).

I thought the API can't be abused? :) How would you prevent drivers doing this
in practice? They need to have the struct definition, and once they have it, you
can't do anything about them peeking internals, if not caught through review.

> Maintainable, as I understand every single LOC, with verbose documentation
> (generated with Copilot, but I’ve reviewed it multiple times and it’s
> correct), etc.

I'm not sure this is the core criteria for evaluating whether something is
maintainable or not.

To be honest, this does not sound very community focused.

> Regardless, given all of the above, at a minimum my driver needs to move
> on one way or another.

Your driver? What do you mean with "it has to move"?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
  2026-03-17  2:47   ` Daniel Almeida
  2026-03-17  5:45     ` Matthew Brost
@ 2026-03-17 12:31     ` Danilo Krummrich
  2026-03-17 14:25       ` Daniel Almeida
  1 sibling, 1 reply; 50+ messages in thread
From: Danilo Krummrich @ 2026-03-17 12:31 UTC (permalink / raw)
  To: Daniel Almeida
  Cc: Matthew Brost, intel-xe, dri-devel, Boris Brezillon,
	Tvrtko Ursulin, Rodrigo Vivi, Thomas Hellström,
	Christian König, David Airlie, Maarten Lankhorst,
	Maxime Ripard, Philipp Stanner, Simona Vetter, Sumit Semwal,
	Thomas Zimmermann, linux-kernel, Sami Tolvanen,
	Jeffrey Vander Stoep, Alice Ryhl, Daniel Stone, Alexandre Courbot,
	John Hubbard, shashanks, jajones, Eliot Courtney, Joel Fernandes,
	rust-for-linux, Miguel Ojeda

On Tue Mar 17, 2026 at 3:47 AM CET, Daniel Almeida wrote:
> I agree with what Danilo said below, i.e.:  IMHO, with the direction that DRM
> is going, it is much more ergonomic to add a Rust component with a nice C
> interface than doing it the other way around.

This is not exactly what I said. I was talking about the maintainance aspects
and that a Rust Jobqueue implementation (for the reasons explained in my initial
reply) is easily justifiable in this aspect, whereas another C implementation,
that does *not* replace the existing DRM scheduler entirely, is much harder to
justify from a maintainance perspective.

I'm also not sure whether a C interface from the Rust side is easy to establish.
We don't want to limit ourselves in terms of language capabilities for this and
passing through all the additional infromation Rust carries in the type system
might not be straight forward.

It would be an experiment, and it was one of the ideas behind the Rust Jobqueue
to see how it turns if we try. Always with the fallback of having C
infrastructure as an alternative when it doesn't work out well.

Having this said, I don't see an issue with the drm_dep thing going forward if
there is a path to replacing DRM sched entirely.

The Rust component should remain independent from this for the reasons mentioned
in [1].

[1] https://lore.kernel.org/dri-devel/DH51W6XRQXYX.3M30IRYIWZLFG@kernel.org/

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
  2026-03-17 12:31     ` Danilo Krummrich
@ 2026-03-17 14:25       ` Daniel Almeida
  2026-03-17 14:33         ` Danilo Krummrich
  0 siblings, 1 reply; 50+ messages in thread
From: Daniel Almeida @ 2026-03-17 14:25 UTC (permalink / raw)
  To: Danilo Krummrich
  Cc: Matthew Brost, intel-xe, dri-devel, Boris Brezillon,
	Tvrtko Ursulin, Rodrigo Vivi, Thomas Hellström,
	Christian König, David Airlie, Maarten Lankhorst,
	Maxime Ripard, Philipp Stanner, Simona Vetter, Sumit Semwal,
	Thomas Zimmermann, linux-kernel, Sami Tolvanen,
	Jeffrey Vander Stoep, Alice Ryhl, Daniel Stone, Alexandre Courbot,
	John Hubbard, shashanks, jajones, Eliot Courtney, Joel Fernandes,
	rust-for-linux, Miguel Ojeda



> On 17 Mar 2026, at 09:31, Danilo Krummrich <dakr@kernel.org> wrote:
> 
> On Tue Mar 17, 2026 at 3:47 AM CET, Daniel Almeida wrote:
>> I agree with what Danilo said below, i.e.:  IMHO, with the direction that DRM
>> is going, it is much more ergonomic to add a Rust component with a nice C
>> interface than doing it the other way around.
> 
> This is not exactly what I said. I was talking about the maintainance aspects
> and that a Rust Jobqueue implementation (for the reasons explained in my initial
> reply) is easily justifiable in this aspect, whereas another C implementation,
> that does *not* replace the existing DRM scheduler entirely, is much harder to
> justify from a maintainance perspective.

Ok, I misunderstood your point a bit.

> 
> I'm also not sure whether a C interface from the Rust side is easy to establish.
> We don't want to limit ourselves in terms of language capabilities for this and
> passing through all the additional infromation Rust carries in the type system
> might not be straight forward.
> 
> It would be an experiment, and it was one of the ideas behind the Rust Jobqueue
> to see how it turns if we try. Always with the fallback of having C
> infrastructure as an alternative when it doesn't work out well.

From previous experience in doing Rust to C FFI in NVK, I don’t see, at
first, why this can’t work. But I agree with you, there may very well be
unanticipated things here and this part is indeed an experiment. No argument
from me here.

> 
> Having this said, I don't see an issue with the drm_dep thing going forward if
> there is a path to replacing DRM sched entirely.

The issues I pointed out remain. Even if the plan is to have drm_dep + JobQueue
(and no drm_sched). I feel that my point of considering doing it in Rust remains.

> 
> The Rust component should remain independent from this for the reasons mentioned
> in [1].
> 
> [1] https://lore.kernel.org/dri-devel/DH51W6XRQXYX.3M30IRYIWZLFG@kernel.org/

Ok

— Daniel


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
  2026-03-17 14:25       ` Daniel Almeida
@ 2026-03-17 14:33         ` Danilo Krummrich
  2026-03-18 22:50           ` Matthew Brost
  0 siblings, 1 reply; 50+ messages in thread
From: Danilo Krummrich @ 2026-03-17 14:33 UTC (permalink / raw)
  To: Daniel Almeida
  Cc: Matthew Brost, intel-xe, dri-devel, Boris Brezillon,
	Tvrtko Ursulin, Rodrigo Vivi, Thomas Hellström,
	Christian König, David Airlie, Maarten Lankhorst,
	Maxime Ripard, Philipp Stanner, Simona Vetter, Sumit Semwal,
	Thomas Zimmermann, linux-kernel, Sami Tolvanen,
	Jeffrey Vander Stoep, Alice Ryhl, Daniel Stone, Alexandre Courbot,
	John Hubbard, shashanks, jajones, Eliot Courtney, Joel Fernandes,
	rust-for-linux, Miguel Ojeda

On Tue Mar 17, 2026 at 3:25 PM CET, Daniel Almeida wrote:
>
>
>> On 17 Mar 2026, at 09:31, Danilo Krummrich <dakr@kernel.org> wrote:
>> 
>> On Tue Mar 17, 2026 at 3:47 AM CET, Daniel Almeida wrote:
>>> I agree with what Danilo said below, i.e.:  IMHO, with the direction that DRM
>>> is going, it is much more ergonomic to add a Rust component with a nice C
>>> interface than doing it the other way around.
>> 
>> This is not exactly what I said. I was talking about the maintainance aspects
>> and that a Rust Jobqueue implementation (for the reasons explained in my initial
>> reply) is easily justifiable in this aspect, whereas another C implementation,
>> that does *not* replace the existing DRM scheduler entirely, is much harder to
>> justify from a maintainance perspective.
>
> Ok, I misunderstood your point a bit.
>
>> 
>> I'm also not sure whether a C interface from the Rust side is easy to establish.
>> We don't want to limit ourselves in terms of language capabilities for this and
>> passing through all the additional infromation Rust carries in the type system
>> might not be straight forward.
>> 
>> It would be an experiment, and it was one of the ideas behind the Rust Jobqueue
>> to see how it turns if we try. Always with the fallback of having C
>> infrastructure as an alternative when it doesn't work out well.
>
> From previous experience in doing Rust to C FFI in NVK, I don’t see, at
> first, why this can’t work. But I agree with you, there may very well be
> unanticipated things here and this part is indeed an experiment. No argument
> from me here.
>
>> 
>> Having this said, I don't see an issue with the drm_dep thing going forward if
>> there is a path to replacing DRM sched entirely.
>
> The issues I pointed out remain. Even if the plan is to have drm_dep + JobQueue
> (and no drm_sched). I feel that my point of considering doing it in Rust remains.

I mean, as mentioned below, we should have a Rust Jobqueue as independent
component. Or are you saying you'd consdider having only a Rust component with a
C API eventually? If so, that'd be way too early to consider for various
reasons.

>> The Rust component should remain independent from this for the reasons mentioned
>> in [1].
>> 
>> [1] https://lore.kernel.org/dri-devel/DH51W6XRQXYX.3M30IRYIWZLFG@kernel.org/
>
> Ok
>
> — Daniel


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
  2026-03-16  4:32 ` [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer Matthew Brost
                     ` (3 preceding siblings ...)
  2026-03-17  8:47   ` Christian König
@ 2026-03-17 14:55   ` Boris Brezillon
  2026-03-18 23:28     ` Matthew Brost
  2026-03-17 16:30   ` Shashank Sharma
  5 siblings, 1 reply; 50+ messages in thread
From: Boris Brezillon @ 2026-03-17 14:55 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, Tvrtko Ursulin, Rodrigo Vivi,
	Thomas Hellström, Christian König, Danilo Krummrich,
	David Airlie, Maarten Lankhorst, Maxime Ripard, Philipp Stanner,
	Simona Vetter, Sumit Semwal, Thomas Zimmermann, linux-kernel

On Sun, 15 Mar 2026 21:32:45 -0700
Matthew Brost <matthew.brost@intel.com> wrote:

> +/**
> + * struct drm_dep_fence - fence tracking the completion of a dep job
> + *
> + * Contains a single dma_fence (@finished) that is signalled when the
> + * hardware completes the job. The fence uses the kernel's inline_lock
> + * (no external spinlock required).
> + *
> + * This struct is private to the drm_dep module; external code interacts
> + * through the accessor functions declared in drm_dep_fence.h.
> + */
> +struct drm_dep_fence {
> +	/**
> +	 * @finished: signalled when the job completes on hardware.
> +	 *
> +	 * Drivers should use this fence as the out-fence for a job since it
> +	 * is available immediately upon drm_dep_job_arm().
> +	 */
> +	struct dma_fence finished;
> +
> +	/**
> +	 * @deadline: deadline set on @finished which potentially needs to be
> +	 * propagated to @parent.
> +	 */
> +	ktime_t	deadline;
> +
> +	/**
> +	 * @parent: The hardware fence returned by &drm_dep_queue_ops.run_job.
> +	 *
> +	 * @finished is signaled once @parent is signaled. The initial store is
> +	 * performed via smp_store_release to synchronize with deadline handling.
> +	 *
> +	 * All readers must access this under the fence lock and take a reference to
> +	 * it, as @parent is set to NULL under the fence lock when the drm_dep_fence
> +	 * signals, and this drop also releases its internal reference.
> +	 */
> +	struct dma_fence *parent;
> +
> +	/**
> +	 * @q: the queue this fence belongs to.
> +	 */
> +	struct drm_dep_queue *q;
> +};

As Daniel pointed out already, with Christian's recent changes to
dma_fence (the ones that reset dma_fence::ops after ::signal()), the
fence proxy that existed in drm_sched_fence is no longer required:
drivers and their implementations can safely vanish even if some fences
they have emitted are still referenced by other subsystems. All we need
is:

- fence must be signaled for dma_fence::ops to be set back to NULL
- no .cleanup and no .wait implementation

There might be an interest in having HW submission fences reflecting
when the job is passed to the FW/HW queue, but that can done as a
separate fence implementation using a different fence timeline/context.

> diff --git a/drivers/gpu/drm/dep/drm_dep_job.c b/drivers/gpu/drm/dep/drm_dep_job.c
> new file mode 100644
> index 000000000000..2d012b29a5fc
> --- /dev/null
> +++ b/drivers/gpu/drm/dep/drm_dep_job.c
> @@ -0,0 +1,675 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright 2015 Advanced Micro Devices, Inc.
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
> + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
> + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
> + * OTHER DEALINGS IN THE SOFTWARE.
> + *
> + * Copyright © 2026 Intel Corporation
> + */
> +
> +/**
> + * DOC: DRM dependency job
> + *
> + * A struct drm_dep_job represents a single unit of GPU work associated with
> + * a struct drm_dep_queue. The lifecycle of a job is:
> + *
> + * 1. **Allocation**: the driver allocates memory for the job (typically by
> + *    embedding struct drm_dep_job in a larger structure) and calls
> + *    drm_dep_job_init() to initialise it. On success the job holds one
> + *    kref reference and a reference to its queue.
> + *
> + * 2. **Dependency collection**: the driver calls drm_dep_job_add_dependency(),
> + *    drm_dep_job_add_syncobj_dependency(), drm_dep_job_add_resv_dependencies(),
> + *    or drm_dep_job_add_implicit_dependencies() to register dma_fence objects
> + *    that must be signalled before the job can run. Duplicate fences from the
> + *    same fence context are deduplicated automatically.
> + *
> + * 3. **Arming**: drm_dep_job_arm() initialises the job's finished fence,
> + *    consuming a sequence number from the queue. After arming,
> + *    drm_dep_job_finished_fence() returns a valid fence that may be passed to
> + *    userspace or used as a dependency by other jobs.
> + *
> + * 4. **Submission**: drm_dep_job_push() submits the job to the queue. The
> + *    queue takes a reference that it holds until the job's finished fence
> + *    signals and the job is freed by the put_job worker.
> + *
> + * 5. **Completion**: when the job's hardware work finishes its finished fence
> + *    is signalled and drm_dep_job_put() is called by the queue. The driver
> + *    must release any driver-private resources in &drm_dep_job_ops.release.
> + *
> + * Reference counting uses drm_dep_job_get() / drm_dep_job_put(). The
> + * internal drm_dep_job_fini() tears down the dependency xarray and fence
> + * objects before the driver's release callback is invoked.
> + */
> +
> +#include <linux/dma-resv.h>
> +#include <linux/kref.h>
> +#include <linux/slab.h>
> +#include <drm/drm_dep.h>
> +#include <drm/drm_file.h>
> +#include <drm/drm_gem.h>
> +#include <drm/drm_syncobj.h>
> +#include "drm_dep_fence.h"
> +#include "drm_dep_job.h"
> +#include "drm_dep_queue.h"
> +
> +/**
> + * drm_dep_job_init() - initialise a dep job
> + * @job: dep job to initialise
> + * @args: initialisation arguments
> + *
> + * Initialises @job with the queue, ops and credit count from @args.  Acquires
> + * a reference to @args->q via drm_dep_queue_get(); this reference is held for
> + * the lifetime of the job and released by drm_dep_job_release() when the last
> + * job reference is dropped.
> + *
> + * Resources are released automatically when the last reference is dropped
> + * via drm_dep_job_put(), which must be called to release the job; drivers
> + * must not free the job directly.
> + *
> + * Context: Process context. Allocates memory with GFP_KERNEL.
> + * Return: 0 on success, -%EINVAL if credits is 0,
> + *   -%ENOMEM on fence allocation failure.
> + */
> +int drm_dep_job_init(struct drm_dep_job *job,
> +		     const struct drm_dep_job_init_args *args)
> +{
> +	if (unlikely(!args->credits)) {
> +		pr_err("drm_dep: %s: credits cannot be 0\n", __func__);
> +		return -EINVAL;
> +	}
> +
> +	memset(job, 0, sizeof(*job));
> +
> +	job->dfence = drm_dep_fence_alloc();
> +	if (!job->dfence)
> +		return -ENOMEM;
> +
> +	job->ops = args->ops;
> +	job->q = drm_dep_queue_get(args->q);
> +	job->credits = args->credits;
> +
> +	kref_init(&job->refcount);
> +	xa_init_flags(&job->dependencies, XA_FLAGS_ALLOC);
> +	INIT_LIST_HEAD(&job->pending_link);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(drm_dep_job_init);
> +
> +/**
> + * drm_dep_job_drop_dependencies() - release all input dependency fences
> + * @job: dep job whose dependency xarray to drain
> + *
> + * Walks @job->dependencies, puts each fence, and destroys the xarray.
> + * Any slots still holding a %DRM_DEP_JOB_FENCE_PREALLOC sentinel —
> + * i.e. slots that were pre-allocated but never replaced — are silently
> + * skipped; the sentinel carries no reference.  Called from
> + * drm_dep_queue_run_job() in process context immediately after
> + * @ops->run_job() returns, before the final drm_dep_job_put().  Releasing
> + * dependencies here — while still in process context — avoids calling
> + * xa_destroy() from IRQ context if the job's last reference is later
> + * dropped from a dma_fence callback.
> + *
> + * Context: Process context.
> + */
> +void drm_dep_job_drop_dependencies(struct drm_dep_job *job)
> +{
> +	struct dma_fence *fence;
> +	unsigned long index;
> +
> +	xa_for_each(&job->dependencies, index, fence) {
> +		if (unlikely(fence == DRM_DEP_JOB_FENCE_PREALLOC))
> +			continue;
> +		dma_fence_put(fence);
> +	}
> +	xa_destroy(&job->dependencies);
> +}
> +
> +/**
> + * drm_dep_job_fini() - clean up a dep job
> + * @job: dep job to clean up
> + *
> + * Cleans up the dep fence and drops the queue reference held by @job.
> + *
> + * If the job was never armed (e.g. init failed before drm_dep_job_arm()),
> + * the dependency xarray is also released here.  For armed jobs the xarray
> + * has already been drained by drm_dep_job_drop_dependencies() in process
> + * context immediately after run_job(), so it is left untouched to avoid
> + * calling xa_destroy() from IRQ context.
> + *
> + * Warns if @job is still linked on the queue's pending list, which would
> + * indicate a bug in the teardown ordering.
> + *
> + * Context: Any context.
> + */
> +static void drm_dep_job_fini(struct drm_dep_job *job)
> +{
> +	bool armed = drm_dep_fence_is_armed(job->dfence);
> +
> +	WARN_ON(!list_empty(&job->pending_link));
> +
> +	drm_dep_fence_cleanup(job->dfence);
> +	job->dfence = NULL;
> +
> +	/*
> +	 * Armed jobs have their dependencies drained by
> +	 * drm_dep_job_drop_dependencies() in process context after run_job().

Just want to clear the confusion and make sure I get this right at the
same time. To me, "process context" means a user thread entering some
syscall(). What you call "process context" is more a "thread context" to
me. I'm actually almost certain it's always a kernel thread (a workqueue
worker thread to be accurate) that executes the drop_deps() after a
run_job().

> +	 * Skip here to avoid calling xa_destroy() from IRQ context.
> +	 */
> +	if (!armed)
> +		drm_dep_job_drop_dependencies(job);

Why do we need to make a difference here. Can't we just assume that the
hole drm_dep_job_fini() call is unsafe in atomic context, and have a
work item embedded in the job to defer its destruction when _put() is
called in a context where the destruction is not allowed?

> +}
> +
> +/**
> + * drm_dep_job_get() - acquire a reference to a dep job
> + * @job: dep job to acquire a reference on, or NULL
> + *
> + * Context: Any context.
> + * Return: @job with an additional reference held, or NULL if @job is NULL.
> + */
> +struct drm_dep_job *drm_dep_job_get(struct drm_dep_job *job)
> +{
> +	if (job)
> +		kref_get(&job->refcount);
> +	return job;
> +}
> +EXPORT_SYMBOL(drm_dep_job_get);
> +
> +/**
> + * drm_dep_job_release() - kref release callback for a dep job
> + * @kref: kref embedded in the dep job
> + *
> + * Calls drm_dep_job_fini(), then invokes &drm_dep_job_ops.release if set,
> + * otherwise frees @job with kfree().  Finally, releases the queue reference
> + * that was acquired by drm_dep_job_init() via drm_dep_queue_put().  The
> + * queue put is performed last to ensure no queue state is accessed after
> + * the job memory is freed.
> + *
> + * Context: Any context if %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE is set on the
> + *   job's queue; otherwise process context only, as the release callback may
> + *   sleep.
> + */
> +static void drm_dep_job_release(struct kref *kref)
> +{
> +	struct drm_dep_job *job =
> +		container_of(kref, struct drm_dep_job, refcount);
> +	struct drm_dep_queue *q = job->q;
> +
> +	drm_dep_job_fini(job);
> +
> +	if (job->ops && job->ops->release)
> +		job->ops->release(job);
> +	else
> +		kfree(job);
> +
> +	drm_dep_queue_put(q);
> +}
> +
> +/**
> + * drm_dep_job_put() - release a reference to a dep job
> + * @job: dep job to release a reference on, or NULL
> + *
> + * When the last reference is dropped, calls &drm_dep_job_ops.release if set,
> + * otherwise frees @job with kfree(). Does nothing if @job is NULL.
> + *
> + * Context: Any context if %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE is set on the
> + *   job's queue; otherwise process context only, as the release callback may
> + *   sleep.
> + */
> +void drm_dep_job_put(struct drm_dep_job *job)
> +{
> +	if (job)
> +		kref_put(&job->refcount, drm_dep_job_release);
> +}
> +EXPORT_SYMBOL(drm_dep_job_put);
> +
> +/**
> + * drm_dep_job_arm() - arm a dep job for submission
> + * @job: dep job to arm
> + *
> + * Initialises the finished fence on @job->dfence, assigning
> + * it a sequence number from the job's queue. Must be called after
> + * drm_dep_job_init() and before drm_dep_job_push(). Once armed,
> + * drm_dep_job_finished_fence() returns a valid fence that may be passed to
> + * userspace or used as a dependency by other jobs.
> + *
> + * Begins the DMA fence signalling path via dma_fence_begin_signalling().
> + * After this point, memory allocations that could trigger reclaim are
> + * forbidden; lockdep enforces this. arm() must always be paired with
> + * drm_dep_job_push(); lockdep also enforces this pairing.
> + *
> + * Warns if the job has already been armed.
> + *
> + * Context: Process context if %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set
> + *   (takes @q->sched.lock, a mutex); any context otherwise. DMA fence signaling
> + *   path.
> + */
> +void drm_dep_job_arm(struct drm_dep_job *job)
> +{
> +	drm_dep_queue_push_job_begin(job->q);
> +	WARN_ON(drm_dep_fence_is_armed(job->dfence));
> +	drm_dep_fence_init(job->dfence, job->q);
> +	job->signalling_cookie = dma_fence_begin_signalling();

I'd really like DMA-signalling-path annotation to be something that
doesn't leak to the job object. The way I see it, in the submit path,
it should be some sort of block initializing an opaque token, and
drm_dep_job_arm() should expect a valid token to be passed, thus
guaranteeing that anything between arm and push, and more generally
anything in that section is safe.

	struct drm_job_submit_context submit_ctx;

	// Do all the prep stuff, pre-alloc, resv setup, ...

	// Non-faillible section of the submit starts here.
	// This is properly annotated with
	// dma_fence_{begin,end}_signalling() to ensure we're
	// not taking locks or doing allocations forbidden in
	// the signalling path
	drm_job_submit_non_faillible_section(&submit_ctx) {
		for_each_job() {
			drm_dep_job_arm(&submit_ctx, &job);

			// pass the armed fence around, if needed

			drm_dep_job_push(&submit_ctx, &job);
		}
	}

With the current solution, there's no control that
drm_dep_job_{arm,push}() calls are balanced, with the risk of leaving a
DMA-signalling annotation behind.

> +}
> +EXPORT_SYMBOL(drm_dep_job_arm);
> +
> +/**
> + * drm_dep_job_push() - submit a job to its queue for execution
> + * @job: dep job to push
> + *
> + * Submits @job to the queue it was initialised with. Must be called after
> + * drm_dep_job_arm(). Acquires a reference on @job on behalf of the queue,
> + * held until the queue is fully done with it. The reference is released
> + * directly in the finished-fence dma_fence callback for queues with
> + * %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE (where drm_dep_job_done() may run
> + * from hardirq context), or via the put_job work item on the submit
> + * workqueue otherwise.
> + *
> + * Ends the DMA fence signalling path begun by drm_dep_job_arm() via
> + * dma_fence_end_signalling(). This must be paired with arm(); lockdep
> + * enforces the pairing.
> + *
> + * Once pushed, &drm_dep_queue_ops.run_job is guaranteed to be called for
> + * @job exactly once, even if the queue is killed or torn down before the
> + * job reaches the head of the queue. Drivers can use this guarantee to
> + * perform bookkeeping cleanup; the actual backend operation should be
> + * skipped when drm_dep_queue_is_killed() returns true.
> + *
> + * If the queue does not support the bypass path, the job is pushed directly
> + * onto the SPSC submission queue via drm_dep_queue_push_job() without holding
> + * @q->sched.lock. Otherwise, @q->sched.lock is taken and the job is either
> + * run immediately via drm_dep_queue_run_job() if it qualifies for bypass, or
> + * enqueued via drm_dep_queue_push_job() for dispatch by the run_job work item.
> + *
> + * Warns if the job has not been armed.
> + *
> + * Context: Process context if %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set
> + *   (takes @q->sched.lock, a mutex); any context otherwise. DMA fence signaling
> + *   path.
> + */
> +void drm_dep_job_push(struct drm_dep_job *job)
> +{
> +	struct drm_dep_queue *q = job->q;
> +
> +	WARN_ON(!drm_dep_fence_is_armed(job->dfence));
> +
> +	drm_dep_job_get(job);
> +
> +	if (!(q->sched.flags & DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED)) {
> +		drm_dep_queue_push_job(q, job);
> +		dma_fence_end_signalling(job->signalling_cookie);
> +		drm_dep_queue_push_job_end(job->q);
> +		return;
> +	}
> +
> +	scoped_guard(mutex, &q->sched.lock) {
> +		if (drm_dep_queue_can_job_bypass(q, job))
> +			drm_dep_queue_run_job(q, job);
> +		else
> +			drm_dep_queue_push_job(q, job);
> +	}
> +
> +	dma_fence_end_signalling(job->signalling_cookie);
> +	drm_dep_queue_push_job_end(job->q);
> +}
> +EXPORT_SYMBOL(drm_dep_job_push);
> +
> +/**
> + * drm_dep_job_add_dependency() - adds the fence as a job dependency
> + * @job: dep job to add the dependencies to
> + * @fence: the dma_fence to add to the list of dependencies, or
> + *         %DRM_DEP_JOB_FENCE_PREALLOC to reserve a slot for later.
> + *
> + * Note that @fence is consumed in both the success and error cases (except
> + * when @fence is %DRM_DEP_JOB_FENCE_PREALLOC, which carries no reference).
> + *
> + * Signalled fences and fences belonging to the same queue as @job (i.e. where
> + * fence->context matches the queue's finished fence context) are silently
> + * dropped; the job need not wait on its own queue's output.
> + *
> + * Warns if the job has already been armed (dependencies must be added before
> + * drm_dep_job_arm()).
> + *
> + * **Pre-allocation pattern**
> + *
> + * When multiple jobs across different queues must be prepared and submitted
> + * together in a single atomic commit — for example, where job A's finished
> + * fence is an input dependency of job B — all jobs must be armed and pushed
> + * within a single dma_fence_begin_signalling() / dma_fence_end_signalling()
> + * region.  Once that region has started no memory allocation is permitted.
> + *
> + * To handle this, pass %DRM_DEP_JOB_FENCE_PREALLOC during the preparation
> + * phase (before arming any job, while GFP_KERNEL allocation is still allowed)
> + * to pre-allocate a slot in @job->dependencies.  The slot index assigned by
> + * the underlying xarray must be tracked by the caller separately (e.g. it is
> + * always index 0 when the dependency array is empty, as Xe relies on).
> + * After all jobs have been armed and the finished fences are available, call
> + * drm_dep_job_replace_dependency() with that index and the real fence.
> + * drm_dep_job_replace_dependency() uses GFP_NOWAIT internally and may be
> + * called from atomic or signalling context.
> + *
> + * The sentinel slot is never skipped by the signalled-fence fast-path,
> + * ensuring a slot is always allocated even when the real fence is not yet
> + * known.
> + *
> + * **Example: bind job feeding TLB invalidation jobs**
> + *
> + * Consider a GPU with separate queues for page-table bind operations and for
> + * TLB invalidation.  A single atomic commit must:
> + *
> + *  1. Run a bind job that modifies page tables.
> + *  2. Run one TLB-invalidation job per MMU that depends on the bind
> + *     completing, so stale translations are flushed before the engines
> + *     continue.
> + *
> + * Because all jobs must be armed and pushed inside a signalling region (where
> + * GFP_KERNEL is forbidden), pre-allocate slots before entering the region::
> + *
> + *   // Phase 1 — process context, GFP_KERNEL allowed
> + *   drm_dep_job_init(bind_job, bind_queue, ops);
> + *   for_each_mmu(mmu) {
> + *       drm_dep_job_init(tlb_job[mmu], tlb_queue[mmu], ops);
> + *       // Pre-allocate slot at index 0; real fence not available yet
> + *       drm_dep_job_add_dependency(tlb_job[mmu], DRM_DEP_JOB_FENCE_PREALLOC);
> + *   }
> + *
> + *   // Phase 2 — inside signalling region, no GFP_KERNEL
> + *   dma_fence_begin_signalling();
> + *   drm_dep_job_arm(bind_job);
> + *   for_each_mmu(mmu) {
> + *       // Swap sentinel for bind job's finished fence
> + *       drm_dep_job_replace_dependency(tlb_job[mmu], 0,
> + *                                      dma_fence_get(bind_job->finished));

Just FYI, Panthor doesn't have this {begin,end}_signalling() in the
submit path. If we were to add it, it would be around the
panthor_submit_ctx_push_jobs() call, which might seem broken. In
practice I don't think it is because we don't expose fences to the
outside world until all jobs have been pushed. So what happens is that
a job depending on a previous job in the same batch-submit has the
armed-but-not-yet-pushed fence in its deps, and that's the only place
where this fence is present. If something fails on a subsequent job
preparation in the next batch submit, the rollback logic will just drop
the jobs on the floor, and release the armed-but-not-pushed-fence,
meaning we're not leaking a fence that will never be signalled. I'm in
no way saying this design is sane, just trying to explain why it's
currently safe and works fine.

In general, I wonder if we should distinguish between "armed" and
"publicly exposed" to help deal with this intra-batch dep thing without
resorting to reservation and other tricks like that.

> + *       drm_dep_job_arm(tlb_job[mmu]);
> + *   }
> + *   drm_dep_job_push(bind_job);
> + *   for_each_mmu(mmu)
> + *       drm_dep_job_push(tlb_job[mmu]);
> + *   dma_fence_end_signalling();
> + *
> + * Context: Process context. May allocate memory with GFP_KERNEL.
> + * Return: If fence == DRM_DEP_JOB_FENCE_PREALLOC index of allocation on
> + * success, else 0 on success, or a negative error code.
> + */
> +int drm_dep_job_add_dependency(struct drm_dep_job *job, struct dma_fence *fence)
> +{
> +	struct drm_dep_queue *q = job->q;
> +	struct dma_fence *entry;
> +	unsigned long index;
> +	u32 id = 0;
> +	int ret;
> +
> +	WARN_ON(drm_dep_fence_is_armed(job->dfence));
> +	might_alloc(GFP_KERNEL);
> +
> +	if (!fence)
> +		return 0;
> +
> +	if (fence == DRM_DEP_JOB_FENCE_PREALLOC)
> +		goto add_fence;
> +
> +	/*
> +	 * Ignore signalled fences or fences from our own queue — finished
> +	 * fences use q->fence.context.
> +	 */
> +	if (dma_fence_test_signaled_flag(fence) ||
> +	    fence->context == q->fence.context) {
> +		dma_fence_put(fence);
> +		return 0;
> +	}
> +
> +	/* Deduplicate if we already depend on a fence from the same context.
> +	 * This lets the size of the array of deps scale with the number of
> +	 * engines involved, rather than the number of BOs.
> +	 */
> +	xa_for_each(&job->dependencies, index, entry) {
> +		if (entry == DRM_DEP_JOB_FENCE_PREALLOC ||
> +		    entry->context != fence->context)
> +			continue;
> +
> +		if (dma_fence_is_later(fence, entry)) {
> +			dma_fence_put(entry);
> +			xa_store(&job->dependencies, index, fence, GFP_KERNEL);
> +		} else {
> +			dma_fence_put(fence);
> +		}
> +		return 0;
> +	}
> +
> +add_fence:
> +	ret = xa_alloc(&job->dependencies, &id, fence, xa_limit_32b,
> +		       GFP_KERNEL);
> +	if (ret != 0) {
> +		if (fence != DRM_DEP_JOB_FENCE_PREALLOC)
> +			dma_fence_put(fence);
> +		return ret;
> +	}
> +
> +	return (fence == DRM_DEP_JOB_FENCE_PREALLOC) ? id : 0;
> +}
> +EXPORT_SYMBOL(drm_dep_job_add_dependency);
> +
> +/**
> + * drm_dep_job_replace_dependency() - replace a pre-allocated dependency slot
> + * @job: dep job to update
> + * @index: xarray index of the slot to replace, as returned when the sentinel
> + *         was originally inserted via drm_dep_job_add_dependency()
> + * @fence: the real dma_fence to store; its reference is always consumed
> + *
> + * Replaces the %DRM_DEP_JOB_FENCE_PREALLOC sentinel at @index in
> + * @job->dependencies with @fence.  The slot must have been pre-allocated by
> + * passing %DRM_DEP_JOB_FENCE_PREALLOC to drm_dep_job_add_dependency(); the
> + * existing entry is asserted to be the sentinel.
> + *
> + * This is the second half of the pre-allocation pattern described in
> + * drm_dep_job_add_dependency().  It is intended to be called inside a
> + * dma_fence_begin_signalling() / dma_fence_end_signalling() region where
> + * memory allocation with GFP_KERNEL is forbidden.  It uses GFP_NOWAIT
> + * internally so it is safe to call from atomic or signalling context, but
> + * since the slot has been pre-allocated no actual memory allocation occurs.
> + *
> + * If @fence is already signalled the slot is erased rather than storing a
> + * redundant dependency.  The successful store is asserted — if the store
> + * fails it indicates a programming error (slot index out of range or
> + * concurrent modification).
> + *
> + * Must be called before drm_dep_job_arm(). @fence is consumed in all cases.
> + *
> + * Context: Any context. DMA fence signaling path.
> + */
> +void drm_dep_job_replace_dependency(struct drm_dep_job *job, u32 index,
> +				    struct dma_fence *fence)
> +{
> +	WARN_ON(xa_load(&job->dependencies, index) !=
> +		DRM_DEP_JOB_FENCE_PREALLOC);
> +
> +	if (dma_fence_test_signaled_flag(fence)) {
> +		xa_erase(&job->dependencies, index);
> +		dma_fence_put(fence);
> +		return;
> +	}
> +
> +	if (WARN_ON(xa_is_err(xa_store(&job->dependencies, index, fence,
> +				       GFP_NOWAIT)))) {
> +		dma_fence_put(fence);
> +		return;
> +	}

You don't seem to go for the
replace-if-earlier-fence-on-same-context-exists optimization that we
have in drm_dep_job_add_dependency(). Any reason not to?

> +}
> +EXPORT_SYMBOL(drm_dep_job_replace_dependency);
> +

I'm going to stop here for today.

Regards,

Boris

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
  2026-03-16  4:32 ` [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer Matthew Brost
                     ` (4 preceding siblings ...)
  2026-03-17 14:55   ` Boris Brezillon
@ 2026-03-17 16:30   ` Shashank Sharma
  5 siblings, 0 replies; 50+ messages in thread
From: Shashank Sharma @ 2026-03-17 16:30 UTC (permalink / raw)
  To: Matthew Brost, intel-xe@lists.freedesktop.org
  Cc: dri-devel@lists.freedesktop.org, Boris Brezillon, Tvrtko Ursulin,
	Rodrigo Vivi, Thomas Hellström, Christian König,
	Danilo Krummrich, David Airlie, Maarten Lankhorst, Maxime Ripard,
	Philipp Stanner, Simona Vetter, Sumit Semwal, Thomas Zimmermann,
	linux-kernel@vger.kernel.org

On 16.03.26 05:32, Matthew Brost wrote:
> External email: Use caution opening links or attachments
>
>
> Diverging requirements between GPU drivers using firmware scheduling
> and those using hardware scheduling have shown that drm_gpu_scheduler is
> no longer sufficient for firmware-scheduled GPU drivers. The technical
> debt, lack of memory-safety guarantees, absence of clear object-lifetime
> rules, and numerous driver-specific hacks have rendered
> drm_gpu_scheduler unmaintainable. It is time for a fresh design for
> firmware-scheduled GPU drivers—one that addresses all of the
> aforementioned shortcomings.
>
> Add drm_dep, a lightweight GPU submission queue intended as a
> replacement for drm_gpu_scheduler for firmware-managed GPU schedulers
> (e.g. Xe, Panthor, AMDXDNA, PVR, Nouveau, Nova). Unlike
> drm_gpu_scheduler, which separates the scheduler (drm_gpu_scheduler)
> from the queue (drm_sched_entity) into two objects requiring external
> coordination, drm_dep merges both roles into a single struct
> drm_dep_queue. This eliminates the N:1 entity-to-scheduler mapping
> that is unnecessary for firmware schedulers which manage their own
> run-lists internally.
>
> Unlike drm_gpu_scheduler, which relies on external locking and lifetime
> management by the driver, drm_dep uses reference counting (kref) on both
> queues and jobs to guarantee object lifetime safety. A job holds a queue
> reference from init until its last put, and the queue holds a job reference
> from dispatch until the put_job worker runs. This makes use-after-free
> impossible even when completion arrives from IRQ context or concurrent
> teardown is in flight.
>
> The core objects are:
>
>    struct drm_dep_queue - a per-context submission queue owning an
>      ordered submit workqueue, a TDR timeout workqueue, an SPSC job
>      queue, and a pending-job list. Reference counted; drivers can embed
>      it and provide a .release vfunc for RCU-safe teardown.
>
>    struct drm_dep_job - a single unit of GPU work. Drivers embed this
>      and provide a .release vfunc. Jobs carry an xarray of input
>      dma_fence dependencies and produce a drm_dep_fence as their
>      finished fence.
>
>    struct drm_dep_fence - a dma_fence subclass wrapping an optional
>      parent hardware fence. The finished fence is armed (sequence
>      number assigned) before submission and signals when the hardware
>      fence signals (or immediately on synchronous completion).
>
> Job lifecycle:
>    1. drm_dep_job_init() - allocate and initialise; job acquires a
>       queue reference.
>    2. drm_dep_job_add_dependency() and friends - register input fences;
>       duplicates from the same context are deduplicated.
>    3. drm_dep_job_arm() - assign sequence number, obtain finished fence.
>    4. drm_dep_job_push() - submit to queue.
>
> Submission paths under queue lock:
>    - Bypass path: if DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set, the
>      SPSC queue is empty, no dependencies are pending, and credits are
>      available, the job is dispatched inline on the calling thread.
>    - Queued path: job is pushed onto the SPSC queue and the run_job
>      worker is kicked. The worker resolves remaining dependencies
>      (installing wakeup callbacks for unresolved fences) before calling
>      ops->run_job().
>
> Credit-based throttling prevents hardware overflow: each job declares
> a credit cost at init time; dispatch is deferred until sufficient
> credits are available.
>
> Timeout Detection and Recovery (TDR): a per-queue delayed work item
> fires when the head pending job exceeds q->job.timeout jiffies, calling
> ops->timedout_job(). drm_dep_queue_trigger_timeout() forces immediate
> expiry for device teardown.
>
> IRQ-safe completion: queues flagged DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE
> allow drm_dep_job_done() to be called from hardirq context (e.g. a
> dma_fence callback). Dependency cleanup is deferred to process context
> after ops->run_job() returns to avoid calling xa_destroy() from IRQ.
>
> Zombie-state guard: workers use kref_get_unless_zero() on entry and
> bail immediately if the queue refcount has already reached zero and
> async teardown is in flight, preventing use-after-free.
>
> Teardown is always deferred to a module-private workqueue (dep_free_wq)
> so that destroy_workqueue() is never called from within one of the
> queue's own workers. Each queue holds a drm_dev_get() reference on its
> owning struct drm_device, released as the final step of teardown via
> drm_dev_put(). This prevents the driver module from being unloaded
> while any queue is still alive without requiring a separate drain API.

Considering workload scheduling for FW-based drivers was supposed to be 
simpler than legacy hardware scheduling, replacing the N:1 
entity-to-scheduler mapping with this unified, refcounted drm_dep_queue 
makes a lot of sense.

One architectural question regarding TTM-based drivers: does drm_dep 
intend to handle buffer evictions and page-table (PT) update related 
events ? If yes, we might also need something like 'eviction fences' to 
halt execution, evict BOs, update the PT, and resume. Since the firmware 
manages the actual hardware queue, do you envision drm_dep_queue 
providing generic hooks/flags to temporarily pause job dispatching (or 
inject high-priority PT update jobs) when a memory manager needs to 
perform an eviction? It would be good to ensure a provision for this 
type of pipeline stall, as any TTM-based or dynamic-VM driver will 
inevitably need it.

- Shashank

>
> Cc: Boris Brezillon <boris.brezillon@collabora.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: David Airlie <airlied@gmail.com>
> Cc: dri-devel@lists.freedesktop.org
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: Maxime Ripard <mripard@kernel.org>
> Cc: Philipp Stanner <phasta@kernel.org>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: Sumit Semwal <sumit.semwal@linaro.org>
> Cc: Thomas Zimmermann <tzimmermann@suse.de>
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Assisted-by: GitHub Copilot:claude-sonnet-4.6
> ---
>   drivers/gpu/drm/Kconfig             |    4 +
>   drivers/gpu/drm/Makefile            |    1 +
>   drivers/gpu/drm/dep/Makefile        |    5 +
>   drivers/gpu/drm/dep/drm_dep_fence.c |  406 +++++++
>   drivers/gpu/drm/dep/drm_dep_fence.h |   25 +
>   drivers/gpu/drm/dep/drm_dep_job.c   |  675 +++++++++++
>   drivers/gpu/drm/dep/drm_dep_job.h   |   13 +
>   drivers/gpu/drm/dep/drm_dep_queue.c | 1647 +++++++++++++++++++++++++++
>   drivers/gpu/drm/dep/drm_dep_queue.h |   31 +
>   include/drm/drm_dep.h               |  597 ++++++++++
>   10 files changed, 3404 insertions(+)
>   create mode 100644 drivers/gpu/drm/dep/Makefile
>   create mode 100644 drivers/gpu/drm/dep/drm_dep_fence.c
>   create mode 100644 drivers/gpu/drm/dep/drm_dep_fence.h
>   create mode 100644 drivers/gpu/drm/dep/drm_dep_job.c
>   create mode 100644 drivers/gpu/drm/dep/drm_dep_job.h
>   create mode 100644 drivers/gpu/drm/dep/drm_dep_queue.c
>   create mode 100644 drivers/gpu/drm/dep/drm_dep_queue.h
>   create mode 100644 include/drm/drm_dep.h
>
> diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
> index 5386248e75b6..834f6e210551 100644
> --- a/drivers/gpu/drm/Kconfig
> +++ b/drivers/gpu/drm/Kconfig
> @@ -276,6 +276,10 @@ config DRM_SCHED
>          tristate
>          depends on DRM
>
> +config DRM_DEP
> +       tristate
> +       depends on DRM
> +
>   # Separate option as not all DRM drivers use it
>   config DRM_PANEL_BACKLIGHT_QUIRKS
>          tristate
> diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
> index e97faabcd783..1ad87cc0e545 100644
> --- a/drivers/gpu/drm/Makefile
> +++ b/drivers/gpu/drm/Makefile
> @@ -173,6 +173,7 @@ obj-y                       += clients/
>   obj-y                  += display/
>   obj-$(CONFIG_DRM_TTM)  += ttm/
>   obj-$(CONFIG_DRM_SCHED)        += scheduler/
> +obj-$(CONFIG_DRM_DEP)  += dep/
>   obj-$(CONFIG_DRM_RADEON)+= radeon/
>   obj-$(CONFIG_DRM_AMDGPU)+= amd/amdgpu/
>   obj-$(CONFIG_DRM_AMDGPU)+= amd/amdxcp/
> diff --git a/drivers/gpu/drm/dep/Makefile b/drivers/gpu/drm/dep/Makefile
> new file mode 100644
> index 000000000000..335f1af46a7b
> --- /dev/null
> +++ b/drivers/gpu/drm/dep/Makefile
> @@ -0,0 +1,5 @@
> +# SPDX-License-Identifier: GPL-2.0
> +
> +drm_dep-y := drm_dep_queue.o drm_dep_job.o drm_dep_fence.o
> +
> +obj-$(CONFIG_DRM_DEP) += drm_dep.o
> diff --git a/drivers/gpu/drm/dep/drm_dep_fence.c b/drivers/gpu/drm/dep/drm_dep_fence.c
> new file mode 100644
> index 000000000000..ae05b9077772
> --- /dev/null
> +++ b/drivers/gpu/drm/dep/drm_dep_fence.c
> @@ -0,0 +1,406 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2026 Intel Corporation
> + */
> +
> +/**
> + * DOC: DRM dependency fence
> + *
> + * Each struct drm_dep_job has an associated struct drm_dep_fence that
> + * provides a single dma_fence (@finished) signalled when the hardware
> + * completes the job.
> + *
> + * The hardware fence returned by &drm_dep_queue_ops.run_job is stored as
> + * @parent. @finished is chained to @parent via drm_dep_job_done_cb() and
> + * is signalled once @parent signals (or immediately if run_job() returns
> + * NULL or an error).
> + *
> + * Drivers should expose @finished as the out-fence for GPU work since it is
> + * valid from the moment drm_dep_job_arm() returns, whereas the hardware fence
> + * could be a compound fence, which is disallowed when installed into
> + * drm_syncobjs or dma-resv.
> + *
> + * The fence uses the kernel's inline spinlock (NULL passed to dma_fence_init())
> + * so no separate lock allocation is required.
> + *
> + * Deadline propagation is supported: if a consumer sets a deadline via
> + * dma_fence_set_deadline(), it is forwarded to @parent when @parent is set.
> + * If @parent has not been set yet the deadline is stored in @deadline and
> + * forwarded at that point.
> + *
> + * Memory management: drm_dep_fence objects are allocated with kzalloc() and
> + * freed via kfree_rcu() once the fence is released, ensuring safety with
> + * RCU-protected fence accesses.
> + */
> +
> +#include <linux/slab.h>
> +#include <drm/drm_dep.h>
> +#include "drm_dep_fence.h"
> +
> +/**
> + * DRM_DEP_FENCE_FLAG_HAS_DEADLINE_BIT - a fence deadline hint has been set
> + *
> + * Set by the deadline callback on the finished fence to indicate a deadline
> + * has been set which may need to be propagated to the parent hardware fence.
> + */
> +#define DRM_DEP_FENCE_FLAG_HAS_DEADLINE_BIT    (DMA_FENCE_FLAG_USER_BITS + 1)
> +
> +/**
> + * struct drm_dep_fence - fence tracking the completion of a dep job
> + *
> + * Contains a single dma_fence (@finished) that is signalled when the
> + * hardware completes the job. The fence uses the kernel's inline_lock
> + * (no external spinlock required).
> + *
> + * This struct is private to the drm_dep module; external code interacts
> + * through the accessor functions declared in drm_dep_fence.h.
> + */
> +struct drm_dep_fence {
> +       /**
> +        * @finished: signalled when the job completes on hardware.
> +        *
> +        * Drivers should use this fence as the out-fence for a job since it
> +        * is available immediately upon drm_dep_job_arm().
> +        */
> +       struct dma_fence finished;
> +
> +       /**
> +        * @deadline: deadline set on @finished which potentially needs to be
> +        * propagated to @parent.
> +        */
> +       ktime_t deadline;
> +
> +       /**
> +        * @parent: The hardware fence returned by &drm_dep_queue_ops.run_job.
> +        *
> +        * @finished is signaled once @parent is signaled. The initial store is
> +        * performed via smp_store_release to synchronize with deadline handling.
> +        *
> +        * All readers must access this under the fence lock and take a reference to
> +        * it, as @parent is set to NULL under the fence lock when the drm_dep_fence
> +        * signals, and this drop also releases its internal reference.
> +        */
> +       struct dma_fence *parent;
> +
> +       /**
> +        * @q: the queue this fence belongs to.
> +        */
> +       struct drm_dep_queue *q;
> +};
> +
> +static const struct dma_fence_ops drm_dep_fence_ops;
> +
> +/**
> + * to_drm_dep_fence() - cast a dma_fence to its enclosing drm_dep_fence
> + * @f: dma_fence to cast
> + *
> + * Context: No context requirements (inline helper).
> + * Return: pointer to the enclosing &drm_dep_fence.
> + */
> +static struct drm_dep_fence *to_drm_dep_fence(struct dma_fence *f)
> +{
> +       return container_of(f, struct drm_dep_fence, finished);
> +}
> +
> +/**
> + * drm_dep_fence_set_parent() - store the hardware fence and propagate
> + *   any deadline
> + * @dfence: dep fence
> + * @parent: hardware fence returned by &drm_dep_queue_ops.run_job, or NULL/error
> + *
> + * Stores @parent on @dfence under smp_store_release() so that a concurrent
> + * drm_dep_fence_set_deadline() call sees the parent before checking the
> + * deadline bit. If a deadline has already been set on @dfence->finished it is
> + * forwarded to @parent immediately. Does nothing if @parent is NULL or an
> + * error pointer.
> + *
> + * Context: Any context.
> + */
> +void drm_dep_fence_set_parent(struct drm_dep_fence *dfence,
> +                             struct dma_fence *parent)
> +{
> +       if (IS_ERR_OR_NULL(parent))
> +               return;
> +
> +       /*
> +        * smp_store_release() to ensure a thread racing us in
> +        * drm_dep_fence_set_deadline() sees the parent set before
> +        * it calls test_bit(HAS_DEADLINE_BIT).
> +        */
> +       smp_store_release(&dfence->parent, dma_fence_get(parent));
> +       if (test_bit(DRM_DEP_FENCE_FLAG_HAS_DEADLINE_BIT,
> +                    &dfence->finished.flags))
> +               dma_fence_set_deadline(parent, dfence->deadline);
> +}
> +
> +/**
> + * drm_dep_fence_finished() - signal the finished fence with a result
> + * @dfence: dep fence to signal
> + * @result: error code to set, or 0 for success
> + *
> + * Sets the fence error to @result if non-zero, then signals
> + * @dfence->finished. Also removes parent visibility under the fence lock
> + * and drops the parent reference. Dropping the parent here allows the
> + * DRM dep fence to be completely decoupled from the DRM dep module.
> + *
> + * Context: Any context.
> + */
> +static void drm_dep_fence_finished(struct drm_dep_fence *dfence, int result)
> +{
> +       struct dma_fence *parent;
> +       unsigned long flags;
> +
> +       dma_fence_lock_irqsave(&dfence->finished, flags);
> +       if (result)
> +               dma_fence_set_error(&dfence->finished, result);
> +       dma_fence_signal_locked(&dfence->finished);
> +       parent = dfence->parent;
> +       dfence->parent = NULL;
> +       dma_fence_unlock_irqrestore(&dfence->finished, flags);
> +
> +       dma_fence_put(parent);
> +}
> +
> +static const char *drm_dep_fence_get_driver_name(struct dma_fence *fence)
> +{
> +       return "drm_dep";
> +}
> +
> +static const char *drm_dep_fence_get_timeline_name(struct dma_fence *f)
> +{
> +       struct drm_dep_fence *dfence = to_drm_dep_fence(f);
> +
> +       return dfence->q->name;
> +}
> +
> +/**
> + * drm_dep_fence_get_parent() - get a reference to the parent hardware fence
> + * @dfence: dep fence to query
> + *
> + * Returns a new reference to @dfence->parent, or NULL if the parent has
> + * already been cleared (i.e. @dfence->finished has signalled and the parent
> + * reference was dropped under the fence lock).
> + *
> + * Uses smp_load_acquire() to pair with the smp_store_release() in
> + * drm_dep_fence_set_parent(), ensuring that if we race a concurrent
> + * drm_dep_fence_set_parent() call we observe the parent pointer only after
> + * the store is fully visible — before set_parent() tests
> + * %DRM_DEP_FENCE_FLAG_HAS_DEADLINE_BIT.
> + *
> + * Caller must hold the fence lock on @dfence->finished.
> + *
> + * Context: Any context, fence lock on @dfence->finished must be held.
> + * Return: a new reference to the parent fence, or NULL.
> + */
> +static struct dma_fence *drm_dep_fence_get_parent(struct drm_dep_fence *dfence)
> +{
> +       dma_fence_assert_held(&dfence->finished);
> +
> +       return dma_fence_get(smp_load_acquire(&dfence->parent));
> +}
> +
> +/**
> + * drm_dep_fence_set_deadline() - dma_fence_ops deadline callback
> + * @f: fence on which the deadline is being set
> + * @deadline: the deadline hint to apply
> + *
> + * Stores the earliest deadline under the fence lock, then propagates
> + * it to the parent hardware fence via smp_load_acquire() to race
> + * safely with drm_dep_fence_set_parent().
> + *
> + * Context: Any context.
> + */
> +static void drm_dep_fence_set_deadline(struct dma_fence *f, ktime_t deadline)
> +{
> +       struct drm_dep_fence *dfence = to_drm_dep_fence(f);
> +       struct dma_fence *parent;
> +       unsigned long flags;
> +
> +       dma_fence_lock_irqsave(f, flags);
> +
> +       /* If we already have an earlier deadline, keep it: */
> +       if (test_bit(DRM_DEP_FENCE_FLAG_HAS_DEADLINE_BIT, &f->flags) &&
> +           ktime_before(dfence->deadline, deadline)) {
> +               dma_fence_unlock_irqrestore(f, flags);
> +               return;
> +       }
> +
> +       dfence->deadline = deadline;
> +       set_bit(DRM_DEP_FENCE_FLAG_HAS_DEADLINE_BIT, &f->flags);
> +
> +       parent = drm_dep_fence_get_parent(dfence);
> +       dma_fence_unlock_irqrestore(f, flags);
> +
> +       if (parent)
> +               dma_fence_set_deadline(parent, deadline);
> +
> +       dma_fence_put(parent);
> +}
> +
> +static const struct dma_fence_ops drm_dep_fence_ops = {
> +       .get_driver_name = drm_dep_fence_get_driver_name,
> +       .get_timeline_name = drm_dep_fence_get_timeline_name,
> +       .set_deadline = drm_dep_fence_set_deadline,
> +};
> +
> +/**
> + * drm_dep_fence_alloc() - allocate a dep fence
> + *
> + * Allocates a &drm_dep_fence with kzalloc() without initialising the
> + * dma_fence. Call drm_dep_fence_init() to fully initialise it.
> + *
> + * Context: Process context.
> + * Return: new &drm_dep_fence on success, NULL on allocation failure.
> + */
> +struct drm_dep_fence *drm_dep_fence_alloc(void)
> +{
> +       return kzalloc_obj(struct drm_dep_fence);
> +}
> +
> +/**
> + * drm_dep_fence_init() - initialise the dma_fence inside a dep fence
> + * @dfence: dep fence to initialise
> + * @q: queue the owning job belongs to
> + *
> + * Initialises @dfence->finished using the context and sequence number from @q.
> + * Passes NULL as the lock so the fence uses its inline spinlock.
> + *
> + * Context: Any context.
> + */
> +void drm_dep_fence_init(struct drm_dep_fence *dfence, struct drm_dep_queue *q)
> +{
> +       u32 seq = ++q->fence.seqno;
> +
> +       /*
> +        * XXX: Inline fence hazard: currently all expected users of DRM dep
> +        * hardware fences have a unique lockdep class. If that ever changes,
> +        * we will need to assign a unique lockdep class here so lockdep knows
> +        * this fence is allowed to nest with driver hardware fences.
> +        */
> +
> +       dfence->q = q;
> +       dma_fence_init(&dfence->finished, &drm_dep_fence_ops,
> +                      NULL, q->fence.context, seq);
> +}
> +
> +/**
> + * drm_dep_fence_cleanup() - release a dep fence at job teardown
> + * @dfence: dep fence to clean up
> + *
> + * Called from drm_dep_job_fini(). If the dep fence was armed (refcount > 0)
> + * it is released via dma_fence_put() and will be freed by the RCU release
> + * callback once all waiters have dropped their references. If it was never
> + * armed it is freed directly with kfree().
> + *
> + * Context: Any context.
> + */
> +void drm_dep_fence_cleanup(struct drm_dep_fence *dfence)
> +{
> +       if (drm_dep_fence_is_armed(dfence))
> +               dma_fence_put(&dfence->finished);
> +       else
> +               kfree(dfence);
> +}
> +
> +/**
> + * drm_dep_fence_is_armed() - check whether the fence has been armed
> + * @dfence: dep fence to check
> + *
> + * Returns true if drm_dep_job_arm() has been called, i.e. @dfence->finished
> + * has been initialised and its reference count is non-zero.  Used by
> + * assertions to enforce correct job lifecycle ordering (arm before push,
> + * add_dependency before arm).
> + *
> + * Context: Any context.
> + * Return: true if the fence is armed, false otherwise.
> + */
> +bool drm_dep_fence_is_armed(struct drm_dep_fence *dfence)
> +{
> +       return !!kref_read(&dfence->finished.refcount);
> +}
> +
> +/**
> + * drm_dep_fence_is_finished() - test whether the finished fence has signalled
> + * @dfence: dep fence to check
> + *
> + * Uses dma_fence_test_signaled_flag() to read %DMA_FENCE_FLAG_SIGNALED_BIT
> + * directly without invoking the fence's ->signaled() callback or triggering
> + * any signalling side-effects.
> + *
> + * Context: Any context.
> + * Return: true if @dfence->finished has been signalled, false otherwise.
> + */
> +bool drm_dep_fence_is_finished(struct drm_dep_fence *dfence)
> +{
> +       return dma_fence_test_signaled_flag(&dfence->finished);
> +}
> +
> +/**
> + * drm_dep_fence_is_complete() - test whether the job has completed
> + * @dfence: dep fence to check
> + *
> + * Takes the fence lock on @dfence->finished and calls
> + * drm_dep_fence_get_parent() to safely obtain a reference to the parent
> + * hardware fence — or NULL if the parent has already been cleared after
> + * signalling.  Calls dma_fence_is_signaled() on @parent outside the lock,
> + * which may invoke the fence's ->signaled() callback and trigger signalling
> + * side-effects if the fence has completed but the signalled flag has not yet
> + * been set.  The finished fence is tested via dma_fence_test_signaled_flag(),
> + * without side-effects.
> + *
> + * May only be called on a stopped queue (see drm_dep_queue_is_stopped()).
> + *
> + * Context: Process context. The queue must be stopped before calling this.
> + * Return: true if the job is complete, false otherwise.
> + */
> +bool drm_dep_fence_is_complete(struct drm_dep_fence *dfence)
> +{
> +       struct dma_fence *parent;
> +       unsigned long flags;
> +       bool complete;
> +
> +       dma_fence_lock_irqsave(&dfence->finished, flags);
> +       parent = drm_dep_fence_get_parent(dfence);
> +       dma_fence_unlock_irqrestore(&dfence->finished, flags);
> +
> +       complete = (parent && dma_fence_is_signaled(parent)) ||
> +               dma_fence_test_signaled_flag(&dfence->finished);
> +
> +       dma_fence_put(parent);
> +
> +       return complete;
> +}
> +
> +/**
> + * drm_dep_fence_to_dma() - return the finished dma_fence for a dep fence
> + * @dfence: dep fence to query
> + *
> + * No reference is taken; the caller must hold its own reference to the owning
> + * &drm_dep_job for the duration of the access.
> + *
> + * Context: Any context.
> + * Return: the finished &dma_fence.
> + */
> +struct dma_fence *drm_dep_fence_to_dma(struct drm_dep_fence *dfence)
> +{
> +       return &dfence->finished;
> +}
> +
> +/**
> + * drm_dep_fence_done() - signal the finished fence on job completion
> + * @dfence: dep fence to signal
> + * @result: job error code, or 0 on success
> + *
> + * Gets a temporary reference to @dfence->finished to guard against a racing
> + * last-put, signals the fence with @result, then drops the temporary
> + * reference. Called from drm_dep_job_done() in the queue core when a
> + * hardware completion callback fires or when run_job() returns immediately.
> + *
> + * Context: Any context.
> + */
> +void drm_dep_fence_done(struct drm_dep_fence *dfence, int result)
> +{
> +       dma_fence_get(&dfence->finished);
> +       drm_dep_fence_finished(dfence, result);
> +       dma_fence_put(&dfence->finished);
> +}
> diff --git a/drivers/gpu/drm/dep/drm_dep_fence.h b/drivers/gpu/drm/dep/drm_dep_fence.h
> new file mode 100644
> index 000000000000..65a1582f858b
> --- /dev/null
> +++ b/drivers/gpu/drm/dep/drm_dep_fence.h
> @@ -0,0 +1,25 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2026 Intel Corporation
> + */
> +
> +#ifndef _DRM_DEP_FENCE_H_
> +#define _DRM_DEP_FENCE_H_
> +
> +#include <linux/dma-fence.h>
> +
> +struct drm_dep_fence;
> +struct drm_dep_queue;
> +
> +struct drm_dep_fence *drm_dep_fence_alloc(void);
> +void drm_dep_fence_init(struct drm_dep_fence *dfence, struct drm_dep_queue *q);
> +void drm_dep_fence_cleanup(struct drm_dep_fence *dfence);
> +void drm_dep_fence_set_parent(struct drm_dep_fence *dfence,
> +                             struct dma_fence *parent);
> +void drm_dep_fence_done(struct drm_dep_fence *dfence, int result);
> +bool drm_dep_fence_is_armed(struct drm_dep_fence *dfence);
> +bool drm_dep_fence_is_finished(struct drm_dep_fence *dfence);
> +bool drm_dep_fence_is_complete(struct drm_dep_fence *dfence);
> +struct dma_fence *drm_dep_fence_to_dma(struct drm_dep_fence *dfence);
> +
> +#endif /* _DRM_DEP_FENCE_H_ */
> diff --git a/drivers/gpu/drm/dep/drm_dep_job.c b/drivers/gpu/drm/dep/drm_dep_job.c
> new file mode 100644
> index 000000000000..2d012b29a5fc
> --- /dev/null
> +++ b/drivers/gpu/drm/dep/drm_dep_job.c
> @@ -0,0 +1,675 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright 2015 Advanced Micro Devices, Inc.
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
> + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
> + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
> + * OTHER DEALINGS IN THE SOFTWARE.
> + *
> + * Copyright © 2026 Intel Corporation
> + */
> +
> +/**
> + * DOC: DRM dependency job
> + *
> + * A struct drm_dep_job represents a single unit of GPU work associated with
> + * a struct drm_dep_queue. The lifecycle of a job is:
> + *
> + * 1. **Allocation**: the driver allocates memory for the job (typically by
> + *    embedding struct drm_dep_job in a larger structure) and calls
> + *    drm_dep_job_init() to initialise it. On success the job holds one
> + *    kref reference and a reference to its queue.
> + *
> + * 2. **Dependency collection**: the driver calls drm_dep_job_add_dependency(),
> + *    drm_dep_job_add_syncobj_dependency(), drm_dep_job_add_resv_dependencies(),
> + *    or drm_dep_job_add_implicit_dependencies() to register dma_fence objects
> + *    that must be signalled before the job can run. Duplicate fences from the
> + *    same fence context are deduplicated automatically.
> + *
> + * 3. **Arming**: drm_dep_job_arm() initialises the job's finished fence,
> + *    consuming a sequence number from the queue. After arming,
> + *    drm_dep_job_finished_fence() returns a valid fence that may be passed to
> + *    userspace or used as a dependency by other jobs.
> + *
> + * 4. **Submission**: drm_dep_job_push() submits the job to the queue. The
> + *    queue takes a reference that it holds until the job's finished fence
> + *    signals and the job is freed by the put_job worker.
> + *
> + * 5. **Completion**: when the job's hardware work finishes its finished fence
> + *    is signalled and drm_dep_job_put() is called by the queue. The driver
> + *    must release any driver-private resources in &drm_dep_job_ops.release.
> + *
> + * Reference counting uses drm_dep_job_get() / drm_dep_job_put(). The
> + * internal drm_dep_job_fini() tears down the dependency xarray and fence
> + * objects before the driver's release callback is invoked.
> + */
> +
> +#include <linux/dma-resv.h>
> +#include <linux/kref.h>
> +#include <linux/slab.h>
> +#include <drm/drm_dep.h>
> +#include <drm/drm_file.h>
> +#include <drm/drm_gem.h>
> +#include <drm/drm_syncobj.h>
> +#include "drm_dep_fence.h"
> +#include "drm_dep_job.h"
> +#include "drm_dep_queue.h"
> +
> +/**
> + * drm_dep_job_init() - initialise a dep job
> + * @job: dep job to initialise
> + * @args: initialisation arguments
> + *
> + * Initialises @job with the queue, ops and credit count from @args.  Acquires
> + * a reference to @args->q via drm_dep_queue_get(); this reference is held for
> + * the lifetime of the job and released by drm_dep_job_release() when the last
> + * job reference is dropped.
> + *
> + * Resources are released automatically when the last reference is dropped
> + * via drm_dep_job_put(), which must be called to release the job; drivers
> + * must not free the job directly.
> + *
> + * Context: Process context. Allocates memory with GFP_KERNEL.
> + * Return: 0 on success, -%EINVAL if credits is 0,
> + *   -%ENOMEM on fence allocation failure.
> + */
> +int drm_dep_job_init(struct drm_dep_job *job,
> +                    const struct drm_dep_job_init_args *args)
> +{
> +       if (unlikely(!args->credits)) {
> +               pr_err("drm_dep: %s: credits cannot be 0\n", __func__);
> +               return -EINVAL;
> +       }
> +
> +       memset(job, 0, sizeof(*job));
> +
> +       job->dfence = drm_dep_fence_alloc();
> +       if (!job->dfence)
> +               return -ENOMEM;
> +
> +       job->ops = args->ops;
> +       job->q = drm_dep_queue_get(args->q);
> +       job->credits = args->credits;
> +
> +       kref_init(&job->refcount);
> +       xa_init_flags(&job->dependencies, XA_FLAGS_ALLOC);
> +       INIT_LIST_HEAD(&job->pending_link);
> +
> +       return 0;
> +}
> +EXPORT_SYMBOL(drm_dep_job_init);
> +
> +/**
> + * drm_dep_job_drop_dependencies() - release all input dependency fences
> + * @job: dep job whose dependency xarray to drain
> + *
> + * Walks @job->dependencies, puts each fence, and destroys the xarray.
> + * Any slots still holding a %DRM_DEP_JOB_FENCE_PREALLOC sentinel —
> + * i.e. slots that were pre-allocated but never replaced — are silently
> + * skipped; the sentinel carries no reference.  Called from
> + * drm_dep_queue_run_job() in process context immediately after
> + * @ops->run_job() returns, before the final drm_dep_job_put().  Releasing
> + * dependencies here — while still in process context — avoids calling
> + * xa_destroy() from IRQ context if the job's last reference is later
> + * dropped from a dma_fence callback.
> + *
> + * Context: Process context.
> + */
> +void drm_dep_job_drop_dependencies(struct drm_dep_job *job)
> +{
> +       struct dma_fence *fence;
> +       unsigned long index;
> +
> +       xa_for_each(&job->dependencies, index, fence) {
> +               if (unlikely(fence == DRM_DEP_JOB_FENCE_PREALLOC))
> +                       continue;
> +               dma_fence_put(fence);
> +       }
> +       xa_destroy(&job->dependencies);
> +}
> +
> +/**
> + * drm_dep_job_fini() - clean up a dep job
> + * @job: dep job to clean up
> + *
> + * Cleans up the dep fence and drops the queue reference held by @job.
> + *
> + * If the job was never armed (e.g. init failed before drm_dep_job_arm()),
> + * the dependency xarray is also released here.  For armed jobs the xarray
> + * has already been drained by drm_dep_job_drop_dependencies() in process
> + * context immediately after run_job(), so it is left untouched to avoid
> + * calling xa_destroy() from IRQ context.
> + *
> + * Warns if @job is still linked on the queue's pending list, which would
> + * indicate a bug in the teardown ordering.
> + *
> + * Context: Any context.
> + */
> +static void drm_dep_job_fini(struct drm_dep_job *job)
> +{
> +       bool armed = drm_dep_fence_is_armed(job->dfence);
> +
> +       WARN_ON(!list_empty(&job->pending_link));
> +
> +       drm_dep_fence_cleanup(job->dfence);
> +       job->dfence = NULL;
> +
> +       /*
> +        * Armed jobs have their dependencies drained by
> +        * drm_dep_job_drop_dependencies() in process context after run_job().
> +        * Skip here to avoid calling xa_destroy() from IRQ context.
> +        */
> +       if (!armed)
> +               drm_dep_job_drop_dependencies(job);
> +}
> +
> +/**
> + * drm_dep_job_get() - acquire a reference to a dep job
> + * @job: dep job to acquire a reference on, or NULL
> + *
> + * Context: Any context.
> + * Return: @job with an additional reference held, or NULL if @job is NULL.
> + */
> +struct drm_dep_job *drm_dep_job_get(struct drm_dep_job *job)
> +{
> +       if (job)
> +               kref_get(&job->refcount);
> +       return job;
> +}
> +EXPORT_SYMBOL(drm_dep_job_get);
> +
> +/**
> + * drm_dep_job_release() - kref release callback for a dep job
> + * @kref: kref embedded in the dep job
> + *
> + * Calls drm_dep_job_fini(), then invokes &drm_dep_job_ops.release if set,
> + * otherwise frees @job with kfree().  Finally, releases the queue reference
> + * that was acquired by drm_dep_job_init() via drm_dep_queue_put().  The
> + * queue put is performed last to ensure no queue state is accessed after
> + * the job memory is freed.
> + *
> + * Context: Any context if %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE is set on the
> + *   job's queue; otherwise process context only, as the release callback may
> + *   sleep.
> + */
> +static void drm_dep_job_release(struct kref *kref)
> +{
> +       struct drm_dep_job *job =
> +               container_of(kref, struct drm_dep_job, refcount);
> +       struct drm_dep_queue *q = job->q;
> +
> +       drm_dep_job_fini(job);
> +
> +       if (job->ops && job->ops->release)
> +               job->ops->release(job);
> +       else
> +               kfree(job);
> +
> +       drm_dep_queue_put(q);
> +}
> +
> +/**
> + * drm_dep_job_put() - release a reference to a dep job
> + * @job: dep job to release a reference on, or NULL
> + *
> + * When the last reference is dropped, calls &drm_dep_job_ops.release if set,
> + * otherwise frees @job with kfree(). Does nothing if @job is NULL.
> + *
> + * Context: Any context if %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE is set on the
> + *   job's queue; otherwise process context only, as the release callback may
> + *   sleep.
> + */
> +void drm_dep_job_put(struct drm_dep_job *job)
> +{
> +       if (job)
> +               kref_put(&job->refcount, drm_dep_job_release);
> +}
> +EXPORT_SYMBOL(drm_dep_job_put);
> +
> +/**
> + * drm_dep_job_arm() - arm a dep job for submission
> + * @job: dep job to arm
> + *
> + * Initialises the finished fence on @job->dfence, assigning
> + * it a sequence number from the job's queue. Must be called after
> + * drm_dep_job_init() and before drm_dep_job_push(). Once armed,
> + * drm_dep_job_finished_fence() returns a valid fence that may be passed to
> + * userspace or used as a dependency by other jobs.
> + *
> + * Begins the DMA fence signalling path via dma_fence_begin_signalling().
> + * After this point, memory allocations that could trigger reclaim are
> + * forbidden; lockdep enforces this. arm() must always be paired with
> + * drm_dep_job_push(); lockdep also enforces this pairing.
> + *
> + * Warns if the job has already been armed.
> + *
> + * Context: Process context if %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set
> + *   (takes @q->sched.lock, a mutex); any context otherwise. DMA fence signaling
> + *   path.
> + */
> +void drm_dep_job_arm(struct drm_dep_job *job)
> +{
> +       drm_dep_queue_push_job_begin(job->q);
> +       WARN_ON(drm_dep_fence_is_armed(job->dfence));
> +       drm_dep_fence_init(job->dfence, job->q);
> +       job->signalling_cookie = dma_fence_begin_signalling();
> +}
> +EXPORT_SYMBOL(drm_dep_job_arm);
> +
> +/**
> + * drm_dep_job_push() - submit a job to its queue for execution
> + * @job: dep job to push
> + *
> + * Submits @job to the queue it was initialised with. Must be called after
> + * drm_dep_job_arm(). Acquires a reference on @job on behalf of the queue,
> + * held until the queue is fully done with it. The reference is released
> + * directly in the finished-fence dma_fence callback for queues with
> + * %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE (where drm_dep_job_done() may run
> + * from hardirq context), or via the put_job work item on the submit
> + * workqueue otherwise.
> + *
> + * Ends the DMA fence signalling path begun by drm_dep_job_arm() via
> + * dma_fence_end_signalling(). This must be paired with arm(); lockdep
> + * enforces the pairing.
> + *
> + * Once pushed, &drm_dep_queue_ops.run_job is guaranteed to be called for
> + * @job exactly once, even if the queue is killed or torn down before the
> + * job reaches the head of the queue. Drivers can use this guarantee to
> + * perform bookkeeping cleanup; the actual backend operation should be
> + * skipped when drm_dep_queue_is_killed() returns true.
> + *
> + * If the queue does not support the bypass path, the job is pushed directly
> + * onto the SPSC submission queue via drm_dep_queue_push_job() without holding
> + * @q->sched.lock. Otherwise, @q->sched.lock is taken and the job is either
> + * run immediately via drm_dep_queue_run_job() if it qualifies for bypass, or
> + * enqueued via drm_dep_queue_push_job() for dispatch by the run_job work item.
> + *
> + * Warns if the job has not been armed.
> + *
> + * Context: Process context if %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set
> + *   (takes @q->sched.lock, a mutex); any context otherwise. DMA fence signaling
> + *   path.
> + */
> +void drm_dep_job_push(struct drm_dep_job *job)
> +{
> +       struct drm_dep_queue *q = job->q;
> +
> +       WARN_ON(!drm_dep_fence_is_armed(job->dfence));
> +
> +       drm_dep_job_get(job);
> +
> +       if (!(q->sched.flags & DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED)) {
> +               drm_dep_queue_push_job(q, job);
> +               dma_fence_end_signalling(job->signalling_cookie);
> +               drm_dep_queue_push_job_end(job->q);
> +               return;
> +       }
> +
> +       scoped_guard(mutex, &q->sched.lock) {
> +               if (drm_dep_queue_can_job_bypass(q, job))
> +                       drm_dep_queue_run_job(q, job);
> +               else
> +                       drm_dep_queue_push_job(q, job);
> +       }
> +
> +       dma_fence_end_signalling(job->signalling_cookie);
> +       drm_dep_queue_push_job_end(job->q);
> +}
> +EXPORT_SYMBOL(drm_dep_job_push);
> +
> +/**
> + * drm_dep_job_add_dependency() - adds the fence as a job dependency
> + * @job: dep job to add the dependencies to
> + * @fence: the dma_fence to add to the list of dependencies, or
> + *         %DRM_DEP_JOB_FENCE_PREALLOC to reserve a slot for later.
> + *
> + * Note that @fence is consumed in both the success and error cases (except
> + * when @fence is %DRM_DEP_JOB_FENCE_PREALLOC, which carries no reference).
> + *
> + * Signalled fences and fences belonging to the same queue as @job (i.e. where
> + * fence->context matches the queue's finished fence context) are silently
> + * dropped; the job need not wait on its own queue's output.
> + *
> + * Warns if the job has already been armed (dependencies must be added before
> + * drm_dep_job_arm()).
> + *
> + * **Pre-allocation pattern**
> + *
> + * When multiple jobs across different queues must be prepared and submitted
> + * together in a single atomic commit — for example, where job A's finished
> + * fence is an input dependency of job B — all jobs must be armed and pushed
> + * within a single dma_fence_begin_signalling() / dma_fence_end_signalling()
> + * region.  Once that region has started no memory allocation is permitted.
> + *
> + * To handle this, pass %DRM_DEP_JOB_FENCE_PREALLOC during the preparation
> + * phase (before arming any job, while GFP_KERNEL allocation is still allowed)
> + * to pre-allocate a slot in @job->dependencies.  The slot index assigned by
> + * the underlying xarray must be tracked by the caller separately (e.g. it is
> + * always index 0 when the dependency array is empty, as Xe relies on).
> + * After all jobs have been armed and the finished fences are available, call
> + * drm_dep_job_replace_dependency() with that index and the real fence.
> + * drm_dep_job_replace_dependency() uses GFP_NOWAIT internally and may be
> + * called from atomic or signalling context.
> + *
> + * The sentinel slot is never skipped by the signalled-fence fast-path,
> + * ensuring a slot is always allocated even when the real fence is not yet
> + * known.
> + *
> + * **Example: bind job feeding TLB invalidation jobs**
> + *
> + * Consider a GPU with separate queues for page-table bind operations and for
> + * TLB invalidation.  A single atomic commit must:
> + *
> + *  1. Run a bind job that modifies page tables.
> + *  2. Run one TLB-invalidation job per MMU that depends on the bind
> + *     completing, so stale translations are flushed before the engines
> + *     continue.
> + *
> + * Because all jobs must be armed and pushed inside a signalling region (where
> + * GFP_KERNEL is forbidden), pre-allocate slots before entering the region::
> + *
> + *   // Phase 1 — process context, GFP_KERNEL allowed
> + *   drm_dep_job_init(bind_job, bind_queue, ops);
> + *   for_each_mmu(mmu) {
> + *       drm_dep_job_init(tlb_job[mmu], tlb_queue[mmu], ops);
> + *       // Pre-allocate slot at index 0; real fence not available yet
> + *       drm_dep_job_add_dependency(tlb_job[mmu], DRM_DEP_JOB_FENCE_PREALLOC);
> + *   }
> + *
> + *   // Phase 2 — inside signalling region, no GFP_KERNEL
> + *   dma_fence_begin_signalling();
> + *   drm_dep_job_arm(bind_job);
> + *   for_each_mmu(mmu) {
> + *       // Swap sentinel for bind job's finished fence
> + *       drm_dep_job_replace_dependency(tlb_job[mmu], 0,
> + *                                      dma_fence_get(bind_job->finished));
> + *       drm_dep_job_arm(tlb_job[mmu]);
> + *   }
> + *   drm_dep_job_push(bind_job);
> + *   for_each_mmu(mmu)
> + *       drm_dep_job_push(tlb_job[mmu]);
> + *   dma_fence_end_signalling();
> + *
> + * Context: Process context. May allocate memory with GFP_KERNEL.
> + * Return: If fence == DRM_DEP_JOB_FENCE_PREALLOC index of allocation on
> + * success, else 0 on success, or a negative error code.
> + */
> +int drm_dep_job_add_dependency(struct drm_dep_job *job, struct dma_fence *fence)
> +{
> +       struct drm_dep_queue *q = job->q;
> +       struct dma_fence *entry;
> +       unsigned long index;
> +       u32 id = 0;
> +       int ret;
> +
> +       WARN_ON(drm_dep_fence_is_armed(job->dfence));
> +       might_alloc(GFP_KERNEL);
> +
> +       if (!fence)
> +               return 0;
> +
> +       if (fence == DRM_DEP_JOB_FENCE_PREALLOC)
> +               goto add_fence;
> +
> +       /*
> +        * Ignore signalled fences or fences from our own queue — finished
> +        * fences use q->fence.context.
> +        */
> +       if (dma_fence_test_signaled_flag(fence) ||
> +           fence->context == q->fence.context) {
> +               dma_fence_put(fence);
> +               return 0;
> +       }
> +
> +       /* Deduplicate if we already depend on a fence from the same context.
> +        * This lets the size of the array of deps scale with the number of
> +        * engines involved, rather than the number of BOs.
> +        */
> +       xa_for_each(&job->dependencies, index, entry) {
> +               if (entry == DRM_DEP_JOB_FENCE_PREALLOC ||
> +                   entry->context != fence->context)
> +                       continue;
> +
> +               if (dma_fence_is_later(fence, entry)) {
> +                       dma_fence_put(entry);
> +                       xa_store(&job->dependencies, index, fence, GFP_KERNEL);
> +               } else {
> +                       dma_fence_put(fence);
> +               }
> +               return 0;
> +       }
> +
> +add_fence:
> +       ret = xa_alloc(&job->dependencies, &id, fence, xa_limit_32b,
> +                      GFP_KERNEL);
> +       if (ret != 0) {
> +               if (fence != DRM_DEP_JOB_FENCE_PREALLOC)
> +                       dma_fence_put(fence);
> +               return ret;
> +       }
> +
> +       return (fence == DRM_DEP_JOB_FENCE_PREALLOC) ? id : 0;
> +}
> +EXPORT_SYMBOL(drm_dep_job_add_dependency);
> +
> +/**
> + * drm_dep_job_replace_dependency() - replace a pre-allocated dependency slot
> + * @job: dep job to update
> + * @index: xarray index of the slot to replace, as returned when the sentinel
> + *         was originally inserted via drm_dep_job_add_dependency()
> + * @fence: the real dma_fence to store; its reference is always consumed
> + *
> + * Replaces the %DRM_DEP_JOB_FENCE_PREALLOC sentinel at @index in
> + * @job->dependencies with @fence.  The slot must have been pre-allocated by
> + * passing %DRM_DEP_JOB_FENCE_PREALLOC to drm_dep_job_add_dependency(); the
> + * existing entry is asserted to be the sentinel.
> + *
> + * This is the second half of the pre-allocation pattern described in
> + * drm_dep_job_add_dependency().  It is intended to be called inside a
> + * dma_fence_begin_signalling() / dma_fence_end_signalling() region where
> + * memory allocation with GFP_KERNEL is forbidden.  It uses GFP_NOWAIT
> + * internally so it is safe to call from atomic or signalling context, but
> + * since the slot has been pre-allocated no actual memory allocation occurs.
> + *
> + * If @fence is already signalled the slot is erased rather than storing a
> + * redundant dependency.  The successful store is asserted — if the store
> + * fails it indicates a programming error (slot index out of range or
> + * concurrent modification).
> + *
> + * Must be called before drm_dep_job_arm(). @fence is consumed in all cases.
> + *
> + * Context: Any context. DMA fence signaling path.
> + */
> +void drm_dep_job_replace_dependency(struct drm_dep_job *job, u32 index,
> +                                   struct dma_fence *fence)
> +{
> +       WARN_ON(xa_load(&job->dependencies, index) !=
> +               DRM_DEP_JOB_FENCE_PREALLOC);
> +
> +       if (dma_fence_test_signaled_flag(fence)) {
> +               xa_erase(&job->dependencies, index);
> +               dma_fence_put(fence);
> +               return;
> +       }
> +
> +       if (WARN_ON(xa_is_err(xa_store(&job->dependencies, index, fence,
> +                                      GFP_NOWAIT)))) {
> +               dma_fence_put(fence);
> +               return;
> +       }
> +}
> +EXPORT_SYMBOL(drm_dep_job_replace_dependency);
> +
> +/**
> + * drm_dep_job_add_syncobj_dependency() - adds a syncobj's fence as a
> + *   job dependency
> + * @job: dep job to add the dependencies to
> + * @file: drm file private pointer
> + * @handle: syncobj handle to lookup
> + * @point: timeline point
> + *
> + * This adds the fence matching the given syncobj to @job.
> + *
> + * Context: Process context.
> + * Return: 0 on success, or a negative error code.
> + */
> +int drm_dep_job_add_syncobj_dependency(struct drm_dep_job *job,
> +                                      struct drm_file *file, u32 handle,
> +                                      u32 point)
> +{
> +       struct dma_fence *fence;
> +       int ret;
> +
> +       ret = drm_syncobj_find_fence(file, handle, point, 0, &fence);
> +       if (ret)
> +               return ret;
> +
> +       return drm_dep_job_add_dependency(job, fence);
> +}
> +EXPORT_SYMBOL(drm_dep_job_add_syncobj_dependency);
> +
> +/**
> + * drm_dep_job_add_resv_dependencies() - add all fences from the resv to the job
> + * @job: dep job to add the dependencies to
> + * @resv: the dma_resv object to get the fences from
> + * @usage: the dma_resv_usage to use to filter the fences
> + *
> + * This adds all fences matching the given usage from @resv to @job.
> + * Must be called with the @resv lock held.
> + *
> + * Context: Process context.
> + * Return: 0 on success, or a negative error code.
> + */
> +int drm_dep_job_add_resv_dependencies(struct drm_dep_job *job,
> +                                     struct dma_resv *resv,
> +                                     enum dma_resv_usage usage)
> +{
> +       struct dma_resv_iter cursor;
> +       struct dma_fence *fence;
> +       int ret;
> +
> +       dma_resv_assert_held(resv);
> +
> +       dma_resv_for_each_fence(&cursor, resv, usage, fence) {
> +               /*
> +                * As drm_dep_job_add_dependency always consumes the fence
> +                * reference (even when it fails), and dma_resv_for_each_fence
> +                * is not obtaining one, we need to grab one before calling.
> +                */
> +               ret = drm_dep_job_add_dependency(job, dma_fence_get(fence));
> +               if (ret)
> +                       return ret;
> +       }
> +       return 0;
> +}
> +EXPORT_SYMBOL(drm_dep_job_add_resv_dependencies);
> +
> +/**
> + * drm_dep_job_add_implicit_dependencies() - adds implicit dependencies
> + *   as job dependencies
> + * @job: dep job to add the dependencies to
> + * @obj: the gem object to add new dependencies from.
> + * @write: whether the job might write the object (so we need to depend on
> + * shared fences in the reservation object).
> + *
> + * This should be called after drm_gem_lock_reservations() on your array of
> + * GEM objects used in the job but before updating the reservations with your
> + * own fences.
> + *
> + * Context: Process context.
> + * Return: 0 on success, or a negative error code.
> + */
> +int drm_dep_job_add_implicit_dependencies(struct drm_dep_job *job,
> +                                         struct drm_gem_object *obj,
> +                                         bool write)
> +{
> +       return drm_dep_job_add_resv_dependencies(job, obj->resv,
> +                                                dma_resv_usage_rw(write));
> +}
> +EXPORT_SYMBOL(drm_dep_job_add_implicit_dependencies);
> +
> +/**
> + * drm_dep_job_is_signaled() - check whether a dep job has completed
> + * @job: dep job to check
> + *
> + * Determines whether @job has signalled. The queue should be stopped before
> + * calling this to obtain a stable snapshot of state. Both the parent hardware
> + * fence and the finished software fence are checked.
> + *
> + * Context: Process context. The queue must be stopped before calling this.
> + * Return: true if the job is signalled, false otherwise.
> + */
> +bool drm_dep_job_is_signaled(struct drm_dep_job *job)
> +{
> +       WARN_ON(!drm_dep_queue_is_stopped(job->q));
> +       return drm_dep_fence_is_complete(job->dfence);
> +}
> +EXPORT_SYMBOL(drm_dep_job_is_signaled);
> +
> +/**
> + * drm_dep_job_is_finished() - test whether a dep job's finished fence has signalled
> + * @job: dep job to check
> + *
> + * Tests whether the job's software finished fence has been signalled, using
> + * dma_fence_test_signaled_flag() to avoid any signalling side-effects. Unlike
> + * drm_dep_job_is_signaled(), this does not require the queue to be stopped and
> + * does not check the parent hardware fence — it is a lightweight test of the
> + * finished fence only.
> + *
> + * Context: Any context.
> + * Return: true if the job's finished fence has been signalled, false otherwise.
> + */
> +bool drm_dep_job_is_finished(struct drm_dep_job *job)
> +{
> +       return drm_dep_fence_is_finished(job->dfence);
> +}
> +EXPORT_SYMBOL(drm_dep_job_is_finished);
> +
> +/**
> + * drm_dep_job_invalidate_job() - increment the invalidation count for a job
> + * @job: dep job to invalidate
> + * @threshold: threshold above which the job is considered invalidated
> + *
> + * Increments @job->invalidate_count and returns true if it exceeds @threshold,
> + * indicating the job should be considered hung and discarded. The queue must
> + * be stopped before calling this function.
> + *
> + * Context: Process context. The queue must be stopped before calling this.
> + * Return: true if @job->invalidate_count exceeds @threshold, false otherwise.
> + */
> +bool drm_dep_job_invalidate_job(struct drm_dep_job *job, int threshold)
> +{
> +       WARN_ON(!drm_dep_queue_is_stopped(job->q));
> +       return ++job->invalidate_count > threshold;
> +}
> +EXPORT_SYMBOL(drm_dep_job_invalidate_job);
> +
> +/**
> + * drm_dep_job_finished_fence() - return the finished fence for a job
> + * @job: dep job to query
> + *
> + * No reference is taken on the returned fence; the caller must hold its own
> + * reference to @job for the duration of any access.
> + *
> + * Context: Any context.
> + * Return: the finished &dma_fence for @job.
> + */
> +struct dma_fence *drm_dep_job_finished_fence(struct drm_dep_job *job)
> +{
> +       return drm_dep_fence_to_dma(job->dfence);
> +}
> +EXPORT_SYMBOL(drm_dep_job_finished_fence);
> diff --git a/drivers/gpu/drm/dep/drm_dep_job.h b/drivers/gpu/drm/dep/drm_dep_job.h
> new file mode 100644
> index 000000000000..35c61d258fa1
> --- /dev/null
> +++ b/drivers/gpu/drm/dep/drm_dep_job.h
> @@ -0,0 +1,13 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2026 Intel Corporation
> + */
> +
> +#ifndef _DRM_DEP_JOB_H_
> +#define _DRM_DEP_JOB_H_
> +
> +struct drm_dep_queue;
> +
> +void drm_dep_job_drop_dependencies(struct drm_dep_job *job);
> +
> +#endif /* _DRM_DEP_JOB_H_ */
> diff --git a/drivers/gpu/drm/dep/drm_dep_queue.c b/drivers/gpu/drm/dep/drm_dep_queue.c
> new file mode 100644
> index 000000000000..dac02d0d22c4
> --- /dev/null
> +++ b/drivers/gpu/drm/dep/drm_dep_queue.c
> @@ -0,0 +1,1647 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright 2015 Advanced Micro Devices, Inc.
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
> + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
> + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
> + * OTHER DEALINGS IN THE SOFTWARE.
> + *
> + * Copyright © 2026 Intel Corporation
> + */
> +
> +/**
> + * DOC: DRM dependency queue
> + *
> + * The drm_dep subsystem provides a lightweight GPU submission queue that
> + * combines the roles of drm_gpu_scheduler and drm_sched_entity into a
> + * single object (struct drm_dep_queue). Each queue owns its own ordered
> + * submit workqueue, timeout workqueue, and TDR delayed-work.
> + *
> + * **Job lifecycle**
> + *
> + * 1. Allocate and initialise a job with drm_dep_job_init().
> + * 2. Add dependency fences with drm_dep_job_add_dependency() and friends.
> + * 3. Arm the job with drm_dep_job_arm() to obtain its out-fences.
> + * 4. Submit with drm_dep_job_push().
> + *
> + * **Submission paths**
> + *
> + * drm_dep_job_push() decides between two paths under @q->sched.lock:
> + *
> + * - **Bypass path** (drm_dep_queue_can_job_bypass()): if
> + *   %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set, the queue is not stopped,
> + *   the SPSC queue is empty, the job has no dependency fences, and credits
> + *   are available, the job is submitted inline on the calling thread without
> + *   touching the submit workqueue.
> + *
> + * - **Queued path** (drm_dep_queue_push_job()): the job is pushed onto an
> + *   SPSC queue and the run_job worker is kicked. The run_job worker pops the
> + *   job, resolves any remaining dependency fences (installing wakeup
> + *   callbacks for unresolved ones), and calls drm_dep_queue_run_job().
> + *
> + * **Running a job**
> + *
> + * drm_dep_queue_run_job() accounts credits, appends the job to the pending
> + * list (starting the TDR timer only when the list was previously empty),
> + * calls @ops->run_job(), stores the returned hardware fence as the parent
> + * of the job's dep fence, then installs a callback on it. When the hardware
> + * fence fires (or the job completes synchronously), drm_dep_job_done()
> + * signals the finished fence, returns credits, and kicks the put_job worker
> + * to free the job.
> + *
> + * **Timeout detection and recovery (TDR)**
> + *
> + * A delayed work item fires when a job on the pending list takes longer than
> + * @q->job.timeout jiffies. It calls @ops->timedout_job() and acts on the
> + * returned status (%DRM_DEP_TIMEDOUT_STAT_JOB_SIGNALED or
> + * %DRM_DEP_TIMEDOUT_STAT_REQUEUE_JOB).
> + * drm_dep_queue_trigger_timeout() forces the timer to fire immediately (without
> + * changing the stored timeout), for example during device teardown.
> + *
> + * **Reference counting**
> + *
> + * Jobs and queues are both reference counted.
> + *
> + * A job holds a reference to its queue from drm_dep_job_init() until
> + * drm_dep_job_put() drops the job's last reference and its release callback
> + * runs. This ensures the queue remains valid for the entire lifetime of any
> + * job that was submitted to it.
> + *
> + * The queue holds its own reference to a job for as long as the job is
> + * internally tracked: from the moment the job is added to the pending list
> + * in drm_dep_queue_run_job() until drm_dep_job_done() kicks the put_job
> + * worker, which calls drm_dep_job_put() to release that reference.
> + *
> + * **Hazard: use-after-free from within a worker**
> + *
> + * Because a job holds a queue reference, drm_dep_job_put() dropping the last
> + * job reference will also drop a queue reference via the job's release path.
> + * If that happens to be the last queue reference, drm_dep_queue_fini() can be
> + * called, which queues @q->free_work on dep_free_wq and returns immediately.
> + * free_work calls disable_work_sync() / disable_delayed_work_sync() on the
> + * queue's own workers before destroying its workqueues, so in practice a
> + * running worker always completes before the queue memory is freed.
> + *
> + * However, there is a secondary hazard: a worker can be queued while the
> + * queue is in a "zombie" state — refcount has already reached zero and async
> + * teardown is in flight, but the work item has not yet been disabled by
> + * free_work.  To guard against this every worker uses
> + * drm_dep_queue_get_unless_zero() at entry; if the refcount is already zero
> + * the worker bails immediately without touching the queue state.
> + *
> + * Because all actual teardown (disable_*_sync, destroy_workqueue) runs on
> + * dep_free_wq — which is independent of the queue's own submit/timeout
> + * workqueues — there is no deadlock risk.  Each queue holds a drm_dev_get()
> + * reference on its owning &drm_device, which is released as the last step of
> + * teardown.  This ensures the driver module cannot be unloaded while any queue
> + * is still alive.
> + */
> +
> +#include <linux/dma-resv.h>
> +#include <linux/kref.h>
> +#include <linux/module.h>
> +#include <linux/overflow.h>
> +#include <linux/slab.h>
> +#include <linux/wait.h>
> +#include <linux/workqueue.h>
> +#include <drm/drm_dep.h>
> +#include <drm/drm_drv.h>
> +#include <drm/drm_print.h>
> +#include "drm_dep_fence.h"
> +#include "drm_dep_job.h"
> +#include "drm_dep_queue.h"
> +
> +/*
> + * Dedicated workqueue for deferred drm_dep_queue teardown.  Using a
> + * module-private WQ instead of system_percpu_wq keeps teardown isolated
> + * from unrelated kernel subsystems.
> + */
> +static struct workqueue_struct *dep_free_wq;
> +
> +/**
> + * drm_dep_queue_flags_set() - set a flag on the queue under sched.lock
> + * @q: dep queue
> + * @flag: flag to set (one of &enum drm_dep_queue_flags)
> + *
> + * Sets @flag in @q->sched.flags. Must be called with @q->sched.lock
> + * held; the lockdep assertion enforces this.
> + *
> + * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
> + */
> +static void drm_dep_queue_flags_set(struct drm_dep_queue *q,
> +                                   enum drm_dep_queue_flags flag)
> +{
> +       lockdep_assert_held(&q->sched.lock);
> +       q->sched.flags |= flag;
> +}
> +
> +/**
> + * drm_dep_queue_flags_clear() - clear a flag on the queue under sched.lock
> + * @q: dep queue
> + * @flag: flag to clear (one of &enum drm_dep_queue_flags)
> + *
> + * Clears @flag in @q->sched.flags. Must be called with @q->sched.lock
> + * held; the lockdep assertion enforces this.
> + *
> + * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
> + */
> +static void drm_dep_queue_flags_clear(struct drm_dep_queue *q,
> +                                     enum drm_dep_queue_flags flag)
> +{
> +       lockdep_assert_held(&q->sched.lock);
> +       q->sched.flags &= ~flag;
> +}
> +
> +/**
> + * drm_dep_queue_has_credits() - check whether the queue has enough credits
> + * @q: dep queue
> + * @job: job requesting credits
> + *
> + * Checks whether the queue has enough available credits to dispatch
> + * @job. If @job->credits exceeds the queue's credit limit, it is
> + * clamped with a WARN.
> + *
> + * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
> + * Return: true if available credits >= @job->credits, false otherwise.
> + */
> +static bool drm_dep_queue_has_credits(struct drm_dep_queue *q,
> +                                     struct drm_dep_job *job)
> +{
> +       u32 available;
> +
> +       lockdep_assert_held(&q->sched.lock);
> +
> +       if (job->credits > q->credit.limit) {
> +               drm_warn(q->drm,
> +                        "Jobs may not exceed the credit limit, truncate.\n");
> +               job->credits = q->credit.limit;
> +       }
> +
> +       WARN_ON(check_sub_overflow(q->credit.limit,
> +                                  atomic_read(&q->credit.count),
> +                                  &available));
> +
> +       return available >= job->credits;
> +}
> +
> +/**
> + * drm_dep_queue_run_job_queue() - kick the run-job worker
> + * @q: dep queue
> + *
> + * Queues @q->sched.run_job on @q->sched.submit_wq unless the queue is stopped
> + * or the job queue is empty.  The empty-queue check avoids queueing a work item
> + * that would immediately return with nothing to do.
> + *
> + * Context: Any context.
> + */
> +static void drm_dep_queue_run_job_queue(struct drm_dep_queue *q)
> +{
> +       if (!drm_dep_queue_is_stopped(q) && spsc_queue_count(&q->job.queue))
> +               queue_work(q->sched.submit_wq, &q->sched.run_job);
> +}
> +
> +/**
> + * drm_dep_queue_put_job_queue() - kick the put-job worker
> + * @q: dep queue
> + *
> + * Queues @q->sched.put_job on @q->sched.submit_wq unless the queue
> + * is stopped.
> + *
> + * Context: Any context.
> + */
> +static void drm_dep_queue_put_job_queue(struct drm_dep_queue *q)
> +{
> +       if (!drm_dep_queue_is_stopped(q))
> +               queue_work(q->sched.submit_wq, &q->sched.put_job);
> +}
> +
> +/**
> + * drm_queue_start_timeout() - arm or re-arm the TDR delayed work
> + * @q: dep queue
> + *
> + * Arms the TDR delayed work with @q->job.timeout. No-op if
> + * @q->ops->timedout_job is NULL, the timeout is MAX_SCHEDULE_TIMEOUT,
> + * or the pending list is empty.
> + *
> + * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
> + */
> +static void drm_queue_start_timeout(struct drm_dep_queue *q)
> +{
> +       lockdep_assert_held(&q->job.lock);
> +
> +       if (!q->ops->timedout_job ||
> +           q->job.timeout == MAX_SCHEDULE_TIMEOUT ||
> +           list_empty(&q->job.pending))
> +               return;
> +
> +       mod_delayed_work(q->sched.timeout_wq, &q->sched.tdr, q->job.timeout);
> +}
> +
> +/**
> + * drm_queue_start_timeout_unlocked() - arm TDR, acquiring job.lock
> + * @q: dep queue
> + *
> + * Acquires @q->job.lock with irq and calls
> + * drm_queue_start_timeout().
> + *
> + * Context: Process context (workqueue).
> + */
> +static void drm_queue_start_timeout_unlocked(struct drm_dep_queue *q)
> +{
> +       guard(spinlock_irq)(&q->job.lock);
> +       drm_queue_start_timeout(q);
> +}
> +
> +/**
> + * drm_dep_queue_remove_dependency() - clear the active dependency and wake
> + *   the run-job worker
> + * @q: dep queue
> + * @f: the dependency fence being removed
> + *
> + * Stores @f into @q->dep.removed_fence via smp_store_release() so that the
> + * run-job worker can drop the reference to it in drm_dep_queue_is_ready(),
> + * paired with smp_load_acquire().  Clears @q->dep.fence and kicks the
> + * run-job worker.
> + *
> + * The fence reference is not dropped here; it is deferred to the run-job
> + * worker via @q->dep.removed_fence to keep this path suitable dma_fence
> + * callback removal in drm_dep_queue_kill().
> + *
> + * Context: Any context.
> + */
> +static void drm_dep_queue_remove_dependency(struct drm_dep_queue *q,
> +                                           struct dma_fence *f)
> +{
> +       /* removed_fence must be visible to the reader before &q->dep.fence */
> +       smp_store_release(&q->dep.removed_fence, f);
> +
> +       WRITE_ONCE(q->dep.fence, NULL);
> +       drm_dep_queue_run_job_queue(q);
> +}
> +
> +/**
> + * drm_dep_queue_wakeup() - dma_fence callback to wake the run-job worker
> + * @f: the signalled dependency fence
> + * @cb: callback embedded in the dep queue
> + *
> + * Called from dma_fence_signal() when the active dependency fence signals.
> + * Delegates to drm_dep_queue_remove_dependency() to clear @q->dep.fence and
> + * kick the run-job worker.  The fence reference is not dropped here; it is
> + * deferred to the run-job worker via @q->dep.removed_fence.
> + *
> + * Context: Any context.
> + */
> +static void drm_dep_queue_wakeup(struct dma_fence *f, struct dma_fence_cb *cb)
> +{
> +       struct drm_dep_queue *q =
> +               container_of(cb, struct drm_dep_queue, dep.cb);
> +
> +       drm_dep_queue_remove_dependency(q, f);
> +}
> +
> +/**
> + * drm_dep_queue_is_ready() - check whether the queue has a dispatchable job
> + * @q: dep queue
> + *
> + * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
> + * Return: true if SPSC queue non-empty and no dep fence pending,
> + *   false otherwise.
> + */
> +static bool drm_dep_queue_is_ready(struct drm_dep_queue *q)
> +{
> +       lockdep_assert_held(&q->sched.lock);
> +
> +       if (!spsc_queue_count(&q->job.queue))
> +               return false;
> +
> +       if (READ_ONCE(q->dep.fence))
> +               return false;
> +
> +       /* Paired with smp_store_release in drm_dep_queue_remove_dependency() */
> +       dma_fence_put(smp_load_acquire(&q->dep.removed_fence));
> +
> +       q->dep.removed_fence = NULL;
> +
> +       return true;
> +}
> +
> +/**
> + * drm_dep_queue_is_killed() - check whether a dep queue has been killed
> + * @q: dep queue to check
> + *
> + * Return: true if %DRM_DEP_QUEUE_FLAGS_KILLED is set on @q, false otherwise.
> + *
> + * Context: Any context.
> + */
> +bool drm_dep_queue_is_killed(struct drm_dep_queue *q)
> +{
> +       return !!(q->sched.flags & DRM_DEP_QUEUE_FLAGS_KILLED);
> +}
> +EXPORT_SYMBOL(drm_dep_queue_is_killed);
> +
> +/**
> + * drm_dep_queue_is_initialized() - check whether a dep queue has been initialized
> + * @q: dep queue to check
> + *
> + * A queue is considered initialized once its ops pointer has been set by a
> + * successful call to drm_dep_queue_init().  Drivers that embed a
> + * &drm_dep_queue inside a larger structure may call this before attempting any
> + * other queue operation to confirm that initialization has taken place.
> + * drm_dep_queue_put() must be called if this function returns true to drop the
> + * initialization reference from drm_dep_queue_init().
> + *
> + * Return: true if @q has been initialized, false otherwise.
> + *
> + * Context: Any context.
> + */
> +bool drm_dep_queue_is_initialized(struct drm_dep_queue *q)
> +{
> +       return !!q->ops;
> +}
> +EXPORT_SYMBOL(drm_dep_queue_is_initialized);
> +
> +/**
> + * drm_dep_queue_set_stopped() - pre-mark a queue as stopped before first use
> + * @q: dep queue to mark
> + *
> + * Sets %DRM_DEP_QUEUE_FLAGS_STOPPED directly on @q without going through the
> + * normal drm_dep_queue_stop() path.  This is only valid during the driver-side
> + * queue initialisation sequence — i.e. after drm_dep_queue_init() returns but
> + * before the queue is made visible to other threads (e.g. before it is added
> + * to any lookup structures).  Using this after the queue is live is a driver
> + * bug; use drm_dep_queue_stop() instead.
> + *
> + * Context: Process context, queue not yet visible to other threads.
> + */
> +void drm_dep_queue_set_stopped(struct drm_dep_queue *q)
> +{
> +       q->sched.flags |= DRM_DEP_QUEUE_FLAGS_STOPPED;
> +}
> +EXPORT_SYMBOL(drm_dep_queue_set_stopped);
> +
> +/**
> + * drm_dep_queue_refcount() - read the current reference count of a queue
> + * @q: dep queue to query
> + *
> + * Returns the instantaneous kref value.  The count may change immediately
> + * after this call; callers must not make safety decisions based solely on
> + * the returned value.  Intended for diagnostic snapshots and debugfs output.
> + *
> + * Context: Any context.
> + * Return: current reference count.
> + */
> +unsigned int drm_dep_queue_refcount(const struct drm_dep_queue *q)
> +{
> +       return kref_read(&q->refcount);
> +}
> +EXPORT_SYMBOL(drm_dep_queue_refcount);
> +
> +/**
> + * drm_dep_queue_timeout() - read the per-job TDR timeout for a queue
> + * @q: dep queue to query
> + *
> + * Returns the per-job timeout in jiffies as set at init time.
> + * %MAX_SCHEDULE_TIMEOUT means no timeout is configured.
> + *
> + * Context: Any context.
> + * Return: timeout in jiffies.
> + */
> +long drm_dep_queue_timeout(const struct drm_dep_queue *q)
> +{
> +       return q->job.timeout;
> +}
> +EXPORT_SYMBOL(drm_dep_queue_timeout);
> +
> +/**
> + * drm_dep_queue_is_job_put_irq_safe() - test whether job-put from IRQ is allowed
> + * @q: dep queue
> + *
> + * Context: Any context.
> + * Return: true if %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE is set,
> + *   false otherwise.
> + */
> +static bool drm_dep_queue_is_job_put_irq_safe(const struct drm_dep_queue *q)
> +{
> +       return !!(q->sched.flags & DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE);
> +}
> +
> +/**
> + * drm_dep_queue_job_dependency() - get next unresolved dep fence
> + * @q: dep queue
> + * @job: job whose dependencies to advance
> + *
> + * Returns NULL immediately if the queue has been killed via
> + * drm_dep_queue_kill(), bypassing all dependency waits so that jobs
> + * drain through run_job as quickly as possible.
> + *
> + * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
> + * Return: next unresolved &dma_fence with a new reference, or NULL
> + *   when all dependencies have been consumed (or the queue is killed).
> + */
> +static struct dma_fence *
> +drm_dep_queue_job_dependency(struct drm_dep_queue *q,
> +                            struct drm_dep_job *job)
> +{
> +       struct dma_fence *f;
> +
> +       lockdep_assert_held(&q->sched.lock);
> +
> +       if (drm_dep_queue_is_killed(q))
> +               return NULL;
> +
> +       f = xa_load(&job->dependencies, job->last_dependency);
> +       if (f) {
> +               job->last_dependency++;
> +               if (WARN_ON(DRM_DEP_JOB_FENCE_PREALLOC == f))
> +                       return dma_fence_get_stub();
> +               return dma_fence_get(f);
> +       }
> +
> +       return NULL;
> +}
> +
> +/**
> + * drm_dep_queue_add_dep_cb() - install wakeup callback on dep fence
> + * @q: dep queue
> + * @job: job whose dependency fence is stored in @q->dep.fence
> + *
> + * Installs a wakeup callback on @q->dep.fence. Returns true if the
> + * callback was installed (the queue must wait), false if the fence is
> + * already signalled or is a self-fence from the same queue context.
> + *
> + * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
> + * Return: true if callback installed, false if fence already done.
> + */
> +static bool drm_dep_queue_add_dep_cb(struct drm_dep_queue *q,
> +                                    struct drm_dep_job *job)
> +{
> +       struct dma_fence *fence = q->dep.fence;
> +
> +       lockdep_assert_held(&q->sched.lock);
> +
> +       if (WARN_ON(fence->context == q->fence.context)) {
> +               dma_fence_put(q->dep.fence);
> +               q->dep.fence = NULL;
> +               return false;
> +       }
> +
> +       if (!dma_fence_add_callback(q->dep.fence, &q->dep.cb,
> +                                   drm_dep_queue_wakeup))
> +               return true;
> +
> +       dma_fence_put(q->dep.fence);
> +       q->dep.fence = NULL;
> +
> +       return false;
> +}
> +
> +/**
> + * drm_dep_queue_pop_job() - pop a dispatchable job from the SPSC queue
> + * @q: dep queue
> + *
> + * Peeks at the head of the SPSC queue and drains all resolved
> + * dependencies. If a dependency is still pending, installs a wakeup
> + * callback and returns NULL. On success pops the job and returns it.
> + *
> + * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
> + * Return: next dispatchable job, or NULL if a dep is still pending.
> + */
> +static struct drm_dep_job *drm_dep_queue_pop_job(struct drm_dep_queue *q)
> +{
> +       struct spsc_node *node;
> +       struct drm_dep_job *job;
> +
> +       lockdep_assert_held(&q->sched.lock);
> +
> +       node = spsc_queue_peek(&q->job.queue);
> +       if (!node)
> +               return NULL;
> +
> +       job = container_of(node, struct drm_dep_job, queue_node);
> +
> +       while ((q->dep.fence = drm_dep_queue_job_dependency(q, job))) {
> +               if (drm_dep_queue_add_dep_cb(q, job))
> +                       return NULL;
> +       }
> +
> +       spsc_queue_pop(&q->job.queue);
> +
> +       return job;
> +}
> +
> +/*
> + * drm_dep_queue_get_unless_zero() - try to acquire a queue reference
> + *
> + * Workers use this instead of drm_dep_queue_get() to guard against the zombie
> + * state: the queue's refcount has already reached zero (async teardown is in
> + * flight) but a work item was queued before free_work had a chance to cancel
> + * it.  If kref_get_unless_zero() fails the caller must bail immediately.
> + *
> + * Context: Any context.
> + * Returns true if the reference was acquired, false if the queue is zombie.
> + */
> +bool drm_dep_queue_get_unless_zero(struct drm_dep_queue *q)
> +{
> +       return kref_get_unless_zero(&q->refcount);
> +}
> +EXPORT_SYMBOL(drm_dep_queue_get_unless_zero);
> +
> +/**
> + * drm_dep_queue_run_job_work() - run-job worker
> + * @work: work item embedded in the dep queue
> + *
> + * Acquires @q->sched.lock, checks stopped state, queue readiness and
> + * available credits, pops the next job via drm_dep_queue_pop_job(),
> + * dispatches it via drm_dep_queue_run_job(), then re-kicks itself.
> + *
> + * Uses drm_dep_queue_get_unless_zero() at entry and bails immediately if the
> + * queue is in zombie state (refcount already zero, async teardown in flight).
> + *
> + * Context: Process context (workqueue). DMA fence signaling path.
> + */
> +static void drm_dep_queue_run_job_work(struct work_struct *work)
> +{
> +       struct drm_dep_queue *q =
> +               container_of(work, struct drm_dep_queue, sched.run_job);
> +       struct spsc_node *node;
> +       struct drm_dep_job *job;
> +       bool cookie = dma_fence_begin_signalling();
> +
> +       /* Bail if queue is zombie (refcount already zero, teardown in flight). */
> +       if (!drm_dep_queue_get_unless_zero(q)) {
> +               dma_fence_end_signalling(cookie);
> +               return;
> +       }
> +
> +       mutex_lock(&q->sched.lock);
> +
> +       if (drm_dep_queue_is_stopped(q))
> +               goto put_queue;
> +
> +       if (!drm_dep_queue_is_ready(q))
> +               goto put_queue;
> +
> +       /* Peek to check credits before committing to pop and dep resolution */
> +       node = spsc_queue_peek(&q->job.queue);
> +       if (!node)
> +               goto put_queue;
> +
> +       job = container_of(node, struct drm_dep_job, queue_node);
> +       if (!drm_dep_queue_has_credits(q, job))
> +               goto put_queue;
> +
> +       job = drm_dep_queue_pop_job(q);
> +       if (!job)
> +               goto put_queue;
> +
> +       drm_dep_queue_run_job(q, job);
> +       drm_dep_queue_run_job_queue(q);
> +
> +put_queue:
> +       mutex_unlock(&q->sched.lock);
> +       drm_dep_queue_put(q);
> +       dma_fence_end_signalling(cookie);
> +}
> +
> +/*
> + * drm_dep_queue_remove_job() - unlink a job from the pending list and reset TDR
> + * @q:   dep queue owning @job
> + * @job: job to remove
> + *
> + * Splices @job out of @q->job.pending, cancels any pending TDR delayed work,
> + * and arms the timeout for the new list head (if any).
> + *
> + * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
> + */
> +static void drm_dep_queue_remove_job(struct drm_dep_queue *q,
> +                                    struct drm_dep_job *job)
> +{
> +       lockdep_assert_held(&q->job.lock);
> +
> +       list_del_init(&job->pending_link);
> +       cancel_delayed_work(&q->sched.tdr);
> +       drm_queue_start_timeout(q);
> +}
> +
> +/**
> + * drm_dep_queue_get_finished_job() - dequeue a finished job
> + * @q: dep queue
> + *
> + * Under @q->job.lock checks the head of the pending list for a
> + * finished dep fence. If found, removes the job from the list,
> + * cancels the TDR, and re-arms it for the new head.
> + *
> + * Context: Process context (workqueue). DMA fence signaling path.
> + * Return: the finished &drm_dep_job, or NULL if none is ready.
> + */
> +static struct drm_dep_job *
> +drm_dep_queue_get_finished_job(struct drm_dep_queue *q)
> +{
> +       struct drm_dep_job *job;
> +
> +       guard(spinlock_irq)(&q->job.lock);
> +
> +       job = list_first_entry_or_null(&q->job.pending, struct drm_dep_job,
> +                                      pending_link);
> +       if (job && drm_dep_fence_is_finished(job->dfence))
> +               drm_dep_queue_remove_job(q, job);
> +       else
> +               job = NULL;
> +
> +       return job;
> +}
> +
> +/**
> + * drm_dep_queue_put_job_work() - put-job worker
> + * @work: work item embedded in the dep queue
> + *
> + * Drains all finished jobs by calling drm_dep_job_put() in a loop,
> + * then kicks the run-job worker.
> + *
> + * Uses drm_dep_queue_get_unless_zero() at entry and bails immediately if the
> + * queue is in zombie state (refcount already zero, async teardown in flight).
> + *
> + * Wraps execution in dma_fence_begin_signalling() / dma_fence_end_signalling()
> + * because workqueue is shared with other items in the fence signaling path.
> + *
> + * Context: Process context (workqueue). DMA fence signaling path.
> + */
> +static void drm_dep_queue_put_job_work(struct work_struct *work)
> +{
> +       struct drm_dep_queue *q =
> +               container_of(work, struct drm_dep_queue, sched.put_job);
> +       struct drm_dep_job *job;
> +       bool cookie = dma_fence_begin_signalling();
> +
> +       /* Bail if queue is zombie (refcount already zero, teardown in flight). */
> +       if (!drm_dep_queue_get_unless_zero(q)) {
> +               dma_fence_end_signalling(cookie);
> +               return;
> +       }
> +
> +       while ((job = drm_dep_queue_get_finished_job(q)))
> +               drm_dep_job_put(job);
> +
> +       drm_dep_queue_run_job_queue(q);
> +
> +       drm_dep_queue_put(q);
> +       dma_fence_end_signalling(cookie);
> +}
> +
> +/**
> + * drm_dep_queue_tdr_work() - TDR worker
> + * @work: work item embedded in the delayed TDR work
> + *
> + * Removes the head job from the pending list under @q->job.lock,
> + * asserts @q->ops->timedout_job is non-NULL, calls it outside the lock,
> + * requeues the job if %DRM_DEP_TIMEDOUT_STAT_REQUEUE_JOB, drops the
> + * queue's job reference on %DRM_DEP_TIMEDOUT_STAT_JOB_SIGNALED, and always
> + * restarts the TDR timer after handling the job (unless @q is stopping).
> + * Any other return value triggers a WARN.
> + *
> + * The TDR is never armed when @q->ops->timedout_job is NULL, so firing
> + * this worker without a timedout_job callback is a driver bug.
> + *
> + * Uses drm_dep_queue_get_unless_zero() at entry and bails immediately if the
> + * queue is in zombie state (refcount already zero, async teardown in flight).
> + *
> + * Wraps execution in dma_fence_begin_signalling() / dma_fence_end_signalling()
> + * because timedout_job() is expected to signal the guilty job's fence as part
> + * of reset.
> + *
> + * Context: Process context (workqueue). DMA fence signaling path.
> + */
> +static void drm_dep_queue_tdr_work(struct work_struct *work)
> +{
> +       struct drm_dep_queue *q =
> +               container_of(work, struct drm_dep_queue, sched.tdr.work);
> +       struct drm_dep_job *job;
> +       bool cookie = dma_fence_begin_signalling();
> +
> +       /* Bail if queue is zombie (refcount already zero, teardown in flight). */
> +       if (!drm_dep_queue_get_unless_zero(q)) {
> +               dma_fence_end_signalling(cookie);
> +               return;
> +       }
> +
> +       scoped_guard(spinlock_irq, &q->job.lock) {
> +               job = list_first_entry_or_null(&q->job.pending,
> +                                              struct drm_dep_job,
> +                                              pending_link);
> +               if (job)
> +                       /*
> +                        * Remove from pending so it cannot be freed
> +                        * concurrently by drm_dep_queue_get_finished_job() or
> +                        * .drm_dep_job_done().
> +                        */
> +                       list_del_init(&job->pending_link);
> +       }
> +
> +       if (job) {
> +               enum drm_dep_timedout_stat status;
> +
> +               if (WARN_ON(!q->ops->timedout_job)) {
> +                       drm_dep_job_put(job);
> +                       goto out;
> +               }
> +
> +               status = q->ops->timedout_job(job);
> +
> +               switch (status) {
> +               case DRM_DEP_TIMEDOUT_STAT_REQUEUE_JOB:
> +                       scoped_guard(spinlock_irq, &q->job.lock)
> +                               list_add(&job->pending_link, &q->job.pending);
> +                       drm_dep_queue_put_job_queue(q);
> +                       break;
> +               case DRM_DEP_TIMEDOUT_STAT_JOB_SIGNALED:
> +                       drm_dep_job_put(job);
> +                       break;
> +               default:
> +                       WARN_ON("invalid drm_dep_timedout_stat");
> +                       break;
> +               }
> +       }
> +
> +out:
> +       drm_queue_start_timeout_unlocked(q);
> +       drm_dep_queue_put(q);
> +       dma_fence_end_signalling(cookie);
> +}
> +
> +/**
> + * drm_dep_alloc_submit_wq() - allocate an ordered submit workqueue
> + * @name: name for the workqueue
> + * @flags: DRM_DEP_QUEUE_FLAGS_* flags
> + *
> + * Allocates an ordered workqueue for job submission with %WQ_MEM_RECLAIM and
> + * %WQ_MEM_WARN_ON_RECLAIM set, ensuring the workqueue is safe to use from
> + * memory reclaim context and properly annotated for lockdep taint tracking.
> + * Adds %WQ_HIGHPRI if %DRM_DEP_QUEUE_FLAGS_HIGHPRI is set. When
> + * CONFIG_LOCKDEP is enabled, uses a dedicated lockdep map for annotation.
> + *
> + * Context: Process context.
> + * Return: the new &workqueue_struct, or NULL on failure.
> + */
> +static struct workqueue_struct *
> +drm_dep_alloc_submit_wq(const char *name, enum drm_dep_queue_flags flags)
> +{
> +       unsigned int wq_flags = WQ_MEM_RECLAIM | WQ_MEM_WARN_ON_RECLAIM;
> +
> +       if (flags & DRM_DEP_QUEUE_FLAGS_HIGHPRI)
> +               wq_flags |= WQ_HIGHPRI;
> +
> +#if IS_ENABLED(CONFIG_LOCKDEP)
> +       static struct lockdep_map map = {
> +               .name = "drm_dep_submit_lockdep_map"
> +       };
> +       return alloc_ordered_workqueue_lockdep_map(name, wq_flags, &map);
> +#else
> +       return alloc_ordered_workqueue(name, wq_flags);
> +#endif
> +}
> +
> +/**
> + * drm_dep_alloc_timeout_wq() - allocate an ordered TDR workqueue
> + * @name: name for the workqueue
> + *
> + * Allocates an ordered workqueue for timeout detection and recovery with
> + * %WQ_MEM_RECLAIM and %WQ_MEM_WARN_ON_RECLAIM set, ensuring consistent taint
> + * annotation with the submit workqueue. When CONFIG_LOCKDEP is enabled, uses
> + * a dedicated lockdep map for annotation.
> + *
> + * Context: Process context.
> + * Return: the new &workqueue_struct, or NULL on failure.
> + */
> +static struct workqueue_struct *drm_dep_alloc_timeout_wq(const char *name)
> +{
> +       unsigned int wq_flags = WQ_MEM_RECLAIM | WQ_MEM_WARN_ON_RECLAIM;
> +
> +#if IS_ENABLED(CONFIG_LOCKDEP)
> +       static struct lockdep_map map = {
> +               .name = "drm_dep_timeout_lockdep_map"
> +       };
> +       return alloc_ordered_workqueue_lockdep_map(name, wq_flags, &map);
> +#else
> +       return alloc_ordered_workqueue(name, wq_flags);
> +#endif
> +}
> +
> +/**
> + * drm_dep_queue_init() - initialize a dep queue
> + * @q: dep queue to initialize
> + * @args: initialization arguments
> + *
> + * Initializes all fields of @q from @args. If @args->submit_wq is NULL an
> + * ordered workqueue is allocated and owned by the queue
> + * (%DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ). If @args->timeout_wq is NULL an
> + * ordered workqueue is allocated and owned by the queue
> + * (%DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ). On success the queue holds one kref
> + * reference and drm_dep_queue_put() must be called to drop this reference
> + * (i.e., drivers cannot directly free the queue).
> + *
> + * When CONFIG_LOCKDEP is enabled, @q->sched.lock is primed against the
> + * fs_reclaim pseudo-lock so that lockdep can detect any lock ordering
> + * inversion between @sched.lock and memory reclaim.
> + *
> + * Return: 0 on success, %-EINVAL when @args->credit_limit is zero, @args->ops
> + * is NULL, @args->drm is NULL, @args->ops->run_job is NULL, or when
> + * @args->submit_wq or @args->timeout_wq is non-NULL but was not allocated with
> + * %WQ_MEM_WARN_ON_RECLAIM; %-ENOMEM when workqueue allocation fails.
> + *
> + * Context: Process context. May allocate memory and create workqueues.
> + */
> +int drm_dep_queue_init(struct drm_dep_queue *q,
> +                      const struct drm_dep_queue_init_args *args)
> +{
> +       if (!args->credit_limit || !args->drm || !args->ops ||
> +           !args->ops->run_job)
> +               return -EINVAL;
> +
> +       if (args->submit_wq && !workqueue_is_reclaim_annotated(args->submit_wq))
> +               return -EINVAL;
> +
> +       if (args->timeout_wq &&
> +           !workqueue_is_reclaim_annotated(args->timeout_wq))
> +               return -EINVAL;
> +
> +       memset(q, 0, sizeof(*q));
> +
> +       q->name = args->name;
> +       q->drm = args->drm;
> +       q->credit.limit = args->credit_limit;
> +       q->job.timeout = args->timeout ? args->timeout : MAX_SCHEDULE_TIMEOUT;
> +
> +       init_rcu_head(&q->rcu);
> +       INIT_LIST_HEAD(&q->job.pending);
> +       spin_lock_init(&q->job.lock);
> +       spsc_queue_init(&q->job.queue);
> +
> +       mutex_init(&q->sched.lock);
> +       if (IS_ENABLED(CONFIG_LOCKDEP)) {
> +               fs_reclaim_acquire(GFP_KERNEL);
> +               might_lock(&q->sched.lock);
> +               fs_reclaim_release(GFP_KERNEL);
> +       }
> +
> +       if (args->submit_wq) {
> +               q->sched.submit_wq = args->submit_wq;
> +       } else {
> +               q->sched.submit_wq = drm_dep_alloc_submit_wq(args->name ?: "drm_dep",
> +                                                            args->flags);
> +               if (!q->sched.submit_wq)
> +                       return -ENOMEM;
> +
> +               q->sched.flags |= DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ;
> +       }
> +
> +       if (args->timeout_wq) {
> +               q->sched.timeout_wq = args->timeout_wq;
> +       } else {
> +               q->sched.timeout_wq = drm_dep_alloc_timeout_wq(args->name ?: "drm_dep");
> +               if (!q->sched.timeout_wq)
> +                       goto err_submit_wq;
> +
> +               q->sched.flags |= DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ;
> +       }
> +
> +       q->sched.flags |= args->flags &
> +               ~(DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ |
> +                 DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ);
> +
> +       INIT_DELAYED_WORK(&q->sched.tdr, drm_dep_queue_tdr_work);
> +       INIT_WORK(&q->sched.run_job, drm_dep_queue_run_job_work);
> +       INIT_WORK(&q->sched.put_job, drm_dep_queue_put_job_work);
> +
> +       q->fence.context = dma_fence_context_alloc(1);
> +
> +       kref_init(&q->refcount);
> +       q->ops = args->ops;
> +       drm_dev_get(q->drm);
> +
> +       return 0;
> +
> +err_submit_wq:
> +       if (q->sched.flags & DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ)
> +               destroy_workqueue(q->sched.submit_wq);
> +       mutex_destroy(&q->sched.lock);
> +
> +       return -ENOMEM;
> +}
> +EXPORT_SYMBOL(drm_dep_queue_init);
> +
> +#if IS_ENABLED(CONFIG_PROVE_LOCKING)
> +/**
> + * drm_dep_queue_push_job_begin() - mark the start of an arm/push critical section
> + * @q: dep queue the job belongs to
> + *
> + * Called at the start of drm_dep_job_arm() and warns if the push context is
> + * already owned by another task, which would indicate concurrent arm/push on
> + * the same queue.
> + *
> + * No-op when CONFIG_PROVE_LOCKING is disabled.
> + *
> + * Context: Process context. DMA fence signaling path.
> + */
> +void drm_dep_queue_push_job_begin(struct drm_dep_queue *q)
> +{
> +       WARN_ON(q->job.push.owner);
> +       q->job.push.owner = current;
> +}
> +
> +/**
> + * drm_dep_queue_push_job_end() - mark the end of an arm/push critical section
> + * @q: dep queue the job belongs to
> + *
> + * Called at the end of drm_dep_job_push() and warns if the push context is not
> + * owned by the current task, which would indicate a mismatched begin/end pair
> + * or a push from the wrong thread.
> + *
> + * No-op when CONFIG_PROVE_LOCKING is disabled.
> + *
> + * Context: Process context. DMA fence signaling path.
> + */
> +void drm_dep_queue_push_job_end(struct drm_dep_queue *q)
> +{
> +       WARN_ON(q->job.push.owner != current);
> +       q->job.push.owner = NULL;
> +}
> +#endif
> +
> +/**
> + * drm_dep_queue_assert_teardown_invariants() - assert teardown invariants
> + * @q: dep queue being torn down
> + *
> + * Warns if the pending-job list, the SPSC submission queue, or the credit
> + * counter is non-zero when called, or if the queue still has a non-zero
> + * reference count.
> + *
> + * Context: Any context.
> + */
> +static void drm_dep_queue_assert_teardown_invariants(struct drm_dep_queue *q)
> +{
> +       WARN_ON(!list_empty(&q->job.pending));
> +       WARN_ON(spsc_queue_count(&q->job.queue));
> +       WARN_ON(atomic_read(&q->credit.count));
> +       WARN_ON(drm_dep_queue_refcount(q));
> +}
> +
> +/**
> + * drm_dep_queue_release() - final internal cleanup of a dep queue
> + * @q: dep queue to clean up
> + *
> + * Asserts teardown invariants and destroys internal resources allocated by
> + * drm_dep_queue_init() that cannot be torn down earlier in the teardown
> + * sequence.  Currently this destroys @q->sched.lock.
> + *
> + * Drivers that implement &drm_dep_queue_ops.release **must** call this
> + * function after removing @q from any internal bookkeeping (e.g. lookup
> + * tables or lists) but before freeing the memory that contains @q.  When
> + * &drm_dep_queue_ops.release is NULL, drm_dep follows the default teardown
> + * path and calls this function automatically.
> + *
> + * Context: Any context.
> + */
> +void drm_dep_queue_release(struct drm_dep_queue *q)
> +{
> +       drm_dep_queue_assert_teardown_invariants(q);
> +       mutex_destroy(&q->sched.lock);
> +}
> +EXPORT_SYMBOL(drm_dep_queue_release);
> +
> +/**
> + * drm_dep_queue_free() - final cleanup of a dep queue
> + * @q: dep queue to free
> + *
> + * Invokes &drm_dep_queue_ops.release if set, in which case the driver is
> + * responsible for calling drm_dep_queue_release() and freeing @q itself.
> + * If &drm_dep_queue_ops.release is NULL, calls drm_dep_queue_release()
> + * and then frees @q with kfree_rcu().
> + *
> + * In either case, releases the drm_dev_get() reference taken at init time
> + * via drm_dev_put(), allowing the owning &drm_device to be unloaded once
> + * all queues have been freed.
> + *
> + * Context: Process context (workqueue), reclaim safe.
> + */
> +static void drm_dep_queue_free(struct drm_dep_queue *q)
> +{
> +       struct drm_device *drm = q->drm;
> +
> +       if (q->ops->release) {
> +               q->ops->release(q);
> +       } else {
> +               drm_dep_queue_release(q);
> +               kfree_rcu(q, rcu);
> +       }
> +       drm_dev_put(drm);
> +}
> +
> +/**
> + * drm_dep_queue_free_work() - deferred queue teardown worker
> + * @work: free_work item embedded in the dep queue
> + *
> + * Runs on dep_free_wq. Disables all work items synchronously
> + * (preventing re-queue and waiting for in-flight instances),
> + * destroys any owned workqueues, then calls drm_dep_queue_free().
> + * Running on dep_free_wq ensures destroy_workqueue() is never
> + * called from within one of the queue's own workers (deadlock)
> + * and disable_*_sync() cannot deadlock either.
> + *
> + * Context: Process context (workqueue), reclaim safe.
> + */
> +static void drm_dep_queue_free_work(struct work_struct *work)
> +{
> +       struct drm_dep_queue *q =
> +               container_of(work, struct drm_dep_queue, free_work);
> +
> +       drm_dep_queue_assert_teardown_invariants(q);
> +
> +       disable_delayed_work_sync(&q->sched.tdr);
> +       disable_work_sync(&q->sched.run_job);
> +       disable_work_sync(&q->sched.put_job);
> +
> +       if (q->sched.flags & DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ)
> +               destroy_workqueue(q->sched.timeout_wq);
> +
> +       if (q->sched.flags & DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ)
> +               destroy_workqueue(q->sched.submit_wq);
> +
> +       drm_dep_queue_free(q);
> +}
> +
> +/**
> + * drm_dep_queue_fini() - tear down a dep queue
> + * @q: dep queue to tear down
> + *
> + * Asserts teardown invariants  and nitiates teardown of @q by queuing the
> + * deferred free work onto tht module-private dep_free_wq workqueue.  The work
> + * item disables any pending TDR and run/put-job work synchronously, destroys
> + * any workqueues that were allocated by drm_dep_queue_init(), and then releases
> + * the queue memory.
> + *
> + * Running teardown from dep_free_wq ensures that destroy_workqueue() is never
> + * called from within one of the queue's own workers (e.g. via
> + * drm_dep_queue_put()), which would deadlock.
> + *
> + * Drivers can wait for all outstanding deferred work to complete by waiting
> + * for the last drm_dev_put() reference on their &drm_device, which is
> + * released as the final step of each queue's teardown.
> + *
> + * Drivers that implement &drm_dep_queue_ops.fini **must** call this
> + * function after removing @q from any device bookkeeping but before freeing the
> + * memory that contains @q.  When &drm_dep_queue_ops.fini is NULL, drm_dep
> + * follows the default teardown path and calls this function automatically.
> + *
> + * Context: Any context.
> + */
> +void drm_dep_queue_fini(struct drm_dep_queue *q)
> +{
> +       drm_dep_queue_assert_teardown_invariants(q);
> +
> +       INIT_WORK(&q->free_work, drm_dep_queue_free_work);
> +       queue_work(dep_free_wq, &q->free_work);
> +}
> +EXPORT_SYMBOL(drm_dep_queue_fini);
> +
> +/**
> + * drm_dep_queue_get() - acquire a reference to a dep queue
> + * @q: dep queue to acquire a reference on, or NULL
> + *
> + * Return: @q with an additional reference held, or NULL if @q is NULL.
> + *
> + * Context: Any context.
> + */
> +struct drm_dep_queue *drm_dep_queue_get(struct drm_dep_queue *q)
> +{
> +       if (q)
> +               kref_get(&q->refcount);
> +       return q;
> +}
> +EXPORT_SYMBOL(drm_dep_queue_get);
> +
> +/**
> + * __drm_dep_queue_release() - kref release callback for a dep queue
> + * @kref: kref embedded in the dep queue
> + *
> + * Calls &drm_dep_queue_ops.fini if set, otherwise calls
> + * drm_dep_queue_fini() to initiate deferred teardown.
> + *
> + * Context: Any context.
> + */
> +static void __drm_dep_queue_release(struct kref *kref)
> +{
> +       struct drm_dep_queue *q =
> +               container_of(kref, struct drm_dep_queue, refcount);
> +
> +       if (q->ops->fini)
> +               q->ops->fini(q);
> +       else
> +               drm_dep_queue_fini(q);
> +}
> +
> +/**
> + * drm_dep_queue_put() - release a reference to a dep queue
> + * @q: dep queue to release a reference on, or NULL
> + *
> + * When the last reference is dropped, calls &drm_dep_queue_ops.fini if set,
> + * otherwise calls drm_dep_queue_fini(). Final memory release is handled by
> + * &drm_dep_queue_ops.release (which must call drm_dep_queue_release()) if set,
> + * or drm_dep_queue_release() followed by kfree_rcu() otherwise.
> + * Does nothing if @q is NULL.
> + *
> + * Context: Any context.
> + */
> +void drm_dep_queue_put(struct drm_dep_queue *q)
> +{
> +       if (q)
> +               kref_put(&q->refcount, __drm_dep_queue_release);
> +}
> +EXPORT_SYMBOL(drm_dep_queue_put);
> +
> +/**
> + * drm_dep_queue_stop() - stop a dep queue from processing new jobs
> + * @q: dep queue to stop
> + *
> + * Sets %DRM_DEP_QUEUE_FLAGS_STOPPED on @q under both @q->sched.lock (mutex)
> + * and @q->job.lock (spinlock_irq), making the flag safe to test from finished
> + * fenced signaling context. Then cancels any in-flight run_job and put_job work
> + * items. Once stopped, the bypass path and the submit workqueue will not
> + * dispatch further jobs nor will any jobs be removed from the pending list.
> + * Call drm_dep_queue_start() to resume processing.
> + *
> + * Context: Process context. Waits for in-flight workers to complete.
> + */
> +void drm_dep_queue_stop(struct drm_dep_queue *q)
> +{
> +       scoped_guard(mutex, &q->sched.lock) {
> +               scoped_guard(spinlock_irq, &q->job.lock)
> +                       drm_dep_queue_flags_set(q, DRM_DEP_QUEUE_FLAGS_STOPPED);
> +       }
> +       cancel_work_sync(&q->sched.run_job);
> +       cancel_work_sync(&q->sched.put_job);
> +}
> +EXPORT_SYMBOL(drm_dep_queue_stop);
> +
> +/**
> + * drm_dep_queue_start() - resume a stopped dep queue
> + * @q: dep queue to start
> + *
> + * Clears %DRM_DEP_QUEUE_FLAGS_STOPPED on @q under both @q->sched.lock (mutex)
> + * and @q->job.lock (spinlock_irq), making the flag safe to test from IRQ
> + * context. Then re-queues the run_job and put_job work items so that any jobs
> + * pending since the queue was stopped are processed. Must only be called after
> + * drm_dep_queue_stop().
> + *
> + * Context: Process context.
> + */
> +void drm_dep_queue_start(struct drm_dep_queue *q)
> +{
> +       scoped_guard(mutex, &q->sched.lock) {
> +               scoped_guard(spinlock_irq, &q->job.lock)
> +                       drm_dep_queue_flags_clear(q, DRM_DEP_QUEUE_FLAGS_STOPPED);
> +       }
> +       drm_dep_queue_run_job_queue(q);
> +       drm_dep_queue_put_job_queue(q);
> +}
> +EXPORT_SYMBOL(drm_dep_queue_start);
> +
> +/**
> + * drm_dep_queue_trigger_timeout() - trigger the TDR immediately for
> + *   all pending jobs
> + * @q: dep queue to trigger timeout on
> + *
> + * Sets @q->job.timeout to 1 and arms the TDR delayed work with a one-jiffy
> + * delay, causing it to fire almost immediately without hot-spinning at zero
> + * delay. This is used to force-expire any pendind jobs on the queue, for
> + * example when the device is being torn down or has encountered an
> + * unrecoverable error.
> + *
> + * It is suggested that when this function is used, the first timedout_job call
> + * causes the driver to kick the queue off the hardware and signal all pending
> + * job fences. Subsequent calls continue to signal all pending job fences.
> + *
> + * Has no effect if the pending list is empty.
> + *
> + * Context: Any context.
> + */
> +void drm_dep_queue_trigger_timeout(struct drm_dep_queue *q)
> +{
> +       guard(spinlock_irqsave)(&q->job.lock);
> +       q->job.timeout = 1;
> +       drm_queue_start_timeout(q);
> +}
> +EXPORT_SYMBOL(drm_dep_queue_trigger_timeout);
> +
> +/**
> + * drm_dep_queue_cancel_tdr_sync() - cancel any pending TDR and wait
> + *   for it to finish
> + * @q: dep queue whose TDR to cancel
> + *
> + * Cancels the TDR delayed work item if it has not yet started, and waits for
> + * it to complete if it is already running.  After this call returns, the TDR
> + * worker is guaranteed not to be executing and will not fire again until
> + * explicitly rearmed (e.g. via drm_dep_queue_resume_timeout() or by a new
> + * job being submitted).
> + *
> + * Useful during error recovery or queue teardown when the caller needs to
> + * know that no timeout handling races with its own reset logic.
> + *
> + * Context: Process context. May sleep waiting for the TDR worker to finish.
> + */
> +void drm_dep_queue_cancel_tdr_sync(struct drm_dep_queue *q)
> +{
> +       cancel_delayed_work_sync(&q->sched.tdr);
> +}
> +EXPORT_SYMBOL(drm_dep_queue_cancel_tdr_sync);
> +
> +/**
> + * drm_dep_queue_resume_timeout() - restart the TDR timer with the
> + *   configured timeout
> + * @q: dep queue to resume the timeout for
> + *
> + * Restarts the TDR delayed work using @q->job.timeout. Called after device
> + * recovery to give pending jobs a fresh full timeout window. Has no effect
> + * if the pending list is empty.
> + *
> + * Context: Any context.
> + */
> +void drm_dep_queue_resume_timeout(struct drm_dep_queue *q)
> +{
> +       drm_queue_start_timeout_unlocked(q);
> +}
> +EXPORT_SYMBOL(drm_dep_queue_resume_timeout);
> +
> +/**
> + * drm_dep_queue_is_stopped() - check whether a dep queue is stopped
> + * @q: dep queue to check
> + *
> + * Return: true if %DRM_DEP_QUEUE_FLAGS_STOPPED is set on @q, false otherwise.
> + *
> + * Context: Any context.
> + */
> +bool drm_dep_queue_is_stopped(struct drm_dep_queue *q)
> +{
> +       return !!(q->sched.flags & DRM_DEP_QUEUE_FLAGS_STOPPED);
> +}
> +EXPORT_SYMBOL(drm_dep_queue_is_stopped);
> +
> +/**
> + * drm_dep_queue_kill() - kill a dep queue and flush all pending jobs
> + * @q: dep queue to kill
> + *
> + * Sets %DRM_DEP_QUEUE_FLAGS_KILLED on @q under @q->sched.lock.  If a
> + * dependency fence is currently being waited on, its callback is removed and
> + * the run-job worker is kicked immediately so that the blocked job drains
> + * without waiting.
> + *
> + * Once killed, drm_dep_queue_job_dependency() returns NULL for all jobs,
> + * bypassing dependency waits so that every queued job drains through
> + * &drm_dep_queue_ops.run_job without blocking.
> + *
> + * The &drm_dep_queue_ops.run_job callback is guaranteed to be called for every
> + * job that was pushed before or after drm_dep_queue_kill(), even during queue
> + * teardown.  Drivers should use this guarantee to perform any necessary
> + * bookkeeping cleanup without executing the actual backend operation when the
> + * queue is killed.
> + *
> + * Unlike drm_dep_queue_stop(), killing is one-way: there is no corresponding
> + * start function.
> + *
> + * **Driver safety requirement**
> + *
> + * drm_dep_queue_kill() must only be called once the driver can guarantee that
> + * no job in the queue will touch memory associated with any of its fences
> + * (i.e., the queue has been removed from the device and will never be put back
> + * on).
> + *
> + * Context: Process context.
> + */
> +void drm_dep_queue_kill(struct drm_dep_queue *q)
> +{
> +       scoped_guard(mutex, &q->sched.lock) {
> +               struct dma_fence *fence;
> +
> +               drm_dep_queue_flags_set(q, DRM_DEP_QUEUE_FLAGS_KILLED);
> +
> +               /*
> +                * Holding &q->sched.lock guarantees that the run-job work item
> +                * cannot drop its reference to q->dep.fence concurrently, so
> +                * reading q->dep.fence here is safe.
> +                */
> +               fence = READ_ONCE(q->dep.fence);
> +               if (fence && dma_fence_remove_callback(fence, &q->dep.cb))
> +                       drm_dep_queue_remove_dependency(q, fence);
> +       }
> +}
> +EXPORT_SYMBOL(drm_dep_queue_kill);
> +
> +/**
> + * drm_dep_queue_submit_wq() - retrieve the submit workqueue of a dep queue
> + * @q: dep queue whose workqueue to retrieve
> + *
> + * Drivers may use this to queue their own work items alongside the queue's
> + * internal run-job and put-job workers — for example to process incoming
> + * messages in the same serialisation domain.
> + *
> + * Prefer drm_dep_queue_work_enqueue() when the only need is to enqueue a
> + * work item, as it additionally checks the stopped state.  Use this accessor
> + * when the workqueue itself is required (e.g. for alloc_ordered_workqueue
> + * replacement or drain_workqueue calls).
> + *
> + * Context: Any context.
> + * Return: the &workqueue_struct used by @q for job submission.
> + */
> +struct workqueue_struct *drm_dep_queue_submit_wq(struct drm_dep_queue *q)
> +{
> +       return q->sched.submit_wq;
> +}
> +EXPORT_SYMBOL(drm_dep_queue_submit_wq);
> +
> +/**
> + * drm_dep_queue_timeout_wq() - retrieve the timeout workqueue of a dep queue
> + * @q: dep queue whose workqueue to retrieve
> + *
> + * Returns the workqueue used by @q to run TDR (timeout detection and recovery)
> + * work.  Drivers may use this to queue their own timeout-domain work items, or
> + * to call drain_workqueue() when tearing down and needing to ensure all pending
> + * timeout callbacks have completed before proceeding.
> + *
> + * Context: Any context.
> + * Return: the &workqueue_struct used by @q for TDR work.
> + */
> +struct workqueue_struct *drm_dep_queue_timeout_wq(struct drm_dep_queue *q)
> +{
> +       return q->sched.timeout_wq;
> +}
> +EXPORT_SYMBOL(drm_dep_queue_timeout_wq);
> +
> +/**
> + * drm_dep_queue_work_enqueue() - queue work on the dep queue's submit workqueue
> + * @q: dep queue to enqueue work on
> + * @work: work item to enqueue
> + *
> + * Queues @work on @q->sched.submit_wq if the queue is not stopped.  This
> + * allows drivers to schedule custom work items that run serialised with the
> + * queue's own run-job and put-job workers.
> + *
> + * Return: true if the work was queued, false if the queue is stopped or the
> + * work item was already pending.
> + *
> + * Context: Any context.
> + */
> +bool drm_dep_queue_work_enqueue(struct drm_dep_queue *q,
> +                               struct work_struct *work)
> +{
> +       if (drm_dep_queue_is_stopped(q))
> +               return false;
> +
> +       return queue_work(q->sched.submit_wq, work);
> +}
> +EXPORT_SYMBOL(drm_dep_queue_work_enqueue);
> +
> +/**
> + * drm_dep_queue_can_job_bypass() - test whether a job can skip the SPSC queue
> + * @q: dep queue
> + * @job: job to test
> + *
> + * A job may bypass the submit workqueue and run inline on the calling thread
> + * if all of the following hold:
> + *
> + *  - %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set on the queue
> + *  - the queue is not stopped
> + *  - the SPSC submission queue is empty (no other jobs waiting)
> + *  - the queue has enough credits for @job
> + *  - @job has no unresolved dependency fences
> + *
> + * Must be called under @q->sched.lock.
> + *
> + * Context: Process context. Must hold @q->sched.lock (a mutex).
> + * Return: true if the job may be run inline, false otherwise.
> + */
> +bool drm_dep_queue_can_job_bypass(struct drm_dep_queue *q,
> +                                 struct drm_dep_job *job)
> +{
> +       lockdep_assert_held(&q->sched.lock);
> +
> +       return q->sched.flags & DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED &&
> +               !drm_dep_queue_is_stopped(q) &&
> +               !spsc_queue_count(&q->job.queue) &&
> +               drm_dep_queue_has_credits(q, job) &&
> +               xa_empty(&job->dependencies);
> +}
> +
> +/**
> + * drm_dep_job_done() - mark a job as complete
> + * @job: the job that finished
> + * @result: error code to propagate, or 0 for success
> + *
> + * Subtracts @job->credits from the queue credit counter, then signals the
> + * job's dep fence with @result.
> + *
> + * When %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE is set (IRQ-safe path), a
> + * temporary extra reference is taken on @job before signalling the fence.
> + * This prevents a concurrent put-job worker — which may be woken by timeouts or
> + * queue starting — from freeing the job while this function still holds a
> + * pointer to it.  The extra reference is released at the end of the function.
> + *
> + * After signalling, the IRQ-safe path removes the job from the pending list
> + * under @q->job.lock, provided the queue is not stopped.  Removal is skipped
> + * when the queue is stopped so that drm_dep_queue_for_each_pending_job() can
> + * iterate the list without racing with the completion path.  On successful
> + * removal, kicks the run-job worker so the next queued job can be dispatched
> + * immediately, then drops the job reference.  If the job was already removed
> + * by TDR, or removal was skipped because the queue is stopped, kicks the
> + * put-job worker instead to allow the deferred put to complete.
> + *
> + * Context: Any context.
> + */
> +static void drm_dep_job_done(struct drm_dep_job *job, int result)
> +{
> +       struct drm_dep_queue *q = job->q;
> +       bool irq_safe = drm_dep_queue_is_job_put_irq_safe(q), removed = false;
> +
> +       /*
> +        * Local ref to ensure the put worker—which may be woken by external
> +        * forces (TDR, driver-side queue starting)—doesn't free the job behind
> +        * this function's back after drm_dep_fence_done() while it is still on
> +        * the pending list.
> +        */
> +       if (irq_safe)
> +               drm_dep_job_get(job);
> +
> +       atomic_sub(job->credits, &q->credit.count);
> +       drm_dep_fence_done(job->dfence, result);
> +
> +       /* Only safe to touch job after fence signal if we have a local ref. */
> +
> +       if (irq_safe) {
> +               scoped_guard(spinlock_irqsave, &q->job.lock) {
> +                       removed = !list_empty(&job->pending_link) &&
> +                               !drm_dep_queue_is_stopped(q);
> +
> +                       /* Guard against TDR operating on job */
> +                       if (removed)
> +                               drm_dep_queue_remove_job(q, job);
> +               }
> +       }
> +
> +       if (removed) {
> +               drm_dep_queue_run_job_queue(q);
> +               drm_dep_job_put(job);
> +       } else {
> +               drm_dep_queue_put_job_queue(q);
> +       }
> +
> +       if (irq_safe)
> +               drm_dep_job_put(job);
> +}
> +
> +/**
> + * drm_dep_job_done_cb() - dma_fence callback to complete a job
> + * @f: the hardware fence that signalled
> + * @cb: fence callback embedded in the dep job
> + *
> + * Extracts the job from @cb and calls drm_dep_job_done() with
> + * @f->error as the result.
> + *
> + * Context: Any context, but with IRQ disabled. May not sleep.
> + */
> +static void drm_dep_job_done_cb(struct dma_fence *f, struct dma_fence_cb *cb)
> +{
> +       struct drm_dep_job *job = container_of(cb, struct drm_dep_job, cb);
> +
> +       drm_dep_job_done(job, f->error);
> +}
> +
> +/**
> + * drm_dep_queue_run_job() - submit a job to hardware and set up
> + *   completion tracking
> + * @q: dep queue
> + * @job: job to run
> + *
> + * Accounts @job->credits against the queue, appends the job to the pending
> + * list, then calls @q->ops->run_job(). The TDR timer is started only when
> + * @job is the first entry on the pending list; subsequent jobs added while
> + * a TDR is already in flight do not reset the timer (which would otherwise
> + * extend the deadline for the already-running head job). Stores the returned
> + * hardware fence as the parent of the job's dep fence, then installs
> + * drm_dep_job_done_cb() on it. If the hardware fence is already signalled
> + * (%-ENOENT from dma_fence_add_callback()) or run_job() returns NULL/error,
> + * the job is completed immediately. Must be called under @q->sched.lock.
> + *
> + * Context: Process context. Must hold @q->sched.lock (a mutex). DMA fence
> + * signaling path.
> + */
> +void drm_dep_queue_run_job(struct drm_dep_queue *q, struct drm_dep_job *job)
> +{
> +       struct dma_fence *fence;
> +       int r;
> +
> +       lockdep_assert_held(&q->sched.lock);
> +
> +       drm_dep_job_get(job);
> +       atomic_add(job->credits, &q->credit.count);
> +
> +       scoped_guard(spinlock_irq, &q->job.lock) {
> +               bool first = list_empty(&q->job.pending);
> +
> +               list_add_tail(&job->pending_link, &q->job.pending);
> +               if (first)
> +                       drm_queue_start_timeout(q);
> +       }
> +
> +       fence = q->ops->run_job(job);
> +       drm_dep_fence_set_parent(job->dfence, fence);
> +
> +       if (!IS_ERR_OR_NULL(fence)) {
> +               r = dma_fence_add_callback(fence, &job->cb,
> +                                          drm_dep_job_done_cb);
> +               if (r == -ENOENT)
> +                       drm_dep_job_done(job, fence->error);
> +               else if (r)
> +                       drm_err(q->drm, "fence add callback failed (%d)\n", r);
> +               dma_fence_put(fence);
> +       } else {
> +               drm_dep_job_done(job, IS_ERR(fence) ? PTR_ERR(fence) : 0);
> +       }
> +
> +       /*
> +        * Drop all input dependency fences now, in process context, before the
> +        * final job put. Once the job is on the pending list its last reference
> +        * may be dropped from a dma_fence callback (IRQ context), where calling
> +        * xa_destroy() would be unsafe.
> +        */
> +       drm_dep_job_drop_dependencies(job);
> +       drm_dep_job_put(job);
> +}
> +
> +/**
> + * drm_dep_queue_push_job() - enqueue a job on the SPSC submission queue
> + * @q: dep queue
> + * @job: job to push
> + *
> + * Pushes @job onto the SPSC queue. If the queue was previously empty
> + * (i.e. this is the first pending job), kicks the run_job worker so it
> + * processes the job promptly without waiting for the next wakeup.
> + * May be called with or without @q->sched.lock held.
> + *
> + * Context: Any context. DMA fence signaling path.
> + */
> +void drm_dep_queue_push_job(struct drm_dep_queue *q, struct drm_dep_job *job)
> +{
> +       /*
> +        * spsc_queue_push() returns true if the queue was previously empty,
> +        * i.e. this is the first pending job. Kick the run_job worker so it
> +        * picks it up without waiting for the next wakeup.
> +        */
> +       if (spsc_queue_push(&q->job.queue, &job->queue_node))
> +               drm_dep_queue_run_job_queue(q);
> +}
> +
> +/**
> + * drm_dep_init() - module initialiser
> + *
> + * Allocates the module-private dep_free_wq unbound workqueue used for
> + * deferred queue teardown.
> + *
> + * Return: 0 on success, %-ENOMEM if workqueue allocation fails.
> + */
> +static int __init drm_dep_init(void)
> +{
> +       dep_free_wq = alloc_workqueue("drm_dep_free", WQ_UNBOUND, 0);
> +       if (!dep_free_wq)
> +               return -ENOMEM;
> +
> +       return 0;
> +}
> +
> +/**
> + * drm_dep_exit() - module exit
> + *
> + * Destroys the module-private dep_free_wq workqueue.
> + */
> +static void __exit drm_dep_exit(void)
> +{
> +       destroy_workqueue(dep_free_wq);
> +       dep_free_wq = NULL;
> +}
> +
> +module_init(drm_dep_init);
> +module_exit(drm_dep_exit);
> +
> +MODULE_DESCRIPTION("DRM dependency queue");
> +MODULE_LICENSE("Dual MIT/GPL");
> diff --git a/drivers/gpu/drm/dep/drm_dep_queue.h b/drivers/gpu/drm/dep/drm_dep_queue.h
> new file mode 100644
> index 000000000000..e5c217a3fab5
> --- /dev/null
> +++ b/drivers/gpu/drm/dep/drm_dep_queue.h
> @@ -0,0 +1,31 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2026 Intel Corporation
> + */
> +
> +#ifndef _DRM_DEP_QUEUE_H_
> +#define _DRM_DEP_QUEUE_H_
> +
> +#include <linux/types.h>
> +
> +struct drm_dep_job;
> +struct drm_dep_queue;
> +
> +bool drm_dep_queue_can_job_bypass(struct drm_dep_queue *q,
> +                                 struct drm_dep_job *job);
> +void drm_dep_queue_run_job(struct drm_dep_queue *q, struct drm_dep_job *job);
> +void drm_dep_queue_push_job(struct drm_dep_queue *q, struct drm_dep_job *job);
> +
> +#if IS_ENABLED(CONFIG_PROVE_LOCKING)
> +void drm_dep_queue_push_job_begin(struct drm_dep_queue *q);
> +void drm_dep_queue_push_job_end(struct drm_dep_queue *q);
> +#else
> +static inline void drm_dep_queue_push_job_begin(struct drm_dep_queue *q)
> +{
> +}
> +static inline void drm_dep_queue_push_job_end(struct drm_dep_queue *q)
> +{
> +}
> +#endif
> +
> +#endif /* _DRM_DEP_QUEUE_H_ */
> diff --git a/include/drm/drm_dep.h b/include/drm/drm_dep.h
> new file mode 100644
> index 000000000000..615926584506
> --- /dev/null
> +++ b/include/drm/drm_dep.h
> @@ -0,0 +1,597 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright 2015 Advanced Micro Devices, Inc.
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
> + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
> + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
> + * OTHER DEALINGS IN THE SOFTWARE.
> + *
> + * Copyright © 2026 Intel Corporation
> + */
> +
> +#ifndef _DRM_DEP_H_
> +#define _DRM_DEP_H_
> +
> +#include <drm/spsc_queue.h>
> +#include <linux/dma-fence.h>
> +#include <linux/xarray.h>
> +#include <linux/workqueue.h>
> +
> +enum dma_resv_usage;
> +struct dma_resv;
> +struct drm_dep_fence;
> +struct drm_dep_job;
> +struct drm_dep_queue;
> +struct drm_file;
> +struct drm_gem_object;
> +
> +/**
> + * enum drm_dep_timedout_stat - return value of &drm_dep_queue_ops.timedout_job
> + * @DRM_DEP_TIMEDOUT_STAT_JOB_SIGNALED: driver signaled the job's finished
> + *   fence during reset; drm_dep may safely drop its reference to the job.
> + * @DRM_DEP_TIMEDOUT_STAT_REQUEUE_JOB: timeout was a false alarm; reinsert the
> + *   job at the head of the pending list so it can complete normally.
> + */
> +enum drm_dep_timedout_stat {
> +       DRM_DEP_TIMEDOUT_STAT_JOB_SIGNALED,
> +       DRM_DEP_TIMEDOUT_STAT_REQUEUE_JOB,
> +};
> +
> +/**
> + * struct drm_dep_queue_ops - driver callbacks for a dep queue
> + */
> +struct drm_dep_queue_ops {
> +       /**
> +        * @run_job: submit the job to hardware. Returns the hardware completion
> +        * fence (with a reference held for the scheduler), or NULL/ERR_PTR on
> +        * synchronous completion or error.
> +        */
> +       struct dma_fence *(*run_job)(struct drm_dep_job *job);
> +
> +       /**
> +        * @timedout_job: called when the TDR fires for the head job. Must stop
> +        * the hardware, then return %DRM_DEP_TIMEDOUT_STAT_JOB_SIGNALED if the
> +        * job's fence was signalled during reset, or
> +        * %DRM_DEP_TIMEDOUT_STAT_REQUEUE_JOB if the timeout was spurious or
> +        * signalling was otherwise delayed, and the job should be re-inserted
> +        * at the head of the pending list. Any other value triggers a WARN.
> +        */
> +       enum drm_dep_timedout_stat (*timedout_job)(struct drm_dep_job *job);
> +
> +       /**
> +        * @release: called when the last kref on the queue is dropped and
> +        * drm_dep_queue_fini() has completed.  The driver is responsible for
> +        * removing @q from any internal bookkeeping, calling
> +        * drm_dep_queue_release(), and then freeing the memory containing @q
> +        * (e.g. via kfree_rcu() using @q->rcu).  If NULL, drm_dep calls
> +        * drm_dep_queue_release() and frees @q automatically via kfree_rcu().
> +        * Use this when the queue is embedded in a larger structure.
> +        */
> +       void (*release)(struct drm_dep_queue *q);
> +
> +       /**
> +        * @fini: if set, called instead of drm_dep_queue_fini() when the last
> +        * kref is dropped. The driver is responsible for calling
> +        * drm_dep_queue_fini() itself after it is done with the queue. Use this
> +        * when additional teardown logic must run before fini (e.g., cleanup
> +        * firmware resources associated with the queue).
> +        */
> +       void (*fini)(struct drm_dep_queue *q);
> +};
> +
> +/**
> + * enum drm_dep_queue_flags - flags for &drm_dep_queue and
> + *   &drm_dep_queue_init_args
> + *
> + * Flags are divided into three categories:
> + *
> + * - **Private static**: set internally at init time and never changed.
> + *   Drivers must not read or write these.
> + *   %DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ,
> + *   %DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ.
> + *
> + * - **Public dynamic**: toggled at runtime by drivers via accessors.
> + *   Any modification must be performed under &drm_dep_queue.sched.lock.
> + *   Accessor functions provide unstable reads.
> + *   %DRM_DEP_QUEUE_FLAGS_STOPPED,
> + *   %DRM_DEP_QUEUE_FLAGS_KILLED.
> + *
> + * - **Public static**: supplied by the driver in
> + *   &drm_dep_queue_init_args.flags at queue creation time and not modified
> + *   thereafter.
> + *   %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED,
> + *   %DRM_DEP_QUEUE_FLAGS_HIGHPRI,
> + *   %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE.
> + *
> + * @DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ: (private, static) submit workqueue was
> + *   allocated by drm_dep_queue_init() and will be destroyed by
> + *   drm_dep_queue_fini().
> + * @DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ: (private, static) timeout workqueue
> + *   was allocated by drm_dep_queue_init() and will be destroyed by
> + *   drm_dep_queue_fini().
> + * @DRM_DEP_QUEUE_FLAGS_STOPPED: (public, dynamic) the queue is stopped and
> + *   will not dispatch new jobs or remove jobs from the pending list, dropping
> + *   the drm_dep-owned reference. Set by drm_dep_queue_stop(), cleared by
> + *   drm_dep_queue_start().
> + * @DRM_DEP_QUEUE_FLAGS_KILLED: (public, dynamic) the queue has been killed
> + *   via drm_dep_queue_kill(). Any active dependency wait is cancelled
> + *   immediately.  Jobs continue to flow through run_job for bookkeeping
> + *   cleanup, but dependency waiting is skipped so that queued work drains
> + *   as quickly as possible.
> + * @DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED: (public, static) the queue supports
> + *   the bypass path where eligible jobs skip the SPSC queue and run inline.
> + * @DRM_DEP_QUEUE_FLAGS_HIGHPRI: (public, static) the submit workqueue owned
> + *   by the queue is created with %WQ_HIGHPRI, causing run-job and put-job
> + *   workers to execute at elevated priority. Only privileged clients (e.g.
> + *   drivers managing time-critical or real-time GPU contexts) should request
> + *   this flag; granting it to unprivileged userspace would allow priority
> + *   inversion attacks.
> + *   @drm_dep_queue_init_args.submit_wq is provided.
> + * @DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE: (public, static) when set,
> + *   drm_dep_job_done() may be called from hardirq context (e.g. from a
> + *   hardware-signalled dma_fence callback). drm_dep_job_done() will directly
> + *   dequeue the job and call drm_dep_job_put() without deferring to a
> + *   workqueue. The driver's &drm_dep_job_ops.release callback must therefore
> + *   be safe to invoke from IRQ context.
> + */
> +enum drm_dep_queue_flags {
> +       DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ       = BIT(0),
> +       DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ     = BIT(1),
> +       DRM_DEP_QUEUE_FLAGS_STOPPED             = BIT(2),
> +       DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED    = BIT(3),
> +       DRM_DEP_QUEUE_FLAGS_HIGHPRI             = BIT(4),
> +       DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE    = BIT(5),
> +       DRM_DEP_QUEUE_FLAGS_KILLED              = BIT(6),
> +};
> +
> +/**
> + * struct drm_dep_queue - a dependency-tracked GPU submission queue
> + *
> + * Combines the role of &drm_gpu_scheduler and &drm_sched_entity into a single
> + * object.  Each queue owns a submit workqueue (or borrows one), a timeout
> + * workqueue, an SPSC submission queue, and a pending-job list used for TDR.
> + *
> + * Initialise with drm_dep_queue_init(), tear down with drm_dep_queue_fini().
> + * Reference counted via drm_dep_queue_get() / drm_dep_queue_put().
> + *
> + * All fields are **opaque to drivers**.  Do not read or write any field
> + * directly; use the provided helper functions instead.  The sole exception
> + * is @rcu, which drivers may pass to kfree_rcu() when the queue is embedded
> + * inside a larger driver-managed structure and the &drm_dep_queue_ops.release
> + * vfunc performs an RCU-deferred free.
> + */
> +struct drm_dep_queue {
> +       /** @ops: driver callbacks, set at init time. */
> +       const struct drm_dep_queue_ops *ops;
> +       /** @name: human-readable name used for workqueue and fence naming. */
> +       const char *name;
> +       /** @drm: owning DRM device; a drm_dev_get() reference is held for the
> +        *  lifetime of the queue to prevent module unload while queues are live.
> +        */
> +       struct drm_device *drm;
> +       /** @refcount: reference count; use drm_dep_queue_get/put(). */
> +       struct kref refcount;
> +       /**
> +        * @free_work: deferred teardown work queued unconditionally by
> +        * drm_dep_queue_fini() onto the module-private dep_free_wq.  The work
> +        * item disables pending workers synchronously and destroys any owned
> +        * workqueues before releasing the queue memory and dropping the
> +        * drm_dev_get() reference.  Running on dep_free_wq ensures
> +        * destroy_workqueue() is never called from within one of the queue's
> +        * own workers.
> +        */
> +       struct work_struct free_work;
> +       /**
> +        * @rcu: RCU head for deferred freeing.
> +        *
> +        * This is the **only** field drivers may access directly.  When the
> +        * queue is embedded in a larger structure, implement
> +        * &drm_dep_queue_ops.release, call drm_dep_queue_release() to destroy
> +        * internal resources, then pass this field to kfree_rcu() so that any
> +        * in-flight RCU readers referencing the queue's dma_fence timeline name
> +        * complete before the memory is returned.  All other fields must be
> +        * accessed through the provided helpers.
> +        */
> +       struct rcu_head rcu;
> +
> +       /** @sched: scheduling and workqueue state. */
> +       struct {
> +               /** @sched.submit_wq: ordered workqueue for run/put-job work. */
> +               struct workqueue_struct *submit_wq;
> +               /** @sched.timeout_wq: workqueue for the TDR delayed work. */
> +               struct workqueue_struct *timeout_wq;
> +               /**
> +                * @sched.run_job: work item that dispatches the next queued
> +                * job.
> +                */
> +               struct work_struct run_job;
> +               /** @sched.put_job: work item that frees finished jobs. */
> +               struct work_struct put_job;
> +               /** @sched.tdr: delayed work item for timeout/reset (TDR). */
> +               struct delayed_work tdr;
> +               /**
> +                * @sched.lock: mutex serialising job dispatch, bypass
> +                * decisions, stop/start, and flag updates.
> +                */
> +               struct mutex lock;
> +               /**
> +                * @sched.flags: bitmask of &enum drm_dep_queue_flags.
> +                * Any modification after drm_dep_queue_init() must be
> +                * performed under @sched.lock.
> +                */
> +               enum drm_dep_queue_flags flags;
> +       } sched;
> +
> +       /** @job: pending-job tracking state. */
> +       struct {
> +               /**
> +                * @job.pending: list of jobs that have been dispatched to
> +                * hardware and not yet freed. Protected by @job.lock.
> +                */
> +               struct list_head pending;
> +               /**
> +                * @job.queue: SPSC queue of jobs waiting to be dispatched.
> +                * Producers push via drm_dep_queue_push_job(); the run_job
> +                * work item pops from the consumer side.
> +                */
> +               struct spsc_queue queue;
> +               /**
> +                * @job.lock: spinlock protecting @job.pending, TDR start, and
> +                * the %DRM_DEP_QUEUE_FLAGS_STOPPED flag. Always acquired with
> +                * irqsave (spin_lock_irqsave / spin_unlock_irqrestore) to
> +                * support %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE queues where
> +                * drm_dep_job_done() may run from hardirq context.
> +                */
> +               spinlock_t lock;
> +               /**
> +                * @job.timeout: per-job TDR timeout in jiffies.
> +                * %MAX_SCHEDULE_TIMEOUT means no timeout.
> +                */
> +               long timeout;
> +#if IS_ENABLED(CONFIG_PROVE_LOCKING)
> +               /**
> +                * @job.push: lockdep annotation tracking the arm-to-push
> +                * critical section.
> +                */
> +               struct {
> +                       /*
> +                        * @job.push.owner: task that currently holds the push
> +                        * context, used to assert single-owner invariants.
> +                        * NULL when idle.
> +                        */
> +                       struct task_struct *owner;
> +               } push;
> +#endif
> +       } job;
> +
> +       /** @credit: hardware credit accounting. */
> +       struct {
> +               /** @credit.limit: maximum credits the queue can hold. */
> +               u32 limit;
> +               /** @credit.count: credits currently in flight (atomic). */
> +               atomic_t count;
> +       } credit;
> +
> +       /** @dep: current blocking dependency for the head SPSC job. */
> +       struct {
> +               /**
> +                * @dep.fence: fence being waited on before the head job can
> +                * run. NULL when no dependency is pending.
> +                */
> +               struct dma_fence *fence;
> +               /**
> +                * @dep.removed_fence: dependency fence whose callback has been
> +                * removed.  The run-job worker must drop its reference to this
> +                * fence before proceeding to call run_job.
> +                */
> +               struct dma_fence *removed_fence;
> +               /** @dep.cb: callback installed on @dep.fence. */
> +               struct dma_fence_cb cb;
> +       } dep;
> +
> +       /** @fence: fence context and sequence number state. */
> +       struct {
> +               /**
> +                * @fence.seqno: next sequence number to assign, incremented
> +                * each time a job is armed.
> +                */
> +               u32 seqno;
> +               /**
> +                * @fence.context: base DMA fence context allocated at init
> +                * time. Finished fences use this context.
> +                */
> +               u64 context;
> +       } fence;
> +};
> +
> +/**
> + * struct drm_dep_queue_init_args - arguments for drm_dep_queue_init()
> + */
> +struct drm_dep_queue_init_args {
> +       /** @ops: driver callbacks; must not be NULL. */
> +       const struct drm_dep_queue_ops *ops;
> +       /** @name: human-readable name for workqueues and fence timelines. */
> +       const char *name;
> +       /** @drm: owning DRM device. A drm_dev_get() reference is taken at
> +        *  queue init and released when the queue is freed, preventing module
> +        *  unload while any queue is still alive.
> +        */
> +       struct drm_device *drm;
> +       /**
> +        * @submit_wq: workqueue for job dispatch. If NULL, an ordered
> +        * workqueue is allocated and owned by the queue.  If non-NULL, the
> +        * workqueue must have been allocated with %WQ_MEM_RECLAIM_TAINT;
> +        * drm_dep_queue_init() returns %-EINVAL otherwise.
> +        */
> +       struct workqueue_struct *submit_wq;
> +       /**
> +        * @timeout_wq: workqueue for TDR. If NULL, an ordered workqueue
> +        * is allocated and owned by the queue.  If non-NULL, the workqueue
> +        * must have been allocated with %WQ_MEM_RECLAIM_TAINT;
> +        * drm_dep_queue_init() returns %-EINVAL otherwise.
> +        */
> +       struct workqueue_struct *timeout_wq;
> +       /** @credit_limit: maximum hardware credits; must be non-zero. */
> +       u32 credit_limit;
> +       /**
> +        * @timeout: per-job TDR timeout in jiffies. Zero means no timeout
> +        * (%MAX_SCHEDULE_TIMEOUT is used internally).
> +        */
> +       long timeout;
> +       /**
> +        * @flags: initial queue flags. %DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ
> +        * and %DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ are managed internally
> +        * and will be ignored if set here. Setting
> +        * %DRM_DEP_QUEUE_FLAGS_HIGHPRI requests a high-priority submit
> +        * workqueue; drivers must only set this for privileged clients.
> +        */
> +       enum drm_dep_queue_flags flags;
> +};
> +
> +/**
> + * struct drm_dep_job_ops - driver callbacks for a dep job
> + */
> +struct drm_dep_job_ops {
> +       /**
> +        * @release: called when the last reference to the job is dropped.
> +        *
> +        * If set, the driver is responsible for freeing the job. If NULL,
> +        * drm_dep_job_put() will call kfree() on the job directly.
> +        */
> +       void (*release)(struct drm_dep_job *job);
> +};
> +
> +/**
> + * struct drm_dep_job - a unit of work submitted to a dep queue
> + *
> + * All fields are **opaque to drivers**.  Do not read or write any field
> + * directly; use the provided helper functions instead.
> + */
> +struct drm_dep_job {
> +       /** @ops: driver callbacks for this job. */
> +       const struct drm_dep_job_ops *ops;
> +       /** @refcount: reference count, managed by drm_dep_job_get/put(). */
> +       struct kref refcount;
> +       /**
> +        * @dependencies: xarray of &dma_fence dependencies before the job can
> +        * run.
> +        */
> +       struct xarray dependencies;
> +       /** @q: the queue this job is submitted to. */
> +       struct drm_dep_queue *q;
> +       /** @queue_node: SPSC queue linkage for pending submission. */
> +       struct spsc_node queue_node;
> +       /**
> +        * @pending_link: list entry in the queue's pending job list. Protected
> +        * by @job.q->job.lock.
> +        */
> +       struct list_head pending_link;
> +       /** @dfence: finished fence for this job. */
> +       struct drm_dep_fence *dfence;
> +       /** @cb: fence callback used to watch for dependency completion. */
> +       struct dma_fence_cb cb;
> +       /** @credits: number of credits this job consumes from the queue. */
> +       u32 credits;
> +       /**
> +        * @last_dependency: index into @dependencies of the next fence to
> +        * check. Advanced by drm_dep_queue_job_dependency() as each
> +        * dependency is consumed.
> +        */
> +       u32 last_dependency;
> +       /**
> +        * @invalidate_count: number of times this job has been invalidated.
> +        * Incremented by drm_dep_job_invalidate_job().
> +        */
> +       u32 invalidate_count;
> +       /**
> +        * @signalling_cookie: return value of dma_fence_begin_signalling()
> +        * captured in drm_dep_job_arm() and consumed by drm_dep_job_push().
> +        * Not valid outside the arm→push window.
> +        */
> +       bool signalling_cookie;
> +};
> +
> +/**
> + * struct drm_dep_job_init_args - arguments for drm_dep_job_init()
> + */
> +struct drm_dep_job_init_args {
> +       /**
> +        * @ops: driver callbacks for the job, or NULL for default behaviour.
> +        */
> +       const struct drm_dep_job_ops *ops;
> +       /** @q: the queue to associate the job with. A reference is taken. */
> +       struct drm_dep_queue *q;
> +       /** @credits: number of credits this job consumes; must be non-zero. */
> +       u32 credits;
> +};
> +
> +/* Queue API */
> +
> +/**
> + * drm_dep_queue_sched_guard() - acquire the queue scheduler lock as a guard
> + * @__q: dep queue whose scheduler lock to acquire
> + *
> + * Acquires @__q->sched.lock as a scoped mutex guard (released automatically
> + * when the enclosing scope exits).  This lock serialises all scheduler state
> + * transitions — stop/start/kill flag changes, bypass-path decisions, and the
> + * run-job worker — so it must be held when the driver needs to atomically
> + * inspect or modify queue state in relation to job submission.
> + *
> + * **When to use**
> + *
> + * Drivers that set %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED and wish to
> + * serialise their own submit work against the bypass path must acquire this
> + * guard.  Without it, a concurrent caller of drm_dep_job_push() could take
> + * the bypass path and call ops->run_job() inline between the driver's
> + * eligibility check and its corresponding action, producing a race.
> + *
> + * **Constraint: only from submit_wq worker context**
> + *
> + * This guard must only be acquired from a work item running on the queue's
> + * submit workqueue (@q->sched.submit_wq) by drivers.
> + *
> + * Context: Process context only; must be called from submit_wq work by
> + * drivers.
> + */
> +#define drm_dep_queue_sched_guard(__q) \
> +       guard(mutex)(&(__q)->sched.lock)
> +
> +int drm_dep_queue_init(struct drm_dep_queue *q,
> +                      const struct drm_dep_queue_init_args *args);
> +void drm_dep_queue_fini(struct drm_dep_queue *q);
> +void drm_dep_queue_release(struct drm_dep_queue *q);
> +struct drm_dep_queue *drm_dep_queue_get(struct drm_dep_queue *q);
> +bool drm_dep_queue_get_unless_zero(struct drm_dep_queue *q);
> +void drm_dep_queue_put(struct drm_dep_queue *q);
> +void drm_dep_queue_stop(struct drm_dep_queue *q);
> +void drm_dep_queue_start(struct drm_dep_queue *q);
> +void drm_dep_queue_kill(struct drm_dep_queue *q);
> +void drm_dep_queue_trigger_timeout(struct drm_dep_queue *q);
> +void drm_dep_queue_cancel_tdr_sync(struct drm_dep_queue *q);
> +void drm_dep_queue_resume_timeout(struct drm_dep_queue *q);
> +bool drm_dep_queue_work_enqueue(struct drm_dep_queue *q,
> +                               struct work_struct *work);
> +bool drm_dep_queue_is_stopped(struct drm_dep_queue *q);
> +bool drm_dep_queue_is_killed(struct drm_dep_queue *q);
> +bool drm_dep_queue_is_initialized(struct drm_dep_queue *q);
> +void drm_dep_queue_set_stopped(struct drm_dep_queue *q);
> +unsigned int drm_dep_queue_refcount(const struct drm_dep_queue *q);
> +long drm_dep_queue_timeout(const struct drm_dep_queue *q);
> +struct workqueue_struct *drm_dep_queue_submit_wq(struct drm_dep_queue *q);
> +struct workqueue_struct *drm_dep_queue_timeout_wq(struct drm_dep_queue *q);
> +
> +/* Job API */
> +
> +/**
> + * DRM_DEP_JOB_FENCE_PREALLOC - sentinel value for pre-allocating a dependency slot
> + *
> + * Pass this to drm_dep_job_add_dependency() instead of a real fence to
> + * pre-allocate a slot in the job's dependency xarray during the preparation
> + * phase (where GFP_KERNEL is available).  The returned xarray index identifies
> + * the slot.  Call drm_dep_job_replace_dependency() later — inside a
> + * dma_fence_begin_signalling() region if needed — to swap in the real fence
> + * without further allocation.
> + *
> + * This sentinel is never treated as a dma_fence; it carries no reference count
> + * and must not be passed to dma_fence_put().  It is only valid as an argument
> + * to drm_dep_job_add_dependency() and as the expected stored value checked by
> + * drm_dep_job_replace_dependency().
> + */
> +#define DRM_DEP_JOB_FENCE_PREALLOC     ((struct dma_fence *)-1)
> +
> +int drm_dep_job_init(struct drm_dep_job *job,
> +                    const struct drm_dep_job_init_args *args);
> +struct drm_dep_job *drm_dep_job_get(struct drm_dep_job *job);
> +void drm_dep_job_put(struct drm_dep_job *job);
> +void drm_dep_job_arm(struct drm_dep_job *job);
> +void drm_dep_job_push(struct drm_dep_job *job);
> +int drm_dep_job_add_dependency(struct drm_dep_job *job,
> +                              struct dma_fence *fence);
> +void drm_dep_job_replace_dependency(struct drm_dep_job *job, u32 index,
> +                                   struct dma_fence *fence);
> +int drm_dep_job_add_syncobj_dependency(struct drm_dep_job *job,
> +                                      struct drm_file *file, u32 handle,
> +                                      u32 point);
> +int drm_dep_job_add_resv_dependencies(struct drm_dep_job *job,
> +                                     struct dma_resv *resv,
> +                                     enum dma_resv_usage usage);
> +int drm_dep_job_add_implicit_dependencies(struct drm_dep_job *job,
> +                                         struct drm_gem_object *obj,
> +                                         bool write);
> +bool drm_dep_job_is_signaled(struct drm_dep_job *job);
> +bool drm_dep_job_is_finished(struct drm_dep_job *job);
> +bool drm_dep_job_invalidate_job(struct drm_dep_job *job, int threshold);
> +struct dma_fence *drm_dep_job_finished_fence(struct drm_dep_job *job);
> +
> +/**
> + * struct drm_dep_queue_pending_job_iter - iterator state for
> + *   drm_dep_queue_for_each_pending_job()
> + * @q: queue being iterated
> + */
> +struct drm_dep_queue_pending_job_iter {
> +       struct drm_dep_queue *q;
> +};
> +
> +/* Drivers should never call this directly */
> +static inline struct drm_dep_queue_pending_job_iter
> +__drm_dep_queue_pending_job_iter_begin(struct drm_dep_queue *q)
> +{
> +       struct drm_dep_queue_pending_job_iter iter = {
> +               .q = q,
> +       };
> +
> +       WARN_ON(!drm_dep_queue_is_stopped(q));
> +       return iter;
> +}
> +
> +/* Drivers should never call this directly */
> +static inline void
> +__drm_dep_queue_pending_job_iter_end(struct drm_dep_queue_pending_job_iter iter)
> +{
> +       WARN_ON(!drm_dep_queue_is_stopped(iter.q));
> +}
> +
> +/* clang-format off */
> +DEFINE_CLASS(drm_dep_queue_pending_job_iter,
> +            struct drm_dep_queue_pending_job_iter,
> +            __drm_dep_queue_pending_job_iter_end(_T),
> +            __drm_dep_queue_pending_job_iter_begin(__q),
> +            struct drm_dep_queue *__q);
> +/* clang-format on */
> +static inline void *
> +class_drm_dep_queue_pending_job_iter_lock_ptr(
> +       class_drm_dep_queue_pending_job_iter_t *_T)
> +{ return _T; }
> +#define class_drm_dep_queue_pending_job_iter_is_conditional false
> +
> +/**
> + * drm_dep_queue_for_each_pending_job() - iterate over all pending jobs
> + *   in a queue
> + * @__job: loop cursor, a &struct drm_dep_job pointer
> + * @__q: &struct drm_dep_queue to iterate
> + *
> + * Iterates over every job currently on @__q->job.pending. The queue must be
> + * stopped (drm_dep_queue_stop() called) before using this iterator; a WARN_ON
> + * fires at the start and end of the scope if it is not.
> + *
> + * Context: Any context.
> + */
> +#define drm_dep_queue_for_each_pending_job(__job, __q)                 \
> +       scoped_guard(drm_dep_queue_pending_job_iter, (__q))             \
> +               list_for_each_entry((__job), &(__q)->job.pending, pending_link)
> +
> +#endif
> --
> 2.34.1
>


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
  2026-03-17  5:45     ` Matthew Brost
  2026-03-17  7:17       ` Miguel Ojeda
@ 2026-03-17 18:14       ` Matthew Brost
  2026-03-17 19:48         ` Daniel Almeida
  2026-03-17 20:43         ` Boris Brezillon
  1 sibling, 2 replies; 50+ messages in thread
From: Matthew Brost @ 2026-03-17 18:14 UTC (permalink / raw)
  To: Daniel Almeida
  Cc: intel-xe, dri-devel, Boris Brezillon, Tvrtko Ursulin,
	Rodrigo Vivi, Thomas Hellström, Christian König,
	Danilo Krummrich, David Airlie, Maarten Lankhorst, Maxime Ripard,
	Philipp Stanner, Simona Vetter, Sumit Semwal, Thomas Zimmermann,
	linux-kernel, Sami Tolvanen, Jeffrey Vander Stoep, Alice Ryhl,
	Daniel Stone, Alexandre Courbot, John Hubbard, shashanks, jajones,
	Eliot Courtney, Joel Fernandes, rust-for-linux

On Mon, Mar 16, 2026 at 10:45:33PM -0700, Matthew Brost wrote:
> On Mon, Mar 16, 2026 at 11:47:01PM -0300, Daniel Almeida wrote:
> > (+cc a few other people + Rust-for-Linux ML)
> > 
> > Hi Matthew,
> > 
> > I agree with what Danilo said below, i.e.:  IMHO, with the direction that DRM
> > is going, it is much more ergonomic to add a Rust component with a nice C
> > interface than doing it the other way around.
> >
> 
> Holy war? See my reply to Danilo — I’ll write this in Rust if needed,
> but it’s not my first choice since I’m not yet a native speaker.
>  
> > > On 16 Mar 2026, at 01:32, Matthew Brost <matthew.brost@intel.com> wrote:
> > > 
> > > Diverging requirements between GPU drivers using firmware scheduling
> > > and those using hardware scheduling have shown that drm_gpu_scheduler is
> > > no longer sufficient for firmware-scheduled GPU drivers. The technical
> > > debt, lack of memory-safety guarantees, absence of clear object-lifetime
> > > rules, and numerous driver-specific hacks have rendered
> > > drm_gpu_scheduler unmaintainable. It is time for a fresh design for
> > > firmware-scheduled GPU drivers—one that addresses all of the
> > > aforementioned shortcomings.
> > > 
> > > Add drm_dep, a lightweight GPU submission queue intended as a
> > > replacement for drm_gpu_scheduler for firmware-managed GPU schedulers
> > > (e.g. Xe, Panthor, AMDXDNA, PVR, Nouveau, Nova). Unlike
> > > drm_gpu_scheduler, which separates the scheduler (drm_gpu_scheduler)
> > > from the queue (drm_sched_entity) into two objects requiring external
> > > coordination, drm_dep merges both roles into a single struct
> > > drm_dep_queue. This eliminates the N:1 entity-to-scheduler mapping
> > > that is unnecessary for firmware schedulers which manage their own
> > > run-lists internally.
> > > 
> > > Unlike drm_gpu_scheduler, which relies on external locking and lifetime
> > > management by the driver, drm_dep uses reference counting (kref) on both
> > > queues and jobs to guarantee object lifetime safety. A job holds a queue
> > 
> > In a domain that has been plagued by lifetime issues, we really should be
> 
> Yes, drm sched is a mess. I’ve been suggesting we fix it for years and
> have met pushback. This, however (drm dep), isn’t plagued by lifetime
> issues — that’s the primary focus here.
> 
> > enforcing RAII for resource management instead of manual calls.
> > 
> 
> You can do RAII in C - see cleanup.h. Clear object lifetimes and
> ownership are what is important. Disciplined coding is the only to do
> this regardless of language. RAII doesn't help with help with bad object
> models / ownership / lifetime models either.
> 
> I don't buy the Rust solves everything argument but again non-native
> speaker.
> 
> > > reference from init until its last put, and the queue holds a job reference
> > > from dispatch until the put_job worker runs. This makes use-after-free
> > > impossible even when completion arrives from IRQ context or concurrent
> > > teardown is in flight.
> > 
> > It makes use-after-free impossible _if_ you’re careful. It is not a
> > property of the type system, and incorrect code will compile just fine.
> > 
> 
> Sure. If a driver puts a drm_dep object reference on a resource that
> drm_dep owns, it will explode. That’s effectively putting a reference on
> a resource the driver doesn’t own. A driver can write to any physical
> memory and crash the system anyway, so I’m not really sure what we’re
> talking about here. Rust doesn’t solve anything in this scenario — you
> can always use an unsafe block and put a reference on a resource you
> don’t own.
> 
> Object model, ownership, and lifetimes are what is important and that is
> what drm dep is built around.
> 
> > > 
> > > The core objects are:
> > > 
> > >  struct drm_dep_queue - a per-context submission queue owning an
> > >    ordered submit workqueue, a TDR timeout workqueue, an SPSC job
> > >    queue, and a pending-job list. Reference counted; drivers can embed
> > >    it and provide a .release vfunc for RCU-safe teardown.
> > > 
> > >  struct drm_dep_job - a single unit of GPU work. Drivers embed this
> > >    and provide a .release vfunc. Jobs carry an xarray of input
> > >    dma_fence dependencies and produce a drm_dep_fence as their
> > >    finished fence.
> > > 
> > >  struct drm_dep_fence - a dma_fence subclass wrapping an optional
> > >    parent hardware fence. The finished fence is armed (sequence
> > >    number assigned) before submission and signals when the hardware
> > >    fence signals (or immediately on synchronous completion).
> > > 
> > > Job lifecycle:
> > >  1. drm_dep_job_init() - allocate and initialise; job acquires a
> > >     queue reference.
> > >  2. drm_dep_job_add_dependency() and friends - register input fences;
> > >     duplicates from the same context are deduplicated.
> > >  3. drm_dep_job_arm() - assign sequence number, obtain finished fence.
> > >  4. drm_dep_job_push() - submit to queue.
> > 
> > You cannot enforce this sequence easily in C code. Once again, we are trusting
> > drivers that it is followed, but in Rust, you can simply reject code that does
> > not follow this order at compile time.
> > 
> 
> I don’t know Rust, but yes — you can enforce this in C. It’s called
> lockdep and annotations. It’s not compile-time, but all of this is
> strictly enforced. e.g., write some code that doesn't follow this and
> report back if the kernel doesn't explode. It will, if doesn't I'll fix
> it to complain.
> 
> > 
> > > 
> > > Submission paths under queue lock:
> > >  - Bypass path: if DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set, the
> > >    SPSC queue is empty, no dependencies are pending, and credits are
> > >    available, the job is dispatched inline on the calling thread.
> > >  - Queued path: job is pushed onto the SPSC queue and the run_job
> > >    worker is kicked. The worker resolves remaining dependencies
> > >    (installing wakeup callbacks for unresolved fences) before calling
> > >    ops->run_job().
> > > 
> > > Credit-based throttling prevents hardware overflow: each job declares
> > > a credit cost at init time; dispatch is deferred until sufficient
> > > credits are available.
> > 
> > Why can’t we design an API where the driver can refuse jobs in
> > ops->run_job() if there are no resources to run it? This would do away with the
> > credit system that has been in place for quite a while. Has this approach been
> > tried in the past?
> > 
> 
> That seems possible if this is the preferred option. -EAGAIN is the way
> to do this. I’m open to the idea, but we also need to weigh the cost of
> converting drivers against the number of changes required.
> 
> Partial - reply with catch up the rest later.
> 
> Appreciate the feedback.

Picking up replies.

> 
> Matt
> 
> > 
> > > 
> > > Timeout Detection and Recovery (TDR): a per-queue delayed work item
> > > fires when the head pending job exceeds q->job.timeout jiffies, calling
> > > ops->timedout_job(). drm_dep_queue_trigger_timeout() forces immediate
> > > expiry for device teardown.
> > > 
> > > IRQ-safe completion: queues flagged DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE
> > > allow drm_dep_job_done() to be called from hardirq context (e.g. a
> > > dma_fence callback). Dependency cleanup is deferred to process context
> > > after ops->run_job() returns to avoid calling xa_destroy() from IRQ.
> > > 
> > > Zombie-state guard: workers use kref_get_unless_zero() on entry and
> > > bail immediately if the queue refcount has already reached zero and
> > > async teardown is in flight, preventing use-after-free.
> > 
> > In rust, when you queue work, you have to pass a reference-counted pointer
> > (Arc<T>). We simply never have this problem in a Rust design. If there is work
> > queued, the queue is alive.
> > 
> > By the way, why can’t we simply require synchronous teardowns?

Consider the case where the DRM dep queue’s refcount drops to zero, but
the device firmware still holds references to the associated queue.
These are resources that must be torn down asynchronously. In Xe, I need
to send two asynchronous firmware commands before I can safely remove
the memory associated with the queue (faulting on this kind of global
memory will take down the device) and recycle the firmware ID tied to
the queue. These async commands are issued on the driver side, on the
DRM dep queue’s workqueue as well.

Now consider a scenario where something goes wrong and those firmware
commands never complete, and a device reset is required to recover. The
driver’s per-queue tracking logic stops all queues (including zombie
ones), determines which commands were lost, cleans up the side effects
of that lost state, and then restarts all queues. That is how we would
end up in this work item with a zombie queue. The restart logic could
probably be made smart enough to avoid queueing work for zombie queues,
but in my opinion it’s safe enough to use kref_get_unless_zero() in the
work items.

It should also be clear that a DRM dep queue is primarily intended to be
embedded inside the driver’s own queue object, even though it is valid
to use it as a standalone object. The async teardown flows are also
optional features.

Let’s also consider a case where you do not need the async firmware
flows described above, but the DRM dep queue is still embedded in a
driver-side object that owns memory via dma-resv. The final queue put
may occur in IRQ context (DRM dep avoids kicking a worker just to drop a
refi as opt in), or in the reclaim path (any scheduler workqueue is the
reclaim path). In either case, you cannot free memory there taking a
dma-resv lock, which is why all DRM dep queues ultimately free their
resources in a work item outside of reclaim. Many drivers already follow
this pattern, but in DRM dep this behavior is built-in.

So I don’t think Rust natively solves these types of problems, although
I’ll concede that it does make refcounting a bit more sane.

> > 
> > > 
> > > Teardown is always deferred to a module-private workqueue (dep_free_wq)
> > > so that destroy_workqueue() is never called from within one of the
> > > queue's own workers. Each queue holds a drm_dev_get() reference on its
> > > owning struct drm_device, released as the final step of teardown via
> > > drm_dev_put(). This prevents the driver module from being unloaded
> > > while any queue is still alive without requiring a separate drain API.
> > > 
> > > Cc: Boris Brezillon <boris.brezillon@collabora.com>
> > > Cc: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> > > Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> > > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > Cc: Christian König <christian.koenig@amd.com>
> > > Cc: Danilo Krummrich <dakr@kernel.org>
> > > Cc: David Airlie <airlied@gmail.com>
> > > Cc: dri-devel@lists.freedesktop.org
> > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > Cc: Maxime Ripard <mripard@kernel.org>
> > > Cc: Philipp Stanner <phasta@kernel.org>
> > > Cc: Simona Vetter <simona@ffwll.ch>
> > > Cc: Sumit Semwal <sumit.semwal@linaro.org>
> > > Cc: Thomas Zimmermann <tzimmermann@suse.de>
> > > Cc: linux-kernel@vger.kernel.org
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > Assisted-by: GitHub Copilot:claude-sonnet-4.6
> > > ---
> > > drivers/gpu/drm/Kconfig             |    4 +
> > > drivers/gpu/drm/Makefile            |    1 +
> > > drivers/gpu/drm/dep/Makefile        |    5 +
> > > drivers/gpu/drm/dep/drm_dep_fence.c |  406 +++++++
> > > drivers/gpu/drm/dep/drm_dep_fence.h |   25 +
> > > drivers/gpu/drm/dep/drm_dep_job.c   |  675 +++++++++++
> > > drivers/gpu/drm/dep/drm_dep_job.h   |   13 +
> > > drivers/gpu/drm/dep/drm_dep_queue.c | 1647 +++++++++++++++++++++++++++
> > > drivers/gpu/drm/dep/drm_dep_queue.h |   31 +
> > > include/drm/drm_dep.h               |  597 ++++++++++
> > > 10 files changed, 3404 insertions(+)
> > > create mode 100644 drivers/gpu/drm/dep/Makefile
> > > create mode 100644 drivers/gpu/drm/dep/drm_dep_fence.c
> > > create mode 100644 drivers/gpu/drm/dep/drm_dep_fence.h
> > > create mode 100644 drivers/gpu/drm/dep/drm_dep_job.c
> > > create mode 100644 drivers/gpu/drm/dep/drm_dep_job.h
> > > create mode 100644 drivers/gpu/drm/dep/drm_dep_queue.c
> > > create mode 100644 drivers/gpu/drm/dep/drm_dep_queue.h
> > > create mode 100644 include/drm/drm_dep.h
> > > 
> > > diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
> > > index 5386248e75b6..834f6e210551 100644
> > > --- a/drivers/gpu/drm/Kconfig
> > > +++ b/drivers/gpu/drm/Kconfig
> > > @@ -276,6 +276,10 @@ config DRM_SCHED
> > > tristate
> > > depends on DRM
> > > 
> > > +config DRM_DEP
> > > + tristate
> > > + depends on DRM
> > > +
> > > # Separate option as not all DRM drivers use it
> > > config DRM_PANEL_BACKLIGHT_QUIRKS
> > > tristate
> > > diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
> > > index e97faabcd783..1ad87cc0e545 100644
> > > --- a/drivers/gpu/drm/Makefile
> > > +++ b/drivers/gpu/drm/Makefile
> > > @@ -173,6 +173,7 @@ obj-y += clients/
> > > obj-y += display/
> > > obj-$(CONFIG_DRM_TTM) += ttm/
> > > obj-$(CONFIG_DRM_SCHED) += scheduler/
> > > +obj-$(CONFIG_DRM_DEP) += dep/
> > > obj-$(CONFIG_DRM_RADEON)+= radeon/
> > > obj-$(CONFIG_DRM_AMDGPU)+= amd/amdgpu/
> > > obj-$(CONFIG_DRM_AMDGPU)+= amd/amdxcp/
> > > diff --git a/drivers/gpu/drm/dep/Makefile b/drivers/gpu/drm/dep/Makefile
> > > new file mode 100644
> > > index 000000000000..335f1af46a7b
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/dep/Makefile
> > > @@ -0,0 +1,5 @@
> > > +# SPDX-License-Identifier: GPL-2.0
> > > +
> > > +drm_dep-y := drm_dep_queue.o drm_dep_job.o drm_dep_fence.o
> > > +
> > > +obj-$(CONFIG_DRM_DEP) += drm_dep.o
> > > diff --git a/drivers/gpu/drm/dep/drm_dep_fence.c b/drivers/gpu/drm/dep/drm_dep_fence.c
> > > new file mode 100644
> > > index 000000000000..ae05b9077772
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/dep/drm_dep_fence.c
> > > @@ -0,0 +1,406 @@
> > > +// SPDX-License-Identifier: MIT
> > > +/*
> > > + * Copyright © 2026 Intel Corporation
> > > + */
> > > +
> > > +/**
> > > + * DOC: DRM dependency fence
> > > + *
> > > + * Each struct drm_dep_job has an associated struct drm_dep_fence that
> > > + * provides a single dma_fence (@finished) signalled when the hardware
> > > + * completes the job.
> > > + *
> > > + * The hardware fence returned by &drm_dep_queue_ops.run_job is stored as
> > > + * @parent. @finished is chained to @parent via drm_dep_job_done_cb() and
> > > + * is signalled once @parent signals (or immediately if run_job() returns
> > > + * NULL or an error).
> > 
> > I thought this fence proxy mechanism was going away due to recent work being
> > carried out by Christian?
> > 

Consider the case where a driver’s hardware fence is implemented as a
dma-fence-array or dma-fence-chain. You cannot install these types of
fences into a dma-resv or into syncobjs, so a proxy fence is useful
here. One example is when a single job submits work to multiple rings
that are flipped in hardware at the same time.

Another case is late arming of hardware fences in run_job (which many
drivers do). The proxy fence is immediately available at arm time and
can be installed into dma-resv or syncobjs even though the actual
hardware fence is not yet available. I think most drivers could be
refactored to make the hardware fence immediately available at run_job,
though.

> > > + *
> > > + * Drivers should expose @finished as the out-fence for GPU work since it is
> > > + * valid from the moment drm_dep_job_arm() returns, whereas the hardware fence
> > > + * could be a compound fence, which is disallowed when installed into
> > > + * drm_syncobjs or dma-resv.
> > > + *
> > > + * The fence uses the kernel's inline spinlock (NULL passed to dma_fence_init())
> > > + * so no separate lock allocation is required.
> > > + *
> > > + * Deadline propagation is supported: if a consumer sets a deadline via
> > > + * dma_fence_set_deadline(), it is forwarded to @parent when @parent is set.
> > > + * If @parent has not been set yet the deadline is stored in @deadline and
> > > + * forwarded at that point.
> > > + *
> > > + * Memory management: drm_dep_fence objects are allocated with kzalloc() and
> > > + * freed via kfree_rcu() once the fence is released, ensuring safety with
> > > + * RCU-protected fence accesses.
> > > + */
> > > +
> > > +#include <linux/slab.h>
> > > +#include <drm/drm_dep.h>
> > > +#include "drm_dep_fence.h"
> > > +
> > > +/**
> > > + * DRM_DEP_FENCE_FLAG_HAS_DEADLINE_BIT - a fence deadline hint has been set
> > > + *
> > > + * Set by the deadline callback on the finished fence to indicate a deadline
> > > + * has been set which may need to be propagated to the parent hardware fence.
> > > + */
> > > +#define DRM_DEP_FENCE_FLAG_HAS_DEADLINE_BIT (DMA_FENCE_FLAG_USER_BITS + 1)
> > > +
> > > +/**
> > > + * struct drm_dep_fence - fence tracking the completion of a dep job
> > > + *
> > > + * Contains a single dma_fence (@finished) that is signalled when the
> > > + * hardware completes the job. The fence uses the kernel's inline_lock
> > > + * (no external spinlock required).
> > > + *
> > > + * This struct is private to the drm_dep module; external code interacts
> > > + * through the accessor functions declared in drm_dep_fence.h.
> > > + */
> > > +struct drm_dep_fence {
> > > + /**
> > > + * @finished: signalled when the job completes on hardware.
> > > + *
> > > + * Drivers should use this fence as the out-fence for a job since it
> > > + * is available immediately upon drm_dep_job_arm().
> > > + */
> > > + struct dma_fence finished;
> > > +
> > > + /**
> > > + * @deadline: deadline set on @finished which potentially needs to be
> > > + * propagated to @parent.
> > > + */
> > > + ktime_t deadline;
> > > +
> > > + /**
> > > + * @parent: The hardware fence returned by &drm_dep_queue_ops.run_job.
> > > + *
> > > + * @finished is signaled once @parent is signaled. The initial store is
> > > + * performed via smp_store_release to synchronize with deadline handling.
> > > + *
> > > + * All readers must access this under the fence lock and take a reference to
> > > + * it, as @parent is set to NULL under the fence lock when the drm_dep_fence
> > > + * signals, and this drop also releases its internal reference.
> > > + */
> > > + struct dma_fence *parent;
> > > +
> > > + /**
> > > + * @q: the queue this fence belongs to.
> > > + */
> > > + struct drm_dep_queue *q;
> > > +};
> > > +
> > > +static const struct dma_fence_ops drm_dep_fence_ops;
> > > +
> > > +/**
> > > + * to_drm_dep_fence() - cast a dma_fence to its enclosing drm_dep_fence
> > > + * @f: dma_fence to cast
> > > + *
> > > + * Context: No context requirements (inline helper).
> > > + * Return: pointer to the enclosing &drm_dep_fence.
> > > + */
> > > +static struct drm_dep_fence *to_drm_dep_fence(struct dma_fence *f)
> > > +{
> > > + return container_of(f, struct drm_dep_fence, finished);
> > > +}
> > > +
> > > +/**
> > > + * drm_dep_fence_set_parent() - store the hardware fence and propagate
> > > + *   any deadline
> > > + * @dfence: dep fence
> > > + * @parent: hardware fence returned by &drm_dep_queue_ops.run_job, or NULL/error
> > > + *
> > > + * Stores @parent on @dfence under smp_store_release() so that a concurrent
> > > + * drm_dep_fence_set_deadline() call sees the parent before checking the
> > > + * deadline bit. If a deadline has already been set on @dfence->finished it is
> > > + * forwarded to @parent immediately. Does nothing if @parent is NULL or an
> > > + * error pointer.
> > > + *
> > > + * Context: Any context.
> > > + */
> > > +void drm_dep_fence_set_parent(struct drm_dep_fence *dfence,
> > > +      struct dma_fence *parent)
> > > +{
> > > + if (IS_ERR_OR_NULL(parent))
> > > + return;
> > > +
> > > + /*
> > > + * smp_store_release() to ensure a thread racing us in
> > > + * drm_dep_fence_set_deadline() sees the parent set before
> > > + * it calls test_bit(HAS_DEADLINE_BIT).
> > > + */
> > > + smp_store_release(&dfence->parent, dma_fence_get(parent));
> > > + if (test_bit(DRM_DEP_FENCE_FLAG_HAS_DEADLINE_BIT,
> > > +     &dfence->finished.flags))
> > > + dma_fence_set_deadline(parent, dfence->deadline);
> > > +}
> > > +
> > > +/**
> > > + * drm_dep_fence_finished() - signal the finished fence with a result
> > > + * @dfence: dep fence to signal
> > > + * @result: error code to set, or 0 for success
> > > + *
> > > + * Sets the fence error to @result if non-zero, then signals
> > > + * @dfence->finished. Also removes parent visibility under the fence lock
> > > + * and drops the parent reference. Dropping the parent here allows the
> > > + * DRM dep fence to be completely decoupled from the DRM dep module.
> > > + *
> > > + * Context: Any context.
> > > + */
> > > +static void drm_dep_fence_finished(struct drm_dep_fence *dfence, int result)
> > > +{
> > > + struct dma_fence *parent;
> > > + unsigned long flags;
> > > +
> > > + dma_fence_lock_irqsave(&dfence->finished, flags);
> > > + if (result)
> > > + dma_fence_set_error(&dfence->finished, result);
> > > + dma_fence_signal_locked(&dfence->finished);
> > > + parent = dfence->parent;
> > > + dfence->parent = NULL;
> > > + dma_fence_unlock_irqrestore(&dfence->finished, flags);
> > > +
> > > + dma_fence_put(parent);
> > > +}
> > 
> > We should really try to move away from manual locks and unlocks.
> > 

I agree. Let's see if we can get dma_fence scoped guard in.

> > > +
> > > +static const char *drm_dep_fence_get_driver_name(struct dma_fence *fence)
> > > +{
> > > + return "drm_dep";
> > > +}
> > > +
> > > +static const char *drm_dep_fence_get_timeline_name(struct dma_fence *f)
> > > +{
> > > + struct drm_dep_fence *dfence = to_drm_dep_fence(f);
> > > +
> > > + return dfence->q->name;
> > > +}
> > > +
> > > +/**
> > > + * drm_dep_fence_get_parent() - get a reference to the parent hardware fence
> > > + * @dfence: dep fence to query
> > > + *
> > > + * Returns a new reference to @dfence->parent, or NULL if the parent has
> > > + * already been cleared (i.e. @dfence->finished has signalled and the parent
> > > + * reference was dropped under the fence lock).
> > > + *
> > > + * Uses smp_load_acquire() to pair with the smp_store_release() in
> > > + * drm_dep_fence_set_parent(), ensuring that if we race a concurrent
> > > + * drm_dep_fence_set_parent() call we observe the parent pointer only after
> > > + * the store is fully visible — before set_parent() tests
> > > + * %DRM_DEP_FENCE_FLAG_HAS_DEADLINE_BIT.
> > > + *
> > > + * Caller must hold the fence lock on @dfence->finished.
> > > + *
> > > + * Context: Any context, fence lock on @dfence->finished must be held.
> > > + * Return: a new reference to the parent fence, or NULL.
> > > + */
> > > +static struct dma_fence *drm_dep_fence_get_parent(struct drm_dep_fence *dfence)
> > > +{
> > > + dma_fence_assert_held(&dfence->finished);
> > 
> > > +
> > > + return dma_fence_get(smp_load_acquire(&dfence->parent));
> > > +}
> > > +
> > > +/**
> > > + * drm_dep_fence_set_deadline() - dma_fence_ops deadline callback
> > > + * @f: fence on which the deadline is being set
> > > + * @deadline: the deadline hint to apply
> > > + *
> > > + * Stores the earliest deadline under the fence lock, then propagates
> > > + * it to the parent hardware fence via smp_load_acquire() to race
> > > + * safely with drm_dep_fence_set_parent().
> > > + *
> > > + * Context: Any context.
> > > + */
> > > +static void drm_dep_fence_set_deadline(struct dma_fence *f, ktime_t deadline)
> > > +{
> > > + struct drm_dep_fence *dfence = to_drm_dep_fence(f);
> > > + struct dma_fence *parent;
> > > + unsigned long flags;
> > > +
> > > + dma_fence_lock_irqsave(f, flags);
> > > +
> > > + /* If we already have an earlier deadline, keep it: */
> > > + if (test_bit(DRM_DEP_FENCE_FLAG_HAS_DEADLINE_BIT, &f->flags) &&
> > > +    ktime_before(dfence->deadline, deadline)) {
> > > + dma_fence_unlock_irqrestore(f, flags);
> > > + return;
> > > + }
> > > +
> > > + dfence->deadline = deadline;
> > > + set_bit(DRM_DEP_FENCE_FLAG_HAS_DEADLINE_BIT, &f->flags);
> > > +
> > > + parent = drm_dep_fence_get_parent(dfence);
> > > + dma_fence_unlock_irqrestore(f, flags);
> > > +
> > > + if (parent)
> > > + dma_fence_set_deadline(parent, deadline);
> > > +
> > > + dma_fence_put(parent);
> > > +}
> > > +
> > > +static const struct dma_fence_ops drm_dep_fence_ops = {
> > > + .get_driver_name = drm_dep_fence_get_driver_name,
> > > + .get_timeline_name = drm_dep_fence_get_timeline_name,
> > > + .set_deadline = drm_dep_fence_set_deadline,
> > > +};
> > > +
> > > +/**
> > > + * drm_dep_fence_alloc() - allocate a dep fence
> > > + *
> > > + * Allocates a &drm_dep_fence with kzalloc() without initialising the
> > > + * dma_fence. Call drm_dep_fence_init() to fully initialise it.
> > > + *
> > > + * Context: Process context.
> > > + * Return: new &drm_dep_fence on success, NULL on allocation failure.
> > > + */
> > > +struct drm_dep_fence *drm_dep_fence_alloc(void)
> > > +{
> > > + return kzalloc_obj(struct drm_dep_fence);
> > > +}
> > > +
> > > +/**
> > > + * drm_dep_fence_init() - initialise the dma_fence inside a dep fence
> > > + * @dfence: dep fence to initialise
> > > + * @q: queue the owning job belongs to
> > > + *
> > > + * Initialises @dfence->finished using the context and sequence number from @q.
> > > + * Passes NULL as the lock so the fence uses its inline spinlock.
> > > + *
> > > + * Context: Any context.
> > > + */
> > > +void drm_dep_fence_init(struct drm_dep_fence *dfence, struct drm_dep_queue *q)
> > > +{
> > > + u32 seq = ++q->fence.seqno;
> > > +
> > > + /*
> > > + * XXX: Inline fence hazard: currently all expected users of DRM dep
> > > + * hardware fences have a unique lockdep class. If that ever changes,
> > > + * we will need to assign a unique lockdep class here so lockdep knows
> > > + * this fence is allowed to nest with driver hardware fences.
> > > + */
> > > +
> > > + dfence->q = q;
> > > + dma_fence_init(&dfence->finished, &drm_dep_fence_ops,
> > > +       NULL, q->fence.context, seq);
> > > +}
> > > +
> > > +/**
> > > + * drm_dep_fence_cleanup() - release a dep fence at job teardown
> > > + * @dfence: dep fence to clean up
> > > + *
> > > + * Called from drm_dep_job_fini(). If the dep fence was armed (refcount > 0)
> > > + * it is released via dma_fence_put() and will be freed by the RCU release
> > > + * callback once all waiters have dropped their references. If it was never
> > > + * armed it is freed directly with kfree().
> > > + *
> > > + * Context: Any context.
> > > + */
> > > +void drm_dep_fence_cleanup(struct drm_dep_fence *dfence)
> > > +{
> > > + if (drm_dep_fence_is_armed(dfence))
> > > + dma_fence_put(&dfence->finished);
> > > + else
> > > + kfree(dfence);
> > > +}
> > > +
> > > +/**
> > > + * drm_dep_fence_is_armed() - check whether the fence has been armed
> > > + * @dfence: dep fence to check
> > > + *
> > > + * Returns true if drm_dep_job_arm() has been called, i.e. @dfence->finished
> > > + * has been initialised and its reference count is non-zero.  Used by
> > > + * assertions to enforce correct job lifecycle ordering (arm before push,
> > > + * add_dependency before arm).
> > > + *
> > > + * Context: Any context.
> > > + * Return: true if the fence is armed, false otherwise.
> > > + */
> > > +bool drm_dep_fence_is_armed(struct drm_dep_fence *dfence)
> > > +{
> > > + return !!kref_read(&dfence->finished.refcount);
> > > +}
> > 
> > > +
> > > +/**
> > > + * drm_dep_fence_is_finished() - test whether the finished fence has signalled
> > > + * @dfence: dep fence to check
> > > + *
> > > + * Uses dma_fence_test_signaled_flag() to read %DMA_FENCE_FLAG_SIGNALED_BIT
> > > + * directly without invoking the fence's ->signaled() callback or triggering
> > > + * any signalling side-effects.
> > > + *
> > > + * Context: Any context.
> > > + * Return: true if @dfence->finished has been signalled, false otherwise.
> > > + */
> > > +bool drm_dep_fence_is_finished(struct drm_dep_fence *dfence)
> > > +{
> > > + return dma_fence_test_signaled_flag(&dfence->finished);
> > > +}
> > > +
> > > +/**
> > > + * drm_dep_fence_is_complete() - test whether the job has completed
> > > + * @dfence: dep fence to check
> > > + *
> > > + * Takes the fence lock on @dfence->finished and calls
> > > + * drm_dep_fence_get_parent() to safely obtain a reference to the parent
> > > + * hardware fence — or NULL if the parent has already been cleared after
> > > + * signalling.  Calls dma_fence_is_signaled() on @parent outside the lock,
> > > + * which may invoke the fence's ->signaled() callback and trigger signalling
> > > + * side-effects if the fence has completed but the signalled flag has not yet
> > > + * been set.  The finished fence is tested via dma_fence_test_signaled_flag(),
> > > + * without side-effects.
> > > + *
> > > + * May only be called on a stopped queue (see drm_dep_queue_is_stopped()).
> > > + *
> > > + * Context: Process context. The queue must be stopped before calling this.
> > > + * Return: true if the job is complete, false otherwise.
> > > + */
> > > +bool drm_dep_fence_is_complete(struct drm_dep_fence *dfence)
> > > +{
> > > + struct dma_fence *parent;
> > > + unsigned long flags;
> > > + bool complete;
> > > +
> > > + dma_fence_lock_irqsave(&dfence->finished, flags);
> > > + parent = drm_dep_fence_get_parent(dfence);
> > > + dma_fence_unlock_irqrestore(&dfence->finished, flags);
> > > +
> > > + complete = (parent && dma_fence_is_signaled(parent)) ||
> > > + dma_fence_test_signaled_flag(&dfence->finished);
> > > +
> > > + dma_fence_put(parent);
> > > +
> > > + return complete;
> > > +}
> > > +
> > > +/**
> > > + * drm_dep_fence_to_dma() - return the finished dma_fence for a dep fence
> > > + * @dfence: dep fence to query
> > > + *
> > > + * No reference is taken; the caller must hold its own reference to the owning
> > > + * &drm_dep_job for the duration of the access.
> > > + *
> > > + * Context: Any context.
> > > + * Return: the finished &dma_fence.
> > > + */
> > > +struct dma_fence *drm_dep_fence_to_dma(struct drm_dep_fence *dfence)
> > > +{
> > > + return &dfence->finished;
> > > +}
> > > +
> > > +/**
> > > + * drm_dep_fence_done() - signal the finished fence on job completion
> > > + * @dfence: dep fence to signal
> > > + * @result: job error code, or 0 on success
> > > + *
> > > + * Gets a temporary reference to @dfence->finished to guard against a racing
> > > + * last-put, signals the fence with @result, then drops the temporary
> > > + * reference. Called from drm_dep_job_done() in the queue core when a
> > > + * hardware completion callback fires or when run_job() returns immediately.
> > > + *
> > > + * Context: Any context.
> > > + */
> > > +void drm_dep_fence_done(struct drm_dep_fence *dfence, int result)
> > > +{
> > > + dma_fence_get(&dfence->finished);
> > > + drm_dep_fence_finished(dfence, result);
> > > + dma_fence_put(&dfence->finished);
> > > +}
> > 
> > Proper refcounting is automated (and enforced) in Rust.
> > 

That is a nice feature.

> > > diff --git a/drivers/gpu/drm/dep/drm_dep_fence.h b/drivers/gpu/drm/dep/drm_dep_fence.h
> > > new file mode 100644
> > > index 000000000000..65a1582f858b
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/dep/drm_dep_fence.h
> > > @@ -0,0 +1,25 @@
> > > +/* SPDX-License-Identifier: MIT */
> > > +/*
> > > + * Copyright © 2026 Intel Corporation
> > > + */
> > > +
> > > +#ifndef _DRM_DEP_FENCE_H_
> > > +#define _DRM_DEP_FENCE_H_
> > > +
> > > +#include <linux/dma-fence.h>
> > > +
> > > +struct drm_dep_fence;
> > > +struct drm_dep_queue;
> > > +
> > > +struct drm_dep_fence *drm_dep_fence_alloc(void);
> > > +void drm_dep_fence_init(struct drm_dep_fence *dfence, struct drm_dep_queue *q);
> > > +void drm_dep_fence_cleanup(struct drm_dep_fence *dfence);
> > > +void drm_dep_fence_set_parent(struct drm_dep_fence *dfence,
> > > +      struct dma_fence *parent);
> > > +void drm_dep_fence_done(struct drm_dep_fence *dfence, int result);
> > > +bool drm_dep_fence_is_armed(struct drm_dep_fence *dfence);
> > > +bool drm_dep_fence_is_finished(struct drm_dep_fence *dfence);
> > > +bool drm_dep_fence_is_complete(struct drm_dep_fence *dfence);
> > > +struct dma_fence *drm_dep_fence_to_dma(struct drm_dep_fence *dfence);
> > > +
> > > +#endif /* _DRM_DEP_FENCE_H_ */
> > > diff --git a/drivers/gpu/drm/dep/drm_dep_job.c b/drivers/gpu/drm/dep/drm_dep_job.c
> > > new file mode 100644
> > > index 000000000000..2d012b29a5fc
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/dep/drm_dep_job.c
> > > @@ -0,0 +1,675 @@
> > > +// SPDX-License-Identifier: MIT
> > > +/*
> > > + * Copyright 2015 Advanced Micro Devices, Inc.
> > > + *
> > > + * Permission is hereby granted, free of charge, to any person obtaining a
> > > + * copy of this software and associated documentation files (the "Software"),
> > > + * to deal in the Software without restriction, including without limitation
> > > + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> > > + * and/or sell copies of the Software, and to permit persons to whom the
> > > + * Software is furnished to do so, subject to the following conditions:
> > > + *
> > > + * The above copyright notice and this permission notice shall be included in
> > > + * all copies or substantial portions of the Software.
> > > + *
> > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> > > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> > > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> > > + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
> > > + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
> > > + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
> > > + * OTHER DEALINGS IN THE SOFTWARE.
> > > + *
> > > + * Copyright © 2026 Intel Corporation
> > > + */
> > > +
> > > +/**
> > > + * DOC: DRM dependency job
> > > + *
> > > + * A struct drm_dep_job represents a single unit of GPU work associated with
> > > + * a struct drm_dep_queue. The lifecycle of a job is:
> > > + *
> > > + * 1. **Allocation**: the driver allocates memory for the job (typically by
> > > + *    embedding struct drm_dep_job in a larger structure) and calls
> > > + *    drm_dep_job_init() to initialise it. On success the job holds one
> > > + *    kref reference and a reference to its queue.
> > > + *
> > > + * 2. **Dependency collection**: the driver calls drm_dep_job_add_dependency(),
> > > + *    drm_dep_job_add_syncobj_dependency(), drm_dep_job_add_resv_dependencies(),
> > > + *    or drm_dep_job_add_implicit_dependencies() to register dma_fence objects
> > > + *    that must be signalled before the job can run. Duplicate fences from the
> > > + *    same fence context are deduplicated automatically.
> > > + *
> > > + * 3. **Arming**: drm_dep_job_arm() initialises the job's finished fence,
> > > + *    consuming a sequence number from the queue. After arming,
> > > + *    drm_dep_job_finished_fence() returns a valid fence that may be passed to
> > > + *    userspace or used as a dependency by other jobs.
> > > + *
> > > + * 4. **Submission**: drm_dep_job_push() submits the job to the queue. The
> > > + *    queue takes a reference that it holds until the job's finished fence
> > > + *    signals and the job is freed by the put_job worker.
> > > + *
> > > + * 5. **Completion**: when the job's hardware work finishes its finished fence
> > > + *    is signalled and drm_dep_job_put() is called by the queue. The driver
> > > + *    must release any driver-private resources in &drm_dep_job_ops.release.
> > > + *
> > > + * Reference counting uses drm_dep_job_get() / drm_dep_job_put(). The
> > > + * internal drm_dep_job_fini() tears down the dependency xarray and fence
> > > + * objects before the driver's release callback is invoked.
> > > + */
> > > +
> > > +#include <linux/dma-resv.h>
> > > +#include <linux/kref.h>
> > > +#include <linux/slab.h>
> > > +#include <drm/drm_dep.h>
> > > +#include <drm/drm_file.h>
> > > +#include <drm/drm_gem.h>
> > > +#include <drm/drm_syncobj.h>
> > > +#include "drm_dep_fence.h"
> > > +#include "drm_dep_job.h"
> > > +#include "drm_dep_queue.h"
> > > +
> > > +/**
> > > + * drm_dep_job_init() - initialise a dep job
> > > + * @job: dep job to initialise
> > > + * @args: initialisation arguments
> > > + *
> > > + * Initialises @job with the queue, ops and credit count from @args.  Acquires
> > > + * a reference to @args->q via drm_dep_queue_get(); this reference is held for
> > > + * the lifetime of the job and released by drm_dep_job_release() when the last
> > > + * job reference is dropped.
> > > + *
> > > + * Resources are released automatically when the last reference is dropped
> > > + * via drm_dep_job_put(), which must be called to release the job; drivers
> > > + * must not free the job directly.
> > 
> > Again, can’t enforce that in C.
> > 

I agree. A driver could just kfree() the job after init… but in this
design the driver unload would hang.

> > > + *
> > > + * Context: Process context. Allocates memory with GFP_KERNEL.
> > > + * Return: 0 on success, -%EINVAL if credits is 0,
> > > + *   -%ENOMEM on fence allocation failure.
> > > + */
> > > +int drm_dep_job_init(struct drm_dep_job *job,
> > > +     const struct drm_dep_job_init_args *args)
> > > +{
> > > + if (unlikely(!args->credits)) {
> > > + pr_err("drm_dep: %s: credits cannot be 0\n", __func__);
> > > + return -EINVAL;
> > > + }
> > > +
> > > + memset(job, 0, sizeof(*job));
> > > +
> > > + job->dfence = drm_dep_fence_alloc();
> > > + if (!job->dfence)
> > > + return -ENOMEM;
> > > +
> > > + job->ops = args->ops;
> > > + job->q = drm_dep_queue_get(args->q);
> > > + job->credits = args->credits;
> > > +
> > > + kref_init(&job->refcount);
> > > + xa_init_flags(&job->dependencies, XA_FLAGS_ALLOC);
> > > + INIT_LIST_HEAD(&job->pending_link);
> > > +
> > > + return 0;
> > > +}
> > > +EXPORT_SYMBOL(drm_dep_job_init);
> > > +
> > > +/**
> > > + * drm_dep_job_drop_dependencies() - release all input dependency fences
> > > + * @job: dep job whose dependency xarray to drain
> > > + *
> > > + * Walks @job->dependencies, puts each fence, and destroys the xarray.
> > > + * Any slots still holding a %DRM_DEP_JOB_FENCE_PREALLOC sentinel —
> > > + * i.e. slots that were pre-allocated but never replaced — are silently
> > > + * skipped; the sentinel carries no reference.  Called from
> > > + * drm_dep_queue_run_job() in process context immediately after
> > > + * @ops->run_job() returns, before the final drm_dep_job_put().  Releasing
> > > + * dependencies here — while still in process context — avoids calling
> > > + * xa_destroy() from IRQ context if the job's last reference is later
> > > + * dropped from a dma_fence callback.
> > > + *
> > > + * Context: Process context.
> > > + */
> > > +void drm_dep_job_drop_dependencies(struct drm_dep_job *job)
> > > +{
> > > + struct dma_fence *fence;
> > > + unsigned long index;
> > > +
> > > + xa_for_each(&job->dependencies, index, fence) {
> > > + if (unlikely(fence == DRM_DEP_JOB_FENCE_PREALLOC))
> > > + continue;
> > > + dma_fence_put(fence);
> > > + }
> > > + xa_destroy(&job->dependencies);
> > > +}
> > 
> > This is automated in Rust. You also can’t “forget” to call this.

Driver code can’t call this function—note the lack of an export. DRM dep
owns this call, and it always invokes it. But as discussed, a driver
could still kfree() the job or forget to drop its creation reference.

> > 
> > > +
> > > +/**
> > > + * drm_dep_job_fini() - clean up a dep job
> > > + * @job: dep job to clean up
> > > + *
> > > + * Cleans up the dep fence and drops the queue reference held by @job.
> > > + *
> > > + * If the job was never armed (e.g. init failed before drm_dep_job_arm()),
> > > + * the dependency xarray is also released here.  For armed jobs the xarray
> > > + * has already been drained by drm_dep_job_drop_dependencies() in process
> > > + * context immediately after run_job(), so it is left untouched to avoid
> > > + * calling xa_destroy() from IRQ context.
> > > + *
> > > + * Warns if @job is still linked on the queue's pending list, which would
> > > + * indicate a bug in the teardown ordering.
> > > + *
> > > + * Context: Any context.
> > > + */
> > > +static void drm_dep_job_fini(struct drm_dep_job *job)
> > > +{
> > > + bool armed = drm_dep_fence_is_armed(job->dfence);
> > > +
> > > + WARN_ON(!list_empty(&job->pending_link));
> > > +
> > > + drm_dep_fence_cleanup(job->dfence);
> > > + job->dfence = NULL;
> > > +
> > > + /*
> > > + * Armed jobs have their dependencies drained by
> > > + * drm_dep_job_drop_dependencies() in process context after run_job().
> > > + * Skip here to avoid calling xa_destroy() from IRQ context.
> > > + */
> > > + if (!armed)
> > > + drm_dep_job_drop_dependencies(job);
> > > +}
> > 
> > Same here.
> > 
> > > +
> > > +/**
> > > + * drm_dep_job_get() - acquire a reference to a dep job
> > > + * @job: dep job to acquire a reference on, or NULL
> > > + *
> > > + * Context: Any context.
> > > + * Return: @job with an additional reference held, or NULL if @job is NULL.
> > > + */
> > > +struct drm_dep_job *drm_dep_job_get(struct drm_dep_job *job)
> > > +{
> > > + if (job)
> > > + kref_get(&job->refcount);
> > > + return job;
> > > +}
> > > +EXPORT_SYMBOL(drm_dep_job_get);
> > > +
> > 
> > Same here.
> > 
> > > +/**
> > > + * drm_dep_job_release() - kref release callback for a dep job
> > > + * @kref: kref embedded in the dep job
> > > + *
> > > + * Calls drm_dep_job_fini(), then invokes &drm_dep_job_ops.release if set,
> > > + * otherwise frees @job with kfree().  Finally, releases the queue reference
> > > + * that was acquired by drm_dep_job_init() via drm_dep_queue_put().  The
> > > + * queue put is performed last to ensure no queue state is accessed after
> > > + * the job memory is freed.
> > > + *
> > > + * Context: Any context if %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE is set on the
> > > + *   job's queue; otherwise process context only, as the release callback may
> > > + *   sleep.
> > > + */
> > > +static void drm_dep_job_release(struct kref *kref)
> > > +{
> > > + struct drm_dep_job *job =
> > > + container_of(kref, struct drm_dep_job, refcount);
> > > + struct drm_dep_queue *q = job->q;
> > > +
> > > + drm_dep_job_fini(job);
> > > +
> > > + if (job->ops && job->ops->release)
> > > + job->ops->release(job);
> > > + else
> > > + kfree(job);
> > > +
> > > + drm_dep_queue_put(q);
> > > +}
> > 
> > Same here.
> > 
> > > +
> > > +/**
> > > + * drm_dep_job_put() - release a reference to a dep job
> > > + * @job: dep job to release a reference on, or NULL
> > > + *
> > > + * When the last reference is dropped, calls &drm_dep_job_ops.release if set,
> > > + * otherwise frees @job with kfree(). Does nothing if @job is NULL.
> > > + *
> > > + * Context: Any context if %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE is set on the
> > > + *   job's queue; otherwise process context only, as the release callback may
> > > + *   sleep.
> > > + */
> > > +void drm_dep_job_put(struct drm_dep_job *job)
> > > +{
> > > + if (job)
> > > + kref_put(&job->refcount, drm_dep_job_release);
> > > +}
> > > +EXPORT_SYMBOL(drm_dep_job_put);
> > > +
> > 
> > Same here.
> > 
> > > +/**
> > > + * drm_dep_job_arm() - arm a dep job for submission
> > > + * @job: dep job to arm
> > > + *
> > > + * Initialises the finished fence on @job->dfence, assigning
> > > + * it a sequence number from the job's queue. Must be called after
> > > + * drm_dep_job_init() and before drm_dep_job_push(). Once armed,
> > > + * drm_dep_job_finished_fence() returns a valid fence that may be passed to
> > > + * userspace or used as a dependency by other jobs.
> > > + *
> > > + * Begins the DMA fence signalling path via dma_fence_begin_signalling().
> > > + * After this point, memory allocations that could trigger reclaim are
> > > + * forbidden; lockdep enforces this. arm() must always be paired with
> > > + * drm_dep_job_push(); lockdep also enforces this pairing.
> > > + *
> > > + * Warns if the job has already been armed.
> > > + *
> > > + * Context: Process context if %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set
> > > + *   (takes @q->sched.lock, a mutex); any context otherwise. DMA fence signaling
> > > + *   path.
> > > + */
> > > +void drm_dep_job_arm(struct drm_dep_job *job)
> > > +{
> > > + drm_dep_queue_push_job_begin(job->q);
> > > + WARN_ON(drm_dep_fence_is_armed(job->dfence));
> > > + drm_dep_fence_init(job->dfence, job->q);
> > > + job->signalling_cookie = dma_fence_begin_signalling();
> > > +}
> > > +EXPORT_SYMBOL(drm_dep_job_arm);
> > > +
> > > +/**
> > > + * drm_dep_job_push() - submit a job to its queue for execution
> > > + * @job: dep job to push
> > > + *
> > > + * Submits @job to the queue it was initialised with. Must be called after
> > > + * drm_dep_job_arm(). Acquires a reference on @job on behalf of the queue,
> > > + * held until the queue is fully done with it. The reference is released
> > > + * directly in the finished-fence dma_fence callback for queues with
> > > + * %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE (where drm_dep_job_done() may run
> > > + * from hardirq context), or via the put_job work item on the submit
> > > + * workqueue otherwise.
> > > + *
> > > + * Ends the DMA fence signalling path begun by drm_dep_job_arm() via
> > > + * dma_fence_end_signalling(). This must be paired with arm(); lockdep
> > > + * enforces the pairing.
> > > + *
> > > + * Once pushed, &drm_dep_queue_ops.run_job is guaranteed to be called for
> > > + * @job exactly once, even if the queue is killed or torn down before the
> > > + * job reaches the head of the queue. Drivers can use this guarantee to
> > > + * perform bookkeeping cleanup; the actual backend operation should be
> > > + * skipped when drm_dep_queue_is_killed() returns true.
> > > + *
> > > + * If the queue does not support the bypass path, the job is pushed directly
> > > + * onto the SPSC submission queue via drm_dep_queue_push_job() without holding
> > > + * @q->sched.lock. Otherwise, @q->sched.lock is taken and the job is either
> > > + * run immediately via drm_dep_queue_run_job() if it qualifies for bypass, or
> > > + * enqueued via drm_dep_queue_push_job() for dispatch by the run_job work item.
> > > + *
> > > + * Warns if the job has not been armed.
> > > + *
> > > + * Context: Process context if %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set
> > > + *   (takes @q->sched.lock, a mutex); any context otherwise. DMA fence signaling
> > > + *   path.
> > > + */
> > > +void drm_dep_job_push(struct drm_dep_job *job)
> > > +{
> > > + struct drm_dep_queue *q = job->q;
> > > +
> > > + WARN_ON(!drm_dep_fence_is_armed(job->dfence));
> > > +
> > > + drm_dep_job_get(job);
> > > +
> > > + if (!(q->sched.flags & DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED)) {
> > > + drm_dep_queue_push_job(q, job);
> > > + dma_fence_end_signalling(job->signalling_cookie);
> > 
> > Signaling is enforced in a more thorough way in Rust. I’ll expand on this later in this patch.
> > 
> > > + drm_dep_queue_push_job_end(job->q);
> > > + return;
> > > + }
> > > +
> > > + scoped_guard(mutex, &q->sched.lock) {
> > > + if (drm_dep_queue_can_job_bypass(q, job))
> > > + drm_dep_queue_run_job(q, job);
> > > + else
> > > + drm_dep_queue_push_job(q, job);
> > > + }
> > > +
> > > + dma_fence_end_signalling(job->signalling_cookie);
> > > + drm_dep_queue_push_job_end(job->q);
> > > +}
> > > +EXPORT_SYMBOL(drm_dep_job_push);
> > > +
> > > +/**
> > > + * drm_dep_job_add_dependency() - adds the fence as a job dependency
> > > + * @job: dep job to add the dependencies to
> > > + * @fence: the dma_fence to add to the list of dependencies, or
> > > + *         %DRM_DEP_JOB_FENCE_PREALLOC to reserve a slot for later.
> > > + *
> > > + * Note that @fence is consumed in both the success and error cases (except
> > > + * when @fence is %DRM_DEP_JOB_FENCE_PREALLOC, which carries no reference).
> > > + *
> > > + * Signalled fences and fences belonging to the same queue as @job (i.e. where
> > > + * fence->context matches the queue's finished fence context) are silently
> > > + * dropped; the job need not wait on its own queue's output.
> > > + *
> > > + * Warns if the job has already been armed (dependencies must be added before
> > > + * drm_dep_job_arm()).
> > > + *
> > > + * **Pre-allocation pattern**
> > > + *
> > > + * When multiple jobs across different queues must be prepared and submitted
> > > + * together in a single atomic commit — for example, where job A's finished
> > > + * fence is an input dependency of job B — all jobs must be armed and pushed
> > > + * within a single dma_fence_begin_signalling() / dma_fence_end_signalling()
> > > + * region.  Once that region has started no memory allocation is permitted.
> > > + *
> > > + * To handle this, pass %DRM_DEP_JOB_FENCE_PREALLOC during the preparation
> > > + * phase (before arming any job, while GFP_KERNEL allocation is still allowed)
> > > + * to pre-allocate a slot in @job->dependencies.  The slot index assigned by
> > > + * the underlying xarray must be tracked by the caller separately (e.g. it is
> > > + * always index 0 when the dependency array is empty, as Xe relies on).
> > > + * After all jobs have been armed and the finished fences are available, call
> > > + * drm_dep_job_replace_dependency() with that index and the real fence.
> > > + * drm_dep_job_replace_dependency() uses GFP_NOWAIT internally and may be
> > > + * called from atomic or signalling context.
> > > + *
> > > + * The sentinel slot is never skipped by the signalled-fence fast-path,
> > > + * ensuring a slot is always allocated even when the real fence is not yet
> > > + * known.
> > > + *
> > > + * **Example: bind job feeding TLB invalidation jobs**
> > > + *
> > > + * Consider a GPU with separate queues for page-table bind operations and for
> > > + * TLB invalidation.  A single atomic commit must:
> > > + *
> > > + *  1. Run a bind job that modifies page tables.
> > > + *  2. Run one TLB-invalidation job per MMU that depends on the bind
> > > + *     completing, so stale translations are flushed before the engines
> > > + *     continue.
> > > + *
> > > + * Because all jobs must be armed and pushed inside a signalling region (where
> > > + * GFP_KERNEL is forbidden), pre-allocate slots before entering the region::
> > > + *
> > > + *   // Phase 1 — process context, GFP_KERNEL allowed
> > > + *   drm_dep_job_init(bind_job, bind_queue, ops);
> > > + *   for_each_mmu(mmu) {
> > > + *       drm_dep_job_init(tlb_job[mmu], tlb_queue[mmu], ops);
> > > + *       // Pre-allocate slot at index 0; real fence not available yet
> > > + *       drm_dep_job_add_dependency(tlb_job[mmu], DRM_DEP_JOB_FENCE_PREALLOC);
> > > + *   }
> > > + *
> > > + *   // Phase 2 — inside signalling region, no GFP_KERNEL
> > > + *   dma_fence_begin_signalling();
> > > + *   drm_dep_job_arm(bind_job);
> > > + *   for_each_mmu(mmu) {
> > > + *       // Swap sentinel for bind job's finished fence
> > > + *       drm_dep_job_replace_dependency(tlb_job[mmu], 0,
> > > + *                                      dma_fence_get(bind_job->finished));
> > > + *       drm_dep_job_arm(tlb_job[mmu]);
> > > + *   }
> > > + *   drm_dep_job_push(bind_job);
> > > + *   for_each_mmu(mmu)
> > > + *       drm_dep_job_push(tlb_job[mmu]);
> > > + *   dma_fence_end_signalling();
> > > + *
> > > + * Context: Process context. May allocate memory with GFP_KERNEL.
> > > + * Return: If fence == DRM_DEP_JOB_FENCE_PREALLOC index of allocation on
> > > + * success, else 0 on success, or a negative error code.
> > > + */
> > 
> > > +int drm_dep_job_add_dependency(struct drm_dep_job *job, struct dma_fence *fence)
> > > +{
> > > + struct drm_dep_queue *q = job->q;
> > > + struct dma_fence *entry;
> > > + unsigned long index;
> > > + u32 id = 0;
> > > + int ret;
> > > +
> > > + WARN_ON(drm_dep_fence_is_armed(job->dfence));
> > > + might_alloc(GFP_KERNEL);
> > > +
> > > + if (!fence)
> > > + return 0;
> > > +
> > > + if (fence == DRM_DEP_JOB_FENCE_PREALLOC)
> > > + goto add_fence;
> > > +
> > > + /*
> > > + * Ignore signalled fences or fences from our own queue — finished
> > > + * fences use q->fence.context.
> > > + */
> > > + if (dma_fence_test_signaled_flag(fence) ||
> > > +    fence->context == q->fence.context) {
> > > + dma_fence_put(fence);
> > > + return 0;
> > > + }
> > > +
> > > + /* Deduplicate if we already depend on a fence from the same context.
> > > + * This lets the size of the array of deps scale with the number of
> > > + * engines involved, rather than the number of BOs.
> > > + */
> > > + xa_for_each(&job->dependencies, index, entry) {
> > > + if (entry == DRM_DEP_JOB_FENCE_PREALLOC ||
> > > +    entry->context != fence->context)
> > > + continue;
> > > +
> > > + if (dma_fence_is_later(fence, entry)) {
> > > + dma_fence_put(entry);
> > > + xa_store(&job->dependencies, index, fence, GFP_KERNEL);
> > > + } else {
> > > + dma_fence_put(fence);
> > > + }
> > > + return 0;
> > > + }
> > > +
> > > +add_fence:
> > > + ret = xa_alloc(&job->dependencies, &id, fence, xa_limit_32b,
> > > +       GFP_KERNEL);
> > > + if (ret != 0) {
> > > + if (fence != DRM_DEP_JOB_FENCE_PREALLOC)
> > > + dma_fence_put(fence);
> > > + return ret;
> > > + }
> > > +
> > > + return (fence == DRM_DEP_JOB_FENCE_PREALLOC) ? id : 0;
> > > +}
> > > +EXPORT_SYMBOL(drm_dep_job_add_dependency);
> > > +
> > > +/**
> > > + * drm_dep_job_replace_dependency() - replace a pre-allocated dependency slot
> > > + * @job: dep job to update
> > > + * @index: xarray index of the slot to replace, as returned when the sentinel
> > > + *         was originally inserted via drm_dep_job_add_dependency()
> > > + * @fence: the real dma_fence to store; its reference is always consumed
> > > + *
> > > + * Replaces the %DRM_DEP_JOB_FENCE_PREALLOC sentinel at @index in
> > > + * @job->dependencies with @fence.  The slot must have been pre-allocated by
> > > + * passing %DRM_DEP_JOB_FENCE_PREALLOC to drm_dep_job_add_dependency(); the
> > > + * existing entry is asserted to be the sentinel.
> > > + *
> > > + * This is the second half of the pre-allocation pattern described in
> > > + * drm_dep_job_add_dependency().  It is intended to be called inside a
> > > + * dma_fence_begin_signalling() / dma_fence_end_signalling() region where
> > > + * memory allocation with GFP_KERNEL is forbidden.  It uses GFP_NOWAIT
> > > + * internally so it is safe to call from atomic or signalling context, but
> > > + * since the slot has been pre-allocated no actual memory allocation occurs.
> > > + *
> > > + * If @fence is already signalled the slot is erased rather than storing a
> > > + * redundant dependency.  The successful store is asserted — if the store
> > > + * fails it indicates a programming error (slot index out of range or
> > > + * concurrent modification).
> > > + *
> > > + * Must be called before drm_dep_job_arm(). @fence is consumed in all cases.
> > 
> > Can’t enforce this in C. Also, how is the fence “consumed” ? You can’t enforce that
> > the user can’t access the fence anymore after this function returns, like we can do
> > at compile time in Rust.
> > 

I agree—you can’t enforce correct usage at compile time. The best you
can do is document the rules and annotate them. DRM dep will complain
when those rules are violated.

> > > + *
> > > + * Context: Any context. DMA fence signaling path.
> > > + */
> > > +void drm_dep_job_replace_dependency(struct drm_dep_job *job, u32 index,
> > > +    struct dma_fence *fence)
> > > +{
> > > + WARN_ON(xa_load(&job->dependencies, index) !=
> > > + DRM_DEP_JOB_FENCE_PREALLOC);
> > > +
> > > + if (dma_fence_test_signaled_flag(fence)) {
> > > + xa_erase(&job->dependencies, index);
> > > + dma_fence_put(fence);
> > > + return;
> > > + }
> > > +
> > > + if (WARN_ON(xa_is_err(xa_store(&job->dependencies, index, fence,
> > > +       GFP_NOWAIT)))) {
> > > + dma_fence_put(fence);
> > > + return;
> > > + }
> > > +}
> > > +EXPORT_SYMBOL(drm_dep_job_replace_dependency);
> > > +
> > > +/**
> > > + * drm_dep_job_add_syncobj_dependency() - adds a syncobj's fence as a
> > > + *   job dependency
> > > + * @job: dep job to add the dependencies to
> > > + * @file: drm file private pointer
> > > + * @handle: syncobj handle to lookup
> > > + * @point: timeline point
> > > + *
> > > + * This adds the fence matching the given syncobj to @job.
> > > + *
> > > + * Context: Process context.
> > > + * Return: 0 on success, or a negative error code.
> > > + */
> > > +int drm_dep_job_add_syncobj_dependency(struct drm_dep_job *job,
> > > +       struct drm_file *file, u32 handle,
> > > +       u32 point)
> > > +{
> > > + struct dma_fence *fence;
> > > + int ret;
> > > +
> > > + ret = drm_syncobj_find_fence(file, handle, point, 0, &fence);
> > > + if (ret)
> > > + return ret;
> > > +
> > > + return drm_dep_job_add_dependency(job, fence);
> > > +}
> > > +EXPORT_SYMBOL(drm_dep_job_add_syncobj_dependency);
> > > +
> > > +/**
> > > + * drm_dep_job_add_resv_dependencies() - add all fences from the resv to the job
> > > + * @job: dep job to add the dependencies to
> > > + * @resv: the dma_resv object to get the fences from
> > > + * @usage: the dma_resv_usage to use to filter the fences
> > > + *
> > > + * This adds all fences matching the given usage from @resv to @job.
> > > + * Must be called with the @resv lock held.
> > > + *
> > > + * Context: Process context.
> > > + * Return: 0 on success, or a negative error code.
> > > + */
> > > +int drm_dep_job_add_resv_dependencies(struct drm_dep_job *job,
> > > +      struct dma_resv *resv,
> > > +      enum dma_resv_usage usage)
> > > +{
> > > + struct dma_resv_iter cursor;
> > > + struct dma_fence *fence;
> > > + int ret;
> > > +
> > > + dma_resv_assert_held(resv);
> > > +
> > > + dma_resv_for_each_fence(&cursor, resv, usage, fence) {
> > > + /*
> > > + * As drm_dep_job_add_dependency always consumes the fence
> > > + * reference (even when it fails), and dma_resv_for_each_fence
> > > + * is not obtaining one, we need to grab one before calling.
> > > + */
> > > + ret = drm_dep_job_add_dependency(job, dma_fence_get(fence));
> > > + if (ret)
> > > + return ret;
> > > + }
> > > + return 0;
> > > +}
> > > +EXPORT_SYMBOL(drm_dep_job_add_resv_dependencies);
> > > +
> > > +/**
> > > + * drm_dep_job_add_implicit_dependencies() - adds implicit dependencies
> > > + *   as job dependencies
> > > + * @job: dep job to add the dependencies to
> > > + * @obj: the gem object to add new dependencies from.
> > > + * @write: whether the job might write the object (so we need to depend on
> > > + * shared fences in the reservation object).
> > > + *
> > > + * This should be called after drm_gem_lock_reservations() on your array of
> > > + * GEM objects used in the job but before updating the reservations with your
> > > + * own fences.
> > > + *
> > > + * Context: Process context.
> > > + * Return: 0 on success, or a negative error code.
> > > + */
> > > +int drm_dep_job_add_implicit_dependencies(struct drm_dep_job *job,
> > > +  struct drm_gem_object *obj,
> > > +  bool write)
> > > +{
> > > + return drm_dep_job_add_resv_dependencies(job, obj->resv,
> > > + dma_resv_usage_rw(write));
> > > +}
> > > +EXPORT_SYMBOL(drm_dep_job_add_implicit_dependencies);
> > > +
> > > +/**
> > > + * drm_dep_job_is_signaled() - check whether a dep job has completed
> > > + * @job: dep job to check
> > > + *
> > > + * Determines whether @job has signalled. The queue should be stopped before
> > > + * calling this to obtain a stable snapshot of state. Both the parent hardware
> > > + * fence and the finished software fence are checked.
> > > + *
> > > + * Context: Process context. The queue must be stopped before calling this.
> > > + * Return: true if the job is signalled, false otherwise.
> > > + */
> > > +bool drm_dep_job_is_signaled(struct drm_dep_job *job)
> > > +{
> > > + WARN_ON(!drm_dep_queue_is_stopped(job->q));
> > > + return drm_dep_fence_is_complete(job->dfence);
> > > +}
> > > +EXPORT_SYMBOL(drm_dep_job_is_signaled);
> > > +
> > > +/**
> > > + * drm_dep_job_is_finished() - test whether a dep job's finished fence has signalled
> > > + * @job: dep job to check
> > > + *
> > > + * Tests whether the job's software finished fence has been signalled, using
> > > + * dma_fence_test_signaled_flag() to avoid any signalling side-effects. Unlike
> > > + * drm_dep_job_is_signaled(), this does not require the queue to be stopped and
> > > + * does not check the parent hardware fence — it is a lightweight test of the
> > > + * finished fence only.
> > > + *
> > > + * Context: Any context.
> > > + * Return: true if the job's finished fence has been signalled, false otherwise.
> > > + */
> > > +bool drm_dep_job_is_finished(struct drm_dep_job *job)
> > > +{
> > > + return drm_dep_fence_is_finished(job->dfence);
> > > +}
> > > +EXPORT_SYMBOL(drm_dep_job_is_finished);
> > > +
> > > +/**
> > > + * drm_dep_job_invalidate_job() - increment the invalidation count for a job
> > > + * @job: dep job to invalidate
> > > + * @threshold: threshold above which the job is considered invalidated
> > > + *
> > > + * Increments @job->invalidate_count and returns true if it exceeds @threshold,
> > > + * indicating the job should be considered hung and discarded. The queue must
> > > + * be stopped before calling this function.
> > > + *
> > > + * Context: Process context. The queue must be stopped before calling this.
> > > + * Return: true if @job->invalidate_count exceeds @threshold, false otherwise.
> > > + */
> > > +bool drm_dep_job_invalidate_job(struct drm_dep_job *job, int threshold)
> > > +{
> > > + WARN_ON(!drm_dep_queue_is_stopped(job->q));
> > > + return ++job->invalidate_count > threshold;
> > > +}
> > > +EXPORT_SYMBOL(drm_dep_job_invalidate_job);
> > > +
> > > +/**
> > > + * drm_dep_job_finished_fence() - return the finished fence for a job
> > > + * @job: dep job to query
> > > + *
> > > + * No reference is taken on the returned fence; the caller must hold its own
> > > + * reference to @job for the duration of any access.
> > 
> > Can’t enforce this in C.
> > 
> > > + *
> > > + * Context: Any context.
> > > + * Return: the finished &dma_fence for @job.
> > > + */
> > > +struct dma_fence *drm_dep_job_finished_fence(struct drm_dep_job *job)
> > > +{
> > > + return drm_dep_fence_to_dma(job->dfence);
> > > +}
> > > +EXPORT_SYMBOL(drm_dep_job_finished_fence);
> > > diff --git a/drivers/gpu/drm/dep/drm_dep_job.h b/drivers/gpu/drm/dep/drm_dep_job.h
> > > new file mode 100644
> > > index 000000000000..35c61d258fa1
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/dep/drm_dep_job.h
> > > @@ -0,0 +1,13 @@
> > > +/* SPDX-License-Identifier: MIT */
> > > +/*
> > > + * Copyright © 2026 Intel Corporation
> > > + */
> > > +
> > > +#ifndef _DRM_DEP_JOB_H_
> > > +#define _DRM_DEP_JOB_H_
> > > +
> > > +struct drm_dep_queue;
> > > +
> > > +void drm_dep_job_drop_dependencies(struct drm_dep_job *job);
> > > +
> > > +#endif /* _DRM_DEP_JOB_H_ */
> > > diff --git a/drivers/gpu/drm/dep/drm_dep_queue.c b/drivers/gpu/drm/dep/drm_dep_queue.c
> > > new file mode 100644
> > > index 000000000000..dac02d0d22c4
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/dep/drm_dep_queue.c
> > > @@ -0,0 +1,1647 @@
> > > +// SPDX-License-Identifier: MIT
> > > +/*
> > > + * Copyright 2015 Advanced Micro Devices, Inc.
> > > + *
> > > + * Permission is hereby granted, free of charge, to any person obtaining a
> > > + * copy of this software and associated documentation files (the "Software"),
> > > + * to deal in the Software without restriction, including without limitation
> > > + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> > > + * and/or sell copies of the Software, and to permit persons to whom the
> > > + * Software is furnished to do so, subject to the following conditions:
> > > + *
> > > + * The above copyright notice and this permission notice shall be included in
> > > + * all copies or substantial portions of the Software.
> > > + *
> > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> > > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> > > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> > > + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
> > > + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
> > > + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
> > > + * OTHER DEALINGS IN THE SOFTWARE.
> > > + *
> > > + * Copyright © 2026 Intel Corporation
> > > + */
> > > +
> > > +/**
> > > + * DOC: DRM dependency queue
> > > + *
> > > + * The drm_dep subsystem provides a lightweight GPU submission queue that
> > > + * combines the roles of drm_gpu_scheduler and drm_sched_entity into a
> > > + * single object (struct drm_dep_queue). Each queue owns its own ordered
> > > + * submit workqueue, timeout workqueue, and TDR delayed-work.
> > > + *
> > > + * **Job lifecycle**
> > > + *
> > > + * 1. Allocate and initialise a job with drm_dep_job_init().
> > > + * 2. Add dependency fences with drm_dep_job_add_dependency() and friends.
> > > + * 3. Arm the job with drm_dep_job_arm() to obtain its out-fences.
> > > + * 4. Submit with drm_dep_job_push().
> > > + *
> > > + * **Submission paths**
> > > + *
> > > + * drm_dep_job_push() decides between two paths under @q->sched.lock:
> > > + *
> > > + * - **Bypass path** (drm_dep_queue_can_job_bypass()): if
> > > + *   %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set, the queue is not stopped,
> > > + *   the SPSC queue is empty, the job has no dependency fences, and credits
> > > + *   are available, the job is submitted inline on the calling thread without
> > > + *   touching the submit workqueue.
> > > + *
> > > + * - **Queued path** (drm_dep_queue_push_job()): the job is pushed onto an
> > > + *   SPSC queue and the run_job worker is kicked. The run_job worker pops the
> > > + *   job, resolves any remaining dependency fences (installing wakeup
> > > + *   callbacks for unresolved ones), and calls drm_dep_queue_run_job().
> > > + *
> > > + * **Running a job**
> > > + *
> > > + * drm_dep_queue_run_job() accounts credits, appends the job to the pending
> > > + * list (starting the TDR timer only when the list was previously empty),
> > > + * calls @ops->run_job(), stores the returned hardware fence as the parent
> > > + * of the job's dep fence, then installs a callback on it. When the hardware
> > > + * fence fires (or the job completes synchronously), drm_dep_job_done()
> > > + * signals the finished fence, returns credits, and kicks the put_job worker
> > > + * to free the job.
> > > + *
> > > + * **Timeout detection and recovery (TDR)**
> > > + *
> > > + * A delayed work item fires when a job on the pending list takes longer than
> > > + * @q->job.timeout jiffies. It calls @ops->timedout_job() and acts on the
> > > + * returned status (%DRM_DEP_TIMEDOUT_STAT_JOB_SIGNALED or
> > > + * %DRM_DEP_TIMEDOUT_STAT_REQUEUE_JOB).
> > > + * drm_dep_queue_trigger_timeout() forces the timer to fire immediately (without
> > > + * changing the stored timeout), for example during device teardown.
> > > + *
> > > + * **Reference counting**
> > > + *
> > > + * Jobs and queues are both reference counted.
> > > + *
> > > + * A job holds a reference to its queue from drm_dep_job_init() until
> > > + * drm_dep_job_put() drops the job's last reference and its release callback
> > > + * runs. This ensures the queue remains valid for the entire lifetime of any
> > > + * job that was submitted to it.
> > > + *
> > > + * The queue holds its own reference to a job for as long as the job is
> > > + * internally tracked: from the moment the job is added to the pending list
> > > + * in drm_dep_queue_run_job() until drm_dep_job_done() kicks the put_job
> > > + * worker, which calls drm_dep_job_put() to release that reference.
> > 
> > Why not simply keep track that the job was completed, instead of relinquishing
> > the reference? We can then release the reference once the job is cleaned up
> > (by the queue, using a worker) in process context.

I think that’s what I’m doing, while also allowing an opt-in path to
drop the job reference when it signals (in IRQ context) so we avoid
switching to a work item just to drop a ref. That seems like a
significant win in terms of CPU cycles.

> > 
> > 
> > > + *
> > > + * **Hazard: use-after-free from within a worker**
> > > + *
> > > + * Because a job holds a queue reference, drm_dep_job_put() dropping the last
> > > + * job reference will also drop a queue reference via the job's release path.
> > > + * If that happens to be the last queue reference, drm_dep_queue_fini() can be
> > > + * called, which queues @q->free_work on dep_free_wq and returns immediately.
> > > + * free_work calls disable_work_sync() / disable_delayed_work_sync() on the
> > > + * queue's own workers before destroying its workqueues, so in practice a
> > > + * running worker always completes before the queue memory is freed.
> > > + *
> > > + * However, there is a secondary hazard: a worker can be queued while the
> > > + * queue is in a "zombie" state — refcount has already reached zero and async
> > > + * teardown is in flight, but the work item has not yet been disabled by
> > > + * free_work.  To guard against this every worker uses
> > > + * drm_dep_queue_get_unless_zero() at entry; if the refcount is already zero
> > > + * the worker bails immediately without touching the queue state.
> > 
> > Again, this problem is gone in Rust.
> > 

I answer this one above.

> > > + *
> > > + * Because all actual teardown (disable_*_sync, destroy_workqueue) runs on
> > > + * dep_free_wq — which is independent of the queue's own submit/timeout
> > > + * workqueues — there is no deadlock risk.  Each queue holds a drm_dev_get()
> > > + * reference on its owning &drm_device, which is released as the last step of
> > > + * teardown.  This ensures the driver module cannot be unloaded while any queue
> > > + * is still alive.
> > > + */
> > > +
> > > +#include <linux/dma-resv.h>
> > > +#include <linux/kref.h>
> > > +#include <linux/module.h>
> > > +#include <linux/overflow.h>
> > > +#include <linux/slab.h>
> > > +#include <linux/wait.h>
> > > +#include <linux/workqueue.h>
> > > +#include <drm/drm_dep.h>
> > > +#include <drm/drm_drv.h>
> > > +#include <drm/drm_print.h>
> > > +#include "drm_dep_fence.h"
> > > +#include "drm_dep_job.h"
> > > +#include "drm_dep_queue.h"
> > > +
> > > +/*
> > > + * Dedicated workqueue for deferred drm_dep_queue teardown.  Using a
> > > + * module-private WQ instead of system_percpu_wq keeps teardown isolated
> > > + * from unrelated kernel subsystems.
> > > + */
> > > +static struct workqueue_struct *dep_free_wq;
> > > +
> > > +/**
> > > + * drm_dep_queue_flags_set() - set a flag on the queue under sched.lock
> > > + * @q: dep queue
> > > + * @flag: flag to set (one of &enum drm_dep_queue_flags)
> > > + *
> > > + * Sets @flag in @q->sched.flags. Must be called with @q->sched.lock
> > > + * held; the lockdep assertion enforces this.
> > > + *
> > > + * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
> > > + */
> > > +static void drm_dep_queue_flags_set(struct drm_dep_queue *q,
> > > +    enum drm_dep_queue_flags flag)
> > > +{
> > > + lockdep_assert_held(&q->sched.lock);
> > 
> > We can enforce this in Rust at compile-time. The code does not compile if the
> > lock is not taken. Same here and everywhere else where the sched lock has
> > to be taken.
> > 

I do understand part of Rust and agree it is a nice feature.

> > 
> > > + q->sched.flags |= flag;
> > > +}
> > > +
> > > +/**
> > > + * drm_dep_queue_flags_clear() - clear a flag on the queue under sched.lock
> > > + * @q: dep queue
> > > + * @flag: flag to clear (one of &enum drm_dep_queue_flags)
> > > + *
> > > + * Clears @flag in @q->sched.flags. Must be called with @q->sched.lock
> > > + * held; the lockdep assertion enforces this.
> > > + *
> > > + * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
> > > + */
> > > +static void drm_dep_queue_flags_clear(struct drm_dep_queue *q,
> > > +      enum drm_dep_queue_flags flag)
> > > +{
> > > + lockdep_assert_held(&q->sched.lock);
> > > + q->sched.flags &= ~flag;
> > > +}
> > > +
> > > +/**
> > > + * drm_dep_queue_has_credits() - check whether the queue has enough credits
> > > + * @q: dep queue
> > > + * @job: job requesting credits
> > > + *
> > > + * Checks whether the queue has enough available credits to dispatch
> > > + * @job. If @job->credits exceeds the queue's credit limit, it is
> > > + * clamped with a WARN.
> > > + *
> > > + * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
> > > + * Return: true if available credits >= @job->credits, false otherwise.
> > > + */
> > > +static bool drm_dep_queue_has_credits(struct drm_dep_queue *q,
> > > +      struct drm_dep_job *job)
> > > +{
> > > + u32 available;
> > > +
> > > + lockdep_assert_held(&q->sched.lock);
> > > +
> > > + if (job->credits > q->credit.limit) {
> > > + drm_warn(q->drm,
> > > + "Jobs may not exceed the credit limit, truncate.\n");
> > > + job->credits = q->credit.limit;
> > > + }
> > > +
> > > + WARN_ON(check_sub_overflow(q->credit.limit,
> > > +   atomic_read(&q->credit.count),
> > > +   &available));
> > > +
> > > + return available >= job->credits;
> > > +}
> > > +
> > > +/**
> > > + * drm_dep_queue_run_job_queue() - kick the run-job worker
> > > + * @q: dep queue
> > > + *
> > > + * Queues @q->sched.run_job on @q->sched.submit_wq unless the queue is stopped
> > > + * or the job queue is empty.  The empty-queue check avoids queueing a work item
> > > + * that would immediately return with nothing to do.
> > > + *
> > > + * Context: Any context.
> > > + */
> > > +static void drm_dep_queue_run_job_queue(struct drm_dep_queue *q)
> > > +{
> > > + if (!drm_dep_queue_is_stopped(q) && spsc_queue_count(&q->job.queue))
> > > + queue_work(q->sched.submit_wq, &q->sched.run_job);
> > > +}
> > > +
> > > +/**
> > > + * drm_dep_queue_put_job_queue() - kick the put-job worker
> > > + * @q: dep queue
> > > + *
> > > + * Queues @q->sched.put_job on @q->sched.submit_wq unless the queue
> > > + * is stopped.
> > > + *
> > > + * Context: Any context.
> > > + */
> > > +static void drm_dep_queue_put_job_queue(struct drm_dep_queue *q)
> > > +{
> > > + if (!drm_dep_queue_is_stopped(q))
> > > + queue_work(q->sched.submit_wq, &q->sched.put_job);
> > > +}
> > > +
> > > +/**
> > > + * drm_queue_start_timeout() - arm or re-arm the TDR delayed work
> > > + * @q: dep queue
> > > + *
> > > + * Arms the TDR delayed work with @q->job.timeout. No-op if
> > > + * @q->ops->timedout_job is NULL, the timeout is MAX_SCHEDULE_TIMEOUT,
> > > + * or the pending list is empty.
> > > + *
> > > + * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
> > > + */
> > > +static void drm_queue_start_timeout(struct drm_dep_queue *q)
> > > +{
> > > + lockdep_assert_held(&q->job.lock);
> > > +
> > > + if (!q->ops->timedout_job ||
> > > +    q->job.timeout == MAX_SCHEDULE_TIMEOUT ||
> > > +    list_empty(&q->job.pending))
> > > + return;
> > > +
> > > + mod_delayed_work(q->sched.timeout_wq, &q->sched.tdr, q->job.timeout);
> > > +}
> > > +
> > > +/**
> > > + * drm_queue_start_timeout_unlocked() - arm TDR, acquiring job.lock
> > > + * @q: dep queue
> > > + *
> > > + * Acquires @q->job.lock with irq and calls
> > > + * drm_queue_start_timeout().
> > > + *
> > > + * Context: Process context (workqueue).
> > > + */
> > > +static void drm_queue_start_timeout_unlocked(struct drm_dep_queue *q)
> > > +{
> > > + guard(spinlock_irq)(&q->job.lock);
> > > + drm_queue_start_timeout(q);
> > > +}
> > > +
> > > +/**
> > > + * drm_dep_queue_remove_dependency() - clear the active dependency and wake
> > > + *   the run-job worker
> > > + * @q: dep queue
> > > + * @f: the dependency fence being removed
> > > + *
> > > + * Stores @f into @q->dep.removed_fence via smp_store_release() so that the
> > > + * run-job worker can drop the reference to it in drm_dep_queue_is_ready(),
> > > + * paired with smp_load_acquire().  Clears @q->dep.fence and kicks the
> > > + * run-job worker.
> > > + *
> > > + * The fence reference is not dropped here; it is deferred to the run-job
> > > + * worker via @q->dep.removed_fence to keep this path suitable dma_fence
> > > + * callback removal in drm_dep_queue_kill().
> > 
> > This is a comment in C, but in Rust this is encoded directly in the type system.
> > 
> > > + *
> > > + * Context: Any context.
> > > + */
> > > +static void drm_dep_queue_remove_dependency(struct drm_dep_queue *q,
> > > +    struct dma_fence *f)
> > > +{
> > > + /* removed_fence must be visible to the reader before &q->dep.fence */
> > > + smp_store_release(&q->dep.removed_fence, f);
> > > +
> > > + WRITE_ONCE(q->dep.fence, NULL);
> > > + drm_dep_queue_run_job_queue(q);
> > > +}
> > > +
> > > +/**
> > > + * drm_dep_queue_wakeup() - dma_fence callback to wake the run-job worker
> > > + * @f: the signalled dependency fence
> > > + * @cb: callback embedded in the dep queue
> > > + *
> > > + * Called from dma_fence_signal() when the active dependency fence signals.
> > > + * Delegates to drm_dep_queue_remove_dependency() to clear @q->dep.fence and
> > > + * kick the run-job worker.  The fence reference is not dropped here; it is
> > > + * deferred to the run-job worker via @q->dep.removed_fence.
> > 
> > Same here.
> > 
> > > + *
> > > + * Context: Any context.
> > > + */
> > > +static void drm_dep_queue_wakeup(struct dma_fence *f, struct dma_fence_cb *cb)
> > > +{
> > > + struct drm_dep_queue *q =
> > > + container_of(cb, struct drm_dep_queue, dep.cb);
> > > +
> > > + drm_dep_queue_remove_dependency(q, f);
> > > +}
> > > +
> > > +/**
> > > + * drm_dep_queue_is_ready() - check whether the queue has a dispatchable job
> > > + * @q: dep queue
> > > + *
> > > + * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
> > 
> > Can’t call this in Rust if the lock is not taken.
> > 
> > > + * Return: true if SPSC queue non-empty and no dep fence pending,
> > > + *   false otherwise.
> > > + */
> > > +static bool drm_dep_queue_is_ready(struct drm_dep_queue *q)
> > > +{
> > > + lockdep_assert_held(&q->sched.lock);
> > > +
> > > + if (!spsc_queue_count(&q->job.queue))
> > > + return false;
> > > +
> > > + if (READ_ONCE(q->dep.fence))
> > > + return false;
> > > +
> > > + /* Paired with smp_store_release in drm_dep_queue_remove_dependency() */
> > > + dma_fence_put(smp_load_acquire(&q->dep.removed_fence));
> > > +
> > > + q->dep.removed_fence = NULL;
> > > +
> > > + return true;
> > > +}
> > > +
> > > +/**
> > > + * drm_dep_queue_is_killed() - check whether a dep queue has been killed
> > > + * @q: dep queue to check
> > > + *
> > > + * Return: true if %DRM_DEP_QUEUE_FLAGS_KILLED is set on @q, false otherwise.
> > > + *
> > > + * Context: Any context.
> > > + */
> > > +bool drm_dep_queue_is_killed(struct drm_dep_queue *q)
> > > +{
> > > + return !!(q->sched.flags & DRM_DEP_QUEUE_FLAGS_KILLED);
> > > +}
> > > +EXPORT_SYMBOL(drm_dep_queue_is_killed);
> > > +
> > > +/**
> > > + * drm_dep_queue_is_initialized() - check whether a dep queue has been initialized
> > > + * @q: dep queue to check
> > > + *
> > > + * A queue is considered initialized once its ops pointer has been set by a
> > > + * successful call to drm_dep_queue_init().  Drivers that embed a
> > > + * &drm_dep_queue inside a larger structure may call this before attempting any
> > > + * other queue operation to confirm that initialization has taken place.
> > > + * drm_dep_queue_put() must be called if this function returns true to drop the
> > > + * initialization reference from drm_dep_queue_init().
> > > + *
> > > + * Return: true if @q has been initialized, false otherwise.
> > > + *
> > > + * Context: Any context.
> > > + */
> > > +bool drm_dep_queue_is_initialized(struct drm_dep_queue *q)
> > > +{
> > > + return !!q->ops;
> > > +}
> > > +EXPORT_SYMBOL(drm_dep_queue_is_initialized);
> > > +
> > > +/**
> > > + * drm_dep_queue_set_stopped() - pre-mark a queue as stopped before first use
> > > + * @q: dep queue to mark
> > > + *
> > > + * Sets %DRM_DEP_QUEUE_FLAGS_STOPPED directly on @q without going through the
> > > + * normal drm_dep_queue_stop() path.  This is only valid during the driver-side
> > > + * queue initialisation sequence — i.e. after drm_dep_queue_init() returns but
> > > + * before the queue is made visible to other threads (e.g. before it is added
> > > + * to any lookup structures).  Using this after the queue is live is a driver
> > > + * bug; use drm_dep_queue_stop() instead.
> > > + *
> > > + * Context: Process context, queue not yet visible to other threads.
> > > + */
> > > +void drm_dep_queue_set_stopped(struct drm_dep_queue *q)
> > > +{
> > > + q->sched.flags |= DRM_DEP_QUEUE_FLAGS_STOPPED;
> > > +}
> > > +EXPORT_SYMBOL(drm_dep_queue_set_stopped);
> > > +
> > > +/**
> > > + * drm_dep_queue_refcount() - read the current reference count of a queue
> > > + * @q: dep queue to query
> > > + *
> > > + * Returns the instantaneous kref value.  The count may change immediately
> > > + * after this call; callers must not make safety decisions based solely on
> > > + * the returned value.  Intended for diagnostic snapshots and debugfs output.
> > > + *
> > > + * Context: Any context.
> > > + * Return: current reference count.
> > > + */
> > > +unsigned int drm_dep_queue_refcount(const struct drm_dep_queue *q)
> > > +{
> > > + return kref_read(&q->refcount);
> > > +}
> > > +EXPORT_SYMBOL(drm_dep_queue_refcount);
> > > +
> > > +/**
> > > + * drm_dep_queue_timeout() - read the per-job TDR timeout for a queue
> > > + * @q: dep queue to query
> > > + *
> > > + * Returns the per-job timeout in jiffies as set at init time.
> > > + * %MAX_SCHEDULE_TIMEOUT means no timeout is configured.
> > > + *
> > > + * Context: Any context.
> > > + * Return: timeout in jiffies.
> > > + */
> > > +long drm_dep_queue_timeout(const struct drm_dep_queue *q)
> > > +{
> > > + return q->job.timeout;
> > > +}
> > > +EXPORT_SYMBOL(drm_dep_queue_timeout);
> > > +
> > > +/**
> > > + * drm_dep_queue_is_job_put_irq_safe() - test whether job-put from IRQ is allowed
> > > + * @q: dep queue
> > > + *
> > > + * Context: Any context.
> > > + * Return: true if %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE is set,
> > > + *   false otherwise.
> > > + */
> > > +static bool drm_dep_queue_is_job_put_irq_safe(const struct drm_dep_queue *q)
> > > +{
> > > + return !!(q->sched.flags & DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE);
> > > +}
> > > +
> > > +/**
> > > + * drm_dep_queue_job_dependency() - get next unresolved dep fence
> > > + * @q: dep queue
> > > + * @job: job whose dependencies to advance
> > > + *
> > > + * Returns NULL immediately if the queue has been killed via
> > > + * drm_dep_queue_kill(), bypassing all dependency waits so that jobs
> > > + * drain through run_job as quickly as possible.
> > > + *
> > > + * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
> > > + * Return: next unresolved &dma_fence with a new reference, or NULL
> > > + *   when all dependencies have been consumed (or the queue is killed).
> > > + */
> > > +static struct dma_fence *
> > > +drm_dep_queue_job_dependency(struct drm_dep_queue *q,
> > > +     struct drm_dep_job *job)
> > > +{
> > > + struct dma_fence *f;
> > > +
> > > + lockdep_assert_held(&q->sched.lock);
> > > +
> > > + if (drm_dep_queue_is_killed(q))
> > > + return NULL;
> > > +
> > > + f = xa_load(&job->dependencies, job->last_dependency);
> > > + if (f) {
> > > + job->last_dependency++;
> > > + if (WARN_ON(DRM_DEP_JOB_FENCE_PREALLOC == f))
> > > + return dma_fence_get_stub();
> > > + return dma_fence_get(f);
> > > + }
> > > +
> > > + return NULL;
> > > +}
> > > +
> > > +/**
> > > + * drm_dep_queue_add_dep_cb() - install wakeup callback on dep fence
> > > + * @q: dep queue
> > > + * @job: job whose dependency fence is stored in @q->dep.fence
> > > + *
> > > + * Installs a wakeup callback on @q->dep.fence. Returns true if the
> > > + * callback was installed (the queue must wait), false if the fence is
> > > + * already signalled or is a self-fence from the same queue context.
> > > + *
> > > + * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
> > > + * Return: true if callback installed, false if fence already done.
> > > + */
> > 
> > In Rust, we can encode the signaling paths with a “token type”. So any
> > sections that are part of the signaling path can simply take this token as an
> > argument. This type also enforces that end_signaling() is called automatically when it
> > goes out of scope.
> > 
> > By the way, we can easily offer an irq handler type where we enforce this:
> > 
> > fn handle_threaded_irq(&self, device: &Device<Bound>) -> IrqReturn { 
> >  let _annotation = DmaFenceSignallingAnnotation::new();  // Calls begin_signaling()
> >  self.driver.handle_threaded_irq(device) 
> > 
> >  // end_signaling() is called here automatically.
> > }
> > 
> > Same for workqueues:
> > 
> > fn work_fn(&self, device: &Device<Bound>) {
> >  let _annotation = DmaFenceSignallingAnnotation::new();  // Calls begin_signaling()
> >  self.driver.work_fn(device) 
> > 
> >  // end_signaling() is called here automatically.
> > }
> > 
> > This is not Rust-specific, of course, but it is more ergonomic to write in Rust.
> > 

Yes, I agree this is a nice feature, and properly annotating C code
requires discipline.

> > > +static bool drm_dep_queue_add_dep_cb(struct drm_dep_queue *q,
> > > +     struct drm_dep_job *job)
> > > +{
> > > + struct dma_fence *fence = q->dep.fence;
> > > +
> > > + lockdep_assert_held(&q->sched.lock);
> > > +
> > > + if (WARN_ON(fence->context == q->fence.context)) {
> > > + dma_fence_put(q->dep.fence);
> > > + q->dep.fence = NULL;
> > > + return false;
> > > + }
> > > +
> > > + if (!dma_fence_add_callback(q->dep.fence, &q->dep.cb,
> > > +    drm_dep_queue_wakeup))
> > > + return true;
> > > +
> > > + dma_fence_put(q->dep.fence);
> > > + q->dep.fence = NULL;
> > > +
> > > + return false;
> > > +}
> > 
> > In rust we can enforce that all callbacks take a reference to the fence
> > automatically. If the callback is “forgotten” in a buggy path, it is
> > automatically removed, and the fence is automatically signaled with -ECANCELED.
> > 
> > > +
> > > +/**
> > > + * drm_dep_queue_pop_job() - pop a dispatchable job from the SPSC queue
> > > + * @q: dep queue
> > > + *
> > > + * Peeks at the head of the SPSC queue and drains all resolved
> > > + * dependencies. If a dependency is still pending, installs a wakeup
> > > + * callback and returns NULL. On success pops the job and returns it.
> > > + *
> > > + * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
> > > + * Return: next dispatchable job, or NULL if a dep is still pending.
> > > + */
> > > +static struct drm_dep_job *drm_dep_queue_pop_job(struct drm_dep_queue *q)
> > > +{
> > > + struct spsc_node *node;
> > > + struct drm_dep_job *job;
> > > +
> > > + lockdep_assert_held(&q->sched.lock);
> > > +
> > > + node = spsc_queue_peek(&q->job.queue);
> > > + if (!node)
> > > + return NULL;
> > > +
> > > + job = container_of(node, struct drm_dep_job, queue_node);
> > > +
> > > + while ((q->dep.fence = drm_dep_queue_job_dependency(q, job))) {
> > > + if (drm_dep_queue_add_dep_cb(q, job))
> > > + return NULL;
> > > + }
> > > +
> > > + spsc_queue_pop(&q->job.queue);
> > > +
> > > + return job;
> > > +}
> > > +
> > > +/*
> > > + * drm_dep_queue_get_unless_zero() - try to acquire a queue reference
> > > + *
> > > + * Workers use this instead of drm_dep_queue_get() to guard against the zombie
> > > + * state: the queue's refcount has already reached zero (async teardown is in
> > > + * flight) but a work item was queued before free_work had a chance to cancel
> > > + * it.  If kref_get_unless_zero() fails the caller must bail immediately.
> > > + *
> > > + * Context: Any context.
> > > + * Returns true if the reference was acquired, false if the queue is zombie.
> > > + */
> > 
> > Again, this function is totally gone in Rust.
> > 

See above. I don't think it is given the async teardown flow design.

> > > +bool drm_dep_queue_get_unless_zero(struct drm_dep_queue *q)
> > > +{
> > > + return kref_get_unless_zero(&q->refcount);
> > > +}
> > > +EXPORT_SYMBOL(drm_dep_queue_get_unless_zero);
> > > +
> > > +/**
> > > + * drm_dep_queue_run_job_work() - run-job worker
> > > + * @work: work item embedded in the dep queue
> > > + *
> > > + * Acquires @q->sched.lock, checks stopped state, queue readiness and
> > > + * available credits, pops the next job via drm_dep_queue_pop_job(),
> > > + * dispatches it via drm_dep_queue_run_job(), then re-kicks itself.
> > > + *
> > > + * Uses drm_dep_queue_get_unless_zero() at entry and bails immediately if the
> > > + * queue is in zombie state (refcount already zero, async teardown in flight).
> > > + *
> > > + * Context: Process context (workqueue). DMA fence signaling path.
> > > + */
> > > +static void drm_dep_queue_run_job_work(struct work_struct *work)
> > > +{
> > > + struct drm_dep_queue *q =
> > > + container_of(work, struct drm_dep_queue, sched.run_job);
> > > + struct spsc_node *node;
> > > + struct drm_dep_job *job;
> > > + bool cookie = dma_fence_begin_signalling();
> > > +
> > > + /* Bail if queue is zombie (refcount already zero, teardown in flight). */
> > > + if (!drm_dep_queue_get_unless_zero(q)) {
> > > + dma_fence_end_signalling(cookie);
> > > + return;
> > > + }
> > > +
> > > + mutex_lock(&q->sched.lock);
> > > +
> > > + if (drm_dep_queue_is_stopped(q))
> > > + goto put_queue;
> > > +
> > > + if (!drm_dep_queue_is_ready(q))
> > > + goto put_queue;
> > > +
> > > + /* Peek to check credits before committing to pop and dep resolution */
> > > + node = spsc_queue_peek(&q->job.queue);
> > > + if (!node)
> > > + goto put_queue;
> > > +
> > > + job = container_of(node, struct drm_dep_job, queue_node);
> > > + if (!drm_dep_queue_has_credits(q, job))
> > > + goto put_queue;
> > > +
> > > + job = drm_dep_queue_pop_job(q);
> > > + if (!job)
> > > + goto put_queue;
> > > +
> > > + drm_dep_queue_run_job(q, job);
> > > + drm_dep_queue_run_job_queue(q);
> > > +
> > > +put_queue:
> > > + mutex_unlock(&q->sched.lock);
> > > + drm_dep_queue_put(q);
> > > + dma_fence_end_signalling(cookie);
> > > +}
> > > +
> > > +/*
> > > + * drm_dep_queue_remove_job() - unlink a job from the pending list and reset TDR
> > > + * @q:   dep queue owning @job
> > > + * @job: job to remove
> > > + *
> > > + * Splices @job out of @q->job.pending, cancels any pending TDR delayed work,
> > > + * and arms the timeout for the new list head (if any).
> > > + *
> > > + * Context: Process context. Must hold @q->sched.lock. DMA fence signaling path.
> > > + */
> > > +static void drm_dep_queue_remove_job(struct drm_dep_queue *q,
> > > +     struct drm_dep_job *job)
> > > +{
> > > + lockdep_assert_held(&q->job.lock);
> > > +
> > > + list_del_init(&job->pending_link);
> > > + cancel_delayed_work(&q->sched.tdr);
> > > + drm_queue_start_timeout(q);
> > > +}
> > > +
> > > +/**
> > > + * drm_dep_queue_get_finished_job() - dequeue a finished job
> > > + * @q: dep queue
> > > + *
> > > + * Under @q->job.lock checks the head of the pending list for a
> > > + * finished dep fence. If found, removes the job from the list,
> > > + * cancels the TDR, and re-arms it for the new head.
> > > + *
> > > + * Context: Process context (workqueue). DMA fence signaling path.
> > > + * Return: the finished &drm_dep_job, or NULL if none is ready.
> > > + */
> > > +static struct drm_dep_job *
> > > +drm_dep_queue_get_finished_job(struct drm_dep_queue *q)
> > > +{
> > > + struct drm_dep_job *job;
> > > +
> > > + guard(spinlock_irq)(&q->job.lock);
> > > +
> > > + job = list_first_entry_or_null(&q->job.pending, struct drm_dep_job,
> > > +       pending_link);
> > > + if (job && drm_dep_fence_is_finished(job->dfence))
> > > + drm_dep_queue_remove_job(q, job);
> > > + else
> > > + job = NULL;
> > > +
> > > + return job;
> > > +}
> > > +
> > > +/**
> > > + * drm_dep_queue_put_job_work() - put-job worker
> > > + * @work: work item embedded in the dep queue
> > > + *
> > > + * Drains all finished jobs by calling drm_dep_job_put() in a loop,
> > > + * then kicks the run-job worker.
> > > + *
> > > + * Uses drm_dep_queue_get_unless_zero() at entry and bails immediately if the
> > > + * queue is in zombie state (refcount already zero, async teardown in flight).
> > > + *
> > > + * Wraps execution in dma_fence_begin_signalling() / dma_fence_end_signalling()
> > > + * because workqueue is shared with other items in the fence signaling path.
> > > + *
> > > + * Context: Process context (workqueue). DMA fence signaling path.
> > > + */
> > > +static void drm_dep_queue_put_job_work(struct work_struct *work)
> > > +{
> > > + struct drm_dep_queue *q =
> > > + container_of(work, struct drm_dep_queue, sched.put_job);
> > > + struct drm_dep_job *job;
> > > + bool cookie = dma_fence_begin_signalling();
> > > +
> > > + /* Bail if queue is zombie (refcount already zero, teardown in flight). */
> > > + if (!drm_dep_queue_get_unless_zero(q)) {
> > > + dma_fence_end_signalling(cookie);
> > > + return;
> > > + }
> > > +
> > > + while ((job = drm_dep_queue_get_finished_job(q)))
> > > + drm_dep_job_put(job);
> > > +
> > > + drm_dep_queue_run_job_queue(q);
> > > +
> > > + drm_dep_queue_put(q);
> > > + dma_fence_end_signalling(cookie);
> > > +}
> > > +
> > > +/**
> > > + * drm_dep_queue_tdr_work() - TDR worker
> > > + * @work: work item embedded in the delayed TDR work
> > > + *
> > > + * Removes the head job from the pending list under @q->job.lock,
> > > + * asserts @q->ops->timedout_job is non-NULL, calls it outside the lock,
> > > + * requeues the job if %DRM_DEP_TIMEDOUT_STAT_REQUEUE_JOB, drops the
> > > + * queue's job reference on %DRM_DEP_TIMEDOUT_STAT_JOB_SIGNALED, and always
> > > + * restarts the TDR timer after handling the job (unless @q is stopping).
> > > + * Any other return value triggers a WARN.
> > > + *
> > > + * The TDR is never armed when @q->ops->timedout_job is NULL, so firing
> > > + * this worker without a timedout_job callback is a driver bug.
> > > + *
> > > + * Uses drm_dep_queue_get_unless_zero() at entry and bails immediately if the
> > > + * queue is in zombie state (refcount already zero, async teardown in flight).
> > > + *
> > > + * Wraps execution in dma_fence_begin_signalling() / dma_fence_end_signalling()
> > > + * because timedout_job() is expected to signal the guilty job's fence as part
> > > + * of reset.
> > > + *
> > > + * Context: Process context (workqueue). DMA fence signaling path.
> > > + */
> > > +static void drm_dep_queue_tdr_work(struct work_struct *work)
> > > +{
> > > + struct drm_dep_queue *q =
> > > + container_of(work, struct drm_dep_queue, sched.tdr.work);
> > > + struct drm_dep_job *job;
> > > + bool cookie = dma_fence_begin_signalling();
> > > +
> > > + /* Bail if queue is zombie (refcount already zero, teardown in flight). */
> > > + if (!drm_dep_queue_get_unless_zero(q)) {
> > > + dma_fence_end_signalling(cookie);
> > > + return;
> > > + }
> > > +
> > > + scoped_guard(spinlock_irq, &q->job.lock) {
> > > + job = list_first_entry_or_null(&q->job.pending,
> > > +       struct drm_dep_job,
> > > +       pending_link);
> > > + if (job)
> > > + /*
> > > + * Remove from pending so it cannot be freed
> > > + * concurrently by drm_dep_queue_get_finished_job() or
> > > + * .drm_dep_job_done().
> > > + */
> > > + list_del_init(&job->pending_link);
> > > + }
> > > +
> > > + if (job) {
> > > + enum drm_dep_timedout_stat status;
> > > +
> > > + if (WARN_ON(!q->ops->timedout_job)) {
> > > + drm_dep_job_put(job);
> > > + goto out;
> > > + }
> > > +
> > > + status = q->ops->timedout_job(job);
> > > +
> > > + switch (status) {
> > > + case DRM_DEP_TIMEDOUT_STAT_REQUEUE_JOB:
> > > + scoped_guard(spinlock_irq, &q->job.lock)
> > > + list_add(&job->pending_link, &q->job.pending);
> > > + drm_dep_queue_put_job_queue(q);
> > > + break;
> > > + case DRM_DEP_TIMEDOUT_STAT_JOB_SIGNALED:
> > > + drm_dep_job_put(job);
> > > + break;
> > > + default:
> > > + WARN_ON("invalid drm_dep_timedout_stat");
> > > + break;
> > > + }
> > > + }
> > > +
> > > +out:
> > > + drm_queue_start_timeout_unlocked(q);
> > > + drm_dep_queue_put(q);
> > > + dma_fence_end_signalling(cookie);
> > > +}
> > > +
> > > +/**
> > > + * drm_dep_alloc_submit_wq() - allocate an ordered submit workqueue
> > > + * @name: name for the workqueue
> > > + * @flags: DRM_DEP_QUEUE_FLAGS_* flags
> > > + *
> > > + * Allocates an ordered workqueue for job submission with %WQ_MEM_RECLAIM and
> > > + * %WQ_MEM_WARN_ON_RECLAIM set, ensuring the workqueue is safe to use from
> > > + * memory reclaim context and properly annotated for lockdep taint tracking.
> > > + * Adds %WQ_HIGHPRI if %DRM_DEP_QUEUE_FLAGS_HIGHPRI is set. When
> > > + * CONFIG_LOCKDEP is enabled, uses a dedicated lockdep map for annotation.
> > > + *
> > > + * Context: Process context.
> > > + * Return: the new &workqueue_struct, or NULL on failure.
> > > + */
> > > +static struct workqueue_struct *
> > > +drm_dep_alloc_submit_wq(const char *name, enum drm_dep_queue_flags flags)
> > > +{
> > > + unsigned int wq_flags = WQ_MEM_RECLAIM | WQ_MEM_WARN_ON_RECLAIM;
> > > +
> > > + if (flags & DRM_DEP_QUEUE_FLAGS_HIGHPRI)
> > > + wq_flags |= WQ_HIGHPRI;
> > > +
> > > +#if IS_ENABLED(CONFIG_LOCKDEP)
> > > + static struct lockdep_map map = {
> > > + .name = "drm_dep_submit_lockdep_map"
> > > + };
> > > + return alloc_ordered_workqueue_lockdep_map(name, wq_flags, &map);
> > > +#else
> > > + return alloc_ordered_workqueue(name, wq_flags);
> > > +#endif
> > > +}
> > > +
> > > +/**
> > > + * drm_dep_alloc_timeout_wq() - allocate an ordered TDR workqueue
> > > + * @name: name for the workqueue
> > > + *
> > > + * Allocates an ordered workqueue for timeout detection and recovery with
> > > + * %WQ_MEM_RECLAIM and %WQ_MEM_WARN_ON_RECLAIM set, ensuring consistent taint
> > > + * annotation with the submit workqueue. When CONFIG_LOCKDEP is enabled, uses
> > > + * a dedicated lockdep map for annotation.
> > > + *
> > > + * Context: Process context.
> > > + * Return: the new &workqueue_struct, or NULL on failure.
> > > + */
> > > +static struct workqueue_struct *drm_dep_alloc_timeout_wq(const char *name)
> > > +{
> > > + unsigned int wq_flags = WQ_MEM_RECLAIM | WQ_MEM_WARN_ON_RECLAIM;
> > > +
> > > +#if IS_ENABLED(CONFIG_LOCKDEP)
> > > + static struct lockdep_map map = {
> > > + .name = "drm_dep_timeout_lockdep_map"
> > > + };
> > > + return alloc_ordered_workqueue_lockdep_map(name, wq_flags, &map);
> > > +#else
> > > + return alloc_ordered_workqueue(name, wq_flags);
> > > +#endif
> > > +}
> > > +
> > > +/**
> > > + * drm_dep_queue_init() - initialize a dep queue
> > > + * @q: dep queue to initialize
> > > + * @args: initialization arguments
> > > + *
> > > + * Initializes all fields of @q from @args. If @args->submit_wq is NULL an
> > > + * ordered workqueue is allocated and owned by the queue
> > > + * (%DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ). If @args->timeout_wq is NULL an
> > > + * ordered workqueue is allocated and owned by the queue
> > > + * (%DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ). On success the queue holds one kref
> > > + * reference and drm_dep_queue_put() must be called to drop this reference
> > > + * (i.e., drivers cannot directly free the queue).
> > > + *
> > > + * When CONFIG_LOCKDEP is enabled, @q->sched.lock is primed against the
> > > + * fs_reclaim pseudo-lock so that lockdep can detect any lock ordering
> > > + * inversion between @sched.lock and memory reclaim.
> > > + *
> > > + * Return: 0 on success, %-EINVAL when @args->credit_limit is zero, @args->ops
> > > + * is NULL, @args->drm is NULL, @args->ops->run_job is NULL, or when
> > > + * @args->submit_wq or @args->timeout_wq is non-NULL but was not allocated with
> > > + * %WQ_MEM_WARN_ON_RECLAIM; %-ENOMEM when workqueue allocation fails.
> > > + *
> > > + * Context: Process context. May allocate memory and create workqueues.
> > > + */
> > > +int drm_dep_queue_init(struct drm_dep_queue *q,
> > > +       const struct drm_dep_queue_init_args *args)
> > > +{
> > > + if (!args->credit_limit || !args->drm || !args->ops ||
> > > +    !args->ops->run_job)
> > > + return -EINVAL;
> > > +
> > > + if (args->submit_wq && !workqueue_is_reclaim_annotated(args->submit_wq))
> > > + return -EINVAL;
> > > +
> > > + if (args->timeout_wq &&
> > > +    !workqueue_is_reclaim_annotated(args->timeout_wq))
> > > + return -EINVAL;
> > > +
> > > + memset(q, 0, sizeof(*q));
> > > +
> > > + q->name = args->name;
> > > + q->drm = args->drm;
> > > + q->credit.limit = args->credit_limit;
> > > + q->job.timeout = args->timeout ? args->timeout : MAX_SCHEDULE_TIMEOUT;
> > > +
> > > + init_rcu_head(&q->rcu);
> > > + INIT_LIST_HEAD(&q->job.pending);
> > > + spin_lock_init(&q->job.lock);
> > > + spsc_queue_init(&q->job.queue);
> > > +
> > > + mutex_init(&q->sched.lock);
> > > + if (IS_ENABLED(CONFIG_LOCKDEP)) {
> > > + fs_reclaim_acquire(GFP_KERNEL);
> > > + might_lock(&q->sched.lock);
> > > + fs_reclaim_release(GFP_KERNEL);
> > > + }
> > > +
> > > + if (args->submit_wq) {
> > > + q->sched.submit_wq = args->submit_wq;
> > > + } else {
> > > + q->sched.submit_wq = drm_dep_alloc_submit_wq(args->name ?: "drm_dep",
> > > +     args->flags);
> > > + if (!q->sched.submit_wq)
> > > + return -ENOMEM;
> > > +
> > > + q->sched.flags |= DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ;
> > > + }
> > > +
> > > + if (args->timeout_wq) {
> > > + q->sched.timeout_wq = args->timeout_wq;
> > > + } else {
> > > + q->sched.timeout_wq = drm_dep_alloc_timeout_wq(args->name ?: "drm_dep");
> > > + if (!q->sched.timeout_wq)
> > > + goto err_submit_wq;
> > > +
> > > + q->sched.flags |= DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ;
> > > + }
> > > +
> > > + q->sched.flags |= args->flags &
> > > + ~(DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ |
> > > +  DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ);
> > > +
> > > + INIT_DELAYED_WORK(&q->sched.tdr, drm_dep_queue_tdr_work);
> > > + INIT_WORK(&q->sched.run_job, drm_dep_queue_run_job_work);
> > > + INIT_WORK(&q->sched.put_job, drm_dep_queue_put_job_work);
> > > +
> > > + q->fence.context = dma_fence_context_alloc(1);
> > > +
> > > + kref_init(&q->refcount);
> > > + q->ops = args->ops;
> > > + drm_dev_get(q->drm);
> > > +
> > > + return 0;
> > > +
> > > +err_submit_wq:
> > > + if (q->sched.flags & DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ)
> > > + destroy_workqueue(q->sched.submit_wq);
> > > + mutex_destroy(&q->sched.lock);
> > > +
> > > + return -ENOMEM;
> > > +}
> > > +EXPORT_SYMBOL(drm_dep_queue_init);
> > > +
> > > +#if IS_ENABLED(CONFIG_PROVE_LOCKING)
> > > +/**
> > > + * drm_dep_queue_push_job_begin() - mark the start of an arm/push critical section
> > > + * @q: dep queue the job belongs to
> > > + *
> > > + * Called at the start of drm_dep_job_arm() and warns if the push context is
> > > + * already owned by another task, which would indicate concurrent arm/push on
> > > + * the same queue.
> > > + *
> > > + * No-op when CONFIG_PROVE_LOCKING is disabled.
> > > + *
> > > + * Context: Process context. DMA fence signaling path.
> > > + */
> > > +void drm_dep_queue_push_job_begin(struct drm_dep_queue *q)
> > > +{
> > > + WARN_ON(q->job.push.owner);
> > > + q->job.push.owner = current;
> > > +}
> > > +
> > > +/**
> > > + * drm_dep_queue_push_job_end() - mark the end of an arm/push critical section
> > > + * @q: dep queue the job belongs to
> > > + *
> > > + * Called at the end of drm_dep_job_push() and warns if the push context is not
> > > + * owned by the current task, which would indicate a mismatched begin/end pair
> > > + * or a push from the wrong thread.
> > > + *
> > > + * No-op when CONFIG_PROVE_LOCKING is disabled.
> > > + *
> > > + * Context: Process context. DMA fence signaling path.
> > > + */
> > > +void drm_dep_queue_push_job_end(struct drm_dep_queue *q)
> > > +{
> > > + WARN_ON(q->job.push.owner != current);
> > > + q->job.push.owner = NULL;
> > > +}
> > > +#endif
> > > +
> > > +/**
> > > + * drm_dep_queue_assert_teardown_invariants() - assert teardown invariants
> > > + * @q: dep queue being torn down
> > > + *
> > > + * Warns if the pending-job list, the SPSC submission queue, or the credit
> > > + * counter is non-zero when called, or if the queue still has a non-zero
> > > + * reference count.
> > > + *
> > > + * Context: Any context.
> > > + */
> > > +static void drm_dep_queue_assert_teardown_invariants(struct drm_dep_queue *q)
> > > +{
> > > + WARN_ON(!list_empty(&q->job.pending));
> > > + WARN_ON(spsc_queue_count(&q->job.queue));
> > > + WARN_ON(atomic_read(&q->credit.count));
> > > + WARN_ON(drm_dep_queue_refcount(q));
> > > +}
> > > +
> > > +/**
> > > + * drm_dep_queue_release() - final internal cleanup of a dep queue
> > > + * @q: dep queue to clean up
> > > + *
> > > + * Asserts teardown invariants and destroys internal resources allocated by
> > > + * drm_dep_queue_init() that cannot be torn down earlier in the teardown
> > > + * sequence.  Currently this destroys @q->sched.lock.
> > > + *
> > > + * Drivers that implement &drm_dep_queue_ops.release **must** call this
> > > + * function after removing @q from any internal bookkeeping (e.g. lookup
> > > + * tables or lists) but before freeing the memory that contains @q.  When
> > > + * &drm_dep_queue_ops.release is NULL, drm_dep follows the default teardown
> > > + * path and calls this function automatically.
> > > + *
> > > + * Context: Any context.
> > > + */
> > > +void drm_dep_queue_release(struct drm_dep_queue *q)
> > > +{
> > > + drm_dep_queue_assert_teardown_invariants(q);
> > > + mutex_destroy(&q->sched.lock);
> > > +}
> > > +EXPORT_SYMBOL(drm_dep_queue_release);
> > > +
> > > +/**
> > > + * drm_dep_queue_free() - final cleanup of a dep queue
> > > + * @q: dep queue to free
> > > + *
> > > + * Invokes &drm_dep_queue_ops.release if set, in which case the driver is
> > > + * responsible for calling drm_dep_queue_release() and freeing @q itself.
> > > + * If &drm_dep_queue_ops.release is NULL, calls drm_dep_queue_release()
> > > + * and then frees @q with kfree_rcu().
> > > + *
> > > + * In either case, releases the drm_dev_get() reference taken at init time
> > > + * via drm_dev_put(), allowing the owning &drm_device to be unloaded once
> > > + * all queues have been freed.
> > > + *
> > > + * Context: Process context (workqueue), reclaim safe.
> > > + */
> > > +static void drm_dep_queue_free(struct drm_dep_queue *q)
> > > +{
> > > + struct drm_device *drm = q->drm;
> > > +
> > > + if (q->ops->release) {
> > > + q->ops->release(q);
> > > + } else {
> > > + drm_dep_queue_release(q);
> > > + kfree_rcu(q, rcu);
> > > + }
> > > + drm_dev_put(drm);
> > > +}
> > > +
> > > +/**
> > > + * drm_dep_queue_free_work() - deferred queue teardown worker
> > > + * @work: free_work item embedded in the dep queue
> > > + *
> > > + * Runs on dep_free_wq. Disables all work items synchronously
> > > + * (preventing re-queue and waiting for in-flight instances),
> > > + * destroys any owned workqueues, then calls drm_dep_queue_free().
> > > + * Running on dep_free_wq ensures destroy_workqueue() is never
> > > + * called from within one of the queue's own workers (deadlock)
> > > + * and disable_*_sync() cannot deadlock either.
> > > + *
> > > + * Context: Process context (workqueue), reclaim safe.
> > > + */
> > > +static void drm_dep_queue_free_work(struct work_struct *work)
> > > +{
> > > + struct drm_dep_queue *q =
> > > + container_of(work, struct drm_dep_queue, free_work);
> > > +
> > > + drm_dep_queue_assert_teardown_invariants(q);
> > > +
> > > + disable_delayed_work_sync(&q->sched.tdr);
> > > + disable_work_sync(&q->sched.run_job);
> > > + disable_work_sync(&q->sched.put_job);
> > > +
> > > + if (q->sched.flags & DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ)
> > > + destroy_workqueue(q->sched.timeout_wq);
> > > +
> > > + if (q->sched.flags & DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ)
> > > + destroy_workqueue(q->sched.submit_wq);
> > > +
> > > + drm_dep_queue_free(q);
> > > +}
> > > +
> > > +/**
> > > + * drm_dep_queue_fini() - tear down a dep queue
> > > + * @q: dep queue to tear down
> > > + *
> > > + * Asserts teardown invariants  and nitiates teardown of @q by queuing the
> > > + * deferred free work onto tht module-private dep_free_wq workqueue.  The work
> > > + * item disables any pending TDR and run/put-job work synchronously, destroys
> > > + * any workqueues that were allocated by drm_dep_queue_init(), and then releases
> > > + * the queue memory.
> > > + *
> > > + * Running teardown from dep_free_wq ensures that destroy_workqueue() is never
> > > + * called from within one of the queue's own workers (e.g. via
> > > + * drm_dep_queue_put()), which would deadlock.
> > > + *
> > > + * Drivers can wait for all outstanding deferred work to complete by waiting
> > > + * for the last drm_dev_put() reference on their &drm_device, which is
> > > + * released as the final step of each queue's teardown.
> > > + *
> > > + * Drivers that implement &drm_dep_queue_ops.fini **must** call this
> > > + * function after removing @q from any device bookkeeping but before freeing the
> > > + * memory that contains @q.  When &drm_dep_queue_ops.fini is NULL, drm_dep
> > > + * follows the default teardown path and calls this function automatically.
> > > + *
> > > + * Context: Any context.
> > > + */
> > > +void drm_dep_queue_fini(struct drm_dep_queue *q)
> > > +{
> > > + drm_dep_queue_assert_teardown_invariants(q);
> > > +
> > > + INIT_WORK(&q->free_work, drm_dep_queue_free_work);
> > > + queue_work(dep_free_wq, &q->free_work);
> > > +}
> > > +EXPORT_SYMBOL(drm_dep_queue_fini);
> > > +
> > > +/**
> > > + * drm_dep_queue_get() - acquire a reference to a dep queue
> > > + * @q: dep queue to acquire a reference on, or NULL
> > > + *
> > > + * Return: @q with an additional reference held, or NULL if @q is NULL.
> > > + *
> > > + * Context: Any context.
> > > + */
> > > +struct drm_dep_queue *drm_dep_queue_get(struct drm_dep_queue *q)
> > > +{
> > > + if (q)
> > > + kref_get(&q->refcount);
> > > + return q;
> > > +}
> > > +EXPORT_SYMBOL(drm_dep_queue_get);
> > > +
> > > +/**
> > > + * __drm_dep_queue_release() - kref release callback for a dep queue
> > > + * @kref: kref embedded in the dep queue
> > > + *
> > > + * Calls &drm_dep_queue_ops.fini if set, otherwise calls
> > > + * drm_dep_queue_fini() to initiate deferred teardown.
> > > + *
> > > + * Context: Any context.
> > > + */
> > > +static void __drm_dep_queue_release(struct kref *kref)
> > > +{
> > > + struct drm_dep_queue *q =
> > > + container_of(kref, struct drm_dep_queue, refcount);
> > > +
> > > + if (q->ops->fini)
> > > + q->ops->fini(q);
> > > + else
> > > + drm_dep_queue_fini(q);
> > > +}
> > > +
> > > +/**
> > > + * drm_dep_queue_put() - release a reference to a dep queue
> > > + * @q: dep queue to release a reference on, or NULL
> > > + *
> > > + * When the last reference is dropped, calls &drm_dep_queue_ops.fini if set,
> > > + * otherwise calls drm_dep_queue_fini(). Final memory release is handled by
> > > + * &drm_dep_queue_ops.release (which must call drm_dep_queue_release()) if set,
> > > + * or drm_dep_queue_release() followed by kfree_rcu() otherwise.
> > > + * Does nothing if @q is NULL.
> > > + *
> > > + * Context: Any context.
> > > + */
> > > +void drm_dep_queue_put(struct drm_dep_queue *q)
> > > +{
> > > + if (q)
> > > + kref_put(&q->refcount, __drm_dep_queue_release);
> > > +}
> > > +EXPORT_SYMBOL(drm_dep_queue_put);
> > > +
> > > +/**
> > > + * drm_dep_queue_stop() - stop a dep queue from processing new jobs
> > > + * @q: dep queue to stop
> > > + *
> > > + * Sets %DRM_DEP_QUEUE_FLAGS_STOPPED on @q under both @q->sched.lock (mutex)
> > > + * and @q->job.lock (spinlock_irq), making the flag safe to test from finished
> > > + * fenced signaling context. Then cancels any in-flight run_job and put_job work
> > > + * items. Once stopped, the bypass path and the submit workqueue will not
> > > + * dispatch further jobs nor will any jobs be removed from the pending list.
> > > + * Call drm_dep_queue_start() to resume processing.
> > > + *
> > > + * Context: Process context. Waits for in-flight workers to complete.
> > > + */
> > > +void drm_dep_queue_stop(struct drm_dep_queue *q)
> > > +{
> > > + scoped_guard(mutex, &q->sched.lock) {
> > > + scoped_guard(spinlock_irq, &q->job.lock)
> > > + drm_dep_queue_flags_set(q, DRM_DEP_QUEUE_FLAGS_STOPPED);
> > > + }
> > > + cancel_work_sync(&q->sched.run_job);
> > > + cancel_work_sync(&q->sched.put_job);
> > > +}
> > > +EXPORT_SYMBOL(drm_dep_queue_stop);
> > > +
> > > +/**
> > > + * drm_dep_queue_start() - resume a stopped dep queue
> > > + * @q: dep queue to start
> > > + *
> > > + * Clears %DRM_DEP_QUEUE_FLAGS_STOPPED on @q under both @q->sched.lock (mutex)
> > > + * and @q->job.lock (spinlock_irq), making the flag safe to test from IRQ
> > > + * context. Then re-queues the run_job and put_job work items so that any jobs
> > > + * pending since the queue was stopped are processed. Must only be called after
> > > + * drm_dep_queue_stop().
> > > + *
> > > + * Context: Process context.
> > > + */
> > > +void drm_dep_queue_start(struct drm_dep_queue *q)
> > > +{
> > > + scoped_guard(mutex, &q->sched.lock) {
> > > + scoped_guard(spinlock_irq, &q->job.lock)
> > > + drm_dep_queue_flags_clear(q, DRM_DEP_QUEUE_FLAGS_STOPPED);
> > > + }
> > > + drm_dep_queue_run_job_queue(q);
> > > + drm_dep_queue_put_job_queue(q);
> > > +}
> > > +EXPORT_SYMBOL(drm_dep_queue_start);
> > > +
> > > +/**
> > > + * drm_dep_queue_trigger_timeout() - trigger the TDR immediately for
> > > + *   all pending jobs
> > > + * @q: dep queue to trigger timeout on
> > > + *
> > > + * Sets @q->job.timeout to 1 and arms the TDR delayed work with a one-jiffy
> > > + * delay, causing it to fire almost immediately without hot-spinning at zero
> > > + * delay. This is used to force-expire any pendind jobs on the queue, for
> > > + * example when the device is being torn down or has encountered an
> > > + * unrecoverable error.
> > > + *
> > > + * It is suggested that when this function is used, the first timedout_job call
> > > + * causes the driver to kick the queue off the hardware and signal all pending
> > > + * job fences. Subsequent calls continue to signal all pending job fences.
> > > + *
> > > + * Has no effect if the pending list is empty.
> > > + *
> > > + * Context: Any context.
> > > + */
> > > +void drm_dep_queue_trigger_timeout(struct drm_dep_queue *q)
> > > +{
> > > + guard(spinlock_irqsave)(&q->job.lock);
> > > + q->job.timeout = 1;
> > > + drm_queue_start_timeout(q);
> > > +}
> > > +EXPORT_SYMBOL(drm_dep_queue_trigger_timeout);
> > > +
> > > +/**
> > > + * drm_dep_queue_cancel_tdr_sync() - cancel any pending TDR and wait
> > > + *   for it to finish
> > > + * @q: dep queue whose TDR to cancel
> > > + *
> > > + * Cancels the TDR delayed work item if it has not yet started, and waits for
> > > + * it to complete if it is already running.  After this call returns, the TDR
> > > + * worker is guaranteed not to be executing and will not fire again until
> > > + * explicitly rearmed (e.g. via drm_dep_queue_resume_timeout() or by a new
> > > + * job being submitted).
> > > + *
> > > + * Useful during error recovery or queue teardown when the caller needs to
> > > + * know that no timeout handling races with its own reset logic.
> > > + *
> > > + * Context: Process context. May sleep waiting for the TDR worker to finish.
> > > + */
> > > +void drm_dep_queue_cancel_tdr_sync(struct drm_dep_queue *q)
> > > +{
> > > + cancel_delayed_work_sync(&q->sched.tdr);
> > > +}
> > > +EXPORT_SYMBOL(drm_dep_queue_cancel_tdr_sync);
> > > +
> > > +/**
> > > + * drm_dep_queue_resume_timeout() - restart the TDR timer with the
> > > + *   configured timeout
> > > + * @q: dep queue to resume the timeout for
> > > + *
> > > + * Restarts the TDR delayed work using @q->job.timeout. Called after device
> > > + * recovery to give pending jobs a fresh full timeout window. Has no effect
> > > + * if the pending list is empty.
> > > + *
> > > + * Context: Any context.
> > > + */
> > > +void drm_dep_queue_resume_timeout(struct drm_dep_queue *q)
> > > +{
> > > + drm_queue_start_timeout_unlocked(q);
> > > +}
> > > +EXPORT_SYMBOL(drm_dep_queue_resume_timeout);
> > > +
> > > +/**
> > > + * drm_dep_queue_is_stopped() - check whether a dep queue is stopped
> > > + * @q: dep queue to check
> > > + *
> > > + * Return: true if %DRM_DEP_QUEUE_FLAGS_STOPPED is set on @q, false otherwise.
> > > + *
> > > + * Context: Any context.
> > > + */
> > > +bool drm_dep_queue_is_stopped(struct drm_dep_queue *q)
> > > +{
> > > + return !!(q->sched.flags & DRM_DEP_QUEUE_FLAGS_STOPPED);
> > > +}
> > > +EXPORT_SYMBOL(drm_dep_queue_is_stopped);
> > > +
> > > +/**
> > > + * drm_dep_queue_kill() - kill a dep queue and flush all pending jobs
> > > + * @q: dep queue to kill
> > > + *
> > > + * Sets %DRM_DEP_QUEUE_FLAGS_KILLED on @q under @q->sched.lock.  If a
> > > + * dependency fence is currently being waited on, its callback is removed and
> > > + * the run-job worker is kicked immediately so that the blocked job drains
> > > + * without waiting.
> > > + *
> > > + * Once killed, drm_dep_queue_job_dependency() returns NULL for all jobs,
> > > + * bypassing dependency waits so that every queued job drains through
> > > + * &drm_dep_queue_ops.run_job without blocking.
> > > + *
> > > + * The &drm_dep_queue_ops.run_job callback is guaranteed to be called for every
> > > + * job that was pushed before or after drm_dep_queue_kill(), even during queue
> > > + * teardown.  Drivers should use this guarantee to perform any necessary
> > > + * bookkeeping cleanup without executing the actual backend operation when the
> > > + * queue is killed.
> > > + *
> > > + * Unlike drm_dep_queue_stop(), killing is one-way: there is no corresponding
> > > + * start function.
> > > + *
> > > + * **Driver safety requirement**
> > > + *
> > > + * drm_dep_queue_kill() must only be called once the driver can guarantee that
> > > + * no job in the queue will touch memory associated with any of its fences
> > > + * (i.e., the queue has been removed from the device and will never be put back
> > > + * on).
> > > + *
> > > + * Context: Process context.
> > > + */
> > > +void drm_dep_queue_kill(struct drm_dep_queue *q)
> > > +{
> > > + scoped_guard(mutex, &q->sched.lock) {
> > > + struct dma_fence *fence;
> > > +
> > > + drm_dep_queue_flags_set(q, DRM_DEP_QUEUE_FLAGS_KILLED);
> > > +
> > > + /*
> > > + * Holding &q->sched.lock guarantees that the run-job work item
> > > + * cannot drop its reference to q->dep.fence concurrently, so
> > > + * reading q->dep.fence here is safe.
> > > + */
> > > + fence = READ_ONCE(q->dep.fence);
> > > + if (fence && dma_fence_remove_callback(fence, &q->dep.cb))
> > > + drm_dep_queue_remove_dependency(q, fence);
> > > + }
> > > +}
> > > +EXPORT_SYMBOL(drm_dep_queue_kill);
> > > +
> > > +/**
> > > + * drm_dep_queue_submit_wq() - retrieve the submit workqueue of a dep queue
> > > + * @q: dep queue whose workqueue to retrieve
> > > + *
> > > + * Drivers may use this to queue their own work items alongside the queue's
> > > + * internal run-job and put-job workers — for example to process incoming
> > > + * messages in the same serialisation domain.
> > > + *
> > > + * Prefer drm_dep_queue_work_enqueue() when the only need is to enqueue a
> > > + * work item, as it additionally checks the stopped state.  Use this accessor
> > > + * when the workqueue itself is required (e.g. for alloc_ordered_workqueue
> > > + * replacement or drain_workqueue calls).
> > > + *
> > > + * Context: Any context.
> > > + * Return: the &workqueue_struct used by @q for job submission.
> > > + */
> > > +struct workqueue_struct *drm_dep_queue_submit_wq(struct drm_dep_queue *q)
> > > +{
> > > + return q->sched.submit_wq;
> > > +}
> > > +EXPORT_SYMBOL(drm_dep_queue_submit_wq);
> > > +
> > > +/**
> > > + * drm_dep_queue_timeout_wq() - retrieve the timeout workqueue of a dep queue
> > > + * @q: dep queue whose workqueue to retrieve
> > > + *
> > > + * Returns the workqueue used by @q to run TDR (timeout detection and recovery)
> > > + * work.  Drivers may use this to queue their own timeout-domain work items, or
> > > + * to call drain_workqueue() when tearing down and needing to ensure all pending
> > > + * timeout callbacks have completed before proceeding.
> > > + *
> > > + * Context: Any context.
> > > + * Return: the &workqueue_struct used by @q for TDR work.
> > > + */
> > > +struct workqueue_struct *drm_dep_queue_timeout_wq(struct drm_dep_queue *q)
> > > +{
> > > + return q->sched.timeout_wq;
> > > +}
> > > +EXPORT_SYMBOL(drm_dep_queue_timeout_wq);
> > > +
> > > +/**
> > > + * drm_dep_queue_work_enqueue() - queue work on the dep queue's submit workqueue
> > > + * @q: dep queue to enqueue work on
> > > + * @work: work item to enqueue
> > > + *
> > > + * Queues @work on @q->sched.submit_wq if the queue is not stopped.  This
> > > + * allows drivers to schedule custom work items that run serialised with the
> > > + * queue's own run-job and put-job workers.
> > > + *
> > > + * Return: true if the work was queued, false if the queue is stopped or the
> > > + * work item was already pending.
> > > + *
> > > + * Context: Any context.
> > > + */
> > > +bool drm_dep_queue_work_enqueue(struct drm_dep_queue *q,
> > > + struct work_struct *work)
> > > +{
> > > + if (drm_dep_queue_is_stopped(q))
> > > + return false;
> > > +
> > > + return queue_work(q->sched.submit_wq, work);
> > > +}
> > > +EXPORT_SYMBOL(drm_dep_queue_work_enqueue);
> > > +
> > > +/**
> > > + * drm_dep_queue_can_job_bypass() - test whether a job can skip the SPSC queue
> > > + * @q: dep queue
> > > + * @job: job to test
> > > + *
> > > + * A job may bypass the submit workqueue and run inline on the calling thread
> > > + * if all of the following hold:
> > > + *
> > > + *  - %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set on the queue
> > > + *  - the queue is not stopped
> > > + *  - the SPSC submission queue is empty (no other jobs waiting)
> > > + *  - the queue has enough credits for @job
> > > + *  - @job has no unresolved dependency fences
> > > + *
> > > + * Must be called under @q->sched.lock.
> > > + *
> > > + * Context: Process context. Must hold @q->sched.lock (a mutex).
> > > + * Return: true if the job may be run inline, false otherwise.
> > > + */
> > > +bool drm_dep_queue_can_job_bypass(struct drm_dep_queue *q,
> > > +  struct drm_dep_job *job)
> > > +{
> > > + lockdep_assert_held(&q->sched.lock);
> > > +
> > > + return q->sched.flags & DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED &&
> > > + !drm_dep_queue_is_stopped(q) &&
> > > + !spsc_queue_count(&q->job.queue) &&
> > > + drm_dep_queue_has_credits(q, job) &&
> > > + xa_empty(&job->dependencies);
> > > +}
> > > +
> > > +/**
> > > + * drm_dep_job_done() - mark a job as complete
> > > + * @job: the job that finished
> > > + * @result: error code to propagate, or 0 for success
> > > + *
> > > + * Subtracts @job->credits from the queue credit counter, then signals the
> > > + * job's dep fence with @result.
> > > + *
> > > + * When %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE is set (IRQ-safe path), a
> > > + * temporary extra reference is taken on @job before signalling the fence.
> > > + * This prevents a concurrent put-job worker — which may be woken by timeouts or
> > > + * queue starting — from freeing the job while this function still holds a
> > > + * pointer to it.  The extra reference is released at the end of the function.
> > > + *
> > > + * After signalling, the IRQ-safe path removes the job from the pending list
> > > + * under @q->job.lock, provided the queue is not stopped.  Removal is skipped
> > > + * when the queue is stopped so that drm_dep_queue_for_each_pending_job() can
> > > + * iterate the list without racing with the completion path.  On successful
> > > + * removal, kicks the run-job worker so the next queued job can be dispatched
> > > + * immediately, then drops the job reference.  If the job was already removed
> > > + * by TDR, or removal was skipped because the queue is stopped, kicks the
> > > + * put-job worker instead to allow the deferred put to complete.
> > > + *
> > > + * Context: Any context.
> > > + */
> > > +static void drm_dep_job_done(struct drm_dep_job *job, int result)
> > > +{
> > > + struct drm_dep_queue *q = job->q;
> > > + bool irq_safe = drm_dep_queue_is_job_put_irq_safe(q), removed = false;
> > > +
> > > + /*
> > > + * Local ref to ensure the put worker—which may be woken by external
> > > + * forces (TDR, driver-side queue starting)—doesn't free the job behind
> > > + * this function's back after drm_dep_fence_done() while it is still on
> > > + * the pending list.
> > > + */
> > > + if (irq_safe)
> > > + drm_dep_job_get(job);
> > > +
> > > + atomic_sub(job->credits, &q->credit.count);
> > > + drm_dep_fence_done(job->dfence, result);
> > > +
> > > + /* Only safe to touch job after fence signal if we have a local ref. */
> > > +
> > > + if (irq_safe) {
> > > + scoped_guard(spinlock_irqsave, &q->job.lock) {
> > > + removed = !list_empty(&job->pending_link) &&
> > > + !drm_dep_queue_is_stopped(q);
> > > +
> > > + /* Guard against TDR operating on job */
> > > + if (removed)
> > > + drm_dep_queue_remove_job(q, job);
> > > + }
> > > + }
> > > +
> > > + if (removed) {
> > > + drm_dep_queue_run_job_queue(q);
> > > + drm_dep_job_put(job);
> > > + } else {
> > > + drm_dep_queue_put_job_queue(q);
> > > + }
> > > +
> > > + if (irq_safe)
> > > + drm_dep_job_put(job);
> > > +}
> > > +
> > > +/**
> > > + * drm_dep_job_done_cb() - dma_fence callback to complete a job
> > > + * @f: the hardware fence that signalled
> > > + * @cb: fence callback embedded in the dep job
> > > + *
> > > + * Extracts the job from @cb and calls drm_dep_job_done() with
> > > + * @f->error as the result.
> > > + *
> > > + * Context: Any context, but with IRQ disabled. May not sleep.
> > > + */
> > > +static void drm_dep_job_done_cb(struct dma_fence *f, struct dma_fence_cb *cb)
> > > +{
> > > + struct drm_dep_job *job = container_of(cb, struct drm_dep_job, cb);
> > > +
> > > + drm_dep_job_done(job, f->error);
> > > +}
> > > +
> > > +/**
> > > + * drm_dep_queue_run_job() - submit a job to hardware and set up
> > > + *   completion tracking
> > > + * @q: dep queue
> > > + * @job: job to run
> > > + *
> > > + * Accounts @job->credits against the queue, appends the job to the pending
> > > + * list, then calls @q->ops->run_job(). The TDR timer is started only when
> > > + * @job is the first entry on the pending list; subsequent jobs added while
> > > + * a TDR is already in flight do not reset the timer (which would otherwise
> > > + * extend the deadline for the already-running head job). Stores the returned
> > > + * hardware fence as the parent of the job's dep fence, then installs
> > > + * drm_dep_job_done_cb() on it. If the hardware fence is already signalled
> > > + * (%-ENOENT from dma_fence_add_callback()) or run_job() returns NULL/error,
> > > + * the job is completed immediately. Must be called under @q->sched.lock.
> > > + *
> > > + * Context: Process context. Must hold @q->sched.lock (a mutex). DMA fence
> > > + * signaling path.
> > > + */
> > > +void drm_dep_queue_run_job(struct drm_dep_queue *q, struct drm_dep_job *job)
> > > +{
> > > + struct dma_fence *fence;
> > > + int r;
> > > +
> > > + lockdep_assert_held(&q->sched.lock);
> > > +
> > > + drm_dep_job_get(job);
> > > + atomic_add(job->credits, &q->credit.count);
> > > +
> > > + scoped_guard(spinlock_irq, &q->job.lock) {
> > > + bool first = list_empty(&q->job.pending);
> > > +
> > > + list_add_tail(&job->pending_link, &q->job.pending);
> > > + if (first)
> > > + drm_queue_start_timeout(q);
> > > + }
> > > +
> > > + fence = q->ops->run_job(job);
> > > + drm_dep_fence_set_parent(job->dfence, fence);
> > > +
> > > + if (!IS_ERR_OR_NULL(fence)) {
> > > + r = dma_fence_add_callback(fence, &job->cb,
> > > +   drm_dep_job_done_cb);
> > > + if (r == -ENOENT)
> > > + drm_dep_job_done(job, fence->error);
> > > + else if (r)
> > > + drm_err(q->drm, "fence add callback failed (%d)\n", r);
> > > + dma_fence_put(fence);
> > > + } else {
> > > + drm_dep_job_done(job, IS_ERR(fence) ? PTR_ERR(fence) : 0);
> > > + }
> > > +
> > > + /*
> > > + * Drop all input dependency fences now, in process context, before the
> > > + * final job put. Once the job is on the pending list its last reference
> > > + * may be dropped from a dma_fence callback (IRQ context), where calling
> > > + * xa_destroy() would be unsafe.
> > > + */
> > 
> > I assume that “pending” is the list of jobs that have been handed to the driver
> > via ops->run_job()?
> > 
> > Can’t this problem be solved by not doing anything inside a dma_fence callback
> > other than scheduling the queue worker?
> > 

Yes, this code is required to support dropping job refs directly in the
dma-fence callback (an opt-in feature). Again, this seems like a
significant win in terms of CPU cycles, although I haven’t collected
data yet.

I could drop this, but conceptually it still feels like the right
approach.

> > > + drm_dep_job_drop_dependencies(job);
> > > + drm_dep_job_put(job);
> > > +}
> > > +
> > > +/**
> > > + * drm_dep_queue_push_job() - enqueue a job on the SPSC submission queue
> > > + * @q: dep queue
> > > + * @job: job to push
> > > + *
> > > + * Pushes @job onto the SPSC queue. If the queue was previously empty
> > > + * (i.e. this is the first pending job), kicks the run_job worker so it
> > > + * processes the job promptly without waiting for the next wakeup.
> > > + * May be called with or without @q->sched.lock held.
> > > + *
> > > + * Context: Any context. DMA fence signaling path.
> > > + */
> > > +void drm_dep_queue_push_job(struct drm_dep_queue *q, struct drm_dep_job *job)
> > > +{
> > > + /*
> > > + * spsc_queue_push() returns true if the queue was previously empty,
> > > + * i.e. this is the first pending job. Kick the run_job worker so it
> > > + * picks it up without waiting for the next wakeup.
> > > + */
> > > + if (spsc_queue_push(&q->job.queue, &job->queue_node))
> > > + drm_dep_queue_run_job_queue(q);
> > > +}
> > > +
> > > +/**
> > > + * drm_dep_init() - module initialiser
> > > + *
> > > + * Allocates the module-private dep_free_wq unbound workqueue used for
> > > + * deferred queue teardown.
> > > + *
> > > + * Return: 0 on success, %-ENOMEM if workqueue allocation fails.
> > > + */
> > > +static int __init drm_dep_init(void)
> > > +{
> > > + dep_free_wq = alloc_workqueue("drm_dep_free", WQ_UNBOUND, 0);
> > > + if (!dep_free_wq)
> > > + return -ENOMEM;
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +/**
> > > + * drm_dep_exit() - module exit
> > > + *
> > > + * Destroys the module-private dep_free_wq workqueue.
> > > + */
> > > +static void __exit drm_dep_exit(void)
> > > +{
> > > + destroy_workqueue(dep_free_wq);
> > > + dep_free_wq = NULL;
> > > +}
> > > +
> > > +module_init(drm_dep_init);
> > > +module_exit(drm_dep_exit);
> > > +
> > > +MODULE_DESCRIPTION("DRM dependency queue");
> > > +MODULE_LICENSE("Dual MIT/GPL");
> > > diff --git a/drivers/gpu/drm/dep/drm_dep_queue.h b/drivers/gpu/drm/dep/drm_dep_queue.h
> > > new file mode 100644
> > > index 000000000000..e5c217a3fab5
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/dep/drm_dep_queue.h
> > > @@ -0,0 +1,31 @@
> > > +/* SPDX-License-Identifier: MIT */
> > > +/*
> > > + * Copyright © 2026 Intel Corporation
> > > + */
> > > +
> > > +#ifndef _DRM_DEP_QUEUE_H_
> > > +#define _DRM_DEP_QUEUE_H_
> > > +
> > > +#include <linux/types.h>
> > > +
> > > +struct drm_dep_job;
> > > +struct drm_dep_queue;
> > > +
> > > +bool drm_dep_queue_can_job_bypass(struct drm_dep_queue *q,
> > > +  struct drm_dep_job *job);
> > > +void drm_dep_queue_run_job(struct drm_dep_queue *q, struct drm_dep_job *job);
> > > +void drm_dep_queue_push_job(struct drm_dep_queue *q, struct drm_dep_job *job);
> > > +
> > > +#if IS_ENABLED(CONFIG_PROVE_LOCKING)
> > > +void drm_dep_queue_push_job_begin(struct drm_dep_queue *q);
> > > +void drm_dep_queue_push_job_end(struct drm_dep_queue *q);
> > > +#else
> > > +static inline void drm_dep_queue_push_job_begin(struct drm_dep_queue *q)
> > > +{
> > > +}
> > > +static inline void drm_dep_queue_push_job_end(struct drm_dep_queue *q)
> > > +{
> > > +}
> > > +#endif
> > > +
> > > +#endif /* _DRM_DEP_QUEUE_H_ */
> > > diff --git a/include/drm/drm_dep.h b/include/drm/drm_dep.h
> > > new file mode 100644
> > > index 000000000000..615926584506
> > > --- /dev/null
> > > +++ b/include/drm/drm_dep.h
> > > @@ -0,0 +1,597 @@
> > > +/* SPDX-License-Identifier: MIT */
> > > +/*
> > > + * Copyright 2015 Advanced Micro Devices, Inc.
> > > + *
> > > + * Permission is hereby granted, free of charge, to any person obtaining a
> > > + * copy of this software and associated documentation files (the "Software"),
> > > + * to deal in the Software without restriction, including without limitation
> > > + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> > > + * and/or sell copies of the Software, and to permit persons to whom the
> > > + * Software is furnished to do so, subject to the following conditions:
> > > + *
> > > + * The above copyright notice and this permission notice shall be included in
> > > + * all copies or substantial portions of the Software.
> > > + *
> > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> > > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> > > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> > > + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
> > > + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
> > > + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
> > > + * OTHER DEALINGS IN THE SOFTWARE.
> > > + *
> > > + * Copyright © 2026 Intel Corporation
> > > + */
> > > +
> > > +#ifndef _DRM_DEP_H_
> > > +#define _DRM_DEP_H_
> > > +
> > > +#include <drm/spsc_queue.h>
> > > +#include <linux/dma-fence.h>
> > > +#include <linux/xarray.h>
> > > +#include <linux/workqueue.h>
> > > +
> > > +enum dma_resv_usage;
> > > +struct dma_resv;
> > > +struct drm_dep_fence;
> > > +struct drm_dep_job;
> > > +struct drm_dep_queue;
> > > +struct drm_file;
> > > +struct drm_gem_object;
> > > +
> > > +/**
> > > + * enum drm_dep_timedout_stat - return value of &drm_dep_queue_ops.timedout_job
> > > + * @DRM_DEP_TIMEDOUT_STAT_JOB_SIGNALED: driver signaled the job's finished
> > > + *   fence during reset; drm_dep may safely drop its reference to the job.
> > > + * @DRM_DEP_TIMEDOUT_STAT_REQUEUE_JOB: timeout was a false alarm; reinsert the
> > > + *   job at the head of the pending list so it can complete normally.
> > > + */
> > > +enum drm_dep_timedout_stat {
> > > + DRM_DEP_TIMEDOUT_STAT_JOB_SIGNALED,
> > > + DRM_DEP_TIMEDOUT_STAT_REQUEUE_JOB,
> > > +};
> > > +
> > > +/**
> > > + * struct drm_dep_queue_ops - driver callbacks for a dep queue
> > > + */
> > > +struct drm_dep_queue_ops {
> > > + /**
> > > + * @run_job: submit the job to hardware. Returns the hardware completion
> > > + * fence (with a reference held for the scheduler), or NULL/ERR_PTR on
> > > + * synchronous completion or error.
> > > + */
> > > + struct dma_fence *(*run_job)(struct drm_dep_job *job);
> > > +
> > > + /**
> > > + * @timedout_job: called when the TDR fires for the head job. Must stop
> > > + * the hardware, then return %DRM_DEP_TIMEDOUT_STAT_JOB_SIGNALED if the
> > > + * job's fence was signalled during reset, or
> > > + * %DRM_DEP_TIMEDOUT_STAT_REQUEUE_JOB if the timeout was spurious or
> > > + * signalling was otherwise delayed, and the job should be re-inserted
> > > + * at the head of the pending list. Any other value triggers a WARN.
> > > + */
> > > + enum drm_dep_timedout_stat (*timedout_job)(struct drm_dep_job *job);
> > > +
> > > + /**
> > > + * @release: called when the last kref on the queue is dropped and
> > > + * drm_dep_queue_fini() has completed.  The driver is responsible for
> > > + * removing @q from any internal bookkeeping, calling
> > > + * drm_dep_queue_release(), and then freeing the memory containing @q
> > > + * (e.g. via kfree_rcu() using @q->rcu).  If NULL, drm_dep calls
> > > + * drm_dep_queue_release() and frees @q automatically via kfree_rcu().
> > > + * Use this when the queue is embedded in a larger structure.
> > > + */
> > > + void (*release)(struct drm_dep_queue *q);
> > > +
> > > + /**
> > > + * @fini: if set, called instead of drm_dep_queue_fini() when the last
> > > + * kref is dropped. The driver is responsible for calling
> > > + * drm_dep_queue_fini() itself after it is done with the queue. Use this
> > > + * when additional teardown logic must run before fini (e.g., cleanup
> > > + * firmware resources associated with the queue).
> > > + */
> > > + void (*fini)(struct drm_dep_queue *q);
> > > +};
> > > +
> > > +/**
> > > + * enum drm_dep_queue_flags - flags for &drm_dep_queue and
> > > + *   &drm_dep_queue_init_args
> > > + *
> > > + * Flags are divided into three categories:
> > > + *
> > > + * - **Private static**: set internally at init time and never changed.
> > > + *   Drivers must not read or write these.
> > > + *   %DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ,
> > > + *   %DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ.
> > > + *
> > > + * - **Public dynamic**: toggled at runtime by drivers via accessors.
> > > + *   Any modification must be performed under &drm_dep_queue.sched.lock.
> > 
> > Can’t enforce that in C.
> > 

I agree. There are no true “private” fields in C if the object lives in
a shared header file. I’d love to make drm_dep_queue and drm_dep_job
private (defined in a C file), but then you can’t embed these objects
inside driver objects—which is the primary use case. The best we can do
is simply refuse to accept drivers that touch fields they shouldn’t, so
the code can remain maintainable.

I did, however, make drm_dep_fence a private object—notice it’s defined
in drm_dep_fence.c, so no one can abuse it.

> > > + *   Accessor functions provide unstable reads.
> > > + *   %DRM_DEP_QUEUE_FLAGS_STOPPED,
> > > + *   %DRM_DEP_QUEUE_FLAGS_KILLED.
> > 
> > > + *
> > > + * - **Public static**: supplied by the driver in
> > > + *   &drm_dep_queue_init_args.flags at queue creation time and not modified
> > > + *   thereafter.
> > 
> > Same here.
> > 
> > > + *   %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED,
> > > + *   %DRM_DEP_QUEUE_FLAGS_HIGHPRI,
> > > + *   %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE.
> > 
> > > + *
> > > + * @DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ: (private, static) submit workqueue was
> > > + *   allocated by drm_dep_queue_init() and will be destroyed by
> > > + *   drm_dep_queue_fini().
> > > + * @DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ: (private, static) timeout workqueue
> > > + *   was allocated by drm_dep_queue_init() and will be destroyed by
> > > + *   drm_dep_queue_fini().
> > > + * @DRM_DEP_QUEUE_FLAGS_STOPPED: (public, dynamic) the queue is stopped and
> > > + *   will not dispatch new jobs or remove jobs from the pending list, dropping
> > > + *   the drm_dep-owned reference. Set by drm_dep_queue_stop(), cleared by
> > > + *   drm_dep_queue_start().
> > > + * @DRM_DEP_QUEUE_FLAGS_KILLED: (public, dynamic) the queue has been killed
> > > + *   via drm_dep_queue_kill(). Any active dependency wait is cancelled
> > > + *   immediately.  Jobs continue to flow through run_job for bookkeeping
> > > + *   cleanup, but dependency waiting is skipped so that queued work drains
> > > + *   as quickly as possible.
> > > + * @DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED: (public, static) the queue supports
> > > + *   the bypass path where eligible jobs skip the SPSC queue and run inline.
> > > + * @DRM_DEP_QUEUE_FLAGS_HIGHPRI: (public, static) the submit workqueue owned
> > > + *   by the queue is created with %WQ_HIGHPRI, causing run-job and put-job
> > > + *   workers to execute at elevated priority. Only privileged clients (e.g.
> > > + *   drivers managing time-critical or real-time GPU contexts) should request
> > > + *   this flag; granting it to unprivileged userspace would allow priority
> > > + *   inversion attacks.
> > > + *   @drm_dep_queue_init_args.submit_wq is provided.
> > > + * @DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE: (public, static) when set,
> > > + *   drm_dep_job_done() may be called from hardirq context (e.g. from a
> > > + *   hardware-signalled dma_fence callback). drm_dep_job_done() will directly
> > > + *   dequeue the job and call drm_dep_job_put() without deferring to a
> > > + *   workqueue. The driver's &drm_dep_job_ops.release callback must therefore
> > > + *   be safe to invoke from IRQ context.
> > > + */
> > > +enum drm_dep_queue_flags {
> > > + DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ = BIT(0),
> > > + DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ = BIT(1),
> > > + DRM_DEP_QUEUE_FLAGS_STOPPED = BIT(2),
> > > + DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED = BIT(3),
> > > + DRM_DEP_QUEUE_FLAGS_HIGHPRI = BIT(4),
> > > + DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE = BIT(5),
> > > + DRM_DEP_QUEUE_FLAGS_KILLED = BIT(6),
> > > +};
> > > +
> > > +/**
> > > + * struct drm_dep_queue - a dependency-tracked GPU submission queue
> > > + *
> > > + * Combines the role of &drm_gpu_scheduler and &drm_sched_entity into a single
> > > + * object.  Each queue owns a submit workqueue (or borrows one), a timeout
> > > + * workqueue, an SPSC submission queue, and a pending-job list used for TDR.
> > > + *
> > > + * Initialise with drm_dep_queue_init(), tear down with drm_dep_queue_fini().
> > > + * Reference counted via drm_dep_queue_get() / drm_dep_queue_put().
> > > + *
> > > + * All fields are **opaque to drivers**.  Do not read or write any field
> > 
> > Can’t enforce this in C.
> > 

Just answer above, agree.

> > > + * directly; use the provided helper functions instead.  The sole exception
> > > + * is @rcu, which drivers may pass to kfree_rcu() when the queue is embedded
> > > + * inside a larger driver-managed structure and the &drm_dep_queue_ops.release
> > > + * vfunc performs an RCU-deferred free.
> > 
> > > + */
> > > +struct drm_dep_queue {
> > > + /** @ops: driver callbacks, set at init time. */
> > > + const struct drm_dep_queue_ops *ops;
> > > + /** @name: human-readable name used for workqueue and fence naming. */
> > > + const char *name;
> > > + /** @drm: owning DRM device; a drm_dev_get() reference is held for the
> > > + *  lifetime of the queue to prevent module unload while queues are live.
> > > + */
> > > + struct drm_device *drm;
> > > + /** @refcount: reference count; use drm_dep_queue_get/put(). */
> > > + struct kref refcount;
> > > + /**
> > > + * @free_work: deferred teardown work queued unconditionally by
> > > + * drm_dep_queue_fini() onto the module-private dep_free_wq.  The work
> > > + * item disables pending workers synchronously and destroys any owned
> > > + * workqueues before releasing the queue memory and dropping the
> > > + * drm_dev_get() reference.  Running on dep_free_wq ensures
> > > + * destroy_workqueue() is never called from within one of the queue's
> > > + * own workers.
> > > + */
> > > + struct work_struct free_work;
> > > + /**
> > > + * @rcu: RCU head for deferred freeing.
> > > + *
> > > + * This is the **only** field drivers may access directly.  When the
> > 
> > We can enforce this in Rust at compile time.
> > 

That is nice.

> > > + * queue is embedded in a larger structure, implement
> > > + * &drm_dep_queue_ops.release, call drm_dep_queue_release() to destroy
> > > + * internal resources, then pass this field to kfree_rcu() so that any
> > > + * in-flight RCU readers referencing the queue's dma_fence timeline name
> > > + * complete before the memory is returned.  All other fields must be
> > > + * accessed through the provided helpers.
> > > + */
> > > + struct rcu_head rcu;
> > > +
> > > + /** @sched: scheduling and workqueue state. */
> > > + struct {
> > > + /** @sched.submit_wq: ordered workqueue for run/put-job work. */
> > > + struct workqueue_struct *submit_wq;
> > > + /** @sched.timeout_wq: workqueue for the TDR delayed work. */
> > > + struct workqueue_struct *timeout_wq;
> > > + /**
> > > + * @sched.run_job: work item that dispatches the next queued
> > > + * job.
> > > + */
> > > + struct work_struct run_job;
> > > + /** @sched.put_job: work item that frees finished jobs. */
> > > + struct work_struct put_job;
> > > + /** @sched.tdr: delayed work item for timeout/reset (TDR). */
> > > + struct delayed_work tdr;
> > > + /**
> > > + * @sched.lock: mutex serialising job dispatch, bypass
> > > + * decisions, stop/start, and flag updates.
> > > + */
> > > + struct mutex lock;
> > > + /**
> > > + * @sched.flags: bitmask of &enum drm_dep_queue_flags.
> > > + * Any modification after drm_dep_queue_init() must be
> > > + * performed under @sched.lock.
> > > + */
> > > + enum drm_dep_queue_flags flags;
> > > + } sched;
> > > +
> > > + /** @job: pending-job tracking state. */
> > > + struct {
> > > + /**
> > > + * @job.pending: list of jobs that have been dispatched to
> > > + * hardware and not yet freed. Protected by @job.lock.
> > > + */
> > > + struct list_head pending;
> > > + /**
> > > + * @job.queue: SPSC queue of jobs waiting to be dispatched.
> > > + * Producers push via drm_dep_queue_push_job(); the run_job
> > > + * work item pops from the consumer side.
> > > + */
> > > + struct spsc_queue queue;
> > > + /**
> > > + * @job.lock: spinlock protecting @job.pending, TDR start, and
> > > + * the %DRM_DEP_QUEUE_FLAGS_STOPPED flag. Always acquired with
> > > + * irqsave (spin_lock_irqsave / spin_unlock_irqrestore) to
> > > + * support %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE queues where
> > > + * drm_dep_job_done() may run from hardirq context.
> > > + */
> > > + spinlock_t lock;
> > > + /**
> > > + * @job.timeout: per-job TDR timeout in jiffies.
> > > + * %MAX_SCHEDULE_TIMEOUT means no timeout.
> > > + */
> > > + long timeout;
> > > +#if IS_ENABLED(CONFIG_PROVE_LOCKING)
> > > + /**
> > > + * @job.push: lockdep annotation tracking the arm-to-push
> > > + * critical section.
> > > + */
> > > + struct {
> > > + /*
> > > + * @job.push.owner: task that currently holds the push
> > > + * context, used to assert single-owner invariants.
> > > + * NULL when idle.
> > > + */
> > > + struct task_struct *owner;
> > > + } push;
> > > +#endif
> > > + } job;
> > > +
> > > + /** @credit: hardware credit accounting. */
> > > + struct {
> > > + /** @credit.limit: maximum credits the queue can hold. */
> > > + u32 limit;
> > > + /** @credit.count: credits currently in flight (atomic). */
> > > + atomic_t count;
> > > + } credit;
> > > +
> > > + /** @dep: current blocking dependency for the head SPSC job. */
> > > + struct {
> > > + /**
> > > + * @dep.fence: fence being waited on before the head job can
> > > + * run. NULL when no dependency is pending.
> > > + */
> > > + struct dma_fence *fence;
> > > + /**
> > > + * @dep.removed_fence: dependency fence whose callback has been
> > > + * removed.  The run-job worker must drop its reference to this
> > > + * fence before proceeding to call run_job.
> > 
> > We can enforce this in Rust automatically.
> > 
> > > + */
> > > + struct dma_fence *removed_fence;
> > > + /** @dep.cb: callback installed on @dep.fence. */
> > > + struct dma_fence_cb cb;
> > > + } dep;
> > > +
> > > + /** @fence: fence context and sequence number state. */
> > > + struct {
> > > + /**
> > > + * @fence.seqno: next sequence number to assign, incremented
> > > + * each time a job is armed.
> > > + */
> > > + u32 seqno;
> > > + /**
> > > + * @fence.context: base DMA fence context allocated at init
> > > + * time. Finished fences use this context.
> > > + */
> > > + u64 context;
> > > + } fence;
> > > +};
> > > +
> > > +/**
> > > + * struct drm_dep_queue_init_args - arguments for drm_dep_queue_init()
> > > + */
> > > +struct drm_dep_queue_init_args {
> > > + /** @ops: driver callbacks; must not be NULL. */
> > > + const struct drm_dep_queue_ops *ops;
> > > + /** @name: human-readable name for workqueues and fence timelines. */
> > > + const char *name;
> > > + /** @drm: owning DRM device. A drm_dev_get() reference is taken at
> > > + *  queue init and released when the queue is freed, preventing module
> > > + *  unload while any queue is still alive.
> > > + */
> > > + struct drm_device *drm;
> > > + /**
> > > + * @submit_wq: workqueue for job dispatch. If NULL, an ordered
> > > + * workqueue is allocated and owned by the queue.  If non-NULL, the
> > > + * workqueue must have been allocated with %WQ_MEM_RECLAIM_TAINT;
> > > + * drm_dep_queue_init() returns %-EINVAL otherwise.
> > > + */
> > > + struct workqueue_struct *submit_wq;
> > > + /**
> > > + * @timeout_wq: workqueue for TDR. If NULL, an ordered workqueue
> > > + * is allocated and owned by the queue.  If non-NULL, the workqueue
> > > + * must have been allocated with %WQ_MEM_RECLAIM_TAINT;
> > > + * drm_dep_queue_init() returns %-EINVAL otherwise.
> > > + */
> > > + struct workqueue_struct *timeout_wq;
> > > + /** @credit_limit: maximum hardware credits; must be non-zero. */
> > > + u32 credit_limit;
> > > + /**
> > > + * @timeout: per-job TDR timeout in jiffies. Zero means no timeout
> > > + * (%MAX_SCHEDULE_TIMEOUT is used internally).
> > > + */
> > > + long timeout;
> > > + /**
> > > + * @flags: initial queue flags. %DRM_DEP_QUEUE_FLAGS_OWN_SUBMIT_WQ
> > > + * and %DRM_DEP_QUEUE_FLAGS_OWN_TIMEDOUT_WQ are managed internally
> > > + * and will be ignored if set here. Setting
> > > + * %DRM_DEP_QUEUE_FLAGS_HIGHPRI requests a high-priority submit
> > > + * workqueue; drivers must only set this for privileged clients.
> > > + */
> > > + enum drm_dep_queue_flags flags;
> > > +};
> > > +
> > > +/**
> > > + * struct drm_dep_job_ops - driver callbacks for a dep job
> > > + */
> > > +struct drm_dep_job_ops {
> > > + /**
> > > + * @release: called when the last reference to the job is dropped.
> > > + *
> > > + * If set, the driver is responsible for freeing the job. If NULL,
> > 
> > And if they don’t?
> > 

They leak memory.

> > By the way, we can also enforce this in Rust.
> > 
> > > + * drm_dep_job_put() will call kfree() on the job directly.
> > > + */
> > > + void (*release)(struct drm_dep_job *job);
> > > +};
> > > +
> > > +/**
> > > + * struct drm_dep_job - a unit of work submitted to a dep queue
> > > + *
> > > + * All fields are **opaque to drivers**.  Do not read or write any field
> > > + * directly; use the provided helper functions instead.
> > > + */
> > > +struct drm_dep_job {
> > > + /** @ops: driver callbacks for this job. */
> > > + const struct drm_dep_job_ops *ops;
> > > + /** @refcount: reference count, managed by drm_dep_job_get/put(). */
> > > + struct kref refcount;
> > > + /**
> > > + * @dependencies: xarray of &dma_fence dependencies before the job can
> > > + * run.
> > > + */
> > > + struct xarray dependencies;
> > > + /** @q: the queue this job is submitted to. */
> > > + struct drm_dep_queue *q;
> > > + /** @queue_node: SPSC queue linkage for pending submission. */
> > > + struct spsc_node queue_node;
> > > + /**
> > > + * @pending_link: list entry in the queue's pending job list. Protected
> > > + * by @job.q->job.lock.
> > > + */
> > > + struct list_head pending_link;
> > > + /** @dfence: finished fence for this job. */
> > > + struct drm_dep_fence *dfence;
> > > + /** @cb: fence callback used to watch for dependency completion. */
> > > + struct dma_fence_cb cb;
> > > + /** @credits: number of credits this job consumes from the queue. */
> > > + u32 credits;
> > > + /**
> > > + * @last_dependency: index into @dependencies of the next fence to
> > > + * check. Advanced by drm_dep_queue_job_dependency() as each
> > > + * dependency is consumed.
> > > + */
> > > + u32 last_dependency;
> > > + /**
> > > + * @invalidate_count: number of times this job has been invalidated.
> > > + * Incremented by drm_dep_job_invalidate_job().
> > > + */
> > > + u32 invalidate_count;
> > > + /**
> > > + * @signalling_cookie: return value of dma_fence_begin_signalling()
> > > + * captured in drm_dep_job_arm() and consumed by drm_dep_job_push().
> > > + * Not valid outside the arm→push window.
> > > + */
> > > + bool signalling_cookie;
> > > +};
> > > +
> > > +/**
> > > + * struct drm_dep_job_init_args - arguments for drm_dep_job_init()
> > > + */
> > > +struct drm_dep_job_init_args {
> > > + /**
> > > + * @ops: driver callbacks for the job, or NULL for default behaviour.
> > > + */
> > > + const struct drm_dep_job_ops *ops;
> > > + /** @q: the queue to associate the job with. A reference is taken. */
> > > + struct drm_dep_queue *q;
> > > + /** @credits: number of credits this job consumes; must be non-zero. */
> > > + u32 credits;
> > > +};
> > > +
> > > +/* Queue API */
> > > +
> > > +/**
> > > + * drm_dep_queue_sched_guard() - acquire the queue scheduler lock as a guard
> > > + * @__q: dep queue whose scheduler lock to acquire
> > > + *
> > > + * Acquires @__q->sched.lock as a scoped mutex guard (released automatically
> > > + * when the enclosing scope exits).  This lock serialises all scheduler state
> > > + * transitions — stop/start/kill flag changes, bypass-path decisions, and the
> > > + * run-job worker — so it must be held when the driver needs to atomically
> > > + * inspect or modify queue state in relation to job submission.
> > > + *
> > > + * **When to use**
> > > + *
> > > + * Drivers that set %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED and wish to
> > > + * serialise their own submit work against the bypass path must acquire this
> > > + * guard.  Without it, a concurrent caller of drm_dep_job_push() could take
> > > + * the bypass path and call ops->run_job() inline between the driver's
> > > + * eligibility check and its corresponding action, producing a race.
> > 
> > So if you’re not careful, you have just introduced a race :/
> > 

Luckily I’m careful. The use case here is compositors, compute
workloads, or servicing KMD page faults—none of which have input
dependencies and all of which require very low latency and minimal
jitter.

> > > + *
> > > + * **Constraint: only from submit_wq worker context**
> > > + *
> > > + * This guard must only be acquired from a work item running on the queue's
> > > + * submit workqueue (@q->sched.submit_wq) by drivers.
> > > + *
> > > + * Context: Process context only; must be called from submit_wq work by
> > > + * drivers.
> > > + */
> > > +#define drm_dep_queue_sched_guard(__q) \
> > > + guard(mutex)(&(__q)->sched.lock)
> > > +
> > > +int drm_dep_queue_init(struct drm_dep_queue *q,
> > > +       const struct drm_dep_queue_init_args *args);
> > > +void drm_dep_queue_fini(struct drm_dep_queue *q);
> > > +void drm_dep_queue_release(struct drm_dep_queue *q);
> > > +struct drm_dep_queue *drm_dep_queue_get(struct drm_dep_queue *q);
> > > +bool drm_dep_queue_get_unless_zero(struct drm_dep_queue *q);
> > > +void drm_dep_queue_put(struct drm_dep_queue *q);
> > > +void drm_dep_queue_stop(struct drm_dep_queue *q);
> > > +void drm_dep_queue_start(struct drm_dep_queue *q);
> > > +void drm_dep_queue_kill(struct drm_dep_queue *q);
> > > +void drm_dep_queue_trigger_timeout(struct drm_dep_queue *q);
> > > +void drm_dep_queue_cancel_tdr_sync(struct drm_dep_queue *q);
> > > +void drm_dep_queue_resume_timeout(struct drm_dep_queue *q);
> > > +bool drm_dep_queue_work_enqueue(struct drm_dep_queue *q,
> > > + struct work_struct *work);
> > > +bool drm_dep_queue_is_stopped(struct drm_dep_queue *q);
> > > +bool drm_dep_queue_is_killed(struct drm_dep_queue *q);
> > > +bool drm_dep_queue_is_initialized(struct drm_dep_queue *q);
> > > +void drm_dep_queue_set_stopped(struct drm_dep_queue *q);
> > > +unsigned int drm_dep_queue_refcount(const struct drm_dep_queue *q);
> > > +long drm_dep_queue_timeout(const struct drm_dep_queue *q);
> > > +struct workqueue_struct *drm_dep_queue_submit_wq(struct drm_dep_queue *q);
> > > +struct workqueue_struct *drm_dep_queue_timeout_wq(struct drm_dep_queue *q);
> > > +
> > > +/* Job API */
> > > +
> > > +/**
> > > + * DRM_DEP_JOB_FENCE_PREALLOC - sentinel value for pre-allocating a dependency slot
> > > + *
> > > + * Pass this to drm_dep_job_add_dependency() instead of a real fence to
> > > + * pre-allocate a slot in the job's dependency xarray during the preparation
> > > + * phase (where GFP_KERNEL is available).  The returned xarray index identifies
> > > + * the slot.  Call drm_dep_job_replace_dependency() later — inside a
> > > + * dma_fence_begin_signalling() region if needed — to swap in the real fence
> > > + * without further allocation.
> > > + *
> > > + * This sentinel is never treated as a dma_fence; it carries no reference count
> > > + * and must not be passed to dma_fence_put().  It is only valid as an argument
> > > + * to drm_dep_job_add_dependency() and as the expected stored value checked by
> > > + * drm_dep_job_replace_dependency().
> > > + */
> > > +#define DRM_DEP_JOB_FENCE_PREALLOC ((struct dma_fence *)-1)
> > > +
> > > +int drm_dep_job_init(struct drm_dep_job *job,
> > > +     const struct drm_dep_job_init_args *args);
> > > +struct drm_dep_job *drm_dep_job_get(struct drm_dep_job *job);
> > > +void drm_dep_job_put(struct drm_dep_job *job);
> > > +void drm_dep_job_arm(struct drm_dep_job *job);
> > > +void drm_dep_job_push(struct drm_dep_job *job);
> > > +int drm_dep_job_add_dependency(struct drm_dep_job *job,
> > > +       struct dma_fence *fence);
> > > +void drm_dep_job_replace_dependency(struct drm_dep_job *job, u32 index,
> > > +    struct dma_fence *fence);
> > > +int drm_dep_job_add_syncobj_dependency(struct drm_dep_job *job,
> > > +       struct drm_file *file, u32 handle,
> > > +       u32 point);
> > > +int drm_dep_job_add_resv_dependencies(struct drm_dep_job *job,
> > > +      struct dma_resv *resv,
> > > +      enum dma_resv_usage usage);
> > > +int drm_dep_job_add_implicit_dependencies(struct drm_dep_job *job,
> > > +  struct drm_gem_object *obj,
> > > +  bool write);
> > > +bool drm_dep_job_is_signaled(struct drm_dep_job *job);
> > > +bool drm_dep_job_is_finished(struct drm_dep_job *job);
> > > +bool drm_dep_job_invalidate_job(struct drm_dep_job *job, int threshold);
> > > +struct dma_fence *drm_dep_job_finished_fence(struct drm_dep_job *job);
> > > +
> > > +/**
> > > + * struct drm_dep_queue_pending_job_iter - iterator state for
> > > + *   drm_dep_queue_for_each_pending_job()
> > > + * @q: queue being iterated
> > > + */
> > > +struct drm_dep_queue_pending_job_iter {
> > > + struct drm_dep_queue *q;
> > > +};
> > > +
> > > +/* Drivers should never call this directly */
> > 
> > Not enforceable in C.
> > 
> > > +static inline struct drm_dep_queue_pending_job_iter
> > > +__drm_dep_queue_pending_job_iter_begin(struct drm_dep_queue *q)
> > > +{
> > > + struct drm_dep_queue_pending_job_iter iter = {
> > > + .q = q,
> > > + };
> > > +
> > > + WARN_ON(!drm_dep_queue_is_stopped(q));
> > > + return iter;
> > > +}
> > > +
> > > +/* Drivers should never call this directly */
> > > +static inline void
> > > +__drm_dep_queue_pending_job_iter_end(struct drm_dep_queue_pending_job_iter iter)
> > > +{
> > > + WARN_ON(!drm_dep_queue_is_stopped(iter.q));
> > > +}
> > > +
> > > +/* clang-format off */
> > > +DEFINE_CLASS(drm_dep_queue_pending_job_iter,
> > > +     struct drm_dep_queue_pending_job_iter,
> > > +     __drm_dep_queue_pending_job_iter_end(_T),
> > > +     __drm_dep_queue_pending_job_iter_begin(__q),
> > > +     struct drm_dep_queue *__q);
> > > +/* clang-format on */
> > > +static inline void *
> > > +class_drm_dep_queue_pending_job_iter_lock_ptr(
> > > + class_drm_dep_queue_pending_job_iter_t *_T)
> > > +{ return _T; }
> > > +#define class_drm_dep_queue_pending_job_iter_is_conditional false
> > > +
> > > +/**
> > > + * drm_dep_queue_for_each_pending_job() - iterate over all pending jobs
> > > + *   in a queue
> > > + * @__job: loop cursor, a &struct drm_dep_job pointer
> > > + * @__q: &struct drm_dep_queue to iterate
> > > + *
> > > + * Iterates over every job currently on @__q->job.pending. The queue must be
> > > + * stopped (drm_dep_queue_stop() called) before using this iterator; a WARN_ON
> > > + * fires at the start and end of the scope if it is not.
> > > + *
> > > + * Context: Any context.
> > > + */
> > > +#define drm_dep_queue_for_each_pending_job(__job, __q) \
> > > + scoped_guard(drm_dep_queue_pending_job_iter, (__q)) \
> > > + list_for_each_entry((__job), &(__q)->job.pending, pending_link)
> > > +
> > > +#endif
> > > -- 
> > > 2.34.1
> > > 
> > 
> > 
> > By the way:
> > 
> > I invite you to have a look at this implementation [0]. It currently works in real
> > hardware i.e.: our downstream "Tyr" driver for Arm Mali is using that at the
> > moment. It is a mere prototype that we’ve put together to test different
> > approaches, so it’s not meant to be a “solution” at all. It’s a mere data point
> > for further discussion.

I think some of the things I pointed out—async teardown, bypass paths,
and dropping job refs in IRQ context—would still need to be added,
though.

> > 
> > Philip Stanner is working on this “Job Queue” concept too, but from an upstream
> > perspective.
> > 
> > [0]: https://gitlab.freedesktop.org/panfrost/linux/-/merge_requests/61

I scanned [0], it looks signicantly better than the post upstream. Let
me dig in a bit more.

Matt
Matt

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
  2026-03-17  8:26         ` Matthew Brost
  2026-03-17 12:04           ` Daniel Almeida
@ 2026-03-17 19:41           ` Miguel Ojeda
  2026-03-23 17:31             ` Matthew Brost
  1 sibling, 1 reply; 50+ messages in thread
From: Miguel Ojeda @ 2026-03-17 19:41 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Daniel Almeida, intel-xe, dri-devel, Boris Brezillon,
	Tvrtko Ursulin, Rodrigo Vivi, Thomas Hellström,
	Christian König, Danilo Krummrich, David Airlie,
	Maarten Lankhorst, Maxime Ripard, Philipp Stanner, Simona Vetter,
	Sumit Semwal, Thomas Zimmermann, linux-kernel, Sami Tolvanen,
	Jeffrey Vander Stoep, Alice Ryhl, Daniel Stone, Alexandre Courbot,
	John Hubbard, shashanks, jajones, Eliot Courtney, Joel Fernandes,
	rust-for-linux

On Tue, Mar 17, 2026 at 9:27 AM Matthew Brost <matthew.brost@intel.com> wrote:
>
> I hate cut off in thteads.
>
> I get it — you’re a Rust zealot.

Cut off? Zealot?

Look, I got the email in my inbox, so I skimmed it to understand why I
got it and why the Rust list was Cc'd. I happened to notice your
(quite surprising) claims about Rust, so I decided to reply to a
couple of those, since I proposed Rust for the kernel.

How is that a cut off and how does that make a maintainer a zealot?

Anyway, my understanding is that we agreed that the cleanup attribute
in C doesn't enforce much of anything. We also agreed that it is
important to think about ownership and lifetimes and to enforce the
rules and to be disciplined. All good so far.

Now, what I said is simply that Rust fundamentally improves the
situation -- C "RAII" not doing so is not comparable. For instance,
that statically enforcing things is a meaningful improvement over
runtime approaches (which generally require to trigger an issue, and
which in some cases are not suitable for production settings).

Really, I just said Rust would help with things you already stated you
care about. And nobody claims "Rust solves everything" as you stated.
So I don't see zealots here, and insulting others doesn't help your
argument.

Cheers,
Miguel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
  2026-03-17 18:14       ` Matthew Brost
@ 2026-03-17 19:48         ` Daniel Almeida
  2026-03-17 20:43         ` Boris Brezillon
  1 sibling, 0 replies; 50+ messages in thread
From: Daniel Almeida @ 2026-03-17 19:48 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, Boris Brezillon, Tvrtko Ursulin,
	Rodrigo Vivi, Thomas Hellström, Christian König,
	Danilo Krummrich, David Airlie, Maarten Lankhorst, Maxime Ripard,
	Philipp Stanner, Simona Vetter, Sumit Semwal, Thomas Zimmermann,
	linux-kernel, Sami Tolvanen, Jeffrey Vander Stoep, Alice Ryhl,
	Daniel Stone, Alexandre Courbot, John Hubbard, shashanks, jajones,
	Eliot Courtney, Joel Fernandes, rust-for-linux

I still need to digest your answers above, there's quite a bit of information.
Thanks for that. I'll do a pass on it tomorrow.

>>> By the way:
>>> 
>>> I invite you to have a look at this implementation [0]. It currently works in real
>>> hardware i.e.: our downstream "Tyr" driver for Arm Mali is using that at the
>>> moment. It is a mere prototype that we’ve put together to test different
>>> approaches, so it’s not meant to be a “solution” at all. It’s a mere data point
>>> for further discussion.
> 
> I think some of the things I pointed out—async teardown, bypass paths,
> and dropping job refs in IRQ context—would still need to be added,
> though.

That’s ok, I suppose we can find a way to add these things if they’re
needed in order to support other FW scheduling GPUs (i.e.: other than Mali,
which is the only thing I tested on, and Nova, which I assume has very similar
requirements).

> 
>>> 
>>> Philip Stanner is working on this “Job Queue” concept too, but from an upstream
>>> perspective.
>>> 
>>> [0]: https://gitlab.freedesktop.org/panfrost/linux/-/merge_requests/61
> 
> I scanned [0], it looks signicantly better than the post upstream. Let
> me dig in a bit more.
> 
> Matt
> Matt

One thing that is missing is that, at the moment, submit() is fallible and
there is no preallocation. This can be added to the current design rather
easily (i.e. by splitting into two different steps, a fallible prepare() where
rollback is possible, and an infallible commit(), or whatever names get
chosen).

Perhaps we can also split this into two types too, AtomicJobQueue and JobQueue,
where only the first one allows refs to be dropped in IRQ context; i.e.: since
we do not need this in Tyr, and not allowing this makes the design of the
"non-atomic" version much simpler. Or perhaps we can figure out a way to ensure
that we don't drop the last ref in IRQ context. I am just brainstorming some
ideas at this point, and again, I still need to go through your explanations above.

— Daniel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
  2026-03-17 18:14       ` Matthew Brost
  2026-03-17 19:48         ` Daniel Almeida
@ 2026-03-17 20:43         ` Boris Brezillon
  2026-03-18 22:40           ` Matthew Brost
  1 sibling, 1 reply; 50+ messages in thread
From: Boris Brezillon @ 2026-03-17 20:43 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Daniel Almeida, intel-xe, dri-devel, Tvrtko Ursulin, Rodrigo Vivi,
	Thomas Hellström, Christian König, Danilo Krummrich,
	David Airlie, Maarten Lankhorst, Maxime Ripard, Philipp Stanner,
	Simona Vetter, Sumit Semwal, Thomas Zimmermann, linux-kernel,
	Sami Tolvanen, Jeffrey Vander Stoep, Alice Ryhl, Daniel Stone,
	Alexandre Courbot, John Hubbard, shashanks, jajones,
	Eliot Courtney, Joel Fernandes, rust-for-linux

Hi Matthew,

Just a few drive-by comments.

On Tue, 17 Mar 2026 11:14:36 -0700
Matthew Brost <matthew.brost@intel.com> wrote:

> > > > Timeout Detection and Recovery (TDR): a per-queue delayed work item
> > > > fires when the head pending job exceeds q->job.timeout jiffies, calling
> > > > ops->timedout_job(). drm_dep_queue_trigger_timeout() forces immediate
> > > > expiry for device teardown.
> > > > 
> > > > IRQ-safe completion: queues flagged DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE
> > > > allow drm_dep_job_done() to be called from hardirq context (e.g. a
> > > > dma_fence callback). Dependency cleanup is deferred to process context
> > > > after ops->run_job() returns to avoid calling xa_destroy() from IRQ.
> > > > 
> > > > Zombie-state guard: workers use kref_get_unless_zero() on entry and
> > > > bail immediately if the queue refcount has already reached zero and
> > > > async teardown is in flight, preventing use-after-free.  
> > > 
> > > In rust, when you queue work, you have to pass a reference-counted pointer
> > > (Arc<T>). We simply never have this problem in a Rust design. If there is work
> > > queued, the queue is alive.
> > > 
> > > By the way, why can’t we simply require synchronous teardowns?  
> 
> Consider the case where the DRM dep queue’s refcount drops to zero, but
> the device firmware still holds references to the associated queue.
> These are resources that must be torn down asynchronously. In Xe, I need
> to send two asynchronous firmware commands before I can safely remove
> the memory associated with the queue (faulting on this kind of global
> memory will take down the device) and recycle the firmware ID tied to
> the queue. These async commands are issued on the driver side, on the
> DRM dep queue’s workqueue as well.

Asynchronous teardown is okay, but I'm not too sure using the refcnt to
know that the queue is no longer usable is the way to go. To me the
refcnt is what determines when the SW object is no longer referenced by
any other item in the code, and a work item acting on the queue counts
as one owner of this queue. If you want to cancel the work in order to
speed up the destruction of the queue, you can call
{cancel,disable}_work[_sync](), and have the ref dropped if the
cancel/disable was effective. Multi-step teardown is also an option,
but again, the state of the queue shouldn't be determined from its
refcnt IMHO.

> 
> Now consider a scenario where something goes wrong and those firmware
> commands never complete, and a device reset is required to recover. The
> driver’s per-queue tracking logic stops all queues (including zombie
> ones), determines which commands were lost, cleans up the side effects
> of that lost state, and then restarts all queues. That is how we would
> end up in this work item with a zombie queue. The restart logic could
> probably be made smart enough to avoid queueing work for zombie queues,
> but in my opinion it’s safe enough to use kref_get_unless_zero() in the
> work items.

Well, that only works for single-step teardown, or when you enter the
last step. At which point, I'm not too sure it's significantly better
than encoding the state of the queue through a separate field, and have
the job queue logic reject new jobs if the queue is no longer usable
(shouldn't even be exposed to userland at this point though).

> 
> It should also be clear that a DRM dep queue is primarily intended to be
> embedded inside the driver’s own queue object, even though it is valid
> to use it as a standalone object. The async teardown flows are also
> optional features.
> 
> Let’s also consider a case where you do not need the async firmware
> flows described above, but the DRM dep queue is still embedded in a
> driver-side object that owns memory via dma-resv. The final queue put
> may occur in IRQ context (DRM dep avoids kicking a worker just to drop a
> refi as opt in), or in the reclaim path (any scheduler workqueue is the
> reclaim path). In either case, you cannot free memory there taking a
> dma-resv lock, which is why all DRM dep queues ultimately free their
> resources in a work item outside of reclaim. Many drivers already follow
> this pattern, but in DRM dep this behavior is built-in.

I agree deferred cleanup is the way to go.

> 
> So I don’t think Rust natively solves these types of problems, although
> I’ll concede that it does make refcounting a bit more sane.

Rust won't magically defer the cleanup, nor will it dictate how you want
to do the queue teardown, those are things you need to implement. But it
should give visibility about object lifetimes, and guarantee that an
object that's still visible to some owners is usable (the notion of
usable is highly dependent on the object implementation).

Just a purely theoretical example of a multi-step queue teardown that
might be possible to encode in rust:

- MyJobQueue<Usable>: The job queue is currently exposed and usable.
  There's a ::destroy() method consuming 'self' and returning a
  MyJobQueue<Destroyed> object
- MyJobQueue<Destroyed>: The user asked for the workqueue to be
  destroyed. No new job can be pushed. Existing jobs that didn't make
  it to the FW queue are cancelled, jobs that are in-flight are
  cancelled if they can, or are just waited upon if they can't. When
  the whole destruction step is done, ::destroyed() is called, it
  consumes 'self' and returns a MyJobQueue<Inactive> object.
- MyJobQueue<Inactive>: The queue is no longer active (HW doesn't have
  any resources on this queue). It's ready to be cleaned up.
  ::cleanup() (or just ::drop()) defers the cleanup of some inner
  object that has been passed around between the various
  MyJobQueue<State> wrappers.

Each of the state transition can happen asynchronously. A state
transition consumes the object in one state, and returns a new object
in its new state. None of the transition involves dropping a refcnt,
ownership is just transferred. The final MyJobQueue<Inactive> object is
the object we'll defer cleanup on.

It's a very high-level view of one way this can be implemented (I'm
sure there are others, probably better than my suggestion) in order to
make sure the object doesn't go away without the compiler enforcing
proper state transitions.

> > > > +/**
> > > > + * DOC: DRM dependency fence
> > > > + *
> > > > + * Each struct drm_dep_job has an associated struct drm_dep_fence that
> > > > + * provides a single dma_fence (@finished) signalled when the hardware
> > > > + * completes the job.
> > > > + *
> > > > + * The hardware fence returned by &drm_dep_queue_ops.run_job is stored as
> > > > + * @parent. @finished is chained to @parent via drm_dep_job_done_cb() and
> > > > + * is signalled once @parent signals (or immediately if run_job() returns
> > > > + * NULL or an error).  
> > > 
> > > I thought this fence proxy mechanism was going away due to recent work being
> > > carried out by Christian?
> > >   
> 
> Consider the case where a driver’s hardware fence is implemented as a
> dma-fence-array or dma-fence-chain. You cannot install these types of
> fences into a dma-resv or into syncobjs, so a proxy fence is useful
> here.

Hm, so that's a driver returning a dma_fence_array/chain through
::run_job()? Why would we not want to have them directly exposed and
split up into singular fence objects at resv insertion time (I don't
think syncobjs care, but I might be wrong). I mean, one of the point
behind the container extraction is so fences coming from the same
context/timeline can be detected and merged. If you insert the
container through a proxy, you're defeating the whole fence merging
optimization.

The second thing is that I'm not sure drivers were ever supposed to
return fence containers in the first place, because the whole idea
behind a fence context is that fences are emitted/signalled in
seqno-order, and if the fence is encoding the state of multiple
timelines that progress at their own pace, it becomes tricky to control
that. I guess if it's always the same set of timelines that are
combined, that would work.

> One example is when a single job submits work to multiple rings
> that are flipped in hardware at the same time.

We do have that in Panthor, but that's all explicit: in a single
SUBMIT, you can have multiple jobs targeting different queues, each of
them having their own set of deps/signal ops. The combination of all the
signal ops into a container is left to the UMD. It could be automated
kernel side, but that would be a flag on the SIGNAL op leading to the
creation of a fence_array containing fences from multiple submitted
jobs, rather than the driver combining stuff in the fence it returns in
::run_job().

> 
> Another case is late arming of hardware fences in run_job (which many
> drivers do). The proxy fence is immediately available at arm time and
> can be installed into dma-resv or syncobjs even though the actual
> hardware fence is not yet available. I think most drivers could be
> refactored to make the hardware fence immediately available at run_job,
> though.

Yep, I also think we can arm the driver fence early in the case of
JobQueue. The reason it couldn't be done before is because the
scheduler was in the middle, deciding which entity to pull the next job
from, which was changing the seqno a job driver-fence would be assigned
(you can't guess that at queue time in that case).

[...]

> > > > + * **Reference counting**
> > > > + *
> > > > + * Jobs and queues are both reference counted.
> > > > + *
> > > > + * A job holds a reference to its queue from drm_dep_job_init() until
> > > > + * drm_dep_job_put() drops the job's last reference and its release callback
> > > > + * runs. This ensures the queue remains valid for the entire lifetime of any
> > > > + * job that was submitted to it.
> > > > + *
> > > > + * The queue holds its own reference to a job for as long as the job is
> > > > + * internally tracked: from the moment the job is added to the pending list
> > > > + * in drm_dep_queue_run_job() until drm_dep_job_done() kicks the put_job
> > > > + * worker, which calls drm_dep_job_put() to release that reference.  
> > > 
> > > Why not simply keep track that the job was completed, instead of relinquishing
> > > the reference? We can then release the reference once the job is cleaned up
> > > (by the queue, using a worker) in process context.  
> 
> I think that’s what I’m doing, while also allowing an opt-in path to
> drop the job reference when it signals (in IRQ context)

Did you mean in !IRQ (or !atomic) context here? Feels weird to not
defer the cleanup when you're in an IRQ/atomic context, but defer it
when you're in a thread context.

> so we avoid
> switching to a work item just to drop a ref. That seems like a
> significant win in terms of CPU cycles.

Well, the cleanup path is probably not where latency matters the most.
It's adding scheduling overhead, sure, but given all the stuff we defer
already, I'm not too sure we're at saving a few cycles to get the
cleanup done immediately. What's important to have is a way to signal
fences in an atomic context, because this has an impact on latency.

[...]

> > > > + /*
> > > > + * Drop all input dependency fences now, in process context, before the
> > > > + * final job put. Once the job is on the pending list its last reference
> > > > + * may be dropped from a dma_fence callback (IRQ context), where calling
> > > > + * xa_destroy() would be unsafe.
> > > > + */  
> > > 
> > > I assume that “pending” is the list of jobs that have been handed to the driver
> > > via ops->run_job()?
> > > 
> > > Can’t this problem be solved by not doing anything inside a dma_fence callback
> > > other than scheduling the queue worker?
> > >   
> 
> Yes, this code is required to support dropping job refs directly in the
> dma-fence callback (an opt-in feature). Again, this seems like a
> significant win in terms of CPU cycles, although I haven’t collected
> data yet.

If it significantly hurts the perf, I'd like to understand why, because
to me it looks like pure-cleanup (no signaling involved), and thus no
other process waiting for us to do the cleanup. The only thing that
might have an impact is how fast you release the resources, and given
it's only a partial cleanup (xa_destroy() still has to be deferred), I'd
like to understand which part of the immediate cleanup is causing a
contention (basically which kind of resources the system is starving of)

Regards,

Boris

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
  2026-03-17 20:43         ` Boris Brezillon
@ 2026-03-18 22:40           ` Matthew Brost
  2026-03-19  9:57             ` Boris Brezillon
  0 siblings, 1 reply; 50+ messages in thread
From: Matthew Brost @ 2026-03-18 22:40 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Daniel Almeida, intel-xe, dri-devel, Tvrtko Ursulin, Rodrigo Vivi,
	Thomas Hellström, Christian König, Danilo Krummrich,
	David Airlie, Maarten Lankhorst, Maxime Ripard, Philipp Stanner,
	Simona Vetter, Sumit Semwal, Thomas Zimmermann, linux-kernel,
	Sami Tolvanen, Jeffrey Vander Stoep, Alice Ryhl, Daniel Stone,
	Alexandre Courbot, John Hubbard, shashanks, jajones,
	Eliot Courtney, Joel Fernandes, rust-for-linux

On Tue, Mar 17, 2026 at 09:43:20PM +0100, Boris Brezillon wrote:
> Hi Matthew,
> 
> Just a few drive-by comments.
> 
> On Tue, 17 Mar 2026 11:14:36 -0700
> Matthew Brost <matthew.brost@intel.com> wrote:
> 
> > > > > Timeout Detection and Recovery (TDR): a per-queue delayed work item
> > > > > fires when the head pending job exceeds q->job.timeout jiffies, calling
> > > > > ops->timedout_job(). drm_dep_queue_trigger_timeout() forces immediate
> > > > > expiry for device teardown.
> > > > > 
> > > > > IRQ-safe completion: queues flagged DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE
> > > > > allow drm_dep_job_done() to be called from hardirq context (e.g. a
> > > > > dma_fence callback). Dependency cleanup is deferred to process context
> > > > > after ops->run_job() returns to avoid calling xa_destroy() from IRQ.
> > > > > 
> > > > > Zombie-state guard: workers use kref_get_unless_zero() on entry and
> > > > > bail immediately if the queue refcount has already reached zero and
> > > > > async teardown is in flight, preventing use-after-free.  
> > > > 
> > > > In rust, when you queue work, you have to pass a reference-counted pointer
> > > > (Arc<T>). We simply never have this problem in a Rust design. If there is work
> > > > queued, the queue is alive.
> > > > 
> > > > By the way, why can’t we simply require synchronous teardowns?  
> > 
> > Consider the case where the DRM dep queue’s refcount drops to zero, but
> > the device firmware still holds references to the associated queue.
> > These are resources that must be torn down asynchronously. In Xe, I need
> > to send two asynchronous firmware commands before I can safely remove
> > the memory associated with the queue (faulting on this kind of global
> > memory will take down the device) and recycle the firmware ID tied to
> > the queue. These async commands are issued on the driver side, on the
> > DRM dep queue’s workqueue as well.
> 
> Asynchronous teardown is okay, but I'm not too sure using the refcnt to
> know that the queue is no longer usable is the way to go. To me the
> refcnt is what determines when the SW object is no longer referenced by
> any other item in the code, and a work item acting on the queue counts
> as one owner of this queue. If you want to cancel the work in order to
> speed up the destruction of the queue, you can call
> {cancel,disable}_work[_sync](), and have the ref dropped if the
> cancel/disable was effective. Multi-step teardown is also an option,
> but again, the state of the queue shouldn't be determined from its
> refcnt IMHO.
> 
> > 
> > Now consider a scenario where something goes wrong and those firmware
> > commands never complete, and a device reset is required to recover. The
> > driver’s per-queue tracking logic stops all queues (including zombie
> > ones), determines which commands were lost, cleans up the side effects
> > of that lost state, and then restarts all queues. That is how we would
> > end up in this work item with a zombie queue. The restart logic could
> > probably be made smart enough to avoid queueing work for zombie queues,
> > but in my opinion it’s safe enough to use kref_get_unless_zero() in the
> > work items.
> 
> Well, that only works for single-step teardown, or when you enter the
> last step. At which point, I'm not too sure it's significantly better
> than encoding the state of the queue through a separate field, and have
> the job queue logic reject new jobs if the queue is no longer usable
> (shouldn't even be exposed to userland at this point though).
> 

'shouldn't even be exposed to userland at this point though' - Yes.

The philosophy of ref counting design is roughly:

- When queue is created by userland call drm_dep_queue_init
- All jobs hold a ref to drm_dep_queue
- When userland close queue, remove it from the FD, call
  drm_dep_queue_put + initiatie teardown (I'd recommned just setting TDR
  to immediately fire, kick off queue from device in first fire + signal
  all fences).
- Queue refcount goes to zero optionality implement
  drm_dep_queue_ops.fini to keep the drm_dep_queue (and object it is
  embedded) around for a bit longer if additional firmware / device side
  resources are still around, call drm_dep_queue_fini when this part
  completes. If drm_dep_queue_ops.fini isn't implement the core
  implementation just calls drm_dep_queue_fini.
- Work item releases drm_dep_queue out dma-fence signaling path for safe
  memory release (e.g., take dma-resv locks).


> > 
> > It should also be clear that a DRM dep queue is primarily intended to be
> > embedded inside the driver’s own queue object, even though it is valid
> > to use it as a standalone object. The async teardown flows are also
> > optional features.
> > 
> > Let’s also consider a case where you do not need the async firmware
> > flows described above, but the DRM dep queue is still embedded in a
> > driver-side object that owns memory via dma-resv. The final queue put
> > may occur in IRQ context (DRM dep avoids kicking a worker just to drop a
> > refi as opt in), or in the reclaim path (any scheduler workqueue is the
> > reclaim path). In either case, you cannot free memory there taking a
> > dma-resv lock, which is why all DRM dep queues ultimately free their
> > resources in a work item outside of reclaim. Many drivers already follow
> > this pattern, but in DRM dep this behavior is built-in.
> 
> I agree deferred cleanup is the way to go.
> 

+1. Yes. I spot a bunch of drivers that open code this part driver side,
Xe included.

> > 
> > So I don’t think Rust natively solves these types of problems, although
> > I’ll concede that it does make refcounting a bit more sane.
> 
> Rust won't magically defer the cleanup, nor will it dictate how you want
> to do the queue teardown, those are things you need to implement. But it
> should give visibility about object lifetimes, and guarantee that an
> object that's still visible to some owners is usable (the notion of
> usable is highly dependent on the object implementation).
> 
> Just a purely theoretical example of a multi-step queue teardown that
> might be possible to encode in rust:
> 
> - MyJobQueue<Usable>: The job queue is currently exposed and usable.
>   There's a ::destroy() method consuming 'self' and returning a
>   MyJobQueue<Destroyed> object
> - MyJobQueue<Destroyed>: The user asked for the workqueue to be
>   destroyed. No new job can be pushed. Existing jobs that didn't make
>   it to the FW queue are cancelled, jobs that are in-flight are
>   cancelled if they can, or are just waited upon if they can't. When
>   the whole destruction step is done, ::destroyed() is called, it
>   consumes 'self' and returns a MyJobQueue<Inactive> object.
> - MyJobQueue<Inactive>: The queue is no longer active (HW doesn't have
>   any resources on this queue). It's ready to be cleaned up.
>   ::cleanup() (or just ::drop()) defers the cleanup of some inner
>   object that has been passed around between the various
>   MyJobQueue<State> wrappers.
> 
> Each of the state transition can happen asynchronously. A state
> transition consumes the object in one state, and returns a new object
> in its new state. None of the transition involves dropping a refcnt,
> ownership is just transferred. The final MyJobQueue<Inactive> object is
> the object we'll defer cleanup on.
> 
> It's a very high-level view of one way this can be implemented (I'm
> sure there are others, probably better than my suggestion) in order to
> make sure the object doesn't go away without the compiler enforcing
> proper state transitions.
> 

I'm sure Rust can implement this. My point about Rust is it doesn't
magically solve hard software arch probles, but I will admit the
ownership model, way it can enforce locking at compile time is pretty
cool.

> > > > > +/**
> > > > > + * DOC: DRM dependency fence
> > > > > + *
> > > > > + * Each struct drm_dep_job has an associated struct drm_dep_fence that
> > > > > + * provides a single dma_fence (@finished) signalled when the hardware
> > > > > + * completes the job.
> > > > > + *
> > > > > + * The hardware fence returned by &drm_dep_queue_ops.run_job is stored as
> > > > > + * @parent. @finished is chained to @parent via drm_dep_job_done_cb() and
> > > > > + * is signalled once @parent signals (or immediately if run_job() returns
> > > > > + * NULL or an error).  
> > > > 
> > > > I thought this fence proxy mechanism was going away due to recent work being
> > > > carried out by Christian?
> > > >   
> > 
> > Consider the case where a driver’s hardware fence is implemented as a
> > dma-fence-array or dma-fence-chain. You cannot install these types of
> > fences into a dma-resv or into syncobjs, so a proxy fence is useful
> > here.
> 
> Hm, so that's a driver returning a dma_fence_array/chain through
> ::run_job()? Why would we not want to have them directly exposed and
> split up into singular fence objects at resv insertion time (I don't
> think syncobjs care, but I might be wrong). I mean, one of the point

You can stick dma-fence-arrays in syncobjs, but not chains.

Neither dma-fence-arrays/chain can go into dma-resv.

Hence why disconnecting a job's finished fence from hardware fence IMO
is good idea to keep so gives drivers flexiblity on the hardware fences.
e.g., If this design didn't have a job's finished fence, I'd have to
open code one Xe side.

> behind the container extraction is so fences coming from the same
> context/timeline can be detected and merged. If you insert the
> container through a proxy, you're defeating the whole fence merging
> optimization.

Right. Finished fences have single timeline too...

> 
> The second thing is that I'm not sure drivers were ever supposed to
> return fence containers in the first place, because the whole idea
> behind a fence context is that fences are emitted/signalled in
> seqno-order, and if the fence is encoding the state of multiple
> timelines that progress at their own pace, it becomes tricky to control
> that. I guess if it's always the same set of timelines that are
> combined, that would work.

Xe does this is definitely works. We submit to multiple rings, when all
rings signal a seqno, a chain or array signals -> finished fence
signals. The queues used in this manor can only submit multiple ring
jobs so the finished fence timeline stays intact. If you could a
multiple rings followed by a single ring submission on the same queue,
yes this could break.

> 
> > One example is when a single job submits work to multiple rings
> > that are flipped in hardware at the same time.
> 
> We do have that in Panthor, but that's all explicit: in a single
> SUBMIT, you can have multiple jobs targeting different queues, each of
> them having their own set of deps/signal ops. The combination of all the
> signal ops into a container is left to the UMD. It could be automated
> kernel side, but that would be a flag on the SIGNAL op leading to the
> creation of a fence_array containing fences from multiple submitted
> jobs, rather than the driver combining stuff in the fence it returns in
> ::run_job().

See above. We have a dedicated queue type for these type of submissions
and single job that submits to the all rings. We had multiple queue /
jobs in the i915 to implemented this but it turns out it is much cleaner
with a single queue / singler job / multiple rings model.

> 
> > 
> > Another case is late arming of hardware fences in run_job (which many
> > drivers do). The proxy fence is immediately available at arm time and
> > can be installed into dma-resv or syncobjs even though the actual
> > hardware fence is not yet available. I think most drivers could be
> > refactored to make the hardware fence immediately available at run_job,
> > though.
> 
> Yep, I also think we can arm the driver fence early in the case of
> JobQueue. The reason it couldn't be done before is because the
> scheduler was in the middle, deciding which entity to pull the next job
> from, which was changing the seqno a job driver-fence would be assigned
> (you can't guess that at queue time in that case).
> 

Xe doesn't need to late arming, but it look like multiple drivers to
implement the late arming which may be required (?).

> [...]
> 
> > > > > + * **Reference counting**
> > > > > + *
> > > > > + * Jobs and queues are both reference counted.
> > > > > + *
> > > > > + * A job holds a reference to its queue from drm_dep_job_init() until
> > > > > + * drm_dep_job_put() drops the job's last reference and its release callback
> > > > > + * runs. This ensures the queue remains valid for the entire lifetime of any
> > > > > + * job that was submitted to it.
> > > > > + *
> > > > > + * The queue holds its own reference to a job for as long as the job is
> > > > > + * internally tracked: from the moment the job is added to the pending list
> > > > > + * in drm_dep_queue_run_job() until drm_dep_job_done() kicks the put_job
> > > > > + * worker, which calls drm_dep_job_put() to release that reference.  
> > > > 
> > > > Why not simply keep track that the job was completed, instead of relinquishing
> > > > the reference? We can then release the reference once the job is cleaned up
> > > > (by the queue, using a worker) in process context.  
> > 
> > I think that’s what I’m doing, while also allowing an opt-in path to
> > drop the job reference when it signals (in IRQ context)
> 
> Did you mean in !IRQ (or !atomic) context here? Feels weird to not
> defer the cleanup when you're in an IRQ/atomic context, but defer it
> when you're in a thread context.
> 

The put of a job in this design can be from an IRQ context (opt-in)
feature. xa_destroy blows up if it is called from an IRQ context,
although maybe that could be workaround.

> > so we avoid
> > switching to a work item just to drop a ref. That seems like a
> > significant win in terms of CPU cycles.
> 
> Well, the cleanup path is probably not where latency matters the most.

Agree. But I do think avoiding a CPU context switch (work item) for a
very lightweight job cleanup (usually just drop refs) will save of CPU
cycles, thus also things like power, etc...

> It's adding scheduling overhead, sure, but given all the stuff we defer
> already, I'm not too sure we're at saving a few cycles to get the
> cleanup done immediately. What's important to have is a way to signal
> fences in an atomic context, because this has an impact on latency.
> 

Yes. The signaling happens first then drm_dep_job_put if IRQ opt-in.

> [...]
> 
> > > > > + /*
> > > > > + * Drop all input dependency fences now, in process context, before the
> > > > > + * final job put. Once the job is on the pending list its last reference
> > > > > + * may be dropped from a dma_fence callback (IRQ context), where calling
> > > > > + * xa_destroy() would be unsafe.
> > > > > + */  
> > > > 
> > > > I assume that “pending” is the list of jobs that have been handed to the driver
> > > > via ops->run_job()?
> > > > 
> > > > Can’t this problem be solved by not doing anything inside a dma_fence callback
> > > > other than scheduling the queue worker?
> > > >   
> > 
> > Yes, this code is required to support dropping job refs directly in the
> > dma-fence callback (an opt-in feature). Again, this seems like a
> > significant win in terms of CPU cycles, although I haven’t collected
> > data yet.
> 
> If it significantly hurts the perf, I'd like to understand why, because
> to me it looks like pure-cleanup (no signaling involved), and thus no
> other process waiting for us to do the cleanup. The only thing that
> might have an impact is how fast you release the resources, and given
> it's only a partial cleanup (xa_destroy() still has to be deferred), I'd
> like to understand which part of the immediate cleanup is causing a
> contention (basically which kind of resources the system is starving of)
> 

It was more of once we moved to a ref counted model, it is pretty
trivial allow drm_dep_job_put when the fence is signaling. It doesn't
really add any complexity either, thus why I added it is.

Matt

> Regards,
> 
> Boris

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
  2026-03-17 14:33         ` Danilo Krummrich
@ 2026-03-18 22:50           ` Matthew Brost
  0 siblings, 0 replies; 50+ messages in thread
From: Matthew Brost @ 2026-03-18 22:50 UTC (permalink / raw)
  To: Danilo Krummrich
  Cc: Daniel Almeida, intel-xe, dri-devel, Boris Brezillon,
	Tvrtko Ursulin, Rodrigo Vivi, Thomas Hellström,
	Christian König, David Airlie, Maarten Lankhorst,
	Maxime Ripard, Philipp Stanner, Simona Vetter, Sumit Semwal,
	Thomas Zimmermann, linux-kernel, Sami Tolvanen,
	Jeffrey Vander Stoep, Alice Ryhl, Daniel Stone, Alexandre Courbot,
	John Hubbard, shashanks, jajones, Eliot Courtney, Joel Fernandes,
	rust-for-linux, Miguel Ojeda

On Tue, Mar 17, 2026 at 03:33:13PM +0100, Danilo Krummrich wrote:
> On Tue Mar 17, 2026 at 3:25 PM CET, Daniel Almeida wrote:
> >
> >
> >> On 17 Mar 2026, at 09:31, Danilo Krummrich <dakr@kernel.org> wrote:
> >> 
> >> On Tue Mar 17, 2026 at 3:47 AM CET, Daniel Almeida wrote:
> >>> I agree with what Danilo said below, i.e.:  IMHO, with the direction that DRM
> >>> is going, it is much more ergonomic to add a Rust component with a nice C
> >>> interface than doing it the other way around.
> >> 
> >> This is not exactly what I said. I was talking about the maintainance aspects
> >> and that a Rust Jobqueue implementation (for the reasons explained in my initial
> >> reply) is easily justifiable in this aspect, whereas another C implementation,
> >> that does *not* replace the existing DRM scheduler entirely, is much harder to
> >> justify from a maintainance perspective.
> >
> > Ok, I misunderstood your point a bit.
> >
> >> 
> >> I'm also not sure whether a C interface from the Rust side is easy to establish.
> >> We don't want to limit ourselves in terms of language capabilities for this and
> >> passing through all the additional infromation Rust carries in the type system
> >> might not be straight forward.
> >> 
> >> It would be an experiment, and it was one of the ideas behind the Rust Jobqueue
> >> to see how it turns if we try. Always with the fallback of having C
> >> infrastructure as an alternative when it doesn't work out well.
> >
> > From previous experience in doing Rust to C FFI in NVK, I don’t see, at
> > first, why this can’t work. But I agree with you, there may very well be
> > unanticipated things here and this part is indeed an experiment. No argument
> > from me here.
> >
> >> 
> >> Having this said, I don't see an issue with the drm_dep thing going forward if
> >> there is a path to replacing DRM sched entirely.

The only weird case I haven't wrapped my head around quite yet is the
ganged submissions that rely on the scheduled fence (PVR, AMDGPU do
this). Pretty much every other driver seems like it could be coverted
with what I have in place in this series + local work to provide a
hardware scheduler...

> >
> > The issues I pointed out remain. Even if the plan is to have drm_dep + JobQueue
> > (and no drm_sched). I feel that my point of considering doing it in Rust remains.
> 
> I mean, as mentioned below, we should have a Rust Jobqueue as independent
> component. Or are you saying you'd consdider having only a Rust component with a
> C API eventually? If so, that'd be way too early to consider for various
> reasons.
> 

We need to some C story one way or another as we have C drivers and DRM
sched is not cutting it nor is maitainable.

> >> The Rust component should remain independent from this for the reasons mentioned
> >> in [1].
> >> 
> >> [1] https://lore.kernel.org/dri-devel/DH51W6XRQXYX.3M30IRYIWZLFG@kernel.org/

Fair enough. I read through [1], let me respond there.

Matt

> >
> > Ok
> >
> > — Daniel
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
  2026-03-17 12:19       ` Danilo Krummrich
@ 2026-03-18 23:02         ` Matthew Brost
  0 siblings, 0 replies; 50+ messages in thread
From: Matthew Brost @ 2026-03-18 23:02 UTC (permalink / raw)
  To: Danilo Krummrich
  Cc: intel-xe, dri-devel, Boris Brezillon, Tvrtko Ursulin,
	Rodrigo Vivi, Thomas Hellström, Christian König,
	David Airlie, Maarten Lankhorst, Maxime Ripard, Philipp Stanner,
	Simona Vetter, Sumit Semwal, Thomas Zimmermann, linux-kernel,
	Miguel Ojeda

On Tue, Mar 17, 2026 at 01:19:47PM +0100, Danilo Krummrich wrote:
> On Tue Mar 17, 2026 at 6:10 AM CET, Matthew Brost wrote:
> > On Mon, Mar 16, 2026 at 11:25:23AM +0100, Danilo Krummrich wrote:
> >> The reason I proposed a new component for Rust, is basically what you also wrote
> >> in your cover letter, plus the fact that it prevents us having to build a Rust
> >> abstraction layer to the DRM GPU scheduler.
> >> 
> >> The latter I identified as pretty questionable as building another abstraction
> >> layer on top of some infrastructure is really something that you only want to do
> >> when it is mature enough in terms of lifetime and ownership model.
> >> 
> >
> > I personally don’t think the language matters that much. I care about
> > lifetime, ownership, and teardown semantics. I believe I’ve made this
> > clear in C, so the Rust bindings should be trivial.
> 
> No, they won't be trivial -- in fact, in the case of the Jobqueue they may even
> end up being more complicated than a native implementation.
> 
> We still want to build the object model around it that allows us to catch most
> of the pitfalls at compile time rather than runtime.
> 

I do understand this line of thinking.

> For instance, there has been a proposal of having specific work and workqueue
> types that ensure not to violate DMA fence rules, which the Jobqueue can adopt.
> 

Yes, want to add that too for C (ofc this can't be enforced at compile time).

I've started on this here [1]

https://patchwork.freedesktop.org/patch/711929/?series=163245&rev=1

> We can also use Klint to ensure correctness for those types at compile time.
> 
> So, I can easily see this becoming more complicated when we have to go through
> an FFI layer that makes us loose additional type information / guarantees.
> 
> Anyways, I don't want to argue about this. I don't know why the whole thread
> took this direction.

Yea, my bad. I apologize. This shouldn't be about is Rust or C better.

> 
> This is not about C vs. Rust, and I see the Rust component to be added
> regardless of this effort.

+1.

> 
> The question for me is whether we want to have a second component besides the
> GPU scheduler on the C side or not.
> 
> If we can replace the existing scheduler entirely and rework all the drivers
> that'd be great and you absolutely have my blessing.
> 

I'll copy paste my reply from [2].

The only weird case I haven't wrapped my head around quite yet is the
ganged submissions that rely on the scheduled fence (PVR, AMDGPU do
this). Pretty much every other driver seems like it could be coverted
with what I have in place in this series + local work to provide a
hardware scheduler...

[2] https://patchwork.freedesktop.org/patch/711933/?series=163245&rev=1#comment_1311418

> But, I don't want to end up in a situation where this is landed, one or two
> drivers are converted, and everything else is left behind in terms of
> maintainance / maintainer commitment.
> 

I can get other drivers to compile but I can't do things like test these
changes for other vendors. Also if existing drivers break dma-fencing
rules, those would need to fixed when converting over too. 

> >> My point is, the justification for a new Jobqueue component in Rust I consider
> >> given by the fact that it allows us to avoid building another abstraction layer
> >> on top of DRM sched. Additionally, DRM moves to Rust and gathering experience
> >> with building native Rust components seems like a good synergy in this context.
> >>
> >
> > If I knew Rust off-hand, I would have written it in Rust :). Perhaps
> > this is an opportunity to learn. But I think the Rust vs. C holy war
> > isn’t in scope here. The real questions are what semantics we want, the
> > timeline, and maintainability. Certainly more people know C, and most
> > drivers are written in C, so having the common component in C makes more
> > sense at this point, in my opinion. If the objection is really about the
> > language, I’ll rewrite it in Rust.
> 
> Again, I'm not talking about Rust vs. C. I'm talking about why a new Rust
> component is much easier to justify maintainance wise than a new C component is.
> 
> That is, the existing infrastructure has problems we don't want to build on top
> of and the abstraction ends up being of a similar magnitude as a native
> implementation.
> 
> A new C implementation alongside the existing one is a whole different question.
> 
> >> Having that said, the obvious question for me for this series is how drm_dep
> >> fits into the bigger picture.
> >> 
> >> I.e. what is the maintainance strategy?
> >>
> >
> > I will commit to maintaining code I believe in, and immediately write
> > the bindings on top of this so they’re maintained from day one.
> 
> This I am sure about, but what about the existing scheduler infrastructure? Are

+1

> you going to keep this up as well?

The fact is we can't maintain DRM sched as it is now, techincal debt is
just too high.

> 
> Who keeps supporting it for all the drivers that can't switch (due to not having
> firmware queues) or simply did not switch yet?
> 

I'd say I agree with your no new features statement in DRM sched, if you
want new features in C, fix your driver to use what I have here.

Ofc if bugs pop up in drm sched, happy help fix those.

> >> Do we want to support three components allowing users to do the same thing? What
> >> happens to DRM sched for 1:1 entity / scheduler relationships?
> >> 
> >> Is it worth? Do we have enough C users to justify the maintainance of yet
> >> another component? (Again, DRM moves into the direction of Rust drivers, so I
> >> don't know how many new C drivers we will see.) I.e. having this component won't
> >> get us rid of the majority of DRM sched users.
> >> 
> >
> > Actually, with [1], I’m fairly certain that pretty much every driver
> > could convert to this new code. Part of the problem, though, is that
> > when looking at this, multiple drivers clearly break dma-fencing rules,
> > so an annotated component like DRM dep would explode their drivers. Not
> > to mention the many driver-side hacks that each individual driver would
> > need to drop (e.g., I would not be receptive to any driver directly
> > touching drm_dep object structs).
> 
> I thought the API can't be abused? :) How would you prevent drivers doing this
> in practice? They need to have the struct definition, and once they have it, you
> can't do anything about them peeking internals, if not caught through review.
> 

I agree with this statement. So it would require disciplined reviews...

e.g., You are touching this struct directly - why? Is is drm dep missing
something in the API your driver needs, if it is real issue we add to
drm dep.

> > Maintainable, as I understand every single LOC, with verbose documentation
> > (generated with Copilot, but I’ve reviewed it multiple times and it’s
> > correct), etc.
> 
> I'm not sure this is the core criteria for evaluating whether something is
> maintainable or not.
> 
> To be honest, this does not sound very community focused.
> 

That is not my intent. I did write this as common component for driver
and to used by the community.

> > Regardless, given all of the above, at a minimum my driver needs to move
> > on one way or another.
> 
> Your driver? What do you mean with "it has to move"?

DRM sched is not maintainable, thus it is not going to fit the needs of
Xe in the longterm. Other drivers like are facing similar issues.

Matt

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
  2026-03-17 14:55   ` Boris Brezillon
@ 2026-03-18 23:28     ` Matthew Brost
  2026-03-19  9:11       ` Boris Brezillon
  0 siblings, 1 reply; 50+ messages in thread
From: Matthew Brost @ 2026-03-18 23:28 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: intel-xe, dri-devel, Tvrtko Ursulin, Rodrigo Vivi,
	Thomas Hellström, Christian König, Danilo Krummrich,
	David Airlie, Maarten Lankhorst, Maxime Ripard, Philipp Stanner,
	Simona Vetter, Sumit Semwal, Thomas Zimmermann, linux-kernel

On Tue, Mar 17, 2026 at 03:55:12PM +0100, Boris Brezillon wrote:
> On Sun, 15 Mar 2026 21:32:45 -0700
> Matthew Brost <matthew.brost@intel.com> wrote:
> 
> > +/**
> > + * struct drm_dep_fence - fence tracking the completion of a dep job
> > + *
> > + * Contains a single dma_fence (@finished) that is signalled when the
> > + * hardware completes the job. The fence uses the kernel's inline_lock
> > + * (no external spinlock required).
> > + *
> > + * This struct is private to the drm_dep module; external code interacts
> > + * through the accessor functions declared in drm_dep_fence.h.
> > + */
> > +struct drm_dep_fence {
> > +	/**
> > +	 * @finished: signalled when the job completes on hardware.
> > +	 *
> > +	 * Drivers should use this fence as the out-fence for a job since it
> > +	 * is available immediately upon drm_dep_job_arm().
> > +	 */
> > +	struct dma_fence finished;
> > +
> > +	/**
> > +	 * @deadline: deadline set on @finished which potentially needs to be
> > +	 * propagated to @parent.
> > +	 */
> > +	ktime_t	deadline;
> > +
> > +	/**
> > +	 * @parent: The hardware fence returned by &drm_dep_queue_ops.run_job.
> > +	 *
> > +	 * @finished is signaled once @parent is signaled. The initial store is
> > +	 * performed via smp_store_release to synchronize with deadline handling.
> > +	 *
> > +	 * All readers must access this under the fence lock and take a reference to
> > +	 * it, as @parent is set to NULL under the fence lock when the drm_dep_fence
> > +	 * signals, and this drop also releases its internal reference.
> > +	 */
> > +	struct dma_fence *parent;
> > +
> > +	/**
> > +	 * @q: the queue this fence belongs to.
> > +	 */
> > +	struct drm_dep_queue *q;
> > +};
> 
> As Daniel pointed out already, with Christian's recent changes to
> dma_fence (the ones that reset dma_fence::ops after ::signal()), the
> fence proxy that existed in drm_sched_fence is no longer required:
> drivers and their implementations can safely vanish even if some fences
> they have emitted are still referenced by other subsystems. All we need
> is:
> 

I believe the late arming or dma fence array / chain would need to
address. I've replied in detail in another fork of this thread already
so will not cover here. 

> - fence must be signaled for dma_fence::ops to be set back to NULL
> - no .cleanup and no .wait implementation
> 
> There might be an interest in having HW submission fences reflecting
> when the job is passed to the FW/HW queue, but that can done as a
> separate fence implementation using a different fence timeline/context.
> 

Yes, I removed scheduled side of drm sched fence as I figured that could
be implemented driver side (or as an optional API in drm dep). Only
AMDGPU / PVR use these too for ganged submissions which I need to wrap
my head around. My initial thought is both of implementations likely
could be simplified.

> > diff --git a/drivers/gpu/drm/dep/drm_dep_job.c b/drivers/gpu/drm/dep/drm_dep_job.c
> > new file mode 100644
> > index 000000000000..2d012b29a5fc
> > --- /dev/null
> > +++ b/drivers/gpu/drm/dep/drm_dep_job.c
> > @@ -0,0 +1,675 @@
> > +// SPDX-License-Identifier: MIT
> > +/*
> > + * Copyright 2015 Advanced Micro Devices, Inc.
> > + *
> > + * Permission is hereby granted, free of charge, to any person obtaining a
> > + * copy of this software and associated documentation files (the "Software"),
> > + * to deal in the Software without restriction, including without limitation
> > + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> > + * and/or sell copies of the Software, and to permit persons to whom the
> > + * Software is furnished to do so, subject to the following conditions:
> > + *
> > + * The above copyright notice and this permission notice shall be included in
> > + * all copies or substantial portions of the Software.
> > + *
> > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> > + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
> > + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
> > + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
> > + * OTHER DEALINGS IN THE SOFTWARE.
> > + *
> > + * Copyright © 2026 Intel Corporation
> > + */
> > +
> > +/**
> > + * DOC: DRM dependency job
> > + *
> > + * A struct drm_dep_job represents a single unit of GPU work associated with
> > + * a struct drm_dep_queue. The lifecycle of a job is:
> > + *
> > + * 1. **Allocation**: the driver allocates memory for the job (typically by
> > + *    embedding struct drm_dep_job in a larger structure) and calls
> > + *    drm_dep_job_init() to initialise it. On success the job holds one
> > + *    kref reference and a reference to its queue.
> > + *
> > + * 2. **Dependency collection**: the driver calls drm_dep_job_add_dependency(),
> > + *    drm_dep_job_add_syncobj_dependency(), drm_dep_job_add_resv_dependencies(),
> > + *    or drm_dep_job_add_implicit_dependencies() to register dma_fence objects
> > + *    that must be signalled before the job can run. Duplicate fences from the
> > + *    same fence context are deduplicated automatically.
> > + *
> > + * 3. **Arming**: drm_dep_job_arm() initialises the job's finished fence,
> > + *    consuming a sequence number from the queue. After arming,
> > + *    drm_dep_job_finished_fence() returns a valid fence that may be passed to
> > + *    userspace or used as a dependency by other jobs.
> > + *
> > + * 4. **Submission**: drm_dep_job_push() submits the job to the queue. The
> > + *    queue takes a reference that it holds until the job's finished fence
> > + *    signals and the job is freed by the put_job worker.
> > + *
> > + * 5. **Completion**: when the job's hardware work finishes its finished fence
> > + *    is signalled and drm_dep_job_put() is called by the queue. The driver
> > + *    must release any driver-private resources in &drm_dep_job_ops.release.
> > + *
> > + * Reference counting uses drm_dep_job_get() / drm_dep_job_put(). The
> > + * internal drm_dep_job_fini() tears down the dependency xarray and fence
> > + * objects before the driver's release callback is invoked.
> > + */
> > +
> > +#include <linux/dma-resv.h>
> > +#include <linux/kref.h>
> > +#include <linux/slab.h>
> > +#include <drm/drm_dep.h>
> > +#include <drm/drm_file.h>
> > +#include <drm/drm_gem.h>
> > +#include <drm/drm_syncobj.h>
> > +#include "drm_dep_fence.h"
> > +#include "drm_dep_job.h"
> > +#include "drm_dep_queue.h"
> > +
> > +/**
> > + * drm_dep_job_init() - initialise a dep job
> > + * @job: dep job to initialise
> > + * @args: initialisation arguments
> > + *
> > + * Initialises @job with the queue, ops and credit count from @args.  Acquires
> > + * a reference to @args->q via drm_dep_queue_get(); this reference is held for
> > + * the lifetime of the job and released by drm_dep_job_release() when the last
> > + * job reference is dropped.
> > + *
> > + * Resources are released automatically when the last reference is dropped
> > + * via drm_dep_job_put(), which must be called to release the job; drivers
> > + * must not free the job directly.
> > + *
> > + * Context: Process context. Allocates memory with GFP_KERNEL.
> > + * Return: 0 on success, -%EINVAL if credits is 0,
> > + *   -%ENOMEM on fence allocation failure.
> > + */
> > +int drm_dep_job_init(struct drm_dep_job *job,
> > +		     const struct drm_dep_job_init_args *args)
> > +{
> > +	if (unlikely(!args->credits)) {
> > +		pr_err("drm_dep: %s: credits cannot be 0\n", __func__);
> > +		return -EINVAL;
> > +	}
> > +
> > +	memset(job, 0, sizeof(*job));
> > +
> > +	job->dfence = drm_dep_fence_alloc();
> > +	if (!job->dfence)
> > +		return -ENOMEM;
> > +
> > +	job->ops = args->ops;
> > +	job->q = drm_dep_queue_get(args->q);
> > +	job->credits = args->credits;
> > +
> > +	kref_init(&job->refcount);
> > +	xa_init_flags(&job->dependencies, XA_FLAGS_ALLOC);
> > +	INIT_LIST_HEAD(&job->pending_link);
> > +
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL(drm_dep_job_init);
> > +
> > +/**
> > + * drm_dep_job_drop_dependencies() - release all input dependency fences
> > + * @job: dep job whose dependency xarray to drain
> > + *
> > + * Walks @job->dependencies, puts each fence, and destroys the xarray.
> > + * Any slots still holding a %DRM_DEP_JOB_FENCE_PREALLOC sentinel —
> > + * i.e. slots that were pre-allocated but never replaced — are silently
> > + * skipped; the sentinel carries no reference.  Called from
> > + * drm_dep_queue_run_job() in process context immediately after
> > + * @ops->run_job() returns, before the final drm_dep_job_put().  Releasing
> > + * dependencies here — while still in process context — avoids calling
> > + * xa_destroy() from IRQ context if the job's last reference is later
> > + * dropped from a dma_fence callback.
> > + *
> > + * Context: Process context.
> > + */
> > +void drm_dep_job_drop_dependencies(struct drm_dep_job *job)
> > +{
> > +	struct dma_fence *fence;
> > +	unsigned long index;
> > +
> > +	xa_for_each(&job->dependencies, index, fence) {
> > +		if (unlikely(fence == DRM_DEP_JOB_FENCE_PREALLOC))
> > +			continue;
> > +		dma_fence_put(fence);
> > +	}
> > +	xa_destroy(&job->dependencies);
> > +}
> > +
> > +/**
> > + * drm_dep_job_fini() - clean up a dep job
> > + * @job: dep job to clean up
> > + *
> > + * Cleans up the dep fence and drops the queue reference held by @job.
> > + *
> > + * If the job was never armed (e.g. init failed before drm_dep_job_arm()),
> > + * the dependency xarray is also released here.  For armed jobs the xarray
> > + * has already been drained by drm_dep_job_drop_dependencies() in process
> > + * context immediately after run_job(), so it is left untouched to avoid
> > + * calling xa_destroy() from IRQ context.
> > + *
> > + * Warns if @job is still linked on the queue's pending list, which would
> > + * indicate a bug in the teardown ordering.
> > + *
> > + * Context: Any context.
> > + */
> > +static void drm_dep_job_fini(struct drm_dep_job *job)
> > +{
> > +	bool armed = drm_dep_fence_is_armed(job->dfence);
> > +
> > +	WARN_ON(!list_empty(&job->pending_link));
> > +
> > +	drm_dep_fence_cleanup(job->dfence);
> > +	job->dfence = NULL;
> > +
> > +	/*
> > +	 * Armed jobs have their dependencies drained by
> > +	 * drm_dep_job_drop_dependencies() in process context after run_job().
> 
> Just want to clear the confusion and make sure I get this right at the
> same time. To me, "process context" means a user thread entering some
> syscall(). What you call "process context" is more a "thread context" to
> me. I'm actually almost certain it's always a kernel thread (a workqueue
> worker thread to be accurate) that executes the drop_deps() after a
> run_job().

Some of context comments likely could be cleaned up. 'process context'
here either in user context (bypass path) or run job work item.

> 
> > +	 * Skip here to avoid calling xa_destroy() from IRQ context.
> > +	 */
> > +	if (!armed)
> > +		drm_dep_job_drop_dependencies(job);
> 
> Why do we need to make a difference here. Can't we just assume that the
> hole drm_dep_job_fini() call is unsafe in atomic context, and have a
> work item embedded in the job to defer its destruction when _put() is
> called in a context where the destruction is not allowed?
> 

We already touched on this, but the design currently allows the last job
put from dma-fence signaling path (IRQ). If we droppped that, then yes
this could change. The reason the if statement currently is user is
building a job and need to abort prior to calling arm() (e.g., a memory
allocation fails) via a drm_dep_job_put.

Once arm() is called there is a guarnette the run_job path is called
either via bypass or run job work item.

> > +}
> > +
> > +/**
> > + * drm_dep_job_get() - acquire a reference to a dep job
> > + * @job: dep job to acquire a reference on, or NULL
> > + *
> > + * Context: Any context.
> > + * Return: @job with an additional reference held, or NULL if @job is NULL.
> > + */
> > +struct drm_dep_job *drm_dep_job_get(struct drm_dep_job *job)
> > +{
> > +	if (job)
> > +		kref_get(&job->refcount);
> > +	return job;
> > +}
> > +EXPORT_SYMBOL(drm_dep_job_get);
> > +
> > +/**
> > + * drm_dep_job_release() - kref release callback for a dep job
> > + * @kref: kref embedded in the dep job
> > + *
> > + * Calls drm_dep_job_fini(), then invokes &drm_dep_job_ops.release if set,
> > + * otherwise frees @job with kfree().  Finally, releases the queue reference
> > + * that was acquired by drm_dep_job_init() via drm_dep_queue_put().  The
> > + * queue put is performed last to ensure no queue state is accessed after
> > + * the job memory is freed.
> > + *
> > + * Context: Any context if %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE is set on the
> > + *   job's queue; otherwise process context only, as the release callback may
> > + *   sleep.
> > + */
> > +static void drm_dep_job_release(struct kref *kref)
> > +{
> > +	struct drm_dep_job *job =
> > +		container_of(kref, struct drm_dep_job, refcount);
> > +	struct drm_dep_queue *q = job->q;
> > +
> > +	drm_dep_job_fini(job);
> > +
> > +	if (job->ops && job->ops->release)
> > +		job->ops->release(job);
> > +	else
> > +		kfree(job);
> > +
> > +	drm_dep_queue_put(q);
> > +}
> > +
> > +/**
> > + * drm_dep_job_put() - release a reference to a dep job
> > + * @job: dep job to release a reference on, or NULL
> > + *
> > + * When the last reference is dropped, calls &drm_dep_job_ops.release if set,
> > + * otherwise frees @job with kfree(). Does nothing if @job is NULL.
> > + *
> > + * Context: Any context if %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE is set on the
> > + *   job's queue; otherwise process context only, as the release callback may
> > + *   sleep.
> > + */
> > +void drm_dep_job_put(struct drm_dep_job *job)
> > +{
> > +	if (job)
> > +		kref_put(&job->refcount, drm_dep_job_release);
> > +}
> > +EXPORT_SYMBOL(drm_dep_job_put);
> > +
> > +/**
> > + * drm_dep_job_arm() - arm a dep job for submission
> > + * @job: dep job to arm
> > + *
> > + * Initialises the finished fence on @job->dfence, assigning
> > + * it a sequence number from the job's queue. Must be called after
> > + * drm_dep_job_init() and before drm_dep_job_push(). Once armed,
> > + * drm_dep_job_finished_fence() returns a valid fence that may be passed to
> > + * userspace or used as a dependency by other jobs.
> > + *
> > + * Begins the DMA fence signalling path via dma_fence_begin_signalling().
> > + * After this point, memory allocations that could trigger reclaim are
> > + * forbidden; lockdep enforces this. arm() must always be paired with
> > + * drm_dep_job_push(); lockdep also enforces this pairing.
> > + *
> > + * Warns if the job has already been armed.
> > + *
> > + * Context: Process context if %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set
> > + *   (takes @q->sched.lock, a mutex); any context otherwise. DMA fence signaling
> > + *   path.
> > + */
> > +void drm_dep_job_arm(struct drm_dep_job *job)
> > +{
> > +	drm_dep_queue_push_job_begin(job->q);
> > +	WARN_ON(drm_dep_fence_is_armed(job->dfence));
> > +	drm_dep_fence_init(job->dfence, job->q);
> > +	job->signalling_cookie = dma_fence_begin_signalling();
> 
> I'd really like DMA-signalling-path annotation to be something that
> doesn't leak to the job object. The way I see it, in the submit path,
> it should be some sort of block initializing an opaque token, and
> drm_dep_job_arm() should expect a valid token to be passed, thus
> guaranteeing that anything between arm and push, and more generally
> anything in that section is safe.
> 

Yes. drm_dep_queue_push_job_begin internally creates a token (current)
that is paired drm_dep_queue_push_job_end. If you ever have imbalance
between arm() and push() you will get complaints.

> 	struct drm_job_submit_context submit_ctx;
> 
> 	// Do all the prep stuff, pre-alloc, resv setup, ...
> 
> 	// Non-faillible section of the submit starts here.
> 	// This is properly annotated with
> 	// dma_fence_{begin,end}_signalling() to ensure we're
> 	// not taking locks or doing allocations forbidden in
> 	// the signalling path
> 	drm_job_submit_non_faillible_section(&submit_ctx) {
> 		for_each_job() {
> 			drm_dep_job_arm(&submit_ctx, &job);
> 
> 			// pass the armed fence around, if needed
> 
> 			drm_dep_job_push(&submit_ctx, &job);
> 		}
> 	}
> 
> With the current solution, there's no control that
> drm_dep_job_{arm,push}() calls are balanced, with the risk of leaving a
> DMA-signalling annotation behind.

See above, that is what drm_dep_queue_push_job_begin/end do.

> 
> > +}
> > +EXPORT_SYMBOL(drm_dep_job_arm);
> > +
> > +/**
> > + * drm_dep_job_push() - submit a job to its queue for execution
> > + * @job: dep job to push
> > + *
> > + * Submits @job to the queue it was initialised with. Must be called after
> > + * drm_dep_job_arm(). Acquires a reference on @job on behalf of the queue,
> > + * held until the queue is fully done with it. The reference is released
> > + * directly in the finished-fence dma_fence callback for queues with
> > + * %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE (where drm_dep_job_done() may run
> > + * from hardirq context), or via the put_job work item on the submit
> > + * workqueue otherwise.
> > + *
> > + * Ends the DMA fence signalling path begun by drm_dep_job_arm() via
> > + * dma_fence_end_signalling(). This must be paired with arm(); lockdep
> > + * enforces the pairing.
> > + *
> > + * Once pushed, &drm_dep_queue_ops.run_job is guaranteed to be called for
> > + * @job exactly once, even if the queue is killed or torn down before the
> > + * job reaches the head of the queue. Drivers can use this guarantee to
> > + * perform bookkeeping cleanup; the actual backend operation should be
> > + * skipped when drm_dep_queue_is_killed() returns true.
> > + *
> > + * If the queue does not support the bypass path, the job is pushed directly
> > + * onto the SPSC submission queue via drm_dep_queue_push_job() without holding
> > + * @q->sched.lock. Otherwise, @q->sched.lock is taken and the job is either
> > + * run immediately via drm_dep_queue_run_job() if it qualifies for bypass, or
> > + * enqueued via drm_dep_queue_push_job() for dispatch by the run_job work item.
> > + *
> > + * Warns if the job has not been armed.
> > + *
> > + * Context: Process context if %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set
> > + *   (takes @q->sched.lock, a mutex); any context otherwise. DMA fence signaling
> > + *   path.
> > + */
> > +void drm_dep_job_push(struct drm_dep_job *job)
> > +{
> > +	struct drm_dep_queue *q = job->q;
> > +
> > +	WARN_ON(!drm_dep_fence_is_armed(job->dfence));
> > +
> > +	drm_dep_job_get(job);
> > +
> > +	if (!(q->sched.flags & DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED)) {
> > +		drm_dep_queue_push_job(q, job);
> > +		dma_fence_end_signalling(job->signalling_cookie);
> > +		drm_dep_queue_push_job_end(job->q);
> > +		return;
> > +	}
> > +
> > +	scoped_guard(mutex, &q->sched.lock) {
> > +		if (drm_dep_queue_can_job_bypass(q, job))
> > +			drm_dep_queue_run_job(q, job);
> > +		else
> > +			drm_dep_queue_push_job(q, job);
> > +	}
> > +
> > +	dma_fence_end_signalling(job->signalling_cookie);
> > +	drm_dep_queue_push_job_end(job->q);
> > +}
> > +EXPORT_SYMBOL(drm_dep_job_push);
> > +
> > +/**
> > + * drm_dep_job_add_dependency() - adds the fence as a job dependency
> > + * @job: dep job to add the dependencies to
> > + * @fence: the dma_fence to add to the list of dependencies, or
> > + *         %DRM_DEP_JOB_FENCE_PREALLOC to reserve a slot for later.
> > + *
> > + * Note that @fence is consumed in both the success and error cases (except
> > + * when @fence is %DRM_DEP_JOB_FENCE_PREALLOC, which carries no reference).
> > + *
> > + * Signalled fences and fences belonging to the same queue as @job (i.e. where
> > + * fence->context matches the queue's finished fence context) are silently
> > + * dropped; the job need not wait on its own queue's output.
> > + *
> > + * Warns if the job has already been armed (dependencies must be added before
> > + * drm_dep_job_arm()).
> > + *
> > + * **Pre-allocation pattern**
> > + *
> > + * When multiple jobs across different queues must be prepared and submitted
> > + * together in a single atomic commit — for example, where job A's finished
> > + * fence is an input dependency of job B — all jobs must be armed and pushed
> > + * within a single dma_fence_begin_signalling() / dma_fence_end_signalling()
> > + * region.  Once that region has started no memory allocation is permitted.
> > + *
> > + * To handle this, pass %DRM_DEP_JOB_FENCE_PREALLOC during the preparation
> > + * phase (before arming any job, while GFP_KERNEL allocation is still allowed)
> > + * to pre-allocate a slot in @job->dependencies.  The slot index assigned by
> > + * the underlying xarray must be tracked by the caller separately (e.g. it is
> > + * always index 0 when the dependency array is empty, as Xe relies on).
> > + * After all jobs have been armed and the finished fences are available, call
> > + * drm_dep_job_replace_dependency() with that index and the real fence.
> > + * drm_dep_job_replace_dependency() uses GFP_NOWAIT internally and may be
> > + * called from atomic or signalling context.
> > + *
> > + * The sentinel slot is never skipped by the signalled-fence fast-path,
> > + * ensuring a slot is always allocated even when the real fence is not yet
> > + * known.
> > + *
> > + * **Example: bind job feeding TLB invalidation jobs**
> > + *
> > + * Consider a GPU with separate queues for page-table bind operations and for
> > + * TLB invalidation.  A single atomic commit must:
> > + *
> > + *  1. Run a bind job that modifies page tables.
> > + *  2. Run one TLB-invalidation job per MMU that depends on the bind
> > + *     completing, so stale translations are flushed before the engines
> > + *     continue.
> > + *
> > + * Because all jobs must be armed and pushed inside a signalling region (where
> > + * GFP_KERNEL is forbidden), pre-allocate slots before entering the region::
> > + *
> > + *   // Phase 1 — process context, GFP_KERNEL allowed
> > + *   drm_dep_job_init(bind_job, bind_queue, ops);
> > + *   for_each_mmu(mmu) {
> > + *       drm_dep_job_init(tlb_job[mmu], tlb_queue[mmu], ops);
> > + *       // Pre-allocate slot at index 0; real fence not available yet
> > + *       drm_dep_job_add_dependency(tlb_job[mmu], DRM_DEP_JOB_FENCE_PREALLOC);
> > + *   }
> > + *
> > + *   // Phase 2 — inside signalling region, no GFP_KERNEL
> > + *   dma_fence_begin_signalling();
> > + *   drm_dep_job_arm(bind_job);
> > + *   for_each_mmu(mmu) {
> > + *       // Swap sentinel for bind job's finished fence
> > + *       drm_dep_job_replace_dependency(tlb_job[mmu], 0,
> > + *                                      dma_fence_get(bind_job->finished));
> 
> Just FYI, Panthor doesn't have this {begin,end}_signalling() in the
> submit path. If we were to add it, it would be around the
> panthor_submit_ctx_push_jobs() call, which might seem broken. In

Yes, I noticed that. I put XXX comment in my port [1] around this.

[1] https://patchwork.freedesktop.org/patch/711952/?series=163245&rev=1

> practice I don't think it is because we don't expose fences to the
> outside world until all jobs have been pushed. So what happens is that
> a job depending on a previous job in the same batch-submit has the
> armed-but-not-yet-pushed fence in its deps, and that's the only place
> where this fence is present. If something fails on a subsequent job
> preparation in the next batch submit, the rollback logic will just drop
> the jobs on the floor, and release the armed-but-not-pushed-fence,
> meaning we're not leaking a fence that will never be signalled. I'm in
> no way saying this design is sane, just trying to explain why it's
> currently safe and works fine.

Yep, I think would be better have no failure points between arm and
push which again I do my best to enforce via lockdep/warnings.

> 
> In general, I wonder if we should distinguish between "armed" and
> "publicly exposed" to help deal with this intra-batch dep thing without
> resorting to reservation and other tricks like that.
> 

I'm not exactly sure what you suggesting but always open to ideas.

> > + *       drm_dep_job_arm(tlb_job[mmu]);
> > + *   }
> > + *   drm_dep_job_push(bind_job);
> > + *   for_each_mmu(mmu)
> > + *       drm_dep_job_push(tlb_job[mmu]);
> > + *   dma_fence_end_signalling();
> > + *
> > + * Context: Process context. May allocate memory with GFP_KERNEL.
> > + * Return: If fence == DRM_DEP_JOB_FENCE_PREALLOC index of allocation on
> > + * success, else 0 on success, or a negative error code.
> > + */
> > +int drm_dep_job_add_dependency(struct drm_dep_job *job, struct dma_fence *fence)
> > +{
> > +	struct drm_dep_queue *q = job->q;
> > +	struct dma_fence *entry;
> > +	unsigned long index;
> > +	u32 id = 0;
> > +	int ret;
> > +
> > +	WARN_ON(drm_dep_fence_is_armed(job->dfence));
> > +	might_alloc(GFP_KERNEL);
> > +
> > +	if (!fence)
> > +		return 0;
> > +
> > +	if (fence == DRM_DEP_JOB_FENCE_PREALLOC)
> > +		goto add_fence;
> > +
> > +	/*
> > +	 * Ignore signalled fences or fences from our own queue — finished
> > +	 * fences use q->fence.context.
> > +	 */
> > +	if (dma_fence_test_signaled_flag(fence) ||
> > +	    fence->context == q->fence.context) {
> > +		dma_fence_put(fence);
> > +		return 0;
> > +	}
> > +
> > +	/* Deduplicate if we already depend on a fence from the same context.
> > +	 * This lets the size of the array of deps scale with the number of
> > +	 * engines involved, rather than the number of BOs.
> > +	 */
> > +	xa_for_each(&job->dependencies, index, entry) {
> > +		if (entry == DRM_DEP_JOB_FENCE_PREALLOC ||
> > +		    entry->context != fence->context)
> > +			continue;
> > +
> > +		if (dma_fence_is_later(fence, entry)) {
> > +			dma_fence_put(entry);
> > +			xa_store(&job->dependencies, index, fence, GFP_KERNEL);
> > +		} else {
> > +			dma_fence_put(fence);
> > +		}
> > +		return 0;
> > +	}
> > +
> > +add_fence:
> > +	ret = xa_alloc(&job->dependencies, &id, fence, xa_limit_32b,
> > +		       GFP_KERNEL);
> > +	if (ret != 0) {
> > +		if (fence != DRM_DEP_JOB_FENCE_PREALLOC)
> > +			dma_fence_put(fence);
> > +		return ret;
> > +	}
> > +
> > +	return (fence == DRM_DEP_JOB_FENCE_PREALLOC) ? id : 0;
> > +}
> > +EXPORT_SYMBOL(drm_dep_job_add_dependency);
> > +
> > +/**
> > + * drm_dep_job_replace_dependency() - replace a pre-allocated dependency slot
> > + * @job: dep job to update
> > + * @index: xarray index of the slot to replace, as returned when the sentinel
> > + *         was originally inserted via drm_dep_job_add_dependency()
> > + * @fence: the real dma_fence to store; its reference is always consumed
> > + *
> > + * Replaces the %DRM_DEP_JOB_FENCE_PREALLOC sentinel at @index in
> > + * @job->dependencies with @fence.  The slot must have been pre-allocated by
> > + * passing %DRM_DEP_JOB_FENCE_PREALLOC to drm_dep_job_add_dependency(); the
> > + * existing entry is asserted to be the sentinel.
> > + *
> > + * This is the second half of the pre-allocation pattern described in
> > + * drm_dep_job_add_dependency().  It is intended to be called inside a
> > + * dma_fence_begin_signalling() / dma_fence_end_signalling() region where
> > + * memory allocation with GFP_KERNEL is forbidden.  It uses GFP_NOWAIT
> > + * internally so it is safe to call from atomic or signalling context, but
> > + * since the slot has been pre-allocated no actual memory allocation occurs.
> > + *
> > + * If @fence is already signalled the slot is erased rather than storing a
> > + * redundant dependency.  The successful store is asserted — if the store
> > + * fails it indicates a programming error (slot index out of range or
> > + * concurrent modification).
> > + *
> > + * Must be called before drm_dep_job_arm(). @fence is consumed in all cases.
> > + *
> > + * Context: Any context. DMA fence signaling path.
> > + */
> > +void drm_dep_job_replace_dependency(struct drm_dep_job *job, u32 index,
> > +				    struct dma_fence *fence)
> > +{
> > +	WARN_ON(xa_load(&job->dependencies, index) !=
> > +		DRM_DEP_JOB_FENCE_PREALLOC);
> > +
> > +	if (dma_fence_test_signaled_flag(fence)) {
> > +		xa_erase(&job->dependencies, index);
> > +		dma_fence_put(fence);
> > +		return;
> > +	}
> > +
> > +	if (WARN_ON(xa_is_err(xa_store(&job->dependencies, index, fence,
> > +				       GFP_NOWAIT)))) {
> > +		dma_fence_put(fence);
> > +		return;
> > +	}
> 
> You don't seem to go for the
> replace-if-earlier-fence-on-same-context-exists optimization that we
> have in drm_dep_job_add_dependency(). Any reason not to?
> 

No, that could be added in. My reasoning for ommiting was if you are
pre-alloc a slot you likely know that the same timeline hasn't already
been added in but maybe that is bad assumption.

Matt

> > +}
> > +EXPORT_SYMBOL(drm_dep_job_replace_dependency);
> > +
> 
> I'm going to stop here for today.
> 
> Regards,
> 
> Boris

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
  2026-03-18 23:28     ` Matthew Brost
@ 2026-03-19  9:11       ` Boris Brezillon
  2026-03-23  4:50         ` Matthew Brost
  0 siblings, 1 reply; 50+ messages in thread
From: Boris Brezillon @ 2026-03-19  9:11 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, Tvrtko Ursulin, Rodrigo Vivi,
	Thomas Hellström, Christian König, Danilo Krummrich,
	David Airlie, Maarten Lankhorst, Maxime Ripard, Philipp Stanner,
	Simona Vetter, Sumit Semwal, Thomas Zimmermann, linux-kernel

Hi Matthew,

On Wed, 18 Mar 2026 16:28:13 -0700
Matthew Brost <matthew.brost@intel.com> wrote:

> > - fence must be signaled for dma_fence::ops to be set back to NULL
> > - no .cleanup and no .wait implementation
> > 
> > There might be an interest in having HW submission fences reflecting
> > when the job is passed to the FW/HW queue, but that can done as a
> > separate fence implementation using a different fence timeline/context.
> >   
> 
> Yes, I removed scheduled side of drm sched fence as I figured that could
> be implemented driver side (or as an optional API in drm dep). Only
> AMDGPU / PVR use these too for ganged submissions which I need to wrap
> my head around. My initial thought is both of implementations likely
> could be simplified.

IIRC, PVR was also relying on it to allow native FW waits: when we have
a job that has deps that are backed by fences emitted by the same
driver, they are detected and lowered to waits on the "scheduled"
fence, the wait on the finished fence is done FW side.

> 
> > > diff --git a/drivers/gpu/drm/dep/drm_dep_job.c b/drivers/gpu/drm/dep/drm_dep_job.c
> > > new file mode 100644
> > > index 000000000000..2d012b29a5fc
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/dep/drm_dep_job.c
> > > @@ -0,0 +1,675 @@
> > > +// SPDX-License-Identifier: MIT
> > > +/*
> > > + * Copyright 2015 Advanced Micro Devices, Inc.
> > > + *
> > > + * Permission is hereby granted, free of charge, to any person obtaining a
> > > + * copy of this software and associated documentation files (the "Software"),
> > > + * to deal in the Software without restriction, including without limitation
> > > + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> > > + * and/or sell copies of the Software, and to permit persons to whom the
> > > + * Software is furnished to do so, subject to the following conditions:
> > > + *
> > > + * The above copyright notice and this permission notice shall be included in
> > > + * all copies or substantial portions of the Software.
> > > + *
> > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> > > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> > > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> > > + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
> > > + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
> > > + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
> > > + * OTHER DEALINGS IN THE SOFTWARE.
> > > + *
> > > + * Copyright © 2026 Intel Corporation
> > > + */
> > > +
> > > +/**
> > > + * DOC: DRM dependency job
> > > + *
> > > + * A struct drm_dep_job represents a single unit of GPU work associated with
> > > + * a struct drm_dep_queue. The lifecycle of a job is:
> > > + *
> > > + * 1. **Allocation**: the driver allocates memory for the job (typically by
> > > + *    embedding struct drm_dep_job in a larger structure) and calls
> > > + *    drm_dep_job_init() to initialise it. On success the job holds one
> > > + *    kref reference and a reference to its queue.
> > > + *
> > > + * 2. **Dependency collection**: the driver calls drm_dep_job_add_dependency(),
> > > + *    drm_dep_job_add_syncobj_dependency(), drm_dep_job_add_resv_dependencies(),
> > > + *    or drm_dep_job_add_implicit_dependencies() to register dma_fence objects
> > > + *    that must be signalled before the job can run. Duplicate fences from the
> > > + *    same fence context are deduplicated automatically.
> > > + *
> > > + * 3. **Arming**: drm_dep_job_arm() initialises the job's finished fence,
> > > + *    consuming a sequence number from the queue. After arming,
> > > + *    drm_dep_job_finished_fence() returns a valid fence that may be passed to
> > > + *    userspace or used as a dependency by other jobs.
> > > + *
> > > + * 4. **Submission**: drm_dep_job_push() submits the job to the queue. The
> > > + *    queue takes a reference that it holds until the job's finished fence
> > > + *    signals and the job is freed by the put_job worker.
> > > + *
> > > + * 5. **Completion**: when the job's hardware work finishes its finished fence
> > > + *    is signalled and drm_dep_job_put() is called by the queue. The driver
> > > + *    must release any driver-private resources in &drm_dep_job_ops.release.
> > > + *
> > > + * Reference counting uses drm_dep_job_get() / drm_dep_job_put(). The
> > > + * internal drm_dep_job_fini() tears down the dependency xarray and fence
> > > + * objects before the driver's release callback is invoked.
> > > + */
> > > +
> > > +#include <linux/dma-resv.h>
> > > +#include <linux/kref.h>
> > > +#include <linux/slab.h>
> > > +#include <drm/drm_dep.h>
> > > +#include <drm/drm_file.h>
> > > +#include <drm/drm_gem.h>
> > > +#include <drm/drm_syncobj.h>
> > > +#include "drm_dep_fence.h"
> > > +#include "drm_dep_job.h"
> > > +#include "drm_dep_queue.h"
> > > +
> > > +/**
> > > + * drm_dep_job_init() - initialise a dep job
> > > + * @job: dep job to initialise
> > > + * @args: initialisation arguments
> > > + *
> > > + * Initialises @job with the queue, ops and credit count from @args.  Acquires
> > > + * a reference to @args->q via drm_dep_queue_get(); this reference is held for
> > > + * the lifetime of the job and released by drm_dep_job_release() when the last
> > > + * job reference is dropped.
> > > + *
> > > + * Resources are released automatically when the last reference is dropped
> > > + * via drm_dep_job_put(), which must be called to release the job; drivers
> > > + * must not free the job directly.
> > > + *
> > > + * Context: Process context. Allocates memory with GFP_KERNEL.
> > > + * Return: 0 on success, -%EINVAL if credits is 0,
> > > + *   -%ENOMEM on fence allocation failure.
> > > + */
> > > +int drm_dep_job_init(struct drm_dep_job *job,
> > > +		     const struct drm_dep_job_init_args *args)
> > > +{
> > > +	if (unlikely(!args->credits)) {
> > > +		pr_err("drm_dep: %s: credits cannot be 0\n", __func__);
> > > +		return -EINVAL;
> > > +	}
> > > +
> > > +	memset(job, 0, sizeof(*job));
> > > +
> > > +	job->dfence = drm_dep_fence_alloc();
> > > +	if (!job->dfence)
> > > +		return -ENOMEM;
> > > +
> > > +	job->ops = args->ops;
> > > +	job->q = drm_dep_queue_get(args->q);
> > > +	job->credits = args->credits;
> > > +
> > > +	kref_init(&job->refcount);
> > > +	xa_init_flags(&job->dependencies, XA_FLAGS_ALLOC);
> > > +	INIT_LIST_HEAD(&job->pending_link);
> > > +
> > > +	return 0;
> > > +}
> > > +EXPORT_SYMBOL(drm_dep_job_init);
> > > +
> > > +/**
> > > + * drm_dep_job_drop_dependencies() - release all input dependency fences
> > > + * @job: dep job whose dependency xarray to drain
> > > + *
> > > + * Walks @job->dependencies, puts each fence, and destroys the xarray.
> > > + * Any slots still holding a %DRM_DEP_JOB_FENCE_PREALLOC sentinel —
> > > + * i.e. slots that were pre-allocated but never replaced — are silently
> > > + * skipped; the sentinel carries no reference.  Called from
> > > + * drm_dep_queue_run_job() in process context immediately after
> > > + * @ops->run_job() returns, before the final drm_dep_job_put().  Releasing
> > > + * dependencies here — while still in process context — avoids calling
> > > + * xa_destroy() from IRQ context if the job's last reference is later
> > > + * dropped from a dma_fence callback.
> > > + *
> > > + * Context: Process context.
> > > + */
> > > +void drm_dep_job_drop_dependencies(struct drm_dep_job *job)
> > > +{
> > > +	struct dma_fence *fence;
> > > +	unsigned long index;
> > > +
> > > +	xa_for_each(&job->dependencies, index, fence) {
> > > +		if (unlikely(fence == DRM_DEP_JOB_FENCE_PREALLOC))
> > > +			continue;
> > > +		dma_fence_put(fence);
> > > +	}
> > > +	xa_destroy(&job->dependencies);
> > > +}
> > > +
> > > +/**
> > > + * drm_dep_job_fini() - clean up a dep job
> > > + * @job: dep job to clean up
> > > + *
> > > + * Cleans up the dep fence and drops the queue reference held by @job.
> > > + *
> > > + * If the job was never armed (e.g. init failed before drm_dep_job_arm()),
> > > + * the dependency xarray is also released here.  For armed jobs the xarray
> > > + * has already been drained by drm_dep_job_drop_dependencies() in process
> > > + * context immediately after run_job(), so it is left untouched to avoid
> > > + * calling xa_destroy() from IRQ context.
> > > + *
> > > + * Warns if @job is still linked on the queue's pending list, which would
> > > + * indicate a bug in the teardown ordering.
> > > + *
> > > + * Context: Any context.
> > > + */
> > > +static void drm_dep_job_fini(struct drm_dep_job *job)
> > > +{
> > > +	bool armed = drm_dep_fence_is_armed(job->dfence);
> > > +
> > > +	WARN_ON(!list_empty(&job->pending_link));
> > > +
> > > +	drm_dep_fence_cleanup(job->dfence);
> > > +	job->dfence = NULL;
> > > +
> > > +	/*
> > > +	 * Armed jobs have their dependencies drained by
> > > +	 * drm_dep_job_drop_dependencies() in process context after run_job().  
> > 
> > Just want to clear the confusion and make sure I get this right at the
> > same time. To me, "process context" means a user thread entering some
> > syscall(). What you call "process context" is more a "thread context" to
> > me. I'm actually almost certain it's always a kernel thread (a workqueue
> > worker thread to be accurate) that executes the drop_deps() after a
> > run_job().  
> 
> Some of context comments likely could be cleaned up. 'process context'
> here either in user context (bypass path) or run job work item.
> 
> >   
> > > +	 * Skip here to avoid calling xa_destroy() from IRQ context.
> > > +	 */
> > > +	if (!armed)
> > > +		drm_dep_job_drop_dependencies(job);  
> > 
> > Why do we need to make a difference here. Can't we just assume that the
> > hole drm_dep_job_fini() call is unsafe in atomic context, and have a
> > work item embedded in the job to defer its destruction when _put() is
> > called in a context where the destruction is not allowed?
> >   
> 
> We already touched on this, but the design currently allows the last job
> put from dma-fence signaling path (IRQ).

It's not much about the last _put and more about what happens in the
_release() you pass to kref_put(). My point being, if you assume
something in _release() is not safe to be done in an atomic context,
and _put() is assumed to be called from any context, you might as well
just defer the cleanup (AKA the stuff you currently have in _release())
so everything is always cleaned up in a thread context. Yes, there's
scheduling overhead and extra latency, but it's also simpler, because
there's just one path. So, if the latency and the overhead is not
proven to be a problem (and it rarely is for cleanup operations), I'm
still convinced this makes for an easier design to just defer the
cleanup all the time.

>  If we droppped that, then yes
> this could change. The reason the if statement currently is user is
> building a job and need to abort prior to calling arm() (e.g., a memory
> allocation fails) via a drm_dep_job_put.

But even in that context, it could still be deferred and work just
fine, no?

> 
> Once arm() is called there is a guarnette the run_job path is called
> either via bypass or run job work item.

Sure.

> 
> > > +}
> > > +
> > > +/**
> > > + * drm_dep_job_get() - acquire a reference to a dep job
> > > + * @job: dep job to acquire a reference on, or NULL
> > > + *
> > > + * Context: Any context.
> > > + * Return: @job with an additional reference held, or NULL if @job is NULL.
> > > + */
> > > +struct drm_dep_job *drm_dep_job_get(struct drm_dep_job *job)
> > > +{
> > > +	if (job)
> > > +		kref_get(&job->refcount);
> > > +	return job;
> > > +}
> > > +EXPORT_SYMBOL(drm_dep_job_get);
> > > +
> > > +/**
> > > + * drm_dep_job_release() - kref release callback for a dep job
> > > + * @kref: kref embedded in the dep job
> > > + *
> > > + * Calls drm_dep_job_fini(), then invokes &drm_dep_job_ops.release if set,
> > > + * otherwise frees @job with kfree().  Finally, releases the queue reference
> > > + * that was acquired by drm_dep_job_init() via drm_dep_queue_put().  The
> > > + * queue put is performed last to ensure no queue state is accessed after
> > > + * the job memory is freed.
> > > + *
> > > + * Context: Any context if %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE is set on the
> > > + *   job's queue; otherwise process context only, as the release callback may
> > > + *   sleep.
> > > + */
> > > +static void drm_dep_job_release(struct kref *kref)
> > > +{
> > > +	struct drm_dep_job *job =
> > > +		container_of(kref, struct drm_dep_job, refcount);
> > > +	struct drm_dep_queue *q = job->q;
> > > +
> > > +	drm_dep_job_fini(job);
> > > +
> > > +	if (job->ops && job->ops->release)
> > > +		job->ops->release(job);
> > > +	else
> > > +		kfree(job);
> > > +
> > > +	drm_dep_queue_put(q);
> > > +}
> > > +
> > > +/**
> > > + * drm_dep_job_put() - release a reference to a dep job
> > > + * @job: dep job to release a reference on, or NULL
> > > + *
> > > + * When the last reference is dropped, calls &drm_dep_job_ops.release if set,
> > > + * otherwise frees @job with kfree(). Does nothing if @job is NULL.
> > > + *
> > > + * Context: Any context if %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE is set on the
> > > + *   job's queue; otherwise process context only, as the release callback may
> > > + *   sleep.
> > > + */
> > > +void drm_dep_job_put(struct drm_dep_job *job)
> > > +{
> > > +	if (job)
> > > +		kref_put(&job->refcount, drm_dep_job_release);
> > > +}
> > > +EXPORT_SYMBOL(drm_dep_job_put);
> > > +
> > > +/**
> > > + * drm_dep_job_arm() - arm a dep job for submission
> > > + * @job: dep job to arm
> > > + *
> > > + * Initialises the finished fence on @job->dfence, assigning
> > > + * it a sequence number from the job's queue. Must be called after
> > > + * drm_dep_job_init() and before drm_dep_job_push(). Once armed,
> > > + * drm_dep_job_finished_fence() returns a valid fence that may be passed to
> > > + * userspace or used as a dependency by other jobs.
> > > + *
> > > + * Begins the DMA fence signalling path via dma_fence_begin_signalling().
> > > + * After this point, memory allocations that could trigger reclaim are
> > > + * forbidden; lockdep enforces this. arm() must always be paired with
> > > + * drm_dep_job_push(); lockdep also enforces this pairing.
> > > + *
> > > + * Warns if the job has already been armed.
> > > + *
> > > + * Context: Process context if %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set
> > > + *   (takes @q->sched.lock, a mutex); any context otherwise. DMA fence signaling
> > > + *   path.
> > > + */
> > > +void drm_dep_job_arm(struct drm_dep_job *job)
> > > +{
> > > +	drm_dep_queue_push_job_begin(job->q);
> > > +	WARN_ON(drm_dep_fence_is_armed(job->dfence));
> > > +	drm_dep_fence_init(job->dfence, job->q);
> > > +	job->signalling_cookie = dma_fence_begin_signalling();  
> > 
> > I'd really like DMA-signalling-path annotation to be something that
> > doesn't leak to the job object. The way I see it, in the submit path,
> > it should be some sort of block initializing an opaque token, and
> > drm_dep_job_arm() should expect a valid token to be passed, thus
> > guaranteeing that anything between arm and push, and more generally
> > anything in that section is safe.
> >   
> 
> Yes. drm_dep_queue_push_job_begin internally creates a token (current)
> that is paired drm_dep_queue_push_job_end. If you ever have imbalance
> between arm() and push() you will get complaints.
> 
> > 	struct drm_job_submit_context submit_ctx;
> > 
> > 	// Do all the prep stuff, pre-alloc, resv setup, ...
> > 
> > 	// Non-faillible section of the submit starts here.
> > 	// This is properly annotated with
> > 	// dma_fence_{begin,end}_signalling() to ensure we're
> > 	// not taking locks or doing allocations forbidden in
> > 	// the signalling path
> > 	drm_job_submit_non_faillible_section(&submit_ctx) {
> > 		for_each_job() {
> > 			drm_dep_job_arm(&submit_ctx, &job);
> > 
> > 			// pass the armed fence around, if needed
> > 
> > 			drm_dep_job_push(&submit_ctx, &job);
> > 		}
> > 	}
> > 
> > With the current solution, there's no control that
> > drm_dep_job_{arm,push}() calls are balanced, with the risk of leaving a
> > DMA-signalling annotation behind.  
> 
> See above, that is what drm_dep_queue_push_job_begin/end do.

That's still error-prone, and the kind of errors you only detect at
runtime. Let alone the fact you might not even notice if the unbalanced
symptoms are caused by error paths that are rarely tested. I'm
proposing something that's designed so you can't make those mistakes
unless you really want to:

- drm_job_submit_non_faillible_section() is a block-like macro
  with a clear scope before/after which the token is invalid
- drm_job_submit_non_faillible_section() is the only place that can
  produce a valid token (not enforceable in C, but with an
  __drm_dep_queue_create_submit_token() and proper disclaimer, I guess
  we can discourage people to inadvertently use it)
- drm_dep_job_{arm,push}() calls requires a valid token to work, and
  with the two points mentioned above, that means you can't call
  drm_dep_job_{arm,push}() outside a
  drm_job_submit_non_faillible_section() block

It's not quite the compile-time checks rust would enforce, but it's a
model that forces people to do it the right way, with extra runtime
checks for the case where they still got it wrong (like, putting the
_arm() and _push() in two different
drm_job_submit_non_faillible_section() blocks).

> 
> >   
> > > +}
> > > +EXPORT_SYMBOL(drm_dep_job_arm);
> > > +
> > > +/**
> > > + * drm_dep_job_push() - submit a job to its queue for execution
> > > + * @job: dep job to push
> > > + *
> > > + * Submits @job to the queue it was initialised with. Must be called after
> > > + * drm_dep_job_arm(). Acquires a reference on @job on behalf of the queue,
> > > + * held until the queue is fully done with it. The reference is released
> > > + * directly in the finished-fence dma_fence callback for queues with
> > > + * %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE (where drm_dep_job_done() may run
> > > + * from hardirq context), or via the put_job work item on the submit
> > > + * workqueue otherwise.
> > > + *
> > > + * Ends the DMA fence signalling path begun by drm_dep_job_arm() via
> > > + * dma_fence_end_signalling(). This must be paired with arm(); lockdep
> > > + * enforces the pairing.
> > > + *
> > > + * Once pushed, &drm_dep_queue_ops.run_job is guaranteed to be called for
> > > + * @job exactly once, even if the queue is killed or torn down before the
> > > + * job reaches the head of the queue. Drivers can use this guarantee to
> > > + * perform bookkeeping cleanup; the actual backend operation should be
> > > + * skipped when drm_dep_queue_is_killed() returns true.
> > > + *
> > > + * If the queue does not support the bypass path, the job is pushed directly
> > > + * onto the SPSC submission queue via drm_dep_queue_push_job() without holding
> > > + * @q->sched.lock. Otherwise, @q->sched.lock is taken and the job is either
> > > + * run immediately via drm_dep_queue_run_job() if it qualifies for bypass, or
> > > + * enqueued via drm_dep_queue_push_job() for dispatch by the run_job work item.
> > > + *
> > > + * Warns if the job has not been armed.
> > > + *
> > > + * Context: Process context if %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set
> > > + *   (takes @q->sched.lock, a mutex); any context otherwise. DMA fence signaling
> > > + *   path.
> > > + */
> > > +void drm_dep_job_push(struct drm_dep_job *job)
> > > +{
> > > +	struct drm_dep_queue *q = job->q;
> > > +
> > > +	WARN_ON(!drm_dep_fence_is_armed(job->dfence));
> > > +
> > > +	drm_dep_job_get(job);
> > > +
> > > +	if (!(q->sched.flags & DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED)) {
> > > +		drm_dep_queue_push_job(q, job);
> > > +		dma_fence_end_signalling(job->signalling_cookie);
> > > +		drm_dep_queue_push_job_end(job->q);
> > > +		return;
> > > +	}
> > > +
> > > +	scoped_guard(mutex, &q->sched.lock) {
> > > +		if (drm_dep_queue_can_job_bypass(q, job))
> > > +			drm_dep_queue_run_job(q, job);
> > > +		else
> > > +			drm_dep_queue_push_job(q, job);
> > > +	}
> > > +
> > > +	dma_fence_end_signalling(job->signalling_cookie);
> > > +	drm_dep_queue_push_job_end(job->q);
> > > +}
> > > +EXPORT_SYMBOL(drm_dep_job_push);
> > > +
> > > +/**
> > > + * drm_dep_job_add_dependency() - adds the fence as a job dependency
> > > + * @job: dep job to add the dependencies to
> > > + * @fence: the dma_fence to add to the list of dependencies, or
> > > + *         %DRM_DEP_JOB_FENCE_PREALLOC to reserve a slot for later.
> > > + *
> > > + * Note that @fence is consumed in both the success and error cases (except
> > > + * when @fence is %DRM_DEP_JOB_FENCE_PREALLOC, which carries no reference).
> > > + *
> > > + * Signalled fences and fences belonging to the same queue as @job (i.e. where
> > > + * fence->context matches the queue's finished fence context) are silently
> > > + * dropped; the job need not wait on its own queue's output.
> > > + *
> > > + * Warns if the job has already been armed (dependencies must be added before
> > > + * drm_dep_job_arm()).
> > > + *
> > > + * **Pre-allocation pattern**
> > > + *
> > > + * When multiple jobs across different queues must be prepared and submitted
> > > + * together in a single atomic commit — for example, where job A's finished
> > > + * fence is an input dependency of job B — all jobs must be armed and pushed
> > > + * within a single dma_fence_begin_signalling() / dma_fence_end_signalling()
> > > + * region.  Once that region has started no memory allocation is permitted.
> > > + *
> > > + * To handle this, pass %DRM_DEP_JOB_FENCE_PREALLOC during the preparation
> > > + * phase (before arming any job, while GFP_KERNEL allocation is still allowed)
> > > + * to pre-allocate a slot in @job->dependencies.  The slot index assigned by
> > > + * the underlying xarray must be tracked by the caller separately (e.g. it is
> > > + * always index 0 when the dependency array is empty, as Xe relies on).
> > > + * After all jobs have been armed and the finished fences are available, call
> > > + * drm_dep_job_replace_dependency() with that index and the real fence.
> > > + * drm_dep_job_replace_dependency() uses GFP_NOWAIT internally and may be
> > > + * called from atomic or signalling context.
> > > + *
> > > + * The sentinel slot is never skipped by the signalled-fence fast-path,
> > > + * ensuring a slot is always allocated even when the real fence is not yet
> > > + * known.
> > > + *
> > > + * **Example: bind job feeding TLB invalidation jobs**
> > > + *
> > > + * Consider a GPU with separate queues for page-table bind operations and for
> > > + * TLB invalidation.  A single atomic commit must:
> > > + *
> > > + *  1. Run a bind job that modifies page tables.
> > > + *  2. Run one TLB-invalidation job per MMU that depends on the bind
> > > + *     completing, so stale translations are flushed before the engines
> > > + *     continue.
> > > + *
> > > + * Because all jobs must be armed and pushed inside a signalling region (where
> > > + * GFP_KERNEL is forbidden), pre-allocate slots before entering the region::
> > > + *
> > > + *   // Phase 1 — process context, GFP_KERNEL allowed
> > > + *   drm_dep_job_init(bind_job, bind_queue, ops);
> > > + *   for_each_mmu(mmu) {
> > > + *       drm_dep_job_init(tlb_job[mmu], tlb_queue[mmu], ops);
> > > + *       // Pre-allocate slot at index 0; real fence not available yet
> > > + *       drm_dep_job_add_dependency(tlb_job[mmu], DRM_DEP_JOB_FENCE_PREALLOC);
> > > + *   }
> > > + *
> > > + *   // Phase 2 — inside signalling region, no GFP_KERNEL
> > > + *   dma_fence_begin_signalling();
> > > + *   drm_dep_job_arm(bind_job);
> > > + *   for_each_mmu(mmu) {
> > > + *       // Swap sentinel for bind job's finished fence
> > > + *       drm_dep_job_replace_dependency(tlb_job[mmu], 0,
> > > + *                                      dma_fence_get(bind_job->finished));  
> > 
> > Just FYI, Panthor doesn't have this {begin,end}_signalling() in the
> > submit path. If we were to add it, it would be around the
> > panthor_submit_ctx_push_jobs() call, which might seem broken. In  
> 
> Yes, I noticed that. I put XXX comment in my port [1] around this.
> 
> [1] https://patchwork.freedesktop.org/patch/711952/?series=163245&rev=1
> 
> > practice I don't think it is because we don't expose fences to the
> > outside world until all jobs have been pushed. So what happens is that
> > a job depending on a previous job in the same batch-submit has the
> > armed-but-not-yet-pushed fence in its deps, and that's the only place
> > where this fence is present. If something fails on a subsequent job
> > preparation in the next batch submit, the rollback logic will just drop
> > the jobs on the floor, and release the armed-but-not-pushed-fence,
> > meaning we're not leaking a fence that will never be signalled. I'm in
> > no way saying this design is sane, just trying to explain why it's
> > currently safe and works fine.  
> 
> Yep, I think would be better have no failure points between arm and
> push which again I do my best to enforce via lockdep/warnings.

I'm still not entirely convinced by that. To me _arm() is not quite the
moment you make your fence public, and I'm not sure the extra complexity
added for intra-batch dependencies (one job in a SUBMIT depending on a
previous job in the same SUBMIT) is justified, because what really
matters is not that we leave dangling/unsignalled dma_fence objects
around, the problem is when you do so on an object that has been
exposed publicly (syncobj, dma_resv, sync_file, ...).

> 
> > 
> > In general, I wonder if we should distinguish between "armed" and
> > "publicly exposed" to help deal with this intra-batch dep thing without
> > resorting to reservation and other tricks like that.
> >   
> 
> I'm not exactly sure what you suggesting but always open to ideas.

Right now _arm() is what does the dma_fence_init(). But there's an
extra step between initializing the fence object and making it
visible to the outside world. In order for the dep to be added to the
job, you need the fence to be initialized, but that's not quite
external visibility, because the job is still very much a driver
object, and if something fails, the rollback mechanism makes it so all
the deps are dropped on the floor along the job that's being destroyed.
So we won't really wait on this fence that's never going to be
signalled.

I see what's appealing in pretending that _arm() == externally-visible,
but it's also forcing us to do extra pre-alloc (or other pre-init)
operations that would otherwise not be required in the submit path. Not
a hill I'm willing to die on, but I just thought I'd mention the fact I
find it weird that we put extra constraints on ourselves that are not
strictly needed, just because we fail to properly flag the dma_fence
visibility transitions.

On the rust side it would be directly described through the type
system (see the Visibility attribute in Daniel's branch[1]). On C side,
this could take the form of a new DMA_FENCE_FLAG_INACTIVE (or whichever
name you want to give it). Any operation pushing the fence to public
container (dma_resv, syncobj, sync_file, ...) would be rejected when
that flag is set. At _push() time, we'd clear that flag with a
dma_fence_set_active() helper, which would reflect the fact the fence
can now be observed and exposed to the outside world.

> 
> > > + *       drm_dep_job_arm(tlb_job[mmu]);
> > > + *   }
> > > + *   drm_dep_job_push(bind_job);
> > > + *   for_each_mmu(mmu)
> > > + *       drm_dep_job_push(tlb_job[mmu]);
> > > + *   dma_fence_end_signalling();
> > > + *
> > > + * Context: Process context. May allocate memory with GFP_KERNEL.
> > > + * Return: If fence == DRM_DEP_JOB_FENCE_PREALLOC index of allocation on
> > > + * success, else 0 on success, or a negative error code.
> > > + */
> > > +int drm_dep_job_add_dependency(struct drm_dep_job *job, struct dma_fence *fence)
> > > +{
> > > +	struct drm_dep_queue *q = job->q;
> > > +	struct dma_fence *entry;
> > > +	unsigned long index;
> > > +	u32 id = 0;
> > > +	int ret;
> > > +
> > > +	WARN_ON(drm_dep_fence_is_armed(job->dfence));
> > > +	might_alloc(GFP_KERNEL);
> > > +
> > > +	if (!fence)
> > > +		return 0;
> > > +
> > > +	if (fence == DRM_DEP_JOB_FENCE_PREALLOC)
> > > +		goto add_fence;
> > > +
> > > +	/*
> > > +	 * Ignore signalled fences or fences from our own queue — finished
> > > +	 * fences use q->fence.context.
> > > +	 */
> > > +	if (dma_fence_test_signaled_flag(fence) ||
> > > +	    fence->context == q->fence.context) {
> > > +		dma_fence_put(fence);
> > > +		return 0;
> > > +	}
> > > +
> > > +	/* Deduplicate if we already depend on a fence from the same context.
> > > +	 * This lets the size of the array of deps scale with the number of
> > > +	 * engines involved, rather than the number of BOs.
> > > +	 */
> > > +	xa_for_each(&job->dependencies, index, entry) {
> > > +		if (entry == DRM_DEP_JOB_FENCE_PREALLOC ||
> > > +		    entry->context != fence->context)
> > > +			continue;
> > > +
> > > +		if (dma_fence_is_later(fence, entry)) {
> > > +			dma_fence_put(entry);
> > > +			xa_store(&job->dependencies, index, fence, GFP_KERNEL);
> > > +		} else {
> > > +			dma_fence_put(fence);
> > > +		}
> > > +		return 0;
> > > +	}
> > > +
> > > +add_fence:
> > > +	ret = xa_alloc(&job->dependencies, &id, fence, xa_limit_32b,
> > > +		       GFP_KERNEL);
> > > +	if (ret != 0) {
> > > +		if (fence != DRM_DEP_JOB_FENCE_PREALLOC)
> > > +			dma_fence_put(fence);
> > > +		return ret;
> > > +	}
> > > +
> > > +	return (fence == DRM_DEP_JOB_FENCE_PREALLOC) ? id : 0;
> > > +}
> > > +EXPORT_SYMBOL(drm_dep_job_add_dependency);
> > > +
> > > +/**
> > > + * drm_dep_job_replace_dependency() - replace a pre-allocated dependency slot
> > > + * @job: dep job to update
> > > + * @index: xarray index of the slot to replace, as returned when the sentinel
> > > + *         was originally inserted via drm_dep_job_add_dependency()
> > > + * @fence: the real dma_fence to store; its reference is always consumed
> > > + *
> > > + * Replaces the %DRM_DEP_JOB_FENCE_PREALLOC sentinel at @index in
> > > + * @job->dependencies with @fence.  The slot must have been pre-allocated by
> > > + * passing %DRM_DEP_JOB_FENCE_PREALLOC to drm_dep_job_add_dependency(); the
> > > + * existing entry is asserted to be the sentinel.
> > > + *
> > > + * This is the second half of the pre-allocation pattern described in
> > > + * drm_dep_job_add_dependency().  It is intended to be called inside a
> > > + * dma_fence_begin_signalling() / dma_fence_end_signalling() region where
> > > + * memory allocation with GFP_KERNEL is forbidden.  It uses GFP_NOWAIT
> > > + * internally so it is safe to call from atomic or signalling context, but
> > > + * since the slot has been pre-allocated no actual memory allocation occurs.
> > > + *
> > > + * If @fence is already signalled the slot is erased rather than storing a
> > > + * redundant dependency.  The successful store is asserted — if the store
> > > + * fails it indicates a programming error (slot index out of range or
> > > + * concurrent modification).
> > > + *
> > > + * Must be called before drm_dep_job_arm(). @fence is consumed in all cases.
> > > + *
> > > + * Context: Any context. DMA fence signaling path.
> > > + */
> > > +void drm_dep_job_replace_dependency(struct drm_dep_job *job, u32 index,
> > > +				    struct dma_fence *fence)
> > > +{
> > > +	WARN_ON(xa_load(&job->dependencies, index) !=
> > > +		DRM_DEP_JOB_FENCE_PREALLOC);
> > > +
> > > +	if (dma_fence_test_signaled_flag(fence)) {
> > > +		xa_erase(&job->dependencies, index);
> > > +		dma_fence_put(fence);
> > > +		return;
> > > +	}
> > > +
> > > +	if (WARN_ON(xa_is_err(xa_store(&job->dependencies, index, fence,
> > > +				       GFP_NOWAIT)))) {
> > > +		dma_fence_put(fence);
> > > +		return;
> > > +	}  
> > 
> > You don't seem to go for the
> > replace-if-earlier-fence-on-same-context-exists optimization that we
> > have in drm_dep_job_add_dependency(). Any reason not to?
> >   
> 
> No, that could be added in. My reasoning for ommiting was if you are
> pre-alloc a slot you likely know that the same timeline hasn't already
> been added in but maybe that is bad assumption.

Hm, in Panthor that would mean extra checks driver side, because at the
moment we don't check where deps come from. I'd be tempted to say, the
more we can automate the better, dunno.

Regards,

Boris

[1]https://gitlab.freedesktop.org/panfrost/linux/-/merge_requests/61/diffs#a5a71f917ff65cfe4c1a341fa7e55ae149d22863_300_693

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
  2026-03-18 22:40           ` Matthew Brost
@ 2026-03-19  9:57             ` Boris Brezillon
  2026-03-22  6:43               ` Matthew Brost
  0 siblings, 1 reply; 50+ messages in thread
From: Boris Brezillon @ 2026-03-19  9:57 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Daniel Almeida, intel-xe, dri-devel, Tvrtko Ursulin, Rodrigo Vivi,
	Thomas Hellström, Christian König, Danilo Krummrich,
	David Airlie, Maarten Lankhorst, Maxime Ripard, Philipp Stanner,
	Simona Vetter, Sumit Semwal, Thomas Zimmermann, linux-kernel,
	Sami Tolvanen, Jeffrey Vander Stoep, Alice Ryhl, Daniel Stone,
	Alexandre Courbot, John Hubbard, shashanks, jajones,
	Eliot Courtney, Joel Fernandes, rust-for-linux

On Wed, 18 Mar 2026 15:40:35 -0700
Matthew Brost <matthew.brost@intel.com> wrote:

> > > 
> > > So I don’t think Rust natively solves these types of problems, although
> > > I’ll concede that it does make refcounting a bit more sane.  
> > 
> > Rust won't magically defer the cleanup, nor will it dictate how you want
> > to do the queue teardown, those are things you need to implement. But it
> > should give visibility about object lifetimes, and guarantee that an
> > object that's still visible to some owners is usable (the notion of
> > usable is highly dependent on the object implementation).
> > 
> > Just a purely theoretical example of a multi-step queue teardown that
> > might be possible to encode in rust:
> > 
> > - MyJobQueue<Usable>: The job queue is currently exposed and usable.
> >   There's a ::destroy() method consuming 'self' and returning a
> >   MyJobQueue<Destroyed> object
> > - MyJobQueue<Destroyed>: The user asked for the workqueue to be
> >   destroyed. No new job can be pushed. Existing jobs that didn't make
> >   it to the FW queue are cancelled, jobs that are in-flight are
> >   cancelled if they can, or are just waited upon if they can't. When
> >   the whole destruction step is done, ::destroyed() is called, it
> >   consumes 'self' and returns a MyJobQueue<Inactive> object.
> > - MyJobQueue<Inactive>: The queue is no longer active (HW doesn't have
> >   any resources on this queue). It's ready to be cleaned up.
> >   ::cleanup() (or just ::drop()) defers the cleanup of some inner
> >   object that has been passed around between the various
> >   MyJobQueue<State> wrappers.
> > 
> > Each of the state transition can happen asynchronously. A state
> > transition consumes the object in one state, and returns a new object
> > in its new state. None of the transition involves dropping a refcnt,
> > ownership is just transferred. The final MyJobQueue<Inactive> object is
> > the object we'll defer cleanup on.
> > 
> > It's a very high-level view of one way this can be implemented (I'm
> > sure there are others, probably better than my suggestion) in order to
> > make sure the object doesn't go away without the compiler enforcing
> > proper state transitions.
> >   
> 
> I'm sure Rust can implement this. My point about Rust is it doesn't
> magically solve hard software arch probles, but I will admit the
> ownership model, way it can enforce locking at compile time is pretty
> cool.

It's not quite about rust directly solving those problems for you, it's
about rust forcing you to think about those problems in the first
place. So no, rust won't magically solve your multi-step teardown with
crazy CPU <-> Device synchronization etc, but it allows you to clearly
identify those steps, and think about how you want to represent them
without abusing other concepts, like object refcounting/ownership.
Everything I described, you can code it in C BTW, it's just that C is so
lax that you can also abuse other stuff to get to your ends, which might
or might not be safe, but more importantly, will very likely obfuscate
the code (even with good docs).

> 
> > > > > > +/**
> > > > > > + * DOC: DRM dependency fence
> > > > > > + *
> > > > > > + * Each struct drm_dep_job has an associated struct drm_dep_fence that
> > > > > > + * provides a single dma_fence (@finished) signalled when the hardware
> > > > > > + * completes the job.
> > > > > > + *
> > > > > > + * The hardware fence returned by &drm_dep_queue_ops.run_job is stored as
> > > > > > + * @parent. @finished is chained to @parent via drm_dep_job_done_cb() and
> > > > > > + * is signalled once @parent signals (or immediately if run_job() returns
> > > > > > + * NULL or an error).    
> > > > > 
> > > > > I thought this fence proxy mechanism was going away due to recent work being
> > > > > carried out by Christian?
> > > > >     
> > > 
> > > Consider the case where a driver’s hardware fence is implemented as a
> > > dma-fence-array or dma-fence-chain. You cannot install these types of
> > > fences into a dma-resv or into syncobjs, so a proxy fence is useful
> > > here.  
> > 
> > Hm, so that's a driver returning a dma_fence_array/chain through
> > ::run_job()? Why would we not want to have them directly exposed and
> > split up into singular fence objects at resv insertion time (I don't
> > think syncobjs care, but I might be wrong). I mean, one of the point  
> 
> You can stick dma-fence-arrays in syncobjs, but not chains.

Yeah, kinda makes sense, since timeline syncobjs use chains, and if the
chain reject inner chains, it won't work.

> 
> Neither dma-fence-arrays/chain can go into dma-resv.

They can't go directly in it, but those can be split into individual
fences and be inserted, which would achieve the same goal.

> 
> Hence why disconnecting a job's finished fence from hardware fence IMO
> is good idea to keep so gives drivers flexiblity on the hardware fences.

The thing is, I'm not sure drivers were ever meant to expose containers
through ::run_job().

> e.g., If this design didn't have a job's finished fence, I'd have to
> open code one Xe side.

There might be other reasons we'd like to keep the
drm_sched_fence-like proxy that I'm missing. But if it's the only one,
and the fence-combining pattern you're describing is common to multiple
drivers, we can provide a container implementation that's not a
fence_array, so you can use it to insert driver fences into other
containers. This way we wouldn't force the proxy model to all drivers,
but we would keep the code generic/re-usable.

> 
> > behind the container extraction is so fences coming from the same
> > context/timeline can be detected and merged. If you insert the
> > container through a proxy, you're defeating the whole fence merging
> > optimization.  
> 
> Right. Finished fences have single timeline too...

Aren't you faking a single timeline though if you combine fences from
different engines running at their own pace into a container?

> 
> > 
> > The second thing is that I'm not sure drivers were ever supposed to
> > return fence containers in the first place, because the whole idea
> > behind a fence context is that fences are emitted/signalled in
> > seqno-order, and if the fence is encoding the state of multiple
> > timelines that progress at their own pace, it becomes tricky to control
> > that. I guess if it's always the same set of timelines that are
> > combined, that would work.  
> 
> Xe does this is definitely works. We submit to multiple rings, when all
> rings signal a seqno, a chain or array signals -> finished fence
> signals. The queues used in this manor can only submit multiple ring
> jobs so the finished fence timeline stays intact. If you could a
> multiple rings followed by a single ring submission on the same queue,
> yes this could break.

Okay, I had the same understanding, thanks for confirming.

> 
> >   
> > > One example is when a single job submits work to multiple rings
> > > that are flipped in hardware at the same time.  
> > 
> > We do have that in Panthor, but that's all explicit: in a single
> > SUBMIT, you can have multiple jobs targeting different queues, each of
> > them having their own set of deps/signal ops. The combination of all the
> > signal ops into a container is left to the UMD. It could be automated
> > kernel side, but that would be a flag on the SIGNAL op leading to the
> > creation of a fence_array containing fences from multiple submitted
> > jobs, rather than the driver combining stuff in the fence it returns in
> > ::run_job().  
> 
> See above. We have a dedicated queue type for these type of submissions
> and single job that submits to the all rings. We had multiple queue /
> jobs in the i915 to implemented this but it turns out it is much cleaner
> with a single queue / singler job / multiple rings model.

Hm, okay. It didn't turn into a mess in Panthor, but Xe is likely an
order of magnitude more complicated that Mali, so I'll refrain from
judging this design decision.

> 
> >   
> > > 
> > > Another case is late arming of hardware fences in run_job (which many
> > > drivers do). The proxy fence is immediately available at arm time and
> > > can be installed into dma-resv or syncobjs even though the actual
> > > hardware fence is not yet available. I think most drivers could be
> > > refactored to make the hardware fence immediately available at run_job,
> > > though.  
> > 
> > Yep, I also think we can arm the driver fence early in the case of
> > JobQueue. The reason it couldn't be done before is because the
> > scheduler was in the middle, deciding which entity to pull the next job
> > from, which was changing the seqno a job driver-fence would be assigned
> > (you can't guess that at queue time in that case).
> >   
> 
> Xe doesn't need to late arming, but it look like multiple drivers to
> implement the late arming which may be required (?).

As I said, it's mostly a problem when you have a
single-HW-queue:multiple-contexts model, which is exactly what
drm_sched was designed for. I suspect early arming is not an issue for
any of the HW supporting FW-based scheduling (PVR, Mali, NVidia,
...). If you want to use drm_dep for all drivers currently using
drm_sched (I'm still not convinced this is a good idea to do that
just yet, because then you're going to pull a lot of the complexity
we're trying to get rid of), then you need late arming of driver fences.

> 
> > [...]
> >   
> > > > > > + * **Reference counting**
> > > > > > + *
> > > > > > + * Jobs and queues are both reference counted.
> > > > > > + *
> > > > > > + * A job holds a reference to its queue from drm_dep_job_init() until
> > > > > > + * drm_dep_job_put() drops the job's last reference and its release callback
> > > > > > + * runs. This ensures the queue remains valid for the entire lifetime of any
> > > > > > + * job that was submitted to it.
> > > > > > + *
> > > > > > + * The queue holds its own reference to a job for as long as the job is
> > > > > > + * internally tracked: from the moment the job is added to the pending list
> > > > > > + * in drm_dep_queue_run_job() until drm_dep_job_done() kicks the put_job
> > > > > > + * worker, which calls drm_dep_job_put() to release that reference.    
> > > > > 
> > > > > Why not simply keep track that the job was completed, instead of relinquishing
> > > > > the reference? We can then release the reference once the job is cleaned up
> > > > > (by the queue, using a worker) in process context.    
> > > 
> > > I think that’s what I’m doing, while also allowing an opt-in path to
> > > drop the job reference when it signals (in IRQ context)  
> > 
> > Did you mean in !IRQ (or !atomic) context here? Feels weird to not
> > defer the cleanup when you're in an IRQ/atomic context, but defer it
> > when you're in a thread context.
> >   
> 
> The put of a job in this design can be from an IRQ context (opt-in)
> feature. xa_destroy blows up if it is called from an IRQ context,
> although maybe that could be workaround.

Making it so _put() in IRQ context is safe is fine, what I'm saying is
that instead of doing a partial immediate cleanup, and the rest in a
worker, we can just defer everything: that is, have some
_deref_release() function called by kref_put() that would queue a work
item from which the actual release is done.

> 
> > > so we avoid
> > > switching to a work item just to drop a ref. That seems like a
> > > significant win in terms of CPU cycles.  
> > 
> > Well, the cleanup path is probably not where latency matters the most.  
> 
> Agree. But I do think avoiding a CPU context switch (work item) for a
> very lightweight job cleanup (usually just drop refs) will save of CPU
> cycles, thus also things like power, etc...

That's the sort of statements I'd like to be backed by actual
numbers/scenarios proving that it actually makes a difference. The
mixed model where things are partially freed immediately/partially
deferred, and sometimes even with conditionals for whether the deferral
happens or not, it just makes building a mental model of this thing a
nightmare, which in turn usually leads to subtle bugs.

> 
> > It's adding scheduling overhead, sure, but given all the stuff we defer
> > already, I'm not too sure we're at saving a few cycles to get the
> > cleanup done immediately. What's important to have is a way to signal
> > fences in an atomic context, because this has an impact on latency.
> >   
> 
> Yes. The signaling happens first then drm_dep_job_put if IRQ opt-in.
> 
> > [...]
> >   
> > > > > > + /*
> > > > > > + * Drop all input dependency fences now, in process context, before the
> > > > > > + * final job put. Once the job is on the pending list its last reference
> > > > > > + * may be dropped from a dma_fence callback (IRQ context), where calling
> > > > > > + * xa_destroy() would be unsafe.
> > > > > > + */    
> > > > > 
> > > > > I assume that “pending” is the list of jobs that have been handed to the driver
> > > > > via ops->run_job()?
> > > > > 
> > > > > Can’t this problem be solved by not doing anything inside a dma_fence callback
> > > > > other than scheduling the queue worker?
> > > > >     
> > > 
> > > Yes, this code is required to support dropping job refs directly in the
> > > dma-fence callback (an opt-in feature). Again, this seems like a
> > > significant win in terms of CPU cycles, although I haven’t collected
> > > data yet.  
> > 
> > If it significantly hurts the perf, I'd like to understand why, because
> > to me it looks like pure-cleanup (no signaling involved), and thus no
> > other process waiting for us to do the cleanup. The only thing that
> > might have an impact is how fast you release the resources, and given
> > it's only a partial cleanup (xa_destroy() still has to be deferred), I'd
> > like to understand which part of the immediate cleanup is causing a
> > contention (basically which kind of resources the system is starving of)
> >   
> 
> It was more of once we moved to a ref counted model, it is pretty
> trivial allow drm_dep_job_put when the fence is signaling. It doesn't
> really add any complexity either, thus why I added it is.

It's not the refcount model I'm complaining about, it's the "part of it
is always freed immediately, part of it is deferred, but not always ..."
that happens in drm_dep_job_release() I'm questioning. I'd really
prefer something like:

static void drm_dep_job_release()
{
	// do it all unconditionally
}

static void drm_dep_job_defer_release()
{
	queue_work(&job->cleanup_work);
}

static void drm_dep_job_put()
{
	kref_put(job, drm_dep_job_defer_release);
}

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
  2026-03-19  9:57             ` Boris Brezillon
@ 2026-03-22  6:43               ` Matthew Brost
  2026-03-23  7:58                 ` Matthew Brost
  0 siblings, 1 reply; 50+ messages in thread
From: Matthew Brost @ 2026-03-22  6:43 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Daniel Almeida, intel-xe, dri-devel, Tvrtko Ursulin, Rodrigo Vivi,
	Thomas Hellström, Christian König, Danilo Krummrich,
	David Airlie, Maarten Lankhorst, Maxime Ripard, Philipp Stanner,
	Simona Vetter, Sumit Semwal, Thomas Zimmermann, linux-kernel,
	Sami Tolvanen, Jeffrey Vander Stoep, Alice Ryhl, Daniel Stone,
	Alexandre Courbot, John Hubbard, shashanks, jajones,
	Eliot Courtney, Joel Fernandes, rust-for-linux

On Thu, Mar 19, 2026 at 10:57:29AM +0100, Boris Brezillon wrote:
> On Wed, 18 Mar 2026 15:40:35 -0700
> Matthew Brost <matthew.brost@intel.com> wrote:
> 
> > > > 
> > > > So I don’t think Rust natively solves these types of problems, although
> > > > I’ll concede that it does make refcounting a bit more sane.  
> > > 
> > > Rust won't magically defer the cleanup, nor will it dictate how you want
> > > to do the queue teardown, those are things you need to implement. But it
> > > should give visibility about object lifetimes, and guarantee that an
> > > object that's still visible to some owners is usable (the notion of
> > > usable is highly dependent on the object implementation).
> > > 
> > > Just a purely theoretical example of a multi-step queue teardown that
> > > might be possible to encode in rust:
> > > 
> > > - MyJobQueue<Usable>: The job queue is currently exposed and usable.
> > >   There's a ::destroy() method consuming 'self' and returning a
> > >   MyJobQueue<Destroyed> object
> > > - MyJobQueue<Destroyed>: The user asked for the workqueue to be
> > >   destroyed. No new job can be pushed. Existing jobs that didn't make
> > >   it to the FW queue are cancelled, jobs that are in-flight are
> > >   cancelled if they can, or are just waited upon if they can't. When
> > >   the whole destruction step is done, ::destroyed() is called, it
> > >   consumes 'self' and returns a MyJobQueue<Inactive> object.
> > > - MyJobQueue<Inactive>: The queue is no longer active (HW doesn't have
> > >   any resources on this queue). It's ready to be cleaned up.
> > >   ::cleanup() (or just ::drop()) defers the cleanup of some inner
> > >   object that has been passed around between the various
> > >   MyJobQueue<State> wrappers.
> > > 
> > > Each of the state transition can happen asynchronously. A state
> > > transition consumes the object in one state, and returns a new object
> > > in its new state. None of the transition involves dropping a refcnt,
> > > ownership is just transferred. The final MyJobQueue<Inactive> object is
> > > the object we'll defer cleanup on.
> > > 
> > > It's a very high-level view of one way this can be implemented (I'm
> > > sure there are others, probably better than my suggestion) in order to
> > > make sure the object doesn't go away without the compiler enforcing
> > > proper state transitions.
> > >   
> > 
> > I'm sure Rust can implement this. My point about Rust is it doesn't
> > magically solve hard software arch probles, but I will admit the
> > ownership model, way it can enforce locking at compile time is pretty
> > cool.
> 
> It's not quite about rust directly solving those problems for you, it's
> about rust forcing you to think about those problems in the first
> place. So no, rust won't magically solve your multi-step teardown with
> crazy CPU <-> Device synchronization etc, but it allows you to clearly
> identify those steps, and think about how you want to represent them
> without abusing other concepts, like object refcounting/ownership.
> Everything I described, you can code it in C BTW, it's just that C is so
> lax that you can also abuse other stuff to get to your ends, which might
> or might not be safe, but more importantly, will very likely obfuscate
> the code (even with good docs).
> 

This is very well put, and I completely agree. Sorry—I get annoyed by
the Rust comments. It solves some classes of problems, but it doesn’t
magically solve complex software architecture issues that need to be
thoughtfully designed.

> > 
> > > > > > > +/**
> > > > > > > + * DOC: DRM dependency fence
> > > > > > > + *
> > > > > > > + * Each struct drm_dep_job has an associated struct drm_dep_fence that
> > > > > > > + * provides a single dma_fence (@finished) signalled when the hardware
> > > > > > > + * completes the job.
> > > > > > > + *
> > > > > > > + * The hardware fence returned by &drm_dep_queue_ops.run_job is stored as
> > > > > > > + * @parent. @finished is chained to @parent via drm_dep_job_done_cb() and
> > > > > > > + * is signalled once @parent signals (or immediately if run_job() returns
> > > > > > > + * NULL or an error).    
> > > > > > 
> > > > > > I thought this fence proxy mechanism was going away due to recent work being
> > > > > > carried out by Christian?
> > > > > >     
> > > > 
> > > > Consider the case where a driver’s hardware fence is implemented as a
> > > > dma-fence-array or dma-fence-chain. You cannot install these types of
> > > > fences into a dma-resv or into syncobjs, so a proxy fence is useful
> > > > here.  
> > > 
> > > Hm, so that's a driver returning a dma_fence_array/chain through
> > > ::run_job()? Why would we not want to have them directly exposed and
> > > split up into singular fence objects at resv insertion time (I don't
> > > think syncobjs care, but I might be wrong). I mean, one of the point  
> > 
> > You can stick dma-fence-arrays in syncobjs, but not chains.
> 
> Yeah, kinda makes sense, since timeline syncobjs use chains, and if the
> chain reject inner chains, it won't work.
> 

+1, Exactly.

> > 
> > Neither dma-fence-arrays/chain can go into dma-resv.
> 
> They can't go directly in it, but those can be split into individual
> fences and be inserted, which would achieve the same goal.
> 

Yes, but now it becomes a driver problem (maybe only mine) rather than
an opaque job fence that can be inserted. In my opinion, it’s best to
keep the job vs. hardware fence abstraction.

> > 
> > Hence why disconnecting a job's finished fence from hardware fence IMO
> > is good idea to keep so gives drivers flexiblity on the hardware fences.
> 
> The thing is, I'm not sure drivers were ever meant to expose containers
> through ::run_job().
> 

Well there haven't been any rules...

> > e.g., If this design didn't have a job's finished fence, I'd have to
> > open code one Xe side.
> 
> There might be other reasons we'd like to keep the
> drm_sched_fence-like proxy that I'm missing. But if it's the only one,
> and the fence-combining pattern you're describing is common to multiple
> drivers, we can provide a container implementation that's not a
> fence_array, so you can use it to insert driver fences into other
> containers. This way we wouldn't force the proxy model to all drivers,
> but we would keep the code generic/re-usable.
> 
> > 
> > > behind the container extraction is so fences coming from the same
> > > context/timeline can be detected and merged. If you insert the
> > > container through a proxy, you're defeating the whole fence merging
> > > optimization.  
> > 
> > Right. Finished fences have single timeline too...
> 
> Aren't you faking a single timeline though if you combine fences from
> different engines running at their own pace into a container?
> 
> > 
> > > 
> > > The second thing is that I'm not sure drivers were ever supposed to
> > > return fence containers in the first place, because the whole idea
> > > behind a fence context is that fences are emitted/signalled in
> > > seqno-order, and if the fence is encoding the state of multiple
> > > timelines that progress at their own pace, it becomes tricky to control
> > > that. I guess if it's always the same set of timelines that are
> > > combined, that would work.  
> > 
> > Xe does this is definitely works. We submit to multiple rings, when all
> > rings signal a seqno, a chain or array signals -> finished fence
> > signals. The queues used in this manor can only submit multiple ring
> > jobs so the finished fence timeline stays intact. If you could a
> > multiple rings followed by a single ring submission on the same queue,
> > yes this could break.
> 
> Okay, I had the same understanding, thanks for confirming.
> 

I think the last three comments are resolved here—it’s a queue timeline.
As long as the queue has consistent rules (i.e., submits to a consistent
set of rings), this whole approach makes sense?

> > 
> > >   
> > > > One example is when a single job submits work to multiple rings
> > > > that are flipped in hardware at the same time.  
> > > 
> > > We do have that in Panthor, but that's all explicit: in a single
> > > SUBMIT, you can have multiple jobs targeting different queues, each of
> > > them having their own set of deps/signal ops. The combination of all the
> > > signal ops into a container is left to the UMD. It could be automated
> > > kernel side, but that would be a flag on the SIGNAL op leading to the
> > > creation of a fence_array containing fences from multiple submitted
> > > jobs, rather than the driver combining stuff in the fence it returns in
> > > ::run_job().  
> > 
> > See above. We have a dedicated queue type for these type of submissions
> > and single job that submits to the all rings. We had multiple queue /
> > jobs in the i915 to implemented this but it turns out it is much cleaner
> > with a single queue / singler job / multiple rings model.
> 
> Hm, okay. It didn't turn into a mess in Panthor, but Xe is likely an
> order of magnitude more complicated that Mali, so I'll refrain from
> judging this design decision.
> 

Yes, Xe is a beast, but we tend to build complexity into components and
layers to manage it. That is what I’m attempting to do here.

> > 
> > >   
> > > > 
> > > > Another case is late arming of hardware fences in run_job (which many
> > > > drivers do). The proxy fence is immediately available at arm time and
> > > > can be installed into dma-resv or syncobjs even though the actual
> > > > hardware fence is not yet available. I think most drivers could be
> > > > refactored to make the hardware fence immediately available at run_job,
> > > > though.  
> > > 
> > > Yep, I also think we can arm the driver fence early in the case of
> > > JobQueue. The reason it couldn't be done before is because the
> > > scheduler was in the middle, deciding which entity to pull the next job
> > > from, which was changing the seqno a job driver-fence would be assigned
> > > (you can't guess that at queue time in that case).
> > >   
> > 
> > Xe doesn't need to late arming, but it look like multiple drivers to
> > implement the late arming which may be required (?).
> 
> As I said, it's mostly a problem when you have a
> single-HW-queue:multiple-contexts model, which is exactly what
> drm_sched was designed for. I suspect early arming is not an issue for
> any of the HW supporting FW-based scheduling (PVR, Mali, NVidia,
> ...). If you want to use drm_dep for all drivers currently using
> drm_sched (I'm still not convinced this is a good idea to do that
> just yet, because then you're going to pull a lot of the complexity
> we're trying to get rid of), then you need late arming of driver fences.
> 

Yes, even the hardware scheduling component [1] I hacked together relied
on no late arming. But even then, you can arm a dma-fence early and
assign a hardware seqno later in run_job()—those are two different
things.

[1] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-svn-perf-6-15-2025/-/commit/22c8aa993b5c9e4ad0c312af2f3e032273d20966#line_7c49af3ee_A319

> > 
> > > [...]
> > >   
> > > > > > > + * **Reference counting**
> > > > > > > + *
> > > > > > > + * Jobs and queues are both reference counted.
> > > > > > > + *
> > > > > > > + * A job holds a reference to its queue from drm_dep_job_init() until
> > > > > > > + * drm_dep_job_put() drops the job's last reference and its release callback
> > > > > > > + * runs. This ensures the queue remains valid for the entire lifetime of any
> > > > > > > + * job that was submitted to it.
> > > > > > > + *
> > > > > > > + * The queue holds its own reference to a job for as long as the job is
> > > > > > > + * internally tracked: from the moment the job is added to the pending list
> > > > > > > + * in drm_dep_queue_run_job() until drm_dep_job_done() kicks the put_job
> > > > > > > + * worker, which calls drm_dep_job_put() to release that reference.    
> > > > > > 
> > > > > > Why not simply keep track that the job was completed, instead of relinquishing
> > > > > > the reference? We can then release the reference once the job is cleaned up
> > > > > > (by the queue, using a worker) in process context.    
> > > > 
> > > > I think that’s what I’m doing, while also allowing an opt-in path to
> > > > drop the job reference when it signals (in IRQ context)  
> > > 
> > > Did you mean in !IRQ (or !atomic) context here? Feels weird to not
> > > defer the cleanup when you're in an IRQ/atomic context, but defer it
> > > when you're in a thread context.
> > >   
> > 
> > The put of a job in this design can be from an IRQ context (opt-in)
> > feature. xa_destroy blows up if it is called from an IRQ context,
> > although maybe that could be workaround.
> 
> Making it so _put() in IRQ context is safe is fine, what I'm saying is
> that instead of doing a partial immediate cleanup, and the rest in a
> worker, we can just defer everything: that is, have some
> _deref_release() function called by kref_put() that would queue a work
> item from which the actual release is done.
> 

See below.

> > 
> > > > so we avoid
> > > > switching to a work item just to drop a ref. That seems like a
> > > > significant win in terms of CPU cycles.  
> > > 
> > > Well, the cleanup path is probably not where latency matters the most.  
> > 
> > Agree. But I do think avoiding a CPU context switch (work item) for a
> > very lightweight job cleanup (usually just drop refs) will save of CPU
> > cycles, thus also things like power, etc...
> 
> That's the sort of statements I'd like to be backed by actual
> numbers/scenarios proving that it actually makes a difference. The

I disagree. This is not a locking micro-optimization, for example. It is
a software architecture choice that says “do not trigger a CPU context
to free a job,” which costs thousands of cycles. This will have an
effect on CPU utilization and, thus, power.

> mixed model where things are partially freed immediately/partially
> deferred, and sometimes even with conditionals for whether the deferral
> happens or not, it just makes building a mental model of this thing a
> nightmare, which in turn usually leads to subtle bugs.
> 

See above—managing complexity in components. This works in both modes. I
refactored Xe so it also works in IRQ context. If it would make you feel
better, I can ask my company commits CI resources so non-IRQ mode
consistently works too—it’s just a single API flag on the queue. But
then maybe other companies should also commit to public CI.

> > 
> > > It's adding scheduling overhead, sure, but given all the stuff we defer
> > > already, I'm not too sure we're at saving a few cycles to get the
> > > cleanup done immediately. What's important to have is a way to signal
> > > fences in an atomic context, because this has an impact on latency.
> > >   
> > 
> > Yes. The signaling happens first then drm_dep_job_put if IRQ opt-in.
> > 
> > > [...]
> > >   
> > > > > > > + /*
> > > > > > > + * Drop all input dependency fences now, in process context, before the
> > > > > > > + * final job put. Once the job is on the pending list its last reference
> > > > > > > + * may be dropped from a dma_fence callback (IRQ context), where calling
> > > > > > > + * xa_destroy() would be unsafe.
> > > > > > > + */    
> > > > > > 
> > > > > > I assume that “pending” is the list of jobs that have been handed to the driver
> > > > > > via ops->run_job()?
> > > > > > 
> > > > > > Can’t this problem be solved by not doing anything inside a dma_fence callback
> > > > > > other than scheduling the queue worker?
> > > > > >     
> > > > 
> > > > Yes, this code is required to support dropping job refs directly in the
> > > > dma-fence callback (an opt-in feature). Again, this seems like a
> > > > significant win in terms of CPU cycles, although I haven’t collected
> > > > data yet.  
> > > 
> > > If it significantly hurts the perf, I'd like to understand why, because
> > > to me it looks like pure-cleanup (no signaling involved), and thus no
> > > other process waiting for us to do the cleanup. The only thing that
> > > might have an impact is how fast you release the resources, and given
> > > it's only a partial cleanup (xa_destroy() still has to be deferred), I'd
> > > like to understand which part of the immediate cleanup is causing a
> > > contention (basically which kind of resources the system is starving of)
> > >   
> > 
> > It was more of once we moved to a ref counted model, it is pretty
> > trivial allow drm_dep_job_put when the fence is signaling. It doesn't
> > really add any complexity either, thus why I added it is.
> 
> It's not the refcount model I'm complaining about, it's the "part of it
> is always freed immediately, part of it is deferred, but not always ..."
> that happens in drm_dep_job_release() I'm questioning. I'd really
> prefer something like:
> 

You are completely missing the point here.

Here is what I’ve reduced my job put to:

188         xe_sched_job_free_fences(job);
189         dma_fence_put(job->fence);
190         job_free(job);
191         atomic_dec(&q->job_cnt);
192         xe_pm_runtime_put(xe);

These are lightweight (IRQ-safe) operations that never need to be done
in a work item—so why kick one?

Matt

> static void drm_dep_job_release()
> {
> 	// do it all unconditionally
> }
> 
> static void drm_dep_job_defer_release()
> {
> 	queue_work(&job->cleanup_work);
> }
> 
> static void drm_dep_job_put()
> {
> 	kref_put(job, drm_dep_job_defer_release);
> }

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
  2026-03-19  9:11       ` Boris Brezillon
@ 2026-03-23  4:50         ` Matthew Brost
  2026-03-23  9:55           ` Boris Brezillon
  0 siblings, 1 reply; 50+ messages in thread
From: Matthew Brost @ 2026-03-23  4:50 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: intel-xe, dri-devel, Tvrtko Ursulin, Rodrigo Vivi,
	Thomas Hellström, Christian König, Danilo Krummrich,
	David Airlie, Maarten Lankhorst, Maxime Ripard, Philipp Stanner,
	Simona Vetter, Sumit Semwal, Thomas Zimmermann, linux-kernel

On Thu, Mar 19, 2026 at 10:11:53AM +0100, Boris Brezillon wrote:
> Hi Matthew,
> 
> On Wed, 18 Mar 2026 16:28:13 -0700
> Matthew Brost <matthew.brost@intel.com> wrote:
> 
> > > - fence must be signaled for dma_fence::ops to be set back to NULL
> > > - no .cleanup and no .wait implementation
> > > 
> > > There might be an interest in having HW submission fences reflecting
> > > when the job is passed to the FW/HW queue, but that can done as a
> > > separate fence implementation using a different fence timeline/context.
> > >   
> > 
> > Yes, I removed scheduled side of drm sched fence as I figured that could
> > be implemented driver side (or as an optional API in drm dep). Only
> > AMDGPU / PVR use these too for ganged submissions which I need to wrap
> > my head around. My initial thought is both of implementations likely
> > could be simplified.
> 
> IIRC, PVR was also relying on it to allow native FW waits: when we have
> a job that has deps that are backed by fences emitted by the same
> driver, they are detected and lowered to waits on the "scheduled"
> fence, the wait on the finished fence is done FW side.

Ah, ok. We can build in a scheduling concept if needed, but I’d likely
insist it be an opt-in feature. These are the types of driver-side
requirements I’d need help with.

> 
> > 
> > > > diff --git a/drivers/gpu/drm/dep/drm_dep_job.c b/drivers/gpu/drm/dep/drm_dep_job.c
> > > > new file mode 100644
> > > > index 000000000000..2d012b29a5fc
> > > > --- /dev/null
> > > > +++ b/drivers/gpu/drm/dep/drm_dep_job.c
> > > > @@ -0,0 +1,675 @@
> > > > +// SPDX-License-Identifier: MIT
> > > > +/*
> > > > + * Copyright 2015 Advanced Micro Devices, Inc.
> > > > + *
> > > > + * Permission is hereby granted, free of charge, to any person obtaining a
> > > > + * copy of this software and associated documentation files (the "Software"),
> > > > + * to deal in the Software without restriction, including without limitation
> > > > + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> > > > + * and/or sell copies of the Software, and to permit persons to whom the
> > > > + * Software is furnished to do so, subject to the following conditions:
> > > > + *
> > > > + * The above copyright notice and this permission notice shall be included in
> > > > + * all copies or substantial portions of the Software.
> > > > + *
> > > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> > > > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> > > > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> > > > + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
> > > > + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
> > > > + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
> > > > + * OTHER DEALINGS IN THE SOFTWARE.
> > > > + *
> > > > + * Copyright © 2026 Intel Corporation
> > > > + */
> > > > +
> > > > +/**
> > > > + * DOC: DRM dependency job
> > > > + *
> > > > + * A struct drm_dep_job represents a single unit of GPU work associated with
> > > > + * a struct drm_dep_queue. The lifecycle of a job is:
> > > > + *
> > > > + * 1. **Allocation**: the driver allocates memory for the job (typically by
> > > > + *    embedding struct drm_dep_job in a larger structure) and calls
> > > > + *    drm_dep_job_init() to initialise it. On success the job holds one
> > > > + *    kref reference and a reference to its queue.
> > > > + *
> > > > + * 2. **Dependency collection**: the driver calls drm_dep_job_add_dependency(),
> > > > + *    drm_dep_job_add_syncobj_dependency(), drm_dep_job_add_resv_dependencies(),
> > > > + *    or drm_dep_job_add_implicit_dependencies() to register dma_fence objects
> > > > + *    that must be signalled before the job can run. Duplicate fences from the
> > > > + *    same fence context are deduplicated automatically.
> > > > + *
> > > > + * 3. **Arming**: drm_dep_job_arm() initialises the job's finished fence,
> > > > + *    consuming a sequence number from the queue. After arming,
> > > > + *    drm_dep_job_finished_fence() returns a valid fence that may be passed to
> > > > + *    userspace or used as a dependency by other jobs.
> > > > + *
> > > > + * 4. **Submission**: drm_dep_job_push() submits the job to the queue. The
> > > > + *    queue takes a reference that it holds until the job's finished fence
> > > > + *    signals and the job is freed by the put_job worker.
> > > > + *
> > > > + * 5. **Completion**: when the job's hardware work finishes its finished fence
> > > > + *    is signalled and drm_dep_job_put() is called by the queue. The driver
> > > > + *    must release any driver-private resources in &drm_dep_job_ops.release.
> > > > + *
> > > > + * Reference counting uses drm_dep_job_get() / drm_dep_job_put(). The
> > > > + * internal drm_dep_job_fini() tears down the dependency xarray and fence
> > > > + * objects before the driver's release callback is invoked.
> > > > + */
> > > > +
> > > > +#include <linux/dma-resv.h>
> > > > +#include <linux/kref.h>
> > > > +#include <linux/slab.h>
> > > > +#include <drm/drm_dep.h>
> > > > +#include <drm/drm_file.h>
> > > > +#include <drm/drm_gem.h>
> > > > +#include <drm/drm_syncobj.h>
> > > > +#include "drm_dep_fence.h"
> > > > +#include "drm_dep_job.h"
> > > > +#include "drm_dep_queue.h"
> > > > +
> > > > +/**
> > > > + * drm_dep_job_init() - initialise a dep job
> > > > + * @job: dep job to initialise
> > > > + * @args: initialisation arguments
> > > > + *
> > > > + * Initialises @job with the queue, ops and credit count from @args.  Acquires
> > > > + * a reference to @args->q via drm_dep_queue_get(); this reference is held for
> > > > + * the lifetime of the job and released by drm_dep_job_release() when the last
> > > > + * job reference is dropped.
> > > > + *
> > > > + * Resources are released automatically when the last reference is dropped
> > > > + * via drm_dep_job_put(), which must be called to release the job; drivers
> > > > + * must not free the job directly.
> > > > + *
> > > > + * Context: Process context. Allocates memory with GFP_KERNEL.
> > > > + * Return: 0 on success, -%EINVAL if credits is 0,
> > > > + *   -%ENOMEM on fence allocation failure.
> > > > + */
> > > > +int drm_dep_job_init(struct drm_dep_job *job,
> > > > +		     const struct drm_dep_job_init_args *args)
> > > > +{
> > > > +	if (unlikely(!args->credits)) {
> > > > +		pr_err("drm_dep: %s: credits cannot be 0\n", __func__);
> > > > +		return -EINVAL;
> > > > +	}
> > > > +
> > > > +	memset(job, 0, sizeof(*job));
> > > > +
> > > > +	job->dfence = drm_dep_fence_alloc();
> > > > +	if (!job->dfence)
> > > > +		return -ENOMEM;
> > > > +
> > > > +	job->ops = args->ops;
> > > > +	job->q = drm_dep_queue_get(args->q);
> > > > +	job->credits = args->credits;
> > > > +
> > > > +	kref_init(&job->refcount);
> > > > +	xa_init_flags(&job->dependencies, XA_FLAGS_ALLOC);
> > > > +	INIT_LIST_HEAD(&job->pending_link);
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +EXPORT_SYMBOL(drm_dep_job_init);
> > > > +
> > > > +/**
> > > > + * drm_dep_job_drop_dependencies() - release all input dependency fences
> > > > + * @job: dep job whose dependency xarray to drain
> > > > + *
> > > > + * Walks @job->dependencies, puts each fence, and destroys the xarray.
> > > > + * Any slots still holding a %DRM_DEP_JOB_FENCE_PREALLOC sentinel —
> > > > + * i.e. slots that were pre-allocated but never replaced — are silently
> > > > + * skipped; the sentinel carries no reference.  Called from
> > > > + * drm_dep_queue_run_job() in process context immediately after
> > > > + * @ops->run_job() returns, before the final drm_dep_job_put().  Releasing
> > > > + * dependencies here — while still in process context — avoids calling
> > > > + * xa_destroy() from IRQ context if the job's last reference is later
> > > > + * dropped from a dma_fence callback.
> > > > + *
> > > > + * Context: Process context.
> > > > + */
> > > > +void drm_dep_job_drop_dependencies(struct drm_dep_job *job)
> > > > +{
> > > > +	struct dma_fence *fence;
> > > > +	unsigned long index;
> > > > +
> > > > +	xa_for_each(&job->dependencies, index, fence) {
> > > > +		if (unlikely(fence == DRM_DEP_JOB_FENCE_PREALLOC))
> > > > +			continue;
> > > > +		dma_fence_put(fence);
> > > > +	}
> > > > +	xa_destroy(&job->dependencies);
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_dep_job_fini() - clean up a dep job
> > > > + * @job: dep job to clean up
> > > > + *
> > > > + * Cleans up the dep fence and drops the queue reference held by @job.
> > > > + *
> > > > + * If the job was never armed (e.g. init failed before drm_dep_job_arm()),
> > > > + * the dependency xarray is also released here.  For armed jobs the xarray
> > > > + * has already been drained by drm_dep_job_drop_dependencies() in process
> > > > + * context immediately after run_job(), so it is left untouched to avoid
> > > > + * calling xa_destroy() from IRQ context.
> > > > + *
> > > > + * Warns if @job is still linked on the queue's pending list, which would
> > > > + * indicate a bug in the teardown ordering.
> > > > + *
> > > > + * Context: Any context.
> > > > + */
> > > > +static void drm_dep_job_fini(struct drm_dep_job *job)
> > > > +{
> > > > +	bool armed = drm_dep_fence_is_armed(job->dfence);
> > > > +
> > > > +	WARN_ON(!list_empty(&job->pending_link));
> > > > +
> > > > +	drm_dep_fence_cleanup(job->dfence);
> > > > +	job->dfence = NULL;
> > > > +
> > > > +	/*
> > > > +	 * Armed jobs have their dependencies drained by
> > > > +	 * drm_dep_job_drop_dependencies() in process context after run_job().  
> > > 
> > > Just want to clear the confusion and make sure I get this right at the
> > > same time. To me, "process context" means a user thread entering some
> > > syscall(). What you call "process context" is more a "thread context" to
> > > me. I'm actually almost certain it's always a kernel thread (a workqueue
> > > worker thread to be accurate) that executes the drop_deps() after a
> > > run_job().  
> > 
> > Some of context comments likely could be cleaned up. 'process context'
> > here either in user context (bypass path) or run job work item.
> > 
> > >   
> > > > +	 * Skip here to avoid calling xa_destroy() from IRQ context.
> > > > +	 */
> > > > +	if (!armed)
> > > > +		drm_dep_job_drop_dependencies(job);  
> > > 
> > > Why do we need to make a difference here. Can't we just assume that the
> > > hole drm_dep_job_fini() call is unsafe in atomic context, and have a
> > > work item embedded in the job to defer its destruction when _put() is
> > > called in a context where the destruction is not allowed?
> > >   
> > 
> > We already touched on this, but the design currently allows the last job
> > put from dma-fence signaling path (IRQ).
> 
> It's not much about the last _put and more about what happens in the
> _release() you pass to kref_put(). My point being, if you assume
> something in _release() is not safe to be done in an atomic context,
> and _put() is assumed to be called from any context, you might as well

No. DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE indicates that the entire job
put (including release) is IRQ-safe. If the documentation isn’t clear, I
can clean that up. Some of my comments here [1] try to explain this
further.

Setting DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE makes a job analogous to a
dma-fence whose release must be IRQ-safe, so there is precedent for
this. I didn’t want to unilaterally require that all job releases be
IRQ-safe, as that would conflict with existing DRM scheduler jobs—hence
the flag.

The difference between non-IRQ-safe and IRQ-safe job release is only
about 12 lines of code. I figured that if we’re going to invest the time
and effort to replace DRM sched, we should aim for the best possible
implementation. Any driver can opt-in here and immediately get less CPU
ultization and power savings. I will try to figure out how to measure
this and get some number here.

[1] https://patchwork.freedesktop.org/patch/711933/?series=163245&rev=1#comment_1312648

> just defer the cleanup (AKA the stuff you currently have in _release())
> so everything is always cleaned up in a thread context. Yes, there's
> scheduling overhead and extra latency, but it's also simpler, because
> there's just one path. So, if the latency and the overhead is not

This isn’t bolted on—it’s a built-in feature throughout. I can assure
you that either mode works. I’ll likely add a debug Kconfig option to Xe
that toggles DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE on each queue creation
for CI runs, to ensure both paths work reliably and receive continuous
testing.

> proven to be a problem (and it rarely is for cleanup operations), I'm
> still convinced this makes for an easier design to just defer the
> cleanup all the time.
> 
> >  If we droppped that, then yes
> > this could change. The reason the if statement currently is user is
> > building a job and need to abort prior to calling arm() (e.g., a memory
> > allocation fails) via a drm_dep_job_put.
> 
> But even in that context, it could still be deferred and work just
> fine, no?
> 

A work item context switch is thousands, if not tens of thousands, of
cycles. If your job release is only ~20 instructions, this is a massive
imbalance and an overall huge waste. Jobs are lightweight objects—they
should really be thought of as an extension of fences. Fence release
must be IRQ-safe per the documentation, so it follows that jobs can opt
in to the same release rules.

In contrast, queues are heavyweight objects, typically with associated
memory that also needs to be released. Here, a work item absolutely
makes sense—hence the design in DRM dep.

> > 
> > Once arm() is called there is a guarnette the run_job path is called
> > either via bypass or run job work item.
> 
> Sure.
> 

Let’s not gloss over this—this is actually a huge difference from DRM
sched. One of the biggest problems I found with DRM sched is that if you
call arm(), run_job() may or may not be called. Without this guarantee,
you can’t do driver-side bookkeeping in arm() that is later released in
run_job(), which would otherwise simplify the driver design.

In Xe, we artificially enforce this rule through our own usage of DRM
sched, but in DRM dep this is now an API-level contract. That allows
drivers to embrace this semantic directly, simplifying their designs.

> > 
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_dep_job_get() - acquire a reference to a dep job
> > > > + * @job: dep job to acquire a reference on, or NULL
> > > > + *
> > > > + * Context: Any context.
> > > > + * Return: @job with an additional reference held, or NULL if @job is NULL.
> > > > + */
> > > > +struct drm_dep_job *drm_dep_job_get(struct drm_dep_job *job)
> > > > +{
> > > > +	if (job)
> > > > +		kref_get(&job->refcount);
> > > > +	return job;
> > > > +}
> > > > +EXPORT_SYMBOL(drm_dep_job_get);
> > > > +
> > > > +/**
> > > > + * drm_dep_job_release() - kref release callback for a dep job
> > > > + * @kref: kref embedded in the dep job
> > > > + *
> > > > + * Calls drm_dep_job_fini(), then invokes &drm_dep_job_ops.release if set,
> > > > + * otherwise frees @job with kfree().  Finally, releases the queue reference
> > > > + * that was acquired by drm_dep_job_init() via drm_dep_queue_put().  The
> > > > + * queue put is performed last to ensure no queue state is accessed after
> > > > + * the job memory is freed.
> > > > + *
> > > > + * Context: Any context if %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE is set on the
> > > > + *   job's queue; otherwise process context only, as the release callback may
> > > > + *   sleep.
> > > > + */
> > > > +static void drm_dep_job_release(struct kref *kref)
> > > > +{
> > > > +	struct drm_dep_job *job =
> > > > +		container_of(kref, struct drm_dep_job, refcount);
> > > > +	struct drm_dep_queue *q = job->q;
> > > > +
> > > > +	drm_dep_job_fini(job);
> > > > +
> > > > +	if (job->ops && job->ops->release)
> > > > +		job->ops->release(job);
> > > > +	else
> > > > +		kfree(job);
> > > > +
> > > > +	drm_dep_queue_put(q);
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_dep_job_put() - release a reference to a dep job
> > > > + * @job: dep job to release a reference on, or NULL
> > > > + *
> > > > + * When the last reference is dropped, calls &drm_dep_job_ops.release if set,
> > > > + * otherwise frees @job with kfree(). Does nothing if @job is NULL.
> > > > + *
> > > > + * Context: Any context if %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE is set on the
> > > > + *   job's queue; otherwise process context only, as the release callback may
> > > > + *   sleep.
> > > > + */
> > > > +void drm_dep_job_put(struct drm_dep_job *job)
> > > > +{
> > > > +	if (job)
> > > > +		kref_put(&job->refcount, drm_dep_job_release);
> > > > +}
> > > > +EXPORT_SYMBOL(drm_dep_job_put);
> > > > +
> > > > +/**
> > > > + * drm_dep_job_arm() - arm a dep job for submission
> > > > + * @job: dep job to arm
> > > > + *
> > > > + * Initialises the finished fence on @job->dfence, assigning
> > > > + * it a sequence number from the job's queue. Must be called after
> > > > + * drm_dep_job_init() and before drm_dep_job_push(). Once armed,
> > > > + * drm_dep_job_finished_fence() returns a valid fence that may be passed to
> > > > + * userspace or used as a dependency by other jobs.
> > > > + *
> > > > + * Begins the DMA fence signalling path via dma_fence_begin_signalling().
> > > > + * After this point, memory allocations that could trigger reclaim are
> > > > + * forbidden; lockdep enforces this. arm() must always be paired with
> > > > + * drm_dep_job_push(); lockdep also enforces this pairing.
> > > > + *
> > > > + * Warns if the job has already been armed.
> > > > + *
> > > > + * Context: Process context if %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set
> > > > + *   (takes @q->sched.lock, a mutex); any context otherwise. DMA fence signaling
> > > > + *   path.
> > > > + */
> > > > +void drm_dep_job_arm(struct drm_dep_job *job)
> > > > +{
> > > > +	drm_dep_queue_push_job_begin(job->q);
> > > > +	WARN_ON(drm_dep_fence_is_armed(job->dfence));
> > > > +	drm_dep_fence_init(job->dfence, job->q);
> > > > +	job->signalling_cookie = dma_fence_begin_signalling();  
> > > 
> > > I'd really like DMA-signalling-path annotation to be something that
> > > doesn't leak to the job object. The way I see it, in the submit path,
> > > it should be some sort of block initializing an opaque token, and
> > > drm_dep_job_arm() should expect a valid token to be passed, thus
> > > guaranteeing that anything between arm and push, and more generally
> > > anything in that section is safe.
> > >   
> > 
> > Yes. drm_dep_queue_push_job_begin internally creates a token (current)
> > that is paired drm_dep_queue_push_job_end. If you ever have imbalance
> > between arm() and push() you will get complaints.
> > 
> > > 	struct drm_job_submit_context submit_ctx;
> > > 
> > > 	// Do all the prep stuff, pre-alloc, resv setup, ...
> > > 
> > > 	// Non-faillible section of the submit starts here.
> > > 	// This is properly annotated with
> > > 	// dma_fence_{begin,end}_signalling() to ensure we're
> > > 	// not taking locks or doing allocations forbidden in
> > > 	// the signalling path
> > > 	drm_job_submit_non_faillible_section(&submit_ctx) {
> > > 		for_each_job() {
> > > 			drm_dep_job_arm(&submit_ctx, &job);
> > > 
> > > 			// pass the armed fence around, if needed
> > > 
> > > 			drm_dep_job_push(&submit_ctx, &job);
> > > 		}
> > > 	}
> > > 
> > > With the current solution, there's no control that
> > > drm_dep_job_{arm,push}() calls are balanced, with the risk of leaving a
> > > DMA-signalling annotation behind.  
> > 
> > See above, that is what drm_dep_queue_push_job_begin/end do.
> 
> That's still error-prone, and the kind of errors you only detect at
> runtime. Let alone the fact you might not even notice if the unbalanced

I agree that this is only detectable at runtime, but it would complain
immediately.

> symptoms are caused by error paths that are rarely tested. I'm

Yes, this an unfortunate truth.

> proposing something that's designed so you can't make those mistakes
> unless you really want to:
> 
> - drm_job_submit_non_faillible_section() is a block-like macro
>   with a clear scope before/after which the token is invalid
> - drm_job_submit_non_faillible_section() is the only place that can
>   produce a valid token (not enforceable in C, but with an
>   __drm_dep_queue_create_submit_token() and proper disclaimer, I guess
>   we can discourage people to inadvertently use it)
> - drm_dep_job_{arm,push}() calls requires a valid token to work, and
>   with the two points mentioned above, that means you can't call
>   drm_dep_job_{arm,push}() outside a
>   drm_job_submit_non_faillible_section() block

Ok, let me think about whether I can harden this semantic. I believe
what I have in place already enforces it quite well, but I’m a big
believer in using asserts to enforce behavior, and if we can do better,
let’s do it.

> 
> It's not quite the compile-time checks rust would enforce, but it's a
> model that forces people to do it the right way, with extra runtime
> checks for the case where they still got it wrong (like, putting the
> _arm() and _push() in two different
> drm_job_submit_non_faillible_section() blocks).
> 
> > 
> > >   
> > > > +}
> > > > +EXPORT_SYMBOL(drm_dep_job_arm);
> > > > +
> > > > +/**
> > > > + * drm_dep_job_push() - submit a job to its queue for execution
> > > > + * @job: dep job to push
> > > > + *
> > > > + * Submits @job to the queue it was initialised with. Must be called after
> > > > + * drm_dep_job_arm(). Acquires a reference on @job on behalf of the queue,
> > > > + * held until the queue is fully done with it. The reference is released
> > > > + * directly in the finished-fence dma_fence callback for queues with
> > > > + * %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE (where drm_dep_job_done() may run
> > > > + * from hardirq context), or via the put_job work item on the submit
> > > > + * workqueue otherwise.
> > > > + *
> > > > + * Ends the DMA fence signalling path begun by drm_dep_job_arm() via
> > > > + * dma_fence_end_signalling(). This must be paired with arm(); lockdep
> > > > + * enforces the pairing.
> > > > + *
> > > > + * Once pushed, &drm_dep_queue_ops.run_job is guaranteed to be called for
> > > > + * @job exactly once, even if the queue is killed or torn down before the
> > > > + * job reaches the head of the queue. Drivers can use this guarantee to
> > > > + * perform bookkeeping cleanup; the actual backend operation should be
> > > > + * skipped when drm_dep_queue_is_killed() returns true.
> > > > + *
> > > > + * If the queue does not support the bypass path, the job is pushed directly
> > > > + * onto the SPSC submission queue via drm_dep_queue_push_job() without holding
> > > > + * @q->sched.lock. Otherwise, @q->sched.lock is taken and the job is either
> > > > + * run immediately via drm_dep_queue_run_job() if it qualifies for bypass, or
> > > > + * enqueued via drm_dep_queue_push_job() for dispatch by the run_job work item.
> > > > + *
> > > > + * Warns if the job has not been armed.
> > > > + *
> > > > + * Context: Process context if %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set
> > > > + *   (takes @q->sched.lock, a mutex); any context otherwise. DMA fence signaling
> > > > + *   path.
> > > > + */
> > > > +void drm_dep_job_push(struct drm_dep_job *job)
> > > > +{
> > > > +	struct drm_dep_queue *q = job->q;
> > > > +
> > > > +	WARN_ON(!drm_dep_fence_is_armed(job->dfence));
> > > > +
> > > > +	drm_dep_job_get(job);
> > > > +
> > > > +	if (!(q->sched.flags & DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED)) {
> > > > +		drm_dep_queue_push_job(q, job);
> > > > +		dma_fence_end_signalling(job->signalling_cookie);
> > > > +		drm_dep_queue_push_job_end(job->q);
> > > > +		return;
> > > > +	}
> > > > +
> > > > +	scoped_guard(mutex, &q->sched.lock) {
> > > > +		if (drm_dep_queue_can_job_bypass(q, job))
> > > > +			drm_dep_queue_run_job(q, job);
> > > > +		else
> > > > +			drm_dep_queue_push_job(q, job);
> > > > +	}
> > > > +
> > > > +	dma_fence_end_signalling(job->signalling_cookie);
> > > > +	drm_dep_queue_push_job_end(job->q);
> > > > +}
> > > > +EXPORT_SYMBOL(drm_dep_job_push);
> > > > +
> > > > +/**
> > > > + * drm_dep_job_add_dependency() - adds the fence as a job dependency
> > > > + * @job: dep job to add the dependencies to
> > > > + * @fence: the dma_fence to add to the list of dependencies, or
> > > > + *         %DRM_DEP_JOB_FENCE_PREALLOC to reserve a slot for later.
> > > > + *
> > > > + * Note that @fence is consumed in both the success and error cases (except
> > > > + * when @fence is %DRM_DEP_JOB_FENCE_PREALLOC, which carries no reference).
> > > > + *
> > > > + * Signalled fences and fences belonging to the same queue as @job (i.e. where
> > > > + * fence->context matches the queue's finished fence context) are silently
> > > > + * dropped; the job need not wait on its own queue's output.
> > > > + *
> > > > + * Warns if the job has already been armed (dependencies must be added before
> > > > + * drm_dep_job_arm()).
> > > > + *
> > > > + * **Pre-allocation pattern**
> > > > + *
> > > > + * When multiple jobs across different queues must be prepared and submitted
> > > > + * together in a single atomic commit — for example, where job A's finished
> > > > + * fence is an input dependency of job B — all jobs must be armed and pushed
> > > > + * within a single dma_fence_begin_signalling() / dma_fence_end_signalling()
> > > > + * region.  Once that region has started no memory allocation is permitted.
> > > > + *
> > > > + * To handle this, pass %DRM_DEP_JOB_FENCE_PREALLOC during the preparation
> > > > + * phase (before arming any job, while GFP_KERNEL allocation is still allowed)
> > > > + * to pre-allocate a slot in @job->dependencies.  The slot index assigned by
> > > > + * the underlying xarray must be tracked by the caller separately (e.g. it is
> > > > + * always index 0 when the dependency array is empty, as Xe relies on).
> > > > + * After all jobs have been armed and the finished fences are available, call
> > > > + * drm_dep_job_replace_dependency() with that index and the real fence.
> > > > + * drm_dep_job_replace_dependency() uses GFP_NOWAIT internally and may be
> > > > + * called from atomic or signalling context.
> > > > + *
> > > > + * The sentinel slot is never skipped by the signalled-fence fast-path,
> > > > + * ensuring a slot is always allocated even when the real fence is not yet
> > > > + * known.
> > > > + *
> > > > + * **Example: bind job feeding TLB invalidation jobs**
> > > > + *
> > > > + * Consider a GPU with separate queues for page-table bind operations and for
> > > > + * TLB invalidation.  A single atomic commit must:
> > > > + *
> > > > + *  1. Run a bind job that modifies page tables.
> > > > + *  2. Run one TLB-invalidation job per MMU that depends on the bind
> > > > + *     completing, so stale translations are flushed before the engines
> > > > + *     continue.
> > > > + *
> > > > + * Because all jobs must be armed and pushed inside a signalling region (where
> > > > + * GFP_KERNEL is forbidden), pre-allocate slots before entering the region::
> > > > + *
> > > > + *   // Phase 1 — process context, GFP_KERNEL allowed
> > > > + *   drm_dep_job_init(bind_job, bind_queue, ops);
> > > > + *   for_each_mmu(mmu) {
> > > > + *       drm_dep_job_init(tlb_job[mmu], tlb_queue[mmu], ops);
> > > > + *       // Pre-allocate slot at index 0; real fence not available yet
> > > > + *       drm_dep_job_add_dependency(tlb_job[mmu], DRM_DEP_JOB_FENCE_PREALLOC);
> > > > + *   }
> > > > + *
> > > > + *   // Phase 2 — inside signalling region, no GFP_KERNEL
> > > > + *   dma_fence_begin_signalling();
> > > > + *   drm_dep_job_arm(bind_job);
> > > > + *   for_each_mmu(mmu) {
> > > > + *       // Swap sentinel for bind job's finished fence
> > > > + *       drm_dep_job_replace_dependency(tlb_job[mmu], 0,
> > > > + *                                      dma_fence_get(bind_job->finished));  
> > > 
> > > Just FYI, Panthor doesn't have this {begin,end}_signalling() in the
> > > submit path. If we were to add it, it would be around the
> > > panthor_submit_ctx_push_jobs() call, which might seem broken. In  
> > 
> > Yes, I noticed that. I put XXX comment in my port [1] around this.
> > 
> > [1] https://patchwork.freedesktop.org/patch/711952/?series=163245&rev=1
> > 
> > > practice I don't think it is because we don't expose fences to the
> > > outside world until all jobs have been pushed. So what happens is that
> > > a job depending on a previous job in the same batch-submit has the
> > > armed-but-not-yet-pushed fence in its deps, and that's the only place
> > > where this fence is present. If something fails on a subsequent job
> > > preparation in the next batch submit, the rollback logic will just drop
> > > the jobs on the floor, and release the armed-but-not-pushed-fence,
> > > meaning we're not leaking a fence that will never be signalled. I'm in
> > > no way saying this design is sane, just trying to explain why it's
> > > currently safe and works fine.  
> > 
> > Yep, I think would be better have no failure points between arm and
> > push which again I do my best to enforce via lockdep/warnings.
> 
> I'm still not entirely convinced by that. To me _arm() is not quite the
> moment you make your fence public, and I'm not sure the extra complexity
> added for intra-batch dependencies (one job in a SUBMIT depending on a
> previous job in the same SUBMIT) is justified, because what really
> matters is not that we leave dangling/unsignalled dma_fence objects
> around, the problem is when you do so on an object that has been
> exposed publicly (syncobj, dma_resv, sync_file, ...).
> 

Let me give you an example of why a failure between arm() and push() is
a huge problem:

arm()
dma_resv_install(fence_from_arm)
fail

How does one unwind this? Signal the fence from arm()? What if the fence
from arm() is on a timeline currently being used by the device? The
memory can move, and the device then can corrupt memory.

In my opinion, it’s best and safest to enforce a no-failure policy
between arm() and push().

FWIW, this came up while I was reviewing AMDXDNA’s DRM scheduler usage,
which had the exact issue I described above. I pointed it out and got a
reply saying, “well, this is an API issue, right?”—and they were
correct, it is an API issue.

> > 
> > > 
> > > In general, I wonder if we should distinguish between "armed" and
> > > "publicly exposed" to help deal with this intra-batch dep thing without
> > > resorting to reservation and other tricks like that.
> > >   
> > 
> > I'm not exactly sure what you suggesting but always open to ideas.
> 
> Right now _arm() is what does the dma_fence_init(). But there's an
> extra step between initializing the fence object and making it
> visible to the outside world. In order for the dep to be added to the
> job, you need the fence to be initialized, but that's not quite
> external visibility, because the job is still very much a driver
> object, and if something fails, the rollback mechanism makes it so all
> the deps are dropped on the floor along the job that's being destroyed.
> So we won't really wait on this fence that's never going to be
> signalled.
> 
> I see what's appealing in pretending that _arm() == externally-visible,
> but it's also forcing us to do extra pre-alloc (or other pre-init)
> operations that would otherwise not be required in the submit path. Not
> a hill I'm willing to die on, but I just thought I'd mention the fact I
> find it weird that we put extra constraints on ourselves that are not
> strictly needed, just because we fail to properly flag the dma_fence
> visibility transitions.

See the dma-resv example above. I’m not willing to die on this hill
either, but again, in my opinion, for safety and as an API-level
contract, enforcing arm() as a no-failure point makes sense. It prevents
drivers from doing anything dangerous like the dma-resv example, which
is an extremely subtle bug.

> 
> On the rust side it would be directly described through the type
> system (see the Visibility attribute in Daniel's branch[1]). On C side,
> this could take the form of a new DMA_FENCE_FLAG_INACTIVE (or whichever
> name you want to give it). Any operation pushing the fence to public
> container (dma_resv, syncobj, sync_file, ...) would be rejected when
> that flag is set. At _push() time, we'd clear that flag with a
> dma_fence_set_active() helper, which would reflect the fact the fence
> can now be observed and exposed to the outside world.
> 

Timeline squashing is problematic due to the DMA_FENCE_FLAG_INACTIVE
flag. When adding a fence to dma-resv, fences that belong to the same
timeline are immediately squashed. A later transition of the fence state
completely breaks this behavior.

287 void dma_resv_add_fence(struct dma_resv *obj, struct dma_fence *fence,
288                         enum dma_resv_usage usage)
289 {
290         struct dma_resv_list *fobj;
291         struct dma_fence *old;
292         unsigned int i, count;
293
294         dma_fence_get(fence);
295
296         dma_resv_assert_held(obj);
297
298         /* Drivers should not add containers here, instead add each fence
299          * individually.
300          */
301         WARN_ON(dma_fence_is_container(fence));
302
303         fobj = dma_resv_fences_list(obj);
304         count = fobj->num_fences;
305
306         for (i = 0; i < count; ++i) {
307                 enum dma_resv_usage old_usage;
308
309                 dma_resv_list_entry(fobj, i, obj, &old, &old_usage);
310                 if ((old->context == fence->context && old_usage >= usage &&
311                      dma_fence_is_later_or_same(fence, old)) ||
312                     dma_fence_is_signaled(old)) {
313                         dma_resv_list_set(fobj, i, fence, usage);
314                         dma_fence_put(old);
315                         return;
316                 }
317         }

Imagine syncobjs have similar squashing, but I don't know that offhand.

> > 
> > > > + *       drm_dep_job_arm(tlb_job[mmu]);
> > > > + *   }
> > > > + *   drm_dep_job_push(bind_job);
> > > > + *   for_each_mmu(mmu)
> > > > + *       drm_dep_job_push(tlb_job[mmu]);
> > > > + *   dma_fence_end_signalling();
> > > > + *
> > > > + * Context: Process context. May allocate memory with GFP_KERNEL.
> > > > + * Return: If fence == DRM_DEP_JOB_FENCE_PREALLOC index of allocation on
> > > > + * success, else 0 on success, or a negative error code.
> > > > + */
> > > > +int drm_dep_job_add_dependency(struct drm_dep_job *job, struct dma_fence *fence)
> > > > +{
> > > > +	struct drm_dep_queue *q = job->q;
> > > > +	struct dma_fence *entry;
> > > > +	unsigned long index;
> > > > +	u32 id = 0;
> > > > +	int ret;
> > > > +
> > > > +	WARN_ON(drm_dep_fence_is_armed(job->dfence));
> > > > +	might_alloc(GFP_KERNEL);
> > > > +
> > > > +	if (!fence)
> > > > +		return 0;
> > > > +
> > > > +	if (fence == DRM_DEP_JOB_FENCE_PREALLOC)
> > > > +		goto add_fence;
> > > > +
> > > > +	/*
> > > > +	 * Ignore signalled fences or fences from our own queue — finished
> > > > +	 * fences use q->fence.context.
> > > > +	 */
> > > > +	if (dma_fence_test_signaled_flag(fence) ||
> > > > +	    fence->context == q->fence.context) {
> > > > +		dma_fence_put(fence);
> > > > +		return 0;
> > > > +	}
> > > > +
> > > > +	/* Deduplicate if we already depend on a fence from the same context.
> > > > +	 * This lets the size of the array of deps scale with the number of
> > > > +	 * engines involved, rather than the number of BOs.
> > > > +	 */
> > > > +	xa_for_each(&job->dependencies, index, entry) {
> > > > +		if (entry == DRM_DEP_JOB_FENCE_PREALLOC ||
> > > > +		    entry->context != fence->context)
> > > > +			continue;
> > > > +
> > > > +		if (dma_fence_is_later(fence, entry)) {
> > > > +			dma_fence_put(entry);
> > > > +			xa_store(&job->dependencies, index, fence, GFP_KERNEL);
> > > > +		} else {
> > > > +			dma_fence_put(fence);
> > > > +		}
> > > > +		return 0;
> > > > +	}
> > > > +
> > > > +add_fence:
> > > > +	ret = xa_alloc(&job->dependencies, &id, fence, xa_limit_32b,
> > > > +		       GFP_KERNEL);
> > > > +	if (ret != 0) {
> > > > +		if (fence != DRM_DEP_JOB_FENCE_PREALLOC)
> > > > +			dma_fence_put(fence);
> > > > +		return ret;
> > > > +	}
> > > > +
> > > > +	return (fence == DRM_DEP_JOB_FENCE_PREALLOC) ? id : 0;
> > > > +}
> > > > +EXPORT_SYMBOL(drm_dep_job_add_dependency);
> > > > +
> > > > +/**
> > > > + * drm_dep_job_replace_dependency() - replace a pre-allocated dependency slot
> > > > + * @job: dep job to update
> > > > + * @index: xarray index of the slot to replace, as returned when the sentinel
> > > > + *         was originally inserted via drm_dep_job_add_dependency()
> > > > + * @fence: the real dma_fence to store; its reference is always consumed
> > > > + *
> > > > + * Replaces the %DRM_DEP_JOB_FENCE_PREALLOC sentinel at @index in
> > > > + * @job->dependencies with @fence.  The slot must have been pre-allocated by
> > > > + * passing %DRM_DEP_JOB_FENCE_PREALLOC to drm_dep_job_add_dependency(); the
> > > > + * existing entry is asserted to be the sentinel.
> > > > + *
> > > > + * This is the second half of the pre-allocation pattern described in
> > > > + * drm_dep_job_add_dependency().  It is intended to be called inside a
> > > > + * dma_fence_begin_signalling() / dma_fence_end_signalling() region where
> > > > + * memory allocation with GFP_KERNEL is forbidden.  It uses GFP_NOWAIT
> > > > + * internally so it is safe to call from atomic or signalling context, but
> > > > + * since the slot has been pre-allocated no actual memory allocation occurs.
> > > > + *
> > > > + * If @fence is already signalled the slot is erased rather than storing a
> > > > + * redundant dependency.  The successful store is asserted — if the store
> > > > + * fails it indicates a programming error (slot index out of range or
> > > > + * concurrent modification).
> > > > + *
> > > > + * Must be called before drm_dep_job_arm(). @fence is consumed in all cases.
> > > > + *
> > > > + * Context: Any context. DMA fence signaling path.
> > > > + */
> > > > +void drm_dep_job_replace_dependency(struct drm_dep_job *job, u32 index,
> > > > +				    struct dma_fence *fence)
> > > > +{
> > > > +	WARN_ON(xa_load(&job->dependencies, index) !=
> > > > +		DRM_DEP_JOB_FENCE_PREALLOC);
> > > > +
> > > > +	if (dma_fence_test_signaled_flag(fence)) {
> > > > +		xa_erase(&job->dependencies, index);
> > > > +		dma_fence_put(fence);
> > > > +		return;
> > > > +	}
> > > > +
> > > > +	if (WARN_ON(xa_is_err(xa_store(&job->dependencies, index, fence,
> > > > +				       GFP_NOWAIT)))) {
> > > > +		dma_fence_put(fence);
> > > > +		return;
> > > > +	}  
> > > 
> > > You don't seem to go for the
> > > replace-if-earlier-fence-on-same-context-exists optimization that we
> > > have in drm_dep_job_add_dependency(). Any reason not to?
> > >   
> > 
> > No, that could be added in. My reasoning for ommiting was if you are
> > pre-alloc a slot you likely know that the same timeline hasn't already
> > been added in but maybe that is bad assumption.
> 
> Hm, in Panthor that would mean extra checks driver side, because at the
> moment we don't check where deps come from. I'd be tempted to say, the
> more we can automate the better, dunno.
> 

In my example of TLB invalidations this is a non-issue. We can always
circle back to squashing here if needed or just do it now. Always open
to ideas.

Matt

> Regards,
> 
> Boris
> 
> [1]https://gitlab.freedesktop.org/panfrost/linux/-/merge_requests/61/diffs#a5a71f917ff65cfe4c1a341fa7e55ae149d22863_300_693

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
  2026-03-22  6:43               ` Matthew Brost
@ 2026-03-23  7:58                 ` Matthew Brost
  2026-03-23 10:06                   ` Boris Brezillon
  0 siblings, 1 reply; 50+ messages in thread
From: Matthew Brost @ 2026-03-23  7:58 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Daniel Almeida, intel-xe, dri-devel, Tvrtko Ursulin, Rodrigo Vivi,
	Thomas Hellström, Christian König, Danilo Krummrich,
	David Airlie, Maarten Lankhorst, Maxime Ripard, Philipp Stanner,
	Simona Vetter, Sumit Semwal, Thomas Zimmermann, linux-kernel,
	Sami Tolvanen, Jeffrey Vander Stoep, Alice Ryhl, Daniel Stone,
	Alexandre Courbot, John Hubbard, shashanks, jajones,
	Eliot Courtney, Joel Fernandes, rust-for-linux

On Sat, Mar 21, 2026 at 11:43:12PM -0700, Matthew Brost wrote:
> On Thu, Mar 19, 2026 at 10:57:29AM +0100, Boris Brezillon wrote:
> > On Wed, 18 Mar 2026 15:40:35 -0700
> > Matthew Brost <matthew.brost@intel.com> wrote:
> > 
> > > > > 
> > > > > So I don’t think Rust natively solves these types of problems, although
> > > > > I’ll concede that it does make refcounting a bit more sane.  
> > > > 
> > > > Rust won't magically defer the cleanup, nor will it dictate how you want
> > > > to do the queue teardown, those are things you need to implement. But it
> > > > should give visibility about object lifetimes, and guarantee that an
> > > > object that's still visible to some owners is usable (the notion of
> > > > usable is highly dependent on the object implementation).
> > > > 
> > > > Just a purely theoretical example of a multi-step queue teardown that
> > > > might be possible to encode in rust:
> > > > 
> > > > - MyJobQueue<Usable>: The job queue is currently exposed and usable.
> > > >   There's a ::destroy() method consuming 'self' and returning a
> > > >   MyJobQueue<Destroyed> object
> > > > - MyJobQueue<Destroyed>: The user asked for the workqueue to be
> > > >   destroyed. No new job can be pushed. Existing jobs that didn't make
> > > >   it to the FW queue are cancelled, jobs that are in-flight are
> > > >   cancelled if they can, or are just waited upon if they can't. When
> > > >   the whole destruction step is done, ::destroyed() is called, it
> > > >   consumes 'self' and returns a MyJobQueue<Inactive> object.
> > > > - MyJobQueue<Inactive>: The queue is no longer active (HW doesn't have
> > > >   any resources on this queue). It's ready to be cleaned up.
> > > >   ::cleanup() (or just ::drop()) defers the cleanup of some inner
> > > >   object that has been passed around between the various
> > > >   MyJobQueue<State> wrappers.
> > > > 
> > > > Each of the state transition can happen asynchronously. A state
> > > > transition consumes the object in one state, and returns a new object
> > > > in its new state. None of the transition involves dropping a refcnt,
> > > > ownership is just transferred. The final MyJobQueue<Inactive> object is
> > > > the object we'll defer cleanup on.
> > > > 
> > > > It's a very high-level view of one way this can be implemented (I'm
> > > > sure there are others, probably better than my suggestion) in order to
> > > > make sure the object doesn't go away without the compiler enforcing
> > > > proper state transitions.
> > > >   
> > > 
> > > I'm sure Rust can implement this. My point about Rust is it doesn't
> > > magically solve hard software arch probles, but I will admit the
> > > ownership model, way it can enforce locking at compile time is pretty
> > > cool.
> > 
> > It's not quite about rust directly solving those problems for you, it's
> > about rust forcing you to think about those problems in the first
> > place. So no, rust won't magically solve your multi-step teardown with
> > crazy CPU <-> Device synchronization etc, but it allows you to clearly
> > identify those steps, and think about how you want to represent them
> > without abusing other concepts, like object refcounting/ownership.
> > Everything I described, you can code it in C BTW, it's just that C is so
> > lax that you can also abuse other stuff to get to your ends, which might
> > or might not be safe, but more importantly, will very likely obfuscate
> > the code (even with good docs).
> > 
> 
> This is very well put, and I completely agree. Sorry—I get annoyed by
> the Rust comments. It solves some classes of problems, but it doesn’t
> magically solve complex software architecture issues that need to be
> thoughtfully designed.
> 
> > > 
> > > > > > > > +/**
> > > > > > > > + * DOC: DRM dependency fence
> > > > > > > > + *
> > > > > > > > + * Each struct drm_dep_job has an associated struct drm_dep_fence that
> > > > > > > > + * provides a single dma_fence (@finished) signalled when the hardware
> > > > > > > > + * completes the job.
> > > > > > > > + *
> > > > > > > > + * The hardware fence returned by &drm_dep_queue_ops.run_job is stored as
> > > > > > > > + * @parent. @finished is chained to @parent via drm_dep_job_done_cb() and
> > > > > > > > + * is signalled once @parent signals (or immediately if run_job() returns
> > > > > > > > + * NULL or an error).    
> > > > > > > 
> > > > > > > I thought this fence proxy mechanism was going away due to recent work being
> > > > > > > carried out by Christian?
> > > > > > >     
> > > > > 
> > > > > Consider the case where a driver’s hardware fence is implemented as a
> > > > > dma-fence-array or dma-fence-chain. You cannot install these types of
> > > > > fences into a dma-resv or into syncobjs, so a proxy fence is useful
> > > > > here.  
> > > > 
> > > > Hm, so that's a driver returning a dma_fence_array/chain through
> > > > ::run_job()? Why would we not want to have them directly exposed and
> > > > split up into singular fence objects at resv insertion time (I don't
> > > > think syncobjs care, but I might be wrong). I mean, one of the point  
> > > 
> > > You can stick dma-fence-arrays in syncobjs, but not chains.
> > 
> > Yeah, kinda makes sense, since timeline syncobjs use chains, and if the
> > chain reject inner chains, it won't work.
> > 
> 
> +1, Exactly.
> 
> > > 
> > > Neither dma-fence-arrays/chain can go into dma-resv.
> > 
> > They can't go directly in it, but those can be split into individual
> > fences and be inserted, which would achieve the same goal.
> > 
> 
> Yes, but now it becomes a driver problem (maybe only mine) rather than
> an opaque job fence that can be inserted. In my opinion, it’s best to
> keep the job vs. hardware fence abstraction.
> 
> > > 
> > > Hence why disconnecting a job's finished fence from hardware fence IMO
> > > is good idea to keep so gives drivers flexiblity on the hardware fences.
> > 
> > The thing is, I'm not sure drivers were ever meant to expose containers
> > through ::run_job().
> > 
> 
> Well there haven't been any rules...
> 
> > > e.g., If this design didn't have a job's finished fence, I'd have to
> > > open code one Xe side.
> > 
> > There might be other reasons we'd like to keep the
> > drm_sched_fence-like proxy that I'm missing. But if it's the only one,
> > and the fence-combining pattern you're describing is common to multiple
> > drivers, we can provide a container implementation that's not a
> > fence_array, so you can use it to insert driver fences into other
> > containers. This way we wouldn't force the proxy model to all drivers,
> > but we would keep the code generic/re-usable.
> > 
> > > 
> > > > behind the container extraction is so fences coming from the same
> > > > context/timeline can be detected and merged. If you insert the
> > > > container through a proxy, you're defeating the whole fence merging
> > > > optimization.  
> > > 
> > > Right. Finished fences have single timeline too...
> > 
> > Aren't you faking a single timeline though if you combine fences from
> > different engines running at their own pace into a container?
> > 
> > > 
> > > > 
> > > > The second thing is that I'm not sure drivers were ever supposed to
> > > > return fence containers in the first place, because the whole idea
> > > > behind a fence context is that fences are emitted/signalled in
> > > > seqno-order, and if the fence is encoding the state of multiple
> > > > timelines that progress at their own pace, it becomes tricky to control
> > > > that. I guess if it's always the same set of timelines that are
> > > > combined, that would work.  
> > > 
> > > Xe does this is definitely works. We submit to multiple rings, when all
> > > rings signal a seqno, a chain or array signals -> finished fence
> > > signals. The queues used in this manor can only submit multiple ring
> > > jobs so the finished fence timeline stays intact. If you could a
> > > multiple rings followed by a single ring submission on the same queue,
> > > yes this could break.
> > 
> > Okay, I had the same understanding, thanks for confirming.
> > 
> 
> I think the last three comments are resolved here—it’s a queue timeline.
> As long as the queue has consistent rules (i.e., submits to a consistent
> set of rings), this whole approach makes sense?
> 
> > > 
> > > >   
> > > > > One example is when a single job submits work to multiple rings
> > > > > that are flipped in hardware at the same time.  
> > > > 
> > > > We do have that in Panthor, but that's all explicit: in a single
> > > > SUBMIT, you can have multiple jobs targeting different queues, each of
> > > > them having their own set of deps/signal ops. The combination of all the
> > > > signal ops into a container is left to the UMD. It could be automated
> > > > kernel side, but that would be a flag on the SIGNAL op leading to the
> > > > creation of a fence_array containing fences from multiple submitted
> > > > jobs, rather than the driver combining stuff in the fence it returns in
> > > > ::run_job().  
> > > 
> > > See above. We have a dedicated queue type for these type of submissions
> > > and single job that submits to the all rings. We had multiple queue /
> > > jobs in the i915 to implemented this but it turns out it is much cleaner
> > > with a single queue / singler job / multiple rings model.
> > 
> > Hm, okay. It didn't turn into a mess in Panthor, but Xe is likely an
> > order of magnitude more complicated that Mali, so I'll refrain from
> > judging this design decision.
> > 
> 
> Yes, Xe is a beast, but we tend to build complexity into components and
> layers to manage it. That is what I’m attempting to do here.
> 
> > > 
> > > >   
> > > > > 
> > > > > Another case is late arming of hardware fences in run_job (which many
> > > > > drivers do). The proxy fence is immediately available at arm time and
> > > > > can be installed into dma-resv or syncobjs even though the actual
> > > > > hardware fence is not yet available. I think most drivers could be
> > > > > refactored to make the hardware fence immediately available at run_job,
> > > > > though.  
> > > > 
> > > > Yep, I also think we can arm the driver fence early in the case of
> > > > JobQueue. The reason it couldn't be done before is because the
> > > > scheduler was in the middle, deciding which entity to pull the next job
> > > > from, which was changing the seqno a job driver-fence would be assigned
> > > > (you can't guess that at queue time in that case).
> > > >   
> > > 
> > > Xe doesn't need to late arming, but it look like multiple drivers to
> > > implement the late arming which may be required (?).
> > 
> > As I said, it's mostly a problem when you have a
> > single-HW-queue:multiple-contexts model, which is exactly what
> > drm_sched was designed for. I suspect early arming is not an issue for
> > any of the HW supporting FW-based scheduling (PVR, Mali, NVidia,
> > ...). If you want to use drm_dep for all drivers currently using
> > drm_sched (I'm still not convinced this is a good idea to do that
> > just yet, because then you're going to pull a lot of the complexity
> > we're trying to get rid of), then you need late arming of driver fences.
> > 
> 
> Yes, even the hardware scheduling component [1] I hacked together relied
> on no late arming. But even then, you can arm a dma-fence early and
> assign a hardware seqno later in run_job()—those are two different
> things.
> 
> [1] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-svn-perf-6-15-2025/-/commit/22c8aa993b5c9e4ad0c312af2f3e032273d20966#line_7c49af3ee_A319
> 
> > > 
> > > > [...]
> > > >   
> > > > > > > > + * **Reference counting**
> > > > > > > > + *
> > > > > > > > + * Jobs and queues are both reference counted.
> > > > > > > > + *
> > > > > > > > + * A job holds a reference to its queue from drm_dep_job_init() until
> > > > > > > > + * drm_dep_job_put() drops the job's last reference and its release callback
> > > > > > > > + * runs. This ensures the queue remains valid for the entire lifetime of any
> > > > > > > > + * job that was submitted to it.
> > > > > > > > + *
> > > > > > > > + * The queue holds its own reference to a job for as long as the job is
> > > > > > > > + * internally tracked: from the moment the job is added to the pending list
> > > > > > > > + * in drm_dep_queue_run_job() until drm_dep_job_done() kicks the put_job
> > > > > > > > + * worker, which calls drm_dep_job_put() to release that reference.    
> > > > > > > 
> > > > > > > Why not simply keep track that the job was completed, instead of relinquishing
> > > > > > > the reference? We can then release the reference once the job is cleaned up
> > > > > > > (by the queue, using a worker) in process context.    
> > > > > 
> > > > > I think that’s what I’m doing, while also allowing an opt-in path to
> > > > > drop the job reference when it signals (in IRQ context)  
> > > > 
> > > > Did you mean in !IRQ (or !atomic) context here? Feels weird to not
> > > > defer the cleanup when you're in an IRQ/atomic context, but defer it
> > > > when you're in a thread context.
> > > >   
> > > 
> > > The put of a job in this design can be from an IRQ context (opt-in)
> > > feature. xa_destroy blows up if it is called from an IRQ context,
> > > although maybe that could be workaround.
> > 
> > Making it so _put() in IRQ context is safe is fine, what I'm saying is
> > that instead of doing a partial immediate cleanup, and the rest in a
> > worker, we can just defer everything: that is, have some
> > _deref_release() function called by kref_put() that would queue a work
> > item from which the actual release is done.
> > 
> 
> See below.
> 
> > > 
> > > > > so we avoid
> > > > > switching to a work item just to drop a ref. That seems like a
> > > > > significant win in terms of CPU cycles.  
> > > > 
> > > > Well, the cleanup path is probably not where latency matters the most.  
> > > 
> > > Agree. But I do think avoiding a CPU context switch (work item) for a
> > > very lightweight job cleanup (usually just drop refs) will save of CPU
> > > cycles, thus also things like power, etc...
> > 
> > That's the sort of statements I'd like to be backed by actual
> > numbers/scenarios proving that it actually makes a difference. The
> 
> I disagree. This is not a locking micro-optimization, for example. It is
> a software architecture choice that says “do not trigger a CPU context
> to free a job,” which costs thousands of cycles. This will have an
> effect on CPU utilization and, thus, power.
> 
> > mixed model where things are partially freed immediately/partially
> > deferred, and sometimes even with conditionals for whether the deferral
> > happens or not, it just makes building a mental model of this thing a
> > nightmare, which in turn usually leads to subtle bugs.
> > 
> 
> See above—managing complexity in components. This works in both modes. I
> refactored Xe so it also works in IRQ context. If it would make you feel
> better, I can ask my company commits CI resources so non-IRQ mode
> consistently works too—it’s just a single API flag on the queue. But
> then maybe other companies should also commit to public CI.
> 
> > > 
> > > > It's adding scheduling overhead, sure, but given all the stuff we defer
> > > > already, I'm not too sure we're at saving a few cycles to get the
> > > > cleanup done immediately. What's important to have is a way to signal
> > > > fences in an atomic context, because this has an impact on latency.
> > > >   
> > > 
> > > Yes. The signaling happens first then drm_dep_job_put if IRQ opt-in.
> > > 
> > > > [...]
> > > >   
> > > > > > > > + /*
> > > > > > > > + * Drop all input dependency fences now, in process context, before the
> > > > > > > > + * final job put. Once the job is on the pending list its last reference
> > > > > > > > + * may be dropped from a dma_fence callback (IRQ context), where calling
> > > > > > > > + * xa_destroy() would be unsafe.
> > > > > > > > + */    
> > > > > > > 
> > > > > > > I assume that “pending” is the list of jobs that have been handed to the driver
> > > > > > > via ops->run_job()?
> > > > > > > 
> > > > > > > Can’t this problem be solved by not doing anything inside a dma_fence callback
> > > > > > > other than scheduling the queue worker?
> > > > > > >     
> > > > > 
> > > > > Yes, this code is required to support dropping job refs directly in the
> > > > > dma-fence callback (an opt-in feature). Again, this seems like a
> > > > > significant win in terms of CPU cycles, although I haven’t collected
> > > > > data yet.  
> > > > 
> > > > If it significantly hurts the perf, I'd like to understand why, because
> > > > to me it looks like pure-cleanup (no signaling involved), and thus no
> > > > other process waiting for us to do the cleanup. The only thing that
> > > > might have an impact is how fast you release the resources, and given
> > > > it's only a partial cleanup (xa_destroy() still has to be deferred), I'd
> > > > like to understand which part of the immediate cleanup is causing a
> > > > contention (basically which kind of resources the system is starving of)
> > > >   
> > > 
> > > It was more of once we moved to a ref counted model, it is pretty
> > > trivial allow drm_dep_job_put when the fence is signaling. It doesn't
> > > really add any complexity either, thus why I added it is.
> > 
> > It's not the refcount model I'm complaining about, it's the "part of it
> > is always freed immediately, part of it is deferred, but not always ..."
> > that happens in drm_dep_job_release() I'm questioning. I'd really
> > prefer something like:
> > 
> 
> You are completely missing the point here.
> 

Let me rephrase this — I realize this may come across as rude, which is
not my intent. I believe there is simply a disconnect in understanding
the constraints.

In my example below, the job release completes within bounded time
constraints, which makes it suitable for direct release in IRQ context,
bypassing the need for a work item that would otherwise incur a costly
CPU context switch.

Matt

> Here is what I’ve reduced my job put to:
> 
> 188         xe_sched_job_free_fences(job);
> 189         dma_fence_put(job->fence);
> 190         job_free(job);
> 191         atomic_dec(&q->job_cnt);
> 192         xe_pm_runtime_put(xe);
> 
> These are lightweight (IRQ-safe) operations that never need to be done
> in a work item—so why kick one?
> 
> Matt
> 
> > static void drm_dep_job_release()
> > {
> > 	// do it all unconditionally
> > }
> > 
> > static void drm_dep_job_defer_release()
> > {
> > 	queue_work(&job->cleanup_work);
> > }
> > 
> > static void drm_dep_job_put()
> > {
> > 	kref_put(job, drm_dep_job_defer_release);
> > }

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
  2026-03-23  4:50         ` Matthew Brost
@ 2026-03-23  9:55           ` Boris Brezillon
  2026-03-23 17:08             ` Matthew Brost
  0 siblings, 1 reply; 50+ messages in thread
From: Boris Brezillon @ 2026-03-23  9:55 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, Tvrtko Ursulin, Rodrigo Vivi,
	Thomas Hellström, Christian König, Danilo Krummrich,
	David Airlie, Maarten Lankhorst, Maxime Ripard, Philipp Stanner,
	Simona Vetter, Sumit Semwal, Thomas Zimmermann, linux-kernel

Hi Matthew,

On Sun, 22 Mar 2026 21:50:07 -0700
Matthew Brost <matthew.brost@intel.com> wrote:

> > > > > diff --git a/drivers/gpu/drm/dep/drm_dep_job.c b/drivers/gpu/drm/dep/drm_dep_job.c
> > > > > new file mode 100644
> > > > > index 000000000000..2d012b29a5fc
> > > > > --- /dev/null
> > > > > +++ b/drivers/gpu/drm/dep/drm_dep_job.c
> > > > > @@ -0,0 +1,675 @@
> > > > > +// SPDX-License-Identifier: MIT
> > > > > +/*
> > > > > + * Copyright 2015 Advanced Micro Devices, Inc.
> > > > > + *
> > > > > + * Permission is hereby granted, free of charge, to any person obtaining a
> > > > > + * copy of this software and associated documentation files (the "Software"),
> > > > > + * to deal in the Software without restriction, including without limitation
> > > > > + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> > > > > + * and/or sell copies of the Software, and to permit persons to whom the
> > > > > + * Software is furnished to do so, subject to the following conditions:
> > > > > + *
> > > > > + * The above copyright notice and this permission notice shall be included in
> > > > > + * all copies or substantial portions of the Software.
> > > > > + *
> > > > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> > > > > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> > > > > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> > > > > + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
> > > > > + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
> > > > > + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
> > > > > + * OTHER DEALINGS IN THE SOFTWARE.
> > > > > + *
> > > > > + * Copyright © 2026 Intel Corporation
> > > > > + */
> > > > > +
> > > > > +/**
> > > > > + * DOC: DRM dependency job
> > > > > + *
> > > > > + * A struct drm_dep_job represents a single unit of GPU work associated with
> > > > > + * a struct drm_dep_queue. The lifecycle of a job is:
> > > > > + *
> > > > > + * 1. **Allocation**: the driver allocates memory for the job (typically by
> > > > > + *    embedding struct drm_dep_job in a larger structure) and calls
> > > > > + *    drm_dep_job_init() to initialise it. On success the job holds one
> > > > > + *    kref reference and a reference to its queue.
> > > > > + *
> > > > > + * 2. **Dependency collection**: the driver calls drm_dep_job_add_dependency(),
> > > > > + *    drm_dep_job_add_syncobj_dependency(), drm_dep_job_add_resv_dependencies(),
> > > > > + *    or drm_dep_job_add_implicit_dependencies() to register dma_fence objects
> > > > > + *    that must be signalled before the job can run. Duplicate fences from the
> > > > > + *    same fence context are deduplicated automatically.
> > > > > + *
> > > > > + * 3. **Arming**: drm_dep_job_arm() initialises the job's finished fence,
> > > > > + *    consuming a sequence number from the queue. After arming,
> > > > > + *    drm_dep_job_finished_fence() returns a valid fence that may be passed to
> > > > > + *    userspace or used as a dependency by other jobs.
> > > > > + *
> > > > > + * 4. **Submission**: drm_dep_job_push() submits the job to the queue. The
> > > > > + *    queue takes a reference that it holds until the job's finished fence
> > > > > + *    signals and the job is freed by the put_job worker.
> > > > > + *
> > > > > + * 5. **Completion**: when the job's hardware work finishes its finished fence
> > > > > + *    is signalled and drm_dep_job_put() is called by the queue. The driver
> > > > > + *    must release any driver-private resources in &drm_dep_job_ops.release.
> > > > > + *
> > > > > + * Reference counting uses drm_dep_job_get() / drm_dep_job_put(). The
> > > > > + * internal drm_dep_job_fini() tears down the dependency xarray and fence
> > > > > + * objects before the driver's release callback is invoked.
> > > > > + */
> > > > > +
> > > > > +#include <linux/dma-resv.h>
> > > > > +#include <linux/kref.h>
> > > > > +#include <linux/slab.h>
> > > > > +#include <drm/drm_dep.h>
> > > > > +#include <drm/drm_file.h>
> > > > > +#include <drm/drm_gem.h>
> > > > > +#include <drm/drm_syncobj.h>
> > > > > +#include "drm_dep_fence.h"
> > > > > +#include "drm_dep_job.h"
> > > > > +#include "drm_dep_queue.h"
> > > > > +
> > > > > +/**
> > > > > + * drm_dep_job_init() - initialise a dep job
> > > > > + * @job: dep job to initialise
> > > > > + * @args: initialisation arguments
> > > > > + *
> > > > > + * Initialises @job with the queue, ops and credit count from @args.  Acquires
> > > > > + * a reference to @args->q via drm_dep_queue_get(); this reference is held for
> > > > > + * the lifetime of the job and released by drm_dep_job_release() when the last
> > > > > + * job reference is dropped.
> > > > > + *
> > > > > + * Resources are released automatically when the last reference is dropped
> > > > > + * via drm_dep_job_put(), which must be called to release the job; drivers
> > > > > + * must not free the job directly.
> > > > > + *
> > > > > + * Context: Process context. Allocates memory with GFP_KERNEL.
> > > > > + * Return: 0 on success, -%EINVAL if credits is 0,
> > > > > + *   -%ENOMEM on fence allocation failure.
> > > > > + */
> > > > > +int drm_dep_job_init(struct drm_dep_job *job,
> > > > > +		     const struct drm_dep_job_init_args *args)
> > > > > +{
> > > > > +	if (unlikely(!args->credits)) {
> > > > > +		pr_err("drm_dep: %s: credits cannot be 0\n", __func__);
> > > > > +		return -EINVAL;
> > > > > +	}
> > > > > +
> > > > > +	memset(job, 0, sizeof(*job));
> > > > > +
> > > > > +	job->dfence = drm_dep_fence_alloc();
> > > > > +	if (!job->dfence)
> > > > > +		return -ENOMEM;
> > > > > +
> > > > > +	job->ops = args->ops;
> > > > > +	job->q = drm_dep_queue_get(args->q);
> > > > > +	job->credits = args->credits;
> > > > > +
> > > > > +	kref_init(&job->refcount);
> > > > > +	xa_init_flags(&job->dependencies, XA_FLAGS_ALLOC);
> > > > > +	INIT_LIST_HEAD(&job->pending_link);
> > > > > +
> > > > > +	return 0;
> > > > > +}
> > > > > +EXPORT_SYMBOL(drm_dep_job_init);
> > > > > +
> > > > > +/**
> > > > > + * drm_dep_job_drop_dependencies() - release all input dependency fences
> > > > > + * @job: dep job whose dependency xarray to drain
> > > > > + *
> > > > > + * Walks @job->dependencies, puts each fence, and destroys the xarray.
> > > > > + * Any slots still holding a %DRM_DEP_JOB_FENCE_PREALLOC sentinel —
> > > > > + * i.e. slots that were pre-allocated but never replaced — are silently
> > > > > + * skipped; the sentinel carries no reference.  Called from
> > > > > + * drm_dep_queue_run_job() in process context immediately after
> > > > > + * @ops->run_job() returns, before the final drm_dep_job_put().  Releasing
> > > > > + * dependencies here — while still in process context — avoids calling
> > > > > + * xa_destroy() from IRQ context if the job's last reference is later
> > > > > + * dropped from a dma_fence callback.
> > > > > + *
> > > > > + * Context: Process context.
> > > > > + */
> > > > > +void drm_dep_job_drop_dependencies(struct drm_dep_job *job)
> > > > > +{
> > > > > +	struct dma_fence *fence;
> > > > > +	unsigned long index;
> > > > > +
> > > > > +	xa_for_each(&job->dependencies, index, fence) {
> > > > > +		if (unlikely(fence == DRM_DEP_JOB_FENCE_PREALLOC))
> > > > > +			continue;
> > > > > +		dma_fence_put(fence);
> > > > > +	}
> > > > > +	xa_destroy(&job->dependencies);
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_dep_job_fini() - clean up a dep job
> > > > > + * @job: dep job to clean up
> > > > > + *
> > > > > + * Cleans up the dep fence and drops the queue reference held by @job.
> > > > > + *
> > > > > + * If the job was never armed (e.g. init failed before drm_dep_job_arm()),
> > > > > + * the dependency xarray is also released here.  For armed jobs the xarray
> > > > > + * has already been drained by drm_dep_job_drop_dependencies() in process
> > > > > + * context immediately after run_job(), so it is left untouched to avoid
> > > > > + * calling xa_destroy() from IRQ context.
> > > > > + *
> > > > > + * Warns if @job is still linked on the queue's pending list, which would
> > > > > + * indicate a bug in the teardown ordering.
> > > > > + *
> > > > > + * Context: Any context.
> > > > > + */
> > > > > +static void drm_dep_job_fini(struct drm_dep_job *job)
> > > > > +{
> > > > > +	bool armed = drm_dep_fence_is_armed(job->dfence);
> > > > > +
> > > > > +	WARN_ON(!list_empty(&job->pending_link));
> > > > > +
> > > > > +	drm_dep_fence_cleanup(job->dfence);
> > > > > +	job->dfence = NULL;
> > > > > +
> > > > > +	/*
> > > > > +	 * Armed jobs have their dependencies drained by
> > > > > +	 * drm_dep_job_drop_dependencies() in process context after run_job().    
> > > > 
> > > > Just want to clear the confusion and make sure I get this right at the
> > > > same time. To me, "process context" means a user thread entering some
> > > > syscall(). What you call "process context" is more a "thread context" to
> > > > me. I'm actually almost certain it's always a kernel thread (a workqueue
> > > > worker thread to be accurate) that executes the drop_deps() after a
> > > > run_job().    
> > > 
> > > Some of context comments likely could be cleaned up. 'process context'
> > > here either in user context (bypass path) or run job work item.
> > >   
> > > >     
> > > > > +	 * Skip here to avoid calling xa_destroy() from IRQ context.
> > > > > +	 */
> > > > > +	if (!armed)
> > > > > +		drm_dep_job_drop_dependencies(job);    
> > > > 
> > > > Why do we need to make a difference here. Can't we just assume that the
> > > > hole drm_dep_job_fini() call is unsafe in atomic context, and have a
> > > > work item embedded in the job to defer its destruction when _put() is
> > > > called in a context where the destruction is not allowed?
> > > >     
> > > 
> > > We already touched on this, but the design currently allows the last job
> > > put from dma-fence signaling path (IRQ).  
> > 
> > It's not much about the last _put and more about what happens in the
> > _release() you pass to kref_put(). My point being, if you assume
> > something in _release() is not safe to be done in an atomic context,
> > and _put() is assumed to be called from any context, you might as well  
> 
> No. DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE indicates that the entire job
> put (including release) is IRQ-safe. If the documentation isn’t clear, I
> can clean that up. Some of my comments here [1] try to explain this
> further.
> 
> Setting DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE makes a job analogous to a
> dma-fence whose release must be IRQ-safe, so there is precedent for
> this. I didn’t want to unilaterally require that all job releases be
> IRQ-safe, as that would conflict with existing DRM scheduler jobs—hence
> the flag.
> 
> The difference between non-IRQ-safe and IRQ-safe job release is only
> about 12 lines of code.

It's not just about the number of lines of code added to the core to
deal with that case, but also complexity of the API that results from
these various modes.

> I figured that if we’re going to invest the time
> and effort to replace DRM sched, we should aim for the best possible
> implementation. Any driver can opt-in here and immediately get less CPU
> ultization and power savings. I will try to figure out how to measure
> this and get some number here.

That's key here. My gut feeling is that we have so much deferred
already that adding one more work to the workqueue is not going to
hurt in term of scheduling overhead (no context switch if it's
scheduled on the same workqueue). Job cleanup is just the phase
following the job_done() event, which also requires a deferred work to
check progress on the queue anyway. And if you move the entirety of the
job cleanup to job_release() instead of doing part of it in
drm_dep_job_fini(), it makes for simpler design, where jobs are just
cleaned up when their refcnt drops to zero.

IMHO, that's exactly the kind of premature optimization that led us to
where we are with drm_sched: we think we need the optimization so we
add the complexity upfront without actual numbers to back this
theory (like, real GPU workloads to lead to actual differences in term
on power consumption, speed, ...), and the complexity just piles up as
you keep adding more and more of those flags.

> 
> > just defer the cleanup (AKA the stuff you currently have in _release())
> > so everything is always cleaned up in a thread context. Yes, there's
> > scheduling overhead and extra latency, but it's also simpler, because
> > there's just one path. So, if the latency and the overhead is not  
> 
> This isn’t bolted on—it’s a built-in feature throughout. I can assure
> you that either mode works. I’ll likely add a debug Kconfig option to Xe
> that toggles DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE on each queue creation
> for CI runs, to ensure both paths work reliably and receive continuous
> testing.

I'm not claiming this doesn't work, I just want to make sure we're not
taking the same path drm_sched took with those premature optimizations.
If you have numbers that prove the extra power-consumption or the fact
the extra latency makes a difference in practice, that's a different
story, but otherwise, I still think it's preferable to start with a
smaller scope/simpler design, and add optimized cleanup path when we
have a proof it makes a difference.

> 
> > proven to be a problem (and it rarely is for cleanup operations), I'm
> > still convinced this makes for an easier design to just defer the
> > cleanup all the time.
> >   
> > >  If we droppped that, then yes
> > > this could change. The reason the if statement currently is user is
> > > building a job and need to abort prior to calling arm() (e.g., a memory
> > > allocation fails) via a drm_dep_job_put.  
> > 
> > But even in that context, it could still be deferred and work just
> > fine, no?
> >   
> 
> A work item context switch is thousands, if not tens of thousands, of
> cycles.

We won't have a full context switch just caused by the cleanup in the
normal execution case though. The context switch, you already have it
to check progress on the job queue anyway, so adding an extra job
cleanup is pretty cheap at this point. I'm not saying free either,
there's still the extra work insertion, the dequeuing, etc.

> If your job release is only ~20 instructions, this is a massive
> imbalance and an overall huge waste. Jobs are lightweight objects—they
> should really be thought of as an extension of fences. Fence release
> must be IRQ-safe per the documentation, so it follows that jobs can opt
> in to the same release rules.
> 
> In contrast, queues are heavyweight objects, typically with associated
> memory that also needs to be released. Here, a work item absolutely
> makes sense—hence the design in DRM dep.

And that's kinda my point: a job being reported as done will cause a
work item to be scheduled to check progress on the queue, so you're
already paying the price of a context switch anyway. At this point, all
you'll gain by fast-tracking the job cleanup and allowing for IRQ-safe
cleanups is just latency. If you have numbers/workloads saying
otherwise, I'm fine reconsidering the extra complexity, but I'd to see
those first.

> 
> > > 
> > > Once arm() is called there is a guarnette the run_job path is called
> > > either via bypass or run job work item.  
> > 
> > Sure.
> >   
> 
> Let’s not gloss over this—this is actually a huge difference from DRM
> sched. One of the biggest problems I found with DRM sched is that if you
> call arm(), run_job() may or may not be called. Without this guarantee,
> you can’t do driver-side bookkeeping in arm() that is later released in
> run_job(), which would otherwise simplify the driver design.

You can do driver-side book-keeping after all jobs have been
successfully initialized, which include arming their fences. The key
turning point is when you start exposing those armed fences, not
when you arm them. See below.

> 
> In Xe, we artificially enforce this rule through our own usage of DRM
> sched, but in DRM dep this is now an API-level contract. That allows
> drivers to embrace this semantic directly, simplifying their designs.
> 

[...]

> > >   
> > > >     
> > > > > +}
> > > > > +EXPORT_SYMBOL(drm_dep_job_arm);
> > > > > +
> > > > > +/**
> > > > > + * drm_dep_job_push() - submit a job to its queue for execution
> > > > > + * @job: dep job to push
> > > > > + *
> > > > > + * Submits @job to the queue it was initialised with. Must be called after
> > > > > + * drm_dep_job_arm(). Acquires a reference on @job on behalf of the queue,
> > > > > + * held until the queue is fully done with it. The reference is released
> > > > > + * directly in the finished-fence dma_fence callback for queues with
> > > > > + * %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE (where drm_dep_job_done() may run
> > > > > + * from hardirq context), or via the put_job work item on the submit
> > > > > + * workqueue otherwise.
> > > > > + *
> > > > > + * Ends the DMA fence signalling path begun by drm_dep_job_arm() via
> > > > > + * dma_fence_end_signalling(). This must be paired with arm(); lockdep
> > > > > + * enforces the pairing.
> > > > > + *
> > > > > + * Once pushed, &drm_dep_queue_ops.run_job is guaranteed to be called for
> > > > > + * @job exactly once, even if the queue is killed or torn down before the
> > > > > + * job reaches the head of the queue. Drivers can use this guarantee to
> > > > > + * perform bookkeeping cleanup; the actual backend operation should be
> > > > > + * skipped when drm_dep_queue_is_killed() returns true.
> > > > > + *
> > > > > + * If the queue does not support the bypass path, the job is pushed directly
> > > > > + * onto the SPSC submission queue via drm_dep_queue_push_job() without holding
> > > > > + * @q->sched.lock. Otherwise, @q->sched.lock is taken and the job is either
> > > > > + * run immediately via drm_dep_queue_run_job() if it qualifies for bypass, or
> > > > > + * enqueued via drm_dep_queue_push_job() for dispatch by the run_job work item.
> > > > > + *
> > > > > + * Warns if the job has not been armed.
> > > > > + *
> > > > > + * Context: Process context if %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set
> > > > > + *   (takes @q->sched.lock, a mutex); any context otherwise. DMA fence signaling
> > > > > + *   path.
> > > > > + */
> > > > > +void drm_dep_job_push(struct drm_dep_job *job)
> > > > > +{
> > > > > +	struct drm_dep_queue *q = job->q;
> > > > > +
> > > > > +	WARN_ON(!drm_dep_fence_is_armed(job->dfence));
> > > > > +
> > > > > +	drm_dep_job_get(job);
> > > > > +
> > > > > +	if (!(q->sched.flags & DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED)) {
> > > > > +		drm_dep_queue_push_job(q, job);
> > > > > +		dma_fence_end_signalling(job->signalling_cookie);
> > > > > +		drm_dep_queue_push_job_end(job->q);
> > > > > +		return;
> > > > > +	}
> > > > > +
> > > > > +	scoped_guard(mutex, &q->sched.lock) {
> > > > > +		if (drm_dep_queue_can_job_bypass(q, job))
> > > > > +			drm_dep_queue_run_job(q, job);
> > > > > +		else
> > > > > +			drm_dep_queue_push_job(q, job);
> > > > > +	}
> > > > > +
> > > > > +	dma_fence_end_signalling(job->signalling_cookie);
> > > > > +	drm_dep_queue_push_job_end(job->q);
> > > > > +}
> > > > > +EXPORT_SYMBOL(drm_dep_job_push);
> > > > > +
> > > > > +/**
> > > > > + * drm_dep_job_add_dependency() - adds the fence as a job dependency
> > > > > + * @job: dep job to add the dependencies to
> > > > > + * @fence: the dma_fence to add to the list of dependencies, or
> > > > > + *         %DRM_DEP_JOB_FENCE_PREALLOC to reserve a slot for later.
> > > > > + *
> > > > > + * Note that @fence is consumed in both the success and error cases (except
> > > > > + * when @fence is %DRM_DEP_JOB_FENCE_PREALLOC, which carries no reference).
> > > > > + *
> > > > > + * Signalled fences and fences belonging to the same queue as @job (i.e. where
> > > > > + * fence->context matches the queue's finished fence context) are silently
> > > > > + * dropped; the job need not wait on its own queue's output.
> > > > > + *
> > > > > + * Warns if the job has already been armed (dependencies must be added before
> > > > > + * drm_dep_job_arm()).
> > > > > + *
> > > > > + * **Pre-allocation pattern**
> > > > > + *
> > > > > + * When multiple jobs across different queues must be prepared and submitted
> > > > > + * together in a single atomic commit — for example, where job A's finished
> > > > > + * fence is an input dependency of job B — all jobs must be armed and pushed
> > > > > + * within a single dma_fence_begin_signalling() / dma_fence_end_signalling()
> > > > > + * region.  Once that region has started no memory allocation is permitted.
> > > > > + *
> > > > > + * To handle this, pass %DRM_DEP_JOB_FENCE_PREALLOC during the preparation
> > > > > + * phase (before arming any job, while GFP_KERNEL allocation is still allowed)
> > > > > + * to pre-allocate a slot in @job->dependencies.  The slot index assigned by
> > > > > + * the underlying xarray must be tracked by the caller separately (e.g. it is
> > > > > + * always index 0 when the dependency array is empty, as Xe relies on).
> > > > > + * After all jobs have been armed and the finished fences are available, call
> > > > > + * drm_dep_job_replace_dependency() with that index and the real fence.
> > > > > + * drm_dep_job_replace_dependency() uses GFP_NOWAIT internally and may be
> > > > > + * called from atomic or signalling context.
> > > > > + *
> > > > > + * The sentinel slot is never skipped by the signalled-fence fast-path,
> > > > > + * ensuring a slot is always allocated even when the real fence is not yet
> > > > > + * known.
> > > > > + *
> > > > > + * **Example: bind job feeding TLB invalidation jobs**
> > > > > + *
> > > > > + * Consider a GPU with separate queues for page-table bind operations and for
> > > > > + * TLB invalidation.  A single atomic commit must:
> > > > > + *
> > > > > + *  1. Run a bind job that modifies page tables.
> > > > > + *  2. Run one TLB-invalidation job per MMU that depends on the bind
> > > > > + *     completing, so stale translations are flushed before the engines
> > > > > + *     continue.
> > > > > + *
> > > > > + * Because all jobs must be armed and pushed inside a signalling region (where
> > > > > + * GFP_KERNEL is forbidden), pre-allocate slots before entering the region::
> > > > > + *
> > > > > + *   // Phase 1 — process context, GFP_KERNEL allowed
> > > > > + *   drm_dep_job_init(bind_job, bind_queue, ops);
> > > > > + *   for_each_mmu(mmu) {
> > > > > + *       drm_dep_job_init(tlb_job[mmu], tlb_queue[mmu], ops);
> > > > > + *       // Pre-allocate slot at index 0; real fence not available yet
> > > > > + *       drm_dep_job_add_dependency(tlb_job[mmu], DRM_DEP_JOB_FENCE_PREALLOC);
> > > > > + *   }
> > > > > + *
> > > > > + *   // Phase 2 — inside signalling region, no GFP_KERNEL
> > > > > + *   dma_fence_begin_signalling();
> > > > > + *   drm_dep_job_arm(bind_job);
> > > > > + *   for_each_mmu(mmu) {
> > > > > + *       // Swap sentinel for bind job's finished fence
> > > > > + *       drm_dep_job_replace_dependency(tlb_job[mmu], 0,
> > > > > + *                                      dma_fence_get(bind_job->finished));    
> > > > 
> > > > Just FYI, Panthor doesn't have this {begin,end}_signalling() in the
> > > > submit path. If we were to add it, it would be around the
> > > > panthor_submit_ctx_push_jobs() call, which might seem broken. In    
> > > 
> > > Yes, I noticed that. I put XXX comment in my port [1] around this.
> > > 
> > > [1] https://patchwork.freedesktop.org/patch/711952/?series=163245&rev=1
> > >   
> > > > practice I don't think it is because we don't expose fences to the
> > > > outside world until all jobs have been pushed. So what happens is that
> > > > a job depending on a previous job in the same batch-submit has the
> > > > armed-but-not-yet-pushed fence in its deps, and that's the only place
> > > > where this fence is present. If something fails on a subsequent job
> > > > preparation in the next batch submit, the rollback logic will just drop
> > > > the jobs on the floor, and release the armed-but-not-pushed-fence,
> > > > meaning we're not leaking a fence that will never be signalled. I'm in
> > > > no way saying this design is sane, just trying to explain why it's
> > > > currently safe and works fine.    
> > > 
> > > Yep, I think would be better have no failure points between arm and
> > > push which again I do my best to enforce via lockdep/warnings.  
> > 
> > I'm still not entirely convinced by that. To me _arm() is not quite the
> > moment you make your fence public, and I'm not sure the extra complexity
> > added for intra-batch dependencies (one job in a SUBMIT depending on a
> > previous job in the same SUBMIT) is justified, because what really
> > matters is not that we leave dangling/unsignalled dma_fence objects
> > around, the problem is when you do so on an object that has been
> > exposed publicly (syncobj, dma_resv, sync_file, ...).
> >   
> 
> Let me give you an example of why a failure between arm() and push() is
> a huge problem:
> 
> arm()
> dma_resv_install(fence_from_arm)
> fail

That's not what Panthor does. What we do is:

	for_each_job_in_batch() {
		ret = faillible_stuf()
		if (ret)
			goto rollback;

		arm(job)
	}

	// Nothing can fail after this point

	for_each_job_in_batch() {
		update_resvs(job->done_fence);
		push(job)
	}

	update_submit_syncobjs();

As you can see, an armed job doesn't mean the job fence is public, it
only becomes public after we've updated the resv of the BOs that might
be touched by this job. 

> 
> How does one unwind this? Signal the fence from arm()?

That, or we just ignore the fact it's not been signalled. If the job
that created the fence has never been submitted, and the fence has
vanished before hitting any public container, it doesn't matter.

> What if the fence
> from arm() is on a timeline currently being used by the device? The
> memory can move, and the device then can corrupt memory.

What? No, the seqno is just consumed, but there's nothing attached to
it, the previous job on this timeline (N-1) is still valid, and the next
one will have a seqno of N+1, which will force an implicit dep on N-1
on the same timeline. That's all.

> 
> In my opinion, it’s best and safest to enforce a no-failure policy
> between arm() and push().

I don't think it's safer, it's just the semantics that have been
defined by drm_sched/dma_fence and that we keep forcing ourselves
into. I'd rather have a well defined dma_fence state that says "that's
it, I'm exposed, you have to signal me now", than this half-enforced
arm()+push() model.

> 
> FWIW, this came up while I was reviewing AMDXDNA’s DRM scheduler usage,
> which had the exact issue I described above. I pointed it out and got a
> reply saying, “well, this is an API issue, right?”—and they were
> correct, it is an API issue.
> 
> > >   
> > > > 
> > > > In general, I wonder if we should distinguish between "armed" and
> > > > "publicly exposed" to help deal with this intra-batch dep thing without
> > > > resorting to reservation and other tricks like that.
> > > >     
> > > 
> > > I'm not exactly sure what you suggesting but always open to ideas.  
> > 
> > Right now _arm() is what does the dma_fence_init(). But there's an
> > extra step between initializing the fence object and making it
> > visible to the outside world. In order for the dep to be added to the
> > job, you need the fence to be initialized, but that's not quite
> > external visibility, because the job is still very much a driver
> > object, and if something fails, the rollback mechanism makes it so all
> > the deps are dropped on the floor along the job that's being destroyed.
> > So we won't really wait on this fence that's never going to be
> > signalled.
> > 
> > I see what's appealing in pretending that _arm() == externally-visible,
> > but it's also forcing us to do extra pre-alloc (or other pre-init)
> > operations that would otherwise not be required in the submit path. Not
> > a hill I'm willing to die on, but I just thought I'd mention the fact I
> > find it weird that we put extra constraints on ourselves that are not
> > strictly needed, just because we fail to properly flag the dma_fence
> > visibility transitions.  
> 
> See the dma-resv example above. I’m not willing to die on this hill
> either, but again, in my opinion, for safety and as an API-level
> contract, enforcing arm() as a no-failure point makes sense. It prevents
> drivers from doing anything dangerous like the dma-resv example, which
> is an extremely subtle bug.

That's a valid point, but you're not really enforcing things at
compile/run-time it's just "don't do this/that" in the docs. If you
encode the is_active() state at the dma_fence level, properly change
the fence state anytime it's about to be added to a public container,
and make it so an active fence that's released without being signalled
triggers a WARN_ON(), you've achieved more. Once you've done that, you
can also relax the rule that says that "an armed fence has to be
signalled" to "a fence that's active has to be signalled". With this,
the pre-alloc for intra-batch deps in your drm_dep_job::deps xarray is
no longer required, because you would be able to store inactive fences
there, as long as they become active before the job is pushed.

> 
> > 
> > On the rust side it would be directly described through the type
> > system (see the Visibility attribute in Daniel's branch[1]). On C side,
> > this could take the form of a new DMA_FENCE_FLAG_INACTIVE (or whichever
> > name you want to give it). Any operation pushing the fence to public
> > container (dma_resv, syncobj, sync_file, ...) would be rejected when
> > that flag is set. At _push() time, we'd clear that flag with a
> > dma_fence_set_active() helper, which would reflect the fact the fence
> > can now be observed and exposed to the outside world.
> >   
> 
> Timeline squashing is problematic due to the DMA_FENCE_FLAG_INACTIVE
> flag. When adding a fence to dma-resv, fences that belong to the same
> timeline are immediately squashed. A later transition of the fence state
> completely breaks this behavior.

That's exactly my point: as soon as you want to insert the fence to a
public container, you have to make it "active", so it will never be
rolled back to the previous entry in the resv. Similarly, a
wait/add_callback() on an inactive fence should be rejected.

> 
> 287 void dma_resv_add_fence(struct dma_resv *obj, struct dma_fence *fence,
> 288                         enum dma_resv_usage usage)
> 289 {
> 290         struct dma_resv_list *fobj;
> 291         struct dma_fence *old;
> 292         unsigned int i, count;
> 293
> 294         dma_fence_get(fence);
> 295
> 296         dma_resv_assert_held(obj);
> 297
> 298         /* Drivers should not add containers here, instead add each fence
> 299          * individually.
> 300          */
> 301         WARN_ON(dma_fence_is_container(fence));
> 302
> 303         fobj = dma_resv_fences_list(obj);
> 304         count = fobj->num_fences;
> 305
> 306         for (i = 0; i < count; ++i) {
> 307                 enum dma_resv_usage old_usage;
> 308
> 309                 dma_resv_list_entry(fobj, i, obj, &old, &old_usage);
> 310                 if ((old->context == fence->context && old_usage >= usage &&
> 311                      dma_fence_is_later_or_same(fence, old)) ||
> 312                     dma_fence_is_signaled(old)) {
> 313                         dma_resv_list_set(fobj, i, fence, usage);
> 314                         dma_fence_put(old);
> 315                         return;
> 316                 }
> 317         }
> 
> Imagine syncobjs have similar squashing, but I don't know that offhand.

Same goes for syncobjs.

Regards,

Boris


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
  2026-03-23  7:58                 ` Matthew Brost
@ 2026-03-23 10:06                   ` Boris Brezillon
  2026-03-23 17:11                     ` Matthew Brost
  0 siblings, 1 reply; 50+ messages in thread
From: Boris Brezillon @ 2026-03-23 10:06 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Daniel Almeida, intel-xe, dri-devel, Tvrtko Ursulin, Rodrigo Vivi,
	Thomas Hellström, Christian König, Danilo Krummrich,
	David Airlie, Maarten Lankhorst, Maxime Ripard, Philipp Stanner,
	Simona Vetter, Sumit Semwal, Thomas Zimmermann, linux-kernel,
	Sami Tolvanen, Jeffrey Vander Stoep, Alice Ryhl, Daniel Stone,
	Alexandre Courbot, John Hubbard, shashanks, jajones,
	Eliot Courtney, Joel Fernandes, rust-for-linux

On Mon, 23 Mar 2026 00:58:51 -0700
Matthew Brost <matthew.brost@intel.com> wrote:

> > > It's not the refcount model I'm complaining about, it's the "part of it
> > > is always freed immediately, part of it is deferred, but not always ..."
> > > that happens in drm_dep_job_release() I'm questioning. I'd really
> > > prefer something like:
> > >   
> > 
> > You are completely missing the point here.
> >   
> 
> Let me rephrase this — I realize this may come across as rude, which is
> not my intent.

No offense taken ;-).

> I believe there is simply a disconnect in understanding
> the constraints.
> 
> In my example below, the job release completes within bounded time
> constraints, which makes it suitable for direct release in IRQ context,
> bypassing the need for a work item that would otherwise incur a costly
> CPU context switch.

In the other thread, I've explained in more details why I think
deferred cleanup of jobs is not as bad as you make it sound (context
switch amortized by the fact it's already there for queue progress
checking). But let's assume it is, I'd prefer a model where we say
"ops->job_release() has to be IRQ-safe" and have implementations defer
their cleanup if they have to, than this mixed approach with a flag. Of
course, I'd still like to have numbers proving that this job cleanup
deferral actually makes a difference in practice :P.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
  2026-03-23  9:55           ` Boris Brezillon
@ 2026-03-23 17:08             ` Matthew Brost
  2026-03-23 18:38               ` Matthew Brost
  2026-03-24  8:49               ` Boris Brezillon
  0 siblings, 2 replies; 50+ messages in thread
From: Matthew Brost @ 2026-03-23 17:08 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: intel-xe, dri-devel, Tvrtko Ursulin, Rodrigo Vivi,
	Thomas Hellström, Christian König, Danilo Krummrich,
	David Airlie, Maarten Lankhorst, Maxime Ripard, Philipp Stanner,
	Simona Vetter, Sumit Semwal, Thomas Zimmermann, linux-kernel

On Mon, Mar 23, 2026 at 10:55:04AM +0100, Boris Brezillon wrote:
> Hi Matthew,
> 
> On Sun, 22 Mar 2026 21:50:07 -0700
> Matthew Brost <matthew.brost@intel.com> wrote:
> 
> > > > > > diff --git a/drivers/gpu/drm/dep/drm_dep_job.c b/drivers/gpu/drm/dep/drm_dep_job.c
> > > > > > new file mode 100644
> > > > > > index 000000000000..2d012b29a5fc
> > > > > > --- /dev/null
> > > > > > +++ b/drivers/gpu/drm/dep/drm_dep_job.c
> > > > > > @@ -0,0 +1,675 @@
> > > > > > +// SPDX-License-Identifier: MIT
> > > > > > +/*
> > > > > > + * Copyright 2015 Advanced Micro Devices, Inc.
> > > > > > + *
> > > > > > + * Permission is hereby granted, free of charge, to any person obtaining a
> > > > > > + * copy of this software and associated documentation files (the "Software"),
> > > > > > + * to deal in the Software without restriction, including without limitation
> > > > > > + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> > > > > > + * and/or sell copies of the Software, and to permit persons to whom the
> > > > > > + * Software is furnished to do so, subject to the following conditions:
> > > > > > + *
> > > > > > + * The above copyright notice and this permission notice shall be included in
> > > > > > + * all copies or substantial portions of the Software.
> > > > > > + *
> > > > > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> > > > > > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> > > > > > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> > > > > > + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
> > > > > > + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
> > > > > > + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
> > > > > > + * OTHER DEALINGS IN THE SOFTWARE.
> > > > > > + *
> > > > > > + * Copyright © 2026 Intel Corporation
> > > > > > + */
> > > > > > +
> > > > > > +/**
> > > > > > + * DOC: DRM dependency job
> > > > > > + *
> > > > > > + * A struct drm_dep_job represents a single unit of GPU work associated with
> > > > > > + * a struct drm_dep_queue. The lifecycle of a job is:
> > > > > > + *
> > > > > > + * 1. **Allocation**: the driver allocates memory for the job (typically by
> > > > > > + *    embedding struct drm_dep_job in a larger structure) and calls
> > > > > > + *    drm_dep_job_init() to initialise it. On success the job holds one
> > > > > > + *    kref reference and a reference to its queue.
> > > > > > + *
> > > > > > + * 2. **Dependency collection**: the driver calls drm_dep_job_add_dependency(),
> > > > > > + *    drm_dep_job_add_syncobj_dependency(), drm_dep_job_add_resv_dependencies(),
> > > > > > + *    or drm_dep_job_add_implicit_dependencies() to register dma_fence objects
> > > > > > + *    that must be signalled before the job can run. Duplicate fences from the
> > > > > > + *    same fence context are deduplicated automatically.
> > > > > > + *
> > > > > > + * 3. **Arming**: drm_dep_job_arm() initialises the job's finished fence,
> > > > > > + *    consuming a sequence number from the queue. After arming,
> > > > > > + *    drm_dep_job_finished_fence() returns a valid fence that may be passed to
> > > > > > + *    userspace or used as a dependency by other jobs.
> > > > > > + *
> > > > > > + * 4. **Submission**: drm_dep_job_push() submits the job to the queue. The
> > > > > > + *    queue takes a reference that it holds until the job's finished fence
> > > > > > + *    signals and the job is freed by the put_job worker.
> > > > > > + *
> > > > > > + * 5. **Completion**: when the job's hardware work finishes its finished fence
> > > > > > + *    is signalled and drm_dep_job_put() is called by the queue. The driver
> > > > > > + *    must release any driver-private resources in &drm_dep_job_ops.release.
> > > > > > + *
> > > > > > + * Reference counting uses drm_dep_job_get() / drm_dep_job_put(). The
> > > > > > + * internal drm_dep_job_fini() tears down the dependency xarray and fence
> > > > > > + * objects before the driver's release callback is invoked.
> > > > > > + */
> > > > > > +
> > > > > > +#include <linux/dma-resv.h>
> > > > > > +#include <linux/kref.h>
> > > > > > +#include <linux/slab.h>
> > > > > > +#include <drm/drm_dep.h>
> > > > > > +#include <drm/drm_file.h>
> > > > > > +#include <drm/drm_gem.h>
> > > > > > +#include <drm/drm_syncobj.h>
> > > > > > +#include "drm_dep_fence.h"
> > > > > > +#include "drm_dep_job.h"
> > > > > > +#include "drm_dep_queue.h"
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_dep_job_init() - initialise a dep job
> > > > > > + * @job: dep job to initialise
> > > > > > + * @args: initialisation arguments
> > > > > > + *
> > > > > > + * Initialises @job with the queue, ops and credit count from @args.  Acquires
> > > > > > + * a reference to @args->q via drm_dep_queue_get(); this reference is held for
> > > > > > + * the lifetime of the job and released by drm_dep_job_release() when the last
> > > > > > + * job reference is dropped.
> > > > > > + *
> > > > > > + * Resources are released automatically when the last reference is dropped
> > > > > > + * via drm_dep_job_put(), which must be called to release the job; drivers
> > > > > > + * must not free the job directly.
> > > > > > + *
> > > > > > + * Context: Process context. Allocates memory with GFP_KERNEL.
> > > > > > + * Return: 0 on success, -%EINVAL if credits is 0,
> > > > > > + *   -%ENOMEM on fence allocation failure.
> > > > > > + */
> > > > > > +int drm_dep_job_init(struct drm_dep_job *job,
> > > > > > +		     const struct drm_dep_job_init_args *args)
> > > > > > +{
> > > > > > +	if (unlikely(!args->credits)) {
> > > > > > +		pr_err("drm_dep: %s: credits cannot be 0\n", __func__);
> > > > > > +		return -EINVAL;
> > > > > > +	}
> > > > > > +
> > > > > > +	memset(job, 0, sizeof(*job));
> > > > > > +
> > > > > > +	job->dfence = drm_dep_fence_alloc();
> > > > > > +	if (!job->dfence)
> > > > > > +		return -ENOMEM;
> > > > > > +
> > > > > > +	job->ops = args->ops;
> > > > > > +	job->q = drm_dep_queue_get(args->q);
> > > > > > +	job->credits = args->credits;
> > > > > > +
> > > > > > +	kref_init(&job->refcount);
> > > > > > +	xa_init_flags(&job->dependencies, XA_FLAGS_ALLOC);
> > > > > > +	INIT_LIST_HEAD(&job->pending_link);
> > > > > > +
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +EXPORT_SYMBOL(drm_dep_job_init);
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_dep_job_drop_dependencies() - release all input dependency fences
> > > > > > + * @job: dep job whose dependency xarray to drain
> > > > > > + *
> > > > > > + * Walks @job->dependencies, puts each fence, and destroys the xarray.
> > > > > > + * Any slots still holding a %DRM_DEP_JOB_FENCE_PREALLOC sentinel —
> > > > > > + * i.e. slots that were pre-allocated but never replaced — are silently
> > > > > > + * skipped; the sentinel carries no reference.  Called from
> > > > > > + * drm_dep_queue_run_job() in process context immediately after
> > > > > > + * @ops->run_job() returns, before the final drm_dep_job_put().  Releasing
> > > > > > + * dependencies here — while still in process context — avoids calling
> > > > > > + * xa_destroy() from IRQ context if the job's last reference is later
> > > > > > + * dropped from a dma_fence callback.
> > > > > > + *
> > > > > > + * Context: Process context.
> > > > > > + */
> > > > > > +void drm_dep_job_drop_dependencies(struct drm_dep_job *job)
> > > > > > +{
> > > > > > +	struct dma_fence *fence;
> > > > > > +	unsigned long index;
> > > > > > +
> > > > > > +	xa_for_each(&job->dependencies, index, fence) {
> > > > > > +		if (unlikely(fence == DRM_DEP_JOB_FENCE_PREALLOC))
> > > > > > +			continue;
> > > > > > +		dma_fence_put(fence);
> > > > > > +	}
> > > > > > +	xa_destroy(&job->dependencies);
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_dep_job_fini() - clean up a dep job
> > > > > > + * @job: dep job to clean up
> > > > > > + *
> > > > > > + * Cleans up the dep fence and drops the queue reference held by @job.
> > > > > > + *
> > > > > > + * If the job was never armed (e.g. init failed before drm_dep_job_arm()),
> > > > > > + * the dependency xarray is also released here.  For armed jobs the xarray
> > > > > > + * has already been drained by drm_dep_job_drop_dependencies() in process
> > > > > > + * context immediately after run_job(), so it is left untouched to avoid
> > > > > > + * calling xa_destroy() from IRQ context.
> > > > > > + *
> > > > > > + * Warns if @job is still linked on the queue's pending list, which would
> > > > > > + * indicate a bug in the teardown ordering.
> > > > > > + *
> > > > > > + * Context: Any context.
> > > > > > + */
> > > > > > +static void drm_dep_job_fini(struct drm_dep_job *job)
> > > > > > +{
> > > > > > +	bool armed = drm_dep_fence_is_armed(job->dfence);
> > > > > > +
> > > > > > +	WARN_ON(!list_empty(&job->pending_link));
> > > > > > +
> > > > > > +	drm_dep_fence_cleanup(job->dfence);
> > > > > > +	job->dfence = NULL;
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * Armed jobs have their dependencies drained by
> > > > > > +	 * drm_dep_job_drop_dependencies() in process context after run_job().    
> > > > > 
> > > > > Just want to clear the confusion and make sure I get this right at the
> > > > > same time. To me, "process context" means a user thread entering some
> > > > > syscall(). What you call "process context" is more a "thread context" to
> > > > > me. I'm actually almost certain it's always a kernel thread (a workqueue
> > > > > worker thread to be accurate) that executes the drop_deps() after a
> > > > > run_job().    
> > > > 
> > > > Some of context comments likely could be cleaned up. 'process context'
> > > > here either in user context (bypass path) or run job work item.
> > > >   
> > > > >     
> > > > > > +	 * Skip here to avoid calling xa_destroy() from IRQ context.
> > > > > > +	 */
> > > > > > +	if (!armed)
> > > > > > +		drm_dep_job_drop_dependencies(job);    
> > > > > 
> > > > > Why do we need to make a difference here. Can't we just assume that the
> > > > > hole drm_dep_job_fini() call is unsafe in atomic context, and have a
> > > > > work item embedded in the job to defer its destruction when _put() is
> > > > > called in a context where the destruction is not allowed?
> > > > >     
> > > > 
> > > > We already touched on this, but the design currently allows the last job
> > > > put from dma-fence signaling path (IRQ).  
> > > 
> > > It's not much about the last _put and more about what happens in the
> > > _release() you pass to kref_put(). My point being, if you assume
> > > something in _release() is not safe to be done in an atomic context,
> > > and _put() is assumed to be called from any context, you might as well  
> > 
> > No. DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE indicates that the entire job
> > put (including release) is IRQ-safe. If the documentation isn’t clear, I
> > can clean that up. Some of my comments here [1] try to explain this
> > further.
> > 
> > Setting DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE makes a job analogous to a
> > dma-fence whose release must be IRQ-safe, so there is precedent for
> > this. I didn’t want to unilaterally require that all job releases be
> > IRQ-safe, as that would conflict with existing DRM scheduler jobs—hence
> > the flag.
> > 
> > The difference between non-IRQ-safe and IRQ-safe job release is only
> > about 12 lines of code.
> 
> It's not just about the number of lines of code added to the core to
> deal with that case, but also complexity of the API that results from
> these various modes.
> 

Fair enough.

> > I figured that if we’re going to invest the time
> > and effort to replace DRM sched, we should aim for the best possible
> > implementation. Any driver can opt-in here and immediately get less CPU
> > ultization and power savings. I will try to figure out how to measure
> > this and get some number here.
> 
> That's key here. My gut feeling is that we have so much deferred
> already that adding one more work to the workqueue is not going to
> hurt in term of scheduling overhead (no context switch if it's
> scheduled on the same workqueue). Job cleanup is just the phase

Signaling of fences in many drivers occurs in hard IRQ context rather
than in a work queue. I agree that if you are signaling fences from a
work queue, the overhead of another work item is minimal.

> following the job_done() event, which also requires a deferred work to
> check progress on the queue anyway. And if you move the entirety of the

Yes, I see Panthor signals fences from a work queue by looking at the
seqnos, but again, in many drivers this flow is IRQ-driven for fence
signaling latency reasons.

> job cleanup to job_release() instead of doing part of it in
> drm_dep_job_fini(), it makes for simpler design, where jobs are just
> cleaned up when their refcnt drops to zero.
> 
> IMHO, that's exactly the kind of premature optimization that led us to
> where we are with drm_sched: we think we need the optimization so we
> add the complexity upfront without actual numbers to back this
> theory (like, real GPU workloads to lead to actual differences in term
> on power consumption, speed, ...), and the complexity just piles up as
> you keep adding more and more of those flags.
> 

Fair enough. Let me measure the CPU utilization to get some data here.
It’s not a huge deal to drop this—as I said, it’s a minimal change.

> > 
> > > just defer the cleanup (AKA the stuff you currently have in _release())
> > > so everything is always cleaned up in a thread context. Yes, there's
> > > scheduling overhead and extra latency, but it's also simpler, because
> > > there's just one path. So, if the latency and the overhead is not  
> > 
> > This isn’t bolted on—it’s a built-in feature throughout. I can assure
> > you that either mode works. I’ll likely add a debug Kconfig option to Xe
> > that toggles DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE on each queue creation
> > for CI runs, to ensure both paths work reliably and receive continuous
> > testing.
> 
> I'm not claiming this doesn't work, I just want to make sure we're not
> taking the same path drm_sched took with those premature optimizations.
> If you have numbers that prove the extra power-consumption or the fact
> the extra latency makes a difference in practice, that's a different
> story, but otherwise, I still think it's preferable to start with a

+1, will drop this unless I have solid numbers to back this up.

> smaller scope/simpler design, and add optimized cleanup path when we
> have a proof it makes a difference.
> 
> > 
> > > proven to be a problem (and it rarely is for cleanup operations), I'm
> > > still convinced this makes for an easier design to just defer the
> > > cleanup all the time.
> > >   
> > > >  If we droppped that, then yes
> > > > this could change. The reason the if statement currently is user is
> > > > building a job and need to abort prior to calling arm() (e.g., a memory
> > > > allocation fails) via a drm_dep_job_put.  
> > > 
> > > But even in that context, it could still be deferred and work just
> > > fine, no?
> > >   
> > 
> > A work item context switch is thousands, if not tens of thousands, of
> > cycles.
> 
> We won't have a full context switch just caused by the cleanup in the
> normal execution case though. The context switch, you already have it
> to check progress on the job queue anyway, so adding an extra job
> cleanup is pretty cheap at this point. I'm not saying free either,
> there's still the extra work insertion, the dequeuing, etc.
> 

See above. I believe really changes if a driver signals fences in an IRQ
context or a work queue.

> > If your job release is only ~20 instructions, this is a massive
> > imbalance and an overall huge waste. Jobs are lightweight objects—they
> > should really be thought of as an extension of fences. Fence release
> > must be IRQ-safe per the documentation, so it follows that jobs can opt
> > in to the same release rules.
> > 
> > In contrast, queues are heavyweight objects, typically with associated
> > memory that also needs to be released. Here, a work item absolutely
> > makes sense—hence the design in DRM dep.
> 
> And that's kinda my point: a job being reported as done will cause a
> work item to be scheduled to check progress on the queue, so you're
> already paying the price of a context switch anyway. At this point, all
> you'll gain by fast-tracking the job cleanup and allowing for IRQ-safe
> cleanups is just latency. If you have numbers/workloads saying
> otherwise, I'm fine reconsidering the extra complexity, but I'd to see
> those first.
> 

Yep, agree on getting data here.

> > 
> > > > 
> > > > Once arm() is called there is a guarnette the run_job path is called
> > > > either via bypass or run job work item.  
> > > 
> > > Sure.
> > >   
> > 
> > Let’s not gloss over this—this is actually a huge difference from DRM
> > sched. One of the biggest problems I found with DRM sched is that if you
> > call arm(), run_job() may or may not be called. Without this guarantee,
> > you can’t do driver-side bookkeeping in arm() that is later released in
> > run_job(), which would otherwise simplify the driver design.
> 
> You can do driver-side book-keeping after all jobs have been
> successfully initialized, which include arming their fences. The key
> turning point is when you start exposing those armed fences, not
> when you arm them. See below.
> 

There is still the seqno critical section which starts at arm() and
closed at push() or drop of fence.

> > 
> > In Xe, we artificially enforce this rule through our own usage of DRM
> > sched, but in DRM dep this is now an API-level contract. That allows
> > drivers to embrace this semantic directly, simplifying their designs.
> > 
> 
> [...]
> 
> > > >   
> > > > >     
> > > > > > +}
> > > > > > +EXPORT_SYMBOL(drm_dep_job_arm);
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_dep_job_push() - submit a job to its queue for execution
> > > > > > + * @job: dep job to push
> > > > > > + *
> > > > > > + * Submits @job to the queue it was initialised with. Must be called after
> > > > > > + * drm_dep_job_arm(). Acquires a reference on @job on behalf of the queue,
> > > > > > + * held until the queue is fully done with it. The reference is released
> > > > > > + * directly in the finished-fence dma_fence callback for queues with
> > > > > > + * %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE (where drm_dep_job_done() may run
> > > > > > + * from hardirq context), or via the put_job work item on the submit
> > > > > > + * workqueue otherwise.
> > > > > > + *
> > > > > > + * Ends the DMA fence signalling path begun by drm_dep_job_arm() via
> > > > > > + * dma_fence_end_signalling(). This must be paired with arm(); lockdep
> > > > > > + * enforces the pairing.
> > > > > > + *
> > > > > > + * Once pushed, &drm_dep_queue_ops.run_job is guaranteed to be called for
> > > > > > + * @job exactly once, even if the queue is killed or torn down before the
> > > > > > + * job reaches the head of the queue. Drivers can use this guarantee to
> > > > > > + * perform bookkeeping cleanup; the actual backend operation should be
> > > > > > + * skipped when drm_dep_queue_is_killed() returns true.
> > > > > > + *
> > > > > > + * If the queue does not support the bypass path, the job is pushed directly
> > > > > > + * onto the SPSC submission queue via drm_dep_queue_push_job() without holding
> > > > > > + * @q->sched.lock. Otherwise, @q->sched.lock is taken and the job is either
> > > > > > + * run immediately via drm_dep_queue_run_job() if it qualifies for bypass, or
> > > > > > + * enqueued via drm_dep_queue_push_job() for dispatch by the run_job work item.
> > > > > > + *
> > > > > > + * Warns if the job has not been armed.
> > > > > > + *
> > > > > > + * Context: Process context if %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set
> > > > > > + *   (takes @q->sched.lock, a mutex); any context otherwise. DMA fence signaling
> > > > > > + *   path.
> > > > > > + */
> > > > > > +void drm_dep_job_push(struct drm_dep_job *job)
> > > > > > +{
> > > > > > +	struct drm_dep_queue *q = job->q;
> > > > > > +
> > > > > > +	WARN_ON(!drm_dep_fence_is_armed(job->dfence));
> > > > > > +
> > > > > > +	drm_dep_job_get(job);
> > > > > > +
> > > > > > +	if (!(q->sched.flags & DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED)) {
> > > > > > +		drm_dep_queue_push_job(q, job);
> > > > > > +		dma_fence_end_signalling(job->signalling_cookie);
> > > > > > +		drm_dep_queue_push_job_end(job->q);
> > > > > > +		return;
> > > > > > +	}
> > > > > > +
> > > > > > +	scoped_guard(mutex, &q->sched.lock) {
> > > > > > +		if (drm_dep_queue_can_job_bypass(q, job))
> > > > > > +			drm_dep_queue_run_job(q, job);
> > > > > > +		else
> > > > > > +			drm_dep_queue_push_job(q, job);
> > > > > > +	}
> > > > > > +
> > > > > > +	dma_fence_end_signalling(job->signalling_cookie);
> > > > > > +	drm_dep_queue_push_job_end(job->q);
> > > > > > +}
> > > > > > +EXPORT_SYMBOL(drm_dep_job_push);
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_dep_job_add_dependency() - adds the fence as a job dependency
> > > > > > + * @job: dep job to add the dependencies to
> > > > > > + * @fence: the dma_fence to add to the list of dependencies, or
> > > > > > + *         %DRM_DEP_JOB_FENCE_PREALLOC to reserve a slot for later.
> > > > > > + *
> > > > > > + * Note that @fence is consumed in both the success and error cases (except
> > > > > > + * when @fence is %DRM_DEP_JOB_FENCE_PREALLOC, which carries no reference).
> > > > > > + *
> > > > > > + * Signalled fences and fences belonging to the same queue as @job (i.e. where
> > > > > > + * fence->context matches the queue's finished fence context) are silently
> > > > > > + * dropped; the job need not wait on its own queue's output.
> > > > > > + *
> > > > > > + * Warns if the job has already been armed (dependencies must be added before
> > > > > > + * drm_dep_job_arm()).
> > > > > > + *
> > > > > > + * **Pre-allocation pattern**
> > > > > > + *
> > > > > > + * When multiple jobs across different queues must be prepared and submitted
> > > > > > + * together in a single atomic commit — for example, where job A's finished
> > > > > > + * fence is an input dependency of job B — all jobs must be armed and pushed
> > > > > > + * within a single dma_fence_begin_signalling() / dma_fence_end_signalling()
> > > > > > + * region.  Once that region has started no memory allocation is permitted.
> > > > > > + *
> > > > > > + * To handle this, pass %DRM_DEP_JOB_FENCE_PREALLOC during the preparation
> > > > > > + * phase (before arming any job, while GFP_KERNEL allocation is still allowed)
> > > > > > + * to pre-allocate a slot in @job->dependencies.  The slot index assigned by
> > > > > > + * the underlying xarray must be tracked by the caller separately (e.g. it is
> > > > > > + * always index 0 when the dependency array is empty, as Xe relies on).
> > > > > > + * After all jobs have been armed and the finished fences are available, call
> > > > > > + * drm_dep_job_replace_dependency() with that index and the real fence.
> > > > > > + * drm_dep_job_replace_dependency() uses GFP_NOWAIT internally and may be
> > > > > > + * called from atomic or signalling context.
> > > > > > + *
> > > > > > + * The sentinel slot is never skipped by the signalled-fence fast-path,
> > > > > > + * ensuring a slot is always allocated even when the real fence is not yet
> > > > > > + * known.
> > > > > > + *
> > > > > > + * **Example: bind job feeding TLB invalidation jobs**
> > > > > > + *
> > > > > > + * Consider a GPU with separate queues for page-table bind operations and for
> > > > > > + * TLB invalidation.  A single atomic commit must:
> > > > > > + *
> > > > > > + *  1. Run a bind job that modifies page tables.
> > > > > > + *  2. Run one TLB-invalidation job per MMU that depends on the bind
> > > > > > + *     completing, so stale translations are flushed before the engines
> > > > > > + *     continue.
> > > > > > + *
> > > > > > + * Because all jobs must be armed and pushed inside a signalling region (where
> > > > > > + * GFP_KERNEL is forbidden), pre-allocate slots before entering the region::
> > > > > > + *
> > > > > > + *   // Phase 1 — process context, GFP_KERNEL allowed
> > > > > > + *   drm_dep_job_init(bind_job, bind_queue, ops);
> > > > > > + *   for_each_mmu(mmu) {
> > > > > > + *       drm_dep_job_init(tlb_job[mmu], tlb_queue[mmu], ops);
> > > > > > + *       // Pre-allocate slot at index 0; real fence not available yet
> > > > > > + *       drm_dep_job_add_dependency(tlb_job[mmu], DRM_DEP_JOB_FENCE_PREALLOC);
> > > > > > + *   }
> > > > > > + *
> > > > > > + *   // Phase 2 — inside signalling region, no GFP_KERNEL
> > > > > > + *   dma_fence_begin_signalling();
> > > > > > + *   drm_dep_job_arm(bind_job);
> > > > > > + *   for_each_mmu(mmu) {
> > > > > > + *       // Swap sentinel for bind job's finished fence
> > > > > > + *       drm_dep_job_replace_dependency(tlb_job[mmu], 0,
> > > > > > + *                                      dma_fence_get(bind_job->finished));    
> > > > > 
> > > > > Just FYI, Panthor doesn't have this {begin,end}_signalling() in the
> > > > > submit path. If we were to add it, it would be around the
> > > > > panthor_submit_ctx_push_jobs() call, which might seem broken. In    
> > > > 
> > > > Yes, I noticed that. I put XXX comment in my port [1] around this.
> > > > 
> > > > [1] https://patchwork.freedesktop.org/patch/711952/?series=163245&rev=1
> > > >   
> > > > > practice I don't think it is because we don't expose fences to the
> > > > > outside world until all jobs have been pushed. So what happens is that
> > > > > a job depending on a previous job in the same batch-submit has the
> > > > > armed-but-not-yet-pushed fence in its deps, and that's the only place
> > > > > where this fence is present. If something fails on a subsequent job
> > > > > preparation in the next batch submit, the rollback logic will just drop
> > > > > the jobs on the floor, and release the armed-but-not-pushed-fence,
> > > > > meaning we're not leaking a fence that will never be signalled. I'm in
> > > > > no way saying this design is sane, just trying to explain why it's
> > > > > currently safe and works fine.    
> > > > 
> > > > Yep, I think would be better have no failure points between arm and
> > > > push which again I do my best to enforce via lockdep/warnings.  
> > > 
> > > I'm still not entirely convinced by that. To me _arm() is not quite the
> > > moment you make your fence public, and I'm not sure the extra complexity
> > > added for intra-batch dependencies (one job in a SUBMIT depending on a
> > > previous job in the same SUBMIT) is justified, because what really
> > > matters is not that we leave dangling/unsignalled dma_fence objects
> > > around, the problem is when you do so on an object that has been
> > > exposed publicly (syncobj, dma_resv, sync_file, ...).
> > >   
> > 
> > Let me give you an example of why a failure between arm() and push() is
> > a huge problem:
> > 
> > arm()
> > dma_resv_install(fence_from_arm)
> > fail
> 
> That's not what Panthor does. What we do is:
> 

That's good as it would be a bug, just using this as a possible hazard.

> 	for_each_job_in_batch() {
> 		ret = faillible_stuf()
> 		if (ret)
> 			goto rollback;
> 
> 		arm(job)
> 	}
> 
> 	// Nothing can fail after this point
> 
> 	for_each_job_in_batch() {
> 		update_resvs(job->done_fence);
> 		push(job)
> 	}
> 
> 	update_submit_syncobjs();
> 
> As you can see, an armed job doesn't mean the job fence is public, it
> only becomes public after we've updated the resv of the BOs that might
> be touched by this job.
> 
> > 
> > How does one unwind this? Signal the fence from arm()?
> 
> That, or we just ignore the fact it's not been signalled. If the job
> that created the fence has never been submitted, and the fence has
> vanished before hitting any public container, it doesn't matter.
> 

Ah, this is actually a source of my confusion. I thought the dma-fence
API would complain if you made a fence disappear before it was signaled,
but it looks like it only complains when the fence is unsignaled and
callbacks are attached, i.e. once it has been made public.

> > What if the fence
> > from arm() is on a timeline currently being used by the device? The
> > memory can move, and the device then can corrupt memory.
> 
> What? No, the seqno is just consumed, but there's nothing attached to
> it, the previous job on this timeline (N-1) is still valid, and the next
> one will have a seqno of N+1, which will force an implicit dep on N-1
> on the same timeline. That's all.
> 

Ok.

> > 
> > In my opinion, it’s best and safest to enforce a no-failure policy
> > between arm() and push().
> 
> I don't think it's safer, it's just the semantics that have been
> defined by drm_sched/dma_fence and that we keep forcing ourselves
> into. I'd rather have a well defined dma_fence state that says "that's
> it, I'm exposed, you have to signal me now", than this half-enforced
> arm()+push() model.
> 

So what is the suggestion here — move the asserts I have from arm() to
something like begin_push()? We could add a dma-fence state toggle there
as well if we can get that part merged into dma-fence. Or should we just
drop the asserts/lockdep checks between arm() and push() completely? I’m
open to either approach here.

> > 
> > FWIW, this came up while I was reviewing AMDXDNA’s DRM scheduler usage,
> > which had the exact issue I described above. I pointed it out and got a
> > reply saying, “well, this is an API issue, right?”—and they were
> > correct, it is an API issue.
> > 
> > > >   
> > > > > 
> > > > > In general, I wonder if we should distinguish between "armed" and
> > > > > "publicly exposed" to help deal with this intra-batch dep thing without
> > > > > resorting to reservation and other tricks like that.
> > > > >     
> > > > 
> > > > I'm not exactly sure what you suggesting but always open to ideas.  
> > > 
> > > Right now _arm() is what does the dma_fence_init(). But there's an
> > > extra step between initializing the fence object and making it
> > > visible to the outside world. In order for the dep to be added to the
> > > job, you need the fence to be initialized, but that's not quite
> > > external visibility, because the job is still very much a driver
> > > object, and if something fails, the rollback mechanism makes it so all
> > > the deps are dropped on the floor along the job that's being destroyed.
> > > So we won't really wait on this fence that's never going to be
> > > signalled.
> > > 
> > > I see what's appealing in pretending that _arm() == externally-visible,
> > > but it's also forcing us to do extra pre-alloc (or other pre-init)
> > > operations that would otherwise not be required in the submit path. Not
> > > a hill I'm willing to die on, but I just thought I'd mention the fact I
> > > find it weird that we put extra constraints on ourselves that are not
> > > strictly needed, just because we fail to properly flag the dma_fence
> > > visibility transitions.  
> > 
> > See the dma-resv example above. I’m not willing to die on this hill
> > either, but again, in my opinion, for safety and as an API-level
> > contract, enforcing arm() as a no-failure point makes sense. It prevents
> > drivers from doing anything dangerous like the dma-resv example, which
> > is an extremely subtle bug.
> 
> That's a valid point, but you're not really enforcing things at
> compile/run-time it's just "don't do this/that" in the docs. If you
> encode the is_active() state at the dma_fence level, properly change
> the fence state anytime it's about to be added to a public container,
> and make it so an active fence that's released without being signalled
> triggers a WARN_ON(), you've achieved more. Once you've done that, you
> can also relax the rule that says that "an armed fence has to be
> signalled" to "a fence that's active has to be signalled". With this,
> the pre-alloc for intra-batch deps in your drm_dep_job::deps xarray is
> no longer required, because you would be able to store inactive fences

I wouldn’t go that far or say it’s that simple. This would require a
fairly large refactor of Xe’s VM bind pipeline to call arm() earlier,
and I’m not even sure it would be possible. Between arm() and push(),
the seqno critical section still remains and requires locking, in
particulay the tricky case is kernel binds (e.g., page fault handling)
which use the same queue. Multiple threads can issue kernel binds
concurrently, as our page fault handler is multi-threaded, similar
to the CPU page fault handler, so critical section between arm() and
push() is very late in the pipeline tightly protected by a lock.

> there, as long as they become active before the job is pushed.
> 
> > 
> > > 
> > > On the rust side it would be directly described through the type
> > > system (see the Visibility attribute in Daniel's branch[1]). On C side,
> > > this could take the form of a new DMA_FENCE_FLAG_INACTIVE (or whichever
> > > name you want to give it). Any operation pushing the fence to public
> > > container (dma_resv, syncobj, sync_file, ...) would be rejected when
> > > that flag is set. At _push() time, we'd clear that flag with a
> > > dma_fence_set_active() helper, which would reflect the fact the fence
> > > can now be observed and exposed to the outside world.
> > >   
> > 
> > Timeline squashing is problematic due to the DMA_FENCE_FLAG_INACTIVE
> > flag. When adding a fence to dma-resv, fences that belong to the same
> > timeline are immediately squashed. A later transition of the fence state
> > completely breaks this behavior.
> 
> That's exactly my point: as soon as you want to insert the fence to a
> public container, you have to make it "active", so it will never be
> rolled back to the previous entry in the resv. Similarly, a
> wait/add_callback() on an inactive fence should be rejected.
> 

This is bit bigger dma-fence / treewide level change but in general I
believe this is a good idea.

Matt

> > 
> > 287 void dma_resv_add_fence(struct dma_resv *obj, struct dma_fence *fence,
> > 288                         enum dma_resv_usage usage)
> > 289 {
> > 290         struct dma_resv_list *fobj;
> > 291         struct dma_fence *old;
> > 292         unsigned int i, count;
> > 293
> > 294         dma_fence_get(fence);
> > 295
> > 296         dma_resv_assert_held(obj);
> > 297
> > 298         /* Drivers should not add containers here, instead add each fence
> > 299          * individually.
> > 300          */
> > 301         WARN_ON(dma_fence_is_container(fence));
> > 302
> > 303         fobj = dma_resv_fences_list(obj);
> > 304         count = fobj->num_fences;
> > 305
> > 306         for (i = 0; i < count; ++i) {
> > 307                 enum dma_resv_usage old_usage;
> > 308
> > 309                 dma_resv_list_entry(fobj, i, obj, &old, &old_usage);
> > 310                 if ((old->context == fence->context && old_usage >= usage &&
> > 311                      dma_fence_is_later_or_same(fence, old)) ||
> > 312                     dma_fence_is_signaled(old)) {
> > 313                         dma_resv_list_set(fobj, i, fence, usage);
> > 314                         dma_fence_put(old);
> > 315                         return;
> > 316                 }
> > 317         }
> > 
> > Imagine syncobjs have similar squashing, but I don't know that offhand.
> 
> Same goes for syncobjs.
> 
> Regards,
> 
> Boris
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
  2026-03-23 10:06                   ` Boris Brezillon
@ 2026-03-23 17:11                     ` Matthew Brost
  0 siblings, 0 replies; 50+ messages in thread
From: Matthew Brost @ 2026-03-23 17:11 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Daniel Almeida, intel-xe, dri-devel, Tvrtko Ursulin, Rodrigo Vivi,
	Thomas Hellström, Christian König, Danilo Krummrich,
	David Airlie, Maarten Lankhorst, Maxime Ripard, Philipp Stanner,
	Simona Vetter, Sumit Semwal, Thomas Zimmermann, linux-kernel,
	Sami Tolvanen, Jeffrey Vander Stoep, Alice Ryhl, Daniel Stone,
	Alexandre Courbot, John Hubbard, shashanks, jajones,
	Eliot Courtney, Joel Fernandes, rust-for-linux

On Mon, Mar 23, 2026 at 11:06:13AM +0100, Boris Brezillon wrote:
> On Mon, 23 Mar 2026 00:58:51 -0700
> Matthew Brost <matthew.brost@intel.com> wrote:
> 
> > > > It's not the refcount model I'm complaining about, it's the "part of it
> > > > is always freed immediately, part of it is deferred, but not always ..."
> > > > that happens in drm_dep_job_release() I'm questioning. I'd really
> > > > prefer something like:
> > > >   
> > > 
> > > You are completely missing the point here.
> > >   
> > 
> > Let me rephrase this — I realize this may come across as rude, which is
> > not my intent.
> 
> No offense taken ;-).
> 
> > I believe there is simply a disconnect in understanding
> > the constraints.
> > 
> > In my example below, the job release completes within bounded time
> > constraints, which makes it suitable for direct release in IRQ context,
> > bypassing the need for a work item that would otherwise incur a costly
> > CPU context switch.
> 
> In the other thread, I've explained in more details why I think
> deferred cleanup of jobs is not as bad as you make it sound (context
> switch amortized by the fact it's already there for queue progress
> checking). But let's assume it is, I'd prefer a model where we say
> "ops->job_release() has to be IRQ-safe" and have implementations defer
> their cleanup if they have to, than this mixed approach with a flag. Of
> course, I'd still like to have numbers proving that this job cleanup
> deferral actually makes a difference in practice :P.

Yes, I replied there will either drop this or have solid numbers showing
yes, the CPU utlization shows this is worth while.

Matt 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
  2026-03-17 19:41           ` Miguel Ojeda
@ 2026-03-23 17:31             ` Matthew Brost
  2026-03-23 17:42               ` Miguel Ojeda
  0 siblings, 1 reply; 50+ messages in thread
From: Matthew Brost @ 2026-03-23 17:31 UTC (permalink / raw)
  To: Miguel Ojeda
  Cc: Daniel Almeida, intel-xe, dri-devel, Boris Brezillon,
	Tvrtko Ursulin, Rodrigo Vivi, Thomas Hellström,
	Christian König, Danilo Krummrich, David Airlie,
	Maarten Lankhorst, Maxime Ripard, Philipp Stanner, Simona Vetter,
	Sumit Semwal, Thomas Zimmermann, linux-kernel, Sami Tolvanen,
	Jeffrey Vander Stoep, Alice Ryhl, Daniel Stone, Alexandre Courbot,
	John Hubbard, shashanks, jajones, Eliot Courtney, Joel Fernandes,
	rust-for-linux

On Tue, Mar 17, 2026 at 08:41:24PM +0100, Miguel Ojeda wrote:
> On Tue, Mar 17, 2026 at 9:27 AM Matthew Brost <matthew.brost@intel.com> wrote:
> >
> > I hate cut off in thteads.
> >
> > I get it — you’re a Rust zealot.
> 
> Cut off? Zealot?
> 

I appologize here I shouldn't type when I get annoyed. This the 2nd
comment that pointing out difference between C and Rust which really
wasn't direction I have hoping this thread would take.

> Look, I got the email in my inbox, so I skimmed it to understand why I
> got it and why the Rust list was Cc'd. I happened to notice your
> (quite surprising) claims about Rust, so I decided to reply to a
> couple of those, since I proposed Rust for the kernel.
> 

Again my mistake.

> How is that a cut off and how does that make a maintainer a zealot?
> 
> Anyway, my understanding is that we agreed that the cleanup attribute
> in C doesn't enforce much of anything. We also agreed that it is
> important to think about ownership and lifetimes and to enforce the
> rules and to be disciplined. All good so far.
> 
> Now, what I said is simply that Rust fundamentally improves the
> situation -- C "RAII" not doing so is not comparable. For instance,
> that statically enforcing things is a meaningful improvement over
> runtime approaches (which generally require to trigger an issue, and
> which in some cases are not suitable for production settings).
> 

I agree the static checking in Rust is a very nice feature.

> Really, I just said Rust would help with things you already stated you
> care about. And nobody claims "Rust solves everything" as you stated.
> So I don't see zealots here, and insulting others doesn't help your
> argument.

I know, appologize.

Matt

> 
> Cheers,
> Miguel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
  2026-03-23 17:31             ` Matthew Brost
@ 2026-03-23 17:42               ` Miguel Ojeda
  0 siblings, 0 replies; 50+ messages in thread
From: Miguel Ojeda @ 2026-03-23 17:42 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Daniel Almeida, intel-xe, dri-devel, Boris Brezillon,
	Tvrtko Ursulin, Rodrigo Vivi, Thomas Hellström,
	Christian König, Danilo Krummrich, David Airlie,
	Maarten Lankhorst, Maxime Ripard, Philipp Stanner, Simona Vetter,
	Sumit Semwal, Thomas Zimmermann, linux-kernel, Sami Tolvanen,
	Jeffrey Vander Stoep, Alice Ryhl, Daniel Stone, Alexandre Courbot,
	John Hubbard, shashanks, jajones, Eliot Courtney, Joel Fernandes,
	rust-for-linux

On Mon, Mar 23, 2026 at 6:31 PM Matthew Brost <matthew.brost@intel.com> wrote:
>
> I appologize here I shouldn't type when I get annoyed. This the 2nd
> comment that pointing out difference between C and Rust which really
> wasn't direction I have hoping this thread would take.

No worries, it happens to everyone from time to time.

Thanks!

Cheers,
Miguel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
  2026-03-23 17:08             ` Matthew Brost
@ 2026-03-23 18:38               ` Matthew Brost
  2026-03-24  9:23                 ` Boris Brezillon
  2026-03-24  8:49               ` Boris Brezillon
  1 sibling, 1 reply; 50+ messages in thread
From: Matthew Brost @ 2026-03-23 18:38 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: intel-xe, dri-devel, Tvrtko Ursulin, Rodrigo Vivi,
	Thomas Hellström, Christian König, Danilo Krummrich,
	David Airlie, Maarten Lankhorst, Maxime Ripard, Philipp Stanner,
	Simona Vetter, Sumit Semwal, Thomas Zimmermann, linux-kernel

On Mon, Mar 23, 2026 at 10:08:53AM -0700, Matthew Brost wrote:
> On Mon, Mar 23, 2026 at 10:55:04AM +0100, Boris Brezillon wrote:
> > Hi Matthew,
> > 
> > On Sun, 22 Mar 2026 21:50:07 -0700
> > Matthew Brost <matthew.brost@intel.com> wrote:
> > 
> > > > > > > diff --git a/drivers/gpu/drm/dep/drm_dep_job.c b/drivers/gpu/drm/dep/drm_dep_job.c
> > > > > > > new file mode 100644
> > > > > > > index 000000000000..2d012b29a5fc
> > > > > > > --- /dev/null
> > > > > > > +++ b/drivers/gpu/drm/dep/drm_dep_job.c
> > > > > > > @@ -0,0 +1,675 @@
> > > > > > > +// SPDX-License-Identifier: MIT
> > > > > > > +/*
> > > > > > > + * Copyright 2015 Advanced Micro Devices, Inc.
> > > > > > > + *
> > > > > > > + * Permission is hereby granted, free of charge, to any person obtaining a
> > > > > > > + * copy of this software and associated documentation files (the "Software"),
> > > > > > > + * to deal in the Software without restriction, including without limitation
> > > > > > > + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> > > > > > > + * and/or sell copies of the Software, and to permit persons to whom the
> > > > > > > + * Software is furnished to do so, subject to the following conditions:
> > > > > > > + *
> > > > > > > + * The above copyright notice and this permission notice shall be included in
> > > > > > > + * all copies or substantial portions of the Software.
> > > > > > > + *
> > > > > > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> > > > > > > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> > > > > > > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> > > > > > > + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
> > > > > > > + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
> > > > > > > + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
> > > > > > > + * OTHER DEALINGS IN THE SOFTWARE.
> > > > > > > + *
> > > > > > > + * Copyright © 2026 Intel Corporation
> > > > > > > + */
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * DOC: DRM dependency job
> > > > > > > + *
> > > > > > > + * A struct drm_dep_job represents a single unit of GPU work associated with
> > > > > > > + * a struct drm_dep_queue. The lifecycle of a job is:
> > > > > > > + *
> > > > > > > + * 1. **Allocation**: the driver allocates memory for the job (typically by
> > > > > > > + *    embedding struct drm_dep_job in a larger structure) and calls
> > > > > > > + *    drm_dep_job_init() to initialise it. On success the job holds one
> > > > > > > + *    kref reference and a reference to its queue.
> > > > > > > + *
> > > > > > > + * 2. **Dependency collection**: the driver calls drm_dep_job_add_dependency(),
> > > > > > > + *    drm_dep_job_add_syncobj_dependency(), drm_dep_job_add_resv_dependencies(),
> > > > > > > + *    or drm_dep_job_add_implicit_dependencies() to register dma_fence objects
> > > > > > > + *    that must be signalled before the job can run. Duplicate fences from the
> > > > > > > + *    same fence context are deduplicated automatically.
> > > > > > > + *
> > > > > > > + * 3. **Arming**: drm_dep_job_arm() initialises the job's finished fence,
> > > > > > > + *    consuming a sequence number from the queue. After arming,
> > > > > > > + *    drm_dep_job_finished_fence() returns a valid fence that may be passed to
> > > > > > > + *    userspace or used as a dependency by other jobs.
> > > > > > > + *
> > > > > > > + * 4. **Submission**: drm_dep_job_push() submits the job to the queue. The
> > > > > > > + *    queue takes a reference that it holds until the job's finished fence
> > > > > > > + *    signals and the job is freed by the put_job worker.
> > > > > > > + *
> > > > > > > + * 5. **Completion**: when the job's hardware work finishes its finished fence
> > > > > > > + *    is signalled and drm_dep_job_put() is called by the queue. The driver
> > > > > > > + *    must release any driver-private resources in &drm_dep_job_ops.release.
> > > > > > > + *
> > > > > > > + * Reference counting uses drm_dep_job_get() / drm_dep_job_put(). The
> > > > > > > + * internal drm_dep_job_fini() tears down the dependency xarray and fence
> > > > > > > + * objects before the driver's release callback is invoked.
> > > > > > > + */
> > > > > > > +
> > > > > > > +#include <linux/dma-resv.h>
> > > > > > > +#include <linux/kref.h>
> > > > > > > +#include <linux/slab.h>
> > > > > > > +#include <drm/drm_dep.h>
> > > > > > > +#include <drm/drm_file.h>
> > > > > > > +#include <drm/drm_gem.h>
> > > > > > > +#include <drm/drm_syncobj.h>
> > > > > > > +#include "drm_dep_fence.h"
> > > > > > > +#include "drm_dep_job.h"
> > > > > > > +#include "drm_dep_queue.h"
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * drm_dep_job_init() - initialise a dep job
> > > > > > > + * @job: dep job to initialise
> > > > > > > + * @args: initialisation arguments
> > > > > > > + *
> > > > > > > + * Initialises @job with the queue, ops and credit count from @args.  Acquires
> > > > > > > + * a reference to @args->q via drm_dep_queue_get(); this reference is held for
> > > > > > > + * the lifetime of the job and released by drm_dep_job_release() when the last
> > > > > > > + * job reference is dropped.
> > > > > > > + *
> > > > > > > + * Resources are released automatically when the last reference is dropped
> > > > > > > + * via drm_dep_job_put(), which must be called to release the job; drivers
> > > > > > > + * must not free the job directly.
> > > > > > > + *
> > > > > > > + * Context: Process context. Allocates memory with GFP_KERNEL.
> > > > > > > + * Return: 0 on success, -%EINVAL if credits is 0,
> > > > > > > + *   -%ENOMEM on fence allocation failure.
> > > > > > > + */
> > > > > > > +int drm_dep_job_init(struct drm_dep_job *job,
> > > > > > > +		     const struct drm_dep_job_init_args *args)
> > > > > > > +{
> > > > > > > +	if (unlikely(!args->credits)) {
> > > > > > > +		pr_err("drm_dep: %s: credits cannot be 0\n", __func__);
> > > > > > > +		return -EINVAL;
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	memset(job, 0, sizeof(*job));
> > > > > > > +
> > > > > > > +	job->dfence = drm_dep_fence_alloc();
> > > > > > > +	if (!job->dfence)
> > > > > > > +		return -ENOMEM;
> > > > > > > +
> > > > > > > +	job->ops = args->ops;
> > > > > > > +	job->q = drm_dep_queue_get(args->q);
> > > > > > > +	job->credits = args->credits;
> > > > > > > +
> > > > > > > +	kref_init(&job->refcount);
> > > > > > > +	xa_init_flags(&job->dependencies, XA_FLAGS_ALLOC);
> > > > > > > +	INIT_LIST_HEAD(&job->pending_link);
> > > > > > > +
> > > > > > > +	return 0;
> > > > > > > +}
> > > > > > > +EXPORT_SYMBOL(drm_dep_job_init);
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * drm_dep_job_drop_dependencies() - release all input dependency fences
> > > > > > > + * @job: dep job whose dependency xarray to drain
> > > > > > > + *
> > > > > > > + * Walks @job->dependencies, puts each fence, and destroys the xarray.
> > > > > > > + * Any slots still holding a %DRM_DEP_JOB_FENCE_PREALLOC sentinel —
> > > > > > > + * i.e. slots that were pre-allocated but never replaced — are silently
> > > > > > > + * skipped; the sentinel carries no reference.  Called from
> > > > > > > + * drm_dep_queue_run_job() in process context immediately after
> > > > > > > + * @ops->run_job() returns, before the final drm_dep_job_put().  Releasing
> > > > > > > + * dependencies here — while still in process context — avoids calling
> > > > > > > + * xa_destroy() from IRQ context if the job's last reference is later
> > > > > > > + * dropped from a dma_fence callback.
> > > > > > > + *
> > > > > > > + * Context: Process context.
> > > > > > > + */
> > > > > > > +void drm_dep_job_drop_dependencies(struct drm_dep_job *job)
> > > > > > > +{
> > > > > > > +	struct dma_fence *fence;
> > > > > > > +	unsigned long index;
> > > > > > > +
> > > > > > > +	xa_for_each(&job->dependencies, index, fence) {
> > > > > > > +		if (unlikely(fence == DRM_DEP_JOB_FENCE_PREALLOC))
> > > > > > > +			continue;
> > > > > > > +		dma_fence_put(fence);
> > > > > > > +	}
> > > > > > > +	xa_destroy(&job->dependencies);
> > > > > > > +}
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * drm_dep_job_fini() - clean up a dep job
> > > > > > > + * @job: dep job to clean up
> > > > > > > + *
> > > > > > > + * Cleans up the dep fence and drops the queue reference held by @job.
> > > > > > > + *
> > > > > > > + * If the job was never armed (e.g. init failed before drm_dep_job_arm()),
> > > > > > > + * the dependency xarray is also released here.  For armed jobs the xarray
> > > > > > > + * has already been drained by drm_dep_job_drop_dependencies() in process
> > > > > > > + * context immediately after run_job(), so it is left untouched to avoid
> > > > > > > + * calling xa_destroy() from IRQ context.
> > > > > > > + *
> > > > > > > + * Warns if @job is still linked on the queue's pending list, which would
> > > > > > > + * indicate a bug in the teardown ordering.
> > > > > > > + *
> > > > > > > + * Context: Any context.
> > > > > > > + */
> > > > > > > +static void drm_dep_job_fini(struct drm_dep_job *job)
> > > > > > > +{
> > > > > > > +	bool armed = drm_dep_fence_is_armed(job->dfence);
> > > > > > > +
> > > > > > > +	WARN_ON(!list_empty(&job->pending_link));
> > > > > > > +
> > > > > > > +	drm_dep_fence_cleanup(job->dfence);
> > > > > > > +	job->dfence = NULL;
> > > > > > > +
> > > > > > > +	/*
> > > > > > > +	 * Armed jobs have their dependencies drained by
> > > > > > > +	 * drm_dep_job_drop_dependencies() in process context after run_job().    
> > > > > > 
> > > > > > Just want to clear the confusion and make sure I get this right at the
> > > > > > same time. To me, "process context" means a user thread entering some
> > > > > > syscall(). What you call "process context" is more a "thread context" to
> > > > > > me. I'm actually almost certain it's always a kernel thread (a workqueue
> > > > > > worker thread to be accurate) that executes the drop_deps() after a
> > > > > > run_job().    
> > > > > 
> > > > > Some of context comments likely could be cleaned up. 'process context'
> > > > > here either in user context (bypass path) or run job work item.
> > > > >   
> > > > > >     
> > > > > > > +	 * Skip here to avoid calling xa_destroy() from IRQ context.
> > > > > > > +	 */
> > > > > > > +	if (!armed)
> > > > > > > +		drm_dep_job_drop_dependencies(job);    
> > > > > > 
> > > > > > Why do we need to make a difference here. Can't we just assume that the
> > > > > > hole drm_dep_job_fini() call is unsafe in atomic context, and have a
> > > > > > work item embedded in the job to defer its destruction when _put() is
> > > > > > called in a context where the destruction is not allowed?
> > > > > >     
> > > > > 
> > > > > We already touched on this, but the design currently allows the last job
> > > > > put from dma-fence signaling path (IRQ).  
> > > > 
> > > > It's not much about the last _put and more about what happens in the
> > > > _release() you pass to kref_put(). My point being, if you assume
> > > > something in _release() is not safe to be done in an atomic context,
> > > > and _put() is assumed to be called from any context, you might as well  
> > > 
> > > No. DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE indicates that the entire job
> > > put (including release) is IRQ-safe. If the documentation isn’t clear, I
> > > can clean that up. Some of my comments here [1] try to explain this
> > > further.
> > > 
> > > Setting DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE makes a job analogous to a
> > > dma-fence whose release must be IRQ-safe, so there is precedent for
> > > this. I didn’t want to unilaterally require that all job releases be
> > > IRQ-safe, as that would conflict with existing DRM scheduler jobs—hence
> > > the flag.
> > > 
> > > The difference between non-IRQ-safe and IRQ-safe job release is only
> > > about 12 lines of code.
> > 
> > It's not just about the number of lines of code added to the core to
> > deal with that case, but also complexity of the API that results from
> > these various modes.
> > 
> 
> Fair enough.
> 
> > > I figured that if we’re going to invest the time
> > > and effort to replace DRM sched, we should aim for the best possible
> > > implementation. Any driver can opt-in here and immediately get less CPU
> > > ultization and power savings. I will try to figure out how to measure
> > > this and get some number here.
> > 
> > That's key here. My gut feeling is that we have so much deferred
> > already that adding one more work to the workqueue is not going to
> > hurt in term of scheduling overhead (no context switch if it's
> > scheduled on the same workqueue). Job cleanup is just the phase
> 
> Signaling of fences in many drivers occurs in hard IRQ context rather
> than in a work queue. I agree that if you are signaling fences from a
> work queue, the overhead of another work item is minimal.
> 
> > following the job_done() event, which also requires a deferred work to
> > check progress on the queue anyway. And if you move the entirety of the
> 
> Yes, I see Panthor signals fences from a work queue by looking at the
> seqnos, but again, in many drivers this flow is IRQ-driven for fence
> signaling latency reasons.
> 
> > job cleanup to job_release() instead of doing part of it in
> > drm_dep_job_fini(), it makes for simpler design, where jobs are just
> > cleaned up when their refcnt drops to zero.
> > 
> > IMHO, that's exactly the kind of premature optimization that led us to
> > where we are with drm_sched: we think we need the optimization so we
> > add the complexity upfront without actual numbers to back this
> > theory (like, real GPU workloads to lead to actual differences in term
> > on power consumption, speed, ...), and the complexity just piles up as
> > you keep adding more and more of those flags.
> > 
> 
> Fair enough. Let me measure the CPU utilization to get some data here.
> It’s not a huge deal to drop this—as I said, it’s a minimal change.
> 
> > > 
> > > > just defer the cleanup (AKA the stuff you currently have in _release())
> > > > so everything is always cleaned up in a thread context. Yes, there's
> > > > scheduling overhead and extra latency, but it's also simpler, because
> > > > there's just one path. So, if the latency and the overhead is not  
> > > 
> > > This isn’t bolted on—it’s a built-in feature throughout. I can assure
> > > you that either mode works. I’ll likely add a debug Kconfig option to Xe
> > > that toggles DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE on each queue creation
> > > for CI runs, to ensure both paths work reliably and receive continuous
> > > testing.
> > 
> > I'm not claiming this doesn't work, I just want to make sure we're not
> > taking the same path drm_sched took with those premature optimizations.
> > If you have numbers that prove the extra power-consumption or the fact
> > the extra latency makes a difference in practice, that's a different
> > story, but otherwise, I still think it's preferable to start with a
> 
> +1, will drop this unless I have solid numbers to back this up.
> 

Ok, getting stats is easier than I thought...

./perf stat -a -e context-switches,cpu-migrations,task-clock,cycles,instructions /home/mbrost/xe/source/drivers.gpu.i915.igt-gpu-tools/build/tests/xe_exec_threads --r threads-basic

This test creates one thread per engine instance (7 instances this BMG
device) and submits 1k exec IOCTLs per thread, each performing a DW
write. Each exec IOCTL typically does not have unsignaled input dependencies.

With IRQ putting of jobs off + no bypass (drm_dep_queue_flags = 0):

             8,449      context-switches
               412      cpu-migrations
          2,531.43 msec task-clock
     1,847,846,588      cpu_atom/cycles/
     1,847,856,947      cpu_core/cycles/
   <not supported>      cpu_atom/instructions/
       460,744,020      cpu_core/instructions/

With IRQ putting of jobs off + bypass (drm_dep_queue_flags =
DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED):

             8,655      context-switches
               229      cpu-migrations
          2,571.33 msec task-clock
       855,900,607      cpu_atom/cycles/
       855,900,272      cpu_core/cycles/
   <not supported>      cpu_atom/instructions/
       403,651,469      cpu_core/instructions/

With IRQ putting of jobs on + bypass (drm_dep_queue_flags =
DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED |
DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE):

             5,361      context-switches
               169      cpu-migrations
          2,577.44 msec task-clock
       685,769,153      cpu_atom/cycles/
       685,768,407      cpu_core/cycles/
   <not supported>      cpu_atom/instructions/
       321,336,297      cpu_core/instructions/

Yes, this is a very sythetic test case and results are bit noisy but it
seems to point to this entire design to being worthwhile. I'd really
like to get customer input (e.g., Google androidOS) here to but is
completely out of my control at this point in time.

Matt

> > smaller scope/simpler design, and add optimized cleanup path when we
> > have a proof it makes a difference.
> > 
> > > 
> > > > proven to be a problem (and it rarely is for cleanup operations), I'm
> > > > still convinced this makes for an easier design to just defer the
> > > > cleanup all the time.
> > > >   
> > > > >  If we droppped that, then yes
> > > > > this could change. The reason the if statement currently is user is
> > > > > building a job and need to abort prior to calling arm() (e.g., a memory
> > > > > allocation fails) via a drm_dep_job_put.  
> > > > 
> > > > But even in that context, it could still be deferred and work just
> > > > fine, no?
> > > >   
> > > 
> > > A work item context switch is thousands, if not tens of thousands, of
> > > cycles.
> > 
> > We won't have a full context switch just caused by the cleanup in the
> > normal execution case though. The context switch, you already have it
> > to check progress on the job queue anyway, so adding an extra job
> > cleanup is pretty cheap at this point. I'm not saying free either,
> > there's still the extra work insertion, the dequeuing, etc.
> > 
> 
> See above. I believe really changes if a driver signals fences in an IRQ
> context or a work queue.
> 
> > > If your job release is only ~20 instructions, this is a massive
> > > imbalance and an overall huge waste. Jobs are lightweight objects—they
> > > should really be thought of as an extension of fences. Fence release
> > > must be IRQ-safe per the documentation, so it follows that jobs can opt
> > > in to the same release rules.
> > > 
> > > In contrast, queues are heavyweight objects, typically with associated
> > > memory that also needs to be released. Here, a work item absolutely
> > > makes sense—hence the design in DRM dep.
> > 
> > And that's kinda my point: a job being reported as done will cause a
> > work item to be scheduled to check progress on the queue, so you're
> > already paying the price of a context switch anyway. At this point, all
> > you'll gain by fast-tracking the job cleanup and allowing for IRQ-safe
> > cleanups is just latency. If you have numbers/workloads saying
> > otherwise, I'm fine reconsidering the extra complexity, but I'd to see
> > those first.
> > 
> 
> Yep, agree on getting data here.
> 
> > > 
> > > > > 
> > > > > Once arm() is called there is a guarnette the run_job path is called
> > > > > either via bypass or run job work item.  
> > > > 
> > > > Sure.
> > > >   
> > > 
> > > Let’s not gloss over this—this is actually a huge difference from DRM
> > > sched. One of the biggest problems I found with DRM sched is that if you
> > > call arm(), run_job() may or may not be called. Without this guarantee,
> > > you can’t do driver-side bookkeeping in arm() that is later released in
> > > run_job(), which would otherwise simplify the driver design.
> > 
> > You can do driver-side book-keeping after all jobs have been
> > successfully initialized, which include arming their fences. The key
> > turning point is when you start exposing those armed fences, not
> > when you arm them. See below.
> > 
> 
> There is still the seqno critical section which starts at arm() and
> closed at push() or drop of fence.
> 
> > > 
> > > In Xe, we artificially enforce this rule through our own usage of DRM
> > > sched, but in DRM dep this is now an API-level contract. That allows
> > > drivers to embrace this semantic directly, simplifying their designs.
> > > 
> > 
> > [...]
> > 
> > > > >   
> > > > > >     
> > > > > > > +}
> > > > > > > +EXPORT_SYMBOL(drm_dep_job_arm);
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * drm_dep_job_push() - submit a job to its queue for execution
> > > > > > > + * @job: dep job to push
> > > > > > > + *
> > > > > > > + * Submits @job to the queue it was initialised with. Must be called after
> > > > > > > + * drm_dep_job_arm(). Acquires a reference on @job on behalf of the queue,
> > > > > > > + * held until the queue is fully done with it. The reference is released
> > > > > > > + * directly in the finished-fence dma_fence callback for queues with
> > > > > > > + * %DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE (where drm_dep_job_done() may run
> > > > > > > + * from hardirq context), or via the put_job work item on the submit
> > > > > > > + * workqueue otherwise.
> > > > > > > + *
> > > > > > > + * Ends the DMA fence signalling path begun by drm_dep_job_arm() via
> > > > > > > + * dma_fence_end_signalling(). This must be paired with arm(); lockdep
> > > > > > > + * enforces the pairing.
> > > > > > > + *
> > > > > > > + * Once pushed, &drm_dep_queue_ops.run_job is guaranteed to be called for
> > > > > > > + * @job exactly once, even if the queue is killed or torn down before the
> > > > > > > + * job reaches the head of the queue. Drivers can use this guarantee to
> > > > > > > + * perform bookkeeping cleanup; the actual backend operation should be
> > > > > > > + * skipped when drm_dep_queue_is_killed() returns true.
> > > > > > > + *
> > > > > > > + * If the queue does not support the bypass path, the job is pushed directly
> > > > > > > + * onto the SPSC submission queue via drm_dep_queue_push_job() without holding
> > > > > > > + * @q->sched.lock. Otherwise, @q->sched.lock is taken and the job is either
> > > > > > > + * run immediately via drm_dep_queue_run_job() if it qualifies for bypass, or
> > > > > > > + * enqueued via drm_dep_queue_push_job() for dispatch by the run_job work item.
> > > > > > > + *
> > > > > > > + * Warns if the job has not been armed.
> > > > > > > + *
> > > > > > > + * Context: Process context if %DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set
> > > > > > > + *   (takes @q->sched.lock, a mutex); any context otherwise. DMA fence signaling
> > > > > > > + *   path.
> > > > > > > + */
> > > > > > > +void drm_dep_job_push(struct drm_dep_job *job)
> > > > > > > +{
> > > > > > > +	struct drm_dep_queue *q = job->q;
> > > > > > > +
> > > > > > > +	WARN_ON(!drm_dep_fence_is_armed(job->dfence));
> > > > > > > +
> > > > > > > +	drm_dep_job_get(job);
> > > > > > > +
> > > > > > > +	if (!(q->sched.flags & DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED)) {
> > > > > > > +		drm_dep_queue_push_job(q, job);
> > > > > > > +		dma_fence_end_signalling(job->signalling_cookie);
> > > > > > > +		drm_dep_queue_push_job_end(job->q);
> > > > > > > +		return;
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	scoped_guard(mutex, &q->sched.lock) {
> > > > > > > +		if (drm_dep_queue_can_job_bypass(q, job))
> > > > > > > +			drm_dep_queue_run_job(q, job);
> > > > > > > +		else
> > > > > > > +			drm_dep_queue_push_job(q, job);
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	dma_fence_end_signalling(job->signalling_cookie);
> > > > > > > +	drm_dep_queue_push_job_end(job->q);
> > > > > > > +}
> > > > > > > +EXPORT_SYMBOL(drm_dep_job_push);
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * drm_dep_job_add_dependency() - adds the fence as a job dependency
> > > > > > > + * @job: dep job to add the dependencies to
> > > > > > > + * @fence: the dma_fence to add to the list of dependencies, or
> > > > > > > + *         %DRM_DEP_JOB_FENCE_PREALLOC to reserve a slot for later.
> > > > > > > + *
> > > > > > > + * Note that @fence is consumed in both the success and error cases (except
> > > > > > > + * when @fence is %DRM_DEP_JOB_FENCE_PREALLOC, which carries no reference).
> > > > > > > + *
> > > > > > > + * Signalled fences and fences belonging to the same queue as @job (i.e. where
> > > > > > > + * fence->context matches the queue's finished fence context) are silently
> > > > > > > + * dropped; the job need not wait on its own queue's output.
> > > > > > > + *
> > > > > > > + * Warns if the job has already been armed (dependencies must be added before
> > > > > > > + * drm_dep_job_arm()).
> > > > > > > + *
> > > > > > > + * **Pre-allocation pattern**
> > > > > > > + *
> > > > > > > + * When multiple jobs across different queues must be prepared and submitted
> > > > > > > + * together in a single atomic commit — for example, where job A's finished
> > > > > > > + * fence is an input dependency of job B — all jobs must be armed and pushed
> > > > > > > + * within a single dma_fence_begin_signalling() / dma_fence_end_signalling()
> > > > > > > + * region.  Once that region has started no memory allocation is permitted.
> > > > > > > + *
> > > > > > > + * To handle this, pass %DRM_DEP_JOB_FENCE_PREALLOC during the preparation
> > > > > > > + * phase (before arming any job, while GFP_KERNEL allocation is still allowed)
> > > > > > > + * to pre-allocate a slot in @job->dependencies.  The slot index assigned by
> > > > > > > + * the underlying xarray must be tracked by the caller separately (e.g. it is
> > > > > > > + * always index 0 when the dependency array is empty, as Xe relies on).
> > > > > > > + * After all jobs have been armed and the finished fences are available, call
> > > > > > > + * drm_dep_job_replace_dependency() with that index and the real fence.
> > > > > > > + * drm_dep_job_replace_dependency() uses GFP_NOWAIT internally and may be
> > > > > > > + * called from atomic or signalling context.
> > > > > > > + *
> > > > > > > + * The sentinel slot is never skipped by the signalled-fence fast-path,
> > > > > > > + * ensuring a slot is always allocated even when the real fence is not yet
> > > > > > > + * known.
> > > > > > > + *
> > > > > > > + * **Example: bind job feeding TLB invalidation jobs**
> > > > > > > + *
> > > > > > > + * Consider a GPU with separate queues for page-table bind operations and for
> > > > > > > + * TLB invalidation.  A single atomic commit must:
> > > > > > > + *
> > > > > > > + *  1. Run a bind job that modifies page tables.
> > > > > > > + *  2. Run one TLB-invalidation job per MMU that depends on the bind
> > > > > > > + *     completing, so stale translations are flushed before the engines
> > > > > > > + *     continue.
> > > > > > > + *
> > > > > > > + * Because all jobs must be armed and pushed inside a signalling region (where
> > > > > > > + * GFP_KERNEL is forbidden), pre-allocate slots before entering the region::
> > > > > > > + *
> > > > > > > + *   // Phase 1 — process context, GFP_KERNEL allowed
> > > > > > > + *   drm_dep_job_init(bind_job, bind_queue, ops);
> > > > > > > + *   for_each_mmu(mmu) {
> > > > > > > + *       drm_dep_job_init(tlb_job[mmu], tlb_queue[mmu], ops);
> > > > > > > + *       // Pre-allocate slot at index 0; real fence not available yet
> > > > > > > + *       drm_dep_job_add_dependency(tlb_job[mmu], DRM_DEP_JOB_FENCE_PREALLOC);
> > > > > > > + *   }
> > > > > > > + *
> > > > > > > + *   // Phase 2 — inside signalling region, no GFP_KERNEL
> > > > > > > + *   dma_fence_begin_signalling();
> > > > > > > + *   drm_dep_job_arm(bind_job);
> > > > > > > + *   for_each_mmu(mmu) {
> > > > > > > + *       // Swap sentinel for bind job's finished fence
> > > > > > > + *       drm_dep_job_replace_dependency(tlb_job[mmu], 0,
> > > > > > > + *                                      dma_fence_get(bind_job->finished));    
> > > > > > 
> > > > > > Just FYI, Panthor doesn't have this {begin,end}_signalling() in the
> > > > > > submit path. If we were to add it, it would be around the
> > > > > > panthor_submit_ctx_push_jobs() call, which might seem broken. In    
> > > > > 
> > > > > Yes, I noticed that. I put XXX comment in my port [1] around this.
> > > > > 
> > > > > [1] https://patchwork.freedesktop.org/patch/711952/?series=163245&rev=1
> > > > >   
> > > > > > practice I don't think it is because we don't expose fences to the
> > > > > > outside world until all jobs have been pushed. So what happens is that
> > > > > > a job depending on a previous job in the same batch-submit has the
> > > > > > armed-but-not-yet-pushed fence in its deps, and that's the only place
> > > > > > where this fence is present. If something fails on a subsequent job
> > > > > > preparation in the next batch submit, the rollback logic will just drop
> > > > > > the jobs on the floor, and release the armed-but-not-pushed-fence,
> > > > > > meaning we're not leaking a fence that will never be signalled. I'm in
> > > > > > no way saying this design is sane, just trying to explain why it's
> > > > > > currently safe and works fine.    
> > > > > 
> > > > > Yep, I think would be better have no failure points between arm and
> > > > > push which again I do my best to enforce via lockdep/warnings.  
> > > > 
> > > > I'm still not entirely convinced by that. To me _arm() is not quite the
> > > > moment you make your fence public, and I'm not sure the extra complexity
> > > > added for intra-batch dependencies (one job in a SUBMIT depending on a
> > > > previous job in the same SUBMIT) is justified, because what really
> > > > matters is not that we leave dangling/unsignalled dma_fence objects
> > > > around, the problem is when you do so on an object that has been
> > > > exposed publicly (syncobj, dma_resv, sync_file, ...).
> > > >   
> > > 
> > > Let me give you an example of why a failure between arm() and push() is
> > > a huge problem:
> > > 
> > > arm()
> > > dma_resv_install(fence_from_arm)
> > > fail
> > 
> > That's not what Panthor does. What we do is:
> > 
> 
> That's good as it would be a bug, just using this as a possible hazard.
> 
> > 	for_each_job_in_batch() {
> > 		ret = faillible_stuf()
> > 		if (ret)
> > 			goto rollback;
> > 
> > 		arm(job)
> > 	}
> > 
> > 	// Nothing can fail after this point
> > 
> > 	for_each_job_in_batch() {
> > 		update_resvs(job->done_fence);
> > 		push(job)
> > 	}
> > 
> > 	update_submit_syncobjs();
> > 
> > As you can see, an armed job doesn't mean the job fence is public, it
> > only becomes public after we've updated the resv of the BOs that might
> > be touched by this job.
> > 
> > > 
> > > How does one unwind this? Signal the fence from arm()?
> > 
> > That, or we just ignore the fact it's not been signalled. If the job
> > that created the fence has never been submitted, and the fence has
> > vanished before hitting any public container, it doesn't matter.
> > 
> 
> Ah, this is actually a source of my confusion. I thought the dma-fence
> API would complain if you made a fence disappear before it was signaled,
> but it looks like it only complains when the fence is unsignaled and
> callbacks are attached, i.e. once it has been made public.
> 
> > > What if the fence
> > > from arm() is on a timeline currently being used by the device? The
> > > memory can move, and the device then can corrupt memory.
> > 
> > What? No, the seqno is just consumed, but there's nothing attached to
> > it, the previous job on this timeline (N-1) is still valid, and the next
> > one will have a seqno of N+1, which will force an implicit dep on N-1
> > on the same timeline. That's all.
> > 
> 
> Ok.
> 
> > > 
> > > In my opinion, it’s best and safest to enforce a no-failure policy
> > > between arm() and push().
> > 
> > I don't think it's safer, it's just the semantics that have been
> > defined by drm_sched/dma_fence and that we keep forcing ourselves
> > into. I'd rather have a well defined dma_fence state that says "that's
> > it, I'm exposed, you have to signal me now", than this half-enforced
> > arm()+push() model.
> > 
> 
> So what is the suggestion here — move the asserts I have from arm() to
> something like begin_push()? We could add a dma-fence state toggle there
> as well if we can get that part merged into dma-fence. Or should we just
> drop the asserts/lockdep checks between arm() and push() completely? I’m
> open to either approach here.
> 
> > > 
> > > FWIW, this came up while I was reviewing AMDXDNA’s DRM scheduler usage,
> > > which had the exact issue I described above. I pointed it out and got a
> > > reply saying, “well, this is an API issue, right?”—and they were
> > > correct, it is an API issue.
> > > 
> > > > >   
> > > > > > 
> > > > > > In general, I wonder if we should distinguish between "armed" and
> > > > > > "publicly exposed" to help deal with this intra-batch dep thing without
> > > > > > resorting to reservation and other tricks like that.
> > > > > >     
> > > > > 
> > > > > I'm not exactly sure what you suggesting but always open to ideas.  
> > > > 
> > > > Right now _arm() is what does the dma_fence_init(). But there's an
> > > > extra step between initializing the fence object and making it
> > > > visible to the outside world. In order for the dep to be added to the
> > > > job, you need the fence to be initialized, but that's not quite
> > > > external visibility, because the job is still very much a driver
> > > > object, and if something fails, the rollback mechanism makes it so all
> > > > the deps are dropped on the floor along the job that's being destroyed.
> > > > So we won't really wait on this fence that's never going to be
> > > > signalled.
> > > > 
> > > > I see what's appealing in pretending that _arm() == externally-visible,
> > > > but it's also forcing us to do extra pre-alloc (or other pre-init)
> > > > operations that would otherwise not be required in the submit path. Not
> > > > a hill I'm willing to die on, but I just thought I'd mention the fact I
> > > > find it weird that we put extra constraints on ourselves that are not
> > > > strictly needed, just because we fail to properly flag the dma_fence
> > > > visibility transitions.  
> > > 
> > > See the dma-resv example above. I’m not willing to die on this hill
> > > either, but again, in my opinion, for safety and as an API-level
> > > contract, enforcing arm() as a no-failure point makes sense. It prevents
> > > drivers from doing anything dangerous like the dma-resv example, which
> > > is an extremely subtle bug.
> > 
> > That's a valid point, but you're not really enforcing things at
> > compile/run-time it's just "don't do this/that" in the docs. If you
> > encode the is_active() state at the dma_fence level, properly change
> > the fence state anytime it's about to be added to a public container,
> > and make it so an active fence that's released without being signalled
> > triggers a WARN_ON(), you've achieved more. Once you've done that, you
> > can also relax the rule that says that "an armed fence has to be
> > signalled" to "a fence that's active has to be signalled". With this,
> > the pre-alloc for intra-batch deps in your drm_dep_job::deps xarray is
> > no longer required, because you would be able to store inactive fences
> 
> I wouldn’t go that far or say it’s that simple. This would require a
> fairly large refactor of Xe’s VM bind pipeline to call arm() earlier,
> and I’m not even sure it would be possible. Between arm() and push(),
> the seqno critical section still remains and requires locking, in
> particulay the tricky case is kernel binds (e.g., page fault handling)
> which use the same queue. Multiple threads can issue kernel binds
> concurrently, as our page fault handler is multi-threaded, similar
> to the CPU page fault handler, so critical section between arm() and
> push() is very late in the pipeline tightly protected by a lock.
> 
> > there, as long as they become active before the job is pushed.
> > 
> > > 
> > > > 
> > > > On the rust side it would be directly described through the type
> > > > system (see the Visibility attribute in Daniel's branch[1]). On C side,
> > > > this could take the form of a new DMA_FENCE_FLAG_INACTIVE (or whichever
> > > > name you want to give it). Any operation pushing the fence to public
> > > > container (dma_resv, syncobj, sync_file, ...) would be rejected when
> > > > that flag is set. At _push() time, we'd clear that flag with a
> > > > dma_fence_set_active() helper, which would reflect the fact the fence
> > > > can now be observed and exposed to the outside world.
> > > >   
> > > 
> > > Timeline squashing is problematic due to the DMA_FENCE_FLAG_INACTIVE
> > > flag. When adding a fence to dma-resv, fences that belong to the same
> > > timeline are immediately squashed. A later transition of the fence state
> > > completely breaks this behavior.
> > 
> > That's exactly my point: as soon as you want to insert the fence to a
> > public container, you have to make it "active", so it will never be
> > rolled back to the previous entry in the resv. Similarly, a
> > wait/add_callback() on an inactive fence should be rejected.
> > 
> 
> This is bit bigger dma-fence / treewide level change but in general I
> believe this is a good idea.
> 
> Matt
> 
> > > 
> > > 287 void dma_resv_add_fence(struct dma_resv *obj, struct dma_fence *fence,
> > > 288                         enum dma_resv_usage usage)
> > > 289 {
> > > 290         struct dma_resv_list *fobj;
> > > 291         struct dma_fence *old;
> > > 292         unsigned int i, count;
> > > 293
> > > 294         dma_fence_get(fence);
> > > 295
> > > 296         dma_resv_assert_held(obj);
> > > 297
> > > 298         /* Drivers should not add containers here, instead add each fence
> > > 299          * individually.
> > > 300          */
> > > 301         WARN_ON(dma_fence_is_container(fence));
> > > 302
> > > 303         fobj = dma_resv_fences_list(obj);
> > > 304         count = fobj->num_fences;
> > > 305
> > > 306         for (i = 0; i < count; ++i) {
> > > 307                 enum dma_resv_usage old_usage;
> > > 308
> > > 309                 dma_resv_list_entry(fobj, i, obj, &old, &old_usage);
> > > 310                 if ((old->context == fence->context && old_usage >= usage &&
> > > 311                      dma_fence_is_later_or_same(fence, old)) ||
> > > 312                     dma_fence_is_signaled(old)) {
> > > 313                         dma_resv_list_set(fobj, i, fence, usage);
> > > 314                         dma_fence_put(old);
> > > 315                         return;
> > > 316                 }
> > > 317         }
> > > 
> > > Imagine syncobjs have similar squashing, but I don't know that offhand.
> > 
> > Same goes for syncobjs.
> > 
> > Regards,
> > 
> > Boris
> > 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
  2026-03-23 17:08             ` Matthew Brost
  2026-03-23 18:38               ` Matthew Brost
@ 2026-03-24  8:49               ` Boris Brezillon
  2026-03-24 16:51                 ` Matthew Brost
  1 sibling, 1 reply; 50+ messages in thread
From: Boris Brezillon @ 2026-03-24  8:49 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, Tvrtko Ursulin, Rodrigo Vivi,
	Thomas Hellström, Christian König, Danilo Krummrich,
	David Airlie, Maarten Lankhorst, Maxime Ripard, Philipp Stanner,
	Simona Vetter, Sumit Semwal, Thomas Zimmermann, linux-kernel

On Mon, 23 Mar 2026 10:08:53 -0700
Matthew Brost <matthew.brost@intel.com> wrote:

> On Mon, Mar 23, 2026 at 10:55:04AM +0100, Boris Brezillon wrote:
> > Hi Matthew,
> > 
> > On Sun, 22 Mar 2026 21:50:07 -0700
> > Matthew Brost <matthew.brost@intel.com> wrote:
> >   
> > > > > > > diff --git a/drivers/gpu/drm/dep/drm_dep_job.c b/drivers/gpu/drm/dep/drm_dep_job.c
> > > > > > > new file mode 100644
> > > > > > > index 000000000000..2d012b29a5fc
> > > > > > > --- /dev/null
> > > > > > > +++ b/drivers/gpu/drm/dep/drm_dep_job.c
> > > > > > > @@ -0,0 +1,675 @@
> > > > > > > +// SPDX-License-Identifier: MIT
> > > > > > > +/*
> > > > > > > + * Copyright 2015 Advanced Micro Devices, Inc.
> > > > > > > + *
> > > > > > > + * Permission is hereby granted, free of charge, to any person obtaining a
> > > > > > > + * copy of this software and associated documentation files (the "Software"),
> > > > > > > + * to deal in the Software without restriction, including without limitation
> > > > > > > + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> > > > > > > + * and/or sell copies of the Software, and to permit persons to whom the
> > > > > > > + * Software is furnished to do so, subject to the following conditions:
> > > > > > > + *
> > > > > > > + * The above copyright notice and this permission notice shall be included in
> > > > > > > + * all copies or substantial portions of the Software.
> > > > > > > + *
> > > > > > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> > > > > > > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> > > > > > > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> > > > > > > + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
> > > > > > > + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
> > > > > > > + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
> > > > > > > + * OTHER DEALINGS IN THE SOFTWARE.
> > > > > > > + *
> > > > > > > + * Copyright © 2026 Intel Corporation
> > > > > > > + */
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * DOC: DRM dependency job
> > > > > > > + *
> > > > > > > + * A struct drm_dep_job represents a single unit of GPU work associated with
> > > > > > > + * a struct drm_dep_queue. The lifecycle of a job is:
> > > > > > > + *
> > > > > > > + * 1. **Allocation**: the driver allocates memory for the job (typically by
> > > > > > > + *    embedding struct drm_dep_job in a larger structure) and calls
> > > > > > > + *    drm_dep_job_init() to initialise it. On success the job holds one
> > > > > > > + *    kref reference and a reference to its queue.
> > > > > > > + *
> > > > > > > + * 2. **Dependency collection**: the driver calls drm_dep_job_add_dependency(),
> > > > > > > + *    drm_dep_job_add_syncobj_dependency(), drm_dep_job_add_resv_dependencies(),
> > > > > > > + *    or drm_dep_job_add_implicit_dependencies() to register dma_fence objects
> > > > > > > + *    that must be signalled before the job can run. Duplicate fences from the
> > > > > > > + *    same fence context are deduplicated automatically.
> > > > > > > + *
> > > > > > > + * 3. **Arming**: drm_dep_job_arm() initialises the job's finished fence,
> > > > > > > + *    consuming a sequence number from the queue. After arming,
> > > > > > > + *    drm_dep_job_finished_fence() returns a valid fence that may be passed to
> > > > > > > + *    userspace or used as a dependency by other jobs.
> > > > > > > + *
> > > > > > > + * 4. **Submission**: drm_dep_job_push() submits the job to the queue. The
> > > > > > > + *    queue takes a reference that it holds until the job's finished fence
> > > > > > > + *    signals and the job is freed by the put_job worker.
> > > > > > > + *
> > > > > > > + * 5. **Completion**: when the job's hardware work finishes its finished fence
> > > > > > > + *    is signalled and drm_dep_job_put() is called by the queue. The driver
> > > > > > > + *    must release any driver-private resources in &drm_dep_job_ops.release.
> > > > > > > + *
> > > > > > > + * Reference counting uses drm_dep_job_get() / drm_dep_job_put(). The
> > > > > > > + * internal drm_dep_job_fini() tears down the dependency xarray and fence
> > > > > > > + * objects before the driver's release callback is invoked.
> > > > > > > + */
> > > > > > > +
> > > > > > > +#include <linux/dma-resv.h>
> > > > > > > +#include <linux/kref.h>
> > > > > > > +#include <linux/slab.h>
> > > > > > > +#include <drm/drm_dep.h>
> > > > > > > +#include <drm/drm_file.h>
> > > > > > > +#include <drm/drm_gem.h>
> > > > > > > +#include <drm/drm_syncobj.h>
> > > > > > > +#include "drm_dep_fence.h"
> > > > > > > +#include "drm_dep_job.h"
> > > > > > > +#include "drm_dep_queue.h"
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * drm_dep_job_init() - initialise a dep job
> > > > > > > + * @job: dep job to initialise
> > > > > > > + * @args: initialisation arguments
> > > > > > > + *
> > > > > > > + * Initialises @job with the queue, ops and credit count from @args.  Acquires
> > > > > > > + * a reference to @args->q via drm_dep_queue_get(); this reference is held for
> > > > > > > + * the lifetime of the job and released by drm_dep_job_release() when the last
> > > > > > > + * job reference is dropped.
> > > > > > > + *
> > > > > > > + * Resources are released automatically when the last reference is dropped
> > > > > > > + * via drm_dep_job_put(), which must be called to release the job; drivers
> > > > > > > + * must not free the job directly.
> > > > > > > + *
> > > > > > > + * Context: Process context. Allocates memory with GFP_KERNEL.
> > > > > > > + * Return: 0 on success, -%EINVAL if credits is 0,
> > > > > > > + *   -%ENOMEM on fence allocation failure.
> > > > > > > + */
> > > > > > > +int drm_dep_job_init(struct drm_dep_job *job,
> > > > > > > +		     const struct drm_dep_job_init_args *args)
> > > > > > > +{
> > > > > > > +	if (unlikely(!args->credits)) {
> > > > > > > +		pr_err("drm_dep: %s: credits cannot be 0\n", __func__);
> > > > > > > +		return -EINVAL;
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	memset(job, 0, sizeof(*job));
> > > > > > > +
> > > > > > > +	job->dfence = drm_dep_fence_alloc();
> > > > > > > +	if (!job->dfence)
> > > > > > > +		return -ENOMEM;
> > > > > > > +
> > > > > > > +	job->ops = args->ops;
> > > > > > > +	job->q = drm_dep_queue_get(args->q);
> > > > > > > +	job->credits = args->credits;
> > > > > > > +
> > > > > > > +	kref_init(&job->refcount);
> > > > > > > +	xa_init_flags(&job->dependencies, XA_FLAGS_ALLOC);
> > > > > > > +	INIT_LIST_HEAD(&job->pending_link);
> > > > > > > +
> > > > > > > +	return 0;
> > > > > > > +}
> > > > > > > +EXPORT_SYMBOL(drm_dep_job_init);
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * drm_dep_job_drop_dependencies() - release all input dependency fences
> > > > > > > + * @job: dep job whose dependency xarray to drain
> > > > > > > + *
> > > > > > > + * Walks @job->dependencies, puts each fence, and destroys the xarray.
> > > > > > > + * Any slots still holding a %DRM_DEP_JOB_FENCE_PREALLOC sentinel —
> > > > > > > + * i.e. slots that were pre-allocated but never replaced — are silently
> > > > > > > + * skipped; the sentinel carries no reference.  Called from
> > > > > > > + * drm_dep_queue_run_job() in process context immediately after
> > > > > > > + * @ops->run_job() returns, before the final drm_dep_job_put().  Releasing
> > > > > > > + * dependencies here — while still in process context — avoids calling
> > > > > > > + * xa_destroy() from IRQ context if the job's last reference is later
> > > > > > > + * dropped from a dma_fence callback.
> > > > > > > + *
> > > > > > > + * Context: Process context.
> > > > > > > + */
> > > > > > > +void drm_dep_job_drop_dependencies(struct drm_dep_job *job)
> > > > > > > +{
> > > > > > > +	struct dma_fence *fence;
> > > > > > > +	unsigned long index;
> > > > > > > +
> > > > > > > +	xa_for_each(&job->dependencies, index, fence) {
> > > > > > > +		if (unlikely(fence == DRM_DEP_JOB_FENCE_PREALLOC))
> > > > > > > +			continue;
> > > > > > > +		dma_fence_put(fence);
> > > > > > > +	}
> > > > > > > +	xa_destroy(&job->dependencies);
> > > > > > > +}
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * drm_dep_job_fini() - clean up a dep job
> > > > > > > + * @job: dep job to clean up
> > > > > > > + *
> > > > > > > + * Cleans up the dep fence and drops the queue reference held by @job.
> > > > > > > + *
> > > > > > > + * If the job was never armed (e.g. init failed before drm_dep_job_arm()),
> > > > > > > + * the dependency xarray is also released here.  For armed jobs the xarray
> > > > > > > + * has already been drained by drm_dep_job_drop_dependencies() in process
> > > > > > > + * context immediately after run_job(), so it is left untouched to avoid
> > > > > > > + * calling xa_destroy() from IRQ context.
> > > > > > > + *
> > > > > > > + * Warns if @job is still linked on the queue's pending list, which would
> > > > > > > + * indicate a bug in the teardown ordering.
> > > > > > > + *
> > > > > > > + * Context: Any context.
> > > > > > > + */
> > > > > > > +static void drm_dep_job_fini(struct drm_dep_job *job)
> > > > > > > +{
> > > > > > > +	bool armed = drm_dep_fence_is_armed(job->dfence);
> > > > > > > +
> > > > > > > +	WARN_ON(!list_empty(&job->pending_link));
> > > > > > > +
> > > > > > > +	drm_dep_fence_cleanup(job->dfence);
> > > > > > > +	job->dfence = NULL;
> > > > > > > +
> > > > > > > +	/*
> > > > > > > +	 * Armed jobs have their dependencies drained by
> > > > > > > +	 * drm_dep_job_drop_dependencies() in process context after run_job().      
> > > > > > 
> > > > > > Just want to clear the confusion and make sure I get this right at the
> > > > > > same time. To me, "process context" means a user thread entering some
> > > > > > syscall(). What you call "process context" is more a "thread context" to
> > > > > > me. I'm actually almost certain it's always a kernel thread (a workqueue
> > > > > > worker thread to be accurate) that executes the drop_deps() after a
> > > > > > run_job().      
> > > > > 
> > > > > Some of context comments likely could be cleaned up. 'process context'
> > > > > here either in user context (bypass path) or run job work item.
> > > > >     
> > > > > >       
> > > > > > > +	 * Skip here to avoid calling xa_destroy() from IRQ context.
> > > > > > > +	 */
> > > > > > > +	if (!armed)
> > > > > > > +		drm_dep_job_drop_dependencies(job);      
> > > > > > 
> > > > > > Why do we need to make a difference here. Can't we just assume that the
> > > > > > hole drm_dep_job_fini() call is unsafe in atomic context, and have a
> > > > > > work item embedded in the job to defer its destruction when _put() is
> > > > > > called in a context where the destruction is not allowed?
> > > > > >       
> > > > > 
> > > > > We already touched on this, but the design currently allows the last job
> > > > > put from dma-fence signaling path (IRQ).    
> > > > 
> > > > It's not much about the last _put and more about what happens in the
> > > > _release() you pass to kref_put(). My point being, if you assume
> > > > something in _release() is not safe to be done in an atomic context,
> > > > and _put() is assumed to be called from any context, you might as well    
> > > 
> > > No. DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE indicates that the entire job
> > > put (including release) is IRQ-safe. If the documentation isn’t clear, I
> > > can clean that up. Some of my comments here [1] try to explain this
> > > further.
> > > 
> > > Setting DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE makes a job analogous to a
> > > dma-fence whose release must be IRQ-safe, so there is precedent for
> > > this. I didn’t want to unilaterally require that all job releases be
> > > IRQ-safe, as that would conflict with existing DRM scheduler jobs—hence
> > > the flag.
> > > 
> > > The difference between non-IRQ-safe and IRQ-safe job release is only
> > > about 12 lines of code.  
> > 
> > It's not just about the number of lines of code added to the core to
> > deal with that case, but also complexity of the API that results from
> > these various modes.
> >   
> 
> Fair enough.
> 
> > > I figured that if we’re going to invest the time
> > > and effort to replace DRM sched, we should aim for the best possible
> > > implementation. Any driver can opt-in here and immediately get less CPU
> > > ultization and power savings. I will try to figure out how to measure
> > > this and get some number here.  
> > 
> > That's key here. My gut feeling is that we have so much deferred
> > already that adding one more work to the workqueue is not going to
> > hurt in term of scheduling overhead (no context switch if it's
> > scheduled on the same workqueue). Job cleanup is just the phase  
> 
> Signaling of fences in many drivers occurs in hard IRQ context rather
> than in a work queue. I agree that if you are signaling fences from a
> work queue, the overhead of another work item is minimal.

I'm talking about the drm_dep_queue_run_job_queue() call which in turn
calls queue_work() when a job gets reported as done. That, I think, is
the most likely path, isn't it?

> 
> > following the job_done() event, which also requires a deferred work to
> > check progress on the queue anyway. And if you move the entirety of the  
> 
> Yes, I see Panthor signals fences from a work queue by looking at the
> seqnos, but again, in many drivers this flow is IRQ-driven for fence
> signaling latency reasons.

I'm not talking about the signalling of the done_fence, but the work
that's used to check progress on a job queue
(drm_dep_queue_run_job_queue()).

> > >   
> > > > > 
> > > > > Once arm() is called there is a guarnette the run_job path is called
> > > > > either via bypass or run job work item.    
> > > > 
> > > > Sure.
> > > >     
> > > 
> > > Let’s not gloss over this—this is actually a huge difference from DRM
> > > sched. One of the biggest problems I found with DRM sched is that if you
> > > call arm(), run_job() may or may not be called. Without this guarantee,
> > > you can’t do driver-side bookkeeping in arm() that is later released in
> > > run_job(), which would otherwise simplify the driver design.  
> > 
> > You can do driver-side book-keeping after all jobs have been
> > successfully initialized, which include arming their fences. The key
> > turning point is when you start exposing those armed fences, not
> > when you arm them. See below.
> >   
> 
> There is still the seqno critical section which starts at arm() and
> closed at push() or drop of fence.

That's orthogonal to the rule that says nothing after _arm() can
fail, I think. To guarantee proper job ordering, you need extra locking
(at the moment, we rely on the VM resv lock to serialize this in
Panthor).

> > > 
> > > In my opinion, it’s best and safest to enforce a no-failure policy
> > > between arm() and push().  
> > 
> > I don't think it's safer, it's just the semantics that have been
> > defined by drm_sched/dma_fence and that we keep forcing ourselves
> > into. I'd rather have a well defined dma_fence state that says "that's
> > it, I'm exposed, you have to signal me now", than this half-enforced
> > arm()+push() model.
> >   
> 
> So what is the suggestion here — move the asserts I have from arm() to
> something like begin_push()? We could add a dma-fence state toggle there
> as well if we can get that part merged into dma-fence. Or should we just
> drop the asserts/lockdep checks between arm() and push() completely? I’m
> open to either approach here.

If we can have that INACTIVE flag added, and the associated
dma_fence_init[64]_inactive() variants, I would say, we call
dma_fence_init[64]_inactive() in _arm(), and we call
dma_fence_set_active() in _push(). It'd still be valuable to have some
sort of delimitation for the submission through some block-like macro
with an associated context to which we can attach states and allow for
more (optional?) runtime-checks.

> 
> > > 
> > > FWIW, this came up while I was reviewing AMDXDNA’s DRM scheduler usage,
> > > which had the exact issue I described above. I pointed it out and got a
> > > reply saying, “well, this is an API issue, right?”—and they were
> > > correct, it is an API issue.
> > >   
> > > > >     
> > > > > > 
> > > > > > In general, I wonder if we should distinguish between "armed" and
> > > > > > "publicly exposed" to help deal with this intra-batch dep thing without
> > > > > > resorting to reservation and other tricks like that.
> > > > > >       
> > > > > 
> > > > > I'm not exactly sure what you suggesting but always open to ideas.    
> > > > 
> > > > Right now _arm() is what does the dma_fence_init(). But there's an
> > > > extra step between initializing the fence object and making it
> > > > visible to the outside world. In order for the dep to be added to the
> > > > job, you need the fence to be initialized, but that's not quite
> > > > external visibility, because the job is still very much a driver
> > > > object, and if something fails, the rollback mechanism makes it so all
> > > > the deps are dropped on the floor along the job that's being destroyed.
> > > > So we won't really wait on this fence that's never going to be
> > > > signalled.
> > > > 
> > > > I see what's appealing in pretending that _arm() == externally-visible,
> > > > but it's also forcing us to do extra pre-alloc (or other pre-init)
> > > > operations that would otherwise not be required in the submit path. Not
> > > > a hill I'm willing to die on, but I just thought I'd mention the fact I
> > > > find it weird that we put extra constraints on ourselves that are not
> > > > strictly needed, just because we fail to properly flag the dma_fence
> > > > visibility transitions.    
> > > 
> > > See the dma-resv example above. I’m not willing to die on this hill
> > > either, but again, in my opinion, for safety and as an API-level
> > > contract, enforcing arm() as a no-failure point makes sense. It prevents
> > > drivers from doing anything dangerous like the dma-resv example, which
> > > is an extremely subtle bug.  
> > 
> > That's a valid point, but you're not really enforcing things at
> > compile/run-time it's just "don't do this/that" in the docs. If you
> > encode the is_active() state at the dma_fence level, properly change
> > the fence state anytime it's about to be added to a public container,
> > and make it so an active fence that's released without being signalled
> > triggers a WARN_ON(), you've achieved more. Once you've done that, you
> > can also relax the rule that says that "an armed fence has to be
> > signalled" to "a fence that's active has to be signalled". With this,
> > the pre-alloc for intra-batch deps in your drm_dep_job::deps xarray is
> > no longer required, because you would be able to store inactive fences  
> 
> I wouldn’t go that far or say it’s that simple. This would require a
> fairly large refactor of Xe’s VM bind pipeline to call arm() earlier,
> and I’m not even sure it would be possible. Between arm() and push(),
> the seqno critical section still remains and requires locking, in
> particulay the tricky case is kernel binds (e.g., page fault handling)
> which use the same queue. Multiple threads can issue kernel binds
> concurrently, as our page fault handler is multi-threaded, similar
> to the CPU page fault handler, so critical section between arm() and
> push() is very late in the pipeline tightly protected by a lock.

This sounds like a different issue to me. That's the constraint that
says _arm() and _push() ordering needs to be preserved to guarantee
that jobs are properly ordered on the job queue. But that's orthogonal
to the rule that says nothing between _arm() and _push() on a given job
can fail. Let's take the Panthor case as an example:

	for_each_job_in_batch() {
		// This acquires the VM resv lock, and all BO locks
		// Because queues target a specific VM and all jobs
		// in the a SUBMIT must target the same VM, this
		// guarantees that seqno allocation happening further
		// down (when _arm() is called) won't be interleaved
		// with other concurrent submissions to the same queues.
		lock_and_prepare_resvs()

		<--- Seqno critical section starts here
	}

 	for_each_job_in_batch() {
		// If something fails here, we drop all the jobs that
		// are part of this SUBMIT, and the resv locks are
		// released as part of the rollback. This means we
		// consumed but didn't use the seqnos, thus creating
		// a whole on the timeline, which is armless, as long
		// as those seqnos are not recycled.
		ret = faillible_stuf()
 		if (ret)
 			goto rollback;
 
 		arm(job)
 	}
 
 	// Nothing can fail after this point
 
 	for_each_job_in_batch() {
		// resv locks are released here, unblocking other
		// concurrent submissions
 		update_resvs(job->done_fence)

		<--- Seqno critical section ends here in case of success

 		push(job)
 	}
 
	update_submit_syncobjs();

rollback:
	unlock_resvs()
	<--- Seqno critical section ends here in case of success
...

How wide your critical seqno section is is up to each driver, really.

> 
> > there, as long as they become active before the job is pushed.
> >   
> > >   
> > > > 
> > > > On the rust side it would be directly described through the type
> > > > system (see the Visibility attribute in Daniel's branch[1]). On C side,
> > > > this could take the form of a new DMA_FENCE_FLAG_INACTIVE (or whichever
> > > > name you want to give it). Any operation pushing the fence to public
> > > > container (dma_resv, syncobj, sync_file, ...) would be rejected when
> > > > that flag is set. At _push() time, we'd clear that flag with a
> > > > dma_fence_set_active() helper, which would reflect the fact the fence
> > > > can now be observed and exposed to the outside world.
> > > >     
> > > 
> > > Timeline squashing is problematic due to the DMA_FENCE_FLAG_INACTIVE
> > > flag. When adding a fence to dma-resv, fences that belong to the same
> > > timeline are immediately squashed. A later transition of the fence state
> > > completely breaks this behavior.  
> > 
> > That's exactly my point: as soon as you want to insert the fence to a
> > public container, you have to make it "active", so it will never be
> > rolled back to the previous entry in the resv. Similarly, a
> > wait/add_callback() on an inactive fence should be rejected.
> >   
> 
> This is bit bigger dma-fence / treewide level change but in general I
> believe this is a good idea.

I agree it's a bit more work. It implies patching containers to reject
insertion when the INACTIVE flag is set. If we keep !INACTIVE as the
default (__dma_fence_init(INACTIVE) being an opt-in), fence emitters can
be moved to this model progressively though.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
  2026-03-23 18:38               ` Matthew Brost
@ 2026-03-24  9:23                 ` Boris Brezillon
  2026-03-24 16:06                   ` Matthew Brost
  0 siblings, 1 reply; 50+ messages in thread
From: Boris Brezillon @ 2026-03-24  9:23 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, Tvrtko Ursulin, Rodrigo Vivi,
	Thomas Hellström, Christian König, Danilo Krummrich,
	David Airlie, Maarten Lankhorst, Maxime Ripard, Philipp Stanner,
	Simona Vetter, Sumit Semwal, Thomas Zimmermann, linux-kernel

On Mon, 23 Mar 2026 11:38:06 -0700
Matthew Brost <matthew.brost@intel.com> wrote:

> 
> Ok, getting stats is easier than I thought...
> 
> ./perf stat -a -e context-switches,cpu-migrations,task-clock,cycles,instructions /home/mbrost/xe/source/drivers.gpu.i915.igt-gpu-tools/build/tests/xe_exec_threads --r threads-basic
> 
> This test creates one thread per engine instance (7 instances this BMG
> device) and submits 1k exec IOCTLs per thread, each performing a DW
> write. Each exec IOCTL typically does not have unsignaled input dependencies.
> 
> With IRQ putting of jobs off + no bypass (drm_dep_queue_flags = 0):
> 
>              8,449      context-switches
>                412      cpu-migrations
>           2,531.43 msec task-clock
>      1,847,846,588      cpu_atom/cycles/
>      1,847,856,947      cpu_core/cycles/
>    <not supported>      cpu_atom/instructions/
>        460,744,020      cpu_core/instructions/
> 
> With IRQ putting of jobs off + bypass (drm_dep_queue_flags =
> DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED):
> 
>              8,655      context-switches
>                229      cpu-migrations
>           2,571.33 msec task-clock
>        855,900,607      cpu_atom/cycles/
>        855,900,272      cpu_core/cycles/
>    <not supported>      cpu_atom/instructions/
>        403,651,469      cpu_core/instructions/
> 
> With IRQ putting of jobs on + bypass (drm_dep_queue_flags =
> DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED |
> DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE):
> 
>              5,361      context-switches
>                169      cpu-migrations
>           2,577.44 msec task-clock
>        685,769,153      cpu_atom/cycles/
>        685,768,407      cpu_core/cycles/
>    <not supported>      cpu_atom/instructions/
>        321,336,297      cpu_core/instructions/

Thanks for sharing those numbers. For completeness, can you also add the
"With IRQ putting of jobs on + no bypass" case?

I'm a bit surprised by the difference in number of context switches
given I'd expect the local-CPU to be picked in priority, and so queuing
work items on the same wq from another work item to be almost free in
term on scheduling. But I guess there's some load-balancing happening
when you execute jobs at such a high rate.

Also, I don't know if that's just noise or if it's reproducible, but
task-clock seems to be ~40usec lower with the deferred cleanup and
no-bypass (higher throughput because you're not blocking the dequeuing
of the next job on the cleanup of the previous one, I suspect).


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
  2026-03-24  9:23                 ` Boris Brezillon
@ 2026-03-24 16:06                   ` Matthew Brost
  2026-03-25  2:33                     ` Matthew Brost
  0 siblings, 1 reply; 50+ messages in thread
From: Matthew Brost @ 2026-03-24 16:06 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: intel-xe, dri-devel, Tvrtko Ursulin, Rodrigo Vivi,
	Thomas Hellström, Christian König, Danilo Krummrich,
	David Airlie, Maarten Lankhorst, Maxime Ripard, Philipp Stanner,
	Simona Vetter, Sumit Semwal, Thomas Zimmermann, linux-kernel

On Tue, Mar 24, 2026 at 10:23:45AM +0100, Boris Brezillon wrote:
> On Mon, 23 Mar 2026 11:38:06 -0700
> Matthew Brost <matthew.brost@intel.com> wrote:
> 
> > 
> > Ok, getting stats is easier than I thought...
> > 
> > ./perf stat -a -e context-switches,cpu-migrations,task-clock,cycles,instructions /home/mbrost/xe/source/drivers.gpu.i915.igt-gpu-tools/build/tests/xe_exec_threads --r threads-basic
> > 
> > This test creates one thread per engine instance (7 instances this BMG
> > device) and submits 1k exec IOCTLs per thread, each performing a DW
> > write. Each exec IOCTL typically does not have unsignaled input dependencies.
> > 
> > With IRQ putting of jobs off + no bypass (drm_dep_queue_flags = 0):
> > 
> >              8,449      context-switches
> >                412      cpu-migrations
> >           2,531.43 msec task-clock
> >      1,847,846,588      cpu_atom/cycles/
> >      1,847,856,947      cpu_core/cycles/
> >    <not supported>      cpu_atom/instructions/
> >        460,744,020      cpu_core/instructions/
> > 
> > With IRQ putting of jobs off + bypass (drm_dep_queue_flags =
> > DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED):
> > 
> >              8,655      context-switches
> >                229      cpu-migrations
> >           2,571.33 msec task-clock
> >        855,900,607      cpu_atom/cycles/
> >        855,900,272      cpu_core/cycles/
> >    <not supported>      cpu_atom/instructions/
> >        403,651,469      cpu_core/instructions/
> > 
> > With IRQ putting of jobs on + bypass (drm_dep_queue_flags =
> > DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED |
> > DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE):
> > 
> >              5,361      context-switches
> >                169      cpu-migrations
> >           2,577.44 msec task-clock
> >        685,769,153      cpu_atom/cycles/
> >        685,768,407      cpu_core/cycles/
> >    <not supported>      cpu_atom/instructions/
> >        321,336,297      cpu_core/instructions/
> 
> Thanks for sharing those numbers. For completeness, can you also add the
> "With IRQ putting of jobs on + no bypass" case?
> 

Yes, I also will share a DRM sched baseline too + I figured out power
can be measured too - initial results confirm what I expected too - less
power.

I'm putting together a doc based on running glxgears and another
benchmark on top Ubuntu 24.10 + Wayland which has explicit sync
(linux-drm-syncobj, behaves like surfface flinger when rendering flag to
not pass in fences to draw jobs).

Almost have all the data. Will share here once I have it.

> I'm a bit surprised by the difference in number of context switches
> given I'd expect the local-CPU to be picked in priority, and so queuing
> work items on the same wq from another work item to be almost free in
> term on scheduling. But I guess there's some load-balancing happening
> when you execute jobs at such a high rate.
> 
> Also, I don't know if that's just noise or if it's reproducible, but
> task-clock seems to be ~40usec lower with the deferred cleanup and
> no-bypass (higher throughput because you're not blocking the dequeuing
> of the next job on the cleanup of the previous one, I suspect).

I think that is just noise of what the test is doing in user space -
that bounces around a bit.

Matt

> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
  2026-03-24  8:49               ` Boris Brezillon
@ 2026-03-24 16:51                 ` Matthew Brost
  0 siblings, 0 replies; 50+ messages in thread
From: Matthew Brost @ 2026-03-24 16:51 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: intel-xe, dri-devel, Tvrtko Ursulin, Rodrigo Vivi,
	Thomas Hellström, Christian König, Danilo Krummrich,
	David Airlie, Maarten Lankhorst, Maxime Ripard, Philipp Stanner,
	Simona Vetter, Sumit Semwal, Thomas Zimmermann, linux-kernel

On Tue, Mar 24, 2026 at 09:49:57AM +0100, Boris Brezillon wrote:
> On Mon, 23 Mar 2026 10:08:53 -0700
> Matthew Brost <matthew.brost@intel.com> wrote:
> 
> > On Mon, Mar 23, 2026 at 10:55:04AM +0100, Boris Brezillon wrote:
> > > Hi Matthew,
> > > 
> > > On Sun, 22 Mar 2026 21:50:07 -0700
> > > Matthew Brost <matthew.brost@intel.com> wrote:
> > >   
> > > > > > > > diff --git a/drivers/gpu/drm/dep/drm_dep_job.c b/drivers/gpu/drm/dep/drm_dep_job.c
> > > > > > > > new file mode 100644
> > > > > > > > index 000000000000..2d012b29a5fc
> > > > > > > > --- /dev/null
> > > > > > > > +++ b/drivers/gpu/drm/dep/drm_dep_job.c
> > > > > > > > @@ -0,0 +1,675 @@
> > > > > > > > +// SPDX-License-Identifier: MIT
> > > > > > > > +/*
> > > > > > > > + * Copyright 2015 Advanced Micro Devices, Inc.
> > > > > > > > + *
> > > > > > > > + * Permission is hereby granted, free of charge, to any person obtaining a
> > > > > > > > + * copy of this software and associated documentation files (the "Software"),
> > > > > > > > + * to deal in the Software without restriction, including without limitation
> > > > > > > > + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> > > > > > > > + * and/or sell copies of the Software, and to permit persons to whom the
> > > > > > > > + * Software is furnished to do so, subject to the following conditions:
> > > > > > > > + *
> > > > > > > > + * The above copyright notice and this permission notice shall be included in
> > > > > > > > + * all copies or substantial portions of the Software.
> > > > > > > > + *
> > > > > > > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> > > > > > > > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> > > > > > > > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> > > > > > > > + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
> > > > > > > > + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
> > > > > > > > + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
> > > > > > > > + * OTHER DEALINGS IN THE SOFTWARE.
> > > > > > > > + *
> > > > > > > > + * Copyright © 2026 Intel Corporation
> > > > > > > > + */
> > > > > > > > +
> > > > > > > > +/**
> > > > > > > > + * DOC: DRM dependency job
> > > > > > > > + *
> > > > > > > > + * A struct drm_dep_job represents a single unit of GPU work associated with
> > > > > > > > + * a struct drm_dep_queue. The lifecycle of a job is:
> > > > > > > > + *
> > > > > > > > + * 1. **Allocation**: the driver allocates memory for the job (typically by
> > > > > > > > + *    embedding struct drm_dep_job in a larger structure) and calls
> > > > > > > > + *    drm_dep_job_init() to initialise it. On success the job holds one
> > > > > > > > + *    kref reference and a reference to its queue.
> > > > > > > > + *
> > > > > > > > + * 2. **Dependency collection**: the driver calls drm_dep_job_add_dependency(),
> > > > > > > > + *    drm_dep_job_add_syncobj_dependency(), drm_dep_job_add_resv_dependencies(),
> > > > > > > > + *    or drm_dep_job_add_implicit_dependencies() to register dma_fence objects
> > > > > > > > + *    that must be signalled before the job can run. Duplicate fences from the
> > > > > > > > + *    same fence context are deduplicated automatically.
> > > > > > > > + *
> > > > > > > > + * 3. **Arming**: drm_dep_job_arm() initialises the job's finished fence,
> > > > > > > > + *    consuming a sequence number from the queue. After arming,
> > > > > > > > + *    drm_dep_job_finished_fence() returns a valid fence that may be passed to
> > > > > > > > + *    userspace or used as a dependency by other jobs.
> > > > > > > > + *
> > > > > > > > + * 4. **Submission**: drm_dep_job_push() submits the job to the queue. The
> > > > > > > > + *    queue takes a reference that it holds until the job's finished fence
> > > > > > > > + *    signals and the job is freed by the put_job worker.
> > > > > > > > + *
> > > > > > > > + * 5. **Completion**: when the job's hardware work finishes its finished fence
> > > > > > > > + *    is signalled and drm_dep_job_put() is called by the queue. The driver
> > > > > > > > + *    must release any driver-private resources in &drm_dep_job_ops.release.
> > > > > > > > + *
> > > > > > > > + * Reference counting uses drm_dep_job_get() / drm_dep_job_put(). The
> > > > > > > > + * internal drm_dep_job_fini() tears down the dependency xarray and fence
> > > > > > > > + * objects before the driver's release callback is invoked.
> > > > > > > > + */
> > > > > > > > +
> > > > > > > > +#include <linux/dma-resv.h>
> > > > > > > > +#include <linux/kref.h>
> > > > > > > > +#include <linux/slab.h>
> > > > > > > > +#include <drm/drm_dep.h>
> > > > > > > > +#include <drm/drm_file.h>
> > > > > > > > +#include <drm/drm_gem.h>
> > > > > > > > +#include <drm/drm_syncobj.h>
> > > > > > > > +#include "drm_dep_fence.h"
> > > > > > > > +#include "drm_dep_job.h"
> > > > > > > > +#include "drm_dep_queue.h"
> > > > > > > > +
> > > > > > > > +/**
> > > > > > > > + * drm_dep_job_init() - initialise a dep job
> > > > > > > > + * @job: dep job to initialise
> > > > > > > > + * @args: initialisation arguments
> > > > > > > > + *
> > > > > > > > + * Initialises @job with the queue, ops and credit count from @args.  Acquires
> > > > > > > > + * a reference to @args->q via drm_dep_queue_get(); this reference is held for
> > > > > > > > + * the lifetime of the job and released by drm_dep_job_release() when the last
> > > > > > > > + * job reference is dropped.
> > > > > > > > + *
> > > > > > > > + * Resources are released automatically when the last reference is dropped
> > > > > > > > + * via drm_dep_job_put(), which must be called to release the job; drivers
> > > > > > > > + * must not free the job directly.
> > > > > > > > + *
> > > > > > > > + * Context: Process context. Allocates memory with GFP_KERNEL.
> > > > > > > > + * Return: 0 on success, -%EINVAL if credits is 0,
> > > > > > > > + *   -%ENOMEM on fence allocation failure.
> > > > > > > > + */
> > > > > > > > +int drm_dep_job_init(struct drm_dep_job *job,
> > > > > > > > +		     const struct drm_dep_job_init_args *args)
> > > > > > > > +{
> > > > > > > > +	if (unlikely(!args->credits)) {
> > > > > > > > +		pr_err("drm_dep: %s: credits cannot be 0\n", __func__);
> > > > > > > > +		return -EINVAL;
> > > > > > > > +	}
> > > > > > > > +
> > > > > > > > +	memset(job, 0, sizeof(*job));
> > > > > > > > +
> > > > > > > > +	job->dfence = drm_dep_fence_alloc();
> > > > > > > > +	if (!job->dfence)
> > > > > > > > +		return -ENOMEM;
> > > > > > > > +
> > > > > > > > +	job->ops = args->ops;
> > > > > > > > +	job->q = drm_dep_queue_get(args->q);
> > > > > > > > +	job->credits = args->credits;
> > > > > > > > +
> > > > > > > > +	kref_init(&job->refcount);
> > > > > > > > +	xa_init_flags(&job->dependencies, XA_FLAGS_ALLOC);
> > > > > > > > +	INIT_LIST_HEAD(&job->pending_link);
> > > > > > > > +
> > > > > > > > +	return 0;
> > > > > > > > +}
> > > > > > > > +EXPORT_SYMBOL(drm_dep_job_init);
> > > > > > > > +
> > > > > > > > +/**
> > > > > > > > + * drm_dep_job_drop_dependencies() - release all input dependency fences
> > > > > > > > + * @job: dep job whose dependency xarray to drain
> > > > > > > > + *
> > > > > > > > + * Walks @job->dependencies, puts each fence, and destroys the xarray.
> > > > > > > > + * Any slots still holding a %DRM_DEP_JOB_FENCE_PREALLOC sentinel —
> > > > > > > > + * i.e. slots that were pre-allocated but never replaced — are silently
> > > > > > > > + * skipped; the sentinel carries no reference.  Called from
> > > > > > > > + * drm_dep_queue_run_job() in process context immediately after
> > > > > > > > + * @ops->run_job() returns, before the final drm_dep_job_put().  Releasing
> > > > > > > > + * dependencies here — while still in process context — avoids calling
> > > > > > > > + * xa_destroy() from IRQ context if the job's last reference is later
> > > > > > > > + * dropped from a dma_fence callback.
> > > > > > > > + *
> > > > > > > > + * Context: Process context.
> > > > > > > > + */
> > > > > > > > +void drm_dep_job_drop_dependencies(struct drm_dep_job *job)
> > > > > > > > +{
> > > > > > > > +	struct dma_fence *fence;
> > > > > > > > +	unsigned long index;
> > > > > > > > +
> > > > > > > > +	xa_for_each(&job->dependencies, index, fence) {
> > > > > > > > +		if (unlikely(fence == DRM_DEP_JOB_FENCE_PREALLOC))
> > > > > > > > +			continue;
> > > > > > > > +		dma_fence_put(fence);
> > > > > > > > +	}
> > > > > > > > +	xa_destroy(&job->dependencies);
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > +/**
> > > > > > > > + * drm_dep_job_fini() - clean up a dep job
> > > > > > > > + * @job: dep job to clean up
> > > > > > > > + *
> > > > > > > > + * Cleans up the dep fence and drops the queue reference held by @job.
> > > > > > > > + *
> > > > > > > > + * If the job was never armed (e.g. init failed before drm_dep_job_arm()),
> > > > > > > > + * the dependency xarray is also released here.  For armed jobs the xarray
> > > > > > > > + * has already been drained by drm_dep_job_drop_dependencies() in process
> > > > > > > > + * context immediately after run_job(), so it is left untouched to avoid
> > > > > > > > + * calling xa_destroy() from IRQ context.
> > > > > > > > + *
> > > > > > > > + * Warns if @job is still linked on the queue's pending list, which would
> > > > > > > > + * indicate a bug in the teardown ordering.
> > > > > > > > + *
> > > > > > > > + * Context: Any context.
> > > > > > > > + */
> > > > > > > > +static void drm_dep_job_fini(struct drm_dep_job *job)
> > > > > > > > +{
> > > > > > > > +	bool armed = drm_dep_fence_is_armed(job->dfence);
> > > > > > > > +
> > > > > > > > +	WARN_ON(!list_empty(&job->pending_link));
> > > > > > > > +
> > > > > > > > +	drm_dep_fence_cleanup(job->dfence);
> > > > > > > > +	job->dfence = NULL;
> > > > > > > > +
> > > > > > > > +	/*
> > > > > > > > +	 * Armed jobs have their dependencies drained by
> > > > > > > > +	 * drm_dep_job_drop_dependencies() in process context after run_job().      
> > > > > > > 
> > > > > > > Just want to clear the confusion and make sure I get this right at the
> > > > > > > same time. To me, "process context" means a user thread entering some
> > > > > > > syscall(). What you call "process context" is more a "thread context" to
> > > > > > > me. I'm actually almost certain it's always a kernel thread (a workqueue
> > > > > > > worker thread to be accurate) that executes the drop_deps() after a
> > > > > > > run_job().      
> > > > > > 
> > > > > > Some of context comments likely could be cleaned up. 'process context'
> > > > > > here either in user context (bypass path) or run job work item.
> > > > > >     
> > > > > > >       
> > > > > > > > +	 * Skip here to avoid calling xa_destroy() from IRQ context.
> > > > > > > > +	 */
> > > > > > > > +	if (!armed)
> > > > > > > > +		drm_dep_job_drop_dependencies(job);      
> > > > > > > 
> > > > > > > Why do we need to make a difference here. Can't we just assume that the
> > > > > > > hole drm_dep_job_fini() call is unsafe in atomic context, and have a
> > > > > > > work item embedded in the job to defer its destruction when _put() is
> > > > > > > called in a context where the destruction is not allowed?
> > > > > > >       
> > > > > > 
> > > > > > We already touched on this, but the design currently allows the last job
> > > > > > put from dma-fence signaling path (IRQ).    
> > > > > 
> > > > > It's not much about the last _put and more about what happens in the
> > > > > _release() you pass to kref_put(). My point being, if you assume
> > > > > something in _release() is not safe to be done in an atomic context,
> > > > > and _put() is assumed to be called from any context, you might as well    
> > > > 
> > > > No. DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE indicates that the entire job
> > > > put (including release) is IRQ-safe. If the documentation isn’t clear, I
> > > > can clean that up. Some of my comments here [1] try to explain this
> > > > further.
> > > > 
> > > > Setting DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE makes a job analogous to a
> > > > dma-fence whose release must be IRQ-safe, so there is precedent for
> > > > this. I didn’t want to unilaterally require that all job releases be
> > > > IRQ-safe, as that would conflict with existing DRM scheduler jobs—hence
> > > > the flag.
> > > > 
> > > > The difference between non-IRQ-safe and IRQ-safe job release is only
> > > > about 12 lines of code.  
> > > 
> > > It's not just about the number of lines of code added to the core to
> > > deal with that case, but also complexity of the API that results from
> > > these various modes.
> > >   
> > 
> > Fair enough.
> > 
> > > > I figured that if we’re going to invest the time
> > > > and effort to replace DRM sched, we should aim for the best possible
> > > > implementation. Any driver can opt-in here and immediately get less CPU
> > > > ultization and power savings. I will try to figure out how to measure
> > > > this and get some number here.  
> > > 
> > > That's key here. My gut feeling is that we have so much deferred
> > > already that adding one more work to the workqueue is not going to
> > > hurt in term of scheduling overhead (no context switch if it's
> > > scheduled on the same workqueue). Job cleanup is just the phase  
> > 
> > Signaling of fences in many drivers occurs in hard IRQ context rather
> > than in a work queue. I agree that if you are signaling fences from a
> > work queue, the overhead of another work item is minimal.
> 
> I'm talking about the drm_dep_queue_run_job_queue() call which in turn
> calls queue_work() when a job gets reported as done. That, I think, is
> the most likely path, isn't it?
> 

Oh, I think I understand what you are getting at...

drm_dep_queue_run_job_queue() is called less blindly in DRM dep compared
to DRM sched. When a fence signals, drm_dep_queue_run_job_queue() only
kicks a worker if the SPSC queue is non-empty and the signal may have
freed up the credits needed to call ->run_job(). This could likely be
optimized further to skip the work item if the run_job path is blocked
on a dependency, though I’d need to think through the races.

With the way credits work in Xe (we have many per queue), it is quite
likely this won’t be triggered. Therefore, the extra worker kick to put
the job incurs an otherwise unnecessary context switch. If
drm_dep_queue_run_job_queue() always kicked a work item, the additional
work item would indeed be less costly.

> > 
> > > following the job_done() event, which also requires a deferred work to
> > > check progress on the queue anyway. And if you move the entirety of the  
> > 
> > Yes, I see Panthor signals fences from a work queue by looking at the
> > seqnos, but again, in many drivers this flow is IRQ-driven for fence
> > signaling latency reasons.
> 
> I'm not talking about the signalling of the done_fence, but the work
> that's used to check progress on a job queue
> (drm_dep_queue_run_job_queue()).
> 

See above.

> > > >   
> > > > > > 
> > > > > > Once arm() is called there is a guarnette the run_job path is called
> > > > > > either via bypass or run job work item.    
> > > > > 
> > > > > Sure.
> > > > >     
> > > > 
> > > > Let’s not gloss over this—this is actually a huge difference from DRM
> > > > sched. One of the biggest problems I found with DRM sched is that if you
> > > > call arm(), run_job() may or may not be called. Without this guarantee,
> > > > you can’t do driver-side bookkeeping in arm() that is later released in
> > > > run_job(), which would otherwise simplify the driver design.  
> > > 
> > > You can do driver-side book-keeping after all jobs have been
> > > successfully initialized, which include arming their fences. The key
> > > turning point is when you start exposing those armed fences, not
> > > when you arm them. See below.
> > >   
> > 
> > There is still the seqno critical section which starts at arm() and
> > closed at push() or drop of fence.
> 
> That's orthogonal to the rule that says nothing after _arm() can
> fail, I think. To guarantee proper job ordering, you need extra locking
> (at the moment, we rely on the VM resv lock to serialize this in
> Panthor).

Yes, you need a lock. The VM resv lock is typically how we handle this
as well, but of course in Xe it gets more complicated for kernel-issued
binds, which share a queue across VMs, or in compute cases where we do
not take resv locks in the exec ioctl paths.

> 
> > > > 
> > > > In my opinion, it’s best and safest to enforce a no-failure policy
> > > > between arm() and push().  
> > > 
> > > I don't think it's safer, it's just the semantics that have been
> > > defined by drm_sched/dma_fence and that we keep forcing ourselves
> > > into. I'd rather have a well defined dma_fence state that says "that's
> > > it, I'm exposed, you have to signal me now", than this half-enforced
> > > arm()+push() model.
> > >   
> > 
> > So what is the suggestion here — move the asserts I have from arm() to
> > something like begin_push()? We could add a dma-fence state toggle there
> > as well if we can get that part merged into dma-fence. Or should we just
> > drop the asserts/lockdep checks between arm() and push() completely? I’m
> > open to either approach here.
> 
> If we can have that INACTIVE flag added, and the associated
> dma_fence_init[64]_inactive() variants, I would say, we call
> dma_fence_init[64]_inactive() in _arm(), and we call
> dma_fence_set_active() in _push(). It'd still be valuable to have some

I think it is a valid use case to add an armed fence to dma-resv or sync
objs before calling push(), though. Xe doesn’t do this, nor does
Panthor, but IIRC other drivers do, and I believe that is completely
valid. Of course, as discussed, once that is done it becomes the
no-failure point. Hence my reasoing was make arm() the no failure
point...

How about I add a make_active() call? I can start by moving the lockdep
checks I have in place from arm() to make_active(), and we can
incorporate the dma-fence suggestions you mentioned into that logic in a
follow-up, since that is a broader change + would need buy from dma-resv
maintainers.

> sort of delimitation for the submission through some block-like macro
> with an associated context to which we can attach states and allow for
> more (optional?) runtime-checks.
> 
> > 
> > > > 
> > > > FWIW, this came up while I was reviewing AMDXDNA’s DRM scheduler usage,
> > > > which had the exact issue I described above. I pointed it out and got a
> > > > reply saying, “well, this is an API issue, right?”—and they were
> > > > correct, it is an API issue.
> > > >   
> > > > > >     
> > > > > > > 
> > > > > > > In general, I wonder if we should distinguish between "armed" and
> > > > > > > "publicly exposed" to help deal with this intra-batch dep thing without
> > > > > > > resorting to reservation and other tricks like that.
> > > > > > >       
> > > > > > 
> > > > > > I'm not exactly sure what you suggesting but always open to ideas.    
> > > > > 
> > > > > Right now _arm() is what does the dma_fence_init(). But there's an
> > > > > extra step between initializing the fence object and making it
> > > > > visible to the outside world. In order for the dep to be added to the
> > > > > job, you need the fence to be initialized, but that's not quite
> > > > > external visibility, because the job is still very much a driver
> > > > > object, and if something fails, the rollback mechanism makes it so all
> > > > > the deps are dropped on the floor along the job that's being destroyed.
> > > > > So we won't really wait on this fence that's never going to be
> > > > > signalled.
> > > > > 
> > > > > I see what's appealing in pretending that _arm() == externally-visible,
> > > > > but it's also forcing us to do extra pre-alloc (or other pre-init)
> > > > > operations that would otherwise not be required in the submit path. Not
> > > > > a hill I'm willing to die on, but I just thought I'd mention the fact I
> > > > > find it weird that we put extra constraints on ourselves that are not
> > > > > strictly needed, just because we fail to properly flag the dma_fence
> > > > > visibility transitions.    
> > > > 
> > > > See the dma-resv example above. I’m not willing to die on this hill
> > > > either, but again, in my opinion, for safety and as an API-level
> > > > contract, enforcing arm() as a no-failure point makes sense. It prevents
> > > > drivers from doing anything dangerous like the dma-resv example, which
> > > > is an extremely subtle bug.  
> > > 
> > > That's a valid point, but you're not really enforcing things at
> > > compile/run-time it's just "don't do this/that" in the docs. If you
> > > encode the is_active() state at the dma_fence level, properly change
> > > the fence state anytime it's about to be added to a public container,
> > > and make it so an active fence that's released without being signalled
> > > triggers a WARN_ON(), you've achieved more. Once you've done that, you
> > > can also relax the rule that says that "an armed fence has to be
> > > signalled" to "a fence that's active has to be signalled". With this,
> > > the pre-alloc for intra-batch deps in your drm_dep_job::deps xarray is
> > > no longer required, because you would be able to store inactive fences  
> > 
> > I wouldn’t go that far or say it’s that simple. This would require a
> > fairly large refactor of Xe’s VM bind pipeline to call arm() earlier,
> > and I’m not even sure it would be possible. Between arm() and push(),
> > the seqno critical section still remains and requires locking, in
> > particulay the tricky case is kernel binds (e.g., page fault handling)
> > which use the same queue. Multiple threads can issue kernel binds
> > concurrently, as our page fault handler is multi-threaded, similar
> > to the CPU page fault handler, so critical section between arm() and
> > push() is very late in the pipeline tightly protected by a lock.
> 
> This sounds like a different issue to me. That's the constraint that
> says _arm() and _push() ordering needs to be preserved to guarantee
> that jobs are properly ordered on the job queue. But that's orthogonal
> to the rule that says nothing between _arm() and _push() on a given job
> can fail. Let's take the Panthor case as an example:
> 
> 	for_each_job_in_batch() {
> 		// This acquires the VM resv lock, and all BO locks
> 		// Because queues target a specific VM and all jobs
> 		// in the a SUBMIT must target the same VM, this
> 		// guarantees that seqno allocation happening further
> 		// down (when _arm() is called) won't be interleaved
> 		// with other concurrent submissions to the same queues.
> 		lock_and_prepare_resvs()
> 
> 		<--- Seqno critical section starts here
> 	}
> 
>  	for_each_job_in_batch() {
> 		// If something fails here, we drop all the jobs that
> 		// are part of this SUBMIT, and the resv locks are
> 		// released as part of the rollback. This means we
> 		// consumed but didn't use the seqnos, thus creating
> 		// a whole on the timeline, which is armless, as long
> 		// as those seqnos are not recycled.
> 		ret = faillible_stuf()
>  		if (ret)
>  			goto rollback;
>  
>  		arm(job)
>  	}
>  
>  	// Nothing can fail after this point
>  
>  	for_each_job_in_batch() {

With my suggestion above...

		make_active(job);

> 		// resv locks are released here, unblocking other
> 		// concurrent submissions
>  		update_resvs(job->done_fence)
> 
> 		<--- Seqno critical section ends here in case of success
> 
>  		push(job)
>  	}
>  
> 	update_submit_syncobjs();
> 
> rollback:
> 	unlock_resvs()
> 	<--- Seqno critical section ends here in case of success
> ...
> 
> How wide your critical seqno section is is up to each driver, really.
> 

I agree with this. The same logic follows: by adding make_active(), a
driver can decide the critical non-failing section.

Another thing I could add is an option for a driver to register a
lockdep class with a DRM dep queue that asserts the lock is held in
arm(), make_active(), and push(). GPUVM and GPUSVM have similar
interfaces, for example in functions that must be protected by a
driver-side lock.

> > 
> > > there, as long as they become active before the job is pushed.
> > >   
> > > >   
> > > > > 
> > > > > On the rust side it would be directly described through the type
> > > > > system (see the Visibility attribute in Daniel's branch[1]). On C side,
> > > > > this could take the form of a new DMA_FENCE_FLAG_INACTIVE (or whichever
> > > > > name you want to give it). Any operation pushing the fence to public
> > > > > container (dma_resv, syncobj, sync_file, ...) would be rejected when
> > > > > that flag is set. At _push() time, we'd clear that flag with a
> > > > > dma_fence_set_active() helper, which would reflect the fact the fence
> > > > > can now be observed and exposed to the outside world.
> > > > >     
> > > > 
> > > > Timeline squashing is problematic due to the DMA_FENCE_FLAG_INACTIVE
> > > > flag. When adding a fence to dma-resv, fences that belong to the same
> > > > timeline are immediately squashed. A later transition of the fence state
> > > > completely breaks this behavior.  
> > > 
> > > That's exactly my point: as soon as you want to insert the fence to a
> > > public container, you have to make it "active", so it will never be
> > > rolled back to the previous entry in the resv. Similarly, a
> > > wait/add_callback() on an inactive fence should be rejected.
> > >   
> > 
> > This is bit bigger dma-fence / treewide level change but in general I
> > believe this is a good idea.
> 
> I agree it's a bit more work. It implies patching containers to reject
> insertion when the INACTIVE flag is set. If we keep !INACTIVE as the
> default (__dma_fence_init(INACTIVE) being an opt-in), fence emitters can
> be moved to this model progressively though.

See above. I think we should start with the make_active() split and then
see if we can get dma-resv/dma-fence updated with this semantic.

Matt

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
  2026-03-24 16:06                   ` Matthew Brost
@ 2026-03-25  2:33                     ` Matthew Brost
  0 siblings, 0 replies; 50+ messages in thread
From: Matthew Brost @ 2026-03-25  2:33 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: intel-xe, dri-devel, Tvrtko Ursulin, Rodrigo Vivi,
	Thomas Hellström, Christian König, Danilo Krummrich,
	David Airlie, Maarten Lankhorst, Maxime Ripard, Philipp Stanner,
	Simona Vetter, Sumit Semwal, Thomas Zimmermann, linux-kernel

On Tue, Mar 24, 2026 at 09:06:02AM -0700, Matthew Brost wrote:
> On Tue, Mar 24, 2026 at 10:23:45AM +0100, Boris Brezillon wrote:
> > On Mon, 23 Mar 2026 11:38:06 -0700
> > Matthew Brost <matthew.brost@intel.com> wrote:
> > 
> > > 
> > > Ok, getting stats is easier than I thought...
> > > 
> > > ./perf stat -a -e context-switches,cpu-migrations,task-clock,cycles,instructions /home/mbrost/xe/source/drivers.gpu.i915.igt-gpu-tools/build/tests/xe_exec_threads --r threads-basic
> > > 
> > > This test creates one thread per engine instance (7 instances this BMG
> > > device) and submits 1k exec IOCTLs per thread, each performing a DW
> > > write. Each exec IOCTL typically does not have unsignaled input dependencies.
> > > 
> > > With IRQ putting of jobs off + no bypass (drm_dep_queue_flags = 0):
> > > 
> > >              8,449      context-switches
> > >                412      cpu-migrations
> > >           2,531.43 msec task-clock
> > >      1,847,846,588      cpu_atom/cycles/
> > >      1,847,856,947      cpu_core/cycles/
> > >    <not supported>      cpu_atom/instructions/
> > >        460,744,020      cpu_core/instructions/
> > > 
> > > With IRQ putting of jobs off + bypass (drm_dep_queue_flags =
> > > DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED):
> > > 
> > >              8,655      context-switches
> > >                229      cpu-migrations
> > >           2,571.33 msec task-clock
> > >        855,900,607      cpu_atom/cycles/
> > >        855,900,272      cpu_core/cycles/
> > >    <not supported>      cpu_atom/instructions/
> > >        403,651,469      cpu_core/instructions/
> > > 
> > > With IRQ putting of jobs on + bypass (drm_dep_queue_flags =
> > > DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED |
> > > DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE):
> > > 
> > >              5,361      context-switches
> > >                169      cpu-migrations
> > >           2,577.44 msec task-clock
> > >        685,769,153      cpu_atom/cycles/
> > >        685,768,407      cpu_core/cycles/
> > >    <not supported>      cpu_atom/instructions/
> > >        321,336,297      cpu_core/instructions/
> > 
> > Thanks for sharing those numbers. For completeness, can you also add the
> > "With IRQ putting of jobs on + no bypass" case?
> > 
> 
> Yes, I also will share a DRM sched baseline too + I figured out power
> can be measured too - initial results confirm what I expected too - less
> power.
> 
> I'm putting together a doc based on running glxgears and another
> benchmark on top Ubuntu 24.10 + Wayland which has explicit sync
> (linux-drm-syncobj, behaves like surfface flinger when rendering flag to
> not pass in fences to draw jobs).
> 
> Almost have all the data. Will share here once I have it.
> 

Here are some numbers based on glxgears and weston-simple-egl.

5 configurations tested:
DRM sched
DRM dep (no opt flags)
DRM dep + bypass flag
DRM dep + IRQ-safe flag
DRM dep + bypass + IRQ-safe flags

Each configuration was run 3× on both glxgears and weston-simple-egl.
Raptor lake CPU, BMG G21.

Summary:
DRM dep reduces power usage, CPU cycles, and context switches. Enabling
both the bypass and IRQ-safe flags further reduces all of these metrics.

I’d say this test case best models something like scrolling on a phone
or using a laptop for non-GPU-intensive workloads where the screen still
needs to refresh.

I’ve run more intensive benchmarks—glmark2 and Unigine Heaven as well.
The results are somewhat noisy between boots, but I think the same
conclusion holds.

Raw numbers (bit of a firehouse):

DRM sched:
root@DUT6235BMGFRD:mbrost# ./perf stat -a -e context-switches,cpu-migrations,task-clock,cycles,instructions,power/energy-pkg/,power/energy-cores/ timeout 20s glxgears
Running synchronized to the vertical refresh.  The framerate should be
approximately the same as the monitor refresh rate.
303 frames in 5.0 seconds = 60.565 FPS
300 frames in 5.0 seconds = 60.000 FPS
301 frames in 5.0 seconds = 60.001 FPS

 Performance counter stats for 'system wide':

            71,548        context-switches
             1,466        cpu-migrations
        320,440.96 msec   task-clock
     9,140,249,815        cpu_atom/cycles/
     9,140,253,058        cpu_core/cycles/
   <not supported>        cpu_atom/instructions/
     7,071,794,806        cpu_core/instructions/
            168.76 Joules power/energy-pkg/
             57.78 Joules power/energy-cores/

      20.029126614 seconds time elapsed

root@DUT6235BMGFRD:mbrost# ./perf stat -a -e context-switches,cpu-migrations,task-clock,cycles,instructions,power/energy-pkg/,power/energy-cores/ timeout 20s glxgears
Running synchronized to the vertical refresh.  The framerate should be
approximately the same as the monitor refresh rate.
304 frames in 5.0 seconds = 60.642 FPS
300 frames in 5.0 seconds = 59.988 FPS
301 frames in 5.0 seconds = 60.001 FPS

 Performance counter stats for 'system wide':

            71,720        context-switches
             1,581        cpu-migrations
        320,530.64 msec   task-clock
     8,990,313,521        cpu_atom/cycles/
     8,990,315,400        cpu_core/cycles/
   <not supported>        cpu_atom/instructions/
     6,988,827,285        cpu_core/instructions/
            172.15 Joules power/energy-pkg/
             58.33 Joules power/energy-cores/

      20.034862844 seconds time elapsed

root@DUT6235BMGFRD:mbrost# ./perf stat -a -e context-switches,cpu-migrations,task-clock,cycles,instructions,power/energy-pkg/,power/energy-cores/ timeout 20s glxgears
Running synchronized to the vertical refresh.  The framerate should be
approximately the same as the monitor refresh rate.
304 frames in 5.0 seconds = 60.741 FPS
299 frames in 5.0 seconds = 59.798 FPS
299 frames in 5.0 seconds = 59.799 FPS

 Performance counter stats for 'system wide':

            70,871        context-switches
             1,980        cpu-migrations
        320,558.82 msec   task-clock
     8,861,481,467        cpu_atom/cycles/
     8,861,485,448        cpu_core/cycles/
   <not supported>        cpu_atom/instructions/
     6,665,294,516        cpu_core/instructions/
            167.82 Joules power/energy-pkg/
             56.97 Joules power/energy-cores/

      20.035713155 seconds time elapsed

root@DUT6235BMGFRD:mbrost# ./perf stat -a -e context-switches,cpu-migrations,task-clock,cycles,instructions,power/energy-pkg/,power/energy-cores/ timeout 10 weston-simple-egl -f
Using config: r8g8b8a8
has EGL_EXT_buffer_age and EGL_EXT_swap_buffers_with_damage
has EGL_EXT_surface_compression
299 frames in 5 seconds: 59.799999 fps

 Performance counter stats for 'system wide':

            27,398        context-switches
               678        cpu-migrations
        160,255.17 msec   task-clock
     5,002,546,782        cpu_atom/cycles/
     5,002,549,920        cpu_core/cycles/
   <not supported>        cpu_atom/instructions/
     3,498,672,077        cpu_core/instructions/
             93.41 Joules power/energy-pkg/
             23.91 Joules power/energy-cores/

      10.017552274 seconds time elapsed

root@DUT6235BMGFRD:mbrost# ./perf stat -a -e context-switches,cpu-migrations,task-clock,cycles,instructions,power/energy-pkg/,power/energy-cores/ timeout 10 weston-simple-egl -f
Using config: r8g8b8a8
has EGL_EXT_buffer_age and EGL_EXT_swap_buffers_with_damage
has EGL_EXT_surface_compression
300 frames in 5 seconds: 60.000000 fps

 Performance counter stats for 'system wide':

            27,322        context-switches
               580        cpu-migrations
        160,307.12 msec   task-clock
     4,783,734,059        cpu_atom/cycles/
     4,783,737,645        cpu_core/cycles/
   <not supported>        cpu_atom/instructions/
     3,224,510,206        cpu_core/instructions/
             91.89 Joules power/energy-pkg/
             23.28 Joules power/energy-cores/

      10.020629190 seconds time elapsed

root@DUT6235BMGFRD:mbrost# ./perf stat -a -e context-switches,cpu-migrations,task-clock,cycles,instructions,power/energy-pkg/,power/energy-cores/ timeout 10 weston-simple-egl -f
Using config: r8g8b8a8
has EGL_EXT_buffer_age and EGL_EXT_swap_buffers_with_damage
has EGL_EXT_surface_compression
300 frames in 5 seconds: 60.000000 fps

 Performance counter stats for 'system wide':

            27,356        context-switches
               573        cpu-migrations
        160,362.30 msec   task-clock
     5,112,653,847        cpu_atom/cycles/
     5,112,658,503        cpu_core/cycles/
   <not supported>        cpu_atom/instructions/
     3,395,873,668        cpu_core/instructions/
             94.40 Joules power/energy-pkg/
             24.58 Joules power/energy-cores/

      10.023979647 seconds time elapsed

No opt (drm_dep_queue_flags = 0):
root@DUT6235BMGFRD:mbrost# ./perf stat -a -e context-switches,cpu-migrations,task-clock,cycles,instructions,power/energy-pkg/,power/energy-cores/ timeout 20s glxgears
Running synchronized to the vertical refresh.  The framerate should be
approximately the same as the monitor refresh rate.
303 frames in 5.0 seconds = 60.597 FPS
300 frames in 5.0 seconds = 59.989 FPS
297 frames in 5.0 seconds = 59.232 FPS

 Performance counter stats for 'system wide':

            66,233        context-switches
             1,820        cpu-migrations
        320,586.39 msec   task-clock
     9,028,164,726        cpu_atom/cycles/
     9,028,178,052        cpu_core/cycles/
   <not supported>        cpu_atom/instructions/
     6,541,478,243        cpu_core/instructions/
            178.47 Joules power/energy-pkg/
             44.18 Joules power/energy-cores/

      20.036849235 seconds time elapsed

root@DUT6235BMGFRD:mbrost# ./perf stat -a -e context-switches,cpu-migrations,task-clock,cycles,instructions,power/energy-pkg/,power/energy-cores/ timeout 20s glxgears
Running synchronized to the vertical refresh.  The framerate should be
approximately the same as the monitor refresh rate.
304 frames in 5.0 seconds = 60.691 FPS
297 frames in 5.0 seconds = 59.393 FPS
300 frames in 5.0 seconds = 59.803 FPS

 Performance counter stats for 'system wide':

            68,389        context-switches
             2,034        cpu-migrations
        320,457.18 msec   task-clock
     8,736,092,056        cpu_atom/cycles/
     8,736,096,958        cpu_core/cycles/
   <not supported>        cpu_atom/instructions/
     6,511,630,145        cpu_core/instructions/
            183.23 Joules power/energy-pkg/
             47.43 Joules power/energy-cores/

      20.031469459 seconds time elapsed

root@DUT6235BMGFRD:mbrost# ./perf stat -a -e context-switches,cpu-migrations,task-clock,cycles,instructions,power/energy-pkg/,power/energy-cores/ timeout 20s glxgears
Running synchronized to the vertical refresh.  The framerate should be
approximately the same as the monitor refresh rate.
303 frames in 5.0 seconds = 60.458 FPS
299 frames in 5.0 seconds = 59.606 FPS
298 frames in 5.0 seconds = 59.590 FPS

 Performance counter stats for 'system wide':

            67,692        context-switches
             1,877        cpu-migrations
        320,524.05 msec   task-clock
     8,837,946,224        cpu_atom/cycles/
     8,837,949,628        cpu_core/cycles/
   <not supported>        cpu_atom/instructions/
     6,018,812,170        cpu_core/instructions/
            187.63 Joules power/energy-pkg/
             46.76 Joules power/energy-cores/

      20.034428856 seconds time elapsed

root@DUT6235BMGFRD:mbrost# ./perf stat -a -e context-switches,cpu-migrations,task-clock,cycles,instructions,power/energy-pkg/,power/energy-cores/ timeout 10 weston-simple-egl -f
Using config: r8g8b8a8
has EGL_EXT_buffer_age and EGL_EXT_swap_buffers_with_damage
has EGL_EXT_surface_compression
299 frames in 5 seconds: 59.799999 fps

 Performance counter stats for 'system wide':

            27,259        context-switches
               313        cpu-migrations
        160,538.29 msec   task-clock
     5,079,653,975        cpu_atom/cycles/
     5,079,657,432        cpu_core/cycles/
   <not supported>        cpu_atom/instructions/
     3,166,877,411        cpu_core/instructions/
             90.72 Joules power/energy-pkg/
             21.70 Joules power/energy-cores/

      10.034716719 seconds time elapsed

root@DUT6235BMGFRD:mbrost# ./perf stat -a -e context-switches,cpu-migrations,task-clock,cycles,instructions,power/energy-pkg/,power/energy-cores/ timeout 10 weston-simple-egl -f
Using config: r8g8b8a8
has EGL_EXT_buffer_age and EGL_EXT_swap_buffers_with_damage
has EGL_EXT_surface_compression
300 frames in 5 seconds: 60.000000 fps

 Performance counter stats for 'system wide':

            26,933        context-switches
               449        cpu-migrations
        160,334.74 msec   task-clock
     4,851,027,105        cpu_atom/cycles/
     4,851,054,678        cpu_core/cycles/
   <not supported>        cpu_atom/instructions/
     3,042,177,215        cpu_core/instructions/
             87.33 Joules power/energy-pkg/
             21.85 Joules power/energy-cores/

      10.021873082 seconds time elapsed

root@DUT6235BMGFRD:mbrost# ./perf stat -a -e context-switches,cpu-migrations,task-clock,cycles,instructions,power/energy-pkg/,power/energy-cores/ timeout 10 weston-simple-egl -f
Using config: r8g8b8a8
has EGL_EXT_buffer_age and EGL_EXT_swap_buffers_with_damage
has EGL_EXT_surface_compression
299 frames in 5 seconds: 59.799999 fps

 Performance counter stats for 'system wide':

            27,101        context-switches
               351        cpu-migrations
        160,333.98 msec   task-clock
     4,903,047,240        cpu_atom/cycles/
     4,903,055,111        cpu_core/cycles/
   <not supported>        cpu_atom/instructions/
     2,884,284,727        cpu_core/instructions/
             87.68 Joules power/energy-pkg/
             21.36 Joules power/energy-cores/

      10.021938190 seconds time elapsed

Bypass (drm_dep_queue_flags = DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED):
root@DUT6235BMGFRD:mbrost# ./perf stat -a -e context-switches,cpu-migrations,task-clock,cycles,instructions,power/energy-pkg/,power/energy-cores/ timeout 20s glxgears
Running synchronized to the vertical refresh.  The framerate should be
approximately the same as the monitor refresh rate.
304 frames in 5.0 seconds = 60.718 FPS
299 frames in 5.0 seconds = 59.615 FPS
299 frames in 5.0 seconds = 59.795 FPS

 Performance counter stats for 'system wide':

            56,788        context-switches
             2,576        cpu-migrations
        320,610.02 msec   task-clock
     9,056,383,522        cpu_atom/cycles/
     9,056,385,629        cpu_core/cycles/
   <not supported>        cpu_atom/instructions/
     6,285,652,796        cpu_core/instructions/
            164.29 Joules power/energy-pkg/
             44.70 Joules power/energy-cores/

      20.041318795 seconds time elapsed

root@DUT6235BMGFRD:mbrost# ./perf stat -a -e context-switches,cpu-migrations,task-clock,cycles,instructions,power/energy-pkg/,power/energy-cores/ timeout 20s glxgears
Running synchronized to the vertical refresh.  The framerate should be
approximately the same as the monitor refresh rate.
304 frames in 5.0 seconds = 60.734 FPS
300 frames in 5.0 seconds = 59.983 FPS
300 frames in 5.0 seconds = 60.000 FPS

 Performance counter stats for 'system wide':

            56,388        context-switches
             2,326        cpu-migrations
        320,581.07 msec   task-clock
     8,789,215,827        cpu_atom/cycles/
     8,789,217,484        cpu_core/cycles/
   <not supported>        cpu_atom/instructions/
     6,251,346,200        cpu_core/instructions/
            162.67 Joules power/energy-pkg/
             44.30 Joules power/energy-cores/

      20.037648324 seconds time elapsed

root@DUT6235BMGFRD:mbrost# ./perf stat -a -e context-switches,cpu-migrations,task-clock,cycles,instructions,power/energy-pkg/,power/energy-cores/ timeout 20s glxgears
Running synchronized to the vertical refresh.  The framerate should be
approximately the same as the monitor refresh rate.
305 frames in 5.0 seconds = 60.950 FPS
300 frames in 5.0 seconds = 59.993 FPS
300 frames in 5.0 seconds = 59.806 FPS

 Performance counter stats for 'system wide':

            56,167        context-switches
             2,434        cpu-migrations
        320,594.69 msec   task-clock
     8,700,873,664        cpu_atom/cycles/
     8,700,877,150        cpu_core/cycles/
   <not supported>        cpu_atom/instructions/
     6,405,556,662        cpu_core/instructions/
            162.55 Joules power/energy-pkg/
             43.33 Joules power/energy-cores/

      20.038448851 seconds time elapsed

root@DUT6235BMGFRD:mbrost# ./perf stat -a -e context-switches,cpu-migrations,task-clock,cycles,instructions,power/energy-pkg/,power/energy-cores/ timeout 10 weston-simple-egl -f
Using config: r8g8b8a8
has EGL_EXT_buffer_age and EGL_EXT_swap_buffers_with_damage
has EGL_EXT_surface_compression
299 frames in 5 seconds: 59.799999 fps

 Performance counter stats for 'system wide':

            24,747        context-switches
             1,254        cpu-migrations
        160,543.42 msec   task-clock
     5,047,832,024        cpu_atom/cycles/
     5,047,823,996        cpu_core/cycles/
   <not supported>        cpu_atom/instructions/
     3,124,591,155        cpu_core/instructions/
             80.28 Joules power/energy-pkg/
             21.49 Joules power/energy-cores/

      10.034654628 seconds time elapsed

root@DUT6235BMGFRD:mbrost# ./perf stat -a -e context-switches,cpu-migrations,task-clock,cycles,instructions,power/energy-pkg/,power/energy-cores/ timeout 10 weston-simple-egl -f
Using config: r8g8b8a8
has EGL_EXT_buffer_age and EGL_EXT_swap_buffers_with_damage
has EGL_EXT_surface_compression
300 frames in 5 seconds: 60.000000 fps

 Performance counter stats for 'system wide':

            24,953        context-switches
               921        cpu-migrations
        160,375.32 msec   task-clock
     5,197,283,835        cpu_atom/cycles/
     5,197,287,623        cpu_core/cycles/
   <not supported>        cpu_atom/instructions/
     3,393,363,950        cpu_core/instructions/
             83.36 Joules power/energy-pkg/
             21.92 Joules power/energy-cores/

      10.024899366 seconds time elapsed

root@DUT6235BMGFRD:mbrost# ./perf stat -a -e context-switches,cpu-migrations,task-clock,cycles,instructions,power/energy-pkg/,power/energy-cores/ timeout 10 weston-simple-egl -f
Using config: r8g8b8a8
has EGL_EXT_buffer_age and EGL_EXT_swap_buffers_with_damage
has EGL_EXT_surface_compression
298 frames in 5 seconds: 59.599998 fps

 Performance counter stats for 'system wide':

            24,576        context-switches
               966        cpu-migrations
        160,339.37 msec   task-clock
     4,915,705,971        cpu_atom/cycles/
     4,915,709,503        cpu_core/cycles/
   <not supported>        cpu_atom/instructions/
     2,968,947,722        cpu_core/instructions/
             79.96 Joules power/energy-pkg/
             21.08 Joules power/energy-cores/

      10.022743041 seconds time elapsed

IRQ (drm_dep_queue_flags = DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE):
root@DUT6235BMGFRD:mbrost# ./perf stat -a -e context-switches,cpu-migrations,task-clock,cycles,instructions,power/energy-pkg/,power/energy-cores/ timeout 20s glxgears
Running synchronized to the vertical refresh.  The framerate should be
approximately the same as the monitor refresh rate.
304 frames in 5.0 seconds = 60.643 FPS
298 frames in 5.0 seconds = 59.599 FPS
295 frames in 5.0 seconds = 58.998 FPS

 Performance counter stats for 'system wide':

            60,305        context-switches
             1,994        cpu-migrations
        320,528.79 msec   task-clock
     8,518,549,937        cpu_atom/cycles/
     8,518,573,906        cpu_core/cycles/
   <not supported>        cpu_atom/instructions/
     5,813,890,066        cpu_core/instructions/
            184.52 Joules power/energy-pkg/
             40.79 Joules power/energy-cores/

      20.032795872 seconds time elapsed

root@DUT6235BMGFRD:mbrost# ./perf stat -a -e context-switches,cpu-migrations,task-clock,cycles,instructions,power/energy-pkg/,power/energy-cores/ timeout 20s glxgears
Running synchronized to the vertical refresh.  The framerate should be
approximately the same as the monitor refresh rate.
304 frames in 5.0 seconds = 60.759 FPS
299 frames in 5.0 seconds = 59.790 FPS
301 frames in 5.0 seconds = 60.003 FPS

 Performance counter stats for 'system wide':

            59,401        context-switches
             2,256        cpu-migrations
        320,475.03 msec   task-clock
     8,581,759,828        cpu_atom/cycles/
     8,581,763,986        cpu_core/cycles/
   <not supported>        cpu_atom/instructions/
     6,748,269,548        cpu_core/instructions/
            179.76 Joules power/energy-pkg/
             40.66 Joules power/energy-cores/

      20.029861532 seconds time elapsed

root@DUT6235BMGFRD:mbrost# ./perf stat -a -e context-switches,cpu-migrations,task-clock,cycles,instructions,power/energy-pkg/,power/energy-cores/ timeout 20s glxgears
Running synchronized to the vertical refresh.  The framerate should be
approximately the same as the monitor refresh rate.
304 frames in 5.0 seconds = 60.653 FPS
298 frames in 5.0 seconds = 59.404 FPS
300 frames in 5.0 seconds = 59.990 FPS

 Performance counter stats for 'system wide':

            59,381        context-switches
             1,800        cpu-migrations
        320,616.35 msec   task-clock
     8,829,473,025        cpu_atom/cycles/
     8,829,477,019        cpu_core/cycles/
   <not supported>        cpu_atom/instructions/
     6,505,926,710        cpu_core/instructions/
            180.38 Joules power/energy-pkg/
             40.86 Joules power/energy-cores/

      20.040016190 seconds time elapsed

root@DUT6235BMGFRD:mbrost# ./perf stat -a -e context-switches,cpu-migrations,task-clock,cycles,instructions,power/energy-pkg/,power/energy-cores/ timeout 10 weston-simple-egl -f
Using config: r8g8b8a8
has EGL_EXT_buffer_age and EGL_EXT_swap_buffers_with_damage
has EGL_EXT_surface_compression
298 frames in 5 seconds: 59.599998 fps

 Performance counter stats for 'system wide':

            27,341        context-switches
               786        cpu-migrations
        160,478.01 msec   task-clock
     4,681,440,843        cpu_atom/cycles/
     4,681,443,905        cpu_core/cycles/
   <not supported>        cpu_atom/instructions/
     2,969,039,615        cpu_core/instructions/
             91.74 Joules power/energy-pkg/
             20.84 Joules power/energy-cores/

      10.031116623 seconds time elapsed

root@DUT6235BMGFRD:mbrost# ./perf stat -a -e context-switches,cpu-migrations,task-clock,cycles,instructions,power/energy-pkg/,power/energy-cores/ timeout 10 weston-simple-egl -f
Using config: r8g8b8a8
has EGL_EXT_buffer_age and EGL_EXT_swap_buffers_with_damage
has EGL_EXT_surface_compression
299 frames in 5 seconds: 59.799999 fps

 Performance counter stats for 'system wide':

            24,626        context-switches
               429        cpu-migrations
        160,367.44 msec   task-clock
     4,828,015,355        cpu_atom/cycles/
     4,828,019,887        cpu_core/cycles/
   <not supported>        cpu_atom/instructions/
     2,675,419,833        cpu_core/instructions/
             90.35 Joules power/energy-pkg/
             21.10 Joules power/energy-cores/

      10.024476921 seconds time elapsed

root@DUT6235BMGFRD:mbrost# ./perf stat -a -e context-switches,cpu-migrations,task-clock,cycles,instructions,power/energy-pkg/,power/energy-cores/ timeout 10 weston-simple-egl -f
Using config: r8g8b8a8
has EGL_EXT_buffer_age and EGL_EXT_swap_buffers_with_damage
has EGL_EXT_surface_compression
300 frames in 5 seconds: 60.000000 fps

 Performance counter stats for 'system wide':

            24,679        context-switches
               340        cpu-migrations
        160,303.90 msec   task-clock
     4,500,129,961        cpu_atom/cycles/
     4,500,132,697        cpu_core/cycles/
   <not supported>        cpu_atom/instructions/
     2,766,150,592        cpu_core/instructions/
             88.01 Joules power/energy-pkg/
             19.76 Joules power/energy-cores/

      10.019653353 seconds time elapsed

IRQ plus bypass (drm_dep_queue_flags = DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED | DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE):
root@DUT6235BMGFRD:mbrost# ./perf stat -a -e context-switches,cpu-migrations,task-clock,cycles,instructions,power/energy-pkg/,power/energy-cores/ timeout 20s glxgears
Running synchronized to the vertical refresh.  The framerate should be
approximately the same as the monitor refresh rate.
305 frames in 5.0 seconds = 60.958 FPS
299 frames in 5.0 seconds = 59.607 FPS
299 frames in 5.0 seconds = 59.603 FPS

 Performance counter stats for 'system wide':

            46,934        context-switches
             1,558        cpu-migrations
        320,569.83 msec   task-clock
     7,976,414,449        cpu_atom/cycles/
     7,976,417,934        cpu_core/cycles/
   <not supported>        cpu_atom/instructions/
     6,126,973,947        cpu_core/instructions/
            178.36 Joules power/energy-pkg/
             40.10 Joules power/energy-cores/

      20.037681420 seconds time elapsed

root@DUT6235BMGFRD:mbrost# ./perf stat -a -e context-switches,cpu-migrations,task-clock,cycles,instructions,power/energy-pkg/,power/energy-cores/ timeout 20s glxgears
Running synchronized to the vertical refresh.  The framerate should be
approximately the same as the monitor refresh rate.
304 frames in 5.0 seconds = 60.696 FPS
299 frames in 5.0 seconds = 59.616 FPS
299 frames in 5.0 seconds = 59.781 FPS

 Performance counter stats for 'system wide':

            47,691        context-switches
             1,994        cpu-migrations
        320,602.83 msec   task-clock
     8,270,567,663        cpu_atom/cycles/
     8,270,572,484        cpu_core/cycles/
   <not supported>        cpu_atom/instructions/
     4,361,204,861        cpu_core/instructions/
            181.56 Joules power/energy-pkg/
             40.16 Joules power/energy-cores/

      20.038511163 seconds time elapsed

root@DUT6235BMGFRD:mbrost# ./perf stat -a -e context-switches,cpu-migrations,task-clock,cycles,instructions,power/energy-pkg/,power/energy-cores/ timeout 20s glxgears
Running synchronized to the vertical refresh.  The framerate should be
approximately the same as the monitor refresh rate.
305 frames in 5.0 seconds = 60.911 FPS
298 frames in 5.0 seconds = 59.597 FPS
300 frames in 5.0 seconds = 59.803 FPS

 Performance counter stats for 'system wide':

            47,129        context-switches
             1,921        cpu-migrations
        320,491.09 msec   task-clock
     8,054,513,204        cpu_atom/cycles/
     8,054,518,711        cpu_core/cycles/
   <not supported>        cpu_atom/instructions/
     6,131,796,639        cpu_core/instructions/
            178.54 Joules power/energy-pkg/
             40.08 Joules power/energy-cores/

      20.032444923 seconds time elapsed

root@DUT6235BMGFRD:mbrost# ./perf stat -a -e context-switches,cpu-migrations,task-clock,cycles,instructions,power/energy-pkg/,power/energy-cores/ timeout 10 weston-simple-egl -f
Using config: r8g8b8a8
has EGL_EXT_buffer_age and EGL_EXT_swap_buffers_with_damage
has EGL_EXT_surface_compression
299 frames in 5 seconds: 59.799999 fps

 Performance counter stats for 'system wide':

            21,991        context-switches
               286        cpu-migrations
        160,343.73 msec   task-clock
     4,497,475,288        cpu_atom/cycles/
     4,497,477,011        cpu_core/cycles/
   <not supported>        cpu_atom/instructions/
     3,042,007,163        cpu_core/instructions/
             89.14 Joules power/energy-pkg/
             20.09 Joules power/energy-cores/

      10.021642254 seconds time elapsed

root@DUT6235BMGFRD:mbrost# ./perf stat -a -e context-switches,cpu-migrations,task-clock,cycles,instructions,power/energy-pkg/,power/energy-cores/ timeout 10 weston-simple-egl -f
Using config: r8g8b8a8
has EGL_EXT_buffer_age and EGL_EXT_swap_buffers_with_damage
has EGL_EXT_surface_compression
300 frames in 5 seconds: 60.000000 fps

 Performance counter stats for 'system wide':

            22,366        context-switches
               225        cpu-migrations
        160,386.68 msec   task-clock
     4,398,432,348        cpu_atom/cycles/
     4,398,435,205        cpu_core/cycles/
   <not supported>        cpu_atom/instructions/
     3,086,156,274        cpu_core/instructions/
             89.07 Joules power/energy-pkg/
             19.68 Joules power/energy-cores/

      10.024827902 seconds time elapsed

root@DUT6235BMGFRD:mbrost# ./perf stat -a -e context-switches,cpu-migrations,task-clock,cycles,instructions,power/energy-pkg/,power/energy-cores/ timeout 10 weston-simple-egl -f
Using config: r8g8b8a8
has EGL_EXT_buffer_age and EGL_EXT_swap_buffers_with_damage
has EGL_EXT_surface_compression
299 frames in 5 seconds: 59.799999 fps

 Performance counter stats for 'system wide':

            22,515        context-switches
               286        cpu-migrations
        160,481.91 msec   task-clock
     4,447,740,222        cpu_atom/cycles/
     4,447,743,314        cpu_core/cycles/
   <not supported>        cpu_atom/instructions/
     3,217,285,071        cpu_core/instructions/
             90.15 Joules power/energy-pkg/
             19.65 Joules power/energy-cores/

      10.029135743 seconds time elapsed

Matt

> > I'm a bit surprised by the difference in number of context switches
> > given I'd expect the local-CPU to be picked in priority, and so queuing
> > work items on the same wq from another work item to be almost free in
> > term on scheduling. But I guess there's some load-balancing happening
> > when you execute jobs at such a high rate.
> > 
> > Also, I don't know if that's just noise or if it's reproducible, but
> > task-clock seems to be ~40usec lower with the deferred cleanup and
> > no-bypass (higher throughput because you're not blocking the dequeuing
> > of the next job on the cleanup of the previous one, I suspect).
> 
> I think that is just noise of what the test is doing in user space -
> that bounces around a bit.
> 
> Matt
> 
> > 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 01/12] workqueue: Add interface to teach lockdep to warn on reclaim violations
  2026-03-16  4:32 ` [RFC PATCH 01/12] workqueue: Add interface to teach lockdep to warn on reclaim violations Matthew Brost
@ 2026-03-25 15:59   ` Tejun Heo
  2026-03-26  1:49     ` Matthew Brost
  0 siblings, 1 reply; 50+ messages in thread
From: Tejun Heo @ 2026-03-25 15:59 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-xe, dri-devel, Lai Jiangshan, linux-kernel

Sorry about the tardiness. Traveling during spring break. Getting more than
I can catch up with each day.

On Sun, Mar 15, 2026 at 09:32:44PM -0700, Matthew Brost wrote:
> @@ -403,6 +403,7 @@ enum wq_flags {
>  	 */
>  	WQ_POWER_EFFICIENT	= 1 << 7,
>  	WQ_PERCPU		= 1 << 8, /* bound to a specific cpu */
> +	WQ_MEM_WARN_ON_RECLAIM	= 1 << 9, /* teach lockdep to warn on reclaim */

Shouldn't this require WQ_MEM_RECLAIM?

> +/**
> + * workqueue_is_reclaim_annotated() - Test whether a workqueue is annotated for
> + * reclaim safety
> + * @wq: workqueue to test
> + *
> + * Returns true if @wq is flags have both %WQ_MEM_WARN_ON_RECLAIM and
> + * %WQ_MEM_RECLAIM set. A workqueue marked with these flags indicates that it
> + * participates in reclaim paths, and therefore must not perform memory
> + * allocations that can recurse into reclaim (e.g., GFP_KERNEL is not allowed).
> + *
> + * Drivers can use this helper to enforce reclaim-safe behavior on workqueues
> + * that are created or provided elsewhere in the code.
> + *
> + * Return:
> + * true if the workqueue is reclaim-annotated, false otherwise.
> + */
> +bool workqueue_is_reclaim_annotated(struct workqueue_struct *wq)
> +{
> +	return (wq->flags & WQ_MEM_WARN_ON_RECLAIM) &&
> +		(wq->flags & WQ_MEM_RECLAIM);
> +}
> +EXPORT_SYMBOL_GPL(workqueue_is_reclaim_annotated);

Why is this function necessary? It feels rather odd to use wq as the source
of this information. Shouldn't that be an innate knowledge of the code
that's using this?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 01/12] workqueue: Add interface to teach lockdep to warn on reclaim violations
  2026-03-25 15:59   ` Tejun Heo
@ 2026-03-26  1:49     ` Matthew Brost
  2026-03-26  2:19       ` Tejun Heo
  0 siblings, 1 reply; 50+ messages in thread
From: Matthew Brost @ 2026-03-26  1:49 UTC (permalink / raw)
  To: Tejun Heo; +Cc: intel-xe, dri-devel, Lai Jiangshan, linux-kernel

On Wed, Mar 25, 2026 at 05:59:54AM -1000, Tejun Heo wrote:
> Sorry about the tardiness. Traveling during spring break. Getting more than
> I can catch up with each day.
> 
> On Sun, Mar 15, 2026 at 09:32:44PM -0700, Matthew Brost wrote:
> > @@ -403,6 +403,7 @@ enum wq_flags {
> >  	 */
> >  	WQ_POWER_EFFICIENT	= 1 << 7,
> >  	WQ_PERCPU		= 1 << 8, /* bound to a specific cpu */
> > +	WQ_MEM_WARN_ON_RECLAIM	= 1 << 9, /* teach lockdep to warn on reclaim */
> 
> Shouldn't this require WQ_MEM_RECLAIM?
> 

Yes, so what is suggestion here? If WQ_MEM_WARN_ON_RECLAIM is set
without WQ_MEM_RECLAIM fail the WQ creation with -EINVAL?

> > +/**
> > + * workqueue_is_reclaim_annotated() - Test whether a workqueue is annotated for
> > + * reclaim safety
> > + * @wq: workqueue to test
> > + *
> > + * Returns true if @wq is flags have both %WQ_MEM_WARN_ON_RECLAIM and
> > + * %WQ_MEM_RECLAIM set. A workqueue marked with these flags indicates that it
> > + * participates in reclaim paths, and therefore must not perform memory
> > + * allocations that can recurse into reclaim (e.g., GFP_KERNEL is not allowed).
> > + *
> > + * Drivers can use this helper to enforce reclaim-safe behavior on workqueues
> > + * that are created or provided elsewhere in the code.
> > + *
> > + * Return:
> > + * true if the workqueue is reclaim-annotated, false otherwise.
> > + */
> > +bool workqueue_is_reclaim_annotated(struct workqueue_struct *wq)
> > +{
> > +	return (wq->flags & WQ_MEM_WARN_ON_RECLAIM) &&
> > +		(wq->flags & WQ_MEM_RECLAIM);
> > +}
> > +EXPORT_SYMBOL_GPL(workqueue_is_reclaim_annotated);
> 
> Why is this function necessary? It feels rather odd to use wq as the source
> of this information. Shouldn't that be an innate knowledge of the code
> that's using this?

This, for example, would be used in DRM sched (the existing scheduler)
or DRM dep (the proposed replacement) to ensure that driver-allocated
WQs passed into the layers are created with these flags. DRM sched or
DRM dep has strict DMA-fencing, thus reclaim rule that we expect DRM
drivers to follow. Historically, DRM drivers have broken these rules
quite often, and we no longer want to give them the opportunity to do
so—lockdep should enforce them.

Matt

> 
> Thanks.
> 
> -- 
> tejun

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 01/12] workqueue: Add interface to teach lockdep to warn on reclaim violations
  2026-03-26  1:49     ` Matthew Brost
@ 2026-03-26  2:19       ` Tejun Heo
  2026-03-27  4:33         ` Matthew Brost
  0 siblings, 1 reply; 50+ messages in thread
From: Tejun Heo @ 2026-03-26  2:19 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-xe, dri-devel, Lai Jiangshan, linux-kernel

Hello,

On Wed, Mar 25, 2026 at 06:49:59PM -0700, Matthew Brost wrote:
> On Wed, Mar 25, 2026 at 05:59:54AM -1000, Tejun Heo wrote:
> > Sorry about the tardiness. Traveling during spring break. Getting more than
> > I can catch up with each day.
> > 
> > On Sun, Mar 15, 2026 at 09:32:44PM -0700, Matthew Brost wrote:
> > > @@ -403,6 +403,7 @@ enum wq_flags {
> > >  	 */
> > >  	WQ_POWER_EFFICIENT	= 1 << 7,
> > >  	WQ_PERCPU		= 1 << 8, /* bound to a specific cpu */
> > > +	WQ_MEM_WARN_ON_RECLAIM	= 1 << 9, /* teach lockdep to warn on reclaim */
> > 
> > Shouldn't this require WQ_MEM_RECLAIM?
> 
> Yes, so what is suggestion here? If WQ_MEM_WARN_ON_RECLAIM is set
> without WQ_MEM_RECLAIM fail the WQ creation with -EINVAL?

Yes.

> > Why is this function necessary? It feels rather odd to use wq as the source
> > of this information. Shouldn't that be an innate knowledge of the code
> > that's using this?
> 
> This, for example, would be used in DRM sched (the existing scheduler)
> or DRM dep (the proposed replacement) to ensure that driver-allocated
> WQs passed into the layers are created with these flags. DRM sched or
> DRM dep has strict DMA-fencing, thus reclaim rule that we expect DRM
> drivers to follow. Historically, DRM drivers have broken these rules
> quite often, and we no longer want to give them the opportunity to do
> so—lockdep should enforce them.

I see. Yeah, that makes sense. Please feel free to add

 Acked-by: Tejun Heo <tj@kernel.org>

Please let me know how you wanna route the patch.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 01/12] workqueue: Add interface to teach lockdep to warn on reclaim violations
  2026-03-26  2:19       ` Tejun Heo
@ 2026-03-27  4:33         ` Matthew Brost
  0 siblings, 0 replies; 50+ messages in thread
From: Matthew Brost @ 2026-03-27  4:33 UTC (permalink / raw)
  To: Tejun Heo; +Cc: intel-xe, dri-devel, Lai Jiangshan, linux-kernel

On Wed, Mar 25, 2026 at 04:19:44PM -1000, Tejun Heo wrote:
> Hello,
> 
> On Wed, Mar 25, 2026 at 06:49:59PM -0700, Matthew Brost wrote:
> > On Wed, Mar 25, 2026 at 05:59:54AM -1000, Tejun Heo wrote:
> > > Sorry about the tardiness. Traveling during spring break. Getting more than
> > > I can catch up with each day.
> > > 
> > > On Sun, Mar 15, 2026 at 09:32:44PM -0700, Matthew Brost wrote:
> > > > @@ -403,6 +403,7 @@ enum wq_flags {
> > > >  	 */
> > > >  	WQ_POWER_EFFICIENT	= 1 << 7,
> > > >  	WQ_PERCPU		= 1 << 8, /* bound to a specific cpu */
> > > > +	WQ_MEM_WARN_ON_RECLAIM	= 1 << 9, /* teach lockdep to warn on reclaim */
> > > 
> > > Shouldn't this require WQ_MEM_RECLAIM?
> > 
> > Yes, so what is suggestion here? If WQ_MEM_WARN_ON_RECLAIM is set
> > without WQ_MEM_RECLAIM fail the WQ creation with -EINVAL?
> 
> Yes.
> 
> > > Why is this function necessary? It feels rather odd to use wq as the source
> > > of this information. Shouldn't that be an innate knowledge of the code
> > > that's using this?
> > 
> > This, for example, would be used in DRM sched (the existing scheduler)
> > or DRM dep (the proposed replacement) to ensure that driver-allocated
> > WQs passed into the layers are created with these flags. DRM sched or
> > DRM dep has strict DMA-fencing, thus reclaim rule that we expect DRM
> > drivers to follow. Historically, DRM drivers have broken these rules
> > quite often, and we no longer want to give them the opportunity to do
> > so—lockdep should enforce them.
> 
> I see. Yeah, that makes sense. Please feel free to add
> 
>  Acked-by: Tejun Heo <tj@kernel.org>
> 
> Please let me know how you wanna route the patch.
> 

Can I send an independent patch addressing the WQ_MEM_WARN_ON_RECLAIM
usage above for you to pull into 7.01? I’m still working through a few
issues on my driver (Xe) side to enable this, and the larger DRM-level
changes will take some time since pretty much every driver is doing
something wrong.

Alternatively, we could defer this to 7.02 and merge it through DRM,
since the 7.01 DRM cycle is closing today. Either option works for me.

Matt

> Thanks.
> 
> -- 
> tejun

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2026-03-27  4:33 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260316043255.226352-1-matthew.brost@intel.com>
2026-03-16  4:32 ` [RFC PATCH 01/12] workqueue: Add interface to teach lockdep to warn on reclaim violations Matthew Brost
2026-03-25 15:59   ` Tejun Heo
2026-03-26  1:49     ` Matthew Brost
2026-03-26  2:19       ` Tejun Heo
2026-03-27  4:33         ` Matthew Brost
2026-03-16  4:32 ` [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer Matthew Brost
2026-03-16  9:16   ` Boris Brezillon
2026-03-17  5:22     ` Matthew Brost
2026-03-17  8:48       ` Boris Brezillon
2026-03-16 10:25   ` Danilo Krummrich
2026-03-17  5:10     ` Matthew Brost
2026-03-17 12:19       ` Danilo Krummrich
2026-03-18 23:02         ` Matthew Brost
2026-03-17  2:47   ` Daniel Almeida
2026-03-17  5:45     ` Matthew Brost
2026-03-17  7:17       ` Miguel Ojeda
2026-03-17  8:26         ` Matthew Brost
2026-03-17 12:04           ` Daniel Almeida
2026-03-17 19:41           ` Miguel Ojeda
2026-03-23 17:31             ` Matthew Brost
2026-03-23 17:42               ` Miguel Ojeda
2026-03-17 18:14       ` Matthew Brost
2026-03-17 19:48         ` Daniel Almeida
2026-03-17 20:43         ` Boris Brezillon
2026-03-18 22:40           ` Matthew Brost
2026-03-19  9:57             ` Boris Brezillon
2026-03-22  6:43               ` Matthew Brost
2026-03-23  7:58                 ` Matthew Brost
2026-03-23 10:06                   ` Boris Brezillon
2026-03-23 17:11                     ` Matthew Brost
2026-03-17 12:31     ` Danilo Krummrich
2026-03-17 14:25       ` Daniel Almeida
2026-03-17 14:33         ` Danilo Krummrich
2026-03-18 22:50           ` Matthew Brost
2026-03-17  8:47   ` Christian König
2026-03-17 14:55   ` Boris Brezillon
2026-03-18 23:28     ` Matthew Brost
2026-03-19  9:11       ` Boris Brezillon
2026-03-23  4:50         ` Matthew Brost
2026-03-23  9:55           ` Boris Brezillon
2026-03-23 17:08             ` Matthew Brost
2026-03-23 18:38               ` Matthew Brost
2026-03-24  9:23                 ` Boris Brezillon
2026-03-24 16:06                   ` Matthew Brost
2026-03-25  2:33                     ` Matthew Brost
2026-03-24  8:49               ` Boris Brezillon
2026-03-24 16:51                 ` Matthew Brost
2026-03-17 16:30   ` Shashank Sharma
2026-03-16  4:32 ` [RFC PATCH 11/12] accel/amdxdna: Convert to drm_dep scheduler layer Matthew Brost
2026-03-16  4:32 ` [RFC PATCH 12/12] drm/panthor: " Matthew Brost

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox