Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Matthew Brost <matthew.brost@intel.com>
To: Boris Brezillon <boris.brezillon@collabora.com>
Cc: "Daniel Almeida" <daniel.almeida@collabora.com>,
	intel-xe@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	"Tvrtko Ursulin" <tvrtko.ursulin@igalia.com>,
	"Rodrigo Vivi" <rodrigo.vivi@intel.com>,
	"Thomas Hellström" <thomas.hellstrom@linux.intel.com>,
	"Christian König" <christian.koenig@amd.com>,
	"Danilo Krummrich" <dakr@kernel.org>,
	"David Airlie" <airlied@gmail.com>,
	"Maarten Lankhorst" <maarten.lankhorst@linux.intel.com>,
	"Maxime Ripard" <mripard@kernel.org>,
	"Philipp Stanner" <phasta@kernel.org>,
	"Simona Vetter" <simona@ffwll.ch>,
	"Sumit Semwal" <sumit.semwal@linaro.org>,
	"Thomas Zimmermann" <tzimmermann@suse.de>,
	linux-kernel@vger.kernel.org,
	"Sami Tolvanen" <samitolvanen@google.com>,
	"Jeffrey Vander Stoep" <jeffv@google.com>,
	"Alice Ryhl" <aliceryhl@google.com>,
	"Daniel Stone" <daniels@collabora.com>,
	"Alexandre Courbot" <acourbot@nvidia.com>,
	"John Hubbard" <jhubbard@nvidia.com>,
	shashanks@nvidia.com, jajones@nvidia.com,
	"Eliot Courtney" <ecourtney@nvidia.com>,
	"Joel Fernandes" <joelagnelf@nvidia.com>,
	rust-for-linux <rust-for-linux@vger.kernel.org>
Subject: Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
Date: Wed, 18 Mar 2026 15:40:35 -0700	[thread overview]
Message-ID: <absp4z1Op0BrXReP@lstrano-desk.jf.intel.com> (raw)
In-Reply-To: <20260317214320.74e6c130@fedora>

On Tue, Mar 17, 2026 at 09:43:20PM +0100, Boris Brezillon wrote:
> Hi Matthew,
> 
> Just a few drive-by comments.
> 
> On Tue, 17 Mar 2026 11:14:36 -0700
> Matthew Brost <matthew.brost@intel.com> wrote:
> 
> > > > > Timeout Detection and Recovery (TDR): a per-queue delayed work item
> > > > > fires when the head pending job exceeds q->job.timeout jiffies, calling
> > > > > ops->timedout_job(). drm_dep_queue_trigger_timeout() forces immediate
> > > > > expiry for device teardown.
> > > > > 
> > > > > IRQ-safe completion: queues flagged DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE
> > > > > allow drm_dep_job_done() to be called from hardirq context (e.g. a
> > > > > dma_fence callback). Dependency cleanup is deferred to process context
> > > > > after ops->run_job() returns to avoid calling xa_destroy() from IRQ.
> > > > > 
> > > > > Zombie-state guard: workers use kref_get_unless_zero() on entry and
> > > > > bail immediately if the queue refcount has already reached zero and
> > > > > async teardown is in flight, preventing use-after-free.  
> > > > 
> > > > In rust, when you queue work, you have to pass a reference-counted pointer
> > > > (Arc<T>). We simply never have this problem in a Rust design. If there is work
> > > > queued, the queue is alive.
> > > > 
> > > > By the way, why can’t we simply require synchronous teardowns?  
> > 
> > Consider the case where the DRM dep queue’s refcount drops to zero, but
> > the device firmware still holds references to the associated queue.
> > These are resources that must be torn down asynchronously. In Xe, I need
> > to send two asynchronous firmware commands before I can safely remove
> > the memory associated with the queue (faulting on this kind of global
> > memory will take down the device) and recycle the firmware ID tied to
> > the queue. These async commands are issued on the driver side, on the
> > DRM dep queue’s workqueue as well.
> 
> Asynchronous teardown is okay, but I'm not too sure using the refcnt to
> know that the queue is no longer usable is the way to go. To me the
> refcnt is what determines when the SW object is no longer referenced by
> any other item in the code, and a work item acting on the queue counts
> as one owner of this queue. If you want to cancel the work in order to
> speed up the destruction of the queue, you can call
> {cancel,disable}_work[_sync](), and have the ref dropped if the
> cancel/disable was effective. Multi-step teardown is also an option,
> but again, the state of the queue shouldn't be determined from its
> refcnt IMHO.
> 
> > 
> > Now consider a scenario where something goes wrong and those firmware
> > commands never complete, and a device reset is required to recover. The
> > driver’s per-queue tracking logic stops all queues (including zombie
> > ones), determines which commands were lost, cleans up the side effects
> > of that lost state, and then restarts all queues. That is how we would
> > end up in this work item with a zombie queue. The restart logic could
> > probably be made smart enough to avoid queueing work for zombie queues,
> > but in my opinion it’s safe enough to use kref_get_unless_zero() in the
> > work items.
> 
> Well, that only works for single-step teardown, or when you enter the
> last step. At which point, I'm not too sure it's significantly better
> than encoding the state of the queue through a separate field, and have
> the job queue logic reject new jobs if the queue is no longer usable
> (shouldn't even be exposed to userland at this point though).
> 

'shouldn't even be exposed to userland at this point though' - Yes.

The philosophy of ref counting design is roughly:

- When queue is created by userland call drm_dep_queue_init
- All jobs hold a ref to drm_dep_queue
- When userland close queue, remove it from the FD, call
  drm_dep_queue_put + initiatie teardown (I'd recommned just setting TDR
  to immediately fire, kick off queue from device in first fire + signal
  all fences).
- Queue refcount goes to zero optionality implement
  drm_dep_queue_ops.fini to keep the drm_dep_queue (and object it is
  embedded) around for a bit longer if additional firmware / device side
  resources are still around, call drm_dep_queue_fini when this part
  completes. If drm_dep_queue_ops.fini isn't implement the core
  implementation just calls drm_dep_queue_fini.
- Work item releases drm_dep_queue out dma-fence signaling path for safe
  memory release (e.g., take dma-resv locks).


> > 
> > It should also be clear that a DRM dep queue is primarily intended to be
> > embedded inside the driver’s own queue object, even though it is valid
> > to use it as a standalone object. The async teardown flows are also
> > optional features.
> > 
> > Let’s also consider a case where you do not need the async firmware
> > flows described above, but the DRM dep queue is still embedded in a
> > driver-side object that owns memory via dma-resv. The final queue put
> > may occur in IRQ context (DRM dep avoids kicking a worker just to drop a
> > refi as opt in), or in the reclaim path (any scheduler workqueue is the
> > reclaim path). In either case, you cannot free memory there taking a
> > dma-resv lock, which is why all DRM dep queues ultimately free their
> > resources in a work item outside of reclaim. Many drivers already follow
> > this pattern, but in DRM dep this behavior is built-in.
> 
> I agree deferred cleanup is the way to go.
> 

+1. Yes. I spot a bunch of drivers that open code this part driver side,
Xe included.

> > 
> > So I don’t think Rust natively solves these types of problems, although
> > I’ll concede that it does make refcounting a bit more sane.
> 
> Rust won't magically defer the cleanup, nor will it dictate how you want
> to do the queue teardown, those are things you need to implement. But it
> should give visibility about object lifetimes, and guarantee that an
> object that's still visible to some owners is usable (the notion of
> usable is highly dependent on the object implementation).
> 
> Just a purely theoretical example of a multi-step queue teardown that
> might be possible to encode in rust:
> 
> - MyJobQueue<Usable>: The job queue is currently exposed and usable.
>   There's a ::destroy() method consuming 'self' and returning a
>   MyJobQueue<Destroyed> object
> - MyJobQueue<Destroyed>: The user asked for the workqueue to be
>   destroyed. No new job can be pushed. Existing jobs that didn't make
>   it to the FW queue are cancelled, jobs that are in-flight are
>   cancelled if they can, or are just waited upon if they can't. When
>   the whole destruction step is done, ::destroyed() is called, it
>   consumes 'self' and returns a MyJobQueue<Inactive> object.
> - MyJobQueue<Inactive>: The queue is no longer active (HW doesn't have
>   any resources on this queue). It's ready to be cleaned up.
>   ::cleanup() (or just ::drop()) defers the cleanup of some inner
>   object that has been passed around between the various
>   MyJobQueue<State> wrappers.
> 
> Each of the state transition can happen asynchronously. A state
> transition consumes the object in one state, and returns a new object
> in its new state. None of the transition involves dropping a refcnt,
> ownership is just transferred. The final MyJobQueue<Inactive> object is
> the object we'll defer cleanup on.
> 
> It's a very high-level view of one way this can be implemented (I'm
> sure there are others, probably better than my suggestion) in order to
> make sure the object doesn't go away without the compiler enforcing
> proper state transitions.
> 

I'm sure Rust can implement this. My point about Rust is it doesn't
magically solve hard software arch probles, but I will admit the
ownership model, way it can enforce locking at compile time is pretty
cool.

> > > > > +/**
> > > > > + * DOC: DRM dependency fence
> > > > > + *
> > > > > + * Each struct drm_dep_job has an associated struct drm_dep_fence that
> > > > > + * provides a single dma_fence (@finished) signalled when the hardware
> > > > > + * completes the job.
> > > > > + *
> > > > > + * The hardware fence returned by &drm_dep_queue_ops.run_job is stored as
> > > > > + * @parent. @finished is chained to @parent via drm_dep_job_done_cb() and
> > > > > + * is signalled once @parent signals (or immediately if run_job() returns
> > > > > + * NULL or an error).  
> > > > 
> > > > I thought this fence proxy mechanism was going away due to recent work being
> > > > carried out by Christian?
> > > >   
> > 
> > Consider the case where a driver’s hardware fence is implemented as a
> > dma-fence-array or dma-fence-chain. You cannot install these types of
> > fences into a dma-resv or into syncobjs, so a proxy fence is useful
> > here.
> 
> Hm, so that's a driver returning a dma_fence_array/chain through
> ::run_job()? Why would we not want to have them directly exposed and
> split up into singular fence objects at resv insertion time (I don't
> think syncobjs care, but I might be wrong). I mean, one of the point

You can stick dma-fence-arrays in syncobjs, but not chains.

Neither dma-fence-arrays/chain can go into dma-resv.

Hence why disconnecting a job's finished fence from hardware fence IMO
is good idea to keep so gives drivers flexiblity on the hardware fences.
e.g., If this design didn't have a job's finished fence, I'd have to
open code one Xe side.

> behind the container extraction is so fences coming from the same
> context/timeline can be detected and merged. If you insert the
> container through a proxy, you're defeating the whole fence merging
> optimization.

Right. Finished fences have single timeline too...

> 
> The second thing is that I'm not sure drivers were ever supposed to
> return fence containers in the first place, because the whole idea
> behind a fence context is that fences are emitted/signalled in
> seqno-order, and if the fence is encoding the state of multiple
> timelines that progress at their own pace, it becomes tricky to control
> that. I guess if it's always the same set of timelines that are
> combined, that would work.

Xe does this is definitely works. We submit to multiple rings, when all
rings signal a seqno, a chain or array signals -> finished fence
signals. The queues used in this manor can only submit multiple ring
jobs so the finished fence timeline stays intact. If you could a
multiple rings followed by a single ring submission on the same queue,
yes this could break.

> 
> > One example is when a single job submits work to multiple rings
> > that are flipped in hardware at the same time.
> 
> We do have that in Panthor, but that's all explicit: in a single
> SUBMIT, you can have multiple jobs targeting different queues, each of
> them having their own set of deps/signal ops. The combination of all the
> signal ops into a container is left to the UMD. It could be automated
> kernel side, but that would be a flag on the SIGNAL op leading to the
> creation of a fence_array containing fences from multiple submitted
> jobs, rather than the driver combining stuff in the fence it returns in
> ::run_job().

See above. We have a dedicated queue type for these type of submissions
and single job that submits to the all rings. We had multiple queue /
jobs in the i915 to implemented this but it turns out it is much cleaner
with a single queue / singler job / multiple rings model.

> 
> > 
> > Another case is late arming of hardware fences in run_job (which many
> > drivers do). The proxy fence is immediately available at arm time and
> > can be installed into dma-resv or syncobjs even though the actual
> > hardware fence is not yet available. I think most drivers could be
> > refactored to make the hardware fence immediately available at run_job,
> > though.
> 
> Yep, I also think we can arm the driver fence early in the case of
> JobQueue. The reason it couldn't be done before is because the
> scheduler was in the middle, deciding which entity to pull the next job
> from, which was changing the seqno a job driver-fence would be assigned
> (you can't guess that at queue time in that case).
> 

Xe doesn't need to late arming, but it look like multiple drivers to
implement the late arming which may be required (?).

> [...]
> 
> > > > > + * **Reference counting**
> > > > > + *
> > > > > + * Jobs and queues are both reference counted.
> > > > > + *
> > > > > + * A job holds a reference to its queue from drm_dep_job_init() until
> > > > > + * drm_dep_job_put() drops the job's last reference and its release callback
> > > > > + * runs. This ensures the queue remains valid for the entire lifetime of any
> > > > > + * job that was submitted to it.
> > > > > + *
> > > > > + * The queue holds its own reference to a job for as long as the job is
> > > > > + * internally tracked: from the moment the job is added to the pending list
> > > > > + * in drm_dep_queue_run_job() until drm_dep_job_done() kicks the put_job
> > > > > + * worker, which calls drm_dep_job_put() to release that reference.  
> > > > 
> > > > Why not simply keep track that the job was completed, instead of relinquishing
> > > > the reference? We can then release the reference once the job is cleaned up
> > > > (by the queue, using a worker) in process context.  
> > 
> > I think that’s what I’m doing, while also allowing an opt-in path to
> > drop the job reference when it signals (in IRQ context)
> 
> Did you mean in !IRQ (or !atomic) context here? Feels weird to not
> defer the cleanup when you're in an IRQ/atomic context, but defer it
> when you're in a thread context.
> 

The put of a job in this design can be from an IRQ context (opt-in)
feature. xa_destroy blows up if it is called from an IRQ context,
although maybe that could be workaround.

> > so we avoid
> > switching to a work item just to drop a ref. That seems like a
> > significant win in terms of CPU cycles.
> 
> Well, the cleanup path is probably not where latency matters the most.

Agree. But I do think avoiding a CPU context switch (work item) for a
very lightweight job cleanup (usually just drop refs) will save of CPU
cycles, thus also things like power, etc...

> It's adding scheduling overhead, sure, but given all the stuff we defer
> already, I'm not too sure we're at saving a few cycles to get the
> cleanup done immediately. What's important to have is a way to signal
> fences in an atomic context, because this has an impact on latency.
> 

Yes. The signaling happens first then drm_dep_job_put if IRQ opt-in.

> [...]
> 
> > > > > + /*
> > > > > + * Drop all input dependency fences now, in process context, before the
> > > > > + * final job put. Once the job is on the pending list its last reference
> > > > > + * may be dropped from a dma_fence callback (IRQ context), where calling
> > > > > + * xa_destroy() would be unsafe.
> > > > > + */  
> > > > 
> > > > I assume that “pending” is the list of jobs that have been handed to the driver
> > > > via ops->run_job()?
> > > > 
> > > > Can’t this problem be solved by not doing anything inside a dma_fence callback
> > > > other than scheduling the queue worker?
> > > >   
> > 
> > Yes, this code is required to support dropping job refs directly in the
> > dma-fence callback (an opt-in feature). Again, this seems like a
> > significant win in terms of CPU cycles, although I haven’t collected
> > data yet.
> 
> If it significantly hurts the perf, I'd like to understand why, because
> to me it looks like pure-cleanup (no signaling involved), and thus no
> other process waiting for us to do the cleanup. The only thing that
> might have an impact is how fast you release the resources, and given
> it's only a partial cleanup (xa_destroy() still has to be deferred), I'd
> like to understand which part of the immediate cleanup is causing a
> contention (basically which kind of resources the system is starving of)
> 

It was more of once we moved to a ref counted model, it is pretty
trivial allow drm_dep_job_put when the fence is signaling. It doesn't
really add any complexity either, thus why I added it is.

Matt

> Regards,
> 
> Boris

next prev parent reply	other threads:[~2026-03-18 22:40 UTC|newest]

Thread overview: 65+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-16  4:32 [RFC PATCH 00/12] Introduce DRM dep queue Matthew Brost
2026-03-16  4:32 ` [RFC PATCH 01/12] workqueue: Add interface to teach lockdep to warn on reclaim violations Matthew Brost
2026-03-25 15:59   ` Tejun Heo
2026-03-26  1:49     ` Matthew Brost
2026-03-26  2:19       ` Tejun Heo
2026-03-27  4:33         ` Matthew Brost
2026-03-27 17:25           ` Tejun Heo
2026-03-16  4:32 ` [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer Matthew Brost
2026-03-16  9:16   ` Boris Brezillon
2026-03-17  5:22     ` Matthew Brost
2026-03-17  8:48       ` Boris Brezillon
2026-03-16 10:25   ` Danilo Krummrich
2026-03-17  5:10     ` Matthew Brost
2026-03-17 12:19       ` Danilo Krummrich
2026-03-18 23:02         ` Matthew Brost
2026-03-17  2:47   ` Daniel Almeida
2026-03-17  5:45     ` Matthew Brost
2026-03-17  7:17       ` Miguel Ojeda
2026-03-17  8:26         ` Matthew Brost
2026-03-17 12:04           ` Daniel Almeida
2026-03-17 19:41           ` Miguel Ojeda
2026-03-23 17:31             ` Matthew Brost
2026-03-23 17:42               ` Miguel Ojeda
2026-03-17 18:14       ` Matthew Brost
2026-03-17 19:48         ` Daniel Almeida
2026-03-17 20:43         ` Boris Brezillon
2026-03-18 22:40           ` Matthew Brost [this message]
2026-03-19  9:57             ` Boris Brezillon
2026-03-22  6:43               ` Matthew Brost
2026-03-23  7:58                 ` Matthew Brost
2026-03-23 10:06                   ` Boris Brezillon
2026-03-23 17:11                     ` Matthew Brost
2026-03-17 12:31     ` Danilo Krummrich
2026-03-17 14:25       ` Daniel Almeida
2026-03-17 14:33         ` Danilo Krummrich
2026-03-18 22:50           ` Matthew Brost
2026-03-17  8:47   ` Christian König
2026-03-17 14:55   ` Boris Brezillon
2026-03-18 23:28     ` Matthew Brost
2026-03-19  9:11       ` Boris Brezillon
2026-03-23  4:50         ` Matthew Brost
2026-03-23  9:55           ` Boris Brezillon
2026-03-23 17:08             ` Matthew Brost
2026-03-23 18:38               ` Matthew Brost
2026-03-24  9:23                 ` Boris Brezillon
2026-03-24 16:06                   ` Matthew Brost
2026-03-25  2:33                     ` Matthew Brost
2026-03-24  8:49               ` Boris Brezillon
2026-03-24 16:51                 ` Matthew Brost
2026-03-17 16:30   ` Shashank Sharma
2026-03-16  4:32 ` [RFC PATCH 03/12] drm/xe: Use WQ_MEM_WARN_ON_RECLAIM on all workqueues in the reclaim path Matthew Brost
2026-03-16  4:32 ` [RFC PATCH 04/12] drm/xe: Issue GGTT invalidation under lock in ggtt_node_remove Matthew Brost
2026-03-26  5:45   ` Bhadane, Dnyaneshwar
2026-03-16  4:32 ` [RFC PATCH 05/12] drm/xe: Return fence from xe_sched_job_arm and adjust job references Matthew Brost
2026-03-16  4:32 ` [RFC PATCH 06/12] drm/xe: Convert to DRM dep queue scheduler layer Matthew Brost
2026-03-16  4:32 ` [RFC PATCH 07/12] drm/xe: Make scheduler message lock IRQ-safe Matthew Brost
2026-03-16  4:32 ` [RFC PATCH 08/12] drm/xe: Rework exec queue object on top of DRM dep Matthew Brost
2026-03-16  4:32 ` [RFC PATCH 09/12] drm/xe: Enable IRQ job put in " Matthew Brost
2026-03-16  4:32 ` [RFC PATCH 10/12] drm/xe: Use DRM dep queue kill semantics Matthew Brost
2026-03-16  4:32 ` [RFC PATCH 11/12] accel/amdxdna: Convert to drm_dep scheduler layer Matthew Brost
2026-03-16  4:32 ` [RFC PATCH 12/12] drm/panthor: " Matthew Brost
2026-03-16  4:52 ` ✗ CI.checkpatch: warning for Introduce DRM dep queue Patchwork
2026-03-16  4:53 ` ✓ CI.KUnit: success " Patchwork
2026-03-16  5:28 ` ✓ Xe.CI.BAT: " Patchwork
2026-03-16  8:09 ` ✗ Xe.CI.FULL: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=absp4z1Op0BrXReP@lstrano-desk.jf.intel.com \
    --to=matthew.brost@intel.com \
    --cc=acourbot@nvidia.com \
    --cc=airlied@gmail.com \
    --cc=aliceryhl@google.com \
    --cc=boris.brezillon@collabora.com \
    --cc=christian.koenig@amd.com \
    --cc=dakr@kernel.org \
    --cc=daniel.almeida@collabora.com \
    --cc=daniels@collabora.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=ecourtney@nvidia.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=jajones@nvidia.com \
    --cc=jeffv@google.com \
    --cc=jhubbard@nvidia.com \
    --cc=joelagnelf@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maarten.lankhorst@linux.intel.com \
    --cc=mripard@kernel.org \
    --cc=phasta@kernel.org \
    --cc=rodrigo.vivi@intel.com \
    --cc=rust-for-linux@vger.kernel.org \
    --cc=samitolvanen@google.com \
    --cc=shashanks@nvidia.com \
    --cc=simona@ffwll.ch \
    --cc=sumit.semwal@linaro.org \
    --cc=thomas.hellstrom@linux.intel.com \
    --cc=tvrtko.ursulin@igalia.com \
    --cc=tzimmermann@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.