From: Matthew Brost <matthew.brost@intel.com>
To: Boris Brezillon <boris.brezillon@collabora.com>
Cc: "Daniel Almeida" <daniel.almeida@collabora.com>,
intel-xe@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
"Tvrtko Ursulin" <tvrtko.ursulin@igalia.com>,
"Rodrigo Vivi" <rodrigo.vivi@intel.com>,
"Thomas Hellström" <thomas.hellstrom@linux.intel.com>,
"Christian König" <christian.koenig@amd.com>,
"Danilo Krummrich" <dakr@kernel.org>,
"David Airlie" <airlied@gmail.com>,
"Maarten Lankhorst" <maarten.lankhorst@linux.intel.com>,
"Maxime Ripard" <mripard@kernel.org>,
"Philipp Stanner" <phasta@kernel.org>,
"Simona Vetter" <simona@ffwll.ch>,
"Sumit Semwal" <sumit.semwal@linaro.org>,
"Thomas Zimmermann" <tzimmermann@suse.de>,
linux-kernel@vger.kernel.org,
"Sami Tolvanen" <samitolvanen@google.com>,
"Jeffrey Vander Stoep" <jeffv@google.com>,
"Alice Ryhl" <aliceryhl@google.com>,
"Daniel Stone" <daniels@collabora.com>,
"Alexandre Courbot" <acourbot@nvidia.com>,
"John Hubbard" <jhubbard@nvidia.com>,
shashanks@nvidia.com, jajones@nvidia.com,
"Eliot Courtney" <ecourtney@nvidia.com>,
"Joel Fernandes" <joelagnelf@nvidia.com>,
rust-for-linux <rust-for-linux@vger.kernel.org>
Subject: Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
Date: Wed, 18 Mar 2026 15:40:35 -0700 [thread overview]
Message-ID: <absp4z1Op0BrXReP@lstrano-desk.jf.intel.com> (raw)
In-Reply-To: <20260317214320.74e6c130@fedora>
On Tue, Mar 17, 2026 at 09:43:20PM +0100, Boris Brezillon wrote:
> Hi Matthew,
>
> Just a few drive-by comments.
>
> On Tue, 17 Mar 2026 11:14:36 -0700
> Matthew Brost <matthew.brost@intel.com> wrote:
>
> > > > > Timeout Detection and Recovery (TDR): a per-queue delayed work item
> > > > > fires when the head pending job exceeds q->job.timeout jiffies, calling
> > > > > ops->timedout_job(). drm_dep_queue_trigger_timeout() forces immediate
> > > > > expiry for device teardown.
> > > > >
> > > > > IRQ-safe completion: queues flagged DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE
> > > > > allow drm_dep_job_done() to be called from hardirq context (e.g. a
> > > > > dma_fence callback). Dependency cleanup is deferred to process context
> > > > > after ops->run_job() returns to avoid calling xa_destroy() from IRQ.
> > > > >
> > > > > Zombie-state guard: workers use kref_get_unless_zero() on entry and
> > > > > bail immediately if the queue refcount has already reached zero and
> > > > > async teardown is in flight, preventing use-after-free.
> > > >
> > > > In rust, when you queue work, you have to pass a reference-counted pointer
> > > > (Arc<T>). We simply never have this problem in a Rust design. If there is work
> > > > queued, the queue is alive.
> > > >
> > > > By the way, why can’t we simply require synchronous teardowns?
> >
> > Consider the case where the DRM dep queue’s refcount drops to zero, but
> > the device firmware still holds references to the associated queue.
> > These are resources that must be torn down asynchronously. In Xe, I need
> > to send two asynchronous firmware commands before I can safely remove
> > the memory associated with the queue (faulting on this kind of global
> > memory will take down the device) and recycle the firmware ID tied to
> > the queue. These async commands are issued on the driver side, on the
> > DRM dep queue’s workqueue as well.
>
> Asynchronous teardown is okay, but I'm not too sure using the refcnt to
> know that the queue is no longer usable is the way to go. To me the
> refcnt is what determines when the SW object is no longer referenced by
> any other item in the code, and a work item acting on the queue counts
> as one owner of this queue. If you want to cancel the work in order to
> speed up the destruction of the queue, you can call
> {cancel,disable}_work[_sync](), and have the ref dropped if the
> cancel/disable was effective. Multi-step teardown is also an option,
> but again, the state of the queue shouldn't be determined from its
> refcnt IMHO.
>
> >
> > Now consider a scenario where something goes wrong and those firmware
> > commands never complete, and a device reset is required to recover. The
> > driver’s per-queue tracking logic stops all queues (including zombie
> > ones), determines which commands were lost, cleans up the side effects
> > of that lost state, and then restarts all queues. That is how we would
> > end up in this work item with a zombie queue. The restart logic could
> > probably be made smart enough to avoid queueing work for zombie queues,
> > but in my opinion it’s safe enough to use kref_get_unless_zero() in the
> > work items.
>
> Well, that only works for single-step teardown, or when you enter the
> last step. At which point, I'm not too sure it's significantly better
> than encoding the state of the queue through a separate field, and have
> the job queue logic reject new jobs if the queue is no longer usable
> (shouldn't even be exposed to userland at this point though).
>
'shouldn't even be exposed to userland at this point though' - Yes.
The philosophy of ref counting design is roughly:
- When queue is created by userland call drm_dep_queue_init
- All jobs hold a ref to drm_dep_queue
- When userland close queue, remove it from the FD, call
drm_dep_queue_put + initiatie teardown (I'd recommned just setting TDR
to immediately fire, kick off queue from device in first fire + signal
all fences).
- Queue refcount goes to zero optionality implement
drm_dep_queue_ops.fini to keep the drm_dep_queue (and object it is
embedded) around for a bit longer if additional firmware / device side
resources are still around, call drm_dep_queue_fini when this part
completes. If drm_dep_queue_ops.fini isn't implement the core
implementation just calls drm_dep_queue_fini.
- Work item releases drm_dep_queue out dma-fence signaling path for safe
memory release (e.g., take dma-resv locks).
> >
> > It should also be clear that a DRM dep queue is primarily intended to be
> > embedded inside the driver’s own queue object, even though it is valid
> > to use it as a standalone object. The async teardown flows are also
> > optional features.
> >
> > Let’s also consider a case where you do not need the async firmware
> > flows described above, but the DRM dep queue is still embedded in a
> > driver-side object that owns memory via dma-resv. The final queue put
> > may occur in IRQ context (DRM dep avoids kicking a worker just to drop a
> > refi as opt in), or in the reclaim path (any scheduler workqueue is the
> > reclaim path). In either case, you cannot free memory there taking a
> > dma-resv lock, which is why all DRM dep queues ultimately free their
> > resources in a work item outside of reclaim. Many drivers already follow
> > this pattern, but in DRM dep this behavior is built-in.
>
> I agree deferred cleanup is the way to go.
>
+1. Yes. I spot a bunch of drivers that open code this part driver side,
Xe included.
> >
> > So I don’t think Rust natively solves these types of problems, although
> > I’ll concede that it does make refcounting a bit more sane.
>
> Rust won't magically defer the cleanup, nor will it dictate how you want
> to do the queue teardown, those are things you need to implement. But it
> should give visibility about object lifetimes, and guarantee that an
> object that's still visible to some owners is usable (the notion of
> usable is highly dependent on the object implementation).
>
> Just a purely theoretical example of a multi-step queue teardown that
> might be possible to encode in rust:
>
> - MyJobQueue<Usable>: The job queue is currently exposed and usable.
> There's a ::destroy() method consuming 'self' and returning a
> MyJobQueue<Destroyed> object
> - MyJobQueue<Destroyed>: The user asked for the workqueue to be
> destroyed. No new job can be pushed. Existing jobs that didn't make
> it to the FW queue are cancelled, jobs that are in-flight are
> cancelled if they can, or are just waited upon if they can't. When
> the whole destruction step is done, ::destroyed() is called, it
> consumes 'self' and returns a MyJobQueue<Inactive> object.
> - MyJobQueue<Inactive>: The queue is no longer active (HW doesn't have
> any resources on this queue). It's ready to be cleaned up.
> ::cleanup() (or just ::drop()) defers the cleanup of some inner
> object that has been passed around between the various
> MyJobQueue<State> wrappers.
>
> Each of the state transition can happen asynchronously. A state
> transition consumes the object in one state, and returns a new object
> in its new state. None of the transition involves dropping a refcnt,
> ownership is just transferred. The final MyJobQueue<Inactive> object is
> the object we'll defer cleanup on.
>
> It's a very high-level view of one way this can be implemented (I'm
> sure there are others, probably better than my suggestion) in order to
> make sure the object doesn't go away without the compiler enforcing
> proper state transitions.
>
I'm sure Rust can implement this. My point about Rust is it doesn't
magically solve hard software arch probles, but I will admit the
ownership model, way it can enforce locking at compile time is pretty
cool.
> > > > > +/**
> > > > > + * DOC: DRM dependency fence
> > > > > + *
> > > > > + * Each struct drm_dep_job has an associated struct drm_dep_fence that
> > > > > + * provides a single dma_fence (@finished) signalled when the hardware
> > > > > + * completes the job.
> > > > > + *
> > > > > + * The hardware fence returned by &drm_dep_queue_ops.run_job is stored as
> > > > > + * @parent. @finished is chained to @parent via drm_dep_job_done_cb() and
> > > > > + * is signalled once @parent signals (or immediately if run_job() returns
> > > > > + * NULL or an error).
> > > >
> > > > I thought this fence proxy mechanism was going away due to recent work being
> > > > carried out by Christian?
> > > >
> >
> > Consider the case where a driver’s hardware fence is implemented as a
> > dma-fence-array or dma-fence-chain. You cannot install these types of
> > fences into a dma-resv or into syncobjs, so a proxy fence is useful
> > here.
>
> Hm, so that's a driver returning a dma_fence_array/chain through
> ::run_job()? Why would we not want to have them directly exposed and
> split up into singular fence objects at resv insertion time (I don't
> think syncobjs care, but I might be wrong). I mean, one of the point
You can stick dma-fence-arrays in syncobjs, but not chains.
Neither dma-fence-arrays/chain can go into dma-resv.
Hence why disconnecting a job's finished fence from hardware fence IMO
is good idea to keep so gives drivers flexiblity on the hardware fences.
e.g., If this design didn't have a job's finished fence, I'd have to
open code one Xe side.
> behind the container extraction is so fences coming from the same
> context/timeline can be detected and merged. If you insert the
> container through a proxy, you're defeating the whole fence merging
> optimization.
Right. Finished fences have single timeline too...
>
> The second thing is that I'm not sure drivers were ever supposed to
> return fence containers in the first place, because the whole idea
> behind a fence context is that fences are emitted/signalled in
> seqno-order, and if the fence is encoding the state of multiple
> timelines that progress at their own pace, it becomes tricky to control
> that. I guess if it's always the same set of timelines that are
> combined, that would work.
Xe does this is definitely works. We submit to multiple rings, when all
rings signal a seqno, a chain or array signals -> finished fence
signals. The queues used in this manor can only submit multiple ring
jobs so the finished fence timeline stays intact. If you could a
multiple rings followed by a single ring submission on the same queue,
yes this could break.
>
> > One example is when a single job submits work to multiple rings
> > that are flipped in hardware at the same time.
>
> We do have that in Panthor, but that's all explicit: in a single
> SUBMIT, you can have multiple jobs targeting different queues, each of
> them having their own set of deps/signal ops. The combination of all the
> signal ops into a container is left to the UMD. It could be automated
> kernel side, but that would be a flag on the SIGNAL op leading to the
> creation of a fence_array containing fences from multiple submitted
> jobs, rather than the driver combining stuff in the fence it returns in
> ::run_job().
See above. We have a dedicated queue type for these type of submissions
and single job that submits to the all rings. We had multiple queue /
jobs in the i915 to implemented this but it turns out it is much cleaner
with a single queue / singler job / multiple rings model.
>
> >
> > Another case is late arming of hardware fences in run_job (which many
> > drivers do). The proxy fence is immediately available at arm time and
> > can be installed into dma-resv or syncobjs even though the actual
> > hardware fence is not yet available. I think most drivers could be
> > refactored to make the hardware fence immediately available at run_job,
> > though.
>
> Yep, I also think we can arm the driver fence early in the case of
> JobQueue. The reason it couldn't be done before is because the
> scheduler was in the middle, deciding which entity to pull the next job
> from, which was changing the seqno a job driver-fence would be assigned
> (you can't guess that at queue time in that case).
>
Xe doesn't need to late arming, but it look like multiple drivers to
implement the late arming which may be required (?).
> [...]
>
> > > > > + * **Reference counting**
> > > > > + *
> > > > > + * Jobs and queues are both reference counted.
> > > > > + *
> > > > > + * A job holds a reference to its queue from drm_dep_job_init() until
> > > > > + * drm_dep_job_put() drops the job's last reference and its release callback
> > > > > + * runs. This ensures the queue remains valid for the entire lifetime of any
> > > > > + * job that was submitted to it.
> > > > > + *
> > > > > + * The queue holds its own reference to a job for as long as the job is
> > > > > + * internally tracked: from the moment the job is added to the pending list
> > > > > + * in drm_dep_queue_run_job() until drm_dep_job_done() kicks the put_job
> > > > > + * worker, which calls drm_dep_job_put() to release that reference.
> > > >
> > > > Why not simply keep track that the job was completed, instead of relinquishing
> > > > the reference? We can then release the reference once the job is cleaned up
> > > > (by the queue, using a worker) in process context.
> >
> > I think that’s what I’m doing, while also allowing an opt-in path to
> > drop the job reference when it signals (in IRQ context)
>
> Did you mean in !IRQ (or !atomic) context here? Feels weird to not
> defer the cleanup when you're in an IRQ/atomic context, but defer it
> when you're in a thread context.
>
The put of a job in this design can be from an IRQ context (opt-in)
feature. xa_destroy blows up if it is called from an IRQ context,
although maybe that could be workaround.
> > so we avoid
> > switching to a work item just to drop a ref. That seems like a
> > significant win in terms of CPU cycles.
>
> Well, the cleanup path is probably not where latency matters the most.
Agree. But I do think avoiding a CPU context switch (work item) for a
very lightweight job cleanup (usually just drop refs) will save of CPU
cycles, thus also things like power, etc...
> It's adding scheduling overhead, sure, but given all the stuff we defer
> already, I'm not too sure we're at saving a few cycles to get the
> cleanup done immediately. What's important to have is a way to signal
> fences in an atomic context, because this has an impact on latency.
>
Yes. The signaling happens first then drm_dep_job_put if IRQ opt-in.
> [...]
>
> > > > > + /*
> > > > > + * Drop all input dependency fences now, in process context, before the
> > > > > + * final job put. Once the job is on the pending list its last reference
> > > > > + * may be dropped from a dma_fence callback (IRQ context), where calling
> > > > > + * xa_destroy() would be unsafe.
> > > > > + */
> > > >
> > > > I assume that “pending” is the list of jobs that have been handed to the driver
> > > > via ops->run_job()?
> > > >
> > > > Can’t this problem be solved by not doing anything inside a dma_fence callback
> > > > other than scheduling the queue worker?
> > > >
> >
> > Yes, this code is required to support dropping job refs directly in the
> > dma-fence callback (an opt-in feature). Again, this seems like a
> > significant win in terms of CPU cycles, although I haven’t collected
> > data yet.
>
> If it significantly hurts the perf, I'd like to understand why, because
> to me it looks like pure-cleanup (no signaling involved), and thus no
> other process waiting for us to do the cleanup. The only thing that
> might have an impact is how fast you release the resources, and given
> it's only a partial cleanup (xa_destroy() still has to be deferred), I'd
> like to understand which part of the immediate cleanup is causing a
> contention (basically which kind of resources the system is starving of)
>
It was more of once we moved to a ref counted model, it is pretty
trivial allow drm_dep_job_put when the fence is signaling. It doesn't
really add any complexity either, thus why I added it is.
Matt
> Regards,
>
> Boris
next prev parent reply other threads:[~2026-03-18 22:40 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20260316043255.226352-1-matthew.brost@intel.com>
[not found] ` <20260316043255.226352-3-matthew.brost@intel.com>
2026-03-17 2:47 ` [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer Daniel Almeida
2026-03-17 5:45 ` Matthew Brost
2026-03-17 7:17 ` Miguel Ojeda
2026-03-17 8:26 ` Matthew Brost
2026-03-17 12:04 ` Daniel Almeida
2026-03-17 19:41 ` Miguel Ojeda
2026-03-23 17:31 ` Matthew Brost
2026-03-23 17:42 ` Miguel Ojeda
2026-03-17 18:14 ` Matthew Brost
2026-03-17 19:48 ` Daniel Almeida
2026-03-17 20:43 ` Boris Brezillon
2026-03-18 22:40 ` Matthew Brost [this message]
2026-03-19 9:57 ` Boris Brezillon
2026-03-22 6:43 ` Matthew Brost
2026-03-23 7:58 ` Matthew Brost
2026-03-23 10:06 ` Boris Brezillon
2026-03-23 17:11 ` Matthew Brost
2026-03-17 12:31 ` Danilo Krummrich
2026-03-17 14:25 ` Daniel Almeida
2026-03-17 14:33 ` Danilo Krummrich
2026-03-18 22:50 ` Matthew Brost
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=absp4z1Op0BrXReP@lstrano-desk.jf.intel.com \
--to=matthew.brost@intel.com \
--cc=acourbot@nvidia.com \
--cc=airlied@gmail.com \
--cc=aliceryhl@google.com \
--cc=boris.brezillon@collabora.com \
--cc=christian.koenig@amd.com \
--cc=dakr@kernel.org \
--cc=daniel.almeida@collabora.com \
--cc=daniels@collabora.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=ecourtney@nvidia.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=jajones@nvidia.com \
--cc=jeffv@google.com \
--cc=jhubbard@nvidia.com \
--cc=joelagnelf@nvidia.com \
--cc=linux-kernel@vger.kernel.org \
--cc=maarten.lankhorst@linux.intel.com \
--cc=mripard@kernel.org \
--cc=phasta@kernel.org \
--cc=rodrigo.vivi@intel.com \
--cc=rust-for-linux@vger.kernel.org \
--cc=samitolvanen@google.com \
--cc=shashanks@nvidia.com \
--cc=simona@ffwll.ch \
--cc=sumit.semwal@linaro.org \
--cc=thomas.hellstrom@linux.intel.com \
--cc=tvrtko.ursulin@igalia.com \
--cc=tzimmermann@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox