Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Boris Brezillon <boris.brezillon@collabora.com>
To: Matthew Brost <matthew.brost@intel.com>
Cc: "Daniel Almeida" <daniel.almeida@collabora.com>,
	intel-xe@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	"Tvrtko Ursulin" <tvrtko.ursulin@igalia.com>,
	"Rodrigo Vivi" <rodrigo.vivi@intel.com>,
	"Thomas Hellström" <thomas.hellstrom@linux.intel.com>,
	"Christian König" <christian.koenig@amd.com>,
	"Danilo Krummrich" <dakr@kernel.org>,
	"David Airlie" <airlied@gmail.com>,
	"Maarten Lankhorst" <maarten.lankhorst@linux.intel.com>,
	"Maxime Ripard" <mripard@kernel.org>,
	"Philipp Stanner" <phasta@kernel.org>,
	"Simona Vetter" <simona@ffwll.ch>,
	"Sumit Semwal" <sumit.semwal@linaro.org>,
	"Thomas Zimmermann" <tzimmermann@suse.de>,
	linux-kernel@vger.kernel.org,
	"Sami Tolvanen" <samitolvanen@google.com>,
	"Jeffrey Vander Stoep" <jeffv@google.com>,
	"Alice Ryhl" <aliceryhl@google.com>,
	"Daniel Stone" <daniels@collabora.com>,
	"Alexandre Courbot" <acourbot@nvidia.com>,
	"John Hubbard" <jhubbard@nvidia.com>,
	shashanks@nvidia.com, jajones@nvidia.com,
	"Eliot Courtney" <ecourtney@nvidia.com>,
	"Joel Fernandes" <joelagnelf@nvidia.com>,
	rust-for-linux <rust-for-linux@vger.kernel.org>
Subject: Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
Date: Tue, 17 Mar 2026 21:43:20 +0100	[thread overview]
Message-ID: <20260317214320.74e6c130@fedora> (raw)
In-Reply-To: <abmaDElHmbQkYy57@lstrano-desk.jf.intel.com>

Hi Matthew,

Just a few drive-by comments.

On Tue, 17 Mar 2026 11:14:36 -0700
Matthew Brost <matthew.brost@intel.com> wrote:

> > > > Timeout Detection and Recovery (TDR): a per-queue delayed work item
> > > > fires when the head pending job exceeds q->job.timeout jiffies, calling
> > > > ops->timedout_job(). drm_dep_queue_trigger_timeout() forces immediate
> > > > expiry for device teardown.
> > > > 
> > > > IRQ-safe completion: queues flagged DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE
> > > > allow drm_dep_job_done() to be called from hardirq context (e.g. a
> > > > dma_fence callback). Dependency cleanup is deferred to process context
> > > > after ops->run_job() returns to avoid calling xa_destroy() from IRQ.
> > > > 
> > > > Zombie-state guard: workers use kref_get_unless_zero() on entry and
> > > > bail immediately if the queue refcount has already reached zero and
> > > > async teardown is in flight, preventing use-after-free.  
> > > 
> > > In rust, when you queue work, you have to pass a reference-counted pointer
> > > (Arc<T>). We simply never have this problem in a Rust design. If there is work
> > > queued, the queue is alive.
> > > 
> > > By the way, why can’t we simply require synchronous teardowns?  
> 
> Consider the case where the DRM dep queue’s refcount drops to zero, but
> the device firmware still holds references to the associated queue.
> These are resources that must be torn down asynchronously. In Xe, I need
> to send two asynchronous firmware commands before I can safely remove
> the memory associated with the queue (faulting on this kind of global
> memory will take down the device) and recycle the firmware ID tied to
> the queue. These async commands are issued on the driver side, on the
> DRM dep queue’s workqueue as well.

Asynchronous teardown is okay, but I'm not too sure using the refcnt to
know that the queue is no longer usable is the way to go. To me the
refcnt is what determines when the SW object is no longer referenced by
any other item in the code, and a work item acting on the queue counts
as one owner of this queue. If you want to cancel the work in order to
speed up the destruction of the queue, you can call
{cancel,disable}_work[_sync](), and have the ref dropped if the
cancel/disable was effective. Multi-step teardown is also an option,
but again, the state of the queue shouldn't be determined from its
refcnt IMHO.

> 
> Now consider a scenario where something goes wrong and those firmware
> commands never complete, and a device reset is required to recover. The
> driver’s per-queue tracking logic stops all queues (including zombie
> ones), determines which commands were lost, cleans up the side effects
> of that lost state, and then restarts all queues. That is how we would
> end up in this work item with a zombie queue. The restart logic could
> probably be made smart enough to avoid queueing work for zombie queues,
> but in my opinion it’s safe enough to use kref_get_unless_zero() in the
> work items.

Well, that only works for single-step teardown, or when you enter the
last step. At which point, I'm not too sure it's significantly better
than encoding the state of the queue through a separate field, and have
the job queue logic reject new jobs if the queue is no longer usable
(shouldn't even be exposed to userland at this point though).

> 
> It should also be clear that a DRM dep queue is primarily intended to be
> embedded inside the driver’s own queue object, even though it is valid
> to use it as a standalone object. The async teardown flows are also
> optional features.
> 
> Let’s also consider a case where you do not need the async firmware
> flows described above, but the DRM dep queue is still embedded in a
> driver-side object that owns memory via dma-resv. The final queue put
> may occur in IRQ context (DRM dep avoids kicking a worker just to drop a
> refi as opt in), or in the reclaim path (any scheduler workqueue is the
> reclaim path). In either case, you cannot free memory there taking a
> dma-resv lock, which is why all DRM dep queues ultimately free their
> resources in a work item outside of reclaim. Many drivers already follow
> this pattern, but in DRM dep this behavior is built-in.

I agree deferred cleanup is the way to go.

> 
> So I don’t think Rust natively solves these types of problems, although
> I’ll concede that it does make refcounting a bit more sane.

Rust won't magically defer the cleanup, nor will it dictate how you want
to do the queue teardown, those are things you need to implement. But it
should give visibility about object lifetimes, and guarantee that an
object that's still visible to some owners is usable (the notion of
usable is highly dependent on the object implementation).

Just a purely theoretical example of a multi-step queue teardown that
might be possible to encode in rust:

- MyJobQueue<Usable>: The job queue is currently exposed and usable.
  There's a ::destroy() method consuming 'self' and returning a
  MyJobQueue<Destroyed> object
- MyJobQueue<Destroyed>: The user asked for the workqueue to be
  destroyed. No new job can be pushed. Existing jobs that didn't make
  it to the FW queue are cancelled, jobs that are in-flight are
  cancelled if they can, or are just waited upon if they can't. When
  the whole destruction step is done, ::destroyed() is called, it
  consumes 'self' and returns a MyJobQueue<Inactive> object.
- MyJobQueue<Inactive>: The queue is no longer active (HW doesn't have
  any resources on this queue). It's ready to be cleaned up.
  ::cleanup() (or just ::drop()) defers the cleanup of some inner
  object that has been passed around between the various
  MyJobQueue<State> wrappers.

Each of the state transition can happen asynchronously. A state
transition consumes the object in one state, and returns a new object
in its new state. None of the transition involves dropping a refcnt,
ownership is just transferred. The final MyJobQueue<Inactive> object is
the object we'll defer cleanup on.

It's a very high-level view of one way this can be implemented (I'm
sure there are others, probably better than my suggestion) in order to
make sure the object doesn't go away without the compiler enforcing
proper state transitions.

> > > > +/**
> > > > + * DOC: DRM dependency fence
> > > > + *
> > > > + * Each struct drm_dep_job has an associated struct drm_dep_fence that
> > > > + * provides a single dma_fence (@finished) signalled when the hardware
> > > > + * completes the job.
> > > > + *
> > > > + * The hardware fence returned by &drm_dep_queue_ops.run_job is stored as
> > > > + * @parent. @finished is chained to @parent via drm_dep_job_done_cb() and
> > > > + * is signalled once @parent signals (or immediately if run_job() returns
> > > > + * NULL or an error).  
> > > 
> > > I thought this fence proxy mechanism was going away due to recent work being
> > > carried out by Christian?
> > >   
> 
> Consider the case where a driver’s hardware fence is implemented as a
> dma-fence-array or dma-fence-chain. You cannot install these types of
> fences into a dma-resv or into syncobjs, so a proxy fence is useful
> here.

Hm, so that's a driver returning a dma_fence_array/chain through
::run_job()? Why would we not want to have them directly exposed and
split up into singular fence objects at resv insertion time (I don't
think syncobjs care, but I might be wrong). I mean, one of the point
behind the container extraction is so fences coming from the same
context/timeline can be detected and merged. If you insert the
container through a proxy, you're defeating the whole fence merging
optimization.

The second thing is that I'm not sure drivers were ever supposed to
return fence containers in the first place, because the whole idea
behind a fence context is that fences are emitted/signalled in
seqno-order, and if the fence is encoding the state of multiple
timelines that progress at their own pace, it becomes tricky to control
that. I guess if it's always the same set of timelines that are
combined, that would work.

> One example is when a single job submits work to multiple rings
> that are flipped in hardware at the same time.

We do have that in Panthor, but that's all explicit: in a single
SUBMIT, you can have multiple jobs targeting different queues, each of
them having their own set of deps/signal ops. The combination of all the
signal ops into a container is left to the UMD. It could be automated
kernel side, but that would be a flag on the SIGNAL op leading to the
creation of a fence_array containing fences from multiple submitted
jobs, rather than the driver combining stuff in the fence it returns in
::run_job().

> 
> Another case is late arming of hardware fences in run_job (which many
> drivers do). The proxy fence is immediately available at arm time and
> can be installed into dma-resv or syncobjs even though the actual
> hardware fence is not yet available. I think most drivers could be
> refactored to make the hardware fence immediately available at run_job,
> though.

Yep, I also think we can arm the driver fence early in the case of
JobQueue. The reason it couldn't be done before is because the
scheduler was in the middle, deciding which entity to pull the next job
from, which was changing the seqno a job driver-fence would be assigned
(you can't guess that at queue time in that case).

[...]

> > > > + * **Reference counting**
> > > > + *
> > > > + * Jobs and queues are both reference counted.
> > > > + *
> > > > + * A job holds a reference to its queue from drm_dep_job_init() until
> > > > + * drm_dep_job_put() drops the job's last reference and its release callback
> > > > + * runs. This ensures the queue remains valid for the entire lifetime of any
> > > > + * job that was submitted to it.
> > > > + *
> > > > + * The queue holds its own reference to a job for as long as the job is
> > > > + * internally tracked: from the moment the job is added to the pending list
> > > > + * in drm_dep_queue_run_job() until drm_dep_job_done() kicks the put_job
> > > > + * worker, which calls drm_dep_job_put() to release that reference.  
> > > 
> > > Why not simply keep track that the job was completed, instead of relinquishing
> > > the reference? We can then release the reference once the job is cleaned up
> > > (by the queue, using a worker) in process context.  
> 
> I think that’s what I’m doing, while also allowing an opt-in path to
> drop the job reference when it signals (in IRQ context)

Did you mean in !IRQ (or !atomic) context here? Feels weird to not
defer the cleanup when you're in an IRQ/atomic context, but defer it
when you're in a thread context.

> so we avoid
> switching to a work item just to drop a ref. That seems like a
> significant win in terms of CPU cycles.

Well, the cleanup path is probably not where latency matters the most.
It's adding scheduling overhead, sure, but given all the stuff we defer
already, I'm not too sure we're at saving a few cycles to get the
cleanup done immediately. What's important to have is a way to signal
fences in an atomic context, because this has an impact on latency.

[...]

> > > > + /*
> > > > + * Drop all input dependency fences now, in process context, before the
> > > > + * final job put. Once the job is on the pending list its last reference
> > > > + * may be dropped from a dma_fence callback (IRQ context), where calling
> > > > + * xa_destroy() would be unsafe.
> > > > + */  
> > > 
> > > I assume that “pending” is the list of jobs that have been handed to the driver
> > > via ops->run_job()?
> > > 
> > > Can’t this problem be solved by not doing anything inside a dma_fence callback
> > > other than scheduling the queue worker?
> > >   
> 
> Yes, this code is required to support dropping job refs directly in the
> dma-fence callback (an opt-in feature). Again, this seems like a
> significant win in terms of CPU cycles, although I haven’t collected
> data yet.

If it significantly hurts the perf, I'd like to understand why, because
to me it looks like pure-cleanup (no signaling involved), and thus no
other process waiting for us to do the cleanup. The only thing that
might have an impact is how fast you release the resources, and given
it's only a partial cleanup (xa_destroy() still has to be deferred), I'd
like to understand which part of the immediate cleanup is causing a
contention (basically which kind of resources the system is starving of)

Regards,

Boris

next prev parent reply	other threads:[~2026-03-17 20:43 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20260316043255.226352-1-matthew.brost@intel.com>
2026-03-16  4:32 ` [RFC PATCH 01/12] workqueue: Add interface to teach lockdep to warn on reclaim violations Matthew Brost
2026-03-25 15:59   ` Tejun Heo
2026-03-26  1:49     ` Matthew Brost
2026-03-26  2:19       ` Tejun Heo
2026-03-27  4:33         ` Matthew Brost
2026-03-16  4:32 ` [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer Matthew Brost
2026-03-16  9:16   ` Boris Brezillon
2026-03-17  5:22     ` Matthew Brost
2026-03-17  8:48       ` Boris Brezillon
2026-03-16 10:25   ` Danilo Krummrich
2026-03-17  5:10     ` Matthew Brost
2026-03-17 12:19       ` Danilo Krummrich
2026-03-18 23:02         ` Matthew Brost
2026-03-17  2:47   ` Daniel Almeida
2026-03-17  5:45     ` Matthew Brost
2026-03-17  7:17       ` Miguel Ojeda
2026-03-17  8:26         ` Matthew Brost
2026-03-17 12:04           ` Daniel Almeida
2026-03-17 19:41           ` Miguel Ojeda
2026-03-23 17:31             ` Matthew Brost
2026-03-23 17:42               ` Miguel Ojeda
2026-03-17 18:14       ` Matthew Brost
2026-03-17 19:48         ` Daniel Almeida
2026-03-17 20:43         ` Boris Brezillon [this message]
2026-03-18 22:40           ` Matthew Brost
2026-03-19  9:57             ` Boris Brezillon
2026-03-22  6:43               ` Matthew Brost
2026-03-23  7:58                 ` Matthew Brost
2026-03-23 10:06                   ` Boris Brezillon
2026-03-23 17:11                     ` Matthew Brost
2026-03-17 12:31     ` Danilo Krummrich
2026-03-17 14:25       ` Daniel Almeida
2026-03-17 14:33         ` Danilo Krummrich
2026-03-18 22:50           ` Matthew Brost
2026-03-17  8:47   ` Christian König
2026-03-17 14:55   ` Boris Brezillon
2026-03-18 23:28     ` Matthew Brost
2026-03-19  9:11       ` Boris Brezillon
2026-03-23  4:50         ` Matthew Brost
2026-03-23  9:55           ` Boris Brezillon
2026-03-23 17:08             ` Matthew Brost
2026-03-23 18:38               ` Matthew Brost
2026-03-24  9:23                 ` Boris Brezillon
2026-03-24 16:06                   ` Matthew Brost
2026-03-25  2:33                     ` Matthew Brost
2026-03-24  8:49               ` Boris Brezillon
2026-03-24 16:51                 ` Matthew Brost
2026-03-17 16:30   ` Shashank Sharma
2026-03-16  4:32 ` [RFC PATCH 11/12] accel/amdxdna: Convert to drm_dep scheduler layer Matthew Brost
2026-03-16  4:32 ` [RFC PATCH 12/12] drm/panthor: " Matthew Brost

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260317214320.74e6c130@fedora \
    --to=boris.brezillon@collabora.com \
    --cc=acourbot@nvidia.com \
    --cc=airlied@gmail.com \
    --cc=aliceryhl@google.com \
    --cc=christian.koenig@amd.com \
    --cc=dakr@kernel.org \
    --cc=daniel.almeida@collabora.com \
    --cc=daniels@collabora.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=ecourtney@nvidia.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=jajones@nvidia.com \
    --cc=jeffv@google.com \
    --cc=jhubbard@nvidia.com \
    --cc=joelagnelf@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maarten.lankhorst@linux.intel.com \
    --cc=matthew.brost@intel.com \
    --cc=mripard@kernel.org \
    --cc=phasta@kernel.org \
    --cc=rodrigo.vivi@intel.com \
    --cc=rust-for-linux@vger.kernel.org \
    --cc=samitolvanen@google.com \
    --cc=shashanks@nvidia.com \
    --cc=simona@ffwll.ch \
    --cc=sumit.semwal@linaro.org \
    --cc=thomas.hellstrom@linux.intel.com \
    --cc=tvrtko.ursulin@igalia.com \
    --cc=tzimmermann@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox