public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Boris Brezillon <boris.brezillon@collabora.com>
To: Matthew Brost <matthew.brost@intel.com>
Cc: intel-xe@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	"Tvrtko Ursulin" <tvrtko.ursulin@igalia.com>,
	"Rodrigo Vivi" <rodrigo.vivi@intel.com>,
	"Thomas Hellström" <thomas.hellstrom@linux.intel.com>,
	"Christian König" <christian.koenig@amd.com>,
	"Danilo Krummrich" <dakr@kernel.org>,
	"David Airlie" <airlied@gmail.com>,
	"Maarten Lankhorst" <maarten.lankhorst@linux.intel.com>,
	"Maxime Ripard" <mripard@kernel.org>,
	"Philipp Stanner" <phasta@kernel.org>,
	"Simona Vetter" <simona@ffwll.ch>,
	"Sumit Semwal" <sumit.semwal@linaro.org>,
	"Thomas Zimmermann" <tzimmermann@suse.de>,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer
Date: Tue, 17 Mar 2026 09:48:24 +0100	[thread overview]
Message-ID: <20260317094824.58cc34e4@fedora> (raw)
In-Reply-To: <abjlFshfpJXCz8z8@lstrano-desk.jf.intel.com>

Hi Matthew,

On Mon, 16 Mar 2026 22:22:30 -0700
Matthew Brost <matthew.brost@intel.com> wrote:

> On Mon, Mar 16, 2026 at 10:16:01AM +0100, Boris Brezillon wrote:
> > Hi Matthew,
> > 
> > On Sun, 15 Mar 2026 21:32:45 -0700
> > Matthew Brost <matthew.brost@intel.com> wrote:
> >   
> > > Diverging requirements between GPU drivers using firmware scheduling
> > > and those using hardware scheduling have shown that drm_gpu_scheduler is
> > > no longer sufficient for firmware-scheduled GPU drivers. The technical
> > > debt, lack of memory-safety guarantees, absence of clear object-lifetime
> > > rules, and numerous driver-specific hacks have rendered
> > > drm_gpu_scheduler unmaintainable. It is time for a fresh design for
> > > firmware-scheduled GPU drivers—one that addresses all of the
> > > aforementioned shortcomings.
> > > 
> > > Add drm_dep, a lightweight GPU submission queue intended as a
> > > replacement for drm_gpu_scheduler for firmware-managed GPU schedulers
> > > (e.g. Xe, Panthor, AMDXDNA, PVR, Nouveau, Nova). Unlike
> > > drm_gpu_scheduler, which separates the scheduler (drm_gpu_scheduler)
> > > from the queue (drm_sched_entity) into two objects requiring external
> > > coordination, drm_dep merges both roles into a single struct
> > > drm_dep_queue. This eliminates the N:1 entity-to-scheduler mapping
> > > that is unnecessary for firmware schedulers which manage their own
> > > run-lists internally.
> > > 
> > > Unlike drm_gpu_scheduler, which relies on external locking and lifetime
> > > management by the driver, drm_dep uses reference counting (kref) on both
> > > queues and jobs to guarantee object lifetime safety. A job holds a queue
> > > reference from init until its last put, and the queue holds a job reference
> > > from dispatch until the put_job worker runs. This makes use-after-free
> > > impossible even when completion arrives from IRQ context or concurrent
> > > teardown is in flight.
> > > 
> > > The core objects are:
> > > 
> > >   struct drm_dep_queue - a per-context submission queue owning an
> > >     ordered submit workqueue, a TDR timeout workqueue, an SPSC job
> > >     queue, and a pending-job list. Reference counted; drivers can embed
> > >     it and provide a .release vfunc for RCU-safe teardown.  
> > 
> > First of, I like this idea, and actually think we should have done that
> > from the start rather than trying to bend drm_sched to meet our  
> 
> Yes. Tvrtko actually suggested this years ago, and in my naïveté I
> rejected it. I’m eating my hat here.
> 
> > FW-assisted scheduling model. That's also the direction me and Danilo
> > have been pushing for for the new JobQueue stuff in rust, so I'm glad
> > to see some consensus here.
> > 
> > Now, let's start with the usual naming nitpick :D => can't we find a
> > better prefix than "drm_dep"? I think I get where "dep" comes from (the
> > logic mostly takes care of job deps, and acts as a FIFO otherwise, no
> > real scheduling involved). It's kinda okay for drm_dep_queue, even
> > though, according to the description you've made, jobs seem to stay in
> > that queue even after their deps are met, which, IMHO, is a bit
> > confusing: dep_queue sounds like a queue in which jobs are placed until
> > their deps are met, and then the job moves to some other queue.
> > 
> > It gets worse for drm_dep_job, which sounds like a dep-only job, rather
> > than a job that's queued to the drm_dep_queue. Same goes for
> > drm_dep_fence, which I find super confusing. What this one does is just
> > proxy the driver fence to provide proper isolation between GPU drivers
> > and fence observers (other drivers).
> > 
> > Since this new model is primarily designed for hardware that have
> > FW-assisted scheduling, how about drm_fw_queue, drm_fw_job,
> > drm_fw_job_fence?  
> 
> We can bikeshed — I’m open to other names, but I believe hardware
> scheduling can be built quite cleanly on top of this, so drm_fw_*
> doesn’t really work either.

I agree drm_fw_ is not great either. I think Philipp's name, JobQueue,
is a better fit.

> Check out a hardware-scheduler PoC built
> (today) on top of this in [1].

Yeah, I'm pretty sure we can add a layer on top to deal with HW GPU
queues (Danilo and Philipp proposed it already), I'm just not sure it
should be our primary focus, at least not until we have something
usable for the drivers that are currently written in rust. Doesn't mean
we shouldn't think about it and design the thing so it can be added
later, of course, but most of the time, perfect is the enemy of good.

> 
> [1] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-svn-perf-6-15-2025/-/commit/22c8aa993b5c9e4ad0c312af2f3e032273d20966
> 
> >   
> > > 
> > >   struct drm_dep_job - a single unit of GPU work. Drivers embed this
> > >     and provide a .release vfunc. Jobs carry an xarray of input
> > >     dma_fence dependencies and produce a drm_dep_fence as their
> > >     finished fence.
> > > 
> > >   struct drm_dep_fence - a dma_fence subclass wrapping an optional
> > >     parent hardware fence. The finished fence is armed (sequence
> > >     number assigned) before submission and signals when the hardware
> > >     fence signals (or immediately on synchronous completion).
> > > 
> > > Job lifecycle:
> > >   1. drm_dep_job_init() - allocate and initialise; job acquires a
> > >      queue reference.
> > >   2. drm_dep_job_add_dependency() and friends - register input fences;
> > >      duplicates from the same context are deduplicated.
> > >   3. drm_dep_job_arm() - assign sequence number, obtain finished fence.
> > >   4. drm_dep_job_push() - submit to queue.
> > > 
> > > Submission paths under queue lock:
> > >   - Bypass path: if DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set, the
> > >     SPSC queue is empty, no dependencies are pending, and credits are
> > >     available, the job is dispatched inline on the calling thread.  
> > 
> > I've yet to look at the code, but I must admit I'm less worried about
> > this fast path if it's part of a new model restricted to FW-assisted
> > scheduling. I keep thinking we're not entirely covered for so called
> > real-time GPU contexts that might have jobs that are not dep-free, and
> > if we're going for something new, I'd really like us to consider that
> > case from the start (maybe investigate if kthread_work[er] can be used
> > as a replacement for workqueues, if RT priority on workqueues is not an
> > option).
> >   
> 
> I mostly agree, and I’ll look into whether kthread_work is better
> suited—if that’s the right model, it should be done up front.
> 
> But can you give a use case for real-time GPU contexts that are not
> dep-free? I personally don’t know of one.

Let alone real-time GPU contexts, you still have the concept of context
priority in both GL and Vulkan, on which jobs with deps are likely to
be queued, and if the dequeuing thread doesn't take this priority into
consideration, there's nothing differentiating a low-prio from an
high-prio context, thus partially defeating the priority you assign to
your FW queue (you can execute fast, but if the queuing to the FW queue
is slow, it's pointless).

As for real-time contexts, yes, right now the only use case I can think
of is compositors, and apparently those would pass dep-less submissions,
but claiming that this is the only use case we'll ever have for those RT
contexts is a bit of a risk I'd rather not take. Also, if we solve the
problem with kthreads with prios, it might be that we don't even need
this fastpath in the first place.

> 
> > >   - Queued path: job is pushed onto the SPSC queue and the run_job
> > >     worker is kicked. The worker resolves remaining dependencies
> > >     (installing wakeup callbacks for unresolved fences) before calling
> > >     ops->run_job().
> > > 
> > > Credit-based throttling prevents hardware overflow: each job declares
> > > a credit cost at init time; dispatch is deferred until sufficient
> > > credits are available.
> > > 
> > > Timeout Detection and Recovery (TDR): a per-queue delayed work item
> > > fires when the head pending job exceeds q->job.timeout jiffies, calling
> > > ops->timedout_job(). drm_dep_queue_trigger_timeout() forces immediate
> > > expiry for device teardown.
> > > 
> > > IRQ-safe completion: queues flagged DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE
> > > allow drm_dep_job_done() to be called from hardirq context (e.g. a
> > > dma_fence callback). Dependency cleanup is deferred to process context
> > > after ops->run_job() returns to avoid calling xa_destroy() from IRQ.
> > > 
> > > Zombie-state guard: workers use kref_get_unless_zero() on entry and
> > > bail immediately if the queue refcount has already reached zero and
> > > async teardown is in flight, preventing use-after-free.
> > > 
> > > Teardown is always deferred to a module-private workqueue (dep_free_wq)
> > > so that destroy_workqueue() is never called from within one of the
> > > queue's own workers. Each queue holds a drm_dev_get() reference on its
> > > owning struct drm_device, released as the final step of teardown via
> > > drm_dev_put(). This prevents the driver module from being unloaded
> > > while any queue is still alive without requiring a separate drain API.  
> > 
> > Thanks for posting this RFC. I'll try to have a closer look at the code
> > in the coming days, but given the diffstat, it might take me a bit of
> > time...  
> 
> 
> I understand — I’m a firehose when I get started. Hopefully a sane one,
> though.

One last note: I've seen a bunch of discussions that start to look like
some pissing contest between C and rust advocates. I've only started to
look at rust recently, in the context of Tyr (the rust re-implementation
of Panthor), and there's niceties in the language that I think make it
a good fit for the new JobQueue thing we've been discussing (Daniel
listed a bunch of them). I don't really mind who's going to end up
winning the trophy of this contest to be honest, but what I certainly
don't want is for us to be stuck with no solution at all because each
camp claims that their solution is best and none of them give up on
their ideas (basically what happened with the AGX driver). So please,
please, let's all take a step back, have a look at what each side has
done, keep the discussion constructive, and hopefully we'll end up with
a mix of ideas that makes the final solution better.

Regards,

Boris

  reply	other threads:[~2026-03-17  8:48 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20260316043255.226352-1-matthew.brost@intel.com>
2026-03-16  4:32 ` [RFC PATCH 01/12] workqueue: Add interface to teach lockdep to warn on reclaim violations Matthew Brost
2026-03-25 15:59   ` Tejun Heo
2026-03-26  1:49     ` Matthew Brost
2026-03-26  2:19       ` Tejun Heo
2026-03-27  4:33         ` Matthew Brost
2026-03-16  4:32 ` [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer Matthew Brost
2026-03-16  9:16   ` Boris Brezillon
2026-03-17  5:22     ` Matthew Brost
2026-03-17  8:48       ` Boris Brezillon [this message]
2026-03-16 10:25   ` Danilo Krummrich
2026-03-17  5:10     ` Matthew Brost
2026-03-17 12:19       ` Danilo Krummrich
2026-03-18 23:02         ` Matthew Brost
2026-03-17  2:47   ` Daniel Almeida
2026-03-17  5:45     ` Matthew Brost
2026-03-17  7:17       ` Miguel Ojeda
2026-03-17  8:26         ` Matthew Brost
2026-03-17 12:04           ` Daniel Almeida
2026-03-17 19:41           ` Miguel Ojeda
2026-03-23 17:31             ` Matthew Brost
2026-03-23 17:42               ` Miguel Ojeda
2026-03-17 18:14       ` Matthew Brost
2026-03-17 19:48         ` Daniel Almeida
2026-03-17 20:43         ` Boris Brezillon
2026-03-18 22:40           ` Matthew Brost
2026-03-19  9:57             ` Boris Brezillon
2026-03-22  6:43               ` Matthew Brost
2026-03-23  7:58                 ` Matthew Brost
2026-03-23 10:06                   ` Boris Brezillon
2026-03-23 17:11                     ` Matthew Brost
2026-03-17 12:31     ` Danilo Krummrich
2026-03-17 14:25       ` Daniel Almeida
2026-03-17 14:33         ` Danilo Krummrich
2026-03-18 22:50           ` Matthew Brost
2026-03-17  8:47   ` Christian König
2026-03-17 14:55   ` Boris Brezillon
2026-03-18 23:28     ` Matthew Brost
2026-03-19  9:11       ` Boris Brezillon
2026-03-23  4:50         ` Matthew Brost
2026-03-23  9:55           ` Boris Brezillon
2026-03-23 17:08             ` Matthew Brost
2026-03-23 18:38               ` Matthew Brost
2026-03-24  9:23                 ` Boris Brezillon
2026-03-24 16:06                   ` Matthew Brost
2026-03-25  2:33                     ` Matthew Brost
2026-03-24  8:49               ` Boris Brezillon
2026-03-24 16:51                 ` Matthew Brost
2026-03-17 16:30   ` Shashank Sharma
2026-03-16  4:32 ` [RFC PATCH 11/12] accel/amdxdna: Convert to drm_dep scheduler layer Matthew Brost
2026-03-16  4:32 ` [RFC PATCH 12/12] drm/panthor: " Matthew Brost

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260317094824.58cc34e4@fedora \
    --to=boris.brezillon@collabora.com \
    --cc=airlied@gmail.com \
    --cc=christian.koenig@amd.com \
    --cc=dakr@kernel.org \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maarten.lankhorst@linux.intel.com \
    --cc=matthew.brost@intel.com \
    --cc=mripard@kernel.org \
    --cc=phasta@kernel.org \
    --cc=rodrigo.vivi@intel.com \
    --cc=simona@ffwll.ch \
    --cc=sumit.semwal@linaro.org \
    --cc=thomas.hellstrom@linux.intel.com \
    --cc=tvrtko.ursulin@igalia.com \
    --cc=tzimmermann@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox