Re: [RFC v8 00/21] DRM scheduling cgroup controller

cgroups.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Boris Brezillon <boris.brezillon@collabora.com>
To: Philipp Stanner <phasta@mailbox.org>
Cc: phasta@kernel.org, "Danilo Krummrich" <dakr@kernel.org>,
	"Tvrtko Ursulin" <tvrtko.ursulin@igalia.com>,
	dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org,
	kernel-dev@igalia.com, intel-xe@lists.freedesktop.org,
	cgroups@vger.kernel.org, linux-kernel@vger.kernel.org,
	"Christian König" <christian.koenig@amd.com>,
	"Leo Liu" <Leo.Liu@amd.com>, "Maíra Canal" <mcanal@igalia.com>,
	"Matthew Brost" <matthew.brost@intel.com>,
	"Michal Koutný" <mkoutny@suse.com>,
	"Michel Dänzer" <michel.daenzer@mailbox.org>,
	"Pierre-Eric Pelloux-Prayer" <pierre-eric.pelloux-prayer@amd.com>,
	"Rob Clark" <robdclark@gmail.com>, "Tejun Heo" <tj@kernel.org>,
	"Alexandre Courbot" <acourbot@nvidia.com>,
	"Alistair Popple" <apopple@nvidia.com>,
	"John Hubbard" <jhubbard@nvidia.com>,
	"Joel Fernandes" <joelagnelf@nvidia.com>,
	"Timur Tabi" <ttabi@nvidia.com>,
	"Alex Deucher" <alexander.deucher@amd.com>,
	"Lucas De Marchi" <lucas.demarchi@intel.com>,
	"Thomas Hellström" <thomas.hellstrom@linux.intel.com>,
	"Rodrigo Vivi" <rodrigo.vivi@intel.com>,
	"Rob Herring" <robh@kernel.org>,
	"Steven Price" <steven.price@arm.com>,
	"Liviu Dudau" <liviu.dudau@arm.com>,
	"Daniel Almeida" <daniel.almeida@collabora.com>,
	"Alice Ryhl" <aliceryhl@google.com>,
	"Boqun Feng" <boqunf@netflix.com>,
	"Grégoire Péan" <gpean@netflix.com>,
	"Simona Vetter" <simona@ffwll.ch>,
	airlied@gmail.com
Subject: Re: [RFC v8 00/21] DRM scheduling cgroup controller
Date: Tue, 30 Sep 2025 12:12:29 +0200	[thread overview]
Message-ID: <20250930121229.4f265e0c@fedora> (raw)
In-Reply-To: <4453e5989b38e99588efd53af674b69016b2c420.camel@mailbox.org>

Hi all,

On Tue, 30 Sep 2025 11:00:00 +0200
Philipp Stanner <phasta@mailbox.org> wrote:

> +Cc Sima, Dave
> 
> On Mon, 2025-09-29 at 16:07 +0200, Danilo Krummrich wrote:
> > On Wed Sep 3, 2025 at 5:23 PM CEST, Tvrtko Ursulin wrote:  
> > > This is another respin of this old work^1 which since v7 is a total rewrite and
> > > completely changes how the control is done.  
> > 
> > I only got some of the patches of the series, can you please send all of them
> > for subsequent submissions? You may also want to consider resending if you're
> > not getting a lot of feedback due to that. :)
> >   
> > > On the userspace interface side of things it is the same as before. We have
> > > drm.weight as an interface, taking integers from 1 to 10000, the same as CPU and
> > > IO cgroup controllers.  
> > 
> > In general, I think it would be good to get GPU vendors to speak up to what kind
> > of interfaces they're heading to with firmware schedulers and potential firmware
> > APIs to control scheduling; especially given that this will be a uAPI.
> > 
> > (Adding a couple of folks to Cc.)
> > 
> > Having that said, I think the basic drm.weight interface is fine and should work
> > in any case; i.e. with the existing DRM GPU scheduler in both modes, the
> > upcoming DRM Jobqueue efforts and should be generic enough to work with
> > potential firmware interfaces we may see in the future.
> > 
> > Philipp should be talking about the DRM Jobqueue component at XDC (probably just
> > in this moment).
> > 
> > --
> > 
> > Some more thoughts on the DRM Jobqueue and scheduling:
> > 
> > The idea behind the DRM Jobqueue is to be, as the name suggests, a component
> > that receives jobs from userspace, handles the dependencies (i.e. dma fences),
> > and executes the job, e.g. by writing to a firmware managed software ring.
> > 
> > It basically does what the GPU scheduler does in 1:1 entity-scheduler mode,
> > just without all the additional complexity of moving job ownership from one
> > component to another (i.e. from entity to scheduler, etc.).
> > 
> > With just that, there is no scheduling outside the GPU's firmware scheduler of
> > course. However, additional scheduler capabilities, e.g. to support hardware
> > rings, or manage firmware schedulers that only support a limited number of
> > software rings (like some Mali GPUs), can be layered on top of that:
> > 
> > In contrast to the existing GPU scheduler, the idea would be to keep letting the
> > DRM Jobqueue handle jobs submitted by userspace from end to end (i.e. let the
> > push to the hardware (or software) ring buffer), but have an additional
> > component, whose only purpose is to orchestrate the DRM Jobqueues, by managing
> > when they are allowed to push to a ring and which ring they should push to.
> > 
> > This way we get rid of one of the issue that the existing GPU scheduler moves
> > job ownership between components of different lifetimes (entity and scheduler),
> > which is one of the fundamental hassles to deal with.  
> 
> 
> So just a few minutes ago I had a long chat with Sima.
> 
> Sima (and I, too, I think) thinks that the very few GPUs that have a
> reasonably low limit of firmware rings should just resource-limit
> userspace users once the limit of firmware rings is reached.
> 
> Basically like with VRAM.
> 
> Apparently Sima had suggested that to Panthor in the past? But Panthor
> still seems to have implemented yet another scheduler mechanism on top
> of the 1:1 entity-scheduler drm_sched setup?
> 
> @Boris: Why was that done?

So, the primary reason was that the layer of scheduling we have doesn't
operate at the job or queue level, but at an higher level called group,
which is basically a collection of queues that have close interactions
(a group is backing a VkQueue, and in Mali, a VkQueue has a vertex
suqueue, a fragment subqueue and a compute subqueue). There's also some
fairness involved in our scheduling, where we rotate the priority of
groups over time so it's not always the same group that gets to execute
its workload. I tried to build a mental model of Sima's suggestion at
the time, but I never got to reconcile the job level scheduling
(forcing a limit on the amount of jobs that can be queued per-subqueue)
with the group level scheduling here, and it also didn't seem like
having this extra layer of scheduling was a big deal, because
ultimately, it doesn't get in the way of the single-entity scheduling
provided by drm_sched, it's just something on top.

The other reason being that, even if we find a way to reconcile the two
scheduling models (job vs group) based on some resource-limiting
algorithm, it would get in the way of usermode queues, because then the
job delimitation is blurry. Indeed, in that case you no longer
manipulate jobs, but execution contexts, that have to be scheduled
in/out to introduce some kind of fairness, at which point the resource
becomes GPU time, and you're back to the timeslice-based scheduling we
have right now.

> 
> So far I tend to prefer Sima's proposal because I'm currently very
> unsure how we could deal with shared firmware rings – because then we'd
> need to resubmit jobs, and the currently intended Rust ownership model
> would then be at danger, because the Jobqueue would need a:
> pending_list.

So, my take on that is that what we want ultimately is to have the
functionality provided by drm_sched split into different
components that can be used in isolation, or combined to provide
advanced scheduling.

JobQueue:
 - allows you to queue jobs with their deps
 - dequeues jobs once their deps are met
Not too sure if we want a push or a pull model for the job dequeuing,
but the idea is that once the job is dequeued, ownership is passed to
the SW entity that dequeued it. Note that I intentionally didn't add
the timeout handling here, because dequeueing a job doesn't necessarily
mean it's started immediately. If you're dealing with HW queues, you
might have to wait for a slot to become available. If you're dealing
with something like Mali-CSF, where the amount of FW slots is limited,
you want to wait for your execution context to be passed to the FW for
scheduling, and the final situation is the full-fledged FW scheduling,
where you want things to start as soon as you have space in your FW
queue (AKA ring-buffer?).

JobHWDispatcher: (not sure about the name, I'm bad at naming things)
This object basically pulls ready-jobs from one or multiple JobQueues
into its own queue, and wait for a HW slot to become available. If you
go for the push model, the job gets pushed to the HW dispatcher queue
and waits here until a HW slot becomes available.
That's where timeouts should be handled, because the job only becomes
active when it gets pushed to a HW slot. I guess if we want a
resubmit mechanism, it would have to take place here, but give how
tricky this has been, I'd be tempted to leave that to drivers, that is,
let them requeue the non-faulty jobs directly to their
JobHWDispatcher implementation after a reset.

FWExecutionContextScheduler: (again, pick a different name if you want)
This scheduler doesn't know about jobs, meaning there's a
driver-specific entity that needs to dequeue jobs from the JobQueue
and push those to the relevant ringbuffer. Once a FWExecutionContext
has something to execute, it becomes a candidate for
FWExecutionContextScheduler, which gets to decide which set of
FWExecutionContext get a chance to be scheduled by the FW.
That one is for Mali-CSF case I described above, and I'm not too sure
we want it to be generic, at least not until we have another GPU driver
needing the same kind of scheduling. Again, you want to defer the
timeout handling to this component, because the timer should only
start/resume when the FWExecutionContext gets scheduled, and it should
be paused as soon as the context gets evicted.

TLDR; I think the main problem we had with drm_sched is that it had
this clear drm_sched_entity/drm_gpu_scheduler separation, but those two
components where tightly tied together, with no way to use
drm_sched_entity alone for instance, and this led to the weird
lifetime/ownership issues that the rust effort made more apparent. If we
get to design something new, I think we should try hard to get a clear
isolation between each of these components so they can be used alone or
combined, with a clear job ownership model.

Regards,

Boris

next prev parent reply	other threads:[~2025-09-30 10:12 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-03 15:23 [RFC v8 00/21] DRM scheduling cgroup controller Tvrtko Ursulin
2025-09-03 15:23 ` [RFC 01/21] drm/sched: Add some scheduling quality unit tests Tvrtko Ursulin
2025-09-03 15:23 ` [RFC 02/21] drm/sched: Add some more " Tvrtko Ursulin
2025-09-03 15:23 ` [RFC 03/21] drm/sched: Implement RR via FIFO Tvrtko Ursulin
2025-09-03 15:23 ` [RFC 04/21] drm/sched: Consolidate entity run queue management Tvrtko Ursulin
2025-09-03 15:23 ` [RFC 05/21] drm/sched: Move run queue related code into a separate file Tvrtko Ursulin
2025-09-03 15:23 ` [RFC 06/21] drm/sched: Free all finished jobs at once Tvrtko Ursulin
2025-09-03 15:23 ` [RFC 07/21] drm/sched: Account entity GPU time Tvrtko Ursulin
2025-09-03 15:23 ` [RFC 08/21] drm/sched: Remove idle entity from tree Tvrtko Ursulin
2025-09-03 15:23 ` [RFC 09/21] drm/sched: Add fair scheduling policy Tvrtko Ursulin
2025-09-03 15:23 ` [RFC 10/21] drm/sched: Break submission patterns with some randomness Tvrtko Ursulin
2025-09-03 15:23 ` [RFC 11/21] drm/sched: Remove FIFO and RR and simplify to a single run queue Tvrtko Ursulin
2025-09-03 15:23 ` [RFC 12/21] drm/sched: Embed run queue singleton into the scheduler Tvrtko Ursulin
2025-09-03 15:23 ` [RFC 13/21] cgroup: Add the DRM cgroup controller Tvrtko Ursulin
2025-09-03 15:23 ` [RFC 14/21] cgroup/drm: Track DRM clients per cgroup Tvrtko Ursulin
2025-09-03 15:23 ` [RFC 15/21] cgroup/drm: Add scheduling weight callback Tvrtko Ursulin
2025-09-03 15:23 ` [RFC 16/21] cgroup/drm: Introduce weight based scheduling control Tvrtko Ursulin
2025-09-03 15:23 ` [RFC 17/21] drm/sched: Add helper for tracking entities per client Tvrtko Ursulin
2025-09-03 15:23 ` [RFC 18/21] drm/sched: Add helper for DRM cgroup controller weight notifications Tvrtko Ursulin
2025-09-03 15:23 ` [RFC 19/21] drm/amdgpu: Register with the DRM scheduling cgroup controller Tvrtko Ursulin
2025-09-03 15:23 ` [RFC 20/21] drm/xe: Allow changing GuC scheduling priority Tvrtko Ursulin
2025-09-03 15:23 ` [RFC 21/21] drm/xe: Register with the DRM scheduling cgroup controller Tvrtko Ursulin
2025-09-04 12:08   ` Tvrtko Ursulin
2025-09-29 14:07 ` [RFC v8 00/21] " Danilo Krummrich
2025-09-30  9:00   ` Philipp Stanner
2025-09-30  9:28     ` DRM Jobqueue design (was "[RFC v8 00/21] DRM scheduling cgroup controller") Danilo Krummrich
2025-09-30 10:12     ` Boris Brezillon [this message]
2025-09-30 10:58       ` [RFC v8 00/21] DRM scheduling cgroup controller Danilo Krummrich
2025-09-30 11:57         ` Boris Brezillon
2025-10-07 14:44           ` Danilo Krummrich
2025-10-07 15:44             ` Boris Brezillon
2025-10-23 11:18   ` Tvrtko Ursulin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250930121229.4f265e0c@fedora \
    --to=boris.brezillon@collabora.com \
    --cc=Leo.Liu@amd.com \
    --cc=acourbot@nvidia.com \
    --cc=airlied@gmail.com \
    --cc=alexander.deucher@amd.com \
    --cc=aliceryhl@google.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=apopple@nvidia.com \
    --cc=boqunf@netflix.com \
    --cc=cgroups@vger.kernel.org \
    --cc=christian.koenig@amd.com \
    --cc=dakr@kernel.org \
    --cc=daniel.almeida@collabora.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=gpean@netflix.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=jhubbard@nvidia.com \
    --cc=joelagnelf@nvidia.com \
    --cc=kernel-dev@igalia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=liviu.dudau@arm.com \
    --cc=lucas.demarchi@intel.com \
    --cc=matthew.brost@intel.com \
    --cc=mcanal@igalia.com \
    --cc=michel.daenzer@mailbox.org \
    --cc=mkoutny@suse.com \
    --cc=phasta@kernel.org \
    --cc=phasta@mailbox.org \
    --cc=pierre-eric.pelloux-prayer@amd.com \
    --cc=robdclark@gmail.com \
    --cc=robh@kernel.org \
    --cc=rodrigo.vivi@intel.com \
    --cc=simona@ffwll.ch \
    --cc=steven.price@arm.com \
    --cc=thomas.hellstrom@linux.intel.com \
    --cc=tj@kernel.org \
    --cc=ttabi@nvidia.com \
    --cc=tvrtko.ursulin@igalia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).