drm_sched run_job and scheduling latency

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* drm_sched run_job and scheduling latency
@ 2026-03-04 22:51 Chia-I Wu
  2026-03-05  2:04 ` Matthew Brost
                   ` (3 more replies)
  0 siblings, 4 replies; 26+ messages in thread
From: Chia-I Wu @ 2026-03-04 22:51 UTC (permalink / raw)
  To: ML dri-devel, intel-xe
  Cc: Boris Brezillon, Steven Price, Liviu Dudau, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	Matthew Brost, Danilo Krummrich, Philipp Stanner,
	Christian König, Thomas Hellström, Rodrigo Vivi,
	open list

Hi,

Our system compositor (surfaceflinger on android) submits gpu jobs
from a SCHED_FIFO thread to an RT gpu queue. However, because
workqueue threads are SCHED_NORMAL, the scheduling latency from submit
to run_job can sometimes cause frame misses. We are seeing this on
panthor and xe, but the issue should be common to all drm_sched users.

Using a WQ_HIGHPRI workqueue helps, but it is still not RT (and won't
meet future android requirements). It seems either workqueue needs to
gain RT support, or drm_sched needs to support kthread_worker.

I know drm_sched switched from kthread_worker to workqueue for better
scaling when xe was introduced. But if drm_sched can support either
workqueue or kthread_worker during drm_sched_init, drivers can
selectively use kthread_worker only for RT gpu queues. And because
drivers require CAP_SYS_NICE for RT gpu queues, this should not cause
scaling issues.

Thoughts? Or perhaps this becomes less of an issue if all drm_sched
users have concrete plans for userspace submissions..

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: drm_sched run_job and scheduling latency
  2026-03-04 22:51 drm_sched run_job and scheduling latency Chia-I Wu
@ 2026-03-05  2:04 ` Matthew Brost
  2026-03-05  8:27   ` Boris Brezillon
  2026-03-05  8:35 ` Tvrtko Ursulin
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 26+ messages in thread
From: Matthew Brost @ 2026-03-05  2:04 UTC (permalink / raw)
  To: Chia-I Wu
  Cc: ML dri-devel, intel-xe, Boris Brezillon, Steven Price,
	Liviu Dudau, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	David Airlie, Simona Vetter, Danilo Krummrich, Philipp Stanner,
	Christian König, Thomas Hellström, Rodrigo Vivi,
	open list, tj

On Wed, Mar 04, 2026 at 02:51:39PM -0800, Chia-I Wu wrote:
> Hi,
> 
> Our system compositor (surfaceflinger on android) submits gpu jobs
> from a SCHED_FIFO thread to an RT gpu queue. However, because
> workqueue threads are SCHED_NORMAL, the scheduling latency from submit
> to run_job can sometimes cause frame misses. We are seeing this on
> panthor and xe, but the issue should be common to all drm_sched users.
> 

I'm going to assume that since this is a compositor, you do not pass
input dependencies to the page-flip job. Is that correct?

If so, I believe we could fairly easily build an opt-in DRM sched path
that directly calls run_job in the exec IOCTL context (I assume this is
SCHED_FIFO) if the job has no dependencies.

This would likely break some of Xe’s submission-backend assumptions
around mutual exclusion and ordering based on the workqueue, but that
seems workable. I don’t know how the Panthor code is structured or
whether they have similar issues.

I can try to hack together a quick PoC to see what this would look like
and give you something to test.

> Using a WQ_HIGHPRI workqueue helps, but it is still not RT (and won't
> meet future android requirements). It seems either workqueue needs to
> gain RT support, or drm_sched needs to support kthread_worker.

+Tejun to see if RT workqueue is in the plans.

> 
> I know drm_sched switched from kthread_worker to workqueue for better
> scaling when xe was introduced. But if drm_sched can support either
> workqueue or kthread_worker during drm_sched_init, drivers can
> selectively use kthread_worker only for RT gpu queues. And because
> drivers require CAP_SYS_NICE for RT gpu queues, this should not cause
> scaling issues.
> 

I don’t think having two paths will ever be acceptable, nor do I think
supporting a kthread would be all that easy. For example, in Xe we queue
additional work items outside of the scheduler on the queue for ordering
reasons — we’d have to move all of that code down into DRM sched or
completely redesign our submission model to avoid this. I’m not sure if
other drivers also do this, but it is allowed.

> Thoughts? Or perhaps this becomes less of an issue if all drm_sched
> users have concrete plans for userspace submissions..

Maybe some day....

Matt

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: drm_sched run_job and scheduling latency
  2026-03-05  2:04 ` Matthew Brost
@ 2026-03-05  8:27   ` Boris Brezillon
  2026-03-05  8:38     ` Philipp Stanner
  2026-03-05 10:09     ` Matthew Brost
  0 siblings, 2 replies; 26+ messages in thread
From: Boris Brezillon @ 2026-03-05  8:27 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Chia-I Wu, ML dri-devel, intel-xe, Steven Price, Liviu Dudau,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Simona Vetter, Danilo Krummrich, Philipp Stanner,
	Christian König, Thomas Hellström, Rodrigo Vivi,
	open list, tj

Hi Matthew,

On Wed, 4 Mar 2026 18:04:25 -0800
Matthew Brost <matthew.brost@intel.com> wrote:

> On Wed, Mar 04, 2026 at 02:51:39PM -0800, Chia-I Wu wrote:
> > Hi,
> > 
> > Our system compositor (surfaceflinger on android) submits gpu jobs
> > from a SCHED_FIFO thread to an RT gpu queue. However, because
> > workqueue threads are SCHED_NORMAL, the scheduling latency from submit
> > to run_job can sometimes cause frame misses. We are seeing this on
> > panthor and xe, but the issue should be common to all drm_sched users.
> >   
> 
> I'm going to assume that since this is a compositor, you do not pass
> input dependencies to the page-flip job. Is that correct?
> 
> If so, I believe we could fairly easily build an opt-in DRM sched path
> that directly calls run_job in the exec IOCTL context (I assume this is
> SCHED_FIFO) if the job has no dependencies.

I guess by ::run_job() you mean something slightly more involved that
checks if:

- other jobs are pending
- enough credits (AKA ringbuf space) is available
- and probably other stuff I forgot about

> 
> This would likely break some of Xe’s submission-backend assumptions
> around mutual exclusion and ordering based on the workqueue, but that
> seems workable. I don’t know how the Panthor code is structured or
> whether they have similar issues.

Honestly, I'm not thrilled by this fast-path/call-run_job-directly idea
you're describing. There's just so many things we can forget that would
lead to races/ordering issues that will end up being hard to trigger and
debug. Besides, it doesn't solve the problem where your gfx pipeline is
fully stuffed and the kernel has to dequeue things asynchronously. I do
believe we want RT-prio support in that case too.

> 
> I can try to hack together a quick PoC to see what this would look like
> and give you something to test.
> 
> > Using a WQ_HIGHPRI workqueue helps, but it is still not RT (and won't
> > meet future android requirements). It seems either workqueue needs to
> > gain RT support, or drm_sched needs to support kthread_worker.  
> 
> +Tejun to see if RT workqueue is in the plans.

Dunno how feasible that is, but that would be my preferred option.

> 
> > 
> > I know drm_sched switched from kthread_worker to workqueue for better
> > scaling when xe was introduced. But if drm_sched can support either
> > workqueue or kthread_worker during drm_sched_init, drivers can
> > selectively use kthread_worker only for RT gpu queues. And because
> > drivers require CAP_SYS_NICE for RT gpu queues, this should not cause
> > scaling issues.
> >   
> 
> I don’t think having two paths will ever be acceptable, nor do I think
> supporting a kthread would be all that easy. For example, in Xe we queue
> additional work items outside of the scheduler on the queue for ordering
> reasons — we’d have to move all of that code down into DRM sched or
> completely redesign our submission model to avoid this. I’m not sure if
> other drivers also do this, but it is allowed.

Panthor doesn't rely on the serialization provided by the single-thread
workqueue, Panfrost might rely on it though (I don't remember). I agree
that maintaining a thread and workqueue based scheduling is not ideal
though.

> 
> > Thoughts? Or perhaps this becomes less of an issue if all drm_sched
> > users have concrete plans for userspace submissions..  
> 
> Maybe some day....

I've yet to see a solution where no dma_fence-based signalization is
involved in graphics workloads though (IIRC, Arm's solution still
needs the kernel for that). Until that happens, we'll still need the
kernel to signal fences asynchronously when the job is done, which I
suspect will cause the same kind of latency issue...

Regards,

Boris

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: drm_sched run_job and scheduling latency
  2026-03-04 22:51 drm_sched run_job and scheduling latency Chia-I Wu
  2026-03-05  2:04 ` Matthew Brost
@ 2026-03-05  8:35 ` Tvrtko Ursulin
  2026-03-05  9:40   ` Boris Brezillon
  2026-03-05  9:23 ` Boris Brezillon
  2026-03-05 23:09 ` Hillf Danton
  3 siblings, 1 reply; 26+ messages in thread
From: Tvrtko Ursulin @ 2026-03-05  8:35 UTC (permalink / raw)
  To: Chia-I Wu, ML dri-devel, intel-xe
  Cc: Boris Brezillon, Steven Price, Liviu Dudau, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	Matthew Brost, Danilo Krummrich, Philipp Stanner,
	Christian König, Thomas Hellström, Rodrigo Vivi,
	open list


On 04/03/2026 22:51, Chia-I Wu wrote:
> Hi,
> 
> Our system compositor (surfaceflinger on android) submits gpu jobs
> from a SCHED_FIFO thread to an RT gpu queue. However, because
> workqueue threads are SCHED_NORMAL, the scheduling latency from submit
> to run_job can sometimes cause frame misses. We are seeing this on
> panthor and xe, but the issue should be common to all drm_sched users.
> 
> Using a WQ_HIGHPRI workqueue helps, but it is still not RT (and won't
> meet future android requirements). It seems either workqueue needs to
> gain RT support, or drm_sched needs to support kthread_worker.
> 
> I know drm_sched switched from kthread_worker to workqueue for better

 From a plain kthread actually. Anyway, I suggested trying the 
kthread_worker approach a few times in the past but never got round 
implementing it. Not dual paths but simply replacing the workqueues with 
kthread_workers.

What is your thinking regarding how would the priority be configured? In 
terms of the default and mechanism to select a higher priority 
scheduling class.

Regards,

Tvrtko

> scaling when xe was introduced. But if drm_sched can support either
> workqueue or kthread_worker during drm_sched_init, drivers can
> selectively use kthread_worker only for RT gpu queues. And because
> drivers require CAP_SYS_NICE for RT gpu queues, this should not cause
> scaling issues.
> 
> Thoughts? Or perhaps this becomes less of an issue if all drm_sched
> users have concrete plans for userspace submissions..


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: drm_sched run_job and scheduling latency
  2026-03-05  8:27   ` Boris Brezillon
@ 2026-03-05  8:38     ` Philipp Stanner
  2026-03-05  9:10       ` Matthew Brost
  2026-03-05 10:09     ` Matthew Brost
  1 sibling, 1 reply; 26+ messages in thread
From: Philipp Stanner @ 2026-03-05  8:38 UTC (permalink / raw)
  To: Boris Brezillon, Matthew Brost
  Cc: Chia-I Wu, ML dri-devel, intel-xe, Steven Price, Liviu Dudau,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Simona Vetter, Danilo Krummrich, Philipp Stanner,
	Christian König, Thomas Hellström, Rodrigo Vivi,
	open list, tj

On Thu, 2026-03-05 at 09:27 +0100, Boris Brezillon wrote:
> Hi Matthew,
> 
> On Wed, 4 Mar 2026 18:04:25 -0800
> Matthew Brost <matthew.brost@intel.com> wrote:
> 
> > On Wed, Mar 04, 2026 at 02:51:39PM -0800, Chia-I Wu wrote:
> > > Hi,
> > > 
> > > Our system compositor (surfaceflinger on android) submits gpu jobs
> > > from a SCHED_FIFO thread to an RT gpu queue. However, because
> > > workqueue threads are SCHED_NORMAL, the scheduling latency from submit
> > > to run_job can sometimes cause frame misses. We are seeing this on
> > > panthor and xe, but the issue should be common to all drm_sched users.
> > >   
> > 
> > I'm going to assume that since this is a compositor, you do not pass
> > input dependencies to the page-flip job. Is that correct?
> > 
> > If so, I believe we could fairly easily build an opt-in DRM sched path
> > that directly calls run_job in the exec IOCTL context (I assume this is
> > SCHED_FIFO) if the job has no dependencies.
> 
> I guess by ::run_job() you mean something slightly more involved that
> checks if:
> 
> - other jobs are pending
> - enough credits (AKA ringbuf space) is available
> - and probably other stuff I forgot about
> 
> > 
> > This would likely break some of Xe’s submission-backend assumptions
> > around mutual exclusion and ordering based on the workqueue, but that
> > seems workable. I don’t know how the Panthor code is structured or
> > whether they have similar issues.
> 
> Honestly, I'm not thrilled by this fast-path/call-run_job-directly idea
> you're describing. There's just so many things we can forget that would
> lead to races/ordering issues that will end up being hard to trigger and
> debug.
> 

+1

I'm not thrilled either. More like the opposite of thrilled actually.

Even if we could get that to work. This is more of a maintainability
issue.

The scheduler is full of insane performance hacks for this or that
driver. Lockless accesses, a special lockless queue only used by that
one party in the kernel (a lockless queue which is nowadays, after N
reworks, being used with a lock. Ah well).

In the past discussions Danilo and I made it clear that more major
features in _new_ patch series aimed at getting merged into drm/sched
must be preceded by cleanup work to address some of the scheduler's
major problems.

That's especially true if it's features aimed at performance buffs.



P.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: drm_sched run_job and scheduling latency
  2026-03-05  8:38     ` Philipp Stanner
@ 2026-03-05  9:10       ` Matthew Brost
  2026-03-05  9:47         ` Philipp Stanner
                           ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Matthew Brost @ 2026-03-05  9:10 UTC (permalink / raw)
  To: phasta
  Cc: Boris Brezillon, Chia-I Wu, ML dri-devel, intel-xe, Steven Price,
	Liviu Dudau, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	David Airlie, Simona Vetter, Danilo Krummrich,
	Christian König, Thomas Hellström, Rodrigo Vivi,
	open list, tj

On Thu, Mar 05, 2026 at 09:38:16AM +0100, Philipp Stanner wrote:
> On Thu, 2026-03-05 at 09:27 +0100, Boris Brezillon wrote:
> > Hi Matthew,
> > 
> > On Wed, 4 Mar 2026 18:04:25 -0800
> > Matthew Brost <matthew.brost@intel.com> wrote:
> > 
> > > On Wed, Mar 04, 2026 at 02:51:39PM -0800, Chia-I Wu wrote:
> > > > Hi,
> > > > 
> > > > Our system compositor (surfaceflinger on android) submits gpu jobs
> > > > from a SCHED_FIFO thread to an RT gpu queue. However, because
> > > > workqueue threads are SCHED_NORMAL, the scheduling latency from submit
> > > > to run_job can sometimes cause frame misses. We are seeing this on
> > > > panthor and xe, but the issue should be common to all drm_sched users.
> > > >   
> > > 
> > > I'm going to assume that since this is a compositor, you do not pass
> > > input dependencies to the page-flip job. Is that correct?
> > > 
> > > If so, I believe we could fairly easily build an opt-in DRM sched path
> > > that directly calls run_job in the exec IOCTL context (I assume this is
> > > SCHED_FIFO) if the job has no dependencies.
> > 
> > I guess by ::run_job() you mean something slightly more involved that
> > checks if:
> > 
> > - other jobs are pending

Yes.

> > - enough credits (AKA ringbuf space) is available

Yes.

> > - and probably other stuff I forgot about

The scheduler is not stopped; serialize the bypass path with scheduler
stop/start.

> > 
> > > 
> > > This would likely break some of Xe’s submission-backend assumptions
> > > around mutual exclusion and ordering based on the workqueue, but that
> > > seems workable. I don’t know how the Panthor code is structured or
> > > whether they have similar issues.
> > 
> > Honestly, I'm not thrilled by this fast-path/call-run_job-directly idea
> > you're describing. There's just so many things we can forget that would
> > lead to races/ordering issues that will end up being hard to trigger and
> > debug.
> > 
> 
> +1
> 
> I'm not thrilled either. More like the opposite of thrilled actually.
> 
> Even if we could get that to work. This is more of a maintainability
> issue.
> 
> The scheduler is full of insane performance hacks for this or that
> driver. Lockless accesses, a special lockless queue only used by that
> one party in the kernel (a lockless queue which is nowadays, after N
> reworks, being used with a lock. Ah well).
> 

This is not relevant to this discussion—see below. In general, I agree
that the lockless tricks in the scheduler are not great, nor is the fact
that the scheduler became a dumping ground for driver-specific features.
But again, that is not what we’re talking about here—see below.

> In the past discussions Danilo and I made it clear that more major
> features in _new_ patch series aimed at getting merged into drm/sched
> must be preceded by cleanup work to address some of the scheduler's
> major problems.

Ah, we've moved to dictatorship quickly. Noted.

> 

I can't say I agree with either of you here.

In about an hour, I seemingly have a bypass path working in DRM sched +
Xe, and my diff is:

108 insertions(+), 31 deletions(-)

About 40 lines of the insertions are kernel-doc, so I'm not buying that
this is a maintenance issue or a major feature - it is literally a
single new function.

I understand a bypass path can create issues—for example, on certain
queues in Xe I definitely can't use the bypass path, so Xe simply
wouldn’t use it in those cases. This is the driver's choice to use or
not. If a driver doesn't know how to use the scheduler, well, that’s on
the driver. Providing a simple, documented function as a fast path
really isn't some crazy idea.

The alternative—asking for RT workqueues or changing the design to use
kthread_worker—actually is.

> That's especially true if it's features aimed at performance buffs.
> 

With the above mindset, I'm actually very confused why this series [1]
would even be considered as this order of magnitude greater in
complexity than my suggestion here.

Matt

[1] https://patchwork.freedesktop.org/series/159025/ 

> 
> 
> P.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: drm_sched run_job and scheduling latency
  2026-03-04 22:51 drm_sched run_job and scheduling latency Chia-I Wu
  2026-03-05  2:04 ` Matthew Brost
  2026-03-05  8:35 ` Tvrtko Ursulin
@ 2026-03-05  9:23 ` Boris Brezillon
  2026-03-06  5:33   ` Chia-I Wu
  2026-03-05 23:09 ` Hillf Danton
  3 siblings, 1 reply; 26+ messages in thread
From: Boris Brezillon @ 2026-03-05  9:23 UTC (permalink / raw)
  To: Chia-I Wu
  Cc: ML dri-devel, intel-xe, Steven Price, Liviu Dudau,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Simona Vetter, Matthew Brost, Danilo Krummrich, Philipp Stanner,
	Christian König, Thomas Hellström, Rodrigo Vivi,
	open list

On Wed, 4 Mar 2026 14:51:39 -0800
Chia-I Wu <olvaffe@gmail.com> wrote:

> Hi,
> 
> Our system compositor (surfaceflinger on android) submits gpu jobs
> from a SCHED_FIFO thread to an RT gpu queue. However, because
> workqueue threads are SCHED_NORMAL, the scheduling latency from submit
> to run_job can sometimes cause frame misses. We are seeing this on
> panthor and xe, but the issue should be common to all drm_sched users.
> 
> Using a WQ_HIGHPRI workqueue helps, but it is still not RT (and won't
> meet future android requirements). It seems either workqueue needs to
> gain RT support, or drm_sched needs to support kthread_worker.
> 
> I know drm_sched switched from kthread_worker to workqueue for better
> scaling when xe was introduced.

Actually, it went from a plain kthread with open-coded "work" support to
workqueues. The kthread_worker+kthread_work model looks closer to what
workqueues provide, so transitioning drivers to it shouldn't be too
hard. The scalability issue you mentioned (one thread per GPU context
doesn't scale) doesn't apply, because we can pretty easily share the
same kthread_worker for all drm_gpu_scheduler instances, just like we
can share the same workqueue for all drm_gpu_scheduler instances today.
Luckily, it seems that no one so far has been using
WQ_PERCPU-workqueues, so that's one less thing we need to worry about.
The last remaining drawback with a kthread_work[er] based solution is
the fact workqueues can adjust the number of worker threads on demand
based on the load. If we really need this flexibility (a non static
number of threads per-prio level per-driver), that's something we'll
have to add support for.

For Panthor, the way I see it, we could start with one thread per-group
priority, and then pick the worker thread to use at drm_sched_init()
based on the group prio. If we need something with a thread pool, then
drm_sched will have to know about those threads, and do some load
balancing when queueing the works...

Note that someone at Collabora is working on dynamic context priority
support, meaning we'll have to be able to change the drm_gpu_scheduler
kthread_worker at runtime.

TLDR; All of this is doable, but it's more work (for us, DRM devs) than
asking RT prio support to be added to workqueues.

> But if drm_sched can support either
> workqueue or kthread_worker during drm_sched_init, drivers can
> selectively use kthread_worker only for RT gpu queues. And because
> drivers require CAP_SYS_NICE for RT gpu queues, this should not cause
> scaling issues.

I think, whatever we choose to go for, we probably don't want to keep
both models around, because that's going to be a pain to maintain.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: drm_sched run_job and scheduling latency
  2026-03-05  8:35 ` Tvrtko Ursulin
@ 2026-03-05  9:40   ` Boris Brezillon
  2026-03-27  9:19     ` Tvrtko Ursulin
  0 siblings, 1 reply; 26+ messages in thread
From: Boris Brezillon @ 2026-03-05  9:40 UTC (permalink / raw)
  To: Tvrtko Ursulin
  Cc: Chia-I Wu, ML dri-devel, intel-xe, Steven Price, Liviu Dudau,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Simona Vetter, Matthew Brost, Danilo Krummrich, Philipp Stanner,
	Christian König, Thomas Hellström, Rodrigo Vivi,
	open list

Hi Tvrtko,

On Thu, 5 Mar 2026 08:35:33 +0000
Tvrtko Ursulin <tursulin@ursulin.net> wrote:

> On 04/03/2026 22:51, Chia-I Wu wrote:
> > Hi,
> > 
> > Our system compositor (surfaceflinger on android) submits gpu jobs
> > from a SCHED_FIFO thread to an RT gpu queue. However, because
> > workqueue threads are SCHED_NORMAL, the scheduling latency from submit
> > to run_job can sometimes cause frame misses. We are seeing this on
> > panthor and xe, but the issue should be common to all drm_sched users.
> > 
> > Using a WQ_HIGHPRI workqueue helps, but it is still not RT (and won't
> > meet future android requirements). It seems either workqueue needs to
> > gain RT support, or drm_sched needs to support kthread_worker.
> > 
> > I know drm_sched switched from kthread_worker to workqueue for better  
> 
>  From a plain kthread actually.

Oops, sorry, I hadn't seen your reply before posting mine. I basically
said the same.

> Anyway, I suggested trying the 
> kthread_worker approach a few times in the past but never got round 
> implementing it. Not dual paths but simply replacing the workqueues with 
> kthread_workers.
> 
> What is your thinking regarding how would the priority be configured? In 
> terms of the default and mechanism to select a higher priority 
> scheduling class.

If we follow the same model that exists today, where the
workqueue can be passed at drm_sched_init() time, it becomes the
driver's responsibility to create a worker of his own with the right
prio set (using sched_setscheduler()). There's still the case where the
worker is NULL, in which case the drm_sched code can probably create
his own worker and leave it with the default prio, just like existed
before the transition to workqueues.

It's a whole different story if you want to deal with worker pools and
do some load balancing though...

Regards,

Boris

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: drm_sched run_job and scheduling latency
  2026-03-05  9:10       ` Matthew Brost
@ 2026-03-05  9:47         ` Philipp Stanner
  2026-03-16  4:05           ` Matthew Brost
  2026-03-05 10:19         ` Boris Brezillon
  2026-03-05 12:27         ` Danilo Krummrich
  2 siblings, 1 reply; 26+ messages in thread
From: Philipp Stanner @ 2026-03-05  9:47 UTC (permalink / raw)
  To: Matthew Brost, phasta
  Cc: Boris Brezillon, Chia-I Wu, ML dri-devel, intel-xe, Steven Price,
	Liviu Dudau, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	David Airlie, Simona Vetter, Danilo Krummrich,
	Christian König, Thomas Hellström, Rodrigo Vivi,
	open list, tj

On Thu, 2026-03-05 at 01:10 -0800, Matthew Brost wrote:
> On Thu, Mar 05, 2026 at 09:38:16AM +0100, Philipp Stanner wrote:
> > On Thu, 2026-03-05 at 09:27 +0100, Boris Brezillon wrote:
> > 
> > > 

[…]

> > > Honestly, I'm not thrilled by this fast-path/call-run_job-directly idea
> > > you're describing. There's just so many things we can forget that would
> > > lead to races/ordering issues that will end up being hard to trigger and
> > > debug.
> > > 
> > 
> > +1
> > 
> > I'm not thrilled either. More like the opposite of thrilled actually.
> > 
> > Even if we could get that to work. This is more of a maintainability
> > issue.
> > 
> > The scheduler is full of insane performance hacks for this or that
> > driver. Lockless accesses, a special lockless queue only used by that
> > one party in the kernel (a lockless queue which is nowadays, after N
> > reworks, being used with a lock. Ah well).
> > 
> 
> This is not relevant to this discussion—see below. In general, I agree
> that the lockless tricks in the scheduler are not great, nor is the fact
> that the scheduler became a dumping ground for driver-specific features.
> But again, that is not what we’re talking about here—see below.
> 
> > In the past discussions Danilo and I made it clear that more major
> > features in _new_ patch series aimed at getting merged into drm/sched
> > must be preceded by cleanup work to address some of the scheduler's
> > major problems.
> 
> Ah, we've moved to dictatorship quickly. Noted.

I prefer the term "benevolent presidency" /s

Or even better: s/dictatorship/accountability enforcement.

How does it come that everyone is here and ready so quickly when it
comes to new use cases and features, yet I never saw anyone except for
Tvrtko and Maíra investing even 15 minutes to write a simple patch to
address some of the *various* significant issues in that code base?

You were on CC on all discussions we've had here for the last years
afair, but I rarely saw you participate. And you know what it's like:
who doesn't speak up silently agrees in open source.

But tell me one thing, if you can be so kind:

What is your theory why drm/sched came to be in such horrible shape?
What circumstances, what human behavioral patterns have caused this?

The DRM subsystem has a bad reputation regarding stability among Linux
users, as far as I have sensed. How can we do better?

> 
> > 
> 
> I can't say I agree with either of you here.
> 
> In about an hour, I seemingly have a bypass path working in DRM sched +
> Xe, and my diff is:
> 
> 108 insertions(+), 31 deletions(-)

LOC is a bad metric for complexity.

> 
> About 40 lines of the insertions are kernel-doc, so I'm not buying that
> this is a maintenance issue or a major feature - it is literally a
> single new function.
> 
> I understand a bypass path can create issues—for example, on certain
> queues in Xe I definitely can't use the bypass path, so Xe simply
> wouldn’t use it in those cases. This is the driver's choice to use or
> not. If a driver doesn't know how to use the scheduler, well, that’s on
> the driver. Providing a simple, documented function as a fast path
> really isn't some crazy idea.

We're effectively talking about a deviation from the default submission
mechanism, and all that seems to be desired for a luxury feature.

Then you end up with two submission mechanisms, whose correctness in
the future relies on someone remembering what the background was, why
it was added, and what the rules are..

The current scheduler rules are / were often not even documented, and
sometimes even Christian took a few weeks to remember again why
something had been added – and whether it can now be removed again or
not.

> 
> The alternative—asking for RT workqueues or changing the design to use
> kthread_worker—actually is.
> 
> > That's especially true if it's features aimed at performance buffs.
> > 
> 
> With the above mindset, I'm actually very confused why this series [1]
> would even be considered as this order of magnitude greater in
> complexity than my suggestion here.
> 
> Matt
> 
> [1] https://patchwork.freedesktop.org/series/159025/ 

The discussions about Tvrtko's CFS series were precisely the point
where Danilo brought up that after this can be merged, future rework of
the scheduler must focus on addressing some of the pending fundamental
issues.

The background is that Tvrtko has worked on that series already for
well over a year, it actually simplifies some things in the sense of
removing unused code (obviously it's a complex series, no argument
about that), and we agreed on XDC that this can be merged. So this is a
question of fairness to the contributor.

But at one point you have to finally draw a line. No one will ever
address major scheduler issues unless we demand it. Even very
experienced devs usually prefer to hack around the central design
issues in their drivers instead of fixing the shared infrastructure.

P.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: drm_sched run_job and scheduling latency
  2026-03-05  8:27   ` Boris Brezillon
  2026-03-05  8:38     ` Philipp Stanner
@ 2026-03-05 10:09     ` Matthew Brost
  2026-03-05 10:52       ` Boris Brezillon
  1 sibling, 1 reply; 26+ messages in thread
From: Matthew Brost @ 2026-03-05 10:09 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Chia-I Wu, ML dri-devel, intel-xe, Steven Price, Liviu Dudau,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Simona Vetter, Danilo Krummrich, Philipp Stanner,
	Christian König, Thomas Hellström, Rodrigo Vivi,
	open list, tj

On Thu, Mar 05, 2026 at 09:27:11AM +0100, Boris Brezillon wrote:

I addressed most of your comments in a chained reply to Phillip, but I
guess he dropped some of your email and thus missed those. Responding
below.

> Hi Matthew,
> 
> On Wed, 4 Mar 2026 18:04:25 -0800
> Matthew Brost <matthew.brost@intel.com> wrote:
> 
> > On Wed, Mar 04, 2026 at 02:51:39PM -0800, Chia-I Wu wrote:
> > > Hi,
> > > 
> > > Our system compositor (surfaceflinger on android) submits gpu jobs
> > > from a SCHED_FIFO thread to an RT gpu queue. However, because
> > > workqueue threads are SCHED_NORMAL, the scheduling latency from submit
> > > to run_job can sometimes cause frame misses. We are seeing this on
> > > panthor and xe, but the issue should be common to all drm_sched users.
> > >   
> > 
> > I'm going to assume that since this is a compositor, you do not pass
> > input dependencies to the page-flip job. Is that correct?
> > 
> > If so, I believe we could fairly easily build an opt-in DRM sched path
> > that directly calls run_job in the exec IOCTL context (I assume this is
> > SCHED_FIFO) if the job has no dependencies.
> 
> I guess by ::run_job() you mean something slightly more involved that
> checks if:
> 
> - other jobs are pending
> - enough credits (AKA ringbuf space) is available
> - and probably other stuff I forgot about
> 
> > 
> > This would likely break some of Xe’s submission-backend assumptions
> > around mutual exclusion and ordering based on the workqueue, but that
> > seems workable. I don’t know how the Panthor code is structured or
> > whether they have similar issues.
> 
> Honestly, I'm not thrilled by this fast-path/call-run_job-directly idea
> you're describing. There's just so many things we can forget that would
> lead to races/ordering issues that will end up being hard to trigger and
> debug. Besides, it doesn't solve the problem where your gfx pipeline is
> fully stuffed and the kernel has to dequeue things asynchronously. I do
> believe we want RT-prio support in that case too.
> 

My understanding of SurfaceFlinger is that it never waits on input
dependencies from rendering applications, since those may not signal in
time for a page flip. Because of that, you can’t have the job(s) that
draw to the screen accept input dependencies. Maybe I have that
wrong—but I've spoken to the Google team several times about issues with
SurfaceFlinger, and that was my takeaway.

So I don't think the kernel should ever have to dequeue things
asynchronously, at least for SurfaceFlinger. If there is another RT use
case that requires input dependencies plus the kernel dequeuing things
asynchronously, I agree this wouldn’t help—but my suggestion also isn’t
mutually exclusive with other RT rework either.

> > 
> > I can try to hack together a quick PoC to see what this would look like
> > and give you something to test.
> > 
> > > Using a WQ_HIGHPRI workqueue helps, but it is still not RT (and won't
> > > meet future android requirements). It seems either workqueue needs to
> > > gain RT support, or drm_sched needs to support kthread_worker.  
> > 
> > +Tejun to see if RT workqueue is in the plans.
> 
> Dunno how feasible that is, but that would be my preferred option.
> 
> > 
> > > 
> > > I know drm_sched switched from kthread_worker to workqueue for better
> > > scaling when xe was introduced. But if drm_sched can support either
> > > workqueue or kthread_worker during drm_sched_init, drivers can
> > > selectively use kthread_worker only for RT gpu queues. And because
> > > drivers require CAP_SYS_NICE for RT gpu queues, this should not cause
> > > scaling issues.
> > >   
> > 
> > I don’t think having two paths will ever be acceptable, nor do I think
> > supporting a kthread would be all that easy. For example, in Xe we queue
> > additional work items outside of the scheduler on the queue for ordering
> > reasons — we’d have to move all of that code down into DRM sched or
> > completely redesign our submission model to avoid this. I’m not sure if
> > other drivers also do this, but it is allowed.
> 
> Panthor doesn't rely on the serialization provided by the single-thread
> workqueue, Panfrost might rely on it though (I don't remember). I agree
> that maintaining a thread and workqueue based scheduling is not ideal
> though.
> 
> > 
> > > Thoughts? Or perhaps this becomes less of an issue if all drm_sched
> > > users have concrete plans for userspace submissions..  
> > 
> > Maybe some day....
> 
> I've yet to see a solution where no dma_fence-based signalization is
> involved in graphics workloads though (IIRC, Arm's solution still
> needs the kernel for that). Until that happens, we'll still need the
> kernel to signal fences asynchronously when the job is done, which I
> suspect will cause the same kind of latency issue...
> 

I don't think that is the problem here. Doesn’t the job that draws the
frame actually draw it, or does the display wait on the draw job’s fence
to signal and then do something else? (Sorry—I know next to nothing
about display.) Either way, fences should be signaled in IRQ handlers,
which presumably don’t have the same latency issues as workqueues, but I
could be mistaken.

Matt

> Regards,
> 
> Boris

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: drm_sched run_job and scheduling latency
  2026-03-05  9:10       ` Matthew Brost
  2026-03-05  9:47         ` Philipp Stanner
@ 2026-03-05 10:19         ` Boris Brezillon
  2026-03-05 12:27         ` Danilo Krummrich
  2 siblings, 0 replies; 26+ messages in thread
From: Boris Brezillon @ 2026-03-05 10:19 UTC (permalink / raw)
  To: Matthew Brost
  Cc: phasta, Chia-I Wu, ML dri-devel, intel-xe, Steven Price,
	Liviu Dudau, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	David Airlie, Simona Vetter, Danilo Krummrich,
	Christian König, Thomas Hellström, Rodrigo Vivi,
	open list, tj

On Thu, 5 Mar 2026 01:10:06 -0800
Matthew Brost <matthew.brost@intel.com> wrote:

> I can't say I agree with either of you here.
> 
> In about an hour, I seemingly have a bypass path working in DRM sched +
> Xe, and my diff is:
> 
> 108 insertions(+), 31 deletions(-)

First of all, I'm not blindly rejecting the approach, see how I said
"I'm not thrilled" not "No way!". So yeah, if you have something to
propose, feel free to post the diff here or as an RFC on the ML.

Secondly, I keep thinking the fast-path approach doesn't quite fix
the problem at hand where we actually want queuing/dequeuing operations
to match the priority of the HW/FW context, because if your HW context
is high prio but you're struggling to fill the HW queue, it's not truly
high prio. Note that it's problem that was made more evident with FW
scheduling (and the 1:1 entity:sched association), before that we just
had one thread that was dequeuing from entities and pushing to HW
queues based on entities priorities, so priority was somehow better
enforced.

So yeah, even ignoring the discrepancy that might emerge from this new
fast-path-run_job (and the potential maintenance burden we mentioned),
saying "you'll get proper queueing/dequeuing priority enforcement only
if you have no deps, and the pipeline is not full" is kinda limited
IMHO. I'd rather we think about a solution that solves the entire
problem, which both the kthread_work[er] and workqueue(RT) proposals
do.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: drm_sched run_job and scheduling latency
  2026-03-05 10:09     ` Matthew Brost
@ 2026-03-05 10:52       ` Boris Brezillon
  2026-03-05 20:51         ` Matthew Brost
  0 siblings, 1 reply; 26+ messages in thread
From: Boris Brezillon @ 2026-03-05 10:52 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Chia-I Wu, ML dri-devel, intel-xe, Steven Price, Liviu Dudau,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Simona Vetter, Danilo Krummrich, Philipp Stanner,
	Christian König, Thomas Hellström, Rodrigo Vivi,
	open list, tj

On Thu, 5 Mar 2026 02:09:16 -0800
Matthew Brost <matthew.brost@intel.com> wrote:

> On Thu, Mar 05, 2026 at 09:27:11AM +0100, Boris Brezillon wrote:
> 
> I addressed most of your comments in a chained reply to Phillip, but I
> guess he dropped some of your email and thus missed those. Responding
> below.
> 
> > Hi Matthew,
> > 
> > On Wed, 4 Mar 2026 18:04:25 -0800
> > Matthew Brost <matthew.brost@intel.com> wrote:
> >   
> > > On Wed, Mar 04, 2026 at 02:51:39PM -0800, Chia-I Wu wrote:  
> > > > Hi,
> > > > 
> > > > Our system compositor (surfaceflinger on android) submits gpu jobs
> > > > from a SCHED_FIFO thread to an RT gpu queue. However, because
> > > > workqueue threads are SCHED_NORMAL, the scheduling latency from submit
> > > > to run_job can sometimes cause frame misses. We are seeing this on
> > > > panthor and xe, but the issue should be common to all drm_sched users.
> > > >     
> > > 
> > > I'm going to assume that since this is a compositor, you do not pass
> > > input dependencies to the page-flip job. Is that correct?
> > > 
> > > If so, I believe we could fairly easily build an opt-in DRM sched path
> > > that directly calls run_job in the exec IOCTL context (I assume this is
> > > SCHED_FIFO) if the job has no dependencies.  
> > 
> > I guess by ::run_job() you mean something slightly more involved that
> > checks if:
> > 
> > - other jobs are pending
> > - enough credits (AKA ringbuf space) is available
> > - and probably other stuff I forgot about
> >   
> > > 
> > > This would likely break some of Xe’s submission-backend assumptions
> > > around mutual exclusion and ordering based on the workqueue, but that
> > > seems workable. I don’t know how the Panthor code is structured or
> > > whether they have similar issues.  
> > 
> > Honestly, I'm not thrilled by this fast-path/call-run_job-directly idea
> > you're describing. There's just so many things we can forget that would
> > lead to races/ordering issues that will end up being hard to trigger and
> > debug. Besides, it doesn't solve the problem where your gfx pipeline is
> > fully stuffed and the kernel has to dequeue things asynchronously. I do
> > believe we want RT-prio support in that case too.
> >   
> 
> My understanding of SurfaceFlinger is that it never waits on input
> dependencies from rendering applications, since those may not signal in
> time for a page flip. Because of that, you can’t have the job(s) that
> draw to the screen accept input dependencies. Maybe I have that
> wrong—but I've spoken to the Google team several times about issues with
> SurfaceFlinger, and that was my takeaway.
> 
> So I don't think the kernel should ever have to dequeue things
> asynchronously, at least for SurfaceFlinger.

There's still the contention coming from the ring buffer size, which can
prevent jobs from being queued directly to the HW, though, admittedly,
if the HW is not capable of compositing the frame faster than the
refresh rate, and guarantee an almost always empty ringbuffer, fixing
the scheduling prio is probably pointless.

> If there is another RT use
> case that requires input dependencies plus the kernel dequeuing things
> asynchronously, I agree this wouldn’t help—but my suggestion also isn’t
> mutually exclusive with other RT rework either.

Yeah, dunno. It just feels like another hack on top of the already quite
convoluted design that drm_sched has become.

> 
> > > 
> > > I can try to hack together a quick PoC to see what this would look like
> > > and give you something to test.
> > >   
> > > > Using a WQ_HIGHPRI workqueue helps, but it is still not RT (and won't
> > > > meet future android requirements). It seems either workqueue needs to
> > > > gain RT support, or drm_sched needs to support kthread_worker.    
> > > 
> > > +Tejun to see if RT workqueue is in the plans.  
> > 
> > Dunno how feasible that is, but that would be my preferred option.
> >   
> > >   
> > > > 
> > > > I know drm_sched switched from kthread_worker to workqueue for better
> > > > scaling when xe was introduced. But if drm_sched can support either
> > > > workqueue or kthread_worker during drm_sched_init, drivers can
> > > > selectively use kthread_worker only for RT gpu queues. And because
> > > > drivers require CAP_SYS_NICE for RT gpu queues, this should not cause
> > > > scaling issues.
> > > >     
> > > 
> > > I don’t think having two paths will ever be acceptable, nor do I think
> > > supporting a kthread would be all that easy. For example, in Xe we queue
> > > additional work items outside of the scheduler on the queue for ordering
> > > reasons — we’d have to move all of that code down into DRM sched or
> > > completely redesign our submission model to avoid this. I’m not sure if
> > > other drivers also do this, but it is allowed.  
> > 
> > Panthor doesn't rely on the serialization provided by the single-thread
> > workqueue, Panfrost might rely on it though (I don't remember). I agree
> > that maintaining a thread and workqueue based scheduling is not ideal
> > though.
> >   
> > >   
> > > > Thoughts? Or perhaps this becomes less of an issue if all drm_sched
> > > > users have concrete plans for userspace submissions..    
> > > 
> > > Maybe some day....  
> > 
> > I've yet to see a solution where no dma_fence-based signalization is
> > involved in graphics workloads though (IIRC, Arm's solution still
> > needs the kernel for that). Until that happens, we'll still need the
> > kernel to signal fences asynchronously when the job is done, which I
> > suspect will cause the same kind of latency issue...
> >   
> 
> I don't think that is the problem here. Doesn’t the job that draws the
> frame actually draw it, or does the display wait on the draw job’s fence
> to signal and then do something else?

I know close to nothing about SurfaceFlinger and very little about
compositors in general, so I'll let Chia answer that one. What's sure
is that, on regular page-flips (don't remember what async page-flips
do), the display drivers wait on the fences attached to the buffer to
signal before doing the flip.

> (Sorry—I know next to nothing
> about display.) Either way, fences should be signaled in IRQ handlers,

In Panthor they are not, but that's probably something for us to
address.

> which presumably don’t have the same latency issues as workqueues, but I
> could be mistaken.

Might have to do with the mental model I had of this "reconcile
Usermode queues with dma_fence signaling" model, where I was imagining
a SW job queue (based on drm_sched too) that would wait on HW fences to
be signal and would as a result signal the dma_fence attached to the
job. So the queueing/dequeuing of these jobs would still happen through
drm_sched, with the same scheduling prio issue. This being said, those
jobs would likely be dependency less, so more likely to hit your
fast-path-run-job.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: drm_sched run_job and scheduling latency
  2026-03-05  9:10       ` Matthew Brost
  2026-03-05  9:47         ` Philipp Stanner
  2026-03-05 10:19         ` Boris Brezillon
@ 2026-03-05 12:27         ` Danilo Krummrich
  2 siblings, 0 replies; 26+ messages in thread
From: Danilo Krummrich @ 2026-03-05 12:27 UTC (permalink / raw)
  To: Matthew Brost
  Cc: phasta, Boris Brezillon, Chia-I Wu, ML dri-devel, intel-xe,
	Steven Price, Liviu Dudau, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter,
	Christian König, Thomas Hellström, Rodrigo Vivi,
	open list, tj

On Thu Mar 5, 2026 at 10:10 AM CET, Matthew Brost wrote:
> On Thu, Mar 05, 2026 at 09:38:16AM +0100, Philipp Stanner wrote:
>> In the past discussions Danilo and I made it clear that more major
>> features in _new_ patch series aimed at getting merged into drm/sched
>> must be preceded by cleanup work to address some of the scheduler's
>> major problems.
>
> Ah, we've moved to dictatorship quickly. Noted.

While Philipp and me generally share concerns about the scheduler in general, I
prefer to speak for myself here, as my position is a bit more nuanced than that.

I shared my view on this in detail in [1], so I will keep it very brief here.

From a maintainance perspective the concern is less about whether a particular
change is correct or small in isolation, but about whether it moves the overall
design in a direction that makes the existing issues harder to resolve
subsequently.

I.e. I think we should try to avoid accumulating new features or special paths
on top of known design issues.

(Please also note that those are general considerations; they are not meant to
make any implications on this specific topic. Not least because I did not get to
read through the whole thread yet.)

Thanks,
Danilo

[1] https://lore.kernel.org/all/DFPK5HIP7G9C.2LJ6AOH2UPLEO@kernel.org/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: drm_sched run_job and scheduling latency
  2026-03-05 10:52       ` Boris Brezillon
@ 2026-03-05 20:51         ` Matthew Brost
  2026-03-06  5:13           ` Chia-I Wu
  0 siblings, 1 reply; 26+ messages in thread
From: Matthew Brost @ 2026-03-05 20:51 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Chia-I Wu, ML dri-devel, intel-xe, Steven Price, Liviu Dudau,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Simona Vetter, Danilo Krummrich, Philipp Stanner,
	Christian König, Thomas Hellström, Rodrigo Vivi,
	open list, tj

On Thu, Mar 05, 2026 at 11:52:01AM +0100, Boris Brezillon wrote:
> On Thu, 5 Mar 2026 02:09:16 -0800
> Matthew Brost <matthew.brost@intel.com> wrote:
> 
> > On Thu, Mar 05, 2026 at 09:27:11AM +0100, Boris Brezillon wrote:
> > 
> > I addressed most of your comments in a chained reply to Phillip, but I
> > guess he dropped some of your email and thus missed those. Responding
> > below.
> > 
> > > Hi Matthew,
> > > 
> > > On Wed, 4 Mar 2026 18:04:25 -0800
> > > Matthew Brost <matthew.brost@intel.com> wrote:
> > >   
> > > > On Wed, Mar 04, 2026 at 02:51:39PM -0800, Chia-I Wu wrote:  
> > > > > Hi,
> > > > > 
> > > > > Our system compositor (surfaceflinger on android) submits gpu jobs
> > > > > from a SCHED_FIFO thread to an RT gpu queue. However, because
> > > > > workqueue threads are SCHED_NORMAL, the scheduling latency from submit
> > > > > to run_job can sometimes cause frame misses. We are seeing this on
> > > > > panthor and xe, but the issue should be common to all drm_sched users.
> > > > >     
> > > > 
> > > > I'm going to assume that since this is a compositor, you do not pass
> > > > input dependencies to the page-flip job. Is that correct?
> > > > 
> > > > If so, I believe we could fairly easily build an opt-in DRM sched path
> > > > that directly calls run_job in the exec IOCTL context (I assume this is
> > > > SCHED_FIFO) if the job has no dependencies.  
> > > 
> > > I guess by ::run_job() you mean something slightly more involved that
> > > checks if:
> > > 
> > > - other jobs are pending
> > > - enough credits (AKA ringbuf space) is available
> > > - and probably other stuff I forgot about
> > >   
> > > > 
> > > > This would likely break some of Xe’s submission-backend assumptions
> > > > around mutual exclusion and ordering based on the workqueue, but that
> > > > seems workable. I don’t know how the Panthor code is structured or
> > > > whether they have similar issues.  
> > > 
> > > Honestly, I'm not thrilled by this fast-path/call-run_job-directly idea
> > > you're describing. There's just so many things we can forget that would
> > > lead to races/ordering issues that will end up being hard to trigger and
> > > debug. Besides, it doesn't solve the problem where your gfx pipeline is
> > > fully stuffed and the kernel has to dequeue things asynchronously. I do
> > > believe we want RT-prio support in that case too.
> > >   
> > 
> > My understanding of SurfaceFlinger is that it never waits on input
> > dependencies from rendering applications, since those may not signal in
> > time for a page flip. Because of that, you can’t have the job(s) that
> > draw to the screen accept input dependencies. Maybe I have that
> > wrong—but I've spoken to the Google team several times about issues with
> > SurfaceFlinger, and that was my takeaway.
> > 
> > So I don't think the kernel should ever have to dequeue things
> > asynchronously, at least for SurfaceFlinger.
> 
> There's still the contention coming from the ring buffer size, which can
> prevent jobs from being queued directly to the HW, though, admittedly,
> if the HW is not capable of compositing the frame faster than the
> refresh rate, and guarantee an almost always empty ringbuffer, fixing
> the scheduling prio is probably pointless.
> 
> > If there is another RT use
> > case that requires input dependencies plus the kernel dequeuing things
> > asynchronously, I agree this wouldn’t help—but my suggestion also isn’t
> > mutually exclusive with other RT rework either.
> 
> Yeah, dunno. It just feels like another hack on top of the already quite
> convoluted design that drm_sched has become.
> 

I agree we wouldn't want this to become some wild hack.

I could actually see this helping in other very timing-sensitive
paths—for example, page-fault paths where a copy job needs to be issued
as part of the fault resolution to a dedicated kernel queue. I’ve seen
noise in fault profiling caused by delays in the scheduler workqueue,
which needs to program the job to the device. In paths like this, every
microsecond matters, as even minor improvements have real-world impacts
on performance numbers. This will become even more noticeable as
CPU<->GPU bus speeds increase. In this case, typically copy jobs have
no input dependencies, thus the desire is to program the ring as quickly
as possible.

> > 
> > > > 
> > > > I can try to hack together a quick PoC to see what this would look like
> > > > and give you something to test.
> > > >   
> > > > > Using a WQ_HIGHPRI workqueue helps, but it is still not RT (and won't
> > > > > meet future android requirements). It seems either workqueue needs to
> > > > > gain RT support, or drm_sched needs to support kthread_worker.    
> > > > 
> > > > +Tejun to see if RT workqueue is in the plans.  
> > > 
> > > Dunno how feasible that is, but that would be my preferred option.
> > >   
> > > >   
> > > > > 
> > > > > I know drm_sched switched from kthread_worker to workqueue for better
> > > > > scaling when xe was introduced. But if drm_sched can support either
> > > > > workqueue or kthread_worker during drm_sched_init, drivers can
> > > > > selectively use kthread_worker only for RT gpu queues. And because
> > > > > drivers require CAP_SYS_NICE for RT gpu queues, this should not cause
> > > > > scaling issues.
> > > > >     
> > > > 
> > > > I don’t think having two paths will ever be acceptable, nor do I think
> > > > supporting a kthread would be all that easy. For example, in Xe we queue
> > > > additional work items outside of the scheduler on the queue for ordering
> > > > reasons — we’d have to move all of that code down into DRM sched or
> > > > completely redesign our submission model to avoid this. I’m not sure if
> > > > other drivers also do this, but it is allowed.  
> > > 
> > > Panthor doesn't rely on the serialization provided by the single-thread
> > > workqueue, Panfrost might rely on it though (I don't remember). I agree
> > > that maintaining a thread and workqueue based scheduling is not ideal
> > > though.
> > >   
> > > >   
> > > > > Thoughts? Or perhaps this becomes less of an issue if all drm_sched
> > > > > users have concrete plans for userspace submissions..    
> > > > 
> > > > Maybe some day....  
> > > 
> > > I've yet to see a solution where no dma_fence-based signalization is
> > > involved in graphics workloads though (IIRC, Arm's solution still
> > > needs the kernel for that). Until that happens, we'll still need the
> > > kernel to signal fences asynchronously when the job is done, which I
> > > suspect will cause the same kind of latency issue...
> > >   
> > 
> > I don't think that is the problem here. Doesn’t the job that draws the
> > frame actually draw it, or does the display wait on the draw job’s fence
> > to signal and then do something else?
> 
> I know close to nothing about SurfaceFlinger and very little about
> compositors in general, so I'll let Chia answer that one. What's sure

I think Chia input would good, as if SurfaceFlinger jobs have input
dependencies this entire suggestion doesn't make any sense.

> is that, on regular page-flips (don't remember what async page-flips
> do), the display drivers wait on the fences attached to the buffer to
> signal before doing the flip.

I think SurfaceFlinger is different compared to Wayland/X11 use cases,
as maintaining a steady framerate is the priority above everything else
(think phone screens, which never freeze, whereas desktops do all the
time). So I believe SurfaceFlinger decides when it will submit the job
to draw a frame, without directly passing in application dependencies
into the buffer/job being drawn. Again, my understanding here may be
incorrect...

> 
> > (Sorry—I know next to nothing
> > about display.) Either way, fences should be signaled in IRQ handlers,
> 
> In Panthor they are not, but that's probably something for us to
> address.
> 
> > which presumably don’t have the same latency issues as workqueues, but I
> > could be mistaken.
> 
> Might have to do with the mental model I had of this "reconcile
> Usermode queues with dma_fence signaling" model, where I was imagining
> a SW job queue (based on drm_sched too) that would wait on HW fences to
> be signal and would as a result signal the dma_fence attached to the
> job. So the queueing/dequeuing of these jobs would still happen through
> drm_sched, with the same scheduling prio issue. This being said, those

Yes, if jobs have unmet dependencies, the bypass path doesn’t help with
the DRM scheduler workqueue context switches being slow as that path
needs to be taken in taken in this cases.

Also, to bring up something insane we certainly wouldn’t want to do:
calling run_job when dependencies are resolved in the fence callback,
since we could be in an IRQ handler.

Matt

> jobs would likely be dependency less, so more likely to hit your
> fast-path-run-job.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: drm_sched run_job and scheduling latency
  2026-03-04 22:51 drm_sched run_job and scheduling latency Chia-I Wu
                   ` (2 preceding siblings ...)
  2026-03-05  9:23 ` Boris Brezillon
@ 2026-03-05 23:09 ` Hillf Danton
  2026-03-06  5:46   ` Chia-I Wu
  3 siblings, 1 reply; 26+ messages in thread
From: Hillf Danton @ 2026-03-05 23:09 UTC (permalink / raw)
  To: Chia-I Wu
  Cc: Matthew Brost, DRI, intel-xe, Danilo Krummrich, Philipp Stanner,
	Boris Brezillon, LKML

On Wed, Mar 04, 2026 at 02:51:39PM -0800, Chia-I Wu wrote:
> Hi,
> 
> Our system compositor (surfaceflinger on android) submits gpu jobs
> from a SCHED_FIFO thread to an RT gpu queue. However, because
> workqueue threads are SCHED_NORMAL, the scheduling latency from submit
> to run_job can sometimes cause frame misses. We are seeing this on
> panthor and xe, but the issue should be common to all drm_sched users.
> 
> Using a WQ_HIGHPRI workqueue helps, but it is still not RT (and won't
> meet future android requirements). It seems either workqueue needs to
> gain RT support, or drm_sched needs to support kthread_worker.
>
As RT means (in general) to some extent that the game of eevdf is played in
__userspace__, but you are not PeterZ, so any issue like frame miss is
understandably expected.
Who made the workqueue worker a victim if the CPU cycles are not tight?
Who is the new victim of a RT kthread worker?
As RT is not free, what did you pay for it, given fewer RT success on market?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: drm_sched run_job and scheduling latency
  2026-03-05 20:51         ` Matthew Brost
@ 2026-03-06  5:13           ` Chia-I Wu
  2026-03-06  7:21             ` Matthew Brost
  2026-03-06  9:36             ` Michel Dänzer
  0 siblings, 2 replies; 26+ messages in thread
From: Chia-I Wu @ 2026-03-06  5:13 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Boris Brezillon, ML dri-devel, intel-xe, Steven Price,
	Liviu Dudau, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	David Airlie, Simona Vetter, Danilo Krummrich, Philipp Stanner,
	Christian König, Thomas Hellström, Rodrigo Vivi,
	open list, tj

On Thu, Mar 5, 2026 at 12:52 PM Matthew Brost <matthew.brost@intel.com> wrote:
>
> On Thu, Mar 05, 2026 at 11:52:01AM +0100, Boris Brezillon wrote:
> > On Thu, 5 Mar 2026 02:09:16 -0800
> > Matthew Brost <matthew.brost@intel.com> wrote:
> >
> > > On Thu, Mar 05, 2026 at 09:27:11AM +0100, Boris Brezillon wrote:
> > >
> > > I addressed most of your comments in a chained reply to Phillip, but I
> > > guess he dropped some of your email and thus missed those. Responding
> > > below.
> > >
> > > > Hi Matthew,
> > > >
> > > > On Wed, 4 Mar 2026 18:04:25 -0800
> > > > Matthew Brost <matthew.brost@intel.com> wrote:
> > > >
> > > > > On Wed, Mar 04, 2026 at 02:51:39PM -0800, Chia-I Wu wrote:
> > > > > > Hi,
> > > > > >
> > > > > > Our system compositor (surfaceflinger on android) submits gpu jobs
> > > > > > from a SCHED_FIFO thread to an RT gpu queue. However, because
> > > > > > workqueue threads are SCHED_NORMAL, the scheduling latency from submit
> > > > > > to run_job can sometimes cause frame misses. We are seeing this on
> > > > > > panthor and xe, but the issue should be common to all drm_sched users.
> > > > > >
> > > > >
> > > > > I'm going to assume that since this is a compositor, you do not pass
> > > > > input dependencies to the page-flip job. Is that correct?
> > > > >
> > > > > If so, I believe we could fairly easily build an opt-in DRM sched path
> > > > > that directly calls run_job in the exec IOCTL context (I assume this is
> > > > > SCHED_FIFO) if the job has no dependencies.
> > > >
> > > > I guess by ::run_job() you mean something slightly more involved that
> > > > checks if:
> > > >
> > > > - other jobs are pending
> > > > - enough credits (AKA ringbuf space) is available
> > > > - and probably other stuff I forgot about
> > > >
> > > > >
> > > > > This would likely break some of Xe’s submission-backend assumptions
> > > > > around mutual exclusion and ordering based on the workqueue, but that
> > > > > seems workable. I don’t know how the Panthor code is structured or
> > > > > whether they have similar issues.
> > > >
> > > > Honestly, I'm not thrilled by this fast-path/call-run_job-directly idea
> > > > you're describing. There's just so many things we can forget that would
> > > > lead to races/ordering issues that will end up being hard to trigger and
> > > > debug. Besides, it doesn't solve the problem where your gfx pipeline is
> > > > fully stuffed and the kernel has to dequeue things asynchronously. I do
> > > > believe we want RT-prio support in that case too.
> > > >
> > >
> > > My understanding of SurfaceFlinger is that it never waits on input
> > > dependencies from rendering applications, since those may not signal in
> > > time for a page flip. Because of that, you can’t have the job(s) that
> > > draw to the screen accept input dependencies. Maybe I have that
> > > wrong—but I've spoken to the Google team several times about issues with
> > > SurfaceFlinger, and that was my takeaway.
> > >
> > > So I don't think the kernel should ever have to dequeue things
> > > asynchronously, at least for SurfaceFlinger.
> >
> > There's still the contention coming from the ring buffer size, which can
> > prevent jobs from being queued directly to the HW, though, admittedly,
> > if the HW is not capable of compositing the frame faster than the
> > refresh rate, and guarantee an almost always empty ringbuffer, fixing
> > the scheduling prio is probably pointless.
> >
> > > If there is another RT use
> > > case that requires input dependencies plus the kernel dequeuing things
> > > asynchronously, I agree this wouldn’t help—but my suggestion also isn’t
> > > mutually exclusive with other RT rework either.
> >
> > Yeah, dunno. It just feels like another hack on top of the already quite
> > convoluted design that drm_sched has become.
> >
>
> I agree we wouldn't want this to become some wild hack.
>
> I could actually see this helping in other very timing-sensitive
> paths—for example, page-fault paths where a copy job needs to be issued
> as part of the fault resolution to a dedicated kernel queue. I’ve seen
> noise in fault profiling caused by delays in the scheduler workqueue,
> which needs to program the job to the device. In paths like this, every
> microsecond matters, as even minor improvements have real-world impacts
> on performance numbers. This will become even more noticeable as
> CPU<->GPU bus speeds increase. In this case, typically copy jobs have
> no input dependencies, thus the desire is to program the ring as quickly
> as possible.
>
> > >
> > > > >
> > > > > I can try to hack together a quick PoC to see what this would look like
> > > > > and give you something to test.
> > > > >
> > > > > > Using a WQ_HIGHPRI workqueue helps, but it is still not RT (and won't
> > > > > > meet future android requirements). It seems either workqueue needs to
> > > > > > gain RT support, or drm_sched needs to support kthread_worker.
> > > > >
> > > > > +Tejun to see if RT workqueue is in the plans.
> > > >
> > > > Dunno how feasible that is, but that would be my preferred option.
> > > >
> > > > >
> > > > > >
> > > > > > I know drm_sched switched from kthread_worker to workqueue for better
> > > > > > scaling when xe was introduced. But if drm_sched can support either
> > > > > > workqueue or kthread_worker during drm_sched_init, drivers can
> > > > > > selectively use kthread_worker only for RT gpu queues. And because
> > > > > > drivers require CAP_SYS_NICE for RT gpu queues, this should not cause
> > > > > > scaling issues.
> > > > > >
> > > > >
> > > > > I don’t think having two paths will ever be acceptable, nor do I think
> > > > > supporting a kthread would be all that easy. For example, in Xe we queue
> > > > > additional work items outside of the scheduler on the queue for ordering
> > > > > reasons — we’d have to move all of that code down into DRM sched or
> > > > > completely redesign our submission model to avoid this. I’m not sure if
> > > > > other drivers also do this, but it is allowed.
> > > >
> > > > Panthor doesn't rely on the serialization provided by the single-thread
> > > > workqueue, Panfrost might rely on it though (I don't remember). I agree
> > > > that maintaining a thread and workqueue based scheduling is not ideal
> > > > though.
> > > >
> > > > >
> > > > > > Thoughts? Or perhaps this becomes less of an issue if all drm_sched
> > > > > > users have concrete plans for userspace submissions..
> > > > >
> > > > > Maybe some day....
> > > >
> > > > I've yet to see a solution where no dma_fence-based signalization is
> > > > involved in graphics workloads though (IIRC, Arm's solution still
> > > > needs the kernel for that). Until that happens, we'll still need the
> > > > kernel to signal fences asynchronously when the job is done, which I
> > > > suspect will cause the same kind of latency issue...
> > > >
> > >
> > > I don't think that is the problem here. Doesn’t the job that draws the
> > > frame actually draw it, or does the display wait on the draw job’s fence
> > > to signal and then do something else?
> >
> > I know close to nothing about SurfaceFlinger and very little about
> > compositors in general, so I'll let Chia answer that one. What's sure
>
> I think Chia input would good, as if SurfaceFlinger jobs have input
> dependencies this entire suggestion doesn't make any sense.
>
> > is that, on regular page-flips (don't remember what async page-flips
> > do), the display drivers wait on the fences attached to the buffer to
> > signal before doing the flip.
>
> I think SurfaceFlinger is different compared to Wayland/X11 use cases,
> as maintaining a steady framerate is the priority above everything else
> (think phone screens, which never freeze, whereas desktops do all the
> time). So I believe SurfaceFlinger decides when it will submit the job
> to draw a frame, without directly passing in application dependencies
> into the buffer/job being drawn. Again, my understanding here may be
> incorrect...
That is correct. SurfaceFlinger only ever latches buffers whose
associated fences have signaled, and sends down the buffers to gpu for
composition or to the display for direct scanout. That might also be
how modern wayland compositors work nowadays? It sounds bad to let a
low fps app slow down system composition.

In theory, the gpu driver should not see input dependencies ever. I
will need to check if there are corner cases.


>
> >
> > > (Sorry—I know next to nothing
> > > about display.) Either way, fences should be signaled in IRQ handlers,
> >
> > In Panthor they are not, but that's probably something for us to
> > address.
Yeah, I am also looking into signaling fences from the (threaded) irq handler.

> >
> > > which presumably don’t have the same latency issues as workqueues, but I
> > > could be mistaken.
> >
> > Might have to do with the mental model I had of this "reconcile
> > Usermode queues with dma_fence signaling" model, where I was imagining
> > a SW job queue (based on drm_sched too) that would wait on HW fences to
> > be signal and would as a result signal the dma_fence attached to the
> > job. So the queueing/dequeuing of these jobs would still happen through
> > drm_sched, with the same scheduling prio issue. This being said, those
>
> Yes, if jobs have unmet dependencies, the bypass path doesn’t help with
> the DRM scheduler workqueue context switches being slow as that path
> needs to be taken in taken in this cases.
>
> Also, to bring up something insane we certainly wouldn’t want to do:
> calling run_job when dependencies are resolved in the fence callback,
> since we could be in an IRQ handler.
>
> Matt
>
> > jobs would likely be dependency less, so more likely to hit your
> > fast-path-run-job.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: drm_sched run_job and scheduling latency
  2026-03-05  9:23 ` Boris Brezillon
@ 2026-03-06  5:33   ` Chia-I Wu
  2026-03-06  7:36     ` Matthew Brost
  0 siblings, 1 reply; 26+ messages in thread
From: Chia-I Wu @ 2026-03-06  5:33 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: ML dri-devel, intel-xe, Steven Price, Liviu Dudau,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Simona Vetter, Matthew Brost, Danilo Krummrich, Philipp Stanner,
	Christian König, Thomas Hellström, Rodrigo Vivi,
	open list

On Thu, Mar 5, 2026 at 1:23 AM Boris Brezillon
<boris.brezillon@collabora.com> wrote:
>
> On Wed, 4 Mar 2026 14:51:39 -0800
> Chia-I Wu <olvaffe@gmail.com> wrote:
>
> > Hi,
> >
> > Our system compositor (surfaceflinger on android) submits gpu jobs
> > from a SCHED_FIFO thread to an RT gpu queue. However, because
> > workqueue threads are SCHED_NORMAL, the scheduling latency from submit
> > to run_job can sometimes cause frame misses. We are seeing this on
> > panthor and xe, but the issue should be common to all drm_sched users.
> >
> > Using a WQ_HIGHPRI workqueue helps, but it is still not RT (and won't
> > meet future android requirements). It seems either workqueue needs to
> > gain RT support, or drm_sched needs to support kthread_worker.
> >
> > I know drm_sched switched from kthread_worker to workqueue for better
> > scaling when xe was introduced.
>
> Actually, it went from a plain kthread with open-coded "work" support to
> workqueues. The kthread_worker+kthread_work model looks closer to what
> workqueues provide, so transitioning drivers to it shouldn't be too
> hard. The scalability issue you mentioned (one thread per GPU context
> doesn't scale) doesn't apply, because we can pretty easily share the
> same kthread_worker for all drm_gpu_scheduler instances, just like we
> can share the same workqueue for all drm_gpu_scheduler instances today.
> Luckily, it seems that no one so far has been using
> WQ_PERCPU-workqueues, so that's one less thing we need to worry about.
> The last remaining drawback with a kthread_work[er] based solution is
> the fact workqueues can adjust the number of worker threads on demand
> based on the load. If we really need this flexibility (a non static
> number of threads per-prio level per-driver), that's something we'll
> have to add support for.
Wait, I thought this was the exact scaling issue that workqueue solved
for xe and panthor? We needed to execute run_jobs for N
drm_gpu_scheduler instances, where N is in total control of the
userspace. We didn't want to serialize the executions to a single
thread.

Granted, panthor holds a lock in its run_job callback and does not
benefit from a workqueue. I don't know how xe's run_job does though.

>
> For Panthor, the way I see it, we could start with one thread per-group
> priority, and then pick the worker thread to use at drm_sched_init()
> based on the group prio. If we need something with a thread pool, then
> drm_sched will have to know about those threads, and do some load
> balancing when queueing the works...
>
> Note that someone at Collabora is working on dynamic context priority
> support, meaning we'll have to be able to change the drm_gpu_scheduler
> kthread_worker at runtime.
>
> TLDR; All of this is doable, but it's more work (for us, DRM devs) than
> asking RT prio support to be added to workqueues.

It looks like WQ_RT was last brought up in

  https://lore.kernel.org/all/aPJdrqSiuijOcaPE@slm.duckdns.org/

Maybe adding some form of bring-your-own-worker-pool support to
workqueue will be acceptable?

>
> > But if drm_sched can support either
> > workqueue or kthread_worker during drm_sched_init, drivers can
> > selectively use kthread_worker only for RT gpu queues. And because
> > drivers require CAP_SYS_NICE for RT gpu queues, this should not cause
> > scaling issues.
>
> I think, whatever we choose to go for, we probably don't want to keep
> both models around, because that's going to be a pain to maintain.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: drm_sched run_job and scheduling latency
  2026-03-05 23:09 ` Hillf Danton
@ 2026-03-06  5:46   ` Chia-I Wu
  2026-03-06 11:58     ` Hillf Danton
  0 siblings, 1 reply; 26+ messages in thread
From: Chia-I Wu @ 2026-03-06  5:46 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Matthew Brost, DRI, intel-xe, Danilo Krummrich, Philipp Stanner,
	Boris Brezillon, LKML

On Thu, Mar 5, 2026 at 3:10 PM Hillf Danton <hdanton@sina.com> wrote:
>
> On Wed, Mar 04, 2026 at 02:51:39PM -0800, Chia-I Wu wrote:
> > Hi,
> >
> > Our system compositor (surfaceflinger on android) submits gpu jobs
> > from a SCHED_FIFO thread to an RT gpu queue. However, because
> > workqueue threads are SCHED_NORMAL, the scheduling latency from submit
> > to run_job can sometimes cause frame misses. We are seeing this on
> > panthor and xe, but the issue should be common to all drm_sched users.
> >
> > Using a WQ_HIGHPRI workqueue helps, but it is still not RT (and won't
> > meet future android requirements). It seems either workqueue needs to
> > gain RT support, or drm_sched needs to support kthread_worker.
> >
> As RT means (in general) to some extent that the game of eevdf is played in
> __userspace__, but you are not PeterZ, so any issue like frame miss is
> understandably expected.
> Who made the workqueue worker a victim if the CPU cycles are not tight?
> Who is the new victim of a RT kthread worker?
> As RT is not free, what did you pay for it, given fewer RT success on market?
That is a deliberate decision for android, that avoiding frame misses
is a top priority.

Also, I think most drm drivers already signal their fences from irq
handlers or rt threads for a similar reason. And the reasoning applies
to submissions as well.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: drm_sched run_job and scheduling latency
  2026-03-06  5:13           ` Chia-I Wu
@ 2026-03-06  7:21             ` Matthew Brost
  2026-03-06  9:36             ` Michel Dänzer
  1 sibling, 0 replies; 26+ messages in thread
From: Matthew Brost @ 2026-03-06  7:21 UTC (permalink / raw)
  To: Chia-I Wu
  Cc: Boris Brezillon, ML dri-devel, intel-xe, Steven Price,
	Liviu Dudau, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	David Airlie, Simona Vetter, Danilo Krummrich, Philipp Stanner,
	Christian König, Thomas Hellström, Rodrigo Vivi,
	open list, tj

On Thu, Mar 05, 2026 at 09:13:44PM -0800, Chia-I Wu wrote:
> On Thu, Mar 5, 2026 at 12:52 PM Matthew Brost <matthew.brost@intel.com> wrote:
> >
> > On Thu, Mar 05, 2026 at 11:52:01AM +0100, Boris Brezillon wrote:
> > > On Thu, 5 Mar 2026 02:09:16 -0800
> > > Matthew Brost <matthew.brost@intel.com> wrote:
> > >
> > > > On Thu, Mar 05, 2026 at 09:27:11AM +0100, Boris Brezillon wrote:
> > > >
> > > > I addressed most of your comments in a chained reply to Phillip, but I
> > > > guess he dropped some of your email and thus missed those. Responding
> > > > below.
> > > >
> > > > > Hi Matthew,
> > > > >
> > > > > On Wed, 4 Mar 2026 18:04:25 -0800
> > > > > Matthew Brost <matthew.brost@intel.com> wrote:
> > > > >
> > > > > > On Wed, Mar 04, 2026 at 02:51:39PM -0800, Chia-I Wu wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > Our system compositor (surfaceflinger on android) submits gpu jobs
> > > > > > > from a SCHED_FIFO thread to an RT gpu queue. However, because
> > > > > > > workqueue threads are SCHED_NORMAL, the scheduling latency from submit
> > > > > > > to run_job can sometimes cause frame misses. We are seeing this on
> > > > > > > panthor and xe, but the issue should be common to all drm_sched users.
> > > > > > >
> > > > > >
> > > > > > I'm going to assume that since this is a compositor, you do not pass
> > > > > > input dependencies to the page-flip job. Is that correct?
> > > > > >
> > > > > > If so, I believe we could fairly easily build an opt-in DRM sched path
> > > > > > that directly calls run_job in the exec IOCTL context (I assume this is
> > > > > > SCHED_FIFO) if the job has no dependencies.
> > > > >
> > > > > I guess by ::run_job() you mean something slightly more involved that
> > > > > checks if:
> > > > >
> > > > > - other jobs are pending
> > > > > - enough credits (AKA ringbuf space) is available
> > > > > - and probably other stuff I forgot about
> > > > >
> > > > > >
> > > > > > This would likely break some of Xe’s submission-backend assumptions
> > > > > > around mutual exclusion and ordering based on the workqueue, but that
> > > > > > seems workable. I don’t know how the Panthor code is structured or
> > > > > > whether they have similar issues.
> > > > >
> > > > > Honestly, I'm not thrilled by this fast-path/call-run_job-directly idea
> > > > > you're describing. There's just so many things we can forget that would
> > > > > lead to races/ordering issues that will end up being hard to trigger and
> > > > > debug. Besides, it doesn't solve the problem where your gfx pipeline is
> > > > > fully stuffed and the kernel has to dequeue things asynchronously. I do
> > > > > believe we want RT-prio support in that case too.
> > > > >
> > > >
> > > > My understanding of SurfaceFlinger is that it never waits on input
> > > > dependencies from rendering applications, since those may not signal in
> > > > time for a page flip. Because of that, you can’t have the job(s) that
> > > > draw to the screen accept input dependencies. Maybe I have that
> > > > wrong—but I've spoken to the Google team several times about issues with
> > > > SurfaceFlinger, and that was my takeaway.
> > > >
> > > > So I don't think the kernel should ever have to dequeue things
> > > > asynchronously, at least for SurfaceFlinger.
> > >
> > > There's still the contention coming from the ring buffer size, which can
> > > prevent jobs from being queued directly to the HW, though, admittedly,
> > > if the HW is not capable of compositing the frame faster than the
> > > refresh rate, and guarantee an almost always empty ringbuffer, fixing
> > > the scheduling prio is probably pointless.
> > >
> > > > If there is another RT use
> > > > case that requires input dependencies plus the kernel dequeuing things
> > > > asynchronously, I agree this wouldn’t help—but my suggestion also isn’t
> > > > mutually exclusive with other RT rework either.
> > >
> > > Yeah, dunno. It just feels like another hack on top of the already quite
> > > convoluted design that drm_sched has become.
> > >
> >
> > I agree we wouldn't want this to become some wild hack.
> >
> > I could actually see this helping in other very timing-sensitive
> > paths—for example, page-fault paths where a copy job needs to be issued
> > as part of the fault resolution to a dedicated kernel queue. I’ve seen
> > noise in fault profiling caused by delays in the scheduler workqueue,
> > which needs to program the job to the device. In paths like this, every
> > microsecond matters, as even minor improvements have real-world impacts
> > on performance numbers. This will become even more noticeable as
> > CPU<->GPU bus speeds increase. In this case, typically copy jobs have
> > no input dependencies, thus the desire is to program the ring as quickly
> > as possible.
> >
> > > >
> > > > > >
> > > > > > I can try to hack together a quick PoC to see what this would look like
> > > > > > and give you something to test.
> > > > > >
> > > > > > > Using a WQ_HIGHPRI workqueue helps, but it is still not RT (and won't
> > > > > > > meet future android requirements). It seems either workqueue needs to
> > > > > > > gain RT support, or drm_sched needs to support kthread_worker.
> > > > > >
> > > > > > +Tejun to see if RT workqueue is in the plans.
> > > > >
> > > > > Dunno how feasible that is, but that would be my preferred option.
> > > > >
> > > > > >
> > > > > > >
> > > > > > > I know drm_sched switched from kthread_worker to workqueue for better
> > > > > > > scaling when xe was introduced. But if drm_sched can support either
> > > > > > > workqueue or kthread_worker during drm_sched_init, drivers can
> > > > > > > selectively use kthread_worker only for RT gpu queues. And because
> > > > > > > drivers require CAP_SYS_NICE for RT gpu queues, this should not cause
> > > > > > > scaling issues.
> > > > > > >
> > > > > >
> > > > > > I don’t think having two paths will ever be acceptable, nor do I think
> > > > > > supporting a kthread would be all that easy. For example, in Xe we queue
> > > > > > additional work items outside of the scheduler on the queue for ordering
> > > > > > reasons — we’d have to move all of that code down into DRM sched or
> > > > > > completely redesign our submission model to avoid this. I’m not sure if
> > > > > > other drivers also do this, but it is allowed.
> > > > >
> > > > > Panthor doesn't rely on the serialization provided by the single-thread
> > > > > workqueue, Panfrost might rely on it though (I don't remember). I agree
> > > > > that maintaining a thread and workqueue based scheduling is not ideal
> > > > > though.
> > > > >
> > > > > >
> > > > > > > Thoughts? Or perhaps this becomes less of an issue if all drm_sched
> > > > > > > users have concrete plans for userspace submissions..
> > > > > >
> > > > > > Maybe some day....
> > > > >
> > > > > I've yet to see a solution where no dma_fence-based signalization is
> > > > > involved in graphics workloads though (IIRC, Arm's solution still
> > > > > needs the kernel for that). Until that happens, we'll still need the
> > > > > kernel to signal fences asynchronously when the job is done, which I
> > > > > suspect will cause the same kind of latency issue...
> > > > >
> > > >
> > > > I don't think that is the problem here. Doesn’t the job that draws the
> > > > frame actually draw it, or does the display wait on the draw job’s fence
> > > > to signal and then do something else?
> > >
> > > I know close to nothing about SurfaceFlinger and very little about
> > > compositors in general, so I'll let Chia answer that one. What's sure
> >
> > I think Chia input would good, as if SurfaceFlinger jobs have input
> > dependencies this entire suggestion doesn't make any sense.
> >
> > > is that, on regular page-flips (don't remember what async page-flips
> > > do), the display drivers wait on the fences attached to the buffer to
> > > signal before doing the flip.
> >
> > I think SurfaceFlinger is different compared to Wayland/X11 use cases,
> > as maintaining a steady framerate is the priority above everything else
> > (think phone screens, which never freeze, whereas desktops do all the
> > time). So I believe SurfaceFlinger decides when it will submit the job
> > to draw a frame, without directly passing in application dependencies
> > into the buffer/job being drawn. Again, my understanding here may be
> > incorrect...
> That is correct. SurfaceFlinger only ever latches buffers whose
> associated fences have signaled, and sends down the buffers to gpu for
> composition or to the display for direct scanout. That might also be
> how modern wayland compositors work nowadays? It sounds bad to let a

Don't know wayland but let me follow up on that.

> low fps app slow down system composition.
> 
> In theory, the gpu driver should not see input dependencies ever. I
> will need to check if there are corner cases.
> 

Thanks — this matches my understanding from my conversations with Google
about SurfaceFlinger and the lack of dependencies. If you can also check
any corner cases, that would be good to understand as well. The kernel
can technically introduce dependencies if it moves memory around, but
something like that shouldn’t happen in practice.

I'd strongly suggest a bypass path as a solution. I mentioned this to
Boris — this approach is not mutually exclusive with other RT rework
either, and in any case it is likely the most performant and stable
path (i.e. no jitter).

> 
> >
> > >
> > > > (Sorry—I know next to nothing
> > > > about display.) Either way, fences should be signaled in IRQ handlers,
> > >
> > > In Panthor they are not, but that's probably something for us to
> > > address.
> Yeah, I am also looking into signaling fences from the (threaded) irq handler.
> 

I would suggest that you do. The Xe implementation is in xe_hw_fence.c
if you want a design reference.

Matt

> > >
> > > > which presumably don’t have the same latency issues as workqueues, but I
> > > > could be mistaken.
> > >
> > > Might have to do with the mental model I had of this "reconcile
> > > Usermode queues with dma_fence signaling" model, where I was imagining
> > > a SW job queue (based on drm_sched too) that would wait on HW fences to
> > > be signal and would as a result signal the dma_fence attached to the
> > > job. So the queueing/dequeuing of these jobs would still happen through
> > > drm_sched, with the same scheduling prio issue. This being said, those
> >
> > Yes, if jobs have unmet dependencies, the bypass path doesn’t help with
> > the DRM scheduler workqueue context switches being slow as that path
> > needs to be taken in taken in this cases.
> >
> > Also, to bring up something insane we certainly wouldn’t want to do:
> > calling run_job when dependencies are resolved in the fence callback,
> > since we could be in an IRQ handler.
> >
> > Matt
> >
> > > jobs would likely be dependency less, so more likely to hit your
> > > fast-path-run-job.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: drm_sched run_job and scheduling latency
  2026-03-06  5:33   ` Chia-I Wu
@ 2026-03-06  7:36     ` Matthew Brost
  0 siblings, 0 replies; 26+ messages in thread
From: Matthew Brost @ 2026-03-06  7:36 UTC (permalink / raw)
  To: Chia-I Wu
  Cc: Boris Brezillon, ML dri-devel, intel-xe, Steven Price,
	Liviu Dudau, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	David Airlie, Simona Vetter, Danilo Krummrich, Philipp Stanner,
	Christian König, Thomas Hellström, Rodrigo Vivi,
	open list

On Thu, Mar 05, 2026 at 09:33:36PM -0800, Chia-I Wu wrote:
> On Thu, Mar 5, 2026 at 1:23 AM Boris Brezillon
> <boris.brezillon@collabora.com> wrote:
> >
> > On Wed, 4 Mar 2026 14:51:39 -0800
> > Chia-I Wu <olvaffe@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > Our system compositor (surfaceflinger on android) submits gpu jobs
> > > from a SCHED_FIFO thread to an RT gpu queue. However, because
> > > workqueue threads are SCHED_NORMAL, the scheduling latency from submit
> > > to run_job can sometimes cause frame misses. We are seeing this on
> > > panthor and xe, but the issue should be common to all drm_sched users.
> > >
> > > Using a WQ_HIGHPRI workqueue helps, but it is still not RT (and won't
> > > meet future android requirements). It seems either workqueue needs to
> > > gain RT support, or drm_sched needs to support kthread_worker.
> > >
> > > I know drm_sched switched from kthread_worker to workqueue for better
> > > scaling when xe was introduced.
> >
> > Actually, it went from a plain kthread with open-coded "work" support to
> > workqueues. The kthread_worker+kthread_work model looks closer to what
> > workqueues provide, so transitioning drivers to it shouldn't be too
> > hard. The scalability issue you mentioned (one thread per GPU context
> > doesn't scale) doesn't apply, because we can pretty easily share the
> > same kthread_worker for all drm_gpu_scheduler instances, just like we
> > can share the same workqueue for all drm_gpu_scheduler instances today.
> > Luckily, it seems that no one so far has been using
> > WQ_PERCPU-workqueues, so that's one less thing we need to worry about.
> > The last remaining drawback with a kthread_work[er] based solution is
> > the fact workqueues can adjust the number of worker threads on demand
> > based on the load. If we really need this flexibility (a non static
> > number of threads per-prio level per-driver), that's something we'll
> > have to add support for.
> Wait, I thought this was the exact scaling issue that workqueue solved
> for xe and panthor? We needed to execute run_jobs for N
> drm_gpu_scheduler instances, where N is in total control of the
> userspace. We didn't want to serialize the executions to a single
> thread.
> 

I honestly doubt more threads help here. In Xe, the time to push a job
(run_job) to the hardware is maybe 1µs. In Xe, individual workqueues
are mostly for our compute use cases, where we sometimes need to sleep
inside the work item and don’t want that sleep to interfere with other
clients. For 3D, I suspect we could use a shared workqueue (still with a
dedicated scheduler instance per user queue) among all clients and not
see a noticeable change in performance - it might actually be better. At
one point I converted Xe to do this, but I lost track of the patches in
the stack.

> Granted, panthor holds a lock in its run_job callback and does not
> benefit from a workqueue. I don't know how xe's run_job does though.
> 

We grab a shared mutex for the firmware queue push, but it is a very
tight path and likely within the window where the mutex is still
spinning.

> >
> > For Panthor, the way I see it, we could start with one thread per-group
> > priority, and then pick the worker thread to use at drm_sched_init()
> > based on the group prio. If we need something with a thread pool, then
> > drm_sched will have to know about those threads, and do some load
> > balancing when queueing the works...
> >
> > Note that someone at Collabora is working on dynamic context priority
> > support, meaning we'll have to be able to change the drm_gpu_scheduler
> > kthread_worker at runtime.
> >
> > TLDR; All of this is doable, but it's more work (for us, DRM devs) than
> > asking RT prio support to be added to workqueues.
> 
> It looks like WQ_RT was last brought up in
> 
>   https://lore.kernel.org/all/aPJdrqSiuijOcaPE@slm.duckdns.org/
> 

Tejun says hard no on WQ_RT.

> Maybe adding some form of bring-your-own-worker-pool support to
> workqueue will be acceptable?
>

Before doing anything too crazy, I think we should consider a direct
submit path, given that you’ve confirmed SurfaceFlinger does not have
input dependencies. I’m fairly close to having something I feel good
about posting. If you could test it out and report back, I think that
would be a good place to start — then we can duke it out among the
maintainers if this is acceptable.

Matt

> >
> > > But if drm_sched can support either
> > > workqueue or kthread_worker during drm_sched_init, drivers can
> > > selectively use kthread_worker only for RT gpu queues. And because
> > > drivers require CAP_SYS_NICE for RT gpu queues, this should not cause
> > > scaling issues.
> >
> > I think, whatever we choose to go for, we probably don't want to keep
> > both models around, because that's going to be a pain to maintain.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: drm_sched run_job and scheduling latency
  2026-03-06  5:13           ` Chia-I Wu
  2026-03-06  7:21             ` Matthew Brost
@ 2026-03-06  9:36             ` Michel Dänzer
  2026-03-06  9:40               ` Michel Dänzer
  1 sibling, 1 reply; 26+ messages in thread
From: Michel Dänzer @ 2026-03-06  9:36 UTC (permalink / raw)
  To: Chia-I Wu, Matthew Brost
  Cc: Boris Brezillon, ML dri-devel, intel-xe, Steven Price,
	Liviu Dudau, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	David Airlie, Simona Vetter, Danilo Krummrich, Philipp Stanner,
	Christian König, Thomas Hellström, Rodrigo Vivi,
	open list, tj

On 3/6/26 06:13, Chia-I Wu wrote:
> On Thu, Mar 5, 2026 at 12:52 PM Matthew Brost <matthew.brost@intel.com> wrote:
>> On Thu, Mar 05, 2026 at 11:52:01AM +0100, Boris Brezillon wrote:
>>> On Thu, 5 Mar 2026 02:09:16 -0800
>>> Matthew Brost <matthew.brost@intel.com> wrote:
>>>> On Thu, Mar 05, 2026 at 09:27:11AM +0100, Boris Brezillon wrote:
>>>>> On Wed, 4 Mar 2026 18:04:25 -0800
>>>>> Matthew Brost <matthew.brost@intel.com> wrote:
>>>>>> On Wed, Mar 04, 2026 at 02:51:39PM -0800, Chia-I Wu wrote:
>>>>>>>
>>>>>>> Thoughts? Or perhaps this becomes less of an issue if all drm_sched
>>>>>>> users have concrete plans for userspace submissions..
>>>>>>
>>>>>> Maybe some day....
>>>>>
>>>>> I've yet to see a solution where no dma_fence-based signalization is
>>>>> involved in graphics workloads though (IIRC, Arm's solution still
>>>>> needs the kernel for that). Until that happens, we'll still need the
>>>>> kernel to signal fences asynchronously when the job is done, which I
>>>>> suspect will cause the same kind of latency issue...
>>>>>
>>>>
>>>> I don't think that is the problem here. Doesn’t the job that draws the
>>>> frame actually draw it, or does the display wait on the draw job’s fence
>>>> to signal and then do something else?
>>>
>>> I know close to nothing about SurfaceFlinger and very little about
>>> compositors in general, so I'll let Chia answer that one. What's sure
>>
>> I think Chia input would good, as if SurfaceFlinger jobs have input
>> dependencies this entire suggestion doesn't make any sense.
>>
>>> is that, on regular page-flips (don't remember what async page-flips
>>> do), the display drivers wait on the fences attached to the buffer to
>>> signal before doing the flip.
>>
>> I think SurfaceFlinger is different compared to Wayland/X11 use cases,
>> as maintaining a steady framerate is the priority above everything else
>> (think phone screens, which never freeze, whereas desktops do all the
>> time). So I believe SurfaceFlinger decides when it will submit the job
>> to draw a frame, without directly passing in application dependencies
>> into the buffer/job being drawn. Again, my understanding here may be
>> incorrect...
> That is correct. SurfaceFlinger only ever latches buffers whose
> associated fences have signaled, and sends down the buffers to gpu for
> composition or to the display for direct scanout. That might also be
> how modern wayland compositors work nowadays?

Many (most of the major ones?) do, yes. (Weston being a notable exception AFAIK, though since it supports the Wayland syncobj protocol now, switching to this model should be easy)


-- 
Earthling Michel Dänzer       \        GNOME / Xwayland / Mesa developer
https://redhat.com             \               Libre software enthusiast

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: drm_sched run_job and scheduling latency
  2026-03-06  9:36             ` Michel Dänzer
@ 2026-03-06  9:40               ` Michel Dänzer
  0 siblings, 0 replies; 26+ messages in thread
From: Michel Dänzer @ 2026-03-06  9:40 UTC (permalink / raw)
  To: Chia-I Wu, Matthew Brost
  Cc: Boris Brezillon, ML dri-devel, intel-xe, Steven Price,
	Liviu Dudau, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	David Airlie, Simona Vetter, Danilo Krummrich, Philipp Stanner,
	Christian König, Thomas Hellström, Rodrigo Vivi,
	open list, tj

On 3/6/26 10:36, Michel Dänzer wrote:
> On 3/6/26 06:13, Chia-I Wu wrote:
>> On Thu, Mar 5, 2026 at 12:52 PM Matthew Brost <matthew.brost@intel.com> wrote:
>>> On Thu, Mar 05, 2026 at 11:52:01AM +0100, Boris Brezillon wrote:
>>>> On Thu, 5 Mar 2026 02:09:16 -0800
>>>> Matthew Brost <matthew.brost@intel.com> wrote:
>>>>> On Thu, Mar 05, 2026 at 09:27:11AM +0100, Boris Brezillon wrote:
>>>>>> On Wed, 4 Mar 2026 18:04:25 -0800
>>>>>> Matthew Brost <matthew.brost@intel.com> wrote:
>>>>>>> On Wed, Mar 04, 2026 at 02:51:39PM -0800, Chia-I Wu wrote:
>>>>>>>>
>>>>>>>> Thoughts? Or perhaps this becomes less of an issue if all drm_sched
>>>>>>>> users have concrete plans for userspace submissions..
>>>>>>>
>>>>>>> Maybe some day....
>>>>>>
>>>>>> I've yet to see a solution where no dma_fence-based signalization is
>>>>>> involved in graphics workloads though (IIRC, Arm's solution still
>>>>>> needs the kernel for that). Until that happens, we'll still need the
>>>>>> kernel to signal fences asynchronously when the job is done, which I
>>>>>> suspect will cause the same kind of latency issue...
>>>>>>
>>>>>
>>>>> I don't think that is the problem here. Doesn’t the job that draws the
>>>>> frame actually draw it, or does the display wait on the draw job’s fence
>>>>> to signal and then do something else?
>>>>
>>>> I know close to nothing about SurfaceFlinger and very little about
>>>> compositors in general, so I'll let Chia answer that one. What's sure
>>>
>>> I think Chia input would good, as if SurfaceFlinger jobs have input
>>> dependencies this entire suggestion doesn't make any sense.
>>>
>>>> is that, on regular page-flips (don't remember what async page-flips
>>>> do), the display drivers wait on the fences attached to the buffer to
>>>> signal before doing the flip.
>>>
>>> I think SurfaceFlinger is different compared to Wayland/X11 use cases,
>>> as maintaining a steady framerate is the priority above everything else
>>> (think phone screens, which never freeze, whereas desktops do all the
>>> time). So I believe SurfaceFlinger decides when it will submit the job
>>> to draw a frame, without directly passing in application dependencies
>>> into the buffer/job being drawn. Again, my understanding here may be
>>> incorrect...
>> That is correct. SurfaceFlinger only ever latches buffers whose
>> associated fences have signaled, and sends down the buffers to gpu for
>> composition or to the display for direct scanout. That might also be
>> how modern wayland compositors work nowadays?
> 
> Many (most of the major ones?) do, yes. (Weston being a notable exception AFAIK, though since it supports the Wayland syncobj protocol now, switching to this model shoul

Err, I meant the commit-timing protocol, Weston doesn't support the syncobj protocol yet AFAICT.


-- 
Earthling Michel Dänzer       \        GNOME / Xwayland / Mesa developer
https://redhat.com             \               Libre software enthusiast

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: drm_sched run_job and scheduling latency
  2026-03-06  5:46   ` Chia-I Wu
@ 2026-03-06 11:58     ` Hillf Danton
  0 siblings, 0 replies; 26+ messages in thread
From: Hillf Danton @ 2026-03-06 11:58 UTC (permalink / raw)
  To: Chia-I Wu
  Cc: Matthew Brost, DRI, intel-xe, Danilo Krummrich, Philipp Stanner,
	Boris Brezillon, LKML

On Thu, 5 Mar 2026 21:46:21 -0800 Chia-I Wu wrote:
>On Thu, Mar 5, 2026 at 3:10 PM Hillf Danton <hdanton@sina.com> wrote:
>> On Wed, Mar 04, 2026 at 02:51:39PM -0800, Chia-I Wu wrote:
>> > Hi,
>> >
>> > Our system compositor (surfaceflinger on android) submits gpu jobs
>> > from a SCHED_FIFO thread to an RT gpu queue. However, because
>> > workqueue threads are SCHED_NORMAL, the scheduling latency from submit
>> > to run_job can sometimes cause frame misses. We are seeing this on
>> > panthor and xe, but the issue should be common to all drm_sched users.
>> >
>> > Using a WQ_HIGHPRI workqueue helps, but it is still not RT (and won't
>> > meet future android requirements). It seems either workqueue needs to
>> > gain RT support, or drm_sched needs to support kthread_worker.
>> >
>> As RT means (in general) to some extent that the game of eevdf is played in
>> __userspace__, but you are not PeterZ, so any issue like frame miss is
>> understandably expected.
>> Who made the workqueue worker a victim if the CPU cycles are not tight?
>> Who is the new victim of a RT kthread worker?
>> As RT is not free, what did you pay for it, given fewer RT success on market?
>>
> That is a deliberate decision for android, that avoiding frame misses
> is a top priority.
> 
> Also, I think most drm drivers already signal their fences from irq
> handlers or rt threads for a similar reason. And the reasoning applies
> to submissions as well.
> 
If RT submission alone works for you then your CPU cycles are tight.
And if your workloads are sanely correct then making workqueue and/or kthread
worker RT barely makes sense because the right option is to buy CPU with
higher capacity.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: drm_sched run_job and scheduling latency
  2026-03-05  9:47         ` Philipp Stanner
@ 2026-03-16  4:05           ` Matthew Brost
  2026-03-16  4:14             ` Matthew Brost
  0 siblings, 1 reply; 26+ messages in thread
From: Matthew Brost @ 2026-03-16  4:05 UTC (permalink / raw)
  To: phasta
  Cc: Boris Brezillon, Chia-I Wu, ML dri-devel, intel-xe, Steven Price,
	Liviu Dudau, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	David Airlie, Simona Vetter, Danilo Krummrich,
	Christian König, Thomas Hellström, Rodrigo Vivi,
	open list, tj

On Thu, Mar 05, 2026 at 10:47:32AM +0100, Philipp Stanner wrote:

Off the list... I don’t think airing our personal attacks publicly is a
good look. I’m going to be blunt here in an effort to help you.

> On Thu, 2026-03-05 at 01:10 -0800, Matthew Brost wrote:
> > On Thu, Mar 05, 2026 at 09:38:16AM +0100, Philipp Stanner wrote:
> > > On Thu, 2026-03-05 at 09:27 +0100, Boris Brezillon wrote:
> > > 
> > > > 
> 
> […]
> 
> > > > Honestly, I'm not thrilled by this fast-path/call-run_job-directly idea
> > > > you're describing. There's just so many things we can forget that would
> > > > lead to races/ordering issues that will end up being hard to trigger and
> > > > debug.
> > > > 
> > > 
> > > +1
> > > 
> > > I'm not thrilled either. More like the opposite of thrilled actually.
> > > 
> > > Even if we could get that to work. This is more of a maintainability
> > > issue.
> > > 
> > > The scheduler is full of insane performance hacks for this or that
> > > driver. Lockless accesses, a special lockless queue only used by that
> > > one party in the kernel (a lockless queue which is nowadays, after N
> > > reworks, being used with a lock. Ah well).
> > > 
> > 
> > This is not relevant to this discussion—see below. In general, I agree
> > that the lockless tricks in the scheduler are not great, nor is the fact
> > that the scheduler became a dumping ground for driver-specific features.
> > But again, that is not what we’re talking about here—see below.
> > 
> > > In the past discussions Danilo and I made it clear that more major
> > > features in _new_ patch series aimed at getting merged into drm/sched
> > > must be preceded by cleanup work to address some of the scheduler's
> > > major problems.
> > 
> > Ah, we've moved to dictatorship quickly. Noted.
> 
> I prefer the term "benevolent presidency" /s
> 
> Or even better: s/dictatorship/accountability enforcement.
> 

It’s very hard to take this seriously when I reply to threads saying
something breaks dma-fence rules and the response is, “what are
dma-fence rules?” Or I read through the jobqueue thread and see you
asking why a dma-fence would come from anywhere other than your own
driver — that’s the entire point of dma-fence; it’s a cross-driver
contract. I could go on, but I’d encourage you to take a hard look at
your understanding of DRM, and whether your responses — to me and to
others — are backed by the necessary technical knowledge.

Even better — what first annoyed me was your XDC presentation. You gave
an example of my driver modifying the pending list without a lock while
scheduling was stopped, and claimed you fixed a bug. That was not a bug
- Xe would explode if it was as we test our code. The pending list can
be modified without a lock if scheduling is stopped. I almost grabbed
the mic to correct you. Yes, it’s a layering violation, but presenting
it aa a bug shows a clear lack of understanding.

> How does it come that everyone is here and ready so quickly when it

I’ve suggested ideas to fix DRM sched (refcounting, clear teardown
flows), but they were immediately met with resistance — typically from
Christian with you agreeing. My willingness to fight with Christian is
low; I really don’t need another person to argue with.

> comes to new use cases and features, yet I never saw anyone except for
> Tvrtko and Maíra investing even 15 minutes to write a simple patch to
> address some of the *various* significant issues in that code base?
> 
> You were on CC on all discussions we've had here for the last years
> afair, but I rarely saw you participate. And you know what it's like:

I’ll admit I’m busy with many other things, so my bandwidth is limited.
But again, if I chime in and explain how I solved something in Xe (e.g.,
refcounting) and it’s met with resistance, I’ll likely move on — I’ve
already solved it, and I’ll just let you fail (see cancel_job).

> who doesn't speak up silently agrees in open source.
> 
> But tell me one thing, if you can be so kind:
> 

I'm glad you asked this, and it inspired me to fix this, more below [1].

> What is your theory why drm/sched came to be in such horrible shape?

drm/sched was ported from AMDGPU into common code. It carried many
AMDGPU-specific hacks, had no object-lifetime model thought out as a
common component, and included teardown nightmares that “worked,” but
other drivers immediately had to work around. With Christian involved —
who is notoriously hostile — everyone did their best to paper over
issues driver-side rather than get into fights and fix things properly.
Asahi Linux publicly aired grievances about this situation years ago.

> What circumstances, what human behavioral patterns have caused this?
> 

See above.

> The DRM subsystem has a bad reputation regarding stability among Linux
> users, as far as I have sensed. How can we do better?
> 

Write sane code and test it. fwiw, Google shared a doc with me
indicating that Xe has unprecedented stability, and to be honest, when I
first wrote Xe I barely knew what I was doing — but I did know how to
test. I’ve since cleaned up most of my mistakes though.

So how can we do better... We can [1].

I started on [1] after you asking what the problems in DRM sched - which
got me thinking about what it would look like if we took the good parts
(stop/start control plane, dependency tracking, ordering, finished
fences, etc.), dropped the bad parts (no object-lifetime model, no
refcounting, overly complex queue teardown, messy fence manipulation,
hardware-scheduling baggage, lack of annotations, etc.), and wrote
something that addresses all of these problems from the start
specifically for firmware-scheduling models.

It turns out pretty good.

Main patch [2].

Xe is fully converted, tested, and working. AMDXNDA and Panthor are
compiling. Nouveau and PVR seem like good candidates to convert as well.
Rust bindings are also possible given the clear object model with
refcounting and well-defined object lifetimes.

Thinking further, hardware schedulers should be able to be implemented
on top of this by embedding the objects in [2] and layering a
backend/API on top.

Let me know if you have any feedback (off-list) before I share this
publicly. So far, Dave, Sima, Danilo, and the other Xe maintainers have
been looped in.

Matt

[1] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-svn-perf-6-15-2025/-/tree/local_dev/new_scheduler.post?ref_type=heads
[2] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-svn-perf-6-15-2025/-/commit/0538a3bc2a3b562dc0427a5922958189e0be8271

> > 
> > > 
> > 
> > I can't say I agree with either of you here.
> > 
> > In about an hour, I seemingly have a bypass path working in DRM sched +
> > Xe, and my diff is:
> > 
> > 108 insertions(+), 31 deletions(-)
> 
> LOC is a bad metric for complexity.
> 
> > 
> > About 40 lines of the insertions are kernel-doc, so I'm not buying that
> > this is a maintenance issue or a major feature - it is literally a
> > single new function.
> > 
> > I understand a bypass path can create issues—for example, on certain
> > queues in Xe I definitely can't use the bypass path, so Xe simply
> > wouldn’t use it in those cases. This is the driver's choice to use or
> > not. If a driver doesn't know how to use the scheduler, well, that’s on
> > the driver. Providing a simple, documented function as a fast path
> > really isn't some crazy idea.
> 
> We're effectively talking about a deviation from the default submission
> mechanism, and all that seems to be desired for a luxury feature.
> 
> Then you end up with two submission mechanisms, whose correctness in
> the future relies on someone remembering what the background was, why
> it was added, and what the rules are..
> 
> The current scheduler rules are / were often not even documented, and
> sometimes even Christian took a few weeks to remember again why
> something had been added – and whether it can now be removed again or
> not.
> 
> > 
> > The alternative—asking for RT workqueues or changing the design to use
> > kthread_worker—actually is.
> > 
> > > That's especially true if it's features aimed at performance buffs.
> > > 
> > 
> > With the above mindset, I'm actually very confused why this series [1]
> > would even be considered as this order of magnitude greater in
> > complexity than my suggestion here.
> > 
> > Matt
> > 
> > [1] https://patchwork.freedesktop.org/series/159025/ 
> 
> The discussions about Tvrtko's CFS series were precisely the point
> where Danilo brought up that after this can be merged, future rework of
> the scheduler must focus on addressing some of the pending fundamental
> issues.
> 
> The background is that Tvrtko has worked on that series already for
> well over a year, it actually simplifies some things in the sense of
> removing unused code (obviously it's a complex series, no argument
> about that), and we agreed on XDC that this can be merged. So this is a
> question of fairness to the contributor.
> 
> But at one point you have to finally draw a line. No one will ever
> address major scheduler issues unless we demand it. Even very
> experienced devs usually prefer to hack around the central design
> issues in their drivers instead of fixing the shared infrastructure.
> 
> 
> P.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: drm_sched run_job and scheduling latency
  2026-03-16  4:05           ` Matthew Brost
@ 2026-03-16  4:14             ` Matthew Brost
  0 siblings, 0 replies; 26+ messages in thread
From: Matthew Brost @ 2026-03-16  4:14 UTC (permalink / raw)
  To: phasta
  Cc: Boris Brezillon, Chia-I Wu, ML dri-devel, intel-xe, Steven Price,
	Liviu Dudau, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	David Airlie, Simona Vetter, Danilo Krummrich,
	Christian König, Thomas Hellström, Rodrigo Vivi,
	open list, tj

On Sun, Mar 15, 2026 at 09:05:20PM -0700, Matthew Brost wrote:
> On Thu, Mar 05, 2026 at 10:47:32AM +0100, Philipp Stanner wrote:
> 

Obviously this was intended as a private communication — I hit the wrong
button. I apologize to anyone I offended here.

Matt

> Off the list... I don’t think airing our personal attacks publicly is a
> good look. I’m going to be blunt here in an effort to help you.
> 
> > On Thu, 2026-03-05 at 01:10 -0800, Matthew Brost wrote:
> > > On Thu, Mar 05, 2026 at 09:38:16AM +0100, Philipp Stanner wrote:
> > > > On Thu, 2026-03-05 at 09:27 +0100, Boris Brezillon wrote:
> > > > 
> > > > > 
> > 
> > […]
> > 
> > > > > Honestly, I'm not thrilled by this fast-path/call-run_job-directly idea
> > > > > you're describing. There's just so many things we can forget that would
> > > > > lead to races/ordering issues that will end up being hard to trigger and
> > > > > debug.
> > > > > 
> > > > 
> > > > +1
> > > > 
> > > > I'm not thrilled either. More like the opposite of thrilled actually.
> > > > 
> > > > Even if we could get that to work. This is more of a maintainability
> > > > issue.
> > > > 
> > > > The scheduler is full of insane performance hacks for this or that
> > > > driver. Lockless accesses, a special lockless queue only used by that
> > > > one party in the kernel (a lockless queue which is nowadays, after N
> > > > reworks, being used with a lock. Ah well).
> > > > 
> > > 
> > > This is not relevant to this discussion—see below. In general, I agree
> > > that the lockless tricks in the scheduler are not great, nor is the fact
> > > that the scheduler became a dumping ground for driver-specific features.
> > > But again, that is not what we’re talking about here—see below.
> > > 
> > > > In the past discussions Danilo and I made it clear that more major
> > > > features in _new_ patch series aimed at getting merged into drm/sched
> > > > must be preceded by cleanup work to address some of the scheduler's
> > > > major problems.
> > > 
> > > Ah, we've moved to dictatorship quickly. Noted.
> > 
> > I prefer the term "benevolent presidency" /s
> > 
> > Or even better: s/dictatorship/accountability enforcement.
> > 
> 
> It’s very hard to take this seriously when I reply to threads saying
> something breaks dma-fence rules and the response is, “what are
> dma-fence rules?” Or I read through the jobqueue thread and see you
> asking why a dma-fence would come from anywhere other than your own
> driver — that’s the entire point of dma-fence; it’s a cross-driver
> contract. I could go on, but I’d encourage you to take a hard look at
> your understanding of DRM, and whether your responses — to me and to
> others — are backed by the necessary technical knowledge.
> 
> Even better — what first annoyed me was your XDC presentation. You gave
> an example of my driver modifying the pending list without a lock while
> scheduling was stopped, and claimed you fixed a bug. That was not a bug
> - Xe would explode if it was as we test our code. The pending list can
> be modified without a lock if scheduling is stopped. I almost grabbed
> the mic to correct you. Yes, it’s a layering violation, but presenting
> it aa a bug shows a clear lack of understanding.
> 
> > How does it come that everyone is here and ready so quickly when it
> 
> I’ve suggested ideas to fix DRM sched (refcounting, clear teardown
> flows), but they were immediately met with resistance — typically from
> Christian with you agreeing. My willingness to fight with Christian is
> low; I really don’t need another person to argue with.
> 
> > comes to new use cases and features, yet I never saw anyone except for
> > Tvrtko and Maíra investing even 15 minutes to write a simple patch to
> > address some of the *various* significant issues in that code base?
> > 
> > You were on CC on all discussions we've had here for the last years
> > afair, but I rarely saw you participate. And you know what it's like:
> 
> I’ll admit I’m busy with many other things, so my bandwidth is limited.
> But again, if I chime in and explain how I solved something in Xe (e.g.,
> refcounting) and it’s met with resistance, I’ll likely move on — I’ve
> already solved it, and I’ll just let you fail (see cancel_job).
> 
> > who doesn't speak up silently agrees in open source.
> > 
> > But tell me one thing, if you can be so kind:
> > 
> 
> I'm glad you asked this, and it inspired me to fix this, more below [1].
> 
> > What is your theory why drm/sched came to be in such horrible shape?
> 
> drm/sched was ported from AMDGPU into common code. It carried many
> AMDGPU-specific hacks, had no object-lifetime model thought out as a
> common component, and included teardown nightmares that “worked,” but
> other drivers immediately had to work around. With Christian involved —
> who is notoriously hostile — everyone did their best to paper over
> issues driver-side rather than get into fights and fix things properly.
> Asahi Linux publicly aired grievances about this situation years ago.
> 
> > What circumstances, what human behavioral patterns have caused this?
> > 
> 
> See above.
> 
> > The DRM subsystem has a bad reputation regarding stability among Linux
> > users, as far as I have sensed. How can we do better?
> > 
> 
> Write sane code and test it. fwiw, Google shared a doc with me
> indicating that Xe has unprecedented stability, and to be honest, when I
> first wrote Xe I barely knew what I was doing — but I did know how to
> test. I’ve since cleaned up most of my mistakes though.
> 
> So how can we do better... We can [1].
> 
> I started on [1] after you asking what the problems in DRM sched - which
> got me thinking about what it would look like if we took the good parts
> (stop/start control plane, dependency tracking, ordering, finished
> fences, etc.), dropped the bad parts (no object-lifetime model, no
> refcounting, overly complex queue teardown, messy fence manipulation,
> hardware-scheduling baggage, lack of annotations, etc.), and wrote
> something that addresses all of these problems from the start
> specifically for firmware-scheduling models.
> 
> It turns out pretty good.
> 
> Main patch [2].
> 
> Xe is fully converted, tested, and working. AMDXNDA and Panthor are
> compiling. Nouveau and PVR seem like good candidates to convert as well.
> Rust bindings are also possible given the clear object model with
> refcounting and well-defined object lifetimes.
> 
> Thinking further, hardware schedulers should be able to be implemented
> on top of this by embedding the objects in [2] and layering a
> backend/API on top.
> 
> Let me know if you have any feedback (off-list) before I share this
> publicly. So far, Dave, Sima, Danilo, and the other Xe maintainers have
> been looped in.
> 
> Matt
> 
> [1] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-svn-perf-6-15-2025/-/tree/local_dev/new_scheduler.post?ref_type=heads
> [2] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-svn-perf-6-15-2025/-/commit/0538a3bc2a3b562dc0427a5922958189e0be8271
> 
> > > 
> > > > 
> > > 
> > > I can't say I agree with either of you here.
> > > 
> > > In about an hour, I seemingly have a bypass path working in DRM sched +
> > > Xe, and my diff is:
> > > 
> > > 108 insertions(+), 31 deletions(-)
> > 
> > LOC is a bad metric for complexity.
> > 
> > > 
> > > About 40 lines of the insertions are kernel-doc, so I'm not buying that
> > > this is a maintenance issue or a major feature - it is literally a
> > > single new function.
> > > 
> > > I understand a bypass path can create issues—for example, on certain
> > > queues in Xe I definitely can't use the bypass path, so Xe simply
> > > wouldn’t use it in those cases. This is the driver's choice to use or
> > > not. If a driver doesn't know how to use the scheduler, well, that’s on
> > > the driver. Providing a simple, documented function as a fast path
> > > really isn't some crazy idea.
> > 
> > We're effectively talking about a deviation from the default submission
> > mechanism, and all that seems to be desired for a luxury feature.
> > 
> > Then you end up with two submission mechanisms, whose correctness in
> > the future relies on someone remembering what the background was, why
> > it was added, and what the rules are..
> > 
> > The current scheduler rules are / were often not even documented, and
> > sometimes even Christian took a few weeks to remember again why
> > something had been added – and whether it can now be removed again or
> > not.
> > 
> > > 
> > > The alternative—asking for RT workqueues or changing the design to use
> > > kthread_worker—actually is.
> > > 
> > > > That's especially true if it's features aimed at performance buffs.
> > > > 
> > > 
> > > With the above mindset, I'm actually very confused why this series [1]
> > > would even be considered as this order of magnitude greater in
> > > complexity than my suggestion here.
> > > 
> > > Matt
> > > 
> > > [1] https://patchwork.freedesktop.org/series/159025/ 
> > 
> > The discussions about Tvrtko's CFS series were precisely the point
> > where Danilo brought up that after this can be merged, future rework of
> > the scheduler must focus on addressing some of the pending fundamental
> > issues.
> > 
> > The background is that Tvrtko has worked on that series already for
> > well over a year, it actually simplifies some things in the sense of
> > removing unused code (obviously it's a complex series, no argument
> > about that), and we agreed on XDC that this can be merged. So this is a
> > question of fairness to the contributor.
> > 
> > But at one point you have to finally draw a line. No one will ever
> > address major scheduler issues unless we demand it. Even very
> > experienced devs usually prefer to hack around the central design
> > issues in their drivers instead of fixing the shared infrastructure.
> > 
> > 
> > P.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: drm_sched run_job and scheduling latency
  2026-03-05  9:40   ` Boris Brezillon
@ 2026-03-27  9:19     ` Tvrtko Ursulin
  0 siblings, 0 replies; 26+ messages in thread
From: Tvrtko Ursulin @ 2026-03-27  9:19 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Chia-I Wu, ML dri-devel, intel-xe, Steven Price, Liviu Dudau,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Simona Vetter, Matthew Brost, Danilo Krummrich, Philipp Stanner,
	Christian König, Thomas Hellström, Rodrigo Vivi,
	open list

On 05/03/2026 09:40, Boris Brezillon wrote:
> Hi Tvrtko,
> 
> On Thu, 5 Mar 2026 08:35:33 +0000
> Tvrtko Ursulin <tursulin@ursulin.net> wrote:
> 
>> On 04/03/2026 22:51, Chia-I Wu wrote:
>>> Hi,
>>>
>>> Our system compositor (surfaceflinger on android) submits gpu jobs
>>> from a SCHED_FIFO thread to an RT gpu queue. However, because
>>> workqueue threads are SCHED_NORMAL, the scheduling latency from submit
>>> to run_job can sometimes cause frame misses. We are seeing this on
>>> panthor and xe, but the issue should be common to all drm_sched users.
>>>
>>> Using a WQ_HIGHPRI workqueue helps, but it is still not RT (and won't
>>> meet future android requirements). It seems either workqueue needs to
>>> gain RT support, or drm_sched needs to support kthread_worker.
>>>
>>> I know drm_sched switched from kthread_worker to workqueue for better
>>
>>   From a plain kthread actually.
> 
> Oops, sorry, I hadn't seen your reply before posting mine. I basically
> said the same.
> 
>> Anyway, I suggested trying the
>> kthread_worker approach a few times in the past but never got round
>> implementing it. Not dual paths but simply replacing the workqueues with
>> kthread_workers.
>>
>> What is your thinking regarding how would the priority be configured? In
>> terms of the default and mechanism to select a higher priority
>> scheduling class.
> 
> If we follow the same model that exists today, where the
> workqueue can be passed at drm_sched_init() time, it becomes the
> driver's responsibility to create a worker of his own with the right
> prio set (using sched_setscheduler()). There's still the case where the
> worker is NULL, in which case the drm_sched code can probably create
> his own worker and leave it with the default prio, just like existed
> before the transition to workqueues.
> 
> It's a whole different story if you want to deal with worker pools and
> do some load balancing though...

I prototyped this in xe in the mean time and it is looking plausible 
that latency can be significantly reduced.

First to say that I did not go as far as worker pools because at the 
moment I don't see an use case for it. At least not for xe.

When 1:1 entity to scheduler drivers appeared kthreads were undesirable 
just because they were ending up with effectively unbound number of 
kernel threads. There was no benefit to that but only downsides. 
Workqueues were good since they manage the thread pool under the hood, 
but it is just a handy coincidence, the design still misses to express 
the optimal number of CPU threads required to feed a GPU engine. For 
example with xe, if there was a 4096 CPU machine with 4096 user contexts 
feeding to the same GPU engine, the optimal number of CPU threads to 
feed it is really more like one rather than how much wq management 
decided to run in parallel. They all end up hammering on the same lock 
to let the firmware know there is something to schedule.

For this reason in my prototype I create kthread_worker per hardware 
execution engine. (For xe even that could potentially be too much, maybe 
I should even try one kthread_worker per GuC CT.)

This creates a requirement for 1:1 drivers to not use the "worker" 
auto-create mode of the DRM scheduler so TBD if that is okay.

Anyway, onto the numbers. Well actually first onto a benchmark I hacked 
up.. I took xe_blt from IGT and modified it heavily to be more 
reasonable. What it essentially does it emits a constant stream of 
synchronous blit operations and measures the variance of the time each 
took to complete, as observed by the submitting process. In parallel it 
spawns a number of CPU hog threads to oversubscribe the system. And it 
can run the submitting thread at either normal priority, re-niced to -1, 
or at SCHED_FIFO. This is to simulate a typical compositor use case.

Now onto the numbers.

			normal	nice	FIFO
wq			100%	76%	1%
kthread_worker		100%	73%	1.2%
  └─relative to wq:	50.5%	48.5%	58.9%

Median "jitter" (variance in observed job submissions) is normalised and 
shows how changing the CPU priority changes the jitter observed by the 
submission thread. First two rows are the current wq implementation and 
the kthread_worker conversion. They show scaling as roughly similar.

Third row are the kworker_thread results normalised against wq. And that 
shows roughly twice as low jitter. So a meaningful improvement.

Then I went a step further to even better address the analysis of a 
problem done by Chia-I, solving the priority inversion problem. That is 
to loosely track CPU priorities of the currently active entities 
submitting to each scheduler (and in turn kthread_worker). This in turn 
further improved the latency numbers for the SCHED_FIFO case, albeit 
there is a strange anomaly with re-nice which I will come to later. It 
looks like this:

			normal	nice	FIFO
kworker_follow_prio	100%	277%	0.66%
  └─relative to wq:	60%	222%	37.8%

This effectively means that with a SCHED_FIFO compositor the submission 
round-trip latency could be around a third of what the current scheduler 
can do.

Now the re-nice anomaly.. This is something I am yet to investigate. 
Issue may be related to what I said the kthread_workers loosely track 
the submission thread priority. Loosely meaning if they detect negative 
nice they do not follow the exact nice level but go minium nice, while 
my test program was using the least minimum nice level (-19 vs -1). 
Perhaps that causes some strange effect in the CPU scheduler. I do not 
know yet but it is very interesting that it appears repeatable.

It is also important to view my numbers as with some margin of error. I 
have tried to remove the effect of intel_pstate, CPU turbo, and thermal 
management to a large extent, but I do not think I fully succeeded yet. 
There may be some +/- of 5% or so in the results is my gut feeling.

Also important to say is that the prototype depends on my other DRM 
scheduler series (the fair scheduler one), since I needed the nicer 
sched_rq abstraction with better tracking of active entities to 
implement priority inheritance, so I am unlikely to post it all as RFC 
since Philipp would possible get a heart attack if I did. :)

To close, I think this is interesting to check out further and could 
look at converting panthor next and then we could run more experiments.

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2026-03-27  9:19 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-04 22:51 drm_sched run_job and scheduling latency Chia-I Wu
2026-03-05  2:04 ` Matthew Brost
2026-03-05  8:27   ` Boris Brezillon
2026-03-05  8:38     ` Philipp Stanner
2026-03-05  9:10       ` Matthew Brost
2026-03-05  9:47         ` Philipp Stanner
2026-03-16  4:05           ` Matthew Brost
2026-03-16  4:14             ` Matthew Brost
2026-03-05 10:19         ` Boris Brezillon
2026-03-05 12:27         ` Danilo Krummrich
2026-03-05 10:09     ` Matthew Brost
2026-03-05 10:52       ` Boris Brezillon
2026-03-05 20:51         ` Matthew Brost
2026-03-06  5:13           ` Chia-I Wu
2026-03-06  7:21             ` Matthew Brost
2026-03-06  9:36             ` Michel Dänzer
2026-03-06  9:40               ` Michel Dänzer
2026-03-05  8:35 ` Tvrtko Ursulin
2026-03-05  9:40   ` Boris Brezillon
2026-03-27  9:19     ` Tvrtko Ursulin
2026-03-05  9:23 ` Boris Brezillon
2026-03-06  5:33   ` Chia-I Wu
2026-03-06  7:36     ` Matthew Brost
2026-03-05 23:09 ` Hillf Danton
2026-03-06  5:46   ` Chia-I Wu
2026-03-06 11:58     ` Hillf Danton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox