* drm_sched run_job and scheduling latency
@ 2026-03-04 22:51 Chia-I Wu
2026-03-05 2:04 ` Matthew Brost
` (3 more replies)
0 siblings, 4 replies; 26+ messages in thread
From: Chia-I Wu @ 2026-03-04 22:51 UTC (permalink / raw)
To: ML dri-devel, intel-xe
Cc: Boris Brezillon, Steven Price, Liviu Dudau, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
Matthew Brost, Danilo Krummrich, Philipp Stanner,
Christian König, Thomas Hellström, Rodrigo Vivi,
open list
Hi,
Our system compositor (surfaceflinger on android) submits gpu jobs
from a SCHED_FIFO thread to an RT gpu queue. However, because
workqueue threads are SCHED_NORMAL, the scheduling latency from submit
to run_job can sometimes cause frame misses. We are seeing this on
panthor and xe, but the issue should be common to all drm_sched users.
Using a WQ_HIGHPRI workqueue helps, but it is still not RT (and won't
meet future android requirements). It seems either workqueue needs to
gain RT support, or drm_sched needs to support kthread_worker.
I know drm_sched switched from kthread_worker to workqueue for better
scaling when xe was introduced. But if drm_sched can support either
workqueue or kthread_worker during drm_sched_init, drivers can
selectively use kthread_worker only for RT gpu queues. And because
drivers require CAP_SYS_NICE for RT gpu queues, this should not cause
scaling issues.
Thoughts? Or perhaps this becomes less of an issue if all drm_sched
users have concrete plans for userspace submissions..
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: drm_sched run_job and scheduling latency 2026-03-04 22:51 drm_sched run_job and scheduling latency Chia-I Wu @ 2026-03-05 2:04 ` Matthew Brost 2026-03-05 8:27 ` Boris Brezillon 2026-03-05 8:35 ` Tvrtko Ursulin ` (2 subsequent siblings) 3 siblings, 1 reply; 26+ messages in thread From: Matthew Brost @ 2026-03-05 2:04 UTC (permalink / raw) To: Chia-I Wu Cc: ML dri-devel, intel-xe, Boris Brezillon, Steven Price, Liviu Dudau, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter, Danilo Krummrich, Philipp Stanner, Christian König, Thomas Hellström, Rodrigo Vivi, open list, tj On Wed, Mar 04, 2026 at 02:51:39PM -0800, Chia-I Wu wrote: > Hi, > > Our system compositor (surfaceflinger on android) submits gpu jobs > from a SCHED_FIFO thread to an RT gpu queue. However, because > workqueue threads are SCHED_NORMAL, the scheduling latency from submit > to run_job can sometimes cause frame misses. We are seeing this on > panthor and xe, but the issue should be common to all drm_sched users. > I'm going to assume that since this is a compositor, you do not pass input dependencies to the page-flip job. Is that correct? If so, I believe we could fairly easily build an opt-in DRM sched path that directly calls run_job in the exec IOCTL context (I assume this is SCHED_FIFO) if the job has no dependencies. This would likely break some of Xe’s submission-backend assumptions around mutual exclusion and ordering based on the workqueue, but that seems workable. I don’t know how the Panthor code is structured or whether they have similar issues. I can try to hack together a quick PoC to see what this would look like and give you something to test. > Using a WQ_HIGHPRI workqueue helps, but it is still not RT (and won't > meet future android requirements). It seems either workqueue needs to > gain RT support, or drm_sched needs to support kthread_worker. +Tejun to see if RT workqueue is in the plans. > > I know drm_sched switched from kthread_worker to workqueue for better > scaling when xe was introduced. But if drm_sched can support either > workqueue or kthread_worker during drm_sched_init, drivers can > selectively use kthread_worker only for RT gpu queues. And because > drivers require CAP_SYS_NICE for RT gpu queues, this should not cause > scaling issues. > I don’t think having two paths will ever be acceptable, nor do I think supporting a kthread would be all that easy. For example, in Xe we queue additional work items outside of the scheduler on the queue for ordering reasons — we’d have to move all of that code down into DRM sched or completely redesign our submission model to avoid this. I’m not sure if other drivers also do this, but it is allowed. > Thoughts? Or perhaps this becomes less of an issue if all drm_sched > users have concrete plans for userspace submissions.. Maybe some day.... Matt ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: drm_sched run_job and scheduling latency 2026-03-05 2:04 ` Matthew Brost @ 2026-03-05 8:27 ` Boris Brezillon 2026-03-05 8:38 ` Philipp Stanner 2026-03-05 10:09 ` Matthew Brost 0 siblings, 2 replies; 26+ messages in thread From: Boris Brezillon @ 2026-03-05 8:27 UTC (permalink / raw) To: Matthew Brost Cc: Chia-I Wu, ML dri-devel, intel-xe, Steven Price, Liviu Dudau, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter, Danilo Krummrich, Philipp Stanner, Christian König, Thomas Hellström, Rodrigo Vivi, open list, tj Hi Matthew, On Wed, 4 Mar 2026 18:04:25 -0800 Matthew Brost <matthew.brost@intel.com> wrote: > On Wed, Mar 04, 2026 at 02:51:39PM -0800, Chia-I Wu wrote: > > Hi, > > > > Our system compositor (surfaceflinger on android) submits gpu jobs > > from a SCHED_FIFO thread to an RT gpu queue. However, because > > workqueue threads are SCHED_NORMAL, the scheduling latency from submit > > to run_job can sometimes cause frame misses. We are seeing this on > > panthor and xe, but the issue should be common to all drm_sched users. > > > > I'm going to assume that since this is a compositor, you do not pass > input dependencies to the page-flip job. Is that correct? > > If so, I believe we could fairly easily build an opt-in DRM sched path > that directly calls run_job in the exec IOCTL context (I assume this is > SCHED_FIFO) if the job has no dependencies. I guess by ::run_job() you mean something slightly more involved that checks if: - other jobs are pending - enough credits (AKA ringbuf space) is available - and probably other stuff I forgot about > > This would likely break some of Xe’s submission-backend assumptions > around mutual exclusion and ordering based on the workqueue, but that > seems workable. I don’t know how the Panthor code is structured or > whether they have similar issues. Honestly, I'm not thrilled by this fast-path/call-run_job-directly idea you're describing. There's just so many things we can forget that would lead to races/ordering issues that will end up being hard to trigger and debug. Besides, it doesn't solve the problem where your gfx pipeline is fully stuffed and the kernel has to dequeue things asynchronously. I do believe we want RT-prio support in that case too. > > I can try to hack together a quick PoC to see what this would look like > and give you something to test. > > > Using a WQ_HIGHPRI workqueue helps, but it is still not RT (and won't > > meet future android requirements). It seems either workqueue needs to > > gain RT support, or drm_sched needs to support kthread_worker. > > +Tejun to see if RT workqueue is in the plans. Dunno how feasible that is, but that would be my preferred option. > > > > > I know drm_sched switched from kthread_worker to workqueue for better > > scaling when xe was introduced. But if drm_sched can support either > > workqueue or kthread_worker during drm_sched_init, drivers can > > selectively use kthread_worker only for RT gpu queues. And because > > drivers require CAP_SYS_NICE for RT gpu queues, this should not cause > > scaling issues. > > > > I don’t think having two paths will ever be acceptable, nor do I think > supporting a kthread would be all that easy. For example, in Xe we queue > additional work items outside of the scheduler on the queue for ordering > reasons — we’d have to move all of that code down into DRM sched or > completely redesign our submission model to avoid this. I’m not sure if > other drivers also do this, but it is allowed. Panthor doesn't rely on the serialization provided by the single-thread workqueue, Panfrost might rely on it though (I don't remember). I agree that maintaining a thread and workqueue based scheduling is not ideal though. > > > Thoughts? Or perhaps this becomes less of an issue if all drm_sched > > users have concrete plans for userspace submissions.. > > Maybe some day.... I've yet to see a solution where no dma_fence-based signalization is involved in graphics workloads though (IIRC, Arm's solution still needs the kernel for that). Until that happens, we'll still need the kernel to signal fences asynchronously when the job is done, which I suspect will cause the same kind of latency issue... Regards, Boris ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: drm_sched run_job and scheduling latency 2026-03-05 8:27 ` Boris Brezillon @ 2026-03-05 8:38 ` Philipp Stanner 2026-03-05 9:10 ` Matthew Brost 2026-03-05 10:09 ` Matthew Brost 1 sibling, 1 reply; 26+ messages in thread From: Philipp Stanner @ 2026-03-05 8:38 UTC (permalink / raw) To: Boris Brezillon, Matthew Brost Cc: Chia-I Wu, ML dri-devel, intel-xe, Steven Price, Liviu Dudau, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter, Danilo Krummrich, Philipp Stanner, Christian König, Thomas Hellström, Rodrigo Vivi, open list, tj On Thu, 2026-03-05 at 09:27 +0100, Boris Brezillon wrote: > Hi Matthew, > > On Wed, 4 Mar 2026 18:04:25 -0800 > Matthew Brost <matthew.brost@intel.com> wrote: > > > On Wed, Mar 04, 2026 at 02:51:39PM -0800, Chia-I Wu wrote: > > > Hi, > > > > > > Our system compositor (surfaceflinger on android) submits gpu jobs > > > from a SCHED_FIFO thread to an RT gpu queue. However, because > > > workqueue threads are SCHED_NORMAL, the scheduling latency from submit > > > to run_job can sometimes cause frame misses. We are seeing this on > > > panthor and xe, but the issue should be common to all drm_sched users. > > > > > > > I'm going to assume that since this is a compositor, you do not pass > > input dependencies to the page-flip job. Is that correct? > > > > If so, I believe we could fairly easily build an opt-in DRM sched path > > that directly calls run_job in the exec IOCTL context (I assume this is > > SCHED_FIFO) if the job has no dependencies. > > I guess by ::run_job() you mean something slightly more involved that > checks if: > > - other jobs are pending > - enough credits (AKA ringbuf space) is available > - and probably other stuff I forgot about > > > > > This would likely break some of Xe’s submission-backend assumptions > > around mutual exclusion and ordering based on the workqueue, but that > > seems workable. I don’t know how the Panthor code is structured or > > whether they have similar issues. > > Honestly, I'm not thrilled by this fast-path/call-run_job-directly idea > you're describing. There's just so many things we can forget that would > lead to races/ordering issues that will end up being hard to trigger and > debug. > +1 I'm not thrilled either. More like the opposite of thrilled actually. Even if we could get that to work. This is more of a maintainability issue. The scheduler is full of insane performance hacks for this or that driver. Lockless accesses, a special lockless queue only used by that one party in the kernel (a lockless queue which is nowadays, after N reworks, being used with a lock. Ah well). In the past discussions Danilo and I made it clear that more major features in _new_ patch series aimed at getting merged into drm/sched must be preceded by cleanup work to address some of the scheduler's major problems. That's especially true if it's features aimed at performance buffs. P. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: drm_sched run_job and scheduling latency 2026-03-05 8:38 ` Philipp Stanner @ 2026-03-05 9:10 ` Matthew Brost 2026-03-05 9:47 ` Philipp Stanner ` (2 more replies) 0 siblings, 3 replies; 26+ messages in thread From: Matthew Brost @ 2026-03-05 9:10 UTC (permalink / raw) To: phasta Cc: Boris Brezillon, Chia-I Wu, ML dri-devel, intel-xe, Steven Price, Liviu Dudau, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter, Danilo Krummrich, Christian König, Thomas Hellström, Rodrigo Vivi, open list, tj On Thu, Mar 05, 2026 at 09:38:16AM +0100, Philipp Stanner wrote: > On Thu, 2026-03-05 at 09:27 +0100, Boris Brezillon wrote: > > Hi Matthew, > > > > On Wed, 4 Mar 2026 18:04:25 -0800 > > Matthew Brost <matthew.brost@intel.com> wrote: > > > > > On Wed, Mar 04, 2026 at 02:51:39PM -0800, Chia-I Wu wrote: > > > > Hi, > > > > > > > > Our system compositor (surfaceflinger on android) submits gpu jobs > > > > from a SCHED_FIFO thread to an RT gpu queue. However, because > > > > workqueue threads are SCHED_NORMAL, the scheduling latency from submit > > > > to run_job can sometimes cause frame misses. We are seeing this on > > > > panthor and xe, but the issue should be common to all drm_sched users. > > > > > > > > > > I'm going to assume that since this is a compositor, you do not pass > > > input dependencies to the page-flip job. Is that correct? > > > > > > If so, I believe we could fairly easily build an opt-in DRM sched path > > > that directly calls run_job in the exec IOCTL context (I assume this is > > > SCHED_FIFO) if the job has no dependencies. > > > > I guess by ::run_job() you mean something slightly more involved that > > checks if: > > > > - other jobs are pending Yes. > > - enough credits (AKA ringbuf space) is available Yes. > > - and probably other stuff I forgot about The scheduler is not stopped; serialize the bypass path with scheduler stop/start. > > > > > > > > This would likely break some of Xe’s submission-backend assumptions > > > around mutual exclusion and ordering based on the workqueue, but that > > > seems workable. I don’t know how the Panthor code is structured or > > > whether they have similar issues. > > > > Honestly, I'm not thrilled by this fast-path/call-run_job-directly idea > > you're describing. There's just so many things we can forget that would > > lead to races/ordering issues that will end up being hard to trigger and > > debug. > > > > +1 > > I'm not thrilled either. More like the opposite of thrilled actually. > > Even if we could get that to work. This is more of a maintainability > issue. > > The scheduler is full of insane performance hacks for this or that > driver. Lockless accesses, a special lockless queue only used by that > one party in the kernel (a lockless queue which is nowadays, after N > reworks, being used with a lock. Ah well). > This is not relevant to this discussion—see below. In general, I agree that the lockless tricks in the scheduler are not great, nor is the fact that the scheduler became a dumping ground for driver-specific features. But again, that is not what we’re talking about here—see below. > In the past discussions Danilo and I made it clear that more major > features in _new_ patch series aimed at getting merged into drm/sched > must be preceded by cleanup work to address some of the scheduler's > major problems. Ah, we've moved to dictatorship quickly. Noted. > I can't say I agree with either of you here. In about an hour, I seemingly have a bypass path working in DRM sched + Xe, and my diff is: 108 insertions(+), 31 deletions(-) About 40 lines of the insertions are kernel-doc, so I'm not buying that this is a maintenance issue or a major feature - it is literally a single new function. I understand a bypass path can create issues—for example, on certain queues in Xe I definitely can't use the bypass path, so Xe simply wouldn’t use it in those cases. This is the driver's choice to use or not. If a driver doesn't know how to use the scheduler, well, that’s on the driver. Providing a simple, documented function as a fast path really isn't some crazy idea. The alternative—asking for RT workqueues or changing the design to use kthread_worker—actually is. > That's especially true if it's features aimed at performance buffs. > With the above mindset, I'm actually very confused why this series [1] would even be considered as this order of magnitude greater in complexity than my suggestion here. Matt [1] https://patchwork.freedesktop.org/series/159025/ > > > P. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: drm_sched run_job and scheduling latency 2026-03-05 9:10 ` Matthew Brost @ 2026-03-05 9:47 ` Philipp Stanner 2026-03-16 4:05 ` Matthew Brost 2026-03-05 10:19 ` Boris Brezillon 2026-03-05 12:27 ` Danilo Krummrich 2 siblings, 1 reply; 26+ messages in thread From: Philipp Stanner @ 2026-03-05 9:47 UTC (permalink / raw) To: Matthew Brost, phasta Cc: Boris Brezillon, Chia-I Wu, ML dri-devel, intel-xe, Steven Price, Liviu Dudau, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter, Danilo Krummrich, Christian König, Thomas Hellström, Rodrigo Vivi, open list, tj On Thu, 2026-03-05 at 01:10 -0800, Matthew Brost wrote: > On Thu, Mar 05, 2026 at 09:38:16AM +0100, Philipp Stanner wrote: > > On Thu, 2026-03-05 at 09:27 +0100, Boris Brezillon wrote: > > > > > […] > > > Honestly, I'm not thrilled by this fast-path/call-run_job-directly idea > > > you're describing. There's just so many things we can forget that would > > > lead to races/ordering issues that will end up being hard to trigger and > > > debug. > > > > > > > +1 > > > > I'm not thrilled either. More like the opposite of thrilled actually. > > > > Even if we could get that to work. This is more of a maintainability > > issue. > > > > The scheduler is full of insane performance hacks for this or that > > driver. Lockless accesses, a special lockless queue only used by that > > one party in the kernel (a lockless queue which is nowadays, after N > > reworks, being used with a lock. Ah well). > > > > This is not relevant to this discussion—see below. In general, I agree > that the lockless tricks in the scheduler are not great, nor is the fact > that the scheduler became a dumping ground for driver-specific features. > But again, that is not what we’re talking about here—see below. > > > In the past discussions Danilo and I made it clear that more major > > features in _new_ patch series aimed at getting merged into drm/sched > > must be preceded by cleanup work to address some of the scheduler's > > major problems. > > Ah, we've moved to dictatorship quickly. Noted. I prefer the term "benevolent presidency" /s Or even better: s/dictatorship/accountability enforcement. How does it come that everyone is here and ready so quickly when it comes to new use cases and features, yet I never saw anyone except for Tvrtko and Maíra investing even 15 minutes to write a simple patch to address some of the *various* significant issues in that code base? You were on CC on all discussions we've had here for the last years afair, but I rarely saw you participate. And you know what it's like: who doesn't speak up silently agrees in open source. But tell me one thing, if you can be so kind: What is your theory why drm/sched came to be in such horrible shape? What circumstances, what human behavioral patterns have caused this? The DRM subsystem has a bad reputation regarding stability among Linux users, as far as I have sensed. How can we do better? > > > > > I can't say I agree with either of you here. > > In about an hour, I seemingly have a bypass path working in DRM sched + > Xe, and my diff is: > > 108 insertions(+), 31 deletions(-) LOC is a bad metric for complexity. > > About 40 lines of the insertions are kernel-doc, so I'm not buying that > this is a maintenance issue or a major feature - it is literally a > single new function. > > I understand a bypass path can create issues—for example, on certain > queues in Xe I definitely can't use the bypass path, so Xe simply > wouldn’t use it in those cases. This is the driver's choice to use or > not. If a driver doesn't know how to use the scheduler, well, that’s on > the driver. Providing a simple, documented function as a fast path > really isn't some crazy idea. We're effectively talking about a deviation from the default submission mechanism, and all that seems to be desired for a luxury feature. Then you end up with two submission mechanisms, whose correctness in the future relies on someone remembering what the background was, why it was added, and what the rules are.. The current scheduler rules are / were often not even documented, and sometimes even Christian took a few weeks to remember again why something had been added – and whether it can now be removed again or not. > > The alternative—asking for RT workqueues or changing the design to use > kthread_worker—actually is. > > > That's especially true if it's features aimed at performance buffs. > > > > With the above mindset, I'm actually very confused why this series [1] > would even be considered as this order of magnitude greater in > complexity than my suggestion here. > > Matt > > [1] https://patchwork.freedesktop.org/series/159025/ The discussions about Tvrtko's CFS series were precisely the point where Danilo brought up that after this can be merged, future rework of the scheduler must focus on addressing some of the pending fundamental issues. The background is that Tvrtko has worked on that series already for well over a year, it actually simplifies some things in the sense of removing unused code (obviously it's a complex series, no argument about that), and we agreed on XDC that this can be merged. So this is a question of fairness to the contributor. But at one point you have to finally draw a line. No one will ever address major scheduler issues unless we demand it. Even very experienced devs usually prefer to hack around the central design issues in their drivers instead of fixing the shared infrastructure. P. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: drm_sched run_job and scheduling latency 2026-03-05 9:47 ` Philipp Stanner @ 2026-03-16 4:05 ` Matthew Brost 2026-03-16 4:14 ` Matthew Brost 0 siblings, 1 reply; 26+ messages in thread From: Matthew Brost @ 2026-03-16 4:05 UTC (permalink / raw) To: phasta Cc: Boris Brezillon, Chia-I Wu, ML dri-devel, intel-xe, Steven Price, Liviu Dudau, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter, Danilo Krummrich, Christian König, Thomas Hellström, Rodrigo Vivi, open list, tj On Thu, Mar 05, 2026 at 10:47:32AM +0100, Philipp Stanner wrote: Off the list... I don’t think airing our personal attacks publicly is a good look. I’m going to be blunt here in an effort to help you. > On Thu, 2026-03-05 at 01:10 -0800, Matthew Brost wrote: > > On Thu, Mar 05, 2026 at 09:38:16AM +0100, Philipp Stanner wrote: > > > On Thu, 2026-03-05 at 09:27 +0100, Boris Brezillon wrote: > > > > > > > > > […] > > > > > Honestly, I'm not thrilled by this fast-path/call-run_job-directly idea > > > > you're describing. There's just so many things we can forget that would > > > > lead to races/ordering issues that will end up being hard to trigger and > > > > debug. > > > > > > > > > > +1 > > > > > > I'm not thrilled either. More like the opposite of thrilled actually. > > > > > > Even if we could get that to work. This is more of a maintainability > > > issue. > > > > > > The scheduler is full of insane performance hacks for this or that > > > driver. Lockless accesses, a special lockless queue only used by that > > > one party in the kernel (a lockless queue which is nowadays, after N > > > reworks, being used with a lock. Ah well). > > > > > > > This is not relevant to this discussion—see below. In general, I agree > > that the lockless tricks in the scheduler are not great, nor is the fact > > that the scheduler became a dumping ground for driver-specific features. > > But again, that is not what we’re talking about here—see below. > > > > > In the past discussions Danilo and I made it clear that more major > > > features in _new_ patch series aimed at getting merged into drm/sched > > > must be preceded by cleanup work to address some of the scheduler's > > > major problems. > > > > Ah, we've moved to dictatorship quickly. Noted. > > I prefer the term "benevolent presidency" /s > > Or even better: s/dictatorship/accountability enforcement. > It’s very hard to take this seriously when I reply to threads saying something breaks dma-fence rules and the response is, “what are dma-fence rules?” Or I read through the jobqueue thread and see you asking why a dma-fence would come from anywhere other than your own driver — that’s the entire point of dma-fence; it’s a cross-driver contract. I could go on, but I’d encourage you to take a hard look at your understanding of DRM, and whether your responses — to me and to others — are backed by the necessary technical knowledge. Even better — what first annoyed me was your XDC presentation. You gave an example of my driver modifying the pending list without a lock while scheduling was stopped, and claimed you fixed a bug. That was not a bug - Xe would explode if it was as we test our code. The pending list can be modified without a lock if scheduling is stopped. I almost grabbed the mic to correct you. Yes, it’s a layering violation, but presenting it aa a bug shows a clear lack of understanding. > How does it come that everyone is here and ready so quickly when it I’ve suggested ideas to fix DRM sched (refcounting, clear teardown flows), but they were immediately met with resistance — typically from Christian with you agreeing. My willingness to fight with Christian is low; I really don’t need another person to argue with. > comes to new use cases and features, yet I never saw anyone except for > Tvrtko and Maíra investing even 15 minutes to write a simple patch to > address some of the *various* significant issues in that code base? > > You were on CC on all discussions we've had here for the last years > afair, but I rarely saw you participate. And you know what it's like: I’ll admit I’m busy with many other things, so my bandwidth is limited. But again, if I chime in and explain how I solved something in Xe (e.g., refcounting) and it’s met with resistance, I’ll likely move on — I’ve already solved it, and I’ll just let you fail (see cancel_job). > who doesn't speak up silently agrees in open source. > > But tell me one thing, if you can be so kind: > I'm glad you asked this, and it inspired me to fix this, more below [1]. > What is your theory why drm/sched came to be in such horrible shape? drm/sched was ported from AMDGPU into common code. It carried many AMDGPU-specific hacks, had no object-lifetime model thought out as a common component, and included teardown nightmares that “worked,” but other drivers immediately had to work around. With Christian involved — who is notoriously hostile — everyone did their best to paper over issues driver-side rather than get into fights and fix things properly. Asahi Linux publicly aired grievances about this situation years ago. > What circumstances, what human behavioral patterns have caused this? > See above. > The DRM subsystem has a bad reputation regarding stability among Linux > users, as far as I have sensed. How can we do better? > Write sane code and test it. fwiw, Google shared a doc with me indicating that Xe has unprecedented stability, and to be honest, when I first wrote Xe I barely knew what I was doing — but I did know how to test. I’ve since cleaned up most of my mistakes though. So how can we do better... We can [1]. I started on [1] after you asking what the problems in DRM sched - which got me thinking about what it would look like if we took the good parts (stop/start control plane, dependency tracking, ordering, finished fences, etc.), dropped the bad parts (no object-lifetime model, no refcounting, overly complex queue teardown, messy fence manipulation, hardware-scheduling baggage, lack of annotations, etc.), and wrote something that addresses all of these problems from the start specifically for firmware-scheduling models. It turns out pretty good. Main patch [2]. Xe is fully converted, tested, and working. AMDXNDA and Panthor are compiling. Nouveau and PVR seem like good candidates to convert as well. Rust bindings are also possible given the clear object model with refcounting and well-defined object lifetimes. Thinking further, hardware schedulers should be able to be implemented on top of this by embedding the objects in [2] and layering a backend/API on top. Let me know if you have any feedback (off-list) before I share this publicly. So far, Dave, Sima, Danilo, and the other Xe maintainers have been looped in. Matt [1] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-svn-perf-6-15-2025/-/tree/local_dev/new_scheduler.post?ref_type=heads [2] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-svn-perf-6-15-2025/-/commit/0538a3bc2a3b562dc0427a5922958189e0be8271 > > > > > > > > > I can't say I agree with either of you here. > > > > In about an hour, I seemingly have a bypass path working in DRM sched + > > Xe, and my diff is: > > > > 108 insertions(+), 31 deletions(-) > > LOC is a bad metric for complexity. > > > > > About 40 lines of the insertions are kernel-doc, so I'm not buying that > > this is a maintenance issue or a major feature - it is literally a > > single new function. > > > > I understand a bypass path can create issues—for example, on certain > > queues in Xe I definitely can't use the bypass path, so Xe simply > > wouldn’t use it in those cases. This is the driver's choice to use or > > not. If a driver doesn't know how to use the scheduler, well, that’s on > > the driver. Providing a simple, documented function as a fast path > > really isn't some crazy idea. > > We're effectively talking about a deviation from the default submission > mechanism, and all that seems to be desired for a luxury feature. > > Then you end up with two submission mechanisms, whose correctness in > the future relies on someone remembering what the background was, why > it was added, and what the rules are.. > > The current scheduler rules are / were often not even documented, and > sometimes even Christian took a few weeks to remember again why > something had been added – and whether it can now be removed again or > not. > > > > > The alternative—asking for RT workqueues or changing the design to use > > kthread_worker—actually is. > > > > > That's especially true if it's features aimed at performance buffs. > > > > > > > With the above mindset, I'm actually very confused why this series [1] > > would even be considered as this order of magnitude greater in > > complexity than my suggestion here. > > > > Matt > > > > [1] https://patchwork.freedesktop.org/series/159025/ > > The discussions about Tvrtko's CFS series were precisely the point > where Danilo brought up that after this can be merged, future rework of > the scheduler must focus on addressing some of the pending fundamental > issues. > > The background is that Tvrtko has worked on that series already for > well over a year, it actually simplifies some things in the sense of > removing unused code (obviously it's a complex series, no argument > about that), and we agreed on XDC that this can be merged. So this is a > question of fairness to the contributor. > > But at one point you have to finally draw a line. No one will ever > address major scheduler issues unless we demand it. Even very > experienced devs usually prefer to hack around the central design > issues in their drivers instead of fixing the shared infrastructure. > > > P. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: drm_sched run_job and scheduling latency 2026-03-16 4:05 ` Matthew Brost @ 2026-03-16 4:14 ` Matthew Brost 0 siblings, 0 replies; 26+ messages in thread From: Matthew Brost @ 2026-03-16 4:14 UTC (permalink / raw) To: phasta Cc: Boris Brezillon, Chia-I Wu, ML dri-devel, intel-xe, Steven Price, Liviu Dudau, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter, Danilo Krummrich, Christian König, Thomas Hellström, Rodrigo Vivi, open list, tj On Sun, Mar 15, 2026 at 09:05:20PM -0700, Matthew Brost wrote: > On Thu, Mar 05, 2026 at 10:47:32AM +0100, Philipp Stanner wrote: > Obviously this was intended as a private communication — I hit the wrong button. I apologize to anyone I offended here. Matt > Off the list... I don’t think airing our personal attacks publicly is a > good look. I’m going to be blunt here in an effort to help you. > > > On Thu, 2026-03-05 at 01:10 -0800, Matthew Brost wrote: > > > On Thu, Mar 05, 2026 at 09:38:16AM +0100, Philipp Stanner wrote: > > > > On Thu, 2026-03-05 at 09:27 +0100, Boris Brezillon wrote: > > > > > > > > > > > > > […] > > > > > > > Honestly, I'm not thrilled by this fast-path/call-run_job-directly idea > > > > > you're describing. There's just so many things we can forget that would > > > > > lead to races/ordering issues that will end up being hard to trigger and > > > > > debug. > > > > > > > > > > > > > +1 > > > > > > > > I'm not thrilled either. More like the opposite of thrilled actually. > > > > > > > > Even if we could get that to work. This is more of a maintainability > > > > issue. > > > > > > > > The scheduler is full of insane performance hacks for this or that > > > > driver. Lockless accesses, a special lockless queue only used by that > > > > one party in the kernel (a lockless queue which is nowadays, after N > > > > reworks, being used with a lock. Ah well). > > > > > > > > > > This is not relevant to this discussion—see below. In general, I agree > > > that the lockless tricks in the scheduler are not great, nor is the fact > > > that the scheduler became a dumping ground for driver-specific features. > > > But again, that is not what we’re talking about here—see below. > > > > > > > In the past discussions Danilo and I made it clear that more major > > > > features in _new_ patch series aimed at getting merged into drm/sched > > > > must be preceded by cleanup work to address some of the scheduler's > > > > major problems. > > > > > > Ah, we've moved to dictatorship quickly. Noted. > > > > I prefer the term "benevolent presidency" /s > > > > Or even better: s/dictatorship/accountability enforcement. > > > > It’s very hard to take this seriously when I reply to threads saying > something breaks dma-fence rules and the response is, “what are > dma-fence rules?” Or I read through the jobqueue thread and see you > asking why a dma-fence would come from anywhere other than your own > driver — that’s the entire point of dma-fence; it’s a cross-driver > contract. I could go on, but I’d encourage you to take a hard look at > your understanding of DRM, and whether your responses — to me and to > others — are backed by the necessary technical knowledge. > > Even better — what first annoyed me was your XDC presentation. You gave > an example of my driver modifying the pending list without a lock while > scheduling was stopped, and claimed you fixed a bug. That was not a bug > - Xe would explode if it was as we test our code. The pending list can > be modified without a lock if scheduling is stopped. I almost grabbed > the mic to correct you. Yes, it’s a layering violation, but presenting > it aa a bug shows a clear lack of understanding. > > > How does it come that everyone is here and ready so quickly when it > > I’ve suggested ideas to fix DRM sched (refcounting, clear teardown > flows), but they were immediately met with resistance — typically from > Christian with you agreeing. My willingness to fight with Christian is > low; I really don’t need another person to argue with. > > > comes to new use cases and features, yet I never saw anyone except for > > Tvrtko and Maíra investing even 15 minutes to write a simple patch to > > address some of the *various* significant issues in that code base? > > > > You were on CC on all discussions we've had here for the last years > > afair, but I rarely saw you participate. And you know what it's like: > > I’ll admit I’m busy with many other things, so my bandwidth is limited. > But again, if I chime in and explain how I solved something in Xe (e.g., > refcounting) and it’s met with resistance, I’ll likely move on — I’ve > already solved it, and I’ll just let you fail (see cancel_job). > > > who doesn't speak up silently agrees in open source. > > > > But tell me one thing, if you can be so kind: > > > > I'm glad you asked this, and it inspired me to fix this, more below [1]. > > > What is your theory why drm/sched came to be in such horrible shape? > > drm/sched was ported from AMDGPU into common code. It carried many > AMDGPU-specific hacks, had no object-lifetime model thought out as a > common component, and included teardown nightmares that “worked,” but > other drivers immediately had to work around. With Christian involved — > who is notoriously hostile — everyone did their best to paper over > issues driver-side rather than get into fights and fix things properly. > Asahi Linux publicly aired grievances about this situation years ago. > > > What circumstances, what human behavioral patterns have caused this? > > > > See above. > > > The DRM subsystem has a bad reputation regarding stability among Linux > > users, as far as I have sensed. How can we do better? > > > > Write sane code and test it. fwiw, Google shared a doc with me > indicating that Xe has unprecedented stability, and to be honest, when I > first wrote Xe I barely knew what I was doing — but I did know how to > test. I’ve since cleaned up most of my mistakes though. > > So how can we do better... We can [1]. > > I started on [1] after you asking what the problems in DRM sched - which > got me thinking about what it would look like if we took the good parts > (stop/start control plane, dependency tracking, ordering, finished > fences, etc.), dropped the bad parts (no object-lifetime model, no > refcounting, overly complex queue teardown, messy fence manipulation, > hardware-scheduling baggage, lack of annotations, etc.), and wrote > something that addresses all of these problems from the start > specifically for firmware-scheduling models. > > It turns out pretty good. > > Main patch [2]. > > Xe is fully converted, tested, and working. AMDXNDA and Panthor are > compiling. Nouveau and PVR seem like good candidates to convert as well. > Rust bindings are also possible given the clear object model with > refcounting and well-defined object lifetimes. > > Thinking further, hardware schedulers should be able to be implemented > on top of this by embedding the objects in [2] and layering a > backend/API on top. > > Let me know if you have any feedback (off-list) before I share this > publicly. So far, Dave, Sima, Danilo, and the other Xe maintainers have > been looped in. > > Matt > > [1] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-svn-perf-6-15-2025/-/tree/local_dev/new_scheduler.post?ref_type=heads > [2] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-svn-perf-6-15-2025/-/commit/0538a3bc2a3b562dc0427a5922958189e0be8271 > > > > > > > > > > > > > > I can't say I agree with either of you here. > > > > > > In about an hour, I seemingly have a bypass path working in DRM sched + > > > Xe, and my diff is: > > > > > > 108 insertions(+), 31 deletions(-) > > > > LOC is a bad metric for complexity. > > > > > > > > About 40 lines of the insertions are kernel-doc, so I'm not buying that > > > this is a maintenance issue or a major feature - it is literally a > > > single new function. > > > > > > I understand a bypass path can create issues—for example, on certain > > > queues in Xe I definitely can't use the bypass path, so Xe simply > > > wouldn’t use it in those cases. This is the driver's choice to use or > > > not. If a driver doesn't know how to use the scheduler, well, that’s on > > > the driver. Providing a simple, documented function as a fast path > > > really isn't some crazy idea. > > > > We're effectively talking about a deviation from the default submission > > mechanism, and all that seems to be desired for a luxury feature. > > > > Then you end up with two submission mechanisms, whose correctness in > > the future relies on someone remembering what the background was, why > > it was added, and what the rules are.. > > > > The current scheduler rules are / were often not even documented, and > > sometimes even Christian took a few weeks to remember again why > > something had been added – and whether it can now be removed again or > > not. > > > > > > > > The alternative—asking for RT workqueues or changing the design to use > > > kthread_worker—actually is. > > > > > > > That's especially true if it's features aimed at performance buffs. > > > > > > > > > > With the above mindset, I'm actually very confused why this series [1] > > > would even be considered as this order of magnitude greater in > > > complexity than my suggestion here. > > > > > > Matt > > > > > > [1] https://patchwork.freedesktop.org/series/159025/ > > > > The discussions about Tvrtko's CFS series were precisely the point > > where Danilo brought up that after this can be merged, future rework of > > the scheduler must focus on addressing some of the pending fundamental > > issues. > > > > The background is that Tvrtko has worked on that series already for > > well over a year, it actually simplifies some things in the sense of > > removing unused code (obviously it's a complex series, no argument > > about that), and we agreed on XDC that this can be merged. So this is a > > question of fairness to the contributor. > > > > But at one point you have to finally draw a line. No one will ever > > address major scheduler issues unless we demand it. Even very > > experienced devs usually prefer to hack around the central design > > issues in their drivers instead of fixing the shared infrastructure. > > > > > > P. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: drm_sched run_job and scheduling latency 2026-03-05 9:10 ` Matthew Brost 2026-03-05 9:47 ` Philipp Stanner @ 2026-03-05 10:19 ` Boris Brezillon 2026-03-05 12:27 ` Danilo Krummrich 2 siblings, 0 replies; 26+ messages in thread From: Boris Brezillon @ 2026-03-05 10:19 UTC (permalink / raw) To: Matthew Brost Cc: phasta, Chia-I Wu, ML dri-devel, intel-xe, Steven Price, Liviu Dudau, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter, Danilo Krummrich, Christian König, Thomas Hellström, Rodrigo Vivi, open list, tj On Thu, 5 Mar 2026 01:10:06 -0800 Matthew Brost <matthew.brost@intel.com> wrote: > I can't say I agree with either of you here. > > In about an hour, I seemingly have a bypass path working in DRM sched + > Xe, and my diff is: > > 108 insertions(+), 31 deletions(-) First of all, I'm not blindly rejecting the approach, see how I said "I'm not thrilled" not "No way!". So yeah, if you have something to propose, feel free to post the diff here or as an RFC on the ML. Secondly, I keep thinking the fast-path approach doesn't quite fix the problem at hand where we actually want queuing/dequeuing operations to match the priority of the HW/FW context, because if your HW context is high prio but you're struggling to fill the HW queue, it's not truly high prio. Note that it's problem that was made more evident with FW scheduling (and the 1:1 entity:sched association), before that we just had one thread that was dequeuing from entities and pushing to HW queues based on entities priorities, so priority was somehow better enforced. So yeah, even ignoring the discrepancy that might emerge from this new fast-path-run_job (and the potential maintenance burden we mentioned), saying "you'll get proper queueing/dequeuing priority enforcement only if you have no deps, and the pipeline is not full" is kinda limited IMHO. I'd rather we think about a solution that solves the entire problem, which both the kthread_work[er] and workqueue(RT) proposals do. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: drm_sched run_job and scheduling latency 2026-03-05 9:10 ` Matthew Brost 2026-03-05 9:47 ` Philipp Stanner 2026-03-05 10:19 ` Boris Brezillon @ 2026-03-05 12:27 ` Danilo Krummrich 2 siblings, 0 replies; 26+ messages in thread From: Danilo Krummrich @ 2026-03-05 12:27 UTC (permalink / raw) To: Matthew Brost Cc: phasta, Boris Brezillon, Chia-I Wu, ML dri-devel, intel-xe, Steven Price, Liviu Dudau, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter, Christian König, Thomas Hellström, Rodrigo Vivi, open list, tj On Thu Mar 5, 2026 at 10:10 AM CET, Matthew Brost wrote: > On Thu, Mar 05, 2026 at 09:38:16AM +0100, Philipp Stanner wrote: >> In the past discussions Danilo and I made it clear that more major >> features in _new_ patch series aimed at getting merged into drm/sched >> must be preceded by cleanup work to address some of the scheduler's >> major problems. > > Ah, we've moved to dictatorship quickly. Noted. While Philipp and me generally share concerns about the scheduler in general, I prefer to speak for myself here, as my position is a bit more nuanced than that. I shared my view on this in detail in [1], so I will keep it very brief here. From a maintainance perspective the concern is less about whether a particular change is correct or small in isolation, but about whether it moves the overall design in a direction that makes the existing issues harder to resolve subsequently. I.e. I think we should try to avoid accumulating new features or special paths on top of known design issues. (Please also note that those are general considerations; they are not meant to make any implications on this specific topic. Not least because I did not get to read through the whole thread yet.) Thanks, Danilo [1] https://lore.kernel.org/all/DFPK5HIP7G9C.2LJ6AOH2UPLEO@kernel.org/ ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: drm_sched run_job and scheduling latency 2026-03-05 8:27 ` Boris Brezillon 2026-03-05 8:38 ` Philipp Stanner @ 2026-03-05 10:09 ` Matthew Brost 2026-03-05 10:52 ` Boris Brezillon 1 sibling, 1 reply; 26+ messages in thread From: Matthew Brost @ 2026-03-05 10:09 UTC (permalink / raw) To: Boris Brezillon Cc: Chia-I Wu, ML dri-devel, intel-xe, Steven Price, Liviu Dudau, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter, Danilo Krummrich, Philipp Stanner, Christian König, Thomas Hellström, Rodrigo Vivi, open list, tj On Thu, Mar 05, 2026 at 09:27:11AM +0100, Boris Brezillon wrote: I addressed most of your comments in a chained reply to Phillip, but I guess he dropped some of your email and thus missed those. Responding below. > Hi Matthew, > > On Wed, 4 Mar 2026 18:04:25 -0800 > Matthew Brost <matthew.brost@intel.com> wrote: > > > On Wed, Mar 04, 2026 at 02:51:39PM -0800, Chia-I Wu wrote: > > > Hi, > > > > > > Our system compositor (surfaceflinger on android) submits gpu jobs > > > from a SCHED_FIFO thread to an RT gpu queue. However, because > > > workqueue threads are SCHED_NORMAL, the scheduling latency from submit > > > to run_job can sometimes cause frame misses. We are seeing this on > > > panthor and xe, but the issue should be common to all drm_sched users. > > > > > > > I'm going to assume that since this is a compositor, you do not pass > > input dependencies to the page-flip job. Is that correct? > > > > If so, I believe we could fairly easily build an opt-in DRM sched path > > that directly calls run_job in the exec IOCTL context (I assume this is > > SCHED_FIFO) if the job has no dependencies. > > I guess by ::run_job() you mean something slightly more involved that > checks if: > > - other jobs are pending > - enough credits (AKA ringbuf space) is available > - and probably other stuff I forgot about > > > > > This would likely break some of Xe’s submission-backend assumptions > > around mutual exclusion and ordering based on the workqueue, but that > > seems workable. I don’t know how the Panthor code is structured or > > whether they have similar issues. > > Honestly, I'm not thrilled by this fast-path/call-run_job-directly idea > you're describing. There's just so many things we can forget that would > lead to races/ordering issues that will end up being hard to trigger and > debug. Besides, it doesn't solve the problem where your gfx pipeline is > fully stuffed and the kernel has to dequeue things asynchronously. I do > believe we want RT-prio support in that case too. > My understanding of SurfaceFlinger is that it never waits on input dependencies from rendering applications, since those may not signal in time for a page flip. Because of that, you can’t have the job(s) that draw to the screen accept input dependencies. Maybe I have that wrong—but I've spoken to the Google team several times about issues with SurfaceFlinger, and that was my takeaway. So I don't think the kernel should ever have to dequeue things asynchronously, at least for SurfaceFlinger. If there is another RT use case that requires input dependencies plus the kernel dequeuing things asynchronously, I agree this wouldn’t help—but my suggestion also isn’t mutually exclusive with other RT rework either. > > > > I can try to hack together a quick PoC to see what this would look like > > and give you something to test. > > > > > Using a WQ_HIGHPRI workqueue helps, but it is still not RT (and won't > > > meet future android requirements). It seems either workqueue needs to > > > gain RT support, or drm_sched needs to support kthread_worker. > > > > +Tejun to see if RT workqueue is in the plans. > > Dunno how feasible that is, but that would be my preferred option. > > > > > > > > > I know drm_sched switched from kthread_worker to workqueue for better > > > scaling when xe was introduced. But if drm_sched can support either > > > workqueue or kthread_worker during drm_sched_init, drivers can > > > selectively use kthread_worker only for RT gpu queues. And because > > > drivers require CAP_SYS_NICE for RT gpu queues, this should not cause > > > scaling issues. > > > > > > > I don’t think having two paths will ever be acceptable, nor do I think > > supporting a kthread would be all that easy. For example, in Xe we queue > > additional work items outside of the scheduler on the queue for ordering > > reasons — we’d have to move all of that code down into DRM sched or > > completely redesign our submission model to avoid this. I’m not sure if > > other drivers also do this, but it is allowed. > > Panthor doesn't rely on the serialization provided by the single-thread > workqueue, Panfrost might rely on it though (I don't remember). I agree > that maintaining a thread and workqueue based scheduling is not ideal > though. > > > > > > Thoughts? Or perhaps this becomes less of an issue if all drm_sched > > > users have concrete plans for userspace submissions.. > > > > Maybe some day.... > > I've yet to see a solution where no dma_fence-based signalization is > involved in graphics workloads though (IIRC, Arm's solution still > needs the kernel for that). Until that happens, we'll still need the > kernel to signal fences asynchronously when the job is done, which I > suspect will cause the same kind of latency issue... > I don't think that is the problem here. Doesn’t the job that draws the frame actually draw it, or does the display wait on the draw job’s fence to signal and then do something else? (Sorry—I know next to nothing about display.) Either way, fences should be signaled in IRQ handlers, which presumably don’t have the same latency issues as workqueues, but I could be mistaken. Matt > Regards, > > Boris ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: drm_sched run_job and scheduling latency 2026-03-05 10:09 ` Matthew Brost @ 2026-03-05 10:52 ` Boris Brezillon 2026-03-05 20:51 ` Matthew Brost 0 siblings, 1 reply; 26+ messages in thread From: Boris Brezillon @ 2026-03-05 10:52 UTC (permalink / raw) To: Matthew Brost Cc: Chia-I Wu, ML dri-devel, intel-xe, Steven Price, Liviu Dudau, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter, Danilo Krummrich, Philipp Stanner, Christian König, Thomas Hellström, Rodrigo Vivi, open list, tj On Thu, 5 Mar 2026 02:09:16 -0800 Matthew Brost <matthew.brost@intel.com> wrote: > On Thu, Mar 05, 2026 at 09:27:11AM +0100, Boris Brezillon wrote: > > I addressed most of your comments in a chained reply to Phillip, but I > guess he dropped some of your email and thus missed those. Responding > below. > > > Hi Matthew, > > > > On Wed, 4 Mar 2026 18:04:25 -0800 > > Matthew Brost <matthew.brost@intel.com> wrote: > > > > > On Wed, Mar 04, 2026 at 02:51:39PM -0800, Chia-I Wu wrote: > > > > Hi, > > > > > > > > Our system compositor (surfaceflinger on android) submits gpu jobs > > > > from a SCHED_FIFO thread to an RT gpu queue. However, because > > > > workqueue threads are SCHED_NORMAL, the scheduling latency from submit > > > > to run_job can sometimes cause frame misses. We are seeing this on > > > > panthor and xe, but the issue should be common to all drm_sched users. > > > > > > > > > > I'm going to assume that since this is a compositor, you do not pass > > > input dependencies to the page-flip job. Is that correct? > > > > > > If so, I believe we could fairly easily build an opt-in DRM sched path > > > that directly calls run_job in the exec IOCTL context (I assume this is > > > SCHED_FIFO) if the job has no dependencies. > > > > I guess by ::run_job() you mean something slightly more involved that > > checks if: > > > > - other jobs are pending > > - enough credits (AKA ringbuf space) is available > > - and probably other stuff I forgot about > > > > > > > > This would likely break some of Xe’s submission-backend assumptions > > > around mutual exclusion and ordering based on the workqueue, but that > > > seems workable. I don’t know how the Panthor code is structured or > > > whether they have similar issues. > > > > Honestly, I'm not thrilled by this fast-path/call-run_job-directly idea > > you're describing. There's just so many things we can forget that would > > lead to races/ordering issues that will end up being hard to trigger and > > debug. Besides, it doesn't solve the problem where your gfx pipeline is > > fully stuffed and the kernel has to dequeue things asynchronously. I do > > believe we want RT-prio support in that case too. > > > > My understanding of SurfaceFlinger is that it never waits on input > dependencies from rendering applications, since those may not signal in > time for a page flip. Because of that, you can’t have the job(s) that > draw to the screen accept input dependencies. Maybe I have that > wrong—but I've spoken to the Google team several times about issues with > SurfaceFlinger, and that was my takeaway. > > So I don't think the kernel should ever have to dequeue things > asynchronously, at least for SurfaceFlinger. There's still the contention coming from the ring buffer size, which can prevent jobs from being queued directly to the HW, though, admittedly, if the HW is not capable of compositing the frame faster than the refresh rate, and guarantee an almost always empty ringbuffer, fixing the scheduling prio is probably pointless. > If there is another RT use > case that requires input dependencies plus the kernel dequeuing things > asynchronously, I agree this wouldn’t help—but my suggestion also isn’t > mutually exclusive with other RT rework either. Yeah, dunno. It just feels like another hack on top of the already quite convoluted design that drm_sched has become. > > > > > > > I can try to hack together a quick PoC to see what this would look like > > > and give you something to test. > > > > > > > Using a WQ_HIGHPRI workqueue helps, but it is still not RT (and won't > > > > meet future android requirements). It seems either workqueue needs to > > > > gain RT support, or drm_sched needs to support kthread_worker. > > > > > > +Tejun to see if RT workqueue is in the plans. > > > > Dunno how feasible that is, but that would be my preferred option. > > > > > > > > > > > > > I know drm_sched switched from kthread_worker to workqueue for better > > > > scaling when xe was introduced. But if drm_sched can support either > > > > workqueue or kthread_worker during drm_sched_init, drivers can > > > > selectively use kthread_worker only for RT gpu queues. And because > > > > drivers require CAP_SYS_NICE for RT gpu queues, this should not cause > > > > scaling issues. > > > > > > > > > > I don’t think having two paths will ever be acceptable, nor do I think > > > supporting a kthread would be all that easy. For example, in Xe we queue > > > additional work items outside of the scheduler on the queue for ordering > > > reasons — we’d have to move all of that code down into DRM sched or > > > completely redesign our submission model to avoid this. I’m not sure if > > > other drivers also do this, but it is allowed. > > > > Panthor doesn't rely on the serialization provided by the single-thread > > workqueue, Panfrost might rely on it though (I don't remember). I agree > > that maintaining a thread and workqueue based scheduling is not ideal > > though. > > > > > > > > > Thoughts? Or perhaps this becomes less of an issue if all drm_sched > > > > users have concrete plans for userspace submissions.. > > > > > > Maybe some day.... > > > > I've yet to see a solution where no dma_fence-based signalization is > > involved in graphics workloads though (IIRC, Arm's solution still > > needs the kernel for that). Until that happens, we'll still need the > > kernel to signal fences asynchronously when the job is done, which I > > suspect will cause the same kind of latency issue... > > > > I don't think that is the problem here. Doesn’t the job that draws the > frame actually draw it, or does the display wait on the draw job’s fence > to signal and then do something else? I know close to nothing about SurfaceFlinger and very little about compositors in general, so I'll let Chia answer that one. What's sure is that, on regular page-flips (don't remember what async page-flips do), the display drivers wait on the fences attached to the buffer to signal before doing the flip. > (Sorry—I know next to nothing > about display.) Either way, fences should be signaled in IRQ handlers, In Panthor they are not, but that's probably something for us to address. > which presumably don’t have the same latency issues as workqueues, but I > could be mistaken. Might have to do with the mental model I had of this "reconcile Usermode queues with dma_fence signaling" model, where I was imagining a SW job queue (based on drm_sched too) that would wait on HW fences to be signal and would as a result signal the dma_fence attached to the job. So the queueing/dequeuing of these jobs would still happen through drm_sched, with the same scheduling prio issue. This being said, those jobs would likely be dependency less, so more likely to hit your fast-path-run-job. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: drm_sched run_job and scheduling latency 2026-03-05 10:52 ` Boris Brezillon @ 2026-03-05 20:51 ` Matthew Brost 2026-03-06 5:13 ` Chia-I Wu 0 siblings, 1 reply; 26+ messages in thread From: Matthew Brost @ 2026-03-05 20:51 UTC (permalink / raw) To: Boris Brezillon Cc: Chia-I Wu, ML dri-devel, intel-xe, Steven Price, Liviu Dudau, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter, Danilo Krummrich, Philipp Stanner, Christian König, Thomas Hellström, Rodrigo Vivi, open list, tj On Thu, Mar 05, 2026 at 11:52:01AM +0100, Boris Brezillon wrote: > On Thu, 5 Mar 2026 02:09:16 -0800 > Matthew Brost <matthew.brost@intel.com> wrote: > > > On Thu, Mar 05, 2026 at 09:27:11AM +0100, Boris Brezillon wrote: > > > > I addressed most of your comments in a chained reply to Phillip, but I > > guess he dropped some of your email and thus missed those. Responding > > below. > > > > > Hi Matthew, > > > > > > On Wed, 4 Mar 2026 18:04:25 -0800 > > > Matthew Brost <matthew.brost@intel.com> wrote: > > > > > > > On Wed, Mar 04, 2026 at 02:51:39PM -0800, Chia-I Wu wrote: > > > > > Hi, > > > > > > > > > > Our system compositor (surfaceflinger on android) submits gpu jobs > > > > > from a SCHED_FIFO thread to an RT gpu queue. However, because > > > > > workqueue threads are SCHED_NORMAL, the scheduling latency from submit > > > > > to run_job can sometimes cause frame misses. We are seeing this on > > > > > panthor and xe, but the issue should be common to all drm_sched users. > > > > > > > > > > > > > I'm going to assume that since this is a compositor, you do not pass > > > > input dependencies to the page-flip job. Is that correct? > > > > > > > > If so, I believe we could fairly easily build an opt-in DRM sched path > > > > that directly calls run_job in the exec IOCTL context (I assume this is > > > > SCHED_FIFO) if the job has no dependencies. > > > > > > I guess by ::run_job() you mean something slightly more involved that > > > checks if: > > > > > > - other jobs are pending > > > - enough credits (AKA ringbuf space) is available > > > - and probably other stuff I forgot about > > > > > > > > > > > This would likely break some of Xe’s submission-backend assumptions > > > > around mutual exclusion and ordering based on the workqueue, but that > > > > seems workable. I don’t know how the Panthor code is structured or > > > > whether they have similar issues. > > > > > > Honestly, I'm not thrilled by this fast-path/call-run_job-directly idea > > > you're describing. There's just so many things we can forget that would > > > lead to races/ordering issues that will end up being hard to trigger and > > > debug. Besides, it doesn't solve the problem where your gfx pipeline is > > > fully stuffed and the kernel has to dequeue things asynchronously. I do > > > believe we want RT-prio support in that case too. > > > > > > > My understanding of SurfaceFlinger is that it never waits on input > > dependencies from rendering applications, since those may not signal in > > time for a page flip. Because of that, you can’t have the job(s) that > > draw to the screen accept input dependencies. Maybe I have that > > wrong—but I've spoken to the Google team several times about issues with > > SurfaceFlinger, and that was my takeaway. > > > > So I don't think the kernel should ever have to dequeue things > > asynchronously, at least for SurfaceFlinger. > > There's still the contention coming from the ring buffer size, which can > prevent jobs from being queued directly to the HW, though, admittedly, > if the HW is not capable of compositing the frame faster than the > refresh rate, and guarantee an almost always empty ringbuffer, fixing > the scheduling prio is probably pointless. > > > If there is another RT use > > case that requires input dependencies plus the kernel dequeuing things > > asynchronously, I agree this wouldn’t help—but my suggestion also isn’t > > mutually exclusive with other RT rework either. > > Yeah, dunno. It just feels like another hack on top of the already quite > convoluted design that drm_sched has become. > I agree we wouldn't want this to become some wild hack. I could actually see this helping in other very timing-sensitive paths—for example, page-fault paths where a copy job needs to be issued as part of the fault resolution to a dedicated kernel queue. I’ve seen noise in fault profiling caused by delays in the scheduler workqueue, which needs to program the job to the device. In paths like this, every microsecond matters, as even minor improvements have real-world impacts on performance numbers. This will become even more noticeable as CPU<->GPU bus speeds increase. In this case, typically copy jobs have no input dependencies, thus the desire is to program the ring as quickly as possible. > > > > > > > > > > I can try to hack together a quick PoC to see what this would look like > > > > and give you something to test. > > > > > > > > > Using a WQ_HIGHPRI workqueue helps, but it is still not RT (and won't > > > > > meet future android requirements). It seems either workqueue needs to > > > > > gain RT support, or drm_sched needs to support kthread_worker. > > > > > > > > +Tejun to see if RT workqueue is in the plans. > > > > > > Dunno how feasible that is, but that would be my preferred option. > > > > > > > > > > > > > > > > > I know drm_sched switched from kthread_worker to workqueue for better > > > > > scaling when xe was introduced. But if drm_sched can support either > > > > > workqueue or kthread_worker during drm_sched_init, drivers can > > > > > selectively use kthread_worker only for RT gpu queues. And because > > > > > drivers require CAP_SYS_NICE for RT gpu queues, this should not cause > > > > > scaling issues. > > > > > > > > > > > > > I don’t think having two paths will ever be acceptable, nor do I think > > > > supporting a kthread would be all that easy. For example, in Xe we queue > > > > additional work items outside of the scheduler on the queue for ordering > > > > reasons — we’d have to move all of that code down into DRM sched or > > > > completely redesign our submission model to avoid this. I’m not sure if > > > > other drivers also do this, but it is allowed. > > > > > > Panthor doesn't rely on the serialization provided by the single-thread > > > workqueue, Panfrost might rely on it though (I don't remember). I agree > > > that maintaining a thread and workqueue based scheduling is not ideal > > > though. > > > > > > > > > > > > Thoughts? Or perhaps this becomes less of an issue if all drm_sched > > > > > users have concrete plans for userspace submissions.. > > > > > > > > Maybe some day.... > > > > > > I've yet to see a solution where no dma_fence-based signalization is > > > involved in graphics workloads though (IIRC, Arm's solution still > > > needs the kernel for that). Until that happens, we'll still need the > > > kernel to signal fences asynchronously when the job is done, which I > > > suspect will cause the same kind of latency issue... > > > > > > > I don't think that is the problem here. Doesn’t the job that draws the > > frame actually draw it, or does the display wait on the draw job’s fence > > to signal and then do something else? > > I know close to nothing about SurfaceFlinger and very little about > compositors in general, so I'll let Chia answer that one. What's sure I think Chia input would good, as if SurfaceFlinger jobs have input dependencies this entire suggestion doesn't make any sense. > is that, on regular page-flips (don't remember what async page-flips > do), the display drivers wait on the fences attached to the buffer to > signal before doing the flip. I think SurfaceFlinger is different compared to Wayland/X11 use cases, as maintaining a steady framerate is the priority above everything else (think phone screens, which never freeze, whereas desktops do all the time). So I believe SurfaceFlinger decides when it will submit the job to draw a frame, without directly passing in application dependencies into the buffer/job being drawn. Again, my understanding here may be incorrect... > > > (Sorry—I know next to nothing > > about display.) Either way, fences should be signaled in IRQ handlers, > > In Panthor they are not, but that's probably something for us to > address. > > > which presumably don’t have the same latency issues as workqueues, but I > > could be mistaken. > > Might have to do with the mental model I had of this "reconcile > Usermode queues with dma_fence signaling" model, where I was imagining > a SW job queue (based on drm_sched too) that would wait on HW fences to > be signal and would as a result signal the dma_fence attached to the > job. So the queueing/dequeuing of these jobs would still happen through > drm_sched, with the same scheduling prio issue. This being said, those Yes, if jobs have unmet dependencies, the bypass path doesn’t help with the DRM scheduler workqueue context switches being slow as that path needs to be taken in taken in this cases. Also, to bring up something insane we certainly wouldn’t want to do: calling run_job when dependencies are resolved in the fence callback, since we could be in an IRQ handler. Matt > jobs would likely be dependency less, so more likely to hit your > fast-path-run-job. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: drm_sched run_job and scheduling latency 2026-03-05 20:51 ` Matthew Brost @ 2026-03-06 5:13 ` Chia-I Wu 2026-03-06 7:21 ` Matthew Brost 2026-03-06 9:36 ` Michel Dänzer 0 siblings, 2 replies; 26+ messages in thread From: Chia-I Wu @ 2026-03-06 5:13 UTC (permalink / raw) To: Matthew Brost Cc: Boris Brezillon, ML dri-devel, intel-xe, Steven Price, Liviu Dudau, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter, Danilo Krummrich, Philipp Stanner, Christian König, Thomas Hellström, Rodrigo Vivi, open list, tj On Thu, Mar 5, 2026 at 12:52 PM Matthew Brost <matthew.brost@intel.com> wrote: > > On Thu, Mar 05, 2026 at 11:52:01AM +0100, Boris Brezillon wrote: > > On Thu, 5 Mar 2026 02:09:16 -0800 > > Matthew Brost <matthew.brost@intel.com> wrote: > > > > > On Thu, Mar 05, 2026 at 09:27:11AM +0100, Boris Brezillon wrote: > > > > > > I addressed most of your comments in a chained reply to Phillip, but I > > > guess he dropped some of your email and thus missed those. Responding > > > below. > > > > > > > Hi Matthew, > > > > > > > > On Wed, 4 Mar 2026 18:04:25 -0800 > > > > Matthew Brost <matthew.brost@intel.com> wrote: > > > > > > > > > On Wed, Mar 04, 2026 at 02:51:39PM -0800, Chia-I Wu wrote: > > > > > > Hi, > > > > > > > > > > > > Our system compositor (surfaceflinger on android) submits gpu jobs > > > > > > from a SCHED_FIFO thread to an RT gpu queue. However, because > > > > > > workqueue threads are SCHED_NORMAL, the scheduling latency from submit > > > > > > to run_job can sometimes cause frame misses. We are seeing this on > > > > > > panthor and xe, but the issue should be common to all drm_sched users. > > > > > > > > > > > > > > > > I'm going to assume that since this is a compositor, you do not pass > > > > > input dependencies to the page-flip job. Is that correct? > > > > > > > > > > If so, I believe we could fairly easily build an opt-in DRM sched path > > > > > that directly calls run_job in the exec IOCTL context (I assume this is > > > > > SCHED_FIFO) if the job has no dependencies. > > > > > > > > I guess by ::run_job() you mean something slightly more involved that > > > > checks if: > > > > > > > > - other jobs are pending > > > > - enough credits (AKA ringbuf space) is available > > > > - and probably other stuff I forgot about > > > > > > > > > > > > > > This would likely break some of Xe’s submission-backend assumptions > > > > > around mutual exclusion and ordering based on the workqueue, but that > > > > > seems workable. I don’t know how the Panthor code is structured or > > > > > whether they have similar issues. > > > > > > > > Honestly, I'm not thrilled by this fast-path/call-run_job-directly idea > > > > you're describing. There's just so many things we can forget that would > > > > lead to races/ordering issues that will end up being hard to trigger and > > > > debug. Besides, it doesn't solve the problem where your gfx pipeline is > > > > fully stuffed and the kernel has to dequeue things asynchronously. I do > > > > believe we want RT-prio support in that case too. > > > > > > > > > > My understanding of SurfaceFlinger is that it never waits on input > > > dependencies from rendering applications, since those may not signal in > > > time for a page flip. Because of that, you can’t have the job(s) that > > > draw to the screen accept input dependencies. Maybe I have that > > > wrong—but I've spoken to the Google team several times about issues with > > > SurfaceFlinger, and that was my takeaway. > > > > > > So I don't think the kernel should ever have to dequeue things > > > asynchronously, at least for SurfaceFlinger. > > > > There's still the contention coming from the ring buffer size, which can > > prevent jobs from being queued directly to the HW, though, admittedly, > > if the HW is not capable of compositing the frame faster than the > > refresh rate, and guarantee an almost always empty ringbuffer, fixing > > the scheduling prio is probably pointless. > > > > > If there is another RT use > > > case that requires input dependencies plus the kernel dequeuing things > > > asynchronously, I agree this wouldn’t help—but my suggestion also isn’t > > > mutually exclusive with other RT rework either. > > > > Yeah, dunno. It just feels like another hack on top of the already quite > > convoluted design that drm_sched has become. > > > > I agree we wouldn't want this to become some wild hack. > > I could actually see this helping in other very timing-sensitive > paths—for example, page-fault paths where a copy job needs to be issued > as part of the fault resolution to a dedicated kernel queue. I’ve seen > noise in fault profiling caused by delays in the scheduler workqueue, > which needs to program the job to the device. In paths like this, every > microsecond matters, as even minor improvements have real-world impacts > on performance numbers. This will become even more noticeable as > CPU<->GPU bus speeds increase. In this case, typically copy jobs have > no input dependencies, thus the desire is to program the ring as quickly > as possible. > > > > > > > > > > > > > > I can try to hack together a quick PoC to see what this would look like > > > > > and give you something to test. > > > > > > > > > > > Using a WQ_HIGHPRI workqueue helps, but it is still not RT (and won't > > > > > > meet future android requirements). It seems either workqueue needs to > > > > > > gain RT support, or drm_sched needs to support kthread_worker. > > > > > > > > > > +Tejun to see if RT workqueue is in the plans. > > > > > > > > Dunno how feasible that is, but that would be my preferred option. > > > > > > > > > > > > > > > > > > > > > I know drm_sched switched from kthread_worker to workqueue for better > > > > > > scaling when xe was introduced. But if drm_sched can support either > > > > > > workqueue or kthread_worker during drm_sched_init, drivers can > > > > > > selectively use kthread_worker only for RT gpu queues. And because > > > > > > drivers require CAP_SYS_NICE for RT gpu queues, this should not cause > > > > > > scaling issues. > > > > > > > > > > > > > > > > I don’t think having two paths will ever be acceptable, nor do I think > > > > > supporting a kthread would be all that easy. For example, in Xe we queue > > > > > additional work items outside of the scheduler on the queue for ordering > > > > > reasons — we’d have to move all of that code down into DRM sched or > > > > > completely redesign our submission model to avoid this. I’m not sure if > > > > > other drivers also do this, but it is allowed. > > > > > > > > Panthor doesn't rely on the serialization provided by the single-thread > > > > workqueue, Panfrost might rely on it though (I don't remember). I agree > > > > that maintaining a thread and workqueue based scheduling is not ideal > > > > though. > > > > > > > > > > > > > > > Thoughts? Or perhaps this becomes less of an issue if all drm_sched > > > > > > users have concrete plans for userspace submissions.. > > > > > > > > > > Maybe some day.... > > > > > > > > I've yet to see a solution where no dma_fence-based signalization is > > > > involved in graphics workloads though (IIRC, Arm's solution still > > > > needs the kernel for that). Until that happens, we'll still need the > > > > kernel to signal fences asynchronously when the job is done, which I > > > > suspect will cause the same kind of latency issue... > > > > > > > > > > I don't think that is the problem here. Doesn’t the job that draws the > > > frame actually draw it, or does the display wait on the draw job’s fence > > > to signal and then do something else? > > > > I know close to nothing about SurfaceFlinger and very little about > > compositors in general, so I'll let Chia answer that one. What's sure > > I think Chia input would good, as if SurfaceFlinger jobs have input > dependencies this entire suggestion doesn't make any sense. > > > is that, on regular page-flips (don't remember what async page-flips > > do), the display drivers wait on the fences attached to the buffer to > > signal before doing the flip. > > I think SurfaceFlinger is different compared to Wayland/X11 use cases, > as maintaining a steady framerate is the priority above everything else > (think phone screens, which never freeze, whereas desktops do all the > time). So I believe SurfaceFlinger decides when it will submit the job > to draw a frame, without directly passing in application dependencies > into the buffer/job being drawn. Again, my understanding here may be > incorrect... That is correct. SurfaceFlinger only ever latches buffers whose associated fences have signaled, and sends down the buffers to gpu for composition or to the display for direct scanout. That might also be how modern wayland compositors work nowadays? It sounds bad to let a low fps app slow down system composition. In theory, the gpu driver should not see input dependencies ever. I will need to check if there are corner cases. > > > > > > (Sorry—I know next to nothing > > > about display.) Either way, fences should be signaled in IRQ handlers, > > > > In Panthor they are not, but that's probably something for us to > > address. Yeah, I am also looking into signaling fences from the (threaded) irq handler. > > > > > which presumably don’t have the same latency issues as workqueues, but I > > > could be mistaken. > > > > Might have to do with the mental model I had of this "reconcile > > Usermode queues with dma_fence signaling" model, where I was imagining > > a SW job queue (based on drm_sched too) that would wait on HW fences to > > be signal and would as a result signal the dma_fence attached to the > > job. So the queueing/dequeuing of these jobs would still happen through > > drm_sched, with the same scheduling prio issue. This being said, those > > Yes, if jobs have unmet dependencies, the bypass path doesn’t help with > the DRM scheduler workqueue context switches being slow as that path > needs to be taken in taken in this cases. > > Also, to bring up something insane we certainly wouldn’t want to do: > calling run_job when dependencies are resolved in the fence callback, > since we could be in an IRQ handler. > > Matt > > > jobs would likely be dependency less, so more likely to hit your > > fast-path-run-job. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: drm_sched run_job and scheduling latency 2026-03-06 5:13 ` Chia-I Wu @ 2026-03-06 7:21 ` Matthew Brost 2026-03-06 9:36 ` Michel Dänzer 1 sibling, 0 replies; 26+ messages in thread From: Matthew Brost @ 2026-03-06 7:21 UTC (permalink / raw) To: Chia-I Wu Cc: Boris Brezillon, ML dri-devel, intel-xe, Steven Price, Liviu Dudau, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter, Danilo Krummrich, Philipp Stanner, Christian König, Thomas Hellström, Rodrigo Vivi, open list, tj On Thu, Mar 05, 2026 at 09:13:44PM -0800, Chia-I Wu wrote: > On Thu, Mar 5, 2026 at 12:52 PM Matthew Brost <matthew.brost@intel.com> wrote: > > > > On Thu, Mar 05, 2026 at 11:52:01AM +0100, Boris Brezillon wrote: > > > On Thu, 5 Mar 2026 02:09:16 -0800 > > > Matthew Brost <matthew.brost@intel.com> wrote: > > > > > > > On Thu, Mar 05, 2026 at 09:27:11AM +0100, Boris Brezillon wrote: > > > > > > > > I addressed most of your comments in a chained reply to Phillip, but I > > > > guess he dropped some of your email and thus missed those. Responding > > > > below. > > > > > > > > > Hi Matthew, > > > > > > > > > > On Wed, 4 Mar 2026 18:04:25 -0800 > > > > > Matthew Brost <matthew.brost@intel.com> wrote: > > > > > > > > > > > On Wed, Mar 04, 2026 at 02:51:39PM -0800, Chia-I Wu wrote: > > > > > > > Hi, > > > > > > > > > > > > > > Our system compositor (surfaceflinger on android) submits gpu jobs > > > > > > > from a SCHED_FIFO thread to an RT gpu queue. However, because > > > > > > > workqueue threads are SCHED_NORMAL, the scheduling latency from submit > > > > > > > to run_job can sometimes cause frame misses. We are seeing this on > > > > > > > panthor and xe, but the issue should be common to all drm_sched users. > > > > > > > > > > > > > > > > > > > I'm going to assume that since this is a compositor, you do not pass > > > > > > input dependencies to the page-flip job. Is that correct? > > > > > > > > > > > > If so, I believe we could fairly easily build an opt-in DRM sched path > > > > > > that directly calls run_job in the exec IOCTL context (I assume this is > > > > > > SCHED_FIFO) if the job has no dependencies. > > > > > > > > > > I guess by ::run_job() you mean something slightly more involved that > > > > > checks if: > > > > > > > > > > - other jobs are pending > > > > > - enough credits (AKA ringbuf space) is available > > > > > - and probably other stuff I forgot about > > > > > > > > > > > > > > > > > This would likely break some of Xe’s submission-backend assumptions > > > > > > around mutual exclusion and ordering based on the workqueue, but that > > > > > > seems workable. I don’t know how the Panthor code is structured or > > > > > > whether they have similar issues. > > > > > > > > > > Honestly, I'm not thrilled by this fast-path/call-run_job-directly idea > > > > > you're describing. There's just so many things we can forget that would > > > > > lead to races/ordering issues that will end up being hard to trigger and > > > > > debug. Besides, it doesn't solve the problem where your gfx pipeline is > > > > > fully stuffed and the kernel has to dequeue things asynchronously. I do > > > > > believe we want RT-prio support in that case too. > > > > > > > > > > > > > My understanding of SurfaceFlinger is that it never waits on input > > > > dependencies from rendering applications, since those may not signal in > > > > time for a page flip. Because of that, you can’t have the job(s) that > > > > draw to the screen accept input dependencies. Maybe I have that > > > > wrong—but I've spoken to the Google team several times about issues with > > > > SurfaceFlinger, and that was my takeaway. > > > > > > > > So I don't think the kernel should ever have to dequeue things > > > > asynchronously, at least for SurfaceFlinger. > > > > > > There's still the contention coming from the ring buffer size, which can > > > prevent jobs from being queued directly to the HW, though, admittedly, > > > if the HW is not capable of compositing the frame faster than the > > > refresh rate, and guarantee an almost always empty ringbuffer, fixing > > > the scheduling prio is probably pointless. > > > > > > > If there is another RT use > > > > case that requires input dependencies plus the kernel dequeuing things > > > > asynchronously, I agree this wouldn’t help—but my suggestion also isn’t > > > > mutually exclusive with other RT rework either. > > > > > > Yeah, dunno. It just feels like another hack on top of the already quite > > > convoluted design that drm_sched has become. > > > > > > > I agree we wouldn't want this to become some wild hack. > > > > I could actually see this helping in other very timing-sensitive > > paths—for example, page-fault paths where a copy job needs to be issued > > as part of the fault resolution to a dedicated kernel queue. I’ve seen > > noise in fault profiling caused by delays in the scheduler workqueue, > > which needs to program the job to the device. In paths like this, every > > microsecond matters, as even minor improvements have real-world impacts > > on performance numbers. This will become even more noticeable as > > CPU<->GPU bus speeds increase. In this case, typically copy jobs have > > no input dependencies, thus the desire is to program the ring as quickly > > as possible. > > > > > > > > > > > > > > > > > > I can try to hack together a quick PoC to see what this would look like > > > > > > and give you something to test. > > > > > > > > > > > > > Using a WQ_HIGHPRI workqueue helps, but it is still not RT (and won't > > > > > > > meet future android requirements). It seems either workqueue needs to > > > > > > > gain RT support, or drm_sched needs to support kthread_worker. > > > > > > > > > > > > +Tejun to see if RT workqueue is in the plans. > > > > > > > > > > Dunno how feasible that is, but that would be my preferred option. > > > > > > > > > > > > > > > > > > > > > > > > > I know drm_sched switched from kthread_worker to workqueue for better > > > > > > > scaling when xe was introduced. But if drm_sched can support either > > > > > > > workqueue or kthread_worker during drm_sched_init, drivers can > > > > > > > selectively use kthread_worker only for RT gpu queues. And because > > > > > > > drivers require CAP_SYS_NICE for RT gpu queues, this should not cause > > > > > > > scaling issues. > > > > > > > > > > > > > > > > > > > I don’t think having two paths will ever be acceptable, nor do I think > > > > > > supporting a kthread would be all that easy. For example, in Xe we queue > > > > > > additional work items outside of the scheduler on the queue for ordering > > > > > > reasons — we’d have to move all of that code down into DRM sched or > > > > > > completely redesign our submission model to avoid this. I’m not sure if > > > > > > other drivers also do this, but it is allowed. > > > > > > > > > > Panthor doesn't rely on the serialization provided by the single-thread > > > > > workqueue, Panfrost might rely on it though (I don't remember). I agree > > > > > that maintaining a thread and workqueue based scheduling is not ideal > > > > > though. > > > > > > > > > > > > > > > > > > Thoughts? Or perhaps this becomes less of an issue if all drm_sched > > > > > > > users have concrete plans for userspace submissions.. > > > > > > > > > > > > Maybe some day.... > > > > > > > > > > I've yet to see a solution where no dma_fence-based signalization is > > > > > involved in graphics workloads though (IIRC, Arm's solution still > > > > > needs the kernel for that). Until that happens, we'll still need the > > > > > kernel to signal fences asynchronously when the job is done, which I > > > > > suspect will cause the same kind of latency issue... > > > > > > > > > > > > > I don't think that is the problem here. Doesn’t the job that draws the > > > > frame actually draw it, or does the display wait on the draw job’s fence > > > > to signal and then do something else? > > > > > > I know close to nothing about SurfaceFlinger and very little about > > > compositors in general, so I'll let Chia answer that one. What's sure > > > > I think Chia input would good, as if SurfaceFlinger jobs have input > > dependencies this entire suggestion doesn't make any sense. > > > > > is that, on regular page-flips (don't remember what async page-flips > > > do), the display drivers wait on the fences attached to the buffer to > > > signal before doing the flip. > > > > I think SurfaceFlinger is different compared to Wayland/X11 use cases, > > as maintaining a steady framerate is the priority above everything else > > (think phone screens, which never freeze, whereas desktops do all the > > time). So I believe SurfaceFlinger decides when it will submit the job > > to draw a frame, without directly passing in application dependencies > > into the buffer/job being drawn. Again, my understanding here may be > > incorrect... > That is correct. SurfaceFlinger only ever latches buffers whose > associated fences have signaled, and sends down the buffers to gpu for > composition or to the display for direct scanout. That might also be > how modern wayland compositors work nowadays? It sounds bad to let a Don't know wayland but let me follow up on that. > low fps app slow down system composition. > > In theory, the gpu driver should not see input dependencies ever. I > will need to check if there are corner cases. > Thanks — this matches my understanding from my conversations with Google about SurfaceFlinger and the lack of dependencies. If you can also check any corner cases, that would be good to understand as well. The kernel can technically introduce dependencies if it moves memory around, but something like that shouldn’t happen in practice. I'd strongly suggest a bypass path as a solution. I mentioned this to Boris — this approach is not mutually exclusive with other RT rework either, and in any case it is likely the most performant and stable path (i.e. no jitter). > > > > > > > > > > (Sorry—I know next to nothing > > > > about display.) Either way, fences should be signaled in IRQ handlers, > > > > > > In Panthor they are not, but that's probably something for us to > > > address. > Yeah, I am also looking into signaling fences from the (threaded) irq handler. > I would suggest that you do. The Xe implementation is in xe_hw_fence.c if you want a design reference. Matt > > > > > > > which presumably don’t have the same latency issues as workqueues, but I > > > > could be mistaken. > > > > > > Might have to do with the mental model I had of this "reconcile > > > Usermode queues with dma_fence signaling" model, where I was imagining > > > a SW job queue (based on drm_sched too) that would wait on HW fences to > > > be signal and would as a result signal the dma_fence attached to the > > > job. So the queueing/dequeuing of these jobs would still happen through > > > drm_sched, with the same scheduling prio issue. This being said, those > > > > Yes, if jobs have unmet dependencies, the bypass path doesn’t help with > > the DRM scheduler workqueue context switches being slow as that path > > needs to be taken in taken in this cases. > > > > Also, to bring up something insane we certainly wouldn’t want to do: > > calling run_job when dependencies are resolved in the fence callback, > > since we could be in an IRQ handler. > > > > Matt > > > > > jobs would likely be dependency less, so more likely to hit your > > > fast-path-run-job. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: drm_sched run_job and scheduling latency 2026-03-06 5:13 ` Chia-I Wu 2026-03-06 7:21 ` Matthew Brost @ 2026-03-06 9:36 ` Michel Dänzer 2026-03-06 9:40 ` Michel Dänzer 1 sibling, 1 reply; 26+ messages in thread From: Michel Dänzer @ 2026-03-06 9:36 UTC (permalink / raw) To: Chia-I Wu, Matthew Brost Cc: Boris Brezillon, ML dri-devel, intel-xe, Steven Price, Liviu Dudau, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter, Danilo Krummrich, Philipp Stanner, Christian König, Thomas Hellström, Rodrigo Vivi, open list, tj On 3/6/26 06:13, Chia-I Wu wrote: > On Thu, Mar 5, 2026 at 12:52 PM Matthew Brost <matthew.brost@intel.com> wrote: >> On Thu, Mar 05, 2026 at 11:52:01AM +0100, Boris Brezillon wrote: >>> On Thu, 5 Mar 2026 02:09:16 -0800 >>> Matthew Brost <matthew.brost@intel.com> wrote: >>>> On Thu, Mar 05, 2026 at 09:27:11AM +0100, Boris Brezillon wrote: >>>>> On Wed, 4 Mar 2026 18:04:25 -0800 >>>>> Matthew Brost <matthew.brost@intel.com> wrote: >>>>>> On Wed, Mar 04, 2026 at 02:51:39PM -0800, Chia-I Wu wrote: >>>>>>> >>>>>>> Thoughts? Or perhaps this becomes less of an issue if all drm_sched >>>>>>> users have concrete plans for userspace submissions.. >>>>>> >>>>>> Maybe some day.... >>>>> >>>>> I've yet to see a solution where no dma_fence-based signalization is >>>>> involved in graphics workloads though (IIRC, Arm's solution still >>>>> needs the kernel for that). Until that happens, we'll still need the >>>>> kernel to signal fences asynchronously when the job is done, which I >>>>> suspect will cause the same kind of latency issue... >>>>> >>>> >>>> I don't think that is the problem here. Doesn’t the job that draws the >>>> frame actually draw it, or does the display wait on the draw job’s fence >>>> to signal and then do something else? >>> >>> I know close to nothing about SurfaceFlinger and very little about >>> compositors in general, so I'll let Chia answer that one. What's sure >> >> I think Chia input would good, as if SurfaceFlinger jobs have input >> dependencies this entire suggestion doesn't make any sense. >> >>> is that, on regular page-flips (don't remember what async page-flips >>> do), the display drivers wait on the fences attached to the buffer to >>> signal before doing the flip. >> >> I think SurfaceFlinger is different compared to Wayland/X11 use cases, >> as maintaining a steady framerate is the priority above everything else >> (think phone screens, which never freeze, whereas desktops do all the >> time). So I believe SurfaceFlinger decides when it will submit the job >> to draw a frame, without directly passing in application dependencies >> into the buffer/job being drawn. Again, my understanding here may be >> incorrect... > That is correct. SurfaceFlinger only ever latches buffers whose > associated fences have signaled, and sends down the buffers to gpu for > composition or to the display for direct scanout. That might also be > how modern wayland compositors work nowadays? Many (most of the major ones?) do, yes. (Weston being a notable exception AFAIK, though since it supports the Wayland syncobj protocol now, switching to this model should be easy) -- Earthling Michel Dänzer \ GNOME / Xwayland / Mesa developer https://redhat.com \ Libre software enthusiast ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: drm_sched run_job and scheduling latency 2026-03-06 9:36 ` Michel Dänzer @ 2026-03-06 9:40 ` Michel Dänzer 0 siblings, 0 replies; 26+ messages in thread From: Michel Dänzer @ 2026-03-06 9:40 UTC (permalink / raw) To: Chia-I Wu, Matthew Brost Cc: Boris Brezillon, ML dri-devel, intel-xe, Steven Price, Liviu Dudau, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter, Danilo Krummrich, Philipp Stanner, Christian König, Thomas Hellström, Rodrigo Vivi, open list, tj On 3/6/26 10:36, Michel Dänzer wrote: > On 3/6/26 06:13, Chia-I Wu wrote: >> On Thu, Mar 5, 2026 at 12:52 PM Matthew Brost <matthew.brost@intel.com> wrote: >>> On Thu, Mar 05, 2026 at 11:52:01AM +0100, Boris Brezillon wrote: >>>> On Thu, 5 Mar 2026 02:09:16 -0800 >>>> Matthew Brost <matthew.brost@intel.com> wrote: >>>>> On Thu, Mar 05, 2026 at 09:27:11AM +0100, Boris Brezillon wrote: >>>>>> On Wed, 4 Mar 2026 18:04:25 -0800 >>>>>> Matthew Brost <matthew.brost@intel.com> wrote: >>>>>>> On Wed, Mar 04, 2026 at 02:51:39PM -0800, Chia-I Wu wrote: >>>>>>>> >>>>>>>> Thoughts? Or perhaps this becomes less of an issue if all drm_sched >>>>>>>> users have concrete plans for userspace submissions.. >>>>>>> >>>>>>> Maybe some day.... >>>>>> >>>>>> I've yet to see a solution where no dma_fence-based signalization is >>>>>> involved in graphics workloads though (IIRC, Arm's solution still >>>>>> needs the kernel for that). Until that happens, we'll still need the >>>>>> kernel to signal fences asynchronously when the job is done, which I >>>>>> suspect will cause the same kind of latency issue... >>>>>> >>>>> >>>>> I don't think that is the problem here. Doesn’t the job that draws the >>>>> frame actually draw it, or does the display wait on the draw job’s fence >>>>> to signal and then do something else? >>>> >>>> I know close to nothing about SurfaceFlinger and very little about >>>> compositors in general, so I'll let Chia answer that one. What's sure >>> >>> I think Chia input would good, as if SurfaceFlinger jobs have input >>> dependencies this entire suggestion doesn't make any sense. >>> >>>> is that, on regular page-flips (don't remember what async page-flips >>>> do), the display drivers wait on the fences attached to the buffer to >>>> signal before doing the flip. >>> >>> I think SurfaceFlinger is different compared to Wayland/X11 use cases, >>> as maintaining a steady framerate is the priority above everything else >>> (think phone screens, which never freeze, whereas desktops do all the >>> time). So I believe SurfaceFlinger decides when it will submit the job >>> to draw a frame, without directly passing in application dependencies >>> into the buffer/job being drawn. Again, my understanding here may be >>> incorrect... >> That is correct. SurfaceFlinger only ever latches buffers whose >> associated fences have signaled, and sends down the buffers to gpu for >> composition or to the display for direct scanout. That might also be >> how modern wayland compositors work nowadays? > > Many (most of the major ones?) do, yes. (Weston being a notable exception AFAIK, though since it supports the Wayland syncobj protocol now, switching to this model shoul Err, I meant the commit-timing protocol, Weston doesn't support the syncobj protocol yet AFAICT. -- Earthling Michel Dänzer \ GNOME / Xwayland / Mesa developer https://redhat.com \ Libre software enthusiast ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: drm_sched run_job and scheduling latency 2026-03-04 22:51 drm_sched run_job and scheduling latency Chia-I Wu 2026-03-05 2:04 ` Matthew Brost @ 2026-03-05 8:35 ` Tvrtko Ursulin 2026-03-05 9:40 ` Boris Brezillon 2026-03-05 9:23 ` Boris Brezillon 2026-03-05 23:09 ` Hillf Danton 3 siblings, 1 reply; 26+ messages in thread From: Tvrtko Ursulin @ 2026-03-05 8:35 UTC (permalink / raw) To: Chia-I Wu, ML dri-devel, intel-xe Cc: Boris Brezillon, Steven Price, Liviu Dudau, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter, Matthew Brost, Danilo Krummrich, Philipp Stanner, Christian König, Thomas Hellström, Rodrigo Vivi, open list On 04/03/2026 22:51, Chia-I Wu wrote: > Hi, > > Our system compositor (surfaceflinger on android) submits gpu jobs > from a SCHED_FIFO thread to an RT gpu queue. However, because > workqueue threads are SCHED_NORMAL, the scheduling latency from submit > to run_job can sometimes cause frame misses. We are seeing this on > panthor and xe, but the issue should be common to all drm_sched users. > > Using a WQ_HIGHPRI workqueue helps, but it is still not RT (and won't > meet future android requirements). It seems either workqueue needs to > gain RT support, or drm_sched needs to support kthread_worker. > > I know drm_sched switched from kthread_worker to workqueue for better From a plain kthread actually. Anyway, I suggested trying the kthread_worker approach a few times in the past but never got round implementing it. Not dual paths but simply replacing the workqueues with kthread_workers. What is your thinking regarding how would the priority be configured? In terms of the default and mechanism to select a higher priority scheduling class. Regards, Tvrtko > scaling when xe was introduced. But if drm_sched can support either > workqueue or kthread_worker during drm_sched_init, drivers can > selectively use kthread_worker only for RT gpu queues. And because > drivers require CAP_SYS_NICE for RT gpu queues, this should not cause > scaling issues. > > Thoughts? Or perhaps this becomes less of an issue if all drm_sched > users have concrete plans for userspace submissions.. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: drm_sched run_job and scheduling latency 2026-03-05 8:35 ` Tvrtko Ursulin @ 2026-03-05 9:40 ` Boris Brezillon 2026-03-27 9:19 ` Tvrtko Ursulin 0 siblings, 1 reply; 26+ messages in thread From: Boris Brezillon @ 2026-03-05 9:40 UTC (permalink / raw) To: Tvrtko Ursulin Cc: Chia-I Wu, ML dri-devel, intel-xe, Steven Price, Liviu Dudau, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter, Matthew Brost, Danilo Krummrich, Philipp Stanner, Christian König, Thomas Hellström, Rodrigo Vivi, open list Hi Tvrtko, On Thu, 5 Mar 2026 08:35:33 +0000 Tvrtko Ursulin <tursulin@ursulin.net> wrote: > On 04/03/2026 22:51, Chia-I Wu wrote: > > Hi, > > > > Our system compositor (surfaceflinger on android) submits gpu jobs > > from a SCHED_FIFO thread to an RT gpu queue. However, because > > workqueue threads are SCHED_NORMAL, the scheduling latency from submit > > to run_job can sometimes cause frame misses. We are seeing this on > > panthor and xe, but the issue should be common to all drm_sched users. > > > > Using a WQ_HIGHPRI workqueue helps, but it is still not RT (and won't > > meet future android requirements). It seems either workqueue needs to > > gain RT support, or drm_sched needs to support kthread_worker. > > > > I know drm_sched switched from kthread_worker to workqueue for better > > From a plain kthread actually. Oops, sorry, I hadn't seen your reply before posting mine. I basically said the same. > Anyway, I suggested trying the > kthread_worker approach a few times in the past but never got round > implementing it. Not dual paths but simply replacing the workqueues with > kthread_workers. > > What is your thinking regarding how would the priority be configured? In > terms of the default and mechanism to select a higher priority > scheduling class. If we follow the same model that exists today, where the workqueue can be passed at drm_sched_init() time, it becomes the driver's responsibility to create a worker of his own with the right prio set (using sched_setscheduler()). There's still the case where the worker is NULL, in which case the drm_sched code can probably create his own worker and leave it with the default prio, just like existed before the transition to workqueues. It's a whole different story if you want to deal with worker pools and do some load balancing though... Regards, Boris ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: drm_sched run_job and scheduling latency 2026-03-05 9:40 ` Boris Brezillon @ 2026-03-27 9:19 ` Tvrtko Ursulin 0 siblings, 0 replies; 26+ messages in thread From: Tvrtko Ursulin @ 2026-03-27 9:19 UTC (permalink / raw) To: Boris Brezillon Cc: Chia-I Wu, ML dri-devel, intel-xe, Steven Price, Liviu Dudau, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter, Matthew Brost, Danilo Krummrich, Philipp Stanner, Christian König, Thomas Hellström, Rodrigo Vivi, open list On 05/03/2026 09:40, Boris Brezillon wrote: > Hi Tvrtko, > > On Thu, 5 Mar 2026 08:35:33 +0000 > Tvrtko Ursulin <tursulin@ursulin.net> wrote: > >> On 04/03/2026 22:51, Chia-I Wu wrote: >>> Hi, >>> >>> Our system compositor (surfaceflinger on android) submits gpu jobs >>> from a SCHED_FIFO thread to an RT gpu queue. However, because >>> workqueue threads are SCHED_NORMAL, the scheduling latency from submit >>> to run_job can sometimes cause frame misses. We are seeing this on >>> panthor and xe, but the issue should be common to all drm_sched users. >>> >>> Using a WQ_HIGHPRI workqueue helps, but it is still not RT (and won't >>> meet future android requirements). It seems either workqueue needs to >>> gain RT support, or drm_sched needs to support kthread_worker. >>> >>> I know drm_sched switched from kthread_worker to workqueue for better >> >> From a plain kthread actually. > > Oops, sorry, I hadn't seen your reply before posting mine. I basically > said the same. > >> Anyway, I suggested trying the >> kthread_worker approach a few times in the past but never got round >> implementing it. Not dual paths but simply replacing the workqueues with >> kthread_workers. >> >> What is your thinking regarding how would the priority be configured? In >> terms of the default and mechanism to select a higher priority >> scheduling class. > > If we follow the same model that exists today, where the > workqueue can be passed at drm_sched_init() time, it becomes the > driver's responsibility to create a worker of his own with the right > prio set (using sched_setscheduler()). There's still the case where the > worker is NULL, in which case the drm_sched code can probably create > his own worker and leave it with the default prio, just like existed > before the transition to workqueues. > > It's a whole different story if you want to deal with worker pools and > do some load balancing though... I prototyped this in xe in the mean time and it is looking plausible that latency can be significantly reduced. First to say that I did not go as far as worker pools because at the moment I don't see an use case for it. At least not for xe. When 1:1 entity to scheduler drivers appeared kthreads were undesirable just because they were ending up with effectively unbound number of kernel threads. There was no benefit to that but only downsides. Workqueues were good since they manage the thread pool under the hood, but it is just a handy coincidence, the design still misses to express the optimal number of CPU threads required to feed a GPU engine. For example with xe, if there was a 4096 CPU machine with 4096 user contexts feeding to the same GPU engine, the optimal number of CPU threads to feed it is really more like one rather than how much wq management decided to run in parallel. They all end up hammering on the same lock to let the firmware know there is something to schedule. For this reason in my prototype I create kthread_worker per hardware execution engine. (For xe even that could potentially be too much, maybe I should even try one kthread_worker per GuC CT.) This creates a requirement for 1:1 drivers to not use the "worker" auto-create mode of the DRM scheduler so TBD if that is okay. Anyway, onto the numbers. Well actually first onto a benchmark I hacked up.. I took xe_blt from IGT and modified it heavily to be more reasonable. What it essentially does it emits a constant stream of synchronous blit operations and measures the variance of the time each took to complete, as observed by the submitting process. In parallel it spawns a number of CPU hog threads to oversubscribe the system. And it can run the submitting thread at either normal priority, re-niced to -1, or at SCHED_FIFO. This is to simulate a typical compositor use case. Now onto the numbers. normal nice FIFO wq 100% 76% 1% kthread_worker 100% 73% 1.2% └─relative to wq: 50.5% 48.5% 58.9% Median "jitter" (variance in observed job submissions) is normalised and shows how changing the CPU priority changes the jitter observed by the submission thread. First two rows are the current wq implementation and the kthread_worker conversion. They show scaling as roughly similar. Third row are the kworker_thread results normalised against wq. And that shows roughly twice as low jitter. So a meaningful improvement. Then I went a step further to even better address the analysis of a problem done by Chia-I, solving the priority inversion problem. That is to loosely track CPU priorities of the currently active entities submitting to each scheduler (and in turn kthread_worker). This in turn further improved the latency numbers for the SCHED_FIFO case, albeit there is a strange anomaly with re-nice which I will come to later. It looks like this: normal nice FIFO kworker_follow_prio 100% 277% 0.66% └─relative to wq: 60% 222% 37.8% This effectively means that with a SCHED_FIFO compositor the submission round-trip latency could be around a third of what the current scheduler can do. Now the re-nice anomaly.. This is something I am yet to investigate. Issue may be related to what I said the kthread_workers loosely track the submission thread priority. Loosely meaning if they detect negative nice they do not follow the exact nice level but go minium nice, while my test program was using the least minimum nice level (-19 vs -1). Perhaps that causes some strange effect in the CPU scheduler. I do not know yet but it is very interesting that it appears repeatable. It is also important to view my numbers as with some margin of error. I have tried to remove the effect of intel_pstate, CPU turbo, and thermal management to a large extent, but I do not think I fully succeeded yet. There may be some +/- of 5% or so in the results is my gut feeling. Also important to say is that the prototype depends on my other DRM scheduler series (the fair scheduler one), since I needed the nicer sched_rq abstraction with better tracking of active entities to implement priority inheritance, so I am unlikely to post it all as RFC since Philipp would possible get a heart attack if I did. :) To close, I think this is interesting to check out further and could look at converting panthor next and then we could run more experiments. Regards, Tvrtko ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: drm_sched run_job and scheduling latency 2026-03-04 22:51 drm_sched run_job and scheduling latency Chia-I Wu 2026-03-05 2:04 ` Matthew Brost 2026-03-05 8:35 ` Tvrtko Ursulin @ 2026-03-05 9:23 ` Boris Brezillon 2026-03-06 5:33 ` Chia-I Wu 2026-03-05 23:09 ` Hillf Danton 3 siblings, 1 reply; 26+ messages in thread From: Boris Brezillon @ 2026-03-05 9:23 UTC (permalink / raw) To: Chia-I Wu Cc: ML dri-devel, intel-xe, Steven Price, Liviu Dudau, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter, Matthew Brost, Danilo Krummrich, Philipp Stanner, Christian König, Thomas Hellström, Rodrigo Vivi, open list On Wed, 4 Mar 2026 14:51:39 -0800 Chia-I Wu <olvaffe@gmail.com> wrote: > Hi, > > Our system compositor (surfaceflinger on android) submits gpu jobs > from a SCHED_FIFO thread to an RT gpu queue. However, because > workqueue threads are SCHED_NORMAL, the scheduling latency from submit > to run_job can sometimes cause frame misses. We are seeing this on > panthor and xe, but the issue should be common to all drm_sched users. > > Using a WQ_HIGHPRI workqueue helps, but it is still not RT (and won't > meet future android requirements). It seems either workqueue needs to > gain RT support, or drm_sched needs to support kthread_worker. > > I know drm_sched switched from kthread_worker to workqueue for better > scaling when xe was introduced. Actually, it went from a plain kthread with open-coded "work" support to workqueues. The kthread_worker+kthread_work model looks closer to what workqueues provide, so transitioning drivers to it shouldn't be too hard. The scalability issue you mentioned (one thread per GPU context doesn't scale) doesn't apply, because we can pretty easily share the same kthread_worker for all drm_gpu_scheduler instances, just like we can share the same workqueue for all drm_gpu_scheduler instances today. Luckily, it seems that no one so far has been using WQ_PERCPU-workqueues, so that's one less thing we need to worry about. The last remaining drawback with a kthread_work[er] based solution is the fact workqueues can adjust the number of worker threads on demand based on the load. If we really need this flexibility (a non static number of threads per-prio level per-driver), that's something we'll have to add support for. For Panthor, the way I see it, we could start with one thread per-group priority, and then pick the worker thread to use at drm_sched_init() based on the group prio. If we need something with a thread pool, then drm_sched will have to know about those threads, and do some load balancing when queueing the works... Note that someone at Collabora is working on dynamic context priority support, meaning we'll have to be able to change the drm_gpu_scheduler kthread_worker at runtime. TLDR; All of this is doable, but it's more work (for us, DRM devs) than asking RT prio support to be added to workqueues. > But if drm_sched can support either > workqueue or kthread_worker during drm_sched_init, drivers can > selectively use kthread_worker only for RT gpu queues. And because > drivers require CAP_SYS_NICE for RT gpu queues, this should not cause > scaling issues. I think, whatever we choose to go for, we probably don't want to keep both models around, because that's going to be a pain to maintain. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: drm_sched run_job and scheduling latency 2026-03-05 9:23 ` Boris Brezillon @ 2026-03-06 5:33 ` Chia-I Wu 2026-03-06 7:36 ` Matthew Brost 0 siblings, 1 reply; 26+ messages in thread From: Chia-I Wu @ 2026-03-06 5:33 UTC (permalink / raw) To: Boris Brezillon Cc: ML dri-devel, intel-xe, Steven Price, Liviu Dudau, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter, Matthew Brost, Danilo Krummrich, Philipp Stanner, Christian König, Thomas Hellström, Rodrigo Vivi, open list On Thu, Mar 5, 2026 at 1:23 AM Boris Brezillon <boris.brezillon@collabora.com> wrote: > > On Wed, 4 Mar 2026 14:51:39 -0800 > Chia-I Wu <olvaffe@gmail.com> wrote: > > > Hi, > > > > Our system compositor (surfaceflinger on android) submits gpu jobs > > from a SCHED_FIFO thread to an RT gpu queue. However, because > > workqueue threads are SCHED_NORMAL, the scheduling latency from submit > > to run_job can sometimes cause frame misses. We are seeing this on > > panthor and xe, but the issue should be common to all drm_sched users. > > > > Using a WQ_HIGHPRI workqueue helps, but it is still not RT (and won't > > meet future android requirements). It seems either workqueue needs to > > gain RT support, or drm_sched needs to support kthread_worker. > > > > I know drm_sched switched from kthread_worker to workqueue for better > > scaling when xe was introduced. > > Actually, it went from a plain kthread with open-coded "work" support to > workqueues. The kthread_worker+kthread_work model looks closer to what > workqueues provide, so transitioning drivers to it shouldn't be too > hard. The scalability issue you mentioned (one thread per GPU context > doesn't scale) doesn't apply, because we can pretty easily share the > same kthread_worker for all drm_gpu_scheduler instances, just like we > can share the same workqueue for all drm_gpu_scheduler instances today. > Luckily, it seems that no one so far has been using > WQ_PERCPU-workqueues, so that's one less thing we need to worry about. > The last remaining drawback with a kthread_work[er] based solution is > the fact workqueues can adjust the number of worker threads on demand > based on the load. If we really need this flexibility (a non static > number of threads per-prio level per-driver), that's something we'll > have to add support for. Wait, I thought this was the exact scaling issue that workqueue solved for xe and panthor? We needed to execute run_jobs for N drm_gpu_scheduler instances, where N is in total control of the userspace. We didn't want to serialize the executions to a single thread. Granted, panthor holds a lock in its run_job callback and does not benefit from a workqueue. I don't know how xe's run_job does though. > > For Panthor, the way I see it, we could start with one thread per-group > priority, and then pick the worker thread to use at drm_sched_init() > based on the group prio. If we need something with a thread pool, then > drm_sched will have to know about those threads, and do some load > balancing when queueing the works... > > Note that someone at Collabora is working on dynamic context priority > support, meaning we'll have to be able to change the drm_gpu_scheduler > kthread_worker at runtime. > > TLDR; All of this is doable, but it's more work (for us, DRM devs) than > asking RT prio support to be added to workqueues. It looks like WQ_RT was last brought up in https://lore.kernel.org/all/aPJdrqSiuijOcaPE@slm.duckdns.org/ Maybe adding some form of bring-your-own-worker-pool support to workqueue will be acceptable? > > > But if drm_sched can support either > > workqueue or kthread_worker during drm_sched_init, drivers can > > selectively use kthread_worker only for RT gpu queues. And because > > drivers require CAP_SYS_NICE for RT gpu queues, this should not cause > > scaling issues. > > I think, whatever we choose to go for, we probably don't want to keep > both models around, because that's going to be a pain to maintain. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: drm_sched run_job and scheduling latency 2026-03-06 5:33 ` Chia-I Wu @ 2026-03-06 7:36 ` Matthew Brost 0 siblings, 0 replies; 26+ messages in thread From: Matthew Brost @ 2026-03-06 7:36 UTC (permalink / raw) To: Chia-I Wu Cc: Boris Brezillon, ML dri-devel, intel-xe, Steven Price, Liviu Dudau, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter, Danilo Krummrich, Philipp Stanner, Christian König, Thomas Hellström, Rodrigo Vivi, open list On Thu, Mar 05, 2026 at 09:33:36PM -0800, Chia-I Wu wrote: > On Thu, Mar 5, 2026 at 1:23 AM Boris Brezillon > <boris.brezillon@collabora.com> wrote: > > > > On Wed, 4 Mar 2026 14:51:39 -0800 > > Chia-I Wu <olvaffe@gmail.com> wrote: > > > > > Hi, > > > > > > Our system compositor (surfaceflinger on android) submits gpu jobs > > > from a SCHED_FIFO thread to an RT gpu queue. However, because > > > workqueue threads are SCHED_NORMAL, the scheduling latency from submit > > > to run_job can sometimes cause frame misses. We are seeing this on > > > panthor and xe, but the issue should be common to all drm_sched users. > > > > > > Using a WQ_HIGHPRI workqueue helps, but it is still not RT (and won't > > > meet future android requirements). It seems either workqueue needs to > > > gain RT support, or drm_sched needs to support kthread_worker. > > > > > > I know drm_sched switched from kthread_worker to workqueue for better > > > scaling when xe was introduced. > > > > Actually, it went from a plain kthread with open-coded "work" support to > > workqueues. The kthread_worker+kthread_work model looks closer to what > > workqueues provide, so transitioning drivers to it shouldn't be too > > hard. The scalability issue you mentioned (one thread per GPU context > > doesn't scale) doesn't apply, because we can pretty easily share the > > same kthread_worker for all drm_gpu_scheduler instances, just like we > > can share the same workqueue for all drm_gpu_scheduler instances today. > > Luckily, it seems that no one so far has been using > > WQ_PERCPU-workqueues, so that's one less thing we need to worry about. > > The last remaining drawback with a kthread_work[er] based solution is > > the fact workqueues can adjust the number of worker threads on demand > > based on the load. If we really need this flexibility (a non static > > number of threads per-prio level per-driver), that's something we'll > > have to add support for. > Wait, I thought this was the exact scaling issue that workqueue solved > for xe and panthor? We needed to execute run_jobs for N > drm_gpu_scheduler instances, where N is in total control of the > userspace. We didn't want to serialize the executions to a single > thread. > I honestly doubt more threads help here. In Xe, the time to push a job (run_job) to the hardware is maybe 1µs. In Xe, individual workqueues are mostly for our compute use cases, where we sometimes need to sleep inside the work item and don’t want that sleep to interfere with other clients. For 3D, I suspect we could use a shared workqueue (still with a dedicated scheduler instance per user queue) among all clients and not see a noticeable change in performance - it might actually be better. At one point I converted Xe to do this, but I lost track of the patches in the stack. > Granted, panthor holds a lock in its run_job callback and does not > benefit from a workqueue. I don't know how xe's run_job does though. > We grab a shared mutex for the firmware queue push, but it is a very tight path and likely within the window where the mutex is still spinning. > > > > For Panthor, the way I see it, we could start with one thread per-group > > priority, and then pick the worker thread to use at drm_sched_init() > > based on the group prio. If we need something with a thread pool, then > > drm_sched will have to know about those threads, and do some load > > balancing when queueing the works... > > > > Note that someone at Collabora is working on dynamic context priority > > support, meaning we'll have to be able to change the drm_gpu_scheduler > > kthread_worker at runtime. > > > > TLDR; All of this is doable, but it's more work (for us, DRM devs) than > > asking RT prio support to be added to workqueues. > > It looks like WQ_RT was last brought up in > > https://lore.kernel.org/all/aPJdrqSiuijOcaPE@slm.duckdns.org/ > Tejun says hard no on WQ_RT. > Maybe adding some form of bring-your-own-worker-pool support to > workqueue will be acceptable? > Before doing anything too crazy, I think we should consider a direct submit path, given that you’ve confirmed SurfaceFlinger does not have input dependencies. I’m fairly close to having something I feel good about posting. If you could test it out and report back, I think that would be a good place to start — then we can duke it out among the maintainers if this is acceptable. Matt > > > > > But if drm_sched can support either > > > workqueue or kthread_worker during drm_sched_init, drivers can > > > selectively use kthread_worker only for RT gpu queues. And because > > > drivers require CAP_SYS_NICE for RT gpu queues, this should not cause > > > scaling issues. > > > > I think, whatever we choose to go for, we probably don't want to keep > > both models around, because that's going to be a pain to maintain. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: drm_sched run_job and scheduling latency 2026-03-04 22:51 drm_sched run_job and scheduling latency Chia-I Wu ` (2 preceding siblings ...) 2026-03-05 9:23 ` Boris Brezillon @ 2026-03-05 23:09 ` Hillf Danton 2026-03-06 5:46 ` Chia-I Wu 3 siblings, 1 reply; 26+ messages in thread From: Hillf Danton @ 2026-03-05 23:09 UTC (permalink / raw) To: Chia-I Wu Cc: Matthew Brost, DRI, intel-xe, Danilo Krummrich, Philipp Stanner, Boris Brezillon, LKML On Wed, Mar 04, 2026 at 02:51:39PM -0800, Chia-I Wu wrote: > Hi, > > Our system compositor (surfaceflinger on android) submits gpu jobs > from a SCHED_FIFO thread to an RT gpu queue. However, because > workqueue threads are SCHED_NORMAL, the scheduling latency from submit > to run_job can sometimes cause frame misses. We are seeing this on > panthor and xe, but the issue should be common to all drm_sched users. > > Using a WQ_HIGHPRI workqueue helps, but it is still not RT (and won't > meet future android requirements). It seems either workqueue needs to > gain RT support, or drm_sched needs to support kthread_worker. > As RT means (in general) to some extent that the game of eevdf is played in __userspace__, but you are not PeterZ, so any issue like frame miss is understandably expected. Who made the workqueue worker a victim if the CPU cycles are not tight? Who is the new victim of a RT kthread worker? As RT is not free, what did you pay for it, given fewer RT success on market? ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: drm_sched run_job and scheduling latency 2026-03-05 23:09 ` Hillf Danton @ 2026-03-06 5:46 ` Chia-I Wu 2026-03-06 11:58 ` Hillf Danton 0 siblings, 1 reply; 26+ messages in thread From: Chia-I Wu @ 2026-03-06 5:46 UTC (permalink / raw) To: Hillf Danton Cc: Matthew Brost, DRI, intel-xe, Danilo Krummrich, Philipp Stanner, Boris Brezillon, LKML On Thu, Mar 5, 2026 at 3:10 PM Hillf Danton <hdanton@sina.com> wrote: > > On Wed, Mar 04, 2026 at 02:51:39PM -0800, Chia-I Wu wrote: > > Hi, > > > > Our system compositor (surfaceflinger on android) submits gpu jobs > > from a SCHED_FIFO thread to an RT gpu queue. However, because > > workqueue threads are SCHED_NORMAL, the scheduling latency from submit > > to run_job can sometimes cause frame misses. We are seeing this on > > panthor and xe, but the issue should be common to all drm_sched users. > > > > Using a WQ_HIGHPRI workqueue helps, but it is still not RT (and won't > > meet future android requirements). It seems either workqueue needs to > > gain RT support, or drm_sched needs to support kthread_worker. > > > As RT means (in general) to some extent that the game of eevdf is played in > __userspace__, but you are not PeterZ, so any issue like frame miss is > understandably expected. > Who made the workqueue worker a victim if the CPU cycles are not tight? > Who is the new victim of a RT kthread worker? > As RT is not free, what did you pay for it, given fewer RT success on market? That is a deliberate decision for android, that avoiding frame misses is a top priority. Also, I think most drm drivers already signal their fences from irq handlers or rt threads for a similar reason. And the reasoning applies to submissions as well. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: drm_sched run_job and scheduling latency 2026-03-06 5:46 ` Chia-I Wu @ 2026-03-06 11:58 ` Hillf Danton 0 siblings, 0 replies; 26+ messages in thread From: Hillf Danton @ 2026-03-06 11:58 UTC (permalink / raw) To: Chia-I Wu Cc: Matthew Brost, DRI, intel-xe, Danilo Krummrich, Philipp Stanner, Boris Brezillon, LKML On Thu, 5 Mar 2026 21:46:21 -0800 Chia-I Wu wrote: >On Thu, Mar 5, 2026 at 3:10 PM Hillf Danton <hdanton@sina.com> wrote: >> On Wed, Mar 04, 2026 at 02:51:39PM -0800, Chia-I Wu wrote: >> > Hi, >> > >> > Our system compositor (surfaceflinger on android) submits gpu jobs >> > from a SCHED_FIFO thread to an RT gpu queue. However, because >> > workqueue threads are SCHED_NORMAL, the scheduling latency from submit >> > to run_job can sometimes cause frame misses. We are seeing this on >> > panthor and xe, but the issue should be common to all drm_sched users. >> > >> > Using a WQ_HIGHPRI workqueue helps, but it is still not RT (and won't >> > meet future android requirements). It seems either workqueue needs to >> > gain RT support, or drm_sched needs to support kthread_worker. >> > >> As RT means (in general) to some extent that the game of eevdf is played in >> __userspace__, but you are not PeterZ, so any issue like frame miss is >> understandably expected. >> Who made the workqueue worker a victim if the CPU cycles are not tight? >> Who is the new victim of a RT kthread worker? >> As RT is not free, what did you pay for it, given fewer RT success on market? >> > That is a deliberate decision for android, that avoiding frame misses > is a top priority. > > Also, I think most drm drivers already signal their fences from irq > handlers or rt threads for a similar reason. And the reasoning applies > to submissions as well. > If RT submission alone works for you then your CPU cycles are tight. And if your workloads are sanely correct then making workqueue and/or kthread worker RT barely makes sense because the right option is to buy CPU with higher capacity. ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2026-03-27 9:19 UTC | newest] Thread overview: 26+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-03-04 22:51 drm_sched run_job and scheduling latency Chia-I Wu 2026-03-05 2:04 ` Matthew Brost 2026-03-05 8:27 ` Boris Brezillon 2026-03-05 8:38 ` Philipp Stanner 2026-03-05 9:10 ` Matthew Brost 2026-03-05 9:47 ` Philipp Stanner 2026-03-16 4:05 ` Matthew Brost 2026-03-16 4:14 ` Matthew Brost 2026-03-05 10:19 ` Boris Brezillon 2026-03-05 12:27 ` Danilo Krummrich 2026-03-05 10:09 ` Matthew Brost 2026-03-05 10:52 ` Boris Brezillon 2026-03-05 20:51 ` Matthew Brost 2026-03-06 5:13 ` Chia-I Wu 2026-03-06 7:21 ` Matthew Brost 2026-03-06 9:36 ` Michel Dänzer 2026-03-06 9:40 ` Michel Dänzer 2026-03-05 8:35 ` Tvrtko Ursulin 2026-03-05 9:40 ` Boris Brezillon 2026-03-27 9:19 ` Tvrtko Ursulin 2026-03-05 9:23 ` Boris Brezillon 2026-03-06 5:33 ` Chia-I Wu 2026-03-06 7:36 ` Matthew Brost 2026-03-05 23:09 ` Hillf Danton 2026-03-06 5:46 ` Chia-I Wu 2026-03-06 11:58 ` Hillf Danton
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox