* Re: sched_ext: Partial mode priority and fallthrough to EEVDF
2026-03-10 18:27 ` Tejun Heo
@ 2026-03-10 18:46 ` Andrea Righi
2026-03-11 11:22 ` Matt Fleming
2026-03-11 11:10 ` Matt Fleming
1 sibling, 1 reply; 5+ messages in thread
From: Andrea Righi @ 2026-03-10 18:46 UTC (permalink / raw)
To: Tejun Heo
Cc: Matt Fleming, sched-ext, kernel-team, void, changwoo, peterz,
linux-kernel
On Tue, Mar 10, 2026 at 08:27:00AM -1000, Tejun Heo wrote:
> Hello, Matt.
>
> On Tue, Mar 10, 2026 at 02:52:13PM +0000, Matt Fleming wrote:
> > At Cloudflare we're experimenting with inverting the priority of the
> > ext_sched_class and fair_sched_class to allow us to pick SCHED_EXT
> > tasks to run before SCHED_NORMAL. This gives us better scheduling
> > decisions for those SCHED_EXT tasks where we can embed business logic
> > into the BPF program and prevents them being starved by the larger
> > number of SCHED_NORMAL tasks under CPU contention. There are a couple
> > of reasons we took this route:
> >
> > 1. Our workloads are heterogeneous and complex and we can't move entire
> > systems to SCHED_EXT in one shot. We want to experiment with running
> > SCHED_EXT in partial mode as we progressively onboard more and more
> > services (we run multiple services on single machines).
> >
> > 2. There's no way today (AFAIK) to run in "full-mode" and have BPF
> > schedulers fallthrough to EEVDF.
> >
> > In an ideal world, 2 is what we'd want to do. Is anyone else interested
> > in this problem or currently working on it? Is there anything coming in
> > the future that would make it easier for those of us slowly
> > transitioning to SCHED_EXT?
>
> Hmm... I have a bit of hard time following how that's different from partial
> mode. If you want the scheduler to decide whether a task should be in SCX or
> fair, you can do so from ops.init_task() by asserting p->scx.disallow. If
> you mean that you want to switch dynamically on each scheduling event, I
> don't think that's a good idea given that each hop would be full sched_class
> switch.
>
> As for the ordering between the two, I don't know. How are you using partial
> mode? No matter how you order them, the behaviors on pathological cases are
> pretty bad and I've been thinking that most would use partial mode to
> partition the system so that some CPUs are managed by SCX and others by fair
> in which case the ordering doesn't matter that much. If you're mixing the
> two classes on the same CPUs, I wonder whether this is something which can
> be better dealt with the deadline servers. Andrea, what do you think?
I think you can model your scenario using the ext deadline server. For
instance, if you run:
# echo 500000000 | tee /sys/kernel/debug/sched/ext_server/cpu*/runtime
This would give sched_ext tasks a guaranteed 50% bandwidth on all CPUs,
(default is 5%), even if there are tasks running at higher sched classes.
Would this approach work for your needs?
-Andrea
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: sched_ext: Partial mode priority and fallthrough to EEVDF
2026-03-10 18:27 ` Tejun Heo
2026-03-10 18:46 ` Andrea Righi
@ 2026-03-11 11:10 ` Matt Fleming
1 sibling, 0 replies; 5+ messages in thread
From: Matt Fleming @ 2026-03-11 11:10 UTC (permalink / raw)
To: Tejun Heo
Cc: sched-ext, kernel-team, arighi, void, changwoo, peterz,
linux-kernel
On Tue, Mar 10, 2026 at 08:27:00AM -1000, Tejun Heo wrote:
>
> Hmm... I have a bit of hard time following how that's different from partial
> mode. If you want the scheduler to decide whether a task should be in SCX or
> fair, you can do so from ops.init_task() by asserting p->scx.disallow. If
> you mean that you want to switch dynamically on each scheduling event, I
> don't think that's a good idea given that each hop would be full sched_class
> switch.
Oh no, I don't want to switch dynamically at runtime. Doing the
classification once at BPF program load time is fine, but AFAIU
p->scx.disallow still gives us two scheduling classes (SCHED_EXT and
SCHED_NORMAL) where tasks in the fair class get chosen first.
> As for the ordering between the two, I don't know. How are you using partial
> mode? No matter how you order them, the behaviors on pathological cases are
> pretty bad and I've been thinking that most would use partial mode to
> partition the system so that some CPUs are managed by SCX and others by fair
> in which case the ordering doesn't matter that much. If you're mixing the
> two classes on the same CPUs, I wonder whether this is something which can
> be better dealt with the deadline servers. Andrea, what do you think?
I want to use SCHED_EXT to schedule the most latency-critical tasks
because a custom BPF scheduler allows me to make better CPU placement
and preemption decisions. Doing it with partial mode allows me to
progressively switch services over to SCHED_EXT without needing to take
on a mass migration for 100+ services in one go (something I'm trying
to my hardest to avoid :) ).
To clarify my "fallthrough to EEVDF" comment: if I could run in
full-mode, use disallow to keep most tasks EEVDF, and have SCHED_EXT
tasks scheduled with higher priority than SCHED_NORMAL then this would
tick all the boxes.
I have experimented with isolating CPUs where all tasks running are
SCHED_EXT while other CPUs run the SCHED_NORMAL workloads, so that's a
possibility. But not all our servers are configured that way and given
that we run heterogeneous workloads on single machines, it's a tall
price to pay capacity-wise if we can't fully utilise those isolated
CPUs at all times.
And to limit the pathological case in my experiments so far I'm using
cpu.max to cap CPU bandwidth (thanks to scx_lavd's bandwidth support).
All our services are systemd services, so we can set limits to guard
against complete meltdowns.
Thanks for the tip on the DL server. This looks promising and might
solve my problem nicely. I'll reply in more detail to Andrea's post.
^ permalink raw reply [flat|nested] 5+ messages in thread