sched_ext: Partial mode priority and fallthrough to EEVDF

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* sched_ext: Partial mode priority and fallthrough to EEVDF
@ 2026-03-10 14:52 Matt Fleming
  2026-03-10 18:27 ` Tejun Heo
  0 siblings, 1 reply; 5+ messages in thread
From: Matt Fleming @ 2026-03-10 14:52 UTC (permalink / raw)
  To: sched-ext; +Cc: kernel-team, tj, arighi, void, changwoo, peterz, linux-kernel

Hi,

At Cloudflare we're experimenting with inverting the priority of the
ext_sched_class and fair_sched_class to allow us to pick SCHED_EXT
tasks to run before SCHED_NORMAL. This gives us better scheduling
decisions for those SCHED_EXT tasks where we can embed business logic
into the BPF program and prevents them being starved by the larger
number of SCHED_NORMAL tasks under CPU contention. There are a couple
of reasons we took this route:

 1. Our workloads are heterogeneous and complex and we can't move entire
 systems to SCHED_EXT in one shot. We want to experiment with running
 SCHED_EXT in partial mode as we progressively onboard more and more
 services (we run multiple services on single machines).

 2. There's no way today (AFAIK) to run in "full-mode" and have BPF
 schedulers fallthrough to EEVDF.

In an ideal world, 2 is what we'd want to do. Is anyone else interested
in this problem or currently working on it? Is there anything coming in
the future that would make it easier for those of us slowly
transitioning to SCHED_EXT?

Thanks,
Matt

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: sched_ext: Partial mode priority and fallthrough to EEVDF
  2026-03-10 14:52 sched_ext: Partial mode priority and fallthrough to EEVDF Matt Fleming
@ 2026-03-10 18:27 ` Tejun Heo
  2026-03-10 18:46   ` Andrea Righi
  2026-03-11 11:10   ` Matt Fleming
  0 siblings, 2 replies; 5+ messages in thread
From: Tejun Heo @ 2026-03-10 18:27 UTC (permalink / raw)
  To: Matt Fleming
  Cc: sched-ext, kernel-team, arighi, void, changwoo, peterz,
	linux-kernel

Hello, Matt.

On Tue, Mar 10, 2026 at 02:52:13PM +0000, Matt Fleming wrote:
> At Cloudflare we're experimenting with inverting the priority of the
> ext_sched_class and fair_sched_class to allow us to pick SCHED_EXT
> tasks to run before SCHED_NORMAL. This gives us better scheduling
> decisions for those SCHED_EXT tasks where we can embed business logic
> into the BPF program and prevents them being starved by the larger
> number of SCHED_NORMAL tasks under CPU contention. There are a couple
> of reasons we took this route:
> 
>  1. Our workloads are heterogeneous and complex and we can't move entire
>  systems to SCHED_EXT in one shot. We want to experiment with running
>  SCHED_EXT in partial mode as we progressively onboard more and more
>  services (we run multiple services on single machines).
> 
>  2. There's no way today (AFAIK) to run in "full-mode" and have BPF
>  schedulers fallthrough to EEVDF.
> 
> In an ideal world, 2 is what we'd want to do. Is anyone else interested
> in this problem or currently working on it? Is there anything coming in
> the future that would make it easier for those of us slowly
> transitioning to SCHED_EXT?

Hmm... I have a bit of hard time following how that's different from partial
mode. If you want the scheduler to decide whether a task should be in SCX or
fair, you can do so from ops.init_task() by asserting p->scx.disallow. If
you mean that you want to switch dynamically on each scheduling event, I
don't think that's a good idea given that each hop would be full sched_class
switch.

As for the ordering between the two, I don't know. How are you using partial
mode? No matter how you order them, the behaviors on pathological cases are
pretty bad and I've been thinking that most would use partial mode to
partition the system so that some CPUs are managed by SCX and others by fair
in which case the ordering doesn't matter that much. If you're mixing the
two classes on the same CPUs, I wonder whether this is something which can
be better dealt with the deadline servers. Andrea, what do you think?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: sched_ext: Partial mode priority and fallthrough to EEVDF
  2026-03-10 18:27 ` Tejun Heo
@ 2026-03-10 18:46   ` Andrea Righi
  2026-03-11 11:22     ` Matt Fleming
  2026-03-11 11:10   ` Matt Fleming
  1 sibling, 1 reply; 5+ messages in thread
From: Andrea Righi @ 2026-03-10 18:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Matt Fleming, sched-ext, kernel-team, void, changwoo, peterz,
	linux-kernel

On Tue, Mar 10, 2026 at 08:27:00AM -1000, Tejun Heo wrote:
> Hello, Matt.
> 
> On Tue, Mar 10, 2026 at 02:52:13PM +0000, Matt Fleming wrote:
> > At Cloudflare we're experimenting with inverting the priority of the
> > ext_sched_class and fair_sched_class to allow us to pick SCHED_EXT
> > tasks to run before SCHED_NORMAL. This gives us better scheduling
> > decisions for those SCHED_EXT tasks where we can embed business logic
> > into the BPF program and prevents them being starved by the larger
> > number of SCHED_NORMAL tasks under CPU contention. There are a couple
> > of reasons we took this route:
> > 
> >  1. Our workloads are heterogeneous and complex and we can't move entire
> >  systems to SCHED_EXT in one shot. We want to experiment with running
> >  SCHED_EXT in partial mode as we progressively onboard more and more
> >  services (we run multiple services on single machines).
> > 
> >  2. There's no way today (AFAIK) to run in "full-mode" and have BPF
> >  schedulers fallthrough to EEVDF.
> > 
> > In an ideal world, 2 is what we'd want to do. Is anyone else interested
> > in this problem or currently working on it? Is there anything coming in
> > the future that would make it easier for those of us slowly
> > transitioning to SCHED_EXT?
> 
> Hmm... I have a bit of hard time following how that's different from partial
> mode. If you want the scheduler to decide whether a task should be in SCX or
> fair, you can do so from ops.init_task() by asserting p->scx.disallow. If
> you mean that you want to switch dynamically on each scheduling event, I
> don't think that's a good idea given that each hop would be full sched_class
> switch.
> 
> As for the ordering between the two, I don't know. How are you using partial
> mode? No matter how you order them, the behaviors on pathological cases are
> pretty bad and I've been thinking that most would use partial mode to
> partition the system so that some CPUs are managed by SCX and others by fair
> in which case the ordering doesn't matter that much. If you're mixing the
> two classes on the same CPUs, I wonder whether this is something which can
> be better dealt with the deadline servers. Andrea, what do you think?

I think you can model your scenario using the ext deadline server. For
instance, if you run:

 # echo 500000000 | tee /sys/kernel/debug/sched/ext_server/cpu*/runtime

This would give sched_ext tasks a guaranteed 50% bandwidth on all CPUs,
(default is 5%), even if there are tasks running at higher sched classes.

Would this approach work for your needs?

-Andrea

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: sched_ext: Partial mode priority and fallthrough to EEVDF
  2026-03-10 18:46   ` Andrea Righi
@ 2026-03-11 11:22     ` Matt Fleming
  0 siblings, 0 replies; 5+ messages in thread
From: Matt Fleming @ 2026-03-11 11:22 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Tejun Heo, sched-ext, kernel-team, void, changwoo, peterz,
	linux-kernel

On Tue, Mar 10, 2026 at 07:46:00PM +0100, Andrea Righi wrote:
> 
> I think you can model your scenario using the ext deadline server. For
> instance, if you run:
> 
>  # echo 500000000 | tee /sys/kernel/debug/sched/ext_server/cpu*/runtime
> 
> This would give sched_ext tasks a guaranteed 50% bandwidth on all CPUs,
> (default is 5%), even if there are tasks running at higher sched classes.
> 
> Would this approach work for your needs?

It looks like it would, yes. Thanks! I'll start experimenting and report back.

Are there any plans to backport this to 6.18 LTS?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: sched_ext: Partial mode priority and fallthrough to EEVDF
  2026-03-10 18:27 ` Tejun Heo
  2026-03-10 18:46   ` Andrea Righi
@ 2026-03-11 11:10   ` Matt Fleming
  1 sibling, 0 replies; 5+ messages in thread
From: Matt Fleming @ 2026-03-11 11:10 UTC (permalink / raw)
  To: Tejun Heo
  Cc: sched-ext, kernel-team, arighi, void, changwoo, peterz,
	linux-kernel

On Tue, Mar 10, 2026 at 08:27:00AM -1000, Tejun Heo wrote:
> 
> Hmm... I have a bit of hard time following how that's different from partial
> mode. If you want the scheduler to decide whether a task should be in SCX or
> fair, you can do so from ops.init_task() by asserting p->scx.disallow. If
> you mean that you want to switch dynamically on each scheduling event, I
> don't think that's a good idea given that each hop would be full sched_class
> switch.

Oh no, I don't want to switch dynamically at runtime. Doing the
classification once at BPF program load time is fine, but AFAIU
p->scx.disallow still gives us two scheduling classes (SCHED_EXT and
SCHED_NORMAL) where tasks in the fair class get chosen first.

> As for the ordering between the two, I don't know. How are you using partial
> mode? No matter how you order them, the behaviors on pathological cases are
> pretty bad and I've been thinking that most would use partial mode to
> partition the system so that some CPUs are managed by SCX and others by fair
> in which case the ordering doesn't matter that much. If you're mixing the
> two classes on the same CPUs, I wonder whether this is something which can
> be better dealt with the deadline servers. Andrea, what do you think?

I want to use SCHED_EXT to schedule the most latency-critical tasks
because a custom BPF scheduler allows me to make better CPU placement
and preemption decisions. Doing it with partial mode allows me to
progressively switch services over to SCHED_EXT without needing to take
on a mass migration for 100+ services in one go (something I'm trying
to my hardest to avoid :) ).

To clarify my "fallthrough to EEVDF" comment: if I could run in
full-mode, use disallow to keep most tasks EEVDF, and have SCHED_EXT
tasks scheduled with higher priority than SCHED_NORMAL then this would
tick all the boxes.

I have experimented with isolating CPUs where all tasks running are
SCHED_EXT while other CPUs run the SCHED_NORMAL workloads, so that's a
possibility. But not all our servers are configured that way and given
that we run heterogeneous workloads on single machines, it's a tall
price to pay capacity-wise if we can't fully utilise those isolated
CPUs at all times.

And to limit the pathological case in my experiments so far I'm using
cpu.max to cap CPU bandwidth (thanks to scx_lavd's bandwidth support).
All our services are systemd services, so we can set limits to guard
against complete meltdowns.

Thanks for the tip on the DL server. This looks promising and might
solve my problem nicely. I'll reply in more detail to Andrea's post.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-03-11 11:23 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-10 14:52 sched_ext: Partial mode priority and fallthrough to EEVDF Matt Fleming
2026-03-10 18:27 ` Tejun Heo
2026-03-10 18:46   ` Andrea Righi
2026-03-11 11:22     ` Matt Fleming
2026-03-11 11:10   ` Matt Fleming

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox