[LSF/MM/BPF TOPIC] Discuss more features + use cases for sched

BPF List
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] Discuss more features + use cases for sched_ext
@ 2024-01-26 21:59 David Vernet
  2024-01-29 22:41 ` Joel Fernandes
  2024-02-19  8:48 ` Muhammad Usama Anjum
  0 siblings, 2 replies; 9+ messages in thread
From: David Vernet @ 2024-01-26 21:59 UTC (permalink / raw)
  To: lsf-pc
  Cc: bpf, joel, htejun, schatzberg.dan, andrea.righi, davemarchevsky,
	changwoo, julia.lawall, himadrispandya

[-- Attachment #1: Type: text/plain, Size: 1732 bytes --]

Hello,

A few more use cases have emerged for sched_ext that are not yet
supported that I wanted to discuss in the BPF track. Specifically:

- EAS: Energy Aware Scheduling

While firmware ultimately controls the frequency of a core, the kernel
does provide frequency scaling knobs such as EPP. It could be useful for
BPF schedulers to have control over these knobs to e.g. hint that
certain cores should keep a lower frequency and operate as E cores.
This could have applications in battery-aware devices, or in other
contexts where applications have e.g. latency-sensitive
compute-intensive workloads.

- Componentized schedulers

Scheduler implementations today largely have to reinvent the wheel. For
example, if you want to implement a load balancer in rust, you need to
add the necessary fields to the BPF program for tracking load / duty
cycle, and then parse and consume them from the rust side. That's pretty
suboptimal though, as the actual load balancing algorithm itself is
essentially the exact same. The challenge here is that the feature
requires both BPF and user space components to work together. It's not
enough to ship a rust crate -- you need to also ship a BPF object file
that your program can link against. And what should the API look like on
both ends? Should rust / BPF have to call into functions to get load
balancing? Or should it be automatically packaged and implemented?

There are a lot of ways that we can approach this, and it probably
warrants discussing in some more detail.

If anybody else has ideas on things they'd like to discuss; either
sched_ext features that are missing, or scheduling ideas that we could
try to implement but just haven't yet, please feel free to share.

Thanks,
David

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Discuss more features + use cases for sched_ext
  2024-01-26 21:59 [LSF/MM/BPF TOPIC] Discuss more features + use cases for sched_ext David Vernet
@ 2024-01-29 22:41 ` Joel Fernandes
  2024-01-29 22:42   ` Joel Fernandes
  2024-02-19  8:48 ` Muhammad Usama Anjum
  1 sibling, 1 reply; 9+ messages in thread
From: Joel Fernandes @ 2024-01-29 22:41 UTC (permalink / raw)
  To: David Vernet, lsf-pc
  Cc: bpf, htejun, schatzberg.dan, andrea.righi, davemarchevsky,
	changwoo, julia.lawall, himadrispandya

On 1/26/2024 4:59 PM, David Vernet wrote:
> Hello,
> 
> A few more use cases have emerged for sched_ext that are not yet
> supported that I wanted to discuss in the BPF track. Specifically:
> 
> - EAS: Energy Aware Scheduling
> 
> While firmware ultimately controls the frequency of a core, the kernel
> does provide frequency scaling knobs such as EPP. It could be useful for
> BPF schedulers to have control over these knobs to e.g. hint that
> certain cores should keep a lower frequency and operate as E cores.
> This could have applications in battery-aware devices, or in other
> contexts where applications have e.g. latency-sensitive
> compute-intensive workloads.

This is a great topic. I think integrating/merging such mechanism with the NEST
scheduler could be useful too? You mentioned there is sched_ext implementation
of NEST already? One reason that's interesting to me is the task-packing and
less-spreading may have power benefits, this is exactly what EAS on ARM does,
but it also uses an energy model to know when packing is a bad idea. Since we
don't have fine grained control of frequency on Intel, I wonder what else can we
do to know when the scheduler should pack and when to spread. Maybe something
simple which does not need an energy model but packs based on some other
signal/heuristic would be great in the short term.

Maybe a signal can be the "Quality of service (QoS)" approach where tasks with
lower QoS are packed more aggressively and higher QoS are spread more (?).

> 
> - Componentized schedulers
> 
> Scheduler implementations today largely have to reinvent the wheel. For
> example, if you want to implement a load balancer in rust, you need to
> add the necessary fields to the BPF program for tracking load / duty
> cycle, and then parse and consume them from the rust side. That's pretty
> suboptimal though, as the actual load balancing algorithm itself is
> essentially the exact same. The challenge here is that the feature
> requires both BPF and user space components to work together. It's not
> enough to ship a rust crate -- you need to also ship a BPF object file

Maybe I am confused but why does rust userspace code need to link to BPF
objects? The BPF object is loaded into the kernel right?

> that your program can link against. And what should the API look like on
> both ends? Should rust / BPF have to call into functions to get load
> balancing? Or should it be automatically packaged and implemented?
> 
> There are a lot of ways that we can approach this, and it probably
> warrants discussing in some more detail

But I get the gist of the issue, would be interesting to discuss.

thanks,

- Joel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Discuss more features + use cases for sched_ext
  2024-01-29 22:41 ` Joel Fernandes
@ 2024-01-29 22:42   ` Joel Fernandes
  2024-01-30  0:15     ` David Vernet
  2024-01-30  1:50     ` Tejun Heo
  0 siblings, 2 replies; 9+ messages in thread
From: Joel Fernandes @ 2024-01-29 22:42 UTC (permalink / raw)
  To: David Vernet, lsf-pc, Tejun Heo
  Cc: bpf, schatzberg.dan, andrea.righi, davemarchevsky, changwoo,
	julia.lawall, himadrispandya

Tejun's address bounced so I am adding the correct one. Thanks.

On 1/29/2024 5:41 PM, Joel Fernandes wrote:
> 
> 
> On 1/26/2024 4:59 PM, David Vernet wrote:
>> Hello,
>>
>> A few more use cases have emerged for sched_ext that are not yet
>> supported that I wanted to discuss in the BPF track. Specifically:
>>
>> - EAS: Energy Aware Scheduling
>>
>> While firmware ultimately controls the frequency of a core, the kernel
>> does provide frequency scaling knobs such as EPP. It could be useful for
>> BPF schedulers to have control over these knobs to e.g. hint that
>> certain cores should keep a lower frequency and operate as E cores.
>> This could have applications in battery-aware devices, or in other
>> contexts where applications have e.g. latency-sensitive
>> compute-intensive workloads.
> 
> This is a great topic. I think integrating/merging such mechanism with the NEST
> scheduler could be useful too? You mentioned there is sched_ext implementation
> of NEST already? One reason that's interesting to me is the task-packing and
> less-spreading may have power benefits, this is exactly what EAS on ARM does,
> but it also uses an energy model to know when packing is a bad idea. Since we
> don't have fine grained control of frequency on Intel, I wonder what else can we
> do to know when the scheduler should pack and when to spread. Maybe something
> simple which does not need an energy model but packs based on some other
> signal/heuristic would be great in the short term.
> 
> Maybe a signal can be the "Quality of service (QoS)" approach where tasks with
> lower QoS are packed more aggressively and higher QoS are spread more (?).
> 
>>
>> - Componentized schedulers
>>
>> Scheduler implementations today largely have to reinvent the wheel. For
>> example, if you want to implement a load balancer in rust, you need to
>> add the necessary fields to the BPF program for tracking load / duty
>> cycle, and then parse and consume them from the rust side. That's pretty
>> suboptimal though, as the actual load balancing algorithm itself is
>> essentially the exact same. The challenge here is that the feature
>> requires both BPF and user space components to work together. It's not
>> enough to ship a rust crate -- you need to also ship a BPF object file
> 
> Maybe I am confused but why does rust userspace code need to link to BPF
> objects? The BPF object is loaded into the kernel right?
> 
>> that your program can link against. And what should the API look like on
>> both ends? Should rust / BPF have to call into functions to get load
>> balancing? Or should it be automatically packaged and implemented?
>>
>> There are a lot of ways that we can approach this, and it probably
>> warrants discussing in some more detail
> 
> But I get the gist of the issue, would be interesting to discuss.
> 
> thanks,
> 
> - Joel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Discuss more features + use cases for sched_ext
  2024-01-29 22:42   ` Joel Fernandes
@ 2024-01-30  0:15     ` David Vernet
  2024-01-30  1:50     ` Tejun Heo
  1 sibling, 0 replies; 9+ messages in thread
From: David Vernet @ 2024-01-30  0:15 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: lsf-pc, Tejun Heo, bpf, schatzberg.dan, andrea.righi,
	davemarchevsky, changwoo, julia.lawall, himadrispandya

[-- Attachment #1: Type: text/plain, Size: 8531 bytes --]

On Mon, Jan 29, 2024 at 05:42:54PM -0500, Joel Fernandes wrote:
> Tejun's address bounced so I am adding the correct one. Thanks.

Ah, thanks, my mistake.

> 
> On 1/29/2024 5:41 PM, Joel Fernandes wrote:
> > 
> > 
> > On 1/26/2024 4:59 PM, David Vernet wrote:
> >> Hello,
> >>
> >> A few more use cases have emerged for sched_ext that are not yet
> >> supported that I wanted to discuss in the BPF track. Specifically:
> >>
> >> - EAS: Energy Aware Scheduling
> >>
> >> While firmware ultimately controls the frequency of a core, the kernel
> >> does provide frequency scaling knobs such as EPP. It could be useful for
> >> BPF schedulers to have control over these knobs to e.g. hint that
> >> certain cores should keep a lower frequency and operate as E cores.
> >> This could have applications in battery-aware devices, or in other
> >> contexts where applications have e.g. latency-sensitive
> >> compute-intensive workloads.
> > 
> > This is a great topic. I think integrating/merging such mechanism with the NEST
> > scheduler could be useful too? You mentioned there is sched_ext implementation
> > of NEST already? One reason that's interesting to me is the task-packing and

Correct -- it's called scx_nest [0].

[0]: https://github.com/sched-ext/scx/blob/main/scheds/c/scx_nest.bpf.c

> > less-spreading may have power benefits, this is exactly what EAS on ARM does,
> > but it also uses an energy model to know when packing is a bad idea. Since we
> > don't have fine grained control of frequency on Intel, I wonder what else can we
> > do to know when the scheduler should pack and when to spread. Maybe something
> > simple which does not need an energy model but packs based on some other
> > signal/heuristic would be great in the short term.

Makes sense. What kinds of signals were you thinking? We can have user
space query for whatever we'd need, and then communicate that to the
kernel via shared maps. Or probably even more ideal, if we could get the
information we need from tracepoints or kprobes, then we could possibly
avoid having to deal with that and just keep everything in the kernel.
Note that we don't have to necessarily just track public APIs if we did
all of this in the kernel. If we can access a struct in a tracepoint or
a kprobe, we can read from it, and use that in the scheduler however we
want.

Of course, none of this comes with any kind of ABI stability guarantees,
but that's one of the features of sched_ext: because the actual
scheduler itself is a _kernel_ program that runs in kernel space, we can
experiment with and implement things without tying anyone's hands to
fully supporting it in the kernel forever. The user space portion
communicates with the BPF scheduler over maps that are UAPI (part of BPF
UAPI), but the actual scheduler itself is just a kernel program, and
therefore is free to interact with the rest of the system without making
anything UAPI or adding ABI stability requirements. The contents of
what's passed over those maps are not UAPI, in the same manner that the
contents sent over the communication channels setup by KVM per your
other thread [1] would not be UAPI.

[1]: https://lore.kernel.org/all/653c2448-614e-48d6-af31-c5920d688f3e@joelfernandes.org/

> > Maybe a signal can be the "Quality of service (QoS)" approach where tasks with
> > lower QoS are packed more aggressively and higher QoS are spread more (?).
> > 
> >>
> >> - Componentized schedulers
> >>
> >> Scheduler implementations today largely have to reinvent the wheel. For
> >> example, if you want to implement a load balancer in rust, you need to
> >> add the necessary fields to the BPF program for tracking load / duty
> >> cycle, and then parse and consume them from the rust side. That's pretty
> >> suboptimal though, as the actual load balancing algorithm itself is
> >> essentially the exact same. The challenge here is that the feature
> >> requires both BPF and user space components to work together. It's not
> >> enough to ship a rust crate -- you need to also ship a BPF object file
> > 
> > Maybe I am confused but why does rust userspace code need to link to BPF
> > objects? The BPF object is loaded into the kernel right?

So there are a few pieces at play here:

1. You're correct that the BPF program is loaded into kernel space, but
the actual BPF bytecode itself is linked statically into the
application, and the application is what actually makes the syscalls
(via libbpf) to load the BPF program into the kernel. Here's a
high-level overview of the workflow for loading a scheduler:

	- Open the scheduler: This involves libbpf parsing the BPF
	  object file passed by the application, and discovering its
	  maps, progs, etc which should be created. At this phase user
	  space can still update any maps in the program, including e.g.
	  read-only maps such as .rodata. This allows user space to do
	  things like set the max # of CPUs on the system, set debug
	  flags if they were requested by the user, etc.
	- Load the scheduler: Libbpf creates BPF maps, does relocations
	  for CO-RE [2], and verifies and loads the scheduler into the
	  kernel. At this point, the program is loaded into the kernel,
	  but the scheduler is not actively running yet. User space can
	  no longer write read-only maps in the BPF program, but it can
	  still read and write _writeable_ maps, and it can in fact do
	  so indefinitely throughout the runtime of the scheduler. As
	  described below, this is why we need both a user space and
	  a BPF object file portion for such features.
	- Attach the scheduler: This actually calls into ext.c to update
	  the currently running scheduler to use the BPF sched_ext
	  scheduler.

[2]: https://nakryiko.com/posts/bpf-core-reference-guide/

2. As alluded to above, the user space program that loaded the scheduler
can interact with the scheduler in real time by reading and writing to
its writeable maps. This allows user space to e.g. read some procfs
values to determine utilization for each core in the system, do some
load balancing math with floating point numbers basad on that data and
on task weight / duty cycle, and then notify the BPF scheduler that is
should migrate tasks by writing to shared maps.

This is exactly what we do in scx_rusty [3]. We track duty cycles and
load in kernel space (soon we'll only track duty cycles and do all load
scaling in user space), and then periodically we'll do a load balancing
pass in the user-space portion of the scheduler where we read those
values, use floats, and then signal to the kernel if and where it should
migrate tasks by writing to maps. This is all done async from the
perspective of the kernel, so the kernel will check the maps to see if
there's an update on e.g. enqueue paths.

[3]: https://github.com/sched-ext/scx/tree/main/scheds/rust/scx_rusty/src

So to summarize -- the rust portion isn't running in the kernel, but it
is influencing the kernel scheduler's decisions by communicating with it
via these shared maps (and the kernel can similarly communicate with
user space in the opposite direction). That's the reason that it needs
to have both the user space portion and the kernel portion available to
implement these features. Neither makes sense without the other.

Note that not every scheduler we've implemented has a robust user space
portion, but every scheduler does have _some_ user space counterpart
which is responsible for loading it. scx_nest.c [4], for example,
doesn't really do anything in user space other than periodically print
out some data that's exported to it from the kernel scheduler via a
shared map. If we wanted to add user-space load balancing to scx_nest,
the same requirements would apply as for schedulers with a rust
user-space component: we'd need both a user space portion, and a
kernel-space portion.

[4]: https://github.com/sched-ext/scx/blob/main/scheds/c/scx_nest.c#L195

> >> that your program can link against. And what should the API look like on
> >> both ends? Should rust / BPF have to call into functions to get load
> >> balancing? Or should it be automatically packaged and implemented?
> >>
> >> There are a lot of ways that we can approach this, and it probably
> >> warrants discussing in some more detail
> > 
> > But I get the gist of the issue, would be interesting to discuss.

Sounds great, thanks for reading this over.

- David

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Discuss more features + use cases for sched_ext
  2024-01-29 22:42   ` Joel Fernandes
  2024-01-30  0:15     ` David Vernet
@ 2024-01-30  1:50     ` Tejun Heo
  2024-02-19  9:25       ` Joel Fernandes
  1 sibling, 1 reply; 9+ messages in thread
From: Tejun Heo @ 2024-01-30  1:50 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: David Vernet, lsf-pc, bpf, schatzberg.dan, andrea.righi,
	davemarchevsky, changwoo, julia.lawall, himadrispandya

Hello, Joel.

On Mon, Jan 29, 2024 at 05:42:54PM -0500, Joel Fernandes wrote:
> > This is a great topic. I think integrating/merging such mechanism with the NEST
> > scheduler could be useful too? You mentioned there is sched_ext implementation
> > of NEST already? One reason that's interesting to me is the task-packing and
> > less-spreading may have power benefits, this is exactly what EAS on ARM does,
> > but it also uses an energy model to know when packing is a bad idea. Since we
> > don't have fine grained control of frequency on Intel, I wonder what else can we
> > do to know when the scheduler should pack and when to spread. Maybe something
> > simple which does not need an energy model but packs based on some other
> > signal/heuristic would be great in the short term.
> > 
> > Maybe a signal can be the "Quality of service (QoS)" approach where tasks with
> > lower QoS are packed more aggressively and higher QoS are spread more (?).

This was done for a different purpose (improving tail latencies on latency
critical workload) but it uses soft-affinity based packing which maybe can
translate to power-aware scheduling:

  https://github.com/sched-ext/scx/blob/case-studies/case-studies/scx_layered.md

I have a raptor lake-H laptop which has E and P cores and by default the
threads are being spread across all CPUs which probably isn't best for power
consumption. I was thinking about writing a scheduler which uses a similar
strategy as scx_layered - pack the cores one by one overflowing to the next
core from E to P when the average utilization crosses a set threshold. Most
of the logic is already in scx_layered, so maybe it can just be a part of
that. I'm curious whether whether and how much power can be saved with a
generic approach like that.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Discuss more features + use cases for sched_ext
  2024-01-30  1:50     ` Tejun Heo
@ 2024-02-19  9:25       ` Joel Fernandes
  0 siblings, 0 replies; 9+ messages in thread
From: Joel Fernandes @ 2024-02-19  9:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, lsf-pc, bpf, schatzberg.dan, andrea.righi,
	davemarchevsky, changwoo, julia.lawall, himadrispandya

On 1/29/2024 8:50 PM, Tejun Heo wrote:> On Mon, Jan 29, 2024 at 05:42:54PM
-0500, Joel Fernandes wrote:
>>> This is a great topic. I think integrating/merging such mechanism with the NEST
>>> scheduler could be useful too? You mentioned there is sched_ext implementation
>>> of NEST already? One reason that's interesting to me is the task-packing and
>>> less-spreading may have power benefits, this is exactly what EAS on ARM does,
>>> but it also uses an energy model to know when packing is a bad idea. Since we
>>> don't have fine grained control of frequency on Intel, I wonder what else can we
>>> do to know when the scheduler should pack and when to spread. Maybe something
>>> simple which does not need an energy model but packs based on some other
>>> signal/heuristic would be great in the short term.
>>>
>>> Maybe a signal can be the "Quality of service (QoS)" approach where tasks with
>>> lower QoS are packed more aggressively and higher QoS are spread more (?).
> 
> This was done for a different purpose (improving tail latencies on latency
> critical workload) but it uses soft-affinity based packing which maybe can
> translate to power-aware scheduling:
> 
>   https://github.com/sched-ext/scx/blob/case-studies/case-studies/scx_layered.md

Thanks! I am looking more into this (scx_layered) for the latency benefits as
well. David kindly gave me an introduction to it last week. It seems quite
similar to our approach with using RT (round-robin) for the higher tier (that is
have a higher tier of tasks that are fair scheduled over a lower one). There is
the issue of starvation though (a higher tier/layer starves a lower one), so
we're incorporating the DL server to help with that:
https://lore.kernel.org/all/cover.1699095159.git.bristot@kernel.org/
https://lore.kernel.org/all/20240216183108.1564958-1-joel@joelfernandes.org/

Interesting on the soft-affinity feature and yeah that help save power and might
be a better approach than say our usage of RT.

> I have a raptor lake-H laptop which has E and P cores and by default the
> threads are being spread across all CPUs which probably isn't best for power
> consumption. I was thinking about writing a scheduler which uses a similar
> strategy as scx_layered - pack the cores one by one overflowing to the next
> core from E to P when the average utilization crosses a set threshold. Most
> of the logic is already in scx_layered, so maybe it can just be a part of
> that. I'm curious whether whether and how much power can be saved with a
> generic approach like that.

Can the scx NEST scheduler be reused for this? AFAIR, it does similar task
packing. Though that is to keep more cores idle than to pack tasks to a certain
type of core, if I remember Julia's presentation.

thanks,

 - Joel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Discuss more features + use cases for sched_ext
  2024-01-26 21:59 [LSF/MM/BPF TOPIC] Discuss more features + use cases for sched_ext David Vernet
  2024-01-29 22:41 ` Joel Fernandes
@ 2024-02-19  8:48 ` Muhammad Usama Anjum
  2024-02-19  9:11   ` Joel Fernandes
  1 sibling, 1 reply; 9+ messages in thread
From: Muhammad Usama Anjum @ 2024-02-19  8:48 UTC (permalink / raw)
  To: David Vernet, lsf-pc
  Cc: bpf, joel, htejun, schatzberg.dan, andrea.righi, davemarchevsky,
	changwoo, julia.lawall, himadrispandya

On Fri, 2024-01-26 at 15:59 -0600, David Vernet wrote:
> Hello,
> 
> A few more use cases have emerged for sched_ext that are not yet
> supported that I wanted to discuss in the BPF track. Specifically:
> 
> - EAS: Energy Aware Scheduling
> 
> While firmware ultimately controls the frequency of a core, the kernel
> does provide frequency scaling knobs such as EPP. It could be useful for
> BPF schedulers to have control over these knobs to e.g. hint that
> certain cores should keep a lower frequency and operate as E cores.
> This could have applications in battery-aware devices, or in other
> contexts where applications have e.g. latency-sensitive
> compute-intensive workloads.
The current scheduler must already be using the frequency scaling
knobs. Can sched_ext use those knobs directly with hint from userspace
easily?

> 
> - Componentized schedulers
> 
> Scheduler implementations today largely have to reinvent the wheel. For
> example, if you want to implement a load balancer in rust, you need to
> add the necessary fields to the BPF program for tracking load / duty
> cycle, and then parse and consume them from the rust side. That's pretty
> suboptimal though, as the actual load balancing algorithm itself is
> essentially the exact same. The challenge here is that the feature
> requires both BPF and user space components to work together. It's not
> enough to ship a rust crate -- you need to also ship a BPF object file
> that your program can link against. And what should the API look like on
> both ends? Should rust / BPF have to call into functions to get load
> balancing? Or should it be automatically packaged and implemented?
This seems like a really nice idea. If we build a kind of library
where different components of a schedule are already available, the
researchers can just focus on one component and improve it. This could
bring long term benefits to schedulers based on sched_ext. This
flexibility wasn't possible before for the scheduler.

> 
> There are a lot of ways that we can approach this, and it probably
> warrants discussing in some more detail.
> 
> If anybody else has ideas on things they'd like to discuss; either
> sched_ext features that are missing, or scheduling ideas that we could
> try to implement but just haven't yet, please feel free to share.
> 
> Thanks,
> David


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Discuss more features + use cases for sched_ext
  2024-02-19  8:48 ` Muhammad Usama Anjum
@ 2024-02-19  9:11   ` Joel Fernandes
  2024-02-19  9:14     ` Joel Fernandes
  0 siblings, 1 reply; 9+ messages in thread
From: Joel Fernandes @ 2024-02-19  9:11 UTC (permalink / raw)
  To: Muhammad Usama Anjum, David Vernet, lsf-pc
  Cc: bpf, htejun, schatzberg.dan, andrea.righi, davemarchevsky,
	changwoo, julia.lawall, himadrispandya

On 2/19/2024 3:48 AM, Muhammad Usama Anjum wrote:
> On Fri, 2024-01-26 at 15:59 -0600, David Vernet wrote:
>> Hello,
>>
>> A few more use cases have emerged for sched_ext that are not yet
>> supported that I wanted to discuss in the BPF track. Specifically:
>>
>> - EAS: Energy Aware Scheduling
>>
>> While firmware ultimately controls the frequency of a core, the kernel
>> does provide frequency scaling knobs such as EPP. It could be useful for
>> BPF schedulers to have control over these knobs to e.g. hint that
>> certain cores should keep a lower frequency and operate as E cores.
>> This could have applications in battery-aware devices, or in other
>> contexts where applications have e.g. latency-sensitive
>> compute-intensive workloads.
> The current scheduler must already be using the frequency scaling
> knobs. Can sched_ext use those knobs directly with hint from userspace
> easily?

With regards to the current way of doing things, it depends. On Intel platforms,
if HWP is enabled (Hardware-Controlled Performance States) which it is on almost
all Intel platforms I've seen, then the selection of the individual Performance
states (P-states) is done by the hardware, not the OS. My understanding is the
benefit of HWP is responsiveness of the state selection. So the only thing OS
can control then is either Turbo boost, or EPP.  Unfortunately, this hinders
using an energy model and doing energy calculations (ex. If I place shit on this
core instead of that, then the total system power is such and such because
P-state on this core is this) the way EAS on ARM does. But maybe we can do
something simple with what is available and reap some benefits.

On ARM platforms, there is more finer grained OS control of different operating
performance points (what they call OPP).

Thanks.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Discuss more features + use cases for sched_ext
  2024-02-19  9:11   ` Joel Fernandes
@ 2024-02-19  9:14     ` Joel Fernandes
  0 siblings, 0 replies; 9+ messages in thread
From: Joel Fernandes @ 2024-02-19  9:14 UTC (permalink / raw)
  To: Muhammad Usama Anjum, David Vernet, lsf-pc, Tejun Heo
  Cc: bpf, schatzberg.dan, andrea.righi, davemarchevsky, changwoo,
	julia.lawall, himadrispandya

Fixing with Tejun's correct email address again. ;-)

On 2/19/2024 4:11 AM, Joel Fernandes wrote:
> 
> 
> On 2/19/2024 3:48 AM, Muhammad Usama Anjum wrote:
>> On Fri, 2024-01-26 at 15:59 -0600, David Vernet wrote:
>>> Hello,
>>>
>>> A few more use cases have emerged for sched_ext that are not yet
>>> supported that I wanted to discuss in the BPF track. Specifically:
>>>
>>> - EAS: Energy Aware Scheduling
>>>
>>> While firmware ultimately controls the frequency of a core, the kernel
>>> does provide frequency scaling knobs such as EPP. It could be useful for
>>> BPF schedulers to have control over these knobs to e.g. hint that
>>> certain cores should keep a lower frequency and operate as E cores.
>>> This could have applications in battery-aware devices, or in other
>>> contexts where applications have e.g. latency-sensitive
>>> compute-intensive workloads.
>> The current scheduler must already be using the frequency scaling
>> knobs. Can sched_ext use those knobs directly with hint from userspace
>> easily?
> 
> With regards to the current way of doing things, it depends. On Intel platforms,
> if HWP is enabled (Hardware-Controlled Performance States) which it is on almost
> all Intel platforms I've seen, then the selection of the individual Performance
> states (P-states) is done by the hardware, not the OS. My understanding is the
> benefit of HWP is responsiveness of the state selection. So the only thing OS
> can control then is either Turbo boost, or EPP.  Unfortunately, this hinders
> using an energy model and doing energy calculations (ex. If I place shit on this
> core instead of that, then the total system power is such and such because
> P-state on this core is this) the way EAS on ARM does. But maybe we can do
> something simple with what is available and reap some benefits.
> 
> On ARM platforms, there is more finer grained OS control of different operating
> performance points (what they call OPP).
> 
> Thanks.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-02-19  9:25 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-26 21:59 [LSF/MM/BPF TOPIC] Discuss more features + use cases for sched_ext David Vernet
2024-01-29 22:41 ` Joel Fernandes
2024-01-29 22:42   ` Joel Fernandes
2024-01-30  0:15     ` David Vernet
2024-01-30  1:50     ` Tejun Heo
2024-02-19  9:25       ` Joel Fernandes
2024-02-19  8:48 ` Muhammad Usama Anjum
2024-02-19  9:11   ` Joel Fernandes
2024-02-19  9:14     ` Joel Fernandes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox