* [PATCH 0/2] sched/fair: expose cpu.max.runtime for credit injection @ 2026-05-25 19:36 Fernand Sieber 2026-05-25 19:36 ` [PATCH 1/2] sched/fair: expose cpu.max.runtime to set bandwidth runtime directly Fernand Sieber 2026-05-25 19:36 ` [PATCH 2/2] sched/ext: add cgroup_set_runtime ops callback Fernand Sieber 0 siblings, 2 replies; 8+ messages in thread From: Fernand Sieber @ 2026-05-25 19:36 UTC (permalink / raw) To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot Cc: Tejun Heo, David Vernet, Andrea Righi, Changwoo Min, Dietmar Eggemann, Ben Segall, Mel Gorman, linux-kernel, nh-open-source, Fahad Mubeen, Hendrik Borghorst, David Woodhouse, Fernand Sieber This series adds a cpu.max.runtime cgroup v2 interface that allows userspace to set the CFS bandwidth controller's runtime directly, enabling CPU credit injection. An orchestrator writes a runtime budget which the cgroup consumes naturally through the existing bandwidth enforcement mechanism. Each period, the task consumes runtime and the refill restores only quota (capped at quota + burst), so the injected credits drain until runtime falls below the cap, after which the cgroup returns to its steady-state quota allocation. The series also relaxes the burst validation: burst is no longer required to be <= quota, only that burst + quota does not overflow. This allows configuring burst > quota so that the runtime cap can reach up to one full period, enabling 100% utilization while credits last. A selftest (test_cpucg_max_runtime) validates the credit injection mechanism by configuring a cgroup with minimal quota but large burst, injecting credits via cpu.max.runtime, and verifying that the resulting CPU usage matches the injected budget. Patch 1 adds the core interface and selftest. Patch 2 adds sched_ext integration: an ops callback for BPF scheduler notification when runtime credits are injected. Fernand Sieber (2): sched/fair: expose cpu.max.runtime to set bandwidth runtime directly sched/ext: add cgroup_set_runtime ops callback include/linux/sched/ext.h | 1 + kernel/sched/core.c | 46 ++++++++++++++++- kernel/sched/ext.c | 17 +++++++ kernel/sched/ext.h | 2 + kernel/sched/ext_internal.h | 12 +++++ tools/testing/selftests/cgroup/test_cpu.c | 62 +++++++++++++++++++++++ 6 files changed, 138 insertions(+), 2 deletions(-) -- 2.47.3 Amazon Development Centre (South Africa) (Proprietary) Limited 29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa Registration Number: 2004 / 034463 / 07 ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 1/2] sched/fair: expose cpu.max.runtime to set bandwidth runtime directly 2026-05-25 19:36 [PATCH 0/2] sched/fair: expose cpu.max.runtime for credit injection Fernand Sieber @ 2026-05-25 19:36 ` Fernand Sieber 2026-05-26 20:52 ` Benjamin Segall 2026-05-27 19:04 ` Tejun Heo 2026-05-25 19:36 ` [PATCH 2/2] sched/ext: add cgroup_set_runtime ops callback Fernand Sieber 1 sibling, 2 replies; 8+ messages in thread From: Fernand Sieber @ 2026-05-25 19:36 UTC (permalink / raw) To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot Cc: Tejun Heo, David Vernet, Andrea Righi, Changwoo Min, Dietmar Eggemann, Ben Segall, Mel Gorman, linux-kernel, nh-open-source, Fahad Mubeen, Hendrik Borghorst, David Woodhouse, Fernand Sieber Add a cpu.max.runtime cgroup v2 interface that allows userspace to set the CFS bandwidth controller's runtime directly. This enables CPU credit injection: an orchestrator writes a runtime budget which the cgroup consumes naturally through the existing bandwidth enforcement mechanism. The write sets cfs_b->runtime directly. Each period, the task consumes runtime and the refill restores only quota (capped at quota + burst), so the injected credits drain until runtime falls below the cap, after which the cgroup returns to its steady-state quota allocation. Writes are rejected if the value exceeds quota + burst (the per-period runtime cap) or exceeds the maximum bandwidth limit. Also relax the burst validation: remove the burst <= quota constraint, requiring only that burst + quota does not overflow. This allows configuring burst > quota so that the runtime cap (quota + burst) can reach up to one full period, enabling 100% utilization while credits last. The interface uses microseconds, consistent with cpu.max quota/period. Signed-off-by: Fernand Sieber <sieberf@amazon.com> --- kernel/sched/core.c | 44 +++++++++++++++- tools/testing/selftests/cgroup/test_cpu.c | 62 +++++++++++++++++++++++ 2 files changed, 104 insertions(+), 2 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index b8871449d..d92e5840b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -10085,8 +10085,7 @@ static int tg_set_bandwidth(struct task_group *tg, if (quota_us != RUNTIME_INF && quota_us > max_bw_runtime_us) return -EINVAL; - if (quota_us != RUNTIME_INF && (burst_us > quota_us || - burst_us + quota_us > max_bw_runtime_us)) + if (quota_us != RUNTIME_INF && (burst_us + quota_us > max_bw_runtime_us)) return -EINVAL; #ifdef CONFIG_CFS_BANDWIDTH @@ -10147,6 +10146,41 @@ static int cpu_burst_write_u64(struct cgroup_subsys_state *css, tg_bandwidth(tg, &period_us, "a_us, NULL); return tg_set_bandwidth(tg, period_us, quota_us, burst_us); } + +static int cpu_runtime_write_u64(struct cgroup_subsys_state *css, + struct cftype *cftype, u64 runtime_us) +{ + struct task_group *tg = css_tg(css); + struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth; + + if (runtime_us > max_bw_runtime_us) + return -EINVAL; + + raw_spin_lock_irq(&cfs_b->lock); + if (cfs_b->quota != RUNTIME_INF && + (u64)runtime_us * NSEC_PER_USEC > cfs_b->quota + cfs_b->burst) { + raw_spin_unlock_irq(&cfs_b->lock); + return -EINVAL; + } + cfs_b->runtime = (u64)runtime_us * NSEC_PER_USEC; + raw_spin_unlock_irq(&cfs_b->lock); + + return 0; +} + +static u64 cpu_runtime_read_u64(struct cgroup_subsys_state *css, + struct cftype *cftype) +{ + struct task_group *tg = css_tg(css); + struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth; + u64 runtime_ns; + + raw_spin_lock_irq(&cfs_b->lock); + runtime_ns = cfs_b->runtime; + raw_spin_unlock_irq(&cfs_b->lock); + + return runtime_ns / NSEC_PER_USEC; +} #endif /* CONFIG_GROUP_SCHED_BANDWIDTH */ #ifdef CONFIG_RT_GROUP_SCHED @@ -10498,6 +10532,12 @@ static struct cftype cpu_files[] = { .read_u64 = cpu_burst_read_u64, .write_u64 = cpu_burst_write_u64, }, + { + .name = "max.runtime", + .flags = CFTYPE_NOT_ON_ROOT, + .read_u64 = cpu_runtime_read_u64, + .write_u64 = cpu_runtime_write_u64, + }, #endif /* CONFIG_CFS_BANDWIDTH */ #ifdef CONFIG_UCLAMP_TASK_GROUP { diff --git a/tools/testing/selftests/cgroup/test_cpu.c b/tools/testing/selftests/cgroup/test_cpu.c index c83f05438..df151702b 100644 --- a/tools/testing/selftests/cgroup/test_cpu.c +++ b/tools/testing/selftests/cgroup/test_cpu.c @@ -776,6 +776,67 @@ static int test_cpucg_max_nested(const char *root) return ret; } +static int test_cpucg_max_runtime(const char *root) +{ + int ret = KSFT_FAIL; + long quota_usec = 1000; /* 1ms (minimum) */ + long period_usec = 100000; /* 100ms */ + long burst_usec = 5000000; /* 5s, so cap = 5001ms */ + long runtime_usec = 2500000; /* 2500ms = half of 5s run */ + long duration_sec = 5; + long expected_usec = duration_sec * USEC_PER_SEC / 2; /* 50% */ + long usage_usec; + char *cpucg; + char buf[64]; + int pid; + + cpucg = cg_name(root, "cpucg_runtime_test"); + if (!cpucg) + goto cleanup; + + if (cg_create(cpucg)) + goto cleanup; + + snprintf(buf, sizeof(buf), "%ld %ld", quota_usec, period_usec); + if (cg_write(cpucg, "cpu.max", buf)) + goto cleanup; + if (cg_write_numeric(cpucg, "cpu.max.burst", burst_usec)) + goto cleanup; + + /* Start burner, let it settle, then inject credits */ + struct cpu_hog_func_param param = { + .nprocs = 1, + .ts = { .tv_sec = duration_sec, .tv_nsec = 0 }, + .clock_type = CPU_HOG_CLOCK_WALL, + }; + pid = cg_run_nowait(cpucg, hog_cpus_timed, (void *)¶m); + if (pid < 0) + goto cleanup; + + usleep(100000); + if (cg_write_numeric(cpucg, "cpu.max.runtime", runtime_usec)) { + kill(pid, SIGKILL); + waitpid(pid, NULL, 0); + goto cleanup; + } + + waitpid(pid, NULL, 0); + + usage_usec = cg_read_key_long(cpucg, "cpu.stat", "usage_usec"); + if (usage_usec <= 0) + goto cleanup; + + if (!values_close_report(usage_usec, expected_usec, 10)) + goto cleanup; + + ret = KSFT_PASS; + +cleanup: + cg_destroy(cpucg); + free(cpucg); + return ret; +} + #define T(x) { x, #x } struct cpucg_test { int (*fn)(const char *root); @@ -790,6 +851,7 @@ struct cpucg_test { T(test_cpucg_nested_weight_underprovisioned), T(test_cpucg_max), T(test_cpucg_max_nested), + T(test_cpucg_max_runtime), }; #undef T -- 2.47.3 Amazon Development Centre (South Africa) (Proprietary) Limited 29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa Registration Number: 2004 / 034463 / 07 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH 1/2] sched/fair: expose cpu.max.runtime to set bandwidth runtime directly 2026-05-25 19:36 ` [PATCH 1/2] sched/fair: expose cpu.max.runtime to set bandwidth runtime directly Fernand Sieber @ 2026-05-26 20:52 ` Benjamin Segall 2026-05-28 7:25 ` Fernand Sieber 2026-05-27 19:04 ` Tejun Heo 1 sibling, 1 reply; 8+ messages in thread From: Benjamin Segall @ 2026-05-26 20:52 UTC (permalink / raw) To: Fernand Sieber Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Tejun Heo, David Vernet, Andrea Righi, Changwoo Min, Dietmar Eggemann, Mel Gorman, linux-kernel, nh-open-source, Fahad Mubeen, Hendrik Borghorst, David Woodhouse Fernand Sieber <sieberf@amazon.com> writes: > Add a cpu.max.runtime cgroup v2 interface that allows userspace to set > the CFS bandwidth controller's runtime directly. This enables CPU credit > injection: an orchestrator writes a runtime budget which the cgroup > consumes naturally through the existing bandwidth enforcement mechanism. > > The write sets cfs_b->runtime directly. Each period, the task consumes > runtime and the refill restores only quota (capped at quota + burst), so > the injected credits drain until runtime falls below the cap, after which > the cgroup returns to its steady-state quota allocation. > > Writes are rejected if the value exceeds quota + burst (the per-period > runtime cap) or exceeds the maximum bandwidth limit. > > Also relax the burst validation: remove the burst <= quota constraint, > requiring only that burst + quota does not overflow. This allows > configuring burst > quota so that the runtime cap (quota + burst) can > reach up to one full period, enabling 100% utilization while credits last. > > The interface uses microseconds, consistent with cpu.max quota/period. I don't necessarily object to supporting this design of userspace program/bpf for dynamic quota decisions that gets to make use of the inline cfs bandwidth touch points for the performance sensitive runtime consumption bits, given how minimal it is. However the existing APIs give something very close to this - any write to max/max.burst will also add the new quota to the runtime, and reading max.runtime (beyond using it to construct a += on runtime) can be done with cpuacct. Is the overhead of tg_set_cfs_bandwidth (which admittedly isn't really designed to be fast) too much, or is setting max.runtime rather than adding to it important, or something else? > > Signed-off-by: Fernand Sieber <sieberf@amazon.com> > --- > kernel/sched/core.c | 44 +++++++++++++++- > tools/testing/selftests/cgroup/test_cpu.c | 62 +++++++++++++++++++++++ > 2 files changed, 104 insertions(+), 2 deletions(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index b8871449d..d92e5840b 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -10085,8 +10085,7 @@ static int tg_set_bandwidth(struct task_group *tg, > if (quota_us != RUNTIME_INF && quota_us > max_bw_runtime_us) > return -EINVAL; > > - if (quota_us != RUNTIME_INF && (burst_us > quota_us || > - burst_us + quota_us > max_bw_runtime_us)) > + if (quota_us != RUNTIME_INF && (burst_us + quota_us > max_bw_runtime_us)) > return -EINVAL; I'm fine with this in general, but we should keep a check for burst_us > max_bw_runtime_us as well, to avoid burst_us + quota_us being able to overflow and avoid the second check. > > #ifdef CONFIG_CFS_BANDWIDTH > @@ -10147,6 +10146,41 @@ static int cpu_burst_write_u64(struct cgroup_subsys_state *css, > tg_bandwidth(tg, &period_us, "a_us, NULL); > return tg_set_bandwidth(tg, period_us, quota_us, burst_us); > } > + > +static int cpu_runtime_write_u64(struct cgroup_subsys_state *css, > + struct cftype *cftype, u64 runtime_us) > +{ > + struct task_group *tg = css_tg(css); > + struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth; > + > + if (runtime_us > max_bw_runtime_us) > + return -EINVAL; > + > + raw_spin_lock_irq(&cfs_b->lock); > + if (cfs_b->quota != RUNTIME_INF && > + (u64)runtime_us * NSEC_PER_USEC > cfs_b->quota + cfs_b->burst) { > + raw_spin_unlock_irq(&cfs_b->lock); > + return -EINVAL; > + } > + cfs_b->runtime = (u64)runtime_us * NSEC_PER_USEC; > + raw_spin_unlock_irq(&cfs_b->lock); > + > + return 0; > +} The details of this feel very odd on two fronts: First, while setting runtime rather than adding to it gives more power to the controlling userspace, it also forces it to be racy if it wants to add runtime. But the original design of cfs bandwidth didn't have burst anyways, and it's not a disaster if it does race, even if the orchestrator thread manages to get preempted or such. So I don't exactly object to this design, but I do want to check in on the idea. More importantly, I think it should definitely call distribute_cfs_runtime (or an equivalent), to immediately let throttled tasks start running again. As it is, that will be delayed until the period timer runs, which is entirely desynchronized from userspace, even if userspace uses the same period for its timers, along with inconsistencies with any newly waking cpus which will run immediately. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/2] sched/fair: expose cpu.max.runtime to set bandwidth runtime directly 2026-05-26 20:52 ` Benjamin Segall @ 2026-05-28 7:25 ` Fernand Sieber 0 siblings, 0 replies; 8+ messages in thread From: Fernand Sieber @ 2026-05-28 7:25 UTC (permalink / raw) To: bsegall Cc: arighi, changwoo, dietmar.eggemann, dwmw, fmubeen, hborghor, juri.lelli, linux-kernel, mgorman, mingo, nh-open-source, peterz, sieberf, tj, vincent.guittot, void Hi Ben, On Tue, May 26, 2026 at 01:52:56PM -0700, Benjamin Segall wrote: > I don't necessarily object to supporting this design of userspace > program/bpf for dynamic quota decisions that gets to make use of the > inline cfs bandwidth touch points for the performance sensitive > runtime consumption bits, given how minimal it is. > > However the existing APIs give something very close to this - any > write to max/max.burst will also add the new quota to the runtime, > and reading max.runtime (beyond using it to construct a += on > runtime) can be done with cpuacct. Is the overhead of > tg_set_cfs_bandwidth (which admittedly isn't really designed to be > fast) too much, or is setting max.runtime rather than adding to it > important, or something else? I've detailed our CPU credits for VM use case in Tejun's reply: https://lore.kernel.org/all/20260528065428.69225-1-sieberf@amazon.com/ We need both primitives to control credits accumulation rate (quota) and level of credits (runtime). Controlling level of credits is somewhat rare as it corresponds to specific events in the lifecycle of the VM. If I understand correctly what you are saying, we can already approximate that by temporarily setting quota to the delta runtime we need to adjust, and then setting it back to the normal accumulation rate. While possible, this seems quite awkward and blunt to me. Moreover operations that might need a negative delta (e.g credit transfer) would be even more awkward to implement (I suppose we would need to temporarily reduce the burst limit to force hit the runtime cap and then set it back). > I'm fine with this in general, but we should keep a check for > burst_us > max_bw_runtime_us as well, to avoid burst_us + quota_us > being able to overflow and avoid the second check. Noted. Will address in the next revision. > The details of this feel very odd on two fronts: > > First, while setting runtime rather than adding to it gives more > power to the controlling userspace, it also forces it to be racy > if it wants to add runtime. But the original design of cfs > bandwidth didn't have burst anyways, and it's not a disaster if it > does race, even if the orchestrator thread manages to get preempted > or such. So I don't exactly object to this design, but I do want > to check in on the idea. It was also my reasoning that races were non-critical here, so I opted for an API that was consistent with the other interfaces. However, we could also replace/complement it with a delta API if we think it's more useful. I chose to keep the API simple for now but I don't mind changing it. > More importantly, I think it should definitely call > distribute_cfs_runtime (or an equivalent), to immediately let > throttled tasks start running again. As it is, that will be delayed > until the period timer runs, which is entirely desynchronized from > userspace, even if userspace uses the same period for its timers, > along with inconsistencies with any newly waking cpus which will > run immediately. Fair point. I will update that in the next revision. Thanks. Fernand Amazon Development Centre (South Africa) (Proprietary) Limited 29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa Registration Number: 2004 / 034463 / 07 ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/2] sched/fair: expose cpu.max.runtime to set bandwidth runtime directly 2026-05-25 19:36 ` [PATCH 1/2] sched/fair: expose cpu.max.runtime to set bandwidth runtime directly Fernand Sieber 2026-05-26 20:52 ` Benjamin Segall @ 2026-05-27 19:04 ` Tejun Heo 2026-05-28 6:54 ` Fernand Sieber 1 sibling, 1 reply; 8+ messages in thread From: Tejun Heo @ 2026-05-27 19:04 UTC (permalink / raw) To: Fernand Sieber Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, David Vernet, Andrea Righi, Changwoo Min, Dietmar Eggemann, Ben Segall, Mel Gorman, linux-kernel, nh-open-source, Fahad Mubeen, Hendrik Borghorst, David Woodhouse On Mon, May 25, 2026 at 09:36:21PM +0200, Fernand Sieber wrote: > Add a cpu.max.runtime cgroup v2 interface that allows userspace to set > the CFS bandwidth controller's runtime directly. This enables CPU credit > injection: an orchestrator writes a runtime budget which the cgroup > consumes naturally through the existing bandwidth enforcement mechanism. Can you detail the use case? What problem is it solving how? Thanks. -- tejun ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/2] sched/fair: expose cpu.max.runtime to set bandwidth runtime directly 2026-05-27 19:04 ` Tejun Heo @ 2026-05-28 6:54 ` Fernand Sieber 2026-05-28 14:37 ` Tejun Heo 0 siblings, 1 reply; 8+ messages in thread From: Fernand Sieber @ 2026-05-28 6:54 UTC (permalink / raw) To: tj Cc: arighi, bsegall, changwoo, dietmar.eggemann, dwmw, fmubeen, hborghor, juri.lelli, linux-kernel, mgorman, mingo, nh-open-source, peterz, sieberf, vincent.guittot, void Hi Tejun, On Wed, May 27, 2026 at 09:04:37AM -1000, Tejun Heo wrote: > On Mon, May 25, 2026 at 09:36:21PM +0200, Fernand Sieber wrote: > > Add a cpu.max.runtime cgroup v2 interface that allows userspace to > > set the CFS bandwidth controller's runtime directly. This enables > > CPU credit injection: an orchestrator writes a runtime budget which > > the cgroup consumes naturally through the existing bandwidth > > enforcement mechanism. > > Can you detail the use case? What problem is it solving how? Our use case is managing CPU credits for VMs. Product spec defines credits accumulation rate (quota), credits limit (burst), and initial level of credits at launch (runtime). Controlling runtime is also necessary for preserving credits across live update (kexec) and live migration. It is possible to approximate this behavior with existing kernel primitives. However this requires setting up awkward parallel accounting/control logic from userspace which must be periodically synced up with the kernel. Instead, we propose minimal changes to the cpu bw primitives to facilitate this use case. Thanks. Fernand Amazon Development Centre (South Africa) (Proprietary) Limited 29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa Registration Number: 2004 / 034463 / 07 ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/2] sched/fair: expose cpu.max.runtime to set bandwidth runtime directly 2026-05-28 6:54 ` Fernand Sieber @ 2026-05-28 14:37 ` Tejun Heo 0 siblings, 0 replies; 8+ messages in thread From: Tejun Heo @ 2026-05-28 14:37 UTC (permalink / raw) To: Fernand Sieber Cc: arighi, bsegall, changwoo, dietmar.eggemann, dwmw, fmubeen, hborghor, juri.lelli, linux-kernel, mgorman, mingo, nh-open-source, peterz, vincent.guittot, void Hello, On Thu, May 28, 2026 at 08:54:28AM +0200, Fernand Sieber wrote: > Hi Tejun, > > On Wed, May 27, 2026 at 09:04:37AM -1000, Tejun Heo wrote: > > On Mon, May 25, 2026 at 09:36:21PM +0200, Fernand Sieber wrote: > > > Add a cpu.max.runtime cgroup v2 interface that allows userspace to > > > set the CFS bandwidth controller's runtime directly. This enables > > > CPU credit injection: an orchestrator writes a runtime budget which > > > the cgroup consumes naturally through the existing bandwidth > > > enforcement mechanism. > > > > Can you detail the use case? What problem is it solving how? > > Our use case is managing CPU credits for VMs. > > Product spec defines credits accumulation rate (quota), credits > limit (burst), and initial level of credits at launch (runtime). > > Controlling runtime is also necessary for preserving credits across > live update (kexec) and live migration. > > It is possible to approximate this behavior with existing kernel > primitives. However this requires setting up awkward parallel > accounting/control logic from userspace which must be periodically > synced up with the kernel. Instead, we propose minimal changes to > the cpu bw primitives to facilitate this use case. Can you please go into more details, preferably a lot more? From cgroup POV, there is a precedent for this sort of direct-ish low level control in memory.reclaim; however, as things like this can create a lot of implementation detail exposure, and I think the bar for use case justification should be pretty high - the use cases should make general sense and there are no other reasonable ways to achieve the same without adding the proposed interface. I don't think the above description achieves that. Thanks. -- tejun ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 2/2] sched/ext: add cgroup_set_runtime ops callback 2026-05-25 19:36 [PATCH 0/2] sched/fair: expose cpu.max.runtime for credit injection Fernand Sieber 2026-05-25 19:36 ` [PATCH 1/2] sched/fair: expose cpu.max.runtime to set bandwidth runtime directly Fernand Sieber @ 2026-05-25 19:36 ` Fernand Sieber 1 sibling, 0 replies; 8+ messages in thread From: Fernand Sieber @ 2026-05-25 19:36 UTC (permalink / raw) To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot Cc: Tejun Heo, David Vernet, Andrea Righi, Changwoo Min, Dietmar Eggemann, Ben Segall, Mel Gorman, linux-kernel, nh-open-source, Fahad Mubeen, Hendrik Borghorst, David Woodhouse, Fernand Sieber Add a sched_ext_ops callback that is invoked when userspace writes to cpu.max.runtime. This allows BPF schedulers to be notified when runtime credits are injected into a cgroup, enabling SCX-side credit tracking. The callback includes change detection (only fires when the value changes) and caches the value in tg->scx.bw_runtime_us. Signed-off-by: Fernand Sieber <sieberf@amazon.com> --- include/linux/sched/ext.h | 1 + kernel/sched/core.c | 2 ++ kernel/sched/ext.c | 17 +++++++++++++++++ kernel/sched/ext.h | 2 ++ kernel/sched/ext_internal.h | 12 ++++++++++++ 5 files changed, 34 insertions(+) diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h index 2129e18ad..591801a50 100644 --- a/include/linux/sched/ext.h +++ b/include/linux/sched/ext.h @@ -273,6 +273,7 @@ struct scx_task_group { u64 bw_period_us; u64 bw_quota_us; u64 bw_burst_us; + u64 bw_runtime_us; bool idle; #endif }; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d92e5840b..369dd03d3 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -10165,6 +10165,8 @@ static int cpu_runtime_write_u64(struct cgroup_subsys_state *css, cfs_b->runtime = (u64)runtime_us * NSEC_PER_USEC; raw_spin_unlock_irq(&cfs_b->lock); + + scx_group_set_runtime(tg, runtime_us); return 0; } diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 827a96e39..2ce505ad8 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -4488,6 +4488,23 @@ void scx_group_set_bandwidth(struct task_group *tg, percpu_up_read(&scx_cgroup_ops_rwsem); } + +void scx_group_set_runtime(struct task_group *tg, u64 runtime_us) +{ + struct scx_sched *sch; + + percpu_down_read(&scx_cgroup_ops_rwsem); + sch = scx_root; + + if (scx_cgroup_enabled && SCX_HAS_OP(sch, cgroup_set_runtime) && + tg->scx.bw_runtime_us != runtime_us) + SCX_CALL_OP(sch, cgroup_set_runtime, NULL, + tg_cgrp(tg), runtime_us); + + tg->scx.bw_runtime_us = runtime_us; + + percpu_up_read(&scx_cgroup_ops_rwsem); +} #endif /* CONFIG_EXT_GROUP_SCHED */ #if defined(CONFIG_EXT_GROUP_SCHED) || defined(CONFIG_EXT_SUB_SCHED) diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h index 0b7fc46ae..00103ec3d 100644 --- a/kernel/sched/ext.h +++ b/kernel/sched/ext.h @@ -81,6 +81,7 @@ void scx_cgroup_cancel_attach(struct cgroup_taskset *tset); void scx_group_set_weight(struct task_group *tg, unsigned long cgrp_weight); void scx_group_set_idle(struct task_group *tg, bool idle); void scx_group_set_bandwidth(struct task_group *tg, u64 period_us, u64 quota_us, u64 burst_us); +void scx_group_set_runtime(struct task_group *tg, u64 runtime_us); #else /* CONFIG_EXT_GROUP_SCHED */ static inline void scx_tg_init(struct task_group *tg) {} static inline int scx_tg_online(struct task_group *tg) { return 0; } @@ -91,5 +92,6 @@ static inline void scx_cgroup_cancel_attach(struct cgroup_taskset *tset) {} static inline void scx_group_set_weight(struct task_group *tg, unsigned long cgrp_weight) {} static inline void scx_group_set_idle(struct task_group *tg, bool idle) {} static inline void scx_group_set_bandwidth(struct task_group *tg, u64 period_us, u64 quota_us, u64 burst_us) {} +static inline void scx_group_set_runtime(struct task_group *tg, u64 runtime_us) {} #endif /* CONFIG_EXT_GROUP_SCHED */ #endif /* CONFIG_CGROUP_SCHED */ diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h index a075732d4..21e6ab7af 100644 --- a/kernel/sched/ext_internal.h +++ b/kernel/sched/ext_internal.h @@ -739,6 +739,18 @@ struct sched_ext_ops { */ void (*cgroup_set_idle)(struct cgroup *cgrp, bool idle); + /** + * @cgroup_set_runtime: A cgroup's runtime is being set directly + * @cgrp: cgroup whose runtime is being set + * @runtime_us: runtime in microseconds + * + * Update @cgrp's available runtime. This is from the cpu.max.runtime + * cgroup interface. @runtime_us is the total runtime budget that the + * cgroup may consume. The BPF scheduler should track this value and + * throttle tasks in @cgrp once the budget is exhausted. + */ + void (*cgroup_set_runtime)(struct cgroup *cgrp, u64 runtime_us); + #endif /* CONFIG_EXT_GROUP_SCHED */ /** -- 2.47.3 Amazon Development Centre (South Africa) (Proprietary) Limited 29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa Registration Number: 2004 / 034463 / 07 ^ permalink raw reply related [flat|nested] 8+ messages in thread
end of thread, other threads:[~2026-05-28 14:37 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-05-25 19:36 [PATCH 0/2] sched/fair: expose cpu.max.runtime for credit injection Fernand Sieber 2026-05-25 19:36 ` [PATCH 1/2] sched/fair: expose cpu.max.runtime to set bandwidth runtime directly Fernand Sieber 2026-05-26 20:52 ` Benjamin Segall 2026-05-28 7:25 ` Fernand Sieber 2026-05-27 19:04 ` Tejun Heo 2026-05-28 6:54 ` Fernand Sieber 2026-05-28 14:37 ` Tejun Heo 2026-05-25 19:36 ` [PATCH 2/2] sched/ext: add cgroup_set_runtime ops callback Fernand Sieber
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.