* [PATCH 0/2] sched/fair: expose cpu.max.runtime for credit injection
@ 2026-05-25 19:36 Fernand Sieber
2026-05-25 19:36 ` [PATCH 1/2] sched/fair: expose cpu.max.runtime to set bandwidth runtime directly Fernand Sieber
2026-05-25 19:36 ` [PATCH 2/2] sched/ext: add cgroup_set_runtime ops callback Fernand Sieber
0 siblings, 2 replies; 8+ messages in thread
From: Fernand Sieber @ 2026-05-25 19:36 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
Cc: Tejun Heo, David Vernet, Andrea Righi, Changwoo Min,
Dietmar Eggemann, Ben Segall, Mel Gorman, linux-kernel,
nh-open-source, Fahad Mubeen, Hendrik Borghorst, David Woodhouse,
Fernand Sieber
This series adds a cpu.max.runtime cgroup v2 interface that allows
userspace to set the CFS bandwidth controller's runtime directly,
enabling CPU credit injection.
An orchestrator writes a runtime budget which the cgroup consumes
naturally through the existing bandwidth enforcement mechanism. Each
period, the task consumes runtime and the refill restores only quota
(capped at quota + burst), so the injected credits drain until runtime
falls below the cap, after which the cgroup returns to its steady-state
quota allocation.
The series also relaxes the burst validation: burst is no longer
required to be <= quota, only that burst + quota does not overflow.
This allows configuring burst > quota so that the runtime cap can reach
up to one full period, enabling 100% utilization while credits last.
A selftest (test_cpucg_max_runtime) validates the credit injection
mechanism by configuring a cgroup with minimal quota but large burst,
injecting credits via cpu.max.runtime, and verifying that the resulting
CPU usage matches the injected budget.
Patch 1 adds the core interface and selftest.
Patch 2 adds sched_ext integration: an ops callback for BPF scheduler
notification when runtime credits are injected.
Fernand Sieber (2):
sched/fair: expose cpu.max.runtime to set bandwidth runtime directly
sched/ext: add cgroup_set_runtime ops callback
include/linux/sched/ext.h | 1 +
kernel/sched/core.c | 46 ++++++++++++++++-
kernel/sched/ext.c | 17 +++++++
kernel/sched/ext.h | 2 +
kernel/sched/ext_internal.h | 12 +++++
tools/testing/selftests/cgroup/test_cpu.c | 62 +++++++++++++++++++++++
6 files changed, 138 insertions(+), 2 deletions(-)
--
2.47.3
Amazon Development Centre (South Africa) (Proprietary) Limited
29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa
Registration Number: 2004 / 034463 / 07
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 1/2] sched/fair: expose cpu.max.runtime to set bandwidth runtime directly
2026-05-25 19:36 [PATCH 0/2] sched/fair: expose cpu.max.runtime for credit injection Fernand Sieber
@ 2026-05-25 19:36 ` Fernand Sieber
2026-05-26 20:52 ` Benjamin Segall
2026-05-27 19:04 ` Tejun Heo
2026-05-25 19:36 ` [PATCH 2/2] sched/ext: add cgroup_set_runtime ops callback Fernand Sieber
1 sibling, 2 replies; 8+ messages in thread
From: Fernand Sieber @ 2026-05-25 19:36 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
Cc: Tejun Heo, David Vernet, Andrea Righi, Changwoo Min,
Dietmar Eggemann, Ben Segall, Mel Gorman, linux-kernel,
nh-open-source, Fahad Mubeen, Hendrik Borghorst, David Woodhouse,
Fernand Sieber
Add a cpu.max.runtime cgroup v2 interface that allows userspace to set
the CFS bandwidth controller's runtime directly. This enables CPU credit
injection: an orchestrator writes a runtime budget which the cgroup
consumes naturally through the existing bandwidth enforcement mechanism.
The write sets cfs_b->runtime directly. Each period, the task consumes
runtime and the refill restores only quota (capped at quota + burst), so
the injected credits drain until runtime falls below the cap, after which
the cgroup returns to its steady-state quota allocation.
Writes are rejected if the value exceeds quota + burst (the per-period
runtime cap) or exceeds the maximum bandwidth limit.
Also relax the burst validation: remove the burst <= quota constraint,
requiring only that burst + quota does not overflow. This allows
configuring burst > quota so that the runtime cap (quota + burst) can
reach up to one full period, enabling 100% utilization while credits last.
The interface uses microseconds, consistent with cpu.max quota/period.
Signed-off-by: Fernand Sieber <sieberf@amazon.com>
---
kernel/sched/core.c | 44 +++++++++++++++-
tools/testing/selftests/cgroup/test_cpu.c | 62 +++++++++++++++++++++++
2 files changed, 104 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b8871449d..d92e5840b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10085,8 +10085,7 @@ static int tg_set_bandwidth(struct task_group *tg,
if (quota_us != RUNTIME_INF && quota_us > max_bw_runtime_us)
return -EINVAL;
- if (quota_us != RUNTIME_INF && (burst_us > quota_us ||
- burst_us + quota_us > max_bw_runtime_us))
+ if (quota_us != RUNTIME_INF && (burst_us + quota_us > max_bw_runtime_us))
return -EINVAL;
#ifdef CONFIG_CFS_BANDWIDTH
@@ -10147,6 +10146,41 @@ static int cpu_burst_write_u64(struct cgroup_subsys_state *css,
tg_bandwidth(tg, &period_us, "a_us, NULL);
return tg_set_bandwidth(tg, period_us, quota_us, burst_us);
}
+
+static int cpu_runtime_write_u64(struct cgroup_subsys_state *css,
+ struct cftype *cftype, u64 runtime_us)
+{
+ struct task_group *tg = css_tg(css);
+ struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
+
+ if (runtime_us > max_bw_runtime_us)
+ return -EINVAL;
+
+ raw_spin_lock_irq(&cfs_b->lock);
+ if (cfs_b->quota != RUNTIME_INF &&
+ (u64)runtime_us * NSEC_PER_USEC > cfs_b->quota + cfs_b->burst) {
+ raw_spin_unlock_irq(&cfs_b->lock);
+ return -EINVAL;
+ }
+ cfs_b->runtime = (u64)runtime_us * NSEC_PER_USEC;
+ raw_spin_unlock_irq(&cfs_b->lock);
+
+ return 0;
+}
+
+static u64 cpu_runtime_read_u64(struct cgroup_subsys_state *css,
+ struct cftype *cftype)
+{
+ struct task_group *tg = css_tg(css);
+ struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
+ u64 runtime_ns;
+
+ raw_spin_lock_irq(&cfs_b->lock);
+ runtime_ns = cfs_b->runtime;
+ raw_spin_unlock_irq(&cfs_b->lock);
+
+ return runtime_ns / NSEC_PER_USEC;
+}
#endif /* CONFIG_GROUP_SCHED_BANDWIDTH */
#ifdef CONFIG_RT_GROUP_SCHED
@@ -10498,6 +10532,12 @@ static struct cftype cpu_files[] = {
.read_u64 = cpu_burst_read_u64,
.write_u64 = cpu_burst_write_u64,
},
+ {
+ .name = "max.runtime",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = cpu_runtime_read_u64,
+ .write_u64 = cpu_runtime_write_u64,
+ },
#endif /* CONFIG_CFS_BANDWIDTH */
#ifdef CONFIG_UCLAMP_TASK_GROUP
{
diff --git a/tools/testing/selftests/cgroup/test_cpu.c b/tools/testing/selftests/cgroup/test_cpu.c
index c83f05438..df151702b 100644
--- a/tools/testing/selftests/cgroup/test_cpu.c
+++ b/tools/testing/selftests/cgroup/test_cpu.c
@@ -776,6 +776,67 @@ static int test_cpucg_max_nested(const char *root)
return ret;
}
+static int test_cpucg_max_runtime(const char *root)
+{
+ int ret = KSFT_FAIL;
+ long quota_usec = 1000; /* 1ms (minimum) */
+ long period_usec = 100000; /* 100ms */
+ long burst_usec = 5000000; /* 5s, so cap = 5001ms */
+ long runtime_usec = 2500000; /* 2500ms = half of 5s run */
+ long duration_sec = 5;
+ long expected_usec = duration_sec * USEC_PER_SEC / 2; /* 50% */
+ long usage_usec;
+ char *cpucg;
+ char buf[64];
+ int pid;
+
+ cpucg = cg_name(root, "cpucg_runtime_test");
+ if (!cpucg)
+ goto cleanup;
+
+ if (cg_create(cpucg))
+ goto cleanup;
+
+ snprintf(buf, sizeof(buf), "%ld %ld", quota_usec, period_usec);
+ if (cg_write(cpucg, "cpu.max", buf))
+ goto cleanup;
+ if (cg_write_numeric(cpucg, "cpu.max.burst", burst_usec))
+ goto cleanup;
+
+ /* Start burner, let it settle, then inject credits */
+ struct cpu_hog_func_param param = {
+ .nprocs = 1,
+ .ts = { .tv_sec = duration_sec, .tv_nsec = 0 },
+ .clock_type = CPU_HOG_CLOCK_WALL,
+ };
+ pid = cg_run_nowait(cpucg, hog_cpus_timed, (void *)¶m);
+ if (pid < 0)
+ goto cleanup;
+
+ usleep(100000);
+ if (cg_write_numeric(cpucg, "cpu.max.runtime", runtime_usec)) {
+ kill(pid, SIGKILL);
+ waitpid(pid, NULL, 0);
+ goto cleanup;
+ }
+
+ waitpid(pid, NULL, 0);
+
+ usage_usec = cg_read_key_long(cpucg, "cpu.stat", "usage_usec");
+ if (usage_usec <= 0)
+ goto cleanup;
+
+ if (!values_close_report(usage_usec, expected_usec, 10))
+ goto cleanup;
+
+ ret = KSFT_PASS;
+
+cleanup:
+ cg_destroy(cpucg);
+ free(cpucg);
+ return ret;
+}
+
#define T(x) { x, #x }
struct cpucg_test {
int (*fn)(const char *root);
@@ -790,6 +851,7 @@ struct cpucg_test {
T(test_cpucg_nested_weight_underprovisioned),
T(test_cpucg_max),
T(test_cpucg_max_nested),
+ T(test_cpucg_max_runtime),
};
#undef T
--
2.47.3
Amazon Development Centre (South Africa) (Proprietary) Limited
29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa
Registration Number: 2004 / 034463 / 07
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH 2/2] sched/ext: add cgroup_set_runtime ops callback
2026-05-25 19:36 [PATCH 0/2] sched/fair: expose cpu.max.runtime for credit injection Fernand Sieber
2026-05-25 19:36 ` [PATCH 1/2] sched/fair: expose cpu.max.runtime to set bandwidth runtime directly Fernand Sieber
@ 2026-05-25 19:36 ` Fernand Sieber
1 sibling, 0 replies; 8+ messages in thread
From: Fernand Sieber @ 2026-05-25 19:36 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
Cc: Tejun Heo, David Vernet, Andrea Righi, Changwoo Min,
Dietmar Eggemann, Ben Segall, Mel Gorman, linux-kernel,
nh-open-source, Fahad Mubeen, Hendrik Borghorst, David Woodhouse,
Fernand Sieber
Add a sched_ext_ops callback that is invoked when userspace writes to
cpu.max.runtime. This allows BPF schedulers to be notified when runtime
credits are injected into a cgroup, enabling SCX-side credit tracking.
The callback includes change detection (only fires when the value
changes) and caches the value in tg->scx.bw_runtime_us.
Signed-off-by: Fernand Sieber <sieberf@amazon.com>
---
include/linux/sched/ext.h | 1 +
kernel/sched/core.c | 2 ++
kernel/sched/ext.c | 17 +++++++++++++++++
kernel/sched/ext.h | 2 ++
kernel/sched/ext_internal.h | 12 ++++++++++++
5 files changed, 34 insertions(+)
diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index 2129e18ad..591801a50 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -273,6 +273,7 @@ struct scx_task_group {
u64 bw_period_us;
u64 bw_quota_us;
u64 bw_burst_us;
+ u64 bw_runtime_us;
bool idle;
#endif
};
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d92e5840b..369dd03d3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10165,6 +10165,8 @@ static int cpu_runtime_write_u64(struct cgroup_subsys_state *css,
cfs_b->runtime = (u64)runtime_us * NSEC_PER_USEC;
raw_spin_unlock_irq(&cfs_b->lock);
+
+ scx_group_set_runtime(tg, runtime_us);
return 0;
}
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 827a96e39..2ce505ad8 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -4488,6 +4488,23 @@ void scx_group_set_bandwidth(struct task_group *tg,
percpu_up_read(&scx_cgroup_ops_rwsem);
}
+
+void scx_group_set_runtime(struct task_group *tg, u64 runtime_us)
+{
+ struct scx_sched *sch;
+
+ percpu_down_read(&scx_cgroup_ops_rwsem);
+ sch = scx_root;
+
+ if (scx_cgroup_enabled && SCX_HAS_OP(sch, cgroup_set_runtime) &&
+ tg->scx.bw_runtime_us != runtime_us)
+ SCX_CALL_OP(sch, cgroup_set_runtime, NULL,
+ tg_cgrp(tg), runtime_us);
+
+ tg->scx.bw_runtime_us = runtime_us;
+
+ percpu_up_read(&scx_cgroup_ops_rwsem);
+}
#endif /* CONFIG_EXT_GROUP_SCHED */
#if defined(CONFIG_EXT_GROUP_SCHED) || defined(CONFIG_EXT_SUB_SCHED)
diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h
index 0b7fc46ae..00103ec3d 100644
--- a/kernel/sched/ext.h
+++ b/kernel/sched/ext.h
@@ -81,6 +81,7 @@ void scx_cgroup_cancel_attach(struct cgroup_taskset *tset);
void scx_group_set_weight(struct task_group *tg, unsigned long cgrp_weight);
void scx_group_set_idle(struct task_group *tg, bool idle);
void scx_group_set_bandwidth(struct task_group *tg, u64 period_us, u64 quota_us, u64 burst_us);
+void scx_group_set_runtime(struct task_group *tg, u64 runtime_us);
#else /* CONFIG_EXT_GROUP_SCHED */
static inline void scx_tg_init(struct task_group *tg) {}
static inline int scx_tg_online(struct task_group *tg) { return 0; }
@@ -91,5 +92,6 @@ static inline void scx_cgroup_cancel_attach(struct cgroup_taskset *tset) {}
static inline void scx_group_set_weight(struct task_group *tg, unsigned long cgrp_weight) {}
static inline void scx_group_set_idle(struct task_group *tg, bool idle) {}
static inline void scx_group_set_bandwidth(struct task_group *tg, u64 period_us, u64 quota_us, u64 burst_us) {}
+static inline void scx_group_set_runtime(struct task_group *tg, u64 runtime_us) {}
#endif /* CONFIG_EXT_GROUP_SCHED */
#endif /* CONFIG_CGROUP_SCHED */
diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
index a075732d4..21e6ab7af 100644
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -739,6 +739,18 @@ struct sched_ext_ops {
*/
void (*cgroup_set_idle)(struct cgroup *cgrp, bool idle);
+ /**
+ * @cgroup_set_runtime: A cgroup's runtime is being set directly
+ * @cgrp: cgroup whose runtime is being set
+ * @runtime_us: runtime in microseconds
+ *
+ * Update @cgrp's available runtime. This is from the cpu.max.runtime
+ * cgroup interface. @runtime_us is the total runtime budget that the
+ * cgroup may consume. The BPF scheduler should track this value and
+ * throttle tasks in @cgrp once the budget is exhausted.
+ */
+ void (*cgroup_set_runtime)(struct cgroup *cgrp, u64 runtime_us);
+
#endif /* CONFIG_EXT_GROUP_SCHED */
/**
--
2.47.3
Amazon Development Centre (South Africa) (Proprietary) Limited
29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa
Registration Number: 2004 / 034463 / 07
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH 1/2] sched/fair: expose cpu.max.runtime to set bandwidth runtime directly
2026-05-25 19:36 ` [PATCH 1/2] sched/fair: expose cpu.max.runtime to set bandwidth runtime directly Fernand Sieber
@ 2026-05-26 20:52 ` Benjamin Segall
2026-05-28 7:25 ` Fernand Sieber
2026-05-27 19:04 ` Tejun Heo
1 sibling, 1 reply; 8+ messages in thread
From: Benjamin Segall @ 2026-05-26 20:52 UTC (permalink / raw)
To: Fernand Sieber
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Tejun Heo, David Vernet, Andrea Righi, Changwoo Min,
Dietmar Eggemann, Mel Gorman, linux-kernel, nh-open-source,
Fahad Mubeen, Hendrik Borghorst, David Woodhouse
Fernand Sieber <sieberf@amazon.com> writes:
> Add a cpu.max.runtime cgroup v2 interface that allows userspace to set
> the CFS bandwidth controller's runtime directly. This enables CPU credit
> injection: an orchestrator writes a runtime budget which the cgroup
> consumes naturally through the existing bandwidth enforcement mechanism.
>
> The write sets cfs_b->runtime directly. Each period, the task consumes
> runtime and the refill restores only quota (capped at quota + burst), so
> the injected credits drain until runtime falls below the cap, after which
> the cgroup returns to its steady-state quota allocation.
>
> Writes are rejected if the value exceeds quota + burst (the per-period
> runtime cap) or exceeds the maximum bandwidth limit.
>
> Also relax the burst validation: remove the burst <= quota constraint,
> requiring only that burst + quota does not overflow. This allows
> configuring burst > quota so that the runtime cap (quota + burst) can
> reach up to one full period, enabling 100% utilization while credits last.
>
> The interface uses microseconds, consistent with cpu.max quota/period.
I don't necessarily object to supporting this design of userspace
program/bpf for dynamic quota decisions that gets to make use of the
inline cfs bandwidth touch points for the performance sensitive runtime
consumption bits, given how minimal it is.
However the existing APIs give something very close to this - any write
to max/max.burst will also add the new quota to the runtime, and reading
max.runtime (beyond using it to construct a += on runtime) can be done
with cpuacct. Is the overhead of tg_set_cfs_bandwidth (which admittedly isn't
really designed to be fast) too much, or is setting max.runtime rather
than adding to it important, or something else?
>
> Signed-off-by: Fernand Sieber <sieberf@amazon.com>
> ---
> kernel/sched/core.c | 44 +++++++++++++++-
> tools/testing/selftests/cgroup/test_cpu.c | 62 +++++++++++++++++++++++
> 2 files changed, 104 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b8871449d..d92e5840b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -10085,8 +10085,7 @@ static int tg_set_bandwidth(struct task_group *tg,
> if (quota_us != RUNTIME_INF && quota_us > max_bw_runtime_us)
> return -EINVAL;
>
> - if (quota_us != RUNTIME_INF && (burst_us > quota_us ||
> - burst_us + quota_us > max_bw_runtime_us))
> + if (quota_us != RUNTIME_INF && (burst_us + quota_us > max_bw_runtime_us))
> return -EINVAL;
I'm fine with this in general, but we should keep a check for burst_us >
max_bw_runtime_us as well, to avoid burst_us + quota_us being able to
overflow and avoid the second check.
>
> #ifdef CONFIG_CFS_BANDWIDTH
> @@ -10147,6 +10146,41 @@ static int cpu_burst_write_u64(struct cgroup_subsys_state *css,
> tg_bandwidth(tg, &period_us, "a_us, NULL);
> return tg_set_bandwidth(tg, period_us, quota_us, burst_us);
> }
> +
> +static int cpu_runtime_write_u64(struct cgroup_subsys_state *css,
> + struct cftype *cftype, u64 runtime_us)
> +{
> + struct task_group *tg = css_tg(css);
> + struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
> +
> + if (runtime_us > max_bw_runtime_us)
> + return -EINVAL;
> +
> + raw_spin_lock_irq(&cfs_b->lock);
> + if (cfs_b->quota != RUNTIME_INF &&
> + (u64)runtime_us * NSEC_PER_USEC > cfs_b->quota + cfs_b->burst) {
> + raw_spin_unlock_irq(&cfs_b->lock);
> + return -EINVAL;
> + }
> + cfs_b->runtime = (u64)runtime_us * NSEC_PER_USEC;
> + raw_spin_unlock_irq(&cfs_b->lock);
> +
> + return 0;
> +}
The details of this feel very odd on two fronts:
First, while setting runtime rather than adding to it gives more power
to the controlling userspace, it also forces it to be racy if it wants
to add runtime. But the original design of cfs bandwidth didn't have
burst anyways, and it's not a disaster if it does race, even if the
orchestrator thread manages to get preempted or such. So I don't exactly
object to this design, but I do want to check in on the idea.
More importantly, I think it should definitely call
distribute_cfs_runtime (or an equivalent), to immediately let throttled
tasks start running again. As it is, that will be delayed until the
period timer runs, which is entirely desynchronized from userspace, even
if userspace uses the same period for its timers, along with
inconsistencies with any newly waking cpus which will run immediately.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/2] sched/fair: expose cpu.max.runtime to set bandwidth runtime directly
2026-05-25 19:36 ` [PATCH 1/2] sched/fair: expose cpu.max.runtime to set bandwidth runtime directly Fernand Sieber
2026-05-26 20:52 ` Benjamin Segall
@ 2026-05-27 19:04 ` Tejun Heo
2026-05-28 6:54 ` Fernand Sieber
1 sibling, 1 reply; 8+ messages in thread
From: Tejun Heo @ 2026-05-27 19:04 UTC (permalink / raw)
To: Fernand Sieber
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
David Vernet, Andrea Righi, Changwoo Min, Dietmar Eggemann,
Ben Segall, Mel Gorman, linux-kernel, nh-open-source,
Fahad Mubeen, Hendrik Borghorst, David Woodhouse
On Mon, May 25, 2026 at 09:36:21PM +0200, Fernand Sieber wrote:
> Add a cpu.max.runtime cgroup v2 interface that allows userspace to set
> the CFS bandwidth controller's runtime directly. This enables CPU credit
> injection: an orchestrator writes a runtime budget which the cgroup
> consumes naturally through the existing bandwidth enforcement mechanism.
Can you detail the use case? What problem is it solving how?
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/2] sched/fair: expose cpu.max.runtime to set bandwidth runtime directly
2026-05-27 19:04 ` Tejun Heo
@ 2026-05-28 6:54 ` Fernand Sieber
2026-05-28 14:37 ` Tejun Heo
0 siblings, 1 reply; 8+ messages in thread
From: Fernand Sieber @ 2026-05-28 6:54 UTC (permalink / raw)
To: tj
Cc: arighi, bsegall, changwoo, dietmar.eggemann, dwmw, fmubeen,
hborghor, juri.lelli, linux-kernel, mgorman, mingo,
nh-open-source, peterz, sieberf, vincent.guittot, void
Hi Tejun,
On Wed, May 27, 2026 at 09:04:37AM -1000, Tejun Heo wrote:
> On Mon, May 25, 2026 at 09:36:21PM +0200, Fernand Sieber wrote:
> > Add a cpu.max.runtime cgroup v2 interface that allows userspace to
> > set the CFS bandwidth controller's runtime directly. This enables
> > CPU credit injection: an orchestrator writes a runtime budget which
> > the cgroup consumes naturally through the existing bandwidth
> > enforcement mechanism.
>
> Can you detail the use case? What problem is it solving how?
Our use case is managing CPU credits for VMs.
Product spec defines credits accumulation rate (quota), credits
limit (burst), and initial level of credits at launch (runtime).
Controlling runtime is also necessary for preserving credits across
live update (kexec) and live migration.
It is possible to approximate this behavior with existing kernel
primitives. However this requires setting up awkward parallel
accounting/control logic from userspace which must be periodically
synced up with the kernel. Instead, we propose minimal changes to
the cpu bw primitives to facilitate this use case.
Thanks.
Fernand
Amazon Development Centre (South Africa) (Proprietary) Limited
29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa
Registration Number: 2004 / 034463 / 07
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/2] sched/fair: expose cpu.max.runtime to set bandwidth runtime directly
2026-05-26 20:52 ` Benjamin Segall
@ 2026-05-28 7:25 ` Fernand Sieber
0 siblings, 0 replies; 8+ messages in thread
From: Fernand Sieber @ 2026-05-28 7:25 UTC (permalink / raw)
To: bsegall
Cc: arighi, changwoo, dietmar.eggemann, dwmw, fmubeen, hborghor,
juri.lelli, linux-kernel, mgorman, mingo, nh-open-source, peterz,
sieberf, tj, vincent.guittot, void
Hi Ben,
On Tue, May 26, 2026 at 01:52:56PM -0700, Benjamin Segall wrote:
> I don't necessarily object to supporting this design of userspace
> program/bpf for dynamic quota decisions that gets to make use of the
> inline cfs bandwidth touch points for the performance sensitive
> runtime consumption bits, given how minimal it is.
>
> However the existing APIs give something very close to this - any
> write to max/max.burst will also add the new quota to the runtime,
> and reading max.runtime (beyond using it to construct a += on
> runtime) can be done with cpuacct. Is the overhead of
> tg_set_cfs_bandwidth (which admittedly isn't really designed to be
> fast) too much, or is setting max.runtime rather than adding to it
> important, or something else?
I've detailed our CPU credits for VM use case in Tejun's reply:
https://lore.kernel.org/all/20260528065428.69225-1-sieberf@amazon.com/
We need both primitives to control credits accumulation rate (quota)
and level of credits (runtime). Controlling level of credits is
somewhat rare as it corresponds to specific events in the lifecycle
of the VM.
If I understand correctly what you are saying, we can already
approximate that by temporarily setting quota to the delta runtime
we need to adjust, and then setting it back to the normal
accumulation rate.
While possible, this seems quite awkward and blunt to me. Moreover
operations that might need a negative delta (e.g credit transfer)
would be even more awkward to implement (I suppose we would need to
temporarily reduce the burst limit to force hit the runtime cap and
then set it back).
> I'm fine with this in general, but we should keep a check for
> burst_us > max_bw_runtime_us as well, to avoid burst_us + quota_us
> being able to overflow and avoid the second check.
Noted. Will address in the next revision.
> The details of this feel very odd on two fronts:
>
> First, while setting runtime rather than adding to it gives more
> power to the controlling userspace, it also forces it to be racy
> if it wants to add runtime. But the original design of cfs
> bandwidth didn't have burst anyways, and it's not a disaster if it
> does race, even if the orchestrator thread manages to get preempted
> or such. So I don't exactly object to this design, but I do want
> to check in on the idea.
It was also my reasoning that races were non-critical here, so I
opted for an API that was consistent with the other interfaces.
However, we could also replace/complement it with a delta API if we
think it's more useful. I chose to keep the API simple for now but
I don't mind changing it.
> More importantly, I think it should definitely call
> distribute_cfs_runtime (or an equivalent), to immediately let
> throttled tasks start running again. As it is, that will be delayed
> until the period timer runs, which is entirely desynchronized from
> userspace, even if userspace uses the same period for its timers,
> along with inconsistencies with any newly waking cpus which will
> run immediately.
Fair point. I will update that in the next revision.
Thanks.
Fernand
Amazon Development Centre (South Africa) (Proprietary) Limited
29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa
Registration Number: 2004 / 034463 / 07
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/2] sched/fair: expose cpu.max.runtime to set bandwidth runtime directly
2026-05-28 6:54 ` Fernand Sieber
@ 2026-05-28 14:37 ` Tejun Heo
0 siblings, 0 replies; 8+ messages in thread
From: Tejun Heo @ 2026-05-28 14:37 UTC (permalink / raw)
To: Fernand Sieber
Cc: arighi, bsegall, changwoo, dietmar.eggemann, dwmw, fmubeen,
hborghor, juri.lelli, linux-kernel, mgorman, mingo,
nh-open-source, peterz, vincent.guittot, void
Hello,
On Thu, May 28, 2026 at 08:54:28AM +0200, Fernand Sieber wrote:
> Hi Tejun,
>
> On Wed, May 27, 2026 at 09:04:37AM -1000, Tejun Heo wrote:
> > On Mon, May 25, 2026 at 09:36:21PM +0200, Fernand Sieber wrote:
> > > Add a cpu.max.runtime cgroup v2 interface that allows userspace to
> > > set the CFS bandwidth controller's runtime directly. This enables
> > > CPU credit injection: an orchestrator writes a runtime budget which
> > > the cgroup consumes naturally through the existing bandwidth
> > > enforcement mechanism.
> >
> > Can you detail the use case? What problem is it solving how?
>
> Our use case is managing CPU credits for VMs.
>
> Product spec defines credits accumulation rate (quota), credits
> limit (burst), and initial level of credits at launch (runtime).
>
> Controlling runtime is also necessary for preserving credits across
> live update (kexec) and live migration.
>
> It is possible to approximate this behavior with existing kernel
> primitives. However this requires setting up awkward parallel
> accounting/control logic from userspace which must be periodically
> synced up with the kernel. Instead, we propose minimal changes to
> the cpu bw primitives to facilitate this use case.
Can you please go into more details, preferably a lot more? From cgroup POV,
there is a precedent for this sort of direct-ish low level control in
memory.reclaim; however, as things like this can create a lot of
implementation detail exposure, and I think the bar for use case
justification should be pretty high - the use cases should make general
sense and there are no other reasonable ways to achieve the same without
adding the proposed interface. I don't think the above description achieves
that.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2026-05-28 14:37 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-25 19:36 [PATCH 0/2] sched/fair: expose cpu.max.runtime for credit injection Fernand Sieber
2026-05-25 19:36 ` [PATCH 1/2] sched/fair: expose cpu.max.runtime to set bandwidth runtime directly Fernand Sieber
2026-05-26 20:52 ` Benjamin Segall
2026-05-28 7:25 ` Fernand Sieber
2026-05-27 19:04 ` Tejun Heo
2026-05-28 6:54 ` Fernand Sieber
2026-05-28 14:37 ` Tejun Heo
2026-05-25 19:36 ` [PATCH 2/2] sched/ext: add cgroup_set_runtime ops callback Fernand Sieber
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.