From: Fernand Sieber <sieberf@amazon.com>
To: <bsegall@google.com>
Cc: <arighi@nvidia.com>, <changwoo@igalia.com>,
<dietmar.eggemann@arm.com>, <dwmw@amazon.co.uk>,
<fmubeen@amazon.de>, <hborghor@amazon.de>,
<juri.lelli@redhat.com>, <linux-kernel@vger.kernel.org>,
<mgorman@suse.de>, <mingo@redhat.com>,
<nh-open-source@amazon.com>, <peterz@infradead.org>,
<sieberf@amazon.com>, <tj@kernel.org>,
<vincent.guittot@linaro.org>, <void@manifault.com>
Subject: Re: [PATCH 1/2] sched/fair: expose cpu.max.runtime to set bandwidth runtime directly
Date: Thu, 28 May 2026 09:25:14 +0200 [thread overview]
Message-ID: <20260528072514.76326-1-sieberf@amazon.com> (raw)
In-Reply-To: <xm26tsrtnb9z.fsf@google.com>
Hi Ben,
On Tue, May 26, 2026 at 01:52:56PM -0700, Benjamin Segall wrote:
> I don't necessarily object to supporting this design of userspace
> program/bpf for dynamic quota decisions that gets to make use of the
> inline cfs bandwidth touch points for the performance sensitive
> runtime consumption bits, given how minimal it is.
>
> However the existing APIs give something very close to this - any
> write to max/max.burst will also add the new quota to the runtime,
> and reading max.runtime (beyond using it to construct a += on
> runtime) can be done with cpuacct. Is the overhead of
> tg_set_cfs_bandwidth (which admittedly isn't really designed to be
> fast) too much, or is setting max.runtime rather than adding to it
> important, or something else?
I've detailed our CPU credits for VM use case in Tejun's reply:
https://lore.kernel.org/all/20260528065428.69225-1-sieberf@amazon.com/
We need both primitives to control credits accumulation rate (quota)
and level of credits (runtime). Controlling level of credits is
somewhat rare as it corresponds to specific events in the lifecycle
of the VM.
If I understand correctly what you are saying, we can already
approximate that by temporarily setting quota to the delta runtime
we need to adjust, and then setting it back to the normal
accumulation rate.
While possible, this seems quite awkward and blunt to me. Moreover
operations that might need a negative delta (e.g credit transfer)
would be even more awkward to implement (I suppose we would need to
temporarily reduce the burst limit to force hit the runtime cap and
then set it back).
> I'm fine with this in general, but we should keep a check for
> burst_us > max_bw_runtime_us as well, to avoid burst_us + quota_us
> being able to overflow and avoid the second check.
Noted. Will address in the next revision.
> The details of this feel very odd on two fronts:
>
> First, while setting runtime rather than adding to it gives more
> power to the controlling userspace, it also forces it to be racy
> if it wants to add runtime. But the original design of cfs
> bandwidth didn't have burst anyways, and it's not a disaster if it
> does race, even if the orchestrator thread manages to get preempted
> or such. So I don't exactly object to this design, but I do want
> to check in on the idea.
It was also my reasoning that races were non-critical here, so I
opted for an API that was consistent with the other interfaces.
However, we could also replace/complement it with a delta API if we
think it's more useful. I chose to keep the API simple for now but
I don't mind changing it.
> More importantly, I think it should definitely call
> distribute_cfs_runtime (or an equivalent), to immediately let
> throttled tasks start running again. As it is, that will be delayed
> until the period timer runs, which is entirely desynchronized from
> userspace, even if userspace uses the same period for its timers,
> along with inconsistencies with any newly waking cpus which will
> run immediately.
Fair point. I will update that in the next revision.
Thanks.
Fernand
Amazon Development Centre (South Africa) (Proprietary) Limited
29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa
Registration Number: 2004 / 034463 / 07
next prev parent reply other threads:[~2026-05-28 7:25 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-25 19:36 [PATCH 0/2] sched/fair: expose cpu.max.runtime for credit injection Fernand Sieber
2026-05-25 19:36 ` [PATCH 1/2] sched/fair: expose cpu.max.runtime to set bandwidth runtime directly Fernand Sieber
2026-05-26 20:52 ` Benjamin Segall
2026-05-28 7:25 ` Fernand Sieber [this message]
2026-05-27 19:04 ` Tejun Heo
2026-05-28 6:54 ` Fernand Sieber
2026-05-28 14:37 ` Tejun Heo
2026-05-25 19:36 ` [PATCH 2/2] sched/ext: add cgroup_set_runtime ops callback Fernand Sieber
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260528072514.76326-1-sieberf@amazon.com \
--to=sieberf@amazon.com \
--cc=arighi@nvidia.com \
--cc=bsegall@google.com \
--cc=changwoo@igalia.com \
--cc=dietmar.eggemann@arm.com \
--cc=dwmw@amazon.co.uk \
--cc=fmubeen@amazon.de \
--cc=hborghor@amazon.de \
--cc=juri.lelli@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=nh-open-source@amazon.com \
--cc=peterz@infradead.org \
--cc=tj@kernel.org \
--cc=vincent.guittot@linaro.org \
--cc=void@manifault.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.