From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mail-dy1-f176.google.com (mail-dy1-f176.google.com [74.125.82.176])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id D51DE26C3BD
	for <linux-kernel@vger.kernel.org>; Tue, 26 May 2026 20:53:00 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=74.125.82.176
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1779828782; cv=none; b=XjhwplEwdWCr7IMpob9Fgpcs25aH9wG3PIEusgZyMvxUT7v6YCLtolk9jAC8/341rxQRYUKbeSeLGZHQZy65TC+NLGpxHuEospI71+jda3MfoGRY8HYIyvJLD0snIDRtq+p0T5QWHD4S71Y7dYFF4PegRaXnCw75BVk2K1IGgko=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1779828782; c=relaxed/simple;
	bh=56HMrUedGPcq+tNT8Z/a/6Egn5bsuUsBDkAvcOZlCZ4=;
	h=From:To:Cc:Subject:In-Reply-To:References:Date:Message-ID:
	 MIME-Version:Content-Type; b=bwI/jv/uHJtmxeZxmJUkzwisXs0jllPEAQNTjyJsR7o5x3x+aKEEmJ36FgEacGLs7qkuReyqkXIXruiWTMbGX14hE9IbBO/z5nzJ/OM7KiPF9pgQ80p+fF94DoGEjogVV1HVTEbCk/MSWq/BLAltNLMwd0m0ERE+1ha0SCB6lzk=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=qqhDu08E; arc=none smtp.client-ip=74.125.82.176
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=google.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="qqhDu08E"
Received: by mail-dy1-f176.google.com with SMTP id 5a478bee46e88-2ef2a1cc06dso3351786eec.0
        for <linux-kernel@vger.kernel.org>; Tue, 26 May 2026 13:53:00 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20251104; t=1779828780; x=1780433580; darn=vger.kernel.org;
        h=mime-version:user-agent:message-id:date:references:in-reply-to
         :subject:cc:to:from:from:to:cc:subject:date:message-id:reply-to;
        bh=QYOw6PfUxg7AlTeBfExBnHm3Y8CYFzZtCGfKHN8bFLs=;
        b=qqhDu08Ege+LcCXZqUj9qonCB7wAMRNuqRXv+h3E70oHb5zPRvc7UQU9TWt24cfBlU
         Tgw8TqkJrU80yvZR2rLY5maf5EqluI9uJknEKXRNBy8cyLmJTjpCtpncRc0YxLyraEgH
         IhATH70mdvdXCiEI7K1rc40Q+AHIGbzyHnAdOVhcyWnzB9DPRbKaKPSoX8p71qofuizm
         6YK+7m7VYRLtCJGomiybVA2LYb47eF+p3WWwPaiViRDITTeY+Cf4FeGE5CSH3/2Kok7m
         FyWbc+yOKbxSpsGMdwPbhCQ7xBqaqxs6nVwzZqoe0gXGGD/VBl850R+pkiGPh8xQ44W8
         vsew==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20251104; t=1779828780; x=1780433580;
        h=mime-version:user-agent:message-id:date:references:in-reply-to
         :subject:cc:to:from:x-gm-gg:x-gm-message-state:from:to:cc:subject
         :date:message-id:reply-to;
        bh=QYOw6PfUxg7AlTeBfExBnHm3Y8CYFzZtCGfKHN8bFLs=;
        b=Iv4ByZc70p4Skg5gITcotDMVy5cTvWZAGAgugU0aWzDGnQyE1p06UDlazjGvCOWudQ
         nYOMQIsYpiRE9HwGOYrsPLO0j7sJXFupi4WKxW6q5jis08iQyaYAYzl4+7r4SiKCsdvW
         Aux1yocKmlh9FytpmosxI7eaRlmkIWaCJ6SaLn1vaATH0KY5ARPKmo+jQlOy0FX8HtVi
         T1uqWlFQxPt7b9Dp3FgtNu4YYjgKjg2JW68GtbZyAE45wC0/Ewxd5K1zf4kNtHCxDhxV
         WUydqlQycpuSkpTbVRsbAun5JQB2lYzYLrDgAeo2ISIij6Jt8WrRjSfoOagXyHTwd8xB
         CaKQ==
X-Forwarded-Encrypted: i=1; AFNElJ92dX61GUyaKWPFFw8LcLqVroOdii3/8IScMVy+onb8XfejmTuBR8QxsQE2qNgquid/gzw8rzMp17YDQOQ=@vger.kernel.org
X-Gm-Message-State: AOJu0YyorYP/n3wNrH5fqnnWPAEPhGXQDzPpEYi13Ou8bpddMRB3Tql8
	ymJfapYuGT2BpwFy99UVpwK0Nw2kDExsWEuE8yAFF7sjduU54mNkzQcO4/IOjm3NVw7bcdZGRN8
	3kuDaEDiyAf0=
X-Gm-Gg: Acq92OEpW/1wDQFzoLj1+031QQ8vErE5XGhJXFc75Eup12FwJnFi4yRC9lpjgs4IQ1J
	uPsXvjyzyd6bMOggKdstRHK7VNVTjAa9mZ6YkQelrKpHnMzCNQMlkMx4IfnqN69HcWVmRcO0Efn
	0db/VFBJAl0kyeYIKEMHWASuDIxg/yCHfTiB7j1SqENkZrM7TGzqjSkHgqD4tz7AZMNAJpIjFZt
	a0jIxufciogvKUpNcT/hOUrUEXlnrYAgTRKq1o0atNmRtQuTBoJYie4yc5YS8a4uM4gXfxLGFU9
	G2EVEA8JYgO7gQxjzxJ7OkGLEgdNxJ/okL8ESTsfpeEH3pfyUK8xoVwQcKKAiICANPY9c1jZ9KP
	kHDzhilZcdSsBReDI6pL/T4OT29QEEurG2otZvcKDOUMOvvrQC4EI88nqyvWI0dVE0MIpMxbnX8
	KK+PQd5n/5lHdgQRopMxAOj0T1xOHw1OhWtk05OxsTCDrDb4ZPoxh0cL/IN7wxmK5Tbr9C356hz
	NKtOr91A/OTbRo=
X-Received: by 2002:a05:7301:ea7:b0:2f5:5dd3:1fcf with SMTP id 5a478bee46e88-3044905f44cmr8353237eec.10.1779828779391;
        Tue, 26 May 2026 13:52:59 -0700 (PDT)
Received: from bsegall27.localhost ([2a00:79e0:2ed2:d:a75:54f5:401a:ba50])
        by smtp.gmail.com with ESMTPSA id 5a478bee46e88-3045223103bsm10982064eec.16.2026.05.26.13.52.58
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 26 May 2026 13:52:58 -0700 (PDT)
From: Benjamin Segall <bsegall@google.com>
To: Fernand Sieber <sieberf@amazon.com>
Cc: Ingo Molnar <mingo@redhat.com>,  Peter Zijlstra <peterz@infradead.org>,
  Juri Lelli <juri.lelli@redhat.com>,  Vincent Guittot
 <vincent.guittot@linaro.org>,  Tejun Heo <tj@kernel.org>,  David Vernet
 <void@manifault.com>,  Andrea Righi <arighi@nvidia.com>,  Changwoo Min
 <changwoo@igalia.com>,  Dietmar Eggemann <dietmar.eggemann@arm.com>,  Mel
 Gorman <mgorman@suse.de>,  <linux-kernel@vger.kernel.org>,
  <nh-open-source@amazon.com>,  Fahad Mubeen <fmubeen@amazon.de>,  "Hendrik
 Borghorst" <hborghor@amazon.de>,  David Woodhouse <dwmw@amazon.co.uk>
Subject: Re: [PATCH 1/2] sched/fair: expose cpu.max.runtime to set bandwidth
 runtime directly
In-Reply-To: <20260525193622.70282-2-sieberf@amazon.com> (Fernand Sieber's
	message of "Mon, 25 May 2026 21:36:21 +0200")
References: <20260525193622.70282-1-sieberf@amazon.com>
	<20260525193622.70282-2-sieberf@amazon.com>
Date: Tue, 26 May 2026 13:52:56 -0700
Message-ID: <xm26tsrtnb9z.fsf@google.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain

Fernand Sieber <sieberf@amazon.com> writes:

> Add a cpu.max.runtime cgroup v2 interface that allows userspace to set
> the CFS bandwidth controller's runtime directly. This enables CPU credit
> injection: an orchestrator writes a runtime budget which the cgroup
> consumes naturally through the existing bandwidth enforcement mechanism.
>
> The write sets cfs_b->runtime directly. Each period, the task consumes
> runtime and the refill restores only quota (capped at quota + burst), so
> the injected credits drain until runtime falls below the cap, after which
> the cgroup returns to its steady-state quota allocation.
>
> Writes are rejected if the value exceeds quota + burst (the per-period
> runtime cap) or exceeds the maximum bandwidth limit.
>
> Also relax the burst validation: remove the burst <= quota constraint,
> requiring only that burst + quota does not overflow. This allows
> configuring burst > quota so that the runtime cap (quota + burst) can
> reach up to one full period, enabling 100% utilization while credits last.
>
> The interface uses microseconds, consistent with cpu.max quota/period.


I don't necessarily object to supporting this design of userspace
program/bpf for dynamic quota decisions that gets to make use of the
inline cfs bandwidth touch points for the performance sensitive runtime
consumption bits, given how minimal it is.

However the existing APIs give something very close to this - any write
to max/max.burst will also add the new quota to the runtime, and reading
max.runtime (beyond using it to construct a += on runtime) can be done
with cpuacct. Is the overhead of tg_set_cfs_bandwidth (which admittedly isn't
really designed to be fast) too much, or is setting max.runtime rather
than adding to it important, or something else?

>
> Signed-off-by: Fernand Sieber <sieberf@amazon.com>
> ---
>  kernel/sched/core.c                       | 44 +++++++++++++++-
>  tools/testing/selftests/cgroup/test_cpu.c | 62 +++++++++++++++++++++++
>  2 files changed, 104 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b8871449d..d92e5840b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -10085,8 +10085,7 @@ static int tg_set_bandwidth(struct task_group *tg,
>  	if (quota_us != RUNTIME_INF && quota_us > max_bw_runtime_us)
>  		return -EINVAL;
>  
> -	if (quota_us != RUNTIME_INF && (burst_us > quota_us ||
> -					burst_us + quota_us > max_bw_runtime_us))
> +	if (quota_us != RUNTIME_INF && (burst_us + quota_us > max_bw_runtime_us))
>  		return -EINVAL;

I'm fine with this in general, but we should keep a check for burst_us >
max_bw_runtime_us as well, to avoid burst_us + quota_us being able to
overflow and avoid the second check.

>  
>  #ifdef CONFIG_CFS_BANDWIDTH
> @@ -10147,6 +10146,41 @@ static int cpu_burst_write_u64(struct cgroup_subsys_state *css,
>  	tg_bandwidth(tg, &period_us, &quota_us, NULL);
>  	return tg_set_bandwidth(tg, period_us, quota_us, burst_us);
>  }
> +
> +static int cpu_runtime_write_u64(struct cgroup_subsys_state *css,
> +				 struct cftype *cftype, u64 runtime_us)
> +{
> +	struct task_group *tg = css_tg(css);
> +	struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
> +
> +	if (runtime_us > max_bw_runtime_us)
> +		return -EINVAL;
> +
> +	raw_spin_lock_irq(&cfs_b->lock);
> +	if (cfs_b->quota != RUNTIME_INF &&
> +	    (u64)runtime_us * NSEC_PER_USEC > cfs_b->quota + cfs_b->burst) {
> +		raw_spin_unlock_irq(&cfs_b->lock);
> +		return -EINVAL;
> +	}
> +	cfs_b->runtime = (u64)runtime_us * NSEC_PER_USEC;
> +	raw_spin_unlock_irq(&cfs_b->lock);
> +
> +	return 0;
> +}

The details of this feel very odd on two fronts:

First, while setting runtime rather than adding to it gives more power
to the controlling userspace, it also forces it to be racy if it wants
to add runtime. But the original design of cfs bandwidth didn't have
burst anyways, and it's not a disaster if it does race, even if the
orchestrator thread manages to get preempted or such. So I don't exactly
object to this design, but I do want to check in on the idea.

More importantly, I think it should definitely call
distribute_cfs_runtime (or an equivalent), to immediately let throttled
tasks start running again. As it is, that will be delayed until the
period timer runs, which is entirely desynchronized from userspace, even
if userspace uses the same period for its timers, along with
inconsistencies with any newly waking cpus which will run immediately.