From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-dy1-f176.google.com (mail-dy1-f176.google.com [74.125.82.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D51DE26C3BD for ; Tue, 26 May 2026 20:53:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=74.125.82.176 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779828782; cv=none; b=XjhwplEwdWCr7IMpob9Fgpcs25aH9wG3PIEusgZyMvxUT7v6YCLtolk9jAC8/341rxQRYUKbeSeLGZHQZy65TC+NLGpxHuEospI71+jda3MfoGRY8HYIyvJLD0snIDRtq+p0T5QWHD4S71Y7dYFF4PegRaXnCw75BVk2K1IGgko= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779828782; c=relaxed/simple; bh=56HMrUedGPcq+tNT8Z/a/6Egn5bsuUsBDkAvcOZlCZ4=; h=From:To:Cc:Subject:In-Reply-To:References:Date:Message-ID: MIME-Version:Content-Type; b=bwI/jv/uHJtmxeZxmJUkzwisXs0jllPEAQNTjyJsR7o5x3x+aKEEmJ36FgEacGLs7qkuReyqkXIXruiWTMbGX14hE9IbBO/z5nzJ/OM7KiPF9pgQ80p+fF94DoGEjogVV1HVTEbCk/MSWq/BLAltNLMwd0m0ERE+1ha0SCB6lzk= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=qqhDu08E; arc=none smtp.client-ip=74.125.82.176 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="qqhDu08E" Received: by mail-dy1-f176.google.com with SMTP id 5a478bee46e88-2ef2a1cc06dso3351786eec.0 for ; Tue, 26 May 2026 13:53:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1779828780; x=1780433580; darn=vger.kernel.org; h=mime-version:user-agent:message-id:date:references:in-reply-to :subject:cc:to:from:from:to:cc:subject:date:message-id:reply-to; bh=QYOw6PfUxg7AlTeBfExBnHm3Y8CYFzZtCGfKHN8bFLs=; b=qqhDu08Ege+LcCXZqUj9qonCB7wAMRNuqRXv+h3E70oHb5zPRvc7UQU9TWt24cfBlU Tgw8TqkJrU80yvZR2rLY5maf5EqluI9uJknEKXRNBy8cyLmJTjpCtpncRc0YxLyraEgH IhATH70mdvdXCiEI7K1rc40Q+AHIGbzyHnAdOVhcyWnzB9DPRbKaKPSoX8p71qofuizm 6YK+7m7VYRLtCJGomiybVA2LYb47eF+p3WWwPaiViRDITTeY+Cf4FeGE5CSH3/2Kok7m FyWbc+yOKbxSpsGMdwPbhCQ7xBqaqxs6nVwzZqoe0gXGGD/VBl850R+pkiGPh8xQ44W8 vsew== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1779828780; x=1780433580; h=mime-version:user-agent:message-id:date:references:in-reply-to :subject:cc:to:from:x-gm-gg:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=QYOw6PfUxg7AlTeBfExBnHm3Y8CYFzZtCGfKHN8bFLs=; b=Iv4ByZc70p4Skg5gITcotDMVy5cTvWZAGAgugU0aWzDGnQyE1p06UDlazjGvCOWudQ nYOMQIsYpiRE9HwGOYrsPLO0j7sJXFupi4WKxW6q5jis08iQyaYAYzl4+7r4SiKCsdvW Aux1yocKmlh9FytpmosxI7eaRlmkIWaCJ6SaLn1vaATH0KY5ARPKmo+jQlOy0FX8HtVi T1uqWlFQxPt7b9Dp3FgtNu4YYjgKjg2JW68GtbZyAE45wC0/Ewxd5K1zf4kNtHCxDhxV WUydqlQycpuSkpTbVRsbAun5JQB2lYzYLrDgAeo2ISIij6Jt8WrRjSfoOagXyHTwd8xB CaKQ== X-Forwarded-Encrypted: i=1; AFNElJ92dX61GUyaKWPFFw8LcLqVroOdii3/8IScMVy+onb8XfejmTuBR8QxsQE2qNgquid/gzw8rzMp17YDQOQ=@vger.kernel.org X-Gm-Message-State: AOJu0YyorYP/n3wNrH5fqnnWPAEPhGXQDzPpEYi13Ou8bpddMRB3Tql8 ymJfapYuGT2BpwFy99UVpwK0Nw2kDExsWEuE8yAFF7sjduU54mNkzQcO4/IOjm3NVw7bcdZGRN8 3kuDaEDiyAf0= X-Gm-Gg: Acq92OEpW/1wDQFzoLj1+031QQ8vErE5XGhJXFc75Eup12FwJnFi4yRC9lpjgs4IQ1J uPsXvjyzyd6bMOggKdstRHK7VNVTjAa9mZ6YkQelrKpHnMzCNQMlkMx4IfnqN69HcWVmRcO0Efn 0db/VFBJAl0kyeYIKEMHWASuDIxg/yCHfTiB7j1SqENkZrM7TGzqjSkHgqD4tz7AZMNAJpIjFZt a0jIxufciogvKUpNcT/hOUrUEXlnrYAgTRKq1o0atNmRtQuTBoJYie4yc5YS8a4uM4gXfxLGFU9 G2EVEA8JYgO7gQxjzxJ7OkGLEgdNxJ/okL8ESTsfpeEH3pfyUK8xoVwQcKKAiICANPY9c1jZ9KP kHDzhilZcdSsBReDI6pL/T4OT29QEEurG2otZvcKDOUMOvvrQC4EI88nqyvWI0dVE0MIpMxbnX8 KK+PQd5n/5lHdgQRopMxAOj0T1xOHw1OhWtk05OxsTCDrDb4ZPoxh0cL/IN7wxmK5Tbr9C356hz NKtOr91A/OTbRo= X-Received: by 2002:a05:7301:ea7:b0:2f5:5dd3:1fcf with SMTP id 5a478bee46e88-3044905f44cmr8353237eec.10.1779828779391; Tue, 26 May 2026 13:52:59 -0700 (PDT) Received: from bsegall27.localhost ([2a00:79e0:2ed2:d:a75:54f5:401a:ba50]) by smtp.gmail.com with ESMTPSA id 5a478bee46e88-3045223103bsm10982064eec.16.2026.05.26.13.52.58 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 26 May 2026 13:52:58 -0700 (PDT) From: Benjamin Segall To: Fernand Sieber Cc: Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Tejun Heo , David Vernet , Andrea Righi , Changwoo Min , Dietmar Eggemann , Mel Gorman , , , Fahad Mubeen , "Hendrik Borghorst" , David Woodhouse Subject: Re: [PATCH 1/2] sched/fair: expose cpu.max.runtime to set bandwidth runtime directly In-Reply-To: <20260525193622.70282-2-sieberf@amazon.com> (Fernand Sieber's message of "Mon, 25 May 2026 21:36:21 +0200") References: <20260525193622.70282-1-sieberf@amazon.com> <20260525193622.70282-2-sieberf@amazon.com> Date: Tue, 26 May 2026 13:52:56 -0700 Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain Fernand Sieber writes: > Add a cpu.max.runtime cgroup v2 interface that allows userspace to set > the CFS bandwidth controller's runtime directly. This enables CPU credit > injection: an orchestrator writes a runtime budget which the cgroup > consumes naturally through the existing bandwidth enforcement mechanism. > > The write sets cfs_b->runtime directly. Each period, the task consumes > runtime and the refill restores only quota (capped at quota + burst), so > the injected credits drain until runtime falls below the cap, after which > the cgroup returns to its steady-state quota allocation. > > Writes are rejected if the value exceeds quota + burst (the per-period > runtime cap) or exceeds the maximum bandwidth limit. > > Also relax the burst validation: remove the burst <= quota constraint, > requiring only that burst + quota does not overflow. This allows > configuring burst > quota so that the runtime cap (quota + burst) can > reach up to one full period, enabling 100% utilization while credits last. > > The interface uses microseconds, consistent with cpu.max quota/period. I don't necessarily object to supporting this design of userspace program/bpf for dynamic quota decisions that gets to make use of the inline cfs bandwidth touch points for the performance sensitive runtime consumption bits, given how minimal it is. However the existing APIs give something very close to this - any write to max/max.burst will also add the new quota to the runtime, and reading max.runtime (beyond using it to construct a += on runtime) can be done with cpuacct. Is the overhead of tg_set_cfs_bandwidth (which admittedly isn't really designed to be fast) too much, or is setting max.runtime rather than adding to it important, or something else? > > Signed-off-by: Fernand Sieber > --- > kernel/sched/core.c | 44 +++++++++++++++- > tools/testing/selftests/cgroup/test_cpu.c | 62 +++++++++++++++++++++++ > 2 files changed, 104 insertions(+), 2 deletions(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index b8871449d..d92e5840b 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -10085,8 +10085,7 @@ static int tg_set_bandwidth(struct task_group *tg, > if (quota_us != RUNTIME_INF && quota_us > max_bw_runtime_us) > return -EINVAL; > > - if (quota_us != RUNTIME_INF && (burst_us > quota_us || > - burst_us + quota_us > max_bw_runtime_us)) > + if (quota_us != RUNTIME_INF && (burst_us + quota_us > max_bw_runtime_us)) > return -EINVAL; I'm fine with this in general, but we should keep a check for burst_us > max_bw_runtime_us as well, to avoid burst_us + quota_us being able to overflow and avoid the second check. > > #ifdef CONFIG_CFS_BANDWIDTH > @@ -10147,6 +10146,41 @@ static int cpu_burst_write_u64(struct cgroup_subsys_state *css, > tg_bandwidth(tg, &period_us, "a_us, NULL); > return tg_set_bandwidth(tg, period_us, quota_us, burst_us); > } > + > +static int cpu_runtime_write_u64(struct cgroup_subsys_state *css, > + struct cftype *cftype, u64 runtime_us) > +{ > + struct task_group *tg = css_tg(css); > + struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth; > + > + if (runtime_us > max_bw_runtime_us) > + return -EINVAL; > + > + raw_spin_lock_irq(&cfs_b->lock); > + if (cfs_b->quota != RUNTIME_INF && > + (u64)runtime_us * NSEC_PER_USEC > cfs_b->quota + cfs_b->burst) { > + raw_spin_unlock_irq(&cfs_b->lock); > + return -EINVAL; > + } > + cfs_b->runtime = (u64)runtime_us * NSEC_PER_USEC; > + raw_spin_unlock_irq(&cfs_b->lock); > + > + return 0; > +} The details of this feel very odd on two fronts: First, while setting runtime rather than adding to it gives more power to the controlling userspace, it also forces it to be racy if it wants to add runtime. But the original design of cfs bandwidth didn't have burst anyways, and it's not a disaster if it does race, even if the orchestrator thread manages to get preempted or such. So I don't exactly object to this design, but I do want to check in on the idea. More importantly, I think it should definitely call distribute_cfs_runtime (or an equivalent), to immediately let throttled tasks start running again. As it is, that will be delayed until the period timer runs, which is entirely desynchronized from userspace, even if userspace uses the same period for its timers, along with inconsistencies with any newly waking cpus which will run immediately.