From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from pdx-out-008.esa.us-west-2.outbound.mail-perimeter.amazon.com (pdx-out-008.esa.us-west-2.outbound.mail-perimeter.amazon.com [52.42.203.116]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E5593302165 for ; Thu, 28 May 2026 07:25:26 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=52.42.203.116 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779953128; cv=none; b=rEnLQbgVUnS4f+RmhPUASPqgE/jcS9aDWv/fqIgRsI0glZ94vGseUE2MeM0uVKb+x8DkhqjcATHidT05obE2hR8HssAVFSfCWrht+UvldtbMNhbg1KmEsbBnGJ09E4NYabH6DsmoVOYW6eQj2yo5hzcmXteS7YChtGCI6CFViY0= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779953128; c=relaxed/simple; bh=FuIz4ITGX91zHnDJQyMQCN/hdiL0xhpmsCwKAVLk08o=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=p4jvtFx1dR4KYwzXC9CIwkewlIRlQDMQxvhYs1YCcGXJskALipEGCewvtCglwajWiqsB0pVVfg2U8PTqRb0caGTn06DVLZDJzalTQzGYa7rjAzie1Mw5jNULovXY3OzteVbttnvbn7J8CyiFeZVcp2sOocGRIFkIiftxdhNz/t0= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com; spf=pass smtp.mailfrom=amazon.com; dkim=pass (2048-bit key) header.d=amazon.com header.i=@amazon.com header.b=gihN7RKS; arc=none smtp.client-ip=52.42.203.116 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=amazon.com header.i=@amazon.com header.b="gihN7RKS" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazoncorp2; t=1779953126; x=1811489126; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=FuIz4ITGX91zHnDJQyMQCN/hdiL0xhpmsCwKAVLk08o=; b=gihN7RKS/zMGnzfSg466OFuGX+fPbDRNhAJncSZRQD2a6RyyEQaAcrby SEPzEhIoeMxLjIYCNT+LO8fW76tGV72btf6Vl+LU3//wn7vQRuclQktmN Rq/7JPGnz2mEEv7uo8NX3Kj6S/UvAx+g/AijMtdA0UVz6LuuKgUI6cjUU 1hWVwSP/Za71nUlsQ3ZCjAYSdLbFVmRocRwWjc14rSqsxhfOdzeD/nime rHkdWNTzI8/OP9d8CAR00EmBuriT1a6ynjd9qt2D/oKYDt9NmFcflLEj6 hH03nS5owYZqJA2hp64RnsSHlwLTuZrRZf5Y/jupBOCHdZD5ZA1D7Pkn0 w==; X-CSE-ConnectionGUID: 1gnlb3z9Sr22yyLwpYYo0w== X-CSE-MsgGUID: d8q/LbCDSa+vpDRI22YppA== X-IronPort-AV: E=Sophos;i="6.24,173,1774310400"; d="scan'208";a="20632323" Received: from ip-10-5-6-203.us-west-2.compute.internal (HELO smtpout.naws.us-west-2.prod.farcaster.email.amazon.dev) ([10.5.6.203]) by internal-pdx-out-008.esa.us-west-2.outbound.mail-perimeter.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 May 2026 07:25:23 +0000 Received: from EX19MTAUWC002.ant.amazon.com [205.251.233.111:4001] by smtpin.naws.us-west-2.prod.farcaster.email.amazon.dev [10.0.10.209:2525] with esmtp (Farcaster) id ac7caec8-d5fc-4e59-9571-bbb0257e3d54; Thu, 28 May 2026 07:25:23 +0000 (UTC) X-Farcaster-Flow-ID: ac7caec8-d5fc-4e59-9571-bbb0257e3d54 Received: from EX19D001UWA001.ant.amazon.com (10.13.138.214) by EX19MTAUWC002.ant.amazon.com (10.250.64.143) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.2562.37; Thu, 28 May 2026 07:25:22 +0000 Received: from dev-dsk-sieberf-metal-1a-7543e84d.eu-west-1.amazon.com (172.19.116.227) by EX19D001UWA001.ant.amazon.com (10.13.138.214) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.2562.37; Thu, 28 May 2026 07:25:19 +0000 From: Fernand Sieber To: CC: , , , , , , , , , , , , , , , Subject: Re: [PATCH 1/2] sched/fair: expose cpu.max.runtime to set bandwidth runtime directly Date: Thu, 28 May 2026 09:25:14 +0200 Message-ID: <20260528072514.76326-1-sieberf@amazon.com> X-Mailer: git-send-email 2.47.3 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: EX19D036UWB001.ant.amazon.com (10.13.139.133) To EX19D001UWA001.ant.amazon.com (10.13.138.214) Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Hi Ben, On Tue, May 26, 2026 at 01:52:56PM -0700, Benjamin Segall wrote: > I don't necessarily object to supporting this design of userspace > program/bpf for dynamic quota decisions that gets to make use of the > inline cfs bandwidth touch points for the performance sensitive > runtime consumption bits, given how minimal it is. > > However the existing APIs give something very close to this - any > write to max/max.burst will also add the new quota to the runtime, > and reading max.runtime (beyond using it to construct a += on > runtime) can be done with cpuacct. Is the overhead of > tg_set_cfs_bandwidth (which admittedly isn't really designed to be > fast) too much, or is setting max.runtime rather than adding to it > important, or something else? I've detailed our CPU credits for VM use case in Tejun's reply: https://lore.kernel.org/all/20260528065428.69225-1-sieberf@amazon.com/ We need both primitives to control credits accumulation rate (quota) and level of credits (runtime). Controlling level of credits is somewhat rare as it corresponds to specific events in the lifecycle of the VM. If I understand correctly what you are saying, we can already approximate that by temporarily setting quota to the delta runtime we need to adjust, and then setting it back to the normal accumulation rate. While possible, this seems quite awkward and blunt to me. Moreover operations that might need a negative delta (e.g credit transfer) would be even more awkward to implement (I suppose we would need to temporarily reduce the burst limit to force hit the runtime cap and then set it back). > I'm fine with this in general, but we should keep a check for > burst_us > max_bw_runtime_us as well, to avoid burst_us + quota_us > being able to overflow and avoid the second check. Noted. Will address in the next revision. > The details of this feel very odd on two fronts: > > First, while setting runtime rather than adding to it gives more > power to the controlling userspace, it also forces it to be racy > if it wants to add runtime. But the original design of cfs > bandwidth didn't have burst anyways, and it's not a disaster if it > does race, even if the orchestrator thread manages to get preempted > or such. So I don't exactly object to this design, but I do want > to check in on the idea. It was also my reasoning that races were non-critical here, so I opted for an API that was consistent with the other interfaces. However, we could also replace/complement it with a delta API if we think it's more useful. I chose to keep the API simple for now but I don't mind changing it. > More importantly, I think it should definitely call > distribute_cfs_runtime (or an equivalent), to immediately let > throttled tasks start running again. As it is, that will be delayed > until the period timer runs, which is entirely desynchronized from > userspace, even if userspace uses the same period for its timers, > along with inconsistencies with any newly waking cpus which will > run immediately. Fair point. I will update that in the next revision. Thanks. Fernand Amazon Development Centre (South Africa) (Proprietary) Limited 29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa Registration Number: 2004 / 034463 / 07