From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 183A7258CE5; Tue, 12 May 2026 08:10:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.50.34 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778573425; cv=none; b=COIgvjdDmJVV1IDtPe7lEWNyXosgUI4mt/raBA+Yvlne0+2Ycvrt12pexpTkieo2aecjcli8HL7LJrilZy127uLgxtkckvfmHRBuXd4XOnzrZCI3gCiu1CGn5kJDo/8QMF9RIhU9UTgLYUs8jS+1xARVHOn5Ri7KEjcUK5v0ZpE= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778573425; c=relaxed/simple; bh=JdtqZMJDJnp28NUJSCgytxWm0x493TYcezg01A+7jSU=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=iyYXKLvzXaaCxTHE0FTaEVafmBoQOvsNcaxHvMyv6vYjO/1RnDq5jWPf1kVELwZS7h4Iqp4QbR8wntV5Nldre60Il0pyQGWLYygfEQxs28UyqLrWTbK01jFJ89rXURMrAvwzhDz4nSpRU16U3qc5iUwkzscOGEAdOGdrvxQJ+NY= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=YL0cslZb; arc=none smtp.client-ip=90.155.50.34 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="YL0cslZb" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=19TGBtKQmhIuXSKyT51dH+VsGeOUq2ku3XDV08szO4s=; b=YL0cslZbE6ZZ/7Tz1GBQdwXMc5 zJQEaDoFZsjfFSnJUL72wbh2GRnZ4P5uqReFiFdFvl0505cn0lZ+ni3gsrvItJGiLR6oPY+rsqoRJ ukKxsYiFGCgv5dqK+A8DoHhp0/4Kb8/P3iTy5Bw7DDzgiMYSu5mDG09nYSbssx+sBVxsD6ROkBO3s tdMpPifDHo14ivNO+wV3Zpb+glGEZplZeFjhGx+A3WgPvpQUqL/Qlr777w1BNBR+pXRENg0DY8I4y uMJZBS/nF1Iwz+jnSpVYsjLEWJsNfcMq9ryFtdW+fmwdUKNR4y1DAJ/3TjCiOPtHD3l2952B5fP9O wm9K4BmA==; Received: from 2001-1c00-8d85-4b00-266e-96ff-fe07-7dcc.cable.dynamic.v6.ziggo.nl ([2001:1c00:8d85:4b00:266e:96ff:fe07:7dcc] helo=noisy.programming.kicks-ass.net) by casper.infradead.org with esmtpsa (Exim 4.99.1 #2 (Red Hat Linux)) id 1wMiBl-00000009PjW-2YTr; Tue, 12 May 2026 08:10:01 +0000 Received: by noisy.programming.kicks-ass.net (Postfix, from userid 1000) id 00A7E3007E1; Tue, 12 May 2026 10:10:00 +0200 (CEST) Date: Tue, 12 May 2026 10:10:00 +0200 From: Peter Zijlstra To: Tejun Heo Cc: mingo@kernel.org, longman@redhat.com, chenridong@huaweicloud.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, hannes@cmpxchg.org, mkoutny@suse.com, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, jstultz@google.com, kprateek.nayak@amd.com, qyousef@layalina.io Subject: Re: [PATCH v2 00/10] sched: Flatten the pick Message-ID: <20260512081000.GL3102624@noisy.programming.kicks-ass.net> References: <20260511113104.563854162@infradead.org> Precedence: bulk X-Mailing-List: cgroups@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: On Mon, May 11, 2026 at 09:23:45AM -1000, Tejun Heo wrote: > Hello, Peter. > > On Mon, May 11, 2026 at 01:31:04PM +0200, Peter Zijlstra wrote: > > So cgroup scheduling has always been a pain in the arse. The problems start > > with weight distribution and end with hierachical picks and it all sucks. > > > > The problems with weight distribution are related to that infernal global > > fraction: > > > > tg->w * grq_i->w > > ge_i->w = ---------------- > > \Sum_j grq_j->w > > > > which we've approximated reasonably well by now. However, the immediate > > consequence of this fraction is that the total group weight (tg->w) gets > > fragmented across all your CPUs. And at 64 CPUs that means your per-cpu cgroup > > weight ends up being a nice 19 task worth. And more CPUs more tiny. Combine > > with the fact that 256 CPU systems are relatively common these days, this > > becomes painful. > > > > The common 'solution' is to inflate the group weight by 'nr_cpus'; the > > immediate problem with that is that when all load of a group gets concentrated > > on a single CPU, the per-cpu cgroup weight becomes insanely large, easily > > exceeding nice -20. > > > > Additionally there are numerical limits on the max weight you can have before > > the math starts suffering overflows. As such there is a definite limit on the > > total group weight. Which has annoyed people ;-) > > > > The first few patches add a knob /debug/sched/cgroup_mode and a few different > > options on how to deal with this. My favourite is 'concur', but obviously that > > is also the most expensive one :-/ It adds a tg->tasks counter which makes the > > update_tg_load_avg() thing more expensive. > > Ignoring fixed math accuracy problems, isn't the root problem here that > every thread in the root cgroup competes as if each is its own cgroup? ie. > Isn't the canonical solution here to create an enveloping group, at least > for share calculation purposes, for root threads and then assign them some > weight so that they compete in the same way that other cgroups do? Then, the > different modes go away or rather whatever the user wants can be expressed > via root's weight if that's to be made configurable. As long as the total group weight is a fraction; and it sorta has to be. You can run into trouble by stacking that fraction. Take 256 CPUs and a group weight of 1024. Then each CPU gets a weight of 1/256 or 4. Even if we increase the internal accuracy to 20 bits (we do on 64bit) then this becomes 4096, do this for 2 more levels in the hierarchy and you're down to scraping the barrel again. So if each level runs at a fraction f of the level above, then level n runs at f^n. Moving root into a phantom group at level 1, only solves the problem against other tasks at level 1, but then you have the same problem again at level 2 and below. Both the numerical problems and the scale problem of the root group can be avoided if we can get the average/nominal fraction to be near 1. The 'normal' way around this is to ensure the group weight is nr_cpus * 1024, then, when everybody is running, the per CPU weight is 1024 or 1 and the continued fraction is also 1-ish. This is why people like to increase the max group weight. Trouble is of course that if not all CPUs are busy, with the extreme being only a single CPU carrying that weight of nr_cpus*1024, this then causes trouble because that one CPU gets overloaded. One of the options is to simply put a max on the single CPU load; which is the crudest option to just make it 'work'. The one I favour though is the one where we scale the group weight by: 'min(cpumas, nr_tasks)'. Anyway, this is why I've been looking at these alternative weight schemes, to get the nominal fraction near 1 and make these problems go away. It is both the numerical issues and the disparity between levels (with root being at level 0 being the most obvious). Does that make sense?