From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8B7734D2ED2; Mon, 11 May 2026 19:23:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778527426; cv=none; b=OUhexK9nUfe8uIEBTvt5sP47YgJYKWJmKX1BH1jcXU7XVP7CM9R9fahNH9lcrwLKFYcfxOe2B6Vvh7qkIM15Dc8EWOQuP1MvKuLC0WJdlO/+2rbCGW3yxPTyOrp3+WIAFZGc/5AP9JqI8NPeiZMDUiNDeydriR63PZzkhSzxjZs= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778527426; c=relaxed/simple; bh=EIUsOp7Yw3j+8Z9FypyrNZ1gs/Yd0mDxCEtBIUOJKMc=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=HtuNXgGv4ggObKRsNqApCV8HpHLqkTflttMkhhDIgoYYPHnQN/UCjP0F8YW6EW0LEmd7E5aiYDD3cJwEkbVpbiLZzzscXCDtop6TrMx8tESVOFTvbTbX+o+91haTyQ0Jic50uZcN8E106MufUl5Xo78nuK8JVxm10Zuz5W9LJLg= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=YAcIB1SR; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="YAcIB1SR" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 23FDEC2BCB0; Mon, 11 May 2026 19:23:46 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1778527426; bh=EIUsOp7Yw3j+8Z9FypyrNZ1gs/Yd0mDxCEtBIUOJKMc=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=YAcIB1SR717rb3c+Ptu0WOOp0zebEDt6z2jiEdXSi7zDHuYxSrkzlxy00uiy3Mhgc OFC56wdpu1+mQ45PskHZT/iV2bu+z692Qq4iHRCvS0Iu0amAR5sF/8afS12gljg+WI e27nLJUlqhtvzgDIQgn+tjOc3ZTjdLGAzMRgLkr0bjyUKM/cnintiB9eg3XwRkeMKQ 7vuXz9ig1OsIjK6IchBwcggMRmlGa9X8luEnkBZAqd1COYWso0rEJN7Ocwehpx7WYW 0l/MSQeEDM15HXb8OQRzeW/Tn40bBKqa0sn3zdflozhSALiWEDTCexPbSgEALKhWWw y+fmA4g/nD8EQ== Date: Mon, 11 May 2026 09:23:45 -1000 From: Tejun Heo To: Peter Zijlstra Cc: mingo@kernel.org, longman@redhat.com, chenridong@huaweicloud.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, hannes@cmpxchg.org, mkoutny@suse.com, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, jstultz@google.com, kprateek.nayak@amd.com, qyousef@layalina.io Subject: Re: [PATCH v2 00/10] sched: Flatten the pick Message-ID: References: <20260511113104.563854162@infradead.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20260511113104.563854162@infradead.org> Hello, Peter. On Mon, May 11, 2026 at 01:31:04PM +0200, Peter Zijlstra wrote: > So cgroup scheduling has always been a pain in the arse. The problems start > with weight distribution and end with hierachical picks and it all sucks. > > The problems with weight distribution are related to that infernal global > fraction: > > tg->w * grq_i->w > ge_i->w = ---------------- > \Sum_j grq_j->w > > which we've approximated reasonably well by now. However, the immediate > consequence of this fraction is that the total group weight (tg->w) gets > fragmented across all your CPUs. And at 64 CPUs that means your per-cpu cgroup > weight ends up being a nice 19 task worth. And more CPUs more tiny. Combine > with the fact that 256 CPU systems are relatively common these days, this > becomes painful. > > The common 'solution' is to inflate the group weight by 'nr_cpus'; the > immediate problem with that is that when all load of a group gets concentrated > on a single CPU, the per-cpu cgroup weight becomes insanely large, easily > exceeding nice -20. > > Additionally there are numerical limits on the max weight you can have before > the math starts suffering overflows. As such there is a definite limit on the > total group weight. Which has annoyed people ;-) > > The first few patches add a knob /debug/sched/cgroup_mode and a few different > options on how to deal with this. My favourite is 'concur', but obviously that > is also the most expensive one :-/ It adds a tg->tasks counter which makes the > update_tg_load_avg() thing more expensive. Ignoring fixed math accuracy problems, isn't the root problem here that every thread in the root cgroup competes as if each is its own cgroup? ie. Isn't the canonical solution here to create an enveloping group, at least for share calculation purposes, for root threads and then assign them some weight so that they compete in the same way that other cgroups do? Then, the different modes go away or rather whatever the user wants can be expressed via root's weight if that's to be made configurable. Thanks. -- tejun