From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from desiato.infradead.org (desiato.infradead.org [90.155.92.199]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CDE423C1F54; Wed, 27 May 2026 09:42:25 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.92.199 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779874951; cv=none; b=nD0G5xyyo06y9yrKWaEf4YL+yafVB6xtXT/qRDKQrIaD2LQoaULFrLTj+CK9IWBKwK/3toiK799yLUt+Km+ERCmcN83dXKUnYndf2ZYCwPq1dy1D0n+YrF2Rw54IH0ixa1rMO99L+8wfv8IYNmfANzb3NwxYc6JigfBCoXX6Pr8= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779874951; c=relaxed/simple; bh=+vGMVXR+NPWKz8cGfQUhXwbph1cZUCYSuedasX9zqwY=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=a86AJIBfFTz8mIC6xvJMwixIdvRDKT4rJQjv0OlWaciRaI9OdMciEu8FS+LNlT4GM2cYj8UESrctLGo0bln/fSpWidHfzHs2FE0ndKGY3pp9aSeEfqeuqGCtGldErC0saGssBDmI19EuYs8GBx5ggF9I2kq5Xuuk4Z/BJpok73k= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=DMQnz1bc; arc=none smtp.client-ip=90.155.92.199 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="DMQnz1bc" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=desiato.20200630; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=KYAyspqN8xegJ0uQzjnAhIaAXlOqaGb8oyevRGnRuJk=; b=DMQnz1bcANVv4GPs1wfKUZZvQE J10pS9tUyO4i3pspB2ZospXCmKj2047ml3JN5g1hIilwTcS8jMLdfXYx1/alVsDtt+cnFBVaEXESP It9xyRoo8HUJ09mxodPFZkToLLVwyyhrXYHqOK7CVW9UdEwlIKzmflpy/oFfsZlsDcs9mCChWM8+t vWI4rtfmfrZp0mgZkAcm0cT5cmB4GKPZiWFLqo1FXBsy0TsIzmNGHe8c7zkfIUWF0d4Pp8bT9r4cl 4AufbjdP/j+nSSwA0z1S7IONlk7L9YkNWjIh6SObBnufkIh03Ik7al4dEv1J0o03C7DlQMjnFT1be sFDS+JAg==; Received: from 2001-1c00-8d85-4b00-266e-96ff-fe07-7dcc.cable.dynamic.v6.ziggo.nl ([2001:1c00:8d85:4b00:266e:96ff:fe07:7dcc] helo=noisy.programming.kicks-ass.net) by desiato.infradead.org with esmtpsa (Exim 4.99.1 #2 (Red Hat Linux)) id 1wSAm1-0000000DS3V-0tgH; Wed, 27 May 2026 09:42:01 +0000 Received: by noisy.programming.kicks-ass.net (Postfix, from userid 1000) id B153930057C; Wed, 27 May 2026 11:41:59 +0200 (CEST) Date: Wed, 27 May 2026 11:41:59 +0200 From: Peter Zijlstra To: Tejun Heo Cc: mingo@kernel.org, longman@redhat.com, chenridong@huaweicloud.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, hannes@cmpxchg.org, mkoutny@suse.com, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, jstultz@google.com, kprateek.nayak@amd.com, qyousef@layalina.io Subject: Re: [PATCH v2 00/10] sched: Flatten the pick Message-ID: <20260527094159.GS3126523@noisy.programming.kicks-ass.net> References: <20260511113104.563854162@infradead.org> <20260512081000.GL3102624@noisy.programming.kicks-ass.net> <20260518071456.GO3102624@noisy.programming.kicks-ass.net> Precedence: bulk X-Mailing-List: cgroups@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: On Mon, May 18, 2026 at 09:11:03AM -1000, Tejun Heo wrote: > Hello, Peter. > > On Mon, May 18, 2026 at 09:14:56AM +0200, Peter Zijlstra wrote: > ... > > So the current scheme will inflate the part of A to be double the weight > > (of B), giving them 2 out of 3 parts on the contended CPUs, but then B > > will still get complete / uncontested access to those extra 128 CPUs, > > resulting in a 2:4 weight distribution. > > > > Which also isn't as straight forward as one might think. > > Right, the current behavior isn't quite what people would expect intuitively > either. > > ... > > So for the one contended CPU A gets 256 out of 257 parts, while B gets > > the full CPU for the remaining 255 CPUs, for a: > > > > 256 1 257 > > --- : --- + 255*--- = 256:65535 ~ 1:256 > > 257 257 257 > > > > distribution. While with the new scheme it would be: > > > > 1 1 2 > > - : - + 255*- = 1:511 > > 2 2 2 > > > > Which, realistically isn't all that different, except the old scheme has > > this really large weight to deal with. > > > > So from where I'm sitting, yes different, but it behaves better. FWIW if the workload was single threads per CPU; the above is also the exact behaviour we'd have without cgroups. > I see. Thread cardinality and affinity problems make weight based > distribution such a pain. I wonder whether this can be better solved by > turning it into a two-layer allocation problem - groups to CPUs and then > timeshare on CPUs as necessary. That comes with a lot of its own problems > but it can, aspirationally at least, approximate global weight distribution > and would have better locality properties. If people want, they can already do this today. I don't see a reason to mandate something like that. That is, combine cpuset and cpu in a v2 hierarchy and you get this. The main problem with doing something like that is of course that it isn't always clear how many CPUs will be needed for a particular 'job'. So assigning groups to CPUs isn't a straight forward thing. If I remember, Meta was actually doing some of this. It was dynamically resizing cpusets based on load predictions and the like in order to separate various worloads on the same large machine, right? Anyway, while it is somewhat tedious to change behaviour, I do think it is worth doing in this case.