From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.14])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id A75BB33F58C
	for <linux-kernel@vger.kernel.org>; Tue, 21 Apr 2026 20:57:41 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.14
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1776805063; cv=none; b=MNH1hivyJrLbBYh+2AeJyppL4lMrjqJia98Nh7GDCkrsX4taenGsk66CqB2/lFs6ts86Vl71Rm04Taw/sEpKRnxdNN9QWOL6G+lKzp2BZLhXnY4KrzuoBOUE6FZTURoOz9eRTb2w1UT7BDHAxXn2u7axOtuF9SphPiYBaI7hKqM=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1776805063; c=relaxed/simple;
	bh=waBDvlQ91XegpKHNB/DTVbfeSlUrAiwSOmxYvgaJN+g=;
	h=Message-ID:Subject:From:To:Cc:Date:In-Reply-To:References:
	 Content-Type:MIME-Version; b=jP2s8xdhbpIK+j9uDJBRuFNvcZmeUKKNL+pfJzP4k0BxiJ9Y845panXcsMGfYZBTdabN1EzC2ly7pF7qtstvU6vUIPW7ichMMXsgQ0iQJabAa/ykcSU5BCQgJRb+tYiNw4guIfP93WK2gHwOQImI0kTnuOrdKGhJkDfpc4+3xTw=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=WeHp92zp; arc=none smtp.client-ip=192.198.163.14
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="WeHp92zp"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1776805062; x=1808341062;
  h=message-id:subject:from:to:cc:date:in-reply-to:
   references:content-transfer-encoding:mime-version;
  bh=waBDvlQ91XegpKHNB/DTVbfeSlUrAiwSOmxYvgaJN+g=;
  b=WeHp92zpTcow7RYVSiwhOENuxnX33fwzQofUqmrQr9OJRxsyPwf/vUxk
   CTuRI2aESNu8VpfOyaWQUM0qXvMIwlUMpCzM4a4d2Uw+Zh4YTmWVw03Ba
   +D5UNcwzjmpexaMEkl29XedN0PYCNPVAM28xJCMZfBN1eZrcRaz6yfjdV
   BGLLsYuQRGeJXhezIGJVH65Kpug8cEC0TpNl6Jel1SBdUdR9k0Vx1DSqP
   Z/1kr4nJxMwf0aIXGb+jRmGeFposRe2xkwZdhsHVJTFufsbTsDoCh5jl0
   VBFAKglWZX2fH/WSG8pPRW9sOqvExf5/UQ/3JwUoH32XCJhjekjr1DOtA
   w==;
X-CSE-ConnectionGUID: y9Jj9l6iTfKGlo7x1ioFYQ==
X-CSE-MsgGUID: IEivVW1TR7+Btme6oEej5g==
X-IronPort-AV: E=McAfee;i="6800,10657,11763"; a="77820654"
X-IronPort-AV: E=Sophos;i="6.23,192,1770624000"; 
   d="scan'208";a="77820654"
Received: from fmviesa006.fm.intel.com ([10.60.135.146])
  by fmvoesa108.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 21 Apr 2026 13:57:41 -0700
X-CSE-ConnectionGUID: Cqf2jdSVSwCsSnGz1CXVNA==
X-CSE-MsgGUID: Ih1jWbStT8eyIgpNQ/ttjw==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.23,192,1770624000"; 
   d="scan'208";a="227533962"
Received: from unknown (HELO [10.241.243.39]) ([10.241.243.39])
  by fmviesa006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 21 Apr 2026 13:57:40 -0700
Message-ID: <83c14639f1d6baa1665ad367c308676e5f951ff7.camel@linux.intel.com>
Subject: Re: [Patch v4 00/22] Cache aware scheduling
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Qais Yousef <qyousef@layalina.io>, "Chen, Yu C" <yu.c.chen@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@redhat.com>, K
 Prateek Nayak <kprateek.nayak@amd.com>, "Gautham R . Shenoy"
 <gautham.shenoy@amd.com>, Vincent Guittot	 <vincent.guittot@linaro.org>,
 Juri Lelli <juri.lelli@redhat.com>, Dietmar Eggemann
 <dietmar.eggemann@arm.com>, Steven Rostedt <rostedt@goodmis.org>, Ben
 Segall	 <bsegall@google.com>, Mel Gorman <mgorman@suse.de>, Valentin
 Schneider	 <vschneid@redhat.com>, Madadi Vineeth Reddy
 <vineethr@linux.ibm.com>, Hillf Danton <hdanton@sina.com>, Shrikanth Hegde
 <sshegde@linux.ibm.com>, Jianyong Wu	 <jianyong.wu@outlook.com>, Yangyu
 Chen <cyy@cyyself.name>, Tingyin Duan	 <tingyin.duan@gmail.com>, Vern Hao
 <vernhao@tencent.com>, Vern Hao	 <haoxing990@gmail.com>, Len Brown
 <len.brown@intel.com>, Aubrey Li	 <aubrey.li@intel.com>, Zhao Liu
 <zhao1.liu@intel.com>, Chen Yu	 <yu.chen.surf@gmail.com>, Adam Li
 <adamli@os.amperecomputing.com>, Aaron Lu	 <ziqianlu@bytedance.com>, Tim
 Chen <tim.c.chen@intel.com>, Josh Don	 <joshdon@google.com>, Gavin Guo
 <gavinguo@igalia.com>, Libo Chen	 <libchen@purestorage.com>,
 linux-kernel@vger.kernel.org
Date: Tue, 21 Apr 2026 13:57:39 -0700
In-Reply-To: <20260421003438.whnn2gvv4gkfcmx5@airbuntu>
References: <cover.1775065312.git.tim.c.chen@linux.intel.com>
	 <20260416002749.muyrcycmtabksav4@airbuntu>
	 <da02a906-0b68-47c3-911f-b6a38ec0ed88@intel.com>
	 <20260421003438.whnn2gvv4gkfcmx5@airbuntu>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
User-Agent: Evolution 3.58.1 (3.58.1-1.fc43) 
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

On Tue, 2026-04-21 at 01:34 +0100, Qais Yousef wrote:
> On 04/20/26 17:01, Chen, Yu C wrote:
> > On 4/16/2026 8:27 AM, Qais Yousef wrote:
> > > On 04/01/26 14:52, Tim Chen wrote:
> >=20
> > [ ... ]
> >=20
> > >=20
> > > I posted schedqos announcement yesterday, which I think (hope) would =
be the
> > > right way to address these concerns about tagging tasks.
> > >=20
> > > 	https://lore.kernel.org/lkml/20260415000910.2h5misvwc45bdumu@airbunt=
u/
> > >=20
> >=20

I think that's great. Will be a nice way to tag tasks that should be groupe=
d
and aggregated together.

> > Thanks, I'll take a look at this.
> >=20
> > > It would be trivial to add experimental branch to add new QoS flavour=
 to say
> > > NUMA_SENSITIVE etc. I am still trying to think of a generic descripti=
on to
> > > address a number of use cases (see Execution Profiles in README.md), =
not just
> > > this particular numa sensitive one, but the experimental branch shoul=
d help
> > > iterate and drive the kernel development for wake up path + push lb i=
nstead of
> > > using load balance which I really doubt will work well in practice si=
nce this
> > > is slow to react, and you're relying on overcommitting the system by =
default by
> > > making every task of every process data dependent and require it to b=
e
> > > co-located.
> >=20
> > I am not certain which strategy is preferable, as it largely depends
> > on the use case and workload. We intend to evaluate push-based load
> > balancing on top of the existing lb-based cache-aware placement logic.
>=20
> I'll defer to Vincent here, but I would have thought lb-based approach ca=
n go
> away.
>=20
> >=20
> > > I think in practice admins will care about specific applications to
> > > be kept within a single LLC, and if they are willing to spend the eff=
ort, they
> > > can tag specific tasks of a specific application.
> > >=20
> >=20
> > It seems to me that there are multiple use cases. In one scenario,
> > the administrator (including daemons) is responsible for tagging
> > workloads. In another, users prefer the OS to handle automatic
> > placement without any userspace involvement.
>=20
> How do you define this automatic placement? AFAICS you're just grouping a=
ll
> tasks of a specific process to stay within the same LLC and hitting overc=
ommit
> issues which you're workingaround with this load balancer only based appr=
oach?

The LLC chosen for aggregation (preferred LLC) is the one with most occupan=
cy
by tasks in a process.

However aggregation needs to be done with the load in target LLC and curren=
t
in mind. It is better to keep a task in its current LLC if plenty of idle C=
PUs are available
than move to an LLC where most of the threads are, but CPUs are frequently
busy.  This is the main reason why we put the migration logic in the load
balancer where accurate load information is available and we could put
in load aware migration policy.

It is fine to migrate tasks in the wake up path.  But we need to resolve
the issue of over-aggregation, when multiple CPUs may push tasks
to a LLC independently of each other.  It worsen things with
frequent tasks bouncing if we over-aggregate
and have to migrate tasks out of the LLC again.  We encounter
such issues in our earlier implementations that have task migrations in the=
 wake up path.


>=20
> I think in practice there will be many corner cases where state is not op=
timal
> and we'd end up with heuristics to 'balance' things out and sensitivity t=
o
> independent changes disturbing this fragile balance causing weird regress=
ions
> and us slowly has less flexibility to move and shuffle code (okay, maybe =
too
> much doom and gloom, but we've been by this in the past :)).
>=20
> I am not sure how many of these tests stressed the system with multiple
> critical processes running concurrently?
>=20
> By making it a userspace problem they have to figure out the right balanc=
e and
> we can focus on providing the right mechanism.
>=20
> >=20
> > > Also QoS IMHO should be viewed as a scarce resource. For best effort =
delivery
> > > (which is the best we can do in reality, this is not hard real time s=
ystem), it
> > > is easier to provide good best effort when the average noise level is=
 low, ie:
> > > few tasks are required to be kept within the same LLC. If we overcomm=
it often,
> > > we will crumble often. So IMHO the key is to delegate to userspace to=
 tag, and
> >=20
> > I suppose there are two scenarios. The first is enabling/disabling
> > aggregation
> > for a group of tasks, and the second is task tagging. For the first
> > scenario,
> > this can be applied either process-wide or cgroup-wide by providing a f=
lag,
>=20
> Cgroup-wide tagging doesn't make sense IMO. Process-wide yes.
>=20

I think this depends on the usage scenario. In private discussion with
Vern from Tencent, he mentioned that such a cgroup based tagging is useful =
for them.

Tim

> What does it mean to group all processes in the same cgroup from cache lo=
cality
> PoV? It just seems random setup based on something specific in userspace =
on how
> these cgroups are setup that assumes one process per group? I don't think=
 we
> can generalize if that's the case.
>=20
> Admins can use cpuset to statically partition based on cgroup if they wan=
t to
> ensure a group of processes are confined to the same LLC?
>=20
> > without requiring users to explicitly tag individual tasks. The second
> > scenario
> >  is an enhancement to support fine-grained control over a specific task=
. If
> > schedqos only supports scenario2, the user has to tag every task to sup=
port
> > scenario1.
>=20
> You can do tagging process-wide (as I mentioned in the announcement, I th=
ink
> it's a poor man's way for quick tagging until people learn to do better),=
 not
> just per-task eg:
>=20
> 	{
> 		"PostgreSQL": {
> 			"qos": ["QOS_USER_INTERACTIVE", "QOS_NUMA_SENSITIVE"]
> 		}
> 	}
>=20
> which will tag every task forked by this binary with a cookie and QoS dat=
a
> dependency tag tell the scheduler all tasks with the same cookie need to =
stay
> within the same LLC.
>=20
> We can bikeshed naming and the tagging details, but the actual implementa=
tion
> principal should be the same: keep tasks with the same cookie and data de=
p tag
> on the same LLC at wake up; and let push lb handle occasional strayed tas=
k.
>=20
> If users want to be smart and tag specific tasks only, the implementation=
 would
> be identical, it's just there are fewer tasks tagged.