Re: [Patch v4 00/22] Cache aware scheduling

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Tim Chen <tim.c.chen@linux.intel.com>
To: Qais Yousef <qyousef@layalina.io>, "Chen, Yu C" <yu.c.chen@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot	 <vincent.guittot@linaro.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall	 <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Valentin Schneider	 <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu	 <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan	 <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>, Vern Hao	 <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>, Aubrey Li	 <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>, Chen Yu	 <yu.chen.surf@gmail.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu	 <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>, Josh Don	 <joshdon@google.com>,
	Gavin Guo <gavinguo@igalia.com>,
	Libo Chen	 <libchen@purestorage.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [Patch v4 00/22] Cache aware scheduling
Date: Tue, 21 Apr 2026 13:57:39 -0700	[thread overview]
Message-ID: <83c14639f1d6baa1665ad367c308676e5f951ff7.camel@linux.intel.com> (raw)
In-Reply-To: <20260421003438.whnn2gvv4gkfcmx5@airbuntu>

On Tue, 2026-04-21 at 01:34 +0100, Qais Yousef wrote:
> On 04/20/26 17:01, Chen, Yu C wrote:
> > On 4/16/2026 8:27 AM, Qais Yousef wrote:
> > > On 04/01/26 14:52, Tim Chen wrote:
> > 
> > [ ... ]
> > 
> > > 
> > > I posted schedqos announcement yesterday, which I think (hope) would be the
> > > right way to address these concerns about tagging tasks.
> > > 
> > > 	https://lore.kernel.org/lkml/20260415000910.2h5misvwc45bdumu@airbuntu/
> > > 
> > 

I think that's great. Will be a nice way to tag tasks that should be grouped
and aggregated together.

> > Thanks, I'll take a look at this.
> > 
> > > It would be trivial to add experimental branch to add new QoS flavour to say
> > > NUMA_SENSITIVE etc. I am still trying to think of a generic description to
> > > address a number of use cases (see Execution Profiles in README.md), not just
> > > this particular numa sensitive one, but the experimental branch should help
> > > iterate and drive the kernel development for wake up path + push lb instead of
> > > using load balance which I really doubt will work well in practice since this
> > > is slow to react, and you're relying on overcommitting the system by default by
> > > making every task of every process data dependent and require it to be
> > > co-located.
> > 
> > I am not certain which strategy is preferable, as it largely depends
> > on the use case and workload. We intend to evaluate push-based load
> > balancing on top of the existing lb-based cache-aware placement logic.
> 
> I'll defer to Vincent here, but I would have thought lb-based approach can go
> away.
> 
> > 
> > > I think in practice admins will care about specific applications to
> > > be kept within a single LLC, and if they are willing to spend the effort, they
> > > can tag specific tasks of a specific application.
> > > 
> > 
> > It seems to me that there are multiple use cases. In one scenario,
> > the administrator (including daemons) is responsible for tagging
> > workloads. In another, users prefer the OS to handle automatic
> > placement without any userspace involvement.
> 
> How do you define this automatic placement? AFAICS you're just grouping all
> tasks of a specific process to stay within the same LLC and hitting overcommit
> issues which you're workingaround with this load balancer only based approach?

The LLC chosen for aggregation (preferred LLC) is the one with most occupancy
by tasks in a process.

However aggregation needs to be done with the load in target LLC and current
in mind. It is better to keep a task in its current LLC if plenty of idle CPUs are available
than move to an LLC where most of the threads are, but CPUs are frequently
busy.  This is the main reason why we put the migration logic in the load
balancer where accurate load information is available and we could put
in load aware migration policy.

It is fine to migrate tasks in the wake up path.  But we need to resolve
the issue of over-aggregation, when multiple CPUs may push tasks
to a LLC independently of each other.  It worsen things with
frequent tasks bouncing if we over-aggregate
and have to migrate tasks out of the LLC again.  We encounter
such issues in our earlier implementations that have task migrations in the wake up path.


> 
> I think in practice there will be many corner cases where state is not optimal
> and we'd end up with heuristics to 'balance' things out and sensitivity to
> independent changes disturbing this fragile balance causing weird regressions
> and us slowly has less flexibility to move and shuffle code (okay, maybe too
> much doom and gloom, but we've been by this in the past :)).
> 
> I am not sure how many of these tests stressed the system with multiple
> critical processes running concurrently?
> 
> By making it a userspace problem they have to figure out the right balance and
> we can focus on providing the right mechanism.
> 
> > 
> > > Also QoS IMHO should be viewed as a scarce resource. For best effort delivery
> > > (which is the best we can do in reality, this is not hard real time system), it
> > > is easier to provide good best effort when the average noise level is low, ie:
> > > few tasks are required to be kept within the same LLC. If we overcommit often,
> > > we will crumble often. So IMHO the key is to delegate to userspace to tag, and
> > 
> > I suppose there are two scenarios. The first is enabling/disabling
> > aggregation
> > for a group of tasks, and the second is task tagging. For the first
> > scenario,
> > this can be applied either process-wide or cgroup-wide by providing a flag,
> 
> Cgroup-wide tagging doesn't make sense IMO. Process-wide yes.
> 

I think this depends on the usage scenario. In private discussion with
Vern from Tencent, he mentioned that such a cgroup based tagging is useful for them.

Tim

> What does it mean to group all processes in the same cgroup from cache locality
> PoV? It just seems random setup based on something specific in userspace on how
> these cgroups are setup that assumes one process per group? I don't think we
> can generalize if that's the case.
> 
> Admins can use cpuset to statically partition based on cgroup if they want to
> ensure a group of processes are confined to the same LLC?
> 
> > without requiring users to explicitly tag individual tasks. The second
> > scenario
> >  is an enhancement to support fine-grained control over a specific task. If
> > schedqos only supports scenario2, the user has to tag every task to support
> > scenario1.
> 
> You can do tagging process-wide (as I mentioned in the announcement, I think
> it's a poor man's way for quick tagging until people learn to do better), not
> just per-task eg:
> 
> 	{
> 		"PostgreSQL": {
> 			"qos": ["QOS_USER_INTERACTIVE", "QOS_NUMA_SENSITIVE"]
> 		}
> 	}
> 
> which will tag every task forked by this binary with a cookie and QoS data
> dependency tag tell the scheduler all tasks with the same cookie need to stay
> within the same LLC.
> 
> We can bikeshed naming and the tagging details, but the actual implementation
> principal should be the same: keep tasks with the same cookie and data dep tag
> on the same LLC at wake up; and let push lb handle occasional strayed task.
> 
> If users want to be smart and tag specific tasks only, the implementation would
> be identical, it's just there are fewer tasks tagged.

     prev parent reply	other threads:[~2026-04-21 20:57 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-01 21:52 [Patch v4 00/22] Cache aware scheduling Tim Chen
2026-04-01 21:52 ` [Patch v4 01/22] sched/cache: Introduce infrastructure for cache-aware load balancing Tim Chen
2026-04-09 12:41   ` Peter Zijlstra
2026-04-09 19:21     ` Tim Chen
2026-04-09 23:00       ` Peter Zijlstra
2026-04-10  6:30         ` Chen, Yu C
2026-04-15  2:06   ` Vern Hao
2026-04-15  3:34     ` Chen, Yu C
2026-04-01 21:52 ` [Patch v4 02/22] sched/cache: Limit the scan number of CPUs when calculating task occupancy Tim Chen
2026-04-09 13:17   ` Luo Gengkun
2026-04-09 13:41     ` Peter Zijlstra
2026-04-10 10:12       ` Luo Gengkun
2026-04-10  7:29     ` Chen, Yu C
2026-04-10 10:20       ` Luo Gengkun
2026-04-10 17:12       ` Tim Chen
2026-04-10 17:27         ` Chen, Yu C
2026-04-13  7:23           ` [RFC PATCH] sched/fair: dynamically scale the period of cache work Jianyong Wu
2026-04-13  8:38             ` Chen, Yu C
2026-04-13 11:27               ` Jianyong Wu
2026-04-15  3:31                 ` Chen, Yu C
2026-04-16  3:39                   ` Jianyong Wu
2026-04-15 17:22             ` Tim Chen
2026-04-16  6:50               ` Jianyong Wu
2026-04-14 15:07           ` [PATCH v2] sched/cache: Reduce the overhead of task_cache_work by only scan the visisted cpus Luo Gengkun
2026-04-15  3:10             ` Chen, Yu C
2026-04-18  9:01               ` Luo Gengkun
2026-04-20  7:53                 ` Chen, Yu C
2026-04-23  8:54                   ` [PATCH v3] " Luo Gengkun
2026-04-01 21:52 ` [Patch v4 03/22] sched/cache: Record per LLC utilization to guide cache aware scheduling decisions Tim Chen
2026-04-01 21:52 ` [Patch v4 04/22] sched/cache: Introduce helper functions to enforce LLC migration policy Tim Chen
2026-04-01 21:52 ` [Patch v4 05/22] sched/cache: Make LLC id continuous Tim Chen
2026-04-01 21:52 ` [Patch v4 06/22] sched/cache: Assign preferred LLC ID to processes Tim Chen
2026-04-01 21:52 ` [Patch v4 07/22] sched/cache: Track LLC-preferred tasks per runqueue Tim Chen
2026-04-01 21:52 ` [Patch v4 08/22] sched/cache: Introduce per CPU's tasks LLC preference counter Tim Chen
2026-04-01 21:52 ` [Patch v4 09/22] sched/cache: Calculate the percpu sd task LLC preference Tim Chen
2026-04-01 21:52 ` [Patch v4 10/22] sched/cache: Count tasks prefering destination LLC in a sched group Tim Chen
2026-04-01 21:52 ` [Patch v4 11/22] sched/cache: Check local_group only once in update_sg_lb_stats() Tim Chen
2026-04-01 21:52 ` [Patch v4 12/22] sched/cache: Prioritize tasks preferring destination LLC during balancing Tim Chen
2026-04-01 21:52 ` [Patch v4 13/22] sched/cache: Add migrate_llc_task migration type for cache-aware balancing Tim Chen
2026-04-01 21:52 ` [Patch v4 14/22] sched/cache: Handle moving single tasks to/from their preferred LLC Tim Chen
2026-04-01 21:52 ` [Patch v4 15/22] sched/cache: Respect LLC preference in task migration and detach Tim Chen
2026-04-01 21:52 ` [Patch v4 16/22] sched/cache: Disable cache aware scheduling for processes with high thread counts Tim Chen
2026-04-09 12:43   ` Peter Zijlstra
2026-04-09 19:27     ` Tim Chen
2026-04-01 21:52 ` [Patch v4 17/22] sched/cache: Avoid cache-aware scheduling for memory-heavy processes Tim Chen
2026-04-09 12:46   ` Peter Zijlstra
2026-04-09 12:55     ` Peter Zijlstra
2026-04-10  8:59     ` Chen, Yu C
2026-04-10  9:20       ` Peter Zijlstra
2026-04-01 21:52 ` [Patch v4 18/22] sched/cache: Enable cache aware scheduling for multi LLCs NUMA node Tim Chen
2026-04-09 13:37   ` Peter Zijlstra
2026-04-09 19:39     ` Tim Chen
2026-04-01 21:52 ` [Patch v4 19/22] sched/cache: Allow the user space to turn on and off cache aware scheduling Tim Chen
2026-04-01 21:52 ` [Patch v4 20/22] sched/cache: Add user control to adjust the aggressiveness of cache-aware scheduling Tim Chen
2026-04-01 21:52 ` [Patch v4 21/22] -- DO NOT APPLY!!! -- sched/cache/debug: Display the per LLC occupancy for each process via proc fs Tim Chen
2026-04-01 21:52 ` [Patch v4 22/22] -- DO NOT APPLY!!! -- sched/cache/debug: Add ftrace to track the load balance statistics Tim Chen
2026-04-09 13:54 ` [Patch v4 00/22] Cache aware scheduling Peter Zijlstra
2026-04-09 20:02   ` Tim Chen
2026-04-14  3:20 ` Duan Tingyin
2026-04-15 17:35   ` Tim Chen
2026-04-16  0:27 ` Qais Yousef
2026-04-20  9:01   ` Chen, Yu C
2026-04-21  0:34     ` Qais Yousef
2026-04-21 20:57       ` Tim Chen [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=83c14639f1d6baa1665ad367c308676e5f951ff7.camel@linux.intel.com \
    --to=tim.c.chen@linux.intel.com \
    --cc=adamli@os.amperecomputing.com \
    --cc=aubrey.li@intel.com \
    --cc=bsegall@google.com \
    --cc=cyy@cyyself.name \
    --cc=dietmar.eggemann@arm.com \
    --cc=gautham.shenoy@amd.com \
    --cc=gavinguo@igalia.com \
    --cc=haoxing990@gmail.com \
    --cc=hdanton@sina.com \
    --cc=jianyong.wu@outlook.com \
    --cc=joshdon@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=len.brown@intel.com \
    --cc=libchen@purestorage.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=qyousef@layalina.io \
    --cc=rostedt@goodmis.org \
    --cc=sshegde@linux.ibm.com \
    --cc=tim.c.chen@intel.com \
    --cc=tingyin.duan@gmail.com \
    --cc=vernhao@tencent.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vineethr@linux.ibm.com \
    --cc=vschneid@redhat.com \
    --cc=yu.c.chen@intel.com \
    --cc=yu.chen.surf@gmail.com \
    --cc=zhao1.liu@intel.com \
    --cc=ziqianlu@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox