All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Vern Hao <haoxing990@gmail.com>, "Chen, Yu C" <yu.c.chen@intel.com>
Cc: Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann	 <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall	 <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Valentin Schneider	 <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu	 <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan	 <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>, Len Brown	 <len.brown@intel.com>,
	Aubrey Li <aubrey.li@intel.com>, Zhao Liu	 <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Adam Li	 <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen	 <tim.c.chen@intel.com>,
	linux-kernel@vger.kernel.org,
	Peter Zijlstra	 <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak	 <kprateek.nayak@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	 "Gautham R . Shenoy" <gautham.shenoy@amd.com>
Subject: Re: [PATCH v2 01/23] sched/cache: Introduce infrastructure for cache-aware load balancing
Date: Thu, 15 Jan 2026 13:47:06 -0800	[thread overview]
Message-ID: <d188e11a610cd652ad18a83cf325db54f4938537.camel@linux.intel.com> (raw)
In-Reply-To: <7d5bb7c4-abc5-470e-84fe-72a3b1d3a2f4@gmail.com>

On Wed, 2025-12-17 at 09:17 +0800, Vern Hao wrote:
> On 2025/12/16 14:12, Chen, Yu C wrote:
> > On 12/11/2025 5:03 PM, Vern Hao wrote:
> > > Hi, Peter, Chen Yu and Tim:
> > > 
> > > On 2025/12/4 07:07, Tim Chen wrote:
> > > > From: "Peter Zijlstra (Intel)" <peterz@infradead.org>
> > > > 
> > > > Adds infrastructure to enable cache-aware load balancing,
> > > > which improves cache locality by grouping tasks that share resources
> > > > within the same cache domain. This reduces cache misses and improves
> > > > overall data access efficiency.
> > > > 
> > > > In this initial implementation, threads belonging to the same process
> > > > are treated as entities that likely share working sets. The mechanism
> > > > tracks per-process CPU occupancy across cache domains and attempts to
> > > > migrate threads toward cache-hot domains where their process already
> > > > has active threads, thereby enhancing locality.
> > > > 
> > > > This provides a basic model for cache affinity. While the current code
> > > > targets the last-level cache (LLC), the approach could be extended to
> > > > other domain types such as clusters (L2) or node-internal groupings.
> > > > 
> > > > At present, the mechanism selects the CPU within an LLC that has the
> > > > highest recent runtime. Subsequent patches in this series will use this
> > > > information in the load-balancing path to guide task placement toward
> > > > preferred LLCs.
> > > > 
> > > > In the future, more advanced policies could be integrated through NUMA
> > > > balancing-for example, migrating a task to its preferred LLC when spare
> > > > capacity exists, or swapping tasks across LLCs to improve cache 
> > > > affinity.
> > > > Grouping of tasks could also be generalized from that of a process
> > > > to be that of a NUMA group, or be user configurable.
> > > > 
> > > > Originally-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > > Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> > > > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> > > > ---
> > > > 
> > > > Notes:
> > > >      v1->v2:
> > > >         Restore the original CPU scan to cover all online CPUs,
> > > >         rather than scanning within the preferred NUMA node.
> > > >         (Peter Zijlstra)
> > > >         Use rq->curr instead of rq->donor. (K Prateek Nayak)
> > > >         Minor fix in task_tick_cache() to use
> > > >         if (mm->mm_sched_epoch >= rq->cpu_epoch)
> > > >         to avoid mm_sched_epoch going backwards.
> > > > 
> > > >   include/linux/mm_types.h |  44 +++++++
> > > >   include/linux/sched.h    |  11 ++
> > > >   init/Kconfig             |  11 ++
> > > >   kernel/fork.c            |   6 +
> > > >   kernel/sched/core.c      |   6 +
> > > >   kernel/sched/fair.c      | 258 
> > > > +++++++++++++++++++++++++++++++++++++++
> > > >   kernel/sched/sched.h     |   8 ++
> > > >   7 files changed, 344 insertions(+)
> > > > 
> > > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > > > index 90e5790c318f..1ea16ef90566 100644
> > > > --- a/include/linux/mm_types.h
> > > > +++ b/include/linux/mm_types.h
> > > > @@ -939,6 +939,11 @@ typedef struct {
> > > >       DECLARE_BITMAP(__mm_flags, NUM_MM_FLAG_BITS);
> > > >   } __private mm_flags_t;
> > > > +struct mm_sched {
> > > > +    u64 runtime;
> > > > +    unsigned long epoch;
> > > > +};
> > > > +
> > > >   struct kioctx_table;
> > > >   struct iommu_mm_data;
> > > >   struct mm_struct {
> > > > @@ -1029,6 +1034,17 @@ struct mm_struct {
> > > >            */
> > > >           raw_spinlock_t cpus_allowed_lock;
> > > >   #endif
> > > > +#ifdef CONFIG_SCHED_CACHE
> > > > +        /*
> > > > +         * Track per-cpu-per-process occupancy as a proxy for cache 
> > > > residency.
> > > > +         * See account_mm_sched() and ...
> > > > +         */
> > > > +        struct mm_sched __percpu *pcpu_sched;
> > > > +        raw_spinlock_t mm_sched_lock;
> > > > +        unsigned long mm_sched_epoch;
> > > > +        int mm_sched_cpu;
> > > As we discussed earlier,I continue to believe that dedicating 
> > > 'mm_sched_cpu' to handle the aggregated hotspots of all threads is 
> > > inappropriate, as the multiple threads lack a necessary correlation 
> > > in our real application.
> > > 
> > > So, I was wondering if we could put this variable into struct 
> > > task_struct, That allows us to better monitor the hotspot CPU of each 
> > > thread, despite some details needing consideration.
> > > 
> > 
> > I suppose you are suggesting a fine-grained control for a set of tasks.
> > Process-scope aggregation could be a start as the default strategy(
> > conservative, benefit multi-thread workloads that share data per process,
> > not introduce regression).
> 
> Yes, in our real-world business scenarios at Tencent, I have indeed 
> encountered this issue where multiple threads are divided into several 
> categories to handle different transactions, so they are not share the 
> hot data, the 'mm_sched_cpu'  does not represent all of their task, so 
> add a control interface such as cgroup or others will be a good idea.
> 

Yes, the grouping and aggregating of tasks by process will not cover
your usage scenario. Chen Yu and I had quite a bit of discussions among
us and here're our thoughts.

In the initial version of cache aware scheduling, process based aggregation
is a sensible default. Once this basic option is merged in mainline we will consider adding
other options for task grouping.  For example, setting a flag in a cgroup
cpu controller to indicate that tasks in a cgroup could benefit from being
consolidated in a LLC.

We think that you can put your threads in each category in each of
its own cgroup.  Will that meet your need?

Things like mm_sched_cpu ... etc will be abstracted out, where the grouping structure in mm
is abstracted as cache_group.  So we will have something like
cache_group_sched_cpu instead of mm_sched_cpu.

Tim

> > 
> > On top of that, I wonder if we could provide task-scope control like
> > sched_setattr(), similar to core-scheduling cookie mechanism, for
> > users that want aggressive aggregation. But before doing that, we need a
> > mechanism that that leverages a monitor system(like PMU) to figure out
> There will maybe a trouble, If the environment is running on a VM, We 
> could use tags to differentiate these tasks and do some tests to verify 
> the performance difference between unifying the |mm_sched_cpu| and not 
> unifying.
> > if putting these tasks together would bring benefit(if I understand
> > Steven's suggestion correctly on LPC), or detection tasks that share
> > resource, then maybe leverage QOS interfaces to enable the cache-aware
> > aggregation(something Qias mentioned on the LPC).
> > 
> > thanks,
> > Chenyu
> > 

  reply	other threads:[~2026-01-15 21:47 UTC|newest]

Thread overview: 120+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-03 23:07 [PATCH v2 00/23] Cache aware scheduling Tim Chen
2025-12-03 23:07 ` [PATCH v2 01/23] sched/cache: Introduce infrastructure for cache-aware load balancing Tim Chen
2025-12-09 11:12   ` Peter Zijlstra
2025-12-09 21:39     ` Tim Chen
2025-12-10  9:37   ` Peter Zijlstra
2025-12-10 13:57     ` Chen, Yu C
2025-12-10 15:11       ` Peter Zijlstra
2025-12-11  9:03   ` Vern Hao
2025-12-16  6:12     ` Chen, Yu C
2025-12-17  1:17       ` Vern Hao
2026-01-15 21:47         ` Tim Chen [this message]
     [not found]   ` <fbf52d91-0605-4608-b9cc-e8cc56115fd5@gmail.com>
2025-12-16 22:30     ` Tim Chen
2025-12-03 23:07 ` [PATCH v2 02/23] sched/cache: Record per-LLC utilization to guide cache-aware scheduling decisions Tim Chen
2025-12-09 11:21   ` Peter Zijlstra
2025-12-10 14:02     ` Chen, Yu C
2025-12-10 15:13       ` Peter Zijlstra
2025-12-10 23:58         ` Chen, Yu C
2025-12-03 23:07 ` [PATCH v2 03/23] sched/cache: Introduce helper functions to enforce LLC migration policy Tim Chen
2026-01-22 18:13   ` Yangyu Chen
2026-01-22 20:43     ` Tim Chen
2025-12-03 23:07 ` [PATCH v2 04/23] sched/cache: Make LLC id continuous Tim Chen
2025-12-09 11:58   ` Peter Zijlstra
2025-12-15 20:49     ` Tim Chen
2025-12-16  5:31       ` Chen, Yu C
2025-12-16 19:53         ` Tim Chen
2025-12-17  5:25           ` Chen, Yu C
2025-12-23  5:31   ` K Prateek Nayak
2025-12-24  7:08     ` Chen, Yu C
2025-12-24  8:19       ` K Prateek Nayak
2025-12-24  9:46         ` Chen, Yu C
2025-12-26  3:17           ` K Prateek Nayak
2025-12-03 23:07 ` [PATCH v2 05/23] sched/cache: Assign preferred LLC ID to processes Tim Chen
2025-12-09 12:11   ` Peter Zijlstra
2025-12-09 22:34     ` Tim Chen
2025-12-12  3:34   ` Vern Hao
2025-12-15 19:32     ` Tim Chen
2025-12-19  4:01       ` Vern Hao
2025-12-24 10:20         ` Chen, Yu C
2026-01-07  4:49   ` Jianyong Wu
2026-01-07  8:38     ` Chen, Yu C
2025-12-03 23:07 ` [PATCH v2 06/23] sched/cache: Track LLC-preferred tasks per runqueue Tim Chen
2025-12-09 12:16   ` Peter Zijlstra
2025-12-09 22:55     ` Tim Chen
2025-12-10  9:42       ` Peter Zijlstra
2025-12-16  0:20         ` Chen, Yu C
2025-12-17 10:04   ` Vern Hao
2025-12-17 12:37     ` Chen, Yu C
2025-12-03 23:07 ` [PATCH v2 07/23] sched/cache: Introduce per runqueue task LLC preference counter Tim Chen
2025-12-09 13:06   ` Peter Zijlstra
2025-12-09 23:17     ` Tim Chen
2025-12-10 12:43   ` Peter Zijlstra
2025-12-10 18:36     ` Tim Chen
2025-12-10 12:51   ` Peter Zijlstra
2025-12-10 18:49     ` Tim Chen
2025-12-11 10:31       ` Peter Zijlstra
2025-12-15 19:21         ` Tim Chen
2025-12-16 22:45         ` Tim Chen
2025-12-03 23:07 ` [PATCH v2 08/23] sched/cache: Calculate the per runqueue task LLC preference Tim Chen
2025-12-03 23:07 ` [PATCH v2 09/23] sched/cache: Count tasks prefering destination LLC in a sched group Tim Chen
2025-12-10 12:52   ` Peter Zijlstra
2025-12-10 14:05     ` Chen, Yu C
2025-12-10 15:16       ` Peter Zijlstra
2025-12-10 19:00         ` Tim Chen
2025-12-10 23:50         ` Chen, Yu C
2025-12-03 23:07 ` [PATCH v2 10/23] sched/cache: Check local_group only once in update_sg_lb_stats() Tim Chen
2025-12-03 23:07 ` [PATCH v2 11/23] sched/cache: Prioritize tasks preferring destination LLC during balancing Tim Chen
2025-12-03 23:07 ` [PATCH v2 12/23] sched/cache: Add migrate_llc_task migration type for cache-aware balancing Tim Chen
2025-12-10 13:32   ` Peter Zijlstra
2025-12-16  0:52     ` Chen, Yu C
2025-12-03 23:07 ` [PATCH v2 13/23] sched/cache: Handle moving single tasks to/from their preferred LLC Tim Chen
2025-12-03 23:07 ` [PATCH v2 14/23] sched/cache: Consider LLC preference when selecting tasks for load balancing Tim Chen
2025-12-10 15:58   ` Peter Zijlstra
2025-12-03 23:07 ` [PATCH v2 15/23] sched/cache: Respect LLC preference in task migration and detach Tim Chen
2025-12-10 16:30   ` Peter Zijlstra
2025-12-16  7:30     ` Chen, Yu C
2025-12-03 23:07 ` [PATCH v2 16/23] sched/cache: Introduce sched_cache_present to enable cache aware scheduling for multi LLCs NUMA node Tim Chen
2025-12-10 16:32   ` Peter Zijlstra
2025-12-10 16:52     ` Peter Zijlstra
2025-12-16  7:36       ` Chen, Yu C
2025-12-16  7:31     ` Chen, Yu C
2025-12-03 23:07 ` [PATCH v2 17/23] sched/cache: Record the number of active threads per process for cache-aware scheduling Tim Chen
2025-12-10 16:51   ` Peter Zijlstra
2025-12-16  7:40     ` Chen, Yu C
2025-12-17  9:40   ` Aaron Lu
2025-12-17 12:51     ` Chen, Yu C
2025-12-19  3:32       ` Aaron Lu
2025-12-03 23:07 ` [PATCH v2 18/23] sched/cache: Disable cache aware scheduling for processes with high thread counts Tim Chen
2025-12-03 23:07 ` [PATCH v2 19/23] sched/cache: Avoid cache-aware scheduling for memory-heavy processes Tim Chen
2025-12-18  3:59   ` Vern Hao
2025-12-18  8:32     ` Chen, Yu C
2025-12-18  9:42       ` Vern Hao
2025-12-19  3:14         ` K Prateek Nayak
2025-12-19 12:55           ` Chen, Yu C
2025-12-22  2:49             ` Vern Hao
2025-12-22  2:19           ` Vern Hao
2025-12-03 23:07 ` [PATCH v2 20/23] sched/cache: Add user control to adjust the parameters of cache-aware scheduling Tim Chen
2025-12-10 17:02   ` Peter Zijlstra
2025-12-16  7:42     ` Chen, Yu C
2025-12-19  4:14   ` Vern Hao
2025-12-19 13:21     ` Chen, Yu C
2025-12-19 13:39     ` Chen, Yu C
2025-12-23 12:12   ` Yangyu Chen
2025-12-23 16:44     ` Yangyu Chen
2025-12-24  3:28       ` Yangyu Chen
2025-12-24  7:51         ` Chen, Yu C
2025-12-24 12:15           ` Yangyu Chen
2026-01-15 10:03   ` Jianyong Wu
2026-01-15 12:13     ` Chen, Yu C
2026-01-21 15:21   ` Yangyu Chen
2026-01-21 15:38     ` Chen, Yu C
2025-12-03 23:07 ` [PATCH v2 21/23] -- DO NOT APPLY!!! -- sched/cache/stats: Add schedstat for cache aware load balancing Tim Chen
2025-12-19  5:03   ` Yangyu Chen
2025-12-19 14:41     ` Chen, Yu C
2025-12-19 14:48       ` Yangyu Chen
2025-12-03 23:07 ` [PATCH v2 22/23] -- DO NOT APPLY!!! -- sched/cache/debug: Add ftrace to track the load balance statistics Tim Chen
2025-12-03 23:07 ` [PATCH v2 23/23] -- DO NOT APPLY!!! -- sched/cache/debug: Display the per LLC occupancy for each process via proc fs Tim Chen
2025-12-17  9:59   ` Aaron Lu
2025-12-17 13:01     ` Chen, Yu C
2025-12-19  3:19 ` [PATCH v2 00/23] Cache aware scheduling Aaron Lu
2025-12-19 13:04   ` Chen, Yu C

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d188e11a610cd652ad18a83cf325db54f4938537.camel@linux.intel.com \
    --to=tim.c.chen@linux.intel.com \
    --cc=adamli@os.amperecomputing.com \
    --cc=aubrey.li@intel.com \
    --cc=bsegall@google.com \
    --cc=cyy@cyyself.name \
    --cc=dietmar.eggemann@arm.com \
    --cc=gautham.shenoy@amd.com \
    --cc=haoxing990@gmail.com \
    --cc=hdanton@sina.com \
    --cc=jianyong.wu@outlook.com \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=len.brown@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=sshegde@linux.ibm.com \
    --cc=tim.c.chen@intel.com \
    --cc=tingyin.duan@gmail.com \
    --cc=vernhao@tencent.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vineethr@linux.ibm.com \
    --cc=vschneid@redhat.com \
    --cc=yu.c.chen@intel.com \
    --cc=yu.chen.surf@gmail.com \
    --cc=zhao1.liu@intel.com \
    --cc=ziqianlu@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.