Re: [RFC patch v3 00/20] Cache aware scheduling

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
To: "Chen, Yu C" <yu.c.chen@intel.com>
Cc: Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Libo Chen <libo.chen@oracle.com>,
	Abel Wu <wuyun.abel@bytedance.com>,
	Hillf Danton <hdanton@sina.com>, Len Brown <len.brown@intel.com>,
	linux-kernel@vger.kernel.org,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>
Subject: Re: [RFC patch v3 00/20] Cache aware scheduling
Date: Tue, 24 Jun 2025 23:17:43 +0530	[thread overview]
Message-ID: <02a4da67-f681-425d-b3dd-3ddf10265a64@linux.ibm.com> (raw)
In-Reply-To: <c9328e19-3b18-4ea3-a692-9cb02534e5c9@intel.com>

Hi Chen,

On 22/06/25 06:09, Chen, Yu C wrote:
> On 6/21/2025 3:25 AM, Madadi Vineeth Reddy wrote:
>> Hi Tim,
>>
>> On 18/06/25 23:57, Tim Chen wrote:
>>> This is the third revision of the cache aware scheduling patches,
>>> based on the original patch proposed by Peter[1].
>>>   The goal of the patch series is to aggregate tasks sharing data
>>> to the same cache domain, thereby reducing cache bouncing and
>>> cache misses, and improve data access efficiency. In the current
>>> implementation, threads within the same process are considered
>>> as entities that potentially share resources.
>>>   In previous versions, aggregation of tasks were done in the
>>> wake up path, without making load balancing paths aware of
>>> LLC (Last-Level-Cache) preference. This led to the following
>>> problems:
>>>
>>> 1) Aggregation of tasks during wake up led to load imbalance
>>>     between LLCs
>>> 2) Load balancing tried to even out the load between LLCs
>>> 3) Wake up tasks aggregation happened at a faster rate and
>>>     load balancing moved tasks in opposite directions, leading
>>>     to continuous and excessive task migrations and regressions
>>>     in benchmarks like schbench.
>>>
>>> In this version, load balancing is made cache-aware. The main
>>> idea of cache-aware load balancing consists of two parts:
>>>
>>> 1) Identify tasks that prefer to run on their hottest LLC and
>>>     move them there.
>>> 2) Prevent generic load balancing from moving a task out of
>>>     its hottest LLC.
>>>
>>> By default, LLC task aggregation during wake-up is disabled.
>>> Conversely, cache-aware load balancing is enabled by default.
>>> For easier comparison, two scheduler features are introduced:
>>> SCHED_CACHE_WAKE and SCHED_CACHE_LB, which control cache-aware
>>> wake up and cache-aware load balancing, respectively. By default,
>>> NO_SCHED_CACHE_WAKE and SCHED_CACHE_LB are set, so tasks aggregation
>>> is only done on load balancing.
>>
>> Tested this patch series on a Power11 system with 28 cores and 224 CPUs.
>> LLC on this platform spans 4 threads.
>>
>> schbench:
>>                          baseline (sd%)        baseline+cacheaware (sd%)      %change
>> Lat 50.0th-worker-1        6.33 (24.12%)           6.00 (28.87%)               5.21%
>> Lat 90.0th-worker-1        7.67 ( 7.53%)           7.67 (32.83%)               0.00%
>> Lat 99.0th-worker-1        8.67 ( 6.66%)           9.33 (37.63%)              -7.61%
>> Lat 99.9th-worker-1       21.33 (63.99%)          12.33 (28.47%)              42.19%
>>
>> Lat 50.0th-worker-2        4.33 (13.32%)           5.67 (10.19%)             -30.95%
>> Lat 90.0th-worker-2        5.67 (20.38%)           7.67 ( 7.53%)             -35.27%
>> Lat 99.0th-worker-2        7.33 ( 7.87%)           8.33 ( 6.93%)             -13.64%
>> Lat 99.9th-worker-2       11.67 (24.74%)          10.33 (11.17%)              11.48%
>>
>> Lat 50.0th-worker-4        5.00 ( 0.00%)           7.00 ( 0.00%)             -40.00%
>> Lat 90.0th-worker-4        7.00 ( 0.00%)           9.67 ( 5.97%)             -38.14%
>> Lat 99.0th-worker-4        8.00 ( 0.00%)          11.33 (13.48%)             -41.62%
>> Lat 99.9th-worker-4       10.33 ( 5.59%)          14.00 ( 7.14%)             -35.53%
>>
>> Lat 50.0th-worker-8        4.33 (13.32%)           5.67 (10.19%)             -30.95%
>> Lat 90.0th-worker-8        6.33 (18.23%)           8.67 ( 6.66%)             -36.99%
>> Lat 99.0th-worker-8        7.67 ( 7.53%)          10.33 ( 5.59%)             -34.69%
>> Lat 99.9th-worker-8       10.00 (10.00%)          12.33 ( 4.68%)             -23.30%
>>
>> Lat 50.0th-worker-16       4.00 ( 0.00%)           5.00 ( 0.00%)             -25.00%
>> Lat 90.0th-worker-16       6.33 ( 9.12%)           7.67 ( 7.53%)             -21.21%
>> Lat 99.0th-worker-16       8.00 ( 0.00%)          10.33 ( 5.59%)             -29.13%
>> Lat 99.9th-worker-16      12.00 ( 8.33%)          13.33 ( 4.33%)             -11.08%
>>
>> Lat 50.0th-worker-32       5.00 ( 0.00%)           5.33 (10.83%)              -6.60%
>> Lat 90.0th-worker-32       7.00 ( 0.00%)           8.67 (17.63%)             -23.86%
>> Lat 99.0th-worker-32      10.67 (14.32%)          12.67 ( 4.56%)             -18.75%
>> Lat 99.9th-worker-32      14.67 ( 3.94%)          19.00 (13.93%)             -29.49%
>>
>> Lat 50.0th-worker-64       5.33 (10.83%)           6.67 ( 8.66%)             -25.14%
>> Lat 90.0th-worker-64      10.00 (17.32%)          14.33 ( 4.03%)             -43.30%
>> Lat 99.0th-worker-64      14.00 ( 7.14%)          16.67 ( 3.46%)             -19.07%
>> Lat 99.9th-worker-64      55.00 (56.69%)          47.00 (61.92%)              14.55%
>>
>> Lat 50.0th-worker-128      8.00 ( 0.00%)           8.67 (13.32%)              -8.38%
>> Lat 90.0th-worker-128     13.33 ( 4.33%)          14.33 ( 8.06%)              -7.50%
>> Lat 99.0th-worker-128     16.00 ( 0.00%)          20.00 ( 8.66%)             -25.00%
>> Lat 99.9th-worker-128   2258.33 (83.80%)        2974.67 (21.82%)             -31.72%
>>
>> Lat 50.0th-worker-256     47.67 ( 2.42%)          45.33 ( 3.37%)               4.91%
>> Lat 90.0th-worker-256   3470.67 ( 1.88%)        3558.67 ( 0.47%)              -2.54%
>> Lat 99.0th-worker-256   9040.00 ( 2.76%)        9050.67 ( 0.41%)              -0.12%
>> Lat 99.9th-worker-256  13824.00 (20.07%)       13104.00 ( 6.84%)               5.21%
>>
>> The above data shows mostly regression both in the lesser and
>> higher load cases.
>>
>>
>> Hackbench pipe:
>>
>> Pairs   Baseline Avg (s) (Std%)     Patched Avg (s) (Std%)      % Change
>> 2       2.987 (1.19%)               2.414 (17.99%)              24.06%
>> 4       7.702 (12.53%)              7.228 (18.37%)               6.16%
>> 8       14.141 (1.32%)              13.109 (1.46%)               7.29%
>> 15      27.571 (6.53%)              29.460 (8.71%)              -6.84%
>> 30      65.118 (4.49%)              61.352 (4.00%)               5.78%
>> 45      105.086 (9.75%)             97.970 (4.26%)               6.77%
>> 60      149.221 (6.91%)             154.176 (4.17%)             -3.32%
>> 75      199.278 (1.21%)             198.680 (1.37%)              0.30%
>>
>> A lot of run to run variation is seen in hackbench runs. So hard to tell
>> on the performance but looks better than schbench.
> 
> May I know if the cpu frequency was set at a fixed level and deep
> cpu idle states were disabled(I assume on power system it is called
> stop states?)

Deep cpu idle state is called 'cede' in PowerVM LPAR. I have not disabled
it.

> 
>>
>> In Power 10 and Power 11, The LLC size is relatively smaller (4 CPUs)
>> when compared to platforms like sapphire rapids and Milan. Didn't go
>> through this series yet. Will go through and try to understand why
>> schbench is not happy on Power systems.
>>
>> Meanwhile, Wanted to know your thoughts on how does smaller LLC
>> size get impacted with this patch?
>>
> 
> task aggregation on smaller LLC domain(both in terms of the
> number of CPUs and the size of LLC) might bring cache contention
> and hurt performance IMO. May I know what is the cache size on
> your system:
> lscpu | grep "L3 cache"

L3 cache: 224 MiB (56 instances)

> 
> May I know if you tested it with:
> echo NO_SCHED_CACHE > /sys/kernel/debug/sched/features
> echo NO_SCHED_CACHE_WAKE > /sys/kernel/debug/sched/features
> echo NO_SCHED_CACHE_LB > /sys/kernel/debug/sched/features
> 
> vs
> 
> echo SCHED_CACHE > /sys/kernel/debug/sched/features
> echo NO_SCHED_CACHE_WAKE > /sys/kernel/debug/sched/features
> echo SCHED_CACHE_LB > /sys/kernel/debug/sched/features
> 

I have tested with and without this patch series. Didn't change
any sched feature. So, the patched kernel was running with the default
settings:
SCHED_CACHE, NO_SCHED_CACHE_WAKE, and SCHED_CACHE_LB.


> And could you help check if setting /sys/kernel/debug/sched/llc_aggr_cap
> from 50 to some smaller values(25, etc) would help?

Will give it a try.

Thanks,
Madadi Vineeth Reddy

> 
> thanks,
> Chenyu
> 
>> Thanks,
>> Madadi Vineeth Reddy
>>
>>
>>>
>>> With above default settings, task migrations occur less frequently
>>> and no longer happen in the latency-sensitive wake-up path.
>>>
>>
>> [..snip..]
>>
>>>
>>> Chen Yu (3):
>>>    sched: Several fixes for cache aware scheduling
>>>    sched: Avoid task migration within its preferred LLC
>>>    sched: Save the per LLC utilization for better cache aware scheduling
>>>
>>> K Prateek Nayak (1):
>>>    sched: Avoid calculating the cpumask if the system is overloaded
>>>
>>> Peter Zijlstra (1):
>>>    sched: Cache aware load-balancing
>>>
>>> Tim Chen (15):
>>>    sched: Add hysteresis to switch a task's preferred LLC
>>>    sched: Add helper function to decide whether to allow cache aware
>>>      scheduling
>>>    sched: Set up LLC indexing
>>>    sched: Introduce task preferred LLC field
>>>    sched: Calculate the number of tasks that have LLC preference on a
>>>      runqueue
>>>    sched: Introduce per runqueue task LLC preference counter
>>>    sched: Calculate the total number of preferred LLC tasks during load
>>>      balance
>>>    sched: Tag the sched group as llc_balance if it has tasks prefer other
>>>      LLC
>>>    sched: Introduce update_llc_busiest() to deal with groups having
>>>      preferred LLC tasks
>>>    sched: Introduce a new migration_type to track the preferred LLC load
>>>      balance
>>>    sched: Consider LLC locality for active balance
>>>    sched: Consider LLC preference when picking tasks from busiest queue
>>>    sched: Do not migrate task if it is moving out of its preferred LLC
>>>    sched: Introduce SCHED_CACHE_LB to control cache aware load balance
>>>    sched: Introduce SCHED_CACHE_WAKE to control LLC aggregation on wake
>>>      up
>>>
>>>   include/linux/mm_types.h       |  44 ++
>>>   include/linux/sched.h          |   8 +
>>>   include/linux/sched/topology.h |   3 +
>>>   init/Kconfig                   |   4 +
>>>   init/init_task.c               |   3 +
>>>   kernel/fork.c                  |   5 +
>>>   kernel/sched/core.c            |  25 +-
>>>   kernel/sched/debug.c           |   4 +
>>>   kernel/sched/fair.c            | 859 ++++++++++++++++++++++++++++++++-
>>>   kernel/sched/features.h        |   3 +
>>>   kernel/sched/sched.h           |  23 +
>>>   kernel/sched/topology.c        |  29 ++
>>>   12 files changed, 982 insertions(+), 28 deletions(-)
>>>
>>

next prev parent reply	other threads:[~2025-06-24 17:48 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-18 18:27 [RFC patch v3 00/20] Cache aware scheduling Tim Chen
2025-06-18 18:27 ` [RFC patch v3 01/20] sched: Cache aware load-balancing Tim Chen
2025-06-26 12:23   ` Jianyong Wu
2025-06-26 13:32     ` Chen, Yu C
2025-06-27  0:10       ` Tim Chen
2025-06-27  2:13         ` Jianyong Wu
2025-07-03 19:29   ` Shrikanth Hegde
2025-07-04  8:40     ` Chen, Yu C
2025-07-04  8:45       ` Peter Zijlstra
2025-07-04  8:54         ` Shrikanth Hegde
2025-07-07 19:57     ` Tim Chen
2025-06-18 18:27 ` [RFC patch v3 02/20] sched: Several fixes for cache aware scheduling Tim Chen
2025-07-03 19:33   ` Shrikanth Hegde
2025-07-07 21:02     ` Tim Chen
2025-07-08  1:15   ` Libo Chen
2025-07-08  7:54     ` Chen, Yu C
2025-07-08 15:47       ` Libo Chen
2025-06-18 18:27 ` [RFC patch v3 03/20] sched: Avoid task migration within its preferred LLC Tim Chen
2025-06-18 18:27 ` [RFC patch v3 04/20] sched: Avoid calculating the cpumask if the system is overloaded Tim Chen
2025-07-03 19:39   ` Shrikanth Hegde
2025-07-07 14:57     ` Tim Chen
2025-06-18 18:27 ` [RFC patch v3 05/20] sched: Add hysteresis to switch a task's preferred LLC Tim Chen
2025-07-02  6:47   ` Madadi Vineeth Reddy
2025-07-02 21:47     ` Tim Chen
2025-06-18 18:27 ` [RFC patch v3 06/20] sched: Save the per LLC utilization for better cache aware scheduling Tim Chen
2025-06-18 18:27 ` [RFC patch v3 07/20] sched: Add helper function to decide whether to allow " Tim Chen
2025-07-08  0:41   ` Libo Chen
2025-07-08  8:29     ` Chen, Yu C
2025-07-08 17:22       ` Libo Chen
2025-07-09 14:41         ` Chen, Yu C
2025-07-09 21:31           ` Libo Chen
2025-07-08 21:59     ` Tim Chen
2025-07-09 21:22       ` Libo Chen
2025-06-18 18:27 ` [RFC patch v3 08/20] sched: Set up LLC indexing Tim Chen
2025-07-03 19:44   ` Shrikanth Hegde
2025-07-04  9:36     ` Chen, Yu C
2025-06-18 18:27 ` [RFC patch v3 09/20] sched: Introduce task preferred LLC field Tim Chen
2025-06-18 18:27 ` [RFC patch v3 10/20] sched: Calculate the number of tasks that have LLC preference on a runqueue Tim Chen
2025-07-03 19:45   ` Shrikanth Hegde
2025-07-04 15:00     ` Chen, Yu C
2025-06-18 18:27 ` [RFC patch v3 11/20] sched: Introduce per runqueue task LLC preference counter Tim Chen
2025-06-18 18:28 ` [RFC patch v3 12/20] sched: Calculate the total number of preferred LLC tasks during load balance Tim Chen
2025-06-18 18:28 ` [RFC patch v3 13/20] sched: Tag the sched group as llc_balance if it has tasks prefer other LLC Tim Chen
2025-06-18 18:28 ` [RFC patch v3 14/20] sched: Introduce update_llc_busiest() to deal with groups having preferred LLC tasks Tim Chen
2025-07-03 19:52   ` Shrikanth Hegde
2025-07-05  2:26     ` Chen, Yu C
2025-06-18 18:28 ` [RFC patch v3 15/20] sched: Introduce a new migration_type to track the preferred LLC load balance Tim Chen
2025-06-18 18:28 ` [RFC patch v3 16/20] sched: Consider LLC locality for active balance Tim Chen
2025-06-18 18:28 ` [RFC patch v3 17/20] sched: Consider LLC preference when picking tasks from busiest queue Tim Chen
2025-06-18 18:28 ` [RFC patch v3 18/20] sched: Do not migrate task if it is moving out of its preferred LLC Tim Chen
2025-06-18 18:28 ` [RFC patch v3 19/20] sched: Introduce SCHED_CACHE_LB to control cache aware load balance Tim Chen
2025-06-18 18:28 ` [RFC patch v3 20/20] sched: Introduce SCHED_CACHE_WAKE to control LLC aggregation on wake up Tim Chen
2025-06-19  6:39 ` [RFC patch v3 00/20] Cache aware scheduling Yangyu Chen
2025-06-19 13:21   ` Chen, Yu C
2025-06-19 14:12     ` Yangyu Chen
2025-06-20 19:25 ` Madadi Vineeth Reddy
2025-06-22  0:39   ` Chen, Yu C
2025-06-24 17:47     ` Madadi Vineeth Reddy [this message]
2025-06-23 16:45   ` Tim Chen
2025-06-24  5:00 ` K Prateek Nayak
2025-06-24 12:16   ` Chen, Yu C
2025-06-25  4:19     ` K Prateek Nayak
2025-06-25  0:30   ` Tim Chen
2025-06-25  4:30     ` K Prateek Nayak
2025-07-03 20:00   ` Shrikanth Hegde
2025-07-04 10:09     ` Chen, Yu C
2025-07-09 19:39 ` Madadi Vineeth Reddy
2025-07-10  3:33   ` Chen, Yu C

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=02a4da67-f681-425d-b3dd-3ddf10265a64@linux.ibm.com \
    --to=vineethr@linux.ibm.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=gautham.shenoy@amd.com \
    --cc=hdanton@sina.com \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=len.brown@intel.com \
    --cc=libo.chen@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tim.c.chen@intel.com \
    --cc=tim.c.chen@linux.intel.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=wuyun.abel@bytedance.com \
    --cc=yu.c.chen@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).