From: "Chen, Yu C" <yu.c.chen@intel.com>
To: Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
Tim Chen <tim.c.chen@linux.intel.com>,
Peter Zijlstra <peterz@infradead.org>,
"Ingo Molnar" <mingo@redhat.com>,
K Prateek Nayak <kprateek.nayak@amd.com>,
"Gautham R . Shenoy" <gautham.shenoy@amd.com>
Cc: Juri Lelli <juri.lelli@redhat.com>,
Dietmar Eggemann <dietmar.eggemann@arm.com>,
Steven Rostedt <rostedt@goodmis.org>,
Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
Valentin Schneider <vschneid@redhat.com>,
Tim Chen <tim.c.chen@intel.com>,
Vincent Guittot <vincent.guittot@linaro.org>,
Libo Chen <libo.chen@oracle.com>,
Abel Wu <wuyun.abel@bytedance.com>,
Hillf Danton <hdanton@sina.com>, Len Brown <len.brown@intel.com>,
<linux-kernel@vger.kernel.org>
Subject: Re: [RFC patch v3 00/20] Cache aware scheduling
Date: Sun, 22 Jun 2025 08:39:38 +0800 [thread overview]
Message-ID: <c9328e19-3b18-4ea3-a692-9cb02534e5c9@intel.com> (raw)
In-Reply-To: <8c98fff7-fef3-494a-98a3-4b6d4cc2e6d1@linux.ibm.com>
On 6/21/2025 3:25 AM, Madadi Vineeth Reddy wrote:
> Hi Tim,
>
> On 18/06/25 23:57, Tim Chen wrote:
>> This is the third revision of the cache aware scheduling patches,
>> based on the original patch proposed by Peter[1].
>>
>> The goal of the patch series is to aggregate tasks sharing data
>> to the same cache domain, thereby reducing cache bouncing and
>> cache misses, and improve data access efficiency. In the current
>> implementation, threads within the same process are considered
>> as entities that potentially share resources.
>>
>> In previous versions, aggregation of tasks were done in the
>> wake up path, without making load balancing paths aware of
>> LLC (Last-Level-Cache) preference. This led to the following
>> problems:
>>
>> 1) Aggregation of tasks during wake up led to load imbalance
>> between LLCs
>> 2) Load balancing tried to even out the load between LLCs
>> 3) Wake up tasks aggregation happened at a faster rate and
>> load balancing moved tasks in opposite directions, leading
>> to continuous and excessive task migrations and regressions
>> in benchmarks like schbench.
>>
>> In this version, load balancing is made cache-aware. The main
>> idea of cache-aware load balancing consists of two parts:
>>
>> 1) Identify tasks that prefer to run on their hottest LLC and
>> move them there.
>> 2) Prevent generic load balancing from moving a task out of
>> its hottest LLC.
>>
>> By default, LLC task aggregation during wake-up is disabled.
>> Conversely, cache-aware load balancing is enabled by default.
>> For easier comparison, two scheduler features are introduced:
>> SCHED_CACHE_WAKE and SCHED_CACHE_LB, which control cache-aware
>> wake up and cache-aware load balancing, respectively. By default,
>> NO_SCHED_CACHE_WAKE and SCHED_CACHE_LB are set, so tasks aggregation
>> is only done on load balancing.
>
> Tested this patch series on a Power11 system with 28 cores and 224 CPUs.
> LLC on this platform spans 4 threads.
>
> schbench:
> baseline (sd%) baseline+cacheaware (sd%) %change
> Lat 50.0th-worker-1 6.33 (24.12%) 6.00 (28.87%) 5.21%
> Lat 90.0th-worker-1 7.67 ( 7.53%) 7.67 (32.83%) 0.00%
> Lat 99.0th-worker-1 8.67 ( 6.66%) 9.33 (37.63%) -7.61%
> Lat 99.9th-worker-1 21.33 (63.99%) 12.33 (28.47%) 42.19%
>
> Lat 50.0th-worker-2 4.33 (13.32%) 5.67 (10.19%) -30.95%
> Lat 90.0th-worker-2 5.67 (20.38%) 7.67 ( 7.53%) -35.27%
> Lat 99.0th-worker-2 7.33 ( 7.87%) 8.33 ( 6.93%) -13.64%
> Lat 99.9th-worker-2 11.67 (24.74%) 10.33 (11.17%) 11.48%
>
> Lat 50.0th-worker-4 5.00 ( 0.00%) 7.00 ( 0.00%) -40.00%
> Lat 90.0th-worker-4 7.00 ( 0.00%) 9.67 ( 5.97%) -38.14%
> Lat 99.0th-worker-4 8.00 ( 0.00%) 11.33 (13.48%) -41.62%
> Lat 99.9th-worker-4 10.33 ( 5.59%) 14.00 ( 7.14%) -35.53%
>
> Lat 50.0th-worker-8 4.33 (13.32%) 5.67 (10.19%) -30.95%
> Lat 90.0th-worker-8 6.33 (18.23%) 8.67 ( 6.66%) -36.99%
> Lat 99.0th-worker-8 7.67 ( 7.53%) 10.33 ( 5.59%) -34.69%
> Lat 99.9th-worker-8 10.00 (10.00%) 12.33 ( 4.68%) -23.30%
>
> Lat 50.0th-worker-16 4.00 ( 0.00%) 5.00 ( 0.00%) -25.00%
> Lat 90.0th-worker-16 6.33 ( 9.12%) 7.67 ( 7.53%) -21.21%
> Lat 99.0th-worker-16 8.00 ( 0.00%) 10.33 ( 5.59%) -29.13%
> Lat 99.9th-worker-16 12.00 ( 8.33%) 13.33 ( 4.33%) -11.08%
>
> Lat 50.0th-worker-32 5.00 ( 0.00%) 5.33 (10.83%) -6.60%
> Lat 90.0th-worker-32 7.00 ( 0.00%) 8.67 (17.63%) -23.86%
> Lat 99.0th-worker-32 10.67 (14.32%) 12.67 ( 4.56%) -18.75%
> Lat 99.9th-worker-32 14.67 ( 3.94%) 19.00 (13.93%) -29.49%
>
> Lat 50.0th-worker-64 5.33 (10.83%) 6.67 ( 8.66%) -25.14%
> Lat 90.0th-worker-64 10.00 (17.32%) 14.33 ( 4.03%) -43.30%
> Lat 99.0th-worker-64 14.00 ( 7.14%) 16.67 ( 3.46%) -19.07%
> Lat 99.9th-worker-64 55.00 (56.69%) 47.00 (61.92%) 14.55%
>
> Lat 50.0th-worker-128 8.00 ( 0.00%) 8.67 (13.32%) -8.38%
> Lat 90.0th-worker-128 13.33 ( 4.33%) 14.33 ( 8.06%) -7.50%
> Lat 99.0th-worker-128 16.00 ( 0.00%) 20.00 ( 8.66%) -25.00%
> Lat 99.9th-worker-128 2258.33 (83.80%) 2974.67 (21.82%) -31.72%
>
> Lat 50.0th-worker-256 47.67 ( 2.42%) 45.33 ( 3.37%) 4.91%
> Lat 90.0th-worker-256 3470.67 ( 1.88%) 3558.67 ( 0.47%) -2.54%
> Lat 99.0th-worker-256 9040.00 ( 2.76%) 9050.67 ( 0.41%) -0.12%
> Lat 99.9th-worker-256 13824.00 (20.07%) 13104.00 ( 6.84%) 5.21%
>
> The above data shows mostly regression both in the lesser and
> higher load cases.
>
>
> Hackbench pipe:
>
> Pairs Baseline Avg (s) (Std%) Patched Avg (s) (Std%) % Change
> 2 2.987 (1.19%) 2.414 (17.99%) 24.06%
> 4 7.702 (12.53%) 7.228 (18.37%) 6.16%
> 8 14.141 (1.32%) 13.109 (1.46%) 7.29%
> 15 27.571 (6.53%) 29.460 (8.71%) -6.84%
> 30 65.118 (4.49%) 61.352 (4.00%) 5.78%
> 45 105.086 (9.75%) 97.970 (4.26%) 6.77%
> 60 149.221 (6.91%) 154.176 (4.17%) -3.32%
> 75 199.278 (1.21%) 198.680 (1.37%) 0.30%
>
> A lot of run to run variation is seen in hackbench runs. So hard to tell
> on the performance but looks better than schbench.
May I know if the cpu frequency was set at a fixed level and deep
cpu idle states were disabled(I assume on power system it is called
stop states?)
>
> In Power 10 and Power 11, The LLC size is relatively smaller (4 CPUs)
> when compared to platforms like sapphire rapids and Milan. Didn't go
> through this series yet. Will go through and try to understand why
> schbench is not happy on Power systems.
>
> Meanwhile, Wanted to know your thoughts on how does smaller LLC
> size get impacted with this patch?
>
task aggregation on smaller LLC domain(both in terms of the
number of CPUs and the size of LLC) might bring cache contention
and hurt performance IMO. May I know what is the cache size on
your system:
lscpu | grep "L3 cache"
May I know if you tested it with:
echo NO_SCHED_CACHE > /sys/kernel/debug/sched/features
echo NO_SCHED_CACHE_WAKE > /sys/kernel/debug/sched/features
echo NO_SCHED_CACHE_LB > /sys/kernel/debug/sched/features
vs
echo SCHED_CACHE > /sys/kernel/debug/sched/features
echo NO_SCHED_CACHE_WAKE > /sys/kernel/debug/sched/features
echo SCHED_CACHE_LB > /sys/kernel/debug/sched/features
And could you help check if setting /sys/kernel/debug/sched/llc_aggr_cap
from 50 to some smaller values(25, etc) would help?
thanks,
Chenyu
> Thanks,
> Madadi Vineeth Reddy
>
>
>>
>> With above default settings, task migrations occur less frequently
>> and no longer happen in the latency-sensitive wake-up path.
>>
>
> [..snip..]
>
>>
>> Chen Yu (3):
>> sched: Several fixes for cache aware scheduling
>> sched: Avoid task migration within its preferred LLC
>> sched: Save the per LLC utilization for better cache aware scheduling
>>
>> K Prateek Nayak (1):
>> sched: Avoid calculating the cpumask if the system is overloaded
>>
>> Peter Zijlstra (1):
>> sched: Cache aware load-balancing
>>
>> Tim Chen (15):
>> sched: Add hysteresis to switch a task's preferred LLC
>> sched: Add helper function to decide whether to allow cache aware
>> scheduling
>> sched: Set up LLC indexing
>> sched: Introduce task preferred LLC field
>> sched: Calculate the number of tasks that have LLC preference on a
>> runqueue
>> sched: Introduce per runqueue task LLC preference counter
>> sched: Calculate the total number of preferred LLC tasks during load
>> balance
>> sched: Tag the sched group as llc_balance if it has tasks prefer other
>> LLC
>> sched: Introduce update_llc_busiest() to deal with groups having
>> preferred LLC tasks
>> sched: Introduce a new migration_type to track the preferred LLC load
>> balance
>> sched: Consider LLC locality for active balance
>> sched: Consider LLC preference when picking tasks from busiest queue
>> sched: Do not migrate task if it is moving out of its preferred LLC
>> sched: Introduce SCHED_CACHE_LB to control cache aware load balance
>> sched: Introduce SCHED_CACHE_WAKE to control LLC aggregation on wake
>> up
>>
>> include/linux/mm_types.h | 44 ++
>> include/linux/sched.h | 8 +
>> include/linux/sched/topology.h | 3 +
>> init/Kconfig | 4 +
>> init/init_task.c | 3 +
>> kernel/fork.c | 5 +
>> kernel/sched/core.c | 25 +-
>> kernel/sched/debug.c | 4 +
>> kernel/sched/fair.c | 859 ++++++++++++++++++++++++++++++++-
>> kernel/sched/features.h | 3 +
>> kernel/sched/sched.h | 23 +
>> kernel/sched/topology.c | 29 ++
>> 12 files changed, 982 insertions(+), 28 deletions(-)
>>
>
next prev parent reply other threads:[~2025-06-22 0:40 UTC|newest]
Thread overview: 68+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-06-18 18:27 [RFC patch v3 00/20] Cache aware scheduling Tim Chen
2025-06-18 18:27 ` [RFC patch v3 01/20] sched: Cache aware load-balancing Tim Chen
2025-06-26 12:23 ` Jianyong Wu
2025-06-26 13:32 ` Chen, Yu C
2025-06-27 0:10 ` Tim Chen
2025-06-27 2:13 ` Jianyong Wu
2025-07-03 19:29 ` Shrikanth Hegde
2025-07-04 8:40 ` Chen, Yu C
2025-07-04 8:45 ` Peter Zijlstra
2025-07-04 8:54 ` Shrikanth Hegde
2025-07-07 19:57 ` Tim Chen
2025-06-18 18:27 ` [RFC patch v3 02/20] sched: Several fixes for cache aware scheduling Tim Chen
2025-07-03 19:33 ` Shrikanth Hegde
2025-07-07 21:02 ` Tim Chen
2025-07-08 1:15 ` Libo Chen
2025-07-08 7:54 ` Chen, Yu C
2025-07-08 15:47 ` Libo Chen
2025-06-18 18:27 ` [RFC patch v3 03/20] sched: Avoid task migration within its preferred LLC Tim Chen
2025-06-18 18:27 ` [RFC patch v3 04/20] sched: Avoid calculating the cpumask if the system is overloaded Tim Chen
2025-07-03 19:39 ` Shrikanth Hegde
2025-07-07 14:57 ` Tim Chen
2025-06-18 18:27 ` [RFC patch v3 05/20] sched: Add hysteresis to switch a task's preferred LLC Tim Chen
2025-07-02 6:47 ` Madadi Vineeth Reddy
2025-07-02 21:47 ` Tim Chen
2025-06-18 18:27 ` [RFC patch v3 06/20] sched: Save the per LLC utilization for better cache aware scheduling Tim Chen
2025-06-18 18:27 ` [RFC patch v3 07/20] sched: Add helper function to decide whether to allow " Tim Chen
2025-07-08 0:41 ` Libo Chen
2025-07-08 8:29 ` Chen, Yu C
2025-07-08 17:22 ` Libo Chen
2025-07-09 14:41 ` Chen, Yu C
2025-07-09 21:31 ` Libo Chen
2025-07-08 21:59 ` Tim Chen
2025-07-09 21:22 ` Libo Chen
2025-06-18 18:27 ` [RFC patch v3 08/20] sched: Set up LLC indexing Tim Chen
2025-07-03 19:44 ` Shrikanth Hegde
2025-07-04 9:36 ` Chen, Yu C
2025-06-18 18:27 ` [RFC patch v3 09/20] sched: Introduce task preferred LLC field Tim Chen
2025-06-18 18:27 ` [RFC patch v3 10/20] sched: Calculate the number of tasks that have LLC preference on a runqueue Tim Chen
2025-07-03 19:45 ` Shrikanth Hegde
2025-07-04 15:00 ` Chen, Yu C
2025-06-18 18:27 ` [RFC patch v3 11/20] sched: Introduce per runqueue task LLC preference counter Tim Chen
2025-06-18 18:28 ` [RFC patch v3 12/20] sched: Calculate the total number of preferred LLC tasks during load balance Tim Chen
2025-06-18 18:28 ` [RFC patch v3 13/20] sched: Tag the sched group as llc_balance if it has tasks prefer other LLC Tim Chen
2025-06-18 18:28 ` [RFC patch v3 14/20] sched: Introduce update_llc_busiest() to deal with groups having preferred LLC tasks Tim Chen
2025-07-03 19:52 ` Shrikanth Hegde
2025-07-05 2:26 ` Chen, Yu C
2025-06-18 18:28 ` [RFC patch v3 15/20] sched: Introduce a new migration_type to track the preferred LLC load balance Tim Chen
2025-06-18 18:28 ` [RFC patch v3 16/20] sched: Consider LLC locality for active balance Tim Chen
2025-06-18 18:28 ` [RFC patch v3 17/20] sched: Consider LLC preference when picking tasks from busiest queue Tim Chen
2025-06-18 18:28 ` [RFC patch v3 18/20] sched: Do not migrate task if it is moving out of its preferred LLC Tim Chen
2025-06-18 18:28 ` [RFC patch v3 19/20] sched: Introduce SCHED_CACHE_LB to control cache aware load balance Tim Chen
2025-06-18 18:28 ` [RFC patch v3 20/20] sched: Introduce SCHED_CACHE_WAKE to control LLC aggregation on wake up Tim Chen
2025-06-19 6:39 ` [RFC patch v3 00/20] Cache aware scheduling Yangyu Chen
2025-06-19 13:21 ` Chen, Yu C
2025-06-19 14:12 ` Yangyu Chen
2025-06-20 19:25 ` Madadi Vineeth Reddy
2025-06-22 0:39 ` Chen, Yu C [this message]
2025-06-24 17:47 ` Madadi Vineeth Reddy
2025-06-23 16:45 ` Tim Chen
2025-06-24 5:00 ` K Prateek Nayak
2025-06-24 12:16 ` Chen, Yu C
2025-06-25 4:19 ` K Prateek Nayak
2025-06-25 0:30 ` Tim Chen
2025-06-25 4:30 ` K Prateek Nayak
2025-07-03 20:00 ` Shrikanth Hegde
2025-07-04 10:09 ` Chen, Yu C
2025-07-09 19:39 ` Madadi Vineeth Reddy
2025-07-10 3:33 ` Chen, Yu C
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c9328e19-3b18-4ea3-a692-9cb02534e5c9@intel.com \
--to=yu.c.chen@intel.com \
--cc=bsegall@google.com \
--cc=dietmar.eggemann@arm.com \
--cc=gautham.shenoy@amd.com \
--cc=hdanton@sina.com \
--cc=juri.lelli@redhat.com \
--cc=kprateek.nayak@amd.com \
--cc=len.brown@intel.com \
--cc=libo.chen@oracle.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=tim.c.chen@intel.com \
--cc=tim.c.chen@linux.intel.com \
--cc=vincent.guittot@linaro.org \
--cc=vineethr@linux.ibm.com \
--cc=vschneid@redhat.com \
--cc=wuyun.abel@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).