Re: [RFC patch v3 00/20] Cache aware scheduling

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Tim Chen <tim.c.chen@linux.intel.com>
To: Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>
Cc: Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Libo Chen <libo.chen@oracle.com>,
	Abel Wu <wuyun.abel@bytedance.com>,
	Hillf Danton <hdanton@sina.com>, Len Brown <len.brown@intel.com>,
	linux-kernel@vger.kernel.org, Chen Yu <yu.c.chen@intel.com>
Subject: Re: [RFC patch v3 00/20] Cache aware scheduling
Date: Mon, 23 Jun 2025 09:45:54 -0700	[thread overview]
Message-ID: <946aec3cd4fe308b45d0cff661d80a3d75109b7b.camel@linux.intel.com> (raw)
In-Reply-To: <8c98fff7-fef3-494a-98a3-4b6d4cc2e6d1@linux.ibm.com>

On Sat, 2025-06-21 at 00:55 +0530, Madadi Vineeth Reddy wrote:
> Hi Tim,
> 
> On 18/06/25 23:57, Tim Chen wrote:
> > This is the third revision of the cache aware scheduling patches,
> > based on the original patch proposed by Peter[1].
> >  
> > The goal of the patch series is to aggregate tasks sharing data
> > to the same cache domain, thereby reducing cache bouncing and
> > cache misses, and improve data access efficiency. In the current
> > implementation, threads within the same process are considered
> > as entities that potentially share resources.
> >  
> > In previous versions, aggregation of tasks were done in the
> > wake up path, without making load balancing paths aware of
> > LLC (Last-Level-Cache) preference. This led to the following
> > problems:
> > 
> > 1) Aggregation of tasks during wake up led to load imbalance
> >    between LLCs
> > 2) Load balancing tried to even out the load between LLCs
> > 3) Wake up tasks aggregation happened at a faster rate and
> >    load balancing moved tasks in opposite directions, leading
> >    to continuous and excessive task migrations and regressions
> >    in benchmarks like schbench.
> > 
> > In this version, load balancing is made cache-aware. The main
> > idea of cache-aware load balancing consists of two parts:
> > 
> > 1) Identify tasks that prefer to run on their hottest LLC and
> >    move them there.
> > 2) Prevent generic load balancing from moving a task out of
> >    its hottest LLC.
> > 
> > By default, LLC task aggregation during wake-up is disabled.
> > Conversely, cache-aware load balancing is enabled by default.
> > For easier comparison, two scheduler features are introduced:
> > SCHED_CACHE_WAKE and SCHED_CACHE_LB, which control cache-aware
> > wake up and cache-aware load balancing, respectively. By default,
> > NO_SCHED_CACHE_WAKE and SCHED_CACHE_LB are set, so tasks aggregation
> > is only done on load balancing.
> 
> Tested this patch series on a Power11 system with 28 cores and 224 CPUs.
> LLC on this platform spans 4 threads.

Hi Madadi,

Thank you for testing this patch series.

If I understand correctly, the Power 11 you tested has 8 threads per core.
My suspicion is we benefit much more from utilizing more cores
than aggregating the load on less cores but sharing the cache
more in this case.


> 
> schbench:
>                         baseline (sd%)        baseline+cacheaware (sd%)      %change
> Lat 50.0th-worker-1        6.33 (24.12%)           6.00 (28.87%)               5.21%
> Lat 90.0th-worker-1        7.67 ( 7.53%)           7.67 (32.83%)               0.00%
> Lat 99.0th-worker-1        8.67 ( 6.66%)           9.33 (37.63%)              -7.61%
> Lat 99.9th-worker-1       21.33 (63.99%)          12.33 (28.47%)              42.19%
> 
> Lat 50.0th-worker-2        4.33 (13.32%)           5.67 (10.19%)             -30.95%
> Lat 90.0th-worker-2        5.67 (20.38%)           7.67 ( 7.53%)             -35.27%
> Lat 99.0th-worker-2        7.33 ( 7.87%)           8.33 ( 6.93%)             -13.64%
> Lat 99.9th-worker-2       11.67 (24.74%)          10.33 (11.17%)              11.48%
> 
> Lat 50.0th-worker-4        5.00 ( 0.00%)           7.00 ( 0.00%)             -40.00%
> Lat 90.0th-worker-4        7.00 ( 0.00%)           9.67 ( 5.97%)             -38.14%
> Lat 99.0th-worker-4        8.00 ( 0.00%)          11.33 (13.48%)             -41.62%
> Lat 99.9th-worker-4       10.33 ( 5.59%)          14.00 ( 7.14%)             -35.53%
> 
> Lat 50.0th-worker-8        4.33 (13.32%)           5.67 (10.19%)             -30.95%
> Lat 90.0th-worker-8        6.33 (18.23%)           8.67 ( 6.66%)             -36.99%
> Lat 99.0th-worker-8        7.67 ( 7.53%)          10.33 ( 5.59%)             -34.69%
> Lat 99.9th-worker-8       10.00 (10.00%)          12.33 ( 4.68%)             -23.30%
> 
> Lat 50.0th-worker-16       4.00 ( 0.00%)           5.00 ( 0.00%)             -25.00%
> Lat 90.0th-worker-16       6.33 ( 9.12%)           7.67 ( 7.53%)             -21.21%
> Lat 99.0th-worker-16       8.00 ( 0.00%)          10.33 ( 5.59%)             -29.13%
> Lat 99.9th-worker-16      12.00 ( 8.33%)          13.33 ( 4.33%)             -11.08%
> 
> Lat 50.0th-worker-32       5.00 ( 0.00%)           5.33 (10.83%)              -6.60%
> Lat 90.0th-worker-32       7.00 ( 0.00%)           8.67 (17.63%)             -23.86%
> Lat 99.0th-worker-32      10.67 (14.32%)          12.67 ( 4.56%)             -18.75%
> Lat 99.9th-worker-32      14.67 ( 3.94%)          19.00 (13.93%)             -29.49%
> 
> Lat 50.0th-worker-64       5.33 (10.83%)           6.67 ( 8.66%)             -25.14%
> Lat 90.0th-worker-64      10.00 (17.32%)          14.33 ( 4.03%)             -43.30%
> Lat 99.0th-worker-64      14.00 ( 7.14%)          16.67 ( 3.46%)             -19.07%
> Lat 99.9th-worker-64      55.00 (56.69%)          47.00 (61.92%)              14.55%
> 
> Lat 50.0th-worker-128      8.00 ( 0.00%)           8.67 (13.32%)              -8.38%
> Lat 90.0th-worker-128     13.33 ( 4.33%)          14.33 ( 8.06%)              -7.50%
> Lat 99.0th-worker-128     16.00 ( 0.00%)          20.00 ( 8.66%)             -25.00%
> Lat 99.9th-worker-128   2258.33 (83.80%)        2974.67 (21.82%)             -31.72%
> 
> Lat 50.0th-worker-256     47.67 ( 2.42%)          45.33 ( 3.37%)               4.91%
> Lat 90.0th-worker-256   3470.67 ( 1.88%)        3558.67 ( 0.47%)              -2.54%
> Lat 99.0th-worker-256   9040.00 ( 2.76%)        9050.67 ( 0.41%)              -0.12%
> Lat 99.9th-worker-256  13824.00 (20.07%)       13104.00 ( 6.84%)               5.21%
> 
> The above data shows mostly regression both in the lesser and
> higher load cases.
> 
> 
> Hackbench pipe:
> 
> Pairs   Baseline Avg (s) (Std%)     Patched Avg (s) (Std%)      % Change
> 2       2.987 (1.19%)               2.414 (17.99%)              24.06%
> 4       7.702 (12.53%)              7.228 (18.37%)               6.16%
> 8       14.141 (1.32%)              13.109 (1.46%)               7.29%
> 15      27.571 (6.53%)              29.460 (8.71%)              -6.84%
> 30      65.118 (4.49%)              61.352 (4.00%)               5.78%
> 45      105.086 (9.75%)             97.970 (4.26%)               6.77%
> 60      149.221 (6.91%)             154.176 (4.17%)             -3.32%
> 75      199.278 (1.21%)             198.680 (1.37%)              0.30%
> 
> A lot of run to run variation is seen in hackbench runs. So hard to tell
> on the performance but looks better than schbench.
> 
> In Power 10 and Power 11, The LLC size is relatively smaller (4 CPUs)
> when compared to platforms like sapphire rapids and Milan. Didn't go
> through this series yet. Will go through and try to understand why
> schbench is not happy on Power systems.

My guess is having 8 threads per core, LLC aggregation may have
been too aggressive in consolidating tasks on fewer cores and may have left some
cpu cycles unused. Doing experiments by running one thread per core on Power11
may give us some insights if this conjecture is true.

> 
> Meanwhile, Wanted to know your thoughts on how does smaller LLC
> size get impacted with this patch?
> 

This patch series is currently tuned for systems with single threaded core,
and having many cores and large cache per LLC.  

With only 4 cores and 32 threads per LLC as in Power 11, we run out of cores quickly
and have more cache contention between the tasks consolidated.
We may have to set aggregation threshold (sysctl_llc_aggr_cap) less
than 50% utilization (default), so we consolidate less aggressively
and spread the tasks much sooner. 


Tim

next prev parent reply	other threads:[~2025-06-23 16:45 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-18 18:27 [RFC patch v3 00/20] Cache aware scheduling Tim Chen
2025-06-18 18:27 ` [RFC patch v3 01/20] sched: Cache aware load-balancing Tim Chen
2025-06-26 12:23   ` Jianyong Wu
2025-06-26 13:32     ` Chen, Yu C
2025-06-27  0:10       ` Tim Chen
2025-06-27  2:13         ` Jianyong Wu
2025-07-03 19:29   ` Shrikanth Hegde
2025-07-04  8:40     ` Chen, Yu C
2025-07-04  8:45       ` Peter Zijlstra
2025-07-04  8:54         ` Shrikanth Hegde
2025-07-07 19:57     ` Tim Chen
2025-06-18 18:27 ` [RFC patch v3 02/20] sched: Several fixes for cache aware scheduling Tim Chen
2025-07-03 19:33   ` Shrikanth Hegde
2025-07-07 21:02     ` Tim Chen
2025-07-08  1:15   ` Libo Chen
2025-07-08  7:54     ` Chen, Yu C
2025-07-08 15:47       ` Libo Chen
2025-06-18 18:27 ` [RFC patch v3 03/20] sched: Avoid task migration within its preferred LLC Tim Chen
2025-06-18 18:27 ` [RFC patch v3 04/20] sched: Avoid calculating the cpumask if the system is overloaded Tim Chen
2025-07-03 19:39   ` Shrikanth Hegde
2025-07-07 14:57     ` Tim Chen
2025-06-18 18:27 ` [RFC patch v3 05/20] sched: Add hysteresis to switch a task's preferred LLC Tim Chen
2025-07-02  6:47   ` Madadi Vineeth Reddy
2025-07-02 21:47     ` Tim Chen
2025-06-18 18:27 ` [RFC patch v3 06/20] sched: Save the per LLC utilization for better cache aware scheduling Tim Chen
2025-06-18 18:27 ` [RFC patch v3 07/20] sched: Add helper function to decide whether to allow " Tim Chen
2025-07-08  0:41   ` Libo Chen
2025-07-08  8:29     ` Chen, Yu C
2025-07-08 17:22       ` Libo Chen
2025-07-09 14:41         ` Chen, Yu C
2025-07-09 21:31           ` Libo Chen
2025-07-08 21:59     ` Tim Chen
2025-07-09 21:22       ` Libo Chen
2025-06-18 18:27 ` [RFC patch v3 08/20] sched: Set up LLC indexing Tim Chen
2025-07-03 19:44   ` Shrikanth Hegde
2025-07-04  9:36     ` Chen, Yu C
2025-06-18 18:27 ` [RFC patch v3 09/20] sched: Introduce task preferred LLC field Tim Chen
2025-06-18 18:27 ` [RFC patch v3 10/20] sched: Calculate the number of tasks that have LLC preference on a runqueue Tim Chen
2025-07-03 19:45   ` Shrikanth Hegde
2025-07-04 15:00     ` Chen, Yu C
2025-06-18 18:27 ` [RFC patch v3 11/20] sched: Introduce per runqueue task LLC preference counter Tim Chen
2025-06-18 18:28 ` [RFC patch v3 12/20] sched: Calculate the total number of preferred LLC tasks during load balance Tim Chen
2025-06-18 18:28 ` [RFC patch v3 13/20] sched: Tag the sched group as llc_balance if it has tasks prefer other LLC Tim Chen
2025-06-18 18:28 ` [RFC patch v3 14/20] sched: Introduce update_llc_busiest() to deal with groups having preferred LLC tasks Tim Chen
2025-07-03 19:52   ` Shrikanth Hegde
2025-07-05  2:26     ` Chen, Yu C
2025-06-18 18:28 ` [RFC patch v3 15/20] sched: Introduce a new migration_type to track the preferred LLC load balance Tim Chen
2025-06-18 18:28 ` [RFC patch v3 16/20] sched: Consider LLC locality for active balance Tim Chen
2025-06-18 18:28 ` [RFC patch v3 17/20] sched: Consider LLC preference when picking tasks from busiest queue Tim Chen
2025-06-18 18:28 ` [RFC patch v3 18/20] sched: Do not migrate task if it is moving out of its preferred LLC Tim Chen
2025-06-18 18:28 ` [RFC patch v3 19/20] sched: Introduce SCHED_CACHE_LB to control cache aware load balance Tim Chen
2025-06-18 18:28 ` [RFC patch v3 20/20] sched: Introduce SCHED_CACHE_WAKE to control LLC aggregation on wake up Tim Chen
2025-06-19  6:39 ` [RFC patch v3 00/20] Cache aware scheduling Yangyu Chen
2025-06-19 13:21   ` Chen, Yu C
2025-06-19 14:12     ` Yangyu Chen
2025-06-20 19:25 ` Madadi Vineeth Reddy
2025-06-22  0:39   ` Chen, Yu C
2025-06-24 17:47     ` Madadi Vineeth Reddy
2025-06-23 16:45   ` Tim Chen [this message]
2025-06-24  5:00 ` K Prateek Nayak
2025-06-24 12:16   ` Chen, Yu C
2025-06-25  4:19     ` K Prateek Nayak
2025-06-25  0:30   ` Tim Chen
2025-06-25  4:30     ` K Prateek Nayak
2025-07-03 20:00   ` Shrikanth Hegde
2025-07-04 10:09     ` Chen, Yu C
2025-07-09 19:39 ` Madadi Vineeth Reddy
2025-07-10  3:33   ` Chen, Yu C

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=946aec3cd4fe308b45d0cff661d80a3d75109b7b.camel@linux.intel.com \
    --to=tim.c.chen@linux.intel.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=gautham.shenoy@amd.com \
    --cc=hdanton@sina.com \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=len.brown@intel.com \
    --cc=libo.chen@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tim.c.chen@intel.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vineethr@linux.ibm.com \
    --cc=vschneid@redhat.com \
    --cc=wuyun.abel@bytedance.com \
    --cc=yu.c.chen@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).