Re: [RFC][PATCH] sched: Cache aware load-balancing

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: K Prateek Nayak <kprateek.nayak@amd.com>
To: Peter Zijlstra <peterz@infradead.org>,
	"Chen, Yu C" <yu.c.chen@intel.com>
Cc: <juri.lelli@redhat.com>, <vincent.guittot@linaro.org>,
	<dietmar.eggemann@arm.com>, <rostedt@goodmis.org>,
	<bsegall@google.com>, <mgorman@suse.de>, <vschneid@redhat.com>,
	<linux-kernel@vger.kernel.org>, <tim.c.chen@linux.intel.com>,
	<tglx@linutronix.de>, <len.brown@intel.com>,
	<gautham.shenoy@amd.com>, <mingo@kernel.org>,
	<yu.chen.surf@foxmail.com>
Subject: Re: [RFC][PATCH] sched: Cache aware load-balancing
Date: Wed, 26 Mar 2025 11:48:11 +0530	[thread overview]
Message-ID: <17d0e2be-6130-496f-9a80-49a67a749d01@amd.com> (raw)
In-Reply-To: <20250325184429.GA31413@noisy.programming.kicks-ass.net>

Hello Peter, Chenyu,

On 3/26/2025 12:14 AM, Peter Zijlstra wrote:
> On Tue, Mar 25, 2025 at 11:19:52PM +0800, Chen, Yu C wrote:
>>
>> Hi Peter,
>>
>> Thanks for sending this out,
>>
>> On 3/25/2025 8:09 PM, Peter Zijlstra wrote:
>>> Hi all,
>>>
>>> One of the many things on the eternal todo list has been finishing the
>>> below hackery.
>>>
>>> It is an attempt at modelling cache affinity -- and while the patch
>>> really only targets LLC, it could very well be extended to also apply to
>>> clusters (L2). Specifically any case of multiple cache domains inside a
>>> node.
>>>
>>> Anyway, I wrote this about a year ago, and I mentioned this at the
>>> recent OSPM conf where Gautham and Prateek expressed interest in playing
>>> with this code.
>>>
>>> So here goes, very rough and largely unproven code ahead :-)
>>>
>>> It applies to current tip/master, but I know it will fail the __percpu
>>> validation that sits in -next, although that shouldn't be terribly hard
>>> to fix up.
>>>
>>> As is, it only computes a CPU inside the LLC that has the highest recent
>>> runtime, this CPU is then used in the wake-up path to steer towards this
>>> LLC and in task_hot() to limit migrations away from it.
>>>
>>> More elaborate things could be done, notably there is an XXX in there
>>> somewhere about finding the best LLC inside a NODE (interaction with
>>> NUMA_BALANCING).
>>>
>>
>> Besides the control provided by CONFIG_SCHED_CACHE, could we also introduce
>> sched_feat(SCHED_CACHE) to manage this feature, facilitating dynamic
>> adjustments? Similarly we can also introduce other sched_feats for load
>> balancing and NUMA balancing for fine-grain control.
> 
> We can do all sorts, but the very first thing is determining if this is
> worth it at all. Because if we can't make this work at all, all those
> things are a waste of time.
> 
> This patch is not meant to be merged, it is meant for testing and
> development. We need to first make it actually improve workloads. If it
> then turns out it regresses workloads (likely, things always do), then
> we can look at how to best do that.
> 

Thank you for sharing the patch and the initial review from Chenyu
pointing to issues that need fixing. I'll try to take a good look at it
this week and see if I can improve some trivial benchmarks that regress
currently with RFC as is.

In its current form I think this suffers from the same problem as
SIS_NODE where wakeups redirect to same set of CPUs and a good deal of
additional work is being done without any benefit.

I'll leave the results from my initial testing on the 3rd Generation
EPYC platform below and will evaluate what is making the benchmarks
unhappy. I'll return with more data when some of these benchmarks
are not as unhappy as they are now.

Thank you both for the RFC and the initial feedback. Following are
the initial results for the RFC as is:

   ==================================================================
   Test          : hackbench
   Units         : Normalized time in seconds
   Interpretation: Lower is better
   Statistic     : AMean
   ==================================================================
   Case:           tip[pct imp](CV)      sched_cache[pct imp](CV)
    1-groups     1.00 [ -0.00](10.12)     1.01 [ -0.89]( 2.84)
    2-groups     1.00 [ -0.00]( 6.92)     1.83 [-83.15]( 1.61)
    4-groups     1.00 [ -0.00]( 3.14)     3.00 [-200.21]( 3.13)
    8-groups     1.00 [ -0.00]( 1.35)     3.44 [-243.75]( 2.20)
   16-groups     1.00 [ -0.00]( 1.32)     2.59 [-158.98]( 4.29)


   ==================================================================
   Test          : tbench
   Units         : Normalized throughput
   Interpretation: Higher is better
   Statistic     : AMean
   ==================================================================
   Clients:    tip[pct imp](CV)     sched_cache[pct imp](CV)
       1     1.00 [  0.00]( 0.43)     0.96 [ -3.54]( 0.56)
       2     1.00 [  0.00]( 0.58)     0.99 [ -1.32]( 1.40)
       4     1.00 [  0.00]( 0.54)     0.98 [ -2.34]( 0.78)
       8     1.00 [  0.00]( 0.49)     0.96 [ -3.91]( 0.54)
      16     1.00 [  0.00]( 1.06)     0.97 [ -3.22]( 1.82)
      32     1.00 [  0.00]( 1.27)     0.95 [ -4.74]( 2.05)
      64     1.00 [  0.00]( 1.54)     0.93 [ -6.65]( 0.63)
     128     1.00 [  0.00]( 0.38)     0.93 [ -6.91]( 1.18)
     256     1.00 [  0.00]( 1.85)     0.99 [ -0.50]( 1.34)
     512     1.00 [  0.00]( 0.31)     0.98 [ -2.47]( 0.14)
    1024     1.00 [  0.00]( 0.19)     0.97 [ -3.06]( 0.39)


   ==================================================================
   Test          : stream-10
   Units         : Normalized Bandwidth, MB/s
   Interpretation: Higher is better
   Statistic     : HMean
   ==================================================================
   Test:       tip[pct imp](CV)     sched_cache[pct imp](CV)
    Copy     1.00 [  0.00](11.31)     0.34 [-65.89](72.77)
   Scale     1.00 [  0.00]( 6.62)     0.32 [-68.09](72.49)
     Add     1.00 [  0.00]( 7.06)     0.34 [-65.56](70.56)
   Triad     1.00 [  0.00]( 8.91)     0.34 [-66.47](72.70)


   ==================================================================
   Test          : stream-100
   Units         : Normalized Bandwidth, MB/s
   Interpretation: Higher is better
   Statistic     : HMean
   ==================================================================
   Test:       tip[pct imp](CV)     sched_cache[pct imp](CV)
    Copy     1.00 [  0.00]( 2.01)     0.83 [-16.96](24.55)
   Scale     1.00 [  0.00]( 1.49)     0.79 [-21.40](24.10)
     Add     1.00 [  0.00]( 2.67)     0.79 [-21.33](25.39)
   Triad     1.00 [  0.00]( 2.19)     0.81 [-19.19](25.55)


   ==================================================================
   Test          : netperf
   Units         : Normalized Througput
   Interpretation: Higher is better
   Statistic     : AMean
   ==================================================================
   Clients:         tip[pct imp](CV)     sched_cache[pct imp](CV)
    1-clients     1.00 [  0.00]( 1.43)     0.98 [ -2.22]( 0.26)
    2-clients     1.00 [  0.00]( 1.02)     0.97 [ -2.55]( 0.89)
    4-clients     1.00 [  0.00]( 0.83)     0.98 [ -2.27]( 0.46)
    8-clients     1.00 [  0.00]( 0.73)     0.98 [ -2.45]( 0.80)
   16-clients     1.00 [  0.00]( 0.97)     0.97 [ -2.90]( 0.88)
   32-clients     1.00 [  0.00]( 0.88)     0.95 [ -5.29]( 1.69)
   64-clients     1.00 [  0.00]( 1.49)     0.91 [ -8.70]( 1.95)
   128-clients    1.00 [  0.00]( 1.05)     0.92 [ -8.39]( 4.25)
   256-clients    1.00 [  0.00]( 3.85)     0.92 [ -8.33]( 2.45)
   512-clients    1.00 [  0.00](59.63)     0.92 [ -7.83](51.19)


   ==================================================================
   Test          : schbench
   Units         : Normalized 99th percentile latency in us
   Interpretation: Lower is better
   Statistic     : Median
   ==================================================================
   #workers: tip[pct imp](CV)       sched_cache[pct imp](CV)
     1     1.00 [ -0.00]( 6.67)      0.38 [ 62.22]    ( 5.88)
     2     1.00 [ -0.00](10.18)      0.43 [ 56.52]    ( 2.94)
     4     1.00 [ -0.00]( 4.49)      0.60 [ 40.43]    ( 5.52)
     8     1.00 [ -0.00]( 6.68)    113.96 [-11296.23] (12.91)
    16     1.00 [ -0.00]( 1.87)    359.34 [-35834.43] (20.02)
    32     1.00 [ -0.00]( 4.01)    217.67 [-21667.03] ( 5.48)
    64     1.00 [ -0.00]( 3.21)     97.43 [-9643.02]  ( 4.61)
   128     1.00 [ -0.00](44.13)     41.36 [-4036.10]  ( 6.92)
   256     1.00 [ -0.00](14.46)      2.69 [-169.31]   ( 1.86)
   512     1.00 [ -0.00]( 1.95)      1.89 [-89.22]    ( 2.24)


   ==================================================================
   Test          : new-schbench-requests-per-second
   Units         : Normalized Requests per second
   Interpretation: Higher is better
   Statistic     : Median
   ==================================================================
   #workers: tip[pct imp](CV)      sched_cache[pct imp](CV)
     1     1.00 [  0.00]( 0.46)     0.96 [ -4.14]( 0.00)
     2     1.00 [  0.00]( 0.15)     0.95 [ -5.27]( 2.29)
     4     1.00 [  0.00]( 0.15)     0.88 [-12.01]( 0.46)
     8     1.00 [  0.00]( 0.15)     0.55 [-45.47]( 1.23)
    16     1.00 [  0.00]( 0.00)     0.54 [-45.62]( 0.50)
    32     1.00 [  0.00]( 3.40)     0.63 [-37.48]( 6.37)
    64     1.00 [  0.00]( 7.09)     0.67 [-32.73]( 0.59)
   128     1.00 [  0.00]( 0.00)     0.99 [ -0.76]( 0.34)
   256     1.00 [  0.00]( 1.12)     1.06 [  6.32]( 1.55)
   512     1.00 [  0.00]( 0.22)     1.06 [  6.08]( 0.92)


   ==================================================================
   Test          : new-schbench-wakeup-latency
   Units         : Normalized 99th percentile latency in us
   Interpretation: Lower is better
   Statistic     : Median
   ==================================================================
   #workers: tip[pct imp](CV)       sched_cache[pct imp](CV)
     1     1.00 [ -0.00](19.72)     0.85  [ 15.38]    ( 8.13)
     2     1.00 [ -0.00](15.96)     1.09  [ -9.09]    (18.20)
     4     1.00 [ -0.00]( 3.87)     1.00  [ -0.00]    ( 0.00)
     8     1.00 [ -0.00]( 8.15)    118.17 [-11716.67] ( 0.58)
    16     1.00 [ -0.00]( 3.87)    146.62 [-14561.54] ( 4.64)
    32     1.00 [ -0.00](12.99)    141.60 [-14060.00] ( 5.64)
    64     1.00 [ -0.00]( 6.20)    78.62  [-7762.50]  ( 1.79)
   128     1.00 [ -0.00]( 0.96)    11.36  [-1036.08]  ( 3.41)
   256     1.00 [ -0.00]( 2.76)     1.11  [-11.22]    ( 3.28)
   512     1.00 [ -0.00]( 0.20)     1.21  [-20.81]    ( 0.91)


   ==================================================================
   Test          : new-schbench-request-latency
   Units         : Normalized 99th percentile latency in us
   Interpretation: Lower is better
   Statistic     : Median
   ==================================================================
   #workers: tip[pct imp](CV)      sched_cache[pct imp](CV)
     1     1.00 [ -0.00]( 1.07)     1.11 [-10.66]  ( 2.76)
     2     1.00 [ -0.00]( 0.14)     1.20 [-20.40]  ( 1.73)
     4     1.00 [ -0.00]( 1.39)     2.04 [-104.20] ( 0.96)
     8     1.00 [ -0.00]( 0.36)     3.94 [-294.20] ( 2.85)
    16     1.00 [ -0.00]( 1.18)     4.56 [-356.16] ( 1.19)
    32     1.00 [ -0.00]( 8.42)     3.02 [-201.67] ( 8.93)
    64     1.00 [ -0.00]( 4.85)     1.51 [-51.38]  ( 0.80)
   128     1.00 [ -0.00]( 0.28)     1.83 [-82.77]  ( 1.21)
   256     1.00 [ -0.00](10.52)     1.43 [-43.11]  (10.67)
   512     1.00 [ -0.00]( 0.69)     1.25 [-24.96]  ( 6.24)


   ==================================================================
   Test          : Various longer running benchmarks
   Units         : %diff in throughput reported
   Interpretation: Higher is better
   Statistic     : Median
   ==================================================================
   Benchmarks:                 %diff
   ycsb-cassandra             -10.70%
   ycsb-mongodb               -13.66%

   deathstarbench-1x           13.87%
   deathstarbench-2x            1.70%
   deathstarbench-3x           -8.44%
   deathstarbench-6x           -3.12%

   hammerdb+mysql 16VU        -33.50%
   hammerdb+mysql 64VU        -33.22%

---

I'm planning on taking hackbench and schbench as two extreme cases for
throughput and tail latency and later look at Stream from a "high
bandwidth, don't consolidate" standpoint. I hope once those cases
aren't as much in the reds, the larger benchmarks will be happier too.

-- 
Thanks and Regards,
Prateek

next prev parent reply	other threads:[~2025-03-26  6:18 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-25 12:09 [RFC][PATCH] sched: Cache aware load-balancing Peter Zijlstra
2025-03-25 15:19 ` Chen, Yu C
2025-03-25 18:44   ` Peter Zijlstra
2025-03-26  6:18     ` K Prateek Nayak [this message]
2025-03-26  9:15       ` Chen, Yu C
2025-03-26  9:42         ` Peter Zijlstra
2025-03-27  8:10           ` Chen, Yu C
2025-03-26  9:38   ` Peter Zijlstra
2025-03-26 10:25     ` Peter Zijlstra
2025-03-26 10:42       ` Peter Zijlstra
2025-03-26 10:46       ` Peter Zijlstra
     [not found]       ` <20250327112059.3661-1-hdanton@sina.com>
2025-03-31  6:25         ` Chen, Yu C
2025-03-27  2:48     ` Chen, Yu C
2025-03-27  2:43 ` Madadi Vineeth Reddy
2025-03-27 11:14   ` Chen, Yu C
2025-03-31 20:17     ` Madadi Vineeth Reddy
2025-03-28 13:57 ` Abel Wu
2025-03-29 15:06   ` Chen, Yu C
2025-03-30  8:46     ` Abel Wu
2025-03-31  5:25       ` Chen, Yu C
2025-03-31  8:04         ` Abel Wu
2025-03-31 21:06 ` Tim Chen
2025-04-02  1:52 ` Libo Chen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=17d0e2be-6130-496f-9a80-49a67a749d01@amd.com \
    --to=kprateek.nayak@amd.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=gautham.shenoy@amd.com \
    --cc=juri.lelli@redhat.com \
    --cc=len.brown@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    --cc=tim.c.chen@linux.intel.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=yu.c.chen@intel.com \
    --cc=yu.chen.surf@foxmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox