Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Chen Yu <yu.c.chen@intel.com>
To: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Tim Chen <tim.c.chen@intel.com>, Aaron Lu <aaron.lu@intel.com>,
	"Dietmar Eggemann" <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	"Daniel Bristot de Oliveira" <bristot@redhat.com>,
	Valentin Schneider <vschneid@redhat.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	<linux-kernel@vger.kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	Ingo Molnar <mingo@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Juri Lelli <juri.lelli@redhat.com>
Subject: Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
Date: Thu, 14 Sep 2023 19:01:21 +0800	[thread overview]
Message-ID: <ZQLoAQcQJDCrdOGd@chenyu5-mobl2.ccr.corp.intel.com> (raw)
In-Reply-To: <86f761a4-9805-c704-9c23-ec96065fa389@amd.com>

Hi Prateek,

thanks for the test,

On 2023-09-14 at 09:43:52 +0530, K Prateek Nayak wrote:
> Hello Chenyu,
> 
> On 9/13/2023 8:27 AM, Chen Yu wrote:
> > On 2023-09-12 at 19:56:37 +0530, K Prateek Nayak wrote:
> >> Hello Chenyu,
> >>
> >> On 9/12/2023 6:02 PM, Chen Yu wrote:
> >>> [..snip..]
> >>>
> >>>>> If I understand correctly, WF_SYNC is to let the wakee to woken up
> >>>>> on the waker's CPU, rather than the wakee's previous CPU, because
> >>>>> the waker goes to sleep after wakeup. SIS_CACHE mainly cares about
> >>>>> wakee's previous CPU. We can only restrict that other wakee does not
> >>>>> occupy the previous CPU, but do not enhance the possibility that
> >>>>> wake_affine_idle() chooses the previous CPU.
> >>>>
> >>>> Correct me if I'm wrong here,
> >>>>
> >>>> Say a short sleeper, is always woken up using WF_SYNC flag. When the
> >>>> task is dequeued, we mark the previous  CPU where it ran as "cache-hot"
> >>>> and restrict any wakeup happening until the "cache_hot_timeout" is
> >>>> crossed. Let us assume a perfect world where the task wakes up before
> >>>> the "cache_hot_timeout" expires. Logically this CPU was reserved all
> >>>> this while for the short sleeper but since the wakeup bears WF_SYNC
> >>>> flag, the whole reservation is ignored and waker's LLC is explored.
> >>>>
> >>>
> >>> Ah, I see your point. Do you mean, because the waker has a WF_SYNC, wake_affine_idle()
> >>> forces the short sleeping wakee to be woken up on waker's CPU rather the
> >>> wakee's previous CPU, but wakee's previous has been marked as cache hot
> >>> for nothing?
> >>
> >> Precisely :)
> >>
> >>>
> >>>> Should the timeout be cleared if the wakeup decides to not target the
> >>>> previous CPU? (The default "sysctl_sched_migration_cost" is probably
> >>>> small enough to curb any side effect that could possibly show here but
> >>>> if a genuine use-case warrants setting "sysctl_sched_migration_cost" to
> >>>> a larger value, the wakeup path might be affected where lot of idle
> >>>> targets are overlooked since the CPUs are marked cache-hot forr longer
> >>>> duration)
> >>>>
> >>>> Let me know what you think.
> >>>>
> >>>
> >>> This makes sense. In theory the above logic can be added in
> >>> select_idle_sibling(), if target CPU is chosen rather than
> >>> the previous CPU, the previous CPU's cache hot flag should be
> >>> cleared.
> >>>
> >>> But this might bring overhead. Because we need to grab the rq
> >>> lock and write to other CPU's rq, which could be costly. It
> >>> seems to be a trade-off of current implementation.
> >>
> >> I agree, it will not be pretty. Maybe the other way is to have a
> >> history of the type of wakeup the task experiences (similar to
> >> wakee_flips but for sync and non-syn wakeups) and only reserve
> >> the CPU if the task wakes up more via non-sync wakeups? Thinking
> >> out loud here.
> >>
> > 
> > This looks good to consider the task's attribute, or maybe something
> > like this:
> > 
> > new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
> > if (new_cpu != prev_cpu)
> > 	p->burst_sleep_avg >>= 1;
> > So the duration of reservation could be shrinked.
> 
> That seems like a good approach.
> 
> Meanwhile, here is result for the current series without any
> modifications:
> 
> tl;dr
> 
> - There seems to be a noticeable increase in hackbench runtime with a
>   single group but big gains beyond that. The regression could possibly
>   be because of added searching but let me do some digging to confirm
>   that. 

Ah OK. May I have the command to run 1 group hackbench?

> 
> - Small regressions (~2%) noticed in ycsb-mongodb (medium utilization)
>   and DeathStarBench (High Utilization)
> 
> - Other benchmarks are more of less perf neutral with the changes.
> 
> More information below:
> 
> o System information
> 
>   - Dual socket 3rd Generation EPYC System (2 x 64C/128T)
>   - NPS1 mode (each socket is a NUMA node)
>   - Boost Enabled
>   - C2 disabled (MWAIT based C1 is still enabled)
> 
> 
> o Kernel information
> 
> base		:   tip:sched/core at commit b41bbb33cf75 ("Merge branch
> 		    'sched/eevdf' into sched/core")
> 		  + cheery-pick commit 63304558ba5d ("sched/eevdf: Curb
> 		    wakeup-preemption")
> 
> SIS_CACHE	:   base
> 		  + this series as is
> 
> 
> o Benchmark results
> 
> ==================================================================
> Test          : hackbench
> Units         : Normalized time in seconds
> Interpretation: Lower is better
> Statistic     : AMean
> ==================================================================
> Case:          base[pct imp](CV)     SIS_CACHE[pct imp](CV)
>  1-groups     1.00 [ -0.00]( 1.89)     1.10 [-10.28]( 2.03)
>  2-groups     1.00 [ -0.00]( 2.04)     0.98 [  1.57]( 2.04)
>  4-groups     1.00 [ -0.00]( 2.38)     0.95 [  4.70]( 0.88)
>  8-groups     1.00 [ -0.00]( 1.52)     0.93 [  7.18]( 0.76)
> 16-groups     1.00 [ -0.00]( 3.44)     0.90 [  9.76]( 1.04)
> 
> 
> ==================================================================
> Test          : tbench
> Units         : Normalized throughput
> Interpretation: Higher is better
> Statistic     : AMean
> ==================================================================
> Clients:          base[pct imp](CV)     SIS_CACHE[pct imp](CV)
>     1     1.00 [  0.00]( 0.18)     0.98 [ -1.61]( 0.27)
>     2     1.00 [  0.00]( 0.63)     0.98 [ -1.58]( 0.09)
>     4     1.00 [  0.00]( 0.86)     0.99 [ -0.52]( 0.42)
>     8     1.00 [  0.00]( 0.22)     0.98 [ -1.77]( 0.65)
>    16     1.00 [  0.00]( 1.99)     1.00 [ -0.10]( 1.55)
>    32     1.00 [  0.00]( 4.29)     0.98 [ -1.73]( 1.55)
>    64     1.00 [  0.00]( 1.71)     0.97 [ -2.77]( 3.74)
>   128     1.00 [  0.00]( 0.65)     1.00 [ -0.14]( 0.88)
>   256     1.00 [  0.00]( 0.19)     0.97 [ -2.65]( 0.49)
>   512     1.00 [  0.00]( 0.20)     0.99 [ -1.10]( 0.33)
>  1024     1.00 [  0.00]( 0.29)     0.99 [ -0.70]( 0.16)
> 
> 
> ==================================================================
> Test          : stream-10
> Units         : Normalized Bandwidth, MB/s
> Interpretation: Higher is better
> Statistic     : HMean
> ==================================================================
> Test:          base[pct imp](CV)     SIS_CACHE[pct imp](CV)
>  Copy     1.00 [  0.00]( 4.32)     0.90 [ -9.82](10.72)
> Scale     1.00 [  0.00]( 5.21)     1.01 [  0.59]( 1.83)
>   Add     1.00 [  0.00]( 6.25)     0.99 [ -0.91]( 4.49)
> Triad     1.00 [  0.00](10.74)     1.02 [  2.28]( 6.07)
> 
> 
> ==================================================================
> Test          : stream-100
> Units         : Normalized Bandwidth, MB/s
> Interpretation: Higher is better
> Statistic     : HMean
> ==================================================================
> Test:          base[pct imp](CV)     SIS_CACHE[pct imp](CV)
>  Copy     1.00 [  0.00]( 0.70)     0.98 [ -1.79]( 2.26)
> Scale     1.00 [  0.00]( 6.55)     1.03 [  2.80]( 0.74)
>   Add     1.00 [  0.00]( 6.53)     1.02 [  2.05]( 1.82)
> Triad     1.00 [  0.00]( 6.66)     1.04 [  3.54]( 1.04)
> 
> 
> ==================================================================
> Test          : netperf
> Units         : Normalized Througput
> Interpretation: Higher is better
> Statistic     : AMean
> ==================================================================
> Clients:          base[pct imp](CV)     SIS_CACHE[pct imp](CV)
>  1-clients     1.00 [  0.00]( 0.46)     0.99 [ -0.55]( 0.49)
>  2-clients     1.00 [  0.00]( 0.38)     0.99 [ -1.23]( 1.19)
>  4-clients     1.00 [  0.00]( 0.72)     0.98 [ -1.91]( 1.21)
>  8-clients     1.00 [  0.00]( 0.98)     0.98 [ -1.61]( 1.08)
> 16-clients     1.00 [  0.00]( 0.70)     0.98 [ -1.80]( 1.04)
> 32-clients     1.00 [  0.00]( 0.74)     0.98 [ -1.55]( 1.20)
> 64-clients     1.00 [  0.00]( 2.24)     1.00 [ -0.04]( 2.77)
> 128-clients    1.00 [  0.00]( 1.72)     1.03 [  3.22]( 1.99)
> 256-clients    1.00 [  0.00]( 4.44)     0.99 [ -1.33]( 4.71)
> 512-clients    1.00 [  0.00](52.42)     0.98 [ -1.61](52.72)
> 
> 
> ==================================================================
> Test          : schbench (old)
> Units         : Normalized 99th percentile latency in us
> Interpretation: Lower is better
> Statistic     : Median
> ==================================================================
> #workers:          base[pct imp](CV)     SIS_CACHE[pct imp](CV)
>   1     1.00 [ -0.00]( 2.28)     0.96 [  4.00](15.68)
>   2     1.00 [ -0.00]( 6.42)     1.00 [ -0.00](10.96)
>   4     1.00 [ -0.00]( 3.77)     0.97 [  3.33]( 7.61)
>   8     1.00 [ -0.00](13.83)     1.08 [ -7.89]( 2.86)
>  16     1.00 [ -0.00]( 4.37)     1.00 [ -0.00]( 2.13)
>  32     1.00 [ -0.00]( 8.69)     0.95 [  4.94]( 2.73)
>  64     1.00 [ -0.00]( 2.30)     1.05 [ -5.13]( 1.26)
> 128     1.00 [ -0.00](12.12)     1.03 [ -3.41]( 5.08)
> 256     1.00 [ -0.00](26.04)     0.91 [  8.88]( 2.59)
> 512     1.00 [ -0.00]( 5.62)     0.97 [  3.32]( 0.37)
> 
> 
> ==================================================================
> Test          : Unixbench
> Units         : Various, Throughput
> Interpretation: Higher is better
> Statistic     : AMean, Hmean (Specified)
> ==================================================================
> Metric		variant                      base		     SIS_CACHE
> Hmean     unixbench-dhry2reg-1            41248390.97 (   0.00%)    41485503.82 (   0.57%)
> Hmean     unixbench-dhry2reg-512        6239969914.15 (   0.00%)  6233919689.40 (  -0.10%)
> Amean     unixbench-syscall-1              2968518.27 (   0.00%)     2841236.43 *   4.29%*
> Amean     unixbench-syscall-512            7790656.20 (   0.00%)     7631558.00 *   2.04%*
> Hmean     unixbench-pipe-1                 2535689.01 (   0.00%)     2598208.16 *   2.47%*
> Hmean     unixbench-pipe-512             361385055.25 (   0.00%)   368566373.76 *   1.99%*
> Hmean     unixbench-spawn-1                   4506.26 (   0.00%)        4551.67 (   1.01%)
> Hmean     unixbench-spawn-512                69380.09 (   0.00%)       69264.30 (  -0.17%)
> Hmean     unixbench-execl-1                   3824.57 (   0.00%)        3822.67 (  -0.05%)
> Hmean     unixbench-execl-512                12288.64 (   0.00%)       11728.12 (  -4.56%)
> 
> 
> ==================================================================
> Test          : ycsb-mongodb
> Units         : Throughput
> Interpretation: Higher is better
> Statistic     : AMean
> ==================================================================
> base            : 309589.33 (var: 1.41%) 
> SIS_CACHE       : 304931.33 (var: 1.29%) [diff: -1.50%]
> 
> 
> ==================================================================
> Test          : DeathStarBench
> Units         : Normalized Throughput, relative to base
> Interpretation: Higher is better
> Statistic     : AMean
> ==================================================================
> Pinning         base     SIS_CACHE
> 1 CCD           100%      99.18% [%diff: -0.82%]
> 2 CCD           100%      97.46% [%diff: -2.54%]
> 4 CCD           100%      97.22% [%diff: -2.78%]
> 8 CCD           100%      99.01% [%diff: -0.99%]
> 
> --
> 
> Regression observed could either be because of the larger search time to
> find a non cache-hot idle CPU, or perhaps just the larger search time in
> general adding to utilization and curbing the SIS_UTIL limits further.

Yeah that is possible. And you also mentioned that we should consider the
cache-hot idle CPU if we can not find any cache-cold idle CPUs, that
might be a better choice than forcely putting the wakee on the current
CPU which brings task stacking.

> I'll go gather some stats to back my suspicion (particularly for
> hackbench).
>

Thanks!
Chenyu

next prev parent reply	other threads:[~2023-09-14 11:01 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-11  2:49 [RFC PATCH 0/2] Makes it easier for the wakee to choose previous CPU Chen Yu
2023-09-11  2:49 ` [RFC PATCH 1/2] sched/fair: Record the average sleep time of a task Chen Yu
2023-09-11  2:50 ` [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() Chen Yu
2023-09-11  7:26   ` Aaron Lu
2023-09-11  8:40     ` Chen Yu
2023-09-13  6:22       ` Gautham R. Shenoy
2023-09-13  7:25         ` Chen Yu
2023-09-14  7:06           ` Gautham R. Shenoy
2023-09-14 12:09             ` Chen Yu
2023-09-15 15:18               ` Gautham R. Shenoy
2023-09-19  9:01                 ` Chen Yu
2023-09-11  8:29   ` K Prateek Nayak
2023-09-11 10:19     ` Chen Yu
2023-09-12  3:05       ` K Prateek Nayak
2023-09-12 12:32         ` Chen Yu
2023-09-12 14:26           ` K Prateek Nayak
2023-09-13  2:57             ` Chen Yu
2023-09-14  4:13               ` K Prateek Nayak
2023-09-14 11:01                 ` Chen Yu [this message]
2023-09-15  3:21                   ` K Prateek Nayak
2023-09-12  9:39       ` Mike Galbraith
2023-09-12 14:51         ` Chen Yu
2023-09-12  6:32     ` Mike Galbraith
2023-09-11 15:26   ` Mathieu Desnoyers
2023-09-11 15:43     ` Mathieu Desnoyers
2023-09-12 11:53       ` Chen Yu
2023-09-12 14:06         ` Mathieu Desnoyers
2023-09-12 14:14           ` Chen Yu
2023-09-12 15:18             ` Mathieu Desnoyers
2023-09-13  3:02               ` Chen Yu
2023-09-20 12:34     ` Chen Yu
2023-09-14  5:30   ` K Prateek Nayak
2023-09-14 10:43     ` Chen Yu
2023-09-15  3:37       ` K Prateek Nayak

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZQLoAQcQJDCrdOGd@chenyu5-mobl2.ccr.corp.intel.com \
    --to=yu.c.chen@intel.com \
    --cc=aaron.lu@intel.com \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=gautham.shenoy@amd.com \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tim.c.chen@intel.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox