All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chen Yu <yu.c.chen@intel.com>
To: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Ingo Molnar <mingo@redhat.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	Tim Chen <tim.c.chen@intel.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	"Steven Rostedt" <rostedt@goodmis.org>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Abel Wu" <wuyun.abel@bytedance.com>,
	Yicong Yang <yangyicong@hisilicon.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Len Brown <len.brown@intel.com>, Chen Yu <yu.chen.surf@gmail.com>,
	Arjan Van De Ven <arjan.van.de.ven@intel.com>,
	Aaron Lu <aaron.lu@intel.com>, Barry Song <baohua@kernel.org>,
	<linux-kernel@vger.kernel.org>
Subject: Re: [RFC PATCH] sched/fair: Introduce SIS_PAIR to wakeup task on local idle core first
Date: Tue, 16 May 2023 16:41:27 +0800	[thread overview]
Message-ID: <ZGNBt7vWJ3fDs5Sc@chenyu5-mobl1> (raw)
In-Reply-To: <19664c68f77f5b23a86e5636a17ad2cbfa073f78.camel@gmx.de>

On 2023-05-16 at 08:23:35 +0200, Mike Galbraith wrote:
> On Tue, 2023-05-16 at 09:11 +0800, Chen Yu wrote:
> > [Problem Statement]
> >
> ...
> 
> > 20.26%    19.89%  [kernel.kallsyms]          [k] update_cfs_group
> > 13.53%    12.15%  [kernel.kallsyms]          [k] update_load_avg
> 
> Yup, that's a serious problem, but...
> 
> > [Benchmark]
> >
> > The baseline is on sched/core branch on top of
> > commit a6fcdd8d95f7 ("sched/debug: Correct printing for rq->nr_uninterruptible")
> >
> > Tested will-it-scale context_switch1 case, it shows good improvement
> > both on a server and a desktop:
> >
> > Intel(R) Xeon(R) Platinum 8480+, Sapphire Rapids 2 x 56C/112T = 224 CPUs
> > context_switch1_processes -s 100 -t 112 -n
> > baseline                   SIS_PAIR
> > 1.0                        +68.13%
> >
> > Intel Core(TM) i9-10980XE, Cascade Lake 18C/36T
> > context_switch1_processes -s 100 -t 18 -n
> > baseline                   SIS_PAIR
> > 1.0                        +45.2%
> 
> git@homer: ./context_switch1_processes -s 100 -t 8 -n
> (running in an autogroup)
> 
>    PerfTop:   30853 irqs/sec  kernel:96.8%  exact: 96.8% lost: 0/0 drop: 0/0 [4000Hz cycles],  (all, 8 CPUs)
> ------------------------------------------------------------------------------------------------------------
> 
>      5.72%  [kernel]       [k] switch_mm_irqs_off
>      4.23%  [kernel]       [k] __update_load_avg_se
>      3.76%  [kernel]       [k] __update_load_avg_cfs_rq
>      3.70%  [kernel]       [k] __schedule
>      3.65%  [kernel]       [k] entry_SYSCALL_64
>      3.22%  [kernel]       [k] enqueue_task_fair
>      2.91%  [kernel]       [k] update_curr
>      2.67%  [kernel]       [k] select_task_rq_fair
>      2.60%  [kernel]       [k] pipe_read
>      2.55%  [kernel]       [k] __switch_to
>      2.54%  [kernel]       [k] __calc_delta
>      2.44%  [kernel]       [k] dequeue_task_fair
>      2.38%  [kernel]       [k] reweight_entity
>      2.13%  [kernel]       [k] pipe_write
>      1.96%  [kernel]       [k] restore_fpregs_from_fpstate
>      1.93%  [kernel]       [k] select_idle_smt
>      1.77%  [kernel]       [k] update_load_avg <==
>      1.73%  [kernel]       [k] native_sched_clock
>      1.66%  [kernel]       [k] try_to_wake_up
>      1.52%  [kernel]       [k] _raw_spin_lock_irqsave
>      1.47%  [kernel]       [k] update_min_vruntime
>      1.42%  [kernel]       [k] update_cfs_group <==
>      1.36%  [kernel]       [k] vfs_write
>      1.32%  [kernel]       [k] prepare_to_wait_event
> 
> ...not one with global scope.  My little i7-4790 can play ping-pong all
> day long, as can untold numbers of other boxen around the globe.
>
That is true, on smaller systems, the C2C overhead is not that high. 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 48b6f0ca13ac..e65028dcd6a6 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7125,6 +7125,21 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> >             asym_fits_cpu(task_util, util_min, util_max, target))
> >                 return target;
> >  
> > +       /*
> > +        * If the waker and the wakee are good friends to each other,
> > +        * putting them within the same SMT domain could reduce C2C
> > +        * overhead. SMT idle sibling should be preferred to wakee's
> > +        * previous CPU, because the latter could still have the risk of C2C
> > +        * overhead.
> > +        */
> > +       if (sched_feat(SIS_PAIR) && sched_smt_active() &&
> > +           current->last_wakee == p && p->last_wakee == current) {
> > +               i = select_idle_smt(p, smp_processor_id());
> > +
> > +               if ((unsigned int)i < nr_cpumask_bits)
> > +                       return i;
> > +       }
> > +
> >         /*
> >          * If the previous CPU is cache affine and idle, don't be stupid:
> >          */
> 
> Global scope solutions for non-global issues tend to not work out.  
> 
> Below is a sample of potential scaling wreckage for boxen that are NOT
> akin to the one you're watching turn caches into silicon based pudding.
> 
> Note the *_RR numbers.  Those poked me in the eye because they closely
> resemble pipe ping-pong, all fun and games with about as close to zero
> work other than scheduling as network-land can get, but for my box, SMT
> was the third best option of three.
> 
> You just can't beat idle core selection when it comes to getting work
> done, which is why SIS evolved to select cores first.
> 
There could be some corner cases. Under some conditions choosing an idle
CPU within the local core might be better to select a new idle core. The tricky
part is that SMT is quite special, SMTs share L2, but SMTs also
compete for other critical resources. For short tasks having a close relationship with
each other, putting them together on a local Core (on a high count
system) could sometimes bring benefit. The short duration means that the task
pair have less chance to compete for instruction unit shared by SMTs.
But the short-duration threshold depends on the number of CPUs in the LLC.
> Your box and ilk need help that treats the disease and not the symptom,
> or barring that, help that precisely targets boxen having the disease.
> 
IMO this issue could be generic, the cost of C2C is O(sqrt (n)), in theory on
a system with a large number of LLC and with SMT enabled, the issue is easy to
be detected.

As an example, I did not choose a super big system,
but a desktop i9-10980XE, launches 2 pairs of ping-ping tasks.

Each pair of tasks are bound to 1 dedicated core:
./context_switch1_processes -s 30 -t 2
average:956883

No CPU affinity for the tasks:
./context_switch1_processes -s 30 -t 2 -n
average:849209

We can see that, waking up the wakee on local core brings benefits on this platform.

To make a comparison, I also launched the same test on my laptop
i5-8300H, which has 4Core/8CPUs, and I did not see any difference when running 2 pairs
of will-it-scale, but I did notice an improvement if wakees are woken up on local
core when launching 4 pairs(I guess this is because C2C reduction accumulates):

Each pair of tasks are bound to 1 dedicated core:
./context_switch1_processes -s 30 -t 4
average:731965

No CPU affinity for the tasks:
./context_switch1_processes -s 30 -t 4 -n
average:644337


thanks,
Chenyu

> 	-Mike
> 
> 10 seconds of 1 netperf client/server instance, no knobs twiddled.
> 
> TCP_SENDFILE-1  stacked    Avg:  65387
> TCP_SENDFILE-1  cross-smt  Avg:  65658
> TCP_SENDFILE-1  cross-core Avg:  96318
> 
> TCP_STREAM-1    stacked    Avg:  44322
> TCP_STREAM-1    cross-smt  Avg:  42390
> TCP_STREAM-1    cross-core Avg:  77850
> 
> TCP_MAERTS-1    stacked    Avg:  36636
> TCP_MAERTS-1    cross-smt  Avg:  42333
> TCP_MAERTS-1    cross-core Avg:  74122
> 
> UDP_STREAM-1    stacked    Avg:  52618
> UDP_STREAM-1    cross-smt  Avg:  55298
> UDP_STREAM-1    cross-core Avg:  97415
> 
> TCP_RR-1        stacked    Avg: 242606
> TCP_RR-1        cross-smt  Avg: 140863
> TCP_RR-1        cross-core Avg: 219400
> 
> UDP_RR-1        stacked    Avg: 282253
> UDP_RR-1        cross-smt  Avg: 202062
> UDP_RR-1        cross-core Avg: 288620

  reply	other threads:[~2023-05-16  8:41 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-16  1:11 [RFC PATCH] sched/fair: Introduce SIS_PAIR to wakeup task on local idle core first Chen Yu
2023-05-16  6:23 ` Mike Galbraith
2023-05-16  8:41   ` Chen Yu [this message]
2023-05-16 11:51     ` Mike Galbraith
2023-05-17 16:57       ` Chen Yu
2023-05-17 19:52         ` Mike Galbraith
2023-05-18  3:41           ` Chen Yu
2023-05-19 11:15             ` Mike Galbraith
2023-05-18  3:30         ` K Prateek Nayak
2023-05-18  4:17           ` Chen Yu
2023-05-18 10:26             ` K Prateek Nayak
2023-05-22  4:10               ` Chen Yu
2023-05-22  7:10                 ` Mike Galbraith
2023-05-25  7:47                   ` Chen Yu
2023-05-25  9:33                     ` Mike Galbraith

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZGNBt7vWJ3fDs5Sc@chenyu5-mobl1 \
    --to=yu.c.chen@intel.com \
    --cc=aaron.lu@intel.com \
    --cc=arjan.van.de.ven@intel.com \
    --cc=baohua@kernel.org \
    --cc=dietmar.eggemann@arm.com \
    --cc=efault@gmx.de \
    --cc=gautham.shenoy@amd.com \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=len.brown@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@techsingularity.net \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tim.c.chen@intel.com \
    --cc=vincent.guittot@linaro.org \
    --cc=wuyun.abel@bytedance.com \
    --cc=yangyicong@hisilicon.com \
    --cc=yu.chen.surf@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.