All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mel Gorman <mgorman@suse.de>
To: Julia Lawall <Julia.Lawall@inria.fr>
Cc: Ingo Molnar <mingo@redhat.com>,
	kernel-janitors@vger.kernel.org,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	linux-kernel@vger.kernel.org,
	Valentin Schneider <valentin.schneider@arm.com>,
	Gilles.Muller@inria.fr
Subject: Re: [PATCH] sched/fair: check for idle core
Date: Wed, 21 Oct 2020 11:20:38 +0000	[thread overview]
Message-ID: <20201021112038.GC32041@suse.de> (raw)
In-Reply-To: <1603211879-1064-1-git-send-email-Julia.Lawall@inria.fr>

On Tue, Oct 20, 2020 at 06:37:59PM +0200, Julia Lawall wrote:
> On a thread wakeup, the change [1] from runnable load average to load
> average for comparing candidate cores means that recent short-running
> daemons on the core where a thread ran previously can be considered to
> have a higher load than the core performing the wakeup, even when the
> core where the thread ran previously is currently idle.  This can
> cause a thread to migrate, taking the place of some other thread that
> is about to wake up, and so on.  To avoid unnecessary migrations,
> extend wake_affine_idle to check whether the core where the thread
> previously ran is currently idle, and if so return that core as the
> target.
> 
> [1] commit 11f10e5420f6ce ("sched/fair: Use load instead of runnable
> load in wakeup path")
> 
> This particularly has an impact when using passive (intel_cpufreq)
> power management, where kworkers run every 0.004 seconds on all cores,
> increasing the likelihood that an idle core will be considered to have
> a load.
> 
> The following numbers were obtained with the benchmarking tool
> hyperfine (https://github.com/sharkdp/hyperfine) on the NAS parallel
> benchmarks (https://www.nas.nasa.gov/publications/npb.html).  The
> tests were run on an 80-core Intel(R) Xeon(R) CPU E7-8870 v4 @
> 2.10GHz.  Active (intel_pstate) and passive (intel_cpufreq) power
> management were used.  Times are in seconds.  All experiments use all
> 160 hardware threads.
> 
> 	v5.9/active		v5.9+patch/active
> bt.C.c	24.725724+-0.962340	23.349608+-1.607214
> lu.C.x	29.105952+-4.804203	25.249052+-5.561617
> sp.C.x	31.220696+-1.831335	30.227760+-2.429792
> ua.C.x	26.606118+-1.767384	25.778367+-1.263850
> 
> 	v5.9/passive		v5.9+patch/passive
> bt.C.c	25.330360+-1.028316	23.544036+-1.020189
> lu.C.x	35.872659+-4.872090	23.719295+-3.883848
> sp.C.x	32.141310+-2.289541	29.125363+-0.872300
> ua.C.x	29.024597+-1.667049	25.728888+-1.539772
> 
> On the smaller data sets (A and B) and on the other NAS benchmarks
> there is no impact on performance.
> 
> Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>

I suspect that the benefit of this patch is due to avoiding the overhead
of wake_affine_weight() check because the following check exists in
select_idle_sibling

        /*
         * If the previous CPU is cache affine and idle, don't be stupid:
         */
        if (prev != target && cpus_share_cache(prev, target) &&
            (available_idle_cpu(prev) || sched_idle_cpu(prev)))
                return prev;

Still, the concept makes some sense to avoid wake_affine_weight but look
at the earlier part of wake_affine_idle()

        if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
                return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu;

This thing is almost completely useless because this_cpu is only going to
be idle if it's a wakeup from interrupt context when the CPU was otherwise
idle *but* it takes care to only use the CPU if this and prev share LLC.

The patch as it stands may leave a task on a remote node when it should
have been pulled local to the waker because prev happened to be idle. This
is not guaranteed because a node could have multiple LLCs and prev is
still appropriate but that's a different problem entirely and requires
much deeper surgery. Still, not pulling a task from a remote node is
a change in expected behaviour. While it's possible that NUMA domains
will not even reach this path, it depends on the NUMA distance as can
be seen in sd_init() for the setting of SD_WAKE_AFFINE so I think the
cpus_share_cache check is necessary.

I think it would be more appropriate to rework that block that checks
this_cpu to instead check if the CPUs share cache first and then return one
of them (preference to prev based on the comment above it about avoiding
a migration) if either one is idle.

I see Vincent already agreed with the patch so I could be wrong.  Vincent,
did I miss something stupid?

-- 
Mel Gorman
SUSE Labs

WARNING: multiple messages have this Message-ID (diff)
From: Mel Gorman <mgorman@suse.de>
To: Julia Lawall <Julia.Lawall@inria.fr>
Cc: Ingo Molnar <mingo@redhat.com>,
	kernel-janitors@vger.kernel.org,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	linux-kernel@vger.kernel.org,
	Valentin Schneider <valentin.schneider@arm.com>,
	Gilles.Muller@inria.fr
Subject: Re: [PATCH] sched/fair: check for idle core
Date: Wed, 21 Oct 2020 12:20:38 +0100	[thread overview]
Message-ID: <20201021112038.GC32041@suse.de> (raw)
In-Reply-To: <1603211879-1064-1-git-send-email-Julia.Lawall@inria.fr>

On Tue, Oct 20, 2020 at 06:37:59PM +0200, Julia Lawall wrote:
> On a thread wakeup, the change [1] from runnable load average to load
> average for comparing candidate cores means that recent short-running
> daemons on the core where a thread ran previously can be considered to
> have a higher load than the core performing the wakeup, even when the
> core where the thread ran previously is currently idle.  This can
> cause a thread to migrate, taking the place of some other thread that
> is about to wake up, and so on.  To avoid unnecessary migrations,
> extend wake_affine_idle to check whether the core where the thread
> previously ran is currently idle, and if so return that core as the
> target.
> 
> [1] commit 11f10e5420f6ce ("sched/fair: Use load instead of runnable
> load in wakeup path")
> 
> This particularly has an impact when using passive (intel_cpufreq)
> power management, where kworkers run every 0.004 seconds on all cores,
> increasing the likelihood that an idle core will be considered to have
> a load.
> 
> The following numbers were obtained with the benchmarking tool
> hyperfine (https://github.com/sharkdp/hyperfine) on the NAS parallel
> benchmarks (https://www.nas.nasa.gov/publications/npb.html).  The
> tests were run on an 80-core Intel(R) Xeon(R) CPU E7-8870 v4 @
> 2.10GHz.  Active (intel_pstate) and passive (intel_cpufreq) power
> management were used.  Times are in seconds.  All experiments use all
> 160 hardware threads.
> 
> 	v5.9/active		v5.9+patch/active
> bt.C.c	24.725724+-0.962340	23.349608+-1.607214
> lu.C.x	29.105952+-4.804203	25.249052+-5.561617
> sp.C.x	31.220696+-1.831335	30.227760+-2.429792
> ua.C.x	26.606118+-1.767384	25.778367+-1.263850
> 
> 	v5.9/passive		v5.9+patch/passive
> bt.C.c	25.330360+-1.028316	23.544036+-1.020189
> lu.C.x	35.872659+-4.872090	23.719295+-3.883848
> sp.C.x	32.141310+-2.289541	29.125363+-0.872300
> ua.C.x	29.024597+-1.667049	25.728888+-1.539772
> 
> On the smaller data sets (A and B) and on the other NAS benchmarks
> there is no impact on performance.
> 
> Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>

I suspect that the benefit of this patch is due to avoiding the overhead
of wake_affine_weight() check because the following check exists in
select_idle_sibling

        /*
         * If the previous CPU is cache affine and idle, don't be stupid:
         */
        if (prev != target && cpus_share_cache(prev, target) &&
            (available_idle_cpu(prev) || sched_idle_cpu(prev)))
                return prev;

Still, the concept makes some sense to avoid wake_affine_weight but look
at the earlier part of wake_affine_idle()

        if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
                return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu;

This thing is almost completely useless because this_cpu is only going to
be idle if it's a wakeup from interrupt context when the CPU was otherwise
idle *but* it takes care to only use the CPU if this and prev share LLC.

The patch as it stands may leave a task on a remote node when it should
have been pulled local to the waker because prev happened to be idle. This
is not guaranteed because a node could have multiple LLCs and prev is
still appropriate but that's a different problem entirely and requires
much deeper surgery. Still, not pulling a task from a remote node is
a change in expected behaviour. While it's possible that NUMA domains
will not even reach this path, it depends on the NUMA distance as can
be seen in sd_init() for the setting of SD_WAKE_AFFINE so I think the
cpus_share_cache check is necessary.

I think it would be more appropriate to rework that block that checks
this_cpu to instead check if the CPUs share cache first and then return one
of them (preference to prev based on the comment above it about avoiding
a migration) if either one is idle.

I see Vincent already agreed with the patch so I could be wrong.  Vincent,
did I miss something stupid?

-- 
Mel Gorman
SUSE Labs

  parent reply	other threads:[~2020-10-21 11:20 UTC|newest]

Thread overview: 133+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-20 16:37 [PATCH] sched/fair: check for idle core Julia Lawall
2020-10-20 16:37 ` Julia Lawall
2020-10-21  7:29 ` Vincent Guittot
2020-10-21  7:29   ` Vincent Guittot
2020-10-21 11:13   ` Peter Zijlstra
2020-10-21 11:13     ` Peter Zijlstra
2020-10-21 12:27   ` Vincent Guittot
2020-10-21 12:27     ` Vincent Guittot
2020-10-21 11:20 ` Mel Gorman [this message]
2020-10-21 11:20   ` Mel Gorman
2020-10-21 11:56   ` Julia Lawall
2020-10-21 11:56     ` Julia Lawall
2020-10-21 12:19     ` Peter Zijlstra
2020-10-21 12:19       ` Peter Zijlstra
2020-10-21 12:42       ` Julia Lawall
2020-10-21 12:42         ` Julia Lawall
2020-10-21 12:52         ` Peter Zijlstra
2020-10-21 12:52           ` Peter Zijlstra
2020-10-21 13:43           ` Julia Lawall
2020-10-21 18:18           ` Rafael J. Wysocki
2020-10-21 18:18             ` Rafael J. Wysocki
2020-10-21 18:15         ` Rafael J. Wysocki
2020-10-21 18:15           ` Rafael J. Wysocki
2020-10-21 19:47           ` Julia Lawall
2020-10-21 19:47             ` Julia Lawall
2020-10-21 20:25             ` Rafael J. Wysocki
2020-10-21 20:25               ` Rafael J. Wysocki
2020-10-21 13:10       ` Peter Zijlstra
2020-10-21 13:10         ` Peter Zijlstra
2020-10-21 18:11         ` Rafael J. Wysocki
2020-10-21 18:11           ` Rafael J. Wysocki
2020-10-22  4:41           ` Viresh Kumar
2020-10-22  4:53             ` Viresh Kumar
2020-10-22  7:11           ` Peter Zijlstra
2020-10-22  7:11             ` Peter Zijlstra
2020-10-22 10:47             ` Viresh Kumar
2020-10-22 10:59               ` Viresh Kumar
2020-10-22 11:45               ` Rafael J. Wysocki
2020-10-22 11:45                 ` Rafael J. Wysocki
2020-10-22 12:02                 ` default cpufreq gov, was: " Peter Zijlstra
2020-10-22 12:02                   ` Peter Zijlstra
2020-10-22 12:19                   ` Rafael J. Wysocki
2020-10-22 12:19                     ` Rafael J. Wysocki
2020-10-22 12:29                     ` Peter Zijlstra
2020-10-22 12:29                       ` Peter Zijlstra
2020-10-22 14:52                       ` Mel Gorman
2020-10-22 14:52                         ` Mel Gorman
2020-10-22 14:58                         ` Colin Ian King
2020-10-22 14:58                           ` Colin Ian King
2020-10-22 15:12                           ` Phil Auld
2020-10-22 15:12                             ` Phil Auld
2020-10-22 16:35                             ` Mel Gorman
2020-10-22 16:35                               ` Mel Gorman
2020-10-22 17:59                               ` Rafael J. Wysocki
2020-10-22 17:59                                 ` Rafael J. Wysocki
2020-10-22 20:32                                 ` Mel Gorman
2020-10-22 20:32                                   ` Mel Gorman
2020-10-22 20:39                                   ` Phil Auld
2020-10-22 20:39                                     ` Phil Auld
2020-10-22 15:25                         ` Peter Zijlstra
2020-10-22 15:25                           ` Peter Zijlstra
2020-10-22 15:55                           ` Rafael J. Wysocki
2020-10-22 15:55                             ` Rafael J. Wysocki
2020-10-22 16:29                           ` Mel Gorman
2020-10-22 16:29                             ` Mel Gorman
2020-10-22 20:10                           ` Giovanni Gherdovich
2020-10-22 20:10                             ` Giovanni Gherdovich
2020-10-22 20:16                             ` Giovanni Gherdovich
2020-10-22 20:16                               ` Giovanni Gherdovich
2020-10-23  7:03                             ` Peter Zijlstra
2020-10-23  7:03                               ` Peter Zijlstra
2020-10-23 17:46                               ` Tom Lendacky
2020-10-23 17:46                                 ` Tom Lendacky
2020-10-26 19:52                                 ` Fontenot, Nathan
2020-10-26 19:52                                   ` Fontenot, Nathan
2020-10-22 15:45                       ` A L
2020-10-22 15:45                         ` A L
2020-10-22 15:55                         ` Vincent Guittot
2020-10-22 15:55                           ` Vincent Guittot
2020-10-23  5:11                           ` Viresh Kumar
2020-10-23  5:23                             ` Viresh Kumar
2020-10-22 16:23                   ` [PATCH] cpufreq: Avoid configuring old governors as default with intel_pstate Rafael J. Wysocki
2020-10-22 16:23                     ` Rafael J. Wysocki
2020-10-23  6:17                     ` Viresh Kumar
2020-10-23  6:29                       ` Viresh Kumar
2020-10-23 11:59                       ` Rafael J. Wysocki
2020-10-23 11:59                         ` Rafael J. Wysocki
2020-10-23 15:15                     ` [PATCH v2] " Rafael J. Wysocki
2020-10-23 15:15                       ` Rafael J. Wysocki
2020-10-27  3:01                       ` Viresh Kumar
2020-10-27  3:13                         ` Viresh Kumar
2020-10-27 11:11                   ` default cpufreq gov, was: [PATCH] sched/fair: check for idle core Qais Yousef
2020-10-27 11:26                     ` Valentin Schneider
2020-10-27 11:42                       ` Qais Yousef
2020-10-27 11:48                         ` Viresh Kumar
2020-10-27 11:48                           ` Viresh Kumar
2020-10-23  6:12                 ` Viresh Kumar
2020-10-23  6:24                   ` Viresh Kumar
2020-10-23 15:06                   ` Rafael J. Wysocki
2020-10-23 15:06                     ` Rafael J. Wysocki
2020-10-27  3:01                     ` Viresh Kumar
2020-10-27  3:13                       ` Viresh Kumar
2020-10-22 11:21             ` AW: " Walter Harms
2020-10-21 12:28     ` Mel Gorman
2020-10-21 12:28       ` Mel Gorman
2020-10-21 12:25   ` Vincent Guittot
2020-10-21 12:25     ` Vincent Guittot
2020-10-21 12:47     ` Mel Gorman
2020-10-21 12:47       ` Mel Gorman
2020-10-21 12:56       ` Julia Lawall
2020-10-21 12:56         ` Julia Lawall
2020-10-21 13:18         ` Mel Gorman
2020-10-21 13:18           ` Mel Gorman
2020-10-21 13:24           ` Julia Lawall
2020-10-21 13:24             ` Julia Lawall
2020-10-21 15:08             ` Mel Gorman
2020-10-21 15:08               ` Mel Gorman
2020-10-21 15:18               ` Julia Lawall
2020-10-21 15:18                 ` Julia Lawall
2020-10-21 15:23                 ` Vincent Guittot
2020-10-21 15:23                   ` Vincent Guittot
2020-10-21 15:33                   ` Julia Lawall
2020-10-21 15:33                     ` Julia Lawall
2020-10-21 15:19               ` Vincent Guittot
2020-10-21 15:19                 ` Vincent Guittot
2020-10-21 17:00                 ` Mel Gorman
2020-10-21 17:00                   ` Mel Gorman
2020-10-21 17:39                   ` Julia Lawall
2020-10-21 17:39                     ` Julia Lawall
2020-10-21 13:48           ` Julia Lawall
2020-10-21 13:48             ` Julia Lawall
2020-10-21 15:26             ` Mel Gorman
2020-10-21 15:26               ` Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201021112038.GC32041@suse.de \
    --to=mgorman@suse.de \
    --cc=Gilles.Muller@inria.fr \
    --cc=Julia.Lawall@inria.fr \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@redhat.com \
    --cc=kernel-janitors@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=valentin.schneider@arm.com \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.