From: Mel Gorman <mgorman@techsingularity.net>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>,
Matt Fleming <matt@codeblueprint.co.uk>,
LKML <linux-kernel@vger.kernel.org>,
Mel Gorman <mgorman@techsingularity.net>
Subject: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
Date: Tue, 30 Jan 2018 10:45:55 +0000 [thread overview]
Message-ID: <20180130104555.4125-5-mgorman@techsingularity.net> (raw)
In-Reply-To: <20180130104555.4125-1-mgorman@techsingularity.net>
The select_idle_sibling (SIS) rewrite in commit 10e2f1acd010 ("sched/core:
Rewrite and improve select_idle_siblings()") replaced a domain iteration
with a search that broadly speaking does a wrapped walk of the scheduler
domain sharing a last-level-cache. While this had a number of improvements,
one consequence is that two tasks that share a waker/wakee relationship push
each other around a socket. Even though two tasks may be active, all cores
are evenly used. This is great from a search perspective and spreads a load
across individual cores but it has adverse consequences for cpufreq. As each
CPU has relatively low utilisation, cpufreq may decide the utilisation is
too low to used a higher P-state and overall computation throughput suffers.
While individual cpufreq and cpuidle drivers may compensate by artifically
boosting P-state (at c0) or avoiding lower C-states (during idle), it does
not help if hardware-based cpufreq (e.g. HWP) is used.
This patch tracks a recently used CPU based on what CPU a task was running
on when it last was a waker a CPU it was recently using when a task is a
wakee. During SIS, the recently used CPU is used as a target if it's still
allowed by the task and is idle.
The benefit may be non-obvious so consider an example of two tasks
communicating back and forth. Task A may be an application doing IO where
task B is a kworker or kthread like journald. Task A may issue IO, wake
B and B wakes up A on completion. With the existing scheme this may look
like the following (potentially different IDs if SMT is in use but similar
principal applies).
A (cpu 0) wake B (wakes on cpu 1)
B (cpu 1) wake A (wakes on cpu 2)
A (cpu 2) wake B (wakes on cpu 3)
etc.
A careful reader may wonder why cpu 0 was not idle when B wakes A the
first time and it's simply due to the fact that A can be rescheduled to
another CPU and the pattern is that prev == target when B tries to wakeup
A and the information about CPU 0 has been lost.
With this patch, the pattern is more likely to be
A (cpu 0) wake B (wakes on cpu 1)
B (cpu 1) wake A (wakes on cpu 0)
A (cpu 0) wake B (wakes on cpu 1)
etc
i.e. two communicating casts are more likely to use just two cores instead
of all available cores sharing a LLC.
The most dramatic speedup was noticed on dbench using the XFS filesystem on
UMA as clients interact heavily with workqueues in that configuration. Note
that a similar speedup is not observed on ext4 as the wakeup pattern
is different
4.15.0-rc9 4.15.0-rc9
waprev-v1 biasancestor-v1
Hmean 1 287.54 ( 0.00%) 817.01 ( 184.14%)
Hmean 2 1268.12 ( 0.00%) 1781.24 ( 40.46%)
Hmean 4 1739.68 ( 0.00%) 1594.47 ( -8.35%)
Hmean 8 2464.12 ( 0.00%) 2479.56 ( 0.63%)
Hmean 64 1455.57 ( 0.00%) 1434.68 ( -1.44%)
The results can be less dramatic on NUMA where automatic balancing interferes
with the test. It's also known that network benchmarks running on localhost
also benefit quite a bit from this patch (roughly 10% on netperf RR for UDP
and TCP depending on the machine). Hackbench also seens small improvements
(6-11% depending on machine and thread count). The facebook schbench was also
tested but in most cases showed little or no different to wakeup latencies.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
include/linux/sched.h | 8 ++++++++
kernel/sched/core.c | 1 +
kernel/sched/fair.c | 22 ++++++++++++++++++++--
3 files changed, 29 insertions(+), 2 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d2588263a989..d9140ddaa4e1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -551,6 +551,14 @@ struct task_struct {
unsigned long wakee_flip_decay_ts;
struct task_struct *last_wakee;
+ /*
+ * recent_used_cpu is initially set as the last CPU used by a task
+ * that wakes affine another task. Waker/wakee relationships can
+ * push tasks around a CPU where each wakeup moves to the next one.
+ * Tracking a recently used CPU allows a quick search for a recently
+ * used CPU that may be idle.
+ */
+ int recent_used_cpu;
int wake_cpu;
#endif
int on_rq;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a7bf32aabfda..68d7bcaf0fc7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2460,6 +2460,7 @@ void wake_up_new_task(struct task_struct *p)
* Use __set_task_cpu() to avoid calling sched_class::migrate_task_rq,
* as we're not fully set-up yet.
*/
+ p->recent_used_cpu = task_cpu(p);
__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
#endif
rq = __task_rq_lock(p, &rf);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3b732caa6fba..e96b0c1b43ad 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6201,7 +6201,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
static int select_idle_sibling(struct task_struct *p, int prev, int target)
{
struct sched_domain *sd;
- int i;
+ int i, recent_used_cpu;
if (idle_cpu(target))
return target;
@@ -6212,6 +6212,21 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
if (prev != target && cpus_share_cache(prev, target) && idle_cpu(prev))
return prev;
+ /* Check a recently used CPU as a potential idle candidate */
+ recent_used_cpu = p->recent_used_cpu;
+ if (recent_used_cpu != prev &&
+ recent_used_cpu != target &&
+ cpus_share_cache(recent_used_cpu, target) &&
+ idle_cpu(recent_used_cpu) &&
+ cpumask_test_cpu(p->recent_used_cpu, &p->cpus_allowed)) {
+ /*
+ * Replace recent_used_cpu with prev as it is a potential
+ * candidate for the next wake.
+ */
+ p->recent_used_cpu = prev;
+ return recent_used_cpu;
+ }
+
sd = rcu_dereference(per_cpu(sd_llc, target));
if (!sd)
return target;
@@ -6379,9 +6394,12 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
if (!sd) {
pick_cpu:
- if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
+ if (sd_flag & SD_BALANCE_WAKE) { /* XXX always ? */
new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
+ if (want_affine)
+ current->recent_used_cpu = cpu;
+ }
} else {
new_cpu = find_idlest_cpu(sd, p, cpu, prev_cpu, sd_flag);
}
--
2.15.1
next prev parent reply other threads:[~2018-01-30 10:46 UTC|newest]
Thread overview: 48+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-01-30 10:45 [PATCH 0/4] Reduce migrations and unnecessary spreading of load to multiple CPUs Mel Gorman
2018-01-30 10:45 ` [PATCH 1/4] sched/fair: Remove unnecessary parameters from wake_affine_idle Mel Gorman
2018-02-06 11:55 ` [tip:sched/urgent] sched/fair: Remove unnecessary parameters from wake_affine_idle() tip-bot for Mel Gorman
2018-01-30 10:45 ` [PATCH 2/4] sched/fair: Restructure wake_affine to return a CPU id Mel Gorman
2018-02-06 11:56 ` [tip:sched/urgent] sched/fair: Restructure wake_affine*() " tip-bot for Mel Gorman
2018-01-30 10:45 ` [PATCH 3/4] sched/fair: Do not migrate if the prev_cpu is idle Mel Gorman
2018-02-06 11:56 ` [tip:sched/urgent] " tip-bot for Mel Gorman
2018-01-30 10:45 ` Mel Gorman [this message]
2018-01-30 11:50 ` [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS Peter Zijlstra
2018-01-30 12:57 ` Mel Gorman
2018-01-30 13:15 ` Peter Zijlstra
2018-01-30 13:25 ` Mel Gorman
2018-01-30 13:40 ` Peter Zijlstra
2018-01-30 14:06 ` Mel Gorman
2018-01-31 9:22 ` Rafael J. Wysocki
2018-01-31 10:17 ` Peter Zijlstra
2018-01-31 11:54 ` Mel Gorman
2018-01-31 17:44 ` Srinivas Pandruvada
2018-02-01 9:11 ` Peter Zijlstra
2018-02-01 7:50 ` Rafael J. Wysocki
2018-02-01 9:11 ` Peter Zijlstra
2018-02-01 13:18 ` Srinivas Pandruvada
2018-02-02 11:00 ` Rafael J. Wysocki
2018-02-02 14:54 ` Srinivas Pandruvada
2018-02-02 19:48 ` Mel Gorman
2018-02-02 20:01 ` Srinivas Pandruvada
2018-02-05 11:10 ` Mel Gorman
2018-02-05 17:04 ` Srinivas Pandruvada
2018-02-05 17:50 ` Mel Gorman
2018-02-04 8:42 ` Rafael J. Wysocki
2018-02-04 8:38 ` Rafael J. Wysocki
2018-02-02 11:42 ` Rafael J. Wysocki
2018-02-02 12:46 ` Peter Zijlstra
2018-02-02 12:55 ` Peter Zijlstra
2018-02-02 14:08 ` Peter Zijlstra
2018-02-03 16:30 ` Srinivas Pandruvada
2018-02-05 10:44 ` Peter Zijlstra
2018-02-05 10:58 ` Ingo Molnar
2018-02-02 12:58 ` Peter Zijlstra
2018-02-02 13:27 ` Mel Gorman
2018-01-30 13:15 ` Mike Galbraith
2018-01-30 13:25 ` Peter Zijlstra
2018-01-30 13:35 ` Mike Galbraith
2018-01-30 11:53 ` Peter Zijlstra
2018-01-30 12:59 ` Mel Gorman
2018-01-30 13:06 ` Peter Zijlstra
2018-01-30 13:18 ` Mel Gorman
2018-02-06 11:56 ` [tip:sched/urgent] " tip-bot for Mel Gorman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180130104555.4125-5-mgorman@techsingularity.net \
--to=mgorman@techsingularity.net \
--cc=efault@gmx.de \
--cc=linux-kernel@vger.kernel.org \
--cc=matt@codeblueprint.co.uk \
--cc=peterz@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).