From: Vincent Guittot <vincent.guittot@linaro.org>
To: Mel Gorman <mgorman@techsingularity.net>
Cc: linux-kernel <linux-kernel@vger.kernel.org>,
Ingo Molnar <mingo@redhat.com>,
Peter Zijlstra <peterz@infradead.org>,
Phil Auld <pauld@redhat.com>,
Valentin Schneider <valentin.schneider@arm.com>,
Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
Quentin Perret <quentin.perret@arm.com>,
Dietmar Eggemann <dietmar.eggemann@arm.com>,
Morten Rasmussen <Morten.Rasmussen@arm.com>,
Hillf Danton <hdanton@sina.com>, Parth Shah <parth@linux.ibm.com>,
Rik van Riel <riel@surriel.com>
Subject: Re: [PATCH v4 04/11] sched/fair: rework load_balance
Date: Fri, 8 Nov 2019 17:35:01 +0100 [thread overview]
Message-ID: <20191108163501.GA26528@linaro.org> (raw)
In-Reply-To: <20191031114020.GQ3016@techsingularity.net>
Le Thursday 31 Oct 2019 à 11:40:20 (+0000), Mel Gorman a écrit :
> On Thu, Oct 31, 2019 at 12:13:09PM +0100, Vincent Guittot wrote:
> > > > > On the last one, spreading tasks evenly across NUMA domains is not
> > > > > necessarily a good idea. If I have 2 tasks running on a 2-socket machine
> > > > > with 24 logical CPUs per socket, it should not automatically mean that
> > > > > one task should move cross-node and I have definitely observed this
> > > > > happening. It's probably bad in terms of locality no matter what but it's
> > > > > especially bad if the 2 tasks happened to be communicating because then
> > > > > load balancing will pull apart the tasks while wake_affine will push
> > > > > them together (and potentially NUMA balancing as well). Note that this
> > > > > also applies for some IO workloads because, depending on the filesystem,
> > > > > the task may be communicating with workqueues (XFS) or a kernel thread
> > > > > (ext4 with jbd2).
> > > >
> > > > This rework doesn't touch the NUMA_BALANCING part and NUMA balancing
> > > > still gives guidances with fbq_classify_group/queue.
> > >
> > > I know the NUMA_BALANCING part is not touched, I'm talking about load
> > > balancing across SD_NUMA domains which happens independently of
> > > NUMA_BALANCING. In fact, there is logic in NUMA_BALANCING that tries to
> > > override the load balancer when it moves tasks away from the preferred
> > > node.
> >
> > Yes. this patchset relies on this override for now to prevent moving task away.
>
> Fair enough, netperf hits the corner case where it does not work but
> that is also true without your series.
I run mmtest/netperf test on my setup. It's a mix of small positive or
negative differences (see below)
netperf-udp
5.3-rc2 5.3-rc2
tip +rwk+fix
Hmean send-64 95.06 ( 0.00%) 94.12 * -0.99%*
Hmean send-128 191.71 ( 0.00%) 189.94 * -0.93%*
Hmean send-256 379.05 ( 0.00%) 370.96 * -2.14%*
Hmean send-1024 1485.24 ( 0.00%) 1476.64 * -0.58%*
Hmean send-2048 2894.80 ( 0.00%) 2887.00 * -0.27%*
Hmean send-3312 4580.27 ( 0.00%) 4555.91 * -0.53%*
Hmean send-4096 5592.99 ( 0.00%) 5517.31 * -1.35%*
Hmean send-8192 9117.00 ( 0.00%) 9497.06 * 4.17%*
Hmean send-16384 15824.59 ( 0.00%) 15824.30 * -0.00%*
Hmean recv-64 95.06 ( 0.00%) 94.08 * -1.04%*
Hmean recv-128 191.68 ( 0.00%) 189.89 * -0.93%*
Hmean recv-256 378.94 ( 0.00%) 370.87 * -2.13%*
Hmean recv-1024 1485.24 ( 0.00%) 1476.20 * -0.61%*
Hmean recv-2048 2893.52 ( 0.00%) 2885.25 * -0.29%*
Hmean recv-3312 4580.27 ( 0.00%) 4553.48 * -0.58%*
Hmean recv-4096 5592.99 ( 0.00%) 5517.27 * -1.35%*
Hmean recv-8192 9115.69 ( 0.00%) 9495.69 * 4.17%*
Hmean recv-16384 15824.36 ( 0.00%) 15818.36 * -0.04%*
Stddev send-64 0.15 ( 0.00%) 1.17 (-688.29%)
Stddev send-128 1.56 ( 0.00%) 1.15 ( 25.96%)
Stddev send-256 4.20 ( 0.00%) 5.27 ( -25.63%)
Stddev send-1024 20.11 ( 0.00%) 5.68 ( 71.74%)
Stddev send-2048 11.06 ( 0.00%) 21.74 ( -96.50%)
Stddev send-3312 61.10 ( 0.00%) 48.03 ( 21.39%)
Stddev send-4096 71.84 ( 0.00%) 31.99 ( 55.46%)
Stddev send-8192 165.14 ( 0.00%) 159.99 ( 3.12%)
Stddev send-16384 81.30 ( 0.00%) 188.65 (-132.05%)
Stddev recv-64 0.15 ( 0.00%) 1.15 (-673.42%)
Stddev recv-128 1.58 ( 0.00%) 1.14 ( 28.27%)
Stddev recv-256 4.29 ( 0.00%) 5.19 ( -21.05%)
Stddev recv-1024 20.11 ( 0.00%) 5.70 ( 71.67%)
Stddev recv-2048 10.43 ( 0.00%) 21.41 (-105.22%)
Stddev recv-3312 61.10 ( 0.00%) 46.92 ( 23.20%)
Stddev recv-4096 71.84 ( 0.00%) 31.97 ( 55.50%)
Stddev recv-8192 163.90 ( 0.00%) 160.88 ( 1.84%)
Stddev recv-16384 81.41 ( 0.00%) 187.01 (-129.71%)
5.3-rc2 5.3-rc2
tip +rwk+fix
Duration User 38.90 39.13
Duration System 1311.29 1311.10
Duration Elapsed 1892.82 1892.86
netperf-tcp
5.3-rc2 5.3-rc2
tip +rwk+fix
Hmean 64 871.30 ( 0.00%) 860.90 * -1.19%*
Hmean 128 1689.39 ( 0.00%) 1679.31 * -0.60%*
Hmean 256 3199.59 ( 0.00%) 3241.98 * 1.32%*
Hmean 1024 9390.47 ( 0.00%) 9268.47 * -1.30%*
Hmean 2048 13373.95 ( 0.00%) 13395.61 * 0.16%*
Hmean 3312 16701.30 ( 0.00%) 17165.96 * 2.78%*
Hmean 4096 15831.03 ( 0.00%) 15544.66 * -1.81%*
Hmean 8192 19720.01 ( 0.00%) 20188.60 * 2.38%*
Hmean 16384 23925.90 ( 0.00%) 23914.50 * -0.05%*
Stddev 64 7.38 ( 0.00%) 4.23 ( 42.67%)
Stddev 128 11.62 ( 0.00%) 10.13 ( 12.85%)
Stddev 256 34.33 ( 0.00%) 7.94 ( 76.88%)
Stddev 1024 35.61 ( 0.00%) 116.34 (-226.66%)
Stddev 2048 285.30 ( 0.00%) 80.50 ( 71.78%)
Stddev 3312 304.74 ( 0.00%) 449.08 ( -47.36%)
Stddev 4096 668.11 ( 0.00%) 569.30 ( 14.79%)
Stddev 8192 733.23 ( 0.00%) 944.38 ( -28.80%)
Stddev 16384 553.03 ( 0.00%) 299.44 ( 45.86%)
5.3-rc2 5.3-rc2
tip +rwk+fix
Duration User 138.05 140.95
Duration System 1210.60 1208.45
Duration Elapsed 1352.86 1352.90
>
> > I agree that additional patches are probably needed to improve load
> > balance at NUMA level and I expect that this rework will make it
> > simpler to add.
> > I just wanted to get the output of some real use cases before defining
> > more numa level specific conditions. Some want to spread on there numa
> > nodes but other want to keep everything together. The preferred node
> > and fbq_classify_group was the only sensible metrics to me when he
> > wrote this patchset but changes can be added if they make sense.
> >
>
> That's fair. While it was possible to address the case before your
> series, it was a hatchet job. If the changelog simply notes that some
> special casing may still be required for SD_NUMA but it's outside the
> scope of the series, then I'd be happy. At least there is a good chance
> then if there is follow-up work that it won't be interpreted as an
> attempt to reintroduce hacky heuristics.
>
Would the additional comment make sense for you about work to be done
for SD_NUMA ?
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0ad4b21..7e4cb65 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6960,11 +6960,34 @@ enum fbq_type { regular, remote, all };
* group. see update_sd_pick_busiest().
*/
enum group_type {
+ /*
+ * The group has spare capacity that can be used to process more work.
+ */
group_has_spare = 0,
+ /*
+ * The group is fully used and the tasks don't compete for more CPU
+ * cycles. Nevetheless, some tasks might wait before running.
+ */
group_fully_busy,
+ /*
+ * One task doesn't fit with CPU's capacity and must be migrated on a
+ * more powerful CPU.
+ */
group_misfit_task,
+ /*
+ * One local CPU with higher capacity is available and task should be
+ * migrated on it instead on current CPU.
+ */
group_asym_packing,
+ /*
+ * The tasks affinity prevents the scheduler to balance the load across
+ * the system.
+ */
group_imbalanced,
+ /*
+ * The CPU is overloaded and can't provide expected CPU cycles to all
+ * tasks.
+ */
group_overloaded
};
@@ -8563,7 +8586,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
/*
* Try to use spare capacity of local group without overloading it or
- * emptying busiest
+ * emptying busiest.
+ * XXX Spreading tasks across numa nodes is not always the best policy
+ * and special cares should be taken for SD_NUMA domain level before
+ * spreading the tasks. For now, load_balance() fully relies on
+ * NUMA_BALANCING and fbq_classify_group/rq to overide the decision.
*/
if (local->group_type == group_has_spare) {
if (busiest->group_type > group_fully_busy) {
--
2.7.4
>
> > >
> > > > But the latter could also take advantage of the new type of group. For
> > > > example, what I did in the fix for find_idlest_group : checking
> > > > numa_preferred_nid when the group has capacity and keep the task on
> > > > preferred node if possible. Similar behavior could also be beneficial
> > > > in periodic load_balance case.
> > > >
> > >
> > > And this is the catch -- numa_preferred_nid is not guaranteed to be set at
> > > all. NUMA balancing might be disabled, the task may not have been running
> > > long enough to pick a preferred NID or NUMA balancing might be unable to
> > > pick a preferred NID. The decision to avoid unnecessary migrations across
> > > NUMA domains should be made independently of NUMA balancing. The netperf
> > > configuration from mmtests is great at illustrating the point because it'll
> > > also say what the average local/remote access ratio is. 2 communicating
> > > tasks running on an otherwise idle NUMA machine should not have the load
> > > balancer move the server to one node and the client to another.
> >
> > I'm going to make it a try on my setup to see the results
> >
>
> Thanks.
>
> --
> Mel Gorman
> SUSE Labs
next prev parent reply other threads:[~2019-11-08 16:35 UTC|newest]
Thread overview: 89+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-10-18 13:26 [PATCH v4 00/10] sched/fair: rework the CFS load balance Vincent Guittot
2019-10-18 13:26 ` [PATCH v4 01/11] sched/fair: clean up asym packing Vincent Guittot
2019-10-21 9:12 ` [tip: sched/core] sched/fair: Clean " tip-bot2 for Vincent Guittot
2019-10-30 14:51 ` [PATCH v4 01/11] sched/fair: clean " Mel Gorman
2019-10-30 16:03 ` Vincent Guittot
2019-10-18 13:26 ` [PATCH v4 02/11] sched/fair: rename sum_nr_running to sum_h_nr_running Vincent Guittot
2019-10-21 9:12 ` [tip: sched/core] sched/fair: Rename sg_lb_stats::sum_nr_running " tip-bot2 for Vincent Guittot
2019-10-30 14:53 ` [PATCH v4 02/11] sched/fair: rename sum_nr_running " Mel Gorman
2019-10-18 13:26 ` [PATCH v4 03/11] sched/fair: remove meaningless imbalance calculation Vincent Guittot
2019-10-21 9:12 ` [tip: sched/core] sched/fair: Remove " tip-bot2 for Vincent Guittot
2019-10-18 13:26 ` [PATCH v4 04/11] sched/fair: rework load_balance Vincent Guittot
2019-10-21 9:12 ` [tip: sched/core] sched/fair: Rework load_balance() tip-bot2 for Vincent Guittot
2019-10-30 15:45 ` [PATCH v4 04/11] sched/fair: rework load_balance Mel Gorman
2019-10-30 16:16 ` Valentin Schneider
2019-10-31 9:09 ` Vincent Guittot
2019-10-31 10:15 ` Mel Gorman
2019-10-31 11:13 ` Vincent Guittot
2019-10-31 11:40 ` Mel Gorman
2019-11-08 16:35 ` Vincent Guittot [this message]
2019-11-08 18:37 ` Mel Gorman
2019-11-12 10:58 ` Vincent Guittot
2019-11-12 15:06 ` Mel Gorman
2019-11-12 15:40 ` Vincent Guittot
2019-11-12 17:45 ` Mel Gorman
2019-11-18 13:50 ` Ingo Molnar
2019-11-18 13:57 ` Vincent Guittot
2019-11-18 14:51 ` Mel Gorman
2019-10-18 13:26 ` [PATCH v4 05/11] sched/fair: use rq->nr_running when balancing load Vincent Guittot
2019-10-21 9:12 ` [tip: sched/core] sched/fair: Use " tip-bot2 for Vincent Guittot
2019-10-30 15:54 ` [PATCH v4 05/11] sched/fair: use " Mel Gorman
2019-10-18 13:26 ` [PATCH v4 06/11] sched/fair: use load instead of runnable load in load_balance Vincent Guittot
2019-10-21 9:12 ` [tip: sched/core] sched/fair: Use load instead of runnable load in load_balance() tip-bot2 for Vincent Guittot
2019-10-30 15:58 ` [PATCH v4 06/11] sched/fair: use load instead of runnable load in load_balance Mel Gorman
2019-10-18 13:26 ` [PATCH v4 07/11] sched/fair: evenly spread tasks when not overloaded Vincent Guittot
2019-10-21 9:12 ` [tip: sched/core] sched/fair: Spread out tasks evenly " tip-bot2 for Vincent Guittot
2019-10-30 16:03 ` [PATCH v4 07/11] sched/fair: evenly spread tasks " Mel Gorman
2019-10-18 13:26 ` [PATCH v4 08/11] sched/fair: use utilization to select misfit task Vincent Guittot
2019-10-21 9:12 ` [tip: sched/core] sched/fair: Use " tip-bot2 for Vincent Guittot
2019-10-18 13:26 ` [PATCH v4 09/11] sched/fair: use load instead of runnable load in wakeup path Vincent Guittot
2019-10-21 9:12 ` [tip: sched/core] sched/fair: Use " tip-bot2 for Vincent Guittot
2019-10-18 13:26 ` [PATCH v4 10/11] sched/fair: optimize find_idlest_group Vincent Guittot
2019-10-21 9:12 ` [tip: sched/core] sched/fair: Optimize find_idlest_group() tip-bot2 for Vincent Guittot
2019-10-18 13:26 ` [PATCH v4 11/11] sched/fair: rework find_idlest_group Vincent Guittot
2019-10-21 9:12 ` [tip: sched/core] sched/fair: Rework find_idlest_group() tip-bot2 for Vincent Guittot
2019-10-22 16:46 ` [PATCH] sched/fair: fix rework of find_idlest_group() Vincent Guittot
2019-10-23 7:50 ` Chen, Rong A
2019-10-30 16:07 ` Mel Gorman
2019-11-18 17:42 ` [tip: sched/core] sched/fair: Fix " tip-bot2 for Vincent Guittot
2019-11-22 14:37 ` [PATCH] sched/fair: fix " Valentin Schneider
2019-11-25 9:16 ` Vincent Guittot
2019-11-25 11:03 ` Valentin Schneider
2019-11-20 11:58 ` [PATCH v4 11/11] sched/fair: rework find_idlest_group Qais Yousef
2019-11-20 13:21 ` Vincent Guittot
2019-11-20 16:53 ` Vincent Guittot
2019-11-20 17:34 ` Qais Yousef
2019-11-20 17:43 ` Vincent Guittot
2019-11-20 18:10 ` Qais Yousef
2019-11-20 18:20 ` Vincent Guittot
2019-11-20 18:27 ` Qais Yousef
2019-11-20 19:28 ` Vincent Guittot
2019-11-20 19:55 ` Qais Yousef
2019-11-21 14:58 ` Qais Yousef
2019-11-22 14:34 ` Valentin Schneider
2019-11-25 9:59 ` Vincent Guittot
2019-11-25 11:13 ` Valentin Schneider
2019-10-21 7:50 ` [PATCH v4 00/10] sched/fair: rework the CFS load balance Ingo Molnar
2019-10-21 8:44 ` Vincent Guittot
2019-10-21 12:56 ` Phil Auld
2019-10-24 12:38 ` Phil Auld
2019-10-24 13:46 ` Phil Auld
2019-10-24 14:59 ` Vincent Guittot
2019-10-25 13:33 ` Phil Auld
2019-10-28 13:03 ` Vincent Guittot
2019-10-30 14:39 ` Phil Auld
2019-10-30 16:24 ` Dietmar Eggemann
2019-10-30 16:35 ` Valentin Schneider
2019-10-30 17:19 ` Phil Auld
2019-10-30 17:25 ` Valentin Schneider
2019-10-30 17:29 ` Phil Auld
2019-10-30 17:28 ` Vincent Guittot
2019-10-30 17:44 ` Phil Auld
2019-10-30 17:25 ` Vincent Guittot
2019-10-31 13:57 ` Phil Auld
2019-10-31 16:41 ` Vincent Guittot
2019-10-30 16:24 ` Mel Gorman
2019-10-30 16:35 ` Vincent Guittot
2019-11-18 13:15 ` Ingo Molnar
2019-11-25 12:48 ` Valentin Schneider
2020-01-03 16:39 ` Valentin Schneider
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20191108163501.GA26528@linaro.org \
--to=vincent.guittot@linaro.org \
--cc=Morten.Rasmussen@arm.com \
--cc=dietmar.eggemann@arm.com \
--cc=hdanton@sina.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@techsingularity.net \
--cc=mingo@redhat.com \
--cc=parth@linux.ibm.com \
--cc=pauld@redhat.com \
--cc=peterz@infradead.org \
--cc=quentin.perret@arm.com \
--cc=riel@surriel.com \
--cc=srikar@linux.vnet.ibm.com \
--cc=valentin.schneider@arm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.