All of lore.kernel.org
 help / color / mirror / Atom feed
From: Vincent Guittot <vincent.guittot@linaro.org>
To: Mel Gorman <mgorman@techsingularity.net>
Cc: linux-kernel <linux-kernel@vger.kernel.org>,
	Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Phil Auld <pauld@redhat.com>,
	Valentin Schneider <valentin.schneider@arm.com>,
	Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
	Quentin Perret <quentin.perret@arm.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Morten Rasmussen <Morten.Rasmussen@arm.com>,
	Hillf Danton <hdanton@sina.com>, Parth Shah <parth@linux.ibm.com>,
	Rik van Riel <riel@surriel.com>
Subject: Re: [PATCH v4 04/11] sched/fair: rework load_balance
Date: Fri, 8 Nov 2019 17:35:01 +0100	[thread overview]
Message-ID: <20191108163501.GA26528@linaro.org> (raw)
In-Reply-To: <20191031114020.GQ3016@techsingularity.net>

Le Thursday 31 Oct 2019 à 11:40:20 (+0000), Mel Gorman a écrit :
> On Thu, Oct 31, 2019 at 12:13:09PM +0100, Vincent Guittot wrote:
> > > > > On the last one, spreading tasks evenly across NUMA domains is not
> > > > > necessarily a good idea. If I have 2 tasks running on a 2-socket machine
> > > > > with 24 logical CPUs per socket, it should not automatically mean that
> > > > > one task should move cross-node and I have definitely observed this
> > > > > happening. It's probably bad in terms of locality no matter what but it's
> > > > > especially bad if the 2 tasks happened to be communicating because then
> > > > > load balancing will pull apart the tasks while wake_affine will push
> > > > > them together (and potentially NUMA balancing as well). Note that this
> > > > > also applies for some IO workloads because, depending on the filesystem,
> > > > > the task may be communicating with workqueues (XFS) or a kernel thread
> > > > > (ext4 with jbd2).
> > > >
> > > > This rework doesn't touch the NUMA_BALANCING part and NUMA balancing
> > > > still gives guidances with fbq_classify_group/queue.
> > >
> > > I know the NUMA_BALANCING part is not touched, I'm talking about load
> > > balancing across SD_NUMA domains which happens independently of
> > > NUMA_BALANCING. In fact, there is logic in NUMA_BALANCING that tries to
> > > override the load balancer when it moves tasks away from the preferred
> > > node.
> > 
> > Yes. this patchset relies on this override for now to prevent moving task away.
> 
> Fair enough, netperf hits the corner case where it does not work but
> that is also true without your series.

I run mmtest/netperf test on my setup. It's a mix of small positive or
negative differences (see below)

netperf-udp
                                    5.3-rc2                5.3-rc2
								        tip               +rwk+fix
Hmean     send-64          95.06 (   0.00%)       94.12 *  -0.99%*
Hmean     send-128        191.71 (   0.00%)      189.94 *  -0.93%*
Hmean     send-256        379.05 (   0.00%)      370.96 *  -2.14%*
Hmean     send-1024      1485.24 (   0.00%)     1476.64 *  -0.58%*
Hmean     send-2048      2894.80 (   0.00%)     2887.00 *  -0.27%*
Hmean     send-3312      4580.27 (   0.00%)     4555.91 *  -0.53%*
Hmean     send-4096      5592.99 (   0.00%)     5517.31 *  -1.35%*
Hmean     send-8192      9117.00 (   0.00%)     9497.06 *   4.17%*
Hmean     send-16384    15824.59 (   0.00%)    15824.30 *  -0.00%*
Hmean     recv-64          95.06 (   0.00%)       94.08 *  -1.04%*
Hmean     recv-128        191.68 (   0.00%)      189.89 *  -0.93%*
Hmean     recv-256        378.94 (   0.00%)      370.87 *  -2.13%*
Hmean     recv-1024      1485.24 (   0.00%)     1476.20 *  -0.61%*
Hmean     recv-2048      2893.52 (   0.00%)     2885.25 *  -0.29%*
Hmean     recv-3312      4580.27 (   0.00%)     4553.48 *  -0.58%*
Hmean     recv-4096      5592.99 (   0.00%)     5517.27 *  -1.35%*
Hmean     recv-8192      9115.69 (   0.00%)     9495.69 *   4.17%*
Hmean     recv-16384    15824.36 (   0.00%)    15818.36 *  -0.04%*
Stddev    send-64           0.15 (   0.00%)        1.17 (-688.29%)
Stddev    send-128          1.56 (   0.00%)        1.15 (  25.96%)
Stddev    send-256          4.20 (   0.00%)        5.27 ( -25.63%)
Stddev    send-1024        20.11 (   0.00%)        5.68 (  71.74%)
Stddev    send-2048        11.06 (   0.00%)       21.74 ( -96.50%)
Stddev    send-3312        61.10 (   0.00%)       48.03 (  21.39%)
Stddev    send-4096        71.84 (   0.00%)       31.99 (  55.46%)
Stddev    send-8192       165.14 (   0.00%)      159.99 (   3.12%)
Stddev    send-16384       81.30 (   0.00%)      188.65 (-132.05%)
Stddev    recv-64           0.15 (   0.00%)        1.15 (-673.42%)
Stddev    recv-128          1.58 (   0.00%)        1.14 (  28.27%)
Stddev    recv-256          4.29 (   0.00%)        5.19 ( -21.05%)
Stddev    recv-1024        20.11 (   0.00%)        5.70 (  71.67%)
Stddev    recv-2048        10.43 (   0.00%)       21.41 (-105.22%)
Stddev    recv-3312        61.10 (   0.00%)       46.92 (  23.20%)
Stddev    recv-4096        71.84 (   0.00%)       31.97 (  55.50%)
Stddev    recv-8192       163.90 (   0.00%)      160.88 (   1.84%)
Stddev    recv-16384       81.41 (   0.00%)      187.01 (-129.71%)

                     5.3-rc2     5.3-rc2
                         tip    +rwk+fix
Duration User          38.90       39.13
Duration System      1311.29     1311.10
Duration Elapsed     1892.82     1892.86
 
netperf-tcp
                               5.3-rc2                5.3-rc2
                                   tip               +rwk+fix
Hmean     64         871.30 (   0.00%)      860.90 *  -1.19%*
Hmean     128       1689.39 (   0.00%)     1679.31 *  -0.60%*
Hmean     256       3199.59 (   0.00%)     3241.98 *   1.32%*
Hmean     1024      9390.47 (   0.00%)     9268.47 *  -1.30%*
Hmean     2048     13373.95 (   0.00%)    13395.61 *   0.16%*
Hmean     3312     16701.30 (   0.00%)    17165.96 *   2.78%*
Hmean     4096     15831.03 (   0.00%)    15544.66 *  -1.81%*
Hmean     8192     19720.01 (   0.00%)    20188.60 *   2.38%*
Hmean     16384    23925.90 (   0.00%)    23914.50 *  -0.05%*
Stddev    64           7.38 (   0.00%)        4.23 (  42.67%)
Stddev    128         11.62 (   0.00%)       10.13 (  12.85%)
Stddev    256         34.33 (   0.00%)        7.94 (  76.88%)
Stddev    1024        35.61 (   0.00%)      116.34 (-226.66%)
Stddev    2048       285.30 (   0.00%)       80.50 (  71.78%)
Stddev    3312       304.74 (   0.00%)      449.08 ( -47.36%)
Stddev    4096       668.11 (   0.00%)      569.30 (  14.79%)
Stddev    8192       733.23 (   0.00%)      944.38 ( -28.80%)
Stddev    16384      553.03 (   0.00%)      299.44 (  45.86%)

                     5.3-rc2     5.3-rc2
                         tip    +rwk+fix
Duration User         138.05      140.95
Duration System      1210.60     1208.45
Duration Elapsed     1352.86     1352.90


> 
> > I agree that additional patches are probably needed to improve load
> > balance at NUMA level and I expect that this rework will make it
> > simpler to add.
> > I just wanted to get the output of some real use cases before defining
> > more numa level specific conditions. Some want to spread on there numa
> > nodes but other want to keep everything together. The preferred node
> > and fbq_classify_group was the only sensible metrics to me when he
> > wrote this patchset but changes can be added if they make sense.
> > 
> 
> That's fair. While it was possible to address the case before your
> series, it was a hatchet job. If the changelog simply notes that some
> special casing may still be required for SD_NUMA but it's outside the
> scope of the series, then I'd be happy. At least there is a good chance
> then if there is follow-up work that it won't be interpreted as an
> attempt to reintroduce hacky heuristics.
>

Would the additional comment make sense for you about work to be done
for SD_NUMA ?

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0ad4b21..7e4cb65 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6960,11 +6960,34 @@ enum fbq_type { regular, remote, all };
  * group. see update_sd_pick_busiest().
  */
 enum group_type {
+	/*
+	 * The group has spare capacity that can be used to process more work.
+	 */
 	group_has_spare = 0,
+	/*
+	 * The group is fully used and the tasks don't compete for more CPU
+	 * cycles. Nevetheless, some tasks might wait before running.
+	 */
 	group_fully_busy,
+	/*
+	 * One task doesn't fit with CPU's capacity and must be migrated on a
+	 * more powerful CPU.
+	 */
 	group_misfit_task,
+	/*
+	 * One local CPU with higher capacity is available and task should be
+	 * migrated on it instead on current CPU.
+	 */
 	group_asym_packing,
+	/*
+	 * The tasks affinity prevents the scheduler to balance the load across
+	 * the system.
+	 */
 	group_imbalanced,
+	/*
+	 * The CPU is overloaded and can't provide expected CPU cycles to all
+	 * tasks.
+	 */
 	group_overloaded
 };
 
@@ -8563,7 +8586,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 
 	/*
 	 * Try to use spare capacity of local group without overloading it or
-	 * emptying busiest
+	 * emptying busiest.
+	 * XXX Spreading tasks across numa nodes is not always the best policy
+	 * and special cares should be taken for SD_NUMA domain level before
+	 * spreading the tasks. For now, load_balance() fully relies on
+	 * NUMA_BALANCING and fbq_classify_group/rq to overide the decision.
 	 */
 	if (local->group_type == group_has_spare) {
 		if (busiest->group_type > group_fully_busy) {
-- 
2.7.4

> 
> > >
> > > > But the latter could also take advantage of the new type of group. For
> > > > example, what I did in the fix for find_idlest_group : checking
> > > > numa_preferred_nid when the group has capacity and keep the task on
> > > > preferred node if possible. Similar behavior could also be beneficial
> > > > in periodic load_balance case.
> > > >
> > >
> > > And this is the catch -- numa_preferred_nid is not guaranteed to be set at
> > > all. NUMA balancing might be disabled, the task may not have been running
> > > long enough to pick a preferred NID or NUMA balancing might be unable to
> > > pick a preferred NID. The decision to avoid unnecessary migrations across
> > > NUMA domains should be made independently of NUMA balancing. The netperf
> > > configuration from mmtests is great at illustrating the point because it'll
> > > also say what the average local/remote access ratio is. 2 communicating
> > > tasks running on an otherwise idle NUMA machine should not have the load
> > > balancer move the server to one node and the client to another.
> > 
> > I'm going to make it a try on my setup to see the results
> > 
> 
> Thanks.
> 
> -- 
> Mel Gorman
> SUSE Labs

  reply	other threads:[~2019-11-08 16:35 UTC|newest]

Thread overview: 89+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-18 13:26 [PATCH v4 00/10] sched/fair: rework the CFS load balance Vincent Guittot
2019-10-18 13:26 ` [PATCH v4 01/11] sched/fair: clean up asym packing Vincent Guittot
2019-10-21  9:12   ` [tip: sched/core] sched/fair: Clean " tip-bot2 for Vincent Guittot
2019-10-30 14:51   ` [PATCH v4 01/11] sched/fair: clean " Mel Gorman
2019-10-30 16:03     ` Vincent Guittot
2019-10-18 13:26 ` [PATCH v4 02/11] sched/fair: rename sum_nr_running to sum_h_nr_running Vincent Guittot
2019-10-21  9:12   ` [tip: sched/core] sched/fair: Rename sg_lb_stats::sum_nr_running " tip-bot2 for Vincent Guittot
2019-10-30 14:53   ` [PATCH v4 02/11] sched/fair: rename sum_nr_running " Mel Gorman
2019-10-18 13:26 ` [PATCH v4 03/11] sched/fair: remove meaningless imbalance calculation Vincent Guittot
2019-10-21  9:12   ` [tip: sched/core] sched/fair: Remove " tip-bot2 for Vincent Guittot
2019-10-18 13:26 ` [PATCH v4 04/11] sched/fair: rework load_balance Vincent Guittot
2019-10-21  9:12   ` [tip: sched/core] sched/fair: Rework load_balance() tip-bot2 for Vincent Guittot
2019-10-30 15:45   ` [PATCH v4 04/11] sched/fair: rework load_balance Mel Gorman
2019-10-30 16:16     ` Valentin Schneider
2019-10-31  9:09     ` Vincent Guittot
2019-10-31 10:15       ` Mel Gorman
2019-10-31 11:13         ` Vincent Guittot
2019-10-31 11:40           ` Mel Gorman
2019-11-08 16:35             ` Vincent Guittot [this message]
2019-11-08 18:37               ` Mel Gorman
2019-11-12 10:58                 ` Vincent Guittot
2019-11-12 15:06                   ` Mel Gorman
2019-11-12 15:40                     ` Vincent Guittot
2019-11-12 17:45                       ` Mel Gorman
2019-11-18 13:50     ` Ingo Molnar
2019-11-18 13:57       ` Vincent Guittot
2019-11-18 14:51       ` Mel Gorman
2019-10-18 13:26 ` [PATCH v4 05/11] sched/fair: use rq->nr_running when balancing load Vincent Guittot
2019-10-21  9:12   ` [tip: sched/core] sched/fair: Use " tip-bot2 for Vincent Guittot
2019-10-30 15:54   ` [PATCH v4 05/11] sched/fair: use " Mel Gorman
2019-10-18 13:26 ` [PATCH v4 06/11] sched/fair: use load instead of runnable load in load_balance Vincent Guittot
2019-10-21  9:12   ` [tip: sched/core] sched/fair: Use load instead of runnable load in load_balance() tip-bot2 for Vincent Guittot
2019-10-30 15:58   ` [PATCH v4 06/11] sched/fair: use load instead of runnable load in load_balance Mel Gorman
2019-10-18 13:26 ` [PATCH v4 07/11] sched/fair: evenly spread tasks when not overloaded Vincent Guittot
2019-10-21  9:12   ` [tip: sched/core] sched/fair: Spread out tasks evenly " tip-bot2 for Vincent Guittot
2019-10-30 16:03   ` [PATCH v4 07/11] sched/fair: evenly spread tasks " Mel Gorman
2019-10-18 13:26 ` [PATCH v4 08/11] sched/fair: use utilization to select misfit task Vincent Guittot
2019-10-21  9:12   ` [tip: sched/core] sched/fair: Use " tip-bot2 for Vincent Guittot
2019-10-18 13:26 ` [PATCH v4 09/11] sched/fair: use load instead of runnable load in wakeup path Vincent Guittot
2019-10-21  9:12   ` [tip: sched/core] sched/fair: Use " tip-bot2 for Vincent Guittot
2019-10-18 13:26 ` [PATCH v4 10/11] sched/fair: optimize find_idlest_group Vincent Guittot
2019-10-21  9:12   ` [tip: sched/core] sched/fair: Optimize find_idlest_group() tip-bot2 for Vincent Guittot
2019-10-18 13:26 ` [PATCH v4 11/11] sched/fair: rework find_idlest_group Vincent Guittot
2019-10-21  9:12   ` [tip: sched/core] sched/fair: Rework find_idlest_group() tip-bot2 for Vincent Guittot
2019-10-22 16:46   ` [PATCH] sched/fair: fix rework of find_idlest_group() Vincent Guittot
2019-10-23  7:50     ` Chen, Rong A
2019-10-30 16:07     ` Mel Gorman
2019-11-18 17:42     ` [tip: sched/core] sched/fair: Fix " tip-bot2 for Vincent Guittot
2019-11-22 14:37     ` [PATCH] sched/fair: fix " Valentin Schneider
2019-11-25  9:16       ` Vincent Guittot
2019-11-25 11:03         ` Valentin Schneider
2019-11-20 11:58   ` [PATCH v4 11/11] sched/fair: rework find_idlest_group Qais Yousef
2019-11-20 13:21     ` Vincent Guittot
2019-11-20 16:53       ` Vincent Guittot
2019-11-20 17:34         ` Qais Yousef
2019-11-20 17:43           ` Vincent Guittot
2019-11-20 18:10             ` Qais Yousef
2019-11-20 18:20               ` Vincent Guittot
2019-11-20 18:27                 ` Qais Yousef
2019-11-20 19:28                   ` Vincent Guittot
2019-11-20 19:55                     ` Qais Yousef
2019-11-21 14:58                       ` Qais Yousef
2019-11-22 14:34   ` Valentin Schneider
2019-11-25  9:59     ` Vincent Guittot
2019-11-25 11:13       ` Valentin Schneider
2019-10-21  7:50 ` [PATCH v4 00/10] sched/fair: rework the CFS load balance Ingo Molnar
2019-10-21  8:44   ` Vincent Guittot
2019-10-21 12:56     ` Phil Auld
2019-10-24 12:38     ` Phil Auld
2019-10-24 13:46       ` Phil Auld
2019-10-24 14:59         ` Vincent Guittot
2019-10-25 13:33           ` Phil Auld
2019-10-28 13:03             ` Vincent Guittot
2019-10-30 14:39               ` Phil Auld
2019-10-30 16:24                 ` Dietmar Eggemann
2019-10-30 16:35                   ` Valentin Schneider
2019-10-30 17:19                     ` Phil Auld
2019-10-30 17:25                       ` Valentin Schneider
2019-10-30 17:29                         ` Phil Auld
2019-10-30 17:28                       ` Vincent Guittot
2019-10-30 17:44                         ` Phil Auld
2019-10-30 17:25                 ` Vincent Guittot
2019-10-31 13:57                   ` Phil Auld
2019-10-31 16:41                     ` Vincent Guittot
2019-10-30 16:24   ` Mel Gorman
2019-10-30 16:35     ` Vincent Guittot
2019-11-18 13:15     ` Ingo Molnar
2019-11-25 12:48 ` Valentin Schneider
2020-01-03 16:39   ` Valentin Schneider

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191108163501.GA26528@linaro.org \
    --to=vincent.guittot@linaro.org \
    --cc=Morten.Rasmussen@arm.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=hdanton@sina.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@techsingularity.net \
    --cc=mingo@redhat.com \
    --cc=parth@linux.ibm.com \
    --cc=pauld@redhat.com \
    --cc=peterz@infradead.org \
    --cc=quentin.perret@arm.com \
    --cc=riel@surriel.com \
    --cc=srikar@linux.vnet.ibm.com \
    --cc=valentin.schneider@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.