From: Mel Gorman <mgorman@techsingularity.net>
To: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ingo Molnar <mingo@kernel.org>,
Peter Zijlstra <peterz@infradead.org>,
pauld@redhat.com, valentin.schneider@arm.com,
srikar@linux.vnet.ibm.com, quentin.perret@arm.com,
dietmar.eggemann@arm.com, Morten.Rasmussen@arm.com,
hdanton@sina.com, parth@linux.ibm.com, riel@surriel.com,
LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains
Date: Thu, 19 Dec 2019 15:18:24 +0000 [thread overview]
Message-ID: <20191219151824.GI3178@techsingularity.net> (raw)
In-Reply-To: <20191219144539.GA19614@linaro.org>
On Thu, Dec 19, 2019 at 03:45:39PM +0100, Vincent Guittot wrote:
> Hi Mel,
>
> Thanks for looking at this NUMA locality vs spreading tasks point.
>
No problem.
> > @@ -8680,7 +8676,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> > env->migration_type = migrate_task;
> > lsub_positive(&nr_diff, local->sum_nr_running);
> > env->imbalance = nr_diff >> 1;
> > - return;
> > + goto out_spare;
>
> Why are you doing this only for prefer_sibling case ? That's probably the default case of most of numa system but you should also consider others case too.
>
It's the common case for NUMA machines I'm aware of and from the
perspective of allowing a slight imbalance when there are spare CPUs, I
felt it was the same whether we were considering idle CPUs or the number
of tasks running.
The prefer_sibling case applies to the children and the corner case is
that balancing NUMA domains takes into account whether the MC domain
prefers siblings which is a bit odd. I believe, but don't know, that the
reasoning may have been to spread load for memory bandwidth usage.
> So you should probably add your
>
> > + * Whether balancing the number of running tasks or the number
> > + * of idle CPUs, consider allowing some degree of imbalance if
> > + * migrating between NUMA domains.
> > + */
> > + if (env->sd->flags & SD_NUMA) {
> > + unsigned int imbalance_adj, imbalance_max;
>
> ...
>
> > + }
>
> before the prefer_sibling case :
>
> if (busiest->group_weight == 1 || sds->prefer_sibling) {
> unsigned int nr_diff = busiest->sum_nr_running;
> /*
> * When prefer sibling, evenly spread running tasks on
> * groups.
> */
>
I don't understand. If I move SD_NUMA checks above the imbalance
calculation, how do I know whether the imbalance should be ignored?
>
> >
> > }
> >
> > /*
> > @@ -8690,6 +8686,38 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> > env->migration_type = migrate_task;
> > env->imbalance = max_t(long, 0, (local->idle_cpus -
> > busiest->idle_cpus) >> 1);
> > +
> > +out_spare:
> > + /*
> > + * Whether balancing the number of running tasks or the number
> > + * of idle CPUs, consider allowing some degree of imbalance if
> > + * migrating between NUMA domains.
> > + */
> > + if (env->sd->flags & SD_NUMA) {
> > + unsigned int imbalance_adj, imbalance_max;
> > +
> > + /*
> > + * imbalance_adj is the allowable degree of imbalance
> > + * to exist between two NUMA domains. It's calculated
> > + * relative to imbalance_pct with a minimum of two
> > + * tasks or idle CPUs.
> > + */
> > + imbalance_adj = (busiest->group_weight *
> > + (env->sd->imbalance_pct - 100) / 100) >> 1;
> > + imbalance_adj = max(imbalance_adj, 2U);
> > +
> > + /*
> > + * Ignore imbalance unless busiest sd is close to 50%
> > + * utilisation. At that point balancing for memory
> > + * bandwidth and potentially avoiding unnecessary use
> > + * of HT siblings is as relevant as memory locality.
> > + */
> > + imbalance_max = (busiest->group_weight >> 1) - imbalance_adj;
> > + if (env->imbalance <= imbalance_adj &&
> > + busiest->sum_nr_running < imbalance_max) {i
>
> Shouldn't you consider the number of busiest->idle_cpus instead of the busiest->sum_nr_running ?
>
Why? CPU affinity could have stacked multiple tasks on one CPU where
as I'm looking for a proxy hint on the amount of bandwidth required.
sum_nr_running does not give me an accurate estimate but it's better than
idle cpus.
> and you could simplify by
>
>
> if ((env->sd->flags & SD_NUMA) &&
> ((100 * busiest->group_weight) <= (env->sd->imbalance_pct * (busiest->idle_cpus << 1)))) {
> env->imbalance = 0;
> return;
> }
>
> And otherwise it will continue with the current path
>
I ended up doing something similar to this in v2 but it's a bit more
expanded so I can put in comments on why the comparisons are the way
they are. The multiplications are in the slow path.
> Also I'm a bit worry about using a 50% threshold that look a bit like a
> heuristic which can change depending of platform and the UCs that run of the
> system.
>
UCs?
And yes, it's a heuristic. In this case, I'm as concerned about memory
bandwidth availability as I am about improper locality due to agressive
balancing. We do not know the available memory bandwidth and we do not
know how much bandwidth the tasks required so 50% was as good as threshold
as any. I do not know of any way that can cheaply measure either bandwidth
usage (PMUs are not cheap) or available bandwidth (theoretical bandwidth !=
actual bandwidth).
In an earlier version that I never posted, I had no cutoff at all and
NAS took a roughly 30% performance penalty across all computational
kernels. Debug tracing led me to this cutoff and running a battery
of workloads led me to believe that it was a reasonable cutoff. It's
important.
> In fact i was hoping that we could use the numa_preferred_nid ?
Unfortunately not. For some tasks, they are not long-lived enough for NUMA
balancing to make a decision. For longer-lived tasks, if load balancing is
spreading the load across nodes and wakeups are pulling tasks together,
NUMA balancing will get a mix of remote/local samples and will be unable
to pick a node properly.
In the netperf figures I put in the changelog, I pointed out that NUMA
balancing sampled roughly 50% of accesses were remote. With the patch,
100% of the samples are local.
> During the
> detach of tasks, we don't detach the task if busiest has spare capacity and
> preferred_nid of the task is busiest.
>
Sure, but again if load balancing and waker/wakees are fighting each
other, NUMA balancing gets caught in the crossfire and cannot make a
sensible decision.
--
Mel Gorman
SUSE Labs
next prev parent reply other threads:[~2019-12-19 15:18 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-12-18 15:44 [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains Mel Gorman
2019-12-18 18:50 ` Valentin Schneider
2019-12-18 22:50 ` Mel Gorman
2019-12-19 11:56 ` Valentin Schneider
2019-12-19 10:02 ` Peter Zijlstra
2019-12-19 11:46 ` Valentin Schneider
2019-12-19 14:23 ` Valentin Schneider
2019-12-19 15:23 ` Mel Gorman
2019-12-18 18:54 ` Valentin Schneider
2019-12-19 2:58 ` Rik van Riel
2019-12-19 8:41 ` Mel Gorman
2019-12-19 10:04 ` Peter Zijlstra
2019-12-19 14:45 ` Vincent Guittot
2019-12-19 15:16 ` Valentin Schneider
2019-12-19 15:18 ` Mel Gorman [this message]
2019-12-19 15:41 ` Vincent Guittot
2019-12-19 15:58 ` Mel Gorman
2019-12-20 13:00 ` Srikar Dronamraju
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20191219151824.GI3178@techsingularity.net \
--to=mgorman@techsingularity.net \
--cc=Morten.Rasmussen@arm.com \
--cc=dietmar.eggemann@arm.com \
--cc=hdanton@sina.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@kernel.org \
--cc=parth@linux.ibm.com \
--cc=pauld@redhat.com \
--cc=peterz@infradead.org \
--cc=quentin.perret@arm.com \
--cc=riel@surriel.com \
--cc=srikar@linux.vnet.ibm.com \
--cc=valentin.schneider@arm.com \
--cc=vincent.guittot@linaro.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox