From: Mel Gorman <mgorman@techsingularity.net>
To: Valentin Schneider <valentin.schneider@arm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>,
Ingo Molnar <mingo@kernel.org>,
Peter Zijlstra <peterz@infradead.org>,
pauld@redhat.com, srikar@linux.vnet.ibm.com,
quentin.perret@arm.com, dietmar.eggemann@arm.com,
Morten.Rasmussen@arm.com, hdanton@sina.com, parth@linux.ibm.com,
riel@surriel.com, LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains
Date: Wed, 18 Dec 2019 22:50:23 +0000 [thread overview]
Message-ID: <20191218225023.GG3178@techsingularity.net> (raw)
In-Reply-To: <8f049805-3e97-09bb-2d32-0718be1dec9b@arm.com>
On Wed, Dec 18, 2019 at 06:50:52PM +0000, Valentin Schneider wrote:
> > In general, the patch simply seeks to avoid unnecessarily cross-node
> > migrations when a machine is lightly loaded but shows benefits for other
> > workloads. While tests are still running, so far it seems to benefit
> > light-utilisation smaller workloads on large machines and does not appear
> > to do any harm to larger or parallelised workloads.
> >
>
> Thanks for the detailed testing, I haven't digested it entirely yet but I
> appreciate the effort.
>
No problem, this is one of those patches that it's best to test a bunch
of loads and machines. I haven't presented it all because the changelog
would be beyond ridiculous.
> > @@ -8690,6 +8686,38 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> > env->migration_type = migrate_task;
> > env->imbalance = max_t(long, 0, (local->idle_cpus -
> > busiest->idle_cpus) >> 1);
> > +
> > +out_spare:
> > + /*
> > + * Whether balancing the number of running tasks or the number
> > + * of idle CPUs, consider allowing some degree of imbalance if
> > + * migrating between NUMA domains.
> > + */
> > + if (env->sd->flags & SD_NUMA) {
> > + unsigned int imbalance_adj, imbalance_max;
> > +
> > + /*
> > + * imbalance_adj is the allowable degree of imbalance
> > + * to exist between two NUMA domains. It's calculated
> > + * relative to imbalance_pct with a minimum of two
> > + * tasks or idle CPUs.
> > + */
> > + imbalance_adj = (busiest->group_weight *
> > + (env->sd->imbalance_pct - 100) / 100) >> 1;
>
> IIRC imbalance_pct for NUMA domains uses the default 125, so I read this as
> "allow an imbalance of 1 task per 8 CPU in the source group" (just making
> sure I follow).
>
That is how it works out in most cases. As imbalance_pct can be anything
in theory, I didn't specify what it usually breaks down to. The >> 1 is
to go "half way" similar to how imbalance itself is calculated.
> > + imbalance_adj = max(imbalance_adj, 2U);
> > +
> > + /*
> > + * Ignore imbalance unless busiest sd is close to 50%
> > + * utilisation. At that point balancing for memory
> > + * bandwidth and potentially avoiding unnecessary use
> > + * of HT siblings is as relevant as memory locality.
> > + */
> > + imbalance_max = (busiest->group_weight >> 1) - imbalance_adj;
> > + if (env->imbalance <= imbalance_adj &&
> > + busiest->sum_nr_running < imbalance_max) {
>
> The code does "unless busiest group has half as many runnable tasks (or more)
> as it has CPUs (modulo the adj thing)", is that what you mean by "unless
> busiest sd is close to 50% utilisation" in the comment? It's somewhat
> different IMO.
>
Crap, yes. At the time of writing, I was thinking in terms of running
tasks that were fully active hence the misleading comment.
> > + env->imbalance = 0;
> > + }
> > + }
> > return;
> > }
> >
> >
> I'm quite sure you have reasons to have written it that way, but I was
> hoping we could squash it down to something like:
I wrote it that way to make it clear exactly what has changed, the
thinking behind the checks and to avoid 80-col limits to make review
easier overall. It's a force of habit and I'm happy to reformat it as
you suggest except....
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 08a233e97a01..f05d09a8452e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8680,16 +8680,27 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> env->migration_type = migrate_task;
> lsub_positive(&nr_diff, local->sum_nr_running);
> env->imbalance = nr_diff >> 1;
> - return;
> + } else {
> +
> + /*
> + * If there is no overload, we just want to even the number of
> + * idle cpus.
> + */
> + env->migration_type = migrate_task;
> + env->imbalance = max_t(long, 0, (local->idle_cpus -
> + busiest->idle_cpus) >> 1);
> }
>
> /*
> - * If there is no overload, we just want to even the number of
> - * idle cpus.
> + * Allow for a small imbalance between NUMA groups; don't do any
> + * of it if there is at least half as many tasks / busy CPUs as
> + * there are available CPUs in the busiest group
> */
> - env->migration_type = migrate_task;
> - env->imbalance = max_t(long, 0, (local->idle_cpus -
> - busiest->idle_cpus) >> 1);
> + if (env->sd->flags & SD_NUMA &&
> + (busiest->sum_nr_running < busiest->group_weight >> 1) &&
This last line is not exactly equivalent to what I wrote. It would need
to be
(busiest->sum_nr_running < (busiest->group_weight >> 1) - imbalance_adj) &&
I can test as you suggest to see if it's roughly equivalent in terms of
performance. The intent was to have a cutoff just before we reached 50%
running tasks / busy CPUs.
--
Mel Gorman
SUSE Labs
next prev parent reply other threads:[~2019-12-18 22:50 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-12-18 15:44 [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains Mel Gorman
2019-12-18 18:50 ` Valentin Schneider
2019-12-18 22:50 ` Mel Gorman [this message]
2019-12-19 11:56 ` Valentin Schneider
2019-12-19 10:02 ` Peter Zijlstra
2019-12-19 11:46 ` Valentin Schneider
2019-12-19 14:23 ` Valentin Schneider
2019-12-19 15:23 ` Mel Gorman
2019-12-18 18:54 ` Valentin Schneider
2019-12-19 2:58 ` Rik van Riel
2019-12-19 8:41 ` Mel Gorman
2019-12-19 10:04 ` Peter Zijlstra
2019-12-19 14:45 ` Vincent Guittot
2019-12-19 15:16 ` Valentin Schneider
2019-12-19 15:18 ` Mel Gorman
2019-12-19 15:41 ` Vincent Guittot
2019-12-19 15:58 ` Mel Gorman
2019-12-20 13:00 ` Srikar Dronamraju
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20191218225023.GG3178@techsingularity.net \
--to=mgorman@techsingularity.net \
--cc=Morten.Rasmussen@arm.com \
--cc=dietmar.eggemann@arm.com \
--cc=hdanton@sina.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@kernel.org \
--cc=parth@linux.ibm.com \
--cc=pauld@redhat.com \
--cc=peterz@infradead.org \
--cc=quentin.perret@arm.com \
--cc=riel@surriel.com \
--cc=srikar@linux.vnet.ibm.com \
--cc=valentin.schneider@arm.com \
--cc=vincent.guittot@linaro.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox