Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mel Gorman <mgorman@techsingularity.net>
To: Valentin Schneider <valentin.schneider@arm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>,
	Ingo Molnar <mingo@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	pauld@redhat.com, srikar@linux.vnet.ibm.com,
	quentin.perret@arm.com, dietmar.eggemann@arm.com,
	Morten.Rasmussen@arm.com, hdanton@sina.com, parth@linux.ibm.com,
	riel@surriel.com, LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains
Date: Wed, 18 Dec 2019 22:50:23 +0000	[thread overview]
Message-ID: <20191218225023.GG3178@techsingularity.net> (raw)
In-Reply-To: <8f049805-3e97-09bb-2d32-0718be1dec9b@arm.com>

On Wed, Dec 18, 2019 at 06:50:52PM +0000, Valentin Schneider wrote:
> > In general, the patch simply seeks to avoid unnecessarily cross-node
> > migrations when a machine is lightly loaded but shows benefits for other
> > workloads. While tests are still running, so far it seems to benefit
> > light-utilisation smaller workloads on large machines and does not appear
> > to do any harm to larger or parallelised workloads.
> > 
> 
> Thanks for the detailed testing, I haven't digested it entirely yet but I
> appreciate the effort.
> 

No problem, this is one of those patches that it's best to test a bunch
of loads and machines. I haven't presented it all because the changelog
would be beyond ridiculous.

> > @@ -8690,6 +8686,38 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> >  		env->migration_type = migrate_task;
> >  		env->imbalance = max_t(long, 0, (local->idle_cpus -
> >  						 busiest->idle_cpus) >> 1);
> > +
> > +out_spare:
> > +		/*
> > +		 * Whether balancing the number of running tasks or the number
> > +		 * of idle CPUs, consider allowing some degree of imbalance if
> > +		 * migrating between NUMA domains.
> > +		 */
> > +		if (env->sd->flags & SD_NUMA) {
> > +			unsigned int imbalance_adj, imbalance_max;
> > +
> > +			/*
> > +			 * imbalance_adj is the allowable degree of imbalance
> > +			 * to exist between two NUMA domains. It's calculated
> > +			 * relative to imbalance_pct with a minimum of two
> > +			 * tasks or idle CPUs.
> > +			 */
> > +			imbalance_adj = (busiest->group_weight *
> > +				(env->sd->imbalance_pct - 100) / 100) >> 1;
> 
> IIRC imbalance_pct for NUMA domains uses the default 125, so I read this as
> "allow an imbalance of 1 task per 8 CPU in the source group" (just making
> sure I follow).
> 

That is how it works out in most cases. As imbalance_pct can be anything
in theory, I didn't specify what it usually breaks down to. The >> 1 is
to go "half way" similar to how imbalance itself is calculated.

> > +			imbalance_adj = max(imbalance_adj, 2U);
> > +
> > +			/*
> > +			 * Ignore imbalance unless busiest sd is close to 50%
> > +			 * utilisation. At that point balancing for memory
> > +			 * bandwidth and potentially avoiding unnecessary use
> > +			 * of HT siblings is as relevant as memory locality.
> > +			 */
> > +			imbalance_max = (busiest->group_weight >> 1) - imbalance_adj;
> > +			if (env->imbalance <= imbalance_adj &&
> > +			    busiest->sum_nr_running < imbalance_max) {
> 
> The code does "unless busiest group has half as many runnable tasks (or more)
> as it has CPUs (modulo the adj thing)", is that what you mean by "unless
> busiest sd is close to 50% utilisation" in the comment? It's somewhat
> different IMO.
> 

Crap, yes. At the time of writing, I was thinking in terms of running
tasks that were fully active hence the misleading comment.

> > +				env->imbalance = 0;
> > +			}
> > +		}
> >  		return;
> >  	}
> >  
> > 
> I'm quite sure you have reasons to have written it that way, but I was
> hoping we could squash it down to something like:

I wrote it that way to make it clear exactly what has changed, the
thinking behind the checks and to avoid 80-col limits to make review
easier overall. It's a force of habit and I'm happy to reformat it as
you suggest except....

> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 08a233e97a01..f05d09a8452e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8680,16 +8680,27 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>  			env->migration_type = migrate_task;
>  			lsub_positive(&nr_diff, local->sum_nr_running);
>  			env->imbalance = nr_diff >> 1;
> -			return;
> +		} else {
> +
> +			/*
> +			 * If there is no overload, we just want to even the number of
> +			 * idle cpus.
> +			 */
> +			env->migration_type = migrate_task;
> +			env->imbalance = max_t(long, 0, (local->idle_cpus -
> +							 busiest->idle_cpus) >> 1);
>  		}
>  
>  		/*
> -		 * If there is no overload, we just want to even the number of
> -		 * idle cpus.
> +		 * Allow for a small imbalance between NUMA groups; don't do any
> +		 * of it if there is at least half as many tasks / busy CPUs as
> +		 * there are available CPUs in the busiest group
>  		 */
> -		env->migration_type = migrate_task;
> -		env->imbalance = max_t(long, 0, (local->idle_cpus -
> -						 busiest->idle_cpus) >> 1);
> +		if (env->sd->flags & SD_NUMA &&
> +		    (busiest->sum_nr_running < busiest->group_weight >> 1) &&

This last line is not exactly equivalent to what I wrote. It would need
to be

	(busiest->sum_nr_running < (busiest->group_weight >> 1) - imbalance_adj) &&

I can test as you suggest to see if it's roughly equivalent in terms of
performance. The intent was to have a cutoff just before we reached 50%
running tasks / busy CPUs.

-- 
Mel Gorman
SUSE Labs

next prev parent reply	other threads:[~2019-12-18 22:50 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-18 15:44 [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains Mel Gorman
2019-12-18 18:50 ` Valentin Schneider
2019-12-18 22:50   ` Mel Gorman [this message]
2019-12-19 11:56     ` Valentin Schneider
2019-12-19 10:02   ` Peter Zijlstra
2019-12-19 11:46     ` Valentin Schneider
2019-12-19 14:23       ` Valentin Schneider
2019-12-19 15:23     ` Mel Gorman
2019-12-18 18:54 ` Valentin Schneider
2019-12-19  2:58 ` Rik van Riel
2019-12-19  8:41   ` Mel Gorman
2019-12-19 10:04 ` Peter Zijlstra
2019-12-19 14:45 ` Vincent Guittot
2019-12-19 15:16   ` Valentin Schneider
2019-12-19 15:18   ` Mel Gorman
2019-12-19 15:41     ` Vincent Guittot
2019-12-19 15:58       ` Mel Gorman
2019-12-20 13:00   ` Srikar Dronamraju

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191218225023.GG3178@techsingularity.net \
    --to=mgorman@techsingularity.net \
    --cc=Morten.Rasmussen@arm.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=hdanton@sina.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=parth@linux.ibm.com \
    --cc=pauld@redhat.com \
    --cc=peterz@infradead.org \
    --cc=quentin.perret@arm.com \
    --cc=riel@surriel.com \
    --cc=srikar@linux.vnet.ibm.com \
    --cc=valentin.schneider@arm.com \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.