Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mel Gorman <mgorman@techsingularity.net>
To: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ingo Molnar <mingo@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	pauld@redhat.com, valentin.schneider@arm.com,
	srikar@linux.vnet.ibm.com, quentin.perret@arm.com,
	dietmar.eggemann@arm.com, Morten.Rasmussen@arm.com,
	hdanton@sina.com, parth@linux.ibm.com, riel@surriel.com,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains
Date: Thu, 19 Dec 2019 15:18:24 +0000	[thread overview]
Message-ID: <20191219151824.GI3178@techsingularity.net> (raw)
In-Reply-To: <20191219144539.GA19614@linaro.org>

On Thu, Dec 19, 2019 at 03:45:39PM +0100, Vincent Guittot wrote:
> Hi Mel,
> 
> Thanks for looking at this NUMA locality vs spreading tasks point.
> 

No problem.

> > @@ -8680,7 +8676,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> >  			env->migration_type = migrate_task;
> >  			lsub_positive(&nr_diff, local->sum_nr_running);
> >  			env->imbalance = nr_diff >> 1;
> > -			return;
> > +			goto out_spare;
> 
> Why are you doing this only for prefer_sibling case ? That's probably the default case of most of numa system but you should also consider others case too.
> 

It's the common case for NUMA machines I'm aware of and from the
perspective of allowing a slight imbalance when there are spare CPUs, I
felt it was the same whether we were considering idle CPUs or the number
of tasks running.

The prefer_sibling case applies to the children and the corner case is
that balancing NUMA domains takes into account whether the MC domain
prefers siblings which is a bit odd. I believe, but don't know, that the
reasoning may have been to spread load for memory bandwidth usage.

> So you should probably add your
> 
> > +                * Whether balancing the number of running tasks or the number
> > +                * of idle CPUs, consider allowing some degree of imbalance if
> > +                * migrating between NUMA domains.
> > +                */
> > +               if (env->sd->flags & SD_NUMA) {
> > +                       unsigned int imbalance_adj, imbalance_max;
> 
> ...
> 
> > +               }
> 
> before the prefer_sibling case :
> 
> 		if (busiest->group_weight == 1 || sds->prefer_sibling) {
> 			unsigned int nr_diff = busiest->sum_nr_running;
> 			/*
> 			 * When prefer sibling, evenly spread running tasks on
> 			 * groups.
> 			 */
> 

I don't understand. If I move SD_NUMA checks above the imbalance
calculation, how do I know whether the imbalance should be ignored?

> 
> >
> >  		}
> >  
> >  		/*
> > @@ -8690,6 +8686,38 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> >  		env->migration_type = migrate_task;
> >  		env->imbalance = max_t(long, 0, (local->idle_cpus -
> >  						 busiest->idle_cpus) >> 1);
> > +
> > +out_spare:
> > +		/*
> > +		 * Whether balancing the number of running tasks or the number
> > +		 * of idle CPUs, consider allowing some degree of imbalance if
> > +		 * migrating between NUMA domains.
> > +		 */
> > +		if (env->sd->flags & SD_NUMA) {
> > +			unsigned int imbalance_adj, imbalance_max;
> > +
> > +			/*
> > +			 * imbalance_adj is the allowable degree of imbalance
> > +			 * to exist between two NUMA domains. It's calculated
> > +			 * relative to imbalance_pct with a minimum of two
> > +			 * tasks or idle CPUs.
> > +			 */
> > +			imbalance_adj = (busiest->group_weight *
> > +				(env->sd->imbalance_pct - 100) / 100) >> 1;
> > +			imbalance_adj = max(imbalance_adj, 2U);
> > +
> > +			/*
> > +			 * Ignore imbalance unless busiest sd is close to 50%
> > +			 * utilisation. At that point balancing for memory
> > +			 * bandwidth and potentially avoiding unnecessary use
> > +			 * of HT siblings is as relevant as memory locality.
> > +			 */
> > +			imbalance_max = (busiest->group_weight >> 1) - imbalance_adj;
> > +			if (env->imbalance <= imbalance_adj &&
> > +			    busiest->sum_nr_running < imbalance_max) {i
> 
> Shouldn't you consider the number of busiest->idle_cpus instead of the busiest->sum_nr_running ?
> 

Why? CPU affinity could have stacked multiple tasks on one CPU where
as I'm looking for a proxy hint on the amount of bandwidth required.
sum_nr_running does not give me an accurate estimate but it's better than
idle cpus.

> and you could simplify by 
> 
> 
> 	if ((env->sd->flags & SD_NUMA) &&
> 		((100 * busiest->group_weight) <= (env->sd->imbalance_pct * (busiest->idle_cpus << 1)))) {
> 			env->imbalance = 0;
> 			return;
> 	}
> 
> And otherwise it will continue with the current path
> 

I ended up doing something similar to this in v2 but it's a bit more
expanded so I can put in comments on why the comparisons are the way
they are. The multiplications are in the slow path.

> Also I'm a bit worry about using a 50% threshold that look a bit like a
> heuristic which can change depending of platform and the UCs that run of the
> system.
> 

UCs?

And yes, it's a heuristic. In this case, I'm as concerned about memory
bandwidth availability as I am about improper locality due to agressive
balancing. We do not know the available memory bandwidth and we do not
know how much bandwidth the tasks required so 50% was as good as threshold
as any. I do not know of any way that can cheaply measure either bandwidth
usage (PMUs are not cheap) or available bandwidth (theoretical bandwidth !=
actual bandwidth).

In an earlier version that I never posted, I had no cutoff at all and
NAS took a roughly 30% performance penalty across all computational
kernels. Debug tracing led me to this cutoff and running a battery
of workloads led me to believe that it was a reasonable cutoff. It's
important.

> In fact i was hoping that we could use the numa_preferred_nid ?

Unfortunately not. For some tasks, they are not long-lived enough for NUMA
balancing to make a decision. For longer-lived tasks, if load balancing is
spreading the load across nodes and wakeups are pulling tasks together,
NUMA balancing will get a mix of remote/local samples and will be unable
to pick a node properly.

In the netperf figures I put in the changelog, I pointed out that NUMA
balancing sampled roughly 50% of accesses were remote. With the patch,
100% of the samples are local.

> During the
> detach of tasks, we don't detach the task if busiest has spare capacity and
> preferred_nid of the task is busiest.
> 

Sure, but again if load balancing and waker/wakees are fighting each
other, NUMA balancing gets caught in the crossfire and cannot make a
sensible decision.

-- 
Mel Gorman
SUSE Labs

next prev parent reply	other threads:[~2019-12-19 15:18 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-18 15:44 [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains Mel Gorman
2019-12-18 18:50 ` Valentin Schneider
2019-12-18 22:50   ` Mel Gorman
2019-12-19 11:56     ` Valentin Schneider
2019-12-19 10:02   ` Peter Zijlstra
2019-12-19 11:46     ` Valentin Schneider
2019-12-19 14:23       ` Valentin Schneider
2019-12-19 15:23     ` Mel Gorman
2019-12-18 18:54 ` Valentin Schneider
2019-12-19  2:58 ` Rik van Riel
2019-12-19  8:41   ` Mel Gorman
2019-12-19 10:04 ` Peter Zijlstra
2019-12-19 14:45 ` Vincent Guittot
2019-12-19 15:16   ` Valentin Schneider
2019-12-19 15:18   ` Mel Gorman [this message]
2019-12-19 15:41     ` Vincent Guittot
2019-12-19 15:58       ` Mel Gorman
2019-12-20 13:00   ` Srikar Dronamraju

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191219151824.GI3178@techsingularity.net \
    --to=mgorman@techsingularity.net \
    --cc=Morten.Rasmussen@arm.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=hdanton@sina.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=parth@linux.ibm.com \
    --cc=pauld@redhat.com \
    --cc=peterz@infradead.org \
    --cc=quentin.perret@arm.com \
    --cc=riel@surriel.com \
    --cc=srikar@linux.vnet.ibm.com \
    --cc=valentin.schneider@arm.com \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.