From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 68142C43603 for ; Thu, 19 Dec 2019 15:18:32 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 3082024672 for ; Thu, 19 Dec 2019 15:18:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726908AbfLSPSb (ORCPT ); Thu, 19 Dec 2019 10:18:31 -0500 Received: from outbound-smtp08.blacknight.com ([46.22.139.13]:47700 "EHLO outbound-smtp08.blacknight.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726712AbfLSPSa (ORCPT ); Thu, 19 Dec 2019 10:18:30 -0500 Received: from mail.blacknight.com (pemlinmail05.blacknight.ie [81.17.254.26]) by outbound-smtp08.blacknight.com (Postfix) with ESMTPS id 363821C3623 for ; Thu, 19 Dec 2019 15:18:27 +0000 (GMT) Received: (qmail 30271 invoked from network); 19 Dec 2019 15:18:27 -0000 Received: from unknown (HELO techsingularity.net) (mgorman@techsingularity.net@[84.203.18.57]) by 81.17.254.9 with ESMTPSA (AES256-SHA encrypted, authenticated); 19 Dec 2019 15:18:26 -0000 Date: Thu, 19 Dec 2019 15:18:24 +0000 From: Mel Gorman To: Vincent Guittot Cc: Ingo Molnar , Peter Zijlstra , pauld@redhat.com, valentin.schneider@arm.com, srikar@linux.vnet.ibm.com, quentin.perret@arm.com, dietmar.eggemann@arm.com, Morten.Rasmussen@arm.com, hdanton@sina.com, parth@linux.ibm.com, riel@surriel.com, LKML Subject: Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains Message-ID: <20191219151824.GI3178@techsingularity.net> References: <20191218154402.GF3178@techsingularity.net> <20191219144539.GA19614@linaro.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20191219144539.GA19614@linaro.org> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Dec 19, 2019 at 03:45:39PM +0100, Vincent Guittot wrote: > Hi Mel, > > Thanks for looking at this NUMA locality vs spreading tasks point. > No problem. > > @@ -8680,7 +8676,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s > > env->migration_type = migrate_task; > > lsub_positive(&nr_diff, local->sum_nr_running); > > env->imbalance = nr_diff >> 1; > > - return; > > + goto out_spare; > > Why are you doing this only for prefer_sibling case ? That's probably the default case of most of numa system but you should also consider others case too. > It's the common case for NUMA machines I'm aware of and from the perspective of allowing a slight imbalance when there are spare CPUs, I felt it was the same whether we were considering idle CPUs or the number of tasks running. The prefer_sibling case applies to the children and the corner case is that balancing NUMA domains takes into account whether the MC domain prefers siblings which is a bit odd. I believe, but don't know, that the reasoning may have been to spread load for memory bandwidth usage. > So you should probably add your > > > +                * Whether balancing the number of running tasks or the number > > +                * of idle CPUs, consider allowing some degree of imbalance if > > +                * migrating between NUMA domains. > > +                */ > > +               if (env->sd->flags & SD_NUMA) { > > +                       unsigned int imbalance_adj, imbalance_max; > > ... > > > +               } > > before the prefer_sibling case : > > if (busiest->group_weight == 1 || sds->prefer_sibling) { > unsigned int nr_diff = busiest->sum_nr_running; > /* > * When prefer sibling, evenly spread running tasks on > * groups. > */ > I don't understand. If I move SD_NUMA checks above the imbalance calculation, how do I know whether the imbalance should be ignored? > > > > > } > > > > /* > > @@ -8690,6 +8686,38 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s > > env->migration_type = migrate_task; > > env->imbalance = max_t(long, 0, (local->idle_cpus - > > busiest->idle_cpus) >> 1); > > + > > +out_spare: > > + /* > > + * Whether balancing the number of running tasks or the number > > + * of idle CPUs, consider allowing some degree of imbalance if > > + * migrating between NUMA domains. > > + */ > > + if (env->sd->flags & SD_NUMA) { > > + unsigned int imbalance_adj, imbalance_max; > > + > > + /* > > + * imbalance_adj is the allowable degree of imbalance > > + * to exist between two NUMA domains. It's calculated > > + * relative to imbalance_pct with a minimum of two > > + * tasks or idle CPUs. > > + */ > > + imbalance_adj = (busiest->group_weight * > > + (env->sd->imbalance_pct - 100) / 100) >> 1; > > + imbalance_adj = max(imbalance_adj, 2U); > > + > > + /* > > + * Ignore imbalance unless busiest sd is close to 50% > > + * utilisation. At that point balancing for memory > > + * bandwidth and potentially avoiding unnecessary use > > + * of HT siblings is as relevant as memory locality. > > + */ > > + imbalance_max = (busiest->group_weight >> 1) - imbalance_adj; > > + if (env->imbalance <= imbalance_adj && > > + busiest->sum_nr_running < imbalance_max) {i > > Shouldn't you consider the number of busiest->idle_cpus instead of the busiest->sum_nr_running ? > Why? CPU affinity could have stacked multiple tasks on one CPU where as I'm looking for a proxy hint on the amount of bandwidth required. sum_nr_running does not give me an accurate estimate but it's better than idle cpus. > and you could simplify by  > > > if ((env->sd->flags & SD_NUMA) && > ((100 * busiest->group_weight) <= (env->sd->imbalance_pct * (busiest->idle_cpus << 1)))) { > env->imbalance = 0; > return; > } > > And otherwise it will continue with the current path > I ended up doing something similar to this in v2 but it's a bit more expanded so I can put in comments on why the comparisons are the way they are. The multiplications are in the slow path. > Also I'm a bit worry about using a 50% threshold that look a bit like a > heuristic which can change depending of platform and the UCs that run of the > system. > UCs? And yes, it's a heuristic. In this case, I'm as concerned about memory bandwidth availability as I am about improper locality due to agressive balancing. We do not know the available memory bandwidth and we do not know how much bandwidth the tasks required so 50% was as good as threshold as any. I do not know of any way that can cheaply measure either bandwidth usage (PMUs are not cheap) or available bandwidth (theoretical bandwidth != actual bandwidth). In an earlier version that I never posted, I had no cutoff at all and NAS took a roughly 30% performance penalty across all computational kernels. Debug tracing led me to this cutoff and running a battery of workloads led me to believe that it was a reasonable cutoff. It's important. > In fact i was hoping that we could use the numa_preferred_nid ? Unfortunately not. For some tasks, they are not long-lived enough for NUMA balancing to make a decision. For longer-lived tasks, if load balancing is spreading the load across nodes and wakeups are pulling tasks together, NUMA balancing will get a mix of remote/local samples and will be unable to pick a node properly. In the netperf figures I put in the changelog, I pointed out that NUMA balancing sampled roughly 50% of accesses were remote. With the patch, 100% of the samples are local. > During the > detach of tasks, we don't detach the task if busiest has spare capacity and > preferred_nid of the task is busiest. > Sure, but again if load balancing and waker/wakees are fighting each other, NUMA balancing gets caught in the crossfire and cannot make a sensible decision. -- Mel Gorman SUSE Labs