From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754261Ab1K2Jrv (ORCPT ); Tue, 29 Nov 2011 04:47:51 -0500 Received: from merlin.infradead.org ([205.233.59.134]:35070 "EHLO merlin.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750880Ab1K2Jru convert rfc822-to-8bit (ORCPT ); Tue, 29 Nov 2011 04:47:50 -0500 Message-ID: <1322559859.2921.190.camel@twins> Subject: Re: [patch 3/6] sched, nohz: sched group, domain aware nohz idle load balancing From: Peter Zijlstra To: Suresh Siddha Cc: Ingo Molnar , Venki Pallipadi , Srivatsa Vaddagiri , Mike Galbraith , linux-kernel , Tim Chen , "Shi, Alex" Date: Tue, 29 Nov 2011 10:44:19 +0100 In-Reply-To: <1322524316.21329.64.camel@sbsiddha-desk.sc.intel.com> References: <20111118230323.592022417@sbsiddha-desk.sc.intel.com> <20111118230553.995756330@sbsiddha-desk.sc.intel.com> <1322135263.2921.12.camel@twins> <1322524316.21329.64.camel@sbsiddha-desk.sc.intel.com> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8BIT X-Mailer: Evolution 3.2.1- Mime-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 2011-11-28 at 15:51 -0800, Suresh Siddha wrote: > On Thu, 2011-11-24 at 03:47 -0800, Peter Zijlstra wrote: > > On Fri, 2011-11-18 at 15:03 -0800, Suresh Siddha wrote: > > > + for_each_domain(cpu, sd) { > > > + struct sched_group *sg = sd->groups; > > > + struct sched_group_power *sgp = sg->sgp; > > > + int nr_busy = atomic_read(&sgp->nr_busy_cpus); > > > + > > > + if (nr_busy > 1 && (nr_busy * SCHED_LOAD_SCALE > sgp->power)) > > > + goto need_kick; > > > > This looks wrong, its basically always true for a box with HT. > > In the presence of two busy HT siblings, we need to do the idle load > balance to figure out if the load from the busy core can be migrated to > any other idle core/sibling in the platform. And at this point, we > already know there are idle cpu's in the platform. might have to, this nr_busy doesn't mean its actually busy, just that its not nohz, it might very well be idle. > But you are right. using group power like the above is not right. For > example in the case of two sockets with each socket having dual core > with no HT, if one socket is completely busy with another completely > idle, we would like to identify this. But the group power of that socket > will be 2 * SCHED_POWER_SCALE. Right, minus whatever time was taken by interrupts and RT tasks. > In the older kernels, for the domains which was sharing package > resources, we were setting the group power to SCHED_POWER_SCALE for the > default performance mode. And I has that old code in the mind, while > doing the above check. > > I will modify the above check to: > > if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1) > goto need_kick; > > This way, if there is a SMT/MC domain with more than one busy cpu in the > group, then we will request for the idle load balancing. Potentially 1 more than 1 busy, right? And we do the balancing just in case there are indeed busy cpus. I think its useful to mention that somewhere near, that this nr_busy measure we use is an upper bound on actual busy. > Current mainline code kicks the idle load balancer if there are two busy > cpus in the system. Above mentioned modification makes this decision > some what better. For example, two busy cpu's in two different sockets > or two busy cpu's in a dual-core single socket system will never kick > idle load balancer (as there is no need). Except in power balance mode (although that's probably busted anyway), where we want to aggregate the two tasks over two sockets onto one socket. > In future we can add more heuristics to kick the idle load balancer only > when it is really necessary (for example when there is a real imbalance > between the highest and lowest loaded groups etc). Only catch is to > identify those scenarios with out adding much penality to the busy cpu > which is identifying the imbalance and kicking the idle load balancer. > Above proposed approach is the simplest approach that is trying to do > better than the current logic we have in the kernel now. Fair enough, and yeah, each patch only needs to do better and eventually we'll get somewhere ;-) > Any more thoughts in making the kick decisions (for doing idle load > balancing) more robust are welcome. Ha, I'll certainly share them if I have any.. I'd still love to split this up somehow, but I've yet to figure out how to do so without observing the whole machine.