All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Zijlstra <peterz@infradead.org>
To: svaidy@linux.vnet.ibm.com
Cc: Suresh Siddha <suresh.b.siddha@intel.com>,
	Ingo Molnar <mingo@redhat.com>,
	LKML <linux-kernel@vger.kernel.org>,
	"Ma, Ling" <ling.ma@intel.com>,
	"Zhang, Yanmin" <yanmin_zhang@linux.intel.com>,
	ego@in.ibm.com
Subject: Re: [patch] sched: fix SMT scheduler regression in find_busiest_queue()
Date: Mon, 15 Feb 2010 14:00:43 +0100	[thread overview]
Message-ID: <1266238843.15770.323.camel@laptop> (raw)
In-Reply-To: <20100215123538.GE8006@dirshya.in.ibm.com>

On Mon, 2010-02-15 at 18:05 +0530, Vaidyanathan Srinivasan wrote:
> * Peter Zijlstra <peterz@infradead.org> [2010-02-14 11:11:58]:
> 
> > On Sun, 2010-02-14 at 02:06 +0530, Vaidyanathan Srinivasan wrote:
> > > > > > @@ -4119,12 +4119,23 @@ find_busiest_queue(struct sched_group *group, enum cpu_idle_type idle,
> > > > > >                   continue;
> > > > > > 
> > > > > >           rq = cpu_rq(i);
> > > > > > -         wl = weighted_cpuload(i) * SCHED_LOAD_SCALE;
> > > > > > -         wl /= power;
> > > > > > +         wl = weighted_cpuload(i);
> > > > > > 
> > > > > > +         /*
> > > > > > +          * When comparing with imbalance, use weighted_cpuload()
> > > > > > +          * which is not scaled with the cpu power.
> > > > > > +          */
> > > > > >           if (capacity && rq->nr_running == 1 && wl > imbalance)
> > > > > >                   continue;
> > > > > > 
> > > > > > +         /*
> > > > > > +          * For the load comparisons with the other cpu's, consider
> > > > > > +          * the weighted_cpuload() scaled with the cpu power, so that
> > > > > > +          * the load can be moved away from the cpu that is potentially
> > > > > > +          * running at a lower capacity.
> > > > > > +          */
> > > > > > +         wl = (wl * SCHED_LOAD_SCALE) / power;
> > > > > > +
> > > > > >           if (wl > max_load) {
> > > > > >                   max_load = wl;
> > > > > >                   busiest = rq;
> > > > > > 
> > > > > >
> > > 
> > > In addition to the above fix, for sched_smt_powersavings to work, the
> > > group capacity of the core (mc level) should be made 2 in
> > > update_sg_lb_stats() by changing the DIV_ROUND_CLOSEST to
> > > DIV_RPUND_UP()
> > > 
> > >         sgs->group_capacity =
> > >                 DIV_ROUND_UP(group->cpu_power, SCHED_LOAD_SCALE);
> > > 
> > > Ideally we can change this to DIV_ROUND_UP and let SD_PREFER_SIBLING
> > > flag to force capacity to 1.  Need to see if there are any side
> > > effects of setting SD_PREFER_SIBLING at SIBLING level sched domain
> > > based on sched_smt_powersavings flag. 
> > 
> > OK, so while I think that Suresh' patch can make sense (haven't had time
> > to think it through), the above really sounds wrong. Things should not
> > rely on the cpu_power value like that.
> 
> Hi Peter,
> 
> The reason rounding is a problem is because threads have fractional
> cpu_power and we lose some power in DIV_ROUND_CLOSEST().  At MC level
> a group has 2*589=1178 and group_capacity will be 1 always if
> DIV_ROUND_CLOSEST() is used irrespective of the SD_PREFER_SIBLING
> flag.
> 
> We are reducing group capacity here to 1 even though we have 2 sibling
> threads in the group.  In the sched_smt_powassavings>0 case, the
> group_capacity should be 2 to allow task consolidation to this group
> while leaving other groups completely idle.
> 
> DIV_ROUND_UP(group->cpu_power, SCHED_LOAD_SCALE) will ensure any spare
> capacity is rounded up and counted.  
> 
> While, if SD_REFER_SIBLING is set, 
> 
> update_sd_lb_stats():
>         if (prefer_sibling)
> 		sgs.group_capacity = min(sgs.group_capacity, 1UL);
> 
> will ensure the group_capacity is 1 and allows spreading of tasks.                

We should be weakening this link between cpu_power and capacity, not
strengthening it. What I think you want is to use
cpumask_weight(sched_gropu_cpus(group)) or something as capacity.

The setup for cpu_power is that it can reflect the actual capacity for
work, esp with todays asymmetric cpus where a socket can run on a
different frequency we need to make sure this is so.

So no, that DIV_ROUND_UP is utterly broken, as there might be many ways
for cpu_power of multiple threads/cpus to be less than the number of
cpus.

Furthermore, for powersavings it makes sense to make the capacity a
function of an overload argument/tunable, so that you can specify the
threshold of packing.

So really, cpu_power is a normalization factor to equally distribute
load across cpus that have asymmetric work capacity, if you need any
placement constraints outside of that, do _NOT_ touch cpu_power.


  reply	other threads:[~2010-02-15 13:02 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-02-13  1:14 [patch] sched: fix SMT scheduler regression in find_busiest_queue() Suresh Siddha
2010-02-13  1:31 ` change in sched cpu_power causing regressions with SCHED_MC Suresh Siddha
2010-02-13 10:36   ` Peter Zijlstra
2010-02-13 10:42     ` Peter Zijlstra
2010-02-13 18:37       ` Vaidyanathan Srinivasan
2010-02-13 18:49         ` Suresh Siddha
2010-02-13 18:39     ` Vaidyanathan Srinivasan
2010-02-19  2:16     ` Suresh Siddha
2010-02-19 12:32       ` Arun R Bharadwaj
2010-02-19 13:03       ` Vaidyanathan Srinivasan
2010-02-19 19:15         ` Suresh Siddha
2010-02-19 14:05       ` Peter Zijlstra
2010-02-19 18:36         ` Suresh Siddha
2010-02-19 19:47           ` Peter Zijlstra
2010-02-19 19:50             ` Suresh Siddha
2010-02-19 20:02               ` Peter Zijlstra
2010-02-20  1:13                 ` Suresh Siddha
2010-02-22 18:50                   ` Peter Zijlstra
2010-02-24  0:13                     ` Suresh Siddha
2010-02-24 17:43                       ` Peter Zijlstra
2010-02-24 19:31                         ` Suresh Siddha
2010-02-26 10:24                       ` [tip:sched/core] sched: Fix SCHED_MC regression caused by change in sched cpu_power tip-bot for Suresh Siddha
2010-02-26 14:55                       ` tip-bot for Suresh Siddha
2010-02-19 19:52           ` change in sched cpu_power causing regressions with SCHED_MC Peter Zijlstra
2010-02-13 18:33   ` Vaidyanathan Srinivasan
2010-02-13 18:27 ` [patch] sched: fix SMT scheduler regression in find_busiest_queue() Vaidyanathan Srinivasan
2010-02-13 18:39   ` Suresh Siddha
2010-02-13 18:56     ` Vaidyanathan Srinivasan
2010-02-13 20:25   ` Vaidyanathan Srinivasan
2010-02-13 20:36     ` Vaidyanathan Srinivasan
2010-02-14 10:11       ` Peter Zijlstra
2010-02-15 12:35         ` Vaidyanathan Srinivasan
2010-02-15 13:00           ` Peter Zijlstra [this message]
2010-02-16 15:59             ` Vaidyanathan Srinivasan
2010-02-16 17:28               ` Peter Zijlstra
2010-02-16 18:25                 ` Vaidyanathan Srinivasan
2010-02-16 18:46                   ` Vaidyanathan Srinivasan
2010-02-16 18:48                   ` Peter Zijlstra
2010-02-15 22:29 ` Peter Zijlstra
2010-02-16 14:16 ` [tip:sched/urgent] sched: Fix " tip-bot for Suresh Siddha

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1266238843.15770.323.camel@laptop \
    --to=peterz@infradead.org \
    --cc=ego@in.ibm.com \
    --cc=ling.ma@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=suresh.b.siddha@intel.com \
    --cc=svaidy@linux.vnet.ibm.com \
    --cc=yanmin_zhang@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.