sched_domains SD_BALANCE_FORK and sched_balance

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* sched_domains SD_BALANCE_FORK and sched_balance_self
@ 2005-08-05 23:29 Darren Hart
  2005-08-09 22:03 ` Siddha, Suresh B
  0 siblings, 1 reply; 5+ messages in thread
From: Darren Hart @ 2005-08-05 23:29 UTC (permalink / raw)
  To: lkml, ; +Cc: Piggin, Nick, Bligh, Martin, Dobson, Matt

First off, apologies for not reviewing this code at 2.6.12-mm2, I was 
tied up with other things.  I have some concerns as to the intent vs. 
actual implementation of SD_BALANCE_FORK and the sched_balance_fork() 
routine.

ARCHS=i386,x86_64,ia64

First, iirc SD_NODE_INIT initializes the sched_domain that contains all 
the cpus in the system in node groups (with the exception of ia64 and 
the allnodes domain).  Correct?

SD_NODE_INIT for $ARCHS contains SD_BALANCE_FORK, and no other SD_*_INIT 
routines do.  This seems strange to me as it would seem more appropriate 
to balance within a node on fork as to not have to access the duplicated 
mm across nodes.  If we are going to use SD_BALANCE_FORK, wouldn't it 
make sense to push it down the sched_domain hierarchy to the SD_CPU_INIT 
level?

And now to sched_balance_self().  As I understand it the purpose of this 
method is to choose the "best" cpu to start a task on.  It seems to me 
that the best CPU for a forked process would be an idle CPU on the same 
node as the parent in order to stay close to it's memory.  Failing this, 
  we may need to move to other nodes if they are idle enough to warrant 
the move across node boundaries.  Thoughts?

For the comments below, flag = SD_BALANCE_FORK.

static int sched_balance_self(int cpu, int flag)
{
         struct task_struct *t = current;
         struct sched_domain *tmp, *sd = NULL;

         for_each_domain(cpu, tmp)
                 if (tmp->flags & flag)
                         sd = tmp;

This jumps to the highest level domain that supports SD_BALANCE_FORK 
which are the domains created with SD_NODE_INIT, so we look at all the 
CPUs on the system first rather than those local to the parent's node.

         while (sd) {
                 cpumask_t span;
                 struct sched_group *group;
                 int new_cpu;
                 int weight;

                 span = sd->span;
                 group = find_idlest_group(sd, t, cpu);
                 if (!group)
                         goto nextlevel;

                 new_cpu = find_idlest_cpu(group, cpu);
                 if (new_cpu == -1 || new_cpu == cpu)
                         goto nextlevel;

                 /* Now try balancing at a lower domain level */
                 cpu = new_cpu;
nextlevel:
                 sd = NULL;
                 weight = cpus_weight(span);
                 for_each_domain(cpu, tmp) {
                         if (weight <= cpus_weight(tmp->span))
                                 break;
                         if (tmp->flags & flag)
                                 sd = tmp;
                 }

If I am reading it right, this for_each_domain will exit immediately if 
jumped to via nextlevel and will only do any work if a new cpu is found 
to run on (which is fair sense there is no need to keep looking if the 
whole system doesn't have a better place for us to go).  If a new cpu 
_is_ assigned though, for_each_domain will start with the lowest level 
domain - which always has the smallest cpus_weight doesn't it?  If so, 
won't the (weight <= cpu...) condition always equate to true, ending the 
loop at the first domain?  If so, then that last loop doesn't do 
anything at all, ever.  If I am misreading this fragment, could someone 
please correct my thinking?

                 /* while loop will break here if sd == NULL */
         }

         return cpu;
}

So it seems to me that we should first look at the cpus on the local 
domain for SD_BALANCE_FORK.  SD_BALANCE_EXEC tasks however have a 
minimal mm and could probably be put on whichever cpu/group is the most 
idle, regardless of node.  Thoughts?

Thanks,

-- 
Darren Hart
IBM Linux Technology Center
Linux Kernel Team
Phone: 503 578 3185
   T/L: 775 3185

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: sched_domains SD_BALANCE_FORK and sched_balance_self
  2005-08-05 23:29 sched_domains SD_BALANCE_FORK and sched_balance_self Darren Hart
@ 2005-08-09 22:03 ` Siddha, Suresh B
  2005-08-09 22:19   ` Martin J. Bligh
  0 siblings, 1 reply; 5+ messages in thread
From: Siddha, Suresh B @ 2005-08-09 22:03 UTC (permalink / raw)
  To: Darren Hart
  Cc: lkml, , Piggin, Nick, Bligh, Martin, Dobson, Matt, nickpiggin,
	mingo

On Fri, Aug 05, 2005 at 04:29:45PM -0700, Darren Hart wrote:
> I have some concerns as to the intent vs.  actual implementation of 
> SD_BALANCE_FORK and the sched_balance_fork() routine.

Intent and implementation match. Problem is with the intent ;-)

This has the intent info.

http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=147cbb4bbe991452698f0772d8292f22825710ba

To solve these issues, we need to make the sched domain and its parameters
CMP aware. And dynamically we need to adjust these parameters based
on the system properties.

> SD_NODE_INIT for $ARCHS contains SD_BALANCE_FORK, and no other SD_*_INIT 
> routines do.  This seems strange to me as it would seem more appropriate 
> to balance within a node on fork as to not have to access the duplicated 
> mm across nodes.  If we are going to use SD_BALANCE_FORK, wouldn't it 
> make sense to push it down the sched_domain hierarchy to the SD_CPU_INIT 
> level?

Ideally SD_BALANCE_FORK needs to be set for the domains starting from the
lowest domain to the SMP domain.

> It seems to me that the best CPU for a forked process would be an idle 
> CPU on the same  node as the parent in order to stay close to it's memory.  
> Failing this, we may need to move to other nodes if they are idle enough 
> to warrant the move across node boundaries.  Thoughts?

We can choose the leastly loaded CPU in the home node and we can let the
load balance to move it to other nodes if there is an imbalance.

For exec, we can have the SD_BALANCE_EXEC for all the sched domains, which
is the case today.

>          while (sd) {
>                  cpumask_t span;
>                  struct sched_group *group;
>                  int new_cpu;
>                  int weight;
> 
>                  span = sd->span;
>                  group = find_idlest_group(sd, t, cpu);
>                  if (!group)
>                          goto nextlevel;
> 
>                  new_cpu = find_idlest_cpu(group, cpu);
>                  if (new_cpu == -1 || new_cpu == cpu)
>                          goto nextlevel;
> 
>                  /* Now try balancing at a lower domain level */
>                  cpu = new_cpu;
> nextlevel:
>                  sd = NULL;
>                  weight = cpus_weight(span);
>                  for_each_domain(cpu, tmp) {
>                          if (weight <= cpus_weight(tmp->span))
>                                  break;
>                          if (tmp->flags & flag)
>                                  sd = tmp;
>                  }
> 
> If I am reading it right, this for_each_domain will exit immediately if 
> jumped to via nextlevel and will only do any work if a new cpu is found 
> to run on (which is fair sense there is no need to keep looking if the 
> whole system doesn't have a better place for us to go).  If a new cpu 
> _is_ assigned though, for_each_domain will start with the lowest level 
> domain - which always has the smallest cpus_weight doesn't it?  If so, 
> won't the (weight <= cpu...) condition always equate to true, ending the 

no. last loop will take you to the domain which has the flag and is 
immd below the parent domain from where we started.

thanks,
suresh

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: sched_domains SD_BALANCE_FORK and sched_balance_self
  2005-08-09 22:03 ` Siddha, Suresh B
@ 2005-08-09 22:19   ` Martin J. Bligh
  2005-08-10  0:40     ` Siddha, Suresh B
  0 siblings, 1 reply; 5+ messages in thread
From: Martin J. Bligh @ 2005-08-09 22:19 UTC (permalink / raw)
  To: Siddha, Suresh B, Darren Hart
  Cc: lkml,, Piggin, Nick, Dobson, Matt, nickpiggin, mingo

--On Tuesday, August 09, 2005 15:03:32 -0700 "Siddha, Suresh B" <suresh.b.siddha@intel.com> wrote:

> On Fri, Aug 05, 2005 at 04:29:45PM -0700, Darren Hart wrote:
>> I have some concerns as to the intent vs.  actual implementation of 
>> SD_BALANCE_FORK and the sched_balance_fork() routine.
> 
> Intent and implementation match. Problem is with the intent ;-)
> 
> This has the intent info.
> 
> http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=147cbb4bbe991452698f0772d8292f22825710ba
> 
> To solve these issues, we need to make the sched domain and its parameters
> CMP aware. And dynamically we need to adjust these parameters based
> on the system properties.

Can you explain the purpose of doing balance on both fork and exec?
The reason we did it at exec time is that it's much cheaper to do 
than at fork - you have very, very little state to deal with. The vast 
majority of things that fork will exec immediately thereafter.

Balance on clone make some sort of sense, since you know they're not
going to exec afterwards. We've thrashed through this many times before
and decided that unless there was an explicit hint from userspace,
balance on fork was not a good thing to do in the general case. Not only
based on a large range of testing, but also previous experience from other
Unix's. What new data came forth to change this?

>> It seems to me that the best CPU for a forked process would be an idle 
>> CPU on the same  node as the parent in order to stay close to it's memory.  
>> Failing this, we may need to move to other nodes if they are idle enough 
>> to warrant the move across node boundaries.  Thoughts?
> 
> We can choose the leastly loaded CPU in the home node and we can let the
> load balance to move it to other nodes if there is an imbalance.

Is that what it's actually doing now? That's not what Nick told me at
Kernel Summit, but is the correct thing to do for clone, I think.

> For exec, we can have the SD_BALANCE_EXEC for all the sched domains, which
> is the case today.

Yup.

M.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: sched_domains SD_BALANCE_FORK and sched_balance_self
  2005-08-09 22:19   ` Martin J. Bligh
@ 2005-08-10  0:40     ` Siddha, Suresh B
  2005-08-10  0:43       ` Nick Piggin
  0 siblings, 1 reply; 5+ messages in thread
From: Siddha, Suresh B @ 2005-08-10  0:40 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Siddha, Suresh B, Darren Hart, lkml,, Piggin, Nick, Dobson, Matt,
	nickpiggin, mingo

On Tue, Aug 09, 2005 at 03:19:58PM -0700, Martin J. Bligh wrote:
> --On Tuesday, August 09, 2005 15:03:32 -0700 "Siddha, Suresh B" <suresh.b.siddha@intel.com> wrote:
> 
> > On Fri, Aug 05, 2005 at 04:29:45PM -0700, Darren Hart wrote:
> >> I have some concerns as to the intent vs.  actual implementation of 
> >> SD_BALANCE_FORK and the sched_balance_fork() routine.
> > 
> > Intent and implementation match. Problem is with the intent ;-)
> > 
> > This has the intent info.
> > 
> > http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=147cbb4bbe991452698f0772d8292f22825710ba
> > 
> > To solve these issues, we need to make the sched domain and its parameters
> > CMP aware. And dynamically we need to adjust these parameters based
> > on the system properties.
> 
> Can you explain the purpose of doing balance on both fork and exec?
> The reason we did it at exec time is that it's much cheaper to do 
> than at fork - you have very, very little state to deal with. The vast 
> majority of things that fork will exec immediately thereafter.
> 
> Balance on clone make some sort of sense, since you know they're not
> going to exec afterwards. We've thrashed through this many times before
> and decided that unless there was an explicit hint from userspace,
> balance on fork was not a good thing to do in the general case. Not only
> based on a large range of testing, but also previous experience from other
> Unix's. What new data came forth to change this?

I agree with you. I will let Nick(the author) have a take at this.

> > We can choose the leastly loaded CPU in the home node and we can let the
> > load balance to move it to other nodes if there is an imbalance.
> 
> Is that what it's actually doing now? That's not what Nick told me at
> Kernel Summit, but is the correct thing to do for clone, I think.

We don't do it today. But I would like to see that.

thanks,
suresh

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: sched_domains SD_BALANCE_FORK and sched_balance_self
  2005-08-10  0:40     ` Siddha, Suresh B
@ 2005-08-10  0:43       ` Nick Piggin
  0 siblings, 0 replies; 5+ messages in thread
From: Nick Piggin @ 2005-08-10  0:43 UTC (permalink / raw)
  To: Siddha, Suresh B
  Cc: Martin J. Bligh, Darren Hart, lkml,, Piggin, Nick, Dobson, Matt,
	mingo

Siddha, Suresh B wrote:

>On Tue, Aug 09, 2005 at 03:19:58PM -0700, Martin J. Bligh wrote:
>
>>--On Tuesday, August 09, 2005 15:03:32 -0700 "Siddha, Suresh B" <suresh.b.siddha@intel.com> wrote:
>>
>>
>>Balance on clone make some sort of sense, since you know they're not
>>going to exec afterwards. We've thrashed through this many times before
>>and decided that unless there was an explicit hint from userspace,
>>balance on fork was not a good thing to do in the general case. Not only
>>based on a large range of testing, but also previous experience from other
>>Unix's. What new data came forth to change this?
>>
>
>I agree with you. I will let Nick(the author) have a take at this.
>
>

Sorry I've taken a while with this. Darren, I'll reply to you soon.


Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2005-08-10  0:43 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-08-05 23:29 sched_domains SD_BALANCE_FORK and sched_balance_self Darren Hart
2005-08-09 22:03 ` Siddha, Suresh B
2005-08-09 22:19   ` Martin J. Bligh
2005-08-10  0:40     ` Siddha, Suresh B
2005-08-10  0:43       ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox