Re: [rfc] forked kernel task and mm structures imbalanced on NUMA

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Nick Piggin <npiggin@suse.de>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Ingo Molnar <mingo@redhat.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Lee Schermerhorn <lee.schermerhorn@hp.com>
Subject: Re: [rfc] forked kernel task and mm structures imbalanced on NUMA
Date: Tue, 1 Jun 2010 18:41:57 +1000	[thread overview]
Message-ID: <20100601084157.GS9453@laptop> (raw)
In-Reply-To: <1275380202.27810.26214.camel@twins>

On Tue, Jun 01, 2010 at 10:16:42AM +0200, Peter Zijlstra wrote:
> On Tue, 2010-06-01 at 17:33 +1000, Nick Piggin wrote:
> > Another problem I found when testing this patch is that the scheduler
> > has some issues of its own when balancing. This is improved by
> > traversing the sd groups starting from a different spot each time, so
> > processes get sprinkled around the nodes a bit better.
> 
> Right, makes sense. And I think we could merge that group iteration
> without much problems.
> 
> Your alternative placement for sched_exec() seems to make sense too, the
> earlier we do that the better the memory allocations will be.
> 
> Your changes to sched_fork() and wake_up_new_task() made my head hurt a
> bit -- but that's not your fault. I'm not quite sure why you're changing
> that though.

Well because we indeed don't want to select a new CPU for it unless
there has been a cpus_allowed race in the meantime.

 
> The addition of sched_fork_suggest_cpu() to select a target node seems
> to make sense, but since you then call fork balancing a second time we
> have a chance of ending up on a totally different node all together.
> 
> So I think it would make sense to rework the fork balancing muck to be
> called only once and stick with its decision.

Just need to close that race somehow. AFAIKS we can't use TASK_WAKING
because that must not be preempted?

Otherwise I don't see a problem with just taking another balance on
the extremely rare case of task failure.


 
> One thing that would make the whole fork path much easier is fully
> ripping out that child_runs_first mess for CONFIG_SMP, I think its been
> disabled by default for long enough, and its always been broken in the
> face of fork balancing anyway.

Interesting problem. vfork is nice for fork+exec, but it's a bit
restrictive.

 
> So basically we have to move fork balancing back to sched_fork(), I'd
> have to again look at wth happens to ->cpus_allowed, but I guess it
> should be fixable, and I don't think we should care overly much about
> cpu-hotplug.

No more than simply getting it right. Simply calling into the balancer
again seems to be the simplest way to do it.

 
> A specific code comment:
> 
> > @@ -2550,14 +2561,16 @@ void wake_up_new_task(struct task_struct
> >          * We set TASK_WAKING so that select_task_rq() can drop rq->lock
> >          * without people poking at ->cpus_allowed.
> >          */
> > -       cpu = select_task_rq(rq, p, SD_BALANCE_FORK, 0);
> > -       set_task_cpu(p, cpu);
> > -
> > -       p->state = TASK_RUNNING;
> > -       task_rq_unlock(rq, &flags);
> > +       if (!cpumask_test_cpu(cpu, &p->cpus_allowed)) {
> > +               p->state = TASK_WAKING;
> > +               cpu = select_task_rq(rq, p, SD_BALANCE_FORK, 0);
> > +               set_task_cpu(p, cpu);
> > +               p->state = TASK_RUNNING;
> > +               task_rq_unlock(rq, &flags);
> > +               rq = task_rq_lock(p, &flags);
> > +       }
> >  #endif
> 
> That's iffy because p->cpus_allowed isn't stable when we're not holding
> the task's current rq->lock or p->state is not TASK_WAKING.
> 

Oop, yeah missed that. Half hearted attempt to avoid more rq locks.

WARNING: multiple messages have this Message-ID (diff)

From: Nick Piggin <npiggin@suse.de>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Ingo Molnar <mingo@redhat.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Lee Schermerhorn <lee.schermerhorn@hp.com>
Subject: Re: [rfc] forked kernel task and mm structures imbalanced on NUMA
Date: Tue, 1 Jun 2010 18:41:57 +1000	[thread overview]
Message-ID: <20100601084157.GS9453@laptop> (raw)
In-Reply-To: <1275380202.27810.26214.camel@twins>

On Tue, Jun 01, 2010 at 10:16:42AM +0200, Peter Zijlstra wrote:
> On Tue, 2010-06-01 at 17:33 +1000, Nick Piggin wrote:
> > Another problem I found when testing this patch is that the scheduler
> > has some issues of its own when balancing. This is improved by
> > traversing the sd groups starting from a different spot each time, so
> > processes get sprinkled around the nodes a bit better.
> 
> Right, makes sense. And I think we could merge that group iteration
> without much problems.
> 
> Your alternative placement for sched_exec() seems to make sense too, the
> earlier we do that the better the memory allocations will be.
> 
> Your changes to sched_fork() and wake_up_new_task() made my head hurt a
> bit -- but that's not your fault. I'm not quite sure why you're changing
> that though.

Well because we indeed don't want to select a new CPU for it unless
there has been a cpus_allowed race in the meantime.

 
> The addition of sched_fork_suggest_cpu() to select a target node seems
> to make sense, but since you then call fork balancing a second time we
> have a chance of ending up on a totally different node all together.
> 
> So I think it would make sense to rework the fork balancing muck to be
> called only once and stick with its decision.

Just need to close that race somehow. AFAIKS we can't use TASK_WAKING
because that must not be preempted?

Otherwise I don't see a problem with just taking another balance on
the extremely rare case of task failure.


 
> One thing that would make the whole fork path much easier is fully
> ripping out that child_runs_first mess for CONFIG_SMP, I think its been
> disabled by default for long enough, and its always been broken in the
> face of fork balancing anyway.

Interesting problem. vfork is nice for fork+exec, but it's a bit
restrictive.

 
> So basically we have to move fork balancing back to sched_fork(), I'd
> have to again look at wth happens to ->cpus_allowed, but I guess it
> should be fixable, and I don't think we should care overly much about
> cpu-hotplug.

No more than simply getting it right. Simply calling into the balancer
again seems to be the simplest way to do it.

 
> A specific code comment:
> 
> > @@ -2550,14 +2561,16 @@ void wake_up_new_task(struct task_struct
> >          * We set TASK_WAKING so that select_task_rq() can drop rq->lock
> >          * without people poking at ->cpus_allowed.
> >          */
> > -       cpu = select_task_rq(rq, p, SD_BALANCE_FORK, 0);
> > -       set_task_cpu(p, cpu);
> > -
> > -       p->state = TASK_RUNNING;
> > -       task_rq_unlock(rq, &flags);
> > +       if (!cpumask_test_cpu(cpu, &p->cpus_allowed)) {
> > +               p->state = TASK_WAKING;
> > +               cpu = select_task_rq(rq, p, SD_BALANCE_FORK, 0);
> > +               set_task_cpu(p, cpu);
> > +               p->state = TASK_RUNNING;
> > +               task_rq_unlock(rq, &flags);
> > +               rq = task_rq_lock(p, &flags);
> > +       }
> >  #endif
> 
> That's iffy because p->cpus_allowed isn't stable when we're not holding
> the task's current rq->lock or p->state is not TASK_WAKING.
> 

Oop, yeah missed that. Half hearted attempt to avoid more rq locks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2010-06-01  8:42 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-06-01  7:33 [rfc] forked kernel task and mm structures imbalanced on NUMA Nick Piggin
2010-06-01  7:33 ` Nick Piggin
2010-06-01  8:16 ` Peter Zijlstra
2010-06-01  8:16   ` Peter Zijlstra
2010-06-01  8:41   ` Nick Piggin [this message]
2010-06-01  8:41     ` Nick Piggin
2010-06-01  9:05     ` Peter Zijlstra
2010-06-01  9:05       ` Peter Zijlstra
2010-06-01  8:49   ` Peter Zijlstra
2010-06-01  8:49     ` Peter Zijlstra
2010-06-01 15:48 ` Andi Kleen
2010-06-01 15:48   ` Andi Kleen
2010-06-01 15:59   ` Nick Piggin
2010-06-01 15:59     ` Nick Piggin
2010-06-01 16:20     ` Andi Kleen
2010-06-01 16:20       ` Andi Kleen
2010-06-01 16:31       ` Nick Piggin
2010-06-01 16:31         ` Nick Piggin
2010-06-01 16:00 ` Peter Zijlstra
2010-06-01 16:00   ` Peter Zijlstra
2010-06-01 16:02   ` Peter Zijlstra
2010-06-01 16:02     ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100601084157.GS9453@laptop \
    --to=npiggin@suse.de \
    --cc=lee.schermerhorn@hp.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.