Re: [RFC] sched: unused cpu in affine workload

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Ingo Molnar <mingo@kernel.org>
To: Jiri Olsa <jolsa@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>,
	James Hartsock <hartsjc@redhat.com>,
	Rik van Riel <riel@redhat.com>,
	Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>,
	Kirill Tkhai <ktkhai@parallels.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC] sched: unused cpu in affine workload
Date: Mon, 4 Apr 2016 11:19:51 +0200	[thread overview]
Message-ID: <20160404091951.GA10360@gmail.com> (raw)
In-Reply-To: <20160404085944.GA3030@gmail.com>


* Ingo Molnar <mingo@kernel.org> wrote:

>  - if you want to come up with a 'complete' solution then please don't put it into
>    hot paths such as wakeup or context switching, or any of the hardirq methods,
>    but try to integrate it with the NUMA scheduling slow path.
> 
> The NUMA balancing slow path: that is softirq driven and reasonably low freq to 
> not cause many performance problems.
> 
> The two problems (NUMA affinity and user affinity) are also losely related on a 
> conceptual level: the NUMA affinity optimization problem can be considered as a 
> workload determined, arbitrary 'NUMA mask' being optimized from first 
> principles.
> 
> There's one ABI detail: this is true only as long as SMP affinity masks follow 
> node boundaries - the current NUMA balancing code is very much node granular, so 
> the two can only be merged if the ->cpus_allowed mask follows node boundaries as 
> well.
> 
> A third approach would be to extend the NUMA balancing code to be CPU granular 
> (without changing anytask placement behavior of the current NUMA balancing code 
> of course), with node granular being a special case. This would fit the cgroups 
> (and virtualization) usecases, but that would be a major change.

So my thinking here is: if the NUMA balancing code (which is node granular at the 
moment and uses node masks, etc.) is extended to be CPU granular (which is a big 
task in itself), then the two problems can be 'unified':

  - the NUMA balancing code inputs arbitrarly CPU (node) affinity masks from the
    MM code into the scheduler.

  - the scheduler syscall ABI (and other configuration sources) inputs arbitrary 
    CPU affinity masks into the scheduler.

it's a similar problem, with two (minor looking) complication:

 - the NUMA code right now is 'statistical', while ->cpus_allowed are hard 
   constraints that must never be violated. So there always has to be a final 
   layer to implement the hard constraint - which does not exist in the NUMA 
   balancing case. This should be relatively easy I think as we already do it
   with the regular balancer.

 - the balancing slowpath would have to be activated on non-NUMA systems as well, 
   so that it can handle ->cpus_allowed balancing.

... once all that is solved, I can see several advantages from unifying the NUMA 
balancing and SMP affinity balancing code:

 - the NUMA balancer would improve: cpus_allowed isolation is used more 
   frequently, so fixes from those workloads would benefit the NUMA balancing case 
   as well.

 - testing the NUMA balancer would become easier: we'd simply set cpus_allowed and
   would watch how it balances. No need to coax workloads into actual MM NUMA 
   usage patters to set up interesting scenarios.

 - our existing half-hearted ways to deal with cpus_allowed balancing could be 
   outsourced to the NUMA slow path, which would simplify the SMP balancing fast 
   path.

But it's a major piece of work, and I might be missing implementational details. 
It would be the biggest new scheduler feature since NUMA balancing for sure.

Thanks,

	Ingo

next prev parent reply	other threads:[~2016-04-04  9:19 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-04-04  8:23 [RFC] sched: unused cpu in affine workload Jiri Olsa
2016-04-04  8:44 ` Peter Zijlstra
2016-04-04  8:59 ` Ingo Molnar
2016-04-04  9:19   ` Ingo Molnar [this message]
2016-04-04  9:38     ` Ingo Molnar
2016-04-04 13:23       ` Peter Zijlstra
2016-04-04 19:45         ` Rik van Riel
2016-04-04 21:34           ` Peter Zijlstra
2016-04-05  8:56             ` Jiri Olsa

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160404091951.GA10360@gmail.com \
    --to=mingo@kernel.org \
    --cc=a.p.zijlstra@chello.nl \
    --cc=hartsjc@redhat.com \
    --cc=jolsa@redhat.com \
    --cc=ktkhai@parallels.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=riel@redhat.com \
    --cc=vatsa@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.