Re: [RFC] sched: unused cpu in affine workload

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Ingo Molnar <mingo@kernel.org>
To: Jiri Olsa <jolsa@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>,
	James Hartsock <hartsjc@redhat.com>,
	Rik van Riel <riel@redhat.com>,
	Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>,
	Kirill Tkhai <ktkhai@parallels.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC] sched: unused cpu in affine workload
Date: Mon, 4 Apr 2016 10:59:45 +0200	[thread overview]
Message-ID: <20160404085944.GA3030@gmail.com> (raw)
In-Reply-To: <20160404082302.GB2137@krava.local>

* Jiri Olsa <jolsa@redhat.com> wrote:

> hi,
> we've noticed following issue in one of our workloads.
> 
> I have 24 CPUs server with following sched domains:
>   domain 0: (pairs)
>   domain 1: 0-5,12-17 (group1)  6-11,18-23 (group2)
>   domain 2: 0-23 level NUMA
> 
> I run CPU hogging workload on following CPUs:
>   4,6,14,18,19,20,23
> 
> that is:
>   4,14          CPUs from group1
>   6,18,19,20,23 CPUs from group2
> 
> the workload process gets affinity setup via 'taskset -c ${CPUs workload ...'
> and forks child for every CPU
> 
> very often we notice CPUs 4 and 14 running 3 processes of the workload
> while CPUs 6,18,19,20,23 running just 4 processes, leaving one of the
> CPU from group2 idle
> 
> AFAICS from the code the reason for this is that the load balancing
> follows domains setup (topology) and does not regard affinity setups
> like this. The code in find_busiest_group running under idle cpu from
> group2 will find group1 as bussiest, but its average load will be
> smaller than the one on the local group, so there's no task pulling.
> 
> It's obvious, that load balancer follows sched domain topology.
> However is there some sched feature I'm missing that could help
> with this? Or do we need to follow sched domains topology when
> we select CPUs for workload to get even balancing?

Yeah, so the principle with user-pinning of tasks to CPUs was always:

 - pinning a task to a single CPU should obviously work fine, it's the primary
   usecase for isolation.

 - pinning a task to an arbitrary subset of CPUs is a 'hard' problem
   mathematically that the scheduler never truly wanted to solve in a frontal
   fashion.

... but that principle was set into place well before we did the NUMA scheduling 
work, which in itself is a highly non-trivial load optimization problem to begin 
with, so we might want to reconsider.

So there's two directions I can suggest:

 - if you can come up with workable small-scale solutions to scratch an itch
   that comes up in practice then that's obviously good, as long as it does not
   regress anything else.

 - if you want to come up with a 'complete' solution then please don't put it into
   hot paths such as wakeup or context switching, or any of the hardirq methods,
   but try to integrate it with the NUMA scheduling slow path.

The NUMA balancing slow path: that is softirq driven and reasonably low freq to 
not cause many performance problems.

The two problems (NUMA affinity and user affinity) are also losely related on a 
conceptual level: the NUMA affinity optimization problem can be considered as a 
workload determined, arbitrary 'NUMA mask' being optimized from first principles.

There's one ABI detail: this is true only as long as SMP affinity masks follow 
node boundaries - the current NUMA balancing code is very much node granular, so 
the two can only be merged if the ->cpus_allowed mask follows node boundaries as 
well.

A third approach would be to extend the NUMA balancing code to be CPU granular 
(without changing anytask placement behavior of the current NUMA balancing code of 
course), with node granular being a special case. This would fit the cgroups (and 
virtualization) usecases, but that would be a major change.

Thanks,

	Ingo

next prev parent reply	other threads:[~2016-04-04  8:59 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-04-04  8:23 [RFC] sched: unused cpu in affine workload Jiri Olsa
2016-04-04  8:44 ` Peter Zijlstra
2016-04-04  8:59 ` Ingo Molnar [this message]
2016-04-04  9:19   ` Ingo Molnar
2016-04-04  9:38     ` Ingo Molnar
2016-04-04 13:23       ` Peter Zijlstra
2016-04-04 19:45         ` Rik van Riel
2016-04-04 21:34           ` Peter Zijlstra
2016-04-05  8:56             ` Jiri Olsa

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160404085944.GA3030@gmail.com \
    --to=mingo@kernel.org \
    --cc=a.p.zijlstra@chello.nl \
    --cc=hartsjc@redhat.com \
    --cc=jolsa@redhat.com \
    --cc=ktkhai@parallels.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=riel@redhat.com \
    --cc=vatsa@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).