From: Ingo Molnar <mingo@kernel.org>
To: Jiri Olsa <jolsa@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>,
James Hartsock <hartsjc@redhat.com>,
Rik van Riel <riel@redhat.com>,
Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>,
Kirill Tkhai <ktkhai@parallels.com>,
linux-kernel@vger.kernel.org
Subject: Re: [RFC] sched: unused cpu in affine workload
Date: Mon, 4 Apr 2016 10:59:45 +0200 [thread overview]
Message-ID: <20160404085944.GA3030@gmail.com> (raw)
In-Reply-To: <20160404082302.GB2137@krava.local>
* Jiri Olsa <jolsa@redhat.com> wrote:
> hi,
> we've noticed following issue in one of our workloads.
>
> I have 24 CPUs server with following sched domains:
> domain 0: (pairs)
> domain 1: 0-5,12-17 (group1) 6-11,18-23 (group2)
> domain 2: 0-23 level NUMA
>
> I run CPU hogging workload on following CPUs:
> 4,6,14,18,19,20,23
>
> that is:
> 4,14 CPUs from group1
> 6,18,19,20,23 CPUs from group2
>
> the workload process gets affinity setup via 'taskset -c ${CPUs workload ...'
> and forks child for every CPU
>
> very often we notice CPUs 4 and 14 running 3 processes of the workload
> while CPUs 6,18,19,20,23 running just 4 processes, leaving one of the
> CPU from group2 idle
>
> AFAICS from the code the reason for this is that the load balancing
> follows domains setup (topology) and does not regard affinity setups
> like this. The code in find_busiest_group running under idle cpu from
> group2 will find group1 as bussiest, but its average load will be
> smaller than the one on the local group, so there's no task pulling.
>
> It's obvious, that load balancer follows sched domain topology.
> However is there some sched feature I'm missing that could help
> with this? Or do we need to follow sched domains topology when
> we select CPUs for workload to get even balancing?
Yeah, so the principle with user-pinning of tasks to CPUs was always:
- pinning a task to a single CPU should obviously work fine, it's the primary
usecase for isolation.
- pinning a task to an arbitrary subset of CPUs is a 'hard' problem
mathematically that the scheduler never truly wanted to solve in a frontal
fashion.
... but that principle was set into place well before we did the NUMA scheduling
work, which in itself is a highly non-trivial load optimization problem to begin
with, so we might want to reconsider.
So there's two directions I can suggest:
- if you can come up with workable small-scale solutions to scratch an itch
that comes up in practice then that's obviously good, as long as it does not
regress anything else.
- if you want to come up with a 'complete' solution then please don't put it into
hot paths such as wakeup or context switching, or any of the hardirq methods,
but try to integrate it with the NUMA scheduling slow path.
The NUMA balancing slow path: that is softirq driven and reasonably low freq to
not cause many performance problems.
The two problems (NUMA affinity and user affinity) are also losely related on a
conceptual level: the NUMA affinity optimization problem can be considered as a
workload determined, arbitrary 'NUMA mask' being optimized from first principles.
There's one ABI detail: this is true only as long as SMP affinity masks follow
node boundaries - the current NUMA balancing code is very much node granular, so
the two can only be merged if the ->cpus_allowed mask follows node boundaries as
well.
A third approach would be to extend the NUMA balancing code to be CPU granular
(without changing anytask placement behavior of the current NUMA balancing code of
course), with node granular being a special case. This would fit the cgroups (and
virtualization) usecases, but that would be a major change.
Thanks,
Ingo
next prev parent reply other threads:[~2016-04-04 8:59 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-04-04 8:23 [RFC] sched: unused cpu in affine workload Jiri Olsa
2016-04-04 8:44 ` Peter Zijlstra
2016-04-04 8:59 ` Ingo Molnar [this message]
2016-04-04 9:19 ` Ingo Molnar
2016-04-04 9:38 ` Ingo Molnar
2016-04-04 13:23 ` Peter Zijlstra
2016-04-04 19:45 ` Rik van Riel
2016-04-04 21:34 ` Peter Zijlstra
2016-04-05 8:56 ` Jiri Olsa
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160404085944.GA3030@gmail.com \
--to=mingo@kernel.org \
--cc=a.p.zijlstra@chello.nl \
--cc=hartsjc@redhat.com \
--cc=jolsa@redhat.com \
--cc=ktkhai@parallels.com \
--cc=linux-kernel@vger.kernel.org \
--cc=riel@redhat.com \
--cc=vatsa@linux.vnet.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).