From: Preeti U Murthy <preeti@linux.vnet.ibm.com>
To: Rik van Riel <riel@redhat.com>,
umgwanakikbuti@gmail.com, Peter Zijlstra <peterz@infradead.org>
Cc: Preeti Murthy <preeti.lkml@gmail.com>,
LKML <linux-kernel@vger.kernel.org>,
Morten Rasmussen <morten.rasmussen@arm.com>,
Ingo Molnar <mingo@kernel.org>,
george.mccollister@gmail.com, ktkhai@parallels.com
Subject: Re: [PATCH RFC/TEST] sched: make sync affine wakeups work
Date: Mon, 05 May 2014 10:20:20 +0530 [thread overview]
Message-ID: <5367188C.1060702@linux.vnet.ibm.com> (raw)
In-Reply-To: <53663565.9080306@redhat.com>
On 05/04/2014 06:11 PM, Rik van Riel wrote:
> On 05/04/2014 07:44 AM, Preeti Murthy wrote:
>> Hi Rik, Mike
>>
>> On Fri, May 2, 2014 at 12:00 PM, Rik van Riel <riel@redhat.com> wrote:
>>> On 05/02/2014 02:13 AM, Mike Galbraith wrote:
>>>> On Fri, 2014-05-02 at 00:42 -0400, Rik van Riel wrote:
>>>>
>>>>> Whether or not this is the right thing to do remains to be seen,
>>>>> but it does allow us to verify whether or not the wake_affine
>>>>> strategy of always doing affine wakeups and only disabling them
>>>>> in a specific circumstance is sound, or needs rethinking...
>>>>
>>>> Yes, it needs rethinking.
>>>>
>>>> I know why you want to try this, yes, select_idle_sibling() is very much
>>>> a two faced little bitch.
>>>
>>> My biggest problem with select_idle_sibling and wake_affine in
>>> general is that it will override NUMA placement, even when
>>> processes only wake each other up infrequently...
>>
>> As far as my understanding goes, the logic in select_task_rq_fair()
>> does wake_affine() or calls select_idle_sibling() only at those
>> levels of sched domains where the flag SD_WAKE_AFFINE is set.
>> This flag is not set at the numa domain and hence they will not be
>> balancing across numa nodes. So I don't understand how
>> *these functions* are affecting NUMA placements.
>
> Even on 8-node DL980 systems, the NUMA distance in the
> SLIT table is less than RECLAIM_DISTANCE, and we will
> do wake_affine across the entire system.
>
>> The wake_affine() and select_idle_sibling() will shuttle tasks
>> within a NUMA node as far as I can see.i.e. if the cpu that the task
>> previously ran on and the waker cpu belong to the same node.
>> Else they are not called.
>
> That is what I first hoped, too. I was wrong.
>
>> If the prev_cpu and the waker cpu are on different NUMA nodes
>> then naturally the tasks will get shuttled across NUMA nodes but
>> the culprits are the find_idlest* functions.
>> They do a top-down search for the idlest group and cpu, starting
>> at the NUMA domain *attached to the waker and not the prev_cpu*.
>> This means that the task will end up on a different NUMA node.
>> Looks to me that the problem lies here and not in the wake_affine()
>> and select_idle_siblings().
>
> I have a patch for find_idlest_group that takes the NUMA
> distance between each group and the task's preferred node
> into account.
>
> However, as long as the wake_affine stuff still gets to
> override it, that does not make much difference :)
>
Yeah now I see it. But I still feel wake_affine() and
select_idle_sibling() are not at fault primarily because when they were
introduced, I don't think it was foreseen that the cpu topology would
grow to the extent it is now.
select_idle_sibling() for instance scans the cpus within the purview of
the last level cache of a cpu and this was a small set. Hence there was
no overhead. Now with many cpus sharing the L3 cache, we see an
overhead. wake_affine() probably did not expect the NUMA nodes to come
under its governance as well and hence it sees no harm in waking up
tasks close to the waker because it still believes that it will be
within a node.
What has changed is the code around these two functions I feel. Take
this problem for instance. We ourselves are saying in sd_local_flags()
that this specific domain is fit for wake affine balance. So naturally
the logic in wake_affine and select_idle_sibling() will follow.
My point is the peripheral code is seeing the negative affect of these
two functions because they pushed themselves under its ambit.
Don't you think we should go conservative on the value of
RECLAIM_DISTANCE in arch specific code at-least? On powerpc we set it to
10. Besides, the git log does not tell us the basis on which this value
was set to a default of 30. Maybe this needs re-thought?
Regards
Preeti U Murthy
next prev parent reply other threads:[~2014-05-05 4:54 UTC|newest]
Thread overview: 46+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-05-02 4:42 [PATCH RFC/TEST] sched: make sync affine wakeups work Rik van Riel
2014-05-02 5:32 ` Mike Galbraith
2014-05-02 5:41 ` Mike Galbraith
2014-05-02 5:58 ` Mike Galbraith
2014-05-02 6:08 ` Rik van Riel
2014-05-02 6:36 ` Mike Galbraith
2014-05-02 6:51 ` Mike Galbraith
2014-05-02 6:13 ` Mike Galbraith
2014-05-02 6:30 ` Rik van Riel
2014-05-02 7:37 ` Mike Galbraith
2014-05-02 10:56 ` Rik van Riel
2014-05-02 11:27 ` Mike Galbraith
2014-05-02 12:51 ` Mike Galbraith
[not found] ` <5363B793.9010208@redhat.com>
2014-05-06 11:54 ` Peter Zijlstra
2014-05-06 20:19 ` Rik van Riel
2014-05-06 20:39 ` Peter Zijlstra
2014-05-06 23:46 ` Rik van Riel
2014-05-09 2:20 ` Rik van Riel
2014-05-09 5:27 ` [PATCH] sched: wake up task on prev_cpu if not in SD_WAKE_AFFINE domain with cpu Rik van Riel
2014-05-09 6:04 ` [PATCH] sched: clean up select_task_rq_fair conditionals and indentation Rik van Riel
2014-05-09 7:34 ` [PATCH] sched: wake up task on prev_cpu if not in SD_WAKE_AFFINE domain with cpu Mike Galbraith
2014-05-09 14:22 ` Rik van Riel
2014-05-09 15:24 ` Mike Galbraith
2014-05-09 15:24 ` Rik van Riel
2014-05-09 17:55 ` Mike Galbraith
2014-05-09 18:16 ` Rik van Riel
2014-05-10 3:54 ` Mike Galbraith
2014-05-13 14:08 ` Rik van Riel
2014-05-14 4:08 ` Mike Galbraith
2014-05-14 15:40 ` [PATCH] sched: call select_idle_sibling when not affine_sd Rik van Riel
2014-05-14 15:45 ` Peter Zijlstra
2014-05-19 13:08 ` [tip:sched/core] " tip-bot for Rik van Riel
2014-05-22 12:27 ` [tip:sched/core] sched: Call select_idle_sibling() " tip-bot for Rik van Riel
2014-05-04 11:44 ` [PATCH RFC/TEST] sched: make sync affine wakeups work Preeti Murthy
2014-05-04 12:04 ` Mike Galbraith
2014-05-05 4:38 ` Preeti U Murthy
2014-05-04 12:41 ` Rik van Riel
2014-05-05 4:50 ` Preeti U Murthy [this message]
2014-05-05 6:43 ` Preeti U Murthy
2014-05-05 11:28 ` Rik van Riel
2014-05-06 13:26 ` Peter Zijlstra
2014-05-06 13:25 ` Peter Zijlstra
2014-05-06 20:20 ` Rik van Riel
2014-05-06 20:41 ` Peter Zijlstra
2014-05-07 12:17 ` Ingo Molnar
2014-05-06 11:56 ` Peter Zijlstra
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5367188C.1060702@linux.vnet.ibm.com \
--to=preeti@linux.vnet.ibm.com \
--cc=george.mccollister@gmail.com \
--cc=ktkhai@parallels.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@kernel.org \
--cc=morten.rasmussen@arm.com \
--cc=peterz@infradead.org \
--cc=preeti.lkml@gmail.com \
--cc=riel@redhat.com \
--cc=umgwanakikbuti@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).