Re: [PATCH] sched/fair: Again ignore percpu threads for imbalance pulls

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Yihao Wu <wuyihao@linux.alibaba.com>
To: Valentin Schneider <valentin.schneider@arm.com>,
	Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: "Shanpei Chen" <shanpeic@linux.alibaba.com>,
	王贇 <yun.wang@linux.alibaba.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH] sched/fair: Again ignore percpu threads for imbalance pulls
Date: Tue, 18 Jan 2022 16:11:14 +0800	[thread overview]
Message-ID: <81d85a50-e54e-e74a-14a9-348413850f73@linux.alibaba.com> (raw)
In-Reply-To: <87ee56705h.mognet@arm.com>

On 2022/1/18 1:16 am, Valentin Schneider wrote:
> On 17/01/22 22:50, Yihao Wu wrote:
>> Thanks a lot for the help, Valentin and Peter!
>>
>> On 2021/12/17 2:26am, Valentin Schneider wrote:
>>> On 11/12/21 17:48, Yihao Wu wrote:
>>>> commit 2f5f4cce496e ("sched/fair: Ignore percpu threads for imbalance
>>>> pulls") was meant to fix a performance issue, when load balance tries to
>>>> migrate pinned kernel threads at MC domain level. This was destined to
>>>> fail.
>>>
>>>> After it fails, it further makes wakeup balance at NUMA domain level
>>>> messed up. The most severe case that I noticed and frequently occurs:
>>>>       |sum_nr_running(node1) - sum_nr_running(node2)| > 100
>>>>
>>>
>>> Wakeup balance (aka find_idlest_cpu()) is different from periodic load
>>> balance (aka load_balance()) and doesn't use can_migrate_task(), so the
>>> incriminated commit shouldn't have impacted it (at least not in obvious
>>> ways...). Do you have any more details on that issue
>>
>> The original bugfix concerns only about load balance. While I found wake
>> up balance is impacted too, after I observed regression in lmbench3 test
>> suite. This is how it's impacted:
>>
>> - Periodic load balance
>> - kthread_is_per_cpu? No
>> - env->flags |= LBF_SOME_PINNED
>> - sd_parent..imbalance being set to 1 because of LBF_SOME_PINNED
>>
>> So far exactly the same as what Chandrasekhar describes in 2f5f4cce496e.
>> Then imbalance connects periodic and wakeup balance.
>>
>> - Wakeup balance(find_idlest_group)
>> - update_sg_wakeup_stats classifies local_sgs as group_imbalanced
>> - find_idlest_group chooses another NUMA node
>>
>> wakeup balance keeps doing this until another NUMA node becomes so busy.
>> And another periodic load balance just shifts it around, makeing the
>> previously overloaded node completely idle now.
>>
> 
> Oooh, right, I came to the same conclusion when I got that stress-ng
> regression report back then:
> 
> https://lore.kernel.org/all/871rajkfkn.mognet@arm.com/
> 

Shocked! I wasted weeks to locate almost the same regression. Why on
earth haven't I read your discussion of half a year ago?

> I pretty much gave up on that as the regression we caused by removing an
> obscure/accidental balance which I couldn't properly codify. I can give it

Strange, the regression reported to me says differently from yours.

                  4.19.91      before_2f5f4      after_2f5f4
   my_report         good          bad               bad
your_report          N/A         good               bad

your_report says 2f5f4 introduces new regression. While
my_report says 2f5f4 fails and leaves the old regression be ...

Maybe that's the reason why you give up on fixing it, yet I came to make 
can_migrate_task cover more cases (kernel_thread).


> another shot, but AFAICT that only affects fork/exec heavy workloads (that
> -13% was on something doing almost only forks) which is an odd case to
> support.
>
Yes. They're indeed quite odd workloads.
- Apps with massive shortlived threads better change runtime model, or
use a thread pool.
- Massive different apps on the same machine are even odder.

But I guess this problem affects normal workloads too, more or less but
not significantly. Hard to tell exactly how much influence it has.

>> (Thanks to the great schedviz tool, I observed that all workloads as a
>> whole, is migrated between the two NUMA nodes in a ping-pong pattern,
>> and with a period around 3ms)
>>
>> The reason wake up balance suffers more is, in fork+exit test case,
>> wakeup balance happens with much higher frequency. It exists in real
>> world applications too I believe.
>>
>>>
>>>> However the original bugfix failed, because it covers only case 1) below.
>>>>     1) Created by create_kthread
>>>>     2) Created by kernel_thread
>>>> No kthread is assigned to task_struct in case 2 (Please refer to comments
>>>> in free_kthread_struct) so it simply won't work.
>>>>
>>>> The easist way to cover both cases is to check nr_cpus_allowed, just as
>>>> discussed in the mailing list of the v1 version of the original fix.
>>>>
>>>> * lmbench3.lat_proc -P 104 fork (2 NUMA, and 26 cores, 2 threads)
>>>>
>>>
>>> Reasoning about "proper" pcpu kthreads was simpler since they are static,
>>> see 3a7956e25e1d ("kthread: Fix PF_KTHREAD vs to_kthread() race")
>>>
>> Get it. Thanks.
>>
>>>>                            w/out patch                 w/ patch
>>>> fork+exit latency            1660 ms                  1520 ms (   8.4%)
>>>>
>>>> Fixes: 2f5f4cce496e ("sched/fair: Ignore percpu threads for imbalance pulls")
>>>> Signed-off-by: Yihao Wu <wuyihao@linux.alibaba.com>
>>>> ---
>>>>    kernel/kthread.c | 6 +-----
>>>>    1 file changed, 1 insertion(+), 5 deletions(-)
>>>>
>>>> diff --git a/kernel/kthread.c b/kernel/kthread.c
>>>> index 4a4d7092a2d8..cb05d3ff2de4 100644
>>>> --- a/kernel/kthread.c
>>>> +++ b/kernel/kthread.c
>>>> @@ -543,11 +543,7 @@ void kthread_set_per_cpu(struct task_struct *k, int cpu)
>>>>
>>>>    bool kthread_is_per_cpu(struct task_struct *p)
>>>>    {
>>>> -	struct kthread *kthread = __to_kthread(p);
>>>> -	if (!kthread)
>>>> -		return false;
>>>> -
>>>> -	return test_bit(KTHREAD_IS_PER_CPU, &kthread->flags);
>>>> +	return (p->flags & PF_KTHREAD) && p->nr_cpus_allowed == 1;
>>>>    }
>>>
>>> As Peter said, this is going to cause issues. If you look at
>>> kthread_set_per_cpu(), we also store a CPU value which we expect to be
>>> valid when kthread_is_per_cpu(), which that change is breaking.
>>>
>>> AIUI what you want to patch is the actual usage in can_migrate_task()
>>>
>>
>> Get it. Some may want a consistent view of kthread_is_per_cpu,
>> kthread->cpu, and KTHREAD_IS_PER_CPU.
>>
>> Are you suggesting to patch only can_migrate_task to check
>> nr_cpus_allowed?
> 
> Yes
> 

Okay, I'll post a v2. And see if Peter likes it.

>> Wouldn't it be confusing if it uses an alternative way
>> to tell if p is a per-cpu kthread?
>>
> 
> Well then it wouldn't catch just per-CPU kthreads, but rather any pinned
> task (kernel or otherwise). But then you have to check/test if that's a
> sane thing to :)
> 

Sounds like pain... and not an option :-D

Thanks,
Yihao Wu

>> I haven't a better solution though. :(
>>
>>
>> Thanks,
>> Yihao Wu
>>
>>>>
>>>>    /**
>>>> --
>>>> 2.32.0.604.gb1f3e1269

next prev parent reply	other threads:[~2022-01-18  8:11 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-11  9:48 [PATCH] sched/fair: Again ignore percpu threads for imbalance pulls Yihao Wu
2021-12-11 11:12 ` Peter Zijlstra
2021-12-16 18:26 ` Valentin Schneider
2022-01-17 14:50   ` Yihao Wu
2022-01-17 17:16     ` Valentin Schneider
2022-01-18  8:11       ` Yihao Wu [this message]
2022-01-18 17:10         ` Valentin Schneider

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=81d85a50-e54e-e74a-14a9-348413850f73@linux.alibaba.com \
    --to=wuyihao@linux.alibaba.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=shanpeic@linux.alibaba.com \
    --cc=valentin.schneider@arm.com \
    --cc=vincent.guittot@linaro.org \
    --cc=yun.wang@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox