All of lore.kernel.org
 help / color / mirror / Atom feed
From: Frederic Weisbecker <frederic@kernel.org>
To: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Qian Cai <cai@lca.pw>, Borislav Petkov <bp@alien8.de>,
	Thomas Gleixner <tglx@linutronix.de>,
	linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Subject: Re: Endless soft-lockups for compiling workload since next-20200519
Date: Thu, 21 May 2020 14:41:14 +0200	[thread overview]
Message-ID: <20200521124113.GC15455@lenoir> (raw)
In-Reply-To: <20200521110027.GC325303@hirez.programming.kicks-ass.net>

On Thu, May 21, 2020 at 01:00:27PM +0200, Peter Zijlstra wrote:
> On Thu, May 21, 2020 at 12:49:37PM +0200, Peter Zijlstra wrote:
> > On Thu, May 21, 2020 at 11:39:39AM +0200, Peter Zijlstra wrote:
> > > On Thu, May 21, 2020 at 02:40:36AM +0200, Frederic Weisbecker wrote:
> > 
> > > This:
> > > 
> > > >         smp_call_function_single_async() {             smp_call_function_single_async() {
> > > >             // verified csd->flags != CSD_LOCK             // verified csd->flags != CSD_LOCK
> > > >             csd->flags = CSD_LOCK                          csd->flags = CSD_LOCK
> > > 
> > > concurrent smp_call_function_single_async() using the same csd is what
> > > I'm looking at as well.
> > 
> > So something like this ought to cure the fundamental problem and make
> > smp_call_function_single_async() more user friendly, but also more
> > expensive.
> > 
> > The problem is that while the ILB case is easy to fix, I can't seem to
> > find an equally nice solution for the ttwu_remote_queue() case; that
> > would basically require sticking the wake_csd in task_struct, I'll also
> > post that.
> > 
> > So it's either this:
> 
> Or this:
> 
> ---
>  include/linux/sched.h | 4 ++++
>  kernel/sched/core.c   | 7 ++++---
>  kernel/sched/fair.c   | 2 +-
>  kernel/sched/sched.h  | 1 -
>  4 files changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index f38d62c4632c..136ee400b568 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -696,6 +696,10 @@ struct task_struct {
>  	struct uclamp_se		uclamp[UCLAMP_CNT];
>  #endif
>  
> +#ifdef CONFIG_SMP
> +	call_single_data_t		wake_csd;
> +#endif
> +
>  #ifdef CONFIG_PREEMPT_NOTIFIERS
>  	/* List of struct preempt_notifier: */
>  	struct hlist_head		preempt_notifiers;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 5b286469e26e..a7129652e89b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2320,7 +2320,7 @@ static void ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
>  
>  	if (llist_add(&p->wake_entry, &rq->wake_list)) {
>  		if (!set_nr_if_polling(rq->idle))
> -			smp_call_function_single_async(cpu, &rq->wake_csd);
> +			smp_call_function_single_async(cpu, &p->wake_csd);
>  		else
>  			trace_sched_wake_idle_without_ipi(cpu);
>  	}
> @@ -2921,6 +2921,9 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
>  #endif
>  #if defined(CONFIG_SMP)
>  	p->on_cpu = 0;
> +	p->wake_csd = (struct __call_single_data) {
> +		.func = wake_csd_func,
> +	};
>  #endif
>  	init_task_preempt_count(p);
>  #ifdef CONFIG_SMP
> @@ -6723,8 +6726,6 @@ void __init sched_init(void)
>  		rq->avg_idle = 2*sysctl_sched_migration_cost;
>  		rq->max_idle_balance_cost = sysctl_sched_migration_cost;
>  
> -		rq_csd_init(rq, &rq->wake_csd, wake_csd_func);
> -
>  		INIT_LIST_HEAD(&rq->cfs_tasks);
>  
>  		rq_attach_root(rq, &def_root_domain);
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 01f94cf52783..b6d8a7b991f0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10033,7 +10033,7 @@ static void kick_ilb(unsigned int flags)
>  	 * is idle. And the softirq performing nohz idle load balance
>  	 * will be run before returning from the IPI.
>  	 */
> -	smp_call_function_single_async(ilb_cpu, &cpu_rq(ilb_cpu)->nohz_csd);
> +	smp_call_function_single_async(ilb_cpu, &this_rq()->nohz_csd);

My fear here is that if a previous call from the the same CPU but to another
target is still pending, the new one will be spuriously ignored.

Namely this could happen:

CPU 0                                      CPU 1
-----                                      -----
                                           local_irq_disable() or VMEXIT
kick_ilb() {
    smp_call_function_single_async(CPU 1,      
                 &this_rq()->nohz_csd);
}

kick_ilb() {
    smp_call_function_single_async(CPU 2,      
                 &this_rq()->nohz_csd) {
        // IPI to CPU 2 ignored
        if (csd->flags == CSD_LOCK)
            return -EBUSY;
    }
}

                                            local_irq_enable();




But I believe we can still keep the remote csd if nohz_flags() are
strictly only set before the IPI and strictly only cleared from it.

And I still don't understand why trigger_load_balance() raise the
softirq without setting the current CPU as ilb. run_rebalance_domains()
thus ignores it most of the time in the end or it spuriously clear the
nohz_flags set by an IPI sender. Or there is something I misunderstood
there.

(Haven't checked the wake up case yet).

WARNING: multiple messages have this Message-ID (diff)
From: Frederic Weisbecker <frederic@kernel.org>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Qian Cai <cai@lca.pw>, "Paul E. McKenney" <paulmck@kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Michael Ellerman <mpe@ellerman.id.au>,
	linuxppc-dev <linuxppc-dev@lists.ozlabs.org>,
	Borislav Petkov <bp@alien8.de>
Subject: Re: Endless soft-lockups for compiling workload since next-20200519
Date: Thu, 21 May 2020 14:41:14 +0200	[thread overview]
Message-ID: <20200521124113.GC15455@lenoir> (raw)
In-Reply-To: <20200521110027.GC325303@hirez.programming.kicks-ass.net>

On Thu, May 21, 2020 at 01:00:27PM +0200, Peter Zijlstra wrote:
> On Thu, May 21, 2020 at 12:49:37PM +0200, Peter Zijlstra wrote:
> > On Thu, May 21, 2020 at 11:39:39AM +0200, Peter Zijlstra wrote:
> > > On Thu, May 21, 2020 at 02:40:36AM +0200, Frederic Weisbecker wrote:
> > 
> > > This:
> > > 
> > > >         smp_call_function_single_async() {             smp_call_function_single_async() {
> > > >             // verified csd->flags != CSD_LOCK             // verified csd->flags != CSD_LOCK
> > > >             csd->flags = CSD_LOCK                          csd->flags = CSD_LOCK
> > > 
> > > concurrent smp_call_function_single_async() using the same csd is what
> > > I'm looking at as well.
> > 
> > So something like this ought to cure the fundamental problem and make
> > smp_call_function_single_async() more user friendly, but also more
> > expensive.
> > 
> > The problem is that while the ILB case is easy to fix, I can't seem to
> > find an equally nice solution for the ttwu_remote_queue() case; that
> > would basically require sticking the wake_csd in task_struct, I'll also
> > post that.
> > 
> > So it's either this:
> 
> Or this:
> 
> ---
>  include/linux/sched.h | 4 ++++
>  kernel/sched/core.c   | 7 ++++---
>  kernel/sched/fair.c   | 2 +-
>  kernel/sched/sched.h  | 1 -
>  4 files changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index f38d62c4632c..136ee400b568 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -696,6 +696,10 @@ struct task_struct {
>  	struct uclamp_se		uclamp[UCLAMP_CNT];
>  #endif
>  
> +#ifdef CONFIG_SMP
> +	call_single_data_t		wake_csd;
> +#endif
> +
>  #ifdef CONFIG_PREEMPT_NOTIFIERS
>  	/* List of struct preempt_notifier: */
>  	struct hlist_head		preempt_notifiers;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 5b286469e26e..a7129652e89b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2320,7 +2320,7 @@ static void ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
>  
>  	if (llist_add(&p->wake_entry, &rq->wake_list)) {
>  		if (!set_nr_if_polling(rq->idle))
> -			smp_call_function_single_async(cpu, &rq->wake_csd);
> +			smp_call_function_single_async(cpu, &p->wake_csd);
>  		else
>  			trace_sched_wake_idle_without_ipi(cpu);
>  	}
> @@ -2921,6 +2921,9 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
>  #endif
>  #if defined(CONFIG_SMP)
>  	p->on_cpu = 0;
> +	p->wake_csd = (struct __call_single_data) {
> +		.func = wake_csd_func,
> +	};
>  #endif
>  	init_task_preempt_count(p);
>  #ifdef CONFIG_SMP
> @@ -6723,8 +6726,6 @@ void __init sched_init(void)
>  		rq->avg_idle = 2*sysctl_sched_migration_cost;
>  		rq->max_idle_balance_cost = sysctl_sched_migration_cost;
>  
> -		rq_csd_init(rq, &rq->wake_csd, wake_csd_func);
> -
>  		INIT_LIST_HEAD(&rq->cfs_tasks);
>  
>  		rq_attach_root(rq, &def_root_domain);
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 01f94cf52783..b6d8a7b991f0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10033,7 +10033,7 @@ static void kick_ilb(unsigned int flags)
>  	 * is idle. And the softirq performing nohz idle load balance
>  	 * will be run before returning from the IPI.
>  	 */
> -	smp_call_function_single_async(ilb_cpu, &cpu_rq(ilb_cpu)->nohz_csd);
> +	smp_call_function_single_async(ilb_cpu, &this_rq()->nohz_csd);

My fear here is that if a previous call from the the same CPU but to another
target is still pending, the new one will be spuriously ignored.

Namely this could happen:

CPU 0                                      CPU 1
-----                                      -----
                                           local_irq_disable() or VMEXIT
kick_ilb() {
    smp_call_function_single_async(CPU 1,      
                 &this_rq()->nohz_csd);
}

kick_ilb() {
    smp_call_function_single_async(CPU 2,      
                 &this_rq()->nohz_csd) {
        // IPI to CPU 2 ignored
        if (csd->flags == CSD_LOCK)
            return -EBUSY;
    }
}

                                            local_irq_enable();




But I believe we can still keep the remote csd if nohz_flags() are
strictly only set before the IPI and strictly only cleared from it.

And I still don't understand why trigger_load_balance() raise the
softirq without setting the current CPU as ilb. run_rebalance_domains()
thus ignores it most of the time in the end or it spuriously clear the
nohz_flags set by an IPI sender. Or there is something I misunderstood
there.

(Haven't checked the wake up case yet).

  reply	other threads:[~2020-05-21 12:43 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-20  3:58 Endless soft-lockups for compiling workload since next-20200519 Qian Cai
2020-05-20  3:58 ` Qian Cai
2020-05-20 12:50 ` Peter Zijlstra
2020-05-20 12:50   ` Peter Zijlstra
2020-05-20 14:06   ` Qian Cai
2020-05-20 14:06     ` Qian Cai
2020-05-21  0:40   ` Frederic Weisbecker
2020-05-21  0:40     ` Frederic Weisbecker
2020-05-21  9:39     ` Peter Zijlstra
2020-05-21  9:39       ` Peter Zijlstra
2020-05-21 10:49       ` Peter Zijlstra
2020-05-21 10:49         ` Peter Zijlstra
2020-05-21 11:00         ` Peter Zijlstra
2020-05-21 11:00           ` Peter Zijlstra
2020-05-21 12:41           ` Frederic Weisbecker [this message]
2020-05-21 12:41             ` Frederic Weisbecker
2020-05-25 13:21             ` Peter Zijlstra
2020-05-25 13:21               ` Peter Zijlstra
2020-05-25 14:05               ` Frederic Weisbecker
2020-05-25 14:05                 ` Frederic Weisbecker
2020-05-25 14:38                 ` Peter Zijlstra
2020-05-25 14:38                   ` Peter Zijlstra
2020-05-25 14:17               ` Peter Zijlstra
2020-05-25 14:17                 ` Peter Zijlstra
2020-05-22  2:00       ` Qian Cai
2020-05-22  2:00         ` Qian Cai
2020-05-21 10:10     ` Peter Zijlstra
2020-05-21 10:10       ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200521124113.GC15455@lenoir \
    --to=frederic@kernel.org \
    --cc=bp@alien8.de \
    --cc=cai@lca.pw \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.