All of lore.kernel.org
 help / color / mirror / Atom feed
From: Frederic Weisbecker <frederic@kernel.org>
To: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Qian Cai <cai@lca.pw>, Borislav Petkov <bp@alien8.de>,
	Thomas Gleixner <tglx@linutronix.de>,
	linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Subject: Re: Endless soft-lockups for compiling workload since next-20200519
Date: Thu, 21 May 2020 02:40:36 +0200	[thread overview]
Message-ID: <20200521004035.GA15455@lenoir> (raw)
In-Reply-To: <20200520125056.GC325280@hirez.programming.kicks-ass.net>

On Wed, May 20, 2020 at 02:50:56PM +0200, Peter Zijlstra wrote:
> On Tue, May 19, 2020 at 11:58:17PM -0400, Qian Cai wrote:
> > Just a head up. Repeatedly compiling kernels for a while would trigger
> > endless soft-lockups since next-20200519 on both x86_64 and powerpc.
> > .config are in,
> 
> Could be 90b5363acd47 ("sched: Clean up scheduler_ipi()"), although I've
> not seen anything like that myself. Let me go have a look.
> 
> 
> In as far as the logs are readable (they're a wrapped mess, please don't
> do that!), they contain very little useful, as is typical with IPIs :/
> 
> > [ 1167.993773][    C1] WARNING: CPU: 1 PID: 0 at kernel/smp.c:127
> > flush_smp_call_function_queue+0x1fa/0x2e0

So I've tried to think of a race that could produce that and here is
the only thing I could come up with. It's a bit complicated unfortunately:

CPU 0                                              CPU 1
-----                                              -----

tick {
    trigger_load_balance() {
        raise_softirq(SCHED_SOFTIRQ);
        //but nohz_flags(0) = 0
    }
                                                   kick_ilb() {
                                                       atomic_fetch_or(...., nohz_flags(0))
    softirq() {                                        #VMEXIT or anything that could stop a CPU for a while
        run_rebalance_domain() {
            nohz_idle_balance() {
                atomic_andnot(NOHZ_KICK_MASK, nohz_flag(0))
            }
         }
     }
}

// schedule
nohz_newidle_balance() {
    kick_ilb() { // pick current CPU
        atomic_fetch_or(...., nohz_flags(0))           #VMENTER
        smp_call_function_single_async() {             smp_call_function_single_async() {
            // verified csd->flags != CSD_LOCK             // verified csd->flags != CSD_LOCK
            csd->flags = CSD_LOCK                          csd->flags = CSD_LOCK
            //execute in place                             //queue and send IPI
            csd->flags = 0
            nohz_csd_func()
	}
    }
}


IPI�{
    flush_smp_call_function_queue() {
        csd_unlock() {
            WARN_ON(csd->flags != CSD_LOCK) <---------!!!!!



The root cause here would be that trigger_load_balance() unconditionally raise
the softirq. And I have to confess I'm not clear why since the softirq is
essentially a no-op when nohz_flags() is 0.

Thanks.

WARNING: multiple messages have this Message-ID (diff)
From: Frederic Weisbecker <frederic@kernel.org>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Qian Cai <cai@lca.pw>, "Paul E. McKenney" <paulmck@kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Michael Ellerman <mpe@ellerman.id.au>,
	linuxppc-dev <linuxppc-dev@lists.ozlabs.org>,
	Borislav Petkov <bp@alien8.de>
Subject: Re: Endless soft-lockups for compiling workload since next-20200519
Date: Thu, 21 May 2020 02:40:36 +0200	[thread overview]
Message-ID: <20200521004035.GA15455@lenoir> (raw)
In-Reply-To: <20200520125056.GC325280@hirez.programming.kicks-ass.net>

On Wed, May 20, 2020 at 02:50:56PM +0200, Peter Zijlstra wrote:
> On Tue, May 19, 2020 at 11:58:17PM -0400, Qian Cai wrote:
> > Just a head up. Repeatedly compiling kernels for a while would trigger
> > endless soft-lockups since next-20200519 on both x86_64 and powerpc.
> > .config are in,
> 
> Could be 90b5363acd47 ("sched: Clean up scheduler_ipi()"), although I've
> not seen anything like that myself. Let me go have a look.
> 
> 
> In as far as the logs are readable (they're a wrapped mess, please don't
> do that!), they contain very little useful, as is typical with IPIs :/
> 
> > [ 1167.993773][    C1] WARNING: CPU: 1 PID: 0 at kernel/smp.c:127
> > flush_smp_call_function_queue+0x1fa/0x2e0

So I've tried to think of a race that could produce that and here is
the only thing I could come up with. It's a bit complicated unfortunately:

CPU 0                                              CPU 1
-----                                              -----

tick {
    trigger_load_balance() {
        raise_softirq(SCHED_SOFTIRQ);
        //but nohz_flags(0) = 0
    }
                                                   kick_ilb() {
                                                       atomic_fetch_or(...., nohz_flags(0))
    softirq() {                                        #VMEXIT or anything that could stop a CPU for a while
        run_rebalance_domain() {
            nohz_idle_balance() {
                atomic_andnot(NOHZ_KICK_MASK, nohz_flag(0))
            }
         }
     }
}

// schedule
nohz_newidle_balance() {
    kick_ilb() { // pick current CPU
        atomic_fetch_or(...., nohz_flags(0))           #VMENTER
        smp_call_function_single_async() {             smp_call_function_single_async() {
            // verified csd->flags != CSD_LOCK             // verified csd->flags != CSD_LOCK
            csd->flags = CSD_LOCK                          csd->flags = CSD_LOCK
            //execute in place                             //queue and send IPI
            csd->flags = 0
            nohz_csd_func()
	}
    }
}


IPI�{
    flush_smp_call_function_queue() {
        csd_unlock() {
            WARN_ON(csd->flags != CSD_LOCK) <---------!!!!!



The root cause here would be that trigger_load_balance() unconditionally raise
the softirq. And I have to confess I'm not clear why since the softirq is
essentially a no-op when nohz_flags() is 0.

Thanks.

  parent reply	other threads:[~2020-05-21  0:43 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-20  3:58 Endless soft-lockups for compiling workload since next-20200519 Qian Cai
2020-05-20  3:58 ` Qian Cai
2020-05-20 12:50 ` Peter Zijlstra
2020-05-20 12:50   ` Peter Zijlstra
2020-05-20 14:06   ` Qian Cai
2020-05-20 14:06     ` Qian Cai
2020-05-21  0:40   ` Frederic Weisbecker [this message]
2020-05-21  0:40     ` Frederic Weisbecker
2020-05-21  9:39     ` Peter Zijlstra
2020-05-21  9:39       ` Peter Zijlstra
2020-05-21 10:49       ` Peter Zijlstra
2020-05-21 10:49         ` Peter Zijlstra
2020-05-21 11:00         ` Peter Zijlstra
2020-05-21 11:00           ` Peter Zijlstra
2020-05-21 12:41           ` Frederic Weisbecker
2020-05-21 12:41             ` Frederic Weisbecker
2020-05-25 13:21             ` Peter Zijlstra
2020-05-25 13:21               ` Peter Zijlstra
2020-05-25 14:05               ` Frederic Weisbecker
2020-05-25 14:05                 ` Frederic Weisbecker
2020-05-25 14:38                 ` Peter Zijlstra
2020-05-25 14:38                   ` Peter Zijlstra
2020-05-25 14:17               ` Peter Zijlstra
2020-05-25 14:17                 ` Peter Zijlstra
2020-05-22  2:00       ` Qian Cai
2020-05-22  2:00         ` Qian Cai
2020-05-21 10:10     ` Peter Zijlstra
2020-05-21 10:10       ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200521004035.GA15455@lenoir \
    --to=frederic@kernel.org \
    --cc=bp@alien8.de \
    --cc=cai@lca.pw \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.