Re: rcu_sched self-detected stall on CPU

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Michael Ellerman <mpe@ellerman.id.au>
To: paulmck@kernel.org
Cc: rcu <rcu@vger.kernel.org>, Zhouyi Zhou <zhouzhouyi@gmail.com>,
	linuxppc-dev <linuxppc-dev@lists.ozlabs.org>,
	Nicholas Piggin <npiggin@gmail.com>,
	Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Subject: Re: rcu_sched self-detected stall on CPU
Date: Tue, 12 Apr 2022 16:53:06 +1000	[thread overview]
Message-ID: <87mtgq6be5.fsf@mpe.ellerman.id.au> (raw)
In-Reply-To: <20220411030553.GW4285@paulmck-ThinkPad-P17-Gen-1>

"Paul E. McKenney" <paulmck@kernel.org> writes:
> On Sun, Apr 10, 2022 at 09:33:43PM +1000, Michael Ellerman wrote:
>> Zhouyi Zhou <zhouzhouyi@gmail.com> writes:
>> > On Fri, Apr 8, 2022 at 10:07 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>> >> On Fri, Apr 08, 2022 at 06:02:19PM +0800, Zhouyi Zhou wrote:
>> >> > On Fri, Apr 8, 2022 at 3:23 PM Michael Ellerman <mpe@ellerman.id.au> wrote:
>> ...
>> >> > > I haven't seen it in my testing. But using Miguel's config I can
>> >> > > reproduce it seemingly on every boot.
>> >> > >
>> >> > > For me it bisects to:
>> >> > >
>> >> > >   35de589cb879 ("powerpc/time: improve decrementer clockevent processing")
>> >> > >
>> >> > > Which seems plausible.
>> >> > I also bisect to 35de589cb879 ("powerpc/time: improve decrementer
>> >> > clockevent processing")
>> ...
>> >>
>> >> > > Reverting that on mainline makes the bug go away.
>> 
>> >> > I also revert that on the mainline, and am currently doing a pressure
>> >> > test (by repeatedly invoking qemu and checking the console.log) on PPC
>> >> > VM in Oregon State University.
>> 
>> > After 306 rounds of stress test on mainline without triggering the bug
>> > (last for 4 hours and 27 minutes), I think the bug is indeed caused by
>> > 35de589cb879 ("powerpc/time: improve decrementer clockevent
>> > processing") and stop the test for now.
>> 
>> Thanks for testing, that's pretty conclusive.
>> 
>> I'm not inclined to actually revert it yet.
>> 
>> We need to understand if there's actually a bug in the patch, or if it's
>> just exposing some existing bug/bad behavior we have. The fact that it
>> only appears with CONFIG_HIGH_RES_TIMERS=n is suspicious.
>> 
>> Do we have some code that inadvertently relies on something enabled by
>> HIGH_RES_TIMERS=y, or do we have a bug that is hidden by HIGH_RES_TIMERS=y ?
>
> For whatever it is worth, moderate rcutorture runs to completion without
> errors with CONFIG_HIGH_RES_TIMERS=n on 64-bit x86.

Thanks for testing that, I don't have any big x86 machines to test on :)

> Also for whatever it is worth, I don't know of anything other than
> microcontrollers or the larger IoT devices that would want their kernels
> built with CONFIG_HIGH_RES_TIMERS=n.  Which might be a failure of
> imagination on my part, but so it goes.

Yeah I agree, like I said before I wasn't even aware you could turn it
off. So I think we'll definitely add a select HIGH_RES_TIMERS in future,
but first I need to work out why we are seeing stalls with it disabled.

cheers

WARNING: multiple messages have this Message-ID (diff)

From: Michael Ellerman <mpe@ellerman.id.au>
To: paulmck@kernel.org
Cc: Zhouyi Zhou <zhouzhouyi@gmail.com>, rcu <rcu@vger.kernel.org>,
	Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>,
	linuxppc-dev <linuxppc-dev@lists.ozlabs.org>,
	Nicholas Piggin <npiggin@gmail.com>
Subject: Re: rcu_sched self-detected stall on CPU
Date: Tue, 12 Apr 2022 16:53:06 +1000	[thread overview]
Message-ID: <87mtgq6be5.fsf@mpe.ellerman.id.au> (raw)
In-Reply-To: <20220411030553.GW4285@paulmck-ThinkPad-P17-Gen-1>

"Paul E. McKenney" <paulmck@kernel.org> writes:
> On Sun, Apr 10, 2022 at 09:33:43PM +1000, Michael Ellerman wrote:
>> Zhouyi Zhou <zhouzhouyi@gmail.com> writes:
>> > On Fri, Apr 8, 2022 at 10:07 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>> >> On Fri, Apr 08, 2022 at 06:02:19PM +0800, Zhouyi Zhou wrote:
>> >> > On Fri, Apr 8, 2022 at 3:23 PM Michael Ellerman <mpe@ellerman.id.au> wrote:
>> ...
>> >> > > I haven't seen it in my testing. But using Miguel's config I can
>> >> > > reproduce it seemingly on every boot.
>> >> > >
>> >> > > For me it bisects to:
>> >> > >
>> >> > >   35de589cb879 ("powerpc/time: improve decrementer clockevent processing")
>> >> > >
>> >> > > Which seems plausible.
>> >> > I also bisect to 35de589cb879 ("powerpc/time: improve decrementer
>> >> > clockevent processing")
>> ...
>> >>
>> >> > > Reverting that on mainline makes the bug go away.
>> 
>> >> > I also revert that on the mainline, and am currently doing a pressure
>> >> > test (by repeatedly invoking qemu and checking the console.log) on PPC
>> >> > VM in Oregon State University.
>> 
>> > After 306 rounds of stress test on mainline without triggering the bug
>> > (last for 4 hours and 27 minutes), I think the bug is indeed caused by
>> > 35de589cb879 ("powerpc/time: improve decrementer clockevent
>> > processing") and stop the test for now.
>> 
>> Thanks for testing, that's pretty conclusive.
>> 
>> I'm not inclined to actually revert it yet.
>> 
>> We need to understand if there's actually a bug in the patch, or if it's
>> just exposing some existing bug/bad behavior we have. The fact that it
>> only appears with CONFIG_HIGH_RES_TIMERS=n is suspicious.
>> 
>> Do we have some code that inadvertently relies on something enabled by
>> HIGH_RES_TIMERS=y, or do we have a bug that is hidden by HIGH_RES_TIMERS=y ?
>
> For whatever it is worth, moderate rcutorture runs to completion without
> errors with CONFIG_HIGH_RES_TIMERS=n on 64-bit x86.

Thanks for testing that, I don't have any big x86 machines to test on :)

> Also for whatever it is worth, I don't know of anything other than
> microcontrollers or the larger IoT devices that would want their kernels
> built with CONFIG_HIGH_RES_TIMERS=n.  Which might be a failure of
> imagination on my part, but so it goes.

Yeah I agree, like I said before I wasn't even aware you could turn it
off. So I think we'll definitely add a select HIGH_RES_TIMERS in future,
but first I need to work out why we are seeing stalls with it disabled.

cheers

next prev parent reply	other threads:[~2022-04-12  6:53 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-05 21:41 rcu_sched self-detected stall on CPU Miguel Ojeda
2022-04-06  9:31 ` Zhouyi Zhou
2022-04-06  9:31   ` Zhouyi Zhou
2022-04-06 17:00   ` Paul E. McKenney
2022-04-06 17:00     ` Paul E. McKenney
2022-04-06 18:25     ` Zhouyi Zhou
2022-04-06 18:25       ` Zhouyi Zhou
2022-04-06 19:50       ` Paul E. McKenney
2022-04-06 19:50         ` Paul E. McKenney
2022-04-07  2:26         ` Zhouyi Zhou
2022-04-07  2:26           ` Zhouyi Zhou
2022-04-07 10:07           ` Miguel Ojeda
2022-04-07 10:07             ` Miguel Ojeda
2022-04-07 15:15             ` Paul E. McKenney
2022-04-07 15:15               ` Paul E. McKenney
2022-04-07 17:05               ` Miguel Ojeda
2022-04-07 17:05                 ` Miguel Ojeda
2022-04-07 17:55                 ` Paul E. McKenney
2022-04-07 17:55                   ` Paul E. McKenney
2022-04-07 23:14                   ` Zhouyi Zhou
2022-04-07 23:14                     ` Zhouyi Zhou
2022-04-08  1:43                     ` Paul E. McKenney
2022-04-08  1:43                       ` Paul E. McKenney
2022-04-08  7:23     ` Michael Ellerman
2022-04-08 10:02       ` Zhouyi Zhou
2022-04-08 10:02         ` Zhouyi Zhou
2022-04-08 14:07         ` Paul E. McKenney
2022-04-08 14:07           ` Paul E. McKenney
2022-04-08 14:25           ` Zhouyi Zhou
2022-04-08 14:25             ` Zhouyi Zhou
2022-04-10 11:33             ` Michael Ellerman
2022-04-11  3:05               ` Paul E. McKenney
2022-04-11  3:05                 ` Paul E. McKenney
2022-04-12  6:53                 ` Michael Ellerman [this message]
2022-04-12  6:53                   ` Michael Ellerman
2022-04-12 13:36                   ` Paul E. McKenney
2022-04-12 13:36                     ` Paul E. McKenney
2022-04-08 13:52       ` Miguel Ojeda
2022-04-08 13:52         ` Miguel Ojeda
2022-04-08 14:06       ` Paul E. McKenney
2022-04-08 14:06         ` Paul E. McKenney
2022-04-08 14:42       ` Michael Ellerman
2022-04-08 15:52         ` Paul E. McKenney
2022-04-08 15:52           ` Paul E. McKenney
2022-04-08 17:02         ` Miguel Ojeda
2022-04-08 17:02           ` Miguel Ojeda
2022-04-13  5:11         ` Nicholas Piggin
2022-04-13  5:11           ` Nicholas Piggin
2022-04-13  6:10           ` Low-res tick handler device not going to ONESHOT_STOPPED when tick is stopped (was: rcu_sched self-detected stall on CPU) Nicholas Piggin
2022-04-13  6:10             ` Nicholas Piggin
2022-04-14 17:15             ` Paul E. McKenney
2022-04-14 17:15               ` Paul E. McKenney
2022-04-22 15:53           ` Thomas Gleixner
2022-04-22 15:53             ` Re: Thomas Gleixner
2022-04-23  2:29             ` Re: Nicholas Piggin
2022-04-23  2:29               ` Re: Nicholas Piggin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87mtgq6be5.fsf@mpe.ellerman.id.au \
    --to=mpe@ellerman.id.au \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=miguel.ojeda.sandonis@gmail.com \
    --cc=npiggin@gmail.com \
    --cc=paulmck@kernel.org \
    --cc=rcu@vger.kernel.org \
    --cc=zhouzhouyi@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.