From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>,
David Miller <davem@davemloft.net>,
npiggin@gmail.com, mpe@ellerman.id.au, dzickus@redhat.com,
sfr@canb.auug.org.au, linuxarm@huawei.com, tglx@linutronix.de,
sparclinux@vger.kernel.org, akpm@linux-foundation.org,
linuxppc-dev@lists.ozlabs.org,
linux-arm-kernel@lists.infradead.org, john.stultz@linaro.org,
anna-maria@linutronix.de
Subject: Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?
Date: Wed, 6 Sep 2017 05:28:44 -0700 [thread overview]
Message-ID: <20170906122844.GA30180@linux.vnet.ibm.com> (raw)
In-Reply-To: <20170822152637.GA11320@linux.vnet.ibm.com>
On Tue, Aug 22, 2017 at 08:26:37AM -0700, Paul E. McKenney wrote:
> On Tue, Aug 22, 2017 at 02:21:32PM +0530, Abdul Haleem wrote:
> > On Tue, 2017-08-22 at 08:49 +0100, Jonathan Cameron wrote:
[ . . . ]
> > No more RCU stalls on PowerPC, system is clean when idle or with some
> > test runs.
> >
> > Thank you all for your time and efforts in fixing this.
> >
> > Reported-and-Tested-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
>
> I am still seeing failures, but then again I am running rcutorture with
> lots of CPU hotplug activity. So I am probably seeing some other bug,
> though it still looks a lot like a lost timer.
So one problem appears to be a timing-related deadlock between RCU and
timers. The way that this can happen is that the outgoing CPU goes
offline (as in cpuhp_report_idle_dead() invoked from do_idle()) with
one of RCU's grace-period kthread's timers queued. Now, if someone
waits for a grace period, either directly or indirectly, in a way that
blocks the hotplug notifiers, execution will never reach timers_dead_cpu(),
which means that the grace-period kthread will never wake, which will
mean that the grace period will never complete. Classic deadlock.
I currently have an extremely ugly workaround for this deadlock, which
is to periodically and (usually) redundantly wake up all the RCU
grace-period kthreads from the scheduling-interrupt handler. This is
of course completely inappropriate for mainline, but it does reliably
prevent the "kthread starved for %ld jiffies!" type of RCU CPU stall
warning that I would otherwise see.
To mainline this, one approach would be to make the timers switch to
add_timer_on() to a surviving CPU once the offlining process starts.
Alternatively, I suppose that RCU could do the redundant-wakeup kludge,
but with checks to prevent it from happening unless (1) there is a CPU
in the process of going offline (2) there is an RCU grace period in
progress, and (3) the RCU grace period kthread has been blocked for
(say) three times longer than it should have.
Unfortunately, this is not sufficient to make rcutorture run reliably,
though it does help, which is of course to say that it makes debugging
slower. ;-)
What happens now is that random rcutorture kthreads will hang waiting for
timeouts to complete. This confused me for awhile because I expected
that the timeouts would be delayed during offline processing, but that
my crude deadlock-resolution approach would eventually get things going.
My current suspicion is that the problem is due to a potential delay
between the time an outgoing CPU hits cpuhp_report_idle_dead() and the
timers get migrated from timers_dead_cpu(). This means that the CPU
adopting the timers might be a few ticks ahead of where the outgoing CPU
last processed timers. My current guess is that any timers queued in
intervening indexes are going to wait one good long time. And I don't see
any code in the timers_dead_cpu() that would account for this possibility,
though I of course cannot claim to fully understand this code..
Is this plausible, or am I confused? (Either way, -something- besides
just me is rather thoroughly confused!)
If this is plausible, my guess is that timers_dead_cpu() needs to check
for mismatched indexes (in timer->flags?) and force any intervening
timers to expire if so.
Thoughts?
Thanx, Paul
next prev parent reply other threads:[~2017-09-06 12:28 UTC|newest]
Thread overview: 80+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20170725193039.00007c80@huawei.com>
2017-07-25 12:26 ` RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? Nicholas Piggin
2017-07-25 13:46 ` Paul E. McKenney
2017-07-25 14:42 ` Jonathan Cameron
2017-07-25 15:12 ` Paul E. McKenney
2017-07-25 16:52 ` Jonathan Cameron
2017-07-25 21:10 ` David Miller
2017-07-26 3:55 ` Paul E. McKenney
2017-07-26 4:02 ` David Miller
2017-07-26 4:12 ` Paul E. McKenney
2017-07-26 8:16 ` Jonathan Cameron
2017-07-26 9:32 ` Jonathan Cameron
2017-07-26 12:28 ` Jonathan Cameron
2017-07-26 12:49 ` Jonathan Cameron
2017-07-26 14:14 ` Paul E. McKenney
2017-07-26 14:23 ` Jonathan Cameron
2017-07-26 15:33 ` Jonathan Cameron
2017-07-26 15:49 ` Paul E. McKenney
2017-07-26 16:54 ` David Miller
2017-07-26 17:13 ` Jonathan Cameron
2017-07-27 7:41 ` Jonathan Cameron
2017-07-26 17:50 ` Paul E. McKenney
2017-07-26 22:36 ` Paul E. McKenney
2017-07-26 22:45 ` David Miller
2017-07-26 23:15 ` Paul E. McKenney
2017-07-26 23:22 ` David Miller
2017-07-27 1:42 ` Paul E. McKenney
2017-07-27 4:34 ` Nicholas Piggin
2017-07-27 12:49 ` Paul E. McKenney
2017-07-27 13:49 ` Jonathan Cameron
2017-07-27 16:39 ` Jonathan Cameron
2017-07-27 16:52 ` Paul E. McKenney
2017-07-28 7:44 ` Jonathan Cameron
2017-07-28 12:54 ` Boqun Feng
2017-07-28 13:13 ` Jonathan Cameron
2017-07-28 14:55 ` Paul E. McKenney
2017-07-28 18:41 ` Paul E. McKenney
2017-07-28 19:09 ` Paul E. McKenney
2017-07-30 13:37 ` Boqun Feng
2017-07-30 16:59 ` Paul E. McKenney
2017-07-29 1:20 ` Boqun Feng
2017-07-28 18:42 ` David Miller
2017-07-28 13:08 ` Jonathan Cameron
2017-07-28 13:24 ` Jonathan Cameron
[not found] ` <20170728165529.GF3730@linux.vnet.ibm.com>
2017-07-28 17:27 ` Jonathan Cameron
2017-07-28 19:03 ` Paul E. McKenney
2017-07-31 11:08 ` Jonathan Cameron
2017-07-31 15:04 ` Paul E. McKenney
2017-07-31 15:27 ` Jonathan Cameron
2017-08-01 18:46 ` Paul E. McKenney
2017-08-02 16:25 ` Jonathan Cameron
2017-08-15 15:47 ` Paul E. McKenney
2017-08-16 1:24 ` Jonathan Cameron
2017-08-16 12:43 ` Michael Ellerman
2017-08-16 12:56 ` Paul E. McKenney
2017-08-16 15:31 ` Nicholas Piggin
2017-08-16 16:27 ` Paul E. McKenney
2017-08-17 13:55 ` Michael Ellerman
2017-08-20 4:45 ` Nicholas Piggin
2017-08-20 5:01 ` David Miller
2017-08-20 5:04 ` Paul E. McKenney
2017-08-20 13:00 ` Nicholas Piggin
2017-08-20 18:35 ` Paul E. McKenney
2017-08-20 21:14 ` Paul E. McKenney
2017-08-21 0:52 ` Nicholas Piggin
2017-08-21 6:06 ` Nicholas Piggin
2017-08-21 10:18 ` Jonathan Cameron
2017-08-21 14:19 ` Nicholas Piggin
2017-08-21 15:02 ` Jonathan Cameron
2017-08-21 20:55 ` David Miller
2017-08-22 7:49 ` Jonathan Cameron
2017-08-22 8:51 ` Abdul Haleem
2017-08-22 15:26 ` Paul E. McKenney
2017-09-06 12:28 ` Paul E. McKenney [this message]
2017-08-22 0:38 ` Paul E. McKenney
2017-07-31 11:09 ` Jonathan Cameron
2017-07-31 11:55 ` Jonathan Cameron
2017-08-01 10:53 ` Jonathan Cameron
2017-07-26 16:48 ` David Miller
2017-07-26 3:53 ` Paul E. McKenney
2017-07-26 7:51 ` Jonathan Cameron
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170906122844.GA30180@linux.vnet.ibm.com \
--to=paulmck@linux.vnet.ibm.com \
--cc=Jonathan.Cameron@huawei.com \
--cc=abdhalee@linux.vnet.ibm.com \
--cc=akpm@linux-foundation.org \
--cc=anna-maria@linutronix.de \
--cc=davem@davemloft.net \
--cc=dzickus@redhat.com \
--cc=john.stultz@linaro.org \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linuxarm@huawei.com \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=mpe@ellerman.id.au \
--cc=npiggin@gmail.com \
--cc=sfr@canb.auug.org.au \
--cc=sparclinux@vger.kernel.org \
--cc=tglx@linutronix.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).