From: paulmck@linux.vnet.ibm.com (Paul E. McKenney)
To: linux-arm-kernel@lists.infradead.org
Subject: [PATCH v2] ARM: Don't use complete() during __cpu_die
Date: Thu, 5 Feb 2015 09:54:31 -0800 [thread overview]
Message-ID: <20150205175431.GH5370@linux.vnet.ibm.com> (raw)
In-Reply-To: <20150205173440.GR8656@n2100.arm.linux.org.uk>
On Thu, Feb 05, 2015 at 05:34:40PM +0000, Russell King - ARM Linux wrote:
> On Thu, Feb 05, 2015 at 09:02:28AM -0800, Paul E. McKenney wrote:
> > On Thu, Feb 05, 2015 at 04:11:00PM +0000, Russell King - ARM Linux wrote:
> > > On Thu, Feb 05, 2015 at 06:29:18AM -0800, Paul E. McKenney wrote:
> > > > Works for me, assuming no hidden uses of RCU in the IPI code. ;-)
> > >
> > > Sigh... I kind'a new it wouldn't be this simple. The gic code which
> > > actually raises the IPI takes a raw spinlock, so it's not going to be
> > > this simple - there's a small theoretical window where we have taken
> > > this lock, written the register to send the IPI, and then dropped the
> > > lock - the update to the lock to release it could get lost if the
> > > CPU power is quickly cut at that point.
> > >
> > > Also, we _do_ need the second cache flush in place to ensure that the
> > > unlock is seen to other CPUs.
> > >
> > > We could work around that by taking and releasing the lock in the IPI
> > > processing function... but this is starting to look less attractive
> > > as the lock is private to irq-gic.c.
> > >
> > > Well, we're very close to 3.19, we're too close to be trying to sort
> > > this out, so I'm hoping that your changes which cause this RCU error
> > > are *not* going in during this merge window, because we seem to have
> > > something of a problem right now which needs more time to resolve.
> >
> > Most likely into the 3.20 merge window. But please keep in mind that
> > RCU is just the messenger here -- the current code will break if any
> > CPU for whatever reason takes more than a jiffy to get from its
> > _stop_machine() handler to the end of its last RCU read-side critical
> > section on its way out. A jiffy may sound like a lot, but it is not
> > hard to exceed this limit, especially in virtualized environments.
>
> What I'm saying is that we can't likely get a good fix prepared before
> the 3.20 merge window opens.
And I cannot count. Or cannot type. Or something.
I meant do say "Most likely into the 3.21 merge window." I agree that
this stuff is not ready for next week's merge window. For one thing,
there are similar issues with a number of other architectures as well.
> I don't term the set_bit/clear_bit solution a "good fix" because it is
> far too complex - I've not done a thorough review on it, but the idea
> of setting and clearing a couple of bits in unison, making sure that
> their state is set appropriately through multiple different code paths
> does not strike me as a provably correct replacement for this completion.
> The reason for that complexity is because there is no pre-notification
> to arch code that a CPU might be going down, so there's no way for the
> "CPU is dead" flag to be properly reset (which is why there's all the
> manipulation in lots of possible failure paths.)
>
> The idea that we could reset it in the CPU up code doesn't fly - that
> would only work if we had one secondary CPU (which would guarantee a
> strict up/down/up ordering on it) but as soon as you have more than one
> CPU, that doesn't hold true.
>
> We could hook into the CPU hotplug notifiers - which would be quite a
> lot of additional code to achieve the reset early enough in the hot
> unplug path, though it would probably be the most reliable solution to
> the wait-for-bit solution.
>
> However, any of those solutions needs writing and thorough testing,
> which, if Linus opens the merge window on Sunday, isn't going to
> happen before hand (and we know Linus doesn't like extra development
> appearing which wasn't in -next prior to the merge window - he's taken
> snapshots of -next to check during the merge window in the past - so
> it's not something I'm going to be adding during that time, not even
> as a "fix" because we know about the problem right now, before the
> merge window. To me, to treat this as a "fix" would be wilfully
> deceitful.)
>
> I don't think the existing code is a big problem at the moment - it's
> been like this for about 10 years, and no one has ever reported an
> issue with it, although there have been changes over that time:
>
> aa033810461ee56abbef6cef10aabd6b97f5caee
> ARM: smp: Drop RCU_NONIDLE usage in cpu_die()
>
> This removed the RCU_NONIDLE() from the completion() call.
>
> ff081e05bfba3461119cd280201d163b6858eda2
> ARM: 7457/1: smp: Fix suspicious RCU originating from cpu_die()
>
> This added the RCU_NONIDLE() to the completion() call.
>
> 3c030beabf937b1d3b4ecaedfd1fb2f1e2aa0c70
> ARM: CPU hotplug: move cpu_killed completion to core code
>
> This moved the completion code from Realview (and other ARM
> platforms) into core ARM code.
>
> and 97a63ecff4bd06da5d8feb8c0394a4d020f2d34d
> [ARM SMP] Add CPU hotplug support for Realview MPcore
>
> The earliest current introduction of CPU hotplug in 2005.
Agreed. It does need to be fixed, but it makes sense to take a few
weeks and get a fix that is more likely to be correct.
Thanx, Paul
WARNING: multiple messages have this Message-ID (diff)
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Russell King - ARM Linux <linux@arm.linux.org.uk>
Cc: Krzysztof Kozlowski <k.kozlowski@samsung.com>,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, Arnd Bergmann <arnd@arndb.de>,
Mark Rutland <mark.rutland@arm.com>,
Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>,
Marek Szyprowski <m.szyprowski@samsung.com>,
Stephen Boyd <sboyd@codeaurora.org>,
Catalin Marinas <catalin.marinas@arm.com>,
Will Deacon <will.deacon@arm.com>
Subject: Re: [PATCH v2] ARM: Don't use complete() during __cpu_die
Date: Thu, 5 Feb 2015 09:54:31 -0800 [thread overview]
Message-ID: <20150205175431.GH5370@linux.vnet.ibm.com> (raw)
In-Reply-To: <20150205173440.GR8656@n2100.arm.linux.org.uk>
On Thu, Feb 05, 2015 at 05:34:40PM +0000, Russell King - ARM Linux wrote:
> On Thu, Feb 05, 2015 at 09:02:28AM -0800, Paul E. McKenney wrote:
> > On Thu, Feb 05, 2015 at 04:11:00PM +0000, Russell King - ARM Linux wrote:
> > > On Thu, Feb 05, 2015 at 06:29:18AM -0800, Paul E. McKenney wrote:
> > > > Works for me, assuming no hidden uses of RCU in the IPI code. ;-)
> > >
> > > Sigh... I kind'a new it wouldn't be this simple. The gic code which
> > > actually raises the IPI takes a raw spinlock, so it's not going to be
> > > this simple - there's a small theoretical window where we have taken
> > > this lock, written the register to send the IPI, and then dropped the
> > > lock - the update to the lock to release it could get lost if the
> > > CPU power is quickly cut at that point.
> > >
> > > Also, we _do_ need the second cache flush in place to ensure that the
> > > unlock is seen to other CPUs.
> > >
> > > We could work around that by taking and releasing the lock in the IPI
> > > processing function... but this is starting to look less attractive
> > > as the lock is private to irq-gic.c.
> > >
> > > Well, we're very close to 3.19, we're too close to be trying to sort
> > > this out, so I'm hoping that your changes which cause this RCU error
> > > are *not* going in during this merge window, because we seem to have
> > > something of a problem right now which needs more time to resolve.
> >
> > Most likely into the 3.20 merge window. But please keep in mind that
> > RCU is just the messenger here -- the current code will break if any
> > CPU for whatever reason takes more than a jiffy to get from its
> > _stop_machine() handler to the end of its last RCU read-side critical
> > section on its way out. A jiffy may sound like a lot, but it is not
> > hard to exceed this limit, especially in virtualized environments.
>
> What I'm saying is that we can't likely get a good fix prepared before
> the 3.20 merge window opens.
And I cannot count. Or cannot type. Or something.
I meant do say "Most likely into the 3.21 merge window." I agree that
this stuff is not ready for next week's merge window. For one thing,
there are similar issues with a number of other architectures as well.
> I don't term the set_bit/clear_bit solution a "good fix" because it is
> far too complex - I've not done a thorough review on it, but the idea
> of setting and clearing a couple of bits in unison, making sure that
> their state is set appropriately through multiple different code paths
> does not strike me as a provably correct replacement for this completion.
> The reason for that complexity is because there is no pre-notification
> to arch code that a CPU might be going down, so there's no way for the
> "CPU is dead" flag to be properly reset (which is why there's all the
> manipulation in lots of possible failure paths.)
>
> The idea that we could reset it in the CPU up code doesn't fly - that
> would only work if we had one secondary CPU (which would guarantee a
> strict up/down/up ordering on it) but as soon as you have more than one
> CPU, that doesn't hold true.
>
> We could hook into the CPU hotplug notifiers - which would be quite a
> lot of additional code to achieve the reset early enough in the hot
> unplug path, though it would probably be the most reliable solution to
> the wait-for-bit solution.
>
> However, any of those solutions needs writing and thorough testing,
> which, if Linus opens the merge window on Sunday, isn't going to
> happen before hand (and we know Linus doesn't like extra development
> appearing which wasn't in -next prior to the merge window - he's taken
> snapshots of -next to check during the merge window in the past - so
> it's not something I'm going to be adding during that time, not even
> as a "fix" because we know about the problem right now, before the
> merge window. To me, to treat this as a "fix" would be wilfully
> deceitful.)
>
> I don't think the existing code is a big problem at the moment - it's
> been like this for about 10 years, and no one has ever reported an
> issue with it, although there have been changes over that time:
>
> aa033810461ee56abbef6cef10aabd6b97f5caee
> ARM: smp: Drop RCU_NONIDLE usage in cpu_die()
>
> This removed the RCU_NONIDLE() from the completion() call.
>
> ff081e05bfba3461119cd280201d163b6858eda2
> ARM: 7457/1: smp: Fix suspicious RCU originating from cpu_die()
>
> This added the RCU_NONIDLE() to the completion() call.
>
> 3c030beabf937b1d3b4ecaedfd1fb2f1e2aa0c70
> ARM: CPU hotplug: move cpu_killed completion to core code
>
> This moved the completion code from Realview (and other ARM
> platforms) into core ARM code.
>
> and 97a63ecff4bd06da5d8feb8c0394a4d020f2d34d
> [ARM SMP] Add CPU hotplug support for Realview MPcore
>
> The earliest current introduction of CPU hotplug in 2005.
Agreed. It does need to be fixed, but it makes sense to take a few
weeks and get a fix that is more likely to be correct.
Thanx, Paul
next prev parent reply other threads:[~2015-02-05 17:54 UTC|newest]
Thread overview: 92+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-02-05 10:14 [PATCH v2] ARM: Don't use complete() during __cpu_die Krzysztof Kozlowski
2015-02-05 10:14 ` Krzysztof Kozlowski
2015-02-05 10:50 ` Russell King - ARM Linux
2015-02-05 10:50 ` Russell King - ARM Linux
2015-02-05 11:00 ` Krzysztof Kozlowski
2015-02-05 11:00 ` Krzysztof Kozlowski
2015-02-05 11:08 ` Russell King - ARM Linux
2015-02-05 11:08 ` Russell King - ARM Linux
2015-02-05 11:28 ` Mark Rutland
2015-02-05 11:28 ` Mark Rutland
2015-02-05 11:30 ` Russell King - ARM Linux
2015-02-05 11:30 ` Russell King - ARM Linux
2015-02-05 14:29 ` Paul E. McKenney
2015-02-05 14:29 ` Paul E. McKenney
2015-02-05 16:11 ` Russell King - ARM Linux
2015-02-05 16:11 ` Russell King - ARM Linux
2015-02-05 17:02 ` Paul E. McKenney
2015-02-05 17:02 ` Paul E. McKenney
2015-02-05 17:34 ` Russell King - ARM Linux
2015-02-05 17:34 ` Russell King - ARM Linux
2015-02-05 17:54 ` Paul E. McKenney [this message]
2015-02-05 17:54 ` Paul E. McKenney
2015-02-10 1:24 ` Stephen Boyd
2015-02-10 1:24 ` Stephen Boyd
2015-02-10 1:37 ` Paul E. McKenney
2015-02-10 1:37 ` Paul E. McKenney
2015-02-10 2:05 ` Stephen Boyd
2015-02-10 2:05 ` Stephen Boyd
2015-02-10 3:05 ` Paul E. McKenney
2015-02-10 3:05 ` Paul E. McKenney
2015-02-10 15:14 ` Mark Rutland
2015-02-10 15:14 ` Mark Rutland
2015-02-10 20:48 ` Stephen Boyd
2015-02-10 20:48 ` Stephen Boyd
2015-02-10 21:04 ` Stephen Boyd
2015-02-10 21:04 ` Stephen Boyd
2015-02-10 21:15 ` Russell King - ARM Linux
2015-02-10 21:15 ` Russell King - ARM Linux
2015-02-10 21:49 ` Stephen Boyd
2015-02-10 21:49 ` Stephen Boyd
2015-02-10 22:05 ` Stephen Boyd
2015-02-10 22:05 ` Stephen Boyd
2015-02-13 15:52 ` Mark Rutland
2015-02-13 15:52 ` Mark Rutland
2015-02-13 16:27 ` Russell King - ARM Linux
2015-02-13 16:27 ` Russell King - ARM Linux
2015-02-13 17:21 ` Mark Rutland
2015-02-13 17:21 ` Mark Rutland
2015-02-13 17:30 ` Russell King - ARM Linux
2015-02-13 17:30 ` Russell King - ARM Linux
2015-02-13 16:28 ` Stephen Boyd
2015-02-13 16:28 ` Stephen Boyd
2015-02-13 15:38 ` Mark Rutland
2015-02-13 15:38 ` Mark Rutland
2015-02-10 20:58 ` Russell King - ARM Linux
2015-02-10 20:58 ` Russell King - ARM Linux
2015-02-10 15:41 ` Russell King - ARM Linux
2015-02-10 15:41 ` Russell King - ARM Linux
2015-02-10 18:33 ` Stephen Boyd
2015-02-10 18:33 ` Stephen Boyd
2015-02-25 12:56 ` Russell King - ARM Linux
2015-02-25 12:56 ` Russell King - ARM Linux
2015-02-25 16:47 ` Nicolas Pitre
2015-02-25 16:47 ` Nicolas Pitre
2015-02-25 17:00 ` Russell King - ARM Linux
2015-02-25 17:00 ` Russell King - ARM Linux
2015-02-25 18:13 ` Nicolas Pitre
2015-02-25 18:13 ` Nicolas Pitre
2015-02-25 20:16 ` Nicolas Pitre
2015-02-25 20:16 ` Nicolas Pitre
2015-02-26 1:05 ` Paul E. McKenney
2015-02-26 1:05 ` Paul E. McKenney
2015-03-22 23:30 ` Paul E. McKenney
2015-03-22 23:30 ` Paul E. McKenney
2015-03-23 12:55 ` Russell King - ARM Linux
2015-03-23 12:55 ` Russell King - ARM Linux
2015-03-23 13:21 ` Paul E. McKenney
2015-03-23 13:21 ` Paul E. McKenney
2015-03-23 14:00 ` Russell King - ARM Linux
2015-03-23 14:00 ` Russell King - ARM Linux
2015-03-23 15:37 ` Paul E. McKenney
2015-03-23 15:37 ` Paul E. McKenney
2015-03-23 16:56 ` Paul E. McKenney
2015-03-23 16:56 ` Paul E. McKenney
2015-02-26 19:14 ` Daniel Thompson
2015-02-26 19:14 ` Daniel Thompson
2015-02-26 19:47 ` Nicolas Pitre
2015-02-26 19:47 ` Nicolas Pitre
2015-02-05 10:53 ` Mark Rutland
2015-02-05 10:53 ` Mark Rutland
2015-02-05 10:59 ` Krzysztof Kozlowski
2015-02-05 10:59 ` Krzysztof Kozlowski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150205175431.GH5370@linux.vnet.ibm.com \
--to=paulmck@linux.vnet.ibm.com \
--cc=linux-arm-kernel@lists.infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.