From: andreiw@motorola.com (Andrei Warkentin)
To: linux-arm-kernel@lists.infradead.org
Subject: [RFC] Make SMP secondary CPU up more resilient to failure.
Date: Thu, 16 Dec 2010 17:09:48 -0600 [thread overview]
Message-ID: <AANLkTikJFdxbXjvWUMmEXoAG4xR8G9jq30ox90Bo4SWe@mail.gmail.com> (raw)
In-Reply-To: <20101216113407.GO9937@n2100.arm.linux.org.uk>
On Thu, Dec 16, 2010 at 5:34 AM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
>
> On Wed, Dec 15, 2010 at 05:45:13PM -0600, Andrei Warkentin wrote:
> > This is my first time on linux-arm-kernel, and while I've read the
> > FAQ, hopefully I don't screw up too badly :).
> >
> > Anyway, we're on a dual-core ARMv7 running 2.6.36, and during
> > stability stress testing saw the following:
> > 1) After a number hotplug iterations, CPU1 fails to set its online bit
> > quickly enough and __cpu_up() times-out.
> > 2) CPU1 eventually completes its startup and sets the bit, however,
> > since _cpu_up() failed, CPU1's active bit is never set.
>
> Why is your CPU taking soo long to come up? ?We wait one second in the
> generic code, which is the time taken from the platform code being happy
> that it has successfully started the CPU. ?Normally, platforms wait an
> additional second to detect the CPU entering the kernel.
It seems twd_calibrate_rate is the culprit (although in our case,
since the clock is the same to both CPUs, there is no point in
calibrating).
We've seen this only when the device was under stress test load.
>
> > 2) Additionally I ensure that if the CPU comes up later than it were
> > supposed to (shouldn't, but...), then it will not start initializing
> > behind cpu_up's back (which is not really undoable). This solves the
> > problem with both cpu_up+secondary_start_kernel races and with
> > platform_cpu_kill+secondary_start_kernel races.
>
> Why would you have platform_cpu_kill() running at the same time - firstly,
> hotplug events are serialized, and secondly the platform_cpu_kill() path
> should wait up to five seconds for the CPU to go offline. ?If it doesn't
> go offline within five seconds it's dead (and maybe we should mark it
> not present.)
>
That's the platform_cpu_kill I invoke when I time out waiting for the
online bit. Sorry, wasn't being clear. Just trying
to show I didn't introduce any races :).
See, the SMP logic is sensitive to system load at the moment. Since
boot_secondary is supposed to return failure on failing
to up the secondary, maybe there is no point doing a timed wait for
the online bit, since you are guaranteed to get there.
But right now, you end up in a situation where there is a timeout, but
the CPU is up and running and registered.
And this causes bad behavior later when you try to down it.
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
next prev parent reply other threads:[~2010-12-16 23:09 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-12-15 23:45 [RFC] Make SMP secondary CPU up more resilient to failure Andrei Warkentin
2010-12-16 11:34 ` Russell King - ARM Linux
2010-12-16 23:09 ` Andrei Warkentin [this message]
2010-12-16 23:28 ` Russell King - ARM Linux
2010-12-17 20:52 ` Andrei Warkentin
2010-12-17 23:14 ` Russell King - ARM Linux
2010-12-17 23:45 ` Andrei Warkentin
2010-12-18 0:08 ` Russell King - ARM Linux
2010-12-18 0:36 ` Russell King - ARM Linux
2010-12-18 7:17 ` Andrei Warkentin
2010-12-18 12:01 ` Russell King - ARM Linux
2010-12-18 12:10 ` Andrei Warkentin
2010-12-18 20:04 ` Russell King - ARM Linux
2010-12-21 21:53 ` Andrei Warkentin
2010-12-24 17:38 ` Russell King - ARM Linux
2011-01-13 10:19 ` Andrei Warkentin
2011-01-13 11:14 ` Russell King - ARM Linux
2011-01-13 22:03 ` Andrei Warkentin
2010-12-17 0:11 ` murali at embeddedwireless.com
2010-12-18 9:58 ` Russell King - ARM Linux
2010-12-18 11:54 ` Andrei Warkentin
2010-12-18 12:19 ` Russell King - ARM Linux
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=AANLkTikJFdxbXjvWUMmEXoAG4xR8G9jq30ox90Bo4SWe@mail.gmail.com \
--to=andreiw@motorola.com \
--cc=linux-arm-kernel@lists.infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).