linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
From: linux@arm.linux.org.uk (Russell King - ARM Linux)
To: linux-arm-kernel@lists.infradead.org
Subject: [RFC] Make SMP secondary CPU up more resilient to failure.
Date: Fri, 24 Dec 2010 17:38:24 +0000	[thread overview]
Message-ID: <20101224173824.GI20587@n2100.arm.linux.org.uk> (raw)
In-Reply-To: <AANLkTikHGX6OQWfGoNiQj-UX81=8_HwP3W+DXwZASatN@mail.gmail.com>

On Tue, Dec 21, 2010 at 03:53:46PM -0600, Andrei Warkentin wrote:
> Russel,

Grr.

> Thank you! The culprit looks as it seems to be the writel without
> __iowmb, as you pointed out. At the very least I've yet to hit the
> problem again this way.

Good news.

> I still want to add code inside the platform SMP support as a safety
> net. Maybe I am being too pedantic, but  In the near future (with
> those 40 patches), secondaries are going to boot directly via
> secondary_startup as well, so the first time platform-specific code
> gets invoked is platform_secondary_init. I want to ensure that when
> boot_secondary returns, the CPU is either guaranteed to be running or
> for-sure dead. The problem is that platform_secondary_init is already
> too late - if the CPU gets killed due to timeout anytime between the
> entry to secondary_start_kernel and  platform_secondary_init, it could
> have already increased the refcount on init_mm or disabled preemption.

Here's a problem for you to ponder on over Christmas.

Let's say the secondary CPU is running slowly due to system load.  It
makes it through to secondary_start_kernel(), and calls through to
your preinit function.  It checks that it should be booting, and
passes that test.

At this point, the requesting CPU times out, but gets preempted to
other tasks (which could very well happen on a heavily loaded system
with preempt enabled).

The booting CPU signals that via writing the reset vector, and continues
on to increment the mm_count and switch its page tables.

The requesting CPU finally switches back to the thread requesting
that the CPU be brought up.  It decides as it timed out to kill the
booting CPU, and does so.

What this means that we now have exactly the same scenario you've
referred to above, and adding the pre-init function hasn't really
solved anything.

I _really_ don't want platforms to start playing these games, because
we'll end up with lots of different attempts to solve the problem,
each of them probably racy like the above.  The safest solution is to
use a longer timeout - maybe an excessively long timeout - to guarantee
that we never miss a starting CPU.

If we do end up needing something like this in the kernel, then it needs
to be done carefully and in generic code where it can be done properly
once.  (If any bugs are found in it, we've also only one version to fix,
not five or six different versions.)  However, I'd argue that it's better
to wait longer for the CPU to come up if there's a possibility that it
will rather than trying to sort out the mess from a partially booted
secondary CPU.

  reply	other threads:[~2010-12-24 17:38 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-12-15 23:45 [RFC] Make SMP secondary CPU up more resilient to failure Andrei Warkentin
2010-12-16 11:34 ` Russell King - ARM Linux
2010-12-16 23:09   ` Andrei Warkentin
2010-12-16 23:28     ` Russell King - ARM Linux
2010-12-17 20:52       ` Andrei Warkentin
2010-12-17 23:14         ` Russell King - ARM Linux
2010-12-17 23:45           ` Andrei Warkentin
2010-12-18  0:08             ` Russell King - ARM Linux
2010-12-18  0:36               ` Russell King - ARM Linux
2010-12-18  7:17               ` Andrei Warkentin
2010-12-18 12:01                 ` Russell King - ARM Linux
2010-12-18 12:10                   ` Andrei Warkentin
2010-12-18 20:04                     ` Russell King - ARM Linux
2010-12-21 21:53                       ` Andrei Warkentin
2010-12-24 17:38                         ` Russell King - ARM Linux [this message]
2011-01-13 10:19                           ` Andrei Warkentin
2011-01-13 11:14                             ` Russell King - ARM Linux
2011-01-13 22:03                               ` Andrei Warkentin
2010-12-17  0:11     ` murali at embeddedwireless.com
2010-12-18  9:58     ` Russell King - ARM Linux
2010-12-18 11:54       ` Andrei Warkentin
2010-12-18 12:19         ` Russell King - ARM Linux

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20101224173824.GI20587@n2100.arm.linux.org.uk \
    --to=linux@arm.linux.org.uk \
    --cc=linux-arm-kernel@lists.infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).