All of lore.kernel.org
 help / color / mirror / Atom feed
From: Stephen Warren <swarren@wwwdotorg.org>
To: Will Deacon <will.deacon@arm.com>, Joseph Lo <josephl@nvidia.com>,
	Peter De Schrijver <pdeschrijver@nvidia.com>
Cc: "linux-tegra@vger.kernel.org" <linux-tegra@vger.kernel.org>,
	"kexec@lists.infradead.org" <kexec@lists.infradead.org>,
	Stephen Warren <swarren@nvidia.com>,
	Eric Biederman <ebiederm@xmission.com>,
	"linux-arm-kernel@lists.infradead.org"
	<linux-arm-kernel@lists.infradead.org>
Subject: Re: [PATCH] kexec: disable non-boot CPUs
Date: Thu, 20 Dec 2012 13:15:30 -0700	[thread overview]
Message-ID: <50D371E2.1030807@wwwdotorg.org> (raw)
In-Reply-To: <50D35214.4090008@wwwdotorg.org>

On 12/20/2012 10:59 AM, Stephen Warren wrote:
> On 12/20/2012 10:36 AM, Will Deacon wrote:
>> On Thu, Dec 20, 2012 at 05:21:56PM +0000, Stephen Warren wrote:
>>> On 12/20/2012 03:49 AM, Will Deacon wrote:
>>>> If you do manage to get this merged, please can you follow up with a patch
>>>> to remove the smp_kill_cpus bits from arch/arm/kernel/smp.c please? It only
>>>> exists as a hook to do exactly this and currently nobody is using it afaict.
>>>
>>> I originally implemented this in
>>> arch/arm/kernel/process.c:machine_shutdown(), which currently is:
>>>
>>> void machine_shutdown(void)
>>> {
>>> #ifdef CONFIG_SMP
>>> 	smp_send_stop();
>>> #endif
>>> }
>>>
>>> and I changed it to something like:
>>>
>>> void machine_shutdown(void)
>>> {
>>> #ifdef CONFIG_HOTPLUG_CPU
>>>  	disable_nonboot_cpus();
>>> #elifdef CONFIG_SMP
>>> 	smp_send_stop();
>>> #endif
>>> }
>>>
>>> ... but then figured that moving it up into the core kexec code would be
>>> better, so that everything always worked the same way.
>>
>> Hmmm, isn't this racy: requiring the secondaries to hit idle and notice
>> they're offline and call cpu_die before the primary has replace the kernel
>> image?
> 
> Isn't disable_nonboot_cpus() synchronous? If not, I imagine my original
> patch wasn't any better in this respect, except that the hotunplug
> happened earlier, and hence reduced the likelihood of actually seeing
> any such issues.
> 
>>> Anyway, the change above addresses Eric's concern about isolating the
>>> change to ARM. Does that seem like a reasonable thing for the ARM code
>>> to do?
>>
>> I think you're better off using what we currently have and hanging your code
>> off platform_cpu_kill.
> 
> OK, I'll look into that. Joseph Lo just posted patches to implement
> cpu_kill() on Tegra, which was needed to fix some issues in our hotplug
> code anyway. Perhaps that will remove the need for any other changes...

Will,

I just remembered one other advantage of disable_nonboot_cpus(); it
always makes the kexec happen on the boot CPU. Without this, I believe
it's random whether CPU0 or CPU1 performs the kexec. I suspect it's most
likely to work if we can always kexec on the boot CPU rather than a
random CPU?

Joseph, Peter,

As you know, I've been looking into kexec[2] on Tegra. Here's a summary
of a few tests I did:

linux-next plus nothing in particular, SMP enabled:
* Hangs/crashes during kexec

linux-next + SMP disabled: kexec works

linux-next + hotunplug CPUs other than CPU0 before kexec: kexec works

Will Deacon suggested this was due to a missing implementation of struct
smp_operations .cpu_kill on Tegra, which means that when CPUn are
present, they're simply spinning executing code, and kexec will
eventually overwrite that code, causing all manner of problems. So,
since I noticed that Joseph posted an implementation of .cpu_kill for
Tegra, I tried that out[1]. It certainly doesn't solve the problem, and
in fact actually causes the kernel performing the kexec (rather than the
kexec'd kernel) to hang:

> [   26.291903] Starting new kernel
> [   47.309571] INFO: rcu_preempt detected stalls on CPUs/tasks: { 1} (detected by 0, t=2102 jiffies, g=211, c=210, q=37)
> [   47.323410] Task dump for CPU 1:
> [   47.329763] dd              R running      0   401      1 0x00000000
> [   47.339343] [<c0521690>] (__schedule+0x360/0x600) from [<c002d9b0>] (do_syslog+0x2b4/0x478)
> [   47.350952] [<c002d9b0>] (do_syslog+0x2b4/0x478) from [<ed86bb08>] (0xed86bb08)

Manually hotunplugging the CPUs first still works OK with those patches
applied though.

I /think/ kexec calls .cpu_kill() without having caused the CPU itself
to call .cpu_die() first? Did you allow for that possibility inside the
implementation you posted?

[1]
http://patchwork.ozlabs.org/patch/207601/
http://patchwork.ozlabs.org/patch/207602/

[2]
http://en.wikipedia.org/wiki/Kexec

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

WARNING: multiple messages have this Message-ID (diff)
From: Stephen Warren <swarren-3lzwWm7+Weoh9ZMKESR00Q@public.gmane.org>
To: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>,
	Joseph Lo <josephl-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>,
	Peter De Schrijver
	<pdeschrijver-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
Cc: "linux-tegra-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
	<linux-tegra-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	Stephen Warren <swarren-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>,
	"kexec-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org"
	<kexec-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org>,
	Eric Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>,
	"linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org"
	<linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org>
Subject: Re: [PATCH] kexec: disable non-boot CPUs
Date: Thu, 20 Dec 2012 13:15:30 -0700	[thread overview]
Message-ID: <50D371E2.1030807@wwwdotorg.org> (raw)
In-Reply-To: <50D35214.4090008-3lzwWm7+Weoh9ZMKESR00Q@public.gmane.org>

On 12/20/2012 10:59 AM, Stephen Warren wrote:
> On 12/20/2012 10:36 AM, Will Deacon wrote:
>> On Thu, Dec 20, 2012 at 05:21:56PM +0000, Stephen Warren wrote:
>>> On 12/20/2012 03:49 AM, Will Deacon wrote:
>>>> If you do manage to get this merged, please can you follow up with a patch
>>>> to remove the smp_kill_cpus bits from arch/arm/kernel/smp.c please? It only
>>>> exists as a hook to do exactly this and currently nobody is using it afaict.
>>>
>>> I originally implemented this in
>>> arch/arm/kernel/process.c:machine_shutdown(), which currently is:
>>>
>>> void machine_shutdown(void)
>>> {
>>> #ifdef CONFIG_SMP
>>> 	smp_send_stop();
>>> #endif
>>> }
>>>
>>> and I changed it to something like:
>>>
>>> void machine_shutdown(void)
>>> {
>>> #ifdef CONFIG_HOTPLUG_CPU
>>>  	disable_nonboot_cpus();
>>> #elifdef CONFIG_SMP
>>> 	smp_send_stop();
>>> #endif
>>> }
>>>
>>> ... but then figured that moving it up into the core kexec code would be
>>> better, so that everything always worked the same way.
>>
>> Hmmm, isn't this racy: requiring the secondaries to hit idle and notice
>> they're offline and call cpu_die before the primary has replace the kernel
>> image?
> 
> Isn't disable_nonboot_cpus() synchronous? If not, I imagine my original
> patch wasn't any better in this respect, except that the hotunplug
> happened earlier, and hence reduced the likelihood of actually seeing
> any such issues.
> 
>>> Anyway, the change above addresses Eric's concern about isolating the
>>> change to ARM. Does that seem like a reasonable thing for the ARM code
>>> to do?
>>
>> I think you're better off using what we currently have and hanging your code
>> off platform_cpu_kill.
> 
> OK, I'll look into that. Joseph Lo just posted patches to implement
> cpu_kill() on Tegra, which was needed to fix some issues in our hotplug
> code anyway. Perhaps that will remove the need for any other changes...

Will,

I just remembered one other advantage of disable_nonboot_cpus(); it
always makes the kexec happen on the boot CPU. Without this, I believe
it's random whether CPU0 or CPU1 performs the kexec. I suspect it's most
likely to work if we can always kexec on the boot CPU rather than a
random CPU?

Joseph, Peter,

As you know, I've been looking into kexec[2] on Tegra. Here's a summary
of a few tests I did:

linux-next plus nothing in particular, SMP enabled:
* Hangs/crashes during kexec

linux-next + SMP disabled: kexec works

linux-next + hotunplug CPUs other than CPU0 before kexec: kexec works

Will Deacon suggested this was due to a missing implementation of struct
smp_operations .cpu_kill on Tegra, which means that when CPUn are
present, they're simply spinning executing code, and kexec will
eventually overwrite that code, causing all manner of problems. So,
since I noticed that Joseph posted an implementation of .cpu_kill for
Tegra, I tried that out[1]. It certainly doesn't solve the problem, and
in fact actually causes the kernel performing the kexec (rather than the
kexec'd kernel) to hang:

> [   26.291903] Starting new kernel
> [   47.309571] INFO: rcu_preempt detected stalls on CPUs/tasks: { 1} (detected by 0, t=2102 jiffies, g=211, c=210, q=37)
> [   47.323410] Task dump for CPU 1:
> [   47.329763] dd              R running      0   401      1 0x00000000
> [   47.339343] [<c0521690>] (__schedule+0x360/0x600) from [<c002d9b0>] (do_syslog+0x2b4/0x478)
> [   47.350952] [<c002d9b0>] (do_syslog+0x2b4/0x478) from [<ed86bb08>] (0xed86bb08)

Manually hotunplugging the CPUs first still works OK with those patches
applied though.

I /think/ kexec calls .cpu_kill() without having caused the CPU itself
to call .cpu_die() first? Did you allow for that possibility inside the
implementation you posted?

[1]
http://patchwork.ozlabs.org/patch/207601/
http://patchwork.ozlabs.org/patch/207602/

[2]
http://en.wikipedia.org/wiki/Kexec

WARNING: multiple messages have this Message-ID (diff)
From: swarren@wwwdotorg.org (Stephen Warren)
To: linux-arm-kernel@lists.infradead.org
Subject: [PATCH] kexec: disable non-boot CPUs
Date: Thu, 20 Dec 2012 13:15:30 -0700	[thread overview]
Message-ID: <50D371E2.1030807@wwwdotorg.org> (raw)
In-Reply-To: <50D35214.4090008@wwwdotorg.org>

On 12/20/2012 10:59 AM, Stephen Warren wrote:
> On 12/20/2012 10:36 AM, Will Deacon wrote:
>> On Thu, Dec 20, 2012 at 05:21:56PM +0000, Stephen Warren wrote:
>>> On 12/20/2012 03:49 AM, Will Deacon wrote:
>>>> If you do manage to get this merged, please can you follow up with a patch
>>>> to remove the smp_kill_cpus bits from arch/arm/kernel/smp.c please? It only
>>>> exists as a hook to do exactly this and currently nobody is using it afaict.
>>>
>>> I originally implemented this in
>>> arch/arm/kernel/process.c:machine_shutdown(), which currently is:
>>>
>>> void machine_shutdown(void)
>>> {
>>> #ifdef CONFIG_SMP
>>> 	smp_send_stop();
>>> #endif
>>> }
>>>
>>> and I changed it to something like:
>>>
>>> void machine_shutdown(void)
>>> {
>>> #ifdef CONFIG_HOTPLUG_CPU
>>>  	disable_nonboot_cpus();
>>> #elifdef CONFIG_SMP
>>> 	smp_send_stop();
>>> #endif
>>> }
>>>
>>> ... but then figured that moving it up into the core kexec code would be
>>> better, so that everything always worked the same way.
>>
>> Hmmm, isn't this racy: requiring the secondaries to hit idle and notice
>> they're offline and call cpu_die before the primary has replace the kernel
>> image?
> 
> Isn't disable_nonboot_cpus() synchronous? If not, I imagine my original
> patch wasn't any better in this respect, except that the hotunplug
> happened earlier, and hence reduced the likelihood of actually seeing
> any such issues.
> 
>>> Anyway, the change above addresses Eric's concern about isolating the
>>> change to ARM. Does that seem like a reasonable thing for the ARM code
>>> to do?
>>
>> I think you're better off using what we currently have and hanging your code
>> off platform_cpu_kill.
> 
> OK, I'll look into that. Joseph Lo just posted patches to implement
> cpu_kill() on Tegra, which was needed to fix some issues in our hotplug
> code anyway. Perhaps that will remove the need for any other changes...

Will,

I just remembered one other advantage of disable_nonboot_cpus(); it
always makes the kexec happen on the boot CPU. Without this, I believe
it's random whether CPU0 or CPU1 performs the kexec. I suspect it's most
likely to work if we can always kexec on the boot CPU rather than a
random CPU?

Joseph, Peter,

As you know, I've been looking into kexec[2] on Tegra. Here's a summary
of a few tests I did:

linux-next plus nothing in particular, SMP enabled:
* Hangs/crashes during kexec

linux-next + SMP disabled: kexec works

linux-next + hotunplug CPUs other than CPU0 before kexec: kexec works

Will Deacon suggested this was due to a missing implementation of struct
smp_operations .cpu_kill on Tegra, which means that when CPUn are
present, they're simply spinning executing code, and kexec will
eventually overwrite that code, causing all manner of problems. So,
since I noticed that Joseph posted an implementation of .cpu_kill for
Tegra, I tried that out[1]. It certainly doesn't solve the problem, and
in fact actually causes the kernel performing the kexec (rather than the
kexec'd kernel) to hang:

> [   26.291903] Starting new kernel
> [   47.309571] INFO: rcu_preempt detected stalls on CPUs/tasks: { 1} (detected by 0, t=2102 jiffies, g=211, c=210, q=37)
> [   47.323410] Task dump for CPU 1:
> [   47.329763] dd              R running      0   401      1 0x00000000
> [   47.339343] [<c0521690>] (__schedule+0x360/0x600) from [<c002d9b0>] (do_syslog+0x2b4/0x478)
> [   47.350952] [<c002d9b0>] (do_syslog+0x2b4/0x478) from [<ed86bb08>] (0xed86bb08)

Manually hotunplugging the CPUs first still works OK with those patches
applied though.

I /think/ kexec calls .cpu_kill() without having caused the CPU itself
to call .cpu_die() first? Did you allow for that possibility inside the
implementation you posted?

[1]
http://patchwork.ozlabs.org/patch/207601/
http://patchwork.ozlabs.org/patch/207602/

[2]
http://en.wikipedia.org/wiki/Kexec

  reply	other threads:[~2012-12-20 20:15 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-12-19 23:44 [PATCH] kexec: disable non-boot CPUs Stephen Warren
2012-12-19 23:44 ` Stephen Warren
2012-12-19 23:44 ` Stephen Warren
2012-12-19 23:55 ` Eric W. Biederman
2012-12-19 23:55   ` Eric W. Biederman
2012-12-19 23:55   ` Eric W. Biederman
2012-12-20 10:49 ` Will Deacon
2012-12-20 10:49   ` Will Deacon
2012-12-20 10:49   ` Will Deacon
2012-12-20 17:21   ` Stephen Warren
2012-12-20 17:21     ` Stephen Warren
2012-12-20 17:21     ` Stephen Warren
2012-12-20 17:36     ` Will Deacon
2012-12-20 17:36       ` Will Deacon
2012-12-20 17:36       ` Will Deacon
2012-12-20 17:59       ` Stephen Warren
2012-12-20 17:59         ` Stephen Warren
2012-12-20 17:59         ` Stephen Warren
2012-12-20 20:15         ` Stephen Warren [this message]
2012-12-20 20:15           ` Stephen Warren
2012-12-20 20:15           ` Stephen Warren
2012-12-23 11:06           ` Will Deacon
2012-12-23 11:06             ` Will Deacon
2012-12-23 11:06             ` Will Deacon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=50D371E2.1030807@wwwdotorg.org \
    --to=swarren@wwwdotorg.org \
    --cc=ebiederm@xmission.com \
    --cc=josephl@nvidia.com \
    --cc=kexec@lists.infradead.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-tegra@vger.kernel.org \
    --cc=pdeschrijver@nvidia.com \
    --cc=swarren@nvidia.com \
    --cc=will.deacon@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.