From: Balbir Singh <bsingharora@gmail.com>
To: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Cc: linuxppc-dev@lists.ozlabs.org
Subject: Re: [RFC] powerpc/pseries: Increase busy loop in pseries_cpu_die
Date: Tue, 7 Feb 2017 08:26:45 +0530 [thread overview]
Message-ID: <20170207025645.GB22303@localhost.localdomain> (raw)
In-Reply-To: <1486407496-12151-1-git-send-email-bauerman@linux.vnet.ibm.com>
On Mon, Feb 06, 2017 at 04:58:16PM -0200, Thiago Jung Bauermann wrote:
> [ 447.714064] Querying DEAD? cpu 134 (134) shows 2
> cpu 0x86: Vector: 300 (Data Access) at [c000000007b0fd40]
> pc: 000000001ec3072c
> lr: 000000001ec2fee0
> sp: 1faf6bd0
> msr: 8000000102801000
> dar: 212d6c1a2a20c
This looks like we accessed a bad address, but why?
> dsisr: 42000000
> current = 0xc000000474c6d600
> paca = 0xc000000007b6b600 softe: 0 irq_happened: 0x01
> pid = 0, comm = swapper/134
> Linux version 4.8.0-34-generic (buildd@bos01-ppc64el-026) (gcc version 5.4.0 20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #36~16.04.1-Ubuntu SMP Wed Dec 21 18:53:20 UTC 2016 (Ubuntu 4.8.0-34.36~16.04.1-generic 4.8.11)
> WARNING: exception is not recoverable, can't continue
>
> This was reproduced in v4.10-rc6 as well, but I don't have a crash log
> handy for that version right now. Sorry.
>
> This is a race between one CPU stopping and another one calling
> pseries_cpu_die to wait for it to stop. That function does a short
> busy loop calling RTAS query-cpu-stopped-state on the stopping CPU
> to verify that it is stopped.
>
> As can be seen in the dmesg right before or after the "Querying DEAD?"
> messages, if pseries_cpu_die waited a little longer it would have seen
> the CPU in the stopped state.
>
> I see two cases that can be causing this race:
>
> 1. It's possible that CPU 134 was inactive at the time it was unplugged.
> In that case, dlpar_offline_cpu calls H_PROD on the CPU and immediately
> calls pseries_cpu_die. Meanwhile, the prodded CPU activates and start
> the process of stopping itself. It's possible that the busy loop is not
> long enough to allow for the CPU to wake up and complete the stopping
> process.
> 2. If CPU 134 was online at the time it was unplugged, it would have gone
> through the new CPU hotplug state machine in kernel/cpu.c that was
> introduced in v4.6 to get itself stopped. It's possible that the busy
> loop in pseries_cpu_die was long enough for the older hotplug code but
> not for the new hotplug state machine.
>
> Either way, the solution is the same: wait an adequate amount in
> pseries_cpu_die.
>
> The simple solution is to increase the number of tries in the loop.
> This was done to solve a similar problem in
> commit 940ce422a367 ("powerpc/pseries: Increase cpu die timeout"), so
> it's not as lame as it sounds. :-)
>
> Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
> ---
>
> Notes:
> A solution that is probably better is to have pseries_cpu_die wait
> on a per-CPU semaphore at the beginning of the function, before doing a
> short busy loop. Then the CPU that is stopping unlocks that semaphore right
> before stopping itself, probably at pseries_mach_cpu_die.
>
> What do you think? I can implement that if there is interest.
>
> arch/powerpc/platforms/pseries/hotplug-cpu.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> index a1b63e00b2f7..3d43317eec1b 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> @@ -206,7 +206,7 @@ static void pseries_cpu_die(unsigned int cpu)
> }
> } else if (get_preferred_offline_state(cpu) == CPU_STATE_OFFLINE) {
>
> - for (tries = 0; tries < 25; tries++) {
> + for (tries = 0; tries < 5000; tries++) {
This fixes some of the asymmetry between handling of CPU_STATE_INACTIVE
and CPU_STATE_OFFLINE, but I think we can probably move the cpu_relax()
to msleep(1).
Please also see
940ce42 powerpc/pseries: Increase cpu die timeout
Balbir Singh.
next prev parent reply other threads:[~2017-02-07 2:56 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-02-06 18:58 [RFC] powerpc/pseries: Increase busy loop in pseries_cpu_die Thiago Jung Bauermann
2017-02-07 1:05 ` Han Pingtian
2017-02-07 2:10 ` Michael Ellerman
2017-02-07 2:56 ` Balbir Singh [this message]
2017-02-07 15:32 ` Thiago Jung Bauermann
2017-02-08 2:59 ` Michael Ellerman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170207025645.GB22303@localhost.localdomain \
--to=bsingharora@gmail.com \
--cc=bauerman@linux.vnet.ibm.com \
--cc=linuxppc-dev@lists.ozlabs.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).