linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
From: Balbir Singh <bsingharora@gmail.com>
To: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Cc: linuxppc-dev@lists.ozlabs.org
Subject: Re: [RFC] powerpc/pseries: Increase busy loop in pseries_cpu_die
Date: Tue, 7 Feb 2017 08:26:45 +0530	[thread overview]
Message-ID: <20170207025645.GB22303@localhost.localdomain> (raw)
In-Reply-To: <1486407496-12151-1-git-send-email-bauerman@linux.vnet.ibm.com>

On Mon, Feb 06, 2017 at 04:58:16PM -0200, Thiago Jung Bauermann wrote:
> [  447.714064] Querying DEAD? cpu 134 (134) shows 2
> cpu 0x86: Vector: 300 (Data Access) at [c000000007b0fd40]
>     pc: 000000001ec3072c
>     lr: 000000001ec2fee0
>     sp: 1faf6bd0
>    msr: 8000000102801000
>    dar: 212d6c1a2a20c

This looks like we accessed a bad address, but why?

>  dsisr: 42000000
>   current = 0xc000000474c6d600
>   paca    = 0xc000000007b6b600   softe: 0        irq_happened: 0x01
>     pid   = 0, comm = swapper/134
> Linux version 4.8.0-34-generic (buildd@bos01-ppc64el-026) (gcc version 5.4.0 20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #36~16.04.1-Ubuntu SMP Wed Dec 21 18:53:20 UTC 2016 (Ubuntu 4.8.0-34.36~16.04.1-generic 4.8.11)
> WARNING: exception is not recoverable, can't continue
> 
> This was reproduced in v4.10-rc6 as well, but I don't have a crash log
> handy for that version right now. Sorry.
> 
> This is a race between one CPU stopping and another one calling
> pseries_cpu_die to wait for it to stop. That function does a short
> busy loop calling RTAS query-cpu-stopped-state on the stopping CPU
> to verify that it is stopped.
> 
> As can be seen in the dmesg right before or after the "Querying DEAD?"
> messages, if pseries_cpu_die waited a little longer it would have seen
> the CPU in the stopped state.
> 
> I see two cases that can be causing this race:
> 
> 1. It's possible that CPU 134 was inactive at the time it was unplugged.
>    In that case, dlpar_offline_cpu calls H_PROD on the CPU and immediately
>    calls pseries_cpu_die. Meanwhile, the prodded CPU activates and start
>    the process of stopping itself. It's possible that the busy loop is not
>    long enough to allow for the CPU to wake up and complete the stopping
>    process.
> 2. If CPU 134 was online at the time it was unplugged, it would have gone
>    through the new CPU hotplug state machine in kernel/cpu.c that was
>    introduced in v4.6 to get itself stopped. It's possible that the busy
>    loop in pseries_cpu_die was long enough for the older hotplug code but
>    not for the new hotplug state machine.
> 
> Either way, the solution is the same: wait an adequate amount in
> pseries_cpu_die.
> 
> The simple solution is to increase the number of tries in the loop.
> This was done to solve a similar problem in
> commit 940ce422a367 ("powerpc/pseries: Increase cpu die timeout"), so
> it's not as lame as it sounds. :-)
> 
> Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
> ---
> 
> Notes:
>     A solution that is probably better is to have pseries_cpu_die wait
>     on a per-CPU semaphore at the beginning of the function, before doing a
>     short busy loop. Then the CPU that is stopping unlocks that semaphore right
>     before stopping itself, probably at pseries_mach_cpu_die.
>     
>     What do you think? I can implement that if there is interest.
> 
>  arch/powerpc/platforms/pseries/hotplug-cpu.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> index a1b63e00b2f7..3d43317eec1b 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> @@ -206,7 +206,7 @@ static void pseries_cpu_die(unsigned int cpu)
>  		}
>  	} else if (get_preferred_offline_state(cpu) == CPU_STATE_OFFLINE) {
>  
> -		for (tries = 0; tries < 25; tries++) {
> +		for (tries = 0; tries < 5000; tries++) {

This fixes some of the asymmetry between handling of CPU_STATE_INACTIVE
and CPU_STATE_OFFLINE, but I think we can probably move the cpu_relax()
to msleep(1). 

Please also see
940ce42 powerpc/pseries: Increase cpu die timeout

Balbir Singh.

  parent reply	other threads:[~2017-02-07  2:56 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-02-06 18:58 [RFC] powerpc/pseries: Increase busy loop in pseries_cpu_die Thiago Jung Bauermann
2017-02-07  1:05 ` Han Pingtian
2017-02-07  2:10 ` Michael Ellerman
2017-02-07  2:56 ` Balbir Singh [this message]
2017-02-07 15:32   ` Thiago Jung Bauermann
2017-02-08  2:59     ` Michael Ellerman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170207025645.GB22303@localhost.localdomain \
    --to=bsingharora@gmail.com \
    --cc=bauerman@linux.vnet.ibm.com \
    --cc=linuxppc-dev@lists.ozlabs.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).