From: Gautham R Shenoy <ego@linux.vnet.ibm.com>
To: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Cc: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>,
linux-kernel@vger.kernel.org, Nicholas Piggin <npiggin@gmail.com>,
Michael Bringmann <mwb@linux.vnet.ibm.com>,
Tyrel Datwyler <tyreld@linux.vnet.ibm.com>,
linuxppc-dev@lists.ozlabs.org
Subject: Re: [PATCH] pseries/hotplug: Add more delay in pseries_cpu_die while waiting for rtas-stop
Date: Fri, 7 Dec 2018 16:13:11 +0530 [thread overview]
Message-ID: <20181207104311.GA11431@in.ibm.com> (raw)
In-Reply-To: <87a7li5zv2.fsf@morokweng.localdomain>
Hi Thiago,
On Thu, Dec 06, 2018 at 03:28:17PM -0200, Thiago Jung Bauermann wrote:
[..snip..]
>
>
> I posted a similar patch last year, but I wasn't able to arrive at a
> root cause analysis like you did:
>
>
https://lists.ozlabs.org/pipermail/linuxppc-dev/2017-February/153734.html
Ah! Nice. So this is a known problem.
>
> One thing I realized after I posted the patch was that in my case, the
> CPU was crashing inside RTAS. From the NIP and LR in the trace above it
> looks like it's crashing in RTAS in your case as well.
>
> Michael Ellerman had two comments on my patch:
>
> 1. Regardless of the underlying bug, the kernel shouldn't crash so we
> need a patch making it more resilient to this failure.
>
> 2. The wait loop should use udelay() so that the loop will actually take
> a set amount of wall time, rather than just cycles.
>
> Regarding 1. if the problem is that the kernel is causing RTAS to crash
> because it calls it in a way that's unsupported, then I don't see how we
> can make the kernel more resilient. We have to make the kernel respect
> RTAS' restrictions (or alternatively, poke RTAS devs to make RTAS fail
> gracefuly in these conditions).
I agree that the Kernel has to respect RTAS's restriction. The PAPR
v2.8.1, Requirement R1-7.2.3-8 under section 7.2.3 says the following:
"The stop-self service needs to be serialized with calls to the
stop-self, start-cpu, and set-power-level services. The OS must
be able to call RTAS services on other processors while the
processor is stopped or being stopped"
Thus the onus is on the OS to ensure that there are no concurrent rtas
calls with "stop-self" token.
>
> Regarding 2. I implemented a new version of my patch (posted below) but
> I was never able to test it because I couldn't access a system where the
> problem was reproducible anymore.
>
> There's also a race between the CPU driving the unplug and the CPU being
> unplugged which I think is not easy for the CPU being unplugged to win,
> which makes the busy loop in pseries_cpu_die() a bit fragile. I describe
> the race in the patch description.
>
> My solution to make the race less tight is to make the CPU driving the
> unplug to only start the busy loop only after the CPU being unplugged is
> in the CPU_STATE_OFFLINE state. At that point, we know that it either is
> about to call RTAS or it already has.
Ah, yes this is good optimization. Though, I think we ought to
unconditionally wait until the target CPU has woken up from CEDE and
changed its state to CPU_STATE_OFFLINE. After if PROD failed, then we
would have caught it in dlpar_offline_cpu() itself.
>
> Do you think this makes sense? If you do, would you mind testing my
> patch? You can change the timeouts and delays if you want. To be honest
> they're just guesses on my part...
Sure. I will test the patch and report back.
--
Thanks and Regards
gautham.
next prev parent reply other threads:[~2018-12-07 10:45 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-12-06 11:31 [PATCH] pseries/hotplug: Add more delay in pseries_cpu_die while waiting for rtas-stop Gautham R. Shenoy
2018-12-06 17:28 ` Thiago Jung Bauermann
2018-12-07 10:43 ` Gautham R Shenoy [this message]
2018-12-07 12:03 ` Gautham R Shenoy
2018-12-08 2:40 ` Thiago Jung Bauermann
2018-12-10 20:16 ` Michael Bringmann
2018-12-10 20:31 ` Thiago Jung Bauermann
2018-12-11 22:11 ` Michael Bringmann
2019-01-09 6:08 ` Gautham R Shenoy
2019-01-14 18:11 ` Michael Bringmann
2019-01-17 6:03 ` Gautham R Shenoy
2022-07-07 9:51 ` Christophe Leroy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20181207104311.GA11431@in.ibm.com \
--to=ego@linux.vnet.ibm.com \
--cc=bauerman@linux.ibm.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=mwb@linux.vnet.ibm.com \
--cc=npiggin@gmail.com \
--cc=tyreld@linux.vnet.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).