Re: [PATCH] pseries/hotplug: Add more delay in pseries_cpu_die while waiting for rtas-stop

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Gautham R Shenoy <ego@linux.vnet.ibm.com>
To: Michael Bringmann <mwb@linux.vnet.ibm.com>
Cc: ego@linux.vnet.ibm.com, linux-kernel@vger.kernel.org,
	Nicholas Piggin <npiggin@gmail.com>,
	Tyrel Datwyler <tyreld@linux.vnet.ibm.com>,
	linuxppc-dev@lists.ozlabs.org,
	Thiago Jung Bauermann <bauerman@linux.ibm.com>
Subject: Re: [PATCH] pseries/hotplug: Add more delay in pseries_cpu_die while waiting for rtas-stop
Date: Thu, 17 Jan 2019 11:33:17 +0530	[thread overview]
Message-ID: <20190117060317.GA6897@in.ibm.com> (raw)
In-Reply-To: <74b2262b-48e7-b2a5-7d20-dc7f590958d9@linux.vnet.ibm.com>

Hello Michael,

On Mon, Jan 14, 2019 at 12:11:44PM -0600, Michael Bringmann wrote:
> On 1/9/19 12:08 AM, Gautham R Shenoy wrote:
> 
> > I did some testing during the holidays. Here are the observations:
> > 
> > 1) With just your patch (without any additional debug patch), if I run
> > DLPAR on /off operations on a system that has SMT=off, I am able to
> > see a crash involving RTAS stack corruption within an hour's time.
> > 
> > 2) With the debug patch (appended below) which has additional debug to
> > capture the callers of stop-self, start-cpu, set-power-levels, the
> > system is able to perform DLPAR on/off operations on a system with
> > SMT=off for three days. And then, it crashed with the dead CPU showing
> > a "Bad kernel stack pointer". From this log, I can clearly
> > see that there were no concurrent calls to stop-self, start-cpu,
> > set-power-levels. The only concurrent RTAS calls were the dying CPU
> > calling "stop-self", and the CPU running the DLPAR operation calling
> > "query-cpu-stopped-state". The crash signature is appended below as
> > well.
> > 
> > 3) Modifying your patch to remove the udelay and increase the loop
> > count from 25 to 1000 doesn't improve the situation. We are still able
> > to see the crash.
> > 
> > 4) With my patch, even without any additional debug, I was able to
> > observe the system run the tests successfully for over a week (I
> > started the tests before the Christmas weekend, and forgot to turn it
> > off!)
> 
> So does this mean that the problem is fixed with your patch?

No. On the contrary I think my patch unable to exploit the possible
race window.  From a technical point of view, Thiago's patch does the
right things

  - It waits for the target CPU to come out of CEDE and set its state
    to CPU_STATE_OFFLINE.
  
  - Only then it then makes the "query-cpu-stopped-state" rtas call in
    a loop, while giving sufficient delay between successive
    queries. This reduces the unnecessary rtas-calls.

In my patch, I don't do any of this, but simply keep calling
"query-cpu-stopped-state" call in a loop for 4000 iterations.

That said, if I incorporate the wait for the target CPU to set its
state to CPU_STATE_OFFLINE in my patch and then make the
"query-cpu-stopped-state" rtas call, then, even I am able to get the
crash with the "RTAS CALL BUFFER CORRUPTION" message.

I think that incorporating the wait for the target CPU set its state
to CPU_STATE_OFFLINE increases the probability that
"query-cpu-stopped-state" and "stop-self" rtas calls get called more
or less at the same time. However since concurrent invocations of
these rtas-calls is allowed by the PAPR, it should not result in a
"RTAS CALL BUFFER CORRUPTION". Am I missing something here ?

> 
> > 
> > It appears that there is a narrow race window involving rtas-stop-self
> > and rtas-query-cpu-stopped-state calls that can be observed with your
> > patch. Adding any printk's seems to reduce the probability of hitting
> > this race window. It might be worth the while to check with RTAS
> > folks, if they suspect something here.
> 
> What would the RTAS folks be looking at here?  The 'narrow race window'
> is with respect to a patch that it sound like we should not be
> using.

IMHO, if the race-window exists, it would be better to confirm and fix
it, given that we have a patch that can exploit it consistently.

> 
> Thanks.
> Michael
> 
> -- 
> Michael W. Bringmann
> Linux Technology Center
> IBM Corporation
> Tie-Line  363-5196
> External: (512) 286-5196
> Cell:       (512) 466-0650
> mwb@linux.vnet.ibm.com

--
Thanks and Regards
gautham.

WARNING: multiple messages have this Message-ID (diff)

From: Gautham R Shenoy <ego@linux.vnet.ibm.com>
To: Michael Bringmann <mwb@linux.vnet.ibm.com>
Cc: ego@linux.vnet.ibm.com,
	Thiago Jung Bauermann <bauerman@linux.ibm.com>,
	linux-kernel@vger.kernel.org, Nicholas Piggin <npiggin@gmail.com>,
	Tyrel Datwyler <tyreld@linux.vnet.ibm.com>,
	linuxppc-dev@lists.ozlabs.org
Subject: Re: [PATCH] pseries/hotplug: Add more delay in pseries_cpu_die while waiting for rtas-stop
Date: Thu, 17 Jan 2019 11:33:17 +0530	[thread overview]
Message-ID: <20190117060317.GA6897@in.ibm.com> (raw)
In-Reply-To: <74b2262b-48e7-b2a5-7d20-dc7f590958d9@linux.vnet.ibm.com>

Hello Michael,

On Mon, Jan 14, 2019 at 12:11:44PM -0600, Michael Bringmann wrote:
> On 1/9/19 12:08 AM, Gautham R Shenoy wrote:
> 
> > I did some testing during the holidays. Here are the observations:
> > 
> > 1) With just your patch (without any additional debug patch), if I run
> > DLPAR on /off operations on a system that has SMT=off, I am able to
> > see a crash involving RTAS stack corruption within an hour's time.
> > 
> > 2) With the debug patch (appended below) which has additional debug to
> > capture the callers of stop-self, start-cpu, set-power-levels, the
> > system is able to perform DLPAR on/off operations on a system with
> > SMT=off for three days. And then, it crashed with the dead CPU showing
> > a "Bad kernel stack pointer". From this log, I can clearly
> > see that there were no concurrent calls to stop-self, start-cpu,
> > set-power-levels. The only concurrent RTAS calls were the dying CPU
> > calling "stop-self", and the CPU running the DLPAR operation calling
> > "query-cpu-stopped-state". The crash signature is appended below as
> > well.
> > 
> > 3) Modifying your patch to remove the udelay and increase the loop
> > count from 25 to 1000 doesn't improve the situation. We are still able
> > to see the crash.
> > 
> > 4) With my patch, even without any additional debug, I was able to
> > observe the system run the tests successfully for over a week (I
> > started the tests before the Christmas weekend, and forgot to turn it
> > off!)
> 
> So does this mean that the problem is fixed with your patch?

No. On the contrary I think my patch unable to exploit the possible
race window.  From a technical point of view, Thiago's patch does the
right things

  - It waits for the target CPU to come out of CEDE and set its state
    to CPU_STATE_OFFLINE.
  
  - Only then it then makes the "query-cpu-stopped-state" rtas call in
    a loop, while giving sufficient delay between successive
    queries. This reduces the unnecessary rtas-calls.

In my patch, I don't do any of this, but simply keep calling
"query-cpu-stopped-state" call in a loop for 4000 iterations.

That said, if I incorporate the wait for the target CPU to set its
state to CPU_STATE_OFFLINE in my patch and then make the
"query-cpu-stopped-state" rtas call, then, even I am able to get the
crash with the "RTAS CALL BUFFER CORRUPTION" message.

I think that incorporating the wait for the target CPU set its state
to CPU_STATE_OFFLINE increases the probability that
"query-cpu-stopped-state" and "stop-self" rtas calls get called more
or less at the same time. However since concurrent invocations of
these rtas-calls is allowed by the PAPR, it should not result in a
"RTAS CALL BUFFER CORRUPTION". Am I missing something here ?

> 
> > 
> > It appears that there is a narrow race window involving rtas-stop-self
> > and rtas-query-cpu-stopped-state calls that can be observed with your
> > patch. Adding any printk's seems to reduce the probability of hitting
> > this race window. It might be worth the while to check with RTAS
> > folks, if they suspect something here.
> 
> What would the RTAS folks be looking at here?  The 'narrow race window'
> is with respect to a patch that it sound like we should not be
> using.

IMHO, if the race-window exists, it would be better to confirm and fix
it, given that we have a patch that can exploit it consistently.

> 
> Thanks.
> Michael
> 
> -- 
> Michael W. Bringmann
> Linux Technology Center
> IBM Corporation
> Tie-Line  363-5196
> External: (512) 286-5196
> Cell:       (512) 466-0650
> mwb@linux.vnet.ibm.com

--
Thanks and Regards
gautham.

next prev parent reply	other threads:[~2019-01-17  6:05 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-06 11:31 [PATCH] pseries/hotplug: Add more delay in pseries_cpu_die while waiting for rtas-stop Gautham R. Shenoy
2018-12-06 11:31 ` Gautham R. Shenoy
2018-12-06 17:28 ` Thiago Jung Bauermann
2018-12-06 17:28   ` Thiago Jung Bauermann
2018-12-07 10:43   ` Gautham R Shenoy
2018-12-07 10:43     ` Gautham R Shenoy
2018-12-07 12:03     ` Gautham R Shenoy
2018-12-07 12:03       ` Gautham R Shenoy
2018-12-08  2:40       ` Thiago Jung Bauermann
2018-12-08  2:40         ` Thiago Jung Bauermann
2018-12-10 20:16         ` Michael Bringmann
2018-12-10 20:16           ` Michael Bringmann
2018-12-10 20:31           ` Thiago Jung Bauermann
2018-12-10 20:31             ` Thiago Jung Bauermann
2018-12-11 22:11             ` Michael Bringmann
2018-12-11 22:11               ` Michael Bringmann
2019-01-09  6:08         ` Gautham R Shenoy
2019-01-09  6:08           ` Gautham R Shenoy
2019-01-14 18:11           ` Michael Bringmann
2019-01-14 18:11             ` Michael Bringmann
2019-01-17  6:03             ` Gautham R Shenoy [this message]
2019-01-17  6:03               ` Gautham R Shenoy
2022-07-07  9:51 ` Christophe Leroy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190117060317.GA6897@in.ibm.com \
    --to=ego@linux.vnet.ibm.com \
    --cc=bauerman@linux.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mwb@linux.vnet.ibm.com \
    --cc=npiggin@gmail.com \
    --cc=tyreld@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.