linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
From: Nathan Lynch <nathanl@linux.ibm.com>
To: Michael Ellerman <mpe@ellerman.id.au>
Cc: Greg Kurz <groug@kaod.org>,
	linuxppc-dev@lists.ozlabs.org,
	Michael Roth <mdroth@linux.vnet.ibm.com>,
	Thiago Jung Bauermann <bauerman@linux.ibm.com>,
	Cedric Le Goater <clg@kaod.org>
Subject: Re: [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death
Date: Fri, 07 Aug 2020 02:05:09 -0500	[thread overview]
Message-ID: <87mu37ylzu.fsf@linux.ibm.com> (raw)
In-Reply-To: <87zh79yen7.fsf@mpe.ellerman.id.au>

Hi everyone,

Michael Ellerman <mpe@ellerman.id.au> writes:
> Greg Kurz <groug@kaod.org> writes:
>> On Tue, 04 Aug 2020 23:35:10 +1000
>> Michael Ellerman <mpe@ellerman.id.au> wrote:
>>> Spinning forever seems like a bad idea, but as has been demonstrated at
>>> least twice now, continuing when we don't know the state of the other
>>> CPU can lead to straight up crashes.
>>> 
>>> So I think I'm persuaded that it's preferable to have the kernel stuck
>>> spinning rather than oopsing.
>>> 
>>
>> +1
>>
>>> I'm 50/50 on whether we should have a cond_resched() in the loop. My
>>> first instinct is no, if we're stuck here for 20s a stack trace would be
>>> good. But then we will probably hit that on some big and/or heavily
>>> loaded machine.
>>> 
>>> So possibly we should call cond_resched() but have some custom logic in
>>> the loop to print a warning if we are stuck for more than some
>>> sufficiently long amount of time.
>>
>> How long should that be ?
>
> Yeah good question.
>
> I guess step one would be seeing how long it can take on the 384 vcpu
> machine. And we can probably test on some other big machines.
>
> Hopefully Nathan can give us some idea of how long he's seen it take on
> large systems? I know he was concerned about the 20s timeout of the
> softlockup detector.

Maybe I'm not quite clear what this is referring to, but I don't think
stop-self/query-stopped-state latency increases with processor count, at
least not on PowerVM. And IIRC I was concerned with the earlier patch's
potential for causing the softlockup watchdog to rightly complain by
polling the stopped state without ever scheduling away.

The fact that smp_query_cpu_stopped() kind of collapses the two distinct
results from the query-cpu-stopped-state RTAS call into one return value
may make it harder than necessary to reason about the questions around
cond_resched() and whether to warn.

Sorry to pull this stunt but I have had some code sitting in a neglected
branch that I think gets the logic around this right.

What we should have is a simple C wrapper for the RTAS call that reflects the
architected inputs and outputs:

================================================================
(-- rtas.c --)

/**
 * rtas_query_cpu_stopped_state() - Call RTAS query-cpu-stopped-state.
 * @hwcpu: Identifies the processor thread to be queried.
 * @status: Pointer to status, valid only on success.
 *
 * Determine whether the given processor thread is in the stopped
 * state.  If successful and @status is non-NULL, the thread's status
 * is stored to @status.
 *
 * Return:
 * * 0   - Success
 * * -1  - Hardware error
 * * -2  - Busy, try again later
 */
int rtas_query_cpu_stopped_state(unsigned int hwcpu, unsigned int *status)
{
       unsigned int cpu_status;
       int token;
       int fwrc;

       token = rtas_token("query-cpu-stopped-state");

       fwrc = rtas_call(token, 1, 2, &cpu_status, hwcpu);
       if (fwrc != 0)
               goto out;

       if (status != NULL)
               *status = cpu_status;
out:
       return fwrc;
}
================================================================


And then a utility function that waits for the remote thread to enter
stopped state, with higher-level logic for rescheduling and warning. The
fact that smp_query_cpu_stopped() currently does not handle a -2/busy
status is a bug, fixed below by using rtas_busy_delay(). Note the
justification for the explicit cond_resched() in the outer loop:

================================================================
(-- rtas.h --)

/* query-cpu-stopped-state CPU_status */
#define RTAS_QCSS_STATUS_STOPPED     0
#define RTAS_QCSS_STATUS_IN_PROGRESS 1
#define RTAS_QCSS_STATUS_NOT_STOPPED 2

(-- pseries/hotplug-cpu.c --)

/**
 * wait_for_cpu_stopped() - Wait for a cpu to enter RTAS stopped state.
 */
static void wait_for_cpu_stopped(unsigned int cpu)
{
       unsigned int status;
       unsigned int hwcpu;

       hwcpu = get_hard_smp_processor_id(cpu);

       do {
               int fwrc;

               /*
                * rtas_busy_delay() will yield only if RTAS returns a
                * busy status. Since query-cpu-stopped-state can
                * yield RTAS_QCSS_STATUS_IN_PROGRESS or
                * RTAS_QCSS_STATUS_NOT_STOPPED for an unbounded
                * period before the target thread stops, we must take
                * care to explicitly reschedule while polling.
                */
               cond_resched();

               do {
                       fwrc = rtas_query_cpu_stopped_state(hwcpu, &status);
               } while (rtas_busy_delay(fwrc));

               if (fwrc == 0)
                       continue;

               pr_err_ratelimited("query-cpu-stopped-state for "
                                  "thread 0x%x returned %d\n",
                                  hwcpu, fwrc);
               goto out;

       } while (status == RTAS_QCSS_STATUS_NOT_STOPPED ||
                status == RTAS_QCSS_STATUS_IN_PROGRESS);

       if (status != RTAS_QCSS_STATUS_STOPPED) {
               pr_err_ratelimited("query-cpu-stopped-state yielded unknown "
                                  "status %d for thread 0x%x\n",
                                  status, hwcpu);
       }
out:
       return;
}

[...]

static void pseries_cpu_die(unsigned int cpu)
{
       wait_for_cpu_stopped(cpu);
       paca_ptrs[cpu]->cpu_start = 0;
}
================================================================

wait_for_cpu_stopped() should be able to accommodate a time-based
warning if necessary, but speaking as a likely recipient of any bug
reports that would arise here, I'm not convinced of the need and I
don't know what a good value would be. It's relatively easy to sample
the stack of a task that's apparently failing to make progress, plus I
probably would use 'perf probe' or similar to report the inputs and
outputs for the RTAS call.

I'm happy to make this a proper submission after I can clean it up and
retest it, or Michael R. is welcome to appropriate it, assuming it's
acceptable.


  parent reply	other threads:[~2020-08-07  7:07 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-04  3:29 [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death Michael Roth
2020-08-04 13:35 ` Michael Ellerman
2020-08-04 14:16   ` Greg Kurz
2020-08-05  3:07     ` Michael Ellerman
2020-08-05  4:01       ` Thiago Jung Bauermann
2020-08-05  4:37       ` Michael Roth
2020-08-05 22:29         ` Michael Roth
2020-08-05 22:31           ` Michael Roth
2020-08-06 12:51           ` Michael Ellerman
2020-08-07  7:05       ` Nathan Lynch [this message]
2020-08-11  5:39         ` Michael Roth
2020-08-11 11:56           ` Michael Ellerman
2020-08-11 14:46             ` Nathan Lynch

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87mu37ylzu.fsf@linux.ibm.com \
    --to=nathanl@linux.ibm.com \
    --cc=bauerman@linux.ibm.com \
    --cc=clg@kaod.org \
    --cc=groug@kaod.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mdroth@linux.vnet.ibm.com \
    --cc=mpe@ellerman.id.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).