Re: [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Nathan Lynch <nathanl@linux.ibm.com>
To: Michael Ellerman <mpe@ellerman.id.au>
Cc: Greg Kurz <groug@kaod.org>,
	linuxppc-dev@lists.ozlabs.org,
	Michael Roth <mdroth@linux.vnet.ibm.com>,
	Thiago Jung Bauermann <bauerman@linux.ibm.com>,
	Cedric Le Goater <clg@kaod.org>
Subject: Re: [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death
Date: Fri, 07 Aug 2020 02:05:09 -0500	[thread overview]
Message-ID: <87mu37ylzu.fsf@linux.ibm.com> (raw)
In-Reply-To: <87zh79yen7.fsf@mpe.ellerman.id.au>

Hi everyone,

Michael Ellerman <mpe@ellerman.id.au> writes:
> Greg Kurz <groug@kaod.org> writes:
>> On Tue, 04 Aug 2020 23:35:10 +1000
>> Michael Ellerman <mpe@ellerman.id.au> wrote:
>>> Spinning forever seems like a bad idea, but as has been demonstrated at
>>> least twice now, continuing when we don't know the state of the other
>>> CPU can lead to straight up crashes.
>>> 
>>> So I think I'm persuaded that it's preferable to have the kernel stuck
>>> spinning rather than oopsing.
>>> 
>>
>> +1
>>
>>> I'm 50/50 on whether we should have a cond_resched() in the loop. My
>>> first instinct is no, if we're stuck here for 20s a stack trace would be
>>> good. But then we will probably hit that on some big and/or heavily
>>> loaded machine.
>>> 
>>> So possibly we should call cond_resched() but have some custom logic in
>>> the loop to print a warning if we are stuck for more than some
>>> sufficiently long amount of time.
>>
>> How long should that be ?
>
> Yeah good question.
>
> I guess step one would be seeing how long it can take on the 384 vcpu
> machine. And we can probably test on some other big machines.
>
> Hopefully Nathan can give us some idea of how long he's seen it take on
> large systems? I know he was concerned about the 20s timeout of the
> softlockup detector.

Maybe I'm not quite clear what this is referring to, but I don't think
stop-self/query-stopped-state latency increases with processor count, at
least not on PowerVM. And IIRC I was concerned with the earlier patch's
potential for causing the softlockup watchdog to rightly complain by
polling the stopped state without ever scheduling away.

The fact that smp_query_cpu_stopped() kind of collapses the two distinct
results from the query-cpu-stopped-state RTAS call into one return value
may make it harder than necessary to reason about the questions around
cond_resched() and whether to warn.

Sorry to pull this stunt but I have had some code sitting in a neglected
branch that I think gets the logic around this right.

What we should have is a simple C wrapper for the RTAS call that reflects the
architected inputs and outputs:

================================================================
(-- rtas.c --)

/**
 * rtas_query_cpu_stopped_state() - Call RTAS query-cpu-stopped-state.
 * @hwcpu: Identifies the processor thread to be queried.
 * @status: Pointer to status, valid only on success.
 *
 * Determine whether the given processor thread is in the stopped
 * state.  If successful and @status is non-NULL, the thread's status
 * is stored to @status.
 *
 * Return:
 * * 0   - Success
 * * -1  - Hardware error
 * * -2  - Busy, try again later
 */
int rtas_query_cpu_stopped_state(unsigned int hwcpu, unsigned int *status)
{
       unsigned int cpu_status;
       int token;
       int fwrc;

       token = rtas_token("query-cpu-stopped-state");

       fwrc = rtas_call(token, 1, 2, &cpu_status, hwcpu);
       if (fwrc != 0)
               goto out;

       if (status != NULL)
               *status = cpu_status;
out:
       return fwrc;
}
================================================================


And then a utility function that waits for the remote thread to enter
stopped state, with higher-level logic for rescheduling and warning. The
fact that smp_query_cpu_stopped() currently does not handle a -2/busy
status is a bug, fixed below by using rtas_busy_delay(). Note the
justification for the explicit cond_resched() in the outer loop:

================================================================
(-- rtas.h --)

/* query-cpu-stopped-state CPU_status */
#define RTAS_QCSS_STATUS_STOPPED     0
#define RTAS_QCSS_STATUS_IN_PROGRESS 1
#define RTAS_QCSS_STATUS_NOT_STOPPED 2

(-- pseries/hotplug-cpu.c --)

/**
 * wait_for_cpu_stopped() - Wait for a cpu to enter RTAS stopped state.
 */
static void wait_for_cpu_stopped(unsigned int cpu)
{
       unsigned int status;
       unsigned int hwcpu;

       hwcpu = get_hard_smp_processor_id(cpu);

       do {
               int fwrc;

               /*
                * rtas_busy_delay() will yield only if RTAS returns a
                * busy status. Since query-cpu-stopped-state can
                * yield RTAS_QCSS_STATUS_IN_PROGRESS or
                * RTAS_QCSS_STATUS_NOT_STOPPED for an unbounded
                * period before the target thread stops, we must take
                * care to explicitly reschedule while polling.
                */
               cond_resched();

               do {
                       fwrc = rtas_query_cpu_stopped_state(hwcpu, &status);
               } while (rtas_busy_delay(fwrc));

               if (fwrc == 0)
                       continue;

               pr_err_ratelimited("query-cpu-stopped-state for "
                                  "thread 0x%x returned %d\n",
                                  hwcpu, fwrc);
               goto out;

       } while (status == RTAS_QCSS_STATUS_NOT_STOPPED ||
                status == RTAS_QCSS_STATUS_IN_PROGRESS);

       if (status != RTAS_QCSS_STATUS_STOPPED) {
               pr_err_ratelimited("query-cpu-stopped-state yielded unknown "
                                  "status %d for thread 0x%x\n",
                                  status, hwcpu);
       }
out:
       return;
}

[...]

static void pseries_cpu_die(unsigned int cpu)
{
       wait_for_cpu_stopped(cpu);
       paca_ptrs[cpu]->cpu_start = 0;
}
================================================================

wait_for_cpu_stopped() should be able to accommodate a time-based
warning if necessary, but speaking as a likely recipient of any bug
reports that would arise here, I'm not convinced of the need and I
don't know what a good value would be. It's relatively easy to sample
the stack of a task that's apparently failing to make progress, plus I
probably would use 'perf probe' or similar to report the inputs and
outputs for the RTAS call.

I'm happy to make this a proper submission after I can clean it up and
retest it, or Michael R. is welcome to appropriate it, assuming it's
acceptable.

next prev parent reply	other threads:[~2020-08-07  7:07 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-04  3:29 [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death Michael Roth
2020-08-04 13:35 ` Michael Ellerman
2020-08-04 14:16   ` Greg Kurz
2020-08-05  3:07     ` Michael Ellerman
2020-08-05  4:01       ` Thiago Jung Bauermann
2020-08-05  4:37       ` Michael Roth
2020-08-05 22:29         ` Michael Roth
2020-08-05 22:31           ` Michael Roth
2020-08-06 12:51           ` Michael Ellerman
2020-08-07  7:05       ` Nathan Lynch [this message]
2020-08-11  5:39         ` Michael Roth
2020-08-11 11:56           ` Michael Ellerman
2020-08-11 14:46             ` Nathan Lynch

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87mu37ylzu.fsf@linux.ibm.com \
    --to=nathanl@linux.ibm.com \
    --cc=bauerman@linux.ibm.com \
    --cc=clg@kaod.org \
    --cc=groug@kaod.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mdroth@linux.vnet.ibm.com \
    --cc=mpe@ellerman.id.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.