linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
From: Gautham R Shenoy <ego@linux.vnet.ibm.com>
To: David Gibson <david@gibson.dropbear.id.au>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>,
	Michael Neuling <mikey@neuling.org>,
	Nicholas Piggin <npiggin@gmail.com>,
	Bharata B Rao <bharata@linux.ibm.com>,
	linuxppc-dev@ozlabs.org, kvm-ppc@vger.kernel.org,
	Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>,
	linuxppc-dev@lists.ozlabs.org
Subject: Re: [RFC/PATCH  2/3] pseries/kvm: Clear PSSCR[ESL|EC] bits before guest entry
Date: Mon, 13 Apr 2020 15:55:49 +0530	[thread overview]
Message-ID: <20200413102549.GA22532@in.ibm.com> (raw)
In-Reply-To: <20200408022957.GC44664@umbus.fritz.box>

Hello David,

On Wed, Apr 08, 2020 at 12:29:57PM +1000, David Gibson wrote:
> On Tue, Apr 07, 2020 at 06:55:26PM +0530, Gautham R Shenoy wrote:
> > Hello David,
> > 
> > On Mon, Apr 06, 2020 at 07:58:19PM +1000, David Gibson wrote:
> > > On Fri, Apr 03, 2020 at 03:01:03PM +0530, Gautham R Shenoy wrote:
> > > > On Fri, Apr 03, 2020 at 12:20:26PM +1000, Nicholas Piggin wrote:
> > > > > Gautham R. Shenoy's on March 31, 2020 10:10 pm:
> > > > > > From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> > > > > > 
> > > > > > ISA v3.0 allows the guest to execute a stop instruction. For this, the
> > > > > > PSSCR[ESL|EC] bits need to be cleared by the hypervisor before
> > > > > > scheduling in the guest vCPU.
> > > > > > 
> > > > > > Currently we always schedule in a vCPU with PSSCR[ESL|EC] bits
> > > > > > set. This patch changes the behaviour to enter the guest with
> > > > > > PSSCR[ESL|EC] bits cleared. This is a RFC patch where we
> > > > > > unconditionally clear these bits. Ideally this should be done
> > > > > > conditionally on platforms where the guest stop instruction has no
> > > > > > Bugs (starting POWER9 DD2.3).
> > > > > 
> > > > > How will guests know that they can use this facility safely after your
> > > > > series? You need both DD2.3 and a patched KVM.
> > > > 
> > > > 
> > > > Yes, this is something that isn't addressed in this series (mentioned
> > > > in the cover letter), which is a POC demonstrating that the stop0lite
> > > > state in guest works.
> > > > 
> > > > However, to answer your question, this is the scheme that I had in
> > > > mind :
> > > > 
> > > > OPAL:
> > > >    On Procs >= DD2.3 : we publish a dt-cpu-feature "idle-stop-guest"
> > > > 
> > > > Hypervisor Kernel:
> > > >     1. If "idle-stop-guest" dt-cpu-feature is discovered, then
> > > >        we set bool enable_guest_stop = true;
> > > > 
> > > >     2. During KVM guest entry, clear PSSCR[ESL|EC] iff
> > > >        enable_guest_stop == true.
> > > > 
> > > >     3. In kvm_vm_ioctl_check_extension(), for a new capability
> > > >        KVM_CAP_STOP, return true iff enable_guest_top == true.
> > > > 
> > > > QEMU:
> > > >    Check with the hypervisor if KVM_CAP_STOP is present. If so,
> > > >    indicate the presence to the guest via device tree.
> > > 
> > > Nack.  Presenting different capabilities to the guest depending on
> > > host capabilities (rather than explicit options) is never ok.  It
> > > means that depending on the system you start on you may or may not be
> > > able to migrate to other systems that you're supposed to be able to,
> > 
> > I agree that blocking migration for the unavailability of this feature
> > is not desirable. Could you point me to some other capabilities in KVM
> > which have been implemented via explicit options?
> 
> TBH, most of the options for the 'pseries' machine type are in this
> category: cap-vsx, cap-dfp, cap-htm, a bunch related to various
> Spectre mitigations, cap-hpt-max-page-size (maximum page size for hash
> guests), cap-nested-hv, cap-large-decr, cap-fwnmi, resize-hpt (HPT
> resizing extension), ic-mode (which irq controllers are available to
> the guest).


Thanks. I will follow this suit.

> 
> > The ISA 3.0 allows the guest to execute the "stop" instruction.
> 
> So, this was a bug in DD2.2's implementation of the architecture?

Yes, the previous versions could miss wakeup events when stop was
executed in HV=0,PR=0 mode. So, the hypervisor had to block that.


> 
> > If the
> > Hypervisor hasn't cleared the PSSCR[ESL|EC] then, guest executing the
> > "stop" instruction in the causes a Hypervisor Facility Unavailable
> > Exception, thus giving the hypervisor a chance to emulate the
> > instruction. However, in the current code, when the hypervisor
> > receives this exception, it sends a PROGKILL to the guest which
> > results in crashing the guest.
> > 
> > Patch 1 of this series emulates wakeup from the "stop"
> > instruction. Would the following scheme be ok?
> > 
> > OPAL:
> > 	On Procs >= DD2.3 : we publish a dt-cpu-feature "idle-stop-guest"
> > 
> > Hypervisor Kernel:
> > 
> > 	   If "idle-stop-guest" dt feature is available, then, before
> > 	   entering the guest, the hypervisor clears the PSSCR[EC|ESL]
> > 	   bits allowing the guest to safely execute stop instruction.
> > 
> > 	   If "idle-stop-guest" dt feature is not available, then, the
> > 	   Hypervisor sets the PSSCR[ESL|EC] bits, thereby causing a
> > 	   guest "stop" instruction execution to trap back into the
> > 	   hypervisor. We then emulate a wakeup from the stop
> > 	   instruction (Patch 1 of this series).
> > 
> > Guest Kernel:
> >       If (cpu_has_feature(CPU_FTR_ARCH_300)) only then use the
> >       stop0lite cpuidle state.
> > 
> > This allows us to migrate the KVM guest across any POWER9
> > Hypervisor. The minimal addition that the Hypervisor would need is
> > Patch 1 of this series.
> 
> That could be workable.  Some caveats, though:
> 
>  * How does the latency of the trap-and-emulate compare to the guest
>    using H_CEDE in the first place?  i.e. how big a negative impact
>    will this have for guests running on DD2.2 hosts?


The wakeup latency of trap-and-emulated stop0lite (referred to as
"stop0lite Emulated" in the tables below) the compares favorably
compared to H_CEDE. It is in the order of 5-6us while the wakeup
latency of H_CEDE is ~25-30us.

======================================================================
Wakeup Latency measured using a timer (in ns) [Lower is better]
======================================================================
Idle state |  Nr samples |  Min    | Max    | Median | Avg   | Stddev|
----------------------------------------------------------------------
snooze     |   60        |  787    | 1059   |  938   | 937.4 | 42.27 |
----------------------------------------------------------------------
stop0lite  |   60        |  770    | 1182   |  948   | 946.4 | 67.41 |
----------------------------------------------------------------------
stop0lite  |   60        | 2378    | 7659   | 5006   |5093.6 |1578.7 |  
Emulated   |             |         |        |        |       |       |
----------------------------------------------------------------------
Shared CEDE|   60        | 9550    | 36694  | 29219  |28564.1|3545.9 |
======================================================================


======================================================================
Wakeup latency measured using an IPI (in ns) [Lower is better]
======================================================================
Idle state |  Nr    |  Min    | Max    | Median | Avg     | Stddev   |
           |samples |         |        |        |         |          |
----------------------------------------------------------------------
snooze     |   60   |     2089|    4228|    2259|  2342.31|    316.56|
----------------------------------------------------------------------
stop0lite  |   60   |     1947|    3674|    2653|  2610.57|    266.73|
----------------------------------------------------------------------
stop0lite  |   60   |     3580|    8154|    5596|  5644.95|   1368.44|
Emulated   |        |         |        |        |         |          |
----------------------------------------------------------------------
Shared CEDE|   60   |    20147|   36305|   21827| 26762.65|   6875.01|
======================================================================

> 
>  * We'll only be able to enable this in a new qemu machine type
>    version (say, pseries-5.1.0).  Otherwise a guest could start
>    thinking it can use stop states, then be migrated to an older qemu
>    or host kernel without the support and crash.

That makese sense. In fact in the case of not being able to backport
Patch 1 to all the older hypervisor kernels, we will need a way of
gating the guest from using stop-states and then migrating onto an
older hypervisor kernel. Associating this with a new qemu machine type
version should solve this problem, assuming that all the newer qemus
will also be running on newer hypervisor kernels.



> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson

--
Thanks and Regards
gautham.

  reply	other threads:[~2020-04-13 10:28 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-31 12:10 [RFC/PATCH 0/3] Add support for stop instruction inside KVM guest Gautham R. Shenoy
2020-03-31 12:10 ` [RFC/PATCH 1/3] powerpc/kvm: Handle H_FAC_UNAVAIL when guest executes stop Gautham R. Shenoy
2020-04-03  2:15   ` Nicholas Piggin
2020-03-31 12:10 ` [RFC/PATCH 2/3] pseries/kvm: Clear PSSCR[ESL|EC] bits before guest entry Gautham R. Shenoy
2020-04-03  2:20   ` Nicholas Piggin
2020-04-03  9:31     ` Gautham R Shenoy
2020-04-06  9:58       ` David Gibson
2020-04-07 13:25         ` Gautham R Shenoy
2020-04-08  2:29           ` David Gibson
2020-04-13 10:25             ` Gautham R Shenoy [this message]
2020-04-14  2:17               ` David Gibson
2020-04-07 12:33       ` Gautham R Shenoy
2020-03-31 12:10 ` [RFC/PATCH 3/3] cpuidle/pseries: Add stop0lite state Gautham R. Shenoy
2020-03-31 12:14 ` [RFC/PATCH 0/3] Add support for stop instruction inside KVM guest Gautham R Shenoy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200413102549.GA22532@in.ibm.com \
    --to=ego@linux.vnet.ibm.com \
    --cc=bharata@linux.ibm.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=kvm-ppc@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=linuxppc-dev@ozlabs.org \
    --cc=mikey@neuling.org \
    --cc=npiggin@gmail.com \
    --cc=svaidy@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).