From: Gautham R Shenoy <ego@linux.vnet.ibm.com>
To: David Gibson <david@gibson.dropbear.id.au>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>,
Michael Neuling <mikey@neuling.org>,
Nicholas Piggin <npiggin@gmail.com>,
Bharata B Rao <bharata@linux.ibm.com>,
linuxppc-dev@ozlabs.org, kvm-ppc@vger.kernel.org,
Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>,
linuxppc-dev@lists.ozlabs.org
Subject: Re: [RFC/PATCH 2/3] pseries/kvm: Clear PSSCR[ESL|EC] bits before guest entry
Date: Mon, 13 Apr 2020 10:37:49 +0000 [thread overview]
Message-ID: <20200413102549.GA22532@in.ibm.com> (raw)
In-Reply-To: <20200408022957.GC44664@umbus.fritz.box>
Hello David,
On Wed, Apr 08, 2020 at 12:29:57PM +1000, David Gibson wrote:
> On Tue, Apr 07, 2020 at 06:55:26PM +0530, Gautham R Shenoy wrote:
> > Hello David,
> >
> > On Mon, Apr 06, 2020 at 07:58:19PM +1000, David Gibson wrote:
> > > On Fri, Apr 03, 2020 at 03:01:03PM +0530, Gautham R Shenoy wrote:
> > > > On Fri, Apr 03, 2020 at 12:20:26PM +1000, Nicholas Piggin wrote:
> > > > > Gautham R. Shenoy's on March 31, 2020 10:10 pm:
> > > > > > From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> > > > > >
> > > > > > ISA v3.0 allows the guest to execute a stop instruction. For this, the
> > > > > > PSSCR[ESL|EC] bits need to be cleared by the hypervisor before
> > > > > > scheduling in the guest vCPU.
> > > > > >
> > > > > > Currently we always schedule in a vCPU with PSSCR[ESL|EC] bits
> > > > > > set. This patch changes the behaviour to enter the guest with
> > > > > > PSSCR[ESL|EC] bits cleared. This is a RFC patch where we
> > > > > > unconditionally clear these bits. Ideally this should be done
> > > > > > conditionally on platforms where the guest stop instruction has no
> > > > > > Bugs (starting POWER9 DD2.3).
> > > > >
> > > > > How will guests know that they can use this facility safely after your
> > > > > series? You need both DD2.3 and a patched KVM.
> > > >
> > > >
> > > > Yes, this is something that isn't addressed in this series (mentioned
> > > > in the cover letter), which is a POC demonstrating that the stop0lite
> > > > state in guest works.
> > > >
> > > > However, to answer your question, this is the scheme that I had in
> > > > mind :
> > > >
> > > > OPAL:
> > > > On Procs >= DD2.3 : we publish a dt-cpu-feature "idle-stop-guest"
> > > >
> > > > Hypervisor Kernel:
> > > > 1. If "idle-stop-guest" dt-cpu-feature is discovered, then
> > > > we set bool enable_guest_stop = true;
> > > >
> > > > 2. During KVM guest entry, clear PSSCR[ESL|EC] iff
> > > > enable_guest_stop = true.
> > > >
> > > > 3. In kvm_vm_ioctl_check_extension(), for a new capability
> > > > KVM_CAP_STOP, return true iff enable_guest_top = true.
> > > >
> > > > QEMU:
> > > > Check with the hypervisor if KVM_CAP_STOP is present. If so,
> > > > indicate the presence to the guest via device tree.
> > >
> > > Nack. Presenting different capabilities to the guest depending on
> > > host capabilities (rather than explicit options) is never ok. It
> > > means that depending on the system you start on you may or may not be
> > > able to migrate to other systems that you're supposed to be able to,
> >
> > I agree that blocking migration for the unavailability of this feature
> > is not desirable. Could you point me to some other capabilities in KVM
> > which have been implemented via explicit options?
>
> TBH, most of the options for the 'pseries' machine type are in this
> category: cap-vsx, cap-dfp, cap-htm, a bunch related to various
> Spectre mitigations, cap-hpt-max-page-size (maximum page size for hash
> guests), cap-nested-hv, cap-large-decr, cap-fwnmi, resize-hpt (HPT
> resizing extension), ic-mode (which irq controllers are available to
> the guest).
Thanks. I will follow this suit.
>
> > The ISA 3.0 allows the guest to execute the "stop" instruction.
>
> So, this was a bug in DD2.2's implementation of the architecture?
Yes, the previous versions could miss wakeup events when stop was
executed in HV=0,PR=0 mode. So, the hypervisor had to block that.
>
> > If the
> > Hypervisor hasn't cleared the PSSCR[ESL|EC] then, guest executing the
> > "stop" instruction in the causes a Hypervisor Facility Unavailable
> > Exception, thus giving the hypervisor a chance to emulate the
> > instruction. However, in the current code, when the hypervisor
> > receives this exception, it sends a PROGKILL to the guest which
> > results in crashing the guest.
> >
> > Patch 1 of this series emulates wakeup from the "stop"
> > instruction. Would the following scheme be ok?
> >
> > OPAL:
> > On Procs >= DD2.3 : we publish a dt-cpu-feature "idle-stop-guest"
> >
> > Hypervisor Kernel:
> >
> > If "idle-stop-guest" dt feature is available, then, before
> > entering the guest, the hypervisor clears the PSSCR[EC|ESL]
> > bits allowing the guest to safely execute stop instruction.
> >
> > If "idle-stop-guest" dt feature is not available, then, the
> > Hypervisor sets the PSSCR[ESL|EC] bits, thereby causing a
> > guest "stop" instruction execution to trap back into the
> > hypervisor. We then emulate a wakeup from the stop
> > instruction (Patch 1 of this series).
> >
> > Guest Kernel:
> > If (cpu_has_feature(CPU_FTR_ARCH_300)) only then use the
> > stop0lite cpuidle state.
> >
> > This allows us to migrate the KVM guest across any POWER9
> > Hypervisor. The minimal addition that the Hypervisor would need is
> > Patch 1 of this series.
>
> That could be workable. Some caveats, though:
>
> * How does the latency of the trap-and-emulate compare to the guest
> using H_CEDE in the first place? i.e. how big a negative impact
> will this have for guests running on DD2.2 hosts?
The wakeup latency of trap-and-emulated stop0lite (referred to as
"stop0lite Emulated" in the tables below) the compares favorably
compared to H_CEDE. It is in the order of 5-6us while the wakeup
latency of H_CEDE is ~25-30us.
===================================
Wakeup Latency measured using a timer (in ns) [Lower is better]
===================================
Idle state | Nr samples | Min | Max | Median | Avg | Stddev|
----------------------------------------------------------------------
snooze | 60 | 787 | 1059 | 938 | 937.4 | 42.27 |
----------------------------------------------------------------------
stop0lite | 60 | 770 | 1182 | 948 | 946.4 | 67.41 |
----------------------------------------------------------------------
stop0lite | 60 | 2378 | 7659 | 5006 |5093.6 |1578.7 |
Emulated | | | | | | |
----------------------------------------------------------------------
Shared CEDE| 60 | 9550 | 36694 | 29219 |28564.1|3545.9 |
===================================
===================================
Wakeup latency measured using an IPI (in ns) [Lower is better]
===================================
Idle state | Nr | Min | Max | Median | Avg | Stddev |
|samples | | | | | |
----------------------------------------------------------------------
snooze | 60 | 2089| 4228| 2259| 2342.31| 316.56|
----------------------------------------------------------------------
stop0lite | 60 | 1947| 3674| 2653| 2610.57| 266.73|
----------------------------------------------------------------------
stop0lite | 60 | 3580| 8154| 5596| 5644.95| 1368.44|
Emulated | | | | | | |
----------------------------------------------------------------------
Shared CEDE| 60 | 20147| 36305| 21827| 26762.65| 6875.01|
===================================
>
> * We'll only be able to enable this in a new qemu machine type
> version (say, pseries-5.1.0). Otherwise a guest could start
> thinking it can use stop states, then be migrated to an older qemu
> or host kernel without the support and crash.
That makese sense. In fact in the case of not being able to backport
Patch 1 to all the older hypervisor kernels, we will need a way of
gating the guest from using stop-states and then migrating onto an
older hypervisor kernel. Associating this with a new qemu machine type
version should solve this problem, assuming that all the newer qemus
will also be running on newer hypervisor kernels.
>
> --
> David Gibson | I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
> | _way_ _around_!
> http://www.ozlabs.org/~dgibson
--
Thanks and Regards
gautham.
WARNING: multiple messages have this Message-ID (diff)
From: Gautham R Shenoy <ego@linux.vnet.ibm.com>
To: David Gibson <david@gibson.dropbear.id.au>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>,
Michael Neuling <mikey@neuling.org>,
Nicholas Piggin <npiggin@gmail.com>,
Bharata B Rao <bharata@linux.ibm.com>,
linuxppc-dev@ozlabs.org, kvm-ppc@vger.kernel.org,
Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>,
linuxppc-dev@lists.ozlabs.org
Subject: Re: [RFC/PATCH 2/3] pseries/kvm: Clear PSSCR[ESL|EC] bits before guest entry
Date: Mon, 13 Apr 2020 15:55:49 +0530 [thread overview]
Message-ID: <20200413102549.GA22532@in.ibm.com> (raw)
In-Reply-To: <20200408022957.GC44664@umbus.fritz.box>
Hello David,
On Wed, Apr 08, 2020 at 12:29:57PM +1000, David Gibson wrote:
> On Tue, Apr 07, 2020 at 06:55:26PM +0530, Gautham R Shenoy wrote:
> > Hello David,
> >
> > On Mon, Apr 06, 2020 at 07:58:19PM +1000, David Gibson wrote:
> > > On Fri, Apr 03, 2020 at 03:01:03PM +0530, Gautham R Shenoy wrote:
> > > > On Fri, Apr 03, 2020 at 12:20:26PM +1000, Nicholas Piggin wrote:
> > > > > Gautham R. Shenoy's on March 31, 2020 10:10 pm:
> > > > > > From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> > > > > >
> > > > > > ISA v3.0 allows the guest to execute a stop instruction. For this, the
> > > > > > PSSCR[ESL|EC] bits need to be cleared by the hypervisor before
> > > > > > scheduling in the guest vCPU.
> > > > > >
> > > > > > Currently we always schedule in a vCPU with PSSCR[ESL|EC] bits
> > > > > > set. This patch changes the behaviour to enter the guest with
> > > > > > PSSCR[ESL|EC] bits cleared. This is a RFC patch where we
> > > > > > unconditionally clear these bits. Ideally this should be done
> > > > > > conditionally on platforms where the guest stop instruction has no
> > > > > > Bugs (starting POWER9 DD2.3).
> > > > >
> > > > > How will guests know that they can use this facility safely after your
> > > > > series? You need both DD2.3 and a patched KVM.
> > > >
> > > >
> > > > Yes, this is something that isn't addressed in this series (mentioned
> > > > in the cover letter), which is a POC demonstrating that the stop0lite
> > > > state in guest works.
> > > >
> > > > However, to answer your question, this is the scheme that I had in
> > > > mind :
> > > >
> > > > OPAL:
> > > > On Procs >= DD2.3 : we publish a dt-cpu-feature "idle-stop-guest"
> > > >
> > > > Hypervisor Kernel:
> > > > 1. If "idle-stop-guest" dt-cpu-feature is discovered, then
> > > > we set bool enable_guest_stop = true;
> > > >
> > > > 2. During KVM guest entry, clear PSSCR[ESL|EC] iff
> > > > enable_guest_stop == true.
> > > >
> > > > 3. In kvm_vm_ioctl_check_extension(), for a new capability
> > > > KVM_CAP_STOP, return true iff enable_guest_top == true.
> > > >
> > > > QEMU:
> > > > Check with the hypervisor if KVM_CAP_STOP is present. If so,
> > > > indicate the presence to the guest via device tree.
> > >
> > > Nack. Presenting different capabilities to the guest depending on
> > > host capabilities (rather than explicit options) is never ok. It
> > > means that depending on the system you start on you may or may not be
> > > able to migrate to other systems that you're supposed to be able to,
> >
> > I agree that blocking migration for the unavailability of this feature
> > is not desirable. Could you point me to some other capabilities in KVM
> > which have been implemented via explicit options?
>
> TBH, most of the options for the 'pseries' machine type are in this
> category: cap-vsx, cap-dfp, cap-htm, a bunch related to various
> Spectre mitigations, cap-hpt-max-page-size (maximum page size for hash
> guests), cap-nested-hv, cap-large-decr, cap-fwnmi, resize-hpt (HPT
> resizing extension), ic-mode (which irq controllers are available to
> the guest).
Thanks. I will follow this suit.
>
> > The ISA 3.0 allows the guest to execute the "stop" instruction.
>
> So, this was a bug in DD2.2's implementation of the architecture?
Yes, the previous versions could miss wakeup events when stop was
executed in HV=0,PR=0 mode. So, the hypervisor had to block that.
>
> > If the
> > Hypervisor hasn't cleared the PSSCR[ESL|EC] then, guest executing the
> > "stop" instruction in the causes a Hypervisor Facility Unavailable
> > Exception, thus giving the hypervisor a chance to emulate the
> > instruction. However, in the current code, when the hypervisor
> > receives this exception, it sends a PROGKILL to the guest which
> > results in crashing the guest.
> >
> > Patch 1 of this series emulates wakeup from the "stop"
> > instruction. Would the following scheme be ok?
> >
> > OPAL:
> > On Procs >= DD2.3 : we publish a dt-cpu-feature "idle-stop-guest"
> >
> > Hypervisor Kernel:
> >
> > If "idle-stop-guest" dt feature is available, then, before
> > entering the guest, the hypervisor clears the PSSCR[EC|ESL]
> > bits allowing the guest to safely execute stop instruction.
> >
> > If "idle-stop-guest" dt feature is not available, then, the
> > Hypervisor sets the PSSCR[ESL|EC] bits, thereby causing a
> > guest "stop" instruction execution to trap back into the
> > hypervisor. We then emulate a wakeup from the stop
> > instruction (Patch 1 of this series).
> >
> > Guest Kernel:
> > If (cpu_has_feature(CPU_FTR_ARCH_300)) only then use the
> > stop0lite cpuidle state.
> >
> > This allows us to migrate the KVM guest across any POWER9
> > Hypervisor. The minimal addition that the Hypervisor would need is
> > Patch 1 of this series.
>
> That could be workable. Some caveats, though:
>
> * How does the latency of the trap-and-emulate compare to the guest
> using H_CEDE in the first place? i.e. how big a negative impact
> will this have for guests running on DD2.2 hosts?
The wakeup latency of trap-and-emulated stop0lite (referred to as
"stop0lite Emulated" in the tables below) the compares favorably
compared to H_CEDE. It is in the order of 5-6us while the wakeup
latency of H_CEDE is ~25-30us.
======================================================================
Wakeup Latency measured using a timer (in ns) [Lower is better]
======================================================================
Idle state | Nr samples | Min | Max | Median | Avg | Stddev|
----------------------------------------------------------------------
snooze | 60 | 787 | 1059 | 938 | 937.4 | 42.27 |
----------------------------------------------------------------------
stop0lite | 60 | 770 | 1182 | 948 | 946.4 | 67.41 |
----------------------------------------------------------------------
stop0lite | 60 | 2378 | 7659 | 5006 |5093.6 |1578.7 |
Emulated | | | | | | |
----------------------------------------------------------------------
Shared CEDE| 60 | 9550 | 36694 | 29219 |28564.1|3545.9 |
======================================================================
======================================================================
Wakeup latency measured using an IPI (in ns) [Lower is better]
======================================================================
Idle state | Nr | Min | Max | Median | Avg | Stddev |
|samples | | | | | |
----------------------------------------------------------------------
snooze | 60 | 2089| 4228| 2259| 2342.31| 316.56|
----------------------------------------------------------------------
stop0lite | 60 | 1947| 3674| 2653| 2610.57| 266.73|
----------------------------------------------------------------------
stop0lite | 60 | 3580| 8154| 5596| 5644.95| 1368.44|
Emulated | | | | | | |
----------------------------------------------------------------------
Shared CEDE| 60 | 20147| 36305| 21827| 26762.65| 6875.01|
======================================================================
>
> * We'll only be able to enable this in a new qemu machine type
> version (say, pseries-5.1.0). Otherwise a guest could start
> thinking it can use stop states, then be migrated to an older qemu
> or host kernel without the support and crash.
That makese sense. In fact in the case of not being able to backport
Patch 1 to all the older hypervisor kernels, we will need a way of
gating the guest from using stop-states and then migrating onto an
older hypervisor kernel. Associating this with a new qemu machine type
version should solve this problem, assuming that all the newer qemus
will also be running on newer hypervisor kernels.
>
> --
> David Gibson | I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
> | _way_ _around_!
> http://www.ozlabs.org/~dgibson
--
Thanks and Regards
gautham.
next prev parent reply other threads:[~2020-04-13 10:37 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-03-31 12:10 [RFC/PATCH 0/3] Add support for stop instruction inside KVM guest Gautham R. Shenoy
2020-03-31 12:22 ` Gautham R. Shenoy
2020-03-31 12:10 ` [RFC/PATCH 1/3] powerpc/kvm: Handle H_FAC_UNAVAIL when guest executes stop Gautham R. Shenoy
2020-03-31 12:22 ` Gautham R. Shenoy
2020-04-03 2:15 ` Nicholas Piggin
2020-04-03 2:15 ` Nicholas Piggin
2020-03-31 12:10 ` [RFC/PATCH 2/3] pseries/kvm: Clear PSSCR[ESL|EC] bits before guest entry Gautham R. Shenoy
2020-03-31 12:22 ` Gautham R. Shenoy
2020-04-03 2:20 ` Nicholas Piggin
2020-04-03 2:20 ` Nicholas Piggin
2020-04-03 9:31 ` Gautham R Shenoy
2020-04-03 9:43 ` Gautham R Shenoy
2020-04-06 9:58 ` David Gibson
2020-04-06 9:58 ` David Gibson
2020-04-07 13:25 ` Gautham R Shenoy
2020-04-07 13:37 ` Gautham R Shenoy
2020-04-08 2:29 ` David Gibson
2020-04-08 2:29 ` David Gibson
2020-04-13 10:25 ` Gautham R Shenoy [this message]
2020-04-13 10:37 ` Gautham R Shenoy
2020-04-14 2:17 ` David Gibson
2020-04-14 2:17 ` David Gibson
2020-04-07 12:33 ` Gautham R Shenoy
2020-04-07 12:45 ` Gautham R Shenoy
2020-03-31 12:10 ` [RFC/PATCH 3/3] cpuidle/pseries: Add stop0lite state Gautham R. Shenoy
2020-03-31 12:22 ` Gautham R. Shenoy
2020-03-31 12:14 ` [RFC/PATCH 0/3] Add support for stop instruction inside KVM guest Gautham R Shenoy
2020-03-31 12:26 ` Gautham R Shenoy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200413102549.GA22532@in.ibm.com \
--to=ego@linux.vnet.ibm.com \
--cc=bharata@linux.ibm.com \
--cc=david@gibson.dropbear.id.au \
--cc=kvm-ppc@vger.kernel.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=linuxppc-dev@ozlabs.org \
--cc=mikey@neuling.org \
--cc=npiggin@gmail.com \
--cc=svaidy@linux.vnet.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.