* [PATCH 0/2] ARM/arm64: KVM: Yield CPU when vcpu executes a WFE @ 2013-10-07 15:40 Marc Zyngier 2013-10-07 15:40 ` [PATCH 1/2] ARM: " Marc Zyngier 2013-10-07 15:40 ` [PATCH 2/2] arm64: " Marc Zyngier 0 siblings, 2 replies; 25+ messages in thread From: Marc Zyngier @ 2013-10-07 15:40 UTC (permalink / raw) To: linux-arm-kernel This is a respin of a patch I posted a long while ago, this time with numbers that I hope to be convincing enough. The basic idea is that spinning on WFE in a guest is a waste of resource, and that we're better of running another vcpu instead. This specially shows when the system is oversubscribed. The guest vcpus can be seen spinning, waiting for a lock to be released while the lock holder is nowhere near a physical CPU. This patch series just enables WFE trapping on both ARM and arm64, and calls kvm_vcpu_on_spin(). This is enough to boost other vcpus, and dramatically reduce the overhead. Branch available at: git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git kvm-arm64/wfe-trap Marc Zyngier (2): ARM: KVM: Yield CPU when vcpu executes a WFE arm64: KVM: Yield CPU when vcpu executes a WFE arch/arm/include/asm/kvm_arm.h | 4 +++- arch/arm/kvm/handle_exit.c | 6 +++++- arch/arm64/include/asm/kvm_arm.h | 8 ++++++-- arch/arm64/kvm/handle_exit.c | 18 +++++++++++++----- 4 files changed, 27 insertions(+), 9 deletions(-) -- 1.8.2.3 ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE 2013-10-07 15:40 [PATCH 0/2] ARM/arm64: KVM: Yield CPU when vcpu executes a WFE Marc Zyngier @ 2013-10-07 15:40 ` Marc Zyngier 2013-10-07 16:04 ` Alexander Graf 2013-10-08 11:26 ` Raghavendra KT 2013-10-07 15:40 ` [PATCH 2/2] arm64: " Marc Zyngier 1 sibling, 2 replies; 25+ messages in thread From: Marc Zyngier @ 2013-10-07 15:40 UTC (permalink / raw) To: linux-arm-kernel On an (even slightly) oversubscribed system, spinlocks are quickly becoming a bottleneck, as some vcpus are spinning, waiting for a lock to be released, while the vcpu holding the lock may not be running at all. This creates contention, and the observed slowdown is 40x for hackbench. No, this isn't a typo. The solution is to trap blocking WFEs and tell KVM that we're now spinning. This ensures that other vpus will get a scheduling boost, allowing the lock to be released more quickly. >From a performance point of view: hackbench 1 process 1000 2xA15 host (baseline): 1.843s 2xA15 guest w/o patch: 2.083s 4xA15 guest w/o patch: 80.212s 2xA15 guest w/ patch: 2.072s 4xA15 guest w/ patch: 3.202s So we go from a 40x degradation to 1.5x, which is vaguely more acceptable. Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> --- arch/arm/include/asm/kvm_arm.h | 4 +++- arch/arm/kvm/handle_exit.c | 6 +++++- 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/arch/arm/include/asm/kvm_arm.h b/arch/arm/include/asm/kvm_arm.h index 64e9696..693d5b2 100644 --- a/arch/arm/include/asm/kvm_arm.h +++ b/arch/arm/include/asm/kvm_arm.h @@ -67,7 +67,7 @@ */ #define HCR_GUEST_MASK (HCR_TSC | HCR_TSW | HCR_TWI | HCR_VM | HCR_BSU_IS | \ HCR_FB | HCR_TAC | HCR_AMO | HCR_IMO | HCR_FMO | \ - HCR_SWIO | HCR_TIDCP) + HCR_TWE | HCR_SWIO | HCR_TIDCP) #define HCR_VIRT_EXCP_MASK (HCR_VA | HCR_VI | HCR_VF) /* System Control Register (SCTLR) bits */ @@ -208,6 +208,8 @@ #define HSR_EC_DABT (0x24) #define HSR_EC_DABT_HYP (0x25) +#define HSR_WFI_IS_WFE (1U << 0) + #define HSR_HVC_IMM_MASK ((1UL << 16) - 1) #define HSR_DABT_S1PTW (1U << 7) diff --git a/arch/arm/kvm/handle_exit.c b/arch/arm/kvm/handle_exit.c index df4c82d..c4c496f 100644 --- a/arch/arm/kvm/handle_exit.c +++ b/arch/arm/kvm/handle_exit.c @@ -84,7 +84,11 @@ static int handle_dabt_hyp(struct kvm_vcpu *vcpu, struct kvm_run *run) static int kvm_handle_wfi(struct kvm_vcpu *vcpu, struct kvm_run *run) { trace_kvm_wfi(*vcpu_pc(vcpu)); - kvm_vcpu_block(vcpu); + if (kvm_vcpu_get_hsr(vcpu) & HSR_WFI_IS_WFE) + kvm_vcpu_on_spin(vcpu); + else + kvm_vcpu_block(vcpu); + return 1; } -- 1.8.2.3 ^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE 2013-10-07 15:40 ` [PATCH 1/2] ARM: " Marc Zyngier @ 2013-10-07 16:04 ` Alexander Graf 2013-10-07 16:16 ` Marc Zyngier 2013-10-08 11:26 ` Raghavendra KT 1 sibling, 1 reply; 25+ messages in thread From: Alexander Graf @ 2013-10-07 16:04 UTC (permalink / raw) To: linux-arm-kernel On 07.10.2013, at 17:40, Marc Zyngier <marc.zyngier@arm.com> wrote: > On an (even slightly) oversubscribed system, spinlocks are quickly > becoming a bottleneck, as some vcpus are spinning, waiting for a > lock to be released, while the vcpu holding the lock may not be > running at all. > > This creates contention, and the observed slowdown is 40x for > hackbench. No, this isn't a typo. > > The solution is to trap blocking WFEs and tell KVM that we're > now spinning. This ensures that other vpus will get a scheduling > boost, allowing the lock to be released more quickly. > >> From a performance point of view: hackbench 1 process 1000 > > 2xA15 host (baseline): 1.843s > > 2xA15 guest w/o patch: 2.083s > 4xA15 guest w/o patch: 80.212s > > 2xA15 guest w/ patch: 2.072s > 4xA15 guest w/ patch: 3.202s I'm confused. You got from 2.083s when not exiting on spin locks to 2.072 when exiting on _every_ spin lock that didn't immediately succeed. I would've expected to second number to be worse rather than better. I assume it's within jitter, I'm still puzzled why you don't see any significant drop in performance. Alex ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE 2013-10-07 16:04 ` Alexander Graf @ 2013-10-07 16:16 ` Marc Zyngier 2013-10-07 16:30 ` Alexander Graf 0 siblings, 1 reply; 25+ messages in thread From: Marc Zyngier @ 2013-10-07 16:16 UTC (permalink / raw) To: linux-arm-kernel On 07/10/13 17:04, Alexander Graf wrote: > > On 07.10.2013, at 17:40, Marc Zyngier <marc.zyngier@arm.com> wrote: > >> On an (even slightly) oversubscribed system, spinlocks are quickly >> becoming a bottleneck, as some vcpus are spinning, waiting for a >> lock to be released, while the vcpu holding the lock may not be >> running at all. >> >> This creates contention, and the observed slowdown is 40x for >> hackbench. No, this isn't a typo. >> >> The solution is to trap blocking WFEs and tell KVM that we're now >> spinning. This ensures that other vpus will get a scheduling boost, >> allowing the lock to be released more quickly. >> >>> From a performance point of view: hackbench 1 process 1000 >> >> 2xA15 host (baseline): 1.843s >> >> 2xA15 guest w/o patch: 2.083s 4xA15 guest w/o patch: 80.212s >> >> 2xA15 guest w/ patch: 2.072s 4xA15 guest w/ patch: 3.202s > > I'm confused. You got from 2.083s when not exiting on spin locks to > 2.072 when exiting on _every_ spin lock that didn't immediately > succeed. I would've expected to second number to be worse rather than > better. I assume it's within jitter, I'm still puzzled why you don't > see any significant drop in performance. The key is in the ARM ARM: B1.14.9: "When HCR.TWE is set to 1, and the processor is in a Non-secure mode other than Hyp mode, execution of a WFE instruction generates a Hyp Trap exception if, ignoring the value of the HCR.TWE bit, conditions permit the processor to suspend execution." So, on a non-overcommitted system, you rarely hit a blocking spinlock, hence not trapping. Otherwise, performance would go down the drain very quickly. And yes, the difference is pretty much noise. M. -- Jazz is not dead. It just smells funny... ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE 2013-10-07 16:16 ` Marc Zyngier @ 2013-10-07 16:30 ` Alexander Graf 2013-10-07 16:53 ` Gleb Natapov 2013-10-07 16:55 ` Marc Zyngier 0 siblings, 2 replies; 25+ messages in thread From: Alexander Graf @ 2013-10-07 16:30 UTC (permalink / raw) To: linux-arm-kernel On 07.10.2013, at 18:16, Marc Zyngier <marc.zyngier@arm.com> wrote: > On 07/10/13 17:04, Alexander Graf wrote: >> >> On 07.10.2013, at 17:40, Marc Zyngier <marc.zyngier@arm.com> wrote: >> >>> On an (even slightly) oversubscribed system, spinlocks are quickly >>> becoming a bottleneck, as some vcpus are spinning, waiting for a >>> lock to be released, while the vcpu holding the lock may not be >>> running at all. >>> >>> This creates contention, and the observed slowdown is 40x for >>> hackbench. No, this isn't a typo. >>> >>> The solution is to trap blocking WFEs and tell KVM that we're now >>> spinning. This ensures that other vpus will get a scheduling boost, >>> allowing the lock to be released more quickly. >>> >>>> From a performance point of view: hackbench 1 process 1000 >>> >>> 2xA15 host (baseline): 1.843s >>> >>> 2xA15 guest w/o patch: 2.083s 4xA15 guest w/o patch: 80.212s >>> >>> 2xA15 guest w/ patch: 2.072s 4xA15 guest w/ patch: 3.202s >> >> I'm confused. You got from 2.083s when not exiting on spin locks to >> 2.072 when exiting on _every_ spin lock that didn't immediately >> succeed. I would've expected to second number to be worse rather than >> better. I assume it's within jitter, I'm still puzzled why you don't >> see any significant drop in performance. > > The key is in the ARM ARM: > > B1.14.9: "When HCR.TWE is set to 1, and the processor is in a Non-secure > mode other than Hyp mode, execution of a WFE instruction generates a Hyp > Trap exception if, ignoring the value of the HCR.TWE bit, conditions > permit the processor to suspend execution." > > So, on a non-overcommitted system, you rarely hit a blocking spinlock, > hence not trapping. Otherwise, performance would go down the drain very > quickly. Well, it's the same as pause/loop exiting on x86, but there we have special hardware features to only ever exit after n number of turnarounds. I wonder why we have those when we could just as easily exit on every blocking path. I assume you simply don't contend and spin locks yet. Once you have more guest cores things would look differently. So once you have a system with more cores available, it might make sense to measure it again. Until then, the numbers are impressive. Alex ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE 2013-10-07 16:30 ` Alexander Graf @ 2013-10-07 16:53 ` Gleb Natapov 2013-10-09 13:09 ` Alexander Graf 2013-10-07 16:55 ` Marc Zyngier 1 sibling, 1 reply; 25+ messages in thread From: Gleb Natapov @ 2013-10-07 16:53 UTC (permalink / raw) To: linux-arm-kernel On Mon, Oct 07, 2013 at 06:30:04PM +0200, Alexander Graf wrote: > > On 07.10.2013, at 18:16, Marc Zyngier <marc.zyngier@arm.com> wrote: > > > On 07/10/13 17:04, Alexander Graf wrote: > >> > >> On 07.10.2013, at 17:40, Marc Zyngier <marc.zyngier@arm.com> wrote: > >> > >>> On an (even slightly) oversubscribed system, spinlocks are quickly > >>> becoming a bottleneck, as some vcpus are spinning, waiting for a > >>> lock to be released, while the vcpu holding the lock may not be > >>> running at all. > >>> > >>> This creates contention, and the observed slowdown is 40x for > >>> hackbench. No, this isn't a typo. > >>> > >>> The solution is to trap blocking WFEs and tell KVM that we're now > >>> spinning. This ensures that other vpus will get a scheduling boost, > >>> allowing the lock to be released more quickly. > >>> > >>>> From a performance point of view: hackbench 1 process 1000 > >>> > >>> 2xA15 host (baseline): 1.843s > >>> > >>> 2xA15 guest w/o patch: 2.083s 4xA15 guest w/o patch: 80.212s > >>> > >>> 2xA15 guest w/ patch: 2.072s 4xA15 guest w/ patch: 3.202s > >> > >> I'm confused. You got from 2.083s when not exiting on spin locks to > >> 2.072 when exiting on _every_ spin lock that didn't immediately > >> succeed. I would've expected to second number to be worse rather than > >> better. I assume it's within jitter, I'm still puzzled why you don't > >> see any significant drop in performance. > > > > The key is in the ARM ARM: > > > > B1.14.9: "When HCR.TWE is set to 1, and the processor is in a Non-secure > > mode other than Hyp mode, execution of a WFE instruction generates a Hyp > > Trap exception if, ignoring the value of the HCR.TWE bit, conditions > > permit the processor to suspend execution." > > > > So, on a non-overcommitted system, you rarely hit a blocking spinlock, > > hence not trapping. Otherwise, performance would go down the drain very > > quickly. > > Well, it's the same as pause/loop exiting on x86, but there we have special hardware features to only ever exit after n number of turnarounds. I wonder why we have those when we could just as easily exit on every blocking path. > It will hurt performance if vcpu that holds the lock is running. Ideally you want to exit to hypervisor only if lock holder is preempted, but there is no way to know it, so you spin for a short time and if lock is not released it means that lock holder is preempted (spinlock should not be held for a long time after all), so you exit. -- Gleb. ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE 2013-10-07 16:53 ` Gleb Natapov @ 2013-10-09 13:09 ` Alexander Graf 2013-10-09 13:26 ` Gleb Natapov 0 siblings, 1 reply; 25+ messages in thread From: Alexander Graf @ 2013-10-09 13:09 UTC (permalink / raw) To: linux-arm-kernel On 07.10.2013, at 18:53, Gleb Natapov <gleb@redhat.com> wrote: > On Mon, Oct 07, 2013 at 06:30:04PM +0200, Alexander Graf wrote: >> >> On 07.10.2013, at 18:16, Marc Zyngier <marc.zyngier@arm.com> wrote: >> >>> On 07/10/13 17:04, Alexander Graf wrote: >>>> >>>> On 07.10.2013, at 17:40, Marc Zyngier <marc.zyngier@arm.com> wrote: >>>> >>>>> On an (even slightly) oversubscribed system, spinlocks are quickly >>>>> becoming a bottleneck, as some vcpus are spinning, waiting for a >>>>> lock to be released, while the vcpu holding the lock may not be >>>>> running at all. >>>>> >>>>> This creates contention, and the observed slowdown is 40x for >>>>> hackbench. No, this isn't a typo. >>>>> >>>>> The solution is to trap blocking WFEs and tell KVM that we're now >>>>> spinning. This ensures that other vpus will get a scheduling boost, >>>>> allowing the lock to be released more quickly. >>>>> >>>>>> From a performance point of view: hackbench 1 process 1000 >>>>> >>>>> 2xA15 host (baseline): 1.843s >>>>> >>>>> 2xA15 guest w/o patch: 2.083s 4xA15 guest w/o patch: 80.212s >>>>> >>>>> 2xA15 guest w/ patch: 2.072s 4xA15 guest w/ patch: 3.202s >>>> >>>> I'm confused. You got from 2.083s when not exiting on spin locks to >>>> 2.072 when exiting on _every_ spin lock that didn't immediately >>>> succeed. I would've expected to second number to be worse rather than >>>> better. I assume it's within jitter, I'm still puzzled why you don't >>>> see any significant drop in performance. >>> >>> The key is in the ARM ARM: >>> >>> B1.14.9: "When HCR.TWE is set to 1, and the processor is in a Non-secure >>> mode other than Hyp mode, execution of a WFE instruction generates a Hyp >>> Trap exception if, ignoring the value of the HCR.TWE bit, conditions >>> permit the processor to suspend execution." >>> >>> So, on a non-overcommitted system, you rarely hit a blocking spinlock, >>> hence not trapping. Otherwise, performance would go down the drain very >>> quickly. >> >> Well, it's the same as pause/loop exiting on x86, but there we have special hardware features to only ever exit after n number of turnarounds. I wonder why we have those when we could just as easily exit on every blocking path. >> > It will hurt performance if vcpu that holds the lock is running. Apparently not so on ARM. At least that's what Marc's numbers are showing. I'm not sure what exactly that means. Basically his logic is "if we spin, the holder must have been preempted". And it seems to work out surprisingly well. Alex ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE 2013-10-09 13:09 ` Alexander Graf @ 2013-10-09 13:26 ` Gleb Natapov 2013-10-09 14:18 ` Marc Zyngier 0 siblings, 1 reply; 25+ messages in thread From: Gleb Natapov @ 2013-10-09 13:26 UTC (permalink / raw) To: linux-arm-kernel On Wed, Oct 09, 2013 at 03:09:54PM +0200, Alexander Graf wrote: > > On 07.10.2013, at 18:53, Gleb Natapov <gleb@redhat.com> wrote: > > > On Mon, Oct 07, 2013 at 06:30:04PM +0200, Alexander Graf wrote: > >> > >> On 07.10.2013, at 18:16, Marc Zyngier <marc.zyngier@arm.com> wrote: > >> > >>> On 07/10/13 17:04, Alexander Graf wrote: > >>>> > >>>> On 07.10.2013, at 17:40, Marc Zyngier <marc.zyngier@arm.com> wrote: > >>>> > >>>>> On an (even slightly) oversubscribed system, spinlocks are quickly > >>>>> becoming a bottleneck, as some vcpus are spinning, waiting for a > >>>>> lock to be released, while the vcpu holding the lock may not be > >>>>> running at all. > >>>>> > >>>>> This creates contention, and the observed slowdown is 40x for > >>>>> hackbench. No, this isn't a typo. > >>>>> > >>>>> The solution is to trap blocking WFEs and tell KVM that we're now > >>>>> spinning. This ensures that other vpus will get a scheduling boost, > >>>>> allowing the lock to be released more quickly. > >>>>> > >>>>>> From a performance point of view: hackbench 1 process 1000 > >>>>> > >>>>> 2xA15 host (baseline): 1.843s > >>>>> > >>>>> 2xA15 guest w/o patch: 2.083s 4xA15 guest w/o patch: 80.212s > >>>>> > >>>>> 2xA15 guest w/ patch: 2.072s 4xA15 guest w/ patch: 3.202s > >>>> > >>>> I'm confused. You got from 2.083s when not exiting on spin locks to > >>>> 2.072 when exiting on _every_ spin lock that didn't immediately > >>>> succeed. I would've expected to second number to be worse rather than > >>>> better. I assume it's within jitter, I'm still puzzled why you don't > >>>> see any significant drop in performance. > >>> > >>> The key is in the ARM ARM: > >>> > >>> B1.14.9: "When HCR.TWE is set to 1, and the processor is in a Non-secure > >>> mode other than Hyp mode, execution of a WFE instruction generates a Hyp > >>> Trap exception if, ignoring the value of the HCR.TWE bit, conditions > >>> permit the processor to suspend execution." > >>> > >>> So, on a non-overcommitted system, you rarely hit a blocking spinlock, > >>> hence not trapping. Otherwise, performance would go down the drain very > >>> quickly. > >> > >> Well, it's the same as pause/loop exiting on x86, but there we have special hardware features to only ever exit after n number of turnarounds. I wonder why we have those when we could just as easily exit on every blocking path. > >> > > It will hurt performance if vcpu that holds the lock is running. > > Apparently not so on ARM. At least that's what Marc's numbers are showing. I'm not sure what exactly that means. Basically his logic is "if we spin, the holder must have been preempted". And it seems to work out surprisingly well. > > For not contended locks it make sense. We need to recheck if x86 assumption is still true there, but x86 lock is ticketing which has not only lock holder preemption, but also lock waiter preemption problem which make overcommit problem even worse. -- Gleb. ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE 2013-10-09 13:26 ` Gleb Natapov @ 2013-10-09 14:18 ` Marc Zyngier 2013-10-09 14:50 ` Anup Patel 0 siblings, 1 reply; 25+ messages in thread From: Marc Zyngier @ 2013-10-09 14:18 UTC (permalink / raw) To: linux-arm-kernel On 09/10/13 14:26, Gleb Natapov wrote: > On Wed, Oct 09, 2013 at 03:09:54PM +0200, Alexander Graf wrote: >> >> On 07.10.2013, at 18:53, Gleb Natapov <gleb@redhat.com> wrote: >> >>> On Mon, Oct 07, 2013 at 06:30:04PM +0200, Alexander Graf wrote: >>>> >>>> On 07.10.2013, at 18:16, Marc Zyngier <marc.zyngier@arm.com> wrote: >>>> >>>>> On 07/10/13 17:04, Alexander Graf wrote: >>>>>> >>>>>> On 07.10.2013, at 17:40, Marc Zyngier <marc.zyngier@arm.com> wrote: >>>>>> >>>>>>> On an (even slightly) oversubscribed system, spinlocks are quickly >>>>>>> becoming a bottleneck, as some vcpus are spinning, waiting for a >>>>>>> lock to be released, while the vcpu holding the lock may not be >>>>>>> running at all. >>>>>>> >>>>>>> This creates contention, and the observed slowdown is 40x for >>>>>>> hackbench. No, this isn't a typo. >>>>>>> >>>>>>> The solution is to trap blocking WFEs and tell KVM that we're now >>>>>>> spinning. This ensures that other vpus will get a scheduling boost, >>>>>>> allowing the lock to be released more quickly. >>>>>>> >>>>>>>> From a performance point of view: hackbench 1 process 1000 >>>>>>> >>>>>>> 2xA15 host (baseline): 1.843s >>>>>>> >>>>>>> 2xA15 guest w/o patch: 2.083s 4xA15 guest w/o patch: 80.212s >>>>>>> >>>>>>> 2xA15 guest w/ patch: 2.072s 4xA15 guest w/ patch: 3.202s >>>>>> >>>>>> I'm confused. You got from 2.083s when not exiting on spin locks to >>>>>> 2.072 when exiting on _every_ spin lock that didn't immediately >>>>>> succeed. I would've expected to second number to be worse rather than >>>>>> better. I assume it's within jitter, I'm still puzzled why you don't >>>>>> see any significant drop in performance. >>>>> >>>>> The key is in the ARM ARM: >>>>> >>>>> B1.14.9: "When HCR.TWE is set to 1, and the processor is in a Non-secure >>>>> mode other than Hyp mode, execution of a WFE instruction generates a Hyp >>>>> Trap exception if, ignoring the value of the HCR.TWE bit, conditions >>>>> permit the processor to suspend execution." >>>>> >>>>> So, on a non-overcommitted system, you rarely hit a blocking spinlock, >>>>> hence not trapping. Otherwise, performance would go down the drain very >>>>> quickly. >>>> >>>> Well, it's the same as pause/loop exiting on x86, but there we have special hardware features to only ever exit after n number of turnarounds. I wonder why we have those when we could just as easily exit on every blocking path. >>>> >>> It will hurt performance if vcpu that holds the lock is running. >> >> Apparently not so on ARM. At least that's what Marc's numbers are showing. I'm not sure what exactly that means. Basically his logic is "if we spin, the holder must have been preempted". And it seems to work out surprisingly well. Yes. I basically assume that contention should be rare, and that ending up in a *blocking* WFE is a sign that we're in thrashing mode already (no event is pending). >> > For not contended locks it make sense. We need to recheck if x86 > assumption is still true there, but x86 lock is ticketing which > has not only lock holder preemption, but also lock waiter > preemption problem which make overcommit problem even worse. Locks are ticketing on ARM as well. But there is one key difference here with x86 (or at least what I understand of it, which is very close to none): We only trap if we would have blocked anyway. In our case, it is almost always better to give up the CPU to someone else rather than waiting for some event to take the CPU out of sleep. M. -- Jazz is not dead. It just smells funny... ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE 2013-10-09 14:18 ` Marc Zyngier @ 2013-10-09 14:50 ` Anup Patel 2013-10-09 14:52 ` Anup Patel 2013-10-09 14:59 ` Marc Zyngier 0 siblings, 2 replies; 25+ messages in thread From: Anup Patel @ 2013-10-09 14:50 UTC (permalink / raw) To: linux-arm-kernel On Wed, Oct 9, 2013 at 7:48 PM, Marc Zyngier <marc.zyngier@arm.com> wrote: > On 09/10/13 14:26, Gleb Natapov wrote: >> On Wed, Oct 09, 2013 at 03:09:54PM +0200, Alexander Graf wrote: >>> >>> On 07.10.2013, at 18:53, Gleb Natapov <gleb@redhat.com> wrote: >>> >>>> On Mon, Oct 07, 2013 at 06:30:04PM +0200, Alexander Graf wrote: >>>>> >>>>> On 07.10.2013, at 18:16, Marc Zyngier <marc.zyngier@arm.com> wrote: >>>>> >>>>>> On 07/10/13 17:04, Alexander Graf wrote: >>>>>>> >>>>>>> On 07.10.2013, at 17:40, Marc Zyngier <marc.zyngier@arm.com> wrote: >>>>>>> >>>>>>>> On an (even slightly) oversubscribed system, spinlocks are quickly >>>>>>>> becoming a bottleneck, as some vcpus are spinning, waiting for a >>>>>>>> lock to be released, while the vcpu holding the lock may not be >>>>>>>> running at all. >>>>>>>> >>>>>>>> This creates contention, and the observed slowdown is 40x for >>>>>>>> hackbench. No, this isn't a typo. >>>>>>>> >>>>>>>> The solution is to trap blocking WFEs and tell KVM that we're now >>>>>>>> spinning. This ensures that other vpus will get a scheduling boost, >>>>>>>> allowing the lock to be released more quickly. >>>>>>>> >>>>>>>>> From a performance point of view: hackbench 1 process 1000 >>>>>>>> >>>>>>>> 2xA15 host (baseline): 1.843s >>>>>>>> >>>>>>>> 2xA15 guest w/o patch: 2.083s 4xA15 guest w/o patch: 80.212s >>>>>>>> >>>>>>>> 2xA15 guest w/ patch: 2.072s 4xA15 guest w/ patch: 3.202s >>>>>>> >>>>>>> I'm confused. You got from 2.083s when not exiting on spin locks to >>>>>>> 2.072 when exiting on _every_ spin lock that didn't immediately >>>>>>> succeed. I would've expected to second number to be worse rather than >>>>>>> better. I assume it's within jitter, I'm still puzzled why you don't >>>>>>> see any significant drop in performance. >>>>>> >>>>>> The key is in the ARM ARM: >>>>>> >>>>>> B1.14.9: "When HCR.TWE is set to 1, and the processor is in a Non-secure >>>>>> mode other than Hyp mode, execution of a WFE instruction generates a Hyp >>>>>> Trap exception if, ignoring the value of the HCR.TWE bit, conditions >>>>>> permit the processor to suspend execution." >>>>>> >>>>>> So, on a non-overcommitted system, you rarely hit a blocking spinlock, >>>>>> hence not trapping. Otherwise, performance would go down the drain very >>>>>> quickly. >>>>> >>>>> Well, it's the same as pause/loop exiting on x86, but there we have special hardware features to only ever exit after n number of turnarounds. I wonder why we have those when we could just as easily exit on every blocking path. >>>>> >>>> It will hurt performance if vcpu that holds the lock is running. >>> >>> Apparently not so on ARM. At least that's what Marc's numbers are showing. I'm not sure what exactly that means. Basically his logic is "if we spin, the holder must have been preempted". And it seems to work out surprisingly well. > > Yes. I basically assume that contention should be rare, and that ending > up in a *blocking* WFE is a sign that we're in thrashing mode already > (no event is pending). > >>> >> For not contended locks it make sense. We need to recheck if x86 >> assumption is still true there, but x86 lock is ticketing which >> has not only lock holder preemption, but also lock waiter >> preemption problem which make overcommit problem even worse. > > Locks are ticketing on ARM as well. But there is one key difference here > with x86 (or at least what I understand of it, which is very close to > none): We only trap if we would have blocked anyway. In our case, it is > almost always better to give up the CPU to someone else rather than > waiting for some event to take the CPU out of sleep. Benefits of "Yield CPU when vcpu executes a WFE" seems to depend on: 1. How spin lock is implemented in Guest OS? we cannot assume that underlying Guest OS is always Linux. 2. How bad/good is spin It will be good if we can enable/disable "Yield CPU when vcpu executes a WFE > > M. > -- > Jazz is not dead. It just smells funny... > > > _______________________________________________ > kvmarm mailing list > kvmarm at lists.cs.columbia.edu > https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE 2013-10-09 14:50 ` Anup Patel @ 2013-10-09 14:52 ` Anup Patel 2013-10-09 14:59 ` Marc Zyngier 1 sibling, 0 replies; 25+ messages in thread From: Anup Patel @ 2013-10-09 14:52 UTC (permalink / raw) To: linux-arm-kernel On Wed, Oct 9, 2013 at 8:20 PM, Anup Patel <anup@brainfault.org> wrote: > On Wed, Oct 9, 2013 at 7:48 PM, Marc Zyngier <marc.zyngier@arm.com> wrote: >> On 09/10/13 14:26, Gleb Natapov wrote: >>> On Wed, Oct 09, 2013 at 03:09:54PM +0200, Alexander Graf wrote: >>>> >>>> On 07.10.2013, at 18:53, Gleb Natapov <gleb@redhat.com> wrote: >>>> >>>>> On Mon, Oct 07, 2013 at 06:30:04PM +0200, Alexander Graf wrote: >>>>>> >>>>>> On 07.10.2013, at 18:16, Marc Zyngier <marc.zyngier@arm.com> wrote: >>>>>> >>>>>>> On 07/10/13 17:04, Alexander Graf wrote: >>>>>>>> >>>>>>>> On 07.10.2013, at 17:40, Marc Zyngier <marc.zyngier@arm.com> wrote: >>>>>>>> >>>>>>>>> On an (even slightly) oversubscribed system, spinlocks are quickly >>>>>>>>> becoming a bottleneck, as some vcpus are spinning, waiting for a >>>>>>>>> lock to be released, while the vcpu holding the lock may not be >>>>>>>>> running at all. >>>>>>>>> >>>>>>>>> This creates contention, and the observed slowdown is 40x for >>>>>>>>> hackbench. No, this isn't a typo. >>>>>>>>> >>>>>>>>> The solution is to trap blocking WFEs and tell KVM that we're now >>>>>>>>> spinning. This ensures that other vpus will get a scheduling boost, >>>>>>>>> allowing the lock to be released more quickly. >>>>>>>>> >>>>>>>>>> From a performance point of view: hackbench 1 process 1000 >>>>>>>>> >>>>>>>>> 2xA15 host (baseline): 1.843s >>>>>>>>> >>>>>>>>> 2xA15 guest w/o patch: 2.083s 4xA15 guest w/o patch: 80.212s >>>>>>>>> >>>>>>>>> 2xA15 guest w/ patch: 2.072s 4xA15 guest w/ patch: 3.202s >>>>>>>> >>>>>>>> I'm confused. You got from 2.083s when not exiting on spin locks to >>>>>>>> 2.072 when exiting on _every_ spin lock that didn't immediately >>>>>>>> succeed. I would've expected to second number to be worse rather than >>>>>>>> better. I assume it's within jitter, I'm still puzzled why you don't >>>>>>>> see any significant drop in performance. >>>>>>> >>>>>>> The key is in the ARM ARM: >>>>>>> >>>>>>> B1.14.9: "When HCR.TWE is set to 1, and the processor is in a Non-secure >>>>>>> mode other than Hyp mode, execution of a WFE instruction generates a Hyp >>>>>>> Trap exception if, ignoring the value of the HCR.TWE bit, conditions >>>>>>> permit the processor to suspend execution." >>>>>>> >>>>>>> So, on a non-overcommitted system, you rarely hit a blocking spinlock, >>>>>>> hence not trapping. Otherwise, performance would go down the drain very >>>>>>> quickly. >>>>>> >>>>>> Well, it's the same as pause/loop exiting on x86, but there we have special hardware features to only ever exit after n number of turnarounds. I wonder why we have those when we could just as easily exit on every blocking path. >>>>>> >>>>> It will hurt performance if vcpu that holds the lock is running. >>>> >>>> Apparently not so on ARM. At least that's what Marc's numbers are showing. I'm not sure what exactly that means. Basically his logic is "if we spin, the holder must have been preempted". And it seems to work out surprisingly well. >> >> Yes. I basically assume that contention should be rare, and that ending >> up in a *blocking* WFE is a sign that we're in thrashing mode already >> (no event is pending). >> >>>> >>> For not contended locks it make sense. We need to recheck if x86 >>> assumption is still true there, but x86 lock is ticketing which >>> has not only lock holder preemption, but also lock waiter >>> preemption problem which make overcommit problem even worse. >> >> Locks are ticketing on ARM as well. But there is one key difference here >> with x86 (or at least what I understand of it, which is very close to >> none): We only trap if we would have blocked anyway. In our case, it is >> almost always better to give up the CPU to someone else rather than >> waiting for some event to take the CPU out of sleep. > > Benefits of "Yield CPU when vcpu executes a WFE" seems to depend on: > 1. How spin lock is implemented in Guest OS? > we cannot assume > that underlying Guest OS is always Linux. > 2. How bad/good is spin > > It will be good if we can enable/disable "Yield CPU when vcpu executes a WFE (Please ignore previous incomplete reply ....) Benefits of "Yield CPU when vcpu executes a WFE" seems to depend on: 1. How spin lock is implemented in Guest OS? (Note: we cannot assume that underlying Guest OS is always Linux) 2. How bad/good is spin lock contention in Guest ? (Note: here too we cannot assume the loads running on Guest) It will be good if we can enable/disable "Yield CPU when vcpu executes a WFE" via Kconfig. --Anup > > >> >> M. >> -- >> Jazz is not dead. It just smells funny... >> >> >> _______________________________________________ >> kvmarm mailing list >> kvmarm at lists.cs.columbia.edu >> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE 2013-10-09 14:50 ` Anup Patel 2013-10-09 14:52 ` Anup Patel @ 2013-10-09 14:59 ` Marc Zyngier 2013-10-09 15:10 ` Anup Patel 1 sibling, 1 reply; 25+ messages in thread From: Marc Zyngier @ 2013-10-09 14:59 UTC (permalink / raw) To: linux-arm-kernel On 09/10/13 15:50, Anup Patel wrote: > On Wed, Oct 9, 2013 at 7:48 PM, Marc Zyngier <marc.zyngier@arm.com> wrote: >> On 09/10/13 14:26, Gleb Natapov wrote: >>> On Wed, Oct 09, 2013 at 03:09:54PM +0200, Alexander Graf wrote: >>>> >>>> On 07.10.2013, at 18:53, Gleb Natapov <gleb@redhat.com> wrote: >>>> >>>>> On Mon, Oct 07, 2013 at 06:30:04PM +0200, Alexander Graf wrote: >>>>>> >>>>>> On 07.10.2013, at 18:16, Marc Zyngier <marc.zyngier@arm.com> wrote: >>>>>> >>>>>>> On 07/10/13 17:04, Alexander Graf wrote: >>>>>>>> >>>>>>>> On 07.10.2013, at 17:40, Marc Zyngier <marc.zyngier@arm.com> wrote: >>>>>>>> >>>>>>>>> On an (even slightly) oversubscribed system, spinlocks are quickly >>>>>>>>> becoming a bottleneck, as some vcpus are spinning, waiting for a >>>>>>>>> lock to be released, while the vcpu holding the lock may not be >>>>>>>>> running at all. >>>>>>>>> >>>>>>>>> This creates contention, and the observed slowdown is 40x for >>>>>>>>> hackbench. No, this isn't a typo. >>>>>>>>> >>>>>>>>> The solution is to trap blocking WFEs and tell KVM that we're now >>>>>>>>> spinning. This ensures that other vpus will get a scheduling boost, >>>>>>>>> allowing the lock to be released more quickly. >>>>>>>>> >>>>>>>>>> From a performance point of view: hackbench 1 process 1000 >>>>>>>>> >>>>>>>>> 2xA15 host (baseline): 1.843s >>>>>>>>> >>>>>>>>> 2xA15 guest w/o patch: 2.083s 4xA15 guest w/o patch: 80.212s >>>>>>>>> >>>>>>>>> 2xA15 guest w/ patch: 2.072s 4xA15 guest w/ patch: 3.202s >>>>>>>> >>>>>>>> I'm confused. You got from 2.083s when not exiting on spin locks to >>>>>>>> 2.072 when exiting on _every_ spin lock that didn't immediately >>>>>>>> succeed. I would've expected to second number to be worse rather than >>>>>>>> better. I assume it's within jitter, I'm still puzzled why you don't >>>>>>>> see any significant drop in performance. >>>>>>> >>>>>>> The key is in the ARM ARM: >>>>>>> >>>>>>> B1.14.9: "When HCR.TWE is set to 1, and the processor is in a Non-secure >>>>>>> mode other than Hyp mode, execution of a WFE instruction generates a Hyp >>>>>>> Trap exception if, ignoring the value of the HCR.TWE bit, conditions >>>>>>> permit the processor to suspend execution." >>>>>>> >>>>>>> So, on a non-overcommitted system, you rarely hit a blocking spinlock, >>>>>>> hence not trapping. Otherwise, performance would go down the drain very >>>>>>> quickly. >>>>>> >>>>>> Well, it's the same as pause/loop exiting on x86, but there we have special hardware features to only ever exit after n number of turnarounds. I wonder why we have those when we could just as easily exit on every blocking path. >>>>>> >>>>> It will hurt performance if vcpu that holds the lock is running. >>>> >>>> Apparently not so on ARM. At least that's what Marc's numbers are showing. I'm not sure what exactly that means. Basically his logic is "if we spin, the holder must have been preempted". And it seems to work out surprisingly well. >> >> Yes. I basically assume that contention should be rare, and that ending >> up in a *blocking* WFE is a sign that we're in thrashing mode already >> (no event is pending). >> >>>> >>> For not contended locks it make sense. We need to recheck if x86 >>> assumption is still true there, but x86 lock is ticketing which >>> has not only lock holder preemption, but also lock waiter >>> preemption problem which make overcommit problem even worse. >> >> Locks are ticketing on ARM as well. But there is one key difference here >> with x86 (or at least what I understand of it, which is very close to >> none): We only trap if we would have blocked anyway. In our case, it is >> almost always better to give up the CPU to someone else rather than >> waiting for some event to take the CPU out of sleep. > > Benefits of "Yield CPU when vcpu executes a WFE" seems to depend on: > 1. How spin lock is implemented in Guest OS? > we cannot assume > that underlying Guest OS is always Linux. > 2. How bad/good is spin We do *not* spin. We *sleep*. So instead of taking a nap on a physical CPU (which is slightly less than useful), we go and run some real workload. If your guest OS is executing WFE (I'm not implying a lock here), *and* that WFE is blocking, then I maintain it will be a gain in the vast majority of the cases. > It will be good if we can enable/disable "Yield CPU when vcpu executes a WFE Not until someone has shown me a (real) workload when this is actually detrimental. M. -- Jazz is not dead. It just smells funny... ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE 2013-10-09 14:59 ` Marc Zyngier @ 2013-10-09 15:10 ` Anup Patel 2013-10-09 15:17 ` Marc Zyngier 2013-10-09 15:17 ` Anup Patel 0 siblings, 2 replies; 25+ messages in thread From: Anup Patel @ 2013-10-09 15:10 UTC (permalink / raw) To: linux-arm-kernel On Wed, Oct 9, 2013 at 8:29 PM, Marc Zyngier <marc.zyngier@arm.com> wrote: > On 09/10/13 15:50, Anup Patel wrote: >> On Wed, Oct 9, 2013 at 7:48 PM, Marc Zyngier <marc.zyngier@arm.com> wrote: >>> On 09/10/13 14:26, Gleb Natapov wrote: >>>> On Wed, Oct 09, 2013 at 03:09:54PM +0200, Alexander Graf wrote: >>>>> >>>>> On 07.10.2013, at 18:53, Gleb Natapov <gleb@redhat.com> wrote: >>>>> >>>>>> On Mon, Oct 07, 2013 at 06:30:04PM +0200, Alexander Graf wrote: >>>>>>> >>>>>>> On 07.10.2013, at 18:16, Marc Zyngier <marc.zyngier@arm.com> wrote: >>>>>>> >>>>>>>> On 07/10/13 17:04, Alexander Graf wrote: >>>>>>>>> >>>>>>>>> On 07.10.2013, at 17:40, Marc Zyngier <marc.zyngier@arm.com> wrote: >>>>>>>>> >>>>>>>>>> On an (even slightly) oversubscribed system, spinlocks are quickly >>>>>>>>>> becoming a bottleneck, as some vcpus are spinning, waiting for a >>>>>>>>>> lock to be released, while the vcpu holding the lock may not be >>>>>>>>>> running at all. >>>>>>>>>> >>>>>>>>>> This creates contention, and the observed slowdown is 40x for >>>>>>>>>> hackbench. No, this isn't a typo. >>>>>>>>>> >>>>>>>>>> The solution is to trap blocking WFEs and tell KVM that we're now >>>>>>>>>> spinning. This ensures that other vpus will get a scheduling boost, >>>>>>>>>> allowing the lock to be released more quickly. >>>>>>>>>> >>>>>>>>>>> From a performance point of view: hackbench 1 process 1000 >>>>>>>>>> >>>>>>>>>> 2xA15 host (baseline): 1.843s >>>>>>>>>> >>>>>>>>>> 2xA15 guest w/o patch: 2.083s 4xA15 guest w/o patch: 80.212s >>>>>>>>>> >>>>>>>>>> 2xA15 guest w/ patch: 2.072s 4xA15 guest w/ patch: 3.202s >>>>>>>>> >>>>>>>>> I'm confused. You got from 2.083s when not exiting on spin locks to >>>>>>>>> 2.072 when exiting on _every_ spin lock that didn't immediately >>>>>>>>> succeed. I would've expected to second number to be worse rather than >>>>>>>>> better. I assume it's within jitter, I'm still puzzled why you don't >>>>>>>>> see any significant drop in performance. >>>>>>>> >>>>>>>> The key is in the ARM ARM: >>>>>>>> >>>>>>>> B1.14.9: "When HCR.TWE is set to 1, and the processor is in a Non-secure >>>>>>>> mode other than Hyp mode, execution of a WFE instruction generates a Hyp >>>>>>>> Trap exception if, ignoring the value of the HCR.TWE bit, conditions >>>>>>>> permit the processor to suspend execution." >>>>>>>> >>>>>>>> So, on a non-overcommitted system, you rarely hit a blocking spinlock, >>>>>>>> hence not trapping. Otherwise, performance would go down the drain very >>>>>>>> quickly. >>>>>>> >>>>>>> Well, it's the same as pause/loop exiting on x86, but there we have special hardware features to only ever exit after n number of turnarounds. I wonder why we have those when we could just as easily exit on every blocking path. >>>>>>> >>>>>> It will hurt performance if vcpu that holds the lock is running. >>>>> >>>>> Apparently not so on ARM. At least that's what Marc's numbers are showing. I'm not sure what exactly that means. Basically his logic is "if we spin, the holder must have been preempted". And it seems to work out surprisingly well. >>> >>> Yes. I basically assume that contention should be rare, and that ending >>> up in a *blocking* WFE is a sign that we're in thrashing mode already >>> (no event is pending). >>> >>>>> >>>> For not contended locks it make sense. We need to recheck if x86 >>>> assumption is still true there, but x86 lock is ticketing which >>>> has not only lock holder preemption, but also lock waiter >>>> preemption problem which make overcommit problem even worse. >>> >>> Locks are ticketing on ARM as well. But there is one key difference here >>> with x86 (or at least what I understand of it, which is very close to >>> none): We only trap if we would have blocked anyway. In our case, it is >>> almost always better to give up the CPU to someone else rather than >>> waiting for some event to take the CPU out of sleep. >> >> Benefits of "Yield CPU when vcpu executes a WFE" seems to depend on: >> 1. How spin lock is implemented in Guest OS? >> we cannot assume >> that underlying Guest OS is always Linux. >> 2. How bad/good is spin > > We do *not* spin. We *sleep*. So instead of taking a nap on a physical > CPU (which is slightly less than useful), we go and run some real > workload. If your guest OS is executing WFE (I'm not implying a lock > here), *and* that WFE is blocking, then I maintain it will be a gain in > the vast majority of the cases. What if VCPU A was about to release lock and VCPU B tries to grab same lock. In this case VCPU B gets Yielded due to WFE causing unnecessary delay for VCPU B in acquiring lock. This situation can happen quite often because spin locks are generally used for protecting very small portion of code. > >> It will be good if we can enable/disable "Yield CPU when vcpu executes a WFE > > Not until someone has shown me a (real) workload when this is actually > detrimental. The gains by "Yield CPU when vcpu executes a WFE" are not-significant and we dont have consistent improvement when tried multiple times. Please look at number you reported for multiple runs. Due to this fact it makes more sense to have Kconfig option for this. --Anup > > M. > -- > Jazz is not dead. It just smells funny... > ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE 2013-10-09 15:10 ` Anup Patel @ 2013-10-09 15:17 ` Marc Zyngier 2013-10-09 15:17 ` Anup Patel 1 sibling, 0 replies; 25+ messages in thread From: Marc Zyngier @ 2013-10-09 15:17 UTC (permalink / raw) To: linux-arm-kernel On 09/10/13 16:10, Anup Patel wrote: > On Wed, Oct 9, 2013 at 8:29 PM, Marc Zyngier <marc.zyngier@arm.com> wrote: >> On 09/10/13 15:50, Anup Patel wrote: >>> On Wed, Oct 9, 2013 at 7:48 PM, Marc Zyngier <marc.zyngier@arm.com> wrote: >>>> On 09/10/13 14:26, Gleb Natapov wrote: >>>>> On Wed, Oct 09, 2013 at 03:09:54PM +0200, Alexander Graf wrote: >>>>>> >>>>>> On 07.10.2013, at 18:53, Gleb Natapov <gleb@redhat.com> wrote: >>>>>> >>>>>>> On Mon, Oct 07, 2013 at 06:30:04PM +0200, Alexander Graf wrote: >>>>>>>> >>>>>>>> On 07.10.2013, at 18:16, Marc Zyngier <marc.zyngier@arm.com> wrote: >>>>>>>> >>>>>>>>> On 07/10/13 17:04, Alexander Graf wrote: >>>>>>>>>> >>>>>>>>>> On 07.10.2013, at 17:40, Marc Zyngier <marc.zyngier@arm.com> wrote: >>>>>>>>>> >>>>>>>>>>> On an (even slightly) oversubscribed system, spinlocks are quickly >>>>>>>>>>> becoming a bottleneck, as some vcpus are spinning, waiting for a >>>>>>>>>>> lock to be released, while the vcpu holding the lock may not be >>>>>>>>>>> running at all. >>>>>>>>>>> >>>>>>>>>>> This creates contention, and the observed slowdown is 40x for >>>>>>>>>>> hackbench. No, this isn't a typo. >>>>>>>>>>> >>>>>>>>>>> The solution is to trap blocking WFEs and tell KVM that we're now >>>>>>>>>>> spinning. This ensures that other vpus will get a scheduling boost, >>>>>>>>>>> allowing the lock to be released more quickly. >>>>>>>>>>> >>>>>>>>>>>> From a performance point of view: hackbench 1 process 1000 >>>>>>>>>>> >>>>>>>>>>> 2xA15 host (baseline): 1.843s >>>>>>>>>>> >>>>>>>>>>> 2xA15 guest w/o patch: 2.083s 4xA15 guest w/o patch: 80.212s >>>>>>>>>>> >>>>>>>>>>> 2xA15 guest w/ patch: 2.072s 4xA15 guest w/ patch: 3.202s >>>>>>>>>> >>>>>>>>>> I'm confused. You got from 2.083s when not exiting on spin locks to >>>>>>>>>> 2.072 when exiting on _every_ spin lock that didn't immediately >>>>>>>>>> succeed. I would've expected to second number to be worse rather than >>>>>>>>>> better. I assume it's within jitter, I'm still puzzled why you don't >>>>>>>>>> see any significant drop in performance. >>>>>>>>> >>>>>>>>> The key is in the ARM ARM: >>>>>>>>> >>>>>>>>> B1.14.9: "When HCR.TWE is set to 1, and the processor is in a Non-secure >>>>>>>>> mode other than Hyp mode, execution of a WFE instruction generates a Hyp >>>>>>>>> Trap exception if, ignoring the value of the HCR.TWE bit, conditions >>>>>>>>> permit the processor to suspend execution." >>>>>>>>> >>>>>>>>> So, on a non-overcommitted system, you rarely hit a blocking spinlock, >>>>>>>>> hence not trapping. Otherwise, performance would go down the drain very >>>>>>>>> quickly. >>>>>>>> >>>>>>>> Well, it's the same as pause/loop exiting on x86, but there we have special hardware features to only ever exit after n number of turnarounds. I wonder why we have those when we could just as easily exit on every blocking path. >>>>>>>> >>>>>>> It will hurt performance if vcpu that holds the lock is running. >>>>>> >>>>>> Apparently not so on ARM. At least that's what Marc's numbers are showing. I'm not sure what exactly that means. Basically his logic is "if we spin, the holder must have been preempted". And it seems to work out surprisingly well. >>>> >>>> Yes. I basically assume that contention should be rare, and that ending >>>> up in a *blocking* WFE is a sign that we're in thrashing mode already >>>> (no event is pending). >>>> >>>>>> >>>>> For not contended locks it make sense. We need to recheck if x86 >>>>> assumption is still true there, but x86 lock is ticketing which >>>>> has not only lock holder preemption, but also lock waiter >>>>> preemption problem which make overcommit problem even worse. >>>> >>>> Locks are ticketing on ARM as well. But there is one key difference here >>>> with x86 (or at least what I understand of it, which is very close to >>>> none): We only trap if we would have blocked anyway. In our case, it is >>>> almost always better to give up the CPU to someone else rather than >>>> waiting for some event to take the CPU out of sleep. >>> >>> Benefits of "Yield CPU when vcpu executes a WFE" seems to depend on: >>> 1. How spin lock is implemented in Guest OS? >>> we cannot assume >>> that underlying Guest OS is always Linux. >>> 2. How bad/good is spin >> >> We do *not* spin. We *sleep*. So instead of taking a nap on a physical >> CPU (which is slightly less than useful), we go and run some real >> workload. If your guest OS is executing WFE (I'm not implying a lock >> here), *and* that WFE is blocking, then I maintain it will be a gain in >> the vast majority of the cases. > > What if VCPU A was about to release lock and VCPU B tries to grab > same lock. In this case VCPU B gets Yielded due to WFE causing > unnecessary delay for VCPU B in acquiring lock. This situation can > happen quite often because spin locks are generally used for protecting > very small portion of code. > >> >>> It will be good if we can enable/disable "Yield CPU when vcpu executes a WFE >> >> Not until someone has shown me a (real) workload when this is actually >> detrimental. > > The gains by "Yield CPU when vcpu executes a WFE" are not-significant > and we dont have consistent improvement when tried multiple times. Please > look at number you reported for multiple runs. Due to this fact it makes > more sense to have Kconfig option for this. Not significant? I don't know if I should cry or laugh here... M. -- Jazz is not dead. It just smells funny... ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE 2013-10-09 15:10 ` Anup Patel 2013-10-09 15:17 ` Marc Zyngier @ 2013-10-09 15:17 ` Anup Patel 1 sibling, 0 replies; 25+ messages in thread From: Anup Patel @ 2013-10-09 15:17 UTC (permalink / raw) To: linux-arm-kernel On Wed, Oct 9, 2013 at 8:40 PM, Anup Patel <anup@brainfault.org> wrote: > On Wed, Oct 9, 2013 at 8:29 PM, Marc Zyngier <marc.zyngier@arm.com> wrote: >> On 09/10/13 15:50, Anup Patel wrote: >>> On Wed, Oct 9, 2013 at 7:48 PM, Marc Zyngier <marc.zyngier@arm.com> wrote: >>>> On 09/10/13 14:26, Gleb Natapov wrote: >>>>> On Wed, Oct 09, 2013 at 03:09:54PM +0200, Alexander Graf wrote: >>>>>> >>>>>> On 07.10.2013, at 18:53, Gleb Natapov <gleb@redhat.com> wrote: >>>>>> >>>>>>> On Mon, Oct 07, 2013 at 06:30:04PM +0200, Alexander Graf wrote: >>>>>>>> >>>>>>>> On 07.10.2013, at 18:16, Marc Zyngier <marc.zyngier@arm.com> wrote: >>>>>>>> >>>>>>>>> On 07/10/13 17:04, Alexander Graf wrote: >>>>>>>>>> >>>>>>>>>> On 07.10.2013, at 17:40, Marc Zyngier <marc.zyngier@arm.com> wrote: >>>>>>>>>> >>>>>>>>>>> On an (even slightly) oversubscribed system, spinlocks are quickly >>>>>>>>>>> becoming a bottleneck, as some vcpus are spinning, waiting for a >>>>>>>>>>> lock to be released, while the vcpu holding the lock may not be >>>>>>>>>>> running at all. >>>>>>>>>>> >>>>>>>>>>> This creates contention, and the observed slowdown is 40x for >>>>>>>>>>> hackbench. No, this isn't a typo. >>>>>>>>>>> >>>>>>>>>>> The solution is to trap blocking WFEs and tell KVM that we're now >>>>>>>>>>> spinning. This ensures that other vpus will get a scheduling boost, >>>>>>>>>>> allowing the lock to be released more quickly. >>>>>>>>>>> >>>>>>>>>>>> From a performance point of view: hackbench 1 process 1000 >>>>>>>>>>> >>>>>>>>>>> 2xA15 host (baseline): 1.843s >>>>>>>>>>> >>>>>>>>>>> 2xA15 guest w/o patch: 2.083s 4xA15 guest w/o patch: 80.212s >>>>>>>>>>> >>>>>>>>>>> 2xA15 guest w/ patch: 2.072s 4xA15 guest w/ patch: 3.202s >>>>>>>>>> >>>>>>>>>> I'm confused. You got from 2.083s when not exiting on spin locks to >>>>>>>>>> 2.072 when exiting on _every_ spin lock that didn't immediately >>>>>>>>>> succeed. I would've expected to second number to be worse rather than >>>>>>>>>> better. I assume it's within jitter, I'm still puzzled why you don't >>>>>>>>>> see any significant drop in performance. >>>>>>>>> >>>>>>>>> The key is in the ARM ARM: >>>>>>>>> >>>>>>>>> B1.14.9: "When HCR.TWE is set to 1, and the processor is in a Non-secure >>>>>>>>> mode other than Hyp mode, execution of a WFE instruction generates a Hyp >>>>>>>>> Trap exception if, ignoring the value of the HCR.TWE bit, conditions >>>>>>>>> permit the processor to suspend execution." >>>>>>>>> >>>>>>>>> So, on a non-overcommitted system, you rarely hit a blocking spinlock, >>>>>>>>> hence not trapping. Otherwise, performance would go down the drain very >>>>>>>>> quickly. >>>>>>>> >>>>>>>> Well, it's the same as pause/loop exiting on x86, but there we have special hardware features to only ever exit after n number of turnarounds. I wonder why we have those when we could just as easily exit on every blocking path. >>>>>>>> >>>>>>> It will hurt performance if vcpu that holds the lock is running. >>>>>> >>>>>> Apparently not so on ARM. At least that's what Marc's numbers are showing. I'm not sure what exactly that means. Basically his logic is "if we spin, the holder must have been preempted". And it seems to work out surprisingly well. >>>> >>>> Yes. I basically assume that contention should be rare, and that ending >>>> up in a *blocking* WFE is a sign that we're in thrashing mode already >>>> (no event is pending). >>>> >>>>>> >>>>> For not contended locks it make sense. We need to recheck if x86 >>>>> assumption is still true there, but x86 lock is ticketing which >>>>> has not only lock holder preemption, but also lock waiter >>>>> preemption problem which make overcommit problem even worse. >>>> >>>> Locks are ticketing on ARM as well. But there is one key difference here >>>> with x86 (or at least what I understand of it, which is very close to >>>> none): We only trap if we would have blocked anyway. In our case, it is >>>> almost always better to give up the CPU to someone else rather than >>>> waiting for some event to take the CPU out of sleep. >>> >>> Benefits of "Yield CPU when vcpu executes a WFE" seems to depend on: >>> 1. How spin lock is implemented in Guest OS? >>> we cannot assume >>> that underlying Guest OS is always Linux. >>> 2. How bad/good is spin >> >> We do *not* spin. We *sleep*. So instead of taking a nap on a physical >> CPU (which is slightly less than useful), we go and run some real >> workload. If your guest OS is executing WFE (I'm not implying a lock >> here), *and* that WFE is blocking, then I maintain it will be a gain in >> the vast majority of the cases. > > What if VCPU A was about to release lock and VCPU B tries to grab > same lock. In this case VCPU B gets Yielded due to WFE causing > unnecessary delay for VCPU B in acquiring lock. This situation can > happen quite often because spin locks are generally used for protecting > very small portion of code. It will be interesting to see what hackbench number you get if you don't restrict all Guest VCPUs to same Host CPU? Lets say a Guest with 8 VCPUs running on Host (with > 2 CPUs). > >> >>> It will be good if we can enable/disable "Yield CPU when vcpu executes a WFE >> >> Not until someone has shown me a (real) workload when this is actually >> detrimental. > > The gains by "Yield CPU when vcpu executes a WFE" are not-significant > and we dont have consistent improvement when tried multiple times. Please > look at number you reported for multiple runs. Due to this fact it makes > more sense to have Kconfig option for this. > > --Anup > >> >> M. >> -- >> Jazz is not dead. It just smells funny... >> ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE 2013-10-07 16:30 ` Alexander Graf 2013-10-07 16:53 ` Gleb Natapov @ 2013-10-07 16:55 ` Marc Zyngier 1 sibling, 0 replies; 25+ messages in thread From: Marc Zyngier @ 2013-10-07 16:55 UTC (permalink / raw) To: linux-arm-kernel On 07/10/13 17:30, Alexander Graf wrote: > > On 07.10.2013, at 18:16, Marc Zyngier <marc.zyngier@arm.com> wrote: > >> On 07/10/13 17:04, Alexander Graf wrote: >>> >>> On 07.10.2013, at 17:40, Marc Zyngier <marc.zyngier@arm.com> >>> wrote: >>> >>>> On an (even slightly) oversubscribed system, spinlocks are >>>> quickly becoming a bottleneck, as some vcpus are spinning, >>>> waiting for a lock to be released, while the vcpu holding the >>>> lock may not be running at all. >>>> >>>> This creates contention, and the observed slowdown is 40x for >>>> hackbench. No, this isn't a typo. >>>> >>>> The solution is to trap blocking WFEs and tell KVM that we're >>>> now spinning. This ensures that other vpus will get a >>>> scheduling boost, allowing the lock to be released more >>>> quickly. >>>> >>>>> From a performance point of view: hackbench 1 process 1000 >>>> >>>> 2xA15 host (baseline): 1.843s >>>> >>>> 2xA15 guest w/o patch: 2.083s 4xA15 guest w/o patch: 80.212s >>>> >>>> 2xA15 guest w/ patch: 2.072s 4xA15 guest w/ patch: 3.202s >>> >>> I'm confused. You got from 2.083s when not exiting on spin locks >>> to 2.072 when exiting on _every_ spin lock that didn't >>> immediately succeed. I would've expected to second number to be >>> worse rather than better. I assume it's within jitter, I'm still >>> puzzled why you don't see any significant drop in performance. >> >> The key is in the ARM ARM: >> >> B1.14.9: "When HCR.TWE is set to 1, and the processor is in a >> Non-secure mode other than Hyp mode, execution of a WFE instruction >> generates a Hyp Trap exception if, ignoring the value of the >> HCR.TWE bit, conditions permit the processor to suspend >> execution." >> >> So, on a non-overcommitted system, you rarely hit a blocking >> spinlock, hence not trapping. Otherwise, performance would go down >> the drain very quickly. > > Well, it's the same as pause/loop exiting on x86, but there we have > special hardware features to only ever exit after n number of > turnarounds. I wonder why we have those when we could just as easily > exit on every blocking path. My understanding of x86 is extremely patchy (and of the non-existent flavour), so I can't really comment on that. On ARM, WFE normally blocks if no event is pending for this CPU. We use it on the spinlock slow path, and have a SEV (Send EVent) on release. Even in the case of a race between entering the slow path and releasing the spinlock, you may end-up executing a non-blocking WFE. In this case, no trap will occur. > I assume you simply don't contend and spin locks yet. Once you have > more guest cores things would look differently. So once you have a > system with more cores available, it might make sense to measure it > again. Indeed. Though the above should probably stay valid even if we have a different locking strategy. Entering a blocking WFE always means you're going to block for some time (and no, you don't know how long). > Until then, the numbers are impressive. I thought as much... M. -- Jazz is not dead. It just smells funny... ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE 2013-10-07 15:40 ` [PATCH 1/2] ARM: " Marc Zyngier 2013-10-07 16:04 ` Alexander Graf @ 2013-10-08 11:26 ` Raghavendra KT 2013-10-08 12:43 ` Marc Zyngier 1 sibling, 1 reply; 25+ messages in thread From: Raghavendra KT @ 2013-10-08 11:26 UTC (permalink / raw) To: linux-arm-kernel On Mon, Oct 7, 2013 at 9:10 PM, Marc Zyngier <marc.zyngier@arm.com> wrote: > On an (even slightly) oversubscribed system, spinlocks are quickly > becoming a bottleneck, as some vcpus are spinning, waiting for a > lock to be released, while the vcpu holding the lock may not be > running at all. > > This creates contention, and the observed slowdown is 40x for > hackbench. No, this isn't a typo. > > The solution is to trap blocking WFEs and tell KVM that we're > now spinning. This ensures that other vpus will get a scheduling > boost, allowing the lock to be released more quickly. > > From a performance point of view: hackbench 1 process 1000 > > 2xA15 host (baseline): 1.843s > > 2xA15 guest w/o patch: 2.083s > 4xA15 guest w/o patch: 80.212s > > 2xA15 guest w/ patch: 2.072s > 4xA15 guest w/ patch: 3.202s > > So we go from a 40x degradation to 1.5x, which is vaguely more > acceptable. > > Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> > --- > arch/arm/include/asm/kvm_arm.h | 4 +++- > arch/arm/kvm/handle_exit.c | 6 +++++- > 2 files changed, 8 insertions(+), 2 deletions(-) > > diff --git a/arch/arm/include/asm/kvm_arm.h b/arch/arm/include/asm/kvm_arm.h > index 64e9696..693d5b2 100644 > --- a/arch/arm/include/asm/kvm_arm.h > +++ b/arch/arm/include/asm/kvm_arm.h > @@ -67,7 +67,7 @@ > */ > #define HCR_GUEST_MASK (HCR_TSC | HCR_TSW | HCR_TWI | HCR_VM | HCR_BSU_IS | \ > HCR_FB | HCR_TAC | HCR_AMO | HCR_IMO | HCR_FMO | \ > - HCR_SWIO | HCR_TIDCP) > + HCR_TWE | HCR_SWIO | HCR_TIDCP) > #define HCR_VIRT_EXCP_MASK (HCR_VA | HCR_VI | HCR_VF) > > /* System Control Register (SCTLR) bits */ > @@ -208,6 +208,8 @@ > #define HSR_EC_DABT (0x24) > #define HSR_EC_DABT_HYP (0x25) > > +#define HSR_WFI_IS_WFE (1U << 0) > + > #define HSR_HVC_IMM_MASK ((1UL << 16) - 1) > > #define HSR_DABT_S1PTW (1U << 7) > diff --git a/arch/arm/kvm/handle_exit.c b/arch/arm/kvm/handle_exit.c > index df4c82d..c4c496f 100644 > --- a/arch/arm/kvm/handle_exit.c > +++ b/arch/arm/kvm/handle_exit.c > @@ -84,7 +84,11 @@ static int handle_dabt_hyp(struct kvm_vcpu *vcpu, struct kvm_run *run) > static int kvm_handle_wfi(struct kvm_vcpu *vcpu, struct kvm_run *run) > { > trace_kvm_wfi(*vcpu_pc(vcpu)); > - kvm_vcpu_block(vcpu); > + if (kvm_vcpu_get_hsr(vcpu) & HSR_WFI_IS_WFE) > + kvm_vcpu_on_spin(vcpu); Could you also enable CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT for arm and check if ple handler logic helps further? we would ideally get one more optimization folded into ple handler if you enable that. ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE 2013-10-08 11:26 ` Raghavendra KT @ 2013-10-08 12:43 ` Marc Zyngier 2013-10-08 15:02 ` Raghavendra K T 0 siblings, 1 reply; 25+ messages in thread From: Marc Zyngier @ 2013-10-08 12:43 UTC (permalink / raw) To: linux-arm-kernel On 08/10/13 12:26, Raghavendra KT wrote: > On Mon, Oct 7, 2013 at 9:10 PM, Marc Zyngier <marc.zyngier@arm.com> wrote: >> On an (even slightly) oversubscribed system, spinlocks are quickly >> becoming a bottleneck, as some vcpus are spinning, waiting for a >> lock to be released, while the vcpu holding the lock may not be >> running at all. >> >> This creates contention, and the observed slowdown is 40x for >> hackbench. No, this isn't a typo. >> >> The solution is to trap blocking WFEs and tell KVM that we're >> now spinning. This ensures that other vpus will get a scheduling >> boost, allowing the lock to be released more quickly. >> >> From a performance point of view: hackbench 1 process 1000 >> >> 2xA15 host (baseline): 1.843s >> >> 2xA15 guest w/o patch: 2.083s >> 4xA15 guest w/o patch: 80.212s >> >> 2xA15 guest w/ patch: 2.072s >> 4xA15 guest w/ patch: 3.202s >> >> So we go from a 40x degradation to 1.5x, which is vaguely more >> acceptable. >> >> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> >> --- >> arch/arm/include/asm/kvm_arm.h | 4 +++- >> arch/arm/kvm/handle_exit.c | 6 +++++- >> 2 files changed, 8 insertions(+), 2 deletions(-) >> >> diff --git a/arch/arm/include/asm/kvm_arm.h b/arch/arm/include/asm/kvm_arm.h >> index 64e9696..693d5b2 100644 >> --- a/arch/arm/include/asm/kvm_arm.h >> +++ b/arch/arm/include/asm/kvm_arm.h >> @@ -67,7 +67,7 @@ >> */ >> #define HCR_GUEST_MASK (HCR_TSC | HCR_TSW | HCR_TWI | HCR_VM | HCR_BSU_IS | \ >> HCR_FB | HCR_TAC | HCR_AMO | HCR_IMO | HCR_FMO | \ >> - HCR_SWIO | HCR_TIDCP) >> + HCR_TWE | HCR_SWIO | HCR_TIDCP) >> #define HCR_VIRT_EXCP_MASK (HCR_VA | HCR_VI | HCR_VF) >> >> /* System Control Register (SCTLR) bits */ >> @@ -208,6 +208,8 @@ >> #define HSR_EC_DABT (0x24) >> #define HSR_EC_DABT_HYP (0x25) >> >> +#define HSR_WFI_IS_WFE (1U << 0) >> + >> #define HSR_HVC_IMM_MASK ((1UL << 16) - 1) >> >> #define HSR_DABT_S1PTW (1U << 7) >> diff --git a/arch/arm/kvm/handle_exit.c b/arch/arm/kvm/handle_exit.c >> index df4c82d..c4c496f 100644 >> --- a/arch/arm/kvm/handle_exit.c >> +++ b/arch/arm/kvm/handle_exit.c >> @@ -84,7 +84,11 @@ static int handle_dabt_hyp(struct kvm_vcpu *vcpu, struct kvm_run *run) >> static int kvm_handle_wfi(struct kvm_vcpu *vcpu, struct kvm_run *run) >> { >> trace_kvm_wfi(*vcpu_pc(vcpu)); >> - kvm_vcpu_block(vcpu); >> + if (kvm_vcpu_get_hsr(vcpu) & HSR_WFI_IS_WFE) >> + kvm_vcpu_on_spin(vcpu); > > Could you also enable CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT for arm and > check if ple handler logic helps further? > we would ideally get one more optimization folded into ple handler if > you enable that. Just gave it a go, and the results are slightly (but consistently) worse. Over 10 runs: Without RELAX_INTERCEPT: Average run 3.3623s With RELAX_INTERCEPT: Average run 3.4226s Not massive, but still noticeable. Any clue? M. -- Jazz is not dead. It just smells funny... ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE 2013-10-08 12:43 ` Marc Zyngier @ 2013-10-08 15:02 ` Raghavendra K T 2013-10-08 15:06 ` Marc Zyngier 0 siblings, 1 reply; 25+ messages in thread From: Raghavendra K T @ 2013-10-08 15:02 UTC (permalink / raw) To: linux-arm-kernel [...] >>> + kvm_vcpu_on_spin(vcpu); >> >> Could you also enable CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT for arm and >> check if ple handler logic helps further? >> we would ideally get one more optimization folded into ple handler if >> you enable that. > > Just gave it a go, and the results are slightly (but consistently) > worse. Over 10 runs: > > Without RELAX_INTERCEPT: Average run 3.3623s > With RELAX_INTERCEPT: Average run 3.4226s > > Not massive, but still noticeable. Any clue? Is it a 4x overcommit? Probably we would have hit the code overhead if it were small guests. RELAX_INTERCEPT is worth enabling for large guests with overcommits. ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE 2013-10-08 15:02 ` Raghavendra K T @ 2013-10-08 15:06 ` Marc Zyngier 2013-10-08 15:13 ` Raghavendra K T 0 siblings, 1 reply; 25+ messages in thread From: Marc Zyngier @ 2013-10-08 15:06 UTC (permalink / raw) To: linux-arm-kernel On 08/10/13 16:02, Raghavendra K T wrote: > [...] >>>> + kvm_vcpu_on_spin(vcpu); >>> >>> Could you also enable CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT for arm and >>> check if ple handler logic helps further? >>> we would ideally get one more optimization folded into ple handler if >>> you enable that. >> >> Just gave it a go, and the results are slightly (but consistently) >> worse. Over 10 runs: >> >> Without RELAX_INTERCEPT: Average run 3.3623s >> With RELAX_INTERCEPT: Average run 3.4226s >> >> Not massive, but still noticeable. Any clue? > > Is it a 4x overcommit? Probably we would have hit the code > overhead if it were small guests. Only 2x overcommit (dual core host, quad vcpu guests). > RELAX_INTERCEPT is worth enabling for large guests with > overcommits. I'll try something more aggressive as soon as I get the time. What do you call a large guest? So far, the hard limit on ARM is 8 vcpus. M. -- Jazz is not dead. It just smells funny... ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE 2013-10-08 15:06 ` Marc Zyngier @ 2013-10-08 15:13 ` Raghavendra K T 2013-10-08 16:09 ` Marc Zyngier 0 siblings, 1 reply; 25+ messages in thread From: Raghavendra K T @ 2013-10-08 15:13 UTC (permalink / raw) To: linux-arm-kernel On 10/08/2013 08:36 PM, Marc Zyngier wrote: >>> Just gave it a go, and the results are slightly (but consistently) >>> worse. Over 10 runs: >>> >>> Without RELAX_INTERCEPT: Average run 3.3623s >>> With RELAX_INTERCEPT: Average run 3.4226s >>> >>> Not massive, but still noticeable. Any clue? >> >> Is it a 4x overcommit? Probably we would have hit the code >> overhead if it were small guests. > > Only 2x overcommit (dual core host, quad vcpu guests). Okay. quad vcpu seem to explain. > >> RELAX_INTERCEPT is worth enabling for large guests with >> overcommits. > > I'll try something more aggressive as soon as I get the time. What do > you call a large guest? So far, the hard limit on ARM is 8 vcpus. > Okay. I was referring to guests >= 32 vcpus. May be 8vcpu guests with 2x/4x is worth trying. If we still do not see benefit, then it is not worth enabling. ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE 2013-10-08 15:13 ` Raghavendra K T @ 2013-10-08 16:09 ` Marc Zyngier 0 siblings, 0 replies; 25+ messages in thread From: Marc Zyngier @ 2013-10-08 16:09 UTC (permalink / raw) To: linux-arm-kernel On 08/10/13 16:13, Raghavendra K T wrote: > On 10/08/2013 08:36 PM, Marc Zyngier wrote: >>>> Just gave it a go, and the results are slightly (but consistently) >>>> worse. Over 10 runs: >>>> >>>> Without RELAX_INTERCEPT: Average run 3.3623s >>>> With RELAX_INTERCEPT: Average run 3.4226s >>>> >>>> Not massive, but still noticeable. Any clue? >>> >>> Is it a 4x overcommit? Probably we would have hit the code >>> overhead if it were small guests. >> >> Only 2x overcommit (dual core host, quad vcpu guests). > > Okay. quad vcpu seem to explain. > >> >>> RELAX_INTERCEPT is worth enabling for large guests with >>> overcommits. >> >> I'll try something more aggressive as soon as I get the time. What do >> you call a large guest? So far, the hard limit on ARM is 8 vcpus. >> > > Okay. I was referring to guests >= 32 vcpus. > May be 8vcpu guests with 2x/4x is worth trying. If we still do not > see benefit, then it is not worth enabling. I've just tried with the worse case I can construct, which is a 8 vcpu guest limited to one physical CPU: Over 10 runs: Without RELAX_INTERCEPT: Time: 6.793 Time: 7.619 Time: 6.690 Time: 7.198 Time: 7.659 Time: 7.054 Time: 7.728 Time: 8.546 Time: 7.306 Time: 7.219 Average: 7.381 With RELAX_INTERCEPT: Time: 6.850 Time: 6.889 Time: 7.170 Time: 6.938 Time: 6.756 Time: 7.341 Time: 6.707 Time: 7.452 Time: 6.617 Time: 8.095 Average: 7.082 We're now starting to see some (small) benefits: slightly faster with RELAX_INTERCEPT, and less jitter (the heuristic is better at picking the target vcpu than the default behaviour). I'll enable it in the next version of the series. Thanks! M. -- Jazz is not dead. It just smells funny... ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 2/2] arm64: KVM: Yield CPU when vcpu executes a WFE 2013-10-07 15:40 [PATCH 0/2] ARM/arm64: KVM: Yield CPU when vcpu executes a WFE Marc Zyngier 2013-10-07 15:40 ` [PATCH 1/2] ARM: " Marc Zyngier @ 2013-10-07 15:40 ` Marc Zyngier 2013-10-07 15:52 ` Bhushan Bharat-R65777 1 sibling, 1 reply; 25+ messages in thread From: Marc Zyngier @ 2013-10-07 15:40 UTC (permalink / raw) To: linux-arm-kernel On an (even slightly) oversubscribed system, spinlocks are quickly becoming a bottleneck, as some vcpus are spinning, waiting for a lock to be released, while the vcpu holding the lock may not be running at all. The solution is to trap blocking WFEs and tell KVM that we're now spinning. This ensures that other vpus will get a scheduling boost, allowing the lock to be released more quickly. Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> --- arch/arm64/include/asm/kvm_arm.h | 8 ++++++-- arch/arm64/kvm/handle_exit.c | 18 +++++++++++++----- 2 files changed, 19 insertions(+), 7 deletions(-) diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h index a5f28e2..c98ef47 100644 --- a/arch/arm64/include/asm/kvm_arm.h +++ b/arch/arm64/include/asm/kvm_arm.h @@ -63,6 +63,7 @@ * TAC: Trap ACTLR * TSC: Trap SMC * TSW: Trap cache operations by set/way + * TWE: Trap WFE * TWI: Trap WFI * TIDCP: Trap L2CTLR/L2ECTLR * BSU_IS: Upgrade barriers to the inner shareable domain @@ -72,8 +73,9 @@ * FMO: Override CPSR.F and enable signaling with VF * SWIO: Turn set/way invalidates into set/way clean+invalidate */ -#define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWI | HCR_VM | HCR_BSU_IS | \ - HCR_FB | HCR_TAC | HCR_AMO | HCR_IMO | HCR_FMO | \ +#define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWE | HCR_TWI | HCR_VM | \ + HCR_BSU_IS | HCR_FB | HCR_TAC | \ + HCR_AMO | HCR_IMO | HCR_FMO | \ HCR_SWIO | HCR_TIDCP | HCR_RW) #define HCR_VIRT_EXCP_MASK (HCR_VA | HCR_VI | HCR_VF) @@ -242,4 +244,6 @@ #define ESR_EL2_EC_xABT_xFSR_EXTABT 0x10 +#define ESR_EL2_EC_WFI_ISS_WFE (1 << 0) + #endif /* __ARM64_KVM_ARM_H__ */ diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c index 9beaca03..8da5606 100644 --- a/arch/arm64/kvm/handle_exit.c +++ b/arch/arm64/kvm/handle_exit.c @@ -47,21 +47,29 @@ static int handle_smc(struct kvm_vcpu *vcpu, struct kvm_run *run) } /** - * kvm_handle_wfi - handle a wait-for-interrupts instruction executed by a guest + * kvm_handle_wfx - handle a wait-for-interrupts or wait-for-event + * instruction executed by a guest + * * @vcpu: the vcpu pointer * - * Simply call kvm_vcpu_block(), which will halt execution of + * WFE: Yield the CPU and come back to this vcpu when the scheduler + * decides to. + * WFI: Simply call kvm_vcpu_block(), which will halt execution of * world-switches and schedule other host processes until there is an * incoming IRQ or FIQ to the VM. */ -static int kvm_handle_wfi(struct kvm_vcpu *vcpu, struct kvm_run *run) +static int kvm_handle_wfx(struct kvm_vcpu *vcpu, struct kvm_run *run) { - kvm_vcpu_block(vcpu); + if (kvm_vcpu_get_hsr(vcpu) & ESR_EL2_EC_WFI_ISS_WFE) + kvm_vcpu_on_spin(vcpu); + else + kvm_vcpu_block(vcpu); + return 1; } static exit_handle_fn arm_exit_handlers[] = { - [ESR_EL2_EC_WFI] = kvm_handle_wfi, + [ESR_EL2_EC_WFI] = kvm_handle_wfx, [ESR_EL2_EC_CP15_32] = kvm_handle_cp15_32, [ESR_EL2_EC_CP15_64] = kvm_handle_cp15_64, [ESR_EL2_EC_CP14_MR] = kvm_handle_cp14_access, -- 1.8.2.3 ^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 2/2] arm64: KVM: Yield CPU when vcpu executes a WFE 2013-10-07 15:40 ` [PATCH 2/2] arm64: " Marc Zyngier @ 2013-10-07 15:52 ` Bhushan Bharat-R65777 2013-10-07 16:00 ` Marc Zyngier 0 siblings, 1 reply; 25+ messages in thread From: Bhushan Bharat-R65777 @ 2013-10-07 15:52 UTC (permalink / raw) To: linux-arm-kernel > -----Original Message----- > From: Marc Zyngier [mailto:marc.zyngier at arm.com] > Sent: Monday, October 07, 2013 9:11 PM > To: linux-arm-kernel at lists.infradead.org; kvmarm at lists.cs.columbia.edu; > kvm at vger.kernel.org > Subject: [PATCH 2/2] arm64: KVM: Yield CPU when vcpu executes a WFE > > On an (even slightly) oversubscribed system, spinlocks are quickly becoming a > bottleneck, as some vcpus are spinning, waiting for a lock to be released, while > the vcpu holding the lock may not be running at all. > > The solution is to trap blocking WFEs and tell KVM that we're now spinning. This > ensures that other vpus will get a scheduling boost, allowing the lock to be > released more quickly. > > Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> > --- > arch/arm64/include/asm/kvm_arm.h | 8 ++++++-- > arch/arm64/kvm/handle_exit.c | 18 +++++++++++++----- > 2 files changed, 19 insertions(+), 7 deletions(-) > > diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h > index a5f28e2..c98ef47 100644 > --- a/arch/arm64/include/asm/kvm_arm.h > +++ b/arch/arm64/include/asm/kvm_arm.h > @@ -63,6 +63,7 @@ > * TAC: Trap ACTLR > * TSC: Trap SMC > * TSW: Trap cache operations by set/way > + * TWE: Trap WFE > * TWI: Trap WFI > * TIDCP: Trap L2CTLR/L2ECTLR > * BSU_IS: Upgrade barriers to the inner shareable domain > @@ -72,8 +73,9 @@ > * FMO: Override CPSR.F and enable signaling with VF > * SWIO: Turn set/way invalidates into set/way clean+invalidate > */ > -#define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWI | HCR_VM | HCR_BSU_IS | \ > - HCR_FB | HCR_TAC | HCR_AMO | HCR_IMO | HCR_FMO | \ > +#define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWE | HCR_TWI | HCR_VM | \ > + HCR_BSU_IS | HCR_FB | HCR_TAC | \ > + HCR_AMO | HCR_IMO | HCR_FMO | \ > HCR_SWIO | HCR_TIDCP | HCR_RW) > #define HCR_VIRT_EXCP_MASK (HCR_VA | HCR_VI | HCR_VF) > > @@ -242,4 +244,6 @@ > > #define ESR_EL2_EC_xABT_xFSR_EXTABT 0x10 > > +#define ESR_EL2_EC_WFI_ISS_WFE (1 << 0) In another patch this is named as WHI_IS_WFE whereas here it is WFI_ISS_WFE, looks like typo. Anyways, what I am interested to understand is what does this macro means? Thanks -Bharat > + > #endif /* __ARM64_KVM_ARM_H__ */ > diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c index > 9beaca03..8da5606 100644 > --- a/arch/arm64/kvm/handle_exit.c > +++ b/arch/arm64/kvm/handle_exit.c > @@ -47,21 +47,29 @@ static int handle_smc(struct kvm_vcpu *vcpu, struct kvm_run > *run) } > > /** > - * kvm_handle_wfi - handle a wait-for-interrupts instruction executed by a > guest > + * kvm_handle_wfx - handle a wait-for-interrupts or wait-for-event > + * instruction executed by a guest > + * > * @vcpu: the vcpu pointer > * > - * Simply call kvm_vcpu_block(), which will halt execution of > + * WFE: Yield the CPU and come back to this vcpu when the scheduler > + * decides to. > + * WFI: Simply call kvm_vcpu_block(), which will halt execution of > * world-switches and schedule other host processes until there is an > * incoming IRQ or FIQ to the VM. > */ > -static int kvm_handle_wfi(struct kvm_vcpu *vcpu, struct kvm_run *run) > +static int kvm_handle_wfx(struct kvm_vcpu *vcpu, struct kvm_run *run) > { > - kvm_vcpu_block(vcpu); > + if (kvm_vcpu_get_hsr(vcpu) & ESR_EL2_EC_WFI_ISS_WFE) > + kvm_vcpu_on_spin(vcpu); > + else > + kvm_vcpu_block(vcpu); > + > return 1; > } > > static exit_handle_fn arm_exit_handlers[] = { > - [ESR_EL2_EC_WFI] = kvm_handle_wfi, > + [ESR_EL2_EC_WFI] = kvm_handle_wfx, > [ESR_EL2_EC_CP15_32] = kvm_handle_cp15_32, > [ESR_EL2_EC_CP15_64] = kvm_handle_cp15_64, > [ESR_EL2_EC_CP14_MR] = kvm_handle_cp14_access, > -- > 1.8.2.3 > > > > _______________________________________________ > kvmarm mailing list > kvmarm at lists.cs.columbia.edu > https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 2/2] arm64: KVM: Yield CPU when vcpu executes a WFE 2013-10-07 15:52 ` Bhushan Bharat-R65777 @ 2013-10-07 16:00 ` Marc Zyngier 0 siblings, 0 replies; 25+ messages in thread From: Marc Zyngier @ 2013-10-07 16:00 UTC (permalink / raw) To: linux-arm-kernel On 07/10/13 16:52, Bhushan Bharat-R65777 wrote: > > >> -----Original Message----- >> From: Marc Zyngier [mailto:marc.zyngier at arm.com] >> Sent: Monday, October 07, 2013 9:11 PM >> To: linux-arm-kernel at lists.infradead.org; kvmarm at lists.cs.columbia.edu; >> kvm at vger.kernel.org >> Subject: [PATCH 2/2] arm64: KVM: Yield CPU when vcpu executes a WFE >> >> On an (even slightly) oversubscribed system, spinlocks are quickly becoming a >> bottleneck, as some vcpus are spinning, waiting for a lock to be released, while >> the vcpu holding the lock may not be running at all. >> >> The solution is to trap blocking WFEs and tell KVM that we're now spinning. This >> ensures that other vpus will get a scheduling boost, allowing the lock to be >> released more quickly. >> >> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> >> --- >> arch/arm64/include/asm/kvm_arm.h | 8 ++++++-- >> arch/arm64/kvm/handle_exit.c | 18 +++++++++++++----- >> 2 files changed, 19 insertions(+), 7 deletions(-) >> >> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h >> index a5f28e2..c98ef47 100644 >> --- a/arch/arm64/include/asm/kvm_arm.h >> +++ b/arch/arm64/include/asm/kvm_arm.h >> @@ -63,6 +63,7 @@ >> * TAC: Trap ACTLR >> * TSC: Trap SMC >> * TSW: Trap cache operations by set/way >> + * TWE: Trap WFE >> * TWI: Trap WFI >> * TIDCP: Trap L2CTLR/L2ECTLR >> * BSU_IS: Upgrade barriers to the inner shareable domain >> @@ -72,8 +73,9 @@ >> * FMO: Override CPSR.F and enable signaling with VF >> * SWIO: Turn set/way invalidates into set/way clean+invalidate >> */ >> -#define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWI | HCR_VM | HCR_BSU_IS | \ >> - HCR_FB | HCR_TAC | HCR_AMO | HCR_IMO | HCR_FMO | \ >> +#define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWE | HCR_TWI | HCR_VM | \ >> + HCR_BSU_IS | HCR_FB | HCR_TAC | \ >> + HCR_AMO | HCR_IMO | HCR_FMO | \ >> HCR_SWIO | HCR_TIDCP | HCR_RW) >> #define HCR_VIRT_EXCP_MASK (HCR_VA | HCR_VI | HCR_VF) >> >> @@ -242,4 +244,6 @@ >> >> #define ESR_EL2_EC_xABT_xFSR_EXTABT 0x10 >> >> +#define ESR_EL2_EC_WFI_ISS_WFE (1 << 0) > > In another patch this is named as WHI_IS_WFE whereas here it is WFI_ISS_WFE, looks like typo. Anyways, what I am interested to understand is what does this macro means? Not a typo. It decodes as: Exception Status Register, Exception Level 2, Exception Class Wait For Interrupt, Instruction Specific Syndrome Wait For Event. The ARM code doesn't have such a convention, so I didn't bother. It just reads "Hyp Status Register, Wait For Interrupt Is Wait For Event". M. > Thanks > -Bharat > >> + >> #endif /* __ARM64_KVM_ARM_H__ */ >> diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c index >> 9beaca03..8da5606 100644 >> --- a/arch/arm64/kvm/handle_exit.c >> +++ b/arch/arm64/kvm/handle_exit.c >> @@ -47,21 +47,29 @@ static int handle_smc(struct kvm_vcpu *vcpu, struct kvm_run >> *run) } >> >> /** >> - * kvm_handle_wfi - handle a wait-for-interrupts instruction executed by a >> guest >> + * kvm_handle_wfx - handle a wait-for-interrupts or wait-for-event >> + * instruction executed by a guest >> + * >> * @vcpu: the vcpu pointer >> * >> - * Simply call kvm_vcpu_block(), which will halt execution of >> + * WFE: Yield the CPU and come back to this vcpu when the scheduler >> + * decides to. >> + * WFI: Simply call kvm_vcpu_block(), which will halt execution of >> * world-switches and schedule other host processes until there is an >> * incoming IRQ or FIQ to the VM. >> */ >> -static int kvm_handle_wfi(struct kvm_vcpu *vcpu, struct kvm_run *run) >> +static int kvm_handle_wfx(struct kvm_vcpu *vcpu, struct kvm_run *run) >> { >> - kvm_vcpu_block(vcpu); >> + if (kvm_vcpu_get_hsr(vcpu) & ESR_EL2_EC_WFI_ISS_WFE) >> + kvm_vcpu_on_spin(vcpu); >> + else >> + kvm_vcpu_block(vcpu); >> + >> return 1; >> } >> >> static exit_handle_fn arm_exit_handlers[] = { >> - [ESR_EL2_EC_WFI] = kvm_handle_wfi, >> + [ESR_EL2_EC_WFI] = kvm_handle_wfx, >> [ESR_EL2_EC_CP15_32] = kvm_handle_cp15_32, >> [ESR_EL2_EC_CP15_64] = kvm_handle_cp15_64, >> [ESR_EL2_EC_CP14_MR] = kvm_handle_cp14_access, >> -- >> 1.8.2.3 >> >> >> >> _______________________________________________ >> kvmarm mailing list >> kvmarm at lists.cs.columbia.edu >> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm > > > -- Jazz is not dead. It just smells funny... ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2013-10-09 15:17 UTC | newest] Thread overview: 25+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-10-07 15:40 [PATCH 0/2] ARM/arm64: KVM: Yield CPU when vcpu executes a WFE Marc Zyngier 2013-10-07 15:40 ` [PATCH 1/2] ARM: " Marc Zyngier 2013-10-07 16:04 ` Alexander Graf 2013-10-07 16:16 ` Marc Zyngier 2013-10-07 16:30 ` Alexander Graf 2013-10-07 16:53 ` Gleb Natapov 2013-10-09 13:09 ` Alexander Graf 2013-10-09 13:26 ` Gleb Natapov 2013-10-09 14:18 ` Marc Zyngier 2013-10-09 14:50 ` Anup Patel 2013-10-09 14:52 ` Anup Patel 2013-10-09 14:59 ` Marc Zyngier 2013-10-09 15:10 ` Anup Patel 2013-10-09 15:17 ` Marc Zyngier 2013-10-09 15:17 ` Anup Patel 2013-10-07 16:55 ` Marc Zyngier 2013-10-08 11:26 ` Raghavendra KT 2013-10-08 12:43 ` Marc Zyngier 2013-10-08 15:02 ` Raghavendra K T 2013-10-08 15:06 ` Marc Zyngier 2013-10-08 15:13 ` Raghavendra K T 2013-10-08 16:09 ` Marc Zyngier 2013-10-07 15:40 ` [PATCH 2/2] arm64: " Marc Zyngier 2013-10-07 15:52 ` Bhushan Bharat-R65777 2013-10-07 16:00 ` Marc Zyngier
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).