* [PATCH v2 0/4] KVM: arm64: nv: Fixes for stage-2 MMU recycling
@ 2024-10-07 16:42 Oliver Upton
2024-10-07 16:42 ` [PATCH v2 1/4] KVM: arm64: nv: Keep reference on stage-2 MMU when scheduled out Oliver Upton
` (3 more replies)
0 siblings, 4 replies; 14+ messages in thread
From: Oliver Upton @ 2024-10-07 16:42 UTC (permalink / raw)
To: kvmarm
Cc: Marc Zyngier, Joey Gouly, Suzuki K Poulose, Zenghui Yu,
Sean Christopherson, Oliver Upton
v1 -> v2:
- Keep a reference on the MMU so long as the vCPU appears runnable
(i.e. not in WFI) (Sean)
- Add comment clarifying why TLBI unmaps are allowed to reschedule
[*]: https://lore.kernel.org/kvmarm/20241001001709.1303668-1-oliver.upton@linux.dev/#t
Oliver Upton (4):
KVM: arm64: nv: Keep reference on stage-2 MMU when scheduled out
KVM: arm64: nv: Do not block when unmapping stage-2 if disallowed
KVM: arm64: nv: Punt stage-2 recycling to a vCPU request
KVM: arm64: nv: Clarify safety of allowing TLBI unmaps to reschedule
arch/arm64/include/asm/kvm_host.h | 3 ++
arch/arm64/include/asm/kvm_mmu.h | 3 +-
arch/arm64/include/asm/kvm_nested.h | 4 ++-
arch/arm64/kvm/arm.c | 2 ++
arch/arm64/kvm/mmu.c | 15 +++++-----
arch/arm64/kvm/nested.c | 46 ++++++++++++++++++++++++-----
arch/arm64/kvm/sys_regs.c | 33 +++++++++++++++++++--
7 files changed, 87 insertions(+), 19 deletions(-)
base-commit: 9852d85ec9d492ebef56dc5f229416c925758edc
--
2.47.0.rc0.187.ge670bccf7e-goog
^ permalink raw reply [flat|nested] 14+ messages in thread* [PATCH v2 1/4] KVM: arm64: nv: Keep reference on stage-2 MMU when scheduled out 2024-10-07 16:42 [PATCH v2 0/4] KVM: arm64: nv: Fixes for stage-2 MMU recycling Oliver Upton @ 2024-10-07 16:42 ` Oliver Upton 2024-10-07 18:39 ` Sean Christopherson 2024-10-07 16:42 ` [PATCH v2 2/4] KVM: arm64: nv: Do not block when unmapping stage-2 if disallowed Oliver Upton ` (2 subsequent siblings) 3 siblings, 1 reply; 14+ messages in thread From: Oliver Upton @ 2024-10-07 16:42 UTC (permalink / raw) To: kvmarm Cc: Marc Zyngier, Joey Gouly, Suzuki K Poulose, Zenghui Yu, Sean Christopherson, Oliver Upton If a vCPU is scheduling out and not in WFI emulation, it is highly likely it will get scheduled again soon and reuse the MMU it had before. Dropping the MMU at vcpu_put() can have some unfortunate consequences, as the MMU could get reclaimed and used in a different context, forcing another 'cold start' on an otherwise active MMU. Avoid that altogether by keeping a reference on the MMU if the vCPU is scheduling out, ensuring that another vCPU cannot reclaim it while the current vCPU is away. Since there are more MMUs than vCPUs, this does not affect the guarantee that an unused MMU is available at any time. Furthermore, this makes the vcpu->arch.hw_mmu ~stable in preemptible code, at least for where it matters in the stage-2 abort path. Yes, the MMU can change across WFI emulation, but there isn't even a use case where this would matter. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> --- arch/arm64/kvm/nested.c | 21 ++++++++++++++++++--- 1 file changed, 18 insertions(+), 3 deletions(-) diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c index f9e30dd34c7a..df670c14e1c6 100644 --- a/arch/arm64/kvm/nested.c +++ b/arch/arm64/kvm/nested.c @@ -663,6 +663,13 @@ void kvm_init_nested_s2_mmu(struct kvm_s2_mmu *mmu) void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu) { + /* + * The vCPU kept its reference on the MMU after the last put, keep + * rolling with it. + */ + if (vcpu->arch.hw_mmu) + return; + if (is_hyp_ctxt(vcpu)) { vcpu->arch.hw_mmu = &vcpu->kvm->arch.mmu; } else { @@ -674,10 +681,18 @@ void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu) void kvm_vcpu_put_hw_mmu(struct kvm_vcpu *vcpu) { - if (kvm_is_nested_s2_mmu(vcpu->kvm, vcpu->arch.hw_mmu)) { + /* + * Keep a reference on the associated stage-2 MMU if the vCPU is + * scheduling out and not in WFI emulation, suggesting it is likely to + * reuse the MMU sometime soon. + */ + if (vcpu->scheduled_out && !vcpu_get_flag(vcpu, IN_WFI)) + return; + + if (kvm_is_nested_s2_mmu(vcpu->kvm, vcpu->arch.hw_mmu)) atomic_dec(&vcpu->arch.hw_mmu->refcnt); - vcpu->arch.hw_mmu = NULL; - } + + vcpu->arch.hw_mmu = NULL; } /* -- 2.47.0.rc0.187.ge670bccf7e-goog ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH v2 1/4] KVM: arm64: nv: Keep reference on stage-2 MMU when scheduled out 2024-10-07 16:42 ` [PATCH v2 1/4] KVM: arm64: nv: Keep reference on stage-2 MMU when scheduled out Oliver Upton @ 2024-10-07 18:39 ` Sean Christopherson 2024-10-07 19:01 ` Oliver Upton 0 siblings, 1 reply; 14+ messages in thread From: Sean Christopherson @ 2024-10-07 18:39 UTC (permalink / raw) To: Oliver Upton Cc: kvmarm, Marc Zyngier, Joey Gouly, Suzuki K Poulose, Zenghui Yu On Mon, Oct 07, 2024, Oliver Upton wrote: > If a vCPU is scheduling out and not in WFI emulation, it is highly > likely it will get scheduled again soon and reuse the MMU it had before. > Dropping the MMU at vcpu_put() can have some unfortunate consequences, > as the MMU could get reclaimed and used in a different context, forcing > another 'cold start' on an otherwise active MMU. > > Avoid that altogether by keeping a reference on the MMU if the vCPU is > scheduling out, ensuring that another vCPU cannot reclaim it while the > current vCPU is away. Since there are more MMUs than vCPUs, this does > not affect the guarantee that an unused MMU is available at any time. > > Furthermore, this makes the vcpu->arch.hw_mmu ~stable in preemptible > code, at least for where it matters in the stage-2 abort path. Yes, the > MMU can change across WFI emulation, but there isn't even a use case > where this would matter. > > Signed-off-by: Oliver Upton <oliver.upton@linux.dev> > --- > arch/arm64/kvm/nested.c | 21 ++++++++++++++++++--- > 1 file changed, 18 insertions(+), 3 deletions(-) > > diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c > index f9e30dd34c7a..df670c14e1c6 100644 > --- a/arch/arm64/kvm/nested.c > +++ b/arch/arm64/kvm/nested.c > @@ -663,6 +663,13 @@ void kvm_init_nested_s2_mmu(struct kvm_s2_mmu *mmu) > > void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu) > { > + /* > + * The vCPU kept its reference on the MMU after the last put, keep > + * rolling with it. > + */ > + if (vcpu->arch.hw_mmu) > + return; > + > if (is_hyp_ctxt(vcpu)) { > vcpu->arch.hw_mmu = &vcpu->kvm->arch.mmu; > } else { > @@ -674,10 +681,18 @@ void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu) > > void kvm_vcpu_put_hw_mmu(struct kvm_vcpu *vcpu) > { > - if (kvm_is_nested_s2_mmu(vcpu->kvm, vcpu->arch.hw_mmu)) { > + /* > + * Keep a reference on the associated stage-2 MMU if the vCPU is > + * scheduling out and not in WFI emulation, suggesting it is likely to > + * reuse the MMU sometime soon. > + */ > + if (vcpu->scheduled_out && !vcpu_get_flag(vcpu, IN_WFI)) > + return; Assuming KVM arm64 supports halt-polling, I think it makes more sense to check kvm_vcpu_is_blocking() instead of IN_WFI. That way, KVM will keep the MMU resident if the vCPU happens to be preempted and halt-polling is successful. And somewhat of a side topic, calling kvm_vgic_put() from kvm_arch_vcpu_blocking() instead of kvm_vcpu_wfi() would provide the same optimization for the GIC, as KVM would only need to put/reload the vGIC if the vCPU is actually scheduled out. Unfortunately, using kvm_vcpu_is_blocking() in vgic_v4_put() won't do the right thing, as KVM deliberately calls prepare_to_rcuwait() _after_ kvm_arch_vcpu_blocking(), because x86's AVIC implementation relies on kvm_vcpu_is_blocking() to be false when reloading after unblocking. Aha! With vcpu->scheduled_out, I think we can swap the order and make it more obvious what's going on in the AVIC code at the same time. With that, I think KVM arm64 could drop IN_WFI and use kvm_vcpu_is_blocking() throughout. E.g. diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c index 4b74ea91f4e6..bba93d1d65ae 100644 --- a/arch/x86/kvm/svm/avic.c +++ b/arch/x86/kvm/svm/avic.c @@ -1038,13 +1038,12 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu) return; /* - * No need to update anything if the vCPU is blocking, i.e. if the vCPU - * is being scheduled in after being preempted. The CPU entries in the - * Physical APIC table and IRTE are consumed iff IsRun{ning} is '1'. - * If the vCPU was migrated, its new CPU value will be stuffed when the - * vCPU unblocks. + * No need to update anything if the vCPU is preempted while blocking. + * The CPU entries in the Physical APIC table and IRTE are consumed iff + * IsRun{ning} is '1'. If the vCPU was migrated, its new CPU value + * will be stuffed when the vCPU unblocks. */ - if (kvm_vcpu_is_blocking(vcpu)) + if (kvm_vcpu_is_blocking(vcpu) && vcpu->scheduled_out) return; /* ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH v2 1/4] KVM: arm64: nv: Keep reference on stage-2 MMU when scheduled out 2024-10-07 18:39 ` Sean Christopherson @ 2024-10-07 19:01 ` Oliver Upton 2024-10-07 19:52 ` Sean Christopherson 0 siblings, 1 reply; 14+ messages in thread From: Oliver Upton @ 2024-10-07 19:01 UTC (permalink / raw) To: Sean Christopherson Cc: kvmarm, Marc Zyngier, Joey Gouly, Suzuki K Poulose, Zenghui Yu On Mon, Oct 07, 2024 at 11:39:37AM -0700, Sean Christopherson wrote: > > @@ -674,10 +681,18 @@ void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu) > > > > void kvm_vcpu_put_hw_mmu(struct kvm_vcpu *vcpu) > > { > > - if (kvm_is_nested_s2_mmu(vcpu->kvm, vcpu->arch.hw_mmu)) { > > + /* > > + * Keep a reference on the associated stage-2 MMU if the vCPU is > > + * scheduling out and not in WFI emulation, suggesting it is likely to > > + * reuse the MMU sometime soon. > > + */ > > + if (vcpu->scheduled_out && !vcpu_get_flag(vcpu, IN_WFI)) > > + return; > > Assuming KVM arm64 supports halt-polling, I think it makes more sense to check > kvm_vcpu_is_blocking() instead of IN_WFI. That way, KVM will keep the MMU resident > if the vCPU happens to be preempted and halt-polling is successful. > > And somewhat of a side topic, calling kvm_vgic_put() from kvm_arch_vcpu_blocking() > instead of kvm_vcpu_wfi() would provide the same optimization for the GIC, as KVM > would only need to put/reload the vGIC if the vCPU is actually scheduled out. That wouldn't be an optimization, it'd render the halt polling loop useless. The most recent view of the guest's CPU interface is sitting in hardware at this point, so we need to synchronize KVM's view of the CPUIF with it to determine if a pending interrupt exceeds the priority mask. On top of that, we need to activate the doorbell IRQ for GICv4 to give KVM a kick when something arrives for the vPE. -- Thanks, Oliver ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v2 1/4] KVM: arm64: nv: Keep reference on stage-2 MMU when scheduled out 2024-10-07 19:01 ` Oliver Upton @ 2024-10-07 19:52 ` Sean Christopherson 2024-10-07 21:22 ` Oliver Upton 0 siblings, 1 reply; 14+ messages in thread From: Sean Christopherson @ 2024-10-07 19:52 UTC (permalink / raw) To: Oliver Upton Cc: kvmarm, Marc Zyngier, Joey Gouly, Suzuki K Poulose, Zenghui Yu On Mon, Oct 07, 2024, Oliver Upton wrote: > On Mon, Oct 07, 2024 at 11:39:37AM -0700, Sean Christopherson wrote: > > > @@ -674,10 +681,18 @@ void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu) > > > > > > void kvm_vcpu_put_hw_mmu(struct kvm_vcpu *vcpu) > > > { > > > - if (kvm_is_nested_s2_mmu(vcpu->kvm, vcpu->arch.hw_mmu)) { > > > + /* > > > + * Keep a reference on the associated stage-2 MMU if the vCPU is > > > + * scheduling out and not in WFI emulation, suggesting it is likely to > > > + * reuse the MMU sometime soon. > > > + */ > > > + if (vcpu->scheduled_out && !vcpu_get_flag(vcpu, IN_WFI)) > > > + return; > > > > Assuming KVM arm64 supports halt-polling, I think it makes more sense to check > > kvm_vcpu_is_blocking() instead of IN_WFI. That way, KVM will keep the MMU resident > > if the vCPU happens to be preempted and halt-polling is successful. > > > > And somewhat of a side topic, calling kvm_vgic_put() from kvm_arch_vcpu_blocking() > > instead of kvm_vcpu_wfi() would provide the same optimization for the GIC, as KVM > > would only need to put/reload the vGIC if the vCPU is actually scheduled out. > > That wouldn't be an optimization, it'd render the halt polling loop > useless. > > The most recent view of the guest's CPU interface is sitting in > hardware at this point, so we need to synchronize KVM's view of > the CPUIF with it to determine if a pending interrupt exceeds the > priority mask. Hmm, but what happens if the wakeup event arrives before IN_WFI is set? Ah, IIUC, GICR_VPENDBASER_PendingLast tracks if there's a pending event, KVM makes sure to read PendingLast after making vPE non-resident, and hardware is required to ensure either PendingLast=1 or a doorbell is signaled. Niave question time: why not query GICR_VPENDBASER_PendingLast directly in kvm_vgic_vcpu_pending_irq() instead of putting the vGIC? E.g. if safely querying PendingLast is "heavy", wouldn't it still make sense to do a slow check of PendingLast in kvm_vgic_vcpu_pending_irq() if IN_WFI=1? Or is putting the vGIC as lightweight as things get when it comes to checking PendingLast? > On top of that, we need to activate the doorbell IRQ for GICv4 to give > KVM a kick when something arrives for the vPE. Right, but KVM needs to "manually" detect pending events to account for events that arrive before the doorbell is activated, so from a functional perspective, it shouldn't matter when KVM activates the doorbell (with respect to halt-polling). ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v2 1/4] KVM: arm64: nv: Keep reference on stage-2 MMU when scheduled out 2024-10-07 19:52 ` Sean Christopherson @ 2024-10-07 21:22 ` Oliver Upton 2024-10-07 22:18 ` Sean Christopherson 0 siblings, 1 reply; 14+ messages in thread From: Oliver Upton @ 2024-10-07 21:22 UTC (permalink / raw) To: Sean Christopherson Cc: kvmarm, Marc Zyngier, Joey Gouly, Suzuki K Poulose, Zenghui Yu On Mon, Oct 07, 2024 at 12:52:48PM -0700, Sean Christopherson wrote: > On Mon, Oct 07, 2024, Oliver Upton wrote: > > On Mon, Oct 07, 2024 at 11:39:37AM -0700, Sean Christopherson wrote: > > > > @@ -674,10 +681,18 @@ void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu) > > > > > > > > void kvm_vcpu_put_hw_mmu(struct kvm_vcpu *vcpu) > > > > { > > > > - if (kvm_is_nested_s2_mmu(vcpu->kvm, vcpu->arch.hw_mmu)) { > > > > + /* > > > > + * Keep a reference on the associated stage-2 MMU if the vCPU is > > > > + * scheduling out and not in WFI emulation, suggesting it is likely to > > > > + * reuse the MMU sometime soon. > > > > + */ > > > > + if (vcpu->scheduled_out && !vcpu_get_flag(vcpu, IN_WFI)) > > > > + return; > > > > > > Assuming KVM arm64 supports halt-polling, I think it makes more sense to check > > > kvm_vcpu_is_blocking() instead of IN_WFI. That way, KVM will keep the MMU resident > > > if the vCPU happens to be preempted and halt-polling is successful. > > > > > > And somewhat of a side topic, calling kvm_vgic_put() from kvm_arch_vcpu_blocking() > > > instead of kvm_vcpu_wfi() would provide the same optimization for the GIC, as KVM > > > would only need to put/reload the vGIC if the vCPU is actually scheduled out. > > > > That wouldn't be an optimization, it'd render the halt polling loop > > useless. <snip> > Hmm, but what happens if the wakeup event arrives before IN_WFI is set? Ah, IIUC, > GICR_VPENDBASER_PendingLast tracks if there's a pending event, KVM makes sure to > read PendingLast after making vPE non-resident, and hardware is required to ensure > either PendingLast=1 or a doorbell is signaled. > > Niave question time: why not query GICR_VPENDBASER_PendingLast directly in > kvm_vgic_vcpu_pending_irq() instead of putting the vGIC? The GIC is only required to update GICR_VPENDBASER.PendingLast when software deschedules the loaded vPE, meaning GICR_VPENDBASER.Valid goes from 1 => 0. It is not updated for virtual interrupts that arrive after the vPE has been descheduled. So a polling loop that tests PendingLast would need to repeatedly mark the vPE as scheduled/descheduled to get the GIC to process the LPI pending table. You'd at minimum need a dsb(sy) to ensure the redistributor has seen the new configuration *and* poll it to make sure the LPI pending state has been completely serialized/deserialized into memory: while !GICR_VPENDBASER.PendingLast: GICR_VPENDBASER.Valid = 1 dsb(sy) while GICR_VPENDBASER.Dirty: cpu_relax() GICR_VPENDBASER.Valid = 0 while GICR_VPENDBASER.Dirty: cpu_relax() Conceptually, the redistributor is a 'far' piece of hardware, likely running at a different clock from your CPU. So it is safe to assume that interactions with it are pretty slow, and this overall scheme would be quite expensive. > E.g. if safely querying > PendingLast is "heavy", wouldn't it still make sense to do a slow check of > PendingLast in kvm_vgic_vcpu_pending_irq() if IN_WFI=1? Or is putting the vGIC > as lightweight as things get when it comes to checking PendingLast? I believe I've answered this above, since the status bit only makes sense when hardware has done a state transition. > > On top of that, we need to activate the doorbell IRQ for GICv4 to give > > KVM a kick when something arrives for the vPE. > > Right, but KVM needs to "manually" detect pending events to account for events > that arrive before the doorbell is activated, so from a functional perspective, > it shouldn't matter when KVM activates the doorbell (with respect to halt-polling). Like I said earlier: <paste> > > The most recent view of the guest's CPU interface is sitting in > > hardware at this point, so we need to synchronize KVM's view of > > the CPUIF with it to determine if a pending interrupt exceeds the > > priority mask. </paste> The guest's enable bits and priority mask are in ICH_VMCR_EL2, and we need to read that out of the CPU in order for this loop to work: int kvm_vgic_vcpu_pending_irq(struct kvm_vcpu *vcpu) { struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic_cpu; struct vgic_irq *irq; bool pending = false; unsigned long flags; struct vgic_vmcr vmcr; if (!vcpu->kvm->arch.vgic.enabled) return false; if (vcpu->arch.vgic_cpu.vgic_v3.its_vpe.pending_last) return true; vgic_get_vmcr(vcpu, &vmcr); raw_spin_lock_irqsave(&vgic_cpu->ap_list_lock, flags); list_for_each_entry(irq, &vgic_cpu->ap_list_head, ap_list) { raw_spin_lock(&irq->irq_lock); pending = irq_is_pending(irq) && irq->enabled && !irq->active && irq->priority < vmcr.pmr; <====== Need latest PMR raw_spin_unlock(&irq->irq_lock); if (pending) break; } raw_spin_unlock_irqrestore(&vgic_cpu->ap_list_lock, flags); return pending; } KVM's in-memory view of VMCR is updated in vgic_put(). So even w/o GICv4 direct injection, the seemingly premature GIC save/restore is essential for halt polling to actually do something useful. -- Thanks, Oliver ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v2 1/4] KVM: arm64: nv: Keep reference on stage-2 MMU when scheduled out 2024-10-07 21:22 ` Oliver Upton @ 2024-10-07 22:18 ` Sean Christopherson 0 siblings, 0 replies; 14+ messages in thread From: Sean Christopherson @ 2024-10-07 22:18 UTC (permalink / raw) To: Oliver Upton Cc: kvmarm, Marc Zyngier, Joey Gouly, Suzuki K Poulose, Zenghui Yu On Mon, Oct 07, 2024, Oliver Upton wrote: > On Mon, Oct 07, 2024 at 12:52:48PM -0700, Sean Christopherson wrote: > > > That wouldn't be an optimization, it'd render the halt polling loop > > > useless. > > <snip> > > > Hmm, but what happens if the wakeup event arrives before IN_WFI is set? Ah, IIUC, > > GICR_VPENDBASER_PendingLast tracks if there's a pending event, KVM makes sure to > > read PendingLast after making vPE non-resident, and hardware is required to ensure > > either PendingLast=1 or a doorbell is signaled. > > > > Niave question time: why not query GICR_VPENDBASER_PendingLast directly in > > kvm_vgic_vcpu_pending_irq() instead of putting the vGIC? > > The GIC is only required to update GICR_VPENDBASER.PendingLast when > software deschedules the loaded vPE, meaning GICR_VPENDBASER.Valid goes > from 1 => 0. Ugh, that sucks. I wish we could magically combine Intel+AMD+ARM implementations for direct injection. They all do things just slightly differently, and each one has some behaviors that I really like, and each one has at least one behavior that makes me sad. > > > The most recent view of the guest's CPU interface is sitting in > > > hardware at this point, so we need to synchronize KVM's view of > > > the CPUIF with it to determine if a pending interrupt exceeds the > > > priority mask. > > </paste> > > The guest's enable bits and priority mask are in ICH_VMCR_EL2, and we > need to read that out of the CPU in order for this loop to work: > > int kvm_vgic_vcpu_pending_irq(struct kvm_vcpu *vcpu) > { > struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic_cpu; > struct vgic_irq *irq; > bool pending = false; > unsigned long flags; > struct vgic_vmcr vmcr; > > if (!vcpu->kvm->arch.vgic.enabled) > return false; > > if (vcpu->arch.vgic_cpu.vgic_v3.its_vpe.pending_last) > return true; > > vgic_get_vmcr(vcpu, &vmcr); > > raw_spin_lock_irqsave(&vgic_cpu->ap_list_lock, flags); > > list_for_each_entry(irq, &vgic_cpu->ap_list_head, ap_list) { > raw_spin_lock(&irq->irq_lock); > pending = irq_is_pending(irq) && irq->enabled && > !irq->active && > irq->priority < vmcr.pmr; <====== Need latest PMR > raw_spin_unlock(&irq->irq_lock); > > if (pending) > break; > } > > raw_spin_unlock_irqrestore(&vgic_cpu->ap_list_lock, flags); > > return pending; > } > > KVM's in-memory view of VMCR is updated in vgic_put(). Ah, vgic_get_vmcr() is just unpacking the software-cached VMCR. I missed that on the first few reads. Thanks for being patient :-) > So even w/o GICv4 direct injection, the seemingly premature GIC save/restore > is essential for halt polling to actually do something useful. ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH v2 2/4] KVM: arm64: nv: Do not block when unmapping stage-2 if disallowed 2024-10-07 16:42 [PATCH v2 0/4] KVM: arm64: nv: Fixes for stage-2 MMU recycling Oliver Upton 2024-10-07 16:42 ` [PATCH v2 1/4] KVM: arm64: nv: Keep reference on stage-2 MMU when scheduled out Oliver Upton @ 2024-10-07 16:42 ` Oliver Upton 2024-10-07 16:42 ` [PATCH v2 3/4] KVM: arm64: nv: Punt stage-2 recycling to a vCPU request Oliver Upton 2024-10-07 16:42 ` [PATCH v2 4/4] KVM: arm64: nv: Clarify safety of allowing TLBI unmaps to reschedule Oliver Upton 3 siblings, 0 replies; 14+ messages in thread From: Oliver Upton @ 2024-10-07 16:42 UTC (permalink / raw) To: kvmarm Cc: Marc Zyngier, Joey Gouly, Suzuki K Poulose, Zenghui Yu, Sean Christopherson, Oliver Upton Right now the nested code allows unmap operations on a shadow stage-2 to block unconditionally. This is wrong in a couple places, such as a non-blocking MMU notifier or on the back of a sched_in() notifier as part of shadow MMU recycling. Carry through whether or not blocking is allowed to kvm_pgtable_stage2_unmap(). This 'fixes' an issue where stage-2 MMU reclaim would precipitate a stack overflow from a pile of kvm_sched_in() callbacks, all trying to recycle a stage-2 MMU. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> --- arch/arm64/include/asm/kvm_mmu.h | 3 ++- arch/arm64/include/asm/kvm_nested.h | 2 +- arch/arm64/kvm/mmu.c | 15 ++++++++------- arch/arm64/kvm/nested.c | 6 +++--- arch/arm64/kvm/sys_regs.c | 6 +++--- 5 files changed, 17 insertions(+), 15 deletions(-) diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h index cd4087fbda9a..66d93e320ec8 100644 --- a/arch/arm64/include/asm/kvm_mmu.h +++ b/arch/arm64/include/asm/kvm_mmu.h @@ -166,7 +166,8 @@ int create_hyp_exec_mappings(phys_addr_t phys_addr, size_t size, int create_hyp_stack(phys_addr_t phys_addr, unsigned long *haddr); void __init free_hyp_pgds(void); -void kvm_stage2_unmap_range(struct kvm_s2_mmu *mmu, phys_addr_t start, u64 size); +void kvm_stage2_unmap_range(struct kvm_s2_mmu *mmu, phys_addr_t start, + u64 size, bool may_block); void kvm_stage2_flush_range(struct kvm_s2_mmu *mmu, phys_addr_t addr, phys_addr_t end); void kvm_stage2_wp_range(struct kvm_s2_mmu *mmu, phys_addr_t addr, phys_addr_t end); diff --git a/arch/arm64/include/asm/kvm_nested.h b/arch/arm64/include/asm/kvm_nested.h index e8bc6d67aba2..e74b90dcfac4 100644 --- a/arch/arm64/include/asm/kvm_nested.h +++ b/arch/arm64/include/asm/kvm_nested.h @@ -124,7 +124,7 @@ extern int kvm_s2_handle_perm_fault(struct kvm_vcpu *vcpu, struct kvm_s2_trans *trans); extern int kvm_inject_s2_fault(struct kvm_vcpu *vcpu, u64 esr_el2); extern void kvm_nested_s2_wp(struct kvm *kvm); -extern void kvm_nested_s2_unmap(struct kvm *kvm); +extern void kvm_nested_s2_unmap(struct kvm *kvm, bool may_block); extern void kvm_nested_s2_flush(struct kvm *kvm); unsigned long compute_tlb_inval_range(struct kvm_s2_mmu *mmu, u64 val); diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index a509b63bd4dd..0f7658aefa1a 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -328,9 +328,10 @@ static void __unmap_stage2_range(struct kvm_s2_mmu *mmu, phys_addr_t start, u64 may_block)); } -void kvm_stage2_unmap_range(struct kvm_s2_mmu *mmu, phys_addr_t start, u64 size) +void kvm_stage2_unmap_range(struct kvm_s2_mmu *mmu, phys_addr_t start, + u64 size, bool may_block) { - __unmap_stage2_range(mmu, start, size, true); + __unmap_stage2_range(mmu, start, size, may_block); } void kvm_stage2_flush_range(struct kvm_s2_mmu *mmu, phys_addr_t addr, phys_addr_t end) @@ -1015,7 +1016,7 @@ static void stage2_unmap_memslot(struct kvm *kvm, if (!(vma->vm_flags & VM_PFNMAP)) { gpa_t gpa = addr + (vm_start - memslot->userspace_addr); - kvm_stage2_unmap_range(&kvm->arch.mmu, gpa, vm_end - vm_start); + kvm_stage2_unmap_range(&kvm->arch.mmu, gpa, vm_end - vm_start, true); } hva = vm_end; } while (hva < reg_end); @@ -1042,7 +1043,7 @@ void stage2_unmap_vm(struct kvm *kvm) kvm_for_each_memslot(memslot, bkt, slots) stage2_unmap_memslot(kvm, memslot); - kvm_nested_s2_unmap(kvm); + kvm_nested_s2_unmap(kvm, true); write_unlock(&kvm->mmu_lock); mmap_read_unlock(current->mm); @@ -1912,7 +1913,7 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) (range->end - range->start) << PAGE_SHIFT, range->may_block); - kvm_nested_s2_unmap(kvm); + kvm_nested_s2_unmap(kvm, range->may_block); return false; } @@ -2179,8 +2180,8 @@ void kvm_arch_flush_shadow_memslot(struct kvm *kvm, phys_addr_t size = slot->npages << PAGE_SHIFT; write_lock(&kvm->mmu_lock); - kvm_stage2_unmap_range(&kvm->arch.mmu, gpa, size); - kvm_nested_s2_unmap(kvm); + kvm_stage2_unmap_range(&kvm->arch.mmu, gpa, size, true); + kvm_nested_s2_unmap(kvm, true); write_unlock(&kvm->mmu_lock); } diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c index df670c14e1c6..58d3b998793a 100644 --- a/arch/arm64/kvm/nested.c +++ b/arch/arm64/kvm/nested.c @@ -634,7 +634,7 @@ static struct kvm_s2_mmu *get_s2_mmu_nested(struct kvm_vcpu *vcpu) /* Clear the old state */ if (kvm_s2_mmu_valid(s2_mmu)) - kvm_stage2_unmap_range(s2_mmu, 0, kvm_phys_size(s2_mmu)); + kvm_stage2_unmap_range(s2_mmu, 0, kvm_phys_size(s2_mmu), false); /* * The virtual VMID (modulo CnP) will be used as a key when matching @@ -745,7 +745,7 @@ void kvm_nested_s2_wp(struct kvm *kvm) } } -void kvm_nested_s2_unmap(struct kvm *kvm) +void kvm_nested_s2_unmap(struct kvm *kvm, bool may_block) { int i; @@ -755,7 +755,7 @@ void kvm_nested_s2_unmap(struct kvm *kvm) struct kvm_s2_mmu *mmu = &kvm->arch.nested_mmus[i]; if (kvm_s2_mmu_valid(mmu)) - kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu)); + kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu), may_block); } } diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index dad88e31f953..e0c7cc16466e 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -2903,7 +2903,7 @@ static bool handle_alle1is(struct kvm_vcpu *vcpu, struct sys_reg_params *p, * Drop all shadow S2s, resulting in S1/S2 TLBIs for each of the * corresponding VMIDs. */ - kvm_nested_s2_unmap(vcpu->kvm); + kvm_nested_s2_unmap(vcpu->kvm, true); write_unlock(&vcpu->kvm->mmu_lock); @@ -2955,7 +2955,7 @@ union tlbi_info { static void s2_mmu_unmap_range(struct kvm_s2_mmu *mmu, const union tlbi_info *info) { - kvm_stage2_unmap_range(mmu, info->range.start, info->range.size); + kvm_stage2_unmap_range(mmu, info->range.start, info->range.size, true); } static bool handle_vmalls12e1is(struct kvm_vcpu *vcpu, struct sys_reg_params *p, @@ -3050,7 +3050,7 @@ static void s2_mmu_unmap_ipa(struct kvm_s2_mmu *mmu, max_size = compute_tlb_inval_range(mmu, info->ipa.addr); base_addr &= ~(max_size - 1); - kvm_stage2_unmap_range(mmu, base_addr, max_size); + kvm_stage2_unmap_range(mmu, base_addr, max_size, true); } static bool handle_ipas2e1is(struct kvm_vcpu *vcpu, struct sys_reg_params *p, -- 2.47.0.rc0.187.ge670bccf7e-goog ^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v2 3/4] KVM: arm64: nv: Punt stage-2 recycling to a vCPU request 2024-10-07 16:42 [PATCH v2 0/4] KVM: arm64: nv: Fixes for stage-2 MMU recycling Oliver Upton 2024-10-07 16:42 ` [PATCH v2 1/4] KVM: arm64: nv: Keep reference on stage-2 MMU when scheduled out Oliver Upton 2024-10-07 16:42 ` [PATCH v2 2/4] KVM: arm64: nv: Do not block when unmapping stage-2 if disallowed Oliver Upton @ 2024-10-07 16:42 ` Oliver Upton 2024-10-07 19:06 ` Marc Zyngier 2024-10-07 16:42 ` [PATCH v2 4/4] KVM: arm64: nv: Clarify safety of allowing TLBI unmaps to reschedule Oliver Upton 3 siblings, 1 reply; 14+ messages in thread From: Oliver Upton @ 2024-10-07 16:42 UTC (permalink / raw) To: kvmarm Cc: Marc Zyngier, Joey Gouly, Suzuki K Poulose, Zenghui Yu, Sean Christopherson, Oliver Upton Currently, when a nested MMU is repurposed for some other MMU context, KVM unmaps everything during vcpu_load() while holding the MMU lock for write. This is quite a performance bottleneck for large nested VMs, as all vCPU scheduling will spin until the unmap completes. Start punting the MMU cleanup to a vCPU request, where it is then possible to periodically release the MMU lock and CPU in the presence of contention. Ensure that no vCPU winds up using a stale MMU by tracking the pending unmap on the S2 MMU itself and requesting an unmap on every vCPU that finds it. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> --- arch/arm64/include/asm/kvm_host.h | 3 +++ arch/arm64/include/asm/kvm_nested.h | 2 ++ arch/arm64/kvm/arm.c | 2 ++ arch/arm64/kvm/nested.c | 21 +++++++++++++++++++-- 4 files changed, 26 insertions(+), 2 deletions(-) diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 329619c6fa96..16b4b0003e7d 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -51,6 +51,7 @@ #define KVM_REQ_RELOAD_PMU KVM_ARCH_REQ(5) #define KVM_REQ_SUSPEND KVM_ARCH_REQ(6) #define KVM_REQ_RESYNC_PMU_EL0 KVM_ARCH_REQ(7) +#define KVM_REQ_NESTED_S2_UNMAP KVM_ARCH_REQ(8) #define KVM_DIRTY_LOG_MANUAL_CAPS (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \ KVM_DIRTY_LOG_INITIALLY_SET) @@ -216,6 +217,8 @@ struct kvm_s2_mmu { * >0: Somebody is actively using this. */ atomic_t refcnt; + + bool pending_unmap; }; struct kvm_arch_memory_slot { diff --git a/arch/arm64/include/asm/kvm_nested.h b/arch/arm64/include/asm/kvm_nested.h index e74b90dcfac4..233e65522716 100644 --- a/arch/arm64/include/asm/kvm_nested.h +++ b/arch/arm64/include/asm/kvm_nested.h @@ -78,6 +78,8 @@ extern void kvm_s2_mmu_iterate_by_vmid(struct kvm *kvm, u16 vmid, extern void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu); extern void kvm_vcpu_put_hw_mmu(struct kvm_vcpu *vcpu); +extern void check_nested_vcpu_requests(struct kvm_vcpu *vcpu); + struct kvm_s2_trans { phys_addr_t output; unsigned long block_size; diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index a0d01c46e408..c7b659604bae 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -1031,6 +1031,8 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu) if (kvm_dirty_ring_check_request(vcpu)) return 0; + + check_nested_vcpu_requests(vcpu); } return 1; diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c index 58d3b998793a..71b96eaf6632 100644 --- a/arch/arm64/kvm/nested.c +++ b/arch/arm64/kvm/nested.c @@ -632,9 +632,9 @@ static struct kvm_s2_mmu *get_s2_mmu_nested(struct kvm_vcpu *vcpu) /* Set the scene for the next search */ kvm->arch.nested_mmus_next = (i + 1) % kvm->arch.nested_mmus_size; - /* Clear the old state */ + /* Make sure we don't forget to do the laundry */ if (kvm_s2_mmu_valid(s2_mmu)) - kvm_stage2_unmap_range(s2_mmu, 0, kvm_phys_size(s2_mmu), false); + s2_mmu->pending_unmap = true; /* * The virtual VMID (modulo CnP) will be used as a key when matching @@ -650,6 +650,9 @@ static struct kvm_s2_mmu *get_s2_mmu_nested(struct kvm_vcpu *vcpu) out: atomic_inc(&s2_mmu->refcnt); + if (s2_mmu->pending_unmap) + kvm_make_request(KVM_REQ_NESTED_S2_UNMAP, vcpu); + return s2_mmu; } @@ -1199,3 +1202,17 @@ int kvm_init_nv_sysregs(struct kvm *kvm) return 0; } + +void check_nested_vcpu_requests(struct kvm_vcpu *vcpu) +{ + if (kvm_check_request(KVM_REQ_NESTED_S2_UNMAP, vcpu)) { + struct kvm_s2_mmu *mmu = vcpu->arch.hw_mmu; + + write_lock(&vcpu->kvm->mmu_lock); + if (mmu->pending_unmap) { + kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu), true); + mmu->pending_unmap = false; + } + write_unlock(&vcpu->kvm->mmu_lock); + } +} -- 2.47.0.rc0.187.ge670bccf7e-goog ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH v2 3/4] KVM: arm64: nv: Punt stage-2 recycling to a vCPU request 2024-10-07 16:42 ` [PATCH v2 3/4] KVM: arm64: nv: Punt stage-2 recycling to a vCPU request Oliver Upton @ 2024-10-07 19:06 ` Marc Zyngier 2024-10-07 19:16 ` Oliver Upton 0 siblings, 1 reply; 14+ messages in thread From: Marc Zyngier @ 2024-10-07 19:06 UTC (permalink / raw) To: Oliver Upton Cc: kvmarm, Joey Gouly, Suzuki K Poulose, Zenghui Yu, Sean Christopherson On Mon, 07 Oct 2024 17:42:54 +0100, Oliver Upton <oliver.upton@linux.dev> wrote: > > Currently, when a nested MMU is repurposed for some other MMU context, > KVM unmaps everything during vcpu_load() while holding the MMU lock for > write. This is quite a performance bottleneck for large nested VMs, as > all vCPU scheduling will spin until the unmap completes. > > Start punting the MMU cleanup to a vCPU request, where it is then > possible to periodically release the MMU lock and CPU in the presence of > contention. > > Ensure that no vCPU winds up using a stale MMU by tracking the pending > unmap on the S2 MMU itself and requesting an unmap on every vCPU that > finds it. > > Signed-off-by: Oliver Upton <oliver.upton@linux.dev> > --- > arch/arm64/include/asm/kvm_host.h | 3 +++ > arch/arm64/include/asm/kvm_nested.h | 2 ++ > arch/arm64/kvm/arm.c | 2 ++ > arch/arm64/kvm/nested.c | 21 +++++++++++++++++++-- > 4 files changed, 26 insertions(+), 2 deletions(-) > > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h > index 329619c6fa96..16b4b0003e7d 100644 > --- a/arch/arm64/include/asm/kvm_host.h > +++ b/arch/arm64/include/asm/kvm_host.h > @@ -51,6 +51,7 @@ > #define KVM_REQ_RELOAD_PMU KVM_ARCH_REQ(5) > #define KVM_REQ_SUSPEND KVM_ARCH_REQ(6) > #define KVM_REQ_RESYNC_PMU_EL0 KVM_ARCH_REQ(7) > +#define KVM_REQ_NESTED_S2_UNMAP KVM_ARCH_REQ(8) > > #define KVM_DIRTY_LOG_MANUAL_CAPS (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \ > KVM_DIRTY_LOG_INITIALLY_SET) > @@ -216,6 +217,8 @@ struct kvm_s2_mmu { > * >0: Somebody is actively using this. > */ > atomic_t refcnt; > + > + bool pending_unmap; Nit: move this up next to nested_stage2_enabled, saving a grand total of 4 bytes... > }; > > struct kvm_arch_memory_slot { > diff --git a/arch/arm64/include/asm/kvm_nested.h b/arch/arm64/include/asm/kvm_nested.h > index e74b90dcfac4..233e65522716 100644 > --- a/arch/arm64/include/asm/kvm_nested.h > +++ b/arch/arm64/include/asm/kvm_nested.h > @@ -78,6 +78,8 @@ extern void kvm_s2_mmu_iterate_by_vmid(struct kvm *kvm, u16 vmid, > extern void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu); > extern void kvm_vcpu_put_hw_mmu(struct kvm_vcpu *vcpu); > > +extern void check_nested_vcpu_requests(struct kvm_vcpu *vcpu); > + > struct kvm_s2_trans { > phys_addr_t output; > unsigned long block_size; > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c > index a0d01c46e408..c7b659604bae 100644 > --- a/arch/arm64/kvm/arm.c > +++ b/arch/arm64/kvm/arm.c > @@ -1031,6 +1031,8 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu) > > if (kvm_dirty_ring_check_request(vcpu)) > return 0; > + > + check_nested_vcpu_requests(vcpu); > } > > return 1; > diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c > index 58d3b998793a..71b96eaf6632 100644 > --- a/arch/arm64/kvm/nested.c > +++ b/arch/arm64/kvm/nested.c > @@ -632,9 +632,9 @@ static struct kvm_s2_mmu *get_s2_mmu_nested(struct kvm_vcpu *vcpu) > /* Set the scene for the next search */ > kvm->arch.nested_mmus_next = (i + 1) % kvm->arch.nested_mmus_size; > > - /* Clear the old state */ > + /* Make sure we don't forget to do the laundry */ > if (kvm_s2_mmu_valid(s2_mmu)) > - kvm_stage2_unmap_range(s2_mmu, 0, kvm_phys_size(s2_mmu), false); > + s2_mmu->pending_unmap = true; > > /* > * The virtual VMID (modulo CnP) will be used as a key when matching > @@ -650,6 +650,9 @@ static struct kvm_s2_mmu *get_s2_mmu_nested(struct kvm_vcpu *vcpu) > > out: > atomic_inc(&s2_mmu->refcnt); > + if (s2_mmu->pending_unmap) > + kvm_make_request(KVM_REQ_NESTED_S2_UNMAP, vcpu); > + This bit is awesomely subtle. Any vcpu incrementing the refcount on an MMU being laundered gets to partake in the cleaning process. Without this, the second guy would get to use the MMU before it has been cleaned-up. If you don't mind, I'd like to add a comment to that effect, so that our future selves remember what this is about. And move the pending_unmap field in the process. Thanks, M. -- Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v2 3/4] KVM: arm64: nv: Punt stage-2 recycling to a vCPU request 2024-10-07 19:06 ` Marc Zyngier @ 2024-10-07 19:16 ` Oliver Upton 0 siblings, 0 replies; 14+ messages in thread From: Oliver Upton @ 2024-10-07 19:16 UTC (permalink / raw) To: Marc Zyngier Cc: kvmarm, Joey Gouly, Suzuki K Poulose, Zenghui Yu, Sean Christopherson On Mon, Oct 07, 2024 at 08:06:30PM +0100, Marc Zyngier wrote: > On Mon, 07 Oct 2024 17:42:54 +0100, > Oliver Upton <oliver.upton@linux.dev> wrote: > > > > Currently, when a nested MMU is repurposed for some other MMU context, > > KVM unmaps everything during vcpu_load() while holding the MMU lock for > > write. This is quite a performance bottleneck for large nested VMs, as > > all vCPU scheduling will spin until the unmap completes. > > > > Start punting the MMU cleanup to a vCPU request, where it is then > > possible to periodically release the MMU lock and CPU in the presence of > > contention. > > > > Ensure that no vCPU winds up using a stale MMU by tracking the pending > > unmap on the S2 MMU itself and requesting an unmap on every vCPU that > > finds it. > > > > Signed-off-by: Oliver Upton <oliver.upton@linux.dev> > > --- > > arch/arm64/include/asm/kvm_host.h | 3 +++ > > arch/arm64/include/asm/kvm_nested.h | 2 ++ > > arch/arm64/kvm/arm.c | 2 ++ > > arch/arm64/kvm/nested.c | 21 +++++++++++++++++++-- > > 4 files changed, 26 insertions(+), 2 deletions(-) > > > > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h > > index 329619c6fa96..16b4b0003e7d 100644 > > --- a/arch/arm64/include/asm/kvm_host.h > > +++ b/arch/arm64/include/asm/kvm_host.h > > @@ -51,6 +51,7 @@ > > #define KVM_REQ_RELOAD_PMU KVM_ARCH_REQ(5) > > #define KVM_REQ_SUSPEND KVM_ARCH_REQ(6) > > #define KVM_REQ_RESYNC_PMU_EL0 KVM_ARCH_REQ(7) > > +#define KVM_REQ_NESTED_S2_UNMAP KVM_ARCH_REQ(8) > > > > #define KVM_DIRTY_LOG_MANUAL_CAPS (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \ > > KVM_DIRTY_LOG_INITIALLY_SET) > > @@ -216,6 +217,8 @@ struct kvm_s2_mmu { > > * >0: Somebody is actively using this. > > */ > > atomic_t refcnt; > > + > > + bool pending_unmap; > > Nit: move this up next to nested_stage2_enabled, saving a grand total > of 4 bytes... /facepalm I had it that way in some WIP flavor of this series, looks like I dropped that diff. Will do. > > atomic_inc(&s2_mmu->refcnt); > > + if (s2_mmu->pending_unmap) > > + kvm_make_request(KVM_REQ_NESTED_S2_UNMAP, vcpu); > > + > > This bit is awesomely subtle. Any vcpu incrementing the refcount on an > MMU being laundered gets to partake in the cleaning process. Without > this, the second guy would get to use the MMU before it has been > cleaned-up. > How about: /* * Set the vCPU request to perform an unmap, even if the pending * unmap originates from another vCPU. This guarantees that the * MMU has been completely unmapped before any vCPU actually * uses it, and allows multiple vCPUs to lend a hand with * completing the unmap. */ if (s2_mmu->pending_unmap) kvm_make_request(KVM_REQ_NESTED_S2_UNMAP, vcpu); -- Thanks, Oliver ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH v2 4/4] KVM: arm64: nv: Clarify safety of allowing TLBI unmaps to reschedule 2024-10-07 16:42 [PATCH v2 0/4] KVM: arm64: nv: Fixes for stage-2 MMU recycling Oliver Upton ` (2 preceding siblings ...) 2024-10-07 16:42 ` [PATCH v2 3/4] KVM: arm64: nv: Punt stage-2 recycling to a vCPU request Oliver Upton @ 2024-10-07 16:42 ` Oliver Upton 2024-10-07 18:51 ` Sean Christopherson 3 siblings, 1 reply; 14+ messages in thread From: Oliver Upton @ 2024-10-07 16:42 UTC (permalink / raw) To: kvmarm Cc: Marc Zyngier, Joey Gouly, Suzuki K Poulose, Zenghui Yu, Sean Christopherson, Oliver Upton There's been a decent amount of attention around unmaps of nested MMUs, and TLBI handling is no exception to this. Add a comment clarifying why it is safe to reschedule during a TLBI unmap, even without a reference on the MMU in progress. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> --- arch/arm64/kvm/sys_regs.c | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+) diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index e0c7cc16466e..b773d107cd35 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -2955,6 +2955,29 @@ union tlbi_info { static void s2_mmu_unmap_range(struct kvm_s2_mmu *mmu, const union tlbi_info *info) { + /* + * The unmap operation is allowed to drop the MMU lock and block, which + * means that @mmu could be used for a different context than the one + * currently being invalidated. + * + * This behavior is still safe, as: + * + * 1) The vCPU that recycled the MMU is responsible for invalidating + * the entire MMU before reusing it, which still honors the intent + * of a TLBI. + * + * 2) Until the guest TLBI instruction is 'retired' (i.e. increment PC + * and ERET to the guest), other vCPUs are allowed to use stale + * translations. + * + * 3) Accidentally unmapping an unrelated MMU context is nonfatal, and + * at worst may cause more aborts for shadow stage-2 fills. + * + * Dropping the MMU lock also implies that shadow stage-2 fills could + * happen behind the back of the TLBI. This is still safe, though, as + * the L1 needs to put its stage-2 in a consistent state before doing + * the TLBI. + */ kvm_stage2_unmap_range(mmu, info->range.start, info->range.size, true); } @@ -3050,6 +3073,10 @@ static void s2_mmu_unmap_ipa(struct kvm_s2_mmu *mmu, max_size = compute_tlb_inval_range(mmu, info->ipa.addr); base_addr &= ~(max_size - 1); + /* + * See comment in s2_mmu_unmap_range() for why this is allowed to + * reschedule. + */ kvm_stage2_unmap_range(mmu, base_addr, max_size, true); } -- 2.47.0.rc0.187.ge670bccf7e-goog ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH v2 4/4] KVM: arm64: nv: Clarify safety of allowing TLBI unmaps to reschedule 2024-10-07 16:42 ` [PATCH v2 4/4] KVM: arm64: nv: Clarify safety of allowing TLBI unmaps to reschedule Oliver Upton @ 2024-10-07 18:51 ` Sean Christopherson 2024-10-07 19:20 ` Oliver Upton 0 siblings, 1 reply; 14+ messages in thread From: Sean Christopherson @ 2024-10-07 18:51 UTC (permalink / raw) To: Oliver Upton Cc: kvmarm, Marc Zyngier, Joey Gouly, Suzuki K Poulose, Zenghui Yu On Mon, Oct 07, 2024, Oliver Upton wrote: > There's been a decent amount of attention around unmaps of nested MMUs, > and TLBI handling is no exception to this. Add a comment clarifying why > it is safe to reschedule during a TLBI unmap, even without a reference > on the MMU in progress. > > Signed-off-by: Oliver Upton <oliver.upton@linux.dev> > --- > arch/arm64/kvm/sys_regs.c | 27 +++++++++++++++++++++++++++ > 1 file changed, 27 insertions(+) > > diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c > index e0c7cc16466e..b773d107cd35 100644 > --- a/arch/arm64/kvm/sys_regs.c > +++ b/arch/arm64/kvm/sys_regs.c > @@ -2955,6 +2955,29 @@ union tlbi_info { > static void s2_mmu_unmap_range(struct kvm_s2_mmu *mmu, > const union tlbi_info *info) > { > + /* > + * The unmap operation is allowed to drop the MMU lock and block, which > + * means that @mmu could be used for a different context than the one > + * currently being invalidated. > + * > + * This behavior is still safe, as: > + * > + * 1) The vCPU that recycled the MMU is responsible for invalidating > + * the entire MMU before reusing it, which still honors the intent > + * of a TLBI. I think it's worth calling out that other vCPUs that want to reuse the "new" MMU will also ensure the entire MMU is invalidated before reusing it, and maybe even that multiple vCPUs can participate in recyling an MMU. The "The vCPU" wording is probably going to be interpreted by most readers as meaning only _one_ vCPU will ensure the MMU is invalidated. > + * 2) Until the guest TLBI instruction is 'retired' (i.e. increment PC > + * and ERET to the guest), other vCPUs are allowed to use stale > + * translations. > + * > + * 3) Accidentally unmapping an unrelated MMU context is nonfatal, and > + * at worst may cause more aborts for shadow stage-2 fills. > + * > + * Dropping the MMU lock also implies that shadow stage-2 fills could > + * happen behind the back of the TLBI. This is still safe, though, as > + * the L1 needs to put its stage-2 in a consistent state before doing > + * the TLBI. > + */ > kvm_stage2_unmap_range(mmu, info->range.start, info->range.size, true); > } > > @@ -3050,6 +3073,10 @@ static void s2_mmu_unmap_ipa(struct kvm_s2_mmu *mmu, > max_size = compute_tlb_inval_range(mmu, info->ipa.addr); > base_addr &= ~(max_size - 1); > > + /* > + * See comment in s2_mmu_unmap_range() for why this is allowed to > + * reschedule. > + */ > kvm_stage2_unmap_range(mmu, base_addr, max_size, true); > } > > -- > 2.47.0.rc0.187.ge670bccf7e-goog > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v2 4/4] KVM: arm64: nv: Clarify safety of allowing TLBI unmaps to reschedule 2024-10-07 18:51 ` Sean Christopherson @ 2024-10-07 19:20 ` Oliver Upton 0 siblings, 0 replies; 14+ messages in thread From: Oliver Upton @ 2024-10-07 19:20 UTC (permalink / raw) To: Sean Christopherson Cc: kvmarm, Marc Zyngier, Joey Gouly, Suzuki K Poulose, Zenghui Yu On Mon, Oct 07, 2024 at 11:51:37AM -0700, Sean Christopherson wrote: > On Mon, Oct 07, 2024, Oliver Upton wrote: > > There's been a decent amount of attention around unmaps of nested MMUs, > > and TLBI handling is no exception to this. Add a comment clarifying why > > it is safe to reschedule during a TLBI unmap, even without a reference > > on the MMU in progress. > > > > Signed-off-by: Oliver Upton <oliver.upton@linux.dev> > > --- > > arch/arm64/kvm/sys_regs.c | 27 +++++++++++++++++++++++++++ > > 1 file changed, 27 insertions(+) > > > > diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c > > index e0c7cc16466e..b773d107cd35 100644 > > --- a/arch/arm64/kvm/sys_regs.c > > +++ b/arch/arm64/kvm/sys_regs.c > > @@ -2955,6 +2955,29 @@ union tlbi_info { > > static void s2_mmu_unmap_range(struct kvm_s2_mmu *mmu, > > const union tlbi_info *info) > > { > > + /* > > + * The unmap operation is allowed to drop the MMU lock and block, which > > + * means that @mmu could be used for a different context than the one > > + * currently being invalidated. > > + * > > + * This behavior is still safe, as: > > + * > > + * 1) The vCPU that recycled the MMU is responsible for invalidating > > + * the entire MMU before reusing it, which still honors the intent > > + * of a TLBI. > > I think it's worth calling out that other vCPUs that want to reuse the "new" MMU > will also ensure the entire MMU is invalidated before reusing it, and maybe even > that multiple vCPUs can participate in recyling an MMU. The "The vCPU" wording > is probably going to be interpreted by most readers as meaning only _one_ vCPU > will ensure the MMU is invalidated. I'm going to s/vCPU/vCPU(s)/ this block, but I'll leave any other detail of how recycling works at the point it gets inflicted on potentially more than one vCPU. -- Thanks, Oliver ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2024-10-07 22:18 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-10-07 16:42 [PATCH v2 0/4] KVM: arm64: nv: Fixes for stage-2 MMU recycling Oliver Upton 2024-10-07 16:42 ` [PATCH v2 1/4] KVM: arm64: nv: Keep reference on stage-2 MMU when scheduled out Oliver Upton 2024-10-07 18:39 ` Sean Christopherson 2024-10-07 19:01 ` Oliver Upton 2024-10-07 19:52 ` Sean Christopherson 2024-10-07 21:22 ` Oliver Upton 2024-10-07 22:18 ` Sean Christopherson 2024-10-07 16:42 ` [PATCH v2 2/4] KVM: arm64: nv: Do not block when unmapping stage-2 if disallowed Oliver Upton 2024-10-07 16:42 ` [PATCH v2 3/4] KVM: arm64: nv: Punt stage-2 recycling to a vCPU request Oliver Upton 2024-10-07 19:06 ` Marc Zyngier 2024-10-07 19:16 ` Oliver Upton 2024-10-07 16:42 ` [PATCH v2 4/4] KVM: arm64: nv: Clarify safety of allowing TLBI unmaps to reschedule Oliver Upton 2024-10-07 18:51 ` Sean Christopherson 2024-10-07 19:20 ` Oliver Upton
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.