[PATCH v2 0/4] KVM: arm64: nv: Fixes for stage-2 MMU recycling

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/4] KVM: arm64: nv: Fixes for stage-2 MMU recycling
@ 2024-10-07 16:42 Oliver Upton
  2024-10-07 16:42 ` [PATCH v2 1/4] KVM: arm64: nv: Keep reference on stage-2 MMU when scheduled out Oliver Upton
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Oliver Upton @ 2024-10-07 16:42 UTC (permalink / raw)
  To: kvmarm
  Cc: Marc Zyngier, Joey Gouly, Suzuki K Poulose, Zenghui Yu,
	Sean Christopherson, Oliver Upton

v1 -> v2:
  - Keep a reference on the MMU so long as the vCPU appears runnable
    (i.e. not in WFI) (Sean)
  - Add comment clarifying why TLBI unmaps are allowed to reschedule

[*]: https://lore.kernel.org/kvmarm/20241001001709.1303668-1-oliver.upton@linux.dev/#t

Oliver Upton (4):
  KVM: arm64: nv: Keep reference on stage-2 MMU when scheduled out
  KVM: arm64: nv: Do not block when unmapping stage-2 if disallowed
  KVM: arm64: nv: Punt stage-2 recycling to a vCPU request
  KVM: arm64: nv: Clarify safety of allowing TLBI unmaps to reschedule

 arch/arm64/include/asm/kvm_host.h   |  3 ++
 arch/arm64/include/asm/kvm_mmu.h    |  3 +-
 arch/arm64/include/asm/kvm_nested.h |  4 ++-
 arch/arm64/kvm/arm.c                |  2 ++
 arch/arm64/kvm/mmu.c                | 15 +++++-----
 arch/arm64/kvm/nested.c             | 46 ++++++++++++++++++++++++-----
 arch/arm64/kvm/sys_regs.c           | 33 +++++++++++++++++++--
 7 files changed, 87 insertions(+), 19 deletions(-)


base-commit: 9852d85ec9d492ebef56dc5f229416c925758edc
-- 
2.47.0.rc0.187.ge670bccf7e-goog


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 1/4] KVM: arm64: nv: Keep reference on stage-2 MMU when scheduled out
  2024-10-07 16:42 [PATCH v2 0/4] KVM: arm64: nv: Fixes for stage-2 MMU recycling Oliver Upton
@ 2024-10-07 16:42 ` Oliver Upton
  2024-10-07 18:39   ` Sean Christopherson
  2024-10-07 16:42 ` [PATCH v2 2/4] KVM: arm64: nv: Do not block when unmapping stage-2 if disallowed Oliver Upton
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 14+ messages in thread
From: Oliver Upton @ 2024-10-07 16:42 UTC (permalink / raw)
  To: kvmarm
  Cc: Marc Zyngier, Joey Gouly, Suzuki K Poulose, Zenghui Yu,
	Sean Christopherson, Oliver Upton

If a vCPU is scheduling out and not in WFI emulation, it is highly
likely it will get scheduled again soon and reuse the MMU it had before.
Dropping the MMU at vcpu_put() can have some unfortunate consequences,
as the MMU could get reclaimed and used in a different context, forcing
another 'cold start' on an otherwise active MMU.

Avoid that altogether by keeping a reference on the MMU if the vCPU is
scheduling out, ensuring that another vCPU cannot reclaim it while the
current vCPU is away. Since there are more MMUs than vCPUs, this does
not affect the guarantee that an unused MMU is available at any time.

Furthermore, this makes the vcpu->arch.hw_mmu ~stable in preemptible
code, at least for where it matters in the stage-2 abort path. Yes, the
MMU can change across WFI emulation, but there isn't even a use case
where this would matter.

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
---
 arch/arm64/kvm/nested.c | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
index f9e30dd34c7a..df670c14e1c6 100644
--- a/arch/arm64/kvm/nested.c
+++ b/arch/arm64/kvm/nested.c
@@ -663,6 +663,13 @@ void kvm_init_nested_s2_mmu(struct kvm_s2_mmu *mmu)
 
 void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu)
 {
+	/*
+	 * The vCPU kept its reference on the MMU after the last put, keep
+	 * rolling with it.
+	 */
+	if (vcpu->arch.hw_mmu)
+		return;
+
 	if (is_hyp_ctxt(vcpu)) {
 		vcpu->arch.hw_mmu = &vcpu->kvm->arch.mmu;
 	} else {
@@ -674,10 +681,18 @@ void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu)
 
 void kvm_vcpu_put_hw_mmu(struct kvm_vcpu *vcpu)
 {
-	if (kvm_is_nested_s2_mmu(vcpu->kvm, vcpu->arch.hw_mmu)) {
+	/*
+	 * Keep a reference on the associated stage-2 MMU if the vCPU is
+	 * scheduling out and not in WFI emulation, suggesting it is likely to
+	 * reuse the MMU sometime soon.
+	 */
+	if (vcpu->scheduled_out && !vcpu_get_flag(vcpu, IN_WFI))
+		return;
+
+	if (kvm_is_nested_s2_mmu(vcpu->kvm, vcpu->arch.hw_mmu))
 		atomic_dec(&vcpu->arch.hw_mmu->refcnt);
-		vcpu->arch.hw_mmu = NULL;
-	}
+
+	vcpu->arch.hw_mmu = NULL;
 }
 
 /*
-- 
2.47.0.rc0.187.ge670bccf7e-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/4] KVM: arm64: nv: Keep reference on stage-2 MMU when scheduled out
  2024-10-07 16:42 ` [PATCH v2 1/4] KVM: arm64: nv: Keep reference on stage-2 MMU when scheduled out Oliver Upton
@ 2024-10-07 18:39   ` Sean Christopherson
  2024-10-07 19:01     ` Oliver Upton
  0 siblings, 1 reply; 14+ messages in thread
From: Sean Christopherson @ 2024-10-07 18:39 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvmarm, Marc Zyngier, Joey Gouly, Suzuki K Poulose, Zenghui Yu

On Mon, Oct 07, 2024, Oliver Upton wrote:
> If a vCPU is scheduling out and not in WFI emulation, it is highly
> likely it will get scheduled again soon and reuse the MMU it had before.
> Dropping the MMU at vcpu_put() can have some unfortunate consequences,
> as the MMU could get reclaimed and used in a different context, forcing
> another 'cold start' on an otherwise active MMU.
> 
> Avoid that altogether by keeping a reference on the MMU if the vCPU is
> scheduling out, ensuring that another vCPU cannot reclaim it while the
> current vCPU is away. Since there are more MMUs than vCPUs, this does
> not affect the guarantee that an unused MMU is available at any time.
> 
> Furthermore, this makes the vcpu->arch.hw_mmu ~stable in preemptible
> code, at least for where it matters in the stage-2 abort path. Yes, the
> MMU can change across WFI emulation, but there isn't even a use case
> where this would matter.
> 
> Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
> ---
>  arch/arm64/kvm/nested.c | 21 ++++++++++++++++++---
>  1 file changed, 18 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
> index f9e30dd34c7a..df670c14e1c6 100644
> --- a/arch/arm64/kvm/nested.c
> +++ b/arch/arm64/kvm/nested.c
> @@ -663,6 +663,13 @@ void kvm_init_nested_s2_mmu(struct kvm_s2_mmu *mmu)
>  
>  void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu)
>  {
> +	/*
> +	 * The vCPU kept its reference on the MMU after the last put, keep
> +	 * rolling with it.
> +	 */
> +	if (vcpu->arch.hw_mmu)
> +		return;
> +
>  	if (is_hyp_ctxt(vcpu)) {
>  		vcpu->arch.hw_mmu = &vcpu->kvm->arch.mmu;
>  	} else {
> @@ -674,10 +681,18 @@ void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu)
>  
>  void kvm_vcpu_put_hw_mmu(struct kvm_vcpu *vcpu)
>  {
> -	if (kvm_is_nested_s2_mmu(vcpu->kvm, vcpu->arch.hw_mmu)) {
> +	/*
> +	 * Keep a reference on the associated stage-2 MMU if the vCPU is
> +	 * scheduling out and not in WFI emulation, suggesting it is likely to
> +	 * reuse the MMU sometime soon.
> +	 */
> +	if (vcpu->scheduled_out && !vcpu_get_flag(vcpu, IN_WFI))
> +		return;

Assuming KVM arm64 supports halt-polling, I think it makes more sense to check
kvm_vcpu_is_blocking() instead of IN_WFI.  That way, KVM will keep the MMU resident
if the vCPU happens to be preempted and halt-polling is successful.

And somewhat of a side topic, calling kvm_vgic_put() from kvm_arch_vcpu_blocking()
instead of kvm_vcpu_wfi() would provide the same optimization for the GIC, as KVM
would only need to put/reload the vGIC if the vCPU is actually scheduled out.

Unfortunately, using kvm_vcpu_is_blocking() in vgic_v4_put() won't do the right
thing, as KVM deliberately calls prepare_to_rcuwait() _after_ kvm_arch_vcpu_blocking(),
because x86's AVIC implementation relies on kvm_vcpu_is_blocking() to be false
when reloading after unblocking.

Aha!  With vcpu->scheduled_out, I think we can swap the order and make it more
obvious what's going on in the AVIC code at the same time.  With that, I think
KVM arm64 could drop IN_WFI and use kvm_vcpu_is_blocking() throughout.  E.g.

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 4b74ea91f4e6..bba93d1d65ae 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -1038,13 +1038,12 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
                return;
 
        /*
-        * No need to update anything if the vCPU is blocking, i.e. if the vCPU
-        * is being scheduled in after being preempted.  The CPU entries in the
-        * Physical APIC table and IRTE are consumed iff IsRun{ning} is '1'.
-        * If the vCPU was migrated, its new CPU value will be stuffed when the
-        * vCPU unblocks.
+        * No need to update anything if the vCPU is preempted while blocking.
+        * The CPU entries in the Physical APIC table and IRTE are consumed iff
+        * IsRun{ning} is '1'.  If the vCPU was migrated, its new CPU value
+        * will be stuffed when the vCPU unblocks.
         */
-       if (kvm_vcpu_is_blocking(vcpu))
+       if (kvm_vcpu_is_blocking(vcpu) && vcpu->scheduled_out)
                return;
 
        /*

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/4] KVM: arm64: nv: Keep reference on stage-2 MMU when scheduled out
  2024-10-07 18:39   ` Sean Christopherson
@ 2024-10-07 19:01     ` Oliver Upton
  2024-10-07 19:52       ` Sean Christopherson
  0 siblings, 1 reply; 14+ messages in thread
From: Oliver Upton @ 2024-10-07 19:01 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvmarm, Marc Zyngier, Joey Gouly, Suzuki K Poulose, Zenghui Yu

On Mon, Oct 07, 2024 at 11:39:37AM -0700, Sean Christopherson wrote:
> > @@ -674,10 +681,18 @@ void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu)
> >  
> >  void kvm_vcpu_put_hw_mmu(struct kvm_vcpu *vcpu)
> >  {
> > -	if (kvm_is_nested_s2_mmu(vcpu->kvm, vcpu->arch.hw_mmu)) {
> > +	/*
> > +	 * Keep a reference on the associated stage-2 MMU if the vCPU is
> > +	 * scheduling out and not in WFI emulation, suggesting it is likely to
> > +	 * reuse the MMU sometime soon.
> > +	 */
> > +	if (vcpu->scheduled_out && !vcpu_get_flag(vcpu, IN_WFI))
> > +		return;
> 
> Assuming KVM arm64 supports halt-polling, I think it makes more sense to check
> kvm_vcpu_is_blocking() instead of IN_WFI.  That way, KVM will keep the MMU resident
> if the vCPU happens to be preempted and halt-polling is successful.
> 
> And somewhat of a side topic, calling kvm_vgic_put() from kvm_arch_vcpu_blocking()
> instead of kvm_vcpu_wfi() would provide the same optimization for the GIC, as KVM
> would only need to put/reload the vGIC if the vCPU is actually scheduled out.

That wouldn't be an optimization, it'd render the halt polling loop
useless.

The most recent view of the guest's CPU interface is sitting in
hardware at this point, so we need to synchronize KVM's view of
the CPUIF with it to determine if a pending interrupt exceeds the
priority mask.

On top of that, we need to activate the doorbell IRQ for GICv4 to give
KVM a kick when something arrives for the vPE.

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/4] KVM: arm64: nv: Keep reference on stage-2 MMU when scheduled out
  2024-10-07 19:01     ` Oliver Upton
@ 2024-10-07 19:52       ` Sean Christopherson
  2024-10-07 21:22         ` Oliver Upton
  0 siblings, 1 reply; 14+ messages in thread
From: Sean Christopherson @ 2024-10-07 19:52 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvmarm, Marc Zyngier, Joey Gouly, Suzuki K Poulose, Zenghui Yu

On Mon, Oct 07, 2024, Oliver Upton wrote:
> On Mon, Oct 07, 2024 at 11:39:37AM -0700, Sean Christopherson wrote:
> > > @@ -674,10 +681,18 @@ void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu)
> > >  
> > >  void kvm_vcpu_put_hw_mmu(struct kvm_vcpu *vcpu)
> > >  {
> > > -	if (kvm_is_nested_s2_mmu(vcpu->kvm, vcpu->arch.hw_mmu)) {
> > > +	/*
> > > +	 * Keep a reference on the associated stage-2 MMU if the vCPU is
> > > +	 * scheduling out and not in WFI emulation, suggesting it is likely to
> > > +	 * reuse the MMU sometime soon.
> > > +	 */
> > > +	if (vcpu->scheduled_out && !vcpu_get_flag(vcpu, IN_WFI))
> > > +		return;
> > 
> > Assuming KVM arm64 supports halt-polling, I think it makes more sense to check
> > kvm_vcpu_is_blocking() instead of IN_WFI.  That way, KVM will keep the MMU resident
> > if the vCPU happens to be preempted and halt-polling is successful.
> > 
> > And somewhat of a side topic, calling kvm_vgic_put() from kvm_arch_vcpu_blocking()
> > instead of kvm_vcpu_wfi() would provide the same optimization for the GIC, as KVM
> > would only need to put/reload the vGIC if the vCPU is actually scheduled out.
> 
> That wouldn't be an optimization, it'd render the halt polling loop
> useless.
> 
> The most recent view of the guest's CPU interface is sitting in
> hardware at this point, so we need to synchronize KVM's view of
> the CPUIF with it to determine if a pending interrupt exceeds the
> priority mask.

Hmm, but what happens if the wakeup event arrives before IN_WFI is set?  Ah, IIUC,
GICR_VPENDBASER_PendingLast tracks if there's a pending event, KVM makes sure to
read PendingLast after making vPE non-resident, and hardware is required to ensure
either PendingLast=1 or a doorbell is signaled.

Niave question time: why not query GICR_VPENDBASER_PendingLast directly in
kvm_vgic_vcpu_pending_irq() instead of putting the vGIC?  E.g. if safely querying
PendingLast is "heavy", wouldn't it still make sense to do a slow check of
PendingLast in kvm_vgic_vcpu_pending_irq() if IN_WFI=1?  Or is putting the vGIC
as lightweight as things get when it comes to checking PendingLast?

> On top of that, we need to activate the doorbell IRQ for GICv4 to give
> KVM a kick when something arrives for the vPE.

Right, but KVM needs to "manually" detect pending events to account for events
that arrive before the doorbell is activated, so from a functional perspective,
it shouldn't matter when KVM activates the doorbell (with respect to halt-polling).

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/4] KVM: arm64: nv: Keep reference on stage-2 MMU when scheduled out
  2024-10-07 19:52       ` Sean Christopherson
@ 2024-10-07 21:22         ` Oliver Upton
  2024-10-07 22:18           ` Sean Christopherson
  0 siblings, 1 reply; 14+ messages in thread
From: Oliver Upton @ 2024-10-07 21:22 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvmarm, Marc Zyngier, Joey Gouly, Suzuki K Poulose, Zenghui Yu

On Mon, Oct 07, 2024 at 12:52:48PM -0700, Sean Christopherson wrote:
> On Mon, Oct 07, 2024, Oliver Upton wrote:
> > On Mon, Oct 07, 2024 at 11:39:37AM -0700, Sean Christopherson wrote:
> > > > @@ -674,10 +681,18 @@ void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu)
> > > >  
> > > >  void kvm_vcpu_put_hw_mmu(struct kvm_vcpu *vcpu)
> > > >  {
> > > > -	if (kvm_is_nested_s2_mmu(vcpu->kvm, vcpu->arch.hw_mmu)) {
> > > > +	/*
> > > > +	 * Keep a reference on the associated stage-2 MMU if the vCPU is
> > > > +	 * scheduling out and not in WFI emulation, suggesting it is likely to
> > > > +	 * reuse the MMU sometime soon.
> > > > +	 */
> > > > +	if (vcpu->scheduled_out && !vcpu_get_flag(vcpu, IN_WFI))
> > > > +		return;
> > > 
> > > Assuming KVM arm64 supports halt-polling, I think it makes more sense to check
> > > kvm_vcpu_is_blocking() instead of IN_WFI.  That way, KVM will keep the MMU resident
> > > if the vCPU happens to be preempted and halt-polling is successful.
> > > 
> > > And somewhat of a side topic, calling kvm_vgic_put() from kvm_arch_vcpu_blocking()
> > > instead of kvm_vcpu_wfi() would provide the same optimization for the GIC, as KVM
> > > would only need to put/reload the vGIC if the vCPU is actually scheduled out.
> > 
> > That wouldn't be an optimization, it'd render the halt polling loop
> > useless.

<snip>

> Hmm, but what happens if the wakeup event arrives before IN_WFI is set?  Ah, IIUC,
> GICR_VPENDBASER_PendingLast tracks if there's a pending event, KVM makes sure to
> read PendingLast after making vPE non-resident, and hardware is required to ensure
> either PendingLast=1 or a doorbell is signaled.
> 
> Niave question time: why not query GICR_VPENDBASER_PendingLast directly in
> kvm_vgic_vcpu_pending_irq() instead of putting the vGIC?

The GIC is only required to update GICR_VPENDBASER.PendingLast when
software deschedules the loaded vPE, meaning GICR_VPENDBASER.Valid goes
from 1 => 0. It is not updated for virtual interrupts that arrive after
the vPE has been descheduled.

So a polling loop that tests PendingLast would need to repeatedly mark
the vPE as scheduled/descheduled to get the GIC to process the LPI
pending table. You'd at minimum need a dsb(sy) to ensure the
redistributor has seen the new configuration *and* poll it to make sure
the LPI pending state has been completely serialized/deserialized into
memory:

	while !GICR_VPENDBASER.PendingLast:
		GICR_VPENDBASER.Valid = 1
		dsb(sy)
		while GICR_VPENDBASER.Dirty:
			cpu_relax()
		GICR_VPENDBASER.Valid = 0
		while GICR_VPENDBASER.Dirty:
			cpu_relax()

Conceptually, the redistributor is a 'far' piece of hardware, likely
running at a different clock from your CPU. So it is safe to assume that
interactions with it are pretty slow, and this overall scheme would be
quite expensive.

> E.g. if safely querying
> PendingLast is "heavy", wouldn't it still make sense to do a slow check of
> PendingLast in kvm_vgic_vcpu_pending_irq() if IN_WFI=1?  Or is putting the vGIC
> as lightweight as things get when it comes to checking PendingLast?

I believe I've answered this above, since the status bit only makes
sense when hardware has done a state transition.

> > On top of that, we need to activate the doorbell IRQ for GICv4 to give
> > KVM a kick when something arrives for the vPE.
> 
> Right, but KVM needs to "manually" detect pending events to account for events
> that arrive before the doorbell is activated, so from a functional perspective,
> it shouldn't matter when KVM activates the doorbell (with respect to halt-polling).

Like I said earlier:

<paste>

> > The most recent view of the guest's CPU interface is sitting in
> > hardware at this point, so we need to synchronize KVM's view of
> > the CPUIF with it to determine if a pending interrupt exceeds the
> > priority mask.

</paste>

The guest's enable bits and priority mask are in ICH_VMCR_EL2, and we
need to read that out of the CPU in order for this loop to work:

int kvm_vgic_vcpu_pending_irq(struct kvm_vcpu *vcpu)
{
	struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic_cpu;
	struct vgic_irq *irq;
	bool pending = false;
	unsigned long flags;
	struct vgic_vmcr vmcr;

	if (!vcpu->kvm->arch.vgic.enabled)
		return false;

	if (vcpu->arch.vgic_cpu.vgic_v3.its_vpe.pending_last)
		return true;

	vgic_get_vmcr(vcpu, &vmcr);

	raw_spin_lock_irqsave(&vgic_cpu->ap_list_lock, flags);

	list_for_each_entry(irq, &vgic_cpu->ap_list_head, ap_list) {
		raw_spin_lock(&irq->irq_lock);
		pending = irq_is_pending(irq) && irq->enabled &&
			  !irq->active &&
			  irq->priority < vmcr.pmr;		<====== Need latest PMR
		raw_spin_unlock(&irq->irq_lock);

		if (pending)
			break;
	}

	raw_spin_unlock_irqrestore(&vgic_cpu->ap_list_lock, flags);

	return pending;
}

KVM's in-memory view of VMCR is updated in vgic_put(). So even w/o GICv4
direct injection, the seemingly premature GIC save/restore is essential
for halt polling to actually do something useful.

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/4] KVM: arm64: nv: Keep reference on stage-2 MMU when scheduled out
  2024-10-07 21:22         ` Oliver Upton
@ 2024-10-07 22:18           ` Sean Christopherson
  0 siblings, 0 replies; 14+ messages in thread
From: Sean Christopherson @ 2024-10-07 22:18 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvmarm, Marc Zyngier, Joey Gouly, Suzuki K Poulose, Zenghui Yu

On Mon, Oct 07, 2024, Oliver Upton wrote:
> On Mon, Oct 07, 2024 at 12:52:48PM -0700, Sean Christopherson wrote:
> > > That wouldn't be an optimization, it'd render the halt polling loop
> > > useless.
> 
> <snip>
> 
> > Hmm, but what happens if the wakeup event arrives before IN_WFI is set?  Ah, IIUC,
> > GICR_VPENDBASER_PendingLast tracks if there's a pending event, KVM makes sure to
> > read PendingLast after making vPE non-resident, and hardware is required to ensure
> > either PendingLast=1 or a doorbell is signaled.
> > 
> > Niave question time: why not query GICR_VPENDBASER_PendingLast directly in
> > kvm_vgic_vcpu_pending_irq() instead of putting the vGIC?
> 
> The GIC is only required to update GICR_VPENDBASER.PendingLast when
> software deschedules the loaded vPE, meaning GICR_VPENDBASER.Valid goes
> from 1 => 0.

Ugh, that sucks.  I wish we could magically combine Intel+AMD+ARM implementations
for direct injection.  They all do things just slightly differently, and each one
has some behaviors that I really like, and each one has at least one behavior that
makes me sad.

> > > The most recent view of the guest's CPU interface is sitting in
> > > hardware at this point, so we need to synchronize KVM's view of
> > > the CPUIF with it to determine if a pending interrupt exceeds the
> > > priority mask.
> 
> </paste>
> 
> The guest's enable bits and priority mask are in ICH_VMCR_EL2, and we
> need to read that out of the CPU in order for this loop to work:
> 
> int kvm_vgic_vcpu_pending_irq(struct kvm_vcpu *vcpu)
> {
> 	struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic_cpu;
> 	struct vgic_irq *irq;
> 	bool pending = false;
> 	unsigned long flags;
> 	struct vgic_vmcr vmcr;
> 
> 	if (!vcpu->kvm->arch.vgic.enabled)
> 		return false;
> 
> 	if (vcpu->arch.vgic_cpu.vgic_v3.its_vpe.pending_last)
> 		return true;
> 
> 	vgic_get_vmcr(vcpu, &vmcr);
> 
> 	raw_spin_lock_irqsave(&vgic_cpu->ap_list_lock, flags);
> 
> 	list_for_each_entry(irq, &vgic_cpu->ap_list_head, ap_list) {
> 		raw_spin_lock(&irq->irq_lock);
> 		pending = irq_is_pending(irq) && irq->enabled &&
> 			  !irq->active &&
> 			  irq->priority < vmcr.pmr;		<====== Need latest PMR
> 		raw_spin_unlock(&irq->irq_lock);
> 
> 		if (pending)
> 			break;
> 	}
> 
> 	raw_spin_unlock_irqrestore(&vgic_cpu->ap_list_lock, flags);
> 
> 	return pending;
> }
> 
> KVM's in-memory view of VMCR is updated in vgic_put().

Ah, vgic_get_vmcr() is just unpacking the software-cached VMCR.  I missed that
on the first few reads.

Thanks for being patient :-)

> So even w/o GICv4 direct injection, the seemingly premature GIC save/restore
> is essential for halt polling to actually do something useful.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 2/4] KVM: arm64: nv: Do not block when unmapping stage-2 if disallowed
  2024-10-07 16:42 [PATCH v2 0/4] KVM: arm64: nv: Fixes for stage-2 MMU recycling Oliver Upton
  2024-10-07 16:42 ` [PATCH v2 1/4] KVM: arm64: nv: Keep reference on stage-2 MMU when scheduled out Oliver Upton
@ 2024-10-07 16:42 ` Oliver Upton
  2024-10-07 16:42 ` [PATCH v2 3/4] KVM: arm64: nv: Punt stage-2 recycling to a vCPU request Oliver Upton
  2024-10-07 16:42 ` [PATCH v2 4/4] KVM: arm64: nv: Clarify safety of allowing TLBI unmaps to reschedule Oliver Upton
  3 siblings, 0 replies; 14+ messages in thread
From: Oliver Upton @ 2024-10-07 16:42 UTC (permalink / raw)
  To: kvmarm
  Cc: Marc Zyngier, Joey Gouly, Suzuki K Poulose, Zenghui Yu,
	Sean Christopherson, Oliver Upton

Right now the nested code allows unmap operations on a shadow stage-2 to
block unconditionally. This is wrong in a couple places, such as a
non-blocking MMU notifier or on the back of a sched_in() notifier as
part of shadow MMU recycling.

Carry through whether or not blocking is allowed to
kvm_pgtable_stage2_unmap(). This 'fixes' an issue where stage-2 MMU
reclaim would precipitate a stack overflow from a pile of kvm_sched_in()
callbacks, all trying to recycle a stage-2 MMU.

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
---
 arch/arm64/include/asm/kvm_mmu.h    |  3 ++-
 arch/arm64/include/asm/kvm_nested.h |  2 +-
 arch/arm64/kvm/mmu.c                | 15 ++++++++-------
 arch/arm64/kvm/nested.c             |  6 +++---
 arch/arm64/kvm/sys_regs.c           |  6 +++---
 5 files changed, 17 insertions(+), 15 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index cd4087fbda9a..66d93e320ec8 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -166,7 +166,8 @@ int create_hyp_exec_mappings(phys_addr_t phys_addr, size_t size,
 int create_hyp_stack(phys_addr_t phys_addr, unsigned long *haddr);
 void __init free_hyp_pgds(void);
 
-void kvm_stage2_unmap_range(struct kvm_s2_mmu *mmu, phys_addr_t start, u64 size);
+void kvm_stage2_unmap_range(struct kvm_s2_mmu *mmu, phys_addr_t start,
+			    u64 size, bool may_block);
 void kvm_stage2_flush_range(struct kvm_s2_mmu *mmu, phys_addr_t addr, phys_addr_t end);
 void kvm_stage2_wp_range(struct kvm_s2_mmu *mmu, phys_addr_t addr, phys_addr_t end);
 
diff --git a/arch/arm64/include/asm/kvm_nested.h b/arch/arm64/include/asm/kvm_nested.h
index e8bc6d67aba2..e74b90dcfac4 100644
--- a/arch/arm64/include/asm/kvm_nested.h
+++ b/arch/arm64/include/asm/kvm_nested.h
@@ -124,7 +124,7 @@ extern int kvm_s2_handle_perm_fault(struct kvm_vcpu *vcpu,
 				    struct kvm_s2_trans *trans);
 extern int kvm_inject_s2_fault(struct kvm_vcpu *vcpu, u64 esr_el2);
 extern void kvm_nested_s2_wp(struct kvm *kvm);
-extern void kvm_nested_s2_unmap(struct kvm *kvm);
+extern void kvm_nested_s2_unmap(struct kvm *kvm, bool may_block);
 extern void kvm_nested_s2_flush(struct kvm *kvm);
 
 unsigned long compute_tlb_inval_range(struct kvm_s2_mmu *mmu, u64 val);
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index a509b63bd4dd..0f7658aefa1a 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -328,9 +328,10 @@ static void __unmap_stage2_range(struct kvm_s2_mmu *mmu, phys_addr_t start, u64
 				   may_block));
 }
 
-void kvm_stage2_unmap_range(struct kvm_s2_mmu *mmu, phys_addr_t start, u64 size)
+void kvm_stage2_unmap_range(struct kvm_s2_mmu *mmu, phys_addr_t start,
+			    u64 size, bool may_block)
 {
-	__unmap_stage2_range(mmu, start, size, true);
+	__unmap_stage2_range(mmu, start, size, may_block);
 }
 
 void kvm_stage2_flush_range(struct kvm_s2_mmu *mmu, phys_addr_t addr, phys_addr_t end)
@@ -1015,7 +1016,7 @@ static void stage2_unmap_memslot(struct kvm *kvm,
 
 		if (!(vma->vm_flags & VM_PFNMAP)) {
 			gpa_t gpa = addr + (vm_start - memslot->userspace_addr);
-			kvm_stage2_unmap_range(&kvm->arch.mmu, gpa, vm_end - vm_start);
+			kvm_stage2_unmap_range(&kvm->arch.mmu, gpa, vm_end - vm_start, true);
 		}
 		hva = vm_end;
 	} while (hva < reg_end);
@@ -1042,7 +1043,7 @@ void stage2_unmap_vm(struct kvm *kvm)
 	kvm_for_each_memslot(memslot, bkt, slots)
 		stage2_unmap_memslot(kvm, memslot);
 
-	kvm_nested_s2_unmap(kvm);
+	kvm_nested_s2_unmap(kvm, true);
 
 	write_unlock(&kvm->mmu_lock);
 	mmap_read_unlock(current->mm);
@@ -1912,7 +1913,7 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 			     (range->end - range->start) << PAGE_SHIFT,
 			     range->may_block);
 
-	kvm_nested_s2_unmap(kvm);
+	kvm_nested_s2_unmap(kvm, range->may_block);
 	return false;
 }
 
@@ -2179,8 +2180,8 @@ void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
 	phys_addr_t size = slot->npages << PAGE_SHIFT;
 
 	write_lock(&kvm->mmu_lock);
-	kvm_stage2_unmap_range(&kvm->arch.mmu, gpa, size);
-	kvm_nested_s2_unmap(kvm);
+	kvm_stage2_unmap_range(&kvm->arch.mmu, gpa, size, true);
+	kvm_nested_s2_unmap(kvm, true);
 	write_unlock(&kvm->mmu_lock);
 }
 
diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
index df670c14e1c6..58d3b998793a 100644
--- a/arch/arm64/kvm/nested.c
+++ b/arch/arm64/kvm/nested.c
@@ -634,7 +634,7 @@ static struct kvm_s2_mmu *get_s2_mmu_nested(struct kvm_vcpu *vcpu)
 
 	/* Clear the old state */
 	if (kvm_s2_mmu_valid(s2_mmu))
-		kvm_stage2_unmap_range(s2_mmu, 0, kvm_phys_size(s2_mmu));
+		kvm_stage2_unmap_range(s2_mmu, 0, kvm_phys_size(s2_mmu), false);
 
 	/*
 	 * The virtual VMID (modulo CnP) will be used as a key when matching
@@ -745,7 +745,7 @@ void kvm_nested_s2_wp(struct kvm *kvm)
 	}
 }
 
-void kvm_nested_s2_unmap(struct kvm *kvm)
+void kvm_nested_s2_unmap(struct kvm *kvm, bool may_block)
 {
 	int i;
 
@@ -755,7 +755,7 @@ void kvm_nested_s2_unmap(struct kvm *kvm)
 		struct kvm_s2_mmu *mmu = &kvm->arch.nested_mmus[i];
 
 		if (kvm_s2_mmu_valid(mmu))
-			kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu));
+			kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu), may_block);
 	}
 }
 
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index dad88e31f953..e0c7cc16466e 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -2903,7 +2903,7 @@ static bool handle_alle1is(struct kvm_vcpu *vcpu, struct sys_reg_params *p,
 	 * Drop all shadow S2s, resulting in S1/S2 TLBIs for each of the
 	 * corresponding VMIDs.
 	 */
-	kvm_nested_s2_unmap(vcpu->kvm);
+	kvm_nested_s2_unmap(vcpu->kvm, true);
 
 	write_unlock(&vcpu->kvm->mmu_lock);
 
@@ -2955,7 +2955,7 @@ union tlbi_info {
 static void s2_mmu_unmap_range(struct kvm_s2_mmu *mmu,
 			       const union tlbi_info *info)
 {
-	kvm_stage2_unmap_range(mmu, info->range.start, info->range.size);
+	kvm_stage2_unmap_range(mmu, info->range.start, info->range.size, true);
 }
 
 static bool handle_vmalls12e1is(struct kvm_vcpu *vcpu, struct sys_reg_params *p,
@@ -3050,7 +3050,7 @@ static void s2_mmu_unmap_ipa(struct kvm_s2_mmu *mmu,
 	max_size = compute_tlb_inval_range(mmu, info->ipa.addr);
 	base_addr &= ~(max_size - 1);
 
-	kvm_stage2_unmap_range(mmu, base_addr, max_size);
+	kvm_stage2_unmap_range(mmu, base_addr, max_size, true);
 }
 
 static bool handle_ipas2e1is(struct kvm_vcpu *vcpu, struct sys_reg_params *p,
-- 
2.47.0.rc0.187.ge670bccf7e-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 3/4] KVM: arm64: nv: Punt stage-2 recycling to a vCPU request
  2024-10-07 16:42 [PATCH v2 0/4] KVM: arm64: nv: Fixes for stage-2 MMU recycling Oliver Upton
  2024-10-07 16:42 ` [PATCH v2 1/4] KVM: arm64: nv: Keep reference on stage-2 MMU when scheduled out Oliver Upton
  2024-10-07 16:42 ` [PATCH v2 2/4] KVM: arm64: nv: Do not block when unmapping stage-2 if disallowed Oliver Upton
@ 2024-10-07 16:42 ` Oliver Upton
  2024-10-07 19:06   ` Marc Zyngier
  2024-10-07 16:42 ` [PATCH v2 4/4] KVM: arm64: nv: Clarify safety of allowing TLBI unmaps to reschedule Oliver Upton
  3 siblings, 1 reply; 14+ messages in thread
From: Oliver Upton @ 2024-10-07 16:42 UTC (permalink / raw)
  To: kvmarm
  Cc: Marc Zyngier, Joey Gouly, Suzuki K Poulose, Zenghui Yu,
	Sean Christopherson, Oliver Upton

Currently, when a nested MMU is repurposed for some other MMU context,
KVM unmaps everything during vcpu_load() while holding the MMU lock for
write. This is quite a performance bottleneck for large nested VMs, as
all vCPU scheduling will spin until the unmap completes.

Start punting the MMU cleanup to a vCPU request, where it is then
possible to periodically release the MMU lock and CPU in the presence of
contention.

Ensure that no vCPU winds up using a stale MMU by tracking the pending
unmap on the S2 MMU itself and requesting an unmap on every vCPU that
finds it.

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
---
 arch/arm64/include/asm/kvm_host.h   |  3 +++
 arch/arm64/include/asm/kvm_nested.h |  2 ++
 arch/arm64/kvm/arm.c                |  2 ++
 arch/arm64/kvm/nested.c             | 21 +++++++++++++++++++--
 4 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 329619c6fa96..16b4b0003e7d 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -51,6 +51,7 @@
 #define KVM_REQ_RELOAD_PMU	KVM_ARCH_REQ(5)
 #define KVM_REQ_SUSPEND		KVM_ARCH_REQ(6)
 #define KVM_REQ_RESYNC_PMU_EL0	KVM_ARCH_REQ(7)
+#define KVM_REQ_NESTED_S2_UNMAP	KVM_ARCH_REQ(8)
 
 #define KVM_DIRTY_LOG_MANUAL_CAPS   (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \
 				     KVM_DIRTY_LOG_INITIALLY_SET)
@@ -216,6 +217,8 @@ struct kvm_s2_mmu {
 	 * >0: Somebody is actively using this.
 	 */
 	atomic_t refcnt;
+
+	bool	pending_unmap;
 };
 
 struct kvm_arch_memory_slot {
diff --git a/arch/arm64/include/asm/kvm_nested.h b/arch/arm64/include/asm/kvm_nested.h
index e74b90dcfac4..233e65522716 100644
--- a/arch/arm64/include/asm/kvm_nested.h
+++ b/arch/arm64/include/asm/kvm_nested.h
@@ -78,6 +78,8 @@ extern void kvm_s2_mmu_iterate_by_vmid(struct kvm *kvm, u16 vmid,
 extern void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu);
 extern void kvm_vcpu_put_hw_mmu(struct kvm_vcpu *vcpu);
 
+extern void check_nested_vcpu_requests(struct kvm_vcpu *vcpu);
+
 struct kvm_s2_trans {
 	phys_addr_t output;
 	unsigned long block_size;
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index a0d01c46e408..c7b659604bae 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1031,6 +1031,8 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
 
 		if (kvm_dirty_ring_check_request(vcpu))
 			return 0;
+
+		check_nested_vcpu_requests(vcpu);
 	}
 
 	return 1;
diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
index 58d3b998793a..71b96eaf6632 100644
--- a/arch/arm64/kvm/nested.c
+++ b/arch/arm64/kvm/nested.c
@@ -632,9 +632,9 @@ static struct kvm_s2_mmu *get_s2_mmu_nested(struct kvm_vcpu *vcpu)
 	/* Set the scene for the next search */
 	kvm->arch.nested_mmus_next = (i + 1) % kvm->arch.nested_mmus_size;
 
-	/* Clear the old state */
+	/* Make sure we don't forget to do the laundry */
 	if (kvm_s2_mmu_valid(s2_mmu))
-		kvm_stage2_unmap_range(s2_mmu, 0, kvm_phys_size(s2_mmu), false);
+		s2_mmu->pending_unmap = true;
 
 	/*
 	 * The virtual VMID (modulo CnP) will be used as a key when matching
@@ -650,6 +650,9 @@ static struct kvm_s2_mmu *get_s2_mmu_nested(struct kvm_vcpu *vcpu)
 
 out:
 	atomic_inc(&s2_mmu->refcnt);
+	if (s2_mmu->pending_unmap)
+		kvm_make_request(KVM_REQ_NESTED_S2_UNMAP, vcpu);
+
 	return s2_mmu;
 }
 
@@ -1199,3 +1202,17 @@ int kvm_init_nv_sysregs(struct kvm *kvm)
 
 	return 0;
 }
+
+void check_nested_vcpu_requests(struct kvm_vcpu *vcpu)
+{
+	if (kvm_check_request(KVM_REQ_NESTED_S2_UNMAP, vcpu)) {
+		struct kvm_s2_mmu *mmu = vcpu->arch.hw_mmu;
+
+		write_lock(&vcpu->kvm->mmu_lock);
+		if (mmu->pending_unmap) {
+			kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu), true);
+			mmu->pending_unmap = false;
+		}
+		write_unlock(&vcpu->kvm->mmu_lock);
+	}
+}
-- 
2.47.0.rc0.187.ge670bccf7e-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 3/4] KVM: arm64: nv: Punt stage-2 recycling to a vCPU request
  2024-10-07 16:42 ` [PATCH v2 3/4] KVM: arm64: nv: Punt stage-2 recycling to a vCPU request Oliver Upton
@ 2024-10-07 19:06   ` Marc Zyngier
  2024-10-07 19:16     ` Oliver Upton
  0 siblings, 1 reply; 14+ messages in thread
From: Marc Zyngier @ 2024-10-07 19:06 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvmarm, Joey Gouly, Suzuki K Poulose, Zenghui Yu,
	Sean Christopherson

On Mon, 07 Oct 2024 17:42:54 +0100,
Oliver Upton <oliver.upton@linux.dev> wrote:
> 
> Currently, when a nested MMU is repurposed for some other MMU context,
> KVM unmaps everything during vcpu_load() while holding the MMU lock for
> write. This is quite a performance bottleneck for large nested VMs, as
> all vCPU scheduling will spin until the unmap completes.
> 
> Start punting the MMU cleanup to a vCPU request, where it is then
> possible to periodically release the MMU lock and CPU in the presence of
> contention.
> 
> Ensure that no vCPU winds up using a stale MMU by tracking the pending
> unmap on the S2 MMU itself and requesting an unmap on every vCPU that
> finds it.
> 
> Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
> ---
>  arch/arm64/include/asm/kvm_host.h   |  3 +++
>  arch/arm64/include/asm/kvm_nested.h |  2 ++
>  arch/arm64/kvm/arm.c                |  2 ++
>  arch/arm64/kvm/nested.c             | 21 +++++++++++++++++++--
>  4 files changed, 26 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 329619c6fa96..16b4b0003e7d 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -51,6 +51,7 @@
>  #define KVM_REQ_RELOAD_PMU	KVM_ARCH_REQ(5)
>  #define KVM_REQ_SUSPEND		KVM_ARCH_REQ(6)
>  #define KVM_REQ_RESYNC_PMU_EL0	KVM_ARCH_REQ(7)
> +#define KVM_REQ_NESTED_S2_UNMAP	KVM_ARCH_REQ(8)
>  
>  #define KVM_DIRTY_LOG_MANUAL_CAPS   (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \
>  				     KVM_DIRTY_LOG_INITIALLY_SET)
> @@ -216,6 +217,8 @@ struct kvm_s2_mmu {
>  	 * >0: Somebody is actively using this.
>  	 */
>  	atomic_t refcnt;
> +
> +	bool	pending_unmap;

Nit: move this up next to nested_stage2_enabled, saving a grand total
of 4 bytes...

>  };
>  
>  struct kvm_arch_memory_slot {
> diff --git a/arch/arm64/include/asm/kvm_nested.h b/arch/arm64/include/asm/kvm_nested.h
> index e74b90dcfac4..233e65522716 100644
> --- a/arch/arm64/include/asm/kvm_nested.h
> +++ b/arch/arm64/include/asm/kvm_nested.h
> @@ -78,6 +78,8 @@ extern void kvm_s2_mmu_iterate_by_vmid(struct kvm *kvm, u16 vmid,
>  extern void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu);
>  extern void kvm_vcpu_put_hw_mmu(struct kvm_vcpu *vcpu);
>  
> +extern void check_nested_vcpu_requests(struct kvm_vcpu *vcpu);
> +
>  struct kvm_s2_trans {
>  	phys_addr_t output;
>  	unsigned long block_size;
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index a0d01c46e408..c7b659604bae 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -1031,6 +1031,8 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
>  
>  		if (kvm_dirty_ring_check_request(vcpu))
>  			return 0;
> +
> +		check_nested_vcpu_requests(vcpu);
>  	}
>  
>  	return 1;
> diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
> index 58d3b998793a..71b96eaf6632 100644
> --- a/arch/arm64/kvm/nested.c
> +++ b/arch/arm64/kvm/nested.c
> @@ -632,9 +632,9 @@ static struct kvm_s2_mmu *get_s2_mmu_nested(struct kvm_vcpu *vcpu)
>  	/* Set the scene for the next search */
>  	kvm->arch.nested_mmus_next = (i + 1) % kvm->arch.nested_mmus_size;
>  
> -	/* Clear the old state */
> +	/* Make sure we don't forget to do the laundry */
>  	if (kvm_s2_mmu_valid(s2_mmu))
> -		kvm_stage2_unmap_range(s2_mmu, 0, kvm_phys_size(s2_mmu), false);
> +		s2_mmu->pending_unmap = true;
>  
>  	/*
>  	 * The virtual VMID (modulo CnP) will be used as a key when matching
> @@ -650,6 +650,9 @@ static struct kvm_s2_mmu *get_s2_mmu_nested(struct kvm_vcpu *vcpu)
>  
>  out:
>  	atomic_inc(&s2_mmu->refcnt);
> +	if (s2_mmu->pending_unmap)
> +		kvm_make_request(KVM_REQ_NESTED_S2_UNMAP, vcpu);
> +

This bit is awesomely subtle. Any vcpu incrementing the refcount on an
MMU being laundered gets to partake in the cleaning process. Without
this, the second guy would get to use the MMU before it has been
cleaned-up.

If you don't mind, I'd like to add a comment to that effect, so that
our future selves remember what this is about.

And move the pending_unmap field in the process.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 3/4] KVM: arm64: nv: Punt stage-2 recycling to a vCPU request
  2024-10-07 19:06   ` Marc Zyngier
@ 2024-10-07 19:16     ` Oliver Upton
  0 siblings, 0 replies; 14+ messages in thread
From: Oliver Upton @ 2024-10-07 19:16 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvmarm, Joey Gouly, Suzuki K Poulose, Zenghui Yu,
	Sean Christopherson

On Mon, Oct 07, 2024 at 08:06:30PM +0100, Marc Zyngier wrote:
> On Mon, 07 Oct 2024 17:42:54 +0100,
> Oliver Upton <oliver.upton@linux.dev> wrote:
> > 
> > Currently, when a nested MMU is repurposed for some other MMU context,
> > KVM unmaps everything during vcpu_load() while holding the MMU lock for
> > write. This is quite a performance bottleneck for large nested VMs, as
> > all vCPU scheduling will spin until the unmap completes.
> > 
> > Start punting the MMU cleanup to a vCPU request, where it is then
> > possible to periodically release the MMU lock and CPU in the presence of
> > contention.
> > 
> > Ensure that no vCPU winds up using a stale MMU by tracking the pending
> > unmap on the S2 MMU itself and requesting an unmap on every vCPU that
> > finds it.
> > 
> > Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
> > ---
> >  arch/arm64/include/asm/kvm_host.h   |  3 +++
> >  arch/arm64/include/asm/kvm_nested.h |  2 ++
> >  arch/arm64/kvm/arm.c                |  2 ++
> >  arch/arm64/kvm/nested.c             | 21 +++++++++++++++++++--
> >  4 files changed, 26 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > index 329619c6fa96..16b4b0003e7d 100644
> > --- a/arch/arm64/include/asm/kvm_host.h
> > +++ b/arch/arm64/include/asm/kvm_host.h
> > @@ -51,6 +51,7 @@
> >  #define KVM_REQ_RELOAD_PMU	KVM_ARCH_REQ(5)
> >  #define KVM_REQ_SUSPEND		KVM_ARCH_REQ(6)
> >  #define KVM_REQ_RESYNC_PMU_EL0	KVM_ARCH_REQ(7)
> > +#define KVM_REQ_NESTED_S2_UNMAP	KVM_ARCH_REQ(8)
> >  
> >  #define KVM_DIRTY_LOG_MANUAL_CAPS   (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \
> >  				     KVM_DIRTY_LOG_INITIALLY_SET)
> > @@ -216,6 +217,8 @@ struct kvm_s2_mmu {
> >  	 * >0: Somebody is actively using this.
> >  	 */
> >  	atomic_t refcnt;
> > +
> > +	bool	pending_unmap;
> 
> Nit: move this up next to nested_stage2_enabled, saving a grand total
> of 4 bytes...

/facepalm I had it that way in some WIP flavor of this series, looks
like I dropped that diff. Will do.

> >  	atomic_inc(&s2_mmu->refcnt);
> > +	if (s2_mmu->pending_unmap)
> > +		kvm_make_request(KVM_REQ_NESTED_S2_UNMAP, vcpu);
> > +
> 
> This bit is awesomely subtle. Any vcpu incrementing the refcount on an
> MMU being laundered gets to partake in the cleaning process. Without
> this, the second guy would get to use the MMU before it has been
> cleaned-up.
> 

How about:

	/*
	 * Set the vCPU request to perform an unmap, even if the pending
	 * unmap originates from another vCPU. This guarantees that the
	 * MMU has been completely unmapped before any vCPU actually
	 * uses it, and allows multiple vCPUs to lend a hand with
	 * completing the unmap.
	 */
	if (s2_mmu->pending_unmap)
		kvm_make_request(KVM_REQ_NESTED_S2_UNMAP, vcpu);

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 4/4] KVM: arm64: nv: Clarify safety of allowing TLBI unmaps to reschedule
  2024-10-07 16:42 [PATCH v2 0/4] KVM: arm64: nv: Fixes for stage-2 MMU recycling Oliver Upton
                   ` (2 preceding siblings ...)
  2024-10-07 16:42 ` [PATCH v2 3/4] KVM: arm64: nv: Punt stage-2 recycling to a vCPU request Oliver Upton
@ 2024-10-07 16:42 ` Oliver Upton
  2024-10-07 18:51   ` Sean Christopherson
  3 siblings, 1 reply; 14+ messages in thread
From: Oliver Upton @ 2024-10-07 16:42 UTC (permalink / raw)
  To: kvmarm
  Cc: Marc Zyngier, Joey Gouly, Suzuki K Poulose, Zenghui Yu,
	Sean Christopherson, Oliver Upton

There's been a decent amount of attention around unmaps of nested MMUs,
and TLBI handling is no exception to this. Add a comment clarifying why
it is safe to reschedule during a TLBI unmap, even without a reference
on the MMU in progress.

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
---
 arch/arm64/kvm/sys_regs.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index e0c7cc16466e..b773d107cd35 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -2955,6 +2955,29 @@ union tlbi_info {
 static void s2_mmu_unmap_range(struct kvm_s2_mmu *mmu,
 			       const union tlbi_info *info)
 {
+	/*
+	 * The unmap operation is allowed to drop the MMU lock and block, which
+	 * means that @mmu could be used for a different context than the one
+	 * currently being invalidated.
+	 *
+	 * This behavior is still safe, as:
+	 *
+	 *  1) The vCPU that recycled the MMU is responsible for invalidating
+	 *     the entire MMU before reusing it, which still honors the intent
+	 *     of a TLBI.
+	 *
+	 *  2) Until the guest TLBI instruction is 'retired' (i.e. increment PC
+	 *     and ERET to the guest), other vCPUs are allowed to use stale
+	 *     translations.
+	 *
+	 *  3) Accidentally unmapping an unrelated MMU context is nonfatal, and
+	 *     at worst may cause more aborts for shadow stage-2 fills.
+	 *
+	 * Dropping the MMU lock also implies that shadow stage-2 fills could
+	 * happen behind the back of the TLBI. This is still safe, though, as
+	 * the L1 needs to put its stage-2 in a consistent state before doing
+	 * the TLBI.
+	 */
 	kvm_stage2_unmap_range(mmu, info->range.start, info->range.size, true);
 }
 
@@ -3050,6 +3073,10 @@ static void s2_mmu_unmap_ipa(struct kvm_s2_mmu *mmu,
 	max_size = compute_tlb_inval_range(mmu, info->ipa.addr);
 	base_addr &= ~(max_size - 1);
 
+	/*
+	 * See comment in s2_mmu_unmap_range() for why this is allowed to
+	 * reschedule.
+	 */
 	kvm_stage2_unmap_range(mmu, base_addr, max_size, true);
 }
 
-- 
2.47.0.rc0.187.ge670bccf7e-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 4/4] KVM: arm64: nv: Clarify safety of allowing TLBI unmaps to reschedule
  2024-10-07 16:42 ` [PATCH v2 4/4] KVM: arm64: nv: Clarify safety of allowing TLBI unmaps to reschedule Oliver Upton
@ 2024-10-07 18:51   ` Sean Christopherson
  2024-10-07 19:20     ` Oliver Upton
  0 siblings, 1 reply; 14+ messages in thread
From: Sean Christopherson @ 2024-10-07 18:51 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvmarm, Marc Zyngier, Joey Gouly, Suzuki K Poulose, Zenghui Yu

On Mon, Oct 07, 2024, Oliver Upton wrote:
> There's been a decent amount of attention around unmaps of nested MMUs,
> and TLBI handling is no exception to this. Add a comment clarifying why
> it is safe to reschedule during a TLBI unmap, even without a reference
> on the MMU in progress.
> 
> Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
> ---
>  arch/arm64/kvm/sys_regs.c | 27 +++++++++++++++++++++++++++
>  1 file changed, 27 insertions(+)
> 
> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
> index e0c7cc16466e..b773d107cd35 100644
> --- a/arch/arm64/kvm/sys_regs.c
> +++ b/arch/arm64/kvm/sys_regs.c
> @@ -2955,6 +2955,29 @@ union tlbi_info {
>  static void s2_mmu_unmap_range(struct kvm_s2_mmu *mmu,
>  			       const union tlbi_info *info)
>  {
> +	/*
> +	 * The unmap operation is allowed to drop the MMU lock and block, which
> +	 * means that @mmu could be used for a different context than the one
> +	 * currently being invalidated.
> +	 *
> +	 * This behavior is still safe, as:
> +	 *
> +	 *  1) The vCPU that recycled the MMU is responsible for invalidating
> +	 *     the entire MMU before reusing it, which still honors the intent
> +	 *     of a TLBI.

I think it's worth calling out that other vCPUs that want to reuse the "new" MMU
will also ensure the entire MMU is invalidated before reusing it, and maybe even
that multiple vCPUs can participate in recyling an MMU.  The "The vCPU" wording
is probably going to be interpreted by most readers as meaning only _one_ vCPU
will ensure the MMU is invalidated.

> +	 *  2) Until the guest TLBI instruction is 'retired' (i.e. increment PC
> +	 *     and ERET to the guest), other vCPUs are allowed to use stale
> +	 *     translations.
> +	 *
> +	 *  3) Accidentally unmapping an unrelated MMU context is nonfatal, and
> +	 *     at worst may cause more aborts for shadow stage-2 fills.
> +	 *
> +	 * Dropping the MMU lock also implies that shadow stage-2 fills could
> +	 * happen behind the back of the TLBI. This is still safe, though, as
> +	 * the L1 needs to put its stage-2 in a consistent state before doing
> +	 * the TLBI.
> +	 */
>  	kvm_stage2_unmap_range(mmu, info->range.start, info->range.size, true);
>  }
>  
> @@ -3050,6 +3073,10 @@ static void s2_mmu_unmap_ipa(struct kvm_s2_mmu *mmu,
>  	max_size = compute_tlb_inval_range(mmu, info->ipa.addr);
>  	base_addr &= ~(max_size - 1);
>  
> +	/*
> +	 * See comment in s2_mmu_unmap_range() for why this is allowed to
> +	 * reschedule.
> +	 */
>  	kvm_stage2_unmap_range(mmu, base_addr, max_size, true);
>  }
>  
> -- 
> 2.47.0.rc0.187.ge670bccf7e-goog
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 4/4] KVM: arm64: nv: Clarify safety of allowing TLBI unmaps to reschedule
  2024-10-07 18:51   ` Sean Christopherson
@ 2024-10-07 19:20     ` Oliver Upton
  0 siblings, 0 replies; 14+ messages in thread
From: Oliver Upton @ 2024-10-07 19:20 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvmarm, Marc Zyngier, Joey Gouly, Suzuki K Poulose, Zenghui Yu

On Mon, Oct 07, 2024 at 11:51:37AM -0700, Sean Christopherson wrote:
> On Mon, Oct 07, 2024, Oliver Upton wrote:
> > There's been a decent amount of attention around unmaps of nested MMUs,
> > and TLBI handling is no exception to this. Add a comment clarifying why
> > it is safe to reschedule during a TLBI unmap, even without a reference
> > on the MMU in progress.
> > 
> > Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
> > ---
> >  arch/arm64/kvm/sys_regs.c | 27 +++++++++++++++++++++++++++
> >  1 file changed, 27 insertions(+)
> > 
> > diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
> > index e0c7cc16466e..b773d107cd35 100644
> > --- a/arch/arm64/kvm/sys_regs.c
> > +++ b/arch/arm64/kvm/sys_regs.c
> > @@ -2955,6 +2955,29 @@ union tlbi_info {
> >  static void s2_mmu_unmap_range(struct kvm_s2_mmu *mmu,
> >  			       const union tlbi_info *info)
> >  {
> > +	/*
> > +	 * The unmap operation is allowed to drop the MMU lock and block, which
> > +	 * means that @mmu could be used for a different context than the one
> > +	 * currently being invalidated.
> > +	 *
> > +	 * This behavior is still safe, as:
> > +	 *
> > +	 *  1) The vCPU that recycled the MMU is responsible for invalidating
> > +	 *     the entire MMU before reusing it, which still honors the intent
> > +	 *     of a TLBI.
> 
> I think it's worth calling out that other vCPUs that want to reuse the "new" MMU
> will also ensure the entire MMU is invalidated before reusing it, and maybe even
> that multiple vCPUs can participate in recyling an MMU.  The "The vCPU" wording
> is probably going to be interpreted by most readers as meaning only _one_ vCPU
> will ensure the MMU is invalidated.

I'm going to s/vCPU/vCPU(s)/ this block, but I'll leave any other detail
of how recycling works at the point it gets inflicted on potentially
more than one vCPU.

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-10-07 22:18 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-07 16:42 [PATCH v2 0/4] KVM: arm64: nv: Fixes for stage-2 MMU recycling Oliver Upton
2024-10-07 16:42 ` [PATCH v2 1/4] KVM: arm64: nv: Keep reference on stage-2 MMU when scheduled out Oliver Upton
2024-10-07 18:39   ` Sean Christopherson
2024-10-07 19:01     ` Oliver Upton
2024-10-07 19:52       ` Sean Christopherson
2024-10-07 21:22         ` Oliver Upton
2024-10-07 22:18           ` Sean Christopherson
2024-10-07 16:42 ` [PATCH v2 2/4] KVM: arm64: nv: Do not block when unmapping stage-2 if disallowed Oliver Upton
2024-10-07 16:42 ` [PATCH v2 3/4] KVM: arm64: nv: Punt stage-2 recycling to a vCPU request Oliver Upton
2024-10-07 19:06   ` Marc Zyngier
2024-10-07 19:16     ` Oliver Upton
2024-10-07 16:42 ` [PATCH v2 4/4] KVM: arm64: nv: Clarify safety of allowing TLBI unmaps to reschedule Oliver Upton
2024-10-07 18:51   ` Sean Christopherson
2024-10-07 19:20     ` Oliver Upton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.