[PATCH] KVM: arm64: nv: do not inject L2-bound IRQs to L1 hypervisor

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] KVM: arm64: nv: do not inject L2-bound IRQs to L1 hypervisor
@ 2025-10-02 21:00 Volodymyr Babchuk
  2025-10-03  0:04 ` Oliver Upton
  0 siblings, 1 reply; 3+ messages in thread
From: Volodymyr Babchuk @ 2025-10-02 21:00 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	linux-kernel@vger.kernel.org
  Cc: Volodymyr Babchuk, Marc Zyngier, Oliver Upton, Joey Gouly,
	Suzuki K Poulose, Zenghui Yu, Catalin Marinas, Will Deacon,
	Wei-Lin Chang, Christoffer Dall, Alok Tiwari

There is a class of "virtual" HW interrupts: interrupts that generated
by a device model (like QEMU or kvmtool maybe) and thus considered as
hardware ones by L1 hypervisor, but they are not backed by real HW
interrupts. These interrupts are initially targeted to a L1
hypervisor, activated by it and then either are handled by the
hypervisor itself or re-injected to a L2 guest. Usual stuff.

In non-nested case this is perfectly fine: hypervisor can (and will)
activate all pending IRQs at once and then feed them to a guest batch
by batch. Batch size depends on LR count, of course. But in NV case
this causes a problem, as KVM maintains LRs for L1 hypervisor and it
does not remove active entries from LRs in case L1 hypervisor would
wish to deactivate them. L1 hypervisor in turn is waiting for L2 guest
to perform the actual deactivation.

The bug happens when number of active IRQs is equal to LR<n> count.

1. KVM tries to inject one more IRQ to a L1 hypervisor: it
triggers IRQ exception at vEL2 and tries to cram the new IRQ to LRs,
but as all LRs are already used, so the IRQ remains in ap_list

2. L1 hypervisor tries to handle the exception by activating the new
IRQ, but it is not present in LRs, so GIC returns 1023 on IAR1_EL1
read.

3. L1 hypervisor sees that there are no new interrupts and tries ERET
to L2 guest, so the guest would complete its own interrupt handler.

4. KVM still sees pending IRQ for the L1 hypervisor, so GOTO 1.

This particular bug was observed with Xen as L1 hypervisor, QEMU as
device model and lots of virtio-MMIO devices passed-through to DomU.

Difference between nested virtualization and "baremetal" case is that
real GIC can track lots of active interrupts simultaneously, but vGIC
is limited only to 4..16 LRs.

This patch tries to fix this problem by assuming that L1 hypervisor
will not touch any IRQ it already injected to a guest. So, while
processing shadow LRs, we can mark any LR entry with HW bit set as
"targeted to L2" and remove corresponding entry from real LRs while
entering L1 hypervisor. With this approach L1 hypervisor will see only
IRQs that are either pending or active, but not targeted to L2 guest.

Link: https://lists.infradead.org/pipermail/linux-arm-kernel/2025-October/1067534.html
Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>

---

This should cover real HW IRQs as well. If such IRQ is passed-through
all the way down to L2 guest, it should be correctly deactivated when
L2 guest writes to EOI. But it will not be deactivated if L1
hypervisor tries to pass it to L2 guest first, but then tries to
deactivate it by itself, because in this case there will be no
corresponding entry in LR<n>. So, we are exception that all L1
hypervisors will be well-behaved. Hence the RFC tag for this patch.
---
 arch/arm64/kvm/vgic/vgic-v3-nested.c |  6 +++++-
 arch/arm64/kvm/vgic/vgic.c           | 11 +++++++++++
 include/kvm/arm_vgic.h               |  1 +
 3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/vgic/vgic-v3-nested.c b/arch/arm64/kvm/vgic/vgic-v3-nested.c
index 7f1259b49c505..bdd1fb78e3682 100644
--- a/arch/arm64/kvm/vgic/vgic-v3-nested.c
+++ b/arch/arm64/kvm/vgic/vgic-v3-nested.c
@@ -286,9 +286,13 @@ void vgic_v3_sync_nested(struct kvm_vcpu *vcpu)
 		if (WARN_ON(!irq)) /* Shouldn't happen as we check on load */
 			continue;

+		irq->targets_l2 = true;
+
 		lr = __gic_v3_get_lr(lr_map_idx_to_shadow_idx(shadow_if, i));
-		if (!(lr & ICH_LR_STATE))
+		if (!(lr & ICH_LR_STATE)) {
 			irq->active = false;
+			irq->targets_l2 = false;
+		}

 		vgic_put_irq(vcpu->kvm, irq);
 	}
diff --git a/arch/arm64/kvm/vgic/vgic.c b/arch/arm64/kvm/vgic/vgic.c
index 6dd5a10081e27..6f6759a74569a 100644
--- a/arch/arm64/kvm/vgic/vgic.c
+++ b/arch/arm64/kvm/vgic/vgic.c
@@ -858,6 +858,17 @@ static void vgic_flush_lr_state(struct kvm_vcpu *vcpu)
 			break;
 		}

+		/*
+		 * If we are switching to L1 Hypervisor - populate LR with
+		 * IRQs that targeting it especially and are not targeting
+		 * its L2 guest
+		 */
+		if (vcpu_has_nv(vcpu) && !vgic_state_is_nested(vcpu) &&
+		    irq->targets_l2) {
+			raw_spin_unlock(&irq->irq_lock);
+			continue;
+		}
+
 		if (likely(vgic_target_oracle(irq) == vcpu)) {
 			vgic_populate_lr(vcpu, irq, count++);

diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
index 4000ff16f2957..f3a4561be1ca2 100644
--- a/include/kvm/arm_vgic.h
+++ b/include/kvm/arm_vgic.h
@@ -145,6 +145,7 @@ struct vgic_irq {

 	bool enabled;
 	bool hw;			/* Tied to HW IRQ */
+	bool targets_l2;		/* (Nesting) Targeted at L2 guest */
 	refcount_t refcount;		/* Used for LPIs */
 	u32 hwintid;			/* HW INTID number */
 	unsigned int host_irq;		/* linux irq corresponding to hwintid */
-- 
2.50.1

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] KVM: arm64: nv: do not inject L2-bound IRQs to L1 hypervisor
  2025-10-02 21:00 [PATCH] KVM: arm64: nv: do not inject L2-bound IRQs to L1 hypervisor Volodymyr Babchuk
@ 2025-10-03  0:04 ` Oliver Upton
  2025-10-03 13:48   ` Volodymyr Babchuk
  0 siblings, 1 reply; 3+ messages in thread
From: Oliver Upton @ 2025-10-03  0:04 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	linux-kernel@vger.kernel.org, Marc Zyngier, Joey Gouly,
	Suzuki K Poulose, Zenghui Yu, Catalin Marinas, Will Deacon,
	Wei-Lin Chang, Christoffer Dall, Alok Tiwari

Hi Volodymyr,

On Thu, Oct 02, 2025 at 09:00:11PM +0000, Volodymyr Babchuk wrote:
> Difference between nested virtualization and "baremetal" case is that
> real GIC can track lots of active interrupts simultaneously, but vGIC
> is limited only to 4..16 LRs.

There isn't an architectural limitation here. Nothing prevents a
virtualized GIC from representing more active IRQs than there are LRs in
hardware.

ICH_HCR_EL2.LRENPIE allows you to take a trap when an EOI is received
for an IRQ that exists outside of teh list registers which would allow
the deactivation of the SW view of that IRQ.

As Marc suggested, the correct thing to do is adjust the sorting of IRQs
such that pending IRQs fill LRs before those in an active state.

> diff --git a/arch/arm64/kvm/vgic/vgic-v3-nested.c b/arch/arm64/kvm/vgic/vgic-v3-nested.c
> index 7f1259b49c505..bdd1fb78e3682 100644
> --- a/arch/arm64/kvm/vgic/vgic-v3-nested.c
> +++ b/arch/arm64/kvm/vgic/vgic-v3-nested.c
> @@ -286,9 +286,13 @@ void vgic_v3_sync_nested(struct kvm_vcpu *vcpu)
>  		if (WARN_ON(!irq)) /* Shouldn't happen as we check on load */
>  			continue;
>  
> +		irq->targets_l2 = true;
> +
>  		lr = __gic_v3_get_lr(lr_map_idx_to_shadow_idx(shadow_if, i));
> -		if (!(lr & ICH_LR_STATE))
> +		if (!(lr & ICH_LR_STATE)) {
>  			irq->active = false;
> +			irq->targets_l2 = false;
> +		}
>  
>  		vgic_put_irq(vcpu->kvm, irq);
>  	}
> diff --git a/arch/arm64/kvm/vgic/vgic.c b/arch/arm64/kvm/vgic/vgic.c
> index 6dd5a10081e27..6f6759a74569a 100644
> --- a/arch/arm64/kvm/vgic/vgic.c
> +++ b/arch/arm64/kvm/vgic/vgic.c
> @@ -858,6 +858,17 @@ static void vgic_flush_lr_state(struct kvm_vcpu *vcpu)
>  			break;
>  		}
>  
> +		/*
> +		 * If we are switching to L1 Hypervisor - populate LR with
> +		 * IRQs that targeting it especially and are not targeting
> +		 * its L2 guest
> +		 */
> +		if (vcpu_has_nv(vcpu) && !vgic_state_is_nested(vcpu) &&
> +		    irq->targets_l2) {
> +			raw_spin_unlock(&irq->irq_lock);
> +			continue;
> +		}
> +
>  		if (likely(vgic_target_oracle(irq) == vcpu)) {
>  			vgic_populate_lr(vcpu, irq, count++);

This has some serious issues as it violates our architectural model of
the GIC.

The existence of an LR associating a vIRQ to a pIRQ only has relevance
when in a nested state, specifically when handling vIRQ deactivation.
Besides that it is just a number...

For example, imagine the guest hypervisor associated an EL1 timer IRQ
with an LR and later tried to clear the active state and re-arm the
timer. Seems like we would miss the IRQ entirely.

I'd really like to see a solution similar to Marc's proposal which
addresses the fundamental problem of active IRQs overflowing the list
registers. 

Thanks,
Oliver

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] KVM: arm64: nv: do not inject L2-bound IRQs to L1 hypervisor
  2025-10-03  0:04 ` Oliver Upton
@ 2025-10-03 13:48   ` Volodymyr Babchuk
  0 siblings, 0 replies; 3+ messages in thread
From: Volodymyr Babchuk @ 2025-10-03 13:48 UTC (permalink / raw)
  To: Oliver Upton
  Cc: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	linux-kernel@vger.kernel.org, Marc Zyngier, Joey Gouly,
	Suzuki K Poulose, Zenghui Yu, Catalin Marinas, Will Deacon,
	Wei-Lin Chang, Christoffer Dall, Alok Tiwari

Hi Oliver,

Oliver Upton <oliver.upton@linux.dev> writes:

> Hi Volodymyr,
>
> On Thu, Oct 02, 2025 at 09:00:11PM +0000, Volodymyr Babchuk wrote:
>> Difference between nested virtualization and "baremetal" case is that
>> real GIC can track lots of active interrupts simultaneously, but vGIC
>> is limited only to 4..16 LRs.
>
> There isn't an architectural limitation here. Nothing prevents a
> virtualized GIC from representing more active IRQs than there are LRs in
> hardware.
>
> ICH_HCR_EL2.LRENPIE allows you to take a trap when an EOI is received
> for an IRQ that exists outside of teh list registers which would allow
> the deactivation of the SW view of that IRQ.

Ah, right. I am missed this feature.

>
> As Marc suggested, the correct thing to do is adjust the sorting of IRQs
> such that pending IRQs fill LRs before those in an active state.

Looks like I missed an email where he suggested that, but yeah, this was
my first ideal as well, I just didn't knew about LRENPIE bit.

[...]

> I'd really like to see a solution similar to Marc's proposal which
> addresses the fundamental problem of active IRQs overflowing the list
> registers.
>

I'm sorry but it appears that I won't be able to put any more effort
into this issue right now. My boss wants me to switch to another pressing
task... So yeah, I won't mind if someone else will continue the work.

Also, I am sorry that I can't share reproducer, but as I said, that
would require all the whopping Android build and believe me, you don't
want this. I suspect that this issue can be reproduced by firing lots of
interrupts simultaneously in kvmtool. And of course in this case there
is nothing Xen-specific, so KVM as L1 hypervisor should be fine too. But
I have no time to verify this.

-- 
WBR, Volodymyr

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2025-10-03 13:49 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-02 21:00 [PATCH] KVM: arm64: nv: do not inject L2-bound IRQs to L1 hypervisor Volodymyr Babchuk
2025-10-03  0:04 ` Oliver Upton
2025-10-03 13:48   ` Volodymyr Babchuk

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).