* [BUG] [KVM/VMX] Level triggered interrupts mishandled on Windows w/ nested virt(Credential Guard) when using split irqchip
@ 2025-09-05 15:26 Khushit Shah
2025-09-08 9:05 ` Vitaly Kuznetsov
0 siblings, 1 reply; 8+ messages in thread
From: Khushit Shah @ 2025-09-05 15:26 UTC (permalink / raw)
To: seanjc@google.com, pbonzini@redhat.com, tglx@linutronix.de,
mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com
Cc: hpa@zytor.com, x86@kernel.org, kvm@vger.kernel.org,
linux-kernel@vger.kernel.org, Shaju Abraham
[1.] One line summary:
[KVM/VMX] Level triggered interrupts mishandled on Windows w/ nested virt(Credential Guard) when using split irqchip
[2.] Problem/Report:
When running Windows with Credential Guard enabled and with split-irqchip, level triggered interrupts are not properly forwarded to L2 (Credential Guard) by L1 (Windows), instead L1 EOIs the interrupt. Which leads to extremely slow Windows boot time. This issue is only seen on Intel + split-irqchip. Intel + kernel-irqchip, AMD + (kernel/split)-irqchip works fine.
Qemu command used to create the vm:
/usr/libexec/qemu-kvm \
-machine q35,accel=kvm,smm=on,usb=off,acpi=on,kernel-irqchip=split \
-cpu host,+vmx,+invpcid,+ssse3,+aes,+xsave,+xsaveopt,+xgetbv1,+xsaves,+rdtscp,+tsc-deadline \
-m 20G -smp 1 \
-boot order=c,menu=on \
-drive if=pflash,format=raw,readonly=on,file=/usr/share/OVMF/OVMF_CODE_4MB.secboot.fd \
-drive if=pflash,format=raw,file=/var/lib/qemu/nvram/win_VARS.fd \
-chardev socket,id=chrtpm,path=/var/lib/qemu/tpm-win/sock \
-tpmdev emulator,id=tpm0,chardev=chrtpm \
-device tpm-crb,tpmdev=tpm0 \
-device ich9-usb-ehci1,id=usb \
-device ich9-usb-uhci1,masterbus=usb.0,firstport=0 \
-device ich9-usb-uhci2,masterbus=usb.0,firstport=2 \
-device ich9-usb-uhci3,masterbus=usb.0,firstport=4 \
-device usb-tablet,bus=usb.0,port=1 \
-device VGA,vgamem_mb=16 \
-device ich9-ahci,id=sata \
-drive file=/var/log/winsrv_2022_cg.qcow2,if=none,format=qcow2,id=hd0 \
-device ide-hd,drive=hd0,bus=sata.0 \
-vnc 0.0.0.0:5 \
-serial mon:stdio
Note that usb controllers and usb devices generate the INTx interrupts hence are important for repro, the qcow2 is Windows Server 2022 with UEFI boot and Credential Guard. Removing kernel-irqchip=split (using kernel-irqchip) does not lead into the issue.
Processor Information: Intel(R) Xeon(R) Gold 6348 CPU @ 2.60GHz
Kernel Info:
6.12.26-11.0s20c78r5.el9.x86_64 (bug present)
6.17.0-11.0s20c79r5.el9.x86_64 (bug present)
Config file: standard config, KVM specific config here: https://pastebin.com/skXYQeJn
Qemu version: 9.1.0
We see the following turns of events with kernel-irqchip:
a) A level triggered interrupt on pin 19 arrives after MMIO write from L2(CGVM) (to usb device) (not shown in above logs).
b) Using kvm_nested_vmexit_inject the L2 exit reason (external interrupt) is made visible to L1.
c) We see an L1 enter, which immediately invokes L2 with VMRESUME.
d) L2 services the interrupt, performs MMIO write, which de-asserts the level triggered interrupt.
e) Then L2 EOIs the interrupt after de-assertion -> L1 EOIs the interrupt.
While for split-irqchip we see the following turn of events:
a) and b) is the same.
c) L1 enters and EOIs immediately, without letting L2 service the interrupt first. (Line never de-asserts)
d) The interrupt gets re-inserted repeatedly -> boot crawls.
[KVM Tracepoints at the bottom ]
We probed the interrupt delivery path in both cases, the interrupt delivery path for both the cases is only slightly different, in the case of kernel-irqchip, `kvm_set_irq` calls `kvm_ioapic_set_irq`, while in the case of split-irqchip, `kvm_set_irq` calls `kvm_msi_set_irq`, both paths merge on `kvm_irq_delivery_to_apic`, we noticed all the args are same, we also probed path taken from there in both the cases, which is kvm_irq_delivery_to_apic → kvm_irq_delivery_to_apic_fast → kvm_apic_set_irq → __apic_accept_irq → vmx_deliver_interrupt → vmx_deliver_posted_interrupt → kvm_vcpu_trigger_posted_interrupt
On split-irqchip, subsequent interrupts take different route in kvm_vcpu_trigger_posted_interrupt as interrupts are received while L1 (Windows) is running on the vcpu.
We also took VMCS dumps just before kvm_enter and just after kvm_exit after the first level triggered interrupt is received, VMCS01 and VMCS12 are same just before the L1 VMENTER after the first interrupt is delivered. We observe that for both VMCS01 and VMCS12, InterruptStatus, SVI|RVI, entry/exit information, pin-based/cpu-based ctrl state are the same, yet in case of kernel-irqchip, L1 passes the interrupt to L2 with VMRESUME, and in case of split-irqchip performs EOI.
[VMCS dumps at the bottom]
We also checked the last MSR reads and writes on different addresses before the first level triggered interrupt, the dump is available here: https://pastebin.com/PckkM8pt,
Only values we see different between kernel-irqchip and split-irqchip seems to be run to run variation.
We also checked CPUID info exposed to the guest. There was only one bit change, specifically on func 40000001.
A@1451335: kvm:kvm_cpuid: func 40000001 idx 0 rax 10040fb rbx 0 rcx 0 rdx 0, cpuid entry found
B@1476293: kvm:kvm_cpuid: func 40000001 idx 0 rax 100c0fb rbx 0 rcx 0 rdx 0, cpuid entry found
This seems to be KVM_FEATURE_MSI_EXT_DEST_ID bit, disabling KVM_FEATURE_MSI_EXT_DEST_ID does not change the behavior.
[3.] Keywords:
kvm,vmx,nested-virtualization,split-irqchip,windows-guest
Any pointers on this will be extremely helpful thanks!
Thanks,
Khushit
===========================================================
KVM tracepoints in the case of kernel-irqchip:
CPU 0/KVM 699876 [020] 18136.763622: kvm:kvm_set_irq: gsi 19 level 1 source 0
CPU 0/KVM 699876 [020] 18136.763624: kvm:kvm_apic_accept_irq: apicid 0 vec 129 (Fixed|level)
CPU 0/KVM 699876 [020] 18136.763625: kvm:kvm_apicv_accept_irq: apicid 0 vec 129 (Fixed|level)
CPU 0/KVM 699876 [020] 18136.763626: kvm:kvm_ioapic_set_irq: pin 19 dst 0 vec 129 (Fixed|physical|level)
CPU 0/KVM 699876 [020] 18136.763629: kvm:kvm_set_irq: gsi 11 level 0 source 0
CPU 0/KVM 699876 [020] 18136.763631: kvm:kvm_pic_set_irq: chip 1 pin 3 (edge|masked)
CPU 0/KVM 699876 [020] 18136.763632: kvm:kvm_ioapic_set_irq: pin 11 dst 0 vec 255 (Fixed|physical|edge|masked)
CPU 0/KVM 699876 [020] 18136.763633: kvm:kvm_set_irq: gsi 11 level 0 source 0
CPU 0/KVM 699876 [020] 18136.763634: kvm:kvm_pic_set_irq: chip 1 pin 3 (edge|masked)
CPU 0/KVM 699876 [020] 18136.763634: kvm:kvm_ioapic_set_irq: pin 11 dst 0 vec 255 (Fixed|physical|edge|masked)
CPU 0/KVM 699876 [020] 18136.763640: kvm:kvm_fpu: load
CPU 0/KVM 699876 [020] 18136.763644: kvm:kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 80000081 int_info_err 0
CPU 0/KVM 699876 [020] 18136.763647: kvm:kvm_entry: vcpu 0, rip 0xfffff8489b64142c
CPU 0/KVM 699876 [020] 18136.763651: kvm:kvm_exit: reason VMRESUME rip 0xfffff8489b641308 info 0 0
CPU 0/KVM 699876 [020] 18136.763652: kvm:kvm_nested_vmenter: rip: 0xfffff8489b641308 vmcs: 0xffffffffffffffff nested_rip: 0xfffff8031964b2b6 int_ctl: 0x00000000 event_inj: 0x80000081 nested_ept=y nested_eptp: 0x00000001019a701e
————————————————————————————— L2 services the interrupt ———————————————————
CPU 0/KVM 699876 [020] 18136.763695: kvm:kvm_mmio: mmio write len 4 gpa 0x81111024 val 0x4
CPU 0/KVM 699876 [020] 18136.763696: kvm:kvm_fpu: unload
CPU 0/KVM 699876 [020] 18136.763696: kvm:kvm_userspace_exit: reason KVM_EXIT_MMIO (6)
CPU 0/KVM 699876 [020] 18136.763699: kvm:kvm_set_irq: gsi 19 level 0 source 0
CPU 0/KVM 699876 [020] 18136.763700: kvm:kvm_ioapic_set_irq: pin 19 dst 0 vec 129 (Fixed|physical|level)
CPU 0/KVM 699876 [020] 18136.763701: kvm:kvm_set_irq: gsi 11 level 0 source 0
CPU 0/KVM 699876 [020] 18136.763701: kvm:kvm_pic_set_irq: chip 1 pin 3 (edge|masked)
CPU 0/KVM 699876 [020] 18136.763702: kvm:kvm_ioapic_set_irq: pin 11 dst 0 vec 255 (Fixed|physical|edge|masked)
CPU 0/KVM 699876 [020] 18136.763703: kvm:kvm_set_irq: gsi 11 level 0 source 0
CPU 0/KVM 699876 [020] 18136.763703: kvm:kvm_pic_set_irq: chip 1 pin 3 (edge|masked)
CPU 0/KVM 699876 [020] 18136.763703: kvm:kvm_ioapic_set_irq: pin 11 dst 0 vec 255 (Fixed|physical|edge|masked)
CPU 0/KVM 699876 [020] 18136.763705: kvm:kvm_fpu: load
CPU 0/KVM 699876 [020] 18136.763705: kvm:kvm_entry: vcpu 0, rip 0xfffff8031964a708
CPU 0/KVM 699876 [020] 18136.763707: kvm:kvm_exit: reason EPT_MISCONFIG rip 0xfffff8031964a727 info 0 0
CPU 0/KVM 699876 [020] 18136.763707: kvm:kvm_nested_vmexit: CAN'T FIND FIELD "rip"<CANT FIND FIELD exit_code>vcpu 0 reason EPT_MISCONFIG rip 0xfffff8031964a727 info1 0x0000000000000000 info2 0x0000000000000000 intr_info 0x00000000 error_code 0x00000000
CPU 0/KVM 699876 [020] 18136.763709: kvm:kvm_emulate_insn: 0:fffff8031964a727: 8b 56 0c
CPU 0/KVM 699876 [020] 18136.763710: kvm:vcpu_match_mmio: gva 0xffffbd0c86b8a02c gpa 0x8111102c Read GPA
CPU 0/KVM 699876 [020] 18136.763710: kvm:kvm_mmio: mmio unsatisfied-read len 4 gpa 0x8111102c val 0x0
CPU 0/KVM 699876 [020] 18136.763710: kvm:kvm_fpu: unload
CPU 0/KVM 699876 [020] 18136.763711: kvm:kvm_userspace_exit: reason KVM_EXIT_MMIO (6)
CPU 0/KVM 699876 [020] 18136.763712: kvm:kvm_fpu: load
CPU 0/KVM 699876 [020] 18136.763713: kvm:kvm_mmio: mmio read len 4 gpa 0x8111102c val 0xec0
CPU 0/KVM 699876 [020] 18136.763714: kvm:kvm_entry: vcpu 0, rip 0xfffff8031964a72a
CPU 0/KVM 699876 [020] 18136.763716: kvm:kvm_exit: reason MSR_WRITE rip 0xfffff8030f62d36c info 0 0
CPU 0/KVM 699876 [020] 18136.763717: kvm:kvm_nested_vmexit: CAN'T FIND FIELD "rip"<CANT FIND FIELD exit_code>vcpu 0 reason MSR_WRITE rip 0xfffff8030f62d36c info1 0x0000000000000000 info2 0x0000000000000000 intr_info 0x00000000 error_code 0x00000000
CPU 0/KVM 699876 [020] 18136.763719: kvm:kvm_nested_vmexit_inject: reason MSR_WRITE info1 0 info2 0 int_info 0 int_info_err 0
CPU 0/KVM 699876 [020] 18136.763721: kvm:kvm_entry: vcpu 0, rip 0xfffff8489b64142c
CPU 0/KVM 699876 [020] 18136.763724: kvm:kvm_exit: reason MSR_WRITE rip 0xfffff8489b732daf info 0 0
CPU 0/KVM 699876 [020] 18136.763725: kvm:kvm_apic: apic_write APIC_EOI = 0x0
CPU 0/KVM 699876 [020] 18136.763725: kvm:kvm_eoi: apicid 0 vector 129
KVM tracepoints in case of split-irqchip:
CPU 0/KVM 655470 [011] 17156.779254: kvm:kvm_set_irq: gsi 19 level 1 source 0
CPU 0/KVM 655470 [011] 17156.779255: kvm:kvm_msi_set_irq: dst 0 vec 129 (Fixed|physical|level)
CPU 0/KVM 655470 [011] 17156.779256: kvm:kvm_apic_accept_irq: apicid 0 vec 129 (Fixed|level)
CPU 0/KVM 655470 [011] 17156.779256: kvm:kvm_apicv_accept_irq: apicid 0 vec 129 (Fixed|level)
CPU 0/KVM 655470 [011] 17156.779263: kvm:kvm_fpu: load
CPU 0/KVM 655470 [011] 17156.779266: kvm:kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 80000081 int_info_err 0
CPU 0/KVM 655470 [011] 17156.779269: kvm:kvm_entry: vcpu 0, rip 0xfffff80653e4142c
CPU 0/KVM 655470 [011] 17156.779271: kvm:kvm_exit: reason MSR_WRITE rip 0xfffff80653f32daf info 0 0
CPU 0/KVM 655470 [011] 17156.779272: kvm:kvm_apic: apic_write APIC_EOI = 0x0
CPU 0/KVM 655470 [011] 17156.779272: kvm:kvm_eoi: apicid 0 vector 129
CPU 0/KVM 655470 [011] 17156.779273: kvm:kvm_msr: msr_write 40000070 = 0x0
CPU 0/KVM 655470 [011] 17156.779274: kvm:kvm_fpu: unload
CPU 0/KVM 655470 [011] 17156.779274: kvm:kvm_userspace_exit: reason KVM_EXIT_IOAPIC_EOI (26)
VMCS dump for kernel-irqchip:
CPU 0/KVM-251862 [019] 6384.132433: bprint: vmx_dump_01_02_12: KVM/VMX: ===== autodump (VM-ENTER) vcpu=0 =====
CPU 0/KVM-251862 [019] 6384.132433: bputs: vmx_dump_01_02_12: KVM/VMX: [CURRENT]
CPU 0/KVM-251862 [019] 6384.132433: bputs: vmx_dump_01_02_12: KVM/VMC: [CURRENT] is vmc01
CPU 0/KVM-251862 [019] 6384.132433: bprint: dump_vmcs: VMCS 0xffff906388d0c000, last attempted VM-entry on CPU 19
CPU 0/KVM-251862 [019] 6384.132433: bputs: dump_vmcs: *** Guest State ***
CPU 0/KVM-251862 [019] 6384.132433: bprint: dump_vmcs: CR0: actual=0x0000000080010031, shadow=0x0000000080010031, gh_mask=fffffffffffefff7
CPU 0/KVM-251862 [019] 6384.132434: bprint: dump_vmcs: CR4: actual=0x00000000000422e0, shadow=0x00000000000422e0, gh_mask=fffffffffffef871
CPU 0/KVM-251862 [019] 6384.132434: bprint: dump_vmcs: CR3 = 0x0000000101945000
CPU 0/KVM-251862 [019] 6384.132434: bprint: dump_vmcs: PDPTR0 = 0x0000000065632001 PDPTR1 = 0x0000000065631001
CPU 0/KVM-251862 [019] 6384.132434: bprint: dump_vmcs: PDPTR2 = 0x0000000065630001 PDPTR3 = 0x000000006562f001
CPU 0/KVM-251862 [019] 6384.132435: bprint: dump_vmcs: RSP = 0xffffe70000005fc0 RIP = 0xfffff8390c241308
CPU 0/KVM-251862 [019] 6384.132435: bprint: dump_vmcs: RFLAGS=0x00000002 DR7 = 0x0000000000000400
CPU 0/KVM-251862 [019] 6384.132435: bprint: vmx_dump_dtsel: GDTR: limit=0x0000ffff, base=0xfffff8390c0232b0
CPU 0/KVM-251862 [019] 6384.132435: bprint: vmx_dump_dtsel: IDTR: limit=0x0000ffff, base=0xfffff8390c0220b0
CPU 0/KVM-251862 [019] 6384.132435: bprint: vmx_dump_sel.constprop.0: TR: sel=0x0030, attr=0x0008b, limit=0x00000067, base=0xfffff8390c123320
CPU 0/KVM-251862 [019] 6384.132436: bprint: dump_vmcs: DebugCtl = 0x0000000000000000 DebugExceptions = 0x0000000000000000
CPU 0/KVM-251862 [019] 6384.132436: bprint: dump_vmcs: Interruptibility = 00000000 ActivityState = 00000000
CPU 0/KVM-251862 [019] 6384.132436: bprint: dump_vmcs: InterruptStatus = 5100
CPU 0/KVM-251862 [019] 6384.132436: bprint: vmx_dump_msrs: MSR guest autoload:
CPU 0/KVM-251862 [019] 6384.132436: bprint: vmx_dump_msrs: 0: msr=0x00000600 value=0x0000000000000000
CPU 0/KVM-251862 [019] 6384.132437: bputs: dump_vmcs: *** Host State ***
CPU 0/KVM-251862 [019] 6384.132437: bprint: dump_vmcs: RIP = 0xffffffffc0d89630 RSP = 0xffffb5c59424bce0
CPU 0/KVM-251862 [019] 6384.132437: bprint: dump_vmcs: GDTBase=fffffe3f5127d000 IDTBase=fffffe0000000000
CPU 0/KVM-251862 [019] 6384.132437: bprint: dump_vmcs: CR0=0000000080050033 CR3=0000000183c24001 CR4=0000000000772ef0
CPU 0/KVM-251862 [019] 6384.132437: bprint: dump_vmcs: PAT = 0x0407050600070106
CPU 0/KVM-251862 [019] 6384.132437: bprint: vmx_dump_msrs: MSR host autoload:
CPU 0/KVM-251862 [019] 6384.132438: bprint: vmx_dump_msrs: 0: msr=0x00000600 value=0xfffffe3f51297000
CPU 0/KVM-251862 [019] 6384.132438: bputs: dump_vmcs: *** Control State ***
CPU 0/KVM-251862 [019] 6384.132438: bprint: dump_vmcs: CPUBased=0xb5a26dfa SecondaryExec=0x821217eb TertiaryExec=0x0000000000000010
CPU 0/KVM-251862 [019] 6384.132438: bprint: dump_vmcs: PinBased=0x000000ff EntryControls=000053ff ExitControls=000befff
CPU 0/KVM-251862 [019] 6384.132438: bprint: dump_vmcs: ExceptionBitmap=000600c2 PFECmask=00000000 PFECmatch=00000000
CPU 0/KVM-251862 [019] 6384.132439: bprint: dump_vmcs: VMEntry: intr_info=00000000 errcode=00000000 ilen=00000000
CPU 0/KVM-251862 [019] 6384.132439: bprint: dump_vmcs: VMExit: intr_info=00000000 errcode=00000000 ilen=00000003
CPU 0/KVM-251862 [019] 6384.132439: bprint: dump_vmcs: reason=00000018 qualification=0000000000000000
CPU 0/KVM-251862 [019] 6384.132439: bprint: dump_vmcs: IDTVectoring: info=00000000 errcode=00000000
CPU 0/KVM-251862 [019] 6384.132439: bprint: dump_vmcs: TSC Offset = 0xfffff447b4a645a1
CPU 0/KVM-251862 [019] 6384.132439: bprint: dump_vmcs: TSC Multiplier = 0x0001000000000000
CPU 0/KVM-251862 [019] 6384.132440: bprint: dump_vmcs: SVI|RVI = 51|00
CPU 0/KVM-251862 [019] 6384.132440: bprint: dump_vmcs: TPR Threshold = 0x00
CPU 0/KVM-251862 [019] 6384.132440: bprint: dump_vmcs: APIC-access addr = 0x000000024ea83000
CPU 0/KVM-251862 [019] 6384.132440: bprint: dump_vmcs: virt-APIC addr = 0x00000046212ed000
CPU 0/KVM-251862 [019] 6384.132440: bprint: dump_vmcs: PostedIntrVec = 0xf2
CPU 0/KVM-251862 [019] 6384.132441: bprint: dump_vmcs: EPT pointer = 0x00000040a0db305e
CPU 0/KVM-251862 [019] 6384.132441: bprint: dump_vmcs: PLE Gap=00000080 Window=00001000
CPU 0/KVM-251862 [019] 6384.132441: bprint: dump_vmcs: Virtual processor ID = 0x0011
CPU 0/KVM-251862 [019] 6384.132441: bputs: vmx_dump_01_02_12: KVM/VMX: [VMCS12]
CPU 0/KVM-251862 [019] 6384.132441: bprint: vmx_dump_01_02_12: ENTRY_INTR_INFO=0xd1 ENTRY_LEN=0 EXIT_REASON=0x1 EXIT_INTR_INFO=0x80000051 EXIT_INTR_ERR=0
CPU 0/KVM-251862 [019] 6384.132442: bprint: vmx_dump_01_02_12: IDT_VEC_INFO=0 IDT_VEC_ERR=0 EXIT_LEN=0x3 VMX_INSN_INFO=0
CPU 0/KVM-251862 [019] 6384.132442: bprint: vmx_dump_01_02_12: GUEST_INTR_STATUS=0 TPR_THRESH=0 VAPIC=0x116688000 PIDESC=0 PINV=0
CPU 0/KVM-251862 [019] 6384.132442: bprint: vmx_dump_01_02_12: EOI_EXIT_BITMAPS: [0]=0 [1]=0 [2]=0 [3]=0
CPU 0/KVM-251862 [019] 6384.132442: bprint: vmx_dump_01_02_12: PINBASED=0x3f CPU_BASED=0xb6206dfa CPU_BASED2=0x1010ae VM_ENTRY_CTRL=0x13ff VM_EXIT_CTRL=0x3efff
CPU 0/KVM-251862 [019] 6384.132443: bprint: vmx_dump_01_02_12: RIP=0xfffff8012b5f50c2 RSP=0xfffffc8791b29fd0 RFLAGS=0x40213 INTR_STATE=0 ACTV=0
CPU 0/KVM-251862 [019] 6384.132443: bprint: vmx_dump_01_02_12: CR0=0x80050033 CR3=0x119167000 CR4=0x352ef8 EPTP=0x1019ab01e
CPU 0/KVM-251862 [019] 6384.132443: bputs: vmx_dump_01_02_12: KVM/VMX: =====================================
VMCS dump for split-irqchip,
CPU 0/KVM-453395 [001] 12461.840701: bprint: vmx_dump_01_02_12: KVM/VMX: ===== autodump (VM-ENTER) vcpu=0 =====
CPU 0/KVM-453395 [001] 12461.840702: bputs: vmx_dump_01_02_12: KVM/VMX: [CURRENT]
CPU 0/KVM-453395 [001] 12461.840702: bputs: vmx_dump_01_02_12: KVM/VMC: [CURRENT] is vmc01
CPU 0/KVM-453395 [001] 12461.840702: bprint: dump_vmcs: VMCS 0xffff906904c87000, last attempted VM-entry on CPU 1
CPU 0/KVM-453395 [001] 12461.840702: bputs: dump_vmcs: *** Guest State ***
CPU 0/KVM-453395 [001] 12461.840702: bprint: dump_vmcs: CR0: actual=0x0000000080010031, shadow=0x0000000080010031, gh_mask=fffffffffffefff7
CPU 0/KVM-453395 [001] 12461.840702: bprint: dump_vmcs: CR4: actual=0x00000000000422e0, shadow=0x00000000000422e0, gh_mask=fffffffffffef871
CPU 0/KVM-453395 [001] 12461.840703: bprint: dump_vmcs: CR3 = 0x0000000101944000
CPU 0/KVM-453395 [001] 12461.840703: bprint: dump_vmcs: PDPTR0 = 0x0000000065632001 PDPTR1 = 0x0000000065631001
CPU 0/KVM-453395 [001] 12461.840703: bprint: dump_vmcs: PDPTR2 = 0x0000000065630001 PDPTR3 = 0x000000006562f001
CPU 0/KVM-453395 [001] 12461.840703: bprint: dump_vmcs: RSP = 0xffffe70000005fc0 RIP = 0xfffff808b8441308
CPU 0/KVM-453395 [001] 12461.840703: bprint: dump_vmcs: RFLAGS=0x00000002 DR7 = 0x0000000000000400
CPU 0/KVM-453395 [001] 12461.840704: bprint: vmx_dump_dtsel: GDTR: limit=0x0000ffff, base=0xfffff808b82232b0
CPU 0/KVM-453395 [001] 12461.840704: bprint: vmx_dump_dtsel: IDTR: limit=0x0000ffff, base=0xfffff808b82220b0
CPU 0/KVM-453395 [001] 12461.840704: bprint: vmx_dump_sel.constprop.0: TR: sel=0x0030, attr=0x0008b, limit=0x00000067, base=0xfffff808b8323320
CPU 0/KVM-453395 [001] 12461.840704: bprint: dump_vmcs: DebugCtl = 0x0000000000000000 DebugExceptions = 0x0000000000000000
CPU 0/KVM-453395 [001] 12461.840704: bprint: dump_vmcs: Interruptibility = 00000000 ActivityState = 00000000
CPU 0/KVM-453395 [001] 12461.840705: bprint: dump_vmcs: InterruptStatus = 5100
CPU 0/KVM-453395 [001] 12461.840705: bprint: vmx_dump_msrs: MSR guest autoload:
CPU 0/KVM-453395 [001] 12461.840705: bprint: vmx_dump_msrs: 0: msr=0x00000600 value=0x0000000000000000
CPU 0/KVM-453395 [001] 12461.840705: bputs: dump_vmcs: *** Host State ***
CPU 0/KVM-453395 [001] 12461.840705: bprint: dump_vmcs: RIP = 0xffffffffc0d89630 RSP = 0xffffb5c5a0117ce0
CPU 0/KVM-453395 [001] 12461.840705: bprint: dump_vmcs: GDTBase=fffffe5a3972e000 IDTBase=fffffe0000000000
CPU 0/KVM-453395 [001] 12461.840705: bprint: dump_vmcs: CR0=0000000080050033 CR3=0000000109cee003 CR4=0000000000772ef0
CPU 0/KVM-453395 [001] 12461.840706: bprint: dump_vmcs: PAT = 0x0407050600070106
CPU 0/KVM-453395 [001] 12461.840706: bprint: vmx_dump_msrs: MSR host autoload:
CPU 0/KVM-453395 [001] 12461.840706: bprint: vmx_dump_msrs: 0: msr=0x00000600 value=0xfffffe5a39748000
CPU 0/KVM-453395 [001] 12461.840706: bputs: dump_vmcs: *** Control State ***
CPU 0/KVM-453395 [001] 12461.840706: bprint: dump_vmcs: CPUBased=0xb5a26dfa SecondaryExec=0x821217eb TertiaryExec=0x0000000000000010
CPU 0/KVM-453395 [001] 12461.840706: bprint: dump_vmcs: PinBased=0x000000ff EntryControls=000053ff ExitControls=000befff
CPU 0/KVM-453395 [001] 12461.840707: bprint: dump_vmcs: ExceptionBitmap=000600c2 PFECmask=00000000 PFECmatch=00000000
CPU 0/KVM-453395 [001] 12461.840707: bprint: dump_vmcs: VMEntry: intr_info=00000000 errcode=00000000 ilen=00000000
CPU 0/KVM-453395 [001] 12461.840707: bprint: dump_vmcs: VMExit: intr_info=00000000 errcode=00000000 ilen=00000003
CPU 0/KVM-453395 [001] 12461.840707: bprint: dump_vmcs: reason=00000018 qualification=0000000000000000
CPU 0/KVM-453395 [001] 12461.840708: bprint: dump_vmcs: IDTVectoring: info=00000000 errcode=00000000
CPU 0/KVM-453395 [001] 12461.840708: bprint: dump_vmcs: TSC Offset = 0xffffe9399281aa53
CPU 0/KVM-453395 [001] 12461.840708: bprint: dump_vmcs: TSC Multiplier = 0x0001000000000000
CPU 0/KVM-453395 [001] 12461.840708: bprint: dump_vmcs: SVI|RVI = 51|00
CPU 0/KVM-453395 [001] 12461.840708: bprint: dump_vmcs: TPR Threshold = 0x00
CPU 0/KVM-453395 [001] 12461.840709: bprint: dump_vmcs: APIC-access addr = 0x000000012e1ad000
CPU 0/KVM-453395 [001] 12461.840709: bprint: dump_vmcs: virt-APIC addr = 0x0000004605554000
CPU 0/KVM-453395 [001] 12461.840709: bprint: dump_vmcs: PostedIntrVec = 0xf2
CPU 0/KVM-453395 [001] 12461.840709: bprint: dump_vmcs: EPT pointer = 0x00000001bdfa805e
CPU 0/KVM-453395 [001] 12461.840709: bprint: dump_vmcs: PLE Gap=00000080 Window=00001000
CPU 0/KVM-453395 [001] 12461.840709: bprint: dump_vmcs: Virtual processor ID = 0x0011
CPU 0/KVM-453395 [001] 12461.840710: bputs: vmx_dump_01_02_12: KVM/VMX: [VMCS12]
CPU 0/KVM-453395 [001] 12461.840710: bprint: vmx_dump_01_02_12: ENTRY_INTR_INFO=0xd1 ENTRY_LEN=0 EXIT_REASON=0x1 EXIT_INTR_INFO=0x80000051 EXIT_INTR_ERR=0
CPU 0/KVM-453395 [001] 12461.840710: bprint: vmx_dump_01_02_12: IDT_VEC_INFO=0 IDT_VEC_ERR=0 EXIT_LEN=0x5 VMX_INSN_INFO=0
CPU 0/KVM-453395 [001] 12461.840710: bprint: vmx_dump_01_02_12: GUEST_INTR_STATUS=0 TPR_THRESH=0 VAPIC=0x116687000 PIDESC=0 PINV=0
CPU 0/KVM-453395 [001] 12461.840711: bprint: vmx_dump_01_02_12: EOI_EXIT_BITMAPS: [0]=0 [1]=0 [2]=0 [3]=0
CPU 0/KVM-453395 [001] 12461.840711: bprint: vmx_dump_01_02_12: PINBASED=0x3f CPU_BASED=0xb6206dfa CPU_BASED2=0x1010ae VM_ENTRY_CTRL=0x13ff VM_EXIT_CTRL=0x3efff
CPU 0/KVM-453395 [001] 12461.840711: bprint: vmx_dump_01_02_12: RIP=0xfffff8035aff73fc RSP=0xffffc18d6b729f58 RFLAGS=0x40246 INTR_STATE=0 ACTV=0
CPU 0/KVM-453395 [001] 12461.840711: bprint: vmx_dump_01_02_12: CR0=0x80050033 CR3=0x11912e000 CR4=0x352ef8 EPTP=0x1019aa01e
CPU 0/KVM-453395 [001] 12461.840711: bputs: vmx_dump_01_02_12: KVM/VMX: =====================================
The full VMCS dumps + kvm tracepoints near the first level triggered interrupt for kernel-irqchip are here: https://pastebin.com/2tnN6Xfy, and for split-irqchip are here: https://pastebin.com/K1BKZYvv
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [BUG] [KVM/VMX] Level triggered interrupts mishandled on Windows w/ nested virt(Credential Guard) when using split irqchip
2025-09-05 15:26 [BUG] [KVM/VMX] Level triggered interrupts mishandled on Windows w/ nested virt(Credential Guard) when using split irqchip Khushit Shah
@ 2025-09-08 9:05 ` Vitaly Kuznetsov
2025-09-08 11:19 ` Khushit Shah
0 siblings, 1 reply; 8+ messages in thread
From: Vitaly Kuznetsov @ 2025-09-08 9:05 UTC (permalink / raw)
To: Khushit Shah
Cc: seanjc@google.com, pbonzini@redhat.com, kvm@vger.kernel.org,
linux-kernel@vger.kernel.org, Shaju Abraham
Khushit Shah <khushit.shah@nutanix.com> writes:
[trimmed 'Cc' list a bit]
> [1.] One line summary:
> [KVM/VMX] Level triggered interrupts mishandled on Windows w/ nested virt(Credential Guard) when using split irqchip
>
> [2.] Problem/Report:
> When running Windows with Credential Guard enabled and with split-irqchip, level triggered interrupts are not properly forwarded to L2 (Credential Guard) by L1 (Windows), instead L1 EOIs the interrupt. Which leads to extremely slow Windows boot time. This issue is only seen on Intel + split-irqchip. Intel + kernel-irqchip, AMD + (kernel/split)-irqchip works fine.
>
> Qemu command used to create the vm:
> /usr/libexec/qemu-kvm \
> -machine q35,accel=kvm,smm=on,usb=off,acpi=on,kernel-irqchip=split \
> -cpu host,+vmx,+invpcid,+ssse3,+aes,+xsave,+xsaveopt,+xgetbv1,+xsaves,+rdtscp,+tsc-deadline \
Is there a specific reason to not enable any Hyper-V enlightenments for
your guest? For nested cases, features like Enightended VMCS
('hv-evmcs'), 'hv-vapic', 'hv-apicv', ... can change Windows's behavior
a lot. I'd even suggest you start with 'hv-passthrough' to see if the
slowness goes away and if yes, then try to find the required set of
options you can use in your setup.
> -m 20G -smp 1 \
Single CPU Windows guests are always very slow, doubly so when running
nested.
...
--
Vitaly
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [BUG] [KVM/VMX] Level triggered interrupts mishandled on Windows w/ nested virt(Credential Guard) when using split irqchip
2025-09-08 9:05 ` Vitaly Kuznetsov
@ 2025-09-08 11:19 ` Khushit Shah
2025-09-08 11:42 ` Vitaly Kuznetsov
0 siblings, 1 reply; 8+ messages in thread
From: Khushit Shah @ 2025-09-08 11:19 UTC (permalink / raw)
To: Vitaly Kuznetsov
Cc: seanjc@google.com, pbonzini@redhat.com, kvm@vger.kernel.org,
linux-kernel@vger.kernel.org, Shaju Abraham
Thanks you for the comments Vitaly!
> On 8 Sep 2025, at 2:35 PM, Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
>
> Is there a specific reason to not enable any Hyper-V enlightenments for
> your guest? For nested cases, features like Enightended VMCS
> ('hv-evmcs'), 'hv-vapic', 'hv-apicv', ... can change Windows's behavior
> a lot. I'd even suggest you start with 'hv-passthrough' to see if the
> slowness goes away and if yes, then try to find the required set of
> options you can use in your setup.
Actually in production we use an extensive set of cpu features exposed to the guest, still the issue persists,
With the following hv-* options also the issue is present:
hypervisor=on,hv-time=on,hv-relaxed=on,hv-vapic=on,hv-spinlocks=0x2000,hv-vpindex=on,hv-runtime=on,hv-synic=on,
hv-stimer=on,hv-tlbflush=on,hv-ipi=on,hv-evmcs=on
>
> On 8 Sep 2025, at 2:35 PM, Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
>
> Single CPU Windows guests are always very slow, doubly so when running nested.
The bug was reproducible even with more cpus like (4,8), we use 1 to reduce noise in captured logs.
I should also mention by slow boot we mean extremely slow (>3h).
Thanks,
Khushit
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [BUG] [KVM/VMX] Level triggered interrupts mishandled on Windows w/ nested virt(Credential Guard) when using split irqchip
2025-09-08 11:19 ` Khushit Shah
@ 2025-09-08 11:42 ` Vitaly Kuznetsov
2025-09-09 10:34 ` Khushit Shah
0 siblings, 1 reply; 8+ messages in thread
From: Vitaly Kuznetsov @ 2025-09-08 11:42 UTC (permalink / raw)
To: Khushit Shah
Cc: seanjc@google.com, pbonzini@redhat.com, kvm@vger.kernel.org,
linux-kernel@vger.kernel.org, Shaju Abraham
Khushit Shah <khushit.shah@nutanix.com> writes:
> Thanks you for the comments Vitaly!
>
>> On 8 Sep 2025, at 2:35 PM, Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
>>
>> Is there a specific reason to not enable any Hyper-V enlightenments for
>> your guest? For nested cases, features like Enightended VMCS
>> ('hv-evmcs'), 'hv-vapic', 'hv-apicv', ... can change Windows's behavior
>> a lot. I'd even suggest you start with 'hv-passthrough' to see if the
>> slowness goes away and if yes, then try to find the required set of
>> options you can use in your setup.
>
>
> Actually in production we use an extensive set of cpu features exposed to the guest, still the issue persists,
> With the following hv-* options also the issue is present:
> hypervisor=on,hv-time=on,hv-relaxed=on,hv-vapic=on,hv-spinlocks=0x2000,hv-vpindex=on,hv-runtime=on,hv-synic=on,
> hv-stimer=on,hv-tlbflush=on,hv-ipi=on,hv-evmcs=on
>
Try adding 'hv-apicv' (AKA 'hv-avic') to the list too (not to be
confused with 'hv-vapic').
>>
>> On 8 Sep 2025, at 2:35 PM, Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
>>
>> Single CPU Windows guests are always very slow, doubly so when running nested.
>
> The bug was reproducible even with more cpus like (4,8), we use 1 to reduce noise in captured logs.
>
> I should also mention by slow boot we mean extremely slow (>3h).
Oh, this is next level) Do you know if the issue reproduces with newer
Windows, e.g. 2025?
Also, I've just recalled I fixed (well, 'workarounded') an issue similar
to yours a while ago in QEMU:
commit 958a01dab8e02fc49f4fd619fad8c82a1108afdb
Author: Vitaly Kuznetsov <vkuznets@redhat.com>
Date: Tue Apr 2 10:02:15 2019 +0200
ioapic: allow buggy guests mishandling level-triggered interrupts to make progress
maybe something has changed and it doesn't work anymore?
--
Vitaly
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [BUG] [KVM/VMX] Level triggered interrupts mishandled on Windows w/ nested virt(Credential Guard) when using split irqchip
2025-09-08 11:42 ` Vitaly Kuznetsov
@ 2025-09-09 10:34 ` Khushit Shah
2025-09-10 8:34 ` Vitaly Kuznetsov
0 siblings, 1 reply; 8+ messages in thread
From: Khushit Shah @ 2025-09-09 10:34 UTC (permalink / raw)
To: Vitaly Kuznetsov
Cc: seanjc@google.com, pbonzini@redhat.com, kvm@vger.kernel.org,
linux-kernel@vger.kernel.org, Shaju Abraham
Thank you for the pointers, Vitaly!
> On 8 Sep 2025, at 5:12 PM, Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
>
>> Thanks you for the comments Vitaly!
>>
>>> On 8 Sep 2025, at 2:35 PM, Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
>>>
>>> Is there a specific reason to not enable any Hyper-V enlightenments for
>>> your guest? For nested cases, features like Enightended VMCS
>>> ('hv-evmcs'), 'hv-vapic', 'hv-apicv', ... can change Windows's behavior
>>> a lot. I'd even suggest you start with 'hv-passthrough' to see if the
>>> slowness goes away and if yes, then try to find the required set of
>>> options you can use in your setup.
>>
>>
>> Actually in production we use an extensive set of cpu features exposed to the guest, still the issue persists,
>> With the following hv-* options also the issue is present:
>> hypervisor=on,hv-time=on,hv-relaxed=on,hv-vapic=on,hv-spinlocks=0x2000,hv-vpindex=on,hv-runtime=on,hv-synic=on,
>> hv-stimer=on,hv-tlbflush=on,hv-ipi=on,hv-evmcs=on
>>
>
> Try adding 'hv-apicv' (AKA 'hv-avic') to the list too (not to be
> confused with 'hv-vapic’)
I tried with hv-apicv=on, the issue still persists.
> Oh, this is next level) Do you know if the issue reproduces with newer
> Windows, e.g. 2025?
The latest I can try on is Win11 24H2, I will try and update here.
> Also, I've just recalled I fixed (well, 'workarounded') an issue similar
> to yours a while ago in QEMU:
>
> commit 958a01dab8e02fc49f4fd619fad8c82a1108afdb
> Author: Vitaly Kuznetsov <vkuznets@redhat.com>
> Date: Tue Apr 2 10:02:15 2019 +0200
>
> ioapic: allow buggy guests mishandling level-triggered interrupts to make progress
>
> maybe something has changed and it doesn't work anymore?
This is really interesting, we are facing a very similar issue, but the interrupt storm only occurs when using split-irqchip.
Using kernel-irqchip, we do not even see consecutive level triggered interrupts of the same vector. From the logs it is
clear that somehow with kernel-irqchip, L1 passes the interrupt to L2 to service, but with split-irqchip, L1 EOI’s without
servicing the interrupt. As it is working properly on kernel-irqchip, we can’t really point it as an Hyper-V issue. AFAIK,
kernel-irqchip setting should be transparent to the guest, can you think of anything that can change this?
Regards,
Khushit
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [BUG] [KVM/VMX] Level triggered interrupts mishandled on Windows w/ nested virt(Credential Guard) when using split irqchip
2025-09-09 10:34 ` Khushit Shah
@ 2025-09-10 8:34 ` Vitaly Kuznetsov
2025-09-10 9:39 ` David Woodhouse
0 siblings, 1 reply; 8+ messages in thread
From: Vitaly Kuznetsov @ 2025-09-10 8:34 UTC (permalink / raw)
To: Khushit Shah
Cc: seanjc@google.com, pbonzini@redhat.com, kvm@vger.kernel.org,
linux-kernel@vger.kernel.org, Shaju Abraham
Khushit Shah <khushit.shah@nutanix.com> writes:
>> On 8 Sep 2025, at 5:12 PM, Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
>>
...
>> Also, I've just recalled I fixed (well, 'workarounded') an issue similar
>> to yours a while ago in QEMU:
>>
>> commit 958a01dab8e02fc49f4fd619fad8c82a1108afdb
>> Author: Vitaly Kuznetsov <vkuznets@redhat.com>
>> Date: Tue Apr 2 10:02:15 2019 +0200
>>
>> ioapic: allow buggy guests mishandling level-triggered interrupts to make progress
>>
>> maybe something has changed and it doesn't work anymore?
>
> This is really interesting, we are facing a very similar issue, but the interrupt storm only occurs when using split-irqchip.
> Using kernel-irqchip, we do not even see consecutive level triggered interrupts of the same vector. From the logs it is
> clear that somehow with kernel-irqchip, L1 passes the interrupt to L2 to service, but with split-irqchip, L1 EOI’s without
> servicing the interrupt. As it is working properly on kernel-irqchip, we can’t really point it as an Hyper-V issue. AFAIK,
> kernel-irqchip setting should be transparent to the guest, can you think of anything that can change this?
The problem I've fixed back then was also only visible with split
irqchip. The reason was:
"""
in-kernel IOAPIC implementation has commit 184564efae4d ("kvm: ioapic: conditionally delay
irq delivery duringeoi broadcast")
"""
so even though the guest cannot really distinguish between in-kernel and
split irqchips, the small differences in implementation can make a big
difference in the observed behavior. In case we re-assert improperly
handled level-triggered interrupt too fast, the guest is not able to
make much progress but if we let it execute for even the tiniest
fraction of time, then the forward progress happens.
I don't exactly know what happens in this particular case but I'd
suggest you try to atrificially delay re-asserting level triggered
interrupts and see what happens.
--
Vitaly
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [BUG] [KVM/VMX] Level triggered interrupts mishandled on Windows w/ nested virt(Credential Guard) when using split irqchip
2025-09-10 8:34 ` Vitaly Kuznetsov
@ 2025-09-10 9:39 ` David Woodhouse
2025-09-18 16:05 ` Khushit Shah
0 siblings, 1 reply; 8+ messages in thread
From: David Woodhouse @ 2025-09-10 9:39 UTC (permalink / raw)
To: Vitaly Kuznetsov, Khushit Shah
Cc: seanjc@google.com, pbonzini@redhat.com, kvm@vger.kernel.org,
linux-kernel@vger.kernel.org, Shaju Abraham
[-- Attachment #1: Type: text/plain, Size: 3298 bytes --]
On Wed, 2025-09-10 at 10:34 +0200, Vitaly Kuznetsov wrote:
> Khushit Shah <khushit.shah@nutanix.com> writes:
>
> > > On 8 Sep 2025, at 5:12 PM, Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
> > >
>
> ...
>
> > > Also, I've just recalled I fixed (well, 'workarounded') an issue
> > > similar
> > > to yours a while ago in QEMU:
> > >
> > > commit 958a01dab8e02fc49f4fd619fad8c82a1108afdb
> > > Author: Vitaly Kuznetsov <vkuznets@redhat.com>
> > > Date: Tue Apr 2 10:02:15 2019 +0200
> > >
> > > ioapic: allow buggy guests mishandling level-triggered
> > > interrupts to make progress
> > >
> > > maybe something has changed and it doesn't work anymore?
> >
> > This is really interesting, we are facing a very similar issue, but
> > the interrupt storm only occurs when using split-irqchip.
> > Using kernel-irqchip, we do not even see consecutive level
> > triggered interrupts of the same vector. From the logs it is
> > clear that somehow with kernel-irqchip, L1 passes the interrupt to
> > L2 to service, but with split-irqchip, L1 EOI’s without
> > servicing the interrupt. As it is working properly on kernel-
> > irqchip, we can’t really point it as an Hyper-V issue. AFAIK,
> > kernel-irqchip setting should be transparent to the guest, can you
> > think of anything that can change this?
>
> The problem I've fixed back then was also only visible with split
> irqchip. The reason was:
>
> """
> in-kernel IOAPIC implementation has commit 184564efae4d ("kvm:
> ioapic: conditionally delay
> irq delivery duringeoi broadcast")
> """
>
> so even though the guest cannot really distinguish between in-kernel
> and
> split irqchips, the small differences in implementation can make a
> big
> difference in the observed behavior. In case we re-assert improperly
> handled level-triggered interrupt too fast, the guest is not able to
> make much progress but if we let it execute for even the tiniest
> fraction of time, then the forward progress happens.
>
> I don't exactly know what happens in this particular case but I'd
> suggest you try to atrificially delay re-asserting level triggered
> interrupts and see what happens.
We know that QEMU reasserts INTx interrupts too soon anyway.
The in-kernel irqchip will trigger the VFIO resamplefd when the
interrupt is EOI'd in the I/O APIC. as $DEITY intended.
QEMU, on the other hand, will unmap the device BARs when the interrupt
happens and intercept subsequent access, triggering the VFIO resamplefd
as soon as the next access happens — even before it's EOI'd.
Could that be making a difference here?
I guess, in theory, "too soon" probably shouldn't matter if it's all
handled correctly elsewhere — it should get masked again in the
hardware and the pending status tracked correctly until it's
redelivered to the guest(s). But it's probably worth testing, given
that's one of the big behavioural differences between kernel and
userspace I/O APIC?
It's somewhat non-trivial to fix it 'properly' across all of QEMU's
interrupt controllers and IRQ abstractions, but hacking something up
which does the right thing just for this x86 platform and I/O APIC and
avoids the current MMIO-unmapping abomination might be worth a test?
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [BUG] [KVM/VMX] Level triggered interrupts mishandled on Windows w/ nested virt(Credential Guard) when using split irqchip
2025-09-10 9:39 ` David Woodhouse
@ 2025-09-18 16:05 ` Khushit Shah
0 siblings, 0 replies; 8+ messages in thread
From: Khushit Shah @ 2025-09-18 16:05 UTC (permalink / raw)
To: David Woodhouse, Vitaly Kuznetsov
Cc: Vitaly Kuznetsov, seanjc@google.com, pbonzini@redhat.com,
kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Shaju Abraham
-------------------------------------------------
RCA: KVM ignores Directed EOI bit in split-irqchip
--------------------------------------------------
We traced the issue to KVM not respecting the Directed EOI bit in the
LAPIC Spurious Interrupt Vector Register (APIC_SPIV, bit 12) when using
split-irqchip.
Per the x2APIC specification, when APIC_SPIV.DirectedEOI is set the
LAPIC must not broadcast EOIs to the IOAPIC. Instead, the guest is
responsible for issuing an IOAPIC EOI by writing to its EOI register.
How we confirmed the RCA:
We added a manual delay after the level-triggered interrupt EOI. Right
after the EOI, Windows performs VMRESUME to L2, injecting the same
vector; L2 services the interrupt and then writes the vector value to
0xFEC00040 (the IOAPIC EOI register).
Relevant logs (abridged)
qemu-kvm 169720 [043] 3975.550049: kvm:kvm_entry: vcpu 0, rip 0xfffff8052377167e
qemu-kvm 169710 [039] 3975.550064: kvm:kvm_set_irq: gsi 21 level 1 source 0
qemu-kvm 169710 [039] 3975.550065: kvm:kvm_msi_set_irq: dst 0 vec 161 (Fixed|physical|level)
qemu-kvm 169710 [039] 3975.550065: kvm:kvm_apic_accept_irq: apicid 0 vec 161 (Fixed|level)
qemu-kvm 169710 [039] 3975.550066: kvm:kvm_apicv_accept_irq: apicid 0 vec 161 (Fixed|level)
qemu-kvm 169720 [043] 3975.550067: kvm:kvm_exit: reason EXTERNAL_INTERRUPT rip 0xfffff8052363aa34 info 0 0
qemu-kvm 169720 [043] 3975.550068: kvm:kvm_nested_vmexit: CAN'T FIND FIELD "rip"<CANT FIND FIELD exit_code>vcpu 0 reason EXTERNAL_INTERRUPT rip 0xfffff8052363aa34 info1 0x0000000000000000 info2 0x0000000000000000 intr_info 0x800000f2 error_code 0x00000000
qemu-kvm 169720 [043] 3975.550069: kvm:kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 800000a1 int_info_err 0
qemu-kvm 169720 [043] 3975.550070: kvm:kvm_entry: vcpu 0, rip 0xfffff829b2a4142c
qemu-kvm 169720 [043] 3975.550072: kvm:kvm_exit: reason EOI_INDUCED rip 0xfffff829b2a85561 info a1 0
qemu-kvm 169720 [043] 3975.550072: kvm:kvm_eoi: apicid 0 vector 161
qemu-kvm 169720 [043] 3975.550073: kvm:kvm_entry: vcpu 0, rip 0xfffff829b2a85561
qemu-kvm 169720 [043] 3975.550075: kvm:kvm_exit: reason VMRESUME rip 0xfffff829b2a41308 info 0 0
qemu-kvm 169720 [043] 3975.550075: kvm:kvm_nested_vmenter: rip: 0xfffff829b2a41308 vmcs: 0x000000011b3bc000 nested_rip: 0xfffff8052363aa34 int_ctl: 0x00000000 event_inj: 0x800000a1 nested_ept=y nested_eptp: 0x00000001030a501e
=================================================== L2 Services the Interrupt ======================================
qemu-kvm 169720 [043] 3975.550123: kvm:kvm_nested_vmexit: CAN'T FIND FIELD "rip"<CANT FIND FIELD exit_code>vcpu 0 reason MSR_WRITE rip 0xfffff8052362d36c info1 0x0000000000000000 info2 0x0000000000000000 intr_info 0x00000000 error_code 0x00000000
qemu-kvm 169720 [043] 3975.550124: kvm:kvm_nested_vmexit_inject: reason MSR_WRITE info1 0 info2 0 int_info 0 int_info_err 0
qemu-kvm 169720 [043] 3975.550125: kvm:kvm_entry: vcpu 0, rip 0xfffff829b2a4142c
qemu-kvm 169720 [043] 3975.550127: kvm:kvm_exit: reason EPT_VIOLATION rip 0xfffff829b2b04b82 info d82 0
qemu-kvm 169720 [043] 3975.550127: kvm:kvm_page_fault: vcpu 0 rip 0xfffff829b2b04b82 address 0x00000000fec00040 error_code 0xd82
qemu-kvm 169720 [043] 3975.550130: kvm:kvm_emulate_insn: 0:fffff829b2b04b82: 89 48 40
qemu-kvm 169720 [043] 3975.550131: kvm:vcpu_match_mmio: gva 0xfffff827a2606040 gpa 0xfec00040 Write GPA
qemu-kvm 169720 [043] 3975.550131: kvm:kvm_mmio: mmio write len 4 gpa 0xfec00040 val 0xa1
qemu-kvm 169720 [043] 3975.550131: kvm:kvm_fpu: unload
qemu-kvm 169720 [043] 3975.550132: kvm:kvm_userspace_exit: reason KVM_EXIT_MMIO (6)
When APIC_SPIV.DirectedEOI is set, Windows expects that the LAPIC will
not EOI the IOAPIC. KVM, however, EOIs the IOAPIC from userspace while
the interrupt has not yet been serviced, so the line remains asserted
and the IOAPIC reinserts the interrupt. This loop continues and Windows
makes no progress.
Why this is Intel + split-irqchip only:
This is not seen on AMD with split-irqchip because Windows does not set
APIC_SPIV.DirectedEOI in these cases. We do not see this with
kernel-irqchip because, Directed EOI capability is only advertised if
the irqchip is not in kernel. (ref:
/arch/x86/kvm/lapic.c:kvm_apic_set_version). This is because in-kernel's
IOAPIC implementation does not have EOI registers (IOAPIC version 0x11).
While Qemu's default IOAPIC Implementation has EOI registers (IOAPIC
version 0x20).
This is possibly also the actual RCA for
commit 958a01dab8e02fc49f4fd619fad8c82a1108afdb
Author: Vitaly Kuznetsov <vkuznets@redhat.com>
Date: Tue Apr 2 10:02:15 2019 +0200
ioapic: allow buggy guests mishandling level-triggered interrupts to make progress
Patch:
https://patchwork.kernel.org/project/kvm/patch/20250918162529.640943-1-jon@nutanix.com/
Additional finding (potential separate bug)
In /arch/x86/kvm/lapic.c:kvm_apic_set_version
Directed EOI support is exposed to the guest whenever x2APIC is present
and the IOAPIC is not in-kernel:
if (guest_cpuid_has(vcpu, X86_FEATURE_X2APIC) &&
!ioapic_in_kernel(vcpu->kvm))
v |= APIC_LVR_DIRECTED_EOI;
kvm_lapic_set_reg(apic, APIC_LVR, v);
This assumes the userspace IOAPIC always supports Directed EOI. QEMU’s
IOAPIC version can be 0x11, which does not support Directed EOI. We
should track this separately.
Thanks for all the help!
Regards,
Khushit
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2025-09-18 16:05 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-05 15:26 [BUG] [KVM/VMX] Level triggered interrupts mishandled on Windows w/ nested virt(Credential Guard) when using split irqchip Khushit Shah
2025-09-08 9:05 ` Vitaly Kuznetsov
2025-09-08 11:19 ` Khushit Shah
2025-09-08 11:42 ` Vitaly Kuznetsov
2025-09-09 10:34 ` Khushit Shah
2025-09-10 8:34 ` Vitaly Kuznetsov
2025-09-10 9:39 ` David Woodhouse
2025-09-18 16:05 ` Khushit Shah
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox