* More than 255 vcpus Windows VM setup without viommu ? @ 2024-07-02 5:17 Sandesh Patel 2024-07-02 9:04 ` David Woodhouse 2024-09-28 14:59 ` David Woodhouse 0 siblings, 2 replies; 17+ messages in thread From: Sandesh Patel @ 2024-07-02 5:17 UTC (permalink / raw) To: qemu-devel@nongnu.org Cc: dwmw2@infradead.org, Rob Scheepens, Prerna Saxena, qemu-devel@nongnu.org Hi All, Is it possible to setup a large Windows VM (say 512 vcpus) without adding viommu (EIM=on, IR=on)? When I try to power such VM, the qemu process crashes with error- ``` qemu-kvm: ../accel/kvm/kvm-all.c:1837: kvm_irqchip_commit_routes: Assertion `ret == 0’ failed ``` Stack trace- ``` #1 0x00007f484bc21ea5 abort (libc.so.6) #2 0x00007f484bc21d79 __assert_fail_base.cold.0 (libc.so.6) #3 0x00007f484bc47426 __assert_fail (libc.so.6) #4 0x000055b7215634d3 kvm_irqchip_commit_routes (qemu-kvm) #5 0x000055b7213bfc7e kvm_virtio_pci_vector_use_one (qemu-kvm) #6 0x000055b7213c02cf virtio_pci_set_guest_notifiers (qemu-kvm) #7 0x000055b7214dd848 vhost_scsi_common_start (qemu-kvm) #8 0x000055b72139b936 vhost_user_scsi_start (qemu-kvm) #9 0x000055b72139ba64 vhost_user_scsi_set_status (qemu-kvm) #10 0x000055b7214f865a virtio_set_status (qemu-kvm) #11 0x000055b7213bdc3f virtio_pci_common_write (qemu-kvm) #12 0x000055b721514e68 memory_region_write_accessor (qemu-kvm) #13 0x000055b72151489e access_with_adjusted_size (qemu-kvm) #14 0x000055b721514b89 memory_region_dispatch_write (qemu-kvm) #15 0x000055b72151e3fc flatview_write_continue (qemu-kvm) #16 0x000055b72151e553 flatview_write (qemu-kvm) #17 0x000055b72151ee76 address_space_write (qemu-kvm) #18 0x000055b721565526 kvm_cpu_exec (qemu-kvm) #19 0x000055b72156634d kvm_vcpu_thread_fn (qemu-kvm) #20 0x000055b721750224 qemu_thread_start (qemu-kvm) #21 0x00007f484c0081ca start_thread (libpthread.so.0) #22 0x00007f484bc39e73 ``` The error is due to invalid MSIX routing entry passed to KVM. The VM boots fine if we attach a vIOMMU but adding a vIOMMU can potentially result in IO performance loss in guest. I was interested to know if someone could boot a large Windows VM by some other means like kvm-msi-ext-dest-id. Overheads of viommu have been shown for example in - https://static.sched.com/hosted_files/kvmforum2021/da/vIOMMU%20KVM%20Forum%202021%20-%20v4.pdf Thanks and regards, Sandesh ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: More than 255 vcpus Windows VM setup without viommu ? 2024-07-02 5:17 More than 255 vcpus Windows VM setup without viommu ? Sandesh Patel @ 2024-07-02 9:04 ` David Woodhouse 2024-07-03 16:01 ` Sandesh Patel 2024-09-28 14:59 ` David Woodhouse 1 sibling, 1 reply; 17+ messages in thread From: David Woodhouse @ 2024-07-02 9:04 UTC (permalink / raw) To: Sandesh Patel, qemu-devel@nongnu.org Cc: Rob Scheepens, Prerna Saxena, Dexuan Cui [-- Attachment #1: Type: text/plain, Size: 1396 bytes --] On Tue, 2024-07-02 at 05:17 +0000, Sandesh Patel wrote: > Hi All, > Is it possible to setup a large Windows VM (say 512 vcpus) without > adding viommu (EIM=on, IR=on)? > When I try to power such VM, the qemu process crashes with error- > ``` > qemu-kvm: ../accel/kvm/kvm-all.c:1837: kvm_irqchip_commit_routes: Assertion `ret == 0’ failed > Interesting. What exactly has Windows *done* in those MSI entries? That might give a clue about how to support it. > > The VM boots fine if we attach a vIOMMU but adding a vIOMMU can > potentially result in IO performance loss in guest. > I was interested to know if someone could boot a large Windows VM by > some other means like kvm-msi-ext-dest-id. I worked with Microsoft folks when I was defining the msi-ext-dest-id support, and Hyper-V does it exactly the same way. But that's on the *hypervisor* side. At the time, I don't believe Windows as a guest was planning to use it. But I actually thought Windows worked OK without being able to direct external interrupts to all vCPUs, so it didn't matter? > Overheads of viommu have been shown for example in - > https://static.sched.com/hosted_files/kvmforum2021/da/vIOMMU%20KVM%20 > Forum%202021%20-%20v4.pdf Isn't that for DMA translation though? If you give the guest an intel_iommu with dma_translation=off then it should *only* do interrupt remapping. [-- Attachment #2: smime.p7s --] [-- Type: application/pkcs7-signature, Size: 5965 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: More than 255 vcpus Windows VM setup without viommu ? 2024-07-02 9:04 ` David Woodhouse @ 2024-07-03 16:01 ` Sandesh Patel 2024-07-08 9:13 ` David Woodhouse 0 siblings, 1 reply; 17+ messages in thread From: Sandesh Patel @ 2024-07-03 16:01 UTC (permalink / raw) To: David Woodhouse Cc: qemu-devel@nongnu.org, Rob Scheepens, Prerna Saxena, Dexuan Cui [-- Attachment #1: Type: text/plain, Size: 2883 bytes --] Thanks David for the response. On 2 Jul 2024, at 2:34 PM, David Woodhouse <dwmw2@infradead.org> wrote: On Tue, 2024-07-02 at 05:17 +0000, Sandesh Patel wrote: Hi All, Is it possible to setup a large Windows VM (say 512 vcpus) without adding viommu (EIM=on, IR=on)? When I try to power such VM, the qemu process crashes with error- ``` qemu-kvm: ../accel/kvm/kvm-all.c:1837: kvm_irqchip_commit_routes: Assertion `ret == 0’ failed Interesting. What exactly has Windows *done* in those MSI entries? That might give a clue about how to support it. The KVM_SET_GSI_ROUTING ioctl calls kvm_set_routing_entry function in kvm. int kvm_set_routing_entry(struct kvm *kvm, struct kvm_kernel_irq_routing_entry *e, const struct kvm_irq_routing_entry *ue) { switch (ue->type) { case KVM_IRQ_ROUTING_MSI: e->set = kvm_set_msi; e->msi.address_lo = ue->u.msi.address_lo; e->msi.address_hi = ue->u.msi.address_hi; e->msi.data = ue->u.msi.data; if (kvm_msi_route_invalid(kvm, e)) return -EINVAL; break; } } static inline bool kvm_msi_route_invalid(struct kvm *kvm, struct kvm_kernel_irq_routing_entry *e) { return kvm->arch.x2apic_format && (e->msi.address_hi & 0xff); } That means msi.address_hi must have 0 in the last byte. Qemu function kvm_arch_fixup_msi_route is responsible for fixing msi.address_hi value in msi routing entry that is passed to kvm. This function got msi.addr_hi: 0x0 in input when iommu was enabled and msi.addr_hi: 0x1 when viommu was not enabled for one of the entry. The same value was returned in the output. and saved as routing entry. The VM boots fine if we attach a vIOMMU but adding a vIOMMU can potentially result in IO performance loss in guest. I was interested to know if someone could boot a large Windows VM by some other means like kvm-msi-ext-dest-id. I worked with Microsoft folks when I was defining the msi-ext-dest-id support, and Hyper-V does it exactly the same way. But that's on the *hypervisor* side. At the time, I don't believe Windows as a guest was planning to use it. But I actually thought Windows worked OK without being able to direct external interrupts to all vCPUs, so it didn't matter? I think not. Looks like there is difference in approach how hyperv limits the irq delivery vs how Qemu/kvm do it. Overheads of viommu have been shown for example in - https://static.sched.com/hosted_files/kvmforum2021/da/vIOMMU%20KVM%20 Forum%202021%20-%20v4.pdf Isn't that for DMA translation though? If you give the guest an intel_iommu with dma_translation=off then it should *only* do interrupt remapping. Thanks for the suggestion. It avoids DMA translations and hence no major performance loss. [-- Attachment #2: Type: text/html, Size: 4810 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: More than 255 vcpus Windows VM setup without viommu ? 2024-07-03 16:01 ` Sandesh Patel @ 2024-07-08 9:13 ` David Woodhouse 2024-07-11 7:26 ` David Woodhouse 0 siblings, 1 reply; 17+ messages in thread From: David Woodhouse @ 2024-07-08 9:13 UTC (permalink / raw) To: Sandesh Patel Cc: qemu-devel@nongnu.org, Rob Scheepens, Prerna Saxena, Dexuan Cui [-- Attachment #1: Type: text/plain, Size: 2016 bytes --] On Wed, 2024-07-03 at 16:01 +0000, Sandesh Patel wrote: > > > > Interesting. What exactly has Windows *done* in those MSI entries? > > That > > might give a clue about how to support it. > > The KVM_SET_GSI_ROUTING ioctl calls kvm_set_routing_entry function in > kvm. > > int kvm_set_routing_entry(struct kvm *kvm, struct > kvm_kernel_irq_routing_entry *e, > const struct kvm_irq_routing_entry *ue) > { > > switch (ue->type) { > case KVM_IRQ_ROUTING_MSI: > e->set = kvm_set_msi; > e->msi.address_lo = ue->u.msi.address_lo; > e->msi.address_hi = ue->u.msi.address_hi; > e->msi.data = ue->u.msi.data; > > if (kvm_msi_route_invalid(kvm, e)) > return -EINVAL; > break; > } > } > > static inline bool kvm_msi_route_invalid(struct kvm *kvm, > struct kvm_kernel_irq_routing_entry *e) > { > return kvm->arch.x2apic_format && (e->msi.address_hi & 0xff); > } > > That means msi.address_hi must have 0 in the last byte. > > Qemu function kvm_arch_fixup_msi_route is responsible for > fixing msi.address_hi value in > msi routing entry that is passed to kvm. > This function got msi.addr_hi: 0x0 in input when iommu was enabled > and msi.addr_hi: 0x1 > when viommu was not enabled for one of the entry. The same value was > returned in the output. > and saved as routing entry. That's after QEMU has translated it though. Precisely which MSI is this, belonging to which device, and what exactly did Windows write to the MSI table? If *Windows* put anything into the high address bits, that isn't even targeted at the APIC, and it would be an attempt at writing to actual memory. I suspect that isn't the case. Can you make kvm_arch_fixup_msi_route() print the actual address and data directly from the guest? [-- Attachment #2: smime.p7s --] [-- Type: application/pkcs7-signature, Size: 5965 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: More than 255 vcpus Windows VM setup without viommu ? 2024-07-08 9:13 ` David Woodhouse @ 2024-07-11 7:26 ` David Woodhouse 2024-07-11 11:23 ` David Woodhouse 2024-08-01 10:28 ` Sandesh Patel 0 siblings, 2 replies; 17+ messages in thread From: David Woodhouse @ 2024-07-11 7:26 UTC (permalink / raw) To: Sandesh Patel Cc: qemu-devel@nongnu.org, Rob Scheepens, Prerna Saxena, Dexuan Cui, Alexander Graf [-- Attachment #1: Type: text/plain, Size: 9613 bytes --] On Mon, 2024-07-08 at 10:13 +0100, David Woodhouse wrote: > On Wed, 2024-07-03 at 16:01 +0000, Sandesh Patel wrote: > > > > > > Interesting. What exactly has Windows *done* in those MSI entries? > > > That might give a clue about how to support it. I repeated my experiment with interrupt-only remapping (no DMA remapping). On two hosts: vendor_id : AuthenticAMD cpu family : 25 model : 17 model name : AMD EPYC 9R14 96-Core Processor vendor_id : GenuineIntel cpu family : 6 model : 143 model name : Intel(R) Xeon(R) Platinum 8488C I used identical command lines on both, and on each host I got the same result with all of '-cpu host', '-cpu EPYC' and -cpu Skylake-Server'. It's the *host* that makes the difference, not the CPUID presented to the guest. On the Intel host it boots: $ ./qemu-system-x86_64 -accel kvm,kernel-irqchip=split -cdrom ~/Win10_22H2_EnglishInternational_x64v1.iso -m 16G -M q35 -smp 2,cores=12,threads=2,maxcpus=288 -accel kvm,kernel-irqchip=split -device intel-iommu,intremap=on,dma-translation=off -cpu Skylake-Server --trace vtd_ir_\* --trace apic\* --trace kvm_irqchip\* qemu-system-x86_64: -accel kvm,kernel-irqchip=split: warning: Number of hotpluggable cpus requested (288) exceeds the recommended cpus supported by KVM (192) kvm_irqchip_add_msi_route dev N/A vector 0 virq 0 kvm_irqchip_add_msi_route dev N/A vector 0 virq 1 kvm_irqchip_add_msi_route dev N/A vector 0 virq 2 kvm_irqchip_add_msi_route dev N/A vector 0 virq 3 kvm_irqchip_add_msi_route dev N/A vector 0 virq 4 kvm_irqchip_add_msi_route dev N/A vector 0 virq 5 kvm_irqchip_add_msi_route dev N/A vector 0 virq 6 kvm_irqchip_add_msi_route dev N/A vector 0 virq 7 kvm_irqchip_add_msi_route dev N/A vector 0 virq 8 kvm_irqchip_add_msi_route dev N/A vector 0 virq 9 kvm_irqchip_add_msi_route dev N/A vector 0 virq 10 kvm_irqchip_add_msi_route dev N/A vector 0 virq 11 kvm_irqchip_add_msi_route dev N/A vector 0 virq 12 kvm_irqchip_add_msi_route dev N/A vector 0 virq 13 kvm_irqchip_add_msi_route dev N/A vector 0 virq 14 kvm_irqchip_add_msi_route dev N/A vector 0 virq 15 kvm_irqchip_add_msi_route dev N/A vector 0 virq 16 kvm_irqchip_add_msi_route dev N/A vector 0 virq 17 kvm_irqchip_add_msi_route dev N/A vector 0 virq 18 kvm_irqchip_add_msi_route dev N/A vector 0 virq 19 kvm_irqchip_add_msi_route dev N/A vector 0 virq 20 kvm_irqchip_add_msi_route dev N/A vector 0 virq 21 kvm_irqchip_add_msi_route dev N/A vector 0 virq 22 kvm_irqchip_add_msi_route dev N/A vector 0 virq 23 kvm_irqchip_commit_routes qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.07H:EBX.hle [bit 4] qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.07H:EBX.rtm [bit 11] qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.07H:EBX.hle [bit 4] qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.07H:EBX.rtm [bit 11] kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes vtd_ir_enable enable 1 kvm_irqchip_commit_routes kvm_irqchip_commit_routes vtd_ir_remap_msi_req addr 0xfee00010 data 0xd1 sid 0xffff do_fault 0 vtd_ir_irte_get index 0 low 0x0 high 0x100d10005 vtd_ir_remap index 0 trigger 0 vector 209 deliver 0 dest 0x1 mode 1 vtd_ir_remap_type IOAPIC vtd_ir_remap_msi (addr 0xfee00010, data 0xd1) -> (addr 0xfee01004, data 0x40d1) kvm_irqchip_update_msi_route Updating MSI route virq=2 kvm_irqchip_commit_routes vtd_ir_remap_msi_req addr 0xfee00010 data 0xd1 sid 0xffff do_fault 0 vtd_ir_irte_get index 0 low 0x0 high 0x100d10005 vtd_ir_remap index 0 trigger 0 vector 209 deliver 0 dest 0x1 mode 1 vtd_ir_remap_type IOAPIC vtd_ir_remap_msi (addr 0xfee00010, data 0xd1) -> (addr 0xfee01004, data 0x40d1) kvm_irqchip_update_msi_route Updating MSI route virq=2 kvm_irqchip_commit_routes ... On the AMD host it stops at the boot splash screen: $ ./qemu-system-x86_64 -accel kvm,kernel-irqchip=split -cdrom ~/Win10_22H2_EnglishInternational_x64v1.iso -m 16G -M q35 -smp 2,cores=12,threads=2,maxcpus=288 -accel kvm,kernel-irqchip=split -device intel-iommu,intremap=on,dma-translation=off -cpu Skylake-Server --trace vtd_ir_\* --trace apic\* --trace kvm_irqchip\* qemu-system-x86_64: -accel kvm,kernel-irqchip=split: warning: Number of hotpluggable cpus requested (288) exceeds the recommended cpus supported by KVM (192) kvm_irqchip_add_msi_route dev N/A vector 0 virq 0 kvm_irqchip_add_msi_route dev N/A vector 0 virq 1 kvm_irqchip_add_msi_route dev N/A vector 0 virq 2 kvm_irqchip_add_msi_route dev N/A vector 0 virq 3 kvm_irqchip_add_msi_route dev N/A vector 0 virq 4 kvm_irqchip_add_msi_route dev N/A vector 0 virq 5 kvm_irqchip_add_msi_route dev N/A vector 0 virq 6 kvm_irqchip_add_msi_route dev N/A vector 0 virq 7 kvm_irqchip_add_msi_route dev N/A vector 0 virq 8 kvm_irqchip_add_msi_route dev N/A vector 0 virq 9 kvm_irqchip_add_msi_route dev N/A vector 0 virq 10 kvm_irqchip_add_msi_route dev N/A vector 0 virq 11 kvm_irqchip_add_msi_route dev N/A vector 0 virq 12 kvm_irqchip_add_msi_route dev N/A vector 0 virq 13 kvm_irqchip_add_msi_route dev N/A vector 0 virq 14 kvm_irqchip_add_msi_route dev N/A vector 0 virq 15 kvm_irqchip_add_msi_route dev N/A vector 0 virq 16 kvm_irqchip_add_msi_route dev N/A vector 0 virq 17 kvm_irqchip_add_msi_route dev N/A vector 0 virq 18 kvm_irqchip_add_msi_route dev N/A vector 0 virq 19 kvm_irqchip_add_msi_route dev N/A vector 0 virq 20 kvm_irqchip_add_msi_route dev N/A vector 0 virq 21 kvm_irqchip_add_msi_route dev N/A vector 0 virq 22 kvm_irqchip_add_msi_route dev N/A vector 0 virq 23 kvm_irqchip_commit_routes qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.07H:EBX.hle [bit 4] qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.07H:EBX.rtm [bit 11] qemu-system-x86_64: warning: This family of AMD CPU doesn't support hyperthreading(2). Please configure -smp options properly or try enabling topoext feature. qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.07H:EBX.hle [bit 4] qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.07H:EBX.rtm [bit 11] kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes vtd_ir_enable enable 1 kvm_irqchip_commit_routes kvm_irqchip_commit_routes vtd_ir_remap_msi_req addr 0xfee00004 data 0x0 sid 0xffff do_fault 0 vtd_ir_remap_msi (addr 0xfee00004, data 0x0) -> (addr 0xfee00004, data 0x0) kvm_irqchip_update_msi_route Updating MSI route virq=2 kvm_irqchip_commit_routes ^Cqemu: terminating on signal 2 It looks like Windows is putting something bogus into the IOAPIC RTE for irq2? In the successful boot that was 0xfee00010 / 0xd1; vector 209 on CPU0. [-- Attachment #2: smime.p7s --] [-- Type: application/pkcs7-signature, Size: 5965 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: More than 255 vcpus Windows VM setup without viommu ? 2024-07-11 7:26 ` David Woodhouse @ 2024-07-11 11:23 ` David Woodhouse 2024-07-11 11:52 ` Sandesh Patel 2024-08-01 10:28 ` Sandesh Patel 1 sibling, 1 reply; 17+ messages in thread From: David Woodhouse @ 2024-07-11 11:23 UTC (permalink / raw) To: Sandesh Patel Cc: qemu-devel@nongnu.org, Rob Scheepens, Prerna Saxena, Dexuan Cui, Alexander Graf [-- Attachment #1: Type: text/plain, Size: 2045 bytes --] On Thu, 2024-07-11 at 08:26 +0100, David Woodhouse wrote: > > I used identical command lines on both, and on each host I got the same > result with all of '-cpu host', '-cpu EPYC' and -cpu Skylake-Server'. > It's the *host* that makes the difference, not the CPUID presented to > the guest. Actually... it turns out QEMU isn't really advertising the CPUID we ask it to. Leaf zero still does say 'AuthenticAMD' vs. GenuineIntel' according to the *host* it's running on, regardless of the -cpu option. And it is indeed *just* that which seems to trigger the Windows bug, setting IRQ2 to point somewhere bogus: vtd_ir_remap_msi_req addr 0xfee00004 data 0x0 sid 0xffff do_fault 0 vtd_ir_remap_msi (addr 0xfee00004, data 0x0) -> (addr 0xfee00004, data 0x0) kvm_irqchip_update_msi_route Updating MSI route virq=2 While in the happy case it does use a remappable format MSI message. Not direct to vector 209 on CPU0 as I said before; I think that's IRTE entry #209 which maps to vector 209 on CPU1. vtd_ir_remap_msi_req addr 0xfee00010 data 0xd1 sid 0xffff do_fault 0 vtd_ir_irte_get index 0 low 0x0 high 0x100d10005 vtd_ir_remap index 0 trigger 0 vector 209 deliver 0 dest 0x1 mode 1 vtd_ir_remap_type IOAPIC vtd_ir_remap_msi (addr 0xfee00010, data 0xd1) -> (addr 0xfee01004, data 0x40d1) So it looks like Windows doesn't actually cope with Intel IRQ remapping when it sees and AMD CPU, which is suboptimal. So to support >255 vCPUs on AMD without having to also do *DMA* translation, either we need to come up with a trick like the "no supported address widths" we use for dma-translation=off on Intel, or we see if we can persuade Windows to use the 15-bit MSI support. Looking at the Linux guest support, it seems to look just at the HyperV CPUID leaves 0x40000081 and 0x40000082. QEMU knows of those only for SYNDBG; Sandesh do you want to try setting the HYPERV_VS_PROPERTIES_EAX_EXTENDED_IOAPIC_RTE bit that Linux looks for, and see how that affects Windows guests (with no emulated IOMMU)? [-- Attachment #2: smime.p7s --] [-- Type: application/pkcs7-signature, Size: 5965 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: More than 255 vcpus Windows VM setup without viommu ? 2024-07-11 11:23 ` David Woodhouse @ 2024-07-11 11:52 ` Sandesh Patel 2024-07-16 5:13 ` Sandesh Patel 0 siblings, 1 reply; 17+ messages in thread From: Sandesh Patel @ 2024-07-11 11:52 UTC (permalink / raw) To: David Woodhouse Cc: qemu-devel@nongnu.org, Rob Scheepens, Prerna Saxena, Dexuan Cui, Alexander Graf Thanks David for all the analysis. > On 11 Jul 2024, at 4:53 PM, David Woodhouse <dwmw2@infradead.org> wrote: > > On Thu, 2024-07-11 at 08:26 +0100, David Woodhouse wrote: >> >> I used identical command lines on both, and on each host I got the same >> result with all of '-cpu host', '-cpu EPYC' and -cpu Skylake-Server'. >> It's the *host* that makes the difference, not the CPUID presented to >> the guest. > > Actually... it turns out QEMU isn't really advertising the CPUID we ask > it to. Leaf zero still does say 'AuthenticAMD' vs. GenuineIntel' > according to the *host* it's running on, regardless of the -cpu option. > that undermines advertised cpu model. > And it is indeed *just* that which seems to trigger the Windows bug, > setting IRQ2 to point somewhere bogus: > > vtd_ir_remap_msi_req addr 0xfee00004 data 0x0 sid 0xffff do_fault 0 > vtd_ir_remap_msi (addr 0xfee00004, data 0x0) -> (addr 0xfee00004, data 0x0) > kvm_irqchip_update_msi_route Updating MSI route virq=2 > > While in the happy case it does use a remappable format MSI message. > Not direct to vector 209 on CPU0 as I said before; I think that's IRTE > entry #209 which maps to vector 209 on CPU1. > > vtd_ir_remap_msi_req addr 0xfee00010 data 0xd1 sid 0xffff do_fault 0 > vtd_ir_irte_get index 0 low 0x0 high 0x100d10005 > vtd_ir_remap index 0 trigger 0 vector 209 deliver 0 dest 0x1 mode 1 > vtd_ir_remap_type IOAPIC > vtd_ir_remap_msi (addr 0xfee00010, data 0xd1) -> (addr 0xfee01004, data 0x40d1) > > So it looks like Windows doesn't actually cope with Intel IRQ remapping > when it sees and AMD CPU, which is suboptimal. > Makes sense. > So to support >255 vCPUs on AMD without having to also do *DMA* > translation, either we need to come up with a trick like the "no > supported address widths" we use for dma-translation=off on Intel, or > we see if we can persuade Windows to use the 15-bit MSI support. > > > Looking at the Linux guest support, it seems to look just at the HyperV > CPUID leaves 0x40000081 and 0x40000082. QEMU knows of those only for > SYNDBG; Sandesh do you want to try setting the > HYPERV_VS_PROPERTIES_EAX_EXTENDED_IOAPIC_RTE bit that Linux looks for, > and see how that affects Windows guests (with no emulated IOMMU)? > Sure I would try that, I would need some reading time however. Regards, Sandesh ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: More than 255 vcpus Windows VM setup without viommu ? 2024-07-11 11:52 ` Sandesh Patel @ 2024-07-16 5:13 ` Sandesh Patel 2024-07-24 9:22 ` David Woodhouse 0 siblings, 1 reply; 17+ messages in thread From: Sandesh Patel @ 2024-07-16 5:13 UTC (permalink / raw) To: David Woodhouse Cc: qemu-devel@nongnu.org, Rob Scheepens, Prerna Saxena, Dexuan Cui, Alexander Graf [-- Attachment #1: Type: text/plain, Size: 1990 bytes --] On 11 Jul 2024, at 5:22 PM, Sandesh Patel <sandesh.patel@nutanix.com> wrote: Thanks David for all the analysis. Looking at the Linux guest support, it seems to look just at the HyperV CPUID leaves 0x40000081 and 0x40000082. QEMU knows of those only for SYNDBG; Sandesh do you want to try setting the HYPERV_VS_PROPERTIES_EAX_EXTENDED_IOAPIC_RTE bit that Linux looks for, and see how that affects Windows guests (with no emulated IOMMU)? I am enabling same bit (BIT(2) under HYPERV_CPUID_SYNDBG_PLATFORM_CAPABILITIES (0x40000082) with simple kvm patch (need to check how do we switch to 15 bit destination id when enabling this)- diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c index 04cca46fed1e..b9e74b791247 100644 --- a/arch/x86/kvm/hyperv.c +++ b/arch/x86/kvm/hyperv.c @@ -2567,6 +2567,7 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid, case HYPERV_CPUID_SYNDBG_PLATFORM_CAPABILITIES: ent->eax |= HV_X64_SYNDBG_CAP_ALLOW_KERNEL_DEBUGGING; + ent->eax |= HYPERV_VS_PROPERTIES_EAX_EXTENDED_IOAPIC_RTE; break; default: diff --git a/arch/x86/kvm/hyperv.h b/arch/x86/kvm/hyperv.h index 1030b1b50552..384585a1f165 100644 --- a/arch/x86/kvm/hyperv.h +++ b/arch/x86/kvm/hyperv.h @@ -41,6 +41,7 @@ * These are HYPERV_CPUID_SYNDBG_PLATFORM_CAPABILITIES.EAX bits. */ #define HV_X64_SYNDBG_CAP_ALLOW_KERNEL_DEBUGGING BIT(1) +#define HYPERV_VS_PROPERTIES_EAX_EXTENDED_IOAPIC_RTE BIT(2) /* Hyper-V Synthetic debug options MSR */ #define HV_X64_MSR_SYNDBG_CONTROL 0x400000F1 I am hitting an issue where the Windows guest is not booting (guest reset in loop) when adding hv-syndbg hyperv feature (or using hv-passthrough). Possibly an occurrence of - https://patchew.org/QEMU/20230612084201.294248-1-vkuznets@redhat.com/ Anything special to take care here? Regards, Sandesh [-- Attachment #2: Type: text/html, Size: 4317 bytes --] ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: More than 255 vcpus Windows VM setup without viommu ? 2024-07-16 5:13 ` Sandesh Patel @ 2024-07-24 9:22 ` David Woodhouse 0 siblings, 0 replies; 17+ messages in thread From: David Woodhouse @ 2024-07-24 9:22 UTC (permalink / raw) To: Sandesh Patel Cc: qemu-devel@nongnu.org, Rob Scheepens, Prerna Saxena, Dexuan Cui, Alexander Graf [-- Attachment #1: Type: text/plain, Size: 2588 bytes --] On Tue, 2024-07-16 at 05:13 +0000, Sandesh Patel wrote: > > > > On 11 Jul 2024, at 5:22 PM, Sandesh Patel <sandesh.patel@nutanix.com> wrote: > > > > Thanks David for all the analysis. > > > > > > Looking at the Linux guest support, it seems to look just at the HyperV > > > CPUID leaves 0x40000081 and 0x40000082. QEMU knows of those only for > > > SYNDBG; Sandesh do you want to try setting the > > > HYPERV_VS_PROPERTIES_EAX_EXTENDED_IOAPIC_RTE bit that Linux looks for, > > > and see how that affects Windows guests (with no emulated IOMMU)? > > > > > > > I am enabling same bit (BIT(2) under > HYPERV_CPUID_SYNDBG_PLATFORM_CAPABILITIES (0x40000082) with simple > kvm patch (need to check how do we switch to 15 bit destination id > when enabling this)- > > > diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c > index 04cca46fed1e..b9e74b791247 100644 > --- a/arch/x86/kvm/hyperv.c > +++ b/arch/x86/kvm/hyperv.c > @@ -2567,6 +2567,7 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid, > > case HYPERV_CPUID_SYNDBG_PLATFORM_CAPABILITIES: > ent->eax |= HV_X64_SYNDBG_CAP_ALLOW_KERNEL_DEBUGGING; > + ent->eax |= HYPERV_VS_PROPERTIES_EAX_EXTENDED_IOAPIC_RTE; > break; > > default: > diff --git a/arch/x86/kvm/hyperv.h b/arch/x86/kvm/hyperv.h > index 1030b1b50552..384585a1f165 100644 > --- a/arch/x86/kvm/hyperv.h > +++ b/arch/x86/kvm/hyperv.h > @@ -41,6 +41,7 @@ > * These are HYPERV_CPUID_SYNDBG_PLATFORM_CAPABILITIES.EAX bits. > */ > #define HV_X64_SYNDBG_CAP_ALLOW_KERNEL_DEBUGGING BIT(1) > +#define HYPERV_VS_PROPERTIES_EAX_EXTENDED_IOAPIC_RTE BIT(2) > > /* Hyper-V Synthetic debug options MSR */ > #define HV_X64_MSR_SYNDBG_CONTROL 0x400000F1 > > > I am hitting an issue where the Windows guest is not booting (guest > reset in loop) when adding hv-syndbg hyperv feature (or using hv- > passthrough). Possibly an occurrence of - > https://patchew.org/QEMU/20230612084201.294248-1-vkuznets@redhat.com > / > Anything special to take care here? As a simple test, have you tried just *not* setting the ALLOW_KERNEL_DEBUGGING bit? Just comment that line out? So we're only setting the magic value in 0x40000081 and then the extended I/OAPIC RTE bit in 0x40000082? Although we've heard separately that this *isn't* implemented in Windows, so we guess it isn't going to work. [-- Attachment #2: smime.p7s --] [-- Type: application/pkcs7-signature, Size: 5965 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: More than 255 vcpus Windows VM setup without viommu ? 2024-07-11 7:26 ` David Woodhouse 2024-07-11 11:23 ` David Woodhouse @ 2024-08-01 10:28 ` Sandesh Patel 1 sibling, 0 replies; 17+ messages in thread From: Sandesh Patel @ 2024-08-01 10:28 UTC (permalink / raw) To: David Woodhouse Cc: qemu-devel@nongnu.org, Rob Scheepens, Prerna Saxena, Dexuan Cui, Alexander Graf > On 11 Jul 2024, at 12:56 PM, David Woodhouse <dwmw2@infradead.org> wrote: > > On Mon, 2024-07-08 at 10:13 +0100, David Woodhouse wrote: >> On Wed, 2024-07-03 at 16:01 +0000, Sandesh Patel wrote: >>>> >>>> Interesting. What exactly has Windows *done* in those MSI entries? >>>> That might give a clue about how to support it. > > I repeated my experiment with interrupt-only remapping (no DMA > remapping). On two hosts: > > > > vendor_id : AuthenticAMD > cpu family : 25 > model : 17 > model name : AMD EPYC 9R14 96-Core Processor > > vendor_id : GenuineIntel > cpu family : 6 > model : 143 > model name : Intel(R) Xeon(R) Platinum 8488C > > I used identical command lines on both, and on each host I got the same > result with all of '-cpu host', '-cpu EPYC' and -cpu Skylake-Server'. > It's the *host* that makes the difference, not the CPUID presented to > the guest. > > On the Intel host it boots: > > $ ./qemu-system-x86_64 -accel kvm,kernel-irqchip=split -cdrom ~/Win10_22H2_EnglishInternational_x64v1.iso -m 16G -M q35 -smp 2,cores=12,threads=2,maxcpus=288 -accel kvm,kernel-irqchip=split -device intel-iommu,intremap=on,dma-translation=off -cpu Skylake-Server --trace vtd_ir_\* --trace apic\* --trace kvm_irqchip\* > qemu-system-x86_64: -accel kvm,kernel-irqchip=split: warning: Number of hotpluggable cpus requested (288) exceeds the recommended cpus supported by KVM (192) > kvm_irqchip_add_msi_route dev N/A vector 0 virq 0 > kvm_irqchip_add_msi_route dev N/A vector 0 virq 1 > kvm_irqchip_add_msi_route dev N/A vector 0 virq 2 > kvm_irqchip_add_msi_route dev N/A vector 0 virq 3 > kvm_irqchip_add_msi_route dev N/A vector 0 virq 4 > kvm_irqchip_add_msi_route dev N/A vector 0 virq 5 > kvm_irqchip_add_msi_route dev N/A vector 0 virq 6 > kvm_irqchip_add_msi_route dev N/A vector 0 virq 7 > kvm_irqchip_add_msi_route dev N/A vector 0 virq 8 > kvm_irqchip_add_msi_route dev N/A vector 0 virq 9 > kvm_irqchip_add_msi_route dev N/A vector 0 virq 10 > kvm_irqchip_add_msi_route dev N/A vector 0 virq 11 > kvm_irqchip_add_msi_route dev N/A vector 0 virq 12 > kvm_irqchip_add_msi_route dev N/A vector 0 virq 13 > kvm_irqchip_add_msi_route dev N/A vector 0 virq 14 > kvm_irqchip_add_msi_route dev N/A vector 0 virq 15 > kvm_irqchip_add_msi_route dev N/A vector 0 virq 16 > kvm_irqchip_add_msi_route dev N/A vector 0 virq 17 > kvm_irqchip_add_msi_route dev N/A vector 0 virq 18 > kvm_irqchip_add_msi_route dev N/A vector 0 virq 19 > kvm_irqchip_add_msi_route dev N/A vector 0 virq 20 > kvm_irqchip_add_msi_route dev N/A vector 0 virq 21 > kvm_irqchip_add_msi_route dev N/A vector 0 virq 22 > kvm_irqchip_add_msi_route dev N/A vector 0 virq 23 > kvm_irqchip_commit_routes > qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.07H:EBX.hle [bit 4] > qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.07H:EBX.rtm [bit 11] > qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.07H:EBX.hle [bit 4] > qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.07H:EBX.rtm [bit 11] > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > vtd_ir_enable enable 1 > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > vtd_ir_remap_msi_req addr 0xfee00010 data 0xd1 sid 0xffff do_fault 0 > vtd_ir_irte_get index 0 low 0x0 high 0x100d10005 > vtd_ir_remap index 0 trigger 0 vector 209 deliver 0 dest 0x1 mode 1 > vtd_ir_remap_type IOAPIC > vtd_ir_remap_msi (addr 0xfee00010, data 0xd1) -> (addr 0xfee01004, data 0x40d1) > kvm_irqchip_update_msi_route Updating MSI route virq=2 > kvm_irqchip_commit_routes > vtd_ir_remap_msi_req addr 0xfee00010 data 0xd1 sid 0xffff do_fault 0 > vtd_ir_irte_get index 0 low 0x0 high 0x100d10005 > vtd_ir_remap index 0 trigger 0 vector 209 deliver 0 dest 0x1 mode 1 > vtd_ir_remap_type IOAPIC > vtd_ir_remap_msi (addr 0xfee00010, data 0xd1) -> (addr 0xfee01004, data 0x40d1) > kvm_irqchip_update_msi_route Updating MSI route virq=2 > kvm_irqchip_commit_routes > ... > > > On the AMD host it stops at the boot splash screen: > > $ ./qemu-system-x86_64 -accel kvm,kernel-irqchip=split -cdrom ~/Win10_22H2_EnglishInternational_x64v1.iso -m 16G -M q35 -smp 2,cores=12,threads=2,maxcpus=288 -accel kvm,kernel-irqchip=split -device intel-iommu,intremap=on,dma-translation=off -cpu Skylake-Server --trace vtd_ir_\* --trace apic\* --trace kvm_irqchip\* > qemu-system-x86_64: -accel kvm,kernel-irqchip=split: warning: Number of hotpluggable cpus requested (288) exceeds the recommended cpus supported by KVM (192) > kvm_irqchip_add_msi_route dev N/A vector 0 virq 0 > kvm_irqchip_add_msi_route dev N/A vector 0 virq 1 > kvm_irqchip_add_msi_route dev N/A vector 0 virq 2 > kvm_irqchip_add_msi_route dev N/A vector 0 virq 3 > kvm_irqchip_add_msi_route dev N/A vector 0 virq 4 > kvm_irqchip_add_msi_route dev N/A vector 0 virq 5 > kvm_irqchip_add_msi_route dev N/A vector 0 virq 6 > kvm_irqchip_add_msi_route dev N/A vector 0 virq 7 > kvm_irqchip_add_msi_route dev N/A vector 0 virq 8 > kvm_irqchip_add_msi_route dev N/A vector 0 virq 9 > kvm_irqchip_add_msi_route dev N/A vector 0 virq 10 > kvm_irqchip_add_msi_route dev N/A vector 0 virq 11 > kvm_irqchip_add_msi_route dev N/A vector 0 virq 12 > kvm_irqchip_add_msi_route dev N/A vector 0 virq 13 > kvm_irqchip_add_msi_route dev N/A vector 0 virq 14 > kvm_irqchip_add_msi_route dev N/A vector 0 virq 15 > kvm_irqchip_add_msi_route dev N/A vector 0 virq 16 > kvm_irqchip_add_msi_route dev N/A vector 0 virq 17 > kvm_irqchip_add_msi_route dev N/A vector 0 virq 18 > kvm_irqchip_add_msi_route dev N/A vector 0 virq 19 > kvm_irqchip_add_msi_route dev N/A vector 0 virq 20 > kvm_irqchip_add_msi_route dev N/A vector 0 virq 21 > kvm_irqchip_add_msi_route dev N/A vector 0 virq 22 > kvm_irqchip_add_msi_route dev N/A vector 0 virq 23 > kvm_irqchip_commit_routes > qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.07H:EBX.hle [bit 4] > qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.07H:EBX.rtm [bit 11] > qemu-system-x86_64: warning: This family of AMD CPU doesn't support hyperthreading(2). Please configure -smp options properly or try enabling topoext feature. > qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.07H:EBX.hle [bit 4] > qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.07H:EBX.rtm [bit 11] > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > vtd_ir_enable enable 1 > kvm_irqchip_commit_routes > kvm_irqchip_commit_routes > vtd_ir_remap_msi_req addr 0xfee00004 data 0x0 sid 0xffff do_fault 0 > vtd_ir_remap_msi (addr 0xfee00004, data 0x0) -> (addr 0xfee00004, data 0x0) > kvm_irqchip_update_msi_route Updating MSI route virq=2 > kvm_irqchip_commit_routes > ^Cqemu: terminating on signal 2 > > > It looks like Windows is putting something bogus into the IOAPIC RTE for irq2? > In the successful boot that was 0xfee00010 / 0xd1; vector 209 on CPU0. > Hi David, Do Windows VMs boot with amd-iommu in you setup? It would be suboptimal in performance but is it able to boot? In my setup, a 360 vcpu Windows VM boot to splash screen and get hung there. $ qemu-system-x86_64 -vnc :2 --trace vtd_ir_\* --trace apic\* --trace kvm_irqchip\* -m 8G -cpu EPYC-Rome -smp 6,cores=60,maxcpus=360 -enable-kvm -machine q35,kernel_irqchip=split -device amd-iommu w25.qcow2 kvm_irqchip_add_msi_route dev N/A vector 0 virq 0 kvm_irqchip_add_msi_route dev N/A vector 0 virq 1 kvm_irqchip_add_msi_route dev N/A vector 0 virq 2 kvm_irqchip_add_msi_route dev N/A vector 0 virq 3 kvm_irqchip_add_msi_route dev N/A vector 0 virq 4 kvm_irqchip_add_msi_route dev N/A vector 0 virq 5 kvm_irqchip_add_msi_route dev N/A vector 0 virq 6 kvm_irqchip_add_msi_route dev N/A vector 0 virq 7 kvm_irqchip_add_msi_route dev N/A vector 0 virq 8 kvm_irqchip_add_msi_route dev N/A vector 0 virq 9 kvm_irqchip_add_msi_route dev N/A vector 0 virq 10 kvm_irqchip_add_msi_route dev N/A vector 0 virq 11 kvm_irqchip_add_msi_route dev N/A vector 0 virq 12 kvm_irqchip_add_msi_route dev N/A vector 0 virq 13 kvm_irqchip_add_msi_route dev N/A vector 0 virq 14 kvm_irqchip_add_msi_route dev N/A vector 0 virq 15 kvm_irqchip_add_msi_route dev N/A vector 0 virq 16 kvm_irqchip_add_msi_route dev N/A vector 0 virq 17 kvm_irqchip_add_msi_route dev N/A vector 0 virq 18 kvm_irqchip_add_msi_route dev N/A vector 0 virq 19 kvm_irqchip_add_msi_route dev N/A vector 0 virq 20 kvm_irqchip_add_msi_route dev N/A vector 0 virq 21 kvm_irqchip_add_msi_route dev N/A vector 0 virq 22 kvm_irqchip_add_msi_route dev N/A vector 0 virq 23 kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes kvm_irqchip_commit_routes Regards, Sandesh ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: More than 255 vcpus Windows VM setup without viommu ? 2024-07-02 5:17 More than 255 vcpus Windows VM setup without viommu ? Sandesh Patel 2024-07-02 9:04 ` David Woodhouse @ 2024-09-28 14:59 ` David Woodhouse 2024-09-30 15:50 ` David Woodhouse 2024-10-01 13:33 ` Daniel P. Berrangé 1 sibling, 2 replies; 17+ messages in thread From: David Woodhouse @ 2024-09-28 14:59 UTC (permalink / raw) To: Sandesh Patel, qemu-devel@nongnu.org Cc: Rob Scheepens, Prerna Saxena, Alexander Graf [-- Attachment #1: Type: text/plain, Size: 1492 bytes --] On Tue, 2024-07-02 at 05:17 +0000, Sandesh Patel wrote: > > The error is due to invalid MSIX routing entry passed to KVM. > > The VM boots fine if we attach a vIOMMU but adding a vIOMMU can > potentially result in IO performance loss in guest. > I was interested to know if someone could boot a large Windows VM by > some other means like kvm-msi-ext-dest-id. I think I may (with Alex Graf's suggestion) have found the Windows bug with Intel IOMMU. It looks like when interrupt remapping is enabled with an AMD CPU, Windows *assumes* it can generate AMD-style MSI messages even if the IOMMU is an Intel one. If we put a little hack into the IOMMU interrupt remapping to make it interpret an AMD-style message, Windows seems to boot at least a little bit further than it did before... --- a/hw/i386/intel_iommu.c +++ b/hw/i386/intel_iommu.c @@ -3550,9 +3550,14 @@ static int vtd_interrupt_remap_msi(IntelIOMMUState *iommu, /* This is compatible mode. */ if (addr.addr.int_mode != VTD_IR_INT_FORMAT_REMAP) { - memcpy(translated, origin, sizeof(*origin)); - goto out; - } + if (0) { + memcpy(translated, origin, sizeof(*origin)); + goto out; + } + /* Pretend it's an AMD-format remappable MSI (Yay Windows!) */ + index = origin->data & 0x7ff; + printf("Compat mode index 0x%x\n", index); + } else index = addr.addr.index_h << 15 | addr.addr.index_l; [-- Attachment #2: smime.p7s --] [-- Type: application/pkcs7-signature, Size: 5965 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: More than 255 vcpus Windows VM setup without viommu ? 2024-09-28 14:59 ` David Woodhouse @ 2024-09-30 15:50 ` David Woodhouse 2024-10-02 11:33 ` Igor Mammedov 2024-10-01 13:33 ` Daniel P. Berrangé 1 sibling, 1 reply; 17+ messages in thread From: David Woodhouse @ 2024-09-30 15:50 UTC (permalink / raw) To: Sandesh Patel, qemu-devel@nongnu.org, paul Cc: Rob Scheepens, Prerna Saxena, Alexander Graf [-- Attachment #1: Type: text/plain, Size: 4240 bytes --] On Sat, 2024-09-28 at 15:59 +0100, David Woodhouse wrote: > On Tue, 2024-07-02 at 05:17 +0000, Sandesh Patel wrote: > > > > The error is due to invalid MSIX routing entry passed to KVM. > > > > The VM boots fine if we attach a vIOMMU but adding a vIOMMU can > > potentially result in IO performance loss in guest. > > I was interested to know if someone could boot a large Windows VM by > > some other means like kvm-msi-ext-dest-id. > > I think I may (with Alex Graf's suggestion) have found the Windows bug > with Intel IOMMU. > > It looks like when interrupt remapping is enabled with an AMD CPU, > Windows *assumes* it can generate AMD-style MSI messages even if the > IOMMU is an Intel one. If we put a little hack into the IOMMU interrupt > remapping to make it interpret an AMD-style message, Windows seems to > boot at least a little bit further than it did before... Sadly, Windows has *more* bugs than that. The previous hack extracted the Interrupt Remapping Table Entry (IRTE) index from an AMD-style MSI message, and looked it up in the Intel IOMMU's IR Table. That works... for the MSIs generated by the I/O APIC. However... in the Intel IOMMU model, there is a single global IRT, and each entry specifies which devices are permitted to invoke it. The AMD model is slightly nicer, in that it allows a per-device IRT. So for a PCI device, Windows just seems to configure each MSI vector in order, with IRTE#0, 1, onwards. Because it's a per-device number space, right? Which means that first MSI vector on a PCI device gets aliased to IRQ#0 on the I/O APIC. I dumped the whole IRT, and it isn't just that Windows is using the wrong index; it hasn't even set up the correct destination in *any* of the entries. So we can't even do a nasty trick like scanning and funding the Nth entry which is valid for a particular source-id. Happily, Windows has *more* bugs than that... if I run with `-cpu host,+hv-avic' then it puts the high bits of the target APIC ID into the high bits of the MSI address. This *ought* to mean that MSIs from device miss the APIC (at 0x00000000FEExxxxx) and scribble over guest memory at addresses like 0x1FEE00004. But we can add yet *another* hack to catch that. For now I just hacked it to move the low 7 extra bits in to the "right" place for the 15-bit extension. --- a/hw/pci/pci.c +++ b/hw/pci/pci.c @@ -361,6 +361,14 @@ static void pci_msi_trigger(PCIDevice *dev, MSIMessage msg) return; } attrs.requester_id = pci_requester_id(dev); + printf("Send MSI 0x%lx/0x%x from 0x%x\n", msg.address, msg.data, attrs.requester_id); + if (msg.address >> 32) { + uint64_t ext_id = msg.address >> 32; + msg.address &= 0xffffffff; + msg.address |= ext_id << 5; + printf("Now 0x%lx/0x%x with ext_id %lx\n", msg.address, msg.data, ext_id); + } + address_space_stl_le(&dev->bus_master_as, msg.address, msg.data, attrs, NULL); } We also need to stop forcing Windows to use logical mode, and force it to use physical mode instead: --- a/hw/i386/acpi-build.c +++ b/hw/i386/acpi-build.c @@ -158,7 +158,7 @@ static void init_common_fadt_data(MachineState *ms, Object *o, * used */ ((ms->smp.max_cpus > 8) ? - (1 << ACPI_FADT_F_FORCE_APIC_CLUSTER_MODEL) : 0), + (1 << ACPI_FADT_F_FORCE_APIC_PHYSICAL_DESTINATION_MODE) : 0), .int_model = 1 /* Multiple APIC */, .rtc_century = RTC_CENTURY, .plvl2_lat = 0xfff /* C2 state not supported */, So now, with *no* IOMMU configured, Windows Server 2022 is booting and using CPUs > 255: Send MSI 0x1fee01000/0x41b0 from 0xfa Now 0xfee01020/0x41b0 with ext_id 1 That trick obviously can't work the the I/O APIC, but I haven't managed to persuade Windows to target I/O APIC interrupts at any CPU other than #0 yet. I'm trying to make QEMU run with *only* higher APIC IDs, to test. It may be that we need to advertise an Intel IOMMU that *only* has the I/O APIC behind it, and all the actual PCI devices are direct, so we can abuse that last Windows bug. [-- Attachment #2: smime.p7s --] [-- Type: application/pkcs7-signature, Size: 5965 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: More than 255 vcpus Windows VM setup without viommu ? 2024-09-30 15:50 ` David Woodhouse @ 2024-10-02 11:33 ` Igor Mammedov 2024-10-02 15:30 ` David Woodhouse 0 siblings, 1 reply; 17+ messages in thread From: Igor Mammedov @ 2024-10-02 11:33 UTC (permalink / raw) To: David Woodhouse Cc: Sandesh Patel, qemu-devel@nongnu.org, paul, Rob Scheepens, Prerna Saxena, Alexander Graf On Mon, 30 Sep 2024 16:50:21 +0100 David Woodhouse <dwmw2@infradead.org> wrote: > On Sat, 2024-09-28 at 15:59 +0100, David Woodhouse wrote: > > On Tue, 2024-07-02 at 05:17 +0000, Sandesh Patel wrote: > > > > > > The error is due to invalid MSIX routing entry passed to KVM. > > > > > > The VM boots fine if we attach a vIOMMU but adding a vIOMMU can > > > potentially result in IO performance loss in guest. > > > I was interested to know if someone could boot a large Windows VM by > > > some other means like kvm-msi-ext-dest-id. > > > > I think I may (with Alex Graf's suggestion) have found the Windows bug > > with Intel IOMMU. > > > > It looks like when interrupt remapping is enabled with an AMD CPU, > > Windows *assumes* it can generate AMD-style MSI messages even if the > > IOMMU is an Intel one. If we put a little hack into the IOMMU interrupt > > remapping to make it interpret an AMD-style message, Windows seems to > > boot at least a little bit further than it did before... > > Sadly, Windows has *more* bugs than that. > > The previous hack extracted the Interrupt Remapping Table Entry (IRTE) > index from an AMD-style MSI message, and looked it up in the Intel > IOMMU's IR Table. > > That works... for the MSIs generated by the I/O APIC. > > However... in the Intel IOMMU model, there is a single global IRT, and > each entry specifies which devices are permitted to invoke it. The AMD > model is slightly nicer, in that it allows a per-device IRT. > > So for a PCI device, Windows just seems to configure each MSI vector in > order, with IRTE#0, 1, onwards. Because it's a per-device number space, > right? Which means that first MSI vector on a PCI device gets aliased > to IRQ#0 on the I/O APIC. > > I dumped the whole IRT, and it isn't just that Windows is using the > wrong index; it hasn't even set up the correct destination in *any* of > the entries. So we can't even do a nasty trick like scanning and > funding the Nth entry which is valid for a particular source-id. > > Happily, Windows has *more* bugs than that... if I run with > `-cpu host,+hv-avic' then it puts the high bits of the target APIC ID > into the high bits of the MSI address. This *ought* to mean that MSIs > from device miss the APIC (at 0x00000000FEExxxxx) and scribble over > guest memory at addresses like 0x1FEE00004. But we can add yet > *another* hack to catch that. For now I just hacked it to move the low > 7 extra bits in to the "right" place for the 15-bit extension. > > --- a/hw/pci/pci.c > +++ b/hw/pci/pci.c > @@ -361,6 +361,14 @@ static void pci_msi_trigger(PCIDevice *dev, MSIMessage msg) > return; > } > attrs.requester_id = pci_requester_id(dev); > + printf("Send MSI 0x%lx/0x%x from 0x%x\n", msg.address, msg.data, attrs.requester_id); > + if (msg.address >> 32) { > + uint64_t ext_id = msg.address >> 32; > + msg.address &= 0xffffffff; > + msg.address |= ext_id << 5; > + printf("Now 0x%lx/0x%x with ext_id %lx\n", msg.address, msg.data, ext_id); > + } > + > address_space_stl_le(&dev->bus_master_as, msg.address, msg.data, > attrs, NULL); > } > > We also need to stop forcing Windows to use logical mode, and force it > to use physical mode instead: > > --- a/hw/i386/acpi-build.c > +++ b/hw/i386/acpi-build.c > @@ -158,7 +158,7 @@ static void init_common_fadt_data(MachineState *ms, Object *o, > * used > */ > ((ms->smp.max_cpus > 8) ? > - (1 << ACPI_FADT_F_FORCE_APIC_CLUSTER_MODEL) : 0), > + (1 << ACPI_FADT_F_FORCE_APIC_PHYSICAL_DESTINATION_MODE) : 0), > .int_model = 1 /* Multiple APIC */, > .rtc_century = RTC_CENTURY, > .plvl2_lat = 0xfff /* C2 state not supported */, > > > So now, with *no* IOMMU configured, Windows Server 2022 is booting and > using CPUs > 255: > Send MSI 0x1fee01000/0x41b0 from 0xfa > Now 0xfee01020/0x41b0 with ext_id 1 > > That trick obviously can't work the the I/O APIC, but I haven't managed > to persuade Windows to target I/O APIC interrupts at any CPU other than > #0 yet. I'm trying to make QEMU run with *only* higher APIC IDs, to > test. > > It may be that we need to advertise an Intel IOMMU that *only* has the > I/O APIC behind it, and all the actual PCI devices are direct, so we > can abuse that last Windows bug. It's interesting as an experiment, to prove that Windows is riddled with bugs. (well, and it could serve as starting point to report issue to MS) But I'd rather Microsoft fix bugs on their side, instead of putting hacks in QEMU. PS: Given it's AMD cpu, I doubt very much that using intel_iommu would be accepted by Microsoft as valid complaint though. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: More than 255 vcpus Windows VM setup without viommu ? 2024-10-02 11:33 ` Igor Mammedov @ 2024-10-02 15:30 ` David Woodhouse 0 siblings, 0 replies; 17+ messages in thread From: David Woodhouse @ 2024-10-02 15:30 UTC (permalink / raw) To: Igor Mammedov Cc: Sandesh Patel, qemu-devel@nongnu.org, paul, Rob Scheepens, Prerna Saxena, Alexander Graf [-- Attachment #1: Type: text/plain, Size: 1603 bytes --] On Wed, 2024-10-02 at 13:33 +0200, Igor Mammedov wrote: > > It's interesting as an experiment, to prove that Windows is riddled with bugs. > (well, and it could serve as starting point to report issue to MS) > But I'd rather Microsoft fix bugs on their side, instead of putting hacks in > QEMU. Absolutely. I would very prefer Microsoft to fix the bugs, and to support the 15-bit destination ID enlightenment that KVM, Xen and even Hyper-V all define — instead of randomly putting high bits into the address which ought to cause it to miss the APIC and scribble over memory. The 15-bit extension supports I/O APIC and HPET interrupts too. But I'd like to at least understand the current behaviour and whether there's anything we can do to work around it. > PS: > Given it's AMD cpu, I doubt very much that using intel_iommu would be > accepted by Microsoft as valid complaint though. Well, that argument only makes a little bit more sense than refusing to support an Intel NIC with an AMD CPU. The IOMMU just isn't that tied to the CPU ID. But hey, it's Microsoft. However egregious their deviations from both standards and from common sense, they usually like to claim it's "Working as Designed". But to actually *initialise* the Intel IOMMU and put it into remapping mode, and then to send MSIs formatted for an AMD IOMMU which wasn't actually present in the system, would be a new low even for Microsoft. At *best* they could make a tenuous argument for not supporting the Intel NIC (sorry, I mean the Intel IOMMU) at *all* when running on an AMD CPU. [-- Attachment #2: smime.p7s --] [-- Type: application/pkcs7-signature, Size: 5965 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: More than 255 vcpus Windows VM setup without viommu ? 2024-09-28 14:59 ` David Woodhouse 2024-09-30 15:50 ` David Woodhouse @ 2024-10-01 13:33 ` Daniel P. Berrangé 2024-10-01 16:37 ` David Woodhouse 1 sibling, 1 reply; 17+ messages in thread From: Daniel P. Berrangé @ 2024-10-01 13:33 UTC (permalink / raw) To: David Woodhouse Cc: Sandesh Patel, qemu-devel@nongnu.org, Rob Scheepens, Prerna Saxena, Alexander Graf On Sat, Sep 28, 2024 at 03:59:32PM +0100, David Woodhouse wrote: > On Tue, 2024-07-02 at 05:17 +0000, Sandesh Patel wrote: > > > > The error is due to invalid MSIX routing entry passed to KVM. > > > > The VM boots fine if we attach a vIOMMU but adding a vIOMMU can > > potentially result in IO performance loss in guest. > > I was interested to know if someone could boot a large Windows VM by > > some other means like kvm-msi-ext-dest-id. > > I think I may (with Alex Graf's suggestion) have found the Windows bug > with Intel IOMMU. > > It looks like when interrupt remapping is enabled with an AMD CPU, > Windows *assumes* it can generate AMD-style MSI messages even if the > IOMMU is an Intel one. If we put a little hack into the IOMMU interrupt > remapping to make it interpret an AMD-style message, Windows seems to > boot at least a little bit further than it did before... Rather than filling the intel IOMMU impl with hacks to make Windows boot on AMD virtualized CPUs, shouldn't we steer people to use the amd-iommu that QEMU already ships [1] ? Even if we hack the intel iommu, so current Windows boots, can we have confidence that future Windows releases will correctly boot on an intel iommu with AMD CPUs virtualized ? > --- a/hw/i386/intel_iommu.c > +++ b/hw/i386/intel_iommu.c > @@ -3550,9 +3550,14 @@ static int vtd_interrupt_remap_msi(IntelIOMMUState *iommu, > > /* This is compatible mode. */ > if (addr.addr.int_mode != VTD_IR_INT_FORMAT_REMAP) { > - memcpy(translated, origin, sizeof(*origin)); > - goto out; > - } > + if (0) { > + memcpy(translated, origin, sizeof(*origin)); > + goto out; > + } > + /* Pretend it's an AMD-format remappable MSI (Yay Windows!) */ > + index = origin->data & 0x7ff; > + printf("Compat mode index 0x%x\n", index); > + } else > > index = addr.addr.index_h << 15 | addr.addr.index_l; With regards, Daniel [1] the AMD IOMMU is not perfect, because currently it has a significant QEMU impl flaw in that it secretly creates an extra PCI device behind the scenes. This makes it impossible for libvirt to manage the PCI resources from the AMD IOMMU. I feel like this ought to be solvable though, as it is just a QEMU impl decison that can be corrected. -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: More than 255 vcpus Windows VM setup without viommu ? 2024-10-01 13:33 ` Daniel P. Berrangé @ 2024-10-01 16:37 ` David Woodhouse 0 siblings, 0 replies; 17+ messages in thread From: David Woodhouse @ 2024-10-01 16:37 UTC (permalink / raw) To: Daniel P. Berrangé Cc: Sandesh Patel, qemu-devel@nongnu.org, Rob Scheepens, Prerna Saxena, Alexander Graf [-- Attachment #1: Type: text/plain, Size: 25542 bytes --] On Tue, 2024-10-01 at 14:33 +0100, Daniel P. Berrangé wrote: > > > It looks like when interrupt remapping is enabled with an AMD CPU, > > Windows *assumes* it can generate AMD-style MSI messages even if the > > IOMMU is an Intel one. If we put a little hack into the IOMMU interrupt > > remapping to make it interpret an AMD-style message, Windows seems to > > boot at least a little bit further than it did before... > > Rather than filling the intel IOMMU impl with hacks to make Windows > boot on AMD virtualized CPUs, shouldn't we steer people to use the > amd-iommu that QEMU already ships [1] ? No, because there's no way to disable *DMA* translation on that. We absolutely don't want to offer guests another level of DMA translation under their control, because of the performance and security implications. The way we implement 'dma-translation=pff' for Intel IOMMU is a bit of a hack, disabling all three of the SAGAW bits which advertise support for 3-level, 4-level or 5-level page tables and thus leaving the guest without *any* workable DMA page table setup. (I have asked Intel to officially bless this trick, FWIW). Linux *used* to panic when it saw this, but I fixed it when I added the 'dma-translation=off' support to QEMU. Windows always just quietly refrained from using such an IOMMU for DMA translation, while still using it for Interrupt Remapping. Which was the point. > Even if we hack the intel iommu, so current Windows boots, can we > have confidence that future Windows releases will correctly boot > on an intel iommu with AMD CPUs virtualized ? I'm not really proposing that we hack the Intel IOMMU like this; it's a proof of concept trying to understand the Windows bugs. And it *only* works for interrupts generated by the I/O APIC anyway. For real PCI MSI, Windows still generates an AMD-style remappable MSI message but *doesn't* actually program it into the IOMMU's table! Probably because in AMD mode, the IRTE indices are per-device rather than global. For PCI MSI(-X) we're actually better off without an IOMMU because then we see a *different* Windows bug — it just puts the high bits of the APIC ID into the high bits of the MSI address, instead. Obviously such an MSI *ought* to miss the APIC at 0x00000000FEExxxxx completely, and just scribble over guest memory, but we can cope with that as I showed in a later email. At this point I'm just hacking around and trying to understand how Windows behaves; until I do that I don't have any concrete suggestions for if/how we should attempt to support it. There is a Design Change Request open with Microsoft already, to fix some of this and use the KVM/Xen/Hyper-V 15-bit MSI extension sensible. Hopefully they can fix it, and we don't have to worry too much about what future Windows versions will do because they'll be a bit saner. In the meantime we're trying to work out if it's even possible to make today's versions of Windows work, without having to give them DMA translation support. With `-cpu host,+hv-avic` and a hack in pci_msi_trigger() to handle the erroneous high bits in the MSI address, I do have Windows Server 2022 booting. I'm not sure what would happen if it ever tried to target an I/O APIC interrupt at a CPU above 255 though. FWIW I *already* wanted to rewrite QEMU's MSI handling before we gained TCG X2APIC support, and now I want to rewrite it even more, even without this Windows nonsense. We should have a *single* translation function which covers KVM and TCG, which includes IOMMUs, Xen's PIRQ hack, the 15-bit MSI extension, this Windows bug (if we want to support it), and which will allow the IOMMU to know whether to deliver an IRQ fault event or not. And which handles the cookies needed for IOMMU invalidation, which needs to kick eventfd assignments out of the KVM irq routing table. When a guest programs a PCI device's MSIX table, this function should be called with deliver_now==false. If the translation succeeds, yay! It should be put into the KVM routing table and the VFIO eventfd should be attached (which will allow posted interrupts to work). If the translation fails, QEMU should just listen on the VFIO eventfd for itself. When an MSI happens in real time, either because a VFIO eventfd fires or because an emulated PCI device calls pci_msi_trigger(), it calls the same function with 'deliver_now==true'. If an IOMMU lookup *still* fails, that's when the IOMMU will actually raise a fault. That function allows us to collect all the various MSI format nonsense in *once* place and handle it cleanly, converting to the KVM X2APIC MSI format which both KVM *and* the TCG X2APIC implementation accept. It would have a comment which looks something like this... (Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> in case anyone gets around to such a rewrite before I do, and/or just wants to nab this and put it somewhere useful) /* * =================== * MSI MESSAGE FORMATS * =================== * * Message Signaled Interrupts are simply DMA transactions from the device. * It really is just "write <these> 32 bits <here> when you want attention." * The MSI (or MSI-X) message configured in the device is just the 64 bits * of the address to write to, and the 32 bits to write there. * * You can use this to do polled I/O by telling the device to write into a * data structure of your own choosing, then checking to see when it does * so. * * Or you can tell the device to poke at MMIO on *another* device, for * example when it's finished receiving a packet and it's time for the next * device to process that packet. * * Or — and this one is *actually* how it's expected to be used by sane * operating systems — you can point it at a special region of "physical * memory" which isn't actually memory; it's really an MMIO device which * can be used to trigger interrupts. * * That MMIO device is called the APIC, and on x86 machines it lives at * 0x00000000FEExxxxx in the physical memory space (the real one in host * physical space, and a virtual one in guest physical space). * * When the APIC receives a write transaction, it looks at the low 24 bits of * the address, and the 32 bits of data, and that conveys all the information * about which interrupt vector to raise on which CPU, and a few more details * besides. Some of those details include special cases like cluster delivery * modes and ways to delivery NMI/INIT/etc. which we won't go into here. * * Ih the beginning, there was only one way of doing this. This is what Intel * documentation now calls "Compatibility Format" (§5.1.2.1 of the VT-d spec). * It has 8 bits for the Destination APIC ID which are in bits 12-19 of the * MSI address (i.e. the XX in 0xfeeXX...). The *vector* to be raised on that * CPU is in the low 8 bits of the data written to that address. * * * Compatibility Format * -------------------- * * Address: 1111.1110.1110.dddd.dddd.0000.0000.rmxx * 0xFEE . Dest ID . Rsvd .↑↑↑ * ||Don't Care * |Destination Mode * Redirection Hint * * Data: 0000.0000.0000.0000.TL00.0DDD.vvvv.vvvv * Reserved .↑↑ ↑ . Vector * || Delivery Mode * Trigger Mode, Trigger Mode Level * * Crucially, this format has only 8 bits for the Destination ID. Since 0xFF * is the broadcast address, this allows only up to 255 CPUs to be supported. * * For many years the Reserved bits in bit 4-11 of the address were labelled * in some Intel documentation as "Extended Destination ID", but never used. * * * I/O APIC Redirection Table Entries * ---------------------------------- * * The I/O APIC is just a device for turning line-level interrupts into MSI * messages. Each pin on the I/O APIC has a Redirection Table Entry (RTE) * which configures the MSI message to be sent. The 64 bits of the RTE * include all the fields of the Compatibility Format MSI, including the * Extended Destination ID, but basically shuffled into a strange order for * historical reasons. Creating a Compatibility Format MSI from an I/O APIC * RTE is basically just a series of masks and shifts to move the bits into * the right place. Linux will compose an MSI message as appropriate for the * actual APIC or IOMMU in use (we'll get to those), then just shuffle the * bits around to program the I/O APIC RTE. * * * Intel "Remappable Format" * ------------------------- * * When Intel started supporting more than 255 CPUs, the 8-bit limit in what * was not yet called "Compatibility Format" became a problem. To support * the full 32 bits of logical X2APIC IDs they had to come up with another * solution. Since MSIs are basically just a DMA write, the logical place for * this was the IOMMU, which already intercepts DMA writes from devices. So * they invented "Interrupt Remapping". The "Remappable Format" MSI does not * directly encode which vector to send to which CPU; instead it just * identifies an index into an IOMMU table (the Interrupt Remapping Table). * * The Interrupt Remapping Table Entry (IRTE) contains all the information * which was once present in the MSI address+data, but allows for a full 32 * bits of destination ID. (It can also be used for posted interrupts, * delivering the interrupt *direcftly* to a vCPU in guest mode). * * To signal a Remappable Format MSI, Intel used bit 4 of the MSI address, * which is the lowest of the bits which were previously labelled "Extended * *Destination ID". With an Intel IOMMU doing Interrupt Remapping, you can * either submit Remappable Format MSIs, *or* Compatibilly Format, and the * IOMMU will only actually remap the former. (It can be told to block the * latter, for security reasons.) * * Intel calls the IRTE index the "handle". There are some legacy multi-MSI * devices which can't be explicitly configured with a different address/data * for each interrupt, but just add one to the data for each consecutive MSI * vector they generate. This *used* to correspond to consecutive IRQ vectors * on the same CPU. To cope with this, Intel added a "Subhandle" in the low * bits of the data, which *optionally* adds those bits to the handle * extracted from the MSI address: * Address: 1111.1110.1110.hhhh.hhhh.hhhh.hhh1.shxx * 0xFEE . Handle[14:0] .↑↑↑ * ||Don't Care * |Handle[15] * Subhandle Valid (SHV) * * Data: 0000.0000.0000.0000.ssss.ssss.ssss.ssss * Reserved . Subhandle (if SHV==1 in address) * * These is a slight complexity here for the I/O APIC, which doesn't *just* * shuffle the bits around to generate an MSI, but also handles EOI of line * level interrupts (and has to re-raise the IRQ if the line is actually * still asserted). For that, the I/O APIC interprets the RTE bits with their * original "compatibility" meaning. All those bits actually end up in the * low bits of the MSI data, so the OS has to program those bits accordingly * even though it sets SHV=0 so they're actually *ignored* when generating * the interrupt. * * * AMD Remappable MSI * ------------------ * * AMD's IOMMU is completely different to Intel's, and they didn't make * things anywhere near as complicated. When the IOMMU is enabled, a * device cannot send "Compatibility Format" MSIs any more, so there is * no need to tell one format from the other. AMD just used the low 11 * bits of the data as the IRTE index, and nothing else matters. * * Address: 1111.1110.1110.xxxx.xxxx.xxxx.xxxx.xxxx * 0xFEE . Don't Care * * Data: xxxx.xxxx.xxxx.xxxx.xxxx.xiii.iiii.iiii * Don't Care IRTE Index * * The reason for using only 11 bits of IRTE index is because, as described * above, the I/O APIC actually *does* care about bit 11 of the MSI data, * (or, more accurately, it cares about the RTE bit which gets shuffled into * bit 11 of the MSI data). That's the original "Trigger Mode" bit, which the * I/O APIC needs in order to re-raise level-triggered interrupts which are * EOI'd while they're still asserted. * * Although the Intel IOMMU has a single Interrupt Remapping Table and a * single number space for IRTE indices across the whole system, the AMD * IOMMU has a table per device. This, sadly becomes important later. * * The 15-bit MSI extension * ------------------------ * * The problem with IOMMUs is that they were designed to support DMA * translation, and there is no architectural way to disable that and offer * guests an IOMMU which *only* supports Interrupt Remapping. We really don't * want guests doing DMA translation, as it has severe performance and * security implications. * * So KVM, Hyper-V and Xen all define a virt extension which uses 7 of the * original "Extended Destination ID" bits to give support for up to 32768 * virtual CPUs. (This extension avoids the low bit which Intel used to * indicate Remappable Format). This format is exactly like the Compatibility * Format, except that bits 5-11 of the MSI address are used as bits 8-15 * of the destination APIC ID: * * Address: 1111.1110.1110.dddd.dddd.DDDD.DDD0.rmxx * 0xFEE . Dest ID .ExtDest .↑↑↑ * ||Don't Care * |Destination Mode * Redirection Hint * * Data: 0000.0000.0000.0000.TL00.0DDD.vvvv.vvvv * Reserved .↑↑ ↑ . Vector * || Delivery Mode * Trigger Mode, Trigger Mode Level * * * Xen MSI → PIRQ mapping * ---------------------- * * All of the above are implementable in real hardware. Actual external PCI * devices can perform memory transactions to addresses in the physical * address range 0x00000000FEExxxxx, which reach the APIC and cause * interrupts to be injected into the relevant CPU. * * But Xen guests know that they are running in a virtual machine. So they * know that the PCI config space is a complete fiction. For example, if they * set up a BAR of a given device with a certain address, that is a *guest* * physical address. The hypervisor probably doesn't even change anything on * the device itself; it just adjusts the EPT page tables to make the * corresponding BAR *appear* in the guest physical address space at the * desired location. * * MSI messages are similarly fictional. The guest configures a PCI device * with its own vCPU APIC ID and vector, and the real hardware wouldn't know * what to do with that. (Well, we could design an IOMMU which *could* cope * with that, let guests write directly to the PCI devices' MSI tables, and * use the resulting MSIs for posted interrupts as a first-class citizen, but * nobody's done that.) * * In practice, what happens is that the hypervisor registers its *own* * handler for the interrupt in question (routing it to a given vector on a * given *host* CPU). When that host interrupt handler is triggered, the * hypervisor injects an interrupt to the guest vCPU accordingly. Most * hypervisors, including Xen and KVM, do *not* have a mechanism to simply * write to guest memory *instead* of injecting an interrupt. So if the guest * configured the MSI to target an address outside the 0x00000000FEExxxxx * range, it just gets dropped. (Boo, no DPDK polled-mode implementations * abusing MSIs for memory writes, in virt guests!) * * This means that we can abuse the high 32 bits of the address even in a * guest-visible way, right? Nothing would ever go wrong... * * Xen was the first to do this. It needed a way to map MSI from PCI devices * to a 'PIRQ', which is a form of Xen paravirtualised interrupt which binds * to Xen Event Channels. By using vector#0, Xen guests indicate a special * MSI message which is to be routed to a PIRQ. The actual PIRQ# is then in * the original Destination ID field... and the high bits of the address. * * (We'll gloss over the way that Xen snoops on these even while masked, and * actually unmasks the MSI when the guests binds to the corresponding PIRQ, * because there's only so much pain I can inflict on the reader in one * sitting.) * * AddrHi: DDDD.DDDD.DDDD.DDDD.DDDD.DDDD.0000.0000 * PIRQ#[31-8] . Rsvd * * AddrLo: 1111.1110.1110.dddd.dddd.0000.0000.xxxx * 0xFEE .PIRQ[7-0]. Rsvd .Don't Care * * Data: xxxx.xxxx.xxxx.xxxx.xxxx.xxxx.0000.0000 * Don't Care . Vector == 0 * * * KVM X2APIC MSI API * ------------------ * * KVM has an ioctl() for injecting MSI interrupts, and routing table entries * which cause MSIs to be injected to the guest when triggered. For * convenience, KVM originally just used the Compatibility Format MSI message * as its userspace ABI for configuring these. This got less convenient when * X2APIC came along and we needed an extra 24 bits for the Destination ID. * * KVM's solution was to abuse the high 32 bits of the address, If this was a * true memory transaction, such a write would miss the APIC completely and * scribble over guest memory at an address like 0x00000100FEExxxxx. But in * this case it's just an ABI between KVM and userspace, using bits which * would otherwise be completely redundant. KVM uses the high 24 bits of the * MSI address (bits 40-63) as the high 24 bits of the destination ID. * * AddrHi: DDDD.DDDD.DDDD.DDDD.DDDD.DDDD.0000.0000 * Destination ID bits 8-31 . Rsvd * * AddrLo: 1111.1110.1110.dddd.dddd.0000.0000.rmxx * 0xFEE . Dest ID . Rsvd .↑↑↑ * ||Don't Care * |Destination Mode * Redirection Hint * * Data: 0000.0000.0000.0000.TL00.0DDD.vvvv.vvvv * Reserved .↑↑ ↑ . Vector * || Delivery Mode * Trigger Mode, Trigger Mode Level * * This hack is not visible to a KVM guest. What a KVM guest programs into * the MSI descriptors of passthrough or emulated PCI devices is completely * different, and (at this point in our tale of woe, at least) never sets * the high 32 bits of the target address to anything but zero. * * * IOMMU interrupts * ---------------- * * Since an IOMMU is responsible for remapping interrupts so they can reach * CPUs with higher APIC IDs, how do we actually configure the events from * the IOMMU itself? * * Intel uses the same format as the KVM X2APIC API (which may actually have * been why KVM did it that way). Since it's never going to be an actual * memory transaction, it's safe to abuse the high bits of the address. Intel * offers { Data, Address, Upper Address } registers for each type of event * that the IOMMU can generate for itself, with the high 24 bits of the * destination ID in the higher 24 bits of the address as shown above. * * AMD's IOMMU uses a completely different 64-bit register format (e.g XT * IOMMU General Interrupt Control Register) which doesn't pretend very hard * to look like an MSI at all. But just happens to have the DestMode at bit * 2, like in the MSI address. And just happens to have the vector and * Delivery Mode (from the low 9 bits of the MSI data) in the low 9 bits of * its high word (bits 32-40 of the register). And then just throws the * actual destination ID in around them in some other bits: * * Low32: dddd.dddd.dddd.dddd.dddd.dddd.xxxx.xmxx * Destination ID [23-0] . DC . ↑↑ * |Don't Care * Destination Mode * * High32: DDDD.DDDD.xxxx.xxxx.xxxx.xxxD.vvvv.vvvv * DestId[31-24] ↑. Vector * Delivery Mode * * * Windows, part 1: Intel IOMMU with no DMA translation * ---------------------------------------------------- * * As noted above, the 15-bit extension was invented to avoid the need for * an IOMMU, because it is undesirable to offer a virtual IOMMU to guests * with support for them to do their own additional level of DMA translation. * * However, although Hyper-V exposes the 15-bit MSI feature, Windows as a * guest OS does not use it. In order to support Windows guests with more * than 255 vCPUs, a hack was found for the Intel IOMMU. Although there is no * official way to advertise that the IOMMU does not support DMA translation, * there *are* "Supported Adjusted Guest Address Width" bits which advertise * the ability to use 3-level, 4-level, and/or 5-level page tables. If * Windows encounters an IOMMU which sets *none* of these bits, Windows will * quietly refrain from attempting to use that IOMMU for DMA translation, but * will still use it for Interrupt Remapping. * * However, this only works correctly if Windows is running on an Intel CPU. * When Windows runs on an AMD CPU, it will happily configure and use the * Intel IOMMU, but misconfigures the MSI messages that it programs into the * devices. For I/O APIC interrupts, Windows programs the IRTE in the Intel * IOMMU correctly... but then configures the I/O APIC using the AMD format * (with the IRTE index where the vector would have been). A hack to the * virtual Intel IOMMU emulation can make it cope with this bug... but sadly * it *only* works for I/O APIC interrupts. For actual PCI MSI, Windows still * configures the device with an AMD-style remappable MSI but *doesn't* * actually configure the IRTE in the IOMMU at all. This is probably because * Intel's IRT is system-wide, while AMD has one per device; Windows does * seem to think it's using a separate IRTE space, so the first MSI vector * gets IRTE index 0 which conflicts with I/O APIC pin 0. * * So for PCI, the hypervisor has no idea where Windows intended a given MSI * to be routed, and cannot work around the Windows bugs to support >255 AMD * vCPUs this way. * * * Windows, part 2: No IOMMU * ------------------------- * * If we do *not* offer an IOMMU to a Windows guest which has CPUs with high * APIC IDs, we encounter a *different* Windows bug, which is easier to work * around. Windows doesn't use the 15-bit extension described above, but it * *does* just throw the high bits of the destination ID into bits 32-55 of * the MSI address. * * Done without negotiation or discovery of any hypervisor feature, this * arguably ought to cause the device to write to an address in guest * *memory* and miss the APIC at 0x00000000FEExxxxx altogether, but we * already admitted almost no hypervisors actually *do* that. (QEMU is the * exception here, because for *emulated* PCI devices, pci_msi_trigger() does * actually generate true write cycles in the corresponding DMA address * space.) * * We can cope with this Windows bug and even use it to our advantage, by * spotting the high bits in the MSI address and using them. It does require * that we have an API which is specifically for *MSI*, not to be conflated * with actual DMA writes. So QEMU's pci_msi_trigger() would have to do * things differently. But let's pretend, for the same of argument, that I'm * typing this C-comment essay into a VMM other than QEMU, which already * does think that way and has a cleaner separation of emulated-PCI vs. the * VFIO or true emulation which can back it, and *does* always handle MSIs * explicity. * * In that case, all the translation function has to do, in addition to * invoking all the IOMMU and Xen and 15-bit translators as QEMU's * kvm_arch_fixup_msi_route() function already does, is add one more trivial * special case. This format is the same as the KVM X2APIC API format, with * the top 32 bits of the address shifted by 8 bits: * * AddrHi: 0000.0000.DDDD.DDDD.DDDD.DDDD.DDDD.DDDD.0000.0000 * Rsvd . Destination ID bits 8-31 * * AddrLo: 1111.1110.1110.dddd.dddd.0000.0000.rmxx * 0xFEE . Dest ID . Rsvd .↑↑↑ * ||Don't Care * |Destination Mode * Redirection Hint * * Data: 0000.0000.0000.0000.TL00.0DDD.vvvv.vvvv * Reserved .↑↑ ↑ . Vector * || Delivery Mode * Trigger Mode, Trigger Mode Level */ bool arch_translate_msi_message(struct kvm_irq_routing_entry *re, const struct kvm_msi *in, uint64_t *cookie, bool deliver_now) { [-- Attachment #2: smime.p7s --] [-- Type: application/pkcs7-signature, Size: 5965 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* More than 255 vcpus Windows VM setup without viommu ? @ 2024-07-02 7:20 Sandesh Patel 0 siblings, 0 replies; 17+ messages in thread From: Sandesh Patel @ 2024-07-02 7:20 UTC (permalink / raw) To: qemu-devel@nongnu.org Cc: dwmw2@infradead.org, Rob Scheepens, Prerna Saxena, qemu-devel@nongnu.org Hi All, Is it possible to setup a large Windows VM (say 512 vcpus) without adding viommu (EIM=on, IR=on)? When I try to power such VM, the qemu process crashes with error- ``` qemu-kvm: ../accel/kvm/kvm-all.c:1837: kvm_irqchip_commit_routes: Assertion `ret == 0’ failed ``` Stack trace- ``` #1 0x00007f484bc21ea5 abort (libc.so.6) #2 0x00007f484bc21d79 __assert_fail_base.cold.0 (libc.so.6) #3 0x00007f484bc47426 __assert_fail (libc.so.6) #4 0x000055b7215634d3 kvm_irqchip_commit_routes (qemu-kvm) #5 0x000055b7213bfc7e kvm_virtio_pci_vector_use_one (qemu-kvm) #6 0x000055b7213c02cf virtio_pci_set_guest_notifiers (qemu-kvm) #7 0x000055b7214dd848 vhost_scsi_common_start (qemu-kvm) #8 0x000055b72139b936 vhost_user_scsi_start (qemu-kvm) #9 0x000055b72139ba64 vhost_user_scsi_set_status (qemu-kvm) #10 0x000055b7214f865a virtio_set_status (qemu-kvm) #11 0x000055b7213bdc3f virtio_pci_common_write (qemu-kvm) #12 0x000055b721514e68 memory_region_write_accessor (qemu-kvm) #13 0x000055b72151489e access_with_adjusted_size (qemu-kvm) #14 0x000055b721514b89 memory_region_dispatch_write (qemu-kvm) #15 0x000055b72151e3fc flatview_write_continue (qemu-kvm) #16 0x000055b72151e553 flatview_write (qemu-kvm) #17 0x000055b72151ee76 address_space_write (qemu-kvm) #18 0x000055b721565526 kvm_cpu_exec (qemu-kvm) #19 0x000055b72156634d kvm_vcpu_thread_fn (qemu-kvm) #20 0x000055b721750224 qemu_thread_start (qemu-kvm) #21 0x00007f484c0081ca start_thread (libpthread.so.0) #22 0x00007f484bc39e73 ``` The error is due to invalid MSIX routing entry passed to KVM. The VM boots fine if we attach a vIOMMU but adding a vIOMMU can potentially result in IO performance loss in guest. I was interested to know if someone could boot a large Windows VM by some other means like kvm-msi-ext-dest-id. Overheads of viommu have been shown for example in - https://static.sched.com/hosted_files/kvmforum2021/da/vIOMMU%20KVM%20Forum%202021%20-%20v4.pdf Thanks and regards, Sandesh ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2024-10-02 15:31 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-07-02 5:17 More than 255 vcpus Windows VM setup without viommu ? Sandesh Patel 2024-07-02 9:04 ` David Woodhouse 2024-07-03 16:01 ` Sandesh Patel 2024-07-08 9:13 ` David Woodhouse 2024-07-11 7:26 ` David Woodhouse 2024-07-11 11:23 ` David Woodhouse 2024-07-11 11:52 ` Sandesh Patel 2024-07-16 5:13 ` Sandesh Patel 2024-07-24 9:22 ` David Woodhouse 2024-08-01 10:28 ` Sandesh Patel 2024-09-28 14:59 ` David Woodhouse 2024-09-30 15:50 ` David Woodhouse 2024-10-02 11:33 ` Igor Mammedov 2024-10-02 15:30 ` David Woodhouse 2024-10-01 13:33 ` Daniel P. Berrangé 2024-10-01 16:37 ` David Woodhouse -- strict thread matches above, loose matches on Subject: below -- 2024-07-02 7:20 Sandesh Patel
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).