Re: More than 255 vcpus Windows VM setup without viommu ?

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Igor Mammedov <imammedo@redhat.com>
To: David Woodhouse <dwmw2@infradead.org>
Cc: Sandesh Patel <sandesh.patel@nutanix.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	paul <paul@xen.org>, Rob Scheepens <rob.scheepens@nutanix.com>,
	Prerna Saxena <confluence@nutanix.com>,
	Alexander Graf <agraf@csgraf.de>
Subject: Re: More than 255 vcpus Windows VM setup without viommu ?
Date: Wed, 2 Oct 2024 13:33:22 +0200	[thread overview]
Message-ID: <20241002133322.18a4f1fa@imammedo.users.ipa.redhat.com> (raw)
In-Reply-To: <7571cdc42d6d69db0ac98ffc99801d11de1de129.camel@infradead.org>

On Mon, 30 Sep 2024 16:50:21 +0100
David Woodhouse <dwmw2@infradead.org> wrote:

> On Sat, 2024-09-28 at 15:59 +0100, David Woodhouse wrote:
> > On Tue, 2024-07-02 at 05:17 +0000, Sandesh Patel wrote:  
> > > 
> > > The error is due to invalid MSIX routing entry passed to KVM.
> > > 
> > > The VM boots fine if we attach a vIOMMU but adding a vIOMMU can
> > > potentially result in IO performance loss in guest.
> > > I was interested to know if someone could boot a large Windows VM by
> > > some other means like kvm-msi-ext-dest-id.  
> > 
> > I think I may (with Alex Graf's suggestion) have found the Windows bug
> > with Intel IOMMU.
> > 
> > It looks like when interrupt remapping is enabled with an AMD CPU,
> > Windows *assumes* it can generate AMD-style MSI messages even if the
> > IOMMU is an Intel one. If we put a little hack into the IOMMU interrupt
> > remapping to make it interpret an AMD-style message, Windows seems to
> > boot at least a little bit further than it did before...  
> 
> Sadly, Windows has *more* bugs than that.
> 
> The previous hack extracted the Interrupt Remapping Table Entry (IRTE)
> index from an AMD-style MSI message, and looked it up in the Intel
> IOMMU's IR Table.
> 
> That works... for the MSIs generated by the I/O APIC.
> 
> However... in the Intel IOMMU model, there is a single global IRT, and
> each entry specifies which devices are permitted to invoke it. The AMD
> model is slightly nicer, in that it allows a per-device IRT.
> 
> So for a PCI device, Windows just seems to configure each MSI vector in
> order, with IRTE#0, 1, onwards. Because it's a per-device number space,
> right? Which means that first MSI vector on a PCI device gets aliased
> to IRQ#0 on the I/O APIC.
> 
> I dumped the whole IRT, and it isn't just that Windows is using the
> wrong index; it hasn't even set up the correct destination in *any* of
> the entries. So we can't even do a nasty trick like scanning and
> funding the Nth entry which is valid for a particular source-id.
> 
> Happily, Windows has *more* bugs than that... if I run with
> `-cpu host,+hv-avic' then it puts the high bits of the target APIC ID
> into the high bits of the MSI address. This *ought* to mean that MSIs
> from device miss the APIC (at 0x00000000FEExxxxx) and scribble over
> guest memory at addresses like 0x1FEE00004. But we can add yet
> *another* hack to catch that. For now I just hacked it to move the low
> 7 extra bits in to the "right" place for the 15-bit extension.
> 
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -361,6 +361,14 @@ static void pci_msi_trigger(PCIDevice *dev, MSIMessage msg)
>          return;
>      }
>      attrs.requester_id = pci_requester_id(dev);
> +    printf("Send MSI 0x%lx/0x%x from 0x%x\n", msg.address, msg.data, attrs.requester_id);
> +    if (msg.address >> 32) {
> +        uint64_t ext_id = msg.address >> 32;
> +        msg.address &= 0xffffffff;
> +        msg.address |= ext_id << 5;
> +        printf("Now 0x%lx/0x%x with ext_id %lx\n", msg.address, msg.data, ext_id);
> +    }
> +        
>      address_space_stl_le(&dev->bus_master_as, msg.address, msg.data,
>                           attrs, NULL);
>  }
> 
> We also need to stop forcing Windows to use logical mode, and force it
> to use physical mode instead:
> 
> --- a/hw/i386/acpi-build.c
> +++ b/hw/i386/acpi-build.c
> @@ -158,7 +158,7 @@ static void init_common_fadt_data(MachineState *ms, Object *o,
>               * used
>               */
>              ((ms->smp.max_cpus > 8) ?
> -                        (1 << ACPI_FADT_F_FORCE_APIC_CLUSTER_MODEL) : 0),
> +                        (1 << ACPI_FADT_F_FORCE_APIC_PHYSICAL_DESTINATION_MODE) : 0),
>          .int_model = 1 /* Multiple APIC */,
>          .rtc_century = RTC_CENTURY,
>          .plvl2_lat = 0xfff /* C2 state not supported */,
> 
> 
> So now, with *no* IOMMU configured, Windows Server 2022 is booting and
> using CPUs > 255:
>   Send MSI 0x1fee01000/0x41b0 from 0xfa
>   Now 0xfee01020/0x41b0 with ext_id 1
> 
> That trick obviously can't work the the I/O APIC, but I haven't managed
> to persuade Windows to target I/O APIC interrupts at any CPU other than
> #0 yet. I'm trying to make QEMU run with *only* higher APIC IDs, to
> test.
> 
> It may be that we need to advertise an Intel IOMMU that *only* has the
> I/O APIC behind it, and all the actual PCI devices are direct, so we
> can abuse that last Windows bug.

It's interesting as an experiment, to prove that Windows is riddled with bugs.
(well, and it could serve as starting point to report issue to MS)
But I'd rather Microsoft fix bugs on their side, instead of putting hacks in
QEMU.

PS:
Given it's AMD cpu, I doubt very much that using intel_iommu would be
accepted by Microsoft as valid complaint though.

next prev parent reply	other threads:[~2024-10-02 11:34 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-02  5:17 More than 255 vcpus Windows VM setup without viommu ? Sandesh Patel
2024-07-02  9:04 ` David Woodhouse
2024-07-03 16:01   ` Sandesh Patel
2024-07-08  9:13     ` David Woodhouse
2024-07-11  7:26       ` David Woodhouse
2024-07-11 11:23         ` David Woodhouse
2024-07-11 11:52           ` Sandesh Patel
2024-07-16  5:13             ` Sandesh Patel
2024-07-24  9:22               ` David Woodhouse
2024-08-01 10:28         ` Sandesh Patel
2024-09-28 14:59 ` David Woodhouse
2024-09-30 15:50   ` David Woodhouse
2024-10-02 11:33     ` Igor Mammedov [this message]
2024-10-02 15:30       ` David Woodhouse
2024-10-01 13:33   ` Daniel P. Berrangé
2024-10-01 16:37     ` David Woodhouse
  -- strict thread matches above, loose matches on Subject: below --
2024-07-02  7:20 Sandesh Patel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20241002133322.18a4f1fa@imammedo.users.ipa.redhat.com \
    --to=imammedo@redhat.com \
    --cc=agraf@csgraf.de \
    --cc=confluence@nutanix.com \
    --cc=dwmw2@infradead.org \
    --cc=paul@xen.org \
    --cc=qemu-devel@nongnu.org \
    --cc=rob.scheepens@nutanix.com \
    --cc=sandesh.patel@nutanix.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).