Re: More than 255 vcpus Windows VM setup without viommu ?

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: David Woodhouse <dwmw2@infradead.org>
To: Sandesh Patel <sandesh.patel@nutanix.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	paul <paul@xen.org>
Cc: Rob Scheepens <rob.scheepens@nutanix.com>,
	Prerna Saxena <confluence@nutanix.com>,
	Alexander Graf <agraf@csgraf.de>
Subject: Re: More than 255 vcpus Windows VM setup without viommu ?
Date: Mon, 30 Sep 2024 16:50:21 +0100	[thread overview]
Message-ID: <7571cdc42d6d69db0ac98ffc99801d11de1de129.camel@infradead.org> (raw)
In-Reply-To: <a80c99b0e10e71a5a301c884d699eeaff3893349.camel@infradead.org>

[-- Attachment #1: Type: text/plain, Size: 4240 bytes --]

On Sat, 2024-09-28 at 15:59 +0100, David Woodhouse wrote:
> On Tue, 2024-07-02 at 05:17 +0000, Sandesh Patel wrote:
> > 
> > The error is due to invalid MSIX routing entry passed to KVM.
> > 
> > The VM boots fine if we attach a vIOMMU but adding a vIOMMU can
> > potentially result in IO performance loss in guest.
> > I was interested to know if someone could boot a large Windows VM by
> > some other means like kvm-msi-ext-dest-id.
> 
> I think I may (with Alex Graf's suggestion) have found the Windows bug
> with Intel IOMMU.
> 
> It looks like when interrupt remapping is enabled with an AMD CPU,
> Windows *assumes* it can generate AMD-style MSI messages even if the
> IOMMU is an Intel one. If we put a little hack into the IOMMU interrupt
> remapping to make it interpret an AMD-style message, Windows seems to
> boot at least a little bit further than it did before...

Sadly, Windows has *more* bugs than that.

The previous hack extracted the Interrupt Remapping Table Entry (IRTE)
index from an AMD-style MSI message, and looked it up in the Intel
IOMMU's IR Table.

That works... for the MSIs generated by the I/O APIC.

However... in the Intel IOMMU model, there is a single global IRT, and
each entry specifies which devices are permitted to invoke it. The AMD
model is slightly nicer, in that it allows a per-device IRT.

So for a PCI device, Windows just seems to configure each MSI vector in
order, with IRTE#0, 1, onwards. Because it's a per-device number space,
right? Which means that first MSI vector on a PCI device gets aliased
to IRQ#0 on the I/O APIC.

I dumped the whole IRT, and it isn't just that Windows is using the
wrong index; it hasn't even set up the correct destination in *any* of
the entries. So we can't even do a nasty trick like scanning and
funding the Nth entry which is valid for a particular source-id.

Happily, Windows has *more* bugs than that... if I run with
`-cpu host,+hv-avic' then it puts the high bits of the target APIC ID
into the high bits of the MSI address. This *ought* to mean that MSIs
from device miss the APIC (at 0x00000000FEExxxxx) and scribble over
guest memory at addresses like 0x1FEE00004. But we can add yet
*another* hack to catch that. For now I just hacked it to move the low
7 extra bits in to the "right" place for the 15-bit extension.

--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -361,6 +361,14 @@ static void pci_msi_trigger(PCIDevice *dev, MSIMessage msg)
         return;
     }
     attrs.requester_id = pci_requester_id(dev);
+    printf("Send MSI 0x%lx/0x%x from 0x%x\n", msg.address, msg.data, attrs.requester_id);
+    if (msg.address >> 32) {
+        uint64_t ext_id = msg.address >> 32;
+        msg.address &= 0xffffffff;
+        msg.address |= ext_id << 5;
+        printf("Now 0x%lx/0x%x with ext_id %lx\n", msg.address, msg.data, ext_id);
+    }
+        
     address_space_stl_le(&dev->bus_master_as, msg.address, msg.data,
                          attrs, NULL);
 }

We also need to stop forcing Windows to use logical mode, and force it
to use physical mode instead:

--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -158,7 +158,7 @@ static void init_common_fadt_data(MachineState *ms, Object *o,
              * used
              */
             ((ms->smp.max_cpus > 8) ?
-                        (1 << ACPI_FADT_F_FORCE_APIC_CLUSTER_MODEL) : 0),
+                        (1 << ACPI_FADT_F_FORCE_APIC_PHYSICAL_DESTINATION_MODE) : 0),
         .int_model = 1 /* Multiple APIC */,
         .rtc_century = RTC_CENTURY,
         .plvl2_lat = 0xfff /* C2 state not supported */,

So now, with *no* IOMMU configured, Windows Server 2022 is booting and
using CPUs > 255:
  Send MSI 0x1fee01000/0x41b0 from 0xfa
  Now 0xfee01020/0x41b0 with ext_id 1

That trick obviously can't work the the I/O APIC, but I haven't managed
to persuade Windows to target I/O APIC interrupts at any CPU other than
#0 yet. I'm trying to make QEMU run with *only* higher APIC IDs, to
test.

It may be that we need to advertise an Intel IOMMU that *only* has the
I/O APIC behind it, and all the actual PCI devices are direct, so we
can abuse that last Windows bug.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

next prev parent reply	other threads:[~2024-09-30 15:51 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-02  5:17 More than 255 vcpus Windows VM setup without viommu ? Sandesh Patel
2024-07-02  9:04 ` David Woodhouse
2024-07-03 16:01   ` Sandesh Patel
2024-07-08  9:13     ` David Woodhouse
2024-07-11  7:26       ` David Woodhouse
2024-07-11 11:23         ` David Woodhouse
2024-07-11 11:52           ` Sandesh Patel
2024-07-16  5:13             ` Sandesh Patel
2024-07-24  9:22               ` David Woodhouse
2024-08-01 10:28         ` Sandesh Patel
2024-09-28 14:59 ` David Woodhouse
2024-09-30 15:50   ` David Woodhouse [this message]
2024-10-02 11:33     ` Igor Mammedov
2024-10-02 15:30       ` David Woodhouse
2024-10-01 13:33   ` Daniel P. Berrangé
2024-10-01 16:37     ` David Woodhouse
  -- strict thread matches above, loose matches on Subject: below --
2024-07-02  7:20 Sandesh Patel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7571cdc42d6d69db0ac98ffc99801d11de1de129.camel@infradead.org \
    --to=dwmw2@infradead.org \
    --cc=agraf@csgraf.de \
    --cc=confluence@nutanix.com \
    --cc=paul@xen.org \
    --cc=qemu-devel@nongnu.org \
    --cc=rob.scheepens@nutanix.com \
    --cc=sandesh.patel@nutanix.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).