From: David Woodhouse <dwmw2@infradead.org>
To: Sandesh Patel <sandesh.patel@nutanix.com>,
"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
paul <paul@xen.org>
Cc: Rob Scheepens <rob.scheepens@nutanix.com>,
Prerna Saxena <confluence@nutanix.com>,
Alexander Graf <agraf@csgraf.de>
Subject: Re: More than 255 vcpus Windows VM setup without viommu ?
Date: Mon, 30 Sep 2024 16:50:21 +0100 [thread overview]
Message-ID: <7571cdc42d6d69db0ac98ffc99801d11de1de129.camel@infradead.org> (raw)
In-Reply-To: <a80c99b0e10e71a5a301c884d699eeaff3893349.camel@infradead.org>
[-- Attachment #1: Type: text/plain, Size: 4240 bytes --]
On Sat, 2024-09-28 at 15:59 +0100, David Woodhouse wrote:
> On Tue, 2024-07-02 at 05:17 +0000, Sandesh Patel wrote:
> >
> > The error is due to invalid MSIX routing entry passed to KVM.
> >
> > The VM boots fine if we attach a vIOMMU but adding a vIOMMU can
> > potentially result in IO performance loss in guest.
> > I was interested to know if someone could boot a large Windows VM by
> > some other means like kvm-msi-ext-dest-id.
>
> I think I may (with Alex Graf's suggestion) have found the Windows bug
> with Intel IOMMU.
>
> It looks like when interrupt remapping is enabled with an AMD CPU,
> Windows *assumes* it can generate AMD-style MSI messages even if the
> IOMMU is an Intel one. If we put a little hack into the IOMMU interrupt
> remapping to make it interpret an AMD-style message, Windows seems to
> boot at least a little bit further than it did before...
Sadly, Windows has *more* bugs than that.
The previous hack extracted the Interrupt Remapping Table Entry (IRTE)
index from an AMD-style MSI message, and looked it up in the Intel
IOMMU's IR Table.
That works... for the MSIs generated by the I/O APIC.
However... in the Intel IOMMU model, there is a single global IRT, and
each entry specifies which devices are permitted to invoke it. The AMD
model is slightly nicer, in that it allows a per-device IRT.
So for a PCI device, Windows just seems to configure each MSI vector in
order, with IRTE#0, 1, onwards. Because it's a per-device number space,
right? Which means that first MSI vector on a PCI device gets aliased
to IRQ#0 on the I/O APIC.
I dumped the whole IRT, and it isn't just that Windows is using the
wrong index; it hasn't even set up the correct destination in *any* of
the entries. So we can't even do a nasty trick like scanning and
funding the Nth entry which is valid for a particular source-id.
Happily, Windows has *more* bugs than that... if I run with
`-cpu host,+hv-avic' then it puts the high bits of the target APIC ID
into the high bits of the MSI address. This *ought* to mean that MSIs
from device miss the APIC (at 0x00000000FEExxxxx) and scribble over
guest memory at addresses like 0x1FEE00004. But we can add yet
*another* hack to catch that. For now I just hacked it to move the low
7 extra bits in to the "right" place for the 15-bit extension.
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -361,6 +361,14 @@ static void pci_msi_trigger(PCIDevice *dev, MSIMessage msg)
return;
}
attrs.requester_id = pci_requester_id(dev);
+ printf("Send MSI 0x%lx/0x%x from 0x%x\n", msg.address, msg.data, attrs.requester_id);
+ if (msg.address >> 32) {
+ uint64_t ext_id = msg.address >> 32;
+ msg.address &= 0xffffffff;
+ msg.address |= ext_id << 5;
+ printf("Now 0x%lx/0x%x with ext_id %lx\n", msg.address, msg.data, ext_id);
+ }
+
address_space_stl_le(&dev->bus_master_as, msg.address, msg.data,
attrs, NULL);
}
We also need to stop forcing Windows to use logical mode, and force it
to use physical mode instead:
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -158,7 +158,7 @@ static void init_common_fadt_data(MachineState *ms, Object *o,
* used
*/
((ms->smp.max_cpus > 8) ?
- (1 << ACPI_FADT_F_FORCE_APIC_CLUSTER_MODEL) : 0),
+ (1 << ACPI_FADT_F_FORCE_APIC_PHYSICAL_DESTINATION_MODE) : 0),
.int_model = 1 /* Multiple APIC */,
.rtc_century = RTC_CENTURY,
.plvl2_lat = 0xfff /* C2 state not supported */,
So now, with *no* IOMMU configured, Windows Server 2022 is booting and
using CPUs > 255:
Send MSI 0x1fee01000/0x41b0 from 0xfa
Now 0xfee01020/0x41b0 with ext_id 1
That trick obviously can't work the the I/O APIC, but I haven't managed
to persuade Windows to target I/O APIC interrupts at any CPU other than
#0 yet. I'm trying to make QEMU run with *only* higher APIC IDs, to
test.
It may be that we need to advertise an Intel IOMMU that *only* has the
I/O APIC behind it, and all the actual PCI devices are direct, so we
can abuse that last Windows bug.
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]
next prev parent reply other threads:[~2024-09-30 15:51 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-07-02 5:17 More than 255 vcpus Windows VM setup without viommu ? Sandesh Patel
2024-07-02 9:04 ` David Woodhouse
2024-07-03 16:01 ` Sandesh Patel
2024-07-08 9:13 ` David Woodhouse
2024-07-11 7:26 ` David Woodhouse
2024-07-11 11:23 ` David Woodhouse
2024-07-11 11:52 ` Sandesh Patel
2024-07-16 5:13 ` Sandesh Patel
2024-07-24 9:22 ` David Woodhouse
2024-08-01 10:28 ` Sandesh Patel
2024-09-28 14:59 ` David Woodhouse
2024-09-30 15:50 ` David Woodhouse [this message]
2024-10-02 11:33 ` Igor Mammedov
2024-10-02 15:30 ` David Woodhouse
2024-10-01 13:33 ` Daniel P. Berrangé
2024-10-01 16:37 ` David Woodhouse
-- strict thread matches above, loose matches on Subject: below --
2024-07-02 7:20 Sandesh Patel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=7571cdc42d6d69db0ac98ffc99801d11de1de129.camel@infradead.org \
--to=dwmw2@infradead.org \
--cc=agraf@csgraf.de \
--cc=confluence@nutanix.com \
--cc=paul@xen.org \
--cc=qemu-devel@nongnu.org \
--cc=rob.scheepens@nutanix.com \
--cc=sandesh.patel@nutanix.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).