public inbox for iommu@lists.linux-foundation.org
 help / color / mirror / Atom feed
* Re: ath11k and vfio-pci support
       [not found]                       ` <3d22a730-aee5-4f2a-9ddc-b4b5bd4d62fe@quicinc.com>
@ 2024-01-14 14:36                         ` Kalle Valo
  2024-01-15 17:46                           ` Alex Williamson
  0 siblings, 1 reply; 29+ messages in thread
From: Kalle Valo @ 2024-01-14 14:36 UTC (permalink / raw)
  To: Baochen Qiang
  Cc: James Prestwood, linux-wireless, ath11k, David Woodhouse, iommu

Baochen Qiang <quic_bqiang@quicinc.com> writes:

>>> Strange that still fails. Are you now seeing this error in your
>>> host or your Qemu? or both?
>>> Could you share your test steps? And if you can share please be as
>>> detailed as possible since I'm not familiar with passing WLAN
>>> hardware to a VM using vfio-pci.
>>
>> Just in Qemu, the hardware works fine on my host machine.
>> I basically follow this guide to set it up, its written in the
>> context of GPUs/libvirt but the host setup is exactly the same. By
>> no means do you need to read it all, once you set the vfio-pci.ids
>> and see your unclaimed adapter you can stop:
>> https://wiki.archlinux.org/title/PCI_passthrough_via_OVMF
>> In short you should be able to set the following host kernel options
>> and reboot (assuming your motherboard/hardware is compatible):
>> intel_iommu=on iommu=pt vfio-pci.ids=17cb:1103
>> Obviously change the device/vendor IDs to whatever ath11k hw you
>> have. Once the host is rebooted you should see your wlan adapter as
>> UNCLAIMED, showing the driver in use as vfio-pci. If not, its likely
>> your motherboard just isn't compatible, the device has to be in its
>> own IOMMU group (you could try switching PCI ports if this is the
>> case).
>> I then build a "kvm_guest.config" kernel with the driver/firmware
>> for ath11k and boot into that with the following Qemu options:
>> -enable-kvm -device -vfio-pci,host=<PCI address>
>> If it seems easier you could also utilize IWD's test-runner which
>> handles launching the Qemu kernel automatically, detecting any
>> vfio-devices and passes them through and mounts some useful host
>> folders into the VM. Its actually a very good general purpose tool
>> for kernel testing, not just for IWD:
>> https://git.kernel.org/pub/scm/network/wireless/iwd.git/tree/doc/test-runner.txt
>> Once set up you can just run test-runner with a few flags and you'll
>> boot into a shell:
>> ./tools/test-runner -k <kernel-image> --hw --start /bin/bash
>> Please reach out if you have questions, thanks for looking into
>> this.
>
> Thanks for these details. I reproduced this issue by following your guide.
>
> Seems the root cause is that the MSI vector assigned to WCN6855 in
> qemu is different with that in host. In my case the MSI vector in qemu
> is [Address: fee00000  Data: 0020] while in host it is [Address:
> fee00578 Data: 0000]. So in qemu ath11k configures MSI vector
> [Address: fee00000 Data: 0020] to WCN6855 hardware/firmware, and
> firmware uses that vector to fire interrupts to host/qemu. However
> host IOMMU doesn't know that vector because the real vector is
> [Address: fee00578  Data: 0000], as a result host blocks that
> interrupt and reports an error, see below log:
>
> [ 1414.206069] DMAR: DRHD: handling fault status reg 2
> [ 1414.206081] DMAR: [INTR-REMAP] Request device [02:00.0] fault index
> 0x0 [fault reason 0x25] Blocked a compatibility format interrupt
> request
> [ 1414.210334] DMAR: DRHD: handling fault status reg 2
> [ 1414.210342] DMAR: [INTR-REMAP] Request device [02:00.0] fault index
> 0x0 [fault reason 0x25] Blocked a compatibility format interrupt
> request
> [ 1414.212496] DMAR: DRHD: handling fault status reg 2
> [ 1414.212503] DMAR: [INTR-REMAP] Request device [02:00.0] fault index
> 0x0 [fault reason 0x25] Blocked a compatibility format interrupt
> request
> [ 1414.214600] DMAR: DRHD: handling fault status reg 2
>
> While I don't think there is a way for qemu/ath11k to get the real MSI
> vector from host, I will try to read the vfio code to check further.
> Before that, to unblock you, a possible hack is to hard code the MSI
> vector in qemu to the same as in host, on condition that the MSI
> vector doesn't change.

Baochen, awesome that you were able to debug this further. Now we at
least know what's the problem.

I'll add David Woodhouse and the iommu list in hopes that they could
give us ideas how to solve this. Full thread here:

https://lore.kernel.org/all/adcb785e-4dc7-4c4a-b341-d53b72e13467@gmail.com/

-- 
https://patchwork.kernel.org/project/linux-wireless/list/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: ath11k and vfio-pci support
  2024-01-14 14:36                         ` ath11k and vfio-pci support Kalle Valo
@ 2024-01-15 17:46                           ` Alex Williamson
  2024-01-16 10:08                             ` Baochen Qiang
  0 siblings, 1 reply; 29+ messages in thread
From: Alex Williamson @ 2024-01-15 17:46 UTC (permalink / raw)
  To: Kalle Valo
  Cc: Baochen Qiang, James Prestwood, linux-wireless, ath11k,
	David Woodhouse, iommu

On Sun, 14 Jan 2024 16:36:02 +0200
Kalle Valo <kvalo@kernel.org> wrote:

> Baochen Qiang <quic_bqiang@quicinc.com> writes:
> 
> >>> Strange that still fails. Are you now seeing this error in your
> >>> host or your Qemu? or both?
> >>> Could you share your test steps? And if you can share please be as
> >>> detailed as possible since I'm not familiar with passing WLAN
> >>> hardware to a VM using vfio-pci.  
> >>
> >> Just in Qemu, the hardware works fine on my host machine.
> >> I basically follow this guide to set it up, its written in the
> >> context of GPUs/libvirt but the host setup is exactly the same. By
> >> no means do you need to read it all, once you set the vfio-pci.ids
> >> and see your unclaimed adapter you can stop:
> >> https://wiki.archlinux.org/title/PCI_passthrough_via_OVMF
> >> In short you should be able to set the following host kernel options
> >> and reboot (assuming your motherboard/hardware is compatible):
> >> intel_iommu=on iommu=pt vfio-pci.ids=17cb:1103
> >> Obviously change the device/vendor IDs to whatever ath11k hw you
> >> have. Once the host is rebooted you should see your wlan adapter as
> >> UNCLAIMED, showing the driver in use as vfio-pci. If not, its likely
> >> your motherboard just isn't compatible, the device has to be in its
> >> own IOMMU group (you could try switching PCI ports if this is the
> >> case).
> >> I then build a "kvm_guest.config" kernel with the driver/firmware
> >> for ath11k and boot into that with the following Qemu options:
> >> -enable-kvm -device -vfio-pci,host=<PCI address>
> >> If it seems easier you could also utilize IWD's test-runner which
> >> handles launching the Qemu kernel automatically, detecting any
> >> vfio-devices and passes them through and mounts some useful host
> >> folders into the VM. Its actually a very good general purpose tool
> >> for kernel testing, not just for IWD:
> >> https://git.kernel.org/pub/scm/network/wireless/iwd.git/tree/doc/test-runner.txt
> >> Once set up you can just run test-runner with a few flags and you'll
> >> boot into a shell:
> >> ./tools/test-runner -k <kernel-image> --hw --start /bin/bash
> >> Please reach out if you have questions, thanks for looking into
> >> this.  
> >
> > Thanks for these details. I reproduced this issue by following your guide.
> >
> > Seems the root cause is that the MSI vector assigned to WCN6855 in
> > qemu is different with that in host. In my case the MSI vector in qemu
> > is [Address: fee00000  Data: 0020] while in host it is [Address:
> > fee00578 Data: 0000]. So in qemu ath11k configures MSI vector
> > [Address: fee00000 Data: 0020] to WCN6855 hardware/firmware, and
> > firmware uses that vector to fire interrupts to host/qemu. However
> > host IOMMU doesn't know that vector because the real vector is
> > [Address: fee00578  Data: 0000], as a result host blocks that
> > interrupt and reports an error, see below log:
> >
> > [ 1414.206069] DMAR: DRHD: handling fault status reg 2
> > [ 1414.206081] DMAR: [INTR-REMAP] Request device [02:00.0] fault index
> > 0x0 [fault reason 0x25] Blocked a compatibility format interrupt
> > request
> > [ 1414.210334] DMAR: DRHD: handling fault status reg 2
> > [ 1414.210342] DMAR: [INTR-REMAP] Request device [02:00.0] fault index
> > 0x0 [fault reason 0x25] Blocked a compatibility format interrupt
> > request
> > [ 1414.212496] DMAR: DRHD: handling fault status reg 2
> > [ 1414.212503] DMAR: [INTR-REMAP] Request device [02:00.0] fault index
> > 0x0 [fault reason 0x25] Blocked a compatibility format interrupt
> > request
> > [ 1414.214600] DMAR: DRHD: handling fault status reg 2
> >
> > While I don't think there is a way for qemu/ath11k to get the real MSI
> > vector from host, I will try to read the vfio code to check further.
> > Before that, to unblock you, a possible hack is to hard code the MSI
> > vector in qemu to the same as in host, on condition that the MSI
> > vector doesn't change.  
> 
> Baochen, awesome that you were able to debug this further. Now we at
> least know what's the problem.

It's an interesting problem, I don't think we've seen another device
where the driver reads the MSI register in order to program another
hardware entity to match the MSI address and data configuration.

When assigning a device, the host and guest use entirely separate
address spaces for MSI interrupts.  When the guest enables MSI, the
operation is trapped by the VMM and triggers an ioctl to the host to
perform an equivalent configuration.  Generally the physical device
will interrupt within the host where it may be directly attached to KVM
to signal the interrupt, trigger through the VMM, or where
virtualization hardware supports it, the interrupt can directly trigger
the vCPU.   From the VM perspective, the guest address/data pair is used
to signal the interrupt, which is why it makes sense to virtualize the
MSI registers.

Off hand I don't have a good solution for this, the hardware is
essentially imposing a unique requirement for MSI programming that the
driver needs visibility of the physical MSI address and data.  It's
conceivable that device specific code could either make the physical
address/data pair visible to the VM or trap the firmware programming to
inject the correct physical values.  Is there somewhere other than the
standard MSI capability in config space that the driver could learn the
physical values, ie. somewhere that isn't virtualized?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: ath11k and vfio-pci support
  2024-01-15 17:46                           ` Alex Williamson
@ 2024-01-16 10:08                             ` Baochen Qiang
  2024-01-16 10:41                               ` David Woodhouse
  0 siblings, 1 reply; 29+ messages in thread
From: Baochen Qiang @ 2024-01-16 10:08 UTC (permalink / raw)
  To: Alex Williamson, Kalle Valo
  Cc: James Prestwood, linux-wireless, ath11k, David Woodhouse, iommu



On 1/16/2024 1:46 AM, Alex Williamson wrote:
> On Sun, 14 Jan 2024 16:36:02 +0200
> Kalle Valo <kvalo@kernel.org> wrote:
> 
>> Baochen Qiang <quic_bqiang@quicinc.com> writes:
>>
>>>>> Strange that still fails. Are you now seeing this error in your
>>>>> host or your Qemu? or both?
>>>>> Could you share your test steps? And if you can share please be as
>>>>> detailed as possible since I'm not familiar with passing WLAN
>>>>> hardware to a VM using vfio-pci.
>>>>
>>>> Just in Qemu, the hardware works fine on my host machine.
>>>> I basically follow this guide to set it up, its written in the
>>>> context of GPUs/libvirt but the host setup is exactly the same. By
>>>> no means do you need to read it all, once you set the vfio-pci.ids
>>>> and see your unclaimed adapter you can stop:
>>>> https://wiki.archlinux.org/title/PCI_passthrough_via_OVMF
>>>> In short you should be able to set the following host kernel options
>>>> and reboot (assuming your motherboard/hardware is compatible):
>>>> intel_iommu=on iommu=pt vfio-pci.ids=17cb:1103
>>>> Obviously change the device/vendor IDs to whatever ath11k hw you
>>>> have. Once the host is rebooted you should see your wlan adapter as
>>>> UNCLAIMED, showing the driver in use as vfio-pci. If not, its likely
>>>> your motherboard just isn't compatible, the device has to be in its
>>>> own IOMMU group (you could try switching PCI ports if this is the
>>>> case).
>>>> I then build a "kvm_guest.config" kernel with the driver/firmware
>>>> for ath11k and boot into that with the following Qemu options:
>>>> -enable-kvm -device -vfio-pci,host=<PCI address>
>>>> If it seems easier you could also utilize IWD's test-runner which
>>>> handles launching the Qemu kernel automatically, detecting any
>>>> vfio-devices and passes them through and mounts some useful host
>>>> folders into the VM. Its actually a very good general purpose tool
>>>> for kernel testing, not just for IWD:
>>>> https://git.kernel.org/pub/scm/network/wireless/iwd.git/tree/doc/test-runner.txt
>>>> Once set up you can just run test-runner with a few flags and you'll
>>>> boot into a shell:
>>>> ./tools/test-runner -k <kernel-image> --hw --start /bin/bash
>>>> Please reach out if you have questions, thanks for looking into
>>>> this.
>>>
>>> Thanks for these details. I reproduced this issue by following your guide.
>>>
>>> Seems the root cause is that the MSI vector assigned to WCN6855 in
>>> qemu is different with that in host. In my case the MSI vector in qemu
>>> is [Address: fee00000  Data: 0020] while in host it is [Address:
>>> fee00578 Data: 0000]. So in qemu ath11k configures MSI vector
>>> [Address: fee00000 Data: 0020] to WCN6855 hardware/firmware, and
>>> firmware uses that vector to fire interrupts to host/qemu. However
>>> host IOMMU doesn't know that vector because the real vector is
>>> [Address: fee00578  Data: 0000], as a result host blocks that
>>> interrupt and reports an error, see below log:
>>>
>>> [ 1414.206069] DMAR: DRHD: handling fault status reg 2
>>> [ 1414.206081] DMAR: [INTR-REMAP] Request device [02:00.0] fault index
>>> 0x0 [fault reason 0x25] Blocked a compatibility format interrupt
>>> request
>>> [ 1414.210334] DMAR: DRHD: handling fault status reg 2
>>> [ 1414.210342] DMAR: [INTR-REMAP] Request device [02:00.0] fault index
>>> 0x0 [fault reason 0x25] Blocked a compatibility format interrupt
>>> request
>>> [ 1414.212496] DMAR: DRHD: handling fault status reg 2
>>> [ 1414.212503] DMAR: [INTR-REMAP] Request device [02:00.0] fault index
>>> 0x0 [fault reason 0x25] Blocked a compatibility format interrupt
>>> request
>>> [ 1414.214600] DMAR: DRHD: handling fault status reg 2
>>>
>>> While I don't think there is a way for qemu/ath11k to get the real MSI
>>> vector from host, I will try to read the vfio code to check further.
>>> Before that, to unblock you, a possible hack is to hard code the MSI
>>> vector in qemu to the same as in host, on condition that the MSI
>>> vector doesn't change.
>>
>> Baochen, awesome that you were able to debug this further. Now we at
>> least know what's the problem.
> 
> It's an interesting problem, I don't think we've seen another device
> where the driver reads the MSI register in order to program another
> hardware entity to match the MSI address and data configuration.
> 
> When assigning a device, the host and guest use entirely separate
> address spaces for MSI interrupts.  When the guest enables MSI, the
> operation is trapped by the VMM and triggers an ioctl to the host to
> perform an equivalent configuration.  Generally the physical device
> will interrupt within the host where it may be directly attached to KVM
> to signal the interrupt, trigger through the VMM, or where
> virtualization hardware supports it, the interrupt can directly trigger
> the vCPU.   From the VM perspective, the guest address/data pair is used
> to signal the interrupt, which is why it makes sense to virtualize the
> MSI registers.
Hi Alex, could you help elaborate more? why from the VM perspective MSI 
virtualization is necessary?

And, maybe a stupid question, is that possible VM/KVM or vfio only 
virtualize write operation to MSI register but leave read operation 
un-virtualized? I am asking this because in that way ath11k may get a 
chance to run in VM after getting the real vector.

> 
> Off hand I don't have a good solution for this, the hardware is
> essentially imposing a unique requirement for MSI programming that the
> driver needs visibility of the physical MSI address and data.  It's
> conceivable that device specific code could either make the physical
> address/data pair visible to the VM or trap the firmware programming to
> inject the correct physical values.  Is there somewhere other than the
> standard MSI capability in config space that the driver could learn the
> physical values, ie. somewhere that isn't virtualized?  Thanks,
I don't think we have such capability in configuration space.

> 
> Alex
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: ath11k and vfio-pci support
  2024-01-16 10:08                             ` Baochen Qiang
@ 2024-01-16 10:41                               ` David Woodhouse
  2024-01-16 15:29                                 ` Jason Gunthorpe
                                                   ` (3 more replies)
  0 siblings, 4 replies; 29+ messages in thread
From: David Woodhouse @ 2024-01-16 10:41 UTC (permalink / raw)
  To: Baochen Qiang, Alex Williamson, Kalle Valo
  Cc: James Prestwood, linux-wireless, ath11k, iommu

[-- Attachment #1: Type: text/plain, Size: 11808 bytes --]

On Tue, 2024-01-16 at 18:08 +0800, Baochen Qiang wrote:
> 
> 
> On 1/16/2024 1:46 AM, Alex Williamson wrote:
> > On Sun, 14 Jan 2024 16:36:02 +0200
> > Kalle Valo <kvalo@kernel.org> wrote:
> > 
> > > Baochen Qiang <quic_bqiang@quicinc.com> writes:
> > > 
> > > > > > Strange that still fails. Are you now seeing this error in your
> > > > > > host or your Qemu? or both?
> > > > > > Could you share your test steps? And if you can share please be as
> > > > > > detailed as possible since I'm not familiar with passing WLAN
> > > > > > hardware to a VM using vfio-pci.
> > > > > 
> > > > > Just in Qemu, the hardware works fine on my host machine.
> > > > > I basically follow this guide to set it up, its written in the
> > > > > context of GPUs/libvirt but the host setup is exactly the same. By
> > > > > no means do you need to read it all, once you set the vfio-pci.ids
> > > > > and see your unclaimed adapter you can stop:
> > > > > https://wiki.archlinux.org/title/PCI_passthrough_via_OVMF
> > > > > In short you should be able to set the following host kernel options
> > > > > and reboot (assuming your motherboard/hardware is compatible):
> > > > > intel_iommu=on iommu=pt vfio-pci.ids=17cb:1103
> > > > > Obviously change the device/vendor IDs to whatever ath11k hw you
> > > > > have. Once the host is rebooted you should see your wlan adapter as
> > > > > UNCLAIMED, showing the driver in use as vfio-pci. If not, its likely
> > > > > your motherboard just isn't compatible, the device has to be in its
> > > > > own IOMMU group (you could try switching PCI ports if this is the
> > > > > case).
> > > > > I then build a "kvm_guest.config" kernel with the driver/firmware
> > > > > for ath11k and boot into that with the following Qemu options:
> > > > > -enable-kvm -device -vfio-pci,host=<PCI address>
> > > > > If it seems easier you could also utilize IWD's test-runner which
> > > > > handles launching the Qemu kernel automatically, detecting any
> > > > > vfio-devices and passes them through and mounts some useful host
> > > > > folders into the VM. Its actually a very good general purpose tool
> > > > > for kernel testing, not just for IWD:
> > > > > https://git.kernel.org/pub/scm/network/wireless/iwd.git/tree/doc/test-runner.txt
> > > > > Once set up you can just run test-runner with a few flags and you'll
> > > > > boot into a shell:
> > > > > ./tools/test-runner -k <kernel-image> --hw --start /bin/bash
> > > > > Please reach out if you have questions, thanks for looking into
> > > > > this.
> > > > 
> > > > Thanks for these details. I reproduced this issue by following your guide.
> > > > 
> > > > Seems the root cause is that the MSI vector assigned to WCN6855 in
> > > > qemu is different with that in host. In my case the MSI vector in qemu
> > > > is [Address: fee00000  Data: 0020] while in host it is [Address:
> > > > fee00578 Data: 0000]. So in qemu ath11k configures MSI vector
> > > > [Address: fee00000 Data: 0020] to WCN6855 hardware/firmware, and
> > > > firmware uses that vector to fire interrupts to host/qemu. However
> > > > host IOMMU doesn't know that vector because the real vector is
> > > > [Address: fee00578  Data: 0000], as a result host blocks that
> > > > interrupt and reports an error, see below log:
> > > > 
> > > > [ 1414.206069] DMAR: DRHD: handling fault status reg 2
> > > > [ 1414.206081] DMAR: [INTR-REMAP] Request device [02:00.0] fault index
> > > > 0x0 [fault reason 0x25] Blocked a compatibility format interrupt
> > > > request
> > > > [ 1414.210334] DMAR: DRHD: handling fault status reg 2
> > > > [ 1414.210342] DMAR: [INTR-REMAP] Request device [02:00.0] fault index
> > > > 0x0 [fault reason 0x25] Blocked a compatibility format interrupt
> > > > request
> > > > [ 1414.212496] DMAR: DRHD: handling fault status reg 2
> > > > [ 1414.212503] DMAR: [INTR-REMAP] Request device [02:00.0] fault index
> > > > 0x0 [fault reason 0x25] Blocked a compatibility format interrupt
> > > > request
> > > > [ 1414.214600] DMAR: DRHD: handling fault status reg 2
> > > > 
> > > > While I don't think there is a way for qemu/ath11k to get the real MSI
> > > > vector from host, I will try to read the vfio code to check further.
> > > > Before that, to unblock you, a possible hack is to hard code the MSI
> > > > vector in qemu to the same as in host, on condition that the MSI
> > > > vector doesn't change.
> > > 
> > > Baochen, awesome that you were able to debug this further. Now we at
> > > least know what's the problem.
> > 
> > It's an interesting problem, I don't think we've seen another device
> > where the driver reads the MSI register in order to program another
> > hardware entity to match the MSI address and data configuration.
> > 
> > When assigning a device, the host and guest use entirely separate
> > address spaces for MSI interrupts.  When the guest enables MSI, the
> > operation is trapped by the VMM and triggers an ioctl to the host to
> > perform an equivalent configuration.  Generally the physical device
> > will interrupt within the host where it may be directly attached to KVM
> > to signal the interrupt, trigger through the VMM, or where
> > virtualization hardware supports it, the interrupt can directly trigger
> > the vCPU.   From the VM perspective, the guest address/data pair is used
> > to signal the interrupt, which is why it makes sense to virtualize the
> > MSI registers.
>
> Hi Alex, could you help elaborate more? why from the VM perspective MSI 
> virtualization is necessary?

An MSI is just a write to physical memory space. You can even use it
like that; configure the device to just write 4 bytes to some address
in a struct in memory to show that it needs attention, and you then
poll that memory.

But mostly we don't (ab)use it like that, of course. We tell the device
to write to a special range of the physical address space where the
interrupt controller lives — the range from 0xfee00000 to 0xfeefffff.
The low 20 bits of the address, and the 32 bits of data written to that
address, tell the interrupt controller which CPU to interrupt, and
which vector to raise on the CPU (as well as some other details and
weird interrupt modes which are theoretically encodable).

So in your example, the guest writes [Address: fee00000  Data: 0020]
which means it wants vector 0x20 on CPU#0 (well, the CPU with APICID
0). But that's what the *guest* wants. If we just blindly programmed
that into the hardware, the hardware would deliver vector 0x20 to the
host's CPU0... which would be very confused by it.

The host has a driver for that device, probably the VFIO driver. The
host registers its own interrupt handlers for the real hardware,
decides which *host* CPU (and vector) should be notified when something
happens. And when that happens, the VFIO driver will raise an event on
an eventfd, which will notify QEMU to inject the appropriate interrupt
into the guest.

So... when the guest enables the MSI, that's trapped by QEMU which
remembers which *guest* CPU/vector the interrupt should go to. QEMU
tells VFIO to enable the corresponding interrupt, and what gets
programmed into the actual hardware is up to the *host* operating
system; nothing to do with the guest's information at all.

Then when the actual hardware raises the interrupt, the VFIO interrupt
handler runs in the guest, signals an event on the eventfd, and QEMU
receives that and injects the event into the appropriate guest vCPU.

(In practice QEMU doesn't do it these days; there's actually a shortcut
which improves latency by allowing the kernel to deliver the event to
the guest directly, connecting the eventfd directly to the KVM irq
routing table.)


Interrupt remapping is probably not important here, but I'll explain it
briefly anyway. With interrupt remapping, the IOMMU handles the
'memory' write from the device, just as it handles all other memory
transactions. One of the reasons for interrupt remapping is that the
original definitions of the bits in the MSI (the low 20 bits of the
address and the 32 bits of what's written) only had 8 bits for the
target CPU APICID. And we have bigger systems than that now.

So by using one of the spare bits in the MSI message, we can indicate
that this isn't just a directly-encoded cpu/vector in "Compatibility
Format", but is a "Remappable Format" interrupt. Instead of the
cpu/vector it just contains an index in to the IOMMU's Interrupt
Redirection Table. Which *does* have a full 32-bits for the target APIC
ID. That's why x2apic support (which gives us support for >254 CPUs)
depends on interrupt remapping. 

The other thing that the IOMMU can do in modern systems is *posted*
interrupts. Where the entry in the IOMMU's IRT doesn't just specify the
host's CPU/vector, but actually specifies a *vCPU* to deliver the
interrupt to. 

All of which is mostly irrelevant as it's just another bypass
optimisation to improve latency. The key here is that what the guest
writes to its emulated MSI table and what the host writes to the real
hardware are not at all related.

If we had had this posted interrupt support from the beginning, perhaps
we could have have a much simpler model — we just let the guest write
its intended (v)CPU#/vector *directly* to the MSI table in the device,
and let the IOMMU fix it up by having a table pointing to the
appropriate set of vCPUs. But that isn't how it happened. The model we
have is that the VMM has to *emulate* the config space and handle the
interrupts as described above.

This means that whenever a device has a non-standard way of configuring
MSIs, the VMM has to understand and intercept that. I believe we've
even seen some Atheros devices with the MSI target in some weird MMIO
registers instead of the standard location, so we've had to hack QEMU
to handle those too?

> And, maybe a stupid question, is that possible VM/KVM or vfio only 
> virtualize write operation to MSI register but leave read operation 
> un-virtualized? I am asking this because in that way ath11k may get a
> chance to run in VM after getting the real vector.

That might confuse a number of operating systems. Especially if they
mask/unmask by reading the register, flipping the mask bit and writing
back again.

How exactly is the content of this register then given back to the
firmware? Is that communication snoopable by the VMM?


> > 
> > Off hand I don't have a good solution for this, the hardware is
> > essentially imposing a unique requirement for MSI programming that the
> > driver needs visibility of the physical MSI address and data.
> > 

Strictly, the driver doesn't need visibility to the actual values used
by the hardware. Another way of it looking at it would be to say that
the driver programs the MSI through this non-standard method, it just
needs the VMM to trap and handle that, just as the VMM does for the
standard MSI table. 

Which is what I thought we'd already seen on some Atheros devices.

> >   It's
> > conceivable that device specific code could either make the physical
> > address/data pair visible to the VM or trap the firmware programming to
> > inject the correct physical values.  Is there somewhere other than the
> > standard MSI capability in config space that the driver could learn the
> > physical values, ie. somewhere that isn't virtualized?  Thanks,
>
> I don't think we have such capability in configuration space.

Configuration space is a complete fiction though; it's all emulated. We
can do anything we like. Or we can have a PV hypercall which will
report it. I don't know that we'd *want* to, but all things are
possible.


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: ath11k and vfio-pci support
  2024-01-16 10:41                               ` David Woodhouse
@ 2024-01-16 15:29                                 ` Jason Gunthorpe
  2024-01-16 18:28                                 ` Alex Williamson
                                                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 29+ messages in thread
From: Jason Gunthorpe @ 2024-01-16 15:29 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Baochen Qiang, Alex Williamson, Kalle Valo, James Prestwood,
	linux-wireless, ath11k, iommu

On Tue, Jan 16, 2024 at 11:41:19AM +0100, David Woodhouse wrote:

> If we had had this posted interrupt support from the beginning, perhaps
> we could have have a much simpler model — we just let the guest write
> its intended (v)CPU#/vector *directly* to the MSI table in the device,
> and let the IOMMU fix it up by having a table pointing to the
> appropriate set of vCPUs. But that isn't how it happened. The model we
> have is that the VMM has to *emulate* the config space and handle the
> interrupts as described above.

I do have a strong desire to rework things to be more like this, just
not time yet :)

We have enough real problems related to the fake interrupt data in
the guest.

This ath11k thing sounds more like IMS really - it makes zero sense
that a device would be designed where the MSI vector has to be copied
to another location - most likely the other location is another
interrupt source that can be programmed independently, with its own
irqchip, etc? Linux supports this now. Thomas and Intel did it to
support SIOV IMS.

Are you sure you have implemented your Linux driver correctly? :)

Of course IMS doesn't work in VMs, but that is a big motivation to fix
the irq organizing. At least you'd know why the device is broken :)

Jason


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: ath11k and vfio-pci support
  2024-01-16 10:41                               ` David Woodhouse
  2024-01-16 15:29                                 ` Jason Gunthorpe
@ 2024-01-16 18:28                                 ` Alex Williamson
  2024-01-16 21:10                                   ` Jeff Johnson
  2024-01-17  5:47                                 ` Baochen Qiang
  2024-03-21 19:14                                 ` Johannes Berg
  3 siblings, 1 reply; 29+ messages in thread
From: Alex Williamson @ 2024-01-16 18:28 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Baochen Qiang, Kalle Valo, James Prestwood, linux-wireless,
	ath11k, iommu

On Tue, 16 Jan 2024 11:41:19 +0100
David Woodhouse <dwmw2@infradead.org> wrote:

> On Tue, 2024-01-16 at 18:08 +0800, Baochen Qiang wrote:
> > 
> > 
> > On 1/16/2024 1:46 AM, Alex Williamson wrote:  
> > > On Sun, 14 Jan 2024 16:36:02 +0200
> > > Kalle Valo <kvalo@kernel.org> wrote:
> > >   
> > > > Baochen Qiang <quic_bqiang@quicinc.com> writes:
> > > >   
> > > > > > > Strange that still fails. Are you now seeing this error in your
> > > > > > > host or your Qemu? or both?
> > > > > > > Could you share your test steps? And if you can share please be as
> > > > > > > detailed as possible since I'm not familiar with passing WLAN
> > > > > > > hardware to a VM using vfio-pci.  
> > > > > > 
> > > > > > Just in Qemu, the hardware works fine on my host machine.
> > > > > > I basically follow this guide to set it up, its written in the
> > > > > > context of GPUs/libvirt but the host setup is exactly the same. By
> > > > > > no means do you need to read it all, once you set the vfio-pci.ids
> > > > > > and see your unclaimed adapter you can stop:
> > > > > > https://wiki.archlinux.org/title/PCI_passthrough_via_OVMF
> > > > > > In short you should be able to set the following host kernel options
> > > > > > and reboot (assuming your motherboard/hardware is compatible):
> > > > > > intel_iommu=on iommu=pt vfio-pci.ids=17cb:1103
> > > > > > Obviously change the device/vendor IDs to whatever ath11k hw you
> > > > > > have. Once the host is rebooted you should see your wlan adapter as
> > > > > > UNCLAIMED, showing the driver in use as vfio-pci. If not, its likely
> > > > > > your motherboard just isn't compatible, the device has to be in its
> > > > > > own IOMMU group (you could try switching PCI ports if this is the
> > > > > > case).
> > > > > > I then build a "kvm_guest.config" kernel with the driver/firmware
> > > > > > for ath11k and boot into that with the following Qemu options:
> > > > > > -enable-kvm -device -vfio-pci,host=<PCI address>
> > > > > > If it seems easier you could also utilize IWD's test-runner which
> > > > > > handles launching the Qemu kernel automatically, detecting any
> > > > > > vfio-devices and passes them through and mounts some useful host
> > > > > > folders into the VM. Its actually a very good general purpose tool
> > > > > > for kernel testing, not just for IWD:
> > > > > > https://git.kernel.org/pub/scm/network/wireless/iwd.git/tree/doc/test-runner.txt
> > > > > > Once set up you can just run test-runner with a few flags and you'll
> > > > > > boot into a shell:
> > > > > > ./tools/test-runner -k <kernel-image> --hw --start /bin/bash
> > > > > > Please reach out if you have questions, thanks for looking into
> > > > > > this.  
> > > > > 
> > > > > Thanks for these details. I reproduced this issue by following your guide.
> > > > > 
> > > > > Seems the root cause is that the MSI vector assigned to WCN6855 in
> > > > > qemu is different with that in host. In my case the MSI vector in qemu
> > > > > is [Address: fee00000  Data: 0020] while in host it is [Address:
> > > > > fee00578 Data: 0000]. So in qemu ath11k configures MSI vector
> > > > > [Address: fee00000 Data: 0020] to WCN6855 hardware/firmware, and
> > > > > firmware uses that vector to fire interrupts to host/qemu. However
> > > > > host IOMMU doesn't know that vector because the real vector is
> > > > > [Address: fee00578  Data: 0000], as a result host blocks that
> > > > > interrupt and reports an error, see below log:
> > > > > 
> > > > > [ 1414.206069] DMAR: DRHD: handling fault status reg 2
> > > > > [ 1414.206081] DMAR: [INTR-REMAP] Request device [02:00.0] fault index
> > > > > 0x0 [fault reason 0x25] Blocked a compatibility format interrupt
> > > > > request
> > > > > [ 1414.210334] DMAR: DRHD: handling fault status reg 2
> > > > > [ 1414.210342] DMAR: [INTR-REMAP] Request device [02:00.0] fault index
> > > > > 0x0 [fault reason 0x25] Blocked a compatibility format interrupt
> > > > > request
> > > > > [ 1414.212496] DMAR: DRHD: handling fault status reg 2
> > > > > [ 1414.212503] DMAR: [INTR-REMAP] Request device [02:00.0] fault index
> > > > > 0x0 [fault reason 0x25] Blocked a compatibility format interrupt
> > > > > request
> > > > > [ 1414.214600] DMAR: DRHD: handling fault status reg 2
> > > > > 
> > > > > While I don't think there is a way for qemu/ath11k to get the real MSI
> > > > > vector from host, I will try to read the vfio code to check further.
> > > > > Before that, to unblock you, a possible hack is to hard code the MSI
> > > > > vector in qemu to the same as in host, on condition that the MSI
> > > > > vector doesn't change.  
> > > > 
> > > > Baochen, awesome that you were able to debug this further. Now we at
> > > > least know what's the problem.  
> > > 
> > > It's an interesting problem, I don't think we've seen another device
> > > where the driver reads the MSI register in order to program another
> > > hardware entity to match the MSI address and data configuration.
> > > 
> > > When assigning a device, the host and guest use entirely separate
> > > address spaces for MSI interrupts.  When the guest enables MSI, the
> > > operation is trapped by the VMM and triggers an ioctl to the host to
> > > perform an equivalent configuration.  Generally the physical device
> > > will interrupt within the host where it may be directly attached to KVM
> > > to signal the interrupt, trigger through the VMM, or where
> > > virtualization hardware supports it, the interrupt can directly trigger
> > > the vCPU.   From the VM perspective, the guest address/data pair is used
> > > to signal the interrupt, which is why it makes sense to virtualize the
> > > MSI registers.  
> >
> > Hi Alex, could you help elaborate more? why from the VM perspective MSI 
> > virtualization is necessary?  
> 
> An MSI is just a write to physical memory space. You can even use it
> like that; configure the device to just write 4 bytes to some address
> in a struct in memory to show that it needs attention, and you then
> poll that memory.
> 
> But mostly we don't (ab)use it like that, of course. We tell the device
> to write to a special range of the physical address space where the
> interrupt controller lives — the range from 0xfee00000 to 0xfeefffff.
> The low 20 bits of the address, and the 32 bits of data written to that
> address, tell the interrupt controller which CPU to interrupt, and
> which vector to raise on the CPU (as well as some other details and
> weird interrupt modes which are theoretically encodable).
> 
> So in your example, the guest writes [Address: fee00000  Data: 0020]
> which means it wants vector 0x20 on CPU#0 (well, the CPU with APICID
> 0). But that's what the *guest* wants. If we just blindly programmed
> that into the hardware, the hardware would deliver vector 0x20 to the
> host's CPU0... which would be very confused by it.
> 
> The host has a driver for that device, probably the VFIO driver. The
> host registers its own interrupt handlers for the real hardware,
> decides which *host* CPU (and vector) should be notified when something
> happens. And when that happens, the VFIO driver will raise an event on
> an eventfd, which will notify QEMU to inject the appropriate interrupt
> into the guest.
> 
> So... when the guest enables the MSI, that's trapped by QEMU which
> remembers which *guest* CPU/vector the interrupt should go to. QEMU
> tells VFIO to enable the corresponding interrupt, and what gets
> programmed into the actual hardware is up to the *host* operating
> system; nothing to do with the guest's information at all.
> 
> Then when the actual hardware raises the interrupt, the VFIO interrupt
> handler runs in the guest, signals an event on the eventfd, and QEMU

s/guest/host/

> receives that and injects the event into the appropriate guest vCPU.
> 
> (In practice QEMU doesn't do it these days; there's actually a shortcut
> which improves latency by allowing the kernel to deliver the event to
> the guest directly, connecting the eventfd directly to the KVM irq
> routing table.)
> 
> 
> Interrupt remapping is probably not important here, but I'll explain it
> briefly anyway. With interrupt remapping, the IOMMU handles the
> 'memory' write from the device, just as it handles all other memory
> transactions. One of the reasons for interrupt remapping is that the
> original definitions of the bits in the MSI (the low 20 bits of the
> address and the 32 bits of what's written) only had 8 bits for the
> target CPU APICID. And we have bigger systems than that now.
> 
> So by using one of the spare bits in the MSI message, we can indicate
> that this isn't just a directly-encoded cpu/vector in "Compatibility
> Format", but is a "Remappable Format" interrupt. Instead of the
> cpu/vector it just contains an index in to the IOMMU's Interrupt
> Redirection Table. Which *does* have a full 32-bits for the target APIC
> ID. That's why x2apic support (which gives us support for >254 CPUs)
> depends on interrupt remapping. 
> 
> The other thing that the IOMMU can do in modern systems is *posted*
> interrupts. Where the entry in the IOMMU's IRT doesn't just specify the
> host's CPU/vector, but actually specifies a *vCPU* to deliver the
> interrupt to. 
> 
> All of which is mostly irrelevant as it's just another bypass
> optimisation to improve latency. The key here is that what the guest
> writes to its emulated MSI table and what the host writes to the real
> hardware are not at all related.
> 
> If we had had this posted interrupt support from the beginning, perhaps
> we could have have a much simpler model — we just let the guest write
> its intended (v)CPU#/vector *directly* to the MSI table in the device,
> and let the IOMMU fix it up by having a table pointing to the
> appropriate set of vCPUs. But that isn't how it happened. The model we
> have is that the VMM has to *emulate* the config space and handle the
> interrupts as described above.
> 
> This means that whenever a device has a non-standard way of configuring
> MSIs, the VMM has to understand and intercept that. I believe we've
> even seen some Atheros devices with the MSI target in some weird MMIO
> registers instead of the standard location, so we've had to hack QEMU
> to handle those too?
> 
> > And, maybe a stupid question, is that possible VM/KVM or vfio only 
> > virtualize write operation to MSI register but leave read operation 
> > un-virtualized? I am asking this because in that way ath11k may get a
> > chance to run in VM after getting the real vector.  
> 
> That might confuse a number of operating systems. Especially if they
> mask/unmask by reading the register, flipping the mask bit and writing
> back again.
> 
> How exactly is the content of this register then given back to the
> firmware? Is that communication snoopable by the VMM?
> 
> 
> > > 
> > > Off hand I don't have a good solution for this, the hardware is
> > > essentially imposing a unique requirement for MSI programming that the
> > > driver needs visibility of the physical MSI address and data.
> > >   
> 
> Strictly, the driver doesn't need visibility to the actual values used
> by the hardware. Another way of it looking at it would be to say that
> the driver programs the MSI through this non-standard method, it just
> needs the VMM to trap and handle that, just as the VMM does for the
> standard MSI table. 
> 
> Which is what I thought we'd already seen on some Atheros devices.
> 
> > >   It's
> > > conceivable that device specific code could either make the physical
> > > address/data pair visible to the VM or trap the firmware programming to
> > > inject the correct physical values.  Is there somewhere other than the
> > > standard MSI capability in config space that the driver could learn the
> > > physical values, ie. somewhere that isn't virtualized?  Thanks,  
> >
> > I don't think we have such capability in configuration space.  
> 
> Configuration space is a complete fiction though; it's all emulated. We
> can do anything we like. Or we can have a PV hypercall which will
> report it. I don't know that we'd *want* to, but all things are
> possible.

RTL8169 has a back door to the MSI-X vector table, maybe that's the one
you're thinking of.  Alternate methods for the driver to access config
space is common on GPUs, presumably because they require extensive
vBIOS support and IO port and MMIO windows through which pre-boot code
can interact with config space is faster and easier than standard
config accesses.  Much of the work of assigning a GPU to a VM is to
wrap those alternate methods in virtualization to keep the driver
working within the guest address space.

The fictitious config space was my thought too, an ath11k vfio-pci
variant driver could insert a vendor defined capability into config
space to expose the physical MSI address/data.  The driver would know
by the presence of the capability that it's running in a VM and to
prefer that mechanism to retrieve MSI address and data.

Alternatively as also suggested here, if programming of the firmware
with the MSI address/data is something that a hypervisor could trap,
then we might be able to make it transparent to the guest.  For example
if it were programmed via MMIO, the guest address/data values could be
auto-magically replaced with physical values.  Since QEMU doesn't know
the physical values, this would also likely be through a device
specific extension to vfio-pci through a variant driver, or maybe some
combination of variant driver and QEMU if we need to make trapping
conditional in order to avoid a performance penalty.

This is essentially device specific interrupt programming, which either
needs to be virtualized (performed by the VMM) or paravirtualized
(performed in cooperation with the guest).  This is also something to
keep in mind relative to the initial source of this issue, ie. testing
device drivers and hardware under device assignment.  There can be
subtle differences.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: ath11k and vfio-pci support
  2024-01-16 18:28                                 ` Alex Williamson
@ 2024-01-16 21:10                                   ` Jeff Johnson
  0 siblings, 0 replies; 29+ messages in thread
From: Jeff Johnson @ 2024-01-16 21:10 UTC (permalink / raw)
  To: Alex Williamson, David Woodhouse
  Cc: Baochen Qiang, Kalle Valo, James Prestwood, linux-wireless,
	ath11k, iommu, kernel@quicinc.com

On 1/16/2024 10:28 AM, Alex Williamson wrote:
> On Tue, 16 Jan 2024 11:41:19 +0100
> David Woodhouse <dwmw2@infradead.org> wrote:
> 
>> On Tue, 2024-01-16 at 18:08 +0800, Baochen Qiang wrote:
>>>
>>>
>>> On 1/16/2024 1:46 AM, Alex Williamson wrote:  
>>>> On Sun, 14 Jan 2024 16:36:02 +0200
>>>> Kalle Valo <kvalo@kernel.org> wrote:
>>>>   
>>>>> Baochen Qiang <quic_bqiang@quicinc.com> writes:
>>>>>   
>>>>>>>> Strange that still fails. Are you now seeing this error in your
>>>>>>>> host or your Qemu? or both?
>>>>>>>> Could you share your test steps? And if you can share please be as
>>>>>>>> detailed as possible since I'm not familiar with passing WLAN
>>>>>>>> hardware to a VM using vfio-pci.  
>>>>>>>
>>>>>>> Just in Qemu, the hardware works fine on my host machine.
>>>>>>> I basically follow this guide to set it up, its written in the
>>>>>>> context of GPUs/libvirt but the host setup is exactly the same. By
>>>>>>> no means do you need to read it all, once you set the vfio-pci.ids
>>>>>>> and see your unclaimed adapter you can stop:
>>>>>>> https://wiki.archlinux.org/title/PCI_passthrough_via_OVMF
>>>>>>> In short you should be able to set the following host kernel options
>>>>>>> and reboot (assuming your motherboard/hardware is compatible):
>>>>>>> intel_iommu=on iommu=pt vfio-pci.ids=17cb:1103
>>>>>>> Obviously change the device/vendor IDs to whatever ath11k hw you
>>>>>>> have. Once the host is rebooted you should see your wlan adapter as
>>>>>>> UNCLAIMED, showing the driver in use as vfio-pci. If not, its likely
>>>>>>> your motherboard just isn't compatible, the device has to be in its
>>>>>>> own IOMMU group (you could try switching PCI ports if this is the
>>>>>>> case).
>>>>>>> I then build a "kvm_guest.config" kernel with the driver/firmware
>>>>>>> for ath11k and boot into that with the following Qemu options:
>>>>>>> -enable-kvm -device -vfio-pci,host=<PCI address>
>>>>>>> If it seems easier you could also utilize IWD's test-runner which
>>>>>>> handles launching the Qemu kernel automatically, detecting any
>>>>>>> vfio-devices and passes them through and mounts some useful host
>>>>>>> folders into the VM. Its actually a very good general purpose tool
>>>>>>> for kernel testing, not just for IWD:
>>>>>>> https://git.kernel.org/pub/scm/network/wireless/iwd.git/tree/doc/test-runner.txt
>>>>>>> Once set up you can just run test-runner with a few flags and you'll
>>>>>>> boot into a shell:
>>>>>>> ./tools/test-runner -k <kernel-image> --hw --start /bin/bash
>>>>>>> Please reach out if you have questions, thanks for looking into
>>>>>>> this.  
>>>>>>
>>>>>> Thanks for these details. I reproduced this issue by following your guide.
>>>>>>
>>>>>> Seems the root cause is that the MSI vector assigned to WCN6855 in
>>>>>> qemu is different with that in host. In my case the MSI vector in qemu
>>>>>> is [Address: fee00000  Data: 0020] while in host it is [Address:
>>>>>> fee00578 Data: 0000]. So in qemu ath11k configures MSI vector
>>>>>> [Address: fee00000 Data: 0020] to WCN6855 hardware/firmware, and
>>>>>> firmware uses that vector to fire interrupts to host/qemu. However
>>>>>> host IOMMU doesn't know that vector because the real vector is
>>>>>> [Address: fee00578  Data: 0000], as a result host blocks that
>>>>>> interrupt and reports an error, see below log:
>>>>>>
>>>>>> [ 1414.206069] DMAR: DRHD: handling fault status reg 2
>>>>>> [ 1414.206081] DMAR: [INTR-REMAP] Request device [02:00.0] fault index
>>>>>> 0x0 [fault reason 0x25] Blocked a compatibility format interrupt
>>>>>> request
>>>>>> [ 1414.210334] DMAR: DRHD: handling fault status reg 2
>>>>>> [ 1414.210342] DMAR: [INTR-REMAP] Request device [02:00.0] fault index
>>>>>> 0x0 [fault reason 0x25] Blocked a compatibility format interrupt
>>>>>> request
>>>>>> [ 1414.212496] DMAR: DRHD: handling fault status reg 2
>>>>>> [ 1414.212503] DMAR: [INTR-REMAP] Request device [02:00.0] fault index
>>>>>> 0x0 [fault reason 0x25] Blocked a compatibility format interrupt
>>>>>> request
>>>>>> [ 1414.214600] DMAR: DRHD: handling fault status reg 2
>>>>>>
>>>>>> While I don't think there is a way for qemu/ath11k to get the real MSI
>>>>>> vector from host, I will try to read the vfio code to check further.
>>>>>> Before that, to unblock you, a possible hack is to hard code the MSI
>>>>>> vector in qemu to the same as in host, on condition that the MSI
>>>>>> vector doesn't change.  
>>>>>
>>>>> Baochen, awesome that you were able to debug this further. Now we at
>>>>> least know what's the problem.  
>>>>
>>>> It's an interesting problem, I don't think we've seen another device
>>>> where the driver reads the MSI register in order to program another
>>>> hardware entity to match the MSI address and data configuration.
>>>>
>>>> When assigning a device, the host and guest use entirely separate
>>>> address spaces for MSI interrupts.  When the guest enables MSI, the
>>>> operation is trapped by the VMM and triggers an ioctl to the host to
>>>> perform an equivalent configuration.  Generally the physical device
>>>> will interrupt within the host where it may be directly attached to KVM
>>>> to signal the interrupt, trigger through the VMM, or where
>>>> virtualization hardware supports it, the interrupt can directly trigger
>>>> the vCPU.   From the VM perspective, the guest address/data pair is used
>>>> to signal the interrupt, which is why it makes sense to virtualize the
>>>> MSI registers.  
>>>
>>> Hi Alex, could you help elaborate more? why from the VM perspective MSI 
>>> virtualization is necessary?  
>>
>> An MSI is just a write to physical memory space. You can even use it
>> like that; configure the device to just write 4 bytes to some address
>> in a struct in memory to show that it needs attention, and you then
>> poll that memory.
>>
>> But mostly we don't (ab)use it like that, of course. We tell the device
>> to write to a special range of the physical address space where the
>> interrupt controller lives — the range from 0xfee00000 to 0xfeefffff.
>> The low 20 bits of the address, and the 32 bits of data written to that
>> address, tell the interrupt controller which CPU to interrupt, and
>> which vector to raise on the CPU (as well as some other details and
>> weird interrupt modes which are theoretically encodable).
>>
>> So in your example, the guest writes [Address: fee00000  Data: 0020]
>> which means it wants vector 0x20 on CPU#0 (well, the CPU with APICID
>> 0). But that's what the *guest* wants. If we just blindly programmed
>> that into the hardware, the hardware would deliver vector 0x20 to the
>> host's CPU0... which would be very confused by it.
>>
>> The host has a driver for that device, probably the VFIO driver. The
>> host registers its own interrupt handlers for the real hardware,
>> decides which *host* CPU (and vector) should be notified when something
>> happens. And when that happens, the VFIO driver will raise an event on
>> an eventfd, which will notify QEMU to inject the appropriate interrupt
>> into the guest.
>>
>> So... when the guest enables the MSI, that's trapped by QEMU which
>> remembers which *guest* CPU/vector the interrupt should go to. QEMU
>> tells VFIO to enable the corresponding interrupt, and what gets
>> programmed into the actual hardware is up to the *host* operating
>> system; nothing to do with the guest's information at all.
>>
>> Then when the actual hardware raises the interrupt, the VFIO interrupt
>> handler runs in the guest, signals an event on the eventfd, and QEMU
> 
> s/guest/host/
> 
>> receives that and injects the event into the appropriate guest vCPU.
>>
>> (In practice QEMU doesn't do it these days; there's actually a shortcut
>> which improves latency by allowing the kernel to deliver the event to
>> the guest directly, connecting the eventfd directly to the KVM irq
>> routing table.)
>>
>>
>> Interrupt remapping is probably not important here, but I'll explain it
>> briefly anyway. With interrupt remapping, the IOMMU handles the
>> 'memory' write from the device, just as it handles all other memory
>> transactions. One of the reasons for interrupt remapping is that the
>> original definitions of the bits in the MSI (the low 20 bits of the
>> address and the 32 bits of what's written) only had 8 bits for the
>> target CPU APICID. And we have bigger systems than that now.
>>
>> So by using one of the spare bits in the MSI message, we can indicate
>> that this isn't just a directly-encoded cpu/vector in "Compatibility
>> Format", but is a "Remappable Format" interrupt. Instead of the
>> cpu/vector it just contains an index in to the IOMMU's Interrupt
>> Redirection Table. Which *does* have a full 32-bits for the target APIC
>> ID. That's why x2apic support (which gives us support for >254 CPUs)
>> depends on interrupt remapping. 
>>
>> The other thing that the IOMMU can do in modern systems is *posted*
>> interrupts. Where the entry in the IOMMU's IRT doesn't just specify the
>> host's CPU/vector, but actually specifies a *vCPU* to deliver the
>> interrupt to. 
>>
>> All of which is mostly irrelevant as it's just another bypass
>> optimisation to improve latency. The key here is that what the guest
>> writes to its emulated MSI table and what the host writes to the real
>> hardware are not at all related.
>>
>> If we had had this posted interrupt support from the beginning, perhaps
>> we could have have a much simpler model — we just let the guest write
>> its intended (v)CPU#/vector *directly* to the MSI table in the device,
>> and let the IOMMU fix it up by having a table pointing to the
>> appropriate set of vCPUs. But that isn't how it happened. The model we
>> have is that the VMM has to *emulate* the config space and handle the
>> interrupts as described above.
>>
>> This means that whenever a device has a non-standard way of configuring
>> MSIs, the VMM has to understand and intercept that. I believe we've
>> even seen some Atheros devices with the MSI target in some weird MMIO
>> registers instead of the standard location, so we've had to hack QEMU
>> to handle those too?
>>
>>> And, maybe a stupid question, is that possible VM/KVM or vfio only 
>>> virtualize write operation to MSI register but leave read operation 
>>> un-virtualized? I am asking this because in that way ath11k may get a
>>> chance to run in VM after getting the real vector.  
>>
>> That might confuse a number of operating systems. Especially if they
>> mask/unmask by reading the register, flipping the mask bit and writing
>> back again.
>>
>> How exactly is the content of this register then given back to the
>> firmware? Is that communication snoopable by the VMM?
>>
>>
>>>>
>>>> Off hand I don't have a good solution for this, the hardware is
>>>> essentially imposing a unique requirement for MSI programming that the
>>>> driver needs visibility of the physical MSI address and data.
>>>>   
>>
>> Strictly, the driver doesn't need visibility to the actual values used
>> by the hardware. Another way of it looking at it would be to say that
>> the driver programs the MSI through this non-standard method, it just
>> needs the VMM to trap and handle that, just as the VMM does for the
>> standard MSI table. 
>>
>> Which is what I thought we'd already seen on some Atheros devices.
>>
>>>>   It's
>>>> conceivable that device specific code could either make the physical
>>>> address/data pair visible to the VM or trap the firmware programming to
>>>> inject the correct physical values.  Is there somewhere other than the
>>>> standard MSI capability in config space that the driver could learn the
>>>> physical values, ie. somewhere that isn't virtualized?  Thanks,  
>>>
>>> I don't think we have such capability in configuration space.  
>>
>> Configuration space is a complete fiction though; it's all emulated. We
>> can do anything we like. Or we can have a PV hypercall which will
>> report it. I don't know that we'd *want* to, but all things are
>> possible.
> 
> RTL8169 has a back door to the MSI-X vector table, maybe that's the one
> you're thinking of.  Alternate methods for the driver to access config
> space is common on GPUs, presumably because they require extensive
> vBIOS support and IO port and MMIO windows through which pre-boot code
> can interact with config space is faster and easier than standard
> config accesses.  Much of the work of assigning a GPU to a VM is to
> wrap those alternate methods in virtualization to keep the driver
> working within the guest address space.
> 
> The fictitious config space was my thought too, an ath11k vfio-pci
> variant driver could insert a vendor defined capability into config
> space to expose the physical MSI address/data.  The driver would know
> by the presence of the capability that it's running in a VM and to
> prefer that mechanism to retrieve MSI address and data.
> 
> Alternatively as also suggested here, if programming of the firmware
> with the MSI address/data is something that a hypervisor could trap,
> then we might be able to make it transparent to the guest.  For example
> if it were programmed via MMIO, the guest address/data values could be
> auto-magically replaced with physical values.  Since QEMU doesn't know
> the physical values, this would also likely be through a device
> specific extension to vfio-pci through a variant driver, or maybe some
> combination of variant driver and QEMU if we need to make trapping
> conditional in order to avoid a performance penalty.
> 
> This is essentially device specific interrupt programming, which either
> needs to be virtualized (performed by the VMM) or paravirtualized
> (performed in cooperation with the guest).  This is also something to
> keep in mind relative to the initial source of this issue, ie. testing
> device drivers and hardware under device assignment.  There can be
> subtle differences.  Thanks,
> 
> Alex
> 
>

+ kernel@quicinc.com for added visibility and advice

Full thread:
<https://lore.kernel.org/all/adcb785e-4dc7-4c4a-b341-d53b72e13467@gmail.com/>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: ath11k and vfio-pci support
  2024-01-16 10:41                               ` David Woodhouse
  2024-01-16 15:29                                 ` Jason Gunthorpe
  2024-01-16 18:28                                 ` Alex Williamson
@ 2024-01-17  5:47                                 ` Baochen Qiang
  2024-03-21 19:14                                 ` Johannes Berg
  3 siblings, 0 replies; 29+ messages in thread
From: Baochen Qiang @ 2024-01-17  5:47 UTC (permalink / raw)
  To: David Woodhouse, Alex Williamson, Kalle Valo
  Cc: James Prestwood, linux-wireless, ath11k, iommu



On 1/16/2024 6:41 PM, David Woodhouse wrote:
> On Tue, 2024-01-16 at 18:08 +0800, Baochen Qiang wrote:
>>
>>
>> On 1/16/2024 1:46 AM, Alex Williamson wrote:
>>> On Sun, 14 Jan 2024 16:36:02 +0200
>>> Kalle Valo <kvalo@kernel.org> wrote:
>>>
>>>> Baochen Qiang <quic_bqiang@quicinc.com> writes:
>>>>
>>>>>>> Strange that still fails. Are you now seeing this error in your
>>>>>>> host or your Qemu? or both?
>>>>>>> Could you share your test steps? And if you can share please be as
>>>>>>> detailed as possible since I'm not familiar with passing WLAN
>>>>>>> hardware to a VM using vfio-pci.
>>>>>>
>>>>>> Just in Qemu, the hardware works fine on my host machine.
>>>>>> I basically follow this guide to set it up, its written in the
>>>>>> context of GPUs/libvirt but the host setup is exactly the same. By
>>>>>> no means do you need to read it all, once you set the vfio-pci.ids
>>>>>> and see your unclaimed adapter you can stop:
>>>>>> https://wiki.archlinux.org/title/PCI_passthrough_via_OVMF
>>>>>> In short you should be able to set the following host kernel options
>>>>>> and reboot (assuming your motherboard/hardware is compatible):
>>>>>> intel_iommu=on iommu=pt vfio-pci.ids=17cb:1103
>>>>>> Obviously change the device/vendor IDs to whatever ath11k hw you
>>>>>> have. Once the host is rebooted you should see your wlan adapter as
>>>>>> UNCLAIMED, showing the driver in use as vfio-pci. If not, its likely
>>>>>> your motherboard just isn't compatible, the device has to be in its
>>>>>> own IOMMU group (you could try switching PCI ports if this is the
>>>>>> case).
>>>>>> I then build a "kvm_guest.config" kernel with the driver/firmware
>>>>>> for ath11k and boot into that with the following Qemu options:
>>>>>> -enable-kvm -device -vfio-pci,host=<PCI address>
>>>>>> If it seems easier you could also utilize IWD's test-runner which
>>>>>> handles launching the Qemu kernel automatically, detecting any
>>>>>> vfio-devices and passes them through and mounts some useful host
>>>>>> folders into the VM. Its actually a very good general purpose tool
>>>>>> for kernel testing, not just for IWD:
>>>>>> https://git.kernel.org/pub/scm/network/wireless/iwd.git/tree/doc/test-runner.txt
>>>>>> Once set up you can just run test-runner with a few flags and you'll
>>>>>> boot into a shell:
>>>>>> ./tools/test-runner -k <kernel-image> --hw --start /bin/bash
>>>>>> Please reach out if you have questions, thanks for looking into
>>>>>> this.
>>>>>
>>>>> Thanks for these details. I reproduced this issue by following your guide.
>>>>>
>>>>> Seems the root cause is that the MSI vector assigned to WCN6855 in
>>>>> qemu is different with that in host. In my case the MSI vector in qemu
>>>>> is [Address: fee00000  Data: 0020] while in host it is [Address:
>>>>> fee00578 Data: 0000]. So in qemu ath11k configures MSI vector
>>>>> [Address: fee00000 Data: 0020] to WCN6855 hardware/firmware, and
>>>>> firmware uses that vector to fire interrupts to host/qemu. However
>>>>> host IOMMU doesn't know that vector because the real vector is
>>>>> [Address: fee00578  Data: 0000], as a result host blocks that
>>>>> interrupt and reports an error, see below log:
>>>>>
>>>>> [ 1414.206069] DMAR: DRHD: handling fault status reg 2
>>>>> [ 1414.206081] DMAR: [INTR-REMAP] Request device [02:00.0] fault index
>>>>> 0x0 [fault reason 0x25] Blocked a compatibility format interrupt
>>>>> request
>>>>> [ 1414.210334] DMAR: DRHD: handling fault status reg 2
>>>>> [ 1414.210342] DMAR: [INTR-REMAP] Request device [02:00.0] fault index
>>>>> 0x0 [fault reason 0x25] Blocked a compatibility format interrupt
>>>>> request
>>>>> [ 1414.212496] DMAR: DRHD: handling fault status reg 2
>>>>> [ 1414.212503] DMAR: [INTR-REMAP] Request device [02:00.0] fault index
>>>>> 0x0 [fault reason 0x25] Blocked a compatibility format interrupt
>>>>> request
>>>>> [ 1414.214600] DMAR: DRHD: handling fault status reg 2
>>>>>
>>>>> While I don't think there is a way for qemu/ath11k to get the real MSI
>>>>> vector from host, I will try to read the vfio code to check further.
>>>>> Before that, to unblock you, a possible hack is to hard code the MSI
>>>>> vector in qemu to the same as in host, on condition that the MSI
>>>>> vector doesn't change.
>>>>
>>>> Baochen, awesome that you were able to debug this further. Now we at
>>>> least know what's the problem.
>>>
>>> It's an interesting problem, I don't think we've seen another device
>>> where the driver reads the MSI register in order to program another
>>> hardware entity to match the MSI address and data configuration.
>>>
>>> When assigning a device, the host and guest use entirely separate
>>> address spaces for MSI interrupts.  When the guest enables MSI, the
>>> operation is trapped by the VMM and triggers an ioctl to the host to
>>> perform an equivalent configuration.  Generally the physical device
>>> will interrupt within the host where it may be directly attached to KVM
>>> to signal the interrupt, trigger through the VMM, or where
>>> virtualization hardware supports it, the interrupt can directly trigger
>>> the vCPU.   From the VM perspective, the guest address/data pair is used
>>> to signal the interrupt, which is why it makes sense to virtualize the
>>> MSI registers.
>>
>> Hi Alex, could you help elaborate more? why from the VM perspective MSI
>> virtualization is necessary?
> 
> An MSI is just a write to physical memory space. You can even use it
> like that; configure the device to just write 4 bytes to some address
> in a struct in memory to show that it needs attention, and you then
> poll that memory.
> 
> But mostly we don't (ab)use it like that, of course. We tell the device
> to write to a special range of the physical address space where the
> interrupt controller lives — the range from 0xfee00000 to 0xfeefffff.
> The low 20 bits of the address, and the 32 bits of data written to that
> address, tell the interrupt controller which CPU to interrupt, and
> which vector to raise on the CPU (as well as some other details and
> weird interrupt modes which are theoretically encodable).
> 
> So in your example, the guest writes [Address: fee00000  Data: 0020]
> which means it wants vector 0x20 on CPU#0 (well, the CPU with APICID
> 0). But that's what the *guest* wants. If we just blindly programmed
> that into the hardware, the hardware would deliver vector 0x20 to the
> host's CPU0... which would be very confused by it.
> 
> The host has a driver for that device, probably the VFIO driver. The
> host registers its own interrupt handlers for the real hardware,
> decides which *host* CPU (and vector) should be notified when something
> happens. And when that happens, the VFIO driver will raise an event on
> an eventfd, which will notify QEMU to inject the appropriate interrupt
> into the guest.
> 
> So... when the guest enables the MSI, that's trapped by QEMU which
> remembers which *guest* CPU/vector the interrupt should go to. QEMU
> tells VFIO to enable the corresponding interrupt, and what gets
> programmed into the actual hardware is up to the *host* operating
> system; nothing to do with the guest's information at all.
> 
> Then when the actual hardware raises the interrupt, the VFIO interrupt
> handler runs in the guest, signals an event on the eventfd, and QEMU
> receives that and injects the event into the appropriate guest vCPU.
> 
> (In practice QEMU doesn't do it these days; there's actually a shortcut
> which improves latency by allowing the kernel to deliver the event to
> the guest directly, connecting the eventfd directly to the KVM irq
> routing table.)
> 
> 
> Interrupt remapping is probably not important here, but I'll explain it
> briefly anyway. With interrupt remapping, the IOMMU handles the
> 'memory' write from the device, just as it handles all other memory
> transactions. One of the reasons for interrupt remapping is that the
> original definitions of the bits in the MSI (the low 20 bits of the
> address and the 32 bits of what's written) only had 8 bits for the
> target CPU APICID. And we have bigger systems than that now.
> 
> So by using one of the spare bits in the MSI message, we can indicate
> that this isn't just a directly-encoded cpu/vector in "Compatibility
> Format", but is a "Remappable Format" interrupt. Instead of the
> cpu/vector it just contains an index in to the IOMMU's Interrupt
> Redirection Table. Which *does* have a full 32-bits for the target APIC
> ID. That's why x2apic support (which gives us support for >254 CPUs)
> depends on interrupt remapping.
> 
> The other thing that the IOMMU can do in modern systems is *posted*
> interrupts. Where the entry in the IOMMU's IRT doesn't just specify the
> host's CPU/vector, but actually specifies a *vCPU* to deliver the
> interrupt to.
> 
> All of which is mostly irrelevant as it's just another bypass
> optimisation to improve latency. The key here is that what the guest
> writes to its emulated MSI table and what the host writes to the real
> hardware are not at all related.
> 
Thanks. A really detailed and clear explanation.

> If we had had this posted interrupt support from the beginning, perhaps
> we could have have a much simpler model — we just let the guest write
> its intended (v)CPU#/vector *directly* to the MSI table in the device,
> and let the IOMMU fix it up by having a table pointing to the
> appropriate set of vCPUs. But that isn't how it happened. The model we
> have is that the VMM has to *emulate* the config space and handle the
> interrupts as described above.
> 
> This means that whenever a device has a non-standard way of configuring
> MSIs, the VMM has to understand and intercept that. I believe we've
> even seen some Atheros devices with the MSI target in some weird MMIO
> registers instead of the standard location, so we've had to hack QEMU
> to handle those too?
> 
>> And, maybe a stupid question, is that possible VM/KVM or vfio only
>> virtualize write operation to MSI register but leave read operation
>> un-virtualized? I am asking this because in that way ath11k may get a
>> chance to run in VM after getting the real vector.
> 
> That might confuse a number of operating systems. Especially if they
> mask/unmask by reading the register, flipping the mask bit and writing
> back again.
> 
> How exactly is the content of this register then given back to the
> firmware? Is that communication snoopable by the VMM?
By programming it to a MMIO register. It is a non-standard register and 
also device specific, not sure snoopable or not by the VMM.

> 
> 
>>>
>>> Off hand I don't have a good solution for this, the hardware is
>>> essentially imposing a unique requirement for MSI programming that the
>>> driver needs visibility of the physical MSI address and data.
>>>
> 
> Strictly, the driver doesn't need visibility to the actual values used
> by the hardware. Another way of it looking at it would be to say that
> the driver programs the MSI through this non-standard method, it just
> needs the VMM to trap and handle that, just as the VMM does for the
> standard MSI table.
> 
> Which is what I thought we'd already seen on some Atheros devices.
> 
>>>    It's
>>> conceivable that device specific code could either make the physical
>>> address/data pair visible to the VM or trap the firmware programming to
>>> inject the correct physical values.  Is there somewhere other than the
>>> standard MSI capability in config space that the driver could learn the
>>> physical values, ie. somewhere that isn't virtualized?  Thanks,
>>
>> I don't think we have such capability in configuration space.
> 
> Configuration space is a complete fiction though; it's all emulated. We
> can do anything we like. Or we can have a PV hypercall which will
> report it. I don't know that we'd *want* to, but all things are
> possible.
OK, I get the point now.

> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: ath11k and vfio-pci support
  2024-01-16 10:41                               ` David Woodhouse
                                                   ` (2 preceding siblings ...)
  2024-01-17  5:47                                 ` Baochen Qiang
@ 2024-03-21 19:14                                 ` Johannes Berg
  3 siblings, 0 replies; 29+ messages in thread
From: Johannes Berg @ 2024-03-21 19:14 UTC (permalink / raw)
  To: David Woodhouse, Baochen Qiang, Alex Williamson, Kalle Valo
  Cc: Jose Ignacio Tornos Martinez, James Prestwood, linux-wireless,
	ath11k, iommu

Given the "ewww" patch to make a module parameter workaround for this:
https://lore.kernel.org/linux-wireless/20240321172402.346191-1-jtornosm@redhat.com/T/#u


On Tue, 2024-01-16 at 11:41 +0100, David Woodhouse wrote:
> 
> How exactly is the content of this register then given back to the
> firmware? Is that communication snoopable by the VMM?

If I'm reading the code correctly, it's just a write to a register in a
memory mapped region, possibly in multiple locations (different queues
or something). Possibly indirect (see __ath11k_pcic_write32), but
someone would have to trace it or know the HW better to understand which
locations it's written to.

But yeah seems totally feasible to just translate that back in the VMM.

> > > Off hand I don't have a good solution for this, the hardware is
> > > essentially imposing a unique requirement for MSI programming that the
> > > driver needs visibility of the physical MSI address and data.
> > > 
> 
> Strictly, the driver doesn't need visibility to the actual values used
> by the hardware. Another way of it looking at it would be to say that
> the driver programs the MSI through this non-standard method, it just
> needs the VMM to trap and handle that, just as the VMM does for the
> standard MSI table. 

Indeed. Much better than having a module parameter.

> Which is what I thought we'd already seen on some Atheros devices.

It probably also affects ath12k, seems similar.

johannes

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH RFC/RFT] vfio/pci: Create feature to disable MSI virtualization
       [not found] <adcb785e-4dc7-4c4a-b341-d53b72e13467@gmail.com>
       [not found] ` <8734v5zhol.fsf@kernel.org>
@ 2024-08-12 16:59 ` Alex Williamson
  2024-08-13 16:30   ` Jason Gunthorpe
  2024-08-12 17:00 ` [PATCH RFC/RFT] vfio/pci-quirks: Quirk for ath wireless Alex Williamson
  2 siblings, 1 reply; 29+ messages in thread
From: Alex Williamson @ 2024-08-12 16:59 UTC (permalink / raw)
  To: kvm
  Cc: Alex Williamson, quic_bqiang, kvalo, prestwoj, linux-wireless,
	ath11k, dwmw2, iommu, jgg, kernel, johannes, jtornosm

vfio-pci has always virtualized the MSI address and data registers as
MSI programming is performed through the SET_IRQS ioctl.  Often this
virtualization is not used, and in specific cases can be unhelpful.

One such case where the virtualization is a hinderance is when the
device contains an onboard interrupt controller programmed by the guest
driver.  Userspace VMMs have a chance to quirk this programming,
injecting the host physical MSI information, but only if the userspace
driver can get access to the host physical address and data registers.

This introduces a device feature which allows the userspace driver to
disable virtualization of the MSI capability address and data registers
in order to provide read-only access the the physical values.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=216055
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/vfio/pci/vfio_pci_config.c | 26 ++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_core.c   | 21 +++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_priv.h   |  1 +
 include/uapi/linux/vfio.h          | 14 ++++++++++++++
 4 files changed, 62 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index 97422aafaa7b..5f86e75ea6ca 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -1259,6 +1259,32 @@ static int vfio_msi_cap_len(struct vfio_pci_core_device *vdev, u8 pos)
 	return len;
 }
 
+/* Disable virtualization of the MSI address and data fields */
+int vfio_pci_msi_novirt(struct vfio_pci_core_device *vdev)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	struct perm_bits *perm = vdev->msi_perm;
+	u16 flags;
+	int ret;
+
+	if (!perm)
+		return -EINVAL;
+
+	ret = pci_read_config_word(pdev, pdev->msi_cap + PCI_MSI_FLAGS, &flags);
+	if (ret)
+		return pcibios_err_to_errno(ret);
+
+	p_setd(perm, PCI_MSI_ADDRESS_LO, NO_VIRT, NO_WRITE);
+	if (flags & PCI_MSI_FLAGS_64BIT) {
+		p_setd(perm, PCI_MSI_ADDRESS_HI, NO_VIRT, NO_WRITE);
+		p_setw(perm, PCI_MSI_DATA_64, (u16)NO_VIRT, (u16)NO_WRITE);
+	} else {
+		p_setw(perm, PCI_MSI_DATA_32, (u16)NO_VIRT, (u16)NO_WRITE);
+	}
+
+	return 0;
+}
+
 /* Determine extended capability length for VC (2 & 9) and MFVC */
 static int vfio_vc_cap_len(struct vfio_pci_core_device *vdev, u16 pos)
 {
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index ba0ce0075b2f..acdced212be2 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1518,6 +1518,24 @@ static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
 	return 0;
 }
 
+static int vfio_pci_core_feature_msi_novirt(struct vfio_device *device,
+					    u32 flags, void __user *arg,
+					    size_t argsz)
+{
+	struct vfio_pci_core_device *vdev =
+		container_of(device, struct vfio_pci_core_device, vdev);
+	int ret;
+
+	if (!vdev->msi_perm)
+		return -ENOTTY;
+
+	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET, 0);
+	if (ret != 1)
+		return ret;
+
+	return vfio_pci_msi_novirt(vdev);
+}
+
 int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
 				void __user *arg, size_t argsz)
 {
@@ -1531,6 +1549,9 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
 		return vfio_pci_core_pm_exit(device, flags, arg, argsz);
 	case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
 		return vfio_pci_core_feature_token(device, flags, arg, argsz);
+	case VFIO_DEVICE_FEATURE_PCI_MSI_NOVIRT:
+		return vfio_pci_core_feature_msi_novirt(device, flags,
+							arg, argsz);
 	default:
 		return -ENOTTY;
 	}
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index 5e4fa69aee16..6e6cc74c6579 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -53,6 +53,7 @@ int vfio_pci_ioeventfd(struct vfio_pci_core_device *vdev, loff_t offset,
 
 int vfio_pci_init_perm_bits(void);
 void vfio_pci_uninit_perm_bits(void);
+int vfio_pci_msi_novirt(struct vfio_pci_core_device *vdev);
 
 int vfio_config_init(struct vfio_pci_core_device *vdev);
 void vfio_config_free(struct vfio_pci_core_device *vdev);
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 2b68e6cdf190..ddf5dd9245fb 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1458,6 +1458,20 @@ struct vfio_device_feature_bus_master {
 };
 #define VFIO_DEVICE_FEATURE_BUS_MASTER 10
 
+/**
+ * Toggle virtualization of PCI MSI address and data fields off.  By default
+ * vfio-pci-core based drivers virtualize the MSI address and data fields of
+ * the MSI capability to emulate direct access to the device, ie. writes are
+ * allowed and buffered where subsequent reads return the buffered data.
+ * VMMs often virtualize these registers anyway and there are cases in user-
+ * space where having access to the host MSI fields can be useful, such as
+ * quirking an embedded interrupt controller on the device to generate physical
+ * MSI interrupts.  Upon VFIO_DEVICE_FEATURE_SET of the PCI_MSI_NOVIRT feature
+ * this virtualization is disabled, reads of the MSI address and data fields
+ * will return the physical values and writes are dropped.
+ */
+#define VFIO_DEVICE_FEATURE_PCI_MSI_NOVIRT 11
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH RFC/RFT] vfio/pci-quirks: Quirk for ath wireless
       [not found] <adcb785e-4dc7-4c4a-b341-d53b72e13467@gmail.com>
       [not found] ` <8734v5zhol.fsf@kernel.org>
  2024-08-12 16:59 ` [PATCH RFC/RFT] vfio/pci: Create feature to disable MSI virtualization Alex Williamson
@ 2024-08-12 17:00 ` Alex Williamson
  2024-08-13 16:43   ` Jason Gunthorpe
  2 siblings, 1 reply; 29+ messages in thread
From: Alex Williamson @ 2024-08-12 17:00 UTC (permalink / raw)
  To: qemu-devel, kvm
  Cc: Alex Williamson, quic_bqiang, kvalo, prestwoj, linux-wireless,
	ath11k, dwmw2, iommu, jgg, kernel, johannes, jtornosm

These devices have an embedded interrupt controller which is programmed
with guest physical MSI address/data, which doesn't work.  We need
vfio-pci kernel support to provide a device feature which disables
virtualization of the MSI capability registers.  Then we can do brute
force testing for writes matching the MSI address, from which we can
infer writes of the MSI data, replacing each with host physical values.

This has only been tested on ath11k (0x1103), ath12k support is
speculative and requires testing.  Note that Windows guest drivers make
use of multi-vector MSI which requires interrupt remapping support in
the host.

NB. The #define for the new vfio feature is temporary for RFC/RFT, a
formal proposal will include a proper linux-headers update.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=216055
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 hw/vfio/pci-quirks.c | 236 +++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/trace-events |   3 +
 2 files changed, 239 insertions(+)

diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
index 39dae72497e0..5ba37bee3b36 100644
--- a/hw/vfio/pci-quirks.c
+++ b/hw/vfio/pci-quirks.c
@@ -23,6 +23,7 @@
 #include "qapi/visitor.h"
 #include <sys/ioctl.h>
 #include "hw/nvram/fw_cfg.h"
+#include "hw/pci/msi.h"
 #include "hw/qdev-properties.h"
 #include "pci.h"
 #include "trace.h"
@@ -1159,6 +1160,240 @@ static void vfio_probe_rtl8168_bar2_quirk(VFIOPCIDevice *vdev, int nr)
     trace_vfio_quirk_rtl8168_probe(vdev->vbasedev.name);
 }
 
+#define PCI_VENDOR_ID_QCOM 0x17cb
+
+/*
+ * ath11k wireless adapters do not support INTx and appear to have an interrupt
+ * controller embedded into the hardware.  By default the interrupt controller
+ * is programmed with the MSI guest physical address, which doesn't work.
+ * Instead we need to trap and insert the host physical address and data.
+ *
+ * By default vfio-pci virtualizes the MSI address and data registers, providing
+ * a writable buffer, where reads simply return the buffer.  QEMU writes
+ * everything through to hardware, so this only holds the guest MSI address.
+ * Therefore we first need to invoke a device feature that disables this
+ * emulation of the MSI address and data registers, allowing access to the host
+ * physical address and data.
+ *
+ * Next, where do these values get written?  If we disable mmap support and
+ * trace accesses, we get this:
+ *
+ * # grep -A2 region0.*0xfee00000 typescript
+ * vfio_region_write  (0000:01:00.0:region0+0x80048, 0xfee00000, 4)
+ * vfio_region_write  (0000:01:00.0:region0+0x8004c, 0x100, 4)
+ * vfio_region_write  (0000:01:00.0:region0+0x80050, 0x21, 4)
+ * --
+ * vfio_region_write  (0000:01:00.0:region0+0x83048, 0xfee00000, 4)
+ * vfio_region_write  (0000:01:00.0:region0+0x8304c, 0x100, 4)
+ * vfio_region_write  (0000:01:00.0:region0+0x83050, 0x21, 4)
+ * --
+ * vfio_region_write  (0000:01:00.0:region0+0x830a0, 0xfee00000, 4)
+ * vfio_region_write  (0000:01:00.0:region0+0x830a4, 0x100, 4)
+ * vfio_region_write  (0000:01:00.0:region0+0x830a8, 0x21, 4)
+ * --
+ * vfio_region_write  (0000:01:00.0:region0+0x85048, 0xfee00000, 4)
+ * vfio_region_write  (0000:01:00.0:region0+0x8504c, 0x100, 4)
+ * vfio_region_write  (0000:01:00.0:region0+0x85050, 0x21, 4)
+ * --
+ * vfio_region_write  (0000:01:00.0:region0+0x850a0, 0xfee00000, 4)
+ * vfio_region_write  (0000:01:00.0:region0+0x850a4, 0x100, 4)
+ * vfio_region_write  (0000:01:00.0:region0+0x850a8, 0x21, 4)
+ * --
+ * vfio_region_write  (0000:01:00.0:region0+0x86048, 0xfee00000, 4)
+ * vfio_region_write  (0000:01:00.0:region0+0x8604c, 0x100, 4)
+ * vfio_region_write  (0000:01:00.0:region0+0x86050, 0x21, 4)
+ * --
+ * vfio_region_write  (0000:01:00.0:region0+0x8b048, 0xfee00000, 4)
+ * vfio_region_write  (0000:01:00.0:region0+0x8b04c, 0x100, 4)
+ * vfio_region_write  (0000:01:00.0:region0+0x8b050, 0x21, 4)
+ * --
+ * vfio_region_write  (0000:01:00.0:region0+0x8b0a0, 0xfee00000, 4)
+ * vfio_region_write  (0000:01:00.0:region0+0x8b0a4, 0x100, 4)
+ * vfio_region_write  (0000:01:00.0:region0+0x8b0a8, 0x21, 4)
+ * --
+ * vfio_region_write  (0000:01:00.0:region0+0x8e048, 0xfee00000, 4)
+ * vfio_region_write  (0000:01:00.0:region0+0x8e04c, 0x100, 4)
+ * vfio_region_write  (0000:01:00.0:region0+0x8e050, 0x21, 4)
+ * --
+ * vfio_region_write  (0000:01:00.0:region0+0xb4968, 0xfee00000, 4)
+ * vfio_region_write  (0000:01:00.0:region0+0xb496c, 0x100, 4)
+ * vfio_region_write  (0000:01:00.0:region0+0xb4970, 0x21, 4)
+ * --
+ * vfio_region_write  (0000:01:00.0:region0+0xb4a70, 0xfee00000, 4)
+ * vfio_region_write  (0000:01:00.0:region0+0xb4a74, 0x100, 4)
+ * vfio_region_write  (0000:01:00.0:region0+0xb4a78, 0x21, 4)
+ * --
+ * vfio_region_write  (0000:01:00.0:region0+0xb849c, 0xfee00000, 4)
+ * vfio_region_write  (0000:01:00.0:region0+0xb84a0, 0x100, 4)
+ * vfio_region_write  (0000:01:00.0:region0+0xb84a4, 0x21, 4)
+ * --
+ * vfio_region_write  (0000:01:00.0:region0+0xb85a4, 0xfee00000, 4)
+ * vfio_region_write  (0000:01:00.0:region0+0xb85a8, 0x100, 4)
+ * vfio_region_write  (0000:01:00.0:region0+0xb85ac, 0x21, 4)
+ *
+ * We can find in this example trace that the MSI capability is programmed as:
+ *
+ * vfio_pci_write_config  (0000:01:00.0, @0x54, 0xfee00000, len=0x4)
+ * vfio_pci_write_config  (0000:01:00.0, @0x58, 0x21, len=0x2)
+ *
+ * This is a 32-bit MSI capability based at 0x50, so the MSI address is
+ * 0xfee00000 with data 0x21.  So we see writes of the MSI address, followed
+ * 8-bytes later by what appears to be the MSI data.  There's no obvious
+ * pattern in the device address where these are being written.
+ *
+ * We therefore come up with a really crude quirk that looks for values
+ * written to the ATH11K_PCI_WINDOW (defined in Linux driver as starting at
+ * 0x80000 with an 18-bit mask, ie. 256k) that match the guest MSI address.
+ * When found we replace the data with the host physical address and set a
+ * cookie to match the MSI data write, again replacing with the host value and
+ * clearing the cookie.
+ *
+ * Amazingly, this seems sufficient to work, and the trapped window only seems
+ * to be used during initialization, so this should introduce minimal overhead.
+ *
+ * The Windows driver makes use of multi-vector MSI, where our sanity test
+ * of the MSI data value must then mask off the vector offset for comparison
+ * and add it back to the host base data value on write.
+ *
+ * Only 4- and 8-byte accesses are observed to the PCI window, and MSI access
+ * are only observed with 4-byte width.
+ */
+
+// Temporary, include updated Linux headers
+#define VFIO_DEVICE_FEATURE_PCI_MSI_NOVIRT 11
+
+typedef struct VFIOQAthQuirk {
+    VFIOPCIDevice *vdev;
+    uint32_t pci_window_base;
+    uint32_t msi_data_addr;
+} VFIOQAthQuirk;
+
+static uint64_t vfio_ath_quirk_read(void *opaque, hwaddr addr, unsigned size)
+{
+    VFIOQAthQuirk *ath = opaque;
+    VFIOPCIDevice *vdev = ath->vdev;
+
+    return vfio_region_read(&vdev->bars[0].region,
+                            addr + ath->pci_window_base, size);
+}
+
+static void vfio_ath_quirk_write(void *opaque, hwaddr addr,
+                                 uint64_t data, unsigned size)
+{
+    VFIOQAthQuirk *ath = opaque;
+    VFIOPCIDevice *vdev = ath->vdev;
+
+    if (size == 4 && msi_enabled(&vdev->pdev)) {
+        uint32_t phys = 0;
+
+        if (data == pci_get_long(vdev->pdev.config +
+                                 vdev->pdev.msi_cap + PCI_MSI_ADDRESS_LO)) {
+            pread(vdev->vbasedev.fd, &phys, 4, vdev->config_offset +
+                  vdev->pdev.msi_cap + PCI_MSI_ADDRESS_LO);
+            trace_vfio_quirk_ath_write_address(vdev->vbasedev.name,
+                                               addr + ath->pci_window_base,
+                                               data, phys);
+            data = phys;
+            ath->msi_data_addr = addr + 8;
+        } else if (ath->msi_data_addr && ath->msi_data_addr == addr) {
+            uint32_t mask = msi_nr_vectors_allocated(&vdev->pdev) - 1;
+            uint32_t nr = data & mask;
+
+            if ((data & ~mask) == pci_get_word(vdev->pdev.config +
+                                               vdev->pdev.msi_cap +
+                                               PCI_MSI_DATA_32)) {
+                pread(vdev->vbasedev.fd, &phys, 2, vdev->config_offset +
+                      vdev->pdev.msi_cap + PCI_MSI_DATA_32);
+                phys += nr;
+                trace_vfio_quirk_ath_write_data(vdev->vbasedev.name,
+                                                addr + ath->pci_window_base,
+                                                data, phys);
+                data = phys;
+            }
+            ath->msi_data_addr = 0;
+        }
+    }
+
+    vfio_region_write(&vdev->bars[0].region, addr + ath->pci_window_base,
+                      data, size);
+}
+
+static const MemoryRegionOps vfio_ath_quirk = {
+    .read = vfio_ath_quirk_read,
+    .write = vfio_ath_quirk_write,
+    .valid = {
+        .min_access_size = 4,
+        .max_access_size = 8,
+        .unaligned = false,
+    },
+    .endianness = DEVICE_LITTLE_ENDIAN,
+};
+
+static bool vfio_pci_msi_novirt(VFIOPCIDevice *vdev)
+{
+    uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature),
+                              sizeof(uint64_t))] = {};
+    struct vfio_device_feature *feature = (struct vfio_device_feature *)buf;
+
+    feature->argsz = sizeof(buf);
+    feature->flags = VFIO_DEVICE_FEATURE_SET |
+                     VFIO_DEVICE_FEATURE_PCI_MSI_NOVIRT;
+
+    return !ioctl(vdev->vbasedev.fd, VFIO_DEVICE_FEATURE, feature);
+}
+
+static void vfio_probe_ath_bar0_quirk(VFIOPCIDevice *vdev, int nr)
+{
+    VFIOQuirk *quirk;
+    VFIOQAthQuirk *ath;
+    uint32_t pci_window_base, pci_window_size;
+
+    if (nr != 0 ||
+        vdev->vendor_id != PCI_VENDOR_ID_QCOM || !msi_present(&vdev->pdev)) {
+        return;
+    }
+
+    switch (vdev->device_id) {
+    case 0x1101: /* Untested */
+    case 0x1103:
+    case 0x1104: /* Untested */
+        /* Devices claimed by ath11k_pci_id_table in Linux driver as of v6.10 */
+        pci_window_base = 0x80000; /* ATH11K_PCI_WINDOW_START */
+        pci_window_size = 0x40000; /* ATH11K_PCI_WINDOW_RANGE_MASK (256k)*/
+        break;
+    case 0x1107: /* Untested */
+    case 0x1109: /* Untested */
+        /* Devices claimed by ath12k_pci_id_table in Linux driver as of v6.10 */
+        pci_window_base = 0x1e00000; /* PCI_BAR_WINDOW0_BASE */
+        pci_window_size = 0x80000; /* ~PCI_BAR_WINDOW0_END (512k) */
+        break;
+    default:
+        return;
+    }
+
+    if (!vfio_pci_msi_novirt(vdev)) {
+        warn_report("Found device matching Atheros wireless quirk, but host "
+                    "does not support vfio device feature required for quirk. "
+                    "Device is known not to work with device assignment "
+                    "without this quirk.  Please update host kernel.");
+        return;
+    }
+
+    quirk = vfio_quirk_alloc(1);
+    quirk->data = ath = g_malloc0(sizeof(*ath));
+    ath->vdev = vdev;
+    ath->pci_window_base = pci_window_base;
+
+    memory_region_init_io(&quirk->mem[0], OBJECT(vdev), &vfio_ath_quirk,
+                          ath, "vfio-ath-quirk", pci_window_size);
+    memory_region_add_subregion_overlap(vdev->bars[nr].region.mem,
+                                        pci_window_base, &quirk->mem[0], 1);
+
+    QLIST_INSERT_HEAD(&vdev->bars[nr].quirks, quirk, next);
+
+    trace_vfio_quirk_ath_bar0_probe(vdev->vbasedev.name);
+}
+
 #define IGD_ASLS 0xfc /* ASL Storage Register */
 
 /*
@@ -1261,6 +1496,7 @@ void vfio_bar_quirk_setup(VFIOPCIDevice *vdev, int nr)
 #ifdef CONFIG_VFIO_IGD
     vfio_probe_igd_bar4_quirk(vdev, nr);
 #endif
+    vfio_probe_ath_bar0_quirk(vdev, nr);
 }
 
 void vfio_bar_quirk_exit(VFIOPCIDevice *vdev, int nr)
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 98bd4dcceadc..3d1154785157 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -82,6 +82,9 @@ vfio_ioeventfd_exit(const char *name, uint64_t addr, unsigned size, uint64_t dat
 vfio_ioeventfd_handler(const char *name, uint64_t addr, unsigned size, uint64_t data) "%s+0x%"PRIx64"[%d] -> 0x%"PRIx64
 vfio_ioeventfd_init(const char *name, uint64_t addr, unsigned size, uint64_t data, bool vfio) "%s+0x%"PRIx64"[%d]:0x%"PRIx64" vfio:%d"
 vfio_pci_igd_opregion_enabled(const char *name) "%s"
+vfio_quirk_ath_bar0_probe(const char *name) "%s"
+vfio_quirk_ath_write_address(const char *name, uint64_t addr, uint64_t data, uint32_t phys) "%s[0x%"PRIx64"]: Replacing MSI address 0x%"PRIx64" with value 0x%x"
+vfio_quirk_ath_write_data(const char *name, uint64_t addr, uint64_t data, uint32_t phys) "%s[0x%"PRIx64"]: Replacing MSI data 0x%"PRIx64" with value 0x%x"
 
 # igd.c
 vfio_pci_igd_bar4_write(const char *name, uint32_t index, uint32_t data, uint32_t base) "%s [0x%03x] 0x%08x -> 0x%08x"
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC/RFT] vfio/pci: Create feature to disable MSI virtualization
  2024-08-12 16:59 ` [PATCH RFC/RFT] vfio/pci: Create feature to disable MSI virtualization Alex Williamson
@ 2024-08-13 16:30   ` Jason Gunthorpe
  2024-08-13 17:30     ` Thomas Gleixner
  2024-08-13 21:14     ` Alex Williamson
  0 siblings, 2 replies; 29+ messages in thread
From: Jason Gunthorpe @ 2024-08-13 16:30 UTC (permalink / raw)
  To: Alex Williamson, Thomas Gleixner
  Cc: kvm, quic_bqiang, kvalo, prestwoj, linux-wireless, ath11k, dwmw2,
	iommu, kernel, johannes, jtornosm

On Mon, Aug 12, 2024 at 10:59:12AM -0600, Alex Williamson wrote:
> vfio-pci has always virtualized the MSI address and data registers as
> MSI programming is performed through the SET_IRQS ioctl.  Often this
> virtualization is not used, and in specific cases can be unhelpful.
> 
> One such case where the virtualization is a hinderance is when the
> device contains an onboard interrupt controller programmed by the guest
> driver.  Userspace VMMs have a chance to quirk this programming,
> injecting the host physical MSI information, but only if the userspace
> driver can get access to the host physical address and data registers.
> 
> This introduces a device feature which allows the userspace driver to
> disable virtualization of the MSI capability address and data registers
> in order to provide read-only access the the physical values.

Personally, I very much dislike this. Encouraging such hacky driver
use of the interrupt subsystem is not a good direction. Enabling this
in VMs will further complicate fixing the IRQ usages in these drivers
over the long run.

If the device has it's own interrupt sources then the device needs to
create an irq_chip and related and hook them up properly. Not hackily
read the MSI-X registers and write them someplace else.

Thomas Gleixner has done alot of great work recently to clean this up.

So if you imagine the driver is fixed, then this is not necessary.

Howver, it will still not work in a VM. Making IMS and non-MSI
interrupt controlers work within VMs is still something that needs to
be done.

Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC/RFT] vfio/pci-quirks: Quirk for ath wireless
  2024-08-12 17:00 ` [PATCH RFC/RFT] vfio/pci-quirks: Quirk for ath wireless Alex Williamson
@ 2024-08-13 16:43   ` Jason Gunthorpe
  2024-08-13 21:03     ` Alex Williamson
  0 siblings, 1 reply; 29+ messages in thread
From: Jason Gunthorpe @ 2024-08-13 16:43 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, kvm, quic_bqiang, kvalo, prestwoj, linux-wireless,
	ath11k, dwmw2, iommu, kernel, johannes, jtornosm

On Mon, Aug 12, 2024 at 11:00:40AM -0600, Alex Williamson wrote:
> These devices have an embedded interrupt controller which is programmed
> with guest physical MSI address/data, which doesn't work.  We need
> vfio-pci kernel support to provide a device feature which disables
> virtualization of the MSI capability registers.  Then we can do brute
> force testing for writes matching the MSI address, from which we can
> infer writes of the MSI data, replacing each with host physical values.
> 
> This has only been tested on ath11k (0x1103), ath12k support is
> speculative and requires testing.  Note that Windows guest drivers make
> use of multi-vector MSI which requires interrupt remapping support in
> the host.

The way it is really supposed to work, is that the guest itself
controls/knows the MSI addr/data pairs and the interrupt remapping HW
makes that delegation safe since all the interrupt processing will be
qualified by the RID.

Then the guest can make up the unique interrupts for MSI and any
internal "IMS" sources and we just let the guest directly write the
MSI/MSI-X and any IMS values however it wants.

This hackery to capture and substitute the IMS programming is neat and
will solve this one device, but there are more IMS style devices in
the pipeline than will really need a full solution.

> + * The Windows driver makes use of multi-vector MSI, where our sanity test
> + * of the MSI data value must then mask off the vector offset for comparison
> + * and add it back to the host base data value on write.

But is that really enough? If the vector offset is newly created then
that means the VM built a new interrupt that needs setup to be routed
into the VM?? Is that why you say it "requires interrupt remapping
support" because that setup is happening implicitly on x86?

It looks like Windows is acting as I said Linux should, with a
"irq_chip" and so on to get the unique interrupt source a proper
unique addr/data pair...

Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC/RFT] vfio/pci: Create feature to disable MSI virtualization
  2024-08-13 16:30   ` Jason Gunthorpe
@ 2024-08-13 17:30     ` Thomas Gleixner
  2024-08-13 23:39       ` Jason Gunthorpe
  2024-12-13  9:10       ` David Woodhouse
  2024-08-13 21:14     ` Alex Williamson
  1 sibling, 2 replies; 29+ messages in thread
From: Thomas Gleixner @ 2024-08-13 17:30 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: kvm, quic_bqiang, kvalo, prestwoj, linux-wireless, ath11k, dwmw2,
	iommu, kernel, johannes, jtornosm

On Tue, Aug 13 2024 at 13:30, Jason Gunthorpe wrote:
> On Mon, Aug 12, 2024 at 10:59:12AM -0600, Alex Williamson wrote:
>> vfio-pci has always virtualized the MSI address and data registers as
>> MSI programming is performed through the SET_IRQS ioctl.  Often this
>> virtualization is not used, and in specific cases can be unhelpful.
>> 
>> One such case where the virtualization is a hinderance is when the
>> device contains an onboard interrupt controller programmed by the guest
>> driver.  Userspace VMMs have a chance to quirk this programming,
>> injecting the host physical MSI information, but only if the userspace
>> driver can get access to the host physical address and data registers.
>> 
>> This introduces a device feature which allows the userspace driver to
>> disable virtualization of the MSI capability address and data registers
>> in order to provide read-only access the the physical values.
>
> Personally, I very much dislike this. Encouraging such hacky driver
> use of the interrupt subsystem is not a good direction. Enabling this
> in VMs will further complicate fixing the IRQ usages in these drivers
> over the long run.
>
> If the device has it's own interrupt sources then the device needs to
> create an irq_chip and related and hook them up properly. Not hackily
> read the MSI-X registers and write them someplace else.
>
> Thomas Gleixner has done alot of great work recently to clean this up.
>
> So if you imagine the driver is fixed, then this is not necessary.

Yes. I looked at the at11k driver when I was reworking the PCI/MSI
subsystem and that's a perfect candidate for a proper device specific
interrupt domain to replace the horrible MSI hackery it has.

> Howver, it will still not work in a VM. Making IMS and non-MSI
> interrupt controlers work within VMs is still something that needs to
> be done.

Sure, but we really want to do that in a generic way and not based on ad
hoc workarounds.

Did the debate around this go anywhere?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC/RFT] vfio/pci-quirks: Quirk for ath wireless
  2024-08-13 16:43   ` Jason Gunthorpe
@ 2024-08-13 21:03     ` Alex Williamson
  2024-08-13 23:37       ` Jason Gunthorpe
  0 siblings, 1 reply; 29+ messages in thread
From: Alex Williamson @ 2024-08-13 21:03 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: qemu-devel, kvm, quic_bqiang, kvalo, prestwoj, linux-wireless,
	ath11k, dwmw2, iommu, kernel, johannes, jtornosm

On Tue, 13 Aug 2024 13:43:41 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Mon, Aug 12, 2024 at 11:00:40AM -0600, Alex Williamson wrote:
> > These devices have an embedded interrupt controller which is programmed
> > with guest physical MSI address/data, which doesn't work.  We need
> > vfio-pci kernel support to provide a device feature which disables
> > virtualization of the MSI capability registers.  Then we can do brute
> > force testing for writes matching the MSI address, from which we can
> > infer writes of the MSI data, replacing each with host physical values.
> > 
> > This has only been tested on ath11k (0x1103), ath12k support is
> > speculative and requires testing.  Note that Windows guest drivers make
> > use of multi-vector MSI which requires interrupt remapping support in
> > the host.  
> 
> The way it is really supposed to work, is that the guest itself
> controls/knows the MSI addr/data pairs and the interrupt remapping HW
> makes that delegation safe since all the interrupt processing will be
> qualified by the RID.
> 
> Then the guest can make up the unique interrupts for MSI and any
> internal "IMS" sources and we just let the guest directly write the
> MSI/MSI-X and any IMS values however it wants.
> 
> This hackery to capture and substitute the IMS programming is neat and
> will solve this one device, but there are more IMS style devices in
> the pipeline than will really need a full solution.

How does the guest know to write a remappable vector format?  How does
the guest know the host interrupt architecture?  For example why would
an aarch64 guest program an MSI vector of 0xfee... if the host is x86?

The idea of guest owning the physical MSI address space sounds great,
but is it practical?  Is it something that would be accomplished while
this device is still relevant?

> > + * The Windows driver makes use of multi-vector MSI, where our sanity test
> > + * of the MSI data value must then mask off the vector offset for comparison
> > + * and add it back to the host base data value on write.  
> 
> But is that really enough? If the vector offset is newly created then
> that means the VM built a new interrupt that needs setup to be routed
> into the VM?? Is that why you say it "requires interrupt remapping
> support" because that setup is happening implicitly on x86?
> 
> It looks like Windows is acting as I said Linux should, with a
> "irq_chip" and so on to get the unique interrupt source a proper
> unique addr/data pair...

The Windows driver is just programming the MSI capability to use 16
vectors.  We configure those vectors on the host at the time the
capability is written.  Whereas the Linux driver is only using a single
vector and therefore writing the same MSI address and data at the
locations noted in the trace, the Windows driver is writing different
data values at different locations to make use of those vectors.  This
note is simply describing that we can't directly write the physical
data value into the device, we need to determine which vector offset
the guest is using and provide the same offset from the host data
register value.

I don't know that interrupt remapping is specifically required, but the
MSI domain needs to support MSI_FLAG_MULTI_PCI_MSI and AFAIK that's
only available with interrupt remapping on x86, ie.
pci_alloc_irq_vectors() with max_vecs >1 and PCI_IRQ_MSI flags needs to
work on the host to mirror the guest MSI configuration.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC/RFT] vfio/pci: Create feature to disable MSI virtualization
  2024-08-13 16:30   ` Jason Gunthorpe
  2024-08-13 17:30     ` Thomas Gleixner
@ 2024-08-13 21:14     ` Alex Williamson
  2024-08-13 23:16       ` Jason Gunthorpe
  1 sibling, 1 reply; 29+ messages in thread
From: Alex Williamson @ 2024-08-13 21:14 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Thomas Gleixner, kvm, quic_bqiang, kvalo, prestwoj,
	linux-wireless, ath11k, dwmw2, iommu, kernel, johannes, jtornosm

On Tue, 13 Aug 2024 13:30:53 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Mon, Aug 12, 2024 at 10:59:12AM -0600, Alex Williamson wrote:
> > vfio-pci has always virtualized the MSI address and data registers as
> > MSI programming is performed through the SET_IRQS ioctl.  Often this
> > virtualization is not used, and in specific cases can be unhelpful.
> > 
> > One such case where the virtualization is a hinderance is when the
> > device contains an onboard interrupt controller programmed by the guest
> > driver.  Userspace VMMs have a chance to quirk this programming,
> > injecting the host physical MSI information, but only if the userspace
> > driver can get access to the host physical address and data registers.
> > 
> > This introduces a device feature which allows the userspace driver to
> > disable virtualization of the MSI capability address and data registers
> > in order to provide read-only access the the physical values.  
> 
> Personally, I very much dislike this. Encouraging such hacky driver
> use of the interrupt subsystem is not a good direction. Enabling this
> in VMs will further complicate fixing the IRQ usages in these drivers
> over the long run.

Clearly these _guest_ drivers are doing this regardless of the
interfaces provided by vfio, so I don't see how we're encouraging hacky
driver behavior, especially when it comes to Windows guest drivers.

> If the device has it's own interrupt sources then the device needs to
> create an irq_chip and related and hook them up properly. Not hackily
> read the MSI-X registers and write them someplace else.

This is how the hardware works, regardless of whether the guest driver
represents the hardware using an irq_chip.

> Thomas Gleixner has done alot of great work recently to clean this up.
> 
> So if you imagine the driver is fixed, then this is not necessary.

How so?  Regardless of the guest driver structure, something is writing
the MSI address and data values elsewhere in the device.  AFAICT the
only way to avoid needing to fixup those values is to give the guest
ownership of the address space as you suggested in the other patch.
That also seems to have a pile of issues though.

> Howver, it will still not work in a VM. Making IMS and non-MSI
> interrupt controlers work within VMs is still something that needs to
> be done.

Making it work in a VM is sort of the point here.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC/RFT] vfio/pci: Create feature to disable MSI virtualization
  2024-08-13 21:14     ` Alex Williamson
@ 2024-08-13 23:16       ` Jason Gunthorpe
  2024-08-14 14:55         ` Alex Williamson
  0 siblings, 1 reply; 29+ messages in thread
From: Jason Gunthorpe @ 2024-08-13 23:16 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Thomas Gleixner, kvm, quic_bqiang, kvalo, prestwoj,
	linux-wireless, ath11k, dwmw2, iommu, kernel, johannes, jtornosm

On Tue, Aug 13, 2024 at 03:14:01PM -0600, Alex Williamson wrote:

> > Personally, I very much dislike this. Encouraging such hacky driver
> > use of the interrupt subsystem is not a good direction. Enabling this
> > in VMs will further complicate fixing the IRQ usages in these drivers
> > over the long run.
> 
> Clearly these _guest_ drivers are doing this regardless of the
> interfaces provided by vfio, so I don't see how we're encouraging hacky
> driver behavior, especially when it comes to Windows guest drivers.

Because people will then say the Linux driver can't be fixed to
properly use an irq_domain/etc as the only option that works in VMs
will be the hacky copy from MSI-X approach :\

> > Thomas Gleixner has done alot of great work recently to clean this up.
> > 
> > So if you imagine the driver is fixed, then this is not necessary.
> 
> How so? 

Because if the driver is properly using the new irq_domain/etc
infrastructure to model its additional interrupt source then this
patch won't make it work in the VM anyhow, so it is not necessary..

Your other patch would be the only short term answer.

Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC/RFT] vfio/pci-quirks: Quirk for ath wireless
  2024-08-13 21:03     ` Alex Williamson
@ 2024-08-13 23:37       ` Jason Gunthorpe
  2024-08-15 16:59         ` Alex Williamson
  0 siblings, 1 reply; 29+ messages in thread
From: Jason Gunthorpe @ 2024-08-13 23:37 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, kvm, quic_bqiang, kvalo, prestwoj, linux-wireless,
	ath11k, dwmw2, iommu, kernel, johannes, jtornosm

On Tue, Aug 13, 2024 at 03:03:20PM -0600, Alex Williamson wrote:

> How does the guest know to write a remappable vector format?  How does
> the guest know the host interrupt architecture?  For example why would
> an aarch64 guest program an MSI vector of 0xfee... if the host is x86?

All excellent questions.

Emulating real interrupt controllers in the VM is probably impossible
in every scenario. But certainly x86 emulating x86 and ARM emulating
ARM would be usefully achievable.

hyperv did a neat thing where their remapping driver seems to make VMM
traps and looks kind of like the VMM gives it the platform specific
addr/data pair.

It is a big ugly problem for sure, and we definately have painted
ourselves into a corner where the OS has no idea if IMS techniques
work properly or it is broken. :( :(

But I think there may not be a terribly impossible path where at least
the guest could be offered a, say, virtio-irq in addition to the
existing platform controllers that would process IMS for it.

> The idea of guest owning the physical MSI address space sounds great,
> but is it practical?  

In many cases yes, it is, but more importantly it is the only sane way
to support these IMS like techniques broadly since IMS is by
definition not generally trappable.

> Is it something that would be accomplished while
> this device is still relevant?

I don't know, I fear not. But it keeps coming up. Too many things
don't work right with the trapping approach, including this.

> The Windows driver is just programming the MSI capability to use 16
> vectors.  We configure those vectors on the host at the time the
> capability is written.  Whereas the Linux driver is only using a single
> vector and therefore writing the same MSI address and data at the
> locations noted in the trace, the Windows driver is writing different
> data values at different locations to make use of those vectors.  This
> note is simply describing that we can't directly write the physical
> data value into the device, we need to determine which vector offset
> the guest is using and provide the same offset from the host data
> register value.

I see, it seems to be assuming also that these extra interrupt sources
are generating the same MSI message as the main MSI, not something
else. That is more a SW quirk of Windows, I expect. I don't think
Linux would do that..

This is probably the only way to approach this, trap and emulate the
places in the device that program additional interrupt sources and do
a full MSI-like flow to set them up in the kernel.

Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC/RFT] vfio/pci: Create feature to disable MSI virtualization
  2024-08-13 17:30     ` Thomas Gleixner
@ 2024-08-13 23:39       ` Jason Gunthorpe
  2024-12-13  9:10       ` David Woodhouse
  1 sibling, 0 replies; 29+ messages in thread
From: Jason Gunthorpe @ 2024-08-13 23:39 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Alex Williamson, kvm, quic_bqiang, kvalo, prestwoj,
	linux-wireless, ath11k, dwmw2, iommu, kernel, johannes, jtornosm

On Tue, Aug 13, 2024 at 07:30:41PM +0200, Thomas Gleixner wrote:
> > Howver, it will still not work in a VM. Making IMS and non-MSI
> > interrupt controlers work within VMs is still something that needs to
> > be done.
> 
> Sure, but we really want to do that in a generic way and not based on ad
> hoc workarounds.
>
> Did the debate around this go anywhere?

No, it got stuck on the impossible situation that there is no existing
way for the VM to have any idea if IMS will work or is broken. Recall
Intel was planning to "solve" this by sticking a DVSEC in their
virtual config space that said to turn off IMS :\

So using IMS in the real world looked impractical and interest faded a
bit.

But the underlying reasons for IMS haven't gone away and more work is
coming that will bring it up again...

Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC/RFT] vfio/pci: Create feature to disable MSI virtualization
  2024-08-13 23:16       ` Jason Gunthorpe
@ 2024-08-14 14:55         ` Alex Williamson
  2024-08-14 15:20           ` Jason Gunthorpe
  0 siblings, 1 reply; 29+ messages in thread
From: Alex Williamson @ 2024-08-14 14:55 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Thomas Gleixner, kvm, quic_bqiang, kvalo, prestwoj,
	linux-wireless, ath11k, dwmw2, iommu, kernel, johannes, jtornosm

On Tue, 13 Aug 2024 20:16:42 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Tue, Aug 13, 2024 at 03:14:01PM -0600, Alex Williamson wrote:
> 
> > > Personally, I very much dislike this. Encouraging such hacky driver
> > > use of the interrupt subsystem is not a good direction. Enabling this
> > > in VMs will further complicate fixing the IRQ usages in these drivers
> > > over the long run.  
> > 
> > Clearly these _guest_ drivers are doing this regardless of the
> > interfaces provided by vfio, so I don't see how we're encouraging hacky
> > driver behavior, especially when it comes to Windows guest drivers.  
> 
> Because people will then say the Linux driver can't be fixed to
> properly use an irq_domain/etc as the only option that works in VMs
> will be the hacky copy from MSI-X approach :\

Ironically QEMU already has direct access to the MSI-X vector table in
MMIO space and could implement this type of quirk with no kernel
changes.  It's MSI that is now blocked by virtualization of the address
and data registers.  Note also that QEMU is still virtualizing these
registers, the values seen in the guest are unchanged.  It's only the
VMM that can bypass that virtualization to see the host values.

Let's imagine the guest driver does change to implement an irq_domain.
How does that fundamentally change the problem for the VMM that guest
MSI values are being written to other portions of the device?  The
guest driver can have whatever architecture it wants (we don't know
the architecture of the Windows driver) but we still need to trap
writes of the guest MSI address/data and replace it with host values.

> > > Thomas Gleixner has done alot of great work recently to clean this up.
> > > 
> > > So if you imagine the driver is fixed, then this is not necessary.  
> > 
> > How so?   
> 
> Because if the driver is properly using the new irq_domain/etc
> infrastructure to model its additional interrupt source then this
> patch won't make it work in the VM anyhow, so it is not necessary..
> 
> Your other patch would be the only short term answer.

The QEMU patch relies on this kernel patch in order to be able to
access the host physical MSI address and data values through the vfio
interface.  Otherwise QEMU has no host values with which to patch-up
guest values.  As noted above, this does not provide any visible change
to a QEMU guest, it only enables QEMU to implement the quirk in the
other patch.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC/RFT] vfio/pci: Create feature to disable MSI virtualization
  2024-08-14 14:55         ` Alex Williamson
@ 2024-08-14 15:20           ` Jason Gunthorpe
  0 siblings, 0 replies; 29+ messages in thread
From: Jason Gunthorpe @ 2024-08-14 15:20 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Thomas Gleixner, kvm, quic_bqiang, kvalo, prestwoj,
	linux-wireless, ath11k, dwmw2, iommu, kernel, johannes, jtornosm

On Wed, Aug 14, 2024 at 08:55:05AM -0600, Alex Williamson wrote:
> Let's imagine the guest driver does change to implement an irq_domain.
> How does that fundamentally change the problem for the VMM that guest
> MSI values are being written to other portions of the device?

If changed to irq_domain the VM will write addr/data pairs into those
special register that are unique to that interrupt source and will not
re-use values already set in the MSI table.

This means the VMM doesn't get any value from inspecting the MSI table
because the value it needs won't be there, and alos that no interrupt
routing will have been setup. The VMM must call VFIO_DEVICE_SET_IRQS
to setup the unique routing.

These two patches are avoiding VFIO_DEVICE_SET_IRQS based on the
assumption that the VM will re-use a addr/data pair already setup in
the MSI table. Invalidating that assumption is the fundamental change
irq_domain in the VM will make.

> The guest driver can have whatever architecture it wants (we don't
> know the architecture of the Windows driver) but we still need to
> trap writes of the guest MSI address/data and replace it with host
> values.

Yes you do. But the wrinkle is you can't just assume one of the
existing MSI entries is a valid replacement and copy from the MSI
table. That works right now only because the Linux/Windows driver is
re-using a MSI vector in the IMS registers.

I suggest the general path is something like:

 1) A vfio variant driver sets up an irq_domain for the additional
    interrupt source registers
 2) Somehow wire up VFIO_DEVICE_SET_IRQS so it can target vectors in
    the additional interrupt domain
 3) Have the VMM trap writes to the extra interrupt source registers
    and execute VFIO_DEVICE_SET_IRQS
 4) IRQ layer will setup an appropriate unique IRQ and route it to the
    guest/whatever just like MSI. Callbacks into the variant driver's
    irq_domain will program the HW registers.

Basically exactly the same flow as MSI, except instead of targetting a
vector in the PCI core's MSI irq_domain it targets a vector in the
variant driver's IMS IRQ domain.

Then we don't make any assumptions about how the VM is using these
interrupt vectors, and crucially, SET_IRQs is called for every
interrupt source and we rely on the kernel to produce the correct
addr/data pair. No need for copying addr/data pairs from MSI tables.

> As noted above, this does not provide any visible change to a QEMU
> guest, it only enables QEMU to implement the quirk in the other
> patch.

I see, I definitely didn't understand that it only reaches qemu from
the commit message..

Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC/RFT] vfio/pci-quirks: Quirk for ath wireless
  2024-08-13 23:37       ` Jason Gunthorpe
@ 2024-08-15 16:59         ` Alex Williamson
  2024-08-15 17:19           ` Jason Gunthorpe
  0 siblings, 1 reply; 29+ messages in thread
From: Alex Williamson @ 2024-08-15 16:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: qemu-devel, kvm, quic_bqiang, kvalo, prestwoj, linux-wireless,
	ath11k, dwmw2, iommu, kernel, johannes, jtornosm

On Tue, 13 Aug 2024 20:37:24 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Tue, Aug 13, 2024 at 03:03:20PM -0600, Alex Williamson wrote:
> 
> > How does the guest know to write a remappable vector format?  How does
> > the guest know the host interrupt architecture?  For example why would
> > an aarch64 guest program an MSI vector of 0xfee... if the host is x86?  
> 
> All excellent questions.
> 
> Emulating real interrupt controllers in the VM is probably impossible
> in every scenario. But certainly x86 emulating x86 and ARM emulating
> ARM would be usefully achievable.
> 
> hyperv did a neat thing where their remapping driver seems to make VMM
> traps and looks kind of like the VMM gives it the platform specific
> addr/data pair.
> 
> It is a big ugly problem for sure, and we definately have painted
> ourselves into a corner where the OS has no idea if IMS techniques
> work properly or it is broken. :( :(
> 
> But I think there may not be a terribly impossible path where at least
> the guest could be offered a, say, virtio-irq in addition to the
> existing platform controllers that would process IMS for it.
> 
> > The idea of guest owning the physical MSI address space sounds great,
> > but is it practical?    
> 
> In many cases yes, it is, but more importantly it is the only sane way
> to support these IMS like techniques broadly since IMS is by
> definition not generally trappable.
> 
> > Is it something that would be accomplished while
> > this device is still relevant?  
> 
> I don't know, I fear not. But it keeps coming up. Too many things
> don't work right with the trapping approach, including this.
> 
> > The Windows driver is just programming the MSI capability to use 16
> > vectors.  We configure those vectors on the host at the time the
> > capability is written.  Whereas the Linux driver is only using a single
> > vector and therefore writing the same MSI address and data at the
> > locations noted in the trace, the Windows driver is writing different
> > data values at different locations to make use of those vectors.  This
> > note is simply describing that we can't directly write the physical
> > data value into the device, we need to determine which vector offset
> > the guest is using and provide the same offset from the host data
> > register value.  
> 
> I see, it seems to be assuming also that these extra interrupt sources
> are generating the same MSI message as the main MSI, not something
> else. That is more a SW quirk of Windows, I expect. I don't think
> Linux would do that..
> 
> This is probably the only way to approach this, trap and emulate the
> places in the device that program additional interrupt sources and do
> a full MSI-like flow to set them up in the kernel.

Your last sentence here seems to agree with this approach, but
everything else suggests disapproval, so I don't know where you're
going here.

I have no specs for this device, nor any involvement from the device
vendor, so the idea of creating a vfio-pci variant driver to setup an
irq_domain and augment a device specific SET_IRQs ioctls not only sounds
tremendously more complicated (host and VMM), it's simply not possible
with the knowledge we have at hand.  Making this device work in a VM is
dead in the water if that's the bar to achieve.

I observe that the device configures MSI vectors and then writes that
same vector address/data elsewhere into the device.  Whether the device
can trigger those vectors based only on the MSI capability programming
and a secondary source piggybacks on those vectors or if this is just a
hack by Qualcomm to use an MSI capability to acquire some vectors which
are exclusively used by the secondary hardware, I have no idea.  Who
can even say if this is just a cost saving measure that a PCI config
space is slapped into a platform device and there's simply no hw/fw
support to push the vector data into the hardware and the driver
bridges the gap.

The solution here is arguably fragile, we're relying on having a
sufficiently unique MSI address that we can recognize writes with that
value in order to both replace it with the host value and mark the
location of the data register.  If someone with some hardware insight
wants to come along and provide a reference for static locations of
these writes, I'd certainly welcome it.  My sample size is one, which
is why this is posted as an RFT.

I do not believe that introducing a vfio device feature that disables
virtualization of the MSI address/data _only_ at the vfio interface
(not to a QEMU VM) provides some implicit support of this device
behavior.  These values are already available to a privileged user in
the host and the same is available for an MSI-X use case by directly
reading the MSI-X vector table.  The only point of the vfio device
feature is that we need a vehicle to expose the MSI phsyical
address/data values through he vfio channel, without additional host
privileges.  The virtualized values are essentially unused by QEMU, so
why not give QEMU a way to turn off the virtualization to expose the
host values.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC/RFT] vfio/pci-quirks: Quirk for ath wireless
  2024-08-15 16:59         ` Alex Williamson
@ 2024-08-15 17:19           ` Jason Gunthorpe
  2026-03-16 14:58             ` James Prestwood
  0 siblings, 1 reply; 29+ messages in thread
From: Jason Gunthorpe @ 2024-08-15 17:19 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, kvm, quic_bqiang, kvalo, prestwoj, linux-wireless,
	ath11k, dwmw2, iommu, kernel, johannes, jtornosm

On Thu, Aug 15, 2024 at 10:59:05AM -0600, Alex Williamson wrote:

> > This is probably the only way to approach this, trap and emulate the
> > places in the device that program additional interrupt sources and do
> > a full MSI-like flow to set them up in the kernel.
> 
> Your last sentence here seems to agree with this approach, but
> everything else suggests disapproval, so I don't know where you're
> going here.

Trapping and emulating is fine.

My concern is really only about skipping SET_IRQ.

That works because of the assumption that the IMS sources are going to
re-use addr/data pairs setup in the MSI CAP.

That assumption is frail, and won't work at all under the proper IMS
support Linux now has.

I really don't want to go down the road and have someone tell Thomas
he can't convert the Linux driver to use irq_domain IMS because it
will break this stuff here.

> I have no specs for this device, nor any involvement from the device
> vendor, so the idea of creating a vfio-pci variant driver to setup an
> irq_domain and augment a device specific SET_IRQs ioctls not only sounds
> tremendously more complicated (host and VMM), it's simply not possible
> with the knowledge we have at hand.  

It seems like you have reverse engineered alot of the necessary
information though??

Maybe there is a more generic approach than a variant driver. If you
wanted to use IMS from userspace generically you could imagine some
kind of IMS focused "SET_IRQ" in generic VFIO. Where we'd create the
needed irq_domains and pass the addr/data pair back to userspace?

> I observe that the device configures MSI vectors and then writes that
> same vector address/data elsewhere into the device.  Whether the device
> can trigger those vectors based only on the MSI capability programming
> and a secondary source piggybacks on those vectors or if this is just a
> hack by Qualcomm to use an MSI capability to acquire some vectors which
> are exclusively used by the secondary hardware, I have no idea.

Well at least that should be testable - but it seems crazy if the
device has registers for an addr/data pair and then somehow doesn't
use the values that get put in them??

Copying from the MSI is almost certainly a SW hack because IMS support
has never really existed in an OS until now. I think your guess for
why it is like this is pretty good.

> I do not believe that introducing a vfio device feature that disables
> virtualization of the MSI address/data _only_ at the vfio interface
> (not to a QEMU VM) provides some implicit support of this device
> behavior.  These values are already available to a privileged user in
> the host and the same is available for an MSI-X use case by directly
> reading the MSI-X vector table.  

To be clear, I'm not really worried about showing the data to
userspace.

Userspace just shouldn't be using it to implement an IMS technique!

Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC/RFT] vfio/pci: Create feature to disable MSI virtualization
  2024-08-13 17:30     ` Thomas Gleixner
  2024-08-13 23:39       ` Jason Gunthorpe
@ 2024-12-13  9:10       ` David Woodhouse
  2025-01-03 14:31         ` Jason Gunthorpe
  1 sibling, 1 reply; 29+ messages in thread
From: David Woodhouse @ 2024-12-13  9:10 UTC (permalink / raw)
  To: Thomas Gleixner, Jason Gunthorpe, Alex Williamson
  Cc: kvm, quic_bqiang, kvalo, prestwoj, linux-wireless, ath11k, iommu,
	kernel, johannes, jtornosm

[-- Attachment #1: Type: text/plain, Size: 4266 bytes --]

On Tue, 2024-08-13 at 19:30 +0200, Thomas Gleixner wrote:
> On Tue, Aug 13 2024 at 13:30, Jason Gunthorpe wrote:
> > On Mon, Aug 12, 2024 at 10:59:12AM -0600, Alex Williamson wrote:
> > > vfio-pci has always virtualized the MSI address and data registers as
> > > MSI programming is performed through the SET_IRQS ioctl.  Often this
> > > virtualization is not used, and in specific cases can be unhelpful.
> > > 
> > > One such case where the virtualization is a hinderance is when the
> > > device contains an onboard interrupt controller programmed by the guest
> > > driver.  Userspace VMMs have a chance to quirk this programming,
> > > injecting the host physical MSI information, but only if the userspace
> > > driver can get access to the host physical address and data registers.
> > > 
> > > This introduces a device feature which allows the userspace driver to
> > > disable virtualization of the MSI capability address and data registers
> > > in order to provide read-only access the the physical values.
> > 
> > Personally, I very much dislike this. Encouraging such hacky driver
> > use of the interrupt subsystem is not a good direction. Enabling this
> > in VMs will further complicate fixing the IRQ usages in these drivers
> > over the long run.
> > 
> > If the device has it's own interrupt sources then the device needs to
> > create an irq_chip and related and hook them up properly. Not hackily
> > read the MSI-X registers and write them someplace else.
> > 
> > Thomas Gleixner has done alot of great work recently to clean this up.
> > 
> > So if you imagine the driver is fixed, then this is not necessary.
> 
> Yes. I looked at the at11k driver when I was reworking the PCI/MSI
> subsystem and that's a perfect candidate for a proper device specific
> interrupt domain to replace the horrible MSI hackery it has.

The ath11k hacks may be awful, but in their defence, that's because the
whole way the hardware works is awful.

Q: With PCI passthrough to a guest, how does the guest OS tell the
device where to do DMA?

A: The guest OS just hands the device a guest physical address and the
IOMMU does the rest. Nothing 'intercedes' between the guest and the
device to mess with that address.

Q: MSIs are just DMA. So with PCI passthrough to a guest, how does the
guest OS configure the device's MSIs? 

<fantasy>
A: The guest OS just hands the device a standard MSI message encoding
the target guest APIC ID and vector (etc.), and the IOMMU does the
rest. Nothing 'intercedes' between the guest and the device to mess
with that MSI message.

And thus ath11k didn't need to do *any* hacks to work around a stupid
hardware design with the VMM snooping on stuff it ideally shouldn't
have had any business touching in the first place.

Posted interrupts are almost the *default* because the IOMMU receives a
<source-id, vCPU APIC ID, vector> tuple on the bus. If receiving an
interrupt for a vCPU which isn't currently running, that's when the
IOMMU sets a bit in a table somewhere and notifies the host OS.

All that special case MSI handling and routing code that I had
nightmares about because it fell through a wormhole from a parallel
universe, doesn't exist.

And look, DPDK drivers which run in polling mode and 'abuse' MSIs by
using real memory addresses and asking the device to "write <these> 32
bits to <this> structure if you want attention" just work nicely in
virtual machines too, just as they do on real hardware.
</fantasy>

/me wakes up...

Shit.

And we have to enable this Interrupt Remapping crap even to address
more than 255 CPUs *without* virtualization? Even a *guest* has to see
a virtual IOMMU and enable Interrupt Remapping to be able to use more
than 255 vCPUs? Even though there were a metric shitload of spare bits
in the MSI message we could have used¹.

Wait, so that means we have to offer an IOMMU with *DMA* remapping to
guests, which means 2-stage translations and/or massive overhead, just
for that guest to be able to use >255 vCPUs?

Screw you all, I'm going back to bed.



¹ And *should* use, if we ever do something similar like, say, expand
  the vector# space past 8 bits. Intel and AMD take note. 

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC/RFT] vfio/pci: Create feature to disable MSI virtualization
  2024-12-13  9:10       ` David Woodhouse
@ 2025-01-03 14:31         ` Jason Gunthorpe
  2025-01-03 14:47           ` David Woodhouse
  0 siblings, 1 reply; 29+ messages in thread
From: Jason Gunthorpe @ 2025-01-03 14:31 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Thomas Gleixner, Alex Williamson, kvm, quic_bqiang, kvalo,
	prestwoj, linux-wireless, ath11k, iommu, kernel, johannes,
	jtornosm

On Fri, Dec 13, 2024 at 09:10:30AM +0000, David Woodhouse wrote:

> <fantasy>
> A: The guest OS just hands the device a standard MSI message encoding
> the target guest APIC ID and vector (etc.), and the IOMMU does the
> rest. Nothing 'intercedes' between the guest and the device to mess
> with that MSI message.
 
> /me wakes up...

Well, I share your dream at least. :\

Have the VMM shadow the virtual interrupt remapping tables and assign
it to the phyiscal remapping so that the physical addr/data pair
doesn't change. Driving interrupt routing fully via the remapping HW
and not via MSI interception.

IIRC Alex had a patch series for qemu to capture and rewrite the ath
non-standard MSI locations, so virtualization worries should not block
moving ath to use the device-specific MSI..

Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC/RFT] vfio/pci: Create feature to disable MSI virtualization
  2025-01-03 14:31         ` Jason Gunthorpe
@ 2025-01-03 14:47           ` David Woodhouse
  2025-01-03 15:19             ` Jason Gunthorpe
  0 siblings, 1 reply; 29+ messages in thread
From: David Woodhouse @ 2025-01-03 14:47 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Thomas Gleixner, Alex Williamson, kvm, quic_bqiang, kvalo,
	prestwoj, linux-wireless, ath11k, iommu, kernel, johannes,
	jtornosm

[-- Attachment #1: Type: text/plain, Size: 696 bytes --]

On Fri, 2025-01-03 at 10:31 -0400, Jason Gunthorpe wrote:
> 
> IIRC Alex had a patch series for qemu to capture and rewrite the ath
> non-standard MSI locations, so virtualization worries should not block
> moving ath to use the device-specific MSI..

In the absence of the IOMMU in our shared fantasy, this seems like the
best approach.

Even for passed-through PCI hardware, the config space is largely a
fiction. As are the MSI(-X) vectors. So although it's a device-specific
quirk, this seems like the best approach.

Probably best for the hypervisor to have some way to *advertise* that
it's handling this though, as guests also might want to run on
hypervisors which don't.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC/RFT] vfio/pci: Create feature to disable MSI virtualization
  2025-01-03 14:47           ` David Woodhouse
@ 2025-01-03 15:19             ` Jason Gunthorpe
  0 siblings, 0 replies; 29+ messages in thread
From: Jason Gunthorpe @ 2025-01-03 15:19 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Thomas Gleixner, Alex Williamson, kvm, quic_bqiang, kvalo,
	prestwoj, linux-wireless, ath11k, iommu, kernel, johannes,
	jtornosm

On Fri, Jan 03, 2025 at 02:47:11PM +0000, David Woodhouse wrote:

> Probably best for the hypervisor to have some way to *advertise* that
> it's handling this though, as guests also might want to run on
> hypervisors which don't.

If the hypervisor doesn't properly virtualize the device it shouldn't
assign it to a VM to start with :\

Intel looked at the question of advertising clean interrupt remapping
when trying to virtualize IMS and it didn't seem so great.

Bare metal machines need to work, so any test they could think of
adding would either fail on bare metal or fail on existing VMs.

VMM's have taken the approach of not telling the guest they are in VMs
and then also not implementing the bare metal HW behaviors with full
fidelity. So we have no way to discover that the VMM is, in fact,
emulating broken "hw".

Thus we get this push that all kernels need to accomodate the worst
VMM behaviors :(

Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC/RFT] vfio/pci-quirks: Quirk for ath wireless
  2024-08-15 17:19           ` Jason Gunthorpe
@ 2026-03-16 14:58             ` James Prestwood
  2026-03-16 15:43               ` James Prestwood
  0 siblings, 1 reply; 29+ messages in thread
From: James Prestwood @ 2026-03-16 14:58 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: qemu-devel, kvm, quic_bqiang, kvalo, linux-wireless, ath11k,
	dwmw2, iommu, kernel, johannes, jtornosm

On 8/15/24 10:19 AM, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2024 at 10:59:05AM -0600, Alex Williamson wrote:
>
>>> This is probably the only way to approach this, trap and emulate the
>>> places in the device that program additional interrupt sources and do
>>> a full MSI-like flow to set them up in the kernel.
>> Your last sentence here seems to agree with this approach, but
>> everything else suggests disapproval, so I don't know where you're
>> going here.
> Trapping and emulating is fine.
>
> My concern is really only about skipping SET_IRQ.
>
> That works because of the assumption that the IMS sources are going to
> re-use addr/data pairs setup in the MSI CAP.
>
> That assumption is frail, and won't work at all under the proper IMS
> support Linux now has.
>
> I really don't want to go down the road and have someone tell Thomas
> he can't convert the Linux driver to use irq_domain IMS because it
> will break this stuff here.
>
>> I have no specs for this device, nor any involvement from the device
>> vendor, so the idea of creating a vfio-pci variant driver to setup an
>> irq_domain and augment a device specific SET_IRQs ioctls not only sounds
>> tremendously more complicated (host and VMM), it's simply not possible
>> with the knowledge we have at hand.
> It seems like you have reverse engineered alot of the necessary
> information though??
>
> Maybe there is a more generic approach than a variant driver. If you
> wanted to use IMS from userspace generically you could imagine some
> kind of IMS focused "SET_IRQ" in generic VFIO. Where we'd create the
> needed irq_domains and pass the addr/data pair back to userspace?
>
>> I observe that the device configures MSI vectors and then writes that
>> same vector address/data elsewhere into the device.  Whether the device
>> can trigger those vectors based only on the MSI capability programming
>> and a secondary source piggybacks on those vectors or if this is just a
>> hack by Qualcomm to use an MSI capability to acquire some vectors which
>> are exclusively used by the secondary hardware, I have no idea.
> Well at least that should be testable - but it seems crazy if the
> device has registers for an addr/data pair and then somehow doesn't
> use the values that get put in them??
>
> Copying from the MSI is almost certainly a SW hack because IMS support
> has never really existed in an OS until now. I think your guess for
> why it is like this is pretty good.
>
>> I do not believe that introducing a vfio device feature that disables
>> virtualization of the MSI address/data _only_ at the vfio interface
>> (not to a QEMU VM) provides some implicit support of this device
>> behavior.  These values are already available to a privileged user in
>> the host and the same is available for an MSI-X use case by directly
>> reading the MSI-X vector table.
> To be clear, I'm not really worried about showing the data to
> userspace.
>
> Userspace just shouldn't be using it to implement an IMS technique!
>
> Jason

I see this thread went stale. Wondering if there was ever a final 
conclusion if this could get fixed for ath11k or not. I tried again on a 
recent kernel, 6.17, and see the same behavior.

Thanks,

James


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC/RFT] vfio/pci-quirks: Quirk for ath wireless
  2026-03-16 14:58             ` James Prestwood
@ 2026-03-16 15:43               ` James Prestwood
  0 siblings, 0 replies; 29+ messages in thread
From: James Prestwood @ 2026-03-16 15:43 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: qemu-devel, kvm, quic_bqiang, kvalo, linux-wireless, ath11k,
	dwmw2, iommu, kernel, johannes, jtornosm


On 3/16/26 7:58 AM, James Prestwood wrote:
> On 8/15/24 10:19 AM, Jason Gunthorpe wrote:
>> On Thu, Aug 15, 2024 at 10:59:05AM -0600, Alex Williamson wrote:
>>
>>>> This is probably the only way to approach this, trap and emulate the
>>>> places in the device that program additional interrupt sources and do
>>>> a full MSI-like flow to set them up in the kernel.
>>> Your last sentence here seems to agree with this approach, but
>>> everything else suggests disapproval, so I don't know where you're
>>> going here.
>> Trapping and emulating is fine.
>>
>> My concern is really only about skipping SET_IRQ.
>>
>> That works because of the assumption that the IMS sources are going to
>> re-use addr/data pairs setup in the MSI CAP.
>>
>> That assumption is frail, and won't work at all under the proper IMS
>> support Linux now has.
>>
>> I really don't want to go down the road and have someone tell Thomas
>> he can't convert the Linux driver to use irq_domain IMS because it
>> will break this stuff here.
>>
>>> I have no specs for this device, nor any involvement from the device
>>> vendor, so the idea of creating a vfio-pci variant driver to setup an
>>> irq_domain and augment a device specific SET_IRQs ioctls not only 
>>> sounds
>>> tremendously more complicated (host and VMM), it's simply not possible
>>> with the knowledge we have at hand.
>> It seems like you have reverse engineered alot of the necessary
>> information though??
>>
>> Maybe there is a more generic approach than a variant driver. If you
>> wanted to use IMS from userspace generically you could imagine some
>> kind of IMS focused "SET_IRQ" in generic VFIO. Where we'd create the
>> needed irq_domains and pass the addr/data pair back to userspace?
>>
>>> I observe that the device configures MSI vectors and then writes that
>>> same vector address/data elsewhere into the device.  Whether the device
>>> can trigger those vectors based only on the MSI capability programming
>>> and a secondary source piggybacks on those vectors or if this is just a
>>> hack by Qualcomm to use an MSI capability to acquire some vectors which
>>> are exclusively used by the secondary hardware, I have no idea.
>> Well at least that should be testable - but it seems crazy if the
>> device has registers for an addr/data pair and then somehow doesn't
>> use the values that get put in them??
>>
>> Copying from the MSI is almost certainly a SW hack because IMS support
>> has never really existed in an OS until now. I think your guess for
>> why it is like this is pretty good.
>>
>>> I do not believe that introducing a vfio device feature that disables
>>> virtualization of the MSI address/data _only_ at the vfio interface
>>> (not to a QEMU VM) provides some implicit support of this device
>>> behavior.  These values are already available to a privileged user in
>>> the host and the same is available for an MSI-X use case by directly
>>> reading the MSI-X vector table.
>> To be clear, I'm not really worried about showing the data to
>> userspace.
>>
>> Userspace just shouldn't be using it to implement an IMS technique!
>>
>> Jason
>
> I see this thread went stale. Wondering if there was ever a final 
> conclusion if this could get fixed for ath11k or not. I tried again on 
> a recent kernel, 6.17, and see the same behavior.
>
> Thanks,
>
> James
>
I addition, I've looked at various kernel version and see no time where 
there was ever a "hw/vfio/pci-quirks.c" file in the tree. I tried many 
versions between 6.17 and 5.11, I don't see the "hw" directory at all.

I'd like to try this patch out, but might need some guidance on what 
kernel version this was meant for and if files may have shuffled around.

Thanks,

James


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2026-03-16 15:43 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <adcb785e-4dc7-4c4a-b341-d53b72e13467@gmail.com>
     [not found] ` <8734v5zhol.fsf@kernel.org>
     [not found]   ` <87fa5220-6fd9-433d-879b-c55ac67a0748@gmail.com>
     [not found]     ` <87r0ipcn7j.fsf@kernel.org>
     [not found]       ` <356e0b05-f396-4ad7-9b29-c492b54af834@gmail.com>
     [not found]         ` <26119c3f-9012-47bb-948e-7e976d4773a7@quicinc.com>
     [not found]           ` <87mstccmk6.fsf@kernel.org>
     [not found]             ` <df9fd970-5af3-468c-b1f1-18f91215cf44@gmail.com>
     [not found]               ` <8734v4auc4.fsf@kernel.org>
     [not found]                 ` <e8878979-1f3f-4635-a716-9ac381c617d9@gmail.com>
     [not found]                   ` <285b84d0-229c-4c83-a7d6-4c3c23139597@quicinc.com>
     [not found]                     ` <4607fb37-8227-49a3-9e8c-10c9b117ec7b@gmail.com>
     [not found]                       ` <3d22a730-aee5-4f2a-9ddc-b4b5bd4d62fe@quicinc.com>
2024-01-14 14:36                         ` ath11k and vfio-pci support Kalle Valo
2024-01-15 17:46                           ` Alex Williamson
2024-01-16 10:08                             ` Baochen Qiang
2024-01-16 10:41                               ` David Woodhouse
2024-01-16 15:29                                 ` Jason Gunthorpe
2024-01-16 18:28                                 ` Alex Williamson
2024-01-16 21:10                                   ` Jeff Johnson
2024-01-17  5:47                                 ` Baochen Qiang
2024-03-21 19:14                                 ` Johannes Berg
2024-08-12 16:59 ` [PATCH RFC/RFT] vfio/pci: Create feature to disable MSI virtualization Alex Williamson
2024-08-13 16:30   ` Jason Gunthorpe
2024-08-13 17:30     ` Thomas Gleixner
2024-08-13 23:39       ` Jason Gunthorpe
2024-12-13  9:10       ` David Woodhouse
2025-01-03 14:31         ` Jason Gunthorpe
2025-01-03 14:47           ` David Woodhouse
2025-01-03 15:19             ` Jason Gunthorpe
2024-08-13 21:14     ` Alex Williamson
2024-08-13 23:16       ` Jason Gunthorpe
2024-08-14 14:55         ` Alex Williamson
2024-08-14 15:20           ` Jason Gunthorpe
2024-08-12 17:00 ` [PATCH RFC/RFT] vfio/pci-quirks: Quirk for ath wireless Alex Williamson
2024-08-13 16:43   ` Jason Gunthorpe
2024-08-13 21:03     ` Alex Williamson
2024-08-13 23:37       ` Jason Gunthorpe
2024-08-15 16:59         ` Alex Williamson
2024-08-15 17:19           ` Jason Gunthorpe
2026-03-16 14:58             ` James Prestwood
2026-03-16 15:43               ` James Prestwood

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox