A lingering doubt on PCI-MMIO region of PCI-passthrough-device

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* A lingering doubt on PCI-MMIO region of PCI-passthrough-device
@ 2025-12-14 12:08 Ajay Garg
  2025-12-14 19:52 ` Alex Williamson
  0 siblings, 1 reply; 7+ messages in thread
From: Ajay Garg @ 2025-12-14 12:08 UTC (permalink / raw)
  To: iommu, linux-pci, Linux Kernel Mailing List

Hi everyone.

Let's assume x86_64-linux host and guest, with full-virtualization and
iommu hardware capabilities.

Before launching vm, qemu with the help vfio "installs" "dev1" on the
virtual-pci-root-complex of guest.
After bootup, the guest does the usual enumeration, finds "dev1" on
the pci-bus, and programs the BARs in its domain.

However, there is no guarantee that guest-pci-mmio-physical-ranges
would be identical to "what would have been"
host-pci-mmio-physical-ranges.
Then how does the EPT/SLAT tables get set up for correct mapping from
GPA => HPA for dev1's-BARs-MMIO-regions ?

Will be grateful for pointers.

Thanks and Regards,
Ajay

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: A lingering doubt on PCI-MMIO region of PCI-passthrough-device
  2025-12-14 12:08 A lingering doubt on PCI-MMIO region of PCI-passthrough-device Ajay Garg
@ 2025-12-14 19:52 ` Alex Williamson
  2025-12-15  3:50   ` Ajay Garg
  0 siblings, 1 reply; 7+ messages in thread
From: Alex Williamson @ 2025-12-14 19:52 UTC (permalink / raw)
  To: Ajay Garg; +Cc: iommu, linux-pci, Linux Kernel Mailing List

On Sun, 14 Dec 2025 17:38:50 +0530
Ajay Garg <ajaygargnsit@gmail.com> wrote:

> Hi everyone.
> 
> Let's assume x86_64-linux host and guest, with full-virtualization and
> iommu hardware capabilities.
> 
> Before launching vm, qemu with the help vfio "installs" "dev1" on the
> virtual-pci-root-complex of guest.
> After bootup, the guest does the usual enumeration, finds "dev1" on
> the pci-bus, and programs the BARs in its domain.
> 
> However, there is no guarantee that guest-pci-mmio-physical-ranges
> would be identical to "what would have been"
> host-pci-mmio-physical-ranges.
> Then how does the EPT/SLAT tables get set up for correct mapping from
> GPA => HPA for dev1's-BARs-MMIO-regions ?

The guest doesn't get to program the device physical BARs, nor does it
require identity mapping in the guest.  The BAR programming is
virtualized.  QEMU mmaps the BAR and that mmap is the backing for the
mapping into the guest.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: A lingering doubt on PCI-MMIO region of PCI-passthrough-device
  2025-12-14 19:52 ` Alex Williamson
@ 2025-12-15  3:50   ` Ajay Garg
  2025-12-19  6:23     ` Ajay Garg
  0 siblings, 1 reply; 7+ messages in thread
From: Ajay Garg @ 2025-12-15  3:50 UTC (permalink / raw)
  To: Alex Williamson; +Cc: iommu, linux-pci, Linux Kernel Mailing List

Thanks Alex.

So does something like the following happen :

i)
During bootup, guest starts pci-enumeration as usual.

ii)
Upon discovering the "passthrough-device", guest carves the physical
MMIO regions (as usual) in the guest's physical-address-space, and
starts-to/attempts to program the BARs with the
guest-physical-base-addresses carved out.

iii)
These attempts to program the BARs (lying in the
"passthrough-device"'s config-space), are intercepted by the
hypervisor instead (causing a VM-exit in the interim).

iv)
The hypervisor uses the above info to update the EPT, to ensure GPA =>
HPA conversions go fine when the guest tries to access the PCI-MMIO
regions later (once gurst is fully booted up). Also, the hypervisor
marks the operation as success (without "really" re-programming the
BARs).

v)
The VM-entry is called, and the guest resumes with the "impression"
that the BARs have been "programmed by guest".

Is the above sequencing correct at a bird's view level?

Once again, many thanks for the help !

Thanks and Regards,
Ajay

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: A lingering doubt on PCI-MMIO region of PCI-passthrough-device
  2025-12-15  3:50   ` Ajay Garg
@ 2025-12-19  6:23     ` Ajay Garg
  2025-12-20  0:06       ` Alex Williamson
  0 siblings, 1 reply; 7+ messages in thread
From: Ajay Garg @ 2025-12-19  6:23 UTC (permalink / raw)
  To: Alex Williamson, QEMU Developers
  Cc: iommu, linux-pci, Linux Kernel Mailing List

Hi Alex.
Kindly help if the steps listed in the previous email are correct.

(Have added qemu mailing-list too, as it might be a qemu thing too as
virtual-pci is in picture).

On Mon, Dec 15, 2025 at 9:20 AM Ajay Garg <ajaygargnsit@gmail.com> wrote:
>
> Thanks Alex.
>
> So does something like the following happen :
>
> i)
> During bootup, guest starts pci-enumeration as usual.
>
> ii)
> Upon discovering the "passthrough-device", guest carves the physical
> MMIO regions (as usual) in the guest's physical-address-space, and
> starts-to/attempts to program the BARs with the
> guest-physical-base-addresses carved out.
>
> iii)
> These attempts to program the BARs (lying in the
> "passthrough-device"'s config-space), are intercepted by the
> hypervisor instead (causing a VM-exit in the interim).
>
> iv)
> The hypervisor uses the above info to update the EPT, to ensure GPA =>
> HPA conversions go fine when the guest tries to access the PCI-MMIO
> regions later (once gurst is fully booted up). Also, the hypervisor
> marks the operation as success (without "really" re-programming the
> BARs).
>
> v)
> The VM-entry is called, and the guest resumes with the "impression"
> that the BARs have been "programmed by guest".
>
> Is the above sequencing correct at a bird's view level?
>
>
> Once again, many thanks for the help !
>
> Thanks and Regards,
> Ajay

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: A lingering doubt on PCI-MMIO region of PCI-passthrough-device
  2025-12-19  6:23     ` Ajay Garg
@ 2025-12-20  0:06       ` Alex Williamson
  2025-12-20 12:52         ` Ajay Garg
  0 siblings, 1 reply; 7+ messages in thread
From: Alex Williamson @ 2025-12-20  0:06 UTC (permalink / raw)
  To: Ajay Garg; +Cc: QEMU Developers, iommu, linux-pci, Linux Kernel Mailing List

On Fri, 19 Dec 2025 11:53:56 +0530
Ajay Garg <ajaygargnsit@gmail.com> wrote:

> Hi Alex.
> Kindly help if the steps listed in the previous email are correct.
> 
> (Have added qemu mailing-list too, as it might be a qemu thing too as
> virtual-pci is in picture).
> 
> On Mon, Dec 15, 2025 at 9:20 AM Ajay Garg <ajaygargnsit@gmail.com> wrote:
> >
> > Thanks Alex.
> >
> > So does something like the following happen :
> >
> > i)
> > During bootup, guest starts pci-enumeration as usual.
> >
> > ii)
> > Upon discovering the "passthrough-device", guest carves the physical
> > MMIO regions (as usual) in the guest's physical-address-space, and
> > starts-to/attempts to program the BARs with the
> > guest-physical-base-addresses carved out.
> >
> > iii)
> > These attempts to program the BARs (lying in the
> > "passthrough-device"'s config-space), are intercepted by the
> > hypervisor instead (causing a VM-exit in the interim).
> >
> > iv)
> > The hypervisor uses the above info to update the EPT, to ensure GPA =>
> > HPA conversions go fine when the guest tries to access the PCI-MMIO
> > regions later (once gurst is fully booted up). Also, the hypervisor
> > marks the operation as success (without "really" re-programming the
> > BARs).
> >
> > v)
> > The VM-entry is called, and the guest resumes with the "impression"
> > that the BARs have been "programmed by guest".
> >
> > Is the above sequencing correct at a bird's view level?

It's not far off.  The key is simply that we can create a host virtual
mapping to the device BARs, ie. an mmap.  The guest enumerates emulated
BARs, they're only used for sizing and locating the BARs in the guest
physical address space.  When the guest BAR is programmed and memory
enabled, the address space in QEMU is populated at the BAR indicated
GPA using the mmap backing.  KVM memory slots are used to fill the
mappings in the vCPU.  The same BAR mmap is also used to provide DMA
mapping of the BAR through the IOMMU in the legacy type1 IOMMU backend
case.  Barring a vIOMMU, the IOMMU IOVA space is the guest physical
address space.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: A lingering doubt on PCI-MMIO region of PCI-passthrough-device
  2025-12-20  0:06       ` Alex Williamson
@ 2025-12-20 12:52         ` Ajay Garg
  2025-12-20 13:24           ` Ajay Garg
  0 siblings, 1 reply; 7+ messages in thread
From: Ajay Garg @ 2025-12-20 12:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: QEMU Developers, iommu, linux-pci, Linux Kernel Mailing List

Thanks Alex.

I was/am aware of GPA-ranges backed by mmap'ed HVA-ranges.
On further thought, I think I have all the missing pieces (except one,
as mentioned at last in current email).

I'll list the steps below :

a)
There are three stages :

   * pre-configuration by host/qemu.
   * guest-vm bios.
   * guest-vm kernel.

b)
Host procures following memory-slots (amongst others) via mmap :

  * guest-ram
  * pci-config-space       : via vfio's ioctls' help.
  * pci-bar-mmio-space : via vfio's ioctls' help.

For the above memory-slots,

*
guest-ram physical-address is known (0), so ept-mappings for guest-ram
are set up even before guest-vm begins to boot up.

*
there is no concept of guest-physical-address for pci-config-space.

*
pci-bar-mmio-space physical address is not known yet, so ept-mappings
for pci-bar-mmio-space are not set up (yet).

c)
qemu starts the guest, and guest-vm-bios runs next.

This bios is "owned by qemu", and is "definitely different" from the
host-bios (qemu is an altogether different "hardware"). qemu-bios and
host-bios handle pci bus/enumeration "completely differently".

When the pci-enumeration runs during this guest-vm-bios stage, it
accesses the pci-device config-space (backed on the host by mmap'ed
mappings). Note that guest-kernel is still not in picture.

"OBVIOUSLY", all accesses (reads/writes) to pci-config space go to the
pci-config-space memory-slot (handled purely by qemu-bios code).

Once the guest-vm bios carves out guest-physical-addresses for the
pci-device-bars, it programs the bars by writing to bars-offsets in
the pci-config-space. qemu detects this, and does the following :

   * does not relay the actual-writes to physical bars on the host.
   * since the bar-guest-physical-addresses are now known, so now the
missing ept-mappings
     for pci-bar-mmio-space are now set up.

d)
Finally, guest-kernel takes over, and

   * all accesses to ram go through vanilla two-stages translation.
   * all accesses to pci-bars-mmio go through vanilla two-stages translation.

Requests :

i)
Alex / QEMU-experts : kindly correct me if I am wrong :) till now.

ii)
Once kernel boots up, how are accesses to pci-config-space handled? Is
again qemu-bios involved in pci-config-space accesses after
guest-kernel has booted up?

Once again, many thanks to everyone for their time and help.

Thanks and Regards,
Ajay

On Sat, Dec 20, 2025 at 5:36 AM Alex Williamson <alex@shazbot.org> wrote:
>
> On Fri, 19 Dec 2025 11:53:56 +0530
> Ajay Garg <ajaygargnsit@gmail.com> wrote:
>
> > Hi Alex.
> > Kindly help if the steps listed in the previous email are correct.
> >
> > (Have added qemu mailing-list too, as it might be a qemu thing too as
> > virtual-pci is in picture).
> >
> > On Mon, Dec 15, 2025 at 9:20 AM Ajay Garg <ajaygargnsit@gmail.com> wrote:
> > >
> > > Thanks Alex.
> > >
> > > So does something like the following happen :
> > >
> > > i)
> > > During bootup, guest starts pci-enumeration as usual.
> > >
> > > ii)
> > > Upon discovering the "passthrough-device", guest carves the physical
> > > MMIO regions (as usual) in the guest's physical-address-space, and
> > > starts-to/attempts to program the BARs with the
> > > guest-physical-base-addresses carved out.
> > >
> > > iii)
> > > These attempts to program the BARs (lying in the
> > > "passthrough-device"'s config-space), are intercepted by the
> > > hypervisor instead (causing a VM-exit in the interim).
> > >
> > > iv)
> > > The hypervisor uses the above info to update the EPT, to ensure GPA =>
> > > HPA conversions go fine when the guest tries to access the PCI-MMIO
> > > regions later (once gurst is fully booted up). Also, the hypervisor
> > > marks the operation as success (without "really" re-programming the
> > > BARs).
> > >
> > > v)
> > > The VM-entry is called, and the guest resumes with the "impression"
> > > that the BARs have been "programmed by guest".
> > >
> > > Is the above sequencing correct at a bird's view level?
>
> It's not far off.  The key is simply that we can create a host virtual
> mapping to the device BARs, ie. an mmap.  The guest enumerates emulated
> BARs, they're only used for sizing and locating the BARs in the guest
> physical address space.  When the guest BAR is programmed and memory
> enabled, the address space in QEMU is populated at the BAR indicated
> GPA using the mmap backing.  KVM memory slots are used to fill the
> mappings in the vCPU.  The same BAR mmap is also used to provide DMA
> mapping of the BAR through the IOMMU in the legacy type1 IOMMU backend
> case.  Barring a vIOMMU, the IOMMU IOVA space is the guest physical
> address space.  Thanks,
>
> Alex

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: A lingering doubt on PCI-MMIO region of PCI-passthrough-device
  2025-12-20 12:52         ` Ajay Garg
@ 2025-12-20 13:24           ` Ajay Garg
  0 siblings, 0 replies; 7+ messages in thread
From: Ajay Garg @ 2025-12-20 13:24 UTC (permalink / raw)
  To: Alex Williamson
  Cc: QEMU Developers, iommu, linux-pci, Linux Kernel Mailing List

Is guest-acpi-mcfg/mmconfig tables the answer to my last question? :)

i.e. qemu-bios setting up acpi mcfg / mmconfig addresses, which are
backed by mmap'ed pci-config-space mappings on the host (while also
setting up ept-mappings for pci-config-space too)?


On Sat, Dec 20, 2025 at 6:22 PM Ajay Garg <ajaygargnsit@gmail.com> wrote:
>
> Thanks Alex.
>
> I was/am aware of GPA-ranges backed by mmap'ed HVA-ranges.
> On further thought, I think I have all the missing pieces (except one,
> as mentioned at last in current email).
>
> I'll list the steps below :
>
> a)
> There are three stages :
>
>    * pre-configuration by host/qemu.
>    * guest-vm bios.
>    * guest-vm kernel.
>
> b)
> Host procures following memory-slots (amongst others) via mmap :
>
>   * guest-ram
>   * pci-config-space       : via vfio's ioctls' help.
>   * pci-bar-mmio-space : via vfio's ioctls' help.
>
> For the above memory-slots,
>
> *
> guest-ram physical-address is known (0), so ept-mappings for guest-ram
> are set up even before guest-vm begins to boot up.
>
> *
> there is no concept of guest-physical-address for pci-config-space.
>
> *
> pci-bar-mmio-space physical address is not known yet, so ept-mappings
> for pci-bar-mmio-space are not set up (yet).
>
> c)
> qemu starts the guest, and guest-vm-bios runs next.
>
> This bios is "owned by qemu", and is "definitely different" from the
> host-bios (qemu is an altogether different "hardware"). qemu-bios and
> host-bios handle pci bus/enumeration "completely differently".
>
> When the pci-enumeration runs during this guest-vm-bios stage, it
> accesses the pci-device config-space (backed on the host by mmap'ed
> mappings). Note that guest-kernel is still not in picture.
>
> "OBVIOUSLY", all accesses (reads/writes) to pci-config space go to the
> pci-config-space memory-slot (handled purely by qemu-bios code).
>
> Once the guest-vm bios carves out guest-physical-addresses for the
> pci-device-bars, it programs the bars by writing to bars-offsets in
> the pci-config-space. qemu detects this, and does the following :
>
>    * does not relay the actual-writes to physical bars on the host.
>    * since the bar-guest-physical-addresses are now known, so now the
> missing ept-mappings
>      for pci-bar-mmio-space are now set up.
>
> d)
> Finally, guest-kernel takes over, and
>
>    * all accesses to ram go through vanilla two-stages translation.
>    * all accesses to pci-bars-mmio go through vanilla two-stages translation.
>
>
> Requests :
>
> i)
> Alex / QEMU-experts : kindly correct me if I am wrong :) till now.
>
> ii)
> Once kernel boots up, how are accesses to pci-config-space handled? Is
> again qemu-bios involved in pci-config-space accesses after
> guest-kernel has booted up?
>
>
> Once again, many thanks to everyone for their time and help.
>
> Thanks and Regards,
> Ajay
>
>
> On Sat, Dec 20, 2025 at 5:36 AM Alex Williamson <alex@shazbot.org> wrote:
> >
> > On Fri, 19 Dec 2025 11:53:56 +0530
> > Ajay Garg <ajaygargnsit@gmail.com> wrote:
> >
> > > Hi Alex.
> > > Kindly help if the steps listed in the previous email are correct.
> > >
> > > (Have added qemu mailing-list too, as it might be a qemu thing too as
> > > virtual-pci is in picture).
> > >
> > > On Mon, Dec 15, 2025 at 9:20 AM Ajay Garg <ajaygargnsit@gmail.com> wrote:
> > > >
> > > > Thanks Alex.
> > > >
> > > > So does something like the following happen :
> > > >
> > > > i)
> > > > During bootup, guest starts pci-enumeration as usual.
> > > >
> > > > ii)
> > > > Upon discovering the "passthrough-device", guest carves the physical
> > > > MMIO regions (as usual) in the guest's physical-address-space, and
> > > > starts-to/attempts to program the BARs with the
> > > > guest-physical-base-addresses carved out.
> > > >
> > > > iii)
> > > > These attempts to program the BARs (lying in the
> > > > "passthrough-device"'s config-space), are intercepted by the
> > > > hypervisor instead (causing a VM-exit in the interim).
> > > >
> > > > iv)
> > > > The hypervisor uses the above info to update the EPT, to ensure GPA =>
> > > > HPA conversions go fine when the guest tries to access the PCI-MMIO
> > > > regions later (once gurst is fully booted up). Also, the hypervisor
> > > > marks the operation as success (without "really" re-programming the
> > > > BARs).
> > > >
> > > > v)
> > > > The VM-entry is called, and the guest resumes with the "impression"
> > > > that the BARs have been "programmed by guest".
> > > >
> > > > Is the above sequencing correct at a bird's view level?
> >
> > It's not far off.  The key is simply that we can create a host virtual
> > mapping to the device BARs, ie. an mmap.  The guest enumerates emulated
> > BARs, they're only used for sizing and locating the BARs in the guest
> > physical address space.  When the guest BAR is programmed and memory
> > enabled, the address space in QEMU is populated at the BAR indicated
> > GPA using the mmap backing.  KVM memory slots are used to fill the
> > mappings in the vCPU.  The same BAR mmap is also used to provide DMA
> > mapping of the BAR through the IOMMU in the legacy type1 IOMMU backend
> > case.  Barring a vIOMMU, the IOMMU IOVA space is the guest physical
> > address space.  Thanks,
> >
> > Alex

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-12-20 13:25 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-14 12:08 A lingering doubt on PCI-MMIO region of PCI-passthrough-device Ajay Garg
2025-12-14 19:52 ` Alex Williamson
2025-12-15  3:50   ` Ajay Garg
2025-12-19  6:23     ` Ajay Garg
2025-12-20  0:06       ` Alex Williamson
2025-12-20 12:52         ` Ajay Garg
2025-12-20 13:24           ` Ajay Garg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox