[RFC 0/2] hw/vfio/pci: Prevent BARs from being dma mapped in d3hot state

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [RFC 0/2] hw/vfio/pci: Prevent BARs from being dma mapped in d3hot state
@ 2025-02-19 17:58 Eric Auger
  2025-02-19 17:58 ` [RFC 1/2] hw/vfio: Introduce vfio_is_dma_map_allowed() callback Eric Auger
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Eric Auger @ 2025-02-19 17:58 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, alex.williamson, clg,
	zhenzhong.duan

Since kernel commit:
2b2c651baf1c ("vfio/pci: Invalidate mmaps and block the access
in D3hot power state")
any attempt to do an mmap access to a BAR when the device is in d3hot
state will generate a fault.

On system_powerdown, if the VFIO device is translated by an IOMMU,
the device is moved to D3hot state and then the vIOMMU gets disabled
by the guest. As a result of this later operation, the address space is
swapped from translated to untranslated. When re-enabling the aliased
regions, the RAM regions are dma-mapped again and this causes DMA_MAP
faults when attempting the operation on BARs.

To avoid doing the remap on those BARs, we compute whether the
device is in D3hot state and if so, skip the DMA MAP.

This series can be found at:
https://github.com/eauger/qemu/tree/d3hot_dma_map

Eric Auger (2):
  hw/vfio: Introduce vfio_is_dma_map_allowed() callback
  hw/vfio/pci: Prevents BARs from being dma mapped in d3hot state

 hw/vfio/common.c              | 57 +++++++++++++++++++++--------------
 hw/vfio/pci.c                 | 22 ++++++++++++++
 hw/vfio/trace-events          |  1 +
 include/hw/vfio/vfio-common.h | 11 +++++++
 4 files changed, 69 insertions(+), 22 deletions(-)

-- 
2.47.1

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC 1/2] hw/vfio: Introduce vfio_is_dma_map_allowed() callback
  2025-02-19 17:58 [RFC 0/2] hw/vfio/pci: Prevent BARs from being dma mapped in d3hot state Eric Auger
@ 2025-02-19 17:58 ` Eric Auger
  2025-02-19 17:59 ` [RFC 2/2] hw/vfio/pci: Prevents BARs from being dma mapped in d3hot state Eric Auger
  2025-02-19 18:58 ` [RFC 0/2] hw/vfio/pci: Prevent " Alex Williamson
  2 siblings, 0 replies; 12+ messages in thread
From: Eric Auger @ 2025-02-19 17:58 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, alex.williamson, clg,
	zhenzhong.duan

It may happen that a VFIO device state prevents its regions
from beeing DMA mapped. Specifically this happens with VFIO PCI
device in D3hot power state whose BARs cannot be dma mapped.
The behavior was introduced by kernel commit:

2b2c651baf1c ("vfio/pci: Invalidate mmaps and block the access
in D3hot power state")

We introduce a new VFIODeviceOps callback to retrieve whether
DMA MAP is allowed. This callback will be called from the generic
code, in vfio_listener_region_add.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 include/hw/vfio/vfio-common.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 0c60be5b15..92c58f14a0 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -182,6 +182,17 @@ struct VFIODeviceOps {
      * Returns zero to indicate success and negative for error
      */
     int (*vfio_load_config)(VFIODevice *vdev, QEMUFile *f);
+
+    /**
+     * @is_dma_map_allowed
+     *
+     * Returns if the device regions can be dma mapped
+     * It may happen that the device state is not compatible
+     * with such operation
+     *
+     * @vdev: #VFIODevice whose power state needs to be tested
+     */
+    bool (*vfio_is_dma_map_allowed)(VFIODevice *vdev);
 };
 
 typedef struct VFIOGroup {
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC 2/2] hw/vfio/pci: Prevents BARs from being dma mapped in d3hot state
  2025-02-19 17:58 [RFC 0/2] hw/vfio/pci: Prevent BARs from being dma mapped in d3hot state Eric Auger
  2025-02-19 17:58 ` [RFC 1/2] hw/vfio: Introduce vfio_is_dma_map_allowed() callback Eric Auger
@ 2025-02-19 17:59 ` Eric Auger
  2025-02-19 18:58 ` [RFC 0/2] hw/vfio/pci: Prevent " Alex Williamson
  2 siblings, 0 replies; 12+ messages in thread
From: Eric Auger @ 2025-02-19 17:59 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, alex.williamson, clg,
	zhenzhong.duan

Since kernel commit:
2b2c651baf1c ("vfio/pci: Invalidate mmaps and block the access
in D3hot power state")
any attempt to do an mmap access to a BAR when the device is in d3hot
state will generate a fault.

On system_powerdown, if the VFIO device is translated by an IOMMU,
the device is moved to D3hot state and then the vIOMMU gets disabled
by the guest. As a result of this later operation, the address space is
swapped from translated to untranslated. When re-enabling the aliased
regions, the RAM regions are dma-mapped again and this causes DMA_MAP
faults when attempting the operation on BARs.

To avoid doing the remap on those BARs, we need to retrieve the
information whether the device is in a non compatible state.

Implement the vfio_is_dma_map_allowed() callback for PCI devices.
If the device is in D3hot state, skip the DMA MAP in vfio_listener_add().

To ease the implementation, vfio_section_is_vfio_pci now returns
a VFIOPCIDevice pointer and the function is moved before the first
caller.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 hw/vfio/common.c     | 57 +++++++++++++++++++++++++++-----------------
 hw/vfio/pci.c        | 22 +++++++++++++++++
 hw/vfio/trace-events |  1 +
 3 files changed, 58 insertions(+), 22 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 173fb3a997..96f401f10a 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -555,11 +555,34 @@ static bool vfio_get_section_iova_range(VFIOContainerBase *bcontainer,
     return true;
 }
 
+static VFIOPCIDevice *vfio_section_is_vfio_pci(MemoryRegionSection *section,
+                                     VFIOContainerBase *bcontainer)
+{
+    VFIOPCIDevice *pcidev;
+    VFIODevice *vbasedev;
+    Object *owner;
+
+    owner = memory_region_owner(section->mr);
+
+    QLIST_FOREACH(vbasedev, &bcontainer->device_list, container_next) {
+        if (vbasedev->type != VFIO_DEVICE_TYPE_PCI) {
+            continue;
+        }
+        pcidev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+        if (OBJECT(pcidev) == owner) {
+            return pcidev;
+        }
+    }
+
+    return NULL;
+}
+
 static void vfio_listener_region_add(MemoryListener *listener,
                                      MemoryRegionSection *section)
 {
     VFIOContainerBase *bcontainer = container_of(listener, VFIOContainerBase,
                                                  listener);
+    VFIOPCIDevice *vdev;
     hwaddr iova, end;
     Int128 llend, llsize;
     void *vaddr;
@@ -630,6 +653,18 @@ static void vfio_listener_region_add(MemoryListener *listener,
 
     /* Here we assume that memory_region_is_ram(section->mr)==true */
 
+    /* skip if the region is a BAR and the power state forbids DMA MAP */
+    vdev = vfio_section_is_vfio_pci(section, bcontainer);
+    if (vdev) {
+        VFIODevice *vbasedev = &vdev->vbasedev;
+        assert(vbasedev->ops->vfio_is_dma_map_allowed);
+        if (!vbasedev->ops->vfio_is_dma_map_allowed(vbasedev)) {
+            trace_vfio_listener_region_add_skip(section->mr->name);
+            return;
+        }
+    }
+
+
     /*
      * For RAM memory regions with a RamDiscardManager, we only want to map the
      * actually populated parts - and update the mapping whenever we're notified
@@ -804,28 +839,6 @@ typedef struct VFIODirtyRangesListener {
     MemoryListener listener;
 } VFIODirtyRangesListener;
 
-static bool vfio_section_is_vfio_pci(MemoryRegionSection *section,
-                                     VFIOContainerBase *bcontainer)
-{
-    VFIOPCIDevice *pcidev;
-    VFIODevice *vbasedev;
-    Object *owner;
-
-    owner = memory_region_owner(section->mr);
-
-    QLIST_FOREACH(vbasedev, &bcontainer->device_list, container_next) {
-        if (vbasedev->type != VFIO_DEVICE_TYPE_PCI) {
-            continue;
-        }
-        pcidev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
-        if (OBJECT(pcidev) == owner) {
-            return true;
-        }
-    }
-
-    return false;
-}
-
 static void vfio_dirty_tracking_update_range(VFIODirtyRanges *range,
                                              hwaddr iova, hwaddr end,
                                              bool update_pci)
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index ab17a98ee5..314dddae4a 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2653,6 +2653,26 @@ static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
     return ret;
 }
 
+/*
+ * BARs cannot be dma-mapped if the device is in D3hot state since
+ * linux commit 2b2c651baf1c ("vfio/pci: Invalidate mmaps and block
+ * the access in D3hot power state")
+ */
+static bool vfio_pci_is_dma_map_allowed(VFIODevice *vbasedev)
+{
+    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+    uint16_t pmcsr;
+    uint8_t state;
+
+    pmcsr = vfio_pci_read_config(&vdev->pdev, vdev->pm_cap + PCI_PM_CTRL, 2);
+    state = pmcsr & PCI_PM_CTRL_STATE_MASK;
+    if (state == 3) {
+        return false;
+    }
+    return true;
+}
+
+
 static VFIODeviceOps vfio_pci_ops = {
     .vfio_compute_needs_reset = vfio_pci_compute_needs_reset,
     .vfio_hot_reset_multi = vfio_pci_hot_reset_multi,
@@ -2660,6 +2680,7 @@ static VFIODeviceOps vfio_pci_ops = {
     .vfio_get_object = vfio_pci_get_object,
     .vfio_save_config = vfio_pci_save_config,
     .vfio_load_config = vfio_pci_load_config,
+    .vfio_is_dma_map_allowed = vfio_pci_is_dma_map_allowed,
 };
 
 bool vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
@@ -3477,3 +3498,4 @@ static void register_vfio_pci_dev_type(void)
 }
 
 type_init(register_vfio_pci_dev_type)
+
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index c5385e1a4f..a0d5868c2f 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -121,6 +121,7 @@ vfio_legacy_dma_unmap_overflow_workaround(void) ""
 vfio_get_dirty_bitmap(uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t start, uint64_t dirty_pages) "iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64" dirty_pages=%"PRIu64
 vfio_iommu_map_dirty_notify(uint64_t iova_start, uint64_t iova_end) "iommu dirty @ 0x%"PRIx64" - 0x%"PRIx64
 vfio_reset_handler(void) ""
+vfio_listener_region_add_skip(const char *name) "DMA MAP would fail on region %s due to incompatible power state, skip it"
 
 # platform.c
 vfio_platform_realize(char *name, char *compat) "vfio device %s, compat = %s"
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC 0/2] hw/vfio/pci: Prevent BARs from being dma mapped in d3hot state
  2025-02-19 17:58 [RFC 0/2] hw/vfio/pci: Prevent BARs from being dma mapped in d3hot state Eric Auger
  2025-02-19 17:58 ` [RFC 1/2] hw/vfio: Introduce vfio_is_dma_map_allowed() callback Eric Auger
  2025-02-19 17:59 ` [RFC 2/2] hw/vfio/pci: Prevents BARs from being dma mapped in d3hot state Eric Auger
@ 2025-02-19 18:58 ` Alex Williamson
  2025-02-19 21:19   ` Alex Williamson
  2025-02-20  4:24   ` Duan, Zhenzhong
  2 siblings, 2 replies; 12+ messages in thread
From: Alex Williamson @ 2025-02-19 18:58 UTC (permalink / raw)
  To: Eric Auger; +Cc: eric.auger.pro, qemu-devel, clg, zhenzhong.duan

On Wed, 19 Feb 2025 18:58:58 +0100
Eric Auger <eric.auger@redhat.com> wrote:

> Since kernel commit:
> 2b2c651baf1c ("vfio/pci: Invalidate mmaps and block the access
> in D3hot power state")
> any attempt to do an mmap access to a BAR when the device is in d3hot
> state will generate a fault.
> 
> On system_powerdown, if the VFIO device is translated by an IOMMU,
> the device is moved to D3hot state and then the vIOMMU gets disabled
> by the guest. As a result of this later operation, the address space is
> swapped from translated to untranslated. When re-enabling the aliased
> regions, the RAM regions are dma-mapped again and this causes DMA_MAP
> faults when attempting the operation on BARs.
> 
> To avoid doing the remap on those BARs, we compute whether the
> device is in D3hot state and if so, skip the DMA MAP.

Thinking on this some more, QEMU PCI code already manages the device
BARs appearing in the address space based on the memory enable bit in
the command register.  Should we do the same for PM state?

IOW, the device going into low power state should remove the BARs from
the AddressSpace and waking the device should re-add them.  The BAR DMA
mapping should then always be consistent, whereas here nothing would
remap the BARs when the device is woken.

I imagine we'd need an interface to register the PM capability with the
core QEMU PCI code, where address space updates are performed relative
to both memory enable and power status.  There might be a way to
implement this just for vfio-pci devices by toggling the enable state
of the BAR mmaps relative to PM state, but doing it at the PCI core
level seems like it'd provide behavior more true to physical hardware.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 0/2] hw/vfio/pci: Prevent BARs from being dma mapped in d3hot state
  2025-02-19 18:58 ` [RFC 0/2] hw/vfio/pci: Prevent " Alex Williamson
@ 2025-02-19 21:19   ` Alex Williamson
  2025-02-20 10:31     ` Eric Auger
  2025-02-20  4:24   ` Duan, Zhenzhong
  1 sibling, 1 reply; 12+ messages in thread
From: Alex Williamson @ 2025-02-19 21:19 UTC (permalink / raw)
  To: Eric Auger; +Cc: eric.auger.pro, qemu-devel, clg, zhenzhong.duan

On Wed, 19 Feb 2025 11:58:44 -0700
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Wed, 19 Feb 2025 18:58:58 +0100
> Eric Auger <eric.auger@redhat.com> wrote:
> 
> > Since kernel commit:
> > 2b2c651baf1c ("vfio/pci: Invalidate mmaps and block the access
> > in D3hot power state")
> > any attempt to do an mmap access to a BAR when the device is in d3hot
> > state will generate a fault.
> > 
> > On system_powerdown, if the VFIO device is translated by an IOMMU,
> > the device is moved to D3hot state and then the vIOMMU gets disabled
> > by the guest. As a result of this later operation, the address space is
> > swapped from translated to untranslated. When re-enabling the aliased
> > regions, the RAM regions are dma-mapped again and this causes DMA_MAP
> > faults when attempting the operation on BARs.
> > 
> > To avoid doing the remap on those BARs, we compute whether the
> > device is in D3hot state and if so, skip the DMA MAP.  
> 
> Thinking on this some more, QEMU PCI code already manages the device
> BARs appearing in the address space based on the memory enable bit in
> the command register.  Should we do the same for PM state?
> 
> IOW, the device going into low power state should remove the BARs from
> the AddressSpace and waking the device should re-add them.  The BAR DMA
> mapping should then always be consistent, whereas here nothing would
> remap the BARs when the device is woken.
> 
> I imagine we'd need an interface to register the PM capability with the
> core QEMU PCI code, where address space updates are performed relative
> to both memory enable and power status.  There might be a way to
> implement this just for vfio-pci devices by toggling the enable state
> of the BAR mmaps relative to PM state, but doing it at the PCI core
> level seems like it'd provide behavior more true to physical hardware.

I took a stab at this approach here, it doesn't obviously break
anything in my configs, but I haven't yet tried to reproduce this exact
scenario.

https://gitlab.com/alex.williamson/qemu/-/tree/pci-pm-power-state

There's another pm_cap on the PCIExpressDevice that needs to be
consolidated as well, once I do some research to figure out why a
non-express capability is tracked only by express devices and what
they're doing with it.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [RFC 0/2] hw/vfio/pci: Prevent BARs from being dma mapped in d3hot state
  2025-02-19 18:58 ` [RFC 0/2] hw/vfio/pci: Prevent " Alex Williamson
  2025-02-19 21:19   ` Alex Williamson
@ 2025-02-20  4:24   ` Duan, Zhenzhong
  2025-02-20  5:05     ` Alex Williamson
  1 sibling, 1 reply; 12+ messages in thread
From: Duan, Zhenzhong @ 2025-02-20  4:24 UTC (permalink / raw)
  To: Alex Williamson, Eric Auger
  Cc: eric.auger.pro@gmail.com, qemu-devel@nongnu.org, clg@redhat.com



>-----Original Message-----
>From: Alex Williamson <alex.williamson@redhat.com>
>Subject: Re: [RFC 0/2] hw/vfio/pci: Prevent BARs from being dma mapped in
>d3hot state
>
>On Wed, 19 Feb 2025 18:58:58 +0100
>Eric Auger <eric.auger@redhat.com> wrote:
>
>> Since kernel commit:
>> 2b2c651baf1c ("vfio/pci: Invalidate mmaps and block the access
>> in D3hot power state")
>> any attempt to do an mmap access to a BAR when the device is in d3hot
>> state will generate a fault.
>>
>> On system_powerdown, if the VFIO device is translated by an IOMMU,
>> the device is moved to D3hot state and then the vIOMMU gets disabled
>> by the guest. As a result of this later operation, the address space is
>> swapped from translated to untranslated. When re-enabling the aliased
>> regions, the RAM regions are dma-mapped again and this causes DMA_MAP
>> faults when attempting the operation on BARs.
>>
>> To avoid doing the remap on those BARs, we compute whether the
>> device is in D3hot state and if so, skip the DMA MAP.
>
>Thinking on this some more, QEMU PCI code already manages the device
>BARs appearing in the address space based on the memory enable bit in
>the command register.  Should we do the same for PM state?
>
>IOW, the device going into low power state should remove the BARs from
>the AddressSpace and waking the device should re-add them.  The BAR DMA
>mapping should then always be consistent, whereas here nothing would
>remap the BARs when the device is woken.

If BARs should be disabled before D3hot transition, isn't it guest's responsibility to do that itself?
Just like what have been done for FLR which calls pci_dev_save_and_disable().

Thanks
Zhenzhong

>
>I imagine we'd need an interface to register the PM capability with the
>core QEMU PCI code, where address space updates are performed relative
>to both memory enable and power status.  There might be a way to
>implement this just for vfio-pci devices by toggling the enable state
>of the BAR mmaps relative to PM state, but doing it at the PCI core
>level seems like it'd provide behavior more true to physical hardware.
>Thanks,
>
>Alex



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 0/2] hw/vfio/pci: Prevent BARs from being dma mapped in d3hot state
  2025-02-20  4:24   ` Duan, Zhenzhong
@ 2025-02-20  5:05     ` Alex Williamson
  2025-02-20  8:25       ` Duan, Zhenzhong
  0 siblings, 1 reply; 12+ messages in thread
From: Alex Williamson @ 2025-02-20  5:05 UTC (permalink / raw)
  To: Duan, Zhenzhong
  Cc: Eric Auger, eric.auger.pro@gmail.com, qemu-devel@nongnu.org,
	clg@redhat.com

On Thu, 20 Feb 2025 04:24:13 +0000
"Duan, Zhenzhong" <zhenzhong.duan@intel.com> wrote:

> >-----Original Message-----
> >From: Alex Williamson <alex.williamson@redhat.com>
> >Subject: Re: [RFC 0/2] hw/vfio/pci: Prevent BARs from being dma mapped in
> >d3hot state
> >
> >On Wed, 19 Feb 2025 18:58:58 +0100
> >Eric Auger <eric.auger@redhat.com> wrote:
> >  
> >> Since kernel commit:
> >> 2b2c651baf1c ("vfio/pci: Invalidate mmaps and block the access
> >> in D3hot power state")
> >> any attempt to do an mmap access to a BAR when the device is in d3hot
> >> state will generate a fault.
> >>
> >> On system_powerdown, if the VFIO device is translated by an IOMMU,
> >> the device is moved to D3hot state and then the vIOMMU gets disabled
> >> by the guest. As a result of this later operation, the address space is
> >> swapped from translated to untranslated. When re-enabling the aliased
> >> regions, the RAM regions are dma-mapped again and this causes DMA_MAP
> >> faults when attempting the operation on BARs.
> >>
> >> To avoid doing the remap on those BARs, we compute whether the
> >> device is in D3hot state and if so, skip the DMA MAP.  
> >
> >Thinking on this some more, QEMU PCI code already manages the device
> >BARs appearing in the address space based on the memory enable bit in
> >the command register.  Should we do the same for PM state?
> >
> >IOW, the device going into low power state should remove the BARs from
> >the AddressSpace and waking the device should re-add them.  The BAR DMA
> >mapping should then always be consistent, whereas here nothing would
> >remap the BARs when the device is woken.  
> 
> If BARs should be disabled before D3hot transition, isn't it guest's responsibility to do that itself?
> Just like what have been done for FLR which calls pci_dev_save_and_disable().

Nothing requires the guest to clear memory and IO from the command
register before entering a low power state, nor are we going to get
very far arguing that it's the guest's fault for triggering an error in
the hypervisor.  The PCI spec indicates that memory and IO BARs are only
accessible when the device is in the D0 power state.  On bare metal
accessing the BAR for a device in a low power state would generate an
unsupported request.  Therefore why should QEMU map BARs of devices in
low power states into the address space?  Thanks,

Alex



^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [RFC 0/2] hw/vfio/pci: Prevent BARs from being dma mapped in d3hot state
  2025-02-20  5:05     ` Alex Williamson
@ 2025-02-20  8:25       ` Duan, Zhenzhong
  0 siblings, 0 replies; 12+ messages in thread
From: Duan, Zhenzhong @ 2025-02-20  8:25 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Eric Auger, eric.auger.pro@gmail.com, qemu-devel@nongnu.org,
	clg@redhat.com



>-----Original Message-----
>From: Alex Williamson <alex.williamson@redhat.com>
>Subject: Re: [RFC 0/2] hw/vfio/pci: Prevent BARs from being dma mapped in
>d3hot state
>
>On Thu, 20 Feb 2025 04:24:13 +0000
>"Duan, Zhenzhong" <zhenzhong.duan@intel.com> wrote:
>
>> >-----Original Message-----
>> >From: Alex Williamson <alex.williamson@redhat.com>
>> >Subject: Re: [RFC 0/2] hw/vfio/pci: Prevent BARs from being dma mapped in
>> >d3hot state
>> >
>> >On Wed, 19 Feb 2025 18:58:58 +0100
>> >Eric Auger <eric.auger@redhat.com> wrote:
>> >
>> >> Since kernel commit:
>> >> 2b2c651baf1c ("vfio/pci: Invalidate mmaps and block the access
>> >> in D3hot power state")
>> >> any attempt to do an mmap access to a BAR when the device is in d3hot
>> >> state will generate a fault.
>> >>
>> >> On system_powerdown, if the VFIO device is translated by an IOMMU,
>> >> the device is moved to D3hot state and then the vIOMMU gets disabled
>> >> by the guest. As a result of this later operation, the address space is
>> >> swapped from translated to untranslated. When re-enabling the aliased
>> >> regions, the RAM regions are dma-mapped again and this causes DMA_MAP
>> >> faults when attempting the operation on BARs.
>> >>
>> >> To avoid doing the remap on those BARs, we compute whether the
>> >> device is in D3hot state and if so, skip the DMA MAP.
>> >
>> >Thinking on this some more, QEMU PCI code already manages the device
>> >BARs appearing in the address space based on the memory enable bit in
>> >the command register.  Should we do the same for PM state?
>> >
>> >IOW, the device going into low power state should remove the BARs from
>> >the AddressSpace and waking the device should re-add them.  The BAR DMA
>> >mapping should then always be consistent, whereas here nothing would
>> >remap the BARs when the device is woken.
>>
>> If BARs should be disabled before D3hot transition, isn't it guest's responsibility
>to do that itself?
>> Just like what have been done for FLR which calls pci_dev_save_and_disable().
>
>Nothing requires the guest to clear memory and IO from the command
>register before entering a low power state, nor are we going to get
>very far arguing that it's the guest's fault for triggering an error in
>the hypervisor.  The PCI spec indicates that memory and IO BARs are only
>accessible when the device is in the D0 power state.  On bare metal
>accessing the BAR for a device in a low power state would generate an
>unsupported request.

Understood, yes it makes sense to remove BARs from AddressSpace when D3hot.

> Therefore why should QEMU map BARs of devices in
>low power states into the address space?
Should not.

Thanks
Zhenzhong



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 0/2] hw/vfio/pci: Prevent BARs from being dma mapped in d3hot state
  2025-02-19 21:19   ` Alex Williamson
@ 2025-02-20 10:31     ` Eric Auger
  2025-02-20 10:45       ` Eric Auger
  0 siblings, 1 reply; 12+ messages in thread
From: Eric Auger @ 2025-02-20 10:31 UTC (permalink / raw)
  To: Alex Williamson; +Cc: eric.auger.pro, qemu-devel, clg, zhenzhong.duan


Hi Alex,

On 2/19/25 10:19 PM, Alex Williamson wrote:
> On Wed, 19 Feb 2025 11:58:44 -0700
> Alex Williamson <alex.williamson@redhat.com> wrote:
>
>> On Wed, 19 Feb 2025 18:58:58 +0100
>> Eric Auger <eric.auger@redhat.com> wrote:
>>
>>> Since kernel commit:
>>> 2b2c651baf1c ("vfio/pci: Invalidate mmaps and block the access
>>> in D3hot power state")
>>> any attempt to do an mmap access to a BAR when the device is in d3hot
>>> state will generate a fault.
>>>
>>> On system_powerdown, if the VFIO device is translated by an IOMMU,
>>> the device is moved to D3hot state and then the vIOMMU gets disabled
>>> by the guest. As a result of this later operation, the address space is
>>> swapped from translated to untranslated. When re-enabling the aliased
>>> regions, the RAM regions are dma-mapped again and this causes DMA_MAP
>>> faults when attempting the operation on BARs.
>>>
>>> To avoid doing the remap on those BARs, we compute whether the
>>> device is in D3hot state and if so, skip the DMA MAP.  
>> Thinking on this some more, QEMU PCI code already manages the device
>> BARs appearing in the address space based on the memory enable bit in
>> the command register.  Should we do the same for PM state?
>>
>> IOW, the device going into low power state should remove the BARs from
>> the AddressSpace and waking the device should re-add them.  The BAR DMA
>> mapping should then always be consistent, whereas here nothing would
>> remap the BARs when the device is woken.
>>
>> I imagine we'd need an interface to register the PM capability with the
>> core QEMU PCI code, where address space updates are performed relative
>> to both memory enable and power status.  There might be a way to
>> implement this just for vfio-pci devices by toggling the enable state
>> of the BAR mmaps relative to PM state, but doing it at the PCI core
>> level seems like it'd provide behavior more true to physical hardware.
> I took a stab at this approach here, it doesn't obviously break
> anything in my configs, but I haven't yet tried to reproduce this exact
> scenario.
>
> https://gitlab.com/alex.williamson/qemu/-/tree/pci-pm-power-state

So if I understand correctly the BAR regions will disappear upon the
config cmd write in vfio_sub_page_bar_update_mapping(). Is that correct?
I will give it a try on my setup...
>
> There's another pm_cap on the PCIExpressDevice that needs to be
> consolidated as well, once I do some research to figure out why a
> non-express capability is tracked only by express devices and what
> they're doing with it.  Thanks,
I am not sure I get this last point though.

Thanks

Eric
>
> Alex
>



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 0/2] hw/vfio/pci: Prevent BARs from being dma mapped in d3hot state
  2025-02-20 10:31     ` Eric Auger
@ 2025-02-20 10:45       ` Eric Auger
  2025-02-20 15:07         ` Alex Williamson
  0 siblings, 1 reply; 12+ messages in thread
From: Eric Auger @ 2025-02-20 10:45 UTC (permalink / raw)
  To: eric.auger, Alex Williamson
  Cc: eric.auger.pro, qemu-devel, clg, zhenzhong.duan

Hi Alex,

On 2/20/25 11:31 AM, Eric Auger wrote:
> 
> Hi Alex,
> 
> On 2/19/25 10:19 PM, Alex Williamson wrote:
>> On Wed, 19 Feb 2025 11:58:44 -0700
>> Alex Williamson <alex.williamson@redhat.com> wrote:
>>
>>> On Wed, 19 Feb 2025 18:58:58 +0100
>>> Eric Auger <eric.auger@redhat.com> wrote:
>>>
>>>> Since kernel commit:
>>>> 2b2c651baf1c ("vfio/pci: Invalidate mmaps and block the access
>>>> in D3hot power state")
>>>> any attempt to do an mmap access to a BAR when the device is in d3hot
>>>> state will generate a fault.
>>>>
>>>> On system_powerdown, if the VFIO device is translated by an IOMMU,
>>>> the device is moved to D3hot state and then the vIOMMU gets disabled
>>>> by the guest. As a result of this later operation, the address space is
>>>> swapped from translated to untranslated. When re-enabling the aliased
>>>> regions, the RAM regions are dma-mapped again and this causes DMA_MAP
>>>> faults when attempting the operation on BARs.
>>>>
>>>> To avoid doing the remap on those BARs, we compute whether the
>>>> device is in D3hot state and if so, skip the DMA MAP.  
>>> Thinking on this some more, QEMU PCI code already manages the device
>>> BARs appearing in the address space based on the memory enable bit in
>>> the command register.  Should we do the same for PM state?
>>>
>>> IOW, the device going into low power state should remove the BARs from
>>> the AddressSpace and waking the device should re-add them.  The BAR DMA
>>> mapping should then always be consistent, whereas here nothing would
>>> remap the BARs when the device is woken.
>>>
>>> I imagine we'd need an interface to register the PM capability with the
>>> core QEMU PCI code, where address space updates are performed relative
>>> to both memory enable and power status.  There might be a way to
>>> implement this just for vfio-pci devices by toggling the enable state
>>> of the BAR mmaps relative to PM state, but doing it at the PCI core
>>> level seems like it'd provide behavior more true to physical hardware.
>> I took a stab at this approach here, it doesn't obviously break
>> anything in my configs, but I haven't yet tried to reproduce this exact
>> scenario.
>>
>> https://gitlab.com/alex.williamson/qemu/-/tree/pci-pm-power-state

it does not totally fix the issue: I now get:

qemu-system-x86_64: warning: vfio_container_dma_map(0x55cc25705680,
0x380000000000, 0x1000000, 0x7f8762000000) = -14 (Bad address)
0000:41:00.0: PCI peer-to-peer transactions on BARs are not supported.


Eric

> 
> So if I understand correctly the BAR regions will disappear upon the
> config cmd write in vfio_sub_page_bar_update_mapping(). Is that correct?
> I will give it a try on my setup...
>>
>> There's another pm_cap on the PCIExpressDevice that needs to be
>> consolidated as well, once I do some research to figure out why a
>> non-express capability is tracked only by express devices and what
>> they're doing with it.  Thanks,
> I am not sure I get this last point though.
> 
> Thanks
> 
> Eric
>>
>> Alex
>>
> 



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 0/2] hw/vfio/pci: Prevent BARs from being dma mapped in d3hot state
  2025-02-20 10:45       ` Eric Auger
@ 2025-02-20 15:07         ` Alex Williamson
  2025-02-20 15:48           ` Alex Williamson
  0 siblings, 1 reply; 12+ messages in thread
From: Alex Williamson @ 2025-02-20 15:07 UTC (permalink / raw)
  To: Eric Auger; +Cc: eric.auger, eric.auger.pro, qemu-devel, clg, zhenzhong.duan

On Thu, 20 Feb 2025 11:45:35 +0100
Eric Auger <eauger@redhat.com> wrote:

> Hi Alex,
> 
> On 2/20/25 11:31 AM, Eric Auger wrote:
> > 
> > Hi Alex,
> > 
> > On 2/19/25 10:19 PM, Alex Williamson wrote:  
> >> On Wed, 19 Feb 2025 11:58:44 -0700
> >> Alex Williamson <alex.williamson@redhat.com> wrote:
> >>  
> >>> On Wed, 19 Feb 2025 18:58:58 +0100
> >>> Eric Auger <eric.auger@redhat.com> wrote:
> >>>  
> >>>> Since kernel commit:
> >>>> 2b2c651baf1c ("vfio/pci: Invalidate mmaps and block the access
> >>>> in D3hot power state")
> >>>> any attempt to do an mmap access to a BAR when the device is in d3hot
> >>>> state will generate a fault.
> >>>>
> >>>> On system_powerdown, if the VFIO device is translated by an IOMMU,
> >>>> the device is moved to D3hot state and then the vIOMMU gets disabled
> >>>> by the guest. As a result of this later operation, the address space is
> >>>> swapped from translated to untranslated. When re-enabling the aliased
> >>>> regions, the RAM regions are dma-mapped again and this causes DMA_MAP
> >>>> faults when attempting the operation on BARs.
> >>>>
> >>>> To avoid doing the remap on those BARs, we compute whether the
> >>>> device is in D3hot state and if so, skip the DMA MAP.    
> >>> Thinking on this some more, QEMU PCI code already manages the device
> >>> BARs appearing in the address space based on the memory enable bit in
> >>> the command register.  Should we do the same for PM state?
> >>>
> >>> IOW, the device going into low power state should remove the BARs from
> >>> the AddressSpace and waking the device should re-add them.  The BAR DMA
> >>> mapping should then always be consistent, whereas here nothing would
> >>> remap the BARs when the device is woken.
> >>>
> >>> I imagine we'd need an interface to register the PM capability with the
> >>> core QEMU PCI code, where address space updates are performed relative
> >>> to both memory enable and power status.  There might be a way to
> >>> implement this just for vfio-pci devices by toggling the enable state
> >>> of the BAR mmaps relative to PM state, but doing it at the PCI core
> >>> level seems like it'd provide behavior more true to physical hardware.  
> >> I took a stab at this approach here, it doesn't obviously break
> >> anything in my configs, but I haven't yet tried to reproduce this exact
> >> scenario.
> >>
> >> https://gitlab.com/alex.williamson/qemu/-/tree/pci-pm-power-state  
> 
> it does not totally fix the issue: I now get:
> 
> qemu-system-x86_64: warning: vfio_container_dma_map(0x55cc25705680,
> 0x380000000000, 0x1000000, 0x7f8762000000) = -14 (Bad address)
> 0000:41:00.0: PCI peer-to-peer transactions on BARs are not supported.

Hmm, I'll reproduce and debug further.  The intention here is that BARs
for the device in D3hot would not be DMA mapped, effectively as if the
memory enable bit in the command register were cleared, therefore I'd
hoped the listener is not called for this range.

> > So if I understand correctly the BAR regions will disappear upon the
> > config cmd write in vfio_sub_page_bar_update_mapping(). Is that correct?
> > I will give it a try on my setup...  
> >>
> >> There's another pm_cap on the PCIExpressDevice that needs to be
> >> consolidated as well, once I do some research to figure out why a
> >> non-express capability is tracked only by express devices and what
> >> they're doing with it.  Thanks,  
> > I am not sure I get this last point though.

I added a patch to my branch that removes the redundant pm_cap from the
PCIExressDevice.  It just wasn't clear, and really still isn't, why
this cap offset had been cached on the express object rather than the
conventional PCI device object.  Regardless, it can be consolidated.
Thanks,

Alex



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 0/2] hw/vfio/pci: Prevent BARs from being dma mapped in d3hot state
  2025-02-20 15:07         ` Alex Williamson
@ 2025-02-20 15:48           ` Alex Williamson
  0 siblings, 0 replies; 12+ messages in thread
From: Alex Williamson @ 2025-02-20 15:48 UTC (permalink / raw)
  To: Eric Auger; +Cc: eric.auger, eric.auger.pro, qemu-devel, clg, zhenzhong.duan

On Thu, 20 Feb 2025 08:07:23 -0700
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Thu, 20 Feb 2025 11:45:35 +0100
> Eric Auger <eauger@redhat.com> wrote:
> 
> > Hi Alex,
> > 
> > On 2/20/25 11:31 AM, Eric Auger wrote:  
> > > 
> > > Hi Alex,
> > > 
> > > On 2/19/25 10:19 PM, Alex Williamson wrote:    
> > >> On Wed, 19 Feb 2025 11:58:44 -0700
> > >> Alex Williamson <alex.williamson@redhat.com> wrote:
> > >>    
> > >>> On Wed, 19 Feb 2025 18:58:58 +0100
> > >>> Eric Auger <eric.auger@redhat.com> wrote:
> > >>>    
> > >>>> Since kernel commit:
> > >>>> 2b2c651baf1c ("vfio/pci: Invalidate mmaps and block the access
> > >>>> in D3hot power state")
> > >>>> any attempt to do an mmap access to a BAR when the device is in d3hot
> > >>>> state will generate a fault.
> > >>>>
> > >>>> On system_powerdown, if the VFIO device is translated by an IOMMU,
> > >>>> the device is moved to D3hot state and then the vIOMMU gets disabled
> > >>>> by the guest. As a result of this later operation, the address space is
> > >>>> swapped from translated to untranslated. When re-enabling the aliased
> > >>>> regions, the RAM regions are dma-mapped again and this causes DMA_MAP
> > >>>> faults when attempting the operation on BARs.
> > >>>>
> > >>>> To avoid doing the remap on those BARs, we compute whether the
> > >>>> device is in D3hot state and if so, skip the DMA MAP.      
> > >>> Thinking on this some more, QEMU PCI code already manages the device
> > >>> BARs appearing in the address space based on the memory enable bit in
> > >>> the command register.  Should we do the same for PM state?
> > >>>
> > >>> IOW, the device going into low power state should remove the BARs from
> > >>> the AddressSpace and waking the device should re-add them.  The BAR DMA
> > >>> mapping should then always be consistent, whereas here nothing would
> > >>> remap the BARs when the device is woken.
> > >>>
> > >>> I imagine we'd need an interface to register the PM capability with the
> > >>> core QEMU PCI code, where address space updates are performed relative
> > >>> to both memory enable and power status.  There might be a way to
> > >>> implement this just for vfio-pci devices by toggling the enable state
> > >>> of the BAR mmaps relative to PM state, but doing it at the PCI core
> > >>> level seems like it'd provide behavior more true to physical hardware.    
> > >> I took a stab at this approach here, it doesn't obviously break
> > >> anything in my configs, but I haven't yet tried to reproduce this exact
> > >> scenario.
> > >>
> > >> https://gitlab.com/alex.williamson/qemu/-/tree/pci-pm-power-state    
> > 
> > it does not totally fix the issue: I now get:
> > 
> > qemu-system-x86_64: warning: vfio_container_dma_map(0x55cc25705680,
> > 0x380000000000, 0x1000000, 0x7f8762000000) = -14 (Bad address)
> > 0000:41:00.0: PCI peer-to-peer transactions on BARs are not supported.  
> 
> Hmm, I'll reproduce and debug further.  The intention here is that BARs
> for the device in D3hot would not be DMA mapped, effectively as if the
> memory enable bit in the command register were cleared, therefore I'd
> hoped the listener is not called for this range.

I forgot to mark the PM state field as writable in config space, so we
were always reading back D0 state.  Adding the following to
pci_pm_init() resolves it:

--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -445,6 +445,7 @@ int pci_pm_init(PCIDevice *d, uint8_t offset, Error **errp)
 
     d->pm_cap = cap;
     d->cap_present |= QEMU_PCI_CAP_PM;
+    pci_set_word(d->wmask + cap + PCI_PM_CTRL, PCI_PM_CTRL_STATE_MASK);
 
     return cap;
 }

Changing this might cause a problem with migration, ISTR we validate
the wmask with the source.  Anyway, I'll post the series and we can
test further and discuss it there.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-02-20 15:49 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-19 17:58 [RFC 0/2] hw/vfio/pci: Prevent BARs from being dma mapped in d3hot state Eric Auger
2025-02-19 17:58 ` [RFC 1/2] hw/vfio: Introduce vfio_is_dma_map_allowed() callback Eric Auger
2025-02-19 17:59 ` [RFC 2/2] hw/vfio/pci: Prevents BARs from being dma mapped in d3hot state Eric Auger
2025-02-19 18:58 ` [RFC 0/2] hw/vfio/pci: Prevent " Alex Williamson
2025-02-19 21:19   ` Alex Williamson
2025-02-20 10:31     ` Eric Auger
2025-02-20 10:45       ` Eric Auger
2025-02-20 15:07         ` Alex Williamson
2025-02-20 15:48           ` Alex Williamson
2025-02-20  4:24   ` Duan, Zhenzhong
2025-02-20  5:05     ` Alex Williamson
2025-02-20  8:25       ` Duan, Zhenzhong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).