qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] PCI: Implement basic PCI PM capability backing
@ 2025-02-20 22:48 Alex Williamson
  2025-02-20 22:48 ` [PATCH 1/5] hw/pci: Basic support for PCI power management Alex Williamson
                   ` (7 more replies)
  0 siblings, 8 replies; 22+ messages in thread
From: Alex Williamson @ 2025-02-20 22:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, eric.auger.pro, eric.auger, clg, zhenzhong.duan,
	mst, marcel.apfelbaum

Eric recently identified an issue[1] where during graceful shutdown
of a VM in a vIOMMU configuration, the guest driver places the device
into the D3 power state, the vIOMMU is then disabled, triggering an
AddressSpace update.  The device BARs are still mapped into the AS,
but the vfio host driver refuses to DMA map the MMIO space due to the
device power state.

The proposed solution in [1] was to skip mappings based on the
device power state.  Here we take a different approach.  The PCI spec
defines that devices in D1/2/3 power state should respond only to
configuration and message requests and all other requests should be
handled as an Unsupported Request.  In other words, the memory and
IO BARs are not accessible except when the device is in the D0 power
state.

To emulate this behavior, we can factor the device power state into
the mapping state of the device BARs.  Therefore the BAR is marked
as unmapped if either the respective command register enable bit is
clear or the device is not in the D0 power state.

In order to implement this, the PowerState field of the PMCSR
register becomes writable, which allows the device to appear in
lower power states.  This also therefore implements D3 support
(insofar as the BAR behavior) for all devices implementing the PM
capability.  The PCI spec requires D3 support.

An aspect that needs attention here is whether this change in the
wmask and PMCSR bits becomes a problem for migration, and how we
might solve it.  For a guest migrating old->new, the device would
always be in the D0 power state, but the register becomes writable.
In the opposite direction, is it possible that a device could
migrate in a low power state and be stuck there since the bits are
read-only in old QEMU?  Do we need an option for this behavior and a
machine state bump, or are there alternatives?

Thanks,
Alex

[1]https://lore.kernel.org/all/20250219175941.135390-1-eric.auger@redhat.com/

Alex Williamson (5):
  hw/pci: Basic support for PCI power management
  pci: Use PCI PM capability initializer
  vfio/pci: Delete local pm_cap
  pcie, virtio: Remove redundant pm_cap
  hw/vfio/pci: Re-order pre-reset

 hw/net/e1000e.c                 |  3 +-
 hw/net/eepro100.c               |  4 +-
 hw/net/igb.c                    |  3 +-
 hw/nvme/ctrl.c                  |  3 +-
 hw/pci-bridge/pcie_pci_bridge.c |  3 +-
 hw/pci/pci.c                    | 83 ++++++++++++++++++++++++++++++++-
 hw/pci/trace-events             |  2 +
 hw/vfio/pci.c                   | 29 ++++++------
 hw/vfio/pci.h                   |  1 -
 hw/virtio/virtio-pci.c          | 11 ++---
 include/hw/pci/pci.h            |  3 ++
 include/hw/pci/pci_device.h     |  3 ++
 include/hw/pci/pcie.h           |  2 -
 13 files changed, 112 insertions(+), 38 deletions(-)

-- 
2.48.1



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/5] hw/pci: Basic support for PCI power management
  2025-02-20 22:48 [PATCH 0/5] PCI: Implement basic PCI PM capability backing Alex Williamson
@ 2025-02-20 22:48 ` Alex Williamson
  2025-02-24 19:03   ` Eric Auger
  2025-02-20 22:48 ` [PATCH 2/5] pci: Use PCI PM capability initializer Alex Williamson
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Alex Williamson @ 2025-02-20 22:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, eric.auger.pro, eric.auger, clg, zhenzhong.duan,
	mst, marcel.apfelbaum

The memory and IO BARs for devices are only accessible in the D0
power state.  In other power states the PCI spec defines that the
device should respond to TLPs and messages with an Unsupported
Request response.  The closest we can come to emulating this
behavior is to consider the BARs as unmapped, removing them from
the address space.

Introduce an interface to register the PM capability such that
the device power state is considered relative to the BAR mapping
state.

Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 hw/pci/pci.c                | 83 ++++++++++++++++++++++++++++++++++++-
 hw/pci/trace-events         |  2 +
 include/hw/pci/pci.h        |  3 ++
 include/hw/pci/pci_device.h |  3 ++
 4 files changed, 89 insertions(+), 2 deletions(-)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 2afa423925c5..4f2554dd63ac 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -435,6 +435,74 @@ static void pci_msi_trigger(PCIDevice *dev, MSIMessage msg)
                          attrs, NULL);
 }
 
+int pci_pm_init(PCIDevice *d, uint8_t offset, Error **errp)
+{
+    int cap = pci_add_capability(d, PCI_CAP_ID_PM, offset, PCI_PM_SIZEOF, errp);
+
+    if (cap < 0) {
+        return cap;
+    }
+
+    d->pm_cap = cap;
+    d->cap_present |= QEMU_PCI_CAP_PM;
+    pci_set_word(d->wmask + cap + PCI_PM_CTRL, PCI_PM_CTRL_STATE_MASK);
+
+    return cap;
+}
+
+static uint8_t pci_pm_state(PCIDevice *d)
+{
+    uint16_t pmcsr;
+
+    if (!(d->cap_present & QEMU_PCI_CAP_PM)) {
+        return 0;
+    }
+
+    pmcsr = pci_get_word(d->config + d->pm_cap + PCI_PM_CTRL);
+
+    return pmcsr & PCI_PM_CTRL_STATE_MASK;
+}
+
+static uint8_t pci_pm_update(PCIDevice *d, uint32_t addr, int l, uint8_t old)
+{
+    uint16_t pmc;
+    uint8_t new;
+
+    if (!(d->cap_present & QEMU_PCI_CAP_PM) ||
+        !range_covers_byte(addr, l, d->pm_cap + PCI_PM_CTRL)) {
+        return old;
+    }
+
+    new = pci_pm_state(d);
+    if (new == old) {
+        return old;
+    }
+
+    pmc = pci_get_word(d->config + d->pm_cap + PCI_PM_PMC);
+
+    /*
+     * Transitions to D1 & D2 are only allowed if supported.  Devices may
+     * only transition to higher D-states or to D0.  The write is dropped
+     * for rejected transitions by restoring the old state.
+     */
+    if ((!(pmc & PCI_PM_CAP_D1) && new == 1) ||
+        (!(pmc & PCI_PM_CAP_D2) && new == 2) ||
+        (old && new && new < old)) {
+        pci_word_test_and_clear_mask(d->config + d->pm_cap + PCI_PM_CTRL,
+                                     PCI_PM_CTRL_STATE_MASK);
+        pci_word_test_and_set_mask(d->config + d->pm_cap + PCI_PM_CTRL,
+                                   old);
+        trace_pci_pm_bad_transition(d->name, pci_dev_bus_num(d),
+                                    PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
+                                    old, new);
+        return old;
+    }
+
+    trace_pci_pm_transition(d->name, pci_dev_bus_num(d), PCI_SLOT(d->devfn),
+                            PCI_FUNC(d->devfn), old, new);
+    return new;
+}
+
 static void pci_reset_regions(PCIDevice *dev)
 {
     int r;
@@ -474,6 +542,11 @@ static void pci_do_device_reset(PCIDevice *dev)
                               pci_get_word(dev->wmask + PCI_INTERRUPT_LINE) |
                               pci_get_word(dev->w1cmask + PCI_INTERRUPT_LINE));
     dev->config[PCI_CACHE_LINE_SIZE] = 0x0;
+    /* Default PM state is D0 */
+    if (dev->cap_present & QEMU_PCI_CAP_PM) {
+        pci_word_test_and_clear_mask(dev->config + dev->pm_cap + PCI_PM_CTRL,
+                                     PCI_PM_CTRL_STATE_MASK);
+    }
     pci_reset_regions(dev);
     pci_update_mappings(dev);
 
@@ -1598,7 +1671,7 @@ static void pci_update_mappings(PCIDevice *d)
             continue;
 
         new_addr = pci_bar_address(d, i, r->type, r->size);
-        if (!d->enabled) {
+        if (!d->enabled || pci_pm_state(d)) {
             new_addr = PCI_BAR_UNMAPPED;
         }
 
@@ -1664,6 +1737,7 @@ uint32_t pci_default_read_config(PCIDevice *d,
 
 void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val_in, int l)
 {
+    uint8_t new_pm_state, old_pm_state = pci_pm_state(d);
     int i, was_irq_disabled = pci_irq_disabled(d);
     uint32_t val = val_in;
 
@@ -1676,11 +1750,16 @@ void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val_in, int
         d->config[addr + i] = (d->config[addr + i] & ~wmask) | (val & wmask);
         d->config[addr + i] &= ~(val & w1cmask); /* W1C: Write 1 to Clear */
     }
+
+    new_pm_state = pci_pm_update(d, addr, l, old_pm_state);
+
     if (ranges_overlap(addr, l, PCI_BASE_ADDRESS_0, 24) ||
         ranges_overlap(addr, l, PCI_ROM_ADDRESS, 4) ||
         ranges_overlap(addr, l, PCI_ROM_ADDRESS1, 4) ||
-        range_covers_byte(addr, l, PCI_COMMAND))
+        range_covers_byte(addr, l, PCI_COMMAND) ||
+        !!new_pm_state != !!old_pm_state) {
         pci_update_mappings(d);
+    }
 
     if (ranges_overlap(addr, l, PCI_COMMAND, 2)) {
         pci_update_irq_disabled(d, was_irq_disabled);
diff --git a/hw/pci/trace-events b/hw/pci/trace-events
index 19643aa8c6b0..811354f29031 100644
--- a/hw/pci/trace-events
+++ b/hw/pci/trace-events
@@ -1,6 +1,8 @@
 # See docs/devel/tracing.rst for syntax documentation.
 
 # pci.c
+pci_pm_bad_transition(const char *dev, uint32_t bus, uint32_t slot, uint32_t func, uint8_t old, uint8_t new) "%s %02x:%02x.%x bad PM transition D%d->D%d"
+pci_pm_transition(const char *dev, uint32_t bus, uint32_t slot, uint32_t func, uint8_t old, uint8_t new) "%s %02x:%02x.%x PM transition D%d->D%d"
 pci_update_mappings_del(const char *dev, uint32_t bus, uint32_t slot, uint32_t func, int bar, uint64_t addr, uint64_t size) "%s %02x:%02x.%x %d,0x%"PRIx64"+0x%"PRIx64
 pci_update_mappings_add(const char *dev, uint32_t bus, uint32_t slot, uint32_t func, int bar, uint64_t addr, uint64_t size) "%s %02x:%02x.%x %d,0x%"PRIx64"+0x%"PRIx64
 pci_route_irq(int dev_irq, const char *dev_path, int parent_irq, const char *parent_path) "IRQ %d @%s -> IRQ %d @%s"
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 4002bbeebde0..c220cc844962 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -216,6 +216,8 @@ enum {
     QEMU_PCIE_ARI_NEXTFN_1 = (1 << QEMU_PCIE_ARI_NEXTFN_1_BITNR),
 #define QEMU_PCIE_EXT_TAG_BITNR 13
     QEMU_PCIE_EXT_TAG = (1 << QEMU_PCIE_EXT_TAG_BITNR),
+#define QEMU_PCI_CAP_PM_BITNR 14
+    QEMU_PCI_CAP_PM = (1 << QEMU_PCI_CAP_PM_BITNR),
 };
 
 typedef struct PCIINTxRoute {
@@ -676,5 +678,6 @@ static inline void pci_irq_deassert(PCIDevice *pci_dev)
 MSIMessage pci_get_msi_message(PCIDevice *dev, int vector);
 void pci_set_enabled(PCIDevice *pci_dev, bool state);
 void pci_set_power(PCIDevice *pci_dev, bool state);
+int pci_pm_init(PCIDevice *pci_dev, uint8_t offset, Error **errp);
 
 #endif
diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
index add208edfabd..345b12eaac1a 100644
--- a/include/hw/pci/pci_device.h
+++ b/include/hw/pci/pci_device.h
@@ -105,6 +105,9 @@ struct PCIDevice {
     /* Capability bits */
     uint32_t cap_present;
 
+    /* Offset of PM capability in config space */
+    uint8_t pm_cap;
+
     /* Offset of MSI-X capability in config space */
     uint8_t msix_cap;
 
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/5] pci: Use PCI PM capability initializer
  2025-02-20 22:48 [PATCH 0/5] PCI: Implement basic PCI PM capability backing Alex Williamson
  2025-02-20 22:48 ` [PATCH 1/5] hw/pci: Basic support for PCI power management Alex Williamson
@ 2025-02-20 22:48 ` Alex Williamson
  2025-02-24 18:37   ` Eric Auger
  2025-02-20 22:48 ` [PATCH 3/5] vfio/pci: Delete local pm_cap Alex Williamson
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Alex Williamson @ 2025-02-20 22:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, eric.auger.pro, eric.auger, clg, zhenzhong.duan,
	mst, marcel.apfelbaum, Dmitry Fleytman, Akihiko Odaki, Jason Wang,
	Stefan Weil, Sriram Yagnaraman, Keith Busch, Klaus Jensen,
	Jesper Devantier

Switch callers directly initializing the PCI PM capability with
pci_add_capability() to use pci_pm_init().

Cc: Dmitry Fleytman <dmitry.fleytman@gmail.com>
Cc: Akihiko Odaki <akihiko.odaki@daynix.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Stefan Weil <sw@weilnetz.de>
Cc: Sriram Yagnaraman <sriram.yagnaraman@ericsson.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Klaus Jensen <its@irrelevant.dk>
Cc: Jesper Devantier <foss@defmacro.it>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
Cc: Cédric Le Goater <clg@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 hw/net/e1000e.c                 | 3 +--
 hw/net/eepro100.c               | 4 +---
 hw/net/igb.c                    | 3 +--
 hw/nvme/ctrl.c                  | 3 +--
 hw/pci-bridge/pcie_pci_bridge.c | 2 +-
 hw/vfio/pci.c                   | 2 +-
 hw/virtio/virtio-pci.c          | 3 +--
 7 files changed, 7 insertions(+), 13 deletions(-)

diff --git a/hw/net/e1000e.c b/hw/net/e1000e.c
index f637853073e2..b72cbab7e889 100644
--- a/hw/net/e1000e.c
+++ b/hw/net/e1000e.c
@@ -372,8 +372,7 @@ static int
 e1000e_add_pm_capability(PCIDevice *pdev, uint8_t offset, uint16_t pmc)
 {
     Error *local_err = NULL;
-    int ret = pci_add_capability(pdev, PCI_CAP_ID_PM, offset,
-                                 PCI_PM_SIZEOF, &local_err);
+    int ret = pci_pm_init(pdev, offset, &local_err);
 
     if (local_err) {
         error_report_err(local_err);
diff --git a/hw/net/eepro100.c b/hw/net/eepro100.c
index 6d853229aec2..29a39865a608 100644
--- a/hw/net/eepro100.c
+++ b/hw/net/eepro100.c
@@ -551,9 +551,7 @@ static void e100_pci_reset(EEPRO100State *s, Error **errp)
     if (info->power_management) {
         /* Power Management Capabilities */
         int cfg_offset = 0xdc;
-        int r = pci_add_capability(&s->dev, PCI_CAP_ID_PM,
-                                   cfg_offset, PCI_PM_SIZEOF,
-                                   errp);
+        int r = pci_pm_init(&s->dev, cfg_offset, errp);
         if (r < 0) {
             return;
         }
diff --git a/hw/net/igb.c b/hw/net/igb.c
index 4d93ce629f95..700dbc746d3d 100644
--- a/hw/net/igb.c
+++ b/hw/net/igb.c
@@ -356,8 +356,7 @@ static int
 igb_add_pm_capability(PCIDevice *pdev, uint8_t offset, uint16_t pmc)
 {
     Error *local_err = NULL;
-    int ret = pci_add_capability(pdev, PCI_CAP_ID_PM, offset,
-                                 PCI_PM_SIZEOF, &local_err);
+    int ret = pci_pm_init(pdev, offset, &local_err);
 
     if (local_err) {
         error_report_err(local_err);
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 68903d1d7067..1faea3d2b85b 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -8503,8 +8503,7 @@ static int nvme_add_pm_capability(PCIDevice *pci_dev, uint8_t offset)
     Error *err = NULL;
     int ret;
 
-    ret = pci_add_capability(pci_dev, PCI_CAP_ID_PM, offset,
-                             PCI_PM_SIZEOF, &err);
+    ret = pci_pm_init(pci_dev, offset, &err);
     if (err) {
         error_report_err(err);
         return ret;
diff --git a/hw/pci-bridge/pcie_pci_bridge.c b/hw/pci-bridge/pcie_pci_bridge.c
index fd4514a595ce..9fa656b43b42 100644
--- a/hw/pci-bridge/pcie_pci_bridge.c
+++ b/hw/pci-bridge/pcie_pci_bridge.c
@@ -52,7 +52,7 @@ static void pcie_pci_bridge_realize(PCIDevice *d, Error **errp)
         goto cap_error;
     }
 
-    pos = pci_add_capability(d, PCI_CAP_ID_PM, 0, PCI_PM_SIZEOF, errp);
+    pos = pci_pm_init(d, 0, errp);
     if (pos < 0) {
         goto pm_error;
     }
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 89d900e9cf0c..6903f831e45f 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2216,7 +2216,7 @@ static bool vfio_add_std_cap(VFIOPCIDevice *vdev, uint8_t pos, Error **errp)
     case PCI_CAP_ID_PM:
         vfio_check_pm_reset(vdev, pos);
         vdev->pm_cap = pos;
-        ret = pci_add_capability(pdev, cap_id, pos, size, errp) >= 0;
+        ret = pci_pm_init(pdev, pos, errp) >= 0;
         break;
     case PCI_CAP_ID_AF:
         vfio_check_af_flr(vdev, pos);
diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
index c773a9130c7e..afe8b5551c5c 100644
--- a/hw/virtio/virtio-pci.c
+++ b/hw/virtio/virtio-pci.c
@@ -2204,8 +2204,7 @@ static void virtio_pci_realize(PCIDevice *pci_dev, Error **errp)
         pos = pcie_endpoint_cap_init(pci_dev, 0);
         assert(pos > 0);
 
-        pos = pci_add_capability(pci_dev, PCI_CAP_ID_PM, 0,
-                                 PCI_PM_SIZEOF, errp);
+        pos = pci_pm_init(pci_dev, 0, errp);
         if (pos < 0) {
             return;
         }
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/5] vfio/pci: Delete local pm_cap
  2025-02-20 22:48 [PATCH 0/5] PCI: Implement basic PCI PM capability backing Alex Williamson
  2025-02-20 22:48 ` [PATCH 1/5] hw/pci: Basic support for PCI power management Alex Williamson
  2025-02-20 22:48 ` [PATCH 2/5] pci: Use PCI PM capability initializer Alex Williamson
@ 2025-02-20 22:48 ` Alex Williamson
  2025-02-24 18:38   ` Eric Auger
  2025-02-20 22:48 ` [PATCH 4/5] pcie, virtio: Remove redundant pm_cap Alex Williamson
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Alex Williamson @ 2025-02-20 22:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, eric.auger.pro, eric.auger, clg, zhenzhong.duan,
	mst, marcel.apfelbaum

This is now redundant to PCIDevice.pm_cap.

Cc: Cédric Le Goater <clg@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 hw/vfio/pci.c | 9 ++++-----
 hw/vfio/pci.h | 1 -
 2 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 6903f831e45f..ba4ef65b16fa 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2215,7 +2215,6 @@ static bool vfio_add_std_cap(VFIOPCIDevice *vdev, uint8_t pos, Error **errp)
         break;
     case PCI_CAP_ID_PM:
         vfio_check_pm_reset(vdev, pos);
-        vdev->pm_cap = pos;
         ret = pci_pm_init(pdev, pos, errp) >= 0;
         break;
     case PCI_CAP_ID_AF:
@@ -2407,17 +2406,17 @@ void vfio_pci_pre_reset(VFIOPCIDevice *vdev)
     vfio_disable_interrupts(vdev);
 
     /* Make sure the device is in D0 */
-    if (vdev->pm_cap) {
+    if (pdev->pm_cap) {
         uint16_t pmcsr;
         uint8_t state;
 
-        pmcsr = vfio_pci_read_config(pdev, vdev->pm_cap + PCI_PM_CTRL, 2);
+        pmcsr = vfio_pci_read_config(pdev, pdev->pm_cap + PCI_PM_CTRL, 2);
         state = pmcsr & PCI_PM_CTRL_STATE_MASK;
         if (state) {
             pmcsr &= ~PCI_PM_CTRL_STATE_MASK;
-            vfio_pci_write_config(pdev, vdev->pm_cap + PCI_PM_CTRL, pmcsr, 2);
+            vfio_pci_write_config(pdev, pdev->pm_cap + PCI_PM_CTRL, pmcsr, 2);
             /* vfio handles the necessary delay here */
-            pmcsr = vfio_pci_read_config(pdev, vdev->pm_cap + PCI_PM_CTRL, 2);
+            pmcsr = vfio_pci_read_config(pdev, pdev->pm_cap + PCI_PM_CTRL, 2);
             state = pmcsr & PCI_PM_CTRL_STATE_MASK;
             if (state) {
                 error_report("vfio: Unable to power on device, stuck in D%d",
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 43c166680abb..d638c781f6f1 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -160,7 +160,6 @@ struct VFIOPCIDevice {
     int32_t bootindex;
     uint32_t igd_gms;
     OffAutoPCIBAR msix_relo;
-    uint8_t pm_cap;
     uint8_t nv_gpudirect_clique;
     bool pci_aer;
     bool req_enabled;
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/5] pcie, virtio: Remove redundant pm_cap
  2025-02-20 22:48 [PATCH 0/5] PCI: Implement basic PCI PM capability backing Alex Williamson
                   ` (2 preceding siblings ...)
  2025-02-20 22:48 ` [PATCH 3/5] vfio/pci: Delete local pm_cap Alex Williamson
@ 2025-02-20 22:48 ` Alex Williamson
  2025-02-21  6:12   ` Duan, Zhenzhong
  2025-02-24 18:40   ` Eric Auger
  2025-02-20 22:48 ` [PATCH 5/5] hw/vfio/pci: Re-order pre-reset Alex Williamson
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 22+ messages in thread
From: Alex Williamson @ 2025-02-20 22:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, eric.auger.pro, eric.auger, clg, zhenzhong.duan,
	mst, marcel.apfelbaum

The pm_cap on the PCIExpressDevice object can be distilled down
to the new instance on the PCIDevice object.

Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 hw/pci-bridge/pcie_pci_bridge.c | 1 -
 hw/virtio/virtio-pci.c          | 8 +++-----
 include/hw/pci/pcie.h           | 2 --
 3 files changed, 3 insertions(+), 8 deletions(-)

diff --git a/hw/pci-bridge/pcie_pci_bridge.c b/hw/pci-bridge/pcie_pci_bridge.c
index 9fa656b43b42..2429503cfbbf 100644
--- a/hw/pci-bridge/pcie_pci_bridge.c
+++ b/hw/pci-bridge/pcie_pci_bridge.c
@@ -56,7 +56,6 @@ static void pcie_pci_bridge_realize(PCIDevice *d, Error **errp)
     if (pos < 0) {
         goto pm_error;
     }
-    d->exp.pm_cap = pos;
     pci_set_word(d->config + pos + PCI_PM_PMC, 0x3);
 
     pcie_cap_arifwd_init(d);
diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
index afe8b5551c5c..3ca3f849d391 100644
--- a/hw/virtio/virtio-pci.c
+++ b/hw/virtio/virtio-pci.c
@@ -2209,8 +2209,6 @@ static void virtio_pci_realize(PCIDevice *pci_dev, Error **errp)
             return;
         }
 
-        pci_dev->exp.pm_cap = pos;
-
         /*
          * Indicates that this function complies with revision 1.2 of the
          * PCI Power Management Interface Specification.
@@ -2309,11 +2307,11 @@ static bool virtio_pci_no_soft_reset(PCIDevice *dev)
 {
     uint16_t pmcsr;
 
-    if (!pci_is_express(dev) || !dev->exp.pm_cap) {
+    if (!pci_is_express(dev) || !(dev->cap_present & QEMU_PCI_CAP_PM)) {
         return false;
     }
 
-    pmcsr = pci_get_word(dev->config + dev->exp.pm_cap + PCI_PM_CTRL);
+    pmcsr = pci_get_word(dev->config + dev->pm_cap + PCI_PM_CTRL);
 
     /*
      * When No_Soft_Reset bit is set and the device
@@ -2342,7 +2340,7 @@ static void virtio_pci_bus_reset_hold(Object *obj, ResetType type)
 
         if (proxy->flags & VIRTIO_PCI_FLAG_INIT_PM) {
             pci_word_test_and_clear_mask(
-                dev->config + dev->exp.pm_cap + PCI_PM_CTRL,
+                dev->config + dev->pm_cap + PCI_PM_CTRL,
                 PCI_PM_CTRL_STATE_MASK);
         }
     }
diff --git a/include/hw/pci/pcie.h b/include/hw/pci/pcie.h
index b8d59732bc63..70a5de09de39 100644
--- a/include/hw/pci/pcie.h
+++ b/include/hw/pci/pcie.h
@@ -58,8 +58,6 @@ typedef enum {
 struct PCIExpressDevice {
     /* Offset of express capability in config space */
     uint8_t exp_cap;
-    /* Offset of Power Management capability in config space */
-    uint8_t pm_cap;
 
     /* SLOT */
     bool hpev_notified; /* Logical AND of conditions for hot plug event.
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 5/5] hw/vfio/pci: Re-order pre-reset
  2025-02-20 22:48 [PATCH 0/5] PCI: Implement basic PCI PM capability backing Alex Williamson
                   ` (3 preceding siblings ...)
  2025-02-20 22:48 ` [PATCH 4/5] pcie, virtio: Remove redundant pm_cap Alex Williamson
@ 2025-02-20 22:48 ` Alex Williamson
  2025-02-24 20:16   ` Eric Auger
  2025-02-20 22:54 ` [PATCH 0/5] PCI: Implement basic PCI PM capability backing Michael S. Tsirkin
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Alex Williamson @ 2025-02-20 22:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, eric.auger.pro, eric.auger, clg, zhenzhong.duan,
	mst, marcel.apfelbaum

We want the device in the D0 power state going into reset, but the
config write can enable the BARs in the address space, which are
then removed from the address space once we clear the memory enable
bit in the command register.  Re-order to clear the command bit
first, so the power state change doesn't enable the BARs.

Cc: Cédric Le Goater <clg@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 hw/vfio/pci.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index ba4ef65b16fa..fcc5f118bf90 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2405,6 +2405,15 @@ void vfio_pci_pre_reset(VFIOPCIDevice *vdev)
 
     vfio_disable_interrupts(vdev);
 
+    /*
+     * Stop any ongoing DMA by disconnecting I/O, MMIO, and bus master.
+     * Also put INTx Disable in known state.
+     */
+    cmd = vfio_pci_read_config(pdev, PCI_COMMAND, 2);
+    cmd &= ~(PCI_COMMAND_IO | PCI_COMMAND_MEMORY | PCI_COMMAND_MASTER |
+             PCI_COMMAND_INTX_DISABLE);
+    vfio_pci_write_config(pdev, PCI_COMMAND, cmd, 2);
+
     /* Make sure the device is in D0 */
     if (pdev->pm_cap) {
         uint16_t pmcsr;
@@ -2424,15 +2433,6 @@ void vfio_pci_pre_reset(VFIOPCIDevice *vdev)
             }
         }
     }
-
-    /*
-     * Stop any ongoing DMA by disconnecting I/O, MMIO, and bus master.
-     * Also put INTx Disable in known state.
-     */
-    cmd = vfio_pci_read_config(pdev, PCI_COMMAND, 2);
-    cmd &= ~(PCI_COMMAND_IO | PCI_COMMAND_MEMORY | PCI_COMMAND_MASTER |
-             PCI_COMMAND_INTX_DISABLE);
-    vfio_pci_write_config(pdev, PCI_COMMAND, cmd, 2);
 }
 
 void vfio_pci_post_reset(VFIOPCIDevice *vdev)
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/5] PCI: Implement basic PCI PM capability backing
  2025-02-20 22:48 [PATCH 0/5] PCI: Implement basic PCI PM capability backing Alex Williamson
                   ` (4 preceding siblings ...)
  2025-02-20 22:48 ` [PATCH 5/5] hw/vfio/pci: Re-order pre-reset Alex Williamson
@ 2025-02-20 22:54 ` Michael S. Tsirkin
  2025-02-24  8:21   ` Cédric Le Goater
  2025-02-24  1:43 ` Duan, Zhenzhong
  2025-02-24  8:14 ` Cédric Le Goater
  7 siblings, 1 reply; 22+ messages in thread
From: Michael S. Tsirkin @ 2025-02-20 22:54 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, eric.auger.pro, eric.auger, clg, zhenzhong.duan,
	marcel.apfelbaum

On Thu, Feb 20, 2025 at 03:48:53PM -0700, Alex Williamson wrote:
> Eric recently identified an issue[1] where during graceful shutdown
> of a VM in a vIOMMU configuration, the guest driver places the device
> into the D3 power state, the vIOMMU is then disabled, triggering an
> AddressSpace update.  The device BARs are still mapped into the AS,
> but the vfio host driver refuses to DMA map the MMIO space due to the
> device power state.
> 
> The proposed solution in [1] was to skip mappings based on the
> device power state.  Here we take a different approach.  The PCI spec
> defines that devices in D1/2/3 power state should respond only to
> configuration and message requests and all other requests should be
> handled as an Unsupported Request.  In other words, the memory and
> IO BARs are not accessible except when the device is in the D0 power
> state.
> 
> To emulate this behavior, we can factor the device power state into
> the mapping state of the device BARs.  Therefore the BAR is marked
> as unmapped if either the respective command register enable bit is
> clear or the device is not in the D0 power state.
> 
> In order to implement this, the PowerState field of the PMCSR
> register becomes writable, which allows the device to appear in
> lower power states.  This also therefore implements D3 support
> (insofar as the BAR behavior) for all devices implementing the PM
> capability.  The PCI spec requires D3 support.
> 
> An aspect that needs attention here is whether this change in the
> wmask and PMCSR bits becomes a problem for migration, and how we
> might solve it.  For a guest migrating old->new, the device would
> always be in the D0 power state, but the register becomes writable.
> In the opposite direction, is it possible that a device could
> migrate in a low power state and be stuck there since the bits are
> read-only in old QEMU?  Do we need an option for this behavior and a
> machine state bump, or are there alternatives?
> 
> Thanks,
> Alex
> 
> [1]https://lore.kernel.org/all/20250219175941.135390-1-eric.auger@redhat.com/


PCI bits:

Reviewed-by: Michael S. Tsirkin <mst@redhat.com>

feel free to merge.

> Alex Williamson (5):
>   hw/pci: Basic support for PCI power management
>   pci: Use PCI PM capability initializer
>   vfio/pci: Delete local pm_cap
>   pcie, virtio: Remove redundant pm_cap
>   hw/vfio/pci: Re-order pre-reset
> 
>  hw/net/e1000e.c                 |  3 +-
>  hw/net/eepro100.c               |  4 +-
>  hw/net/igb.c                    |  3 +-
>  hw/nvme/ctrl.c                  |  3 +-
>  hw/pci-bridge/pcie_pci_bridge.c |  3 +-
>  hw/pci/pci.c                    | 83 ++++++++++++++++++++++++++++++++-
>  hw/pci/trace-events             |  2 +
>  hw/vfio/pci.c                   | 29 ++++++------
>  hw/vfio/pci.h                   |  1 -
>  hw/virtio/virtio-pci.c          | 11 ++---
>  include/hw/pci/pci.h            |  3 ++
>  include/hw/pci/pci_device.h     |  3 ++
>  include/hw/pci/pcie.h           |  2 -
>  13 files changed, 112 insertions(+), 38 deletions(-)
> 
> -- 
> 2.48.1



^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH 4/5] pcie, virtio: Remove redundant pm_cap
  2025-02-20 22:48 ` [PATCH 4/5] pcie, virtio: Remove redundant pm_cap Alex Williamson
@ 2025-02-21  6:12   ` Duan, Zhenzhong
  2025-02-22  6:00     ` Cédric Le Goater
  2025-02-24 18:40   ` Eric Auger
  1 sibling, 1 reply; 22+ messages in thread
From: Duan, Zhenzhong @ 2025-02-21  6:12 UTC (permalink / raw)
  To: Alex Williamson, qemu-devel@nongnu.org
  Cc: eric.auger.pro@gmail.com, eric.auger@redhat.com, clg@redhat.com,
	mst@redhat.com, marcel.apfelbaum@gmail.com



>-----Original Message-----
>From: Alex Williamson <alex.williamson@redhat.com>
>Subject: [PATCH 4/5] pcie, virtio: Remove redundant pm_cap
>
>The pm_cap on the PCIExpressDevice object can be distilled down
>to the new instance on the PCIDevice object.
>
>Cc: Michael S. Tsirkin <mst@redhat.com>
>Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
>Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>---
> hw/pci-bridge/pcie_pci_bridge.c | 1 -
> hw/virtio/virtio-pci.c          | 8 +++-----
> include/hw/pci/pcie.h           | 2 --
> 3 files changed, 3 insertions(+), 8 deletions(-)
>
>diff --git a/hw/pci-bridge/pcie_pci_bridge.c b/hw/pci-bridge/pcie_pci_bridge.c
>index 9fa656b43b42..2429503cfbbf 100644
>--- a/hw/pci-bridge/pcie_pci_bridge.c
>+++ b/hw/pci-bridge/pcie_pci_bridge.c
>@@ -56,7 +56,6 @@ static void pcie_pci_bridge_realize(PCIDevice *d, Error
>**errp)
>     if (pos < 0) {
>         goto pm_error;
>     }
>-    d->exp.pm_cap = pos;
>     pci_set_word(d->config + pos + PCI_PM_PMC, 0x3);
>
>     pcie_cap_arifwd_init(d);
>diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
>index afe8b5551c5c..3ca3f849d391 100644
>--- a/hw/virtio/virtio-pci.c
>+++ b/hw/virtio/virtio-pci.c
>@@ -2209,8 +2209,6 @@ static void virtio_pci_realize(PCIDevice *pci_dev, Error
>**errp)
>             return;
>         }
>
>-        pci_dev->exp.pm_cap = pos;
>-
>         /*
>          * Indicates that this function complies with revision 1.2 of the
>          * PCI Power Management Interface Specification.
>@@ -2309,11 +2307,11 @@ static bool virtio_pci_no_soft_reset(PCIDevice *dev)
> {
>     uint16_t pmcsr;
>
>-    if (!pci_is_express(dev) || !dev->exp.pm_cap) {
>+    if (!pci_is_express(dev) || !(dev->cap_present & QEMU_PCI_CAP_PM)) {

Maybe a bit more optimized by checking dev->pm_cap,
but it's also ok checking present bit. For the whole series,

Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>

Thanks
Zhenzhong

>         return false;
>     }
>
>-    pmcsr = pci_get_word(dev->config + dev->exp.pm_cap + PCI_PM_CTRL);
>+    pmcsr = pci_get_word(dev->config + dev->pm_cap + PCI_PM_CTRL);
>
>     /*
>      * When No_Soft_Reset bit is set and the device
>@@ -2342,7 +2340,7 @@ static void virtio_pci_bus_reset_hold(Object *obj,
>ResetType type)
>
>         if (proxy->flags & VIRTIO_PCI_FLAG_INIT_PM) {
>             pci_word_test_and_clear_mask(
>-                dev->config + dev->exp.pm_cap + PCI_PM_CTRL,
>+                dev->config + dev->pm_cap + PCI_PM_CTRL,
>                 PCI_PM_CTRL_STATE_MASK);
>         }
>     }
>diff --git a/include/hw/pci/pcie.h b/include/hw/pci/pcie.h
>index b8d59732bc63..70a5de09de39 100644
>--- a/include/hw/pci/pcie.h
>+++ b/include/hw/pci/pcie.h
>@@ -58,8 +58,6 @@ typedef enum {
> struct PCIExpressDevice {
>     /* Offset of express capability in config space */
>     uint8_t exp_cap;
>-    /* Offset of Power Management capability in config space */
>-    uint8_t pm_cap;
>
>     /* SLOT */
>     bool hpev_notified; /* Logical AND of conditions for hot plug event.
>--
>2.48.1



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/5] pcie, virtio: Remove redundant pm_cap
  2025-02-21  6:12   ` Duan, Zhenzhong
@ 2025-02-22  6:00     ` Cédric Le Goater
  2025-02-24  1:45       ` Duan, Zhenzhong
  0 siblings, 1 reply; 22+ messages in thread
From: Cédric Le Goater @ 2025-02-22  6:00 UTC (permalink / raw)
  To: Duan, Zhenzhong, Alex Williamson, qemu-devel@nongnu.org
  Cc: eric.auger.pro@gmail.com, eric.auger@redhat.com, mst@redhat.com,
	marcel.apfelbaum@gmail.com

Hello Zhenzhong,

On 2/21/25 07:12, Duan, Zhenzhong wrote:
> 
> 
>> -----Original Message-----
>> From: Alex Williamson <alex.williamson@redhat.com>
>> Subject: [PATCH 4/5] pcie, virtio: Remove redundant pm_cap
>>
>> The pm_cap on the PCIExpressDevice object can be distilled down
>> to the new instance on the PCIDevice object.
>>
>> Cc: Michael S. Tsirkin <mst@redhat.com>
>> Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>> ---
>> hw/pci-bridge/pcie_pci_bridge.c | 1 -
>> hw/virtio/virtio-pci.c          | 8 +++-----
>> include/hw/pci/pcie.h           | 2 --
>> 3 files changed, 3 insertions(+), 8 deletions(-)
>>
>> diff --git a/hw/pci-bridge/pcie_pci_bridge.c b/hw/pci-bridge/pcie_pci_bridge.c
>> index 9fa656b43b42..2429503cfbbf 100644
>> --- a/hw/pci-bridge/pcie_pci_bridge.c
>> +++ b/hw/pci-bridge/pcie_pci_bridge.c
>> @@ -56,7 +56,6 @@ static void pcie_pci_bridge_realize(PCIDevice *d, Error
>> **errp)
>>      if (pos < 0) {
>>          goto pm_error;
>>      }
>> -    d->exp.pm_cap = pos;
>>      pci_set_word(d->config + pos + PCI_PM_PMC, 0x3);
>>
>>      pcie_cap_arifwd_init(d);
>> diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
>> index afe8b5551c5c..3ca3f849d391 100644
>> --- a/hw/virtio/virtio-pci.c
>> +++ b/hw/virtio/virtio-pci.c
>> @@ -2209,8 +2209,6 @@ static void virtio_pci_realize(PCIDevice *pci_dev, Error
>> **errp)
>>              return;
>>          }
>>
>> -        pci_dev->exp.pm_cap = pos;
>> -
>>          /*
>>           * Indicates that this function complies with revision 1.2 of the
>>           * PCI Power Management Interface Specification.
>> @@ -2309,11 +2307,11 @@ static bool virtio_pci_no_soft_reset(PCIDevice *dev)
>> {
>>      uint16_t pmcsr;
>>
>> -    if (!pci_is_express(dev) || !dev->exp.pm_cap) {
>> +    if (!pci_is_express(dev) || !(dev->cap_present & QEMU_PCI_CAP_PM)) {
> 
> Maybe a bit more optimized by checking dev->pm_cap,
> but it's also ok checking present bit. For the whole series,
> 
> Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>

Could you please reply to the cover letter instead ? Tools like b4
will apply the provided trailers to the whole series and not this
patch only.


Thanks,

C.




> 
> Thanks
> Zhenzhong
> 
>>          return false;
>>      }
>>
>> -    pmcsr = pci_get_word(dev->config + dev->exp.pm_cap + PCI_PM_CTRL);
>> +    pmcsr = pci_get_word(dev->config + dev->pm_cap + PCI_PM_CTRL);
>>
>>      /*
>>       * When No_Soft_Reset bit is set and the device
>> @@ -2342,7 +2340,7 @@ static void virtio_pci_bus_reset_hold(Object *obj,
>> ResetType type)
>>
>>          if (proxy->flags & VIRTIO_PCI_FLAG_INIT_PM) {
>>              pci_word_test_and_clear_mask(
>> -                dev->config + dev->exp.pm_cap + PCI_PM_CTRL,
>> +                dev->config + dev->pm_cap + PCI_PM_CTRL,
>>                  PCI_PM_CTRL_STATE_MASK);
>>          }
>>      }
>> diff --git a/include/hw/pci/pcie.h b/include/hw/pci/pcie.h
>> index b8d59732bc63..70a5de09de39 100644
>> --- a/include/hw/pci/pcie.h
>> +++ b/include/hw/pci/pcie.h
>> @@ -58,8 +58,6 @@ typedef enum {
>> struct PCIExpressDevice {
>>      /* Offset of express capability in config space */
>>      uint8_t exp_cap;
>> -    /* Offset of Power Management capability in config space */
>> -    uint8_t pm_cap;
>>
>>      /* SLOT */
>>      bool hpev_notified; /* Logical AND of conditions for hot plug event.
>> --
>> 2.48.1
> 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH 0/5] PCI: Implement basic PCI PM capability backing
  2025-02-20 22:48 [PATCH 0/5] PCI: Implement basic PCI PM capability backing Alex Williamson
                   ` (5 preceding siblings ...)
  2025-02-20 22:54 ` [PATCH 0/5] PCI: Implement basic PCI PM capability backing Michael S. Tsirkin
@ 2025-02-24  1:43 ` Duan, Zhenzhong
  2025-02-24  8:14 ` Cédric Le Goater
  7 siblings, 0 replies; 22+ messages in thread
From: Duan, Zhenzhong @ 2025-02-24  1:43 UTC (permalink / raw)
  To: Alex Williamson, qemu-devel@nongnu.org
  Cc: eric.auger.pro@gmail.com, eric.auger@redhat.com, clg@redhat.com,
	mst@redhat.com, marcel.apfelbaum@gmail.com



>-----Original Message-----
>From: Alex Williamson <alex.williamson@redhat.com>
>Subject: [PATCH 0/5] PCI: Implement basic PCI PM capability backing
>
>Eric recently identified an issue[1] where during graceful shutdown
>of a VM in a vIOMMU configuration, the guest driver places the device
>into the D3 power state, the vIOMMU is then disabled, triggering an
>AddressSpace update.  The device BARs are still mapped into the AS,
>but the vfio host driver refuses to DMA map the MMIO space due to the
>device power state.
>
>The proposed solution in [1] was to skip mappings based on the
>device power state.  Here we take a different approach.  The PCI spec
>defines that devices in D1/2/3 power state should respond only to
>configuration and message requests and all other requests should be
>handled as an Unsupported Request.  In other words, the memory and
>IO BARs are not accessible except when the device is in the D0 power
>state.
>
>To emulate this behavior, we can factor the device power state into
>the mapping state of the device BARs.  Therefore the BAR is marked
>as unmapped if either the respective command register enable bit is
>clear or the device is not in the D0 power state.
>
>In order to implement this, the PowerState field of the PMCSR
>register becomes writable, which allows the device to appear in
>lower power states.  This also therefore implements D3 support
>(insofar as the BAR behavior) for all devices implementing the PM
>capability.  The PCI spec requires D3 support.
>
>An aspect that needs attention here is whether this change in the
>wmask and PMCSR bits becomes a problem for migration, and how we
>might solve it.  For a guest migrating old->new, the device would
>always be in the D0 power state, but the register becomes writable.
>In the opposite direction, is it possible that a device could
>migrate in a low power state and be stuck there since the bits are
>read-only in old QEMU?  Do we need an option for this behavior and a
>machine state bump, or are there alternatives?
>
>Thanks,
>Alex
>
>[1]https://lore.kernel.org/all/20250219175941.135390-1-
>eric.auger@redhat.com/
>
>Alex Williamson (5):
>  hw/pci: Basic support for PCI power management
>  pci: Use PCI PM capability initializer
>  vfio/pci: Delete local pm_cap
>  pcie, virtio: Remove redundant pm_cap
>  hw/vfio/pci: Re-order pre-reset

For the whole series,

Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH 4/5] pcie, virtio: Remove redundant pm_cap
  2025-02-22  6:00     ` Cédric Le Goater
@ 2025-02-24  1:45       ` Duan, Zhenzhong
  0 siblings, 0 replies; 22+ messages in thread
From: Duan, Zhenzhong @ 2025-02-24  1:45 UTC (permalink / raw)
  To: Cédric Le Goater, Alex Williamson, qemu-devel@nongnu.org
  Cc: eric.auger.pro@gmail.com, eric.auger@redhat.com, mst@redhat.com,
	marcel.apfelbaum@gmail.com

Hi Cédric,

>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Subject: Re: [PATCH 4/5] pcie, virtio: Remove redundant pm_cap
>
>Hello Zhenzhong,
>
>On 2/21/25 07:12, Duan, Zhenzhong wrote:
>>
>>
>>> -----Original Message-----
>>> From: Alex Williamson <alex.williamson@redhat.com>
>>> Subject: [PATCH 4/5] pcie, virtio: Remove redundant pm_cap
>>>
>>> The pm_cap on the PCIExpressDevice object can be distilled down
>>> to the new instance on the PCIDevice object.
>>>
>>> Cc: Michael S. Tsirkin <mst@redhat.com>
>>> Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>>> ---
>>> hw/pci-bridge/pcie_pci_bridge.c | 1 -
>>> hw/virtio/virtio-pci.c          | 8 +++-----
>>> include/hw/pci/pcie.h           | 2 --
>>> 3 files changed, 3 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/hw/pci-bridge/pcie_pci_bridge.c b/hw/pci-bridge/pcie_pci_bridge.c
>>> index 9fa656b43b42..2429503cfbbf 100644
>>> --- a/hw/pci-bridge/pcie_pci_bridge.c
>>> +++ b/hw/pci-bridge/pcie_pci_bridge.c
>>> @@ -56,7 +56,6 @@ static void pcie_pci_bridge_realize(PCIDevice *d, Error
>>> **errp)
>>>      if (pos < 0) {
>>>          goto pm_error;
>>>      }
>>> -    d->exp.pm_cap = pos;
>>>      pci_set_word(d->config + pos + PCI_PM_PMC, 0x3);
>>>
>>>      pcie_cap_arifwd_init(d);
>>> diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
>>> index afe8b5551c5c..3ca3f849d391 100644
>>> --- a/hw/virtio/virtio-pci.c
>>> +++ b/hw/virtio/virtio-pci.c
>>> @@ -2209,8 +2209,6 @@ static void virtio_pci_realize(PCIDevice *pci_dev,
>Error
>>> **errp)
>>>              return;
>>>          }
>>>
>>> -        pci_dev->exp.pm_cap = pos;
>>> -
>>>          /*
>>>           * Indicates that this function complies with revision 1.2 of the
>>>           * PCI Power Management Interface Specification.
>>> @@ -2309,11 +2307,11 @@ static bool virtio_pci_no_soft_reset(PCIDevice
>*dev)
>>> {
>>>      uint16_t pmcsr;
>>>
>>> -    if (!pci_is_express(dev) || !dev->exp.pm_cap) {
>>> +    if (!pci_is_express(dev) || !(dev->cap_present & QEMU_PCI_CAP_PM)) {
>>
>> Maybe a bit more optimized by checking dev->pm_cap,
>> but it's also ok checking present bit. For the whole series,
>>
>> Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>
>Could you please reply to the cover letter instead ? Tools like b4
>will apply the provided trailers to the whole series and not this
>patch only.

Got it, just done.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/5] PCI: Implement basic PCI PM capability backing
  2025-02-20 22:48 [PATCH 0/5] PCI: Implement basic PCI PM capability backing Alex Williamson
                   ` (6 preceding siblings ...)
  2025-02-24  1:43 ` Duan, Zhenzhong
@ 2025-02-24  8:14 ` Cédric Le Goater
  2025-02-24 15:09   ` Alex Williamson
  7 siblings, 1 reply; 22+ messages in thread
From: Cédric Le Goater @ 2025-02-24  8:14 UTC (permalink / raw)
  To: Alex Williamson, qemu-devel
  Cc: eric.auger.pro, eric.auger, zhenzhong.duan, mst, marcel.apfelbaum

> An aspect that needs attention here is whether this change in the
> wmask and PMCSR bits becomes a problem for migration, and how we
> might solve it.  For a guest migrating old->new, the device would
> always be in the D0 power state, but the register becomes writable.
> In the opposite direction, is it possible that a device could
> migrate in a low power state and be stuck there since the bits are
> read-only in old QEMU?  Do we need an option for this behavior and a
> machine state bump, or are there alternatives?

Should we introduce a migration blocker when a PCI device is in low
power state  ?


Thanks,

C.




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/5] PCI: Implement basic PCI PM capability backing
  2025-02-20 22:54 ` [PATCH 0/5] PCI: Implement basic PCI PM capability backing Michael S. Tsirkin
@ 2025-02-24  8:21   ` Cédric Le Goater
  0 siblings, 0 replies; 22+ messages in thread
From: Cédric Le Goater @ 2025-02-24  8:21 UTC (permalink / raw)
  To: Michael S. Tsirkin, Alex Williamson
  Cc: qemu-devel, eric.auger.pro, eric.auger, zhenzhong.duan,
	marcel.apfelbaum

On 2/20/25 23:54, Michael S. Tsirkin wrote:
> On Thu, Feb 20, 2025 at 03:48:53PM -0700, Alex Williamson wrote:
>> Eric recently identified an issue[1] where during graceful shutdown
>> of a VM in a vIOMMU configuration, the guest driver places the device
>> into the D3 power state, the vIOMMU is then disabled, triggering an
>> AddressSpace update.  The device BARs are still mapped into the AS,
>> but the vfio host driver refuses to DMA map the MMIO space due to the
>> device power state.
>>
>> The proposed solution in [1] was to skip mappings based on the
>> device power state.  Here we take a different approach.  The PCI spec
>> defines that devices in D1/2/3 power state should respond only to
>> configuration and message requests and all other requests should be
>> handled as an Unsupported Request.  In other words, the memory and
>> IO BARs are not accessible except when the device is in the D0 power
>> state.
>>
>> To emulate this behavior, we can factor the device power state into
>> the mapping state of the device BARs.  Therefore the BAR is marked
>> as unmapped if either the respective command register enable bit is
>> clear or the device is not in the D0 power state.
>>
>> In order to implement this, the PowerState field of the PMCSR
>> register becomes writable, which allows the device to appear in
>> lower power states.  This also therefore implements D3 support
>> (insofar as the BAR behavior) for all devices implementing the PM
>> capability.  The PCI spec requires D3 support.
>>
>> An aspect that needs attention here is whether this change in the
>> wmask and PMCSR bits becomes a problem for migration, and how we
>> might solve it.  For a guest migrating old->new, the device would
>> always be in the D0 power state, but the register becomes writable.
>> In the opposite direction, is it possible that a device could
>> migrate in a low power state and be stuck there since the bits are
>> read-only in old QEMU?  Do we need an option for this behavior and a
>> machine state bump, or are there alternatives?
>>
>> Thanks,
>> Alex
>>
>> [1]https://lore.kernel.org/all/20250219175941.135390-1-eric.auger@redhat.com/
> 
> 
> PCI bits:
> 
> Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
> 
> feel free to merge.

Applied to vfio-next.

Thanks,

C.





^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/5] PCI: Implement basic PCI PM capability backing
  2025-02-24  8:14 ` Cédric Le Goater
@ 2025-02-24 15:09   ` Alex Williamson
  0 siblings, 0 replies; 22+ messages in thread
From: Alex Williamson @ 2025-02-24 15:09 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: qemu-devel, eric.auger.pro, eric.auger, zhenzhong.duan, mst,
	marcel.apfelbaum

On Mon, 24 Feb 2025 09:14:19 +0100
Cédric Le Goater <clg@redhat.com> wrote:

> > An aspect that needs attention here is whether this change in the
> > wmask and PMCSR bits becomes a problem for migration, and how we
> > might solve it.  For a guest migrating old->new, the device would
> > always be in the D0 power state, but the register becomes writable.
> > In the opposite direction, is it possible that a device could
> > migrate in a low power state and be stuck there since the bits are
> > read-only in old QEMU?  Do we need an option for this behavior and a
> > machine state bump, or are there alternatives?  
> 
> Should we introduce a migration blocker when a PCI device is in low
> power state  ?

Blocking relative to the power state of a device seems relatively
non-intuitive for a user to debug.  Logically there's also an
opportunity that any device could support migration while in D3 if it
indicates a soft reset is performed on D3->D0 transition, regardless of
underlying VMM support for the device to migrate.  So that doesn't
really feel like the right approach to me.

FWIW, the emulated igb device will enter D3 when idle and bound to
vfio-pci in the guest, so we should be able to test migration in
various states with purely emulated devices.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/5] pci: Use PCI PM capability initializer
  2025-02-20 22:48 ` [PATCH 2/5] pci: Use PCI PM capability initializer Alex Williamson
@ 2025-02-24 18:37   ` Eric Auger
  2025-02-24 19:03     ` Alex Williamson
  0 siblings, 1 reply; 22+ messages in thread
From: Eric Auger @ 2025-02-24 18:37 UTC (permalink / raw)
  To: Alex Williamson, qemu-devel
  Cc: eric.auger.pro, clg, zhenzhong.duan, mst, marcel.apfelbaum,
	Dmitry Fleytman, Akihiko Odaki, Jason Wang, Stefan Weil,
	Sriram Yagnaraman, Keith Busch, Klaus Jensen, Jesper Devantier


Hi Alex,

On 2/20/25 11:48 PM, Alex Williamson wrote:
> Switch callers directly initializing the PCI PM capability with
> pci_add_capability() to use pci_pm_init().
>
> Cc: Dmitry Fleytman <dmitry.fleytman@gmail.com>
> Cc: Akihiko Odaki <akihiko.odaki@daynix.com>
> Cc: Jason Wang <jasowang@redhat.com>
> Cc: Stefan Weil <sw@weilnetz.de>
> Cc: Sriram Yagnaraman <sriram.yagnaraman@ericsson.com>
> Cc: Keith Busch <kbusch@kernel.org>
> Cc: Klaus Jensen <its@irrelevant.dk>
> Cc: Jesper Devantier <foss@defmacro.it>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
> Cc: Cédric Le Goater <clg@redhat.com>
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  hw/net/e1000e.c                 | 3 +--
>  hw/net/eepro100.c               | 4 +---
>  hw/net/igb.c                    | 3 +--
>  hw/nvme/ctrl.c                  | 3 +--
>  hw/pci-bridge/pcie_pci_bridge.c | 2 +-
>  hw/vfio/pci.c                   | 2 +-
>  hw/virtio/virtio-pci.c          | 3 +--
>  7 files changed, 7 insertions(+), 13 deletions(-)
>
> diff --git a/hw/net/e1000e.c b/hw/net/e1000e.c
> index f637853073e2..b72cbab7e889 100644
> --- a/hw/net/e1000e.c
> +++ b/hw/net/e1000e.c
> @@ -372,8 +372,7 @@ static int
>  e1000e_add_pm_capability(PCIDevice *pdev, uint8_t offset, uint16_t pmc)
>  {
>      Error *local_err = NULL;
> -    int ret = pci_add_capability(pdev, PCI_CAP_ID_PM, offset,
> -                                 PCI_PM_SIZEOF, &local_err);
> +    int ret = pci_pm_init(pdev, offset, &local_err);
>  
>      if (local_err) {
>          error_report_err(local_err);
> diff --git a/hw/net/eepro100.c b/hw/net/eepro100.c
> index 6d853229aec2..29a39865a608 100644
> --- a/hw/net/eepro100.c
> +++ b/hw/net/eepro100.c
> @@ -551,9 +551,7 @@ static void e100_pci_reset(EEPRO100State *s, Error **errp)
>      if (info->power_management) {
>          /* Power Management Capabilities */
>          int cfg_offset = 0xdc;
> -        int r = pci_add_capability(&s->dev, PCI_CAP_ID_PM,
> -                                   cfg_offset, PCI_PM_SIZEOF,
> -                                   errp);
> +        int r = pci_pm_init(&s->dev, cfg_offset, errp);
>          if (r < 0) {
>              return;
>          }
> diff --git a/hw/net/igb.c b/hw/net/igb.c
> index 4d93ce629f95..700dbc746d3d 100644
> --- a/hw/net/igb.c
> +++ b/hw/net/igb.c
> @@ -356,8 +356,7 @@ static int
>  igb_add_pm_capability(PCIDevice *pdev, uint8_t offset, uint16_t pmc)
>  {
>      Error *local_err = NULL;
> -    int ret = pci_add_capability(pdev, PCI_CAP_ID_PM, offset,
> -                                 PCI_PM_SIZEOF, &local_err);
> +    int ret = pci_pm_init(pdev, offset, &local_err);
>  
>      if (local_err) {
>          error_report_err(local_err);
> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index 68903d1d7067..1faea3d2b85b 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c
> @@ -8503,8 +8503,7 @@ static int nvme_add_pm_capability(PCIDevice *pci_dev, uint8_t offset)
>      Error *err = NULL;
>      int ret;
>  
> -    ret = pci_add_capability(pci_dev, PCI_CAP_ID_PM, offset,
> -                             PCI_PM_SIZEOF, &err);
> +    ret = pci_pm_init(pci_dev, offset, &err);
>      if (err) {
>          error_report_err(err);
>          return ret;
nit: below there is a redundant
    pci_set_word(pci_dev->wmask + offset + PCI_PM_CTRL,
                 PCI_PM_CTRL_STATE_MASK);

Thanks

Eric
> diff --git a/hw/pci-bridge/pcie_pci_bridge.c b/hw/pci-bridge/pcie_pci_bridge.c
> index fd4514a595ce..9fa656b43b42 100644
> --- a/hw/pci-bridge/pcie_pci_bridge.c
> +++ b/hw/pci-bridge/pcie_pci_bridge.c
> @@ -52,7 +52,7 @@ static void pcie_pci_bridge_realize(PCIDevice *d, Error **errp)
>          goto cap_error;
>      }
>  
> -    pos = pci_add_capability(d, PCI_CAP_ID_PM, 0, PCI_PM_SIZEOF, errp);
> +    pos = pci_pm_init(d, 0, errp);
>      if (pos < 0) {
>          goto pm_error;
>      }
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 89d900e9cf0c..6903f831e45f 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -2216,7 +2216,7 @@ static bool vfio_add_std_cap(VFIOPCIDevice *vdev, uint8_t pos, Error **errp)
>      case PCI_CAP_ID_PM:
>          vfio_check_pm_reset(vdev, pos);
>          vdev->pm_cap = pos;
> -        ret = pci_add_capability(pdev, cap_id, pos, size, errp) >= 0;
> +        ret = pci_pm_init(pdev, pos, errp) >= 0;
>          break;
>      case PCI_CAP_ID_AF:
>          vfio_check_af_flr(vdev, pos);
> diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
> index c773a9130c7e..afe8b5551c5c 100644
> --- a/hw/virtio/virtio-pci.c
> +++ b/hw/virtio/virtio-pci.c
> @@ -2204,8 +2204,7 @@ static void virtio_pci_realize(PCIDevice *pci_dev, Error **errp)
>          pos = pcie_endpoint_cap_init(pci_dev, 0);
>          assert(pos > 0);
>  
> -        pos = pci_add_capability(pci_dev, PCI_CAP_ID_PM, 0,
> -                                 PCI_PM_SIZEOF, errp);
> +        pos = pci_pm_init(pci_dev, 0, errp);
>          if (pos < 0) {
>              return;
>          }



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/5] vfio/pci: Delete local pm_cap
  2025-02-20 22:48 ` [PATCH 3/5] vfio/pci: Delete local pm_cap Alex Williamson
@ 2025-02-24 18:38   ` Eric Auger
  0 siblings, 0 replies; 22+ messages in thread
From: Eric Auger @ 2025-02-24 18:38 UTC (permalink / raw)
  To: Alex Williamson, qemu-devel
  Cc: eric.auger.pro, clg, zhenzhong.duan, mst, marcel.apfelbaum




On 2/20/25 11:48 PM, Alex Williamson wrote:
> This is now redundant to PCIDevice.pm_cap.
>
> Cc: Cédric Le Goater <clg@redhat.com>
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Eric
> ---
>  hw/vfio/pci.c | 9 ++++-----
>  hw/vfio/pci.h | 1 -
>  2 files changed, 4 insertions(+), 6 deletions(-)
>
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 6903f831e45f..ba4ef65b16fa 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -2215,7 +2215,6 @@ static bool vfio_add_std_cap(VFIOPCIDevice *vdev, uint8_t pos, Error **errp)
>          break;
>      case PCI_CAP_ID_PM:
>          vfio_check_pm_reset(vdev, pos);
> -        vdev->pm_cap = pos;
>          ret = pci_pm_init(pdev, pos, errp) >= 0;
>          break;
>      case PCI_CAP_ID_AF:
> @@ -2407,17 +2406,17 @@ void vfio_pci_pre_reset(VFIOPCIDevice *vdev)
>      vfio_disable_interrupts(vdev);
>  
>      /* Make sure the device is in D0 */
> -    if (vdev->pm_cap) {
> +    if (pdev->pm_cap) {
>          uint16_t pmcsr;
>          uint8_t state;
>  
> -        pmcsr = vfio_pci_read_config(pdev, vdev->pm_cap + PCI_PM_CTRL, 2);
> +        pmcsr = vfio_pci_read_config(pdev, pdev->pm_cap + PCI_PM_CTRL, 2);
>          state = pmcsr & PCI_PM_CTRL_STATE_MASK;
>          if (state) {
>              pmcsr &= ~PCI_PM_CTRL_STATE_MASK;
> -            vfio_pci_write_config(pdev, vdev->pm_cap + PCI_PM_CTRL, pmcsr, 2);
> +            vfio_pci_write_config(pdev, pdev->pm_cap + PCI_PM_CTRL, pmcsr, 2);
>              /* vfio handles the necessary delay here */
> -            pmcsr = vfio_pci_read_config(pdev, vdev->pm_cap + PCI_PM_CTRL, 2);
> +            pmcsr = vfio_pci_read_config(pdev, pdev->pm_cap + PCI_PM_CTRL, 2);
>              state = pmcsr & PCI_PM_CTRL_STATE_MASK;
>              if (state) {
>                  error_report("vfio: Unable to power on device, stuck in D%d",
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index 43c166680abb..d638c781f6f1 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -160,7 +160,6 @@ struct VFIOPCIDevice {
>      int32_t bootindex;
>      uint32_t igd_gms;
>      OffAutoPCIBAR msix_relo;
> -    uint8_t pm_cap;
>      uint8_t nv_gpudirect_clique;
>      bool pci_aer;
>      bool req_enabled;



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/5] pcie, virtio: Remove redundant pm_cap
  2025-02-20 22:48 ` [PATCH 4/5] pcie, virtio: Remove redundant pm_cap Alex Williamson
  2025-02-21  6:12   ` Duan, Zhenzhong
@ 2025-02-24 18:40   ` Eric Auger
  1 sibling, 0 replies; 22+ messages in thread
From: Eric Auger @ 2025-02-24 18:40 UTC (permalink / raw)
  To: Alex Williamson, qemu-devel
  Cc: eric.auger.pro, clg, zhenzhong.duan, mst, marcel.apfelbaum




On 2/20/25 11:48 PM, Alex Williamson wrote:
> The pm_cap on the PCIExpressDevice object can be distilled down
> to the new instance on the PCIDevice object.
>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Eric
> ---
>  hw/pci-bridge/pcie_pci_bridge.c | 1 -
>  hw/virtio/virtio-pci.c          | 8 +++-----
>  include/hw/pci/pcie.h           | 2 --
>  3 files changed, 3 insertions(+), 8 deletions(-)
>
> diff --git a/hw/pci-bridge/pcie_pci_bridge.c b/hw/pci-bridge/pcie_pci_bridge.c
> index 9fa656b43b42..2429503cfbbf 100644
> --- a/hw/pci-bridge/pcie_pci_bridge.c
> +++ b/hw/pci-bridge/pcie_pci_bridge.c
> @@ -56,7 +56,6 @@ static void pcie_pci_bridge_realize(PCIDevice *d, Error **errp)
>      if (pos < 0) {
>          goto pm_error;
>      }
> -    d->exp.pm_cap = pos;
>      pci_set_word(d->config + pos + PCI_PM_PMC, 0x3);
>  
>      pcie_cap_arifwd_init(d);
> diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
> index afe8b5551c5c..3ca3f849d391 100644
> --- a/hw/virtio/virtio-pci.c
> +++ b/hw/virtio/virtio-pci.c
> @@ -2209,8 +2209,6 @@ static void virtio_pci_realize(PCIDevice *pci_dev, Error **errp)
>              return;
>          }
>  
> -        pci_dev->exp.pm_cap = pos;
> -
>          /*
>           * Indicates that this function complies with revision 1.2 of the
>           * PCI Power Management Interface Specification.
> @@ -2309,11 +2307,11 @@ static bool virtio_pci_no_soft_reset(PCIDevice *dev)
>  {
>      uint16_t pmcsr;
>  
> -    if (!pci_is_express(dev) || !dev->exp.pm_cap) {
> +    if (!pci_is_express(dev) || !(dev->cap_present & QEMU_PCI_CAP_PM)) {
>          return false;
>      }
>  
> -    pmcsr = pci_get_word(dev->config + dev->exp.pm_cap + PCI_PM_CTRL);
> +    pmcsr = pci_get_word(dev->config + dev->pm_cap + PCI_PM_CTRL);
>  
>      /*
>       * When No_Soft_Reset bit is set and the device
> @@ -2342,7 +2340,7 @@ static void virtio_pci_bus_reset_hold(Object *obj, ResetType type)
>  
>          if (proxy->flags & VIRTIO_PCI_FLAG_INIT_PM) {
>              pci_word_test_and_clear_mask(
> -                dev->config + dev->exp.pm_cap + PCI_PM_CTRL,
> +                dev->config + dev->pm_cap + PCI_PM_CTRL,
>                  PCI_PM_CTRL_STATE_MASK);
>          }
>      }
> diff --git a/include/hw/pci/pcie.h b/include/hw/pci/pcie.h
> index b8d59732bc63..70a5de09de39 100644
> --- a/include/hw/pci/pcie.h
> +++ b/include/hw/pci/pcie.h
> @@ -58,8 +58,6 @@ typedef enum {
>  struct PCIExpressDevice {
>      /* Offset of express capability in config space */
>      uint8_t exp_cap;
> -    /* Offset of Power Management capability in config space */
> -    uint8_t pm_cap;
>  
>      /* SLOT */
>      bool hpev_notified; /* Logical AND of conditions for hot plug event.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/5] pci: Use PCI PM capability initializer
  2025-02-24 18:37   ` Eric Auger
@ 2025-02-24 19:03     ` Alex Williamson
  0 siblings, 0 replies; 22+ messages in thread
From: Alex Williamson @ 2025-02-24 19:03 UTC (permalink / raw)
  To: Eric Auger
  Cc: qemu-devel, eric.auger.pro, clg, zhenzhong.duan, mst,
	marcel.apfelbaum, Dmitry Fleytman, Akihiko Odaki, Jason Wang,
	Stefan Weil, Sriram Yagnaraman, Keith Busch, Klaus Jensen,
	Jesper Devantier

On Mon, 24 Feb 2025 19:37:03 +0100
Eric Auger <eric.auger@redhat.com> wrote:

> Hi Alex,
> 
> On 2/20/25 11:48 PM, Alex Williamson wrote:
> > Switch callers directly initializing the PCI PM capability with
> > pci_add_capability() to use pci_pm_init().
> >
> > Cc: Dmitry Fleytman <dmitry.fleytman@gmail.com>
> > Cc: Akihiko Odaki <akihiko.odaki@daynix.com>
> > Cc: Jason Wang <jasowang@redhat.com>
> > Cc: Stefan Weil <sw@weilnetz.de>
> > Cc: Sriram Yagnaraman <sriram.yagnaraman@ericsson.com>
> > Cc: Keith Busch <kbusch@kernel.org>
> > Cc: Klaus Jensen <its@irrelevant.dk>
> > Cc: Jesper Devantier <foss@defmacro.it>
> > Cc: Michael S. Tsirkin <mst@redhat.com>
> > Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
> > Cc: Cédric Le Goater <clg@redhat.com>
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> >  hw/net/e1000e.c                 | 3 +--
> >  hw/net/eepro100.c               | 4 +---
> >  hw/net/igb.c                    | 3 +--
> >  hw/nvme/ctrl.c                  | 3 +--
> >  hw/pci-bridge/pcie_pci_bridge.c | 2 +-
> >  hw/vfio/pci.c                   | 2 +-
> >  hw/virtio/virtio-pci.c          | 3 +--
> >  7 files changed, 7 insertions(+), 13 deletions(-)
> >
> > diff --git a/hw/net/e1000e.c b/hw/net/e1000e.c
> > index f637853073e2..b72cbab7e889 100644
> > --- a/hw/net/e1000e.c
> > +++ b/hw/net/e1000e.c
> > @@ -372,8 +372,7 @@ static int
> >  e1000e_add_pm_capability(PCIDevice *pdev, uint8_t offset, uint16_t pmc)
> >  {
> >      Error *local_err = NULL;
> > -    int ret = pci_add_capability(pdev, PCI_CAP_ID_PM, offset,
> > -                                 PCI_PM_SIZEOF, &local_err);
> > +    int ret = pci_pm_init(pdev, offset, &local_err);
> >  
> >      if (local_err) {
> >          error_report_err(local_err);
> > diff --git a/hw/net/eepro100.c b/hw/net/eepro100.c
> > index 6d853229aec2..29a39865a608 100644
> > --- a/hw/net/eepro100.c
> > +++ b/hw/net/eepro100.c
> > @@ -551,9 +551,7 @@ static void e100_pci_reset(EEPRO100State *s, Error **errp)
> >      if (info->power_management) {
> >          /* Power Management Capabilities */
> >          int cfg_offset = 0xdc;
> > -        int r = pci_add_capability(&s->dev, PCI_CAP_ID_PM,
> > -                                   cfg_offset, PCI_PM_SIZEOF,
> > -                                   errp);
> > +        int r = pci_pm_init(&s->dev, cfg_offset, errp);
> >          if (r < 0) {
> >              return;
> >          }
> > diff --git a/hw/net/igb.c b/hw/net/igb.c
> > index 4d93ce629f95..700dbc746d3d 100644
> > --- a/hw/net/igb.c
> > +++ b/hw/net/igb.c
> > @@ -356,8 +356,7 @@ static int
> >  igb_add_pm_capability(PCIDevice *pdev, uint8_t offset, uint16_t pmc)
> >  {
> >      Error *local_err = NULL;
> > -    int ret = pci_add_capability(pdev, PCI_CAP_ID_PM, offset,
> > -                                 PCI_PM_SIZEOF, &local_err);
> > +    int ret = pci_pm_init(pdev, offset, &local_err);
> >  
> >      if (local_err) {
> >          error_report_err(local_err);
> > diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> > index 68903d1d7067..1faea3d2b85b 100644
> > --- a/hw/nvme/ctrl.c
> > +++ b/hw/nvme/ctrl.c
> > @@ -8503,8 +8503,7 @@ static int nvme_add_pm_capability(PCIDevice *pci_dev, uint8_t offset)
> >      Error *err = NULL;
> >      int ret;
> >  
> > -    ret = pci_add_capability(pci_dev, PCI_CAP_ID_PM, offset,
> > -                             PCI_PM_SIZEOF, &err);
> > +    ret = pci_pm_init(pci_dev, offset, &err);
> >      if (err) {
> >          error_report_err(err);
> >          return ret;  
> nit: below there is a redundant
>     pci_set_word(pci_dev->wmask + offset + PCI_PM_CTRL,
>                  PCI_PM_CTRL_STATE_MASK);

Indeed there is, thanks for spotting that!  I'll fix it in the next
spin.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/5] hw/pci: Basic support for PCI power management
  2025-02-20 22:48 ` [PATCH 1/5] hw/pci: Basic support for PCI power management Alex Williamson
@ 2025-02-24 19:03   ` Eric Auger
  2025-02-25  5:24     ` Alex Williamson
  0 siblings, 1 reply; 22+ messages in thread
From: Eric Auger @ 2025-02-24 19:03 UTC (permalink / raw)
  To: Alex Williamson, qemu-devel
  Cc: eric.auger.pro, clg, zhenzhong.duan, mst, marcel.apfelbaum


Hi Alex,

On 2/20/25 11:48 PM, Alex Williamson wrote:
> The memory and IO BARs for devices are only accessible in the D0
> power state.  In other power states the PCI spec defines that the
> device should respond to TLPs and messages with an Unsupported
> Request response.  The closest we can come to emulating this
> behavior is to consider the BARs as unmapped, removing them from
> the address space.
>
> Introduce an interface to register the PM capability such that
> the device power state is considered relative to the BAR mapping
> state.

"the device power state is considered relative to the BAR mapping
state."
I rather understood that you want the BAR mapping state to depend on the power state but maybe I misunderstand the langage here.


>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  hw/pci/pci.c                | 83 ++++++++++++++++++++++++++++++++++++-
>  hw/pci/trace-events         |  2 +
>  include/hw/pci/pci.h        |  3 ++
>  include/hw/pci/pci_device.h |  3 ++
>  4 files changed, 89 insertions(+), 2 deletions(-)
>
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index 2afa423925c5..4f2554dd63ac 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -435,6 +435,74 @@ static void pci_msi_trigger(PCIDevice *dev, MSIMessage msg)
>                           attrs, NULL);
>  }
>  
> +int pci_pm_init(PCIDevice *d, uint8_t offset, Error **errp)
> +{
> +    int cap = pci_add_capability(d, PCI_CAP_ID_PM, offset, PCI_PM_SIZEOF, errp);
> +
> +    if (cap < 0) {
> +        return cap;
> +    }
> +
> +    d->pm_cap = cap;
> +    d->cap_present |= QEMU_PCI_CAP_PM;
> +    pci_set_word(d->wmask + cap + PCI_PM_CTRL, PCI_PM_CTRL_STATE_MASK);
> +
> +    return cap;
> +}
> +
> +static uint8_t pci_pm_state(PCIDevice *d)
> +{
> +    uint16_t pmcsr;
> +
> +    if (!(d->cap_present & QEMU_PCI_CAP_PM)) {
> +        return 0;
> +    }
> +
> +    pmcsr = pci_get_word(d->config + d->pm_cap + PCI_PM_CTRL);
> +
> +    return pmcsr & PCI_PM_CTRL_STATE_MASK;
> +}
> +
> +static uint8_t pci_pm_update(PCIDevice *d, uint32_t addr, int l, uint8_t old)
the old/new terminology sounds weird to me. generally when one updates
we pass the new value.
> +{
> +    uint16_t pmc;
> +    uint8_t new;
> +
> +    if (!(d->cap_present & QEMU_PCI_CAP_PM) ||
> +        !range_covers_byte(addr, l, d->pm_cap + PCI_PM_CTRL)) {
> +        return old;
> +    }
> +
> +    new = pci_pm_state(d);
and new sounds to be the current state.

pci_pm_update() does a kind of validation of the current value? 

> +    if (new == old) {
> +        return old;
> +    }
> +
> +    pmc = pci_get_word(d->config + d->pm_cap + PCI_PM_PMC);
> +
> +    /*
> +     * Transitions to D1 & D2 are only allowed if supported.  Devices may
> +     * only transition to higher D-states or to D0.  The write is dropped
> +     * for rejected transitions by restoring the old state.
> +     */
> +    if ((!(pmc & PCI_PM_CAP_D1) && new == 1) ||
> +        (!(pmc & PCI_PM_CAP_D2) && new == 2) ||
> +        (old && new && new < old)) {
> +        pci_word_test_and_clear_mask(d->config + d->pm_cap + PCI_PM_CTRL,
> +                                     PCI_PM_CTRL_STATE_MASK);
> +        pci_word_test_and_set_mask(d->config + d->pm_cap + PCI_PM_CTRL,
> +                                   old);
> +        trace_pci_pm_bad_transition(d->name, pci_dev_bus_num(d),
> +                                    PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
> +                                    old, new);
the trace does not output that the old state is kept. This may be helpful.
> +        return old;
> +    }
> +
> +    trace_pci_pm_transition(d->name, pci_dev_bus_num(d), PCI_SLOT(d->devfn),
> +                            PCI_FUNC(d->devfn), old, new);
> +    return new;
> +}
> +
>  static void pci_reset_regions(PCIDevice *dev)
>  {
>      int r;
> @@ -474,6 +542,11 @@ static void pci_do_device_reset(PCIDevice *dev)
>                                pci_get_word(dev->wmask + PCI_INTERRUPT_LINE) |
>                                pci_get_word(dev->w1cmask + PCI_INTERRUPT_LINE));
>      dev->config[PCI_CACHE_LINE_SIZE] = 0x0;
> +    /* Default PM state is D0 */
> +    if (dev->cap_present & QEMU_PCI_CAP_PM) {
> +        pci_word_test_and_clear_mask(dev->config + dev->pm_cap + PCI_PM_CTRL,
> +                                     PCI_PM_CTRL_STATE_MASK);
> +    }
>      pci_reset_regions(dev);
>      pci_update_mappings(dev);
>  
> @@ -1598,7 +1671,7 @@ static void pci_update_mappings(PCIDevice *d)
>              continue;
>  
>          new_addr = pci_bar_address(d, i, r->type, r->size);
> -        if (!d->enabled) {
> +        if (!d->enabled || pci_pm_state(d)) {
>              new_addr = PCI_BAR_UNMAPPED;
>          }
>  
> @@ -1664,6 +1737,7 @@ uint32_t pci_default_read_config(PCIDevice *d,
>  
>  void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val_in, int l)
>  {
> +    uint8_t new_pm_state, old_pm_state = pci_pm_state(d);
>      int i, was_irq_disabled = pci_irq_disabled(d);
>      uint32_t val = val_in;
>  
> @@ -1676,11 +1750,16 @@ void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val_in, int
>          d->config[addr + i] = (d->config[addr + i] & ~wmask) | (val & wmask);
>          d->config[addr + i] &= ~(val & w1cmask); /* W1C: Write 1 to Clear */
>      }
> +
IIUC the config is first updated and then we check the validity and
potentially restore the older value.
Couldn't we check the validity before updating d->config?
> +    new_pm_state = pci_pm_update(d, addr, l, old_pm_state);
> +
>      if (ranges_overlap(addr, l, PCI_BASE_ADDRESS_0, 24) ||
>          ranges_overlap(addr, l, PCI_ROM_ADDRESS, 4) ||
>          ranges_overlap(addr, l, PCI_ROM_ADDRESS1, 4) ||
> -        range_covers_byte(addr, l, PCI_COMMAND))
> +        range_covers_byte(addr, l, PCI_COMMAND) ||
> +        !!new_pm_state != !!old_pm_state) {
>          pci_update_mappings(d);
Eric
> +    }
>  
>      if (ranges_overlap(addr, l, PCI_COMMAND, 2)) {
>          pci_update_irq_disabled(d, was_irq_disabled);
> diff --git a/hw/pci/trace-events b/hw/pci/trace-events
> index 19643aa8c6b0..811354f29031 100644
> --- a/hw/pci/trace-events
> +++ b/hw/pci/trace-events
> @@ -1,6 +1,8 @@
>  # See docs/devel/tracing.rst for syntax documentation.
>  
>  # pci.c
> +pci_pm_bad_transition(const char *dev, uint32_t bus, uint32_t slot, uint32_t func, uint8_t old, uint8_t new) "%s %02x:%02x.%x bad PM transition D%d->D%d"
> +pci_pm_transition(const char *dev, uint32_t bus, uint32_t slot, uint32_t func, uint8_t old, uint8_t new) "%s %02x:%02x.%x PM transition D%d->D%d"
>  pci_update_mappings_del(const char *dev, uint32_t bus, uint32_t slot, uint32_t func, int bar, uint64_t addr, uint64_t size) "%s %02x:%02x.%x %d,0x%"PRIx64"+0x%"PRIx64
>  pci_update_mappings_add(const char *dev, uint32_t bus, uint32_t slot, uint32_t func, int bar, uint64_t addr, uint64_t size) "%s %02x:%02x.%x %d,0x%"PRIx64"+0x%"PRIx64
>  pci_route_irq(int dev_irq, const char *dev_path, int parent_irq, const char *parent_path) "IRQ %d @%s -> IRQ %d @%s"
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index 4002bbeebde0..c220cc844962 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -216,6 +216,8 @@ enum {
>      QEMU_PCIE_ARI_NEXTFN_1 = (1 << QEMU_PCIE_ARI_NEXTFN_1_BITNR),
>  #define QEMU_PCIE_EXT_TAG_BITNR 13
>      QEMU_PCIE_EXT_TAG = (1 << QEMU_PCIE_EXT_TAG_BITNR),
> +#define QEMU_PCI_CAP_PM_BITNR 14
> +    QEMU_PCI_CAP_PM = (1 << QEMU_PCI_CAP_PM_BITNR),
>  };
>  
>  typedef struct PCIINTxRoute {
> @@ -676,5 +678,6 @@ static inline void pci_irq_deassert(PCIDevice *pci_dev)
>  MSIMessage pci_get_msi_message(PCIDevice *dev, int vector);
>  void pci_set_enabled(PCIDevice *pci_dev, bool state);
>  void pci_set_power(PCIDevice *pci_dev, bool state);
> +int pci_pm_init(PCIDevice *pci_dev, uint8_t offset, Error **errp);
>  
>  #endif
> diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
> index add208edfabd..345b12eaac1a 100644
> --- a/include/hw/pci/pci_device.h
> +++ b/include/hw/pci/pci_device.h
> @@ -105,6 +105,9 @@ struct PCIDevice {
>      /* Capability bits */
>      uint32_t cap_present;
>  
> +    /* Offset of PM capability in config space */
> +    uint8_t pm_cap;
> +
>      /* Offset of MSI-X capability in config space */
>      uint8_t msix_cap;
>  



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/5] hw/vfio/pci: Re-order pre-reset
  2025-02-20 22:48 ` [PATCH 5/5] hw/vfio/pci: Re-order pre-reset Alex Williamson
@ 2025-02-24 20:16   ` Eric Auger
  0 siblings, 0 replies; 22+ messages in thread
From: Eric Auger @ 2025-02-24 20:16 UTC (permalink / raw)
  To: Alex Williamson, qemu-devel
  Cc: eric.auger.pro, clg, zhenzhong.duan, mst, marcel.apfelbaum




On 2/20/25 11:48 PM, Alex Williamson wrote:
> We want the device in the D0 power state going into reset, but the
> config write can enable the BARs in the address space, which are
> then removed from the address space once we clear the memory enable
> bit in the command register.  Re-order to clear the command bit
> first, so the power state change doesn't enable the BARs.
>
> Cc: Cédric Le Goater <clg@redhat.com>
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Eric
> ---
>  hw/vfio/pci.c | 18 +++++++++---------
>  1 file changed, 9 insertions(+), 9 deletions(-)
>
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index ba4ef65b16fa..fcc5f118bf90 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -2405,6 +2405,15 @@ void vfio_pci_pre_reset(VFIOPCIDevice *vdev)
>  
>      vfio_disable_interrupts(vdev);
>  
> +    /*
> +     * Stop any ongoing DMA by disconnecting I/O, MMIO, and bus master.
> +     * Also put INTx Disable in known state.
> +     */
> +    cmd = vfio_pci_read_config(pdev, PCI_COMMAND, 2);
> +    cmd &= ~(PCI_COMMAND_IO | PCI_COMMAND_MEMORY | PCI_COMMAND_MASTER |
> +             PCI_COMMAND_INTX_DISABLE);
> +    vfio_pci_write_config(pdev, PCI_COMMAND, cmd, 2);
> +
>      /* Make sure the device is in D0 */
>      if (pdev->pm_cap) {
>          uint16_t pmcsr;
> @@ -2424,15 +2433,6 @@ void vfio_pci_pre_reset(VFIOPCIDevice *vdev)
>              }
>          }
>      }
> -
> -    /*
> -     * Stop any ongoing DMA by disconnecting I/O, MMIO, and bus master.
> -     * Also put INTx Disable in known state.
> -     */
> -    cmd = vfio_pci_read_config(pdev, PCI_COMMAND, 2);
> -    cmd &= ~(PCI_COMMAND_IO | PCI_COMMAND_MEMORY | PCI_COMMAND_MASTER |
> -             PCI_COMMAND_INTX_DISABLE);
> -    vfio_pci_write_config(pdev, PCI_COMMAND, cmd, 2);
>  }
>  
>  void vfio_pci_post_reset(VFIOPCIDevice *vdev)



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/5] hw/pci: Basic support for PCI power management
  2025-02-24 19:03   ` Eric Auger
@ 2025-02-25  5:24     ` Alex Williamson
  2025-02-25  9:45       ` Eric Auger
  0 siblings, 1 reply; 22+ messages in thread
From: Alex Williamson @ 2025-02-25  5:24 UTC (permalink / raw)
  To: Eric Auger
  Cc: qemu-devel, eric.auger.pro, clg, zhenzhong.duan, mst,
	marcel.apfelbaum

On Mon, 24 Feb 2025 20:03:56 +0100
Eric Auger <eric.auger@redhat.com> wrote:

> Hi Alex,
> 
> On 2/20/25 11:48 PM, Alex Williamson wrote:
> > The memory and IO BARs for devices are only accessible in the D0
> > power state.  In other power states the PCI spec defines that the
> > device should respond to TLPs and messages with an Unsupported
> > Request response.  The closest we can come to emulating this
> > behavior is to consider the BARs as unmapped, removing them from
> > the address space.
> >
> > Introduce an interface to register the PM capability such that
> > the device power state is considered relative to the BAR mapping
> > state.  
> 
> "the device power state is considered relative to the BAR mapping
> state."
> I rather understood that you want the BAR mapping state to depend on the power state but maybe I misunderstand the langage here.

Yes, for the BAR to be mapped, both the memory enable state of the
command register and the device power state are relevant, they are both
considered.

> > Cc: Michael S. Tsirkin <mst@redhat.com>
> > Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> >  hw/pci/pci.c                | 83 ++++++++++++++++++++++++++++++++++++-
> >  hw/pci/trace-events         |  2 +
> >  include/hw/pci/pci.h        |  3 ++
> >  include/hw/pci/pci_device.h |  3 ++
> >  4 files changed, 89 insertions(+), 2 deletions(-)
> >
> > diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> > index 2afa423925c5..4f2554dd63ac 100644
> > --- a/hw/pci/pci.c
> > +++ b/hw/pci/pci.c
> > @@ -435,6 +435,74 @@ static void pci_msi_trigger(PCIDevice *dev, MSIMessage msg)
> >                           attrs, NULL);
> >  }
> >  
> > +int pci_pm_init(PCIDevice *d, uint8_t offset, Error **errp)
> > +{
> > +    int cap = pci_add_capability(d, PCI_CAP_ID_PM, offset, PCI_PM_SIZEOF, errp);
> > +
> > +    if (cap < 0) {
> > +        return cap;
> > +    }
> > +
> > +    d->pm_cap = cap;
> > +    d->cap_present |= QEMU_PCI_CAP_PM;
> > +    pci_set_word(d->wmask + cap + PCI_PM_CTRL, PCI_PM_CTRL_STATE_MASK);
> > +
> > +    return cap;
> > +}
> > +
> > +static uint8_t pci_pm_state(PCIDevice *d)
> > +{
> > +    uint16_t pmcsr;
> > +
> > +    if (!(d->cap_present & QEMU_PCI_CAP_PM)) {
> > +        return 0;
> > +    }
> > +
> > +    pmcsr = pci_get_word(d->config + d->pm_cap + PCI_PM_CTRL);
> > +
> > +    return pmcsr & PCI_PM_CTRL_STATE_MASK;
> > +}
> > +
> > +static uint8_t pci_pm_update(PCIDevice *d, uint32_t addr, int l, uint8_t old)  
> the old/new terminology sounds weird to me. generally when one updates
> we pass the new value.

The new state lives in config space, the old state is recorded before
config space is updated.  Otherwise we interfere with or duplicate the
update of config space, which seems error prone and unnecessarily
complicated.  pci_update_irq_disable() uses the same strategy.

> > +{
> > +    uint16_t pmc;
> > +    uint8_t new;
> > +
> > +    if (!(d->cap_present & QEMU_PCI_CAP_PM) ||
> > +        !range_covers_byte(addr, l, d->pm_cap + PCI_PM_CTRL)) {
> > +        return old;
> > +    }
> > +
> > +    new = pci_pm_state(d);  
> and new sounds to be the current state.

The state after the guest config write is first applied, yes.

> pci_pm_update() does a kind of validation of the current value? 

Yes.  The PCI spec indicates that invalid transitions are ignored by
the device.  That's currently all this does, but if a device wanted a
callback to effect some behavior at state change, this would be a
logical place to do that.

> > +    if (new == old) {
> > +        return old;
> > +    }
> > +
> > +    pmc = pci_get_word(d->config + d->pm_cap + PCI_PM_PMC);
> > +
> > +    /*
> > +     * Transitions to D1 & D2 are only allowed if supported.  Devices may
> > +     * only transition to higher D-states or to D0.  The write is dropped
> > +     * for rejected transitions by restoring the old state.
> > +     */
> > +    if ((!(pmc & PCI_PM_CAP_D1) && new == 1) ||
> > +        (!(pmc & PCI_PM_CAP_D2) && new == 2) ||
> > +        (old && new && new < old)) {
> > +        pci_word_test_and_clear_mask(d->config + d->pm_cap + PCI_PM_CTRL,
> > +                                     PCI_PM_CTRL_STATE_MASK);
> > +        pci_word_test_and_set_mask(d->config + d->pm_cap + PCI_PM_CTRL,
> > +                                   old);
> > +        trace_pci_pm_bad_transition(d->name, pci_dev_bus_num(d),
> > +                                    PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
> > +                                    old, new);  
> the trace does not output that the old state is kept. This may be helpful.

I was trying to keep the trace log brief.  The trace indicates it's a
"bad" transition versus the trace below indicates a successful
transition.  AIUI, most trace logs require some degree of familiarity
with the call path to fully comprehend and it seemed unnecessary to
print the old value twice to explicitly show it was restored.
Suggestions welcome.

> > +        return old;
> > +    }
> > +
> > +    trace_pci_pm_transition(d->name, pci_dev_bus_num(d), PCI_SLOT(d->devfn),
> > +                            PCI_FUNC(d->devfn), old, new);
> > +    return new;
> > +}
> > +
> >  static void pci_reset_regions(PCIDevice *dev)
> >  {
> >      int r;
> > @@ -474,6 +542,11 @@ static void pci_do_device_reset(PCIDevice *dev)
> >                                pci_get_word(dev->wmask + PCI_INTERRUPT_LINE) |
> >                                pci_get_word(dev->w1cmask + PCI_INTERRUPT_LINE));
> >      dev->config[PCI_CACHE_LINE_SIZE] = 0x0;
> > +    /* Default PM state is D0 */
> > +    if (dev->cap_present & QEMU_PCI_CAP_PM) {
> > +        pci_word_test_and_clear_mask(dev->config + dev->pm_cap + PCI_PM_CTRL,
> > +                                     PCI_PM_CTRL_STATE_MASK);
> > +    }
> >      pci_reset_regions(dev);
> >      pci_update_mappings(dev);
> >  
> > @@ -1598,7 +1671,7 @@ static void pci_update_mappings(PCIDevice *d)
> >              continue;
> >  
> >          new_addr = pci_bar_address(d, i, r->type, r->size);
> > -        if (!d->enabled) {
> > +        if (!d->enabled || pci_pm_state(d)) {
> >              new_addr = PCI_BAR_UNMAPPED;
> >          }
> >  
> > @@ -1664,6 +1737,7 @@ uint32_t pci_default_read_config(PCIDevice *d,
> >  
> >  void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val_in, int l)
> >  {
> > +    uint8_t new_pm_state, old_pm_state = pci_pm_state(d);
> >      int i, was_irq_disabled = pci_irq_disabled(d);
> >      uint32_t val = val_in;
> >  
> > @@ -1676,11 +1750,16 @@ void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val_in, int
> >          d->config[addr + i] = (d->config[addr + i] & ~wmask) | (val & wmask);
> >          d->config[addr + i] &= ~(val & w1cmask); /* W1C: Write 1 to Clear */
> >      }
> > +  
> IIUC the config is first updated and then we check the validity and
> potentially restore the older value.
> Couldn't we check the validity before updating d->config?

We could, but why?  It's easier to deal with the before and after
values rather than mangle the incoming write value, imo.  I also don't
think the intermediate value is visible to the guest anyway.  Fixing
the value "in flight" would need to take into account the wmask,
duplicating or preempting the code just above.  Thanks,

Alex

> > +    new_pm_state = pci_pm_update(d, addr, l, old_pm_state);
> > +
> >      if (ranges_overlap(addr, l, PCI_BASE_ADDRESS_0, 24) ||
> >          ranges_overlap(addr, l, PCI_ROM_ADDRESS, 4) ||
> >          ranges_overlap(addr, l, PCI_ROM_ADDRESS1, 4) ||
> > -        range_covers_byte(addr, l, PCI_COMMAND))
> > +        range_covers_byte(addr, l, PCI_COMMAND) ||
> > +        !!new_pm_state != !!old_pm_state) {
> >          pci_update_mappings(d);  
> Eric
> > +    }
> >  
> >      if (ranges_overlap(addr, l, PCI_COMMAND, 2)) {
> >          pci_update_irq_disabled(d, was_irq_disabled);
> > diff --git a/hw/pci/trace-events b/hw/pci/trace-events
> > index 19643aa8c6b0..811354f29031 100644
> > --- a/hw/pci/trace-events
> > +++ b/hw/pci/trace-events
> > @@ -1,6 +1,8 @@
> >  # See docs/devel/tracing.rst for syntax documentation.
> >  
> >  # pci.c
> > +pci_pm_bad_transition(const char *dev, uint32_t bus, uint32_t slot, uint32_t func, uint8_t old, uint8_t new) "%s %02x:%02x.%x bad PM transition D%d->D%d"
> > +pci_pm_transition(const char *dev, uint32_t bus, uint32_t slot, uint32_t func, uint8_t old, uint8_t new) "%s %02x:%02x.%x PM transition D%d->D%d"
> >  pci_update_mappings_del(const char *dev, uint32_t bus, uint32_t slot, uint32_t func, int bar, uint64_t addr, uint64_t size) "%s %02x:%02x.%x %d,0x%"PRIx64"+0x%"PRIx64
> >  pci_update_mappings_add(const char *dev, uint32_t bus, uint32_t slot, uint32_t func, int bar, uint64_t addr, uint64_t size) "%s %02x:%02x.%x %d,0x%"PRIx64"+0x%"PRIx64
> >  pci_route_irq(int dev_irq, const char *dev_path, int parent_irq, const char *parent_path) "IRQ %d @%s -> IRQ %d @%s"
> > diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> > index 4002bbeebde0..c220cc844962 100644
> > --- a/include/hw/pci/pci.h
> > +++ b/include/hw/pci/pci.h
> > @@ -216,6 +216,8 @@ enum {
> >      QEMU_PCIE_ARI_NEXTFN_1 = (1 << QEMU_PCIE_ARI_NEXTFN_1_BITNR),
> >  #define QEMU_PCIE_EXT_TAG_BITNR 13
> >      QEMU_PCIE_EXT_TAG = (1 << QEMU_PCIE_EXT_TAG_BITNR),
> > +#define QEMU_PCI_CAP_PM_BITNR 14
> > +    QEMU_PCI_CAP_PM = (1 << QEMU_PCI_CAP_PM_BITNR),
> >  };
> >  
> >  typedef struct PCIINTxRoute {
> > @@ -676,5 +678,6 @@ static inline void pci_irq_deassert(PCIDevice *pci_dev)
> >  MSIMessage pci_get_msi_message(PCIDevice *dev, int vector);
> >  void pci_set_enabled(PCIDevice *pci_dev, bool state);
> >  void pci_set_power(PCIDevice *pci_dev, bool state);
> > +int pci_pm_init(PCIDevice *pci_dev, uint8_t offset, Error **errp);
> >  
> >  #endif
> > diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
> > index add208edfabd..345b12eaac1a 100644
> > --- a/include/hw/pci/pci_device.h
> > +++ b/include/hw/pci/pci_device.h
> > @@ -105,6 +105,9 @@ struct PCIDevice {
> >      /* Capability bits */
> >      uint32_t cap_present;
> >  
> > +    /* Offset of PM capability in config space */
> > +    uint8_t pm_cap;
> > +
> >      /* Offset of MSI-X capability in config space */
> >      uint8_t msix_cap;
> >    
> 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/5] hw/pci: Basic support for PCI power management
  2025-02-25  5:24     ` Alex Williamson
@ 2025-02-25  9:45       ` Eric Auger
  0 siblings, 0 replies; 22+ messages in thread
From: Eric Auger @ 2025-02-25  9:45 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, eric.auger.pro, clg, zhenzhong.duan, mst,
	marcel.apfelbaum

Hi Alex,


On 2/25/25 6:24 AM, Alex Williamson wrote:
> On Mon, 24 Feb 2025 20:03:56 +0100
> Eric Auger <eric.auger@redhat.com> wrote:
>
>> Hi Alex,
>>
>> On 2/20/25 11:48 PM, Alex Williamson wrote:
>>> The memory and IO BARs for devices are only accessible in the D0
>>> power state.  In other power states the PCI spec defines that the
>>> device should respond to TLPs and messages with an Unsupported
>>> Request response.  The closest we can come to emulating this
>>> behavior is to consider the BARs as unmapped, removing them from
>>> the address space.
>>>
>>> Introduce an interface to register the PM capability such that
>>> the device power state is considered relative to the BAR mapping
>>> state.  
>> "the device power state is considered relative to the BAR mapping
>> state."
>> I rather understood that you want the BAR mapping state to depend on the power state but maybe I misunderstand the langage here.
> Yes, for the BAR to be mapped, both the memory enable state of the
> command register and the device power state are relevant, they are both
> considered.
thanks. That sounds clearer to me.
>
>>> Cc: Michael S. Tsirkin <mst@redhat.com>
>>> Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>>> ---
>>>  hw/pci/pci.c                | 83 ++++++++++++++++++++++++++++++++++++-
>>>  hw/pci/trace-events         |  2 +
>>>  include/hw/pci/pci.h        |  3 ++
>>>  include/hw/pci/pci_device.h |  3 ++
>>>  4 files changed, 89 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>>> index 2afa423925c5..4f2554dd63ac 100644
>>> --- a/hw/pci/pci.c
>>> +++ b/hw/pci/pci.c
>>> @@ -435,6 +435,74 @@ static void pci_msi_trigger(PCIDevice *dev, MSIMessage msg)
>>>                           attrs, NULL);
>>>  }
>>>  
>>> +int pci_pm_init(PCIDevice *d, uint8_t offset, Error **errp)
>>> +{
>>> +    int cap = pci_add_capability(d, PCI_CAP_ID_PM, offset, PCI_PM_SIZEOF, errp);
>>> +
>>> +    if (cap < 0) {
>>> +        return cap;
>>> +    }
>>> +
>>> +    d->pm_cap = cap;
>>> +    d->cap_present |= QEMU_PCI_CAP_PM;
>>> +    pci_set_word(d->wmask + cap + PCI_PM_CTRL, PCI_PM_CTRL_STATE_MASK);
>>> +
>>> +    return cap;
>>> +}
>>> +
>>> +static uint8_t pci_pm_state(PCIDevice *d)
>>> +{
>>> +    uint16_t pmcsr;
>>> +
>>> +    if (!(d->cap_present & QEMU_PCI_CAP_PM)) {
>>> +        return 0;
>>> +    }
>>> +
>>> +    pmcsr = pci_get_word(d->config + d->pm_cap + PCI_PM_CTRL);
>>> +
>>> +    return pmcsr & PCI_PM_CTRL_STATE_MASK;
>>> +}
>>> +
>>> +static uint8_t pci_pm_update(PCIDevice *d, uint32_t addr, int l, uint8_t old)  
>> the old/new terminology sounds weird to me. generally when one updates
>> we pass the new value.
> The new state lives in config space, the old state is recorded before
> config space is updated.  Otherwise we interfere with or duplicate the
> update of config space, which seems error prone and unnecessarily
> complicated.  pci_update_irq_disable() uses the same strategy.
ok
>
>>> +{
>>> +    uint16_t pmc;
>>> +    uint8_t new;
>>> +
>>> +    if (!(d->cap_present & QEMU_PCI_CAP_PM) ||
>>> +        !range_covers_byte(addr, l, d->pm_cap + PCI_PM_CTRL)) {
>>> +        return old;
>>> +    }
>>> +
>>> +    new = pci_pm_state(d);  
>> and new sounds to be the current state.
> The state after the guest config write is first applied, yes.
>
>> pci_pm_update() does a kind of validation of the current value? 
> Yes.  The PCI spec indicates that invalid transitions are ignored by
> the device.  That's currently all this does, but if a device wanted a
> callback to effect some behavior at state change, this would be a
> logical place to do that.
ok
>
>>> +    if (new == old) {
>>> +        return old;
>>> +    }
>>> +
>>> +    pmc = pci_get_word(d->config + d->pm_cap + PCI_PM_PMC);
>>> +
>>> +    /*
>>> +     * Transitions to D1 & D2 are only allowed if supported.  Devices may
>>> +     * only transition to higher D-states or to D0.  The write is dropped
>>> +     * for rejected transitions by restoring the old state.
>>> +     */
>>> +    if ((!(pmc & PCI_PM_CAP_D1) && new == 1) ||
>>> +        (!(pmc & PCI_PM_CAP_D2) && new == 2) ||
>>> +        (old && new && new < old)) {
>>> +        pci_word_test_and_clear_mask(d->config + d->pm_cap + PCI_PM_CTRL,
>>> +                                     PCI_PM_CTRL_STATE_MASK);
>>> +        pci_word_test_and_set_mask(d->config + d->pm_cap + PCI_PM_CTRL,
>>> +                                   old);
>>> +        trace_pci_pm_bad_transition(d->name, pci_dev_bus_num(d),
>>> +                                    PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
>>> +                                    old, new);  
>> the trace does not output that the old state is kept. This may be helpful.
> I was trying to keep the trace log brief.  The trace indicates it's a
> "bad" transition versus the trace below indicates a successful
> transition.  AIUI, most trace logs require some degree of familiarity
> with the call path to fully comprehend and it seemed unnecessary to
> print the old value twice to explicitly show it was restored.
> Suggestions welcome.
ok
>
>>> +        return old;
>>> +    }
>>> +
>>> +    trace_pci_pm_transition(d->name, pci_dev_bus_num(d), PCI_SLOT(d->devfn),
>>> +                            PCI_FUNC(d->devfn), old, new);
>>> +    return new;
>>> +}
>>> +
>>>  static void pci_reset_regions(PCIDevice *dev)
>>>  {
>>>      int r;
>>> @@ -474,6 +542,11 @@ static void pci_do_device_reset(PCIDevice *dev)
>>>                                pci_get_word(dev->wmask + PCI_INTERRUPT_LINE) |
>>>                                pci_get_word(dev->w1cmask + PCI_INTERRUPT_LINE));
>>>      dev->config[PCI_CACHE_LINE_SIZE] = 0x0;
>>> +    /* Default PM state is D0 */
>>> +    if (dev->cap_present & QEMU_PCI_CAP_PM) {
>>> +        pci_word_test_and_clear_mask(dev->config + dev->pm_cap + PCI_PM_CTRL,
>>> +                                     PCI_PM_CTRL_STATE_MASK);
>>> +    }
>>>      pci_reset_regions(dev);
>>>      pci_update_mappings(dev);
>>>  
>>> @@ -1598,7 +1671,7 @@ static void pci_update_mappings(PCIDevice *d)
>>>              continue;
>>>  
>>>          new_addr = pci_bar_address(d, i, r->type, r->size);
>>> -        if (!d->enabled) {
>>> +        if (!d->enabled || pci_pm_state(d)) {
>>>              new_addr = PCI_BAR_UNMAPPED;
>>>          }
>>>  
>>> @@ -1664,6 +1737,7 @@ uint32_t pci_default_read_config(PCIDevice *d,
>>>  
>>>  void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val_in, int l)
>>>  {
>>> +    uint8_t new_pm_state, old_pm_state = pci_pm_state(d);
>>>      int i, was_irq_disabled = pci_irq_disabled(d);
>>>      uint32_t val = val_in;
>>>  
>>> @@ -1676,11 +1750,16 @@ void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val_in, int
>>>          d->config[addr + i] = (d->config[addr + i] & ~wmask) | (val & wmask);
>>>          d->config[addr + i] &= ~(val & w1cmask); /* W1C: Write 1 to Clear */
>>>      }
>>> +  
>> IIUC the config is first updated and then we check the validity and
>> potentially restore the older value.
>> Couldn't we check the validity before updating d->config?
> We could, but why?  It's easier to deal with the before and after
> values rather than mangle the incoming write value, imo.  I also don't
> think the intermediate value is visible to the guest anyway.  Fixing
> the value "in flight" would need to take into account the wmask,
> duplicating or preempting the code just above.  Thanks,
OK, up to you. That pattern looked a bit unusual to me + update with
'old' param. But I have not seen any functional issue

Reviewed-by: Eric Auger <eric.auger@redhat.com>

Eric


>
> Alex
>
>>> +    new_pm_state = pci_pm_update(d, addr, l, old_pm_state);
>>> +
>>>      if (ranges_overlap(addr, l, PCI_BASE_ADDRESS_0, 24) ||
>>>          ranges_overlap(addr, l, PCI_ROM_ADDRESS, 4) ||
>>>          ranges_overlap(addr, l, PCI_ROM_ADDRESS1, 4) ||
>>> -        range_covers_byte(addr, l, PCI_COMMAND))
>>> +        range_covers_byte(addr, l, PCI_COMMAND) ||
>>> +        !!new_pm_state != !!old_pm_state) {
>>>          pci_update_mappings(d);  
>> Eric
>>> +    }
>>>  
>>>      if (ranges_overlap(addr, l, PCI_COMMAND, 2)) {
>>>          pci_update_irq_disabled(d, was_irq_disabled);
>>> diff --git a/hw/pci/trace-events b/hw/pci/trace-events
>>> index 19643aa8c6b0..811354f29031 100644
>>> --- a/hw/pci/trace-events
>>> +++ b/hw/pci/trace-events
>>> @@ -1,6 +1,8 @@
>>>  # See docs/devel/tracing.rst for syntax documentation.
>>>  
>>>  # pci.c
>>> +pci_pm_bad_transition(const char *dev, uint32_t bus, uint32_t slot, uint32_t func, uint8_t old, uint8_t new) "%s %02x:%02x.%x bad PM transition D%d->D%d"
>>> +pci_pm_transition(const char *dev, uint32_t bus, uint32_t slot, uint32_t func, uint8_t old, uint8_t new) "%s %02x:%02x.%x PM transition D%d->D%d"
>>>  pci_update_mappings_del(const char *dev, uint32_t bus, uint32_t slot, uint32_t func, int bar, uint64_t addr, uint64_t size) "%s %02x:%02x.%x %d,0x%"PRIx64"+0x%"PRIx64
>>>  pci_update_mappings_add(const char *dev, uint32_t bus, uint32_t slot, uint32_t func, int bar, uint64_t addr, uint64_t size) "%s %02x:%02x.%x %d,0x%"PRIx64"+0x%"PRIx64
>>>  pci_route_irq(int dev_irq, const char *dev_path, int parent_irq, const char *parent_path) "IRQ %d @%s -> IRQ %d @%s"
>>> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
>>> index 4002bbeebde0..c220cc844962 100644
>>> --- a/include/hw/pci/pci.h
>>> +++ b/include/hw/pci/pci.h
>>> @@ -216,6 +216,8 @@ enum {
>>>      QEMU_PCIE_ARI_NEXTFN_1 = (1 << QEMU_PCIE_ARI_NEXTFN_1_BITNR),
>>>  #define QEMU_PCIE_EXT_TAG_BITNR 13
>>>      QEMU_PCIE_EXT_TAG = (1 << QEMU_PCIE_EXT_TAG_BITNR),
>>> +#define QEMU_PCI_CAP_PM_BITNR 14
>>> +    QEMU_PCI_CAP_PM = (1 << QEMU_PCI_CAP_PM_BITNR),
>>>  };
>>>  
>>>  typedef struct PCIINTxRoute {
>>> @@ -676,5 +678,6 @@ static inline void pci_irq_deassert(PCIDevice *pci_dev)
>>>  MSIMessage pci_get_msi_message(PCIDevice *dev, int vector);
>>>  void pci_set_enabled(PCIDevice *pci_dev, bool state);
>>>  void pci_set_power(PCIDevice *pci_dev, bool state);
>>> +int pci_pm_init(PCIDevice *pci_dev, uint8_t offset, Error **errp);
>>>  
>>>  #endif
>>> diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
>>> index add208edfabd..345b12eaac1a 100644
>>> --- a/include/hw/pci/pci_device.h
>>> +++ b/include/hw/pci/pci_device.h
>>> @@ -105,6 +105,9 @@ struct PCIDevice {
>>>      /* Capability bits */
>>>      uint32_t cap_present;
>>>  
>>> +    /* Offset of PM capability in config space */
>>> +    uint8_t pm_cap;
>>> +
>>>      /* Offset of MSI-X capability in config space */
>>>      uint8_t msix_cap;
>>>    



^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2025-02-25  9:45 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-20 22:48 [PATCH 0/5] PCI: Implement basic PCI PM capability backing Alex Williamson
2025-02-20 22:48 ` [PATCH 1/5] hw/pci: Basic support for PCI power management Alex Williamson
2025-02-24 19:03   ` Eric Auger
2025-02-25  5:24     ` Alex Williamson
2025-02-25  9:45       ` Eric Auger
2025-02-20 22:48 ` [PATCH 2/5] pci: Use PCI PM capability initializer Alex Williamson
2025-02-24 18:37   ` Eric Auger
2025-02-24 19:03     ` Alex Williamson
2025-02-20 22:48 ` [PATCH 3/5] vfio/pci: Delete local pm_cap Alex Williamson
2025-02-24 18:38   ` Eric Auger
2025-02-20 22:48 ` [PATCH 4/5] pcie, virtio: Remove redundant pm_cap Alex Williamson
2025-02-21  6:12   ` Duan, Zhenzhong
2025-02-22  6:00     ` Cédric Le Goater
2025-02-24  1:45       ` Duan, Zhenzhong
2025-02-24 18:40   ` Eric Auger
2025-02-20 22:48 ` [PATCH 5/5] hw/vfio/pci: Re-order pre-reset Alex Williamson
2025-02-24 20:16   ` Eric Auger
2025-02-20 22:54 ` [PATCH 0/5] PCI: Implement basic PCI PM capability backing Michael S. Tsirkin
2025-02-24  8:21   ` Cédric Le Goater
2025-02-24  1:43 ` Duan, Zhenzhong
2025-02-24  8:14 ` Cédric Le Goater
2025-02-24 15:09   ` Alex Williamson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).