qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/5] PCI: Implement basic PCI PM capability backing
@ 2025-02-25 21:52 Alex Williamson
  2025-02-25 21:52 ` [PATCH v2 1/5] hw/pci: Basic support for PCI power management Alex Williamson
                   ` (6 more replies)
  0 siblings, 7 replies; 12+ messages in thread
From: Alex Williamson @ 2025-02-25 21:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, eric.auger.pro, eric.auger, clg, zhenzhong.duan,
	mst, marcel.apfelbaum

v2:

Eric noted in v1 that one of the drivers had a redundant wmask setting
since pci_pm_init() enabled writes to the power state field.  This was
added because vfio-pci was not setting wmask for this capability but
is allowing writes to the PM state field through to the device.  For
vfio-pci, QEMU emulated config space is rather secondary to the config
space through vfio.

It turns out therefore, that vfio-pci is nearly unique in not already
managing the wmask of the PM capability state and if we embrace that
it's the pci_pm_init() caller's responsibility to manage the remaining
contents and write-access of the capability, then I think we also
solve the question of migration compatibility.  The new infrastructure
here is not changing whether any fields were previously writable, it's
only effecting a mapping change based on the value found there.

This requires only a slight change to patch 1/, removing setting of
the wmask, but commit log is also updated and comments added.  I also
made the bad transition trace a little more obvious given Eric's
comments.  Patch 2/ is also updated so that vfio-pci effects the wmask
change locally.  The couple drivers that don't currently update wmask
simply don't get this new BAR unmapped when not in D0 behavior.

Incorporated reviews for the unmodified patches.  Please re-review and
report any noted issues.  Thanks,

Alex

v1:

https://lore.kernel.org/all/20250220224918.2520417-1-alex.williamson@redhat.com/

Eric recently identified an issue[1] where during graceful shutdown
of a VM in a vIOMMU configuration, the guest driver places the device
into the D3 power state, the vIOMMU is then disabled, triggering an
AddressSpace update.  The device BARs are still mapped into the AS,
but the vfio host driver refuses to DMA map the MMIO space due to the
device power state.

The proposed solution in [1] was to skip mappings based on the
device power state.  Here we take a different approach.  The PCI spec
defines that devices in D1/2/3 power state should respond only to
configuration and message requests and all other requests should be
handled as an Unsupported Request.  In other words, the memory and
IO BARs are not accessible except when the device is in the D0 power
state.

To emulate this behavior, we can factor the device power state into
the mapping state of the device BARs.  Therefore the BAR is marked
as unmapped if either the respective command register enable bit is
clear or the device is not in the D0 power state.

In order to implement this, the PowerState field of the PMCSR
register becomes writable, which allows the device to appear in
lower power states.  This also therefore implements D3 support
(insofar as the BAR behavior) for all devices implementing the PM
capability.  The PCI spec requires D3 support.

An aspect that needs attention here is whether this change in the
wmask and PMCSR bits becomes a problem for migration, and how we
might solve it.  For a guest migrating old->new, the device would
always be in the D0 power state, but the register becomes writable.
In the opposite direction, is it possible that a device could
migrate in a low power state and be stuck there since the bits are
read-only in old QEMU?  Do we need an option for this behavior and a
machine state bump, or are there alternatives?

Thanks,
Alex

[1]https://lore.kernel.org/all/20250219175941.135390-1-eric.auger@redhat.com/


Alex Williamson (5):
  hw/pci: Basic support for PCI power management
  pci: Use PCI PM capability initializer
  vfio/pci: Delete local pm_cap
  pcie, virtio: Remove redundant pm_cap
  hw/vfio/pci: Re-order pre-reset

 hw/net/e1000e.c                 |  3 +-
 hw/net/eepro100.c               |  4 +-
 hw/net/igb.c                    |  3 +-
 hw/nvme/ctrl.c                  |  3 +-
 hw/pci-bridge/pcie_pci_bridge.c |  3 +-
 hw/pci/pci.c                    | 93 ++++++++++++++++++++++++++++++++-
 hw/pci/trace-events             |  2 +
 hw/vfio/pci.c                   | 34 ++++++------
 hw/vfio/pci.h                   |  1 -
 hw/virtio/virtio-pci.c          | 11 ++--
 include/hw/pci/pci.h            |  3 ++
 include/hw/pci/pci_device.h     |  3 ++
 include/hw/pci/pcie.h           |  2 -
 13 files changed, 127 insertions(+), 38 deletions(-)

-- 
2.48.1



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 1/5] hw/pci: Basic support for PCI power management
  2025-02-25 21:52 [PATCH v2 0/5] PCI: Implement basic PCI PM capability backing Alex Williamson
@ 2025-02-25 21:52 ` Alex Williamson
  2025-02-27  8:16   ` Eric Auger
  2025-02-25 21:52 ` [PATCH v2 2/5] pci: Use PCI PM capability initializer Alex Williamson
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 12+ messages in thread
From: Alex Williamson @ 2025-02-25 21:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, eric.auger.pro, eric.auger, clg, zhenzhong.duan,
	mst, marcel.apfelbaum

The memory and IO BARs for devices are only accessible in the D0 power
state.  In other power states the PCI spec defines that the device
responds to TLPs and messages with an Unsupported Request response.

To approximate this behavior, consider the BARs as unmapped when the
device is not in the D0 power state.  This makes the BARs inaccessible
and has the additional bonus for vfio-pci that we don't attempt to DMA
map BARs for devices in a non-D0 power state.

To support this, an interface is added for devices to register the PM
capability, which allows central tracking to enforce valid transitions
and unmap BARs in non-D0 states.

NB. We currently have device models (eepro100 and pcie_pci_bridge)
that register a PM capability but do not set wmask to enable writes to
the power state field.  In order to maintain migration compatibility,
this new helper does not manage the wmask to enable guest writes to
initiate a power state change.  The contents and write access of the
PM capability are still managed by the caller.

Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 hw/pci/pci.c                | 93 ++++++++++++++++++++++++++++++++++++-
 hw/pci/trace-events         |  2 +
 include/hw/pci/pci.h        |  3 ++
 include/hw/pci/pci_device.h |  3 ++
 4 files changed, 99 insertions(+), 2 deletions(-)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 2afa423925c5..24629807de82 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -435,6 +435,84 @@ static void pci_msi_trigger(PCIDevice *dev, MSIMessage msg)
                          attrs, NULL);
 }
 
+/*
+ * Register and track a PM capability.  If wmask is also enabled for the power
+ * state field of the pmcsr register, guest writes may change the device PM
+ * state.  BAR access is only enabled while the device is in the D0 state.
+ * Return the capability offset or negative error code.
+ */
+int pci_pm_init(PCIDevice *d, uint8_t offset, Error **errp)
+{
+    int cap = pci_add_capability(d, PCI_CAP_ID_PM, offset, PCI_PM_SIZEOF, errp);
+
+    if (cap < 0) {
+        return cap;
+    }
+
+    d->pm_cap = cap;
+    d->cap_present |= QEMU_PCI_CAP_PM;
+
+    return cap;
+}
+
+static uint8_t pci_pm_state(PCIDevice *d)
+{
+    uint16_t pmcsr;
+
+    if (!(d->cap_present & QEMU_PCI_CAP_PM)) {
+        return 0;
+    }
+
+    pmcsr = pci_get_word(d->config + d->pm_cap + PCI_PM_CTRL);
+
+    return pmcsr & PCI_PM_CTRL_STATE_MASK;
+}
+
+/*
+ * Update the PM capability state based on the new value stored in config
+ * space respective to the old, pre-write state provided.  If the new value
+ * is rejected (unsupported or invalid transition) restore the old value.
+ * Return the resulting PM state.
+ */
+static uint8_t pci_pm_update(PCIDevice *d, uint32_t addr, int l, uint8_t old)
+{
+    uint16_t pmc;
+    uint8_t new;
+
+    if (!(d->cap_present & QEMU_PCI_CAP_PM) ||
+        !range_covers_byte(addr, l, d->pm_cap + PCI_PM_CTRL)) {
+        return old;
+    }
+
+    new = pci_pm_state(d);
+    if (new == old) {
+        return old;
+    }
+
+    pmc = pci_get_word(d->config + d->pm_cap + PCI_PM_PMC);
+
+    /*
+     * Transitions to D1 & D2 are only allowed if supported.  Devices may
+     * only transition to higher D-states or to D0.
+     */
+    if ((!(pmc & PCI_PM_CAP_D1) && new == 1) ||
+        (!(pmc & PCI_PM_CAP_D2) && new == 2) ||
+        (old && new && new < old)) {
+        pci_word_test_and_clear_mask(d->config + d->pm_cap + PCI_PM_CTRL,
+                                     PCI_PM_CTRL_STATE_MASK);
+        pci_word_test_and_set_mask(d->config + d->pm_cap + PCI_PM_CTRL,
+                                   old);
+        trace_pci_pm_bad_transition(d->name, pci_dev_bus_num(d),
+                                    PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
+                                    old, new);
+        return old;
+    }
+
+    trace_pci_pm_transition(d->name, pci_dev_bus_num(d), PCI_SLOT(d->devfn),
+                            PCI_FUNC(d->devfn), old, new);
+    return new;
+}
+
 static void pci_reset_regions(PCIDevice *dev)
 {
     int r;
@@ -474,6 +552,11 @@ static void pci_do_device_reset(PCIDevice *dev)
                               pci_get_word(dev->wmask + PCI_INTERRUPT_LINE) |
                               pci_get_word(dev->w1cmask + PCI_INTERRUPT_LINE));
     dev->config[PCI_CACHE_LINE_SIZE] = 0x0;
+    /* Default PM state is D0 */
+    if (dev->cap_present & QEMU_PCI_CAP_PM) {
+        pci_word_test_and_clear_mask(dev->config + dev->pm_cap + PCI_PM_CTRL,
+                                     PCI_PM_CTRL_STATE_MASK);
+    }
     pci_reset_regions(dev);
     pci_update_mappings(dev);
 
@@ -1598,7 +1681,7 @@ static void pci_update_mappings(PCIDevice *d)
             continue;
 
         new_addr = pci_bar_address(d, i, r->type, r->size);
-        if (!d->enabled) {
+        if (!d->enabled || pci_pm_state(d)) {
             new_addr = PCI_BAR_UNMAPPED;
         }
 
@@ -1664,6 +1747,7 @@ uint32_t pci_default_read_config(PCIDevice *d,
 
 void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val_in, int l)
 {
+    uint8_t new_pm_state, old_pm_state = pci_pm_state(d);
     int i, was_irq_disabled = pci_irq_disabled(d);
     uint32_t val = val_in;
 
@@ -1676,11 +1760,16 @@ void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val_in, int
         d->config[addr + i] = (d->config[addr + i] & ~wmask) | (val & wmask);
         d->config[addr + i] &= ~(val & w1cmask); /* W1C: Write 1 to Clear */
     }
+
+    new_pm_state = pci_pm_update(d, addr, l, old_pm_state);
+
     if (ranges_overlap(addr, l, PCI_BASE_ADDRESS_0, 24) ||
         ranges_overlap(addr, l, PCI_ROM_ADDRESS, 4) ||
         ranges_overlap(addr, l, PCI_ROM_ADDRESS1, 4) ||
-        range_covers_byte(addr, l, PCI_COMMAND))
+        range_covers_byte(addr, l, PCI_COMMAND) ||
+        !!new_pm_state != !!old_pm_state) {
         pci_update_mappings(d);
+    }
 
     if (ranges_overlap(addr, l, PCI_COMMAND, 2)) {
         pci_update_irq_disabled(d, was_irq_disabled);
diff --git a/hw/pci/trace-events b/hw/pci/trace-events
index 19643aa8c6b0..c82a87ffdd2b 100644
--- a/hw/pci/trace-events
+++ b/hw/pci/trace-events
@@ -1,6 +1,8 @@
 # See docs/devel/tracing.rst for syntax documentation.
 
 # pci.c
+pci_pm_bad_transition(const char *dev, uint32_t bus, uint32_t slot, uint32_t func, uint8_t old, uint8_t new) "%s %02x:%02x.%x REJECTED PM transition D%d->D%d"
+pci_pm_transition(const char *dev, uint32_t bus, uint32_t slot, uint32_t func, uint8_t old, uint8_t new) "%s %02x:%02x.%x PM transition D%d->D%d"
 pci_update_mappings_del(const char *dev, uint32_t bus, uint32_t slot, uint32_t func, int bar, uint64_t addr, uint64_t size) "%s %02x:%02x.%x %d,0x%"PRIx64"+0x%"PRIx64
 pci_update_mappings_add(const char *dev, uint32_t bus, uint32_t slot, uint32_t func, int bar, uint64_t addr, uint64_t size) "%s %02x:%02x.%x %d,0x%"PRIx64"+0x%"PRIx64
 pci_route_irq(int dev_irq, const char *dev_path, int parent_irq, const char *parent_path) "IRQ %d @%s -> IRQ %d @%s"
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 4002bbeebde0..c220cc844962 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -216,6 +216,8 @@ enum {
     QEMU_PCIE_ARI_NEXTFN_1 = (1 << QEMU_PCIE_ARI_NEXTFN_1_BITNR),
 #define QEMU_PCIE_EXT_TAG_BITNR 13
     QEMU_PCIE_EXT_TAG = (1 << QEMU_PCIE_EXT_TAG_BITNR),
+#define QEMU_PCI_CAP_PM_BITNR 14
+    QEMU_PCI_CAP_PM = (1 << QEMU_PCI_CAP_PM_BITNR),
 };
 
 typedef struct PCIINTxRoute {
@@ -676,5 +678,6 @@ static inline void pci_irq_deassert(PCIDevice *pci_dev)
 MSIMessage pci_get_msi_message(PCIDevice *dev, int vector);
 void pci_set_enabled(PCIDevice *pci_dev, bool state);
 void pci_set_power(PCIDevice *pci_dev, bool state);
+int pci_pm_init(PCIDevice *pci_dev, uint8_t offset, Error **errp);
 
 #endif
diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
index add208edfabd..345b12eaac1a 100644
--- a/include/hw/pci/pci_device.h
+++ b/include/hw/pci/pci_device.h
@@ -105,6 +105,9 @@ struct PCIDevice {
     /* Capability bits */
     uint32_t cap_present;
 
+    /* Offset of PM capability in config space */
+    uint8_t pm_cap;
+
     /* Offset of MSI-X capability in config space */
     uint8_t msix_cap;
 
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 2/5] pci: Use PCI PM capability initializer
  2025-02-25 21:52 [PATCH v2 0/5] PCI: Implement basic PCI PM capability backing Alex Williamson
  2025-02-25 21:52 ` [PATCH v2 1/5] hw/pci: Basic support for PCI power management Alex Williamson
@ 2025-02-25 21:52 ` Alex Williamson
  2025-02-27  7:23   ` Akihiko Odaki
  2025-02-27  8:17   ` Eric Auger
  2025-02-25 21:52 ` [PATCH v2 3/5] vfio/pci: Delete local pm_cap Alex Williamson
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 12+ messages in thread
From: Alex Williamson @ 2025-02-25 21:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, eric.auger.pro, eric.auger, clg, zhenzhong.duan,
	mst, marcel.apfelbaum, Dmitry Fleytman, Akihiko Odaki, Jason Wang,
	Stefan Weil, Sriram Yagnaraman, Keith Busch, Klaus Jensen,
	Jesper Devantier

Switch callers directly initializing the PCI PM capability with
pci_add_capability() to use pci_pm_init().

Cc: Dmitry Fleytman <dmitry.fleytman@gmail.com>
Cc: Akihiko Odaki <akihiko.odaki@daynix.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Stefan Weil <sw@weilnetz.de>
Cc: Sriram Yagnaraman <sriram.yagnaraman@ericsson.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Klaus Jensen <its@irrelevant.dk>
Cc: Jesper Devantier <foss@defmacro.it>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
Cc: Cédric Le Goater <clg@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 hw/net/e1000e.c                 | 3 +--
 hw/net/eepro100.c               | 4 +---
 hw/net/igb.c                    | 3 +--
 hw/nvme/ctrl.c                  | 3 +--
 hw/pci-bridge/pcie_pci_bridge.c | 2 +-
 hw/vfio/pci.c                   | 7 ++++++-
 hw/virtio/virtio-pci.c          | 3 +--
 7 files changed, 12 insertions(+), 13 deletions(-)

diff --git a/hw/net/e1000e.c b/hw/net/e1000e.c
index f637853073e2..b72cbab7e889 100644
--- a/hw/net/e1000e.c
+++ b/hw/net/e1000e.c
@@ -372,8 +372,7 @@ static int
 e1000e_add_pm_capability(PCIDevice *pdev, uint8_t offset, uint16_t pmc)
 {
     Error *local_err = NULL;
-    int ret = pci_add_capability(pdev, PCI_CAP_ID_PM, offset,
-                                 PCI_PM_SIZEOF, &local_err);
+    int ret = pci_pm_init(pdev, offset, &local_err);
 
     if (local_err) {
         error_report_err(local_err);
diff --git a/hw/net/eepro100.c b/hw/net/eepro100.c
index 6d853229aec2..29a39865a608 100644
--- a/hw/net/eepro100.c
+++ b/hw/net/eepro100.c
@@ -551,9 +551,7 @@ static void e100_pci_reset(EEPRO100State *s, Error **errp)
     if (info->power_management) {
         /* Power Management Capabilities */
         int cfg_offset = 0xdc;
-        int r = pci_add_capability(&s->dev, PCI_CAP_ID_PM,
-                                   cfg_offset, PCI_PM_SIZEOF,
-                                   errp);
+        int r = pci_pm_init(&s->dev, cfg_offset, errp);
         if (r < 0) {
             return;
         }
diff --git a/hw/net/igb.c b/hw/net/igb.c
index 4d93ce629f95..700dbc746d3d 100644
--- a/hw/net/igb.c
+++ b/hw/net/igb.c
@@ -356,8 +356,7 @@ static int
 igb_add_pm_capability(PCIDevice *pdev, uint8_t offset, uint16_t pmc)
 {
     Error *local_err = NULL;
-    int ret = pci_add_capability(pdev, PCI_CAP_ID_PM, offset,
-                                 PCI_PM_SIZEOF, &local_err);
+    int ret = pci_pm_init(pdev, offset, &local_err);
 
     if (local_err) {
         error_report_err(local_err);
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 68903d1d7067..1faea3d2b85b 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -8503,8 +8503,7 @@ static int nvme_add_pm_capability(PCIDevice *pci_dev, uint8_t offset)
     Error *err = NULL;
     int ret;
 
-    ret = pci_add_capability(pci_dev, PCI_CAP_ID_PM, offset,
-                             PCI_PM_SIZEOF, &err);
+    ret = pci_pm_init(pci_dev, offset, &err);
     if (err) {
         error_report_err(err);
         return ret;
diff --git a/hw/pci-bridge/pcie_pci_bridge.c b/hw/pci-bridge/pcie_pci_bridge.c
index fd4514a595ce..9fa656b43b42 100644
--- a/hw/pci-bridge/pcie_pci_bridge.c
+++ b/hw/pci-bridge/pcie_pci_bridge.c
@@ -52,7 +52,7 @@ static void pcie_pci_bridge_realize(PCIDevice *d, Error **errp)
         goto cap_error;
     }
 
-    pos = pci_add_capability(d, PCI_CAP_ID_PM, 0, PCI_PM_SIZEOF, errp);
+    pos = pci_pm_init(d, 0, errp);
     if (pos < 0) {
         goto pm_error;
     }
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 89d900e9cf0c..1a4a0b4b15b4 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2216,7 +2216,12 @@ static bool vfio_add_std_cap(VFIOPCIDevice *vdev, uint8_t pos, Error **errp)
     case PCI_CAP_ID_PM:
         vfio_check_pm_reset(vdev, pos);
         vdev->pm_cap = pos;
-        ret = pci_add_capability(pdev, cap_id, pos, size, errp) >= 0;
+        ret = pci_pm_init(pdev, pos, errp) >= 0;
+        /*
+         * PCI-core config space emulation needs write access to the power
+         * state enabled for tracking BAR mapping relative to PM state.
+         */
+        pci_set_word(pdev->wmask + pos + PCI_PM_CTRL, PCI_PM_CTRL_STATE_MASK);
         break;
     case PCI_CAP_ID_AF:
         vfio_check_af_flr(vdev, pos);
diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
index c773a9130c7e..afe8b5551c5c 100644
--- a/hw/virtio/virtio-pci.c
+++ b/hw/virtio/virtio-pci.c
@@ -2204,8 +2204,7 @@ static void virtio_pci_realize(PCIDevice *pci_dev, Error **errp)
         pos = pcie_endpoint_cap_init(pci_dev, 0);
         assert(pos > 0);
 
-        pos = pci_add_capability(pci_dev, PCI_CAP_ID_PM, 0,
-                                 PCI_PM_SIZEOF, errp);
+        pos = pci_pm_init(pci_dev, 0, errp);
         if (pos < 0) {
             return;
         }
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 3/5] vfio/pci: Delete local pm_cap
  2025-02-25 21:52 [PATCH v2 0/5] PCI: Implement basic PCI PM capability backing Alex Williamson
  2025-02-25 21:52 ` [PATCH v2 1/5] hw/pci: Basic support for PCI power management Alex Williamson
  2025-02-25 21:52 ` [PATCH v2 2/5] pci: Use PCI PM capability initializer Alex Williamson
@ 2025-02-25 21:52 ` Alex Williamson
  2025-02-25 21:52 ` [PATCH v2 4/5] pcie, virtio: Remove redundant pm_cap Alex Williamson
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: Alex Williamson @ 2025-02-25 21:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, eric.auger.pro, eric.auger, clg, zhenzhong.duan,
	mst, marcel.apfelbaum

This is now redundant to PCIDevice.pm_cap.

Cc: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 hw/vfio/pci.c | 9 ++++-----
 hw/vfio/pci.h | 1 -
 2 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 1a4a0b4b15b4..eab8974e9b48 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2215,7 +2215,6 @@ static bool vfio_add_std_cap(VFIOPCIDevice *vdev, uint8_t pos, Error **errp)
         break;
     case PCI_CAP_ID_PM:
         vfio_check_pm_reset(vdev, pos);
-        vdev->pm_cap = pos;
         ret = pci_pm_init(pdev, pos, errp) >= 0;
         /*
          * PCI-core config space emulation needs write access to the power
@@ -2412,17 +2411,17 @@ void vfio_pci_pre_reset(VFIOPCIDevice *vdev)
     vfio_disable_interrupts(vdev);
 
     /* Make sure the device is in D0 */
-    if (vdev->pm_cap) {
+    if (pdev->pm_cap) {
         uint16_t pmcsr;
         uint8_t state;
 
-        pmcsr = vfio_pci_read_config(pdev, vdev->pm_cap + PCI_PM_CTRL, 2);
+        pmcsr = vfio_pci_read_config(pdev, pdev->pm_cap + PCI_PM_CTRL, 2);
         state = pmcsr & PCI_PM_CTRL_STATE_MASK;
         if (state) {
             pmcsr &= ~PCI_PM_CTRL_STATE_MASK;
-            vfio_pci_write_config(pdev, vdev->pm_cap + PCI_PM_CTRL, pmcsr, 2);
+            vfio_pci_write_config(pdev, pdev->pm_cap + PCI_PM_CTRL, pmcsr, 2);
             /* vfio handles the necessary delay here */
-            pmcsr = vfio_pci_read_config(pdev, vdev->pm_cap + PCI_PM_CTRL, 2);
+            pmcsr = vfio_pci_read_config(pdev, pdev->pm_cap + PCI_PM_CTRL, 2);
             state = pmcsr & PCI_PM_CTRL_STATE_MASK;
             if (state) {
                 error_report("vfio: Unable to power on device, stuck in D%d",
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 43c166680abb..d638c781f6f1 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -160,7 +160,6 @@ struct VFIOPCIDevice {
     int32_t bootindex;
     uint32_t igd_gms;
     OffAutoPCIBAR msix_relo;
-    uint8_t pm_cap;
     uint8_t nv_gpudirect_clique;
     bool pci_aer;
     bool req_enabled;
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 4/5] pcie, virtio: Remove redundant pm_cap
  2025-02-25 21:52 [PATCH v2 0/5] PCI: Implement basic PCI PM capability backing Alex Williamson
                   ` (2 preceding siblings ...)
  2025-02-25 21:52 ` [PATCH v2 3/5] vfio/pci: Delete local pm_cap Alex Williamson
@ 2025-02-25 21:52 ` Alex Williamson
  2025-02-25 21:52 ` [PATCH v2 5/5] hw/vfio/pci: Re-order pre-reset Alex Williamson
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: Alex Williamson @ 2025-02-25 21:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, eric.auger.pro, eric.auger, clg, zhenzhong.duan,
	mst, marcel.apfelbaum

The pm_cap on the PCIExpressDevice object can be distilled down
to the new instance on the PCIDevice object.

Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 hw/pci-bridge/pcie_pci_bridge.c | 1 -
 hw/virtio/virtio-pci.c          | 8 +++-----
 include/hw/pci/pcie.h           | 2 --
 3 files changed, 3 insertions(+), 8 deletions(-)

diff --git a/hw/pci-bridge/pcie_pci_bridge.c b/hw/pci-bridge/pcie_pci_bridge.c
index 9fa656b43b42..2429503cfbbf 100644
--- a/hw/pci-bridge/pcie_pci_bridge.c
+++ b/hw/pci-bridge/pcie_pci_bridge.c
@@ -56,7 +56,6 @@ static void pcie_pci_bridge_realize(PCIDevice *d, Error **errp)
     if (pos < 0) {
         goto pm_error;
     }
-    d->exp.pm_cap = pos;
     pci_set_word(d->config + pos + PCI_PM_PMC, 0x3);
 
     pcie_cap_arifwd_init(d);
diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
index afe8b5551c5c..3ca3f849d391 100644
--- a/hw/virtio/virtio-pci.c
+++ b/hw/virtio/virtio-pci.c
@@ -2209,8 +2209,6 @@ static void virtio_pci_realize(PCIDevice *pci_dev, Error **errp)
             return;
         }
 
-        pci_dev->exp.pm_cap = pos;
-
         /*
          * Indicates that this function complies with revision 1.2 of the
          * PCI Power Management Interface Specification.
@@ -2309,11 +2307,11 @@ static bool virtio_pci_no_soft_reset(PCIDevice *dev)
 {
     uint16_t pmcsr;
 
-    if (!pci_is_express(dev) || !dev->exp.pm_cap) {
+    if (!pci_is_express(dev) || !(dev->cap_present & QEMU_PCI_CAP_PM)) {
         return false;
     }
 
-    pmcsr = pci_get_word(dev->config + dev->exp.pm_cap + PCI_PM_CTRL);
+    pmcsr = pci_get_word(dev->config + dev->pm_cap + PCI_PM_CTRL);
 
     /*
      * When No_Soft_Reset bit is set and the device
@@ -2342,7 +2340,7 @@ static void virtio_pci_bus_reset_hold(Object *obj, ResetType type)
 
         if (proxy->flags & VIRTIO_PCI_FLAG_INIT_PM) {
             pci_word_test_and_clear_mask(
-                dev->config + dev->exp.pm_cap + PCI_PM_CTRL,
+                dev->config + dev->pm_cap + PCI_PM_CTRL,
                 PCI_PM_CTRL_STATE_MASK);
         }
     }
diff --git a/include/hw/pci/pcie.h b/include/hw/pci/pcie.h
index b8d59732bc63..70a5de09de39 100644
--- a/include/hw/pci/pcie.h
+++ b/include/hw/pci/pcie.h
@@ -58,8 +58,6 @@ typedef enum {
 struct PCIExpressDevice {
     /* Offset of express capability in config space */
     uint8_t exp_cap;
-    /* Offset of Power Management capability in config space */
-    uint8_t pm_cap;
 
     /* SLOT */
     bool hpev_notified; /* Logical AND of conditions for hot plug event.
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 5/5] hw/vfio/pci: Re-order pre-reset
  2025-02-25 21:52 [PATCH v2 0/5] PCI: Implement basic PCI PM capability backing Alex Williamson
                   ` (3 preceding siblings ...)
  2025-02-25 21:52 ` [PATCH v2 4/5] pcie, virtio: Remove redundant pm_cap Alex Williamson
@ 2025-02-25 21:52 ` Alex Williamson
  2025-03-04 12:18 ` [PATCH v2 0/5] PCI: Implement basic PCI PM capability backing Cédric Le Goater
  2025-03-05 17:42 ` Cédric Le Goater
  6 siblings, 0 replies; 12+ messages in thread
From: Alex Williamson @ 2025-02-25 21:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, eric.auger.pro, eric.auger, clg, zhenzhong.duan,
	mst, marcel.apfelbaum

We want the device in the D0 power state going into reset, but the
config write can enable the BARs in the address space, which are
then removed from the address space once we clear the memory enable
bit in the command register.  Re-order to clear the command bit
first, so the power state change doesn't enable the BARs.

Cc: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 hw/vfio/pci.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index eab8974e9b48..153455fae85d 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2410,6 +2410,15 @@ void vfio_pci_pre_reset(VFIOPCIDevice *vdev)
 
     vfio_disable_interrupts(vdev);
 
+    /*
+     * Stop any ongoing DMA by disconnecting I/O, MMIO, and bus master.
+     * Also put INTx Disable in known state.
+     */
+    cmd = vfio_pci_read_config(pdev, PCI_COMMAND, 2);
+    cmd &= ~(PCI_COMMAND_IO | PCI_COMMAND_MEMORY | PCI_COMMAND_MASTER |
+             PCI_COMMAND_INTX_DISABLE);
+    vfio_pci_write_config(pdev, PCI_COMMAND, cmd, 2);
+
     /* Make sure the device is in D0 */
     if (pdev->pm_cap) {
         uint16_t pmcsr;
@@ -2429,15 +2438,6 @@ void vfio_pci_pre_reset(VFIOPCIDevice *vdev)
             }
         }
     }
-
-    /*
-     * Stop any ongoing DMA by disconnecting I/O, MMIO, and bus master.
-     * Also put INTx Disable in known state.
-     */
-    cmd = vfio_pci_read_config(pdev, PCI_COMMAND, 2);
-    cmd &= ~(PCI_COMMAND_IO | PCI_COMMAND_MEMORY | PCI_COMMAND_MASTER |
-             PCI_COMMAND_INTX_DISABLE);
-    vfio_pci_write_config(pdev, PCI_COMMAND, cmd, 2);
 }
 
 void vfio_pci_post_reset(VFIOPCIDevice *vdev)
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 2/5] pci: Use PCI PM capability initializer
  2025-02-25 21:52 ` [PATCH v2 2/5] pci: Use PCI PM capability initializer Alex Williamson
@ 2025-02-27  7:23   ` Akihiko Odaki
  2025-02-27  8:17   ` Eric Auger
  1 sibling, 0 replies; 12+ messages in thread
From: Akihiko Odaki @ 2025-02-27  7:23 UTC (permalink / raw)
  To: Alex Williamson, qemu-devel
  Cc: eric.auger.pro, eric.auger, clg, zhenzhong.duan, mst,
	marcel.apfelbaum, Dmitry Fleytman, Jason Wang, Stefan Weil,
	Sriram Yagnaraman, Keith Busch, Klaus Jensen, Jesper Devantier

On 2025/02/26 6:52, Alex Williamson wrote:
> Switch callers directly initializing the PCI PM capability with
> pci_add_capability() to use pci_pm_init().
> 
> Cc: Dmitry Fleytman <dmitry.fleytman@gmail.com>
> Cc: Akihiko Odaki <akihiko.odaki@daynix.com>
> Cc: Jason Wang <jasowang@redhat.com>
> Cc: Stefan Weil <sw@weilnetz.de>
> Cc: Sriram Yagnaraman <sriram.yagnaraman@ericsson.com>
> Cc: Keith Busch <kbusch@kernel.org>
> Cc: Klaus Jensen <its@irrelevant.dk>
> Cc: Jesper Devantier <foss@defmacro.it>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
> Cc: Cédric Le Goater <clg@redhat.com>
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>

Reviewed-by: Akihiko Odaki <akihiko.odaki@daynix.com>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/5] hw/pci: Basic support for PCI power management
  2025-02-25 21:52 ` [PATCH v2 1/5] hw/pci: Basic support for PCI power management Alex Williamson
@ 2025-02-27  8:16   ` Eric Auger
  0 siblings, 0 replies; 12+ messages in thread
From: Eric Auger @ 2025-02-27  8:16 UTC (permalink / raw)
  To: Alex Williamson, qemu-devel
  Cc: eric.auger.pro, clg, zhenzhong.duan, mst, marcel.apfelbaum




On 2/25/25 10:52 PM, Alex Williamson wrote:
> The memory and IO BARs for devices are only accessible in the D0 power
> state.  In other power states the PCI spec defines that the device
> responds to TLPs and messages with an Unsupported Request response.
>
> To approximate this behavior, consider the BARs as unmapped when the
> device is not in the D0 power state.  This makes the BARs inaccessible
> and has the additional bonus for vfio-pci that we don't attempt to DMA
> map BARs for devices in a non-D0 power state.
>
> To support this, an interface is added for devices to register the PM
> capability, which allows central tracking to enforce valid transitions
> and unmap BARs in non-D0 states.
>
> NB. We currently have device models (eepro100 and pcie_pci_bridge)
> that register a PM capability but do not set wmask to enable writes to
> the power state field.  In order to maintain migration compatibility,
> this new helper does not manage the wmask to enable guest writes to
> initiate a power state change.  The contents and write access of the
> PM capability are still managed by the caller.
>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Eric
> ---
>  hw/pci/pci.c                | 93 ++++++++++++++++++++++++++++++++++++-
>  hw/pci/trace-events         |  2 +
>  include/hw/pci/pci.h        |  3 ++
>  include/hw/pci/pci_device.h |  3 ++
>  4 files changed, 99 insertions(+), 2 deletions(-)
>
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index 2afa423925c5..24629807de82 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -435,6 +435,84 @@ static void pci_msi_trigger(PCIDevice *dev, MSIMessage msg)
>                           attrs, NULL);
>  }
>  
> +/*
> + * Register and track a PM capability.  If wmask is also enabled for the power
> + * state field of the pmcsr register, guest writes may change the device PM
> + * state.  BAR access is only enabled while the device is in the D0 state.
> + * Return the capability offset or negative error code.
> + */
> +int pci_pm_init(PCIDevice *d, uint8_t offset, Error **errp)
> +{
> +    int cap = pci_add_capability(d, PCI_CAP_ID_PM, offset, PCI_PM_SIZEOF, errp);
> +
> +    if (cap < 0) {
> +        return cap;
> +    }
> +
> +    d->pm_cap = cap;
> +    d->cap_present |= QEMU_PCI_CAP_PM;
> +
> +    return cap;
> +}
> +
> +static uint8_t pci_pm_state(PCIDevice *d)
> +{
> +    uint16_t pmcsr;
> +
> +    if (!(d->cap_present & QEMU_PCI_CAP_PM)) {
> +        return 0;
> +    }
> +
> +    pmcsr = pci_get_word(d->config + d->pm_cap + PCI_PM_CTRL);
> +
> +    return pmcsr & PCI_PM_CTRL_STATE_MASK;
> +}
> +
> +/*
> + * Update the PM capability state based on the new value stored in config
> + * space respective to the old, pre-write state provided.  If the new value
> + * is rejected (unsupported or invalid transition) restore the old value.
> + * Return the resulting PM state.
> + */
> +static uint8_t pci_pm_update(PCIDevice *d, uint32_t addr, int l, uint8_t old)
> +{
> +    uint16_t pmc;
> +    uint8_t new;
> +
> +    if (!(d->cap_present & QEMU_PCI_CAP_PM) ||
> +        !range_covers_byte(addr, l, d->pm_cap + PCI_PM_CTRL)) {
> +        return old;
> +    }
> +
> +    new = pci_pm_state(d);
> +    if (new == old) {
> +        return old;
> +    }
> +
> +    pmc = pci_get_word(d->config + d->pm_cap + PCI_PM_PMC);
> +
> +    /*
> +     * Transitions to D1 & D2 are only allowed if supported.  Devices may
> +     * only transition to higher D-states or to D0.
> +     */
> +    if ((!(pmc & PCI_PM_CAP_D1) && new == 1) ||
> +        (!(pmc & PCI_PM_CAP_D2) && new == 2) ||
> +        (old && new && new < old)) {
> +        pci_word_test_and_clear_mask(d->config + d->pm_cap + PCI_PM_CTRL,
> +                                     PCI_PM_CTRL_STATE_MASK);
> +        pci_word_test_and_set_mask(d->config + d->pm_cap + PCI_PM_CTRL,
> +                                   old);
> +        trace_pci_pm_bad_transition(d->name, pci_dev_bus_num(d),
> +                                    PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
> +                                    old, new);
> +        return old;
> +    }
> +
> +    trace_pci_pm_transition(d->name, pci_dev_bus_num(d), PCI_SLOT(d->devfn),
> +                            PCI_FUNC(d->devfn), old, new);
> +    return new;
> +}
> +
>  static void pci_reset_regions(PCIDevice *dev)
>  {
>      int r;
> @@ -474,6 +552,11 @@ static void pci_do_device_reset(PCIDevice *dev)
>                                pci_get_word(dev->wmask + PCI_INTERRUPT_LINE) |
>                                pci_get_word(dev->w1cmask + PCI_INTERRUPT_LINE));
>      dev->config[PCI_CACHE_LINE_SIZE] = 0x0;
> +    /* Default PM state is D0 */
> +    if (dev->cap_present & QEMU_PCI_CAP_PM) {
> +        pci_word_test_and_clear_mask(dev->config + dev->pm_cap + PCI_PM_CTRL,
> +                                     PCI_PM_CTRL_STATE_MASK);
> +    }
>      pci_reset_regions(dev);
>      pci_update_mappings(dev);
>  
> @@ -1598,7 +1681,7 @@ static void pci_update_mappings(PCIDevice *d)
>              continue;
>  
>          new_addr = pci_bar_address(d, i, r->type, r->size);
> -        if (!d->enabled) {
> +        if (!d->enabled || pci_pm_state(d)) {
>              new_addr = PCI_BAR_UNMAPPED;
>          }
>  
> @@ -1664,6 +1747,7 @@ uint32_t pci_default_read_config(PCIDevice *d,
>  
>  void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val_in, int l)
>  {
> +    uint8_t new_pm_state, old_pm_state = pci_pm_state(d);
>      int i, was_irq_disabled = pci_irq_disabled(d);
>      uint32_t val = val_in;
>  
> @@ -1676,11 +1760,16 @@ void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val_in, int
>          d->config[addr + i] = (d->config[addr + i] & ~wmask) | (val & wmask);
>          d->config[addr + i] &= ~(val & w1cmask); /* W1C: Write 1 to Clear */
>      }
> +
> +    new_pm_state = pci_pm_update(d, addr, l, old_pm_state);
> +
>      if (ranges_overlap(addr, l, PCI_BASE_ADDRESS_0, 24) ||
>          ranges_overlap(addr, l, PCI_ROM_ADDRESS, 4) ||
>          ranges_overlap(addr, l, PCI_ROM_ADDRESS1, 4) ||
> -        range_covers_byte(addr, l, PCI_COMMAND))
> +        range_covers_byte(addr, l, PCI_COMMAND) ||
> +        !!new_pm_state != !!old_pm_state) {
>          pci_update_mappings(d);
> +    }
>  
>      if (ranges_overlap(addr, l, PCI_COMMAND, 2)) {
>          pci_update_irq_disabled(d, was_irq_disabled);
> diff --git a/hw/pci/trace-events b/hw/pci/trace-events
> index 19643aa8c6b0..c82a87ffdd2b 100644
> --- a/hw/pci/trace-events
> +++ b/hw/pci/trace-events
> @@ -1,6 +1,8 @@
>  # See docs/devel/tracing.rst for syntax documentation.
>  
>  # pci.c
> +pci_pm_bad_transition(const char *dev, uint32_t bus, uint32_t slot, uint32_t func, uint8_t old, uint8_t new) "%s %02x:%02x.%x REJECTED PM transition D%d->D%d"
> +pci_pm_transition(const char *dev, uint32_t bus, uint32_t slot, uint32_t func, uint8_t old, uint8_t new) "%s %02x:%02x.%x PM transition D%d->D%d"
>  pci_update_mappings_del(const char *dev, uint32_t bus, uint32_t slot, uint32_t func, int bar, uint64_t addr, uint64_t size) "%s %02x:%02x.%x %d,0x%"PRIx64"+0x%"PRIx64
>  pci_update_mappings_add(const char *dev, uint32_t bus, uint32_t slot, uint32_t func, int bar, uint64_t addr, uint64_t size) "%s %02x:%02x.%x %d,0x%"PRIx64"+0x%"PRIx64
>  pci_route_irq(int dev_irq, const char *dev_path, int parent_irq, const char *parent_path) "IRQ %d @%s -> IRQ %d @%s"
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index 4002bbeebde0..c220cc844962 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -216,6 +216,8 @@ enum {
>      QEMU_PCIE_ARI_NEXTFN_1 = (1 << QEMU_PCIE_ARI_NEXTFN_1_BITNR),
>  #define QEMU_PCIE_EXT_TAG_BITNR 13
>      QEMU_PCIE_EXT_TAG = (1 << QEMU_PCIE_EXT_TAG_BITNR),
> +#define QEMU_PCI_CAP_PM_BITNR 14
> +    QEMU_PCI_CAP_PM = (1 << QEMU_PCI_CAP_PM_BITNR),
>  };
>  
>  typedef struct PCIINTxRoute {
> @@ -676,5 +678,6 @@ static inline void pci_irq_deassert(PCIDevice *pci_dev)
>  MSIMessage pci_get_msi_message(PCIDevice *dev, int vector);
>  void pci_set_enabled(PCIDevice *pci_dev, bool state);
>  void pci_set_power(PCIDevice *pci_dev, bool state);
> +int pci_pm_init(PCIDevice *pci_dev, uint8_t offset, Error **errp);
>  
>  #endif
> diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
> index add208edfabd..345b12eaac1a 100644
> --- a/include/hw/pci/pci_device.h
> +++ b/include/hw/pci/pci_device.h
> @@ -105,6 +105,9 @@ struct PCIDevice {
>      /* Capability bits */
>      uint32_t cap_present;
>  
> +    /* Offset of PM capability in config space */
> +    uint8_t pm_cap;
> +
>      /* Offset of MSI-X capability in config space */
>      uint8_t msix_cap;
>  



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 2/5] pci: Use PCI PM capability initializer
  2025-02-25 21:52 ` [PATCH v2 2/5] pci: Use PCI PM capability initializer Alex Williamson
  2025-02-27  7:23   ` Akihiko Odaki
@ 2025-02-27  8:17   ` Eric Auger
  1 sibling, 0 replies; 12+ messages in thread
From: Eric Auger @ 2025-02-27  8:17 UTC (permalink / raw)
  To: Alex Williamson, qemu-devel
  Cc: eric.auger.pro, clg, zhenzhong.duan, mst, marcel.apfelbaum,
	Dmitry Fleytman, Akihiko Odaki, Jason Wang, Stefan Weil,
	Sriram Yagnaraman, Keith Busch, Klaus Jensen, Jesper Devantier




On 2/25/25 10:52 PM, Alex Williamson wrote:
> Switch callers directly initializing the PCI PM capability with
> pci_add_capability() to use pci_pm_init().
>
> Cc: Dmitry Fleytman <dmitry.fleytman@gmail.com>
> Cc: Akihiko Odaki <akihiko.odaki@daynix.com>
> Cc: Jason Wang <jasowang@redhat.com>
> Cc: Stefan Weil <sw@weilnetz.de>
> Cc: Sriram Yagnaraman <sriram.yagnaraman@ericsson.com>
> Cc: Keith Busch <kbusch@kernel.org>
> Cc: Klaus Jensen <its@irrelevant.dk>
> Cc: Jesper Devantier <foss@defmacro.it>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
> Cc: Cédric Le Goater <clg@redhat.com>
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Eric
> ---
>  hw/net/e1000e.c                 | 3 +--
>  hw/net/eepro100.c               | 4 +---
>  hw/net/igb.c                    | 3 +--
>  hw/nvme/ctrl.c                  | 3 +--
>  hw/pci-bridge/pcie_pci_bridge.c | 2 +-
>  hw/vfio/pci.c                   | 7 ++++++-
>  hw/virtio/virtio-pci.c          | 3 +--
>  7 files changed, 12 insertions(+), 13 deletions(-)
>
> diff --git a/hw/net/e1000e.c b/hw/net/e1000e.c
> index f637853073e2..b72cbab7e889 100644
> --- a/hw/net/e1000e.c
> +++ b/hw/net/e1000e.c
> @@ -372,8 +372,7 @@ static int
>  e1000e_add_pm_capability(PCIDevice *pdev, uint8_t offset, uint16_t pmc)
>  {
>      Error *local_err = NULL;
> -    int ret = pci_add_capability(pdev, PCI_CAP_ID_PM, offset,
> -                                 PCI_PM_SIZEOF, &local_err);
> +    int ret = pci_pm_init(pdev, offset, &local_err);
>  
>      if (local_err) {
>          error_report_err(local_err);
> diff --git a/hw/net/eepro100.c b/hw/net/eepro100.c
> index 6d853229aec2..29a39865a608 100644
> --- a/hw/net/eepro100.c
> +++ b/hw/net/eepro100.c
> @@ -551,9 +551,7 @@ static void e100_pci_reset(EEPRO100State *s, Error **errp)
>      if (info->power_management) {
>          /* Power Management Capabilities */
>          int cfg_offset = 0xdc;
> -        int r = pci_add_capability(&s->dev, PCI_CAP_ID_PM,
> -                                   cfg_offset, PCI_PM_SIZEOF,
> -                                   errp);
> +        int r = pci_pm_init(&s->dev, cfg_offset, errp);
>          if (r < 0) {
>              return;
>          }
> diff --git a/hw/net/igb.c b/hw/net/igb.c
> index 4d93ce629f95..700dbc746d3d 100644
> --- a/hw/net/igb.c
> +++ b/hw/net/igb.c
> @@ -356,8 +356,7 @@ static int
>  igb_add_pm_capability(PCIDevice *pdev, uint8_t offset, uint16_t pmc)
>  {
>      Error *local_err = NULL;
> -    int ret = pci_add_capability(pdev, PCI_CAP_ID_PM, offset,
> -                                 PCI_PM_SIZEOF, &local_err);
> +    int ret = pci_pm_init(pdev, offset, &local_err);
>  
>      if (local_err) {
>          error_report_err(local_err);
> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index 68903d1d7067..1faea3d2b85b 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c
> @@ -8503,8 +8503,7 @@ static int nvme_add_pm_capability(PCIDevice *pci_dev, uint8_t offset)
>      Error *err = NULL;
>      int ret;
>  
> -    ret = pci_add_capability(pci_dev, PCI_CAP_ID_PM, offset,
> -                             PCI_PM_SIZEOF, &err);
> +    ret = pci_pm_init(pci_dev, offset, &err);
>      if (err) {
>          error_report_err(err);
>          return ret;
> diff --git a/hw/pci-bridge/pcie_pci_bridge.c b/hw/pci-bridge/pcie_pci_bridge.c
> index fd4514a595ce..9fa656b43b42 100644
> --- a/hw/pci-bridge/pcie_pci_bridge.c
> +++ b/hw/pci-bridge/pcie_pci_bridge.c
> @@ -52,7 +52,7 @@ static void pcie_pci_bridge_realize(PCIDevice *d, Error **errp)
>          goto cap_error;
>      }
>  
> -    pos = pci_add_capability(d, PCI_CAP_ID_PM, 0, PCI_PM_SIZEOF, errp);
> +    pos = pci_pm_init(d, 0, errp);
>      if (pos < 0) {
>          goto pm_error;
>      }
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 89d900e9cf0c..1a4a0b4b15b4 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -2216,7 +2216,12 @@ static bool vfio_add_std_cap(VFIOPCIDevice *vdev, uint8_t pos, Error **errp)
>      case PCI_CAP_ID_PM:
>          vfio_check_pm_reset(vdev, pos);
>          vdev->pm_cap = pos;
> -        ret = pci_add_capability(pdev, cap_id, pos, size, errp) >= 0;
> +        ret = pci_pm_init(pdev, pos, errp) >= 0;
> +        /*
> +         * PCI-core config space emulation needs write access to the power
> +         * state enabled for tracking BAR mapping relative to PM state.
> +         */
> +        pci_set_word(pdev->wmask + pos + PCI_PM_CTRL, PCI_PM_CTRL_STATE_MASK);
>          break;
>      case PCI_CAP_ID_AF:
>          vfio_check_af_flr(vdev, pos);
> diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
> index c773a9130c7e..afe8b5551c5c 100644
> --- a/hw/virtio/virtio-pci.c
> +++ b/hw/virtio/virtio-pci.c
> @@ -2204,8 +2204,7 @@ static void virtio_pci_realize(PCIDevice *pci_dev, Error **errp)
>          pos = pcie_endpoint_cap_init(pci_dev, 0);
>          assert(pos > 0);
>  
> -        pos = pci_add_capability(pci_dev, PCI_CAP_ID_PM, 0,
> -                                 PCI_PM_SIZEOF, errp);
> +        pos = pci_pm_init(pci_dev, 0, errp);
>          if (pos < 0) {
>              return;
>          }



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 0/5] PCI: Implement basic PCI PM capability backing
  2025-02-25 21:52 [PATCH v2 0/5] PCI: Implement basic PCI PM capability backing Alex Williamson
                   ` (4 preceding siblings ...)
  2025-02-25 21:52 ` [PATCH v2 5/5] hw/vfio/pci: Re-order pre-reset Alex Williamson
@ 2025-03-04 12:18 ` Cédric Le Goater
  2025-03-04 17:14   ` Michael S. Tsirkin
  2025-03-05 17:42 ` Cédric Le Goater
  6 siblings, 1 reply; 12+ messages in thread
From: Cédric Le Goater @ 2025-03-04 12:18 UTC (permalink / raw)
  To: Alex Williamson, qemu-devel
  Cc: eric.auger.pro, eric.auger, zhenzhong.duan, mst, marcel.apfelbaum

Hello Michael,

Could you please re-ack (or not) v2 ?

Thanks

C.

On 2/25/25 22:52, Alex Williamson wrote:
> v2:
> 
> Eric noted in v1 that one of the drivers had a redundant wmask setting
> since pci_pm_init() enabled writes to the power state field.  This was
> added because vfio-pci was not setting wmask for this capability but
> is allowing writes to the PM state field through to the device.  For
> vfio-pci, QEMU emulated config space is rather secondary to the config
> space through vfio.
> 
> It turns out therefore, that vfio-pci is nearly unique in not already
> managing the wmask of the PM capability state and if we embrace that
> it's the pci_pm_init() caller's responsibility to manage the remaining
> contents and write-access of the capability, then I think we also
> solve the question of migration compatibility.  The new infrastructure
> here is not changing whether any fields were previously writable, it's
> only effecting a mapping change based on the value found there.
> 
> This requires only a slight change to patch 1/, removing setting of
> the wmask, but commit log is also updated and comments added.  I also
> made the bad transition trace a little more obvious given Eric's
> comments.  Patch 2/ is also updated so that vfio-pci effects the wmask
> change locally.  The couple drivers that don't currently update wmask
> simply don't get this new BAR unmapped when not in D0 behavior.
> 
> Incorporated reviews for the unmodified patches.  Please re-review and
> report any noted issues.  Thanks,
> 
> Alex
> 
> v1:
> 
> https://lore.kernel.org/all/20250220224918.2520417-1-alex.williamson@redhat.com/
> 
> Eric recently identified an issue[1] where during graceful shutdown
> of a VM in a vIOMMU configuration, the guest driver places the device
> into the D3 power state, the vIOMMU is then disabled, triggering an
> AddressSpace update.  The device BARs are still mapped into the AS,
> but the vfio host driver refuses to DMA map the MMIO space due to the
> device power state.
> 
> The proposed solution in [1] was to skip mappings based on the
> device power state.  Here we take a different approach.  The PCI spec
> defines that devices in D1/2/3 power state should respond only to
> configuration and message requests and all other requests should be
> handled as an Unsupported Request.  In other words, the memory and
> IO BARs are not accessible except when the device is in the D0 power
> state.
> 
> To emulate this behavior, we can factor the device power state into
> the mapping state of the device BARs.  Therefore the BAR is marked
> as unmapped if either the respective command register enable bit is
> clear or the device is not in the D0 power state.
> 
> In order to implement this, the PowerState field of the PMCSR
> register becomes writable, which allows the device to appear in
> lower power states.  This also therefore implements D3 support
> (insofar as the BAR behavior) for all devices implementing the PM
> capability.  The PCI spec requires D3 support.
> 
> An aspect that needs attention here is whether this change in the
> wmask and PMCSR bits becomes a problem for migration, and how we
> might solve it.  For a guest migrating old->new, the device would
> always be in the D0 power state, but the register becomes writable.
> In the opposite direction, is it possible that a device could
> migrate in a low power state and be stuck there since the bits are
> read-only in old QEMU?  Do we need an option for this behavior and a
> machine state bump, or are there alternatives?
> 
> Thanks,
> Alex
> 
> [1]https://lore.kernel.org/all/20250219175941.135390-1-eric.auger@redhat.com/
> 
> 
> Alex Williamson (5):
>    hw/pci: Basic support for PCI power management
>    pci: Use PCI PM capability initializer
>    vfio/pci: Delete local pm_cap
>    pcie, virtio: Remove redundant pm_cap
>    hw/vfio/pci: Re-order pre-reset
> 
>   hw/net/e1000e.c                 |  3 +-
>   hw/net/eepro100.c               |  4 +-
>   hw/net/igb.c                    |  3 +-
>   hw/nvme/ctrl.c                  |  3 +-
>   hw/pci-bridge/pcie_pci_bridge.c |  3 +-
>   hw/pci/pci.c                    | 93 ++++++++++++++++++++++++++++++++-
>   hw/pci/trace-events             |  2 +
>   hw/vfio/pci.c                   | 34 ++++++------
>   hw/vfio/pci.h                   |  1 -
>   hw/virtio/virtio-pci.c          | 11 ++--
>   include/hw/pci/pci.h            |  3 ++
>   include/hw/pci/pci_device.h     |  3 ++
>   include/hw/pci/pcie.h           |  2 -
>   13 files changed, 127 insertions(+), 38 deletions(-)
> 



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 0/5] PCI: Implement basic PCI PM capability backing
  2025-03-04 12:18 ` [PATCH v2 0/5] PCI: Implement basic PCI PM capability backing Cédric Le Goater
@ 2025-03-04 17:14   ` Michael S. Tsirkin
  0 siblings, 0 replies; 12+ messages in thread
From: Michael S. Tsirkin @ 2025-03-04 17:14 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Alex Williamson, qemu-devel, eric.auger.pro, eric.auger,
	zhenzhong.duan, marcel.apfelbaum

On Tue, Mar 04, 2025 at 01:18:45PM +0100, Cédric Le Goater wrote:
> Hello Michael,
> 
> Could you please re-ack (or not) v2 ?
> 
> Thanks
> 
> C.


pci things:
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>


> On 2/25/25 22:52, Alex Williamson wrote:
> > v2:
> > 
> > Eric noted in v1 that one of the drivers had a redundant wmask setting
> > since pci_pm_init() enabled writes to the power state field.  This was
> > added because vfio-pci was not setting wmask for this capability but
> > is allowing writes to the PM state field through to the device.  For
> > vfio-pci, QEMU emulated config space is rather secondary to the config
> > space through vfio.
> > 
> > It turns out therefore, that vfio-pci is nearly unique in not already
> > managing the wmask of the PM capability state and if we embrace that
> > it's the pci_pm_init() caller's responsibility to manage the remaining
> > contents and write-access of the capability, then I think we also
> > solve the question of migration compatibility.  The new infrastructure
> > here is not changing whether any fields were previously writable, it's
> > only effecting a mapping change based on the value found there.
> > 
> > This requires only a slight change to patch 1/, removing setting of
> > the wmask, but commit log is also updated and comments added.  I also
> > made the bad transition trace a little more obvious given Eric's
> > comments.  Patch 2/ is also updated so that vfio-pci effects the wmask
> > change locally.  The couple drivers that don't currently update wmask
> > simply don't get this new BAR unmapped when not in D0 behavior.
> > 
> > Incorporated reviews for the unmodified patches.  Please re-review and
> > report any noted issues.  Thanks,
> > 
> > Alex
> > 
> > v1:
> > 
> > https://lore.kernel.org/all/20250220224918.2520417-1-alex.williamson@redhat.com/
> > 
> > Eric recently identified an issue[1] where during graceful shutdown
> > of a VM in a vIOMMU configuration, the guest driver places the device
> > into the D3 power state, the vIOMMU is then disabled, triggering an
> > AddressSpace update.  The device BARs are still mapped into the AS,
> > but the vfio host driver refuses to DMA map the MMIO space due to the
> > device power state.
> > 
> > The proposed solution in [1] was to skip mappings based on the
> > device power state.  Here we take a different approach.  The PCI spec
> > defines that devices in D1/2/3 power state should respond only to
> > configuration and message requests and all other requests should be
> > handled as an Unsupported Request.  In other words, the memory and
> > IO BARs are not accessible except when the device is in the D0 power
> > state.
> > 
> > To emulate this behavior, we can factor the device power state into
> > the mapping state of the device BARs.  Therefore the BAR is marked
> > as unmapped if either the respective command register enable bit is
> > clear or the device is not in the D0 power state.
> > 
> > In order to implement this, the PowerState field of the PMCSR
> > register becomes writable, which allows the device to appear in
> > lower power states.  This also therefore implements D3 support
> > (insofar as the BAR behavior) for all devices implementing the PM
> > capability.  The PCI spec requires D3 support.
> > 
> > An aspect that needs attention here is whether this change in the
> > wmask and PMCSR bits becomes a problem for migration, and how we
> > might solve it.  For a guest migrating old->new, the device would
> > always be in the D0 power state, but the register becomes writable.
> > In the opposite direction, is it possible that a device could
> > migrate in a low power state and be stuck there since the bits are
> > read-only in old QEMU?  Do we need an option for this behavior and a
> > machine state bump, or are there alternatives?
> > 
> > Thanks,
> > Alex
> > 
> > [1]https://lore.kernel.org/all/20250219175941.135390-1-eric.auger@redhat.com/
> > 
> > 
> > Alex Williamson (5):
> >    hw/pci: Basic support for PCI power management
> >    pci: Use PCI PM capability initializer
> >    vfio/pci: Delete local pm_cap
> >    pcie, virtio: Remove redundant pm_cap
> >    hw/vfio/pci: Re-order pre-reset
> > 
> >   hw/net/e1000e.c                 |  3 +-
> >   hw/net/eepro100.c               |  4 +-
> >   hw/net/igb.c                    |  3 +-
> >   hw/nvme/ctrl.c                  |  3 +-
> >   hw/pci-bridge/pcie_pci_bridge.c |  3 +-
> >   hw/pci/pci.c                    | 93 ++++++++++++++++++++++++++++++++-
> >   hw/pci/trace-events             |  2 +
> >   hw/vfio/pci.c                   | 34 ++++++------
> >   hw/vfio/pci.h                   |  1 -
> >   hw/virtio/virtio-pci.c          | 11 ++--
> >   include/hw/pci/pci.h            |  3 ++
> >   include/hw/pci/pci_device.h     |  3 ++
> >   include/hw/pci/pcie.h           |  2 -
> >   13 files changed, 127 insertions(+), 38 deletions(-)
> > 



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 0/5] PCI: Implement basic PCI PM capability backing
  2025-02-25 21:52 [PATCH v2 0/5] PCI: Implement basic PCI PM capability backing Alex Williamson
                   ` (5 preceding siblings ...)
  2025-03-04 12:18 ` [PATCH v2 0/5] PCI: Implement basic PCI PM capability backing Cédric Le Goater
@ 2025-03-05 17:42 ` Cédric Le Goater
  6 siblings, 0 replies; 12+ messages in thread
From: Cédric Le Goater @ 2025-03-05 17:42 UTC (permalink / raw)
  To: Alex Williamson, qemu-devel
  Cc: eric.auger.pro, eric.auger, zhenzhong.duan, mst, marcel.apfelbaum

On 2/25/25 22:52, Alex Williamson wrote:
> v2:
> 
> Eric noted in v1 that one of the drivers had a redundant wmask setting
> since pci_pm_init() enabled writes to the power state field.  This was
> added because vfio-pci was not setting wmask for this capability but
> is allowing writes to the PM state field through to the device.  For
> vfio-pci, QEMU emulated config space is rather secondary to the config
> space through vfio.
> 
> It turns out therefore, that vfio-pci is nearly unique in not already
> managing the wmask of the PM capability state and if we embrace that
> it's the pci_pm_init() caller's responsibility to manage the remaining
> contents and write-access of the capability, then I think we also
> solve the question of migration compatibility.  The new infrastructure
> here is not changing whether any fields were previously writable, it's
> only effecting a mapping change based on the value found there.
> 
> This requires only a slight change to patch 1/, removing setting of
> the wmask, but commit log is also updated and comments added.  I also
> made the bad transition trace a little more obvious given Eric's
> comments.  Patch 2/ is also updated so that vfio-pci effects the wmask
> change locally.  The couple drivers that don't currently update wmask
> simply don't get this new BAR unmapped when not in D0 behavior.
> 
> Incorporated reviews for the unmodified patches.  Please re-review and
> report any noted issues.  Thanks,
> 
> Alex
> 
> v1:
> 
> https://lore.kernel.org/all/20250220224918.2520417-1-alex.williamson@redhat.com/
> 
> Eric recently identified an issue[1] where during graceful shutdown
> of a VM in a vIOMMU configuration, the guest driver places the device
> into the D3 power state, the vIOMMU is then disabled, triggering an
> AddressSpace update.  The device BARs are still mapped into the AS,
> but the vfio host driver refuses to DMA map the MMIO space due to the
> device power state.
> 
> The proposed solution in [1] was to skip mappings based on the
> device power state.  Here we take a different approach.  The PCI spec
> defines that devices in D1/2/3 power state should respond only to
> configuration and message requests and all other requests should be
> handled as an Unsupported Request.  In other words, the memory and
> IO BARs are not accessible except when the device is in the D0 power
> state.
> 
> To emulate this behavior, we can factor the device power state into
> the mapping state of the device BARs.  Therefore the BAR is marked
> as unmapped if either the respective command register enable bit is
> clear or the device is not in the D0 power state.
> 
> In order to implement this, the PowerState field of the PMCSR
> register becomes writable, which allows the device to appear in
> lower power states.  This also therefore implements D3 support
> (insofar as the BAR behavior) for all devices implementing the PM
> capability.  The PCI spec requires D3 support.
> 
> An aspect that needs attention here is whether this change in the
> wmask and PMCSR bits becomes a problem for migration, and how we
> might solve it.  For a guest migrating old->new, the device would
> always be in the D0 power state, but the register becomes writable.
> In the opposite direction, is it possible that a device could
> migrate in a low power state and be stuck there since the bits are
> read-only in old QEMU?  Do we need an option for this behavior and a
> machine state bump, or are there alternatives?
> 
> Thanks,
> Alex
> 
> [1]https://lore.kernel.org/all/20250219175941.135390-1-eric.auger@redhat.com/
> 
> 
> Alex Williamson (5):
>    hw/pci: Basic support for PCI power management
>    pci: Use PCI PM capability initializer
>    vfio/pci: Delete local pm_cap
>    pcie, virtio: Remove redundant pm_cap
>    hw/vfio/pci: Re-order pre-reset
> 
>   hw/net/e1000e.c                 |  3 +-
>   hw/net/eepro100.c               |  4 +-
>   hw/net/igb.c                    |  3 +-
>   hw/nvme/ctrl.c                  |  3 +-
>   hw/pci-bridge/pcie_pci_bridge.c |  3 +-
>   hw/pci/pci.c                    | 93 ++++++++++++++++++++++++++++++++-
>   hw/pci/trace-events             |  2 +
>   hw/vfio/pci.c                   | 34 ++++++------
>   hw/vfio/pci.h                   |  1 -
>   hw/virtio/virtio-pci.c          | 11 ++--
>   include/hw/pci/pci.h            |  3 ++
>   include/hw/pci/pci_device.h     |  3 ++
>   include/hw/pci/pcie.h           |  2 -
>   13 files changed, 127 insertions(+), 38 deletions(-)
> 


Applied to vfio-next.

Thanks,

C.




^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-03-05 17:43 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-25 21:52 [PATCH v2 0/5] PCI: Implement basic PCI PM capability backing Alex Williamson
2025-02-25 21:52 ` [PATCH v2 1/5] hw/pci: Basic support for PCI power management Alex Williamson
2025-02-27  8:16   ` Eric Auger
2025-02-25 21:52 ` [PATCH v2 2/5] pci: Use PCI PM capability initializer Alex Williamson
2025-02-27  7:23   ` Akihiko Odaki
2025-02-27  8:17   ` Eric Auger
2025-02-25 21:52 ` [PATCH v2 3/5] vfio/pci: Delete local pm_cap Alex Williamson
2025-02-25 21:52 ` [PATCH v2 4/5] pcie, virtio: Remove redundant pm_cap Alex Williamson
2025-02-25 21:52 ` [PATCH v2 5/5] hw/vfio/pci: Re-order pre-reset Alex Williamson
2025-03-04 12:18 ` [PATCH v2 0/5] PCI: Implement basic PCI PM capability backing Cédric Le Goater
2025-03-04 17:14   ` Michael S. Tsirkin
2025-03-05 17:42 ` Cédric Le Goater

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).