* [PATCH v6 0/9] cxl: Add cxl_reset sysfs attribute for memdevs
@ 2026-05-28 8:31 Srirangan Madhavan
2026-05-28 8:31 ` [PATCH v6 1/9] cxl/hdm: Add helpers to restore and commit memdev decoders Srirangan Madhavan
` (10 more replies)
0 siblings, 11 replies; 32+ messages in thread
From: Srirangan Madhavan @ 2026-05-28 8:31 UTC (permalink / raw)
To: linux-cxl, linux-pci, linux-kernel
Cc: vsethi, alwilliamson, Dan Williams, Sai Yashwanth Reddy Kancherla,
Vishal Aslot, Manish Honap, Jiandi An, Richard Cheng, linux-tegra,
Srirangan Madhavan
Hi folks!
This patch series introduces support for the CXL Reset method for CXL
Type 2 devices, implementing the reset procedure outlined in the CXL
Specification r3.2 [1], Sections 8.1.3, 9.6, and 9.7.
The userspace ABI is a write-only cxl_reset attribute under the CXL
memdev device:
/sys/bus/cxl/devices/memX/cxl_reset
The memdev is the userspace handle, while the implementation coordinates
the target PCI function, affected sibling PCI functions, active CXL
memdevs, and any CXL regions reachable through those memdevs.
v6 changes (from v5 [2]):
- Rebased on the current CXL tree used for v7.1-rc4 development.
- Move the ABI from /sys/bus/pci/devices/.../cxl_reset to
/sys/bus/cxl/devices/memX/cxl_reset.
- Use the memdev as the userspace handle while keeping the reset
orchestration scoped to the CXL device reset scope.
- Reduce the earlier PCI/CXL save/restore series [3] to a single CXL HDM
decoder restore/commit helper patch, included here as patch 1.
- Do not offline or hot-remove memory as part of reset. Return -EBUSY
if an affected CXL region is online as System RAM or has an active
region driver bound.
- Add reset-idle validation and CPU cache invalidation for affected CXL
regions.
- Add CXL sibling PCI function discovery using the Non-CXL Function Map
DVSEC and CXL.cache/CXL.mem capability bits.
- Coordinate PCI save/disable/restore and IOMMU reset prepare/done for
the target and affected sibling functions.
- Add CXL DVSEC reset sequencing, including CXL.cache disable,
writeback-invalidate, a minimum 100ms quiet period, reset-complete
polling, and Reset Error reporting.
- Track affected memdevs, lock active memdevs across reset, restore and
commit decoder state, re-enable CXL.mem, and wait for media ready
after reset.
- Cache reset capability at memdev registration time for sysfs
visibility.
- Document reset scope, Memory Clear not being requested, and -EBUSY
behavior for active CXL regions.
Motivation:
-----------
- As support for Type 2 devices is being introduced, more devices need a
CXL-specific reset mechanism beyond bus-wide PCI reset methods.
- FLR does not affect CXL.cache or CXL.mem protocol state, making CXL
Reset the appropriate mechanism for cases where those protocols must
be reset.
- The CXL specification highlights use cases such as function rebinding
and error recovery where CXL Reset is explicitly required.
Change Description:
-------------------
Patch 1: cxl/hdm: Add helpers to restore and commit memdev decoders
- Restore endpoint decoder programming from CXL core's cached decoder
objects while keeping CXL.mem disabled.
- Commit restored HDM decoders as a separate step so reset orchestration
can re-enable CXL.mem only after safety checks complete.
Patch 2: PCI: Export pci_dev_save_and_disable() and pci_dev_restore()
- Export PCI reset lifecycle helpers so CXL reset orchestration can save,
disable, restore, and invoke reset callbacks for affected functions.
Patch 3: cxl: Add reset-idle and cache flush helpers
- Collect CXL regions affected by a memdev reset.
- Fail reset if affected regions are not idle.
- Invalidate CPU caches for each affected region once.
Patch 4: PCI/CXL: Add sibling function coordination for reset
- Identify CXL.cache/CXL.mem sibling functions in the reset scope.
- Use the Non-CXL Function Map DVSEC to exclude non-CXL functions.
- Save, disable, restore, and unlock affected PCI sibling functions.
Patch 5: cxl/pci: Add CXL DVSEC reset helper
- Execute CXL Reset through the CXL Device DVSEC.
- Disable CXL.cache and request writeback-invalidate where supported.
- Enforce the post-reset quiet period and poll for reset completion.
- Block and restore IOMMU traffic while reset is active.
Patch 6: cxl/pci: Track memdevs affected by CXL reset
- Track the target memdev and any sibling-function memdevs affected by
reset.
- Revalidate and lock active memdevs before reset proceeds.
Patch 7: cxl/pci: Orchestrate CXL reset for affected memdevs
- Coordinate region validation, CPU cache invalidation, PCI function
preparation, DVSEC reset, decoder restore and commit, CXL.mem enable,
and media-ready wait.
Patch 8: cxl/memdev: Add cxl_reset sysfs attribute
- Expose /sys/bus/cxl/devices/memX/cxl_reset.
- Only make the attribute visible when the underlying PCI function is
Type 2 and reset capable.
- Write a boolean true value, such as "1" or "true", to trigger reset.
Patch 9: Documentation/ABI: Document CXL memdev cxl_reset
- Document the new memdev sysfs ABI, reset scope, Memory Clear behavior,
and idle-region requirement.
The CPU cache invalidation step depends on
cpu_cache_invalidate_memregion() support for the affected address ranges.
If no provider is available, reset fails before hardware reset is
requested.
Command line to test CXL reset on a capable memdev:
echo 1 > /sys/bus/cxl/devices/memX/cxl_reset
Basic CXL DVSEC reset testing was done on a CXL Type 2 device. The reset
sequence completed successfully and ResetComplete was observed. Full
memdev/region integration testing is still in progress.
References:
[1] https://computeexpresslink.org/wp-content/uploads/2024/12/CXL_3.2-Spec-Announcement_FINAL-1.pdf
[2] https://lore.kernel.org/linux-cxl/20260306092322.148765-1-smadhavan@nvidia.com/
[3] https://lore.kernel.org/linux-cxl/20260306080026.116789-1-smadhavan@nvidia.com/
Srirangan Madhavan (9):
cxl/hdm: Add helpers to restore and commit memdev decoders
PCI: Export pci_dev_save_and_disable() and pci_dev_restore()
cxl: Add reset-idle and cache flush helpers
PCI/CXL: Add sibling function coordination for reset
cxl/pci: Add CXL DVSEC reset helper
cxl/pci: Track memdevs affected by CXL reset
cxl/pci: Orchestrate CXL reset for affected memdevs
cxl/memdev: Add cxl_reset sysfs attribute
Documentation/ABI: Document CXL memdev cxl_reset
Documentation/ABI/testing/sysfs-bus-cxl | 28 +
drivers/cxl/core/hdm.c | 318 ++++++-
drivers/cxl/core/memdev.c | 30 +
drivers/cxl/core/pci.c | 1140 +++++++++++++++++++++++
drivers/cxl/cxl.h | 5 +
drivers/cxl/cxlmem.h | 2 +
drivers/pci/pci.c | 22 +-
include/linux/pci.h | 2 +
include/uapi/linux/pci_regs.h | 15 +
9 files changed, 1557 insertions(+), 5 deletions(-)
base-commit: abb3c0de119032f4c0c81177884a3bb0a133e6ca
--
2.43.0
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v6 1/9] cxl/hdm: Add helpers to restore and commit memdev decoders
2026-05-28 8:31 [PATCH v6 0/9] cxl: Add cxl_reset sysfs attribute for memdevs Srirangan Madhavan
@ 2026-05-28 8:31 ` Srirangan Madhavan
2026-05-28 11:06 ` Richard Cheng
` (3 more replies)
2026-05-28 8:31 ` [PATCH v6 2/9] PCI: Export pci_dev_save_and_disable() and pci_dev_restore() Srirangan Madhavan
` (9 subsequent siblings)
10 siblings, 4 replies; 32+ messages in thread
From: Srirangan Madhavan @ 2026-05-28 8:31 UTC (permalink / raw)
To: linux-cxl, linux-pci, linux-kernel
Cc: vsethi, alwilliamson, Dan Williams, Sai Yashwanth Reddy Kancherla,
Vishal Aslot, Manish Honap, Jiandi An, Richard Cheng, linux-tegra,
Srirangan Madhavan
Add helpers to restore endpoint decoder programming for a CXL memdev from
CXL core's cached decoder objects, then commit it as a distinct step.
Callers are expected to have established reset safety and to hold
cxl_rwsem.region for write.
cxl_restore_memdev_decoders() restores programmable decoder state while
keeping traffic disabled. For HDM-backed endpoints it programs enabled
endpoint decoder fields without COMMIT, keeps the HDM Decoder Capability
disabled, and mirrors matching endpoint DVSEC ranges where possible. For
endpoints without HDM decoder registers, it restores the legacy DVSEC
ranges that model endpoint decode.
cxl_commit_memdev_decoders() enables the HDM Decoder Capability and
commits enabled, unlocked endpoint decoders after safety checks pass. It
sets COMMIT only after decoder fields have been restored, does not
re-lock decoders, and does not set DVSEC MEM_ENABLE.
Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com>
---
drivers/cxl/core/hdm.c | 318 ++++++++++++++++++++++++++++++++++++++++-
drivers/cxl/cxl.h | 2 +
2 files changed, 317 insertions(+), 3 deletions(-)
diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
index 0c80b76a5f9b..f7af1041a9fc 100644
--- a/drivers/cxl/core/hdm.c
+++ b/drivers/cxl/core/hdm.c
@@ -679,7 +679,7 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, u64 size)
return devm_add_action_or_reset(&port->dev, cxl_dpa_release, cxled);
}
-static void cxld_set_interleave(struct cxl_decoder *cxld, u32 *ctrl)
+static int cxld_set_interleave_fields(struct cxl_decoder *cxld, u32 *ctrl)
{
u16 eig;
u8 eiw;
@@ -690,14 +690,22 @@ static void cxld_set_interleave(struct cxl_decoder *cxld, u32 *ctrl)
*/
if (WARN_ONCE(ways_to_eiw(cxld->interleave_ways, &eiw),
"invalid interleave_ways: %d\n", cxld->interleave_ways))
- return;
+ return -EINVAL;
if (WARN_ONCE(granularity_to_eig(cxld->interleave_granularity, &eig),
"invalid interleave_granularity: %d\n",
cxld->interleave_granularity))
- return;
+ return -EINVAL;
u32p_replace_bits(ctrl, eig, CXL_HDM_DECODER0_CTRL_IG_MASK);
u32p_replace_bits(ctrl, eiw, CXL_HDM_DECODER0_CTRL_IW_MASK);
+ return 0;
+}
+
+static void cxld_set_interleave(struct cxl_decoder *cxld, u32 *ctrl)
+{
+ if (cxld_set_interleave_fields(cxld, ctrl))
+ return;
+
*ctrl |= CXL_HDM_DECODER0_CTRL_COMMIT;
}
@@ -927,6 +935,310 @@ static void cxl_decoder_reset(struct cxl_decoder *cxld)
}
}
+static int cxl_restore_dvsec_range(struct cxl_memdev *cxlmd,
+ struct cxl_endpoint_decoder *cxled)
+{
+ struct cxl_dev_state *cxlds = cxlmd->cxlds;
+ struct cxl_decoder *cxld = &cxled->cxld;
+ struct pci_dev *pdev = to_pci_dev(cxlds->dev);
+ u64 base = cxld->hpa_range.start;
+ u64 size = range_len(&cxld->hpa_range);
+ u32 lo;
+ int dvsec = cxlds->cxl_dvsec;
+ int id = cxld->id;
+ int rc;
+
+ if (!dvsec)
+ return 0;
+
+ if (id >= CXL_DVSEC_RANGE_MAX)
+ return 0;
+
+ rc = pci_write_config_dword(pdev, dvsec + PCI_DVSEC_CXL_RANGE_BASE_HIGH(id),
+ upper_32_bits(base));
+ if (rc)
+ return rc;
+
+ rc = pci_read_config_dword(pdev, dvsec + PCI_DVSEC_CXL_RANGE_BASE_LOW(id),
+ &lo);
+ if (rc)
+ return rc;
+ lo &= ~PCI_DVSEC_CXL_MEM_BASE_LOW;
+ lo |= lower_32_bits(base) & PCI_DVSEC_CXL_MEM_BASE_LOW;
+
+ rc = pci_write_config_dword(pdev, dvsec + PCI_DVSEC_CXL_RANGE_BASE_LOW(id),
+ lo);
+ if (rc)
+ return rc;
+
+ rc = pci_write_config_dword(pdev, dvsec + PCI_DVSEC_CXL_RANGE_SIZE_HIGH(id),
+ upper_32_bits(size));
+ if (rc)
+ return rc;
+
+ rc = pci_read_config_dword(pdev, dvsec + PCI_DVSEC_CXL_RANGE_SIZE_LOW(id),
+ &lo);
+ if (rc)
+ return rc;
+
+ /*
+ * Preserve MEM_INFO_VALID / MEM_ACTIVE and any reserved bits while
+ * restoring only the programmable size bits.
+ */
+ lo &= ~PCI_DVSEC_CXL_MEM_SIZE_LOW;
+ lo |= lower_32_bits(size) & PCI_DVSEC_CXL_MEM_SIZE_LOW;
+
+ return pci_write_config_dword(pdev,
+ dvsec + PCI_DVSEC_CXL_RANGE_SIZE_LOW(id),
+ lo);
+}
+
+static int cxl_restore_hdm_decoder(struct cxl_hdm *cxlhdm,
+ struct cxl_endpoint_decoder *cxled)
+{
+ struct cxl_decoder *cxld = &cxled->cxld;
+ void __iomem *hdm;
+ u64 base, size, skip;
+ u32 ctrl;
+ int id;
+
+ id = cxld->id;
+ hdm = cxlhdm->regs.hdm_decoder;
+ ctrl = readl(hdm + CXL_HDM_DECODER0_CTRL_OFFSET(id));
+ if (ctrl & CXL_HDM_DECODER0_CTRL_LOCK)
+ return 0;
+
+ base = cxld->hpa_range.start;
+ size = range_len(&cxld->hpa_range);
+ skip = cxled->skip;
+
+ ctrl &= ~(CXL_HDM_DECODER0_CTRL_LOCK |
+ CXL_HDM_DECODER0_CTRL_COMMIT |
+ CXL_HDM_DECODER0_CTRL_COMMITTED |
+ CXL_HDM_DECODER0_CTRL_COMMIT_ERROR);
+ if (cxld_set_interleave_fields(cxld, &ctrl))
+ return -EINVAL;
+ cxld_set_type(cxld, &ctrl);
+
+ /* Preserve setup_hw_decoder() programming order, without COMMIT. */
+ writel(upper_32_bits(base), hdm + CXL_HDM_DECODER0_BASE_HIGH_OFFSET(id));
+ writel(lower_32_bits(base), hdm + CXL_HDM_DECODER0_BASE_LOW_OFFSET(id));
+ writel(upper_32_bits(size), hdm + CXL_HDM_DECODER0_SIZE_HIGH_OFFSET(id));
+ writel(lower_32_bits(size), hdm + CXL_HDM_DECODER0_SIZE_LOW_OFFSET(id));
+ writel(upper_32_bits(skip), hdm + CXL_HDM_DECODER0_SKIP_HIGH(id));
+ writel(lower_32_bits(skip), hdm + CXL_HDM_DECODER0_SKIP_LOW(id));
+ wmb();
+ writel(ctrl, hdm + CXL_HDM_DECODER0_CTRL_OFFSET(id));
+
+ return 0;
+}
+
+struct cxl_restore_ctx {
+ struct cxl_memdev *cxlmd;
+ struct cxl_hdm *cxlhdm;
+};
+
+static int cxl_restore_decoder(struct device *dev, void *data)
+{
+ struct cxl_restore_ctx *ctx = data;
+ struct cxl_endpoint_decoder *cxled;
+ struct cxl_decoder *cxld;
+ int rc;
+
+ if (!is_endpoint_decoder(dev))
+ return 0;
+
+ cxled = to_cxl_endpoint_decoder(dev);
+ cxld = &cxled->cxld;
+ if ((cxld->flags & CXL_DECODER_F_ENABLE) == 0)
+ return 0;
+
+ if (ctx->cxlhdm->regs.hdm_decoder) {
+ if (cxld->id >= ctx->cxlhdm->decoder_count)
+ return -EINVAL;
+
+ rc = cxl_restore_hdm_decoder(ctx->cxlhdm, cxled);
+ if (rc)
+ return rc;
+ }
+
+ return cxl_restore_dvsec_range(ctx->cxlmd, cxled);
+}
+
+static int cxl_restore_decoders(struct cxl_memdev *cxlmd, struct cxl_hdm *cxlhdm)
+{
+ struct cxl_port *port = cxlhdm->port;
+ void __iomem *hdm = cxlhdm->regs.hdm_decoder;
+ struct cxl_restore_ctx ctx = {
+ .cxlmd = cxlmd,
+ .cxlhdm = cxlhdm,
+ };
+ u32 global_ctrl;
+
+ if (hdm) {
+ global_ctrl = readl(hdm + CXL_HDM_DECODER_CTRL_OFFSET);
+ writel(global_ctrl & ~CXL_HDM_DECODER_ENABLE,
+ hdm + CXL_HDM_DECODER_CTRL_OFFSET);
+ }
+
+ return device_for_each_child(&port->dev, &ctx, cxl_restore_decoder);
+}
+
+/**
+ * cxl_restore_memdev_decoders - Restore endpoint decoder programming
+ * @cxlmd: CXL memdev whose endpoint decoders need to be restored
+ *
+ * Restore only programmable decoder state from CXL core's cached decoder
+ * objects. For endpoints with HDM decoder registers, program the HDM decoder
+ * fields and mirror decoder ids representable by CXL_DVSEC_RANGE_MAX into the
+ * DVSEC range registers when present. For endpoints without HDM decoder
+ * registers, restore DVSEC range registers only.
+ *
+ * This helper leaves CXL.mem disabled: it does not commit HDM decoders, enable
+ * the HDM Decoder Capability, set PCI_DVSEC_CXL_MEM_ENABLE, or restore
+ * unrelated DVSEC CTRL, CTRL2, LOCK, MEM_ENABLE, or other control state.
+ * Callers must perform final commit/resume steps only after reset safety checks
+ * pass.
+ *
+ * Return: 0 on success, negative errno on failure.
+ */
+int cxl_restore_memdev_decoders(struct cxl_memdev *cxlmd)
+{
+ struct cxl_port *endpoint = cxlmd->endpoint;
+ struct cxl_hdm *cxlhdm;
+ int rc;
+
+ lockdep_assert_held_write(&cxl_rwsem.region);
+
+ if (!endpoint)
+ return -ENODEV;
+
+ cxlhdm = dev_get_drvdata(&endpoint->dev);
+ if (!cxlhdm)
+ return -ENODEV;
+
+ scoped_guard(rwsem_read, &cxl_rwsem.dpa)
+ rc = cxl_restore_decoders(cxlmd, cxlhdm);
+ return rc;
+}
+
+static int cxl_commit_restored_hdm_decoder(struct cxl_hdm *cxlhdm,
+ struct cxl_endpoint_decoder *cxled)
+{
+ struct cxl_decoder *cxld = &cxled->cxld;
+ void __iomem *hdm = cxlhdm->regs.hdm_decoder;
+ u32 ctrl;
+ int id;
+
+ if ((cxld->flags & CXL_DECODER_F_ENABLE) == 0)
+ return 0;
+
+ if (!hdm)
+ return 0;
+
+ id = cxld->id;
+ if (id >= cxlhdm->decoder_count)
+ return -EINVAL;
+
+ /*
+ * cxl_restore_hdm_decoder() programmed the decoder fields first. This
+ * control register write sets COMMIT as the final programming step.
+ */
+ ctrl = readl(hdm + CXL_HDM_DECODER0_CTRL_OFFSET(id));
+ if (ctrl & CXL_HDM_DECODER0_CTRL_LOCK)
+ return 0;
+
+ if (ctrl & CXL_HDM_DECODER0_CTRL_COMMITTED)
+ return 0;
+
+ ctrl |= CXL_HDM_DECODER0_CTRL_COMMIT;
+ writel(ctrl, hdm + CXL_HDM_DECODER0_CTRL_OFFSET(id));
+
+ return cxld_await_commit(hdm, id);
+}
+
+struct cxl_commit_decoder_ctx {
+ struct cxl_hdm *cxlhdm;
+ int id;
+};
+
+static int cxl_commit_restored_decoder_by_id(struct device *dev, void *data)
+{
+ struct cxl_commit_decoder_ctx *ctx = data;
+ struct cxl_endpoint_decoder *cxled;
+ int rc;
+
+ if (!is_endpoint_decoder(dev))
+ return 0;
+
+ cxled = to_cxl_endpoint_decoder(dev);
+ if (cxled->cxld.id != ctx->id)
+ return 0;
+
+ rc = cxl_commit_restored_hdm_decoder(ctx->cxlhdm, cxled);
+ return rc ?: 1;
+}
+
+/**
+ * cxl_commit_memdev_decoders - Commit restored endpoint decoder programming
+ * @cxlmd: CXL memdev whose endpoint decoders need to be committed
+ *
+ * Resume endpoint decoding after cxl_restore_memdev_decoders() has restored
+ * programmable decoder fields. For endpoints with HDM decoder registers, enable
+ * the HDM Decoder Capability and commit enabled, unlocked endpoint decoders.
+ * Locked decoders are left to their current hardware/firmware-owned state.
+ *
+ * This helper does not set PCI_DVSEC_CXL_MEM_ENABLE. Callers must enable
+ * CXL.mem only after all reset safety checks and decoder restore/commit steps
+ * have completed.
+ *
+ * Return: 0 on success, negative errno on failure.
+ */
+int cxl_commit_memdev_decoders(struct cxl_memdev *cxlmd)
+{
+ struct cxl_port *endpoint = cxlmd->endpoint;
+ struct cxl_hdm *cxlhdm;
+ void __iomem *hdm;
+ u32 global_ctrl;
+ int i, rc;
+
+ lockdep_assert_held_write(&cxl_rwsem.region);
+
+ if (!endpoint)
+ return -ENODEV;
+
+ cxlhdm = dev_get_drvdata(&endpoint->dev);
+ if (!cxlhdm)
+ return -ENODEV;
+
+ hdm = cxlhdm->regs.hdm_decoder;
+ if (!hdm)
+ return 0;
+
+ global_ctrl = readl(hdm + CXL_HDM_DECODER_CTRL_OFFSET);
+ writel(global_ctrl | CXL_HDM_DECODER_ENABLE,
+ hdm + CXL_HDM_DECODER_CTRL_OFFSET);
+
+ for (i = 0; i < cxlhdm->decoder_count; i++) {
+ struct cxl_commit_decoder_ctx ctx = {
+ .cxlhdm = cxlhdm,
+ .id = i,
+ };
+
+ /*
+ * Per CXL Spec 3.1 8.2.4.20.12 software must commit decoders
+ * in HPA order. Region setup already enforces that ordering by
+ * decoder id, so restore commits follow ascending id order.
+ */
+ rc = device_for_each_child(&endpoint->dev, &ctx,
+ cxl_commit_restored_decoder_by_id);
+ if (rc < 0)
+ return rc;
+ }
+
+ return 0;
+}
+
static int cxl_setup_hdm_decoder_from_dvsec(
struct cxl_port *port, struct cxl_decoder *cxld, u64 *dpa_base,
int which, struct cxl_endpoint_dvsec_info *info)
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 1297594beaec..b51b1e9d6400 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -794,6 +794,8 @@ int cxl_port_setup_regs(struct cxl_port *port,
struct cxl_dev_state;
int cxl_dvsec_rr_decode(struct cxl_dev_state *cxlds,
struct cxl_endpoint_dvsec_info *info);
+int cxl_restore_memdev_decoders(struct cxl_memdev *cxlmd);
+int cxl_commit_memdev_decoders(struct cxl_memdev *cxlmd);
bool is_cxl_region(struct device *dev);
--
2.43.0
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH v6 2/9] PCI: Export pci_dev_save_and_disable() and pci_dev_restore()
2026-05-28 8:31 [PATCH v6 0/9] cxl: Add cxl_reset sysfs attribute for memdevs Srirangan Madhavan
2026-05-28 8:31 ` [PATCH v6 1/9] cxl/hdm: Add helpers to restore and commit memdev decoders Srirangan Madhavan
@ 2026-05-28 8:31 ` Srirangan Madhavan
2026-06-02 20:18 ` Dave Jiang
2026-06-03 22:36 ` Dan Williams (nvidia)
2026-05-28 8:31 ` [PATCH v6 3/9] cxl: Add reset-idle and cache flush helpers Srirangan Madhavan
` (8 subsequent siblings)
10 siblings, 2 replies; 32+ messages in thread
From: Srirangan Madhavan @ 2026-05-28 8:31 UTC (permalink / raw)
To: linux-cxl, linux-pci, linux-kernel
Cc: vsethi, alwilliamson, Dan Williams, Sai Yashwanth Reddy Kancherla,
Vishal Aslot, Manish Honap, Jiandi An, Richard Cheng, linux-tegra,
Srirangan Madhavan
Export pci_dev_save_and_disable() and pci_dev_restore() so CXL reset
orchestration can reuse the PCI core reset lifecycle for non-standard
reset flows.
These helpers invoke driver reset_prepare/reset_done callbacks, save and
restore PCI config state, and disable the device while the caller holds
the device lock.
Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com>
---
drivers/pci/pci.c | 22 ++++++++++++++++++++--
include/linux/pci.h | 2 ++
2 files changed, 22 insertions(+), 2 deletions(-)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index d34266651ad0..75d2f4074750 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5003,7 +5003,15 @@ void pci_dev_unlock(struct pci_dev *dev)
}
EXPORT_SYMBOL_GPL(pci_dev_unlock);
-static void pci_dev_save_and_disable(struct pci_dev *dev)
+/**
+ * pci_dev_save_and_disable - Save device state and disable it
+ * @dev: PCI device to save and disable
+ *
+ * Save the PCI configuration state, invoke the driver's reset_prepare()
+ * callback if present, and disable the device by clearing the Command
+ * register. The device lock must be held by the caller.
+ */
+void pci_dev_save_and_disable(struct pci_dev *dev)
{
const struct pci_error_handlers *err_handler =
dev->driver ? dev->driver->err_handler : NULL;
@@ -5036,8 +5044,17 @@ static void pci_dev_save_and_disable(struct pci_dev *dev)
*/
pci_write_config_word(dev, PCI_COMMAND, PCI_COMMAND_INTX_DISABLE);
}
+EXPORT_SYMBOL_GPL(pci_dev_save_and_disable);
-static void pci_dev_restore(struct pci_dev *dev)
+/**
+ * pci_dev_restore - Restore device state after reset
+ * @dev: PCI device to restore
+ *
+ * Restore the saved PCI configuration state and invoke the driver's
+ * reset_done() callback if present. The device lock must be held by the
+ * caller.
+ */
+void pci_dev_restore(struct pci_dev *dev)
{
const struct pci_error_handlers *err_handler =
dev->driver ? dev->driver->err_handler : NULL;
@@ -5054,6 +5071,7 @@ static void pci_dev_restore(struct pci_dev *dev)
else if (dev->driver)
pci_warn(dev, "reset done");
}
+EXPORT_SYMBOL_GPL(pci_dev_restore);
/* dev->reset_methods[] is a 0-terminated list of indices into this array */
const struct pci_reset_fn_method pci_reset_fn_methods[] = {
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 2c4454583c11..d6303e16e11b 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -2012,6 +2012,8 @@ void pci_dev_lock(struct pci_dev *dev);
int pci_dev_trylock(struct pci_dev *dev);
void pci_dev_unlock(struct pci_dev *dev);
DEFINE_GUARD(pci_dev, struct pci_dev *, pci_dev_lock(_T), pci_dev_unlock(_T))
+void pci_dev_save_and_disable(struct pci_dev *dev);
+void pci_dev_restore(struct pci_dev *dev);
/*
* PCI domain support. Sometimes called PCI segment (eg by ACPI),
--
2.43.0
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH v6 3/9] cxl: Add reset-idle and cache flush helpers
2026-05-28 8:31 [PATCH v6 0/9] cxl: Add cxl_reset sysfs attribute for memdevs Srirangan Madhavan
2026-05-28 8:31 ` [PATCH v6 1/9] cxl/hdm: Add helpers to restore and commit memdev decoders Srirangan Madhavan
2026-05-28 8:31 ` [PATCH v6 2/9] PCI: Export pci_dev_save_and_disable() and pci_dev_restore() Srirangan Madhavan
@ 2026-05-28 8:31 ` Srirangan Madhavan
2026-06-02 20:34 ` Cheatham, Benjamin
` (2 more replies)
2026-05-28 8:31 ` [PATCH v6 4/9] PCI/CXL: Add sibling function coordination for reset Srirangan Madhavan
` (7 subsequent siblings)
10 siblings, 3 replies; 32+ messages in thread
From: Srirangan Madhavan @ 2026-05-28 8:31 UTC (permalink / raw)
To: linux-cxl, linux-pci, linux-kernel
Cc: vsethi, alwilliamson, Dan Williams, Sai Yashwanth Reddy Kancherla,
Vishal Aslot, Manish Honap, Jiandi An, Richard Cheng, linux-tegra,
Srirangan Madhavan
Add helpers to collect the CXL regions affected by a memdev reset,
verify that those regions are idle, and invalidate CPU caches for the
affected address ranges before reset.
A memdev can participate in an interleaved region through multiple
endpoint decoders. Track affected regions in a temporary xarray so each
region is checked and cache-invalidated once per reset operation.
These helpers prepare the CXL.mem data path for reset. The actual reset
orchestration and decoder restore flow are added separately.
Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com>
---
drivers/cxl/core/pci.c | 170 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 170 insertions(+)
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index d1f487b3d809..318744695f62 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -4,9 +4,11 @@
#include <linux/io-64-nonatomic-lo-hi.h>
#include <linux/device.h>
#include <linux/delay.h>
+#include <linux/memregion.h>
#include <linux/pci.h>
#include <linux/pci-doe.h>
#include <linux/aer.h>
+#include <linux/xarray.h>
#include <cxlpci.h>
#include <cxlmem.h>
#include <cxl.h>
@@ -926,3 +928,171 @@ int cxl_port_get_possible_dports(struct cxl_port *port)
return ctx.count;
}
+
+static int cxl_reset_system_ram_found(struct resource *res, void *data)
+{
+ return 1;
+}
+
+struct cxl_reset_region_context {
+ struct xarray regions;
+};
+
+static void __maybe_unused
+cxl_reset_region_context_init(struct cxl_reset_region_context *ctx)
+{
+ xa_init(&ctx->regions);
+}
+
+static void __maybe_unused
+cxl_reset_region_context_destroy(struct cxl_reset_region_context *ctx)
+{
+ xa_destroy(&ctx->regions);
+}
+
+static int cxl_reset_add_region(struct cxl_reset_region_context *ctx,
+ struct cxl_region *cxlr)
+{
+ int rc;
+
+ if (!cxlr || !cxlr->params.res)
+ return 0;
+
+ rc = xa_insert(&ctx->regions, (unsigned long)cxlr, cxlr, GFP_KERNEL);
+
+ /* A region may be referenced by multiple affected endpoint decoders. */
+ return rc == -EBUSY ? 0 : rc;
+}
+
+static int cxl_reset_collect_region(struct device *dev, void *data)
+{
+ struct cxl_reset_region_context *ctx = data;
+ struct cxl_endpoint_decoder *cxled;
+
+ if (!is_endpoint_decoder(dev))
+ return 0;
+
+ cxled = to_cxl_endpoint_decoder(dev);
+ return cxl_reset_add_region(ctx, cxled->cxld.region);
+}
+
+static int __maybe_unused
+cxl_reset_collect_memdev_regions(struct cxl_reset_region_context *ctx,
+ struct cxl_memdev *cxlmd)
+{
+ struct cxl_port *endpoint;
+
+ if (!cxlmd || !cxlmd->cxlds)
+ return -ENODEV;
+
+ endpoint = cxlmd->endpoint;
+ if (!endpoint)
+ return 0;
+
+ return device_for_each_child(&endpoint->dev, ctx,
+ cxl_reset_collect_region);
+}
+
+static bool cxl_reset_region_has_system_ram(struct cxl_region *cxlr)
+{
+ struct cxl_region_params *p = &cxlr->params;
+ int rc;
+
+ if (!p->res)
+ return false;
+
+ rc = walk_iomem_res_desc(IORES_DESC_NONE,
+ IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
+ p->res->start, p->res->end, NULL,
+ cxl_reset_system_ram_found);
+
+ return rc > 0;
+}
+
+static int cxl_reset_validate_region_idle(struct cxl_region *cxlr)
+{
+ struct resource *res = cxlr->params.res;
+ int rc = 0;
+
+ lockdep_assert_held_write(&cxl_rwsem.region);
+
+ if (cxl_reset_region_has_system_ram(cxlr)) {
+ dev_err(&cxlr->dev,
+ "Cannot reset while CXL memory is online as System RAM [%pr]\n",
+ res);
+ return -EBUSY;
+ }
+
+ if (!device_trylock(&cxlr->dev))
+ return -EAGAIN;
+
+ if (cxlr->dev.driver) {
+ dev_err(&cxlr->dev,
+ "Cannot reset while CXL region has an active driver\n");
+ rc = -EBUSY;
+ }
+
+ device_unlock(&cxlr->dev);
+ return rc;
+}
+
+static int __maybe_unused
+cxl_reset_validate_regions_idle(struct cxl_reset_region_context *ctx)
+{
+ struct cxl_region *cxlr;
+ unsigned long index;
+ int rc;
+
+ xa_for_each(&ctx->regions, index, cxlr) {
+ rc = cxl_reset_validate_region_idle(cxlr);
+ if (rc)
+ return rc;
+ }
+
+ return 0;
+}
+
+static int cxl_reset_flush_region_cache(struct cxl_region *cxlr)
+{
+ struct resource *res = cxlr->params.res;
+ int rc;
+
+ if (!res)
+ return 0;
+
+ rc = cpu_cache_invalidate_memregion(res->start, resource_size(res));
+ if (rc)
+ dev_err(&cxlr->dev, "Failed to invalidate CPU cache [%pr]: %d\n",
+ res, rc);
+
+ return rc;
+}
+
+static int __maybe_unused
+cxl_reset_flush_cpu_caches(struct cxl_reset_region_context *ctx)
+{
+ struct cxl_region *cxlr;
+ unsigned long index;
+ int rc;
+
+ if (xa_empty(&ctx->regions))
+ return 0;
+
+ if (!cpu_cache_has_invalidate_memregion()) {
+ if (IS_ENABLED(CONFIG_CXL_REGION_INVALIDATION_TEST)) {
+ pr_info_once(
+ "Bypassing cpu_cache_invalidate_memregion() for testing!\n");
+ return 0;
+ }
+ pr_warn("Failed to synchronize CPU cache state\n");
+ return -ENXIO;
+ }
+
+ xa_for_each(&ctx->regions, index, cxlr) {
+ rc = cxl_reset_flush_region_cache(cxlr);
+ if (rc)
+ return rc;
+ }
+
+ return 0;
+}
--
2.43.0
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH v6 4/9] PCI/CXL: Add sibling function coordination for reset
2026-05-28 8:31 [PATCH v6 0/9] cxl: Add cxl_reset sysfs attribute for memdevs Srirangan Madhavan
` (2 preceding siblings ...)
2026-05-28 8:31 ` [PATCH v6 3/9] cxl: Add reset-idle and cache flush helpers Srirangan Madhavan
@ 2026-05-28 8:31 ` Srirangan Madhavan
2026-05-28 11:15 ` Richard Cheng
` (2 more replies)
2026-05-28 8:31 ` [PATCH v6 5/9] cxl/pci: Add CXL DVSEC reset helper Srirangan Madhavan
` (6 subsequent siblings)
10 siblings, 3 replies; 32+ messages in thread
From: Srirangan Madhavan @ 2026-05-28 8:31 UTC (permalink / raw)
To: linux-cxl, linux-pci, linux-kernel
Cc: vsethi, alwilliamson, Dan Williams, Sai Yashwanth Reddy Kancherla,
Vishal Aslot, Manish Honap, Jiandi An, Richard Cheng, linux-tegra,
Srirangan Madhavan
Add helpers to collect CXL sibling PCI functions affected by a CXL reset
and prepare them for reset by saving and disabling them. Restore those
siblings and drop their references when reset coordination completes.
Use the Non-CXL Function Map DVSEC to exclude non-CXL functions, and
filter remaining siblings to functions that advertise CXL.cache or
CXL.mem capability.
Use pci_dev_trylock() for sibling locking and unwind on contention or
allocation failure, so competing reset paths fail with an errno.
Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com>
---
drivers/cxl/core/pci.c | 207 ++++++++++++++++++++++++++++++++++
include/uapi/linux/pci_regs.h | 2 +
2 files changed, 209 insertions(+)
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 318744695f62..01effbb4e7cd 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -1,9 +1,11 @@
// SPDX-License-Identifier: GPL-2.0-only
/* Copyright(c) 2021 Intel Corporation. All rights reserved. */
#include <linux/units.h>
+#include <linux/bitmap.h>
#include <linux/io-64-nonatomic-lo-hi.h>
#include <linux/device.h>
#include <linux/delay.h>
+#include <linux/iommu.h>
#include <linux/memregion.h>
#include <linux/pci.h>
#include <linux/pci-doe.h>
@@ -15,6 +17,10 @@
#include "core.h"
#include "trace.h"
+#define CXL_RESET_MAX_FUNCTIONS 256
+#define CXL_RESET_FUNCTION_MAP_REGS (CXL_RESET_MAX_FUNCTIONS / 32)
+#define CXL_RESET_SIBLINGS_INIT 8
+
/**
* DOC: cxl core pci
*
@@ -1096,3 +1102,204 @@ cxl_reset_flush_cpu_caches(struct cxl_reset_region_context *ctx)
return 0;
}
+
+struct cxl_reset_context {
+ struct pci_dev *target;
+ struct pci_dev **siblings;
+ int nr_siblings;
+ int sibling_capacity;
+ int nr_siblings_prepared;
+};
+
+struct cxl_reset_walk_ctx {
+ struct cxl_reset_context *ctx;
+ unsigned long *non_cxl_func_map;
+ int rc;
+};
+
+static void
+cxl_reset_read_non_cxl_func_map(struct pci_dev *pdev,
+ unsigned long *non_cxl_func_map)
+{
+ u32 map[CXL_RESET_FUNCTION_MAP_REGS] = {};
+ u16 dvsec;
+ int rc, i;
+
+ bitmap_zero(non_cxl_func_map, CXL_RESET_MAX_FUNCTIONS);
+
+ dvsec = pci_find_dvsec_capability(pdev, PCI_VENDOR_ID_CXL,
+ PCI_DVSEC_CXL_FUNCTION_MAP);
+ if (!dvsec)
+ return;
+
+ for (i = 0; i < CXL_RESET_FUNCTION_MAP_REGS; i++) {
+ rc = pci_read_config_dword(pdev,
+ dvsec + PCI_DVSEC_CXL_FUNCTION_MAP_REG +
+ i * sizeof(map[i]), &map[i]);
+ if (rc) {
+ pci_warn(pdev,
+ "failed to read CXL Function Map; treating all siblings as CXL: %d\n",
+ rc);
+ bitmap_zero(non_cxl_func_map, CXL_RESET_MAX_FUNCTIONS);
+ return;
+ }
+ }
+
+ bitmap_from_arr32(non_cxl_func_map, map, CXL_RESET_MAX_FUNCTIONS);
+}
+
+static bool cxl_reset_is_cxl_sibling(struct pci_dev *pdev,
+ struct pci_dev *sibling,
+ unsigned long *non_cxl_func_map)
+{
+ if (sibling == pdev || sibling->bus != pdev->bus)
+ return false;
+
+ if (pci_ari_enabled(pdev->bus))
+ return !test_bit(sibling->devfn, non_cxl_func_map);
+
+ if (PCI_SLOT(sibling->devfn) != PCI_SLOT(pdev->devfn))
+ return false;
+
+ return !test_bit(PCI_FUNC(sibling->devfn) * 32 +
+ PCI_SLOT(sibling->devfn), non_cxl_func_map);
+}
+
+static bool cxl_reset_has_cache_or_mem(struct pci_dev *pdev)
+{
+ u16 dvsec, cap;
+
+ dvsec = pci_find_dvsec_capability(pdev, PCI_VENDOR_ID_CXL,
+ PCI_DVSEC_CXL_DEVICE);
+ if (!dvsec)
+ return false;
+
+ if (pci_read_config_word(pdev, dvsec + PCI_DVSEC_CXL_CAP, &cap))
+ return false;
+
+ return cap & (PCI_DVSEC_CXL_CACHE_CAPABLE | PCI_DVSEC_CXL_MEM_CAPABLE);
+}
+
+static int cxl_reset_add_sibling(struct cxl_reset_context *ctx,
+ struct pci_dev *sibling)
+{
+ struct pci_dev **siblings;
+ int capacity;
+
+ if (ctx->nr_siblings < ctx->sibling_capacity)
+ goto add;
+
+ capacity = ctx->sibling_capacity ? ctx->sibling_capacity * 2 :
+ CXL_RESET_SIBLINGS_INIT;
+ siblings = krealloc(ctx->siblings, capacity * sizeof(*siblings),
+ GFP_KERNEL);
+ if (!siblings)
+ return -ENOMEM;
+
+ ctx->siblings = siblings;
+ ctx->sibling_capacity = capacity;
+
+add:
+ pci_dev_get(sibling);
+ ctx->siblings[ctx->nr_siblings++] = sibling;
+ return 0;
+}
+
+static int cxl_reset_collect_sibling(struct pci_dev *sibling, void *data)
+{
+ struct cxl_reset_walk_ctx *wctx = data;
+ struct cxl_reset_context *ctx = wctx->ctx;
+ struct pci_dev *pdev = ctx->target;
+
+ if (!cxl_reset_is_cxl_sibling(pdev, sibling, wctx->non_cxl_func_map))
+ return 0;
+
+ if (!cxl_reset_has_cache_or_mem(sibling))
+ return 0;
+
+ wctx->rc = cxl_reset_add_sibling(ctx, sibling);
+ return wctx->rc;
+}
+
+static int cxl_reset_collect_siblings(struct cxl_reset_context *ctx)
+{
+ DECLARE_BITMAP(non_cxl_func_map, CXL_RESET_MAX_FUNCTIONS);
+ struct cxl_reset_walk_ctx wctx = {
+ .ctx = ctx,
+ .non_cxl_func_map = non_cxl_func_map,
+ };
+
+ cxl_reset_read_non_cxl_func_map(ctx->target, non_cxl_func_map);
+ pci_walk_bus(ctx->target->bus, cxl_reset_collect_sibling, &wctx);
+ return wctx.rc;
+}
+
+static void cxl_pci_functions_reset_done(struct cxl_reset_context *ctx)
+{
+ int i;
+
+ for (i = ctx->nr_siblings_prepared - 1; i >= 0; i--) {
+ struct pci_dev *sibling = ctx->siblings[i];
+
+ pci_dev_reset_iommu_done(sibling);
+ pci_dev_restore(sibling);
+ pci_dev_unlock(sibling);
+ }
+
+ for (i = 0; i < ctx->nr_siblings; i++)
+ pci_dev_put(ctx->siblings[i]);
+
+ kfree(ctx->siblings);
+ ctx->siblings = NULL;
+ ctx->nr_siblings = 0;
+ ctx->sibling_capacity = 0;
+ ctx->nr_siblings_prepared = 0;
+}
+
+static int __maybe_unused
+cxl_pci_functions_reset_prepare(struct cxl_reset_context *ctx)
+{
+ int rc, i;
+
+ ctx->siblings = NULL;
+ ctx->nr_siblings = 0;
+ ctx->sibling_capacity = 0;
+ ctx->nr_siblings_prepared = 0;
+
+ rc = cxl_reset_collect_siblings(ctx);
+ if (rc)
+ goto err;
+
+ for (i = 0; i < ctx->nr_siblings; i++) {
+ struct pci_dev *sibling = ctx->siblings[i];
+
+ if (!pci_dev_trylock(sibling)) {
+ rc = -EAGAIN;
+ goto err;
+ }
+
+ pci_dev_save_and_disable(sibling);
+ rc = pci_dev_reset_iommu_prepare(sibling);
+ if (rc) {
+ pci_err(sibling,
+ "failed to block IOMMU for CXL reset: %d\n",
+ rc);
+ /*
+ * Undo save_and_disable() for this sibling. IOMMU
+ * prepare failed, so this sibling is not counted in
+ * nr_siblings_prepared and must not get iommu_done().
+ */
+ pci_dev_restore(sibling);
+ pci_dev_unlock(sibling);
+ goto err;
+ }
+
+ ctx->nr_siblings_prepared++;
+ }
+
+ return 0;
+
+err:
+ cxl_pci_functions_reset_done(ctx);
+ return rc;
+}
diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
index 14f634ab9350..fa1fcd26af01 100644
--- a/include/uapi/linux/pci_regs.h
+++ b/include/uapi/linux/pci_regs.h
@@ -1349,6 +1349,7 @@
/* CXL r4.0, 8.1.3: PCIe DVSEC for CXL Device */
#define PCI_DVSEC_CXL_DEVICE 0
#define PCI_DVSEC_CXL_CAP 0xA
+#define PCI_DVSEC_CXL_CACHE_CAPABLE _BITUL(0)
#define PCI_DVSEC_CXL_MEM_CAPABLE _BITUL(2)
#define PCI_DVSEC_CXL_HDM_COUNT __GENMASK(5, 4)
#define PCI_DVSEC_CXL_CTRL 0xC
@@ -1366,6 +1367,7 @@
/* CXL r4.0, 8.1.4: Non-CXL Function Map DVSEC */
#define PCI_DVSEC_CXL_FUNCTION_MAP 2
+#define PCI_DVSEC_CXL_FUNCTION_MAP_REG 0x0C
/* CXL r4.0, 8.1.5: Extensions DVSEC for Ports */
#define PCI_DVSEC_CXL_PORT 3
--
2.43.0
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH v6 5/9] cxl/pci: Add CXL DVSEC reset helper
2026-05-28 8:31 [PATCH v6 0/9] cxl: Add cxl_reset sysfs attribute for memdevs Srirangan Madhavan
` (3 preceding siblings ...)
2026-05-28 8:31 ` [PATCH v6 4/9] PCI/CXL: Add sibling function coordination for reset Srirangan Madhavan
@ 2026-05-28 8:31 ` Srirangan Madhavan
2026-06-02 20:34 ` Cheatham, Benjamin
2026-05-28 8:31 ` [PATCH v6 6/9] cxl/pci: Track memdevs affected by CXL reset Srirangan Madhavan
` (5 subsequent siblings)
10 siblings, 1 reply; 32+ messages in thread
From: Srirangan Madhavan @ 2026-05-28 8:31 UTC (permalink / raw)
To: linux-cxl, linux-pci, linux-kernel
Cc: vsethi, alwilliamson, Dan Williams, Sai Yashwanth Reddy Kancherla,
Vishal Aslot, Manish Honap, Jiandi An, Richard Cheng, linux-tegra,
Srirangan Madhavan
Add a helper to execute CXL Reset through the CXL Device DVSEC. The
helper verifies reset capability, waits for pending PCIe transactions,
disables CXL.cache, optionally initiates cache writeback and invalidation,
and then starts CXL Reset through the DVSEC Control2 register.
Block IOMMU traffic while reset is active, then restore IOMMU
translations after reset completes.
Wait for the DVSEC reset timeout before checking reset completion, and
report reset error or timeout status from the DVSEC Status2 register. Add
the CXL Device DVSEC reset and cache control definitions needed by the
helper.
Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com>
---
drivers/cxl/core/pci.c | 185 ++++++++++++++++++++++++++++++++++
include/uapi/linux/pci_regs.h | 13 +++
2 files changed, 198 insertions(+)
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 01effbb4e7cd..1dd880f5a333 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -20,6 +20,9 @@
#define CXL_RESET_MAX_FUNCTIONS 256
#define CXL_RESET_FUNCTION_MAP_REGS (CXL_RESET_MAX_FUNCTIONS / 32)
#define CXL_RESET_SIBLINGS_INIT 8
+#define CXL_RESET_CACHE_WBI_POLL_US 100
+#define CXL_RESET_CACHE_WBI_TIMEOUT_US (100 * USEC_PER_MSEC)
+#define CXL_RESET_MIN_QUIET_MS 100
/**
* DOC: cxl core pci
@@ -1303,3 +1306,185 @@ cxl_pci_functions_reset_prepare(struct cxl_reset_context *ctx)
cxl_pci_functions_reset_done(ctx);
return rc;
}
+
+static int cxl_reset_update_ctrl2(struct pci_dev *pdev, int dvsec, u16 set,
+ u16 clear)
+{
+ u16 ctrl2;
+ int rc;
+
+ rc = pci_read_config_word(pdev, dvsec + PCI_DVSEC_CXL_CTRL2, &ctrl2);
+ if (rc)
+ return rc;
+
+ ctrl2 &= ~clear;
+ ctrl2 |= set;
+
+ return pci_write_config_word(pdev, dvsec + PCI_DVSEC_CXL_CTRL2, ctrl2);
+}
+
+static int cxl_reset_wait_cache_inv(struct pci_dev *pdev, int dvsec)
+{
+ int remaining_us = CXL_RESET_CACHE_WBI_TIMEOUT_US;
+ u16 status2;
+ int rc;
+
+ do {
+ usleep_range(CXL_RESET_CACHE_WBI_POLL_US,
+ CXL_RESET_CACHE_WBI_POLL_US + 1);
+ remaining_us -= CXL_RESET_CACHE_WBI_POLL_US;
+
+ rc = pci_read_config_word(pdev, dvsec + PCI_DVSEC_CXL_STATUS2,
+ &status2);
+ if (rc)
+ return rc;
+
+ if (status2 & PCI_DVSEC_CXL_CACHE_INV)
+ return 0;
+ } while (remaining_us > 0);
+
+ pci_err(pdev, "CXL cache WB+I timed out\n");
+ return -ETIMEDOUT;
+}
+
+static int cxl_reset_enable_cache(struct pci_dev *pdev, int dvsec, u16 cap)
+{
+ if (!(cap & PCI_DVSEC_CXL_CACHE_CAPABLE))
+ return 0;
+
+ return cxl_reset_update_ctrl2(pdev, dvsec, 0,
+ PCI_DVSEC_CXL_DISABLE_CACHING);
+}
+
+static int cxl_reset_disable_cache(struct pci_dev *pdev, int dvsec, u16 cap)
+{
+ int rc;
+
+ if (!(cap & PCI_DVSEC_CXL_CACHE_CAPABLE))
+ return 0;
+
+ rc = cxl_reset_update_ctrl2(pdev, dvsec,
+ PCI_DVSEC_CXL_DISABLE_CACHING, 0);
+ if (rc)
+ return rc;
+
+ if (!(cap & PCI_DVSEC_CXL_CACHE_WBI_CAPABLE))
+ return 0;
+
+ rc = cxl_reset_update_ctrl2(pdev, dvsec,
+ PCI_DVSEC_CXL_INIT_CACHE_WBI, 0);
+ if (rc)
+ goto err_enable_cache;
+
+ rc = cxl_reset_wait_cache_inv(pdev, dvsec);
+ if (rc)
+ goto err_enable_cache;
+
+ return 0;
+
+err_enable_cache:
+ /*
+ * Best effort rollback: preserve the original WB+I failure even if
+ * re-enabling CXL.cache also fails.
+ */
+ cxl_reset_enable_cache(pdev, dvsec, cap);
+ return rc;
+}
+
+static int cxl_reset_wait_done(struct pci_dev *pdev, int dvsec, u16 cap)
+{
+ static const u32 reset_timeout_ms[] = { 10, 100, 1000, 10000, 100000 };
+ u32 timeout_ms;
+ u16 status2;
+ int rc, idx;
+
+ idx = FIELD_GET(PCI_DVSEC_CXL_RST_TIMEOUT, cap);
+ if (idx >= ARRAY_SIZE(reset_timeout_ms))
+ idx = ARRAY_SIZE(reset_timeout_ms) - 1;
+ timeout_ms = reset_timeout_ms[idx];
+
+ msleep(max_t(u32, timeout_ms, CXL_RESET_MIN_QUIET_MS));
+
+ rc = pci_read_config_word(pdev, dvsec + PCI_DVSEC_CXL_STATUS2,
+ &status2);
+ if (rc)
+ return rc;
+
+ if (status2 & PCI_DVSEC_CXL_RST_ERR) {
+ pci_err(pdev, "CXL reset error\n");
+ return -EIO;
+ }
+
+ if (!(status2 & PCI_DVSEC_CXL_RST_DONE)) {
+ pci_err(pdev, "CXL reset timed out\n");
+ return -ETIMEDOUT;
+ }
+
+ return 0;
+}
+
+static int __maybe_unused cxl_dev_reset(struct pci_dev *pdev, bool mem_clear)
+{
+ int dvsec, rc;
+ u16 ctrl2_clear = 0;
+ u16 cap;
+
+ dvsec = pci_find_dvsec_capability(pdev, PCI_VENDOR_ID_CXL,
+ PCI_DVSEC_CXL_DEVICE);
+ if (!dvsec)
+ return -ENODEV;
+
+ rc = pci_read_config_word(pdev, dvsec + PCI_DVSEC_CXL_CAP, &cap);
+ if (rc)
+ return rc;
+
+ if (!(cap & PCI_DVSEC_CXL_RST_CAPABLE))
+ return -EOPNOTSUPP;
+
+ if (mem_clear && !(cap & PCI_DVSEC_CXL_RST_MEM_CLR_CAPABLE))
+ return -EOPNOTSUPP;
+
+ if (!pci_wait_for_pending_transaction(pdev))
+ pci_err(pdev, "timed out waiting for pending transactions\n");
+
+ rc = pci_dev_reset_iommu_prepare(pdev);
+ if (rc) {
+ pci_err(pdev, "failed to block IOMMU for CXL reset: %d\n",
+ rc);
+ return rc;
+ }
+
+ rc = cxl_reset_disable_cache(pdev, dvsec, cap);
+ if (rc)
+ goto out_iommu;
+ if (cap & PCI_DVSEC_CXL_CACHE_CAPABLE)
+ ctrl2_clear |= PCI_DVSEC_CXL_DISABLE_CACHING;
+
+ if (mem_clear) {
+ rc = cxl_reset_update_ctrl2(pdev, dvsec,
+ PCI_DVSEC_CXL_RST_MEM_CLR_EN, 0);
+ if (rc)
+ goto out_ctrl2;
+ ctrl2_clear |= PCI_DVSEC_CXL_RST_MEM_CLR_EN;
+ }
+
+ rc = cxl_reset_update_ctrl2(pdev, dvsec,
+ PCI_DVSEC_CXL_INIT_CXL_RST, 0);
+ if (rc)
+ goto out_ctrl2;
+
+ rc = cxl_reset_wait_done(pdev, dvsec, cap);
+ if (rc)
+ goto out_iommu;
+
+ rc = cxl_reset_update_ctrl2(pdev, dvsec, 0,
+ PCI_DVSEC_CXL_DISABLE_CACHING);
+
+out_ctrl2:
+ if (rc && ctrl2_clear)
+ cxl_reset_update_ctrl2(pdev, dvsec, 0, ctrl2_clear);
+
+out_iommu:
+ pci_dev_reset_iommu_done(pdev);
+ return rc;
+}
diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
index fa1fcd26af01..7fc1d34fcce7 100644
--- a/include/uapi/linux/pci_regs.h
+++ b/include/uapi/linux/pci_regs.h
@@ -1352,8 +1352,21 @@
#define PCI_DVSEC_CXL_CACHE_CAPABLE _BITUL(0)
#define PCI_DVSEC_CXL_MEM_CAPABLE _BITUL(2)
#define PCI_DVSEC_CXL_HDM_COUNT __GENMASK(5, 4)
+#define PCI_DVSEC_CXL_CACHE_WBI_CAPABLE _BITUL(6)
+#define PCI_DVSEC_CXL_RST_CAPABLE _BITUL(7)
+#define PCI_DVSEC_CXL_RST_TIMEOUT __GENMASK(10, 8)
+#define PCI_DVSEC_CXL_RST_MEM_CLR_CAPABLE _BITUL(11)
#define PCI_DVSEC_CXL_CTRL 0xC
#define PCI_DVSEC_CXL_MEM_ENABLE _BITUL(2)
+#define PCI_DVSEC_CXL_CTRL2 0x10
+#define PCI_DVSEC_CXL_DISABLE_CACHING _BITUL(0)
+#define PCI_DVSEC_CXL_INIT_CACHE_WBI _BITUL(1)
+#define PCI_DVSEC_CXL_INIT_CXL_RST _BITUL(2)
+#define PCI_DVSEC_CXL_RST_MEM_CLR_EN _BITUL(3)
+#define PCI_DVSEC_CXL_STATUS2 0x12
+#define PCI_DVSEC_CXL_CACHE_INV _BITUL(0)
+#define PCI_DVSEC_CXL_RST_DONE _BITUL(1)
+#define PCI_DVSEC_CXL_RST_ERR _BITUL(2)
#define PCI_DVSEC_CXL_RANGE_SIZE_HIGH(i) (0x18 + (i * 0x10))
#define PCI_DVSEC_CXL_RANGE_SIZE_LOW(i) (0x1C + (i * 0x10))
#define PCI_DVSEC_CXL_MEM_INFO_VALID _BITUL(0)
--
2.43.0
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH v6 6/9] cxl/pci: Track memdevs affected by CXL reset
2026-05-28 8:31 [PATCH v6 0/9] cxl: Add cxl_reset sysfs attribute for memdevs Srirangan Madhavan
` (4 preceding siblings ...)
2026-05-28 8:31 ` [PATCH v6 5/9] cxl/pci: Add CXL DVSEC reset helper Srirangan Madhavan
@ 2026-05-28 8:31 ` Srirangan Madhavan
2026-06-02 20:34 ` Cheatham, Benjamin
2026-05-28 8:31 ` [PATCH v6 7/9] cxl/pci: Orchestrate CXL reset for affected memdevs Srirangan Madhavan
` (4 subsequent siblings)
10 siblings, 1 reply; 32+ messages in thread
From: Srirangan Madhavan @ 2026-05-28 8:31 UTC (permalink / raw)
To: linux-cxl, linux-pci, linux-kernel
Cc: vsethi, alwilliamson, Dan Williams, Sai Yashwanth Reddy Kancherla,
Vishal Aslot, Manish Honap, Jiandi An, Richard Cheng, linux-tegra,
Srirangan Madhavan
CXL reset is scoped to the CXL.cache/mem function set, so reset
orchestration needs to account for the target memdev and any affected
sibling-function memdevs.
Add reset context tracking for affected memdevs. Collect the memdevs
associated with the target and sibling PCI functions, track which ones
are active, collect their regions, and provide helpers to lock and
revalidate the active memdevs before reset proceeds.
The reset orchestration and CXL.mem restore flow are added separately.
Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com>
---
drivers/cxl/core/pci.c | 176 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 176 insertions(+)
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 1dd880f5a333..c755c18c8d84 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -1106,8 +1106,17 @@ cxl_reset_flush_cpu_caches(struct cxl_reset_region_context *ctx)
return 0;
}
+struct cxl_reset_memdev {
+ struct cxl_memdev *cxlmd;
+ bool active;
+ bool locked;
+};
+
struct cxl_reset_context {
struct pci_dev *target;
+ struct cxl_reset_memdev *memdevs;
+ int nr_memdevs;
+ int memdev_capacity;
struct pci_dev **siblings;
int nr_siblings;
int sibling_capacity;
@@ -1237,6 +1246,173 @@ static int cxl_reset_collect_siblings(struct cxl_reset_context *ctx)
return wctx.rc;
}
+static int cxl_reset_match_memdev_by_parent(struct device *dev,
+ const void *parent)
+{
+ return is_cxl_memdev(dev) && dev->parent == parent;
+}
+
+static bool cxl_reset_memdev_active(struct cxl_memdev *cxlmd)
+{
+ return cxlmd->dev.driver && cxlmd->endpoint &&
+ !IS_ERR(cxlmd->endpoint);
+}
+
+static int cxl_reset_collect_pci_memdev(struct cxl_reset_context *ctx,
+ struct pci_dev *pdev)
+{
+ struct cxl_reset_memdev *memdevs;
+ struct cxl_memdev *cxlmd;
+ struct device *dev;
+ int capacity, i;
+
+ dev = bus_find_device(&cxl_bus_type, NULL, &pdev->dev,
+ cxl_reset_match_memdev_by_parent);
+ if (!dev)
+ return 0;
+
+ cxlmd = to_cxl_memdev(dev);
+ for (i = 0; i < ctx->nr_memdevs; i++) {
+ if (ctx->memdevs[i].cxlmd == cxlmd) {
+ put_device(dev);
+ return 0;
+ }
+ }
+
+ if (ctx->nr_memdevs < ctx->memdev_capacity)
+ goto add;
+
+ capacity = ctx->memdev_capacity ? ctx->memdev_capacity * 2 :
+ CXL_RESET_SIBLINGS_INIT;
+ memdevs = krealloc(ctx->memdevs, capacity * sizeof(*memdevs),
+ GFP_KERNEL);
+ if (!memdevs) {
+ put_device(dev);
+ return -ENOMEM;
+ }
+
+ ctx->memdevs = memdevs;
+ ctx->memdev_capacity = capacity;
+
+add:
+ ctx->memdevs[ctx->nr_memdevs++] = (struct cxl_reset_memdev) {
+ .cxlmd = cxlmd,
+ };
+ return 0;
+}
+
+/*
+ * CXL Reset is device scoped for CXL.cache/mem. Use the affected PCI
+ * function set to find memdevs whose regions and endpoint decoder state must
+ * be handled around the reset.
+ */
+static int __maybe_unused cxl_reset_collect_memdevs(struct cxl_reset_context *ctx)
+{
+ int rc, i;
+
+ rc = cxl_reset_collect_pci_memdev(ctx, ctx->target);
+ if (rc)
+ return rc;
+
+ for (i = 0; i < ctx->nr_siblings; i++) {
+ rc = cxl_reset_collect_pci_memdev(ctx, ctx->siblings[i]);
+ if (rc)
+ return rc;
+ }
+
+ return 0;
+}
+
+static int __maybe_unused
+cxl_reset_collect_regions(struct cxl_reset_context *ctx,
+ struct cxl_reset_region_context *region_ctx)
+{
+ int rc, i;
+
+ lockdep_assert_held_write(&cxl_rwsem.region);
+
+ for (i = 0; i < ctx->nr_memdevs; i++) {
+ struct cxl_reset_memdev *rmd = &ctx->memdevs[i];
+ struct cxl_memdev *cxlmd = rmd->cxlmd;
+
+ if (!device_trylock(&cxlmd->dev))
+ return -EAGAIN;
+
+ if (cxl_reset_memdev_active(cxlmd)) {
+ rc = cxl_reset_collect_memdev_regions(region_ctx,
+ cxlmd);
+ if (!rc)
+ rmd->active = true;
+ } else {
+ rc = 0;
+ }
+
+ device_unlock(&cxlmd->dev);
+ if (rc)
+ return rc;
+ }
+
+ return 0;
+}
+
+static void cxl_reset_unlock_memdevs(struct cxl_reset_context *ctx)
+{
+ int i;
+
+ for (i = ctx->nr_memdevs - 1; i >= 0; i--) {
+ struct cxl_reset_memdev *rmd = &ctx->memdevs[i];
+
+ if (!rmd->locked)
+ continue;
+
+ device_unlock(&rmd->cxlmd->dev);
+ rmd->locked = false;
+ }
+}
+
+static int __maybe_unused cxl_reset_lock_memdevs(struct cxl_reset_context *ctx)
+{
+ int i;
+
+ lockdep_assert_held_write(&cxl_rwsem.region);
+
+ for (i = 0; i < ctx->nr_memdevs; i++) {
+ struct cxl_reset_memdev *rmd = &ctx->memdevs[i];
+ struct cxl_memdev *cxlmd = rmd->cxlmd;
+
+ if (!rmd->active)
+ continue;
+
+ if (!device_trylock(&cxlmd->dev))
+ goto err;
+
+ rmd->locked = true;
+ if (!cxl_reset_memdev_active(cxlmd)) {
+ cxl_reset_unlock_memdevs(ctx);
+ return -ENODEV;
+ }
+ }
+
+ return 0;
+
+err:
+ cxl_reset_unlock_memdevs(ctx);
+ return -EAGAIN;
+}
+
+static void __maybe_unused cxl_reset_put_memdevs(struct cxl_reset_context *ctx)
+{
+ int i;
+
+ for (i = 0; i < ctx->nr_memdevs; i++)
+ put_device(&ctx->memdevs[i].cxlmd->dev);
+
+ kfree(ctx->memdevs);
+ ctx->memdevs = NULL;
+ ctx->nr_memdevs = 0;
+ ctx->memdev_capacity = 0;
+}
+
static void cxl_pci_functions_reset_done(struct cxl_reset_context *ctx)
{
int i;
--
2.43.0
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH v6 7/9] cxl/pci: Orchestrate CXL reset for affected memdevs
2026-05-28 8:31 [PATCH v6 0/9] cxl: Add cxl_reset sysfs attribute for memdevs Srirangan Madhavan
` (5 preceding siblings ...)
2026-05-28 8:31 ` [PATCH v6 6/9] cxl/pci: Track memdevs affected by CXL reset Srirangan Madhavan
@ 2026-05-28 8:31 ` Srirangan Madhavan
2026-06-02 20:34 ` Cheatham, Benjamin
2026-06-04 3:25 ` Dan Williams (nvidia)
2026-05-28 8:31 ` [PATCH v6 8/9] cxl/memdev: Add cxl_reset sysfs attribute Srirangan Madhavan
` (3 subsequent siblings)
10 siblings, 2 replies; 32+ messages in thread
From: Srirangan Madhavan @ 2026-05-28 8:31 UTC (permalink / raw)
To: linux-cxl, linux-pci, linux-kernel
Cc: vsethi, alwilliamson, Dan Williams, Sai Yashwanth Reddy Kancherla,
Vishal Aslot, Manish Honap, Jiandi An, Richard Cheng, linux-tegra,
Srirangan Madhavan
Add the reset flow that coordinates the target function, affected CXL
sibling functions, and any active memdevs in the CXL.cache/mem reset
scope.
The flow collects regions for the affected memdevs under
cxl_rwsem.region, verifies that those regions are idle, flushes CPU
caches for the affected ranges, saves and disables the target and sibling
PCI functions, and locks active memdevs to revalidate that their
endpoints are still present before reset.
After the CXL DVSEC reset completes, restore PCI config space so CXL
MMIO is accessible, restore decoder programming for all active affected
memdevs, commit their restored decoders, and only then re-enable CXL.mem
for the affected set.
Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com>
---
drivers/cxl/core/pci.c | 414 +++++++++++++++++++++++++++++++++++------
1 file changed, 358 insertions(+), 56 deletions(-)
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index c755c18c8d84..486c447e98f3 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -947,14 +947,12 @@ struct cxl_reset_region_context {
struct xarray regions;
};
-static void __maybe_unused
-cxl_reset_region_context_init(struct cxl_reset_region_context *ctx)
+static void cxl_reset_region_context_init(struct cxl_reset_region_context *ctx)
{
xa_init(&ctx->regions);
}
-static void __maybe_unused
-cxl_reset_region_context_destroy(struct cxl_reset_region_context *ctx)
+static void cxl_reset_region_context_destroy(struct cxl_reset_region_context *ctx)
{
xa_destroy(&ctx->regions);
}
@@ -985,9 +983,8 @@ static int cxl_reset_collect_region(struct device *dev, void *data)
return cxl_reset_add_region(ctx, cxled->cxld.region);
}
-static int __maybe_unused
-cxl_reset_collect_memdev_regions(struct cxl_reset_region_context *ctx,
- struct cxl_memdev *cxlmd)
+static int cxl_reset_collect_memdev_regions(struct cxl_reset_region_context *ctx,
+ struct cxl_memdev *cxlmd)
{
struct cxl_port *endpoint;
@@ -1045,8 +1042,7 @@ static int cxl_reset_validate_region_idle(struct cxl_region *cxlr)
return rc;
}
-static int __maybe_unused
-cxl_reset_validate_regions_idle(struct cxl_reset_region_context *ctx)
+static int cxl_reset_validate_regions_idle(struct cxl_reset_region_context *ctx)
{
struct cxl_region *cxlr;
unsigned long index;
@@ -1077,26 +1073,41 @@ static int cxl_reset_flush_region_cache(struct cxl_region *cxlr)
return rc;
}
-static int __maybe_unused
-cxl_reset_flush_cpu_caches(struct cxl_reset_region_context *ctx)
+static int cxl_reset_cpu_cache_flush_preflight(struct cxl_reset_region_context *ctx,
+ bool *skip)
{
- struct cxl_region *cxlr;
- unsigned long index;
- int rc;
+ if (skip)
+ *skip = false;
if (xa_empty(&ctx->regions))
return 0;
- if (!cpu_cache_has_invalidate_memregion()) {
- if (IS_ENABLED(CONFIG_CXL_REGION_INVALIDATION_TEST)) {
- pr_info_once(
- "Bypassing cpu_cache_invalidate_memregion() for testing!\n");
- return 0;
- }
- pr_warn("Failed to synchronize CPU cache state\n");
- return -ENXIO;
+ if (cpu_cache_has_invalidate_memregion())
+ return 0;
+
+ if (IS_ENABLED(CONFIG_CXL_REGION_INVALIDATION_TEST)) {
+ pr_info_once(
+ "Bypassing cpu_cache_invalidate_memregion() for testing!\n");
+ if (skip)
+ *skip = true;
+ return 0;
}
+ pr_warn("Failed to synchronize CPU cache state\n");
+ return -ENXIO;
+}
+
+static int cxl_reset_flush_cpu_caches(struct cxl_reset_region_context *ctx)
+{
+ struct cxl_region *cxlr;
+ unsigned long index;
+ bool skip;
+ int rc;
+
+ rc = cxl_reset_cpu_cache_flush_preflight(ctx, &skip);
+ if (rc || skip)
+ return rc;
+
xa_for_each(&ctx->regions, index, cxlr) {
rc = cxl_reset_flush_region_cache(cxlr);
if (rc)
@@ -1120,7 +1131,11 @@ struct cxl_reset_context {
struct pci_dev **siblings;
int nr_siblings;
int sibling_capacity;
+ int nr_siblings_locked;
int nr_siblings_prepared;
+ bool target_locked;
+ bool target_saved;
+ bool target_iommu_prepared;
};
struct cxl_reset_walk_ctx {
@@ -1306,7 +1321,7 @@ static int cxl_reset_collect_pci_memdev(struct cxl_reset_context *ctx,
* function set to find memdevs whose regions and endpoint decoder state must
* be handled around the reset.
*/
-static int __maybe_unused cxl_reset_collect_memdevs(struct cxl_reset_context *ctx)
+static int cxl_reset_collect_memdevs(struct cxl_reset_context *ctx)
{
int rc, i;
@@ -1323,7 +1338,7 @@ static int __maybe_unused cxl_reset_collect_memdevs(struct cxl_reset_context *ct
return 0;
}
-static int __maybe_unused
+static int
cxl_reset_collect_regions(struct cxl_reset_context *ctx,
struct cxl_reset_region_context *region_ctx)
{
@@ -1370,7 +1385,7 @@ static void cxl_reset_unlock_memdevs(struct cxl_reset_context *ctx)
}
}
-static int __maybe_unused cxl_reset_lock_memdevs(struct cxl_reset_context *ctx)
+static int cxl_reset_lock_memdevs(struct cxl_reset_context *ctx)
{
int i;
@@ -1400,7 +1415,7 @@ static int __maybe_unused cxl_reset_lock_memdevs(struct cxl_reset_context *ctx)
return -EAGAIN;
}
-static void __maybe_unused cxl_reset_put_memdevs(struct cxl_reset_context *ctx)
+static void cxl_reset_put_memdevs(struct cxl_reset_context *ctx)
{
int i;
@@ -1417,14 +1432,20 @@ static void cxl_pci_functions_reset_done(struct cxl_reset_context *ctx)
{
int i;
+ /*
+ * Config state was restored early for CXL MMIO access. Complete PCI
+ * reset recovery here by unblocking IOMMU and running reset_done().
+ */
for (i = ctx->nr_siblings_prepared - 1; i >= 0; i--) {
struct pci_dev *sibling = ctx->siblings[i];
pci_dev_reset_iommu_done(sibling);
pci_dev_restore(sibling);
- pci_dev_unlock(sibling);
}
+ for (i = ctx->nr_siblings_locked - 1; i >= 0; i--)
+ pci_dev_unlock(ctx->siblings[i]);
+
for (i = 0; i < ctx->nr_siblings; i++)
pci_dev_put(ctx->siblings[i]);
@@ -1432,31 +1453,39 @@ static void cxl_pci_functions_reset_done(struct cxl_reset_context *ctx)
ctx->siblings = NULL;
ctx->nr_siblings = 0;
ctx->sibling_capacity = 0;
+ ctx->nr_siblings_locked = 0;
ctx->nr_siblings_prepared = 0;
}
-static int __maybe_unused
-cxl_pci_functions_reset_prepare(struct cxl_reset_context *ctx)
+static int cxl_pci_functions_lock(struct cxl_reset_context *ctx)
{
- int rc, i;
-
- ctx->siblings = NULL;
- ctx->nr_siblings = 0;
- ctx->sibling_capacity = 0;
- ctx->nr_siblings_prepared = 0;
+ int i;
- rc = cxl_reset_collect_siblings(ctx);
- if (rc)
- goto err;
+ ctx->nr_siblings_locked = 0;
for (i = 0; i < ctx->nr_siblings; i++) {
struct pci_dev *sibling = ctx->siblings[i];
if (!pci_dev_trylock(sibling)) {
- rc = -EAGAIN;
- goto err;
+ cxl_pci_functions_reset_done(ctx);
+ return -EAGAIN;
}
+ ctx->nr_siblings_locked++;
+ }
+
+ return 0;
+}
+
+static int cxl_pci_functions_reset_prepare(struct cxl_reset_context *ctx)
+{
+ int rc, i;
+
+ ctx->nr_siblings_prepared = 0;
+
+ for (i = 0; i < ctx->nr_siblings_locked; i++) {
+ struct pci_dev *sibling = ctx->siblings[i];
+
pci_dev_save_and_disable(sibling);
rc = pci_dev_reset_iommu_prepare(sibling);
if (rc) {
@@ -1469,7 +1498,6 @@ cxl_pci_functions_reset_prepare(struct cxl_reset_context *ctx)
* nr_siblings_prepared and must not get iommu_done().
*/
pci_dev_restore(sibling);
- pci_dev_unlock(sibling);
goto err;
}
@@ -1483,6 +1511,79 @@ cxl_pci_functions_reset_prepare(struct cxl_reset_context *ctx)
return rc;
}
+/*
+ * Restore PCI config space after reset so CXL MMIO is accessible for memdev
+ * restore. Driver reset_done callbacks remain deferred to final cleanup.
+ */
+static void cxl_pci_functions_restore_state(struct cxl_reset_context *ctx)
+{
+ int i;
+
+ for (i = ctx->nr_siblings_prepared - 1; i >= 0; i--)
+ pci_restore_state(ctx->siblings[i]);
+}
+
+static int cxl_pci_target_lock(struct cxl_reset_context *ctx)
+{
+ struct pci_dev *pdev = ctx->target;
+
+ if (!pci_dev_trylock(pdev))
+ return -EAGAIN;
+
+ ctx->target_locked = true;
+ return 0;
+}
+
+static int cxl_pci_target_reset_prepare(struct cxl_reset_context *ctx)
+{
+ struct pci_dev *pdev = ctx->target;
+ int rc;
+
+ /* Disable first to stop new transactions, then drain in-flight ones. */
+ pci_dev_save_and_disable(pdev);
+ ctx->target_saved = true;
+
+ if (!pci_wait_for_pending_transaction(pdev))
+ pci_err(pdev, "timed out waiting for pending transactions\n");
+
+ rc = pci_dev_reset_iommu_prepare(pdev);
+ if (rc) {
+ pci_err(pdev, "failed to block IOMMU for CXL reset: %d\n", rc);
+ return rc;
+ }
+
+ ctx->target_iommu_prepared = true;
+ return 0;
+}
+
+static void cxl_pci_target_restore_state(struct cxl_reset_context *ctx)
+{
+ if (ctx->target_saved)
+ pci_restore_state(ctx->target);
+}
+
+static void cxl_pci_target_reset_done(struct cxl_reset_context *ctx)
+{
+ if (ctx->target_iommu_prepared) {
+ pci_dev_reset_iommu_done(ctx->target);
+ ctx->target_iommu_prepared = false;
+ }
+
+ /*
+ * cxl_pci_target_restore_state() restores config space before memdev
+ * restore. Complete PCI reset recovery here with reset_done().
+ */
+ if (ctx->target_saved) {
+ pci_dev_restore(ctx->target);
+ ctx->target_saved = false;
+ }
+
+ if (ctx->target_locked) {
+ pci_dev_unlock(ctx->target);
+ ctx->target_locked = false;
+ }
+}
+
static int cxl_reset_update_ctrl2(struct pci_dev *pdev, int dvsec, u16 set,
u16 clear)
{
@@ -1599,7 +1700,7 @@ static int cxl_reset_wait_done(struct pci_dev *pdev, int dvsec, u16 cap)
return 0;
}
-static int __maybe_unused cxl_dev_reset(struct pci_dev *pdev, bool mem_clear)
+static int cxl_dev_reset(struct pci_dev *pdev, bool mem_clear)
{
int dvsec, rc;
u16 ctrl2_clear = 0;
@@ -1620,19 +1721,9 @@ static int __maybe_unused cxl_dev_reset(struct pci_dev *pdev, bool mem_clear)
if (mem_clear && !(cap & PCI_DVSEC_CXL_RST_MEM_CLR_CAPABLE))
return -EOPNOTSUPP;
- if (!pci_wait_for_pending_transaction(pdev))
- pci_err(pdev, "timed out waiting for pending transactions\n");
-
- rc = pci_dev_reset_iommu_prepare(pdev);
- if (rc) {
- pci_err(pdev, "failed to block IOMMU for CXL reset: %d\n",
- rc);
- return rc;
- }
-
rc = cxl_reset_disable_cache(pdev, dvsec, cap);
if (rc)
- goto out_iommu;
+ return rc;
if (cap & PCI_DVSEC_CXL_CACHE_CAPABLE)
ctrl2_clear |= PCI_DVSEC_CXL_DISABLE_CACHING;
@@ -1651,7 +1742,7 @@ static int __maybe_unused cxl_dev_reset(struct pci_dev *pdev, bool mem_clear)
rc = cxl_reset_wait_done(pdev, dvsec, cap);
if (rc)
- goto out_iommu;
+ return rc;
rc = cxl_reset_update_ctrl2(pdev, dvsec, 0,
PCI_DVSEC_CXL_DISABLE_CACHING);
@@ -1660,7 +1751,218 @@ static int __maybe_unused cxl_dev_reset(struct pci_dev *pdev, bool mem_clear)
if (rc && ctrl2_clear)
cxl_reset_update_ctrl2(pdev, dvsec, 0, ctrl2_clear);
-out_iommu:
- pci_dev_reset_iommu_done(pdev);
+ return rc;
+}
+
+static int cxl_reset_restore_memdev(struct cxl_reset_memdev *rmd)
+{
+ struct cxl_memdev *cxlmd = rmd->cxlmd;
+ int rc;
+
+ if (!rmd->active)
+ return 0;
+
+ rc = cxl_restore_memdev_decoders(cxlmd);
+ if (rc)
+ dev_err(&cxlmd->dev,
+ "Failed to restore CXL.mem decoders after reset: %d\n",
+ rc);
+
+ return rc;
+}
+
+static int cxl_reset_commit_memdev(struct cxl_reset_memdev *rmd)
+{
+ struct cxl_memdev *cxlmd = rmd->cxlmd;
+ int rc;
+
+ if (!rmd->active)
+ return 0;
+
+ rc = cxl_commit_memdev_decoders(cxlmd);
+ if (rc)
+ dev_err(&cxlmd->dev,
+ "Failed to commit CXL.mem decoders after reset: %d\n",
+ rc);
+
+ return rc;
+}
+
+static int cxl_reset_enable_memdev(struct cxl_reset_memdev *rmd)
+{
+ struct cxl_memdev *cxlmd = rmd->cxlmd;
+ struct cxl_dev_state *cxlds = cxlmd->cxlds;
+ int rc;
+
+ if (!rmd->active)
+ return 0;
+
+ cxlds->media_ready = false;
+
+ rc = cxl_set_mem_enable(cxlds, PCI_DVSEC_CXL_MEM_ENABLE);
+ if (rc < 0) {
+ dev_err(&cxlmd->dev,
+ "Failed to enable CXL.mem after reset: %d\n", rc);
+ return rc;
+ }
+
+ rc = cxl_await_media_ready(cxlds);
+ if (rc) {
+ dev_err(&cxlmd->dev,
+ "Media not active after CXL reset: %d\n", rc);
+ return rc;
+ }
+ cxlds->media_ready = true;
+
+ return 0;
+}
+
+static void cxl_reset_disable_memdevs(struct cxl_reset_context *ctx)
+{
+ int rc, i;
+
+ for (i = ctx->nr_memdevs - 1; i >= 0; i--) {
+ struct cxl_memdev *cxlmd = ctx->memdevs[i].cxlmd;
+
+ if (!ctx->memdevs[i].active)
+ continue;
+
+ rc = cxl_set_mem_enable(cxlmd->cxlds, 0);
+ if (rc < 0)
+ dev_err(&cxlmd->dev,
+ "Failed to disable CXL.mem after reset restore failure; device state may be inconsistent: %d\n",
+ rc);
+ }
+}
+
+static int cxl_reset_restore_memdevs(struct cxl_reset_context *ctx)
+{
+ int rc;
+ int i;
+
+ lockdep_assert_held_write(&cxl_rwsem.region);
+
+ for (i = 0; i < ctx->nr_memdevs; i++) {
+ rc = cxl_reset_restore_memdev(&ctx->memdevs[i]);
+ if (rc)
+ return rc;
+ }
+
+ for (i = 0; i < ctx->nr_memdevs; i++) {
+ rc = cxl_reset_commit_memdev(&ctx->memdevs[i]);
+ if (rc)
+ return rc;
+ }
+
+ for (i = 0; i < ctx->nr_memdevs; i++) {
+ rc = cxl_reset_enable_memdev(&ctx->memdevs[i]);
+ if (rc) {
+ cxl_reset_disable_memdevs(ctx);
+ return rc;
+ }
+ }
+
+ return 0;
+}
+
+static void cxl_reset_context_destroy(struct cxl_reset_context *ctx)
+{
+ /*
+ * LIFO unwind for regular completion and partial initialization:
+ * memdevs, sibling functions, target function, then references.
+ * Each cleanup helper tolerates being called after its state was
+ * already released on an earlier error path.
+ */
+ cxl_reset_unlock_memdevs(ctx);
+ cxl_pci_functions_reset_done(ctx);
+ cxl_pci_target_reset_done(ctx);
+ cxl_reset_put_memdevs(ctx);
+}
+
+static int cxl_do_reset_locked(struct cxl_reset_context *ctx, bool mem_clear)
+{
+ struct cxl_reset_region_context region_ctx;
+ int rc;
+
+ lockdep_assert_held_write(&cxl_rwsem.region);
+
+ cxl_reset_region_context_init(®ion_ctx);
+
+ rc = cxl_reset_collect_regions(ctx, ®ion_ctx);
+ if (rc)
+ goto out;
+
+ rc = cxl_pci_target_lock(ctx);
+ if (rc)
+ goto out;
+
+ rc = cxl_pci_functions_lock(ctx);
+ if (rc)
+ goto out;
+
+ rc = cxl_reset_lock_memdevs(ctx);
+ if (rc)
+ goto out;
+
+ rc = cxl_reset_cpu_cache_flush_preflight(®ion_ctx, NULL);
+ if (rc)
+ goto out;
+
+ rc = cxl_reset_validate_regions_idle(®ion_ctx);
+ if (rc)
+ goto out;
+
+ rc = cxl_reset_flush_cpu_caches(®ion_ctx);
+ if (rc)
+ goto out;
+
+ rc = cxl_pci_target_reset_prepare(ctx);
+ if (rc)
+ goto out;
+
+ rc = cxl_pci_functions_reset_prepare(ctx);
+ if (rc)
+ goto out;
+
+ rc = cxl_dev_reset(ctx->target, mem_clear);
+
+ cxl_pci_target_restore_state(ctx);
+ cxl_pci_functions_restore_state(ctx);
+
+ if (!rc)
+ rc = cxl_reset_restore_memdevs(ctx);
+
+ cxl_reset_unlock_memdevs(ctx);
+
+out:
+ cxl_reset_region_context_destroy(®ion_ctx);
+ return rc;
+}
+
+static int __maybe_unused cxl_do_reset(struct pci_dev *pdev, bool mem_clear)
+{
+ struct cxl_reset_context ctx = {
+ .target = pdev,
+ };
+ int rc;
+
+ /*
+ * Snapshot the CXL r3.2 9.7 device reset scope before taking
+ * cxl_rwsem.region. Hot-added functions after this point are not
+ * coordinated by this reset operation.
+ */
+ rc = cxl_reset_collect_siblings(&ctx);
+ if (rc)
+ goto out;
+
+ rc = cxl_reset_collect_memdevs(&ctx);
+ if (rc)
+ goto out;
+
+ scoped_guard(rwsem_write, &cxl_rwsem.region)
+ rc = cxl_do_reset_locked(&ctx, mem_clear);
+
+out:
+ cxl_reset_context_destroy(&ctx);
return rc;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH v6 8/9] cxl/memdev: Add cxl_reset sysfs attribute
2026-05-28 8:31 [PATCH v6 0/9] cxl: Add cxl_reset sysfs attribute for memdevs Srirangan Madhavan
` (6 preceding siblings ...)
2026-05-28 8:31 ` [PATCH v6 7/9] cxl/pci: Orchestrate CXL reset for affected memdevs Srirangan Madhavan
@ 2026-05-28 8:31 ` Srirangan Madhavan
2026-06-02 21:35 ` Cheatham, Benjamin
2026-06-02 23:50 ` Dave Jiang
2026-05-28 8:31 ` [PATCH v6 9/9] Documentation/ABI: Document CXL memdev cxl_reset Srirangan Madhavan
` (2 subsequent siblings)
10 siblings, 2 replies; 32+ messages in thread
From: Srirangan Madhavan @ 2026-05-28 8:31 UTC (permalink / raw)
To: linux-cxl, linux-pci, linux-kernel
Cc: vsethi, alwilliamson, Dan Williams, Sai Yashwanth Reddy Kancherla,
Vishal Aslot, Manish Honap, Jiandi An, Richard Cheng, linux-tegra,
Srirangan Madhavan
Expose CXL reset through the CXL memdev device. The reset flow
depends on CXL memdev state to identify affected regions, coordinate
decoder restore, and keep CXL-specific policy out of the PCI sysfs ABI.
Add a write-only cxl_reset attribute under memX. The attribute is visible
only when the memdev's PCI parent advertises CXL Reset capability.
Writing a true boolean value invokes the CXL reset orchestration.
Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com>
---
drivers/cxl/core/memdev.c | 30 +++++++++++
drivers/cxl/core/pci.c | 102 +++++++++++++++++++++++++++++++++++++-
drivers/cxl/cxl.h | 3 ++
drivers/cxl/cxlmem.h | 2 +
4 files changed, 136 insertions(+), 1 deletion(-)
diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
index 80e65690eb77..af67fa3d11b8 100644
--- a/drivers/cxl/core/memdev.c
+++ b/drivers/cxl/core/memdev.c
@@ -199,6 +199,26 @@ static ssize_t security_erase_store(struct device *dev,
static struct device_attribute dev_attr_security_erase =
__ATTR(erase, 0200, NULL, security_erase_store);
+static ssize_t cxl_reset_store(struct device *dev,
+ struct device_attribute *attr, const char *buf,
+ size_t len)
+{
+ struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
+ bool reset;
+ int rc;
+
+ rc = kstrtobool(buf, &reset);
+ if (rc)
+ return rc;
+
+ if (!reset)
+ return -EINVAL;
+
+ rc = cxl_memdev_reset(cxlmd);
+ return rc ? rc : len;
+}
+static DEVICE_ATTR_WO(cxl_reset);
+
bool cxl_memdev_has_poison_cmd(struct cxl_memdev *cxlmd,
enum poison_cmd_enabled_bits cmd)
{
@@ -421,6 +441,7 @@ static struct attribute *cxl_memdev_attributes[] = {
&dev_attr_payload_max.attr,
&dev_attr_label_storage_size.attr,
&dev_attr_numa_node.attr,
+ &dev_attr_cxl_reset.attr,
NULL,
};
@@ -485,8 +506,16 @@ static struct attribute *cxl_memdev_security_attributes[] = {
static umode_t cxl_memdev_visible(struct kobject *kobj, struct attribute *a,
int n)
{
+ struct device *dev = kobj_to_dev(kobj);
+ struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
+
if (!IS_ENABLED(CONFIG_NUMA) && a == &dev_attr_numa_node.attr)
return 0;
+
+ if (a == &dev_attr_cxl_reset.attr &&
+ !cxl_memdev_reset_capable(cxlmd))
+ return 0;
+
return a->mode;
}
@@ -1099,6 +1128,7 @@ static int cxlmd_add(struct cxl_memdev *cxlmd, struct cxl_dev_state *cxlds)
cxlmd->cxlds = cxlds;
cxlds->cxlmd = cxlmd;
+ cxl_memdev_init_reset(cxlmd);
rc = cdev_device_add(&cxlmd->cdev, &cxlmd->dev);
if (rc) {
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 486c447e98f3..09f016544d24 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -1207,6 +1207,22 @@ static bool cxl_reset_has_cache_or_mem(struct pci_dev *pdev)
return cap & (PCI_DVSEC_CXL_CACHE_CAPABLE | PCI_DVSEC_CXL_MEM_CAPABLE);
}
+static bool cxl_reset_is_type2(struct pci_dev *pdev)
+{
+ u16 dvsec, cap;
+
+ dvsec = pci_find_dvsec_capability(pdev, PCI_VENDOR_ID_CXL,
+ PCI_DVSEC_CXL_DEVICE);
+ if (!dvsec)
+ return false;
+
+ if (pci_read_config_word(pdev, dvsec + PCI_DVSEC_CXL_CAP, &cap))
+ return false;
+
+ return (cap & PCI_DVSEC_CXL_CACHE_CAPABLE) &&
+ (cap & PCI_DVSEC_CXL_MEM_CAPABLE);
+}
+
static int cxl_reset_add_sibling(struct cxl_reset_context *ctx,
struct pci_dev *sibling)
{
@@ -1939,7 +1955,7 @@ static int cxl_do_reset_locked(struct cxl_reset_context *ctx, bool mem_clear)
return rc;
}
-static int __maybe_unused cxl_do_reset(struct pci_dev *pdev, bool mem_clear)
+static int cxl_do_reset(struct pci_dev *pdev, bool mem_clear)
{
struct cxl_reset_context ctx = {
.target = pdev,
@@ -1966,3 +1982,87 @@ static int __maybe_unused cxl_do_reset(struct pci_dev *pdev, bool mem_clear)
cxl_reset_context_destroy(&ctx);
return rc;
}
+
+static struct pci_dev *cxl_reset_get_fn0(struct pci_dev *pdev)
+{
+ unsigned int devfn;
+
+ /*
+ * CXL Reset control/status is exposed in Function 0 and affects all
+ * CXL.cache/mem functions in the device.
+ */
+ if (pci_ari_enabled(pdev->bus))
+ devfn = 0;
+ else
+ devfn = PCI_DEVFN(PCI_SLOT(pdev->devfn), 0);
+
+ if (pdev->devfn == devfn)
+ return pci_dev_get(pdev);
+
+ return pci_get_slot(pdev->bus, devfn);
+}
+
+static bool cxl_memdev_probe_reset_capable(struct cxl_memdev *cxlmd)
+{
+ struct device *dev = cxlmd->dev.parent;
+ struct pci_dev *pdev, *fn0;
+ int dvsec;
+ u16 cap;
+
+ if (!dev || !dev_is_pci(dev))
+ return false;
+
+ pdev = to_pci_dev(dev);
+ if (!cxl_reset_is_type2(pdev))
+ return false;
+
+ fn0 = cxl_reset_get_fn0(pdev);
+ if (!fn0)
+ return false;
+
+ dvsec = pci_find_dvsec_capability(fn0, PCI_VENDOR_ID_CXL,
+ PCI_DVSEC_CXL_DEVICE);
+ if (!dvsec)
+ goto out;
+
+ if (pci_read_config_word(fn0, dvsec + PCI_DVSEC_CXL_CAP, &cap))
+ goto out;
+
+ pci_dev_put(fn0);
+ return cap & PCI_DVSEC_CXL_RST_CAPABLE;
+
+out:
+ pci_dev_put(fn0);
+ return false;
+}
+
+void cxl_memdev_init_reset(struct cxl_memdev *cxlmd)
+{
+ cxlmd->reset_capable = cxl_memdev_probe_reset_capable(cxlmd);
+}
+
+bool cxl_memdev_reset_capable(struct cxl_memdev *cxlmd)
+{
+ return cxlmd->reset_capable;
+}
+
+int cxl_memdev_reset(struct cxl_memdev *cxlmd)
+{
+ struct device *dev = cxlmd->dev.parent;
+ struct pci_dev *fn0;
+ int rc;
+
+ if (!cxl_memdev_reset_capable(cxlmd))
+ return -EOPNOTSUPP;
+
+ if (!dev || !dev_is_pci(dev))
+ return -ENODEV;
+
+ fn0 = cxl_reset_get_fn0(to_pci_dev(dev));
+ if (!fn0)
+ return -ENODEV;
+
+ rc = cxl_do_reset(fn0, false);
+ pci_dev_put(fn0);
+ return rc;
+}
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index b51b1e9d6400..bf65996e24dc 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -796,6 +796,9 @@ int cxl_dvsec_rr_decode(struct cxl_dev_state *cxlds,
struct cxl_endpoint_dvsec_info *info);
int cxl_restore_memdev_decoders(struct cxl_memdev *cxlmd);
int cxl_commit_memdev_decoders(struct cxl_memdev *cxlmd);
+void cxl_memdev_init_reset(struct cxl_memdev *cxlmd);
+bool cxl_memdev_reset_capable(struct cxl_memdev *cxlmd);
+int cxl_memdev_reset(struct cxl_memdev *cxlmd);
bool is_cxl_region(struct device *dev);
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 776c50d1db51..c8e7349fb130 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -48,6 +48,7 @@ struct cxl_memdev_attach {
* @cxl_nvd: optional bridge to an nvdimm if the device supports pmem
* @endpoint: connection to the CXL port topology for this memory device
* @attach: creator of this memdev depends on CXL link attach to operate
+ * @reset_capable: cached CXL Reset support
* @id: id number of this memdev instance.
* @depth: endpoint port depth
* @scrub_cycle: current scrub cycle set for this device
@@ -65,6 +66,7 @@ struct cxl_memdev {
struct cxl_nvdimm *cxl_nvd;
struct cxl_port *endpoint;
const struct cxl_memdev_attach *attach;
+ bool reset_capable;
int id;
int depth;
u8 scrub_cycle;
--
2.43.0
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH v6 9/9] Documentation/ABI: Document CXL memdev cxl_reset
2026-05-28 8:31 [PATCH v6 0/9] cxl: Add cxl_reset sysfs attribute for memdevs Srirangan Madhavan
` (7 preceding siblings ...)
2026-05-28 8:31 ` [PATCH v6 8/9] cxl/memdev: Add cxl_reset sysfs attribute Srirangan Madhavan
@ 2026-05-28 8:31 ` Srirangan Madhavan
2026-06-03 0:11 ` Dave Jiang
2026-06-02 20:34 ` [PATCH v6 0/9] cxl: Add cxl_reset sysfs attribute for memdevs Cheatham, Benjamin
2026-06-02 21:42 ` Dan Williams (nvidia)
10 siblings, 1 reply; 32+ messages in thread
From: Srirangan Madhavan @ 2026-05-28 8:31 UTC (permalink / raw)
To: linux-cxl, linux-pci, linux-kernel
Cc: vsethi, alwilliamson, Dan Williams, Sai Yashwanth Reddy Kancherla,
Vishal Aslot, Manish Honap, Jiandi An, Richard Cheng, linux-tegra,
Srirangan Madhavan
Document the write-only cxl_reset attribute under CXL memdev devices.
The attribute is visible only when the memdev's PCI parent advertises
CXL Reset capability, and writing a true boolean value requests the CXL
reset flow.
Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com>
---
Documentation/ABI/testing/sysfs-bus-cxl | 28 +++++++++++++++++++++++++
1 file changed, 28 insertions(+)
diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
index 16a9b3d2e2c0..d5d055e7a756 100644
--- a/Documentation/ABI/testing/sysfs-bus-cxl
+++ b/Documentation/ABI/testing/sysfs-bus-cxl
@@ -110,6 +110,34 @@ Description:
affinity for this device.
+What: /sys/bus/cxl/devices/memX/cxl_reset
+Date: May, 2026
+KernelVersion: v7.1
+Contact: linux-cxl@vger.kernel.org
+Description:
+ (WO) Write a boolean true value, for example "1" or "true", to
+ request CXL Reset for this memory device. The driver performs
+ CXL-specific reset coordination for the target memdev before
+ issuing reset, including any required preparation for affected
+ CXL memory regions and related CXL memory devices.
+
+ CXL Reset control is Function 0 scoped. A write to this
+ attribute resets the CXL.cache and CXL.mem state for all
+ CXL.cache or CXL.mem functions in the same CXL device reset
+ scope, not only the memX device associated with this file.
+
+ The optional CXL Reset Memory Clear operation is not exposed by
+ this attribute.
+
+ A reset fails with -EBUSY if any affected CXL region is
+ online as System RAM or has an active region driver bound.
+ Userspace must first quiesce and release affected CXL memory
+ mappings.
+
+ If this file is not present, then CXL Reset is not supported
+ for the device.
+
+
What: /sys/bus/cxl/devices/memX/security/state
Date: June, 2023
KernelVersion: v6.5
--
2.43.0
^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [PATCH v6 1/9] cxl/hdm: Add helpers to restore and commit memdev decoders
2026-05-28 8:31 ` [PATCH v6 1/9] cxl/hdm: Add helpers to restore and commit memdev decoders Srirangan Madhavan
@ 2026-05-28 11:06 ` Richard Cheng
2026-06-02 18:12 ` Dave Jiang
2026-06-02 18:31 ` Dave Jiang
` (2 subsequent siblings)
3 siblings, 1 reply; 32+ messages in thread
From: Richard Cheng @ 2026-05-28 11:06 UTC (permalink / raw)
To: Srirangan Madhavan
Cc: linux-cxl, linux-pci, linux-kernel, vsethi, alwilliamson,
Dan Williams, Sai Yashwanth Reddy Kancherla, Vishal Aslot,
Manish Honap, Jiandi An, linux-tegra
On Thu, May 28, 2026 at 08:31:46AM +0800, Srirangan Madhavan wrote:
> Add helpers to restore endpoint decoder programming for a CXL memdev from
> CXL core's cached decoder objects, then commit it as a distinct step.
> Callers are expected to have established reset safety and to hold
> cxl_rwsem.region for write.
>
> cxl_restore_memdev_decoders() restores programmable decoder state while
> keeping traffic disabled. For HDM-backed endpoints it programs enabled
> endpoint decoder fields without COMMIT, keeps the HDM Decoder Capability
> disabled, and mirrors matching endpoint DVSEC ranges where possible. For
> endpoints without HDM decoder registers, it restores the legacy DVSEC
> ranges that model endpoint decode.
>
> cxl_commit_memdev_decoders() enables the HDM Decoder Capability and
> commits enabled, unlocked endpoint decoders after safety checks pass. It
> sets COMMIT only after decoder fields have been restored, does not
> re-lock decoders, and does not set DVSEC MEM_ENABLE.
>
> Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com>
> ---
> drivers/cxl/core/hdm.c | 318 ++++++++++++++++++++++++++++++++++++++++-
> drivers/cxl/cxl.h | 2 +
> 2 files changed, 317 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index 0c80b76a5f9b..f7af1041a9fc 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -679,7 +679,7 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, u64 size)
> return devm_add_action_or_reset(&port->dev, cxl_dpa_release, cxled);
> }
>
> -static void cxld_set_interleave(struct cxl_decoder *cxld, u32 *ctrl)
> +static int cxld_set_interleave_fields(struct cxl_decoder *cxld, u32 *ctrl)
> {
> u16 eig;
> u8 eiw;
> @@ -690,14 +690,22 @@ static void cxld_set_interleave(struct cxl_decoder *cxld, u32 *ctrl)
> */
> if (WARN_ONCE(ways_to_eiw(cxld->interleave_ways, &eiw),
> "invalid interleave_ways: %d\n", cxld->interleave_ways))
> - return;
> + return -EINVAL;
> if (WARN_ONCE(granularity_to_eig(cxld->interleave_granularity, &eig),
> "invalid interleave_granularity: %d\n",
> cxld->interleave_granularity))
> - return;
> + return -EINVAL;
>
> u32p_replace_bits(ctrl, eig, CXL_HDM_DECODER0_CTRL_IG_MASK);
> u32p_replace_bits(ctrl, eiw, CXL_HDM_DECODER0_CTRL_IW_MASK);
> + return 0;
> +}
> +
> +static void cxld_set_interleave(struct cxl_decoder *cxld, u32 *ctrl)
> +{
> + if (cxld_set_interleave_fields(cxld, ctrl))
> + return;
> +
> *ctrl |= CXL_HDM_DECODER0_CTRL_COMMIT;
> }
>
> @@ -927,6 +935,310 @@ static void cxl_decoder_reset(struct cxl_decoder *cxld)
> }
> }
>
> +static int cxl_restore_dvsec_range(struct cxl_memdev *cxlmd,
> + struct cxl_endpoint_decoder *cxled)
> +{
> + struct cxl_dev_state *cxlds = cxlmd->cxlds;
> + struct cxl_decoder *cxld = &cxled->cxld;
> + struct pci_dev *pdev = to_pci_dev(cxlds->dev);
> + u64 base = cxld->hpa_range.start;
> + u64 size = range_len(&cxld->hpa_range);
> + u32 lo;
> + int dvsec = cxlds->cxl_dvsec;
> + int id = cxld->id;
> + int rc;
> +
> + if (!dvsec)
> + return 0;
> +
> + if (id >= CXL_DVSEC_RANGE_MAX)
> + return 0;
> +
> + rc = pci_write_config_dword(pdev, dvsec + PCI_DVSEC_CXL_RANGE_BASE_HIGH(id),
> + upper_32_bits(base));
> + if (rc)
> + return rc;
> +
> + rc = pci_read_config_dword(pdev, dvsec + PCI_DVSEC_CXL_RANGE_BASE_LOW(id),
> + &lo);
> + if (rc)
> + return rc;
Here pci_read/write* returns positive values on failure, and you pass the value up.
Eventually surfacing through cxl_reset_store to userspace where sysfs thinks positive
values as "bytes written".
I think this might need a fix ?
Best regards,
Richard Cheng.
> + lo &= ~PCI_DVSEC_CXL_MEM_BASE_LOW;
> + lo |= lower_32_bits(base) & PCI_DVSEC_CXL_MEM_BASE_LOW;
> +
> + rc = pci_write_config_dword(pdev, dvsec + PCI_DVSEC_CXL_RANGE_BASE_LOW(id),
> + lo);
> + if (rc)
> + return rc;
> +
> + rc = pci_write_config_dword(pdev, dvsec + PCI_DVSEC_CXL_RANGE_SIZE_HIGH(id),
> + upper_32_bits(size));
> + if (rc)
> + return rc;
> +
> + rc = pci_read_config_dword(pdev, dvsec + PCI_DVSEC_CXL_RANGE_SIZE_LOW(id),
> + &lo);
> + if (rc)
> + return rc;
> +
> + /*
> + * Preserve MEM_INFO_VALID / MEM_ACTIVE and any reserved bits while
> + * restoring only the programmable size bits.
> + */
> + lo &= ~PCI_DVSEC_CXL_MEM_SIZE_LOW;
> + lo |= lower_32_bits(size) & PCI_DVSEC_CXL_MEM_SIZE_LOW;
> +
> + return pci_write_config_dword(pdev,
> + dvsec + PCI_DVSEC_CXL_RANGE_SIZE_LOW(id),
> + lo);
> +}
> +
> +static int cxl_restore_hdm_decoder(struct cxl_hdm *cxlhdm,
> + struct cxl_endpoint_decoder *cxled)
> +{
> + struct cxl_decoder *cxld = &cxled->cxld;
> + void __iomem *hdm;
> + u64 base, size, skip;
> + u32 ctrl;
> + int id;
> +
> + id = cxld->id;
> + hdm = cxlhdm->regs.hdm_decoder;
> + ctrl = readl(hdm + CXL_HDM_DECODER0_CTRL_OFFSET(id));
> + if (ctrl & CXL_HDM_DECODER0_CTRL_LOCK)
> + return 0;
> +
> + base = cxld->hpa_range.start;
> + size = range_len(&cxld->hpa_range);
> + skip = cxled->skip;
> +
> + ctrl &= ~(CXL_HDM_DECODER0_CTRL_LOCK |
> + CXL_HDM_DECODER0_CTRL_COMMIT |
> + CXL_HDM_DECODER0_CTRL_COMMITTED |
> + CXL_HDM_DECODER0_CTRL_COMMIT_ERROR);
> + if (cxld_set_interleave_fields(cxld, &ctrl))
> + return -EINVAL;
> + cxld_set_type(cxld, &ctrl);
> +
> + /* Preserve setup_hw_decoder() programming order, without COMMIT. */
> + writel(upper_32_bits(base), hdm + CXL_HDM_DECODER0_BASE_HIGH_OFFSET(id));
> + writel(lower_32_bits(base), hdm + CXL_HDM_DECODER0_BASE_LOW_OFFSET(id));
> + writel(upper_32_bits(size), hdm + CXL_HDM_DECODER0_SIZE_HIGH_OFFSET(id));
> + writel(lower_32_bits(size), hdm + CXL_HDM_DECODER0_SIZE_LOW_OFFSET(id));
> + writel(upper_32_bits(skip), hdm + CXL_HDM_DECODER0_SKIP_HIGH(id));
> + writel(lower_32_bits(skip), hdm + CXL_HDM_DECODER0_SKIP_LOW(id));
> + wmb();
> + writel(ctrl, hdm + CXL_HDM_DECODER0_CTRL_OFFSET(id));
> +
> + return 0;
> +}
> +
> +struct cxl_restore_ctx {
> + struct cxl_memdev *cxlmd;
> + struct cxl_hdm *cxlhdm;
> +};
> +
> +static int cxl_restore_decoder(struct device *dev, void *data)
> +{
> + struct cxl_restore_ctx *ctx = data;
> + struct cxl_endpoint_decoder *cxled;
> + struct cxl_decoder *cxld;
> + int rc;
> +
> + if (!is_endpoint_decoder(dev))
> + return 0;
> +
> + cxled = to_cxl_endpoint_decoder(dev);
> + cxld = &cxled->cxld;
> + if ((cxld->flags & CXL_DECODER_F_ENABLE) == 0)
> + return 0;
> +
> + if (ctx->cxlhdm->regs.hdm_decoder) {
> + if (cxld->id >= ctx->cxlhdm->decoder_count)
> + return -EINVAL;
> +
> + rc = cxl_restore_hdm_decoder(ctx->cxlhdm, cxled);
> + if (rc)
> + return rc;
> + }
> +
> + return cxl_restore_dvsec_range(ctx->cxlmd, cxled);
> +}
> +
> +static int cxl_restore_decoders(struct cxl_memdev *cxlmd, struct cxl_hdm *cxlhdm)
> +{
> + struct cxl_port *port = cxlhdm->port;
> + void __iomem *hdm = cxlhdm->regs.hdm_decoder;
> + struct cxl_restore_ctx ctx = {
> + .cxlmd = cxlmd,
> + .cxlhdm = cxlhdm,
> + };
> + u32 global_ctrl;
> +
> + if (hdm) {
> + global_ctrl = readl(hdm + CXL_HDM_DECODER_CTRL_OFFSET);
> + writel(global_ctrl & ~CXL_HDM_DECODER_ENABLE,
> + hdm + CXL_HDM_DECODER_CTRL_OFFSET);
> + }
> +
> + return device_for_each_child(&port->dev, &ctx, cxl_restore_decoder);
> +}
> +
> +/**
> + * cxl_restore_memdev_decoders - Restore endpoint decoder programming
> + * @cxlmd: CXL memdev whose endpoint decoders need to be restored
> + *
> + * Restore only programmable decoder state from CXL core's cached decoder
> + * objects. For endpoints with HDM decoder registers, program the HDM decoder
> + * fields and mirror decoder ids representable by CXL_DVSEC_RANGE_MAX into the
> + * DVSEC range registers when present. For endpoints without HDM decoder
> + * registers, restore DVSEC range registers only.
> + *
> + * This helper leaves CXL.mem disabled: it does not commit HDM decoders, enable
> + * the HDM Decoder Capability, set PCI_DVSEC_CXL_MEM_ENABLE, or restore
> + * unrelated DVSEC CTRL, CTRL2, LOCK, MEM_ENABLE, or other control state.
> + * Callers must perform final commit/resume steps only after reset safety checks
> + * pass.
> + *
> + * Return: 0 on success, negative errno on failure.
> + */
> +int cxl_restore_memdev_decoders(struct cxl_memdev *cxlmd)
> +{
> + struct cxl_port *endpoint = cxlmd->endpoint;
> + struct cxl_hdm *cxlhdm;
> + int rc;
> +
> + lockdep_assert_held_write(&cxl_rwsem.region);
> +
> + if (!endpoint)
> + return -ENODEV;
> +
> + cxlhdm = dev_get_drvdata(&endpoint->dev);
> + if (!cxlhdm)
> + return -ENODEV;
> +
> + scoped_guard(rwsem_read, &cxl_rwsem.dpa)
> + rc = cxl_restore_decoders(cxlmd, cxlhdm);
> + return rc;
> +}
> +
> +static int cxl_commit_restored_hdm_decoder(struct cxl_hdm *cxlhdm,
> + struct cxl_endpoint_decoder *cxled)
> +{
> + struct cxl_decoder *cxld = &cxled->cxld;
> + void __iomem *hdm = cxlhdm->regs.hdm_decoder;
> + u32 ctrl;
> + int id;
> +
> + if ((cxld->flags & CXL_DECODER_F_ENABLE) == 0)
> + return 0;
> +
> + if (!hdm)
> + return 0;
> +
> + id = cxld->id;
> + if (id >= cxlhdm->decoder_count)
> + return -EINVAL;
> +
> + /*
> + * cxl_restore_hdm_decoder() programmed the decoder fields first. This
> + * control register write sets COMMIT as the final programming step.
> + */
> + ctrl = readl(hdm + CXL_HDM_DECODER0_CTRL_OFFSET(id));
> + if (ctrl & CXL_HDM_DECODER0_CTRL_LOCK)
> + return 0;
> +
> + if (ctrl & CXL_HDM_DECODER0_CTRL_COMMITTED)
> + return 0;
> +
> + ctrl |= CXL_HDM_DECODER0_CTRL_COMMIT;
> + writel(ctrl, hdm + CXL_HDM_DECODER0_CTRL_OFFSET(id));
> +
> + return cxld_await_commit(hdm, id);
> +}
> +
> +struct cxl_commit_decoder_ctx {
> + struct cxl_hdm *cxlhdm;
> + int id;
> +};
> +
> +static int cxl_commit_restored_decoder_by_id(struct device *dev, void *data)
> +{
> + struct cxl_commit_decoder_ctx *ctx = data;
> + struct cxl_endpoint_decoder *cxled;
> + int rc;
> +
> + if (!is_endpoint_decoder(dev))
> + return 0;
> +
> + cxled = to_cxl_endpoint_decoder(dev);
> + if (cxled->cxld.id != ctx->id)
> + return 0;
> +
> + rc = cxl_commit_restored_hdm_decoder(ctx->cxlhdm, cxled);
> + return rc ?: 1;
> +}
> +
> +/**
> + * cxl_commit_memdev_decoders - Commit restored endpoint decoder programming
> + * @cxlmd: CXL memdev whose endpoint decoders need to be committed
> + *
> + * Resume endpoint decoding after cxl_restore_memdev_decoders() has restored
> + * programmable decoder fields. For endpoints with HDM decoder registers, enable
> + * the HDM Decoder Capability and commit enabled, unlocked endpoint decoders.
> + * Locked decoders are left to their current hardware/firmware-owned state.
> + *
> + * This helper does not set PCI_DVSEC_CXL_MEM_ENABLE. Callers must enable
> + * CXL.mem only after all reset safety checks and decoder restore/commit steps
> + * have completed.
> + *
> + * Return: 0 on success, negative errno on failure.
> + */
> +int cxl_commit_memdev_decoders(struct cxl_memdev *cxlmd)
> +{
> + struct cxl_port *endpoint = cxlmd->endpoint;
> + struct cxl_hdm *cxlhdm;
> + void __iomem *hdm;
> + u32 global_ctrl;
> + int i, rc;
> +
> + lockdep_assert_held_write(&cxl_rwsem.region);
> +
> + if (!endpoint)
> + return -ENODEV;
> +
> + cxlhdm = dev_get_drvdata(&endpoint->dev);
> + if (!cxlhdm)
> + return -ENODEV;
> +
> + hdm = cxlhdm->regs.hdm_decoder;
> + if (!hdm)
> + return 0;
> +
> + global_ctrl = readl(hdm + CXL_HDM_DECODER_CTRL_OFFSET);
> + writel(global_ctrl | CXL_HDM_DECODER_ENABLE,
> + hdm + CXL_HDM_DECODER_CTRL_OFFSET);
> +
> + for (i = 0; i < cxlhdm->decoder_count; i++) {
> + struct cxl_commit_decoder_ctx ctx = {
> + .cxlhdm = cxlhdm,
> + .id = i,
> + };
> +
> + /*
> + * Per CXL Spec 3.1 8.2.4.20.12 software must commit decoders
> + * in HPA order. Region setup already enforces that ordering by
> + * decoder id, so restore commits follow ascending id order.
> + */
> + rc = device_for_each_child(&endpoint->dev, &ctx,
> + cxl_commit_restored_decoder_by_id);
> + if (rc < 0)
> + return rc;
> + }
> +
> + return 0;
> +}
> +
> static int cxl_setup_hdm_decoder_from_dvsec(
> struct cxl_port *port, struct cxl_decoder *cxld, u64 *dpa_base,
> int which, struct cxl_endpoint_dvsec_info *info)
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 1297594beaec..b51b1e9d6400 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -794,6 +794,8 @@ int cxl_port_setup_regs(struct cxl_port *port,
> struct cxl_dev_state;
> int cxl_dvsec_rr_decode(struct cxl_dev_state *cxlds,
> struct cxl_endpoint_dvsec_info *info);
> +int cxl_restore_memdev_decoders(struct cxl_memdev *cxlmd);
> +int cxl_commit_memdev_decoders(struct cxl_memdev *cxlmd);
>
> bool is_cxl_region(struct device *dev);
>
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v6 4/9] PCI/CXL: Add sibling function coordination for reset
2026-05-28 8:31 ` [PATCH v6 4/9] PCI/CXL: Add sibling function coordination for reset Srirangan Madhavan
@ 2026-05-28 11:15 ` Richard Cheng
2026-06-02 22:10 ` Dave Jiang
2026-06-04 3:13 ` Dan Williams (nvidia)
2 siblings, 0 replies; 32+ messages in thread
From: Richard Cheng @ 2026-05-28 11:15 UTC (permalink / raw)
To: Srirangan Madhavan
Cc: linux-cxl, linux-pci, linux-kernel, vsethi, alwilliamson,
Dan Williams, Sai Yashwanth Reddy Kancherla, Vishal Aslot,
Manish Honap, Jiandi An, linux-tegra
On Thu, May 28, 2026 at 08:31:49AM +0800, Srirangan Madhavan wrote:
> Add helpers to collect CXL sibling PCI functions affected by a CXL reset
> and prepare them for reset by saving and disabling them. Restore those
> siblings and drop their references when reset coordination completes.
>
> Use the Non-CXL Function Map DVSEC to exclude non-CXL functions, and
> filter remaining siblings to functions that advertise CXL.cache or
> CXL.mem capability.
>
> Use pci_dev_trylock() for sibling locking and unwind on contention or
> allocation failure, so competing reset paths fail with an errno.
>
> Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com>
> ---
> drivers/cxl/core/pci.c | 207 ++++++++++++++++++++++++++++++++++
> include/uapi/linux/pci_regs.h | 2 +
> 2 files changed, 209 insertions(+)
>
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 318744695f62..01effbb4e7cd 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -1,9 +1,11 @@
> // SPDX-License-Identifier: GPL-2.0-only
> /* Copyright(c) 2021 Intel Corporation. All rights reserved. */
> #include <linux/units.h>
> +#include <linux/bitmap.h>
> #include <linux/io-64-nonatomic-lo-hi.h>
> #include <linux/device.h>
> #include <linux/delay.h>
> +#include <linux/iommu.h>
> #include <linux/memregion.h>
> #include <linux/pci.h>
> #include <linux/pci-doe.h>
> @@ -15,6 +17,10 @@
> #include "core.h"
> #include "trace.h"
>
> +#define CXL_RESET_MAX_FUNCTIONS 256
> +#define CXL_RESET_FUNCTION_MAP_REGS (CXL_RESET_MAX_FUNCTIONS / 32)
> +#define CXL_RESET_SIBLINGS_INIT 8
> +
> /**
> * DOC: cxl core pci
> *
> @@ -1096,3 +1102,204 @@ cxl_reset_flush_cpu_caches(struct cxl_reset_region_context *ctx)
>
> return 0;
> }
> +
> +struct cxl_reset_context {
> + struct pci_dev *target;
> + struct pci_dev **siblings;
> + int nr_siblings;
> + int sibling_capacity;
> + int nr_siblings_prepared;
> +};
> +
> +struct cxl_reset_walk_ctx {
> + struct cxl_reset_context *ctx;
> + unsigned long *non_cxl_func_map;
> + int rc;
> +};
> +
> +static void
> +cxl_reset_read_non_cxl_func_map(struct pci_dev *pdev,
> + unsigned long *non_cxl_func_map)
> +{
> + u32 map[CXL_RESET_FUNCTION_MAP_REGS] = {};
> + u16 dvsec;
> + int rc, i;
> +
> + bitmap_zero(non_cxl_func_map, CXL_RESET_MAX_FUNCTIONS);
> +
> + dvsec = pci_find_dvsec_capability(pdev, PCI_VENDOR_ID_CXL,
> + PCI_DVSEC_CXL_FUNCTION_MAP);
> + if (!dvsec)
> + return;
> +
> + for (i = 0; i < CXL_RESET_FUNCTION_MAP_REGS; i++) {
> + rc = pci_read_config_dword(pdev,
> + dvsec + PCI_DVSEC_CXL_FUNCTION_MAP_REG +
> + i * sizeof(map[i]), &map[i]);
> + if (rc) {
> + pci_warn(pdev,
> + "failed to read CXL Function Map; treating all siblings as CXL: %d\n",
> + rc);
> + bitmap_zero(non_cxl_func_map, CXL_RESET_MAX_FUNCTIONS);
> + return;
> + }
> + }
> +
> + bitmap_from_arr32(non_cxl_func_map, map, CXL_RESET_MAX_FUNCTIONS);
> +}
> +
> +static bool cxl_reset_is_cxl_sibling(struct pci_dev *pdev,
> + struct pci_dev *sibling,
> + unsigned long *non_cxl_func_map)
> +{
> + if (sibling == pdev || sibling->bus != pdev->bus)
> + return false;
> +
> + if (pci_ari_enabled(pdev->bus))
> + return !test_bit(sibling->devfn, non_cxl_func_map);
> +
> + if (PCI_SLOT(sibling->devfn) != PCI_SLOT(pdev->devfn))
> + return false;
> +
> + return !test_bit(PCI_FUNC(sibling->devfn) * 32 +
> + PCI_SLOT(sibling->devfn), non_cxl_func_map);
> +}
> +
Acked on sashiko-bot's finding, and even more, since the function already
does the check of whether sibling devfn is equal to the device's devfn slot or not,
PCI_SLOT(sibling->devfn) is guaranteed equal to the target's slot. It's a constant.
According to the spec, the Non-CXL Function Map is one bit per function within the same
multi-function device. I think the following change would be reasonable
"""
return !test_bit(PCI_FUNC(sibling->devfn), non_cxl_func_map);
"""
and besides the false-negative case, I think the more common case would be false positive, e.g.
F>=1 reads bits 32, 64, ... in the reserved portion of the 256-bit map, which are almost always
clear, so non-CXL siblings get pulled into the CXL reset path.
Best regards,
Richard Cheng.
> +static bool cxl_reset_has_cache_or_mem(struct pci_dev *pdev)
> +{
> + u16 dvsec, cap;
> +
> + dvsec = pci_find_dvsec_capability(pdev, PCI_VENDOR_ID_CXL,
> + PCI_DVSEC_CXL_DEVICE);
> + if (!dvsec)
> + return false;
> +
> + if (pci_read_config_word(pdev, dvsec + PCI_DVSEC_CXL_CAP, &cap))
> + return false;
> +
> + return cap & (PCI_DVSEC_CXL_CACHE_CAPABLE | PCI_DVSEC_CXL_MEM_CAPABLE);
> +}
> +
> +static int cxl_reset_add_sibling(struct cxl_reset_context *ctx,
> + struct pci_dev *sibling)
> +{
> + struct pci_dev **siblings;
> + int capacity;
> +
> + if (ctx->nr_siblings < ctx->sibling_capacity)
> + goto add;
> +
> + capacity = ctx->sibling_capacity ? ctx->sibling_capacity * 2 :
> + CXL_RESET_SIBLINGS_INIT;
> + siblings = krealloc(ctx->siblings, capacity * sizeof(*siblings),
> + GFP_KERNEL);
> + if (!siblings)
> + return -ENOMEM;
> +
> + ctx->siblings = siblings;
> + ctx->sibling_capacity = capacity;
> +
> +add:
> + pci_dev_get(sibling);
> + ctx->siblings[ctx->nr_siblings++] = sibling;
> + return 0;
> +}
> +
> +static int cxl_reset_collect_sibling(struct pci_dev *sibling, void *data)
> +{
> + struct cxl_reset_walk_ctx *wctx = data;
> + struct cxl_reset_context *ctx = wctx->ctx;
> + struct pci_dev *pdev = ctx->target;
> +
> + if (!cxl_reset_is_cxl_sibling(pdev, sibling, wctx->non_cxl_func_map))
> + return 0;
> +
> + if (!cxl_reset_has_cache_or_mem(sibling))
> + return 0;
> +
> + wctx->rc = cxl_reset_add_sibling(ctx, sibling);
> + return wctx->rc;
> +}
> +
> +static int cxl_reset_collect_siblings(struct cxl_reset_context *ctx)
> +{
> + DECLARE_BITMAP(non_cxl_func_map, CXL_RESET_MAX_FUNCTIONS);
> + struct cxl_reset_walk_ctx wctx = {
> + .ctx = ctx,
> + .non_cxl_func_map = non_cxl_func_map,
> + };
> +
> + cxl_reset_read_non_cxl_func_map(ctx->target, non_cxl_func_map);
> + pci_walk_bus(ctx->target->bus, cxl_reset_collect_sibling, &wctx);
> + return wctx.rc;
> +}
> +
> +static void cxl_pci_functions_reset_done(struct cxl_reset_context *ctx)
> +{
> + int i;
> +
> + for (i = ctx->nr_siblings_prepared - 1; i >= 0; i--) {
> + struct pci_dev *sibling = ctx->siblings[i];
> +
> + pci_dev_reset_iommu_done(sibling);
> + pci_dev_restore(sibling);
> + pci_dev_unlock(sibling);
> + }
> +
> + for (i = 0; i < ctx->nr_siblings; i++)
> + pci_dev_put(ctx->siblings[i]);
> +
> + kfree(ctx->siblings);
> + ctx->siblings = NULL;
> + ctx->nr_siblings = 0;
> + ctx->sibling_capacity = 0;
> + ctx->nr_siblings_prepared = 0;
> +}
> +
> +static int __maybe_unused
> +cxl_pci_functions_reset_prepare(struct cxl_reset_context *ctx)
> +{
> + int rc, i;
> +
> + ctx->siblings = NULL;
> + ctx->nr_siblings = 0;
> + ctx->sibling_capacity = 0;
> + ctx->nr_siblings_prepared = 0;
> +
> + rc = cxl_reset_collect_siblings(ctx);
> + if (rc)
> + goto err;
> +
> + for (i = 0; i < ctx->nr_siblings; i++) {
> + struct pci_dev *sibling = ctx->siblings[i];
> +
> + if (!pci_dev_trylock(sibling)) {
> + rc = -EAGAIN;
> + goto err;
> + }
> +
> + pci_dev_save_and_disable(sibling);
> + rc = pci_dev_reset_iommu_prepare(sibling);
> + if (rc) {
> + pci_err(sibling,
> + "failed to block IOMMU for CXL reset: %d\n",
> + rc);
> + /*
> + * Undo save_and_disable() for this sibling. IOMMU
> + * prepare failed, so this sibling is not counted in
> + * nr_siblings_prepared and must not get iommu_done().
> + */
> + pci_dev_restore(sibling);
> + pci_dev_unlock(sibling);
> + goto err;
> + }
> +
> + ctx->nr_siblings_prepared++;
> + }
> +
> + return 0;
> +
> +err:
> + cxl_pci_functions_reset_done(ctx);
> + return rc;
> +}
> diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
> index 14f634ab9350..fa1fcd26af01 100644
> --- a/include/uapi/linux/pci_regs.h
> +++ b/include/uapi/linux/pci_regs.h
> @@ -1349,6 +1349,7 @@
> /* CXL r4.0, 8.1.3: PCIe DVSEC for CXL Device */
> #define PCI_DVSEC_CXL_DEVICE 0
> #define PCI_DVSEC_CXL_CAP 0xA
> +#define PCI_DVSEC_CXL_CACHE_CAPABLE _BITUL(0)
> #define PCI_DVSEC_CXL_MEM_CAPABLE _BITUL(2)
> #define PCI_DVSEC_CXL_HDM_COUNT __GENMASK(5, 4)
> #define PCI_DVSEC_CXL_CTRL 0xC
> @@ -1366,6 +1367,7 @@
>
> /* CXL r4.0, 8.1.4: Non-CXL Function Map DVSEC */
> #define PCI_DVSEC_CXL_FUNCTION_MAP 2
> +#define PCI_DVSEC_CXL_FUNCTION_MAP_REG 0x0C
>
> /* CXL r4.0, 8.1.5: Extensions DVSEC for Ports */
> #define PCI_DVSEC_CXL_PORT 3
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v6 1/9] cxl/hdm: Add helpers to restore and commit memdev decoders
2026-05-28 11:06 ` Richard Cheng
@ 2026-06-02 18:12 ` Dave Jiang
0 siblings, 0 replies; 32+ messages in thread
From: Dave Jiang @ 2026-06-02 18:12 UTC (permalink / raw)
To: Richard Cheng, Srirangan Madhavan
Cc: linux-cxl, linux-pci, linux-kernel, vsethi, alwilliamson,
Dan Williams, Sai Yashwanth Reddy Kancherla, Vishal Aslot,
Manish Honap, Jiandi An, linux-tegra
On 5/28/26 4:06 AM, Richard Cheng wrote:
> On Thu, May 28, 2026 at 08:31:46AM +0800, Srirangan Madhavan wrote:
<-- snip -->
>> +static int cxl_restore_dvsec_range(struct cxl_memdev *cxlmd,
>> + struct cxl_endpoint_decoder *cxled)
>> +{
>> + struct cxl_dev_state *cxlds = cxlmd->cxlds;
>> + struct cxl_decoder *cxld = &cxled->cxld;
>> + struct pci_dev *pdev = to_pci_dev(cxlds->dev);
>> + u64 base = cxld->hpa_range.start;
>> + u64 size = range_len(&cxld->hpa_range);
>> + u32 lo;
>> + int dvsec = cxlds->cxl_dvsec;
>> + int id = cxld->id;
>> + int rc;
>> +
>> + if (!dvsec)
>> + return 0;
>> +
>> + if (id >= CXL_DVSEC_RANGE_MAX)
>> + return 0;
>> +
>> + rc = pci_write_config_dword(pdev, dvsec + PCI_DVSEC_CXL_RANGE_BASE_HIGH(id),
>> + upper_32_bits(base));
>> + if (rc)
>> + return rc;
>> +
>> + rc = pci_read_config_dword(pdev, dvsec + PCI_DVSEC_CXL_RANGE_BASE_LOW(id),
>> + &lo);
>> + if (rc)
>> + return rc;
>
> Here pci_read/write* returns positive values on failure, and you pass the value up.
> Eventually surfacing through cxl_reset_store to userspace where sysfs thinks positive
> values as "bytes written".
>
> I think this might need a fix ?
Great catch! Not something I ever thought about before WRT pci_read/write*. I think to get errno it needs pcibios_err_to_errno() wrapper called. The sysfs exposure definitely needs to be audited.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v6 1/9] cxl/hdm: Add helpers to restore and commit memdev decoders
2026-05-28 8:31 ` [PATCH v6 1/9] cxl/hdm: Add helpers to restore and commit memdev decoders Srirangan Madhavan
2026-05-28 11:06 ` Richard Cheng
@ 2026-06-02 18:31 ` Dave Jiang
2026-06-02 20:34 ` Cheatham, Benjamin
2026-06-03 22:35 ` Dan Williams (nvidia)
3 siblings, 0 replies; 32+ messages in thread
From: Dave Jiang @ 2026-06-02 18:31 UTC (permalink / raw)
To: Srirangan Madhavan, linux-cxl, linux-pci, linux-kernel
Cc: vsethi, alwilliamson, Dan Williams, Sai Yashwanth Reddy Kancherla,
Vishal Aslot, Manish Honap, Jiandi An, Richard Cheng, linux-tegra
On 5/28/26 1:31 AM, Srirangan Madhavan wrote:
> Add helpers to restore endpoint decoder programming for a CXL memdev from
> CXL core's cached decoder objects, then commit it as a distinct step.
> Callers are expected to have established reset safety and to hold
> cxl_rwsem.region for write.
>
> cxl_restore_memdev_decoders() restores programmable decoder state while
> keeping traffic disabled. For HDM-backed endpoints it programs enabled
> endpoint decoder fields without COMMIT, keeps the HDM Decoder Capability
> disabled, and mirrors matching endpoint DVSEC ranges where possible. For
> endpoints without HDM decoder registers, it restores the legacy DVSEC
> ranges that model endpoint decode.
>
> cxl_commit_memdev_decoders() enables the HDM Decoder Capability and
> commits enabled, unlocked endpoint decoders after safety checks pass. It
> sets COMMIT only after decoder fields have been restored, does not
> re-lock decoders, and does not set DVSEC MEM_ENABLE.
>
> Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com>
A few minor things below besides what Richard raised...
> ---
> drivers/cxl/core/hdm.c | 318 ++++++++++++++++++++++++++++++++++++++++-
> drivers/cxl/cxl.h | 2 +
> 2 files changed, 317 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index 0c80b76a5f9b..f7af1041a9fc 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -679,7 +679,7 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, u64 size)
> return devm_add_action_or_reset(&port->dev, cxl_dpa_release, cxled);
> }
>
> -static void cxld_set_interleave(struct cxl_decoder *cxld, u32 *ctrl)
> +static int cxld_set_interleave_fields(struct cxl_decoder *cxld, u32 *ctrl)
> {
> u16 eig;
> u8 eiw;
> @@ -690,14 +690,22 @@ static void cxld_set_interleave(struct cxl_decoder *cxld, u32 *ctrl)
> */
> if (WARN_ONCE(ways_to_eiw(cxld->interleave_ways, &eiw),
> "invalid interleave_ways: %d\n", cxld->interleave_ways))
> - return;
> + return -EINVAL;
> if (WARN_ONCE(granularity_to_eig(cxld->interleave_granularity, &eig),
> "invalid interleave_granularity: %d\n",
> cxld->interleave_granularity))
> - return;
> + return -EINVAL;
>
> u32p_replace_bits(ctrl, eig, CXL_HDM_DECODER0_CTRL_IG_MASK);
> u32p_replace_bits(ctrl, eiw, CXL_HDM_DECODER0_CTRL_IW_MASK);
> + return 0;
> +}
> +
> +static void cxld_set_interleave(struct cxl_decoder *cxld, u32 *ctrl)
> +{
> + if (cxld_set_interleave_fields(cxld, ctrl))
> + return;
> +
> *ctrl |= CXL_HDM_DECODER0_CTRL_COMMIT;
> }
>
> @@ -927,6 +935,310 @@ static void cxl_decoder_reset(struct cxl_decoder *cxld)
> }
> }
>
> +static int cxl_restore_dvsec_range(struct cxl_memdev *cxlmd,
> + struct cxl_endpoint_decoder *cxled)
> +{
> + struct cxl_dev_state *cxlds = cxlmd->cxlds;
> + struct cxl_decoder *cxld = &cxled->cxld;
> + struct pci_dev *pdev = to_pci_dev(cxlds->dev);
> + u64 base = cxld->hpa_range.start;
> + u64 size = range_len(&cxld->hpa_range);
> + u32 lo;
> + int dvsec = cxlds->cxl_dvsec;
> + int id = cxld->id;
> + int rc;
Just a nit. Please arrange in reverse xmas tree when possible throughout the code.
> +
> + if (!dvsec)
> + return 0;
> +
> + if (id >= CXL_DVSEC_RANGE_MAX)
> + return 0;
> +
> + rc = pci_write_config_dword(pdev, dvsec + PCI_DVSEC_CXL_RANGE_BASE_HIGH(id),
> + upper_32_bits(base));
> + if (rc)
> + return rc;
> +
> + rc = pci_read_config_dword(pdev, dvsec + PCI_DVSEC_CXL_RANGE_BASE_LOW(id),
> + &lo);
> + if (rc)
> + return rc;
Can use a blank line here
> + lo &= ~PCI_DVSEC_CXL_MEM_BASE_LOW;
> + lo |= lower_32_bits(base) & PCI_DVSEC_CXL_MEM_BASE_LOW;
> +
> + rc = pci_write_config_dword(pdev, dvsec + PCI_DVSEC_CXL_RANGE_BASE_LOW(id),
> + lo);
> + if (rc)
> + return rc;
> +
> + rc = pci_write_config_dword(pdev, dvsec + PCI_DVSEC_CXL_RANGE_SIZE_HIGH(id),
> + upper_32_bits(size));
> + if (rc)
> + return rc;
> +
> + rc = pci_read_config_dword(pdev, dvsec + PCI_DVSEC_CXL_RANGE_SIZE_LOW(id),
> + &lo);
> + if (rc)
> + return rc;
> +
> + /*
> + * Preserve MEM_INFO_VALID / MEM_ACTIVE and any reserved bits while
> + * restoring only the programmable size bits.
> + */
> + lo &= ~PCI_DVSEC_CXL_MEM_SIZE_LOW;
> + lo |= lower_32_bits(size) & PCI_DVSEC_CXL_MEM_SIZE_LOW;
> +
> + return pci_write_config_dword(pdev,
> + dvsec + PCI_DVSEC_CXL_RANGE_SIZE_LOW(id),
> + lo);
> +}
> +
> +static int cxl_restore_hdm_decoder(struct cxl_hdm *cxlhdm,
> + struct cxl_endpoint_decoder *cxled)
> +{
> + struct cxl_decoder *cxld = &cxled->cxld;
> + void __iomem *hdm;
> + u64 base, size, skip;
> + u32 ctrl;
> + int id;
> +
> + id = cxld->id;
> + hdm = cxlhdm->regs.hdm_decoder;
> + ctrl = readl(hdm + CXL_HDM_DECODER0_CTRL_OFFSET(id));
> + if (ctrl & CXL_HDM_DECODER0_CTRL_LOCK)
> + return 0;
> +
> + base = cxld->hpa_range.start;
> + size = range_len(&cxld->hpa_range);
> + skip = cxled->skip;
> +
> + ctrl &= ~(CXL_HDM_DECODER0_CTRL_LOCK |
> + CXL_HDM_DECODER0_CTRL_COMMIT |
> + CXL_HDM_DECODER0_CTRL_COMMITTED |
> + CXL_HDM_DECODER0_CTRL_COMMIT_ERROR);
> + if (cxld_set_interleave_fields(cxld, &ctrl))
> + return -EINVAL;
> + cxld_set_type(cxld, &ctrl);
> +
> + /* Preserve setup_hw_decoder() programming order, without COMMIT. */
> + writel(upper_32_bits(base), hdm + CXL_HDM_DECODER0_BASE_HIGH_OFFSET(id));
> + writel(lower_32_bits(base), hdm + CXL_HDM_DECODER0_BASE_LOW_OFFSET(id));
> + writel(upper_32_bits(size), hdm + CXL_HDM_DECODER0_SIZE_HIGH_OFFSET(id));
> + writel(lower_32_bits(size), hdm + CXL_HDM_DECODER0_SIZE_LOW_OFFSET(id));
> + writel(upper_32_bits(skip), hdm + CXL_HDM_DECODER0_SKIP_HIGH(id));
> + writel(lower_32_bits(skip), hdm + CXL_HDM_DECODER0_SKIP_LOW(id));
> + wmb();
The wmb() is unnessary. See Documentation/driver-api/device-io.rst. readX()/writeX() accesses to the same device are ordered.
DJ
> + writel(ctrl, hdm + CXL_HDM_DECODER0_CTRL_OFFSET(id));
> +
> + return 0;
> +}
> +
> +struct cxl_restore_ctx {
> + struct cxl_memdev *cxlmd;
> + struct cxl_hdm *cxlhdm;
> +};
> +
> +static int cxl_restore_decoder(struct device *dev, void *data)
> +{
> + struct cxl_restore_ctx *ctx = data;
> + struct cxl_endpoint_decoder *cxled;
> + struct cxl_decoder *cxld;
> + int rc;
> +
> + if (!is_endpoint_decoder(dev))
> + return 0;
> +
> + cxled = to_cxl_endpoint_decoder(dev);
> + cxld = &cxled->cxld;
> + if ((cxld->flags & CXL_DECODER_F_ENABLE) == 0)
> + return 0;
> +
> + if (ctx->cxlhdm->regs.hdm_decoder) {
> + if (cxld->id >= ctx->cxlhdm->decoder_count)
> + return -EINVAL;
> +
> + rc = cxl_restore_hdm_decoder(ctx->cxlhdm, cxled);
> + if (rc)
> + return rc;
> + }
> +
> + return cxl_restore_dvsec_range(ctx->cxlmd, cxled);
> +}
> +
> +static int cxl_restore_decoders(struct cxl_memdev *cxlmd, struct cxl_hdm *cxlhdm)
> +{
> + struct cxl_port *port = cxlhdm->port;
> + void __iomem *hdm = cxlhdm->regs.hdm_decoder;
> + struct cxl_restore_ctx ctx = {
> + .cxlmd = cxlmd,
> + .cxlhdm = cxlhdm,
> + };
> + u32 global_ctrl;
> +
> + if (hdm) {
> + global_ctrl = readl(hdm + CXL_HDM_DECODER_CTRL_OFFSET);
> + writel(global_ctrl & ~CXL_HDM_DECODER_ENABLE,
> + hdm + CXL_HDM_DECODER_CTRL_OFFSET);
> + }
> +
> + return device_for_each_child(&port->dev, &ctx, cxl_restore_decoder);
> +}
> +
> +/**
> + * cxl_restore_memdev_decoders - Restore endpoint decoder programming
> + * @cxlmd: CXL memdev whose endpoint decoders need to be restored
> + *
> + * Restore only programmable decoder state from CXL core's cached decoder
> + * objects. For endpoints with HDM decoder registers, program the HDM decoder
> + * fields and mirror decoder ids representable by CXL_DVSEC_RANGE_MAX into the
> + * DVSEC range registers when present. For endpoints without HDM decoder
> + * registers, restore DVSEC range registers only.
> + *
> + * This helper leaves CXL.mem disabled: it does not commit HDM decoders, enable
> + * the HDM Decoder Capability, set PCI_DVSEC_CXL_MEM_ENABLE, or restore
> + * unrelated DVSEC CTRL, CTRL2, LOCK, MEM_ENABLE, or other control state.
> + * Callers must perform final commit/resume steps only after reset safety checks
> + * pass.
> + *
> + * Return: 0 on success, negative errno on failure.
> + */
> +int cxl_restore_memdev_decoders(struct cxl_memdev *cxlmd)
> +{
> + struct cxl_port *endpoint = cxlmd->endpoint;
> + struct cxl_hdm *cxlhdm;
> + int rc;
> +
> + lockdep_assert_held_write(&cxl_rwsem.region);
> +
> + if (!endpoint)
> + return -ENODEV;
> +
> + cxlhdm = dev_get_drvdata(&endpoint->dev);
> + if (!cxlhdm)
> + return -ENODEV;
> +
> + scoped_guard(rwsem_read, &cxl_rwsem.dpa)
> + rc = cxl_restore_decoders(cxlmd, cxlhdm);
> + return rc;
> +}
> +
> +static int cxl_commit_restored_hdm_decoder(struct cxl_hdm *cxlhdm,
> + struct cxl_endpoint_decoder *cxled)
> +{
> + struct cxl_decoder *cxld = &cxled->cxld;
> + void __iomem *hdm = cxlhdm->regs.hdm_decoder;
> + u32 ctrl;
> + int id;
> +
> + if ((cxld->flags & CXL_DECODER_F_ENABLE) == 0)
> + return 0;
> +
> + if (!hdm)
> + return 0;
> +
> + id = cxld->id;
> + if (id >= cxlhdm->decoder_count)
> + return -EINVAL;
> +
> + /*
> + * cxl_restore_hdm_decoder() programmed the decoder fields first. This
> + * control register write sets COMMIT as the final programming step.
> + */
> + ctrl = readl(hdm + CXL_HDM_DECODER0_CTRL_OFFSET(id));
> + if (ctrl & CXL_HDM_DECODER0_CTRL_LOCK)
> + return 0;
> +
> + if (ctrl & CXL_HDM_DECODER0_CTRL_COMMITTED)
> + return 0;
> +
> + ctrl |= CXL_HDM_DECODER0_CTRL_COMMIT;
> + writel(ctrl, hdm + CXL_HDM_DECODER0_CTRL_OFFSET(id));
> +
> + return cxld_await_commit(hdm, id);
> +}
> +
> +struct cxl_commit_decoder_ctx {
> + struct cxl_hdm *cxlhdm;
> + int id;
> +};
> +
> +static int cxl_commit_restored_decoder_by_id(struct device *dev, void *data)
> +{
> + struct cxl_commit_decoder_ctx *ctx = data;
> + struct cxl_endpoint_decoder *cxled;
> + int rc;
> +
> + if (!is_endpoint_decoder(dev))
> + return 0;
> +
> + cxled = to_cxl_endpoint_decoder(dev);
> + if (cxled->cxld.id != ctx->id)
> + return 0;
> +
> + rc = cxl_commit_restored_hdm_decoder(ctx->cxlhdm, cxled);
> + return rc ?: 1;
> +}
> +
> +/**
> + * cxl_commit_memdev_decoders - Commit restored endpoint decoder programming
> + * @cxlmd: CXL memdev whose endpoint decoders need to be committed
> + *
> + * Resume endpoint decoding after cxl_restore_memdev_decoders() has restored
> + * programmable decoder fields. For endpoints with HDM decoder registers, enable
> + * the HDM Decoder Capability and commit enabled, unlocked endpoint decoders.
> + * Locked decoders are left to their current hardware/firmware-owned state.
> + *
> + * This helper does not set PCI_DVSEC_CXL_MEM_ENABLE. Callers must enable
> + * CXL.mem only after all reset safety checks and decoder restore/commit steps
> + * have completed.
> + *
> + * Return: 0 on success, negative errno on failure.
> + */
> +int cxl_commit_memdev_decoders(struct cxl_memdev *cxlmd)
> +{
> + struct cxl_port *endpoint = cxlmd->endpoint;
> + struct cxl_hdm *cxlhdm;
> + void __iomem *hdm;
> + u32 global_ctrl;
> + int i, rc;
> +
> + lockdep_assert_held_write(&cxl_rwsem.region);
> +
> + if (!endpoint)
> + return -ENODEV;
> +
> + cxlhdm = dev_get_drvdata(&endpoint->dev);
> + if (!cxlhdm)
> + return -ENODEV;
> +
> + hdm = cxlhdm->regs.hdm_decoder;
> + if (!hdm)
> + return 0;
> +
> + global_ctrl = readl(hdm + CXL_HDM_DECODER_CTRL_OFFSET);
> + writel(global_ctrl | CXL_HDM_DECODER_ENABLE,
> + hdm + CXL_HDM_DECODER_CTRL_OFFSET);
> +
> + for (i = 0; i < cxlhdm->decoder_count; i++) {
> + struct cxl_commit_decoder_ctx ctx = {
> + .cxlhdm = cxlhdm,
> + .id = i,
> + };
> +
> + /*
> + * Per CXL Spec 3.1 8.2.4.20.12 software must commit decoders
> + * in HPA order. Region setup already enforces that ordering by
> + * decoder id, so restore commits follow ascending id order.
> + */
> + rc = device_for_each_child(&endpoint->dev, &ctx,
> + cxl_commit_restored_decoder_by_id);
> + if (rc < 0)
> + return rc;
> + }
> +
> + return 0;
> +}
> +
> static int cxl_setup_hdm_decoder_from_dvsec(
> struct cxl_port *port, struct cxl_decoder *cxld, u64 *dpa_base,
> int which, struct cxl_endpoint_dvsec_info *info)
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 1297594beaec..b51b1e9d6400 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -794,6 +794,8 @@ int cxl_port_setup_regs(struct cxl_port *port,
> struct cxl_dev_state;
> int cxl_dvsec_rr_decode(struct cxl_dev_state *cxlds,
> struct cxl_endpoint_dvsec_info *info);
> +int cxl_restore_memdev_decoders(struct cxl_memdev *cxlmd);
> +int cxl_commit_memdev_decoders(struct cxl_memdev *cxlmd);
>
> bool is_cxl_region(struct device *dev);
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v6 2/9] PCI: Export pci_dev_save_and_disable() and pci_dev_restore()
2026-05-28 8:31 ` [PATCH v6 2/9] PCI: Export pci_dev_save_and_disable() and pci_dev_restore() Srirangan Madhavan
@ 2026-06-02 20:18 ` Dave Jiang
2026-06-03 22:36 ` Dan Williams (nvidia)
1 sibling, 0 replies; 32+ messages in thread
From: Dave Jiang @ 2026-06-02 20:18 UTC (permalink / raw)
To: Srirangan Madhavan, linux-cxl, linux-pci, linux-kernel
Cc: vsethi, alwilliamson, Dan Williams, Sai Yashwanth Reddy Kancherla,
Vishal Aslot, Manish Honap, Jiandi An, Richard Cheng, linux-tegra
On 5/28/26 1:31 AM, Srirangan Madhavan wrote:
> Export pci_dev_save_and_disable() and pci_dev_restore() so CXL reset
> orchestration can reuse the PCI core reset lifecycle for non-standard
> reset flows.
>
> These helpers invoke driver reset_prepare/reset_done callbacks, save and
> restore PCI config state, and disable the device while the caller holds
> the device lock.
>
> Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com>
> ---
> drivers/pci/pci.c | 22 ++++++++++++++++++++--
> include/linux/pci.h | 2 ++
> 2 files changed, 22 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index d34266651ad0..75d2f4074750 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -5003,7 +5003,15 @@ void pci_dev_unlock(struct pci_dev *dev)
> }
> EXPORT_SYMBOL_GPL(pci_dev_unlock);
>
> -static void pci_dev_save_and_disable(struct pci_dev *dev)
> +/**
> + * pci_dev_save_and_disable - Save device state and disable it
> + * @dev: PCI device to save and disable
> + *
> + * Save the PCI configuration state, invoke the driver's reset_prepare()
> + * callback if present, and disable the device by clearing the Command
> + * register. The device lock must be held by the caller.
> + */
> +void pci_dev_save_and_disable(struct pci_dev *dev)
> {
> const struct pci_error_handlers *err_handler =
> dev->driver ? dev->driver->err_handler : NULL;
> @@ -5036,8 +5044,17 @@ static void pci_dev_save_and_disable(struct pci_dev *dev)
> */
> pci_write_config_word(dev, PCI_COMMAND, PCI_COMMAND_INTX_DISABLE);
> }
> +EXPORT_SYMBOL_GPL(pci_dev_save_and_disable);
Maybe only export it to CXL namespace to reduce the scope?
>
> -static void pci_dev_restore(struct pci_dev *dev)
> +/**
> + * pci_dev_restore - Restore device state after reset
> + * @dev: PCI device to restore
> + *
> + * Restore the saved PCI configuration state and invoke the driver's
> + * reset_done() callback if present. The device lock must be held by the
> + * caller.
> + */
> +void pci_dev_restore(struct pci_dev *dev)
> {
> const struct pci_error_handlers *err_handler =
> dev->driver ? dev->driver->err_handler : NULL;
> @@ -5054,6 +5071,7 @@ static void pci_dev_restore(struct pci_dev *dev)
> else if (dev->driver)
> pci_warn(dev, "reset done");
> }
> +EXPORT_SYMBOL_GPL(pci_dev_restore);
same comment as above
>
> /* dev->reset_methods[] is a 0-terminated list of indices into this array */
> const struct pci_reset_fn_method pci_reset_fn_methods[] = {
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 2c4454583c11..d6303e16e11b 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -2012,6 +2012,8 @@ void pci_dev_lock(struct pci_dev *dev);
> int pci_dev_trylock(struct pci_dev *dev);
> void pci_dev_unlock(struct pci_dev *dev);
> DEFINE_GUARD(pci_dev, struct pci_dev *, pci_dev_lock(_T), pci_dev_unlock(_T))
> +void pci_dev_save_and_disable(struct pci_dev *dev);
> +void pci_dev_restore(struct pci_dev *dev);
>
> /*
> * PCI domain support. Sometimes called PCI segment (eg by ACPI),
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v6 1/9] cxl/hdm: Add helpers to restore and commit memdev decoders
2026-05-28 8:31 ` [PATCH v6 1/9] cxl/hdm: Add helpers to restore and commit memdev decoders Srirangan Madhavan
2026-05-28 11:06 ` Richard Cheng
2026-06-02 18:31 ` Dave Jiang
@ 2026-06-02 20:34 ` Cheatham, Benjamin
2026-06-03 22:35 ` Dan Williams (nvidia)
3 siblings, 0 replies; 32+ messages in thread
From: Cheatham, Benjamin @ 2026-06-02 20:34 UTC (permalink / raw)
To: Srirangan Madhavan, linux-cxl, linux-pci, linux-kernel
Cc: vsethi, alwilliamson, Dan Williams, Sai Yashwanth Reddy Kancherla,
Vishal Aslot, Manish Honap, Jiandi An, Richard Cheng, linux-tegra
On 5/28/2026 3:31 AM, Srirangan Madhavan wrote:
> Add helpers to restore endpoint decoder programming for a CXL memdev from
> CXL core's cached decoder objects, then commit it as a distinct step.
> Callers are expected to have established reset safety and to hold
> cxl_rwsem.region for write.
>
> cxl_restore_memdev_decoders() restores programmable decoder state while
> keeping traffic disabled. For HDM-backed endpoints it programs enabled
> endpoint decoder fields without COMMIT, keeps the HDM Decoder Capability
> disabled, and mirrors matching endpoint DVSEC ranges where possible. For
> endpoints without HDM decoder registers, it restores the legacy DVSEC
> ranges that model endpoint decode.
>
> cxl_commit_memdev_decoders() enables the HDM Decoder Capability and
> commits enabled, unlocked endpoint decoders after safety checks pass. It
> sets COMMIT only after decoder fields have been restored, does not
> re-lock decoders, and does not set DVSEC MEM_ENABLE.
>
> Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com>
> ---
> drivers/cxl/core/hdm.c | 318 ++++++++++++++++++++++++++++++++++++++++-
> drivers/cxl/cxl.h | 2 +
> 2 files changed, 317 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index 0c80b76a5f9b..f7af1041a9fc 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -679,7 +679,7 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, u64 size)
> return devm_add_action_or_reset(&port->dev, cxl_dpa_release, cxled);
> }
>
> -static void cxld_set_interleave(struct cxl_decoder *cxld, u32 *ctrl)
> +static int cxld_set_interleave_fields(struct cxl_decoder *cxld, u32 *ctrl)
> {
> u16 eig;
> u8 eiw;
> @@ -690,14 +690,22 @@ static void cxld_set_interleave(struct cxl_decoder *cxld, u32 *ctrl)
> */
> if (WARN_ONCE(ways_to_eiw(cxld->interleave_ways, &eiw),
> "invalid interleave_ways: %d\n", cxld->interleave_ways))
> - return;
> + return -EINVAL;
> if (WARN_ONCE(granularity_to_eig(cxld->interleave_granularity, &eig),
> "invalid interleave_granularity: %d\n",
> cxld->interleave_granularity))
> - return;
> + return -EINVAL;
>
> u32p_replace_bits(ctrl, eig, CXL_HDM_DECODER0_CTRL_IG_MASK);
> u32p_replace_bits(ctrl, eiw, CXL_HDM_DECODER0_CTRL_IW_MASK);
> + return 0;
> +}
> +
> +static void cxld_set_interleave(struct cxl_decoder *cxld, u32 *ctrl)
> +{
> + if (cxld_set_interleave_fields(cxld, ctrl))
> + return;
> +
> *ctrl |= CXL_HDM_DECODER0_CTRL_COMMIT;
> }
Few issues here:
1. This isn't used in this patch, so it should be moved to the patch it's used in
2. There's no mention that this sets the COMMIT bit. A name update would be best,
but a comment above the function could work as well.
I haven't looked ahead, but if this only gets used once or twice I would just remove this
helper, keep the name the same for cxld_set_interleave_fields(), and manually set the COMMIT
bit when you need to.
>
> @@ -927,6 +935,310 @@ static void cxl_decoder_reset(struct cxl_decoder *cxld)
> }
> }
>
> +static int cxl_restore_dvsec_range(struct cxl_memdev *cxlmd,
> + struct cxl_endpoint_decoder *cxled)
> +{
> + struct cxl_dev_state *cxlds = cxlmd->cxlds;
> + struct cxl_decoder *cxld = &cxled->cxld;
> + struct pci_dev *pdev = to_pci_dev(cxlds->dev);
> + u64 base = cxld->hpa_range.start;
> + u64 size = range_len(&cxld->hpa_range);
> + u32 lo;
Nit: this should be after id below
> + int dvsec = cxlds->cxl_dvsec;
> + int id = cxld->id;
> + int rc;
> +
> + if (!dvsec)
> + return 0;
> +
> + if (id >= CXL_DVSEC_RANGE_MAX)
> + return 0;
> +
> + rc = pci_write_config_dword(pdev, dvsec + PCI_DVSEC_CXL_RANGE_BASE_HIGH(id),
> + upper_32_bits(base));
> + if (rc)
> + return rc;
> +
> + rc = pci_read_config_dword(pdev, dvsec + PCI_DVSEC_CXL_RANGE_BASE_LOW(id),
> + &lo);
> + if (rc)
> + return rc;
> + lo &= ~PCI_DVSEC_CXL_MEM_BASE_LOW;
> + lo |= lower_32_bits(base) & PCI_DVSEC_CXL_MEM_BASE_LOW;
> +
> + rc = pci_write_config_dword(pdev, dvsec + PCI_DVSEC_CXL_RANGE_BASE_LOW(id),
> + lo);
> + if (rc)
> + return rc;
> +
> + rc = pci_write_config_dword(pdev, dvsec + PCI_DVSEC_CXL_RANGE_SIZE_HIGH(id),
> + upper_32_bits(size));
> + if (rc)
> + return rc;
> +
> + rc = pci_read_config_dword(pdev, dvsec + PCI_DVSEC_CXL_RANGE_SIZE_LOW(id),
> + &lo);
> + if (rc)
> + return rc;
> +
> + /*
> + * Preserve MEM_INFO_VALID / MEM_ACTIVE and any reserved bits while
> + * restoring only the programmable size bits.
> + */
I would move this to where you do the masking above. I think it's (effectively) doing the same thing,
so putting it at the first instance makes it easier on the reader.
> + lo &= ~PCI_DVSEC_CXL_MEM_SIZE_LOW;
> + lo |= lower_32_bits(size) & PCI_DVSEC_CXL_MEM_SIZE_LOW;
> +
> + return pci_write_config_dword(pdev,
> + dvsec + PCI_DVSEC_CXL_RANGE_SIZE_LOW(id),
> + lo);
> +}
> +
> +static int cxl_restore_hdm_decoder(struct cxl_hdm *cxlhdm,
> + struct cxl_endpoint_decoder *cxled)
> +{
> + struct cxl_decoder *cxld = &cxled->cxld;
> + void __iomem *hdm;
> + u64 base, size, skip;
> + u32 ctrl;
> + int id;
> +
> + id = cxld->id;
> + hdm = cxlhdm->regs.hdm_decoder;
I would go ahead and set these variables at the top.
> + ctrl = readl(hdm + CXL_HDM_DECODER0_CTRL_OFFSET(id));
> + if (ctrl & CXL_HDM_DECODER0_CTRL_LOCK)
> + return 0;
> +
> + base = cxld->hpa_range.start;
> + size = range_len(&cxld->hpa_range);
> + skip = cxled->skip;
Same with these.
> +
> + ctrl &= ~(CXL_HDM_DECODER0_CTRL_LOCK |
> + CXL_HDM_DECODER0_CTRL_COMMIT |
> + CXL_HDM_DECODER0_CTRL_COMMITTED |
> + CXL_HDM_DECODER0_CTRL_COMMIT_ERROR);
You don't need the CXL_HDM_DECODER0_CTRL_LOCK bit here since you've already
verified it isn't set above.
> + if (cxld_set_interleave_fields(cxld, &ctrl))
> + return -EINVAL;
> + cxld_set_type(cxld, &ctrl);
> +
> + /* Preserve setup_hw_decoder() programming order, without COMMIT. */
> + writel(upper_32_bits(base), hdm + CXL_HDM_DECODER0_BASE_HIGH_OFFSET(id));
> + writel(lower_32_bits(base), hdm + CXL_HDM_DECODER0_BASE_LOW_OFFSET(id));
> + writel(upper_32_bits(size), hdm + CXL_HDM_DECODER0_SIZE_HIGH_OFFSET(id));
> + writel(lower_32_bits(size), hdm + CXL_HDM_DECODER0_SIZE_LOW_OFFSET(id));
> + writel(upper_32_bits(skip), hdm + CXL_HDM_DECODER0_SKIP_HIGH(id));
> + writel(lower_32_bits(skip), hdm + CXL_HDM_DECODER0_SKIP_LOW(id));
> + wmb();
> + writel(ctrl, hdm + CXL_HDM_DECODER0_CTRL_OFFSET(id));
> +
> + return 0;
> +}
> +
> +struct cxl_restore_ctx {
> + struct cxl_memdev *cxlmd;
> + struct cxl_hdm *cxlhdm;
> +};
> +
> +static int cxl_restore_decoder(struct device *dev, void *data)
> +{
> + struct cxl_restore_ctx *ctx = data;
> + struct cxl_endpoint_decoder *cxled;
> + struct cxl_decoder *cxld;
> + int rc;
> +
> + if (!is_endpoint_decoder(dev))
> + return 0;
> +
> + cxled = to_cxl_endpoint_decoder(dev);
> + cxld = &cxled->cxld;
> + if ((cxld->flags & CXL_DECODER_F_ENABLE) == 0)
> + return 0;
> +
> + if (ctx->cxlhdm->regs.hdm_decoder) {
> + if (cxld->id >= ctx->cxlhdm->decoder_count)
> + return -EINVAL;
> +
> + rc = cxl_restore_hdm_decoder(ctx->cxlhdm, cxled);
> + if (rc)
> + return rc;
I'm pretty sure HDM decoders and range registers are mutually exclusive,
so this should just return the result of cxl_restore_hdm_decoder() here.
I think you could also fallback to restoring the range registers, but I'm
not sure if you need more set up to do so.
> + }
> +
> + return cxl_restore_dvsec_range(ctx->cxlmd, cxled);
> +}
> +
> +static int cxl_restore_decoders(struct cxl_memdev *cxlmd, struct cxl_hdm *cxlhdm)
> +{
> + struct cxl_port *port = cxlhdm->port;
> + void __iomem *hdm = cxlhdm->regs.hdm_decoder;
> + struct cxl_restore_ctx ctx = {
> + .cxlmd = cxlmd,
> + .cxlhdm = cxlhdm,
> + };
> + u32 global_ctrl;
> +
> + if (hdm) {
> + global_ctrl = readl(hdm + CXL_HDM_DECODER_CTRL_OFFSET);
> + writel(global_ctrl & ~CXL_HDM_DECODER_ENABLE,
> + hdm + CXL_HDM_DECODER_CTRL_OFFSET);
> + }
> +
> + return device_for_each_child(&port->dev, &ctx, cxl_restore_decoder);
Now that I'm looking at it, I think it would be better to modify cxl_restore_dvsec_range()
to be usable by device_for_each_child() and rework this function to:
static int cxl_restore_decoders(struct cxl_memdev *cxlmd, struct cxl_hdm *cxlhdm)
{
struct cxl_port *port = cxlhdm->port;
void __iomem *hdm = cxlhdm->regs.hdm_decoder;
struct cxl_restore_ctx ctx = {
.cxlmd = cxlmd,
.cxlhdm = cxlhdm,
};
u32 global_ctrl;
if (hdm) {
global_ctrl = readl(hdm + CXL_HDM_DECODER_CTRL_OFFSET);
writel(global_ctrl & ~CXL_HDM_DECODER_ENABLE,
hdm + CXL_HDM_DECODER_CTRL_OFFSET);
return device_for_each_child(&port->dev, &ctx, cxl_restore_decoder());
}
return device_for_each_child(&port->dev, &ctx, cxl_restore_dvsec_range());
}
That would allow you to remove the hdm register NULL checks and keep the same behavior.
Only drawback is that cxl_restore_dvsec_range() wouldn't take typed parameters, but you
could use a wrapper function for that if needed.
> +}
> +
> +/**
> + * cxl_restore_memdev_decoders - Restore endpoint decoder programming
> + * @cxlmd: CXL memdev whose endpoint decoders need to be restored
> + *
> + * Restore only programmable decoder state from CXL core's cached decoder
> + * objects. For endpoints with HDM decoder registers, program the HDM decoder
> + * fields and mirror decoder ids representable by CXL_DVSEC_RANGE_MAX into the
> + * DVSEC range registers when present. For endpoints without HDM decoder
> + * registers, restore DVSEC range registers only.
> + *
> + * This helper leaves CXL.mem disabled: it does not commit HDM decoders, enable
> + * the HDM Decoder Capability, set PCI_DVSEC_CXL_MEM_ENABLE, or restore
> + * unrelated DVSEC CTRL, CTRL2, LOCK, MEM_ENABLE, or other control state.
> + * Callers must perform final commit/resume steps only after reset safety checks
> + * pass.
> + *
> + * Return: 0 on success, negative errno on failure.
> + */
> +int cxl_restore_memdev_decoders(struct cxl_memdev *cxlmd)
> +{
> + struct cxl_port *endpoint = cxlmd->endpoint;
> + struct cxl_hdm *cxlhdm;
> + int rc;
> +
> + lockdep_assert_held_write(&cxl_rwsem.region);
> +
> + if (!endpoint)
> + return -ENODEV;
> +
> + cxlhdm = dev_get_drvdata(&endpoint->dev);
> + if (!cxlhdm)
> + return -ENODEV;
> +
> + scoped_guard(rwsem_read, &cxl_rwsem.dpa)
> + rc = cxl_restore_decoders(cxlmd, cxlhdm);
I don't think you need the scoped guard here, just use a regular one? i.e.:
guard(rwsem_read, &cxl_rwsem.dpa);
return cxl_restore_decoders(cxlmd, cxlhdm);
> + return rc;
> +}
> +
> +static int cxl_commit_restored_hdm_decoder(struct cxl_hdm *cxlhdm,
> + struct cxl_endpoint_decoder *cxled)
> +{
> + struct cxl_decoder *cxld = &cxled->cxld;
> + void __iomem *hdm = cxlhdm->regs.hdm_decoder;
> + u32 ctrl;
> + int id;
> +
> + if ((cxld->flags & CXL_DECODER_F_ENABLE) == 0)
> + return 0;
> +
> + if (!hdm)
> + return 0;
Can combine the two above checks.
> +
> + id = cxld->id;
> + if (id >= cxlhdm->decoder_count)
> + return -EINVAL;
> +
> + /*
> + * cxl_restore_hdm_decoder() programmed the decoder fields first. This
> + * control register write sets COMMIT as the final programming step.
> + */
Don't need the second sentence in this comment, it's pretty self-explanatory.
> + ctrl = readl(hdm + CXL_HDM_DECODER0_CTRL_OFFSET(id));
> + if (ctrl & CXL_HDM_DECODER0_CTRL_LOCK)
> + return 0;
> +
> + if (ctrl & CXL_HDM_DECODER0_CTRL_COMMITTED)
> + return 0;
> +
> + ctrl |= CXL_HDM_DECODER0_CTRL_COMMIT;
> + writel(ctrl, hdm + CXL_HDM_DECODER0_CTRL_OFFSET(id));
> +
> + return cxld_await_commit(hdm, id);
> +}
> +
> +struct cxl_commit_decoder_ctx {
> + struct cxl_hdm *cxlhdm;
> + int id;
> +};
> +
> +static int cxl_commit_restored_decoder_by_id(struct device *dev, void *data)
> +{
> + struct cxl_commit_decoder_ctx *ctx = data;
> + struct cxl_endpoint_decoder *cxled;
> + int rc;
> +
> + if (!is_endpoint_decoder(dev))
> + return 0;
> +
> + cxled = to_cxl_endpoint_decoder(dev);
> + if (cxled->cxld.id != ctx->id)
> + return 0;
> +
> + rc = cxl_commit_restored_hdm_decoder(ctx->cxlhdm, cxled);
> + return rc ?: 1;
> +}
> +
> +/**
> + * cxl_commit_memdev_decoders - Commit restored endpoint decoder programming
> + * @cxlmd: CXL memdev whose endpoint decoders need to be committed
> + *
> + * Resume endpoint decoding after cxl_restore_memdev_decoders() has restored
> + * programmable decoder fields. For endpoints with HDM decoder registers, enable
> + * the HDM Decoder Capability and commit enabled, unlocked endpoint decoders.
> + * Locked decoders are left to their current hardware/firmware-owned state.
> + *
> + * This helper does not set PCI_DVSEC_CXL_MEM_ENABLE. Callers must enable
> + * CXL.mem only after all reset safety checks and decoder restore/commit steps
> + * have completed.
> + *
> + * Return: 0 on success, negative errno on failure.
> + */
> +int cxl_commit_memdev_decoders(struct cxl_memdev *cxlmd)
> +{
> + struct cxl_port *endpoint = cxlmd->endpoint;
> + struct cxl_hdm *cxlhdm;
> + void __iomem *hdm;
> + u32 global_ctrl;
> + int i, rc;
> +
> + lockdep_assert_held_write(&cxl_rwsem.region);
> +
> + if (!endpoint)
> + return -ENODEV;
> +
> + cxlhdm = dev_get_drvdata(&endpoint->dev);
> + if (!cxlhdm)
> + return -ENODEV;
> +
> + hdm = cxlhdm->regs.hdm_decoder;
> + if (!hdm)
> + return 0;
> +
> + global_ctrl = readl(hdm + CXL_HDM_DECODER_CTRL_OFFSET);
> + writel(global_ctrl | CXL_HDM_DECODER_ENABLE,
> + hdm + CXL_HDM_DECODER_CTRL_OFFSET);
> +
> + for (i = 0; i < cxlhdm->decoder_count; i++) {
> + struct cxl_commit_decoder_ctx ctx = {
> + .cxlhdm = cxlhdm,
> + .id = i,
> + };
> +
> + /*
> + * Per CXL Spec 3.1 8.2.4.20.12 software must commit decoders
Update to 4.0 spec.
> + * in HPA order. Region setup already enforces that ordering by
> + * decoder id, so restore commits follow ascending id order.
> + */
> + rc = device_for_each_child(&endpoint->dev, &ctx,
> + cxl_commit_restored_decoder_by_id);
> + if (rc < 0)
> + return rc;
> + }
> +
> + return 0;
> +}
> +
> static int cxl_setup_hdm_decoder_from_dvsec(
> struct cxl_port *port, struct cxl_decoder *cxld, u64 *dpa_base,
> int which, struct cxl_endpoint_dvsec_info *info)
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 1297594beaec..b51b1e9d6400 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -794,6 +794,8 @@ int cxl_port_setup_regs(struct cxl_port *port,
> struct cxl_dev_state;
> int cxl_dvsec_rr_decode(struct cxl_dev_state *cxlds,
> struct cxl_endpoint_dvsec_info *info);
> +int cxl_restore_memdev_decoders(struct cxl_memdev *cxlmd);
> +int cxl_commit_memdev_decoders(struct cxl_memdev *cxlmd);
>
> bool is_cxl_region(struct device *dev);
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v6 3/9] cxl: Add reset-idle and cache flush helpers
2026-05-28 8:31 ` [PATCH v6 3/9] cxl: Add reset-idle and cache flush helpers Srirangan Madhavan
@ 2026-06-02 20:34 ` Cheatham, Benjamin
2026-06-02 20:36 ` Dave Jiang
2026-06-04 2:49 ` Dan Williams (nvidia)
2 siblings, 0 replies; 32+ messages in thread
From: Cheatham, Benjamin @ 2026-06-02 20:34 UTC (permalink / raw)
To: Srirangan Madhavan, linux-cxl, linux-pci, linux-kernel
Cc: vsethi, alwilliamson, Dan Williams, Sai Yashwanth Reddy Kancherla,
Vishal Aslot, Manish Honap, Jiandi An, Richard Cheng, linux-tegra
On 5/28/2026 3:31 AM, Srirangan Madhavan wrote:
> Add helpers to collect the CXL regions affected by a memdev reset,
> verify that those regions are idle, and invalidate CPU caches for the
> affected address ranges before reset.
>
> A memdev can participate in an interleaved region through multiple
> endpoint decoders. Track affected regions in a temporary xarray so each
> region is checked and cache-invalidated once per reset operation.
>
> These helpers prepare the CXL.mem data path for reset. The actual reset
> orchestration and decoder restore flow are added separately.
>
> Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com>
> ---
> drivers/cxl/core/pci.c | 170 +++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 170 insertions(+)
These changes should probably go into cxl/core/region.c. cxl/core/pci.c is more
for actually touching PCI registers/config as I understand it.
>
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index d1f487b3d809..318744695f62 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -4,9 +4,11 @@
> #include <linux/io-64-nonatomic-lo-hi.h>
> #include <linux/device.h>
> #include <linux/delay.h>
> +#include <linux/memregion.h>
> #include <linux/pci.h>
> #include <linux/pci-doe.h>
> #include <linux/aer.h>
> +#include <linux/xarray.h>
> #include <cxlpci.h>
> #include <cxlmem.h>
> #include <cxl.h>
> @@ -926,3 +928,171 @@ int cxl_port_get_possible_dports(struct cxl_port *port)
>
> return ctx.count;
> }
> +
> +static int cxl_reset_system_ram_found(struct resource *res, void *data)
> +{
> + return 1;
> +}
There's already a helper in cxl/core/region.c called is_system_ram() that does
this. I'd use that instead when you move these functions over.
> +
> +struct cxl_reset_region_context {
> + struct xarray regions;
> +};
> +
> +static void __maybe_unused
> +cxl_reset_region_context_init(struct cxl_reset_region_context *ctx)
> +{
> + xa_init(&ctx->regions);
> +}
> +
> +static void __maybe_unused
> +cxl_reset_region_context_destroy(struct cxl_reset_region_context *ctx)
> +{
> + xa_destroy(&ctx->regions);
> +}
> +
> +static int cxl_reset_add_region(struct cxl_reset_region_context *ctx,
> + struct cxl_region *cxlr)
> +{
> + int rc;
> +
> + if (!cxlr || !cxlr->params.res)
> + return 0;
> +
> + rc = xa_insert(&ctx->regions, (unsigned long)cxlr, cxlr, GFP_KERNEL);
It may be easier to have the id as cxlr->id instead of (unsigned long)cxlr, but that
depends on how you're iterating later on.
> +
> + /* A region may be referenced by multiple affected endpoint decoders. */
> + return rc == -EBUSY ? 0 : rc;
> +}
> +
> +static int cxl_reset_collect_region(struct device *dev, void *data)
> +{
> + struct cxl_reset_region_context *ctx = data;
> + struct cxl_endpoint_decoder *cxled;
> +
> + if (!is_endpoint_decoder(dev))
> + return 0;
> +
> + cxled = to_cxl_endpoint_decoder(dev);
> + return cxl_reset_add_region(ctx, cxled->cxld.region);
It looks like cxl_reset_add_region() is only used here. I'd just do the internals
of it here and remove the function.
> +}
> +
> +static int __maybe_unused
> +cxl_reset_collect_memdev_regions(struct cxl_reset_region_context *ctx,
> + struct cxl_memdev *cxlmd)
> +{
> + struct cxl_port *endpoint;
> +
> + if (!cxlmd || !cxlmd->cxlds)
> + return -ENODEV;
Why check for cxlmd->cxlds here? It doesn't look like it's used in this path,
are you checking if the driver is attached?
> +
> + endpoint = cxlmd->endpoint;
> + if (!endpoint)
> + return 0;
> +
> + return device_for_each_child(&endpoint->dev, ctx,
> + cxl_reset_collect_region);
> +}
> +
> +static bool cxl_reset_region_has_system_ram(struct cxl_region *cxlr)
> +{
> + struct cxl_region_params *p = &cxlr->params;
> + int rc;
> +
> + if (!p->res)
> + return false;
> +
> + rc = walk_iomem_res_desc(IORES_DESC_NONE,
> + IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
> + p->res->start, p->res->end, NULL,
> + cxl_reset_system_ram_found);
> +
> + return rc > 0;
> +}
This helper could also be used in cxl_region_probe() for ram regions,
see the switch case statement in that function. I don't know if it's
worth the churn though...
> +
> +static int cxl_reset_validate_region_idle(struct cxl_region *cxlr)
> +{
> + struct resource *res = cxlr->params.res;
> + int rc = 0;
> +
> + lockdep_assert_held_write(&cxl_rwsem.region);
> +
> + if (cxl_reset_region_has_system_ram(cxlr)) {
> + dev_err(&cxlr->dev,
> + "Cannot reset while CXL memory is online as System RAM [%pr]\n",
> + res);
> + return -EBUSY;
> + }
> +
> + if (!device_trylock(&cxlr->dev))
> + return -EAGAIN;
I think you can use ACQUIRE() here? I'm pretty sure it was made for this case and gets
rid of the device_unlock() below.
> +
> + if (cxlr->dev.driver) {
> + dev_err(&cxlr->dev,
> + "Cannot reset while CXL region has an active driver\n");
> + rc = -EBUSY;
> + }
> +
> + device_unlock(&cxlr->dev);
> + return rc;
> +}
> +
> +static int __maybe_unused
> +cxl_reset_validate_regions_idle(struct cxl_reset_region_context *ctx)
> +{
> + struct cxl_region *cxlr;
> + unsigned long index;
> + int rc;
> +
> + xa_for_each(&ctx->regions, index, cxlr) {
> + rc = cxl_reset_validate_region_idle(cxlr);
> + if (rc)
> + return rc;
> + }
> +
> + return 0;
> +}
> +
> +static int cxl_reset_flush_region_cache(struct cxl_region *cxlr)
> +{
> + struct resource *res = cxlr->params.res;
> + int rc;
> +
> + if (!res)
> + return 0;
> +
> + rc = cpu_cache_invalidate_memregion(res->start, resource_size(res));
> + if (rc)
> + dev_err(&cxlr->dev, "Failed to invalidate CPU cache [%pr]: %d\n",
> + res, rc);
> +
> + return rc;
> +}
There's already a helper in cxl/core/region.c, see cxl_region_invalidate_memregion().
You'd have to modify the function below to use it here though.
> +
> +static int __maybe_unused
> +cxl_reset_flush_cpu_caches(struct cxl_reset_region_context *ctx)
> +{
> + struct cxl_region *cxlr;
> + unsigned long index;
> + int rc;
> +
> + if (xa_empty(&ctx->regions))
> + return 0;
> +
> + if (!cpu_cache_has_invalidate_memregion()) {
> + if (IS_ENABLED(CONFIG_CXL_REGION_INVALIDATION_TEST)) {
> + pr_info_once(
> + "Bypassing cpu_cache_invalidate_memregion() for testing!\n");
> + return 0;
> + }
> + pr_warn("Failed to synchronize CPU cache state\n");
> + return -ENXIO;
> + }
> +
> + xa_for_each(&ctx->regions, index, cxlr) {
> + rc = cxl_reset_flush_region_cache(cxlr);
> + if (rc)
> + return rc;
> + }
> +
> + return 0;
> +}
I see you've marked most of these functions with __maybe_unused and remove them in later on
in the series. It would be much better to move these definitions into the patches where they're
used throughout the whole series. Fortunately, it seems most of these can just be moved to
patch 7/9.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v6 5/9] cxl/pci: Add CXL DVSEC reset helper
2026-05-28 8:31 ` [PATCH v6 5/9] cxl/pci: Add CXL DVSEC reset helper Srirangan Madhavan
@ 2026-06-02 20:34 ` Cheatham, Benjamin
0 siblings, 0 replies; 32+ messages in thread
From: Cheatham, Benjamin @ 2026-06-02 20:34 UTC (permalink / raw)
To: Srirangan Madhavan, linux-cxl, linux-pci, linux-kernel
Cc: vsethi, alwilliamson, Dan Williams, Sai Yashwanth Reddy Kancherla,
Vishal Aslot, Manish Honap, Jiandi An, Richard Cheng, linux-tegra
On 5/28/2026 3:31 AM, Srirangan Madhavan wrote:
> Add a helper to execute CXL Reset through the CXL Device DVSEC. The
> helper verifies reset capability, waits for pending PCIe transactions,
> disables CXL.cache, optionally initiates cache writeback and invalidation,
> and then starts CXL Reset through the DVSEC Control2 register.
>
> Block IOMMU traffic while reset is active, then restore IOMMU
> translations after reset completes.
>
> Wait for the DVSEC reset timeout before checking reset completion, and
> report reset error or timeout status from the DVSEC Status2 register. Add
> the CXL Device DVSEC reset and cache control definitions needed by the
> helper.
>
> Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com>
> ---
> drivers/cxl/core/pci.c | 185 ++++++++++++++++++++++++++++++++++
> include/uapi/linux/pci_regs.h | 13 +++
> 2 files changed, 198 insertions(+)
>
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 01effbb4e7cd..1dd880f5a333 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -20,6 +20,9 @@
> #define CXL_RESET_MAX_FUNCTIONS 256
> #define CXL_RESET_FUNCTION_MAP_REGS (CXL_RESET_MAX_FUNCTIONS / 32)
> #define CXL_RESET_SIBLINGS_INIT 8
> +#define CXL_RESET_CACHE_WBI_POLL_US 100
> +#define CXL_RESET_CACHE_WBI_TIMEOUT_US (100 * USEC_PER_MSEC)
> +#define CXL_RESET_MIN_QUIET_MS 100
>
> /**
> * DOC: cxl core pci
> @@ -1303,3 +1306,185 @@ cxl_pci_functions_reset_prepare(struct cxl_reset_context *ctx)
> cxl_pci_functions_reset_done(ctx);
> return rc;
> }
> +
> +static int cxl_reset_update_ctrl2(struct pci_dev *pdev, int dvsec, u16 set,
> + u16 clear)
I'd change to cxl_dvsec_update_ctrl2(), see next comment.
> +{
> + u16 ctrl2;
> + int rc;
> +
> + rc = pci_read_config_word(pdev, dvsec + PCI_DVSEC_CXL_CTRL2, &ctrl2);
> + if (rc)
> + return rc;
> +
> + ctrl2 &= ~clear;
> + ctrl2 |= set;
> +
> + return pci_write_config_word(pdev, dvsec + PCI_DVSEC_CXL_CTRL2, ctrl2);
> +}
> +
> +static int cxl_reset_wait_cache_inv(struct pci_dev *pdev, int dvsec)
> +{
> + int remaining_us = CXL_RESET_CACHE_WBI_TIMEOUT_US;
> + u16 status2;
> + int rc;
> +
> + do {
> + usleep_range(CXL_RESET_CACHE_WBI_POLL_US,
> + CXL_RESET_CACHE_WBI_POLL_US + 1);
> + remaining_us -= CXL_RESET_CACHE_WBI_POLL_US;
> +
> + rc = pci_read_config_word(pdev, dvsec + PCI_DVSEC_CXL_STATUS2,
> + &status2);
> + if (rc)
> + return rc;
> +
> + if (status2 & PCI_DVSEC_CXL_CACHE_INV)
> + return 0;
> + } while (remaining_us > 0);
> +
> + pci_err(pdev, "CXL cache WB+I timed out\n");
> + return -ETIMEDOUT;
> +}
> +
> +static int cxl_reset_enable_cache(struct pci_dev *pdev, int dvsec, u16 cap)
> +{
> + if (!(cap & PCI_DVSEC_CXL_CACHE_CAPABLE))
> + return 0;
> +
> + return cxl_reset_update_ctrl2(pdev, dvsec, 0,
> + PCI_DVSEC_CXL_DISABLE_CACHING);
> +}
> +
> +static int cxl_reset_disable_cache(struct pci_dev *pdev, int dvsec, u16 cap)
I would remove the reset portion of the names of these three functions. There's nothing
reset specific about them and it'd be nice to have general functions for when/if CXL.cache
device support comes to the core.
Also would like to see the RESET part of the WBI #defines dropped for the same reasoning.
> +{
> + int rc;
> +
> + if (!(cap & PCI_DVSEC_CXL_CACHE_CAPABLE))
> + return 0;
> +
> + rc = cxl_reset_update_ctrl2(pdev, dvsec,
> + PCI_DVSEC_CXL_DISABLE_CACHING, 0);
> + if (rc)
> + return rc;
> +
> + if (!(cap & PCI_DVSEC_CXL_CACHE_WBI_CAPABLE))
> + return 0;
> +
> + rc = cxl_reset_update_ctrl2(pdev, dvsec,
> + PCI_DVSEC_CXL_INIT_CACHE_WBI, 0);
> + if (rc)
> + goto err_enable_cache;
I think you can probably return here. I would be surprised if the pci_write/read_config()
updated the register with the new value while also returning an error.
> +
> + rc = cxl_reset_wait_cache_inv(pdev, dvsec);
> + if (rc)
> + goto err_enable_cache;
> +
> + return 0;
With above change you can drop the goto and do:
rc = cxl_reset_wait_cache_inv(pdev, dvsec);
if (rc)
dev_warn(<descriptive error message>);
return rc;
> +
> +err_enable_cache:
> + /*
> + * Best effort rollback: preserve the original WB+I failure even if
> + * re-enabling CXL.cache also fails.
> + */
> + cxl_reset_enable_cache(pdev, dvsec, cap);
> + return rc;
> +}
> +
> +static int cxl_reset_wait_done(struct pci_dev *pdev, int dvsec, u16 cap)
> +{
> + static const u32 reset_timeout_ms[] = { 10, 100, 1000, 10000, 100000 };
> + u32 timeout_ms;
> + u16 status2;
> + int rc, idx;
> +
> + idx = FIELD_GET(PCI_DVSEC_CXL_RST_TIMEOUT, cap);
> + if (idx >= ARRAY_SIZE(reset_timeout_ms))
> + idx = ARRAY_SIZE(reset_timeout_ms) - 1;
I'd put a dev_dbg() here just in case someone is wondering why their timeout value
is being shortened. Would also be nice for vendors/users to see that their hardware is
being programmed with a value that's (probably) too long.
> + timeout_ms = reset_timeout_ms[idx];
> +
> + msleep(max_t(u32, timeout_ms, CXL_RESET_MIN_QUIET_MS));
> +
> + rc = pci_read_config_word(pdev, dvsec + PCI_DVSEC_CXL_STATUS2,
> + &status2);
> + if (rc)
> + return rc;
> +
> + if (status2 & PCI_DVSEC_CXL_RST_ERR) {
> + pci_err(pdev, "CXL reset error\n");
> + return -EIO;
> + }
> +
> + if (!(status2 & PCI_DVSEC_CXL_RST_DONE)) {
> + pci_err(pdev, "CXL reset timed out\n");
> + return -ETIMEDOUT;
> + }
> +
> + return 0;
> +}
> +
> +static int __maybe_unused cxl_dev_reset(struct pci_dev *pdev, bool mem_clear)
> +{
> + int dvsec, rc;
> + u16 ctrl2_clear = 0;
> + u16 cap;
> +
> + dvsec = pci_find_dvsec_capability(pdev, PCI_VENDOR_ID_CXL,
> + PCI_DVSEC_CXL_DEVICE);
> + if (!dvsec)
> + return -ENODEV;
> +
> + rc = pci_read_config_word(pdev, dvsec + PCI_DVSEC_CXL_CAP, &cap);
> + if (rc)
> + return rc;
> +
> + if (!(cap & PCI_DVSEC_CXL_RST_CAPABLE))
> + return -EOPNOTSUPP;
> +
> + if (mem_clear && !(cap & PCI_DVSEC_CXL_RST_MEM_CLR_CAPABLE))
> + return -EOPNOTSUPP;
> +
> + if (!pci_wait_for_pending_transaction(pdev))
> + pci_err(pdev, "timed out waiting for pending transactions\n");
> +
> + rc = pci_dev_reset_iommu_prepare(pdev);
> + if (rc) {
> + pci_err(pdev, "failed to block IOMMU for CXL reset: %d\n",
> + rc);
> + return rc;
> + }
> +
> + rc = cxl_reset_disable_cache(pdev, dvsec, cap);
> + if (rc)
> + goto out_iommu;
> + if (cap & PCI_DVSEC_CXL_CACHE_CAPABLE)
> + ctrl2_clear |= PCI_DVSEC_CXL_DISABLE_CACHING;
> +
> + if (mem_clear) {
> + rc = cxl_reset_update_ctrl2(pdev, dvsec,
> + PCI_DVSEC_CXL_RST_MEM_CLR_EN, 0);
> + if (rc)
> + goto out_ctrl2;
> + ctrl2_clear |= PCI_DVSEC_CXL_RST_MEM_CLR_EN;
> + }
> +
> + rc = cxl_reset_update_ctrl2(pdev, dvsec,
> + PCI_DVSEC_CXL_INIT_CXL_RST, 0);
> + if (rc)
> + goto out_ctrl2;
> +
> + rc = cxl_reset_wait_done(pdev, dvsec, cap);
> + if (rc)
> + goto out_iommu;
> +
> + rc = cxl_reset_update_ctrl2(pdev, dvsec, 0,
> + PCI_DVSEC_CXL_DISABLE_CACHING);
> +
> +out_ctrl2:
> + if (rc && ctrl2_clear)
> + cxl_reset_update_ctrl2(pdev, dvsec, 0, ctrl2_clear);
> +
> +out_iommu:
> + pci_dev_reset_iommu_done(pdev);
> + return rc;
> +}
> diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
> index fa1fcd26af01..7fc1d34fcce7 100644
> --- a/include/uapi/linux/pci_regs.h
> +++ b/include/uapi/linux/pci_regs.h
> @@ -1352,8 +1352,21 @@
> #define PCI_DVSEC_CXL_CACHE_CAPABLE _BITUL(0)
> #define PCI_DVSEC_CXL_MEM_CAPABLE _BITUL(2)
> #define PCI_DVSEC_CXL_HDM_COUNT __GENMASK(5, 4)
> +#define PCI_DVSEC_CXL_CACHE_WBI_CAPABLE _BITUL(6)
> +#define PCI_DVSEC_CXL_RST_CAPABLE _BITUL(7)
> +#define PCI_DVSEC_CXL_RST_TIMEOUT __GENMASK(10, 8)
> +#define PCI_DVSEC_CXL_RST_MEM_CLR_CAPABLE _BITUL(11)
> #define PCI_DVSEC_CXL_CTRL 0xC
> #define PCI_DVSEC_CXL_MEM_ENABLE _BITUL(2)
> +#define PCI_DVSEC_CXL_CTRL2 0x10
> +#define PCI_DVSEC_CXL_DISABLE_CACHING _BITUL(0)
> +#define PCI_DVSEC_CXL_INIT_CACHE_WBI _BITUL(1)
> +#define PCI_DVSEC_CXL_INIT_CXL_RST _BITUL(2)
> +#define PCI_DVSEC_CXL_RST_MEM_CLR_EN _BITUL(3)
> +#define PCI_DVSEC_CXL_STATUS2 0x12
> +#define PCI_DVSEC_CXL_CACHE_INV _BITUL(0)
> +#define PCI_DVSEC_CXL_RST_DONE _BITUL(1)
> +#define PCI_DVSEC_CXL_RST_ERR _BITUL(2)
> #define PCI_DVSEC_CXL_RANGE_SIZE_HIGH(i) (0x18 + (i * 0x10))
> #define PCI_DVSEC_CXL_RANGE_SIZE_LOW(i) (0x1C + (i * 0x10))
> #define PCI_DVSEC_CXL_MEM_INFO_VALID _BITUL(0)
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v6 6/9] cxl/pci: Track memdevs affected by CXL reset
2026-05-28 8:31 ` [PATCH v6 6/9] cxl/pci: Track memdevs affected by CXL reset Srirangan Madhavan
@ 2026-06-02 20:34 ` Cheatham, Benjamin
0 siblings, 0 replies; 32+ messages in thread
From: Cheatham, Benjamin @ 2026-06-02 20:34 UTC (permalink / raw)
To: Srirangan Madhavan, linux-cxl, linux-pci, linux-kernel
Cc: vsethi, alwilliamson, Dan Williams, Sai Yashwanth Reddy Kancherla,
Vishal Aslot, Manish Honap, Jiandi An, Richard Cheng, linux-tegra
On 5/28/2026 3:31 AM, Srirangan Madhavan wrote:
> CXL reset is scoped to the CXL.cache/mem function set, so reset
> orchestration needs to account for the target memdev and any affected
> sibling-function memdevs.
I would move this patch to be right after 4/9 since it's doing the back
half of finding the devices affected by reset.
>
> Add reset context tracking for affected memdevs. Collect the memdevs
> associated with the target and sibling PCI functions, track which ones
> are active, collect their regions, and provide helpers to lock and
> revalidate the active memdevs before reset proceeds.
>
> The reset orchestration and CXL.mem restore flow are added separately.
>
> Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com>
> ---
> drivers/cxl/core/pci.c | 176 +++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 176 insertions(+)
>
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 1dd880f5a333..c755c18c8d84 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -1106,8 +1106,17 @@ cxl_reset_flush_cpu_caches(struct cxl_reset_region_context *ctx)
> return 0;
> }
>
> +struct cxl_reset_memdev {
> + struct cxl_memdev *cxlmd;
> + bool active;
> + bool locked;
> +};
> +
> struct cxl_reset_context {
> struct pci_dev *target;
> + struct cxl_reset_memdev *memdevs;
> + int nr_memdevs;
> + int memdev_capacity;
> struct pci_dev **siblings;
> int nr_siblings;
> int sibling_capacity;
> @@ -1237,6 +1246,173 @@ static int cxl_reset_collect_siblings(struct cxl_reset_context *ctx)
> return wctx.rc;
> }
>
> +static int cxl_reset_match_memdev_by_parent(struct device *dev,
> + const void *parent)
> +{
> + return is_cxl_memdev(dev) && dev->parent == parent;
> +}
> +
> +static bool cxl_reset_memdev_active(struct cxl_memdev *cxlmd)
> +{
> + return cxlmd->dev.driver && cxlmd->endpoint &&
> + !IS_ERR(cxlmd->endpoint);
You can replace the last two checks with !IS_ERR_OR_NULL(cxlmd->endpoint).
> +}
> +
> +static int cxl_reset_collect_pci_memdev(struct cxl_reset_context *ctx,
> + struct pci_dev *pdev)
> +{
> + struct cxl_reset_memdev *memdevs;
> + struct cxl_memdev *cxlmd;
> + struct device *dev;
> + int capacity, i;
> +
> + dev = bus_find_device(&cxl_bus_type, NULL, &pdev->dev,
> + cxl_reset_match_memdev_by_parent);
> + if (!dev)
> + return 0;
> +
> + cxlmd = to_cxl_memdev(dev);
> + for (i = 0; i < ctx->nr_memdevs; i++) {
> + if (ctx->memdevs[i].cxlmd == cxlmd) {
> + put_device(dev);
> + return 0;
> + }
> + }
> +
> + if (ctx->nr_memdevs < ctx->memdev_capacity)
> + goto add;
The goto here isn't great, I'd just add the memdev inside the if statement
and return here.
> +
> + capacity = ctx->memdev_capacity ? ctx->memdev_capacity * 2 :
> + CXL_RESET_SIBLINGS_INIT;
> + memdevs = krealloc(ctx->memdevs, capacity * sizeof(*memdevs),
> + GFP_KERNEL);
> + if (!memdevs) {
> + put_device(dev);
Should you null out ctx->memdevs here? Is it possible it would have a stale value at this
point?
> + return -ENOMEM;
> + }
> +
> + ctx->memdevs = memdevs;
> + ctx->memdev_capacity = capacity;
> +
> +add:
> + ctx->memdevs[ctx->nr_memdevs++] = (struct cxl_reset_memdev) {
> + .cxlmd = cxlmd,
> + };
> + return 0;
> +}
> +
> +/*
> + * CXL Reset is device scoped for CXL.cache/mem. Use the affected PCI
> + * function set to find memdevs whose regions and endpoint decoder state must
> + * be handled around the reset.
> + */
> +static int __maybe_unused cxl_reset_collect_memdevs(struct cxl_reset_context *ctx)
> +{
> + int rc, i;
> +
> + rc = cxl_reset_collect_pci_memdev(ctx, ctx->target);
> + if (rc)
> + return rc;
> +
> + for (i = 0; i < ctx->nr_siblings; i++) {
> + rc = cxl_reset_collect_pci_memdev(ctx, ctx->siblings[i]);
> + if (rc)
> + return rc;
> + }
> +
> + return 0;
> +}
> +
> +static int __maybe_unused
> +cxl_reset_collect_regions(struct cxl_reset_context *ctx,
> + struct cxl_reset_region_context *region_ctx)
> +{
> + int rc, i;
> +
> + lockdep_assert_held_write(&cxl_rwsem.region);
> +
> + for (i = 0; i < ctx->nr_memdevs; i++) {
> + struct cxl_reset_memdev *rmd = &ctx->memdevs[i];
> + struct cxl_memdev *cxlmd = rmd->cxlmd;
> +
> + if (!device_trylock(&cxlmd->dev))
> + return -EAGAIN;
Use ACQUIRE() here.
> +
> + if (cxl_reset_memdev_active(cxlmd)) {
> + rc = cxl_reset_collect_memdev_regions(region_ctx,
> + cxlmd);
> + if (!rc)
> + rmd->active = true;
> + } else {
> + rc = 0;
> + }
> +
> + device_unlock(&cxlmd->dev);
> + if (rc)
> + return rc;
> + }
> +
> + return 0;
> +}
> +
> +static void cxl_reset_unlock_memdevs(struct cxl_reset_context *ctx)
> +{
> + int i;
> +
> + for (i = ctx->nr_memdevs - 1; i >= 0; i--) {
> + struct cxl_reset_memdev *rmd = &ctx->memdevs[i];
> +
> + if (!rmd->locked)
> + continue;
> +
> + device_unlock(&rmd->cxlmd->dev);
> + rmd->locked = false;
> + }
> +}
> +
> +static int __maybe_unused cxl_reset_lock_memdevs(struct cxl_reset_context *ctx)
> +{
> + int i;
> +
> + lockdep_assert_held_write(&cxl_rwsem.region);
> +
> + for (i = 0; i < ctx->nr_memdevs; i++) {
> + struct cxl_reset_memdev *rmd = &ctx->memdevs[i];
> + struct cxl_memdev *cxlmd = rmd->cxlmd;
> +
> + if (!rmd->active)
> + continue;
> +
> + if (!device_trylock(&cxlmd->dev))
> + goto err;
> +
> + rmd->locked = true;
> + if (!cxl_reset_memdev_active(cxlmd)) {
> + cxl_reset_unlock_memdevs(ctx);
> + return -ENODEV;
> + }
> + }
> +
> + return 0;
> +
> +err:
> + cxl_reset_unlock_memdevs(ctx);
> + return -EAGAIN;
> +}
> +
> +static void __maybe_unused cxl_reset_put_memdevs(struct cxl_reset_context *ctx)
> +{
> + int i;
> +
> + for (i = 0; i < ctx->nr_memdevs; i++)
> + put_device(&ctx->memdevs[i].cxlmd->dev);
> +
> + kfree(ctx->memdevs);
> + ctx->memdevs = NULL;
> + ctx->nr_memdevs = 0;
> + ctx->memdev_capacity = 0;
> +}
> +
> static void cxl_pci_functions_reset_done(struct cxl_reset_context *ctx)
> {
> int i;
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v6 7/9] cxl/pci: Orchestrate CXL reset for affected memdevs
2026-05-28 8:31 ` [PATCH v6 7/9] cxl/pci: Orchestrate CXL reset for affected memdevs Srirangan Madhavan
@ 2026-06-02 20:34 ` Cheatham, Benjamin
2026-06-04 3:25 ` Dan Williams (nvidia)
1 sibling, 0 replies; 32+ messages in thread
From: Cheatham, Benjamin @ 2026-06-02 20:34 UTC (permalink / raw)
To: Srirangan Madhavan, linux-cxl, linux-pci, linux-kernel
Cc: vsethi, alwilliamson, Dan Williams, Sai Yashwanth Reddy Kancherla,
Vishal Aslot, Manish Honap, Jiandi An, Richard Cheng, linux-tegra
On 5/28/2026 3:31 AM, Srirangan Madhavan wrote:
> Add the reset flow that coordinates the target function, affected CXL
> sibling functions, and any active memdevs in the CXL.cache/mem reset
> scope.
>
> The flow collects regions for the affected memdevs under
> cxl_rwsem.region, verifies that those regions are idle, flushes CPU
> caches for the affected ranges, saves and disables the target and sibling
> PCI functions, and locks active memdevs to revalidate that their
> endpoints are still present before reset.
>
> After the CXL DVSEC reset completes, restore PCI config space so CXL
> MMIO is accessible, restore decoder programming for all active affected
> memdevs, commit their restored decoders, and only then re-enable CXL.mem
> for the affected set.
>
> Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com>
> ---
> drivers/cxl/core/pci.c | 414 +++++++++++++++++++++++++++++++++++------
> 1 file changed, 358 insertions(+), 56 deletions(-)
>
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index c755c18c8d84..486c447e98f3 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -947,14 +947,12 @@ struct cxl_reset_region_context {
> struct xarray regions;
> };
>
> -static void __maybe_unused
> -cxl_reset_region_context_init(struct cxl_reset_region_context *ctx)
> +static void cxl_reset_region_context_init(struct cxl_reset_region_context *ctx)
> {
> xa_init(&ctx->regions);
> }
>
> -static void __maybe_unused
> -cxl_reset_region_context_destroy(struct cxl_reset_region_context *ctx)
> +static void cxl_reset_region_context_destroy(struct cxl_reset_region_context *ctx)
> {
> xa_destroy(&ctx->regions);
> }
> @@ -985,9 +983,8 @@ static int cxl_reset_collect_region(struct device *dev, void *data)
> return cxl_reset_add_region(ctx, cxled->cxld.region);
> }
>
> -static int __maybe_unused
> -cxl_reset_collect_memdev_regions(struct cxl_reset_region_context *ctx,
> - struct cxl_memdev *cxlmd)
> +static int cxl_reset_collect_memdev_regions(struct cxl_reset_region_context *ctx,
> + struct cxl_memdev *cxlmd)
> {
> struct cxl_port *endpoint;
>
> @@ -1045,8 +1042,7 @@ static int cxl_reset_validate_region_idle(struct cxl_region *cxlr)
> return rc;
> }
>
> -static int __maybe_unused
> -cxl_reset_validate_regions_idle(struct cxl_reset_region_context *ctx)
> +static int cxl_reset_validate_regions_idle(struct cxl_reset_region_context *ctx)
> {
> struct cxl_region *cxlr;
> unsigned long index;
> @@ -1077,26 +1073,41 @@ static int cxl_reset_flush_region_cache(struct cxl_region *cxlr)
> return rc;
> }
Move the helpers from patch 3/9 to this patch; having to remove a bunch of __maybe_unused
annotations creates a lot of needless churn.
>
> -static int __maybe_unused
> -cxl_reset_flush_cpu_caches(struct cxl_reset_region_context *ctx)
> +static int cxl_reset_cpu_cache_flush_preflight(struct cxl_reset_region_context *ctx,
> + bool *skip)
> {
> - struct cxl_region *cxlr;
> - unsigned long index;
> - int rc;
> + if (skip)
> + *skip = false;
>
> if (xa_empty(&ctx->regions))
> return 0;
>
> - if (!cpu_cache_has_invalidate_memregion()) {
> - if (IS_ENABLED(CONFIG_CXL_REGION_INVALIDATION_TEST)) {
> - pr_info_once(
> - "Bypassing cpu_cache_invalidate_memregion() for testing!\n");
> - return 0;
> - }
> - pr_warn("Failed to synchronize CPU cache state\n");
> - return -ENXIO;
> + if (cpu_cache_has_invalidate_memregion())
> + return 0;
> +
> + if (IS_ENABLED(CONFIG_CXL_REGION_INVALIDATION_TEST)) {
> + pr_info_once(
> + "Bypassing cpu_cache_invalidate_memregion() for testing!\n");
> + if (skip)
> + *skip = true;
> + return 0;
> }
>
> + pr_warn("Failed to synchronize CPU cache state\n");
> + return -ENXIO;
> +}
And I'd remove the cxl_reset_flush_cpu_caches() definition from 3/9 and just replace it
with this to begin with. It doesn't really make sense to add a bunch of code that isn't
used just to replace it later.
I'll stop commenting on it now, but I'd take another look at how to get the helper
functions into the same patch that uses them.
> +
> +static int cxl_reset_flush_cpu_caches(struct cxl_reset_region_context *ctx)
> +{
> + struct cxl_region *cxlr;
> + unsigned long index;
> + bool skip;
> + int rc;
> +
> + rc = cxl_reset_cpu_cache_flush_preflight(ctx, &skip);
> + if (rc || skip)
> + return rc;
> +
> xa_for_each(&ctx->regions, index, cxlr) {
> rc = cxl_reset_flush_region_cache(cxlr);
> if (rc)
> @@ -1120,7 +1131,11 @@ struct cxl_reset_context {
> struct pci_dev **siblings;
> int nr_siblings;
> int sibling_capacity;
> + int nr_siblings_locked;
> int nr_siblings_prepared;
> + bool target_locked;
> + bool target_saved;
> + bool target_iommu_prepared;
> };
>
> struct cxl_reset_walk_ctx {
> @@ -1306,7 +1321,7 @@ static int cxl_reset_collect_pci_memdev(struct cxl_reset_context *ctx,
> * function set to find memdevs whose regions and endpoint decoder state must
> * be handled around the reset.
> */
> -static int __maybe_unused cxl_reset_collect_memdevs(struct cxl_reset_context *ctx)
> +static int cxl_reset_collect_memdevs(struct cxl_reset_context *ctx)
> {
> int rc, i;
>
> @@ -1323,7 +1338,7 @@ static int __maybe_unused cxl_reset_collect_memdevs(struct cxl_reset_context *ct
> return 0;
> }
>
> -static int __maybe_unused
> +static int
> cxl_reset_collect_regions(struct cxl_reset_context *ctx,
> struct cxl_reset_region_context *region_ctx)
> {
> @@ -1370,7 +1385,7 @@ static void cxl_reset_unlock_memdevs(struct cxl_reset_context *ctx)
> }
> }
>
> -static int __maybe_unused cxl_reset_lock_memdevs(struct cxl_reset_context *ctx)
> +static int cxl_reset_lock_memdevs(struct cxl_reset_context *ctx)
> {
> int i;
>
> @@ -1400,7 +1415,7 @@ static int __maybe_unused cxl_reset_lock_memdevs(struct cxl_reset_context *ctx)
> return -EAGAIN;
> }
>
> -static void __maybe_unused cxl_reset_put_memdevs(struct cxl_reset_context *ctx)
> +static void cxl_reset_put_memdevs(struct cxl_reset_context *ctx)
> {
> int i;
>
> @@ -1417,14 +1432,20 @@ static void cxl_pci_functions_reset_done(struct cxl_reset_context *ctx)
> {
> int i;
>
> + /*
> + * Config state was restored early for CXL MMIO access. Complete PCI
> + * reset recovery here by unblocking IOMMU and running reset_done().
> + */
> for (i = ctx->nr_siblings_prepared - 1; i >= 0; i--) {
> struct pci_dev *sibling = ctx->siblings[i];
>
> pci_dev_reset_iommu_done(sibling);
> pci_dev_restore(sibling);
> - pci_dev_unlock(sibling);
> }
>
> + for (i = ctx->nr_siblings_locked - 1; i >= 0; i--)
> + pci_dev_unlock(ctx->siblings[i]);
> +
> for (i = 0; i < ctx->nr_siblings; i++)
> pci_dev_put(ctx->siblings[i]);
>
> @@ -1432,31 +1453,39 @@ static void cxl_pci_functions_reset_done(struct cxl_reset_context *ctx)
> ctx->siblings = NULL;
> ctx->nr_siblings = 0;
> ctx->sibling_capacity = 0;
> + ctx->nr_siblings_locked = 0;
> ctx->nr_siblings_prepared = 0;
> }
>
> -static int __maybe_unused
> -cxl_pci_functions_reset_prepare(struct cxl_reset_context *ctx)
> +static int cxl_pci_functions_lock(struct cxl_reset_context *ctx)
> {
> - int rc, i;
> -
> - ctx->siblings = NULL;
> - ctx->nr_siblings = 0;
> - ctx->sibling_capacity = 0;
> - ctx->nr_siblings_prepared = 0;
> + int i;
>
> - rc = cxl_reset_collect_siblings(ctx);
> - if (rc)
> - goto err;
> + ctx->nr_siblings_locked = 0;
>
> for (i = 0; i < ctx->nr_siblings; i++) {
> struct pci_dev *sibling = ctx->siblings[i];
>
> if (!pci_dev_trylock(sibling)) {
> - rc = -EAGAIN;
> - goto err;
> + cxl_pci_functions_reset_done(ctx);
> + return -EAGAIN;
> }
>
> + ctx->nr_siblings_locked++;
> + }
> +
> + return 0;
> +}
> +
> +static int cxl_pci_functions_reset_prepare(struct cxl_reset_context *ctx)
> +{
> + int rc, i;
> +
> + ctx->nr_siblings_prepared = 0;
> +
> + for (i = 0; i < ctx->nr_siblings_locked; i++) {
> + struct pci_dev *sibling = ctx->siblings[i];
> +
> pci_dev_save_and_disable(sibling);
> rc = pci_dev_reset_iommu_prepare(sibling);
> if (rc) {
> @@ -1469,7 +1498,6 @@ cxl_pci_functions_reset_prepare(struct cxl_reset_context *ctx)
> * nr_siblings_prepared and must not get iommu_done().
> */
> pci_dev_restore(sibling);
> - pci_dev_unlock(sibling);
> goto err;
> }
>
> @@ -1483,6 +1511,79 @@ cxl_pci_functions_reset_prepare(struct cxl_reset_context *ctx)
> return rc;
> }
>
> +/*
> + * Restore PCI config space after reset so CXL MMIO is accessible for memdev
> + * restore. Driver reset_done callbacks remain deferred to final cleanup.
> + */
> +static void cxl_pci_functions_restore_state(struct cxl_reset_context *ctx)
> +{
> + int i;
> +
> + for (i = ctx->nr_siblings_prepared - 1; i >= 0; i--)
> + pci_restore_state(ctx->siblings[i]);
> +}
> +
> +static int cxl_pci_target_lock(struct cxl_reset_context *ctx)
> +{
> + struct pci_dev *pdev = ctx->target;
> +
> + if (!pci_dev_trylock(pdev))
> + return -EAGAIN;
> +
> + ctx->target_locked = true;
> + return 0;
> +}
> +
> +static int cxl_pci_target_reset_prepare(struct cxl_reset_context *ctx)
> +{
> + struct pci_dev *pdev = ctx->target;
> + int rc;
> +
> + /* Disable first to stop new transactions, then drain in-flight ones. */
> + pci_dev_save_and_disable(pdev);
> + ctx->target_saved = true;
> +
> + if (!pci_wait_for_pending_transaction(pdev))
> + pci_err(pdev, "timed out waiting for pending transactions\n");
> +
> + rc = pci_dev_reset_iommu_prepare(pdev);
> + if (rc) {
> + pci_err(pdev, "failed to block IOMMU for CXL reset: %d\n", rc);
> + return rc;
> + }
> +
> + ctx->target_iommu_prepared = true;
> + return 0;
> +}
> +
> +static void cxl_pci_target_restore_state(struct cxl_reset_context *ctx)
> +{
> + if (ctx->target_saved)
> + pci_restore_state(ctx->target);
> +}
> +
> +static void cxl_pci_target_reset_done(struct cxl_reset_context *ctx)
> +{
> + if (ctx->target_iommu_prepared) {
> + pci_dev_reset_iommu_done(ctx->target);
> + ctx->target_iommu_prepared = false;
> + }
> +
> + /*
> + * cxl_pci_target_restore_state() restores config space before memdev
> + * restore. Complete PCI reset recovery here with reset_done().
> + */
> + if (ctx->target_saved) {
> + pci_dev_restore(ctx->target);
> + ctx->target_saved = false;
> + }
> +
> + if (ctx->target_locked) {
> + pci_dev_unlock(ctx->target);
> + ctx->target_locked = false;
> + }
> +}
> +
> static int cxl_reset_update_ctrl2(struct pci_dev *pdev, int dvsec, u16 set,
> u16 clear)
> {
> @@ -1599,7 +1700,7 @@ static int cxl_reset_wait_done(struct pci_dev *pdev, int dvsec, u16 cap)
> return 0;
> }
>
> -static int __maybe_unused cxl_dev_reset(struct pci_dev *pdev, bool mem_clear)
> +static int cxl_dev_reset(struct pci_dev *pdev, bool mem_clear)
> {
> int dvsec, rc;
> u16 ctrl2_clear = 0;
> @@ -1620,19 +1721,9 @@ static int __maybe_unused cxl_dev_reset(struct pci_dev *pdev, bool mem_clear)
> if (mem_clear && !(cap & PCI_DVSEC_CXL_RST_MEM_CLR_CAPABLE))
> return -EOPNOTSUPP;
>
> - if (!pci_wait_for_pending_transaction(pdev))
> - pci_err(pdev, "timed out waiting for pending transactions\n");
> -
> - rc = pci_dev_reset_iommu_prepare(pdev);
> - if (rc) {
> - pci_err(pdev, "failed to block IOMMU for CXL reset: %d\n",
> - rc);
> - return rc;
> - }
> -
> rc = cxl_reset_disable_cache(pdev, dvsec, cap);
> if (rc)
> - goto out_iommu;
> + return rc;
> if (cap & PCI_DVSEC_CXL_CACHE_CAPABLE)
> ctrl2_clear |= PCI_DVSEC_CXL_DISABLE_CACHING;
>
> @@ -1651,7 +1742,7 @@ static int __maybe_unused cxl_dev_reset(struct pci_dev *pdev, bool mem_clear)
>
> rc = cxl_reset_wait_done(pdev, dvsec, cap);
> if (rc)
> - goto out_iommu;
> + return rc;
>
> rc = cxl_reset_update_ctrl2(pdev, dvsec, 0,
> PCI_DVSEC_CXL_DISABLE_CACHING);
> @@ -1660,7 +1751,218 @@ static int __maybe_unused cxl_dev_reset(struct pci_dev *pdev, bool mem_clear)
> if (rc && ctrl2_clear)
> cxl_reset_update_ctrl2(pdev, dvsec, 0, ctrl2_clear);
>
> -out_iommu:
> - pci_dev_reset_iommu_done(pdev);
> + return rc;
> +}
> +
> +static int cxl_reset_restore_memdev(struct cxl_reset_memdev *rmd)
> +{
> + struct cxl_memdev *cxlmd = rmd->cxlmd;
> + int rc;
> +
> + if (!rmd->active)
> + return 0;
> +
> + rc = cxl_restore_memdev_decoders(cxlmd);
> + if (rc)
> + dev_err(&cxlmd->dev,
> + "Failed to restore CXL.mem decoders after reset: %d\n",
> + rc);
> +
> + return rc;
> +}
> +
> +static int cxl_reset_commit_memdev(struct cxl_reset_memdev *rmd)
> +{
> + struct cxl_memdev *cxlmd = rmd->cxlmd;
> + int rc;
> +
> + if (!rmd->active)
> + return 0;
> +
> + rc = cxl_commit_memdev_decoders(cxlmd);
> + if (rc)
> + dev_err(&cxlmd->dev,
> + "Failed to commit CXL.mem decoders after reset: %d\n",
> + rc);
> +
> + return rc;
> +}
> +
> +static int cxl_reset_enable_memdev(struct cxl_reset_memdev *rmd)
> +{
> + struct cxl_memdev *cxlmd = rmd->cxlmd;
> + struct cxl_dev_state *cxlds = cxlmd->cxlds;
> + int rc;
> +
> + if (!rmd->active)
> + return 0;
> +
> + cxlds->media_ready = false;
> +
> + rc = cxl_set_mem_enable(cxlds, PCI_DVSEC_CXL_MEM_ENABLE);
> + if (rc < 0) {
> + dev_err(&cxlmd->dev,
> + "Failed to enable CXL.mem after reset: %d\n", rc);
> + return rc;
> + }
> +
> + rc = cxl_await_media_ready(cxlds);
> + if (rc) {
> + dev_err(&cxlmd->dev,
> + "Media not active after CXL reset: %d\n", rc);
> + return rc;
> + }
> + cxlds->media_ready = true;
You need to verify the memdev has mailbox support before messing around with media_ready, type 2 devices
aren't required to have mailbox support.
> +
> + return 0;
> +}
> +
> +static void cxl_reset_disable_memdevs(struct cxl_reset_context *ctx)
> +{
> + int rc, i;
> +
> + for (i = ctx->nr_memdevs - 1; i >= 0; i--) {
> + struct cxl_memdev *cxlmd = ctx->memdevs[i].cxlmd;
> +
> + if (!ctx->memdevs[i].active)
> + continue;
> +
> + rc = cxl_set_mem_enable(cxlmd->cxlds, 0);
> + if (rc < 0)
> + dev_err(&cxlmd->dev,
> + "Failed to disable CXL.mem after reset restore failure; device state may be inconsistent: %d\n",
> + rc);
> + }
> +}
> +
> +static int cxl_reset_restore_memdevs(struct cxl_reset_context *ctx)
> +{
> + int rc;
> + int i;
> +
> + lockdep_assert_held_write(&cxl_rwsem.region);
> +
> + for (i = 0; i < ctx->nr_memdevs; i++) {
> + rc = cxl_reset_restore_memdev(&ctx->memdevs[i]);
> + if (rc)
> + return rc;
> + }
> +
> + for (i = 0; i < ctx->nr_memdevs; i++) {
> + rc = cxl_reset_commit_memdev(&ctx->memdevs[i]);
> + if (rc)
> + return rc;
> + }
> +
> + for (i = 0; i < ctx->nr_memdevs; i++) {
> + rc = cxl_reset_enable_memdev(&ctx->memdevs[i]);
> + if (rc) {
> + cxl_reset_disable_memdevs(ctx);
> + return rc;
> + }
> + }
Do you need to walk back the committed memdevs if you fail to enable them?
I don't think it's an issue if any of them fail to commit, but I'm more
wary when it comes to enabling them.
> +
> + return 0;
> +}
> +
> +static void cxl_reset_context_destroy(struct cxl_reset_context *ctx)
> +{
> + /*
> + * LIFO unwind for regular completion and partial initialization:
> + * memdevs, sibling functions, target function, then references.
> + * Each cleanup helper tolerates being called after its state was
> + * already released on an earlier error path.
> + */
> + cxl_reset_unlock_memdevs(ctx);
> + cxl_pci_functions_reset_done(ctx);
> + cxl_pci_target_reset_done(ctx);
> + cxl_reset_put_memdevs(ctx);
> +}
> +
> +static int cxl_do_reset_locked(struct cxl_reset_context *ctx, bool mem_clear)
> +{
> + struct cxl_reset_region_context region_ctx;
> + int rc;
> +
> + lockdep_assert_held_write(&cxl_rwsem.region);
> +
> + cxl_reset_region_context_init(®ion_ctx);
> +
> + rc = cxl_reset_collect_regions(ctx, ®ion_ctx);
> + if (rc)
> + goto out;
> +
> + rc = cxl_pci_target_lock(ctx);
> + if (rc)
> + goto out;
> +
> + rc = cxl_pci_functions_lock(ctx);
> + if (rc)
> + goto out;
> +
> + rc = cxl_reset_lock_memdevs(ctx);
> + if (rc)
> + goto out;
> +
> + rc = cxl_reset_cpu_cache_flush_preflight(®ion_ctx, NULL);
> + if (rc)
> + goto out;
> +
> + rc = cxl_reset_validate_regions_idle(®ion_ctx);
> + if (rc)
> + goto out;
> +
> + rc = cxl_reset_flush_cpu_caches(®ion_ctx);
> + if (rc)
> + goto out;
> +
> + rc = cxl_pci_target_reset_prepare(ctx);
> + if (rc)
> + goto out;
> +
> + rc = cxl_pci_functions_reset_prepare(ctx);
> + if (rc)
> + goto out;
> +
> + rc = cxl_dev_reset(ctx->target, mem_clear);
> +
> + cxl_pci_target_restore_state(ctx);
> + cxl_pci_functions_restore_state(ctx);
> +
> + if (!rc)
> + rc = cxl_reset_restore_memdevs(ctx);
> +
> + cxl_reset_unlock_memdevs(ctx);
> +
> +out:
> + cxl_reset_region_context_destroy(®ion_ctx);
> + return rc;
> +}
> +
> +static int __maybe_unused cxl_do_reset(struct pci_dev *pdev, bool mem_clear)
> +{
> + struct cxl_reset_context ctx = {
> + .target = pdev,
> + };
> + int rc;
> +
> + /*
> + * Snapshot the CXL r3.2 9.7 device reset scope before taking
Update to latest revision. I think I mentioned 4.0 earlier, it may be at 4.1 at this point...
> + * cxl_rwsem.region. Hot-added functions after this point are not
> + * coordinated by this reset operation.
> + */
> + rc = cxl_reset_collect_siblings(&ctx);
> + if (rc)
> + goto out;
> +
> + rc = cxl_reset_collect_memdevs(&ctx);
> + if (rc)
> + goto out;
> +
> + scoped_guard(rwsem_write, &cxl_rwsem.region)
> + rc = cxl_do_reset_locked(&ctx, mem_clear);
> +
> +out:
> + cxl_reset_context_destroy(&ctx);
> return rc;
> }
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v6 0/9] cxl: Add cxl_reset sysfs attribute for memdevs
2026-05-28 8:31 [PATCH v6 0/9] cxl: Add cxl_reset sysfs attribute for memdevs Srirangan Madhavan
` (8 preceding siblings ...)
2026-05-28 8:31 ` [PATCH v6 9/9] Documentation/ABI: Document CXL memdev cxl_reset Srirangan Madhavan
@ 2026-06-02 20:34 ` Cheatham, Benjamin
2026-06-02 21:42 ` Dan Williams (nvidia)
10 siblings, 0 replies; 32+ messages in thread
From: Cheatham, Benjamin @ 2026-06-02 20:34 UTC (permalink / raw)
To: Srirangan Madhavan, linux-cxl, linux-pci, linux-kernel
Cc: vsethi, alwilliamson, Dan Williams, Sai Yashwanth Reddy Kancherla,
Vishal Aslot, Manish Honap, Jiandi An, Richard Cheng, linux-tegra
On 5/28/2026 3:31 AM, Srirangan Madhavan wrote:
> Hi folks!
>
> This patch series introduces support for the CXL Reset method for CXL
> Type 2 devices, implementing the reset procedure outlined in the CXL
> Specification r3.2 [1], Sections 8.1.3, 9.6, and 9.7.
>
> The userspace ABI is a write-only cxl_reset attribute under the CXL
> memdev device:
>
> /sys/bus/cxl/devices/memX/cxl_reset
>
> The memdev is the userspace handle, while the implementation coordinates
> the target PCI function, affected sibling PCI functions, active CXL
> memdevs, and any CXL regions reachable through those memdevs.
>
This may be a dumb question, but where do type 2 driver slot into this?
I think it's expected that they'll implement the ->reset_XXX callbacks for
their own handling? If not, you may want to look at where type 2 drivers
outside of drivers/cxl/ can have a hook to do any reset related clean up
or prep.
Thanks,
Ben
> v6 changes (from v5 [2]):
> - Rebased on the current CXL tree used for v7.1-rc4 development.
> - Move the ABI from /sys/bus/pci/devices/.../cxl_reset to
> /sys/bus/cxl/devices/memX/cxl_reset.
> - Use the memdev as the userspace handle while keeping the reset
> orchestration scoped to the CXL device reset scope.
> - Reduce the earlier PCI/CXL save/restore series [3] to a single CXL HDM
> decoder restore/commit helper patch, included here as patch 1.
> - Do not offline or hot-remove memory as part of reset. Return -EBUSY
> if an affected CXL region is online as System RAM or has an active
> region driver bound.
> - Add reset-idle validation and CPU cache invalidation for affected CXL
> regions.
> - Add CXL sibling PCI function discovery using the Non-CXL Function Map
> DVSEC and CXL.cache/CXL.mem capability bits.
> - Coordinate PCI save/disable/restore and IOMMU reset prepare/done for
> the target and affected sibling functions.
> - Add CXL DVSEC reset sequencing, including CXL.cache disable,
> writeback-invalidate, a minimum 100ms quiet period, reset-complete
> polling, and Reset Error reporting.
> - Track affected memdevs, lock active memdevs across reset, restore and
> commit decoder state, re-enable CXL.mem, and wait for media ready
> after reset.
> - Cache reset capability at memdev registration time for sysfs
> visibility.
> - Document reset scope, Memory Clear not being requested, and -EBUSY
> behavior for active CXL regions.
>
> Motivation:
> -----------
> - As support for Type 2 devices is being introduced, more devices need a
> CXL-specific reset mechanism beyond bus-wide PCI reset methods.
>
> - FLR does not affect CXL.cache or CXL.mem protocol state, making CXL
> Reset the appropriate mechanism for cases where those protocols must
> be reset.
>
> - The CXL specification highlights use cases such as function rebinding
> and error recovery where CXL Reset is explicitly required.
>
> Change Description:
> -------------------
>
> Patch 1: cxl/hdm: Add helpers to restore and commit memdev decoders
> - Restore endpoint decoder programming from CXL core's cached decoder
> objects while keeping CXL.mem disabled.
> - Commit restored HDM decoders as a separate step so reset orchestration
> can re-enable CXL.mem only after safety checks complete.
>
> Patch 2: PCI: Export pci_dev_save_and_disable() and pci_dev_restore()
> - Export PCI reset lifecycle helpers so CXL reset orchestration can save,
> disable, restore, and invoke reset callbacks for affected functions.
>
> Patch 3: cxl: Add reset-idle and cache flush helpers
> - Collect CXL regions affected by a memdev reset.
> - Fail reset if affected regions are not idle.
> - Invalidate CPU caches for each affected region once.
>
> Patch 4: PCI/CXL: Add sibling function coordination for reset
> - Identify CXL.cache/CXL.mem sibling functions in the reset scope.
> - Use the Non-CXL Function Map DVSEC to exclude non-CXL functions.
> - Save, disable, restore, and unlock affected PCI sibling functions.
>
> Patch 5: cxl/pci: Add CXL DVSEC reset helper
> - Execute CXL Reset through the CXL Device DVSEC.
> - Disable CXL.cache and request writeback-invalidate where supported.
> - Enforce the post-reset quiet period and poll for reset completion.
> - Block and restore IOMMU traffic while reset is active.
>
> Patch 6: cxl/pci: Track memdevs affected by CXL reset
> - Track the target memdev and any sibling-function memdevs affected by
> reset.
> - Revalidate and lock active memdevs before reset proceeds.
>
> Patch 7: cxl/pci: Orchestrate CXL reset for affected memdevs
> - Coordinate region validation, CPU cache invalidation, PCI function
> preparation, DVSEC reset, decoder restore and commit, CXL.mem enable,
> and media-ready wait.
>
> Patch 8: cxl/memdev: Add cxl_reset sysfs attribute
> - Expose /sys/bus/cxl/devices/memX/cxl_reset.
> - Only make the attribute visible when the underlying PCI function is
> Type 2 and reset capable.
> - Write a boolean true value, such as "1" or "true", to trigger reset.
>
> Patch 9: Documentation/ABI: Document CXL memdev cxl_reset
> - Document the new memdev sysfs ABI, reset scope, Memory Clear behavior,
> and idle-region requirement.
>
> The CPU cache invalidation step depends on
> cpu_cache_invalidate_memregion() support for the affected address ranges.
> If no provider is available, reset fails before hardware reset is
> requested.
>
> Command line to test CXL reset on a capable memdev:
>
> echo 1 > /sys/bus/cxl/devices/memX/cxl_reset
>
> Basic CXL DVSEC reset testing was done on a CXL Type 2 device. The reset
> sequence completed successfully and ResetComplete was observed. Full
> memdev/region integration testing is still in progress.
>
> References:
> [1] https://computeexpresslink.org/wp-content/uploads/2024/12/CXL_3.2-Spec-Announcement_FINAL-1.pdf
> [2] https://lore.kernel.org/linux-cxl/20260306092322.148765-1-smadhavan@nvidia.com/
> [3] https://lore.kernel.org/linux-cxl/20260306080026.116789-1-smadhavan@nvidia.com/
>
> Srirangan Madhavan (9):
> cxl/hdm: Add helpers to restore and commit memdev decoders
> PCI: Export pci_dev_save_and_disable() and pci_dev_restore()
> cxl: Add reset-idle and cache flush helpers
> PCI/CXL: Add sibling function coordination for reset
> cxl/pci: Add CXL DVSEC reset helper
> cxl/pci: Track memdevs affected by CXL reset
> cxl/pci: Orchestrate CXL reset for affected memdevs
> cxl/memdev: Add cxl_reset sysfs attribute
> Documentation/ABI: Document CXL memdev cxl_reset
>
> Documentation/ABI/testing/sysfs-bus-cxl | 28 +
> drivers/cxl/core/hdm.c | 318 ++++++-
> drivers/cxl/core/memdev.c | 30 +
> drivers/cxl/core/pci.c | 1140 +++++++++++++++++++++++
> drivers/cxl/cxl.h | 5 +
> drivers/cxl/cxlmem.h | 2 +
> drivers/pci/pci.c | 22 +-
> include/linux/pci.h | 2 +
> include/uapi/linux/pci_regs.h | 15 +
> 9 files changed, 1557 insertions(+), 5 deletions(-)
>
> base-commit: abb3c0de119032f4c0c81177884a3bb0a133e6ca
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v6 3/9] cxl: Add reset-idle and cache flush helpers
2026-05-28 8:31 ` [PATCH v6 3/9] cxl: Add reset-idle and cache flush helpers Srirangan Madhavan
2026-06-02 20:34 ` Cheatham, Benjamin
@ 2026-06-02 20:36 ` Dave Jiang
2026-06-04 2:49 ` Dan Williams (nvidia)
2 siblings, 0 replies; 32+ messages in thread
From: Dave Jiang @ 2026-06-02 20:36 UTC (permalink / raw)
To: Srirangan Madhavan, linux-cxl, linux-pci, linux-kernel
Cc: vsethi, alwilliamson, Dan Williams, Sai Yashwanth Reddy Kancherla,
Vishal Aslot, Manish Honap, Jiandi An, Richard Cheng, linux-tegra
On 5/28/26 1:31 AM, Srirangan Madhavan wrote:
> Add helpers to collect the CXL regions affected by a memdev reset,
> verify that those regions are idle, and invalidate CPU caches for the
> affected address ranges before reset.
>
> A memdev can participate in an interleaved region through multiple
> endpoint decoders. Track affected regions in a temporary xarray so each
> region is checked and cache-invalidated once per reset operation.
>
> These helpers prepare the CXL.mem data path for reset. The actual reset
> orchestration and decoder restore flow are added separately.
>
> Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com>
> ---
> drivers/cxl/core/pci.c | 170 +++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 170 insertions(+)
>
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index d1f487b3d809..318744695f62 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -4,9 +4,11 @@
> #include <linux/io-64-nonatomic-lo-hi.h>
> #include <linux/device.h>
> #include <linux/delay.h>
> +#include <linux/memregion.h>
> #include <linux/pci.h>
> #include <linux/pci-doe.h>
> #include <linux/aer.h>
> +#include <linux/xarray.h>
> #include <cxlpci.h>
> #include <cxlmem.h>
> #include <cxl.h>
> @@ -926,3 +928,171 @@ int cxl_port_get_possible_dports(struct cxl_port *port)
>
> return ctx.count;
> }
> +
> +static int cxl_reset_system_ram_found(struct resource *res, void *data)
> +{
> + return 1;
> +}
Reuse core/region.c:is_system_ram()?
DJ
> +
> +struct cxl_reset_region_context {
> + struct xarray regions;
> +};
> +
> +static void __maybe_unused
> +cxl_reset_region_context_init(struct cxl_reset_region_context *ctx)
> +{
> + xa_init(&ctx->regions);
> +}
> +
> +static void __maybe_unused
> +cxl_reset_region_context_destroy(struct cxl_reset_region_context *ctx)
> +{
> + xa_destroy(&ctx->regions);
> +}
> +
> +static int cxl_reset_add_region(struct cxl_reset_region_context *ctx,
> + struct cxl_region *cxlr)
> +{
> + int rc;
> +
> + if (!cxlr || !cxlr->params.res)
> + return 0;
> +
> + rc = xa_insert(&ctx->regions, (unsigned long)cxlr, cxlr, GFP_KERNEL);
> +
> + /* A region may be referenced by multiple affected endpoint decoders. */
> + return rc == -EBUSY ? 0 : rc;
> +}
> +
> +static int cxl_reset_collect_region(struct device *dev, void *data)
> +{
> + struct cxl_reset_region_context *ctx = data;
> + struct cxl_endpoint_decoder *cxled;
> +
> + if (!is_endpoint_decoder(dev))
> + return 0;
> +
> + cxled = to_cxl_endpoint_decoder(dev);
> + return cxl_reset_add_region(ctx, cxled->cxld.region);
> +}
> +
> +static int __maybe_unused
> +cxl_reset_collect_memdev_regions(struct cxl_reset_region_context *ctx,
> + struct cxl_memdev *cxlmd)
> +{
> + struct cxl_port *endpoint;
> +
> + if (!cxlmd || !cxlmd->cxlds)
> + return -ENODEV;
> +
> + endpoint = cxlmd->endpoint;
> + if (!endpoint)
> + return 0;
> +
> + return device_for_each_child(&endpoint->dev, ctx,
> + cxl_reset_collect_region);
> +}
> +
> +static bool cxl_reset_region_has_system_ram(struct cxl_region *cxlr)
> +{
> + struct cxl_region_params *p = &cxlr->params;
> + int rc;
> +
> + if (!p->res)
> + return false;
> +
> + rc = walk_iomem_res_desc(IORES_DESC_NONE,
> + IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
> + p->res->start, p->res->end, NULL,
> + cxl_reset_system_ram_found);
> +
> + return rc > 0;
> +}
> +
> +static int cxl_reset_validate_region_idle(struct cxl_region *cxlr)
> +{
> + struct resource *res = cxlr->params.res;
> + int rc = 0;
> +
> + lockdep_assert_held_write(&cxl_rwsem.region);
> +
> + if (cxl_reset_region_has_system_ram(cxlr)) {
> + dev_err(&cxlr->dev,
> + "Cannot reset while CXL memory is online as System RAM [%pr]\n",
> + res);
> + return -EBUSY;
> + }
> +
> + if (!device_trylock(&cxlr->dev))
> + return -EAGAIN;
> +
> + if (cxlr->dev.driver) {
> + dev_err(&cxlr->dev,
> + "Cannot reset while CXL region has an active driver\n");
> + rc = -EBUSY;
> + }
> +
> + device_unlock(&cxlr->dev);
> + return rc;
> +}
> +
> +static int __maybe_unused
> +cxl_reset_validate_regions_idle(struct cxl_reset_region_context *ctx)
> +{
> + struct cxl_region *cxlr;
> + unsigned long index;
> + int rc;
> +
> + xa_for_each(&ctx->regions, index, cxlr) {
> + rc = cxl_reset_validate_region_idle(cxlr);
> + if (rc)
> + return rc;
> + }
> +
> + return 0;
> +}
> +
> +static int cxl_reset_flush_region_cache(struct cxl_region *cxlr)
> +{
> + struct resource *res = cxlr->params.res;
> + int rc;
> +
> + if (!res)
> + return 0;
> +
> + rc = cpu_cache_invalidate_memregion(res->start, resource_size(res));
> + if (rc)
> + dev_err(&cxlr->dev, "Failed to invalidate CPU cache [%pr]: %d\n",
> + res, rc);
> +
> + return rc;
> +}
> +
> +static int __maybe_unused
> +cxl_reset_flush_cpu_caches(struct cxl_reset_region_context *ctx)
> +{
> + struct cxl_region *cxlr;
> + unsigned long index;
> + int rc;
> +
> + if (xa_empty(&ctx->regions))
> + return 0;
> +
> + if (!cpu_cache_has_invalidate_memregion()) {
> + if (IS_ENABLED(CONFIG_CXL_REGION_INVALIDATION_TEST)) {
> + pr_info_once(
> + "Bypassing cpu_cache_invalidate_memregion() for testing!\n");
> + return 0;
> + }
> + pr_warn("Failed to synchronize CPU cache state\n");
> + return -ENXIO;
> + }
> +
> + xa_for_each(&ctx->regions, index, cxlr) {
> + rc = cxl_reset_flush_region_cache(cxlr);
> + if (rc)
> + return rc;
> + }
> +
> + return 0;
> +}
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v6 8/9] cxl/memdev: Add cxl_reset sysfs attribute
2026-05-28 8:31 ` [PATCH v6 8/9] cxl/memdev: Add cxl_reset sysfs attribute Srirangan Madhavan
@ 2026-06-02 21:35 ` Cheatham, Benjamin
2026-06-02 23:50 ` Dave Jiang
1 sibling, 0 replies; 32+ messages in thread
From: Cheatham, Benjamin @ 2026-06-02 21:35 UTC (permalink / raw)
To: Srirangan Madhavan, linux-cxl, linux-pci, linux-kernel
Cc: vsethi, alwilliamson, Dan Williams, Sai Yashwanth Reddy Kancherla,
Vishal Aslot, Manish Honap, Jiandi An, Richard Cheng, linux-tegra
On 5/28/2026 3:31 AM, Srirangan Madhavan wrote:
> Expose CXL reset through the CXL memdev device. The reset flow
> depends on CXL memdev state to identify affected regions, coordinate
> decoder restore, and keep CXL-specific policy out of the PCI sysfs ABI.
>
> Add a write-only cxl_reset attribute under memX. The attribute is visible
> only when the memdev's PCI parent advertises CXL Reset capability.
> Writing a true boolean value invokes the CXL reset orchestration.
>
> Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com>
> ---
> drivers/cxl/core/memdev.c | 30 +++++++++++
> drivers/cxl/core/pci.c | 102 +++++++++++++++++++++++++++++++++++++-
> drivers/cxl/cxl.h | 3 ++
> drivers/cxl/cxlmem.h | 2 +
> 4 files changed, 136 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
> index 80e65690eb77..af67fa3d11b8 100644
> --- a/drivers/cxl/core/memdev.c
> +++ b/drivers/cxl/core/memdev.c
> @@ -199,6 +199,26 @@ static ssize_t security_erase_store(struct device *dev,
> static struct device_attribute dev_attr_security_erase =
> __ATTR(erase, 0200, NULL, security_erase_store);
>
> +static ssize_t cxl_reset_store(struct device *dev,
> + struct device_attribute *attr, const char *buf,
> + size_t len)
> +{
> + struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> + bool reset;
> + int rc;
> +
> + rc = kstrtobool(buf, &reset);
> + if (rc)
> + return rc;
> +
> + if (!reset)
> + return -EINVAL;
> +
> + rc = cxl_memdev_reset(cxlmd);
> + return rc ? rc : len;
> +}
> +static DEVICE_ATTR_WO(cxl_reset);
> +
> bool cxl_memdev_has_poison_cmd(struct cxl_memdev *cxlmd,
> enum poison_cmd_enabled_bits cmd)
> {
> @@ -421,6 +441,7 @@ static struct attribute *cxl_memdev_attributes[] = {
> &dev_attr_payload_max.attr,
> &dev_attr_label_storage_size.attr,
> &dev_attr_numa_node.attr,
> + &dev_attr_cxl_reset.attr,
> NULL,
> };
>
> @@ -485,8 +506,16 @@ static struct attribute *cxl_memdev_security_attributes[] = {
> static umode_t cxl_memdev_visible(struct kobject *kobj, struct attribute *a,
> int n)
> {
> + struct device *dev = kobj_to_dev(kobj);
> + struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> +
> if (!IS_ENABLED(CONFIG_NUMA) && a == &dev_attr_numa_node.attr)
> return 0;
> +
> + if (a == &dev_attr_cxl_reset.attr &&
> + !cxl_memdev_reset_capable(cxlmd))
> + return 0;
> +
> return a->mode;
> }
>
> @@ -1099,6 +1128,7 @@ static int cxlmd_add(struct cxl_memdev *cxlmd, struct cxl_dev_state *cxlds)
>
> cxlmd->cxlds = cxlds;
> cxlds->cxlmd = cxlmd;
> + cxl_memdev_init_reset(cxlmd);
>
> rc = cdev_device_add(&cxlmd->cdev, &cxlmd->dev);
> if (rc) {
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 486c447e98f3..09f016544d24 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -1207,6 +1207,22 @@ static bool cxl_reset_has_cache_or_mem(struct pci_dev *pdev)
> return cap & (PCI_DVSEC_CXL_CACHE_CAPABLE | PCI_DVSEC_CXL_MEM_CAPABLE);
> }
>
> +static bool cxl_reset_is_type2(struct pci_dev *pdev)
> +{
> + u16 dvsec, cap;
> +
> + dvsec = pci_find_dvsec_capability(pdev, PCI_VENDOR_ID_CXL,
> + PCI_DVSEC_CXL_DEVICE);
> + if (!dvsec)
> + return false;
> +
> + if (pci_read_config_word(pdev, dvsec + PCI_DVSEC_CXL_CAP, &cap))
> + return false;
> +
> + return (cap & PCI_DVSEC_CXL_CACHE_CAPABLE) &&
> + (cap & PCI_DVSEC_CXL_MEM_CAPABLE);
> +}
> +
> static int cxl_reset_add_sibling(struct cxl_reset_context *ctx,
> struct pci_dev *sibling)
> {
> @@ -1939,7 +1955,7 @@ static int cxl_do_reset_locked(struct cxl_reset_context *ctx, bool mem_clear)
> return rc;
> }
>
> -static int __maybe_unused cxl_do_reset(struct pci_dev *pdev, bool mem_clear)
> +static int cxl_do_reset(struct pci_dev *pdev, bool mem_clear)
> {
> struct cxl_reset_context ctx = {
> .target = pdev,
> @@ -1966,3 +1982,87 @@ static int __maybe_unused cxl_do_reset(struct pci_dev *pdev, bool mem_clear)
> cxl_reset_context_destroy(&ctx);
> return rc;
> }
> +
> +static struct pci_dev *cxl_reset_get_fn0(struct pci_dev *pdev)
> +{
> + unsigned int devfn;
> +
> + /*
> + * CXL Reset control/status is exposed in Function 0 and affects all
> + * CXL.cache/mem functions in the device.
> + */
> + if (pci_ari_enabled(pdev->bus))
> + devfn = 0;
> + else
> + devfn = PCI_DEVFN(PCI_SLOT(pdev->devfn), 0);
> +
> + if (pdev->devfn == devfn)
> + return pci_dev_get(pdev);
> +
> + return pci_get_slot(pdev->bus, devfn);
> +}
> +
> +static bool cxl_memdev_probe_reset_capable(struct cxl_memdev *cxlmd)
> +{
> + struct device *dev = cxlmd->dev.parent;
> + struct pci_dev *pdev, *fn0;
> + int dvsec;
> + u16 cap;
> +
> + if (!dev || !dev_is_pci(dev))
> + return false;
> +
> + pdev = to_pci_dev(dev);
> + if (!cxl_reset_is_type2(pdev))
> + return false;
> +
> + fn0 = cxl_reset_get_fn0(pdev);
> + if (!fn0)
> + return false;
> +
> + dvsec = pci_find_dvsec_capability(fn0, PCI_VENDOR_ID_CXL,
> + PCI_DVSEC_CXL_DEVICE);
> + if (!dvsec)
> + goto out;
> +
> + if (pci_read_config_word(fn0, dvsec + PCI_DVSEC_CXL_CAP, &cap))
> + goto out;
> +
> + pci_dev_put(fn0);
> + return cap & PCI_DVSEC_CXL_RST_CAPABLE;
> +
> +out:
> + pci_dev_put(fn0);
> + return false;
> +}
> +
> +void cxl_memdev_init_reset(struct cxl_memdev *cxlmd)
> +{
> + cxlmd->reset_capable = cxl_memdev_probe_reset_capable(cxlmd);
> +}
> +
> +bool cxl_memdev_reset_capable(struct cxl_memdev *cxlmd)
> +{
> + return cxlmd->reset_capable;
> +}
I would get rid of these and just set reset_capable in cxlmd_add() and check
reset_capable directly.
I guess these could be used by a type 2 driver in the future, but it's probably
better to defer creating these functions until then.
> +
> +int cxl_memdev_reset(struct cxl_memdev *cxlmd)
> +{
> + struct device *dev = cxlmd->dev.parent;
> + struct pci_dev *fn0;
> + int rc;
> +
> + if (!cxl_memdev_reset_capable(cxlmd))
> + return -EOPNOTSUPP;
> +
> + if (!dev || !dev_is_pci(dev))
> + return -ENODEV;
> +
> + fn0 = cxl_reset_get_fn0(to_pci_dev(dev));
> + if (!fn0)
> + return -ENODEV;
> +
> + rc = cxl_do_reset(fn0, false);
> + pci_dev_put(fn0);
> + return rc;
> +}
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index b51b1e9d6400..bf65996e24dc 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -796,6 +796,9 @@ int cxl_dvsec_rr_decode(struct cxl_dev_state *cxlds,
> struct cxl_endpoint_dvsec_info *info);
> int cxl_restore_memdev_decoders(struct cxl_memdev *cxlmd);
> int cxl_commit_memdev_decoders(struct cxl_memdev *cxlmd);
> +void cxl_memdev_init_reset(struct cxl_memdev *cxlmd);
> +bool cxl_memdev_reset_capable(struct cxl_memdev *cxlmd);
> +int cxl_memdev_reset(struct cxl_memdev *cxlmd);
>
> bool is_cxl_region(struct device *dev);
>
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 776c50d1db51..c8e7349fb130 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -48,6 +48,7 @@ struct cxl_memdev_attach {
> * @cxl_nvd: optional bridge to an nvdimm if the device supports pmem
> * @endpoint: connection to the CXL port topology for this memory device
> * @attach: creator of this memdev depends on CXL link attach to operate
> + * @reset_capable: cached CXL Reset support
> * @id: id number of this memdev instance.
> * @depth: endpoint port depth
> * @scrub_cycle: current scrub cycle set for this device
> @@ -65,6 +66,7 @@ struct cxl_memdev {
> struct cxl_nvdimm *cxl_nvd;
> struct cxl_port *endpoint;
> const struct cxl_memdev_attach *attach;
> + bool reset_capable;
Should this go into cxl_dev_state instead? This seems like a driver state thing, but the
line is a bit blurry here.
> int id;
> int depth;
> u8 scrub_cycle;
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v6 0/9] cxl: Add cxl_reset sysfs attribute for memdevs
2026-05-28 8:31 [PATCH v6 0/9] cxl: Add cxl_reset sysfs attribute for memdevs Srirangan Madhavan
` (9 preceding siblings ...)
2026-06-02 20:34 ` [PATCH v6 0/9] cxl: Add cxl_reset sysfs attribute for memdevs Cheatham, Benjamin
@ 2026-06-02 21:42 ` Dan Williams (nvidia)
10 siblings, 0 replies; 32+ messages in thread
From: Dan Williams (nvidia) @ 2026-06-02 21:42 UTC (permalink / raw)
To: Srirangan Madhavan, linux-cxl, linux-pci, linux-kernel
Cc: vsethi, alwilliamson, Dan Williams, Sai Yashwanth Reddy Kancherla,
Vishal Aslot, Manish Honap, Jiandi An, Richard Cheng, linux-tegra,
Srirangan Madhavan
Srirangan Madhavan wrote:
> Hi folks!
>
> This patch series introduces support for the CXL Reset method for CXL
> Type 2 devices, implementing the reset procedure outlined in the CXL
> Specification r3.2 [1], Sections 8.1.3, 9.6, and 9.7.
>
> The userspace ABI is a write-only cxl_reset attribute under the CXL
> memdev device:
>
> /sys/bus/cxl/devices/memX/cxl_reset
Hi Srirangan,
To move this forward we need a compromise between reimplementing CXL
bits in drivers/pci/ (what I reacted to in the initial postings), but
still wanting to use the /sys/bus/pci reset entry point (what you and
Alex reacted to in my comments).
I started a suggestion here...
http://lore.kernel.org/6a0620acec806_57ad71008c@djbw-dev.notmuch
...however, looking at it again, this:
echo 1 > /sys/bus/pci/devices/$pdev/cxl/reset
...ends up functionally equivalent to the original:
echo cxl_reset > /sys/bus/pci/devices/$pdev/reset_method
echo 1 > /sys/bus/pci/devices/$pdev/reset_method
Now, the motivations why I pushed on /sys/bus/cxl/devices/memX/cxl_reset
were to avoid duplicating HDM enumeration in multiple places, and
provide for coordinating changes to the CXL memory configuration with
CXL reset. I.e. CXL reset can take HDM locks (where the PCI reset device
locks may not be sufficient)
The fatal downside of that proposal is that the memX/cxl_reset ABI
requires driver loading. Long term, as you and Alex convinced me, that
is going to be a pain and breaks current device assignment flows.
A compromise that lets PCI and CXL share infrastructure while still
supporting the long-standing PCI reset ABI is:
1/ Carry CXL decoder settings in the PCI device
2/ Build in shared low level helpers for marshaling decoder settings
to/from hardware.
3/ Allow the low-level helpers to reference CXL locks
I drafted a rough conversion of what would be needed to share this
low-level coordination across the PCI and CXL core.
It introduces 'struct cxl_decoder_settings' and moves all the HDM decode
related definitions to cxl/cxl.h. It moves the core locks and low-level
hardware update helpers into a built-in drivers/cxl/core/reset.o object
where all of this reset coordination can be shared. It provides for
saving and restoring HDM state not just over reset, but from initial
device enumeration for devices that may forget their CXL configuration
for other reasons besides PCI reset.
The bulk of this is movement from drivers/cxl/cxl.h to
include/cxl/cxl.h, and drivers/cxl/core/hdm.c to
drivers/cxl/core/reset.c.
Thoughts? Does this compromise address all the open ABI concerns? I will
go through the rest of the patches and provide some notes with this
proposal in mind.
Applies against v7.1-rc3, needs splitting once we agree on this shape
(only build tested):
diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
index 80aeb0d556bd..a809ba0dcc0c 100644
--- a/drivers/cxl/Kconfig
+++ b/drivers/cxl/Kconfig
@@ -5,6 +5,7 @@ menuconfig CXL_BUS
select FW_LOADER
select FW_UPLOAD
select PCI_DOE
+ select CXL_HDM
select FIRMWARE_TABLE
select NUMA_KEEP_MEMINFO if NUMA_MEMBLKS
select FWCTL if CXL_FEATURES
@@ -243,4 +244,7 @@ config CXL_ATL
depends on CXL_REGION
depends on ACPI_PRMT && AMD_NB
+config CXL_HDM
+ bool
+
endif
diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
index ce7213818d3c..ebb0891daeb5 100644
--- a/drivers/cxl/core/Makefile
+++ b/drivers/cxl/core/Makefile
@@ -1,6 +1,7 @@
# SPDX-License-Identifier: GPL-2.0
obj-$(CONFIG_CXL_BUS) += cxl_core.o
obj-$(CONFIG_CXL_SUSPEND) += suspend.o
+obj-$(CONFIG_CXL_HDM) += reset.o
ccflags-y += -I$(srctree)/drivers/cxl
CFLAGS_trace.o = -DTRACE_INCLUDE_PATH=. -I$(src)
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 1297594beaec..e31462fcf37b 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -252,49 +252,8 @@ int cxl_dport_map_rcd_linkcap(struct pci_dev *pdev, struct cxl_dport *dport);
#define CXL_DECODER_F_NORMALIZED_ADDRESSING BIT(6)
#define CXL_DECODER_F_RESET_MASK (CXL_DECODER_F_ENABLE | CXL_DECODER_F_LOCK)
-enum cxl_decoder_type {
- CXL_DECODER_DEVMEM = 2,
- CXL_DECODER_HOSTONLYMEM = 3,
-};
-
-/*
- * Current specification goes up to 8, double that seems a reasonable
- * software max for the foreseeable future
- */
-#define CXL_DECODER_MAX_INTERLEAVE 16
-
#define CXL_QOS_CLASS_INVALID -1
-/**
- * struct cxl_decoder - Common CXL HDM Decoder Attributes
- * @dev: this decoder's device
- * @id: kernel device name id
- * @hpa_range: Host physical address range mapped by this decoder
- * @interleave_ways: number of cxl_dports in this decode
- * @interleave_granularity: data stride per dport
- * @target_type: accelerator vs expander (type2 vs type3) selector
- * @region: currently assigned region for this decoder
- * @flags: memory type capabilities and locking
- * @target_map: cached copy of hardware port-id list, available at init
- * before all @dport objects have been instantiated. While
- * dport id is 8bit, CFMWS interleave targets are 32bits.
- * @commit: device/decoder-type specific callback to commit settings to hw
- * @reset: device/decoder-type specific callback to reset hw settings
-*/
-struct cxl_decoder {
- struct device dev;
- int id;
- struct range hpa_range;
- int interleave_ways;
- int interleave_granularity;
- enum cxl_decoder_type target_type;
- struct cxl_region *region;
- unsigned long flags;
- u32 target_map[CXL_DECODER_MAX_INTERLEAVE];
- int (*commit)(struct cxl_decoder *cxld);
- void (*reset)(struct cxl_decoder *cxld);
-};
-
/*
* Track whether this decoder is free for userspace provisioning, reserved for
* region autodiscovery, whether it is started connecting (awaiting other
@@ -310,7 +269,6 @@ enum cxl_decoder_state {
* struct cxl_endpoint_decoder - Endpoint / SPA to DPA decoder
* @cxld: base cxl_decoder_object
* @dpa_res: actively claimed DPA span of this decoder
- * @skip: offset into @dpa_res where @cxld.hpa_range maps
* @state: autodiscovery state
* @part: partition index this decoder maps
* @pos: interleave position in @cxld.region
@@ -318,7 +276,6 @@ enum cxl_decoder_state {
struct cxl_endpoint_decoder {
struct cxl_decoder cxld;
struct resource *dpa_res;
- resource_size_t skip;
enum cxl_decoder_state state;
int part;
int pos;
diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
index fa7269154620..1460bfefe593 100644
--- a/include/cxl/cxl.h
+++ b/include/cxl/cxl.h
@@ -5,6 +5,7 @@
#ifndef __CXL_CXL_H__
#define __CXL_CXL_H__
+#include <linux/device.h>
#include <linux/node.h>
#include <linux/ioport.h>
#include <cxl/mailbox.h>
@@ -23,7 +24,56 @@ enum cxl_devtype {
CXL_DEVTYPE_CLASSMEM,
};
-struct device;
+enum cxl_decoder_type {
+ CXL_DECODER_DEVMEM = 2,
+ CXL_DECODER_HOSTONLYMEM = 3,
+};
+
+/*
+ * Current specification goes up to 8, double that seems a reasonable
+ * software max for the foreseeable future
+ */
+#define CXL_DECODER_MAX_INTERLEAVE 16
+
+/**
+ * struct cxl_decoder - Common CXL HDM Decoder Attributes
+ * @dev: this decoder's device
+ * @id: kernel device name id
+ * @hpa_range: Host physical address range mapped by this decoder
+ * @skip: offset into @dpa_res where @cxld.hpa_range maps (endpoint)
+ * @targets: interleave position to dport mapping (switch)
+ * @interleave_ways: number of cxl_dports in this decode
+ * @interleave_granularity: data stride per dport
+ * @target_type: accelerator vs expander (type2 vs type3) selector
+ * @flags: memory type capabilities and locking
+ * @region: currently assigned region for this decoder
+ * @target_map: cached copy of hardware port-id list, available at init
+ * before all @dport objects have been instantiated. While
+ * dport id is 8bit, CFMWS interleave targets are 32bits.
+ * @commit: device/decoder-type specific callback to commit settings to hw
+ * @reset: device/decoder-type specific callback to reset hw settings
+*/
+struct cxl_decoder {
+ struct device dev;
+ struct_group_tagged(cxl_decoder_settings, settings,
+ int id;
+ struct range hpa_range;
+ union {
+ u64 skip;
+ u64 targets;
+ };
+ int interleave_ways;
+ int interleave_granularity;
+ enum cxl_decoder_type target_type;
+ unsigned long flags;
+ );
+ struct cxl_region *region;
+ u32 target_map[CXL_DECODER_MAX_INTERLEAVE];
+ int (*commit)(struct cxl_decoder *cxld);
+ void (*reset)(struct cxl_decoder *cxld);
+};
+
+int cxl_commit(struct cxl_decoder_settings *cxld, void __iomem *hdm);
/*
* Using struct_group() allows for per register-block-type helper routines,
@@ -116,6 +166,12 @@ struct cxl_register_map {
};
};
+struct cxl_hdm_info {
+ int decoder_count;
+ struct cxl_component_regs regs;
+ struct cxl_decoder_settings settings[] __counted_by(decoder_count);
+};
+
/**
* struct cxl_dpa_perf - DPA performance property entry
* @dpa_range: range for DPA address
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 2c4454583c11..35d05c8bdd43 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -39,6 +39,7 @@
#include <linux/io.h>
#include <linux/resource_ext.h>
#include <linux/msi_api.h>
+#include <cxl/cxl.h>
#include <uapi/linux/pci.h>
#include <linux/pci_ids.h>
@@ -577,6 +578,9 @@ struct pci_dev {
#endif
#ifdef CONFIG_PCI_TSM
struct pci_tsm *tsm; /* TSM operation state */
+#endif
+#ifdef CONFIG_CXL_HDM
+ struct cxl_hdm_info *hdm;
#endif
u16 acs_cap; /* ACS Capability offset */
u16 acs_capabilities; /* ACS Capabilities */
diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
index 0c80b76a5f9b..8c236d116174 100644
--- a/drivers/cxl/core/hdm.c
+++ b/drivers/cxl/core/hdm.c
@@ -16,11 +16,6 @@
* for enumerating these registers and capabilities.
*/
-struct cxl_rwsem cxl_rwsem = {
- .region = __RWSEM_INITIALIZER(cxl_rwsem.region),
- .dpa = __RWSEM_INITIALIZER(cxl_rwsem.dpa),
-};
-
static int add_hdm_decoder(struct cxl_port *port, struct cxl_decoder *cxld)
{
int rc;
@@ -249,17 +244,18 @@ static void __cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
struct cxl_port *port = cxled_to_port(cxled);
struct cxl_dev_state *cxlds = cxlmd->cxlds;
+ struct cxl_decoder *cxld = &cxled->cxld;
struct resource *res = cxled->dpa_res;
resource_size_t skip_start;
lockdep_assert_held_write(&cxl_rwsem.dpa);
/* save @skip_start, before @res is released */
- skip_start = res->start - cxled->skip;
+ skip_start = res->start - cxld->skip;
__release_region(&cxlds->dpa_res, res->start, resource_size(res));
- if (cxled->skip)
- release_skip(cxlds, skip_start, cxled->skip);
- cxled->skip = 0;
+ if (cxld->skip)
+ release_skip(cxlds, skip_start, cxld->skip);
+ cxld->skip = 0;
cxled->dpa_res = NULL;
put_device(&cxled->cxld.dev);
port->hdm_end--;
@@ -343,6 +339,7 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
struct cxl_port *port = cxled_to_port(cxled);
struct cxl_dev_state *cxlds = cxlmd->cxlds;
+ struct cxl_decoder *cxld = &cxled->cxld;
struct device *dev = &port->dev;
struct resource *res;
int rc;
@@ -388,7 +385,7 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
return -EBUSY;
}
cxled->dpa_res = res;
- cxled->skip = skipped;
+ cxld->skip = skipped;
/*
* When allocating new capacity, ->part is already set, when
@@ -679,39 +676,12 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, u64 size)
return devm_add_action_or_reset(&port->dev, cxl_dpa_release, cxled);
}
-static void cxld_set_interleave(struct cxl_decoder *cxld, u32 *ctrl)
-{
- u16 eig;
- u8 eiw;
-
- /*
- * Input validation ensures these warns never fire, but otherwise
- * suppress unititalized variable usage warnings.
- */
- if (WARN_ONCE(ways_to_eiw(cxld->interleave_ways, &eiw),
- "invalid interleave_ways: %d\n", cxld->interleave_ways))
- return;
- if (WARN_ONCE(granularity_to_eig(cxld->interleave_granularity, &eig),
- "invalid interleave_granularity: %d\n",
- cxld->interleave_granularity))
- return;
-
- u32p_replace_bits(ctrl, eig, CXL_HDM_DECODER0_CTRL_IG_MASK);
- u32p_replace_bits(ctrl, eiw, CXL_HDM_DECODER0_CTRL_IW_MASK);
- *ctrl |= CXL_HDM_DECODER0_CTRL_COMMIT;
-}
-
-static void cxld_set_type(struct cxl_decoder *cxld, u32 *ctrl)
-{
- u32p_replace_bits(ctrl,
- !!(cxld->target_type == CXL_DECODER_HOSTONLYMEM),
- CXL_HDM_DECODER0_CTRL_HOSTONLY);
-}
-
-static void cxlsd_set_targets(struct cxl_switch_decoder *cxlsd, u64 *tgt)
+static void cxlsd_set_targets(struct cxl_decoder *cxld)
{
+ struct cxl_switch_decoder *cxlsd = to_cxl_switch_decoder(&cxld->dev);
struct cxl_dport **t = &cxlsd->target[0];
int ways = cxlsd->cxld.interleave_ways;
+ u64 *tgt = &cxld->targets;
*tgt = FIELD_PREP(GENMASK(7, 0), t[0]->port_id);
if (ways > 1)
@@ -730,73 +700,6 @@ static void cxlsd_set_targets(struct cxl_switch_decoder *cxlsd, u64 *tgt)
*tgt |= FIELD_PREP(GENMASK_ULL(63, 56), t[7]->port_id);
}
-/*
- * Per CXL 2.0 8.2.5.12.20 Committing Decoder Programming, hardware must set
- * committed or error within 10ms, but just be generous with 20ms to account for
- * clock skew and other marginal behavior
- */
-#define COMMIT_TIMEOUT_MS 20
-static int cxld_await_commit(void __iomem *hdm, int id)
-{
- u32 ctrl;
- int i;
-
- for (i = 0; i < COMMIT_TIMEOUT_MS; i++) {
- ctrl = readl(hdm + CXL_HDM_DECODER0_CTRL_OFFSET(id));
- if (FIELD_GET(CXL_HDM_DECODER0_CTRL_COMMIT_ERROR, ctrl)) {
- ctrl &= ~CXL_HDM_DECODER0_CTRL_COMMIT;
- writel(ctrl, hdm + CXL_HDM_DECODER0_CTRL_OFFSET(id));
- return -EIO;
- }
- if (FIELD_GET(CXL_HDM_DECODER0_CTRL_COMMITTED, ctrl))
- return 0;
- fsleep(1000);
- }
-
- return -ETIMEDOUT;
-}
-
-static void setup_hw_decoder(struct cxl_decoder *cxld, void __iomem *hdm)
-{
- int id = cxld->id;
- u64 base, size;
- u32 ctrl;
-
- /* common decoder settings */
- ctrl = readl(hdm + CXL_HDM_DECODER0_CTRL_OFFSET(cxld->id));
- cxld_set_interleave(cxld, &ctrl);
- cxld_set_type(cxld, &ctrl);
- base = cxld->hpa_range.start;
- size = range_len(&cxld->hpa_range);
-
- writel(upper_32_bits(base), hdm + CXL_HDM_DECODER0_BASE_HIGH_OFFSET(id));
- writel(lower_32_bits(base), hdm + CXL_HDM_DECODER0_BASE_LOW_OFFSET(id));
- writel(upper_32_bits(size), hdm + CXL_HDM_DECODER0_SIZE_HIGH_OFFSET(id));
- writel(lower_32_bits(size), hdm + CXL_HDM_DECODER0_SIZE_LOW_OFFSET(id));
-
- if (is_switch_decoder(&cxld->dev)) {
- struct cxl_switch_decoder *cxlsd =
- to_cxl_switch_decoder(&cxld->dev);
- void __iomem *tl_hi = hdm + CXL_HDM_DECODER0_TL_HIGH(id);
- void __iomem *tl_lo = hdm + CXL_HDM_DECODER0_TL_LOW(id);
- u64 targets;
-
- cxlsd_set_targets(cxlsd, &targets);
- writel(upper_32_bits(targets), tl_hi);
- writel(lower_32_bits(targets), tl_lo);
- } else {
- struct cxl_endpoint_decoder *cxled =
- to_cxl_endpoint_decoder(&cxld->dev);
- void __iomem *sk_hi = hdm + CXL_HDM_DECODER0_SKIP_HIGH(id);
- void __iomem *sk_lo = hdm + CXL_HDM_DECODER0_SKIP_LOW(id);
-
- writel(upper_32_bits(cxled->skip), sk_hi);
- writel(lower_32_bits(cxled->skip), sk_lo);
- }
-
- writel(ctrl, hdm + CXL_HDM_DECODER0_CTRL_OFFSET(id));
-}
-
static int cxl_decoder_commit(struct cxl_decoder *cxld)
{
struct cxl_port *port = to_cxl_port(cxld->dev.parent);
@@ -832,21 +735,17 @@ static int cxl_decoder_commit(struct cxl_decoder *cxld)
dev_name(&cxld->dev));
return -EBUSY;
}
- }
-
- scoped_guard(rwsem_read, &cxl_rwsem.dpa)
- setup_hw_decoder(cxld, hdm);
+ } else
+ cxlsd_set_targets(cxld);
- rc = cxld_await_commit(hdm, cxld->id);
- if (rc) {
+ rc = cxl_commit(&cxld->settings, hdm);
+ if (rc)
dev_dbg(&port->dev, "%s: error %d committing decoder\n",
dev_name(&cxld->dev), rc);
- return rc;
- }
- port->commit_end++;
- cxld->flags |= CXL_DECODER_F_ENABLE;
+ else
+ port->commit_end++;
- return 0;
+ return rc;
}
static int commit_reap(struct device *dev, void *data)
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index e50dc716d4e8..0349d73140e3 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -2899,6 +2899,7 @@ static int poison_by_decoder(struct device *dev, void *arg)
struct cxl_endpoint_decoder *cxled;
enum cxl_partition_mode mode;
struct cxl_dev_state *cxlds;
+ struct cxl_decoder *cxld;
struct cxl_memdev *cxlmd;
u64 offset, length;
int rc = 0;
@@ -2912,11 +2913,12 @@ static int poison_by_decoder(struct device *dev, void *arg)
cxlmd = cxled_to_memdev(cxled);
cxlds = cxlmd->cxlds;
+ cxld = &cxled->cxld;
mode = cxlds->part[cxled->part].mode;
- if (cxled->skip) {
- offset = cxled->dpa_res->start - cxled->skip;
- length = cxled->skip;
+ if (cxld->skip) {
+ offset = cxled->dpa_res->start - cxld->skip;
+ length = cxld->skip;
rc = cxl_mem_get_poison(cxlmd, offset, length, NULL);
if (rc == -EFAULT && mode == CXL_PARTMODE_RAM)
rc = 0;
diff --git a/drivers/cxl/core/reset.c b/drivers/cxl/core/reset.c
new file mode 100644
index 000000000000..0b4372b6d608
--- /dev/null
+++ b/drivers/cxl/core/reset.c
@@ -0,0 +1,119 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright(c) 2026 NVIDIA Corporation & Affiliates */
+#include <cxl/cxl.h>
+#include <linux/bitfield.h>
+#include <linux/delay.h>
+#include <linux/io.h>
+#include <linux/module.h>
+#include <linux/range.h>
+#include <cxl.h>
+#include "core.h"
+
+/*
+ * Common lowlevel setup and re-initialization (reset) helpers for the
+ * CXL memory associated with a PCI device. CXL core locks are built-in
+ * to the main kernel image for coordination with in-kernel mechanisms
+ * like reset.
+ */
+
+struct cxl_rwsem cxl_rwsem = {
+ .region = __RWSEM_INITIALIZER(cxl_rwsem.region),
+ .dpa = __RWSEM_INITIALIZER(cxl_rwsem.dpa),
+};
+EXPORT_SYMBOL_FOR_MODULES(cxl_rwsem, "cxl_core");
+
+static void cxld_set_interleave(struct cxl_decoder_settings *cxld, u32 *ctrl)
+{
+ u16 eig;
+ u8 eiw;
+
+ /*
+ * Input validation ensures these warns never fire, but otherwise
+ * suppress unititalized variable usage warnings.
+ */
+ if (WARN_ONCE(ways_to_eiw(cxld->interleave_ways, &eiw),
+ "invalid interleave_ways: %d\n", cxld->interleave_ways))
+ return;
+ if (WARN_ONCE(granularity_to_eig(cxld->interleave_granularity, &eig),
+ "invalid interleave_granularity: %d\n",
+ cxld->interleave_granularity))
+ return;
+
+ u32p_replace_bits(ctrl, eig, CXL_HDM_DECODER0_CTRL_IG_MASK);
+ u32p_replace_bits(ctrl, eiw, CXL_HDM_DECODER0_CTRL_IW_MASK);
+ *ctrl |= CXL_HDM_DECODER0_CTRL_COMMIT;
+}
+
+static void cxld_set_type(struct cxl_decoder_settings *cxld, u32 *ctrl)
+{
+ u32p_replace_bits(ctrl,
+ !!(cxld->target_type == CXL_DECODER_HOSTONLYMEM),
+ CXL_HDM_DECODER0_CTRL_HOSTONLY);
+}
+
+static void setup_hw_decoder(struct cxl_decoder_settings *cxld, void __iomem *hdm)
+{
+ u32 ctrl;
+ u64 base, size;
+ int id = cxld->id;
+ void __iomem *sk_hi = hdm + CXL_HDM_DECODER0_SKIP_HIGH(id);
+ void __iomem *sk_lo = hdm + CXL_HDM_DECODER0_SKIP_LOW(id);
+
+ /* common decoder settings */
+ ctrl = readl(hdm + CXL_HDM_DECODER0_CTRL_OFFSET(cxld->id));
+ cxld_set_interleave(cxld, &ctrl);
+ cxld_set_type(cxld, &ctrl);
+ base = cxld->hpa_range.start;
+ size = range_len(&cxld->hpa_range);
+
+ writel(upper_32_bits(base), hdm + CXL_HDM_DECODER0_BASE_HIGH_OFFSET(id));
+ writel(lower_32_bits(base), hdm + CXL_HDM_DECODER0_BASE_LOW_OFFSET(id));
+ writel(upper_32_bits(size), hdm + CXL_HDM_DECODER0_SIZE_HIGH_OFFSET(id));
+ writel(lower_32_bits(size), hdm + CXL_HDM_DECODER0_SIZE_LOW_OFFSET(id));
+
+ /* endpoint 'skip' and switch 'targets' settings alias */
+ writel(upper_32_bits(cxld->skip), sk_hi);
+ writel(lower_32_bits(cxld->skip), sk_lo);
+
+ writel(ctrl, hdm + CXL_HDM_DECODER0_CTRL_OFFSET(id));
+}
+
+/*
+ * Per CXL 2.0 8.2.5.12.20 Committing Decoder Programming, hardware must set
+ * committed or error within 10ms, but just be generous with 20ms to account for
+ * clock skew and other marginal behavior
+ */
+#define COMMIT_TIMEOUT_MS 20
+static int cxld_await_commit(void __iomem *hdm, int id)
+{
+ u32 ctrl;
+ int i;
+
+ for (i = 0; i < COMMIT_TIMEOUT_MS; i++) {
+ ctrl = readl(hdm + CXL_HDM_DECODER0_CTRL_OFFSET(id));
+ if (FIELD_GET(CXL_HDM_DECODER0_CTRL_COMMIT_ERROR, ctrl)) {
+ ctrl &= ~CXL_HDM_DECODER0_CTRL_COMMIT;
+ writel(ctrl, hdm + CXL_HDM_DECODER0_CTRL_OFFSET(id));
+ return -EIO;
+ }
+ if (FIELD_GET(CXL_HDM_DECODER0_CTRL_COMMITTED, ctrl))
+ return 0;
+ fsleep(1000);
+ }
+
+ return -ETIMEDOUT;
+}
+
+int cxl_commit(struct cxl_decoder_settings *cxld, void __iomem *hdm)
+{
+ int rc;
+
+ scoped_guard(rwsem_read, &cxl_rwsem.dpa)
+ setup_hw_decoder(cxld, hdm);
+
+ rc = cxld_await_commit(hdm, cxld->id);
+ if (rc == 0)
+ cxld->flags |= CXL_DECODER_F_ENABLE;
+ return rc;
+}
+EXPORT_SYMBOL_FOR_MODULES(cxl_commit, "cxl_core");
diff --git a/tools/testing/cxl/test/cxl.c b/tools/testing/cxl/test/cxl.c
index 418669927fb0..de088bb930c3 100644
--- a/tools/testing/cxl/test/cxl.c
+++ b/tools/testing/cxl/test/cxl.c
@@ -840,11 +840,11 @@ static int cxld_registry_restore(struct cxl_decoder *cxld,
dbg_cxld(port, "restore", &td->cxled.cxld);
cxld_copy(cxld, &td->cxled.cxld);
cxled->state = td->cxled.state;
- cxled->skip = td->cxled.skip;
+ cxld->skip = td->cxled.cxld.skip;
if (range_len(&td->dpa_range)) {
rc = devm_cxl_dpa_reserve(cxled, td->dpa_range.start,
range_len(&td->dpa_range),
- td->cxled.skip);
+ td->cxled.cxld.skip);
if (rc) {
init_disabled_mock_decoder(cxld);
return rc;
@@ -882,7 +882,7 @@ static void __cxld_registry_save(struct cxl_test_decoder *td,
cxld_copy(&td->cxled.cxld, cxld);
td->cxled.state = cxled->state;
- td->cxled.skip = cxled->skip;
+ td->cxled.cxld.skip = cxld->skip;
if (!(cxld->flags & CXL_DECODER_F_ENABLE)) {
td->dpa_range.start = 0;
@@ -970,7 +970,7 @@ static void mock_decoder_reset(struct cxl_decoder *cxld)
to_cxl_endpoint_decoder(&cxld->dev);
cxled->state = CXL_DECODER_STATE_MANUAL;
- cxled->skip = 0;
+ cxld->skip = 0;
}
if (decoder_reset_preserve_registry)
dev_dbg(port->uport_dev, "decoder%d: skip registry update\n",
@@ -1021,7 +1021,7 @@ static void init_disabled_mock_decoder(struct cxl_decoder *cxld)
to_cxl_endpoint_decoder(&cxld->dev);
cxled->state = CXL_DECODER_STATE_MANUAL;
- cxled->skip = 0;
+ cxld->skip = 0;
}
}
^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [PATCH v6 4/9] PCI/CXL: Add sibling function coordination for reset
2026-05-28 8:31 ` [PATCH v6 4/9] PCI/CXL: Add sibling function coordination for reset Srirangan Madhavan
2026-05-28 11:15 ` Richard Cheng
@ 2026-06-02 22:10 ` Dave Jiang
2026-06-04 3:13 ` Dan Williams (nvidia)
2 siblings, 0 replies; 32+ messages in thread
From: Dave Jiang @ 2026-06-02 22:10 UTC (permalink / raw)
To: Srirangan Madhavan, linux-cxl, linux-pci, linux-kernel
Cc: vsethi, alwilliamson, Dan Williams, Sai Yashwanth Reddy Kancherla,
Vishal Aslot, Manish Honap, Jiandi An, Richard Cheng, linux-tegra
On 5/28/26 1:31 AM, Srirangan Madhavan wrote:
> Add helpers to collect CXL sibling PCI functions affected by a CXL reset
> and prepare them for reset by saving and disabling them. Restore those
> siblings and drop their references when reset coordination completes.
>
> Use the Non-CXL Function Map DVSEC to exclude non-CXL functions, and
> filter remaining siblings to functions that advertise CXL.cache or
> CXL.mem capability.
>
> Use pci_dev_trylock() for sibling locking and unwind on contention or
> allocation failure, so competing reset paths fail with an errno.
>
> Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com>
> ---
> drivers/cxl/core/pci.c | 207 ++++++++++++++++++++++++++++++++++
> include/uapi/linux/pci_regs.h | 2 +
> 2 files changed, 209 insertions(+)
>
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 318744695f62..01effbb4e7cd 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -1,9 +1,11 @@
> // SPDX-License-Identifier: GPL-2.0-only
> /* Copyright(c) 2021 Intel Corporation. All rights reserved. */
> #include <linux/units.h>
> +#include <linux/bitmap.h>
> #include <linux/io-64-nonatomic-lo-hi.h>
> #include <linux/device.h>
> #include <linux/delay.h>
> +#include <linux/iommu.h>
> #include <linux/memregion.h>
> #include <linux/pci.h>
> #include <linux/pci-doe.h>
> @@ -15,6 +17,10 @@
> #include "core.h"
> #include "trace.h"
>
> +#define CXL_RESET_MAX_FUNCTIONS 256
Can use MAX_NR_DEVFNS defined by PCI
DJ
> +#define CXL_RESET_FUNCTION_MAP_REGS (CXL_RESET_MAX_FUNCTIONS / 32)
> +#define CXL_RESET_SIBLINGS_INIT 8
> +
> /**
> * DOC: cxl core pci
> *
> @@ -1096,3 +1102,204 @@ cxl_reset_flush_cpu_caches(struct cxl_reset_region_context *ctx)
>
> return 0;
> }
> +
> +struct cxl_reset_context {
> + struct pci_dev *target;
> + struct pci_dev **siblings;
> + int nr_siblings;
> + int sibling_capacity;
> + int nr_siblings_prepared;
> +};
> +
> +struct cxl_reset_walk_ctx {
> + struct cxl_reset_context *ctx;
> + unsigned long *non_cxl_func_map;
> + int rc;
> +};
> +
> +static void
> +cxl_reset_read_non_cxl_func_map(struct pci_dev *pdev,
> + unsigned long *non_cxl_func_map)
> +{
> + u32 map[CXL_RESET_FUNCTION_MAP_REGS] = {};
> + u16 dvsec;
> + int rc, i;
> +
> + bitmap_zero(non_cxl_func_map, CXL_RESET_MAX_FUNCTIONS);
> +
> + dvsec = pci_find_dvsec_capability(pdev, PCI_VENDOR_ID_CXL,
> + PCI_DVSEC_CXL_FUNCTION_MAP);
> + if (!dvsec)
> + return;
> +
> + for (i = 0; i < CXL_RESET_FUNCTION_MAP_REGS; i++) {
> + rc = pci_read_config_dword(pdev,
> + dvsec + PCI_DVSEC_CXL_FUNCTION_MAP_REG +
> + i * sizeof(map[i]), &map[i]);
> + if (rc) {
> + pci_warn(pdev,
> + "failed to read CXL Function Map; treating all siblings as CXL: %d\n",
> + rc);
> + bitmap_zero(non_cxl_func_map, CXL_RESET_MAX_FUNCTIONS);
> + return;
> + }
> + }
> +
> + bitmap_from_arr32(non_cxl_func_map, map, CXL_RESET_MAX_FUNCTIONS);
> +}
> +
> +static bool cxl_reset_is_cxl_sibling(struct pci_dev *pdev,
> + struct pci_dev *sibling,
> + unsigned long *non_cxl_func_map)
> +{
> + if (sibling == pdev || sibling->bus != pdev->bus)
> + return false;
> +
> + if (pci_ari_enabled(pdev->bus))
> + return !test_bit(sibling->devfn, non_cxl_func_map);
> +
> + if (PCI_SLOT(sibling->devfn) != PCI_SLOT(pdev->devfn))
> + return false;
> +
> + return !test_bit(PCI_FUNC(sibling->devfn) * 32 +
> + PCI_SLOT(sibling->devfn), non_cxl_func_map);
> +}
> +
> +static bool cxl_reset_has_cache_or_mem(struct pci_dev *pdev)
> +{
> + u16 dvsec, cap;
> +
> + dvsec = pci_find_dvsec_capability(pdev, PCI_VENDOR_ID_CXL,
> + PCI_DVSEC_CXL_DEVICE);
> + if (!dvsec)
> + return false;
> +
> + if (pci_read_config_word(pdev, dvsec + PCI_DVSEC_CXL_CAP, &cap))
> + return false;
> +
> + return cap & (PCI_DVSEC_CXL_CACHE_CAPABLE | PCI_DVSEC_CXL_MEM_CAPABLE);
> +}
> +
> +static int cxl_reset_add_sibling(struct cxl_reset_context *ctx,
> + struct pci_dev *sibling)
> +{
> + struct pci_dev **siblings;
> + int capacity;
> +
> + if (ctx->nr_siblings < ctx->sibling_capacity)
> + goto add;
> +
> + capacity = ctx->sibling_capacity ? ctx->sibling_capacity * 2 :
> + CXL_RESET_SIBLINGS_INIT;
> + siblings = krealloc(ctx->siblings, capacity * sizeof(*siblings),
> + GFP_KERNEL);
> + if (!siblings)
> + return -ENOMEM;
> +
> + ctx->siblings = siblings;
> + ctx->sibling_capacity = capacity;
> +
> +add:
> + pci_dev_get(sibling);
> + ctx->siblings[ctx->nr_siblings++] = sibling;
> + return 0;
> +}
> +
> +static int cxl_reset_collect_sibling(struct pci_dev *sibling, void *data)
> +{
> + struct cxl_reset_walk_ctx *wctx = data;
> + struct cxl_reset_context *ctx = wctx->ctx;
> + struct pci_dev *pdev = ctx->target;
> +
> + if (!cxl_reset_is_cxl_sibling(pdev, sibling, wctx->non_cxl_func_map))
> + return 0;
> +
> + if (!cxl_reset_has_cache_or_mem(sibling))
> + return 0;
> +
> + wctx->rc = cxl_reset_add_sibling(ctx, sibling);
> + return wctx->rc;
> +}
> +
> +static int cxl_reset_collect_siblings(struct cxl_reset_context *ctx)
> +{
> + DECLARE_BITMAP(non_cxl_func_map, CXL_RESET_MAX_FUNCTIONS);
> + struct cxl_reset_walk_ctx wctx = {
> + .ctx = ctx,
> + .non_cxl_func_map = non_cxl_func_map,
> + };
> +
> + cxl_reset_read_non_cxl_func_map(ctx->target, non_cxl_func_map);
> + pci_walk_bus(ctx->target->bus, cxl_reset_collect_sibling, &wctx);
> + return wctx.rc;
> +}
> +
> +static void cxl_pci_functions_reset_done(struct cxl_reset_context *ctx)
> +{
> + int i;
> +
> + for (i = ctx->nr_siblings_prepared - 1; i >= 0; i--) {
> + struct pci_dev *sibling = ctx->siblings[i];
> +
> + pci_dev_reset_iommu_done(sibling);
> + pci_dev_restore(sibling);
> + pci_dev_unlock(sibling);
> + }
> +
> + for (i = 0; i < ctx->nr_siblings; i++)
> + pci_dev_put(ctx->siblings[i]);
> +
> + kfree(ctx->siblings);
> + ctx->siblings = NULL;
> + ctx->nr_siblings = 0;
> + ctx->sibling_capacity = 0;
> + ctx->nr_siblings_prepared = 0;
> +}
> +
> +static int __maybe_unused
> +cxl_pci_functions_reset_prepare(struct cxl_reset_context *ctx)
> +{
> + int rc, i;
> +
> + ctx->siblings = NULL;
> + ctx->nr_siblings = 0;
> + ctx->sibling_capacity = 0;
> + ctx->nr_siblings_prepared = 0;
> +
> + rc = cxl_reset_collect_siblings(ctx);
> + if (rc)
> + goto err;
> +
> + for (i = 0; i < ctx->nr_siblings; i++) {
> + struct pci_dev *sibling = ctx->siblings[i];
> +
> + if (!pci_dev_trylock(sibling)) {
> + rc = -EAGAIN;
> + goto err;
> + }
> +
> + pci_dev_save_and_disable(sibling);
> + rc = pci_dev_reset_iommu_prepare(sibling);
> + if (rc) {
> + pci_err(sibling,
> + "failed to block IOMMU for CXL reset: %d\n",
> + rc);
> + /*
> + * Undo save_and_disable() for this sibling. IOMMU
> + * prepare failed, so this sibling is not counted in
> + * nr_siblings_prepared and must not get iommu_done().
> + */
> + pci_dev_restore(sibling);
> + pci_dev_unlock(sibling);
> + goto err;
> + }
> +
> + ctx->nr_siblings_prepared++;
> + }
> +
> + return 0;
> +
> +err:
> + cxl_pci_functions_reset_done(ctx);
> + return rc;
> +}
> diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
> index 14f634ab9350..fa1fcd26af01 100644
> --- a/include/uapi/linux/pci_regs.h
> +++ b/include/uapi/linux/pci_regs.h
> @@ -1349,6 +1349,7 @@
> /* CXL r4.0, 8.1.3: PCIe DVSEC for CXL Device */
> #define PCI_DVSEC_CXL_DEVICE 0
> #define PCI_DVSEC_CXL_CAP 0xA
> +#define PCI_DVSEC_CXL_CACHE_CAPABLE _BITUL(0)
> #define PCI_DVSEC_CXL_MEM_CAPABLE _BITUL(2)
> #define PCI_DVSEC_CXL_HDM_COUNT __GENMASK(5, 4)
> #define PCI_DVSEC_CXL_CTRL 0xC
> @@ -1366,6 +1367,7 @@
>
> /* CXL r4.0, 8.1.4: Non-CXL Function Map DVSEC */
> #define PCI_DVSEC_CXL_FUNCTION_MAP 2
> +#define PCI_DVSEC_CXL_FUNCTION_MAP_REG 0x0C
>
> /* CXL r4.0, 8.1.5: Extensions DVSEC for Ports */
> #define PCI_DVSEC_CXL_PORT 3
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v6 8/9] cxl/memdev: Add cxl_reset sysfs attribute
2026-05-28 8:31 ` [PATCH v6 8/9] cxl/memdev: Add cxl_reset sysfs attribute Srirangan Madhavan
2026-06-02 21:35 ` Cheatham, Benjamin
@ 2026-06-02 23:50 ` Dave Jiang
1 sibling, 0 replies; 32+ messages in thread
From: Dave Jiang @ 2026-06-02 23:50 UTC (permalink / raw)
To: Srirangan Madhavan, linux-cxl, linux-pci, linux-kernel
Cc: vsethi, alwilliamson, Dan Williams, Sai Yashwanth Reddy Kancherla,
Vishal Aslot, Manish Honap, Jiandi An, Richard Cheng, linux-tegra
On 5/28/26 1:31 AM, Srirangan Madhavan wrote:
> Expose CXL reset through the CXL memdev device. The reset flow
> depends on CXL memdev state to identify affected regions, coordinate
> decoder restore, and keep CXL-specific policy out of the PCI sysfs ABI.
>
> Add a write-only cxl_reset attribute under memX. The attribute is visible
> only when the memdev's PCI parent advertises CXL Reset capability.
> Writing a true boolean value invokes the CXL reset orchestration.
Probably should explicitly mention that the reset is only for Type2 devices in the commit log and that is a design choice.
DJ
>
> Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com>
> ---
> drivers/cxl/core/memdev.c | 30 +++++++++++
> drivers/cxl/core/pci.c | 102 +++++++++++++++++++++++++++++++++++++-
> drivers/cxl/cxl.h | 3 ++
> drivers/cxl/cxlmem.h | 2 +
> 4 files changed, 136 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
> index 80e65690eb77..af67fa3d11b8 100644
> --- a/drivers/cxl/core/memdev.c
> +++ b/drivers/cxl/core/memdev.c
> @@ -199,6 +199,26 @@ static ssize_t security_erase_store(struct device *dev,
> static struct device_attribute dev_attr_security_erase =
> __ATTR(erase, 0200, NULL, security_erase_store);
>
> +static ssize_t cxl_reset_store(struct device *dev,
> + struct device_attribute *attr, const char *buf,
> + size_t len)
> +{
> + struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> + bool reset;
> + int rc;
> +
> + rc = kstrtobool(buf, &reset);
> + if (rc)
> + return rc;
> +
> + if (!reset)
> + return -EINVAL;
> +
> + rc = cxl_memdev_reset(cxlmd);
> + return rc ? rc : len;
> +}
> +static DEVICE_ATTR_WO(cxl_reset);
> +
> bool cxl_memdev_has_poison_cmd(struct cxl_memdev *cxlmd,
> enum poison_cmd_enabled_bits cmd)
> {
> @@ -421,6 +441,7 @@ static struct attribute *cxl_memdev_attributes[] = {
> &dev_attr_payload_max.attr,
> &dev_attr_label_storage_size.attr,
> &dev_attr_numa_node.attr,
> + &dev_attr_cxl_reset.attr,
> NULL,
> };
>
> @@ -485,8 +506,16 @@ static struct attribute *cxl_memdev_security_attributes[] = {
> static umode_t cxl_memdev_visible(struct kobject *kobj, struct attribute *a,
> int n)
> {
> + struct device *dev = kobj_to_dev(kobj);
> + struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> +
> if (!IS_ENABLED(CONFIG_NUMA) && a == &dev_attr_numa_node.attr)
> return 0;
> +
> + if (a == &dev_attr_cxl_reset.attr &&
> + !cxl_memdev_reset_capable(cxlmd))
> + return 0;
> +
> return a->mode;
> }
>
> @@ -1099,6 +1128,7 @@ static int cxlmd_add(struct cxl_memdev *cxlmd, struct cxl_dev_state *cxlds)
>
> cxlmd->cxlds = cxlds;
> cxlds->cxlmd = cxlmd;
> + cxl_memdev_init_reset(cxlmd);
>
> rc = cdev_device_add(&cxlmd->cdev, &cxlmd->dev);
> if (rc) {
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 486c447e98f3..09f016544d24 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -1207,6 +1207,22 @@ static bool cxl_reset_has_cache_or_mem(struct pci_dev *pdev)
> return cap & (PCI_DVSEC_CXL_CACHE_CAPABLE | PCI_DVSEC_CXL_MEM_CAPABLE);
> }
>
> +static bool cxl_reset_is_type2(struct pci_dev *pdev)
> +{
> + u16 dvsec, cap;
> +
> + dvsec = pci_find_dvsec_capability(pdev, PCI_VENDOR_ID_CXL,
> + PCI_DVSEC_CXL_DEVICE);
> + if (!dvsec)
> + return false;
> +
> + if (pci_read_config_word(pdev, dvsec + PCI_DVSEC_CXL_CAP, &cap))
> + return false;
> +
> + return (cap & PCI_DVSEC_CXL_CACHE_CAPABLE) &&
> + (cap & PCI_DVSEC_CXL_MEM_CAPABLE);
> +}
> +
> static int cxl_reset_add_sibling(struct cxl_reset_context *ctx,
> struct pci_dev *sibling)
> {
> @@ -1939,7 +1955,7 @@ static int cxl_do_reset_locked(struct cxl_reset_context *ctx, bool mem_clear)
> return rc;
> }
>
> -static int __maybe_unused cxl_do_reset(struct pci_dev *pdev, bool mem_clear)
> +static int cxl_do_reset(struct pci_dev *pdev, bool mem_clear)
> {
> struct cxl_reset_context ctx = {
> .target = pdev,
> @@ -1966,3 +1982,87 @@ static int __maybe_unused cxl_do_reset(struct pci_dev *pdev, bool mem_clear)
> cxl_reset_context_destroy(&ctx);
> return rc;
> }
> +
> +static struct pci_dev *cxl_reset_get_fn0(struct pci_dev *pdev)
> +{
> + unsigned int devfn;
> +
> + /*
> + * CXL Reset control/status is exposed in Function 0 and affects all
> + * CXL.cache/mem functions in the device.
> + */
> + if (pci_ari_enabled(pdev->bus))
> + devfn = 0;
> + else
> + devfn = PCI_DEVFN(PCI_SLOT(pdev->devfn), 0);
> +
> + if (pdev->devfn == devfn)
> + return pci_dev_get(pdev);
> +
> + return pci_get_slot(pdev->bus, devfn);
> +}
> +
> +static bool cxl_memdev_probe_reset_capable(struct cxl_memdev *cxlmd)
> +{
> + struct device *dev = cxlmd->dev.parent;
> + struct pci_dev *pdev, *fn0;
> + int dvsec;
> + u16 cap;
> +
> + if (!dev || !dev_is_pci(dev))
> + return false;
> +
> + pdev = to_pci_dev(dev);
> + if (!cxl_reset_is_type2(pdev))
> + return false;
> +
> + fn0 = cxl_reset_get_fn0(pdev);
> + if (!fn0)
> + return false;
> +
> + dvsec = pci_find_dvsec_capability(fn0, PCI_VENDOR_ID_CXL,
> + PCI_DVSEC_CXL_DEVICE);
> + if (!dvsec)
> + goto out;
> +
> + if (pci_read_config_word(fn0, dvsec + PCI_DVSEC_CXL_CAP, &cap))
> + goto out;
> +
> + pci_dev_put(fn0);
> + return cap & PCI_DVSEC_CXL_RST_CAPABLE;
> +
> +out:
> + pci_dev_put(fn0);
> + return false;
> +}
> +
> +void cxl_memdev_init_reset(struct cxl_memdev *cxlmd)
> +{
> + cxlmd->reset_capable = cxl_memdev_probe_reset_capable(cxlmd);
> +}
> +
> +bool cxl_memdev_reset_capable(struct cxl_memdev *cxlmd)
> +{
> + return cxlmd->reset_capable;
> +}
> +
> +int cxl_memdev_reset(struct cxl_memdev *cxlmd)
> +{
> + struct device *dev = cxlmd->dev.parent;
> + struct pci_dev *fn0;
> + int rc;
> +
> + if (!cxl_memdev_reset_capable(cxlmd))
> + return -EOPNOTSUPP;
> +
> + if (!dev || !dev_is_pci(dev))
> + return -ENODEV;
> +
> + fn0 = cxl_reset_get_fn0(to_pci_dev(dev));
> + if (!fn0)
> + return -ENODEV;
> +
> + rc = cxl_do_reset(fn0, false);
> + pci_dev_put(fn0);
> + return rc;
> +}
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index b51b1e9d6400..bf65996e24dc 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -796,6 +796,9 @@ int cxl_dvsec_rr_decode(struct cxl_dev_state *cxlds,
> struct cxl_endpoint_dvsec_info *info);
> int cxl_restore_memdev_decoders(struct cxl_memdev *cxlmd);
> int cxl_commit_memdev_decoders(struct cxl_memdev *cxlmd);
> +void cxl_memdev_init_reset(struct cxl_memdev *cxlmd);
> +bool cxl_memdev_reset_capable(struct cxl_memdev *cxlmd);
> +int cxl_memdev_reset(struct cxl_memdev *cxlmd);
>
> bool is_cxl_region(struct device *dev);
>
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 776c50d1db51..c8e7349fb130 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -48,6 +48,7 @@ struct cxl_memdev_attach {
> * @cxl_nvd: optional bridge to an nvdimm if the device supports pmem
> * @endpoint: connection to the CXL port topology for this memory device
> * @attach: creator of this memdev depends on CXL link attach to operate
> + * @reset_capable: cached CXL Reset support
> * @id: id number of this memdev instance.
> * @depth: endpoint port depth
> * @scrub_cycle: current scrub cycle set for this device
> @@ -65,6 +66,7 @@ struct cxl_memdev {
> struct cxl_nvdimm *cxl_nvd;
> struct cxl_port *endpoint;
> const struct cxl_memdev_attach *attach;
> + bool reset_capable;
> int id;
> int depth;
> u8 scrub_cycle;
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v6 9/9] Documentation/ABI: Document CXL memdev cxl_reset
2026-05-28 8:31 ` [PATCH v6 9/9] Documentation/ABI: Document CXL memdev cxl_reset Srirangan Madhavan
@ 2026-06-03 0:11 ` Dave Jiang
0 siblings, 0 replies; 32+ messages in thread
From: Dave Jiang @ 2026-06-03 0:11 UTC (permalink / raw)
To: Srirangan Madhavan, linux-cxl, linux-pci, linux-kernel
Cc: vsethi, alwilliamson, Dan Williams, Sai Yashwanth Reddy Kancherla,
Vishal Aslot, Manish Honap, Jiandi An, Richard Cheng, linux-tegra
On 5/28/26 1:31 AM, Srirangan Madhavan wrote:
> Document the write-only cxl_reset attribute under CXL memdev devices.
> The attribute is visible only when the memdev's PCI parent advertises
> CXL Reset capability, and writing a true boolean value requests the CXL
> reset flow.
>
> Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com>
> ---
> Documentation/ABI/testing/sysfs-bus-cxl | 28 +++++++++++++++++++++++++
> 1 file changed, 28 insertions(+)
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> index 16a9b3d2e2c0..d5d055e7a756 100644
> --- a/Documentation/ABI/testing/sysfs-bus-cxl
> +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> @@ -110,6 +110,34 @@ Description:
> affinity for this device.
>
>
> +What: /sys/bus/cxl/devices/memX/cxl_reset
> +Date: May, 2026
> +KernelVersion: v7.1
> +Contact: linux-cxl@vger.kernel.org
> +Description:
> + (WO) Write a boolean true value, for example "1" or "true", to
> + request CXL Reset for this memory device. The driver performs
> + CXL-specific reset coordination for the target memdev before
> + issuing reset, including any required preparation for affected
> + CXL memory regions and related CXL memory devices.
> +
> + CXL Reset control is Function 0 scoped. A write to this
> + attribute resets the CXL.cache and CXL.mem state for all
> + CXL.cache or CXL.mem functions in the same CXL device reset
> + scope, not only the memX device associated with this file.
> +
> + The optional CXL Reset Memory Clear operation is not exposed by
> + this attribute.
> +
> + A reset fails with -EBUSY if any affected CXL region is
> + online as System RAM or has an active region driver bound.
> + Userspace must first quiesce and release affected CXL memory
> + mappings.
> +
> + If this file is not present, then CXL Reset is not supported
> + for the device.
Need to mention this is only shows up for type2 devices. Also missing information about decoder and dvsec range restore.
> +
> +
> What: /sys/bus/cxl/devices/memX/security/state
> Date: June, 2023
> KernelVersion: v6.5
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v6 1/9] cxl/hdm: Add helpers to restore and commit memdev decoders
2026-05-28 8:31 ` [PATCH v6 1/9] cxl/hdm: Add helpers to restore and commit memdev decoders Srirangan Madhavan
` (2 preceding siblings ...)
2026-06-02 20:34 ` Cheatham, Benjamin
@ 2026-06-03 22:35 ` Dan Williams (nvidia)
3 siblings, 0 replies; 32+ messages in thread
From: Dan Williams (nvidia) @ 2026-06-03 22:35 UTC (permalink / raw)
To: Srirangan Madhavan, linux-cxl, linux-pci, linux-kernel
Cc: vsethi, alwilliamson, Dan Williams, Sai Yashwanth Reddy Kancherla,
Vishal Aslot, Manish Honap, Jiandi An, Richard Cheng, linux-tegra,
Srirangan Madhavan
Srirangan Madhavan wrote:
> Add helpers to restore endpoint decoder programming for a CXL memdev from
> CXL core's cached decoder objects, then commit it as a distinct step.
> Callers are expected to have established reset safety and to hold
> cxl_rwsem.region for write.
In the /sys/bus/.../reset path the PCI device only has cached HDM state,
no 'struct cxl_region' objects. So this needs a simpler method to
determine that HDM is unmapped. It probably also needs account for
userspace mappings. The closest approximation to "this phys addr is not
mapped anywhere, even /dev/mem" is a successful __request_region().
> cxl_restore_memdev_decoders() restores programmable decoder state while
> keeping traffic disabled. For HDM-backed endpoints it programs enabled
> endpoint decoder fields without COMMIT, keeps the HDM Decoder Capability
> disabled, and mirrors matching endpoint DVSEC ranges where possible. For
> endpoints without HDM decoder registers, it restores the legacy DVSEC
> ranges that model endpoint decode.
Ideally just use the same helper for restoring the configuration as
writing the configuration. I.e. I am not sure that writing all the
address ranges first is required since per decoder commit is still
required. Effectively trying to maximize identical flows for the reset
and dynamic configuration paths.
> cxl_commit_memdev_decoders() enables the HDM Decoder Capability and
> commits enabled, unlocked endpoint decoders after safety checks pass. It
> sets COMMIT only after decoder fields have been restored, does not
> re-lock decoders, and does not set DVSEC MEM_ENABLE.
>
> Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com>
> ---
> drivers/cxl/core/hdm.c | 318 ++++++++++++++++++++++++++++++++++++++++-
> drivers/cxl/cxl.h | 2 +
> 2 files changed, 317 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index 0c80b76a5f9b..f7af1041a9fc 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -679,7 +679,7 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, u64 size)
> return devm_add_action_or_reset(&port->dev, cxl_dpa_release, cxled);
> }
>
> -static void cxld_set_interleave(struct cxl_decoder *cxld, u32 *ctrl)
> +static int cxld_set_interleave_fields(struct cxl_decoder *cxld, u32 *ctrl)
> {
> u16 eig;
> u8 eiw;
> @@ -690,14 +690,22 @@ static void cxld_set_interleave(struct cxl_decoder *cxld, u32 *ctrl)
> */
> if (WARN_ONCE(ways_to_eiw(cxld->interleave_ways, &eiw),
> "invalid interleave_ways: %d\n", cxld->interleave_ways))
> - return;
> + return -EINVAL;
> if (WARN_ONCE(granularity_to_eig(cxld->interleave_granularity, &eig),
> "invalid interleave_granularity: %d\n",
> cxld->interleave_granularity))
> - return;
> + return -EINVAL;
>
> u32p_replace_bits(ctrl, eig, CXL_HDM_DECODER0_CTRL_IG_MASK);
> u32p_replace_bits(ctrl, eiw, CXL_HDM_DECODER0_CTRL_IW_MASK);
> + return 0;
> +}
It is awkward to get all the way down to restoring the interleave to
find that the original interleave settings were invalid to start.
I think it is also impossible. These warnings have never provided any
value and should probably be deleted rather than honored and reflected
up the stack.
> +
> +static void cxld_set_interleave(struct cxl_decoder *cxld, u32 *ctrl)
> +{
> + if (cxld_set_interleave_fields(cxld, ctrl))
> + return;
> +
> *ctrl |= CXL_HDM_DECODER0_CTRL_COMMIT;
> }
>
> @@ -927,6 +935,310 @@ static void cxl_decoder_reset(struct cxl_decoder *cxld)
> }
> }
>
> +static int cxl_restore_dvsec_range(struct cxl_memdev *cxlmd,
> + struct cxl_endpoint_decoder *cxled)
> +{
> + struct cxl_dev_state *cxlds = cxlmd->cxlds;
> + struct cxl_decoder *cxld = &cxled->cxld;
> + struct pci_dev *pdev = to_pci_dev(cxlds->dev);
> + u64 base = cxld->hpa_range.start;
> + u64 size = range_len(&cxld->hpa_range);
> + u32 lo;
> + int dvsec = cxlds->cxl_dvsec;
> + int id = cxld->id;
> + int rc;
> +
> + if (!dvsec)
> + return 0;
> +
> + if (id >= CXL_DVSEC_RANGE_MAX)
> + return 0;
> +
> + rc = pci_write_config_dword(pdev, dvsec + PCI_DVSEC_CXL_RANGE_BASE_HIGH(id),
> + upper_32_bits(base));
> + if (rc)
> + return rc;
> +
> + rc = pci_read_config_dword(pdev, dvsec + PCI_DVSEC_CXL_RANGE_BASE_LOW(id),
> + &lo);
> + if (rc)
> + return rc;
> + lo &= ~PCI_DVSEC_CXL_MEM_BASE_LOW;
> + lo |= lower_32_bits(base) & PCI_DVSEC_CXL_MEM_BASE_LOW;
> +
> + rc = pci_write_config_dword(pdev, dvsec + PCI_DVSEC_CXL_RANGE_BASE_LOW(id),
> + lo);
> + if (rc)
> + return rc;
> +
> + rc = pci_write_config_dword(pdev, dvsec + PCI_DVSEC_CXL_RANGE_SIZE_HIGH(id),
> + upper_32_bits(size));
> + if (rc)
> + return rc;
> +
> + rc = pci_read_config_dword(pdev, dvsec + PCI_DVSEC_CXL_RANGE_SIZE_LOW(id),
> + &lo);
> + if (rc)
> + return rc;
> +
> + /*
> + * Preserve MEM_INFO_VALID / MEM_ACTIVE and any reserved bits while
> + * restoring only the programmable size bits.
> + */
> + lo &= ~PCI_DVSEC_CXL_MEM_SIZE_LOW;
> + lo |= lower_32_bits(size) & PCI_DVSEC_CXL_MEM_SIZE_LOW;
No need for lower_32_bits() with the mask.
> +
> + return pci_write_config_dword(pdev,
> + dvsec + PCI_DVSEC_CXL_RANGE_SIZE_LOW(id),
> + lo);
> +}
> +
> +static int cxl_restore_hdm_decoder(struct cxl_hdm *cxlhdm,
> + struct cxl_endpoint_decoder *cxled)
> +{
> + struct cxl_decoder *cxld = &cxled->cxld;
> + void __iomem *hdm;
> + u64 base, size, skip;
> + u32 ctrl;
> + int id;
> +
> + id = cxld->id;
> + hdm = cxlhdm->regs.hdm_decoder;
> + ctrl = readl(hdm + CXL_HDM_DECODER0_CTRL_OFFSET(id));
> + if (ctrl & CXL_HDM_DECODER0_CTRL_LOCK)
> + return 0;
> +
> + base = cxld->hpa_range.start;
> + size = range_len(&cxld->hpa_range);
> + skip = cxled->skip;
> +
> + ctrl &= ~(CXL_HDM_DECODER0_CTRL_LOCK |
> + CXL_HDM_DECODER0_CTRL_COMMIT |
> + CXL_HDM_DECODER0_CTRL_COMMITTED |
> + CXL_HDM_DECODER0_CTRL_COMMIT_ERROR);
> + if (cxld_set_interleave_fields(cxld, &ctrl))
> + return -EINVAL;
> + cxld_set_type(cxld, &ctrl);
> +
> + /* Preserve setup_hw_decoder() programming order, without COMMIT. */
> + writel(upper_32_bits(base), hdm + CXL_HDM_DECODER0_BASE_HIGH_OFFSET(id));
> + writel(lower_32_bits(base), hdm + CXL_HDM_DECODER0_BASE_LOW_OFFSET(id));
> + writel(upper_32_bits(size), hdm + CXL_HDM_DECODER0_SIZE_HIGH_OFFSET(id));
> + writel(lower_32_bits(size), hdm + CXL_HDM_DECODER0_SIZE_LOW_OFFSET(id));
> + writel(upper_32_bits(skip), hdm + CXL_HDM_DECODER0_SKIP_HIGH(id));
> + writel(lower_32_bits(skip), hdm + CXL_HDM_DECODER0_SKIP_LOW(id));
> + wmb();
> + writel(ctrl, hdm + CXL_HDM_DECODER0_CTRL_OFFSET(id));
> +
> + return 0;
> +}
> +
> +struct cxl_restore_ctx {
> + struct cxl_memdev *cxlmd;
> + struct cxl_hdm *cxlhdm;
> +};
> +
> +static int cxl_restore_decoder(struct device *dev, void *data)
> +{
> + struct cxl_restore_ctx *ctx = data;
> + struct cxl_endpoint_decoder *cxled;
> + struct cxl_decoder *cxld;
> + int rc;
> +
> + if (!is_endpoint_decoder(dev))
> + return 0;
> +
> + cxled = to_cxl_endpoint_decoder(dev);
> + cxld = &cxled->cxld;
> + if ((cxld->flags & CXL_DECODER_F_ENABLE) == 0)
> + return 0;
> +
> + if (ctx->cxlhdm->regs.hdm_decoder) {
> + if (cxld->id >= ctx->cxlhdm->decoder_count)
> + return -EINVAL;
> +
> + rc = cxl_restore_hdm_decoder(ctx->cxlhdm, cxled);
> + if (rc)
> + return rc;
> + }
> +
> + return cxl_restore_dvsec_range(ctx->cxlmd, cxled);
> +}
> +
> +static int cxl_restore_decoders(struct cxl_memdev *cxlmd, struct cxl_hdm *cxlhdm)
> +{
> + struct cxl_port *port = cxlhdm->port;
> + void __iomem *hdm = cxlhdm->regs.hdm_decoder;
> + struct cxl_restore_ctx ctx = {
> + .cxlmd = cxlmd,
> + .cxlhdm = cxlhdm,
> + };
> + u32 global_ctrl;
> +
> + if (hdm) {
> + global_ctrl = readl(hdm + CXL_HDM_DECODER_CTRL_OFFSET);
> + writel(global_ctrl & ~CXL_HDM_DECODER_ENABLE,
> + hdm + CXL_HDM_DECODER_CTRL_OFFSET);
After reset, global control should not require re-disabling.
> + }
> +
> + return device_for_each_child(&port->dev, &ctx, cxl_restore_decoder);
This gets cleaner when being able to skip the device_for_each_child()
and just walk the array of cached HDM info in the device.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v6 2/9] PCI: Export pci_dev_save_and_disable() and pci_dev_restore()
2026-05-28 8:31 ` [PATCH v6 2/9] PCI: Export pci_dev_save_and_disable() and pci_dev_restore() Srirangan Madhavan
2026-06-02 20:18 ` Dave Jiang
@ 2026-06-03 22:36 ` Dan Williams (nvidia)
1 sibling, 0 replies; 32+ messages in thread
From: Dan Williams (nvidia) @ 2026-06-03 22:36 UTC (permalink / raw)
To: Srirangan Madhavan, linux-cxl, linux-pci, linux-kernel
Cc: vsethi, alwilliamson, Dan Williams, Sai Yashwanth Reddy Kancherla,
Vishal Aslot, Manish Honap, Jiandi An, Richard Cheng, linux-tegra,
Srirangan Madhavan
Srirangan Madhavan wrote:
> Export pci_dev_save_and_disable() and pci_dev_restore() so CXL reset
> orchestration can reuse the PCI core reset lifecycle for non-standard
> reset flows.
>
> These helpers invoke driver reset_prepare/reset_done callbacks, save and
> restore PCI config state, and disable the device while the caller holds
> the device lock.
No longer required with the plan for built-in CXL helpers.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v6 3/9] cxl: Add reset-idle and cache flush helpers
2026-05-28 8:31 ` [PATCH v6 3/9] cxl: Add reset-idle and cache flush helpers Srirangan Madhavan
2026-06-02 20:34 ` Cheatham, Benjamin
2026-06-02 20:36 ` Dave Jiang
@ 2026-06-04 2:49 ` Dan Williams (nvidia)
2 siblings, 0 replies; 32+ messages in thread
From: Dan Williams (nvidia) @ 2026-06-04 2:49 UTC (permalink / raw)
To: Srirangan Madhavan, linux-cxl, linux-pci, linux-kernel
Cc: vsethi, alwilliamson, Dan Williams, Sai Yashwanth Reddy Kancherla,
Vishal Aslot, Manish Honap, Jiandi An, Richard Cheng, linux-tegra,
Srirangan Madhavan, rrichter
[ cc Robert ]
Srirangan Madhavan wrote:
> Add helpers to collect the CXL regions affected by a memdev reset,
> verify that those regions are idle, and invalidate CPU caches for the
> affected address ranges before reset.
>
> A memdev can participate in an interleaved region through multiple
> endpoint decoders. Track affected regions in a temporary xarray so each
> region is checked and cache-invalidated once per reset operation.
With the new proposal we still have the HPA range per-endpoint decoder,
so you can still check that the endpoint decoder is not mapped via
request_region().
Probably the more important optimization is to enumerate to CXL when the
cache invalidation routine is global. That lets the reset implementation
do its own simple "one global cache operation per-device" rather than
per-decoder.
Now, thinking through this, recall that some AMD platforms need firmware
help to translate the decoder settings from per-host-bridge normalized
addressing to typical global HPA addressing. See cxl_prm_setup_root().
I think for now HDM restore can not be expected to work on those platforms.
Either need to cache the translation at init, or teach the restore path to
reverse translate SPA back to the HW decoder HPA values.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v6 4/9] PCI/CXL: Add sibling function coordination for reset
2026-05-28 8:31 ` [PATCH v6 4/9] PCI/CXL: Add sibling function coordination for reset Srirangan Madhavan
2026-05-28 11:15 ` Richard Cheng
2026-06-02 22:10 ` Dave Jiang
@ 2026-06-04 3:13 ` Dan Williams (nvidia)
2 siblings, 0 replies; 32+ messages in thread
From: Dan Williams (nvidia) @ 2026-06-04 3:13 UTC (permalink / raw)
To: Srirangan Madhavan, linux-cxl, linux-pci, linux-kernel
Cc: vsethi, alwilliamson, Dan Williams, Sai Yashwanth Reddy Kancherla,
Vishal Aslot, Manish Honap, Jiandi An, Richard Cheng, linux-tegra,
Srirangan Madhavan
Srirangan Madhavan wrote:
> Add helpers to collect CXL sibling PCI functions affected by a CXL reset
> and prepare them for reset by saving and disabling them. Restore those
> siblings and drop their references when reset coordination completes.
>
> Use the Non-CXL Function Map DVSEC to exclude non-CXL functions, and
> filter remaining siblings to functions that advertise CXL.cache or
> CXL.mem capability.
>
> Use pci_dev_trylock() for sibling locking and unwind on contention or
> allocation failure, so competing reset paths fail with an errno.
This is a pile of code just to precisely save and restore only the
functions impacted by the reset. What is not clear to me is what is the
cost of over saving and restoring. The specification seems to imply that
CXL Reset has the same effect as FLR as far as CXL.io is concerned.
Which could maybe be read as all functions that speak CXL.io (all of
them) see the reset even if only a subset participate in CXL.cachemem.
Otherwise, there is a good chance that the pci_dev_reset_iommu_prepare()
is all going to all apply to the same iommu group for this device.
pci_dev_save_and_disable(sibling);
rc = pci_dev_reset_iommu_prepare(sibling);
...maybe the simple thing to do is just treat this like slot reset and
use the existing method of walking the device list by matching slot to
save and disable every function on the device. In other words, it is not
clear that the precision of saving some extra save_and_disable cycles is
worth it.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v6 7/9] cxl/pci: Orchestrate CXL reset for affected memdevs
2026-05-28 8:31 ` [PATCH v6 7/9] cxl/pci: Orchestrate CXL reset for affected memdevs Srirangan Madhavan
2026-06-02 20:34 ` Cheatham, Benjamin
@ 2026-06-04 3:25 ` Dan Williams (nvidia)
1 sibling, 0 replies; 32+ messages in thread
From: Dan Williams (nvidia) @ 2026-06-04 3:25 UTC (permalink / raw)
To: Srirangan Madhavan, linux-cxl, linux-pci, linux-kernel
Cc: vsethi, alwilliamson, Dan Williams, Sai Yashwanth Reddy Kancherla,
Vishal Aslot, Manish Honap, Jiandi An, Richard Cheng, linux-tegra,
Srirangan Madhavan
Srirangan Madhavan wrote:
> Add the reset flow that coordinates the target function, affected CXL
> sibling functions, and any active memdevs in the CXL.cache/mem reset
> scope.
>
> The flow collects regions for the affected memdevs under
> cxl_rwsem.region, verifies that those regions are idle, flushes CPU
> caches for the affected ranges, saves and disables the target and sibling
> PCI functions, and locks active memdevs to revalidate that their
> endpoints are still present before reset.
>
> After the CXL DVSEC reset completes, restore PCI config space so CXL
> MMIO is accessible, restore decoder programming for all active affected
> memdevs, commit their restored decoders, and only then re-enable CXL.mem
> for the affected set.
>
> Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com>
> ---
[..]
> + rc = cxl_reset_collect_memdevs(&ctx);
There can never me more than one memdev or cache interface to reset per
device, right? Those controls only exist for function0. The siblings
will not have their own reset and cache disable control DVSECs.
So, per the other observation that this probably does not need to /
cannot worry about save_and_disable precision, it only needs to invoke
the actual reset/cache management for function0.
^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2026-06-04 3:25 UTC | newest]
Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-28 8:31 [PATCH v6 0/9] cxl: Add cxl_reset sysfs attribute for memdevs Srirangan Madhavan
2026-05-28 8:31 ` [PATCH v6 1/9] cxl/hdm: Add helpers to restore and commit memdev decoders Srirangan Madhavan
2026-05-28 11:06 ` Richard Cheng
2026-06-02 18:12 ` Dave Jiang
2026-06-02 18:31 ` Dave Jiang
2026-06-02 20:34 ` Cheatham, Benjamin
2026-06-03 22:35 ` Dan Williams (nvidia)
2026-05-28 8:31 ` [PATCH v6 2/9] PCI: Export pci_dev_save_and_disable() and pci_dev_restore() Srirangan Madhavan
2026-06-02 20:18 ` Dave Jiang
2026-06-03 22:36 ` Dan Williams (nvidia)
2026-05-28 8:31 ` [PATCH v6 3/9] cxl: Add reset-idle and cache flush helpers Srirangan Madhavan
2026-06-02 20:34 ` Cheatham, Benjamin
2026-06-02 20:36 ` Dave Jiang
2026-06-04 2:49 ` Dan Williams (nvidia)
2026-05-28 8:31 ` [PATCH v6 4/9] PCI/CXL: Add sibling function coordination for reset Srirangan Madhavan
2026-05-28 11:15 ` Richard Cheng
2026-06-02 22:10 ` Dave Jiang
2026-06-04 3:13 ` Dan Williams (nvidia)
2026-05-28 8:31 ` [PATCH v6 5/9] cxl/pci: Add CXL DVSEC reset helper Srirangan Madhavan
2026-06-02 20:34 ` Cheatham, Benjamin
2026-05-28 8:31 ` [PATCH v6 6/9] cxl/pci: Track memdevs affected by CXL reset Srirangan Madhavan
2026-06-02 20:34 ` Cheatham, Benjamin
2026-05-28 8:31 ` [PATCH v6 7/9] cxl/pci: Orchestrate CXL reset for affected memdevs Srirangan Madhavan
2026-06-02 20:34 ` Cheatham, Benjamin
2026-06-04 3:25 ` Dan Williams (nvidia)
2026-05-28 8:31 ` [PATCH v6 8/9] cxl/memdev: Add cxl_reset sysfs attribute Srirangan Madhavan
2026-06-02 21:35 ` Cheatham, Benjamin
2026-06-02 23:50 ` Dave Jiang
2026-05-28 8:31 ` [PATCH v6 9/9] Documentation/ABI: Document CXL memdev cxl_reset Srirangan Madhavan
2026-06-03 0:11 ` Dave Jiang
2026-06-02 20:34 ` [PATCH v6 0/9] cxl: Add cxl_reset sysfs attribute for memdevs Cheatham, Benjamin
2026-06-02 21:42 ` Dan Williams (nvidia)
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox