[PATCH 0/5] cxl: Initialization and shutdown fixes

public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/5] cxl: Initialization and shutdown fixes
@ 2024-10-11  5:33 Dan Williams
  2024-10-11  5:34 ` [PATCH 1/5] cxl/port: Fix CXL port initialization order when the subsystem is built-in Dan Williams
                   ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Dan Williams @ 2024-10-11  5:33 UTC (permalink / raw)
  To: dave.jiang, ira.weiny
  Cc: Davidlohr Bueso, Jonathan Cameron, stable, Zijun Hu,
	Greg Kroah-Hartman, Alison Schofield, Gregory Price, Vishal Verma,
	linux-cxl

Gregory's modest proposal to fix CXL cxl_mem_probe() failures due to
delayed arrival of the CXL "root" infrastructure [1] prompted questions
of how the existing mechanism for retrying cxl_mem_probe() could be
failing.

The critical missing piece in the debug was that Gregory's setup had
almost all CXL modules built-in to the kernel.

On the way to that discovery several other bugs and init-order corner
cases were discovered.

The main fix is to make sure the drivers/cxl/Makefile object order
supports root CXL ports being fully initialized upon cxl_acpi_probe()
exit. The modular case has some similar potential holes that are fixed
with MODULE_SOFTDEP() and other fix ups. Finally, an attempt to update
cxl_test to reproduce the original report resulted in the discovery of a
separate long standing use after free bug in cxl_region_detach().

[1]: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net

---

Dan Williams (5):
      cxl/port: Fix CXL port initialization order when the subsystem is built-in
      cxl/port: Fix cxl_bus_rescan() vs bus_rescan_devices()
      cxl/acpi: Ensure ports ready at cxl_acpi_probe() return
      cxl/port: Fix use-after-free, permit out-of-order decoder shutdown
      cxl/test: Improve init-order fidelity relative to real-world systems

 drivers/base/core.c          |   35 +++++++
 drivers/cxl/Kconfig          |    1
 drivers/cxl/Makefile         |   12 +--
 drivers/cxl/acpi.c           |    7 +
 drivers/cxl/core/hdm.c       |   50 +++++++++--
 drivers/cxl/core/port.c      |   13 ++-
 drivers/cxl/core/region.c    |   48 +++-------
 drivers/cxl/cxl.h            |    3 -
 include/linux/device.h       |    3 +
 tools/testing/cxl/test/cxl.c |  200 +++++++++++++++++++++++-------------------
 tools/testing/cxl/test/mem.c |    1
 11 files changed, 228 insertions(+), 145 deletions(-)

base-commit: 8cf0b93919e13d1e8d4466eb4080a4c4d9d66d7b

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/5] cxl/port: Fix CXL port initialization order when the subsystem is built-in
  2024-10-11  5:33 [PATCH 0/5] cxl: Initialization and shutdown fixes Dan Williams
@ 2024-10-11  5:34 ` Dan Williams
  2024-10-14 11:33   ` Jonathan Cameron
  2024-10-11  5:34 ` [PATCH 4/5] cxl/port: Fix use-after-free, permit out-of-order decoder shutdown Dan Williams
  2024-10-11 11:21 ` [PATCH 0/5] cxl: Initialization and shutdown fixes Alejandro Lucero Palau
  2 siblings, 1 reply; 25+ messages in thread
From: Dan Williams @ 2024-10-11  5:34 UTC (permalink / raw)
  To: dave.jiang, ira.weiny
  Cc: Gregory Price, Gregory Price, stable, Davidlohr Bueso,
	Jonathan Cameron, Alison Schofield, Vishal Verma, linux-cxl

When the CXL subsystem is built-in the module init order is determined
by Makefile order. That order violates expectations. The expectation is
that cxl_acpi and cxl_mem can race to attach and that if cxl_acpi wins
the race cxl_mem will find the enabled CXL root ports it needs and if
cxl_acpi loses the race it will retrigger cxl_mem to attach via
cxl_bus_rescan(). That only works if cxl_acpi can assume ports are
enabled immediately upone cxl_acpi_probe() return. That in turn can only
happen in the CONFIG_CXL_ACPI=y case if the cxl_port object appears
before the cxl_acpi object in the Makefile.

Fix up the order to prevent initialization failures, and make sure that
cxl_port is built-in if cxl_acpi is also built-in.

As for what contributed to this not being found earlier, the CXL
regression environment, cxl_test, builds all CXL functionality as a
module to allow to symbol mocking and other dynamic reload tests.  As a
result there is no regression coverage for the built-in case.

Reported-by: Gregory Price <gourry@gourry.net>
Closes: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net
Tested-by: Gregory Price <gourry@gourry.net>
Fixes: 8dd2bc0f8e02 ("cxl/mem: Add the cxl_mem driver")
Cc: <stable@vger.kernel.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Alison Schofield <alison.schofield@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/cxl/Kconfig  |    1 +
 drivers/cxl/Makefile |   12 ++++++------
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
index 29c192f20082..876469e23f7a 100644
--- a/drivers/cxl/Kconfig
+++ b/drivers/cxl/Kconfig
@@ -60,6 +60,7 @@ config CXL_ACPI
 	default CXL_BUS
 	select ACPI_TABLE_LIB
 	select ACPI_HMAT
+	select CXL_PORT
 	help
 	  Enable support for host managed device memory (HDM) resources
 	  published by a platform's ACPI CXL memory layout description.  See
diff --git a/drivers/cxl/Makefile b/drivers/cxl/Makefile
index db321f48ba52..374829359275 100644
--- a/drivers/cxl/Makefile
+++ b/drivers/cxl/Makefile
@@ -1,13 +1,13 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-y += core/
-obj-$(CONFIG_CXL_PCI) += cxl_pci.o
-obj-$(CONFIG_CXL_MEM) += cxl_mem.o
+obj-$(CONFIG_CXL_PORT) += cxl_port.o
 obj-$(CONFIG_CXL_ACPI) += cxl_acpi.o
+obj-$(CONFIG_CXL_PCI) += cxl_pci.o
 obj-$(CONFIG_CXL_PMEM) += cxl_pmem.o
-obj-$(CONFIG_CXL_PORT) += cxl_port.o
+obj-$(CONFIG_CXL_MEM) += cxl_mem.o
 
-cxl_mem-y := mem.o
-cxl_pci-y := pci.o
+cxl_port-y := port.o
 cxl_acpi-y := acpi.o
+cxl_pci-y := pci.o
 cxl_pmem-y := pmem.o security.o
-cxl_port-y := port.o
+cxl_mem-y := mem.o


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 4/5] cxl/port: Fix use-after-free, permit out-of-order decoder shutdown
  2024-10-11  5:33 [PATCH 0/5] cxl: Initialization and shutdown fixes Dan Williams
  2024-10-11  5:34 ` [PATCH 1/5] cxl/port: Fix CXL port initialization order when the subsystem is built-in Dan Williams
@ 2024-10-11  5:34 ` Dan Williams
  2024-10-11 11:50   ` Zijun Hu
  2024-10-15 16:47   ` Jonathan Cameron
  2024-10-11 11:21 ` [PATCH 0/5] cxl: Initialization and shutdown fixes Alejandro Lucero Palau
  2 siblings, 2 replies; 25+ messages in thread
From: Dan Williams @ 2024-10-11  5:34 UTC (permalink / raw)
  To: dave.jiang, ira.weiny
  Cc: stable, Greg Kroah-Hartman, Davidlohr Bueso, Jonathan Cameron,
	Alison Schofield, Zijun Hu, vishal.l.verma, linux-cxl

In support of investigating an initialization failure report [1],
cxl_test was updated to register mock memory-devices after the mock
root-port/bus device had been registered. That led to cxl_test crashing
with a use-after-free bug with the following signature:

    cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem0:decoder7.0 @ 0 next: cxl_switch_uport.0 nr_eps: 1 nr_targets: 1
    cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem4:decoder14.0 @ 1 next: cxl_switch_uport.0 nr_eps: 2 nr_targets: 1
    cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[0] = cxl_switch_dport.0 for mem0:decoder7.0 @ 0
1)  cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[1] = cxl_switch_dport.4 for mem4:decoder14.0 @ 1
    [..]
    cxld_unregister: cxl decoder14.0:
    cxl_region_decode_reset: cxl_region region3:
    mock_decoder_reset: cxl_port port3: decoder3.0 reset
2)  mock_decoder_reset: cxl_port port3: decoder3.0: out of order reset, expected decoder3.1
    cxl_endpoint_decoder_release: cxl decoder14.0:
    [..]
    cxld_unregister: cxl decoder7.0:
3)  cxl_region_decode_reset: cxl_region region3:
    Oops: general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6bc3: 0000 [#1] PREEMPT SMP PTI
    [..]
    RIP: 0010:to_cxl_port+0x8/0x60 [cxl_core]
    [..]
    Call Trace:
     <TASK>
     cxl_region_decode_reset+0x69/0x190 [cxl_core]
     cxl_region_detach+0xe8/0x210 [cxl_core]
     cxl_decoder_kill_region+0x27/0x40 [cxl_core]
     cxld_unregister+0x5d/0x60 [cxl_core]

At 1) a region has been established with 2 endpoint decoders (7.0 and
14.0). Those endpoints share a common switch-decoder in the topology
(3.0). At teardown, 2), decoder14.0 is the first to be removed and hits
the "out of order reset case" in the switch decoder. The effect though
is that region3 cleanup is aborted leaving it in-tact and
referencing decoder14.0. At 3) the second attempt to teardown region3
trips over the stale decoder14.0 object which has long since been
deleted.

The fix here is to recognize that the CXL specification places no
mandate on in-order shutdown of switch-decoders, the driver enforces
in-order allocation, and hardware enforces in-order commit. So, rather
than fail and leave objects dangling, always remove them.

In support of making cxl_region_decode_reset() always succeed,
cxl_region_invalidate_memregion() failures are turned into warnings.
Crashing the kernel is ok there since system integrity is at risk if
caches cannot be managed around physical address mutation events like
CXL region destruction.

A new device_for_each_child_reverse_from() is added to cleanup
port->commit_end after all dependent decoders have been disabled. In
other words if decoders are allocated 0->1->2 and disabled 1->2->0 then
port->commit_end only decrements from 2 after 2 has been disabled, and
it decrements all the way to zero since 1 was disabled previously.

Link: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net [1]
Cc: <stable@vger.kernel.org>
Fixes: 176baefb2eb5 ("cxl/hdm: Commit decoder state to hardware")
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Alison Schofield <alison.schofield@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Zijun Hu <zijun_hu@icloud.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/base/core.c          |   35 +++++++++++++++++++++++++++++
 drivers/cxl/core/hdm.c       |   50 +++++++++++++++++++++++++++++++++++-------
 drivers/cxl/core/region.c    |   48 +++++++++++-----------------------------
 drivers/cxl/cxl.h            |    3 ++-
 include/linux/device.h       |    3 +++
 tools/testing/cxl/test/cxl.c |   14 ++++--------
 6 files changed, 100 insertions(+), 53 deletions(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index a4c853411a6b..e42f1ad73078 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -4037,6 +4037,41 @@ int device_for_each_child_reverse(struct device *parent, void *data,
 }
 EXPORT_SYMBOL_GPL(device_for_each_child_reverse);
 
+/**
+ * device_for_each_child_reverse_from - device child iterator in reversed order.
+ * @parent: parent struct device.
+ * @from: optional starting point in child list
+ * @fn: function to be called for each device.
+ * @data: data for the callback.
+ *
+ * Iterate over @parent's child devices, starting at @from, and call @fn
+ * for each, passing it @data. This helper is identical to
+ * device_for_each_child_reverse() when @from is NULL.
+ *
+ * @fn is checked each iteration. If it returns anything other than 0,
+ * iteration stop and that value is returned to the caller of
+ * device_for_each_child_reverse_from();
+ */
+int device_for_each_child_reverse_from(struct device *parent,
+				       struct device *from, const void *data,
+				       int (*fn)(struct device *, const void *))
+{
+	struct klist_iter i;
+	struct device *child;
+	int error = 0;
+
+	if (!parent->p)
+		return 0;
+
+	klist_iter_init_node(&parent->p->klist_children, &i,
+			     (from ? &from->p->knode_parent : NULL));
+	while ((child = prev_device(&i)) && !error)
+		error = fn(child, data);
+	klist_iter_exit(&i);
+	return error;
+}
+EXPORT_SYMBOL_GPL(device_for_each_child_reverse_from);
+
 /**
  * device_find_child - device iterator for locating a particular device.
  * @parent: parent struct device
diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
index 3df10517a327..223c273c0cd1 100644
--- a/drivers/cxl/core/hdm.c
+++ b/drivers/cxl/core/hdm.c
@@ -712,7 +712,44 @@ static int cxl_decoder_commit(struct cxl_decoder *cxld)
 	return 0;
 }
 
-static int cxl_decoder_reset(struct cxl_decoder *cxld)
+static int commit_reap(struct device *dev, const void *data)
+{
+	struct cxl_port *port = to_cxl_port(dev->parent);
+	struct cxl_decoder *cxld;
+
+	if (!is_switch_decoder(dev) && !is_endpoint_decoder(dev))
+		return 0;
+
+	cxld = to_cxl_decoder(dev);
+	if (port->commit_end == cxld->id &&
+	    ((cxld->flags & CXL_DECODER_F_ENABLE) == 0)) {
+		port->commit_end--;
+		dev_dbg(&port->dev, "reap: %s commit_end: %d\n",
+			dev_name(&cxld->dev), port->commit_end);
+	}
+
+	return 0;
+}
+
+void cxl_port_commit_reap(struct cxl_decoder *cxld)
+{
+	struct cxl_port *port = to_cxl_port(cxld->dev.parent);
+
+	lockdep_assert_held_write(&cxl_region_rwsem);
+
+	/*
+	 * Once the highest committed decoder is disabled, free any other
+	 * decoders that were pinned allocated by out-of-order release.
+	 */
+	port->commit_end--;
+	dev_dbg(&port->dev, "reap: %s commit_end: %d\n", dev_name(&cxld->dev),
+		port->commit_end);
+	device_for_each_child_reverse_from(&port->dev, &cxld->dev, NULL,
+					   commit_reap);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_port_commit_reap, CXL);
+
+static void cxl_decoder_reset(struct cxl_decoder *cxld)
 {
 	struct cxl_port *port = to_cxl_port(cxld->dev.parent);
 	struct cxl_hdm *cxlhdm = dev_get_drvdata(&port->dev);
@@ -721,14 +758,14 @@ static int cxl_decoder_reset(struct cxl_decoder *cxld)
 	u32 ctrl;
 
 	if ((cxld->flags & CXL_DECODER_F_ENABLE) == 0)
-		return 0;
+		return;
 
-	if (port->commit_end != id) {
+	if (port->commit_end == id)
+		cxl_port_commit_reap(cxld);
+	else
 		dev_dbg(&port->dev,
 			"%s: out of order reset, expected decoder%d.%d\n",
 			dev_name(&cxld->dev), port->id, port->commit_end);
-		return -EBUSY;
-	}
 
 	down_read(&cxl_dpa_rwsem);
 	ctrl = readl(hdm + CXL_HDM_DECODER0_CTRL_OFFSET(id));
@@ -741,7 +778,6 @@ static int cxl_decoder_reset(struct cxl_decoder *cxld)
 	writel(0, hdm + CXL_HDM_DECODER0_BASE_LOW_OFFSET(id));
 	up_read(&cxl_dpa_rwsem);
 
-	port->commit_end--;
 	cxld->flags &= ~CXL_DECODER_F_ENABLE;
 
 	/* Userspace is now responsible for reconfiguring this decoder */
@@ -751,8 +787,6 @@ static int cxl_decoder_reset(struct cxl_decoder *cxld)
 		cxled = to_cxl_endpoint_decoder(&cxld->dev);
 		cxled->state = CXL_DECODER_STATE_MANUAL;
 	}
-
-	return 0;
 }
 
 static int cxl_setup_hdm_decoder_from_dvsec(
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index e701e4b04032..3478d2058303 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -232,8 +232,8 @@ static int cxl_region_invalidate_memregion(struct cxl_region *cxlr)
 				"Bypassing cpu_cache_invalidate_memregion() for testing!\n");
 			return 0;
 		} else {
-			dev_err(&cxlr->dev,
-				"Failed to synchronize CPU cache state\n");
+			dev_WARN(&cxlr->dev,
+				 "Failed to synchronize CPU cache state\n");
 			return -ENXIO;
 		}
 	}
@@ -242,19 +242,17 @@ static int cxl_region_invalidate_memregion(struct cxl_region *cxlr)
 	return 0;
 }
 
-static int cxl_region_decode_reset(struct cxl_region *cxlr, int count)
+static void cxl_region_decode_reset(struct cxl_region *cxlr, int count)
 {
 	struct cxl_region_params *p = &cxlr->params;
-	int i, rc = 0;
+	int i;
 
 	/*
-	 * Before region teardown attempt to flush, and if the flush
-	 * fails cancel the region teardown for data consistency
-	 * concerns
+	 * Before region teardown attempt to flush, evict any data cached for
+	 * this region, or scream loudly about missing arch / platform support
+	 * for CXL teardown.
 	 */
-	rc = cxl_region_invalidate_memregion(cxlr);
-	if (rc)
-		return rc;
+	cxl_region_invalidate_memregion(cxlr);
 
 	for (i = count - 1; i >= 0; i--) {
 		struct cxl_endpoint_decoder *cxled = p->targets[i];
@@ -277,23 +275,17 @@ static int cxl_region_decode_reset(struct cxl_region *cxlr, int count)
 			cxl_rr = cxl_rr_load(iter, cxlr);
 			cxld = cxl_rr->decoder;
 			if (cxld->reset)
-				rc = cxld->reset(cxld);
-			if (rc)
-				return rc;
+				cxld->reset(cxld);
 			set_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags);
 		}
 
 endpoint_reset:
-		rc = cxled->cxld.reset(&cxled->cxld);
-		if (rc)
-			return rc;
+		cxled->cxld.reset(&cxled->cxld);
 		set_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags);
 	}
 
 	/* all decoders associated with this region have been torn down */
 	clear_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags);
-
-	return 0;
 }
 
 static int commit_decoder(struct cxl_decoder *cxld)
@@ -409,16 +401,8 @@ static ssize_t commit_store(struct device *dev, struct device_attribute *attr,
 		 * still pending.
 		 */
 		if (p->state == CXL_CONFIG_RESET_PENDING) {
-			rc = cxl_region_decode_reset(cxlr, p->interleave_ways);
-			/*
-			 * Revert to committed since there may still be active
-			 * decoders associated with this region, or move forward
-			 * to active to mark the reset successful
-			 */
-			if (rc)
-				p->state = CXL_CONFIG_COMMIT;
-			else
-				p->state = CXL_CONFIG_ACTIVE;
+			cxl_region_decode_reset(cxlr, p->interleave_ways);
+			p->state = CXL_CONFIG_ACTIVE;
 		}
 	}
 
@@ -2054,13 +2038,7 @@ static int cxl_region_detach(struct cxl_endpoint_decoder *cxled)
 	get_device(&cxlr->dev);
 
 	if (p->state > CXL_CONFIG_ACTIVE) {
-		/*
-		 * TODO: tear down all impacted regions if a device is
-		 * removed out of order
-		 */
-		rc = cxl_region_decode_reset(cxlr, p->interleave_ways);
-		if (rc)
-			goto out;
+		cxl_region_decode_reset(cxlr, p->interleave_ways);
 		p->state = CXL_CONFIG_ACTIVE;
 	}
 
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 0d8b810a51f0..5406e3ab3d4a 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -359,7 +359,7 @@ struct cxl_decoder {
 	struct cxl_region *region;
 	unsigned long flags;
 	int (*commit)(struct cxl_decoder *cxld);
-	int (*reset)(struct cxl_decoder *cxld);
+	void (*reset)(struct cxl_decoder *cxld);
 };
 
 /*
@@ -730,6 +730,7 @@ static inline bool is_cxl_root(struct cxl_port *port)
 int cxl_num_decoders_committed(struct cxl_port *port);
 bool is_cxl_port(const struct device *dev);
 struct cxl_port *to_cxl_port(const struct device *dev);
+void cxl_port_commit_reap(struct cxl_decoder *cxld);
 struct pci_bus;
 int devm_cxl_register_pci_bus(struct device *host, struct device *uport_dev,
 			      struct pci_bus *bus);
diff --git a/include/linux/device.h b/include/linux/device.h
index b4bde8d22697..667cb6db9019 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -1078,6 +1078,9 @@ int device_for_each_child(struct device *dev, void *data,
 			  int (*fn)(struct device *dev, void *data));
 int device_for_each_child_reverse(struct device *dev, void *data,
 				  int (*fn)(struct device *dev, void *data));
+int device_for_each_child_reverse_from(struct device *parent,
+				       struct device *from, const void *data,
+				       int (*fn)(struct device *, const void *));
 struct device *device_find_child(struct device *dev, void *data,
 				 int (*match)(struct device *dev, void *data));
 struct device *device_find_child_by_name(struct device *parent,
diff --git a/tools/testing/cxl/test/cxl.c b/tools/testing/cxl/test/cxl.c
index 90d5afd52dd0..c5bbd89b3192 100644
--- a/tools/testing/cxl/test/cxl.c
+++ b/tools/testing/cxl/test/cxl.c
@@ -693,26 +693,22 @@ static int mock_decoder_commit(struct cxl_decoder *cxld)
 	return 0;
 }
 
-static int mock_decoder_reset(struct cxl_decoder *cxld)
+static void mock_decoder_reset(struct cxl_decoder *cxld)
 {
 	struct cxl_port *port = to_cxl_port(cxld->dev.parent);
 	int id = cxld->id;
 
 	if ((cxld->flags & CXL_DECODER_F_ENABLE) == 0)
-		return 0;
+		return;
 
 	dev_dbg(&port->dev, "%s reset\n", dev_name(&cxld->dev));
-	if (port->commit_end != id) {
+	if (port->commit_end == id)
+		cxl_port_commit_reap(cxld);
+	else
 		dev_dbg(&port->dev,
 			"%s: out of order reset, expected decoder%d.%d\n",
 			dev_name(&cxld->dev), port->id, port->commit_end);
-		return -EBUSY;
-	}
-
-	port->commit_end--;
 	cxld->flags &= ~CXL_DECODER_F_ENABLE;
-
-	return 0;
 }
 
 static void default_mock_decoder(struct cxl_decoder *cxld)


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/5] cxl: Initialization and shutdown fixes
  2024-10-11  5:33 [PATCH 0/5] cxl: Initialization and shutdown fixes Dan Williams
  2024-10-11  5:34 ` [PATCH 1/5] cxl/port: Fix CXL port initialization order when the subsystem is built-in Dan Williams
  2024-10-11  5:34 ` [PATCH 4/5] cxl/port: Fix use-after-free, permit out-of-order decoder shutdown Dan Williams
@ 2024-10-11 11:21 ` Alejandro Lucero Palau
  2024-10-11 17:38   ` Dan Williams
  2 siblings, 1 reply; 25+ messages in thread
From: Alejandro Lucero Palau @ 2024-10-11 11:21 UTC (permalink / raw)
  To: Dan Williams, dave.jiang, ira.weiny
  Cc: Davidlohr Bueso, Jonathan Cameron, stable, Zijun Hu,
	Greg Kroah-Hartman, Alison Schofield, Gregory Price, Vishal Verma,
	linux-cxl

Hi Dan,


I think this is the same issue one of the patches in type2 support tries 
to deal with:


https://lore.kernel.org/linux-cxl/20240907081836.5801-1-alejandro.lucero-palau@amd.com/T/#m9357a559c1a3cc7869ecce44a1801d51518d106e


If this fixes that situation, I guess I can drop that one from v4 which 
is ready to be sent.


The other problem I try to fix in that patch, the endpoint not being 
there when that code tries to use it, it is likely not needed either, 
although I have a trivial fix for it now instead of that ugly loop with 
delays. The solution is to add PROBE_FORCE_SYNCHRONOUS as probe_type for 
the cxl_mem_driver which implies the device_add will only return when 
the device is really created. Maybe that is worth it for other potential 
situations suffering the delayed creation.


On 10/11/24 06:33, Dan Williams wrote:
> Gregory's modest proposal to fix CXL cxl_mem_probe() failures due to
> delayed arrival of the CXL "root" infrastructure [1] prompted questions
> of how the existing mechanism for retrying cxl_mem_probe() could be
> failing.
>
> The critical missing piece in the debug was that Gregory's setup had
> almost all CXL modules built-in to the kernel.
>
> On the way to that discovery several other bugs and init-order corner
> cases were discovered.
>
> The main fix is to make sure the drivers/cxl/Makefile object order
> supports root CXL ports being fully initialized upon cxl_acpi_probe()
> exit. The modular case has some similar potential holes that are fixed
> with MODULE_SOFTDEP() and other fix ups. Finally, an attempt to update
> cxl_test to reproduce the original report resulted in the discovery of a
> separate long standing use after free bug in cxl_region_detach().
>
> [1]: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net
>
> ---
>
> Dan Williams (5):
>        cxl/port: Fix CXL port initialization order when the subsystem is built-in
>        cxl/port: Fix cxl_bus_rescan() vs bus_rescan_devices()
>        cxl/acpi: Ensure ports ready at cxl_acpi_probe() return
>        cxl/port: Fix use-after-free, permit out-of-order decoder shutdown
>        cxl/test: Improve init-order fidelity relative to real-world systems
>
>
>   drivers/base/core.c          |   35 +++++++
>   drivers/cxl/Kconfig          |    1
>   drivers/cxl/Makefile         |   12 +--
>   drivers/cxl/acpi.c           |    7 +
>   drivers/cxl/core/hdm.c       |   50 +++++++++--
>   drivers/cxl/core/port.c      |   13 ++-
>   drivers/cxl/core/region.c    |   48 +++-------
>   drivers/cxl/cxl.h            |    3 -
>   include/linux/device.h       |    3 +
>   tools/testing/cxl/test/cxl.c |  200 +++++++++++++++++++++++-------------------
>   tools/testing/cxl/test/mem.c |    1
>   11 files changed, 228 insertions(+), 145 deletions(-)
>
> base-commit: 8cf0b93919e13d1e8d4466eb4080a4c4d9d66d7b
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/5] cxl/port: Fix use-after-free, permit out-of-order decoder shutdown
  2024-10-11  5:34 ` [PATCH 4/5] cxl/port: Fix use-after-free, permit out-of-order decoder shutdown Dan Williams
@ 2024-10-11 11:50   ` Zijun Hu
  2024-10-11 17:46     ` Dan Williams
  2024-10-15 16:47   ` Jonathan Cameron
  1 sibling, 1 reply; 25+ messages in thread
From: Zijun Hu @ 2024-10-11 11:50 UTC (permalink / raw)
  To: Dan Williams, dave.jiang, ira.weiny
  Cc: stable, Greg Kroah-Hartman, Davidlohr Bueso, Jonathan Cameron,
	Alison Schofield, vishal.l.verma, linux-cxl

On 2024/10/11 13:34, Dan Williams wrote:
> In support of investigating an initialization failure report [1],
> cxl_test was updated to register mock memory-devices after the mock
> root-port/bus device had been registered. That led to cxl_test crashing
> with a use-after-free bug with the following signature:
> 
>     cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem0:decoder7.0 @ 0 next: cxl_switch_uport.0 nr_eps: 1 nr_targets: 1
>     cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem4:decoder14.0 @ 1 next: cxl_switch_uport.0 nr_eps: 2 nr_targets: 1
>     cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[0] = cxl_switch_dport.0 for mem0:decoder7.0 @ 0
> 1)  cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[1] = cxl_switch_dport.4 for mem4:decoder14.0 @ 1
>     [..]
>     cxld_unregister: cxl decoder14.0:
>     cxl_region_decode_reset: cxl_region region3:
>     mock_decoder_reset: cxl_port port3: decoder3.0 reset
> 2)  mock_decoder_reset: cxl_port port3: decoder3.0: out of order reset, expected decoder3.1
>     cxl_endpoint_decoder_release: cxl decoder14.0:
>     [..]
>     cxld_unregister: cxl decoder7.0:
> 3)  cxl_region_decode_reset: cxl_region region3:
>     Oops: general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6bc3: 0000 [#1] PREEMPT SMP PTI
>     [..]
>     RIP: 0010:to_cxl_port+0x8/0x60 [cxl_core]
>     [..]
>     Call Trace:
>      <TASK>
>      cxl_region_decode_reset+0x69/0x190 [cxl_core]
>      cxl_region_detach+0xe8/0x210 [cxl_core]
>      cxl_decoder_kill_region+0x27/0x40 [cxl_core]
>      cxld_unregister+0x5d/0x60 [cxl_core]
> 
> At 1) a region has been established with 2 endpoint decoders (7.0 and
> 14.0). Those endpoints share a common switch-decoder in the topology
> (3.0). At teardown, 2), decoder14.0 is the first to be removed and hits
> the "out of order reset case" in the switch decoder. The effect though
> is that region3 cleanup is aborted leaving it in-tact and
> referencing decoder14.0. At 3) the second attempt to teardown region3
> trips over the stale decoder14.0 object which has long since been
> deleted.
> 
> The fix here is to recognize that the CXL specification places no
> mandate on in-order shutdown of switch-decoders, the driver enforces
> in-order allocation, and hardware enforces in-order commit. So, rather
> than fail and leave objects dangling, always remove them.
> 
> In support of making cxl_region_decode_reset() always succeed,
> cxl_region_invalidate_memregion() failures are turned into warnings.
> Crashing the kernel is ok there since system integrity is at risk if
> caches cannot be managed around physical address mutation events like
> CXL region destruction.
> 
> A new device_for_each_child_reverse_from() is added to cleanup
> port->commit_end after all dependent decoders have been disabled. In
> other words if decoders are allocated 0->1->2 and disabled 1->2->0 then
> port->commit_end only decrements from 2 after 2 has been disabled, and
> it decrements all the way to zero since 1 was disabled previously.
> 
> Link: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net [1]
> Cc: <stable@vger.kernel.org>
> Fixes: 176baefb2eb5 ("cxl/hdm: Commit decoder state to hardware")
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Davidlohr Bueso <dave@stgolabs.net>
> Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Alison Schofield <alison.schofield@intel.com>
> Cc: Ira Weiny <ira.weiny@intel.com>
> Cc: Zijun Hu <zijun_hu@icloud.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/base/core.c          |   35 +++++++++++++++++++++++++++++
>  drivers/cxl/core/hdm.c       |   50 +++++++++++++++++++++++++++++++++++-------
>  drivers/cxl/core/region.c    |   48 +++++++++++-----------------------------
>  drivers/cxl/cxl.h            |    3 ++-
>  include/linux/device.h       |    3 +++
>  tools/testing/cxl/test/cxl.c |   14 ++++--------
>  6 files changed, 100 insertions(+), 53 deletions(-)
> 
> diff --git a/drivers/base/core.c b/drivers/base/core.c
> index a4c853411a6b..e42f1ad73078 100644
> --- a/drivers/base/core.c
> +++ b/drivers/base/core.c
> @@ -4037,6 +4037,41 @@ int device_for_each_child_reverse(struct device *parent, void *data,
>  }
>  EXPORT_SYMBOL_GPL(device_for_each_child_reverse);
>  
> +/**
> + * device_for_each_child_reverse_from - device child iterator in reversed order.
> + * @parent: parent struct device.
> + * @from: optional starting point in child list
> + * @fn: function to be called for each device.
> + * @data: data for the callback.
> + *
> + * Iterate over @parent's child devices, starting at @from, and call @fn
> + * for each, passing it @data. This helper is identical to
> + * device_for_each_child_reverse() when @from is NULL.
> + *
> + * @fn is checked each iteration. If it returns anything other than 0,
> + * iteration stop and that value is returned to the caller of
> + * device_for_each_child_reverse_from();
> + */
> +int device_for_each_child_reverse_from(struct device *parent,
> +				       struct device *from, const void *data,
> +				       int (*fn)(struct device *, const void *))
> +{
> +	struct klist_iter i;
> +	struct device *child;
> +	int error = 0;
> +
> +	if (!parent->p)
> +		return 0;
> +
> +	klist_iter_init_node(&parent->p->klist_children, &i,
> +			     (from ? &from->p->knode_parent : NULL));
> +	while ((child = prev_device(&i)) && !error)
> +		error = fn(child, data);
> +	klist_iter_exit(&i);
> +	return error;
> +}
> +EXPORT_SYMBOL_GPL(device_for_each_child_reverse_from);
> +

it does NOT deserve, also does NOT need to introduce a new core driver
API device_for_each_child_reverse_from(). existing
device_for_each_child_reverse() can do what the _from() wants to do.

we can use similar approach as below link shown:
https://lore.kernel.org/all/20240815-const_dfc_prepare-v2-2-8316b87b8ff9@quicinc.com/

>  /**
>   * device_find_child - device iterator for locating a particular device.
>   * @parent: parent struct device
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index 3df10517a327..223c273c0cd1 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -712,7 +712,44 @@ static int cxl_decoder_commit(struct cxl_decoder *cxld)
>  	return 0;
>  }
>  
> -static int cxl_decoder_reset(struct cxl_decoder *cxld)
> +static int commit_reap(struct device *dev, const void *data)
> +{
> +	struct cxl_port *port = to_cxl_port(dev->parent);
> +	struct cxl_decoder *cxld;
> +
> +	if (!is_switch_decoder(dev) && !is_endpoint_decoder(dev))
> +		return 0;
> +
> +	cxld = to_cxl_decoder(dev);
> +	if (port->commit_end == cxld->id &&
> +	    ((cxld->flags & CXL_DECODER_F_ENABLE) == 0)) {
> +		port->commit_end--;
> +		dev_dbg(&port->dev, "reap: %s commit_end: %d\n",
> +			dev_name(&cxld->dev), port->commit_end);
> +	}
> +
> +	return 0;
> +}
> +
> +void cxl_port_commit_reap(struct cxl_decoder *cxld)
> +{
> +	struct cxl_port *port = to_cxl_port(cxld->dev.parent);
> +
> +	lockdep_assert_held_write(&cxl_region_rwsem);
> +
> +	/*
> +	 * Once the highest committed decoder is disabled, free any other
> +	 * decoders that were pinned allocated by out-of-order release.
> +	 */
> +	port->commit_end--;
> +	dev_dbg(&port->dev, "reap: %s commit_end: %d\n", dev_name(&cxld->dev),
> +		port->commit_end);
> +	device_for_each_child_reverse_from(&port->dev, &cxld->dev, NULL,
> +					   commit_reap);
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_port_commit_reap, CXL);
> +
> +static void cxl_decoder_reset(struct cxl_decoder *cxld)
>  {
>  	struct cxl_port *port = to_cxl_port(cxld->dev.parent);
>  	struct cxl_hdm *cxlhdm = dev_get_drvdata(&port->dev);
> @@ -721,14 +758,14 @@ static int cxl_decoder_reset(struct cxl_decoder *cxld)
>  	u32 ctrl;
>  
>  	if ((cxld->flags & CXL_DECODER_F_ENABLE) == 0)
> -		return 0;
> +		return;
>  
> -	if (port->commit_end != id) {
> +	if (port->commit_end == id)
> +		cxl_port_commit_reap(cxld);
> +	else
>  		dev_dbg(&port->dev,
>  			"%s: out of order reset, expected decoder%d.%d\n",
>  			dev_name(&cxld->dev), port->id, port->commit_end);
> -		return -EBUSY;
> -	}
>  
>  	down_read(&cxl_dpa_rwsem);
>  	ctrl = readl(hdm + CXL_HDM_DECODER0_CTRL_OFFSET(id));
> @@ -741,7 +778,6 @@ static int cxl_decoder_reset(struct cxl_decoder *cxld)
>  	writel(0, hdm + CXL_HDM_DECODER0_BASE_LOW_OFFSET(id));
>  	up_read(&cxl_dpa_rwsem);
>  
> -	port->commit_end--;
>  	cxld->flags &= ~CXL_DECODER_F_ENABLE;
>  
>  	/* Userspace is now responsible for reconfiguring this decoder */
> @@ -751,8 +787,6 @@ static int cxl_decoder_reset(struct cxl_decoder *cxld)
>  		cxled = to_cxl_endpoint_decoder(&cxld->dev);
>  		cxled->state = CXL_DECODER_STATE_MANUAL;
>  	}
> -
> -	return 0;
>  }
>  
>  static int cxl_setup_hdm_decoder_from_dvsec(
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index e701e4b04032..3478d2058303 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -232,8 +232,8 @@ static int cxl_region_invalidate_memregion(struct cxl_region *cxlr)
>  				"Bypassing cpu_cache_invalidate_memregion() for testing!\n");
>  			return 0;
>  		} else {
> -			dev_err(&cxlr->dev,
> -				"Failed to synchronize CPU cache state\n");
> +			dev_WARN(&cxlr->dev,
> +				 "Failed to synchronize CPU cache state\n");
>  			return -ENXIO;
>  		}
>  	}
> @@ -242,19 +242,17 @@ static int cxl_region_invalidate_memregion(struct cxl_region *cxlr)
>  	return 0;
>  }
>  
> -static int cxl_region_decode_reset(struct cxl_region *cxlr, int count)
> +static void cxl_region_decode_reset(struct cxl_region *cxlr, int count)
>  {
>  	struct cxl_region_params *p = &cxlr->params;
> -	int i, rc = 0;
> +	int i;
>  
>  	/*
> -	 * Before region teardown attempt to flush, and if the flush
> -	 * fails cancel the region teardown for data consistency
> -	 * concerns
> +	 * Before region teardown attempt to flush, evict any data cached for
> +	 * this region, or scream loudly about missing arch / platform support
> +	 * for CXL teardown.
>  	 */
> -	rc = cxl_region_invalidate_memregion(cxlr);
> -	if (rc)
> -		return rc;
> +	cxl_region_invalidate_memregion(cxlr);
>  
>  	for (i = count - 1; i >= 0; i--) {
>  		struct cxl_endpoint_decoder *cxled = p->targets[i];
> @@ -277,23 +275,17 @@ static int cxl_region_decode_reset(struct cxl_region *cxlr, int count)
>  			cxl_rr = cxl_rr_load(iter, cxlr);
>  			cxld = cxl_rr->decoder;
>  			if (cxld->reset)
> -				rc = cxld->reset(cxld);
> -			if (rc)
> -				return rc;
> +				cxld->reset(cxld);
>  			set_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags);
>  		}
>  
>  endpoint_reset:
> -		rc = cxled->cxld.reset(&cxled->cxld);
> -		if (rc)
> -			return rc;
> +		cxled->cxld.reset(&cxled->cxld);
>  		set_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags);
>  	}
>  
>  	/* all decoders associated with this region have been torn down */
>  	clear_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags);
> -
> -	return 0;
>  }
>  
>  static int commit_decoder(struct cxl_decoder *cxld)
> @@ -409,16 +401,8 @@ static ssize_t commit_store(struct device *dev, struct device_attribute *attr,
>  		 * still pending.
>  		 */
>  		if (p->state == CXL_CONFIG_RESET_PENDING) {
> -			rc = cxl_region_decode_reset(cxlr, p->interleave_ways);
> -			/*
> -			 * Revert to committed since there may still be active
> -			 * decoders associated with this region, or move forward
> -			 * to active to mark the reset successful
> -			 */
> -			if (rc)
> -				p->state = CXL_CONFIG_COMMIT;
> -			else
> -				p->state = CXL_CONFIG_ACTIVE;
> +			cxl_region_decode_reset(cxlr, p->interleave_ways);
> +			p->state = CXL_CONFIG_ACTIVE;
>  		}
>  	}
>  
> @@ -2054,13 +2038,7 @@ static int cxl_region_detach(struct cxl_endpoint_decoder *cxled)
>  	get_device(&cxlr->dev);
>  
>  	if (p->state > CXL_CONFIG_ACTIVE) {
> -		/*
> -		 * TODO: tear down all impacted regions if a device is
> -		 * removed out of order
> -		 */
> -		rc = cxl_region_decode_reset(cxlr, p->interleave_ways);
> -		if (rc)
> -			goto out;
> +		cxl_region_decode_reset(cxlr, p->interleave_ways);
>  		p->state = CXL_CONFIG_ACTIVE;
>  	}
>  
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 0d8b810a51f0..5406e3ab3d4a 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -359,7 +359,7 @@ struct cxl_decoder {
>  	struct cxl_region *region;
>  	unsigned long flags;
>  	int (*commit)(struct cxl_decoder *cxld);
> -	int (*reset)(struct cxl_decoder *cxld);
> +	void (*reset)(struct cxl_decoder *cxld);
>  };
>  
>  /*
> @@ -730,6 +730,7 @@ static inline bool is_cxl_root(struct cxl_port *port)
>  int cxl_num_decoders_committed(struct cxl_port *port);
>  bool is_cxl_port(const struct device *dev);
>  struct cxl_port *to_cxl_port(const struct device *dev);
> +void cxl_port_commit_reap(struct cxl_decoder *cxld);
>  struct pci_bus;
>  int devm_cxl_register_pci_bus(struct device *host, struct device *uport_dev,
>  			      struct pci_bus *bus);
> diff --git a/include/linux/device.h b/include/linux/device.h
> index b4bde8d22697..667cb6db9019 100644
> --- a/include/linux/device.h
> +++ b/include/linux/device.h
> @@ -1078,6 +1078,9 @@ int device_for_each_child(struct device *dev, void *data,
>  			  int (*fn)(struct device *dev, void *data));
>  int device_for_each_child_reverse(struct device *dev, void *data,
>  				  int (*fn)(struct device *dev, void *data));
> +int device_for_each_child_reverse_from(struct device *parent,
> +				       struct device *from, const void *data,
> +				       int (*fn)(struct device *, const void *));
>  struct device *device_find_child(struct device *dev, void *data,
>  				 int (*match)(struct device *dev, void *data));
>  struct device *device_find_child_by_name(struct device *parent,
> diff --git a/tools/testing/cxl/test/cxl.c b/tools/testing/cxl/test/cxl.c
> index 90d5afd52dd0..c5bbd89b3192 100644
> --- a/tools/testing/cxl/test/cxl.c
> +++ b/tools/testing/cxl/test/cxl.c
> @@ -693,26 +693,22 @@ static int mock_decoder_commit(struct cxl_decoder *cxld)
>  	return 0;
>  }
>  
> -static int mock_decoder_reset(struct cxl_decoder *cxld)
> +static void mock_decoder_reset(struct cxl_decoder *cxld)
>  {
>  	struct cxl_port *port = to_cxl_port(cxld->dev.parent);
>  	int id = cxld->id;
>  
>  	if ((cxld->flags & CXL_DECODER_F_ENABLE) == 0)
> -		return 0;
> +		return;
>  
>  	dev_dbg(&port->dev, "%s reset\n", dev_name(&cxld->dev));
> -	if (port->commit_end != id) {
> +	if (port->commit_end == id)
> +		cxl_port_commit_reap(cxld);
> +	else
>  		dev_dbg(&port->dev,
>  			"%s: out of order reset, expected decoder%d.%d\n",
>  			dev_name(&cxld->dev), port->id, port->commit_end);
> -		return -EBUSY;
> -	}
> -
> -	port->commit_end--;
>  	cxld->flags &= ~CXL_DECODER_F_ENABLE;
> -
> -	return 0;
>  }
>  
>  static void default_mock_decoder(struct cxl_decoder *cxld)
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/5] cxl: Initialization and shutdown fixes
  2024-10-11 11:21 ` [PATCH 0/5] cxl: Initialization and shutdown fixes Alejandro Lucero Palau
@ 2024-10-11 17:38   ` Dan Williams
  2024-10-12  6:30     ` Alejandro Lucero Palau
  0 siblings, 1 reply; 25+ messages in thread
From: Dan Williams @ 2024-10-11 17:38 UTC (permalink / raw)
  To: Alejandro Lucero Palau, Dan Williams, dave.jiang, ira.weiny
  Cc: Davidlohr Bueso, Jonathan Cameron, stable, Zijun Hu,
	Greg Kroah-Hartman, Alison Schofield, Gregory Price, Vishal Verma,
	linux-cxl

Alejandro Lucero Palau wrote:
> Hi Dan,
> 
> 
> I think this is the same issue one of the patches in type2 support tries 
> to deal with:
> 
> 
> https://lore.kernel.org/linux-cxl/20240907081836.5801-1-alejandro.lucero-palau@amd.com/T/#m9357a559c1a3cc7869ecce44a1801d51518d106e
> 
> 
> If this fixes that situation, I guess I can drop that one from v4 which 
> is ready to be sent.
> 
> 
> The other problem I try to fix in that patch, the endpoint not being 
> there when that code tries to use it, it is likely not needed either, 
> although I have a trivial fix for it now instead of that ugly loop with 
> delays. The solution is to add PROBE_FORCE_SYNCHRONOUS as probe_type for 
> the cxl_mem_driver which implies the device_add will only return when 
> the device is really created. Maybe that is worth it for other potential 
> situations suffering the delayed creation.

I am skeptical that PROBE_FORCE_SYNCRONOUS is a fix for any
device-readiness bug. Some other assumption is violated if that is
required.

For the type-2 case I did have an EPROBE_DEFER in my initial RFC on the
assumption that an accelerator driver might want to wait until CXL is
initialized before the base accelerator proceeds. However, if
accelerator drivers behave the same as the cxl_pci driver and are ok
with asynchronus arrival of CXL functionality then no deferral is
needed.

Otherwise, the only motivation for synchronous probing I can think of
would be to have more predictable naming of kernel objects. So yes, I
would be curious to understand what scenarios probe deferral is still
needed.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/5] cxl/port: Fix use-after-free, permit out-of-order decoder shutdown
  2024-10-11 11:50   ` Zijun Hu
@ 2024-10-11 17:46     ` Dan Williams
  2024-10-11 23:40       ` Zijun Hu
  0 siblings, 1 reply; 25+ messages in thread
From: Dan Williams @ 2024-10-11 17:46 UTC (permalink / raw)
  To: Zijun Hu, Dan Williams, dave.jiang, ira.weiny
  Cc: stable, Greg Kroah-Hartman, Davidlohr Bueso, Jonathan Cameron,
	Alison Schofield, vishal.l.verma, linux-cxl

Zijun Hu wrote:
> On 2024/10/11 13:34, Dan Williams wrote:
> > In support of investigating an initialization failure report [1],
> > cxl_test was updated to register mock memory-devices after the mock
> > root-port/bus device had been registered. That led to cxl_test crashing
> > with a use-after-free bug with the following signature:
> > 
> >     cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem0:decoder7.0 @ 0 next: cxl_switch_uport.0 nr_eps: 1 nr_targets: 1
> >     cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem4:decoder14.0 @ 1 next: cxl_switch_uport.0 nr_eps: 2 nr_targets: 1
> >     cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[0] = cxl_switch_dport.0 for mem0:decoder7.0 @ 0
> > 1)  cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[1] = cxl_switch_dport.4 for mem4:decoder14.0 @ 1
> >     [..]
> >     cxld_unregister: cxl decoder14.0:
> >     cxl_region_decode_reset: cxl_region region3:
> >     mock_decoder_reset: cxl_port port3: decoder3.0 reset
> > 2)  mock_decoder_reset: cxl_port port3: decoder3.0: out of order reset, expected decoder3.1
> >     cxl_endpoint_decoder_release: cxl decoder14.0:
> >     [..]
> >     cxld_unregister: cxl decoder7.0:
> > 3)  cxl_region_decode_reset: cxl_region region3:
> >     Oops: general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6bc3: 0000 [#1] PREEMPT SMP PTI
> >     [..]
> >     RIP: 0010:to_cxl_port+0x8/0x60 [cxl_core]
> >     [..]
> >     Call Trace:
> >      <TASK>
> >      cxl_region_decode_reset+0x69/0x190 [cxl_core]
> >      cxl_region_detach+0xe8/0x210 [cxl_core]
> >      cxl_decoder_kill_region+0x27/0x40 [cxl_core]
> >      cxld_unregister+0x5d/0x60 [cxl_core]
> > 
> > At 1) a region has been established with 2 endpoint decoders (7.0 and
> > 14.0). Those endpoints share a common switch-decoder in the topology
> > (3.0). At teardown, 2), decoder14.0 is the first to be removed and hits
> > the "out of order reset case" in the switch decoder. The effect though
> > is that region3 cleanup is aborted leaving it in-tact and
> > referencing decoder14.0. At 3) the second attempt to teardown region3
> > trips over the stale decoder14.0 object which has long since been
> > deleted.
> > 
> > The fix here is to recognize that the CXL specification places no
> > mandate on in-order shutdown of switch-decoders, the driver enforces
> > in-order allocation, and hardware enforces in-order commit. So, rather
> > than fail and leave objects dangling, always remove them.
> > 
> > In support of making cxl_region_decode_reset() always succeed,
> > cxl_region_invalidate_memregion() failures are turned into warnings.
> > Crashing the kernel is ok there since system integrity is at risk if
> > caches cannot be managed around physical address mutation events like
> > CXL region destruction.
> > 
> > A new device_for_each_child_reverse_from() is added to cleanup
> > port->commit_end after all dependent decoders have been disabled. In
> > other words if decoders are allocated 0->1->2 and disabled 1->2->0 then
> > port->commit_end only decrements from 2 after 2 has been disabled, and
> > it decrements all the way to zero since 1 was disabled previously.
> > 
> > Link: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net [1]
> > Cc: <stable@vger.kernel.org>
> > Fixes: 176baefb2eb5 ("cxl/hdm: Commit decoder state to hardware")
> > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > Cc: Davidlohr Bueso <dave@stgolabs.net>
> > Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
> > Cc: Dave Jiang <dave.jiang@intel.com>
> > Cc: Alison Schofield <alison.schofield@intel.com>
> > Cc: Ira Weiny <ira.weiny@intel.com>
> > Cc: Zijun Hu <zijun_hu@icloud.com>
> > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> > ---
> >  drivers/base/core.c          |   35 +++++++++++++++++++++++++++++
> >  drivers/cxl/core/hdm.c       |   50 +++++++++++++++++++++++++++++++++++-------
> >  drivers/cxl/core/region.c    |   48 +++++++++++-----------------------------
> >  drivers/cxl/cxl.h            |    3 ++-
> >  include/linux/device.h       |    3 +++
> >  tools/testing/cxl/test/cxl.c |   14 ++++--------
> >  6 files changed, 100 insertions(+), 53 deletions(-)
> > 
> > diff --git a/drivers/base/core.c b/drivers/base/core.c
> > index a4c853411a6b..e42f1ad73078 100644
> > --- a/drivers/base/core.c
> > +++ b/drivers/base/core.c
> > @@ -4037,6 +4037,41 @@ int device_for_each_child_reverse(struct device *parent, void *data,
> >  }
> >  EXPORT_SYMBOL_GPL(device_for_each_child_reverse);
> >  
> > +/**
> > + * device_for_each_child_reverse_from - device child iterator in reversed order.
> > + * @parent: parent struct device.
> > + * @from: optional starting point in child list
> > + * @fn: function to be called for each device.
> > + * @data: data for the callback.
> > + *
> > + * Iterate over @parent's child devices, starting at @from, and call @fn
> > + * for each, passing it @data. This helper is identical to
> > + * device_for_each_child_reverse() when @from is NULL.
> > + *
> > + * @fn is checked each iteration. If it returns anything other than 0,
> > + * iteration stop and that value is returned to the caller of
> > + * device_for_each_child_reverse_from();
> > + */
> > +int device_for_each_child_reverse_from(struct device *parent,
> > +				       struct device *from, const void *data,
> > +				       int (*fn)(struct device *, const void *))
> > +{
> > +	struct klist_iter i;
> > +	struct device *child;
> > +	int error = 0;
> > +
> > +	if (!parent->p)
> > +		return 0;
> > +
> > +	klist_iter_init_node(&parent->p->klist_children, &i,
> > +			     (from ? &from->p->knode_parent : NULL));
> > +	while ((child = prev_device(&i)) && !error)
> > +		error = fn(child, data);
> > +	klist_iter_exit(&i);
> > +	return error;
> > +}
> > +EXPORT_SYMBOL_GPL(device_for_each_child_reverse_from);
> > +
> 
> it does NOT deserve, also does NOT need to introduce a new core driver
> API device_for_each_child_reverse_from(). existing
> device_for_each_child_reverse() can do what the _from() wants to do.
> 
> we can use similar approach as below link shown:
> https://lore.kernel.org/all/20240815-const_dfc_prepare-v2-2-8316b87b8ff9@quicinc.com/

No, just have a simple starting point parameter. I understand that more
logic can be placed around device_for_each_child_reverse() to achieve
the same effect, but the core helpers should be removing logic from
consumers, not forcing them to add more.

If bloat is a concern, then after your const cleanups go through
device_for_each_child_reverse() can be rewritten in terms of
device_for_each_child_reverse_from() as (untested):

diff --git a/drivers/base/core.c b/drivers/base/core.c
index e42f1ad73078..2571c910da46 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -4007,36 +4007,6 @@ int device_for_each_child(struct device *parent, void *data,
 }
 EXPORT_SYMBOL_GPL(device_for_each_child);
 
-/**
- * device_for_each_child_reverse - device child iterator in reversed order.
- * @parent: parent struct device.
- * @fn: function to be called for each device.
- * @data: data for the callback.
- *
- * Iterate over @parent's child devices, and call @fn for each,
- * passing it @data.
- *
- * We check the return of @fn each time. If it returns anything
- * other than 0, we break out and return that value.
- */
-int device_for_each_child_reverse(struct device *parent, void *data,
-				  int (*fn)(struct device *dev, void *data))
-{
-	struct klist_iter i;
-	struct device *child;
-	int error = 0;
-
-	if (!parent || !parent->p)
-		return 0;
-
-	klist_iter_init(&parent->p->klist_children, &i);
-	while ((child = prev_device(&i)) && !error)
-		error = fn(child, data);
-	klist_iter_exit(&i);
-	return error;
-}
-EXPORT_SYMBOL_GPL(device_for_each_child_reverse);
-
 /**
  * device_for_each_child_reverse_from - device child iterator in reversed order.
  * @parent: parent struct device.
diff --git a/include/linux/device.h b/include/linux/device.h
index 667cb6db9019..96a2c072bf5b 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -1076,11 +1076,14 @@ DEFINE_FREE(device_del, struct device *, if (_T) device_del(_T))
 
 int device_for_each_child(struct device *dev, void *data,
 			  int (*fn)(struct device *dev, void *data));
-int device_for_each_child_reverse(struct device *dev, void *data,
-				  int (*fn)(struct device *dev, void *data));
 int device_for_each_child_reverse_from(struct device *parent,
 				       struct device *from, const void *data,
 				       int (*fn)(struct device *, const void *));
+static inline int device_for_each_child_reverse(struct device *dev, const void *data,
+						int (*fn)(struct device *, const void *))
+{
+	return device_for_each_child_reverse_from(dev, NULL, data, fn);
+}
 struct device *device_find_child(struct device *dev, void *data,
 				 int (*match)(struct device *dev, void *data));
 struct device *device_find_child_by_name(struct device *parent,

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/5] cxl/port: Fix use-after-free, permit out-of-order decoder shutdown
  2024-10-11 17:46     ` Dan Williams
@ 2024-10-11 23:40       ` Zijun Hu
  2024-10-12 17:56         ` Gregory Price
  2024-10-12 22:16         ` Dan Williams
  0 siblings, 2 replies; 25+ messages in thread
From: Zijun Hu @ 2024-10-11 23:40 UTC (permalink / raw)
  To: Dan Williams, dave.jiang, ira.weiny
  Cc: stable, Greg Kroah-Hartman, Davidlohr Bueso, Jonathan Cameron,
	Alison Schofield, vishal.l.verma, linux-cxl

On 2024/10/12 01:46, Dan Williams wrote:
> Zijun Hu wrote:
>> On 2024/10/11 13:34, Dan Williams wrote:
>>> In support of investigating an initialization failure report [1],
>>> cxl_test was updated to register mock memory-devices after the mock
>>> root-port/bus device had been registered. That led to cxl_test crashing
>>> with a use-after-free bug with the following signature:
>>>
>>>     cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem0:decoder7.0 @ 0 next: cxl_switch_uport.0 nr_eps: 1 nr_targets: 1
>>>     cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem4:decoder14.0 @ 1 next: cxl_switch_uport.0 nr_eps: 2 nr_targets: 1
>>>     cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[0] = cxl_switch_dport.0 for mem0:decoder7.0 @ 0
>>> 1)  cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[1] = cxl_switch_dport.4 for mem4:decoder14.0 @ 1
>>>     [..]
>>>     cxld_unregister: cxl decoder14.0:
>>>     cxl_region_decode_reset: cxl_region region3:
>>>     mock_decoder_reset: cxl_port port3: decoder3.0 reset
>>> 2)  mock_decoder_reset: cxl_port port3: decoder3.0: out of order reset, expected decoder3.1
>>>     cxl_endpoint_decoder_release: cxl decoder14.0:
>>>     [..]
>>>     cxld_unregister: cxl decoder7.0:
>>> 3)  cxl_region_decode_reset: cxl_region region3:
>>>     Oops: general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6bc3: 0000 [#1] PREEMPT SMP PTI
>>>     [..]
>>>     RIP: 0010:to_cxl_port+0x8/0x60 [cxl_core]
>>>     [..]
>>>     Call Trace:
>>>      <TASK>
>>>      cxl_region_decode_reset+0x69/0x190 [cxl_core]
>>>      cxl_region_detach+0xe8/0x210 [cxl_core]
>>>      cxl_decoder_kill_region+0x27/0x40 [cxl_core]
>>>      cxld_unregister+0x5d/0x60 [cxl_core]
>>>
>>> At 1) a region has been established with 2 endpoint decoders (7.0 and
>>> 14.0). Those endpoints share a common switch-decoder in the topology
>>> (3.0). At teardown, 2), decoder14.0 is the first to be removed and hits
>>> the "out of order reset case" in the switch decoder. The effect though
>>> is that region3 cleanup is aborted leaving it in-tact and
>>> referencing decoder14.0. At 3) the second attempt to teardown region3
>>> trips over the stale decoder14.0 object which has long since been
>>> deleted.
>>>
>>> The fix here is to recognize that the CXL specification places no
>>> mandate on in-order shutdown of switch-decoders, the driver enforces
>>> in-order allocation, and hardware enforces in-order commit. So, rather
>>> than fail and leave objects dangling, always remove them.
>>>
>>> In support of making cxl_region_decode_reset() always succeed,
>>> cxl_region_invalidate_memregion() failures are turned into warnings.
>>> Crashing the kernel is ok there since system integrity is at risk if
>>> caches cannot be managed around physical address mutation events like
>>> CXL region destruction.
>>>
>>> A new device_for_each_child_reverse_from() is added to cleanup
>>> port->commit_end after all dependent decoders have been disabled. In
>>> other words if decoders are allocated 0->1->2 and disabled 1->2->0 then
>>> port->commit_end only decrements from 2 after 2 has been disabled, and
>>> it decrements all the way to zero since 1 was disabled previously.
>>>
>>> Link: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net [1]
>>> Cc: <stable@vger.kernel.org>
>>> Fixes: 176baefb2eb5 ("cxl/hdm: Commit decoder state to hardware")
>>> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>>> Cc: Davidlohr Bueso <dave@stgolabs.net>
>>> Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
>>> Cc: Dave Jiang <dave.jiang@intel.com>
>>> Cc: Alison Schofield <alison.schofield@intel.com>
>>> Cc: Ira Weiny <ira.weiny@intel.com>
>>> Cc: Zijun Hu <zijun_hu@icloud.com>
>>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>>> ---
>>>  drivers/base/core.c          |   35 +++++++++++++++++++++++++++++
>>>  drivers/cxl/core/hdm.c       |   50 +++++++++++++++++++++++++++++++++++-------
>>>  drivers/cxl/core/region.c    |   48 +++++++++++-----------------------------
>>>  drivers/cxl/cxl.h            |    3 ++-
>>>  include/linux/device.h       |    3 +++
>>>  tools/testing/cxl/test/cxl.c |   14 ++++--------
>>>  6 files changed, 100 insertions(+), 53 deletions(-)
>>>
>>> diff --git a/drivers/base/core.c b/drivers/base/core.c
>>> index a4c853411a6b..e42f1ad73078 100644
>>> --- a/drivers/base/core.c
>>> +++ b/drivers/base/core.c
>>> @@ -4037,6 +4037,41 @@ int device_for_each_child_reverse(struct device *parent, void *data,
>>>  }
>>>  EXPORT_SYMBOL_GPL(device_for_each_child_reverse);
>>>  
>>> +/**
>>> + * device_for_each_child_reverse_from - device child iterator in reversed order.
>>> + * @parent: parent struct device.
>>> + * @from: optional starting point in child list
>>> + * @fn: function to be called for each device.
>>> + * @data: data for the callback.
>>> + *
>>> + * Iterate over @parent's child devices, starting at @from, and call @fn
>>> + * for each, passing it @data. This helper is identical to
>>> + * device_for_each_child_reverse() when @from is NULL.
>>> + *
>>> + * @fn is checked each iteration. If it returns anything other than 0,
>>> + * iteration stop and that value is returned to the caller of
>>> + * device_for_each_child_reverse_from();
>>> + */
>>> +int device_for_each_child_reverse_from(struct device *parent,
>>> +				       struct device *from, const void *data,
>>> +				       int (*fn)(struct device *, const void *))
>>> +{
>>> +	struct klist_iter i;
>>> +	struct device *child;
>>> +	int error = 0;
>>> +
>>> +	if (!parent->p)
>>> +		return 0;
>>> +
>>> +	klist_iter_init_node(&parent->p->klist_children, &i,
>>> +			     (from ? &from->p->knode_parent : NULL));
>>> +	while ((child = prev_device(&i)) && !error)
>>> +		error = fn(child, data);
>>> +	klist_iter_exit(&i);
>>> +	return error;
>>> +}
>>> +EXPORT_SYMBOL_GPL(device_for_each_child_reverse_from);
>>> +
>>
>> it does NOT deserve, also does NOT need to introduce a new core driver
>> API device_for_each_child_reverse_from(). existing
>> device_for_each_child_reverse() can do what the _from() wants to do.
>>
>> we can use similar approach as below link shown:
>> https://lore.kernel.org/all/20240815-const_dfc_prepare-v2-2-8316b87b8ff9@quicinc.com/
> 
> No, just have a simple starting point parameter. I understand that more
> logic can be placed around device_for_each_child_reverse() to achieve
> the same effect, but the core helpers should be removing logic from
> consumers, not forcing them to add more.
> 
> If bloat is a concern, then after your const cleanups go through
> device_for_each_child_reverse() can be rewritten in terms of
> device_for_each_child_reverse_from() as (untested):
> 

bloat is one aspect, the other aspect is that there are redundant
between both driver core APIs, namely, there are a question:

why to still need device_for_each_child_reverse() if it is same as
_from(..., NULL, ...) ?

So i suggest use existing API now.
if there are more users who have such starting point requirement, then
add the parameter into device_for_each_child_reverse() which is
consistent with other existing *_for_each_*() core APIs such as
(class|driver|bus)_for_each_device() and bus_for_each_drv(), that may
need much efforts.


could you please contains your proposal "fixing this allocation
order validation" of below link into this patch series with below
reason? and Cc me (^^)

https://lore.kernel.org/all/670835f5a2887_964f229474@dwillia2-xfh.jf.intel.com.notmuch/

A)
  the proposal depends on this patch series.
B)
  one of the issues the proposal fix is match_free_decoder()  error
logic which is also relevant issues this patch series fix, the proposal
also can fix the other device_find_child()'s match() issue i care about.

C)
 Actually, it is a bit difficult for me to understand the proposal since
 i don't have any basic knowledge about CXL. (^^)

> diff --git a/drivers/base/core.c b/drivers/base/core.c
> index e42f1ad73078..2571c910da46 100644
> --- a/drivers/base/core.c
> +++ b/drivers/base/core.c
> @@ -4007,36 +4007,6 @@ int device_for_each_child(struct device *parent, void *data,
>  }
>  EXPORT_SYMBOL_GPL(device_for_each_child);
>  
> -/**
> - * device_for_each_child_reverse - device child iterator in reversed order.
> - * @parent: parent struct device.
> - * @fn: function to be called for each device.
> - * @data: data for the callback.
> - *
> - * Iterate over @parent's child devices, and call @fn for each,
> - * passing it @data.
> - *
> - * We check the return of @fn each time. If it returns anything
> - * other than 0, we break out and return that value.
> - */
> -int device_for_each_child_reverse(struct device *parent, void *data,
> -				  int (*fn)(struct device *dev, void *data))
> -{
> -	struct klist_iter i;
> -	struct device *child;
> -	int error = 0;
> -
> -	if (!parent || !parent->p)
> -		return 0;
> -
> -	klist_iter_init(&parent->p->klist_children, &i);
> -	while ((child = prev_device(&i)) && !error)
> -		error = fn(child, data);
> -	klist_iter_exit(&i);
> -	return error;
> -}
> -EXPORT_SYMBOL_GPL(device_for_each_child_reverse);
> -
>  /**
>   * device_for_each_child_reverse_from - device child iterator in reversed order.
>   * @parent: parent struct device.
> diff --git a/include/linux/device.h b/include/linux/device.h
> index 667cb6db9019..96a2c072bf5b 100644
> --- a/include/linux/device.h
> +++ b/include/linux/device.h
> @@ -1076,11 +1076,14 @@ DEFINE_FREE(device_del, struct device *, if (_T) device_del(_T))
>  
>  int device_for_each_child(struct device *dev, void *data,
>  			  int (*fn)(struct device *dev, void *data));
> -int device_for_each_child_reverse(struct device *dev, void *data,
> -				  int (*fn)(struct device *dev, void *data));
>  int device_for_each_child_reverse_from(struct device *parent,
>  				       struct device *from, const void *data,
>  				       int (*fn)(struct device *, const void *));
> +static inline int device_for_each_child_reverse(struct device *dev, const void *data,
> +						int (*fn)(struct device *, const void *))
> +{
> +	return device_for_each_child_reverse_from(dev, NULL, data, fn);
> +}
>  struct device *device_find_child(struct device *dev, void *data,
>  				 int (*match)(struct device *dev, void *data));
>  struct device *device_find_child_by_name(struct device *parent,


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/5] cxl: Initialization and shutdown fixes
  2024-10-11 17:38   ` Dan Williams
@ 2024-10-12  6:30     ` Alejandro Lucero Palau
  2024-10-12 21:57       ` Dan Williams
  0 siblings, 1 reply; 25+ messages in thread
From: Alejandro Lucero Palau @ 2024-10-12  6:30 UTC (permalink / raw)
  To: Dan Williams, dave.jiang, ira.weiny
  Cc: Davidlohr Bueso, Jonathan Cameron, stable, Zijun Hu,
	Greg Kroah-Hartman, Alison Schofield, Gregory Price, Vishal Verma,
	linux-cxl

On 10/11/24 18:38, Dan Williams wrote:
> Alejandro Lucero Palau wrote:
>> Hi Dan,
>>
>>
>> I think this is the same issue one of the patches in type2 support tries
>> to deal with:
>>
>>
>> https://lore.kernel.org/linux-cxl/20240907081836.5801-1-alejandro.lucero-palau@amd.com/T/#m9357a559c1a3cc7869ecce44a1801d51518d106e
>>
>>
>> If this fixes that situation, I guess I can drop that one from v4 which
>> is ready to be sent.
>>
>>
>> The other problem I try to fix in that patch, the endpoint not being
>> there when that code tries to use it, it is likely not needed either,
>> although I have a trivial fix for it now instead of that ugly loop with
>> delays. The solution is to add PROBE_FORCE_SYNCHRONOUS as probe_type for
>> the cxl_mem_driver which implies the device_add will only return when
>> the device is really created. Maybe that is worth it for other potential
>> situations suffering the delayed creation.
> I am skeptical that PROBE_FORCE_SYNCRONOUS is a fix for any
> device-readiness bug. Some other assumption is violated if that is
> required.

But that problem is not about device readiness but just how the device 
model works. In this case the memdev creation is adding devices, no real 
ones but those abstractions we use from the device model, and that 
device creation is done asynchronously. When the code creating the 
memdev, a Type2 driver in my case, is going to work with such a device 
abstraction just after the memdev creation, it is not there yet. It is 
true that clx_mem_probe will interact with the real device, but the fact 
is such a function is not invoked by the device model synchronously, so 
the code using such a device abstraction needs to wait until it is 
there. With this flag the waiting is implicit to device creation. 
Without that flag other "nasty dancing" with delays and checks needs to 
be done as the code in v3 did.

> For the type-2 case I did have an EPROBE_DEFER in my initial RFC on the
> assumption that an accelerator driver might want to wait until CXL is
> initialized before the base accelerator proceeds. However, if
> accelerator drivers behave the same as the cxl_pci driver and are ok
> with asynchronus arrival of CXL functionality then no deferral is
> needed.

I think deferring the accel driver makes sense. In the sfc driver case, 
it could work without CXL and then change to use it once the CXL kernel 
support is fully initialised, but I guess other accel drivers will rely 
on CXL with no other option, and even with the sfc driver, supporting 
such a change will make the code far more complex.

>
> Otherwise, the only motivation for synchronous probing I can think of
> would be to have more predictable naming of kernel objects. So yes, I
> would be curious to understand what scenarios probe deferral is still
> needed.

OK. I will keep that patch with the last change in the v4. Let's discuss 
this further with that patch as a reference.

Thanks.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/5] cxl/port: Fix use-after-free, permit out-of-order decoder shutdown
  2024-10-11 23:40       ` Zijun Hu
@ 2024-10-12 17:56         ` Gregory Price
  2024-10-12 22:16         ` Dan Williams
  1 sibling, 0 replies; 25+ messages in thread
From: Gregory Price @ 2024-10-12 17:56 UTC (permalink / raw)
  To: Zijun Hu
  Cc: Dan Williams, dave.jiang, ira.weiny, stable, Greg Kroah-Hartman,
	Davidlohr Bueso, Jonathan Cameron, Alison Schofield,
	vishal.l.verma, linux-cxl

On Sat, Oct 12, 2024 at 07:40:13AM +0800, Zijun Hu wrote:
> On 2024/10/12 01:46, Dan Williams wrote:
> > Zijun Hu wrote:
> >> On 2024/10/11 13:34, Dan Williams wrote:
> >>> In support of investigating an initialization failure report [1],
> >>> cxl_test was updated to register mock memory-devices after the mock
> >>> root-port/bus device had been registered. That led to cxl_test crashing
> >>> with a use-after-free bug with the following signature:
> >>>
> >>>     cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem0:decoder7.0 @ 0 next: cxl_switch_uport.0 nr_eps: 1 nr_targets: 1
> >>>     cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem4:decoder14.0 @ 1 next: cxl_switch_uport.0 nr_eps: 2 nr_targets: 1
> >>>     cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[0] = cxl_switch_dport.0 for mem0:decoder7.0 @ 0
> >>> 1)  cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[1] = cxl_switch_dport.4 for mem4:decoder14.0 @ 1
> >>>     [..]
> >>>     cxld_unregister: cxl decoder14.0:
> >>>     cxl_region_decode_reset: cxl_region region3:
> >>>     mock_decoder_reset: cxl_port port3: decoder3.0 reset
> >>> 2)  mock_decoder_reset: cxl_port port3: decoder3.0: out of order reset, expected decoder3.1
> >>>     cxl_endpoint_decoder_release: cxl decoder14.0:
> >>>     [..]
> >>>     cxld_unregister: cxl decoder7.0:
> >>> 3)  cxl_region_decode_reset: cxl_region region3:
> >>>     Oops: general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6bc3: 0000 [#1] PREEMPT SMP PTI
> >>>     [..]
> >>>     RIP: 0010:to_cxl_port+0x8/0x60 [cxl_core]
> >>>     [..]
> >>>     Call Trace:
> >>>      <TASK>
> >>>      cxl_region_decode_reset+0x69/0x190 [cxl_core]
> >>>      cxl_region_detach+0xe8/0x210 [cxl_core]
> >>>      cxl_decoder_kill_region+0x27/0x40 [cxl_core]
> >>>      cxld_unregister+0x5d/0x60 [cxl_core]
> >>>
> >>> At 1) a region has been established with 2 endpoint decoders (7.0 and
> >>> 14.0). Those endpoints share a common switch-decoder in the topology
> >>> (3.0). At teardown, 2), decoder14.0 is the first to be removed and hits
> >>> the "out of order reset case" in the switch decoder. The effect though
> >>> is that region3 cleanup is aborted leaving it in-tact and
> >>> referencing decoder14.0. At 3) the second attempt to teardown region3
> >>> trips over the stale decoder14.0 object which has long since been
> >>> deleted.
> >>>
> >>> The fix here is to recognize that the CXL specification places no
> >>> mandate on in-order shutdown of switch-decoders, the driver enforces
> >>> in-order allocation, and hardware enforces in-order commit. So, rather
> >>> than fail and leave objects dangling, always remove them.
> >>>
> >>> In support of making cxl_region_decode_reset() always succeed,
> >>> cxl_region_invalidate_memregion() failures are turned into warnings.
> >>> Crashing the kernel is ok there since system integrity is at risk if
> >>> caches cannot be managed around physical address mutation events like
> >>> CXL region destruction.
> >>>
> >>> A new device_for_each_child_reverse_from() is added to cleanup
> >>> port->commit_end after all dependent decoders have been disabled. In
> >>> other words if decoders are allocated 0->1->2 and disabled 1->2->0 then
> >>> port->commit_end only decrements from 2 after 2 has been disabled, and
> >>> it decrements all the way to zero since 1 was disabled previously.
> >>>
> >>> Link: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net [1]
> >>> Cc: <stable@vger.kernel.org>
> >>> Fixes: 176baefb2eb5 ("cxl/hdm: Commit decoder state to hardware")
> >>> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> >>> Cc: Davidlohr Bueso <dave@stgolabs.net>
> >>> Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
> >>> Cc: Dave Jiang <dave.jiang@intel.com>
> >>> Cc: Alison Schofield <alison.schofield@intel.com>
> >>> Cc: Ira Weiny <ira.weiny@intel.com>
> >>> Cc: Zijun Hu <zijun_hu@icloud.com>
> >>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> >>> ---
> >>>  drivers/base/core.c          |   35 +++++++++++++++++++++++++++++
> >>>  drivers/cxl/core/hdm.c       |   50 +++++++++++++++++++++++++++++++++++-------
> >>>  drivers/cxl/core/region.c    |   48 +++++++++++-----------------------------
> >>>  drivers/cxl/cxl.h            |    3 ++-
> >>>  include/linux/device.h       |    3 +++
> >>>  tools/testing/cxl/test/cxl.c |   14 ++++--------
> >>>  6 files changed, 100 insertions(+), 53 deletions(-)
> >>>
> >>> diff --git a/drivers/base/core.c b/drivers/base/core.c
> >>> index a4c853411a6b..e42f1ad73078 100644
> >>> --- a/drivers/base/core.c
> >>> +++ b/drivers/base/core.c
> >>> @@ -4037,6 +4037,41 @@ int device_for_each_child_reverse(struct device *parent, void *data,
> >>>  }
> >>>  EXPORT_SYMBOL_GPL(device_for_each_child_reverse);
> >>>  
> >>> +/**
> >>> + * device_for_each_child_reverse_from - device child iterator in reversed order.
> >>> + * @parent: parent struct device.
> >>> + * @from: optional starting point in child list
> >>> + * @fn: function to be called for each device.
> >>> + * @data: data for the callback.
> >>> + *
> >>> + * Iterate over @parent's child devices, starting at @from, and call @fn
> >>> + * for each, passing it @data. This helper is identical to
> >>> + * device_for_each_child_reverse() when @from is NULL.
> >>> + *
> >>> + * @fn is checked each iteration. If it returns anything other than 0,
> >>> + * iteration stop and that value is returned to the caller of
> >>> + * device_for_each_child_reverse_from();
> >>> + */
> >>> +int device_for_each_child_reverse_from(struct device *parent,
> >>> +				       struct device *from, const void *data,
> >>> +				       int (*fn)(struct device *, const void *))
> >>> +{
> >>> +	struct klist_iter i;
> >>> +	struct device *child;
> >>> +	int error = 0;
> >>> +
> >>> +	if (!parent->p)
> >>> +		return 0;
> >>> +
> >>> +	klist_iter_init_node(&parent->p->klist_children, &i,
> >>> +			     (from ? &from->p->knode_parent : NULL));
> >>> +	while ((child = prev_device(&i)) && !error)
> >>> +		error = fn(child, data);
> >>> +	klist_iter_exit(&i);
> >>> +	return error;
> >>> +}
> >>> +EXPORT_SYMBOL_GPL(device_for_each_child_reverse_from);
> >>> +
> >>
> >> it does NOT deserve, also does NOT need to introduce a new core driver
> >> API device_for_each_child_reverse_from(). existing
> >> device_for_each_child_reverse() can do what the _from() wants to do.
> >>
> >> we can use similar approach as below link shown:
> >> https://lore.kernel.org/all/20240815-const_dfc_prepare-v2-2-8316b87b8ff9@quicinc.com/
> > 
> > No, just have a simple starting point parameter. I understand that more
> > logic can be placed around device_for_each_child_reverse() to achieve
> > the same effect, but the core helpers should be removing logic from
> > consumers, not forcing them to add more.
> > 
> > If bloat is a concern, then after your const cleanups go through
> > device_for_each_child_reverse() can be rewritten in terms of
> > device_for_each_child_reverse_from() as (untested):
> > 
> 
> bloat is one aspect, the other aspect is that there are redundant
> between both driver core APIs, namely, there are a question:
> 
> why to still need device_for_each_child_reverse() if it is same as
> _from(..., NULL, ...) ?
> 

This same pattern (_reverse and _from_reverse) is present in list.h
and other iterators. Why would it be contentious here?

Reducing _reverse() to be a wrapper of _from_reverse is a nice way
of reducing the bloat/redundancy without having to update every
current user - this is a very common refactor pattern.

Refactoring without disrupting in-flight work is intrinsically valuable.

> > - *
> > - * Iterate over @parent's child devices, and call @fn for each,
> > - * passing it @data.
> > - *
> > - * We check the return of @fn each time. If it returns anything
> > - * other than 0, we break out and return that value.
> > - */
> > -int device_for_each_child_reverse(struct device *parent, void *data,
> > -				  int (*fn)(struct device *dev, void *data))
> > -{
> > -	struct klist_iter i;
> > -	struct device *child;
> > -	int error = 0;
> > -
> > -	if (!parent || !parent->p)
> > -		return 0;
> > -
> > -	klist_iter_init(&parent->p->klist_children, &i);
> > -	while ((child = prev_device(&i)) && !error)
> > -		error = fn(child, data);
> > -	klist_iter_exit(&i);
> > -	return error;
> > -}
> > -EXPORT_SYMBOL_GPL(device_for_each_child_reverse);
> > -
> >  /**
> >   * device_for_each_child_reverse_from - device child iterator in reversed order.
> >   * @parent: parent struct device.
> > diff --git a/include/linux/device.h b/include/linux/device.h
> > index 667cb6db9019..96a2c072bf5b 100644
> > --- a/include/linux/device.h
> > +++ b/include/linux/device.h
> > @@ -1076,11 +1076,14 @@ DEFINE_FREE(device_del, struct device *, if (_T) device_del(_T))
> >  
> >  int device_for_each_child(struct device *dev, void *data,
> >  			  int (*fn)(struct device *dev, void *data));
> > -int device_for_each_child_reverse(struct device *dev, void *data,
> > -				  int (*fn)(struct device *dev, void *data));
> >  int device_for_each_child_reverse_from(struct device *parent,
> >  				       struct device *from, const void *data,
> >  				       int (*fn)(struct device *, const void *));
> > +static inline int device_for_each_child_reverse(struct device *dev, const void *data,
> > +						int (*fn)(struct device *, const void *))
> > +{
> > +	return device_for_each_child_reverse_from(dev, NULL, data, fn);
> > +}
> >  struct device *device_find_child(struct device *dev, void *data,
> >  				 int (*match)(struct device *dev, void *data));
> >  struct device *device_find_child_by_name(struct device *parent,
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/5] cxl: Initialization and shutdown fixes
  2024-10-12  6:30     ` Alejandro Lucero Palau
@ 2024-10-12 21:57       ` Dan Williams
  2024-10-14 15:13         ` Alejandro Lucero Palau
  0 siblings, 1 reply; 25+ messages in thread
From: Dan Williams @ 2024-10-12 21:57 UTC (permalink / raw)
  To: Alejandro Lucero Palau, Dan Williams, dave.jiang, ira.weiny
  Cc: Davidlohr Bueso, Jonathan Cameron, stable, Zijun Hu,
	Greg Kroah-Hartman, Alison Schofield, Gregory Price, Vishal Verma,
	linux-cxl

Alejandro Lucero Palau wrote:
[..]
> > I am skeptical that PROBE_FORCE_SYNCRONOUS is a fix for any
> > device-readiness bug. Some other assumption is violated if that is
> > required.
> 
> 
> But that problem is not about device readiness but just how the device 
> model works. In this case the memdev creation is adding devices, no real 
> ones but those abstractions we use from the device model, and that 
> device creation is done asynchronously.

Device creation is not done asynchronously, the PCI driver is attaching
asynchrounously. When the PCI driver attaches it creates memdevs and
those are attached to cxl_mem synchronously.

> memdev, a Type2 driver in my case, is going to work with such a device 
> abstraction just after the memdev creation, it is not there yet.

Oh, is the concern that you always want to have the memdev attached to
cxl_mem immediately after it is registered?

I think that is another case where "MODULE_SOFTDEP("pre: cxl_mem")" is
needed. However, to fix this situation once and for all I think I would
rather just drop all this modularity and move both cxl_port and cxl_mem
to be drivers internal to cxl_core.ko similar to the cxl_region driver.


> It is true that clx_mem_probe will interact with the real device, but
> the fact is such a function is not invoked by the device model
> synchronously, so the code using such a device abstraction needs to
> wait until it is there. With this flag the waiting is implicit to
> device creation.  Without that flag other "nasty dancing" with delays
> and checks needs to be done as the code in v3 did.

It is invoked synchronously *if* the driver is loaded *and* the user has
not forced asynchronous attach on the command line in which case they
get to keep the pieces.

> > For the type-2 case I did have an EPROBE_DEFER in my initial RFC on the
> > assumption that an accelerator driver might want to wait until CXL is
> > initialized before the base accelerator proceeds. However, if
> > accelerator drivers behave the same as the cxl_pci driver and are ok
> > with asynchronus arrival of CXL functionality then no deferral is
> > needed.
> 
> 
> I think deferring the accel driver makes sense. In the sfc driver case, 
> it could work without CXL and then change to use it once the CXL kernel 
> support is fully initialised, but I guess other accel drivers will rely 
> on CXL with no other option, and even with the sfc driver, supporting 
> such a change will make the code far more complex.

Makes sense.

> > Otherwise, the only motivation for synchronous probing I can think of
> > would be to have more predictable naming of kernel objects. So yes, I
> > would be curious to understand what scenarios probe deferral is still
> > needed.
> 
> OK. I will keep that patch with the last change in the v4. Let's discuss 
> this further with that patch as a reference.

EPROBE_DEFER when CXL is not ready yet is fine in the sfc driver, just
comment that the driver does not have a PCIe-only operation mode and
requires that CXL initializes.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/5] cxl/port: Fix use-after-free, permit out-of-order decoder shutdown
  2024-10-11 23:40       ` Zijun Hu
  2024-10-12 17:56         ` Gregory Price
@ 2024-10-12 22:16         ` Dan Williams
  2024-10-14  1:29           ` Zijun Hu
  1 sibling, 1 reply; 25+ messages in thread
From: Dan Williams @ 2024-10-12 22:16 UTC (permalink / raw)
  To: Zijun Hu, Dan Williams, dave.jiang, ira.weiny
  Cc: stable, Greg Kroah-Hartman, Davidlohr Bueso, Jonathan Cameron,
	Alison Schofield, vishal.l.verma, linux-cxl

Zijun Hu wrote:
[..]
> >> it does NOT deserve, also does NOT need to introduce a new core driver
> >> API device_for_each_child_reverse_from(). existing
> >> device_for_each_child_reverse() can do what the _from() wants to do.
> >>
> >> we can use similar approach as below link shown:
> >> https://lore.kernel.org/all/20240815-const_dfc_prepare-v2-2-8316b87b8ff9@quicinc.com/
> > 
> > No, just have a simple starting point parameter. I understand that more
> > logic can be placed around device_for_each_child_reverse() to achieve
> > the same effect, but the core helpers should be removing logic from
> > consumers, not forcing them to add more.
> > 
> > If bloat is a concern, then after your const cleanups go through
> > device_for_each_child_reverse() can be rewritten in terms of
> > device_for_each_child_reverse_from() as (untested):
> > 
> 
> bloat is one aspect, the other aspect is that there are redundant
> between both driver core APIs, namely, there are a question:
> 
> why to still need device_for_each_child_reverse() if it is same as
> _from(..., NULL, ...) ?

To allow access to the new functionality without burdening existing
callers. With device_for_each_child_reverse() re-written to internally
use device_for_each_child_reverse_from() Linux gets more functionality
with no disruption to existing users and no bloat. This is typical API
evolution.

> So i suggest use existing API now.

No, I am failing to understand your concern.

> if there are more users who have such starting point requirement, then
> add the parameter into device_for_each_child_reverse() which is
> consistent with other existing *_for_each_*() core APIs such as
> (class|driver|bus)_for_each_device() and bus_for_each_drv(), that may
> need much efforts.

There are ~370 existing device_for_each* callers. Let's not thrash them.

Introduce new superset calls with the additional parameter and then
rewrite the old routines to just have a simple wrapper that passes a
NULL @start parameter.

Now, if Greg has the appetite to go touch all ~370 existing callers, so
be it, but introducing a superset-iterator-helper and a compat-wrapper
for legacy is the path I would take.

> could you please contains your proposal "fixing this allocation
> order validation" of below link into this patch series with below
> reason? and Cc me (^^)
> 
> https://lore.kernel.org/all/670835f5a2887_964f229474@dwillia2-xfh.jf.intel.com.notmuch/
> 
> A)
>   the proposal depends on this patch series.
> B)
>   one of the issues the proposal fix is match_free_decoder()  error
> logic which is also relevant issues this patch series fix, the proposal
> also can fix the other device_find_child()'s match() issue i care about.
> 
> C)
>  Actually, it is a bit difficult for me to understand the proposal since
>  i don't have any basic knowledge about CXL. (^^)

If I understand your question correctly you are asking how does
device_for_each_child_reverse_from() get used in
cxl_region_find_decoder() to enforce in-order allocation?

The recommendation is the following:

-- 8< --
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 3478d2058303..32cde18ff31b 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -778,26 +778,50 @@ static size_t show_targetN(struct cxl_region *cxlr, char *buf, int pos)
 	return rc;
 }
 
+static int check_commit_order(struct device *dev, const void *data)
+{
+	struct cxl_decoder *cxld = to_cxl_decoder(dev);
+
+	/*
+	 * if port->commit_end is not the only free decoder, then out of
+	 * order shutdown has occurred, block further allocations until
+	 * that is resolved
+	 */
+	if (((cxld->flags & CXL_DECODER_F_ENABLE) == 0))
+		return -EBUSY;
+	return 0;
+}
+
 static int match_free_decoder(struct device *dev, void *data)
 {
+	struct cxl_port *port = to_cxl_port(dev->parent);
 	struct cxl_decoder *cxld;
-	int *id = data;
+	int rc;
 
 	if (!is_switch_decoder(dev))
 		return 0;
 
 	cxld = to_cxl_decoder(dev);
 
-	/* enforce ordered allocation */
-	if (cxld->id != *id)
+	if (cxld->id != port->commit_end + 1)
 		return 0;
 
-	if (!cxld->region)
-		return 1;
-
-	(*id)++;
+	if (cxld->region) {
+		dev_dbg(dev->parent,
+			"next decoder to commit is already reserved\n",
+			dev_name(dev));
+		return 0;
+	}
 
-	return 0;
+	rc = device_for_each_child_reverse_from(dev->parent, dev, NULL,
+						check_commit_order);
+	if (rc) {
+		dev_dbg(dev->parent,
+			"unable to allocate %s due to out of order shutdown\n",
+			dev_name(dev));
+		return 0;
+	}
+	return 1;
 }
 
 static int match_auto_decoder(struct device *dev, void *data)
@@ -824,7 +848,6 @@ cxl_region_find_decoder(struct cxl_port *port,
 			struct cxl_region *cxlr)
 {
 	struct device *dev;
-	int id = 0;
 
 	if (port == cxled_to_port(cxled))
 		return &cxled->cxld;
@@ -833,7 +856,7 @@ cxl_region_find_decoder(struct cxl_port *port,
 		dev = device_find_child(&port->dev, &cxlr->params,
 					match_auto_decoder);
 	else
-		dev = device_find_child(&port->dev, &id, match_free_decoder);
+		dev = device_find_child(&port->dev, NULL, match_free_decoder);
 	if (!dev)
 		return NULL;
 	/*

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/5] cxl/port: Fix use-after-free, permit out-of-order decoder shutdown
  2024-10-12 22:16         ` Dan Williams
@ 2024-10-14  1:29           ` Zijun Hu
  2024-10-14 19:32             ` Dan Williams
  0 siblings, 1 reply; 25+ messages in thread
From: Zijun Hu @ 2024-10-14  1:29 UTC (permalink / raw)
  To: Dan Williams, dave.jiang, ira.weiny
  Cc: stable, Greg Kroah-Hartman, Davidlohr Bueso, Jonathan Cameron,
	Alison Schofield, vishal.l.verma, linux-cxl

On 2024/10/13 06:16, Dan Williams wrote:
> Zijun Hu wrote:
> [..]
>>>> it does NOT deserve, also does NOT need to introduce a new core driver
>>>> API device_for_each_child_reverse_from(). existing
>>>> device_for_each_child_reverse() can do what the _from() wants to do.
>>>>
>>>> we can use similar approach as below link shown:
>>>> https://lore.kernel.org/all/20240815-const_dfc_prepare-v2-2-8316b87b8ff9@quicinc.com/
>>>
>>> No, just have a simple starting point parameter. I understand that more
>>> logic can be placed around device_for_each_child_reverse() to achieve
>>> the same effect, but the core helpers should be removing logic from
>>> consumers, not forcing them to add more.
>>>
>>> If bloat is a concern, then after your const cleanups go through
>>> device_for_each_child_reverse() can be rewritten in terms of
>>> device_for_each_child_reverse_from() as (untested):
>>>
>>
>> bloat is one aspect, the other aspect is that there are redundant
>> between both driver core APIs, namely, there are a question:
>>
>> why to still need device_for_each_child_reverse() if it is same as
>> _from(..., NULL, ...) ?
> 
> To allow access to the new functionality without burdening existing
> callers. With device_for_each_child_reverse() re-written to internally
> use device_for_each_child_reverse_from() Linux gets more functionality
> with no disruption to existing users and no bloat. This is typical API
> evolution.
> 
>> So i suggest use existing API now.
> 
> No, I am failing to understand your concern.
> 
>> if there are more users who have such starting point requirement, then
>> add the parameter into device_for_each_child_reverse() which is
>> consistent with other existing *_for_each_*() core APIs such as
>> (class|driver|bus)_for_each_device() and bus_for_each_drv(), that may
>> need much efforts.
> 
> There are ~370 existing device_for_each* callers. Let's not thrash them.
> 
> Introduce new superset calls with the additional parameter and then
> rewrite the old routines to just have a simple wrapper that passes a
> NULL @start parameter.
> 
> Now, if Greg has the appetite to go touch all ~370 existing callers, so
> be it, but introducing a superset-iterator-helper and a compat-wrapper
> for legacy is the path I would take.
> 

current kernel tree ONLY has 15 usages of
device_for_each_child_reverse(), i would like to
add an extra parameter @start as existing
(class|driver)_for_each_device() and bus_for_each_(dev|drv)() do
if it is required.

>> could you please contains your proposal "fixing this allocation
>> order validation" of below link into this patch series with below
>> reason? and Cc me (^^)
>>
>> https://lore.kernel.org/all/670835f5a2887_964f229474@dwillia2-xfh.jf.intel.com.notmuch/
>>
>> A)
>>   the proposal depends on this patch series.
>> B)
>>   one of the issues the proposal fix is match_free_decoder()  error
>> logic which is also relevant issues this patch series fix, the proposal
>> also can fix the other device_find_child()'s match() issue i care about.
>>
>> C)
>>  Actually, it is a bit difficult for me to understand the proposal since
>>  i don't have any basic knowledge about CXL. (^^)
> 
> If I understand your question correctly you are asking how does
> device_for_each_child_reverse_from() get used in
> cxl_region_find_decoder() to enforce in-order allocation?
> 

yes. your recommendation may help me understand it.

> The recommendation is the following:
> 
> -- 8< --
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 3478d2058303..32cde18ff31b 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -778,26 +778,50 @@ static size_t show_targetN(struct cxl_region *cxlr, char *buf, int pos)
>  	return rc;
>  }
>  
> +static int check_commit_order(struct device *dev, const void *data)
> +{
> +	struct cxl_decoder *cxld = to_cxl_decoder(dev);
> +
> +	/*
> +	 * if port->commit_end is not the only free decoder, then out of
> +	 * order shutdown has occurred, block further allocations until
> +	 * that is resolved
> +	 */
> +	if (((cxld->flags & CXL_DECODER_F_ENABLE) == 0))
> +		return -EBUSY;
> +	return 0;
> +}
> +
>  static int match_free_decoder(struct device *dev, void *data)
>  {
> +	struct cxl_port *port = to_cxl_port(dev->parent);
>  	struct cxl_decoder *cxld;
> -	int *id = data;
> +	int rc;
>  
>  	if (!is_switch_decoder(dev))
>  		return 0;
>  
>  	cxld = to_cxl_decoder(dev);
>  
> -	/* enforce ordered allocation */
> -	if (cxld->id != *id)
> +	if (cxld->id != port->commit_end + 1)
>  		return 0;
>  

for a port, is it possible that there are multiple CXLDs with same IDs?

> -	if (!cxld->region)
> -		return 1;
> -
> -	(*id)++;
> +	if (cxld->region) {
> +		dev_dbg(dev->parent,
> +			"next decoder to commit is already reserved\n",
> +			dev_name(dev));
> +		return 0;
> +	}
>  
> -	return 0;
> +	rc = device_for_each_child_reverse_from(dev->parent, dev, NULL,
> +						check_commit_order);
> +	if (rc) {
> +		dev_dbg(dev->parent,
> +			"unable to allocate %s due to out of order shutdown\n",
> +			dev_name(dev));
> +		return 0;
> +	}
> +	return 1;
>  }
>  
>  static int match_auto_decoder(struct device *dev, void *data)
> @@ -824,7 +848,6 @@ cxl_region_find_decoder(struct cxl_port *port,
>  			struct cxl_region *cxlr)
>  {
>  	struct device *dev;
> -	int id = 0;
>  
>  	if (port == cxled_to_port(cxled))
>  		return &cxled->cxld;
> @@ -833,7 +856,7 @@ cxl_region_find_decoder(struct cxl_port *port,
>  		dev = device_find_child(&port->dev, &cxlr->params,
>  					match_auto_decoder);
>  	else
> -		dev = device_find_child(&port->dev, &id, match_free_decoder);
> +		dev = device_find_child(&port->dev, NULL, match_free_decoder);
>  	if (!dev)
>  		return NULL;
>  	/*


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/5] cxl/port: Fix CXL port initialization order when the subsystem is built-in
  2024-10-11  5:34 ` [PATCH 1/5] cxl/port: Fix CXL port initialization order when the subsystem is built-in Dan Williams
@ 2024-10-14 11:33   ` Jonathan Cameron
  0 siblings, 0 replies; 25+ messages in thread
From: Jonathan Cameron @ 2024-10-14 11:33 UTC (permalink / raw)
  To: Dan Williams
  Cc: dave.jiang, ira.weiny, Gregory Price, stable, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-cxl

On Thu, 10 Oct 2024 22:34:02 -0700
Dan Williams <dan.j.williams@intel.com> wrote:

> When the CXL subsystem is built-in the module init order is determined
> by Makefile order. That order violates expectations. The expectation is
> that cxl_acpi and cxl_mem can race to attach and that if cxl_acpi wins
> the race cxl_mem will find the enabled CXL root ports it needs and if
> cxl_acpi loses the race it will retrigger cxl_mem to attach via
> cxl_bus_rescan(). That only works if cxl_acpi can assume ports are
> enabled immediately upone cxl_acpi_probe() return. That in turn can only

upon

> happen in the CONFIG_CXL_ACPI=y case if the cxl_port object appears
> before the cxl_acpi object in the Makefile.
> 
> Fix up the order to prevent initialization failures, and make sure that
> cxl_port is built-in if cxl_acpi is also built-in.
> 
> As for what contributed to this not being found earlier, the CXL
> regression environment, cxl_test, builds all CXL functionality as a
> module to allow to symbol mocking and other dynamic reload tests.  As a
> result there is no regression coverage for the built-in case.

My testing is all modular too :( 

> 
> Reported-by: Gregory Price <gourry@gourry.net>
> Closes: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net
> Tested-by: Gregory Price <gourry@gourry.net>
> Fixes: 8dd2bc0f8e02 ("cxl/mem: Add the cxl_mem driver")
> Cc: <stable@vger.kernel.org>
> Cc: Davidlohr Bueso <dave@stgolabs.net>
> Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Alison Schofield <alison.schofield@intel.com>
> Cc: Vishal Verma <vishal.l.verma@intel.com>
> Cc: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/cxl/Kconfig  |    1 +
>  drivers/cxl/Makefile |   12 ++++++------
>  2 files changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> index 29c192f20082..876469e23f7a 100644
> --- a/drivers/cxl/Kconfig
> +++ b/drivers/cxl/Kconfig
> @@ -60,6 +60,7 @@ config CXL_ACPI
>  	default CXL_BUS
>  	select ACPI_TABLE_LIB
>  	select ACPI_HMAT
> +	select CXL_PORT
>  	help
>  	  Enable support for host managed device memory (HDM) resources
>  	  published by a platform's ACPI CXL memory layout description.  See
> diff --git a/drivers/cxl/Makefile b/drivers/cxl/Makefile
> index db321f48ba52..374829359275 100644
> --- a/drivers/cxl/Makefile
> +++ b/drivers/cxl/Makefile
> @@ -1,13 +1,13 @@
>  # SPDX-License-Identifier: GPL-2.0
>  obj-y += core/
> -obj-$(CONFIG_CXL_PCI) += cxl_pci.o
> -obj-$(CONFIG_CXL_MEM) += cxl_mem.o
> +obj-$(CONFIG_CXL_PORT) += cxl_port.o
>  obj-$(CONFIG_CXL_ACPI) += cxl_acpi.o

Needs some comments on the ordering being required.
Otherwise some future 'cleanup' will reorder them again.
However relying on build order is nasty.

> +obj-$(CONFIG_CXL_PCI) += cxl_pci.o
>  obj-$(CONFIG_CXL_PMEM) += cxl_pmem.o
> -obj-$(CONFIG_CXL_PORT) += cxl_port.o
> +obj-$(CONFIG_CXL_MEM) += cxl_mem.o
>  
> -cxl_mem-y := mem.o
> -cxl_pci-y := pci.o
> +cxl_port-y := port.o
>  cxl_acpi-y := acpi.o
> +cxl_pci-y := pci.o
>  cxl_pmem-y := pmem.o security.o
> -cxl_port-y := port.o
> +cxl_mem-y := mem.o
> 
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/5] cxl: Initialization and shutdown fixes
  2024-10-12 21:57       ` Dan Williams
@ 2024-10-14 15:13         ` Alejandro Lucero Palau
  2024-10-14 22:24           ` Dan Williams
  0 siblings, 1 reply; 25+ messages in thread
From: Alejandro Lucero Palau @ 2024-10-14 15:13 UTC (permalink / raw)
  To: Dan Williams, dave.jiang, ira.weiny
  Cc: Davidlohr Bueso, Jonathan Cameron, stable, Zijun Hu,
	Greg Kroah-Hartman, Alison Schofield, Gregory Price, Vishal Verma,
	linux-cxl


On 10/12/24 22:57, Dan Williams wrote:
> Alejandro Lucero Palau wrote:
> [..]
>>> I am skeptical that PROBE_FORCE_SYNCRONOUS is a fix for any
>>> device-readiness bug. Some other assumption is violated if that is
>>> required.
>>
>> But that problem is not about device readiness but just how the device
>> model works. In this case the memdev creation is adding devices, no real
>> ones but those abstractions we use from the device model, and that
>> device creation is done asynchronously.
> Device creation is not done asynchronously, the PCI driver is attaching
> asynchrounously. When the PCI driver attaches it creates memdevs and
> those are attached to cxl_mem synchronously.
>
>> memdev, a Type2 driver in my case, is going to work with such a device
>> abstraction just after the memdev creation, it is not there yet.
> Oh, is the concern that you always want to have the memdev attached to
> cxl_mem immediately after it is registered?
>
> I think that is another case where "MODULE_SOFTDEP("pre: cxl_mem")" is
> needed. However, to fix this situation once and for all I think I would
> rather just drop all this modularity and move both cxl_port and cxl_mem
> to be drivers internal to cxl_core.ko similar to the cxl_region driver.


Oh, so the problem is the code is not ready because the functionality is 
in a module not loaded yet.

Then it makes sense that change. I'll do it if not already taken. I'll 
send v4 without the PROBE_FORCE_SYNCHRONOUS flag and without the 
previous loop with delays implemented in v3.

Thanks


>
>> It is true that clx_mem_probe will interact with the real device, but
>> the fact is such a function is not invoked by the device model
>> synchronously, so the code using such a device abstraction needs to
>> wait until it is there. With this flag the waiting is implicit to
>> device creation.  Without that flag other "nasty dancing" with delays
>> and checks needs to be done as the code in v3 did.
> It is invoked synchronously *if* the driver is loaded *and* the user has
> not forced asynchronous attach on the command line in which case they
> get to keep the pieces.
>
>>> For the type-2 case I did have an EPROBE_DEFER in my initial RFC on the
>>> assumption that an accelerator driver might want to wait until CXL is
>>> initialized before the base accelerator proceeds. However, if
>>> accelerator drivers behave the same as the cxl_pci driver and are ok
>>> with asynchronus arrival of CXL functionality then no deferral is
>>> needed.
>>
>> I think deferring the accel driver makes sense. In the sfc driver case,
>> it could work without CXL and then change to use it once the CXL kernel
>> support is fully initialised, but I guess other accel drivers will rely
>> on CXL with no other option, and even with the sfc driver, supporting
>> such a change will make the code far more complex.
> Makes sense.
>
>>> Otherwise, the only motivation for synchronous probing I can think of
>>> would be to have more predictable naming of kernel objects. So yes, I
>>> would be curious to understand what scenarios probe deferral is still
>>> needed.
>> OK. I will keep that patch with the last change in the v4. Let's discuss
>> this further with that patch as a reference.
> EPROBE_DEFER when CXL is not ready yet is fine in the sfc driver, just
> comment that the driver does not have a PCIe-only operation mode and
> requires that CXL initializes.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/5] cxl/port: Fix use-after-free, permit out-of-order decoder shutdown
  2024-10-14  1:29           ` Zijun Hu
@ 2024-10-14 19:32             ` Dan Williams
  2024-10-15  0:02               ` Zijun Hu
  0 siblings, 1 reply; 25+ messages in thread
From: Dan Williams @ 2024-10-14 19:32 UTC (permalink / raw)
  To: Zijun Hu, Dan Williams, dave.jiang, ira.weiny
  Cc: stable, Greg Kroah-Hartman, Davidlohr Bueso, Jonathan Cameron,
	Alison Schofield, vishal.l.verma, linux-cxl

Zijun Hu wrote:
> On 2024/10/13 06:16, Dan Williams wrote:
> > Zijun Hu wrote:
> > [..]
> >>>> it does NOT deserve, also does NOT need to introduce a new core driver
> >>>> API device_for_each_child_reverse_from(). existing
> >>>> device_for_each_child_reverse() can do what the _from() wants to do.
> >>>>
> >>>> we can use similar approach as below link shown:
> >>>> https://lore.kernel.org/all/20240815-const_dfc_prepare-v2-2-8316b87b8ff9@quicinc.com/
> >>>
> >>> No, just have a simple starting point parameter. I understand that more
> >>> logic can be placed around device_for_each_child_reverse() to achieve
> >>> the same effect, but the core helpers should be removing logic from
> >>> consumers, not forcing them to add more.
> >>>
> >>> If bloat is a concern, then after your const cleanups go through
> >>> device_for_each_child_reverse() can be rewritten in terms of
> >>> device_for_each_child_reverse_from() as (untested):
> >>>
> >>
> >> bloat is one aspect, the other aspect is that there are redundant
> >> between both driver core APIs, namely, there are a question:
> >>
> >> why to still need device_for_each_child_reverse() if it is same as
> >> _from(..., NULL, ...) ?
> > 
> > To allow access to the new functionality without burdening existing
> > callers. With device_for_each_child_reverse() re-written to internally
> > use device_for_each_child_reverse_from() Linux gets more functionality
> > with no disruption to existing users and no bloat. This is typical API
> > evolution.
> > 
> >> So i suggest use existing API now.
> > 
> > No, I am failing to understand your concern.
> > 
> >> if there are more users who have such starting point requirement, then
> >> add the parameter into device_for_each_child_reverse() which is
> >> consistent with other existing *_for_each_*() core APIs such as
> >> (class|driver|bus)_for_each_device() and bus_for_each_drv(), that may
> >> need much efforts.
> > 
> > There are ~370 existing device_for_each* callers. Let's not thrash them.
> > 
> > Introduce new superset calls with the additional parameter and then
> > rewrite the old routines to just have a simple wrapper that passes a
> > NULL @start parameter.
> > 
> > Now, if Greg has the appetite to go touch all ~370 existing callers, so
> > be it, but introducing a superset-iterator-helper and a compat-wrapper
> > for legacy is the path I would take.
> > 
> 
> current kernel tree ONLY has 15 usages of
> device_for_each_child_reverse(), i would like to
> add an extra parameter @start as existing
> (class|driver)_for_each_device() and bus_for_each_(dev|drv)() do
> if it is required.

A new parameter to a new wrapper symbol sounds fine to me. Otherwise,
please do not go thrash all the call sites to pass an unused NULL @start
parameter. Just accept that device_for_each_* did not follow the
{class,driver,bus}_for_each_* example and instead introduce a new symbol
to wrap the new functionality that so far only has the single CXL user.

[..]
> > If I understand your question correctly you are asking how does
> > device_for_each_child_reverse_from() get used in
> > cxl_region_find_decoder() to enforce in-order allocation?
> > 
> 
> yes. your recommendation may help me understand it.
> 
> > The recommendation is the following:
> > 
> > -- 8< --
> > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> > index 3478d2058303..32cde18ff31b 100644
> > --- a/drivers/cxl/core/region.c
> > +++ b/drivers/cxl/core/region.c
> > @@ -778,26 +778,50 @@ static size_t show_targetN(struct cxl_region *cxlr, char *buf, int pos)
> >  	return rc;
> >  }
> >  
> > +static int check_commit_order(struct device *dev, const void *data)
> > +{
> > +	struct cxl_decoder *cxld = to_cxl_decoder(dev);
> > +
> > +	/*
> > +	 * if port->commit_end is not the only free decoder, then out of
> > +	 * order shutdown has occurred, block further allocations until
> > +	 * that is resolved
> > +	 */
> > +	if (((cxld->flags & CXL_DECODER_F_ENABLE) == 0))
> > +		return -EBUSY;
> > +	return 0;
> > +}
> > +
> >  static int match_free_decoder(struct device *dev, void *data)
> >  {
> > +	struct cxl_port *port = to_cxl_port(dev->parent);
> >  	struct cxl_decoder *cxld;
> > -	int *id = data;
> > +	int rc;
> >  
> >  	if (!is_switch_decoder(dev))
> >  		return 0;
> >  
> >  	cxld = to_cxl_decoder(dev);
> >  
> > -	/* enforce ordered allocation */
> > -	if (cxld->id != *id)
> > +	if (cxld->id != port->commit_end + 1)
> >  		return 0;
> >  
> 
> for a port, is it possible that there are multiple CXLDs with same IDs?

Not possible, that is also enforced by the driver core, all kernel
object names must be unique (at least if they have the same parent), and
the local subsystem convention is that all decoders are registered in
id-order.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/5] cxl: Initialization and shutdown fixes
  2024-10-14 15:13         ` Alejandro Lucero Palau
@ 2024-10-14 22:24           ` Dan Williams
  2024-10-15  8:45             ` Alejandro Lucero Palau
  0 siblings, 1 reply; 25+ messages in thread
From: Dan Williams @ 2024-10-14 22:24 UTC (permalink / raw)
  To: Alejandro Lucero Palau, Dan Williams, dave.jiang, ira.weiny
  Cc: Davidlohr Bueso, Jonathan Cameron, stable, Zijun Hu,
	Greg Kroah-Hartman, Alison Schofield, Gregory Price, Vishal Verma,
	linux-cxl

Alejandro Lucero Palau wrote:
> 
> On 10/12/24 22:57, Dan Williams wrote:
> > Alejandro Lucero Palau wrote:
> > [..]
> >>> I am skeptical that PROBE_FORCE_SYNCRONOUS is a fix for any
> >>> device-readiness bug. Some other assumption is violated if that is
> >>> required.
> >>
> >> But that problem is not about device readiness but just how the device
> >> model works. In this case the memdev creation is adding devices, no real
> >> ones but those abstractions we use from the device model, and that
> >> device creation is done asynchronously.
> > Device creation is not done asynchronously, the PCI driver is attaching
> > asynchrounously. When the PCI driver attaches it creates memdevs and
> > those are attached to cxl_mem synchronously.
> >
> >> memdev, a Type2 driver in my case, is going to work with such a device
> >> abstraction just after the memdev creation, it is not there yet.
> > Oh, is the concern that you always want to have the memdev attached to
> > cxl_mem immediately after it is registered?
> >
> > I think that is another case where "MODULE_SOFTDEP("pre: cxl_mem")" is
> > needed. However, to fix this situation once and for all I think I would
> > rather just drop all this modularity and move both cxl_port and cxl_mem
> > to be drivers internal to cxl_core.ko similar to the cxl_region driver.
> 
> 
> Oh, so the problem is the code is not ready because the functionality is 
> in a module not loaded yet.

Right.

> Then it makes sense that change. I'll do it if not already taken. I'll 
> send v4 without the PROBE_FORCE_SYNCHRONOUS flag and without the 
> previous loop with delays implemented in v3.

So I think EPROBE_DEFER can stay out of the CXL core because it is up to
the accelerator driver to decide whether CXL availability is fatal to
init or not.

Additionally, I am less and less convinced that Type-2 drivers should be
forced to depend on the cxl_mem driver to attach vs just arranging for
those Type-2 drivers to call devm_cxl_enumerate_ports() and
devm_cxl_add_endpoint() directly. In other words I am starting to worry
that the generic cxl_mem driver design pattern is a midlayer mistake.

So, if it makes it easier for sfc, I would be open to exploring a direct
scheme for endpoint attachment, and not requiring the generic cxl_mem
driver as an intermediary.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/5] cxl/port: Fix use-after-free, permit out-of-order decoder shutdown
  2024-10-14 19:32             ` Dan Williams
@ 2024-10-15  0:02               ` Zijun Hu
  2024-10-15  0:10                 ` Dan Williams
  0 siblings, 1 reply; 25+ messages in thread
From: Zijun Hu @ 2024-10-15  0:02 UTC (permalink / raw)
  To: Dan Williams, dave.jiang, ira.weiny
  Cc: stable, Greg Kroah-Hartman, Davidlohr Bueso, Jonathan Cameron,
	Alison Schofield, vishal.l.verma, linux-cxl

On 2024/10/15 03:32, Dan Williams wrote:
> Zijun Hu wrote:
>> On 2024/10/13 06:16, Dan Williams wrote:
>>> Zijun Hu wrote:
>>> [..]
>>>>>> it does NOT deserve, also does NOT need to introduce a new core driver
>>>>>> API device_for_each_child_reverse_from(). existing
>>>>>> device_for_each_child_reverse() can do what the _from() wants to do.
>>>>>>

[snip]

>>> Introduce new superset calls with the additional parameter and then
>>> rewrite the old routines to just have a simple wrapper that passes a
>>> NULL @start parameter.
>>>
>>> Now, if Greg has the appetite to go touch all ~370 existing callers, so
>>> be it, but introducing a superset-iterator-helper and a compat-wrapper
>>> for legacy is the path I would take.
>>>
>>
>> current kernel tree ONLY has 15 usages of
>> device_for_each_child_reverse(), i would like to
>> add an extra parameter @start as existing
>> (class|driver)_for_each_device() and bus_for_each_(dev|drv)() do
>> if it is required.
> 
> A new parameter to a new wrapper symbol sounds fine to me. Otherwise,
> please do not go thrash all the call sites to pass an unused NULL @start
> parameter. Just accept that device_for_each_* did not follow the
> {class,driver,bus}_for_each_* example and instead introduce a new symbol
> to wrap the new functionality that so far only has the single CXL user.
> 

you maybe regard my idea as a alternative proposal if Greg dislike
introducing a new core driver API. (^^)

> [..]
>>> If I understand your question correctly you are asking how does
>>> device_for_each_child_reverse_from() get used in
>>> cxl_region_find_decoder() to enforce in-order allocation?
>>>
>>
>> yes. your recommendation may help me understand it.
>>

below simple solution should have same effect as your recommendation.
also have below optimizations:

1) it don't need new core API.
2) it is more efficient since it has minimal iterating.

i will submit it if you like it. (^^)

index e701e4b04032..37da42329ff3 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -796,8 +796,9 @@ static size_t show_targetN(struct cxl_region *cxlr,
char *buf, int pos)

 static int match_free_decoder(struct device *dev, void *data)
 {
+       struct cxl_port *port = to_cxl_port(dev->parent);
        struct cxl_decoder *cxld;
-       int *id = data;
+       struct device **target_dev = data;

        if (!is_switch_decoder(dev))
                return 0;
@@ -805,15 +806,19 @@ static int match_free_decoder(struct device *dev,
void *data)
        cxld = to_cxl_decoder(dev);

        /* enforce ordered allocation */
-       if (cxld->id != *id)
-               return 0;
-
-       if (!cxld->region)
-               return 1;
-
-       (*id)++;
-
-       return 0;
+       if (cxld->id == port->commit_end + 1) {
+               if (!cxld->region) {
+                       *target_dev = dev;
+                       return 1;
+               } else {
+                       dev_dbg(dev->parent,
+                               "next decoder to commit is already
reserved\n",
+                               dev_name(dev));
+                       return -ENODEV;
+               }
+       } else {
+               return cxld->flags & CXL_DECODER_F_ENABLE ? 0 : -EBUSY;
+       }
 }

 static int match_auto_decoder(struct device *dev, void *data)
@@ -839,7 +844,7 @@ cxl_region_find_decoder(struct cxl_port *port,
                        struct cxl_endpoint_decoder *cxled,
                        struct cxl_region *cxlr)
 {
-       struct device *dev;
+       struct device *dev = NULL;
        int id = 0;

        if (port == cxled_to_port(cxled))
@@ -848,8 +853,8 @@ cxl_region_find_decoder(struct cxl_port *port,
        if (test_bit(CXL_REGION_F_AUTO, &cxlr->flags))
                dev = device_find_child(&port->dev, &cxlr->params,
                                        match_auto_decoder);
-       else
-               dev = device_find_child(&port->dev, &id,
match_free_decoder);
+       else if (device_for_each_child(&port->dev, &dev,
match_free_decoder) > 0)
+               get_device(dev);
        if (!dev)
                return NULL;
        /*


>>> The recommendation is the following:
>>>
>>> -- 8< --
>>> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
>>> index 3478d2058303..32cde18ff31b 100644
>>> --- a/drivers/cxl/core/region.c
>>> +++ b/drivers/cxl/core/region.c
>>> @@ -778,26 +778,50 @@ static size_t show_targetN(struct cxl_region *cxlr, char *buf, int pos)
>>>  	return rc;
>>>  }
>>>  
>>> +static int check_commit_order(struct device *dev, const void *data)
>>> +{
>>> +	struct cxl_decoder *cxld = to_cxl_decoder(dev);
>>> +
>>> +	/*
>>> +	 * if port->commit_end is not the only free decoder, then out of
>>> +	 * order shutdown has occurred, block further allocations until
>>> +	 * that is resolved
>>> +	 */
>>> +	if (((cxld->flags & CXL_DECODER_F_ENABLE) == 0))
>>> +		return -EBUSY;
>>> +	return 0;
>>> +}
>>> +
>>>  static int match_free_decoder(struct device *dev, void *data)
>>>  {
>>> +	struct cxl_port *port = to_cxl_port(dev->parent);
>>>  	struct cxl_decoder *cxld;
>>> -	int *id = data;
>>> +	int rc;
>>>  
>>>  	if (!is_switch_decoder(dev))
>>>  		return 0;
>>>  
>>>  	cxld = to_cxl_decoder(dev);
>>>  
>>> -	/* enforce ordered allocation */
>>> -	if (cxld->id != *id)
>>> +	if (cxld->id != port->commit_end + 1)
>>>  		return 0;
>>>  
>>
>> for a port, is it possible that there are multiple CXLDs with same IDs?
> 
> Not possible, that is also enforced by the driver core, all kernel
> object names must be unique (at least if they have the same parent), and
> the local subsystem convention is that all decoders are registered in
> id-order.


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/5] cxl/port: Fix use-after-free, permit out-of-order decoder shutdown
  2024-10-15  0:02               ` Zijun Hu
@ 2024-10-15  0:10                 ` Dan Williams
  0 siblings, 0 replies; 25+ messages in thread
From: Dan Williams @ 2024-10-15  0:10 UTC (permalink / raw)
  To: Zijun Hu, Dan Williams, dave.jiang, ira.weiny
  Cc: stable, Greg Kroah-Hartman, Davidlohr Bueso, Jonathan Cameron,
	Alison Schofield, vishal.l.verma, linux-cxl

Zijun Hu wrote:
> On 2024/10/15 03:32, Dan Williams wrote:
> > Zijun Hu wrote:
> >> On 2024/10/13 06:16, Dan Williams wrote:
> >>> Zijun Hu wrote:
> >>> [..]
> >>>>>> it does NOT deserve, also does NOT need to introduce a new core driver
> >>>>>> API device_for_each_child_reverse_from(). existing
> >>>>>> device_for_each_child_reverse() can do what the _from() wants to do.
> >>>>>>
> 
> [snip]
> 
> >>> Introduce new superset calls with the additional parameter and then
> >>> rewrite the old routines to just have a simple wrapper that passes a
> >>> NULL @start parameter.
> >>>
> >>> Now, if Greg has the appetite to go touch all ~370 existing callers, so
> >>> be it, but introducing a superset-iterator-helper and a compat-wrapper
> >>> for legacy is the path I would take.
> >>>
> >>
> >> current kernel tree ONLY has 15 usages of
> >> device_for_each_child_reverse(), i would like to
> >> add an extra parameter @start as existing
> >> (class|driver)_for_each_device() and bus_for_each_(dev|drv)() do
> >> if it is required.
> > 
> > A new parameter to a new wrapper symbol sounds fine to me. Otherwise,
> > please do not go thrash all the call sites to pass an unused NULL @start
> > parameter. Just accept that device_for_each_* did not follow the
> > {class,driver,bus}_for_each_* example and instead introduce a new symbol
> > to wrap the new functionality that so far only has the single CXL user.
> > 
> 
> you maybe regard my idea as a alternative proposal if Greg dislike
> introducing a new core driver API. (^^)

If the proposal is to add an unwanted parameter to existing call sites
then yes, I would NAK that.

> > [..]
> >>> If I understand your question correctly you are asking how does
> >>> device_for_each_child_reverse_from() get used in
> >>> cxl_region_find_decoder() to enforce in-order allocation?
> >>>
> >>
> >> yes. your recommendation may help me understand it.
> >>
> 
> below simple solution should have same effect as your recommendation.
> also have below optimizations:
> 
> 1) it don't need new core API.
> 2) it is more efficient since it has minimal iterating.
> 
> i will submit it if you like it. (^^)

See the patch I just submitted, it does not handle the case of competing
allocations. The cxld->region check is not sufficient for determining
that the decoder is committed.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/5] cxl: Initialization and shutdown fixes
  2024-10-14 22:24           ` Dan Williams
@ 2024-10-15  8:45             ` Alejandro Lucero Palau
  2024-10-15 16:37               ` Dan Williams
  0 siblings, 1 reply; 25+ messages in thread
From: Alejandro Lucero Palau @ 2024-10-15  8:45 UTC (permalink / raw)
  To: Dan Williams, dave.jiang, ira.weiny
  Cc: Davidlohr Bueso, Jonathan Cameron, stable, Zijun Hu,
	Greg Kroah-Hartman, Alison Schofield, Gregory Price, Vishal Verma,
	linux-cxl


On 10/14/24 23:24, Dan Williams wrote:
> Alejandro Lucero Palau wrote:
>> On 10/12/24 22:57, Dan Williams wrote:
>>> Alejandro Lucero Palau wrote:
>>> [..]
>>>>> I am skeptical that PROBE_FORCE_SYNCRONOUS is a fix for any
>>>>> device-readiness bug. Some other assumption is violated if that is
>>>>> required.
>>>> But that problem is not about device readiness but just how the device
>>>> model works. In this case the memdev creation is adding devices, no real
>>>> ones but those abstractions we use from the device model, and that
>>>> device creation is done asynchronously.
>>> Device creation is not done asynchronously, the PCI driver is attaching
>>> asynchrounously. When the PCI driver attaches it creates memdevs and
>>> those are attached to cxl_mem synchronously.
>>>
>>>> memdev, a Type2 driver in my case, is going to work with such a device
>>>> abstraction just after the memdev creation, it is not there yet.
>>> Oh, is the concern that you always want to have the memdev attached to
>>> cxl_mem immediately after it is registered?
>>>
>>> I think that is another case where "MODULE_SOFTDEP("pre: cxl_mem")" is
>>> needed. However, to fix this situation once and for all I think I would
>>> rather just drop all this modularity and move both cxl_port and cxl_mem
>>> to be drivers internal to cxl_core.ko similar to the cxl_region driver.
>>
>> Oh, so the problem is the code is not ready because the functionality is
>> in a module not loaded yet.
> Right.
>
>> Then it makes sense that change. I'll do it if not already taken. I'll
>> send v4 without the PROBE_FORCE_SYNCHRONOUS flag and without the
>> previous loop with delays implemented in v3.
> So I think EPROBE_DEFER can stay out of the CXL core because it is up to
> the accelerator driver to decide whether CXL availability is fatal to
> init or not.


It needs support from the cxl core though. If the cxl root is not there 
yet, the driver needs to know, and that is what you did in your original 
patch and I'm keeping as well.


> Additionally, I am less and less convinced that Type-2 drivers should be
> forced to depend on the cxl_mem driver to attach vs just arranging for
> those Type-2 drivers to call devm_cxl_enumerate_ports() and
> devm_cxl_add_endpoint() directly. In other words I am starting to worry
> that the generic cxl_mem driver design pattern is a midlayer mistake.


You know better than me but in my view, a Type2 should follow what a 
Type3 does with some small changes for dealing with the differences, 
mainly the way it is going to be used and those optional capabilities 
for Type2. This makes more sense to me for Type1.

> So, if it makes it easier for sfc, I would be open to exploring a direct
> scheme for endpoint attachment, and not requiring the generic cxl_mem
> driver as an intermediary.


V4 is ready (just having problems when testing with 6.12-rcX) so I would 
like to keep it and explore this once we have something working and 
accepted. Type2 and Type1 with CXL cache will bring new challenges and I 
bet we will need refactoring in the code and potentially new design for 
generic code (Type3 and Type2, Type2 and Type1).



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/5] cxl: Initialization and shutdown fixes
  2024-10-15  8:45             ` Alejandro Lucero Palau
@ 2024-10-15 16:37               ` Dan Williams
  2024-10-16 14:41                 ` Alejandro Lucero Palau
  0 siblings, 1 reply; 25+ messages in thread
From: Dan Williams @ 2024-10-15 16:37 UTC (permalink / raw)
  To: Alejandro Lucero Palau, Dan Williams, dave.jiang, ira.weiny
  Cc: Davidlohr Bueso, Jonathan Cameron, stable, Zijun Hu,
	Greg Kroah-Hartman, Alison Schofield, Gregory Price, Vishal Verma,
	linux-cxl

Alejandro Lucero Palau wrote:
[..]
> >> Then it makes sense that change. I'll do it if not already taken. I'll
> >> send v4 without the PROBE_FORCE_SYNCHRONOUS flag and without the
> >> previous loop with delays implemented in v3.
> > So I think EPROBE_DEFER can stay out of the CXL core because it is up to
> > the accelerator driver to decide whether CXL availability is fatal to
> > init or not.
> 
> 
> It needs support from the cxl core though. If the cxl root is not there 
> yet, the driver needs to know, and that is what you did in your original 
> patch and I'm keeping as well.

So there are two ways, check if a registered @memdev has
@memdev->dev.driver set, assuming you know that the cxl_mem driver is
available, or call devm_cxl_enumerate_ports() yourself and skip the
cxl_mem driver indirection.

Setting @memdev->endpoint to ERR_PTR(-EPROBE_DEFER), as I originally
had, is an even more indirect way to convey a similar result and is
starting to feel a bit "mid-layer-y".

> > Additionally, I am less and less convinced that Type-2 drivers should be
> > forced to depend on the cxl_mem driver to attach vs just arranging for
> > those Type-2 drivers to call devm_cxl_enumerate_ports() and
> > devm_cxl_add_endpoint() directly. In other words I am starting to worry
> > that the generic cxl_mem driver design pattern is a midlayer mistake.
> 
> You know better than me but in my view, a Type2 should follow what a 
> Type3 does with some small changes for dealing with the differences, 
> mainly the way it is going to be used and those optional capabilities 
> for Type2. This makes more sense to me for Type1.

If an accelerator driver *wants* to depend on cxl_mem, it can, but all
the cxl_core functions that cxl_mem uses are also exported for
accelerator drivers to use directly to avoid needing to guess about when
/ if cxl_mem is going to attach.

> > So, if it makes it easier for sfc, I would be open to exploring a direct
> > scheme for endpoint attachment, and not requiring the generic cxl_mem
> > driver as an intermediary.
> 
> V4 is ready (just having problems when testing with 6.12-rcX) so I would 
> like to keep it and explore this once we have something working and 
> accepted. Type2 and Type1 with CXL cache will bring new challenges and I 
> bet we will need refactoring in the code and potentially new design for 
> generic code (Type3 and Type2, Type2 and Type1).

Yeah, no need to switch horses mid-race if you already have a cxl_mem
dependent approach nearly ready, but a potential simplification to
explore going forward.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/5] cxl/port: Fix use-after-free, permit out-of-order decoder shutdown
  2024-10-11  5:34 ` [PATCH 4/5] cxl/port: Fix use-after-free, permit out-of-order decoder shutdown Dan Williams
  2024-10-11 11:50   ` Zijun Hu
@ 2024-10-15 16:47   ` Jonathan Cameron
  2024-10-23  0:31     ` Dan Williams
  1 sibling, 1 reply; 25+ messages in thread
From: Jonathan Cameron @ 2024-10-15 16:47 UTC (permalink / raw)
  To: Dan Williams
  Cc: dave.jiang, ira.weiny, stable, Greg Kroah-Hartman,
	Davidlohr Bueso, Alison Schofield, Zijun Hu, vishal.l.verma,
	linux-cxl

On Thu, 10 Oct 2024 22:34:26 -0700
Dan Williams <dan.j.williams@intel.com> wrote:

> In support of investigating an initialization failure report [1],
> cxl_test was updated to register mock memory-devices after the mock
> root-port/bus device had been registered. That led to cxl_test crashing
> with a use-after-free bug with the following signature:
> 
>     cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem0:decoder7.0 @ 0 next: cxl_switch_uport.0 nr_eps: 1 nr_targets: 1
>     cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem4:decoder14.0 @ 1 next: cxl_switch_uport.0 nr_eps: 2 nr_targets: 1
>     cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[0] = cxl_switch_dport.0 for mem0:decoder7.0 @ 0
> 1)  cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[1] = cxl_switch_dport.4 for mem4:decoder14.0 @ 1
>     [..]
>     cxld_unregister: cxl decoder14.0:
>     cxl_region_decode_reset: cxl_region region3:
>     mock_decoder_reset: cxl_port port3: decoder3.0 reset
> 2)  mock_decoder_reset: cxl_port port3: decoder3.0: out of order reset, expected decoder3.1
>     cxl_endpoint_decoder_release: cxl decoder14.0:
>     [..]
>     cxld_unregister: cxl decoder7.0:
> 3)  cxl_region_decode_reset: cxl_region region3:
>     Oops: general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6bc3: 0000 [#1] PREEMPT SMP PTI
>     [..]
>     RIP: 0010:to_cxl_port+0x8/0x60 [cxl_core]
>     [..]
>     Call Trace:
>      <TASK>
>      cxl_region_decode_reset+0x69/0x190 [cxl_core]
>      cxl_region_detach+0xe8/0x210 [cxl_core]
>      cxl_decoder_kill_region+0x27/0x40 [cxl_core]
>      cxld_unregister+0x5d/0x60 [cxl_core]
> 
> At 1) a region has been established with 2 endpoint decoders (7.0 and
> 14.0). Those endpoints share a common switch-decoder in the topology
> (3.0). At teardown, 2), decoder14.0 is the first to be removed and hits
> the "out of order reset case" in the switch decoder. The effect though
> is that region3 cleanup is aborted leaving it in-tact and
> referencing decoder14.0. At 3) the second attempt to teardown region3
> trips over the stale decoder14.0 object which has long since been
> deleted.
> 
> The fix here is to recognize that the CXL specification places no
> mandate on in-order shutdown of switch-decoders, the driver enforces
> in-order allocation, and hardware enforces in-order commit. So, rather
> than fail and leave objects dangling, always remove them.
> 
> In support of making cxl_region_decode_reset() always succeed,
> cxl_region_invalidate_memregion() failures are turned into warnings.
> Crashing the kernel is ok there since system integrity is at risk if
> caches cannot be managed around physical address mutation events like
> CXL region destruction.

I'm fine with this, but seems like it is worth breaking out as a precursor
where we can discuss merits of that change separate from the complexity
of the rest.

I don't mind that strongly though so if you keep this intact,
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

Trivial passing comment inline.



> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index 3df10517a327..223c273c0cd1 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -712,7 +712,44 @@ static int cxl_decoder_commit(struct cxl_decoder *cxld)
>  	return 0;
>  }
>  
> -static int cxl_decoder_reset(struct cxl_decoder *cxld)
> +static int commit_reap(struct device *dev, const void *data)
> +{
> +	struct cxl_port *port = to_cxl_port(dev->parent);
> +	struct cxl_decoder *cxld;
> +
> +	if (!is_switch_decoder(dev) && !is_endpoint_decoder(dev))
> +		return 0;
> +
> +	cxld = to_cxl_decoder(dev);
> +	if (port->commit_end == cxld->id &&
> +	    ((cxld->flags & CXL_DECODER_F_ENABLE) == 0)) {
I'd have gone with !(cxld->flags & CXL_DECODER_F_ENABLE) but
this is consistent with exiting form, so fine as is.

> +		port->commit_end--;
> +		dev_dbg(&port->dev, "reap: %s commit_end: %d\n",
> +			dev_name(&cxld->dev), port->commit_end);
> +	}
> +
> +	return 0;
> +}


> diff --git a/tools/testing/cxl/test/cxl.c b/tools/testing/cxl/test/cxl.c
> index 90d5afd52dd0..c5bbd89b3192 100644
> --- a/tools/testing/cxl/test/cxl.c
> +++ b/tools/testing/cxl/test/cxl.c
> @@ -693,26 +693,22 @@ static int mock_decoder_commit(struct cxl_decoder *cxld)
>  	return 0;
>  }
>  
> -static int mock_decoder_reset(struct cxl_decoder *cxld)
> +static void mock_decoder_reset(struct cxl_decoder *cxld)
>  {
>  	struct cxl_port *port = to_cxl_port(cxld->dev.parent);
>  	int id = cxld->id;
>  
>  	if ((cxld->flags & CXL_DECODER_F_ENABLE) == 0)
> -		return 0;
> +		return;
>  
>  	dev_dbg(&port->dev, "%s reset\n", dev_name(&cxld->dev));
> -	if (port->commit_end != id) {
> +	if (port->commit_end == id)
> +		cxl_port_commit_reap(cxld);
> +	else
>  		dev_dbg(&port->dev,
>  			"%s: out of order reset, expected decoder%d.%d\n",
>  			dev_name(&cxld->dev), port->id, port->commit_end);
> -		return -EBUSY;
> -	}
> -
> -	port->commit_end--;
>  	cxld->flags &= ~CXL_DECODER_F_ENABLE;
> -
> -	return 0;
>  }
>  
>  static void default_mock_decoder(struct cxl_decoder *cxld)
> 
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/5] cxl: Initialization and shutdown fixes
  2024-10-15 16:37               ` Dan Williams
@ 2024-10-16 14:41                 ` Alejandro Lucero Palau
  2024-10-23  0:46                   ` Dan Williams
  0 siblings, 1 reply; 25+ messages in thread
From: Alejandro Lucero Palau @ 2024-10-16 14:41 UTC (permalink / raw)
  To: Dan Williams, dave.jiang, ira.weiny
  Cc: Davidlohr Bueso, Jonathan Cameron, stable, Zijun Hu,
	Greg Kroah-Hartman, Alison Schofield, Gregory Price, Vishal Verma,
	linux-cxl


On 10/15/24 17:37, Dan Williams wrote:
> Alejandro Lucero Palau wrote:
> [..]
>>>> Then it makes sense that change. I'll do it if not already taken. I'll
>>>> send v4 without the PROBE_FORCE_SYNCHRONOUS flag and without the
>>>> previous loop with delays implemented in v3.
>>> So I think EPROBE_DEFER can stay out of the CXL core because it is up to
>>> the accelerator driver to decide whether CXL availability is fatal to
>>> init or not.
>>
>> It needs support from the cxl core though. If the cxl root is not there
>> yet, the driver needs to know, and that is what you did in your original
>> patch and I'm keeping as well.
> So there are two ways, check if a registered @memdev has
> @memdev->dev.driver set, assuming you know that the cxl_mem driver is
> available, or call devm_cxl_enumerate_ports() yourself and skip the
> cxl_mem driver indirection.
>
> Setting @memdev->endpoint to ERR_PTR(-EPROBE_DEFER), as I originally
> had, is an even more indirect way to convey a similar result and is
> starting to feel a bit "mid-layer-y".
>

I was a bit confused with this answer until I read again the patch 
commit from your original work.


The confusion came from my assumption about if the root device is not 
there, it is due to the hardware root initialization requiring more 
time. But I realize now you specifically said "the root driver has not 
attached yet" what turns it into this problem of kernel modules not 
loaded yet.


If so, I think I can solve this within the type2 driver code and 
kconfig. Kconfig will force the driver being compiled as a module if the 
cxl core is a module, and with MODULE_SOFTDEP("pre: cxl_core cxl_port 
cxl_acpi cxl-mem) the type2 driver modprobe will trigger loading of 
dependencies beforehand. This makes unnecessary EPROBE_DEFER.


What do you think?


>>> Additionally, I am less and less convinced that Type-2 drivers should be
>>> forced to depend on the cxl_mem driver to attach vs just arranging for
>>> those Type-2 drivers to call devm_cxl_enumerate_ports() and
>>> devm_cxl_add_endpoint() directly. In other words I am starting to worry
>>> that the generic cxl_mem driver design pattern is a midlayer mistake.
>> You know better than me but in my view, a Type2 should follow what a
>> Type3 does with some small changes for dealing with the differences,
>> mainly the way it is going to be used and those optional capabilities
>> for Type2. This makes more sense to me for Type1.
> If an accelerator driver *wants* to depend on cxl_mem, it can, but all
> the cxl_core functions that cxl_mem uses are also exported for
> accelerator drivers to use directly to avoid needing to guess about when
> / if cxl_mem is going to attach.
>
>>> So, if it makes it easier for sfc, I would be open to exploring a direct
>>> scheme for endpoint attachment, and not requiring the generic cxl_mem
>>> driver as an intermediary.
>> V4 is ready (just having problems when testing with 6.12-rcX) so I would
>> like to keep it and explore this once we have something working and
>> accepted. Type2 and Type1 with CXL cache will bring new challenges and I
>> bet we will need refactoring in the code and potentially new design for
>> generic code (Type3 and Type2, Type2 and Type1).
> Yeah, no need to switch horses mid-race if you already have a cxl_mem
> dependent approach nearly ready, but a potential simplification to
> explore going forward.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/5] cxl/port: Fix use-after-free, permit out-of-order decoder shutdown
  2024-10-15 16:47   ` Jonathan Cameron
@ 2024-10-23  0:31     ` Dan Williams
  0 siblings, 0 replies; 25+ messages in thread
From: Dan Williams @ 2024-10-23  0:31 UTC (permalink / raw)
  To: Jonathan Cameron, Dan Williams
  Cc: dave.jiang, ira.weiny, stable, Greg Kroah-Hartman,
	Davidlohr Bueso, Alison Schofield, Zijun Hu, vishal.l.verma,
	linux-cxl

Jonathan Cameron wrote:
> On Thu, 10 Oct 2024 22:34:26 -0700
> Dan Williams <dan.j.williams@intel.com> wrote:
> 
> > In support of investigating an initialization failure report [1],
> > cxl_test was updated to register mock memory-devices after the mock
> > root-port/bus device had been registered. That led to cxl_test crashing
> > with a use-after-free bug with the following signature:
> > 
> >     cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem0:decoder7.0 @ 0 next: cxl_switch_uport.0 nr_eps: 1 nr_targets: 1
> >     cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem4:decoder14.0 @ 1 next: cxl_switch_uport.0 nr_eps: 2 nr_targets: 1
> >     cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[0] = cxl_switch_dport.0 for mem0:decoder7.0 @ 0
> > 1)  cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[1] = cxl_switch_dport.4 for mem4:decoder14.0 @ 1
> >     [..]
> >     cxld_unregister: cxl decoder14.0:
> >     cxl_region_decode_reset: cxl_region region3:
> >     mock_decoder_reset: cxl_port port3: decoder3.0 reset
> > 2)  mock_decoder_reset: cxl_port port3: decoder3.0: out of order reset, expected decoder3.1
> >     cxl_endpoint_decoder_release: cxl decoder14.0:
> >     [..]
> >     cxld_unregister: cxl decoder7.0:
> > 3)  cxl_region_decode_reset: cxl_region region3:
> >     Oops: general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6bc3: 0000 [#1] PREEMPT SMP PTI
> >     [..]
> >     RIP: 0010:to_cxl_port+0x8/0x60 [cxl_core]
> >     [..]
> >     Call Trace:
> >      <TASK>
> >      cxl_region_decode_reset+0x69/0x190 [cxl_core]
> >      cxl_region_detach+0xe8/0x210 [cxl_core]
> >      cxl_decoder_kill_region+0x27/0x40 [cxl_core]
> >      cxld_unregister+0x5d/0x60 [cxl_core]
> > 
> > At 1) a region has been established with 2 endpoint decoders (7.0 and
> > 14.0). Those endpoints share a common switch-decoder in the topology
> > (3.0). At teardown, 2), decoder14.0 is the first to be removed and hits
> > the "out of order reset case" in the switch decoder. The effect though
> > is that region3 cleanup is aborted leaving it in-tact and
> > referencing decoder14.0. At 3) the second attempt to teardown region3
> > trips over the stale decoder14.0 object which has long since been
> > deleted.
> > 
> > The fix here is to recognize that the CXL specification places no
> > mandate on in-order shutdown of switch-decoders, the driver enforces
> > in-order allocation, and hardware enforces in-order commit. So, rather
> > than fail and leave objects dangling, always remove them.
> > 
> > In support of making cxl_region_decode_reset() always succeed,
> > cxl_region_invalidate_memregion() failures are turned into warnings.
> > Crashing the kernel is ok there since system integrity is at risk if
> > caches cannot be managed around physical address mutation events like
> > CXL region destruction.
> 
> I'm fine with this, but seems like it is worth breaking out as a precursor
> where we can discuss merits of that change separate from the complexity
> of the rest.
> 
> I don't mind that strongly though so if you keep this intact,

If there are merits to discuss, let's discuss them in this patch (in v2)
because if cxl_region_invalidate_memregion() failures are not suitable
to be warnings then that invalidates this patch.

> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> 
> Trivial passing comment inline.
> 
> 
> 
> > diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> > index 3df10517a327..223c273c0cd1 100644
> > --- a/drivers/cxl/core/hdm.c
> > +++ b/drivers/cxl/core/hdm.c
> > @@ -712,7 +712,44 @@ static int cxl_decoder_commit(struct cxl_decoder *cxld)
> >  	return 0;
> >  }
> >  
> > -static int cxl_decoder_reset(struct cxl_decoder *cxld)
> > +static int commit_reap(struct device *dev, const void *data)
> > +{
> > +	struct cxl_port *port = to_cxl_port(dev->parent);
> > +	struct cxl_decoder *cxld;
> > +
> > +	if (!is_switch_decoder(dev) && !is_endpoint_decoder(dev))
> > +		return 0;
> > +
> > +	cxld = to_cxl_decoder(dev);
> > +	if (port->commit_end == cxld->id &&
> > +	    ((cxld->flags & CXL_DECODER_F_ENABLE) == 0)) {
> I'd have gone with !(cxld->flags & CXL_DECODER_F_ENABLE) but
> this is consistent with exiting form, so fine as is.

I have long had an aversion to negation operators for the small
speedbump to left-to-right readability as evidenced by all the other
"(cxld->flags & CXL_DECODER_F_ENABLE) == 0" in drivers/cxl/.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/5] cxl: Initialization and shutdown fixes
  2024-10-16 14:41                 ` Alejandro Lucero Palau
@ 2024-10-23  0:46                   ` Dan Williams
  0 siblings, 0 replies; 25+ messages in thread
From: Dan Williams @ 2024-10-23  0:46 UTC (permalink / raw)
  To: Alejandro Lucero Palau, Dan Williams, dave.jiang, ira.weiny
  Cc: Davidlohr Bueso, Jonathan Cameron, stable, Zijun Hu,
	Greg Kroah-Hartman, Alison Schofield, Gregory Price, Vishal Verma,
	linux-cxl

Alejandro Lucero Palau wrote:
[..]
> > Setting @memdev->endpoint to ERR_PTR(-EPROBE_DEFER), as I originally
> > had, is an even more indirect way to convey a similar result and is
> > starting to feel a bit "mid-layer-y".
> >
> 
> I was a bit confused with this answer until I read again the patch 
> commit from your original work.
> 
> The confusion came from my assumption about if the root device is not 
> there, it is due to the hardware root initialization requiring more 
> time. But I realize now you specifically said "the root driver has not 
> attached yet" what turns it into this problem of kernel modules not 
> loaded yet.
> 
> If so, I think I can solve this within the type2 driver code and 
> kconfig. Kconfig will force the driver being compiled as a module...

There should be no requirement that accelerator drivers must be built as
modules. An accelerator driver simply cannot enforce, via module load
order, that CXL root infrastructure is up and ready before the
accelerator 'probe' routine runs. This is because enumeration order still
dominiates and enumeration order is effectively random*.

The accelerator driver only has 2 options, return EPROBE_DEFER until all
resource dependencies are ready, or do what cxl_pci + cxl_mem do. What
cxl_pci + cxl_mem do is, cxl_pci_probe() registers a memdev and then at
some point later cxl_mem notices that the root infrastructure has
arrived via the cxl_bus_rescan() event.

Note that these patches are about fixing the assumptions of
cxl_bus_rescan(), not about ensuring init order.

* ...at least nothing should break if CXL root and CXL endpoint
  enumeration happens out of order.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2024-10-23  0:46 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-11  5:33 [PATCH 0/5] cxl: Initialization and shutdown fixes Dan Williams
2024-10-11  5:34 ` [PATCH 1/5] cxl/port: Fix CXL port initialization order when the subsystem is built-in Dan Williams
2024-10-14 11:33   ` Jonathan Cameron
2024-10-11  5:34 ` [PATCH 4/5] cxl/port: Fix use-after-free, permit out-of-order decoder shutdown Dan Williams
2024-10-11 11:50   ` Zijun Hu
2024-10-11 17:46     ` Dan Williams
2024-10-11 23:40       ` Zijun Hu
2024-10-12 17:56         ` Gregory Price
2024-10-12 22:16         ` Dan Williams
2024-10-14  1:29           ` Zijun Hu
2024-10-14 19:32             ` Dan Williams
2024-10-15  0:02               ` Zijun Hu
2024-10-15  0:10                 ` Dan Williams
2024-10-15 16:47   ` Jonathan Cameron
2024-10-23  0:31     ` Dan Williams
2024-10-11 11:21 ` [PATCH 0/5] cxl: Initialization and shutdown fixes Alejandro Lucero Palau
2024-10-11 17:38   ` Dan Williams
2024-10-12  6:30     ` Alejandro Lucero Palau
2024-10-12 21:57       ` Dan Williams
2024-10-14 15:13         ` Alejandro Lucero Palau
2024-10-14 22:24           ` Dan Williams
2024-10-15  8:45             ` Alejandro Lucero Palau
2024-10-15 16:37               ` Dan Williams
2024-10-16 14:41                 ` Alejandro Lucero Palau
2024-10-23  0:46                   ` Dan Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox