[PATCH 0/4] cxl: DPA partition metadata is a mess...

Linux CXL
 help / color / mirror / Atom feed

* [PATCH 0/4] cxl: DPA partition metadata is a mess...
@ 2025-01-17  6:10 Dan Williams
  2025-01-17  6:10 ` [PATCH 1/4] cxl: Remove the CXL_DECODER_MIXED mistake Dan Williams
                   ` (3 more replies)
  0 siblings, 4 replies; 32+ messages in thread
From: Dan Williams @ 2025-01-17  6:10 UTC (permalink / raw)
  To: linux-cxl; +Cc: Ira Weiny, Dave Jiang, Alejandro Lucero, dave.jiang

As noted in patch3, the pending efforts to add CXL Accelerator (type-2)
device [1], and Dynamic Capacity (DCD) support [2], tripped on the
no-longer-fit-for-purpose design in the CXL subsystem for tracking
device-physical-address (DPA) metadata.

In fact there was no design at all, just a couple of open-coded 'struct
resource' instances for 'ram' and 'pmem' and a pile of explicit code
referencing those resources directly.

See patch3 for more details on the specific problems that caused, and
patch4 for the eyesore reduction of making the DPA allocation algorithm
partition number agnostic.

The motivation with this effort is to make it easier to land the Type-2
and DCD series.

Next on the cleanup list is 'enum cxl_decoder_mode' which has little to
exist after partition info is centralized. That cleanup is left as an
exercise for the DCD series.

This series passes a cxl-test run at every patch.

[1]: http://lore.kernel.org/20241230214445.27602-1-alejandro.lucero-palau@amd.com
[2]: http://lore.kernel.org/20241210-dcd-type2-upstream-v8-0-812852504400@intel.com

---

Dan Williams (4):
      cxl: Remove the CXL_DECODER_MIXED mistake
      cxl: Introduce to_{ram,pmem}_{res,perf}() helpers
      cxl: Introduce 'struct cxl_dpa_partition' and 'struct cxl_range_info'
      cxl: Make cxl_dpa_alloc() DPA partition number agnostic

 drivers/cxl/core/cdat.c      |   63 ++++++-----
 drivers/cxl/core/hdm.c       |  244 +++++++++++++++++++++++++++++++++---------
 drivers/cxl/core/mbox.c      |   84 ++++++--------
 drivers/cxl/core/memdev.c    |   42 ++++---
 drivers/cxl/core/region.c    |   22 +---
 drivers/cxl/cxl.h            |    4 -
 drivers/cxl/cxlmem.h         |   94 ++++++++++++++--
 drivers/cxl/mem.c            |    2 
 drivers/cxl/pci.c            |    7 +
 tools/testing/cxl/test/cxl.c |   22 ++--
 tools/testing/cxl/test/mem.c |    7 +
 11 files changed, 396 insertions(+), 195 deletions(-)

base-commit: fac04efc5c793dccbd07e2d59af9f90b7fc0dca4

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 1/4] cxl: Remove the CXL_DECODER_MIXED mistake
  2025-01-17  6:10 [PATCH 0/4] cxl: DPA partition metadata is a mess Dan Williams
@ 2025-01-17  6:10 ` Dan Williams
  2025-01-17 10:03   ` Jonathan Cameron
                     ` (2 more replies)
  2025-01-17  6:10 ` [PATCH 2/4] cxl: Introduce to_{ram,pmem}_{res,perf}() helpers Dan Williams
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 32+ messages in thread
From: Dan Williams @ 2025-01-17  6:10 UTC (permalink / raw)
  To: linux-cxl; +Cc: dave.jiang

CXL_DECODER_MIXED is a safety mechanism introduced for the case where
platform firmware has programmed an endpoint decoder that straddles a
DPA partition boundary. While the kernel is careful to only allocate DPA
capacity within a single partition there is no guarantee that platform
firmware, or anything that touched the device before the current kernel,
gets that right.

However, __cxl_dpa_reserve() will never get to the CXL_DECODER_MIXED
designation because of the way it tracks partition boundaries. A
request_resource() that spans ->ram_res and ->pmem_res fails with the
following signature:

    __cxl_dpa_reserve: cxl_port endpoint15: decoder15.0: failed to reserve allocation

CXL_DECODER_MIXED is dead defensive programming after the driver has
already given up on the device. It has never offered any protection in
practice, just delete it.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/cxl/core/hdm.c    |    8 ++++----
 drivers/cxl/core/region.c |   12 ------------
 drivers/cxl/cxl.h         |    4 +---
 3 files changed, 5 insertions(+), 19 deletions(-)

diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
index 28edd5822486..be8556119d94 100644
--- a/drivers/cxl/core/hdm.c
+++ b/drivers/cxl/core/hdm.c
@@ -329,12 +329,12 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
 
 	if (resource_contains(&cxlds->pmem_res, res))
 		cxled->mode = CXL_DECODER_PMEM;
-	else if (resource_contains(&cxlds->ram_res, res))
+	if (resource_contains(&cxlds->ram_res, res))
 		cxled->mode = CXL_DECODER_RAM;
 	else {
-		dev_warn(dev, "decoder%d.%d: %pr mixed mode not supported\n",
-			 port->id, cxled->cxld.id, cxled->dpa_res);
-		cxled->mode = CXL_DECODER_MIXED;
+		dev_warn(dev, "decoder%d.%d: %pr does not map any partition\n",
+			 port->id, cxled->cxld.id, res);
+		cxled->mode = CXL_DECODER_NONE;
 	}
 
 	port->hdm_end++;
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index d77899650798..e4885acac853 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -2725,18 +2725,6 @@ static int poison_by_decoder(struct device *dev, void *arg)
 	if (!cxled->dpa_res || !resource_size(cxled->dpa_res))
 		return rc;
 
-	/*
-	 * Regions are only created with single mode decoders: pmem or ram.
-	 * Linux does not support mixed mode decoders. This means that
-	 * reading poison per endpoint decoder adheres to the requirement
-	 * that poison reads of pmem and ram must be separated.
-	 * CXL 3.0 Spec 8.2.9.8.4.1
-	 */
-	if (cxled->mode == CXL_DECODER_MIXED) {
-		dev_dbg(dev, "poison list read unsupported in mixed mode\n");
-		return rc;
-	}
-
 	cxlmd = cxled_to_memdev(cxled);
 	if (cxled->skip) {
 		offset = cxled->dpa_res->start - cxled->skip;
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index f6015f24ad38..0fb8d70fa3e5 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -379,7 +379,6 @@ enum cxl_decoder_mode {
 	CXL_DECODER_NONE,
 	CXL_DECODER_RAM,
 	CXL_DECODER_PMEM,
-	CXL_DECODER_MIXED,
 	CXL_DECODER_DEAD,
 };
 
@@ -389,10 +388,9 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
 		[CXL_DECODER_NONE] = "none",
 		[CXL_DECODER_RAM] = "ram",
 		[CXL_DECODER_PMEM] = "pmem",
-		[CXL_DECODER_MIXED] = "mixed",
 	};
 
-	if (mode >= CXL_DECODER_NONE && mode <= CXL_DECODER_MIXED)
+	if (mode >= CXL_DECODER_NONE && mode <= CXL_DECODER_PMEM)
 		return names[mode];
 	return "mixed";
 }


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 2/4] cxl: Introduce to_{ram,pmem}_{res,perf}() helpers
  2025-01-17  6:10 [PATCH 0/4] cxl: DPA partition metadata is a mess Dan Williams
  2025-01-17  6:10 ` [PATCH 1/4] cxl: Remove the CXL_DECODER_MIXED mistake Dan Williams
@ 2025-01-17  6:10 ` Dan Williams
  2025-01-17 10:20   ` Jonathan Cameron
  2025-01-17 13:33   ` Alejandro Lucero Palau
  2025-01-17  6:10 ` [PATCH 3/4] cxl: Introduce 'struct cxl_dpa_partition' and 'struct cxl_range_info' Dan Williams
  2025-01-17  6:10 ` [PATCH 4/4] cxl: Make cxl_dpa_alloc() DPA partition number agnostic Dan Williams
  3 siblings, 2 replies; 32+ messages in thread
From: Dan Williams @ 2025-01-17  6:10 UTC (permalink / raw)
  To: linux-cxl; +Cc: Dave Jiang, Alejandro Lucero, Ira Weiny, dave.jiang

In preparation for consolidating all DPA partition information into an
array of DPA metadata, introduce helpers that hide the layout of the
current data. I.e. make the eventual replacement of ->ram_res,
->pmem_res, ->ram_perf, and ->pmem_perf with a new DPA metadata array a
no-op for code paths that consume that information, and reduce the noise
of follow-on patches.

The end goal is to consolidate all DPA information in 'struct
cxl_dev_state', but for now the helpers just make it appear that all DPA
metadata is relative to @cxlds.

Note that a follow-on patch also cleans up the temporary placeholders of
@ram_res, and @pmem_res in the qos_class manipulation code,
cxl_dpa_alloc(), and cxl_mem_create_range_info().

Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Alejandro Lucero <alucerop@amd.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/cxl/core/cdat.c      |   70 +++++++++++++++++++++++++-----------------
 drivers/cxl/core/hdm.c       |   26 ++++++++--------
 drivers/cxl/core/mbox.c      |   18 ++++++-----
 drivers/cxl/core/memdev.c    |   42 +++++++++++++------------
 drivers/cxl/core/region.c    |   10 ++++--
 drivers/cxl/cxlmem.h         |   58 ++++++++++++++++++++++++++++++-----
 drivers/cxl/mem.c            |    2 +
 tools/testing/cxl/test/cxl.c |   25 ++++++++-------
 8 files changed, 159 insertions(+), 92 deletions(-)

diff --git a/drivers/cxl/core/cdat.c b/drivers/cxl/core/cdat.c
index 8153f8d83a16..b177a488e29b 100644
--- a/drivers/cxl/core/cdat.c
+++ b/drivers/cxl/core/cdat.c
@@ -258,29 +258,33 @@ static void update_perf_entry(struct device *dev, struct dsmas_entry *dent,
 static void cxl_memdev_set_qos_class(struct cxl_dev_state *cxlds,
 				     struct xarray *dsmas_xa)
 {
-	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlds);
 	struct device *dev = cxlds->dev;
-	struct range pmem_range = {
-		.start = cxlds->pmem_res.start,
-		.end = cxlds->pmem_res.end,
-	};
-	struct range ram_range = {
-		.start = cxlds->ram_res.start,
-		.end = cxlds->ram_res.end,
-	};
 	struct dsmas_entry *dent;
 	unsigned long index;
+	const struct resource *partition[] = {
+		to_ram_res(cxlds),
+		to_pmem_res(cxlds),
+	};
+	struct cxl_dpa_perf *perf[] = {
+		to_ram_perf(cxlds),
+		to_pmem_perf(cxlds),
+	};
 
 	xa_for_each(dsmas_xa, index, dent) {
-		if (resource_size(&cxlds->ram_res) &&
-		    range_contains(&ram_range, &dent->dpa_range))
-			update_perf_entry(dev, dent, &mds->ram_perf);
-		else if (resource_size(&cxlds->pmem_res) &&
-			 range_contains(&pmem_range, &dent->dpa_range))
-			update_perf_entry(dev, dent, &mds->pmem_perf);
-		else
-			dev_dbg(dev, "no partition for dsmas dpa: %pra\n",
-				&dent->dpa_range);
+		for (int i = 0; i < ARRAY_SIZE(partition); i++) {
+			const struct resource *res = partition[i];
+			struct range range = {
+				.start = res->start,
+				.end = res->end,
+			};
+
+			if (range_contains(&range, &dent->dpa_range))
+				update_perf_entry(dev, dent, perf[i]);
+			else
+				dev_dbg(dev,
+					"no partition for dsmas dpa: %pra\n",
+					&dent->dpa_range);
+		}
 	}
 }
 
@@ -304,6 +308,9 @@ static int match_cxlrd_qos_class(struct device *dev, void *data)
 
 static void reset_dpa_perf(struct cxl_dpa_perf *dpa_perf)
 {
+	if (!dpa_perf)
+		return;
+
 	*dpa_perf = (struct cxl_dpa_perf) {
 		.qos_class = CXL_QOS_CLASS_INVALID,
 	};
@@ -312,6 +319,9 @@ static void reset_dpa_perf(struct cxl_dpa_perf *dpa_perf)
 static bool cxl_qos_match(struct cxl_port *root_port,
 			  struct cxl_dpa_perf *dpa_perf)
 {
+	if (!dpa_perf)
+		return false;
+
 	if (dpa_perf->qos_class == CXL_QOS_CLASS_INVALID)
 		return false;
 
@@ -346,7 +356,8 @@ static int match_cxlrd_hb(struct device *dev, void *data)
 static int cxl_qos_class_verify(struct cxl_memdev *cxlmd)
 {
 	struct cxl_dev_state *cxlds = cxlmd->cxlds;
-	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlds);
+	struct cxl_dpa_perf *ram_perf = to_ram_perf(cxlds),
+			    *pmem_perf = to_pmem_perf(cxlds);
 	struct cxl_port *root_port;
 	int rc;
 
@@ -359,17 +370,17 @@ static int cxl_qos_class_verify(struct cxl_memdev *cxlmd)
 	root_port = &cxl_root->port;
 
 	/* Check that the QTG IDs are all sane between end device and root decoders */
-	if (!cxl_qos_match(root_port, &mds->ram_perf))
-		reset_dpa_perf(&mds->ram_perf);
-	if (!cxl_qos_match(root_port, &mds->pmem_perf))
-		reset_dpa_perf(&mds->pmem_perf);
+	if (!cxl_qos_match(root_port, ram_perf))
+		reset_dpa_perf(ram_perf);
+	if (!cxl_qos_match(root_port, pmem_perf))
+		reset_dpa_perf(pmem_perf);
 
 	/* Check to make sure that the device's host bridge is under a root decoder */
 	rc = device_for_each_child(&root_port->dev,
 				   cxlmd->endpoint->host_bridge, match_cxlrd_hb);
 	if (!rc) {
-		reset_dpa_perf(&mds->ram_perf);
-		reset_dpa_perf(&mds->pmem_perf);
+		reset_dpa_perf(ram_perf);
+		reset_dpa_perf(pmem_perf);
 	}
 
 	return rc;
@@ -567,6 +578,9 @@ static bool dpa_perf_contains(struct cxl_dpa_perf *perf,
 		.end = dpa_res->end,
 	};
 
+	if (!perf)
+		return false;
+
 	return range_contains(&perf->dpa_range, &dpa);
 }
 
@@ -574,15 +588,15 @@ static struct cxl_dpa_perf *cxled_get_dpa_perf(struct cxl_endpoint_decoder *cxle
 					       enum cxl_decoder_mode mode)
 {
 	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
-	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
+	struct cxl_dev_state *cxlds = cxlmd->cxlds;
 	struct cxl_dpa_perf *perf;
 
 	switch (mode) {
 	case CXL_DECODER_RAM:
-		perf = &mds->ram_perf;
+		perf = to_ram_perf(cxlds);
 		break;
 	case CXL_DECODER_PMEM:
-		perf = &mds->pmem_perf;
+		perf = to_pmem_perf(cxlds);
 		break;
 	default:
 		return ERR_PTR(-EINVAL);
diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
index be8556119d94..7a85522294ad 100644
--- a/drivers/cxl/core/hdm.c
+++ b/drivers/cxl/core/hdm.c
@@ -327,9 +327,9 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
 	cxled->dpa_res = res;
 	cxled->skip = skipped;
 
-	if (resource_contains(&cxlds->pmem_res, res))
+	if (resource_contains(to_pmem_res(cxlds), res))
 		cxled->mode = CXL_DECODER_PMEM;
-	if (resource_contains(&cxlds->ram_res, res))
+	else if (resource_contains(to_ram_res(cxlds), res))
 		cxled->mode = CXL_DECODER_RAM;
 	else {
 		dev_warn(dev, "decoder%d.%d: %pr does not map any partition\n",
@@ -442,11 +442,11 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
 	 * Only allow modes that are supported by the current partition
 	 * configuration
 	 */
-	if (mode == CXL_DECODER_PMEM && !resource_size(&cxlds->pmem_res)) {
+	if (mode == CXL_DECODER_PMEM && !cxl_pmem_size(cxlds)) {
 		dev_dbg(dev, "no available pmem capacity\n");
 		return -ENXIO;
 	}
-	if (mode == CXL_DECODER_RAM && !resource_size(&cxlds->ram_res)) {
+	if (mode == CXL_DECODER_RAM && !cxl_ram_size(cxlds)) {
 		dev_dbg(dev, "no available ram capacity\n");
 		return -ENXIO;
 	}
@@ -464,6 +464,8 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
 	struct device *dev = &cxled->cxld.dev;
 	resource_size_t start, avail, skip;
 	struct resource *p, *last;
+	const struct resource *ram_res = to_ram_res(cxlds);
+	const struct resource *pmem_res = to_pmem_res(cxlds);
 	int rc;
 
 	down_write(&cxl_dpa_rwsem);
@@ -480,37 +482,37 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
 		goto out;
 	}
 
-	for (p = cxlds->ram_res.child, last = NULL; p; p = p->sibling)
+	for (p = ram_res->child, last = NULL; p; p = p->sibling)
 		last = p;
 	if (last)
 		free_ram_start = last->end + 1;
 	else
-		free_ram_start = cxlds->ram_res.start;
+		free_ram_start = ram_res->start;
 
-	for (p = cxlds->pmem_res.child, last = NULL; p; p = p->sibling)
+	for (p = pmem_res->child, last = NULL; p; p = p->sibling)
 		last = p;
 	if (last)
 		free_pmem_start = last->end + 1;
 	else
-		free_pmem_start = cxlds->pmem_res.start;
+		free_pmem_start = pmem_res->start;
 
 	if (cxled->mode == CXL_DECODER_RAM) {
 		start = free_ram_start;
-		avail = cxlds->ram_res.end - start + 1;
+		avail = ram_res->end - start + 1;
 		skip = 0;
 	} else if (cxled->mode == CXL_DECODER_PMEM) {
 		resource_size_t skip_start, skip_end;
 
 		start = free_pmem_start;
-		avail = cxlds->pmem_res.end - start + 1;
+		avail = pmem_res->end - start + 1;
 		skip_start = free_ram_start;
 
 		/*
 		 * If some pmem is already allocated, then that allocation
 		 * already handled the skip.
 		 */
-		if (cxlds->pmem_res.child &&
-		    skip_start == cxlds->pmem_res.child->start)
+		if (pmem_res->child &&
+		    skip_start == pmem_res->child->start)
 			skip_end = skip_start - 1;
 		else
 			skip_end = start - 1;
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 548564c770c0..3502f1633ad2 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -1270,24 +1270,26 @@ static int add_dpa_res(struct device *dev, struct resource *parent,
 int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
 {
 	struct cxl_dev_state *cxlds = &mds->cxlds;
+	struct resource *ram_res = to_ram_res(cxlds);
+	struct resource *pmem_res = to_pmem_res(cxlds);
 	struct device *dev = cxlds->dev;
 	int rc;
 
 	if (!cxlds->media_ready) {
 		cxlds->dpa_res = DEFINE_RES_MEM(0, 0);
-		cxlds->ram_res = DEFINE_RES_MEM(0, 0);
-		cxlds->pmem_res = DEFINE_RES_MEM(0, 0);
+		*ram_res = DEFINE_RES_MEM(0, 0);
+		*pmem_res = DEFINE_RES_MEM(0, 0);
 		return 0;
 	}
 
 	cxlds->dpa_res = DEFINE_RES_MEM(0, mds->total_bytes);
 
 	if (mds->partition_align_bytes == 0) {
-		rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->ram_res, 0,
+		rc = add_dpa_res(dev, &cxlds->dpa_res, ram_res, 0,
 				 mds->volatile_only_bytes, "ram");
 		if (rc)
 			return rc;
-		return add_dpa_res(dev, &cxlds->dpa_res, &cxlds->pmem_res,
+		return add_dpa_res(dev, &cxlds->dpa_res, pmem_res,
 				   mds->volatile_only_bytes,
 				   mds->persistent_only_bytes, "pmem");
 	}
@@ -1298,11 +1300,11 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
 		return rc;
 	}
 
-	rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->ram_res, 0,
+	rc = add_dpa_res(dev, &cxlds->dpa_res, ram_res, 0,
 			 mds->active_volatile_bytes, "ram");
 	if (rc)
 		return rc;
-	return add_dpa_res(dev, &cxlds->dpa_res, &cxlds->pmem_res,
+	return add_dpa_res(dev, &cxlds->dpa_res, pmem_res,
 			   mds->active_volatile_bytes,
 			   mds->active_persistent_bytes, "pmem");
 }
@@ -1450,8 +1452,8 @@ struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev)
 	mds->cxlds.reg_map.host = dev;
 	mds->cxlds.reg_map.resource = CXL_RESOURCE_NONE;
 	mds->cxlds.type = CXL_DEVTYPE_CLASSMEM;
-	mds->ram_perf.qos_class = CXL_QOS_CLASS_INVALID;
-	mds->pmem_perf.qos_class = CXL_QOS_CLASS_INVALID;
+	to_ram_perf(&mds->cxlds)->qos_class = CXL_QOS_CLASS_INVALID;
+	to_pmem_perf(&mds->cxlds)->qos_class = CXL_QOS_CLASS_INVALID;
 
 	return mds;
 }
diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
index ae3dfcbe8938..c5f8320ed330 100644
--- a/drivers/cxl/core/memdev.c
+++ b/drivers/cxl/core/memdev.c
@@ -80,7 +80,7 @@ static ssize_t ram_size_show(struct device *dev, struct device_attribute *attr,
 {
 	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
 	struct cxl_dev_state *cxlds = cxlmd->cxlds;
-	unsigned long long len = resource_size(&cxlds->ram_res);
+	unsigned long long len = resource_size(to_ram_res(cxlds));
 
 	return sysfs_emit(buf, "%#llx\n", len);
 }
@@ -93,7 +93,7 @@ static ssize_t pmem_size_show(struct device *dev, struct device_attribute *attr,
 {
 	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
 	struct cxl_dev_state *cxlds = cxlmd->cxlds;
-	unsigned long long len = resource_size(&cxlds->pmem_res);
+	unsigned long long len = cxl_pmem_size(cxlds);
 
 	return sysfs_emit(buf, "%#llx\n", len);
 }
@@ -198,16 +198,20 @@ static int cxl_get_poison_by_memdev(struct cxl_memdev *cxlmd)
 	int rc = 0;
 
 	/* CXL 3.0 Spec 8.2.9.8.4.1 Separate pmem and ram poison requests */
-	if (resource_size(&cxlds->pmem_res)) {
-		offset = cxlds->pmem_res.start;
-		length = resource_size(&cxlds->pmem_res);
+	if (cxl_pmem_size(cxlds)) {
+		const struct resource *res = to_pmem_res(cxlds);
+
+		offset = res->start;
+		length = resource_size(res);
 		rc = cxl_mem_get_poison(cxlmd, offset, length, NULL);
 		if (rc)
 			return rc;
 	}
-	if (resource_size(&cxlds->ram_res)) {
-		offset = cxlds->ram_res.start;
-		length = resource_size(&cxlds->ram_res);
+	if (cxl_ram_size(cxlds)) {
+		const struct resource *res = to_ram_res(cxlds);
+
+		offset = res->start;
+		length = resource_size(res);
 		rc = cxl_mem_get_poison(cxlmd, offset, length, NULL);
 		/*
 		 * Invalid Physical Address is not an error for
@@ -409,9 +413,8 @@ static ssize_t pmem_qos_class_show(struct device *dev,
 {
 	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
 	struct cxl_dev_state *cxlds = cxlmd->cxlds;
-	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlds);
 
-	return sysfs_emit(buf, "%d\n", mds->pmem_perf.qos_class);
+	return sysfs_emit(buf, "%d\n", to_pmem_perf(cxlds)->qos_class);
 }
 
 static struct device_attribute dev_attr_pmem_qos_class =
@@ -428,9 +431,8 @@ static ssize_t ram_qos_class_show(struct device *dev,
 {
 	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
 	struct cxl_dev_state *cxlds = cxlmd->cxlds;
-	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlds);
 
-	return sysfs_emit(buf, "%d\n", mds->ram_perf.qos_class);
+	return sysfs_emit(buf, "%d\n", to_ram_perf(cxlds)->qos_class);
 }
 
 static struct device_attribute dev_attr_ram_qos_class =
@@ -466,11 +468,11 @@ static umode_t cxl_ram_visible(struct kobject *kobj, struct attribute *a, int n)
 {
 	struct device *dev = kobj_to_dev(kobj);
 	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
-	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
+	struct cxl_dpa_perf *perf = to_ram_perf(cxlmd->cxlds);
 
-	if (a == &dev_attr_ram_qos_class.attr)
-		if (mds->ram_perf.qos_class == CXL_QOS_CLASS_INVALID)
-			return 0;
+	if (a == &dev_attr_ram_qos_class.attr &&
+	    (!perf || perf->qos_class == CXL_QOS_CLASS_INVALID))
+		return 0;
 
 	return a->mode;
 }
@@ -485,11 +487,11 @@ static umode_t cxl_pmem_visible(struct kobject *kobj, struct attribute *a, int n
 {
 	struct device *dev = kobj_to_dev(kobj);
 	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
-	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
+	struct cxl_dpa_perf *perf = to_pmem_perf(cxlmd->cxlds);
 
-	if (a == &dev_attr_pmem_qos_class.attr)
-		if (mds->pmem_perf.qos_class == CXL_QOS_CLASS_INVALID)
-			return 0;
+	if (a == &dev_attr_pmem_qos_class.attr &&
+	    (!perf || perf->qos_class == CXL_QOS_CLASS_INVALID))
+		return 0;
 
 	return a->mode;
 }
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index e4885acac853..9f0f6fdbc841 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -2688,7 +2688,7 @@ static int cxl_get_poison_unmapped(struct cxl_memdev *cxlmd,
 
 	if (ctx->mode == CXL_DECODER_RAM) {
 		offset = ctx->offset;
-		length = resource_size(&cxlds->ram_res) - offset;
+		length = cxl_ram_size(cxlds) - offset;
 		rc = cxl_mem_get_poison(cxlmd, offset, length, NULL);
 		if (rc == -EFAULT)
 			rc = 0;
@@ -2700,9 +2700,11 @@ static int cxl_get_poison_unmapped(struct cxl_memdev *cxlmd,
 		length = resource_size(&cxlds->dpa_res) - offset;
 		if (!length)
 			return 0;
-	} else if (resource_size(&cxlds->pmem_res)) {
-		offset = cxlds->pmem_res.start;
-		length = resource_size(&cxlds->pmem_res);
+	} else if (cxl_pmem_size(cxlds)) {
+		const struct resource *res = to_pmem_res(cxlds);
+
+		offset = res->start;
+		length = resource_size(res);
 	} else {
 		return 0;
 	}
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 2a25d1957ddb..78e92e24d7b5 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -423,8 +423,8 @@ struct cxl_dpa_perf {
  * @rcd: operating in RCD mode (CXL 3.0 9.11.8 CXL Devices Attached to an RCH)
  * @media_ready: Indicate whether the device media is usable
  * @dpa_res: Overall DPA resource tree for the device
- * @pmem_res: Active Persistent memory capacity configuration
- * @ram_res: Active Volatile memory capacity configuration
+ * @_pmem_res: Active Persistent memory capacity configuration
+ * @_ram_res: Active Volatile memory capacity configuration
  * @serial: PCIe Device Serial Number
  * @type: Generic Memory Class device or Vendor Specific Memory device
  * @cxl_mbox: CXL mailbox context
@@ -438,13 +438,41 @@ struct cxl_dev_state {
 	bool rcd;
 	bool media_ready;
 	struct resource dpa_res;
-	struct resource pmem_res;
-	struct resource ram_res;
+	struct resource _pmem_res;
+	struct resource _ram_res;
 	u64 serial;
 	enum cxl_devtype type;
 	struct cxl_mailbox cxl_mbox;
 };
 
+static inline struct resource *to_ram_res(struct cxl_dev_state *cxlds)
+{
+	return &cxlds->_ram_res;
+}
+
+static inline struct resource *to_pmem_res(struct cxl_dev_state *cxlds)
+{
+	return &cxlds->_pmem_res;
+}
+
+static inline resource_size_t cxl_ram_size(struct cxl_dev_state *cxlds)
+{
+	const struct resource *res = to_ram_res(cxlds);
+
+	if (!res)
+		return 0;
+	return resource_size(res);
+}
+
+static inline resource_size_t cxl_pmem_size(struct cxl_dev_state *cxlds)
+{
+	const struct resource *res = to_pmem_res(cxlds);
+
+	if (!res)
+		return 0;
+	return resource_size(res);
+}
+
 static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
 {
 	return dev_get_drvdata(cxl_mbox->host);
@@ -471,8 +499,8 @@ static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
  * @active_persistent_bytes: sum of hard + soft persistent
  * @next_volatile_bytes: volatile capacity change pending device reset
  * @next_persistent_bytes: persistent capacity change pending device reset
- * @ram_perf: performance data entry matched to RAM partition
- * @pmem_perf: performance data entry matched to PMEM partition
+ * @_ram_perf: performance data entry matched to RAM partition
+ * @_pmem_perf: performance data entry matched to PMEM partition
  * @event: event log driver state
  * @poison: poison driver state info
  * @security: security driver state info
@@ -496,8 +524,8 @@ struct cxl_memdev_state {
 	u64 next_volatile_bytes;
 	u64 next_persistent_bytes;
 
-	struct cxl_dpa_perf ram_perf;
-	struct cxl_dpa_perf pmem_perf;
+	struct cxl_dpa_perf _ram_perf;
+	struct cxl_dpa_perf _pmem_perf;
 
 	struct cxl_event_state event;
 	struct cxl_poison_state poison;
@@ -505,6 +533,20 @@ struct cxl_memdev_state {
 	struct cxl_fw_state fw;
 };
 
+static inline struct cxl_dpa_perf *to_ram_perf(struct cxl_dev_state *cxlds)
+{
+	struct cxl_memdev_state *mds = container_of(cxlds, typeof(*mds), cxlds);
+
+	return &mds->_ram_perf;
+}
+
+static inline struct cxl_dpa_perf *to_pmem_perf(struct cxl_dev_state *cxlds)
+{
+	struct cxl_memdev_state *mds = container_of(cxlds, typeof(*mds), cxlds);
+
+	return &mds->_pmem_perf;
+}
+
 static inline struct cxl_memdev_state *
 to_cxl_memdev_state(struct cxl_dev_state *cxlds)
 {
diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
index 2f03a4d5606e..9675243bd05b 100644
--- a/drivers/cxl/mem.c
+++ b/drivers/cxl/mem.c
@@ -152,7 +152,7 @@ static int cxl_mem_probe(struct device *dev)
 		return -ENXIO;
 	}
 
-	if (resource_size(&cxlds->pmem_res) && IS_ENABLED(CONFIG_CXL_PMEM)) {
+	if (cxl_pmem_size(cxlds) && IS_ENABLED(CONFIG_CXL_PMEM)) {
 		rc = devm_cxl_add_nvdimm(parent_port, cxlmd);
 		if (rc) {
 			if (rc == -ENODEV)
diff --git a/tools/testing/cxl/test/cxl.c b/tools/testing/cxl/test/cxl.c
index d0337c11f9ee..7f1c5061307b 100644
--- a/tools/testing/cxl/test/cxl.c
+++ b/tools/testing/cxl/test/cxl.c
@@ -1000,25 +1000,28 @@ static void mock_cxl_endpoint_parse_cdat(struct cxl_port *port)
 		find_cxl_root(port);
 	struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport_dev);
 	struct cxl_dev_state *cxlds = cxlmd->cxlds;
-	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlds);
 	struct access_coordinate ep_c[ACCESS_COORDINATE_MAX];
-	struct range pmem_range = {
-		.start = cxlds->pmem_res.start,
-		.end = cxlds->pmem_res.end,
+	const struct resource *partition[] = {
+		to_ram_res(cxlds),
+		to_pmem_res(cxlds),
 	};
-	struct range ram_range = {
-		.start = cxlds->ram_res.start,
-		.end = cxlds->ram_res.end,
+	struct cxl_dpa_perf *perf[] = {
+		to_ram_perf(cxlds),
+		to_pmem_perf(cxlds),
 	};
 
 	if (!cxl_root)
 		return;
 
-	if (range_len(&ram_range))
-		dpa_perf_setup(port, &ram_range, &mds->ram_perf);
+	for (int i = 0; i < ARRAY_SIZE(partition); i++) {
+		const struct resource *res = partition[i];
+		struct range range = {
+			.start = res->start,
+			.end = res->end,
+		};
 
-	if (range_len(&pmem_range))
-		dpa_perf_setup(port, &pmem_range, &mds->pmem_perf);
+		dpa_perf_setup(port, &range, perf[i]);
+	}
 
 	cxl_memdev_update_perf(cxlmd);
 


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 3/4] cxl: Introduce 'struct cxl_dpa_partition' and 'struct cxl_range_info'
  2025-01-17  6:10 [PATCH 0/4] cxl: DPA partition metadata is a mess Dan Williams
  2025-01-17  6:10 ` [PATCH 1/4] cxl: Remove the CXL_DECODER_MIXED mistake Dan Williams
  2025-01-17  6:10 ` [PATCH 2/4] cxl: Introduce to_{ram,pmem}_{res,perf}() helpers Dan Williams
@ 2025-01-17  6:10 ` Dan Williams
  2025-01-17 10:52   ` Jonathan Cameron
                     ` (3 more replies)
  2025-01-17  6:10 ` [PATCH 4/4] cxl: Make cxl_dpa_alloc() DPA partition number agnostic Dan Williams
  3 siblings, 4 replies; 32+ messages in thread
From: Dan Williams @ 2025-01-17  6:10 UTC (permalink / raw)
  To: linux-cxl; +Cc: Dave Jiang, Alejandro Lucero, Ira Weiny, dave.jiang

The pending efforts to add CXL Accelerator (type-2) device [1], and
Dynamic Capacity (DCD) support [2], tripped on the
no-longer-fit-for-purpose design in the CXL subsystem for tracking
device-physical-address (DPA) metadata. Trip hazards include:

- CXL Memory Devices need to consider a PMEM partition, but Accelerator
  devices with CXL.mem likely do not in the common case.

- CXL Memory Devices enumerate DPA through Memory Device mailbox
  commands like Partition Info, Accelerators devices do not.

- CXL Memory Devices that support DCD support more than 2 partitions.
  Some of the driver algorithms are awkward to expand to > 2 partition
  cases.

- DPA performance data is a general capability that can be shared with
  accelerators, so tracking it in 'struct cxl_memdev_state' is no longer
  suitable.

- 'enum cxl_decoder_mode' is sometimes a partition id and sometimes a
  memory property, it should be phased in favor of a partition id and
  the memory property comes from the partition info.

Towards cleaning up those issues and allowing a smoother landing for the
aforementioned pending efforts, introduce a 'struct cxl_dpa_partition'
array to 'struct cxl_dev_state', and 'struct cxl_range_info' as a shared
way for Memory Devices and Accelerators to initialize the DPA information
in 'struct cxl_dev_state'.

For now, split a new cxl_dpa_setup() from cxl_mem_create_range_info() to
get the new data structure initialized, and cleanup some qos_class init.
Follow on patches will go further to use the new data structure to
cleanup algorithms that are better suited to loop over all possible
partitions.

cxl_dpa_setup() follows the locking expectations of mutating the device
DPA map, and is suitable for Accelerator drivers to use. Accelerators
likely only have one hardcoded 'ram' partition to convey to the
cxl_core.

Link: http://lore.kernel.org/20241230214445.27602-1-alejandro.lucero-palau@amd.com [1]
Link: http://lore.kernel.org/20241210-dcd-type2-upstream-v8-0-812852504400@intel.com [2]
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Alejandro Lucero <alucerop@amd.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/cxl/core/cdat.c      |   15 ++-----
 drivers/cxl/core/hdm.c       |   69 ++++++++++++++++++++++++++++++++++
 drivers/cxl/core/mbox.c      |   86 ++++++++++++++++++------------------------
 drivers/cxl/cxlmem.h         |   79 +++++++++++++++++++++++++--------------
 drivers/cxl/pci.c            |    7 +++
 tools/testing/cxl/test/cxl.c |   15 ++-----
 tools/testing/cxl/test/mem.c |    7 +++
 7 files changed, 176 insertions(+), 102 deletions(-)

diff --git a/drivers/cxl/core/cdat.c b/drivers/cxl/core/cdat.c
index b177a488e29b..5400a421ad30 100644
--- a/drivers/cxl/core/cdat.c
+++ b/drivers/cxl/core/cdat.c
@@ -261,25 +261,18 @@ static void cxl_memdev_set_qos_class(struct cxl_dev_state *cxlds,
 	struct device *dev = cxlds->dev;
 	struct dsmas_entry *dent;
 	unsigned long index;
-	const struct resource *partition[] = {
-		to_ram_res(cxlds),
-		to_pmem_res(cxlds),
-	};
-	struct cxl_dpa_perf *perf[] = {
-		to_ram_perf(cxlds),
-		to_pmem_perf(cxlds),
-	};
 
 	xa_for_each(dsmas_xa, index, dent) {
-		for (int i = 0; i < ARRAY_SIZE(partition); i++) {
-			const struct resource *res = partition[i];
+		for (int i = 0; i < cxlds->nr_partitions; i++) {
+			struct resource *res = &cxlds->part[i].res;
 			struct range range = {
 				.start = res->start,
 				.end = res->end,
 			};
 
 			if (range_contains(&range, &dent->dpa_range))
-				update_perf_entry(dev, dent, perf[i]);
+				update_perf_entry(dev, dent,
+						  &cxlds->part[i].perf);
 			else
 				dev_dbg(dev,
 					"no partition for dsmas dpa: %pra\n",
diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
index 7a85522294ad..7e1559b3ed88 100644
--- a/drivers/cxl/core/hdm.c
+++ b/drivers/cxl/core/hdm.c
@@ -342,6 +342,75 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
 	return 0;
 }
 
+static int add_dpa_res(struct device *dev, struct resource *parent,
+		       struct resource *res, resource_size_t start,
+		       resource_size_t size, const char *type)
+{
+	int rc;
+
+	*res = (struct resource) {
+		.name = type,
+		.start = start,
+		.end =  start + size - 1,
+		.flags = IORESOURCE_MEM,
+	};
+	if (resource_size(res) == 0) {
+		dev_dbg(dev, "DPA(%s): no capacity\n", res->name);
+		return 0;
+	}
+	rc = request_resource(parent, res);
+	if (rc) {
+		dev_err(dev, "DPA(%s): failed to track %pr (%d)\n", res->name,
+			res, rc);
+		return rc;
+	}
+
+	dev_dbg(dev, "DPA(%s): %pr\n", res->name, res);
+
+	return 0;
+}
+
+/* if this fails the caller must destroy @cxlds, there is no recovery */
+int cxl_dpa_setup(struct cxl_dev_state *cxlds, const struct cxl_dpa_info *info)
+{
+	struct device *dev = cxlds->dev;
+
+	guard(rwsem_write)(&cxl_dpa_rwsem);
+
+	if (cxlds->nr_partitions)
+		return -EBUSY;
+
+	if (!info->size || !info->nr_partitions) {
+		cxlds->dpa_res = DEFINE_RES_MEM(0, 0);
+		cxlds->nr_partitions = 0;
+		return 0;
+	}
+
+	cxlds->dpa_res = DEFINE_RES_MEM(0, info->size);
+
+	for (int i = 0; i < info->nr_partitions; i++) {
+		const char *desc;
+		int rc;
+
+		if (i == CXL_PARTITION_RAM)
+			desc = "ram";
+		else if (i == CXL_PARTITION_PMEM)
+			desc = "pmem";
+		else
+			desc = "";
+		cxlds->part[i].perf.qos_class = CXL_QOS_CLASS_INVALID;
+		rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->part[i].res,
+				 info->range[i].start,
+				 range_len(&info->range[i]), desc);
+		if (rc)
+			return rc;
+		cxlds->nr_partitions++;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(cxl_dpa_setup);
+
 int devm_cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
 				resource_size_t base, resource_size_t len,
 				resource_size_t skipped)
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 3502f1633ad2..7dca5c8c3494 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -1241,57 +1241,36 @@ int cxl_mem_sanitize(struct cxl_memdev *cxlmd, u16 cmd)
 	return rc;
 }
 
-static int add_dpa_res(struct device *dev, struct resource *parent,
-		       struct resource *res, resource_size_t start,
-		       resource_size_t size, const char *type)
-{
-	int rc;
-
-	res->name = type;
-	res->start = start;
-	res->end = start + size - 1;
-	res->flags = IORESOURCE_MEM;
-	if (resource_size(res) == 0) {
-		dev_dbg(dev, "DPA(%s): no capacity\n", res->name);
-		return 0;
-	}
-	rc = request_resource(parent, res);
-	if (rc) {
-		dev_err(dev, "DPA(%s): failed to track %pr (%d)\n", res->name,
-			res, rc);
-		return rc;
-	}
-
-	dev_dbg(dev, "DPA(%s): %pr\n", res->name, res);
-
-	return 0;
-}
-
-int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
+int cxl_mem_dpa_fetch(struct cxl_memdev_state *mds, struct cxl_dpa_info *info)
 {
 	struct cxl_dev_state *cxlds = &mds->cxlds;
-	struct resource *ram_res = to_ram_res(cxlds);
-	struct resource *pmem_res = to_pmem_res(cxlds);
 	struct device *dev = cxlds->dev;
 	int rc;
 
 	if (!cxlds->media_ready) {
-		cxlds->dpa_res = DEFINE_RES_MEM(0, 0);
-		*ram_res = DEFINE_RES_MEM(0, 0);
-		*pmem_res = DEFINE_RES_MEM(0, 0);
+		info->size = 0;
 		return 0;
 	}
 
-	cxlds->dpa_res = DEFINE_RES_MEM(0, mds->total_bytes);
+	info->size = mds->total_bytes;
 
 	if (mds->partition_align_bytes == 0) {
-		rc = add_dpa_res(dev, &cxlds->dpa_res, ram_res, 0,
-				 mds->volatile_only_bytes, "ram");
-		if (rc)
-			return rc;
-		return add_dpa_res(dev, &cxlds->dpa_res, pmem_res,
-				   mds->volatile_only_bytes,
-				   mds->persistent_only_bytes, "pmem");
+		info->range[CXL_PARTITION_RAM] = (struct range) {
+			.start = 0,
+			.end = mds->volatile_only_bytes - 1,
+		};
+		info->nr_partitions++;
+
+		if (!mds->persistent_only_bytes)
+			return 0;
+
+		info->range[CXL_PARTITION_PMEM] = (struct range) {
+			.start = mds->volatile_only_bytes,
+			.end = mds->volatile_only_bytes +
+			       mds->persistent_only_bytes - 1,
+		};
+		info->nr_partitions++;
+		return 0;
 	}
 
 	rc = cxl_mem_get_partition_info(mds);
@@ -1300,15 +1279,24 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
 		return rc;
 	}
 
-	rc = add_dpa_res(dev, &cxlds->dpa_res, ram_res, 0,
-			 mds->active_volatile_bytes, "ram");
-	if (rc)
-		return rc;
-	return add_dpa_res(dev, &cxlds->dpa_res, pmem_res,
-			   mds->active_volatile_bytes,
-			   mds->active_persistent_bytes, "pmem");
+	info->range[CXL_PARTITION_RAM] = (struct range) {
+		.start = 0,
+		.end = mds->active_volatile_bytes - 1,
+	};
+	info->nr_partitions++;
+
+	if (!mds->active_persistent_bytes)
+		return 0;
+
+	info->range[CXL_PARTITION_PMEM] = (struct range) {
+		.start = mds->active_volatile_bytes,
+		.end = mds->active_volatile_bytes + mds->active_persistent_bytes - 1,
+	};
+	info->nr_partitions++;
+
+	return 0;
 }
-EXPORT_SYMBOL_NS_GPL(cxl_mem_create_range_info, "CXL");
+EXPORT_SYMBOL_NS_GPL(cxl_mem_dpa_fetch, "CXL");
 
 int cxl_set_timestamp(struct cxl_memdev_state *mds)
 {
@@ -1452,8 +1440,6 @@ struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev)
 	mds->cxlds.reg_map.host = dev;
 	mds->cxlds.reg_map.resource = CXL_RESOURCE_NONE;
 	mds->cxlds.type = CXL_DEVTYPE_CLASSMEM;
-	to_ram_perf(&mds->cxlds)->qos_class = CXL_QOS_CLASS_INVALID;
-	to_pmem_perf(&mds->cxlds)->qos_class = CXL_QOS_CLASS_INVALID;
 
 	return mds;
 }
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 78e92e24d7b5..2e728d4b7327 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -97,6 +97,20 @@ int devm_cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
 			 resource_size_t base, resource_size_t len,
 			 resource_size_t skipped);
 
+/* Well known, spec defined partition indices */
+enum cxl_partition {
+	CXL_PARTITION_RAM,
+	CXL_PARTITION_PMEM,
+	CXL_PARTITION_MAX,
+};
+
+struct cxl_dpa_info {
+	u64 size;
+	struct range range[CXL_PARTITION_MAX];
+	int nr_partitions;
+};
+int cxl_dpa_setup(struct cxl_dev_state *cxlds, const struct cxl_dpa_info *info);
+
 static inline struct cxl_ep *cxl_ep_load(struct cxl_port *port,
 					 struct cxl_memdev *cxlmd)
 {
@@ -408,6 +422,16 @@ struct cxl_dpa_perf {
 	int qos_class;
 };
 
+/**
+ * struct cxl_dpa_partition - DPA partition descriptor
+ * @res: shortcut to the partition in the DPA resource tree (cxlds->dpa_res)
+ * @perf: performance attributes of the partition from CDAT
+ */
+struct cxl_dpa_partition {
+	struct resource res;
+	struct cxl_dpa_perf perf;
+};
+
 /**
  * struct cxl_dev_state - The driver device state
  *
@@ -423,8 +447,8 @@ struct cxl_dpa_perf {
  * @rcd: operating in RCD mode (CXL 3.0 9.11.8 CXL Devices Attached to an RCH)
  * @media_ready: Indicate whether the device media is usable
  * @dpa_res: Overall DPA resource tree for the device
- * @_pmem_res: Active Persistent memory capacity configuration
- * @_ram_res: Active Volatile memory capacity configuration
+ * @part: DPA partition array
+ * @nr_partitions: Number of DPA partitions
  * @serial: PCIe Device Serial Number
  * @type: Generic Memory Class device or Vendor Specific Memory device
  * @cxl_mbox: CXL mailbox context
@@ -438,21 +462,39 @@ struct cxl_dev_state {
 	bool rcd;
 	bool media_ready;
 	struct resource dpa_res;
-	struct resource _pmem_res;
-	struct resource _ram_res;
+	struct cxl_dpa_partition part[CXL_PARTITION_MAX];
+	unsigned int nr_partitions;
 	u64 serial;
 	enum cxl_devtype type;
 	struct cxl_mailbox cxl_mbox;
 };
 
-static inline struct resource *to_ram_res(struct cxl_dev_state *cxlds)
+static inline const struct resource *to_ram_res(struct cxl_dev_state *cxlds)
 {
-	return &cxlds->_ram_res;
+	if (cxlds->nr_partitions > 0)
+		return &cxlds->part[CXL_PARTITION_RAM].res;
+	return NULL;
 }
 
-static inline struct resource *to_pmem_res(struct cxl_dev_state *cxlds)
+static inline const struct resource *to_pmem_res(struct cxl_dev_state *cxlds)
 {
-	return &cxlds->_pmem_res;
+	if (cxlds->nr_partitions > 1)
+		return &cxlds->part[CXL_PARTITION_PMEM].res;
+	return NULL;
+}
+
+static inline struct cxl_dpa_perf *to_ram_perf(struct cxl_dev_state *cxlds)
+{
+	if (cxlds->nr_partitions > 0)
+		return &cxlds->part[CXL_PARTITION_RAM].perf;
+	return NULL;
+}
+
+static inline struct cxl_dpa_perf *to_pmem_perf(struct cxl_dev_state *cxlds)
+{
+	if (cxlds->nr_partitions > 1)
+		return &cxlds->part[CXL_PARTITION_PMEM].perf;
+	return NULL;
 }
 
 static inline resource_size_t cxl_ram_size(struct cxl_dev_state *cxlds)
@@ -499,8 +541,6 @@ static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
  * @active_persistent_bytes: sum of hard + soft persistent
  * @next_volatile_bytes: volatile capacity change pending device reset
  * @next_persistent_bytes: persistent capacity change pending device reset
- * @_ram_perf: performance data entry matched to RAM partition
- * @_pmem_perf: performance data entry matched to PMEM partition
  * @event: event log driver state
  * @poison: poison driver state info
  * @security: security driver state info
@@ -524,29 +564,12 @@ struct cxl_memdev_state {
 	u64 next_volatile_bytes;
 	u64 next_persistent_bytes;
 
-	struct cxl_dpa_perf _ram_perf;
-	struct cxl_dpa_perf _pmem_perf;
-
 	struct cxl_event_state event;
 	struct cxl_poison_state poison;
 	struct cxl_security_state security;
 	struct cxl_fw_state fw;
 };
 
-static inline struct cxl_dpa_perf *to_ram_perf(struct cxl_dev_state *cxlds)
-{
-	struct cxl_memdev_state *mds = container_of(cxlds, typeof(*mds), cxlds);
-
-	return &mds->_ram_perf;
-}
-
-static inline struct cxl_dpa_perf *to_pmem_perf(struct cxl_dev_state *cxlds)
-{
-	struct cxl_memdev_state *mds = container_of(cxlds, typeof(*mds), cxlds);
-
-	return &mds->_pmem_perf;
-}
-
 static inline struct cxl_memdev_state *
 to_cxl_memdev_state(struct cxl_dev_state *cxlds)
 {
@@ -860,7 +883,7 @@ int cxl_internal_send_cmd(struct cxl_mailbox *cxl_mbox,
 int cxl_dev_state_identify(struct cxl_memdev_state *mds);
 int cxl_await_media_ready(struct cxl_dev_state *cxlds);
 int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
-int cxl_mem_create_range_info(struct cxl_memdev_state *mds);
+int cxl_mem_dpa_fetch(struct cxl_memdev_state *mds, struct cxl_dpa_info *info);
 struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev);
 void set_exclusive_cxl_commands(struct cxl_memdev_state *mds,
 				unsigned long *cmds);
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 0241d1d7133a..47dbfe406236 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -900,6 +900,7 @@ __ATTRIBUTE_GROUPS(cxl_rcd);
 static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 {
 	struct pci_host_bridge *host_bridge = pci_find_host_bridge(pdev->bus);
+	struct cxl_dpa_info range_info = { 0 };
 	struct cxl_memdev_state *mds;
 	struct cxl_dev_state *cxlds;
 	struct cxl_register_map map;
@@ -989,7 +990,11 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (rc)
 		return rc;
 
-	rc = cxl_mem_create_range_info(mds);
+	rc = cxl_mem_dpa_fetch(mds, &range_info);
+	if (rc)
+		return rc;
+
+	rc = cxl_dpa_setup(cxlds, &range_info);
 	if (rc)
 		return rc;
 
diff --git a/tools/testing/cxl/test/cxl.c b/tools/testing/cxl/test/cxl.c
index 7f1c5061307b..ba3d48b37de3 100644
--- a/tools/testing/cxl/test/cxl.c
+++ b/tools/testing/cxl/test/cxl.c
@@ -1001,26 +1001,19 @@ static void mock_cxl_endpoint_parse_cdat(struct cxl_port *port)
 	struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport_dev);
 	struct cxl_dev_state *cxlds = cxlmd->cxlds;
 	struct access_coordinate ep_c[ACCESS_COORDINATE_MAX];
-	const struct resource *partition[] = {
-		to_ram_res(cxlds),
-		to_pmem_res(cxlds),
-	};
-	struct cxl_dpa_perf *perf[] = {
-		to_ram_perf(cxlds),
-		to_pmem_perf(cxlds),
-	};
 
 	if (!cxl_root)
 		return;
 
-	for (int i = 0; i < ARRAY_SIZE(partition); i++) {
-		const struct resource *res = partition[i];
+	for (int i = 0; i < cxlds->nr_partitions; i++) {
+		struct resource *res = &cxlds->part[i].res;
+		struct cxl_dpa_perf *perf = &cxlds->part[i].perf;
 		struct range range = {
 			.start = res->start,
 			.end = res->end,
 		};
 
-		dpa_perf_setup(port, &range, perf[i]);
+		dpa_perf_setup(port, &range, perf);
 	}
 
 	cxl_memdev_update_perf(cxlmd);
diff --git a/tools/testing/cxl/test/mem.c b/tools/testing/cxl/test/mem.c
index 347c1e7b37bd..ed365e083c8f 100644
--- a/tools/testing/cxl/test/mem.c
+++ b/tools/testing/cxl/test/mem.c
@@ -1477,6 +1477,7 @@ static int cxl_mock_mem_probe(struct platform_device *pdev)
 	struct cxl_dev_state *cxlds;
 	struct cxl_mockmem_data *mdata;
 	struct cxl_mailbox *cxl_mbox;
+	struct cxl_dpa_info range_info = { 0 };
 	int rc;
 
 	mdata = devm_kzalloc(dev, sizeof(*mdata), GFP_KERNEL);
@@ -1537,7 +1538,11 @@ static int cxl_mock_mem_probe(struct platform_device *pdev)
 	if (rc)
 		return rc;
 
-	rc = cxl_mem_create_range_info(mds);
+	rc = cxl_mem_dpa_fetch(mds, &range_info);
+	if (rc)
+		return rc;
+
+	rc = cxl_dpa_setup(cxlds, &range_info);
 	if (rc)
 		return rc;
 


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 4/4] cxl: Make cxl_dpa_alloc() DPA partition number agnostic
  2025-01-17  6:10 [PATCH 0/4] cxl: DPA partition metadata is a mess Dan Williams
                   ` (2 preceding siblings ...)
  2025-01-17  6:10 ` [PATCH 3/4] cxl: Introduce 'struct cxl_dpa_partition' and 'struct cxl_range_info' Dan Williams
@ 2025-01-17  6:10 ` Dan Williams
  2025-01-17 11:12   ` Jonathan Cameron
  2025-01-17 15:42   ` Alejandro Lucero Palau
  3 siblings, 2 replies; 32+ messages in thread
From: Dan Williams @ 2025-01-17  6:10 UTC (permalink / raw)
  To: linux-cxl; +Cc: Dave Jiang, Alejandro Lucero, Ira Weiny, dave.jiang

cxl_dpa_alloc() is a hard coded nest of assumptions around PMEM
allocations being distinct from RAM allocations in specific ways when in
practice the allocation rules are only relative to DPA partition index.

The rules for cxl_dpa_alloc() are:

- allocations can only come from 1 partition

- if allocating at partition-index-N, all free space in partitions less
  than partition-index-N must be skipped over

Use the new 'struct cxl_dpa_partition' array to support allocation with
an arbitrary number of DPA partitions on the device.

A follow-on patch can go further to cleanup 'enum cxl_decoder_mode'
concept and supersede it with looking up the memory properties from
partition metadata.

Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Alejandro Lucero <alucerop@amd.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/cxl/core/hdm.c |  167 +++++++++++++++++++++++++++++++++---------------
 drivers/cxl/cxlmem.h   |    9 +++
 2 files changed, 125 insertions(+), 51 deletions(-)

diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
index 7e1559b3ed88..4a2816102a1e 100644
--- a/drivers/cxl/core/hdm.c
+++ b/drivers/cxl/core/hdm.c
@@ -223,6 +223,30 @@ void cxl_dpa_debug(struct seq_file *file, struct cxl_dev_state *cxlds)
 }
 EXPORT_SYMBOL_NS_GPL(cxl_dpa_debug, "CXL");
 
+static void release_skip(struct cxl_dev_state *cxlds,
+			 const resource_size_t skip_base,
+			 const resource_size_t skip_len)
+{
+	resource_size_t skip_start = skip_base, skip_rem = skip_len;
+
+	for (int i = 0; i < cxlds->nr_partitions; i++) {
+		const struct resource *part_res = &cxlds->part[i].res;
+		resource_size_t skip_end, skip_size;
+
+		if (skip_start < part_res->start || skip_start > part_res->end)
+			continue;
+
+		skip_end = min(part_res->end, skip_start + skip_rem - 1);
+		skip_size = skip_end - skip_start + 1;
+		__release_region(&cxlds->dpa_res, skip_start, skip_size);
+		skip_start += skip_size;
+		skip_rem -= skip_size;
+
+		if (!skip_rem)
+			break;
+	}
+}
+
 /*
  * Must be called in a context that synchronizes against this decoder's
  * port ->remove() callback (like an endpoint decoder sysfs attribute)
@@ -241,7 +265,7 @@ static void __cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
 	skip_start = res->start - cxled->skip;
 	__release_region(&cxlds->dpa_res, res->start, resource_size(res));
 	if (cxled->skip)
-		__release_region(&cxlds->dpa_res, skip_start, cxled->skip);
+		release_skip(cxlds, skip_start, cxled->skip);
 	cxled->skip = 0;
 	cxled->dpa_res = NULL;
 	put_device(&cxled->cxld.dev);
@@ -268,6 +292,47 @@ static void devm_cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
 	__cxl_dpa_release(cxled);
 }
 
+static int request_skip(struct cxl_dev_state *cxlds,
+			struct cxl_endpoint_decoder *cxled,
+			const resource_size_t skip_base,
+			const resource_size_t skip_len)
+{
+	resource_size_t skip_start = skip_base, skip_rem = skip_len;
+
+	for (int i = 0; i < cxlds->nr_partitions; i++) {
+		const struct resource *part_res = &cxlds->part[i].res;
+		struct cxl_port *port = cxled_to_port(cxled);
+		resource_size_t skip_end, skip_size;
+		struct resource *res;
+
+		if (skip_start < part_res->start || skip_start > part_res->end)
+			continue;
+
+		skip_end = min(part_res->end, skip_start + skip_rem - 1);
+		skip_size = skip_end - skip_start + 1;
+
+		res = __request_region(&cxlds->dpa_res, skip_start, skip_size,
+				       dev_name(&cxled->cxld.dev), 0);
+		if (!res) {
+			dev_dbg(cxlds->dev,
+				"decoder%d.%d: failed to reserve skipped space\n",
+				port->id, cxled->cxld.id);
+			break;
+		}
+		skip_start += skip_size;
+		skip_rem -= skip_size;
+		if (!skip_rem)
+			break;
+	}
+
+	if (skip_rem == 0)
+		return 0;
+
+	release_skip(cxlds, skip_base, skip_len - skip_rem);
+
+	return -EBUSY;
+}
+
 static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
 			     resource_size_t base, resource_size_t len,
 			     resource_size_t skipped)
@@ -277,6 +342,7 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
 	struct cxl_dev_state *cxlds = cxlmd->cxlds;
 	struct device *dev = &port->dev;
 	struct resource *res;
+	int rc;
 
 	lockdep_assert_held_write(&cxl_dpa_rwsem);
 
@@ -305,14 +371,9 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
 	}
 
 	if (skipped) {
-		res = __request_region(&cxlds->dpa_res, base - skipped, skipped,
-				       dev_name(&cxled->cxld.dev), 0);
-		if (!res) {
-			dev_dbg(dev,
-				"decoder%d.%d: failed to reserve skipped space\n",
-				port->id, cxled->cxld.id);
-			return -EBUSY;
-		}
+		rc = request_skip(cxlds, cxled, base - skipped, skipped);
+		if (rc)
+			return rc;
 	}
 	res = __request_region(&cxlds->dpa_res, base, len,
 			       dev_name(&cxled->cxld.dev), 0);
@@ -320,16 +381,15 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
 		dev_dbg(dev, "decoder%d.%d: failed to reserve allocation\n",
 			port->id, cxled->cxld.id);
 		if (skipped)
-			__release_region(&cxlds->dpa_res, base - skipped,
-					 skipped);
+			release_skip(cxlds, base - skipped, skipped);
 		return -EBUSY;
 	}
 	cxled->dpa_res = res;
 	cxled->skip = skipped;
 
-	if (resource_contains(to_pmem_res(cxlds), res))
+	if (cxl_partition_contains(cxlds, CXL_PARTITION_PMEM, res))
 		cxled->mode = CXL_DECODER_PMEM;
-	else if (resource_contains(to_ram_res(cxlds), res))
+	else if (cxl_partition_contains(cxlds, CXL_PARTITION_RAM, res))
 		cxled->mode = CXL_DECODER_RAM;
 	else {
 		dev_warn(dev, "decoder%d.%d: %pr does not map any partition\n",
@@ -527,15 +587,13 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
 int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
 {
 	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
-	resource_size_t free_ram_start, free_pmem_start;
 	struct cxl_port *port = cxled_to_port(cxled);
 	struct cxl_dev_state *cxlds = cxlmd->cxlds;
 	struct device *dev = &cxled->cxld.dev;
-	resource_size_t start, avail, skip;
+	struct resource *res, *prev = NULL;
+	resource_size_t start, avail, skip, skip_start;
 	struct resource *p, *last;
-	const struct resource *ram_res = to_ram_res(cxlds);
-	const struct resource *pmem_res = to_pmem_res(cxlds);
-	int rc;
+	int part, rc;
 
 	down_write(&cxl_dpa_rwsem);
 	if (cxled->cxld.region) {
@@ -551,47 +609,54 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
 		goto out;
 	}
 
-	for (p = ram_res->child, last = NULL; p; p = p->sibling)
-		last = p;
-	if (last)
-		free_ram_start = last->end + 1;
+	if (cxled->mode == CXL_DECODER_RAM)
+		part = CXL_PARTITION_RAM;
+	else if (cxled->mode == CXL_DECODER_PMEM)
+		part = CXL_PARTITION_PMEM;
 	else
-		free_ram_start = ram_res->start;
+		part = cxlds->nr_partitions;
+
+	if (part >= cxlds->nr_partitions) {
+		dev_dbg(dev, "partition %d not found\n", part);
+		rc = -EBUSY;
+		goto out;
+	}
+
+	res = &cxlds->part[part].res;
 
-	for (p = pmem_res->child, last = NULL; p; p = p->sibling)
+	for (p = res->child, last = NULL; p; p = p->sibling)
 		last = p;
 	if (last)
-		free_pmem_start = last->end + 1;
+		start = last->end + 1;
 	else
-		free_pmem_start = pmem_res->start;
+		start = res->start;
 
-	if (cxled->mode == CXL_DECODER_RAM) {
-		start = free_ram_start;
-		avail = ram_res->end - start + 1;
-		skip = 0;
-	} else if (cxled->mode == CXL_DECODER_PMEM) {
-		resource_size_t skip_start, skip_end;
-
-		start = free_pmem_start;
-		avail = pmem_res->end - start + 1;
-		skip_start = free_ram_start;
-
-		/*
-		 * If some pmem is already allocated, then that allocation
-		 * already handled the skip.
-		 */
-		if (pmem_res->child &&
-		    skip_start == pmem_res->child->start)
-			skip_end = skip_start - 1;
-		else
-			skip_end = start - 1;
-		skip = skip_end - skip_start + 1;
-	} else {
-		dev_dbg(dev, "mode not set\n");
-		rc = -EINVAL;
-		goto out;
+	/*
+	 * To allocate at partition N, a skip needs to be calculated for all
+	 * unallocated space at lower partitions indices.
+	 *
+	 * If a partition has any allocations, the search can end because a
+	 * previous cxl_dpa_alloc() invocation is assumed to have accounted for
+	 * all previous partitions.
+	 */
+	skip_start = CXL_RESOURCE_NONE;
+	for (int i = part; i; i--) {
+		prev = &cxlds->part[i - 1].res;
+		for (p = prev->child, last = NULL; p; p = p->sibling)
+			last = p;
+		if (last) {
+			skip_start = last->end + 1;
+			break;
+		}
+		skip_start = prev->start;
 	}
 
+	avail = res->end - start + 1;
+	if (skip_start == CXL_RESOURCE_NONE)
+		skip = 0;
+	else
+		skip = res->start - skip_start;
+
 	if (size > avail) {
 		dev_dbg(dev, "%pa exceeds available %s capacity: %pa\n", &size,
 			cxl_decoder_mode_name(cxled->mode), &avail);
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 2e728d4b7327..43acd48b300f 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -515,6 +515,15 @@ static inline resource_size_t cxl_pmem_size(struct cxl_dev_state *cxlds)
 	return resource_size(res);
 }
 
+static inline bool cxl_partition_contains(struct cxl_dev_state *cxlds,
+					  unsigned int part,
+					  struct resource *res)
+{
+	if (part >= cxlds->nr_partitions)
+		return false;
+	return resource_contains(&cxlds->part[part].res, res);
+}
+
 static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
 {
 	return dev_get_drvdata(cxl_mbox->host);


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] cxl: Remove the CXL_DECODER_MIXED mistake
  2025-01-17  6:10 ` [PATCH 1/4] cxl: Remove the CXL_DECODER_MIXED mistake Dan Williams
@ 2025-01-17 10:03   ` Jonathan Cameron
  2025-01-17 17:47     ` Dan Williams
  2025-01-17 10:24   ` Alejandro Lucero Palau
  2025-01-17 18:45   ` Ira Weiny
  2 siblings, 1 reply; 32+ messages in thread
From: Jonathan Cameron @ 2025-01-17 10:03 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-cxl, dave.jiang

On Thu, 16 Jan 2025 22:10:32 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> CXL_DECODER_MIXED is a safety mechanism introduced for the case where
> platform firmware has programmed an endpoint decoder that straddles a
> DPA partition boundary. While the kernel is careful to only allocate DPA
> capacity within a single partition there is no guarantee that platform
> firmware, or anything that touched the device before the current kernel,
> gets that right.
> 
> However, __cxl_dpa_reserve() will never get to the CXL_DECODER_MIXED
> designation because of the way it tracks partition boundaries. A
> request_resource() that spans ->ram_res and ->pmem_res fails with the
> following signature:
> 
>     __cxl_dpa_reserve: cxl_port endpoint15: decoder15.0: failed to reserve allocation
> 
> CXL_DECODER_MIXED is dead defensive programming after the driver has
> already given up on the device. It has never offered any protection in
> practice, just delete it.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/cxl/core/hdm.c    |    8 ++++----
>  drivers/cxl/core/region.c |   12 ------------
>  drivers/cxl/cxl.h         |    4 +---
>  3 files changed, 5 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index 28edd5822486..be8556119d94 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -329,12 +329,12 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  
>  	if (resource_contains(&cxlds->pmem_res, res))
>  		cxled->mode = CXL_DECODER_PMEM;
> -	else if (resource_contains(&cxlds->ram_res, res))
> +	if (resource_contains(&cxlds->ram_res, res))

Logic of removing the else?  I assume there is 0 chance that both conditions
match, but doesn't this mean if the res is not in ram_res we always hit the next
else and print the warning?

>  		cxled->mode = CXL_DECODER_RAM;
>  	else {
> -		dev_warn(dev, "decoder%d.%d: %pr mixed mode not supported\n",
> -			 port->id, cxled->cxld.id, cxled->dpa_res);
> -		cxled->mode = CXL_DECODER_MIXED;
> +		dev_warn(dev, "decoder%d.%d: %pr does not map any partition\n",
> +			 port->id, cxled->cxld.id, res);
> +		cxled->mode = CXL_DECODER_NONE;
>  	}
>  
>  	port->hdm_end++;

> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index f6015f24ad38..0fb8d70fa3e5 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -379,7 +379,6 @@ enum cxl_decoder_mode {
>  	CXL_DECODER_NONE,
>  	CXL_DECODER_RAM,
>  	CXL_DECODER_PMEM,
> -	CXL_DECODER_MIXED,
>  	CXL_DECODER_DEAD,
>  };
>  
> @@ -389,10 +388,9 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
>  		[CXL_DECODER_NONE] = "none",
>  		[CXL_DECODER_RAM] = "ram",
>  		[CXL_DECODER_PMEM] = "pmem",
> -		[CXL_DECODER_MIXED] = "mixed",
>  	};
>  
> -	if (mode >= CXL_DECODER_NONE && mode <= CXL_DECODER_MIXED)
> +	if (mode >= CXL_DECODER_NONE && mode <= CXL_DECODER_PMEM)
Maybe just < DEAD is simpler?
>  		return names[mode];
>  	return "mixed";
>  }
> 
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] cxl: Introduce to_{ram,pmem}_{res,perf}() helpers
  2025-01-17  6:10 ` [PATCH 2/4] cxl: Introduce to_{ram,pmem}_{res,perf}() helpers Dan Williams
@ 2025-01-17 10:20   ` Jonathan Cameron
  2025-01-17 10:23     ` Jonathan Cameron
  2025-01-17 13:33   ` Alejandro Lucero Palau
  1 sibling, 1 reply; 32+ messages in thread
From: Jonathan Cameron @ 2025-01-17 10:20 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-cxl, Dave Jiang, Alejandro Lucero, Ira Weiny

On Thu, 16 Jan 2025 22:10:38 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> In preparation for consolidating all DPA partition information into an
> array of DPA metadata, introduce helpers that hide the layout of the
> current data. I.e. make the eventual replacement of ->ram_res,
> ->pmem_res, ->ram_perf, and ->pmem_perf with a new DPA metadata array a  
> no-op for code paths that consume that information, and reduce the noise
> of follow-on patches.
> 
> The end goal is to consolidate all DPA information in 'struct
> cxl_dev_state', but for now the helpers just make it appear that all DPA
> metadata is relative to @cxlds.
> 
> Note that a follow-on patch also cleans up the temporary placeholders of
> @ram_res, and @pmem_res in the qos_class manipulation code,
> cxl_dpa_alloc(), and cxl_mem_create_range_info().
> 
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Alejandro Lucero <alucerop@amd.com>
> Cc: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

I'm not that keen on wrapping the size but not the base.
Leads to some odd looking code in places.

Various other comments inline.

> ---
>  drivers/cxl/core/cdat.c      |   70 +++++++++++++++++++++++++-----------------
>  drivers/cxl/core/hdm.c       |   26 ++++++++--------
>  drivers/cxl/core/mbox.c      |   18 ++++++-----
>  drivers/cxl/core/memdev.c    |   42 +++++++++++++------------
>  drivers/cxl/core/region.c    |   10 ++++--
>  drivers/cxl/cxlmem.h         |   58 ++++++++++++++++++++++++++++++-----
>  drivers/cxl/mem.c            |    2 +
>  tools/testing/cxl/test/cxl.c |   25 ++++++++-------
>  8 files changed, 159 insertions(+), 92 deletions(-)
> 
> diff --git a/drivers/cxl/core/cdat.c b/drivers/cxl/core/cdat.c
> index 8153f8d83a16..b177a488e29b 100644
> --- a/drivers/cxl/core/cdat.c
> +++ b/drivers/cxl/core/cdat.c
> @@ -258,29 +258,33 @@ static void update_perf_entry(struct device *dev, struct dsmas_entry *dent,
>  static void cxl_memdev_set_qos_class(struct cxl_dev_state *cxlds,
>  				     struct xarray *dsmas_xa)
>  {
> -	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlds);
>  	struct device *dev = cxlds->dev;
> -	struct range pmem_range = {
> -		.start = cxlds->pmem_res.start,
> -		.end = cxlds->pmem_res.end,
> -	};
> -	struct range ram_range = {
> -		.start = cxlds->ram_res.start,
> -		.end = cxlds->ram_res.end,
> -	};
>  	struct dsmas_entry *dent;
>  	unsigned long index;
> +	const struct resource *partition[] = {
> +		to_ram_res(cxlds),
> +		to_pmem_res(cxlds),
> +	};
As in the test code case (see later), why not just use a range
here and fill the various bits directly?

I think the code ends up simpler, particularly if you have _base()
helpers as well as size ones.

> +	struct cxl_dpa_perf *perf[] = {
> +		to_ram_perf(cxlds),
> +		to_pmem_perf(cxlds),
> +	};
>  
>  	xa_for_each(dsmas_xa, index, dent) {
> -		if (resource_size(&cxlds->ram_res) &&
> -		    range_contains(&ram_range, &dent->dpa_range))
> -			update_perf_entry(dev, dent, &mds->ram_perf);
> -		else if (resource_size(&cxlds->pmem_res) &&
> -			 range_contains(&pmem_range, &dent->dpa_range))
> -			update_perf_entry(dev, dent, &mds->pmem_perf);
> -		else
> -			dev_dbg(dev, "no partition for dsmas dpa: %pra\n",
> -				&dent->dpa_range);
> +		for (int i = 0; i < ARRAY_SIZE(partition); i++) {
> +			const struct resource *res = partition[i];
> +			struct range range = {
> +				.start = res->start,
> +				.end = res->end,
> +			};
> +
> +			if (range_contains(&range, &dent->dpa_range))
> +				update_perf_entry(dev, dent, perf[i]);
> +			else
> +				dev_dbg(dev,
> +					"no partition for dsmas dpa: %pra\n",
> +					&dent->dpa_range);
> +		}
>  	}
>  }
>  
> @@ -304,6 +308,9 @@ static int match_cxlrd_qos_class(struct device *dev, void *data)
>  
>  static void reset_dpa_perf(struct cxl_dpa_perf *dpa_perf)
>  {
> +	if (!dpa_perf)
> +		return;

I don't mind the change, but this smells like a functional change that
doesn't belong in this patch.  I'm not seeing the check removed from elsewhere.

> +
>  	*dpa_perf = (struct cxl_dpa_perf) {
>  		.qos_class = CXL_QOS_CLASS_INVALID,
>  	};
> @@ -312,6 +319,9 @@ static void reset_dpa_perf(struct cxl_dpa_perf *dpa_perf)
>  static bool cxl_qos_match(struct cxl_port *root_port,
>  			  struct cxl_dpa_perf *dpa_perf)
>  {
> +	if (!dpa_perf)
> +		return false;
> +
>  	if (dpa_perf->qos_class == CXL_QOS_CLASS_INVALID)
>  		return false;
>  
> @@ -346,7 +356,8 @@ static int match_cxlrd_hb(struct device *dev, void *data)
>  static int cxl_qos_class_verify(struct cxl_memdev *cxlmd)
>  {
>  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> -	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlds);
> +	struct cxl_dpa_perf *ram_perf = to_ram_perf(cxlds),
> +			    *pmem_perf = to_pmem_perf(cxlds);
I'd just repeat the type.  To me that would be easier to read than
this (slightly).


> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index be8556119d94..7a85522294ad 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -327,9 +327,9 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,

> diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
> index ae3dfcbe8938..c5f8320ed330 100644
> --- a/drivers/cxl/core/memdev.c
> +++ b/drivers/cxl/core/memdev.c
> @@ -80,7 +80,7 @@ static ssize_t ram_size_show(struct device *dev, struct device_attribute *attr,
>  {
>  	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
>  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> -	unsigned long long len = resource_size(&cxlds->ram_res);
> +	unsigned long long len = resource_size(to_ram_res(cxlds));

Use the helper.

>  
>  	return sysfs_emit(buf, "%#llx\n", len);
>  }
> @@ -93,7 +93,7 @@ static ssize_t pmem_size_show(struct device *dev, struct device_attribute *attr,
>  {
>  	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
>  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> -	unsigned long long len = resource_size(&cxlds->pmem_res);
> +	unsigned long long len = cxl_pmem_size(cxlds);
>  
>  	return sysfs_emit(buf, "%#llx\n", len);
>  }
> @@ -198,16 +198,20 @@ static int cxl_get_poison_by_memdev(struct cxl_memdev *cxlmd)
>  	int rc = 0;
>  
>  	/* CXL 3.0 Spec 8.2.9.8.4.1 Separate pmem and ram poison requests */
> -	if (resource_size(&cxlds->pmem_res)) {
> -		offset = cxlds->pmem_res.start;
> -		length = resource_size(&cxlds->pmem_res);
> +	if (cxl_pmem_size(cxlds)) {
> +		const struct resource *res = to_pmem_res(cxlds);
> +
> +		offset = res->start;
> +		length = resource_size(res);
>  		rc = cxl_mem_get_poison(cxlmd, offset, length, NULL);
>  		if (rc)
>  			return rc;
>  	}
> -	if (resource_size(&cxlds->ram_res)) {
> -		offset = cxlds->ram_res.start;
> -		length = resource_size(&cxlds->ram_res);
> +	if (cxl_ram_size(cxlds)) {
> +		const struct resource *res = to_ram_res(cxlds);
> +
> +		offset = res->start;
> +		length = resource_size(res);

Having a size helper and not using it consistently to me is ugly.
If we can keep the direct manipulation to where we actually care
about the tree of resources, that seems simpler to me.

> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index e4885acac853..9f0f6fdbc841 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -2688,7 +2688,7 @@ static int cxl_get_poison_unmapped(struct cxl_memdev *cxlmd,
>  
>  	if (ctx->mode == CXL_DECODER_RAM) {
>  		offset = ctx->offset;
> -		length = resource_size(&cxlds->ram_res) - offset;
> +		length = cxl_ram_size(cxlds) - offset;
>  		rc = cxl_mem_get_poison(cxlmd, offset, length, NULL);
>  		if (rc == -EFAULT)
>  			rc = 0;
> @@ -2700,9 +2700,11 @@ static int cxl_get_poison_unmapped(struct cxl_memdev *cxlmd,
>  		length = resource_size(&cxlds->dpa_res) - offset;
>  		if (!length)
>  			return 0;
> -	} else if (resource_size(&cxlds->pmem_res)) {
> -		offset = cxlds->pmem_res.start;
> -		length = resource_size(&cxlds->pmem_res);
> +	} else if (cxl_pmem_size(cxlds)) {
> +		const struct resource *res = to_pmem_res(cxlds);
> +
> +		offset = res->start;
> +		length = resource_size(res);
Whilst it's slightly more complex in terms of what it runs, I'd go with

		length = cxl_pmem_size(cxlds);
Could introduce a wrapper for base as well but perhaps that's a step
too far.

>  	} else {
>  		return 0;
>  	}
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 2a25d1957ddb..78e92e24d7b5 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> index 2f03a4d5606e..9675243bd05b 100644
> --- a/drivers/cxl/mem.c
> +++ b/drivers/cxl/mem.c
> @@ -152,7 +152,7 @@ static int cxl_mem_probe(struct device *dev)
>  		return -ENXIO;
>  	}
>  
> -	if (resource_size(&cxlds->pmem_res) && IS_ENABLED(CONFIG_CXL_PMEM)) {
> +	if (cxl_pmem_size(cxlds) && IS_ENABLED(CONFIG_CXL_PMEM)) {
>  		rc = devm_cxl_add_nvdimm(parent_port, cxlmd);
>  		if (rc) {
>  			if (rc == -ENODEV)
> diff --git a/tools/testing/cxl/test/cxl.c b/tools/testing/cxl/test/cxl.c
> index d0337c11f9ee..7f1c5061307b 100644
> --- a/tools/testing/cxl/test/cxl.c
> +++ b/tools/testing/cxl/test/cxl.c
> @@ -1000,25 +1000,28 @@ static void mock_cxl_endpoint_parse_cdat(struct cxl_port *port)
>  		find_cxl_root(port);
>  	struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport_dev);
>  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> -	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlds);
>  	struct access_coordinate ep_c[ACCESS_COORDINATE_MAX];
> -	struct range pmem_range = {
> -		.start = cxlds->pmem_res.start,
> -		.end = cxlds->pmem_res.end,
> +	const struct resource *partition[] = {
> +		to_ram_res(cxlds),
> +		to_pmem_res(cxlds),
>  	};
Maybe better to use array of range, and fill the two entrees
using helpers (I'd add them for base as well as size).

> -	struct range ram_range = {
> -		.start = cxlds->ram_res.start,
> -		.end = cxlds->ram_res.end,
> +	struct cxl_dpa_perf *perf[] = {
> +		to_ram_perf(cxlds),
> +		to_pmem_perf(cxlds),
>  	};

>  
>  	if (!cxl_root)
>  		return;
>  
> -	if (range_len(&ram_range))
> -		dpa_perf_setup(port, &ram_range, &mds->ram_perf);
> +	for (int i = 0; i < ARRAY_SIZE(partition); i++) {
> +		const struct resource *res = partition[i];
> +		struct range range = {
> +			.start = res->start,
> +			.end = res->end,

Purely a preference thing, but I'd got with not introduce the local
for just two simple dereferences.  Keeping it clearly associated with
the definitions above looks better to me.

			.start = partition[i]->start,
			.end = partition[i]->end,	
Or switch to range for your paritions array and do the conversion in one
place.

		
> +		};
>  
> -	if (range_len(&pmem_range))
> -		dpa_perf_setup(port, &pmem_range, &mds->pmem_perf);
> +		dpa_perf_setup(port, &range, perf[i]);
> +	}
>  
>  	cxl_memdev_update_perf(cxlmd);
>  
> 
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] cxl: Introduce to_{ram,pmem}_{res,perf}() helpers
  2025-01-17 10:20   ` Jonathan Cameron
@ 2025-01-17 10:23     ` Jonathan Cameron
  2025-01-17 17:55       ` Dan Williams
  0 siblings, 1 reply; 32+ messages in thread
From: Jonathan Cameron @ 2025-01-17 10:23 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-cxl, Dave Jiang, Alejandro Lucero, Ira Weiny

On Fri, 17 Jan 2025 10:20:56 +0000
Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:

> On Thu, 16 Jan 2025 22:10:38 -0800
> Dan Williams <dan.j.williams@intel.com> wrote:
> 
> > In preparation for consolidating all DPA partition information into an
> > array of DPA metadata, introduce helpers that hide the layout of the
> > current data. I.e. make the eventual replacement of ->ram_res,  
> > ->pmem_res, ->ram_perf, and ->pmem_perf with a new DPA metadata array a    
> > no-op for code paths that consume that information, and reduce the noise
> > of follow-on patches.
> > 
> > The end goal is to consolidate all DPA information in 'struct
> > cxl_dev_state', but for now the helpers just make it appear that all DPA
> > metadata is relative to @cxlds.
> > 
> > Note that a follow-on patch also cleans up the temporary placeholders of
> > @ram_res, and @pmem_res in the qos_class manipulation code,
> > cxl_dpa_alloc(), and cxl_mem_create_range_info().
> > 
> > Cc: Dave Jiang <dave.jiang@intel.com>
> > Cc: Alejandro Lucero <alucerop@amd.com>
> > Cc: Ira Weiny <ira.weiny@intel.com>
> > Signed-off-by: Dan Williams <dan.j.williams@intel.com>  
> 
> I'm not that keen on wrapping the size but not the base.
> Leads to some odd looking code in places.

I seems some of the code I didn't like goes away anyway later in the series.
So maybe it makes sense from a churn reduction point of view.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] cxl: Remove the CXL_DECODER_MIXED mistake
  2025-01-17  6:10 ` [PATCH 1/4] cxl: Remove the CXL_DECODER_MIXED mistake Dan Williams
  2025-01-17 10:03   ` Jonathan Cameron
@ 2025-01-17 10:24   ` Alejandro Lucero Palau
  2025-01-17 17:54     ` Dan Williams
  2025-01-17 18:45   ` Ira Weiny
  2 siblings, 1 reply; 32+ messages in thread
From: Alejandro Lucero Palau @ 2025-01-17 10:24 UTC (permalink / raw)
  To: Dan Williams, linux-cxl; +Cc: dave.jiang


On 1/17/25 06:10, Dan Williams wrote:
> CXL_DECODER_MIXED is a safety mechanism introduced for the case where
> platform firmware has programmed an endpoint decoder that straddles a
> DPA partition boundary. While the kernel is careful to only allocate DPA
> capacity within a single partition there is no guarantee that platform
> firmware, or anything that touched the device before the current kernel,
> gets that right.
>
> However, __cxl_dpa_reserve() will never get to the CXL_DECODER_MIXED
> designation because of the way it tracks partition boundaries. A
> request_resource() that spans ->ram_res and ->pmem_res fails with the
> following signature:
>
>      __cxl_dpa_reserve: cxl_port endpoint15: decoder15.0: failed to reserve allocation
>
> CXL_DECODER_MIXED is dead defensive programming after the driver has
> already given up on the device. It has never offered any protection in
> practice, just delete it.


I wonder if the reason for adding this CXL_DECODER_MIXED  does still 
worth it for fixing __cxl_dpa_reserve instead of just not supporting 
this case.

Assuming it does not:

Reviewed-by: Alejandro Lucero <alucerop@amd.com>


> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>   drivers/cxl/core/hdm.c    |    8 ++++----
>   drivers/cxl/core/region.c |   12 ------------
>   drivers/cxl/cxl.h         |    4 +---
>   3 files changed, 5 insertions(+), 19 deletions(-)
>
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index 28edd5822486..be8556119d94 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -329,12 +329,12 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>   
>   	if (resource_contains(&cxlds->pmem_res, res))
>   		cxled->mode = CXL_DECODER_PMEM;
> -	else if (resource_contains(&cxlds->ram_res, res))
> +	if (resource_contains(&cxlds->ram_res, res))
>   		cxled->mode = CXL_DECODER_RAM;
>   	else {
> -		dev_warn(dev, "decoder%d.%d: %pr mixed mode not supported\n",
> -			 port->id, cxled->cxld.id, cxled->dpa_res);
> -		cxled->mode = CXL_DECODER_MIXED;
> +		dev_warn(dev, "decoder%d.%d: %pr does not map any partition\n",
> +			 port->id, cxled->cxld.id, res);
> +		cxled->mode = CXL_DECODER_NONE;
>   	}
>   
>   	port->hdm_end++;
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index d77899650798..e4885acac853 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -2725,18 +2725,6 @@ static int poison_by_decoder(struct device *dev, void *arg)
>   	if (!cxled->dpa_res || !resource_size(cxled->dpa_res))
>   		return rc;
>   
> -	/*
> -	 * Regions are only created with single mode decoders: pmem or ram.
> -	 * Linux does not support mixed mode decoders. This means that
> -	 * reading poison per endpoint decoder adheres to the requirement
> -	 * that poison reads of pmem and ram must be separated.
> -	 * CXL 3.0 Spec 8.2.9.8.4.1
> -	 */
> -	if (cxled->mode == CXL_DECODER_MIXED) {
> -		dev_dbg(dev, "poison list read unsupported in mixed mode\n");
> -		return rc;
> -	}
> -
>   	cxlmd = cxled_to_memdev(cxled);
>   	if (cxled->skip) {
>   		offset = cxled->dpa_res->start - cxled->skip;
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index f6015f24ad38..0fb8d70fa3e5 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -379,7 +379,6 @@ enum cxl_decoder_mode {
>   	CXL_DECODER_NONE,
>   	CXL_DECODER_RAM,
>   	CXL_DECODER_PMEM,
> -	CXL_DECODER_MIXED,
>   	CXL_DECODER_DEAD,
>   };
>   
> @@ -389,10 +388,9 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
>   		[CXL_DECODER_NONE] = "none",
>   		[CXL_DECODER_RAM] = "ram",
>   		[CXL_DECODER_PMEM] = "pmem",
> -		[CXL_DECODER_MIXED] = "mixed",
>   	};
>   
> -	if (mode >= CXL_DECODER_NONE && mode <= CXL_DECODER_MIXED)
> +	if (mode >= CXL_DECODER_NONE && mode <= CXL_DECODER_PMEM)
>   		return names[mode];
>   	return "mixed";
>   }
>
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] cxl: Introduce 'struct cxl_dpa_partition' and 'struct cxl_range_info'
  2025-01-17  6:10 ` [PATCH 3/4] cxl: Introduce 'struct cxl_dpa_partition' and 'struct cxl_range_info' Dan Williams
@ 2025-01-17 10:52   ` Jonathan Cameron
  2025-01-17 13:38     ` Alejandro Lucero Palau
  2025-01-17 18:23     ` Dan Williams
  2025-01-17 15:58   ` Alejandro Lucero Palau
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 32+ messages in thread
From: Jonathan Cameron @ 2025-01-17 10:52 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-cxl, Dave Jiang, Alejandro Lucero, Ira Weiny

On Thu, 16 Jan 2025 22:10:44 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> The pending efforts to add CXL Accelerator (type-2) device [1], and
> Dynamic Capacity (DCD) support [2], tripped on the
> no-longer-fit-for-purpose design in the CXL subsystem for tracking
> device-physical-address (DPA) metadata. Trip hazards include:
> 
> - CXL Memory Devices need to consider a PMEM partition, but Accelerator
>   devices with CXL.mem likely do not in the common case.
> 
> - CXL Memory Devices enumerate DPA through Memory Device mailbox
>   commands like Partition Info, Accelerators devices do not.
> 
> - CXL Memory Devices that support DCD support more than 2 partitions.
>   Some of the driver algorithms are awkward to expand to > 2 partition
>   cases.
> 
> - DPA performance data is a general capability that can be shared with
>   accelerators, so tracking it in 'struct cxl_memdev_state' is no longer
>   suitable.
> 
> - 'enum cxl_decoder_mode' is sometimes a partition id and sometimes a
>   memory property, it should be phased in favor of a partition id and
>   the memory property comes from the partition info.
> 
> Towards cleaning up those issues and allowing a smoother landing for the
> aforementioned pending efforts, introduce a 'struct cxl_dpa_partition'
> array to 'struct cxl_dev_state', and 'struct cxl_range_info' as a shared
> way for Memory Devices and Accelerators to initialize the DPA information
> in 'struct cxl_dev_state'.
> 
> For now, split a new cxl_dpa_setup() from cxl_mem_create_range_info() to
> get the new data structure initialized, and cleanup some qos_class init.
> Follow on patches will go further to use the new data structure to
> cleanup algorithms that are better suited to loop over all possible
> partitions.
> 
> cxl_dpa_setup() follows the locking expectations of mutating the device
> DPA map, and is suitable for Accelerator drivers to use. Accelerators
> likely only have one hardcoded 'ram' partition to convey to the
> cxl_core.
> 
> Link: http://lore.kernel.org/20241230214445.27602-1-alejandro.lucero-palau@amd.com [1]
> Link: http://lore.kernel.org/20241210-dcd-type2-upstream-v8-0-812852504400@intel.com [2]
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Alejandro Lucero <alucerop@amd.com>
> Cc: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Hi Dan,

In basic form this seems fine, but I find the nr_paritions variable usage very
counter intuitive.  It's just how many we configured not how many there
are, potentially with 0 size (so not a partition).  I'd be happier if we
can avoid that by just prefilling the lot with zero size and filling in
the ones we want.  So zero size means doesn't exist and use an iterator where
appropriate to skip the zero size ones.

Without that tidied up, to me this is more confusing than the previous code.

Jonathan

> ---
>  drivers/cxl/core/cdat.c      |   15 ++-----
>  drivers/cxl/core/hdm.c       |   69 ++++++++++++++++++++++++++++++++++
>  drivers/cxl/core/mbox.c      |   86 ++++++++++++++++++------------------------
>  drivers/cxl/cxlmem.h         |   79 +++++++++++++++++++++++++--------------
>  drivers/cxl/pci.c            |    7 +++
>  tools/testing/cxl/test/cxl.c |   15 ++-----
>  tools/testing/cxl/test/mem.c |    7 +++
>  7 files changed, 176 insertions(+), 102 deletions(-)
> 
> diff --git a/drivers/cxl/core/cdat.c b/drivers/cxl/core/cdat.c
> index b177a488e29b..5400a421ad30 100644
> --- a/drivers/cxl/core/cdat.c
> +++ b/drivers/cxl/core/cdat.c
> @@ -261,25 +261,18 @@ static void cxl_memdev_set_qos_class(struct cxl_dev_state *cxlds,
>  	struct device *dev = cxlds->dev;
>  	struct dsmas_entry *dent;
>  	unsigned long index;
> -	const struct resource *partition[] = {
> -		to_ram_res(cxlds),
> -		to_pmem_res(cxlds),
> -	};
> -	struct cxl_dpa_perf *perf[] = {
> -		to_ram_perf(cxlds),
> -		to_pmem_perf(cxlds),
> -	};

Ok. This removes some of the concerns from previous patch.

>  
>  	xa_for_each(dsmas_xa, index, dent) {
> -		for (int i = 0; i < ARRAY_SIZE(partition); i++) {
> -			const struct resource *res = partition[i];
> +		for (int i = 0; i < cxlds->nr_partitions; i++) {
> +			struct resource *res = &cxlds->part[i].res;
>  			struct range range = {
>  				.start = res->start,
>  				.end = res->end,
>  			};
>  
>  			if (range_contains(&range, &dent->dpa_range))
> -				update_perf_entry(dev, dent, perf[i]);
> +				update_perf_entry(dev, dent,
> +						  &cxlds->part[i].perf);
>  			else
>  				dev_dbg(dev,
>  					"no partition for dsmas dpa: %pra\n",
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index 7a85522294ad..7e1559b3ed88 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -342,6 +342,75 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  	return 0;
>  }
>  
> +static int add_dpa_res(struct device *dev, struct resource *parent,
> +		       struct resource *res, resource_size_t start,
> +		       resource_size_t size, const char *type)
> +{
> +	int rc;
> +
> +	*res = (struct resource) {
> +		.name = type,
> +		.start = start,
> +		.end =  start + size - 1,
> +		.flags = IORESOURCE_MEM,
> +	};
> +	if (resource_size(res) == 0) {
> +		dev_dbg(dev, "DPA(%s): no capacity\n", res->name);
> +		return 0;
> +	}
> +	rc = request_resource(parent, res);
> +	if (rc) {
> +		dev_err(dev, "DPA(%s): failed to track %pr (%d)\n", res->name,
> +			res, rc);
> +		return rc;
> +	}
> +
> +	dev_dbg(dev, "DPA(%s): %pr\n", res->name, res);
> +
> +	return 0;
> +}
> +
> +/* if this fails the caller must destroy @cxlds, there is no recovery */
> +int cxl_dpa_setup(struct cxl_dev_state *cxlds, const struct cxl_dpa_info *info)
> +{
> +	struct device *dev = cxlds->dev;
> +
> +	guard(rwsem_write)(&cxl_dpa_rwsem);
> +
> +	if (cxlds->nr_partitions)
> +		return -EBUSY;
> +
> +	if (!info->size || !info->nr_partitions) {
> +		cxlds->dpa_res = DEFINE_RES_MEM(0, 0);
> +		cxlds->nr_partitions = 0;
> +		return 0;
> +	}
> +
> +	cxlds->dpa_res = DEFINE_RES_MEM(0, info->size);
> +
> +	for (int i = 0; i < info->nr_partitions; i++) {
> +		const char *desc;
> +		int rc;
> +
> +		if (i == CXL_PARTITION_RAM)
> +			desc = "ram";
> +		else if (i == CXL_PARTITION_PMEM)
> +			desc = "pmem";
> +		else
> +			desc = "";
> +		cxlds->part[i].perf.qos_class = CXL_QOS_CLASS_INVALID;
> +		rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->part[i].res,
> +				 info->range[i].start,
> +				 range_len(&info->range[i]), desc);
> +		if (rc)
> +			return rc;
> +		cxlds->nr_partitions++;
I'd just initialize the rest to 0 length similar to what is happening
if we have pmem only anyway.  Then this nr_patitions goes away and
stops being a possible source of confusion.

> +	}
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(cxl_dpa_setup);

> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 3502f1633ad2..7dca5c8c3494 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -1241,57 +1241,36 @@ int cxl_mem_sanitize(struct cxl_memdev *cxlmd, u16 cmd)

> -int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
> +int cxl_mem_dpa_fetch(struct cxl_memdev_state *mds, struct cxl_dpa_info *info)
>  {
>  	struct cxl_dev_state *cxlds = &mds->cxlds;
> -	struct resource *ram_res = to_ram_res(cxlds);
> -	struct resource *pmem_res = to_pmem_res(cxlds);
>  	struct device *dev = cxlds->dev;
>  	int rc;
>  
>  	if (!cxlds->media_ready) {
> -		cxlds->dpa_res = DEFINE_RES_MEM(0, 0);
> -		*ram_res = DEFINE_RES_MEM(0, 0);
> -		*pmem_res = DEFINE_RES_MEM(0, 0);
> +		info->size = 0;
>  		return 0;
>  	}
>  
> -	cxlds->dpa_res = DEFINE_RES_MEM(0, mds->total_bytes);
> +	info->size = mds->total_bytes;
>  
>  	if (mds->partition_align_bytes == 0) {
Obviously nothing to do with your patch as such, but maybe tidy this up
by making active values == fixed values when we don't have partition control.
That seems logical anyway to me and means we only end up with one lot of
range setup in here.  I can't immediately see any side effects of doing this.



	if (mds->partition_align_bytes != 0) {
		rc = cxl_mem_get_partition_info(mds);
		if (rc)
			return rc;
	} else {
		mds->active_volatile_bytes = mds->volatile_only_bytes;
		mds->active_persistent_bytes = mds->persistent_only_bytes;
	}
 	info->range[CXL_PARTITION_RAM] = (struct range) {
		.start = 0,
		.end = mds->active_volatile_bytes - 1,
	};
	info->nr_partitions++;

	if (!mds->active_persistent_bytes)
		return 0;

	info->range[CXL_PARTITION_PMEM] = (struct range) {
		.start = mds->active_volatile_bytes,
		.end = mds->active_volatile_bytes + mds->active_persistent_bytes - 1,
	};
	info->nr_partitions++;

	return 0;
}

> -		rc = add_dpa_res(dev, &cxlds->dpa_res, ram_res, 0,
> -				 mds->volatile_only_bytes, "ram");
> -		if (rc)
> -			return rc;
> -		return add_dpa_res(dev, &cxlds->dpa_res, pmem_res,
> -				   mds->volatile_only_bytes,
> -				   mds->persistent_only_bytes, "pmem");
> +		info->range[CXL_PARTITION_RAM] = (struct range) {
> +			.start = 0,
> +			.end = mds->volatile_only_bytes - 1,
> +		};
> +		info->nr_partitions++;
> +
> +		if (!mds->persistent_only_bytes)
> +			return 0;
> +
> +		info->range[CXL_PARTITION_PMEM] = (struct range) {
> +			.start = mds->volatile_only_bytes,
> +			.end = mds->volatile_only_bytes +
> +			       mds->persistent_only_bytes - 1,
> +		};
> +		info->nr_partitions++;

This nr partitions makes some sense though I'd be tempted to add a type
array to info so that we can just not pass empty ones if we don't want to.
Makes this code a little more complex, but not a lot and means
nr->partitions becomes the ones that actually exist.

> +		return 0;
>  	}
>  
>  	rc = cxl_mem_get_partition_info(mds);
> @@ -1300,15 +1279,24 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
>  		return rc;
>  	}
>  
> -	rc = add_dpa_res(dev, &cxlds->dpa_res, ram_res, 0,
> -			 mds->active_volatile_bytes, "ram");
> -	if (rc)
> -		return rc;
> -	return add_dpa_res(dev, &cxlds->dpa_res, pmem_res,
> -			   mds->active_volatile_bytes,
> -			   mds->active_persistent_bytes, "pmem");
> +	info->range[CXL_PARTITION_RAM] = (struct range) {
> +		.start = 0,
> +		.end = mds->active_volatile_bytes - 1,
> +	};
> +	info->nr_partitions++;
> +
> +	if (!mds->active_persistent_bytes)
> +		return 0;
> +
> +	info->range[CXL_PARTITION_PMEM] = (struct range) {
> +		.start = mds->active_volatile_bytes,
> +		.end = mds->active_volatile_bytes + mds->active_persistent_bytes - 1,
> +	};
> +	info->nr_partitions++;
> +
> +	return 0;
>  }
> -EXPORT_SYMBOL_NS_GPL(cxl_mem_create_range_info, "CXL");
> +EXPORT_SYMBOL_NS_GPL(cxl_mem_dpa_fetch, "CXL");

> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 78e92e24d7b5..2e728d4b7327 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -97,6 +97,20 @@ int devm_cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  			 resource_size_t base, resource_size_t len,
>  			 resource_size_t skipped);
>  
> +/* Well known, spec defined partition indices */
> +enum cxl_partition {
> +	CXL_PARTITION_RAM,
> +	CXL_PARTITION_PMEM,
> +	CXL_PARTITION_MAX,
> +};
> +
> +struct cxl_dpa_info {
> +	u64 size;
> +	struct range range[CXL_PARTITION_MAX];
> +	int nr_partitions;
> +};

blank line seems appropriate here.

> +int cxl_dpa_setup(struct cxl_dev_state *cxlds, const struct cxl_dpa_info *info);
> +
>  static inline struct cxl_ep *cxl_ep_load(struct cxl_port *port,
>  					 struct cxl_memdev *cxlmd)
>  {
> @@ -408,6 +422,16 @@ struct cxl_dpa_perf {
>  	int qos_class;
>  };
>  

>  /**
>   * struct cxl_dev_state - The driver device state
>   *
> @@ -423,8 +447,8 @@ struct cxl_dpa_perf {
>   * @rcd: operating in RCD mode (CXL 3.0 9.11.8 CXL Devices Attached to an RCH)
>   * @media_ready: Indicate whether the device media is usable
>   * @dpa_res: Overall DPA resource tree for the device
> - * @_pmem_res: Active Persistent memory capacity configuration
> - * @_ram_res: Active Volatile memory capacity configuration
> + * @part: DPA partition array
> + * @nr_partitions: Number of DPA partitions

This needs more. It is not the number of partitions present I think, it
is the number that a particular driver is potentially interested in.

>   * @serial: PCIe Device Serial Number
>   * @type: Generic Memory Class device or Vendor Specific Memory device
>   * @cxl_mbox: CXL mailbox context
> @@ -438,21 +462,39 @@ struct cxl_dev_state {
>  	bool rcd;
>  	bool media_ready;
>  	struct resource dpa_res;
> -	struct resource _pmem_res;
> -	struct resource _ram_res;
> +	struct cxl_dpa_partition part[CXL_PARTITION_MAX];
> +	unsigned int nr_partitions;
>  	u64 serial;
>  	enum cxl_devtype type;
>  	struct cxl_mailbox cxl_mbox;
>  };
>  
> -static inline struct resource *to_ram_res(struct cxl_dev_state *cxlds)
> +static inline const struct resource *to_ram_res(struct cxl_dev_state *cxlds)
>  {
> -	return &cxlds->_ram_res;
> +	if (cxlds->nr_partitions > 0)
> +		return &cxlds->part[CXL_PARTITION_RAM].res;
> +	return NULL;
>  }
>  
> -static inline struct resource *to_pmem_res(struct cxl_dev_state *cxlds)
> +static inline const struct resource *to_pmem_res(struct cxl_dev_state *cxlds)
>  {
> -	return &cxlds->_pmem_res;
> +	if (cxlds->nr_partitions > 1)

This is very confusing as nr_partitions is being used not to indicate
number of partitions but whether a driver has filled in the data for them
(which may well be empty).

I'd rather see that as a bitmap, or a 'not set' value initialized by
the core that is then replaced when they are set.


> +		return &cxlds->part[CXL_PARTITION_PMEM].res;
> +	return NULL;
> +}
> +
> +static inline struct cxl_dpa_perf *to_ram_perf(struct cxl_dev_state *cxlds)
> +{
> +	if (cxlds->nr_partitions > 0)
> +		return &cxlds->part[CXL_PARTITION_RAM].perf;
> +	return NULL;
> +}
> +
> +static inline struct cxl_dpa_perf *to_pmem_perf(struct cxl_dev_state *cxlds)
> +{
> +	if (cxlds->nr_partitions > 1)
> +		return &cxlds->part[CXL_PARTITION_PMEM].perf;
> +	return NULL;
>  }


> @@ -860,7 +883,7 @@ int cxl_internal_send_cmd(struct cxl_mailbox *cxl_mbox,
>  int cxl_dev_state_identify(struct cxl_memdev_state *mds);
>  int cxl_await_media_ready(struct cxl_dev_state *cxlds);
>  int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
> -int cxl_mem_create_range_info(struct cxl_memdev_state *mds);
> +int cxl_mem_dpa_fetch(struct cxl_memdev_state *mds, struct cxl_dpa_info *info);
>  struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev);
>  void set_exclusive_cxl_commands(struct cxl_memdev_state *mds,
>  				unsigned long *cmds);

> diff --git a/tools/testing/cxl/test/cxl.c b/tools/testing/cxl/test/cxl.c
> index 7f1c5061307b..ba3d48b37de3 100644
> --- a/tools/testing/cxl/test/cxl.c
> +++ b/tools/testing/cxl/test/cxl.c
> @@ -1001,26 +1001,19 @@ static void mock_cxl_endpoint_parse_cdat(struct cxl_port *port)
>  	struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport_dev);
>  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
>  	struct access_coordinate ep_c[ACCESS_COORDINATE_MAX];
> -	const struct resource *partition[] = {
> -		to_ram_res(cxlds),
> -		to_pmem_res(cxlds),
> -	};
> -	struct cxl_dpa_perf *perf[] = {
> -		to_ram_perf(cxlds),
> -		to_pmem_perf(cxlds),
> -	};

Ok. This gets rid of some of the earlier concerns.

>  
>  	if (!cxl_root)
>  		return;
>  
> -	for (int i = 0; i < ARRAY_SIZE(partition); i++) {
> -		const struct resource *res = partition[i];
> +	for (int i = 0; i < cxlds->nr_partitions; i++) {
> +		struct resource *res = &cxlds->part[i].res;
> +		struct cxl_dpa_perf *perf = &cxlds->part[i].perf;
>  		struct range range = {
>  			.start = res->start,
>  			.end = res->end,
>  		};
>  
> -		dpa_perf_setup(port, &range, perf[i]);
> +		dpa_perf_setup(port, &range, perf);
>  	}
>  
>  	cxl_memdev_update_perf(cxlmd);



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/4] cxl: Make cxl_dpa_alloc() DPA partition number agnostic
  2025-01-17  6:10 ` [PATCH 4/4] cxl: Make cxl_dpa_alloc() DPA partition number agnostic Dan Williams
@ 2025-01-17 11:12   ` Jonathan Cameron
  2025-01-17 18:37     ` Dan Williams
  2025-01-17 15:42   ` Alejandro Lucero Palau
  1 sibling, 1 reply; 32+ messages in thread
From: Jonathan Cameron @ 2025-01-17 11:12 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-cxl, Dave Jiang, Alejandro Lucero, Ira Weiny

On Thu, 16 Jan 2025 22:10:50 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> cxl_dpa_alloc() is a hard coded nest of assumptions around PMEM
> allocations being distinct from RAM allocations in specific ways when in
> practice the allocation rules are only relative to DPA partition index.
> 
> The rules for cxl_dpa_alloc() are:
> 
> - allocations can only come from 1 partition
> 
> - if allocating at partition-index-N, all free space in partitions less
>   than partition-index-N must be skipped over
> 
> Use the new 'struct cxl_dpa_partition' array to support allocation with
> an arbitrary number of DPA partitions on the device.
> 
> A follow-on patch can go further to cleanup 'enum cxl_decoder_mode'
> concept and supersede it with looking up the memory properties from
> partition metadata.

If we'd move to meta data and these were tightly packed then I'd be fine
with nr_partitions. Until that step, I find it confusing.

A few comments inline. This series does bring some advantages though
at cost of code that needs a bit more documentation at the very least.

> 
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Alejandro Lucero <alucerop@amd.com>
> Cc: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/cxl/core/hdm.c |  167 +++++++++++++++++++++++++++++++++---------------
>  drivers/cxl/cxlmem.h   |    9 +++
>  2 files changed, 125 insertions(+), 51 deletions(-)
> 
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index 7e1559b3ed88..4a2816102a1e 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -223,6 +223,30 @@ void cxl_dpa_debug(struct seq_file *file, struct cxl_dev_state *cxlds)
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_dpa_debug, "CXL");
>  

Some documentation would be useful. I'm not sure I understand
this algorithm correctly.

I think this complexity is all about ensuring that skip regions have
their resources broken up on partition boundaries?

Can we potentially relax constraints a little more to make this
easier to read by not caring on the ordering?  Find overlap
of skip region with any partition and remove that bit unconditionally.

> +static void release_skip(struct cxl_dev_state *cxlds,
> +			 const resource_size_t skip_base,
> +			 const resource_size_t skip_len)
> +{
> +	resource_size_t skip_start = skip_base, skip_rem = skip_len;
> +
> +	for (int i = 0; i < cxlds->nr_partitions; i++) {
> +		const struct resource *part_res = &cxlds->part[i].res;
> +		resource_size_t skip_end, skip_size;
> +
> +		if (skip_start < part_res->start || skip_start > part_res->end)
> +			continue;
> +
> +		skip_end = min(part_res->end, skip_start + skip_rem - 1);
> +		skip_size = skip_end - skip_start + 1;
> +		__release_region(&cxlds->dpa_res, skip_start, skip_size);
> +		skip_start += skip_size;
> +		skip_rem -= skip_size;
> +
> +		if (!skip_rem)
> +			break;
> +	}
> +}
> +
>  /*
>   * Must be called in a context that synchronizes against this decoder's
>   * port ->remove() callback (like an endpoint decoder sysfs attribute)
> @@ -241,7 +265,7 @@ static void __cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
>  	skip_start = res->start - cxled->skip;
>  	__release_region(&cxlds->dpa_res, res->start, resource_size(res));
>  	if (cxled->skip)
> -		__release_region(&cxlds->dpa_res, skip_start, cxled->skip);
> +		release_skip(cxlds, skip_start, cxled->skip);
>  	cxled->skip = 0;
>  	cxled->dpa_res = NULL;
>  	put_device(&cxled->cxld.dev);
> @@ -268,6 +292,47 @@ static void devm_cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
>  	__cxl_dpa_release(cxled);
>  }
>  
> +static int request_skip(struct cxl_dev_state *cxlds,
> +			struct cxl_endpoint_decoder *cxled,
> +			const resource_size_t skip_base,
> +			const resource_size_t skip_len)
> +{
> +	resource_size_t skip_start = skip_base, skip_rem = skip_len;
> +
> +	for (int i = 0; i < cxlds->nr_partitions; i++) {

Likewise, if we relax a constraint on ordering can we make this simpler?
Would just need to keep track on whether we had reserved enough. I'm not
100% sure that is sufficient for the final error check.

> +		const struct resource *part_res = &cxlds->part[i].res;
> +		struct cxl_port *port = cxled_to_port(cxled);
> +		resource_size_t skip_end, skip_size;
> +		struct resource *res;
> +
> +		if (skip_start < part_res->start || skip_start > part_res->end)
> +			continue;
> +
> +		skip_end = min(part_res->end, skip_start + skip_rem - 1);
> +		skip_size = skip_end - skip_start + 1;
> +
> +		res = __request_region(&cxlds->dpa_res, skip_start, skip_size,
> +				       dev_name(&cxled->cxld.dev), 0);
> +		if (!res) {
> +			dev_dbg(cxlds->dev,
> +				"decoder%d.%d: failed to reserve skipped space\n",
> +				port->id, cxled->cxld.id);
> +			break;
> +		}
> +		skip_start += skip_size;
> +		skip_rem -= skip_size;
> +		if (!skip_rem)
> +			break;
> +	}
> +
> +	if (skip_rem == 0)
> +		return 0;
> +
> +	release_skip(cxlds, skip_base, skip_len - skip_rem);
Ah, this complicates possibility of relaxations as we'd need to pass in what
partion number we'd reached when fail occurred.
Maybe this is the best algorithm, but I'd definitely like docs for this
function to make it clear what it's assumptions are (paritions in order of DPA etc)
> +
> +	return -EBUSY;
> +}
> +


> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 2e728d4b7327..43acd48b300f 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -515,6 +515,15 @@ static inline resource_size_t cxl_pmem_size(struct cxl_dev_state *cxlds)
>  	return resource_size(res);
>  }
>  
> +static inline bool cxl_partition_contains(struct cxl_dev_state *cxlds,
> +					  unsigned int part,
> +					  struct resource *res)
> +{
> +	if (part >= cxlds->nr_partitions)
> +		return false;
> +	return resource_contains(&cxlds->part[part].res, res);
As per previous review. zero size resource never contains, so can drop the check
on nr_partitions and instead check against MAX that the core initializes to empty
(and might be overwritten by the drivers).
> +}
> +
>  static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
>  {
>  	return dev_get_drvdata(cxl_mbox->host);
> 
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] cxl: Introduce to_{ram,pmem}_{res,perf}() helpers
  2025-01-17  6:10 ` [PATCH 2/4] cxl: Introduce to_{ram,pmem}_{res,perf}() helpers Dan Williams
  2025-01-17 10:20   ` Jonathan Cameron
@ 2025-01-17 13:33   ` Alejandro Lucero Palau
  2025-01-17 20:47     ` Dan Williams
  1 sibling, 1 reply; 32+ messages in thread
From: Alejandro Lucero Palau @ 2025-01-17 13:33 UTC (permalink / raw)
  To: Dan Williams, linux-cxl; +Cc: Dave Jiang, Ira Weiny


On 1/17/25 06:10, Dan Williams wrote:
> In preparation for consolidating all DPA partition information into an
> array of DPA metadata, introduce helpers that hide the layout of the
> current data. I.e. make the eventual replacement of ->ram_res,
> ->pmem_res, ->ram_perf, and ->pmem_perf with a new DPA metadata array a
> no-op for code paths that consume that information, and reduce the noise
> of follow-on patches.
>
> The end goal is to consolidate all DPA information in 'struct
> cxl_dev_state', but for now the helpers just make it appear that all DPA
> metadata is relative to @cxlds.
>
> Note that a follow-on patch also cleans up the temporary placeholders of
> @ram_res, and @pmem_res in the qos_class manipulation code,
> cxl_dpa_alloc(), and cxl_mem_create_range_info().
>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Alejandro Lucero <alucerop@amd.com>
> Cc: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>   drivers/cxl/core/cdat.c      |   70 +++++++++++++++++++++++++-----------------
>   drivers/cxl/core/hdm.c       |   26 ++++++++--------
>   drivers/cxl/core/mbox.c      |   18 ++++++-----
>   drivers/cxl/core/memdev.c    |   42 +++++++++++++------------
>   drivers/cxl/core/region.c    |   10 ++++--
>   drivers/cxl/cxlmem.h         |   58 ++++++++++++++++++++++++++++++-----
>   drivers/cxl/mem.c            |    2 +
>   tools/testing/cxl/test/cxl.c |   25 ++++++++-------
>   8 files changed, 159 insertions(+), 92 deletions(-)
>
> diff --git a/drivers/cxl/core/cdat.c b/drivers/cxl/core/cdat.c
> index 8153f8d83a16..b177a488e29b 100644
> --- a/drivers/cxl/core/cdat.c
> +++ b/drivers/cxl/core/cdat.c
> @@ -258,29 +258,33 @@ static void update_perf_entry(struct device *dev, struct dsmas_entry *dent,
>   static void cxl_memdev_set_qos_class(struct cxl_dev_state *cxlds,
>   				     struct xarray *dsmas_xa)
>   {
> -	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlds);
>   	struct device *dev = cxlds->dev;
> -	struct range pmem_range = {
> -		.start = cxlds->pmem_res.start,
> -		.end = cxlds->pmem_res.end,
> -	};
> -	struct range ram_range = {
> -		.start = cxlds->ram_res.start,
> -		.end = cxlds->ram_res.end,
> -	};
>   	struct dsmas_entry *dent;
>   	unsigned long index;
> +	const struct resource *partition[] = {
> +		to_ram_res(cxlds),
> +		to_pmem_res(cxlds),
> +	};
> +	struct cxl_dpa_perf *perf[] = {
> +		to_ram_perf(cxlds),
> +		to_pmem_perf(cxlds),
> +	};
>   
>   	xa_for_each(dsmas_xa, index, dent) {
> -		if (resource_size(&cxlds->ram_res) &&
> -		    range_contains(&ram_range, &dent->dpa_range))
> -			update_perf_entry(dev, dent, &mds->ram_perf);
> -		else if (resource_size(&cxlds->pmem_res) &&
> -			 range_contains(&pmem_range, &dent->dpa_range))
> -			update_perf_entry(dev, dent, &mds->pmem_perf);
> -		else
> -			dev_dbg(dev, "no partition for dsmas dpa: %pra\n",
> -				&dent->dpa_range);
> +		for (int i = 0; i < ARRAY_SIZE(partition); i++) {
> +			const struct resource *res = partition[i];
> +			struct range range = {
> +				.start = res->start,
> +				.end = res->end,
> +			};
> +
> +			if (range_contains(&range, &dent->dpa_range))
> +				update_perf_entry(dev, dent, perf[i]);
> +			else
> +				dev_dbg(dev,
> +					"no partition for dsmas dpa: %pra\n",
> +					&dent->dpa_range);
> +		}
>   	}
>   }
>   
> @@ -304,6 +308,9 @@ static int match_cxlrd_qos_class(struct device *dev, void *data)
>   
>   static void reset_dpa_perf(struct cxl_dpa_perf *dpa_perf)
>   {
> +	if (!dpa_perf)
> +		return;
> +
>   	*dpa_perf = (struct cxl_dpa_perf) {
>   		.qos_class = CXL_QOS_CLASS_INVALID,
>   	};
> @@ -312,6 +319,9 @@ static void reset_dpa_perf(struct cxl_dpa_perf *dpa_perf)
>   static bool cxl_qos_match(struct cxl_port *root_port,
>   			  struct cxl_dpa_perf *dpa_perf)
>   {
> +	if (!dpa_perf)
> +		return false;
> +
>   	if (dpa_perf->qos_class == CXL_QOS_CLASS_INVALID)
>   		return false;
>   
> @@ -346,7 +356,8 @@ static int match_cxlrd_hb(struct device *dev, void *data)
>   static int cxl_qos_class_verify(struct cxl_memdev *cxlmd)
>   {
>   	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> -	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlds);
> +	struct cxl_dpa_perf *ram_perf = to_ram_perf(cxlds),
> +			    *pmem_perf = to_pmem_perf(cxlds);
>   	struct cxl_port *root_port;
>   	int rc;
>   
> @@ -359,17 +370,17 @@ static int cxl_qos_class_verify(struct cxl_memdev *cxlmd)
>   	root_port = &cxl_root->port;
>   
>   	/* Check that the QTG IDs are all sane between end device and root decoders */
> -	if (!cxl_qos_match(root_port, &mds->ram_perf))
> -		reset_dpa_perf(&mds->ram_perf);
> -	if (!cxl_qos_match(root_port, &mds->pmem_perf))
> -		reset_dpa_perf(&mds->pmem_perf);
> +	if (!cxl_qos_match(root_port, ram_perf))
> +		reset_dpa_perf(ram_perf);
> +	if (!cxl_qos_match(root_port, pmem_perf))
> +		reset_dpa_perf(pmem_perf);
>   
>   	/* Check to make sure that the device's host bridge is under a root decoder */
>   	rc = device_for_each_child(&root_port->dev,
>   				   cxlmd->endpoint->host_bridge, match_cxlrd_hb);
>   	if (!rc) {
> -		reset_dpa_perf(&mds->ram_perf);
> -		reset_dpa_perf(&mds->pmem_perf);
> +		reset_dpa_perf(ram_perf);
> +		reset_dpa_perf(pmem_perf);
>   	}
>   
>   	return rc;
> @@ -567,6 +578,9 @@ static bool dpa_perf_contains(struct cxl_dpa_perf *perf,
>   		.end = dpa_res->end,
>   	};
>   
> +	if (!perf)
> +		return false;
> +
>   	return range_contains(&perf->dpa_range, &dpa);
>   }
>   
> @@ -574,15 +588,15 @@ static struct cxl_dpa_perf *cxled_get_dpa_perf(struct cxl_endpoint_decoder *cxle
>   					       enum cxl_decoder_mode mode)
>   {
>   	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> -	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
> +	struct cxl_dev_state *cxlds = cxlmd->cxlds;
>   	struct cxl_dpa_perf *perf;
>   
>   	switch (mode) {
>   	case CXL_DECODER_RAM:
> -		perf = &mds->ram_perf;
> +		perf = to_ram_perf(cxlds);
>   		break;
>   	case CXL_DECODER_PMEM:
> -		perf = &mds->pmem_perf;
> +		perf = to_pmem_perf(cxlds);
>   		break;
>   	default:
>   		return ERR_PTR(-EINVAL);
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index be8556119d94..7a85522294ad 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -327,9 +327,9 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>   	cxled->dpa_res = res;
>   	cxled->skip = skipped;
>   
> -	if (resource_contains(&cxlds->pmem_res, res))
> +	if (resource_contains(to_pmem_res(cxlds), res))
>   		cxled->mode = CXL_DECODER_PMEM;
> -	if (resource_contains(&cxlds->ram_res, res))
> +	else if (resource_contains(to_ram_res(cxlds), res))
>   		cxled->mode = CXL_DECODER_RAM;
>   	else {
>   		dev_warn(dev, "decoder%d.%d: %pr does not map any partition\n",
> @@ -442,11 +442,11 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
>   	 * Only allow modes that are supported by the current partition
>   	 * configuration
>   	 */
> -	if (mode == CXL_DECODER_PMEM && !resource_size(&cxlds->pmem_res)) {
> +	if (mode == CXL_DECODER_PMEM && !cxl_pmem_size(cxlds)) {
>   		dev_dbg(dev, "no available pmem capacity\n");
>   		return -ENXIO;
>   	}
> -	if (mode == CXL_DECODER_RAM && !resource_size(&cxlds->ram_res)) {
> +	if (mode == CXL_DECODER_RAM && !cxl_ram_size(cxlds)) {
>   		dev_dbg(dev, "no available ram capacity\n");
>   		return -ENXIO;
>   	}
> @@ -464,6 +464,8 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
>   	struct device *dev = &cxled->cxld.dev;
>   	resource_size_t start, avail, skip;
>   	struct resource *p, *last;
> +	const struct resource *ram_res = to_ram_res(cxlds);
> +	const struct resource *pmem_res = to_pmem_res(cxlds);
>   	int rc;
>   
>   	down_write(&cxl_dpa_rwsem);
> @@ -480,37 +482,37 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
>   		goto out;
>   	}
>   
> -	for (p = cxlds->ram_res.child, last = NULL; p; p = p->sibling)
> +	for (p = ram_res->child, last = NULL; p; p = p->sibling)
>   		last = p;
>   	if (last)
>   		free_ram_start = last->end + 1;
>   	else
> -		free_ram_start = cxlds->ram_res.start;
> +		free_ram_start = ram_res->start;
>   
> -	for (p = cxlds->pmem_res.child, last = NULL; p; p = p->sibling)
> +	for (p = pmem_res->child, last = NULL; p; p = p->sibling)
>   		last = p;
>   	if (last)
>   		free_pmem_start = last->end + 1;
>   	else
> -		free_pmem_start = cxlds->pmem_res.start;
> +		free_pmem_start = pmem_res->start;
>   
>   	if (cxled->mode == CXL_DECODER_RAM) {
>   		start = free_ram_start;
> -		avail = cxlds->ram_res.end - start + 1;
> +		avail = ram_res->end - start + 1;
>   		skip = 0;
>   	} else if (cxled->mode == CXL_DECODER_PMEM) {
>   		resource_size_t skip_start, skip_end;
>   
>   		start = free_pmem_start;
> -		avail = cxlds->pmem_res.end - start + 1;
> +		avail = pmem_res->end - start + 1;
>   		skip_start = free_ram_start;
>   
>   		/*
>   		 * If some pmem is already allocated, then that allocation
>   		 * already handled the skip.
>   		 */
> -		if (cxlds->pmem_res.child &&
> -		    skip_start == cxlds->pmem_res.child->start)
> +		if (pmem_res->child &&
> +		    skip_start == pmem_res->child->start)
>   			skip_end = skip_start - 1;
>   		else
>   			skip_end = start - 1;
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 548564c770c0..3502f1633ad2 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -1270,24 +1270,26 @@ static int add_dpa_res(struct device *dev, struct resource *parent,
>   int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
>   {
>   	struct cxl_dev_state *cxlds = &mds->cxlds;
> +	struct resource *ram_res = to_ram_res(cxlds);
> +	struct resource *pmem_res = to_pmem_res(cxlds);
>   	struct device *dev = cxlds->dev;
>   	int rc;
>   
>   	if (!cxlds->media_ready) {
>   		cxlds->dpa_res = DEFINE_RES_MEM(0, 0);
> -		cxlds->ram_res = DEFINE_RES_MEM(0, 0);
> -		cxlds->pmem_res = DEFINE_RES_MEM(0, 0);
> +		*ram_res = DEFINE_RES_MEM(0, 0);
> +		*pmem_res = DEFINE_RES_MEM(0, 0);
>   		return 0;
>   	}
>   
>   	cxlds->dpa_res = DEFINE_RES_MEM(0, mds->total_bytes);
>   
>   	if (mds->partition_align_bytes == 0) {
> -		rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->ram_res, 0,
> +		rc = add_dpa_res(dev, &cxlds->dpa_res, ram_res, 0,
>   				 mds->volatile_only_bytes, "ram");
>   		if (rc)
>   			return rc;
> -		return add_dpa_res(dev, &cxlds->dpa_res, &cxlds->pmem_res,
> +		return add_dpa_res(dev, &cxlds->dpa_res, pmem_res,
>   				   mds->volatile_only_bytes,
>   				   mds->persistent_only_bytes, "pmem");
>   	}
> @@ -1298,11 +1300,11 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
>   		return rc;
>   	}
>   
> -	rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->ram_res, 0,
> +	rc = add_dpa_res(dev, &cxlds->dpa_res, ram_res, 0,
>   			 mds->active_volatile_bytes, "ram");
>   	if (rc)
>   		return rc;
> -	return add_dpa_res(dev, &cxlds->dpa_res, &cxlds->pmem_res,
> +	return add_dpa_res(dev, &cxlds->dpa_res, pmem_res,
>   			   mds->active_volatile_bytes,
>   			   mds->active_persistent_bytes, "pmem");
>   }
> @@ -1450,8 +1452,8 @@ struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev)
>   	mds->cxlds.reg_map.host = dev;
>   	mds->cxlds.reg_map.resource = CXL_RESOURCE_NONE;
>   	mds->cxlds.type = CXL_DEVTYPE_CLASSMEM;
> -	mds->ram_perf.qos_class = CXL_QOS_CLASS_INVALID;
> -	mds->pmem_perf.qos_class = CXL_QOS_CLASS_INVALID;
> +	to_ram_perf(&mds->cxlds)->qos_class = CXL_QOS_CLASS_INVALID;
> +	to_pmem_perf(&mds->cxlds)->qos_class = CXL_QOS_CLASS_INVALID;
>   
>   	return mds;
>   }
> diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
> index ae3dfcbe8938..c5f8320ed330 100644
> --- a/drivers/cxl/core/memdev.c
> +++ b/drivers/cxl/core/memdev.c
> @@ -80,7 +80,7 @@ static ssize_t ram_size_show(struct device *dev, struct device_attribute *attr,
>   {
>   	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
>   	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> -	unsigned long long len = resource_size(&cxlds->ram_res);
> +	unsigned long long len = resource_size(to_ram_res(cxlds));
>   
>   	return sysfs_emit(buf, "%#llx\n", len);
>   }
> @@ -93,7 +93,7 @@ static ssize_t pmem_size_show(struct device *dev, struct device_attribute *attr,
>   {
>   	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
>   	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> -	unsigned long long len = resource_size(&cxlds->pmem_res);
> +	unsigned long long len = cxl_pmem_size(cxlds);
>   
>   	return sysfs_emit(buf, "%#llx\n", len);
>   }
> @@ -198,16 +198,20 @@ static int cxl_get_poison_by_memdev(struct cxl_memdev *cxlmd)
>   	int rc = 0;
>   
>   	/* CXL 3.0 Spec 8.2.9.8.4.1 Separate pmem and ram poison requests */
> -	if (resource_size(&cxlds->pmem_res)) {
> -		offset = cxlds->pmem_res.start;
> -		length = resource_size(&cxlds->pmem_res);
> +	if (cxl_pmem_size(cxlds)) {
> +		const struct resource *res = to_pmem_res(cxlds);
> +
> +		offset = res->start;
> +		length = resource_size(res);
>   		rc = cxl_mem_get_poison(cxlmd, offset, length, NULL);
>   		if (rc)
>   			return rc;
>   	}
> -	if (resource_size(&cxlds->ram_res)) {
> -		offset = cxlds->ram_res.start;
> -		length = resource_size(&cxlds->ram_res);
> +	if (cxl_ram_size(cxlds)) {
> +		const struct resource *res = to_ram_res(cxlds);
> +
> +		offset = res->start;
> +		length = resource_size(res);
>   		rc = cxl_mem_get_poison(cxlmd, offset, length, NULL);
>   		/*
>   		 * Invalid Physical Address is not an error for
> @@ -409,9 +413,8 @@ static ssize_t pmem_qos_class_show(struct device *dev,
>   {
>   	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
>   	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> -	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlds);
>   
> -	return sysfs_emit(buf, "%d\n", mds->pmem_perf.qos_class);
> +	return sysfs_emit(buf, "%d\n", to_pmem_perf(cxlds)->qos_class);
>   }
>   
>   static struct device_attribute dev_attr_pmem_qos_class =
> @@ -428,9 +431,8 @@ static ssize_t ram_qos_class_show(struct device *dev,
>   {
>   	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
>   	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> -	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlds);
>   
> -	return sysfs_emit(buf, "%d\n", mds->ram_perf.qos_class);
> +	return sysfs_emit(buf, "%d\n", to_ram_perf(cxlds)->qos_class);
>   }
>   
>   static struct device_attribute dev_attr_ram_qos_class =
> @@ -466,11 +468,11 @@ static umode_t cxl_ram_visible(struct kobject *kobj, struct attribute *a, int n)
>   {
>   	struct device *dev = kobj_to_dev(kobj);
>   	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> -	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
> +	struct cxl_dpa_perf *perf = to_ram_perf(cxlmd->cxlds);
>   
> -	if (a == &dev_attr_ram_qos_class.attr)
> -		if (mds->ram_perf.qos_class == CXL_QOS_CLASS_INVALID)
> -			return 0;
> +	if (a == &dev_attr_ram_qos_class.attr &&
> +	    (!perf || perf->qos_class == CXL_QOS_CLASS_INVALID))
> +		return 0;
>   
>   	return a->mode;
>   }
> @@ -485,11 +487,11 @@ static umode_t cxl_pmem_visible(struct kobject *kobj, struct attribute *a, int n
>   {
>   	struct device *dev = kobj_to_dev(kobj);
>   	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> -	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
> +	struct cxl_dpa_perf *perf = to_pmem_perf(cxlmd->cxlds);
>   
> -	if (a == &dev_attr_pmem_qos_class.attr)
> -		if (mds->pmem_perf.qos_class == CXL_QOS_CLASS_INVALID)
> -			return 0;
> +	if (a == &dev_attr_pmem_qos_class.attr &&
> +	    (!perf || perf->qos_class == CXL_QOS_CLASS_INVALID))
> +		return 0;
>   
>   	return a->mode;
>   }
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index e4885acac853..9f0f6fdbc841 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -2688,7 +2688,7 @@ static int cxl_get_poison_unmapped(struct cxl_memdev *cxlmd,
>   
>   	if (ctx->mode == CXL_DECODER_RAM) {
>   		offset = ctx->offset;
> -		length = resource_size(&cxlds->ram_res) - offset;
> +		length = cxl_ram_size(cxlds) - offset;
>   		rc = cxl_mem_get_poison(cxlmd, offset, length, NULL);
>   		if (rc == -EFAULT)
>   			rc = 0;
> @@ -2700,9 +2700,11 @@ static int cxl_get_poison_unmapped(struct cxl_memdev *cxlmd,
>   		length = resource_size(&cxlds->dpa_res) - offset;
>   		if (!length)
>   			return 0;
> -	} else if (resource_size(&cxlds->pmem_res)) {
> -		offset = cxlds->pmem_res.start;
> -		length = resource_size(&cxlds->pmem_res);
> +	} else if (cxl_pmem_size(cxlds)) {
> +		const struct resource *res = to_pmem_res(cxlds);
> +
> +		offset = res->start;
> +		length = resource_size(res);
>   	} else {
>   		return 0;
>   	}
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 2a25d1957ddb..78e92e24d7b5 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -423,8 +423,8 @@ struct cxl_dpa_perf {
>    * @rcd: operating in RCD mode (CXL 3.0 9.11.8 CXL Devices Attached to an RCH)
>    * @media_ready: Indicate whether the device media is usable
>    * @dpa_res: Overall DPA resource tree for the device
> - * @pmem_res: Active Persistent memory capacity configuration
> - * @ram_res: Active Volatile memory capacity configuration
> + * @_pmem_res: Active Persistent memory capacity configuration
> + * @_ram_res: Active Volatile memory capacity configuration
>    * @serial: PCIe Device Serial Number
>    * @type: Generic Memory Class device or Vendor Specific Memory device
>    * @cxl_mbox: CXL mailbox context
> @@ -438,13 +438,41 @@ struct cxl_dev_state {
>   	bool rcd;
>   	bool media_ready;
>   	struct resource dpa_res;
> -	struct resource pmem_res;
> -	struct resource ram_res;
> +	struct resource _pmem_res;
> +	struct resource _ram_res;


I think this is unnecessary since it is clear those fields are going 
away later on, and this change only adds confusion. Moreover, they are 
not referenced in the code now because the helpers.


>   	u64 serial;
>   	enum cxl_devtype type;
>   	struct cxl_mailbox cxl_mbox;
>   };
>   
> +static inline struct resource *to_ram_res(struct cxl_dev_state *cxlds)
> +{
> +	return &cxlds->_ram_res;
> +}
> +
> +static inline struct resource *to_pmem_res(struct cxl_dev_state *cxlds)
> +{
> +	return &cxlds->_pmem_res;
> +}
> +
> +static inline resource_size_t cxl_ram_size(struct cxl_dev_state *cxlds)
> +{
> +	const struct resource *res = to_ram_res(cxlds);
> +
> +	if (!res)
> +		return 0;


This check is not needed now, and with the change in next patch, I think 
it should not be needed either.

Do we need the distinction between, no ram or no pmem, and ram/pmem with 
size 0?


> +	return resource_size(res);
> +}
> +
> +static inline resource_size_t cxl_pmem_size(struct cxl_dev_state *cxlds)
> +{
> +	const struct resource *res = to_pmem_res(cxlds);
> +
> +	if (!res)
> +		return 0;
> +	return resource_size(res);
> +}
> +
>   static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
>   {
>   	return dev_get_drvdata(cxl_mbox->host);
> @@ -471,8 +499,8 @@ static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
>    * @active_persistent_bytes: sum of hard + soft persistent
>    * @next_volatile_bytes: volatile capacity change pending device reset
>    * @next_persistent_bytes: persistent capacity change pending device reset
> - * @ram_perf: performance data entry matched to RAM partition
> - * @pmem_perf: performance data entry matched to PMEM partition
> + * @_ram_perf: performance data entry matched to RAM partition
> + * @_pmem_perf: performance data entry matched to PMEM partition
>    * @event: event log driver state
>    * @poison: poison driver state info
>    * @security: security driver state info
> @@ -496,8 +524,8 @@ struct cxl_memdev_state {
>   	u64 next_volatile_bytes;
>   	u64 next_persistent_bytes;
>   
> -	struct cxl_dpa_perf ram_perf;
> -	struct cxl_dpa_perf pmem_perf;
> +	struct cxl_dpa_perf _ram_perf;
> +	struct cxl_dpa_perf _pmem_perf;
>   
>   	struct cxl_event_state event;
>   	struct cxl_poison_state poison;
> @@ -505,6 +533,20 @@ struct cxl_memdev_state {
>   	struct cxl_fw_state fw;
>   };
>   
> +static inline struct cxl_dpa_perf *to_ram_perf(struct cxl_dev_state *cxlds)
> +{
> +	struct cxl_memdev_state *mds = container_of(cxlds, typeof(*mds), cxlds);
> +
> +	return &mds->_ram_perf;
> +}
> +
> +static inline struct cxl_dpa_perf *to_pmem_perf(struct cxl_dev_state *cxlds)
> +{
> +	struct cxl_memdev_state *mds = container_of(cxlds, typeof(*mds), cxlds);
> +
> +	return &mds->_pmem_perf;
> +}
> +
>   static inline struct cxl_memdev_state *
>   to_cxl_memdev_state(struct cxl_dev_state *cxlds)
>   {
> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> index 2f03a4d5606e..9675243bd05b 100644
> --- a/drivers/cxl/mem.c
> +++ b/drivers/cxl/mem.c
> @@ -152,7 +152,7 @@ static int cxl_mem_probe(struct device *dev)
>   		return -ENXIO;
>   	}
>   
> -	if (resource_size(&cxlds->pmem_res) && IS_ENABLED(CONFIG_CXL_PMEM)) {
> +	if (cxl_pmem_size(cxlds) && IS_ENABLED(CONFIG_CXL_PMEM)) {
>   		rc = devm_cxl_add_nvdimm(parent_port, cxlmd);
>   		if (rc) {
>   			if (rc == -ENODEV)
> diff --git a/tools/testing/cxl/test/cxl.c b/tools/testing/cxl/test/cxl.c
> index d0337c11f9ee..7f1c5061307b 100644
> --- a/tools/testing/cxl/test/cxl.c
> +++ b/tools/testing/cxl/test/cxl.c
> @@ -1000,25 +1000,28 @@ static void mock_cxl_endpoint_parse_cdat(struct cxl_port *port)
>   		find_cxl_root(port);
>   	struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport_dev);
>   	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> -	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlds);
>   	struct access_coordinate ep_c[ACCESS_COORDINATE_MAX];
> -	struct range pmem_range = {
> -		.start = cxlds->pmem_res.start,
> -		.end = cxlds->pmem_res.end,
> +	const struct resource *partition[] = {
> +		to_ram_res(cxlds),
> +		to_pmem_res(cxlds),
>   	};
> -	struct range ram_range = {
> -		.start = cxlds->ram_res.start,
> -		.end = cxlds->ram_res.end,
> +	struct cxl_dpa_perf *perf[] = {
> +		to_ram_perf(cxlds),
> +		to_pmem_perf(cxlds),
>   	};
>   
>   	if (!cxl_root)
>   		return;
>   
> -	if (range_len(&ram_range))
> -		dpa_perf_setup(port, &ram_range, &mds->ram_perf);
> +	for (int i = 0; i < ARRAY_SIZE(partition); i++) {
> +		const struct resource *res = partition[i];
> +		struct range range = {
> +			.start = res->start,
> +			.end = res->end,
> +		};
>   
> -	if (range_len(&pmem_range))
> -		dpa_perf_setup(port, &pmem_range, &mds->pmem_perf);
> +		dpa_perf_setup(port, &range, perf[i]);
> +	}
>   
>   	cxl_memdev_update_perf(cxlmd);
>   
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] cxl: Introduce 'struct cxl_dpa_partition' and 'struct cxl_range_info'
  2025-01-17 10:52   ` Jonathan Cameron
@ 2025-01-17 13:38     ` Alejandro Lucero Palau
  2025-01-17 18:23     ` Dan Williams
  1 sibling, 0 replies; 32+ messages in thread
From: Alejandro Lucero Palau @ 2025-01-17 13:38 UTC (permalink / raw)
  To: Jonathan Cameron, Dan Williams; +Cc: linux-cxl, Dave Jiang, Ira Weiny


On 1/17/25 10:52, Jonathan Cameron wrote:
> On Thu, 16 Jan 2025 22:10:44 -0800
> Dan Williams <dan.j.williams@intel.com> wrote:
>
>> The pending efforts to add CXL Accelerator (type-2) device [1], and
>> Dynamic Capacity (DCD) support [2], tripped on the
>> no-longer-fit-for-purpose design in the CXL subsystem for tracking
>> device-physical-address (DPA) metadata. Trip hazards include:
>>
>> - CXL Memory Devices need to consider a PMEM partition, but Accelerator
>>    devices with CXL.mem likely do not in the common case.
>>
>> - CXL Memory Devices enumerate DPA through Memory Device mailbox
>>    commands like Partition Info, Accelerators devices do not.
>>
>> - CXL Memory Devices that support DCD support more than 2 partitions.
>>    Some of the driver algorithms are awkward to expand to > 2 partition
>>    cases.
>>
>> - DPA performance data is a general capability that can be shared with
>>    accelerators, so tracking it in 'struct cxl_memdev_state' is no longer
>>    suitable.
>>
>> - 'enum cxl_decoder_mode' is sometimes a partition id and sometimes a
>>    memory property, it should be phased in favor of a partition id and
>>    the memory property comes from the partition info.
>>
>> Towards cleaning up those issues and allowing a smoother landing for the
>> aforementioned pending efforts, introduce a 'struct cxl_dpa_partition'
>> array to 'struct cxl_dev_state', and 'struct cxl_range_info' as a shared
>> way for Memory Devices and Accelerators to initialize the DPA information
>> in 'struct cxl_dev_state'.
>>
>> For now, split a new cxl_dpa_setup() from cxl_mem_create_range_info() to
>> get the new data structure initialized, and cleanup some qos_class init.
>> Follow on patches will go further to use the new data structure to
>> cleanup algorithms that are better suited to loop over all possible
>> partitions.
>>
>> cxl_dpa_setup() follows the locking expectations of mutating the device
>> DPA map, and is suitable for Accelerator drivers to use. Accelerators
>> likely only have one hardcoded 'ram' partition to convey to the
>> cxl_core.
>>
>> Link: http://lore.kernel.org/20241230214445.27602-1-alejandro.lucero-palau@amd.com [1]
>> Link: http://lore.kernel.org/20241210-dcd-type2-upstream-v8-0-812852504400@intel.com [2]
>> Cc: Dave Jiang <dave.jiang@intel.com>
>> Cc: Alejandro Lucero <alucerop@amd.com>
>> Cc: Ira Weiny <ira.weiny@intel.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> Hi Dan,
>
> In basic form this seems fine, but I find the nr_paritions variable usage very
> counter intuitive.  It's just how many we configured not how many there
> are, potentially with 0 size (so not a partition).  I'd be happier if we
> can avoid that by just prefilling the lot with zero size and filling in
> the ones we want.  So zero size means doesn't exist and use an iterator where
> appropriate to skip the zero size ones.
>
> Without that tidied up, to me this is more confusing than the previous code.
>
> Jonathan
>
>> ---
>>   drivers/cxl/core/cdat.c      |   15 ++-----
>>   drivers/cxl/core/hdm.c       |   69 ++++++++++++++++++++++++++++++++++
>>   drivers/cxl/core/mbox.c      |   86 ++++++++++++++++++------------------------
>>   drivers/cxl/cxlmem.h         |   79 +++++++++++++++++++++++++--------------
>>   drivers/cxl/pci.c            |    7 +++
>>   tools/testing/cxl/test/cxl.c |   15 ++-----
>>   tools/testing/cxl/test/mem.c |    7 +++
>>   7 files changed, 176 insertions(+), 102 deletions(-)
>>
>> diff --git a/drivers/cxl/core/cdat.c b/drivers/cxl/core/cdat.c
>> index b177a488e29b..5400a421ad30 100644
>> --- a/drivers/cxl/core/cdat.c
>> +++ b/drivers/cxl/core/cdat.c
>> @@ -261,25 +261,18 @@ static void cxl_memdev_set_qos_class(struct cxl_dev_state *cxlds,
>>   	struct device *dev = cxlds->dev;
>>   	struct dsmas_entry *dent;
>>   	unsigned long index;
>> -	const struct resource *partition[] = {
>> -		to_ram_res(cxlds),
>> -		to_pmem_res(cxlds),
>> -	};
>> -	struct cxl_dpa_perf *perf[] = {
>> -		to_ram_perf(cxlds),
>> -		to_pmem_perf(cxlds),
>> -	};
> Ok. This removes some of the concerns from previous patch.
>
>>   
>>   	xa_for_each(dsmas_xa, index, dent) {
>> -		for (int i = 0; i < ARRAY_SIZE(partition); i++) {
>> -			const struct resource *res = partition[i];
>> +		for (int i = 0; i < cxlds->nr_partitions; i++) {
>> +			struct resource *res = &cxlds->part[i].res;
>>   			struct range range = {
>>   				.start = res->start,
>>   				.end = res->end,
>>   			};
>>   
>>   			if (range_contains(&range, &dent->dpa_range))
>> -				update_perf_entry(dev, dent, perf[i]);
>> +				update_perf_entry(dev, dent,
>> +						  &cxlds->part[i].perf);
>>   			else
>>   				dev_dbg(dev,
>>   					"no partition for dsmas dpa: %pra\n",
>> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
>> index 7a85522294ad..7e1559b3ed88 100644
>> --- a/drivers/cxl/core/hdm.c
>> +++ b/drivers/cxl/core/hdm.c
>> @@ -342,6 +342,75 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>>   	return 0;
>>   }
>>   
>> +static int add_dpa_res(struct device *dev, struct resource *parent,
>> +		       struct resource *res, resource_size_t start,
>> +		       resource_size_t size, const char *type)
>> +{
>> +	int rc;
>> +
>> +	*res = (struct resource) {
>> +		.name = type,
>> +		.start = start,
>> +		.end =  start + size - 1,
>> +		.flags = IORESOURCE_MEM,
>> +	};
>> +	if (resource_size(res) == 0) {
>> +		dev_dbg(dev, "DPA(%s): no capacity\n", res->name);
>> +		return 0;
>> +	}
>> +	rc = request_resource(parent, res);
>> +	if (rc) {
>> +		dev_err(dev, "DPA(%s): failed to track %pr (%d)\n", res->name,
>> +			res, rc);
>> +		return rc;
>> +	}
>> +
>> +	dev_dbg(dev, "DPA(%s): %pr\n", res->name, res);
>> +
>> +	return 0;
>> +}
>> +
>> +/* if this fails the caller must destroy @cxlds, there is no recovery */
>> +int cxl_dpa_setup(struct cxl_dev_state *cxlds, const struct cxl_dpa_info *info)
>> +{
>> +	struct device *dev = cxlds->dev;
>> +
>> +	guard(rwsem_write)(&cxl_dpa_rwsem);
>> +
>> +	if (cxlds->nr_partitions)
>> +		return -EBUSY;
>> +
>> +	if (!info->size || !info->nr_partitions) {
>> +		cxlds->dpa_res = DEFINE_RES_MEM(0, 0);
>> +		cxlds->nr_partitions = 0;
>> +		return 0;
>> +	}
>> +
>> +	cxlds->dpa_res = DEFINE_RES_MEM(0, info->size);
>> +
>> +	for (int i = 0; i < info->nr_partitions; i++) {
>> +		const char *desc;
>> +		int rc;
>> +
>> +		if (i == CXL_PARTITION_RAM)
>> +			desc = "ram";
>> +		else if (i == CXL_PARTITION_PMEM)
>> +			desc = "pmem";
>> +		else
>> +			desc = "";
>> +		cxlds->part[i].perf.qos_class = CXL_QOS_CLASS_INVALID;
>> +		rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->part[i].res,
>> +				 info->range[i].start,
>> +				 range_len(&info->range[i]), desc);
>> +		if (rc)
>> +			return rc;
>> +		cxlds->nr_partitions++;
> I'd just initialize the rest to 0 length similar to what is happening
> if we have pmem only anyway.  Then this nr_patitions goes away and
> stops being a possible source of confusion.
>
>> +	}
>> +
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(cxl_dpa_setup);
>> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
>> index 3502f1633ad2..7dca5c8c3494 100644
>> --- a/drivers/cxl/core/mbox.c
>> +++ b/drivers/cxl/core/mbox.c
>> @@ -1241,57 +1241,36 @@ int cxl_mem_sanitize(struct cxl_memdev *cxlmd, u16 cmd)
>> -int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
>> +int cxl_mem_dpa_fetch(struct cxl_memdev_state *mds, struct cxl_dpa_info *info)
>>   {
>>   	struct cxl_dev_state *cxlds = &mds->cxlds;
>> -	struct resource *ram_res = to_ram_res(cxlds);
>> -	struct resource *pmem_res = to_pmem_res(cxlds);
>>   	struct device *dev = cxlds->dev;
>>   	int rc;
>>   
>>   	if (!cxlds->media_ready) {
>> -		cxlds->dpa_res = DEFINE_RES_MEM(0, 0);
>> -		*ram_res = DEFINE_RES_MEM(0, 0);
>> -		*pmem_res = DEFINE_RES_MEM(0, 0);
>> +		info->size = 0;
>>   		return 0;
>>   	}
>>   
>> -	cxlds->dpa_res = DEFINE_RES_MEM(0, mds->total_bytes);
>> +	info->size = mds->total_bytes;
>>   
>>   	if (mds->partition_align_bytes == 0) {
> Obviously nothing to do with your patch as such, but maybe tidy this up
> by making active values == fixed values when we don't have partition control.
> That seems logical anyway to me and means we only end up with one lot of
> range setup in here.  I can't immediately see any side effects of doing this.
>
>
>
> 	if (mds->partition_align_bytes != 0) {
> 		rc = cxl_mem_get_partition_info(mds);
> 		if (rc)
> 			return rc;
> 	} else {
> 		mds->active_volatile_bytes = mds->volatile_only_bytes;
> 		mds->active_persistent_bytes = mds->persistent_only_bytes;
> 	}
>   	info->range[CXL_PARTITION_RAM] = (struct range) {
> 		.start = 0,
> 		.end = mds->active_volatile_bytes - 1,
> 	};
> 	info->nr_partitions++;
>
> 	if (!mds->active_persistent_bytes)
> 		return 0;
>
> 	info->range[CXL_PARTITION_PMEM] = (struct range) {
> 		.start = mds->active_volatile_bytes,
> 		.end = mds->active_volatile_bytes + mds->active_persistent_bytes - 1,
> 	};
> 	info->nr_partitions++;
>
> 	return 0;
> }
>
>> -		rc = add_dpa_res(dev, &cxlds->dpa_res, ram_res, 0,
>> -				 mds->volatile_only_bytes, "ram");
>> -		if (rc)
>> -			return rc;
>> -		return add_dpa_res(dev, &cxlds->dpa_res, pmem_res,
>> -				   mds->volatile_only_bytes,
>> -				   mds->persistent_only_bytes, "pmem");
>> +		info->range[CXL_PARTITION_RAM] = (struct range) {
>> +			.start = 0,
>> +			.end = mds->volatile_only_bytes - 1,
>> +		};
>> +		info->nr_partitions++;
>> +
>> +		if (!mds->persistent_only_bytes)
>> +			return 0;
>> +
>> +		info->range[CXL_PARTITION_PMEM] = (struct range) {
>> +			.start = mds->volatile_only_bytes,
>> +			.end = mds->volatile_only_bytes +
>> +			       mds->persistent_only_bytes - 1,
>> +		};
>> +		info->nr_partitions++;
> This nr partitions makes some sense though I'd be tempted to add a type
> array to info so that we can just not pass empty ones if we don't want to.
> Makes this code a little more complex, but not a lot and means
> nr->partitions becomes the ones that actually exist.
>
>> +		return 0;
>>   	}
>>   
>>   	rc = cxl_mem_get_partition_info(mds);
>> @@ -1300,15 +1279,24 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
>>   		return rc;
>>   	}
>>   
>> -	rc = add_dpa_res(dev, &cxlds->dpa_res, ram_res, 0,
>> -			 mds->active_volatile_bytes, "ram");
>> -	if (rc)
>> -		return rc;
>> -	return add_dpa_res(dev, &cxlds->dpa_res, pmem_res,
>> -			   mds->active_volatile_bytes,
>> -			   mds->active_persistent_bytes, "pmem");
>> +	info->range[CXL_PARTITION_RAM] = (struct range) {
>> +		.start = 0,
>> +		.end = mds->active_volatile_bytes - 1,
>> +	};
>> +	info->nr_partitions++;
>> +
>> +	if (!mds->active_persistent_bytes)
>> +		return 0;
>> +
>> +	info->range[CXL_PARTITION_PMEM] = (struct range) {
>> +		.start = mds->active_volatile_bytes,
>> +		.end = mds->active_volatile_bytes + mds->active_persistent_bytes - 1,
>> +	};
>> +	info->nr_partitions++;
>> +
>> +	return 0;
>>   }
>> -EXPORT_SYMBOL_NS_GPL(cxl_mem_create_range_info, "CXL");
>> +EXPORT_SYMBOL_NS_GPL(cxl_mem_dpa_fetch, "CXL");
>> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
>> index 78e92e24d7b5..2e728d4b7327 100644
>> --- a/drivers/cxl/cxlmem.h
>> +++ b/drivers/cxl/cxlmem.h
>> @@ -97,6 +97,20 @@ int devm_cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>>   			 resource_size_t base, resource_size_t len,
>>   			 resource_size_t skipped);
>>   
>> +/* Well known, spec defined partition indices */
>> +enum cxl_partition {
>> +	CXL_PARTITION_RAM,
>> +	CXL_PARTITION_PMEM,
>> +	CXL_PARTITION_MAX,
>> +};
>> +
>> +struct cxl_dpa_info {
>> +	u64 size;
>> +	struct range range[CXL_PARTITION_MAX];
>> +	int nr_partitions;
>> +};
> blank line seems appropriate here.
>
>> +int cxl_dpa_setup(struct cxl_dev_state *cxlds, const struct cxl_dpa_info *info);
>> +
>>   static inline struct cxl_ep *cxl_ep_load(struct cxl_port *port,
>>   					 struct cxl_memdev *cxlmd)
>>   {
>> @@ -408,6 +422,16 @@ struct cxl_dpa_perf {
>>   	int qos_class;
>>   };
>>   
>>   /**
>>    * struct cxl_dev_state - The driver device state
>>    *
>> @@ -423,8 +447,8 @@ struct cxl_dpa_perf {
>>    * @rcd: operating in RCD mode (CXL 3.0 9.11.8 CXL Devices Attached to an RCH)
>>    * @media_ready: Indicate whether the device media is usable
>>    * @dpa_res: Overall DPA resource tree for the device
>> - * @_pmem_res: Active Persistent memory capacity configuration
>> - * @_ram_res: Active Volatile memory capacity configuration
>> + * @part: DPA partition array
>> + * @nr_partitions: Number of DPA partitions
> This needs more. It is not the number of partitions present I think, it
> is the number that a particular driver is potentially interested in.
>
>>    * @serial: PCIe Device Serial Number
>>    * @type: Generic Memory Class device or Vendor Specific Memory device
>>    * @cxl_mbox: CXL mailbox context
>> @@ -438,21 +462,39 @@ struct cxl_dev_state {
>>   	bool rcd;
>>   	bool media_ready;
>>   	struct resource dpa_res;
>> -	struct resource _pmem_res;
>> -	struct resource _ram_res;
>> +	struct cxl_dpa_partition part[CXL_PARTITION_MAX];
>> +	unsigned int nr_partitions;
>>   	u64 serial;
>>   	enum cxl_devtype type;
>>   	struct cxl_mailbox cxl_mbox;
>>   };
>>   
>> -static inline struct resource *to_ram_res(struct cxl_dev_state *cxlds)
>> +static inline const struct resource *to_ram_res(struct cxl_dev_state *cxlds)
>>   {
>> -	return &cxlds->_ram_res;
>> +	if (cxlds->nr_partitions > 0)
>> +		return &cxlds->part[CXL_PARTITION_RAM].res;
>> +	return NULL;
>>   }
>>   
>> -static inline struct resource *to_pmem_res(struct cxl_dev_state *cxlds)
>> +static inline const struct resource *to_pmem_res(struct cxl_dev_state *cxlds)
>>   {
>> -	return &cxlds->_pmem_res;
>> +	if (cxlds->nr_partitions > 1)
> This is very confusing as nr_partitions is being used not to indicate
> number of partitions but whether a driver has filled in the data for them
> (which may well be empty).


I would say this is more than confusing: it is broken. What if a device 
only has pmem?

Number of partitions would be 1 ...

>
> I'd rather see that as a bitmap, or a 'not set' value initialized by
> the core that is then replaced when they are set.


Repeating my doubt I expressed in the previous patch but with other words:

Is it not enough to give the reference to the ram/pmem resource and then 
work based on the size?


>> +		return &cxlds->part[CXL_PARTITION_PMEM].res;
>> +	return NULL;
>> +}
>> +
>> +static inline struct cxl_dpa_perf *to_ram_perf(struct cxl_dev_state *cxlds)
>> +{
>> +	if (cxlds->nr_partitions > 0)
>> +		return &cxlds->part[CXL_PARTITION_RAM].perf;
>> +	return NULL;
>> +}
>> +
>> +static inline struct cxl_dpa_perf *to_pmem_perf(struct cxl_dev_state *cxlds)
>> +{
>> +	if (cxlds->nr_partitions > 1)
>> +		return &cxlds->part[CXL_PARTITION_PMEM].perf;
>> +	return NULL;
>>   }
>
>> @@ -860,7 +883,7 @@ int cxl_internal_send_cmd(struct cxl_mailbox *cxl_mbox,
>>   int cxl_dev_state_identify(struct cxl_memdev_state *mds);
>>   int cxl_await_media_ready(struct cxl_dev_state *cxlds);
>>   int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
>> -int cxl_mem_create_range_info(struct cxl_memdev_state *mds);
>> +int cxl_mem_dpa_fetch(struct cxl_memdev_state *mds, struct cxl_dpa_info *info);
>>   struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev);
>>   void set_exclusive_cxl_commands(struct cxl_memdev_state *mds,
>>   				unsigned long *cmds);
>> diff --git a/tools/testing/cxl/test/cxl.c b/tools/testing/cxl/test/cxl.c
>> index 7f1c5061307b..ba3d48b37de3 100644
>> --- a/tools/testing/cxl/test/cxl.c
>> +++ b/tools/testing/cxl/test/cxl.c
>> @@ -1001,26 +1001,19 @@ static void mock_cxl_endpoint_parse_cdat(struct cxl_port *port)
>>   	struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport_dev);
>>   	struct cxl_dev_state *cxlds = cxlmd->cxlds;
>>   	struct access_coordinate ep_c[ACCESS_COORDINATE_MAX];
>> -	const struct resource *partition[] = {
>> -		to_ram_res(cxlds),
>> -		to_pmem_res(cxlds),
>> -	};
>> -	struct cxl_dpa_perf *perf[] = {
>> -		to_ram_perf(cxlds),
>> -		to_pmem_perf(cxlds),
>> -	};
> Ok. This gets rid of some of the earlier concerns.
>
>>   
>>   	if (!cxl_root)
>>   		return;
>>   
>> -	for (int i = 0; i < ARRAY_SIZE(partition); i++) {
>> -		const struct resource *res = partition[i];
>> +	for (int i = 0; i < cxlds->nr_partitions; i++) {
>> +		struct resource *res = &cxlds->part[i].res;
>> +		struct cxl_dpa_perf *perf = &cxlds->part[i].perf;
>>   		struct range range = {
>>   			.start = res->start,
>>   			.end = res->end,
>>   		};
>>   
>> -		dpa_perf_setup(port, &range, perf[i]);
>> +		dpa_perf_setup(port, &range, perf);
>>   	}
>>   
>>   	cxl_memdev_update_perf(cxlmd);
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/4] cxl: Make cxl_dpa_alloc() DPA partition number agnostic
  2025-01-17  6:10 ` [PATCH 4/4] cxl: Make cxl_dpa_alloc() DPA partition number agnostic Dan Williams
  2025-01-17 11:12   ` Jonathan Cameron
@ 2025-01-17 15:42   ` Alejandro Lucero Palau
  2025-01-17 20:57     ` Dan Williams
  1 sibling, 1 reply; 32+ messages in thread
From: Alejandro Lucero Palau @ 2025-01-17 15:42 UTC (permalink / raw)
  To: Dan Williams, linux-cxl; +Cc: Dave Jiang, Ira Weiny


On 1/17/25 06:10, Dan Williams wrote:
> cxl_dpa_alloc() is a hard coded nest of assumptions around PMEM
> allocations being distinct from RAM allocations in specific ways when in
> practice the allocation rules are only relative to DPA partition index.
>
> The rules for cxl_dpa_alloc() are:
>
> - allocations can only come from 1 partition
>
> - if allocating at partition-index-N, all free space in partitions less
>    than partition-index-N must be skipped over


In my view, you are mixing the current code with the new code in this 
explanation. It would be better to say the current code assumption is 
just two partitions, ram and pmem, but DCD changes the game.


> Use the new 'struct cxl_dpa_partition' array to support allocation with
> an arbitrary number of DPA partitions on the device.
>
> A follow-on patch can go further to cleanup 'enum cxl_decoder_mode'
> concept and supersede it with looking up the memory properties from
> partition metadata.
>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Alejandro Lucero <alucerop@amd.com>
> Cc: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>   drivers/cxl/core/hdm.c |  167 +++++++++++++++++++++++++++++++++---------------
>   drivers/cxl/cxlmem.h   |    9 +++
>   2 files changed, 125 insertions(+), 51 deletions(-)
>
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index 7e1559b3ed88..4a2816102a1e 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -223,6 +223,30 @@ void cxl_dpa_debug(struct seq_file *file, struct cxl_dev_state *cxlds)
>   }
>   EXPORT_SYMBOL_NS_GPL(cxl_dpa_debug, "CXL");
>   
> +static void release_skip(struct cxl_dev_state *cxlds,
> +			 const resource_size_t skip_base,
> +			 const resource_size_t skip_len)
> +{
> +	resource_size_t skip_start = skip_base, skip_rem = skip_len;
> +
> +	for (int i = 0; i < cxlds->nr_partitions; i++) {
> +		const struct resource *part_res = &cxlds->part[i].res;
> +		resource_size_t skip_end, skip_size;
> +
> +		if (skip_start < part_res->start || skip_start > part_res->end)
> +			continue;
> +
> +		skip_end = min(part_res->end, skip_start + skip_rem - 1);
> +		skip_size = skip_end - skip_start + 1;
> +		__release_region(&cxlds->dpa_res, skip_start, skip_size);
> +		skip_start += skip_size;
> +		skip_rem -= skip_size;
> +
> +		if (!skip_rem)
> +			break;
> +	}
> +}
> +


This implies the skip can not be based on the last child end as the code 
implements.


>   /*
>    * Must be called in a context that synchronizes against this decoder's
>    * port ->remove() callback (like an endpoint decoder sysfs attribute)
> @@ -241,7 +265,7 @@ static void __cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
>   	skip_start = res->start - cxled->skip;
>   	__release_region(&cxlds->dpa_res, res->start, resource_size(res));
>   	if (cxled->skip)
> -		__release_region(&cxlds->dpa_res, skip_start, cxled->skip);
> +		release_skip(cxlds, skip_start, cxled->skip);
>   	cxled->skip = 0;
>   	cxled->dpa_res = NULL;
>   	put_device(&cxled->cxld.dev);
> @@ -268,6 +292,47 @@ static void devm_cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
>   	__cxl_dpa_release(cxled);
>   }
>   
> +static int request_skip(struct cxl_dev_state *cxlds,
> +			struct cxl_endpoint_decoder *cxled,
> +			const resource_size_t skip_base,
> +			const resource_size_t skip_len)
> +{
> +	resource_size_t skip_start = skip_base, skip_rem = skip_len;
> +
> +	for (int i = 0; i < cxlds->nr_partitions; i++) {
> +		const struct resource *part_res = &cxlds->part[i].res;
> +		struct cxl_port *port = cxled_to_port(cxled);
> +		resource_size_t skip_end, skip_size;
> +		struct resource *res;
> +
> +		if (skip_start < part_res->start || skip_start > part_res->end)
> +			continue;
> +
> +		skip_end = min(part_res->end, skip_start + skip_rem - 1);
> +		skip_size = skip_end - skip_start + 1;
> +
> +		res = __request_region(&cxlds->dpa_res, skip_start, skip_size,
> +				       dev_name(&cxled->cxld.dev), 0);
> +		if (!res) {
> +			dev_dbg(cxlds->dev,
> +				"decoder%d.%d: failed to reserve skipped space\n",
> +				port->id, cxled->cxld.id);
> +			break;
> +		}
> +		skip_start += skip_size;
> +		skip_rem -= skip_size;
> +		if (!skip_rem)
> +			break;
> +	}
> +
> +	if (skip_rem == 0)
> +		return 0;
> +
> +	release_skip(cxlds, skip_base, skip_len - skip_rem);
> +
> +	return -EBUSY;
> +}
> +
>   static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>   			     resource_size_t base, resource_size_t len,
>   			     resource_size_t skipped)
> @@ -277,6 +342,7 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>   	struct cxl_dev_state *cxlds = cxlmd->cxlds;
>   	struct device *dev = &port->dev;
>   	struct resource *res;
> +	int rc;
>   
>   	lockdep_assert_held_write(&cxl_dpa_rwsem);
>   
> @@ -305,14 +371,9 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>   	}
>   
>   	if (skipped) {
> -		res = __request_region(&cxlds->dpa_res, base - skipped, skipped,
> -				       dev_name(&cxled->cxld.dev), 0);
> -		if (!res) {
> -			dev_dbg(dev,
> -				"decoder%d.%d: failed to reserve skipped space\n",
> -				port->id, cxled->cxld.id);
> -			return -EBUSY;
> -		}
> +		rc = request_skip(cxlds, cxled, base - skipped, skipped);
> +		if (rc)
> +			return rc;
>   	}
>   	res = __request_region(&cxlds->dpa_res, base, len,
>   			       dev_name(&cxled->cxld.dev), 0);
> @@ -320,16 +381,15 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>   		dev_dbg(dev, "decoder%d.%d: failed to reserve allocation\n",
>   			port->id, cxled->cxld.id);
>   		if (skipped)
> -			__release_region(&cxlds->dpa_res, base - skipped,
> -					 skipped);
> +			release_skip(cxlds, base - skipped, skipped);
>   		return -EBUSY;
>   	}
>   	cxled->dpa_res = res;
>   	cxled->skip = skipped;
>   
> -	if (resource_contains(to_pmem_res(cxlds), res))
> +	if (cxl_partition_contains(cxlds, CXL_PARTITION_PMEM, res))
>   		cxled->mode = CXL_DECODER_PMEM;
> -	else if (resource_contains(to_ram_res(cxlds), res))
> +	else if (cxl_partition_contains(cxlds, CXL_PARTITION_RAM, res))
>   		cxled->mode = CXL_DECODER_RAM;
>   	else {
>   		dev_warn(dev, "decoder%d.%d: %pr does not map any partition\n",
> @@ -527,15 +587,13 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
>   int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
>   {
>   	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> -	resource_size_t free_ram_start, free_pmem_start;
>   	struct cxl_port *port = cxled_to_port(cxled);
>   	struct cxl_dev_state *cxlds = cxlmd->cxlds;
>   	struct device *dev = &cxled->cxld.dev;
> -	resource_size_t start, avail, skip;
> +	struct resource *res, *prev = NULL;
> +	resource_size_t start, avail, skip, skip_start;
>   	struct resource *p, *last;
> -	const struct resource *ram_res = to_ram_res(cxlds);
> -	const struct resource *pmem_res = to_pmem_res(cxlds);
> -	int rc;
> +	int part, rc;
>   
>   	down_write(&cxl_dpa_rwsem);
>   	if (cxled->cxld.region) {
> @@ -551,47 +609,54 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
>   		goto out;
>   	}
>   
> -	for (p = ram_res->child, last = NULL; p; p = p->sibling)
> -		last = p;
> -	if (last)
> -		free_ram_start = last->end + 1;
> +	if (cxled->mode == CXL_DECODER_RAM)
> +		part = CXL_PARTITION_RAM;
> +	else if (cxled->mode == CXL_DECODER_PMEM)
> +		part = CXL_PARTITION_PMEM;
>   	else
> -		free_ram_start = ram_res->start;
> +		part = cxlds->nr_partitions;
> +
> +	if (part >= cxlds->nr_partitions) {
> +		dev_dbg(dev, "partition %d not found\n", part);
> +		rc = -EBUSY;
> +		goto out;
> +	}
> +
> +	res = &cxlds->part[part].res;
>   
> -	for (p = pmem_res->child, last = NULL; p; p = p->sibling)
> +	for (p = res->child, last = NULL; p; p = p->sibling)
>   		last = p;
>   	if (last)
> -		free_pmem_start = last->end + 1;
> +		start = last->end + 1;
>   	else
> -		free_pmem_start = pmem_res->start;
> +		start = res->start;
>   


As said above, this is not correct if there are holes due to releases.


> -	if (cxled->mode == CXL_DECODER_RAM) {
> -		start = free_ram_start;
> -		avail = ram_res->end - start + 1;
> -		skip = 0;
> -	} else if (cxled->mode == CXL_DECODER_PMEM) {
> -		resource_size_t skip_start, skip_end;
> -
> -		start = free_pmem_start;
> -		avail = pmem_res->end - start + 1;
> -		skip_start = free_ram_start;
> -
> -		/*
> -		 * If some pmem is already allocated, then that allocation
> -		 * already handled the skip.
> -		 */
> -		if (pmem_res->child &&
> -		    skip_start == pmem_res->child->start)
> -			skip_end = skip_start - 1;
> -		else
> -			skip_end = start - 1;
> -		skip = skip_end - skip_start + 1;
> -	} else {
> -		dev_dbg(dev, "mode not set\n");
> -		rc = -EINVAL;
> -		goto out;
> +	/*
> +	 * To allocate at partition N, a skip needs to be calculated for all
> +	 * unallocated space at lower partitions indices.
> +	 *
> +	 * If a partition has any allocations, the search can end because a
> +	 * previous cxl_dpa_alloc() invocation is assumed to have accounted for
> +	 * all previous partitions.
> +	 */


This is right, but the code below is not because ...


> +	skip_start = CXL_RESOURCE_NONE;
> +	for (int i = part; i; i--) {
> +		prev = &cxlds->part[i - 1].res;
> +		for (p = prev->child, last = NULL; p; p = p->sibling)
> +			last = p;


... holes ...


I think the problem here is we assumed ram and pmem being a child and 
likely some free space, but a device with multiple HDM decoders implies 
potentially several child.

The code supported the case of multiple child but I guess we still had 
in mind the simple case. Otherwise I can not understand all this ...


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] cxl: Introduce 'struct cxl_dpa_partition' and 'struct cxl_range_info'
  2025-01-17  6:10 ` [PATCH 3/4] cxl: Introduce 'struct cxl_dpa_partition' and 'struct cxl_range_info' Dan Williams
  2025-01-17 10:52   ` Jonathan Cameron
@ 2025-01-17 15:58   ` Alejandro Lucero Palau
  2025-01-17 22:52     ` Dan Williams
  2025-01-17 20:42   ` Ira Weiny
  2025-01-17 22:08   ` Ira Weiny
  3 siblings, 1 reply; 32+ messages in thread
From: Alejandro Lucero Palau @ 2025-01-17 15:58 UTC (permalink / raw)
  To: Dan Williams, linux-cxl; +Cc: Dave Jiang, Ira Weiny


On 1/17/25 06:10, Dan Williams wrote:
> The pending efforts to add CXL Accelerator (type-2) device [1], and
> Dynamic Capacity (DCD) support [2], tripped on the
> no-longer-fit-for-purpose design in the CXL subsystem for tracking
> device-physical-address (DPA) metadata. Trip hazards include:
>
> - CXL Memory Devices need to consider a PMEM partition, but Accelerator
>    devices with CXL.mem likely do not in the common case.
>
> - CXL Memory Devices enumerate DPA through Memory Device mailbox
>    commands like Partition Info, Accelerators devices do not.
>
> - CXL Memory Devices that support DCD support more than 2 partitions.
>    Some of the driver algorithms are awkward to expand to > 2 partition
>    cases.
>
> - DPA performance data is a general capability that can be shared with
>    accelerators, so tracking it in 'struct cxl_memdev_state' is no longer
>    suitable.
>
> - 'enum cxl_decoder_mode' is sometimes a partition id and sometimes a
>    memory property, it should be phased in favor of a partition id and
>    the memory property comes from the partition info.
>
> Towards cleaning up those issues and allowing a smoother landing for the
> aforementioned pending efforts, introduce a 'struct cxl_dpa_partition'
> array to 'struct cxl_dev_state', and 'struct cxl_range_info' as a shared
> way for Memory Devices and Accelerators to initialize the DPA information
> in 'struct cxl_dev_state'.
>
> For now, split a new cxl_dpa_setup() from cxl_mem_create_range_info() to
> get the new data structure initialized, and cleanup some qos_class init.
> Follow on patches will go further to use the new data structure to
> cleanup algorithms that are better suited to loop over all possible
> partitions.
>
> cxl_dpa_setup() follows the locking expectations of mutating the device
> DPA map, and is suitable for Accelerator drivers to use. Accelerators
> likely only have one hardcoded 'ram' partition to convey to the
> cxl_core.


<snip>

> +/* if this fails the caller must destroy @cxlds, there is no recovery */
> +int cxl_dpa_setup(struct cxl_dev_state *cxlds, const struct cxl_dpa_info *info)
> +{
> +	struct device *dev = cxlds->dev;
> +
> +	guard(rwsem_write)(&cxl_dpa_rwsem);
> +


This explains to me what you meant about locking when setting the 
resources for Type2.


However, I think this is no necessary because there is no user space, or 
that is my idea, involved when creating CXL regions for a Type2. It is 
all up to the accel driver to do so, therefore no locking needed because 
none is going to traverse the child resource list while 
initialising/updating it.

It does not harm to have it for current Type2 case, and always a good 
idea to have it for potential future cases.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] cxl: Remove the CXL_DECODER_MIXED mistake
  2025-01-17 10:03   ` Jonathan Cameron
@ 2025-01-17 17:47     ` Dan Williams
  0 siblings, 0 replies; 32+ messages in thread
From: Dan Williams @ 2025-01-17 17:47 UTC (permalink / raw)
  To: Jonathan Cameron, Dan Williams; +Cc: linux-cxl, dave.jiang

Jonathan Cameron wrote:
> On Thu, 16 Jan 2025 22:10:32 -0800
> Dan Williams <dan.j.williams@intel.com> wrote:
> 
> > CXL_DECODER_MIXED is a safety mechanism introduced for the case where
> > platform firmware has programmed an endpoint decoder that straddles a
> > DPA partition boundary. While the kernel is careful to only allocate DPA
> > capacity within a single partition there is no guarantee that platform
> > firmware, or anything that touched the device before the current kernel,
> > gets that right.
> > 
> > However, __cxl_dpa_reserve() will never get to the CXL_DECODER_MIXED
> > designation because of the way it tracks partition boundaries. A
> > request_resource() that spans ->ram_res and ->pmem_res fails with the
> > following signature:
> > 
> >     __cxl_dpa_reserve: cxl_port endpoint15: decoder15.0: failed to reserve allocation
> > 
> > CXL_DECODER_MIXED is dead defensive programming after the driver has
> > already given up on the device. It has never offered any protection in
> > practice, just delete it.
> > 
> > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> > ---
> >  drivers/cxl/core/hdm.c    |    8 ++++----
> >  drivers/cxl/core/region.c |   12 ------------
> >  drivers/cxl/cxl.h         |    4 +---
> >  3 files changed, 5 insertions(+), 19 deletions(-)
> > 
> > diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> > index 28edd5822486..be8556119d94 100644
> > --- a/drivers/cxl/core/hdm.c
> > +++ b/drivers/cxl/core/hdm.c
> > @@ -329,12 +329,12 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> >  
> >  	if (resource_contains(&cxlds->pmem_res, res))
> >  		cxled->mode = CXL_DECODER_PMEM;
> > -	else if (resource_contains(&cxlds->ram_res, res))
> > +	if (resource_contains(&cxlds->ram_res, res))
> 
> Logic of removing the else?  I assume there is 0 chance that both conditions
> match, but doesn't this mean if the res is not in ram_res we always hit the next
> else and print the warning?

...bug that I fixed later in the series and did not fold all the way
back to where it came from when splitting the series.

Good catch.

> 
> >  		cxled->mode = CXL_DECODER_RAM;
> >  	else {
> > -		dev_warn(dev, "decoder%d.%d: %pr mixed mode not supported\n",
> > -			 port->id, cxled->cxld.id, cxled->dpa_res);
> > -		cxled->mode = CXL_DECODER_MIXED;
> > +		dev_warn(dev, "decoder%d.%d: %pr does not map any partition\n",
> > +			 port->id, cxled->cxld.id, res);
> > +		cxled->mode = CXL_DECODER_NONE;
> >  	}
> >  
> >  	port->hdm_end++;
> 
> > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> > index f6015f24ad38..0fb8d70fa3e5 100644
> > --- a/drivers/cxl/cxl.h
> > +++ b/drivers/cxl/cxl.h
> > @@ -379,7 +379,6 @@ enum cxl_decoder_mode {
> >  	CXL_DECODER_NONE,
> >  	CXL_DECODER_RAM,
> >  	CXL_DECODER_PMEM,
> > -	CXL_DECODER_MIXED,
> >  	CXL_DECODER_DEAD,
> >  };
> >  
> > @@ -389,10 +388,9 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
> >  		[CXL_DECODER_NONE] = "none",
> >  		[CXL_DECODER_RAM] = "ram",
> >  		[CXL_DECODER_PMEM] = "pmem",
> > -		[CXL_DECODER_MIXED] = "mixed",
> >  	};
> >  
> > -	if (mode >= CXL_DECODER_NONE && mode <= CXL_DECODER_MIXED)
> > +	if (mode >= CXL_DECODER_NONE && mode <= CXL_DECODER_PMEM)
> Maybe just < DEAD is simpler?

I like that.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] cxl: Remove the CXL_DECODER_MIXED mistake
  2025-01-17 10:24   ` Alejandro Lucero Palau
@ 2025-01-17 17:54     ` Dan Williams
  0 siblings, 0 replies; 32+ messages in thread
From: Dan Williams @ 2025-01-17 17:54 UTC (permalink / raw)
  To: Alejandro Lucero Palau, Dan Williams, linux-cxl; +Cc: dave.jiang

Alejandro Lucero Palau wrote:
> 
> On 1/17/25 06:10, Dan Williams wrote:
> > CXL_DECODER_MIXED is a safety mechanism introduced for the case where
> > platform firmware has programmed an endpoint decoder that straddles a
> > DPA partition boundary. While the kernel is careful to only allocate DPA
> > capacity within a single partition there is no guarantee that platform
> > firmware, or anything that touched the device before the current kernel,
> > gets that right.
> >
> > However, __cxl_dpa_reserve() will never get to the CXL_DECODER_MIXED
> > designation because of the way it tracks partition boundaries. A
> > request_resource() that spans ->ram_res and ->pmem_res fails with the
> > following signature:
> >
> >      __cxl_dpa_reserve: cxl_port endpoint15: decoder15.0: failed to reserve allocation
> >
> > CXL_DECODER_MIXED is dead defensive programming after the driver has
> > already given up on the device. It has never offered any protection in
> > practice, just delete it.
> 
> 
> I wonder if the reason for adding this CXL_DECODER_MIXED  does still 
> worth it for fixing __cxl_dpa_reserve instead of just not supporting 
> this case.

See where that "failed to reserve allocation" message is printed. That
leads to the driver giving up on the device before the bad decoder
setting can confuse other code paths.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] cxl: Introduce to_{ram,pmem}_{res,perf}() helpers
  2025-01-17 10:23     ` Jonathan Cameron
@ 2025-01-17 17:55       ` Dan Williams
  0 siblings, 0 replies; 32+ messages in thread
From: Dan Williams @ 2025-01-17 17:55 UTC (permalink / raw)
  To: Jonathan Cameron, Dan Williams
  Cc: linux-cxl, Dave Jiang, Alejandro Lucero, Ira Weiny

Jonathan Cameron wrote:
> On Fri, 17 Jan 2025 10:20:56 +0000
> Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> 
> > On Thu, 16 Jan 2025 22:10:38 -0800
> > Dan Williams <dan.j.williams@intel.com> wrote:
> > 
> > > In preparation for consolidating all DPA partition information into an
> > > array of DPA metadata, introduce helpers that hide the layout of the
> > > current data. I.e. make the eventual replacement of ->ram_res,  
> > > ->pmem_res, ->ram_perf, and ->pmem_perf with a new DPA metadata array a    
> > > no-op for code paths that consume that information, and reduce the noise
> > > of follow-on patches.
> > > 
> > > The end goal is to consolidate all DPA information in 'struct
> > > cxl_dev_state', but for now the helpers just make it appear that all DPA
> > > metadata is relative to @cxlds.
> > > 
> > > Note that a follow-on patch also cleans up the temporary placeholders of
> > > @ram_res, and @pmem_res in the qos_class manipulation code,
> > > cxl_dpa_alloc(), and cxl_mem_create_range_info().
> > > 
> > > Cc: Dave Jiang <dave.jiang@intel.com>
> > > Cc: Alejandro Lucero <alucerop@amd.com>
> > > Cc: Ira Weiny <ira.weiny@intel.com>
> > > Signed-off-by: Dan Williams <dan.j.williams@intel.com>  
> > 
> > I'm not that keen on wrapping the size but not the base.
> > Leads to some odd looking code in places.
> 
> I seems some of the code I didn't like goes away anyway later in the series.
> So maybe it makes sense from a churn reduction point of view.

Yeah, I tried to clarify that was a temporary side effect of patch
splitting until 'struct cxl_dpa_partition' could clean things up
further.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] cxl: Introduce 'struct cxl_dpa_partition' and 'struct cxl_range_info'
  2025-01-17 10:52   ` Jonathan Cameron
  2025-01-17 13:38     ` Alejandro Lucero Palau
@ 2025-01-17 18:23     ` Dan Williams
  2025-01-17 20:32       ` Ira Weiny
  2025-01-20 12:24       ` Alejandro Lucero Palau
  1 sibling, 2 replies; 32+ messages in thread
From: Dan Williams @ 2025-01-17 18:23 UTC (permalink / raw)
  To: Jonathan Cameron, Dan Williams
  Cc: linux-cxl, Dave Jiang, Alejandro Lucero, Ira Weiny

Jonathan Cameron wrote:
> On Thu, 16 Jan 2025 22:10:44 -0800
> Dan Williams <dan.j.williams@intel.com> wrote:
> 
> > The pending efforts to add CXL Accelerator (type-2) device [1], and
> > Dynamic Capacity (DCD) support [2], tripped on the
> > no-longer-fit-for-purpose design in the CXL subsystem for tracking
> > device-physical-address (DPA) metadata. Trip hazards include:
> > 
> > - CXL Memory Devices need to consider a PMEM partition, but Accelerator
> >   devices with CXL.mem likely do not in the common case.
> > 
> > - CXL Memory Devices enumerate DPA through Memory Device mailbox
> >   commands like Partition Info, Accelerators devices do not.
> > 
> > - CXL Memory Devices that support DCD support more than 2 partitions.
> >   Some of the driver algorithms are awkward to expand to > 2 partition
> >   cases.
> > 
> > - DPA performance data is a general capability that can be shared with
> >   accelerators, so tracking it in 'struct cxl_memdev_state' is no longer
> >   suitable.
> > 
> > - 'enum cxl_decoder_mode' is sometimes a partition id and sometimes a
> >   memory property, it should be phased in favor of a partition id and
> >   the memory property comes from the partition info.
> > 
> > Towards cleaning up those issues and allowing a smoother landing for the
> > aforementioned pending efforts, introduce a 'struct cxl_dpa_partition'
> > array to 'struct cxl_dev_state', and 'struct cxl_range_info' as a shared
> > way for Memory Devices and Accelerators to initialize the DPA information
> > in 'struct cxl_dev_state'.
> > 
> > For now, split a new cxl_dpa_setup() from cxl_mem_create_range_info() to
> > get the new data structure initialized, and cleanup some qos_class init.
> > Follow on patches will go further to use the new data structure to
> > cleanup algorithms that are better suited to loop over all possible
> > partitions.
> > 
> > cxl_dpa_setup() follows the locking expectations of mutating the device
> > DPA map, and is suitable for Accelerator drivers to use. Accelerators
> > likely only have one hardcoded 'ram' partition to convey to the
> > cxl_core.
> > 
> > Link: http://lore.kernel.org/20241230214445.27602-1-alejandro.lucero-palau@amd.com [1]
> > Link: http://lore.kernel.org/20241210-dcd-type2-upstream-v8-0-812852504400@intel.com [2]
> > Cc: Dave Jiang <dave.jiang@intel.com>
> > Cc: Alejandro Lucero <alucerop@amd.com>
> > Cc: Ira Weiny <ira.weiny@intel.com>
> > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> 
> Hi Dan,
> 
> In basic form this seems fine, but I find the nr_paritions variable usage very
> counter intuitive.  It's just how many we configured not how many there
> are, potentially with 0 size (so not a partition).  I'd be happier if we
> can avoid that by just prefilling the lot with zero size and filling in
> the ones we want.  So zero size means doesn't exist and use an iterator where
> appropriate to skip the zero size ones.

The PMEM-only device case did give me pause. Is that 2 partitions with a
zero-sized first partition, or is that just 1 partition?

Ultimately I do think the code should further evolve to treat that as
1-PMEM-partition, but as far as I can see that depends on 'enum
cxl_decoder_mode' being eliminated and teaching all code paths to search
for the position of the PMEM partition.

> Without that tidied up, to me this is more confusing than the previous code.

I was going to save PMEM at a partition other than 1 for the DCD series,
but let me take another pass at adding that to this series.

[..]
> > +/* if this fails the caller must destroy @cxlds, there is no recovery */
> > +int cxl_dpa_setup(struct cxl_dev_state *cxlds, const struct cxl_dpa_info *info)
> > +{
> > +	struct device *dev = cxlds->dev;
> > +
> > +	guard(rwsem_write)(&cxl_dpa_rwsem);
> > +
> > +	if (cxlds->nr_partitions)
> > +		return -EBUSY;
> > +
> > +	if (!info->size || !info->nr_partitions) {
> > +		cxlds->dpa_res = DEFINE_RES_MEM(0, 0);
> > +		cxlds->nr_partitions = 0;
> > +		return 0;
> > +	}
> > +
> > +	cxlds->dpa_res = DEFINE_RES_MEM(0, info->size);
> > +
> > +	for (int i = 0; i < info->nr_partitions; i++) {
> > +		const char *desc;
> > +		int rc;
> > +
> > +		if (i == CXL_PARTITION_RAM)
> > +			desc = "ram";
> > +		else if (i == CXL_PARTITION_PMEM)
> > +			desc = "pmem";
> > +		else
> > +			desc = "";
> > +		cxlds->part[i].perf.qos_class = CXL_QOS_CLASS_INVALID;
> > +		rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->part[i].res,
> > +				 info->range[i].start,
> > +				 range_len(&info->range[i]), desc);
> > +		if (rc)
> > +			return rc;
> > +		cxlds->nr_partitions++;
> I'd just initialize the rest to 0 length similar to what is happening
> if we have pmem only anyway.  Then this nr_patitions goes away and
> stops being a possible source of confusion.

Modulo teaching other code that wants to ask "what is the size of the
PMEM partition" to use a helper that hides the "find the device's PMEM
partition".


> 
> > +	}
> > +
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(cxl_dpa_setup);
> 
> > diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> > index 3502f1633ad2..7dca5c8c3494 100644
> > --- a/drivers/cxl/core/mbox.c
> > +++ b/drivers/cxl/core/mbox.c
> > @@ -1241,57 +1241,36 @@ int cxl_mem_sanitize(struct cxl_memdev *cxlmd, u16 cmd)
> 
> > -int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
> > +int cxl_mem_dpa_fetch(struct cxl_memdev_state *mds, struct cxl_dpa_info *info)
> >  {
> >  	struct cxl_dev_state *cxlds = &mds->cxlds;
> > -	struct resource *ram_res = to_ram_res(cxlds);
> > -	struct resource *pmem_res = to_pmem_res(cxlds);
> >  	struct device *dev = cxlds->dev;
> >  	int rc;
> >  
> >  	if (!cxlds->media_ready) {
> > -		cxlds->dpa_res = DEFINE_RES_MEM(0, 0);
> > -		*ram_res = DEFINE_RES_MEM(0, 0);
> > -		*pmem_res = DEFINE_RES_MEM(0, 0);
> > +		info->size = 0;
> >  		return 0;
> >  	}
> >  
> > -	cxlds->dpa_res = DEFINE_RES_MEM(0, mds->total_bytes);
> > +	info->size = mds->total_bytes;
> >  
> >  	if (mds->partition_align_bytes == 0) {
> Obviously nothing to do with your patch as such, but maybe tidy this up
> by making active values == fixed values when we don't have partition control.
> That seems logical anyway to me and means we only end up with one lot of
> range setup in here.  I can't immediately see any side effects of doing this.

Yeah, I mentioned this in another thread. There is no reason
for 'struct cxl_memdev_state' to carry these values at all. They are
just temporary init-data.

So, cxl_dev_state_identify() becomes cxl_mem_identify(), since
it is a memory-device command. Move it inside of cxl_mem_dpa_fetch()
since it is just temporary init-data for 'struct cxl_dpa_info'.

[..]
> > -		rc = add_dpa_res(dev, &cxlds->dpa_res, ram_res, 0,
> > -				 mds->volatile_only_bytes, "ram");
> > -		if (rc)
> > -			return rc;
> > -		return add_dpa_res(dev, &cxlds->dpa_res, pmem_res,
> > -				   mds->volatile_only_bytes,
> > -				   mds->persistent_only_bytes, "pmem");
> > +		info->range[CXL_PARTITION_RAM] = (struct range) {
> > +			.start = 0,
> > +			.end = mds->volatile_only_bytes - 1,
> > +		};
> > +		info->nr_partitions++;
> > +
> > +		if (!mds->persistent_only_bytes)
> > +			return 0;
> > +
> > +		info->range[CXL_PARTITION_PMEM] = (struct range) {
> > +			.start = mds->volatile_only_bytes,
> > +			.end = mds->volatile_only_bytes +
> > +			       mds->persistent_only_bytes - 1,
> > +		};
> > +		info->nr_partitions++;
> 
> This nr partitions makes some sense though I'd be tempted to add a type
> array to info so that we can just not pass empty ones if we don't want to.
> Makes this code a little more complex, but not a lot and means
> nr->partitions becomes the ones that actually exist.

Agree, that's the end goal.

> 
> > +		return 0;
> >  	}
> >  
> >  	rc = cxl_mem_get_partition_info(mds);
> > @@ -1300,15 +1279,24 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
> >  		return rc;
> >  	}
> >  
> > -	rc = add_dpa_res(dev, &cxlds->dpa_res, ram_res, 0,
> > -			 mds->active_volatile_bytes, "ram");
> > -	if (rc)
> > -		return rc;
> > -	return add_dpa_res(dev, &cxlds->dpa_res, pmem_res,
> > -			   mds->active_volatile_bytes,
> > -			   mds->active_persistent_bytes, "pmem");
> > +	info->range[CXL_PARTITION_RAM] = (struct range) {
> > +		.start = 0,
> > +		.end = mds->active_volatile_bytes - 1,
> > +	};
> > +	info->nr_partitions++;
> > +
> > +	if (!mds->active_persistent_bytes)
> > +		return 0;
> > +
> > +	info->range[CXL_PARTITION_PMEM] = (struct range) {
> > +		.start = mds->active_volatile_bytes,
> > +		.end = mds->active_volatile_bytes + mds->active_persistent_bytes - 1,
> > +	};
> > +	info->nr_partitions++;
> > +
> > +	return 0;
> >  }
> > -EXPORT_SYMBOL_NS_GPL(cxl_mem_create_range_info, "CXL");
> > +EXPORT_SYMBOL_NS_GPL(cxl_mem_dpa_fetch, "CXL");
> 
> > diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> > index 78e92e24d7b5..2e728d4b7327 100644
> > --- a/drivers/cxl/cxlmem.h
> > +++ b/drivers/cxl/cxlmem.h
> > @@ -97,6 +97,20 @@ int devm_cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> >  			 resource_size_t base, resource_size_t len,
> >  			 resource_size_t skipped);
> >  
> > +/* Well known, spec defined partition indices */
> > +enum cxl_partition {
> > +	CXL_PARTITION_RAM,
> > +	CXL_PARTITION_PMEM,
> > +	CXL_PARTITION_MAX,
> > +};
> > +
> > +struct cxl_dpa_info {
> > +	u64 size;
> > +	struct range range[CXL_PARTITION_MAX];
> > +	int nr_partitions;
> > +};
> 
> blank line seems appropriate here.

Added.

> 
> > +int cxl_dpa_setup(struct cxl_dev_state *cxlds, const struct cxl_dpa_info *info);
> > +
> >  static inline struct cxl_ep *cxl_ep_load(struct cxl_port *port,
> >  					 struct cxl_memdev *cxlmd)
> >  {
> > @@ -408,6 +422,16 @@ struct cxl_dpa_perf {
> >  	int qos_class;
> >  };
> >  
> 
> >  /**
> >   * struct cxl_dev_state - The driver device state
> >   *
> > @@ -423,8 +447,8 @@ struct cxl_dpa_perf {
> >   * @rcd: operating in RCD mode (CXL 3.0 9.11.8 CXL Devices Attached to an RCH)
> >   * @media_ready: Indicate whether the device media is usable
> >   * @dpa_res: Overall DPA resource tree for the device
> > - * @_pmem_res: Active Persistent memory capacity configuration
> > - * @_ram_res: Active Volatile memory capacity configuration
> > + * @part: DPA partition array
> > + * @nr_partitions: Number of DPA partitions
> 
> This needs more. It is not the number of partitions present I think, it
> is the number that a particular driver is potentially interested in.
> 
> >   * @serial: PCIe Device Serial Number
> >   * @type: Generic Memory Class device or Vendor Specific Memory device
> >   * @cxl_mbox: CXL mailbox context
> > @@ -438,21 +462,39 @@ struct cxl_dev_state {
> >  	bool rcd;
> >  	bool media_ready;
> >  	struct resource dpa_res;
> > -	struct resource _pmem_res;
> > -	struct resource _ram_res;
> > +	struct cxl_dpa_partition part[CXL_PARTITION_MAX];
> > +	unsigned int nr_partitions;
> >  	u64 serial;
> >  	enum cxl_devtype type;
> >  	struct cxl_mailbox cxl_mbox;
> >  };
> >  
> > -static inline struct resource *to_ram_res(struct cxl_dev_state *cxlds)
> > +static inline const struct resource *to_ram_res(struct cxl_dev_state *cxlds)
> >  {
> > -	return &cxlds->_ram_res;
> > +	if (cxlds->nr_partitions > 0)
> > +		return &cxlds->part[CXL_PARTITION_RAM].res;
> > +	return NULL;
> >  }
> >  
> > -static inline struct resource *to_pmem_res(struct cxl_dev_state *cxlds)
> > +static inline const struct resource *to_pmem_res(struct cxl_dev_state *cxlds)
> >  {
> > -	return &cxlds->_pmem_res;
> > +	if (cxlds->nr_partitions > 1)
> 
> This is very confusing as nr_partitions is being used not to indicate
> number of partitions but whether a driver has filled in the data for them
> (which may well be empty).
> 
> I'd rather see that as a bitmap, or a 'not set' value initialized by
> the core that is then replaced when they are set.

...or even better, not require PMEM to be at partition1.

[..]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/4] cxl: Make cxl_dpa_alloc() DPA partition number agnostic
  2025-01-17 11:12   ` Jonathan Cameron
@ 2025-01-17 18:37     ` Dan Williams
  0 siblings, 0 replies; 32+ messages in thread
From: Dan Williams @ 2025-01-17 18:37 UTC (permalink / raw)
  To: Jonathan Cameron, Dan Williams
  Cc: linux-cxl, Dave Jiang, Alejandro Lucero, Ira Weiny

Jonathan Cameron wrote:
> On Thu, 16 Jan 2025 22:10:50 -0800
> Dan Williams <dan.j.williams@intel.com> wrote:
> 
> > cxl_dpa_alloc() is a hard coded nest of assumptions around PMEM
> > allocations being distinct from RAM allocations in specific ways when in
> > practice the allocation rules are only relative to DPA partition index.
> > 
> > The rules for cxl_dpa_alloc() are:
> > 
> > - allocations can only come from 1 partition
> > 
> > - if allocating at partition-index-N, all free space in partitions less
> >   than partition-index-N must be skipped over
> > 
> > Use the new 'struct cxl_dpa_partition' array to support allocation with
> > an arbitrary number of DPA partitions on the device.
> > 
> > A follow-on patch can go further to cleanup 'enum cxl_decoder_mode'
> > concept and supersede it with looking up the memory properties from
> > partition metadata.
> 
> If we'd move to meta data and these were tightly packed then I'd be fine
> with nr_partitions. Until that step, I find it confusing.
> 
> A few comments inline. This series does bring some advantages though
> at cost of code that needs a bit more documentation at the very least.
> 
> > 
> > Cc: Dave Jiang <dave.jiang@intel.com>
> > Cc: Alejandro Lucero <alucerop@amd.com>
> > Cc: Ira Weiny <ira.weiny@intel.com>
> > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> > ---
> >  drivers/cxl/core/hdm.c |  167 +++++++++++++++++++++++++++++++++---------------
> >  drivers/cxl/cxlmem.h   |    9 +++
> >  2 files changed, 125 insertions(+), 51 deletions(-)
> > 
> > diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> > index 7e1559b3ed88..4a2816102a1e 100644
> > --- a/drivers/cxl/core/hdm.c
> > +++ b/drivers/cxl/core/hdm.c
> > @@ -223,6 +223,30 @@ void cxl_dpa_debug(struct seq_file *file, struct cxl_dev_state *cxlds)
> >  }
> >  EXPORT_SYMBOL_NS_GPL(cxl_dpa_debug, "CXL");
> >  
> 
> Some documentation would be useful. I'm not sure I understand
> this algorithm correctly.
> 
> I think this complexity is all about ensuring that skip regions have
> their resources broken up on partition boundaries?
> 
> Can we potentially relax constraints a little more to make this
> easier to read by not caring on the ordering?  Find overlap
> of skip region with any partition and remove that bit unconditionally.
> 
> > +static void release_skip(struct cxl_dev_state *cxlds,
> > +			 const resource_size_t skip_base,
> > +			 const resource_size_t skip_len)
> > +{
> > +	resource_size_t skip_start = skip_base, skip_rem = skip_len;
> > +
> > +	for (int i = 0; i < cxlds->nr_partitions; i++) {
> > +		const struct resource *part_res = &cxlds->part[i].res;
> > +		resource_size_t skip_end, skip_size;
> > +
> > +		if (skip_start < part_res->start || skip_start > part_res->end)
> > +			continue;
> > +
> > +		skip_end = min(part_res->end, skip_start + skip_rem - 1);
> > +		skip_size = skip_end - skip_start + 1;
> > +		__release_region(&cxlds->dpa_res, skip_start, skip_size);
> > +		skip_start += skip_size;
> > +		skip_rem -= skip_size;
> > +
> > +		if (!skip_rem)
> > +			break;
> > +	}
> > +}
> > +
> >  /*
> >   * Must be called in a context that synchronizes against this decoder's
> >   * port ->remove() callback (like an endpoint decoder sysfs attribute)
> > @@ -241,7 +265,7 @@ static void __cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
> >  	skip_start = res->start - cxled->skip;
> >  	__release_region(&cxlds->dpa_res, res->start, resource_size(res));
> >  	if (cxled->skip)
> > -		__release_region(&cxlds->dpa_res, skip_start, cxled->skip);
> > +		release_skip(cxlds, skip_start, cxled->skip);
> >  	cxled->skip = 0;
> >  	cxled->dpa_res = NULL;
> >  	put_device(&cxled->cxld.dev);
> > @@ -268,6 +292,47 @@ static void devm_cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
> >  	__cxl_dpa_release(cxled);
> >  }
> >  
> > +static int request_skip(struct cxl_dev_state *cxlds,
> > +			struct cxl_endpoint_decoder *cxled,
> > +			const resource_size_t skip_base,
> > +			const resource_size_t skip_len)
> > +{
> > +	resource_size_t skip_start = skip_base, skip_rem = skip_len;
> > +
> > +	for (int i = 0; i < cxlds->nr_partitions; i++) {
> 
> Likewise, if we relax a constraint on ordering can we make this simpler?
> Would just need to keep track on whether we had reserved enough. I'm not
> 100% sure that is sufficient for the final error check.
> 
> > +		const struct resource *part_res = &cxlds->part[i].res;
> > +		struct cxl_port *port = cxled_to_port(cxled);
> > +		resource_size_t skip_end, skip_size;
> > +		struct resource *res;
> > +
> > +		if (skip_start < part_res->start || skip_start > part_res->end)
> > +			continue;
> > +
> > +		skip_end = min(part_res->end, skip_start + skip_rem - 1);
> > +		skip_size = skip_end - skip_start + 1;
> > +
> > +		res = __request_region(&cxlds->dpa_res, skip_start, skip_size,
> > +				       dev_name(&cxled->cxld.dev), 0);
> > +		if (!res) {
> > +			dev_dbg(cxlds->dev,
> > +				"decoder%d.%d: failed to reserve skipped space\n",
> > +				port->id, cxled->cxld.id);
> > +			break;
> > +		}
> > +		skip_start += skip_size;
> > +		skip_rem -= skip_size;
> > +		if (!skip_rem)
> > +			break;
> > +	}
> > +
> > +	if (skip_rem == 0)
> > +		return 0;
> > +
> > +	release_skip(cxlds, skip_base, skip_len - skip_rem);
> Ah, this complicates possibility of relaxations as we'd need to pass in what
> partion number we'd reached when fail occurred.
> Maybe this is the best algorithm, but I'd definitely like docs for this
> function to make it clear what it's assumptions are (paritions in order of DPA etc)

Yes, the software requirements and assumptions for tracking "DPABase"
deserve to be called out in comments for these helpers. Those "DPABase"
tracking assumptions are spelled out in the implementation note in
"8.2.4.19.13 Decoder Protection".

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] cxl: Remove the CXL_DECODER_MIXED mistake
  2025-01-17  6:10 ` [PATCH 1/4] cxl: Remove the CXL_DECODER_MIXED mistake Dan Williams
  2025-01-17 10:03   ` Jonathan Cameron
  2025-01-17 10:24   ` Alejandro Lucero Palau
@ 2025-01-17 18:45   ` Ira Weiny
  2 siblings, 0 replies; 32+ messages in thread
From: Ira Weiny @ 2025-01-17 18:45 UTC (permalink / raw)
  To: Dan Williams, linux-cxl; +Cc: dave.jiang

Dan Williams wrote:
> CXL_DECODER_MIXED is a safety mechanism introduced for the case where
> platform firmware has programmed an endpoint decoder that straddles a
> DPA partition boundary. While the kernel is careful to only allocate DPA
> capacity within a single partition there is no guarantee that platform
> firmware, or anything that touched the device before the current kernel,
> gets that right.
> 
> However, __cxl_dpa_reserve() will never get to the CXL_DECODER_MIXED
> designation because of the way it tracks partition boundaries. A
> request_resource() that spans ->ram_res and ->pmem_res fails with the
> following signature:
> 
>     __cxl_dpa_reserve: cxl_port endpoint15: decoder15.0: failed to reserve allocation
> 
> CXL_DECODER_MIXED is dead defensive programming after the driver has
> already given up on the device. It has never offered any protection in
> practice, just delete it.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Reviewed-by: Ira Weiny <ira.weiny@intel.com>

[snip]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] cxl: Introduce 'struct cxl_dpa_partition' and 'struct cxl_range_info'
  2025-01-17 18:23     ` Dan Williams
@ 2025-01-17 20:32       ` Ira Weiny
  2025-01-20 12:24       ` Alejandro Lucero Palau
  1 sibling, 0 replies; 32+ messages in thread
From: Ira Weiny @ 2025-01-17 20:32 UTC (permalink / raw)
  To: Dan Williams, Jonathan Cameron
  Cc: linux-cxl, Dave Jiang, Alejandro Lucero, Ira Weiny

Dan Williams wrote:
> Jonathan Cameron wrote:
> > On Thu, 16 Jan 2025 22:10:44 -0800
> > Dan Williams <dan.j.williams@intel.com> wrote:
> > 
> > > The pending efforts to add CXL Accelerator (type-2) device [1], and
> > > Dynamic Capacity (DCD) support [2], tripped on the
> > > no-longer-fit-for-purpose design in the CXL subsystem for tracking
> > > device-physical-address (DPA) metadata. Trip hazards include:
> > > 
> > > - CXL Memory Devices need to consider a PMEM partition, but Accelerator
> > >   devices with CXL.mem likely do not in the common case.
> > > 
> > > - CXL Memory Devices enumerate DPA through Memory Device mailbox
> > >   commands like Partition Info, Accelerators devices do not.
> > > 
> > > - CXL Memory Devices that support DCD support more than 2 partitions.
> > >   Some of the driver algorithms are awkward to expand to > 2 partition
> > >   cases.
> > > 
> > > - DPA performance data is a general capability that can be shared with
> > >   accelerators, so tracking it in 'struct cxl_memdev_state' is no longer
> > >   suitable.
> > > 
> > > - 'enum cxl_decoder_mode' is sometimes a partition id and sometimes a
> > >   memory property, it should be phased in favor of a partition id and
> > >   the memory property comes from the partition info.
> > > 
> > > Towards cleaning up those issues and allowing a smoother landing for the
> > > aforementioned pending efforts, introduce a 'struct cxl_dpa_partition'
> > > array to 'struct cxl_dev_state', and 'struct cxl_range_info' as a shared
> > > way for Memory Devices and Accelerators to initialize the DPA information
> > > in 'struct cxl_dev_state'.
> > > 
> > > For now, split a new cxl_dpa_setup() from cxl_mem_create_range_info() to
> > > get the new data structure initialized, and cleanup some qos_class init.
> > > Follow on patches will go further to use the new data structure to
> > > cleanup algorithms that are better suited to loop over all possible
> > > partitions.
> > > 
> > > cxl_dpa_setup() follows the locking expectations of mutating the device
> > > DPA map, and is suitable for Accelerator drivers to use. Accelerators
> > > likely only have one hardcoded 'ram' partition to convey to the
> > > cxl_core.
> > > 
> > > Link: http://lore.kernel.org/20241230214445.27602-1-alejandro.lucero-palau@amd.com [1]
> > > Link: http://lore.kernel.org/20241210-dcd-type2-upstream-v8-0-812852504400@intel.com [2]
> > > Cc: Dave Jiang <dave.jiang@intel.com>
> > > Cc: Alejandro Lucero <alucerop@amd.com>
> > > Cc: Ira Weiny <ira.weiny@intel.com>
> > > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> > 
> > Hi Dan,
> > 
> > In basic form this seems fine, but I find the nr_paritions variable usage very
> > counter intuitive.  It's just how many we configured not how many there
> > are, potentially with 0 size (so not a partition).  I'd be happier if we
> > can avoid that by just prefilling the lot with zero size and filling in
> > the ones we want.  So zero size means doesn't exist and use an iterator where
> > appropriate to skip the zero size ones.
> 
> The PMEM-only device case did give me pause. Is that 2 partitions with a
> zero-sized first partition, or is that just 1 partition?
> 
> Ultimately I do think the code should further evolve to treat that as
> 1-PMEM-partition, but as far as I can see that depends on 'enum
> cxl_decoder_mode' being eliminated and teaching all code paths to search
> for the position of the PMEM partition.

I was of the same mind that the decoder names could be used to index the
array.  For ram/pmem this is baked into the user API but for DCD one could
imagine not just specifying partition dc0 but rather a 'ram' dcd
partition.

> 
> > Without that tidied up, to me this is more confusing than the previous code.
> 
> I was going to save PMEM at a partition other than 1 for the DCD series,
> but let me take another pass at adding that to this series.
> 
> [..]
> > > +/* if this fails the caller must destroy @cxlds, there is no recovery */
> > > +int cxl_dpa_setup(struct cxl_dev_state *cxlds, const struct cxl_dpa_info *info)
> > > +{
> > > +	struct device *dev = cxlds->dev;
> > > +
> > > +	guard(rwsem_write)(&cxl_dpa_rwsem);
> > > +
> > > +	if (cxlds->nr_partitions)
> > > +		return -EBUSY;
> > > +
> > > +	if (!info->size || !info->nr_partitions) {
> > > +		cxlds->dpa_res = DEFINE_RES_MEM(0, 0);
> > > +		cxlds->nr_partitions = 0;
> > > +		return 0;
> > > +	}
> > > +
> > > +	cxlds->dpa_res = DEFINE_RES_MEM(0, info->size);
> > > +
> > > +	for (int i = 0; i < info->nr_partitions; i++) {
> > > +		const char *desc;
> > > +		int rc;
> > > +
> > > +		if (i == CXL_PARTITION_RAM)
> > > +			desc = "ram";
> > > +		else if (i == CXL_PARTITION_PMEM)
> > > +			desc = "pmem";
> > > +		else
> > > +			desc = "";
> > > +		cxlds->part[i].perf.qos_class = CXL_QOS_CLASS_INVALID;
> > > +		rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->part[i].res,
> > > +				 info->range[i].start,
> > > +				 range_len(&info->range[i]), desc);
> > > +		if (rc)
> > > +			return rc;
> > > +		cxlds->nr_partitions++;
> > I'd just initialize the rest to 0 length similar to what is happening
> > if we have pmem only anyway.  Then this nr_patitions goes away and
> > stops being a possible source of confusion.
> 
> Modulo teaching other code that wants to ask "what is the size of the
> PMEM partition" to use a helper that hides the "find the device's PMEM
> partition".
> 
> 
> > 
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +EXPORT_SYMBOL_GPL(cxl_dpa_setup);
> > 
> > > diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> > > index 3502f1633ad2..7dca5c8c3494 100644
> > > --- a/drivers/cxl/core/mbox.c
> > > +++ b/drivers/cxl/core/mbox.c
> > > @@ -1241,57 +1241,36 @@ int cxl_mem_sanitize(struct cxl_memdev *cxlmd, u16 cmd)
> > 
> > > -int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
> > > +int cxl_mem_dpa_fetch(struct cxl_memdev_state *mds, struct cxl_dpa_info *info)
> > >  {
> > >  	struct cxl_dev_state *cxlds = &mds->cxlds;
> > > -	struct resource *ram_res = to_ram_res(cxlds);
> > > -	struct resource *pmem_res = to_pmem_res(cxlds);
> > >  	struct device *dev = cxlds->dev;
> > >  	int rc;
> > >  
> > >  	if (!cxlds->media_ready) {
> > > -		cxlds->dpa_res = DEFINE_RES_MEM(0, 0);
> > > -		*ram_res = DEFINE_RES_MEM(0, 0);
> > > -		*pmem_res = DEFINE_RES_MEM(0, 0);
> > > +		info->size = 0;
> > >  		return 0;
> > >  	}
> > >  
> > > -	cxlds->dpa_res = DEFINE_RES_MEM(0, mds->total_bytes);
> > > +	info->size = mds->total_bytes;
> > >  
> > >  	if (mds->partition_align_bytes == 0) {
> > Obviously nothing to do with your patch as such, but maybe tidy this up
> > by making active values == fixed values when we don't have partition control.
> > That seems logical anyway to me and means we only end up with one lot of
> > range setup in here.  I can't immediately see any side effects of doing this.
> 
> Yeah, I mentioned this in another thread. There is no reason
> for 'struct cxl_memdev_state' to carry these values at all. They are
> just temporary init-data.
> 
> So, cxl_dev_state_identify() becomes cxl_mem_identify(), since
> it is a memory-device command. Move it inside of cxl_mem_dpa_fetch()
> since it is just temporary init-data for 'struct cxl_dpa_info'.

I took a different direction and removed all these temporary variables
which also had the side effect of separating the mbox command processing
from the resource creation.

I do prefer cxl_dpa_info to the way I coded it but I did not anticipate my
structure living long I'm also not a fan of 'cxl_byte_layout':


diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 6d63c29eb0e1..9646465e2cbe 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -463,12 +463,6 @@ static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
  * @firmware_version: Firmware version for the memory device.
  * @enabled_cmds: Hardware commands found enabled in CEL.
  * @exclusive_cmds: Commands that are kernel-internal only
- * @total_bytes: sum of all possible capacities
- * @volatile_only_bytes: hard volatile capacity
- * @persistent_only_bytes: hard persistent capacity
- * @partition_align_bytes: alignment size for partition-able capacity
- * @active_volatile_bytes: sum of hard + soft volatile
- * @active_persistent_bytes: sum of hard + soft persistent
  * @ram_perf: performance data entry matched to RAM partition
  * @pmem_perf: performance data entry matched to PMEM partition
  * @event: event log driver state
@@ -485,12 +479,6 @@ struct cxl_memdev_state {
        char firmware_version[0x10];
        DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
        DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
-       u64 total_bytes;
-       u64 volatile_only_bytes;
-       u64 persistent_only_bytes;
-       u64 partition_align_bytes;
-       u64 active_volatile_bytes;
-       u64 active_persistent_bytes;
 
        struct cxl_dpa_perf ram_perf;
        struct cxl_dpa_perf pmem_perf;
@@ -811,10 +799,19 @@ enum {
 
 int cxl_internal_send_cmd(struct cxl_mailbox *cxl_mbox,
                          struct cxl_mbox_cmd *cmd);
-int cxl_dev_state_identify(struct cxl_memdev_state *mds);
+
+struct cxl_mem_byte_layout {
+       u64 total_bytes;
+       u64 volatile_bytes;
+       u64 persistent_bytes;
+};
+
+int cxl_dev_state_identify(struct cxl_memdev_state *mds,
+                          struct cxl_mem_byte_layout *byte_layout);
 int cxl_await_media_ready(struct cxl_dev_state *cxlds);
 int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
-int cxl_mem_create_range_info(struct cxl_memdev_state *mds);
+int cxl_create_range_info(struct cxl_dev_state *cxlds,
+                             struct cxl_mem_byte_layout *byte_layout);
 struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev);
 void set_exclusive_cxl_commands(struct cxl_memdev_state *mds,
                                unsigned long *cmds);


> 
> [..]
> > > -		rc = add_dpa_res(dev, &cxlds->dpa_res, ram_res, 0,
> > > -				 mds->volatile_only_bytes, "ram");
> > > -		if (rc)
> > > -			return rc;
> > > -		return add_dpa_res(dev, &cxlds->dpa_res, pmem_res,
> > > -				   mds->volatile_only_bytes,
> > > -				   mds->persistent_only_bytes, "pmem");
> > > +		info->range[CXL_PARTITION_RAM] = (struct range) {
> > > +			.start = 0,
> > > +			.end = mds->volatile_only_bytes - 1,
> > > +		};
> > > +		info->nr_partitions++;
> > > +
> > > +		if (!mds->persistent_only_bytes)
> > > +			return 0;
> > > +
> > > +		info->range[CXL_PARTITION_PMEM] = (struct range) {
> > > +			.start = mds->volatile_only_bytes,
> > > +			.end = mds->volatile_only_bytes +
> > > +			       mds->persistent_only_bytes - 1,
> > > +		};
> > > +		info->nr_partitions++;
> > 
> > This nr partitions makes some sense though I'd be tempted to add a type
> > array to info so that we can just not pass empty ones if we don't want to.
> > Makes this code a little more complex, but not a lot and means
> > nr->partitions becomes the ones that actually exist.
> 
> Agree, that's the end goal.
> 
> > 
> > > +		return 0;
> > >  	}
> > >  
> > >  	rc = cxl_mem_get_partition_info(mds);
> > > @@ -1300,15 +1279,24 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
> > >  		return rc;
> > >  	}
> > >  
> > > -	rc = add_dpa_res(dev, &cxlds->dpa_res, ram_res, 0,
> > > -			 mds->active_volatile_bytes, "ram");
> > > -	if (rc)
> > > -		return rc;
> > > -	return add_dpa_res(dev, &cxlds->dpa_res, pmem_res,
> > > -			   mds->active_volatile_bytes,
> > > -			   mds->active_persistent_bytes, "pmem");
> > > +	info->range[CXL_PARTITION_RAM] = (struct range) {
> > > +		.start = 0,
> > > +		.end = mds->active_volatile_bytes - 1,
> > > +	};
> > > +	info->nr_partitions++;
> > > +
> > > +	if (!mds->active_persistent_bytes)
> > > +		return 0;
> > > +
> > > +	info->range[CXL_PARTITION_PMEM] = (struct range) {
> > > +		.start = mds->active_volatile_bytes,
> > > +		.end = mds->active_volatile_bytes + mds->active_persistent_bytes - 1,
> > > +	};
> > > +	info->nr_partitions++;
> > > +
> > > +	return 0;
> > >  }
> > > -EXPORT_SYMBOL_NS_GPL(cxl_mem_create_range_info, "CXL");
> > > +EXPORT_SYMBOL_NS_GPL(cxl_mem_dpa_fetch, "CXL");
> > 
> > > diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> > > index 78e92e24d7b5..2e728d4b7327 100644
> > > --- a/drivers/cxl/cxlmem.h
> > > +++ b/drivers/cxl/cxlmem.h
> > > @@ -97,6 +97,20 @@ int devm_cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> > >  			 resource_size_t base, resource_size_t len,
> > >  			 resource_size_t skipped);
> > >  
> > > +/* Well known, spec defined partition indices */
> > > +enum cxl_partition {
> > > +	CXL_PARTITION_RAM,
> > > +	CXL_PARTITION_PMEM,
> > > +	CXL_PARTITION_MAX,
> > > +};
> > > +
> > > +struct cxl_dpa_info {
> > > +	u64 size;
> > > +	struct range range[CXL_PARTITION_MAX];
> > > +	int nr_partitions;
> > > +};
> > 
> > blank line seems appropriate here.
> 
> Added.
> 
> > 
> > > +int cxl_dpa_setup(struct cxl_dev_state *cxlds, const struct cxl_dpa_info *info);
> > > +
> > >  static inline struct cxl_ep *cxl_ep_load(struct cxl_port *port,
> > >  					 struct cxl_memdev *cxlmd)
> > >  {
> > > @@ -408,6 +422,16 @@ struct cxl_dpa_perf {
> > >  	int qos_class;
> > >  };
> > >  
> > 
> > >  /**
> > >   * struct cxl_dev_state - The driver device state
> > >   *
> > > @@ -423,8 +447,8 @@ struct cxl_dpa_perf {
> > >   * @rcd: operating in RCD mode (CXL 3.0 9.11.8 CXL Devices Attached to an RCH)
> > >   * @media_ready: Indicate whether the device media is usable
> > >   * @dpa_res: Overall DPA resource tree for the device
> > > - * @_pmem_res: Active Persistent memory capacity configuration
> > > - * @_ram_res: Active Volatile memory capacity configuration
> > > + * @part: DPA partition array
> > > + * @nr_partitions: Number of DPA partitions
> > 
> > This needs more. It is not the number of partitions present I think, it
> > is the number that a particular driver is potentially interested in.
> > 
> > >   * @serial: PCIe Device Serial Number
> > >   * @type: Generic Memory Class device or Vendor Specific Memory device
> > >   * @cxl_mbox: CXL mailbox context
> > > @@ -438,21 +462,39 @@ struct cxl_dev_state {
> > >  	bool rcd;
> > >  	bool media_ready;
> > >  	struct resource dpa_res;
> > > -	struct resource _pmem_res;
> > > -	struct resource _ram_res;
> > > +	struct cxl_dpa_partition part[CXL_PARTITION_MAX];
> > > +	unsigned int nr_partitions;
> > >  	u64 serial;
> > >  	enum cxl_devtype type;
> > >  	struct cxl_mailbox cxl_mbox;
> > >  };
> > >  
> > > -static inline struct resource *to_ram_res(struct cxl_dev_state *cxlds)
> > > +static inline const struct resource *to_ram_res(struct cxl_dev_state *cxlds)
> > >  {
> > > -	return &cxlds->_ram_res;
> > > +	if (cxlds->nr_partitions > 0)
> > > +		return &cxlds->part[CXL_PARTITION_RAM].res;
> > > +	return NULL;
> > >  }
> > >  
> > > -static inline struct resource *to_pmem_res(struct cxl_dev_state *cxlds)
> > > +static inline const struct resource *to_pmem_res(struct cxl_dev_state *cxlds)
> > >  {
> > > -	return &cxlds->_pmem_res;
> > > +	if (cxlds->nr_partitions > 1)
> > 
> > This is very confusing as nr_partitions is being used not to indicate
> > number of partitions but whether a driver has filled in the data for them
> > (which may well be empty).
> > 
> > I'd rather see that as a bitmap, or a 'not set' value initialized by
> > the core that is then replaced when they are set.
> 
> ...or even better, not require PMEM to be at partition1.

FWIW I think that is a step to far to be rushing into 6.14.

Ira

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] cxl: Introduce 'struct cxl_dpa_partition' and 'struct cxl_range_info'
  2025-01-17  6:10 ` [PATCH 3/4] cxl: Introduce 'struct cxl_dpa_partition' and 'struct cxl_range_info' Dan Williams
  2025-01-17 10:52   ` Jonathan Cameron
  2025-01-17 15:58   ` Alejandro Lucero Palau
@ 2025-01-17 20:42   ` Ira Weiny
  2025-01-17 22:08   ` Ira Weiny
  3 siblings, 0 replies; 32+ messages in thread
From: Ira Weiny @ 2025-01-17 20:42 UTC (permalink / raw)
  To: Dan Williams, linux-cxl; +Cc: Dave Jiang, Alejandro Lucero, Ira Weiny

Dan Williams wrote:

[snip]

> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index 7a85522294ad..7e1559b3ed88 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -342,6 +342,75 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  	return 0;
>  }
>  
> +static int add_dpa_res(struct device *dev, struct resource *parent,
> +		       struct resource *res, resource_size_t start,
> +		       resource_size_t size, const char *type)
> +{
> +	int rc;
> +
> +	*res = (struct resource) {
> +		.name = type,
> +		.start = start,
> +		.end =  start + size - 1,
> +		.flags = IORESOURCE_MEM,
> +	};
> +	if (resource_size(res) == 0) {
> +		dev_dbg(dev, "DPA(%s): no capacity\n", res->name);
> +		return 0;
> +	}
> +	rc = request_resource(parent, res);
> +	if (rc) {
> +		dev_err(dev, "DPA(%s): failed to track %pr (%d)\n", res->name,
> +			res, rc);
> +		return rc;
> +	}
> +
> +	dev_dbg(dev, "DPA(%s): %pr\n", res->name, res);
> +
> +	return 0;
> +}
> +
> +/* if this fails the caller must destroy @cxlds, there is no recovery */
> +int cxl_dpa_setup(struct cxl_dev_state *cxlds, const struct cxl_dpa_info *info)
> +{
> +	struct device *dev = cxlds->dev;
> +
> +	guard(rwsem_write)(&cxl_dpa_rwsem);
> +
> +	if (cxlds->nr_partitions)
> +		return -EBUSY;
> +
> +	if (!info->size || !info->nr_partitions) {
> +		cxlds->dpa_res = DEFINE_RES_MEM(0, 0);
> +		cxlds->nr_partitions = 0;
> +		return 0;
> +	}
> +
> +	cxlds->dpa_res = DEFINE_RES_MEM(0, info->size);
> +
> +	for (int i = 0; i < info->nr_partitions; i++) {
> +		const char *desc;
> +		int rc;
> +
> +		if (i == CXL_PARTITION_RAM)
> +			desc = "ram";
> +		else if (i == CXL_PARTITION_PMEM)
> +			desc = "pmem";
> +		else
> +			desc = "";
> +		cxlds->part[i].perf.qos_class = CXL_QOS_CLASS_INVALID;
> +		rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->part[i].res,
> +				 info->range[i].start,
> +				 range_len(&info->range[i]), desc);
> +		if (rc)
> +			return rc;
> +		cxlds->nr_partitions++;
> +	}
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(cxl_dpa_setup);

Why put this in the middle of hdm.c where it splits up devm_cxl_dpa_reserve()
and __cxl_dpa_reserve()?

Ira

> +
>  int devm_cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  				resource_size_t base, resource_size_t len,
>  				resource_size_t skipped)

[snip]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] cxl: Introduce to_{ram,pmem}_{res,perf}() helpers
  2025-01-17 13:33   ` Alejandro Lucero Palau
@ 2025-01-17 20:47     ` Dan Williams
  0 siblings, 0 replies; 32+ messages in thread
From: Dan Williams @ 2025-01-17 20:47 UTC (permalink / raw)
  To: Alejandro Lucero Palau, Dan Williams, linux-cxl; +Cc: Dave Jiang, Ira Weiny

Alejandro Lucero Palau wrote:
> 
> On 1/17/25 06:10, Dan Williams wrote:
> > In preparation for consolidating all DPA partition information into an
> > array of DPA metadata, introduce helpers that hide the layout of the
> > current data. I.e. make the eventual replacement of ->ram_res,
> > ->pmem_res, ->ram_perf, and ->pmem_perf with a new DPA metadata array a
> > no-op for code paths that consume that information, and reduce the noise
> > of follow-on patches.
> >
> > The end goal is to consolidate all DPA information in 'struct
> > cxl_dev_state', but for now the helpers just make it appear that all DPA
> > metadata is relative to @cxlds.
> >
> > Note that a follow-on patch also cleans up the temporary placeholders of
> > @ram_res, and @pmem_res in the qos_class manipulation code,
> > cxl_dpa_alloc(), and cxl_mem_create_range_info().
> >
> > Cc: Dave Jiang <dave.jiang@intel.com>
> > Cc: Alejandro Lucero <alucerop@amd.com>
> > Cc: Ira Weiny <ira.weiny@intel.com>
> > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
[..]
> > diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> > index 2a25d1957ddb..78e92e24d7b5 100644
> > --- a/drivers/cxl/cxlmem.h
> > +++ b/drivers/cxl/cxlmem.h
[..]
> > @@ -438,13 +438,41 @@ struct cxl_dev_state {
> >   	bool rcd;
> >   	bool media_ready;
> >   	struct resource dpa_res;
> > -	struct resource pmem_res;
> > -	struct resource ram_res;
> > +	struct resource _pmem_res;
> > +	struct resource _ram_res;
> 
> 
> I think this is unnecessary since it is clear those fields are going 
> away later on, and this change only adds confusion. Moreover, they are 
> not referenced in the code now because the helpers.

That is part of demonstrating a safe conversion. You can read this
change to know that every possible usage of the old name is gone and
that is verified by the compiler.

Otherwise the reviewer needs to spend effort to grep to see if all the
old usages are indeed gone.

> > +static inline resource_size_t cxl_ram_size(struct cxl_dev_state *cxlds)
> > +{
> > +	const struct resource *res = to_ram_res(cxlds);
> > +
> > +	if (!res)
> > +		return 0;
> 
> 
> This check is not needed now, and with the change in next patch, I think 
> it should not be needed either.
> 
> Do we need the distinction between, no ram or no pmem, and ram/pmem with 
> size 0?

This was also Jonathan's feedback. In v2 I am removing the distinction
that PMEM is always index-1 in the cxl_dpa_partition array.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/4] cxl: Make cxl_dpa_alloc() DPA partition number agnostic
  2025-01-17 15:42   ` Alejandro Lucero Palau
@ 2025-01-17 20:57     ` Dan Williams
  2025-01-20 12:39       ` Alejandro Lucero Palau
  0 siblings, 1 reply; 32+ messages in thread
From: Dan Williams @ 2025-01-17 20:57 UTC (permalink / raw)
  To: Alejandro Lucero Palau, Dan Williams, linux-cxl; +Cc: Dave Jiang, Ira Weiny

Alejandro Lucero Palau wrote:
> 
> On 1/17/25 06:10, Dan Williams wrote:
> > cxl_dpa_alloc() is a hard coded nest of assumptions around PMEM
> > allocations being distinct from RAM allocations in specific ways when in
> > practice the allocation rules are only relative to DPA partition index.
> >
> > The rules for cxl_dpa_alloc() are:
> >
> > - allocations can only come from 1 partition
> >
> > - if allocating at partition-index-N, all free space in partitions less
> >    than partition-index-N must be skipped over
> 
> 
> In my view, you are mixing the current code with the new code in this 
> explanation. It would be better to say the current code assumption is 
> just two partitions, ram and pmem, but DCD changes the game.

There is no mixture in that description. The rules have not changed from
old to new, the implementation is updated to reflect that the algorithm
never needed to consider ram and pmem explicitly.

> > Use the new 'struct cxl_dpa_partition' array to support allocation with
> > an arbitrary number of DPA partitions on the device.
> >
> > A follow-on patch can go further to cleanup 'enum cxl_decoder_mode'
> > concept and supersede it with looking up the memory properties from
> > partition metadata.
> >
> > Cc: Dave Jiang <dave.jiang@intel.com>
> > Cc: Alejandro Lucero <alucerop@amd.com>
> > Cc: Ira Weiny <ira.weiny@intel.com>
> > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> > ---
> >   drivers/cxl/core/hdm.c |  167 +++++++++++++++++++++++++++++++++---------------
> >   drivers/cxl/cxlmem.h   |    9 +++
> >   2 files changed, 125 insertions(+), 51 deletions(-)
> >
> > diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> > index 7e1559b3ed88..4a2816102a1e 100644
> > --- a/drivers/cxl/core/hdm.c
> > +++ b/drivers/cxl/core/hdm.c
> > @@ -223,6 +223,30 @@ void cxl_dpa_debug(struct seq_file *file, struct cxl_dev_state *cxlds)
> >   }
> >   EXPORT_SYMBOL_NS_GPL(cxl_dpa_debug, "CXL");
> >   
> > +static void release_skip(struct cxl_dev_state *cxlds,
> > +			 const resource_size_t skip_base,
> > +			 const resource_size_t skip_len)
> > +{
> > +	resource_size_t skip_start = skip_base, skip_rem = skip_len;
> > +
> > +	for (int i = 0; i < cxlds->nr_partitions; i++) {
> > +		const struct resource *part_res = &cxlds->part[i].res;
> > +		resource_size_t skip_end, skip_size;
> > +
> > +		if (skip_start < part_res->start || skip_start > part_res->end)
> > +			continue;
> > +
> > +		skip_end = min(part_res->end, skip_start + skip_rem - 1);
> > +		skip_size = skip_end - skip_start + 1;
> > +		__release_region(&cxlds->dpa_res, skip_start, skip_size);
> > +		skip_start += skip_size;
> > +		skip_rem -= skip_size;
> > +
> > +		if (!skip_rem)
> > +			break;
> > +	}
> > +}
> > +
> 
> 
> This implies the skip can not be based on the last child end as the code 
> implements.

I do not follow this comment... more below.

[..]
> > @@ -551,47 +609,54 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
> >   		goto out;
> >   	}
> >   
> > -	for (p = ram_res->child, last = NULL; p; p = p->sibling)
> > -		last = p;
> > -	if (last)
> > -		free_ram_start = last->end + 1;
> > +	if (cxled->mode == CXL_DECODER_RAM)
> > +		part = CXL_PARTITION_RAM;
> > +	else if (cxled->mode == CXL_DECODER_PMEM)
> > +		part = CXL_PARTITION_PMEM;
> >   	else
> > -		free_ram_start = ram_res->start;
> > +		part = cxlds->nr_partitions;
> > +
> > +	if (part >= cxlds->nr_partitions) {
> > +		dev_dbg(dev, "partition %d not found\n", part);
> > +		rc = -EBUSY;
> > +		goto out;
> > +	}
> > +
> > +	res = &cxlds->part[part].res;
> >   
> > -	for (p = pmem_res->child, last = NULL; p; p = p->sibling)
> > +	for (p = res->child, last = NULL; p; p = p->sibling)
> >   		last = p;
> >   	if (last)
> > -		free_pmem_start = last->end + 1;
> > +		start = last->end + 1;
> >   	else
> > -		free_pmem_start = pmem_res->start;
> > +		start = res->start;
> >   
> 
> 
> As said above, this is not correct if there are holes due to releases.

There are no holes introduced by releases. The skip is always one
contiguous range, it just happens to need to be tracked via multiple
entries in the ->dpa_res tree. Jonathan had asked for more commentary on
what is happening request_skip() and release_skip(), I will clarify this
detail about contiguity.

> 
> 
> > -	if (cxled->mode == CXL_DECODER_RAM) {
> > -		start = free_ram_start;
> > -		avail = ram_res->end - start + 1;
> > -		skip = 0;
> > -	} else if (cxled->mode == CXL_DECODER_PMEM) {
> > -		resource_size_t skip_start, skip_end;
> > -
> > -		start = free_pmem_start;
> > -		avail = pmem_res->end - start + 1;
> > -		skip_start = free_ram_start;
> > -
> > -		/*
> > -		 * If some pmem is already allocated, then that allocation
> > -		 * already handled the skip.
> > -		 */
> > -		if (pmem_res->child &&
> > -		    skip_start == pmem_res->child->start)
> > -			skip_end = skip_start - 1;
> > -		else
> > -			skip_end = start - 1;
> > -		skip = skip_end - skip_start + 1;
> > -	} else {
> > -		dev_dbg(dev, "mode not set\n");
> > -		rc = -EINVAL;
> > -		goto out;
> > +	/*
> > +	 * To allocate at partition N, a skip needs to be calculated for all
> > +	 * unallocated space at lower partitions indices.
> > +	 *
> > +	 * If a partition has any allocations, the search can end because a
> > +	 * previous cxl_dpa_alloc() invocation is assumed to have accounted for
> > +	 * all previous partitions.
> > +	 */
> 
> 
> This is right, but the code below is not because ...
> 
> 
> > +	skip_start = CXL_RESOURCE_NONE;
> > +	for (int i = part; i; i--) {
> > +		prev = &cxlds->part[i - 1].res;
> > +		for (p = prev->child, last = NULL; p; p = p->sibling)
> > +			last = p;
> 
> 
> ... holes ...
> 
> 
> I think the problem here is we assumed ram and pmem being a child and 
> likely some free space, but a device with multiple HDM decoders implies 
> potentially several child.
> 
> The code supported the case of multiple child but I guess we still had 
> in mind the simple case. Otherwise I can not understand all this ...

Holes are not allowed. If you want to delete any decoder capacity you
need to tear down all higher DPA allocations. That is a constraint of
the hardware definition around how software can assume the value of
DPABase. Will add some comments to that effect.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] cxl: Introduce 'struct cxl_dpa_partition' and 'struct cxl_range_info'
  2025-01-17  6:10 ` [PATCH 3/4] cxl: Introduce 'struct cxl_dpa_partition' and 'struct cxl_range_info' Dan Williams
                     ` (2 preceding siblings ...)
  2025-01-17 20:42   ` Ira Weiny
@ 2025-01-17 22:08   ` Ira Weiny
  2025-01-31 23:39     ` Dan Williams
  3 siblings, 1 reply; 32+ messages in thread
From: Ira Weiny @ 2025-01-17 22:08 UTC (permalink / raw)
  To: Dan Williams, linux-cxl; +Cc: Dave Jiang, Alejandro Lucero, Ira Weiny

Dan Williams wrote:
> The pending efforts to add CXL Accelerator (type-2) device [1], and
> Dynamic Capacity (DCD) support [2], tripped on the
> no-longer-fit-for-purpose design in the CXL subsystem for tracking
> device-physical-address (DPA) metadata. Trip hazards include:
> 
> - CXL Memory Devices need to consider a PMEM partition, but Accelerator
>   devices with CXL.mem likely do not in the common case.
> 
> - CXL Memory Devices enumerate DPA through Memory Device mailbox
>   commands like Partition Info, Accelerators devices do not.
> 
> - CXL Memory Devices that support DCD support more than 2 partitions.
>   Some of the driver algorithms are awkward to expand to > 2 partition
>   cases.
> 
> - DPA performance data is a general capability that can be shared with
>   accelerators, so tracking it in 'struct cxl_memdev_state' is no longer
>   suitable.
> 
> - 'enum cxl_decoder_mode' is sometimes a partition id and sometimes a
>   memory property, it should be phased in favor of a partition id and
>   the memory property comes from the partition info.
> 
> Towards cleaning up those issues and allowing a smoother landing for the
> aforementioned pending efforts, introduce a 'struct cxl_dpa_partition'
> array to 'struct cxl_dev_state', and 'struct cxl_range_info' as a shared
> way for Memory Devices and Accelerators to initialize the DPA information
> in 'struct cxl_dev_state'.
> 
> For now, split a new cxl_dpa_setup() from cxl_mem_create_range_info() to
> get the new data structure initialized, and cleanup some qos_class init.
> Follow on patches will go further to use the new data structure to
> cleanup algorithms that are better suited to loop over all possible
> partitions.
> 
> cxl_dpa_setup() follows the locking expectations of mutating the device
> DPA map, and is suitable for Accelerator drivers to use. Accelerators
> likely only have one hardcoded 'ram' partition to convey to the
> cxl_core.
> 
> Link: http://lore.kernel.org/20241230214445.27602-1-alejandro.lucero-palau@amd.com [1]
> Link: http://lore.kernel.org/20241210-dcd-type2-upstream-v8-0-812852504400@intel.com [2]
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Alejandro Lucero <alucerop@amd.com>
> Cc: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/cxl/core/cdat.c      |   15 ++-----
>  drivers/cxl/core/hdm.c       |   69 ++++++++++++++++++++++++++++++++++
>  drivers/cxl/core/mbox.c      |   86 ++++++++++++++++++------------------------
>  drivers/cxl/cxlmem.h         |   79 +++++++++++++++++++++++++--------------
>  drivers/cxl/pci.c            |    7 +++
>  tools/testing/cxl/test/cxl.c |   15 ++-----
>  tools/testing/cxl/test/mem.c |    7 +++
>  7 files changed, 176 insertions(+), 102 deletions(-)
> 
> diff --git a/drivers/cxl/core/cdat.c b/drivers/cxl/core/cdat.c
> index b177a488e29b..5400a421ad30 100644
> --- a/drivers/cxl/core/cdat.c
> +++ b/drivers/cxl/core/cdat.c
> @@ -261,25 +261,18 @@ static void cxl_memdev_set_qos_class(struct cxl_dev_state *cxlds,
>  	struct device *dev = cxlds->dev;
>  	struct dsmas_entry *dent;
>  	unsigned long index;
> -	const struct resource *partition[] = {
> -		to_ram_res(cxlds),
> -		to_pmem_res(cxlds),
> -	};
> -	struct cxl_dpa_perf *perf[] = {
> -		to_ram_perf(cxlds),
> -		to_pmem_perf(cxlds),
> -	};
>  
>  	xa_for_each(dsmas_xa, index, dent) {
> -		for (int i = 0; i < ARRAY_SIZE(partition); i++) {
> -			const struct resource *res = partition[i];
> +		for (int i = 0; i < cxlds->nr_partitions; i++) {
> +			struct resource *res = &cxlds->part[i].res;
>  			struct range range = {
>  				.start = res->start,
>  				.end = res->end,
>  			};
>  
>  			if (range_contains(&range, &dent->dpa_range))
> -				update_perf_entry(dev, dent, perf[i]);
> +				update_perf_entry(dev, dent,
> +						  &cxlds->part[i].perf);
>  			else
>  				dev_dbg(dev,
>  					"no partition for dsmas dpa: %pra\n",
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index 7a85522294ad..7e1559b3ed88 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -342,6 +342,75 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  	return 0;
>  }
>  
> +static int add_dpa_res(struct device *dev, struct resource *parent,
> +		       struct resource *res, resource_size_t start,
> +		       resource_size_t size, const char *type)
> +{
> +	int rc;
> +
> +	*res = (struct resource) {
> +		.name = type,
> +		.start = start,
> +		.end =  start + size - 1,
> +		.flags = IORESOURCE_MEM,
> +	};
> +	if (resource_size(res) == 0) {
> +		dev_dbg(dev, "DPA(%s): no capacity\n", res->name);
> +		return 0;
> +	}
> +	rc = request_resource(parent, res);
> +	if (rc) {
> +		dev_err(dev, "DPA(%s): failed to track %pr (%d)\n", res->name,
> +			res, rc);
> +		return rc;
> +	}
> +
> +	dev_dbg(dev, "DPA(%s): %pr\n", res->name, res);
> +
> +	return 0;
> +}
> +
> +/* if this fails the caller must destroy @cxlds, there is no recovery */
> +int cxl_dpa_setup(struct cxl_dev_state *cxlds, const struct cxl_dpa_info *info)
> +{
> +	struct device *dev = cxlds->dev;
> +
> +	guard(rwsem_write)(&cxl_dpa_rwsem);

Why is this semaphore required now?

Ira

> +
> +	if (cxlds->nr_partitions)
> +		return -EBUSY;
> +
> +	if (!info->size || !info->nr_partitions) {
> +		cxlds->dpa_res = DEFINE_RES_MEM(0, 0);
> +		cxlds->nr_partitions = 0;
> +		return 0;
> +	}
> +
> +	cxlds->dpa_res = DEFINE_RES_MEM(0, info->size);
> +
> +	for (int i = 0; i < info->nr_partitions; i++) {
> +		const char *desc;
> +		int rc;
> +
> +		if (i == CXL_PARTITION_RAM)
> +			desc = "ram";
> +		else if (i == CXL_PARTITION_PMEM)
> +			desc = "pmem";
> +		else
> +			desc = "";
> +		cxlds->part[i].perf.qos_class = CXL_QOS_CLASS_INVALID;
> +		rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->part[i].res,
> +				 info->range[i].start,
> +				 range_len(&info->range[i]), desc);
> +		if (rc)
> +			return rc;
> +		cxlds->nr_partitions++;
> +	}
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(cxl_dpa_setup);
> +
>  int devm_cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  				resource_size_t base, resource_size_t len,
>  				resource_size_t skipped)
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 3502f1633ad2..7dca5c8c3494 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -1241,57 +1241,36 @@ int cxl_mem_sanitize(struct cxl_memdev *cxlmd, u16 cmd)
>  	return rc;
>  }
>  
> -static int add_dpa_res(struct device *dev, struct resource *parent,
> -		       struct resource *res, resource_size_t start,
> -		       resource_size_t size, const char *type)
> -{
> -	int rc;
> -
> -	res->name = type;
> -	res->start = start;
> -	res->end = start + size - 1;
> -	res->flags = IORESOURCE_MEM;
> -	if (resource_size(res) == 0) {
> -		dev_dbg(dev, "DPA(%s): no capacity\n", res->name);
> -		return 0;
> -	}
> -	rc = request_resource(parent, res);
> -	if (rc) {
> -		dev_err(dev, "DPA(%s): failed to track %pr (%d)\n", res->name,
> -			res, rc);
> -		return rc;
> -	}
> -
> -	dev_dbg(dev, "DPA(%s): %pr\n", res->name, res);
> -
> -	return 0;
> -}
> -
> -int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
> +int cxl_mem_dpa_fetch(struct cxl_memdev_state *mds, struct cxl_dpa_info *info)
>  {
>  	struct cxl_dev_state *cxlds = &mds->cxlds;
> -	struct resource *ram_res = to_ram_res(cxlds);
> -	struct resource *pmem_res = to_pmem_res(cxlds);
>  	struct device *dev = cxlds->dev;
>  	int rc;
>  
>  	if (!cxlds->media_ready) {
> -		cxlds->dpa_res = DEFINE_RES_MEM(0, 0);
> -		*ram_res = DEFINE_RES_MEM(0, 0);
> -		*pmem_res = DEFINE_RES_MEM(0, 0);
> +		info->size = 0;
>  		return 0;
>  	}
>  
> -	cxlds->dpa_res = DEFINE_RES_MEM(0, mds->total_bytes);
> +	info->size = mds->total_bytes;
>  
>  	if (mds->partition_align_bytes == 0) {
> -		rc = add_dpa_res(dev, &cxlds->dpa_res, ram_res, 0,
> -				 mds->volatile_only_bytes, "ram");
> -		if (rc)
> -			return rc;
> -		return add_dpa_res(dev, &cxlds->dpa_res, pmem_res,
> -				   mds->volatile_only_bytes,
> -				   mds->persistent_only_bytes, "pmem");
> +		info->range[CXL_PARTITION_RAM] = (struct range) {
> +			.start = 0,
> +			.end = mds->volatile_only_bytes - 1,
> +		};
> +		info->nr_partitions++;
> +
> +		if (!mds->persistent_only_bytes)
> +			return 0;
> +
> +		info->range[CXL_PARTITION_PMEM] = (struct range) {
> +			.start = mds->volatile_only_bytes,
> +			.end = mds->volatile_only_bytes +
> +			       mds->persistent_only_bytes - 1,
> +		};
> +		info->nr_partitions++;
> +		return 0;
>  	}
>  
>  	rc = cxl_mem_get_partition_info(mds);
> @@ -1300,15 +1279,24 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
>  		return rc;
>  	}
>  
> -	rc = add_dpa_res(dev, &cxlds->dpa_res, ram_res, 0,
> -			 mds->active_volatile_bytes, "ram");
> -	if (rc)
> -		return rc;
> -	return add_dpa_res(dev, &cxlds->dpa_res, pmem_res,
> -			   mds->active_volatile_bytes,
> -			   mds->active_persistent_bytes, "pmem");
> +	info->range[CXL_PARTITION_RAM] = (struct range) {
> +		.start = 0,
> +		.end = mds->active_volatile_bytes - 1,
> +	};
> +	info->nr_partitions++;
> +
> +	if (!mds->active_persistent_bytes)
> +		return 0;
> +
> +	info->range[CXL_PARTITION_PMEM] = (struct range) {
> +		.start = mds->active_volatile_bytes,
> +		.end = mds->active_volatile_bytes + mds->active_persistent_bytes - 1,
> +	};
> +	info->nr_partitions++;
> +
> +	return 0;
>  }
> -EXPORT_SYMBOL_NS_GPL(cxl_mem_create_range_info, "CXL");
> +EXPORT_SYMBOL_NS_GPL(cxl_mem_dpa_fetch, "CXL");
>  
>  int cxl_set_timestamp(struct cxl_memdev_state *mds)
>  {
> @@ -1452,8 +1440,6 @@ struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev)
>  	mds->cxlds.reg_map.host = dev;
>  	mds->cxlds.reg_map.resource = CXL_RESOURCE_NONE;
>  	mds->cxlds.type = CXL_DEVTYPE_CLASSMEM;
> -	to_ram_perf(&mds->cxlds)->qos_class = CXL_QOS_CLASS_INVALID;
> -	to_pmem_perf(&mds->cxlds)->qos_class = CXL_QOS_CLASS_INVALID;
>  
>  	return mds;
>  }
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 78e92e24d7b5..2e728d4b7327 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -97,6 +97,20 @@ int devm_cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  			 resource_size_t base, resource_size_t len,
>  			 resource_size_t skipped);
>  
> +/* Well known, spec defined partition indices */
> +enum cxl_partition {
> +	CXL_PARTITION_RAM,
> +	CXL_PARTITION_PMEM,
> +	CXL_PARTITION_MAX,
> +};
> +
> +struct cxl_dpa_info {
> +	u64 size;
> +	struct range range[CXL_PARTITION_MAX];
> +	int nr_partitions;
> +};
> +int cxl_dpa_setup(struct cxl_dev_state *cxlds, const struct cxl_dpa_info *info);
> +
>  static inline struct cxl_ep *cxl_ep_load(struct cxl_port *port,
>  					 struct cxl_memdev *cxlmd)
>  {
> @@ -408,6 +422,16 @@ struct cxl_dpa_perf {
>  	int qos_class;
>  };
>  
> +/**
> + * struct cxl_dpa_partition - DPA partition descriptor
> + * @res: shortcut to the partition in the DPA resource tree (cxlds->dpa_res)
> + * @perf: performance attributes of the partition from CDAT
> + */
> +struct cxl_dpa_partition {
> +	struct resource res;
> +	struct cxl_dpa_perf perf;
> +};
> +
>  /**
>   * struct cxl_dev_state - The driver device state
>   *
> @@ -423,8 +447,8 @@ struct cxl_dpa_perf {
>   * @rcd: operating in RCD mode (CXL 3.0 9.11.8 CXL Devices Attached to an RCH)
>   * @media_ready: Indicate whether the device media is usable
>   * @dpa_res: Overall DPA resource tree for the device
> - * @_pmem_res: Active Persistent memory capacity configuration
> - * @_ram_res: Active Volatile memory capacity configuration
> + * @part: DPA partition array
> + * @nr_partitions: Number of DPA partitions
>   * @serial: PCIe Device Serial Number
>   * @type: Generic Memory Class device or Vendor Specific Memory device
>   * @cxl_mbox: CXL mailbox context
> @@ -438,21 +462,39 @@ struct cxl_dev_state {
>  	bool rcd;
>  	bool media_ready;
>  	struct resource dpa_res;
> -	struct resource _pmem_res;
> -	struct resource _ram_res;
> +	struct cxl_dpa_partition part[CXL_PARTITION_MAX];
> +	unsigned int nr_partitions;
>  	u64 serial;
>  	enum cxl_devtype type;
>  	struct cxl_mailbox cxl_mbox;
>  };
>  
> -static inline struct resource *to_ram_res(struct cxl_dev_state *cxlds)
> +static inline const struct resource *to_ram_res(struct cxl_dev_state *cxlds)
>  {
> -	return &cxlds->_ram_res;
> +	if (cxlds->nr_partitions > 0)
> +		return &cxlds->part[CXL_PARTITION_RAM].res;
> +	return NULL;
>  }
>  
> -static inline struct resource *to_pmem_res(struct cxl_dev_state *cxlds)
> +static inline const struct resource *to_pmem_res(struct cxl_dev_state *cxlds)
>  {
> -	return &cxlds->_pmem_res;
> +	if (cxlds->nr_partitions > 1)
> +		return &cxlds->part[CXL_PARTITION_PMEM].res;
> +	return NULL;
> +}
> +
> +static inline struct cxl_dpa_perf *to_ram_perf(struct cxl_dev_state *cxlds)
> +{
> +	if (cxlds->nr_partitions > 0)
> +		return &cxlds->part[CXL_PARTITION_RAM].perf;
> +	return NULL;
> +}
> +
> +static inline struct cxl_dpa_perf *to_pmem_perf(struct cxl_dev_state *cxlds)
> +{
> +	if (cxlds->nr_partitions > 1)
> +		return &cxlds->part[CXL_PARTITION_PMEM].perf;
> +	return NULL;
>  }
>  
>  static inline resource_size_t cxl_ram_size(struct cxl_dev_state *cxlds)
> @@ -499,8 +541,6 @@ static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
>   * @active_persistent_bytes: sum of hard + soft persistent
>   * @next_volatile_bytes: volatile capacity change pending device reset
>   * @next_persistent_bytes: persistent capacity change pending device reset
> - * @_ram_perf: performance data entry matched to RAM partition
> - * @_pmem_perf: performance data entry matched to PMEM partition
>   * @event: event log driver state
>   * @poison: poison driver state info
>   * @security: security driver state info
> @@ -524,29 +564,12 @@ struct cxl_memdev_state {
>  	u64 next_volatile_bytes;
>  	u64 next_persistent_bytes;
>  
> -	struct cxl_dpa_perf _ram_perf;
> -	struct cxl_dpa_perf _pmem_perf;
> -
>  	struct cxl_event_state event;
>  	struct cxl_poison_state poison;
>  	struct cxl_security_state security;
>  	struct cxl_fw_state fw;
>  };
>  
> -static inline struct cxl_dpa_perf *to_ram_perf(struct cxl_dev_state *cxlds)
> -{
> -	struct cxl_memdev_state *mds = container_of(cxlds, typeof(*mds), cxlds);
> -
> -	return &mds->_ram_perf;
> -}
> -
> -static inline struct cxl_dpa_perf *to_pmem_perf(struct cxl_dev_state *cxlds)
> -{
> -	struct cxl_memdev_state *mds = container_of(cxlds, typeof(*mds), cxlds);
> -
> -	return &mds->_pmem_perf;
> -}
> -
>  static inline struct cxl_memdev_state *
>  to_cxl_memdev_state(struct cxl_dev_state *cxlds)
>  {
> @@ -860,7 +883,7 @@ int cxl_internal_send_cmd(struct cxl_mailbox *cxl_mbox,
>  int cxl_dev_state_identify(struct cxl_memdev_state *mds);
>  int cxl_await_media_ready(struct cxl_dev_state *cxlds);
>  int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
> -int cxl_mem_create_range_info(struct cxl_memdev_state *mds);
> +int cxl_mem_dpa_fetch(struct cxl_memdev_state *mds, struct cxl_dpa_info *info);
>  struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev);
>  void set_exclusive_cxl_commands(struct cxl_memdev_state *mds,
>  				unsigned long *cmds);
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 0241d1d7133a..47dbfe406236 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -900,6 +900,7 @@ __ATTRIBUTE_GROUPS(cxl_rcd);
>  static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  {
>  	struct pci_host_bridge *host_bridge = pci_find_host_bridge(pdev->bus);
> +	struct cxl_dpa_info range_info = { 0 };
>  	struct cxl_memdev_state *mds;
>  	struct cxl_dev_state *cxlds;
>  	struct cxl_register_map map;
> @@ -989,7 +990,11 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  	if (rc)
>  		return rc;
>  
> -	rc = cxl_mem_create_range_info(mds);
> +	rc = cxl_mem_dpa_fetch(mds, &range_info);
> +	if (rc)
> +		return rc;
> +
> +	rc = cxl_dpa_setup(cxlds, &range_info);
>  	if (rc)
>  		return rc;
>  
> diff --git a/tools/testing/cxl/test/cxl.c b/tools/testing/cxl/test/cxl.c
> index 7f1c5061307b..ba3d48b37de3 100644
> --- a/tools/testing/cxl/test/cxl.c
> +++ b/tools/testing/cxl/test/cxl.c
> @@ -1001,26 +1001,19 @@ static void mock_cxl_endpoint_parse_cdat(struct cxl_port *port)
>  	struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport_dev);
>  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
>  	struct access_coordinate ep_c[ACCESS_COORDINATE_MAX];
> -	const struct resource *partition[] = {
> -		to_ram_res(cxlds),
> -		to_pmem_res(cxlds),
> -	};
> -	struct cxl_dpa_perf *perf[] = {
> -		to_ram_perf(cxlds),
> -		to_pmem_perf(cxlds),
> -	};
>  
>  	if (!cxl_root)
>  		return;
>  
> -	for (int i = 0; i < ARRAY_SIZE(partition); i++) {
> -		const struct resource *res = partition[i];
> +	for (int i = 0; i < cxlds->nr_partitions; i++) {
> +		struct resource *res = &cxlds->part[i].res;
> +		struct cxl_dpa_perf *perf = &cxlds->part[i].perf;
>  		struct range range = {
>  			.start = res->start,
>  			.end = res->end,
>  		};
>  
> -		dpa_perf_setup(port, &range, perf[i]);
> +		dpa_perf_setup(port, &range, perf);
>  	}
>  
>  	cxl_memdev_update_perf(cxlmd);
> diff --git a/tools/testing/cxl/test/mem.c b/tools/testing/cxl/test/mem.c
> index 347c1e7b37bd..ed365e083c8f 100644
> --- a/tools/testing/cxl/test/mem.c
> +++ b/tools/testing/cxl/test/mem.c
> @@ -1477,6 +1477,7 @@ static int cxl_mock_mem_probe(struct platform_device *pdev)
>  	struct cxl_dev_state *cxlds;
>  	struct cxl_mockmem_data *mdata;
>  	struct cxl_mailbox *cxl_mbox;
> +	struct cxl_dpa_info range_info = { 0 };
>  	int rc;
>  
>  	mdata = devm_kzalloc(dev, sizeof(*mdata), GFP_KERNEL);
> @@ -1537,7 +1538,11 @@ static int cxl_mock_mem_probe(struct platform_device *pdev)
>  	if (rc)
>  		return rc;
>  
> -	rc = cxl_mem_create_range_info(mds);
> +	rc = cxl_mem_dpa_fetch(mds, &range_info);
> +	if (rc)
> +		return rc;
> +
> +	rc = cxl_dpa_setup(cxlds, &range_info);
>  	if (rc)
>  		return rc;
>  
> 
> 



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] cxl: Introduce 'struct cxl_dpa_partition' and 'struct cxl_range_info'
  2025-01-17 15:58   ` Alejandro Lucero Palau
@ 2025-01-17 22:52     ` Dan Williams
  0 siblings, 0 replies; 32+ messages in thread
From: Dan Williams @ 2025-01-17 22:52 UTC (permalink / raw)
  To: Alejandro Lucero Palau, Dan Williams, linux-cxl; +Cc: Dave Jiang, Ira Weiny

Alejandro Lucero Palau wrote:
[..]
> > +/* if this fails the caller must destroy @cxlds, there is no recovery */
> > +int cxl_dpa_setup(struct cxl_dev_state *cxlds, const struct cxl_dpa_info *info)
> > +{
> > +	struct device *dev = cxlds->dev;
> > +
> > +	guard(rwsem_write)(&cxl_dpa_rwsem);
> > +
> 
> 
> This explains to me what you meant about locking when setting the 
> resources for Type2.
> 
> 
> However, I think this is no necessary because there is no user space, or 
> that is my idea, involved when creating CXL regions for a Type2. It is 
> all up to the accel driver to do so, therefore no locking needed because 
> none is going to traverse the child resource list while 
> initialising/updating it.

Yes, no locking is needed, and that was the status quo for cxl_pci since
it was simple to audit the single user. Going forward, with multiple
users, and the fact that cxl_dev_state is not strictly private to the
cxl_core, some safety is reasonable.

> It does not harm to have it for current Type2 case, and always a good 
> idea to have it for potential future cases.

My main motivation is to help protect against someone thinking that
calling cxl_dpa_setup() twice is a workable model.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] cxl: Introduce 'struct cxl_dpa_partition' and 'struct cxl_range_info'
  2025-01-17 18:23     ` Dan Williams
  2025-01-17 20:32       ` Ira Weiny
@ 2025-01-20 12:24       ` Alejandro Lucero Palau
  2025-01-31 23:54         ` Dan Williams
  1 sibling, 1 reply; 32+ messages in thread
From: Alejandro Lucero Palau @ 2025-01-20 12:24 UTC (permalink / raw)
  To: Dan Williams, Jonathan Cameron; +Cc: linux-cxl, Dave Jiang, Ira Weiny


On 1/17/25 18:23, Dan Williams wrote:
> Jonathan Cameron wrote:
>> On Thu, 16 Jan 2025 22:10:44 -0800
>> Dan Williams <dan.j.williams@intel.com> wrote:
>>
>>> The pending efforts to add CXL Accelerator (type-2) device [1], and
>>> Dynamic Capacity (DCD) support [2], tripped on the
>>> no-longer-fit-for-purpose design in the CXL subsystem for tracking
>>> device-physical-address (DPA) metadata. Trip hazards include:
>>>
>>> - CXL Memory Devices need to consider a PMEM partition, but Accelerator
>>>    devices with CXL.mem likely do not in the common case.
>>>
>>> - CXL Memory Devices enumerate DPA through Memory Device mailbox
>>>    commands like Partition Info, Accelerators devices do not.
>>>
>>> - CXL Memory Devices that support DCD support more than 2 partitions.
>>>    Some of the driver algorithms are awkward to expand to > 2 partition
>>>    cases.
>>>
>>> - DPA performance data is a general capability that can be shared with
>>>    accelerators, so tracking it in 'struct cxl_memdev_state' is no longer
>>>    suitable.
>>>
>>> - 'enum cxl_decoder_mode' is sometimes a partition id and sometimes a
>>>    memory property, it should be phased in favor of a partition id and
>>>    the memory property comes from the partition info.
>>>
>>> Towards cleaning up those issues and allowing a smoother landing for the
>>> aforementioned pending efforts, introduce a 'struct cxl_dpa_partition'
>>> array to 'struct cxl_dev_state', and 'struct cxl_range_info' as a shared
>>> way for Memory Devices and Accelerators to initialize the DPA information
>>> in 'struct cxl_dev_state'.
>>>
>>> For now, split a new cxl_dpa_setup() from cxl_mem_create_range_info() to
>>> get the new data structure initialized, and cleanup some qos_class init.
>>> Follow on patches will go further to use the new data structure to
>>> cleanup algorithms that are better suited to loop over all possible
>>> partitions.
>>>
>>> cxl_dpa_setup() follows the locking expectations of mutating the device
>>> DPA map, and is suitable for Accelerator drivers to use. Accelerators
>>> likely only have one hardcoded 'ram' partition to convey to the
>>> cxl_core.
>>>
>>> Link: http://lore.kernel.org/20241230214445.27602-1-alejandro.lucero-palau@amd.com [1]
>>> Link: http://lore.kernel.org/20241210-dcd-type2-upstream-v8-0-812852504400@intel.com [2]
>>> Cc: Dave Jiang <dave.jiang@intel.com>
>>> Cc: Alejandro Lucero <alucerop@amd.com>
>>> Cc: Ira Weiny <ira.weiny@intel.com>
>>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> Hi Dan,
>>
>> In basic form this seems fine, but I find the nr_paritions variable usage very
>> counter intuitive.  It's just how many we configured not how many there
>> are, potentially with 0 size (so not a partition).  I'd be happier if we
>> can avoid that by just prefilling the lot with zero size and filling in
>> the ones we want.  So zero size means doesn't exist and use an iterator where
>> appropriate to skip the zero size ones.
> The PMEM-only device case did give me pause. Is that 2 partitions with a
> zero-sized first partition, or is that just 1 partition?


I was wrong about the code being broken for this case.

The code would create two partitions, at least for the case of partition 
alignment being 0, with the first one having 0 size.

This is all based on data/code based on mbox commands. Without mbox this 
partition info needs to be hardcoded or created somehow by the accel 
driver by its own means, so it is good to know the code expects such a 
0-size partition for the pmem-only case and the nr_partitions should be 
set accordingly. Not what my type2 patchset needs now, but I bet this 
will need to be set properly by a coming accel driver.


> Ultimately I do think the code should further evolve to treat that as
> 1-PMEM-partition, but as far as I can see that depends on 'enum
> cxl_decoder_mode' being eliminated and teaching all code paths to search
> for the position of the PMEM partition.
>
>> Without that tidied up, to me this is more confusing than the previous code.
> I was going to save PMEM at a partition other than 1 for the DCD series,
> but let me take another pass at adding that to this series.
>
> [..]
>>> +/* if this fails the caller must destroy @cxlds, there is no recovery */
>>> +int cxl_dpa_setup(struct cxl_dev_state *cxlds, const struct cxl_dpa_info *info)
>>> +{
>>> +	struct device *dev = cxlds->dev;
>>> +
>>> +	guard(rwsem_write)(&cxl_dpa_rwsem);
>>> +
>>> +	if (cxlds->nr_partitions)
>>> +		return -EBUSY;
>>> +
>>> +	if (!info->size || !info->nr_partitions) {
>>> +		cxlds->dpa_res = DEFINE_RES_MEM(0, 0);
>>> +		cxlds->nr_partitions = 0;
>>> +		return 0;
>>> +	}
>>> +
>>> +	cxlds->dpa_res = DEFINE_RES_MEM(0, info->size);
>>> +
>>> +	for (int i = 0; i < info->nr_partitions; i++) {
>>> +		const char *desc;
>>> +		int rc;
>>> +
>>> +		if (i == CXL_PARTITION_RAM)
>>> +			desc = "ram";
>>> +		else if (i == CXL_PARTITION_PMEM)
>>> +			desc = "pmem";
>>> +		else
>>> +			desc = "";
>>> +		cxlds->part[i].perf.qos_class = CXL_QOS_CLASS_INVALID;
>>> +		rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->part[i].res,
>>> +				 info->range[i].start,
>>> +				 range_len(&info->range[i]), desc);
>>> +		if (rc)
>>> +			return rc;
>>> +		cxlds->nr_partitions++;
>> I'd just initialize the rest to 0 length similar to what is happening
>> if we have pmem only anyway.  Then this nr_patitions goes away and
>> stops being a possible source of confusion.
> Modulo teaching other code that wants to ask "what is the size of the
> PMEM partition" to use a helper that hides the "find the device's PMEM
> partition".
>
>
>>> +	}
>>> +
>>> +	return 0;
>>> +}
>>> +EXPORT_SYMBOL_GPL(cxl_dpa_setup);
>>> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
>>> index 3502f1633ad2..7dca5c8c3494 100644
>>> --- a/drivers/cxl/core/mbox.c
>>> +++ b/drivers/cxl/core/mbox.c
>>> @@ -1241,57 +1241,36 @@ int cxl_mem_sanitize(struct cxl_memdev *cxlmd, u16 cmd)
>>> -int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
>>> +int cxl_mem_dpa_fetch(struct cxl_memdev_state *mds, struct cxl_dpa_info *info)
>>>   {
>>>   	struct cxl_dev_state *cxlds = &mds->cxlds;
>>> -	struct resource *ram_res = to_ram_res(cxlds);
>>> -	struct resource *pmem_res = to_pmem_res(cxlds);
>>>   	struct device *dev = cxlds->dev;
>>>   	int rc;
>>>   
>>>   	if (!cxlds->media_ready) {
>>> -		cxlds->dpa_res = DEFINE_RES_MEM(0, 0);
>>> -		*ram_res = DEFINE_RES_MEM(0, 0);
>>> -		*pmem_res = DEFINE_RES_MEM(0, 0);
>>> +		info->size = 0;
>>>   		return 0;
>>>   	}
>>>   
>>> -	cxlds->dpa_res = DEFINE_RES_MEM(0, mds->total_bytes);
>>> +	info->size = mds->total_bytes;
>>>   
>>>   	if (mds->partition_align_bytes == 0) {
>> Obviously nothing to do with your patch as such, but maybe tidy this up
>> by making active values == fixed values when we don't have partition control.
>> That seems logical anyway to me and means we only end up with one lot of
>> range setup in here.  I can't immediately see any side effects of doing this.
> Yeah, I mentioned this in another thread. There is no reason
> for 'struct cxl_memdev_state' to carry these values at all. They are
> just temporary init-data.
>
> So, cxl_dev_state_identify() becomes cxl_mem_identify(), since
> it is a memory-device command. Move it inside of cxl_mem_dpa_fetch()
> since it is just temporary init-data for 'struct cxl_dpa_info'.
>
> [..]
>>> -		rc = add_dpa_res(dev, &cxlds->dpa_res, ram_res, 0,
>>> -				 mds->volatile_only_bytes, "ram");
>>> -		if (rc)
>>> -			return rc;
>>> -		return add_dpa_res(dev, &cxlds->dpa_res, pmem_res,
>>> -				   mds->volatile_only_bytes,
>>> -				   mds->persistent_only_bytes, "pmem");
>>> +		info->range[CXL_PARTITION_RAM] = (struct range) {
>>> +			.start = 0,
>>> +			.end = mds->volatile_only_bytes - 1,
>>> +		};
>>> +		info->nr_partitions++;
>>> +
>>> +		if (!mds->persistent_only_bytes)
>>> +			return 0;
>>> +
>>> +		info->range[CXL_PARTITION_PMEM] = (struct range) {
>>> +			.start = mds->volatile_only_bytes,
>>> +			.end = mds->volatile_only_bytes +
>>> +			       mds->persistent_only_bytes - 1,
>>> +		};
>>> +		info->nr_partitions++;
>> This nr partitions makes some sense though I'd be tempted to add a type
>> array to info so that we can just not pass empty ones if we don't want to.
>> Makes this code a little more complex, but not a lot and means
>> nr->partitions becomes the ones that actually exist.
> Agree, that's the end goal.
>
>>> +		return 0;
>>>   	}
>>>   
>>>   	rc = cxl_mem_get_partition_info(mds);
>>> @@ -1300,15 +1279,24 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
>>>   		return rc;
>>>   	}
>>>   
>>> -	rc = add_dpa_res(dev, &cxlds->dpa_res, ram_res, 0,
>>> -			 mds->active_volatile_bytes, "ram");
>>> -	if (rc)
>>> -		return rc;
>>> -	return add_dpa_res(dev, &cxlds->dpa_res, pmem_res,
>>> -			   mds->active_volatile_bytes,
>>> -			   mds->active_persistent_bytes, "pmem");
>>> +	info->range[CXL_PARTITION_RAM] = (struct range) {
>>> +		.start = 0,
>>> +		.end = mds->active_volatile_bytes - 1,
>>> +	};
>>> +	info->nr_partitions++;
>>> +
>>> +	if (!mds->active_persistent_bytes)
>>> +		return 0;
>>> +
>>> +	info->range[CXL_PARTITION_PMEM] = (struct range) {
>>> +		.start = mds->active_volatile_bytes,
>>> +		.end = mds->active_volatile_bytes + mds->active_persistent_bytes - 1,
>>> +	};
>>> +	info->nr_partitions++;
>>> +
>>> +	return 0;
>>>   }
>>> -EXPORT_SYMBOL_NS_GPL(cxl_mem_create_range_info, "CXL");
>>> +EXPORT_SYMBOL_NS_GPL(cxl_mem_dpa_fetch, "CXL");
>>> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
>>> index 78e92e24d7b5..2e728d4b7327 100644
>>> --- a/drivers/cxl/cxlmem.h
>>> +++ b/drivers/cxl/cxlmem.h
>>> @@ -97,6 +97,20 @@ int devm_cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>>>   			 resource_size_t base, resource_size_t len,
>>>   			 resource_size_t skipped);
>>>   
>>> +/* Well known, spec defined partition indices */
>>> +enum cxl_partition {
>>> +	CXL_PARTITION_RAM,
>>> +	CXL_PARTITION_PMEM,
>>> +	CXL_PARTITION_MAX,
>>> +};
>>> +
>>> +struct cxl_dpa_info {
>>> +	u64 size;
>>> +	struct range range[CXL_PARTITION_MAX];
>>> +	int nr_partitions;
>>> +};
>> blank line seems appropriate here.
> Added.
>
>>> +int cxl_dpa_setup(struct cxl_dev_state *cxlds, const struct cxl_dpa_info *info);
>>> +
>>>   static inline struct cxl_ep *cxl_ep_load(struct cxl_port *port,
>>>   					 struct cxl_memdev *cxlmd)
>>>   {
>>> @@ -408,6 +422,16 @@ struct cxl_dpa_perf {
>>>   	int qos_class;
>>>   };
>>>   
>>>   /**
>>>    * struct cxl_dev_state - The driver device state
>>>    *
>>> @@ -423,8 +447,8 @@ struct cxl_dpa_perf {
>>>    * @rcd: operating in RCD mode (CXL 3.0 9.11.8 CXL Devices Attached to an RCH)
>>>    * @media_ready: Indicate whether the device media is usable
>>>    * @dpa_res: Overall DPA resource tree for the device
>>> - * @_pmem_res: Active Persistent memory capacity configuration
>>> - * @_ram_res: Active Volatile memory capacity configuration
>>> + * @part: DPA partition array
>>> + * @nr_partitions: Number of DPA partitions
>> This needs more. It is not the number of partitions present I think, it
>> is the number that a particular driver is potentially interested in.
>>
>>>    * @serial: PCIe Device Serial Number
>>>    * @type: Generic Memory Class device or Vendor Specific Memory device
>>>    * @cxl_mbox: CXL mailbox context
>>> @@ -438,21 +462,39 @@ struct cxl_dev_state {
>>>   	bool rcd;
>>>   	bool media_ready;
>>>   	struct resource dpa_res;
>>> -	struct resource _pmem_res;
>>> -	struct resource _ram_res;
>>> +	struct cxl_dpa_partition part[CXL_PARTITION_MAX];
>>> +	unsigned int nr_partitions;
>>>   	u64 serial;
>>>   	enum cxl_devtype type;
>>>   	struct cxl_mailbox cxl_mbox;
>>>   };
>>>   
>>> -static inline struct resource *to_ram_res(struct cxl_dev_state *cxlds)
>>> +static inline const struct resource *to_ram_res(struct cxl_dev_state *cxlds)
>>>   {
>>> -	return &cxlds->_ram_res;
>>> +	if (cxlds->nr_partitions > 0)
>>> +		return &cxlds->part[CXL_PARTITION_RAM].res;
>>> +	return NULL;
>>>   }
>>>   
>>> -static inline struct resource *to_pmem_res(struct cxl_dev_state *cxlds)
>>> +static inline const struct resource *to_pmem_res(struct cxl_dev_state *cxlds)
>>>   {
>>> -	return &cxlds->_pmem_res;
>>> +	if (cxlds->nr_partitions > 1)
>> This is very confusing as nr_partitions is being used not to indicate
>> number of partitions but whether a driver has filled in the data for them
>> (which may well be empty).
>>
>> I'd rather see that as a bitmap, or a 'not set' value initialized by
>> the core that is then replaced when they are set.
> ...or even better, not require PMEM to be at partition1.
>
> [..]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/4] cxl: Make cxl_dpa_alloc() DPA partition number agnostic
  2025-01-17 20:57     ` Dan Williams
@ 2025-01-20 12:39       ` Alejandro Lucero Palau
  2025-02-01  0:08         ` Dan Williams
  0 siblings, 1 reply; 32+ messages in thread
From: Alejandro Lucero Palau @ 2025-01-20 12:39 UTC (permalink / raw)
  To: Dan Williams, linux-cxl; +Cc: Dave Jiang, Ira Weiny

On 1/17/25 20:57, Dan Williams wrote:
> Alejandro Lucero Palau wrote:
>> On 1/17/25 06:10, Dan Williams wrote:
>>> cxl_dpa_alloc() is a hard coded nest of assumptions around PMEM
>>> allocations being distinct from RAM allocations in specific ways when in
>>> practice the allocation rules are only relative to DPA partition index.
>>>
>>> The rules for cxl_dpa_alloc() are:
>>>
>>> - allocations can only come from 1 partition
>>>
>>> - if allocating at partition-index-N, all free space in partitions less
>>>     than partition-index-N must be skipped over
>>
>> In my view, you are mixing the current code with the new code in this
>> explanation. It would be better to say the current code assumption is
>> just two partitions, ram and pmem, but DCD changes the game.
> There is no mixture in that description. The rules have not changed from
> old to new, the implementation is updated to reflect that the algorithm
> never needed to consider ram and pmem explicitly.

Well, if I'm not mistaken, until now we did not need to support an 
arbitrary number N of partitions but just 2.

In fact, your next sentence just confirms this.

<snip>

>
>> I think the problem here is we assumed ram and pmem being a child and
>> likely some free space, but a device with multiple HDM decoders implies
>> potentially several child.
>>
>> The code supported the case of multiple child but I guess we still had
>> in mind the simple case. Otherwise I can not understand all this ...
> Holes are not allowed. If you want to delete any decoder capacity you
> need to tear down all higher DPA allocations. That is a constraint of
> the hardware definition around how software can assume the value of
> DPABase. Will add some comments to that effect.

I have to admit I do not master the case of regions created from user 
space, but I thought the idea was to allow creation of regions from 
independent devices for facilitating interleaving and just adding 
flexibility when needing a memory region bigger than what only one 
device can offer at some specific time.

If so, I would expect cases like two different regions using different 
ranges inside a device DPA, and the two regions completely independent. 
If I understand your answer, this is possible but if you want to release 
one of those regions, you can not use the released DPA until the first 
one is also released ... my instinct tells me this can not be the case ...

If what I'm saying makes no sense, I'll try with a more complex 
description with a timeline and actions from user space, endpoint 
decoders, regions and so on, what will definitely helps me to 
(hopefully) understand the user space case.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] cxl: Introduce 'struct cxl_dpa_partition' and 'struct cxl_range_info'
  2025-01-17 22:08   ` Ira Weiny
@ 2025-01-31 23:39     ` Dan Williams
  0 siblings, 0 replies; 32+ messages in thread
From: Dan Williams @ 2025-01-31 23:39 UTC (permalink / raw)
  To: Ira Weiny, Dan Williams, linux-cxl
  Cc: Dave Jiang, Alejandro Lucero, Ira Weiny

Ira Weiny wrote:
> Dan Williams wrote:
> > The pending efforts to add CXL Accelerator (type-2) device [1], and
> > Dynamic Capacity (DCD) support [2], tripped on the
> > no-longer-fit-for-purpose design in the CXL subsystem for tracking
> > device-physical-address (DPA) metadata. Trip hazards include:
> > 
> > - CXL Memory Devices need to consider a PMEM partition, but Accelerator
> >   devices with CXL.mem likely do not in the common case.
> > 
> > - CXL Memory Devices enumerate DPA through Memory Device mailbox
> >   commands like Partition Info, Accelerators devices do not.
> > 
> > - CXL Memory Devices that support DCD support more than 2 partitions.
> >   Some of the driver algorithms are awkward to expand to > 2 partition
> >   cases.
> > 
> > - DPA performance data is a general capability that can be shared with
> >   accelerators, so tracking it in 'struct cxl_memdev_state' is no longer
> >   suitable.
> > 
> > - 'enum cxl_decoder_mode' is sometimes a partition id and sometimes a
> >   memory property, it should be phased in favor of a partition id and
> >   the memory property comes from the partition info.
> > 
> > Towards cleaning up those issues and allowing a smoother landing for the
> > aforementioned pending efforts, introduce a 'struct cxl_dpa_partition'
> > array to 'struct cxl_dev_state', and 'struct cxl_range_info' as a shared
> > way for Memory Devices and Accelerators to initialize the DPA information
> > in 'struct cxl_dev_state'.
> > 
> > For now, split a new cxl_dpa_setup() from cxl_mem_create_range_info() to
> > get the new data structure initialized, and cleanup some qos_class init.
> > Follow on patches will go further to use the new data structure to
> > cleanup algorithms that are better suited to loop over all possible
> > partitions.
> > 
> > cxl_dpa_setup() follows the locking expectations of mutating the device
> > DPA map, and is suitable for Accelerator drivers to use. Accelerators
> > likely only have one hardcoded 'ram' partition to convey to the
> > cxl_core.
> > 
> > Link: http://lore.kernel.org/20241230214445.27602-1-alejandro.lucero-palau@amd.com [1]
> > Link: http://lore.kernel.org/20241210-dcd-type2-upstream-v8-0-812852504400@intel.com [2]
> > Cc: Dave Jiang <dave.jiang@intel.com>
> > Cc: Alejandro Lucero <alucerop@amd.com>
> > Cc: Ira Weiny <ira.weiny@intel.com>
> > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
[..]
> > diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> > index 7a85522294ad..7e1559b3ed88 100644
> > --- a/drivers/cxl/core/hdm.c
> > +++ b/drivers/cxl/core/hdm.c
[..]
> > +/* if this fails the caller must destroy @cxlds, there is no recovery */
> > +int cxl_dpa_setup(struct cxl_dev_state *cxlds, const struct cxl_dpa_info *info)
> > +{
> > +	struct device *dev = cxlds->dev;
> > +
> > +	guard(rwsem_write)(&cxl_dpa_rwsem);
> 
> Why is this semaphore required now?

Previously DPA setup activities were known to be carried out in a
hard-coded order by the cxl_pci driver. With accelerator support and
this being a publicly exported function, that calling context can no
longer be assumed. So, take the lock as is typically expected when
mutating the DPA space.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] cxl: Introduce 'struct cxl_dpa_partition' and 'struct cxl_range_info'
  2025-01-20 12:24       ` Alejandro Lucero Palau
@ 2025-01-31 23:54         ` Dan Williams
  0 siblings, 0 replies; 32+ messages in thread
From: Dan Williams @ 2025-01-31 23:54 UTC (permalink / raw)
  To: Alejandro Lucero Palau, Dan Williams, Jonathan Cameron
  Cc: linux-cxl, Dave Jiang, Ira Weiny

Alejandro Lucero Palau wrote:
> 
> On 1/17/25 18:23, Dan Williams wrote:
> > Jonathan Cameron wrote:
> >> On Thu, 16 Jan 2025 22:10:44 -0800
> >> Dan Williams <dan.j.williams@intel.com> wrote:
> >>
> >>> The pending efforts to add CXL Accelerator (type-2) device [1], and
> >>> Dynamic Capacity (DCD) support [2], tripped on the
> >>> no-longer-fit-for-purpose design in the CXL subsystem for tracking
> >>> device-physical-address (DPA) metadata. Trip hazards include:
> >>>
> >>> - CXL Memory Devices need to consider a PMEM partition, but Accelerator
> >>>    devices with CXL.mem likely do not in the common case.
> >>>
> >>> - CXL Memory Devices enumerate DPA through Memory Device mailbox
> >>>    commands like Partition Info, Accelerators devices do not.
> >>>
> >>> - CXL Memory Devices that support DCD support more than 2 partitions.
> >>>    Some of the driver algorithms are awkward to expand to > 2 partition
> >>>    cases.
> >>>
> >>> - DPA performance data is a general capability that can be shared with
> >>>    accelerators, so tracking it in 'struct cxl_memdev_state' is no longer
> >>>    suitable.
> >>>
> >>> - 'enum cxl_decoder_mode' is sometimes a partition id and sometimes a
> >>>    memory property, it should be phased in favor of a partition id and
> >>>    the memory property comes from the partition info.
> >>>
> >>> Towards cleaning up those issues and allowing a smoother landing for the
> >>> aforementioned pending efforts, introduce a 'struct cxl_dpa_partition'
> >>> array to 'struct cxl_dev_state', and 'struct cxl_range_info' as a shared
> >>> way for Memory Devices and Accelerators to initialize the DPA information
> >>> in 'struct cxl_dev_state'.
> >>>
> >>> For now, split a new cxl_dpa_setup() from cxl_mem_create_range_info() to
> >>> get the new data structure initialized, and cleanup some qos_class init.
> >>> Follow on patches will go further to use the new data structure to
> >>> cleanup algorithms that are better suited to loop over all possible
> >>> partitions.
> >>>
> >>> cxl_dpa_setup() follows the locking expectations of mutating the device
> >>> DPA map, and is suitable for Accelerator drivers to use. Accelerators
> >>> likely only have one hardcoded 'ram' partition to convey to the
> >>> cxl_core.
> >>>
> >>> Link: http://lore.kernel.org/20241230214445.27602-1-alejandro.lucero-palau@amd.com [1]
> >>> Link: http://lore.kernel.org/20241210-dcd-type2-upstream-v8-0-812852504400@intel.com [2]
> >>> Cc: Dave Jiang <dave.jiang@intel.com>
> >>> Cc: Alejandro Lucero <alucerop@amd.com>
> >>> Cc: Ira Weiny <ira.weiny@intel.com>
> >>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> >> Hi Dan,
> >>
> >> In basic form this seems fine, but I find the nr_paritions variable usage very
> >> counter intuitive.  It's just how many we configured not how many there
> >> are, potentially with 0 size (so not a partition).  I'd be happier if we
> >> can avoid that by just prefilling the lot with zero size and filling in
> >> the ones we want.  So zero size means doesn't exist and use an iterator where
> >> appropriate to skip the zero size ones.
> > The PMEM-only device case did give me pause. Is that 2 partitions with a
> > zero-sized first partition, or is that just 1 partition?
> 
> 
> I was wrong about the code being broken for this case.
> 
> The code would create two partitions, at least for the case of partition 
> alignment being 0, with the first one having 0 size.

That was the feedback from Jonathan from v1 => v2. Drop zero-size
partitions from the tracking.

> This is all based on data/code based on mbox commands. Without mbox this 
> partition info needs to be hardcoded or created somehow by the accel 
> driver by its own means, so it is good to know the code expects such a 
> 0-size partition for the pmem-only case and the nr_partitions should be 
> set accordingly. Not what my type2 patchset needs now, but I bet this 
> will need to be set properly by a coming accel driver.

An accelerator does not need to worry about passing in a 0-sized pmem
partition, it can just register the one DPA range that it cares about.
nr_partitions means that the partittion array has entry
[0]..[nr_partitions-1] filled with non-zero data, each with a distinct
operation mode, and in contiguous order starting from zero.

The only place Ira and I identified where this potentially runs into
trouble is if a device places a gap between partitions, or if we want to
skip over "shared" capacity in the first round of DCD support. Might
just need to mandate that userspace skip over "shared" capacity for now.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/4] cxl: Make cxl_dpa_alloc() DPA partition number agnostic
  2025-01-20 12:39       ` Alejandro Lucero Palau
@ 2025-02-01  0:08         ` Dan Williams
  0 siblings, 0 replies; 32+ messages in thread
From: Dan Williams @ 2025-02-01  0:08 UTC (permalink / raw)
  To: Alejandro Lucero Palau, Dan Williams, linux-cxl; +Cc: Dave Jiang, Ira Weiny

Alejandro Lucero Palau wrote:
> 
> On 1/17/25 20:57, Dan Williams wrote:
> > Alejandro Lucero Palau wrote:
> >> On 1/17/25 06:10, Dan Williams wrote:
> >>> cxl_dpa_alloc() is a hard coded nest of assumptions around PMEM
> >>> allocations being distinct from RAM allocations in specific ways when in
> >>> practice the allocation rules are only relative to DPA partition index.
> >>>
> >>> The rules for cxl_dpa_alloc() are:
> >>>
> >>> - allocations can only come from 1 partition
> >>>
> >>> - if allocating at partition-index-N, all free space in partitions less
> >>>     than partition-index-N must be skipped over
> >>
> >> In my view, you are mixing the current code with the new code in this
> >> explanation. It would be better to say the current code assumption is
> >> just two partitions, ram and pmem, but DCD changes the game.
> > There is no mixture in that description. The rules have not changed from
> > old to new, the implementation is updated to reflect that the algorithm
> > never needed to consider ram and pmem explicitly.
> 
> 
> Well, if I'm not mistaken, until now we did not need to support an 
> arbitrary number N of partitions but just 2.
> 
> In fact, your next sentence just confirms this.

Right, the algorithm can be more generic, and that is immediately needed
for the DCD case where it adds new partition types beyond pmem.

[..]
> > Holes are not allowed. If you want to delete any decoder capacity you
> > need to tear down all higher DPA allocations. That is a constraint of
> > the hardware definition around how software can assume the value of
> > DPABase. Will add some comments to that effect.
> 
> 
> I have to admit I do not master the case of regions created from user 
> space, but I thought the idea was to allow creation of regions from 
> independent devices for facilitating interleaving and just adding 
> flexibility when needing a memory region bigger than what only one 
> device can offer at some specific time.
> 
> 
> If so, I would expect cases like two different regions using different 
> ranges inside a device DPA, and the two regions completely independent. 
> If I understand your answer, this is possible but if you want to release 
> one of those regions, you can not use the released DPA until the first 
> one is also released ... my instinct tells me this can not be the case ...

As it turns out, the CXL specification mandates this counter-intuitive
situation. See the implementation note in the spec titled: "Device
Decode Logic". There you will see that if you have regionA and regionB
on a device using decoderA and decoderB where B is a higher DPA (and by
definition higher decoder id than A) you can not change the size of
decoderA without impacting the base address of decoderB. CXL HDM
Decoders are not PCI BARs.

> If what I'm saying makes no sense, I'll try with a more complex 
> description with a timeline and actions from user space, endpoint 
> decoders, regions and so on, what will definitely helps me to 
> (hopefully) understand the user space case.

See some of the commits that deal with this specification constraint:

105b6235ad0f cxl/port: Prevent out-of-order decoder allocation
101c268bd2f3 cxl/port: Fix use-after-free, permit out-of-order decoder shutdown
cb66b1d60c28 cxl/region: Allow out of order assembly of autodiscovered regions
2ab47045ac96 cxl/region: Flag partially torn down regions as unusable

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2025-02-01  0:08 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-17  6:10 [PATCH 0/4] cxl: DPA partition metadata is a mess Dan Williams
2025-01-17  6:10 ` [PATCH 1/4] cxl: Remove the CXL_DECODER_MIXED mistake Dan Williams
2025-01-17 10:03   ` Jonathan Cameron
2025-01-17 17:47     ` Dan Williams
2025-01-17 10:24   ` Alejandro Lucero Palau
2025-01-17 17:54     ` Dan Williams
2025-01-17 18:45   ` Ira Weiny
2025-01-17  6:10 ` [PATCH 2/4] cxl: Introduce to_{ram,pmem}_{res,perf}() helpers Dan Williams
2025-01-17 10:20   ` Jonathan Cameron
2025-01-17 10:23     ` Jonathan Cameron
2025-01-17 17:55       ` Dan Williams
2025-01-17 13:33   ` Alejandro Lucero Palau
2025-01-17 20:47     ` Dan Williams
2025-01-17  6:10 ` [PATCH 3/4] cxl: Introduce 'struct cxl_dpa_partition' and 'struct cxl_range_info' Dan Williams
2025-01-17 10:52   ` Jonathan Cameron
2025-01-17 13:38     ` Alejandro Lucero Palau
2025-01-17 18:23     ` Dan Williams
2025-01-17 20:32       ` Ira Weiny
2025-01-20 12:24       ` Alejandro Lucero Palau
2025-01-31 23:54         ` Dan Williams
2025-01-17 15:58   ` Alejandro Lucero Palau
2025-01-17 22:52     ` Dan Williams
2025-01-17 20:42   ` Ira Weiny
2025-01-17 22:08   ` Ira Weiny
2025-01-31 23:39     ` Dan Williams
2025-01-17  6:10 ` [PATCH 4/4] cxl: Make cxl_dpa_alloc() DPA partition number agnostic Dan Williams
2025-01-17 11:12   ` Jonathan Cameron
2025-01-17 18:37     ` Dan Williams
2025-01-17 15:42   ` Alejandro Lucero Palau
2025-01-17 20:57     ` Dan Williams
2025-01-20 12:39       ` Alejandro Lucero Palau
2025-02-01  0:08         ` Dan Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox