[PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD)
@ 2025-04-13 22:52 Ira Weiny
  2025-04-13 22:52 ` [PATCH v9 01/19] cxl/mbox: Flag " Ira Weiny
                   ` (22 more replies)
  0 siblings, 23 replies; 65+ messages in thread
From: Ira Weiny @ 2025-04-13 22:52 UTC (permalink / raw)
  To: Dave Jiang, Fan Ni, Jonathan Cameron
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	Ira Weiny, linux-cxl, nvdimm, linux-kernel, Li Ming

A git tree of this series can be found here:

	https://github.com/weiny2/linux-kernel/tree/dcd-v6-2025-04-13

This is now based on 6.15-rc2.

Due to the stagnation of solid requirements for users of DCD I do not
plan to rev this work in Q2 of 2025 and possibly beyond.

It is anticipated that this will support at least the initial
implementation of DCD devices, if and when they appear in the ecosystem.
The patch set should be reviewed with the limited set of functionality in
mind.  Additional functionality can be added as devices support them.

It is strongly encouraged for individuals or companies wishing to bring
DCD devices to market review this set with the customer use cases they
have in mind.

Series info
===========

This series has 2 parts:

Patch 1-17: Core DCD support
Patch 18-19: cxl_test support

Background
==========

A Dynamic Capacity Device (DCD) (CXL 3.1 sec 9.13.3) is a CXL memory
device that allows memory capacity within a region to change
dynamically without the need for resetting the device, reconfiguring
HDM decoders, or reconfiguring software DAX regions.

One of the biggest anticipated use cases for Dynamic Capacity is to
allow hosts to dynamically add or remove memory from a host within a
data center without physically changing the per-host attached memory nor
rebooting the host.

The general flow for the addition or removal of memory is to have an
orchestrator coordinate the use of the memory.  Generally there are 5
actors in such a system, the Orchestrator, Fabric Manager, the Logical
device, the Host Kernel, and a Host User.

An example work flow is shown below.

Orchestrator      FM         Device       Host Kernel    Host User

    |             |           |            |               |
    |-------------- Create region ------------------------>|
    |             |           |            |               |
    |             |           |            |<-- Create ----|
    |             |           |            |    Region     |
    |             |           |            |(dynamic_ram_a)|
    |<------------- Signal done ---------------------------|
    |             |           |            |               |
    |-- Add ----->|-- Add --->|--- Add --->|               |
    |  Capacity   |  Extent   |   Extent   |               |
    |             |           |            |               |
    |             |<- Accept -|<- Accept  -|               |
    |             |   Extent  |   Extent   |               |
    |             |           |            |<- Create ---->|
    |             |           |            |   DAX dev     |-- Use memory
    |             |           |            |               |   |
    |             |           |            |               |   |
    |             |           |            |<- Release ----| <-+
    |             |           |            |   DAX dev     |
    |             |           |            |               |
    |<------------- Signal done ---------------------------|
    |             |           |            |               |
    |-- Remove -->|- Release->|- Release ->|               |
    |  Capacity   |  Extent   |   Extent   |               |
    |             |           |            |               |
    |             |<- Release-|<- Release -|               |
    |             |   Extent  |   Extent   |               |
    |             |           |            |               |
    |-- Add ----->|-- Add --->|--- Add --->|               |
    |  Capacity   |  Extent   |   Extent   |               |
    |             |           |            |               |
    |             |<- Accept -|<- Accept  -|               |
    |             |   Extent  |   Extent   |               |
    |             |           |            |<- Create -----|
    |             |           |            |   DAX dev     |-- Use memory
    |             |           |            |               |   |
    |             |           |            |<- Release ----| <-+
    |             |           |            |   DAX dev     |
    |<------------- Signal done ---------------------------|
    |             |           |            |               |
    |-- Remove -->|- Release->|- Release ->|               |
    |  Capacity   |  Extent   |   Extent   |               |
    |             |           |            |               |
    |             |<- Release-|<- Release -|               |
    |             |   Extent  |   Extent   |               |
    |             |           |            |               |
    |-- Add ----->|-- Add --->|--- Add --->|               |
    |  Capacity   |  Extent   |   Extent   |               |
    |             |           |            |<- Create -----|
    |             |           |            |   DAX dev     |-- Use memory
    |             |           |            |               |   |
    |-- Remove -->|- Release->|- Release ->|               |   |
    |  Capacity   |  Extent   |   Extent   |               |   |
    |             |           |            |               |   |
    |             |           |     (Release Ignored)      |   |
    |             |           |            |               |   |
    |             |           |            |<- Release ----| <-+
    |             |           |            |   DAX dev     |
    |<------------- Signal done ---------------------------|
    |             |           |            |               |
    |             |- Release->|- Release ->|               |
    |             |  Extent   |   Extent   |               |
    |             |           |            |               |
    |             |<- Release-|<- Release -|               |
    |             |   Extent  |   Extent   |               |
    |             |           |            |<- Destroy ----|
    |             |           |            |   Region      |
    |             |           |            |               |

Implementation
==============

This series requires the creation of regions and DAX devices to be
closely synchronized with the Orchestrator and Fabric Manager.  The host
kernel will reject extents if a region is not yet created.  It also
ignores extent release if memory is in use (DAX device created).  These
synchronizations are not anticipated to be an issue with real
applications.

Only a single dynamic ram partition is supported (dynamic_ram_a).  The
requirements, use cases, and existence of actual hardware devices to
support more than one DC partition is unknown at this time.  So a less
complex implementation was chosen.

In order to allow for capacity to be added and removed a new concept of
a sparse DAX region is introduced.  A sparse DAX region may have 0 or
more bytes of available space.  The total space depends on the number
and size of the extents which have been added.

It is anticipated that users of the memory will carefully coordinate the
surfacing of capacity with the creation of DAX devices which use that
capacity.  Therefore, the allocation of the memory to DAX devices does
not allow for specific associations between DAX device and extent.  This
keeps allocations of DAX devices similar to existing DAX region
behavior.

To keep the DAX memory allocation aligned with the existing DAX devices
which do not have tags, extents are not allowed to have tags in this
implementation.  Future support for tags can be added when real use
cases surface.

Great care was taken to keep the extent tracking simple.  Some xarray's
needed to be added but extra software objects are kept to a minimum.

Region extents are tracked as sub-devices of the DAX region.  This
ensures that region destruction cleans up all extent allocations
properly.

The major functionality of this series includes:

- Getting the dynamic capacity (DC) configuration information from cxl
  devices

- Configuring a DC partition found in hardware.

- Enhancing the CXL and DAX regions for dynamic capacity support
	a. Maintain a logical separation between hardware extents and
	   software managed extents.  This provides an abstraction
	   between the layers and should allow for interleaving in the
	   future

- Get existing hardware extent lists for endpoint decoders upon region
  creation.

- Respond to DC capacity events and adjust available region memory.
        a. Add capacity Events
	b. Release capacity events

- Host response for add capacity
	a. do not accept the extent if:
		If the region does not exist
		or an error occurs realizing the extent
	b. If the region does exist
		realize a DAX region extent with 1:1 mapping (no
		interleave yet)
	c. Support the event more bit by processing a list of extents
	   marked with the more bit together before setting up a
	   response.

- Host response for remove capacity
	a. If no DAX device references the extent; release the extent
	b. If a reference does exist, ignore the request.
	   (Require FM to issue release again.)
	c. Release extents flagged with the 'more' bit individually as
	   the specification allows for the asynchronous release of
	   memory and the implementation is simplified by doing so.

- Modify DAX device creation/resize to account for extents within a
  sparse DAX region

- Trace Dynamic Capacity events for debugging

- Add cxl-test infrastructure to allow for faster unit testing
  (See new ndctl branch for cxl-dcd.sh test[1])

- Only support 0 value extent tags

Fan Ni's upstream of Qemu DCD was used for testing.

Remaining work:

	1) Allow mapping to specific extents (perhaps based on
	   label/tag)
	   1a) devise region size reporting based on tags
	2) Interleave support

Possible additional work depending on requirements:

	1) Accept a new extent which extends (but overlaps) already
	   accepted extent(s)
	2) Rework DAX device interfaces, memfd has been explored a bit
	3) Support more than 1 DC partition

[1] https://github.com/weiny2/ndctl/tree/dcd-region3-2025-04-13

---
Changes in v9:
- djbw: pare down support to only a single DC parition
- djbw: adjust to the new core partition processing which aligns with
  new type2 work.
- iweiny: address smaller comments from v8
- iweiny: rebase off of 6.15-rc1
- Link to v8: https://patch.msgid.link/20241210-dcd-type2-upstream-v8-0-812852504400@intel.com

---
Ira Weiny (19):
      cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)
      cxl/mem: Read dynamic capacity configuration from the device
      cxl/cdat: Gather DSMAS data for DCD partitions
      cxl/core: Enforce partition order/simplify partition calls
      cxl/mem: Expose dynamic ram A partition in sysfs
      cxl/port: Add 'dynamic_ram_a' to endpoint decoder mode
      cxl/region: Add sparse DAX region support
      cxl/events: Split event msgnum configuration from irq setup
      cxl/pci: Factor out interrupt policy check
      cxl/mem: Configure dynamic capacity interrupts
      cxl/core: Return endpoint decoder information from region search
      cxl/extent: Process dynamic partition events and realize region extents
      cxl/region/extent: Expose region extent information in sysfs
      dax/bus: Factor out dev dax resize logic
      dax/region: Create resources on sparse DAX regions
      cxl/region: Read existing extents on region creation
      cxl/mem: Trace Dynamic capacity Event Record
      tools/testing/cxl: Make event logs dynamic
      tools/testing/cxl: Add DC Regions to mock mem data

 Documentation/ABI/testing/sysfs-bus-cxl |  100 ++-
 drivers/cxl/core/Makefile               |    2 +-
 drivers/cxl/core/cdat.c                 |   11 +
 drivers/cxl/core/core.h                 |   33 +-
 drivers/cxl/core/extent.c               |  495 +++++++++++++++
 drivers/cxl/core/hdm.c                  |   13 +-
 drivers/cxl/core/mbox.c                 |  632 ++++++++++++++++++-
 drivers/cxl/core/memdev.c               |   87 ++-
 drivers/cxl/core/port.c                 |    5 +
 drivers/cxl/core/region.c               |   76 ++-
 drivers/cxl/core/trace.h                |   65 ++
 drivers/cxl/cxl.h                       |   61 +-
 drivers/cxl/cxlmem.h                    |  134 +++-
 drivers/cxl/mem.c                       |    2 +-
 drivers/cxl/pci.c                       |  115 +++-
 drivers/dax/bus.c                       |  356 +++++++++--
 drivers/dax/bus.h                       |    4 +-
 drivers/dax/cxl.c                       |   71 ++-
 drivers/dax/dax-private.h               |   40 ++
 drivers/dax/hmem/hmem.c                 |    2 +-
 drivers/dax/pmem.c                      |    2 +-
 include/cxl/event.h                     |   31 +
 include/linux/ioport.h                  |    3 +
 tools/testing/cxl/Kbuild                |    3 +-
 tools/testing/cxl/test/mem.c            | 1021 +++++++++++++++++++++++++++----
 25 files changed, 3102 insertions(+), 262 deletions(-)
---
base-commit: 8ffd015db85fea3e15a77027fda6c02ced4d2444
change-id: 20230604-dcd-type2-upstream-0cd15f6216fd

Best regards,
-- 
Ira Weiny <ira.weiny@intel.com>


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v9 01/19] cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)
  2025-04-13 22:52 [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
@ 2025-04-13 22:52 ` Ira Weiny
  2025-04-14 14:19   ` Jonathan Cameron
  2025-04-13 22:52 ` [PATCH v9 02/19] cxl/mem: Read dynamic capacity configuration from the device Ira Weiny
                   ` (21 subsequent siblings)
  22 siblings, 1 reply; 65+ messages in thread
From: Ira Weiny @ 2025-04-13 22:52 UTC (permalink / raw)
  To: Dave Jiang, Fan Ni, Jonathan Cameron
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	Ira Weiny, linux-cxl, nvdimm, linux-kernel

Per the CXL 3.1 specification software must check the Command Effects
Log (CEL) for dynamic capacity command support.

Detect support for the DCD commands while reading the CEL, including:

	Get DC Config
	Get DC Extent List
	Add DC Response
	Release DC

Based on an original patch by Navneet Singh.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes:
[iweiny: rebased]
[iweiny: remove tags]
[djbw: remove dcd_cmds bitmask from mds]
---
 drivers/cxl/core/mbox.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
 drivers/cxl/cxlmem.h    | 15 +++++++++++++++
 2 files changed, 59 insertions(+)

diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index d72764056ce6..58d378400a4b 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -165,6 +165,43 @@ static void cxl_set_security_cmd_enabled(struct cxl_security_state *security,
 	}
 }
 
+static bool cxl_is_dcd_command(u16 opcode)
+{
+#define CXL_MBOX_OP_DCD_CMDS 0x48
+
+	return (opcode >> 8) == CXL_MBOX_OP_DCD_CMDS;
+}
+
+static void cxl_set_dcd_cmd_enabled(struct cxl_memdev_state *mds, u16 opcode,
+				    unsigned long *cmd_mask)
+{
+	switch (opcode) {
+	case CXL_MBOX_OP_GET_DC_CONFIG:
+		set_bit(CXL_DCD_ENABLED_GET_CONFIG, cmd_mask);
+		break;
+	case CXL_MBOX_OP_GET_DC_EXTENT_LIST:
+		set_bit(CXL_DCD_ENABLED_GET_EXTENT_LIST, cmd_mask);
+		break;
+	case CXL_MBOX_OP_ADD_DC_RESPONSE:
+		set_bit(CXL_DCD_ENABLED_ADD_RESPONSE, cmd_mask);
+		break;
+	case CXL_MBOX_OP_RELEASE_DC:
+		set_bit(CXL_DCD_ENABLED_RELEASE, cmd_mask);
+		break;
+	default:
+		break;
+	}
+}
+
+static bool cxl_verify_dcd_cmds(struct cxl_memdev_state *mds, unsigned long *cmds_seen)
+{
+	DECLARE_BITMAP(all_cmds, CXL_DCD_ENABLED_MAX);
+	DECLARE_BITMAP(dst, CXL_DCD_ENABLED_MAX);
+
+	bitmap_fill(all_cmds, CXL_DCD_ENABLED_MAX);
+	return bitmap_and(dst, cmds_seen, all_cmds, CXL_DCD_ENABLED_MAX);
+}
+
 static bool cxl_is_poison_command(u16 opcode)
 {
 #define CXL_MBOX_OP_POISON_CMDS 0x43
@@ -750,6 +787,7 @@ static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
 	struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
 	struct cxl_cel_entry *cel_entry;
 	const int cel_entries = size / sizeof(*cel_entry);
+	DECLARE_BITMAP(dcd_cmds, CXL_DCD_ENABLED_MAX);
 	struct device *dev = mds->cxlds.dev;
 	int i, ro_cmds = 0, wr_cmds = 0;
 
@@ -778,11 +816,17 @@ static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
 			enabled++;
 		}
 
+		if (cxl_is_dcd_command(opcode)) {
+			cxl_set_dcd_cmd_enabled(mds, opcode, dcd_cmds);
+			enabled++;
+		}
+
 		dev_dbg(dev, "Opcode 0x%04x %s\n", opcode,
 			enabled ? "enabled" : "unsupported by driver");
 	}
 
 	set_features_cap(cxl_mbox, ro_cmds, wr_cmds);
+	mds->dcd_supported = cxl_verify_dcd_cmds(mds, dcd_cmds);
 }
 
 static struct cxl_mbox_get_supported_logs *cxl_get_gsl(struct cxl_memdev_state *mds)
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 3ec6b906371b..394a776954f4 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -216,6 +216,15 @@ struct cxl_event_state {
 	struct mutex log_lock;
 };
 
+/* Device enabled DCD commands */
+enum dcd_cmd_enabled_bits {
+	CXL_DCD_ENABLED_GET_CONFIG,
+	CXL_DCD_ENABLED_GET_EXTENT_LIST,
+	CXL_DCD_ENABLED_ADD_RESPONSE,
+	CXL_DCD_ENABLED_RELEASE,
+	CXL_DCD_ENABLED_MAX
+};
+
 /* Device enabled poison commands */
 enum poison_cmd_enabled_bits {
 	CXL_POISON_ENABLED_LIST,
@@ -472,6 +481,7 @@ static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
  * @partition_align_bytes: alignment size for partition-able capacity
  * @active_volatile_bytes: sum of hard + soft volatile
  * @active_persistent_bytes: sum of hard + soft persistent
+ * @dcd_supported: all DCD commands are supported
  * @event: event log driver state
  * @poison: poison driver state info
  * @security: security driver state info
@@ -491,6 +501,7 @@ struct cxl_memdev_state {
 	u64 partition_align_bytes;
 	u64 active_volatile_bytes;
 	u64 active_persistent_bytes;
+	bool dcd_supported;
 
 	struct cxl_event_state event;
 	struct cxl_poison_state poison;
@@ -551,6 +562,10 @@ enum cxl_opcode {
 	CXL_MBOX_OP_UNLOCK		= 0x4503,
 	CXL_MBOX_OP_FREEZE_SECURITY	= 0x4504,
 	CXL_MBOX_OP_PASSPHRASE_SECURE_ERASE	= 0x4505,
+	CXL_MBOX_OP_GET_DC_CONFIG	= 0x4800,
+	CXL_MBOX_OP_GET_DC_EXTENT_LIST	= 0x4801,
+	CXL_MBOX_OP_ADD_DC_RESPONSE	= 0x4802,
+	CXL_MBOX_OP_RELEASE_DC		= 0x4803,
 	CXL_MBOX_OP_MAX			= 0x10000
 };
 

-- 
2.49.0


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH v9 01/19] cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)
  2025-04-13 22:52 ` [PATCH v9 01/19] cxl/mbox: Flag " Ira Weiny
@ 2025-04-14 14:19   ` Jonathan Cameron
  2025-05-05 21:04     ` Fan Ni
  0 siblings, 1 reply; 65+ messages in thread
From: Jonathan Cameron @ 2025-04-14 14:19 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Jiang, Fan Ni, Dan Williams, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-cxl, nvdimm, linux-kernel

On Sun, 13 Apr 2025 17:52:09 -0500
Ira Weiny <ira.weiny@intel.com> wrote:

> Per the CXL 3.1 specification software must check the Command Effects
> Log (CEL) for dynamic capacity command support.
> 
> Detect support for the DCD commands while reading the CEL, including:
> 
> 	Get DC Config
> 	Get DC Extent List
> 	Add DC Response
> 	Release DC
> 
> Based on an original patch by Navneet Singh.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>


> +
> +static bool cxl_verify_dcd_cmds(struct cxl_memdev_state *mds, unsigned long *cmds_seen)

It's not immediately obvious to me what the right behavior
from something called cxl_verify_dcd_cmds() is.  A comment might help with that.

I think all it does right now is check if any bits are set. In my head
it was going to check that all bits needed for a useful implementation were
set. I did have to go check what a 'logical and' of a bitmap was defined as
because that bit of the bitmap_and() return value wasn't obvious to me either!


> +{
> +	DECLARE_BITMAP(all_cmds, CXL_DCD_ENABLED_MAX);
> +	DECLARE_BITMAP(dst, CXL_DCD_ENABLED_MAX);
> +
> +	bitmap_fill(all_cmds, CXL_DCD_ENABLED_MAX);
> +	return bitmap_and(dst, cmds_seen, all_cmds, CXL_DCD_ENABLED_MAX);
> +}
> +



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v9 01/19] cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)
  2025-04-14 14:19   ` Jonathan Cameron
@ 2025-05-05 21:04     ` Fan Ni
  2025-05-06 16:09       ` Ira Weiny
  0 siblings, 1 reply; 65+ messages in thread
From: Fan Ni @ 2025-05-05 21:04 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Ira Weiny, Dave Jiang, Dan Williams, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-cxl, nvdimm, linux-kernel

On Mon, Apr 14, 2025 at 03:19:50PM +0100, Jonathan Cameron wrote:
> On Sun, 13 Apr 2025 17:52:09 -0500
> Ira Weiny <ira.weiny@intel.com> wrote:
> 
> > Per the CXL 3.1 specification software must check the Command Effects
> > Log (CEL) for dynamic capacity command support.
> > 
> > Detect support for the DCD commands while reading the CEL, including:
> > 
> > 	Get DC Config
> > 	Get DC Extent List
> > 	Add DC Response
> > 	Release DC
> > 
> > Based on an original patch by Navneet Singh.
> > 
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> 
> > +
> > +static bool cxl_verify_dcd_cmds(struct cxl_memdev_state *mds, unsigned long *cmds_seen)
> 
> It's not immediately obvious to me what the right behavior
> from something called cxl_verify_dcd_cmds() is.  A comment might help with that.
> 
> I think all it does right now is check if any bits are set. In my head
> it was going to check that all bits needed for a useful implementation were
> set. I did have to go check what a 'logical and' of a bitmap was defined as
> because that bit of the bitmap_and() return value wasn't obvious to me either!

The code only checks if any DCD command (48xx) is supported, if any is
set, it will set "dcd_supported".
As you mentioned, it seems we should check all the related commands are
supported, otherwise it is not valid implementation.

Fan
> 
> 
> > +{
> > +	DECLARE_BITMAP(all_cmds, CXL_DCD_ENABLED_MAX);
> > +	DECLARE_BITMAP(dst, CXL_DCD_ENABLED_MAX);
> > +
> > +	bitmap_fill(all_cmds, CXL_DCD_ENABLED_MAX);
> > +	return bitmap_and(dst, cmds_seen, all_cmds, CXL_DCD_ENABLED_MAX);
> > +}
> > +
> 
> 

-- 
Fan Ni

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v9 01/19] cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)
  2025-05-05 21:04     ` Fan Ni
@ 2025-05-06 16:09       ` Ira Weiny
  2025-05-06 18:54         ` Fan Ni
  0 siblings, 1 reply; 65+ messages in thread
From: Ira Weiny @ 2025-05-06 16:09 UTC (permalink / raw)
  To: Fan Ni, Jonathan Cameron
  Cc: Ira Weiny, Dave Jiang, Dan Williams, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-cxl, nvdimm, linux-kernel

Fan Ni wrote:
> On Mon, Apr 14, 2025 at 03:19:50PM +0100, Jonathan Cameron wrote:
> > On Sun, 13 Apr 2025 17:52:09 -0500
> > Ira Weiny <ira.weiny@intel.com> wrote:

[snip]

> > 
> > > +
> > > +static bool cxl_verify_dcd_cmds(struct cxl_memdev_state *mds, unsigned long *cmds_seen)
> > 
> > It's not immediately obvious to me what the right behavior
> > from something called cxl_verify_dcd_cmds() is.  A comment might help with that.
> > 
> > I think all it does right now is check if any bits are set. In my head
> > it was going to check that all bits needed for a useful implementation were
> > set. I did have to go check what a 'logical and' of a bitmap was defined as
> > because that bit of the bitmap_and() return value wasn't obvious to me either!
> 
> The code only checks if any DCD command (48xx) is supported, if any is
> set, it will set "dcd_supported".
> As you mentioned, it seems we should check all the related commands are
> supported, otherwise it is not valid implementation.
> 
> Fan
> > 
> > 
> > > +{
> > > +	DECLARE_BITMAP(all_cmds, CXL_DCD_ENABLED_MAX);
> > > +	DECLARE_BITMAP(dst, CXL_DCD_ENABLED_MAX);
> > > +
> > > +	bitmap_fill(all_cmds, CXL_DCD_ENABLED_MAX);
> > > +	return bitmap_and(dst, cmds_seen, all_cmds, CXL_DCD_ENABLED_MAX);

Yea... so this should read:

...
	bitmap_and(dst, cmds_seen, all_cmds, CXL_DCD_ENABLED_MAX);
	return bitmap_equal(dst, all_cmds, CXL_DCD_ENABLED_MAX);
...

Of course if a device has set any of these commands true it better have
set them all.  Otherwise the device is broken and it will fail in bad
ways.

But I agree with both of you that this is much better and explicit that
something went wrong.  A dev_dbg() might be in order to debug such an
issue.

Ira

[snip]

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v9 01/19] cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)
  2025-05-06 16:09       ` Ira Weiny
@ 2025-05-06 18:54         ` Fan Ni
  0 siblings, 0 replies; 65+ messages in thread
From: Fan Ni @ 2025-05-06 18:54 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Fan Ni, Jonathan Cameron, Dave Jiang, Dan Williams,
	Davidlohr Bueso, Alison Schofield, Vishal Verma, linux-cxl,
	nvdimm, linux-kernel

On Tue, May 06, 2025 at 11:09:09AM -0500, Ira Weiny wrote:
> Fan Ni wrote:
> > On Mon, Apr 14, 2025 at 03:19:50PM +0100, Jonathan Cameron wrote:
> > > On Sun, 13 Apr 2025 17:52:09 -0500
> > > Ira Weiny <ira.weiny@intel.com> wrote:
> 
> [snip]
> 
> > > 
> > > > +
> > > > +static bool cxl_verify_dcd_cmds(struct cxl_memdev_state *mds, unsigned long *cmds_seen)
> > > 
> > > It's not immediately obvious to me what the right behavior
> > > from something called cxl_verify_dcd_cmds() is.  A comment might help with that.
> > > 
> > > I think all it does right now is check if any bits are set. In my head
> > > it was going to check that all bits needed for a useful implementation were
> > > set. I did have to go check what a 'logical and' of a bitmap was defined as
> > > because that bit of the bitmap_and() return value wasn't obvious to me either!
> > 
> > The code only checks if any DCD command (48xx) is supported, if any is
> > set, it will set "dcd_supported".
> > As you mentioned, it seems we should check all the related commands are
> > supported, otherwise it is not valid implementation.
> > 
> > Fan
> > > 
> > > 
> > > > +{
> > > > +	DECLARE_BITMAP(all_cmds, CXL_DCD_ENABLED_MAX);
> > > > +	DECLARE_BITMAP(dst, CXL_DCD_ENABLED_MAX);
> > > > +
> > > > +	bitmap_fill(all_cmds, CXL_DCD_ENABLED_MAX);
> > > > +	return bitmap_and(dst, cmds_seen, all_cmds, CXL_DCD_ENABLED_MAX);
> 
> Yea... so this should read:
> 
> ...
> 	bitmap_and(dst, cmds_seen, all_cmds, CXL_DCD_ENABLED_MAX);
> 	return bitmap_equal(dst, all_cmds, CXL_DCD_ENABLED_MAX);
Maybe only 
    return bitmap_equal(cmds_seen, all_cmds, CXL_DCD_ENABLED_MAX)?

Fan
> ...
> 
> Of course if a device has set any of these commands true it better have
> set them all.  Otherwise the device is broken and it will fail in bad
> ways.
> 
> But I agree with both of you that this is much better and explicit that
> something went wrong.  A dev_dbg() might be in order to debug such an
> issue.
> 
> Ira
> 
> [snip]

-- 
Fan Ni

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v9 02/19] cxl/mem: Read dynamic capacity configuration from the device
  2025-04-13 22:52 [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
  2025-04-13 22:52 ` [PATCH v9 01/19] cxl/mbox: Flag " Ira Weiny
@ 2025-04-13 22:52 ` Ira Weiny
  2025-04-14 14:35   ` Jonathan Cameron
  2025-05-07 17:40   ` Fan Ni
  2025-04-13 22:52 ` [PATCH v9 03/19] cxl/cdat: Gather DSMAS data for DCD partitions Ira Weiny
                   ` (20 subsequent siblings)
  22 siblings, 2 replies; 65+ messages in thread
From: Ira Weiny @ 2025-04-13 22:52 UTC (permalink / raw)
  To: Dave Jiang, Fan Ni, Jonathan Cameron
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	Ira Weiny, linux-cxl, nvdimm, linux-kernel

Devices which optionally support Dynamic Capacity (DC) are configured
via mailbox commands.  CXL 3.2 section 9.13.3 requires the host to issue
the Get DC Configuration command in order to properly configure DCDs.
Without the Get DC Configuration command DCD can't be supported.

Implement the DC mailbox commands as specified in CXL 3.2 section
8.2.10.9.9 (opcodes 48XXh) to read and store the DCD configuration
information.  Disable DCD if an invalid configuration is found.

Linux has no support for more than one dynamic capacity partition.  Read
and validate all the partitions but configure only the first partition
as 'dynamic ram A'.  Additional partitions can be added in the future if
such a device ever materializes.  Additionally is it anticipated that no
skips will be present from the end of the pmem partition.  Check for an
disallow this configuration as well.

Linux has no use for the trailing fields of the Get Dynamic Capacity
Configuration Output Payload (Total number of supported extents, number
of available extents, total number of supported tags, and number of
available tags).  Avoid defining those fields to use the more useful
dynamic C array.

Based on an original patch by Navneet Singh.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes:
[iweiny: rebase]
[iweiny: Update spec references to 3.2]
[djbw: Limit to 1 partition]
[djbw: Avoid inter-partition skipping]
[djbw: s/region/partition/]
[djbw: remove cxl_dc_region[partition]_info->name]
[iweiny: adjust to lack of dcd_cmds in mds]
[iweiny: remove extra 'region' from names]
[iweiny: remove unused CXL_DYNAMIC_CAPACITY_SANITIZE_ON_RELEASE_FLAG]
---
 drivers/cxl/core/hdm.c  |   2 +
 drivers/cxl/core/mbox.c | 179 ++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/cxl/cxl.h       |   1 +
 drivers/cxl/cxlmem.h    |  54 ++++++++++++++-
 drivers/cxl/pci.c       |   3 +
 5 files changed, 238 insertions(+), 1 deletion(-)

diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
index 70cae4ebf8a4..c5f8a17d00f1 100644
--- a/drivers/cxl/core/hdm.c
+++ b/drivers/cxl/core/hdm.c
@@ -459,6 +459,8 @@ static const char *cxl_mode_name(enum cxl_partition_mode mode)
 		return "ram";
 	case CXL_PARTMODE_PMEM:
 		return "pmem";
+	case CXL_PARTMODE_DYNAMIC_RAM_A:
+		return "dynamic_ram_a";
 	default:
 		return "";
 	};
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 58d378400a4b..866a423d6125 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -1313,6 +1313,153 @@ int cxl_mem_sanitize(struct cxl_memdev *cxlmd, u16 cmd)
 	return -EBUSY;
 }
 
+static int cxl_dc_check(struct device *dev, struct cxl_dc_partition_info *part_array,
+			u8 index, struct cxl_dc_partition *dev_part)
+{
+	size_t blk_size, len;
+
+	part_array[index].start = le64_to_cpu(dev_part->base);
+	part_array[index].size = le64_to_cpu(dev_part->decode_length);
+	part_array[index].size *= CXL_CAPACITY_MULTIPLIER;
+	len = le64_to_cpu(dev_part->length);
+	blk_size = le64_to_cpu(dev_part->block_size);
+
+	/* Check partitions are in increasing DPA order */
+	if (index > 0) {
+		struct cxl_dc_partition_info *prev_part = &part_array[index - 1];
+
+		if ((prev_part->start + prev_part->size) >
+		     part_array[index].start) {
+			dev_err(dev,
+				"DPA ordering violation for DC partition %d and %d\n",
+				index - 1, index);
+			return -EINVAL;
+		}
+	}
+
+	if (!IS_ALIGNED(part_array[index].start, SZ_256M) ||
+	    !IS_ALIGNED(part_array[index].start, blk_size)) {
+		dev_err(dev, "DC partition %d invalid start %zu blk size %zu\n",
+			index, part_array[index].start, blk_size);
+		return -EINVAL;
+	}
+
+	if (part_array[index].size == 0 || len == 0 ||
+	    part_array[index].size < len || !IS_ALIGNED(len, blk_size)) {
+		dev_err(dev, "DC partition %d invalid length; size %zu len %zu blk size %zu\n",
+			index, part_array[index].size, len, blk_size);
+		return -EINVAL;
+	}
+
+	if (blk_size == 0 || blk_size % CXL_DCD_BLOCK_LINE_SIZE ||
+	    !is_power_of_2(blk_size)) {
+		dev_err(dev, "DC partition %d invalid block size; %zu\n",
+			index, blk_size);
+		return -EINVAL;
+	}
+
+	dev_dbg(dev, "DC partition %d start %zu start %zu size %zu\n",
+		index, part_array[index].start, part_array[index].size,
+		blk_size);
+
+	return 0;
+}
+
+/* Returns the number of partitions in dc_resp or -ERRNO */
+static int cxl_get_dc_config(struct cxl_mailbox *mbox, u8 start_partition,
+			     struct cxl_mbox_get_dc_config_out *dc_resp,
+			     size_t dc_resp_size)
+{
+	struct cxl_mbox_get_dc_config_in get_dc = (struct cxl_mbox_get_dc_config_in) {
+		.partition_count = CXL_MAX_DC_PARTITIONS,
+		.start_partition_index = start_partition,
+	};
+	struct cxl_mbox_cmd mbox_cmd = (struct cxl_mbox_cmd) {
+		.opcode = CXL_MBOX_OP_GET_DC_CONFIG,
+		.payload_in = &get_dc,
+		.size_in = sizeof(get_dc),
+		.size_out = dc_resp_size,
+		.payload_out = dc_resp,
+		.min_out = 1,
+	};
+	int rc;
+
+	rc = cxl_internal_send_cmd(mbox, &mbox_cmd);
+	if (rc < 0)
+		return rc;
+
+	dev_dbg(mbox->host, "Read %d/%d DC partitions\n",
+		dc_resp->partitions_returned, dc_resp->avail_partition_count);
+	return dc_resp->partitions_returned;
+}
+
+/**
+ * cxl_dev_dc_identify() - Reads the dynamic capacity information from the
+ *                         device.
+ * @mbox: Mailbox to query
+ * @dc_info: The dynamic partition information to return
+ *
+ * Read Dynamic Capacity information from the device and return the partition
+ * information.
+ *
+ * Return: 0 if identify was executed successfully, -ERRNO on error.
+ *         on error only dynamic_bytes is left unchanged.
+ */
+int cxl_dev_dc_identify(struct cxl_mailbox *mbox,
+			struct cxl_dc_partition_info *dc_info)
+{
+	struct cxl_dc_partition_info partitions[CXL_MAX_DC_PARTITIONS];
+	size_t dc_resp_size = mbox->payload_size;
+	struct device *dev = mbox->host;
+	u8 start_partition;
+	u8 num_partitions;
+
+	struct cxl_mbox_get_dc_config_out *dc_resp __free(kfree) =
+					kvmalloc(dc_resp_size, GFP_KERNEL);
+	if (!dc_resp)
+		return -ENOMEM;
+
+	/* Read and check all partition information for validity and potential
+	 * debugging; see debug output in cxl_dc_check() */
+	start_partition = 0;
+	do {
+		int rc, i, j;
+
+		rc = cxl_get_dc_config(mbox, start_partition, dc_resp, dc_resp_size);
+		if (rc < 0) {
+			dev_err(dev, "Failed to get DC config: %d\n", rc);
+			return rc;
+		}
+
+		num_partitions += rc;
+
+		if (num_partitions < 1 || num_partitions > CXL_MAX_DC_PARTITIONS) {
+			dev_err(dev, "Invalid num of dynamic capacity partitions %d\n",
+				num_partitions);
+			return -EINVAL;
+		}
+
+		for (i = start_partition, j = 0; i < num_partitions; i++, j++) {
+			rc = cxl_dc_check(dev, partitions, i,
+					  &dc_resp->partition[j]);
+			if (rc)
+				return rc;
+		}
+
+		start_partition = num_partitions;
+
+	} while (num_partitions < dc_resp->avail_partition_count);
+
+	/* Return 1st partition */
+	dc_info->start = partitions[0].start;
+	dc_info->size = partitions[0].size;
+	dev_dbg(dev, "Returning partition 0 %zu size %zu\n",
+		dc_info->start, dc_info->size);
+
+	return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_dev_dc_identify, "CXL");
+
 static void add_part(struct cxl_dpa_info *info, u64 start, u64 size, enum cxl_partition_mode mode)
 {
 	int i = info->nr_partitions;
@@ -1383,6 +1530,38 @@ int cxl_get_dirty_count(struct cxl_memdev_state *mds, u32 *count)
 }
 EXPORT_SYMBOL_NS_GPL(cxl_get_dirty_count, "CXL");
 
+void cxl_configure_dcd(struct cxl_memdev_state *mds, struct cxl_dpa_info *info)
+{
+	struct cxl_dc_partition_info dc_info = { 0 };
+	struct device *dev = mds->cxlds.dev;
+	size_t skip;
+	int rc;
+
+	rc = cxl_dev_dc_identify(&mds->cxlds.cxl_mbox, &dc_info);
+	if (rc) {
+		dev_warn(dev,
+			 "Failed to read Dynamic Capacity config: %d\n", rc);
+		cxl_disable_dcd(mds);
+		return;
+	}
+
+	/* Skips between pmem and the dynamic partition are not supported */
+	skip = dc_info.start - info->size;
+	if (skip) {
+		dev_warn(dev,
+			 "Dynamic Capacity skip from pmem not supported: %zu\n",
+			 skip);
+		cxl_disable_dcd(mds);
+		return;
+	}
+
+	info->size += dc_info.size;
+	dev_dbg(dev, "Adding dynamic ram partition A; %zu size %zu\n",
+		dc_info.start, dc_info.size);
+	add_part(info, dc_info.start, dc_info.size, CXL_PARTMODE_DYNAMIC_RAM_A);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_configure_dcd, "CXL");
+
 int cxl_arm_dirty_shutdown(struct cxl_memdev_state *mds)
 {
 	struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index be8a7dc77719..a9d42210e8a3 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -485,6 +485,7 @@ struct cxl_region_params {
 enum cxl_partition_mode {
 	CXL_PARTMODE_RAM,
 	CXL_PARTMODE_PMEM,
+	CXL_PARTMODE_DYNAMIC_RAM_A,
 };
 
 /*
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 394a776954f4..057933128d2c 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -97,7 +97,7 @@ int devm_cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
 			 resource_size_t base, resource_size_t len,
 			 resource_size_t skipped);
 
-#define CXL_NR_PARTITIONS_MAX 2
+#define CXL_NR_PARTITIONS_MAX 3
 
 struct cxl_dpa_info {
 	u64 size;
@@ -380,6 +380,7 @@ enum cxl_devtype {
 	CXL_DEVTYPE_CLASSMEM,
 };
 
+#define CXL_MAX_DC_PARTITIONS 8
 /**
  * struct cxl_dpa_perf - DPA performance property entry
  * @dpa_range: range for DPA address
@@ -722,6 +723,31 @@ struct cxl_mbox_set_shutdown_state_in {
 	u8 state;
 } __packed;
 
+/* See CXL 3.2 Table 8-178 get dynamic capacity config Input Payload */
+struct cxl_mbox_get_dc_config_in {
+	u8 partition_count;
+	u8 start_partition_index;
+} __packed;
+
+/* See CXL 3.2 Table 8-179 get dynamic capacity config Output Payload */
+struct cxl_mbox_get_dc_config_out {
+	u8 avail_partition_count;
+	u8 partitions_returned;
+	u8 rsvd[6];
+	/* See CXL 3.2 Table 8-180 */
+	struct cxl_dc_partition {
+		__le64 base;
+		__le64 decode_length;
+		__le64 length;
+		__le64 block_size;
+		__le32 dsmad_handle;
+		u8 flags;
+		u8 rsvd[3];
+	} __packed partition[] __counted_by(partitions_returned);
+	/* Trailing fields unused */
+} __packed;
+#define CXL_DCD_BLOCK_LINE_SIZE 0x40
+
 /* Set Timestamp CXL 3.0 Spec 8.2.9.4.2 */
 struct cxl_mbox_set_timestamp_in {
 	__le64 timestamp;
@@ -845,9 +871,24 @@ enum {
 int cxl_internal_send_cmd(struct cxl_mailbox *cxl_mbox,
 			  struct cxl_mbox_cmd *cmd);
 int cxl_dev_state_identify(struct cxl_memdev_state *mds);
+
+struct cxl_mem_dev_info {
+	u64 total_bytes;
+	u64 volatile_bytes;
+	u64 persistent_bytes;
+};
+
+struct cxl_dc_partition_info {
+	size_t start;
+	size_t size;
+};
+
+int cxl_dev_dc_identify(struct cxl_mailbox *mbox,
+			struct cxl_dc_partition_info *dc_info);
 int cxl_await_media_ready(struct cxl_dev_state *cxlds);
 int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
 int cxl_mem_dpa_fetch(struct cxl_memdev_state *mds, struct cxl_dpa_info *info);
+void cxl_configure_dcd(struct cxl_memdev_state *mds, struct cxl_dpa_info *info);
 struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev);
 void set_exclusive_cxl_commands(struct cxl_memdev_state *mds,
 				unsigned long *cmds);
@@ -860,6 +901,17 @@ void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
 			    const uuid_t *uuid, union cxl_event *evt);
 int cxl_get_dirty_count(struct cxl_memdev_state *mds, u32 *count);
 int cxl_arm_dirty_shutdown(struct cxl_memdev_state *mds);
+
+static inline bool cxl_dcd_supported(struct cxl_memdev_state *mds)
+{
+	return mds->dcd_supported;
+}
+
+static inline void cxl_disable_dcd(struct cxl_memdev_state *mds)
+{
+	mds->dcd_supported = false;
+}
+
 int cxl_set_timestamp(struct cxl_memdev_state *mds);
 int cxl_poison_state_init(struct cxl_memdev_state *mds);
 int cxl_mem_get_poison(struct cxl_memdev *cxlmd, u64 offset, u64 len,
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 7b14a154463c..bc40cf6e2fe9 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -998,6 +998,9 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (rc)
 		return rc;
 
+	if (cxl_dcd_supported(mds))
+		cxl_configure_dcd(mds, &range_info);
+
 	rc = cxl_dpa_setup(cxlds, &range_info);
 	if (rc)
 		return rc;

-- 
2.49.0


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH v9 02/19] cxl/mem: Read dynamic capacity configuration from the device
  2025-04-13 22:52 ` [PATCH v9 02/19] cxl/mem: Read dynamic capacity configuration from the device Ira Weiny
@ 2025-04-14 14:35   ` Jonathan Cameron
  2025-04-14 15:20     ` Jonathan Cameron
  2025-05-07 17:40   ` Fan Ni
  1 sibling, 1 reply; 65+ messages in thread
From: Jonathan Cameron @ 2025-04-14 14:35 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Jiang, Fan Ni, Dan Williams, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-cxl, nvdimm, linux-kernel

On Sun, 13 Apr 2025 17:52:10 -0500
Ira Weiny <ira.weiny@intel.com> wrote:

> Devices which optionally support Dynamic Capacity (DC) are configured
> via mailbox commands.  CXL 3.2 section 9.13.3 requires the host to issue
> the Get DC Configuration command in order to properly configure DCDs.
> Without the Get DC Configuration command DCD can't be supported.
> 
> Implement the DC mailbox commands as specified in CXL 3.2 section
> 8.2.10.9.9 (opcodes 48XXh) to read and store the DCD configuration
> information.  Disable DCD if an invalid configuration is found.
> 
> Linux has no support for more than one dynamic capacity partition.  Read
> and validate all the partitions but configure only the first partition
> as 'dynamic ram A'. Additional partitions can be added in the future if
> such a device ever materializes.  Additionally is it anticipated that no
> skips will be present from the end of the pmem partition.  Check for an
> disallow this configuration as well.
> 
> Linux has no use for the trailing fields of the Get Dynamic Capacity
> Configuration Output Payload (Total number of supported extents, number
> of available extents, total number of supported tags, and number of
> available tags).  Avoid defining those fields to use the more useful
> dynamic C array.
> 
> Based on an original patch by Navneet Singh.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> ---
> Changes:
> [iweiny: rebase]
> [iweiny: Update spec references to 3.2]
> [djbw: Limit to 1 partition]
> [djbw: Avoid inter-partition skipping]
> [djbw: s/region/partition/]
> [djbw: remove cxl_dc_region[partition]_info->name]
> [iweiny: adjust to lack of dcd_cmds in mds]
> [iweiny: remove extra 'region' from names]
> [iweiny: remove unused CXL_DYNAMIC_CAPACITY_SANITIZE_ON_RELEASE_FLAG]
> ---

> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 58d378400a4b..866a423d6125 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -1313,6 +1313,153 @@ int cxl_mem_sanitize(struct cxl_memdev *cxlmd, u16 cmd)
>  	return -EBUSY;
>  }
>  
> +static int cxl_dc_check(struct device *dev, struct cxl_dc_partition_info *part_array,
> +			u8 index, struct cxl_dc_partition *dev_part)
> +{
> +	size_t blk_size, len;
> +
> +	part_array[index].start = le64_to_cpu(dev_part->base);
> +	part_array[index].size = le64_to_cpu(dev_part->decode_length);
> +	part_array[index].size *= CXL_CAPACITY_MULTIPLIER;
> +	len = le64_to_cpu(dev_part->length);
> +	blk_size = le64_to_cpu(dev_part->block_size);
> +
> +	/* Check partitions are in increasing DPA order */
> +	if (index > 0) {
> +		struct cxl_dc_partition_info *prev_part = &part_array[index - 1];
> +
> +		if ((prev_part->start + prev_part->size) >
> +		     part_array[index].start) {
> +			dev_err(dev,
> +				"DPA ordering violation for DC partition %d and %d\n",
> +				index - 1, index);
> +			return -EINVAL;
> +		}
> +	}
> +
> +	if (!IS_ALIGNED(part_array[index].start, SZ_256M) ||
> +	    !IS_ALIGNED(part_array[index].start, blk_size)) {
> +		dev_err(dev, "DC partition %d invalid start %zu blk size %zu\n",
> +			index, part_array[index].start, blk_size);
> +		return -EINVAL;
> +	}
> +
> +	if (part_array[index].size == 0 || len == 0 ||
> +	    part_array[index].size < len || !IS_ALIGNED(len, blk_size)) {
> +		dev_err(dev, "DC partition %d invalid length; size %zu len %zu blk size %zu\n",
> +			index, part_array[index].size, len, blk_size);
> +		return -EINVAL;
> +	}
> +
> +	if (blk_size == 0 || blk_size % CXL_DCD_BLOCK_LINE_SIZE ||
> +	    !is_power_of_2(blk_size)) {
> +		dev_err(dev, "DC partition %d invalid block size; %zu\n",
> +			index, blk_size);
> +		return -EINVAL;
> +	}
> +
> +	dev_dbg(dev, "DC partition %d start %zu start %zu size %zu\n",
> +		index, part_array[index].start, part_array[index].size,
> +		blk_size);
> +
> +	return 0;
> +}
> +
> +/* Returns the number of partitions in dc_resp or -ERRNO */
> +static int cxl_get_dc_config(struct cxl_mailbox *mbox, u8 start_partition,
> +			     struct cxl_mbox_get_dc_config_out *dc_resp,
> +			     size_t dc_resp_size)
> +{
> +	struct cxl_mbox_get_dc_config_in get_dc = (struct cxl_mbox_get_dc_config_in) {
> +		.partition_count = CXL_MAX_DC_PARTITIONS,
> +		.start_partition_index = start_partition,
> +	};
> +	struct cxl_mbox_cmd mbox_cmd = (struct cxl_mbox_cmd) {
> +		.opcode = CXL_MBOX_OP_GET_DC_CONFIG,
> +		.payload_in = &get_dc,
> +		.size_in = sizeof(get_dc),
> +		.size_out = dc_resp_size,
> +		.payload_out = dc_resp,
> +		.min_out = 1,
> +	};
> +	int rc;
> +
> +	rc = cxl_internal_send_cmd(mbox, &mbox_cmd);
> +	if (rc < 0)
> +		return rc;
> +
> +	dev_dbg(mbox->host, "Read %d/%d DC partitions\n",
> +		dc_resp->partitions_returned, dc_resp->avail_partition_count);
> +	return dc_resp->partitions_returned;
> +}
> +
> +/**
> + * cxl_dev_dc_identify() - Reads the dynamic capacity information from the
> + *                         device.
> + * @mbox: Mailbox to query
> + * @dc_info: The dynamic partition information to return
> + *
> + * Read Dynamic Capacity information from the device and return the partition
> + * information.
> + *
> + * Return: 0 if identify was executed successfully, -ERRNO on error.
> + *         on error only dynamic_bytes is left unchanged.
> + */
> +int cxl_dev_dc_identify(struct cxl_mailbox *mbox,
> +			struct cxl_dc_partition_info *dc_info)
> +{
> +	struct cxl_dc_partition_info partitions[CXL_MAX_DC_PARTITIONS];
> +	size_t dc_resp_size = mbox->payload_size;
> +	struct device *dev = mbox->host;
> +	u8 start_partition;
> +	u8 num_partitions;
> +
> +	struct cxl_mbox_get_dc_config_out *dc_resp __free(kfree) =
> +					kvmalloc(dc_resp_size, GFP_KERNEL);
> +	if (!dc_resp)
> +		return -ENOMEM;
> +
> +	/* Read and check all partition information for validity and potential
> +	 * debugging; see debug output in cxl_dc_check() */
> +	start_partition = 0;
> +	do {
> +		int rc, i, j;
> +
> +		rc = cxl_get_dc_config(mbox, start_partition, dc_resp, dc_resp_size);
> +		if (rc < 0) {
> +			dev_err(dev, "Failed to get DC config: %d\n", rc);
> +			return rc;
> +		}
> +
> +		num_partitions += rc;
> +
> +		if (num_partitions < 1 || num_partitions > CXL_MAX_DC_PARTITIONS) {
> +			dev_err(dev, "Invalid num of dynamic capacity partitions %d\n",
> +				num_partitions);
> +			return -EINVAL;
> +		}
> +
> +		for (i = start_partition, j = 0; i < num_partitions; i++, j++) {
> +			rc = cxl_dc_check(dev, partitions, i,
> +					  &dc_resp->partition[j]);
> +			if (rc)
> +				return rc;
> +		}
> +
> +		start_partition = num_partitions;
> +
> +	} while (num_partitions < dc_resp->avail_partition_count);
> +
> +	/* Return 1st partition */
> +	dc_info->start = partitions[0].start;
> +	dc_info->size = partitions[0].size;
> +	dev_dbg(dev, "Returning partition 0 %zu size %zu\n",
> +		dc_info->start, dc_info->size);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_dev_dc_identify, "CXL");
> +
>  static void add_part(struct cxl_dpa_info *info, u64 start, u64 size, enum cxl_partition_mode mode)
>  {
>  	int i = info->nr_partitions;
> @@ -1383,6 +1530,38 @@ int cxl_get_dirty_count(struct cxl_memdev_state *mds, u32 *count)
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_get_dirty_count, "CXL");
>  
> +void cxl_configure_dcd(struct cxl_memdev_state *mds, struct cxl_dpa_info *info)
> +{
> +	struct cxl_dc_partition_info dc_info = { 0 };
Trivial bit of c stuff that surprised me in another thread the other day that doesn't
apply here because of packed nature of structure but...

 = {}; is defined in c23 (and probably before that in practice) as
the "empty initializer"
> +	struct device *dev = mds->cxlds.dev;
> +	size_t skip;
> +	int rc;
> +
> +	rc = cxl_dev_dc_identify(&mds->cxlds.cxl_mbox, &dc_info);
> +	if (rc) {
> +		dev_warn(dev,
> +			 "Failed to read Dynamic Capacity config: %d\n", rc);
> +		cxl_disable_dcd(mds);
> +		return;
> +	}
> +
> +	/* Skips between pmem and the dynamic partition are not supported */
> +	skip = dc_info.start - info->size;
> +	if (skip) {
> +		dev_warn(dev,
> +			 "Dynamic Capacity skip from pmem not supported: %zu\n",
> +			 skip);
> +		cxl_disable_dcd(mds);
> +		return;
> +	}
> +
> +	info->size += dc_info.size;
> +	dev_dbg(dev, "Adding dynamic ram partition A; %zu size %zu\n",
> +		dc_info.start, dc_info.size);
> +	add_part(info, dc_info.start, dc_info.size, CXL_PARTMODE_DYNAMIC_RAM_A);
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_configure_dcd, "CXL");
> +
>  int cxl_arm_dirty_shutdown(struct cxl_memdev_state *mds)
>  {
>  	struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index be8a7dc77719..a9d42210e8a3 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -485,6 +485,7 @@ struct cxl_region_params {
>  enum cxl_partition_mode {
>  	CXL_PARTMODE_RAM,
>  	CXL_PARTMODE_PMEM,
> +	CXL_PARTMODE_DYNAMIC_RAM_A,
>  };
>  
>  /*
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 394a776954f4..057933128d2c 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -97,7 +97,7 @@ int devm_cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  			 resource_size_t base, resource_size_t len,
>  			 resource_size_t skipped);
>  
> -#define CXL_NR_PARTITIONS_MAX 2
> +#define CXL_NR_PARTITIONS_MAX 3
>  
>  struct cxl_dpa_info {
>  	u64 size;
> @@ -380,6 +380,7 @@ enum cxl_devtype {
>  	CXL_DEVTYPE_CLASSMEM,
>  };
>  
> +#define CXL_MAX_DC_PARTITIONS 8
>  /**
>   * struct cxl_dpa_perf - DPA performance property entry
>   * @dpa_range: range for DPA address
> @@ -722,6 +723,31 @@ struct cxl_mbox_set_shutdown_state_in {
>  	u8 state;
>  } __packed;
>  
> +/* See CXL 3.2 Table 8-178 get dynamic capacity config Input Payload */
> +struct cxl_mbox_get_dc_config_in {
> +	u8 partition_count;
> +	u8 start_partition_index;
> +} __packed;
> +
> +/* See CXL 3.2 Table 8-179 get dynamic capacity config Output Payload */
> +struct cxl_mbox_get_dc_config_out {
> +	u8 avail_partition_count;
> +	u8 partitions_returned;
> +	u8 rsvd[6];
> +	/* See CXL 3.2 Table 8-180 */
> +	struct cxl_dc_partition {
> +		__le64 base;
> +		__le64 decode_length;
> +		__le64 length;
> +		__le64 block_size;
> +		__le32 dsmad_handle;
> +		u8 flags;
> +		u8 rsvd[3];
> +	} __packed partition[] __counted_by(partitions_returned);
> +	/* Trailing fields unused */
> +} __packed;
> +#define CXL_DCD_BLOCK_LINE_SIZE 0x40
> +
>  /* Set Timestamp CXL 3.0 Spec 8.2.9.4.2 */
>  struct cxl_mbox_set_timestamp_in {
>  	__le64 timestamp;
> @@ -845,9 +871,24 @@ enum {
>  int cxl_internal_send_cmd(struct cxl_mailbox *cxl_mbox,
>  			  struct cxl_mbox_cmd *cmd);
>  int cxl_dev_state_identify(struct cxl_memdev_state *mds);
> +
> +struct cxl_mem_dev_info {
> +	u64 total_bytes;
> +	u64 volatile_bytes;
> +	u64 persistent_bytes;
> +};
> +
> +struct cxl_dc_partition_info {
> +	size_t start;
> +	size_t size;
> +};
> +
> +int cxl_dev_dc_identify(struct cxl_mailbox *mbox,
> +			struct cxl_dc_partition_info *dc_info);
>  int cxl_await_media_ready(struct cxl_dev_state *cxlds);
>  int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
>  int cxl_mem_dpa_fetch(struct cxl_memdev_state *mds, struct cxl_dpa_info *info);
> +void cxl_configure_dcd(struct cxl_memdev_state *mds, struct cxl_dpa_info *info);
>  struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev);
>  void set_exclusive_cxl_commands(struct cxl_memdev_state *mds,
>  				unsigned long *cmds);
> @@ -860,6 +901,17 @@ void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
>  			    const uuid_t *uuid, union cxl_event *evt);
>  int cxl_get_dirty_count(struct cxl_memdev_state *mds, u32 *count);
>  int cxl_arm_dirty_shutdown(struct cxl_memdev_state *mds);
> +
> +static inline bool cxl_dcd_supported(struct cxl_memdev_state *mds)
> +{
> +	return mds->dcd_supported;
> +}
> +
> +static inline void cxl_disable_dcd(struct cxl_memdev_state *mds)
> +{
> +	mds->dcd_supported = false;
> +}
> +
>  int cxl_set_timestamp(struct cxl_memdev_state *mds);
>  int cxl_poison_state_init(struct cxl_memdev_state *mds);
>  int cxl_mem_get_poison(struct cxl_memdev *cxlmd, u64 offset, u64 len,
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 7b14a154463c..bc40cf6e2fe9 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -998,6 +998,9 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  	if (rc)
>  		return rc;
>  
> +	if (cxl_dcd_supported(mds))
> +		cxl_configure_dcd(mds, &range_info);
> +
>  	rc = cxl_dpa_setup(cxlds, &range_info);
>  	if (rc)
>  		return rc;
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v9 02/19] cxl/mem: Read dynamic capacity configuration from the device
  2025-04-14 14:35   ` Jonathan Cameron
@ 2025-04-14 15:20     ` Jonathan Cameron
  0 siblings, 0 replies; 65+ messages in thread
From: Jonathan Cameron @ 2025-04-14 15:20 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Jiang, Fan Ni, Dan Williams, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-cxl, nvdimm, linux-kernel

On Sun, 13 Apr 2025 17:52:10 -0500
Ira Weiny <ira.weiny@intel.com> wrote:

> Devices which optionally support Dynamic Capacity (DC) are configured
> via mailbox commands.  CXL 3.2 section 9.13.3 requires the host to issue
> the Get DC Configuration command in order to properly configure DCDs.
> Without the Get DC Configuration command DCD can't be supported.
> 
> Implement the DC mailbox commands as specified in CXL 3.2 section
> 8.2.10.9.9 (opcodes 48XXh) to read and store the DCD configuration
> information.  Disable DCD if an invalid configuration is found.
> 
> Linux has no support for more than one dynamic capacity partition.  Read
> and validate all the partitions but configure only the first partition
> as 'dynamic ram A'. Additional partitions can be added in the future if
> such a device ever materializes.  Additionally is it anticipated that no
> skips will be present from the end of the pmem partition.  Check for an
> disallow this configuration as well.
> 
> Linux has no use for the trailing fields of the Get Dynamic Capacity
> Configuration Output Payload (Total number of supported extents, number
> of available extents, total number of supported tags, and number of
> available tags).  Avoid defining those fields to use the more useful
> dynamic C array.
> 
> Based on an original patch by Navneet Singh.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
Hi Ira,

This ended up with a slightly odd mix of the nice flexible code which
we had before to handle multiple regions and just handing one.

Whilst I don't mind keeping the multiple region handling you could further
simplify this if you didn't...

Jonathan

> ---
> Changes:
> [iweiny: rebase]
> [iweiny: Update spec references to 3.2]
> [djbw: Limit to 1 partition]
> [djbw: Avoid inter-partition skipping]
> [djbw: s/region/partition/]
> [djbw: remove cxl_dc_region[partition]_info->name]
> [iweiny: adjust to lack of dcd_cmds in mds]
> [iweiny: remove extra 'region' from names]
> [iweiny: remove unused CXL_DYNAMIC_CAPACITY_SANITIZE_ON_RELEASE_FLAG]
> ---

> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 58d378400a4b..866a423d6125 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -1313,6 +1313,153 @@ int cxl_mem_sanitize(struct cxl_memdev *cxlmd, u16 cmd)
>  	return -EBUSY;
>  }
>  
> +static int cxl_dc_check(struct device *dev, struct cxl_dc_partition_info *part_array,
> +			u8 index, struct cxl_dc_partition *dev_part)

I'd be tempted to pass in both this part and the previous one (or NULL) directly rather
than passing in the array.  Seems like it would end up slightly simpler in here.
Mind you we only support the first one anyway so maybe we don't need the prev_part
stuff for now...

> +{
> +	size_t blk_size, len;
> +
> +	part_array[index].start = le64_to_cpu(dev_part->base);
> +	part_array[index].size = le64_to_cpu(dev_part->decode_length);
> +	part_array[index].size *= CXL_CAPACITY_MULTIPLIER;
> +	len = le64_to_cpu(dev_part->length);
> +	blk_size = le64_to_cpu(dev_part->block_size);

For these, might as well do it at declaration and save a line.

	size_t blk_size = le64_to_cpu(dev_part->length);
	size_t len = le64_to_cpu(dev_part->length);

> +
> +	/* Check partitions are in increasing DPA order */
> +	if (index > 0) {

If you pass the prev_part in as a parameter, this just becomes
	if (prev_part)

> +		struct cxl_dc_partition_info *prev_part = &part_array[index - 1];
> +
> +		if ((prev_part->start + prev_part->size) >
> +		     part_array[index].start) {
> +			dev_err(dev,
> +				"DPA ordering violation for DC partition %d and %d\n",
> +				index - 1, index);
> +			return -EINVAL;
> +		}
> +	}

Rest of the checks look good to me.

> +}
> +
> +/* Returns the number of partitions in dc_resp or -ERRNO */
> +static int cxl_get_dc_config(struct cxl_mailbox *mbox, u8 start_partition,
> +			     struct cxl_mbox_get_dc_config_out *dc_resp,
> +			     size_t dc_resp_size)
> +{
> +	struct cxl_mbox_get_dc_config_in get_dc = (struct cxl_mbox_get_dc_config_in) {
> +		.partition_count = CXL_MAX_DC_PARTITIONS,
> +		.start_partition_index = start_partition,
> +	};
> +	struct cxl_mbox_cmd mbox_cmd = (struct cxl_mbox_cmd) {
> +		.opcode = CXL_MBOX_OP_GET_DC_CONFIG,
> +		.payload_in = &get_dc,
> +		.size_in = sizeof(get_dc),
> +		.size_out = dc_resp_size,
> +		.payload_out = dc_resp,
> +		.min_out = 1,

Why 1?  If a device oddly supported 0 regions I think it would still be 8
to cover the first two fields and the reserved space before region configuration
structure.

> +	};
> +	int rc;
> +
> +	rc = cxl_internal_send_cmd(mbox, &mbox_cmd);
> +	if (rc < 0)
> +		return rc;
> +
> +	dev_dbg(mbox->host, "Read %d/%d DC partitions\n",
> +		dc_resp->partitions_returned, dc_resp->avail_partition_count);
> +	return dc_resp->partitions_returned;
> +}
> +
> +/**
> + * cxl_dev_dc_identify() - Reads the dynamic capacity information from the
> + *                         device.
> + * @mbox: Mailbox to query
> + * @dc_info: The dynamic partition information to return
> + *
> + * Read Dynamic Capacity information from the device and return the partition
> + * information.
> + *
> + * Return: 0 if identify was executed successfully, -ERRNO on error.
> + *         on error only dynamic_bytes is left unchanged.
> + */
> +int cxl_dev_dc_identify(struct cxl_mailbox *mbox,
> +			struct cxl_dc_partition_info *dc_info)
> +{
> +	struct cxl_dc_partition_info partitions[CXL_MAX_DC_PARTITIONS];
> +	size_t dc_resp_size = mbox->payload_size;
> +	struct device *dev = mbox->host;
> +	u8 start_partition;
> +	u8 num_partitions;
> +
> +	struct cxl_mbox_get_dc_config_out *dc_resp __free(kfree) =
> +					kvmalloc(dc_resp_size, GFP_KERNEL);

Could we size this one for max possible? (i.e. 8 partitions) with a struct
size and avoid needing vmalloc.  Maybe it is worth the bother.

> +	if (!dc_resp)
> +		return -ENOMEM;
> +
> +	/* Read and check all partition information for validity and potential

Multi line comment syntax isn't this for this file.

> +	 * debugging; see debug output in cxl_dc_check() */
> +	start_partition = 0;
Could set at declaration  (up to you)
> +	do {
> +		int rc, i, j;
> +
> +		rc = cxl_get_dc_config(mbox, start_partition, dc_resp, dc_resp_size);
> +		if (rc < 0) {
> +			dev_err(dev, "Failed to get DC config: %d\n", rc);
> +			return rc;
> +		}
> +
> +		num_partitions += rc;

Initialization missing I think.

> +
> +		if (num_partitions < 1 || num_partitions > CXL_MAX_DC_PARTITIONS) {
> +			dev_err(dev, "Invalid num of dynamic capacity partitions %d\n",
> +				num_partitions);
> +			return -EINVAL;
> +		}
> +
> +		for (i = start_partition, j = 0; i < num_partitions; i++, j++) {
> +			rc = cxl_dc_check(dev, partitions, i,
> +					  &dc_resp->partition[j]);
> +			if (rc)
> +				return rc;
> +		}
> +
> +		start_partition = num_partitions;
> +
> +	} while (num_partitions < dc_resp->avail_partition_count);
> +
> +	/* Return 1st partition */
> +	dc_info->start = partitions[0].start;
> +	dc_info->size = partitions[0].size;

I'm not against keeping the complexity above but if all we are going to do is
use the first partition, maybe just ask for that in the first place?
We don't need to check for issues in things we aren't turning on.

> +	dev_dbg(dev, "Returning partition 0 %zu size %zu\n",
> +		dc_info->start, dc_info->size);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_dev_dc_identify, "CXL");



> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 394a776954f4..057933128d2c 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> +
> +struct cxl_mem_dev_info {
> +	u64 total_bytes;
> +	u64 volatile_bytes;
> +	u64 persistent_bytes;
> +};

So far I'm not seeing any use of this. Left over from previous patch
or something that gets used later in the series and so should get
introduced with first use?




^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v9 02/19] cxl/mem: Read dynamic capacity configuration from the device
  2025-04-13 22:52 ` [PATCH v9 02/19] cxl/mem: Read dynamic capacity configuration from the device Ira Weiny
  2025-04-14 14:35   ` Jonathan Cameron
@ 2025-05-07 17:40   ` Fan Ni
  2025-05-08 13:35     ` Ira Weiny
  1 sibling, 1 reply; 65+ messages in thread
From: Fan Ni @ 2025-05-07 17:40 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Jiang, Jonathan Cameron, Dan Williams, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-cxl, nvdimm, linux-kernel

On Sun, Apr 13, 2025 at 05:52:10PM -0500, Ira Weiny wrote:
> Devices which optionally support Dynamic Capacity (DC) are configured
> via mailbox commands.  CXL 3.2 section 9.13.3 requires the host to issue
> the Get DC Configuration command in order to properly configure DCDs.
> Without the Get DC Configuration command DCD can't be supported.
> 
> Implement the DC mailbox commands as specified in CXL 3.2 section
> 8.2.10.9.9 (opcodes 48XXh) to read and store the DCD configuration
> information.  Disable DCD if an invalid configuration is found.
> 
> Linux has no support for more than one dynamic capacity partition.  Read
> and validate all the partitions but configure only the first partition
> as 'dynamic ram A'.  Additional partitions can be added in the future if
> such a device ever materializes.  Additionally is it anticipated that no
> skips will be present from the end of the pmem partition.  Check for an
> disallow this configuration as well.
> 
> Linux has no use for the trailing fields of the Get Dynamic Capacity
> Configuration Output Payload (Total number of supported extents, number
> of available extents, total number of supported tags, and number of
> available tags).  Avoid defining those fields to use the more useful
> dynamic C array.
> 
> Based on an original patch by Navneet Singh.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> ---
> Changes:
> [iweiny: rebase]
> [iweiny: Update spec references to 3.2]
> [djbw: Limit to 1 partition]
> [djbw: Avoid inter-partition skipping]
> [djbw: s/region/partition/]
> [djbw: remove cxl_dc_region[partition]_info->name]
> [iweiny: adjust to lack of dcd_cmds in mds]
> [iweiny: remove extra 'region' from names]
> [iweiny: remove unused CXL_DYNAMIC_CAPACITY_SANITIZE_ON_RELEASE_FLAG]
> ---
>  drivers/cxl/core/hdm.c  |   2 +
>  drivers/cxl/core/mbox.c | 179 ++++++++++++++++++++++++++++++++++++++++++++++++
>  drivers/cxl/cxl.h       |   1 +
>  drivers/cxl/cxlmem.h    |  54 ++++++++++++++-
>  drivers/cxl/pci.c       |   3 +
>  5 files changed, 238 insertions(+), 1 deletion(-)
...
>  /* Set Timestamp CXL 3.0 Spec 8.2.9.4.2 */
>  struct cxl_mbox_set_timestamp_in {
>  	__le64 timestamp;
> @@ -845,9 +871,24 @@ enum {
>  int cxl_internal_send_cmd(struct cxl_mailbox *cxl_mbox,
>  			  struct cxl_mbox_cmd *cmd);
>  int cxl_dev_state_identify(struct cxl_memdev_state *mds);
> +
> +struct cxl_mem_dev_info {
> +	u64 total_bytes;
> +	u64 volatile_bytes;
> +	u64 persistent_bytes;
> +};

Defined, but never used.

Fan

> +
> +struct cxl_dc_partition_info {
> +	size_t start;
> +	size_t size;
> +};
> +
> +int cxl_dev_dc_identify(struct cxl_mailbox *mbox,
> +			struct cxl_dc_partition_info *dc_info);
>  int cxl_await_media_ready(struct cxl_dev_state *cxlds);
>  int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
>  int cxl_mem_dpa_fetch(struct cxl_memdev_state *mds, struct cxl_dpa_info *info);
> +void cxl_configure_dcd(struct cxl_memdev_state *mds, struct cxl_dpa_info *info);
>  struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev);
>  void set_exclusive_cxl_commands(struct cxl_memdev_state *mds,
>  				unsigned long *cmds);
> @@ -860,6 +901,17 @@ void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
>  			    const uuid_t *uuid, union cxl_event *evt);
>  int cxl_get_dirty_count(struct cxl_memdev_state *mds, u32 *count);
>  int cxl_arm_dirty_shutdown(struct cxl_memdev_state *mds);
> +
> +static inline bool cxl_dcd_supported(struct cxl_memdev_state *mds)
> +{
> +	return mds->dcd_supported;
> +}
> +
> +static inline void cxl_disable_dcd(struct cxl_memdev_state *mds)
> +{
> +	mds->dcd_supported = false;
> +}
> +
>  int cxl_set_timestamp(struct cxl_memdev_state *mds);
>  int cxl_poison_state_init(struct cxl_memdev_state *mds);
>  int cxl_mem_get_poison(struct cxl_memdev *cxlmd, u64 offset, u64 len,
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 7b14a154463c..bc40cf6e2fe9 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -998,6 +998,9 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  	if (rc)
>  		return rc;
>  
> +	if (cxl_dcd_supported(mds))
> +		cxl_configure_dcd(mds, &range_info);
> +
>  	rc = cxl_dpa_setup(cxlds, &range_info);
>  	if (rc)
>  		return rc;
> 
> -- 
> 2.49.0
> 

-- 
Fan Ni

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v9 02/19] cxl/mem: Read dynamic capacity configuration from the device
  2025-05-07 17:40   ` Fan Ni
@ 2025-05-08 13:35     ` Ira Weiny
  0 siblings, 0 replies; 65+ messages in thread
From: Ira Weiny @ 2025-05-08 13:35 UTC (permalink / raw)
  To: Fan Ni, Ira Weiny
  Cc: Dave Jiang, Jonathan Cameron, Dan Williams, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-cxl, nvdimm, linux-kernel

Fan Ni wrote:
> On Sun, Apr 13, 2025 at 05:52:10PM -0500, Ira Weiny wrote:
> > Devices which optionally support Dynamic Capacity (DC) are configured
> > via mailbox commands.  CXL 3.2 section 9.13.3 requires the host to issue
> > the Get DC Configuration command in order to properly configure DCDs.
> > Without the Get DC Configuration command DCD can't be supported.
> > 
> > Implement the DC mailbox commands as specified in CXL 3.2 section
> > 8.2.10.9.9 (opcodes 48XXh) to read and store the DCD configuration
> > information.  Disable DCD if an invalid configuration is found.
> > 
> > Linux has no support for more than one dynamic capacity partition.  Read
> > and validate all the partitions but configure only the first partition
> > as 'dynamic ram A'.  Additional partitions can be added in the future if
> > such a device ever materializes.  Additionally is it anticipated that no
> > skips will be present from the end of the pmem partition.  Check for an
> > disallow this configuration as well.
> > 
> > Linux has no use for the trailing fields of the Get Dynamic Capacity
> > Configuration Output Payload (Total number of supported extents, number
> > of available extents, total number of supported tags, and number of
> > available tags).  Avoid defining those fields to use the more useful
> > dynamic C array.
> > 
> > Based on an original patch by Navneet Singh.
> > 
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > 
> > ---
> > Changes:
> > [iweiny: rebase]
> > [iweiny: Update spec references to 3.2]
> > [djbw: Limit to 1 partition]
> > [djbw: Avoid inter-partition skipping]
> > [djbw: s/region/partition/]
> > [djbw: remove cxl_dc_region[partition]_info->name]
> > [iweiny: adjust to lack of dcd_cmds in mds]
> > [iweiny: remove extra 'region' from names]
> > [iweiny: remove unused CXL_DYNAMIC_CAPACITY_SANITIZE_ON_RELEASE_FLAG]
> > ---
> >  drivers/cxl/core/hdm.c  |   2 +
> >  drivers/cxl/core/mbox.c | 179 ++++++++++++++++++++++++++++++++++++++++++++++++
> >  drivers/cxl/cxl.h       |   1 +
> >  drivers/cxl/cxlmem.h    |  54 ++++++++++++++-
> >  drivers/cxl/pci.c       |   3 +
> >  5 files changed, 238 insertions(+), 1 deletion(-)
> ...
> >  /* Set Timestamp CXL 3.0 Spec 8.2.9.4.2 */
> >  struct cxl_mbox_set_timestamp_in {
> >  	__le64 timestamp;
> > @@ -845,9 +871,24 @@ enum {
> >  int cxl_internal_send_cmd(struct cxl_mailbox *cxl_mbox,
> >  			  struct cxl_mbox_cmd *cmd);
> >  int cxl_dev_state_identify(struct cxl_memdev_state *mds);
> > +
> > +struct cxl_mem_dev_info {
> > +	u64 total_bytes;
> > +	u64 volatile_bytes;
> > +	u64 persistent_bytes;
> > +};
> 
> Defined, but never used.

Shoot...  That was from a previous version of work on type2...

Thanks for the catch!
Ira

[snip]

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v9 03/19] cxl/cdat: Gather DSMAS data for DCD partitions
  2025-04-13 22:52 [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
  2025-04-13 22:52 ` [PATCH v9 01/19] cxl/mbox: Flag " Ira Weiny
  2025-04-13 22:52 ` [PATCH v9 02/19] cxl/mem: Read dynamic capacity configuration from the device Ira Weiny
@ 2025-04-13 22:52 ` Ira Weiny
  2025-04-14 15:29   ` Jonathan Cameron
  2025-04-13 22:52 ` [PATCH v9 04/19] cxl/core: Enforce partition order/simplify partition calls Ira Weiny
                   ` (19 subsequent siblings)
  22 siblings, 1 reply; 65+ messages in thread
From: Ira Weiny @ 2025-04-13 22:52 UTC (permalink / raw)
  To: Dave Jiang, Fan Ni, Jonathan Cameron
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	Ira Weiny, linux-cxl, nvdimm, linux-kernel

Additional DCD partition (AKA region) information is contained in the
DSMAS CDAT tables, including performance, read only, and shareable
attributes.

Match DCD partitions with DSMAS tables and store the meta data.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes:
[iweiny: Adjust for new perf/partition infrastructure]
---
 drivers/cxl/core/cdat.c | 11 +++++++++++
 drivers/cxl/core/mbox.c |  2 ++
 drivers/cxl/cxlmem.h    |  6 ++++++
 3 files changed, 19 insertions(+)

diff --git a/drivers/cxl/core/cdat.c b/drivers/cxl/core/cdat.c
index edb4f41eeacc..ad93713f4364 100644
--- a/drivers/cxl/core/cdat.c
+++ b/drivers/cxl/core/cdat.c
@@ -17,6 +17,7 @@ struct dsmas_entry {
 	struct access_coordinate cdat_coord[ACCESS_COORDINATE_MAX];
 	int entries;
 	int qos_class;
+	bool shareable;
 };
 
 static u32 cdat_normalize(u16 entry, u64 base, u8 type)
@@ -74,6 +75,7 @@ static int cdat_dsmas_handler(union acpi_subtable_headers *header, void *arg,
 		return -ENOMEM;
 
 	dent->handle = dsmas->dsmad_handle;
+	dent->shareable = dsmas->flags & ACPI_CDAT_DSMAS_SHAREABLE;
 	dent->dpa_range.start = le64_to_cpu((__force __le64)dsmas->dpa_base_address);
 	dent->dpa_range.end = le64_to_cpu((__force __le64)dsmas->dpa_base_address) +
 			      le64_to_cpu((__force __le64)dsmas->dpa_length) - 1;
@@ -244,6 +246,7 @@ static void update_perf_entry(struct device *dev, struct dsmas_entry *dent,
 		dpa_perf->coord[i] = dent->coord[i];
 		dpa_perf->cdat_coord[i] = dent->cdat_coord[i];
 	}
+	dpa_perf->shareable = dent->shareable;
 	dpa_perf->dpa_range = dent->dpa_range;
 	dpa_perf->qos_class = dent->qos_class;
 	dev_dbg(dev,
@@ -266,13 +269,21 @@ static void cxl_memdev_set_qos_class(struct cxl_dev_state *cxlds,
 		bool found = false;
 
 		for (int i = 0; i < cxlds->nr_partitions; i++) {
+			enum cxl_partition_mode mode = cxlds->part[i].mode;
 			struct resource *res = &cxlds->part[i].res;
+			u8 handle = cxlds->part[i].handle;
 			struct range range = {
 				.start = res->start,
 				.end = res->end,
 			};
 
 			if (range_contains(&range, &dent->dpa_range)) {
+				if (mode == CXL_PARTMODE_DYNAMIC_RAM_A &&
+				    dent->handle != handle)
+					dev_warn(dev,
+						"Dynamic RAM perf mismatch; %pra (%u) vs %pra (%u)\n",
+						&range, handle, &dent->dpa_range, dent->handle);
+
 				update_perf_entry(dev, dent,
 						  &cxlds->part[i].perf);
 				found = true;
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 866a423d6125..c589d8a330bb 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -1321,6 +1321,7 @@ static int cxl_dc_check(struct device *dev, struct cxl_dc_partition_info *part_a
 	part_array[index].start = le64_to_cpu(dev_part->base);
 	part_array[index].size = le64_to_cpu(dev_part->decode_length);
 	part_array[index].size *= CXL_CAPACITY_MULTIPLIER;
+	part_array[index].handle = le32_to_cpu(dev_part->dsmad_handle) & 0xFF;
 	len = le64_to_cpu(dev_part->length);
 	blk_size = le64_to_cpu(dev_part->block_size);
 
@@ -1453,6 +1454,7 @@ int cxl_dev_dc_identify(struct cxl_mailbox *mbox,
 	/* Return 1st partition */
 	dc_info->start = partitions[0].start;
 	dc_info->size = partitions[0].size;
+	dc_info->handle = partitions[0].handle;
 	dev_dbg(dev, "Returning partition 0 %zu size %zu\n",
 		dc_info->start, dc_info->size);
 
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 057933128d2c..96d8edaa5003 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -104,6 +104,7 @@ struct cxl_dpa_info {
 	struct cxl_dpa_part_info {
 		struct range range;
 		enum cxl_partition_mode mode;
+		u8 handle;
 	} part[CXL_NR_PARTITIONS_MAX];
 	int nr_partitions;
 };
@@ -387,12 +388,14 @@ enum cxl_devtype {
  * @coord: QoS performance data (i.e. latency, bandwidth)
  * @cdat_coord: raw QoS performance data from CDAT
  * @qos_class: QoS Class cookies
+ * @shareable: Is the range sharable
  */
 struct cxl_dpa_perf {
 	struct range dpa_range;
 	struct access_coordinate coord[ACCESS_COORDINATE_MAX];
 	struct access_coordinate cdat_coord[ACCESS_COORDINATE_MAX];
 	int qos_class;
+	bool shareable;
 };
 
 /**
@@ -400,11 +403,13 @@ struct cxl_dpa_perf {
  * @res: shortcut to the partition in the DPA resource tree (cxlds->dpa_res)
  * @perf: performance attributes of the partition from CDAT
  * @mode: operation mode for the DPA capacity, e.g. ram, pmem, dynamic...
+ * @handle: DMASS handle intended to represent this partition
  */
 struct cxl_dpa_partition {
 	struct resource res;
 	struct cxl_dpa_perf perf;
 	enum cxl_partition_mode mode;
+	u8 handle;
 };
 
 /**
@@ -881,6 +886,7 @@ struct cxl_mem_dev_info {
 struct cxl_dc_partition_info {
 	size_t start;
 	size_t size;
+	u8 handle;
 };
 
 int cxl_dev_dc_identify(struct cxl_mailbox *mbox,

-- 
2.49.0


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH v9 03/19] cxl/cdat: Gather DSMAS data for DCD partitions
  2025-04-13 22:52 ` [PATCH v9 03/19] cxl/cdat: Gather DSMAS data for DCD partitions Ira Weiny
@ 2025-04-14 15:29   ` Jonathan Cameron
  0 siblings, 0 replies; 65+ messages in thread
From: Jonathan Cameron @ 2025-04-14 15:29 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Jiang, Fan Ni, Dan Williams, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-cxl, nvdimm, linux-kernel

On Sun, 13 Apr 2025 17:52:11 -0500
Ira Weiny <ira.weiny@intel.com> wrote:

> Additional DCD partition (AKA region) information is contained in the
> DSMAS CDAT tables, including performance, read only, and shareable
> attributes.
> 
> Match DCD partitions with DSMAS tables and store the meta data.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 866a423d6125..c589d8a330bb 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -1321,6 +1321,7 @@ static int cxl_dc_check(struct device *dev, struct cxl_dc_partition_info *part_a
>  	part_array[index].start = le64_to_cpu(dev_part->base);
>  	part_array[index].size = le64_to_cpu(dev_part->decode_length);
>  	part_array[index].size *= CXL_CAPACITY_MULTIPLIER;
> +	part_array[index].handle = le32_to_cpu(dev_part->dsmad_handle) & 0xFF;

Perhaps a comment on this.  Or a check that it is representable in
CDAT (where we only have the one byte) and a print + fail to carry on if not?

>  	len = le64_to_cpu(dev_part->length);
>  	blk_size = le64_to_cpu(dev_part->block_size);
>  
> @@ -1453,6 +1454,7 @@ int cxl_dev_dc_identify(struct cxl_mailbox *mbox,
>  	/* Return 1st partition */
>  	dc_info->start = partitions[0].start;
>  	dc_info->size = partitions[0].size;
> +	dc_info->handle = partitions[0].handle;
>  	dev_dbg(dev, "Returning partition 0 %zu size %zu\n",
>  		dc_info->start, dc_info->size);
>  
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 057933128d2c..96d8edaa5003 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -104,6 +104,7 @@ struct cxl_dpa_info {
>  	struct cxl_dpa_part_info {
>  		struct range range;
>  		enum cxl_partition_mode mode;
> +		u8 handle;
>  	} part[CXL_NR_PARTITIONS_MAX];
>  	int nr_partitions;
>  };
> @@ -387,12 +388,14 @@ enum cxl_devtype {
>   * @coord: QoS performance data (i.e. latency, bandwidth)
>   * @cdat_coord: raw QoS performance data from CDAT
>   * @qos_class: QoS Class cookies
> + * @shareable: Is the range sharable
>   */
>  struct cxl_dpa_perf {
>  	struct range dpa_range;
>  	struct access_coordinate coord[ACCESS_COORDINATE_MAX];
>  	struct access_coordinate cdat_coord[ACCESS_COORDINATE_MAX];
>  	int qos_class;
> +	bool shareable;

It feels a bit odd to have this in the dpa_perf structure as not really
a performance thing but I guess this is only convenient place to stash it.

>  };
>  
>  /**
> @@ -400,11 +403,13 @@ struct cxl_dpa_perf {
>   * @res: shortcut to the partition in the DPA resource tree (cxlds->dpa_res)
>   * @perf: performance attributes of the partition from CDAT
>   * @mode: operation mode for the DPA capacity, e.g. ram, pmem, dynamic...
> + * @handle: DMASS handle intended to represent this partition

DSMAS ?


>   */
>  struct cxl_dpa_partition {
>  	struct resource res;
>  	struct cxl_dpa_perf perf;
>  	enum cxl_partition_mode mode;
> +	u8 handle;
>  };
>  
>  /**
> @@ -881,6 +886,7 @@ struct cxl_mem_dev_info {
>  struct cxl_dc_partition_info {
>  	size_t start;
>  	size_t size;
> +	u8 handle;
>  };
>  
>  int cxl_dev_dc_identify(struct cxl_mailbox *mbox,
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v9 04/19] cxl/core: Enforce partition order/simplify partition calls
  2025-04-13 22:52 [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (2 preceding siblings ...)
  2025-04-13 22:52 ` [PATCH v9 03/19] cxl/cdat: Gather DSMAS data for DCD partitions Ira Weiny
@ 2025-04-13 22:52 ` Ira Weiny
  2025-04-14 15:32   ` Jonathan Cameron
  2026-02-02 19:25   ` Davidlohr Bueso
  2025-04-13 22:52 ` [PATCH v9 05/19] cxl/mem: Expose dynamic ram A partition in sysfs Ira Weiny
                   ` (18 subsequent siblings)
  22 siblings, 2 replies; 65+ messages in thread
From: Ira Weiny @ 2025-04-13 22:52 UTC (permalink / raw)
  To: Dave Jiang, Fan Ni, Jonathan Cameron
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	Ira Weiny, linux-cxl, nvdimm, linux-kernel

Device partitions have an implied order which is made more complex by
the addition of a dynamic partition.

Remove the ram special case information calls in favor of generic calls
with a check ahead of time to ensure the preservation of the implied
partition order.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 drivers/cxl/core/hdm.c    | 11 ++++++++++-
 drivers/cxl/core/memdev.c | 32 +++++++++-----------------------
 drivers/cxl/cxl.h         |  1 +
 drivers/cxl/cxlmem.h      |  9 +++------
 drivers/cxl/mem.c         |  2 +-
 5 files changed, 24 insertions(+), 31 deletions(-)

diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
index c5f8a17d00f1..92e1a24e2109 100644
--- a/drivers/cxl/core/hdm.c
+++ b/drivers/cxl/core/hdm.c
@@ -470,6 +470,7 @@ static const char *cxl_mode_name(enum cxl_partition_mode mode)
 int cxl_dpa_setup(struct cxl_dev_state *cxlds, const struct cxl_dpa_info *info)
 {
 	struct device *dev = cxlds->dev;
+	int i;
 
 	guard(rwsem_write)(&cxl_dpa_rwsem);
 
@@ -482,9 +483,17 @@ int cxl_dpa_setup(struct cxl_dev_state *cxlds, const struct cxl_dpa_info *info)
 		return 0;
 	}
 
+	/* Verify partitions are in expected order. */
+	for (i = 1; i < info->nr_partitions; i++) {
+		if (cxlds->part[i].mode < cxlds->part[i-1].mode) {
+			dev_err(dev, "Partition order mismatch\n");
+			return 0;
+		}
+	}
+
 	cxlds->dpa_res = DEFINE_RES_MEM(0, info->size);
 
-	for (int i = 0; i < info->nr_partitions; i++) {
+	for (i = 0; i < info->nr_partitions; i++) {
 		const struct cxl_dpa_part_info *part = &info->part[i];
 		int rc;
 
diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
index a16a5886d40a..9d6f8800e37a 100644
--- a/drivers/cxl/core/memdev.c
+++ b/drivers/cxl/core/memdev.c
@@ -75,20 +75,12 @@ static ssize_t label_storage_size_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(label_storage_size);
 
-static resource_size_t cxl_ram_size(struct cxl_dev_state *cxlds)
-{
-	/* Static RAM is only expected at partition 0. */
-	if (cxlds->part[0].mode != CXL_PARTMODE_RAM)
-		return 0;
-	return resource_size(&cxlds->part[0].res);
-}
-
 static ssize_t ram_size_show(struct device *dev, struct device_attribute *attr,
 			     char *buf)
 {
 	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
 	struct cxl_dev_state *cxlds = cxlmd->cxlds;
-	unsigned long long len = cxl_ram_size(cxlds);
+	unsigned long long len = cxl_part_size(cxlds, CXL_PARTMODE_RAM);
 
 	return sysfs_emit(buf, "%#llx\n", len);
 }
@@ -101,7 +93,7 @@ static ssize_t pmem_size_show(struct device *dev, struct device_attribute *attr,
 {
 	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
 	struct cxl_dev_state *cxlds = cxlmd->cxlds;
-	unsigned long long len = cxl_pmem_size(cxlds);
+	unsigned long long len = cxl_part_size(cxlds, CXL_PARTMODE_PMEM);
 
 	return sysfs_emit(buf, "%#llx\n", len);
 }
@@ -407,10 +399,11 @@ static struct attribute *cxl_memdev_attributes[] = {
 	NULL,
 };
 
-static struct cxl_dpa_perf *to_pmem_perf(struct cxl_dev_state *cxlds)
+static struct cxl_dpa_perf *part_perf(struct cxl_dev_state *cxlds,
+				      enum cxl_partition_mode mode)
 {
 	for (int i = 0; i < cxlds->nr_partitions; i++)
-		if (cxlds->part[i].mode == CXL_PARTMODE_PMEM)
+		if (cxlds->part[i].mode == mode)
 			return &cxlds->part[i].perf;
 	return NULL;
 }
@@ -421,7 +414,7 @@ static ssize_t pmem_qos_class_show(struct device *dev,
 	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
 	struct cxl_dev_state *cxlds = cxlmd->cxlds;
 
-	return sysfs_emit(buf, "%d\n", to_pmem_perf(cxlds)->qos_class);
+	return sysfs_emit(buf, "%d\n", part_perf(cxlds, CXL_PARTMODE_PMEM)->qos_class);
 }
 
 static struct device_attribute dev_attr_pmem_qos_class =
@@ -433,20 +426,13 @@ static struct attribute *cxl_memdev_pmem_attributes[] = {
 	NULL,
 };
 
-static struct cxl_dpa_perf *to_ram_perf(struct cxl_dev_state *cxlds)
-{
-	if (cxlds->part[0].mode != CXL_PARTMODE_RAM)
-		return NULL;
-	return &cxlds->part[0].perf;
-}
-
 static ssize_t ram_qos_class_show(struct device *dev,
 				  struct device_attribute *attr, char *buf)
 {
 	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
 	struct cxl_dev_state *cxlds = cxlmd->cxlds;
 
-	return sysfs_emit(buf, "%d\n", to_ram_perf(cxlds)->qos_class);
+	return sysfs_emit(buf, "%d\n", part_perf(cxlds, CXL_PARTMODE_RAM)->qos_class);
 }
 
 static struct device_attribute dev_attr_ram_qos_class =
@@ -482,7 +468,7 @@ static umode_t cxl_ram_visible(struct kobject *kobj, struct attribute *a, int n)
 {
 	struct device *dev = kobj_to_dev(kobj);
 	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
-	struct cxl_dpa_perf *perf = to_ram_perf(cxlmd->cxlds);
+	struct cxl_dpa_perf *perf = part_perf(cxlmd->cxlds, CXL_PARTMODE_RAM);
 
 	if (a == &dev_attr_ram_qos_class.attr &&
 	    (!perf || perf->qos_class == CXL_QOS_CLASS_INVALID))
@@ -501,7 +487,7 @@ static umode_t cxl_pmem_visible(struct kobject *kobj, struct attribute *a, int n
 {
 	struct device *dev = kobj_to_dev(kobj);
 	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
-	struct cxl_dpa_perf *perf = to_pmem_perf(cxlmd->cxlds);
+	struct cxl_dpa_perf *perf = part_perf(cxlmd->cxlds, CXL_PARTMODE_PMEM);
 
 	if (a == &dev_attr_pmem_qos_class.attr &&
 	    (!perf || perf->qos_class == CXL_QOS_CLASS_INVALID))
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index a9d42210e8a3..4bb0ff4d8f5f 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -482,6 +482,7 @@ struct cxl_region_params {
 	resource_size_t cache_size;
 };
 
+/* Modes should be in the implied DPA order */
 enum cxl_partition_mode {
 	CXL_PARTMODE_RAM,
 	CXL_PARTMODE_PMEM,
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 96d8edaa5003..a74ac2d70d8d 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -453,14 +453,11 @@ struct cxl_dev_state {
 #endif
 };
 
-static inline resource_size_t cxl_pmem_size(struct cxl_dev_state *cxlds)
+static inline resource_size_t cxl_part_size(struct cxl_dev_state *cxlds,
+					    enum cxl_partition_mode mode)
 {
-	/*
-	 * Static PMEM may be at partition index 0 when there is no static RAM
-	 * capacity.
-	 */
 	for (int i = 0; i < cxlds->nr_partitions; i++)
-		if (cxlds->part[i].mode == CXL_PARTMODE_PMEM)
+		if (cxlds->part[i].mode == mode)
 			return resource_size(&cxlds->part[i].res);
 	return 0;
 }
diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
index 9675243bd05b..b58b915708f9 100644
--- a/drivers/cxl/mem.c
+++ b/drivers/cxl/mem.c
@@ -152,7 +152,7 @@ static int cxl_mem_probe(struct device *dev)
 		return -ENXIO;
 	}
 
-	if (cxl_pmem_size(cxlds) && IS_ENABLED(CONFIG_CXL_PMEM)) {
+	if (cxl_part_size(cxlds, CXL_PARTMODE_PMEM) && IS_ENABLED(CONFIG_CXL_PMEM)) {
 		rc = devm_cxl_add_nvdimm(parent_port, cxlmd);
 		if (rc) {
 			if (rc == -ENODEV)

-- 
2.49.0


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH v9 04/19] cxl/core: Enforce partition order/simplify partition calls
  2025-04-13 22:52 ` [PATCH v9 04/19] cxl/core: Enforce partition order/simplify partition calls Ira Weiny
@ 2025-04-14 15:32   ` Jonathan Cameron
  2026-02-02 19:25   ` Davidlohr Bueso
  1 sibling, 0 replies; 65+ messages in thread
From: Jonathan Cameron @ 2025-04-14 15:32 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Jiang, Fan Ni, Dan Williams, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-cxl, nvdimm, linux-kernel

On Sun, 13 Apr 2025 17:52:12 -0500
Ira Weiny <ira.weiny@intel.com> wrote:

> Device partitions have an implied order which is made more complex by
> the addition of a dynamic partition.
> 
> Remove the ram special case information calls in favor of generic calls
> with a check ahead of time to ensure the preservation of the implied
> partition order.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
One trivial thing inline.

To me this patch stands on it's own irrespective of the rest of the
series. Maybe one to queue up early as a cleanup?

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

Jonathan

> ---
>  drivers/cxl/core/hdm.c    | 11 ++++++++++-
>  drivers/cxl/core/memdev.c | 32 +++++++++-----------------------
>  drivers/cxl/cxl.h         |  1 +
>  drivers/cxl/cxlmem.h      |  9 +++------
>  drivers/cxl/mem.c         |  2 +-
>  5 files changed, 24 insertions(+), 31 deletions(-)
> 
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index c5f8a17d00f1..92e1a24e2109 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -470,6 +470,7 @@ static const char *cxl_mode_name(enum cxl_partition_mode mode)
>  int cxl_dpa_setup(struct cxl_dev_state *cxlds, const struct cxl_dpa_info *info)
>  {
>  	struct device *dev = cxlds->dev;
> +	int i;
>  
>  	guard(rwsem_write)(&cxl_dpa_rwsem);
>  
> @@ -482,9 +483,17 @@ int cxl_dpa_setup(struct cxl_dev_state *cxlds, const struct cxl_dpa_info *info)
>  		return 0;
>  	}
>  
> +	/* Verify partitions are in expected order. */
> +	for (i = 1; i < info->nr_partitions; i++) {
> +		if (cxlds->part[i].mode < cxlds->part[i-1].mode) {

spaces around -

> +			dev_err(dev, "Partition order mismatch\n");
> +			return 0;
> +		}
> +	}
> +
>  	cxlds->dpa_res = DEFINE_RES_MEM(0, info->size);
>  
> -	for (int i = 0; i < info->nr_partitions; i++) {
> +	for (i = 0; i < info->nr_partitions; i++) {
>  		const struct cxl_dpa_part_info *part = &info->part[i];
>  		int rc;


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v9 04/19] cxl/core: Enforce partition order/simplify partition calls
  2025-04-13 22:52 ` [PATCH v9 04/19] cxl/core: Enforce partition order/simplify partition calls Ira Weiny
  2025-04-14 15:32   ` Jonathan Cameron
@ 2026-02-02 19:25   ` Davidlohr Bueso
  1 sibling, 0 replies; 65+ messages in thread
From: Davidlohr Bueso @ 2026-02-02 19:25 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Jiang, Fan Ni, Jonathan Cameron, Dan Williams,
	Alison Schofield, Vishal Verma, linux-cxl, nvdimm, linux-kernel

On Sun, 13 Apr 2025, Ira Weiny wrote:

>Device partitions have an implied order which is made more complex by
>the addition of a dynamic partition.
>
>Remove the ram special case information calls in favor of generic calls
>with a check ahead of time to ensure the preservation of the implied
>partition order.
>
>Signed-off-by: Ira Weiny <ira.weiny@intel.com>
>---
> drivers/cxl/core/hdm.c    | 11 ++++++++++-
> drivers/cxl/core/memdev.c | 32 +++++++++-----------------------
> drivers/cxl/cxl.h         |  1 +
> drivers/cxl/cxlmem.h      |  9 +++------
> drivers/cxl/mem.c         |  2 +-
> 5 files changed, 24 insertions(+), 31 deletions(-)
>
>diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
>index c5f8a17d00f1..92e1a24e2109 100644
>--- a/drivers/cxl/core/hdm.c
>+++ b/drivers/cxl/core/hdm.c
>@@ -470,6 +470,7 @@ static const char *cxl_mode_name(enum cxl_partition_mode mode)
> int cxl_dpa_setup(struct cxl_dev_state *cxlds, const struct cxl_dpa_info *info)
> {
> 	struct device *dev = cxlds->dev;
>+	int i;
>
> 	guard(rwsem_write)(&cxl_dpa_rwsem);
>
>@@ -482,9 +483,17 @@ int cxl_dpa_setup(struct cxl_dev_state *cxlds, const struct cxl_dpa_info *info)
> 		return 0;
> 	}
>
>+	/* Verify partitions are in expected order. */
>+	for (i = 1; i < info->nr_partitions; i++) {
>+		if (cxlds->part[i].mode < cxlds->part[i-1].mode) {
>+			dev_err(dev, "Partition order mismatch\n");
>+			return 0;

return -EINVAL?

>+		}
>+	}
>+
> 	cxlds->dpa_res = DEFINE_RES_MEM(0, info->size);
>
>-	for (int i = 0; i < info->nr_partitions; i++) {
>+	for (i = 0; i < info->nr_partitions; i++) {
> 		const struct cxl_dpa_part_info *part = &info->part[i];
> 		int rc;
>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v9 05/19] cxl/mem: Expose dynamic ram A partition in sysfs
  2025-04-13 22:52 [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (3 preceding siblings ...)
  2025-04-13 22:52 ` [PATCH v9 04/19] cxl/core: Enforce partition order/simplify partition calls Ira Weiny
@ 2025-04-13 22:52 ` Ira Weiny
  2025-04-14 15:34   ` Jonathan Cameron
  2026-02-02 19:28   ` Davidlohr Bueso
  2025-04-13 22:52 ` [PATCH v9 06/19] cxl/port: Add 'dynamic_ram_a' to endpoint decoder mode Ira Weiny
                   ` (17 subsequent siblings)
  22 siblings, 2 replies; 65+ messages in thread
From: Ira Weiny @ 2025-04-13 22:52 UTC (permalink / raw)
  To: Dave Jiang, Fan Ni, Jonathan Cameron
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	Ira Weiny, linux-cxl, nvdimm, linux-kernel

To properly configure CXL regions user space will need to know the
details of the dynamic ram partition.

Expose the first dynamic ram partition through sysfs.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes:
[iweiny: Complete rewrite of the old patch.]
---
 Documentation/ABI/testing/sysfs-bus-cxl | 24 ++++++++++++++
 drivers/cxl/core/memdev.c               | 57 +++++++++++++++++++++++++++++++++
 2 files changed, 81 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
index 99bb3faf7a0e..2b59041bb410 100644
--- a/Documentation/ABI/testing/sysfs-bus-cxl
+++ b/Documentation/ABI/testing/sysfs-bus-cxl
@@ -89,6 +89,30 @@ Description:
 		and there are platform specific performance related
 		side-effects that may result. First class-id is displayed.
 
+What:		/sys/bus/cxl/devices/memX/dynamic_ram_a/size
+Date:		May, 2025
+KernelVersion:	v6.16
+Contact:	linux-cxl@vger.kernel.org
+Description:
+		(RO) The first Dynamic RAM partition capacity as bytes.
+
+
+What:		/sys/bus/cxl/devices/memX/dynamic_ram_a/qos_class
+Date:		May, 2025
+KernelVersion:	v6.16
+Contact:	linux-cxl@vger.kernel.org
+Description:
+		(RO) For CXL host platforms that support "QoS Telemmetry"
+		this attribute conveys a comma delimited list of platform
+		specific cookies that identifies a QoS performance class
+		for the persistent partition of the CXL mem device. These
+		class-ids can be compared against a similar "qos_class"
+		published for a root decoder. While it is not required
+		that the endpoints map their local memory-class to a
+		matching platform class, mismatches are not recommended
+		and there are platform specific performance related
+		side-effects that may result. First class-id is displayed.
+
 
 What:		/sys/bus/cxl/devices/memX/serial
 Date:		January, 2022
diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
index 9d6f8800e37a..063a14c1973a 100644
--- a/drivers/cxl/core/memdev.c
+++ b/drivers/cxl/core/memdev.c
@@ -101,6 +101,19 @@ static ssize_t pmem_size_show(struct device *dev, struct device_attribute *attr,
 static struct device_attribute dev_attr_pmem_size =
 	__ATTR(size, 0444, pmem_size_show, NULL);
 
+static ssize_t dynamic_ram_a_size_show(struct device *dev, struct device_attribute *attr,
+			      char *buf)
+{
+	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
+	struct cxl_dev_state *cxlds = cxlmd->cxlds;
+	unsigned long long len = cxl_part_size(cxlds, CXL_PARTMODE_DYNAMIC_RAM_A);
+
+	return sysfs_emit(buf, "%#llx\n", len);
+}
+
+static struct device_attribute dev_attr_dynamic_ram_a_size =
+	__ATTR(size, 0444, dynamic_ram_a_size_show, NULL);
+
 static ssize_t serial_show(struct device *dev, struct device_attribute *attr,
 			   char *buf)
 {
@@ -426,6 +439,25 @@ static struct attribute *cxl_memdev_pmem_attributes[] = {
 	NULL,
 };
 
+static ssize_t dynamic_ram_a_qos_class_show(struct device *dev,
+				   struct device_attribute *attr, char *buf)
+{
+	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
+	struct cxl_dev_state *cxlds = cxlmd->cxlds;
+
+	return sysfs_emit(buf, "%d\n",
+			  part_perf(cxlds, CXL_PARTMODE_DYNAMIC_RAM_A)->qos_class);
+}
+
+static struct device_attribute dev_attr_dynamic_ram_a_qos_class =
+	__ATTR(qos_class, 0444, dynamic_ram_a_qos_class_show, NULL);
+
+static struct attribute *cxl_memdev_dynamic_ram_a_attributes[] = {
+	&dev_attr_dynamic_ram_a_size.attr,
+	&dev_attr_dynamic_ram_a_qos_class.attr,
+	NULL,
+};
+
 static ssize_t ram_qos_class_show(struct device *dev,
 				  struct device_attribute *attr, char *buf)
 {
@@ -502,6 +534,29 @@ static struct attribute_group cxl_memdev_pmem_attribute_group = {
 	.is_visible = cxl_pmem_visible,
 };
 
+static umode_t cxl_dynamic_ram_a_visible(struct kobject *kobj, struct attribute *a, int n)
+{
+	struct device *dev = kobj_to_dev(kobj);
+	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
+	struct cxl_dpa_perf *perf = part_perf(cxlmd->cxlds, CXL_PARTMODE_DYNAMIC_RAM_A);
+
+	if (a == &dev_attr_dynamic_ram_a_qos_class.attr &&
+	    (!perf || perf->qos_class == CXL_QOS_CLASS_INVALID))
+		return 0;
+
+	if (a == &dev_attr_dynamic_ram_a_size.attr &&
+	    (!cxl_part_size(cxlmd->cxlds, CXL_PARTMODE_DYNAMIC_RAM_A)))
+		return 0;
+
+	return a->mode;
+}
+
+static struct attribute_group cxl_memdev_dynamic_ram_a_attribute_group = {
+	.name = "dynamic_ram_a",
+	.attrs = cxl_memdev_dynamic_ram_a_attributes,
+	.is_visible = cxl_dynamic_ram_a_visible,
+};
+
 static umode_t cxl_memdev_security_visible(struct kobject *kobj,
 					   struct attribute *a, int n)
 {
@@ -530,6 +585,7 @@ static const struct attribute_group *cxl_memdev_attribute_groups[] = {
 	&cxl_memdev_attribute_group,
 	&cxl_memdev_ram_attribute_group,
 	&cxl_memdev_pmem_attribute_group,
+	&cxl_memdev_dynamic_ram_a_attribute_group,
 	&cxl_memdev_security_attribute_group,
 	NULL,
 };
@@ -538,6 +594,7 @@ void cxl_memdev_update_perf(struct cxl_memdev *cxlmd)
 {
 	sysfs_update_group(&cxlmd->dev.kobj, &cxl_memdev_ram_attribute_group);
 	sysfs_update_group(&cxlmd->dev.kobj, &cxl_memdev_pmem_attribute_group);
+	sysfs_update_group(&cxlmd->dev.kobj, &cxl_memdev_dynamic_ram_a_attribute_group);
 }
 EXPORT_SYMBOL_NS_GPL(cxl_memdev_update_perf, "CXL");
 

-- 
2.49.0


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH v9 05/19] cxl/mem: Expose dynamic ram A partition in sysfs
  2025-04-13 22:52 ` [PATCH v9 05/19] cxl/mem: Expose dynamic ram A partition in sysfs Ira Weiny
@ 2025-04-14 15:34   ` Jonathan Cameron
  2026-02-02 19:28   ` Davidlohr Bueso
  1 sibling, 0 replies; 65+ messages in thread
From: Jonathan Cameron @ 2025-04-14 15:34 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Jiang, Fan Ni, Dan Williams, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-cxl, nvdimm, linux-kernel

On Sun, 13 Apr 2025 17:52:13 -0500
Ira Weiny <ira.weiny@intel.com> wrote:

> To properly configure CXL regions user space will need to know the
> details of the dynamic ram partition.
> 
> Expose the first dynamic ram partition through sysfs.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 

LGTM

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v9 05/19] cxl/mem: Expose dynamic ram A partition in sysfs
  2025-04-13 22:52 ` [PATCH v9 05/19] cxl/mem: Expose dynamic ram A partition in sysfs Ira Weiny
  2025-04-14 15:34   ` Jonathan Cameron
@ 2026-02-02 19:28   ` Davidlohr Bueso
  1 sibling, 0 replies; 65+ messages in thread
From: Davidlohr Bueso @ 2026-02-02 19:28 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Jiang, Fan Ni, Jonathan Cameron, Dan Williams,
	Alison Schofield, Vishal Verma, linux-cxl, nvdimm, linux-kernel

On Sun, 13 Apr 2025, Ira Weiny wrote:

>To properly configure CXL regions user space will need to know the
>details of the dynamic ram partition.
>
>Expose the first dynamic ram partition through sysfs.
>
>Signed-off-by: Ira Weiny <ira.weiny@intel.com>
>
>---
>Changes:
>[iweiny: Complete rewrite of the old patch.]
>---
> Documentation/ABI/testing/sysfs-bus-cxl | 24 ++++++++++++++
> drivers/cxl/core/memdev.c               | 57 +++++++++++++++++++++++++++++++++
> 2 files changed, 81 insertions(+)
>
>diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
>index 99bb3faf7a0e..2b59041bb410 100644
>--- a/Documentation/ABI/testing/sysfs-bus-cxl
>+++ b/Documentation/ABI/testing/sysfs-bus-cxl
>@@ -89,6 +89,30 @@ Description:
>		and there are platform specific performance related
>		side-effects that may result. First class-id is displayed.
>
>+What:		/sys/bus/cxl/devices/memX/dynamic_ram_a/size
>+Date:		May, 2025
>+KernelVersion:	v6.16
>+Contact:	linux-cxl@vger.kernel.org
>+Description:
>+		(RO) The first Dynamic RAM partition capacity as bytes.
>+
>+
>+What:		/sys/bus/cxl/devices/memX/dynamic_ram_a/qos_class
>+Date:		May, 2025
>+KernelVersion:	v6.16
>+Contact:	linux-cxl@vger.kernel.org
>+Description:
>+		(RO) For CXL host platforms that support "QoS Telemmetry"
>+		this attribute conveys a comma delimited list of platform
>+		specific cookies that identifies a QoS performance class
>+		for the persistent partition of the CXL mem device. These
			^^ 'persistent' should be dropped

>+		class-ids can be compared against a similar "qos_class"
>+		published for a root decoder. While it is not required
>+		that the endpoints map their local memory-class to a
>+		matching platform class, mismatches are not recommended
>+		and there are platform specific performance related
>+		side-effects that may result. First class-id is displayed.
>+

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v9 06/19] cxl/port: Add 'dynamic_ram_a' to endpoint decoder mode
  2025-04-13 22:52 [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (4 preceding siblings ...)
  2025-04-13 22:52 ` [PATCH v9 05/19] cxl/mem: Expose dynamic ram A partition in sysfs Ira Weiny
@ 2025-04-13 22:52 ` Ira Weiny
  2025-04-14 15:36   ` Jonathan Cameron
  2025-05-07 20:50   ` Fan Ni
  2025-04-13 22:52 ` [PATCH v9 07/19] cxl/region: Add sparse DAX region support Ira Weiny
                   ` (16 subsequent siblings)
  22 siblings, 2 replies; 65+ messages in thread
From: Ira Weiny @ 2025-04-13 22:52 UTC (permalink / raw)
  To: Dave Jiang, Fan Ni, Jonathan Cameron
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	Ira Weiny, linux-cxl, nvdimm, linux-kernel

Endpoints can now support a single dynamic ram partition following the
persistent memory partition.

Expand the mode to allow a decoder to point to the first dynamic ram
partition.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes:
[iweiny: completely re-written]
---
 Documentation/ABI/testing/sysfs-bus-cxl | 18 +++++++++---------
 drivers/cxl/core/port.c                 |  4 ++++
 2 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
index 2b59041bb410..b2754e6047ca 100644
--- a/Documentation/ABI/testing/sysfs-bus-cxl
+++ b/Documentation/ABI/testing/sysfs-bus-cxl
@@ -358,22 +358,22 @@ Description:
 
 
 What:		/sys/bus/cxl/devices/decoderX.Y/mode
-Date:		May, 2022
-KernelVersion:	v6.0
+Date:		May, 2022, May 2025
+KernelVersion:	v6.0, v6.16 (dynamic_ram_a)
 Contact:	linux-cxl@vger.kernel.org
 Description:
 		(RW) When a CXL decoder is of devtype "cxl_decoder_endpoint" it
 		translates from a host physical address range, to a device
 		local address range. Device-local address ranges are further
-		split into a 'ram' (volatile memory) range and 'pmem'
-		(persistent memory) range. The 'mode' attribute emits one of
-		'ram', 'pmem', or 'none'. The 'none' indicates the decoder is
-		not actively decoding, or no DPA allocation policy has been
-		set.
+		split into a 'ram' (volatile memory) range, 'pmem' (persistent
+		memory), and 'dynamic_ram_a' (first Dynamic RAM) range. The
+		'mode' attribute emits one of 'ram', 'pmem', 'dynamic_ram_a' or
+		'none'. The 'none' indicates the decoder is not actively
+		decoding, or no DPA allocation policy has been set.
 
 		'mode' can be written, when the decoder is in the 'disabled'
-		state, with either 'ram' or 'pmem' to set the boundaries for the
-		next allocation.
+		state, with either 'ram', 'pmem', or 'dynamic_ram_a' to set the
+		boundaries for the next allocation.
 
 
 What:		/sys/bus/cxl/devices/decoderX.Y/dpa_resource
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 0fd6646c1a2e..e98605bd39b4 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -125,6 +125,7 @@ static DEVICE_ATTR_RO(name)
 
 CXL_DECODER_FLAG_ATTR(cap_pmem, CXL_DECODER_F_PMEM);
 CXL_DECODER_FLAG_ATTR(cap_ram, CXL_DECODER_F_RAM);
+CXL_DECODER_FLAG_ATTR(cap_dynamic_ram_a, CXL_DECODER_F_RAM);
 CXL_DECODER_FLAG_ATTR(cap_type2, CXL_DECODER_F_TYPE2);
 CXL_DECODER_FLAG_ATTR(cap_type3, CXL_DECODER_F_TYPE3);
 CXL_DECODER_FLAG_ATTR(locked, CXL_DECODER_F_LOCK);
@@ -219,6 +220,8 @@ static ssize_t mode_store(struct device *dev, struct device_attribute *attr,
 		mode = CXL_PARTMODE_PMEM;
 	else if (sysfs_streq(buf, "ram"))
 		mode = CXL_PARTMODE_RAM;
+	else if (sysfs_streq(buf, "dynamic_ram_a"))
+		mode = CXL_PARTMODE_DYNAMIC_RAM_A;
 	else
 		return -EINVAL;
 
@@ -324,6 +327,7 @@ static struct attribute_group cxl_decoder_base_attribute_group = {
 static struct attribute *cxl_decoder_root_attrs[] = {
 	&dev_attr_cap_pmem.attr,
 	&dev_attr_cap_ram.attr,
+	&dev_attr_cap_dynamic_ram_a.attr,
 	&dev_attr_cap_type2.attr,
 	&dev_attr_cap_type3.attr,
 	&dev_attr_target_list.attr,

-- 
2.49.0


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH v9 06/19] cxl/port: Add 'dynamic_ram_a' to endpoint decoder mode
  2025-04-13 22:52 ` [PATCH v9 06/19] cxl/port: Add 'dynamic_ram_a' to endpoint decoder mode Ira Weiny
@ 2025-04-14 15:36   ` Jonathan Cameron
  2025-05-07 20:50   ` Fan Ni
  1 sibling, 0 replies; 65+ messages in thread
From: Jonathan Cameron @ 2025-04-14 15:36 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Jiang, Fan Ni, Dan Williams, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-cxl, nvdimm, linux-kernel

On Sun, 13 Apr 2025 17:52:14 -0500
Ira Weiny <ira.weiny@intel.com> wrote:

> Endpoints can now support a single dynamic ram partition following the
> persistent memory partition.
> 
> Expand the mode to allow a decoder to point to the first dynamic ram
> partition.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v9 06/19] cxl/port: Add 'dynamic_ram_a' to endpoint decoder mode
  2025-04-13 22:52 ` [PATCH v9 06/19] cxl/port: Add 'dynamic_ram_a' to endpoint decoder mode Ira Weiny
  2025-04-14 15:36   ` Jonathan Cameron
@ 2025-05-07 20:50   ` Fan Ni
  1 sibling, 0 replies; 65+ messages in thread
From: Fan Ni @ 2025-05-07 20:50 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Jiang, Jonathan Cameron, Dan Williams, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-cxl, nvdimm, linux-kernel

On Sun, Apr 13, 2025 at 05:52:14PM -0500, Ira Weiny wrote:
> Endpoints can now support a single dynamic ram partition following the
> persistent memory partition.
> 
> Expand the mode to allow a decoder to point to the first dynamic ram
> partition.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 

Reviewed-by: Fan Ni <fan.ni@samsung.com>

> ---
> Changes:
> [iweiny: completely re-written]
> ---
>  Documentation/ABI/testing/sysfs-bus-cxl | 18 +++++++++---------
>  drivers/cxl/core/port.c                 |  4 ++++
>  2 files changed, 13 insertions(+), 9 deletions(-)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> index 2b59041bb410..b2754e6047ca 100644
> --- a/Documentation/ABI/testing/sysfs-bus-cxl
> +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> @@ -358,22 +358,22 @@ Description:
>  
>  
>  What:		/sys/bus/cxl/devices/decoderX.Y/mode
> -Date:		May, 2022
> -KernelVersion:	v6.0
> +Date:		May, 2022, May 2025
> +KernelVersion:	v6.0, v6.16 (dynamic_ram_a)
>  Contact:	linux-cxl@vger.kernel.org
>  Description:
>  		(RW) When a CXL decoder is of devtype "cxl_decoder_endpoint" it
>  		translates from a host physical address range, to a device
>  		local address range. Device-local address ranges are further
> -		split into a 'ram' (volatile memory) range and 'pmem'
> -		(persistent memory) range. The 'mode' attribute emits one of
> -		'ram', 'pmem', or 'none'. The 'none' indicates the decoder is
> -		not actively decoding, or no DPA allocation policy has been
> -		set.
> +		split into a 'ram' (volatile memory) range, 'pmem' (persistent
> +		memory), and 'dynamic_ram_a' (first Dynamic RAM) range. The
> +		'mode' attribute emits one of 'ram', 'pmem', 'dynamic_ram_a' or
> +		'none'. The 'none' indicates the decoder is not actively
> +		decoding, or no DPA allocation policy has been set.
>  
>  		'mode' can be written, when the decoder is in the 'disabled'
> -		state, with either 'ram' or 'pmem' to set the boundaries for the
> -		next allocation.
> +		state, with either 'ram', 'pmem', or 'dynamic_ram_a' to set the
> +		boundaries for the next allocation.
>  
>  
>  What:		/sys/bus/cxl/devices/decoderX.Y/dpa_resource
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 0fd6646c1a2e..e98605bd39b4 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -125,6 +125,7 @@ static DEVICE_ATTR_RO(name)
>  
>  CXL_DECODER_FLAG_ATTR(cap_pmem, CXL_DECODER_F_PMEM);
>  CXL_DECODER_FLAG_ATTR(cap_ram, CXL_DECODER_F_RAM);
> +CXL_DECODER_FLAG_ATTR(cap_dynamic_ram_a, CXL_DECODER_F_RAM);
>  CXL_DECODER_FLAG_ATTR(cap_type2, CXL_DECODER_F_TYPE2);
>  CXL_DECODER_FLAG_ATTR(cap_type3, CXL_DECODER_F_TYPE3);
>  CXL_DECODER_FLAG_ATTR(locked, CXL_DECODER_F_LOCK);
> @@ -219,6 +220,8 @@ static ssize_t mode_store(struct device *dev, struct device_attribute *attr,
>  		mode = CXL_PARTMODE_PMEM;
>  	else if (sysfs_streq(buf, "ram"))
>  		mode = CXL_PARTMODE_RAM;
> +	else if (sysfs_streq(buf, "dynamic_ram_a"))
> +		mode = CXL_PARTMODE_DYNAMIC_RAM_A;
>  	else
>  		return -EINVAL;
>  
> @@ -324,6 +327,7 @@ static struct attribute_group cxl_decoder_base_attribute_group = {
>  static struct attribute *cxl_decoder_root_attrs[] = {
>  	&dev_attr_cap_pmem.attr,
>  	&dev_attr_cap_ram.attr,
> +	&dev_attr_cap_dynamic_ram_a.attr,
>  	&dev_attr_cap_type2.attr,
>  	&dev_attr_cap_type3.attr,
>  	&dev_attr_target_list.attr,
> 
> -- 
> 2.49.0
> 

-- 
Fan Ni

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v9 07/19] cxl/region: Add sparse DAX region support
  2025-04-13 22:52 [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (5 preceding siblings ...)
  2025-04-13 22:52 ` [PATCH v9 06/19] cxl/port: Add 'dynamic_ram_a' to endpoint decoder mode Ira Weiny
@ 2025-04-13 22:52 ` Ira Weiny
  2025-04-14 15:40   ` Jonathan Cameron
                     ` (2 more replies)
  2025-04-13 22:52 ` [PATCH v9 08/19] cxl/events: Split event msgnum configuration from irq setup Ira Weiny
                   ` (15 subsequent siblings)
  22 siblings, 3 replies; 65+ messages in thread
From: Ira Weiny @ 2025-04-13 22:52 UTC (permalink / raw)
  To: Dave Jiang, Fan Ni, Jonathan Cameron
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	Ira Weiny, linux-cxl, nvdimm, linux-kernel

Dynamic Capacity CXL regions must allow memory to be added or removed
dynamically.  In addition to the quantity of memory available the
location of the memory within a DC partition is dynamic based on the
extents offered by a device.  CXL DAX regions must accommodate the
sparseness of this memory in the management of DAX regions and devices.

Introduce the concept of a sparse DAX region.  Introduce
create_dynamic_ram_a_region() sysfs entry to create such regions.
Special case dynamic capable regions to create a 0 sized seed DAX device
to maintain compatibility which requires a default DAX device to hold a
region reference.

Indicate 0 byte available capacity until such time that capacity is
added.

Sparse regions complicate the range mapping of dax devices.  There is no
known use case for range mapping on sparse regions.  Avoid the
complication by preventing range mapping of dax devices on sparse
regions.

Interleaving is deferred for now.  Add checks.

Based on an original patch by Navneet Singh.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes:
[iweiny: adjust to new partition mode and new singular dynamic ram
         partition]
---
 Documentation/ABI/testing/sysfs-bus-cxl | 22 +++++++++----------
 drivers/cxl/core/core.h                 | 11 ++++++++++
 drivers/cxl/core/port.c                 |  1 +
 drivers/cxl/core/region.c               | 38 +++++++++++++++++++++++++++++++--
 drivers/dax/bus.c                       | 10 +++++++++
 drivers/dax/bus.h                       |  1 +
 drivers/dax/cxl.c                       | 16 ++++++++++++--
 7 files changed, 84 insertions(+), 15 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
index b2754e6047ca..2e26d95ac66f 100644
--- a/Documentation/ABI/testing/sysfs-bus-cxl
+++ b/Documentation/ABI/testing/sysfs-bus-cxl
@@ -434,20 +434,20 @@ Description:
 		interleave_granularity).
 
 
-What:		/sys/bus/cxl/devices/decoderX.Y/create_{pmem,ram}_region
-Date:		May, 2022, January, 2023
-KernelVersion:	v6.0 (pmem), v6.3 (ram)
+What:		/sys/bus/cxl/devices/decoderX.Y/create_{pmem,ram,dynamic_ram_a}_region
+Date:		May, 2022, January, 2023, May 2025
+KernelVersion:	v6.0 (pmem), v6.3 (ram), v6.16 (dynamic_ram_a)
 Contact:	linux-cxl@vger.kernel.org
 Description:
 		(RW) Write a string in the form 'regionZ' to start the process
-		of defining a new persistent, or volatile memory region
-		(interleave-set) within the decode range bounded by root decoder
-		'decoderX.Y'. The value written must match the current value
-		returned from reading this attribute. An atomic compare exchange
-		operation is done on write to assign the requested id to a
-		region and allocate the region-id for the next creation attempt.
-		EBUSY is returned if the region name written does not match the
-		current cached value.
+		of defining a new persistent, volatile, or dynamic RAM memory
+		region (interleave-set) within the decode range bounded by root
+		decoder 'decoderX.Y'. The value written must match the current
+		value returned from reading this attribute.  An atomic compare
+		exchange operation is done on write to assign the requested id
+		to a region and allocate the region-id for the next creation
+		attempt.  EBUSY is returned if the region name written does not
+		match the current cached value.
 
 
 What:		/sys/bus/cxl/devices/decoderX.Y/delete_region
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 15699299dc11..08facbc2d270 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -5,6 +5,7 @@
 #define __CXL_CORE_H__
 
 #include <cxl/mailbox.h>
+#include <cxlmem.h>
 
 extern const struct device_type cxl_nvdimm_bridge_type;
 extern const struct device_type cxl_nvdimm_type;
@@ -12,9 +13,19 @@ extern const struct device_type cxl_pmu_type;
 
 extern struct attribute_group cxl_base_attribute_group;
 
+static inline struct cxl_memdev_state *
+cxled_to_mds(struct cxl_endpoint_decoder *cxled)
+{
+	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
+	struct cxl_dev_state *cxlds = cxlmd->cxlds;
+
+	return container_of(cxlds, struct cxl_memdev_state, cxlds);
+}
+
 #ifdef CONFIG_CXL_REGION
 extern struct device_attribute dev_attr_create_pmem_region;
 extern struct device_attribute dev_attr_create_ram_region;
+extern struct device_attribute dev_attr_create_dynamic_ram_a_region;
 extern struct device_attribute dev_attr_delete_region;
 extern struct device_attribute dev_attr_region;
 extern const struct device_type cxl_pmem_region_type;
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index e98605bd39b4..b2bd24437484 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -334,6 +334,7 @@ static struct attribute *cxl_decoder_root_attrs[] = {
 	&dev_attr_qos_class.attr,
 	SET_CXL_REGION_ATTR(create_pmem_region)
 	SET_CXL_REGION_ATTR(create_ram_region)
+	SET_CXL_REGION_ATTR(create_dynamic_ram_a_region)
 	SET_CXL_REGION_ATTR(delete_region)
 	NULL,
 };
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index c3f4dc244df7..716d33140ee8 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -480,6 +480,11 @@ static ssize_t interleave_ways_store(struct device *dev,
 	if (rc)
 		return rc;
 
+	if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_A && val != 1) {
+		dev_err(dev, "Interleaving and DCD not supported\n");
+		return -EINVAL;
+	}
+
 	rc = ways_to_eiw(val, &iw);
 	if (rc)
 		return rc;
@@ -2198,6 +2203,7 @@ static size_t store_targetN(struct cxl_region *cxlr, const char *buf, int pos,
 	if (sysfs_streq(buf, "\n"))
 		rc = detach_target(cxlr, pos);
 	else {
+		struct cxl_endpoint_decoder *cxled;
 		struct device *dev;
 
 		dev = bus_find_device_by_name(&cxl_bus_type, NULL, buf);
@@ -2209,8 +2215,13 @@ static size_t store_targetN(struct cxl_region *cxlr, const char *buf, int pos,
 			goto out;
 		}
 
-		rc = attach_target(cxlr, to_cxl_endpoint_decoder(dev), pos,
-				   TASK_INTERRUPTIBLE);
+		cxled = to_cxl_endpoint_decoder(dev);
+		if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_A &&
+		    !cxl_dcd_supported(cxled_to_mds(cxled))) {
+			dev_dbg(dev, "DCD unsupported\n");
+			return -EINVAL;
+		}
+		rc = attach_target(cxlr, cxled, pos, TASK_INTERRUPTIBLE);
 out:
 		put_device(dev);
 	}
@@ -2555,6 +2566,7 @@ static struct cxl_region *__create_region(struct cxl_root_decoder *cxlrd,
 	switch (mode) {
 	case CXL_PARTMODE_RAM:
 	case CXL_PARTMODE_PMEM:
+	case CXL_PARTMODE_DYNAMIC_RAM_A:
 		break;
 	default:
 		dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %d\n", mode);
@@ -2607,6 +2619,21 @@ static ssize_t create_ram_region_store(struct device *dev,
 }
 DEVICE_ATTR_RW(create_ram_region);
 
+static ssize_t create_dynamic_ram_a_region_show(struct device *dev,
+						struct device_attribute *attr,
+						char *buf)
+{
+	return __create_region_show(to_cxl_root_decoder(dev), buf);
+}
+
+static ssize_t create_dynamic_ram_a_region_store(struct device *dev,
+						 struct device_attribute *attr,
+						 const char *buf, size_t len)
+{
+	return create_region_store(dev, buf, len, CXL_PARTMODE_DYNAMIC_RAM_A);
+}
+DEVICE_ATTR_RW(create_dynamic_ram_a_region);
+
 static ssize_t region_show(struct device *dev, struct device_attribute *attr,
 			   char *buf)
 {
@@ -3173,6 +3200,12 @@ static int devm_cxl_add_dax_region(struct cxl_region *cxlr)
 	struct device *dev;
 	int rc;
 
+	if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_A &&
+	    cxlr->params.interleave_ways != 1) {
+		dev_err(&cxlr->dev, "Interleaving DC not supported\n");
+		return -EINVAL;
+	}
+
 	cxlr_dax = cxl_dax_region_alloc(cxlr);
 	if (IS_ERR(cxlr_dax))
 		return PTR_ERR(cxlr_dax);
@@ -3539,6 +3572,7 @@ static int cxl_region_probe(struct device *dev)
 	case CXL_PARTMODE_PMEM:
 		return devm_cxl_add_pmem_region(cxlr);
 	case CXL_PARTMODE_RAM:
+	case CXL_PARTMODE_DYNAMIC_RAM_A:
 		/*
 		 * The region can not be manged by CXL if any portion of
 		 * it is already online as 'System RAM'
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index fde29e0ad68b..d8cb5195a227 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -178,6 +178,11 @@ static bool is_static(struct dax_region *dax_region)
 	return (dax_region->res.flags & IORESOURCE_DAX_STATIC) != 0;
 }
 
+static bool is_sparse(struct dax_region *dax_region)
+{
+	return (dax_region->res.flags & IORESOURCE_DAX_SPARSE_CAP) != 0;
+}
+
 bool static_dev_dax(struct dev_dax *dev_dax)
 {
 	return is_static(dev_dax->region);
@@ -301,6 +306,9 @@ static unsigned long long dax_region_avail_size(struct dax_region *dax_region)
 
 	lockdep_assert_held(&dax_region_rwsem);
 
+	if (is_sparse(dax_region))
+		return 0;
+
 	for_each_dax_region_resource(dax_region, res)
 		size -= resource_size(res);
 	return size;
@@ -1373,6 +1381,8 @@ static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n)
 		return 0;
 	if (a == &dev_attr_mapping.attr && is_static(dax_region))
 		return 0;
+	if (a == &dev_attr_mapping.attr && is_sparse(dax_region))
+		return 0;
 	if ((a == &dev_attr_align.attr ||
 	     a == &dev_attr_size.attr) && is_static(dax_region))
 		return 0444;
diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
index cbbf64443098..783bfeef42cc 100644
--- a/drivers/dax/bus.h
+++ b/drivers/dax/bus.h
@@ -13,6 +13,7 @@ struct dax_region;
 /* dax bus specific ioresource flags */
 #define IORESOURCE_DAX_STATIC BIT(0)
 #define IORESOURCE_DAX_KMEM BIT(1)
+#define IORESOURCE_DAX_SPARSE_CAP BIT(2)
 
 struct dax_region *alloc_dax_region(struct device *parent, int region_id,
 		struct range *range, int target_node, unsigned int align,
diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
index 13cd94d32ff7..88b051cea755 100644
--- a/drivers/dax/cxl.c
+++ b/drivers/dax/cxl.c
@@ -13,19 +13,31 @@ static int cxl_dax_region_probe(struct device *dev)
 	struct cxl_region *cxlr = cxlr_dax->cxlr;
 	struct dax_region *dax_region;
 	struct dev_dax_data data;
+	resource_size_t dev_size;
+	unsigned long flags;
 
 	if (nid == NUMA_NO_NODE)
 		nid = memory_add_physaddr_to_nid(cxlr_dax->hpa_range.start);
 
+	flags = IORESOURCE_DAX_KMEM;
+	if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_A)
+		flags |= IORESOURCE_DAX_SPARSE_CAP;
+
 	dax_region = alloc_dax_region(dev, cxlr->id, &cxlr_dax->hpa_range, nid,
-				      PMD_SIZE, IORESOURCE_DAX_KMEM);
+				      PMD_SIZE, flags);
 	if (!dax_region)
 		return -ENOMEM;
 
+	if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_A)
+		/* Add empty seed dax device */
+		dev_size = 0;
+	else
+		dev_size = range_len(&cxlr_dax->hpa_range);
+
 	data = (struct dev_dax_data) {
 		.dax_region = dax_region,
 		.id = -1,
-		.size = range_len(&cxlr_dax->hpa_range),
+		.size = dev_size,
 		.memmap_on_memory = true,
 	};
 

-- 
2.49.0


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH v9 07/19] cxl/region: Add sparse DAX region support
  2025-04-13 22:52 ` [PATCH v9 07/19] cxl/region: Add sparse DAX region support Ira Weiny
@ 2025-04-14 15:40   ` Jonathan Cameron
  2025-05-08 17:54   ` Fan Ni
  2025-05-08 18:17   ` Fan Ni
  2 siblings, 0 replies; 65+ messages in thread
From: Jonathan Cameron @ 2025-04-14 15:40 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Jiang, Fan Ni, Dan Williams, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-cxl, nvdimm, linux-kernel

On Sun, 13 Apr 2025 17:52:15 -0500
Ira Weiny <ira.weiny@intel.com> wrote:

> Dynamic Capacity CXL regions must allow memory to be added or removed
> dynamically.  In addition to the quantity of memory available the
> location of the memory within a DC partition is dynamic based on the
> extents offered by a device.  CXL DAX regions must accommodate the
> sparseness of this memory in the management of DAX regions and devices.
> 
> Introduce the concept of a sparse DAX region.  Introduce
> create_dynamic_ram_a_region() sysfs entry to create such regions.
> Special case dynamic capable regions to create a 0 sized seed DAX device
> to maintain compatibility which requires a default DAX device to hold a
> region reference.
> 
> Indicate 0 byte available capacity until such time that capacity is
> added.
> 
> Sparse regions complicate the range mapping of dax devices.  There is no
> known use case for range mapping on sparse regions.  Avoid the
> complication by preventing range mapping of dax devices on sparse
> regions.
> 
> Interleaving is deferred for now.  Add checks.
> 
> Based on an original patch by Navneet Singh.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
>
I'm not that familiar with the DAX parts but looks fine to me.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v9 07/19] cxl/region: Add sparse DAX region support
  2025-04-13 22:52 ` [PATCH v9 07/19] cxl/region: Add sparse DAX region support Ira Weiny
  2025-04-14 15:40   ` Jonathan Cameron
@ 2025-05-08 17:54   ` Fan Ni
  2025-05-08 18:17   ` Fan Ni
  2 siblings, 0 replies; 65+ messages in thread
From: Fan Ni @ 2025-05-08 17:54 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Jiang, Jonathan Cameron, Dan Williams, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-cxl, nvdimm, linux-kernel

On Sun, Apr 13, 2025 at 05:52:15PM -0500, Ira Weiny wrote:
> Dynamic Capacity CXL regions must allow memory to be added or removed
> dynamically.  In addition to the quantity of memory available the
> location of the memory within a DC partition is dynamic based on the
> extents offered by a device.  CXL DAX regions must accommodate the
> sparseness of this memory in the management of DAX regions and devices.
> 
> Introduce the concept of a sparse DAX region.  Introduce
> create_dynamic_ram_a_region() sysfs entry to create such regions.
> Special case dynamic capable regions to create a 0 sized seed DAX device
> to maintain compatibility which requires a default DAX device to hold a
> region reference.
> 
> Indicate 0 byte available capacity until such time that capacity is
> added.
> 
> Sparse regions complicate the range mapping of dax devices.  There is no
> known use case for range mapping on sparse regions.  Avoid the
> complication by preventing range mapping of dax devices on sparse
> regions.
> 
> Interleaving is deferred for now.  Add checks.
> 
> Based on an original patch by Navneet Singh.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> ---
> Changes:
> [iweiny: adjust to new partition mode and new singular dynamic ram
>          partition]
> ---
>  Documentation/ABI/testing/sysfs-bus-cxl | 22 +++++++++----------
>  drivers/cxl/core/core.h                 | 11 ++++++++++
>  drivers/cxl/core/port.c                 |  1 +
>  drivers/cxl/core/region.c               | 38 +++++++++++++++++++++++++++++++--
>  drivers/dax/bus.c                       | 10 +++++++++
>  drivers/dax/bus.h                       |  1 +
>  drivers/dax/cxl.c                       | 16 ++++++++++++--
>  7 files changed, 84 insertions(+), 15 deletions(-)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> index b2754e6047ca..2e26d95ac66f 100644
> --- a/Documentation/ABI/testing/sysfs-bus-cxl
> +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> @@ -434,20 +434,20 @@ Description:
>  		interleave_granularity).
>  
>  
> -What:		/sys/bus/cxl/devices/decoderX.Y/create_{pmem,ram}_region
> -Date:		May, 2022, January, 2023
> -KernelVersion:	v6.0 (pmem), v6.3 (ram)
> +What:		/sys/bus/cxl/devices/decoderX.Y/create_{pmem,ram,dynamic_ram_a}_region
> +Date:		May, 2022, January, 2023, May 2025
> +KernelVersion:	v6.0 (pmem), v6.3 (ram), v6.16 (dynamic_ram_a)
>  Contact:	linux-cxl@vger.kernel.org
>  Description:
>  		(RW) Write a string in the form 'regionZ' to start the process
> -		of defining a new persistent, or volatile memory region
> -		(interleave-set) within the decode range bounded by root decoder
> -		'decoderX.Y'. The value written must match the current value
> -		returned from reading this attribute. An atomic compare exchange
> -		operation is done on write to assign the requested id to a
> -		region and allocate the region-id for the next creation attempt.
> -		EBUSY is returned if the region name written does not match the
> -		current cached value.
> +		of defining a new persistent, volatile, or dynamic RAM memory
> +		region (interleave-set) within the decode range bounded by root
> +		decoder 'decoderX.Y'. The value written must match the current
> +		value returned from reading this attribute.  An atomic compare
> +		exchange operation is done on write to assign the requested id
> +		to a region and allocate the region-id for the next creation
> +		attempt.  EBUSY is returned if the region name written does not
> +		match the current cached value.
>  
>  
>  What:		/sys/bus/cxl/devices/decoderX.Y/delete_region
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 15699299dc11..08facbc2d270 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -5,6 +5,7 @@
>  #define __CXL_CORE_H__
>  
>  #include <cxl/mailbox.h>
> +#include <cxlmem.h>
>  
>  extern const struct device_type cxl_nvdimm_bridge_type;
>  extern const struct device_type cxl_nvdimm_type;
> @@ -12,9 +13,19 @@ extern const struct device_type cxl_pmu_type;
>  
>  extern struct attribute_group cxl_base_attribute_group;
>  
> +static inline struct cxl_memdev_state *
> +cxled_to_mds(struct cxl_endpoint_decoder *cxled)
> +{
> +	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> +	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> +
> +	return container_of(cxlds, struct cxl_memdev_state, cxlds);
> +}
> +
>  #ifdef CONFIG_CXL_REGION
>  extern struct device_attribute dev_attr_create_pmem_region;
>  extern struct device_attribute dev_attr_create_ram_region;
> +extern struct device_attribute dev_attr_create_dynamic_ram_a_region;
>  extern struct device_attribute dev_attr_delete_region;
>  extern struct device_attribute dev_attr_region;
>  extern const struct device_type cxl_pmem_region_type;
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index e98605bd39b4..b2bd24437484 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -334,6 +334,7 @@ static struct attribute *cxl_decoder_root_attrs[] = {
>  	&dev_attr_qos_class.attr,
>  	SET_CXL_REGION_ATTR(create_pmem_region)
>  	SET_CXL_REGION_ATTR(create_ram_region)
> +	SET_CXL_REGION_ATTR(create_dynamic_ram_a_region)
>  	SET_CXL_REGION_ATTR(delete_region)
>  	NULL,
>  };
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index c3f4dc244df7..716d33140ee8 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -480,6 +480,11 @@ static ssize_t interleave_ways_store(struct device *dev,
>  	if (rc)
>  		return rc;
>  
> +	if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_A && val != 1) {
> +		dev_err(dev, "Interleaving and DCD not supported\n");
> +		return -EINVAL;
> +	}
> +
>  	rc = ways_to_eiw(val, &iw);
>  	if (rc)
>  		return rc;
> @@ -2198,6 +2203,7 @@ static size_t store_targetN(struct cxl_region *cxlr, const char *buf, int pos,
>  	if (sysfs_streq(buf, "\n"))
>  		rc = detach_target(cxlr, pos);
>  	else {
> +		struct cxl_endpoint_decoder *cxled;
>  		struct device *dev;
>  
>  		dev = bus_find_device_by_name(&cxl_bus_type, NULL, buf);
> @@ -2209,8 +2215,13 @@ static size_t store_targetN(struct cxl_region *cxlr, const char *buf, int pos,
>  			goto out;
>  		}
>  
> -		rc = attach_target(cxlr, to_cxl_endpoint_decoder(dev), pos,
> -				   TASK_INTERRUPTIBLE);
> +		cxled = to_cxl_endpoint_decoder(dev);
> +		if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_A &&
> +		    !cxl_dcd_supported(cxled_to_mds(cxled))) {
> +			dev_dbg(dev, "DCD unsupported\n");
> +			return -EINVAL;
Should be ...?

+           rc = -EINVAL;
+           goto out;

Fan

> +		}
> +		rc = attach_target(cxlr, cxled, pos, TASK_INTERRUPTIBLE);
>  out:
>  		put_device(dev);
>  	}
> @@ -2555,6 +2566,7 @@ static struct cxl_region *__create_region(struct cxl_root_decoder *cxlrd,
>  	switch (mode) {
>  	case CXL_PARTMODE_RAM:
>  	case CXL_PARTMODE_PMEM:
> +	case CXL_PARTMODE_DYNAMIC_RAM_A:
>  		break;
>  	default:
>  		dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %d\n", mode);
> @@ -2607,6 +2619,21 @@ static ssize_t create_ram_region_store(struct device *dev,
>  }
>  DEVICE_ATTR_RW(create_ram_region);
>  
> +static ssize_t create_dynamic_ram_a_region_show(struct device *dev,
> +						struct device_attribute *attr,
> +						char *buf)
> +{
> +	return __create_region_show(to_cxl_root_decoder(dev), buf);
> +}
> +
> +static ssize_t create_dynamic_ram_a_region_store(struct device *dev,
> +						 struct device_attribute *attr,
> +						 const char *buf, size_t len)
> +{
> +	return create_region_store(dev, buf, len, CXL_PARTMODE_DYNAMIC_RAM_A);
> +}
> +DEVICE_ATTR_RW(create_dynamic_ram_a_region);
> +
>  static ssize_t region_show(struct device *dev, struct device_attribute *attr,
>  			   char *buf)
>  {
> @@ -3173,6 +3200,12 @@ static int devm_cxl_add_dax_region(struct cxl_region *cxlr)
>  	struct device *dev;
>  	int rc;
>  
> +	if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_A &&
> +	    cxlr->params.interleave_ways != 1) {
> +		dev_err(&cxlr->dev, "Interleaving DC not supported\n");
> +		return -EINVAL;
> +	}
> +
>  	cxlr_dax = cxl_dax_region_alloc(cxlr);
>  	if (IS_ERR(cxlr_dax))
>  		return PTR_ERR(cxlr_dax);
> @@ -3539,6 +3572,7 @@ static int cxl_region_probe(struct device *dev)
>  	case CXL_PARTMODE_PMEM:
>  		return devm_cxl_add_pmem_region(cxlr);
>  	case CXL_PARTMODE_RAM:
> +	case CXL_PARTMODE_DYNAMIC_RAM_A:
>  		/*
>  		 * The region can not be manged by CXL if any portion of
>  		 * it is already online as 'System RAM'
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index fde29e0ad68b..d8cb5195a227 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -178,6 +178,11 @@ static bool is_static(struct dax_region *dax_region)
>  	return (dax_region->res.flags & IORESOURCE_DAX_STATIC) != 0;
>  }
>  
> +static bool is_sparse(struct dax_region *dax_region)
> +{
> +	return (dax_region->res.flags & IORESOURCE_DAX_SPARSE_CAP) != 0;
> +}
> +
>  bool static_dev_dax(struct dev_dax *dev_dax)
>  {
>  	return is_static(dev_dax->region);
> @@ -301,6 +306,9 @@ static unsigned long long dax_region_avail_size(struct dax_region *dax_region)
>  
>  	lockdep_assert_held(&dax_region_rwsem);
>  
> +	if (is_sparse(dax_region))
> +		return 0;
> +
>  	for_each_dax_region_resource(dax_region, res)
>  		size -= resource_size(res);
>  	return size;
> @@ -1373,6 +1381,8 @@ static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n)
>  		return 0;
>  	if (a == &dev_attr_mapping.attr && is_static(dax_region))
>  		return 0;
> +	if (a == &dev_attr_mapping.attr && is_sparse(dax_region))
> +		return 0;
>  	if ((a == &dev_attr_align.attr ||
>  	     a == &dev_attr_size.attr) && is_static(dax_region))
>  		return 0444;
> diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
> index cbbf64443098..783bfeef42cc 100644
> --- a/drivers/dax/bus.h
> +++ b/drivers/dax/bus.h
> @@ -13,6 +13,7 @@ struct dax_region;
>  /* dax bus specific ioresource flags */
>  #define IORESOURCE_DAX_STATIC BIT(0)
>  #define IORESOURCE_DAX_KMEM BIT(1)
> +#define IORESOURCE_DAX_SPARSE_CAP BIT(2)
>  
>  struct dax_region *alloc_dax_region(struct device *parent, int region_id,
>  		struct range *range, int target_node, unsigned int align,
> diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
> index 13cd94d32ff7..88b051cea755 100644
> --- a/drivers/dax/cxl.c
> +++ b/drivers/dax/cxl.c
> @@ -13,19 +13,31 @@ static int cxl_dax_region_probe(struct device *dev)
>  	struct cxl_region *cxlr = cxlr_dax->cxlr;
>  	struct dax_region *dax_region;
>  	struct dev_dax_data data;
> +	resource_size_t dev_size;
> +	unsigned long flags;
>  
>  	if (nid == NUMA_NO_NODE)
>  		nid = memory_add_physaddr_to_nid(cxlr_dax->hpa_range.start);
>  
> +	flags = IORESOURCE_DAX_KMEM;
> +	if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_A)
> +		flags |= IORESOURCE_DAX_SPARSE_CAP;
> +
>  	dax_region = alloc_dax_region(dev, cxlr->id, &cxlr_dax->hpa_range, nid,
> -				      PMD_SIZE, IORESOURCE_DAX_KMEM);
> +				      PMD_SIZE, flags);
>  	if (!dax_region)
>  		return -ENOMEM;
>  
> +	if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_A)
> +		/* Add empty seed dax device */
> +		dev_size = 0;
> +	else
> +		dev_size = range_len(&cxlr_dax->hpa_range);
> +
>  	data = (struct dev_dax_data) {
>  		.dax_region = dax_region,
>  		.id = -1,
> -		.size = range_len(&cxlr_dax->hpa_range),
> +		.size = dev_size,
>  		.memmap_on_memory = true,
>  	};
>  
> 
> -- 
> 2.49.0
> 

-- 
Fan Ni (From gmail)

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v9 07/19] cxl/region: Add sparse DAX region support
  2025-04-13 22:52 ` [PATCH v9 07/19] cxl/region: Add sparse DAX region support Ira Weiny
  2025-04-14 15:40   ` Jonathan Cameron
  2025-05-08 17:54   ` Fan Ni
@ 2025-05-08 18:17   ` Fan Ni
  2 siblings, 0 replies; 65+ messages in thread
From: Fan Ni @ 2025-05-08 18:17 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Jiang, Jonathan Cameron, Dan Williams, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-cxl, nvdimm, linux-kernel

On Sun, Apr 13, 2025 at 05:52:15PM -0500, Ira Weiny wrote:
> Dynamic Capacity CXL regions must allow memory to be added or removed
> dynamically.  In addition to the quantity of memory available the
> location of the memory within a DC partition is dynamic based on the
> extents offered by a device.  CXL DAX regions must accommodate the
> sparseness of this memory in the management of DAX regions and devices.
> 
> Introduce the concept of a sparse DAX region.  Introduce
> create_dynamic_ram_a_region() sysfs entry to create such regions.
> Special case dynamic capable regions to create a 0 sized seed DAX device
> to maintain compatibility which requires a default DAX device to hold a
> region reference.
> 
> Indicate 0 byte available capacity until such time that capacity is
> added.
> 
> Sparse regions complicate the range mapping of dax devices.  There is no
> known use case for range mapping on sparse regions.  Avoid the
> complication by preventing range mapping of dax devices on sparse
> regions.
> 
> Interleaving is deferred for now.  Add checks.
> 
> Based on an original patch by Navneet Singh.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
LGTM, although I am not very familiar with dax. 

Reviewed-by: Fan Ni <fan.ni@samsung.com>
> 
> ---
> Changes:
> [iweiny: adjust to new partition mode and new singular dynamic ram
>          partition]
> ---
>  Documentation/ABI/testing/sysfs-bus-cxl | 22 +++++++++----------
>  drivers/cxl/core/core.h                 | 11 ++++++++++
>  drivers/cxl/core/port.c                 |  1 +
>  drivers/cxl/core/region.c               | 38 +++++++++++++++++++++++++++++++--
>  drivers/dax/bus.c                       | 10 +++++++++
>  drivers/dax/bus.h                       |  1 +
>  drivers/dax/cxl.c                       | 16 ++++++++++++--
>  7 files changed, 84 insertions(+), 15 deletions(-)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> index b2754e6047ca..2e26d95ac66f 100644
> --- a/Documentation/ABI/testing/sysfs-bus-cxl
> +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> @@ -434,20 +434,20 @@ Description:
>  		interleave_granularity).
>  
>  
> -What:		/sys/bus/cxl/devices/decoderX.Y/create_{pmem,ram}_region
> -Date:		May, 2022, January, 2023
> -KernelVersion:	v6.0 (pmem), v6.3 (ram)
> +What:		/sys/bus/cxl/devices/decoderX.Y/create_{pmem,ram,dynamic_ram_a}_region
> +Date:		May, 2022, January, 2023, May 2025
> +KernelVersion:	v6.0 (pmem), v6.3 (ram), v6.16 (dynamic_ram_a)
>  Contact:	linux-cxl@vger.kernel.org
>  Description:
>  		(RW) Write a string in the form 'regionZ' to start the process
> -		of defining a new persistent, or volatile memory region
> -		(interleave-set) within the decode range bounded by root decoder
> -		'decoderX.Y'. The value written must match the current value
> -		returned from reading this attribute. An atomic compare exchange
> -		operation is done on write to assign the requested id to a
> -		region and allocate the region-id for the next creation attempt.
> -		EBUSY is returned if the region name written does not match the
> -		current cached value.
> +		of defining a new persistent, volatile, or dynamic RAM memory
> +		region (interleave-set) within the decode range bounded by root
> +		decoder 'decoderX.Y'. The value written must match the current
> +		value returned from reading this attribute.  An atomic compare
> +		exchange operation is done on write to assign the requested id
> +		to a region and allocate the region-id for the next creation
> +		attempt.  EBUSY is returned if the region name written does not
> +		match the current cached value.
>  
>  
>  What:		/sys/bus/cxl/devices/decoderX.Y/delete_region
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 15699299dc11..08facbc2d270 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -5,6 +5,7 @@
>  #define __CXL_CORE_H__
>  
>  #include <cxl/mailbox.h>
> +#include <cxlmem.h>
>  
>  extern const struct device_type cxl_nvdimm_bridge_type;
>  extern const struct device_type cxl_nvdimm_type;
> @@ -12,9 +13,19 @@ extern const struct device_type cxl_pmu_type;
>  
>  extern struct attribute_group cxl_base_attribute_group;
>  
> +static inline struct cxl_memdev_state *
> +cxled_to_mds(struct cxl_endpoint_decoder *cxled)
> +{
> +	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> +	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> +
> +	return container_of(cxlds, struct cxl_memdev_state, cxlds);
> +}
> +
>  #ifdef CONFIG_CXL_REGION
>  extern struct device_attribute dev_attr_create_pmem_region;
>  extern struct device_attribute dev_attr_create_ram_region;
> +extern struct device_attribute dev_attr_create_dynamic_ram_a_region;
>  extern struct device_attribute dev_attr_delete_region;
>  extern struct device_attribute dev_attr_region;
>  extern const struct device_type cxl_pmem_region_type;
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index e98605bd39b4..b2bd24437484 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -334,6 +334,7 @@ static struct attribute *cxl_decoder_root_attrs[] = {
>  	&dev_attr_qos_class.attr,
>  	SET_CXL_REGION_ATTR(create_pmem_region)
>  	SET_CXL_REGION_ATTR(create_ram_region)
> +	SET_CXL_REGION_ATTR(create_dynamic_ram_a_region)
>  	SET_CXL_REGION_ATTR(delete_region)
>  	NULL,
>  };
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index c3f4dc244df7..716d33140ee8 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -480,6 +480,11 @@ static ssize_t interleave_ways_store(struct device *dev,
>  	if (rc)
>  		return rc;
>  
> +	if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_A && val != 1) {
> +		dev_err(dev, "Interleaving and DCD not supported\n");
> +		return -EINVAL;
> +	}
> +
>  	rc = ways_to_eiw(val, &iw);
>  	if (rc)
>  		return rc;
> @@ -2198,6 +2203,7 @@ static size_t store_targetN(struct cxl_region *cxlr, const char *buf, int pos,
>  	if (sysfs_streq(buf, "\n"))
>  		rc = detach_target(cxlr, pos);
>  	else {
> +		struct cxl_endpoint_decoder *cxled;
>  		struct device *dev;
>  
>  		dev = bus_find_device_by_name(&cxl_bus_type, NULL, buf);
> @@ -2209,8 +2215,13 @@ static size_t store_targetN(struct cxl_region *cxlr, const char *buf, int pos,
>  			goto out;
>  		}
>  
> -		rc = attach_target(cxlr, to_cxl_endpoint_decoder(dev), pos,
> -				   TASK_INTERRUPTIBLE);
> +		cxled = to_cxl_endpoint_decoder(dev);
> +		if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_A &&
> +		    !cxl_dcd_supported(cxled_to_mds(cxled))) {
> +			dev_dbg(dev, "DCD unsupported\n");
> +			return -EINVAL;
> +		}
> +		rc = attach_target(cxlr, cxled, pos, TASK_INTERRUPTIBLE);
>  out:
>  		put_device(dev);
>  	}
> @@ -2555,6 +2566,7 @@ static struct cxl_region *__create_region(struct cxl_root_decoder *cxlrd,
>  	switch (mode) {
>  	case CXL_PARTMODE_RAM:
>  	case CXL_PARTMODE_PMEM:
> +	case CXL_PARTMODE_DYNAMIC_RAM_A:
>  		break;
>  	default:
>  		dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %d\n", mode);
> @@ -2607,6 +2619,21 @@ static ssize_t create_ram_region_store(struct device *dev,
>  }
>  DEVICE_ATTR_RW(create_ram_region);
>  
> +static ssize_t create_dynamic_ram_a_region_show(struct device *dev,
> +						struct device_attribute *attr,
> +						char *buf)
> +{
> +	return __create_region_show(to_cxl_root_decoder(dev), buf);
> +}
> +
> +static ssize_t create_dynamic_ram_a_region_store(struct device *dev,
> +						 struct device_attribute *attr,
> +						 const char *buf, size_t len)
> +{
> +	return create_region_store(dev, buf, len, CXL_PARTMODE_DYNAMIC_RAM_A);
> +}
> +DEVICE_ATTR_RW(create_dynamic_ram_a_region);
> +
>  static ssize_t region_show(struct device *dev, struct device_attribute *attr,
>  			   char *buf)
>  {
> @@ -3173,6 +3200,12 @@ static int devm_cxl_add_dax_region(struct cxl_region *cxlr)
>  	struct device *dev;
>  	int rc;
>  
> +	if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_A &&
> +	    cxlr->params.interleave_ways != 1) {
> +		dev_err(&cxlr->dev, "Interleaving DC not supported\n");
> +		return -EINVAL;
> +	}
> +
>  	cxlr_dax = cxl_dax_region_alloc(cxlr);
>  	if (IS_ERR(cxlr_dax))
>  		return PTR_ERR(cxlr_dax);
> @@ -3539,6 +3572,7 @@ static int cxl_region_probe(struct device *dev)
>  	case CXL_PARTMODE_PMEM:
>  		return devm_cxl_add_pmem_region(cxlr);
>  	case CXL_PARTMODE_RAM:
> +	case CXL_PARTMODE_DYNAMIC_RAM_A:
>  		/*
>  		 * The region can not be manged by CXL if any portion of
>  		 * it is already online as 'System RAM'
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index fde29e0ad68b..d8cb5195a227 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -178,6 +178,11 @@ static bool is_static(struct dax_region *dax_region)
>  	return (dax_region->res.flags & IORESOURCE_DAX_STATIC) != 0;
>  }
>  
> +static bool is_sparse(struct dax_region *dax_region)
> +{
> +	return (dax_region->res.flags & IORESOURCE_DAX_SPARSE_CAP) != 0;
> +}
> +
>  bool static_dev_dax(struct dev_dax *dev_dax)
>  {
>  	return is_static(dev_dax->region);
> @@ -301,6 +306,9 @@ static unsigned long long dax_region_avail_size(struct dax_region *dax_region)
>  
>  	lockdep_assert_held(&dax_region_rwsem);
>  
> +	if (is_sparse(dax_region))
> +		return 0;
> +
>  	for_each_dax_region_resource(dax_region, res)
>  		size -= resource_size(res);
>  	return size;
> @@ -1373,6 +1381,8 @@ static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n)
>  		return 0;
>  	if (a == &dev_attr_mapping.attr && is_static(dax_region))
>  		return 0;
> +	if (a == &dev_attr_mapping.attr && is_sparse(dax_region))
> +		return 0;
>  	if ((a == &dev_attr_align.attr ||
>  	     a == &dev_attr_size.attr) && is_static(dax_region))
>  		return 0444;
> diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
> index cbbf64443098..783bfeef42cc 100644
> --- a/drivers/dax/bus.h
> +++ b/drivers/dax/bus.h
> @@ -13,6 +13,7 @@ struct dax_region;
>  /* dax bus specific ioresource flags */
>  #define IORESOURCE_DAX_STATIC BIT(0)
>  #define IORESOURCE_DAX_KMEM BIT(1)
> +#define IORESOURCE_DAX_SPARSE_CAP BIT(2)
>  
>  struct dax_region *alloc_dax_region(struct device *parent, int region_id,
>  		struct range *range, int target_node, unsigned int align,
> diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
> index 13cd94d32ff7..88b051cea755 100644
> --- a/drivers/dax/cxl.c
> +++ b/drivers/dax/cxl.c
> @@ -13,19 +13,31 @@ static int cxl_dax_region_probe(struct device *dev)
>  	struct cxl_region *cxlr = cxlr_dax->cxlr;
>  	struct dax_region *dax_region;
>  	struct dev_dax_data data;
> +	resource_size_t dev_size;
> +	unsigned long flags;
>  
>  	if (nid == NUMA_NO_NODE)
>  		nid = memory_add_physaddr_to_nid(cxlr_dax->hpa_range.start);
>  
> +	flags = IORESOURCE_DAX_KMEM;
> +	if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_A)
> +		flags |= IORESOURCE_DAX_SPARSE_CAP;
> +
>  	dax_region = alloc_dax_region(dev, cxlr->id, &cxlr_dax->hpa_range, nid,
> -				      PMD_SIZE, IORESOURCE_DAX_KMEM);
> +				      PMD_SIZE, flags);
>  	if (!dax_region)
>  		return -ENOMEM;
>  
> +	if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_A)
> +		/* Add empty seed dax device */
> +		dev_size = 0;
> +	else
> +		dev_size = range_len(&cxlr_dax->hpa_range);
> +
>  	data = (struct dev_dax_data) {
>  		.dax_region = dax_region,
>  		.id = -1,
> -		.size = range_len(&cxlr_dax->hpa_range),
> +		.size = dev_size,
>  		.memmap_on_memory = true,
>  	};
>  
> 
> -- 
> 2.49.0
> 

-- 
Fan Ni (From gmail)

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v9 08/19] cxl/events: Split event msgnum configuration from irq setup
  2025-04-13 22:52 [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (6 preceding siblings ...)
  2025-04-13 22:52 ` [PATCH v9 07/19] cxl/region: Add sparse DAX region support Ira Weiny
@ 2025-04-13 22:52 ` Ira Weiny
  2025-04-13 22:52 ` [PATCH v9 09/19] cxl/pci: Factor out interrupt policy check Ira Weiny
                   ` (14 subsequent siblings)
  22 siblings, 0 replies; 65+ messages in thread
From: Ira Weiny @ 2025-04-13 22:52 UTC (permalink / raw)
  To: Dave Jiang, Fan Ni, Jonathan Cameron
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	Ira Weiny, linux-cxl, nvdimm, linux-kernel, Li Ming

Dynamic Capacity Devices (DCD) require event interrupts to process
memory addition or removal.  BIOS may have control over non-DCD event
processing.  DCD interrupt configuration needs to be separate from
memory event interrupt configuration.

Split cxl_event_config_msgnums() from irq setup in preparation for
separate DCD interrupts configuration.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Li Ming <ming.li@zohomail.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 drivers/cxl/pci.c | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index bc40cf6e2fe9..308b05bbb82d 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -715,35 +715,31 @@ static int cxl_event_config_msgnums(struct cxl_memdev_state *mds,
 	return cxl_event_get_int_policy(mds, policy);
 }
 
-static int cxl_event_irqsetup(struct cxl_memdev_state *mds)
+static int cxl_event_irqsetup(struct cxl_memdev_state *mds,
+			      struct cxl_event_interrupt_policy *policy)
 {
 	struct cxl_dev_state *cxlds = &mds->cxlds;
-	struct cxl_event_interrupt_policy policy;
 	int rc;
 
-	rc = cxl_event_config_msgnums(mds, &policy);
-	if (rc)
-		return rc;
-
-	rc = cxl_event_req_irq(cxlds, policy.info_settings);
+	rc = cxl_event_req_irq(cxlds, policy->info_settings);
 	if (rc) {
 		dev_err(cxlds->dev, "Failed to get interrupt for event Info log\n");
 		return rc;
 	}
 
-	rc = cxl_event_req_irq(cxlds, policy.warn_settings);
+	rc = cxl_event_req_irq(cxlds, policy->warn_settings);
 	if (rc) {
 		dev_err(cxlds->dev, "Failed to get interrupt for event Warn log\n");
 		return rc;
 	}
 
-	rc = cxl_event_req_irq(cxlds, policy.failure_settings);
+	rc = cxl_event_req_irq(cxlds, policy->failure_settings);
 	if (rc) {
 		dev_err(cxlds->dev, "Failed to get interrupt for event Failure log\n");
 		return rc;
 	}
 
-	rc = cxl_event_req_irq(cxlds, policy.fatal_settings);
+	rc = cxl_event_req_irq(cxlds, policy->fatal_settings);
 	if (rc) {
 		dev_err(cxlds->dev, "Failed to get interrupt for event Fatal log\n");
 		return rc;
@@ -790,11 +786,15 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
 		return -EBUSY;
 	}
 
+	rc = cxl_event_config_msgnums(mds, &policy);
+	if (rc)
+		return rc;
+
 	rc = cxl_mem_alloc_event_buf(mds);
 	if (rc)
 		return rc;
 
-	rc = cxl_event_irqsetup(mds);
+	rc = cxl_event_irqsetup(mds, &policy);
 	if (rc)
 		return rc;
 

-- 
2.49.0


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v9 09/19] cxl/pci: Factor out interrupt policy check
  2025-04-13 22:52 [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (7 preceding siblings ...)
  2025-04-13 22:52 ` [PATCH v9 08/19] cxl/events: Split event msgnum configuration from irq setup Ira Weiny
@ 2025-04-13 22:52 ` Ira Weiny
  2025-04-13 22:52 ` [PATCH v9 10/19] cxl/mem: Configure dynamic capacity interrupts Ira Weiny
                   ` (13 subsequent siblings)
  22 siblings, 0 replies; 65+ messages in thread
From: Ira Weiny @ 2025-04-13 22:52 UTC (permalink / raw)
  To: Dave Jiang, Fan Ni, Jonathan Cameron
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	Ira Weiny, linux-cxl, nvdimm, linux-kernel, Li Ming

Dynamic Capacity Devices (DCD) require event interrupts to process
memory addition or removal.  BIOS may have control over non-DCD event
processing.  DCD interrupt configuration needs to be separate from
memory event interrupt configuration.

Factor out event interrupt setting validation.

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
Reviewed-by: Li Ming <ming.li@zohomail.com>
Link: https://lore.kernel.org/all/663922b475e50_d54d72945b@dwillia2-xfh.jf.intel.com.notmuch/ [1]
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 drivers/cxl/pci.c | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 308b05bbb82d..36d031d66dec 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -755,6 +755,21 @@ static bool cxl_event_int_is_fw(u8 setting)
 	return mode == CXL_INT_FW;
 }
 
+static bool cxl_event_validate_mem_policy(struct cxl_memdev_state *mds,
+					  struct cxl_event_interrupt_policy *policy)
+{
+	if (cxl_event_int_is_fw(policy->info_settings) ||
+	    cxl_event_int_is_fw(policy->warn_settings) ||
+	    cxl_event_int_is_fw(policy->failure_settings) ||
+	    cxl_event_int_is_fw(policy->fatal_settings)) {
+		dev_err(mds->cxlds.dev,
+			"FW still in control of Event Logs despite _OSC settings\n");
+		return false;
+	}
+
+	return true;
+}
+
 static int cxl_event_config(struct pci_host_bridge *host_bridge,
 			    struct cxl_memdev_state *mds, bool irq_avail)
 {
@@ -777,14 +792,8 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
 	if (rc)
 		return rc;
 
-	if (cxl_event_int_is_fw(policy.info_settings) ||
-	    cxl_event_int_is_fw(policy.warn_settings) ||
-	    cxl_event_int_is_fw(policy.failure_settings) ||
-	    cxl_event_int_is_fw(policy.fatal_settings)) {
-		dev_err(mds->cxlds.dev,
-			"FW still in control of Event Logs despite _OSC settings\n");
+	if (!cxl_event_validate_mem_policy(mds, &policy))
 		return -EBUSY;
-	}
 
 	rc = cxl_event_config_msgnums(mds, &policy);
 	if (rc)

-- 
2.49.0


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v9 10/19] cxl/mem: Configure dynamic capacity interrupts
  2025-04-13 22:52 [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (8 preceding siblings ...)
  2025-04-13 22:52 ` [PATCH v9 09/19] cxl/pci: Factor out interrupt policy check Ira Weiny
@ 2025-04-13 22:52 ` Ira Weiny
  2025-04-13 22:52 ` [PATCH v9 11/19] cxl/core: Return endpoint decoder information from region search Ira Weiny
                   ` (12 subsequent siblings)
  22 siblings, 0 replies; 65+ messages in thread
From: Ira Weiny @ 2025-04-13 22:52 UTC (permalink / raw)
  To: Dave Jiang, Fan Ni, Jonathan Cameron
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	Ira Weiny, linux-cxl, nvdimm, linux-kernel, Li Ming

Dynamic Capacity Devices (DCD) support extent change notifications
through the event log mechanism.  The interrupt mailbox commands were
extended in CXL 3.1 to support these notifications.  Firmware can't
configure DCD events to be FW controlled but can retain control of
memory events.

Configure DCD event log interrupts on devices supporting dynamic
capacity.  Disable DCD if interrupts are not supported.

Care is taken to preserve the interrupt policy set by the FW if FW first
has been selected by the BIOS.

Based on an original patch by Navneet Singh.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Li Ming <ming.li@zohomail.com>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 drivers/cxl/cxlmem.h |  2 ++
 drivers/cxl/pci.c    | 73 ++++++++++++++++++++++++++++++++++++++++++----------
 2 files changed, 62 insertions(+), 13 deletions(-)

diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index a74ac2d70d8d..34a606c5ead0 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -204,7 +204,9 @@ struct cxl_event_interrupt_policy {
 	u8 warn_settings;
 	u8 failure_settings;
 	u8 fatal_settings;
+	u8 dcd_settings;
 } __packed;
+#define CXL_EVENT_INT_POLICY_BASE_SIZE 4 /* info, warn, failure, fatal */
 
 /**
  * struct cxl_event_state - Event log driver state
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 36d031d66dec..c8a315bbf012 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -685,23 +685,34 @@ static int cxl_event_get_int_policy(struct cxl_memdev_state *mds,
 }
 
 static int cxl_event_config_msgnums(struct cxl_memdev_state *mds,
-				    struct cxl_event_interrupt_policy *policy)
+				    struct cxl_event_interrupt_policy *policy,
+				    bool native_cxl)
 {
 	struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
+	size_t size_in = CXL_EVENT_INT_POLICY_BASE_SIZE;
 	struct cxl_mbox_cmd mbox_cmd;
 	int rc;
 
-	*policy = (struct cxl_event_interrupt_policy) {
-		.info_settings = CXL_INT_MSI_MSIX,
-		.warn_settings = CXL_INT_MSI_MSIX,
-		.failure_settings = CXL_INT_MSI_MSIX,
-		.fatal_settings = CXL_INT_MSI_MSIX,
-	};
+	/* memory event policy is left if FW has control */
+	if (native_cxl) {
+		*policy = (struct cxl_event_interrupt_policy) {
+			.info_settings = CXL_INT_MSI_MSIX,
+			.warn_settings = CXL_INT_MSI_MSIX,
+			.failure_settings = CXL_INT_MSI_MSIX,
+			.fatal_settings = CXL_INT_MSI_MSIX,
+			.dcd_settings = 0,
+		};
+	}
+
+	if (cxl_dcd_supported(mds)) {
+		policy->dcd_settings = CXL_INT_MSI_MSIX;
+		size_in += sizeof(policy->dcd_settings);
+	}
 
 	mbox_cmd = (struct cxl_mbox_cmd) {
 		.opcode = CXL_MBOX_OP_SET_EVT_INT_POLICY,
 		.payload_in = policy,
-		.size_in = sizeof(*policy),
+		.size_in = size_in,
 	};
 
 	rc = cxl_internal_send_cmd(cxl_mbox, &mbox_cmd);
@@ -748,6 +759,30 @@ static int cxl_event_irqsetup(struct cxl_memdev_state *mds,
 	return 0;
 }
 
+static int cxl_irqsetup(struct cxl_memdev_state *mds,
+			struct cxl_event_interrupt_policy *policy,
+			bool native_cxl)
+{
+	struct cxl_dev_state *cxlds = &mds->cxlds;
+	int rc;
+
+	if (native_cxl) {
+		rc = cxl_event_irqsetup(mds, policy);
+		if (rc)
+			return rc;
+	}
+
+	if (cxl_dcd_supported(mds)) {
+		rc = cxl_event_req_irq(cxlds, policy->dcd_settings);
+		if (rc) {
+			dev_err(cxlds->dev, "Failed to get interrupt for DCD event log\n");
+			cxl_disable_dcd(mds);
+		}
+	}
+
+	return 0;
+}
+
 static bool cxl_event_int_is_fw(u8 setting)
 {
 	u8 mode = FIELD_GET(CXLDEV_EVENT_INT_MODE_MASK, setting);
@@ -773,18 +808,26 @@ static bool cxl_event_validate_mem_policy(struct cxl_memdev_state *mds,
 static int cxl_event_config(struct pci_host_bridge *host_bridge,
 			    struct cxl_memdev_state *mds, bool irq_avail)
 {
-	struct cxl_event_interrupt_policy policy;
+	struct cxl_event_interrupt_policy policy = { 0 };
+	bool native_cxl = host_bridge->native_cxl_error;
 	int rc;
 
 	/*
 	 * When BIOS maintains CXL error reporting control, it will process
 	 * event records.  Only one agent can do so.
+	 *
+	 * If BIOS has control of events and DCD is not supported skip event
+	 * configuration.
 	 */
-	if (!host_bridge->native_cxl_error)
+	if (!native_cxl && !cxl_dcd_supported(mds))
 		return 0;
 
 	if (!irq_avail) {
 		dev_info(mds->cxlds.dev, "No interrupt support, disable event processing.\n");
+		if (cxl_dcd_supported(mds)) {
+			dev_info(mds->cxlds.dev, "DCD requires interrupts, disable DCD\n");
+			cxl_disable_dcd(mds);
+		}
 		return 0;
 	}
 
@@ -792,10 +835,10 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
 	if (rc)
 		return rc;
 
-	if (!cxl_event_validate_mem_policy(mds, &policy))
+	if (native_cxl && !cxl_event_validate_mem_policy(mds, &policy))
 		return -EBUSY;
 
-	rc = cxl_event_config_msgnums(mds, &policy);
+	rc = cxl_event_config_msgnums(mds, &policy, native_cxl);
 	if (rc)
 		return rc;
 
@@ -803,12 +846,16 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
 	if (rc)
 		return rc;
 
-	rc = cxl_event_irqsetup(mds, &policy);
+	rc = cxl_irqsetup(mds, &policy, native_cxl);
 	if (rc)
 		return rc;
 
 	cxl_mem_get_event_records(mds, CXLDEV_EVENT_STATUS_ALL);
 
+	dev_dbg(mds->cxlds.dev, "Event config : %s DCD %s\n",
+		native_cxl ? "OS" : "BIOS",
+		cxl_dcd_supported(mds) ? "supported" : "not supported");
+
 	return 0;
 }
 

-- 
2.49.0


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v9 11/19] cxl/core: Return endpoint decoder information from region search
  2025-04-13 22:52 [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (9 preceding siblings ...)
  2025-04-13 22:52 ` [PATCH v9 10/19] cxl/mem: Configure dynamic capacity interrupts Ira Weiny
@ 2025-04-13 22:52 ` Ira Weiny
  2025-04-13 22:52 ` [PATCH v9 12/19] cxl/extent: Process dynamic partition events and realize region extents Ira Weiny
                   ` (11 subsequent siblings)
  22 siblings, 0 replies; 65+ messages in thread
From: Ira Weiny @ 2025-04-13 22:52 UTC (permalink / raw)
  To: Dave Jiang, Fan Ni, Jonathan Cameron
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	Ira Weiny, linux-cxl, nvdimm, linux-kernel, Li Ming

cxl_dpa_to_region() finds the region from a <DPA, device> tuple.
The search involves finding the device endpoint decoder as well.

Dynamic capacity extent processing uses the endpoint decoder HPA
information to calculate the HPA offset.  In addition, well behaved
extents should be contained within an endpoint decoder.

Return the endpoint decoder found to be used in subsequent DCD code.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Li Ming <ming.li@zohomail.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes:
[iweiny: rebase]
---
 drivers/cxl/core/core.h   | 6 ++++--
 drivers/cxl/core/mbox.c   | 2 +-
 drivers/cxl/core/memdev.c | 4 ++--
 drivers/cxl/core/region.c | 8 +++++++-
 4 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 08facbc2d270..76e23ec03fb4 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -40,7 +40,8 @@ void cxl_decoder_kill_region(struct cxl_endpoint_decoder *cxled);
 int cxl_region_init(void);
 void cxl_region_exit(void);
 int cxl_get_poison_by_endpoint(struct cxl_port *port);
-struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa);
+struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa,
+				     struct cxl_endpoint_decoder **cxled);
 u64 cxl_dpa_to_hpa(struct cxl_region *cxlr, const struct cxl_memdev *cxlmd,
 		   u64 dpa);
 
@@ -51,7 +52,8 @@ static inline u64 cxl_dpa_to_hpa(struct cxl_region *cxlr,
 	return ULLONG_MAX;
 }
 static inline
-struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa)
+struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa,
+				     struct cxl_endpoint_decoder **cxled)
 {
 	return NULL;
 }
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index c589d8a330bb..b3dd119d166a 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -957,7 +957,7 @@ void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
 		guard(rwsem_read)(&cxl_dpa_rwsem);
 
 		dpa = le64_to_cpu(evt->media_hdr.phys_addr) & CXL_DPA_MASK;
-		cxlr = cxl_dpa_to_region(cxlmd, dpa);
+		cxlr = cxl_dpa_to_region(cxlmd, dpa, NULL);
 		if (cxlr) {
 			u64 cache_size = cxlr->params.cache_size;
 
diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
index 063a14c1973a..d3555d1f13c6 100644
--- a/drivers/cxl/core/memdev.c
+++ b/drivers/cxl/core/memdev.c
@@ -320,7 +320,7 @@ int cxl_inject_poison(struct cxl_memdev *cxlmd, u64 dpa)
 	if (rc)
 		goto out;
 
-	cxlr = cxl_dpa_to_region(cxlmd, dpa);
+	cxlr = cxl_dpa_to_region(cxlmd, dpa, NULL);
 	if (cxlr)
 		dev_warn_once(cxl_mbox->host,
 			      "poison inject dpa:%#llx region: %s\n", dpa,
@@ -384,7 +384,7 @@ int cxl_clear_poison(struct cxl_memdev *cxlmd, u64 dpa)
 	if (rc)
 		goto out;
 
-	cxlr = cxl_dpa_to_region(cxlmd, dpa);
+	cxlr = cxl_dpa_to_region(cxlmd, dpa, NULL);
 	if (cxlr)
 		dev_warn_once(cxl_mbox->host,
 			      "poison clear dpa:%#llx region: %s\n", dpa,
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 716d33140ee8..9c573e8d6ed7 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -2839,6 +2839,7 @@ int cxl_get_poison_by_endpoint(struct cxl_port *port)
 struct cxl_dpa_to_region_context {
 	struct cxl_region *cxlr;
 	u64 dpa;
+	struct cxl_endpoint_decoder *cxled;
 };
 
 static int __cxl_dpa_to_region(struct device *dev, void *arg)
@@ -2872,11 +2873,13 @@ static int __cxl_dpa_to_region(struct device *dev, void *arg)
 			dev_name(dev));
 
 	ctx->cxlr = cxlr;
+	ctx->cxled = cxled;
 
 	return 1;
 }
 
-struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa)
+struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa,
+				     struct cxl_endpoint_decoder **cxled)
 {
 	struct cxl_dpa_to_region_context ctx;
 	struct cxl_port *port;
@@ -2888,6 +2891,9 @@ struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa)
 	if (port && is_cxl_endpoint(port) && cxl_num_decoders_committed(port))
 		device_for_each_child(&port->dev, &ctx, __cxl_dpa_to_region);
 
+	if (cxled)
+		*cxled = ctx.cxled;
+
 	return ctx.cxlr;
 }
 

-- 
2.49.0


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v9 12/19] cxl/extent: Process dynamic partition events and realize region extents
  2025-04-13 22:52 [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (10 preceding siblings ...)
  2025-04-13 22:52 ` [PATCH v9 11/19] cxl/core: Return endpoint decoder information from region search Ira Weiny
@ 2025-04-13 22:52 ` Ira Weiny
  2025-04-14 16:07   ` Jonathan Cameron
                     ` (4 more replies)
  2025-04-13 22:52 ` [PATCH v9 13/19] cxl/region/extent: Expose region extent information in sysfs Ira Weiny
                   ` (10 subsequent siblings)
  22 siblings, 5 replies; 65+ messages in thread
From: Ira Weiny @ 2025-04-13 22:52 UTC (permalink / raw)
  To: Dave Jiang, Fan Ni, Jonathan Cameron
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	Ira Weiny, linux-cxl, nvdimm, linux-kernel, Li Ming

A dynamic capacity device (DCD) sends events to signal the host for
changes in the availability of Dynamic Capacity (DC) memory.  These
events contain extents describing a DPA range and meta data for memory
to be added or removed.  Events may be sent from the device at any time.

Three types of events can be signaled, Add, Release, and Force Release.

On add, the host may accept or reject the memory being offered.  If no
region exists, or the extent is invalid, the extent should be rejected.
Add extent events may be grouped by a 'more' bit which indicates those
extents should be processed as a group.

On remove, the host can delay the response until the host is safely not
using the memory.  If no region exists the release can be sent
immediately.  The host may also release extents (or partial extents) at
any time.  Thus the 'more' bit grouping of release events is of less
value and can be ignored in favor of sending multiple release capacity
responses for groups of release events.

Force removal is intended as a mechanism between the FM and the device
and intended only when the host is unresponsive, out of sync, or
otherwise broken.  Purposely ignore force removal events.

Regions are made up of one or more devices which may be surfacing memory
to the host.  Once all devices in a region have surfaced an extent the
region can expose a corresponding extent for the user to consume.
Without interleaving a device extent forms a 1:1 relationship with the
region extent.  Immediately surface a region extent upon getting a
device extent.

Per the specification the device is allowed to offer or remove extents
at any time.  However, anticipated use cases can expect extents to be
offered, accepted, and removed in well defined chunks.

Simplify extent tracking with the following restrictions.

	1) Flag for removal any extent which overlaps a requested
	   release range.
	2) Refuse the offer of extents which overlap already accepted
	   memory ranges.
	3) Accept again a range which has already been accepted by the
	   host.  Eating duplicates serves three purposes.
	   3a) This simplifies the code if the device should get out of
	       sync with the host.  And it should be safe to acknowledge
	       the extent again.
	   3b) This simplifies the code to process existing extents if
	       the extent list should change while the extent list is
	       being read.
	   3c) Duplicates for a given partition which are seen during a
	       race between the hardware surfacing an extent and the cxl
	       dax driver scanning for existing extents will be ignored.

	   NOTE: Processing existing extents is done in a later patch.

Management of the region extent devices must be synchronized with
potential uses of the memory within the DAX layer.  Create region extent
devices as children of the cxl_dax_region device such that the DAX
region driver can co-drive them and synchronize with the DAX layer.
Synchronization and management is handled in a subsequent patch.

Tag support within the DAX layer is not yet supported.  To maintain
compatibility with legacy DAX/region processing only tags with a value
of 0 are allowed.  This defines existing DAX devices as having a 0 tag
which makes the most logical sense as a default.

Process DCD events and create region devices.

Based on an original patch by Navneet Singh.

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Li Ming <ming.li@zohomail.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes:
[iweiny: rebase]
[djbw: s/region/partition/]
[iweiny: Adapt to new partition arch]
[iweiny: s/tag/uuid/ throughout the code]
---
 drivers/cxl/core/Makefile |   2 +-
 drivers/cxl/core/core.h   |  13 ++
 drivers/cxl/core/extent.c | 366 ++++++++++++++++++++++++++++++++++++++++++++++
 drivers/cxl/core/mbox.c   | 292 +++++++++++++++++++++++++++++++++++-
 drivers/cxl/core/region.c |   3 +
 drivers/cxl/cxl.h         |  53 ++++++-
 drivers/cxl/cxlmem.h      |  27 ++++
 include/cxl/event.h       |  31 ++++
 tools/testing/cxl/Kbuild  |   3 +-
 9 files changed, 786 insertions(+), 4 deletions(-)

diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
index 086df97a0fcf..792ac799f39d 100644
--- a/drivers/cxl/core/Makefile
+++ b/drivers/cxl/core/Makefile
@@ -17,6 +17,6 @@ cxl_core-y += cdat.o
 cxl_core-y += ras.o
 cxl_core-y += acpi.o
 cxl_core-$(CONFIG_TRACING) += trace.o
-cxl_core-$(CONFIG_CXL_REGION) += region.o
+cxl_core-$(CONFIG_CXL_REGION) += region.o extent.o
 cxl_core-$(CONFIG_CXL_MCE) += mce.o
 cxl_core-$(CONFIG_CXL_FEATURES) += features.o
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 76e23ec03fb4..1272be497926 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -45,12 +45,24 @@ struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa,
 u64 cxl_dpa_to_hpa(struct cxl_region *cxlr, const struct cxl_memdev *cxlmd,
 		   u64 dpa);
 
+int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent);
+int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent);
 #else
 static inline u64 cxl_dpa_to_hpa(struct cxl_region *cxlr,
 				 const struct cxl_memdev *cxlmd, u64 dpa)
 {
 	return ULLONG_MAX;
 }
+static inline int cxl_add_extent(struct cxl_memdev_state *mds,
+				   struct cxl_extent *extent)
+{
+	return 0;
+}
+static inline int cxl_rm_extent(struct cxl_memdev_state *mds,
+				struct cxl_extent *extent)
+{
+	return 0;
+}
 static inline
 struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa,
 				     struct cxl_endpoint_decoder **cxled)
@@ -129,6 +141,7 @@ int cxl_update_hmat_access_coordinates(int nid, struct cxl_region *cxlr,
 bool cxl_need_node_perf_attrs_update(int nid);
 int cxl_port_get_switch_dport_bandwidth(struct cxl_port *port,
 					struct access_coordinate *c);
+void memdev_release_extent(struct cxl_memdev_state *mds, struct range *range);
 
 int cxl_ras_init(void);
 void cxl_ras_exit(void);
diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
new file mode 100644
index 000000000000..6df277caf974
--- /dev/null
+++ b/drivers/cxl/core/extent.c
@@ -0,0 +1,366 @@
+// SPDX-License-Identifier: GPL-2.0
+/*  Copyright(c) 2024 Intel Corporation. All rights reserved. */
+
+#include <linux/device.h>
+#include <cxl.h>
+
+#include "core.h"
+
+static void cxled_release_extent(struct cxl_endpoint_decoder *cxled,
+				 struct cxled_extent *ed_extent)
+{
+	struct cxl_memdev_state *mds = cxled_to_mds(cxled);
+	struct device *dev = &cxled->cxld.dev;
+
+	dev_dbg(dev, "Remove extent %pra (%pU)\n",
+		&ed_extent->dpa_range, &ed_extent->uuid);
+	memdev_release_extent(mds, &ed_extent->dpa_range);
+	kfree(ed_extent);
+}
+
+static void free_region_extent(struct region_extent *region_extent)
+{
+	struct cxled_extent *ed_extent;
+	unsigned long index;
+
+	/*
+	 * Remove from each endpoint decoder the extent which backs this region
+	 * extent
+	 */
+	xa_for_each(&region_extent->decoder_extents, index, ed_extent)
+		cxled_release_extent(ed_extent->cxled, ed_extent);
+	xa_destroy(&region_extent->decoder_extents);
+	ida_free(&region_extent->cxlr_dax->extent_ida, region_extent->dev.id);
+	kfree(region_extent);
+}
+
+static void region_extent_release(struct device *dev)
+{
+	struct region_extent *region_extent = to_region_extent(dev);
+
+	free_region_extent(region_extent);
+}
+
+static const struct device_type region_extent_type = {
+	.name = "extent",
+	.release = region_extent_release,
+};
+
+bool is_region_extent(struct device *dev)
+{
+	return dev->type == &region_extent_type;
+}
+EXPORT_SYMBOL_NS_GPL(is_region_extent, "CXL");
+
+static void region_extent_unregister(void *ext)
+{
+	struct region_extent *region_extent = ext;
+
+	dev_dbg(&region_extent->dev, "DAX region rm extent HPA %pra\n",
+		&region_extent->hpa_range);
+	device_unregister(&region_extent->dev);
+}
+
+static void region_rm_extent(struct region_extent *region_extent)
+{
+	struct device *region_dev = region_extent->dev.parent;
+
+	devm_release_action(region_dev, region_extent_unregister, region_extent);
+}
+
+static struct region_extent *
+alloc_region_extent(struct cxl_dax_region *cxlr_dax, struct range *hpa_range,
+		    uuid_t *uuid)
+{
+	int id;
+
+	struct region_extent *region_extent __free(kfree) =
+				kzalloc(sizeof(*region_extent), GFP_KERNEL);
+	if (!region_extent)
+		return ERR_PTR(-ENOMEM);
+
+	id = ida_alloc(&cxlr_dax->extent_ida, GFP_KERNEL);
+	if (id < 0)
+		return ERR_PTR(-ENOMEM);
+
+	region_extent->hpa_range = *hpa_range;
+	region_extent->cxlr_dax = cxlr_dax;
+	uuid_copy(&region_extent->uuid, uuid);
+	region_extent->dev.id = id;
+	xa_init(&region_extent->decoder_extents);
+	return no_free_ptr(region_extent);
+}
+
+static int online_region_extent(struct region_extent *region_extent)
+{
+	struct cxl_dax_region *cxlr_dax = region_extent->cxlr_dax;
+	struct device *dev = &region_extent->dev;
+	int rc;
+
+	device_initialize(dev);
+	device_set_pm_not_required(dev);
+	dev->parent = &cxlr_dax->dev;
+	dev->type = &region_extent_type;
+	rc = dev_set_name(dev, "extent%d.%d", cxlr_dax->cxlr->id, dev->id);
+	if (rc)
+		goto err;
+
+	rc = device_add(dev);
+	if (rc)
+		goto err;
+
+	dev_dbg(dev, "region extent HPA %pra\n", &region_extent->hpa_range);
+	return devm_add_action_or_reset(&cxlr_dax->dev, region_extent_unregister,
+					region_extent);
+
+err:
+	dev_err(&cxlr_dax->dev, "Failed to initialize region extent HPA %pra\n",
+		&region_extent->hpa_range);
+
+	put_device(dev);
+	return rc;
+}
+
+struct match_data {
+	struct cxl_endpoint_decoder *cxled;
+	struct range *new_range;
+};
+
+static int match_contains(struct device *dev, const void *data)
+{
+	struct region_extent *region_extent = to_region_extent(dev);
+	const struct match_data *md = data;
+	struct cxled_extent *entry;
+	unsigned long index;
+
+	if (!region_extent)
+		return 0;
+
+	xa_for_each(&region_extent->decoder_extents, index, entry) {
+		if (md->cxled == entry->cxled &&
+		    range_contains(&entry->dpa_range, md->new_range))
+			return 1;
+	}
+	return 0;
+}
+
+static bool extents_contain(struct cxl_dax_region *cxlr_dax,
+			    struct cxl_endpoint_decoder *cxled,
+			    struct range *new_range)
+{
+	struct match_data md = {
+		.cxled = cxled,
+		.new_range = new_range,
+	};
+
+	struct device *extent_device __free(put_device)
+			= device_find_child(&cxlr_dax->dev, &md, match_contains);
+	if (!extent_device)
+		return false;
+
+	return true;
+}
+
+static int match_overlaps(struct device *dev, const void *data)
+{
+	struct region_extent *region_extent = to_region_extent(dev);
+	const struct match_data *md = data;
+	struct cxled_extent *entry;
+	unsigned long index;
+
+	if (!region_extent)
+		return 0;
+
+	xa_for_each(&region_extent->decoder_extents, index, entry) {
+		if (md->cxled == entry->cxled &&
+		    range_overlaps(&entry->dpa_range, md->new_range))
+			return 1;
+	}
+
+	return 0;
+}
+
+static bool extents_overlap(struct cxl_dax_region *cxlr_dax,
+			    struct cxl_endpoint_decoder *cxled,
+			    struct range *new_range)
+{
+	struct match_data md = {
+		.cxled = cxled,
+		.new_range = new_range,
+	};
+
+	struct device *extent_device __free(put_device)
+			= device_find_child(&cxlr_dax->dev, &md, match_overlaps);
+	if (!extent_device)
+		return false;
+
+	return true;
+}
+
+static void calc_hpa_range(struct cxl_endpoint_decoder *cxled,
+			   struct cxl_dax_region *cxlr_dax,
+			   struct range *dpa_range,
+			   struct range *hpa_range)
+{
+	resource_size_t dpa_offset, hpa;
+
+	dpa_offset = dpa_range->start - cxled->dpa_res->start;
+	hpa = cxled->cxld.hpa_range.start + dpa_offset;
+
+	hpa_range->start = hpa - cxlr_dax->hpa_range.start;
+	hpa_range->end = hpa_range->start + range_len(dpa_range) - 1;
+}
+
+static int cxlr_rm_extent(struct device *dev, void *data)
+{
+	struct region_extent *region_extent = to_region_extent(dev);
+	struct range *region_hpa_range = data;
+
+	if (!region_extent)
+		return 0;
+
+	/*
+	 * Any extent which 'touches' the released range is removed.
+	 */
+	if (range_overlaps(region_hpa_range, &region_extent->hpa_range)) {
+		dev_dbg(dev, "Remove region extent HPA %pra\n",
+			&region_extent->hpa_range);
+		region_rm_extent(region_extent);
+	}
+	return 0;
+}
+
+int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
+{
+	u64 start_dpa = le64_to_cpu(extent->start_dpa);
+	struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
+	struct cxl_endpoint_decoder *cxled;
+	struct range hpa_range, dpa_range;
+	struct cxl_region *cxlr;
+
+	dpa_range = (struct range) {
+		.start = start_dpa,
+		.end = start_dpa + le64_to_cpu(extent->length) - 1,
+	};
+
+	guard(rwsem_read)(&cxl_region_rwsem);
+	cxlr = cxl_dpa_to_region(cxlmd, start_dpa, &cxled);
+	if (!cxlr) {
+		/*
+		 * No region can happen here for a few reasons:
+		 *
+		 * 1) Extents were accepted and the host crashed/rebooted
+		 *    leaving them in an accepted state.  On reboot the host
+		 *    has not yet created a region to own them.
+		 *
+		 * 2) Region destruction won the race with the device releasing
+		 *    all the extents.  Here the release will be a duplicate of
+		 *    the one sent via region destruction.
+		 *
+		 * 3) The device is confused and releasing extents for which no
+		 *    region ever existed.
+		 *
+		 * In all these cases make sure the device knows we are not
+		 * using this extent.
+		 */
+		memdev_release_extent(mds, &dpa_range);
+		return -ENXIO;
+	}
+
+	calc_hpa_range(cxled, cxlr->cxlr_dax, &dpa_range, &hpa_range);
+
+	/* Remove region extents which overlap */
+	return device_for_each_child(&cxlr->cxlr_dax->dev, &hpa_range,
+				     cxlr_rm_extent);
+}
+
+static int cxlr_add_extent(struct cxl_dax_region *cxlr_dax,
+			   struct cxl_endpoint_decoder *cxled,
+			   struct cxled_extent *ed_extent)
+{
+	struct region_extent *region_extent;
+	struct range hpa_range;
+	int rc;
+
+	calc_hpa_range(cxled, cxlr_dax, &ed_extent->dpa_range, &hpa_range);
+
+	region_extent = alloc_region_extent(cxlr_dax, &hpa_range, &ed_extent->uuid);
+	if (IS_ERR(region_extent))
+		return PTR_ERR(region_extent);
+
+	rc = xa_insert(&region_extent->decoder_extents, (unsigned long)ed_extent,
+		       ed_extent, GFP_KERNEL);
+	if (rc) {
+		free_region_extent(region_extent);
+		return rc;
+	}
+
+	/* device model handles freeing region_extent */
+	return online_region_extent(region_extent);
+}
+
+/* Callers are expected to ensure cxled has been attached to a region */
+int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
+{
+	u64 start_dpa = le64_to_cpu(extent->start_dpa);
+	struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
+	struct cxl_endpoint_decoder *cxled;
+	struct range ed_range, ext_range;
+	struct cxl_dax_region *cxlr_dax;
+	struct cxled_extent *ed_extent;
+	struct cxl_region *cxlr;
+	struct device *dev;
+
+	ext_range = (struct range) {
+		.start = start_dpa,
+		.end = start_dpa + le64_to_cpu(extent->length) - 1,
+	};
+
+	guard(rwsem_read)(&cxl_region_rwsem);
+	cxlr = cxl_dpa_to_region(cxlmd, start_dpa, &cxled);
+	if (!cxlr)
+		return -ENXIO;
+
+	cxlr_dax = cxled->cxld.region->cxlr_dax;
+	dev = &cxled->cxld.dev;
+	ed_range = (struct range) {
+		.start = cxled->dpa_res->start,
+		.end = cxled->dpa_res->end,
+	};
+
+	dev_dbg(&cxled->cxld.dev, "Checking ED (%pr) for extent %pra\n",
+		cxled->dpa_res, &ext_range);
+
+	if (!range_contains(&ed_range, &ext_range)) {
+		dev_err_ratelimited(dev,
+				    "DC extent DPA %pra (%pU) is not fully in ED %pra\n",
+				    &ext_range, extent->uuid, &ed_range);
+		return -ENXIO;
+	}
+
+	/*
+	 * Allowing duplicates or extents which are already in an accepted
+	 * range simplifies extent processing, especially when dealing with the
+	 * cxl dax driver scanning for existing extents.
+	 */
+	if (extents_contain(cxlr_dax, cxled, &ext_range)) {
+		dev_warn_ratelimited(dev, "Extent %pra exists; accept again\n",
+				     &ext_range);
+		return 0;
+	}
+
+	if (extents_overlap(cxlr_dax, cxled, &ext_range))
+		return -ENXIO;
+
+	ed_extent = kzalloc(sizeof(*ed_extent), GFP_KERNEL);
+	if (!ed_extent)
+		return -ENOMEM;
+
+	ed_extent->cxled = cxled;
+	ed_extent->dpa_range = ext_range;
+	import_uuid(&ed_extent->uuid, extent->uuid);
+
+	dev_dbg(dev, "Add extent %pra (%pU)\n", &ed_extent->dpa_range, &ed_extent->uuid);
+
+	return cxlr_add_extent(cxlr_dax, cxled, ed_extent);
+}
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index b3dd119d166a..de01c6684530 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -930,6 +930,60 @@ int cxl_enumerate_cmds(struct cxl_memdev_state *mds)
 }
 EXPORT_SYMBOL_NS_GPL(cxl_enumerate_cmds, "CXL");
 
+static int cxl_validate_extent(struct cxl_memdev_state *mds,
+			       struct cxl_extent *extent)
+{
+	struct cxl_dev_state *cxlds = &mds->cxlds;
+	struct device *dev = mds->cxlds.dev;
+	u64 start, length;
+
+	start = le64_to_cpu(extent->start_dpa);
+	length = le64_to_cpu(extent->length);
+
+	struct range ext_range = (struct range){
+		.start = start,
+		.end = start + length - 1,
+	};
+
+	if (le16_to_cpu(extent->shared_extn_seq) != 0) {
+		dev_err_ratelimited(dev,
+				    "DC extent DPA %pra (%pU) can not be shared\n",
+				    &ext_range, extent->uuid);
+		return -ENXIO;
+	}
+
+	if (!uuid_is_null((const uuid_t *)extent->uuid)) {
+		dev_err_ratelimited(dev,
+				    "DC extent DPA %pra (%pU); tags not supported\n",
+				    &ext_range, extent->uuid);
+		return -ENXIO;
+	}
+
+	/* Extents must be within the DC partition boundary */
+	for (int i = 0; i < cxlds->nr_partitions; i++) {
+		struct cxl_dpa_partition *part = &cxlds->part[i];
+
+		if (part->mode != CXL_PARTMODE_DYNAMIC_RAM_A)
+			continue;
+
+		struct range partition_range = (struct range) {
+			.start = part->res.start,
+			.end = part->res.end,
+		};
+
+		if (range_contains(&partition_range, &ext_range)) {
+			dev_dbg(dev, "DC extent DPA %pra (DCR:%pra)(%pU)\n",
+				&ext_range, &partition_range, extent->uuid);
+			return 0;
+		}
+	}
+
+	dev_err_ratelimited(dev,
+			    "DC extent DPA %pra (%pU) is not in a valid DC partition\n",
+			    &ext_range, extent->uuid);
+	return -ENXIO;
+}
+
 void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
 			    enum cxl_event_log_type type,
 			    enum cxl_event_type event_type,
@@ -1064,6 +1118,221 @@ static int cxl_clear_event_record(struct cxl_memdev_state *mds,
 	return rc;
 }
 
+static int send_one_response(struct cxl_mailbox *cxl_mbox,
+			     struct cxl_mbox_dc_response *response,
+			     int opcode, u32 extent_list_size, u8 flags)
+{
+	struct cxl_mbox_cmd mbox_cmd = (struct cxl_mbox_cmd) {
+		.opcode = opcode,
+		.size_in = struct_size(response, extent_list, extent_list_size),
+		.payload_in = response,
+	};
+
+	response->extent_list_size = cpu_to_le32(extent_list_size);
+	response->flags = flags;
+	return cxl_internal_send_cmd(cxl_mbox, &mbox_cmd);
+}
+
+static int cxl_send_dc_response(struct cxl_memdev_state *mds, int opcode,
+				struct xarray *extent_array, int cnt)
+{
+	struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
+	struct cxl_mbox_dc_response *p;
+	struct cxl_extent *extent;
+	unsigned long index;
+	u32 pl_index;
+
+	size_t pl_size = struct_size(p, extent_list, cnt);
+	u32 max_extents = cnt;
+
+	/* May have to use more bit on response. */
+	if (pl_size > cxl_mbox->payload_size) {
+		max_extents = (cxl_mbox->payload_size - sizeof(*p)) /
+			      sizeof(struct updated_extent_list);
+		pl_size = struct_size(p, extent_list, max_extents);
+	}
+
+	struct cxl_mbox_dc_response *response __free(kfree) =
+						kzalloc(pl_size, GFP_KERNEL);
+	if (!response)
+		return -ENOMEM;
+
+	if (cnt == 0)
+		return send_one_response(cxl_mbox, response, opcode, 0, 0);
+
+	pl_index = 0;
+	xa_for_each(extent_array, index, extent) {
+		response->extent_list[pl_index].dpa_start = extent->start_dpa;
+		response->extent_list[pl_index].length = extent->length;
+		pl_index++;
+
+		if (pl_index == max_extents) {
+			u8 flags = 0;
+			int rc;
+
+			if (pl_index < cnt)
+				flags |= CXL_DCD_EVENT_MORE;
+			rc = send_one_response(cxl_mbox, response, opcode,
+					       pl_index, flags);
+			if (rc)
+				return rc;
+			cnt -= pl_index;
+			pl_index = 0;
+		}
+	}
+
+	if (!pl_index) /* nothing more to do */
+		return 0;
+	return send_one_response(cxl_mbox, response, opcode, pl_index, 0);
+}
+
+void memdev_release_extent(struct cxl_memdev_state *mds, struct range *range)
+{
+	struct device *dev = mds->cxlds.dev;
+	struct xarray extent_list;
+
+	struct cxl_extent extent = {
+		.start_dpa = cpu_to_le64(range->start),
+		.length = cpu_to_le64(range_len(range)),
+	};
+
+	dev_dbg(dev, "Release response dpa %pra\n", &range);
+
+	xa_init(&extent_list);
+	if (xa_insert(&extent_list, 0, &extent, GFP_KERNEL)) {
+		dev_dbg(dev, "Failed to release %pra\n", &range);
+		goto destroy;
+	}
+
+	if (cxl_send_dc_response(mds, CXL_MBOX_OP_RELEASE_DC, &extent_list, 1))
+		dev_dbg(dev, "Failed to release %pra\n", &range);
+
+destroy:
+	xa_destroy(&extent_list);
+}
+
+static int validate_add_extent(struct cxl_memdev_state *mds,
+			       struct cxl_extent *extent)
+{
+	int rc;
+
+	rc = cxl_validate_extent(mds, extent);
+	if (rc)
+		return rc;
+
+	return cxl_add_extent(mds, extent);
+}
+
+static int cxl_add_pending(struct cxl_memdev_state *mds)
+{
+	struct device *dev = mds->cxlds.dev;
+	struct cxl_extent *extent;
+	unsigned long cnt = 0;
+	unsigned long index;
+	int rc;
+
+	xa_for_each(&mds->pending_extents, index, extent) {
+		if (validate_add_extent(mds, extent)) {
+			/*
+			 * Any extents which are to be rejected are omitted from
+			 * the response.  An empty response means all are
+			 * rejected.
+			 */
+			dev_dbg(dev, "unconsumed DC extent DPA:%#llx LEN:%#llx\n",
+				le64_to_cpu(extent->start_dpa),
+				le64_to_cpu(extent->length));
+			xa_erase(&mds->pending_extents, index);
+			kfree(extent);
+			continue;
+		}
+		cnt++;
+	}
+	rc = cxl_send_dc_response(mds, CXL_MBOX_OP_ADD_DC_RESPONSE,
+				  &mds->pending_extents, cnt);
+	xa_for_each(&mds->pending_extents, index, extent) {
+		xa_erase(&mds->pending_extents, index);
+		kfree(extent);
+	}
+	return rc;
+}
+
+static int handle_add_event(struct cxl_memdev_state *mds,
+			    struct cxl_event_dcd *event)
+{
+	struct device *dev = mds->cxlds.dev;
+	struct cxl_extent *extent;
+
+	extent = kmemdup(&event->extent, sizeof(*extent), GFP_KERNEL);
+	if (!extent)
+		return -ENOMEM;
+
+	if (xa_insert(&mds->pending_extents, (unsigned long)extent, extent,
+		      GFP_KERNEL)) {
+		kfree(extent);
+		return -ENOMEM;
+	}
+
+	if (event->flags & CXL_DCD_EVENT_MORE) {
+		dev_dbg(dev, "more bit set; delay the surfacing of extent\n");
+		return 0;
+	}
+
+	/* extents are removed and free'ed in cxl_add_pending() */
+	return cxl_add_pending(mds);
+}
+
+static char *cxl_dcd_evt_type_str(u8 type)
+{
+	switch (type) {
+	case DCD_ADD_CAPACITY:
+		return "add";
+	case DCD_RELEASE_CAPACITY:
+		return "release";
+	case DCD_FORCED_CAPACITY_RELEASE:
+		return "force release";
+	default:
+		break;
+	}
+
+	return "<unknown>";
+}
+
+static void cxl_handle_dcd_event_records(struct cxl_memdev_state *mds,
+					struct cxl_event_record_raw *raw_rec)
+{
+	struct cxl_event_dcd *event = &raw_rec->event.dcd;
+	struct cxl_extent *extent = &event->extent;
+	struct device *dev = mds->cxlds.dev;
+	uuid_t *id = &raw_rec->id;
+	int rc;
+
+	if (!uuid_equal(id, &CXL_EVENT_DC_EVENT_UUID))
+		return;
+
+	dev_dbg(dev, "DCD event %s : DPA:%#llx LEN:%#llx\n",
+		cxl_dcd_evt_type_str(event->event_type),
+		le64_to_cpu(extent->start_dpa), le64_to_cpu(extent->length));
+
+	switch (event->event_type) {
+	case DCD_ADD_CAPACITY:
+		rc = handle_add_event(mds, event);
+		break;
+	case DCD_RELEASE_CAPACITY:
+		rc = cxl_rm_extent(mds, &event->extent);
+		break;
+	case DCD_FORCED_CAPACITY_RELEASE:
+		dev_err_ratelimited(dev, "Forced release event ignored.\n");
+		rc = 0;
+		break;
+	default:
+		rc = -EINVAL;
+		break;
+	}
+
+	if (rc)
+		dev_err_ratelimited(dev, "dcd event failed: %d\n", rc);
+}
+
 static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
 				    enum cxl_event_log_type type)
 {
@@ -1100,9 +1369,13 @@ static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
 		if (!nr_rec)
 			break;
 
-		for (i = 0; i < nr_rec; i++)
+		for (i = 0; i < nr_rec; i++) {
 			__cxl_event_trace_record(cxlmd, type,
 						 &payload->records[i]);
+			if (type == CXL_EVENT_TYPE_DCD)
+				cxl_handle_dcd_event_records(mds,
+							&payload->records[i]);
+		}
 
 		if (payload->flags & CXL_GET_EVENT_FLAG_OVERFLOW)
 			trace_cxl_overflow(cxlmd, type, payload);
@@ -1134,6 +1407,8 @@ void cxl_mem_get_event_records(struct cxl_memdev_state *mds, u32 status)
 {
 	dev_dbg(mds->cxlds.dev, "Reading event logs: %x\n", status);
 
+	if (cxl_dcd_supported(mds) && (status & CXLDEV_EVENT_STATUS_DCD))
+		cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_DCD);
 	if (status & CXLDEV_EVENT_STATUS_FATAL)
 		cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_FATAL);
 	if (status & CXLDEV_EVENT_STATUS_FAIL)
@@ -1709,6 +1984,17 @@ int cxl_mailbox_init(struct cxl_mailbox *cxl_mbox, struct device *host)
 }
 EXPORT_SYMBOL_NS_GPL(cxl_mailbox_init, "CXL");
 
+static void clear_pending_extents(void *_mds)
+{
+	struct cxl_memdev_state *mds = _mds;
+	struct cxl_extent *extent;
+	unsigned long index;
+
+	xa_for_each(&mds->pending_extents, index, extent)
+		kfree(extent);
+	xa_destroy(&mds->pending_extents);
+}
+
 struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev)
 {
 	struct cxl_memdev_state *mds;
@@ -1726,6 +2012,10 @@ struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev)
 	mds->cxlds.cxl_mbox.host = dev;
 	mds->cxlds.reg_map.resource = CXL_RESOURCE_NONE;
 	mds->cxlds.type = CXL_DEVTYPE_CLASSMEM;
+	xa_init(&mds->pending_extents);
+	rc = devm_add_action_or_reset(dev, clear_pending_extents, mds);
+	if (rc)
+		return ERR_PTR(rc);
 
 	rc = devm_cxl_register_mce_notifier(dev, &mds->mce_notifier);
 	if (rc == -EOPNOTSUPP)
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 9c573e8d6ed7..3106df6f3636 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -3048,6 +3048,7 @@ static void cxl_dax_region_release(struct device *dev)
 {
 	struct cxl_dax_region *cxlr_dax = to_cxl_dax_region(dev);
 
+	ida_destroy(&cxlr_dax->extent_ida);
 	kfree(cxlr_dax);
 }
 
@@ -3097,6 +3098,8 @@ static struct cxl_dax_region *cxl_dax_region_alloc(struct cxl_region *cxlr)
 
 	dev = &cxlr_dax->dev;
 	cxlr_dax->cxlr = cxlr;
+	cxlr->cxlr_dax = cxlr_dax;
+	ida_init(&cxlr_dax->extent_ida);
 	device_initialize(dev);
 	lockdep_set_class(&dev->mutex, &cxl_dax_region_key);
 	device_set_pm_not_required(dev);
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 4bb0ff4d8f5f..d027432b1572 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -11,6 +11,8 @@
 #include <linux/log2.h>
 #include <linux/node.h>
 #include <linux/io.h>
+#include <linux/xarray.h>
+#include <cxl/event.h>
 
 extern const struct nvdimm_security_ops *cxl_security_ops;
 
@@ -169,11 +171,13 @@ static inline int ways_to_eiw(unsigned int ways, u8 *eiw)
 #define CXLDEV_EVENT_STATUS_WARN		BIT(1)
 #define CXLDEV_EVENT_STATUS_FAIL		BIT(2)
 #define CXLDEV_EVENT_STATUS_FATAL		BIT(3)
+#define CXLDEV_EVENT_STATUS_DCD			BIT(4)
 
 #define CXLDEV_EVENT_STATUS_ALL (CXLDEV_EVENT_STATUS_INFO |	\
 				 CXLDEV_EVENT_STATUS_WARN |	\
 				 CXLDEV_EVENT_STATUS_FAIL |	\
-				 CXLDEV_EVENT_STATUS_FATAL)
+				 CXLDEV_EVENT_STATUS_FATAL |	\
+				 CXLDEV_EVENT_STATUS_DCD)
 
 /* CXL rev 3.0 section 8.2.9.2.4; Table 8-52 */
 #define CXLDEV_EVENT_INT_MODE_MASK	GENMASK(1, 0)
@@ -381,6 +385,18 @@ enum cxl_decoder_state {
 	CXL_DECODER_STATE_AUTO,
 };
 
+/**
+ * struct cxled_extent - Extent within an endpoint decoder
+ * @cxled: Reference to the endpoint decoder
+ * @dpa_range: DPA range this extent covers within the decoder
+ * @uuid: uuid from device for this extent
+ */
+struct cxled_extent {
+	struct cxl_endpoint_decoder *cxled;
+	struct range dpa_range;
+	uuid_t uuid;
+};
+
 /**
  * struct cxl_endpoint_decoder - Endpoint  / SPA to DPA decoder
  * @cxld: base cxl_decoder_object
@@ -512,6 +528,7 @@ enum cxl_partition_mode {
  * @type: Endpoint decoder target type
  * @cxl_nvb: nvdimm bridge for coordinating @cxlr_pmem setup / shutdown
  * @cxlr_pmem: (for pmem regions) cached copy of the nvdimm bridge
+ * @cxlr_dax: (for DC regions) cached copy of CXL DAX bridge
  * @flags: Region state flags
  * @params: active + config params for the region
  * @coord: QoS access coordinates for the region
@@ -525,6 +542,7 @@ struct cxl_region {
 	enum cxl_decoder_type type;
 	struct cxl_nvdimm_bridge *cxl_nvb;
 	struct cxl_pmem_region *cxlr_pmem;
+	struct cxl_dax_region *cxlr_dax;
 	unsigned long flags;
 	struct cxl_region_params params;
 	struct access_coordinate coord[ACCESS_COORDINATE_MAX];
@@ -566,12 +584,45 @@ struct cxl_pmem_region {
 	struct cxl_pmem_region_mapping mapping[];
 };
 
+/* See CXL 3.1 8.2.9.2.1.6 */
+enum dc_event {
+	DCD_ADD_CAPACITY,
+	DCD_RELEASE_CAPACITY,
+	DCD_FORCED_CAPACITY_RELEASE,
+	DCD_REGION_CONFIGURATION_UPDATED,
+};
+
 struct cxl_dax_region {
 	struct device dev;
 	struct cxl_region *cxlr;
 	struct range hpa_range;
+	struct ida extent_ida;
 };
 
+/**
+ * struct region_extent - CXL DAX region extent
+ * @dev: device representing this extent
+ * @cxlr_dax: back reference to parent region device
+ * @hpa_range: HPA range of this extent
+ * @uuid: uuid of the extent
+ * @decoder_extents: Endpoint decoder extents which make up this region extent
+ */
+struct region_extent {
+	struct device dev;
+	struct cxl_dax_region *cxlr_dax;
+	struct range hpa_range;
+	uuid_t uuid;
+	struct xarray decoder_extents;
+};
+
+bool is_region_extent(struct device *dev);
+static inline struct region_extent *to_region_extent(struct device *dev)
+{
+	if (!is_region_extent(dev))
+		return NULL;
+	return container_of(dev, struct region_extent, dev);
+}
+
 /**
  * struct cxl_port - logical collection of upstream port devices and
  *		     downstream port devices to construct a CXL memory
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 34a606c5ead0..63a38e449454 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -7,6 +7,7 @@
 #include <linux/cdev.h>
 #include <linux/uuid.h>
 #include <linux/node.h>
+#include <linux/xarray.h>
 #include <cxl/event.h>
 #include <cxl/mailbox.h>
 #include "cxl.h"
@@ -487,6 +488,7 @@ static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
  * @active_volatile_bytes: sum of hard + soft volatile
  * @active_persistent_bytes: sum of hard + soft persistent
  * @dcd_supported: all DCD commands are supported
+ * @pending_extents: array of extents pending during more bit processing
  * @event: event log driver state
  * @poison: poison driver state info
  * @security: security driver state info
@@ -507,6 +509,7 @@ struct cxl_memdev_state {
 	u64 active_volatile_bytes;
 	u64 active_persistent_bytes;
 	bool dcd_supported;
+	struct xarray pending_extents;
 
 	struct cxl_event_state event;
 	struct cxl_poison_state poison;
@@ -582,6 +585,21 @@ enum cxl_opcode {
 	UUID_INIT(0x5e1819d9, 0x11a9, 0x400c, 0x81, 0x1f, 0xd6, 0x07, 0x19,     \
 		  0x40, 0x3d, 0x86)
 
+/*
+ * Add Dynamic Capacity Response
+ * CXL rev 3.1 section 8.2.9.9.9.3; Table 8-168 & Table 8-169
+ */
+struct cxl_mbox_dc_response {
+	__le32 extent_list_size;
+	u8 flags;
+	u8 reserved[3];
+	struct updated_extent_list {
+		__le64 dpa_start;
+		__le64 length;
+		u8 reserved[8];
+	} __packed extent_list[];
+} __packed;
+
 struct cxl_mbox_get_supported_logs {
 	__le16 entries;
 	u8 rsvd[6];
@@ -644,6 +662,14 @@ struct cxl_mbox_identify {
 	UUID_INIT(0xfe927475, 0xdd59, 0x4339, 0xa5, 0x86, 0x79, 0xba, 0xb1, \
 		  0x13, 0xb7, 0x74)
 
+/*
+ * Dynamic Capacity Event Record
+ * CXL rev 3.1 section 8.2.9.2.1; Table 8-43
+ */
+#define CXL_EVENT_DC_EVENT_UUID                                             \
+	UUID_INIT(0xca95afa7, 0xf183, 0x4018, 0x8c, 0x2f, 0x95, 0x26, 0x8e, \
+		  0x10, 0x1a, 0x2a)
+
 /*
  * Get Event Records output payload
  * CXL rev 3.0 section 8.2.9.2.2; Table 8-50
@@ -669,6 +695,7 @@ enum cxl_event_log_type {
 	CXL_EVENT_TYPE_WARN,
 	CXL_EVENT_TYPE_FAIL,
 	CXL_EVENT_TYPE_FATAL,
+	CXL_EVENT_TYPE_DCD,
 	CXL_EVENT_TYPE_MAX
 };
 
diff --git a/include/cxl/event.h b/include/cxl/event.h
index f9ae1796da85..0c159eac4337 100644
--- a/include/cxl/event.h
+++ b/include/cxl/event.h
@@ -108,11 +108,42 @@ struct cxl_event_mem_module {
 	u8 reserved[0x2a];
 } __packed;
 
+/*
+ * CXL rev 3.1 section 8.2.9.2.1.6; Table 8-51
+ */
+struct cxl_extent {
+	__le64 start_dpa;
+	__le64 length;
+	u8 uuid[UUID_SIZE];
+	__le16 shared_extn_seq;
+	u8 reserved[0x6];
+} __packed;
+
+/*
+ * Dynamic Capacity Event Record
+ * CXL rev 3.1 section 8.2.9.2.1.6; Table 8-50
+ */
+#define CXL_DCD_EVENT_MORE			BIT(0)
+struct cxl_event_dcd {
+	struct cxl_event_record_hdr hdr;
+	u8 event_type;
+	u8 validity_flags;
+	__le16 host_id;
+	u8 partition_index;
+	u8 flags;
+	u8 reserved1[0x2];
+	struct cxl_extent extent;
+	u8 reserved2[0x18];
+	__le32 num_avail_extents;
+	__le32 num_avail_tags;
+} __packed;
+
 union cxl_event {
 	struct cxl_event_generic generic;
 	struct cxl_event_gen_media gen_media;
 	struct cxl_event_dram dram;
 	struct cxl_event_mem_module mem_module;
+	struct cxl_event_dcd dcd;
 	/* dram & gen_media event header */
 	struct cxl_event_media_hdr media_hdr;
 } __packed;
diff --git a/tools/testing/cxl/Kbuild b/tools/testing/cxl/Kbuild
index 387f3df8b988..916f2b30e2f3 100644
--- a/tools/testing/cxl/Kbuild
+++ b/tools/testing/cxl/Kbuild
@@ -64,7 +64,8 @@ cxl_core-y += $(CXL_CORE_SRC)/cdat.o
 cxl_core-y += $(CXL_CORE_SRC)/ras.o
 cxl_core-y += $(CXL_CORE_SRC)/acpi.o
 cxl_core-$(CONFIG_TRACING) += $(CXL_CORE_SRC)/trace.o
-cxl_core-$(CONFIG_CXL_REGION) += $(CXL_CORE_SRC)/region.o
+cxl_core-$(CONFIG_CXL_REGION) += $(CXL_CORE_SRC)/region.o \
+				 $(CXL_CORE_SRC)/extent.o
 cxl_core-$(CONFIG_CXL_MCE) += $(CXL_CORE_SRC)/mce.o
 cxl_core-$(CONFIG_CXL_FEATURES) += $(CXL_CORE_SRC)/features.o
 cxl_core-y += config_check.o

-- 
2.49.0


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH v9 12/19] cxl/extent: Process dynamic partition events and realize region extents
  2025-04-13 22:52 ` [PATCH v9 12/19] cxl/extent: Process dynamic partition events and realize region extents Ira Weiny
@ 2025-04-14 16:07   ` Jonathan Cameron
  2025-04-14 22:10   ` Alison Schofield
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 65+ messages in thread
From: Jonathan Cameron @ 2025-04-14 16:07 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Jiang, Fan Ni, Dan Williams, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-cxl, nvdimm, linux-kernel,
	Li Ming

On Sun, 13 Apr 2025 17:52:20 -0500
Ira Weiny <ira.weiny@intel.com> wrote:

> A dynamic capacity device (DCD) sends events to signal the host for
> changes in the availability of Dynamic Capacity (DC) memory.  These
> events contain extents describing a DPA range and meta data for memory
> to be added or removed.  Events may be sent from the device at any time.
> 
> Three types of events can be signaled, Add, Release, and Force Release.
> 
> On add, the host may accept or reject the memory being offered.  If no
> region exists, or the extent is invalid, the extent should be rejected.
> Add extent events may be grouped by a 'more' bit which indicates those
> extents should be processed as a group.
> 
> On remove, the host can delay the response until the host is safely not
> using the memory.  If no region exists the release can be sent
> immediately.  The host may also release extents (or partial extents) at
> any time.  Thus the 'more' bit grouping of release events is of less
> value and can be ignored in favor of sending multiple release capacity
> responses for groups of release events.
> 
> Force removal is intended as a mechanism between the FM and the device
> and intended only when the host is unresponsive, out of sync, or
> otherwise broken.  Purposely ignore force removal events.
> 
> Regions are made up of one or more devices which may be surfacing memory
> to the host.  Once all devices in a region have surfaced an extent the
> region can expose a corresponding extent for the user to consume.
> Without interleaving a device extent forms a 1:1 relationship with the
> region extent.  Immediately surface a region extent upon getting a
> device extent.
> 
> Per the specification the device is allowed to offer or remove extents
> at any time.  However, anticipated use cases can expect extents to be
> offered, accepted, and removed in well defined chunks.
> 
> Simplify extent tracking with the following restrictions.
> 
> 	1) Flag for removal any extent which overlaps a requested
> 	   release range.
> 	2) Refuse the offer of extents which overlap already accepted
> 	   memory ranges.
> 	3) Accept again a range which has already been accepted by the
> 	   host.  Eating duplicates serves three purposes.
> 	   3a) This simplifies the code if the device should get out of
> 	       sync with the host.  And it should be safe to acknowledge
> 	       the extent again.
> 	   3b) This simplifies the code to process existing extents if
> 	       the extent list should change while the extent list is
> 	       being read.
> 	   3c) Duplicates for a given partition which are seen during a
> 	       race between the hardware surfacing an extent and the cxl
> 	       dax driver scanning for existing extents will be ignored.
> 
> 	   NOTE: Processing existing extents is done in a later patch.
> 
> Management of the region extent devices must be synchronized with
> potential uses of the memory within the DAX layer.  Create region extent
> devices as children of the cxl_dax_region device such that the DAX
> region driver can co-drive them and synchronize with the DAX layer.
> Synchronization and management is handled in a subsequent patch.
> 
> Tag support within the DAX layer is not yet supported.  To maintain
> compatibility with legacy DAX/region processing only tags with a value
> of 0 are allowed.  This defines existing DAX devices as having a 0 tag
> which makes the most logical sense as a default.
> 
> Process DCD events and create region devices.
> 
> Based on an original patch by Navneet Singh.
> 
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Li Ming <ming.li@zohomail.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
I've forgotten what our policy on spec references in new
code. Maybe update them to 3.2?

A few tiny little things inline from a fresh look.

Thanks,

Jonathan

> diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
> new file mode 100644
> index 000000000000..6df277caf974
> --- /dev/null
> +++ b/drivers/cxl/core/extent.c

> +static int cxlr_rm_extent(struct device *dev, void *data)
> +{
> +	struct region_extent *region_extent = to_region_extent(dev);
> +	struct range *region_hpa_range = data;
> +
> +	if (!region_extent)
> +		return 0;
> +
> +	/*
> +	 * Any extent which 'touches' the released range is removed.
> +	 */

Single line comment syntax.

> +	if (range_overlaps(region_hpa_range, &region_extent->hpa_range)) {
> +		dev_dbg(dev, "Remove region extent HPA %pra\n",
> +			&region_extent->hpa_range);
> +		region_rm_extent(region_extent);
> +	}
> +	return 0;
> +}


> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index b3dd119d166a..de01c6684530 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -930,6 +930,60 @@ int cxl_enumerate_cmds(struct cxl_memdev_state *mds)
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_enumerate_cmds, "CXL");
>  
> +static int cxl_validate_extent(struct cxl_memdev_state *mds,
> +			       struct cxl_extent *extent)
> +{
> +	struct cxl_dev_state *cxlds = &mds->cxlds;
> +	struct device *dev = mds->cxlds.dev;
> +	u64 start, length;
> +
> +	start = le64_to_cpu(extent->start_dpa);
> +	length = le64_to_cpu(extent->length);
Set these at declaration..

> +
> +	struct range ext_range = (struct range){
> +		.start = start,
> +		.end = start + length - 1,
> +	};

With the above set at declaration this is then not declaration
mid code which are still generally looked at in a funny way in kernel!

> +
> +	if (le16_to_cpu(extent->shared_extn_seq) != 0) {
> +		dev_err_ratelimited(dev,
> +				    "DC extent DPA %pra (%pU) can not be shared\n",
> +				    &ext_range, extent->uuid);
> +		return -ENXIO;
> +	}
> +
> +	if (!uuid_is_null((const uuid_t *)extent->uuid)) {
> +		dev_err_ratelimited(dev,
> +				    "DC extent DPA %pra (%pU); tags not supported\n",
> +				    &ext_range, extent->uuid);
> +		return -ENXIO;
> +	}
> +
> +	/* Extents must be within the DC partition boundary */
> +	for (int i = 0; i < cxlds->nr_partitions; i++) {
> +		struct cxl_dpa_partition *part = &cxlds->part[i];
> +
> +		if (part->mode != CXL_PARTMODE_DYNAMIC_RAM_A)
> +			continue;
> +
> +		struct range partition_range = (struct range) {

Maybe move the declaration up and just assign it here.

> +			.start = part->res.start,
> +			.end = part->res.end,
> +		};
> +
> +		if (range_contains(&partition_range, &ext_range)) {
> +			dev_dbg(dev, "DC extent DPA %pra (DCR:%pra)(%pU)\n",
> +				&ext_range, &partition_range, extent->uuid);
> +			return 0;
> +		}
> +	}
> +
> +	dev_err_ratelimited(dev,
> +			    "DC extent DPA %pra (%pU) is not in a valid DC partition\n",
> +			    &ext_range, extent->uuid);
> +	return -ENXIO;
> +}

> +/**
> + * struct cxled_extent - Extent within an endpoint decoder
> + * @cxled: Reference to the endpoint decoder
> + * @dpa_range: DPA range this extent covers within the decoder
> + * @uuid: uuid from device for this extent
> + */
> +struct cxled_extent {
> +	struct cxl_endpoint_decoder *cxled;
> +	struct range dpa_range;
> +	uuid_t uuid;
> +};

> +/* See CXL 3.1 8.2.9.2.1.6 */
> +enum dc_event {
> +	DCD_ADD_CAPACITY,
> +	DCD_RELEASE_CAPACITY,
> +	DCD_FORCED_CAPACITY_RELEASE,
> +	DCD_REGION_CONFIGURATION_UPDATED,

Perhaps a comment here that the other values don't apply to the
normal mailbox interface (they are FM only).
Might avoid confusion.

> +};

> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 34a606c5ead0..63a38e449454 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h

> +/*
> + * Add Dynamic Capacity Response
> + * CXL rev 3.1 section 8.2.9.9.9.3; Table 8-168 & Table 8-169
> + */
> +struct cxl_mbox_dc_response {
> +	__le32 extent_list_size;
> +	u8 flags;
> +	u8 reserved[3];
> +	struct updated_extent_list {
> +		__le64 dpa_start;
> +		__le64 length;
> +		u8 reserved[8];
> +	} __packed extent_list[];

counted_by marking always nice to have and here it's the extent_list_size I think
(which has an odd name giving it is a count, not a size... *dramatic sigh*)


> +} __packed;


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v9 12/19] cxl/extent: Process dynamic partition events and realize region extents
  2025-04-13 22:52 ` [PATCH v9 12/19] cxl/extent: Process dynamic partition events and realize region extents Ira Weiny
  2025-04-14 16:07   ` Jonathan Cameron
@ 2025-04-14 22:10   ` Alison Schofield
  2025-05-12 17:47   ` Fan Ni
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 65+ messages in thread
From: Alison Schofield @ 2025-04-14 22:10 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Jiang, Fan Ni, Jonathan Cameron, Dan Williams,
	Davidlohr Bueso, Vishal Verma, linux-cxl, nvdimm, linux-kernel,
	Li Ming

On Sun, Apr 13, 2025 at 05:52:20PM -0500, Ira Weiny wrote:

snip

> +
> +void memdev_release_extent(struct cxl_memdev_state *mds, struct range *range)
> +{
> +	struct device *dev = mds->cxlds.dev;
> +	struct xarray extent_list;
> +
> +	struct cxl_extent extent = {
> +		.start_dpa = cpu_to_le64(range->start),
> +		.length = cpu_to_le64(range_len(range)),
> +	};
> +
> +	dev_dbg(dev, "Release response dpa %pra\n", &range);
> +
> +	xa_init(&extent_list);
> +	if (xa_insert(&extent_list, 0, &extent, GFP_KERNEL)) {
> +		dev_dbg(dev, "Failed to release %pra\n", &range);
> +		goto destroy;
> +	}
> +
> +	if (cxl_send_dc_response(mds, CXL_MBOX_OP_RELEASE_DC, &extent_list, 1))
> +		dev_dbg(dev, "Failed to release %pra\n", &range);
> +

smatch complains about the above 3 dev_dbg() messages:

memdev_release_extent() error: '%pr' expects argument of type struct range *, but argument 4 has type 'struct range**'

snip


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v9 12/19] cxl/extent: Process dynamic partition events and realize region extents
  2025-04-13 22:52 ` [PATCH v9 12/19] cxl/extent: Process dynamic partition events and realize region extents Ira Weiny
  2025-04-14 16:07   ` Jonathan Cameron
  2025-04-14 22:10   ` Alison Schofield
@ 2025-05-12 17:47   ` Fan Ni
  2026-02-02 20:00   ` Davidlohr Bueso
  2026-02-24  1:24   ` Anisa Su
  4 siblings, 0 replies; 65+ messages in thread
From: Fan Ni @ 2025-05-12 17:47 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Jiang, Jonathan Cameron, Dan Williams, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-cxl, nvdimm, linux-kernel,
	Li Ming

On Sun, Apr 13, 2025 at 05:52:20PM -0500, Ira Weiny wrote:
> A dynamic capacity device (DCD) sends events to signal the host for
> changes in the availability of Dynamic Capacity (DC) memory.  These
> events contain extents describing a DPA range and meta data for memory
> to be added or removed.  Events may be sent from the device at any time.
> 
...
> Tag support within the DAX layer is not yet supported.  To maintain
> compatibility with legacy DAX/region processing only tags with a value
> of 0 are allowed.  This defines existing DAX devices as having a 0 tag
> which makes the most logical sense as a default.
> 
> Process DCD events and create region devices.
> 
> Based on an original patch by Navneet Singh.
> 
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Li Ming <ming.li@zohomail.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
Hi Ira,
I have some comments inline. 
There is one that will need to fix if I understand the code correctly.

> ---
> Changes:
> [iweiny: rebase]
> [djbw: s/region/partition/]
> [iweiny: Adapt to new partition arch]
> [iweiny: s/tag/uuid/ throughout the code]
> ---
>  drivers/cxl/core/Makefile |   2 +-
>  drivers/cxl/core/core.h   |  13 ++
>  drivers/cxl/core/extent.c | 366 ++++++++++++++++++++++++++++++++++++++++++++++
>  drivers/cxl/core/mbox.c   | 292 +++++++++++++++++++++++++++++++++++-
>  drivers/cxl/core/region.c |   3 +
>  drivers/cxl/cxl.h         |  53 ++++++-
>  drivers/cxl/cxlmem.h      |  27 ++++
>  include/cxl/event.h       |  31 ++++
>  tools/testing/cxl/Kbuild  |   3 +-
>  9 files changed, 786 insertions(+), 4 deletions(-)
...  
> +static int send_one_response(struct cxl_mailbox *cxl_mbox,
I feel like the name is not that informative, maybe 
send_one_dc_response?
> +			     struct cxl_mbox_dc_response *response,
> +			     int opcode, u32 extent_list_size, u8 flags)
> +{
> +	struct cxl_mbox_cmd mbox_cmd = (struct cxl_mbox_cmd) {
> +		.opcode = opcode,
> +		.size_in = struct_size(response, extent_list, extent_list_size),
> +		.payload_in = response,
> +	};
> +
> +	response->extent_list_size = cpu_to_le32(extent_list_size);
> +	response->flags = flags;
> +	return cxl_internal_send_cmd(cxl_mbox, &mbox_cmd);
> +}
> +
> +static int cxl_send_dc_response(struct cxl_memdev_state *mds, int opcode,
> +				struct xarray *extent_array, int cnt)
> +{
> +	struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
> +	struct cxl_mbox_dc_response *p;
> +	struct cxl_extent *extent;
> +	unsigned long index;
> +	u32 pl_index;
> +
> +	size_t pl_size = struct_size(p, extent_list, cnt);
> +	u32 max_extents = cnt;
> +
> +	/* May have to use more bit on response. */
> +	if (pl_size > cxl_mbox->payload_size) {
> +		max_extents = (cxl_mbox->payload_size - sizeof(*p)) /
> +			      sizeof(struct updated_extent_list);
> +		pl_size = struct_size(p, extent_list, max_extents);
> +	}
> +
> +	struct cxl_mbox_dc_response *response __free(kfree) =
> +						kzalloc(pl_size, GFP_KERNEL);
> +	if (!response)
> +		return -ENOMEM;
> +
> +	if (cnt == 0)
> +		return send_one_response(cxl_mbox, response, opcode, 0, 0);
> +
> +	pl_index = 0;
> +	xa_for_each(extent_array, index, extent) {
> +		response->extent_list[pl_index].dpa_start = extent->start_dpa;
> +		response->extent_list[pl_index].length = extent->length;
> +		pl_index++;
> +
> +		if (pl_index == max_extents) {
> +			u8 flags = 0;
> +			int rc;
> +
> +			if (pl_index < cnt)
> +				flags |= CXL_DCD_EVENT_MORE;
> +			rc = send_one_response(cxl_mbox, response, opcode,
> +					       pl_index, flags);
> +			if (rc)
> +				return rc;
> +			cnt -= pl_index;
> +			pl_index = 0;

The logic here seems incorrect. 
Let's say cnt = 8, and max_extents = 5.
For the first 5 extents, it works fine. But after the first 5 extents
are processed (response sent), the cnt will become 8-5=3, however,
max_extents is still 5, so there is no chance we can send response for
the last 3 extents.
I think we need to update max_extents based on "cnt" after each
iteration.

> +		}
> +	}
> +
> +	if (!pl_index) /* nothing more to do */
> +		return 0;
...
> +/*
> + * Add Dynamic Capacity Response
> + * CXL rev 3.1 section 8.2.9.9.9.3; Table 8-168 & Table 8-169
> + */
> +struct cxl_mbox_dc_response {
> +	__le32 extent_list_size;

As jonathan mentioned, "size" may be not a good name.
Maybe "nr_extents"?

Fan
> +	u8 flags;
> +	u8 reserved[3];
> +	struct updated_extent_list {
> +		__le64 dpa_start;
> +		__le64 length;
> +		u8 reserved[8];
> +	} __packed extent_list[];
> +} __packed;
> +
>  struct cxl_mbox_get_supported_logs {
>  	__le16 entries;
>  	u8 rsvd[6];
> @@ -644,6 +662,14 @@ struct cxl_mbox_identify {
>  	UUID_INIT(0xfe927475, 0xdd59, 0x4339, 0xa5, 0x86, 0x79, 0xba, 0xb1, \
>  		  0x13, 0xb7, 0x74)
>  
> +/*
> + * Dynamic Capacity Event Record
> + * CXL rev 3.1 section 8.2.9.2.1; Table 8-43
> + */
> +#define CXL_EVENT_DC_EVENT_UUID                                             \
> +	UUID_INIT(0xca95afa7, 0xf183, 0x4018, 0x8c, 0x2f, 0x95, 0x26, 0x8e, \
> +		  0x10, 0x1a, 0x2a)
> +
>  /*
>   * Get Event Records output payload
>   * CXL rev 3.0 section 8.2.9.2.2; Table 8-50
> @@ -669,6 +695,7 @@ enum cxl_event_log_type {
>  	CXL_EVENT_TYPE_WARN,
>  	CXL_EVENT_TYPE_FAIL,
>  	CXL_EVENT_TYPE_FATAL,
> +	CXL_EVENT_TYPE_DCD,
>  	CXL_EVENT_TYPE_MAX
>  };
>  
> diff --git a/include/cxl/event.h b/include/cxl/event.h
> index f9ae1796da85..0c159eac4337 100644
> --- a/include/cxl/event.h
> +++ b/include/cxl/event.h
> @@ -108,11 +108,42 @@ struct cxl_event_mem_module {
>  	u8 reserved[0x2a];
>  } __packed;
>  
> +/*
> + * CXL rev 3.1 section 8.2.9.2.1.6; Table 8-51
> + */
> +struct cxl_extent {
> +	__le64 start_dpa;
> +	__le64 length;
> +	u8 uuid[UUID_SIZE];
> +	__le16 shared_extn_seq;
> +	u8 reserved[0x6];
> +} __packed;
> +
> +/*
> + * Dynamic Capacity Event Record
> + * CXL rev 3.1 section 8.2.9.2.1.6; Table 8-50
> + */
> +#define CXL_DCD_EVENT_MORE			BIT(0)
> +struct cxl_event_dcd {
> +	struct cxl_event_record_hdr hdr;
> +	u8 event_type;
> +	u8 validity_flags;
> +	__le16 host_id;
> +	u8 partition_index;
> +	u8 flags;
> +	u8 reserved1[0x2];
> +	struct cxl_extent extent;
> +	u8 reserved2[0x18];
> +	__le32 num_avail_extents;
> +	__le32 num_avail_tags;
> +} __packed;
> +
>  union cxl_event {
>  	struct cxl_event_generic generic;
>  	struct cxl_event_gen_media gen_media;
>  	struct cxl_event_dram dram;
>  	struct cxl_event_mem_module mem_module;
> +	struct cxl_event_dcd dcd;
>  	/* dram & gen_media event header */
>  	struct cxl_event_media_hdr media_hdr;
>  } __packed;
> diff --git a/tools/testing/cxl/Kbuild b/tools/testing/cxl/Kbuild
> index 387f3df8b988..916f2b30e2f3 100644
> --- a/tools/testing/cxl/Kbuild
> +++ b/tools/testing/cxl/Kbuild
> @@ -64,7 +64,8 @@ cxl_core-y += $(CXL_CORE_SRC)/cdat.o
>  cxl_core-y += $(CXL_CORE_SRC)/ras.o
>  cxl_core-y += $(CXL_CORE_SRC)/acpi.o
>  cxl_core-$(CONFIG_TRACING) += $(CXL_CORE_SRC)/trace.o
> -cxl_core-$(CONFIG_CXL_REGION) += $(CXL_CORE_SRC)/region.o
> +cxl_core-$(CONFIG_CXL_REGION) += $(CXL_CORE_SRC)/region.o \
> +				 $(CXL_CORE_SRC)/extent.o
>  cxl_core-$(CONFIG_CXL_MCE) += $(CXL_CORE_SRC)/mce.o
>  cxl_core-$(CONFIG_CXL_FEATURES) += $(CXL_CORE_SRC)/features.o
>  cxl_core-y += config_check.o
> 
> -- 
> 2.49.0
> 

-- 
Fan Ni (From gmail)

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v9 12/19] cxl/extent: Process dynamic partition events and realize region extents
  2025-04-13 22:52 ` [PATCH v9 12/19] cxl/extent: Process dynamic partition events and realize region extents Ira Weiny
                     ` (2 preceding siblings ...)
  2025-05-12 17:47   ` Fan Ni
@ 2026-02-02 20:00   ` Davidlohr Bueso
  2026-02-24  1:24   ` Anisa Su
  4 siblings, 0 replies; 65+ messages in thread
From: Davidlohr Bueso @ 2026-02-02 20:00 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Jiang, Fan Ni, Jonathan Cameron, Dan Williams,
	Alison Schofield, Vishal Verma, linux-cxl, nvdimm, linux-kernel,
	Li Ming

On Sun, 13 Apr 2025, Ira Weiny wrote:

>+static int cxlr_add_extent(struct cxl_dax_region *cxlr_dax,
>+			   struct cxl_endpoint_decoder *cxled,
>+			   struct cxled_extent *ed_extent)
>+{
>+	struct region_extent *region_extent;
>+	struct range hpa_range;
>+	int rc;
>+
>+	calc_hpa_range(cxled, cxlr_dax, &ed_extent->dpa_range, &hpa_range);
>+
>+	region_extent = alloc_region_extent(cxlr_dax, &hpa_range, &ed_extent->uuid);
>+	if (IS_ERR(region_extent))
>+		return PTR_ERR(region_extent);
>+

afaict the ed_extent can leak in this error path

>+	rc = xa_insert(&region_extent->decoder_extents, (unsigned long)ed_extent,
>+		       ed_extent, GFP_KERNEL);
>+	if (rc) {
>+		free_region_extent(region_extent);
>+		return rc;
>+	}

.. and this one (not in the xarray).

>+
>+	/* device model handles freeing region_extent */
>+	return online_region_extent(region_extent);
>+}

...

> static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
>				    enum cxl_event_log_type type)
> {
>@@ -1100,9 +1369,13 @@ static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
>		if (!nr_rec)
>			break;
>
>-		for (i = 0; i < nr_rec; i++)
>+		for (i = 0; i < nr_rec; i++) {
>			__cxl_event_trace_record(cxlmd, type,
>						 &payload->records[i]);
>+			if (type == CXL_EVENT_TYPE_DCD)
>+				cxl_handle_dcd_event_records(mds,
>+							&payload->records[i]);
>+		}
>
>		if (payload->flags & CXL_GET_EVENT_FLAG_OVERFLOW)
>			trace_cxl_overflow(cxlmd, type, payload);

With DCD the extent list needs resync'd in the overflow case; cxl_process_extent_list() needs
called.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v9 12/19] cxl/extent: Process dynamic partition events and realize region extents
  2025-04-13 22:52 ` [PATCH v9 12/19] cxl/extent: Process dynamic partition events and realize region extents Ira Weiny
                     ` (3 preceding siblings ...)
  2026-02-02 20:00   ` Davidlohr Bueso
@ 2026-02-24  1:24   ` Anisa Su
  2026-03-05 22:00     ` Ira Weiny
  4 siblings, 1 reply; 65+ messages in thread
From: Anisa Su @ 2026-02-24  1:24 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Jiang, Fan Ni, Jonathan Cameron, Dan Williams,
	Davidlohr Bueso, Alison Schofield, Vishal Verma, linux-cxl,
	nvdimm, linux-kernel, Li Ming

On Sun, Apr 13, 2025 at 05:52:20PM -0500, Ira Weiny wrote:i
A few notes while going through and removing sparse dax semantics and plumbing
for fs-dax mode:
> A dynamic capacity device (DCD) sends events to signal the host for
> changes in the availability of Dynamic Capacity (DC) memory.  These
> events contain extents describing a DPA range and meta data for memory
> to be added or removed.  Events may be sent from the device at any time.
> 
> Three types of events can be signaled, Add, Release, and Force Release.
> 
> On add, the host may accept or reject the memory being offered.  If no
> region exists, or the extent is invalid, the extent should be rejected.
> Add extent events may be grouped by a 'more' bit which indicates those
> extents should be processed as a group.
> 
> On remove, the host can delay the response until the host is safely not
> using the memory.  If no region exists the release can be sent
> immediately.  The host may also release extents (or partial extents) at
> any time.
Partial release is no longer valid for tagged release iirc from the calls

> Thus the 'more' bit grouping of release events is of less
> value and can be ignored in favor of sending multiple release capacity
> responses for groups of release events.
> 
[snip]
> +
> +static int cxl_send_dc_response(struct cxl_memdev_state *mds, int opcode,
> +				struct xarray *extent_array, int cnt)
> +{
> +	struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
> +	struct cxl_mbox_dc_response *p;
> +	struct cxl_extent *extent;
> +	unsigned long index;
> +	u32 pl_index;
> +
> +	size_t pl_size = struct_size(p, extent_list, cnt);
> +	u32 max_extents = cnt;
> +
> +	/* May have to use more bit on response. */
> +	if (pl_size > cxl_mbox->payload_size) {
> +		max_extents = (cxl_mbox->payload_size - sizeof(*p)) /
> +			      sizeof(struct updated_extent_list);
> +		pl_size = struct_size(p, extent_list, max_extents);
> +	}
> +
> +	struct cxl_mbox_dc_response *response __free(kfree) =
> +						kzalloc(pl_size, GFP_KERNEL);
> +	if (!response)
> +		return -ENOMEM;
> +
> +	if (cnt == 0)
> +		return send_one_response(cxl_mbox, response, opcode, 0, 0);
> +
> +	pl_index = 0;
I was wondering why xarray is used here instead of a list? I didn't see anywhere
that we need to look up a specific index to benefit from the log complexity and
afaict, simply used to iterate over all elements.

> +	xa_for_each(extent_array, index, extent) {
> +		response->extent_list[pl_index].dpa_start = extent->start_dpa;
> +		response->extent_list[pl_index].length = extent->length;
> +		pl_index++;
> +
> +		if (pl_index == max_extents) {
> +			u8 flags = 0;
> +			int rc;
> +
> +			if (pl_index < cnt)
> +				flags |= CXL_DCD_EVENT_MORE;
> +			rc = send_one_response(cxl_mbox, response, opcode,
> +					       pl_index, flags);
> +			if (rc)
> +				return rc;
> +			cnt -= pl_index;
> +			pl_index = 0;
> +		}
> +	}
> +
> +	if (!pl_index) /* nothing more to do */
> +		return 0;
> +	return send_one_response(cxl_mbox, response, opcode, pl_index, 0);
> +}
> +
> +void memdev_release_extent(struct cxl_memdev_state *mds, struct range *range)
> +{
> +	struct device *dev = mds->cxlds.dev;
> +	struct xarray extent_list;
> +
> +	struct cxl_extent extent = {
> +		.start_dpa = cpu_to_le64(range->start),
> +		.length = cpu_to_le64(range_len(range)),
> +	};
> +
> +	dev_dbg(dev, "Release response dpa %pra\n", &range);
> +
> +	xa_init(&extent_list);
> +	if (xa_insert(&extent_list, 0, &extent, GFP_KERNEL)) {
> +		dev_dbg(dev, "Failed to release %pra\n", &range);
> +		goto destroy;
> +	}
> +
> +	if (cxl_send_dc_response(mds, CXL_MBOX_OP_RELEASE_DC, &extent_list, 1))
> +		dev_dbg(dev, "Failed to release %pra\n", &range);
> +
> +destroy:
> +	xa_destroy(&extent_list);
> +}
> +
> +static int validate_add_extent(struct cxl_memdev_state *mds,
> +			       struct cxl_extent *extent)
> +{
> +	int rc;
> +
> +	rc = cxl_validate_extent(mds, extent);
> +	if (rc)
> +		return rc;
> +
> +	return cxl_add_extent(mds, extent);
> +}
> +
> +static int cxl_add_pending(struct cxl_memdev_state *mds)
> +{
> +	struct device *dev = mds->cxlds.dev;
> +	struct cxl_extent *extent;
> +	unsigned long cnt = 0;
> +	unsigned long index;
> +	int rc;
> +
Also according to the spec:
"In response to an Add Capacity Event Record, or multiple Add Capacity Event 
records grouped via the More flag (see Table 8-229), the host is expected to 
respond with exactly one Add Dynamic Capacity Response acknowledgment, 
corresponding to the order of the Add Capacity Events received. If the order 
does not match, the device shall return Invalid Input. The Add Dynamic Capacity
Response acknowledgment must be sent in the same order as the Add Capacity 
Event Records."

Using xarray does not preserve the order of the extents, which requires a fifo
queue.

> +	xa_for_each(&mds->pending_extents, index, extent) {
> +		if (validate_add_extent(mds, extent)) {
> +			/*
> +			 * Any extents which are to be rejected are omitted from
> +			 * the response.  An empty response means all are
> +			 * rejected.
> +			 */
> +			dev_dbg(dev, "unconsumed DC extent DPA:%#llx LEN:%#llx\n",
> +				le64_to_cpu(extent->start_dpa),
> +				le64_to_cpu(extent->length));
> +			xa_erase(&mds->pending_extents, index);
> +			kfree(extent);
> +			continue;
> +		}
> +		cnt++;
> +	}
> +	rc = cxl_send_dc_response(mds, CXL_MBOX_OP_ADD_DC_RESPONSE,
> +				  &mds->pending_extents, cnt);
> +	xa_for_each(&mds->pending_extents, index, extent) {
> +		xa_erase(&mds->pending_extents, index);
> +		kfree(extent);
> +	}
> +	return rc;
> +}
> +
[snip]  
Thanks,
Anisa 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v9 12/19] cxl/extent: Process dynamic partition events and realize region extents
  2026-02-24  1:24   ` Anisa Su
@ 2026-03-05 22:00     ` Ira Weiny
  0 siblings, 0 replies; 65+ messages in thread
From: Ira Weiny @ 2026-03-05 22:00 UTC (permalink / raw)
  To: Anisa Su, Ira Weiny
  Cc: Dave Jiang, Fan Ni, Jonathan Cameron, Dan Williams,
	Davidlohr Bueso, Alison Schofield, Vishal Verma, linux-cxl,
	nvdimm, linux-kernel, Li Ming

Anisa Su wrote:
> On Sun, Apr 13, 2025 at 05:52:20PM -0500, Ira Weiny wrote:i
> A few notes while going through and removing sparse dax semantics and plumbing
> for fs-dax mode:
> > A dynamic capacity device (DCD) sends events to signal the host for
> > changes in the availability of Dynamic Capacity (DC) memory.  These
> > events contain extents describing a DPA range and meta data for memory
> > to be added or removed.  Events may be sent from the device at any time.
> > 
> > Three types of events can be signaled, Add, Release, and Force Release.
> > 
> > On add, the host may accept or reject the memory being offered.  If no
> > region exists, or the extent is invalid, the extent should be rejected.
> > Add extent events may be grouped by a 'more' bit which indicates those
> > extents should be processed as a group.
> > 
> > On remove, the host can delay the response until the host is safely not
> > using the memory.  If no region exists the release can be sent
> > immediately.  The host may also release extents (or partial extents) at
> > any time.
> Partial release is no longer valid for tagged release iirc from the calls

Tags were not supported in this version:

        if (!uuid_is_null((const uuid_t *)extent->uuid)) {
                dev_err_ratelimited(dev,
                                    "DC extent DPA %pra (%pU); tags not supported\n",
                                    &ext_range, extent->uuid);
                return -ENXIO;
        }

> 
> > Thus the 'more' bit grouping of release events is of less
> > value and can be ignored in favor of sending multiple release capacity
> > responses for groups of release events.
> > 
> [snip]
> > +
> > +static int cxl_send_dc_response(struct cxl_memdev_state *mds, int opcode,
> > +				struct xarray *extent_array, int cnt)
> > +{
> > +	struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
> > +	struct cxl_mbox_dc_response *p;
> > +	struct cxl_extent *extent;
> > +	unsigned long index;
> > +	u32 pl_index;
> > +
> > +	size_t pl_size = struct_size(p, extent_list, cnt);
> > +	u32 max_extents = cnt;
> > +
> > +	/* May have to use more bit on response. */
> > +	if (pl_size > cxl_mbox->payload_size) {
> > +		max_extents = (cxl_mbox->payload_size - sizeof(*p)) /
> > +			      sizeof(struct updated_extent_list);
> > +		pl_size = struct_size(p, extent_list, max_extents);
> > +	}
> > +
> > +	struct cxl_mbox_dc_response *response __free(kfree) =
> > +						kzalloc(pl_size, GFP_KERNEL);
> > +	if (!response)
> > +		return -ENOMEM;
> > +
> > +	if (cnt == 0)
> > +		return send_one_response(cxl_mbox, response, opcode, 0, 0);
> > +
> > +	pl_index = 0;
> I was wondering why xarray is used here instead of a list? I didn't see anywhere
> that we need to look up a specific index to benefit from the log complexity and
> afaict, simply used to iterate over all elements.

xarray was just easier than a list.

> 
> > +	xa_for_each(extent_array, index, extent) {
> > +		response->extent_list[pl_index].dpa_start = extent->start_dpa;
> > +		response->extent_list[pl_index].length = extent->length;
> > +		pl_index++;
> > +
> > +		if (pl_index == max_extents) {
> > +			u8 flags = 0;
> > +			int rc;
> > +
> > +			if (pl_index < cnt)
> > +				flags |= CXL_DCD_EVENT_MORE;
> > +			rc = send_one_response(cxl_mbox, response, opcode,
> > +					       pl_index, flags);
> > +			if (rc)
> > +				return rc;
> > +			cnt -= pl_index;
> > +			pl_index = 0;
> > +		}
> > +	}
> > +
> > +	if (!pl_index) /* nothing more to do */
> > +		return 0;
> > +	return send_one_response(cxl_mbox, response, opcode, pl_index, 0);
> > +}
> > +

[snip]

> > +static int validate_add_extent(struct cxl_memdev_state *mds,
> > +			       struct cxl_extent *extent)
> > +{
> > +	int rc;
> > +
> > +	rc = cxl_validate_extent(mds, extent);
> > +	if (rc)
> > +		return rc;
> > +
> > +	return cxl_add_extent(mds, extent);
> > +}
> > +
> > +static int cxl_add_pending(struct cxl_memdev_state *mds)
> > +{
> > +	struct device *dev = mds->cxlds.dev;
> > +	struct cxl_extent *extent;
> > +	unsigned long cnt = 0;
> > +	unsigned long index;
> > +	int rc;
> > +
> Also according to the spec:
> "In response to an Add Capacity Event Record, or multiple Add Capacity Event 
> records grouped via the More flag (see Table 8-229), the host is expected to 
> respond with exactly one Add Dynamic Capacity Response acknowledgment, 
> corresponding to the order of the Add Capacity Events received. If the order 
> does not match, the device shall return Invalid Input. The Add Dynamic Capacity
> Response acknowledgment must be sent in the same order as the Add Capacity 
> Event Records."

hmmm...  yea that might be wrong, I don't recall.

> 
> Using xarray does not preserve the order of the extents, which requires a fifo
> queue.

It could if the index was the order.

But in the end I'm not opposed to using a list.

Ira

[snip]

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v9 13/19] cxl/region/extent: Expose region extent information in sysfs
  2025-04-13 22:52 [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (11 preceding siblings ...)
  2025-04-13 22:52 ` [PATCH v9 12/19] cxl/extent: Process dynamic partition events and realize region extents Ira Weiny
@ 2025-04-13 22:52 ` Ira Weiny
  2025-04-13 22:52 ` [PATCH v9 14/19] dax/bus: Factor out dev dax resize logic Ira Weiny
                   ` (9 subsequent siblings)
  22 siblings, 0 replies; 65+ messages in thread
From: Ira Weiny @ 2025-04-13 22:52 UTC (permalink / raw)
  To: Dave Jiang, Fan Ni, Jonathan Cameron
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	Ira Weiny, linux-cxl, nvdimm, linux-kernel

Extent information can be helpful to the user to coordinate memory usage
with the external orchestrator and FM.

Expose the details of region extents by creating the following
sysfs entries.

        /sys/bus/cxl/devices/dax_regionX/extentX.Y
        /sys/bus/cxl/devices/dax_regionX/extentX.Y/offset
        /sys/bus/cxl/devices/dax_regionX/extentX.Y/length
        /sys/bus/cxl/devices/dax_regionX/extentX.Y/tag

Based on an original patch by Navneet Singh.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
Tested-by: Fan Ni <fan.ni@samsung.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes:
[iweiny: rebase]
[iweiny: s/tag/uuid/ throughout the code]
[iweiny: update sysfs docs to 2025]
---
 Documentation/ABI/testing/sysfs-bus-cxl | 36 ++++++++++++++++++++
 drivers/cxl/core/extent.c               | 58 +++++++++++++++++++++++++++++++++
 2 files changed, 94 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
index 2e26d95ac66f..6e9d60baf546 100644
--- a/Documentation/ABI/testing/sysfs-bus-cxl
+++ b/Documentation/ABI/testing/sysfs-bus-cxl
@@ -639,3 +639,39 @@ Description:
 		The count is persistent across power loss and wraps back to 0
 		upon overflow. If this file is not present, the device does not
 		have the necessary support for dirty tracking.
+
+
+What:		/sys/bus/cxl/devices/dax_regionX/extentX.Y/offset
+Date:		May, 2025
+KernelVersion:	v6.16
+Contact:	linux-cxl@vger.kernel.org
+Description:
+		(RO) [For Dynamic Capacity regions only] Users can use the
+		extent information to create DAX devices on specific extents.
+		This is done by creating and destroying DAX devices in specific
+		sequences and looking at the mappings created.  Extent offset
+		within the region.
+
+
+What:		/sys/bus/cxl/devices/dax_regionX/extentX.Y/length
+Date:		May, 2025
+KernelVersion:	v6.16
+Contact:	linux-cxl@vger.kernel.org
+Description:
+		(RO) [For Dynamic Capacity regions only] Users can use the
+		extent information to create DAX devices on specific extents.
+		This is done by creating and destroying DAX devices in specific
+		sequences and looking at the mappings created.  Extent length
+		within the region.
+
+
+What:		/sys/bus/cxl/devices/dax_regionX/extentX.Y/uuid
+Date:		May, 2025
+KernelVersion:	v6.16
+Contact:	linux-cxl@vger.kernel.org
+Description:
+		(RO) [For Dynamic Capacity regions only] Users can use the
+		extent information to create DAX devices on specific extents.
+		This is done by creating and destroying DAX devices in specific
+		sequences and looking at the mappings created.  UUID of this
+		extent.
diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
index 6df277caf974..3fb20cd7afc8 100644
--- a/drivers/cxl/core/extent.c
+++ b/drivers/cxl/core/extent.c
@@ -6,6 +6,63 @@
 
 #include "core.h"
 
+static ssize_t offset_show(struct device *dev, struct device_attribute *attr,
+			   char *buf)
+{
+	struct region_extent *region_extent = to_region_extent(dev);
+
+	return sysfs_emit(buf, "%#llx\n", region_extent->hpa_range.start);
+}
+static DEVICE_ATTR_RO(offset);
+
+static ssize_t length_show(struct device *dev, struct device_attribute *attr,
+			   char *buf)
+{
+	struct region_extent *region_extent = to_region_extent(dev);
+	u64 length = range_len(&region_extent->hpa_range);
+
+	return sysfs_emit(buf, "%#llx\n", length);
+}
+static DEVICE_ATTR_RO(length);
+
+static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
+			 char *buf)
+{
+	struct region_extent *region_extent = to_region_extent(dev);
+
+	return sysfs_emit(buf, "%pUb\n", &region_extent->uuid);
+}
+static DEVICE_ATTR_RO(uuid);
+
+static struct attribute *region_extent_attrs[] = {
+	&dev_attr_offset.attr,
+	&dev_attr_length.attr,
+	&dev_attr_uuid.attr,
+	NULL
+};
+
+static uuid_t empty_uuid = { 0 };
+
+static umode_t region_extent_visible(struct kobject *kobj,
+				     struct attribute *a, int n)
+{
+	struct device *dev = kobj_to_dev(kobj);
+	struct region_extent *region_extent = to_region_extent(dev);
+
+	if (a == &dev_attr_uuid.attr &&
+	    uuid_equal(&region_extent->uuid, &empty_uuid))
+		return 0;
+
+	return a->mode;
+}
+
+static const struct attribute_group region_extent_attribute_group = {
+	.attrs = region_extent_attrs,
+	.is_visible = region_extent_visible,
+};
+
+__ATTRIBUTE_GROUPS(region_extent_attribute);
+
 static void cxled_release_extent(struct cxl_endpoint_decoder *cxled,
 				 struct cxled_extent *ed_extent)
 {
@@ -44,6 +101,7 @@ static void region_extent_release(struct device *dev)
 static const struct device_type region_extent_type = {
 	.name = "extent",
 	.release = region_extent_release,
+	.groups = region_extent_attribute_groups,
 };
 
 bool is_region_extent(struct device *dev)

-- 
2.49.0


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v9 14/19] dax/bus: Factor out dev dax resize logic
  2025-04-13 22:52 [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (12 preceding siblings ...)
  2025-04-13 22:52 ` [PATCH v9 13/19] cxl/region/extent: Expose region extent information in sysfs Ira Weiny
@ 2025-04-13 22:52 ` Ira Weiny
  2025-04-13 22:52 ` [PATCH v9 15/19] dax/region: Create resources on sparse DAX regions Ira Weiny
                   ` (8 subsequent siblings)
  22 siblings, 0 replies; 65+ messages in thread
From: Ira Weiny @ 2025-04-13 22:52 UTC (permalink / raw)
  To: Dave Jiang, Fan Ni, Jonathan Cameron
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	Ira Weiny, linux-cxl, nvdimm, linux-kernel

Dynamic Capacity regions must limit dev dax resources to those areas
which have extents backing real memory.  Such DAX regions are dubbed
'sparse' regions.  In order to manage where memory is available four
alternatives were considered:

1) Create a single region resource child on region creation which
   reserves the entire region.  Then as extents are added punch holes in
   this reservation.  This requires new resource manipulation to punch
   the holes and still requires an additional iteration over the extent
   areas which may already have existing dev dax resources used.

2) Maintain an ordered xarray of extents which can be queried while
   processing the resize logic.  The issue is that existing region->res
   children may artificially limit the allocation size sent to
   alloc_dev_dax_range().  IE the resource children can't be directly
   used in the resize logic to find where space in the region is.  This
   also poses a problem of managing the available size in 2 places.

3) Maintain a separate resource tree with extents.  This option is the
   same as 2) but with the different data structure.  Most ideally there
   should be a unified representation of the resource tree not two places
   to look for space.

4) Create region resource children for each extent.  Manage the dax dev
   resize logic in the same way as before but use a region child
   (extent) resource as the parents to find space within each extent.

Option 4 can leverage the existing resize algorithm to find space within
the extents.  It manages the available space in a singular resource tree
which is less complicated for finding space.

In preparation for this change, factor out the dev_dax_resize logic.
For static regions use dax_region->res as the parent to find space for
the dax ranges.  Future patches will use the same algorithm with
individual extent resources as the parent.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 drivers/dax/bus.c | 130 +++++++++++++++++++++++++++++++++---------------------
 1 file changed, 80 insertions(+), 50 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index d8cb5195a227..c25942a3d125 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -844,11 +844,9 @@ static int devm_register_dax_mapping(struct dev_dax *dev_dax, int range_id)
 	return 0;
 }
 
-static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
-		resource_size_t size)
+static int alloc_dev_dax_range(struct resource *parent, struct dev_dax *dev_dax,
+			       u64 start, resource_size_t size)
 {
-	struct dax_region *dax_region = dev_dax->region;
-	struct resource *res = &dax_region->res;
 	struct device *dev = &dev_dax->dev;
 	struct dev_dax_range *ranges;
 	unsigned long pgoff = 0;
@@ -866,14 +864,14 @@ static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
 		return 0;
 	}
 
-	alloc = __request_region(res, start, size, dev_name(dev), 0);
+	alloc = __request_region(parent, start, size, dev_name(dev), 0);
 	if (!alloc)
 		return -ENOMEM;
 
 	ranges = krealloc(dev_dax->ranges, sizeof(*ranges)
 			* (dev_dax->nr_range + 1), GFP_KERNEL);
 	if (!ranges) {
-		__release_region(res, alloc->start, resource_size(alloc));
+		__release_region(parent, alloc->start, resource_size(alloc));
 		return -ENOMEM;
 	}
 
@@ -1026,50 +1024,45 @@ static bool adjust_ok(struct dev_dax *dev_dax, struct resource *res)
 	return true;
 }
 
-static ssize_t dev_dax_resize(struct dax_region *dax_region,
-		struct dev_dax *dev_dax, resource_size_t size)
+/**
+ * dev_dax_resize_static - Expand the device into the unused portion of the
+ * region. This may involve adjusting the end of an existing resource, or
+ * allocating a new resource.
+ *
+ * @parent: parent resource to allocate this range in
+ * @dev_dax: DAX device to be expanded
+ * @to_alloc: amount of space to alloc; must be <= space available in @parent
+ *
+ * Return the amount of space allocated or -ERRNO on failure
+ */
+static ssize_t dev_dax_resize_static(struct resource *parent,
+				     struct dev_dax *dev_dax,
+				     resource_size_t to_alloc)
 {
-	resource_size_t avail = dax_region_avail_size(dax_region), to_alloc;
-	resource_size_t dev_size = dev_dax_size(dev_dax);
-	struct resource *region_res = &dax_region->res;
-	struct device *dev = &dev_dax->dev;
 	struct resource *res, *first;
-	resource_size_t alloc = 0;
 	int rc;
 
-	if (dev->driver)
-		return -EBUSY;
-	if (size == dev_size)
-		return 0;
-	if (size > dev_size && size - dev_size > avail)
-		return -ENOSPC;
-	if (size < dev_size)
-		return dev_dax_shrink(dev_dax, size);
-
-	to_alloc = size - dev_size;
-	if (dev_WARN_ONCE(dev, !alloc_is_aligned(dev_dax, to_alloc),
-			"resize of %pa misaligned\n", &to_alloc))
-		return -ENXIO;
-
-	/*
-	 * Expand the device into the unused portion of the region. This
-	 * may involve adjusting the end of an existing resource, or
-	 * allocating a new resource.
-	 */
-retry:
-	first = region_res->child;
-	if (!first)
-		return alloc_dev_dax_range(dev_dax, dax_region->res.start, to_alloc);
+	first = parent->child;
+	if (!first) {
+		rc = alloc_dev_dax_range(parent, dev_dax,
+					   parent->start, to_alloc);
+		if (rc)
+			return rc;
+		return to_alloc;
+	}
 
-	rc = -ENOSPC;
 	for (res = first; res; res = res->sibling) {
 		struct resource *next = res->sibling;
+		resource_size_t alloc;
 
 		/* space at the beginning of the region */
-		if (res == first && res->start > dax_region->res.start) {
-			alloc = min(res->start - dax_region->res.start, to_alloc);
-			rc = alloc_dev_dax_range(dev_dax, dax_region->res.start, alloc);
-			break;
+		if (res == first && res->start > parent->start) {
+			alloc = min(res->start - parent->start, to_alloc);
+			rc = alloc_dev_dax_range(parent, dev_dax,
+						 parent->start, alloc);
+			if (rc)
+				return rc;
+			return alloc;
 		}
 
 		alloc = 0;
@@ -1078,21 +1071,56 @@ static ssize_t dev_dax_resize(struct dax_region *dax_region,
 			alloc = min(next->start - (res->end + 1), to_alloc);
 
 		/* space at the end of the region */
-		if (!alloc && !next && res->end < region_res->end)
-			alloc = min(region_res->end - res->end, to_alloc);
+		if (!alloc && !next && res->end < parent->end)
+			alloc = min(parent->end - res->end, to_alloc);
 
 		if (!alloc)
 			continue;
 
 		if (adjust_ok(dev_dax, res)) {
 			rc = adjust_dev_dax_range(dev_dax, res, resource_size(res) + alloc);
-			break;
+			if (rc)
+				return rc;
+			return alloc;
 		}
-		rc = alloc_dev_dax_range(dev_dax, res->end + 1, alloc);
-		break;
+		rc = alloc_dev_dax_range(parent, dev_dax, res->end + 1, alloc);
+		if (rc)
+			return rc;
+		return alloc;
 	}
-	if (rc)
-		return rc;
+
+	/* available was already calculated and should never be an issue */
+	dev_WARN_ONCE(&dev_dax->dev, 1, "space not found?");
+	return 0;
+}
+
+static ssize_t dev_dax_resize(struct dax_region *dax_region,
+		struct dev_dax *dev_dax, resource_size_t size)
+{
+	resource_size_t avail = dax_region_avail_size(dax_region);
+	resource_size_t dev_size = dev_dax_size(dev_dax);
+	struct device *dev = &dev_dax->dev;
+	resource_size_t to_alloc;
+	resource_size_t alloc;
+
+	if (dev->driver)
+		return -EBUSY;
+	if (size == dev_size)
+		return 0;
+	if (size > dev_size && size - dev_size > avail)
+		return -ENOSPC;
+	if (size < dev_size)
+		return dev_dax_shrink(dev_dax, size);
+
+	to_alloc = size - dev_size;
+	if (dev_WARN_ONCE(dev, !alloc_is_aligned(dev_dax, to_alloc),
+			"resize of %pa misaligned\n", &to_alloc))
+		return -ENXIO;
+
+retry:
+	alloc = dev_dax_resize_static(&dax_region->res, dev_dax, to_alloc);
+	if (alloc <= 0)
+		return alloc;
 	to_alloc -= alloc;
 	if (to_alloc)
 		goto retry;
@@ -1198,7 +1226,8 @@ static ssize_t mapping_store(struct device *dev, struct device_attribute *attr,
 
 	to_alloc = range_len(&r);
 	if (alloc_is_aligned(dev_dax, to_alloc))
-		rc = alloc_dev_dax_range(dev_dax, r.start, to_alloc);
+		rc = alloc_dev_dax_range(&dax_region->res, dev_dax, r.start,
+					 to_alloc);
 	up_write(&dax_dev_rwsem);
 	up_write(&dax_region_rwsem);
 
@@ -1466,7 +1495,8 @@ static struct dev_dax *__devm_create_dev_dax(struct dev_dax_data *data)
 	device_initialize(dev);
 	dev_set_name(dev, "dax%d.%d", dax_region->id, dev_dax->id);
 
-	rc = alloc_dev_dax_range(dev_dax, dax_region->res.start, data->size);
+	rc = alloc_dev_dax_range(&dax_region->res, dev_dax, dax_region->res.start,
+				 data->size);
 	if (rc)
 		goto err_range;
 

-- 
2.49.0


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v9 15/19] dax/region: Create resources on sparse DAX regions
  2025-04-13 22:52 [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (13 preceding siblings ...)
  2025-04-13 22:52 ` [PATCH v9 14/19] dax/bus: Factor out dev dax resize logic Ira Weiny
@ 2025-04-13 22:52 ` Ira Weiny
  2025-04-13 22:52 ` [PATCH v9 16/19] cxl/region: Read existing extents on region creation Ira Weiny
                   ` (7 subsequent siblings)
  22 siblings, 0 replies; 65+ messages in thread
From: Ira Weiny @ 2025-04-13 22:52 UTC (permalink / raw)
  To: Dave Jiang, Fan Ni, Jonathan Cameron
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	Ira Weiny, linux-cxl, nvdimm, linux-kernel

DAX regions which map dynamic capacity partitions require that memory be
allowed to come and go.  Recall sparse regions were created for this
purpose.  Now that extents can be realized within DAX regions the DAX
region driver can start tracking sub-resource information.

The tight relationship between DAX region operations and extent
operations require memory changes to be controlled synchronously with
the user of the region.  Synchronize through the dax_region_rwsem and by
having the region driver drive both the region device as well as the
extent sub-devices.

Recall requests to remove extents can happen at any time and that a host
is not obligated to release the memory until it is not being used.  If
an extent is not used allow a release response.

When extents are eligible for release.  No mappings exist but data may
reside in caches not yet written to the device.  Call
cxl_region_invalidate_memregion() to write back data to the device prior
to signaling the release complete.

Speculative writes after a release may dirty the cache such that a read
from a newly surfaced extent may not come from the device.  Call
cxl_region_invalidate_memregion() prior to bringing a new extent online
to ensure the cache is marked invalid.

While these invalidate calls are inefficient they are the best we can do
to ensure cache consistency without back invalidate.  Furthermore this
should occur infrequently with sufficiently large extents that real work
loads should not be impacted much.

The DAX layer has no need for the details of the CXL memory extent
devices.  Expose extents to the DAX layer as device children of the DAX
region device.  A single callback from the driver aids the DAX layer to
determine if the child device is an extent.  The DAX layer also
registers a devres function to automatically clean up when the device is
removed from the region.

There is a race between extents being surfaced and the dax_cxl driver
being loaded.  Synchronizes the driver during probe by scanning for
existing extents while under the device lock.

Respond to extent notifications.  Manage the DAX region resource tree
based on the extents lifetime.  Return the status of remove
notifications to lower layers such that it can manage the hardware
appropriately.

Based on an original patch by Navneet Singh.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes:
[iweiny: convert range prints to %pra]
---
 drivers/cxl/core/core.h   |   2 +
 drivers/cxl/core/extent.c |  83 ++++++++++++++--
 drivers/cxl/core/region.c |   2 +-
 drivers/cxl/cxl.h         |   6 ++
 drivers/dax/bus.c         | 246 +++++++++++++++++++++++++++++++++++++++++-----
 drivers/dax/bus.h         |   3 +-
 drivers/dax/cxl.c         |  61 +++++++++++-
 drivers/dax/dax-private.h |  40 ++++++++
 drivers/dax/hmem/hmem.c   |   2 +-
 drivers/dax/pmem.c        |   2 +-
 include/linux/ioport.h    |   3 +
 11 files changed, 411 insertions(+), 39 deletions(-)

diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 1272be497926..027dd1504d77 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -22,6 +22,8 @@ cxled_to_mds(struct cxl_endpoint_decoder *cxled)
 	return container_of(cxlds, struct cxl_memdev_state, cxlds);
 }
 
+int cxl_region_invalidate_memregion(struct cxl_region *cxlr);
+
 #ifdef CONFIG_CXL_REGION
 extern struct device_attribute dev_attr_create_pmem_region;
 extern struct device_attribute dev_attr_create_ram_region;
diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
index 3fb20cd7afc8..4dc0dec486f6 100644
--- a/drivers/cxl/core/extent.c
+++ b/drivers/cxl/core/extent.c
@@ -116,6 +116,12 @@ static void region_extent_unregister(void *ext)
 
 	dev_dbg(&region_extent->dev, "DAX region rm extent HPA %pra\n",
 		&region_extent->hpa_range);
+	/*
+	 * Extent is not in use or an error has occur.  No mappings
+	 * exist at this point.  Write and invalidate caches to ensure
+	 * the device has all data prior to final release.
+	 */
+	cxl_region_invalidate_memregion(region_extent->cxlr_dax->cxlr);
 	device_unregister(&region_extent->dev);
 }
 
@@ -269,20 +275,65 @@ static void calc_hpa_range(struct cxl_endpoint_decoder *cxled,
 	hpa_range->end = hpa_range->start + range_len(dpa_range) - 1;
 }
 
+static int cxlr_notify_extent(struct cxl_region *cxlr, enum dc_event event,
+			      struct region_extent *region_extent)
+{
+	struct device *dev = &cxlr->cxlr_dax->dev;
+	struct cxl_notify_data notify_data;
+	struct cxl_driver *driver;
+
+	dev_dbg(dev, "Trying notify: type %d HPA %pra\n", event,
+		&region_extent->hpa_range);
+
+	guard(device)(dev);
+
+	/*
+	 * The lack of a driver indicates a notification has failed.  No user
+	 * space coordination was possible.
+	 */
+	if (!dev->driver)
+		return 0;
+	driver = to_cxl_drv(dev->driver);
+	if (!driver->notify)
+		return 0;
+
+	notify_data = (struct cxl_notify_data) {
+		.event = event,
+		.region_extent = region_extent,
+	};
+
+	dev_dbg(dev, "Notify: type %d HPA %pra\n", event,
+		&region_extent->hpa_range);
+	return driver->notify(dev, &notify_data);
+}
+
+struct rm_data {
+	struct cxl_region *cxlr;
+	struct range *range;
+};
+
 static int cxlr_rm_extent(struct device *dev, void *data)
 {
 	struct region_extent *region_extent = to_region_extent(dev);
-	struct range *region_hpa_range = data;
+	struct rm_data *rm_data = data;
+	int rc;
 
 	if (!region_extent)
 		return 0;
 
 	/*
-	 * Any extent which 'touches' the released range is removed.
+	 * Any extent which 'touches' the released range is attempted to be
+	 * removed.
 	 */
-	if (range_overlaps(region_hpa_range, &region_extent->hpa_range)) {
+	if (range_overlaps(rm_data->range, &region_extent->hpa_range)) {
+		struct cxl_region *cxlr = rm_data->cxlr;
+
 		dev_dbg(dev, "Remove region extent HPA %pra\n",
 			&region_extent->hpa_range);
+		rc = cxlr_notify_extent(cxlr, DCD_RELEASE_CAPACITY, region_extent);
+		if (rc == -EBUSY)
+			return 0;
+
 		region_rm_extent(region_extent);
 	}
 	return 0;
@@ -327,8 +378,13 @@ int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
 
 	calc_hpa_range(cxled, cxlr->cxlr_dax, &dpa_range, &hpa_range);
 
+	struct rm_data rm_data = {
+		.cxlr = cxlr,
+		.range = &hpa_range,
+	};
+
 	/* Remove region extents which overlap */
-	return device_for_each_child(&cxlr->cxlr_dax->dev, &hpa_range,
+	return device_for_each_child(&cxlr->cxlr_dax->dev, &rm_data,
 				     cxlr_rm_extent);
 }
 
@@ -353,8 +409,23 @@ static int cxlr_add_extent(struct cxl_dax_region *cxlr_dax,
 		return rc;
 	}
 
-	/* device model handles freeing region_extent */
-	return online_region_extent(region_extent);
+	/* Ensure caches are clean prior onlining */
+	cxl_region_invalidate_memregion(cxlr_dax->cxlr);
+
+	rc = online_region_extent(region_extent);
+	/* device model handled freeing region_extent */
+	if (rc)
+		return rc;
+
+	rc = cxlr_notify_extent(cxlr_dax->cxlr, DCD_ADD_CAPACITY, region_extent);
+	/*
+	 * The region device was briefly live but DAX layer ensures it was not
+	 * used
+	 */
+	if (rc)
+		region_rm_extent(region_extent);
+
+	return rc;
 }
 
 /* Callers are expected to ensure cxled has been attached to a region */
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 3106df6f3636..eeabc5a6b18a 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -223,7 +223,7 @@ static struct cxl_region_ref *cxl_rr_load(struct cxl_port *port,
 	return xa_load(&port->regions, (unsigned long)cxlr);
 }
 
-static int cxl_region_invalidate_memregion(struct cxl_region *cxlr)
+int cxl_region_invalidate_memregion(struct cxl_region *cxlr)
 {
 	if (!cpu_cache_has_invalidate_memregion()) {
 		if (IS_ENABLED(CONFIG_CXL_REGION_INVALIDATION_TEST)) {
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index d027432b1572..a14b33eca1d0 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -867,10 +867,16 @@ bool is_cxl_region(struct device *dev);
 
 extern struct bus_type cxl_bus_type;
 
+struct cxl_notify_data {
+	enum dc_event event;
+	struct region_extent *region_extent;
+};
+
 struct cxl_driver {
 	const char *name;
 	int (*probe)(struct device *dev);
 	void (*remove)(struct device *dev);
+	int (*notify)(struct device *dev, struct cxl_notify_data *notify_data);
 	struct device_driver drv;
 	int id;
 };
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index c25942a3d125..45573d077b5a 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -183,6 +183,93 @@ static bool is_sparse(struct dax_region *dax_region)
 	return (dax_region->res.flags & IORESOURCE_DAX_SPARSE_CAP) != 0;
 }
 
+static void __dax_release_resource(struct dax_resource *dax_resource)
+{
+	struct dax_region *dax_region = dax_resource->region;
+
+	lockdep_assert_held_write(&dax_region_rwsem);
+	dev_dbg(dax_region->dev, "Extent release resource %pr\n",
+		dax_resource->res);
+	if (dax_resource->res)
+		__release_region(&dax_region->res, dax_resource->res->start,
+				 resource_size(dax_resource->res));
+	dax_resource->res = NULL;
+}
+
+static void dax_release_resource(void *res)
+{
+	struct dax_resource *dax_resource = res;
+
+	guard(rwsem_write)(&dax_region_rwsem);
+	__dax_release_resource(dax_resource);
+	kfree(dax_resource);
+}
+
+int dax_region_add_resource(struct dax_region *dax_region,
+			    struct device *device,
+			    resource_size_t start, resource_size_t length)
+{
+	struct resource *new_resource;
+	int rc;
+
+	struct dax_resource *dax_resource __free(kfree) =
+				kzalloc(sizeof(*dax_resource), GFP_KERNEL);
+	if (!dax_resource)
+		return -ENOMEM;
+
+	guard(rwsem_write)(&dax_region_rwsem);
+
+	dev_dbg(dax_region->dev, "DAX region resource %pr\n", &dax_region->res);
+	new_resource = __request_region(&dax_region->res, start, length, "extent", 0);
+	if (!new_resource) {
+		dev_err(dax_region->dev, "Failed to add region s:%pa l:%pa\n",
+			&start, &length);
+		return -ENOSPC;
+	}
+
+	dev_dbg(dax_region->dev, "add resource %pr\n", new_resource);
+	dax_resource->region = dax_region;
+	dax_resource->res = new_resource;
+
+	/*
+	 * open code devm_add_action_or_reset() to avoid recursive write lock
+	 * of dax_region_rwsem in the error case.
+	 */
+	rc = devm_add_action(device, dax_release_resource, dax_resource);
+	if (rc) {
+		__dax_release_resource(dax_resource);
+		return rc;
+	}
+
+	dev_set_drvdata(device, no_free_ptr(dax_resource));
+	return 0;
+}
+EXPORT_SYMBOL_GPL(dax_region_add_resource);
+
+int dax_region_rm_resource(struct dax_region *dax_region,
+			   struct device *dev)
+{
+	struct dax_resource *dax_resource;
+
+	guard(rwsem_write)(&dax_region_rwsem);
+
+	dax_resource = dev_get_drvdata(dev);
+	if (!dax_resource)
+		return 0;
+
+	if (dax_resource->use_cnt)
+		return -EBUSY;
+
+	/*
+	 * release the resource under dax_region_rwsem to avoid races with
+	 * users trying to use the extent
+	 */
+	__dax_release_resource(dax_resource);
+	dev_set_drvdata(dev, NULL);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(dax_region_rm_resource);
+
 bool static_dev_dax(struct dev_dax *dev_dax)
 {
 	return is_static(dev_dax->region);
@@ -296,19 +383,41 @@ static ssize_t region_align_show(struct device *dev,
 static struct device_attribute dev_attr_region_align =
 		__ATTR(align, 0400, region_align_show, NULL);
 
+resource_size_t
+dax_avail_size(struct resource *dax_resource)
+{
+	resource_size_t rc;
+	struct resource *used_res;
+
+	rc = resource_size(dax_resource);
+	for_each_child_resource(dax_resource, used_res)
+		rc -= resource_size(used_res);
+	return rc;
+}
+EXPORT_SYMBOL_GPL(dax_avail_size);
+
 #define for_each_dax_region_resource(dax_region, res) \
 	for (res = (dax_region)->res.child; res; res = res->sibling)
 
 static unsigned long long dax_region_avail_size(struct dax_region *dax_region)
 {
-	resource_size_t size = resource_size(&dax_region->res);
+	resource_size_t size;
 	struct resource *res;
 
 	lockdep_assert_held(&dax_region_rwsem);
 
-	if (is_sparse(dax_region))
-		return 0;
+	if (is_sparse(dax_region)) {
+		/*
+		 * Children of a sparse region represent available space not
+		 * used space.
+		 */
+		size = 0;
+		for_each_dax_region_resource(dax_region, res)
+			size += dax_avail_size(res);
+		return size;
+	}
 
+	size = resource_size(&dax_region->res);
 	for_each_dax_region_resource(dax_region, res)
 		size -= resource_size(res);
 	return size;
@@ -449,15 +558,26 @@ EXPORT_SYMBOL_GPL(kill_dev_dax);
 static void trim_dev_dax_range(struct dev_dax *dev_dax)
 {
 	int i = dev_dax->nr_range - 1;
-	struct range *range = &dev_dax->ranges[i].range;
+	struct dev_dax_range *dev_range = &dev_dax->ranges[i];
+	struct range *range = &dev_range->range;
 	struct dax_region *dax_region = dev_dax->region;
+	struct resource *res = &dax_region->res;
 
 	lockdep_assert_held_write(&dax_region_rwsem);
 	dev_dbg(&dev_dax->dev, "delete range[%d]: %#llx:%#llx\n", i,
 		(unsigned long long)range->start,
 		(unsigned long long)range->end);
 
-	__release_region(&dax_region->res, range->start, range_len(range));
+	if (dev_range->dax_resource) {
+		res = dev_range->dax_resource->res;
+		dev_dbg(&dev_dax->dev, "Trim sparse extent %pr\n", res);
+	}
+
+	__release_region(res, range->start, range_len(range));
+
+	if (dev_range->dax_resource)
+		dev_range->dax_resource->use_cnt--;
+
 	if (--dev_dax->nr_range == 0) {
 		kfree(dev_dax->ranges);
 		dev_dax->ranges = NULL;
@@ -640,7 +760,7 @@ static void dax_region_unregister(void *region)
 
 struct dax_region *alloc_dax_region(struct device *parent, int region_id,
 		struct range *range, int target_node, unsigned int align,
-		unsigned long flags)
+		unsigned long flags, struct dax_sparse_ops *sparse_ops)
 {
 	struct dax_region *dax_region;
 
@@ -658,12 +778,16 @@ struct dax_region *alloc_dax_region(struct device *parent, int region_id,
 			|| !IS_ALIGNED(range_len(range), align))
 		return NULL;
 
+	if (!sparse_ops && (flags & IORESOURCE_DAX_SPARSE_CAP))
+		return NULL;
+
 	dax_region = kzalloc(sizeof(*dax_region), GFP_KERNEL);
 	if (!dax_region)
 		return NULL;
 
 	dev_set_drvdata(parent, dax_region);
 	kref_init(&dax_region->kref);
+	dax_region->sparse_ops = sparse_ops;
 	dax_region->id = region_id;
 	dax_region->align = align;
 	dax_region->dev = parent;
@@ -845,7 +969,8 @@ static int devm_register_dax_mapping(struct dev_dax *dev_dax, int range_id)
 }
 
 static int alloc_dev_dax_range(struct resource *parent, struct dev_dax *dev_dax,
-			       u64 start, resource_size_t size)
+			       u64 start, resource_size_t size,
+			       struct dax_resource *dax_resource)
 {
 	struct device *dev = &dev_dax->dev;
 	struct dev_dax_range *ranges;
@@ -884,6 +1009,7 @@ static int alloc_dev_dax_range(struct resource *parent, struct dev_dax *dev_dax,
 			.start = alloc->start,
 			.end = alloc->end,
 		},
+		.dax_resource = dax_resource,
 	};
 
 	dev_dbg(dev, "alloc range[%d]: %pa:%pa\n", dev_dax->nr_range - 1,
@@ -966,7 +1092,8 @@ static int dev_dax_shrink(struct dev_dax *dev_dax, resource_size_t size)
 	int i;
 
 	for (i = dev_dax->nr_range - 1; i >= 0; i--) {
-		struct range *range = &dev_dax->ranges[i].range;
+		struct dev_dax_range *dev_range = &dev_dax->ranges[i];
+		struct range *range = &dev_range->range;
 		struct dax_mapping *mapping = dev_dax->ranges[i].mapping;
 		struct resource *adjust = NULL, *res;
 		resource_size_t shrink;
@@ -982,12 +1109,21 @@ static int dev_dax_shrink(struct dev_dax *dev_dax, resource_size_t size)
 			continue;
 		}
 
-		for_each_dax_region_resource(dax_region, res)
-			if (strcmp(res->name, dev_name(dev)) == 0
-					&& res->start == range->start) {
-				adjust = res;
-				break;
-			}
+		if (dev_range->dax_resource) {
+			for_each_child_resource(dev_range->dax_resource->res, res)
+				if (strcmp(res->name, dev_name(dev)) == 0
+						&& res->start == range->start) {
+					adjust = res;
+					break;
+				}
+		} else {
+			for_each_dax_region_resource(dax_region, res)
+				if (strcmp(res->name, dev_name(dev)) == 0
+						&& res->start == range->start) {
+					adjust = res;
+					break;
+				}
+		}
 
 		if (dev_WARN_ONCE(dev, !adjust || i != dev_dax->nr_range - 1,
 					"failed to find matching resource\n"))
@@ -1025,19 +1161,21 @@ static bool adjust_ok(struct dev_dax *dev_dax, struct resource *res)
 }
 
 /**
- * dev_dax_resize_static - Expand the device into the unused portion of the
- * region. This may involve adjusting the end of an existing resource, or
- * allocating a new resource.
+ * __dev_dax_resize - Expand the device into the unused portion of the region.
+ * This may involve adjusting the end of an existing resource, or allocating a
+ * new resource.
  *
  * @parent: parent resource to allocate this range in
  * @dev_dax: DAX device to be expanded
  * @to_alloc: amount of space to alloc; must be <= space available in @parent
+ * @dax_resource: if sparse; the parent resource
  *
  * Return the amount of space allocated or -ERRNO on failure
  */
-static ssize_t dev_dax_resize_static(struct resource *parent,
-				     struct dev_dax *dev_dax,
-				     resource_size_t to_alloc)
+static ssize_t __dev_dax_resize(struct resource *parent,
+				struct dev_dax *dev_dax,
+				resource_size_t to_alloc,
+				struct dax_resource *dax_resource)
 {
 	struct resource *res, *first;
 	int rc;
@@ -1045,7 +1183,8 @@ static ssize_t dev_dax_resize_static(struct resource *parent,
 	first = parent->child;
 	if (!first) {
 		rc = alloc_dev_dax_range(parent, dev_dax,
-					   parent->start, to_alloc);
+					   parent->start, to_alloc,
+					   dax_resource);
 		if (rc)
 			return rc;
 		return to_alloc;
@@ -1059,7 +1198,8 @@ static ssize_t dev_dax_resize_static(struct resource *parent,
 		if (res == first && res->start > parent->start) {
 			alloc = min(res->start - parent->start, to_alloc);
 			rc = alloc_dev_dax_range(parent, dev_dax,
-						 parent->start, alloc);
+						 parent->start, alloc,
+						 dax_resource);
 			if (rc)
 				return rc;
 			return alloc;
@@ -1083,7 +1223,8 @@ static ssize_t dev_dax_resize_static(struct resource *parent,
 				return rc;
 			return alloc;
 		}
-		rc = alloc_dev_dax_range(parent, dev_dax, res->end + 1, alloc);
+		rc = alloc_dev_dax_range(parent, dev_dax, res->end + 1, alloc,
+					 dax_resource);
 		if (rc)
 			return rc;
 		return alloc;
@@ -1094,6 +1235,51 @@ static ssize_t dev_dax_resize_static(struct resource *parent,
 	return 0;
 }
 
+static ssize_t dev_dax_resize_static(struct dax_region *dax_region,
+				     struct dev_dax *dev_dax,
+				     resource_size_t to_alloc)
+{
+	return __dev_dax_resize(&dax_region->res, dev_dax, to_alloc, NULL);
+}
+
+static int find_free_extent(struct device *dev, const void *data)
+{
+	const struct dax_region *dax_region = data;
+	struct dax_resource *dax_resource;
+
+	if (!dax_region->sparse_ops->is_extent(dev))
+		return 0;
+
+	dax_resource = dev_get_drvdata(dev);
+	if (!dax_resource || !dax_avail_size(dax_resource->res))
+		return 0;
+	return 1;
+}
+
+static ssize_t dev_dax_resize_sparse(struct dax_region *dax_region,
+				     struct dev_dax *dev_dax,
+				     resource_size_t to_alloc)
+{
+	struct dax_resource *dax_resource;
+	ssize_t alloc;
+
+	struct device *extent_dev __free(put_device) =
+			device_find_child(dax_region->dev, dax_region,
+					  find_free_extent);
+	if (!extent_dev)
+		return 0;
+
+	dax_resource = dev_get_drvdata(extent_dev);
+	if (!dax_resource)
+		return 0;
+
+	to_alloc = min(dax_avail_size(dax_resource->res), to_alloc);
+	alloc = __dev_dax_resize(dax_resource->res, dev_dax, to_alloc, dax_resource);
+	if (alloc > 0)
+		dax_resource->use_cnt++;
+	return alloc;
+}
+
 static ssize_t dev_dax_resize(struct dax_region *dax_region,
 		struct dev_dax *dev_dax, resource_size_t size)
 {
@@ -1118,7 +1304,10 @@ static ssize_t dev_dax_resize(struct dax_region *dax_region,
 		return -ENXIO;
 
 retry:
-	alloc = dev_dax_resize_static(&dax_region->res, dev_dax, to_alloc);
+	if (is_sparse(dax_region))
+		alloc = dev_dax_resize_sparse(dax_region, dev_dax, to_alloc);
+	else
+		alloc = dev_dax_resize_static(dax_region, dev_dax, to_alloc);
 	if (alloc <= 0)
 		return alloc;
 	to_alloc -= alloc;
@@ -1227,7 +1416,7 @@ static ssize_t mapping_store(struct device *dev, struct device_attribute *attr,
 	to_alloc = range_len(&r);
 	if (alloc_is_aligned(dev_dax, to_alloc))
 		rc = alloc_dev_dax_range(&dax_region->res, dev_dax, r.start,
-					 to_alloc);
+					 to_alloc, NULL);
 	up_write(&dax_dev_rwsem);
 	up_write(&dax_region_rwsem);
 
@@ -1466,6 +1655,11 @@ static struct dev_dax *__devm_create_dev_dax(struct dev_dax_data *data)
 	struct device *dev;
 	int rc;
 
+	if (is_sparse(dax_region) && data->size) {
+		dev_err(parent, "Sparse DAX region devices must be created initially with 0 size");
+		return ERR_PTR(-EINVAL);
+	}
+
 	dev_dax = kzalloc(sizeof(*dev_dax), GFP_KERNEL);
 	if (!dev_dax)
 		return ERR_PTR(-ENOMEM);
@@ -1496,7 +1690,7 @@ static struct dev_dax *__devm_create_dev_dax(struct dev_dax_data *data)
 	dev_set_name(dev, "dax%d.%d", dax_region->id, dev_dax->id);
 
 	rc = alloc_dev_dax_range(&dax_region->res, dev_dax, dax_region->res.start,
-				 data->size);
+				 data->size, NULL);
 	if (rc)
 		goto err_range;
 
diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
index 783bfeef42cc..ae5029ea6047 100644
--- a/drivers/dax/bus.h
+++ b/drivers/dax/bus.h
@@ -9,6 +9,7 @@ struct dev_dax;
 struct resource;
 struct dax_device;
 struct dax_region;
+struct dax_sparse_ops;
 
 /* dax bus specific ioresource flags */
 #define IORESOURCE_DAX_STATIC BIT(0)
@@ -17,7 +18,7 @@ struct dax_region;
 
 struct dax_region *alloc_dax_region(struct device *parent, int region_id,
 		struct range *range, int target_node, unsigned int align,
-		unsigned long flags);
+		unsigned long flags, struct dax_sparse_ops *sparse_ops);
 
 struct dev_dax_data {
 	struct dax_region *dax_region;
diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
index 88b051cea755..011bd1dc7691 100644
--- a/drivers/dax/cxl.c
+++ b/drivers/dax/cxl.c
@@ -5,6 +5,57 @@
 
 #include "../cxl/cxl.h"
 #include "bus.h"
+#include "dax-private.h"
+
+static int __cxl_dax_add_resource(struct dax_region *dax_region,
+				  struct region_extent *region_extent)
+{
+	struct device *dev = &region_extent->dev;
+	resource_size_t start, length;
+
+	start = dax_region->res.start + region_extent->hpa_range.start;
+	length = range_len(&region_extent->hpa_range);
+	return dax_region_add_resource(dax_region, dev, start, length);
+}
+
+static int cxl_dax_add_resource(struct device *dev, void *data)
+{
+	struct dax_region *dax_region = data;
+	struct region_extent *region_extent;
+
+	region_extent = to_region_extent(dev);
+	if (!region_extent)
+		return 0;
+
+	dev_dbg(dax_region->dev, "Adding resource HPA %pra\n",
+		&region_extent->hpa_range);
+
+	return __cxl_dax_add_resource(dax_region, region_extent);
+}
+
+static int cxl_dax_region_notify(struct device *dev,
+				 struct cxl_notify_data *notify_data)
+{
+	struct cxl_dax_region *cxlr_dax = to_cxl_dax_region(dev);
+	struct dax_region *dax_region = dev_get_drvdata(dev);
+	struct region_extent *region_extent = notify_data->region_extent;
+
+	switch (notify_data->event) {
+	case DCD_ADD_CAPACITY:
+		return __cxl_dax_add_resource(dax_region, region_extent);
+	case DCD_RELEASE_CAPACITY:
+		return dax_region_rm_resource(dax_region, &region_extent->dev);
+	case DCD_FORCED_CAPACITY_RELEASE:
+	default:
+		dev_err(&cxlr_dax->dev, "Unknown DC event %d\n",
+			notify_data->event);
+		return -ENXIO;
+	}
+}
+
+struct dax_sparse_ops sparse_ops = {
+	.is_extent = is_region_extent,
+};
 
 static int cxl_dax_region_probe(struct device *dev)
 {
@@ -24,15 +75,18 @@ static int cxl_dax_region_probe(struct device *dev)
 		flags |= IORESOURCE_DAX_SPARSE_CAP;
 
 	dax_region = alloc_dax_region(dev, cxlr->id, &cxlr_dax->hpa_range, nid,
-				      PMD_SIZE, flags);
+				      PMD_SIZE, flags, &sparse_ops);
 	if (!dax_region)
 		return -ENOMEM;
 
-	if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_A)
+	if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_A) {
+		device_for_each_child(&cxlr_dax->dev, dax_region,
+				      cxl_dax_add_resource);
 		/* Add empty seed dax device */
 		dev_size = 0;
-	else
+	} else {
 		dev_size = range_len(&cxlr_dax->hpa_range);
+	}
 
 	data = (struct dev_dax_data) {
 		.dax_region = dax_region,
@@ -47,6 +101,7 @@ static int cxl_dax_region_probe(struct device *dev)
 static struct cxl_driver cxl_dax_region_driver = {
 	.name = "cxl_dax_region",
 	.probe = cxl_dax_region_probe,
+	.notify = cxl_dax_region_notify,
 	.id = CXL_DEVICE_DAX_REGION,
 	.drv = {
 		.suppress_bind_attrs = true,
diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index 0867115aeef2..39fb587561f8 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -16,6 +16,14 @@ struct inode *dax_inode(struct dax_device *dax_dev);
 int dax_bus_init(void);
 void dax_bus_exit(void);
 
+/**
+ * struct dax_sparse_ops - Operations for sparse regions
+ * @is_extent: return if the device is an extent
+ */
+struct dax_sparse_ops {
+	bool (*is_extent)(struct device *dev);
+};
+
 /**
  * struct dax_region - mapping infrastructure for dax devices
  * @id: kernel-wide unique region for a memory range
@@ -27,6 +35,7 @@ void dax_bus_exit(void);
  * @res: resource tree to track instance allocations
  * @seed: allow userspace to find the first unbound seed device
  * @youngest: allow userspace to find the most recently created device
+ * @sparse_ops: operations required for sparse regions
  */
 struct dax_region {
 	int id;
@@ -38,6 +47,7 @@ struct dax_region {
 	struct resource res;
 	struct device *seed;
 	struct device *youngest;
+	struct dax_sparse_ops *sparse_ops;
 };
 
 /**
@@ -57,11 +67,13 @@ struct dax_mapping {
  * @pgoff: page offset
  * @range: resource-span
  * @mapping: reference to the dax_mapping for this range
+ * @dax_resource: if not NULL; dax sparse resource containing this range
  */
 struct dev_dax_range {
 	unsigned long pgoff;
 	struct range range;
 	struct dax_mapping *mapping;
+	struct dax_resource *dax_resource;
 };
 
 /**
@@ -100,6 +112,34 @@ struct dev_dax {
  */
 void run_dax(struct dax_device *dax_dev);
 
+/**
+ * struct dax_resource - For sparse regions; an active resource
+ * @region: dax_region this resources is in
+ * @res: resource
+ * @use_cnt: count the number of uses of this resource
+ *
+ * Changes to the dax_region and the dax_resources within it are protected by
+ * dax_region_rwsem
+ *
+ * dax_resource's are not intended to be used outside the dax layer.
+ */
+struct dax_resource {
+	struct dax_region *region;
+	struct resource *res;
+	unsigned int use_cnt;
+};
+
+/*
+ * Similar to run_dax() dax_region_{add,rm}_resource() and dax_avail_size() are
+ * exported but are not intended to be generic operations outside the dax
+ * subsystem.  They are only generic between the dax layer and the dax drivers.
+ */
+int dax_region_add_resource(struct dax_region *dax_region, struct device *dev,
+			    resource_size_t start, resource_size_t length);
+int dax_region_rm_resource(struct dax_region *dax_region,
+			   struct device *dev);
+resource_size_t dax_avail_size(struct resource *dax_resource);
+
 static inline struct dev_dax *to_dev_dax(struct device *dev)
 {
 	return container_of(dev, struct dev_dax, dev);
diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index 5e7c53f18491..0eea65052874 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -28,7 +28,7 @@ static int dax_hmem_probe(struct platform_device *pdev)
 
 	mri = dev->platform_data;
 	dax_region = alloc_dax_region(dev, pdev->id, &mri->range,
-				      mri->target_node, PMD_SIZE, flags);
+				      mri->target_node, PMD_SIZE, flags, NULL);
 	if (!dax_region)
 		return -ENOMEM;
 
diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
index c8ebf4e281f2..f927e855f240 100644
--- a/drivers/dax/pmem.c
+++ b/drivers/dax/pmem.c
@@ -54,7 +54,7 @@ static struct dev_dax *__dax_pmem_probe(struct device *dev)
 	range.start += offset;
 	dax_region = alloc_dax_region(dev, region_id, &range,
 			nd_region->target_node, le32_to_cpu(pfn_sb->align),
-			IORESOURCE_DAX_STATIC);
+			IORESOURCE_DAX_STATIC, NULL);
 	if (!dax_region)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index e8b2d6aa4013..a97bb3d936a7 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -27,6 +27,9 @@ struct resource {
 	struct resource *parent, *sibling, *child;
 };
 
+#define for_each_child_resource(parent, res) \
+	for (res = (parent)->child; res; res = res->sibling)
+
 /*
  * IO resources have these defined flags.
  *

-- 
2.49.0


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v9 16/19] cxl/region: Read existing extents on region creation
  2025-04-13 22:52 [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (14 preceding siblings ...)
  2025-04-13 22:52 ` [PATCH v9 15/19] dax/region: Create resources on sparse DAX regions Ira Weiny
@ 2025-04-13 22:52 ` Ira Weiny
  2025-04-14 16:15   ` Jonathan Cameron
  2026-02-02 19:42   ` Davidlohr Bueso
  2025-04-13 22:52 ` [PATCH v9 17/19] cxl/mem: Trace Dynamic capacity Event Record Ira Weiny
                   ` (6 subsequent siblings)
  22 siblings, 2 replies; 65+ messages in thread
From: Ira Weiny @ 2025-04-13 22:52 UTC (permalink / raw)
  To: Dave Jiang, Fan Ni, Jonathan Cameron
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	Ira Weiny, linux-cxl, nvdimm, linux-kernel

Dynamic capacity device extents may be left in an accepted state on a
device due to an unexpected host crash.  In this case it is expected
that the creation of a new region on top of a DC partition can read
those extents and surface them for continued use.

Once all endpoint decoders are part of a region and the region is being
realized, a read of the 'devices extent list' can reveal these
previously accepted extents.

CXL r3.1 specifies the mailbox call Get Dynamic Capacity Extent List for
this purpose.  The call returns all the extents for all dynamic capacity
partitions.  If the fabric manager is adding extents to any DCD
partition, the extent list for the recovered region may change.  In this
case the query must retry.  Upon retry the query could encounter extents
which were accepted on a previous list query.  Adding such extents is
ignored without error because they are entirely within a previous
accepted extent.  Instead warn on this case to allow for differentiating
bad devices from this normal condition.

Latch any errors to be bubbled up to ensure notification to the user
even if individual errors are rate limited or otherwise ignored.

The scan for existing extents races with the dax_cxl driver.  This is
synchronized through the region device lock.  Extents which are found
after the driver has loaded will surface through the normal notification
path while extents seen prior to the driver are read during driver load.

Based on an original patch by Navneet Singh.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes:
[0day: fix extent count in GetExtent input payload]
[iweiny: minor clean ups]
[iweiny: Adjust for partition arch]
---
 drivers/cxl/core/core.h   |   1 +
 drivers/cxl/core/mbox.c   | 109 ++++++++++++++++++++++++++++++++++++++++++++++
 drivers/cxl/core/region.c |  25 +++++++++++
 drivers/cxl/cxlmem.h      |  21 +++++++++
 4 files changed, 156 insertions(+)

diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 027dd1504d77..e06a46fec217 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -22,6 +22,7 @@ cxled_to_mds(struct cxl_endpoint_decoder *cxled)
 	return container_of(cxlds, struct cxl_memdev_state, cxlds);
 }
 
+int cxl_process_extent_list(struct cxl_endpoint_decoder *cxled);
 int cxl_region_invalidate_memregion(struct cxl_region *cxlr);
 
 #ifdef CONFIG_CXL_REGION
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index de01c6684530..8af3a4173b99 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -1737,6 +1737,115 @@ int cxl_dev_dc_identify(struct cxl_mailbox *mbox,
 }
 EXPORT_SYMBOL_NS_GPL(cxl_dev_dc_identify, "CXL");
 
+/* Return -EAGAIN if the extent list changes while reading */
+static int __cxl_process_extent_list(struct cxl_endpoint_decoder *cxled)
+{
+	u32 current_index, total_read, total_expected, initial_gen_num;
+	struct cxl_memdev_state *mds = cxled_to_mds(cxled);
+	struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
+	struct device *dev = mds->cxlds.dev;
+	struct cxl_mbox_cmd mbox_cmd;
+	u32 max_extent_count;
+	int latched_rc = 0;
+	bool first = true;
+
+	struct cxl_mbox_get_extent_out *extents __free(kvfree) =
+				kvmalloc(cxl_mbox->payload_size, GFP_KERNEL);
+	if (!extents)
+		return -ENOMEM;
+
+	total_read = 0;
+	current_index = 0;
+	total_expected = 0;
+	max_extent_count = (cxl_mbox->payload_size - sizeof(*extents)) /
+				sizeof(struct cxl_extent);
+	do {
+		u32 nr_returned, current_total, current_gen_num;
+		struct cxl_mbox_get_extent_in get_extent;
+		int rc;
+
+		get_extent = (struct cxl_mbox_get_extent_in) {
+			.extent_cnt = cpu_to_le32(max(max_extent_count,
+						  total_expected - current_index)),
+			.start_extent_index = cpu_to_le32(current_index),
+		};
+
+		mbox_cmd = (struct cxl_mbox_cmd) {
+			.opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
+			.payload_in = &get_extent,
+			.size_in = sizeof(get_extent),
+			.size_out = cxl_mbox->payload_size,
+			.payload_out = extents,
+			.min_out = 1,
+		};
+
+		rc = cxl_internal_send_cmd(cxl_mbox, &mbox_cmd);
+		if (rc < 0)
+			return rc;
+
+		/* Save initial data */
+		if (first) {
+			total_expected = le32_to_cpu(extents->total_extent_count);
+			initial_gen_num = le32_to_cpu(extents->generation_num);
+			first = false;
+		}
+
+		nr_returned = le32_to_cpu(extents->returned_extent_count);
+		total_read += nr_returned;
+		current_total = le32_to_cpu(extents->total_extent_count);
+		current_gen_num = le32_to_cpu(extents->generation_num);
+
+		dev_dbg(dev, "Got extent list %d-%d of %d generation Num:%d\n",
+			current_index, total_read - 1, current_total, current_gen_num);
+
+		if (current_gen_num != initial_gen_num || total_expected != current_total) {
+			dev_warn(dev, "Extent list change detected; gen %u != %u : cnt %u != %u\n",
+				 current_gen_num, initial_gen_num,
+				 total_expected, current_total);
+			return -EAGAIN;
+		}
+
+		for (int i = 0; i < nr_returned ; i++) {
+			struct cxl_extent *extent = &extents->extent[i];
+
+			dev_dbg(dev, "Processing extent %d/%d\n",
+				current_index + i, total_expected);
+
+			rc = validate_add_extent(mds, extent);
+			if (rc)
+				latched_rc = rc;
+		}
+
+		current_index += nr_returned;
+	} while (total_expected > total_read);
+
+	return latched_rc;
+}
+
+#define CXL_READ_EXTENT_LIST_RETRY 10
+
+/**
+ * cxl_process_extent_list() - Read existing extents
+ * @cxled: Endpoint decoder which is part of a region
+ *
+ * Issue the Get Dynamic Capacity Extent List command to the device
+ * and add existing extents if found.
+ *
+ * A retry of 10 is somewhat arbitrary, however, extent changes should be
+ * relatively rare while bringing up a region.  So 10 should be plenty.
+ */
+int cxl_process_extent_list(struct cxl_endpoint_decoder *cxled)
+{
+	int retry = CXL_READ_EXTENT_LIST_RETRY;
+	int rc;
+
+	do {
+		rc = __cxl_process_extent_list(cxled);
+	} while (rc == -EAGAIN && retry--);
+
+	return rc;
+}
+
 static void add_part(struct cxl_dpa_info *info, u64 start, u64 size, enum cxl_partition_mode mode)
 {
 	int i = info->nr_partitions;
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index eeabc5a6b18a..a43b43972bae 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -3196,6 +3196,26 @@ static int devm_cxl_add_pmem_region(struct cxl_region *cxlr)
 	return rc;
 }
 
+static int cxlr_add_existing_extents(struct cxl_region *cxlr)
+{
+	struct cxl_region_params *p = &cxlr->params;
+	int i, latched_rc = 0;
+
+	for (i = 0; i < p->nr_targets; i++) {
+		struct device *dev = &p->targets[i]->cxld.dev;
+		int rc;
+
+		rc = cxl_process_extent_list(p->targets[i]);
+		if (rc) {
+			dev_err(dev, "Existing extent processing failed %d\n",
+				rc);
+			latched_rc = rc;
+		}
+	}
+
+	return latched_rc;
+}
+
 static void cxlr_dax_unregister(void *_cxlr_dax)
 {
 	struct cxl_dax_region *cxlr_dax = _cxlr_dax;
@@ -3231,6 +3251,11 @@ static int devm_cxl_add_dax_region(struct cxl_region *cxlr)
 	dev_dbg(&cxlr->dev, "%s: register %s\n", dev_name(dev->parent),
 		dev_name(dev));
 
+	if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_A)
+		if (cxlr_add_existing_extents(cxlr))
+			dev_err(&cxlr->dev, "Existing extent processing failed %d\n",
+				rc);
+
 	return devm_add_action_or_reset(&cxlr->dev, cxlr_dax_unregister,
 					cxlr_dax);
 err:
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 63a38e449454..f80f70549c0b 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -600,6 +600,27 @@ struct cxl_mbox_dc_response {
 	} __packed extent_list[];
 } __packed;
 
+/*
+ * Get Dynamic Capacity Extent List; Input Payload
+ * CXL rev 3.1 section 8.2.9.9.9.2; Table 8-166
+ */
+struct cxl_mbox_get_extent_in {
+	__le32 extent_cnt;
+	__le32 start_extent_index;
+} __packed;
+
+/*
+ * Get Dynamic Capacity Extent List; Output Payload
+ * CXL rev 3.1 section 8.2.9.9.9.2; Table 8-167
+ */
+struct cxl_mbox_get_extent_out {
+	__le32 returned_extent_count;
+	__le32 total_extent_count;
+	__le32 generation_num;
+	u8 rsvd[4];
+	struct cxl_extent extent[];
+} __packed;
+
 struct cxl_mbox_get_supported_logs {
 	__le16 entries;
 	u8 rsvd[6];

-- 
2.49.0


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH v9 16/19] cxl/region: Read existing extents on region creation
  2025-04-13 22:52 ` [PATCH v9 16/19] cxl/region: Read existing extents on region creation Ira Weiny
@ 2025-04-14 16:15   ` Jonathan Cameron
  2026-02-02 19:42   ` Davidlohr Bueso
  1 sibling, 0 replies; 65+ messages in thread
From: Jonathan Cameron @ 2025-04-14 16:15 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Jiang, Fan Ni, Dan Williams, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-cxl, nvdimm, linux-kernel

On Sun, 13 Apr 2025 17:52:24 -0500
Ira Weiny <ira.weiny@intel.com> wrote:

> Dynamic capacity device extents may be left in an accepted state on a
> device due to an unexpected host crash.  In this case it is expected
> that the creation of a new region on top of a DC partition can read
> those extents and surface them for continued use.
> 
> Once all endpoint decoders are part of a region and the region is being
> realized, a read of the 'devices extent list' can reveal these
> previously accepted extents.
> 
> CXL r3.1 specifies the mailbox call Get Dynamic Capacity Extent List for
> this purpose.  The call returns all the extents for all dynamic capacity
> partitions.  If the fabric manager is adding extents to any DCD
> partition, the extent list for the recovered region may change.  In this
> case the query must retry.  Upon retry the query could encounter extents
> which were accepted on a previous list query.  Adding such extents is
> ignored without error because they are entirely within a previous
> accepted extent.  Instead warn on this case to allow for differentiating
> bad devices from this normal condition.
> 
> Latch any errors to be bubbled up to ensure notification to the user
> even if individual errors are rate limited or otherwise ignored.
> 
> The scan for existing extents races with the dax_cxl driver.  This is
> synchronized through the region device lock.  Extents which are found
> after the driver has loaded will surface through the normal notification
> path while extents seen prior to the driver are read during driver load.
> 
> Based on an original patch by Navneet Singh.
> 
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Fan Ni <fan.ni@samsung.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

A couple of minor things noticed on taking another look.

> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index de01c6684530..8af3a4173b99 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -1737,6 +1737,115 @@ int cxl_dev_dc_identify(struct cxl_mailbox *mbox,
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_dev_dc_identify, "CXL");
>  
> +/* Return -EAGAIN if the extent list changes while reading */
> +static int __cxl_process_extent_list(struct cxl_endpoint_decoder *cxled)
> +{
> +	u32 current_index, total_read, total_expected, initial_gen_num;
> +	struct cxl_memdev_state *mds = cxled_to_mds(cxled);
> +	struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
> +	struct device *dev = mds->cxlds.dev;
> +	struct cxl_mbox_cmd mbox_cmd;
> +	u32 max_extent_count;
> +	int latched_rc = 0;
> +	bool first = true;
> +
> +	struct cxl_mbox_get_extent_out *extents __free(kvfree) =
> +				kvmalloc(cxl_mbox->payload_size, GFP_KERNEL);
> +	if (!extents)
> +		return -ENOMEM;
> +
> +	total_read = 0;
> +	current_index = 0;
> +	total_expected = 0;
> +	max_extent_count = (cxl_mbox->payload_size - sizeof(*extents)) /
> +				sizeof(struct cxl_extent);
> +	do {
> +		u32 nr_returned, current_total, current_gen_num;
> +		struct cxl_mbox_get_extent_in get_extent;
> +		int rc;
> +
> +		get_extent = (struct cxl_mbox_get_extent_in) {
> +			.extent_cnt = cpu_to_le32(max(max_extent_count,
> +						  total_expected - current_index)),
> +			.start_extent_index = cpu_to_le32(current_index),
> +		};
> +
> +		mbox_cmd = (struct cxl_mbox_cmd) {
> +			.opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
> +			.payload_in = &get_extent,
> +			.size_in = sizeof(get_extent),
> +			.size_out = cxl_mbox->payload_size,
> +			.payload_out = extents,
> +			.min_out = 1,

Similar to earlier comment (I might well have forgotten how this works) but
why not 16 which is what I think we should get even if no extents.

> +		};
> +
> +		rc = cxl_internal_send_cmd(cxl_mbox, &mbox_cmd);
> +		if (rc < 0)
> +			return rc;
> +
> +		/* Save initial data */
> +		if (first) {
> +			total_expected = le32_to_cpu(extents->total_extent_count);
> +			initial_gen_num = le32_to_cpu(extents->generation_num);
> +			first = false;
> +		}
> +
> +		nr_returned = le32_to_cpu(extents->returned_extent_count);
> +		total_read += nr_returned;
> +		current_total = le32_to_cpu(extents->total_extent_count);
> +		current_gen_num = le32_to_cpu(extents->generation_num);
> +
> +		dev_dbg(dev, "Got extent list %d-%d of %d generation Num:%d\n",
> +			current_index, total_read - 1, current_total, current_gen_num);
> +
> +		if (current_gen_num != initial_gen_num || total_expected != current_total) {
> +			dev_warn(dev, "Extent list change detected; gen %u != %u : cnt %u != %u\n",
> +				 current_gen_num, initial_gen_num,
> +				 total_expected, current_total);
> +			return -EAGAIN;
> +		}
> +
> +		for (int i = 0; i < nr_returned ; i++) {
> +			struct cxl_extent *extent = &extents->extent[i];
> +
> +			dev_dbg(dev, "Processing extent %d/%d\n",
> +				current_index + i, total_expected);
> +
> +			rc = validate_add_extent(mds, extent);
> +			if (rc)
> +				latched_rc = rc;
> +		}
> +
> +		current_index += nr_returned;
> +	} while (total_expected > total_read);
> +
> +	return latched_rc;
> +}

> +/*
> + * Get Dynamic Capacity Extent List; Output Payload
> + * CXL rev 3.1 section 8.2.9.9.9.2; Table 8-167
> + */
> +struct cxl_mbox_get_extent_out {
> +	__le32 returned_extent_count;
> +	__le32 total_extent_count;
> +	__le32 generation_num;
> +	u8 rsvd[4];
> +	struct cxl_extent extent[];

Throw some counted_by magic at this?

> +} __packed;
> +
>  struct cxl_mbox_get_supported_logs {
>  	__le16 entries;
>  	u8 rsvd[6];
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v9 16/19] cxl/region: Read existing extents on region creation
  2025-04-13 22:52 ` [PATCH v9 16/19] cxl/region: Read existing extents on region creation Ira Weiny
  2025-04-14 16:15   ` Jonathan Cameron
@ 2026-02-02 19:42   ` Davidlohr Bueso
  1 sibling, 0 replies; 65+ messages in thread
From: Davidlohr Bueso @ 2026-02-02 19:42 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Jiang, Fan Ni, Jonathan Cameron, Dan Williams,
	Alison Schofield, Vishal Verma, linux-cxl, nvdimm, linux-kernel

On Sun, 13 Apr 2025, Ira Weiny wrote:

>Dynamic capacity device extents may be left in an accepted state on a
>device due to an unexpected host crash.  In this case it is expected
>that the creation of a new region on top of a DC partition can read
>those extents and surface them for continued use.
>
>Once all endpoint decoders are part of a region and the region is being
>realized, a read of the 'devices extent list' can reveal these
>previously accepted extents.
>
>CXL r3.1 specifies the mailbox call Get Dynamic Capacity Extent List for
>this purpose.  The call returns all the extents for all dynamic capacity
>partitions.  If the fabric manager is adding extents to any DCD
>partition, the extent list for the recovered region may change.  In this
>case the query must retry.  Upon retry the query could encounter extents
>which were accepted on a previous list query.  Adding such extents is
>ignored without error because they are entirely within a previous
>accepted extent.  Instead warn on this case to allow for differentiating
>bad devices from this normal condition.
>
>Latch any errors to be bubbled up to ensure notification to the user
>even if individual errors are rate limited or otherwise ignored.
>
>The scan for existing extents races with the dax_cxl driver.  This is
>synchronized through the region device lock.  Extents which are found
>after the driver has loaded will surface through the normal notification
>path while extents seen prior to the driver are read during driver load.
>
>Based on an original patch by Navneet Singh.
>
>Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>Reviewed-by: Fan Ni <fan.ni@samsung.com>
>Signed-off-by: Ira Weiny <ira.weiny@intel.com>
>
>---
>Changes:
>[0day: fix extent count in GetExtent input payload]
>[iweiny: minor clean ups]
>[iweiny: Adjust for partition arch]
>---
> drivers/cxl/core/core.h   |   1 +
> drivers/cxl/core/mbox.c   | 109 ++++++++++++++++++++++++++++++++++++++++++++++
> drivers/cxl/core/region.c |  25 +++++++++++
> drivers/cxl/cxlmem.h      |  21 +++++++++
> 4 files changed, 156 insertions(+)
>
>diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
>index 027dd1504d77..e06a46fec217 100644
>--- a/drivers/cxl/core/core.h
>+++ b/drivers/cxl/core/core.h
>@@ -22,6 +22,7 @@ cxled_to_mds(struct cxl_endpoint_decoder *cxled)
>	return container_of(cxlds, struct cxl_memdev_state, cxlds);
> }
>
>+int cxl_process_extent_list(struct cxl_endpoint_decoder *cxled);
> int cxl_region_invalidate_memregion(struct cxl_region *cxlr);
>
> #ifdef CONFIG_CXL_REGION
>diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
>index de01c6684530..8af3a4173b99 100644
>--- a/drivers/cxl/core/mbox.c
>+++ b/drivers/cxl/core/mbox.c
>@@ -1737,6 +1737,115 @@ int cxl_dev_dc_identify(struct cxl_mailbox *mbox,
> }
> EXPORT_SYMBOL_NS_GPL(cxl_dev_dc_identify, "CXL");
>
>+/* Return -EAGAIN if the extent list changes while reading */
>+static int __cxl_process_extent_list(struct cxl_endpoint_decoder *cxled)
>+{
>+	u32 current_index, total_read, total_expected, initial_gen_num;
>+	struct cxl_memdev_state *mds = cxled_to_mds(cxled);
>+	struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
>+	struct device *dev = mds->cxlds.dev;
>+	struct cxl_mbox_cmd mbox_cmd;
>+	u32 max_extent_count;
>+	int latched_rc = 0;
>+	bool first = true;
>+
>+	struct cxl_mbox_get_extent_out *extents __free(kvfree) =
>+				kvmalloc(cxl_mbox->payload_size, GFP_KERNEL);
>+	if (!extents)
>+		return -ENOMEM;
>+
>+	total_read = 0;
>+	current_index = 0;
>+	total_expected = 0;
>+	max_extent_count = (cxl_mbox->payload_size - sizeof(*extents)) /
>+				sizeof(struct cxl_extent);
>+	do {
>+		u32 nr_returned, current_total, current_gen_num;
>+		struct cxl_mbox_get_extent_in get_extent;
>+		int rc;
>+
>+		get_extent = (struct cxl_mbox_get_extent_in) {
>+			.extent_cnt = cpu_to_le32(max(max_extent_count,
>+						  total_expected - current_index)),

s/max/min().

>+			.start_extent_index = cpu_to_le32(current_index),

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v9 17/19] cxl/mem: Trace Dynamic capacity Event Record
  2025-04-13 22:52 [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (15 preceding siblings ...)
  2025-04-13 22:52 ` [PATCH v9 16/19] cxl/region: Read existing extents on region creation Ira Weiny
@ 2025-04-13 22:52 ` Ira Weiny
  2025-04-13 22:52 ` [PATCH v9 18/19] tools/testing/cxl: Make event logs dynamic Ira Weiny
                   ` (5 subsequent siblings)
  22 siblings, 0 replies; 65+ messages in thread
From: Ira Weiny @ 2025-04-13 22:52 UTC (permalink / raw)
  To: Dave Jiang, Fan Ni, Jonathan Cameron
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	Ira Weiny, linux-cxl, nvdimm, linux-kernel

CXL rev 3.1 section 8.2.9.2.1 adds the Dynamic Capacity Event Records.
User space can use trace events for debugging of DC capacity changes.

Add DC trace points to the trace log.

Based on an original patch by Navneet Singh.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes:
[djbw: s/region/partition/]
[iweiny: s/tag/uuid/]
---
 drivers/cxl/core/mbox.c  |  4 +++
 drivers/cxl/core/trace.h | 65 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 69 insertions(+)

diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 8af3a4173b99..891a213ce7be 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -1043,6 +1043,10 @@ static void __cxl_event_trace_record(const struct cxl_memdev *cxlmd,
 		ev_type = CXL_CPER_EVENT_DRAM;
 	else if (uuid_equal(uuid, &CXL_EVENT_MEM_MODULE_UUID))
 		ev_type = CXL_CPER_EVENT_MEM_MODULE;
+	else if (uuid_equal(uuid, &CXL_EVENT_DC_EVENT_UUID)) {
+		trace_cxl_dynamic_capacity(cxlmd, type, &record->event.dcd);
+		return;
+	}
 
 	cxl_event_trace_record(cxlmd, type, ev_type, uuid, &record->event);
 }
diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
index 25ebfbc1616c..384017259970 100644
--- a/drivers/cxl/core/trace.h
+++ b/drivers/cxl/core/trace.h
@@ -978,6 +978,71 @@ TRACE_EVENT(cxl_poison,
 	)
 );
 
+/*
+ * Dynamic Capacity Event Record - DER
+ *
+ * CXL rev 3.1 section 8.2.9.2.1.6 Table 8-50
+ */
+
+#define CXL_DC_ADD_CAPACITY			0x00
+#define CXL_DC_REL_CAPACITY			0x01
+#define CXL_DC_FORCED_REL_CAPACITY		0x02
+#define CXL_DC_REG_CONF_UPDATED			0x03
+#define show_dc_evt_type(type)	__print_symbolic(type,		\
+	{ CXL_DC_ADD_CAPACITY,	"Add capacity"},		\
+	{ CXL_DC_REL_CAPACITY,	"Release capacity"},		\
+	{ CXL_DC_FORCED_REL_CAPACITY,	"Forced capacity release"},	\
+	{ CXL_DC_REG_CONF_UPDATED,	"Region Configuration Updated"	} \
+)
+
+TRACE_EVENT(cxl_dynamic_capacity,
+
+	TP_PROTO(const struct cxl_memdev *cxlmd, enum cxl_event_log_type log,
+		 struct cxl_event_dcd *rec),
+
+	TP_ARGS(cxlmd, log, rec),
+
+	TP_STRUCT__entry(
+		CXL_EVT_TP_entry
+
+		/* Dynamic capacity Event */
+		__field(u8, event_type)
+		__field(u16, hostid)
+		__field(u8, partition_id)
+		__field(u64, dpa_start)
+		__field(u64, length)
+		__array(u8, uuid, UUID_SIZE)
+		__field(u16, sh_extent_seq)
+	),
+
+	TP_fast_assign(
+		CXL_EVT_TP_fast_assign(cxlmd, log, rec->hdr);
+
+		/* Dynamic_capacity Event */
+		__entry->event_type = rec->event_type;
+
+		/* DCD event record data */
+		__entry->hostid = le16_to_cpu(rec->host_id);
+		__entry->partition_id = rec->partition_index;
+		__entry->dpa_start = le64_to_cpu(rec->extent.start_dpa);
+		__entry->length = le64_to_cpu(rec->extent.length);
+		memcpy(__entry->uuid, &rec->extent.uuid, UUID_SIZE);
+		__entry->sh_extent_seq = le16_to_cpu(rec->extent.shared_extn_seq);
+	),
+
+	CXL_EVT_TP_printk("event_type='%s' host_id='%d' partition_id='%d' " \
+		"starting_dpa=%llx length=%llx tag=%pU " \
+		"shared_extent_sequence=%d",
+		show_dc_evt_type(__entry->event_type),
+		__entry->hostid,
+		__entry->partition_id,
+		__entry->dpa_start,
+		__entry->length,
+		__entry->uuid,
+		__entry->sh_extent_seq
+	)
+);
+
 #endif /* _CXL_EVENTS_H */
 
 #define TRACE_INCLUDE_FILE trace

-- 
2.49.0


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v9 18/19] tools/testing/cxl: Make event logs dynamic
  2025-04-13 22:52 [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (16 preceding siblings ...)
  2025-04-13 22:52 ` [PATCH v9 17/19] cxl/mem: Trace Dynamic capacity Event Record Ira Weiny
@ 2025-04-13 22:52 ` Ira Weiny
  2025-04-13 22:52 ` [PATCH v9 19/19] tools/testing/cxl: Add DC Regions to mock mem data Ira Weiny
                   ` (4 subsequent siblings)
  22 siblings, 0 replies; 65+ messages in thread
From: Ira Weiny @ 2025-04-13 22:52 UTC (permalink / raw)
  To: Dave Jiang, Fan Ni, Jonathan Cameron
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	Ira Weiny, linux-cxl, nvdimm, linux-kernel

The event logs test was created as static arrays as an easy way to mock
events.  Dynamic Capacity Device (DCD) test support requires events be
generated dynamically when extents are created or destroyed.

The current event log test has specific checks for the number of events
seen including log overflow.

Modify mock event logs to be dynamically allocated.  Adjust array size
and mock event entry data to match the output expected by the existing
event test.

Use the static event data to create the dynamic events in the new logs
without inventing complex event injection for the previous tests.

Simplify log processing by using the event log array index as the
handle.  Add a lock to manage concurrency required when user space is
allowed to control DCD extents

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes:
[iweiny: rebase to 6.15-rc1]
---
 tools/testing/cxl/test/mem.c | 268 ++++++++++++++++++++++++++-----------------
 1 file changed, 162 insertions(+), 106 deletions(-)

diff --git a/tools/testing/cxl/test/mem.c b/tools/testing/cxl/test/mem.c
index f2957a3e36fe..a71a72966de1 100644
--- a/tools/testing/cxl/test/mem.c
+++ b/tools/testing/cxl/test/mem.c
@@ -142,18 +142,26 @@ static struct {
 
 #define PASS_TRY_LIMIT 3
 
-#define CXL_TEST_EVENT_CNT_MAX 15
+#define CXL_TEST_EVENT_CNT_MAX 16
+/* 1 extra slot to accommodate that handles can't be 0 */
+#define CXL_TEST_EVENT_ARRAY_SIZE (CXL_TEST_EVENT_CNT_MAX + 1)
 
 /* Set a number of events to return at a time for simulation.  */
 #define CXL_TEST_EVENT_RET_MAX 4
 
+/*
+ * @last_handle: last handle (index) to have an entry stored
+ * @current_handle: current handle (index) to be returned to the user on get_event
+ * @nr_overflow: number of events added past the log size
+ * @lock: protect these state variables
+ * @events: array of pending events to be returned.
+ */
 struct mock_event_log {
-	u16 clear_idx;
-	u16 cur_idx;
-	u16 nr_events;
+	u16 last_handle;
+	u16 current_handle;
 	u16 nr_overflow;
-	u16 overflow_reset;
-	struct cxl_event_record_raw *events[CXL_TEST_EVENT_CNT_MAX];
+	rwlock_t lock;
+	struct cxl_event_record_raw *events[CXL_TEST_EVENT_ARRAY_SIZE];
 };
 
 struct mock_event_store {
@@ -194,56 +202,65 @@ static struct mock_event_log *event_find_log(struct device *dev, int log_type)
 	return &mdata->mes.mock_logs[log_type];
 }
 
-static struct cxl_event_record_raw *event_get_current(struct mock_event_log *log)
-{
-	return log->events[log->cur_idx];
-}
-
-static void event_reset_log(struct mock_event_log *log)
-{
-	log->cur_idx = 0;
-	log->clear_idx = 0;
-	log->nr_overflow = log->overflow_reset;
-}
-
 /* Handle can never be 0 use 1 based indexing for handle */
-static u16 event_get_clear_handle(struct mock_event_log *log)
+static u16 event_inc_handle(u16 handle)
 {
-	return log->clear_idx + 1;
+	handle = (handle + 1) % CXL_TEST_EVENT_ARRAY_SIZE;
+	if (handle == 0)
+		handle = 1;
+	return handle;
 }
 
-/* Handle can never be 0 use 1 based indexing for handle */
-static __le16 event_get_cur_event_handle(struct mock_event_log *log)
-{
-	u16 cur_handle = log->cur_idx + 1;
-
-	return cpu_to_le16(cur_handle);
-}
-
-static bool event_log_empty(struct mock_event_log *log)
-{
-	return log->cur_idx == log->nr_events;
-}
-
-static void mes_add_event(struct mock_event_store *mes,
+/* Add the event or free it on overflow */
+static void mes_add_event(struct cxl_mockmem_data *mdata,
 			  enum cxl_event_log_type log_type,
 			  struct cxl_event_record_raw *event)
 {
+	struct device *dev = mdata->mds->cxlds.dev;
 	struct mock_event_log *log;
 
 	if (WARN_ON(log_type >= CXL_EVENT_TYPE_MAX))
 		return;
 
-	log = &mes->mock_logs[log_type];
+	log = &mdata->mes.mock_logs[log_type];
+
+	guard(write_lock)(&log->lock);
 
-	if ((log->nr_events + 1) > CXL_TEST_EVENT_CNT_MAX) {
+	dev_dbg(dev, "Add log %d cur %d last %d\n",
+		log_type, log->current_handle, log->last_handle);
+
+	/* Check next buffer */
+	if (event_inc_handle(log->last_handle) == log->current_handle) {
 		log->nr_overflow++;
-		log->overflow_reset = log->nr_overflow;
+		dev_dbg(dev, "Overflowing log %d nr %d\n",
+			log_type, log->nr_overflow);
+		devm_kfree(dev, event);
 		return;
 	}
 
-	log->events[log->nr_events] = event;
-	log->nr_events++;
+	dev_dbg(dev, "Log %d; handle %u\n", log_type, log->last_handle);
+	event->event.generic.hdr.handle = cpu_to_le16(log->last_handle);
+	log->events[log->last_handle] = event;
+	log->last_handle = event_inc_handle(log->last_handle);
+}
+
+static void mes_del_event(struct device *dev,
+			  struct mock_event_log *log,
+			  u16 handle)
+{
+	struct cxl_event_record_raw *record;
+
+	lockdep_assert(lockdep_is_held(&log->lock));
+
+	dev_dbg(dev, "Clearing event %u; record %u\n",
+		handle, log->current_handle);
+	record = log->events[handle];
+	if (!record)
+		dev_err(dev, "Mock event index %u empty?\n", handle);
+
+	log->events[handle] = NULL;
+	log->current_handle = event_inc_handle(log->current_handle);
+	devm_kfree(dev, record);
 }
 
 /*
@@ -256,7 +273,7 @@ static int mock_get_event(struct device *dev, struct cxl_mbox_cmd *cmd)
 {
 	struct cxl_get_event_payload *pl;
 	struct mock_event_log *log;
-	u16 nr_overflow;
+	u16 handle;
 	u8 log_type;
 	int i;
 
@@ -277,29 +294,38 @@ static int mock_get_event(struct device *dev, struct cxl_mbox_cmd *cmd)
 	memset(cmd->payload_out, 0, struct_size(pl, records, 0));
 
 	log = event_find_log(dev, log_type);
-	if (!log || event_log_empty(log))
+	if (!log)
 		return 0;
 
 	pl = cmd->payload_out;
 
-	for (i = 0; i < ret_limit && !event_log_empty(log); i++) {
-		memcpy(&pl->records[i], event_get_current(log),
-		       sizeof(pl->records[i]));
-		pl->records[i].event.generic.hdr.handle =
-				event_get_cur_event_handle(log);
-		log->cur_idx++;
+	guard(read_lock)(&log->lock);
+
+	handle = log->current_handle;
+	dev_dbg(dev, "Get log %d handle %u last %u\n",
+		log_type, handle, log->last_handle);
+	for (i = 0; i < ret_limit && handle != log->last_handle;
+	     i++, handle = event_inc_handle(handle)) {
+		struct cxl_event_record_raw *cur;
+
+		cur = log->events[handle];
+		dev_dbg(dev, "Sending event log %d handle %d idx %u\n",
+			log_type, le16_to_cpu(cur->event.generic.hdr.handle),
+			handle);
+		memcpy(&pl->records[i], cur, sizeof(pl->records[i]));
+		pl->records[i].event.generic.hdr.handle = cpu_to_le16(handle);
 	}
 
 	cmd->size_out = struct_size(pl, records, i);
 	pl->record_count = cpu_to_le16(i);
-	if (!event_log_empty(log))
+	if (handle != log->last_handle)
 		pl->flags |= CXL_GET_EVENT_FLAG_MORE_RECORDS;
 
 	if (log->nr_overflow) {
 		u64 ns;
 
 		pl->flags |= CXL_GET_EVENT_FLAG_OVERFLOW;
-		pl->overflow_err_count = cpu_to_le16(nr_overflow);
+		pl->overflow_err_count = cpu_to_le16(log->nr_overflow);
 		ns = ktime_get_real_ns();
 		ns -= 5000000000; /* 5s ago */
 		pl->first_overflow_timestamp = cpu_to_le64(ns);
@@ -314,8 +340,8 @@ static int mock_get_event(struct device *dev, struct cxl_mbox_cmd *cmd)
 static int mock_clear_event(struct device *dev, struct cxl_mbox_cmd *cmd)
 {
 	struct cxl_mbox_clear_event_payload *pl = cmd->payload_in;
-	struct mock_event_log *log;
 	u8 log_type = pl->event_log;
+	struct mock_event_log *log;
 	u16 handle;
 	int nr;
 
@@ -326,23 +352,20 @@ static int mock_clear_event(struct device *dev, struct cxl_mbox_cmd *cmd)
 	if (!log)
 		return 0; /* No mock data in this log */
 
-	/*
-	 * This check is technically not invalid per the specification AFAICS.
-	 * (The host could 'guess' handles and clear them in order).
-	 * However, this is not good behavior for the host so test it.
-	 */
-	if (log->clear_idx + pl->nr_recs > log->cur_idx) {
-		dev_err(dev,
-			"Attempting to clear more events than returned!\n");
-		return -EINVAL;
-	}
+	guard(write_lock)(&log->lock);
 
 	/* Check handle order prior to clearing events */
-	for (nr = 0, handle = event_get_clear_handle(log);
-	     nr < pl->nr_recs;
-	     nr++, handle++) {
+	handle = log->current_handle;
+	for (nr = 0; nr < pl->nr_recs && handle != log->last_handle;
+	     nr++, handle = event_inc_handle(handle)) {
+
+		dev_dbg(dev, "Checking clear of %d handle %u plhandle %u\n",
+			log_type, handle,
+			le16_to_cpu(pl->handles[nr]));
+
 		if (handle != le16_to_cpu(pl->handles[nr])) {
-			dev_err(dev, "Clearing events out of order\n");
+			dev_err(dev, "Clearing events out of order %u %u\n",
+				handle, le16_to_cpu(pl->handles[nr]));
 			return -EINVAL;
 		}
 	}
@@ -351,25 +374,12 @@ static int mock_clear_event(struct device *dev, struct cxl_mbox_cmd *cmd)
 		log->nr_overflow = 0;
 
 	/* Clear events */
-	log->clear_idx += pl->nr_recs;
-	return 0;
-}
-
-static void cxl_mock_event_trigger(struct device *dev)
-{
-	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
-	struct mock_event_store *mes = &mdata->mes;
-	int i;
+	for (nr = 0; nr < pl->nr_recs; nr++)
+		mes_del_event(dev, log, le16_to_cpu(pl->handles[nr]));
+	dev_dbg(dev, "Delete log %d cur %d last %d\n",
+		log_type, log->current_handle, log->last_handle);
 
-	for (i = CXL_EVENT_TYPE_INFO; i < CXL_EVENT_TYPE_MAX; i++) {
-		struct mock_event_log *log;
-
-		log = event_find_log(dev, i);
-		if (log)
-			event_reset_log(log);
-	}
-
-	cxl_mem_get_event_records(mdata->mds, mes->ev_status);
+	return 0;
 }
 
 struct cxl_event_record_raw maint_needed = {
@@ -510,8 +520,27 @@ static int mock_set_timestamp(struct cxl_dev_state *cxlds,
 	return 0;
 }
 
-static void cxl_mock_add_event_logs(struct mock_event_store *mes)
+/* Create a dynamically allocated event out of a statically defined event. */
+static void add_event_from_static(struct cxl_mockmem_data *mdata,
+				  enum cxl_event_log_type log_type,
+				  struct cxl_event_record_raw *raw)
+{
+	struct device *dev = mdata->mds->cxlds.dev;
+	struct cxl_event_record_raw *rec;
+
+	rec = devm_kmemdup(dev, raw, sizeof(*rec), GFP_KERNEL);
+	if (!rec) {
+		dev_err(dev, "Failed to alloc event for log\n");
+		return;
+	}
+	mes_add_event(mdata, log_type, rec);
+}
+
+static void cxl_mock_add_event_logs(struct cxl_mockmem_data *mdata)
 {
+	struct mock_event_store *mes = &mdata->mes;
+	struct device *dev = mdata->mds->cxlds.dev;
+
 	put_unaligned_le16(CXL_GMER_VALID_CHANNEL | CXL_GMER_VALID_RANK |
 			   CXL_GMER_VALID_COMPONENT | CXL_GMER_VALID_COMPONENT_ID_FORMAT,
 			   &gen_media.rec.media_hdr.validity_flags);
@@ -524,43 +553,60 @@ static void cxl_mock_add_event_logs(struct mock_event_store *mes)
 	put_unaligned_le16(CXL_MMER_VALID_COMPONENT | CXL_MMER_VALID_COMPONENT_ID_FORMAT,
 			   &mem_module.rec.validity_flags);
 
-	mes_add_event(mes, CXL_EVENT_TYPE_INFO, &maint_needed);
-	mes_add_event(mes, CXL_EVENT_TYPE_INFO,
+	dev_dbg(dev, "Generating fake event logs %d\n",
+		CXL_EVENT_TYPE_INFO);
+	add_event_from_static(mdata, CXL_EVENT_TYPE_INFO, &maint_needed);
+	add_event_from_static(mdata, CXL_EVENT_TYPE_INFO,
 		      (struct cxl_event_record_raw *)&gen_media);
-	mes_add_event(mes, CXL_EVENT_TYPE_INFO,
+	add_event_from_static(mdata, CXL_EVENT_TYPE_INFO,
 		      (struct cxl_event_record_raw *)&mem_module);
 	mes->ev_status |= CXLDEV_EVENT_STATUS_INFO;
 
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &maint_needed);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
+	dev_dbg(dev, "Generating fake event logs %d\n",
+		CXL_EVENT_TYPE_FAIL);
+	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &maint_needed);
+	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
+		      (struct cxl_event_record_raw *)&mem_module);
+	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
 		      (struct cxl_event_record_raw *)&dram);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
+	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
 		      (struct cxl_event_record_raw *)&gen_media);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
+	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
 		      (struct cxl_event_record_raw *)&mem_module);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
+	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
 		      (struct cxl_event_record_raw *)&dram);
 	/* Overflow this log */
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
 	mes->ev_status |= CXLDEV_EVENT_STATUS_FAIL;
 
-	mes_add_event(mes, CXL_EVENT_TYPE_FATAL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FATAL,
+	dev_dbg(dev, "Generating fake event logs %d\n",
+		CXL_EVENT_TYPE_FATAL);
+	add_event_from_static(mdata, CXL_EVENT_TYPE_FATAL, &hardware_replace);
+	add_event_from_static(mdata, CXL_EVENT_TYPE_FATAL,
 		      (struct cxl_event_record_raw *)&dram);
 	mes->ev_status |= CXLDEV_EVENT_STATUS_FATAL;
 }
 
+static void cxl_mock_event_trigger(struct device *dev)
+{
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	struct mock_event_store *mes = &mdata->mes;
+
+	cxl_mock_add_event_logs(mdata);
+	cxl_mem_get_event_records(mdata->mds, mes->ev_status);
+}
+
 static int mock_gsl(struct cxl_mbox_cmd *cmd)
 {
 	if (cmd->size_out < sizeof(mock_gsl_payload))
@@ -1685,6 +1731,14 @@ static void cxl_mock_test_feat_init(struct cxl_mockmem_data *mdata)
 	mdata->test_feat.data = cpu_to_le32(0xdeadbeef);
 }
 
+static void init_event_log(struct mock_event_log *log)
+{
+	rwlock_init(&log->lock);
+	/* Handle can never be 0 use 1 based indexing for handle */
+	log->current_handle = 1;
+	log->last_handle = 1;
+}
+
 static int cxl_mock_mem_probe(struct platform_device *pdev)
 {
 	struct device *dev = &pdev->dev;
@@ -1766,7 +1820,9 @@ static int cxl_mock_mem_probe(struct platform_device *pdev)
 	if (rc)
 		dev_dbg(dev, "No CXL Features discovered\n");
 
-	cxl_mock_add_event_logs(&mdata->mes);
+	for (int i = 0; i < CXL_EVENT_TYPE_MAX; i++)
+		init_event_log(&mdata->mes.mock_logs[i]);
+	cxl_mock_add_event_logs(mdata);
 
 	cxlmd = devm_cxl_add_memdev(&pdev->dev, cxlds);
 	if (IS_ERR(cxlmd))

-- 
2.49.0


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v9 19/19] tools/testing/cxl: Add DC Regions to mock mem data
  2025-04-13 22:52 [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (17 preceding siblings ...)
  2025-04-13 22:52 ` [PATCH v9 18/19] tools/testing/cxl: Make event logs dynamic Ira Weiny
@ 2025-04-13 22:52 ` Ira Weiny
  2025-04-14 16:11 ` [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD) Fan Ni
                   ` (3 subsequent siblings)
  22 siblings, 0 replies; 65+ messages in thread
From: Ira Weiny @ 2025-04-13 22:52 UTC (permalink / raw)
  To: Dave Jiang, Fan Ni, Jonathan Cameron
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	Ira Weiny, linux-cxl, nvdimm, linux-kernel

cxl_test provides a good way to ensure quick smoke and regression
testing.  The complexity of Dynamic Capacity (DC) extent processing as
well as the complexity of the new sparse DAX regions can mostly be
tested through cxl_test.  This includes management of sparse regions and
DAX devices on those regions; the management of extent device lifetimes;
and the processing of DCD events.

The only missing functionality from this test is actual interrupt
processing.

Mock memory devices can easily mock DC information and manage fake
extent data.

Define mock_dc_partition information within the mock memory data.  Add
sysfs entries on the mock device to inject and delete extents.

The inject format is <start>:<length>:<tag>:<more_flag>
The delete format is <start>:<length>

Directly call the event irq callback to simulate irqs to process the
test extents.

Add DC mailbox commands to the CEL and implement those commands.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes:
[iweiny: rebase]
[djbw: s/region/partition/]
[iweiny: s/tag/uuid/]
---
 tools/testing/cxl/test/mem.c | 753 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 753 insertions(+)

diff --git a/tools/testing/cxl/test/mem.c b/tools/testing/cxl/test/mem.c
index a71a72966de1..a85a04168434 100644
--- a/tools/testing/cxl/test/mem.c
+++ b/tools/testing/cxl/test/mem.c
@@ -20,6 +20,7 @@
 #define FW_SLOTS 3
 #define DEV_SIZE SZ_2G
 #define EFFECT(x) (1U << x)
+#define BASE_DYNAMIC_CAP_DPA DEV_SIZE
 
 #define MOCK_INJECT_DEV_MAX 8
 #define MOCK_INJECT_TEST_MAX 128
@@ -113,6 +114,22 @@ static struct cxl_cel_entry mock_cel[] = {
 				      EFFECT(SECURITY_CHANGE_IMMEDIATE) |
 				      EFFECT(BACKGROUND_OP)),
 	},
+	{
+		.opcode = cpu_to_le16(CXL_MBOX_OP_GET_DC_CONFIG),
+		.effect = CXL_CMD_EFFECT_NONE,
+	},
+	{
+		.opcode = cpu_to_le16(CXL_MBOX_OP_GET_DC_EXTENT_LIST),
+		.effect = CXL_CMD_EFFECT_NONE,
+	},
+	{
+		.opcode = cpu_to_le16(CXL_MBOX_OP_ADD_DC_RESPONSE),
+		.effect = cpu_to_le16(EFFECT(CONF_CHANGE_IMMEDIATE)),
+	},
+	{
+		.opcode = cpu_to_le16(CXL_MBOX_OP_RELEASE_DC),
+		.effect = cpu_to_le16(EFFECT(CONF_CHANGE_IMMEDIATE)),
+	},
 };
 
 /* See CXL 2.0 Table 181 Get Health Info Output Payload */
@@ -173,6 +190,8 @@ struct vendor_test_feat {
 	__le32 data;
 } __packed;
 
+#define NUM_MOCK_DC_REGIONS 2
+
 struct cxl_mockmem_data {
 	void *lsa;
 	void *fw;
@@ -191,6 +210,20 @@ struct cxl_mockmem_data {
 	unsigned long sanitize_timeout;
 	struct vendor_test_feat test_feat;
 	u8 shutdown_state;
+
+	struct cxl_dc_partition dc_partitions[NUM_MOCK_DC_REGIONS];
+	u32 dc_ext_generation;
+	struct mutex ext_lock;
+
+	/*
+	 * Extents are in 1 of 3 states
+	 * FM (sysfs added but not sent to the host yet)
+	 * sent (sent to the host but not accepted)
+	 * accepted (by the host)
+	 */
+	struct xarray dc_fm_extents;
+	struct xarray dc_sent_extents;
+	struct xarray dc_accepted_exts;
 };
 
 static struct mock_event_log *event_find_log(struct device *dev, int log_type)
@@ -607,6 +640,251 @@ static void cxl_mock_event_trigger(struct device *dev)
 	cxl_mem_get_event_records(mdata->mds, mes->ev_status);
 }
 
+struct cxl_extent_data {
+	u64 dpa_start;
+	u64 length;
+	u8 uuid[UUID_SIZE];
+	bool shared;
+};
+
+static int __devm_add_extent(struct device *dev, struct xarray *array,
+			     u64 start, u64 length, const char *uuid,
+			     bool shared)
+{
+	struct cxl_extent_data *extent;
+
+	extent = devm_kzalloc(dev, sizeof(*extent), GFP_KERNEL);
+	if (!extent)
+		return -ENOMEM;
+
+	extent->dpa_start = start;
+	extent->length = length;
+	memcpy(extent->uuid, uuid, min(sizeof(extent->uuid), strlen(uuid)));
+	extent->shared = shared;
+
+	if (xa_insert(array, start, extent, GFP_KERNEL)) {
+		devm_kfree(dev, extent);
+		dev_err(dev, "Failed xarry insert %#llx\n", start);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int devm_add_fm_extent(struct device *dev, u64 start, u64 length,
+			      const char *uuid, bool shared)
+{
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+
+	guard(mutex)(&mdata->ext_lock);
+	return __devm_add_extent(dev, &mdata->dc_fm_extents, start, length,
+				 uuid, shared);
+}
+
+/* It is known that ext and the new range are not equal */
+static struct cxl_extent_data *
+split_ext(struct device *dev, struct xarray *array,
+	  struct cxl_extent_data *ext, u64 start, u64 length)
+{
+	u64 new_start, new_length;
+
+	if (ext->dpa_start == start) {
+		new_start = start + length;
+		new_length = (ext->dpa_start + ext->length) - new_start;
+
+		if (__devm_add_extent(dev, array, new_start, new_length,
+				      ext->uuid, false))
+			return NULL;
+
+		ext = xa_erase(array, ext->dpa_start);
+		if (__devm_add_extent(dev, array, start, length, ext->uuid,
+				      false))
+			return NULL;
+
+		return xa_load(array, start);
+	}
+
+	/* ext->dpa_start != start */
+
+	if (__devm_add_extent(dev, array, start, length, ext->uuid, false))
+		return NULL;
+
+	new_start = ext->dpa_start;
+	new_length = start - ext->dpa_start;
+
+	ext = xa_erase(array, ext->dpa_start);
+	if (__devm_add_extent(dev, array, new_start, new_length, ext->uuid,
+			      false))
+		return NULL;
+
+	return xa_load(array, start);
+}
+
+/*
+ * Do not handle extents which are not inside a single extent sent to
+ * the host.
+ */
+static struct cxl_extent_data *
+find_create_ext(struct device *dev, struct xarray *array, u64 start, u64 length)
+{
+	struct cxl_extent_data *ext;
+	unsigned long index;
+
+	xa_for_each(array, index, ext) {
+		u64 end = start + length;
+
+		/* start < [ext) <= start */
+		if (start < ext->dpa_start ||
+		    (ext->dpa_start + ext->length) <= start)
+			continue;
+
+		if (end <= ext->dpa_start ||
+		    (ext->dpa_start + ext->length) < end) {
+			dev_err(dev, "Invalid range %#llx-%#llx\n", start,
+				end);
+			return NULL;
+		}
+
+		break;
+	}
+
+	if (!ext)
+		return NULL;
+
+	if (start == ext->dpa_start && length == ext->length)
+		return ext;
+
+	return split_ext(dev, array, ext, start, length);
+}
+
+static int dc_accept_extent(struct device *dev, u64 start, u64 length)
+{
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	struct cxl_extent_data *ext;
+
+	dev_dbg(dev, "Host accepting extent %#llx\n", start);
+	mdata->dc_ext_generation++;
+
+	lockdep_assert_held(&mdata->ext_lock);
+	ext = find_create_ext(dev, &mdata->dc_sent_extents, start, length);
+	if (!ext) {
+		dev_err(dev, "Extent %#llx-%#llx not found\n",
+			start, start + length);
+		return -ENOMEM;
+	}
+	ext = xa_erase(&mdata->dc_sent_extents, ext->dpa_start);
+	return xa_insert(&mdata->dc_accepted_exts, start, ext, GFP_KERNEL);
+}
+
+static void release_dc_ext(void *md)
+{
+	struct cxl_mockmem_data *mdata = md;
+
+	xa_destroy(&mdata->dc_fm_extents);
+	xa_destroy(&mdata->dc_sent_extents);
+	xa_destroy(&mdata->dc_accepted_exts);
+}
+
+/* Pretend to have some previous accepted extents */
+struct pre_ext_info {
+	u64 offset;
+	u64 length;
+} pre_ext_info[] = {
+	{
+		.offset = SZ_128M,
+		.length = SZ_64M,
+	},
+	{
+		.offset = SZ_256M,
+		.length = SZ_64M,
+	},
+};
+
+static int devm_add_sent_extent(struct device *dev, u64 start, u64 length,
+				const char *tag, bool shared)
+{
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+
+	lockdep_assert_held(&mdata->ext_lock);
+	return __devm_add_extent(dev, &mdata->dc_sent_extents, start, length,
+				 tag, shared);
+}
+
+static int inject_prev_extents(struct device *dev, u64 base_dpa)
+{
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	int rc;
+
+	dev_dbg(dev, "Adding %ld pre-extents for testing\n",
+		ARRAY_SIZE(pre_ext_info));
+
+	guard(mutex)(&mdata->ext_lock);
+	for (int i = 0; i < ARRAY_SIZE(pre_ext_info); i++) {
+		u64 ext_dpa = base_dpa + pre_ext_info[i].offset;
+		u64 ext_len = pre_ext_info[i].length;
+
+		dev_dbg(dev, "Adding pre-extent DPA:%#llx LEN:%#llx\n",
+			ext_dpa, ext_len);
+
+		rc = devm_add_sent_extent(dev, ext_dpa, ext_len, "", false);
+		if (rc) {
+			dev_err(dev, "Failed to add pre-extent DPA:%#llx LEN:%#llx; %d\n",
+				ext_dpa, ext_len, rc);
+			return rc;
+		}
+
+		rc = dc_accept_extent(dev, ext_dpa, ext_len);
+		if (rc)
+			return rc;
+	}
+	return 0;
+}
+
+static int cxl_mock_dc_partition_setup(struct device *dev)
+{
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	u64 base_dpa = BASE_DYNAMIC_CAP_DPA;
+	u32 dsmad_handle = 0xFADE;
+	u64 decode_length = SZ_512M;
+	u64 block_size = SZ_512;
+	u64 length = SZ_512M;
+	int rc;
+
+	mutex_init(&mdata->ext_lock);
+	xa_init(&mdata->dc_fm_extents);
+	xa_init(&mdata->dc_sent_extents);
+	xa_init(&mdata->dc_accepted_exts);
+
+	rc = devm_add_action_or_reset(dev, release_dc_ext, mdata);
+	if (rc)
+		return rc;
+
+	for (int i = 0; i < NUM_MOCK_DC_REGIONS; i++) {
+		struct cxl_dc_partition *part = &mdata->dc_partitions[i];
+
+		dev_dbg(dev, "Creating DC partition DC%d DPA:%#llx LEN:%#llx\n",
+			i, base_dpa, length);
+
+		part->base = cpu_to_le64(base_dpa);
+		part->decode_length = cpu_to_le64(decode_length /
+						  CXL_CAPACITY_MULTIPLIER);
+		part->length = cpu_to_le64(length);
+		part->block_size = cpu_to_le64(block_size);
+		part->dsmad_handle = cpu_to_le32(dsmad_handle);
+		dsmad_handle++;
+
+		rc = inject_prev_extents(dev, base_dpa);
+		if (rc) {
+			dev_err(dev, "Failed to add pre-extents for DC%d\n", i);
+			return rc;
+		}
+
+		base_dpa += decode_length;
+	}
+
+	return 0;
+}
+
 static int mock_gsl(struct cxl_mbox_cmd *cmd)
 {
 	if (cmd->size_out < sizeof(mock_gsl_payload))
@@ -1582,6 +1860,192 @@ static int mock_get_supported_features(struct cxl_mockmem_data *mdata,
 	return 0;
 }
 
+static int mock_get_dc_config(struct device *dev,
+			      struct cxl_mbox_cmd *cmd)
+{
+	struct cxl_mbox_get_dc_config_in *dc_config = cmd->payload_in;
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	u8 partition_requested, partition_start_idx, partition_ret_cnt;
+	struct cxl_mbox_get_dc_config_out *resp;
+	int i;
+
+	partition_requested = min(dc_config->partition_count, NUM_MOCK_DC_REGIONS);
+
+	if (cmd->size_out < struct_size(resp, partition, partition_requested))
+		return -EINVAL;
+
+	memset(cmd->payload_out, 0, cmd->size_out);
+	resp = cmd->payload_out;
+
+	partition_start_idx = dc_config->start_partition_index;
+	partition_ret_cnt = 0;
+	for (i = 0; i < NUM_MOCK_DC_REGIONS; i++) {
+		if (i >= partition_start_idx) {
+			memcpy(&resp->partition[partition_ret_cnt],
+				&mdata->dc_partitions[i],
+				sizeof(resp->partition[partition_ret_cnt]));
+			partition_ret_cnt++;
+		}
+	}
+	resp->avail_partition_count = NUM_MOCK_DC_REGIONS;
+	resp->partitions_returned = i;
+
+	dev_dbg(dev, "Returning %d dc partitions\n", partition_ret_cnt);
+	return 0;
+}
+
+static int mock_get_dc_extent_list(struct device *dev,
+				   struct cxl_mbox_cmd *cmd)
+{
+	struct cxl_mbox_get_extent_out *resp = cmd->payload_out;
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	struct cxl_mbox_get_extent_in *get = cmd->payload_in;
+	u32 total_avail = 0, total_ret = 0;
+	struct cxl_extent_data *ext;
+	u32 ext_count, start_idx;
+	unsigned long i;
+
+	ext_count = le32_to_cpu(get->extent_cnt);
+	start_idx = le32_to_cpu(get->start_extent_index);
+
+	memset(resp, 0, sizeof(*resp));
+
+	guard(mutex)(&mdata->ext_lock);
+	/*
+	 * Total available needs to be calculated and returned regardless of
+	 * how many can actually be returned.
+	 */
+	xa_for_each(&mdata->dc_accepted_exts, i, ext)
+		total_avail++;
+
+	if (start_idx > total_avail)
+		return -EINVAL;
+
+	xa_for_each(&mdata->dc_accepted_exts, i, ext) {
+		if (total_ret >= ext_count)
+			break;
+
+		if (total_ret >= start_idx) {
+			resp->extent[total_ret].start_dpa =
+						cpu_to_le64(ext->dpa_start);
+			resp->extent[total_ret].length =
+						cpu_to_le64(ext->length);
+			memcpy(&resp->extent[total_ret].uuid, ext->uuid,
+					sizeof(resp->extent[total_ret]));
+			total_ret++;
+		}
+	}
+
+	resp->returned_extent_count = cpu_to_le32(total_ret);
+	resp->total_extent_count = cpu_to_le32(total_avail);
+	resp->generation_num = cpu_to_le32(mdata->dc_ext_generation);
+
+	dev_dbg(dev, "Returning %d extents of %d total\n",
+		total_ret, total_avail);
+
+	return 0;
+}
+
+static void dc_clear_sent(struct device *dev)
+{
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	struct cxl_extent_data *ext;
+	unsigned long index;
+
+	lockdep_assert_held(&mdata->ext_lock);
+
+	/* Any extents not accepted must be cleared */
+	xa_for_each(&mdata->dc_sent_extents, index, ext) {
+		dev_dbg(dev, "Host rejected extent %#llx\n", ext->dpa_start);
+		xa_erase(&mdata->dc_sent_extents, ext->dpa_start);
+	}
+}
+
+static int mock_add_dc_response(struct device *dev,
+				struct cxl_mbox_cmd *cmd)
+{
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	struct cxl_mbox_dc_response *req = cmd->payload_in;
+	u32 list_size = le32_to_cpu(req->extent_list_size);
+
+	guard(mutex)(&mdata->ext_lock);
+	for (int i = 0; i < list_size; i++) {
+		u64 start = le64_to_cpu(req->extent_list[i].dpa_start);
+		u64 length = le64_to_cpu(req->extent_list[i].length);
+		int rc;
+
+		rc = dc_accept_extent(dev, start, length);
+		if (rc)
+			return rc;
+	}
+
+	dc_clear_sent(dev);
+	return 0;
+}
+
+static void dc_delete_extent(struct device *dev, unsigned long long start,
+			     unsigned long long length)
+{
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	unsigned long long end = start + length;
+	struct cxl_extent_data *ext;
+	unsigned long index;
+
+	dev_dbg(dev, "Deleting extent at %#llx len:%#llx\n", start, length);
+
+	guard(mutex)(&mdata->ext_lock);
+	xa_for_each(&mdata->dc_fm_extents, index, ext) {
+		u64 extent_end = ext->dpa_start + ext->length;
+
+		/*
+		 * Any extent which 'touches' the released delete range will be
+		 * removed.
+		 */
+		if ((start <= ext->dpa_start && ext->dpa_start < end) ||
+		    (start <= extent_end && extent_end < end))
+			xa_erase(&mdata->dc_fm_extents, ext->dpa_start);
+	}
+
+	/*
+	 * If the extent was accepted let it be for the host to drop
+	 * later.
+	 */
+}
+
+static int release_accepted_extent(struct device *dev, u64 start, u64 length)
+{
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	struct cxl_extent_data *ext;
+
+	guard(mutex)(&mdata->ext_lock);
+	ext = find_create_ext(dev, &mdata->dc_accepted_exts, start, length);
+	if (!ext) {
+		dev_err(dev, "Extent %#llx not in accepted state\n", start);
+		return -EINVAL;
+	}
+	xa_erase(&mdata->dc_accepted_exts, ext->dpa_start);
+	mdata->dc_ext_generation++;
+
+	return 0;
+}
+
+static int mock_dc_release(struct device *dev,
+			   struct cxl_mbox_cmd *cmd)
+{
+	struct cxl_mbox_dc_response *req = cmd->payload_in;
+	u32 list_size = le32_to_cpu(req->extent_list_size);
+
+	for (int i = 0; i < list_size; i++) {
+		u64 start = le64_to_cpu(req->extent_list[i].dpa_start);
+		u64 length = le64_to_cpu(req->extent_list[i].length);
+
+		dev_dbg(dev, "Extent %#llx released by host\n", start);
+		release_accepted_extent(dev, start, length);
+	}
+
+	return 0;
+}
+
 static int cxl_mock_mbox_send(struct cxl_mailbox *cxl_mbox,
 			      struct cxl_mbox_cmd *cmd)
 {
@@ -1673,6 +2137,18 @@ static int cxl_mock_mbox_send(struct cxl_mailbox *cxl_mbox,
 	case CXL_MBOX_OP_GET_SUPPORTED_FEATURES:
 		rc = mock_get_supported_features(mdata, cmd);
 		break;
+	case CXL_MBOX_OP_GET_DC_CONFIG:
+		rc = mock_get_dc_config(dev, cmd);
+		break;
+	case CXL_MBOX_OP_GET_DC_EXTENT_LIST:
+		rc = mock_get_dc_extent_list(dev, cmd);
+		break;
+	case CXL_MBOX_OP_ADD_DC_RESPONSE:
+		rc = mock_add_dc_response(dev, cmd);
+		break;
+	case CXL_MBOX_OP_RELEASE_DC:
+		rc = mock_dc_release(dev, cmd);
+		break;
 	case CXL_MBOX_OP_GET_FEATURE:
 		rc = mock_get_feature(mdata, cmd);
 		break;
@@ -1755,6 +2231,10 @@ static int cxl_mock_mem_probe(struct platform_device *pdev)
 		return -ENOMEM;
 	dev_set_drvdata(dev, mdata);
 
+	rc = cxl_mock_dc_partition_setup(dev);
+	if (rc)
+		return rc;
+
 	mdata->lsa = vmalloc(LSA_SIZE);
 	if (!mdata->lsa)
 		return -ENOMEM;
@@ -1812,6 +2292,9 @@ static int cxl_mock_mem_probe(struct platform_device *pdev)
 	if (rc)
 		return rc;
 
+	if (cxl_dcd_supported(mds))
+		cxl_configure_dcd(mds, &range_info);
+
 	rc = cxl_dpa_setup(cxlds, &range_info);
 	if (rc)
 		return rc;
@@ -1936,11 +2419,281 @@ static ssize_t sanitize_timeout_store(struct device *dev,
 
 static DEVICE_ATTR_RW(sanitize_timeout);
 
+/* Return if the proposed extent would break the test code */
+static bool new_extent_valid(struct device *dev, size_t new_start,
+			     size_t new_len)
+{
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	struct cxl_extent_data *extent;
+	size_t new_end, i;
+
+	if (!new_len)
+		return false;
+
+	new_end = new_start + new_len;
+
+	dev_dbg(dev, "New extent %zx-%zx\n", new_start, new_end);
+
+	guard(mutex)(&mdata->ext_lock);
+	dev_dbg(dev, "Checking extents starts...\n");
+	xa_for_each(&mdata->dc_fm_extents, i, extent) {
+		if (extent->dpa_start == new_start)
+			return false;
+	}
+
+	dev_dbg(dev, "Checking sent extents starts...\n");
+	xa_for_each(&mdata->dc_sent_extents, i, extent) {
+		if (extent->dpa_start == new_start)
+			return false;
+	}
+
+	dev_dbg(dev, "Checking accepted extents starts...\n");
+	xa_for_each(&mdata->dc_accepted_exts, i, extent) {
+		if (extent->dpa_start == new_start)
+			return false;
+	}
+
+	return true;
+}
+
+struct cxl_test_dcd {
+	uuid_t id;
+	struct cxl_event_dcd rec;
+} __packed;
+
+struct cxl_test_dcd dcd_event_rec_template = {
+	.id = CXL_EVENT_DC_EVENT_UUID,
+	.rec = {
+		.hdr = {
+			.length = sizeof(struct cxl_test_dcd),
+		},
+	},
+};
+
+static int log_dc_event(struct cxl_mockmem_data *mdata, enum dc_event type,
+			u64 start, u64 length, const char *tag_str, bool more)
+{
+	struct device *dev = mdata->mds->cxlds.dev;
+	struct cxl_test_dcd *dcd_event;
+
+	dev_dbg(dev, "mock device log event %d\n", type);
+
+	dcd_event = devm_kmemdup(dev, &dcd_event_rec_template,
+				     sizeof(*dcd_event), GFP_KERNEL);
+	if (!dcd_event)
+		return -ENOMEM;
+
+	dcd_event->rec.flags = 0;
+	if (more)
+		dcd_event->rec.flags |= CXL_DCD_EVENT_MORE;
+	dcd_event->rec.event_type = type;
+	dcd_event->rec.extent.start_dpa = cpu_to_le64(start);
+	dcd_event->rec.extent.length = cpu_to_le64(length);
+	memcpy(dcd_event->rec.extent.uuid, tag_str,
+	       min(sizeof(dcd_event->rec.extent.uuid),
+		   strlen(tag_str)));
+
+	mes_add_event(mdata, CXL_EVENT_TYPE_DCD,
+		      (struct cxl_event_record_raw *)dcd_event);
+
+	/* Fake the irq */
+	cxl_mem_get_event_records(mdata->mds, CXLDEV_EVENT_STATUS_DCD);
+
+	return 0;
+}
+
+static void mark_extent_sent(struct device *dev, unsigned long long start)
+{
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	struct cxl_extent_data *ext;
+
+	guard(mutex)(&mdata->ext_lock);
+	ext = xa_erase(&mdata->dc_fm_extents, start);
+	if (xa_insert(&mdata->dc_sent_extents, ext->dpa_start, ext, GFP_KERNEL))
+		dev_err(dev, "Failed to mark extent %#llx sent\n", ext->dpa_start);
+}
+
+/*
+ * Format <start>:<length>:<tag>:<more_flag>
+ *
+ * start and length must be a multiple of the configured partition block size.
+ * Tag can be any string up to 16 bytes.
+ *
+ * Extents must be exclusive of other extents
+ *
+ * If the more flag is specified it is expected that an additional extent will
+ * be specified without the more flag to complete the test transaction with the
+ * host.
+ */
+static ssize_t __dc_inject_extent_store(struct device *dev,
+					struct device_attribute *attr,
+					const char *buf, size_t count,
+					bool shared)
+{
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	unsigned long long start, length, more;
+	char *len_str, *uuid_str, *more_str;
+	size_t buf_len = count;
+	int rc;
+
+	char *start_str __free(kfree) = kstrdup(buf, GFP_KERNEL);
+	if (!start_str)
+		return -ENOMEM;
+
+	len_str = strnchr(start_str, buf_len, ':');
+	if (!len_str) {
+		dev_err(dev, "Extent failed to find len_str: %s\n", start_str);
+		return -EINVAL;
+	}
+
+	*len_str = '\0';
+	len_str += 1;
+	buf_len -= strlen(start_str);
+
+	uuid_str = strnchr(len_str, buf_len, ':');
+	if (!uuid_str) {
+		dev_err(dev, "Extent failed to find uuid_str: %s\n", len_str);
+		return -EINVAL;
+	}
+	*uuid_str = '\0';
+	uuid_str += 1;
+
+	more_str = strnchr(uuid_str, buf_len, ':');
+	if (!more_str) {
+		dev_err(dev, "Extent failed to find more_str: %s\n", uuid_str);
+		return -EINVAL;
+	}
+	*more_str = '\0';
+	more_str += 1;
+
+	if (kstrtoull(start_str, 0, &start)) {
+		dev_err(dev, "Extent failed to parse start: %s\n", start_str);
+		return -EINVAL;
+	}
+
+	if (kstrtoull(len_str, 0, &length)) {
+		dev_err(dev, "Extent failed to parse length: %s\n", len_str);
+		return -EINVAL;
+	}
+
+	if (kstrtoull(more_str, 0, &more)) {
+		dev_err(dev, "Extent failed to parse more: %s\n", more_str);
+		return -EINVAL;
+	}
+
+	if (!new_extent_valid(dev, start, length))
+		return -EINVAL;
+
+	rc = devm_add_fm_extent(dev, start, length, uuid_str, shared);
+	if (rc) {
+		dev_err(dev, "Failed to add extent DPA:%#llx LEN:%#llx; %d\n",
+			start, length, rc);
+		return rc;
+	}
+
+	mark_extent_sent(dev, start);
+	rc = log_dc_event(mdata, DCD_ADD_CAPACITY, start, length, uuid_str, more);
+	if (rc) {
+		dev_err(dev, "Failed to add event %d\n", rc);
+		return rc;
+	}
+
+	return count;
+}
+
+static ssize_t dc_inject_extent_store(struct device *dev,
+				      struct device_attribute *attr,
+				      const char *buf, size_t count)
+{
+	return __dc_inject_extent_store(dev, attr, buf, count, false);
+}
+static DEVICE_ATTR_WO(dc_inject_extent);
+
+static ssize_t dc_inject_shared_extent_store(struct device *dev,
+					     struct device_attribute *attr,
+					     const char *buf, size_t count)
+{
+	return __dc_inject_extent_store(dev, attr, buf, count, true);
+}
+static DEVICE_ATTR_WO(dc_inject_shared_extent);
+
+static ssize_t __dc_del_extent_store(struct device *dev,
+				     struct device_attribute *attr,
+				     const char *buf, size_t count,
+				     enum dc_event type)
+{
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	unsigned long long start, length;
+	char *len_str;
+	int rc;
+
+	char *start_str __free(kfree) = kstrdup(buf, GFP_KERNEL);
+	if (!start_str)
+		return -ENOMEM;
+
+	len_str = strnchr(start_str, count, ':');
+	if (!len_str) {
+		dev_err(dev, "Failed to find len_str: %s\n", start_str);
+		return -EINVAL;
+	}
+	*len_str = '\0';
+	len_str += 1;
+
+	if (kstrtoull(start_str, 0, &start)) {
+		dev_err(dev, "Failed to parse start: %s\n", start_str);
+		return -EINVAL;
+	}
+
+	if (kstrtoull(len_str, 0, &length)) {
+		dev_err(dev, "Failed to parse length: %s\n", len_str);
+		return -EINVAL;
+	}
+
+	dc_delete_extent(dev, start, length);
+
+	if (type == DCD_FORCED_CAPACITY_RELEASE)
+		dev_dbg(dev, "Forcing delete of extent %#llx len:%#llx\n",
+			start, length);
+
+	rc = log_dc_event(mdata, type, start, length, "", false);
+	if (rc) {
+		dev_err(dev, "Failed to add event %d\n", rc);
+		return rc;
+	}
+
+	return count;
+}
+
+/*
+ * Format <start>:<length>
+ */
+static ssize_t dc_del_extent_store(struct device *dev,
+				   struct device_attribute *attr,
+				   const char *buf, size_t count)
+{
+	return __dc_del_extent_store(dev, attr, buf, count,
+				     DCD_RELEASE_CAPACITY);
+}
+static DEVICE_ATTR_WO(dc_del_extent);
+
+static ssize_t dc_force_del_extent_store(struct device *dev,
+					 struct device_attribute *attr,
+					 const char *buf, size_t count)
+{
+	return __dc_del_extent_store(dev, attr, buf, count,
+				     DCD_FORCED_CAPACITY_RELEASE);
+}
+static DEVICE_ATTR_WO(dc_force_del_extent);
+
 static struct attribute *cxl_mock_mem_attrs[] = {
 	&dev_attr_security_lock.attr,
 	&dev_attr_event_trigger.attr,
 	&dev_attr_fw_buf_checksum.attr,
 	&dev_attr_sanitize_timeout.attr,
+	&dev_attr_dc_inject_extent.attr,
+	&dev_attr_dc_inject_shared_extent.attr,
+	&dev_attr_dc_del_extent.attr,
+	&dev_attr_dc_force_del_extent.attr,
 	NULL
 };
 ATTRIBUTE_GROUPS(cxl_mock_mem);

-- 
2.49.0


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD)
  2025-04-13 22:52 [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (18 preceding siblings ...)
  2025-04-13 22:52 ` [PATCH v9 19/19] tools/testing/cxl: Add DC Regions to mock mem data Ira Weiny
@ 2025-04-14 16:11 ` Fan Ni
  2025-04-15  2:37   ` Ira Weiny
  2025-04-14 16:47 ` Jonathan Cameron
                   ` (2 subsequent siblings)
  22 siblings, 1 reply; 65+ messages in thread
From: Fan Ni @ 2025-04-14 16:11 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Jiang, Jonathan Cameron, Dan Williams, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-cxl, nvdimm, linux-kernel,
	Li Ming

On Sun, Apr 13, 2025 at 05:52:08PM -0500, Ira Weiny wrote:
> A git tree of this series can be found here:
> 
> 	https://github.com/weiny2/linux-kernel/tree/dcd-v6-2025-04-13
> 
> This is now based on 6.15-rc2.
> 
> Due to the stagnation of solid requirements for users of DCD I do not
> plan to rev this work in Q2 of 2025 and possibly beyond.
> 
> It is anticipated that this will support at least the initial
> implementation of DCD devices, if and when they appear in the ecosystem.
> The patch set should be reviewed with the limited set of functionality in
> mind.  Additional functionality can be added as devices support them.
> 
> It is strongly encouraged for individuals or companies wishing to bring
> DCD devices to market review this set with the customer use cases they
> have in mind.

Hi Ira,
thanks for sending it out.

I have not got a chance to check the code or test it extensively.

I tried to test one specific case and hit issue.

I tried to add some DC extents to the extent list on the device when the
VM is launched by hacking qemu like below,

diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c
index 87fa308495..4049fc8dd9 100644
--- a/hw/mem/cxl_type3.c
+++ b/hw/mem/cxl_type3.c
@@ -826,6 +826,11 @@ static bool cxl_create_dc_regions(CXLType3Dev *ct3d, Error **errp)
     QTAILQ_INIT(&ct3d->dc.extents);
     QTAILQ_INIT(&ct3d->dc.extents_pending);
 
+    cxl_insert_extent_to_extent_list(&ct3d->dc.extents, 0,
+                                     CXL_CAPACITY_MULTIPLIER, NULL, 0);
+    ct3d->dc.total_extent_count = 1;
+    ct3_set_region_block_backed(ct3d, 0, CXL_CAPACITY_MULTIPLIER);
+
     return true;
 }


Then after the VM is launched, I tried to create a DC region with
commmand: cxl create-region -m mem0 -d decoder0.0 -s 1G -t
dynamic_ram_a.

It works fine. As you can see below, the region is created and the
extent is showing correctly.

root@debian:~# cxl list -r region0 -N
[
  {
    "region":"region0",
    "resource":79725330432,
    "size":1073741824,
    "interleave_ways":1,
    "interleave_granularity":256,
    "decode_state":"commit",
    "extents":[
      {
        "offset":0,
        "length":268435456,
        "uuid":"00000000-0000-0000-0000-000000000000"
      }
    ]
  }
]


However, after that, I tried to create a dax device as below, it failed.

root@debian:~# daxctl create-device -r region0 -v
libdaxctl: __dax_regions_init: no dax regions found via: /sys/class/dax
error creating devices: No such device or address
created 0 devices
root@debian:~# 

root@debian:~# ls /sys/class/dax 
ls: cannot access '/sys/class/dax': No such file or directory

The dmesg shows the really_probe function returns early as resource
presents before probe as below,

[ 1745.505068] cxl_core:devm_cxl_add_dax_region:3251: cxl_region region0: region0: register dax_region0
[ 1745.506063] cxl_pci:__cxl_pci_mbox_send_cmd:263: cxl_pci 0000:0d:00.0: Sending command: 0x4801
[ 1745.506953] cxl_pci:cxl_pci_mbox_wait_for_doorbell:74: cxl_pci 0000:0d:00.0: Doorbell wait took 0ms
[ 1745.507911] cxl_core:__cxl_process_extent_list:1802: cxl_pci 0000:0d:00.0: Got extent list 0-0 of 1 generation Num:0
[ 1745.508958] cxl_core:__cxl_process_extent_list:1815: cxl_pci 0000:0d:00.0: Processing extent 0/1
[ 1745.509843] cxl_core:cxl_validate_extent:975: cxl_pci 0000:0d:00.0: DC extent DPA [range 0x0000000000000000-0x000000000fffffff] (DCR:[range 0x0000000000000000-0x000000007fffffff])(00000000-0000-0000-0000-000000000000)
[ 1745.511748] cxl_core:__cxl_dpa_to_region:2869: cxl decoder2.0: dpa:0x0 mapped in region:region0
[ 1745.512626] cxl_core:cxl_add_extent:460: cxl decoder2.0: Checking ED ([mem 0x00000000-0x3fffffff flags 0x80000200]) for extent [range 0x0000000000000000-0x000000000fffffff]
[ 1745.514143] cxl_core:cxl_add_extent:492: cxl decoder2.0: Add extent [range 0x0000000000000000-0x000000000fffffff] (00000000-0000-0000-0000-000000000000)
[ 1745.515485] cxl_core:online_region_extent:176:  extent0.0: region extent HPA [range 0x0000000000000000-0x000000000fffffff]
[ 1745.516576] cxl_core:cxlr_notify_extent:285: cxl dax_region0: Trying notify: type 0 HPA [range 0x0000000000000000-0x000000000fffffff]
[ 1745.517768] cxl_core:cxl_bus_probe:2087: cxl_region region0: probe: 0
[ 1745.524984] cxl dax_region0: Resources present before probing


btw, I hit the same issue with the previous verson also.

Fan

> 
> Series info
> ===========
> 
> This series has 2 parts:
> 
> Patch 1-17: Core DCD support
> Patch 18-19: cxl_test support
> 
> Background
> ==========
> 
> A Dynamic Capacity Device (DCD) (CXL 3.1 sec 9.13.3) is a CXL memory
> device that allows memory capacity within a region to change
> dynamically without the need for resetting the device, reconfiguring
> HDM decoders, or reconfiguring software DAX regions.
> 
> One of the biggest anticipated use cases for Dynamic Capacity is to
> allow hosts to dynamically add or remove memory from a host within a
> data center without physically changing the per-host attached memory nor
> rebooting the host.
> 
> The general flow for the addition or removal of memory is to have an
> orchestrator coordinate the use of the memory.  Generally there are 5
> actors in such a system, the Orchestrator, Fabric Manager, the Logical
> device, the Host Kernel, and a Host User.
> 
> An example work flow is shown below.
> 
> Orchestrator      FM         Device       Host Kernel    Host User
> 
>     |             |           |            |               |
>     |-------------- Create region ------------------------>|
>     |             |           |            |               |
>     |             |           |            |<-- Create ----|
>     |             |           |            |    Region     |
>     |             |           |            |(dynamic_ram_a)|
>     |<------------- Signal done ---------------------------|
>     |             |           |            |               |
>     |-- Add ----->|-- Add --->|--- Add --->|               |
>     |  Capacity   |  Extent   |   Extent   |               |
>     |             |           |            |               |
>     |             |<- Accept -|<- Accept  -|               |
>     |             |   Extent  |   Extent   |               |
>     |             |           |            |<- Create ---->|
>     |             |           |            |   DAX dev     |-- Use memory
>     |             |           |            |               |   |
>     |             |           |            |               |   |
>     |             |           |            |<- Release ----| <-+
>     |             |           |            |   DAX dev     |
>     |             |           |            |               |
>     |<------------- Signal done ---------------------------|
>     |             |           |            |               |
>     |-- Remove -->|- Release->|- Release ->|               |
>     |  Capacity   |  Extent   |   Extent   |               |
>     |             |           |            |               |
>     |             |<- Release-|<- Release -|               |
>     |             |   Extent  |   Extent   |               |
>     |             |           |            |               |
>     |-- Add ----->|-- Add --->|--- Add --->|               |
>     |  Capacity   |  Extent   |   Extent   |               |
>     |             |           |            |               |
>     |             |<- Accept -|<- Accept  -|               |
>     |             |   Extent  |   Extent   |               |
>     |             |           |            |<- Create -----|
>     |             |           |            |   DAX dev     |-- Use memory
>     |             |           |            |               |   |
>     |             |           |            |<- Release ----| <-+
>     |             |           |            |   DAX dev     |
>     |<------------- Signal done ---------------------------|
>     |             |           |            |               |
>     |-- Remove -->|- Release->|- Release ->|               |
>     |  Capacity   |  Extent   |   Extent   |               |
>     |             |           |            |               |
>     |             |<- Release-|<- Release -|               |
>     |             |   Extent  |   Extent   |               |
>     |             |           |            |               |
>     |-- Add ----->|-- Add --->|--- Add --->|               |
>     |  Capacity   |  Extent   |   Extent   |               |
>     |             |           |            |<- Create -----|
>     |             |           |            |   DAX dev     |-- Use memory
>     |             |           |            |               |   |
>     |-- Remove -->|- Release->|- Release ->|               |   |
>     |  Capacity   |  Extent   |   Extent   |               |   |
>     |             |           |            |               |   |
>     |             |           |     (Release Ignored)      |   |
>     |             |           |            |               |   |
>     |             |           |            |<- Release ----| <-+
>     |             |           |            |   DAX dev     |
>     |<------------- Signal done ---------------------------|
>     |             |           |            |               |
>     |             |- Release->|- Release ->|               |
>     |             |  Extent   |   Extent   |               |
>     |             |           |            |               |
>     |             |<- Release-|<- Release -|               |
>     |             |   Extent  |   Extent   |               |
>     |             |           |            |<- Destroy ----|
>     |             |           |            |   Region      |
>     |             |           |            |               |
> 
> Implementation
> ==============
> 
> This series requires the creation of regions and DAX devices to be
> closely synchronized with the Orchestrator and Fabric Manager.  The host
> kernel will reject extents if a region is not yet created.  It also
> ignores extent release if memory is in use (DAX device created).  These
> synchronizations are not anticipated to be an issue with real
> applications.
> 
> Only a single dynamic ram partition is supported (dynamic_ram_a).  The
> requirements, use cases, and existence of actual hardware devices to
> support more than one DC partition is unknown at this time.  So a less
> complex implementation was chosen.
> 
> In order to allow for capacity to be added and removed a new concept of
> a sparse DAX region is introduced.  A sparse DAX region may have 0 or
> more bytes of available space.  The total space depends on the number
> and size of the extents which have been added.
> 
> It is anticipated that users of the memory will carefully coordinate the
> surfacing of capacity with the creation of DAX devices which use that
> capacity.  Therefore, the allocation of the memory to DAX devices does
> not allow for specific associations between DAX device and extent.  This
> keeps allocations of DAX devices similar to existing DAX region
> behavior.
> 
> To keep the DAX memory allocation aligned with the existing DAX devices
> which do not have tags, extents are not allowed to have tags in this
> implementation.  Future support for tags can be added when real use
> cases surface.
> 
> Great care was taken to keep the extent tracking simple.  Some xarray's
> needed to be added but extra software objects are kept to a minimum.
> 
> Region extents are tracked as sub-devices of the DAX region.  This
> ensures that region destruction cleans up all extent allocations
> properly.
> 
> The major functionality of this series includes:
> 
> - Getting the dynamic capacity (DC) configuration information from cxl
>   devices
> 
> - Configuring a DC partition found in hardware.
> 
> - Enhancing the CXL and DAX regions for dynamic capacity support
> 	a. Maintain a logical separation between hardware extents and
> 	   software managed extents.  This provides an abstraction
> 	   between the layers and should allow for interleaving in the
> 	   future
> 
> - Get existing hardware extent lists for endpoint decoders upon region
>   creation.
> 
> - Respond to DC capacity events and adjust available region memory.
>         a. Add capacity Events
> 	b. Release capacity events
> 
> - Host response for add capacity
> 	a. do not accept the extent if:
> 		If the region does not exist
> 		or an error occurs realizing the extent
> 	b. If the region does exist
> 		realize a DAX region extent with 1:1 mapping (no
> 		interleave yet)
> 	c. Support the event more bit by processing a list of extents
> 	   marked with the more bit together before setting up a
> 	   response.
> 
> - Host response for remove capacity
> 	a. If no DAX device references the extent; release the extent
> 	b. If a reference does exist, ignore the request.
> 	   (Require FM to issue release again.)
> 	c. Release extents flagged with the 'more' bit individually as
> 	   the specification allows for the asynchronous release of
> 	   memory and the implementation is simplified by doing so.
> 
> - Modify DAX device creation/resize to account for extents within a
>   sparse DAX region
> 
> - Trace Dynamic Capacity events for debugging
> 
> - Add cxl-test infrastructure to allow for faster unit testing
>   (See new ndctl branch for cxl-dcd.sh test[1])
> 
> - Only support 0 value extent tags
> 
> Fan Ni's upstream of Qemu DCD was used for testing.
> 
> Remaining work:
> 
> 	1) Allow mapping to specific extents (perhaps based on
> 	   label/tag)
> 	   1a) devise region size reporting based on tags
> 	2) Interleave support
> 
> Possible additional work depending on requirements:
> 
> 	1) Accept a new extent which extends (but overlaps) already
> 	   accepted extent(s)
> 	2) Rework DAX device interfaces, memfd has been explored a bit
> 	3) Support more than 1 DC partition
> 
> [1] https://github.com/weiny2/ndctl/tree/dcd-region3-2025-04-13
> 
> ---
> Changes in v9:
> - djbw: pare down support to only a single DC parition
> - djbw: adjust to the new core partition processing which aligns with
>   new type2 work.
> - iweiny: address smaller comments from v8
> - iweiny: rebase off of 6.15-rc1
> - Link to v8: https://patch.msgid.link/20241210-dcd-type2-upstream-v8-0-812852504400@intel.com
> 
> ---
> Ira Weiny (19):
>       cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)
>       cxl/mem: Read dynamic capacity configuration from the device
>       cxl/cdat: Gather DSMAS data for DCD partitions
>       cxl/core: Enforce partition order/simplify partition calls
>       cxl/mem: Expose dynamic ram A partition in sysfs
>       cxl/port: Add 'dynamic_ram_a' to endpoint decoder mode
>       cxl/region: Add sparse DAX region support
>       cxl/events: Split event msgnum configuration from irq setup
>       cxl/pci: Factor out interrupt policy check
>       cxl/mem: Configure dynamic capacity interrupts
>       cxl/core: Return endpoint decoder information from region search
>       cxl/extent: Process dynamic partition events and realize region extents
>       cxl/region/extent: Expose region extent information in sysfs
>       dax/bus: Factor out dev dax resize logic
>       dax/region: Create resources on sparse DAX regions
>       cxl/region: Read existing extents on region creation
>       cxl/mem: Trace Dynamic capacity Event Record
>       tools/testing/cxl: Make event logs dynamic
>       tools/testing/cxl: Add DC Regions to mock mem data
> 
>  Documentation/ABI/testing/sysfs-bus-cxl |  100 ++-
>  drivers/cxl/core/Makefile               |    2 +-
>  drivers/cxl/core/cdat.c                 |   11 +
>  drivers/cxl/core/core.h                 |   33 +-
>  drivers/cxl/core/extent.c               |  495 +++++++++++++++
>  drivers/cxl/core/hdm.c                  |   13 +-
>  drivers/cxl/core/mbox.c                 |  632 ++++++++++++++++++-
>  drivers/cxl/core/memdev.c               |   87 ++-
>  drivers/cxl/core/port.c                 |    5 +
>  drivers/cxl/core/region.c               |   76 ++-
>  drivers/cxl/core/trace.h                |   65 ++
>  drivers/cxl/cxl.h                       |   61 +-
>  drivers/cxl/cxlmem.h                    |  134 +++-
>  drivers/cxl/mem.c                       |    2 +-
>  drivers/cxl/pci.c                       |  115 +++-
>  drivers/dax/bus.c                       |  356 +++++++++--
>  drivers/dax/bus.h                       |    4 +-
>  drivers/dax/cxl.c                       |   71 ++-
>  drivers/dax/dax-private.h               |   40 ++
>  drivers/dax/hmem/hmem.c                 |    2 +-
>  drivers/dax/pmem.c                      |    2 +-
>  include/cxl/event.h                     |   31 +
>  include/linux/ioport.h                  |    3 +
>  tools/testing/cxl/Kbuild                |    3 +-
>  tools/testing/cxl/test/mem.c            | 1021 +++++++++++++++++++++++++++----
>  25 files changed, 3102 insertions(+), 262 deletions(-)
> ---
> base-commit: 8ffd015db85fea3e15a77027fda6c02ced4d2444
> change-id: 20230604-dcd-type2-upstream-0cd15f6216fd
> 
> Best regards,
> -- 
> Ira Weiny <ira.weiny@intel.com>
> 

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD)
  2025-04-14 16:11 ` [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD) Fan Ni
@ 2025-04-15  2:37   ` Ira Weiny
  2025-04-15  2:47     ` Fan Ni
                       ` (2 more replies)
  0 siblings, 3 replies; 65+ messages in thread
From: Ira Weiny @ 2025-04-15  2:37 UTC (permalink / raw)
  To: Fan Ni, Ira Weiny
  Cc: Dave Jiang, Jonathan Cameron, Dan Williams, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-cxl, nvdimm, linux-kernel,
	Li Ming

Fan Ni wrote:
> On Sun, Apr 13, 2025 at 05:52:08PM -0500, Ira Weiny wrote:
> > A git tree of this series can be found here:
> > 
> > 	https://github.com/weiny2/linux-kernel/tree/dcd-v6-2025-04-13
> > 
> > This is now based on 6.15-rc2.
> > 
> > Due to the stagnation of solid requirements for users of DCD I do not
> > plan to rev this work in Q2 of 2025 and possibly beyond.
> > 
> > It is anticipated that this will support at least the initial
> > implementation of DCD devices, if and when they appear in the ecosystem.
> > The patch set should be reviewed with the limited set of functionality in
> > mind.  Additional functionality can be added as devices support them.
> > 
> > It is strongly encouraged for individuals or companies wishing to bring
> > DCD devices to market review this set with the customer use cases they
> > have in mind.
> 
> Hi Ira,
> thanks for sending it out.
> 
> I have not got a chance to check the code or test it extensively.
> 
> I tried to test one specific case and hit issue.
> 
> I tried to add some DC extents to the extent list on the device when the
> VM is launched by hacking qemu like below,
> 
> diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c
> index 87fa308495..4049fc8dd9 100644
> --- a/hw/mem/cxl_type3.c
> +++ b/hw/mem/cxl_type3.c
> @@ -826,6 +826,11 @@ static bool cxl_create_dc_regions(CXLType3Dev *ct3d, Error **errp)
>      QTAILQ_INIT(&ct3d->dc.extents);
>      QTAILQ_INIT(&ct3d->dc.extents_pending);
>  
> +    cxl_insert_extent_to_extent_list(&ct3d->dc.extents, 0,
> +                                     CXL_CAPACITY_MULTIPLIER, NULL, 0);
> +    ct3d->dc.total_extent_count = 1;
> +    ct3_set_region_block_backed(ct3d, 0, CXL_CAPACITY_MULTIPLIER);
> +
>      return true;
>  }
> 
> 
> Then after the VM is launched, I tried to create a DC region with
> commmand: cxl create-region -m mem0 -d decoder0.0 -s 1G -t
> dynamic_ram_a.
> 
> It works fine. As you can see below, the region is created and the
> extent is showing correctly.
> 
> root@debian:~# cxl list -r region0 -N
> [
>   {
>     "region":"region0",
>     "resource":79725330432,
>     "size":1073741824,
>     "interleave_ways":1,
>     "interleave_granularity":256,
>     "decode_state":"commit",
>     "extents":[
>       {
>         "offset":0,
>         "length":268435456,
>         "uuid":"00000000-0000-0000-0000-000000000000"
>       }
>     ]
>   }
> ]
> 
> 
> However, after that, I tried to create a dax device as below, it failed.
> 
> root@debian:~# daxctl create-device -r region0 -v
> libdaxctl: __dax_regions_init: no dax regions found via: /sys/class/dax
> error creating devices: No such device or address
> created 0 devices
> root@debian:~# 
> 
> root@debian:~# ls /sys/class/dax 
> ls: cannot access '/sys/class/dax': No such file or directory

Have you update daxctl with cxl-cli?

I was confused by this lack of /sys/class/dax and checked with Vishal.  He
says this is legacy.

I have /sys/bus/dax and that works fine for me with the latest daxctl
built from the ndctl code I sent out:

https://github.com/weiny2/ndctl/tree/dcd-region3-2025-04-13

Could you build and use the executables from that version?

Ira

> 
> The dmesg shows the really_probe function returns early as resource
> presents before probe as below,
> 
> [ 1745.505068] cxl_core:devm_cxl_add_dax_region:3251: cxl_region region0: region0: register dax_region0
> [ 1745.506063] cxl_pci:__cxl_pci_mbox_send_cmd:263: cxl_pci 0000:0d:00.0: Sending command: 0x4801
> [ 1745.506953] cxl_pci:cxl_pci_mbox_wait_for_doorbell:74: cxl_pci 0000:0d:00.0: Doorbell wait took 0ms
> [ 1745.507911] cxl_core:__cxl_process_extent_list:1802: cxl_pci 0000:0d:00.0: Got extent list 0-0 of 1 generation Num:0
> [ 1745.508958] cxl_core:__cxl_process_extent_list:1815: cxl_pci 0000:0d:00.0: Processing extent 0/1
> [ 1745.509843] cxl_core:cxl_validate_extent:975: cxl_pci 0000:0d:00.0: DC extent DPA [range 0x0000000000000000-0x000000000fffffff] (DCR:[range 0x0000000000000000-0x000000007fffffff])(00000000-0000-0000-0000-000000000000)
> [ 1745.511748] cxl_core:__cxl_dpa_to_region:2869: cxl decoder2.0: dpa:0x0 mapped in region:region0
> [ 1745.512626] cxl_core:cxl_add_extent:460: cxl decoder2.0: Checking ED ([mem 0x00000000-0x3fffffff flags 0x80000200]) for extent [range 0x0000000000000000-0x000000000fffffff]
> [ 1745.514143] cxl_core:cxl_add_extent:492: cxl decoder2.0: Add extent [range 0x0000000000000000-0x000000000fffffff] (00000000-0000-0000-0000-000000000000)
> [ 1745.515485] cxl_core:online_region_extent:176:  extent0.0: region extent HPA [range 0x0000000000000000-0x000000000fffffff]
> [ 1745.516576] cxl_core:cxlr_notify_extent:285: cxl dax_region0: Trying notify: type 0 HPA [range 0x0000000000000000-0x000000000fffffff]
> [ 1745.517768] cxl_core:cxl_bus_probe:2087: cxl_region region0: probe: 0
> [ 1745.524984] cxl dax_region0: Resources present before probing
> 
> 
> btw, I hit the same issue with the previous verson also.
> 
> Fan

[snip]

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD)
  2025-04-15  2:37   ` Ira Weiny
@ 2025-04-15  2:47     ` Fan Ni
  2025-04-15  4:28     ` Dan Williams
  2025-05-13 18:55     ` Fan Ni
  2 siblings, 0 replies; 65+ messages in thread
From: Fan Ni @ 2025-04-15  2:47 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Fan Ni, Dave Jiang, Jonathan Cameron, Dan Williams,
	Davidlohr Bueso, Alison Schofield, Vishal Verma, linux-cxl,
	nvdimm, linux-kernel, Li Ming

On Mon, Apr 14, 2025 at 09:37:02PM -0500, Ira Weiny wrote:
> Fan Ni wrote:
> > On Sun, Apr 13, 2025 at 05:52:08PM -0500, Ira Weiny wrote:
> > > A git tree of this series can be found here:
> > > 
> > > 	https://github.com/weiny2/linux-kernel/tree/dcd-v6-2025-04-13
> > > 
> > > This is now based on 6.15-rc2.
> > > 
> > > Due to the stagnation of solid requirements for users of DCD I do not
> > > plan to rev this work in Q2 of 2025 and possibly beyond.
> > > 
> > > It is anticipated that this will support at least the initial
> > > implementation of DCD devices, if and when they appear in the ecosystem.
> > > The patch set should be reviewed with the limited set of functionality in
> > > mind.  Additional functionality can be added as devices support them.
> > > 
> > > It is strongly encouraged for individuals or companies wishing to bring
> > > DCD devices to market review this set with the customer use cases they
> > > have in mind.
> > 
> > Hi Ira,
> > thanks for sending it out.
> > 
> > I have not got a chance to check the code or test it extensively.
> > 
> > I tried to test one specific case and hit issue.
> > 
> > I tried to add some DC extents to the extent list on the device when the
> > VM is launched by hacking qemu like below,
> > 
> > diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c
> > index 87fa308495..4049fc8dd9 100644
> > --- a/hw/mem/cxl_type3.c
> > +++ b/hw/mem/cxl_type3.c
> > @@ -826,6 +826,11 @@ static bool cxl_create_dc_regions(CXLType3Dev *ct3d, Error **errp)
> >      QTAILQ_INIT(&ct3d->dc.extents);
> >      QTAILQ_INIT(&ct3d->dc.extents_pending);
> >  
> > +    cxl_insert_extent_to_extent_list(&ct3d->dc.extents, 0,
> > +                                     CXL_CAPACITY_MULTIPLIER, NULL, 0);
> > +    ct3d->dc.total_extent_count = 1;
> > +    ct3_set_region_block_backed(ct3d, 0, CXL_CAPACITY_MULTIPLIER);
> > +
> >      return true;
> >  }
> > 
> > 
> > Then after the VM is launched, I tried to create a DC region with
> > commmand: cxl create-region -m mem0 -d decoder0.0 -s 1G -t
> > dynamic_ram_a.
> > 
> > It works fine. As you can see below, the region is created and the
> > extent is showing correctly.
> > 
> > root@debian:~# cxl list -r region0 -N
> > [
> >   {
> >     "region":"region0",
> >     "resource":79725330432,
> >     "size":1073741824,
> >     "interleave_ways":1,
> >     "interleave_granularity":256,
> >     "decode_state":"commit",
> >     "extents":[
> >       {
> >         "offset":0,
> >         "length":268435456,
> >         "uuid":"00000000-0000-0000-0000-000000000000"
> >       }
> >     ]
> >   }
> > ]
> > 
> > 
> > However, after that, I tried to create a dax device as below, it failed.
> > 
> > root@debian:~# daxctl create-device -r region0 -v
> > libdaxctl: __dax_regions_init: no dax regions found via: /sys/class/dax
> > error creating devices: No such device or address
> > created 0 devices
> > root@debian:~# 
> > 
> > root@debian:~# ls /sys/class/dax 
> > ls: cannot access '/sys/class/dax': No such file or directory
> 
> Have you update daxctl with cxl-cli?
> 
> I was confused by this lack of /sys/class/dax and checked with Vishal.  He
> says this is legacy.
> 
> I have /sys/bus/dax and that works fine for me with the latest daxctl
> built from the ndctl code I sent out:
> 
> https://github.com/weiny2/ndctl/tree/dcd-region3-2025-04-13
> 
> Could you build and use the executables from that version?
> 
> Ira

That is my setup.

root@debian:~# cxl list -r region0 -N
[
  {
    "region":"region0",
    "resource":79725330432,
    "size":2147483648,
    "interleave_ways":1,
    "interleave_granularity":256,
    "decode_state":"commit",
    "extents":[
      {
        "offset":0,
        "length":268435456,
        "uuid":"00000000-0000-0000-0000-000000000000"
      }
    ]
  }
]
root@debian:~# cd ndctl/
root@debian:~/ndctl# git branch
* dcd-region3-2025-04-13
root@debian:~/ndctl# ./build/daxctl/daxctl create-device -r region0 -v
libdaxctl: __dax_regions_init: no dax regions found via: /sys/class/dax
error creating devices: No such device or address
created 0 devices

root@debian:~/ndctl# cat .git/config 
[core]
	repositoryformatversion = 0
	filemode = true
	bare = false
	logallrefupdates = true
[remote "origin"]
	url = https://github.com/weiny2/ndctl.git
	fetch = +refs/heads/dcd-region3-2025-04-13:refs/remotes/origin/dcd-region3-2025-04-13
[branch "dcd-region3-2025-04-13"]
	remote = origin
	merge = refs/heads/dcd-region3-2025-04-13


Fan

> 
> > 
> > The dmesg shows the really_probe function returns early as resource
> > presents before probe as below,
> > 
> > [ 1745.505068] cxl_core:devm_cxl_add_dax_region:3251: cxl_region region0: region0: register dax_region0
> > [ 1745.506063] cxl_pci:__cxl_pci_mbox_send_cmd:263: cxl_pci 0000:0d:00.0: Sending command: 0x4801
> > [ 1745.506953] cxl_pci:cxl_pci_mbox_wait_for_doorbell:74: cxl_pci 0000:0d:00.0: Doorbell wait took 0ms
> > [ 1745.507911] cxl_core:__cxl_process_extent_list:1802: cxl_pci 0000:0d:00.0: Got extent list 0-0 of 1 generation Num:0
> > [ 1745.508958] cxl_core:__cxl_process_extent_list:1815: cxl_pci 0000:0d:00.0: Processing extent 0/1
> > [ 1745.509843] cxl_core:cxl_validate_extent:975: cxl_pci 0000:0d:00.0: DC extent DPA [range 0x0000000000000000-0x000000000fffffff] (DCR:[range 0x0000000000000000-0x000000007fffffff])(00000000-0000-0000-0000-000000000000)
> > [ 1745.511748] cxl_core:__cxl_dpa_to_region:2869: cxl decoder2.0: dpa:0x0 mapped in region:region0
> > [ 1745.512626] cxl_core:cxl_add_extent:460: cxl decoder2.0: Checking ED ([mem 0x00000000-0x3fffffff flags 0x80000200]) for extent [range 0x0000000000000000-0x000000000fffffff]
> > [ 1745.514143] cxl_core:cxl_add_extent:492: cxl decoder2.0: Add extent [range 0x0000000000000000-0x000000000fffffff] (00000000-0000-0000-0000-000000000000)
> > [ 1745.515485] cxl_core:online_region_extent:176:  extent0.0: region extent HPA [range 0x0000000000000000-0x000000000fffffff]
> > [ 1745.516576] cxl_core:cxlr_notify_extent:285: cxl dax_region0: Trying notify: type 0 HPA [range 0x0000000000000000-0x000000000fffffff]
> > [ 1745.517768] cxl_core:cxl_bus_probe:2087: cxl_region region0: probe: 0
> > [ 1745.524984] cxl dax_region0: Resources present before probing
> > 
> > 
> > btw, I hit the same issue with the previous verson also.
> > 
> > Fan
> 
> [snip]

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD)
  2025-04-15  2:37   ` Ira Weiny
  2025-04-15  2:47     ` Fan Ni
@ 2025-04-15  4:28     ` Dan Williams
  2025-05-13 18:55     ` Fan Ni
  2 siblings, 0 replies; 65+ messages in thread
From: Dan Williams @ 2025-04-15  4:28 UTC (permalink / raw)
  To: Ira Weiny, Fan Ni
  Cc: Dave Jiang, Jonathan Cameron, Dan Williams, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-cxl, nvdimm, linux-kernel,
	Li Ming

Ira Weiny wrote:
[..]
> > However, after that, I tried to create a dax device as below, it failed.
> > 
> > root@debian:~# daxctl create-device -r region0 -v
> > libdaxctl: __dax_regions_init: no dax regions found via: /sys/class/dax

Note that /sys/class/dax support was removed from the kernel back in
v5.17:

83762cb5c7c4 dax: Kill DEV_DAX_PMEM_COMPAT

daxctl still supports pre-v5.17 kernels and always checks both subsystem
types. This is a debug message just confirming that it is running on a
new kernel, see dax_regions_init() in daxctl.

> > error creating devices: No such device or address
> > created 0 devices
> > root@debian:~# 
> > 
> > root@debian:~# ls /sys/class/dax 
> > ls: cannot access '/sys/class/dax': No such file or directory
> 
> Have you update daxctl with cxl-cli?
> 
> I was confused by this lack of /sys/class/dax and checked with Vishal.  He
> says this is legacy.
> 
> I have /sys/bus/dax and that works fine for me with the latest daxctl
> built from the ndctl code I sent out:
> 
> https://github.com/weiny2/ndctl/tree/dcd-region3-2025-04-13
> 
> Could you build and use the executables from that version?

The same debug message still exists in that version and will fire every
time when debug is enabled.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD)
  2025-04-15  2:37   ` Ira Weiny
  2025-04-15  2:47     ` Fan Ni
  2025-04-15  4:28     ` Dan Williams
@ 2025-05-13 18:55     ` Fan Ni
  2 siblings, 0 replies; 65+ messages in thread
From: Fan Ni @ 2025-05-13 18:55 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Fan Ni, Dave Jiang, Jonathan Cameron, Dan Williams,
	Davidlohr Bueso, Alison Schofield, Vishal Verma, linux-cxl,
	nvdimm, linux-kernel, Li Ming, anisa.su887

On Mon, Apr 14, 2025 at 09:37:02PM -0500, Ira Weiny wrote:
> Fan Ni wrote:
> > On Sun, Apr 13, 2025 at 05:52:08PM -0500, Ira Weiny wrote:
> > > A git tree of this series can be found here:
> > > 
> > > 	https://github.com/weiny2/linux-kernel/tree/dcd-v6-2025-04-13
> > > 
> > > This is now based on 6.15-rc2.
> > > 
> > > Due to the stagnation of solid requirements for users of DCD I do not
> > > plan to rev this work in Q2 of 2025 and possibly beyond.
> > > 
> > > It is anticipated that this will support at least the initial
> > > implementation of DCD devices, if and when they appear in the ecosystem.
> > > The patch set should be reviewed with the limited set of functionality in
> > > mind.  Additional functionality can be added as devices support them.
> > > 
> > > It is strongly encouraged for individuals or companies wishing to bring
> > > DCD devices to market review this set with the customer use cases they
> > > have in mind.
> > 
> > Hi Ira,
> > thanks for sending it out.
> > 
> > I have not got a chance to check the code or test it extensively.
> > 
> > I tried to test one specific case and hit issue.
> > 
> > I tried to add some DC extents to the extent list on the device when the
> > VM is launched by hacking qemu like below,
> > 
> > diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c
> > index 87fa308495..4049fc8dd9 100644
> > --- a/hw/mem/cxl_type3.c
> > +++ b/hw/mem/cxl_type3.c
> > @@ -826,6 +826,11 @@ static bool cxl_create_dc_regions(CXLType3Dev *ct3d, Error **errp)
> >      QTAILQ_INIT(&ct3d->dc.extents);
> >      QTAILQ_INIT(&ct3d->dc.extents_pending);
> >  
> > +    cxl_insert_extent_to_extent_list(&ct3d->dc.extents, 0,
> > +                                     CXL_CAPACITY_MULTIPLIER, NULL, 0);
> > +    ct3d->dc.total_extent_count = 1;
> > +    ct3_set_region_block_backed(ct3d, 0, CXL_CAPACITY_MULTIPLIER);
> > +
> >      return true;
> >  }
> > 
> > 
> > Then after the VM is launched, I tried to create a DC region with
> > commmand: cxl create-region -m mem0 -d decoder0.0 -s 1G -t
> > dynamic_ram_a.
> > 
> > It works fine. As you can see below, the region is created and the
> > extent is showing correctly.
> > 
> > root@debian:~# cxl list -r region0 -N
> > [
> >   {
> >     "region":"region0",
> >     "resource":79725330432,
> >     "size":1073741824,
> >     "interleave_ways":1,
> >     "interleave_granularity":256,
> >     "decode_state":"commit",
> >     "extents":[
> >       {
> >         "offset":0,
> >         "length":268435456,
> >         "uuid":"00000000-0000-0000-0000-000000000000"
> >       }
> >     ]
> >   }
> > ]
> > 
> > 
> > However, after that, I tried to create a dax device as below, it failed.
> > 
> > root@debian:~# daxctl create-device -r region0 -v
> > libdaxctl: __dax_regions_init: no dax regions found via: /sys/class/dax
> > error creating devices: No such device or address
> > created 0 devices
> > root@debian:~# 
> > 
> > root@debian:~# ls /sys/class/dax 
> > ls: cannot access '/sys/class/dax': No such file or directory
> 
> Have you update daxctl with cxl-cli?
> 
> I was confused by this lack of /sys/class/dax and checked with Vishal.  He
> says this is legacy.
> 
> I have /sys/bus/dax and that works fine for me with the latest daxctl
> built from the ndctl code I sent out:
> 
> https://github.com/weiny2/ndctl/tree/dcd-region3-2025-04-13
> 
> Could you build and use the executables from that version?
> 
> Ira

Hi Ira,
Here are more details about the issue and reasoning.


# ISSUE: No dax device created

## What we see: No Dax device is created after creating the dc region
<pre>
fan@smc-140338-bm01:~/cxl/linux-dcd$ cxl-tool.py --dcd-test mem0
Load cxl drivers first
ssh root@localhost -p 2024 "modprobe -a cxl_acpi cxl_core cxl_pci cxl_port cxl_mem"

Module                  Size  Used by
dax_pmem               12288  0
device_dax             16384  0
nd_pmem                24576  0
nd_btt                 28672  1 nd_pmem
dax                    57344  3 dax_pmem,device_dax,nd_pmem
cxl_pmu                28672  0
cxl_mem                12288  0
cxl_pmem               24576  0
libnvdimm             217088  4 cxl_pmem,dax_pmem,nd_btt,nd_pmem
cxl_pci                28672  0
cxl_acpi               24576  0
cxl_port               16384  0
cxl_core              368640  7 cxl_pmem,cxl_port,cxl_mem,cxl_pci,cxl_acpi,cxl_pmu
ssh root@localhost -p 2024 "cxl enable-memdev mem0"
cxl memdev: cmd_enable_memdev: enabled 1 mem
{
  "region":"region0",
  "resource":79725330432,
  "size":2147483648,
  "interleave_ways":1,
  "interleave_granularity":256,
  "decode_state":"commit",
  "mappings":[
    {
      "position":0,
      "memdev":"mem0",
      "decoder":"decoder2.0"
    }
  ]
}
cxl region: cmd_create_region: created 1 region
sn=3840
cxl-memdev0
sn=3840
Choose OP: 0: add, 1: release, 2: print extent, 9: exit
Choice: 9
Do you want to continue to create dax device for DC(Y/N):y
daxctl create-device -r region0
error creating devices: No such device or address
created 0 devices
daxctl list -r region0 -D

Create dax device failed
</pre>

## What caused the issue: Resources present before probing

<pre>
...
[   14.251500] cxl_core:cxl_region_probe:3571: cxl_region region0: config state: 0
[   14.254129] cxl_core:cxl_bus_probe:2087: cxl_region region0: probe: -6
[   14.256536] cxl_core:devm_cxl_add_region:2535: cxl_acpi ACPI0017:00: decoder0.0: created region0
[   14.281676] cxl_core:cxl_port_attach_region:1169: cxl region0: mem0:endpoint2 decoder2.0 add: mem0:decoder2.0 @ 0 next: none nr_eps: 1 nr_targets: 1
[   14.286254] cxl_core:cxl_port_attach_region:1169: cxl region0: pci0000:0c:port1 decoder1.0 add: mem0:decoder2.0 @ 0 next: mem0 nr_eps: 1 nr_targets: 1
[   14.290995] cxl_core:cxl_port_setup_targets:1489: cxl region0: pci0000:0c:port1 iw: 1 ig: 256
[   14.294161] cxl_core:cxl_port_setup_targets:1513: cxl region0: pci0000:0c:port1 target[0] = 0000:0c:00.0 for mem0:decoder2.0 @ 0
[   14.298209] cxl_core:cxl_calc_interleave_pos:1880: cxl_mem mem0: decoder:decoder2.0 parent:0000:0d:00.0 port:endpoint2 range:0x1290000000-0x130fffffff pos:0
[   14.303224] cxl_core:cxl_region_attach:2080: cxl decoder2.0: Test cxl_calc_interleave_pos(): success test_pos:0 cxled->pos:0
[   14.307522] cxl region0: Bypassing cpu_cache_invalidate_memregion() for testing!
[   14.319576] cxl_core:devm_cxl_add_dax_region:3251: cxl_region region0: region0: register dax_region0
[   14.322918] cxl_pci:__cxl_pci_mbox_send_cmd:263: cxl_pci 0000:0d:00.0: Sending command: 0x4801
[   14.326102] cxl_pci:cxl_pci_mbox_wait_for_doorbell:74: cxl_pci 0000:0d:00.0: Doorbell wait took 0ms
[   14.329523] cxl_core:__cxl_process_extent_list:1802: cxl_pci 0000:0d:00.0: Got extent list 0-0 of 1 generation Num:0
[   14.333141] cxl_core:__cxl_process_extent_list:1815: cxl_pci 0000:0d:00.0: Processing extent 0/1
[   14.336172] cxl_core:cxl_validate_extent:975: cxl_pci 0000:0d:00.0: DC extent DPA [range 0x0000000000000000-0x000000000fffffff] (DCR:[range 0x0000000000000000-0x000000007fffffff])(00000000-0000-0000-0000-000000000000)
[   14.342736] cxl_core:__cxl_dpa_to_region:2869: cxl decoder2.0: dpa:0x0 mapped in region:region0
[   14.345447] cxl_core:cxl_add_extent:460: cxl decoder2.0: Checking ED ([mem 0x00000000-0x7fffffff flags 0x80000200]) for extent [range 0x0000000000000000-0x000000000fffffff]
[   14.350198] cxl_core:cxl_add_extent:492: cxl decoder2.0: Add extent [range 0x0000000000000000-0x000000000fffffff] (00000000-0000-0000-0000-000000000000)
[   14.354574] cxl_core:online_region_extent:176:  extent0.0: region extent HPA [range 0x0000000000000000-0x000000000fffffff]
[   14.357876] cxl_core:cxlr_notify_extent:285: cxl dax_region0: Trying notify: type 0 HPA [range 0x0000000000000000-0x000000000fffffff]
[   14.361361] cxl_core:cxl_bus_probe:2087: cxl_region region0: probe: 0
[   14.395020] cxl dax_region0: Resources present before probing
...
</pre>

## Workaround (not a fix)

By chasing why the devres link list is not empty, or when add_dr() is called,
I located the code that caused the issue. The below hack is used to confirm
the issue is caused by the devm_add_action_or_reset() function call.

<pre>
diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
index 4dc0dec486f6..26daa7906717 100644
--- a/drivers/cxl/core/extent.c
+++ b/drivers/cxl/core/extent.c
@@ -174,6 +174,7 @@ static int online_region_extent(struct region_extent *region_extent)
                goto err;
 
        dev_dbg(dev, "region extent HPA %pra\n", &region_extent->hpa_range);
+       return 0;
        return devm_add_action_or_reset(&cxlr_dax->dev, region_extent_unregister,
                                        region_extent);
</pre> 

## Output

<pre>
fan@smc-140338-bm01:~/cxl/linux-dcd$ cxl-tool.py --run --create-topo
Info: back memory/lsa file exist under /tmp/host0 from previous run, delete them Y/N(default Y): 
Starting VM...
QEMU instance is up, access it: ssh root@localhost -p 2024
fan@smc-140338-bm01:~/cxl/linux-dcd$ cxl-tool.py --dcd-test mem0
Load cxl drivers first
ssh root@localhost -p 2024 "modprobe -a cxl_acpi cxl_core cxl_pci cxl_port cxl_mem"

Module                  Size  Used by
dax_pmem               12288  0
device_dax             16384  0
nd_pmem                24576  0
nd_btt                 28672  1 nd_pmem
dax                    57344  3 dax_pmem,device_dax,nd_pmem
cxl_pmem               24576  0
cxl_pmu                28672  0
cxl_mem                12288  0
libnvdimm             217088  4 cxl_pmem,dax_pmem,nd_btt,nd_pmem
cxl_pci                28672  0
cxl_acpi               24576  0
cxl_port               16384  0
cxl_core              368640  7 cxl_pmem,cxl_port,cxl_mem,cxl_pci,cxl_acpi,cxl_pmu
ssh root@localhost -p 2024 "cxl enable-memdev mem0"
cxl memdev: cmd_enable_memdev: enabled 1 mem
cxl region: cmd_create_region: created 1 region
{
  "region":"region0",
  "resource":79725330432,
  "size":2147483648,
  "interleave_ways":1,
  "interleave_granularity":256,
  "decode_state":"commit",
  "mappings":[
    {
      "position":0,
      "memdev":"mem0",
      "decoder":"decoder2.0"
    }
  ]
}
sn=3840
cxl-memdev0
sn=3840
Choose OP: 0: add, 1: release, 2: print extent, 9: exit
Choice: 2
cat /tmp/qmp-show.json|ncat localhost 4445
{"QMP": {"version": {"qemu": {"micro": 90, "minor": 2, "major": 9}, "package": "v6.2.0-28065-g3537a06886-dirty"}, "capabilities": ["oob"]}}
{"return": {}}
{"return": {}}
{"return": {}}
Print accepted extent info:
0: [0x0 - 0x10000000]
In total, 1 extents printed!
Print pending-to-add extent info:
In total, 0 extents printed!
Choose OP: 0: add, 1: release, 2: print extent, 9: exit
Choice: 9
Do you want to continue to create dax device for DC(Y/N):y
daxctl create-device -r region0
[
  {
    "chardev":"dax0.1",
    "size":268435456,
    "target_node":1,
    "align":2097152,
    "mode":"devdax"
  }
]
created 1 device
daxctl list -r region0 -D
[
  {
    "chardev":"dax0.1",
    "size":268435456,
    "target_node":1,
    "align":2097152,
    "mode":"devdax"
  }
]
ssh root@localhost -p 2024 "daxctl reconfigure-device dax0.1 -m system-ram"
[
  {
    "chardev":"dax0.1",
    "size":268435456,
    "target_node":1,
    "align":2097152,
    "mode":"system-ram",
    "online_memblocks":2,
    "total_memblocks":2,
    "movable":true
  }
]
reconfigured 1 device
RANGE                                  SIZE  STATE REMOVABLE   BLOCK
0x0000000000000000-0x000000007fffffff    2G online       yes    0-15
0x0000000100000000-0x000000027fffffff    6G online       yes   32-79
0x0000001290000000-0x000000129fffffff  256M online       yes 594-595

Memory block size:       128M
Total online memory:     8.3G
</pre>



fan
> 
> > 
> > The dmesg shows the really_probe function returns early as resource
> > presents before probe as below,
> > 
> > [ 1745.505068] cxl_core:devm_cxl_add_dax_region:3251: cxl_region region0: region0: register dax_region0
> > [ 1745.506063] cxl_pci:__cxl_pci_mbox_send_cmd:263: cxl_pci 0000:0d:00.0: Sending command: 0x4801
> > [ 1745.506953] cxl_pci:cxl_pci_mbox_wait_for_doorbell:74: cxl_pci 0000:0d:00.0: Doorbell wait took 0ms
> > [ 1745.507911] cxl_core:__cxl_process_extent_list:1802: cxl_pci 0000:0d:00.0: Got extent list 0-0 of 1 generation Num:0
> > [ 1745.508958] cxl_core:__cxl_process_extent_list:1815: cxl_pci 0000:0d:00.0: Processing extent 0/1
> > [ 1745.509843] cxl_core:cxl_validate_extent:975: cxl_pci 0000:0d:00.0: DC extent DPA [range 0x0000000000000000-0x000000000fffffff] (DCR:[range 0x0000000000000000-0x000000007fffffff])(00000000-0000-0000-0000-000000000000)
> > [ 1745.511748] cxl_core:__cxl_dpa_to_region:2869: cxl decoder2.0: dpa:0x0 mapped in region:region0
> > [ 1745.512626] cxl_core:cxl_add_extent:460: cxl decoder2.0: Checking ED ([mem 0x00000000-0x3fffffff flags 0x80000200]) for extent [range 0x0000000000000000-0x000000000fffffff]
> > [ 1745.514143] cxl_core:cxl_add_extent:492: cxl decoder2.0: Add extent [range 0x0000000000000000-0x000000000fffffff] (00000000-0000-0000-0000-000000000000)
> > [ 1745.515485] cxl_core:online_region_extent:176:  extent0.0: region extent HPA [range 0x0000000000000000-0x000000000fffffff]
> > [ 1745.516576] cxl_core:cxlr_notify_extent:285: cxl dax_region0: Trying notify: type 0 HPA [range 0x0000000000000000-0x000000000fffffff]
> > [ 1745.517768] cxl_core:cxl_bus_probe:2087: cxl_region region0: probe: 0
> > [ 1745.524984] cxl dax_region0: Resources present before probing
> > 
> > 
> > btw, I hit the same issue with the previous verson also.
> > 
> > Fan
> 
> [snip]

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD)
  2025-04-13 22:52 [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (19 preceding siblings ...)
  2025-04-14 16:11 ` [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD) Fan Ni
@ 2025-04-14 16:47 ` Jonathan Cameron
  2025-04-15  4:50   ` Dan Williams
  2025-06-03 16:32 ` Fan Ni
  2026-02-02 20:22 ` Gregory Price
  22 siblings, 1 reply; 65+ messages in thread
From: Jonathan Cameron @ 2025-04-14 16:47 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Jiang, Fan Ni, Dan Williams, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-cxl, nvdimm, linux-kernel,
	Li Ming

On Sun, 13 Apr 2025 17:52:08 -0500
Ira Weiny <ira.weiny@intel.com> wrote:

> A git tree of this series can be found here:
> 
> 	https://github.com/weiny2/linux-kernel/tree/dcd-v6-2025-04-13
> 
> This is now based on 6.15-rc2.

Hi Ira,

Firstly thanks for the update and your hard work driving this forwards.

> 
> Due to the stagnation of solid requirements for users of DCD I do not
> plan to rev this work in Q2 of 2025 and possibly beyond.

Hopefully there will be limited need to make changes (it looks pretty 
good to me - we'll run a bunch of tests though which I haven't done
yet).  I do have reason to want this code upstream and it is
now simple enough that I hope it is not controversial. Let's discuss
path forwards on the sync call tomorrow as I'm sure I'm not the only one.

If needed I'm fine picking up the baton to keep this moving forwards
(I'm even more happy to let someone else step up though!)

To me we don't need to answer the question of whether we fully understand
requirements, or whether this support covers them, but rather to ask
if anyone has requirements that are not sensible to satisfy with additional
work building on this?

I'm not aware of any such blocker.  For the things I care about the
path forwards looks fine (particularly tagged capacity and sharing).

> 
> It is anticipated that this will support at least the initial
> implementation of DCD devices, if and when they appear in the ecosystem.
> The patch set should be reviewed with the limited set of functionality in
> mind.  Additional functionality can be added as devices support them.

Personally I think that's a chicken and egg problem but fully understand
the desire to keep things simple in the short term.  Getting initial DCD
support in will help reduce the response (that I frequently hear) of
'the ecosystem isn't ready, let's leave that for a generation'.

> 
> It is strongly encouraged for individuals or companies wishing to bring
> DCD devices to market review this set with the customer use cases they
> have in mind.
> 

Absolutely.  I can't share anything about devices at this time but you
can read whatever you want into my willingness to help get this (and a
bunch of things built on top of it) over the line.

> Remaining work:
> 
> 	1) Allow mapping to specific extents (perhaps based on
> 	   label/tag)
> 	   1a) devise region size reporting based on tags
> 	2) Interleave support

I'd maybe label these as 'additional possible future features'.
Personally I'm doubtful that hardware interleave of DCD is a short
term feature and it definitely doesn't have to be there for this to be useful.

Tags will matter but that is a 'next step' that this series does
not seem to hinder.

> 
> Possible additional work depending on requirements:
> 
> 	1) Accept a new extent which extends (but overlaps) already
> 	   accepted extent(s)
> 	2) Rework DAX device interfaces, memfd has been explored a bit
> 	3) Support more than 1 DC partition
> 
> [1] https://github.com/weiny2/ndctl/tree/dcd-region3-2025-04-13

Thanks,

Jonathan

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD)
  2025-04-14 16:47 ` Jonathan Cameron
@ 2025-04-15  4:50   ` Dan Williams
  2025-04-15 10:03     ` Jonathan Cameron
  0 siblings, 1 reply; 65+ messages in thread
From: Dan Williams @ 2025-04-15  4:50 UTC (permalink / raw)
  To: Jonathan Cameron, Ira Weiny
  Cc: Dave Jiang, Fan Ni, Dan Williams, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-cxl, nvdimm, linux-kernel,
	Li Ming

Jonathan Cameron wrote:
[..]
> To me we don't need to answer the question of whether we fully understand
> requirements, or whether this support covers them, but rather to ask
> if anyone has requirements that are not sensible to satisfy with additional
> work building on this?

Wearing only my upstream kernel development hat, the question for
merging is "what is the end user visible impact of merging this?". As
long as DCD remains in proof-of-concept mode then leave the code out of
tree until it is ready to graduate past that point.

Same held for HDM-D support which was an out-of-tree POC until
Alejandro arrived with the SFC consumer.

DCD is joined by HDM-DB (awaiting an endpoint) and CXL Error Isolation
(awaiting a production consumer) as solutions that have time to validate
that the ecosystem is indeed graduating to consume them. There was no
"chicken-egg" paradox for the ecosystem to deliver base
static-memory-expander CXL support.

The ongoing failure to get productive engagement on just how ruthlessly
simple the implementation could be and still meet planned usages
continues to give the impression that Linux is way out in front of
hardware here. Uncomfortably so.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD)
  2025-04-15  4:50   ` Dan Williams
@ 2025-04-15 10:03     ` Jonathan Cameron
  2025-04-15 17:45       ` Dan Williams
  0 siblings, 1 reply; 65+ messages in thread
From: Jonathan Cameron @ 2025-04-15 10:03 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ira Weiny, Dave Jiang, Fan Ni, Davidlohr Bueso, Alison Schofield,
	Vishal Verma, linux-cxl, nvdimm, linux-kernel, Li Ming

On Mon, 14 Apr 2025 21:50:31 -0700
Dan Williams <dan.j.williams@intel.com> wrote:

> Jonathan Cameron wrote:
> [..]
> > To me we don't need to answer the question of whether we fully understand
> > requirements, or whether this support covers them, but rather to ask
> > if anyone has requirements that are not sensible to satisfy with additional
> > work building on this?  
> 
> Wearing only my upstream kernel development hat, the question for
> merging is "what is the end user visible impact of merging this?". As
> long as DCD remains in proof-of-concept mode then leave the code out of
> tree until it is ready to graduate past that point.

Hi Dan,

Seems like we'll have to disagree on this. The only thing I can
therefore do is help to keep this patch set in a 'ready to go' state.

I would ask that people review it with that in mind so that we can
merge it the day someone is willing to announce a product which
is a lot more about marketing decisions than anything technical.
Note that will be far too late for distro cycles so distro folk
may have to pick up the fork (which they will hate).

Hopefully that 'fork' will provide a base on which we can build
the next set of key features. 

> 
> Same held for HDM-D support which was an out-of-tree POC until
> Alejandro arrived with the SFC consumer.

Obviously I can't comment on status of that hardware!

> 
> DCD is joined by HDM-DB (awaiting an endpoint) and CXL Error Isolation
> (awaiting a production consumer) as solutions that have time to validate
> that the ecosystem is indeed graduating to consume them. 

Those I'm fine with waiting on, though obviously others may not be!

> There was no
> "chicken-egg" paradox for the ecosystem to deliver base
> static-memory-expander CXL support.

That is (at least partly) because the ecosystem for those was initially BIOS
only. That's not true for DCD. So people built devices on basis they didn't
need any kernel support.  Lots of disadvantages to that but it's what happened.
As a side note, I'd much rather that path had never been there as it is
continuing to make a mess for Gregory and others.

> 
> The ongoing failure to get productive engagement on just how ruthlessly
> simple the implementation could be and still meet planned usages
> continues to give the impression that Linux is way out in front of
> hardware here. Uncomfortably so.

I'll keep pushing for others to engage with this. I also have on my
list writing a document on the future of DCD and proposing at least one
way to add all features on that roadmap. A major intent of that being
to show that there is no blocker to what we have here. I.e. we can
extend it in a logical fashion to exactly what is needed.

Reality is I cannot say anything about unannounced products. Whilst some
companies will talk about stuff well ahead of hardware being ready for
customers we do not do that (normally we announce long after customers
have it.) Hence it seems I have no way to get this upstream other than hope
someone else has a more flexible policy.

Jonathan

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD)
  2025-04-15 10:03     ` Jonathan Cameron
@ 2025-04-15 17:45       ` Dan Williams
  0 siblings, 0 replies; 65+ messages in thread
From: Dan Williams @ 2025-04-15 17:45 UTC (permalink / raw)
  To: Jonathan Cameron, Dan Williams
  Cc: Ira Weiny, Dave Jiang, Fan Ni, Davidlohr Bueso, Alison Schofield,
	Vishal Verma, linux-cxl, nvdimm, linux-kernel, Li Ming

Jonathan Cameron wrote:
> On Mon, 14 Apr 2025 21:50:31 -0700
> Dan Williams <dan.j.williams@intel.com> wrote:
> 
> > Jonathan Cameron wrote:
> > [..]
> > > To me we don't need to answer the question of whether we fully understand
> > > requirements, or whether this support covers them, but rather to ask
> > > if anyone has requirements that are not sensible to satisfy with additional
> > > work building on this?  
> > 
> > Wearing only my upstream kernel development hat, the question for
> > merging is "what is the end user visible impact of merging this?". As
> > long as DCD remains in proof-of-concept mode then leave the code out of
> > tree until it is ready to graduate past that point.
> 
> Hi Dan,
> 
> Seems like we'll have to disagree on this. The only thing I can
> therefore do is help to keep this patch set in a 'ready to go' state.
> 
> I would ask that people review it with that in mind so that we can
> merge it the day someone is willing to announce a product which
> is a lot more about marketing decisions than anything technical.
> Note that will be far too late for distro cycles so distro folk
> may have to pick up the fork (which they will hate).

This is overstated. Distros say "no" to supporting even *shipping*
hardware when there is insufficient customer pull through.  If none of
the distros' customers can get their hands on DCD hardware that
contraindicates merge and distro intercept decisions.

> Hopefully that 'fork' will provide a base on which we can build
> the next set of key features. 

They are only key features when the adoption approaches inevitability.
The LSF/MM discussions around the ongoing challenges of managing
disparate performance memory pools still has me uneasy about whether
Linux yet has the right ABI in hand for dedicated-memory.

What folks seems to want is an anon-only memory provider that does not
ever leak into kernel allocations, and optionally a filesystem
abstraction to provide file backed allocation of dedicate memory. What
they do not want is to teach their applications anything beyond
"malloc()" for anon.

[..]
> That is (at least partly) because the ecosystem for those was initially BIOS
> only. That's not true for DCD. So people built devices on basis they didn't
> need any kernel support.  Lots of disadvantages to that but it's what happened.
> As a side note, I'd much rather that path had never been there as it is
> continuing to make a mess for Gregory and others.

The mess is driven by insufficient communication between platform
firmware implementations and Linux expectations. That is a tractable
problem.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD)
  2025-04-13 22:52 [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (20 preceding siblings ...)
  2025-04-14 16:47 ` Jonathan Cameron
@ 2025-06-03 16:32 ` Fan Ni
  2025-06-09 17:09   ` Fan Ni
  2026-02-02 20:22 ` Gregory Price
  22 siblings, 1 reply; 65+ messages in thread
From: Fan Ni @ 2025-06-03 16:32 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Jiang, Jonathan Cameron, Dan Williams, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-cxl, nvdimm, linux-kernel,
	Li Ming

On Sun, Apr 13, 2025 at 05:52:08PM -0500, Ira Weiny wrote:
> A git tree of this series can be found here:
> 
> 	https://github.com/weiny2/linux-kernel/tree/dcd-v6-2025-04-13
> 
> This is now based on 6.15-rc2.
> 
> Due to the stagnation of solid requirements for users of DCD I do not
> plan to rev this work in Q2 of 2025 and possibly beyond.
> 
> It is anticipated that this will support at least the initial
> implementation of DCD devices, if and when they appear in the ecosystem.
> The patch set should be reviewed with the limited set of functionality in
> mind.  Additional functionality can be added as devices support them.
> 
> It is strongly encouraged for individuals or companies wishing to bring
> DCD devices to market review this set with the customer use cases they
> have in mind.

Hi,
I have a general question about DCD.

How will the start dpa of the first region be set before any extent is
offer to the hosts?

In this series, no dpa gap (skip) is allowed between static capacity and
dynamic capacity. That seems to imply some component that knows the layout
of the host memory will need to set the start dpa of the first dc region?
The firmware?

Also, if a DC extent is shared among multiple hosts each of which has
different memory configuration, how the dcd device provides the extents
to each host to make sure there is no dpa gap between static and dynamic
capacity range on all the hosts?
It seems the start dpa of dcd needs to be different for each host. No sure how
to achieve that.

Fan

> 
> Series info
> ===========
> 
> This series has 2 parts:
> 
> Patch 1-17: Core DCD support
> Patch 18-19: cxl_test support
> 
> Background
> ==========
> 
> A Dynamic Capacity Device (DCD) (CXL 3.1 sec 9.13.3) is a CXL memory
> device that allows memory capacity within a region to change
> dynamically without the need for resetting the device, reconfiguring
> HDM decoders, or reconfiguring software DAX regions.
> 
> One of the biggest anticipated use cases for Dynamic Capacity is to
> allow hosts to dynamically add or remove memory from a host within a
> data center without physically changing the per-host attached memory nor
> rebooting the host.
> 
> The general flow for the addition or removal of memory is to have an
> orchestrator coordinate the use of the memory.  Generally there are 5
> actors in such a system, the Orchestrator, Fabric Manager, the Logical
> device, the Host Kernel, and a Host User.
> 
> An example work flow is shown below.
> 
> Orchestrator      FM         Device       Host Kernel    Host User
> 
>     |             |           |            |               |
>     |-------------- Create region ------------------------>|
>     |             |           |            |               |
>     |             |           |            |<-- Create ----|
>     |             |           |            |    Region     |
>     |             |           |            |(dynamic_ram_a)|
>     |<------------- Signal done ---------------------------|
>     |             |           |            |               |
>     |-- Add ----->|-- Add --->|--- Add --->|               |
>     |  Capacity   |  Extent   |   Extent   |               |
>     |             |           |            |               |
>     |             |<- Accept -|<- Accept  -|               |
>     |             |   Extent  |   Extent   |               |
>     |             |           |            |<- Create ---->|
>     |             |           |            |   DAX dev     |-- Use memory
>     |             |           |            |               |   |
>     |             |           |            |               |   |
>     |             |           |            |<- Release ----| <-+
>     |             |           |            |   DAX dev     |
>     |             |           |            |               |
>     |<------------- Signal done ---------------------------|
>     |             |           |            |               |
>     |-- Remove -->|- Release->|- Release ->|               |
>     |  Capacity   |  Extent   |   Extent   |               |
>     |             |           |            |               |
>     |             |<- Release-|<- Release -|               |
>     |             |   Extent  |   Extent   |               |
>     |             |           |            |               |
>     |-- Add ----->|-- Add --->|--- Add --->|               |
>     |  Capacity   |  Extent   |   Extent   |               |
>     |             |           |            |               |
>     |             |<- Accept -|<- Accept  -|               |
>     |             |   Extent  |   Extent   |               |
>     |             |           |            |<- Create -----|
>     |             |           |            |   DAX dev     |-- Use memory
>     |             |           |            |               |   |
>     |             |           |            |<- Release ----| <-+
>     |             |           |            |   DAX dev     |
>     |<------------- Signal done ---------------------------|
>     |             |           |            |               |
>     |-- Remove -->|- Release->|- Release ->|               |
>     |  Capacity   |  Extent   |   Extent   |               |
>     |             |           |            |               |
>     |             |<- Release-|<- Release -|               |
>     |             |   Extent  |   Extent   |               |
>     |             |           |            |               |
>     |-- Add ----->|-- Add --->|--- Add --->|               |
>     |  Capacity   |  Extent   |   Extent   |               |
>     |             |           |            |<- Create -----|
>     |             |           |            |   DAX dev     |-- Use memory
>     |             |           |            |               |   |
>     |-- Remove -->|- Release->|- Release ->|               |   |
>     |  Capacity   |  Extent   |   Extent   |               |   |
>     |             |           |            |               |   |
>     |             |           |     (Release Ignored)      |   |
>     |             |           |            |               |   |
>     |             |           |            |<- Release ----| <-+
>     |             |           |            |   DAX dev     |
>     |<------------- Signal done ---------------------------|
>     |             |           |            |               |
>     |             |- Release->|- Release ->|               |
>     |             |  Extent   |   Extent   |               |
>     |             |           |            |               |
>     |             |<- Release-|<- Release -|               |
>     |             |   Extent  |   Extent   |               |
>     |             |           |            |<- Destroy ----|
>     |             |           |            |   Region      |
>     |             |           |            |               |
> 
> Implementation
> ==============
> 
> This series requires the creation of regions and DAX devices to be
> closely synchronized with the Orchestrator and Fabric Manager.  The host
> kernel will reject extents if a region is not yet created.  It also
> ignores extent release if memory is in use (DAX device created).  These
> synchronizations are not anticipated to be an issue with real
> applications.
> 
> Only a single dynamic ram partition is supported (dynamic_ram_a).  The
> requirements, use cases, and existence of actual hardware devices to
> support more than one DC partition is unknown at this time.  So a less
> complex implementation was chosen.
> 
> In order to allow for capacity to be added and removed a new concept of
> a sparse DAX region is introduced.  A sparse DAX region may have 0 or
> more bytes of available space.  The total space depends on the number
> and size of the extents which have been added.
> 
> It is anticipated that users of the memory will carefully coordinate the
> surfacing of capacity with the creation of DAX devices which use that
> capacity.  Therefore, the allocation of the memory to DAX devices does
> not allow for specific associations between DAX device and extent.  This
> keeps allocations of DAX devices similar to existing DAX region
> behavior.
> 
> To keep the DAX memory allocation aligned with the existing DAX devices
> which do not have tags, extents are not allowed to have tags in this
> implementation.  Future support for tags can be added when real use
> cases surface.
> 
> Great care was taken to keep the extent tracking simple.  Some xarray's
> needed to be added but extra software objects are kept to a minimum.
> 
> Region extents are tracked as sub-devices of the DAX region.  This
> ensures that region destruction cleans up all extent allocations
> properly.
> 
> The major functionality of this series includes:
> 
> - Getting the dynamic capacity (DC) configuration information from cxl
>   devices
> 
> - Configuring a DC partition found in hardware.
> 
> - Enhancing the CXL and DAX regions for dynamic capacity support
> 	a. Maintain a logical separation between hardware extents and
> 	   software managed extents.  This provides an abstraction
> 	   between the layers and should allow for interleaving in the
> 	   future
> 
> - Get existing hardware extent lists for endpoint decoders upon region
>   creation.
> 
> - Respond to DC capacity events and adjust available region memory.
>         a. Add capacity Events
> 	b. Release capacity events
> 
> - Host response for add capacity
> 	a. do not accept the extent if:
> 		If the region does not exist
> 		or an error occurs realizing the extent
> 	b. If the region does exist
> 		realize a DAX region extent with 1:1 mapping (no
> 		interleave yet)
> 	c. Support the event more bit by processing a list of extents
> 	   marked with the more bit together before setting up a
> 	   response.
> 
> - Host response for remove capacity
> 	a. If no DAX device references the extent; release the extent
> 	b. If a reference does exist, ignore the request.
> 	   (Require FM to issue release again.)
> 	c. Release extents flagged with the 'more' bit individually as
> 	   the specification allows for the asynchronous release of
> 	   memory and the implementation is simplified by doing so.
> 
> - Modify DAX device creation/resize to account for extents within a
>   sparse DAX region
> 
> - Trace Dynamic Capacity events for debugging
> 
> - Add cxl-test infrastructure to allow for faster unit testing
>   (See new ndctl branch for cxl-dcd.sh test[1])
> 
> - Only support 0 value extent tags
> 
> Fan Ni's upstream of Qemu DCD was used for testing.
> 
> Remaining work:
> 
> 	1) Allow mapping to specific extents (perhaps based on
> 	   label/tag)
> 	   1a) devise region size reporting based on tags
> 	2) Interleave support
> 
> Possible additional work depending on requirements:
> 
> 	1) Accept a new extent which extends (but overlaps) already
> 	   accepted extent(s)
> 	2) Rework DAX device interfaces, memfd has been explored a bit
> 	3) Support more than 1 DC partition
> 
> [1] https://github.com/weiny2/ndctl/tree/dcd-region3-2025-04-13
> 
> ---
> Changes in v9:
> - djbw: pare down support to only a single DC parition
> - djbw: adjust to the new core partition processing which aligns with
>   new type2 work.
> - iweiny: address smaller comments from v8
> - iweiny: rebase off of 6.15-rc1
> - Link to v8: https://patch.msgid.link/20241210-dcd-type2-upstream-v8-0-812852504400@intel.com
> 
> ---
> Ira Weiny (19):
>       cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)
>       cxl/mem: Read dynamic capacity configuration from the device
>       cxl/cdat: Gather DSMAS data for DCD partitions
>       cxl/core: Enforce partition order/simplify partition calls
>       cxl/mem: Expose dynamic ram A partition in sysfs
>       cxl/port: Add 'dynamic_ram_a' to endpoint decoder mode
>       cxl/region: Add sparse DAX region support
>       cxl/events: Split event msgnum configuration from irq setup
>       cxl/pci: Factor out interrupt policy check
>       cxl/mem: Configure dynamic capacity interrupts
>       cxl/core: Return endpoint decoder information from region search
>       cxl/extent: Process dynamic partition events and realize region extents
>       cxl/region/extent: Expose region extent information in sysfs
>       dax/bus: Factor out dev dax resize logic
>       dax/region: Create resources on sparse DAX regions
>       cxl/region: Read existing extents on region creation
>       cxl/mem: Trace Dynamic capacity Event Record
>       tools/testing/cxl: Make event logs dynamic
>       tools/testing/cxl: Add DC Regions to mock mem data
> 
>  Documentation/ABI/testing/sysfs-bus-cxl |  100 ++-
>  drivers/cxl/core/Makefile               |    2 +-
>  drivers/cxl/core/cdat.c                 |   11 +
>  drivers/cxl/core/core.h                 |   33 +-
>  drivers/cxl/core/extent.c               |  495 +++++++++++++++
>  drivers/cxl/core/hdm.c                  |   13 +-
>  drivers/cxl/core/mbox.c                 |  632 ++++++++++++++++++-
>  drivers/cxl/core/memdev.c               |   87 ++-
>  drivers/cxl/core/port.c                 |    5 +
>  drivers/cxl/core/region.c               |   76 ++-
>  drivers/cxl/core/trace.h                |   65 ++
>  drivers/cxl/cxl.h                       |   61 +-
>  drivers/cxl/cxlmem.h                    |  134 +++-
>  drivers/cxl/mem.c                       |    2 +-
>  drivers/cxl/pci.c                       |  115 +++-
>  drivers/dax/bus.c                       |  356 +++++++++--
>  drivers/dax/bus.h                       |    4 +-
>  drivers/dax/cxl.c                       |   71 ++-
>  drivers/dax/dax-private.h               |   40 ++
>  drivers/dax/hmem/hmem.c                 |    2 +-
>  drivers/dax/pmem.c                      |    2 +-
>  include/cxl/event.h                     |   31 +
>  include/linux/ioport.h                  |    3 +
>  tools/testing/cxl/Kbuild                |    3 +-
>  tools/testing/cxl/test/mem.c            | 1021 +++++++++++++++++++++++++++----
>  25 files changed, 3102 insertions(+), 262 deletions(-)
> ---
> base-commit: 8ffd015db85fea3e15a77027fda6c02ced4d2444
> change-id: 20230604-dcd-type2-upstream-0cd15f6216fd
> 
> Best regards,
> -- 
> Ira Weiny <ira.weiny@intel.com>
> 

-- 
Fan Ni

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD)
  2025-06-03 16:32 ` Fan Ni
@ 2025-06-09 17:09   ` Fan Ni
  0 siblings, 0 replies; 65+ messages in thread
From: Fan Ni @ 2025-06-09 17:09 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Jiang, Jonathan Cameron, Dan Williams, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-cxl, nvdimm, linux-kernel,
	Li Ming

On Tue, Jun 03, 2025 at 09:32:18AM -0700, Fan Ni wrote:
> On Sun, Apr 13, 2025 at 05:52:08PM -0500, Ira Weiny wrote:
> > A git tree of this series can be found here:
> > 
> > 	https://github.com/weiny2/linux-kernel/tree/dcd-v6-2025-04-13
> > 
> > This is now based on 6.15-rc2.
> > 
> > Due to the stagnation of solid requirements for users of DCD I do not
> > plan to rev this work in Q2 of 2025 and possibly beyond.
> > 
> > It is anticipated that this will support at least the initial
> > implementation of DCD devices, if and when they appear in the ecosystem.
> > The patch set should be reviewed with the limited set of functionality in
> > mind.  Additional functionality can be added as devices support them.
> > 
> > It is strongly encouraged for individuals or companies wishing to bring
> > DCD devices to market review this set with the customer use cases they
> > have in mind.
> 
> Hi,
> I have a general question about DCD.
> 
> How will the start dpa of the first region be set before any extent is
> offer to the hosts?
> 
> In this series, no dpa gap (skip) is allowed between static capacity and
> dynamic capacity. That seems to imply some component that knows the layout
> of the host memory will need to set the start dpa of the first dc region?
> The firmware?
> 
> Also, if a DC extent is shared among multiple hosts each of which has
> different memory configuration, how the dcd device provides the extents
> to each host to make sure there is no dpa gap between static and dynamic
> capacity range on all the hosts?
> It seems the start dpa of dcd needs to be different for each host. No sure how
> to achieve that.
> 
> Fan

Ignore the above message, the question does not make sense.

Fan
> 
> > 
> > Series info
> > ===========
> > 
> > This series has 2 parts:
> > 
> > Patch 1-17: Core DCD support
> > Patch 18-19: cxl_test support
> > 
> > Background
> > ==========
> > 
> > A Dynamic Capacity Device (DCD) (CXL 3.1 sec 9.13.3) is a CXL memory
> > device that allows memory capacity within a region to change
> > dynamically without the need for resetting the device, reconfiguring
> > HDM decoders, or reconfiguring software DAX regions.
> > 
> > One of the biggest anticipated use cases for Dynamic Capacity is to
> > allow hosts to dynamically add or remove memory from a host within a
> > data center without physically changing the per-host attached memory nor
> > rebooting the host.
> > 
> > The general flow for the addition or removal of memory is to have an
> > orchestrator coordinate the use of the memory.  Generally there are 5
> > actors in such a system, the Orchestrator, Fabric Manager, the Logical
> > device, the Host Kernel, and a Host User.
> > 
> > An example work flow is shown below.
> > 
> > Orchestrator      FM         Device       Host Kernel    Host User
> > 
> >     |             |           |            |               |
> >     |-------------- Create region ------------------------>|
> >     |             |           |            |               |
> >     |             |           |            |<-- Create ----|
> >     |             |           |            |    Region     |
> >     |             |           |            |(dynamic_ram_a)|
> >     |<------------- Signal done ---------------------------|
> >     |             |           |            |               |
> >     |-- Add ----->|-- Add --->|--- Add --->|               |
> >     |  Capacity   |  Extent   |   Extent   |               |
> >     |             |           |            |               |
> >     |             |<- Accept -|<- Accept  -|               |
> >     |             |   Extent  |   Extent   |               |
> >     |             |           |            |<- Create ---->|
> >     |             |           |            |   DAX dev     |-- Use memory
> >     |             |           |            |               |   |
> >     |             |           |            |               |   |
> >     |             |           |            |<- Release ----| <-+
> >     |             |           |            |   DAX dev     |
> >     |             |           |            |               |
> >     |<------------- Signal done ---------------------------|
> >     |             |           |            |               |
> >     |-- Remove -->|- Release->|- Release ->|               |
> >     |  Capacity   |  Extent   |   Extent   |               |
> >     |             |           |            |               |
> >     |             |<- Release-|<- Release -|               |
> >     |             |   Extent  |   Extent   |               |
> >     |             |           |            |               |
> >     |-- Add ----->|-- Add --->|--- Add --->|               |
> >     |  Capacity   |  Extent   |   Extent   |               |
> >     |             |           |            |               |
> >     |             |<- Accept -|<- Accept  -|               |
> >     |             |   Extent  |   Extent   |               |
> >     |             |           |            |<- Create -----|
> >     |             |           |            |   DAX dev     |-- Use memory
> >     |             |           |            |               |   |
> >     |             |           |            |<- Release ----| <-+
> >     |             |           |            |   DAX dev     |
> >     |<------------- Signal done ---------------------------|
> >     |             |           |            |               |
> >     |-- Remove -->|- Release->|- Release ->|               |
> >     |  Capacity   |  Extent   |   Extent   |               |
> >     |             |           |            |               |
> >     |             |<- Release-|<- Release -|               |
> >     |             |   Extent  |   Extent   |               |
> >     |             |           |            |               |
> >     |-- Add ----->|-- Add --->|--- Add --->|               |
> >     |  Capacity   |  Extent   |   Extent   |               |
> >     |             |           |            |<- Create -----|
> >     |             |           |            |   DAX dev     |-- Use memory
> >     |             |           |            |               |   |
> >     |-- Remove -->|- Release->|- Release ->|               |   |
> >     |  Capacity   |  Extent   |   Extent   |               |   |
> >     |             |           |            |               |   |
> >     |             |           |     (Release Ignored)      |   |
> >     |             |           |            |               |   |
> >     |             |           |            |<- Release ----| <-+
> >     |             |           |            |   DAX dev     |
> >     |<------------- Signal done ---------------------------|
> >     |             |           |            |               |
> >     |             |- Release->|- Release ->|               |
> >     |             |  Extent   |   Extent   |               |
> >     |             |           |            |               |
> >     |             |<- Release-|<- Release -|               |
> >     |             |   Extent  |   Extent   |               |
> >     |             |           |            |<- Destroy ----|
> >     |             |           |            |   Region      |
> >     |             |           |            |               |
> > 
> > Implementation
> > ==============
> > 
> > This series requires the creation of regions and DAX devices to be
> > closely synchronized with the Orchestrator and Fabric Manager.  The host
> > kernel will reject extents if a region is not yet created.  It also
> > ignores extent release if memory is in use (DAX device created).  These
> > synchronizations are not anticipated to be an issue with real
> > applications.
> > 
> > Only a single dynamic ram partition is supported (dynamic_ram_a).  The
> > requirements, use cases, and existence of actual hardware devices to
> > support more than one DC partition is unknown at this time.  So a less
> > complex implementation was chosen.
> > 
> > In order to allow for capacity to be added and removed a new concept of
> > a sparse DAX region is introduced.  A sparse DAX region may have 0 or
> > more bytes of available space.  The total space depends on the number
> > and size of the extents which have been added.
> > 
> > It is anticipated that users of the memory will carefully coordinate the
> > surfacing of capacity with the creation of DAX devices which use that
> > capacity.  Therefore, the allocation of the memory to DAX devices does
> > not allow for specific associations between DAX device and extent.  This
> > keeps allocations of DAX devices similar to existing DAX region
> > behavior.
> > 
> > To keep the DAX memory allocation aligned with the existing DAX devices
> > which do not have tags, extents are not allowed to have tags in this
> > implementation.  Future support for tags can be added when real use
> > cases surface.
> > 
> > Great care was taken to keep the extent tracking simple.  Some xarray's
> > needed to be added but extra software objects are kept to a minimum.
> > 
> > Region extents are tracked as sub-devices of the DAX region.  This
> > ensures that region destruction cleans up all extent allocations
> > properly.
> > 
> > The major functionality of this series includes:
> > 
> > - Getting the dynamic capacity (DC) configuration information from cxl
> >   devices
> > 
> > - Configuring a DC partition found in hardware.
> > 
> > - Enhancing the CXL and DAX regions for dynamic capacity support
> > 	a. Maintain a logical separation between hardware extents and
> > 	   software managed extents.  This provides an abstraction
> > 	   between the layers and should allow for interleaving in the
> > 	   future
> > 
> > - Get existing hardware extent lists for endpoint decoders upon region
> >   creation.
> > 
> > - Respond to DC capacity events and adjust available region memory.
> >         a. Add capacity Events
> > 	b. Release capacity events
> > 
> > - Host response for add capacity
> > 	a. do not accept the extent if:
> > 		If the region does not exist
> > 		or an error occurs realizing the extent
> > 	b. If the region does exist
> > 		realize a DAX region extent with 1:1 mapping (no
> > 		interleave yet)
> > 	c. Support the event more bit by processing a list of extents
> > 	   marked with the more bit together before setting up a
> > 	   response.
> > 
> > - Host response for remove capacity
> > 	a. If no DAX device references the extent; release the extent
> > 	b. If a reference does exist, ignore the request.
> > 	   (Require FM to issue release again.)
> > 	c. Release extents flagged with the 'more' bit individually as
> > 	   the specification allows for the asynchronous release of
> > 	   memory and the implementation is simplified by doing so.
> > 
> > - Modify DAX device creation/resize to account for extents within a
> >   sparse DAX region
> > 
> > - Trace Dynamic Capacity events for debugging
> > 
> > - Add cxl-test infrastructure to allow for faster unit testing
> >   (See new ndctl branch for cxl-dcd.sh test[1])
> > 
> > - Only support 0 value extent tags
> > 
> > Fan Ni's upstream of Qemu DCD was used for testing.
> > 
> > Remaining work:
> > 
> > 	1) Allow mapping to specific extents (perhaps based on
> > 	   label/tag)
> > 	   1a) devise region size reporting based on tags
> > 	2) Interleave support
> > 
> > Possible additional work depending on requirements:
> > 
> > 	1) Accept a new extent which extends (but overlaps) already
> > 	   accepted extent(s)
> > 	2) Rework DAX device interfaces, memfd has been explored a bit
> > 	3) Support more than 1 DC partition
> > 
> > [1] https://github.com/weiny2/ndctl/tree/dcd-region3-2025-04-13
> > 
> > ---
> > Changes in v9:
> > - djbw: pare down support to only a single DC parition
> > - djbw: adjust to the new core partition processing which aligns with
> >   new type2 work.
> > - iweiny: address smaller comments from v8
> > - iweiny: rebase off of 6.15-rc1
> > - Link to v8: https://patch.msgid.link/20241210-dcd-type2-upstream-v8-0-812852504400@intel.com
> > 
> > ---
> > Ira Weiny (19):
> >       cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)
> >       cxl/mem: Read dynamic capacity configuration from the device
> >       cxl/cdat: Gather DSMAS data for DCD partitions
> >       cxl/core: Enforce partition order/simplify partition calls
> >       cxl/mem: Expose dynamic ram A partition in sysfs
> >       cxl/port: Add 'dynamic_ram_a' to endpoint decoder mode
> >       cxl/region: Add sparse DAX region support
> >       cxl/events: Split event msgnum configuration from irq setup
> >       cxl/pci: Factor out interrupt policy check
> >       cxl/mem: Configure dynamic capacity interrupts
> >       cxl/core: Return endpoint decoder information from region search
> >       cxl/extent: Process dynamic partition events and realize region extents
> >       cxl/region/extent: Expose region extent information in sysfs
> >       dax/bus: Factor out dev dax resize logic
> >       dax/region: Create resources on sparse DAX regions
> >       cxl/region: Read existing extents on region creation
> >       cxl/mem: Trace Dynamic capacity Event Record
> >       tools/testing/cxl: Make event logs dynamic
> >       tools/testing/cxl: Add DC Regions to mock mem data
> > 
> >  Documentation/ABI/testing/sysfs-bus-cxl |  100 ++-
> >  drivers/cxl/core/Makefile               |    2 +-
> >  drivers/cxl/core/cdat.c                 |   11 +
> >  drivers/cxl/core/core.h                 |   33 +-
> >  drivers/cxl/core/extent.c               |  495 +++++++++++++++
> >  drivers/cxl/core/hdm.c                  |   13 +-
> >  drivers/cxl/core/mbox.c                 |  632 ++++++++++++++++++-
> >  drivers/cxl/core/memdev.c               |   87 ++-
> >  drivers/cxl/core/port.c                 |    5 +
> >  drivers/cxl/core/region.c               |   76 ++-
> >  drivers/cxl/core/trace.h                |   65 ++
> >  drivers/cxl/cxl.h                       |   61 +-
> >  drivers/cxl/cxlmem.h                    |  134 +++-
> >  drivers/cxl/mem.c                       |    2 +-
> >  drivers/cxl/pci.c                       |  115 +++-
> >  drivers/dax/bus.c                       |  356 +++++++++--
> >  drivers/dax/bus.h                       |    4 +-
> >  drivers/dax/cxl.c                       |   71 ++-
> >  drivers/dax/dax-private.h               |   40 ++
> >  drivers/dax/hmem/hmem.c                 |    2 +-
> >  drivers/dax/pmem.c                      |    2 +-
> >  include/cxl/event.h                     |   31 +
> >  include/linux/ioport.h                  |    3 +
> >  tools/testing/cxl/Kbuild                |    3 +-
> >  tools/testing/cxl/test/mem.c            | 1021 +++++++++++++++++++++++++++----
> >  25 files changed, 3102 insertions(+), 262 deletions(-)
> > ---
> > base-commit: 8ffd015db85fea3e15a77027fda6c02ced4d2444
> > change-id: 20230604-dcd-type2-upstream-0cd15f6216fd
> > 
> > Best regards,
> > -- 
> > Ira Weiny <ira.weiny@intel.com>
> > 
> 
> -- 
> Fan Ni

-- 
Fan Ni

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD)
  2025-04-13 22:52 [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (21 preceding siblings ...)
  2025-06-03 16:32 ` Fan Ni
@ 2026-02-02 20:22 ` Gregory Price
  2026-02-03 22:04   ` Ira Weiny
  22 siblings, 1 reply; 65+ messages in thread
From: Gregory Price @ 2026-02-02 20:22 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Jiang, Fan Ni, Jonathan Cameron, Dan Williams,
	Davidlohr Bueso, Alison Schofield, Vishal Verma, linux-cxl,
	nvdimm, linux-kernel, Li Ming

On Sun, Apr 13, 2025 at 05:52:08PM -0500, Ira Weiny wrote:
> A git tree of this series can be found here:
> 
> 	https://github.com/weiny2/linux-kernel/tree/dcd-v6-2025-04-13
> 
> This is now based on 6.15-rc2.
> 

Extreme necro-bump for this set, but i wonder what folks opinion is on
DCD support if we expose a new region control pattern ala:

https://lore.kernel.org/linux-cxl/20260129210442.3951412-1-gourry@gourry.net/

The major difference would be elimination of sparse-DAX, which i know
has been a concern, in favor of a per-region-driver policy on how to
manage hot-add/remove events.

Things I've discussed with folks in different private contexts

sysram usecase:
----
  echo regionN > decoder0.0/create_dc_region
  /* configure decoders */
  echo regionN > cxl/drivers/sysram/bind

tagged extents arrive and leave as a group, no sparseness
    extents cannot share a tag unless they arrive together
    e.g. set(A) & set(B) must have different tags
    add and expose daxN.M/uuid as the tag for collective management

Can decide whether linux wants to support untagged extents
    cxl_sysram could choose to track and hotplug untagged extents
    directly without going through DAX. Partial release would be
    possible on a per-extent granularity in this case.
----


virtio usecase:  (making some stuff up here)
----
  echo regionN > decoder0.0/create_dc_region
  /* configure decoders */
  echo regionN > cxl/drivers/virtio/bind

tags are required and may imply specific VM routing
    may or may not use DAX under the hood

extents may be tracked individually and add/removed individually
    if using DAX, this implies 1 device per extent.
    This probably requires a minimum extent size to be reasonable.

Does not expose the memory as SysRAM, instead builds new interface
    to handle memory management message routing to/from the VMM
    (N_MEMORY_PRIVATE?)
----


devdax usecase (FAMFS?)
---- 
  echo regionN > decoder0.0/create_dc_region
  /* configure decoders */
  echo regionN > cxl/drivers/devdax/bind

All sets of extents appear as new DAX devices
Tags are exposed via daxN.M/uuid
Tags are required
   otherwise you can't make sense of what that devdax represents
---

Begs the question:
   Do we require tags as a baseline feature for all modes?
   No tag - no service.
   Heavily implied:  Tags are globally unique (uuid)

But I think this resolves a lot of the disparate disagreements on "what
to do with tags" and how to manage sparseness - just split the policy
into each individual use-case's respective driver.

If a sufficiently unique use-case comes along that doesn't fit the
existing categories - a new region-driver may be warranted.

~Gregory

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD)
  2026-02-02 20:22 ` Gregory Price
@ 2026-02-03 22:04   ` Ira Weiny
  2026-02-04 15:12     ` Gregory Price
  0 siblings, 1 reply; 65+ messages in thread
From: Ira Weiny @ 2026-02-03 22:04 UTC (permalink / raw)
  To: Gregory Price, Ira Weiny
  Cc: Dave Jiang, Fan Ni, Jonathan Cameron, Dan Williams,
	Davidlohr Bueso, Alison Schofield, Vishal Verma, linux-cxl,
	nvdimm, linux-kernel, Li Ming

Gregory Price wrote:
> On Sun, Apr 13, 2025 at 05:52:08PM -0500, Ira Weiny wrote:
> > A git tree of this series can be found here:
> > 
> > 	https://github.com/weiny2/linux-kernel/tree/dcd-v6-2025-04-13
> > 
> > This is now based on 6.15-rc2.
> > 
> 
> Extreme necro-bump for this set, but i wonder what folks opinion is on
> DCD support if we expose a new region control pattern ala:
> 
> https://lore.kernel.org/linux-cxl/20260129210442.3951412-1-gourry@gourry.net/
> 
> The major difference would be elimination of sparse-DAX, which i know

Sparse-dax is somewhat of a misnomer.  sparse regions may have been a
better name for it.  That is really what we are speaking of.  It is the
idea that we have regions which don't necessarily have memory backing the
size of the region.

For the DCD series I wrote dax devices could only be created after extents
appeared.

> has been a concern, in favor of a per-region-driver policy on how to
> manage hot-add/remove events.

I think a concern would be that each region driver is implementing a
'policy' which requires new drivers for new policies.

My memory is very weak on all this stuff...

My general architecture was trying to exposed the extent ranges to user
space and allow userspace to build them into ranges with whatever policy
they wanted.

The tests[1] were all written to create dax devices on top of the extents
in certain ways to link together those extents.

[1] https://github.com/weiny2/ndctl/blob/dcd-region3-2025-04-13/test/cxl-dcd.sh

I did not like the 'implicit' nature of the association of dax device with
extent.  But it maintained backwards compatibility with non-sparse
regions...

My vision for tags was that eventually dax device creation could have a
tag specified prior and would only allocate from extents with that tag.

> 
> Things I've discussed with folks in different private contexts
> 
> sysram usecase:
> ----
>   echo regionN > decoder0.0/create_dc_region
>   /* configure decoders */
>   echo regionN > cxl/drivers/sysram/bind
> 
> tagged extents arrive and leave as a group, no sparseness
>     extents cannot share a tag unless they arrive together
>     e.g. set(A) & set(B) must have different tags
>     add and expose daxN.M/uuid as the tag for collective management

I'm not following this.  If set(A) arrives can another set(A) arrive
later?

How long does the kernel wait for all the 'A's to arrive?  Or must they be
in a ...  'more bit set' set of extents.

Regardless IMO if user space was monitoring the extents with tag A they
can decide if and when all those extents have arrived and can build on top
of that.

> 
> Can decide whether linux wants to support untagged extents
>     cxl_sysram could choose to track and hotplug untagged extents

'cxl_sysram' is the sysram region driver right?

Are we expecting to have tags and non-taged extents on the same DCD
region?

I'm ok not supporting that.  But just to be clear about what you are
suggesting.

Would the cxl_sysram region driver be attached to the DCD partition?  Then
it would have some DCD functionality built in...  I guess make a common
extent processing lib for the 2 drivers?

I feel like that is a lot of policy being built into the kernel.  Where
having the DCD region driver simply tell user space 'Hey there is a new
extent here' and then having user space online that as sysram makes the
policy decision in user space.

Segwaying into the N_PRIVATE work.  Couldn't we assign that memory to a
NUMA node with N_PRIVATE only memory via userspace...  Then it is onlined
in a way that any app which is allocating from that node would get that
memory.  And keep it out of kernel space?

But keep all that policy in user space when an extent appears.  Not baked
into a particular driver.

>     directly without going through DAX. Partial release would be
>     possible on a per-extent granularity in this case.
> ----
> 
> 
> virtio usecase:  (making some stuff up here)
> ----
>   echo regionN > decoder0.0/create_dc_region
>   /* configure decoders */
>   echo regionN > cxl/drivers/virtio/bind
> 
> tags are required and may imply specific VM routing
>     may or may not use DAX under the hood
> 
> extents may be tracked individually and add/removed individually
>     if using DAX, this implies 1 device per extent.
>     This probably requires a minimum extent size to be reasonable.
> 
> Does not expose the memory as SysRAM, instead builds new interface
>     to handle memory management message routing to/from the VMM
>     (N_MEMORY_PRIVATE?)
> ----
> 
> 
> devdax usecase (FAMFS?)
> ---- 
>   echo regionN > decoder0.0/create_dc_region
>   /* configure decoders */
>   echo regionN > cxl/drivers/devdax/bind
> 
> All sets of extents appear as new DAX devices
> Tags are exposed via daxN.M/uuid
> Tags are required
>    otherwise you can't make sense of what that devdax represents
> ---
> 
> Begs the question:
>    Do we require tags as a baseline feature for all modes?

Previously no.  But I've often thought of no tag as just a special case of
tag == 0.  But we agreed at one time that they would have special no tag
meaning such that it was just memory to be used however...

>    No tag - no service.
>    Heavily implied:  Tags are globally unique (uuid)
> 
> But I think this resolves a lot of the disparate disagreements on "what
> to do with tags" and how to manage sparseness - just split the policy
> into each individual use-case's respective driver.

I think what I'm worried about is where that policy resides.

I think it is best to have a DCD region driver which simply exposes
extents and allows user space to control how those extents are used.  I
think some of what you have above works like that but I want to be careful
baking in policy.

> 
> If a sufficiently unique use-case comes along that doesn't fit the
> existing categories - a new region-driver may be warranted.

Again I don't like the idea of needing new drivers for new policies.  That
goes against how things should work in the kernel.

Ira

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD)
  2026-02-03 22:04   ` Ira Weiny
@ 2026-02-04 15:12     ` Gregory Price
  2026-02-04 17:57       ` Ira Weiny
  0 siblings, 1 reply; 65+ messages in thread
From: Gregory Price @ 2026-02-04 15:12 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Jiang, Fan Ni, Jonathan Cameron, Dan Williams,
	Davidlohr Bueso, Alison Schofield, Vishal Verma, linux-cxl,
	nvdimm, linux-kernel, Li Ming

On Tue, Feb 03, 2026 at 04:04:23PM -0600, Ira Weiny wrote:
> Gregory Price wrote:

... snipping this to the top ...
> Again I don't like the idea of needing new drivers for new policies.  That
> goes against how things should work in the kernel.

If you define "How should virtio consume an extent" and "How should
FAMFS consume an extent" as "Policy" I can see your argument, and we
should address this.

I view "All things shall route through DAX" as "A policy" that
dictates cxl-driven changes to dax - including new dax drivers
(see: famfs new dax mechanism).

So we're already there.  Might as well reduce the complexity (as
explained below) and cut out dax where it makes sense rather than
force everyone to eat DAX (for potentially negative value).

---

> > has been a concern, in favor of a per-region-driver policy on how to
> > manage hot-add/remove events.
> 
> I think a concern would be that each region driver is implementing a
> 'policy' which requires new drivers for new policies.
> 

This is fair, we don't want infinite drivers - and many use cases
(we imagine) will end up using DAX - I'm not arguing to get rid of the
dax driver.

There are at least 3 or 4 use-cases i've seen so far

- dax (dev and fs): can share a driver w/ DAXDRV_ selection

- sysram : preferably doing direct hotplug - not via dax
           private-ram may re-use this cleanly with some config bits

- virtio : may not even want to expose objects to userland
           may prefer to simply directly interact with a VMM
	   dax may present a security issue if reconfig'd to device

- type-2 : may have wildly different patterns and preferences
           may also end up somewhat generalized

I think trying to pump all of these through dax and into userland by
default is a mistake - if only because it drives more complexity.

We should get form from function.

Example: for sysram - dax_kmem is just glue, the hotplug logic should
         live in cxl and operate directly on extents.  It's simpler and
	 doesn't add a bunch of needless dependencies.

Consider a hot-unplug request

Current setup
----
FM -> Host
   1) Unplug Extent A
Host
   2) cxl: hotunplug(dax_map[A])
   3) dax: Does this cover the entire dax? (no->reject, yes->unplug())
      - might fail due to dax-reasons
      - might fail due to normal hot-unplug reasons
   4) unbind dax
   5) return extent

Dropping Dax in favor of sysram doing direct hotplug
----
FM -> Host
   1) Unplug Extent A 
Host
   2) hotunplug(extents_map[A])
      - might fail because of normal hot-unplug reasons
   3) return extent

It's just simpler and gives you the option of complete sparseness
(untagged extents) or tracking related extents (tagged extents).

This pattern may not carry over the same with dax or virtio uses.

> I did not like the 'implicit' nature of the association of dax device with
> extent.  But it maintained backwards compatibility with non-sparse
> regions...
> 
> My vision for tags was that eventually dax device creation could have a
> tag specified prior and would only allocate from extents with that tag.
>

yeah i think it's pretty clear the dax case wants a daxN.M/uuid of some
kind (we can argue whether it needs to be exposed to userland - but
having some conversations about FAMFS, this sounds userful.

> I'm not following this.  If set(A) arrives can another set(A) arrive
> later?
> 
> How long does the kernel wait for all the 'A's to arrive?  Or must they be
> in a ...  'more bit set' set of extents.
> 

Set(A) = extents that arrive together with the more bit set

So lets say you get two sets that arrive with the same tag (A)
Set(A) + Set(A)'

Set(A)' would get rejected because Set(A) has already arrived.
Otherwise, accepting Set(A)' implies sparseness of Set(A).

Having a tag map to a region is pointless - the HPA maps extent to
region.  So there's no other use for a tag in the sysram case.

On the flip side - assuming you want to try to allow Set(A)+Set(A)'

How userland is expected to know when all extents have arrived if
hotplug cannot occur until all the extents have arrived, and the only
place to put those extents is DAX?  Seems needlessly complex.

> Regardless IMO if user space was monitoring the extents with tag A they
> can decide if and when all those extents have arrived and can build on top
> of that.
> 

This assumes userland has something to build on top of, and moreover
that this something will be DAX.

- I agree for a filesystem-consumption pattern.
- I disagree for hotplug - dax is pointless glue.
- I don't know if DAX is right-fit for other use cases. (it might just
  want to pass the raw IORESOURCE region to the VMM, for example).

> Are we expecting to have tags and non-taged extents on the same DCD
> region?
> 
> I'm ok not supporting that.  But just to be clear about what you are
> suggesting.
> 

Probably not.  And in fact I think that should be one configuration bit
(either you support tags or you don't - reject the other state).

But I can imagine a driver wanting to support either (exclusive-or)

> Would the cxl_sysram region driver be attached to the DCD partition?  Then
> it would have some DCD functionality built in...  I guess make a common
> extent processing lib for the 2 drivers?
> 

Same driver - allow it to bind PARTMODE_RAM or PARTMODE_DC.

A RAM region hotplugs exactly once: at bind/unbind
A DC region hotplugs at runtime.

Same code, DC just adds the log monitoring stuff.

> I feel like that is a lot of policy being built into the kernel.  Where
> having the DCD region driver simply tell user space 'Hey there is a new
> extent here' and then having user space online that as sysram makes the
> policy decision in user space.
> 
> Segwaying into the N_PRIVATE work.  Couldn't we assign that memory to a
> NUMA node with N_PRIVATE only memory via userspace...  Then it is onlined
> in a way that any app which is allocating from that node would get that
> memory.  And keep it out of kernel space?
> 
> But keep all that policy in user space when an extent appears.  Not baked
> into a particular driver.
> 

I would need to think this over a bit more, I'm not quite seeing how
what you are suggesting would work.

N_MEMORY_PRIVATE implies there is some special feature of the device
that should be taken into account when managing the memory - but that
you want to re-use (some of) the existing mm/ infrastructure for basic
operations (page_alloc, reclaim, migration, etc).

There's an argument that some such nodes shouldn't even be visible to
userspace (of what use is knowing a node is there if mempolicy commands
are rejected or ignored if you try to bind to it?)

But also, setting N_MEMORY_PRIVATE vs N_MEMORY would explicitly be an
mm/memory_hotplug.c operation - so there's a pretty long path from
userland to "Setting N_MEMORY_PRIVATE" that goes through the drivers.

You can't set N_MEMORY_PRIVATE before going online (has to be done
during the hotplug process, otherwise you get nasty race conditions).

> > But I think this resolves a lot of the disparate disagreements on "what
> > to do with tags" and how to manage sparseness - just split the policy
> > into each individual use-case's respective driver.
> 
> I think what I'm worried about is where that policy resides.
>
> I think it is best to have a DCD region driver which simply exposes
> extents and allows user space to control how those extents are used.  I
> think some of what you have above works like that but I want to be careful
> baking in policy.
> 

I guess summarizing the sysram case: The policy seems simple enough to
not warrant over-complicated the infrastructure for the sake of making
dax "The One Interface To Rule Them All".

All userland wants to do for sysram is hot(un)plug.  Why bother with
dax at all?

~Gregory

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD)
  2026-02-04 15:12     ` Gregory Price
@ 2026-02-04 17:57       ` Ira Weiny
  2026-02-04 18:53         ` Gregory Price
  0 siblings, 1 reply; 65+ messages in thread
From: Ira Weiny @ 2026-02-04 17:57 UTC (permalink / raw)
  To: Gregory Price, Ira Weiny
  Cc: Dave Jiang, Fan Ni, Jonathan Cameron, Dan Williams,
	Davidlohr Bueso, Alison Schofield, Vishal Verma, linux-cxl,
	nvdimm, linux-kernel, Li Ming

Gregory Price wrote:
> On Tue, Feb 03, 2026 at 04:04:23PM -0600, Ira Weiny wrote:
> > Gregory Price wrote:
>
> ... snipping this to the top ...
> > Again I don't like the idea of needing new drivers for new policies.  That
> > goes against how things should work in the kernel.
> 
> If you define "How should virtio consume an extent" and "How should
> FAMFS consume an extent" as "Policy" I can see your argument, and we
> should address this.

TLDR; I just don't want to see an explosion of 'drivers' for various
'policies'.  I think your use of the word 'policy' triggered me.

> 
> I view "All things shall route through DAX" as "A policy" that
> dictates cxl-driven changes to dax - including new dax drivers
> (see: famfs new dax mechanism).
> 
> So we're already there.  Might as well reduce the complexity (as
> explained below) and cut out dax where it makes sense rather than
> force everyone to eat DAX (for potentially negative value).
> 
> ---
> 
> > > has been a concern, in favor of a per-region-driver policy on how to
> > > manage hot-add/remove events.
> > 
> > I think a concern would be that each region driver is implementing a
> > 'policy' which requires new drivers for new policies.
> > 
> 
> This is fair, we don't want infinite drivers - and many use cases
> (we imagine) will end up using DAX - I'm not arguing to get rid of the
> dax driver.
> 
> There are at least 3 or 4 use-cases i've seen so far
> 
> - dax (dev and fs): can share a driver w/ DAXDRV_ selection

Legacy...  check!

> 
> - sysram : preferably doing direct hotplug - not via dax
>            private-ram may re-use this cleanly with some config bits

Pre-reading this entire email I think what I was thinking was bundling a
lot of this in here.  Put knobs here to control 'policy' not add to this
list for more policies.

> 
> - virtio : may not even want to expose objects to userland
>            may prefer to simply directly interact with a VMM

Even if directly interacting with the VMM there has to be controls
directly by user space to control this.  I'm not a virtio expert so...  Ok
lets just say there is another flow here.  Don't call it a policy though.

> 	   dax may present a security issue if reconfig'd to device

I don't understand this comment.

> 
> - type-2 : may have wildly different patterns and preferences
>            may also end up somewhat generalized

I think this is all going to be handled in the specific drivers of the
specific devices.  There is no policy here other than 'special' for the
device and we can't control that.

> 
> I think trying to pump all of these through dax and into userland by
> default is a mistake - if only because it drives more complexity.

I don't want to preserve DAX.  I don't.

So I think this list is fine.

> 
> We should get form from function.
> 
> Example: for sysram - dax_kmem is just glue, the hotplug logic should
>          live in cxl and operate directly on extents.  It's simpler and
> 	 doesn't add a bunch of needless dependencies.

Agreed.

> 
> Consider a hot-unplug request
> 
> Current setup
> ----
> FM -> Host
>    1) Unplug Extent A
> Host
>    2) cxl: hotunplug(dax_map[A])
>    3) dax: Does this cover the entire dax? (no->reject, yes->unplug())
>       - might fail due to dax-reasons
>       - might fail due to normal hot-unplug reasons
>    4) unbind dax
>    5) return extent
> 
> Dropping Dax in favor of sysram doing direct hotplug
> ----
> FM -> Host
>    1) Unplug Extent A 
> Host
>    2) hotunplug(extents_map[A])
>       - might fail because of normal hot-unplug reasons
>    3) return extent

Agreed.

> 
> It's just simpler and gives you the option of complete sparseness
> (untagged extents) or tracking related extents (tagged extents).

Just add the knobs for the tags and yea...  the policy of how to handle
the extents can then be controlled by user space.

> 
> This pattern may not carry over the same with dax or virtio uses.

I don't fully understand the virtio case.  So I'll defer this.  But I feel
like this is not so much of a new policy as a different path which is, as
you said above, potentially not in user space at all.

> 
> > I did not like the 'implicit' nature of the association of dax device with
> > extent.  But it maintained backwards compatibility with non-sparse
> > regions...
> > 
> > My vision for tags was that eventually dax device creation could have a
> > tag specified prior and would only allocate from extents with that tag.
> >
> 
> yeah i think it's pretty clear the dax case wants a daxN.M/uuid of some
> kind (we can argue whether it needs to be exposed to userland - but
> having some conversations about FAMFS, this sounds userful.
> 
> > I'm not following this.  If set(A) arrives can another set(A) arrive
> > later?
> > 
> > How long does the kernel wait for all the 'A's to arrive?  Or must they be
> > in a ...  'more bit set' set of extents.
> > 
> 
> Set(A) = extents that arrive together with the more bit set
> 
> So lets say you get two sets that arrive with the same tag (A)
> Set(A) + Set(A)'
> 
> Set(A)' would get rejected because Set(A) has already arrived.
> Otherwise, accepting Set(A)' implies sparseness of Set(A).
> 
> Having a tag map to a region is pointless - the HPA maps extent to
> region.  So there's no other use for a tag in the sysram case.
> 
> On the flip side - assuming you want to try to allow Set(A)+Set(A)'
> 
> How userland is expected to know when all extents have arrived if
> hotplug cannot occur until all the extents have arrived, and the only
> place to put those extents is DAX?  Seems needlessly complex.

Ok I think we need to sync up on the driver here.

For FAMFS/famdax they can expect the more bit and all that jazz.  I can't
stop that.

But for sysram.  No.  It is easy enough to assign a tag to the region and
any extent which shows up without that tag (be it NULL tag or tag A) gets
rejected.  All valid tagged extents get hot plugged.

Simple.  Easy policy for user space to control.

> 
> > Regardless IMO if user space was monitoring the extents with tag A they
> > can decide if and when all those extents have arrived and can build on top
> > of that.
> > 
> 
> This assumes userland has something to build on top of, and moreover
> that this something will be DAX.
> 
> - I agree for a filesystem-consumption pattern.
> - I disagree for hotplug - dax is pointless glue.
> - I don't know if DAX is right-fit for other use cases. (it might just
>   want to pass the raw IORESOURCE region to the VMM, for example).
> 
> > Are we expecting to have tags and non-taged extents on the same DCD
> > region?
> > 
> > I'm ok not supporting that.  But just to be clear about what you are
> > suggesting.
> > 
> 
> Probably not.  And in fact I think that should be one configuration bit
> (either you support tags or you don't - reject the other state).

Not bit.  Just a non-null uuid set.

> 
> But I can imagine a driver wanting to support either (exclusive-or)

Yes.  Set the uuid.

> 
> > Would the cxl_sysram region driver be attached to the DCD partition?  Then
> > it would have some DCD functionality built in...  I guess make a common
> > extent processing lib for the 2 drivers?
> > 
> 
> Same driver - allow it to bind PARTMODE_RAM or PARTMODE_DC.

ok good.

> 
> A RAM region hotplugs exactly once: at bind/unbind
> A DC region hotplugs at runtime.

Yes for every extent as they are seen.

> 
> Same code, DC just adds the log monitoring stuff.

Yep.

> 
> > I feel like that is a lot of policy being built into the kernel.  Where
> > having the DCD region driver simply tell user space 'Hey there is a new
> > extent here' and then having user space online that as sysram makes the
> > policy decision in user space.
> > 
> > Segwaying into the N_PRIVATE work.  Couldn't we assign that memory to a
> > NUMA node with N_PRIVATE only memory via userspace...  Then it is onlined
> > in a way that any app which is allocating from that node would get that
> > memory.  And keep it out of kernel space?
> > 
> > But keep all that policy in user space when an extent appears.  Not baked
> > into a particular driver.
> > 
> 
> I would need to think this over a bit more, I'm not quite seeing how
> what you are suggesting would work.

I think you set it out above.  I thought the sysram driver would have a
control for N_MEMORY_PRIVATE vs N_MEMORY which could control that policy
during hotplug.  Maybe I'm hallucinating.

> 
> N_MEMORY_PRIVATE implies there is some special feature of the device
> that should be taken into account when managing the memory - but that
> you want to re-use (some of) the existing mm/ infrastructure for basic
> operations (page_alloc, reclaim, migration, etc).
> 
> There's an argument that some such nodes shouldn't even be visible to
> userspace (of what use is knowing a node is there if mempolicy commands
> are rejected or ignored if you try to bind to it?)
> 
> But also, setting N_MEMORY_PRIVATE vs N_MEMORY would explicitly be an
> mm/memory_hotplug.c operation - so there's a pretty long path from
> userland to "Setting N_MEMORY_PRIVATE" that goes through the drivers.
> 
> You can't set N_MEMORY_PRIVATE before going online (has to be done
> during the hotplug process, otherwise you get nasty race conditions).
> 
> > > But I think this resolves a lot of the disparate disagreements on "what
> > > to do with tags" and how to manage sparseness - just split the policy
> > > into each individual use-case's respective driver.
> > 
> > I think what I'm worried about is where that policy resides.
> >
> > I think it is best to have a DCD region driver which simply exposes
> > extents and allows user space to control how those extents are used.  I
> > think some of what you have above works like that but I want to be careful
> > baking in policy.
> > 
> 
> I guess summarizing the sysram case: The policy seems simple enough to
> not warrant over-complicated the infrastructure for the sake of making
> dax "The One Interface To Rule Them All".
> 
> All userland wants to do for sysram is hot(un)plug.  Why bother with
> dax at all?

I did not want dax.  Was not advocating for dax.  Just did not want to
build a bunch of new 'drivers' for new each new policy.

Summary, it is fine to add new knobs to the sysram driver for new policy
controls.  It is _not_ ok to have to put in a new driver.

I'm not clear if sysram could be used for virtio, or even needed.  I'm
still figuring out how virtio of simple memory devices is a gain.

Ira

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD)
  2026-02-04 17:57       ` Ira Weiny
@ 2026-02-04 18:53         ` Gregory Price
  2026-02-05 17:48           ` Jonathan Cameron
  0 siblings, 1 reply; 65+ messages in thread
From: Gregory Price @ 2026-02-04 18:53 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Jiang, Fan Ni, Jonathan Cameron, Dan Williams,
	Davidlohr Bueso, Alison Schofield, Vishal Verma, linux-cxl,
	nvdimm, linux-kernel, Li Ming

On Wed, Feb 04, 2026 at 11:57:34AM -0600, Ira Weiny wrote:
> Gregory Price wrote:
> 
> TLDR; I just don't want to see an explosion of 'drivers' for various
> 'policies'.  I think your use of the word 'policy' triggered me.
> 

Gotcha.  Yeah words are hard.  I'm not sure what to call the difference
between the dax pattern and the sysram pattern... workflow?

You're *kind of* encoding "a policy", but more like defining a workflow
i guess.  I suppose i'll update to that terminology unless someone has
something better.

> > - sysram : preferably doing direct hotplug - not via dax
> >            private-ram may re-use this cleanly with some config bits
> 
> Pre-reading this entire email I think what I was thinking was bundling a
> lot of this in here.  Put knobs here to control 'policy' not add to this
> list for more policies.
> 

yup, so you have some sysram_region/ specific knobs
	sysram_region0/online_type
	sysram_region0/extents/[A,B,C]

> >
> >
... snipping out virtio stuff until the end ...
> 
> But for sysram.  No.  It is easy enough to assign a tag to the region and
> any extent which shows up without that tag (be it NULL tag or tag A) gets
> rejected.  All valid tagged extents get hot plugged.
> 
> Simple.  Easy policy for user space to control.
> 

Of what use is a tag for a sysram region?

The HPA is effectively a tag in this case.

An HPA can only belong to one region.

> > 
> > I would need to think this over a bit more, I'm not quite seeing how
> > what you are suggesting would work.
> 
> I think you set it out above.  I thought the sysram driver would have a
> control for N_MEMORY_PRIVATE vs N_MEMORY which could control that policy
> during hotplug.  Maybe I'm hallucinating.
> 

I imagine a device driver setting up a sysram_region with a private bit
before it goes to hotplug.

this would dictate whether it called
   add_memory_driver_managed() or
   add_private_memory_driver_managed()

so like

my_driver_code:
   sysram = create_sysram_region(...);
   sysram.private_callbacks = my_driver_callbacks;
   ... continue with the rest of configuration ...
   probe(sysram); /* sysram does the registration */

Since private-memory users actually have *device-defined* POLICY (yes,
policy) of some kind, I can imagine those devices needing to provide
drivers that set up that policy.

example: compressed memory devices may want to be on a demote-only node
         and control page-table mappings to enforce Read-Only.

(note: don't get hung-up on callbacks, design here is not set, just
       things floating around)

But in the short term, we should try to design it such that additional
drivers are not needed where reasonable.

I can imagine this showing up as needing mm/cram.c and registering a
compressed-node with mm/cram.c rather than enabling driver callbacks
(i'm learning callbacks are a mess, and going to try to avoid it).

> Summary, it is fine to add new knobs to the sysram driver for new policy
> controls.  It is _not_ ok to have to put in a new driver.
>

Well, we don't have a sysram driver at the moment :P

We have a region driver :]

We should have a sysram driver and split up the workflows between dax
and sysram.

> I'm not clear if sysram could be used for virtio, or even needed.  I'm
> still figuring out how virtio of simple memory devices is a gain.
> 

Jonathan mentioned that he thinks it would be possible to just bring it
online as a private-node and inform the consumer of this.  I think
that's probably reasonable.

~Gregory

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD)
  2026-02-04 18:53         ` Gregory Price
@ 2026-02-05 17:48           ` Jonathan Cameron
  2026-02-06 11:01             ` Alireza Sanaee
  0 siblings, 1 reply; 65+ messages in thread
From: Jonathan Cameron @ 2026-02-05 17:48 UTC (permalink / raw)
  To: Gregory Price
  Cc: Ira Weiny, Dave Jiang, Fan Ni, Dan Williams, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-cxl, nvdimm, linux-kernel,
	Li Ming, Alireza Sanaee

> > I'm not clear if sysram could be used for virtio, or even needed.  I'm
> > still figuring out how virtio of simple memory devices is a gain.
> >   
> 
> Jonathan mentioned that he thinks it would be possible to just bring it
> online as a private-node and inform the consumer of this.  I think
> that's probably reasonable.

Firstly VM == Application.  If we have say a DB that wants to do everything
itself, it would use same interface as a VM to get the whole memory
on offer. (I'm still trying to get that Application Specific Memory term
adopted ;) 

This would be better if we didn't assume anything to do with virtio
- that's just one option (and right now for CXL mem probably not the
sensible one as it's missing too many things we get for free by just
emulating CXL devices - e.g. all the stuff you are describing here
for the host is just as valid in the guest.) We have a path to
get that emulation and should have the big missing piece posted shortly
(DCD backed by 'things - this discussion' that turn up after VM boot).

The real topic is memory for a VM and we need a way to tie a memory
backend in qemu to, so that whatever the fabric manager provided for
that VM is given to the VM and not used for anything else.

If it's for a specific VM, then it's tagged as otherwise how else
do we know the intent? (lets ignore random other out of band paths).

Layering wise we can surface as many backing sources as we like at
runtime via 1+ emulated DCD devices (to give perf information etc).
They each show up in the guest as contiguous (maybe tagged) single
extent and then we apply what ever comes out of the rest of this
discussion on top of that.

So all we care about is how the host presents it.

Bunch of things might work for this.

1. Just put it in a numa node that requires specific selection to allocate
   from.  This is nice because it just looks like normal memory and we
   can apply any type of front end on top of that.  Not good if we have a lot
   of these coming and going.

2. Provide it as something with an fd we can memmap. I was fine with Dax for
   this but if it's normal ram just for a VM anything that gives me a handle
   that I can memmap is fine. Just need a way to know which one (so tag).

It's pretty similar for shared cases. Just need a handle to memmap.
In that case, tag goes straight up to guest OS (we've just unwound the
extent ordering in the host and presented it as a contiguous single
extent).

Assumption here is we always provide all that capacity that was tagged
for the VM to use to the VM.   Things may get more entertaining if we have
a bunch of capacity that was tagged to provide extra space for a set of
VMs (e.g. we overcommit on top of the DCD extents) - to me that's a
job for another day.

So I'm not really envisioning anything special for the VM case, it's
just a dedicate allocation of memory for a user who knows how to get it.
We will want a way to get perf info though so we can provide that
in the VM.  Maybe can figure that out from the CXL HW backing it without
needing anything special in what is being discussed here.

Jonathan

> 
> ~Gregory

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD)
  2026-02-05 17:48           ` Jonathan Cameron
@ 2026-02-06 11:01             ` Alireza Sanaee
  2026-02-06 13:26               ` Gregory Price
  0 siblings, 1 reply; 65+ messages in thread
From: Alireza Sanaee @ 2026-02-06 11:01 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Gregory Price, Ira Weiny, Dave Jiang, Fan Ni, Dan Williams,
	Davidlohr Bueso, Alison Schofield, Vishal Verma, linux-cxl,
	nvdimm, linux-kernel, Li Ming

On Thu, 5 Feb 2026 17:48:47 +0000
Jonathan Cameron <jonathan.cameron@huawei.com> wrote:

Hi Jonathan,

Thanks for the clarifications.

Quick thought inline.

> > > I'm not clear if sysram could be used for virtio, or even needed.  I'm
> > > still figuring out how virtio of simple memory devices is a gain.
> > >     
> > 
> > Jonathan mentioned that he thinks it would be possible to just bring it
> > online as a private-node and inform the consumer of this.  I think
> > that's probably reasonable.  
> 
> Firstly VM == Application.  If we have say a DB that wants to do everything
> itself, it would use same interface as a VM to get the whole memory
> on offer. (I'm still trying to get that Application Specific Memory term
> adopted ;) 
> 
> This would be better if we didn't assume anything to do with virtio
> - that's just one option (and right now for CXL mem probably not the
> sensible one as it's missing too many things we get for free by just
> emulating CXL devices - e.g. all the stuff you are describing here
> for the host is just as valid in the guest.) We have a path to
> get that emulation and should have the big missing piece posted shortly
> (DCD backed by 'things - this discussion' that turn up after VM boot).
> 
> The real topic is memory for a VM and we need a way to tie a memory
> backend in qemu to, so that whatever the fabric manager provided for
> that VM is given to the VM and not used for anything else.
> 
> If it's for a specific VM, then it's tagged as otherwise how else
> do we know the intent? (lets ignore random other out of band paths).
> 
> Layering wise we can surface as many backing sources as we like at
> runtime via 1+ emulated DCD devices (to give perf information etc).
> They each show up in the guest as contiguous (maybe tagged) single
> extent and then we apply what ever comes out of the rest of this
> discussion on top of that.
> 
> So all we care about is how the host presents it.
> 
> Bunch of things might work for this.
> 
> 1. Just put it in a numa node that requires specific selection to allocate
>    from.  This is nice because it just looks like normal memory and we
>    can apply any type of front end on top of that.  Not good if we have a lot
>    of these coming and going.
> 
> 2. Provide it as something with an fd we can memmap. I was fine with Dax for
>    this but if it's normal ram just for a VM anything that gives me a handle
>    that I can memmap is fine. Just need a way to know which one (so tag).

I think both of these approaches are OK, but looking from developers
perspective, if someone wants a specific memory for their workload, they
should rather get a fd and play with it in whichever way they want. NUMA may
not give that much flexibility. As a developer it would prefer 2. Though you
may say oh dax then? not sure!
> 
> It's pretty similar for shared cases. Just need a handle to memmap.
> In that case, tag goes straight up to guest OS (we've just unwound the
> extent ordering in the host and presented it as a contiguous single
> extent).
> 
> Assumption here is we always provide all that capacity that was tagged
> for the VM to use to the VM.   Things may get more entertaining if we have
> a bunch of capacity that was tagged to provide extra space for a set of
> VMs (e.g. we overcommit on top of the DCD extents) - to me that's a
> job for another day.
> 
> So I'm not really envisioning anything special for the VM case, it's
> just a dedicate allocation of memory for a user who knows how to get it.
> We will want a way to get perf info though so we can provide that
> in the VM.  Maybe can figure that out from the CXL HW backing it without
> needing anything special in what is being discussed here.
> 
> Jonathan
> 
> > 
> > ~Gregory  
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD)
  2026-02-06 11:01             ` Alireza Sanaee
@ 2026-02-06 13:26               ` Gregory Price
  0 siblings, 0 replies; 65+ messages in thread
From: Gregory Price @ 2026-02-06 13:26 UTC (permalink / raw)
  To: Alireza Sanaee
  Cc: Jonathan Cameron, Ira Weiny, Dave Jiang, Fan Ni, Dan Williams,
	Davidlohr Bueso, Alison Schofield, Vishal Verma, linux-cxl,
	nvdimm, linux-kernel, Li Ming

On Fri, Feb 06, 2026 at 11:01:30AM +0000, Alireza Sanaee wrote:
> On Thu, 5 Feb 2026 17:48:47 +0000
> Jonathan Cameron <jonathan.cameron@huawei.com> wrote:
> 
> I think both of these approaches are OK, but looking from developers
> perspective, if someone wants a specific memory for their workload, they
> should rather get a fd and play with it in whichever way they want. NUMA may
> not give that much flexibility. As a developer it would prefer 2. Though you
> may say oh dax then? not sure!

DAX or numa-aware memfd

If you want *specific* memory (a particular HPA/DPA range), tagged dax is
probably appropriate.

If you just want any old page from a particular chunk of HPA, then
probably some kind of numa-aware memfd would be simplest (though this
may require new interfaces, since memfd is not currently numa-aware).

We might be able to make private node work specifically with membind
policy on a VMA (not on a task).  That would probably be sufficient.

~Gregory

^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2026-03-05 21:57 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-13 22:52 [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
2025-04-13 22:52 ` [PATCH v9 01/19] cxl/mbox: Flag " Ira Weiny
2025-04-14 14:19   ` Jonathan Cameron
2025-05-05 21:04     ` Fan Ni
2025-05-06 16:09       ` Ira Weiny
2025-05-06 18:54         ` Fan Ni
2025-04-13 22:52 ` [PATCH v9 02/19] cxl/mem: Read dynamic capacity configuration from the device Ira Weiny
2025-04-14 14:35   ` Jonathan Cameron
2025-04-14 15:20     ` Jonathan Cameron
2025-05-07 17:40   ` Fan Ni
2025-05-08 13:35     ` Ira Weiny
2025-04-13 22:52 ` [PATCH v9 03/19] cxl/cdat: Gather DSMAS data for DCD partitions Ira Weiny
2025-04-14 15:29   ` Jonathan Cameron
2025-04-13 22:52 ` [PATCH v9 04/19] cxl/core: Enforce partition order/simplify partition calls Ira Weiny
2025-04-14 15:32   ` Jonathan Cameron
2026-02-02 19:25   ` Davidlohr Bueso
2025-04-13 22:52 ` [PATCH v9 05/19] cxl/mem: Expose dynamic ram A partition in sysfs Ira Weiny
2025-04-14 15:34   ` Jonathan Cameron
2026-02-02 19:28   ` Davidlohr Bueso
2025-04-13 22:52 ` [PATCH v9 06/19] cxl/port: Add 'dynamic_ram_a' to endpoint decoder mode Ira Weiny
2025-04-14 15:36   ` Jonathan Cameron
2025-05-07 20:50   ` Fan Ni
2025-04-13 22:52 ` [PATCH v9 07/19] cxl/region: Add sparse DAX region support Ira Weiny
2025-04-14 15:40   ` Jonathan Cameron
2025-05-08 17:54   ` Fan Ni
2025-05-08 18:17   ` Fan Ni
2025-04-13 22:52 ` [PATCH v9 08/19] cxl/events: Split event msgnum configuration from irq setup Ira Weiny
2025-04-13 22:52 ` [PATCH v9 09/19] cxl/pci: Factor out interrupt policy check Ira Weiny
2025-04-13 22:52 ` [PATCH v9 10/19] cxl/mem: Configure dynamic capacity interrupts Ira Weiny
2025-04-13 22:52 ` [PATCH v9 11/19] cxl/core: Return endpoint decoder information from region search Ira Weiny
2025-04-13 22:52 ` [PATCH v9 12/19] cxl/extent: Process dynamic partition events and realize region extents Ira Weiny
2025-04-14 16:07   ` Jonathan Cameron
2025-04-14 22:10   ` Alison Schofield
2025-05-12 17:47   ` Fan Ni
2026-02-02 20:00   ` Davidlohr Bueso
2026-02-24  1:24   ` Anisa Su
2026-03-05 22:00     ` Ira Weiny
2025-04-13 22:52 ` [PATCH v9 13/19] cxl/region/extent: Expose region extent information in sysfs Ira Weiny
2025-04-13 22:52 ` [PATCH v9 14/19] dax/bus: Factor out dev dax resize logic Ira Weiny
2025-04-13 22:52 ` [PATCH v9 15/19] dax/region: Create resources on sparse DAX regions Ira Weiny
2025-04-13 22:52 ` [PATCH v9 16/19] cxl/region: Read existing extents on region creation Ira Weiny
2025-04-14 16:15   ` Jonathan Cameron
2026-02-02 19:42   ` Davidlohr Bueso
2025-04-13 22:52 ` [PATCH v9 17/19] cxl/mem: Trace Dynamic capacity Event Record Ira Weiny
2025-04-13 22:52 ` [PATCH v9 18/19] tools/testing/cxl: Make event logs dynamic Ira Weiny
2025-04-13 22:52 ` [PATCH v9 19/19] tools/testing/cxl: Add DC Regions to mock mem data Ira Weiny
2025-04-14 16:11 ` [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD) Fan Ni
2025-04-15  2:37   ` Ira Weiny
2025-04-15  2:47     ` Fan Ni
2025-04-15  4:28     ` Dan Williams
2025-05-13 18:55     ` Fan Ni
2025-04-14 16:47 ` Jonathan Cameron
2025-04-15  4:50   ` Dan Williams
2025-04-15 10:03     ` Jonathan Cameron
2025-04-15 17:45       ` Dan Williams
2025-06-03 16:32 ` Fan Ni
2025-06-09 17:09   ` Fan Ni
2026-02-02 20:22 ` Gregory Price
2026-02-03 22:04   ` Ira Weiny
2026-02-04 15:12     ` Gregory Price
2026-02-04 17:57       ` Ira Weiny
2026-02-04 18:53         ` Gregory Price
2026-02-05 17:48           ` Jonathan Cameron
2026-02-06 11:01             ` Alireza Sanaee
2026-02-06 13:26               ` Gregory Price

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox