All of lore.kernel.org
 help / color / mirror / Atom feed
* DCD: Add support for Dynamic Capacity Devices (DCD)
@ 2026-06-25 11:04 Anisa Su
  2026-06-25 11:04 ` [PATCH v11 01/31] cxl/mbox: Flag " Anisa Su
                   ` (30 more replies)
  0 siblings, 31 replies; 61+ messages in thread
From: Anisa Su @ 2026-06-25 11:04 UTC (permalink / raw)
  To: linux-cxl, linux-kernel
  Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
	John Groves, Gregory Price, Anisa Su

Table of Contents
=================
  1. Changes since v10
  2. Background
  3. Patch organization
  4. Noteable
  5. Testing

This series branch: https://github.com/anisa-su993/anisa-linux-kernel/tree/dcd-v11-06-23-26
NDCTL branch: https://github.com/anisa-su993/anisa-ndctl/tree/dcd-2026-06-24

v10: https://lore.kernel.org/linux-cxl/ajuMJi5nTQRB_ZP0@AnisaLaptop.localdomain/T/#mfdfc28c829071204333824c542ca3af4170dafb4

Changes since v10
=================
The overall architecture and semantics are unchanged; v11 is review
fixes, naming/ABI corrections, and irons out locking/concurrency edge cases
between the CXL and DAX layers.

Naming / ABI:
 - Renamed dynamic_ram_a to dynamic_ram_1 throughout (endpoint-decoder
   mode, the partition sysfs name, and enum CXL_PARTMODE_DYNAMIC_RAM_1),
   matching the numbered-partition convention.
 - Sharable extent sequence numbers are now a dense 0..n-1 (previously
   1..n); the CXL validation path and the DAX claim path enforce the same
   0..n-1 invariant.
 - The DAX 'uuid' attribute reads back the null UUID (all-zeroes) when
   untagged rather than "0".

Recovery and lifecycle:
 - Creating a region over a DC partition now reads the device's
   already-accepted extents at probe time. cxl_dax_region probe
   and recovered extents are not re-acknowledged via Add-DC-Response.  New
   add events are deferred until the initial scan completes so a tag already in use
   is never registered twice.
 - Per-tag-group add and release of DAX resources are atomic (all-or-none). Previously,
   adding a tag group only locked for each extent addition. The lock is widened to
   the entire group.
 - Upper bound of 100 pending extents to prevent 20-second timeout for the More
   chain to close from being infinitely refreshed (unlikely unless device is malicious)

Robustness (device-supplied data is treated as untrusted):
 - Various device-supplied payload sizing checks, overflow/underflow, etc.
 - Fix places where we need to check for native_cxl to avoid overriding
   BIOS-owned events

Documentation:
 - Small changes to reflect dynamic_ram_a to dynamic_ram_1 change and the
   sequence num change (0...n-1 instead of 1...n)
 - Bump kver to 7.3 and date for sysfs attribute documentation

Signoffs/Tags:
- updated Ira's signoffs and authored-by to use iweiny@kernel.org
- update Jonathan Cameron's email to jic23@kernel.org for various review tags
- update Fan's email to nifan.cxl@gmail.com
- update Dan's email to djbw@kernel.org

Background
=============
A Dynamic Capacity Device (DCD) (CXL 3.1 sec 9.13.3) is a CXL memory
device that allows memory capacity within a region to change
dynamically without the need for resetting the device, reconfiguring
HDM decoders, or reconfiguring software DAX regions.
One of the biggest anticipated use cases for Dynamic Capacity is to
allow hosts to dynamically add or remove memory from a host within a
data center without physically changing the per-host attached memory nor
rebooting the host.
The general flow for the addition or removal of memory is to have an
orchestrator coordinate the use of the memory.  Generally there are 5
actors in such a system, the Orchestrator, Fabric Manager, the Logical
device, the Host Kernel, and a Host User.
An example work flow is shown below.
Orchestrator      FM         Device       Host Kernel    Host User
    |             |           |            |               |
    |-------------- Create region ------------------------>|
    |             |           |            |               |
    |             |           |            |<-- Create ----|
    |             |           |            |    Region     |
    |             |           |            |(dynamic_ram_1)|
    |<------------- Signal done ---------------------------|
    |             |           |            |               |
    |-- Add ----->|-- Add --->|--- Add --->|               |
    |  Capacity   |  Extent   |   Extent   |               |
    |             |           |            |               |
    |             |<- Accept -|<- Accept  -|               |
    |             |   Extent  |   Extent   |               |
    |             |           |            |<- Create -----|
    |             |           |            |   DAX dev     |-- Use memory
    |             |           |            |               |   |
    |             |           |            |               |   |
    |             |           |            |<- Release ----| <-+
    |             |           |            |   DAX dev     |
    |             |           |            |               |
    |<------------- Signal done ---------------------------|
    |             |           |            |               |
    |-- Remove -->|- Release->|- Release ->|               |
    |  Capacity   |  Extent   |   Extent   |               |
    |             |           |            |               |
    |             |<- Release-|<- Release -|               |
    |             |   Extent  |   Extent   |               |
    |             |           |            |               |
    |-- Add ----->|-- Add --->|--- Add --->|               |
    |  Capacity   |  Extent   |   Extent   |               |
    |             |           |            |               |
    |             |<- Accept -|<- Accept  -|               |
    |             |   Extent  |   Extent   |               |
    |             |           |            |<- Create -----|
    |             |           |            |   DAX dev     |-- Use memory
    |             |           |            |               |   |
    |             |           |            |<- Release ----| <-+
    |             |           |            |   DAX dev     |
    |<------------- Signal done ---------------------------|
    |             |           |            |               |
    |-- Remove -->|- Release->|- Release ->|               |
    |  Capacity   |  Extent   |   Extent   |               |
    |             |           |            |               |
    |             |<- Release-|<- Release -|               |
    |             |   Extent  |   Extent   |               |
    |             |           |            |               |
    |-- Add ----->|-- Add --->|--- Add --->|               |
    |  Capacity   |  Extent   |   Extent   |               |
    |             |           |            |<- Create -----|
    |             |           |            |   DAX dev     |-- Use memory
    |             |           |            |               |   |
    |-- Remove -->|- Release->|- Release ->|               |   |
    |  Capacity   |  Extent   |   Extent   |               |   |
    |             |           |            |               |   |
    |             |           |     (Release Ignored)      |   |
    |             |           |            |               |   |
    |             |           |            |<- Release ----| <-+
    |             |           |            |   DAX dev     |
    |<------------- Signal done ---------------------------|
    |             |           |            |               |
    |             |- Release->|- Release ->|               |
    |             |  Extent   |   Extent   |               |
    |             |           |            |               |
    |             |<- Release-|<- Release -|               |
    |             |   Extent  |   Extent   |               |
    |             |           |            |<- Destroy ----|
    |             |           |            |   Region      |
    |             |           |            |               |


Patch organization
==================
Device enablement and partition configuration:
 - cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)
 - cxl/mem: Read dynamic capacity configuration from the device
 - cxl/cdat: Gather DSMAS data for DCD partitions
 - cxl/core: Enforce partition order/simplify partition calls
 - cxl/mem: Expose dynamic ram 1 partition in sysfs
 - cxl/port: Add 'dynamic_ram_1' to endpoint decoder mode
 - cxl/region: Add DC DAX region support

Event and interrupt plumbing:
 - cxl/events: Split event msgnum configuration from irq setup
 - cxl/pci: Factor out interrupt policy check
 - cxl/mem: Configure dynamic capacity interrupts
 - cxl/core: Return endpoint decoder information from region search
 - cxl/mem: Set up framework for handling DC Events
 - cxl/mem: Add 20 second timeout for stalled DC_ADD_CAPACITY chains

Extent handling - add, release, and validation:
 - cxl/extent: Handle DC Add Capacity events
 - cxl/mem: Drop misaligned DCD extent groups
 - cxl/extent: Validate DC extent partition
 - cxl/mem: Enforce tag-group semantics
 - cxl/extent: Handle DC Release Capacity events
 - cxl/extent: Enforce cross-region tag uniqueness
 - cxl/region/extent: Expose dc_extent information in sysfs

DAX resource surfacing and device model:
 - cxl + dax: Surface dax_resources on DCD Add Capacity events
 - cxl + dax: Release dax_resources on DCD Release Capacity events
 - dax/bus: Factor out dev dax resize logic
 - dax/bus: Add uuid sysfs attribute to dax devices
 - dax/bus: Reject resize on DC dax devices and enforce 0-size creation
 - dax/bus: Tag-aware uuid claim and show on DC dax devices
 - cxl/region: Read existing extents on region creation

Tracing, test infrastructure, and documentation:
 - cxl/mem: Trace Dynamic capacity Event Record
 - tools/testing/cxl: Make event logs dynamic
 - tools/testing/cxl: Add DC Regions to mock mem data
 - Documentation/cxl: Document DCD extent handling and DC-backed DAX regions


Noteable
========
 - A More=1 add chain is bounded by the 20s timeout and CXL_DC_MAX_PENDING_EXTENTS,
   set to 100. Suggested by Sashiko as a defensive cap against a fabric manager
   that never closes the chain.  The value is arbitrary; feedback on it is welcome.

 - Several Sashiko review comments assumed multiple host threads could process a
   single DCD add event, or concurrently mutate one tag group, at the same
   time. But I don't think that happens because DCD events for a memdev are delivered
   and handled serially by that device's event-interrupt thread,
   and a tag group is owned by exactly one memory device.  Those comments
   were therefore ignored. Please correct me if this assumption is wrong
   so I can fix those.

Testing
=======
ndctl unit suite: built and run against the QEMU cxl_test mock with the
ndctl 'cxl' suite (branch dcd-2026-06-24): 16 of 17 tests pass and
cxl-features is skipped as unsupported, including cxl-dcd.sh and the
cxl-region-replay.sh crash-recovery test that exercises reading
pre-existing extents on region creation.

QEMU end-to-end: used Ali's QEMU patchset adding tag support
[1], with the below topology:

TOPO='-object memory-backend-file,id=cxl-mem1,mem-path=/tmp/t3_cxl1.raw,size=12G \
     -object memory-backend-file,id=cxl-lsa1,mem-path=/tmp/t3_lsa1.raw,size=1G \
     -device usb-ehci,id=ehci \
     -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1,hdm_for_passthrough=true \
     -device cxl-rp,port=0,bus=cxl.1,id=cxl_rp_port0,chassis=0,slot=2 \
     -device cxl-type3,bus=cxl_rp_port0,id=cxl-dcd0,dc-regions-total-size=12G,num-dc-regions=1,sn=99 \
     -device usb-cxl-mctp,bus=ehci.0,id=usb1,target=cxl-dcd0\
     -machine cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=12G,cxl-fmw.0.interleave-granularity=1k'

The exact instructions are the same as the previous version, so I've truncated some details.

  1. Boot the guest.
  2. QMP object-add a tagged 8G memory-backend-ram
     (tag 5be13bce-ae34-4a77-b6c3-16df975fcf1a).
  3. cxl create-region -m -d decoder0.0 -w 1 -s 8G mem0 -t dynamic_ram_1
  4. QMP cxl-add-dynamic-capacity (prescriptive, region 0, same tag)
     injecting an 8G extent at offset 0.
  5. The extent surfaces under the region: dax_region0/extent0.0 reports
     offset 0x0, length 0x200000000, uuid 5be13bce-...
  6. daxctl create-device -r region0 --uuid 5be13bce-... creates the 8G
     devdax device.

We are also working with some internal teams to test on real hardware, so
I'll report any findings as we go.

References:
[1] https://lore.kernel.org/linux-cxl/20260325184259.366-1-alireza.sanaee@huawei.com/T/#t

This series applies on the v7.1 tag (Linus' tree).

base-commit: 8cd9520d35a6c38db6567e97dd93b1f11f185dc6

Anisa Su (6):
  cxl/mem: Add 20 second timeout for stalled DC_ADD_CAPACITY chains
  cxl/mem: Enforce tag-group semantics
  cxl/extent: Enforce cross-region tag uniqueness
  dax/bus: Add uuid sysfs attribute to dax devices
  dax/bus: Tag-aware uuid claim and show on DC dax devices
  Documentation/cxl: Document DCD extent handling and DC-backed DAX
    regions

Ira Weiny (25):
  cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)
  cxl/mem: Read dynamic capacity configuration from the device
  cxl/cdat: Gather DSMAS data for DCD partitions
  cxl/core: Enforce partition order/simplify partition calls
  cxl/mem: Expose dynamic ram 1 partition in sysfs
  cxl/port: Add 'dynamic_ram_1' to endpoint decoder mode
  cxl/region: Add DC DAX region support
  cxl/events: Split event msgnum configuration from irq setup
  cxl/pci: Factor out interrupt policy check
  cxl/mem: Configure dynamic capacity interrupts
  cxl/core: Return endpoint decoder information from region search
  cxl/mem: Set up framework for handling DC Events
  cxl/extent: Handle DC Add Capacity events
  cxl/mem: Drop misaligned DCD extent groups
  cxl/extent: Validate DC extent partition
  cxl/extent: Handle DC Release Capacity events
  cxl/region/extent: Expose dc_extent information in sysfs
  cxl + dax: Surface dax_resources on DCD Add Capacity events
  cxl + dax: Release dax_resources on DCD Release Capacity events
  dax/bus: Factor out dev dax resize logic
  dax/bus: Reject resize on DC dax devices and enforce 0-size creation
  cxl/region: Read existing extents on region creation
  cxl/mem: Trace Dynamic capacity Event Record
  tools/testing/cxl: Make event logs dynamic
  tools/testing/cxl: Add DC Regions to mock mem data

 Documentation/ABI/testing/sysfs-bus-cxl       |  100 +-
 Documentation/ABI/testing/sysfs-bus-dax       |   18 +
 .../driver-api/cxl/linux/cxl-driver.rst       |  149 +++
 .../driver-api/cxl/linux/dax-driver.rst       |  169 +++
 drivers/cxl/core/Makefile                     |    2 +-
 drivers/cxl/core/cdat.c                       |   12 +
 drivers/cxl/core/core.h                       |   67 +-
 drivers/cxl/core/extent.c                     |  783 ++++++++++++
 drivers/cxl/core/hdm.c                        |   14 +-
 drivers/cxl/core/mbox.c                       | 1107 +++++++++++++++-
 drivers/cxl/core/memdev.c                     |   87 +-
 drivers/cxl/core/port.c                       |    9 +
 drivers/cxl/core/region.c                     |   53 +-
 drivers/cxl/core/region_dax.c                 |   49 +-
 drivers/cxl/core/trace.h                      |   75 ++
 drivers/cxl/cxl.h                             |  114 +-
 drivers/cxl/cxlmem.h                          |  162 ++-
 drivers/cxl/mem.c                             |    2 +-
 drivers/cxl/pci.c                             |  136 +-
 drivers/dax/bus.c                             |  653 +++++++++-
 drivers/dax/bus.h                             |    4 +-
 drivers/dax/cxl.c                             |  115 +-
 drivers/dax/dax-private.h                     |   63 +
 drivers/dax/hmem/hmem.c                       |    2 +-
 drivers/dax/pmem.c                            |    2 +-
 include/cxl/cxl.h                             |    7 +-
 include/cxl/event.h                           |   38 +
 tools/testing/cxl/Kbuild                      |    5 +-
 tools/testing/cxl/test/cxl.c                  |   12 +
 tools/testing/cxl/test/mem.c                  | 1109 +++++++++++++++--
 tools/testing/cxl/test/mock.h                 |    9 +
 31 files changed, 4858 insertions(+), 269 deletions(-)
 create mode 100644 drivers/cxl/core/extent.c

-- 
2.43.0


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v11 01/31] cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)
  2026-06-25 11:04 DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
@ 2026-06-25 11:04 ` Anisa Su
  2026-06-26 21:43   ` Dave Jiang
  2026-06-25 11:04 ` [PATCH v11 02/31] cxl/mem: Read dynamic capacity configuration from the device Anisa Su
                   ` (29 subsequent siblings)
  30 siblings, 1 reply; 61+ messages in thread
From: Anisa Su @ 2026-06-25 11:04 UTC (permalink / raw)
  To: linux-cxl, linux-kernel
  Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
	John Groves, Gregory Price, Anisa Su

From: Ira Weiny <iweiny@kernel.org>

Per the CXL 4.0 specification software must check the Command Effects
Log (CEL) for dynamic capacity command support.

Detect support for the DCD commands while reading the CEL, including:

        Get DC Config
        Get DC Extent List
        Add DC Response
        Release DC

Based on an original patch by Navneet Singh.

Signed-off-by: Ira Weiny <iweiny@kernel.org>
Signed-off-by: Anisa Su <anisa.su@samsung.com>

---

Changes:
- remove unused param struct cxl_memdev_state *mds from
  cxl_set_dcd_cmd_enabled()

- remove unused param struct cxl_memdev_state *mds from
  cxl_verify_dcd_cmds()

- cxl_verify_dcd_cmds(): originally filled out local
  bitmap with all DCD cmd bits and checking if cmds_seen
  bitmap is equal to the local bitmap. Replace with
  simple call to bitmap_full(cmd_seen)

- cxl_walk_cel(): zero out dcd_cmds bitmap before using

- cxlmem.h: Add comment to enum dcd_cmd_enabled_bits
  pointing to where the command set is defined in the
  4.0 spec

- original commit message referred to CXL r3.1. Bump to r4.0
---
 drivers/cxl/core/mbox.c | 39 +++++++++++++++++++++++++++++++++++++++
 drivers/cxl/cxlmem.h    | 20 ++++++++++++++++++++
 2 files changed, 59 insertions(+)

diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 7c6c5b7450a5..07aba6f0b719 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -165,6 +165,38 @@ static void cxl_set_security_cmd_enabled(struct cxl_security_state *security,
 	}
 }
 
+static bool cxl_is_dcd_command(u16 opcode)
+{
+#define CXL_MBOX_OP_DCD_CMDS 0x48
+
+	return (opcode >> 8) == CXL_MBOX_OP_DCD_CMDS;
+}
+
+static void cxl_set_dcd_cmd_enabled(u16 opcode, unsigned long *cmd_mask)
+{
+	switch (opcode) {
+	case CXL_MBOX_OP_GET_DC_CONFIG:
+		set_bit(CXL_DCD_ENABLED_GET_CONFIG, cmd_mask);
+		break;
+	case CXL_MBOX_OP_GET_DC_EXTENT_LIST:
+		set_bit(CXL_DCD_ENABLED_GET_EXTENT_LIST, cmd_mask);
+		break;
+	case CXL_MBOX_OP_ADD_DC_RESPONSE:
+		set_bit(CXL_DCD_ENABLED_ADD_RESPONSE, cmd_mask);
+		break;
+	case CXL_MBOX_OP_RELEASE_DC:
+		set_bit(CXL_DCD_ENABLED_RELEASE, cmd_mask);
+		break;
+	default:
+		break;
+	}
+}
+
+static bool cxl_verify_dcd_cmds(unsigned long *cmds_seen)
+{
+	return bitmap_full(cmds_seen, CXL_DCD_ENABLED_MAX);
+}
+
 static bool cxl_is_poison_command(u16 opcode)
 {
 #define CXL_MBOX_OP_POISON_CMDS 0x43
@@ -757,6 +789,7 @@ static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
 	struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
 	struct cxl_cel_entry *cel_entry;
 	const int cel_entries = size / sizeof(*cel_entry);
+	DECLARE_BITMAP(dcd_cmds, CXL_DCD_ENABLED_MAX) = {};
 	struct device *dev = mds->cxlds.dev;
 	int i, ro_cmds = 0, wr_cmds = 0;
 
@@ -785,11 +818,17 @@ static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
 			enabled++;
 		}
 
+		if (cxl_is_dcd_command(opcode)) {
+			cxl_set_dcd_cmd_enabled(opcode, dcd_cmds);
+			enabled++;
+		}
+
 		dev_dbg(dev, "Opcode 0x%04x %s\n", opcode,
 			enabled ? "enabled" : "unsupported by driver");
 	}
 
 	set_features_cap(cxl_mbox, ro_cmds, wr_cmds);
+	mds->dcd_supported = cxl_verify_dcd_cmds(dcd_cmds);
 }
 
 static struct cxl_mbox_get_supported_logs *cxl_get_gsl(struct cxl_memdev_state *mds)
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 776c50d1db51..60dc3f0006a7 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -230,6 +230,20 @@ struct cxl_event_state {
 	struct mutex log_lock;
 };
 
+/**
+ * CXL r4.0 Section 8.2.10.9 - Memory Device Command Sets. See Table 8-308.
+ *
+ * The 48h Command Set (Opcodes 4800h - 4803h) defines the device-enabled DCD
+ * commands.
+ * */
+enum dcd_cmd_enabled_bits {
+	CXL_DCD_ENABLED_GET_CONFIG,
+	CXL_DCD_ENABLED_GET_EXTENT_LIST,
+	CXL_DCD_ENABLED_ADD_RESPONSE,
+	CXL_DCD_ENABLED_RELEASE,
+	CXL_DCD_ENABLED_MAX
+};
+
 /* Device enabled poison commands */
 enum poison_cmd_enabled_bits {
 	CXL_POISON_ENABLED_LIST,
@@ -405,6 +419,7 @@ static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
  * @partition_align_bytes: alignment size for partition-able capacity
  * @active_volatile_bytes: sum of hard + soft volatile
  * @active_persistent_bytes: sum of hard + soft persistent
+ * @dcd_supported: all DCD commands are supported
  * @event: event log driver state
  * @poison: poison driver state info
  * @security: security driver state info
@@ -424,6 +439,7 @@ struct cxl_memdev_state {
 	u64 partition_align_bytes;
 	u64 active_volatile_bytes;
 	u64 active_persistent_bytes;
+	bool dcd_supported;
 
 	struct cxl_event_state event;
 	struct cxl_poison_state poison;
@@ -485,6 +501,10 @@ enum cxl_opcode {
 	CXL_MBOX_OP_UNLOCK		= 0x4503,
 	CXL_MBOX_OP_FREEZE_SECURITY	= 0x4504,
 	CXL_MBOX_OP_PASSPHRASE_SECURE_ERASE	= 0x4505,
+	CXL_MBOX_OP_GET_DC_CONFIG	= 0x4800,
+	CXL_MBOX_OP_GET_DC_EXTENT_LIST	= 0x4801,
+	CXL_MBOX_OP_ADD_DC_RESPONSE	= 0x4802,
+	CXL_MBOX_OP_RELEASE_DC		= 0x4803,
 	CXL_MBOX_OP_MAX			= 0x10000
 };
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v11 02/31] cxl/mem: Read dynamic capacity configuration from the device
  2026-06-25 11:04 DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
  2026-06-25 11:04 ` [PATCH v11 01/31] cxl/mbox: Flag " Anisa Su
@ 2026-06-25 11:04 ` Anisa Su
  2026-06-25 18:16   ` sashiko-bot
  2026-06-26 22:26   ` Dave Jiang
  2026-06-25 11:04 ` [PATCH v11 04/31] cxl/core: Enforce partition order/simplify partition calls Anisa Su
                   ` (28 subsequent siblings)
  30 siblings, 2 replies; 61+ messages in thread
From: Anisa Su @ 2026-06-25 11:04 UTC (permalink / raw)
  To: linux-cxl, linux-kernel
  Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
	John Groves, Gregory Price, Anisa Su

From: Ira Weiny <iweiny@kernel.org>

Devices which optionally support Dynamic Capacity (DC) are configured
via mailbox commands.  CXL r4.0 section 9.13.3 requires the host to issue
the Get DC Configuration command in order to properly configure DCDs.
Without the Get DC Configuration command DCD can't be supported.

Implement the DC mailbox commands as specified in CXL 4.0 section
8.2.10.9.9 (opcodes 48XXh) to read and store the DCD configuration
information.  Disable DCD if an invalid configuration is found.

Linux has no support for more than one dynamic capacity partition.  Read
and validate all the partitions but configure only the first partition
as 'dynamic ram 1'.  Additional partitions can be added in the future if
such a device ever materializes.  Additionally it is anticipated that no
skips will be present from the end of the pmem partition.  Check for and
disallow this configuration as well.

Linux has no use for the trailing fields of the Get Dynamic Capacity
Configuration Output Payload (Total number of supported extents, number
of available extents, total number of supported tags, and number of
available tags).  Avoid defining those fields to use the more useful
dynamic C array.

Based on an original patch by Navneet Singh.

Signed-off-by: Ira Weiny <iweiny@kernel.org>
Signed-off-by: Anisa Su <anisa.su@samsung.com>

---
Changes:
1. Move partition alignment check after is_power_of_2() check on
   blk_size, as IS_ALIGNED(partition start, blk_size) expects blk_size
   to be a power of 2 in cxl_dc_check()

2. cxl_get_dc_config(): verify mbox_cmd.size_out against
   dc_resp->partitions_returned

3. cxl_dev_dc_identify(): originally calculated size of dc_resp using
   struct cxl_dc_partition_info, but dc_resp->partition[] is of type
   struct cxl_dc_partition. Fix size calculation.

4. fix do/while loop in cxl_dev_dc_identify to protect against returning
   0 partitions infinitely

5. cxl_configure_dcd(): originally checked for gap between PMEM and DC
   partition by calculating if a gap exists:
   	if ([start of dc part] - [end of pmem part])
   Replace with: if ([start of dc part] != [end of pmem part]) to avoid
   underflow in case of bad input

6. Change struct cxl_dc_partition_info to use u64 instead of size_t
   fields

7. Original commit message referenced CXL r3.2. Bump to r4.0.
   Verified section numbers remain the same

8. Rename dynamic_ram_a to dynamic_ram_1
---
 drivers/cxl/core/hdm.c  |   2 +
 drivers/cxl/core/mbox.c | 211 ++++++++++++++++++++++++++++++++++++++++
 drivers/cxl/cxlmem.h    |  47 +++++++++
 drivers/cxl/pci.c       |   3 +
 include/cxl/cxl.h       |   3 +-
 5 files changed, 265 insertions(+), 1 deletion(-)

diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
index 0c80b76a5f9b..0ef076c08ed2 100644
--- a/drivers/cxl/core/hdm.c
+++ b/drivers/cxl/core/hdm.c
@@ -446,6 +446,8 @@ static const char *cxl_mode_name(enum cxl_partition_mode mode)
 		return "ram";
 	case CXL_PARTMODE_PMEM:
 		return "pmem";
+	case CXL_PARTMODE_DYNAMIC_RAM_1:
+		return "dynamic_ram_1";
 	default:
 		return "";
 	};
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 07aba6f0b719..2932bbd67e55 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -1347,6 +1347,188 @@ int cxl_mem_sanitize(struct cxl_memdev *cxlmd, u16 cmd)
 	return -EBUSY;
 }
 
+static int cxl_dc_check(struct device *dev, struct cxl_dc_partition_info *part_array,
+			u8 index, struct cxl_dc_partition *dev_part)
+{
+	u64 blk_size = le64_to_cpu(dev_part->block_size);
+	u64 len = le64_to_cpu(dev_part->length);
+
+	part_array[index].start = le64_to_cpu(dev_part->base);
+	part_array[index].size = le64_to_cpu(dev_part->decode_length);
+	part_array[index].size *= CXL_CAPACITY_MULTIPLIER;
+
+	/* Check partitions are in increasing DPA order */
+	if (index > 0) {
+		struct cxl_dc_partition_info *prev_part = &part_array[index - 1];
+
+		if ((prev_part->start + prev_part->size) >
+		     part_array[index].start) {
+			dev_err(dev,
+				"DPA ordering violation for DC partition %d and %d\n",
+				index - 1, index);
+			return -EINVAL;
+		}
+	}
+
+	if (part_array[index].size == 0 || len == 0 ||
+	    part_array[index].size < len || !IS_ALIGNED(len, blk_size)) {
+		dev_err(dev, "DC partition %d invalid length; size %llu len %llu blk size %llu\n",
+			index, part_array[index].size, len, blk_size);
+		return -EINVAL;
+	}
+
+	if (blk_size == 0 || blk_size % CXL_DCD_BLOCK_LINE_SIZE ||
+	    !is_power_of_2(blk_size)) {
+		dev_err(dev, "DC partition %d invalid block size %llu\n",
+			index, blk_size);
+		return -EINVAL;
+	}
+
+	if (!IS_ALIGNED(part_array[index].start, SZ_256M) ||
+	    !IS_ALIGNED(part_array[index].start, blk_size)) {
+		dev_err(dev, "DC partition %d invalid start %llu blk size %llu\n",
+			index, part_array[index].start, blk_size);
+		return -EINVAL;
+	}
+
+	dev_dbg(dev, "DC partition %d start %llu size %llu blk_size: %llu\n",
+		index, part_array[index].start, part_array[index].size,
+		blk_size);
+
+	return 0;
+}
+
+/* Returns the number of partitions in dc_resp or -ERRNO */
+static int cxl_get_dc_config(struct cxl_mailbox *mbox, u8 start_partition,
+			     struct cxl_mbox_get_dc_config_out *dc_resp,
+			     size_t dc_resp_size)
+{
+	struct cxl_mbox_get_dc_config_in get_dc = (struct cxl_mbox_get_dc_config_in) {
+		.partition_count = CXL_MAX_DC_PARTITIONS,
+		.start_partition_index = start_partition,
+	};
+	struct cxl_mbox_cmd mbox_cmd = (struct cxl_mbox_cmd) {
+		.opcode = CXL_MBOX_OP_GET_DC_CONFIG,
+		.payload_in = &get_dc,
+		.size_in = sizeof(get_dc),
+		.size_out = dc_resp_size,
+		.payload_out = dc_resp,
+		.min_out = 8,
+	};
+	size_t expected_sz;
+	int rc;
+
+	rc = cxl_internal_send_cmd(mbox, &mbox_cmd);
+	if (rc < 0)
+		return rc;
+
+	if (dc_resp->partitions_returned > CXL_MAX_DC_PARTITIONS) {
+		dev_err(mbox->host, "Device returned %u partitions, max %d\n",
+			dc_resp->partitions_returned, CXL_MAX_DC_PARTITIONS);
+		return -EIO;
+	}
+
+	/*
+	 * The payload carries trailing extent/tag count fields after the
+	 * partition array (CXL 3.2 Table 8-179) which the driver ignores, so
+	 * the response is at least, not exactly, expected_sz.
+	 */
+	expected_sz = struct_size(dc_resp, partition,
+				  dc_resp->partitions_returned);
+
+	if (mbox_cmd.size_out < expected_sz) {
+		dev_err(mbox->host,
+			"Payload size %zu less than expected %zu for %u partitions\n",
+			mbox_cmd.size_out,
+			expected_sz,
+			dc_resp->partitions_returned);
+		return -EIO;
+	}
+
+	dev_dbg(mbox->host, "Read %d/%d DC partitions\n",
+		dc_resp->partitions_returned, dc_resp->avail_partition_count);
+	return dc_resp->partitions_returned;
+}
+
+/**
+ * cxl_dev_dc_identify() - Reads the dynamic capacity information from the
+ *                         device.
+ * @mbox: Mailbox to query
+ * @dc_info: The dynamic partition information to return
+ *
+ * Read Dynamic Capacity information from the device and return the partition
+ * information.
+ *
+ * Return: 0 if identify was executed successfully, -ERRNO on error.
+ *         on error only dynamic_bytes is left unchanged.
+ */
+int cxl_dev_dc_identify(struct cxl_mailbox *mbox,
+			struct cxl_dc_partition_info *dc_info)
+{
+	struct cxl_dc_partition_info partitions[CXL_MAX_DC_PARTITIONS];
+	struct cxl_mbox_get_dc_config_out *dc_resp __free(kfree);
+	struct device *dev = mbox->host;
+	u8 start_partition;
+	u8 num_partitions;
+	size_t dc_resp_size = struct_size(dc_resp,
+					  partition,
+					  CXL_MAX_DC_PARTITIONS);
+
+	dc_resp = kmalloc(dc_resp_size, GFP_KERNEL);
+	if (!dc_resp)
+		return -ENOMEM;
+
+	/**
+	 * Read and check all partition information for validity and potential
+	 * debugging; see debug output in cxl_dc_check()
+	 */
+	start_partition = 0;
+	num_partitions = 0;
+	do {
+		int rc, i, j;
+
+		rc = cxl_get_dc_config(mbox, start_partition, dc_resp, dc_resp_size);
+		if (rc < 0) {
+			dev_err(dev, "Failed to get DC config: %d\n", rc);
+			return rc;
+		}
+
+		if (rc == 0) {
+			dev_err(dev,
+				"Device reported %u partitions available but returned none at index %u\n",
+				dc_resp->avail_partition_count, start_partition);
+			return -EIO;
+		}
+
+		num_partitions += rc;
+
+		if (num_partitions < 1 || num_partitions > CXL_MAX_DC_PARTITIONS) {
+			dev_err(dev, "Invalid num of dynamic capacity partitions %d\n",
+				num_partitions);
+			return -EINVAL;
+		}
+
+		for (i = start_partition, j = 0; i < num_partitions; i++, j++) {
+			rc = cxl_dc_check(dev, partitions, i,
+					  &dc_resp->partition[j]);
+			if (rc)
+				return rc;
+		}
+
+		start_partition = num_partitions;
+
+	} while (num_partitions < dc_resp->avail_partition_count);
+
+	/* Return 1st partition */
+	dc_info->start = partitions[0].start;
+	dc_info->size = partitions[0].size;
+	dev_dbg(dev, "Returning partition 0 %llu size %llu\n",
+		dc_info->start, dc_info->size);
+
+	return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_dev_dc_identify, "CXL");
+
 static void add_part(struct cxl_dpa_info *info, u64 start, u64 size, enum cxl_partition_mode mode)
 {
 	int i = info->nr_partitions;
@@ -1417,6 +1599,35 @@ int cxl_get_dirty_count(struct cxl_memdev_state *mds, u32 *count)
 }
 EXPORT_SYMBOL_NS_GPL(cxl_get_dirty_count, "CXL");
 
+void cxl_configure_dcd(struct cxl_memdev_state *mds, struct cxl_dpa_info *info)
+{
+	struct cxl_dc_partition_info dc_info = { 0 };
+	struct device *dev = mds->cxlds.dev;
+	int rc;
+
+	rc = cxl_dev_dc_identify(&mds->cxlds.cxl_mbox, &dc_info);
+	if (rc) {
+		dev_warn(dev,
+			 "Failed to read Dynamic Capacity config: %d\n", rc);
+		cxl_disable_dcd(mds);
+		return;
+	}
+
+	/* Skips between pmem and the dynamic partition are not supported */
+	if (dc_info.start != info->size) {
+		dev_warn(dev,
+			 "Dynamic Capacity skip from pmem not supported\n");
+		cxl_disable_dcd(mds);
+		return;
+	}
+
+	info->size += dc_info.size;
+	dev_dbg(dev, "Adding dynamic ram partition 1; %llu size %llu\n",
+		dc_info.start, dc_info.size);
+	add_part(info, dc_info.start, dc_info.size, CXL_PARTMODE_DYNAMIC_RAM_1);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_configure_dcd, "CXL");
+
 int cxl_arm_dirty_shutdown(struct cxl_memdev_state *mds)
 {
 	struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 60dc3f0006a7..6b548a1ec1e9 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -385,6 +385,8 @@ struct cxl_security_state {
 	struct kernfs_node *sanitize_node;
 };
 
+#define CXL_MAX_DC_PARTITIONS 8
+
 static inline resource_size_t cxl_pmem_size(struct cxl_dev_state *cxlds)
 {
 	/*
@@ -669,6 +671,31 @@ struct cxl_mbox_set_shutdown_state_in {
 	u8 state;
 } __packed;
 
+/* See CXL 3.2 Table 8-178 get dynamic capacity config Input Payload */
+struct cxl_mbox_get_dc_config_in {
+	u8 partition_count;
+	u8 start_partition_index;
+} __packed;
+
+/* See CXL 3.2 Table 8-179 get dynamic capacity config Output Payload */
+struct cxl_mbox_get_dc_config_out {
+	u8 avail_partition_count;
+	u8 partitions_returned;
+	u8 rsvd[6];
+	/* See CXL 3.2 Table 8-180 */
+	struct cxl_dc_partition {
+		__le64 base;
+		__le64 decode_length;
+		__le64 length;
+		__le64 block_size;
+		__le32 dsmad_handle;
+		u8 flags;
+		u8 rsvd[3];
+	} __packed partition[] __counted_by(partitions_returned);
+	/* Trailing extent/tag count fields unused */
+} __packed;
+#define CXL_DCD_BLOCK_LINE_SIZE 0x40
+
 /* Set Timestamp CXL 3.0 Spec 8.2.9.4.2 */
 struct cxl_mbox_set_timestamp_in {
 	__le64 timestamp;
@@ -792,9 +819,18 @@ enum {
 int cxl_internal_send_cmd(struct cxl_mailbox *cxl_mbox,
 			  struct cxl_mbox_cmd *cmd);
 int cxl_dev_state_identify(struct cxl_memdev_state *mds);
+
+struct cxl_dc_partition_info {
+	u64 start;
+	u64 size;
+};
+
+int cxl_dev_dc_identify(struct cxl_mailbox *mbox,
+			struct cxl_dc_partition_info *dc_info);
 int cxl_await_media_ready(struct cxl_dev_state *cxlds);
 int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
 int cxl_mem_dpa_fetch(struct cxl_memdev_state *mds, struct cxl_dpa_info *info);
+void cxl_configure_dcd(struct cxl_memdev_state *mds, struct cxl_dpa_info *info);
 struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev, u64 serial,
 						 u16 dvsec);
 void set_exclusive_cxl_commands(struct cxl_memdev_state *mds,
@@ -808,6 +844,17 @@ void cxl_event_trace_record(struct cxl_memdev *cxlmd,
 			    const uuid_t *uuid, union cxl_event *evt);
 int cxl_get_dirty_count(struct cxl_memdev_state *mds, u32 *count);
 int cxl_arm_dirty_shutdown(struct cxl_memdev_state *mds);
+
+static inline bool cxl_dcd_supported(struct cxl_memdev_state *mds)
+{
+	return mds->dcd_supported;
+}
+
+static inline void cxl_disable_dcd(struct cxl_memdev_state *mds)
+{
+	mds->dcd_supported = false;
+}
+
 int cxl_set_timestamp(struct cxl_memdev_state *mds);
 int cxl_poison_state_init(struct cxl_memdev_state *mds);
 int cxl_mem_get_poison(struct cxl_memdev *cxlmd, u64 offset, u64 len,
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index bace662dc988..60f9fa05d9ef 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -870,6 +870,9 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (rc)
 		return rc;
 
+	if (cxl_dcd_supported(mds))
+		cxl_configure_dcd(mds, &range_info);
+
 	rc = cxl_dpa_setup(cxlds, &range_info);
 	if (rc)
 		return rc;
diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
index fa7269154620..e8a0899960d4 100644
--- a/include/cxl/cxl.h
+++ b/include/cxl/cxl.h
@@ -133,6 +133,7 @@ struct cxl_dpa_perf {
 enum cxl_partition_mode {
 	CXL_PARTMODE_RAM,
 	CXL_PARTMODE_PMEM,
+	CXL_PARTMODE_DYNAMIC_RAM_1,
 };
 
 /**
@@ -147,7 +148,7 @@ struct cxl_dpa_partition {
 	enum cxl_partition_mode mode;
 };
 
-#define CXL_NR_PARTITIONS_MAX 2
+#define CXL_NR_PARTITIONS_MAX 3
 
 /**
  * struct cxl_dev_state - The driver device state
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v11 04/31] cxl/core: Enforce partition order/simplify partition calls
  2026-06-25 11:04 DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
  2026-06-25 11:04 ` [PATCH v11 01/31] cxl/mbox: Flag " Anisa Su
  2026-06-25 11:04 ` [PATCH v11 02/31] cxl/mem: Read dynamic capacity configuration from the device Anisa Su
@ 2026-06-25 11:04 ` Anisa Su
  2026-06-26 22:37   ` Dave Jiang
  2026-06-25 11:04 ` [PATCH v11 05/31] cxl/mem: Expose dynamic ram 1 partition in sysfs Anisa Su
                   ` (27 subsequent siblings)
  30 siblings, 1 reply; 61+ messages in thread
From: Anisa Su @ 2026-06-25 11:04 UTC (permalink / raw)
  To: linux-cxl, linux-kernel
  Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
	John Groves, Gregory Price, Anisa Su

From: Ira Weiny <iweiny@kernel.org>

Device partitions have an implied order which is made more complex by
the addition of a dynamic partition

Remove the ram special case information calls in favor of generic calls
with a check ahead of time to ensure the preservation of the implied
partition order.

Signed-off-by: Ira Weiny <iweiny@kernel.org>
Signed-off-by: Anisa Su <anisa.su@samsung.com>

---
Changes:

1. Use info->part[i] for verifying partitions are in expected order,
   not cxlds->part[i]. cxlds->part[] is populated in the loop following
   this check.
---
 drivers/cxl/core/hdm.c    | 11 ++++++++++-
 drivers/cxl/core/memdev.c | 32 +++++++++-----------------------
 drivers/cxl/cxlmem.h      |  9 +++------
 drivers/cxl/mem.c         |  2 +-
 4 files changed, 23 insertions(+), 31 deletions(-)

diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
index 7f63b86887f4..54b6848928a9 100644
--- a/drivers/cxl/core/hdm.c
+++ b/drivers/cxl/core/hdm.c
@@ -457,6 +457,7 @@ static const char *cxl_mode_name(enum cxl_partition_mode mode)
 int cxl_dpa_setup(struct cxl_dev_state *cxlds, const struct cxl_dpa_info *info)
 {
 	struct device *dev = cxlds->dev;
+	int i;
 
 	guard(rwsem_write)(&cxl_rwsem.dpa);
 
@@ -469,9 +470,17 @@ int cxl_dpa_setup(struct cxl_dev_state *cxlds, const struct cxl_dpa_info *info)
 		return 0;
 	}
 
+	/* Verify partitions are in expected order. */
+	for (i = 1; i < info->nr_partitions; i++) {
+		if (info->part[i].mode < info->part[i-1].mode) {
+			dev_err(dev, "Partition order mismatch\n");
+			return -EINVAL;
+		}
+	}
+
 	cxlds->dpa_res = DEFINE_RES_MEM(0, info->size);
 
-	for (int i = 0; i < info->nr_partitions; i++) {
+	for (i = 0; i < info->nr_partitions; i++) {
 		const struct cxl_dpa_part_info *part = &info->part[i];
 		int rc;
 
diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
index 80e65690eb77..71602820f896 100644
--- a/drivers/cxl/core/memdev.c
+++ b/drivers/cxl/core/memdev.c
@@ -75,20 +75,12 @@ static ssize_t label_storage_size_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(label_storage_size);
 
-static resource_size_t cxl_ram_size(struct cxl_dev_state *cxlds)
-{
-	/* Static RAM is only expected at partition 0. */
-	if (cxlds->part[0].mode != CXL_PARTMODE_RAM)
-		return 0;
-	return resource_size(&cxlds->part[0].res);
-}
-
 static ssize_t ram_size_show(struct device *dev, struct device_attribute *attr,
 			     char *buf)
 {
 	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
 	struct cxl_dev_state *cxlds = cxlmd->cxlds;
-	unsigned long long len = cxl_ram_size(cxlds);
+	unsigned long long len = cxl_part_size(cxlds, CXL_PARTMODE_RAM);
 
 	return sysfs_emit(buf, "%#llx\n", len);
 }
@@ -101,7 +93,7 @@ static ssize_t pmem_size_show(struct device *dev, struct device_attribute *attr,
 {
 	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
 	struct cxl_dev_state *cxlds = cxlmd->cxlds;
-	unsigned long long len = cxl_pmem_size(cxlds);
+	unsigned long long len = cxl_part_size(cxlds, CXL_PARTMODE_PMEM);
 
 	return sysfs_emit(buf, "%#llx\n", len);
 }
@@ -424,10 +416,11 @@ static struct attribute *cxl_memdev_attributes[] = {
 	NULL,
 };
 
-static struct cxl_dpa_perf *to_pmem_perf(struct cxl_dev_state *cxlds)
+static struct cxl_dpa_perf *part_perf(struct cxl_dev_state *cxlds,
+				      enum cxl_partition_mode mode)
 {
 	for (int i = 0; i < cxlds->nr_partitions; i++)
-		if (cxlds->part[i].mode == CXL_PARTMODE_PMEM)
+		if (cxlds->part[i].mode == mode)
 			return &cxlds->part[i].perf;
 	return NULL;
 }
@@ -438,7 +431,7 @@ static ssize_t pmem_qos_class_show(struct device *dev,
 	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
 	struct cxl_dev_state *cxlds = cxlmd->cxlds;
 
-	return sysfs_emit(buf, "%d\n", to_pmem_perf(cxlds)->qos_class);
+	return sysfs_emit(buf, "%d\n", part_perf(cxlds, CXL_PARTMODE_PMEM)->qos_class);
 }
 
 static struct device_attribute dev_attr_pmem_qos_class =
@@ -450,20 +443,13 @@ static struct attribute *cxl_memdev_pmem_attributes[] = {
 	NULL,
 };
 
-static struct cxl_dpa_perf *to_ram_perf(struct cxl_dev_state *cxlds)
-{
-	if (cxlds->part[0].mode != CXL_PARTMODE_RAM)
-		return NULL;
-	return &cxlds->part[0].perf;
-}
-
 static ssize_t ram_qos_class_show(struct device *dev,
 				  struct device_attribute *attr, char *buf)
 {
 	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
 	struct cxl_dev_state *cxlds = cxlmd->cxlds;
 
-	return sysfs_emit(buf, "%d\n", to_ram_perf(cxlds)->qos_class);
+	return sysfs_emit(buf, "%d\n", part_perf(cxlds, CXL_PARTMODE_RAM)->qos_class);
 }
 
 static struct device_attribute dev_attr_ram_qos_class =
@@ -499,7 +485,7 @@ static umode_t cxl_ram_visible(struct kobject *kobj, struct attribute *a, int n)
 {
 	struct device *dev = kobj_to_dev(kobj);
 	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
-	struct cxl_dpa_perf *perf = to_ram_perf(cxlmd->cxlds);
+	struct cxl_dpa_perf *perf = part_perf(cxlmd->cxlds, CXL_PARTMODE_RAM);
 
 	if (a == &dev_attr_ram_qos_class.attr &&
 	    (!perf || perf->qos_class == CXL_QOS_CLASS_INVALID))
@@ -518,7 +504,7 @@ static umode_t cxl_pmem_visible(struct kobject *kobj, struct attribute *a, int n
 {
 	struct device *dev = kobj_to_dev(kobj);
 	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
-	struct cxl_dpa_perf *perf = to_pmem_perf(cxlmd->cxlds);
+	struct cxl_dpa_perf *perf = part_perf(cxlmd->cxlds, CXL_PARTMODE_PMEM);
 
 	if (a == &dev_attr_pmem_qos_class.attr &&
 	    (!perf || perf->qos_class == CXL_QOS_CLASS_INVALID))
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index b29fb16725b4..afc195d8c090 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -388,14 +388,11 @@ struct cxl_security_state {
 
 #define CXL_MAX_DC_PARTITIONS 8
 
-static inline resource_size_t cxl_pmem_size(struct cxl_dev_state *cxlds)
+static inline resource_size_t cxl_part_size(struct cxl_dev_state *cxlds,
+					    enum cxl_partition_mode mode)
 {
-	/*
-	 * Static PMEM may be at partition index 0 when there is no static RAM
-	 * capacity.
-	 */
 	for (int i = 0; i < cxlds->nr_partitions; i++)
-		if (cxlds->part[i].mode == CXL_PARTMODE_PMEM)
+		if (cxlds->part[i].mode == mode)
 			return resource_size(&cxlds->part[i].res);
 	return 0;
 }
diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
index fcffe24dcb42..f19e08279ec7 100644
--- a/drivers/cxl/mem.c
+++ b/drivers/cxl/mem.c
@@ -114,7 +114,7 @@ static int cxl_mem_probe(struct device *dev)
 		return -ENXIO;
 	}
 
-	if (cxl_pmem_size(cxlds) && IS_ENABLED(CONFIG_CXL_PMEM)) {
+	if (cxl_part_size(cxlds, CXL_PARTMODE_PMEM) && IS_ENABLED(CONFIG_CXL_PMEM)) {
 		rc = devm_cxl_add_nvdimm(dev, parent_port, cxlmd);
 		if (rc) {
 			if (rc == -ENODEV)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v11 05/31] cxl/mem: Expose dynamic ram 1 partition in sysfs
  2026-06-25 11:04 DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
                   ` (2 preceding siblings ...)
  2026-06-25 11:04 ` [PATCH v11 04/31] cxl/core: Enforce partition order/simplify partition calls Anisa Su
@ 2026-06-25 11:04 ` Anisa Su
  2026-06-25 18:12   ` sashiko-bot
  2026-06-26 23:08   ` Dave Jiang
  2026-06-25 11:04 ` [PATCH v11 06/31] cxl/port: Add 'dynamic_ram_1' to endpoint decoder mode Anisa Su
                   ` (26 subsequent siblings)
  30 siblings, 2 replies; 61+ messages in thread
From: Anisa Su @ 2026-06-25 11:04 UTC (permalink / raw)
  To: linux-cxl, linux-kernel
  Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
	John Groves, Gregory Price, Anisa Su

From: Ira Weiny <iweiny@kernel.org>

To properly configure CXL regions user space will need to know the
details of the dynamic ram partition.

Expose the first dynamic ram partition through sysfs.

Signed-off-by: Ira Weiny <iweiny@kernel.org>
Signed-off-by: Anisa Su <anisa.su@samsung.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>

---
Changes:
1. Documentation: bump kernel version to 7.3 and date to June 2026
2. Pick up Dave's reviewed-by tag
3. Rename dynamic_ram_a to dynamic_ram_1
---
 Documentation/ABI/testing/sysfs-bus-cxl | 24 +++++++++++
 drivers/cxl/core/memdev.c               | 57 +++++++++++++++++++++++++
 2 files changed, 81 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
index 16a9b3d2e2c0..435495de409c 100644
--- a/Documentation/ABI/testing/sysfs-bus-cxl
+++ b/Documentation/ABI/testing/sysfs-bus-cxl
@@ -89,6 +89,30 @@ Description:
 		and there are platform specific performance related
 		side-effects that may result. First class-id is displayed.
 
+What:		/sys/bus/cxl/devices/memX/dynamic_ram_1/size
+Date:		June, 2026
+KernelVersion:	v7.3
+Contact:	linux-cxl@vger.kernel.org
+Description:
+		(RO) The first Dynamic RAM partition capacity as bytes.
+
+
+What:		/sys/bus/cxl/devices/memX/dynamic_ram_1/qos_class
+Date:		June, 2026
+KernelVersion:	v7.3
+Contact:	linux-cxl@vger.kernel.org
+Description:
+		(RO) For CXL host platforms that support "QoS Telemmetry"
+		this attribute conveys a comma delimited list of platform
+		specific cookies that identifies a QoS performance class
+		for the partition of the CXL mem device. These
+		class-ids can be compared against a similar "qos_class"
+		published for a root decoder. While it is not required
+		that the endpoints map their local memory-class to a
+		matching platform class, mismatches are not recommended
+		and there are platform specific performance related
+		side-effects that may result. First class-id is displayed.
+
 
 What:		/sys/bus/cxl/devices/memX/serial
 Date:		January, 2022
diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
index 71602820f896..20417db933aa 100644
--- a/drivers/cxl/core/memdev.c
+++ b/drivers/cxl/core/memdev.c
@@ -101,6 +101,19 @@ static ssize_t pmem_size_show(struct device *dev, struct device_attribute *attr,
 static struct device_attribute dev_attr_pmem_size =
 	__ATTR(size, 0444, pmem_size_show, NULL);
 
+static ssize_t dynamic_ram_1_size_show(struct device *dev, struct device_attribute *attr,
+			      char *buf)
+{
+	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
+	struct cxl_dev_state *cxlds = cxlmd->cxlds;
+	unsigned long long len = cxl_part_size(cxlds, CXL_PARTMODE_DYNAMIC_RAM_1);
+
+	return sysfs_emit(buf, "%#llx\n", len);
+}
+
+static struct device_attribute dev_attr_dynamic_ram_1_size =
+	__ATTR(size, 0444, dynamic_ram_1_size_show, NULL);
+
 static ssize_t serial_show(struct device *dev, struct device_attribute *attr,
 			   char *buf)
 {
@@ -443,6 +456,25 @@ static struct attribute *cxl_memdev_pmem_attributes[] = {
 	NULL,
 };
 
+static ssize_t dynamic_ram_1_qos_class_show(struct device *dev,
+				   struct device_attribute *attr, char *buf)
+{
+	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
+	struct cxl_dev_state *cxlds = cxlmd->cxlds;
+
+	return sysfs_emit(buf, "%d\n",
+			  part_perf(cxlds, CXL_PARTMODE_DYNAMIC_RAM_1)->qos_class);
+}
+
+static struct device_attribute dev_attr_dynamic_ram_1_qos_class =
+	__ATTR(qos_class, 0444, dynamic_ram_1_qos_class_show, NULL);
+
+static struct attribute *cxl_memdev_dynamic_ram_1_attributes[] = {
+	&dev_attr_dynamic_ram_1_size.attr,
+	&dev_attr_dynamic_ram_1_qos_class.attr,
+	NULL,
+};
+
 static ssize_t ram_qos_class_show(struct device *dev,
 				  struct device_attribute *attr, char *buf)
 {
@@ -519,6 +551,29 @@ static struct attribute_group cxl_memdev_pmem_attribute_group = {
 	.is_visible = cxl_pmem_visible,
 };
 
+static umode_t cxl_dynamic_ram_1_visible(struct kobject *kobj, struct attribute *a, int n)
+{
+	struct device *dev = kobj_to_dev(kobj);
+	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
+	struct cxl_dpa_perf *perf = part_perf(cxlmd->cxlds, CXL_PARTMODE_DYNAMIC_RAM_1);
+
+	if (a == &dev_attr_dynamic_ram_1_qos_class.attr &&
+	    (!perf || perf->qos_class == CXL_QOS_CLASS_INVALID))
+		return 0;
+
+	if (a == &dev_attr_dynamic_ram_1_size.attr &&
+	    (!cxl_part_size(cxlmd->cxlds, CXL_PARTMODE_DYNAMIC_RAM_1)))
+		return 0;
+
+	return a->mode;
+}
+
+static struct attribute_group cxl_memdev_dynamic_ram_1_attribute_group = {
+	.name = "dynamic_ram_1",
+	.attrs = cxl_memdev_dynamic_ram_1_attributes,
+	.is_visible = cxl_dynamic_ram_1_visible,
+};
+
 static umode_t cxl_memdev_security_visible(struct kobject *kobj,
 					   struct attribute *a, int n)
 {
@@ -547,6 +602,7 @@ static const struct attribute_group *cxl_memdev_attribute_groups[] = {
 	&cxl_memdev_attribute_group,
 	&cxl_memdev_ram_attribute_group,
 	&cxl_memdev_pmem_attribute_group,
+	&cxl_memdev_dynamic_ram_1_attribute_group,
 	&cxl_memdev_security_attribute_group,
 	NULL,
 };
@@ -555,6 +611,7 @@ void cxl_memdev_update_perf(struct cxl_memdev *cxlmd)
 {
 	sysfs_update_group(&cxlmd->dev.kobj, &cxl_memdev_ram_attribute_group);
 	sysfs_update_group(&cxlmd->dev.kobj, &cxl_memdev_pmem_attribute_group);
+	sysfs_update_group(&cxlmd->dev.kobj, &cxl_memdev_dynamic_ram_1_attribute_group);
 }
 EXPORT_SYMBOL_NS_GPL(cxl_memdev_update_perf, "CXL");
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v11 06/31] cxl/port: Add 'dynamic_ram_1' to endpoint decoder mode
  2026-06-25 11:04 DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
                   ` (3 preceding siblings ...)
  2026-06-25 11:04 ` [PATCH v11 05/31] cxl/mem: Expose dynamic ram 1 partition in sysfs Anisa Su
@ 2026-06-25 11:04 ` Anisa Su
  2026-06-25 11:04 ` [PATCH v11 07/31] cxl/region: Add DC DAX region support Anisa Su
                   ` (25 subsequent siblings)
  30 siblings, 0 replies; 61+ messages in thread
From: Anisa Su @ 2026-06-25 11:04 UTC (permalink / raw)
  To: linux-cxl, linux-kernel
  Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
	John Groves, Gregory Price, Anisa Su

From: Ira Weiny <iweiny@kernel.org>

Endpoints can now support a single dynamic ram partition following the
persistent memory partition.

Expand the mode to allow a decoder to point to the first dynamic ram
partition.

Signed-off-by: Ira Weiny <iweiny@kernel.org>
Signed-off-by: Anisa Su <anisa.su@samsung.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>

---
Changes:
1. Documentation: Bump kver to 7.3 and date to June 2026
2. Pick up Dave's reviewed-by tag
3. Rename dynamic_ram_a to dynamic_ram_1
---
 Documentation/ABI/testing/sysfs-bus-cxl | 18 +++++++++---------
 drivers/cxl/core/port.c                 |  4 ++++
 2 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
index 435495de409c..499741cbb899 100644
--- a/Documentation/ABI/testing/sysfs-bus-cxl
+++ b/Documentation/ABI/testing/sysfs-bus-cxl
@@ -358,22 +358,22 @@ Description:
 
 
 What:		/sys/bus/cxl/devices/decoderX.Y/mode
-Date:		May, 2022
-KernelVersion:	v6.0
+Date:		May, 2022, June 2026
+KernelVersion:	v6.0, v7.3 (dynamic_ram_1)
 Contact:	linux-cxl@vger.kernel.org
 Description:
 		(RW) When a CXL decoder is of devtype "cxl_decoder_endpoint" it
 		translates from a host physical address range, to a device
 		local address range. Device-local address ranges are further
-		split into a 'ram' (volatile memory) range and 'pmem'
-		(persistent memory) range. The 'mode' attribute emits one of
-		'ram', 'pmem', or 'none'. The 'none' indicates the decoder is
-		not actively decoding, or no DPA allocation policy has been
-		set.
+		split into a 'ram' (volatile memory) range, 'pmem' (persistent
+		memory), and 'dynamic_ram_1' (first Dynamic RAM) range. The
+		'mode' attribute emits one of 'ram', 'pmem', 'dynamic_ram_1' or
+		'none'. The 'none' indicates the decoder is not actively
+		decoding, or no DPA allocation policy has been set.
 
 		'mode' can be written, when the decoder is in the 'disabled'
-		state, with either 'ram' or 'pmem' to set the boundaries for the
-		next allocation.
+		state, with either 'ram', 'pmem', or 'dynamic_ram_1' to set the
+		boundaries for the next allocation.
 
 
 What:		/sys/bus/cxl/devices/decoderX.Y/dpa_resource
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index c5aacd7054f1..57d0fc72023f 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -128,6 +128,7 @@ static DEVICE_ATTR_RO(name)
 
 CXL_DECODER_FLAG_ATTR(cap_pmem, CXL_DECODER_F_PMEM);
 CXL_DECODER_FLAG_ATTR(cap_ram, CXL_DECODER_F_RAM);
+CXL_DECODER_FLAG_ATTR(cap_dynamic_ram_1, CXL_DECODER_F_RAM);
 CXL_DECODER_FLAG_ATTR(cap_type2, CXL_DECODER_F_TYPE2);
 CXL_DECODER_FLAG_ATTR(cap_type3, CXL_DECODER_F_TYPE3);
 CXL_DECODER_FLAG_ATTR(locked, CXL_DECODER_F_LOCK);
@@ -222,6 +223,8 @@ static ssize_t mode_store(struct device *dev, struct device_attribute *attr,
 		mode = CXL_PARTMODE_PMEM;
 	else if (sysfs_streq(buf, "ram"))
 		mode = CXL_PARTMODE_RAM;
+	else if (sysfs_streq(buf, "dynamic_ram_1"))
+		mode = CXL_PARTMODE_DYNAMIC_RAM_1;
 	else
 		return -EINVAL;
 
@@ -327,6 +330,7 @@ static struct attribute_group cxl_decoder_base_attribute_group = {
 static struct attribute *cxl_decoder_root_attrs[] = {
 	&dev_attr_cap_pmem.attr,
 	&dev_attr_cap_ram.attr,
+	&dev_attr_cap_dynamic_ram_1.attr,
 	&dev_attr_cap_type2.attr,
 	&dev_attr_cap_type3.attr,
 	&dev_attr_target_list.attr,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v11 07/31] cxl/region: Add DC DAX region support
  2026-06-25 11:04 DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
                   ` (4 preceding siblings ...)
  2026-06-25 11:04 ` [PATCH v11 06/31] cxl/port: Add 'dynamic_ram_1' to endpoint decoder mode Anisa Su
@ 2026-06-25 11:04 ` Anisa Su
  2026-06-25 18:16   ` sashiko-bot
  2026-06-26 23:18   ` Dave Jiang
  2026-06-25 11:04 ` [PATCH v11 08/31] cxl/events: Split event msgnum configuration from irq setup Anisa Su
                   ` (24 subsequent siblings)
  30 siblings, 2 replies; 61+ messages in thread
From: Anisa Su @ 2026-06-25 11:04 UTC (permalink / raw)
  To: linux-cxl, linux-kernel
  Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
	John Groves, Gregory Price, Anisa Su

From: Ira Weiny <iweiny@kernel.org>

DC DAX regions must allow memory to be added or removed dynamically.
In addition to the quantity of memory available the,
location of the memory within a DC partition is dynamic, based on the
extents offered by a device.  CXL DAX regions must accommodate the
dynamic movement of this memory in the management of DAX regions and devices.

Introduce the concept of a dynamic DAX region. Introduce
create_dynamic_ram_1_region() sysfs entry to create such regions.
Special case DC-capable regions to create a 0 sized seed DAX device
to maintain compatibility which requires a default DAX device to hold a
region reference.

Indicate 0 byte available capacity until such time that capacity is
added.

Dynamic regions complicate the range mapping of dax devices.  There is no
known use case for range mapping on dynamic regions.  Avoid the
complication by preventing range mapping of dax devices on dynamic
regions.

Interleaving is deferred for now.  Add checks.

Based on an original patch by Navneet Singh.

Signed-off-by: Ira Weiny <iweiny@kernel.org>
Signed-off-by: Anisa Su <anisa.su@samsung.com>

---
Changes:
1. Documentation: bump kver to 7.3 and date to June 2026
2. port.c: use helper function to_cxl_memdev_state() in cxled_to_mds()
3. region.c: cxled_to_mds() can return NULL with ^ above change.
   Handle properly in store_targetN()
4. port.c: check if the root decoder supports ram regions before
   exposing create_dynamic_ram_1_region attribute in
   cxl_root_decoder_visible()
5. Rename dynamic_ram_a to dynamic_ram_1
6. Add dynamic_ram_1 to mode_show()
---
 Documentation/ABI/testing/sysfs-bus-cxl | 22 ++++++-------
 drivers/cxl/core/core.h                 |  9 ++++++
 drivers/cxl/core/port.c                 |  5 +++
 drivers/cxl/core/region.c               | 43 +++++++++++++++++++++++--
 drivers/cxl/core/region_dax.c           |  6 ++++
 drivers/dax/bus.c                       | 10 ++++++
 drivers/dax/bus.h                       |  1 +
 drivers/dax/cxl.c                       | 17 ++++++++--
 8 files changed, 98 insertions(+), 15 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
index 499741cbb899..00b98bbe0ff3 100644
--- a/Documentation/ABI/testing/sysfs-bus-cxl
+++ b/Documentation/ABI/testing/sysfs-bus-cxl
@@ -434,20 +434,20 @@ Description:
 		interleave_granularity).
 
 
-What:		/sys/bus/cxl/devices/decoderX.Y/create_{pmem,ram}_region
-Date:		May, 2022, January, 2023
-KernelVersion:	v6.0 (pmem), v6.3 (ram)
+What:		/sys/bus/cxl/devices/decoderX.Y/create_{pmem,ram,dynamic_ram_1}_region
+Date:		May, 2022, January, 2023, June 2026
+KernelVersion:	v6.0 (pmem), v6.3 (ram), v7.3 (dynamic_ram_1)
 Contact:	linux-cxl@vger.kernel.org
 Description:
 		(RW) Write a string in the form 'regionZ' to start the process
-		of defining a new persistent, or volatile memory region
-		(interleave-set) within the decode range bounded by root decoder
-		'decoderX.Y'. The value written must match the current value
-		returned from reading this attribute. An atomic compare exchange
-		operation is done on write to assign the requested id to a
-		region and allocate the region-id for the next creation attempt.
-		EBUSY is returned if the region name written does not match the
-		current cached value.
+		of defining a new persistent, volatile, or dynamic RAM memory
+		region (interleave-set) within the decode range bounded by root
+		decoder 'decoderX.Y'. The value written must match the current
+		value returned from reading this attribute.  An atomic compare
+		exchange operation is done on write to assign the requested id
+		to a region and allocate the region-id for the next creation
+		attempt.  EBUSY is returned if the region name written does not
+		match the current cached value.
 
 
 What:		/sys/bus/cxl/devices/decoderX.Y/delete_region
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 82ca3a476708..9ed141fa1334 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -6,6 +6,7 @@
 
 #include <cxl/mailbox.h>
 #include <linux/rwsem.h>
+#include <cxlmem.h>
 
 extern const struct device_type cxl_nvdimm_bridge_type;
 extern const struct device_type cxl_nvdimm_type;
@@ -18,6 +19,13 @@ enum cxl_detach_mode {
 	DETACH_INVALIDATE,
 };
 
+static inline struct cxl_memdev_state *
+cxled_to_mds(struct cxl_endpoint_decoder *cxled)
+{
+	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
+	return to_cxl_memdev_state(cxlmd->cxlds);
+}
+
 #ifdef CONFIG_CXL_REGION
 
 struct cxl_region_context {
@@ -29,6 +37,7 @@ struct cxl_region_context {
 
 extern struct device_attribute dev_attr_create_pmem_region;
 extern struct device_attribute dev_attr_create_ram_region;
+extern struct device_attribute dev_attr_create_dynamic_ram_1_region;
 extern struct device_attribute dev_attr_delete_region;
 extern struct device_attribute dev_attr_region;
 extern const struct device_type cxl_pmem_region_type;
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 57d0fc72023f..279279f544d8 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -337,6 +337,7 @@ static struct attribute *cxl_decoder_root_attrs[] = {
 	&dev_attr_qos_class.attr,
 	SET_CXL_REGION_ATTR(create_pmem_region)
 	SET_CXL_REGION_ATTR(create_ram_region)
+	SET_CXL_REGION_ATTR(create_dynamic_ram_1_region)
 	SET_CXL_REGION_ATTR(delete_region)
 	NULL,
 };
@@ -366,6 +367,10 @@ static umode_t cxl_root_decoder_visible(struct kobject *kobj, struct attribute *
 	if (a == CXL_REGION_ATTR(create_ram_region) && !can_create_ram(cxlrd))
 		return 0;
 
+	if (a == CXL_REGION_ATTR(create_dynamic_ram_1_region) &&
+	    !can_create_ram(cxlrd))
+		return 0;
+
 	if (a == CXL_REGION_ATTR(delete_region) &&
 	    !(can_create_pmem(cxlrd) || can_create_ram(cxlrd)))
 		return 0;
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index e50dc716d4e8..ba03ec5e27c3 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -493,6 +493,11 @@ static int set_interleave_ways(struct cxl_region *cxlr, int val)
 	int save, rc;
 	u8 iw;
 
+	if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_1 && val != 1) {
+		dev_err(&cxlr->dev, "Interleaving and DCD not supported\n");
+		return -EINVAL;
+	}
+
 	rc = ways_to_eiw(val, &iw);
 	if (rc)
 		return rc;
@@ -642,6 +647,8 @@ static ssize_t mode_show(struct device *dev, struct device_attribute *attr,
 		desc = "ram";
 	else if (cxlr->mode == CXL_PARTMODE_PMEM)
 		desc = "pmem";
+	else if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_1)
+		desc = "dynamic_ram_1";
 	else
 		desc = "";
 
@@ -2389,6 +2396,8 @@ static size_t store_targetN(struct cxl_region *cxlr, const char *buf, int pos,
 	if (sysfs_streq(buf, "\n"))
 		rc = detach_target(cxlr, pos);
 	else {
+		struct cxl_endpoint_decoder *cxled;
+		struct cxl_memdev_state *mds;
 		struct device *dev;
 
 		dev = bus_find_device_by_name(&cxl_bus_type, NULL, buf);
@@ -2400,8 +2409,21 @@ static size_t store_targetN(struct cxl_region *cxlr, const char *buf, int pos,
 			goto out;
 		}
 
-		rc = attach_target(cxlr, to_cxl_endpoint_decoder(dev), pos,
-				   TASK_INTERRUPTIBLE);
+		cxled = to_cxl_endpoint_decoder(dev);
+		if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_1) {
+			mds = cxled_to_mds(cxled);
+			if (!mds) {
+				dev_dbg(dev, "No memdev state\n");
+				rc = -ENODEV;
+				goto out;
+			}
+			if (!cxl_dcd_supported(mds)) {
+				dev_dbg(dev, "DCD unsupported\n");
+				rc = -EINVAL;
+				goto out;
+			}
+		}
+		rc = attach_target(cxlr, cxled, pos, TASK_INTERRUPTIBLE);
 out:
 		put_device(dev);
 	}
@@ -2750,6 +2772,7 @@ static struct cxl_region *__create_region(struct cxl_root_decoder *cxlrd,
 	switch (mode) {
 	case CXL_PARTMODE_RAM:
 	case CXL_PARTMODE_PMEM:
+	case CXL_PARTMODE_DYNAMIC_RAM_1:
 		break;
 	default:
 		dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %d\n", mode);
@@ -2802,6 +2825,21 @@ static ssize_t create_ram_region_store(struct device *dev,
 }
 DEVICE_ATTR_RW(create_ram_region);
 
+static ssize_t create_dynamic_ram_1_region_show(struct device *dev,
+						struct device_attribute *attr,
+						char *buf)
+{
+	return __create_region_show(to_cxl_root_decoder(dev), buf);
+}
+
+static ssize_t create_dynamic_ram_1_region_store(struct device *dev,
+						 struct device_attribute *attr,
+						 const char *buf, size_t len)
+{
+	return create_region_store(dev, buf, len, CXL_PARTMODE_DYNAMIC_RAM_1);
+}
+DEVICE_ATTR_RW(create_dynamic_ram_1_region);
+
 static ssize_t region_show(struct device *dev, struct device_attribute *attr,
 			   char *buf)
 {
@@ -4083,6 +4121,7 @@ static int cxl_region_probe(struct device *dev)
 
 		return devm_cxl_add_pmem_region(cxlr);
 	case CXL_PARTMODE_RAM:
+	case CXL_PARTMODE_DYNAMIC_RAM_1:
 		rc = devm_cxl_region_edac_register(cxlr);
 		if (rc)
 			dev_dbg(&cxlr->dev, "CXL EDAC registration for region_id=%d failed\n",
diff --git a/drivers/cxl/core/region_dax.c b/drivers/cxl/core/region_dax.c
index de04f78f6ad8..3865961c4301 100644
--- a/drivers/cxl/core/region_dax.c
+++ b/drivers/cxl/core/region_dax.c
@@ -84,6 +84,12 @@ int devm_cxl_add_dax_region(struct cxl_region *cxlr)
 	struct device *dev;
 	int rc;
 
+	if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_1 &&
+	    cxlr->params.interleave_ways != 1) {
+		dev_err(&cxlr->dev, "Interleaving DC not supported\n");
+		return -EINVAL;
+	}
+
 	struct cxl_dax_region *cxlr_dax __free(put_cxl_dax_region) =
 		cxl_dax_region_alloc(cxlr);
 	if (IS_ERR(cxlr_dax))
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index ccfe65004888..7356aaaffe57 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -177,6 +177,11 @@ static bool is_static(struct dax_region *dax_region)
 	return (dax_region->res.flags & IORESOURCE_DAX_STATIC) != 0;
 }
 
+static bool is_dynamic(struct dax_region *dax_region)
+{
+	return (dax_region->res.flags & IORESOURCE_DAX_DCD) != 0;
+}
+
 bool static_dev_dax(struct dev_dax *dev_dax)
 {
 	return is_static(dev_dax->region);
@@ -300,6 +305,9 @@ static unsigned long long dax_region_avail_size(struct dax_region *dax_region)
 
 	lockdep_assert_held(&dax_region_rwsem);
 
+	if (is_dynamic(dax_region))
+		return 0;
+
 	for_each_dax_region_resource(dax_region, res)
 		size -= resource_size(res);
 	return size;
@@ -1385,6 +1393,8 @@ static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n)
 		return 0;
 	if (a == &dev_attr_mapping.attr && is_static(dax_region))
 		return 0;
+	if (a == &dev_attr_mapping.attr && is_dynamic(dax_region))
+		return 0;
 	if ((a == &dev_attr_align.attr ||
 	     a == &dev_attr_size.attr) && is_static(dax_region))
 		return 0444;
diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
index 5909171a4428..6e739bfab932 100644
--- a/drivers/dax/bus.h
+++ b/drivers/dax/bus.h
@@ -15,6 +15,7 @@ struct dax_region;
 /* dax bus specific ioresource flags */
 #define IORESOURCE_DAX_STATIC BIT(0)
 #define IORESOURCE_DAX_KMEM BIT(1)
+#define IORESOURCE_DAX_DCD BIT(2)
 
 struct dax_region *alloc_dax_region(struct device *parent, int region_id,
 		struct range *range, int target_node, unsigned int align,
diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
index 3ab39b77843d..cedd974c2d0c 100644
--- a/drivers/dax/cxl.c
+++ b/drivers/dax/cxl.c
@@ -13,19 +13,32 @@ static int cxl_dax_region_probe(struct device *dev)
 	struct cxl_region *cxlr = cxlr_dax->cxlr;
 	struct dax_region *dax_region;
 	struct dev_dax_data data;
+	resource_size_t dev_size;
+	unsigned long flags;
 
 	if (nid == NUMA_NO_NODE)
 		nid = memory_add_physaddr_to_nid(cxlr_dax->hpa_range.start);
 
+	if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_1)
+		flags = IORESOURCE_DAX_DCD;
+	else
+		flags = IORESOURCE_DAX_KMEM;
+
 	dax_region = alloc_dax_region(dev, cxlr->id, &cxlr_dax->hpa_range, nid,
-				      PMD_SIZE, IORESOURCE_DAX_KMEM);
+				      PMD_SIZE, flags);
 	if (!dax_region)
 		return -ENOMEM;
 
+	if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_1)
+		/* Add empty seed dax device */
+		dev_size = 0;
+	else
+		dev_size = range_len(&cxlr_dax->hpa_range);
+
 	data = (struct dev_dax_data) {
 		.dax_region = dax_region,
 		.id = -1,
-		.size = range_len(&cxlr_dax->hpa_range),
+		.size = dev_size,
 		.memmap_on_memory = true,
 	};
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v11 08/31] cxl/events: Split event msgnum configuration from irq setup
  2026-06-25 11:04 DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
                   ` (5 preceding siblings ...)
  2026-06-25 11:04 ` [PATCH v11 07/31] cxl/region: Add DC DAX region support Anisa Su
@ 2026-06-25 11:04 ` Anisa Su
  2026-06-25 11:04 ` [PATCH v11 09/31] cxl/pci: Factor out interrupt policy check Anisa Su
                   ` (23 subsequent siblings)
  30 siblings, 0 replies; 61+ messages in thread
From: Anisa Su @ 2026-06-25 11:04 UTC (permalink / raw)
  To: linux-cxl, linux-kernel
  Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
	John Groves, Gregory Price, Anisa Su, Fan Ni, Li Ming

From: Ira Weiny <iweiny@kernel.org>

Dynamic Capacity Devices (DCD) require event interrupts to process
memory addition or removal.  BIOS may have control over non-DCD event
processing.  DCD interrupt configuration needs to be separate from
memory event interrupt configuration.

Split cxl_event_config_msgnums() from irq setup in preparation for
separate DCD interrupts configuration.

Signed-off-by: Ira Weiny <iweiny@kernel.org>
Signed-off-by: Anisa Su <anisa.su@samsung.com>
Reviewed-by: Jonathan Cameron <jic23@kernel.org>
Reviewed-by: Fan Ni <nifan.cxl@gmail.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Li Ming <ming.li@zohomail.com>
---
 drivers/cxl/pci.c | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 60f9fa05d9ef..35942b2ace53 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -599,35 +599,31 @@ static int cxl_event_config_msgnums(struct cxl_memdev_state *mds,
 	return cxl_event_get_int_policy(mds, policy);
 }
 
-static int cxl_event_irqsetup(struct cxl_memdev_state *mds)
+static int cxl_event_irqsetup(struct cxl_memdev_state *mds,
+			      struct cxl_event_interrupt_policy *policy)
 {
 	struct cxl_dev_state *cxlds = &mds->cxlds;
-	struct cxl_event_interrupt_policy policy;
 	int rc;
 
-	rc = cxl_event_config_msgnums(mds, &policy);
-	if (rc)
-		return rc;
-
-	rc = cxl_event_req_irq(cxlds, policy.info_settings);
+	rc = cxl_event_req_irq(cxlds, policy->info_settings);
 	if (rc) {
 		dev_err(cxlds->dev, "Failed to get interrupt for event Info log\n");
 		return rc;
 	}
 
-	rc = cxl_event_req_irq(cxlds, policy.warn_settings);
+	rc = cxl_event_req_irq(cxlds, policy->warn_settings);
 	if (rc) {
 		dev_err(cxlds->dev, "Failed to get interrupt for event Warn log\n");
 		return rc;
 	}
 
-	rc = cxl_event_req_irq(cxlds, policy.failure_settings);
+	rc = cxl_event_req_irq(cxlds, policy->failure_settings);
 	if (rc) {
 		dev_err(cxlds->dev, "Failed to get interrupt for event Failure log\n");
 		return rc;
 	}
 
-	rc = cxl_event_req_irq(cxlds, policy.fatal_settings);
+	rc = cxl_event_req_irq(cxlds, policy->fatal_settings);
 	if (rc) {
 		dev_err(cxlds->dev, "Failed to get interrupt for event Fatal log\n");
 		return rc;
@@ -674,11 +670,15 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
 		return -EBUSY;
 	}
 
+	rc = cxl_event_config_msgnums(mds, &policy);
+	if (rc)
+		return rc;
+
 	rc = cxl_mem_alloc_event_buf(mds);
 	if (rc)
 		return rc;
 
-	rc = cxl_event_irqsetup(mds);
+	rc = cxl_event_irqsetup(mds, &policy);
 	if (rc)
 		return rc;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v11 09/31] cxl/pci: Factor out interrupt policy check
  2026-06-25 11:04 DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
                   ` (6 preceding siblings ...)
  2026-06-25 11:04 ` [PATCH v11 08/31] cxl/events: Split event msgnum configuration from irq setup Anisa Su
@ 2026-06-25 11:04 ` Anisa Su
  2026-06-25 11:04 ` [PATCH v11 10/31] cxl/mem: Configure dynamic capacity interrupts Anisa Su
                   ` (22 subsequent siblings)
  30 siblings, 0 replies; 61+ messages in thread
From: Anisa Su @ 2026-06-25 11:04 UTC (permalink / raw)
  To: linux-cxl, linux-kernel
  Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
	John Groves, Gregory Price, Anisa Su, Fan Ni, Li Ming

From: Ira Weiny <iweiny@kernel.org>

Dynamic Capacity Devices (DCD) require event interrupts to process
memory addition or removal.  BIOS may have control over non-DCD event
processing.  DCD interrupt configuration needs to be separate from
memory event interrupt configuration.

Factor out event interrupt setting validation.

Link: https://lore.kernel.org/all/663922b475e50_d54d72945b@dwillia2-xfh.jf.intel.com.notmuch/ [1]
Suggested-by: Dan Williams <djbw@kernel.org>
Signed-off-by: Ira Weiny <iweiny@kernel.org>
Signed-off-by: Anisa Su <anisa.su@samsung.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Jonathan Cameron <jic23@kernel.org>
Reviewed-by: Fan Ni <nifan.cxl@gmail.com>
Reviewed-by: Li Ming <ming.li@zohomail.com>
---
 drivers/cxl/pci.c | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 35942b2ace53..8d12c684d670 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -639,6 +639,21 @@ static bool cxl_event_int_is_fw(u8 setting)
 	return mode == CXL_INT_FW;
 }
 
+static bool cxl_event_validate_mem_policy(struct cxl_memdev_state *mds,
+					  struct cxl_event_interrupt_policy *policy)
+{
+	if (cxl_event_int_is_fw(policy->info_settings) ||
+	    cxl_event_int_is_fw(policy->warn_settings) ||
+	    cxl_event_int_is_fw(policy->failure_settings) ||
+	    cxl_event_int_is_fw(policy->fatal_settings)) {
+		dev_err(mds->cxlds.dev,
+			"FW still in control of Event Logs despite _OSC settings\n");
+		return false;
+	}
+
+	return true;
+}
+
 static int cxl_event_config(struct pci_host_bridge *host_bridge,
 			    struct cxl_memdev_state *mds, bool irq_avail)
 {
@@ -661,14 +676,8 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
 	if (rc)
 		return rc;
 
-	if (cxl_event_int_is_fw(policy.info_settings) ||
-	    cxl_event_int_is_fw(policy.warn_settings) ||
-	    cxl_event_int_is_fw(policy.failure_settings) ||
-	    cxl_event_int_is_fw(policy.fatal_settings)) {
-		dev_err(mds->cxlds.dev,
-			"FW still in control of Event Logs despite _OSC settings\n");
+	if (!cxl_event_validate_mem_policy(mds, &policy))
 		return -EBUSY;
-	}
 
 	rc = cxl_event_config_msgnums(mds, &policy);
 	if (rc)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v11 10/31] cxl/mem: Configure dynamic capacity interrupts
  2026-06-25 11:04 DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
                   ` (7 preceding siblings ...)
  2026-06-25 11:04 ` [PATCH v11 09/31] cxl/pci: Factor out interrupt policy check Anisa Su
@ 2026-06-25 11:04 ` Anisa Su
  2026-06-25 18:14   ` sashiko-bot
  2026-06-25 11:04 ` [PATCH v11 11/31] cxl/core: Return endpoint decoder information from region search Anisa Su
                   ` (21 subsequent siblings)
  30 siblings, 1 reply; 61+ messages in thread
From: Anisa Su @ 2026-06-25 11:04 UTC (permalink / raw)
  To: linux-cxl, linux-kernel
  Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
	John Groves, Gregory Price, Anisa Su

From: Ira Weiny <iweiny@kernel.org>

Dynamic Capacity Devices (DCD) support extent change notifications
through the event log mechanism.  The interrupt mailbox commands were
extended in CXL 3.1 to support these notifications.  Firmware can't
configure DCD events to be FW controlled but can retain control of
memory events.

Configure DCD event log interrupts on devices supporting dynamic
capacity.  Disable DCD if interrupts are not supported.

Care is taken to preserve the interrupt policy set by the FW if FW first
has been selected by the BIOS.

Accept the 4-byte CXL 2.0 reply on GET Event Interrupt Policy by setting
min_out to CXL_EVENT_INT_POLICY_BASE_SIZE; pre-CXL 3.1 firmware omits
dcd_settings and would otherwise fail the size check.

Based on an original patch by Navneet Singh.

Signed-off-by: Ira Weiny <iweiny@kernel.org>
Signed-off-by: Anisa Su <anisa.su@samsung.com>

---
Changes:
[anisa: add CXLDEV_EVENT_STATUS_DCD (bit 4) to CXLDEV_EVENT_STATUS_ALL;
previously added in a later commit but moved to current commit]

[anisa: check native_cxl before cxl_mem_get_event_records]
---
 drivers/cxl/cxl.h    |  4 +-
 drivers/cxl/cxlmem.h |  2 +
 drivers/cxl/pci.c    | 94 ++++++++++++++++++++++++++++++++++++--------
 3 files changed, 83 insertions(+), 17 deletions(-)

diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 1297594beaec..864f6d3c03d4 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -180,11 +180,13 @@ static inline int ways_to_eiw(unsigned int ways, u8 *eiw)
 #define CXLDEV_EVENT_STATUS_WARN		BIT(1)
 #define CXLDEV_EVENT_STATUS_FAIL		BIT(2)
 #define CXLDEV_EVENT_STATUS_FATAL		BIT(3)
+#define CXLDEV_EVENT_STATUS_DCD			BIT(4)
 
 #define CXLDEV_EVENT_STATUS_ALL (CXLDEV_EVENT_STATUS_INFO |	\
 				 CXLDEV_EVENT_STATUS_WARN |	\
 				 CXLDEV_EVENT_STATUS_FAIL |	\
-				 CXLDEV_EVENT_STATUS_FATAL)
+				 CXLDEV_EVENT_STATUS_FATAL |	\
+				 CXLDEV_EVENT_STATUS_DCD)
 
 /* CXL rev 3.0 section 8.2.9.2.4; Table 8-52 */
 #define CXLDEV_EVENT_INT_MODE_MASK	GENMASK(1, 0)
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index afc195d8c090..bcf976829c3e 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -218,7 +218,9 @@ struct cxl_event_interrupt_policy {
 	u8 warn_settings;
 	u8 failure_settings;
 	u8 fatal_settings;
+	u8 dcd_settings;
 } __packed;
+#define CXL_EVENT_INT_POLICY_BASE_SIZE 4 /* info, warn, failure, fatal */
 
 /**
  * struct cxl_event_state - Event log driver state
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 8d12c684d670..95a4bf7c1e46 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -514,7 +514,19 @@ static irqreturn_t cxl_event_thread(int irq, void *id)
 	struct cxl_dev_id *dev_id = id;
 	struct cxl_dev_state *cxlds = dev_id->cxlds;
 	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlds);
+	struct pci_host_bridge *host_bridge =
+		pci_find_host_bridge(to_pci_dev(cxlds->dev)->bus);
 	u32 status;
+	u32 mask;
+
+	/*
+	 * Only drain logs the driver owns.  When BIOS owns event reporting
+	 * (!native_cxl) the driver is only here for the Dynamic Capacity log;
+	 * processing the standard logs would steal firmware-first events from
+	 * BIOS, so mask them out.
+	 */
+	mask = host_bridge->native_cxl_error ? CXLDEV_EVENT_STATUS_ALL
+					     : CXLDEV_EVENT_STATUS_DCD;
 
 	do {
 		/*
@@ -522,8 +534,8 @@ static irqreturn_t cxl_event_thread(int irq, void *id)
 		 * ignore the reserved upper 32 bits
 		 */
 		status = readl(cxlds->regs.status + CXLDEV_DEV_EVENT_STATUS_OFFSET);
-		/* Ignore logs unknown to the driver */
-		status &= CXLDEV_EVENT_STATUS_ALL;
+		/* Ignore logs unknown to the driver or owned by BIOS */
+		status &= mask;
 		if (!status)
 			break;
 		cxl_mem_get_event_records(mds, status);
@@ -557,6 +569,8 @@ static int cxl_event_get_int_policy(struct cxl_memdev_state *mds,
 		.opcode = CXL_MBOX_OP_GET_EVT_INT_POLICY,
 		.payload_out = policy,
 		.size_out = sizeof(*policy),
+		/* CXL 2.0 firmware omits dcd_settings; accept the shorter reply */
+		.min_out = CXL_EVENT_INT_POLICY_BASE_SIZE,
 	};
 	int rc;
 
@@ -569,23 +583,34 @@ static int cxl_event_get_int_policy(struct cxl_memdev_state *mds,
 }
 
 static int cxl_event_config_msgnums(struct cxl_memdev_state *mds,
-				    struct cxl_event_interrupt_policy *policy)
+				    struct cxl_event_interrupt_policy *policy,
+				    bool native_cxl)
 {
 	struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
+	size_t size_in = CXL_EVENT_INT_POLICY_BASE_SIZE;
 	struct cxl_mbox_cmd mbox_cmd;
 	int rc;
 
-	*policy = (struct cxl_event_interrupt_policy) {
-		.info_settings = CXL_INT_MSI_MSIX,
-		.warn_settings = CXL_INT_MSI_MSIX,
-		.failure_settings = CXL_INT_MSI_MSIX,
-		.fatal_settings = CXL_INT_MSI_MSIX,
-	};
+	/* memory event policy is left if FW has control */
+	if (native_cxl) {
+		*policy = (struct cxl_event_interrupt_policy) {
+			.info_settings = CXL_INT_MSI_MSIX,
+			.warn_settings = CXL_INT_MSI_MSIX,
+			.failure_settings = CXL_INT_MSI_MSIX,
+			.fatal_settings = CXL_INT_MSI_MSIX,
+			.dcd_settings = 0,
+		};
+	}
+
+	if (cxl_dcd_supported(mds)) {
+		policy->dcd_settings = CXL_INT_MSI_MSIX;
+		size_in += sizeof(policy->dcd_settings);
+	}
 
 	mbox_cmd = (struct cxl_mbox_cmd) {
 		.opcode = CXL_MBOX_OP_SET_EVT_INT_POLICY,
 		.payload_in = policy,
-		.size_in = sizeof(*policy),
+		.size_in = size_in,
 	};
 
 	rc = cxl_internal_send_cmd(cxl_mbox, &mbox_cmd);
@@ -632,6 +657,30 @@ static int cxl_event_irqsetup(struct cxl_memdev_state *mds,
 	return 0;
 }
 
+static int cxl_irqsetup(struct cxl_memdev_state *mds,
+			struct cxl_event_interrupt_policy *policy,
+			bool native_cxl)
+{
+	struct cxl_dev_state *cxlds = &mds->cxlds;
+	int rc;
+
+	if (native_cxl) {
+		rc = cxl_event_irqsetup(mds, policy);
+		if (rc)
+			return rc;
+	}
+
+	if (cxl_dcd_supported(mds)) {
+		rc = cxl_event_req_irq(cxlds, policy->dcd_settings);
+		if (rc) {
+			dev_err(cxlds->dev, "Failed to get interrupt for DCD event log\n");
+			cxl_disable_dcd(mds);
+		}
+	}
+
+	return 0;
+}
+
 static bool cxl_event_int_is_fw(u8 setting)
 {
 	u8 mode = FIELD_GET(CXLDEV_EVENT_INT_MODE_MASK, setting);
@@ -657,18 +706,26 @@ static bool cxl_event_validate_mem_policy(struct cxl_memdev_state *mds,
 static int cxl_event_config(struct pci_host_bridge *host_bridge,
 			    struct cxl_memdev_state *mds, bool irq_avail)
 {
-	struct cxl_event_interrupt_policy policy;
+	struct cxl_event_interrupt_policy policy = { 0 };
+	bool native_cxl = host_bridge->native_cxl_error;
 	int rc;
 
 	/*
 	 * When BIOS maintains CXL error reporting control, it will process
 	 * event records.  Only one agent can do so.
+	 *
+	 * If BIOS has control of events and DCD is not supported skip event
+	 * configuration.
 	 */
-	if (!host_bridge->native_cxl_error)
+	if (!native_cxl && !cxl_dcd_supported(mds))
 		return 0;
 
 	if (!irq_avail) {
 		dev_info(mds->cxlds.dev, "No interrupt support, disable event processing.\n");
+		if (cxl_dcd_supported(mds)) {
+			dev_info(mds->cxlds.dev, "DCD requires interrupts, disable DCD\n");
+			cxl_disable_dcd(mds);
+		}
 		return 0;
 	}
 
@@ -676,10 +733,10 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
 	if (rc)
 		return rc;
 
-	if (!cxl_event_validate_mem_policy(mds, &policy))
+	if (native_cxl && !cxl_event_validate_mem_policy(mds, &policy))
 		return -EBUSY;
 
-	rc = cxl_event_config_msgnums(mds, &policy);
+	rc = cxl_event_config_msgnums(mds, &policy, native_cxl);
 	if (rc)
 		return rc;
 
@@ -687,11 +744,16 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
 	if (rc)
 		return rc;
 
-	rc = cxl_event_irqsetup(mds, &policy);
+	rc = cxl_irqsetup(mds, &policy, native_cxl);
 	if (rc)
 		return rc;
 
-	cxl_mem_get_event_records(mds, CXLDEV_EVENT_STATUS_ALL);
+	if (native_cxl)
+		cxl_mem_get_event_records(mds, CXLDEV_EVENT_STATUS_ALL);
+
+	dev_dbg(mds->cxlds.dev, "Event config : %s DCD %s\n",
+		native_cxl ? "OS" : "BIOS",
+		cxl_dcd_supported(mds) ? "supported" : "not supported");
 
 	return 0;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v11 11/31] cxl/core: Return endpoint decoder information from region search
  2026-06-25 11:04 DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
                   ` (8 preceding siblings ...)
  2026-06-25 11:04 ` [PATCH v11 10/31] cxl/mem: Configure dynamic capacity interrupts Anisa Su
@ 2026-06-25 11:04 ` Anisa Su
  2026-06-25 11:04 ` [PATCH v11 12/31] cxl/mem: Set up framework for handling DC Events Anisa Su
                   ` (20 subsequent siblings)
  30 siblings, 0 replies; 61+ messages in thread
From: Anisa Su @ 2026-06-25 11:04 UTC (permalink / raw)
  To: linux-cxl, linux-kernel
  Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
	John Groves, Gregory Price, Anisa Su, Fan Ni, Li Ming

From: Ira Weiny <iweiny@kernel.org>

cxl_dpa_to_region() finds the region from a <DPA, device> tuple.
The search involves finding the device endpoint decoder as well.

Dynamic capacity extent processing uses the endpoint decoder HPA
information to calculate the HPA offset.  In addition, well behaved
extents should be contained within an endpoint decoder.

Return the endpoint decoder found to be used in subsequent DCD code.

Signed-off-by: Ira Weiny <iweiny@kernel.org>
Signed-off-by: Anisa Su <anisa.su@samsung.com>
Reviewed-by: Jonathan Cameron <jic23@kernel.org>
Reviewed-by: Fan Ni <nifan.cxl@gmail.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Li Ming <ming.li@zohomail.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
---
 drivers/cxl/core/core.h   | 6 ++++--
 drivers/cxl/core/mbox.c   | 2 +-
 drivers/cxl/core/memdev.c | 4 ++--
 drivers/cxl/core/region.c | 8 +++++++-
 4 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 9ed141fa1334..1e3f19d8c9a3 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -56,7 +56,8 @@ int cxl_decoder_detach(struct cxl_region *cxlr,
 int cxl_region_init(void);
 void cxl_region_exit(void);
 int cxl_get_poison_by_endpoint(struct cxl_port *port);
-struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa);
+struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa,
+				     struct cxl_endpoint_decoder **cxled);
 u64 cxl_dpa_to_hpa(struct cxl_region *cxlr, const struct cxl_memdev *cxlmd,
 		   u64 dpa);
 int devm_cxl_add_dax_region(struct cxl_region *cxlr);
@@ -69,7 +70,8 @@ static inline u64 cxl_dpa_to_hpa(struct cxl_region *cxlr,
 	return ULLONG_MAX;
 }
 static inline
-struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa)
+struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa,
+				     struct cxl_endpoint_decoder **cxled)
 {
 	return NULL;
 }
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index bdb908c6e7f3..2ab400788824 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -964,7 +964,7 @@ void cxl_event_trace_record(struct cxl_memdev *cxlmd,
 		guard(rwsem_read)(&cxl_rwsem.dpa);
 
 		dpa = le64_to_cpu(evt->media_hdr.phys_addr) & CXL_DPA_MASK;
-		cxlr = cxl_dpa_to_region(cxlmd, dpa);
+		cxlr = cxl_dpa_to_region(cxlmd, dpa, NULL);
 		if (cxlr) {
 			u64 cache_size = cxlr->params.cache_size;
 
diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
index 20417db933aa..1a2b4d8bdd76 100644
--- a/drivers/cxl/core/memdev.c
+++ b/drivers/cxl/core/memdev.c
@@ -320,7 +320,7 @@ int cxl_inject_poison_locked(struct cxl_memdev *cxlmd, u64 dpa)
 	if (rc)
 		return rc;
 
-	cxlr = cxl_dpa_to_region(cxlmd, dpa);
+	cxlr = cxl_dpa_to_region(cxlmd, dpa, NULL);
 	if (cxlr)
 		dev_warn_once(cxl_mbox->host,
 			      "poison inject dpa:%#llx region: %s\n", dpa,
@@ -389,7 +389,7 @@ int cxl_clear_poison_locked(struct cxl_memdev *cxlmd, u64 dpa)
 	if (rc)
 		return rc;
 
-	cxlr = cxl_dpa_to_region(cxlmd, dpa);
+	cxlr = cxl_dpa_to_region(cxlmd, dpa, NULL);
 	if (cxlr)
 		dev_warn_once(cxl_mbox->host,
 			      "poison clear dpa:%#llx region: %s\n", dpa,
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index ba03ec5e27c3..f6e93bc59ae7 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -3001,6 +3001,7 @@ int cxl_get_poison_by_endpoint(struct cxl_port *port)
 struct cxl_dpa_to_region_context {
 	struct cxl_region *cxlr;
 	u64 dpa;
+	struct cxl_endpoint_decoder *cxled;
 };
 
 static int __cxl_dpa_to_region(struct device *dev, void *arg)
@@ -3034,11 +3035,13 @@ static int __cxl_dpa_to_region(struct device *dev, void *arg)
 			dev_name(dev));
 
 	ctx->cxlr = cxlr;
+	ctx->cxled = cxled;
 
 	return 1;
 }
 
-struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa)
+struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa,
+				     struct cxl_endpoint_decoder **cxled)
 {
 	struct cxl_dpa_to_region_context ctx;
 	struct cxl_port *port = cxlmd->endpoint;
@@ -3052,6 +3055,9 @@ struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa)
 	if (cxl_num_decoders_committed(port))
 		device_for_each_child(&port->dev, &ctx, __cxl_dpa_to_region);
 
+	if (cxled)
+		*cxled = ctx.cxled;
+
 	return ctx.cxlr;
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v11 12/31] cxl/mem: Set up framework for handling DC Events
  2026-06-25 11:04 DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
                   ` (9 preceding siblings ...)
  2026-06-25 11:04 ` [PATCH v11 11/31] cxl/core: Return endpoint decoder information from region search Anisa Su
@ 2026-06-25 11:04 ` Anisa Su
  2026-06-25 18:12   ` sashiko-bot
  2026-06-26 21:54   ` Dave Jiang
  2026-06-25 11:04 ` [PATCH v11 13/31] cxl/mem: Add 20 second timeout for stalled DC_ADD_CAPACITY chains Anisa Su
                   ` (19 subsequent siblings)
  30 siblings, 2 replies; 61+ messages in thread
From: Anisa Su @ 2026-06-25 11:04 UTC (permalink / raw)
  To: linux-cxl, linux-kernel
  Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
	John Groves, Gregory Price, Anisa Su

From: Ira Weiny <iweiny@kernel.org>

Adds the support for receiving DC event records but defers
the real add/release logic to subsequent commits. Simply refuse all
extents for DC_ADD and ack all DC_RELEASE events for now. Forced
release is currently unsupported.

In order, this commit adds the following:

1. Learn about DC Event Records and how to respond to them

* cxl_mem_get_event_records() learns about the DC Event record.
Records of that type are routed to cxl_handle_dcd_event_records().

* cxl_handle_dcd_event_records() switches on event_type:
	- DCD_ADD_CAPACITY     -> handle_add_event()
	- DCD_RELEASE_CAPACITY -> cxl_rm_extent()
	- DCD_FORCED_CAPACITY_RELEASE is logged and ignored (FM/device-only).

* cxl_send_dc_response() sends the reply mailbox commands
ADD_DC_RESPONSE / RELEASE_DC

2. Add stubs for DC_ADD and DC_RELEASE logic

* handle_add_event() stages incoming extents onto
mds->add_ctx.pending_extents and, when More=0 closes the chain,
replies with an empty ADD_DC_RESPONSE — refusing all extents for now

* cxl_rm_extent() acks the release via memdev_release_extent() so the
device's view stays consistent; we can ack all releases because
we currently don't accept/use any extents offered.

3. Structural setup for later commits:

* struct dc_extent, struct cxl_dc_tag_group, and pending_add_ctx
set up the stage for the real DC_ADD path, which will enforce
tag/grouping semantics

Based on an original patch by Navneet Singh.

Signed-off-by: Ira Weiny <iweiny@kernel.org>
Signed-off-by: Anisa Su <anisa.su@samsung.com>

---
Changes:
1. mbox.c: fix leaking var node in add_to_pending_list()
2. mbox.c: clear previously staged extents in handle_add_event()
  if add_to_pending_list() returns an error
3. mbox.c memdev_release_extent(): alloc cxl_extent_list_node
 on stack instead of using kzalloc because if kzalloc fails,
 it would prevent the release DC rsp from being sent. Avoid
 that possibility altogether.
---
 drivers/cxl/core/mbox.c | 245 +++++++++++++++++++++++++++++++++++++++-
 drivers/cxl/cxl.h       |  69 +++++++++++
 drivers/cxl/cxlmem.h    |  45 ++++++++
 include/cxl/event.h     |  38 +++++++
 4 files changed, 396 insertions(+), 1 deletion(-)

diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 2ab400788824..7dd40fb8d613 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -5,6 +5,7 @@
 #include <linux/ktime.h>
 #include <linux/mutex.h>
 #include <linux/unaligned.h>
+#include <linux/list.h>
 #include <cxlpci.h>
 #include <cxlmem.h>
 #include <cxl.h>
@@ -1098,6 +1099,237 @@ static int cxl_clear_event_record(struct cxl_memdev_state *mds,
 	return rc;
 }
 
+static int send_one_response(struct cxl_mailbox *cxl_mbox,
+			     struct cxl_mbox_dc_response *response,
+			     int opcode, u32 extent_list_size, u8 flags)
+{
+	struct cxl_mbox_cmd mbox_cmd = (struct cxl_mbox_cmd) {
+		.opcode = opcode,
+		.size_in = struct_size(response, extent_list, extent_list_size),
+		.payload_in = response,
+	};
+
+	response->extent_list_size = cpu_to_le32(extent_list_size);
+	response->flags = flags;
+	return cxl_internal_send_cmd(cxl_mbox, &mbox_cmd);
+}
+
+static int cxl_send_dc_response(struct cxl_memdev_state *mds, int opcode,
+				struct list_head *extent_list, int cnt)
+{
+	struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
+	struct cxl_mbox_dc_response *p;
+	struct cxl_extent_list_node *pos, *tmp;
+	struct cxl_extent *extent;
+	u32 pl_index;
+
+	size_t pl_size = struct_size(p, extent_list, cnt);
+	u32 max_extents = cnt;
+
+	/* May have to use more bit on response. */
+	if (pl_size > cxl_mbox->payload_size) {
+		max_extents = (cxl_mbox->payload_size - sizeof(*p)) /
+			      sizeof(struct updated_extent_list);
+		pl_size = struct_size(p, extent_list, max_extents);
+	}
+
+	struct cxl_mbox_dc_response *response __free(kfree) =
+						kzalloc(pl_size, GFP_KERNEL);
+	if (!response)
+		return -ENOMEM;
+
+	/* Set __counted_by to capacity before populating; reset per response. */
+	response->extent_list_size = cpu_to_le32(max_extents);
+
+	if (cnt == 0)
+		return send_one_response(cxl_mbox, response, opcode, 0, 0);
+
+	pl_index = 0;
+	list_for_each_entry_safe(pos, tmp, extent_list, list) {
+		extent = pos->extent;
+		response->extent_list[pl_index].dpa_start = extent->start_dpa;
+		response->extent_list[pl_index].length = extent->length;
+		pl_index++;
+
+		if (pl_index == max_extents) {
+			u8 flags = 0;
+			int rc;
+
+			if (pl_index < cnt)
+				flags |= CXL_DCD_EVENT_MORE;
+			rc = send_one_response(cxl_mbox, response, opcode,
+					       pl_index, flags);
+			if (rc)
+				return rc;
+			cnt -= pl_index;
+			if (cnt < max_extents)
+				max_extents = cnt;
+			pl_index = 0;
+		}
+	}
+
+	if (!pl_index) /* nothing more to do */
+		return 0;
+	return send_one_response(cxl_mbox, response, opcode, pl_index, 0);
+}
+
+static void delete_extent_node(struct cxl_extent_list_node *node)
+{
+	list_del(&node->list);
+	kfree(node->extent);
+	kfree(node);
+}
+
+static void memdev_release_extent(struct cxl_memdev_state *mds, struct range *range)
+{
+	struct device *dev = mds->cxlds.dev;
+	struct cxl_extent extent = {
+		.start_dpa = cpu_to_le64(range->start),
+		.length = cpu_to_le64(range_len(range)),
+	};
+	struct cxl_extent_list_node node = { .extent = &extent };
+	LIST_HEAD(extent_list);
+
+	dev_dbg(dev, "Release response dpa %pra\n", range);
+
+	/*
+	 * Stack-allocated instead of kzalloc'ed to avoid potential -ENOMEM,
+	 * which would prevent sending the release DC rsp.
+	 */
+	list_add_tail(&node.list, &extent_list);
+	if (cxl_send_dc_response(mds, CXL_MBOX_OP_RELEASE_DC, &extent_list, 1))
+		dev_dbg(dev, "Failed to release %pra\n", range);
+}
+
+static void clear_pending_extents(void *_mds)
+{
+	struct cxl_memdev_state *mds = _mds;
+	struct cxl_extent_list_node *pos, *tmp;
+
+	list_for_each_entry_safe(pos, tmp, &mds->add_ctx.pending_extents, list)
+		delete_extent_node(pos);
+	mds->add_ctx.group = NULL;
+}
+
+static int add_to_pending_list(struct list_head *pending_list,
+			       struct cxl_extent *to_add)
+{
+	struct cxl_extent_list_node *node = kzalloc(sizeof(*node), GFP_KERNEL);
+	struct cxl_extent *extent;
+
+	if (!node)
+		return -ENOMEM;
+	extent = kmemdup(to_add, sizeof(*extent), GFP_KERNEL);
+	if (!extent) {
+		kfree(node);
+		return -ENOMEM;
+	}
+
+	node->extent = extent;
+	list_add_tail(&node->list, pending_list);
+	return 0;
+}
+
+/*
+ * Stub: stage extents on the pending list and reply with an empty
+ * ADD_DC_RESPONSE on More=0 (refuse all).  A later commit replaces
+ * the no-op tail with the real Add pipeline that surfaces a dax
+ * device per accepted extent.
+ */
+static int handle_add_event(struct cxl_memdev_state *mds,
+			    struct cxl_event_dcd *event)
+{
+	struct device *dev = mds->cxlds.dev;
+	int rc;
+
+	rc = add_to_pending_list(&mds->add_ctx.pending_extents, &event->extent);
+	if (rc) {
+		clear_pending_extents(mds);
+		return rc;
+	}
+
+	if (event->flags & CXL_DCD_EVENT_MORE) {
+		dev_dbg(dev, "more bit set; delay the surfacing of extent\n");
+		return 0;
+	}
+
+	rc = cxl_send_dc_response(mds, CXL_MBOX_OP_ADD_DC_RESPONSE,
+				  &mds->add_ctx.pending_extents, 0);
+	clear_pending_extents(mds);
+	return rc;
+}
+
+/*
+ * Stub: ack the release back to the device so it knows we are not
+ * using the range.  A later commit replaces this with the real
+ * teardown that walks the region's tag group and tears down the
+ * member dc_extent devices.
+ */
+static int cxl_rm_extent(struct cxl_memdev_state *mds,
+			 struct cxl_extent *extent)
+{
+	u64 start_dpa = le64_to_cpu(extent->start_dpa);
+	struct range dpa_range = {
+		.start = start_dpa,
+		.end = start_dpa + le64_to_cpu(extent->length) - 1,
+	};
+
+	memdev_release_extent(mds, &dpa_range);
+	return 0;
+}
+
+static char *cxl_dcd_evt_type_str(u8 type)
+{
+	switch (type) {
+	case DCD_ADD_CAPACITY:
+		return "add";
+	case DCD_RELEASE_CAPACITY:
+		return "release";
+	case DCD_FORCED_CAPACITY_RELEASE:
+		return "force release";
+	default:
+		break;
+	}
+
+	return "<unknown>";
+}
+
+static void cxl_handle_dcd_event_records(struct cxl_memdev_state *mds,
+					 struct cxl_event_record_raw *raw_rec)
+{
+	struct cxl_event_dcd *event = &raw_rec->event.dcd;
+	struct cxl_extent *extent = &event->extent;
+	struct device *dev = mds->cxlds.dev;
+	uuid_t *id = &raw_rec->id;
+	int rc;
+
+	if (!uuid_equal(id, &CXL_EVENT_DC_EVENT_UUID))
+		return;
+
+	dev_dbg(dev, "DCD event %s : DPA:%#llx LEN:%#llx\n",
+		cxl_dcd_evt_type_str(event->event_type),
+		le64_to_cpu(extent->start_dpa), le64_to_cpu(extent->length));
+
+	switch (event->event_type) {
+	case DCD_ADD_CAPACITY:
+		rc = handle_add_event(mds, event);
+		break;
+	case DCD_RELEASE_CAPACITY:
+		rc = cxl_rm_extent(mds, &event->extent);
+		break;
+	case DCD_FORCED_CAPACITY_RELEASE:
+		dev_err_ratelimited(dev, "Forced release event ignored.\n");
+		rc = 0;
+		break;
+	default:
+		rc = -EINVAL;
+		break;
+	}
+
+	if (rc)
+		dev_err_ratelimited(dev, "dcd event failed: %d\n", rc);
+}
+
 static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
 				    enum cxl_event_log_type type)
 {
@@ -1134,9 +1366,13 @@ static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
 		if (!nr_rec)
 			break;
 
-		for (i = 0; i < nr_rec; i++)
+		for (i = 0; i < nr_rec; i++) {
 			__cxl_event_trace_record(cxlmd, type,
 						 &payload->records[i]);
+			if (type == CXL_EVENT_TYPE_DCD)
+				cxl_handle_dcd_event_records(mds,
+							&payload->records[i]);
+		}
 
 		if (payload->flags & CXL_GET_EVENT_FLAG_OVERFLOW)
 			trace_cxl_overflow(cxlmd, type, payload);
@@ -1168,6 +1404,8 @@ void cxl_mem_get_event_records(struct cxl_memdev_state *mds, u32 status)
 {
 	dev_dbg(mds->cxlds.dev, "Reading event logs: %x\n", status);
 
+	if (cxl_dcd_supported(mds) && (status & CXLDEV_EVENT_STATUS_DCD))
+		cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_DCD);
 	if (status & CXLDEV_EVENT_STATUS_FATAL)
 		cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_FATAL);
 	if (status & CXLDEV_EVENT_STATUS_FAIL)
@@ -1797,6 +2035,11 @@ struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev, u64 serial,
 	}
 
 	mutex_init(&mds->event.log_lock);
+	INIT_LIST_HEAD(&mds->add_ctx.pending_extents);
+
+	rc = devm_add_action_or_reset(dev, clear_pending_extents, mds);
+	if (rc)
+		return ERR_PTR(rc);
 
 	rc = devm_cxl_register_mce_notifier(dev, &mds->mce_notifier);
 	if (rc == -EOPNOTSUPP)
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 864f6d3c03d4..367c3d9e2147 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -12,6 +12,7 @@
 #include <linux/node.h>
 #include <linux/io.h>
 #include <linux/range.h>
+#include <linux/xarray.h>
 #include <cxl/cxl.h>
 
 extern const struct nvdimm_security_ops *cxl_security_ops;
@@ -308,6 +309,41 @@ enum cxl_decoder_state {
 	CXL_DECODER_STATE_AUTO_STAGED,
 };
 
+struct cxl_dc_tag_group;
+
+/**
+ * struct dc_extent - A single dynamic-capacity extent surfaced to the host.
+ *
+ * One per device-stamped extent.  Multiple dc_extents that share a tag
+ * (see &struct cxl_dc_tag_group) form a single logical allocation, but
+ * each dc_extent has its own HPA range and is the unit that the DAX
+ * layer sees as a backing dax_resource.
+ *
+ * @dev: device representing this extent; child of cxlr_dax->dev.
+ * @group: containing tag group (allocation); shared across siblings.
+ * @cxled: endpoint decoder backing the DPA range.
+ * @dpa_range: DPA range this extent covers within @cxled.
+ * @hpa_range: HPA range that @dpa_range decodes to, relative to
+ *	       cxlr_dax->hpa_range.start.
+ * @uuid: tag uuid (matches @group->uuid; kept for the release-path log).
+ * @seq_num: 0..n-1 assembly-order index within the tag group, assigned by the
+ *	     host in cxl_realize_group().  For a sharable partition the
+ *	     order follows the device-stamped shared_extn_seq (0..n-1 per
+ *	     CXL r4.0 Table 8-230); for a non-sharable partition, where
+ *	     shared_extn_seq is reserved, the order follows event arrival.
+ *	     Either way @seq_num is a dense 0..n-1 so the dax layer can
+ *	     assemble ranges in the right order regardless of source.
+ */
+struct dc_extent {
+	struct device dev;
+	struct cxl_dc_tag_group *group;
+	struct cxl_endpoint_decoder *cxled;
+	struct range dpa_range;
+	struct range hpa_range;
+	uuid_t uuid;
+	u16 seq_num;
+};
+
 /**
  * struct cxl_endpoint_decoder - Endpoint  / SPA to DPA decoder
  * @cxld: base cxl_decoder_object
@@ -520,12 +556,45 @@ struct cxl_pmem_region {
 	struct cxl_pmem_region_mapping mapping[];
 };
 
+/* See CXL 3.1 8.2.9.2.1.6 */
+enum dc_event {
+	DCD_ADD_CAPACITY,
+	DCD_RELEASE_CAPACITY,
+	DCD_FORCED_CAPACITY_RELEASE,
+	DCD_REGION_CONFIGURATION_UPDATED,
+};
+
 struct cxl_dax_region {
 	struct device dev;
 	struct cxl_region *cxlr;
 	struct range hpa_range;
 };
 
+/**
+ * struct cxl_dc_tag_group - A tagged dynamic-capacity allocation.
+ *
+ * Container for the &struct dc_extent siblings that share a tag.  The
+ * group has no sysfs identity; userspace sees the individual dc_extents
+ * directly under the parent dax_region device.  The group exists to
+ * keep tag-scoped invariants (atomic add, atomic release, ordered carve
+ * by seq_num) in one place.
+ *
+ * @cxlr_dax: back reference to parent region device.
+ * @uuid: tag identifying this allocation; same across all member dc_extents.
+ * @dc_extents: xarray of &struct dc_extent in this group, indexed by the
+ *		dc_extent's @seq_num (0..n-1, dense).  See &struct dc_extent
+ *		for how seq_num is sourced for sharable vs non-sharable
+ *		allocations.
+ * @nr_extents: live count of dc_extents in the group; the group is freed
+ *		when the last dc_extent device is released.
+ */
+struct cxl_dc_tag_group {
+	struct cxl_dax_region *cxlr_dax;
+	uuid_t uuid;
+	struct xarray dc_extents;
+	unsigned int nr_extents;
+};
+
 /**
  * struct cxl_port - logical collection of upstream port devices and
  *		     downstream port devices to construct a CXL memory
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index bcf976829c3e..4ffa7bd1e5f1 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -7,6 +7,7 @@
 #include <linux/cdev.h>
 #include <linux/uuid.h>
 #include <linux/node.h>
+#include <linux/list.h>
 #include <cxl/event.h>
 #include <cxl/mailbox.h>
 #include "cxl.h"
@@ -404,6 +405,23 @@ static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
 	return dev_get_drvdata(cxl_mbox->host);
 }
 
+/**
+ * struct pending_add_ctx - Staging state for an in-progress
+ *			    DCD_ADD_CAPACITY event chain
+ * @pending_extents: extents received so far in the chain; flushed when
+ *		     the chain closes (More=0)
+ * @group: tag group being assembled from the chain
+ *
+ * A DCD_ADD_CAPACITY notification can span multiple event records
+ * stitched together by the CXL_DCD_EVENT_MORE flag.  Records are staged
+ * here until the device clears More, at which point the staged batch is
+ * processed and responded to as a single Add_DC_Response.
+ */
+struct pending_add_ctx {
+	struct list_head pending_extents;
+	struct cxl_dc_tag_group *group;
+};
+
 /**
  * struct cxl_memdev_state - Generic Type-3 Memory Device Class driver data
  *
@@ -422,6 +440,8 @@ static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
  * @active_volatile_bytes: sum of hard + soft volatile
  * @active_persistent_bytes: sum of hard + soft persistent
  * @dcd_supported: all DCD commands are supported
+ * @add_ctx: state for an in-progress DCD_ADD_CAPACITY chain
+ *	     (see &struct pending_add_ctx)
  * @event: event log driver state
  * @poison: poison driver state info
  * @security: security driver state info
@@ -442,6 +462,7 @@ struct cxl_memdev_state {
 	u64 active_volatile_bytes;
 	u64 active_persistent_bytes;
 	bool dcd_supported;
+	struct pending_add_ctx add_ctx;
 
 	struct cxl_event_state event;
 	struct cxl_poison_state poison;
@@ -518,6 +539,21 @@ enum cxl_opcode {
 	UUID_INIT(0x5e1819d9, 0x11a9, 0x400c, 0x81, 0x1f, 0xd6, 0x07, 0x19,     \
 		  0x40, 0x3d, 0x86)
 
+/*
+ * Add Dynamic Capacity Response
+ * CXL rev 3.1 section 8.2.9.9.9.3; Table 8-168 & Table 8-169
+ */
+struct cxl_mbox_dc_response {
+	__le32 extent_list_size;
+	u8 flags;
+	u8 reserved[3];
+	struct updated_extent_list {
+		__le64 dpa_start;
+		__le64 length;
+		u8 reserved[8];
+	} __packed extent_list[] __counted_by(extent_list_size);
+} __packed;
+
 struct cxl_mbox_get_supported_logs {
 	__le16 entries;
 	u8 rsvd[6];
@@ -588,6 +624,14 @@ struct cxl_mbox_identify {
 	UUID_INIT(0xe71f3a40, 0x2d29, 0x4092, 0x8a, 0x39, 0x4d, 0x1c, 0x96, \
 		  0x6c, 0x7c, 0x65)
 
+/*
+ * Dynamic Capacity Event Record
+ * CXL rev 3.1 section 8.2.9.2.1; Table 8-43
+ */
+#define CXL_EVENT_DC_EVENT_UUID                                             \
+	UUID_INIT(0xca95afa7, 0xf183, 0x4018, 0x8c, 0x2f, 0x95, 0x26, 0x8e, \
+		  0x10, 0x1a, 0x2a)
+
 /*
  * Get Event Records output payload
  * CXL rev 3.0 section 8.2.9.2.2; Table 8-50
@@ -613,6 +657,7 @@ enum cxl_event_log_type {
 	CXL_EVENT_TYPE_WARN,
 	CXL_EVENT_TYPE_FAIL,
 	CXL_EVENT_TYPE_FATAL,
+	CXL_EVENT_TYPE_DCD,
 	CXL_EVENT_TYPE_MAX
 };
 
diff --git a/include/cxl/event.h b/include/cxl/event.h
index ff97fea718d2..fa3cd895f656 100644
--- a/include/cxl/event.h
+++ b/include/cxl/event.h
@@ -6,6 +6,7 @@
 #include <linux/types.h>
 #include <linux/uuid.h>
 #include <linux/workqueue_types.h>
+#include <linux/list.h>
 
 /*
  * Common Event Record Format
@@ -141,12 +142,49 @@ struct cxl_event_mem_sparing {
 	u8 reserved2[0x25];
 } __packed;
 
+/*
+ * CXL rev 3.1 section 8.2.9.2.1.6; Table 8-51
+ */
+struct cxl_extent {
+	__le64 start_dpa;
+	__le64 length;
+	u8 uuid[UUID_SIZE];
+	__le16 shared_extn_seq;
+	u8 reserved[0x6];
+} __packed;
+
+struct cxl_extent_list_node {
+	struct cxl_extent *extent;
+	struct list_head list;
+	int rid;
+};
+
+/*
+ * Dynamic Capacity Event Record
+ * CXL rev 3.1 section 8.2.9.2.1.6; Table 8-50
+ */
+#define CXL_DCD_EVENT_MORE			BIT(0)
+struct cxl_event_dcd {
+	struct cxl_event_record_hdr hdr;
+	u8 event_type;
+	u8 validity_flags;
+	__le16 host_id;
+	u8 partition_index;
+	u8 flags;
+	u8 reserved1[0x2];
+	struct cxl_extent extent;
+	u8 reserved2[0x18];
+	__le32 num_avail_extents;
+	__le32 num_avail_tags;
+} __packed;
+
 union cxl_event {
 	struct cxl_event_generic generic;
 	struct cxl_event_gen_media gen_media;
 	struct cxl_event_dram dram;
 	struct cxl_event_mem_module mem_module;
 	struct cxl_event_mem_sparing mem_sparing;
+	struct cxl_event_dcd dcd;
 	/* dram & gen_media event header */
 	struct cxl_event_media_hdr media_hdr;
 } __packed;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v11 13/31] cxl/mem: Add 20 second timeout for stalled DC_ADD_CAPACITY chains
  2026-06-25 11:04 DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
                   ` (10 preceding siblings ...)
  2026-06-25 11:04 ` [PATCH v11 12/31] cxl/mem: Set up framework for handling DC Events Anisa Su
@ 2026-06-25 11:04 ` Anisa Su
  2026-06-25 18:15   ` sashiko-bot
  2026-06-25 11:04 ` [PATCH v11 14/31] cxl/extent: Handle DC Add Capacity events Anisa Su
                   ` (18 subsequent siblings)
  30 siblings, 1 reply; 61+ messages in thread
From: Anisa Su @ 2026-06-25 11:04 UTC (permalink / raw)
  To: linux-cxl, linux-kernel
  Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
	John Groves, Gregory Price, Anisa Su

A DC_ADD_CAPACITY event can span multiple event records grouped together
by the CXL_DCD_EVENT_MORE flag. Extents are staged in the pending list until
the last event record ('More'=0) is received, at which point the pending
list is processed. If the device opens such a chain (More=1) but never
sends the closing record, the staged list sits indefinitely.

Add a delayed-work watchdog that, on expiry, refuses the chain with an
empty ADD_DC_RESPONSE and drops the staged list.

The 20s timeout is a conservative upper bound and may be tightened
later. The timeout is purely defensive — the spec does not require it,
but prevents issues from a lost mailbox response or a crashed fabric manager.

The watchdog bounds how long a chain may stall, but a device could still
defeat it by streaming More=1 records faster than the timeout, growing the
staged list without bound. Also cap a runtime chain at
CXL_DC_MAX_PENDING_EXTENTS and refuse it once exceeded; existing-extent
recovery is bounded separately by the device's reported extent count.

Signed-off-by: Anisa Su <anisa.su@samsung.com>

---
Changes:
1. mbox.c: Fix comment in handle_add_event(), before closing the 'More'
   chain and disabling the watchdog. The comment incorrectly claimed
   handle_add_event() runs in system_wq.
2. mbox.c: Drop unnecessary initialization of add_ctx.armed=false in
   cxl_memdev_state_create(), as allocated memory is already zeroed
3. mbox.c: assert add_ctx.lock is held in add_to_pending_list(); it
   serializes access to add_ctx.pending_extents.
4. mbox.c: cap a runtime More=1 chain at CXL_DC_MAX_PENDING_EXTENTS in
   handle_add_event() so a buggy device cannot grow the staged list
   without bound (the watchdog bounds time, not memory).
---
 drivers/cxl/core/mbox.c | 98 ++++++++++++++++++++++++++++++++++++++++-
 drivers/cxl/cxlmem.h    | 24 ++++++++--
 2 files changed, 117 insertions(+), 5 deletions(-)

diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 7dd40fb8d613..4e887b5cdc3e 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -1208,15 +1208,78 @@ static void clear_pending_extents(void *_mds)
 
 	list_for_each_entry_safe(pos, tmp, &mds->add_ctx.pending_extents, list)
 		delete_extent_node(pos);
+	mds->add_ctx.nr_pending = 0;
 	mds->add_ctx.group = NULL;
 }
 
+/*
+ * Defensive cap on extents staged in one runtime More=1 chain: a buggy
+ * device could otherwise grow the list without bound.  Not spec-defined.
+ */
+#define CXL_DC_MAX_PENDING_EXTENTS	100
+
+/*
+ * Bound on how long the host will wait for a device to finish a
+ * multi-record DC_ADD_CAPACITY chain (More=1 ... More=0) before
+ * refusing the chain.
+ * The timeout is not defined in the spec, but added for defensive purposes.
+ * Since there is no spec-defined timeout, 20s is chosen as a generous
+ * upper bound and matches the GPF timeout.
+ */
+#define CXL_DC_ADD_TIMEOUT	(20 * HZ)
+
+static void cxl_dc_add_timeout(struct work_struct *work)
+{
+	struct pending_add_ctx *ctx = container_of(to_delayed_work(work),
+						   struct pending_add_ctx,
+						   timeout_work);
+	struct cxl_memdev_state *mds = container_of(ctx,
+						    struct cxl_memdev_state,
+						    add_ctx);
+	struct device *dev = mds->cxlds.dev;
+
+	guard(mutex)(&ctx->lock);
+
+	/*
+	 * handle_add_event() cancels this work non-synchronously (a sync
+	 * cancel would deadlock on @ctx->lock, which the chain-close path
+	 * holds), so a callback that already started running can reach here
+	 * after its chain has moved on.  Abort only if a chain is still armed
+	 * AND the timer has not been re-armed since this expiry fired: a fresh
+	 * mod_delayed_work() (a later extent in this chain, or a new chain)
+	 * makes delayed_work_pending() true, meaning this expiry belongs to a
+	 * superseded deadline and must not abort the current chain.
+	 */
+	if (!ctx->armed || delayed_work_pending(&ctx->timeout_work))
+		return;
+
+	dev_warn(dev, "DC add chain timed out; refusing staged extents\n");
+
+	if (cxl_send_dc_response(mds, CXL_MBOX_OP_ADD_DC_RESPONSE,
+				 &ctx->pending_extents, 0))
+		dev_dbg(dev, "Failed to send empty ADD_DC_RESPONSE on timeout\n");
+
+	clear_pending_extents(mds);
+	ctx->armed = false;
+}
+
+static void cxl_cancel_dcd_add_chain_work(void *_mds)
+{
+	struct cxl_memdev_state *mds = _mds;
+
+	cancel_delayed_work_sync(&mds->add_ctx.timeout_work);
+}
+
 static int add_to_pending_list(struct list_head *pending_list,
 			       struct cxl_extent *to_add)
 {
+	struct pending_add_ctx *ctx =
+		container_of(pending_list, struct pending_add_ctx, pending_extents);
 	struct cxl_extent_list_node *node = kzalloc(sizeof(*node), GFP_KERNEL);
 	struct cxl_extent *extent;
 
+	lockdep_assert_held(&ctx->lock);
+
 	if (!node)
 		return -ENOMEM;
 	extent = kmemdup(to_add, sizeof(*extent), GFP_KERNEL);
@@ -1227,6 +1290,7 @@ static int add_to_pending_list(struct list_head *pending_list,
 
 	node->extent = extent;
 	list_add_tail(&node->list, pending_list);
+	ctx->nr_pending++;
 	return 0;
 }
 
@@ -1239,10 +1303,20 @@ static int add_to_pending_list(struct list_head *pending_list,
 static int handle_add_event(struct cxl_memdev_state *mds,
 			    struct cxl_event_dcd *event)
 {
+	struct pending_add_ctx *ctx = &mds->add_ctx;
 	struct device *dev = mds->cxlds.dev;
 	int rc;
 
-	rc = add_to_pending_list(&mds->add_ctx.pending_extents, &event->extent);
+	guard(mutex)(&ctx->lock);
+
+	if (ctx->nr_pending >= CXL_DC_MAX_PENDING_EXTENTS) {
+		dev_warn(dev, "DC add chain exceeds %u extents; dropping (firmware bug)\n",
+			 CXL_DC_MAX_PENDING_EXTENTS);
+		clear_pending_extents(mds);
+		return -ENOSPC;
+	}
+
+	rc = add_to_pending_list(&ctx->pending_extents, &event->extent);
 	if (rc) {
 		clear_pending_extents(mds);
 		return rc;
@@ -1250,9 +1324,19 @@ static int handle_add_event(struct cxl_memdev_state *mds,
 
 	if (event->flags & CXL_DCD_EVENT_MORE) {
 		dev_dbg(dev, "more bit set; delay the surfacing of extent\n");
+		mod_delayed_work(system_wq, &ctx->timeout_work,
+						 CXL_DC_ADD_TIMEOUT);
+		ctx->armed = true;
 		return 0;
 	}
 
+	/*
+	 * Chain is closing.  Disarm before flushing so a pending watchdog
+	 * (queued but blocked on @ctx->lock) sees !armed and bails out.
+	 */
+	ctx->armed = false;
+	cancel_delayed_work(&ctx->timeout_work);
+
 	rc = cxl_send_dc_response(mds, CXL_MBOX_OP_ADD_DC_RESPONSE,
 				  &mds->add_ctx.pending_extents, 0);
 	clear_pending_extents(mds);
@@ -2036,11 +2120,23 @@ struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev, u64 serial,
 
 	mutex_init(&mds->event.log_lock);
 	INIT_LIST_HEAD(&mds->add_ctx.pending_extents);
+	mutex_init(&mds->add_ctx.lock);
+	INIT_DELAYED_WORK(&mds->add_ctx.timeout_work,
+			  cxl_dc_add_timeout);
 
 	rc = devm_add_action_or_reset(dev, clear_pending_extents, mds);
 	if (rc)
 		return ERR_PTR(rc);
 
+	/*
+	 * Registered after clear_pending_extents so devm's reverse-order
+	 * unwind cancels (and waits for) the watchdog first, then the list
+	 * cleanup runs with the watchdog guaranteed not to refire.
+	 */
+	rc = devm_add_action_or_reset(dev, cxl_cancel_dcd_add_chain_work, mds);
+	if (rc)
+		return ERR_PTR(rc);
+
 	rc = devm_cxl_register_mce_notifier(dev, &mds->mce_notifier);
 	if (rc == -EOPNOTSUPP)
 		dev_warn(dev, "CXL MCE unsupported\n");
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 4ffa7bd1e5f1..81498d47f309 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -8,6 +8,8 @@
 #include <linux/uuid.h>
 #include <linux/node.h>
 #include <linux/list.h>
+#include <linux/mutex.h>
+#include <linux/workqueue.h>
 #include <cxl/event.h>
 #include <cxl/mailbox.h>
 #include "cxl.h"
@@ -407,19 +409,33 @@ static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
 
 /**
  * struct pending_add_ctx - Staging state for an in-progress
- *			    DCD_ADD_CAPACITY event chain
+ *							DCD_ADD_CAPACITY event chain
  * @pending_extents: extents received so far in the chain; flushed when
- *		     the chain closes (More=0)
+ *					 the chain closes (More=0)
  * @group: tag group being assembled from the chain
+ * @timeout_work: watchdog that fires if a chain is opened with
+ *				  CXL_DCD_EVENT_MORE but the closing record never arrives
+ * @lock: serialises updates to the chain state against the watchdog
+ * @armed: set when a More=1 chain opens; cleared when the chain closes,
+ *		   either by a More=0 event record or by the watchdog firing.
  *
  * A DCD_ADD_CAPACITY notification can span multiple event records
  * stitched together by the CXL_DCD_EVENT_MORE flag.  Records are staged
- * here until the device clears More, at which point the staged batch is
- * processed and responded to as a single Add_DC_Response.
+ * here until an event record with 'More'=0 is received, at which point the
+ * staged batch is processed and responded to as a single Add_DC_Response.
+ *
+ * If a chain is opened (More=1) but the device never sends the closing
+ * record, the staged list would otherwise sit indefinitely.  @timeout_work
+ * is a defensive watchdog that refuses such a chain with an empty response
+ * and drops the staged list.
  */
 struct pending_add_ctx {
 	struct list_head pending_extents;
 	struct cxl_dc_tag_group *group;
+	struct delayed_work timeout_work;
+	struct mutex lock;
+	unsigned int nr_pending;
+	bool armed;
 };
 
 /**
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v11 14/31] cxl/extent: Handle DC Add Capacity events
  2026-06-25 11:04 DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
                   ` (11 preceding siblings ...)
  2026-06-25 11:04 ` [PATCH v11 13/31] cxl/mem: Add 20 second timeout for stalled DC_ADD_CAPACITY chains Anisa Su
@ 2026-06-25 11:04 ` Anisa Su
  2026-06-25 18:16   ` sashiko-bot
  2026-06-25 11:04 ` [PATCH v11 15/31] cxl/mem: Drop misaligned DCD extent groups Anisa Su
                   ` (17 subsequent siblings)
  30 siblings, 1 reply; 61+ messages in thread
From: Anisa Su @ 2026-06-25 11:04 UTC (permalink / raw)
  To: linux-cxl, linux-kernel
  Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
	John Groves, Gregory Price, Anisa Su

From: Ira Weiny <iweiny@kernel.org>

Replace the empty-response stub in handle_add_event() with the real
add pipeline.

DC Event Records can be grouped together with the 'More' flag. The
previous commit completed the set up for holding onto extents in
the pending list until receiving the last event record of the group,
marked by 'More'=0.

This commit fills in the logic for processing the pending list and
adds basic validation for extents before they are added to the
device model as a child of the cxlr_dax region. More complete
checks for tags/sequence numbers/alignment is added in subsequent commits.

For each tag that appears in the pending list:
1. Extract all extents in the pending list with that tag to a
local list.

2. The spec requires that shareable extents are ordered by
shared extent sequence number, which "instructs each host
on the relative order these extents must be placed in adjacent
virtual address space" (r4.0 Section 9.13.3 Figure 9-23
Shared Extent List Example). Otherwise, retain arrival order.

Thus the tag group is stable-sorted by shared_extn_seq; for non-sharable
extents every key is 0 and the stable sort preserves arrival
order.

Individual extents are checked for the following:
1. The extent's DPA range fully resolves to an endpoint decoder.

2. Doesn't overlap with a previously accepted extent.

3. Sequence number doesn't collide with others in the same
tag group

Upon passing these checks, extents are "onlined" together
as a tag group:
online_tag_group() registers a struct device per
dc_extent under cxlr_dax->dev so the dax layer can discover them
via device_for_each_child().

Once the pending list has been fully processed, send the
DC_ADD_RESPONSE.

Based on an original patch by Navneet Singh.

Signed-off-by: Ira Weiny <iweiny@kernel.org>
Signed-off-by: Anisa Su <anisa.su@samsung.com>

---
Changes:
1. extent.c: add NULL check in dc_extent_release
2. extent.c: add NULL check for cxlr_dax in cxl_validate_extent
3. extent.c: change cxlr_add_extent to return 1 on success so
   cxl_add_pending() can differentiate between accepting a duplicate
   extent vs. a new one. Otherwise the accounting for total_accepted is
   will be off.
4. region_dax.c: init dc_extents xarray with XA_FLAGS_ALLOC1 flag.
   Otherwise if xa_alloc fails for an extent in online_tag_group, the id
   is invalid.
   put_device() calls dc_extent_release(), which clears id 0, which is
   not necessarrily the right extent to clear
5. mbox.c: cxl_add_pending() needs to delete_extent_node() before
   continuing if mds->add_ctx has no tag group
---
 drivers/cxl/core/Makefile     |   2 +-
 drivers/cxl/core/core.h       |  14 ++
 drivers/cxl/core/extent.c     | 411 ++++++++++++++++++++++++++++++++++
 drivers/cxl/core/mbox.c       | 163 +++++++++++++-
 drivers/cxl/core/region_dax.c |   3 +
 drivers/cxl/cxl.h             |  25 +++
 tools/testing/cxl/Kbuild      |   5 +-
 7 files changed, 614 insertions(+), 9 deletions(-)
 create mode 100644 drivers/cxl/core/extent.c

diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
index ce7213818d3c..208917ad8aac 100644
--- a/drivers/cxl/core/Makefile
+++ b/drivers/cxl/core/Makefile
@@ -15,7 +15,7 @@ cxl_core-y += hdm.o
 cxl_core-y += pmu.o
 cxl_core-y += cdat.o
 cxl_core-$(CONFIG_TRACING) += trace.o
-cxl_core-$(CONFIG_CXL_REGION) += region.o region_pmem.o region_dax.o
+cxl_core-$(CONFIG_CXL_REGION) += region.o region_pmem.o region_dax.o extent.o
 cxl_core-$(CONFIG_CXL_MCE) += mce.o
 cxl_core-$(CONFIG_CXL_FEATURES) += features.o
 cxl_core-$(CONFIG_CXL_EDAC_MEM_FEATURES) += edac.o
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 1e3f19d8c9a3..2c1df75ebbc5 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -63,12 +63,25 @@ u64 cxl_dpa_to_hpa(struct cxl_region *cxlr, const struct cxl_memdev *cxlmd,
 int devm_cxl_add_dax_region(struct cxl_region *cxlr);
 int devm_cxl_add_pmem_region(struct cxl_region *cxlr);
 
+int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent,
+		   u16 seq_num);
+int online_tag_group(struct cxl_dc_tag_group *group, bool skip_release);
 #else
 static inline u64 cxl_dpa_to_hpa(struct cxl_region *cxlr,
 				 const struct cxl_memdev *cxlmd, u64 dpa)
 {
 	return ULLONG_MAX;
 }
+static inline int cxl_add_extent(struct cxl_memdev_state *mds,
+				 struct cxl_extent *extent, u16 seq_num)
+{
+	return 0;
+}
+static inline int online_tag_group(struct cxl_dc_tag_group *group,
+				   bool skip_release)
+{
+	return 0;
+}
 static inline
 struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa,
 				     struct cxl_endpoint_decoder **cxled)
@@ -164,6 +177,7 @@ long cxl_pci_get_latency(struct pci_dev *pdev);
 int cxl_pci_get_bandwidth(struct pci_dev *pdev, struct access_coordinate *c);
 int cxl_port_get_switch_dport_bandwidth(struct cxl_port *port,
 					struct access_coordinate *c);
+void memdev_release_extent(struct cxl_memdev_state *mds, struct range *range);
 
 static inline struct device *port_to_host(struct cxl_port *port)
 {
diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
new file mode 100644
index 000000000000..6e67e787d14d
--- /dev/null
+++ b/drivers/cxl/core/extent.c
@@ -0,0 +1,411 @@
+// SPDX-License-Identifier: GPL-2.0
+/*  Copyright(c) 2024 Intel Corporation. All rights reserved. */
+
+#include <linux/device.h>
+#include <cxl.h>
+
+#include "core.h"
+
+
+static void cxled_release_extent(struct cxl_endpoint_decoder *cxled,
+				 struct dc_extent *dc_extent)
+{
+	struct cxl_memdev_state *mds = cxled_to_mds(cxled);
+	struct device *dev = &cxled->cxld.dev;
+
+	dev_dbg(dev, "Remove extent %pra (%pU)\n",
+		&dc_extent->dpa_range, &dc_extent->uuid);
+	memdev_release_extent(mds, &dc_extent->dpa_range);
+}
+
+static void free_tag_group(struct cxl_dc_tag_group *group)
+{
+	xa_destroy(&group->dc_extents);
+	/* Drop the pin taken in alloc_tag_group(). */
+	put_device(&group->cxlr_dax->dev);
+	kfree(group);
+}
+
+static void dc_extent_release(struct device *dev)
+{
+	struct dc_extent *dc_extent = to_dc_extent(dev);
+	struct cxl_dc_tag_group *group;
+
+	if (!dc_extent)
+		return;
+
+	group = dc_extent->group;
+	cxled_release_extent(dc_extent->cxled, dc_extent);
+	xa_erase(&group->cxlr_dax->dc_extents, dc_extent->dev.id);
+	xa_erase(&group->dc_extents, dc_extent->seq_num);
+	group->nr_extents--;
+	if (!group->nr_extents)
+		free_tag_group(group);
+	kfree(dc_extent);
+}
+
+static const struct device_type dc_extent_type = {
+	.name = "extent",
+	.release = dc_extent_release,
+};
+
+bool is_dc_extent(struct device *dev)
+{
+	return dev->type == &dc_extent_type;
+}
+EXPORT_SYMBOL_NS_GPL(is_dc_extent, "CXL");
+
+static struct cxl_dc_tag_group *
+alloc_tag_group(struct cxl_dax_region *cxlr_dax, uuid_t *uuid)
+{
+	struct cxl_dc_tag_group *group __free(kfree) =
+				kzalloc(sizeof(*group), GFP_KERNEL);
+	if (!group)
+		return ERR_PTR(-ENOMEM);
+
+	group->cxlr_dax = cxlr_dax;
+	uuid_copy(&group->uuid, uuid);
+	xa_init(&group->dc_extents);
+
+	/*
+	 * Pin cxlr_dax: it is used after cxl_rwsem.region is dropped, so a
+	 * refcount must keep it alive.  Released in free_tag_group().
+	 */
+	get_device(&cxlr_dax->dev);
+
+	return no_free_ptr(group);
+}
+
+/*
+ * Stage 1 of the add pipeline: pure, no allocation.  Resolve the extent
+ * to its region/endpoint decoder and ext_range, and verify the range
+ * fits in the resolved endpoint decoder's DPA resource.  Further
+ * per-extent invariants layer into this function in subsequent commits.
+ *
+ * Caller must hold cxl_rwsem.region for read (cxl_dpa_to_region()).
+ * On success, @out_cxled / @out_cxlr_dax / @out_ext_range carry the
+ * resolved handles consumed by the rest of the pipeline.
+ */
+static int cxl_validate_extent(struct cxl_memdev_state *mds,
+			       struct cxl_extent *extent,
+			       struct cxl_endpoint_decoder **out_cxled,
+			       struct cxl_dax_region **out_cxlr_dax,
+			       struct range *out_ext_range)
+{
+	u64 start_dpa = le64_to_cpu(extent->start_dpa);
+	struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
+	struct cxl_endpoint_decoder *cxled;
+	struct cxl_region *cxlr;
+	struct range ext_range = (struct range) {
+		.start = start_dpa,
+		.end = start_dpa + le64_to_cpu(extent->length) - 1,
+	};
+	struct range ed_range;
+
+	cxlr = cxl_dpa_to_region(cxlmd, start_dpa, &cxled);
+	if (!cxlr || !cxlr->cxlr_dax)
+		return -ENXIO;
+
+	ed_range = (struct range) {
+		.start = cxled->dpa_res->start,
+		.end = cxled->dpa_res->end,
+	};
+	if (!range_contains(&ed_range, &ext_range)) {
+		dev_err_ratelimited(&cxled->cxld.dev,
+				    "DC extent DPA %pra (%pU) is not fully in ED %pra\n",
+				    &ext_range, extent->uuid, &ed_range);
+		return -ENXIO;
+	}
+
+	*out_cxled = cxled;
+	*out_cxlr_dax = cxlr->cxlr_dax;
+	*out_ext_range = ext_range;
+	return 0;
+}
+
+enum cxl_extent_class {
+	CXL_EXT_NEW,
+	CXL_EXT_DUPLICATE,
+	CXL_EXT_OVERLAP,
+};
+
+/*
+ * Stage 2: classify @ext_range against extents already accepted on this
+ * cxlr_dax+cxled.  Walks cxlr_dax->dc_extents once: a stored extent that
+ * fully contains @ext_range means a duplicate accept (idempotent, fine);
+ * a stored extent that only overlaps means an inconsistent offer.
+ */
+static enum cxl_extent_class
+cxlr_dax_classify_extent(struct cxl_dax_region *cxlr_dax,
+			 struct cxl_endpoint_decoder *cxled,
+			 const struct range *ext_range)
+{
+	struct dc_extent *entry;
+	unsigned long i;
+
+	xa_for_each(&cxlr_dax->dc_extents, i, entry) {
+		if (entry->cxled != cxled)
+			continue;
+		if (range_contains(&entry->dpa_range, ext_range))
+			return CXL_EXT_DUPLICATE;
+		if (range_overlaps(&entry->dpa_range, ext_range))
+			return CXL_EXT_OVERLAP;
+	}
+	return CXL_EXT_NEW;
+}
+
+/*
+ * Stage 3: allocate and populate a dc_extent for an already-validated,
+ * already-classified-as-new @ext_range.  Only -ENOMEM can fail here.
+ */
+static struct dc_extent *
+dc_extent_build(struct cxl_endpoint_decoder *cxled,
+		struct cxl_dax_region *cxlr_dax,
+		struct cxl_extent *extent,
+		const struct range *ext_range, u16 seq_num)
+{
+	resource_size_t dpa_offset = ext_range->start - cxled->dpa_res->start;
+	resource_size_t hpa = cxled->cxld.hpa_range.start + dpa_offset;
+	struct dc_extent *dc_extent;
+
+	dc_extent = kzalloc(sizeof(*dc_extent), GFP_KERNEL);
+	if (!dc_extent)
+		return ERR_PTR(-ENOMEM);
+
+	dc_extent->cxled = cxled;
+	dc_extent->dpa_range = *ext_range;
+	dc_extent->hpa_range.start = hpa - cxlr_dax->hpa_range.start;
+	dc_extent->hpa_range.end = dc_extent->hpa_range.start +
+				   range_len(ext_range) - 1;
+	dc_extent->seq_num = seq_num;
+	import_uuid(&dc_extent->uuid, extent->uuid);
+	return dc_extent;
+}
+
+/*
+ * Stage 4: insert @dc_extent into the pending tag group.  All extents in
+ * one More-chain group share a UUID — enforced here as the group is
+ * either being created (first extent) or appended to.  On any failure
+ * the dc_extent is freed.
+ *
+ * Returns 1 on success to allow caller (cxl_add_extent) to distinguish
+ * between accepting a new extent, accepting a duplicate, or error.
+ */
+static int cxlr_add_extent(struct cxl_memdev_state *mds,
+			   struct cxl_dax_region *cxlr_dax,
+			   struct dc_extent *dc_extent)
+{
+	struct cxl_dc_tag_group **group = &mds->add_ctx.group;
+	int rc;
+
+	if (*group && !uuid_equal(&(*group)->uuid, &dc_extent->uuid)) {
+		kfree(dc_extent);
+		return -EINVAL;
+	}
+
+	if (!*group) {
+		dev_dbg(&cxlr_dax->dev, "Alloc new tag group\n");
+		*group = alloc_tag_group(cxlr_dax, &dc_extent->uuid);
+		if (IS_ERR(*group)) {
+			rc = PTR_ERR(*group);
+			*group = NULL;
+			kfree(dc_extent);
+			return rc;
+		}
+	} else {
+		dev_dbg(&cxlr_dax->dev, "Append dc_extent to tag group\n");
+	}
+
+	dc_extent->group = *group;
+
+	/*
+	 * Key by @seq_num so iteration order equals assembly order.  @seq_num
+	 * is a dense 0..n-1 index (see &struct dc_extent), so a collision
+	 * here signals a cxl-side validation gap.
+	 */
+	rc = xa_insert(&(*group)->dc_extents, dc_extent->seq_num,
+		       dc_extent, GFP_KERNEL);
+	if (rc) {
+		dev_WARN_ONCE(&cxlr_dax->dev, rc == -EBUSY,
+			"duplicate seq_num %u in tag %pUb\n",
+			dc_extent->seq_num, &dc_extent->uuid);
+		kfree(dc_extent);
+		return rc;
+	}
+
+	return 1;
+}
+
+/*
+ * Returns 1 for a successfully added extent, 0 for a duplicate extent,
+ * and <0 on error
+ */
+int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent,
+		   u16 seq_num)
+{
+	struct cxl_endpoint_decoder *cxled;
+	struct cxl_dax_region *cxlr_dax;
+	struct dc_extent *dc_extent;
+	struct range ext_range;
+	int rc;
+
+	guard(rwsem_read)(&cxl_rwsem.region);
+
+	rc = cxl_validate_extent(mds, extent, &cxled, &cxlr_dax, &ext_range);
+	if (rc)
+		return rc;
+
+	switch (cxlr_dax_classify_extent(cxlr_dax, cxled, &ext_range)) {
+	case CXL_EXT_DUPLICATE:
+		/*
+		 * Idempotent accept simplifies the dax-side scan for existing
+		 * extents on region creation; reply success without duplicating.
+		 */
+		dev_warn_ratelimited(&cxled->cxld.dev,
+				     "Extent %pra exists; accept again\n",
+				     &ext_range);
+		return 0;
+	case CXL_EXT_OVERLAP:
+		return -ENXIO;
+	case CXL_EXT_NEW:
+		break;
+	}
+
+	dc_extent = dc_extent_build(cxled, cxlr_dax, extent, &ext_range,
+				    seq_num);
+	if (IS_ERR(dc_extent))
+		return PTR_ERR(dc_extent);
+
+	dev_dbg(&cxled->cxld.dev, "Add extent %pra (%pU)\n",
+		&dc_extent->dpa_range, &dc_extent->uuid);
+
+	/* returns 1 on success, <0 error*/
+	return cxlr_add_extent(mds, cxlr_dax, dc_extent);
+}
+
+static void dc_extent_unregister(void *ext)
+{
+	struct dc_extent *dc_extent = ext;
+
+	dev_dbg(&dc_extent->dev, "DAX region rm extent HPA %pra\n",
+		&dc_extent->hpa_range);
+	device_unregister(&dc_extent->dev);
+}
+
+static void cleanup_pending_dc_extent(struct dc_extent *dc_extent)
+{
+	struct cxl_dc_tag_group *group = dc_extent->group;
+
+	if (!group->skip_device_release)
+		cxled_release_extent(dc_extent->cxled, dc_extent);
+	xa_erase(&group->dc_extents, dc_extent->seq_num);
+	group->nr_extents--;
+	if (!group->nr_extents)
+		free_tag_group(group);
+	kfree(dc_extent);
+}
+
+int online_tag_group(struct cxl_dc_tag_group *group, bool skip_release)
+{
+	struct cxl_dax_region *cxlr_dax = group->cxlr_dax;
+	struct dc_extent *dc_extent;
+	unsigned long index;
+	int rc = 0;
+
+	/*
+	 * Seed nr_extents with the full group size plus a +1 pin held by
+	 * this function.  The size counts every dc_extent that might
+	 * decrement nr_extents on cleanup; the pin keeps @group alive
+	 * across the body even if every dc_extent release fires inside
+	 * the loop (e.g. devm_add_action_or_reset failure on the only
+	 * pending extent).  The pin is dropped at the end of the function.
+	 */
+	xa_for_each(&group->dc_extents, index, dc_extent)
+		group->nr_extents++;
+	group->nr_extents++;
+
+	xa_for_each(&group->dc_extents, index, dc_extent) {
+		struct device *dev = &dc_extent->dev;
+		u32 id;
+
+		device_initialize(dev);
+		device_set_pm_not_required(dev);
+		dev->parent = &cxlr_dax->dev;
+		dev->type = &dc_extent_type;
+
+		rc = xa_alloc(&cxlr_dax->dc_extents, &id, dc_extent,
+			      xa_limit_32b, GFP_KERNEL);
+		/*
+		 * put_device() fires dc_extent_release().  On xa_alloc failure
+		 * dev->id is still its invalid init value (0), but the xarray is
+		 * declared XA_FLAGS_ALLOC1 so 0 is never a valid id and erasing
+		 * it cannot remove another extent.
+		 */
+		if (rc < 0) {
+			put_device(dev);
+			break;
+		}
+		dev->id = id;
+
+		rc = dev_set_name(dev, "extent%d.%d", cxlr_dax->cxlr->id,
+				  dev->id);
+		if (rc) {
+			xa_erase(&cxlr_dax->dc_extents, dev->id);
+			put_device(dev);
+			break;
+		}
+
+		rc = device_add(dev);
+		if (rc) {
+			xa_erase(&cxlr_dax->dc_extents, dev->id);
+			put_device(dev);
+			break;
+		}
+
+		dev_dbg(dev, "dc_extent HPA %pra (%pU)\n",
+			&dc_extent->hpa_range, &group->uuid);
+
+		rc = devm_add_action_or_reset(&cxlr_dax->dev,
+					      dc_extent_unregister, dc_extent);
+		if (rc)
+			break;
+	}
+
+	if (rc) {
+		/*
+		 * The group failed to online atomically, so none of it is
+		 * reported accepted in the Add-DC-Response.  When @skip_release
+		 * is set these extents were never accepted by this host (a
+		 * fresh Add-Capacity group), so suppress the per-extent Release
+		 * DC the teardown would otherwise emit.  Recovered extents
+		 * (already accepted on the device) leave it clear so the device
+		 * reclaims capacity the host cannot use.
+		 */
+		if (skip_release)
+			group->skip_device_release = true;
+
+		/*
+		 * Unwind every remaining dc_extent in the group.  The pin
+		 * above keeps @group alive across this walk.  Distinguish
+		 * onlined dc_extents (have a devm action) from pending ones
+		 * via devm_remove_action_nowarn(): a 0 return means the
+		 * action was installed and is now consumed, so we run the
+		 * unregister ourselves; -ENOENT means pending.
+		 */
+		xa_for_each(&group->dc_extents, index, dc_extent) {
+			int r = devm_remove_action_nowarn(&cxlr_dax->dev,
+							  dc_extent_unregister,
+							  dc_extent);
+			if (r == 0)
+				dc_extent_unregister(dc_extent);
+			else
+				cleanup_pending_dc_extent(dc_extent);
+		}
+	}
+
+	/* Drop the pin; if nothing else still references @group, free it. */
+	group->nr_extents--;
+	if (!group->nr_extents)
+		free_tag_group(group);
+	return rc;
+}
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 4e887b5cdc3e..08f51b8807c0 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -6,6 +6,7 @@
 #include <linux/mutex.h>
 #include <linux/unaligned.h>
 #include <linux/list.h>
+#include <linux/list_sort.h>
 #include <cxlpci.h>
 #include <cxlmem.h>
 #include <cxl.h>
@@ -1180,7 +1181,7 @@ static void delete_extent_node(struct cxl_extent_list_node *node)
 	kfree(node);
 }
 
-static void memdev_release_extent(struct cxl_memdev_state *mds, struct range *range)
+void memdev_release_extent(struct cxl_memdev_state *mds, struct range *range)
 {
 	struct device *dev = mds->cxlds.dev;
 	struct cxl_extent extent = {
@@ -1295,11 +1296,159 @@ static int add_to_pending_list(struct list_head *pending_list,
 }
 
 /*
- * Stub: stage extents on the pending list and reply with an empty
- * ADD_DC_RESPONSE on More=0 (refuse all).  A later commit replaces
- * the no-op tail with the real Add pipeline that surfaces a dax
- * device per accepted extent.
+ * Compare two extents by shared_extn_seq (ascending).  list_sort is
+ * stable, so extents with equal keys keep their arrival order from
+ * add_to_pending_list()'s list_add_tail().
  */
+static int extent_seq_compare(void *priv,
+			      const struct list_head *a,
+			      const struct list_head *b)
+{
+	const struct cxl_extent_list_node *ea =
+		list_entry(a, struct cxl_extent_list_node, list);
+	const struct cxl_extent_list_node *eb =
+		list_entry(b, struct cxl_extent_list_node, list);
+	u16 sa = le16_to_cpu(ea->extent->shared_extn_seq);
+	u16 sb = le16_to_cpu(eb->extent->shared_extn_seq);
+
+	if (sa < sb)
+		return -1;
+	if (sa > sb)
+		return 1;
+	return 0;
+}
+
+/*
+ * Move every pending extent whose tag matches @tag onto @group, preserving
+ * the order they appear in @pending.
+ */
+static void extract_tag_group(struct list_head *pending,
+			      const uuid_t *tag,
+			      struct list_head *group)
+{
+	struct cxl_extent_list_node *pos, *tmp;
+
+	list_for_each_entry_safe(pos, tmp, pending, list) {
+		uuid_t t;
+
+		import_uuid(&t, pos->extent->uuid);
+		if (uuid_equal(&t, tag))
+			list_move_tail(&pos->list, group);
+	}
+}
+
+/* Drop every node in @group, releasing its extent. */
+static void drop_extent_group(struct list_head *group)
+{
+	struct cxl_extent_list_node *pos, *tmp;
+
+	list_for_each_entry_safe(pos, tmp, group, list)
+		delete_extent_node(pos);
+}
+
+/*
+ * Realize a tag @group: add each extent via cxl_add_extent(), then online
+ * the resulting tag group.  Returns the number of accepted extents (>= 0)
+ * with @group left holding them for the caller to splice, or a negative
+ * errno on failure with @group untouched for the caller to drop.
+ */
+static int cxl_realize_group(struct cxl_memdev_state *mds, const uuid_t *tag,
+			     struct list_head *group, bool existing)
+{
+	struct device *dev = mds->cxlds.dev;
+	struct cxl_extent_list_node *pos, *tmp;
+	struct cxl_dc_tag_group *tag_group;
+	int group_cnt = 0;
+	int rc;
+
+	list_for_each_entry_safe(pos, tmp, group, list) {
+		/*
+		 * Pass the device-stamped 0-based shared_extn_seq through
+		 * unchanged as the dax-side @seq_num (0..n-1).
+		 */
+		u16 seq = le16_to_cpu(pos->extent->shared_extn_seq);
+
+		if (cxl_add_extent(mds, pos->extent, seq) < 0) {
+			dev_dbg(dev,
+				"Tag %pUb: failed to add extent DPA:%#llx LEN:%#llx\n",
+				tag,
+				le64_to_cpu(pos->extent->start_dpa),
+				le64_to_cpu(pos->extent->length));
+			delete_extent_node(pos);
+			continue;
+		}
+		group_cnt++;
+	}
+
+	tag_group = mds->add_ctx.group;
+	mds->add_ctx.group = NULL;
+	if (!tag_group)
+		return -ENXIO;
+
+	rc = online_tag_group(tag_group, !existing);
+	if (rc) {
+		dev_warn(dev, "Tag %pUb: failed to online tag group (%d)\n",
+			 tag, rc);
+		return rc;
+	}
+
+	return group_cnt;
+}
+
+/*
+ * Drive the pending Add-Capacity records through cxl_realize_group(),
+ * grouped by tag.  Per group: extract from pending, stable-sort by
+ * shared_extn_seq, realize the group, and on success move it onto the
+ * accepted list.  Validation gates layer onto this loop in later commits.
+ */
+static int cxl_add_pending(struct cxl_memdev_state *mds, bool existing)
+{
+	struct list_head *pending = &mds->add_ctx.pending_extents;
+	LIST_HEAD(accepted);
+	int total_accepted = 0;
+
+	while (!list_empty(pending)) {
+		LIST_HEAD(group);
+		uuid_t tag;
+		int cnt;
+
+		import_uuid(&tag,
+			list_first_entry(pending,
+					 struct cxl_extent_list_node,
+					 list)->extent->uuid);
+		extract_tag_group(pending, &tag, &group);
+
+		/*
+		 * Only a sharable allocation carries a meaningful per-extent
+		 * shared_extn_seq; order those by it.  For non-sharable groups,
+		 * the stable sort maintains arrival order.
+		 */
+		list_sort(NULL, &group, extent_seq_compare);
+
+		cnt = cxl_realize_group(mds, &tag, &group, existing);
+		if (cnt < 0) {
+			drop_extent_group(&group);
+			continue;
+		}
+
+		list_splice_tail_init(&group, &accepted);
+		total_accepted += cnt;
+	}
+
+	list_splice(&accepted, pending);
+
+	/*
+	 * Recovered (already-accepted) extents must not be re-reported in an
+	 * Add-DC-Response: the device rejects a DPA range already added by a
+	 * previous response (CXL r4.0 8.2.10.9.9.3, Invalid Physical Address).
+	 */
+	if (existing)
+		return 0;
+
+	return cxl_send_dc_response(mds, CXL_MBOX_OP_ADD_DC_RESPONSE,
+				    pending, total_accepted);
+}
+
 static int handle_add_event(struct cxl_memdev_state *mds,
 			    struct cxl_event_dcd *event)
 {
@@ -1337,8 +1486,8 @@ static int handle_add_event(struct cxl_memdev_state *mds,
 	ctx->armed = false;
 	cancel_delayed_work(&ctx->timeout_work);
 
-	rc = cxl_send_dc_response(mds, CXL_MBOX_OP_ADD_DC_RESPONSE,
-				  &mds->add_ctx.pending_extents, 0);
+	/* Fresh add events: extents are not yet accepted (not existing). */
+	rc = cxl_add_pending(mds, false);
 	clear_pending_extents(mds);
 	return rc;
 }
diff --git a/drivers/cxl/core/region_dax.c b/drivers/cxl/core/region_dax.c
index 3865961c4301..70b086d50451 100644
--- a/drivers/cxl/core/region_dax.c
+++ b/drivers/cxl/core/region_dax.c
@@ -13,6 +13,7 @@ static void cxl_dax_region_release(struct device *dev)
 {
 	struct cxl_dax_region *cxlr_dax = to_cxl_dax_region(dev);
 
+	xa_destroy(&cxlr_dax->dc_extents);
 	kfree(cxlr_dax);
 }
 
@@ -57,11 +58,13 @@ static struct cxl_dax_region *cxl_dax_region_alloc(struct cxl_region *cxlr)
 	if (!cxlr_dax)
 		return ERR_PTR(-ENOMEM);
 
+	xa_init_flags(&cxlr_dax->dc_extents, XA_FLAGS_ALLOC1);
 	cxlr_dax->hpa_range.start = p->res->start;
 	cxlr_dax->hpa_range.end = p->res->end;
 
 	dev = &cxlr_dax->dev;
 	cxlr_dax->cxlr = cxlr;
+	cxlr->cxlr_dax = cxlr_dax;
 	device_initialize(dev);
 	lockdep_set_class(&dev->mutex, &cxl_dax_region_key);
 	device_set_pm_not_required(dev);
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 367c3d9e2147..aae7eecd191a 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -495,6 +495,7 @@ struct cxl_region_params {
  * @type: Endpoint decoder target type
  * @cxl_nvb: nvdimm bridge for coordinating @cxlr_pmem setup / shutdown
  * @cxlr_pmem: (for pmem regions) cached copy of the nvdimm bridge
+ * @cxlr_dax: (for DC regions) cached copy of CXL DAX bridge
  * @flags: Region state flags
  * @params: active + config params for the region
  * @coord: QoS access coordinates for the region
@@ -510,6 +511,7 @@ struct cxl_region {
 	enum cxl_decoder_type type;
 	struct cxl_nvdimm_bridge *cxl_nvb;
 	struct cxl_pmem_region *cxlr_pmem;
+	struct cxl_dax_region *cxlr_dax;
 	unsigned long flags;
 	struct cxl_region_params params;
 	struct access_coordinate coord[ACCESS_COORDINATE_MAX];
@@ -568,6 +570,15 @@ struct cxl_dax_region {
 	struct device dev;
 	struct cxl_region *cxlr;
 	struct range hpa_range;
+	/*
+	 * dc_extents is keyed by an allocator-assigned u32 (see
+	 * online_tag_group()).  Tag groups have no first-class identity in
+	 * this xarray; siblings within a tag find each other via
+	 * dc_extent->group.  Tag-uniqueness lookup is a linear xa_for_each
+	 * walk, adequate at the bounded per-region extent counts the
+	 * driver handles.
+	 */
+	struct xarray dc_extents;
 };
 
 /**
@@ -587,14 +598,28 @@ struct cxl_dax_region {
  *		allocations.
  * @nr_extents: live count of dc_extents in the group; the group is freed
  *		when the last dc_extent device is released.
+ * @skip_device_release: tear the group down without sending a Release DC
+ *		command to the device.  Set when rejecting a group whose
+ *		extents this host never accepted, so they are omitted from the
+ *		Add-DC-Response rather than released — a Release DC would tell
+ *		the device to free capacity it never handed us.
  */
 struct cxl_dc_tag_group {
 	struct cxl_dax_region *cxlr_dax;
 	uuid_t uuid;
 	struct xarray dc_extents;
 	unsigned int nr_extents;
+	bool skip_device_release;
 };
 
+bool is_dc_extent(struct device *dev);
+static inline struct dc_extent *to_dc_extent(struct device *dev)
+{
+	if (!is_dc_extent(dev))
+		return NULL;
+	return container_of(dev, struct dc_extent, dev);
+}
+
 /**
  * struct cxl_port - logical collection of upstream port devices and
  *		     downstream port devices to construct a CXL memory
diff --git a/tools/testing/cxl/Kbuild b/tools/testing/cxl/Kbuild
index 2be1df80fcc9..8941cf187462 100644
--- a/tools/testing/cxl/Kbuild
+++ b/tools/testing/cxl/Kbuild
@@ -63,7 +63,10 @@ cxl_core-y += $(CXL_CORE_SRC)/hdm.o
 cxl_core-y += $(CXL_CORE_SRC)/pmu.o
 cxl_core-y += $(CXL_CORE_SRC)/cdat.o
 cxl_core-$(CONFIG_TRACING) += $(CXL_CORE_SRC)/trace.o
-cxl_core-$(CONFIG_CXL_REGION) += $(CXL_CORE_SRC)/region.o $(CXL_CORE_SRC)/region_pmem.o $(CXL_CORE_SRC)/region_dax.o
+cxl_core-$(CONFIG_CXL_REGION) += $(CXL_CORE_SRC)/region.o \
+				 $(CXL_CORE_SRC)/region_pmem.o \
+				 $(CXL_CORE_SRC)/region_dax.o \
+				 $(CXL_CORE_SRC)/extent.o
 cxl_core-$(CONFIG_CXL_MCE) += $(CXL_CORE_SRC)/mce.o
 cxl_core-$(CONFIG_CXL_FEATURES) += $(CXL_CORE_SRC)/features.o
 cxl_core-$(CONFIG_CXL_EDAC_MEM_FEATURES) += $(CXL_CORE_SRC)/edac.o
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v11 15/31] cxl/mem: Drop misaligned DCD extent groups
  2026-06-25 11:04 DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
                   ` (12 preceding siblings ...)
  2026-06-25 11:04 ` [PATCH v11 14/31] cxl/extent: Handle DC Add Capacity events Anisa Su
@ 2026-06-25 11:04 ` Anisa Su
  2026-06-25 18:19   ` sashiko-bot
  2026-06-25 11:04 ` [PATCH v11 16/31] cxl/extent: Validate DC extent partition Anisa Su
                   ` (16 subsequent siblings)
  30 siblings, 1 reply; 61+ messages in thread
From: Anisa Su @ 2026-06-25 11:04 UTC (permalink / raw)
  To: linux-cxl, linux-kernel
  Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
	John Groves, Gregory Price, Anisa Su

From: Ira Weiny <iweiny@kernel.org>

Add an alignment gate to cxl_add_pending(): every extent in a tag group
must have its start_dpa and length aligned to the dax region's mapping
granularity.  A misaligned extent makes the resulting dax device unusable,
so drop the whole group rather than accept a partial allocation that would
surface a broken dax_resource.

Based on patches by John Groves.

Signed-off-by: Ira Weiny <iweiny@kernel.org>
Signed-off-by: John Groves <John@Groves.net>
Signed-off-by: Anisa Su <anisa.su@samsung.com>

---
Changes:
[anisa: gate on the dax region's actual mapping alignment (PMD_SIZE)
	instead of a hardcoded SZ_2M]
---
 drivers/cxl/core/mbox.c | 51 +++++++++++++++++++++++++++++++++++++++--
 1 file changed, 49 insertions(+), 2 deletions(-)

diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 08f51b8807c0..14ba263044f0 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -7,6 +7,8 @@
 #include <linux/unaligned.h>
 #include <linux/list.h>
 #include <linux/list_sort.h>
+#include <linux/pgtable.h>
+#include <linux/sizes.h>
 #include <cxlpci.h>
 #include <cxlmem.h>
 #include <cxl.h>
@@ -1295,6 +1297,19 @@ static int add_to_pending_list(struct list_head *pending_list,
 	return 0;
 }
 
+/*
+ * Extents need to be aligned to dax region's mapping granularity.
+ * Use PMD_SIZE, since cxl_dax_region_probe() calls alloc_dax_region with
+ * PMD_SIZE for the 'align' parameter.
+ */
+static bool cxl_extent_dcd_aligned(const struct cxl_extent *extent)
+{
+	u64 start = le64_to_cpu(extent->start_dpa);
+	u64 len = le64_to_cpu(extent->length);
+
+	return IS_ALIGNED(start, PMD_SIZE) && IS_ALIGNED(len, PMD_SIZE);
+}
+
 /*
  * Compare two extents by shared_extn_seq (ascending).  list_sort is
  * stable, so extents with equal keys keep their arrival order from
@@ -1395,11 +1410,38 @@ static int cxl_realize_group(struct cxl_memdev_state *mds, const uuid_t *tag,
 	return group_cnt;
 }
 
+/*
+ * Validate a tag @group before realizing it.  Returns 0 if the group may be
+ * added, or a negative errno if it must be dropped.  Further gates layer in
+ * here in later commits.
+ */
+static int cxl_validate_group(struct cxl_memdev_state *mds, const uuid_t *tag,
+			      struct list_head *group)
+{
+	struct device *dev = mds->cxlds.dev;
+	struct cxl_extent_list_node *pos;
+
+	/* Alignment gate — drop the group if any member fails */
+	list_for_each_entry(pos, group, list) {
+		if (!cxl_extent_dcd_aligned(pos->extent)) {
+			dev_warn(dev,
+				 "Tag %pUb: dropping group, extent DPA:%#llx LEN:%#llx not %#llx-aligned\n",
+				 tag,
+				 le64_to_cpu(pos->extent->start_dpa),
+				 le64_to_cpu(pos->extent->length),
+				 (u64)PMD_SIZE);
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
 /*
  * Drive the pending Add-Capacity records through cxl_realize_group(),
  * grouped by tag.  Per group: extract from pending, stable-sort by
- * shared_extn_seq, realize the group, and on success move it onto the
- * accepted list.  Validation gates layer onto this loop in later commits.
+ * shared_extn_seq, validate, realize the group, and on success move it onto
+ * the accepted list.
  */
 static int cxl_add_pending(struct cxl_memdev_state *mds, bool existing)
 {
@@ -1425,6 +1467,11 @@ static int cxl_add_pending(struct cxl_memdev_state *mds, bool existing)
 		 */
 		list_sort(NULL, &group, extent_seq_compare);
 
+		if (cxl_validate_group(mds, &tag, &group)) {
+			drop_extent_group(&group);
+			continue;
+		}
+
 		cnt = cxl_realize_group(mds, &tag, &group, existing);
 		if (cnt < 0) {
 			drop_extent_group(&group);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v11 16/31] cxl/extent: Validate DC extent partition
  2026-06-25 11:04 DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
                   ` (13 preceding siblings ...)
  2026-06-25 11:04 ` [PATCH v11 15/31] cxl/mem: Drop misaligned DCD extent groups Anisa Su
@ 2026-06-25 11:04 ` Anisa Su
  2026-06-25 18:20   ` sashiko-bot
  2026-06-25 11:04 ` [PATCH v11 17/31] cxl/mem: Enforce tag-group semantics Anisa Su
                   ` (15 subsequent siblings)
  30 siblings, 1 reply; 61+ messages in thread
From: Anisa Su @ 2026-06-25 11:04 UTC (permalink / raw)
  To: linux-cxl, linux-kernel
  Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
	John Groves, Gregory Price, Anisa Su

From: Ira Weiny <iweiny@kernel.org>

Extend cxl_validate_extent() — the per-extent check of the add pipeline
to check partition membership.

Resolves an extent's DPA to its containing DC partition.  Sharability is
a property of the partition (part->shareable), taken from its CDAT DSMAS
entry.

An extent from a sharable partition must carry a non-null tag, since hosts
sharing the allocation key on that tag.  A null tag there is a device
firmware bug; reject the extent.

shared_extn_seq validation is checked in cxl_check_group_seq() once the
whole tag group is collected.

Based on patches by John Groves.

Signed-off-by: Ira Weiny <iweiny@kernel.org>
Signed-off-by: John Groves <John@Groves.net>
Signed-off-by: Anisa Su <anisa.su@samsung.com>

---
Changes:
1. cxl_extent_dc_partition() declared static — it is only called
 from extent.c at this point.  A subsequent commit ("cxl/mem: Enforce
 tag-group semantics") drops static and adds the declaration to core.h
 when mbox.c starts calling it.
2. In cxl_validate_extent(), declare the local uuid as a struct
 (uuid_t uuid) and fill it via import_uuid(&uuid, extent->uuid) instead
 of casting (uuid_t *)extent->uuid.
---
 drivers/cxl/core/extent.c | 85 +++++++++++++++++++++++++++++++++++++--
 1 file changed, 82 insertions(+), 3 deletions(-)

diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
index 6e67e787d14d..2e770c5279c2 100644
--- a/drivers/cxl/core/extent.c
+++ b/drivers/cxl/core/extent.c
@@ -76,11 +76,67 @@ alloc_tag_group(struct cxl_dax_region *cxlr_dax, uuid_t *uuid)
 	return no_free_ptr(group);
 }
 
+/*
+ * Find the DC (Dynamic Capacity) partition that fully contains @ext_range,
+ * or NULL if the extent falls outside every DC partition on this memdev.
+ * The returned pointer is owned by mds->cxlds.part[] and lives for the
+ * lifetime of the memdev.
+ */
+static const struct cxl_dpa_partition *
+cxl_extent_dc_partition(struct cxl_memdev_state *mds,
+			struct cxl_extent *extent,
+			struct range *ext_range)
+{
+	struct cxl_dev_state *cxlds = &mds->cxlds;
+	struct device *dev = mds->cxlds.dev;
+
+	/*
+	 * A device-side error could cause end < start, which range_contains()
+	 * would treat as contained in any partition.
+	 */
+	if (ext_range->end < ext_range->start) {
+		dev_err_ratelimited(dev,
+				    "DC extent DPA %pra (%pU) has invalid length (firmware bug)\n",
+				    ext_range, extent->uuid);
+		return NULL;
+	}
+
+	for (int i = 0; i < cxlds->nr_partitions; i++) {
+		struct cxl_dpa_partition *part = &cxlds->part[i];
+		struct range partition_range = {
+			.start = part->res.start,
+			.end = part->res.end,
+		};
+
+		if (part->mode != CXL_PARTMODE_DYNAMIC_RAM_1)
+			continue;
+
+		if (range_contains(&partition_range, ext_range)) {
+			dev_dbg(dev, "DC extent DPA %pra (DCR:%pra)(%pU)\n",
+				ext_range, &partition_range, extent->uuid);
+			return part;
+		}
+	}
+
+	dev_err_ratelimited(dev,
+			    "DC extent DPA %pra (%pU) is not in a valid DC partition\n",
+			    ext_range, extent->uuid);
+	return NULL;
+}
+
 /*
  * Stage 1 of the add pipeline: pure, no allocation.  Resolve the extent
- * to its region/endpoint decoder and ext_range, and verify the range
- * fits in the resolved endpoint decoder's DPA resource.  Further
- * per-extent invariants layer into this function in subsequent commits.
+ * to its region/endpoint decoder and ext_range, and enforce every
+ * per-extent invariant the device must satisfy:
+ *
+ *   - DPA falls inside a Dynamic Capacity partition (cxl_extent_dc_partition).
+ *   - Sharability is a property of the partition (part->shareable), not of
+ *     the shared_extn_seq value: a sharable-partition extent must carry a
+ *     non-null tag, and a non-sharable-partition extent must leave
+ *     shared_extn_seq reserved (zero).  The dense 0..n-1 numbering within a
+ *     sharable tag group is validated separately (cxl_check_group_seq()).
+ *   - DPA resolves to an endpoint decoder attached to a region.
+ *   - The extent's range is fully contained in that ED's DPA resource.
  *
  * Caller must hold cxl_rwsem.region for read (cxl_dpa_to_region()).
  * On success, @out_cxled / @out_cxlr_dax / @out_ext_range carry the
@@ -94,6 +150,8 @@ static int cxl_validate_extent(struct cxl_memdev_state *mds,
 {
 	u64 start_dpa = le64_to_cpu(extent->start_dpa);
 	struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
+	struct device *dev = mds->cxlds.dev;
+	const struct cxl_dpa_partition *part;
 	struct cxl_endpoint_decoder *cxled;
 	struct cxl_region *cxlr;
 	struct range ext_range = (struct range) {
@@ -101,6 +159,27 @@ static int cxl_validate_extent(struct cxl_memdev_state *mds,
 		.end = start_dpa + le64_to_cpu(extent->length) - 1,
 	};
 	struct range ed_range;
+	uuid_t uuid;
+
+	import_uuid(&uuid, extent->uuid);
+
+	part = cxl_extent_dc_partition(mds, extent, &ext_range);
+	if (!part)
+		return -ENXIO;
+
+	if (part->shareable) {
+		if (uuid_is_null(&uuid)) {
+			dev_err_ratelimited(dev,
+				"DC extent DPA %pra: sharable-partition extent has null tag (firmware bug)\n",
+				&ext_range);
+			return -ENXIO;
+		}
+	} else if (le16_to_cpu(extent->shared_extn_seq)) {
+		dev_err_ratelimited(dev,
+			"DC extent DPA %pra (%pU): non-sharable partition but shared_extn_seq=%u (firmware bug)\n",
+			&ext_range, &uuid, le16_to_cpu(extent->shared_extn_seq));
+		return -ENXIO;
+	}
 
 	cxlr = cxl_dpa_to_region(cxlmd, start_dpa, &cxled);
 	if (!cxlr || !cxlr->cxlr_dax)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v11 17/31] cxl/mem: Enforce tag-group semantics
  2026-06-25 11:04 DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
                   ` (14 preceding siblings ...)
  2026-06-25 11:04 ` [PATCH v11 16/31] cxl/extent: Validate DC extent partition Anisa Su
@ 2026-06-25 11:04 ` Anisa Su
  2026-06-25 18:24   ` sashiko-bot
  2026-06-25 11:04 ` [PATCH v11 18/31] cxl/extent: Handle DC Release Capacity events Anisa Su
                   ` (14 subsequent siblings)
  30 siblings, 1 reply; 61+ messages in thread
From: Anisa Su @ 2026-06-25 11:04 UTC (permalink / raw)
  To: linux-cxl, linux-kernel
  Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
	John Groves, Gregory Price, Anisa Su

The previous commit fully-fleshed out validation for individual extents.
This commit completes tag-group validation.

Add two group-level gates to cxl_add_pending() that
cxl_validate_extent()'s per-extent view can't see:

  - Sequence integrity (cxl_check_group_seq): sharability is taken from
    the group's DC partition attribute (cxl_group_is_shareable).

    For a sharable partition, extents have a shared_extn_seq number,
    which must be 0..n-1 for a tag. Any gaps or duplicate values is a
    firmware bug.

    Non-sharable partitions leave the field reserved and are not checked.

  - cxl_check_group_partition: verify tagged allocations
    don't span DC partitions.

Each check drops the whole group on violation.

Tag uniqueness checks land in a subsequent commit.

Based on patches by John Groves.

Signed-off-by: Ira Weiny <iweiny@kernel.org>
Signed-off-by: John Groves <John@Groves.net>
Signed-off-by: Anisa Su <anisa.su@samsung.com>

---
Changes:
1. Fix sequence number check

Previously, it was assumed that shared extent sequence nums
start at 1, which is wrong. The spec states "For extents
describing shareable regions this field shall be within the range
of 0 to n-1 where n is the number of extents, with each value
appearing only once." Fix to start at 0.

2. Use partition->shareable attribute (from DSMAS flags) as the
source of truth for extent shareability, not sequence num.

Not only was it assumed sequence nums start at 1 for shared extents,
it was also assumed that unshared extents have a sequence num of 0.
So seq num was used to determine if an extent was shareable, which is
wrong.
---
 drivers/cxl/core/core.h   |   4 +
 drivers/cxl/core/extent.c |   2 +-
 drivers/cxl/core/mbox.c   | 168 +++++++++++++++++++++++++++++++++-----
 3 files changed, 154 insertions(+), 20 deletions(-)

diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 2c1df75ebbc5..6ac68f46a18e 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -178,6 +178,10 @@ int cxl_pci_get_bandwidth(struct pci_dev *pdev, struct access_coordinate *c);
 int cxl_port_get_switch_dport_bandwidth(struct cxl_port *port,
 					struct access_coordinate *c);
 void memdev_release_extent(struct cxl_memdev_state *mds, struct range *range);
+const struct cxl_dpa_partition *
+cxl_extent_dc_partition(struct cxl_memdev_state *mds,
+			struct cxl_extent *extent,
+			struct range *ext_range);
 
 static inline struct device *port_to_host(struct cxl_port *port)
 {
diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
index 2e770c5279c2..0ebb581ca833 100644
--- a/drivers/cxl/core/extent.c
+++ b/drivers/cxl/core/extent.c
@@ -82,7 +82,7 @@ alloc_tag_group(struct cxl_dax_region *cxlr_dax, uuid_t *uuid)
  * The returned pointer is owned by mds->cxlds.part[] and lives for the
  * lifetime of the memdev.
  */
-static const struct cxl_dpa_partition *
+const struct cxl_dpa_partition *
 cxl_extent_dc_partition(struct cxl_memdev_state *mds,
 			struct cxl_extent *extent,
 			struct range *ext_range)
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 14ba263044f0..7967b0db2c51 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -1362,28 +1362,148 @@ static void drop_extent_group(struct list_head *group)
 }
 
 /*
- * Realize a tag @group: add each extent via cxl_add_extent(), then online
- * the resulting tag group.  Returns the number of accepted extents (>= 0)
- * with @group left holding them for the caller to splice, or a negative
- * errno on failure with @group untouched for the caller to drop.
+ * Validate shared_extn_seq across a tag group from a sharable partition,
+ * already sorted ascending.  Per CXL r4.0 Table 8-230 the device stamps
+ * each extent of an n-extent sharable allocation with a unique value in
+ * 0..n-1, so the sorted group must be exactly 0, 1, ..., n-1.  A gap,
+ * duplicate, or out-of-range value is a device firmware bug.
+ *
+ * Non-sharable partitions leave shared_extn_seq reserved; sharability is
+ * determined by the partition, not the seq value, so there is nothing to
+ * validate here — the caller assigns assembly order by arrival.
+ */
+static int cxl_check_group_seq(struct device *dev,
+			       const uuid_t *tag,
+			       const struct list_head *group,
+			       bool shareable)
+{
+	struct cxl_extent_list_node *pos;
+	u16 expected = 0;
+
+	if (!shareable)
+		return 0;
+
+	list_for_each_entry(pos, group, list) {
+		u16 s = le16_to_cpu(pos->extent->shared_extn_seq);
+
+		if (s != expected) {
+			dev_warn(dev,
+				 "Tag %pUb: sharable shared_extn_seq must be dense 0..n-1: expected %u got %u (firmware bug)\n",
+				 tag, expected, s);
+			return -EINVAL;
+		}
+		expected++;
+	}
+	return 0;
+}
+
+/*
+ * A tag group's sharability is a property of the DC partition holding its
+ * extents (cxl_check_group_partition() separately enforces that the group
+ * does not span partitions).  Resolve it from the first extent; an empty
+ * group or an extent outside any DC partition is treated as non-sharable.
+ */
+static bool cxl_group_is_shareable(struct cxl_memdev_state *mds,
+				   const struct list_head *group)
+{
+	const struct cxl_dpa_partition *part;
+	struct cxl_extent_list_node *first;
+	struct cxl_extent *extent;
+	struct range ext_range;
+
+	if (list_empty(group))
+		return false;
+
+	first = list_first_entry(group, struct cxl_extent_list_node, list);
+	extent = first->extent;
+	ext_range = (struct range) {
+		.start = le64_to_cpu(extent->start_dpa),
+		.end = le64_to_cpu(extent->start_dpa) +
+			le64_to_cpu(extent->length) - 1,
+	};
+	part = cxl_extent_dc_partition(mds, extent, &ext_range);
+	return part && part->shareable;
+}
+
+/*
+ * For tagged groups, reject allocations that span DC partitions.  A tag
+ * is an allocation identity; the partition's CDAT DSMAS entry is what
+ * tells the host which attributes (sharable, writable, coherency)
+ * apply.  Untagged groups are skipped — the spec does not define a
+ * cross-chain identity for them.
+ */
+static int cxl_check_group_partition(struct cxl_memdev_state *mds,
+				     const uuid_t *tag,
+				     const struct list_head *group)
+{
+	struct device *dev = mds->cxlds.dev;
+	const struct cxl_dpa_partition *first_part = NULL;
+	u64 first_dpa = 0;
+	struct cxl_extent_list_node *pos;
+
+	if (uuid_is_null(tag) || list_empty(group))
+		return 0;
+
+	list_for_each_entry(pos, group, list) {
+		struct cxl_extent *extent = pos->extent;
+		struct range ext_range = (struct range) {
+			.start = le64_to_cpu(extent->start_dpa),
+			.end = le64_to_cpu(extent->start_dpa) +
+				le64_to_cpu(extent->length) - 1,
+		};
+		const struct cxl_dpa_partition *part;
+
+		part = cxl_extent_dc_partition(mds, extent, &ext_range);
+		if (!part)
+			return -ENXIO;
+
+		if (!first_part) {
+			first_part = part;
+			first_dpa = ext_range.start;
+			continue;
+		}
+
+		if (part != first_part) {
+			dev_warn(dev,
+				 "Tag %pUb: extents span DC partitions (DPA:%#llx and DPA:%#llx), firmware bug\n",
+				 tag, first_dpa, ext_range.start);
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
+
+/*
+ * Realize a tag @group: assign each extent its dax-side @seq_num and add it
+ * via cxl_add_extent(), then online the resulting tag group.  Returns the
+ * number of accepted extents (>= 0) with @group left holding them for the
+ * caller to splice, or a negative errno on failure with @group untouched for
+ * the caller to drop.
+ *
+ * A shared extent carries the device-assigned shared_extn_seq (dense 0..n-1).
+ * Non-sharable groups have no meaningful per-extent sequence, so number them
+ * by arrival order.  The counter advances for every member so a failed add
+ * leaves a gap and the partial group is later refused rather than carved.
  */
 static int cxl_realize_group(struct cxl_memdev_state *mds, const uuid_t *tag,
-			     struct list_head *group, bool existing)
+			     struct list_head *group, bool shareable,
+			     bool existing)
 {
 	struct device *dev = mds->cxlds.dev;
 	struct cxl_extent_list_node *pos, *tmp;
 	struct cxl_dc_tag_group *tag_group;
 	int group_cnt = 0;
+	u16 seq_num;
 	int rc;
 
+	seq_num = -1;
 	list_for_each_entry_safe(pos, tmp, group, list) {
-		/*
-		 * Pass the device-stamped 0-based shared_extn_seq through
-		 * unchanged as the dax-side @seq_num (0..n-1).
-		 */
-		u16 seq = le16_to_cpu(pos->extent->shared_extn_seq);
+		if (shareable)
+			seq_num = le16_to_cpu(pos->extent->shared_extn_seq);
+		else
+			seq_num++;
 
-		if (cxl_add_extent(mds, pos->extent, seq) < 0) {
+		if (cxl_add_extent(mds, pos->extent, seq_num) < 0) {
 			dev_dbg(dev,
 				"Tag %pUb: failed to add extent DPA:%#llx LEN:%#llx\n",
 				tag,
@@ -1412,15 +1532,22 @@ static int cxl_realize_group(struct cxl_memdev_state *mds, const uuid_t *tag,
 
 /*
  * Validate a tag @group before realizing it.  Returns 0 if the group may be
- * added, or a negative errno if it must be dropped.  Further gates layer in
- * here in later commits.
+ * added, or a negative errno if it must be dropped.
  */
 static int cxl_validate_group(struct cxl_memdev_state *mds, const uuid_t *tag,
-			      struct list_head *group)
+			      struct list_head *group, bool shareable)
 {
 	struct device *dev = mds->cxlds.dev;
 	struct cxl_extent_list_node *pos;
 
+	/* Sequence-number integrity */
+	if (cxl_check_group_seq(dev, tag, group, shareable))
+		return -EINVAL;
+
+	/* Partition equality (skipped for null UUID) */
+	if (cxl_check_group_partition(mds, tag, group))
+		return -EINVAL;
+
 	/* Alignment gate — drop the group if any member fails */
 	list_for_each_entry(pos, group, list) {
 		if (!cxl_extent_dcd_aligned(pos->extent)) {
@@ -1439,9 +1566,10 @@ static int cxl_validate_group(struct cxl_memdev_state *mds, const uuid_t *tag,
 
 /*
  * Drive the pending Add-Capacity records through cxl_realize_group(),
- * grouped by tag.  Per group: extract from pending, stable-sort by
- * shared_extn_seq, validate, realize the group, and on success move it onto
- * the accepted list.
+ * grouped by tag.  Per group: extract from pending; for a sharable partition
+ * stable-sort by the device's shared_extn_seq (non-sharable groups keep
+ * arrival order), validate, then realize the group, moving it onto the
+ * accepted list on success.
  */
 static int cxl_add_pending(struct cxl_memdev_state *mds, bool existing)
 {
@@ -1451,6 +1579,7 @@ static int cxl_add_pending(struct cxl_memdev_state *mds, bool existing)
 
 	while (!list_empty(pending)) {
 		LIST_HEAD(group);
+		bool shareable;
 		uuid_t tag;
 		int cnt;
 
@@ -1466,13 +1595,14 @@ static int cxl_add_pending(struct cxl_memdev_state *mds, bool existing)
 		 * the stable sort maintains arrival order.
 		 */
 		list_sort(NULL, &group, extent_seq_compare);
+		shareable = cxl_group_is_shareable(mds, &group);
 
-		if (cxl_validate_group(mds, &tag, &group)) {
+		if (cxl_validate_group(mds, &tag, &group, shareable)) {
 			drop_extent_group(&group);
 			continue;
 		}
 
-		cnt = cxl_realize_group(mds, &tag, &group, existing);
+		cnt = cxl_realize_group(mds, &tag, &group, shareable, existing);
 		if (cnt < 0) {
 			drop_extent_group(&group);
 			continue;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v11 18/31] cxl/extent: Handle DC Release Capacity events
  2026-06-25 11:04 DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
                   ` (15 preceding siblings ...)
  2026-06-25 11:04 ` [PATCH v11 17/31] cxl/mem: Enforce tag-group semantics Anisa Su
@ 2026-06-25 11:04 ` Anisa Su
  2026-06-25 18:23   ` sashiko-bot
  2026-06-25 11:04 ` [PATCH v11 19/31] cxl/extent: Enforce cross-region tag uniqueness Anisa Su
                   ` (13 subsequent siblings)
  30 siblings, 1 reply; 61+ messages in thread
From: Anisa Su @ 2026-06-25 11:04 UTC (permalink / raw)
  To: linux-cxl, linux-kernel
  Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
	John Groves, Gregory Price, Anisa Su

From: Ira Weiny <iweiny@kernel.org>

Replace the no-op ack stub for cxl_rm_extent() with the real teardown:
resolve the released DPA range to its region and endpoint decoder,
locate the matching dc_extent in cxlr_dax->dc_extents (filtering by
cxled, range containment, and tag), and tear down the entire containing
tag group atomically through rm_tag_group().  Partial release is not
supported.

Invalidates caches once before rm_tag_group(). Then walk the
group's dc_extents and release each via its devm action installed
at online_tag_group() time.

When the released range maps to no region (host crashed before
persisting acceptance, region destruction raced device release, or the
device is confused) the host has nothing to drop, so reply via
memdev_release_extent() to keep the device's view consistent.

The same behavior is applied to the case where the region exists, but
not cxlr_dax of that region. If its dax region is not set up,
the host is not tracking the  extent, so extents can be released.

Based on an original patch by Navneet Singh.

Signed-off-by: Ira Weiny <iweiny@kernel.org>
Signed-off-by: Anisa Su <anisa.su@samsung.com>

---
Changes:
1. Check rc of cxl_region_invalidate_memregion()

If flushing fails, don't release and wait for the device to
retry.

2. Check if cxlr->cxlr_dax is null before rm_tag_group()

Similar to if !cxlr, if cxlr->cxlr_dax is not set up, the host
is not tracking extents, so it's safe to reply to the device with
release.

3. core.h: move declaration for cxl_region_invalidate_memregion()
inside #ifdef CONFIG_CXL_REGION
---
 drivers/cxl/core/core.h   |   8 +++
 drivers/cxl/core/extent.c | 115 ++++++++++++++++++++++++++++++++++++++
 drivers/cxl/core/mbox.c   |  19 -------
 drivers/cxl/core/region.c |   2 +-
 4 files changed, 124 insertions(+), 20 deletions(-)

diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 6ac68f46a18e..bbbb86ababad 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -28,6 +28,8 @@ cxled_to_mds(struct cxl_endpoint_decoder *cxled)
 
 #ifdef CONFIG_CXL_REGION
 
+int cxl_region_invalidate_memregion(struct cxl_region *cxlr);
+
 struct cxl_region_context {
 	struct cxl_endpoint_decoder *cxled;
 	struct range hpa_range;
@@ -65,6 +67,7 @@ int devm_cxl_add_pmem_region(struct cxl_region *cxlr);
 
 int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent,
 		   u16 seq_num);
+int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent);
 int online_tag_group(struct cxl_dc_tag_group *group, bool skip_release);
 #else
 static inline u64 cxl_dpa_to_hpa(struct cxl_region *cxlr,
@@ -77,6 +80,11 @@ static inline int cxl_add_extent(struct cxl_memdev_state *mds,
 {
 	return 0;
 }
+static inline int cxl_rm_extent(struct cxl_memdev_state *mds,
+				struct cxl_extent *extent)
+{
+	return 0;
+}
 static inline int online_tag_group(struct cxl_dc_tag_group *group,
 				   bool skip_release)
 {
diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
index 0ebb581ca833..a590a89f3580 100644
--- a/drivers/cxl/core/extent.c
+++ b/drivers/cxl/core/extent.c
@@ -371,6 +371,121 @@ static void dc_extent_unregister(void *ext)
 	device_unregister(&dc_extent->dev);
 }
 
+static void rm_tag_group(struct cxl_dc_tag_group *group)
+{
+	struct device *region_dev = &group->cxlr_dax->dev;
+	struct dc_extent *dc_extent;
+	unsigned long index;
+
+	/*
+	 * Pin @group across the walk: each devm_release_action runs the
+	 * dc_extent_unregister action synchronously, which drops the last
+	 * reference on the dc_extent device and fires dc_extent_release.
+	 * The release decrements group->nr_extents and, on the final
+	 * decrement, frees @group.  Without the pin the next iteration's
+	 * xa_find_after() dereferences a freed xarray.
+	 */
+	group->nr_extents++;
+	xa_for_each(&group->dc_extents, index, dc_extent)
+		devm_release_action(region_dev, dc_extent_unregister, dc_extent);
+	group->nr_extents--;
+	if (!group->nr_extents)
+		free_tag_group(group);
+}
+
+int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
+{
+	u64 start_dpa = le64_to_cpu(extent->start_dpa);
+	struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
+	struct cxl_endpoint_decoder *cxled;
+	struct cxl_dax_region *cxlr_dax;
+	struct cxl_dc_tag_group *group;
+	struct dc_extent *dc_extent;
+	struct cxl_region *cxlr;
+	struct range dpa_range;
+	unsigned long idx;
+	uuid_t tag;
+	int rc;
+
+	dpa_range = (struct range) {
+		.start = start_dpa,
+		.end = start_dpa + le64_to_cpu(extent->length) - 1,
+	};
+
+	guard(rwsem_read)(&cxl_rwsem.region);
+	cxlr = cxl_dpa_to_region(cxlmd, start_dpa, &cxled);
+	if (!cxlr) {
+		/*
+		 * No region can happen here for a few reasons:
+		 *
+		 * 1) Extents were accepted and the host crashed/rebooted
+		 *    leaving them in an accepted state.  On reboot the host
+		 *    has not yet created a region to own them.
+		 *
+		 * 2) Region destruction won the race with the device releasing
+		 *    all the extents.  Here the release will be a duplicate of
+		 *    the one sent via region destruction.
+		 *
+		 * 3) The device is confused and releasing extents for which no
+		 *    region ever existed.
+		 *
+		 * In all these cases make sure the device knows we are not
+		 * using this extent.
+		 */
+		memdev_release_extent(mds, &dpa_range);
+		return -ENXIO;
+	}
+
+	cxlr_dax = cxlr->cxlr_dax;
+	if (!cxlr_dax) {
+		/*
+		 * The region exists but its dax region is not set up, so the
+		 * host is not tracking this extent.  Tell the device it is not
+		 * in use, as in the no-region case above.
+		 */
+		memdev_release_extent(mds, &dpa_range);
+		return -ENXIO;
+	}
+
+	import_uuid(&tag, extent->uuid);
+
+	/*
+	 * Find the dc_extent whose DPA range covers the released range and
+	 * whose tag matches.  The release targets the entire containing
+	 * tag group atomically; partial release is not supported.
+	 */
+	group = NULL;
+	xa_for_each(&cxlr_dax->dc_extents, idx, dc_extent) {
+		if (dc_extent->cxled != cxled)
+			continue;
+		if (!range_contains(&dc_extent->dpa_range, &dpa_range))
+			continue;
+		if (!uuid_equal(&dc_extent->group->uuid, &tag))
+			continue;
+		group = dc_extent->group;
+		break;
+	}
+	if (!group) {
+		dev_err(&cxlr_dax->dev,
+			"release DPA %pra (%pU) matches no dc_extent\n",
+			&dpa_range, &tag);
+		return -EINVAL;
+	}
+
+	/*
+	 * Invalidate CPU caches for the region before releasing the capacity
+	 * back to the device so it cannot reassign the range while stale
+	 * cached data lingers.  On failure do not release: leave the tag
+	 * group intact and let the device retry.
+	 */
+	rc = cxl_region_invalidate_memregion(cxlr);
+	if (rc)
+		return rc;
+
+	rm_tag_group(group);
+	return 0;
+}
+
 static void cleanup_pending_dc_extent(struct dc_extent *dc_extent)
 {
 	struct cxl_dc_tag_group *group = dc_extent->group;
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 7967b0db2c51..a072355f2f7c 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -1669,25 +1669,6 @@ static int handle_add_event(struct cxl_memdev_state *mds,
 	return rc;
 }
 
-/*
- * Stub: ack the release back to the device so it knows we are not
- * using the range.  A later commit replaces this with the real
- * teardown that walks the region's tag group and tears down the
- * member dc_extent devices.
- */
-static int cxl_rm_extent(struct cxl_memdev_state *mds,
-			 struct cxl_extent *extent)
-{
-	u64 start_dpa = le64_to_cpu(extent->start_dpa);
-	struct range dpa_range = {
-		.start = start_dpa,
-		.end = start_dpa + le64_to_cpu(extent->length) - 1,
-	};
-
-	memdev_release_extent(mds, &dpa_range);
-	return 0;
-}
-
 static char *cxl_dcd_evt_type_str(u8 type)
 {
 	switch (type) {
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index f6e93bc59ae7..528f0b980f58 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -222,7 +222,7 @@ static struct cxl_region_ref *cxl_rr_load(struct cxl_port *port,
 	return xa_load(&port->regions, (unsigned long)cxlr);
 }
 
-static int cxl_region_invalidate_memregion(struct cxl_region *cxlr)
+int cxl_region_invalidate_memregion(struct cxl_region *cxlr)
 {
 	if (!cpu_cache_has_invalidate_memregion()) {
 		if (IS_ENABLED(CONFIG_CXL_REGION_INVALIDATION_TEST)) {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v11 19/31] cxl/extent: Enforce cross-region tag uniqueness
  2026-06-25 11:04 DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
                   ` (16 preceding siblings ...)
  2026-06-25 11:04 ` [PATCH v11 18/31] cxl/extent: Handle DC Release Capacity events Anisa Su
@ 2026-06-25 11:04 ` Anisa Su
  2026-06-25 18:23   ` sashiko-bot
  2026-06-25 11:04 ` [PATCH v11 20/31] cxl/region/extent: Expose dc_extent information in sysfs Anisa Su
                   ` (12 subsequent siblings)
  30 siblings, 1 reply; 61+ messages in thread
From: Anisa Su @ 2026-06-25 11:04 UTC (permalink / raw)
  To: linux-cxl, linux-kernel
  Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
	John Groves, Gregory Price, Anisa Su

The per-region scans in cxlr_add_extent() and uuid_claim_tagged() only
catch a tag re-appearing on the same cxlr_dax.  The orchestrator owns
tag allocation and is responsible for global uniqueness, but a buggy FM
(or firmware redelivering a tag for a previously-closed allocation)
can still hand the same uuid to extents on two different regions or
memdevs, and the per-region checks accept the second one — leaving
two independent cxl_dc_tag_group objects with the same uuid.

Add a host-wide registry of live tag groups with non-null uuids.
alloc_tag_group() inserts on success, free_tag_group() removes; both
skip the null-uuid case since the spec defines no cross-chain identity
for untagged allocations.

A second group with the same uuid is then rejected: cxl_validate_group()
consults the registry via cxl_tag_already_committed() and returns
-EEXIST before the group is realized, and cxl_tag_register() returns
-EBUSY as a backstop against a racing insert between validate and
realize.

No exit hook is needed: cxl_core only unloads after every dependent
module has, by which point every live tag group has been freed and
the registry is empty.

Signed-off-by: Anisa Su <anisa.su@samsung.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/cxl/core/core.h   |  5 ++++
 drivers/cxl/core/extent.c | 59 +++++++++++++++++++++++++++++++++++++++
 drivers/cxl/core/mbox.c   | 16 +++++++++++
 drivers/cxl/cxl.h         |  3 ++
 4 files changed, 83 insertions(+)

diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index bbbb86ababad..ab75cc67c24d 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -67,6 +67,7 @@ int devm_cxl_add_pmem_region(struct cxl_region *cxlr);
 
 int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent,
 		   u16 seq_num);
+bool cxl_tag_already_committed(const uuid_t *tag);
 int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent);
 int online_tag_group(struct cxl_dc_tag_group *group, bool skip_release);
 #else
@@ -90,6 +91,10 @@ static inline int online_tag_group(struct cxl_dc_tag_group *group,
 {
 	return 0;
 }
+static inline bool cxl_tag_already_committed(const uuid_t *tag)
+{
+	return false;
+}
 static inline
 struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa,
 				     struct cxl_endpoint_decoder **cxled)
diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
index a590a89f3580..36be56ca1097 100644
--- a/drivers/cxl/core/extent.c
+++ b/drivers/cxl/core/extent.c
@@ -18,8 +18,60 @@ static void cxled_release_extent(struct cxl_endpoint_decoder *cxled,
 	memdev_release_extent(mds, &dc_extent->dpa_range);
 }
 
+/*
+ * Host-wide registry of live tag groups with non-null uuids.  Enforces
+ * that within this host, a tag uuid identifies exactly one allocation
+ * across all regions and memdevs — closing the gap left by the
+ * per-region scans in cxlr_add_extent() and uuid_claim_tagged().  The
+ * orchestrator (FM) owns tag-uuid allocation per spec; this is a
+ * defense against firmware bugs and orchestrator misbehavior.  Untagged
+ * (null uuid) allocations are not tracked: the spec defines no
+ * cross-chain identity for them.
+ */
+static DEFINE_MUTEX(cxl_tag_lock);
+static LIST_HEAD(cxl_tag_groups);
+
+static int cxl_tag_register(struct cxl_dc_tag_group *grp)
+{
+	struct cxl_dc_tag_group *g;
+
+	if (uuid_is_null(&grp->uuid))
+		return 0;
+
+	guard(mutex)(&cxl_tag_lock);
+	list_for_each_entry(g, &cxl_tag_groups, registry_node)
+		if (uuid_equal(&g->uuid, &grp->uuid))
+			return -EBUSY;
+	list_add_tail(&grp->registry_node, &cxl_tag_groups);
+	return 0;
+}
+
+static void cxl_tag_unregister(struct cxl_dc_tag_group *grp)
+{
+	if (uuid_is_null(&grp->uuid))
+		return;
+
+	guard(mutex)(&cxl_tag_lock);
+	list_del(&grp->registry_node);
+}
+
+bool cxl_tag_already_committed(const uuid_t *tag)
+{
+	struct cxl_dc_tag_group *g;
+
+	if (uuid_is_null(tag))
+		return false;
+
+	guard(mutex)(&cxl_tag_lock);
+	list_for_each_entry(g, &cxl_tag_groups, registry_node)
+		if (uuid_equal(&g->uuid, tag))
+			return true;
+	return false;
+}
+
 static void free_tag_group(struct cxl_dc_tag_group *group)
 {
+	cxl_tag_unregister(group);
 	xa_destroy(&group->dc_extents);
 	/* Drop the pin taken in alloc_tag_group(). */
 	put_device(&group->cxlr_dax->dev);
@@ -60,12 +112,19 @@ alloc_tag_group(struct cxl_dax_region *cxlr_dax, uuid_t *uuid)
 {
 	struct cxl_dc_tag_group *group __free(kfree) =
 				kzalloc(sizeof(*group), GFP_KERNEL);
+	int rc;
+
 	if (!group)
 		return ERR_PTR(-ENOMEM);
 
 	group->cxlr_dax = cxlr_dax;
 	uuid_copy(&group->uuid, uuid);
 	xa_init(&group->dc_extents);
+	INIT_LIST_HEAD(&group->registry_node);
+
+	rc = cxl_tag_register(group);
+	if (rc)
+		return ERR_PTR(rc);
 
 	/*
 	 * Pin cxlr_dax: it is used after cxl_rwsem.region is dropped, so a
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index a072355f2f7c..0e6d6ad0390b 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -1540,6 +1540,22 @@ static int cxl_validate_group(struct cxl_memdev_state *mds, const uuid_t *tag,
 	struct device *dev = mds->cxlds.dev;
 	struct cxl_extent_list_node *pos;
 
+	/*
+	 * Cross-More-chain uniqueness.  A non-null tag seen in this group must
+	 * not already correspond to a committed tag group anywhere on this
+	 * host.  More=0 was supposed to close that allocation, and tag uuids
+	 * must be unique across all regions and memdevs (the orchestrator owns
+	 * assignment per spec).  Either constraint failing — same chain
+	 * redelivered, or two distinct allocations colliding on the same uuid —
+	 * is a firmware/orchestrator bug; reject the whole group.
+	 */
+	if (cxl_tag_already_committed(tag)) {
+		dev_warn(dev,
+			 "Tag %pUb: dropping group, tag already committed (firmware/orchestrator bug)\n",
+			 tag);
+		return -EEXIST;
+	}
+
 	/* Sequence-number integrity */
 	if (cxl_check_group_seq(dev, tag, group, shareable))
 		return -EINVAL;
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index aae7eecd191a..e82d8bf1388b 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -598,6 +598,8 @@ struct cxl_dax_region {
  *		allocations.
  * @nr_extents: live count of dc_extents in the group; the group is freed
  *		when the last dc_extent device is released.
+ * @registry_node: anchor in the host-wide non-null-tag registry that
+ *		enforces tag uuid uniqueness across all regions and memdevs.
  * @skip_device_release: tear the group down without sending a Release DC
  *		command to the device.  Set when rejecting a group whose
  *		extents this host never accepted, so they are omitted from the
@@ -609,6 +611,7 @@ struct cxl_dc_tag_group {
 	uuid_t uuid;
 	struct xarray dc_extents;
 	unsigned int nr_extents;
+	struct list_head registry_node;
 	bool skip_device_release;
 };
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v11 20/31] cxl/region/extent: Expose dc_extent information in sysfs
  2026-06-25 11:04 DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
                   ` (17 preceding siblings ...)
  2026-06-25 11:04 ` [PATCH v11 19/31] cxl/extent: Enforce cross-region tag uniqueness Anisa Su
@ 2026-06-25 11:04 ` Anisa Su
  2026-06-25 18:33   ` sashiko-bot
  2026-06-25 11:04 ` [PATCH v11 21/31] cxl + dax: Surface dax_resources on DCD Add Capacity events Anisa Su
                   ` (11 subsequent siblings)
  30 siblings, 1 reply; 61+ messages in thread
From: Anisa Su @ 2026-06-25 11:04 UTC (permalink / raw)
  To: linux-cxl, linux-kernel
  Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
	John Groves, Gregory Price, Anisa Su, Fan Ni

From: Ira Weiny <iweiny@kernel.org>

Extent information can be helpful to the user to coordinate memory
usage with the external orchestrator and FM.

Expose the details of each dc_extent by creating the following sysfs
entries.

	/sys/bus/cxl/devices/dax_regionX/extentX.Y
	/sys/bus/cxl/devices/dax_regionX/extentX.Y/offset
	/sys/bus/cxl/devices/dax_regionX/extentX.Y/length
	/sys/bus/cxl/devices/dax_regionX/extentX.Y/uuid

Each dc_extent surfaces as its own extentX.Y device under the parent
dax_region.  offset and length describe that dc_extent's HPA range,
not an aggregate bounding box across the containing tagged
allocation — so when a tagged allocation has multiple
DPA-discontiguous extents, each is reported with its own offset and
length.  uuid is the tag identifying the containing allocation; it
is shared across dc_extents that belong to the same tagged
allocation and is hidden for untagged extents.

Based on an original patch by Navneet Singh.

Signed-off-by: Ira Weiny <iweiny@kernel.org>
Signed-off-by: Anisa Su <anisa.su@samsung.com>
Reviewed-by: Jonathan Cameron <jic23@kernel.org>
Reviewed-by: Fan Ni <nifan.cxl@gmail.com>
Tested-by: Fan Ni <nifan.cxl@gmail.com>

---
Changes:
1. Bump kver to 7.3 and date to June 2026
---
 Documentation/ABI/testing/sysfs-bus-cxl | 36 ++++++++++++++++
 drivers/cxl/core/extent.c               | 56 +++++++++++++++++++++++++
 2 files changed, 92 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
index 00b98bbe0ff3..2a3817cc6eef 100644
--- a/Documentation/ABI/testing/sysfs-bus-cxl
+++ b/Documentation/ABI/testing/sysfs-bus-cxl
@@ -661,3 +661,39 @@ Description:
 		The count is persistent across power loss and wraps back to 0
 		upon overflow. If this file is not present, the device does not
 		have the necessary support for dirty tracking.
+
+
+What:		/sys/bus/cxl/devices/dax_regionX/extentX.Y/offset
+Date:		June, 2026
+KernelVersion:	v7.3
+Contact:	linux-cxl@vger.kernel.org
+Description:
+		(RO) [For Dynamic Capacity regions only] Users can use the
+		extent information to create DAX devices on specific extents.
+		This is done by creating and destroying DAX devices in specific
+		sequences and looking at the mappings created.  Extent offset
+		within the region.
+
+
+What:		/sys/bus/cxl/devices/dax_regionX/extentX.Y/length
+Date:		June, 2026
+KernelVersion:	v7.3
+Contact:	linux-cxl@vger.kernel.org
+Description:
+		(RO) [For Dynamic Capacity regions only] Users can use the
+		extent information to create DAX devices on specific extents.
+		This is done by creating and destroying DAX devices in specific
+		sequences and looking at the mappings created.  Extent length
+		within the region.
+
+
+What:		/sys/bus/cxl/devices/dax_regionX/extentX.Y/uuid
+Date:		June, 2026
+KernelVersion:	v7.3
+Contact:	linux-cxl@vger.kernel.org
+Description:
+		(RO) [For Dynamic Capacity regions only] Users can use the
+		extent information to create DAX devices on specific extents.
+		This is done by creating and destroying DAX devices in specific
+		sequences and looking at the mappings created.  UUID of this
+		extent.
diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
index 36be56ca1097..69c993cdd558 100644
--- a/drivers/cxl/core/extent.c
+++ b/drivers/cxl/core/extent.c
@@ -6,6 +6,61 @@
 
 #include "core.h"
 
+static ssize_t offset_show(struct device *dev, struct device_attribute *attr,
+			   char *buf)
+{
+	struct dc_extent *dc_extent = to_dc_extent(dev);
+
+	return sysfs_emit(buf, "%#llx\n", dc_extent->hpa_range.start);
+}
+static DEVICE_ATTR_RO(offset);
+
+static ssize_t length_show(struct device *dev, struct device_attribute *attr,
+			   char *buf)
+{
+	struct dc_extent *dc_extent = to_dc_extent(dev);
+	u64 length = range_len(&dc_extent->hpa_range);
+
+	return sysfs_emit(buf, "%#llx\n", length);
+}
+static DEVICE_ATTR_RO(length);
+
+static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
+			 char *buf)
+{
+	struct dc_extent *dc_extent = to_dc_extent(dev);
+
+	return sysfs_emit(buf, "%pUb\n", &dc_extent->group->uuid);
+}
+static DEVICE_ATTR_RO(uuid);
+
+static struct attribute *dc_extent_attrs[] = {
+	&dev_attr_offset.attr,
+	&dev_attr_length.attr,
+	&dev_attr_uuid.attr,
+	NULL
+};
+
+static umode_t dc_extent_visible(struct kobject *kobj,
+				 struct attribute *a, int n)
+{
+	struct device *dev = kobj_to_dev(kobj);
+	struct dc_extent *dc_extent = to_dc_extent(dev);
+
+	if (a == &dev_attr_uuid.attr &&
+	    uuid_is_null(&dc_extent->group->uuid))
+		return 0;
+
+	return a->mode;
+}
+
+static const struct attribute_group dc_extent_attribute_group = {
+	.attrs = dc_extent_attrs,
+	.is_visible = dc_extent_visible,
+};
+
+__ATTRIBUTE_GROUPS(dc_extent_attribute);
+
 
 static void cxled_release_extent(struct cxl_endpoint_decoder *cxled,
 				 struct dc_extent *dc_extent)
@@ -99,6 +154,7 @@ static void dc_extent_release(struct device *dev)
 static const struct device_type dc_extent_type = {
 	.name = "extent",
 	.release = dc_extent_release,
+	.groups = dc_extent_attribute_groups,
 };
 
 bool is_dc_extent(struct device *dev)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v11 21/31] cxl + dax: Surface dax_resources on DCD Add Capacity events
  2026-06-25 11:04 DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
                   ` (18 preceding siblings ...)
  2026-06-25 11:04 ` [PATCH v11 20/31] cxl/region/extent: Expose dc_extent information in sysfs Anisa Su
@ 2026-06-25 11:04 ` Anisa Su
  2026-06-25 18:29   ` sashiko-bot
  2026-06-25 11:04 ` [PATCH v11 22/31] cxl + dax: Release dax_resources on DCD Release " Anisa Su
                   ` (10 subsequent siblings)
  30 siblings, 1 reply; 61+ messages in thread
From: Anisa Su @ 2026-06-25 11:04 UTC (permalink / raw)
  To: linux-cxl, linux-kernel
  Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
	John Groves, Gregory Price, Anisa Su

From: Ira Weiny <iweiny@kernel.org>

When an extent is accepted/released, the CXL driver must notify
the DAX driver to coordinate the management of resources. Define
the .notify callback to the cxl_dax region driver to enable
the coordination.

Define struct dax_resource, a sub-resource of a DC dax_region
representing the capacity of one dc_extent.

When the cxl side onlines a tag group during a DC Add event, notify the
DAX region to register a struct dax_resource for each extent.  Surface
the group atomically: dax_region_add_resources() registers every
extent's dax_resource under one dax_region_rwsem and rolls the whole
set back on any failure, so a partial group (or one racing a concurrent
uuid_store claim) never lands in the region.

The dax_resource model:

  * struct dax_resource (dax-private.h) — per-extent sub-resource of
    a DC dax_region: pointer back to its region, the kernel struct
    resource, the tag uuid, the per-allocation seq_num, and a use_cnt
    that lets a later commit refuse release of an in-use extent.
  * struct dev_dax_range gains a dax_resource back-pointer so a
    carved range remembers which extent it lives in.

For now, dax_resources live under the dax_region and remain inaccessible
to DAX devices. A later commit adds the support to specify a tag
when creating a DAX device, which then allows dax_resources to be
claimed by tag.

Release is handled in the following commit.

Based on an original patch by Navneet Singh.

Signed-off-by: Ira Weiny <iweiny@kernel.org>
Signed-off-by: Anisa Su <anisa.su@samsung.com>

---
Changes:
1. add the dax_resource set atomically via dax_region_add_resources()
   (all-or-none under dax_region_rwsem), instead of a per-extent walk
   that had to be removed again later.
---
 drivers/cxl/core/core.h   |  10 +++
 drivers/cxl/core/extent.c |  36 +++++++-
 drivers/cxl/core/mbox.c   |  18 ++++
 drivers/cxl/cxl.h         |   6 ++
 drivers/dax/bus.c         | 183 +++++++++++++++++++++++++++++++++++---
 drivers/dax/bus.h         |   3 +-
 drivers/dax/cxl.c         |  71 ++++++++++++++-
 drivers/dax/dax-private.h |  59 ++++++++++++
 drivers/dax/hmem/hmem.c   |   2 +-
 drivers/dax/pmem.c        |   2 +-
 10 files changed, 371 insertions(+), 19 deletions(-)

diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index ab75cc67c24d..1a2bc22ad3cc 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -70,6 +70,9 @@ int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent,
 bool cxl_tag_already_committed(const uuid_t *tag);
 int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent);
 int online_tag_group(struct cxl_dc_tag_group *group, bool skip_release);
+void rm_tag_group(struct cxl_dc_tag_group *group);
+int cxlr_notify_extent(struct cxl_region *cxlr, enum dc_event event,
+		       struct cxl_dc_tag_group *group);
 #else
 static inline u64 cxl_dpa_to_hpa(struct cxl_region *cxlr,
 				 const struct cxl_memdev *cxlmd, u64 dpa)
@@ -95,6 +98,13 @@ static inline bool cxl_tag_already_committed(const uuid_t *tag)
 {
 	return false;
 }
+static inline void rm_tag_group(struct cxl_dc_tag_group *group) { }
+static inline int cxlr_notify_extent(struct cxl_region *cxlr,
+				     enum dc_event event,
+				     struct cxl_dc_tag_group *group)
+{
+	return 0;
+}
 static inline
 struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa,
 				     struct cxl_endpoint_decoder **cxled)
diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
index 69c993cdd558..59db1878b5e2 100644
--- a/drivers/cxl/core/extent.c
+++ b/drivers/cxl/core/extent.c
@@ -61,7 +61,6 @@ static const struct attribute_group dc_extent_attribute_group = {
 
 __ATTRIBUTE_GROUPS(dc_extent_attribute);
 
-
 static void cxled_release_extent(struct cxl_endpoint_decoder *cxled,
 				 struct dc_extent *dc_extent)
 {
@@ -142,7 +141,8 @@ static void dc_extent_release(struct device *dev)
 		return;
 
 	group = dc_extent->group;
-	cxled_release_extent(dc_extent->cxled, dc_extent);
+	if (!group->skip_device_release)
+		cxled_release_extent(dc_extent->cxled, dc_extent);
 	xa_erase(&group->cxlr_dax->dc_extents, dc_extent->dev.id);
 	xa_erase(&group->dc_extents, dc_extent->seq_num);
 	group->nr_extents--;
@@ -376,6 +376,36 @@ dc_extent_build(struct cxl_endpoint_decoder *cxled,
 	return dc_extent;
 }
 
+int cxlr_notify_extent(struct cxl_region *cxlr, enum dc_event event,
+		       struct cxl_dc_tag_group *group)
+{
+	struct device *dev = &cxlr->cxlr_dax->dev;
+	struct cxl_notify_data notify_data;
+	struct cxl_driver *driver;
+
+	dev_dbg(dev, "Trying notify: type %d tag %pUb\n", event, &group->uuid);
+
+	guard(device)(dev);
+
+	/*
+	 * The lack of a driver indicates a notification has failed.  No user
+	 * space coordination was possible.
+	 */
+	if (!dev->driver)
+		return 0;
+	driver = to_cxl_drv(dev->driver);
+	if (!driver->notify)
+		return 0;
+
+	notify_data = (struct cxl_notify_data) {
+		.event = event,
+		.group = group,
+	};
+
+	dev_dbg(dev, "Notify: type %d tag %pUb\n", event, &group->uuid);
+	return driver->notify(dev, &notify_data);
+}
+
 /*
  * Stage 4: insert @dc_extent into the pending tag group.  All extents in
  * one More-chain group share a UUID — enforced here as the group is
@@ -486,7 +516,7 @@ static void dc_extent_unregister(void *ext)
 	device_unregister(&dc_extent->dev);
 }
 
-static void rm_tag_group(struct cxl_dc_tag_group *group)
+void rm_tag_group(struct cxl_dc_tag_group *group)
 {
 	struct device *region_dev = &group->cxlr_dax->dev;
 	struct dc_extent *dc_extent;
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 0e6d6ad0390b..79258681d428 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -1527,6 +1527,24 @@ static int cxl_realize_group(struct cxl_memdev_state *mds, const uuid_t *tag,
 		return rc;
 	}
 
+	rc = cxlr_notify_extent(tag_group->cxlr_dax->cxlr, DCD_ADD_CAPACITY,
+				tag_group);
+	if (rc) {
+		/*
+		 * The dax-side notification failed; tear down the tag group.
+		 * For a fresh add (!existing) the extents were never accepted —
+		 * they are omitted from the trailing Add-DC-Response — so
+		 * suppress the per-extent Release DC; the device never handed us
+		 * this capacity to release.  Recovered (existing) extents are
+		 * already accepted and cannot be re-notified, so release them
+		 * back to the device rather than leak the capacity.
+		 */
+		if (!existing)
+			tag_group->skip_device_release = true;
+		rm_tag_group(tag_group);
+		return rc;
+	}
+
 	return group_cnt;
 }
 
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index e82d8bf1388b..1bb861bb23fe 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -898,6 +898,11 @@ bool is_cxl_region(struct device *dev);
 
 extern const struct bus_type cxl_bus_type;
 
+struct cxl_notify_data {
+	enum dc_event event;
+	struct cxl_dc_tag_group *group;
+};
+
 /*
  * Note, add_dport() is expressly for the cxl_port driver. TODO: investigate a
  * type-safe driver model where probe()/remove() take the type of object implied
@@ -910,6 +915,7 @@ struct cxl_driver {
 	void (*remove)(struct device *dev);
 	struct cxl_dport *(*add_dport)(struct cxl_port *port,
 				       struct device *dport_dev);
+	int (*notify)(struct device *dev, struct cxl_notify_data *notify_data);
 	struct device_driver drv;
 	int id;
 };
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 7356aaaffe57..9b5c03616b83 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -182,6 +182,138 @@ static bool is_dynamic(struct dax_region *dax_region)
 	return (dax_region->res.flags & IORESOURCE_DAX_DCD) != 0;
 }
 
+static void __dax_release_resource(struct dax_resource *dax_resource)
+{
+	struct dax_region *dax_region = dax_resource->region;
+
+	lockdep_assert_held_write(&dax_region_rwsem);
+	dev_dbg(dax_region->dev, "Extent release resource %pr\n",
+		dax_resource->res);
+	if (dax_resource->res)
+		__release_region(&dax_region->res, dax_resource->res->start,
+				 resource_size(dax_resource->res));
+	dax_resource->res = NULL;
+}
+
+static void dax_release_resource(void *res)
+{
+	struct dax_resource *dax_resource = res;
+
+	guard(rwsem_write)(&dax_region_rwsem);
+	__dax_release_resource(dax_resource);
+	kfree(dax_resource);
+}
+
+static int __dax_region_add_resource(struct dax_region *dax_region,
+				     struct device *device,
+				     resource_size_t start, resource_size_t length,
+				     const uuid_t *tag, u16 seq_num)
+{
+	struct dax_resource *dax_resource __free(kfree) =
+				kzalloc(sizeof(*dax_resource), GFP_KERNEL);
+	struct resource *new_resource;
+	int rc;
+
+	lockdep_assert_held_write(&dax_region_rwsem);
+
+	if (!dax_resource)
+		return -ENOMEM;
+
+	dev_dbg(dax_region->dev, "DAX region resource %pr\n", &dax_region->res);
+	new_resource = __request_region(&dax_region->res, start, length, "extent", 0);
+	if (!new_resource) {
+		dev_err(dax_region->dev, "Failed to add region s:%pa l:%pa\n",
+			&start, &length);
+		return -ENOSPC;
+	}
+
+	dev_dbg(dax_region->dev, "add resource %pr\n", new_resource);
+	dax_resource->region = dax_region;
+	dax_resource->res = new_resource;
+	dax_resource->seq_num = seq_num;
+	if (tag)
+		uuid_copy(&dax_resource->uuid, tag);
+
+	/*
+	 * open code devm_add_action_or_reset() to avoid recursive write lock
+	 * of dax_region_rwsem in the error case.
+	 */
+	rc = devm_add_action(device, dax_release_resource, dax_resource);
+	if (rc) {
+		__dax_release_resource(dax_resource);
+		return rc;
+	}
+
+	dev_set_drvdata(device, no_free_ptr(dax_resource));
+	return 0;
+}
+
+int dax_region_add_resource(struct dax_region *dax_region,
+			    struct device *device,
+			    resource_size_t start, resource_size_t length,
+			    const uuid_t *tag, u16 seq_num)
+{
+	guard(rwsem_write)(&dax_region_rwsem);
+	return __dax_region_add_resource(dax_region, device, start, length,
+					 tag, seq_num);
+}
+EXPORT_SYMBOL_GPL(dax_region_add_resource);
+
+static int __dax_region_rm_resource(struct dax_region *dax_region,
+				    struct device *dev)
+{
+	struct dax_resource *dax_resource;
+
+	lockdep_assert_held_write(&dax_region_rwsem);
+
+	dax_resource = dev_get_drvdata(dev);
+	if (!dax_resource)
+		return 0;
+
+	if (dax_resource->use_cnt)
+		return -EBUSY;
+
+	/*
+	 * release the resource under dax_region_rwsem to avoid races with
+	 * users trying to use the extent
+	 */
+	__dax_release_resource(dax_resource);
+	dev_set_drvdata(dev, NULL);
+	return 0;
+}
+
+/**
+ * dax_region_add_resources - atomically add a set of dax_resources.
+ *
+ * Hold dax_region_rwsem across the whole set so the add cannot interleave
+ * with a concurrent claim (uuid_store) or removal.  On any failure, roll
+ * back the resources already added in this call, leaving the region
+ * unchanged.  Mirrors dax_region_rm_resources()'s all-or-none semantics.
+ */
+int dax_region_add_resources(struct dax_region *dax_region,
+			     const struct dax_resource_spec *specs,
+			     unsigned int n, const uuid_t *tag)
+{
+	unsigned int i;
+	int rc;
+
+	guard(rwsem_write)(&dax_region_rwsem);
+
+	for (i = 0; i < n; i++) {
+		rc = __dax_region_add_resource(dax_region, specs[i].device,
+					       specs[i].start, specs[i].length,
+					       tag, specs[i].seq_num);
+		if (rc) {
+			while (i-- > 0)
+				__dax_region_rm_resource(dax_region,
+							 specs[i].device);
+			return rc;
+		}
+	}
+	return 0;
+}
+EXPORT_SYMBOL_GPL(dax_region_add_resources);
+
 bool static_dev_dax(struct dev_dax *dev_dax)
 {
 	return is_static(dev_dax->region);
@@ -300,14 +432,25 @@ static struct device_attribute dev_attr_region_align =
 
 static unsigned long long dax_region_avail_size(struct dax_region *dax_region)
 {
-	resource_size_t size = resource_size(&dax_region->res);
+	resource_size_t size;
 	struct resource *res;
 
 	lockdep_assert_held(&dax_region_rwsem);
 
-	if (is_dynamic(dax_region))
-		return 0;
+	if (is_dynamic(dax_region)) {
+		/*
+		 * Children of a dynamic region are extents, claimed
+		 * all-or-nothing: an extent's resource is either unclaimed (no
+		 * child) or fully consumed by exactly one dax device.
+		 */
+		size = 0;
+		for_each_dax_region_resource(dax_region, res)
+			if (!res->child)
+				size += resource_size(res);
+		return size;
+	}
 
+	size = resource_size(&dax_region->res);
 	for_each_dax_region_resource(dax_region, res)
 		size -= resource_size(res);
 	return size;
@@ -448,15 +591,26 @@ EXPORT_SYMBOL_GPL(kill_dev_dax);
 static void trim_dev_dax_range(struct dev_dax *dev_dax)
 {
 	int i = dev_dax->nr_range - 1;
-	struct range *range = &dev_dax->ranges[i].range;
+	struct dev_dax_range *dev_range = &dev_dax->ranges[i];
+	struct range *range = &dev_range->range;
 	struct dax_region *dax_region = dev_dax->region;
+	struct resource *res = &dax_region->res;
 
 	lockdep_assert_held_write(&dax_region_rwsem);
 	dev_dbg(&dev_dax->dev, "delete range[%d]: %#llx:%#llx\n", i,
 		(unsigned long long)range->start,
 		(unsigned long long)range->end);
 
-	__release_region(&dax_region->res, range->start, range_len(range));
+	if (dev_range->dax_resource) {
+		res = dev_range->dax_resource->res;
+		dev_dbg(&dev_dax->dev, "Trim dc extent %pr\n", res);
+	}
+
+	__release_region(res, range->start, range_len(range));
+
+	if (dev_range->dax_resource)
+		dev_range->dax_resource->use_cnt--;
+
 	if (--dev_dax->nr_range == 0) {
 		kfree(dev_dax->ranges);
 		dev_dax->ranges = NULL;
@@ -640,11 +794,14 @@ static void dax_region_unregister(void *region)
 
 struct dax_region *alloc_dax_region(struct device *parent, int region_id,
 		struct range *range, int target_node, unsigned int align,
-		unsigned long flags)
+		unsigned long flags, struct dax_dc_ops *dc_ops)
 {
 	struct dax_region *dax_region;
 	int rc;
 
+	if (!dc_ops && (flags & IORESOURCE_DAX_DCD))
+		return NULL;
+
 	/*
 	 * The DAX core assumes that it can store its private data in
 	 * parent->driver_data. This WARN is a reminder / safeguard for
@@ -669,6 +826,7 @@ struct dax_region *alloc_dax_region(struct device *parent, int region_id,
 	dax_region->align = align;
 	dax_region->dev = parent;
 	dax_region->target_node = target_node;
+	dax_region->dc_ops = dc_ops;
 	ida_init(&dax_region->ida);
 	dax_region->res = (struct resource) {
 		.start = range->start,
@@ -857,7 +1015,7 @@ static int devm_register_dax_mapping(struct dev_dax *dev_dax, int range_id)
 }
 
 static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
-		resource_size_t size)
+		resource_size_t size, struct dax_resource *dax_resource)
 {
 	struct dax_region *dax_region = dev_dax->region;
 	struct resource *res = &dax_region->res;
@@ -898,6 +1056,7 @@ static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
 			.start = alloc->start,
 			.end = alloc->end,
 		},
+		.dax_resource = dax_resource,
 	};
 
 	dev_dbg(dev, "alloc range[%d]: %pa:%pa\n", dev_dax->nr_range - 1,
@@ -1071,7 +1230,7 @@ static ssize_t dev_dax_resize(struct dax_region *dax_region,
 retry:
 	first = region_res->child;
 	if (!first)
-		return alloc_dev_dax_range(dev_dax, dax_region->res.start, to_alloc);
+		return alloc_dev_dax_range(dev_dax, dax_region->res.start, to_alloc, NULL);
 
 	rc = -ENOSPC;
 	for (res = first; res; res = res->sibling) {
@@ -1080,7 +1239,7 @@ static ssize_t dev_dax_resize(struct dax_region *dax_region,
 		/* space at the beginning of the region */
 		if (res == first && res->start > dax_region->res.start) {
 			alloc = min(res->start - dax_region->res.start, to_alloc);
-			rc = alloc_dev_dax_range(dev_dax, dax_region->res.start, alloc);
+			rc = alloc_dev_dax_range(dev_dax, dax_region->res.start, alloc, NULL);
 			break;
 		}
 
@@ -1100,7 +1259,7 @@ static ssize_t dev_dax_resize(struct dax_region *dax_region,
 			rc = adjust_dev_dax_range(dev_dax, res, resource_size(res) + alloc);
 			break;
 		}
-		rc = alloc_dev_dax_range(dev_dax, res->end + 1, alloc);
+		rc = alloc_dev_dax_range(dev_dax, res->end + 1, alloc, NULL);
 		break;
 	}
 	if (rc)
@@ -1210,7 +1369,7 @@ static ssize_t mapping_store(struct device *dev, struct device_attribute *attr,
 
 	to_alloc = range_len(&r);
 	if (alloc_is_aligned(dev_dax, to_alloc))
-		rc = alloc_dev_dax_range(dev_dax, r.start, to_alloc);
+		rc = alloc_dev_dax_range(dev_dax, r.start, to_alloc, NULL);
 	up_write(&dax_dev_rwsem);
 	up_write(&dax_region_rwsem);
 
@@ -1498,7 +1657,7 @@ static struct dev_dax *__devm_create_dev_dax(struct dev_dax_data *data)
 	device_initialize(dev);
 	dev_set_name(dev, "dax%d.%d", dax_region->id, dev_dax->id);
 
-	rc = alloc_dev_dax_range(dev_dax, dax_region->res.start, data->size);
+	rc = alloc_dev_dax_range(dev_dax, dax_region->res.start, data->size, NULL);
 	if (rc)
 		goto err_range;
 
diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
index 6e739bfab932..7a115893a102 100644
--- a/drivers/dax/bus.h
+++ b/drivers/dax/bus.h
@@ -11,6 +11,7 @@ struct dev_dax;
 struct resource;
 struct dax_device;
 struct dax_region;
+struct dax_dc_ops;
 
 /* dax bus specific ioresource flags */
 #define IORESOURCE_DAX_STATIC BIT(0)
@@ -19,7 +20,7 @@ struct dax_region;
 
 struct dax_region *alloc_dax_region(struct device *parent, int region_id,
 		struct range *range, int target_node, unsigned int align,
-		unsigned long flags);
+		unsigned long flags, struct dax_dc_ops *dc_ops);
 
 struct dev_dax_data {
 	struct dax_region *dax_region;
diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
index cedd974c2d0c..5d33be342d42 100644
--- a/drivers/dax/cxl.c
+++ b/drivers/dax/cxl.c
@@ -5,6 +5,74 @@
 
 #include "../cxl/cxl.h"
 #include "bus.h"
+#include "dax-private.h"
+
+static int cxl_dax_group_add(struct dax_region *dax_region,
+			     struct cxl_dc_tag_group *group)
+{
+	struct dax_resource_spec *specs;
+	struct dc_extent *dc_extent;
+	unsigned long index;
+	unsigned int n = 0;
+	int rc;
+
+	if (!group->nr_extents)
+		return 0;
+
+	specs = kmalloc_array(group->nr_extents, sizeof(*specs), GFP_KERNEL);
+	if (!specs)
+		return -ENOMEM;
+
+	xa_for_each(&group->dc_extents, index, dc_extent) {
+		if (n == group->nr_extents)
+			break;
+		specs[n++] = (struct dax_resource_spec) {
+			.device = &dc_extent->dev,
+			.start = dax_region->res.start + dc_extent->hpa_range.start,
+			.length = range_len(&dc_extent->hpa_range),
+			.seq_num = dc_extent->seq_num,
+		};
+	}
+
+	/* Atomic all-or-none add, mirroring cxl_dax_group_rm(). */
+	rc = dax_region_add_resources(dax_region, specs, n, &group->uuid);
+	kfree(specs);
+	return rc;
+}
+
+/*
+ * RELEASE is still a stub here — the atomic dax_region_rm_resources API
+ * and its wire-up land in the next commit.  An incoming RELEASE returns
+ * success and the cxl side proceeds to rm_tag_group(), which device-
+ * unregisters each dc_extent; the devm action armed by
+ * dax_region_add_resource() then tears down each dax_resource.
+ */
+static int cxl_dax_region_notify(struct device *dev,
+				 struct cxl_notify_data *notify_data)
+{
+	struct cxl_dax_region *cxlr_dax = to_cxl_dax_region(dev);
+	struct dax_region *dax_region = dev_get_drvdata(dev);
+	struct cxl_dc_tag_group *group = notify_data->group;
+
+	switch (notify_data->event) {
+	case DCD_ADD_CAPACITY:
+		return cxl_dax_group_add(dax_region, group);
+	case DCD_RELEASE_CAPACITY:
+		dev_dbg(&cxlr_dax->dev,
+			"DCD RELEASE notify (tag %pUb): no-op (stub)\n",
+			&group->uuid);
+		return 0;
+	case DCD_FORCED_CAPACITY_RELEASE:
+	default:
+		dev_err(&cxlr_dax->dev, "Unknown DC event %d\n",
+			notify_data->event);
+		return -ENXIO;
+	}
+}
+
+static struct dax_dc_ops dc_ops = {
+	.is_extent = is_dc_extent,
+};
 
 static int cxl_dax_region_probe(struct device *dev)
 {
@@ -25,7 +93,7 @@ static int cxl_dax_region_probe(struct device *dev)
 		flags = IORESOURCE_DAX_KMEM;
 
 	dax_region = alloc_dax_region(dev, cxlr->id, &cxlr_dax->hpa_range, nid,
-				      PMD_SIZE, flags);
+				      PMD_SIZE, flags, &dc_ops);
 	if (!dax_region)
 		return -ENOMEM;
 
@@ -48,6 +116,7 @@ static int cxl_dax_region_probe(struct device *dev)
 static struct cxl_driver cxl_dax_region_driver = {
 	.name = "cxl_dax_region",
 	.probe = cxl_dax_region_probe,
+	.notify = cxl_dax_region_notify,
 	.id = CXL_DEVICE_DAX_REGION,
 	.drv = {
 		.suppress_bind_attrs = true,
diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index 81e4af49e39c..8d98fc9adb4b 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -8,6 +8,7 @@
 #include <linux/device.h>
 #include <linux/cdev.h>
 #include <linux/idr.h>
+#include <linux/uuid.h>
 
 /* private routines between core files */
 struct dax_device;
@@ -16,6 +17,14 @@ struct inode *dax_inode(struct dax_device *dax_dev);
 int dax_bus_init(void);
 void dax_bus_exit(void);
 
+/**
+ * struct dax_dc_ops - Operations for dc-backed regions
+ * @is_extent: return if the device is an extent
+ */
+struct dax_dc_ops {
+	bool (*is_extent)(struct device *dev);
+};
+
 /**
  * struct dax_region - mapping infrastructure for dax devices
  * @id: kernel-wide unique region for a memory range
@@ -27,6 +36,7 @@ void dax_bus_exit(void);
  * @res: resource tree to track instance allocations
  * @seed: allow userspace to find the first unbound seed device
  * @youngest: allow userspace to find the most recently created device
+ * @dc_ops: operations required for DC-backed regions
  */
 struct dax_region {
 	int id;
@@ -38,6 +48,7 @@ struct dax_region {
 	struct resource res;
 	struct device *seed;
 	struct device *youngest;
+	struct dax_dc_ops *dc_ops;
 };
 
 /**
@@ -57,11 +68,13 @@ struct dax_mapping {
  * @pgoff: page offset
  * @range: resource-span
  * @mapping: reference to the dax_mapping for this range
+ * @dax_resource: if not NULL; dax DC resource containing this range
  */
 struct dev_dax_range {
 	unsigned long pgoff;
 	struct range range;
 	struct dax_mapping *mapping;
+	struct dax_resource *dax_resource;
 };
 
 /**
@@ -106,6 +119,52 @@ struct dev_dax {
  */
 void run_dax(struct dax_device *dax_dev);
 
+/**
+ * struct dax_resource - For DC DAX regions; an active resource
+ * @region: dax_region this resources is in
+ * @res: resource
+ * @uuid: tag identifying the backing extent; zero uuid means untagged
+ * @seq_num: dense 0..n-1 assembly-order index within the tag group.  The
+ *	     cxl layer assigns it in assembly order — the device-stamped
+ *	     0..n-1 shared_extn_seq (CXL r4.0 Table 8-230) for a sharable
+ *	     partition, or event arrival order otherwise — so the dax layer
+ *	     can rely on a single 0..n-1 dense invariant when it claims a
+ *	     tagged group in uuid_store().
+ * @use_cnt: count the number of uses of this resource
+ *
+ * Changes to the dax_region and the dax_resources within it are protected by
+ * dax_region_rwsem
+ *
+ * dax_resource's are not intended to be used outside the dax layer.
+ */
+struct dax_resource {
+	struct dax_region *region;
+	struct resource *res;
+	uuid_t uuid;
+	u16 seq_num;
+	unsigned int use_cnt;
+};
+
+/*
+ * Similar to run_dax() dax_region_add_resource() is exported but is not
+ * intended to be a generic operation outside the dax subsystem.  It is only
+ * generic between the dax layer and the dax drivers.
+ */
+int dax_region_add_resource(struct dax_region *dax_region, struct device *dev,
+			    resource_size_t start, resource_size_t length,
+			    const uuid_t *tag, u16 seq_num);
+
+/* One resource to add as part of an atomic dax_region_add_resources() set. */
+struct dax_resource_spec {
+	struct device *device;
+	resource_size_t start;
+	resource_size_t length;
+	u16 seq_num;
+};
+int dax_region_add_resources(struct dax_region *dax_region,
+			     const struct dax_resource_spec *specs,
+			     unsigned int n, const uuid_t *tag);
+
 static inline struct dev_dax *to_dev_dax(struct device *dev)
 {
 	return container_of(dev, struct dev_dax, dev);
diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index af21f66bf872..be938c2a73f8 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -28,7 +28,7 @@ static int dax_hmem_probe(struct platform_device *pdev)
 
 	mri = dev->platform_data;
 	dax_region = alloc_dax_region(dev, pdev->id, &mri->range,
-				      mri->target_node, PMD_SIZE, flags);
+				      mri->target_node, PMD_SIZE, flags, NULL);
 	if (!dax_region)
 		return -ENOMEM;
 
diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
index bee93066a849..5b5be86768f3 100644
--- a/drivers/dax/pmem.c
+++ b/drivers/dax/pmem.c
@@ -53,7 +53,7 @@ static struct dev_dax *__dax_pmem_probe(struct device *dev)
 	range.start += offset;
 	dax_region = alloc_dax_region(dev, region_id, &range,
 			nd_region->target_node, le32_to_cpu(pfn_sb->align),
-			IORESOURCE_DAX_STATIC);
+			IORESOURCE_DAX_STATIC, NULL);
 	if (!dax_region)
 		return ERR_PTR(-ENOMEM);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v11 22/31] cxl + dax: Release dax_resources on DCD Release Capacity events
  2026-06-25 11:04 DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
                   ` (19 preceding siblings ...)
  2026-06-25 11:04 ` [PATCH v11 21/31] cxl + dax: Surface dax_resources on DCD Add Capacity events Anisa Su
@ 2026-06-25 11:04 ` Anisa Su
  2026-06-25 18:36   ` sashiko-bot
  2026-06-25 11:05 ` [PATCH v11 23/31] dax/bus: Factor out dev dax resize logic Anisa Su
                   ` (9 subsequent siblings)
  30 siblings, 1 reply; 61+ messages in thread
From: Anisa Su @ 2026-06-25 11:04 UTC (permalink / raw)
  To: linux-cxl, linux-kernel
  Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
	John Groves, Gregory Price, Anisa Su

From: Ira Weiny <iweiny@kernel.org>

Implement the release path that mirrors the add path: when the device
asks for capacity back, the dax layer tears down the per-extent
resources for the whole tag group atomically via
dax_region_rm_resources().

If any extent in the group is still mapped by a dev_dax, the release
is refused with -EBUSY and no state changes; the cxl side then leaves
the tag group intact and the device retries.

Based on an original patch by Navneet Singh.

Signed-off-by: Ira Weiny <iweiny@kernel.org>
Signed-off-by: Anisa Su <anisa.su@samsung.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/cxl/core/extent.c | 12 +++++++++++
 drivers/dax/bus.c         | 43 +++++++++++++++++++++++++++++++++++++++
 drivers/dax/cxl.c         | 39 +++++++++++++++++++++++++----------
 drivers/dax/dax-private.h |  8 ++++++--
 4 files changed, 89 insertions(+), 13 deletions(-)

diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
index 59db1878b5e2..7009ac6a51b4 100644
--- a/drivers/cxl/core/extent.c
+++ b/drivers/cxl/core/extent.c
@@ -627,6 +627,18 @@ int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
 	if (rc)
 		return rc;
 
+	rc = cxlr_notify_extent(cxlr, DCD_RELEASE_CAPACITY, group);
+	if (rc) {
+		/*
+		 * dax layer refused (-EBUSY) or failed (-ENOMEM, etc.).  Do
+		 * not proceed to tear down the tag group — leave its
+		 * dax_resources alive so we do not free them out from under
+		 * live dev_dax ranges.  The device will retry the release.
+		 */
+		return 0;
+	}
+
+	/* Release the entire tag group */
 	rm_tag_group(group);
 	return 0;
 }
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 9b5c03616b83..95683dc8fcd0 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -282,6 +282,14 @@ static int __dax_region_rm_resource(struct dax_region *dax_region,
 	return 0;
 }
 
+int dax_region_rm_resource(struct dax_region *dax_region,
+			   struct device *dev)
+{
+	guard(rwsem_write)(&dax_region_rwsem);
+	return __dax_region_rm_resource(dax_region, dev);
+}
+EXPORT_SYMBOL_GPL(dax_region_rm_resource);
+
 /**
  * dax_region_add_resources - atomically add a set of dax_resources.
  *
@@ -314,6 +322,41 @@ int dax_region_add_resources(struct dax_region *dax_region,
 }
 EXPORT_SYMBOL_GPL(dax_region_add_resources);
 
+/**
+ * dax_region_rm_resources - atomically remove a set of dax_resources.
+ *
+ * Walk @devs twice under dax_region_rwsem.  First pass refuses the
+ * operation if any member's use_cnt is non-zero; second pass releases
+ * each.  This gives refuse-all-or-none semantics across the set, which
+ * a tag group's atomic release relies on.  Devices with no
+ * dax_resource attached are silently skipped.
+ */
+int dax_region_rm_resources(struct dax_region *dax_region,
+			    struct device * const *devs, unsigned int n)
+{
+	unsigned int i;
+
+	guard(rwsem_write)(&dax_region_rwsem);
+
+	for (i = 0; i < n; i++) {
+		struct dax_resource *r = dev_get_drvdata(devs[i]);
+
+		if (r && r->use_cnt)
+			return -EBUSY;
+	}
+
+	for (i = 0; i < n; i++) {
+		struct dax_resource *r = dev_get_drvdata(devs[i]);
+
+		if (!r)
+			continue;
+		__dax_release_resource(r);
+		dev_set_drvdata(devs[i], NULL);
+	}
+	return 0;
+}
+EXPORT_SYMBOL_GPL(dax_region_rm_resources);
+
 bool static_dev_dax(struct dev_dax *dev_dax)
 {
 	return is_static(dev_dax->region);
diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
index 5d33be342d42..d885b6e698ef 100644
--- a/drivers/dax/cxl.c
+++ b/drivers/dax/cxl.c
@@ -40,13 +40,33 @@ static int cxl_dax_group_add(struct dax_region *dax_region,
 	return rc;
 }
 
-/*
- * RELEASE is still a stub here — the atomic dax_region_rm_resources API
- * and its wire-up land in the next commit.  An incoming RELEASE returns
- * success and the cxl side proceeds to rm_tag_group(), which device-
- * unregisters each dc_extent; the devm action armed by
- * dax_region_add_resource() then tears down each dax_resource.
- */
+static int cxl_dax_group_rm(struct dax_region *dax_region,
+			    struct cxl_dc_tag_group *group)
+{
+	struct dc_extent *dc_extent;
+	struct device **devs;
+	unsigned long index;
+	unsigned int n = 0;
+	int rc;
+
+	if (!group->nr_extents)
+		return 0;
+
+	devs = kmalloc_array(group->nr_extents, sizeof(*devs), GFP_KERNEL);
+	if (!devs)
+		return -ENOMEM;
+
+	xa_for_each(&group->dc_extents, index, dc_extent) {
+		if (n == group->nr_extents)
+			break;
+		devs[n++] = &dc_extent->dev;
+	}
+
+	rc = dax_region_rm_resources(dax_region, devs, n);
+	kfree(devs);
+	return rc;
+}
+
 static int cxl_dax_region_notify(struct device *dev,
 				 struct cxl_notify_data *notify_data)
 {
@@ -58,10 +78,7 @@ static int cxl_dax_region_notify(struct device *dev,
 	case DCD_ADD_CAPACITY:
 		return cxl_dax_group_add(dax_region, group);
 	case DCD_RELEASE_CAPACITY:
-		dev_dbg(&cxlr_dax->dev,
-			"DCD RELEASE notify (tag %pUb): no-op (stub)\n",
-			&group->uuid);
-		return 0;
+		return cxl_dax_group_rm(dax_region, group);
 	case DCD_FORCED_CAPACITY_RELEASE:
 	default:
 		dev_err(&cxlr_dax->dev, "Unknown DC event %d\n",
diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index 8d98fc9adb4b..59ba929e14fd 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -146,13 +146,17 @@ struct dax_resource {
 };
 
 /*
- * Similar to run_dax() dax_region_add_resource() is exported but is not
- * intended to be a generic operation outside the dax subsystem.  It is only
+ * Similar to run_dax() dax_region_{add,rm}_resource() are exported but are not
+ * intended to be generic operations outside the dax subsystem.  They are only
  * generic between the dax layer and the dax drivers.
  */
 int dax_region_add_resource(struct dax_region *dax_region, struct device *dev,
 			    resource_size_t start, resource_size_t length,
 			    const uuid_t *tag, u16 seq_num);
+int dax_region_rm_resource(struct dax_region *dax_region,
+			   struct device *dev);
+int dax_region_rm_resources(struct dax_region *dax_region,
+			    struct device * const *devs, unsigned int n);
 
 /* One resource to add as part of an atomic dax_region_add_resources() set. */
 struct dax_resource_spec {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v11 23/31] dax/bus: Factor out dev dax resize logic
  2026-06-25 11:04 DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
                   ` (20 preceding siblings ...)
  2026-06-25 11:04 ` [PATCH v11 22/31] cxl + dax: Release dax_resources on DCD Release " Anisa Su
@ 2026-06-25 11:05 ` Anisa Su
  2026-06-25 18:27   ` sashiko-bot
  2026-06-25 11:05 ` [PATCH v11 24/31] dax/bus: Add uuid sysfs attribute to dax devices Anisa Su
                   ` (8 subsequent siblings)
  30 siblings, 1 reply; 61+ messages in thread
From: Anisa Su @ 2026-06-25 11:05 UTC (permalink / raw)
  To: linux-cxl, linux-kernel
  Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
	John Groves, Gregory Price, Anisa Su

From: Ira Weiny <iweiny@kernel.org>

Dynamic Capacity (DC) DAX regions back their dax devices with per-extent
resource children of the region, rather than carving from a single
contiguous dax_region->res.  Allocating space for a DC dax device — on
initial uuid claim of its backing extents and on shrink-to-0 during
destroy — needs the same allocator the static case uses, but pointed at
a different parent resource.

In preparation for this change, factor out the dev_dax_resize logic.
For static regions use dax_region->res as the parent to find space for
the dax ranges.  Future patches will use the same algorithm with
individual extent resources as the parent.

Signed-off-by: Ira Weiny <iweiny@kernel.org>
Signed-off-by: Anisa Su <anisa.su@samsung.com>
Reviewed-by: Jonathan Cameron <jic23@kernel.org>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>

---
Changes:
1. dev_dax_resize(): change resource_size_t alloc (unsigned) to signed
   ssize_t to correctly capture -errno
---
 drivers/dax/bus.c | 133 +++++++++++++++++++++++++++++-----------------
 1 file changed, 83 insertions(+), 50 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 95683dc8fcd0..ffa6b303fc9b 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -1057,11 +1057,10 @@ static int devm_register_dax_mapping(struct dev_dax *dev_dax, int range_id)
 	return 0;
 }
 
-static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
-		resource_size_t size, struct dax_resource *dax_resource)
+static int alloc_dev_dax_range(struct resource *parent, struct dev_dax *dev_dax,
+			       u64 start, resource_size_t size,
+			       struct dax_resource *dax_resource)
 {
-	struct dax_region *dax_region = dev_dax->region;
-	struct resource *res = &dax_region->res;
 	struct device *dev = &dev_dax->dev;
 	struct dev_dax_range *ranges;
 	unsigned long pgoff = 0;
@@ -1079,14 +1078,14 @@ static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
 		return 0;
 	}
 
-	alloc = __request_region(res, start, size, dev_name(dev), 0);
+	alloc = __request_region(parent, start, size, dev_name(dev), 0);
 	if (!alloc)
 		return -ENOMEM;
 
 	ranges = krealloc(dev_dax->ranges, sizeof(*ranges)
 			* (dev_dax->nr_range + 1), GFP_KERNEL);
 	if (!ranges) {
-		__release_region(res, alloc->start, resource_size(alloc));
+		__release_region(parent, alloc->start, resource_size(alloc));
 		return -ENOMEM;
 	}
 
@@ -1240,50 +1239,45 @@ static bool adjust_ok(struct dev_dax *dev_dax, struct resource *res)
 	return true;
 }
 
-static ssize_t dev_dax_resize(struct dax_region *dax_region,
-		struct dev_dax *dev_dax, resource_size_t size)
+/**
+ * dev_dax_resize_static - Expand the device into the unused portion of the
+ * region. This may involve adjusting the end of an existing resource, or
+ * allocating a new resource.
+ *
+ * @parent: parent resource to allocate this range in
+ * @dev_dax: DAX device to be expanded
+ * @to_alloc: amount of space to alloc; must be <= space available in @parent
+ *
+ * Return the amount of space allocated or -ERRNO on failure
+ */
+static ssize_t dev_dax_resize_static(struct resource *parent,
+				     struct dev_dax *dev_dax,
+				     resource_size_t to_alloc)
 {
-	resource_size_t avail = dax_region_avail_size(dax_region), to_alloc;
-	resource_size_t dev_size = dev_dax_size(dev_dax);
-	struct resource *region_res = &dax_region->res;
-	struct device *dev = &dev_dax->dev;
 	struct resource *res, *first;
-	resource_size_t alloc = 0;
 	int rc;
 
-	if (dev->driver)
-		return -EBUSY;
-	if (size == dev_size)
-		return 0;
-	if (size > dev_size && size - dev_size > avail)
-		return -ENOSPC;
-	if (size < dev_size)
-		return dev_dax_shrink(dev_dax, size);
-
-	to_alloc = size - dev_size;
-	if (dev_WARN_ONCE(dev, !alloc_is_aligned(dev_dax, to_alloc),
-			"resize of %pa misaligned\n", &to_alloc))
-		return -ENXIO;
-
-	/*
-	 * Expand the device into the unused portion of the region. This
-	 * may involve adjusting the end of an existing resource, or
-	 * allocating a new resource.
-	 */
-retry:
-	first = region_res->child;
-	if (!first)
-		return alloc_dev_dax_range(dev_dax, dax_region->res.start, to_alloc, NULL);
+	first = parent->child;
+	if (!first) {
+		rc = alloc_dev_dax_range(parent, dev_dax,
+					   parent->start, to_alloc, NULL);
+		if (rc)
+			return rc;
+		return to_alloc;
+	}
 
-	rc = -ENOSPC;
 	for (res = first; res; res = res->sibling) {
 		struct resource *next = res->sibling;
+		resource_size_t alloc;
 
 		/* space at the beginning of the region */
-		if (res == first && res->start > dax_region->res.start) {
-			alloc = min(res->start - dax_region->res.start, to_alloc);
-			rc = alloc_dev_dax_range(dev_dax, dax_region->res.start, alloc, NULL);
-			break;
+		if (res == first && res->start > parent->start) {
+			alloc = min(res->start - parent->start, to_alloc);
+			rc = alloc_dev_dax_range(parent, dev_dax,
+						 parent->start, alloc, NULL);
+			if (rc)
+				return rc;
+			return alloc;
 		}
 
 		alloc = 0;
@@ -1292,21 +1286,58 @@ static ssize_t dev_dax_resize(struct dax_region *dax_region,
 			alloc = min(next->start - (res->end + 1), to_alloc);
 
 		/* space at the end of the region */
-		if (!alloc && !next && res->end < region_res->end)
-			alloc = min(region_res->end - res->end, to_alloc);
+		if (!alloc && !next && res->end < parent->end)
+			alloc = min(parent->end - res->end, to_alloc);
 
 		if (!alloc)
 			continue;
 
 		if (adjust_ok(dev_dax, res)) {
 			rc = adjust_dev_dax_range(dev_dax, res, resource_size(res) + alloc);
-			break;
+			if (rc)
+				return rc;
+			return alloc;
 		}
-		rc = alloc_dev_dax_range(dev_dax, res->end + 1, alloc, NULL);
-		break;
+		rc = alloc_dev_dax_range(parent, dev_dax, res->end + 1, alloc, NULL);
+		if (rc)
+			return rc;
+		return alloc;
 	}
-	if (rc)
-		return rc;
+
+	/* available was already calculated and should never be an issue */
+	dev_WARN_ONCE(&dev_dax->dev, 1, "space not found?");
+	return 0;
+}
+
+static ssize_t dev_dax_resize(struct dax_region *dax_region,
+		struct dev_dax *dev_dax, resource_size_t size)
+{
+	resource_size_t avail = dax_region_avail_size(dax_region);
+	resource_size_t dev_size = dev_dax_size(dev_dax);
+	struct device *dev = &dev_dax->dev;
+	resource_size_t to_alloc;
+	ssize_t alloc;
+
+	if (dev->driver)
+		return -EBUSY;
+	if (size == dev_size)
+		return 0;
+	if (size > dev_size && size - dev_size > avail)
+		return -ENOSPC;
+	if (size < dev_size)
+		return dev_dax_shrink(dev_dax, size);
+
+	to_alloc = size - dev_size;
+	if (dev_WARN_ONCE(dev, !alloc_is_aligned(dev_dax, to_alloc),
+			"resize of %pa misaligned\n", &to_alloc))
+		return -ENXIO;
+
+retry:
+	alloc = dev_dax_resize_static(&dax_region->res, dev_dax, to_alloc);
+	if (alloc < 0)
+		return alloc;
+	if (alloc == 0)
+		return -ENOSPC;
 	to_alloc -= alloc;
 	if (to_alloc)
 		goto retry;
@@ -1412,7 +1443,8 @@ static ssize_t mapping_store(struct device *dev, struct device_attribute *attr,
 
 	to_alloc = range_len(&r);
 	if (alloc_is_aligned(dev_dax, to_alloc))
-		rc = alloc_dev_dax_range(dev_dax, r.start, to_alloc, NULL);
+		rc = alloc_dev_dax_range(&dax_region->res, dev_dax, r.start,
+					 to_alloc, NULL);
 	up_write(&dax_dev_rwsem);
 	up_write(&dax_region_rwsem);
 
@@ -1700,7 +1732,8 @@ static struct dev_dax *__devm_create_dev_dax(struct dev_dax_data *data)
 	device_initialize(dev);
 	dev_set_name(dev, "dax%d.%d", dax_region->id, dev_dax->id);
 
-	rc = alloc_dev_dax_range(dev_dax, dax_region->res.start, data->size, NULL);
+	rc = alloc_dev_dax_range(&dax_region->res, dev_dax, dax_region->res.start,
+				 data->size, NULL);
 	if (rc)
 		goto err_range;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v11 24/31] dax/bus: Add uuid sysfs attribute to dax devices
  2026-06-25 11:04 DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
                   ` (21 preceding siblings ...)
  2026-06-25 11:05 ` [PATCH v11 23/31] dax/bus: Factor out dev dax resize logic Anisa Su
@ 2026-06-25 11:05 ` Anisa Su
  2026-06-25 11:05 ` [PATCH v11 25/31] dax/bus: Reject resize on DC dax devices and enforce 0-size creation Anisa Su
                   ` (7 subsequent siblings)
  30 siblings, 0 replies; 61+ messages in thread
From: Anisa Su @ 2026-06-25 11:05 UTC (permalink / raw)
  To: linux-cxl, linux-kernel
  Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
	John Groves, Gregory Price, Anisa Su

Introduce a read-write 'uuid' sysfs entry at
/sys/bus/dax/devices/daxX.Y/ with stubbed handlers: show returns the
null uuid and store returns -EOPNOTSUPP.  A follow-on patch wires both
directions to dax_resource tracking.

Document the attribute in the dax sysfs ABI.

Signed-off-by: Anisa Su <anisa.su@samsung.com>

---
Changes:
- uuid_show() emits the null uuid ("%pUb" of uuid_null) instead of "0".
- ABI: describe the no-uuid read value as a null uuid instead of "0";
  bump Date to June, 2026 and KernelVersion to v7.3.
---
 Documentation/ABI/testing/sysfs-bus-dax | 18 ++++++++++++++++++
 drivers/dax/bus.c                       | 14 ++++++++++++++
 2 files changed, 32 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-dax b/Documentation/ABI/testing/sysfs-bus-dax
index b34266bfae49..3219c09dea01 100644
--- a/Documentation/ABI/testing/sysfs-bus-dax
+++ b/Documentation/ABI/testing/sysfs-bus-dax
@@ -59,6 +59,24 @@ Description:
 		backing device for this dax device, emit the CPU node
 		affinity for this device.
 
+What:		/sys/bus/dax/devices/daxX.Y/uuid
+Date:		June, 2026
+KernelVersion:	v7.3
+Contact:	nvdimm@lists.linux.dev
+Description:
+		(RW) On read, reports the uuid identifying the capacity
+		backing this dax device.  A null uuid (all-zeroes) indicates
+		that the device has no associated uuid — either it is not
+		backed by DCD capacity, or the backing extents are untagged.
+
+		Writes are accepted only on dax devices in sparse (DCD)
+		regions; writes to non-sparse devices return -EOPNOTSUPP.
+		Writing a non-null uuid claims every dax_resource in the
+		parent region whose tag matches the written uuid, consuming
+		any available capacity in each matching resource.  Writing
+		"0" is shorthand for the null uuid and claims a single
+		untagged dax_resource.
+
 What:		/sys/bus/dax/devices/daxX.Y/target_node
 Date:		February, 2019
 KernelVersion:	v5.1
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index ffa6b303fc9b..f61309a6f934 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -1573,6 +1573,19 @@ static ssize_t numa_node_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(numa_node);
 
+static ssize_t uuid_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%pUb\n", &uuid_null);
+}
+
+static ssize_t uuid_store(struct device *dev, struct device_attribute *attr,
+			  const char *buf, size_t len)
+{
+	return -EOPNOTSUPP;
+}
+static DEVICE_ATTR_RW(uuid);
+
 static ssize_t memmap_on_memory_show(struct device *dev,
 				     struct device_attribute *attr, char *buf)
 {
@@ -1644,6 +1657,7 @@ static struct attribute *dev_dax_attributes[] = {
 	&dev_attr_resource.attr,
 	&dev_attr_numa_node.attr,
 	&dev_attr_memmap_on_memory.attr,
+	&dev_attr_uuid.attr,
 	NULL,
 };
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v11 25/31] dax/bus: Reject resize on DC dax devices and enforce 0-size creation
  2026-06-25 11:04 DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
                   ` (22 preceding siblings ...)
  2026-06-25 11:05 ` [PATCH v11 24/31] dax/bus: Add uuid sysfs attribute to dax devices Anisa Su
@ 2026-06-25 11:05 ` Anisa Su
  2026-06-25 11:05 ` [PATCH v11 26/31] dax/bus: Tag-aware uuid claim and show on DC dax devices Anisa Su
                   ` (6 subsequent siblings)
  30 siblings, 0 replies; 61+ messages in thread
From: Anisa Su @ 2026-06-25 11:05 UTC (permalink / raw)
  To: linux-cxl, linux-kernel
  Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
	John Groves, Gregory Price, Anisa Su

From: Ira Weiny <iweiny@kernel.org>

A DC dax device's size is determined by the extents that back it, not by
the user.  DCD extents are all-or-nothing, so partial shrink is just as
illegal as growing.  Enforce that on the size and creation paths:

  * size_store: any non-zero resize on a DC region returns -EOPNOTSUPP.
    The sole exception is size=0, which daxctl destroy-device writes to
    return every claimed extent to the region's available pool before
    the device's name is written to the region's 'delete' attribute.
  * __devm_create_dev_dax: a DC dax device must be created at size 0.
    Non-zero data->size on a DC region returns -EINVAL with a clear
    message.

To prepare the shrink path for DC, dev_dax_resize_static() is renamed
__dev_dax_resize() and gains an explicit @parent resource plus an
optional @dax_resource; a thin dev_dax_resize_static() wrapper passes
the region resource and NULL, so static behaviour is unchanged.  The
path that walks a per-extent dax_resource->res parent (so shrink-to-0
releases each extent's child resource rather than the region's) is
wired up by the tag-aware uuid claim in a later commit.

Based on an original patch by Navneet Singh.

Signed-off-by: Ira Weiny <iweiny@kernel.org>
Signed-off-by: Anisa Su <anisa.su@samsung.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/dax/bus.c | 46 +++++++++++++++++++++++++++++++++++-----------
 1 file changed, 35 insertions(+), 11 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index f61309a6f934..f086ad27d507 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -1181,7 +1181,8 @@ static int dev_dax_shrink(struct dev_dax *dev_dax, resource_size_t size)
 	int i;
 
 	for (i = dev_dax->nr_range - 1; i >= 0; i--) {
-		struct range *range = &dev_dax->ranges[i].range;
+		struct dev_dax_range *dev_range = &dev_dax->ranges[i];
+		struct range *range = &dev_range->range;
 		struct dax_mapping *mapping = dev_dax->ranges[i].mapping;
 		struct resource *adjust = NULL, *res;
 		resource_size_t shrink;
@@ -1197,6 +1198,10 @@ static int dev_dax_shrink(struct dev_dax *dev_dax, resource_size_t size)
 			continue;
 		}
 
+		/*
+		 * Partial shrink: forbidden on DC regions, so dev_range
+		 * here must belong to a static device.
+		 */
 		for_each_dax_region_resource(dax_region, res)
 			if (strcmp(res->name, dev_name(dev)) == 0
 					&& res->start == range->start) {
@@ -1240,19 +1245,21 @@ static bool adjust_ok(struct dev_dax *dev_dax, struct resource *res)
 }
 
 /**
- * dev_dax_resize_static - Expand the device into the unused portion of the
- * region. This may involve adjusting the end of an existing resource, or
- * allocating a new resource.
+ * __dev_dax_resize - Expand the device into the unused portion of the region.
+ * This may involve adjusting the end of an existing resource, or allocating a
+ * new resource.
  *
  * @parent: parent resource to allocate this range in
  * @dev_dax: DAX device to be expanded
  * @to_alloc: amount of space to alloc; must be <= space available in @parent
+ * @dax_resource: if dc; the parent resource
  *
  * Return the amount of space allocated or -ERRNO on failure
  */
-static ssize_t dev_dax_resize_static(struct resource *parent,
-				     struct dev_dax *dev_dax,
-				     resource_size_t to_alloc)
+static ssize_t __dev_dax_resize(struct resource *parent,
+				struct dev_dax *dev_dax,
+				resource_size_t to_alloc,
+				struct dax_resource *dax_resource)
 {
 	struct resource *res, *first;
 	int rc;
@@ -1260,7 +1267,8 @@ static ssize_t dev_dax_resize_static(struct resource *parent,
 	first = parent->child;
 	if (!first) {
 		rc = alloc_dev_dax_range(parent, dev_dax,
-					   parent->start, to_alloc, NULL);
+					   parent->start, to_alloc,
+					   dax_resource);
 		if (rc)
 			return rc;
 		return to_alloc;
@@ -1274,7 +1282,8 @@ static ssize_t dev_dax_resize_static(struct resource *parent,
 		if (res == first && res->start > parent->start) {
 			alloc = min(res->start - parent->start, to_alloc);
 			rc = alloc_dev_dax_range(parent, dev_dax,
-						 parent->start, alloc, NULL);
+						 parent->start, alloc,
+						 dax_resource);
 			if (rc)
 				return rc;
 			return alloc;
@@ -1298,7 +1307,8 @@ static ssize_t dev_dax_resize_static(struct resource *parent,
 				return rc;
 			return alloc;
 		}
-		rc = alloc_dev_dax_range(parent, dev_dax, res->end + 1, alloc, NULL);
+		rc = alloc_dev_dax_range(parent, dev_dax, res->end + 1, alloc,
+					 dax_resource);
 		if (rc)
 			return rc;
 		return alloc;
@@ -1309,6 +1319,13 @@ static ssize_t dev_dax_resize_static(struct resource *parent,
 	return 0;
 }
 
+static ssize_t dev_dax_resize_static(struct dax_region *dax_region,
+				     struct dev_dax *dev_dax,
+				     resource_size_t to_alloc)
+{
+	return __dev_dax_resize(&dax_region->res, dev_dax, to_alloc, NULL);
+}
+
 static ssize_t dev_dax_resize(struct dax_region *dax_region,
 		struct dev_dax *dev_dax, resource_size_t size)
 {
@@ -1322,6 +1339,8 @@ static ssize_t dev_dax_resize(struct dax_region *dax_region,
 		return -EBUSY;
 	if (size == dev_size)
 		return 0;
+	if (size != 0 && is_dynamic(dax_region))
+		return -EOPNOTSUPP;
 	if (size > dev_size && size - dev_size > avail)
 		return -ENOSPC;
 	if (size < dev_size)
@@ -1333,7 +1352,7 @@ static ssize_t dev_dax_resize(struct dax_region *dax_region,
 		return -ENXIO;
 
 retry:
-	alloc = dev_dax_resize_static(&dax_region->res, dev_dax, to_alloc);
+	alloc = dev_dax_resize_static(dax_region, dev_dax, to_alloc);
 	if (alloc < 0)
 		return alloc;
 	if (alloc == 0)
@@ -1717,6 +1736,11 @@ static struct dev_dax *__devm_create_dev_dax(struct dev_dax_data *data)
 	struct device *dev;
 	int rc;
 
+	if (is_dynamic(dax_region) && data->size) {
+		dev_err(parent, "DC DAX region devices must be created initially with 0 size\n");
+		return ERR_PTR(-EINVAL);
+	}
+
 	dev_dax = kzalloc_obj(*dev_dax);
 	if (!dev_dax)
 		return ERR_PTR(-ENOMEM);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v11 26/31] dax/bus: Tag-aware uuid claim and show on DC dax devices
  2026-06-25 11:04 DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
                   ` (23 preceding siblings ...)
  2026-06-25 11:05 ` [PATCH v11 25/31] dax/bus: Reject resize on DC dax devices and enforce 0-size creation Anisa Su
@ 2026-06-25 11:05 ` Anisa Su
  2026-06-25 18:26   ` sashiko-bot
  2026-06-25 11:05 ` [PATCH v11 27/31] cxl/region: Read existing extents on region creation Anisa Su
                   ` (5 subsequent siblings)
  30 siblings, 1 reply; 61+ messages in thread
From: Anisa Su @ 2026-06-25 11:05 UTC (permalink / raw)
  To: linux-cxl, linux-kernel
  Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
	John Groves, Gregory Price, Anisa Su

DC DAX regions are populated with dax_resource children that each carry a
backing tag uuid and a per-allocation sequence number (seq_num).  Add the
userspace claim semantics that resolve those tagged groups into DAX
devices.

A DC region's seed dax device is created at 0-size on probe; userspace
populates it by writing to its 'uuid' attribute:

  * A non-null UUID claims every dax_resource on this region whose tag
    matches, in seq_num order via uuid_claim_tagged().  The match set
    must form a dense 0..n-1 sequence (no gap, no duplicate); the CXL
    side maintains this invariant for both sharable allocations (where
    the device stamps shared_extn_seq) and non-sharable allocations
    (where cxl_realize_group assigns arrival-order seq).  The resulting
    DAX device's size equals the sum of every member extent's size.

  * "0" claims a single untagged dax_resource via
    uuid_claim_untagged().  Untagged extents are independent
    allocations; collapsing several would aggregate unrelated capacity,
    so each uuid="0" write consumes exactly one untagged resource.

  * A write that matches no dax_resource returns -ENOENT; the device
    stays at size 0.

  * A write to an already-claimed device (non-zero size) returns
    -EBUSY; a device's uuid cannot be overwritten once claimed.

uuid_show() reads back the backing tag uuid (or the null UUID for an
untagged claim).  The attribute is read-only (0444) on non-DC dax
devices; writes to it on non-DC regions return -EOPNOTSUPP.

dev_dax_visible() makes the uuid attribute writable on DC dax devices
and read-only elsewhere.

Based on an original patch by Navneet Singh.

Signed-off-by: Ira Weiny <iweiny@kernel.org>
Signed-off-by: Anisa Su <anisa.su@samsung.com>

---
Changes:
[anisa: uuid_show() emits the null uuid ("%pUb" of uuid_null) rather
 than "0" for an untagged or uuid-less device, matching the documented
 read value.]
[anisa: uuid_show()/uuid_store() take their rwsems via ACQUIRE() scoped
 guards instead of explicit down/up with goto unwinding.]
[anisa: uuid_store() refuses to re-claim an already-claimed device
 (-EBUSY) so a uuid cannot be overwritten.]
---
 drivers/dax/bus.c | 262 +++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 259 insertions(+), 3 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index f086ad27d507..d94c0853af10 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -5,6 +5,7 @@
 #include <linux/mutex.h>
 #include <linux/list.h>
 #include <linux/slab.h>
+#include <linux/sort.h>
 #include <linux/dax.h>
 #include <linux/io.h>
 #include "dax-private.h"
@@ -1100,6 +1101,9 @@ static int alloc_dev_dax_range(struct resource *parent, struct dev_dax *dev_dax,
 		},
 		.dax_resource = dax_resource,
 	};
+	/* Pin the extent for this range; trim_dev_dax_range() drops it. */
+	if (dax_resource)
+		dax_resource->use_cnt++;
 
 	dev_dbg(dev, "alloc range[%d]: %pa:%pa\n", dev_dax->nr_range - 1,
 			&alloc->start, &alloc->end);
@@ -1363,6 +1367,89 @@ static ssize_t dev_dax_resize(struct dax_region *dax_region,
 	return 0;
 }
 
+/* DC extents are all-or-nothing: an extent is either free or fully claimed. */
+static bool dax_resource_in_use(const struct dax_resource *dax_resource)
+{
+	return dax_resource->use_cnt > 0;
+}
+
+struct dax_uuid_match {
+	const struct dax_region *dax_region;
+	const uuid_t *uuid;
+};
+
+static int find_uuid_extent(struct device *dev, const void *data)
+{
+	const struct dax_uuid_match *match = data;
+	struct dax_resource *dax_resource;
+
+	if (!match->dax_region->dc_ops->is_extent(dev))
+		return 0;
+
+	dax_resource = dev_get_drvdata(dev);
+	if (!dax_resource || dax_resource_in_use(dax_resource))
+		return 0;
+	return uuid_equal(&dax_resource->uuid, match->uuid);
+}
+
+struct dax_tag_collect {
+	const struct dax_region *dax_region;
+	const uuid_t *uuid;
+	struct dax_resource **arr;
+	unsigned int count;
+	unsigned int cap;
+};
+
+static int collect_uuid_extent(struct device *dev, void *data)
+{
+	struct dax_tag_collect *c = data;
+	struct dax_resource *dax_resource;
+
+	if (!c->dax_region->dc_ops->is_extent(dev))
+		return 0;
+
+	dax_resource = dev_get_drvdata(dev);
+	if (!dax_resource || dax_resource_in_use(dax_resource))
+		return 0;
+	if (!uuid_equal(&dax_resource->uuid, c->uuid))
+		return 0;
+
+	if (c->count == c->cap)
+		return -ENOSPC;
+	c->arr[c->count++] = dax_resource;
+	return 0;
+}
+
+static int count_uuid_extent(struct device *dev, void *data)
+{
+	struct dax_tag_collect *c = data;
+	struct dax_resource *dax_resource;
+
+	if (!c->dax_region->dc_ops->is_extent(dev))
+		return 0;
+
+	dax_resource = dev_get_drvdata(dev);
+	if (!dax_resource || dax_resource_in_use(dax_resource))
+		return 0;
+	if (!uuid_equal(&dax_resource->uuid, c->uuid))
+		return 0;
+
+	c->count++;
+	return 0;
+}
+
+static int dax_resource_seq_cmp(const void *a, const void *b)
+{
+	const struct dax_resource * const *pa = a;
+	const struct dax_resource * const *pb = b;
+
+	if ((*pa)->seq_num < (*pb)->seq_num)
+		return -1;
+	if ((*pa)->seq_num > (*pb)->seq_num)
+		return 1;
+	return 0;
+}
+
 static ssize_t size_store(struct device *dev, struct device_attribute *attr,
 		const char *buf, size_t len)
 {
@@ -1595,13 +1682,178 @@ static DEVICE_ATTR_RO(numa_node);
 static ssize_t uuid_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
 {
+	struct dev_dax *dev_dax = to_dev_dax(dev);
+	int rc;
+
+	ACQUIRE(rwsem_read_intr, rwsem)(&dax_dev_rwsem);
+	if ((rc = ACQUIRE_ERR(rwsem_read_intr, &rwsem)))
+		return rc;
+
+	for (int i = 0; i < dev_dax->nr_range; i++) {
+		struct dax_resource *r = dev_dax->ranges[i].dax_resource;
+
+		if (r && !uuid_is_null(&r->uuid))
+			return sysfs_emit(buf, "%pUb\n", &r->uuid);
+	}
 	return sysfs_emit(buf, "%pUb\n", &uuid_null);
 }
 
+static ssize_t uuid_claim_untagged(struct dax_region *dax_region,
+				   struct dev_dax *dev_dax)
+{
+	struct dax_uuid_match match = {
+		.dax_region = dax_region,
+		.uuid = &uuid_null,
+	};
+	struct dax_resource *dax_resource;
+	resource_size_t to_alloc;
+	struct device *extent_dev;
+	ssize_t alloc;
+
+	extent_dev = device_find_child(dax_region->dev, &match,
+				       find_uuid_extent);
+	if (!extent_dev)
+		return -ENOENT;
+
+	dax_resource = dev_get_drvdata(extent_dev);
+	to_alloc = resource_size(dax_resource->res);
+	if (!alloc_is_aligned(dev_dax, to_alloc)) {
+		put_device(extent_dev);
+		return -EINVAL;
+	}
+	alloc = __dev_dax_resize(dax_resource->res, dev_dax, to_alloc,
+				 dax_resource);
+	put_device(extent_dev);
+	if (alloc < 0)
+		return alloc;
+	if (alloc == 0)
+		return -ENOENT;
+	return 0;
+}
+
+static ssize_t uuid_claim_tagged(struct dax_region *dax_region,
+				 struct dev_dax *dev_dax, const uuid_t *uuid)
+{
+	struct dax_tag_collect c = {
+		.dax_region = dax_region,
+		.uuid = uuid,
+	};
+	unsigned int i;
+	ssize_t rc;
+
+	/* Two-pass: count, then collect into a sized array. */
+	device_for_each_child(dax_region->dev, &c, count_uuid_extent);
+	if (!c.count)
+		return -ENOENT;
+
+	c.arr = kmalloc_array(c.count, sizeof(*c.arr), GFP_KERNEL);
+	if (!c.arr)
+		return -ENOMEM;
+	c.cap = c.count;
+	c.count = 0;
+
+	rc = device_for_each_child(dax_region->dev, &c, collect_uuid_extent);
+	if (rc)
+		goto out;
+
+	sort(c.arr, c.count, sizeof(*c.arr), dax_resource_seq_cmp, NULL);
+
+	/*
+	 * Tagged groups carry a dense 0..n-1 @seq_num regardless of source —
+	 * the device-stamped shared_extn_seq (already 0..n-1) for a sharable
+	 * partition, or cxl-side arrival order for a non-sharable one (see
+	 * &struct dax_resource).  A gap or out-of-range value here means an
+	 * extent went missing on the cxl side (e.g. a per-extent failure in
+	 * cxl_add_pending) or a cxl-side validation gap; in either case
+	 * refuse the whole group rather than carve a partial allocation.
+	 */
+	for (i = 0; i < c.count; i++) {
+		if (c.arr[i]->seq_num != i) {
+			dev_WARN_ONCE(dax_region->dev, 1,
+				"tag %pUb seq invariant violated at slot %u (got %u)\n",
+				uuid, i, c.arr[i]->seq_num);
+			rc = -EINVAL;
+			goto out;
+		}
+	}
+
+	for (i = 0; i < c.count; i++) {
+		resource_size_t to_alloc = resource_size(c.arr[i]->res);
+		ssize_t alloc;
+
+		if (!alloc_is_aligned(dev_dax, to_alloc)) {
+			rc = -EINVAL;
+			goto rollback;
+		}
+		alloc = __dev_dax_resize(c.arr[i]->res, dev_dax, to_alloc,
+					 c.arr[i]);
+		if (alloc < 0) {
+			rc = alloc;
+			goto rollback;
+		}
+		if (alloc == 0) {
+			rc = -ENOSPC;
+			goto rollback;
+		}
+	}
+	rc = 0;
+	goto out;
+
+rollback:
+	/*
+	 * Partial failure: trim every range we added in this attempt.
+	 * trim_dev_dax_range pops the most-recently-appended range from
+	 * dev_dax->ranges[] and decrements its dax_resource->use_cnt, so
+	 * looping until we have undone @i additions restores both
+	 * dev_dax->ranges[] and the matched dax_resources' use_cnt.
+	 */
+	while (i-- > 0)
+		trim_dev_dax_range(dev_dax);
+out:
+	kfree(c.arr);
+	return rc;
+}
+
 static ssize_t uuid_store(struct device *dev, struct device_attribute *attr,
 			  const char *buf, size_t len)
 {
-	return -EOPNOTSUPP;
+	struct dev_dax *dev_dax = to_dev_dax(dev);
+	struct dax_region *dax_region = dev_dax->region;
+	uuid_t uuid;
+	ssize_t rc;
+
+	if (!is_dynamic(dax_region))
+		return -EOPNOTSUPP;
+
+	if (sysfs_streq(buf, "0"))
+		uuid_copy(&uuid, &uuid_null);
+	else {
+		rc = uuid_parse(buf, &uuid);
+		if (rc)
+			return rc;
+	}
+
+	ACQUIRE(rwsem_write_kill, region_rwsem)(&dax_region_rwsem);
+	if ((rc = ACQUIRE_ERR(rwsem_write_kill, &region_rwsem)))
+		return rc;
+
+	if (!dax_region->dev->driver)
+		return -ENXIO;
+
+	ACQUIRE(rwsem_write_kill, dev_rwsem)(&dax_dev_rwsem);
+	if ((rc = ACQUIRE_ERR(rwsem_write_kill, &dev_rwsem)))
+		return rc;
+
+	/* A claimed device already has capacity; do not overwrite its uuid. */
+	if (dev_dax_size(dev_dax))
+		return -EBUSY;
+
+	if (uuid_is_null(&uuid))
+		rc = uuid_claim_untagged(dax_region, dev_dax);
+	else
+		rc = uuid_claim_tagged(dax_region, dev_dax, &uuid);
+
+	return rc < 0 ? rc : len;
 }
 static DEVICE_ATTR_RW(uuid);
 
@@ -1661,8 +1913,12 @@ static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n)
 		return 0;
 	if (a == &dev_attr_mapping.attr && is_dynamic(dax_region))
 		return 0;
-	if ((a == &dev_attr_align.attr ||
-	     a == &dev_attr_size.attr) && is_static(dax_region))
+	if (a == &dev_attr_uuid.attr && !is_dynamic(dax_region))
+		return 0444;
+	if (a == &dev_attr_align.attr &&
+	    (is_static(dax_region) || is_dynamic(dax_region)))
+		return 0444;
+	if (a == &dev_attr_size.attr && is_static(dax_region))
 		return 0444;
 	return a->mode;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v11 27/31] cxl/region: Read existing extents on region creation
  2026-06-25 11:04 DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
                   ` (24 preceding siblings ...)
  2026-06-25 11:05 ` [PATCH v11 26/31] dax/bus: Tag-aware uuid claim and show on DC dax devices Anisa Su
@ 2026-06-25 11:05 ` Anisa Su
  2026-06-25 18:32   ` sashiko-bot
  2026-06-25 11:05 ` [PATCH v11 28/31] cxl/mem: Trace Dynamic capacity Event Record Anisa Su
                   ` (4 subsequent siblings)
  30 siblings, 1 reply; 61+ messages in thread
From: Anisa Su @ 2026-06-25 11:05 UTC (permalink / raw)
  To: linux-cxl, linux-kernel
  Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
	John Groves, Gregory Price, Anisa Su

From: Ira Weiny <iweiny@kernel.org>

Dynamic capacity device extents may be left in an accepted state on a
device due to an unexpected host crash.  In this case it is expected
that the creation of a new region on top of a DC partition can read
those extents and surface them for continued use.

Once all endpoint decoders are part of a region and the region is being
realized, a read of the 'devices extent list' can reveal these
previously accepted extents.

CXL r3.1 specifies the mailbox call Get Dynamic Capacity Extent List for
this purpose.  The call returns all the extents for all dynamic capacity
partitions.  If the fabric manager is adding extents to any DCD
partition, the extent list for the recovered region may change; the
generation number changing between queries is detected and the read is
retried.

Process the existing extents inside the asynchronous cxl_dax_region
probe rather than at region-creation time.  Reading them at creation
races the probe.

New add events for a region are deferred until that region's pre-existing
extents have been read so a tag already in use is never registered twice.
Based on an original patch by Navneet Singh.

Signed-off-by: Ira Weiny <iweiny@kernel.org>
Signed-off-by: Anisa Su <anisa.su@samsung.com>

---
Changes:
1. request a full buffer (max_extent_count) per Get DC Extent List
   call rather than max(buffer, remaining).
2. latch the first per-extent error rather than the last.
3. cap the -EAGAIN retry at CXL_READ_EXTENT_LIST_RETRY (10) total
   attempts (was off by one, 11).
4. register cxlr_dax_unregister before processing existing extents
   so a failure there fails region creation cleanly.
5. recovered extents are processed with existing=true so they are
   not re-acknowledged via Add-DC-Response (the device rejects a DPA
   already added by a prior response, CXL r4.0 8.2.10.9.9.3); a failed
   online still releases them.
6. hold add_ctx.lock while consuming pending_extents in
   __cxl_process_extent_list(), the same lock handle_add_event() holds.
7. process existing extents inside the dax_region probe (with the
   __cxlr_notify_extent() lock-held core) instead of at region
   creation to avoid racing with probe
---
 drivers/cxl/core/core.h       |  15 +++-
 drivers/cxl/core/extent.c     |  35 ++++++--
 drivers/cxl/core/mbox.c       | 155 +++++++++++++++++++++++++++++++++-
 drivers/cxl/core/region_dax.c |  40 ++++++++-
 drivers/cxl/cxl.h             |   7 ++
 drivers/cxl/cxlmem.h          |  21 +++++
 drivers/dax/cxl.c             |  16 +++-
 7 files changed, 273 insertions(+), 16 deletions(-)

diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 1a2bc22ad3cc..29c92e45972c 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -26,6 +26,8 @@ cxled_to_mds(struct cxl_endpoint_decoder *cxled)
 	return to_cxl_memdev_state(cxlmd->cxlds);
 }
 
+int cxl_process_extent_list(struct cxl_endpoint_decoder *cxled);
+
 #ifdef CONFIG_CXL_REGION
 
 int cxl_region_invalidate_memregion(struct cxl_region *cxlr);
@@ -66,13 +68,15 @@ int devm_cxl_add_dax_region(struct cxl_region *cxlr);
 int devm_cxl_add_pmem_region(struct cxl_region *cxlr);
 
 int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent,
-		   u16 seq_num);
+		   u16 seq_num, bool existing);
 bool cxl_tag_already_committed(const uuid_t *tag);
 int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent);
 int online_tag_group(struct cxl_dc_tag_group *group, bool skip_release);
 void rm_tag_group(struct cxl_dc_tag_group *group);
 int cxlr_notify_extent(struct cxl_region *cxlr, enum dc_event event,
 		       struct cxl_dc_tag_group *group);
+int __cxlr_notify_extent(struct cxl_region *cxlr, enum dc_event event,
+			 struct cxl_dc_tag_group *group);
 #else
 static inline u64 cxl_dpa_to_hpa(struct cxl_region *cxlr,
 				 const struct cxl_memdev *cxlmd, u64 dpa)
@@ -80,7 +84,8 @@ static inline u64 cxl_dpa_to_hpa(struct cxl_region *cxlr,
 	return ULLONG_MAX;
 }
 static inline int cxl_add_extent(struct cxl_memdev_state *mds,
-				 struct cxl_extent *extent, u16 seq_num)
+				 struct cxl_extent *extent, u16 seq_num,
+				 bool existing)
 {
 	return 0;
 }
@@ -105,6 +110,12 @@ static inline int cxlr_notify_extent(struct cxl_region *cxlr,
 {
 	return 0;
 }
+static inline int __cxlr_notify_extent(struct cxl_region *cxlr,
+				       enum dc_event event,
+				       struct cxl_dc_tag_group *group)
+{
+	return 0;
+}
 static inline
 struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa,
 				     struct cxl_endpoint_decoder **cxled)
diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
index 7009ac6a51b4..03ae9473b461 100644
--- a/drivers/cxl/core/extent.c
+++ b/drivers/cxl/core/extent.c
@@ -261,7 +261,7 @@ static int cxl_validate_extent(struct cxl_memdev_state *mds,
 			       struct cxl_extent *extent,
 			       struct cxl_endpoint_decoder **out_cxled,
 			       struct cxl_dax_region **out_cxlr_dax,
-			       struct range *out_ext_range)
+			       struct range *out_ext_range, bool existing)
 {
 	u64 start_dpa = le64_to_cpu(extent->start_dpa);
 	struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
@@ -300,6 +300,13 @@ static int cxl_validate_extent(struct cxl_memdev_state *mds,
 	if (!cxlr || !cxlr->cxlr_dax)
 		return -ENXIO;
 
+	/*
+	 * Pre-existing extents must be read before any new extent is added so a
+	 * tag already in use is never added twice; defer new adds until then.
+	 */
+	if (!existing && !smp_load_acquire(&cxlr->cxlr_dax->extents_scanned))
+		return -EBUSY;
+
 	ed_range = (struct range) {
 		.start = cxled->dpa_res->start,
 		.end = cxled->dpa_res->end,
@@ -376,16 +383,22 @@ dc_extent_build(struct cxl_endpoint_decoder *cxled,
 	return dc_extent;
 }
 
-int cxlr_notify_extent(struct cxl_region *cxlr, enum dc_event event,
-		       struct cxl_dc_tag_group *group)
+/*
+ * Core notify: the caller must hold device_lock(&cxlr->cxlr_dax->dev).  Used by
+ * the existing-extent path that runs inside cxl_dax_region_probe(), where the
+ * async device-attach already holds the dax_region's device_lock — taking it
+ * again (as cxlr_notify_extent() does) would deadlock the probe against itself.
+ */
+int __cxlr_notify_extent(struct cxl_region *cxlr, enum dc_event event,
+			 struct cxl_dc_tag_group *group)
 {
 	struct device *dev = &cxlr->cxlr_dax->dev;
 	struct cxl_notify_data notify_data;
 	struct cxl_driver *driver;
 
-	dev_dbg(dev, "Trying notify: type %d tag %pUb\n", event, &group->uuid);
+	device_lock_assert(dev);
 
-	guard(device)(dev);
+	dev_dbg(dev, "Trying notify: type %d tag %pUb\n", event, &group->uuid);
 
 	/*
 	 * The lack of a driver indicates a notification has failed.  No user
@@ -406,6 +419,13 @@ int cxlr_notify_extent(struct cxl_region *cxlr, enum dc_event event,
 	return driver->notify(dev, &notify_data);
 }
 
+int cxlr_notify_extent(struct cxl_region *cxlr, enum dc_event event,
+		       struct cxl_dc_tag_group *group)
+{
+	guard(device)(&cxlr->cxlr_dax->dev);
+	return __cxlr_notify_extent(cxlr, event, group);
+}
+
 /*
  * Stage 4: insert @dc_extent into the pending tag group.  All extents in
  * one More-chain group share a UUID — enforced here as the group is
@@ -465,7 +485,7 @@ static int cxlr_add_extent(struct cxl_memdev_state *mds,
  * and <0 on error
  */
 int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent,
-		   u16 seq_num)
+		   u16 seq_num, bool existing)
 {
 	struct cxl_endpoint_decoder *cxled;
 	struct cxl_dax_region *cxlr_dax;
@@ -475,7 +495,8 @@ int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent,
 
 	guard(rwsem_read)(&cxl_rwsem.region);
 
-	rc = cxl_validate_extent(mds, extent, &cxled, &cxlr_dax, &ext_range);
+	rc = cxl_validate_extent(mds, extent, &cxled, &cxlr_dax, &ext_range,
+				 existing);
 	if (rc)
 		return rc;
 
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 79258681d428..6f0d776e7e78 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -1503,7 +1503,7 @@ static int cxl_realize_group(struct cxl_memdev_state *mds, const uuid_t *tag,
 		else
 			seq_num++;
 
-		if (cxl_add_extent(mds, pos->extent, seq_num) < 0) {
+		if (cxl_add_extent(mds, pos->extent, seq_num, existing) < 0) {
 			dev_dbg(dev,
 				"Tag %pUb: failed to add extent DPA:%#llx LEN:%#llx\n",
 				tag,
@@ -1527,8 +1527,18 @@ static int cxl_realize_group(struct cxl_memdev_state *mds, const uuid_t *tag,
 		return rc;
 	}
 
-	rc = cxlr_notify_extent(tag_group->cxlr_dax->cxlr, DCD_ADD_CAPACITY,
-				tag_group);
+	/*
+	 * The @existing path runs inside cxl_dax_region_probe() with the
+	 * dax_region's device_lock already held, so use the lock-held notify
+	 * variant to avoid re-acquiring it (which would deadlock the async
+	 * probe against itself).  The runtime add-event path holds no such lock.
+	 */
+	if (existing)
+		rc = __cxlr_notify_extent(tag_group->cxlr_dax->cxlr,
+					  DCD_ADD_CAPACITY, tag_group);
+	else
+		rc = cxlr_notify_extent(tag_group->cxlr_dax->cxlr,
+					DCD_ADD_CAPACITY, tag_group);
 	if (rc) {
 		/*
 		 * The dax-side notification failed; tear down the tag group.
@@ -2199,6 +2209,145 @@ int cxl_dev_dc_identify(struct cxl_mailbox *mbox,
 }
 EXPORT_SYMBOL_NS_GPL(cxl_dev_dc_identify, "CXL");
 
+/* Return -EAGAIN if the extent list changes while reading */
+static int __cxl_process_extent_list(struct cxl_endpoint_decoder *cxled)
+{
+	u32 current_index, total_read, total_expected, initial_gen_num;
+	struct cxl_memdev_state *mds = cxled_to_mds(cxled);
+	struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
+	struct device *dev = mds->cxlds.dev;
+	struct cxl_mbox_cmd mbox_cmd;
+	u32 max_extent_count;
+	int rc = 0;
+	bool first = true;
+
+	struct cxl_mbox_get_extent_out *extents __free(kvfree) =
+				kvmalloc(cxl_mbox->payload_size, GFP_KERNEL);
+	if (!extents)
+		return -ENOMEM;
+
+	/*
+	 * Build and consume add_ctx.pending_extents under add_ctx.lock, the
+	 * same lock the DC event path (handle_add_event()) holds, so the two
+	 * cannot corrupt the shared pending list.
+	 */
+	guard(mutex)(&mds->add_ctx.lock);
+
+	total_read = 0;
+	current_index = 0;
+	total_expected = 0;
+	max_extent_count = (cxl_mbox->payload_size - sizeof(*extents)) /
+			    sizeof(struct cxl_extent);
+	do {
+		u32 nr_returned, current_total, current_gen_num;
+		struct cxl_mbox_get_extent_in get_extent;
+
+		get_extent = (struct cxl_mbox_get_extent_in) {
+			.extent_cnt = cpu_to_le32(max_extent_count),
+			.start_extent_index = cpu_to_le32(current_index),
+		};
+
+		mbox_cmd = (struct cxl_mbox_cmd) {
+			.opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
+			.payload_in = &get_extent,
+			.size_in = sizeof(get_extent),
+			.size_out = cxl_mbox->payload_size,
+			.payload_out = extents,
+			.min_out = 1,
+		};
+
+		rc = cxl_internal_send_cmd(cxl_mbox, &mbox_cmd);
+		if (rc < 0)
+			goto out;
+
+		/* Save initial data */
+		if (first) {
+			total_expected = le32_to_cpu(extents->total_extent_count);
+			initial_gen_num = le32_to_cpu(extents->generation_num);
+			first = false;
+		}
+
+		nr_returned = le32_to_cpu(extents->returned_extent_count);
+		total_read += nr_returned;
+		current_total = le32_to_cpu(extents->total_extent_count);
+		current_gen_num = le32_to_cpu(extents->generation_num);
+
+		dev_dbg(dev, "Got extent list %d-%d of %d generation Num:%d\n",
+			current_index, total_read - 1, current_total, current_gen_num);
+
+		if (current_gen_num != initial_gen_num || total_expected != current_total) {
+			dev_warn(dev, "Extent list change detected; gen %u != %u : cnt %u != %u\n",
+				 current_gen_num, initial_gen_num,
+				 total_expected, current_total);
+			rc = -EAGAIN;
+			goto out;
+		}
+
+		/* No progress with more expected: a buggy device would loop forever. */
+		if (!nr_returned && total_expected > total_read) {
+			dev_warn(dev, "Device returned 0 of %u remaining extents\n",
+				 total_expected - total_read);
+			rc = -EIO;
+			goto out;
+		}
+
+		for (int i = 0; i < nr_returned ; i++) {
+			struct cxl_extent *extent = &extents->extent[i];
+
+			dev_dbg(dev, "Processing extent %d/%d\n",
+				current_index + i, total_expected);
+
+			rc = add_to_pending_list(&mds->add_ctx.pending_extents,
+						 extent);
+			if (rc)
+				goto out;
+		}
+
+		current_index += nr_returned;
+	} while (total_expected > total_read);
+
+	if (!list_empty(&mds->add_ctx.pending_extents)) {
+		/*
+		 * Reached only on the success path (every error does goto out),
+		 * so rc is 0 here.  These extents are already accepted on the
+		 * device (recovered from a prior boot).  Pass existing=true so
+		 * they are not re-reported in an Add-DC-Response (the device
+		 * would reject a DPA already added by a prior response), and so
+		 * a failed online releases them rather than silently dropping
+		 * them.
+		 */
+		rc = cxl_add_pending(mds, true);
+	}
+out:
+	clear_pending_extents(mds);
+
+	return rc;
+}
+
+#define CXL_READ_EXTENT_LIST_RETRY 10
+
+/**
+ * cxl_process_extent_list() - Read existing extents
+ * @cxled: Endpoint decoder which is part of a region
+ *
+ * Issue the Get Dynamic Capacity Extent List command to the device
+ * and add existing extents if found.
+ *
+ * A retry of 10 is somewhat arbitrary, however, extent changes should be
+ * relatively rare while bringing up a region.  So 10 should be plenty.
+ */
+int cxl_process_extent_list(struct cxl_endpoint_decoder *cxled)
+{
+	int retry = CXL_READ_EXTENT_LIST_RETRY;
+	int rc;
+
+	do {
+		rc = __cxl_process_extent_list(cxled);
+	} while (rc == -EAGAIN && --retry);
+
+	return rc;
+}
+
 static void add_part(struct cxl_dpa_info *info, u64 start, u64 size,
 		     enum cxl_partition_mode mode, u8 handle)
 {
diff --git a/drivers/cxl/core/region_dax.c b/drivers/cxl/core/region_dax.c
index 70b086d50451..c614f5458330 100644
--- a/drivers/cxl/core/region_dax.c
+++ b/drivers/cxl/core/region_dax.c
@@ -82,6 +82,38 @@ static void cxlr_dax_unregister(void *_cxlr_dax)
 	device_unregister(&cxlr_dax->dev);
 }
 
+/*
+ * Process existing extents from the probe, not region creation: the probe is
+ * async, and attaching extent devres before really_probe() runs trips its
+ * "resources present" -EBUSY gate, so the dax_region never binds.
+ */
+int cxl_region_add_existing_extents(struct cxl_region *cxlr)
+{
+	struct cxl_region_params *p = &cxlr->params;
+	int i, latched_rc = 0;
+
+	for (i = 0; i < p->nr_targets; i++) {
+		struct device *dev = &p->targets[i]->cxld.dev;
+		int rc;
+
+		rc = cxl_process_extent_list(p->targets[i]);
+		if (rc) {
+			dev_err(dev, "Existing extent processing failed %d\n",
+				rc);
+			/* Process every target, but report the first error. */
+			if (!latched_rc)
+				latched_rc = rc;
+		}
+	}
+
+	/* Pre-existing extents are read; new add events may now proceed. */
+	if (!latched_rc)
+		smp_store_release(&cxlr->cxlr_dax->extents_scanned, true);
+
+	return latched_rc;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_region_add_existing_extents, "CXL");
+
 int devm_cxl_add_dax_region(struct cxl_region *cxlr)
 {
 	struct device *dev;
@@ -110,6 +142,10 @@ int devm_cxl_add_dax_region(struct cxl_region *cxlr)
 	dev_dbg(&cxlr->dev, "%s: register %s\n", dev_name(dev->parent),
 		dev_name(dev));
 
-	return devm_add_action_or_reset(&cxlr->dev, cxlr_dax_unregister,
-					no_free_ptr(cxlr_dax));
+	rc = devm_add_action_or_reset(&cxlr->dev, cxlr_dax_unregister,
+				      no_free_ptr(cxlr_dax));
+	if (rc)
+		return rc;
+
+	return 0;
 }
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 1bb861bb23fe..07ecb0e1888b 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -579,6 +579,8 @@ struct cxl_dax_region {
 	 * driver handles.
 	 */
 	struct xarray dc_extents;
+	/* Set once the probe has read the device's pre-existing extents. */
+	bool extents_scanned;
 };
 
 /**
@@ -959,6 +961,7 @@ bool is_cxl_pmem_region(struct device *dev);
 struct cxl_pmem_region *to_cxl_pmem_region(struct device *dev);
 int cxl_add_to_region(struct cxl_endpoint_decoder *cxled);
 struct cxl_dax_region *to_cxl_dax_region(struct device *dev);
+int cxl_region_add_existing_extents(struct cxl_region *cxlr);
 u64 cxl_port_get_spa_cache_alias(struct cxl_port *endpoint, u64 spa);
 bool cxl_region_contains_resource(const struct resource *res);
 #else
@@ -978,6 +981,10 @@ static inline struct cxl_dax_region *to_cxl_dax_region(struct device *dev)
 {
 	return NULL;
 }
+static inline int cxl_region_add_existing_extents(struct cxl_region *cxlr)
+{
+	return 0;
+}
 static inline u64 cxl_port_get_spa_cache_alias(struct cxl_port *endpoint,
 					       u64 spa)
 {
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 81498d47f309..414a20b3522e 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -570,6 +570,27 @@ struct cxl_mbox_dc_response {
 	} __packed extent_list[] __counted_by(extent_list_size);
 } __packed;
 
+/*
+ * Get Dynamic Capacity Extent List; Input Payload
+ * CXL rev 3.1 section 8.2.9.9.9.2; Table 8-166
+ */
+struct cxl_mbox_get_extent_in {
+	__le32 extent_cnt;
+	__le32 start_extent_index;
+} __packed;
+
+/*
+ * Get Dynamic Capacity Extent List; Output Payload
+ * CXL rev 3.1 section 8.2.9.9.9.2; Table 8-167
+ */
+struct cxl_mbox_get_extent_out {
+	__le32 returned_extent_count;
+	__le32 total_extent_count;
+	__le32 generation_num;
+	u8 rsvd[4];
+	struct cxl_extent extent[];
+} __packed;
+
 struct cxl_mbox_get_supported_logs {
 	__le16 entries;
 	u8 rsvd[6];
diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
index d885b6e698ef..54fa7630231a 100644
--- a/drivers/dax/cxl.c
+++ b/drivers/dax/cxl.c
@@ -114,11 +114,23 @@ static int cxl_dax_region_probe(struct device *dev)
 	if (!dax_region)
 		return -ENOMEM;
 
-	if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_1)
+	if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_1) {
+		int rc;
+
+		/*
+		 * Run inside the probe, not at region creation: attaching extent
+		 * devres before really_probe() trips its "resources present"
+		 * -EBUSY gate.  The notify path adds the dax_region resources.
+		 */
+		rc = cxl_region_add_existing_extents(cxlr);
+		if (rc)
+			return rc;
+
 		/* Add empty seed dax device */
 		dev_size = 0;
-	else
+	} else {
 		dev_size = range_len(&cxlr_dax->hpa_range);
+	}
 
 	data = (struct dev_dax_data) {
 		.dax_region = dax_region,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v11 28/31] cxl/mem: Trace Dynamic capacity Event Record
  2026-06-25 11:04 DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
                   ` (25 preceding siblings ...)
  2026-06-25 11:05 ` [PATCH v11 27/31] cxl/region: Read existing extents on region creation Anisa Su
@ 2026-06-25 11:05 ` Anisa Su
  2026-06-25 18:29   ` sashiko-bot
  2026-06-25 11:05 ` [PATCH v11 29/31] tools/testing/cxl: Make event logs dynamic Anisa Su
                   ` (3 subsequent siblings)
  30 siblings, 1 reply; 61+ messages in thread
From: Anisa Su @ 2026-06-25 11:05 UTC (permalink / raw)
  To: linux-cxl, linux-kernel
  Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
	John Groves, Gregory Price, Anisa Su, Fan Ni

From: Ira Weiny <iweiny@kernel.org>

CXL rev 3.1 section 8.2.9.2.1 adds the Dynamic Capacity Event Records.
User space can use trace events for debugging of DC capacity changes.

Add DC trace points to the trace log.

Based on an original patch by Navneet Singh.

Signed-off-by: Ira Weiny <iweiny@kernel.org>
Signed-off-by: Anisa Su <anisa.su@samsung.com>
Reviewed-by: Jonathan Cameron <jic23@kernel.org>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Fan Ni <nifan.cxl@gmail.com>

---
Changes:
[anisa: bump comments to reference the CXL r4.0 spec (section
 8.2.10.2.1.6, Table 8-229).]
[anisa: fix partition_index to updated_region_index, reporting it only
 for Region Configuration Updated events (U8_MAX otherwise).]
---
 drivers/cxl/core/mbox.c  |  4 +++
 drivers/cxl/core/trace.h | 75 ++++++++++++++++++++++++++++++++++++++++
 include/cxl/event.h      |  4 +--
 3 files changed, 81 insertions(+), 2 deletions(-)

diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 6f0d776e7e78..2e56ab639100 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -1027,6 +1027,10 @@ static void __cxl_event_trace_record(struct cxl_memdev *cxlmd,
 		ev_type = CXL_CPER_EVENT_MEM_MODULE;
 	else if (uuid_equal(uuid, &CXL_EVENT_MEM_SPARING_UUID))
 		ev_type = CXL_CPER_EVENT_MEM_SPARING;
+	else if (uuid_equal(uuid, &CXL_EVENT_DC_EVENT_UUID)) {
+		trace_cxl_dynamic_capacity(cxlmd, type, &record->event.dcd);
+		return;
+	}
 
 	cxl_event_trace_record(cxlmd, type, ev_type, uuid, &record->event);
 }
diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
index a972e4ef1936..e5b88887d11b 100644
--- a/drivers/cxl/core/trace.h
+++ b/drivers/cxl/core/trace.h
@@ -1099,6 +1099,81 @@ TRACE_EVENT(cxl_poison,
 	)
 );
 
+/*
+ * Dynamic Capacity Event Record - DER
+ *
+ * CXL r4.0 section 8.2.10.2.1.6 Table 8-229
+ */
+
+#define CXL_DC_ADD_CAPACITY			0x00
+#define CXL_DC_REL_CAPACITY			0x01
+#define CXL_DC_FORCED_REL_CAPACITY		0x02
+#define CXL_DC_REG_CONF_UPDATED			0x03
+#define show_dc_evt_type(type)	__print_symbolic(type,		\
+	{ CXL_DC_ADD_CAPACITY,	"Add capacity"},		\
+	{ CXL_DC_REL_CAPACITY,	"Release capacity"},		\
+	{ CXL_DC_FORCED_REL_CAPACITY,	"Forced capacity release"},	\
+	{ CXL_DC_REG_CONF_UPDATED,	"Region Configuration Updated"	} \
+)
+
+TRACE_EVENT(cxl_dynamic_capacity,
+
+	TP_PROTO(const struct cxl_memdev *cxlmd, enum cxl_event_log_type log,
+		 struct cxl_event_dcd *rec),
+
+	TP_ARGS(cxlmd, log, rec),
+
+	TP_STRUCT__entry(
+		CXL_EVT_TP_entry
+
+		/* Dynamic capacity Event */
+		__field(u8, event_type)
+		__field(u16, hostid)
+		__field(u8, updated_region_index)
+		__field(u64, dpa_start)
+		__field(u64, length)
+		__array(u8, uuid, UUID_SIZE)
+		__field(u16, sh_extent_seq)
+	),
+
+	TP_fast_assign(
+		CXL_EVT_TP_fast_assign(cxlmd, log, rec->hdr);
+		__entry->hdr_uuid = CXL_EVENT_DC_EVENT_UUID;
+
+		/* Dynamic_capacity Event */
+		__entry->event_type = rec->event_type;
+
+		/* DCD event record data */
+		__entry->hostid = le16_to_cpu(rec->host_id);
+		/*
+		 * The Updated Region Index is only defined for Region
+		 * Configuration Updated events (Table 8-229); report U8_MAX
+		 * (not a valid index) for other event types where the field
+		 * is reserved.
+		 */
+		if (rec->event_type == CXL_DC_REG_CONF_UPDATED)
+			__entry->updated_region_index = rec->updated_region_index;
+		else
+			__entry->updated_region_index = U8_MAX;
+		__entry->dpa_start = le64_to_cpu(rec->extent.start_dpa);
+		__entry->length = le64_to_cpu(rec->extent.length);
+		memcpy(__entry->uuid, &rec->extent.uuid, UUID_SIZE);
+		__entry->sh_extent_seq = le16_to_cpu(rec->extent.shared_extn_seq);
+	),
+
+	CXL_EVT_TP_printk("event_type='%s' host_id='%d' updated_region_index='%d' " \
+		"starting_dpa=%llx length=%llx tag=%pU " \
+		"shared_extent_sequence=%d",
+		show_dc_evt_type(__entry->event_type),
+		__entry->hostid,
+		__entry->updated_region_index,
+		__entry->dpa_start,
+		__entry->length,
+		__entry->uuid,
+		__entry->sh_extent_seq
+	)
+);
+
 #endif /* _CXL_EVENTS_H */
 
 #define TRACE_INCLUDE_FILE trace
diff --git a/include/cxl/event.h b/include/cxl/event.h
index fa3cd895f656..601eae40def9 100644
--- a/include/cxl/event.h
+++ b/include/cxl/event.h
@@ -161,7 +161,7 @@ struct cxl_extent_list_node {
 
 /*
  * Dynamic Capacity Event Record
- * CXL rev 3.1 section 8.2.9.2.1.6; Table 8-50
+ * CXL r4.0 section 8.2.10.2.1.6; Table 8-229
  */
 #define CXL_DCD_EVENT_MORE			BIT(0)
 struct cxl_event_dcd {
@@ -169,7 +169,7 @@ struct cxl_event_dcd {
 	u8 event_type;
 	u8 validity_flags;
 	__le16 host_id;
-	u8 partition_index;
+	u8 updated_region_index;
 	u8 flags;
 	u8 reserved1[0x2];
 	struct cxl_extent extent;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v11 29/31] tools/testing/cxl: Make event logs dynamic
  2026-06-25 11:04 DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
                   ` (26 preceding siblings ...)
  2026-06-25 11:05 ` [PATCH v11 28/31] cxl/mem: Trace Dynamic capacity Event Record Anisa Su
@ 2026-06-25 11:05 ` Anisa Su
  2026-06-25 18:31   ` sashiko-bot
  2026-06-25 11:05 ` [PATCH v11 30/31] tools/testing/cxl: Add DC Regions to mock mem data Anisa Su
                   ` (2 subsequent siblings)
  30 siblings, 1 reply; 61+ messages in thread
From: Anisa Su @ 2026-06-25 11:05 UTC (permalink / raw)
  To: linux-cxl, linux-kernel
  Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
	John Groves, Gregory Price, Anisa Su

From: Ira Weiny <iweiny@kernel.org>

The event logs test was created as static arrays as an easy way to mock
events.  Dynamic Capacity Device (DCD) test support requires events be
generated dynamically when extents are created or destroyed.

The current event log test has specific checks for the number of events
seen including log overflow.

Modify mock event logs to be dynamically allocated.  Adjust array size
and mock event entry data to match the output expected by the existing
event test.

Use the static event data to create the dynamic events in the new logs
without inventing complex event injection for the previous tests.

Simplify log processing by using the event log array index as the
handle.  Add a lock to manage concurrency required when user space is
allowed to control DCD extents

Signed-off-by: Ira Weiny <iweiny@kernel.org>
Signed-off-by: Anisa Su <anisa.su@samsung.com>
Reviewed-by: Jonathan Cameron <jic23@kernel.org>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>

---
Changes:
[anisa: use lockdep_assert_held() in mes_del_event(), and return on the
 empty-record error instead of falling through.]

[anisa: add_event_from_static() returns int so an event allocation
 failure propagates up and fails the probe instead of being swallowed.]
---
 tools/testing/cxl/test/mem.c | 284 ++++++++++++++++++++++-------------
 1 file changed, 178 insertions(+), 106 deletions(-)

diff --git a/tools/testing/cxl/test/mem.c b/tools/testing/cxl/test/mem.c
index 271c7ad8cc32..a2bfd52db076 100644
--- a/tools/testing/cxl/test/mem.c
+++ b/tools/testing/cxl/test/mem.c
@@ -142,18 +142,26 @@ static struct {
 
 #define PASS_TRY_LIMIT 3
 
-#define CXL_TEST_EVENT_CNT_MAX 15
+#define CXL_TEST_EVENT_CNT_MAX 16
+/* 1 extra slot to accommodate that handles can't be 0 */
+#define CXL_TEST_EVENT_ARRAY_SIZE (CXL_TEST_EVENT_CNT_MAX + 1)
 
 /* Set a number of events to return at a time for simulation.  */
 #define CXL_TEST_EVENT_RET_MAX 4
 
+/*
+ * @last_handle: last handle (index) to have an entry stored
+ * @current_handle: current handle (index) to be returned to the user on get_event
+ * @nr_overflow: number of events added past the log size
+ * @lock: protect these state variables
+ * @events: array of pending events to be returned.
+ */
 struct mock_event_log {
-	u16 clear_idx;
-	u16 cur_idx;
-	u16 nr_events;
+	u16 last_handle;
+	u16 current_handle;
 	u16 nr_overflow;
-	u16 overflow_reset;
-	struct cxl_event_record_raw *events[CXL_TEST_EVENT_CNT_MAX];
+	rwlock_t lock;
+	struct cxl_event_record_raw *events[CXL_TEST_EVENT_ARRAY_SIZE];
 };
 
 struct mock_event_store {
@@ -194,56 +202,67 @@ static struct mock_event_log *event_find_log(struct device *dev, int log_type)
 	return &mdata->mes.mock_logs[log_type];
 }
 
-static struct cxl_event_record_raw *event_get_current(struct mock_event_log *log)
-{
-	return log->events[log->cur_idx];
-}
-
-static void event_reset_log(struct mock_event_log *log)
-{
-	log->cur_idx = 0;
-	log->clear_idx = 0;
-	log->nr_overflow = log->overflow_reset;
-}
-
 /* Handle can never be 0 use 1 based indexing for handle */
-static u16 event_get_clear_handle(struct mock_event_log *log)
+static u16 event_inc_handle(u16 handle)
 {
-	return log->clear_idx + 1;
+	handle = (handle + 1) % CXL_TEST_EVENT_ARRAY_SIZE;
+	if (handle == 0)
+		handle = 1;
+	return handle;
 }
 
-/* Handle can never be 0 use 1 based indexing for handle */
-static __le16 event_get_cur_event_handle(struct mock_event_log *log)
-{
-	u16 cur_handle = log->cur_idx + 1;
-
-	return cpu_to_le16(cur_handle);
-}
-
-static bool event_log_empty(struct mock_event_log *log)
-{
-	return log->cur_idx == log->nr_events;
-}
-
-static void mes_add_event(struct mock_event_store *mes,
+/* Add the event or free it on overflow */
+static void mes_add_event(struct cxl_mockmem_data *mdata,
 			  enum cxl_event_log_type log_type,
 			  struct cxl_event_record_raw *event)
 {
+	struct device *dev = mdata->mds->cxlds.dev;
 	struct mock_event_log *log;
 
 	if (WARN_ON(log_type >= CXL_EVENT_TYPE_MAX))
 		return;
 
-	log = &mes->mock_logs[log_type];
+	log = &mdata->mes.mock_logs[log_type];
 
-	if ((log->nr_events + 1) > CXL_TEST_EVENT_CNT_MAX) {
+	guard(write_lock)(&log->lock);
+
+	dev_dbg(dev, "Add log %d cur %d last %d\n",
+		log_type, log->current_handle, log->last_handle);
+
+	/* Check next buffer */
+	if (event_inc_handle(log->last_handle) == log->current_handle) {
 		log->nr_overflow++;
-		log->overflow_reset = log->nr_overflow;
+		dev_dbg(dev, "Overflowing log %d nr %d\n",
+			log_type, log->nr_overflow);
+		devm_kfree(dev, event);
 		return;
 	}
 
-	log->events[log->nr_events] = event;
-	log->nr_events++;
+	dev_dbg(dev, "Log %d; handle %u\n", log_type, log->last_handle);
+	event->event.generic.hdr.handle = cpu_to_le16(log->last_handle);
+	log->events[log->last_handle] = event;
+	log->last_handle = event_inc_handle(log->last_handle);
+}
+
+static void mes_del_event(struct device *dev,
+			  struct mock_event_log *log,
+			  u16 handle)
+{
+	struct cxl_event_record_raw *record;
+
+	lockdep_assert_held(&log->lock);
+
+	dev_dbg(dev, "Clearing event %u; record %u\n",
+		handle, log->current_handle);
+	record = log->events[handle];
+	if (!record) {
+		dev_err(dev, "Mock event index %u empty?\n", handle);
+		return;
+	}
+
+	log->events[handle] = NULL;
+	log->current_handle = event_inc_handle(log->current_handle);
+	devm_kfree(dev, record);
 }
 
 /*
@@ -257,6 +276,7 @@ static int mock_get_event(struct device *dev, struct cxl_mbox_cmd *cmd)
 	struct cxl_get_event_payload *pl;
 	struct mock_event_log *log;
 	int ret_limit;
+	u16 handle;
 	u8 log_type;
 	int i;
 
@@ -276,22 +296,31 @@ static int mock_get_event(struct device *dev, struct cxl_mbox_cmd *cmd)
 	memset(cmd->payload_out, 0, struct_size(pl, records, 0));
 
 	log = event_find_log(dev, log_type);
-	if (!log || event_log_empty(log))
+	if (!log)
 		return 0;
 
 	pl = cmd->payload_out;
 
-	for (i = 0; i < ret_limit && !event_log_empty(log); i++) {
-		memcpy(&pl->records[i], event_get_current(log),
-		       sizeof(pl->records[i]));
-		pl->records[i].event.generic.hdr.handle =
-				event_get_cur_event_handle(log);
-		log->cur_idx++;
+	guard(read_lock)(&log->lock);
+
+	handle = log->current_handle;
+	dev_dbg(dev, "Get log %d handle %u last %u\n",
+		log_type, handle, log->last_handle);
+	for (i = 0; i < ret_limit && handle != log->last_handle;
+	     i++, handle = event_inc_handle(handle)) {
+		struct cxl_event_record_raw *cur;
+
+		cur = log->events[handle];
+		dev_dbg(dev, "Sending event log %d handle %d idx %u\n",
+			log_type, le16_to_cpu(cur->event.generic.hdr.handle),
+			handle);
+		memcpy(&pl->records[i], cur, sizeof(pl->records[i]));
+		pl->records[i].event.generic.hdr.handle = cpu_to_le16(handle);
 	}
 
 	cmd->size_out = struct_size(pl, records, i);
 	pl->record_count = cpu_to_le16(i);
-	if (!event_log_empty(log))
+	if (handle != log->last_handle)
 		pl->flags |= CXL_GET_EVENT_FLAG_MORE_RECORDS;
 
 	if (log->nr_overflow) {
@@ -313,8 +342,8 @@ static int mock_get_event(struct device *dev, struct cxl_mbox_cmd *cmd)
 static int mock_clear_event(struct device *dev, struct cxl_mbox_cmd *cmd)
 {
 	struct cxl_mbox_clear_event_payload *pl = cmd->payload_in;
-	struct mock_event_log *log;
 	u8 log_type = pl->event_log;
+	struct mock_event_log *log;
 	u16 handle;
 	int nr;
 
@@ -325,23 +354,20 @@ static int mock_clear_event(struct device *dev, struct cxl_mbox_cmd *cmd)
 	if (!log)
 		return 0; /* No mock data in this log */
 
-	/*
-	 * This check is technically not invalid per the specification AFAICS.
-	 * (The host could 'guess' handles and clear them in order).
-	 * However, this is not good behavior for the host so test it.
-	 */
-	if (log->clear_idx + pl->nr_recs > log->cur_idx) {
-		dev_err(dev,
-			"Attempting to clear more events than returned!\n");
-		return -EINVAL;
-	}
+	guard(write_lock)(&log->lock);
 
 	/* Check handle order prior to clearing events */
-	for (nr = 0, handle = event_get_clear_handle(log);
-	     nr < pl->nr_recs;
-	     nr++, handle++) {
+	handle = log->current_handle;
+	for (nr = 0; nr < pl->nr_recs && handle != log->last_handle;
+	     nr++, handle = event_inc_handle(handle)) {
+
+		dev_dbg(dev, "Checking clear of %d handle %u plhandle %u\n",
+			log_type, handle,
+			le16_to_cpu(pl->handles[nr]));
+
 		if (handle != le16_to_cpu(pl->handles[nr])) {
-			dev_err(dev, "Clearing events out of order\n");
+			dev_err(dev, "Clearing events out of order %u %u\n",
+				handle, le16_to_cpu(pl->handles[nr]));
 			return -EINVAL;
 		}
 	}
@@ -350,25 +376,12 @@ static int mock_clear_event(struct device *dev, struct cxl_mbox_cmd *cmd)
 		log->nr_overflow = 0;
 
 	/* Clear events */
-	log->clear_idx += pl->nr_recs;
-	return 0;
-}
+	for (nr = 0; nr < pl->nr_recs; nr++)
+		mes_del_event(dev, log, le16_to_cpu(pl->handles[nr]));
+	dev_dbg(dev, "Delete log %d cur %d last %d\n",
+		log_type, log->current_handle, log->last_handle);
 
-static void cxl_mock_event_trigger(struct device *dev)
-{
-	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
-	struct mock_event_store *mes = &mdata->mes;
-	int i;
-
-	for (i = CXL_EVENT_TYPE_INFO; i < CXL_EVENT_TYPE_MAX; i++) {
-		struct mock_event_log *log;
-
-		log = event_find_log(dev, i);
-		if (log)
-			event_reset_log(log);
-	}
-
-	cxl_mem_get_event_records(mdata->mds, mes->ev_status);
+	return 0;
 }
 
 struct cxl_event_record_raw maint_needed = {
@@ -509,8 +522,30 @@ static int mock_set_timestamp(struct cxl_dev_state *cxlds,
 	return 0;
 }
 
-static void cxl_mock_add_event_logs(struct mock_event_store *mes)
+/* Create a dynamically allocated event out of a statically defined event. */
+static int add_event_from_static(struct cxl_mockmem_data *mdata,
+				 enum cxl_event_log_type log_type,
+				 struct cxl_event_record_raw *raw)
+{
+	struct device *dev = mdata->mds->cxlds.dev;
+	struct cxl_event_record_raw *rec;
+
+	rec = devm_kmemdup(dev, raw, sizeof(*rec), GFP_KERNEL);
+	if (!rec) {
+		dev_err(dev, "Failed to alloc event for log\n");
+		return -ENOMEM;
+	}
+	mes_add_event(mdata, log_type, rec);
+
+	return 0;
+}
+
+static int cxl_mock_add_event_logs(struct cxl_mockmem_data *mdata)
 {
+	struct mock_event_store *mes = &mdata->mes;
+	struct device *dev = mdata->mds->cxlds.dev;
+	int rc = 0;
+
 	put_unaligned_le16(CXL_GMER_VALID_CHANNEL | CXL_GMER_VALID_RANK |
 			   CXL_GMER_VALID_COMPONENT | CXL_GMER_VALID_COMPONENT_ID_FORMAT,
 			   &gen_media.rec.media_hdr.validity_flags);
@@ -523,41 +558,65 @@ static void cxl_mock_add_event_logs(struct mock_event_store *mes)
 	put_unaligned_le16(CXL_MMER_VALID_COMPONENT | CXL_MMER_VALID_COMPONENT_ID_FORMAT,
 			   &mem_module.rec.validity_flags);
 
-	mes_add_event(mes, CXL_EVENT_TYPE_INFO, &maint_needed);
-	mes_add_event(mes, CXL_EVENT_TYPE_INFO,
+	dev_dbg(dev, "Generating fake event logs %d\n",
+		CXL_EVENT_TYPE_INFO);
+	rc = rc ?: add_event_from_static(mdata, CXL_EVENT_TYPE_INFO, &maint_needed);
+	rc = rc ?: add_event_from_static(mdata, CXL_EVENT_TYPE_INFO,
 		      (struct cxl_event_record_raw *)&gen_media);
-	mes_add_event(mes, CXL_EVENT_TYPE_INFO,
+	rc = rc ?: add_event_from_static(mdata, CXL_EVENT_TYPE_INFO,
 		      (struct cxl_event_record_raw *)&mem_module);
 	mes->ev_status |= CXLDEV_EVENT_STATUS_INFO;
 
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &maint_needed);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
+	dev_dbg(dev, "Generating fake event logs %d\n",
+		CXL_EVENT_TYPE_FAIL);
+	rc = rc ?: add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &maint_needed);
+	rc = rc ?: add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
+		      (struct cxl_event_record_raw *)&mem_module);
+	rc = rc ?: add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	rc = rc ?: add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
 		      (struct cxl_event_record_raw *)&dram);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
+	rc = rc ?: add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
 		      (struct cxl_event_record_raw *)&gen_media);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
+	rc = rc ?: add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
 		      (struct cxl_event_record_raw *)&mem_module);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
+	rc = rc ?: add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	rc = rc ?: add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
 		      (struct cxl_event_record_raw *)&dram);
 	/* Overflow this log */
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	rc = rc ?: add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	rc = rc ?: add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	rc = rc ?: add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	rc = rc ?: add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	rc = rc ?: add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	rc = rc ?: add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	rc = rc ?: add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	rc = rc ?: add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	rc = rc ?: add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	rc = rc ?: add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
 	mes->ev_status |= CXLDEV_EVENT_STATUS_FAIL;
 
-	mes_add_event(mes, CXL_EVENT_TYPE_FATAL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FATAL,
+	dev_dbg(dev, "Generating fake event logs %d\n",
+		CXL_EVENT_TYPE_FATAL);
+	rc = rc ?: add_event_from_static(mdata, CXL_EVENT_TYPE_FATAL, &hardware_replace);
+	rc = rc ?: add_event_from_static(mdata, CXL_EVENT_TYPE_FATAL,
 		      (struct cxl_event_record_raw *)&dram);
 	mes->ev_status |= CXLDEV_EVENT_STATUS_FATAL;
+
+	return rc;
+}
+
+static int cxl_mock_event_trigger(struct device *dev)
+{
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	struct mock_event_store *mes = &mdata->mes;
+	int rc;
+
+	rc = cxl_mock_add_event_logs(mdata);
+	if (rc)
+		return rc;
+	cxl_mem_get_event_records(mdata->mds, mes->ev_status);
+
+	return 0;
 }
 
 static int mock_gsl(struct cxl_mbox_cmd *cmd)
@@ -1663,8 +1722,9 @@ static ssize_t event_trigger_store(struct device *dev,
 				   struct device_attribute *attr,
 				   const char *buf, size_t count)
 {
-	cxl_mock_event_trigger(dev);
-	return count;
+	int rc = cxl_mock_event_trigger(dev);
+
+	return rc ?: count;
 }
 static DEVICE_ATTR_WO(event_trigger);
 
@@ -1684,6 +1744,14 @@ static void cxl_mock_test_feat_init(struct cxl_mockmem_data *mdata)
 	mdata->test_feat.data = cpu_to_le32(0xdeadbeef);
 }
 
+static void init_event_log(struct mock_event_log *log)
+{
+	rwlock_init(&log->lock);
+	/* Handle can never be 0 use 1 based indexing for handle */
+	log->current_handle = 1;
+	log->last_handle = 1;
+}
+
 static int cxl_mock_mem_probe(struct platform_device *pdev)
 {
 	struct device *dev = &pdev->dev;
@@ -1767,7 +1835,11 @@ static int cxl_mock_mem_probe(struct platform_device *pdev)
 	if (rc)
 		dev_dbg(dev, "No CXL Features discovered\n");
 
-	cxl_mock_add_event_logs(&mdata->mes);
+	for (int i = 0; i < CXL_EVENT_TYPE_MAX; i++)
+		init_event_log(&mdata->mes.mock_logs[i]);
+	rc = cxl_mock_add_event_logs(mdata);
+	if (rc)
+		return rc;
 
 	cxlmd = devm_cxl_add_memdev(cxlds, NULL);
 	if (IS_ERR(cxlmd))
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v11 30/31] tools/testing/cxl: Add DC Regions to mock mem data
  2026-06-25 11:04 DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
                   ` (27 preceding siblings ...)
  2026-06-25 11:05 ` [PATCH v11 29/31] tools/testing/cxl: Make event logs dynamic Anisa Su
@ 2026-06-25 11:05 ` Anisa Su
  2026-06-25 18:34   ` sashiko-bot
  2026-06-25 11:05 ` [PATCH v11 31/31] Documentation/cxl: Document DCD extent handling and DC-backed DAX regions Anisa Su
  2026-06-25 18:00 ` [PATCH v11 03/31] cxl/cdat: Gather DSMAS data for DCD partitions Anisa Su
  30 siblings, 1 reply; 61+ messages in thread
From: Anisa Su @ 2026-06-25 11:05 UTC (permalink / raw)
  To: linux-cxl, linux-kernel
  Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
	John Groves, Gregory Price, Anisa Su

From: Ira Weiny <iweiny@kernel.org>

cxl_test provides a good way to ensure quick smoke and regression
testing.  The complexity of Dynamic Capacity (DC) extent processing as
well as the complexity of DC-backed DAX regions can mostly be tested
through cxl_test.  This includes management of DC regions and DAX
devices on those regions; the management of extent device lifetimes;
and the processing of DCD events.

The only missing functionality from this test is actual interrupt
processing.

Mock memory devices can easily mock DC information and manage fake
extent data.

Define mock_dc_partition information within the mock memory data.  Add
sysfs entries on the mock device to inject and delete extents.

The inject format is <start>:<length>:<tag>:<more>[:<seq>] where <tag>
is a UUID string (or "" / "0" for the null UUID) and <seq> is an
optional shared_extn_seq value used for sharable-partition tests
(defaults to 0).
The delete format is <start>:<length>:<uuid>

Directly call the event irq callback to simulate irqs to process the
test extents.

Add DC mailbox commands to the CEL and implement those commands.

Signed-off-by: Ira Weiny <iweiny@kernel.org>
Signed-off-by: Anisa Su <anisa.su@samsung.com>
---
 tools/testing/cxl/test/cxl.c  |  12 +
 tools/testing/cxl/test/mem.c  | 825 +++++++++++++++++++++++++++++++++-
 tools/testing/cxl/test/mock.h |   9 +
 3 files changed, 845 insertions(+), 1 deletion(-)

diff --git a/tools/testing/cxl/test/cxl.c b/tools/testing/cxl/test/cxl.c
index 296516eecfd6..993f79175aa1 100644
--- a/tools/testing/cxl/test/cxl.c
+++ b/tools/testing/cxl/test/cxl.c
@@ -1432,6 +1432,18 @@ static void mock_cxl_endpoint_parse_cdat(struct cxl_port *port)
 		};
 
 		dpa_perf_setup(port, &range, perf);
+
+		/*
+		 * The mock probe stamps MOCK_DC_SHARABLE_SERIAL onto exactly
+		 * one cxl_mem instance; mark its DC partition sharable so
+		 * cxl_validate_extent() routes shared-seq injects through
+		 * the sharable regime.  Every other memdev keeps its DC
+		 * partition non-sharable so the existing untagged / seq=0
+		 * tests still run on this kernel.
+		 */
+		if (cxlds->part[i].mode == CXL_PARTMODE_DYNAMIC_RAM_1 &&
+		    cxlds->serial == MOCK_DC_SHARABLE_SERIAL)
+			cxlds->part[i].shareable = true;
 	}
 
 	cxl_memdev_update_perf(cxlmd);
diff --git a/tools/testing/cxl/test/mem.c b/tools/testing/cxl/test/mem.c
index a2bfd52db076..455f8a50d581 100644
--- a/tools/testing/cxl/test/mem.c
+++ b/tools/testing/cxl/test/mem.c
@@ -13,6 +13,7 @@
 #include <crypto/sha2.h>
 #include <cxlmem.h>
 
+#include "mock.h"
 #include "trace.h"
 
 #define LSA_SIZE SZ_128K
@@ -20,6 +21,7 @@
 #define FW_SLOTS 3
 #define DEV_SIZE SZ_2G
 #define EFFECT(x) (1U << x)
+#define BASE_DYNAMIC_CAP_DPA DEV_SIZE
 
 #define MOCK_INJECT_DEV_MAX 8
 #define MOCK_INJECT_TEST_MAX 128
@@ -113,6 +115,22 @@ static struct cxl_cel_entry mock_cel[] = {
 				      EFFECT(SECURITY_CHANGE_IMMEDIATE) |
 				      EFFECT(BACKGROUND_OP)),
 	},
+	{
+		.opcode = cpu_to_le16(CXL_MBOX_OP_GET_DC_CONFIG),
+		.effect = CXL_CMD_EFFECT_NONE,
+	},
+	{
+		.opcode = cpu_to_le16(CXL_MBOX_OP_GET_DC_EXTENT_LIST),
+		.effect = CXL_CMD_EFFECT_NONE,
+	},
+	{
+		.opcode = cpu_to_le16(CXL_MBOX_OP_ADD_DC_RESPONSE),
+		.effect = cpu_to_le16(EFFECT(CONF_CHANGE_IMMEDIATE)),
+	},
+	{
+		.opcode = cpu_to_le16(CXL_MBOX_OP_RELEASE_DC),
+		.effect = cpu_to_le16(EFFECT(CONF_CHANGE_IMMEDIATE)),
+	},
 };
 
 /* See CXL 2.0 Table 181 Get Health Info Output Payload */
@@ -173,6 +191,16 @@ struct vendor_test_feat {
 	__le32 data;
 } __packed;
 
+/*
+ * The kernel surfaces only the first DC partition reported by the
+ * device (cxl_dev_dc_identify() takes partitions[0] only), so any
+ * extents we pre-inject into a second mock partition end up rejected
+ * as "not in a valid DC partition" — and for tagged groups they also
+ * trip the partition-equality gate and drop the whole group (including
+ * the in-range member in DC0).  Keep the mock at one DC partition.
+ */
+#define NUM_MOCK_DC_REGIONS 1
+
 struct cxl_mockmem_data {
 	void *lsa;
 	void *fw;
@@ -191,6 +219,21 @@ struct cxl_mockmem_data {
 	unsigned long sanitize_timeout;
 	struct vendor_test_feat test_feat;
 	u8 shutdown_state;
+
+	struct cxl_dc_partition dc_partitions[NUM_MOCK_DC_REGIONS];
+	u32 dc_ext_generation;
+	u32 dc_offer_seq;
+	struct mutex ext_lock;
+
+	/*
+	 * Extents are in 1 of 3 states
+	 * FM (sysfs added but not sent to the host yet)
+	 * sent (sent to the host but not accepted)
+	 * accepted (by the host)
+	 */
+	struct xarray dc_fm_extents;
+	struct xarray dc_sent_extents;
+	struct xarray dc_accepted_exts;
 };
 
 static struct mock_event_log *event_find_log(struct device *dev, int log_type)
@@ -619,6 +662,230 @@ static int cxl_mock_event_trigger(struct device *dev)
 	return 0;
 }
 
+struct cxl_extent_data {
+	u64 dpa_start;
+	u64 length;
+	uuid_t uuid;
+	u16 shared_extn_seq;
+	u32 offer_seq;
+	bool shared;
+};
+
+/*
+ * Parse a tag string into a uuid_t.  Accepts the empty string and "0"
+ * as shorthand for the null UUID; anything else must be a UUID string
+ * uuid_parse() can understand.
+ */
+static int parse_tag(const char *tag, uuid_t *out)
+{
+	if (!tag || tag[0] == '\0' || strcmp(tag, "0") == 0) {
+		uuid_copy(out, &uuid_null);
+		return 0;
+	}
+	return uuid_parse(tag, out);
+}
+
+static int __devm_add_extent(struct device *dev, struct xarray *array,
+			     u64 start, u64 length, const char *tag,
+			     u16 shared_extn_seq, bool shared)
+{
+	struct cxl_extent_data *extent;
+	int rc;
+
+	extent = devm_kzalloc(dev, sizeof(*extent), GFP_KERNEL);
+	if (!extent)
+		return -ENOMEM;
+
+	extent->dpa_start = start;
+	extent->length = length;
+	rc = parse_tag(tag, &extent->uuid);
+	if (rc) {
+		dev_err(dev, "Failed to parse tag '%s'\n", tag);
+		devm_kfree(dev, extent);
+		return rc;
+	}
+	extent->shared_extn_seq = shared_extn_seq;
+	extent->shared = shared;
+
+	if (xa_insert(array, start, extent, GFP_KERNEL)) {
+		devm_kfree(dev, extent);
+		dev_err(dev, "Failed xarry insert %#llx\n", start);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int devm_add_fm_extent(struct device *dev, u64 start, u64 length,
+			      const char *tag, u16 shared_extn_seq, bool shared)
+{
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+
+	guard(mutex)(&mdata->ext_lock);
+	return __devm_add_extent(dev, &mdata->dc_fm_extents, start, length,
+				 tag, shared_extn_seq, shared);
+}
+
+static int dc_accept_extent(struct device *dev, u64 start, u64 length)
+{
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	struct cxl_extent_data *ext;
+
+	lockdep_assert_held(&mdata->ext_lock);
+
+	dev_dbg(dev, "Host accepting extent %#llx\n", start);
+	ext = xa_load(&mdata->dc_sent_extents, start);
+	if (!ext || ext->length != length) {
+		/*
+		 * The host may re-accept extents we already moved into the
+		 * accepted xarray (e.g. pre-injected extents replayed on
+		 * region creation).  Treat that as a successful no-op so
+		 * the existing-extent ingest path doesn't abort.
+		 */
+		ext = xa_load(&mdata->dc_accepted_exts, start);
+		if (ext && ext->length == length)
+			return 0;
+		dev_err(dev, "Extent %#llx-%#llx not found\n",
+			start, start + length);
+		return -ENOMEM;
+	}
+	mdata->dc_ext_generation++;
+	xa_erase(&mdata->dc_sent_extents, start);
+	return xa_insert(&mdata->dc_accepted_exts, start, ext, GFP_KERNEL);
+}
+
+static void release_dc_ext(void *md)
+{
+	struct cxl_mockmem_data *mdata = md;
+
+	xa_destroy(&mdata->dc_fm_extents);
+	xa_destroy(&mdata->dc_sent_extents);
+	xa_destroy(&mdata->dc_accepted_exts);
+}
+
+/* Pretend to have some previous accepted extents */
+struct pre_ext_info {
+	u64 offset;
+	u64 length;
+	const char *tag;
+} pre_ext_info[] = {
+	{
+		.offset = SZ_128M,
+		.length = SZ_64M,
+		.tag = "",
+	},
+	{
+		.offset = SZ_256M,
+		.length = SZ_64M,
+		.tag = "deadbeef-cafe-baad-f00d-fedcba987654",
+	},
+};
+
+static int devm_add_sent_extent(struct device *dev, u64 start, u64 length,
+				const char *tag, u16 shared_extn_seq, bool shared)
+{
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+
+	lockdep_assert_held(&mdata->ext_lock);
+	return __devm_add_extent(dev, &mdata->dc_sent_extents, start, length,
+				 tag, shared_extn_seq, shared);
+}
+
+static int inject_prev_extents(struct device *dev, u64 base_dpa)
+{
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	int rc;
+
+	dev_dbg(dev, "Adding %ld pre-extents for testing\n",
+		ARRAY_SIZE(pre_ext_info));
+
+	guard(mutex)(&mdata->ext_lock);
+	for (int i = 0; i < ARRAY_SIZE(pre_ext_info); i++) {
+		u64 ext_dpa = base_dpa + pre_ext_info[i].offset;
+		u64 ext_len = pre_ext_info[i].length;
+
+		dev_dbg(dev, "Adding pre-extent DPA:%#llx LEN:%#llx tag:%s\n",
+			ext_dpa, ext_len, pre_ext_info[i].tag);
+
+		rc = devm_add_sent_extent(dev, ext_dpa, ext_len,
+					  pre_ext_info[i].tag, 0, false);
+		if (rc) {
+			dev_err(dev, "Failed to add pre-extent DPA:%#llx LEN:%#llx; %d\n",
+				ext_dpa, ext_len, rc);
+			return rc;
+		}
+
+		rc = dc_accept_extent(dev, ext_dpa, ext_len);
+		if (rc)
+			return rc;
+	}
+	return 0;
+}
+
+static int cxl_mock_dc_partition_setup(struct device *dev)
+{
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	u64 base_dpa = BASE_DYNAMIC_CAP_DPA;
+	u32 dsmad_handle = 0xFA;
+	u64 decode_length = SZ_512M;
+	u64 block_size = SZ_512;
+	u64 length = SZ_512M;
+	int rc;
+
+	mutex_init(&mdata->ext_lock);
+	xa_init(&mdata->dc_fm_extents);
+	xa_init(&mdata->dc_sent_extents);
+	xa_init(&mdata->dc_accepted_exts);
+
+	rc = devm_add_action_or_reset(dev, release_dc_ext, mdata);
+	if (rc)
+		return rc;
+
+	for (int i = 0; i < NUM_MOCK_DC_REGIONS; i++) {
+		struct cxl_dc_partition *part = &mdata->dc_partitions[i];
+
+		dev_dbg(dev, "Creating DC partition DC%d DPA:%#llx LEN:%#llx\n",
+			i, base_dpa, length);
+
+		part->base = cpu_to_le64(base_dpa);
+		part->decode_length = cpu_to_le64(decode_length /
+						  CXL_CAPACITY_MULTIPLIER);
+		part->length = cpu_to_le64(length);
+		part->block_size = cpu_to_le64(block_size);
+		part->dsmad_handle = cpu_to_le32(dsmad_handle);
+		dsmad_handle++;
+
+		/*
+		 * Skip pre-injection on the sharable mock memdev.  The
+		 * pre-injected extents are untagged / seq=0, which a
+		 * sharable partition rejects as firmware-bug; leaving the
+		 * sharable memdev with an empty DC partition is what its
+		 * dedicated tests (test_shared_extent_inject and
+		 * test_seq_integrity_gap in cxl-dcd.sh) expect anyway.
+		 *
+		 * The sharable fixture is the memdev at pdev->id == 0 —
+		 * see the matching MOCK_DC_SHARABLE_SERIAL stamp in
+		 * cxl_mock_mem_probe().  This relies on tools/testing/cxl
+		 * always allocating a "cxl_mem" platform device with id 0
+		 * as the first memdev; if that invariant ever breaks the
+		 * sharable test fixture will land on the wrong device.
+		 */
+		if (to_platform_device(dev)->id != 0) {
+			rc = inject_prev_extents(dev, base_dpa);
+			if (rc) {
+				dev_err(dev,
+					"Failed to add pre-extents for DC%d\n",
+					i);
+				return rc;
+			}
+		}
+
+		base_dpa += decode_length;
+	}
+
+	return 0;
+}
+
 static int mock_gsl(struct cxl_mbox_cmd *cmd)
 {
 	if (cmd->size_out < sizeof(mock_gsl_payload))
@@ -1594,6 +1861,215 @@ static int mock_get_supported_features(struct cxl_mockmem_data *mdata,
 	return 0;
 }
 
+static int mock_get_dc_config(struct device *dev,
+			      struct cxl_mbox_cmd *cmd)
+{
+	struct cxl_mbox_get_dc_config_in *dc_config = cmd->payload_in;
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	u8 partition_requested, partition_start_idx, partition_ret_cnt;
+	struct cxl_mbox_get_dc_config_out *resp;
+	int i;
+
+	partition_requested = min(dc_config->partition_count, NUM_MOCK_DC_REGIONS);
+
+	if (cmd->size_out < struct_size(resp, partition, partition_requested))
+		return -EINVAL;
+
+	memset(cmd->payload_out, 0, cmd->size_out);
+	resp = cmd->payload_out;
+
+	partition_start_idx = dc_config->start_partition_index;
+	partition_ret_cnt = 0;
+	for (i = partition_start_idx;
+	     i < NUM_MOCK_DC_REGIONS && partition_ret_cnt < partition_requested;
+	     i++) {
+		memcpy(&resp->partition[partition_ret_cnt],
+			&mdata->dc_partitions[i],
+			sizeof(resp->partition[partition_ret_cnt]));
+		partition_ret_cnt++;
+	}
+	resp->avail_partition_count = NUM_MOCK_DC_REGIONS;
+	resp->partitions_returned = partition_ret_cnt;
+	cmd->size_out = struct_size(resp, partition, partition_ret_cnt);
+
+	dev_dbg(dev, "Returning %d dc partitions\n", partition_ret_cnt);
+	return 0;
+}
+
+static int mock_get_dc_extent_list(struct device *dev,
+				   struct cxl_mbox_cmd *cmd)
+{
+	struct cxl_mbox_get_extent_out *resp = cmd->payload_out;
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	struct cxl_mbox_get_extent_in *get = cmd->payload_in;
+	u32 total_avail = 0, total_ret = 0, idx = 0;
+	struct cxl_extent_data *ext;
+	u32 ext_count, start_idx;
+	unsigned long i;
+
+	ext_count = le32_to_cpu(get->extent_cnt);
+	start_idx = le32_to_cpu(get->start_extent_index);
+
+	memset(resp, 0, sizeof(*resp));
+
+	guard(mutex)(&mdata->ext_lock);
+	/*
+	 * Total available needs to be calculated and returned regardless of
+	 * how many can actually be returned.
+	 */
+	xa_for_each(&mdata->dc_accepted_exts, i, ext)
+		total_avail++;
+
+	if (start_idx > total_avail)
+		return -EINVAL;
+
+	xa_for_each(&mdata->dc_accepted_exts, i, ext) {
+		if (total_ret >= ext_count)
+			break;
+
+		if (idx++ >= start_idx) {
+			resp->extent[total_ret].start_dpa =
+						cpu_to_le64(ext->dpa_start);
+			resp->extent[total_ret].length =
+						cpu_to_le64(ext->length);
+			export_uuid(resp->extent[total_ret].uuid, &ext->uuid);
+			resp->extent[total_ret].shared_extn_seq =
+						cpu_to_le16(ext->shared_extn_seq);
+			total_ret++;
+		}
+	}
+
+	resp->returned_extent_count = cpu_to_le32(total_ret);
+	resp->total_extent_count = cpu_to_le32(total_avail);
+	resp->generation_num = cpu_to_le32(mdata->dc_ext_generation);
+
+	dev_dbg(dev, "Returning %d extents of %d total\n",
+		total_ret, total_avail);
+
+	return 0;
+}
+
+static void dc_clear_sent(struct device *dev)
+{
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	struct cxl_extent_data *ext;
+	unsigned long index;
+
+	lockdep_assert_held(&mdata->ext_lock);
+
+	/* Any extents not accepted must be cleared */
+	xa_for_each(&mdata->dc_sent_extents, index, ext) {
+		dev_dbg(dev, "Host rejected extent %#llx\n", ext->dpa_start);
+		xa_erase(&mdata->dc_sent_extents, ext->dpa_start);
+	}
+}
+
+static int mock_add_dc_response(struct device *dev,
+				struct cxl_mbox_cmd *cmd)
+{
+	struct cxl_mbox_dc_response *req = cmd->payload_in;
+	u32 list_size = le32_to_cpu(req->extent_list_size);
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	u32 last_offer_seq = 0;
+	bool first = true;
+
+	guard(mutex)(&mdata->ext_lock);
+	for (int i = 0; i < list_size; i++) {
+		u64 start = le64_to_cpu(req->extent_list[i].dpa_start);
+		u64 length = le64_to_cpu(req->extent_list[i].length);
+		struct cxl_extent_data *ext;
+		int rc;
+
+		/*
+		 * CXL r4.0 8.2.10.9.9.3: the host must list extents in the
+		 * order the device offered them (Add Capacity events); reject
+		 * an out-of-order response as Invalid Input.
+		 */
+		ext = xa_load(&mdata->dc_sent_extents, start);
+		if (!ext)
+			ext = xa_load(&mdata->dc_accepted_exts, start);
+		if (ext) {
+			if (!first && ext->offer_seq < last_offer_seq) {
+				dev_err(dev, "Add-DC-Response out of order at extent %#llx\n",
+					start);
+				return -EINVAL;
+			}
+			last_offer_seq = ext->offer_seq;
+			first = false;
+		}
+
+		rc = dc_accept_extent(dev, start, length);
+		if (rc)
+			return rc;
+	}
+
+	dc_clear_sent(dev);
+	return 0;
+}
+
+static void dc_delete_extent(struct device *dev, unsigned long long start,
+			     unsigned long long length)
+{
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	unsigned long long end = start + length;
+	struct cxl_extent_data *ext;
+	unsigned long index;
+
+	dev_dbg(dev, "Deleting extent at %#llx len:%#llx\n", start, length);
+
+	guard(mutex)(&mdata->ext_lock);
+	xa_for_each(&mdata->dc_fm_extents, index, ext) {
+		u64 extent_end = ext->dpa_start + ext->length;
+
+		/*
+		 * Any extent which 'touches' the released delete range will be
+		 * removed.  Standard half-open overlap test, which also catches
+		 * an extent that fully contains the delete range.
+		 */
+		if (start < extent_end && ext->dpa_start < end)
+			xa_erase(&mdata->dc_fm_extents, ext->dpa_start);
+	}
+
+	/*
+	 * If the extent was accepted let it be for the host to drop
+	 * later.
+	 */
+}
+
+static int release_accepted_extent(struct device *dev, u64 start, u64 length)
+{
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	struct cxl_extent_data *ext;
+
+	guard(mutex)(&mdata->ext_lock);
+	ext = xa_load(&mdata->dc_accepted_exts, start);
+	if (!ext || ext->length != length) {
+		dev_err(dev, "Extent %#llx not in accepted state\n", start);
+		return -EINVAL;
+	}
+	xa_erase(&mdata->dc_accepted_exts, start);
+	mdata->dc_ext_generation++;
+
+	return 0;
+}
+
+static int mock_dc_release(struct device *dev,
+			   struct cxl_mbox_cmd *cmd)
+{
+	struct cxl_mbox_dc_response *req = cmd->payload_in;
+	u32 list_size = le32_to_cpu(req->extent_list_size);
+
+	for (int i = 0; i < list_size; i++) {
+		u64 start = le64_to_cpu(req->extent_list[i].dpa_start);
+		u64 length = le64_to_cpu(req->extent_list[i].length);
+
+		dev_dbg(dev, "Extent %#llx released by host\n", start);
+		release_accepted_extent(dev, start, length);
+	}
+
+	return 0;
+}
+
 static int cxl_mock_mbox_send(struct cxl_mailbox *cxl_mbox,
 			      struct cxl_mbox_cmd *cmd)
 {
@@ -1685,6 +2161,18 @@ static int cxl_mock_mbox_send(struct cxl_mailbox *cxl_mbox,
 	case CXL_MBOX_OP_GET_SUPPORTED_FEATURES:
 		rc = mock_get_supported_features(mdata, cmd);
 		break;
+	case CXL_MBOX_OP_GET_DC_CONFIG:
+		rc = mock_get_dc_config(dev, cmd);
+		break;
+	case CXL_MBOX_OP_GET_DC_EXTENT_LIST:
+		rc = mock_get_dc_extent_list(dev, cmd);
+		break;
+	case CXL_MBOX_OP_ADD_DC_RESPONSE:
+		rc = mock_add_dc_response(dev, cmd);
+		break;
+	case CXL_MBOX_OP_RELEASE_DC:
+		rc = mock_dc_release(dev, cmd);
+		break;
 	case CXL_MBOX_OP_GET_FEATURE:
 		rc = mock_get_feature(mdata, cmd);
 		break;
@@ -1761,6 +2249,7 @@ static int cxl_mock_mem_probe(struct platform_device *pdev)
 	struct cxl_mockmem_data *mdata;
 	struct cxl_mailbox *cxl_mbox;
 	struct cxl_dpa_info range_info = { 0 };
+	u64 serial;
 	int rc;
 
 	/* Increase async probe race window */
@@ -1771,6 +2260,10 @@ static int cxl_mock_mem_probe(struct platform_device *pdev)
 		return -ENOMEM;
 	dev_set_drvdata(dev, mdata);
 
+	rc = cxl_mock_dc_partition_setup(dev);
+	if (rc)
+		return rc;
+
 	mdata->lsa = vmalloc(LSA_SIZE);
 	if (!mdata->lsa)
 		return -ENOMEM;
@@ -1787,7 +2280,17 @@ static int cxl_mock_mem_probe(struct platform_device *pdev)
 	if (rc)
 		return rc;
 
-	mds = cxl_memdev_state_create(dev, pdev->id + 1, 0);
+	/*
+	 * Reserve the memdev at pdev->id == 0 as the sharable DC partition
+	 * test fixture.  This relies on tools/testing/cxl always allocating a
+	 * "cxl_mem" platform device with id 0 as the first memdev — currently
+	 * true in cxl.c, but if the topology ever renumbers, the sharable
+	 * serial will be stamped on the wrong device (or no device).  Matched
+	 * by the skip-pre-inject guard in cxl_mock_dc_partition_setup and by
+	 * mock_cxl_endpoint_parse_cdat in cxl_test.
+	 */
+	serial = pdev->id == 0 ? MOCK_DC_SHARABLE_SERIAL : pdev->id + 1;
+	mds = cxl_memdev_state_create(dev, serial, 0);
 	if (IS_ERR(mds))
 		return PTR_ERR(mds);
 
@@ -1827,6 +2330,9 @@ static int cxl_mock_mem_probe(struct platform_device *pdev)
 	if (rc)
 		return rc;
 
+	if (cxl_dcd_supported(mds))
+		cxl_configure_dcd(mds, &range_info);
+
 	rc = cxl_dpa_setup(cxlds, &range_info);
 	if (rc)
 		return rc;
@@ -1936,11 +2442,328 @@ static ssize_t sanitize_timeout_store(struct device *dev,
 
 static DEVICE_ATTR_RW(sanitize_timeout);
 
+/* Return if the proposed extent would break the test code */
+static bool new_extent_valid(struct device *dev, size_t new_start,
+			     size_t new_len)
+{
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	struct cxl_extent_data *extent;
+	size_t new_end, i;
+
+	if (!new_len)
+		return false;
+
+	new_end = new_start + new_len;
+
+	dev_dbg(dev, "New extent %zx-%zx\n", new_start, new_end);
+
+	guard(mutex)(&mdata->ext_lock);
+	dev_dbg(dev, "Checking extents starts...\n");
+	xa_for_each(&mdata->dc_fm_extents, i, extent) {
+		if (extent->dpa_start == new_start)
+			return false;
+	}
+
+	dev_dbg(dev, "Checking sent extents starts...\n");
+	xa_for_each(&mdata->dc_sent_extents, i, extent) {
+		if (extent->dpa_start == new_start)
+			return false;
+	}
+
+	dev_dbg(dev, "Checking accepted extents starts...\n");
+	xa_for_each(&mdata->dc_accepted_exts, i, extent) {
+		if (extent->dpa_start == new_start)
+			return false;
+	}
+
+	return true;
+}
+
+struct cxl_test_dcd {
+	uuid_t id;
+	struct cxl_event_dcd rec;
+} __packed;
+
+struct cxl_test_dcd dcd_event_rec_template = {
+	.id = CXL_EVENT_DC_EVENT_UUID,
+	.rec = {
+		.hdr = {
+			.length = sizeof(struct cxl_test_dcd),
+		},
+	},
+};
+
+static int log_dc_event(struct cxl_mockmem_data *mdata, enum dc_event type,
+			u64 start, u64 length, const char *tag_str,
+			u16 shared_extn_seq, bool more)
+{
+	struct device *dev = mdata->mds->cxlds.dev;
+	struct cxl_test_dcd *dcd_event;
+	uuid_t tag;
+	int rc;
+
+	dev_dbg(dev, "mock device log event %d\n", type);
+
+	dcd_event = devm_kmemdup(dev, &dcd_event_rec_template,
+				     sizeof(*dcd_event), GFP_KERNEL);
+	if (!dcd_event)
+		return -ENOMEM;
+
+	dcd_event->rec.flags = 0;
+	if (more)
+		dcd_event->rec.flags |= CXL_DCD_EVENT_MORE;
+	dcd_event->rec.event_type = type;
+	dcd_event->rec.extent.start_dpa = cpu_to_le64(start);
+	dcd_event->rec.extent.length = cpu_to_le64(length);
+	rc = parse_tag(tag_str, &tag);
+	if (rc) {
+		devm_kfree(dev, dcd_event);
+		return rc;
+	}
+	export_uuid(dcd_event->rec.extent.uuid, &tag);
+	dcd_event->rec.extent.shared_extn_seq = cpu_to_le16(shared_extn_seq);
+
+	mes_add_event(mdata, CXL_EVENT_TYPE_DCD,
+		      (struct cxl_event_record_raw *)dcd_event);
+
+	/* Fake the irq */
+	cxl_mem_get_event_records(mdata->mds, CXLDEV_EVENT_STATUS_DCD);
+
+	return 0;
+}
+
+static void mark_extent_sent(struct device *dev, unsigned long long start)
+{
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	struct cxl_extent_data *ext;
+
+	guard(mutex)(&mdata->ext_lock);
+	ext = xa_erase(&mdata->dc_fm_extents, start);
+	/* A concurrent delete may have removed it. */
+	if (!ext)
+		return;
+	/* Record the order extents are offered to the host (event order). */
+	ext->offer_seq = mdata->dc_offer_seq++;
+	if (xa_insert(&mdata->dc_sent_extents, ext->dpa_start, ext, GFP_KERNEL)) {
+		dev_err(dev, "Failed to mark extent %#llx sent\n", ext->dpa_start);
+		devm_kfree(dev, ext);
+	}
+}
+
+/*
+ * Format <start>:<length>:<tag>:<more_flag>
+ *
+ * start and length must be a multiple of the configured partition block size.
+ * Tag can be any string up to 16 bytes.
+ *
+ * Extents must be exclusive of other extents
+ *
+ * If the more flag is specified it is expected that an additional extent will
+ * be specified without the more flag to complete the test transaction with the
+ * host.
+ */
+static ssize_t __dc_inject_extent_store(struct device *dev,
+					struct device_attribute *attr,
+					const char *buf, size_t count,
+					bool shared)
+{
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	unsigned long long start, length, more;
+	char *len_str, *uuid_str, *more_str, *seq_str;
+	u16 shared_extn_seq = 0;
+	size_t buf_len = count;
+	int rc;
+
+	char *start_str __free(kfree) = kstrdup(buf, GFP_KERNEL);
+	if (!start_str)
+		return -ENOMEM;
+
+	len_str = strnchr(start_str, buf_len, ':');
+	if (!len_str) {
+		dev_err(dev, "Extent failed to find len_str: %s\n", start_str);
+		return -EINVAL;
+	}
+
+	*len_str = '\0';
+	len_str += 1;
+	buf_len -= strlen(start_str);
+
+	uuid_str = strnchr(len_str, buf_len, ':');
+	if (!uuid_str) {
+		dev_err(dev, "Extent failed to find uuid_str: %s\n", len_str);
+		return -EINVAL;
+	}
+	*uuid_str = '\0';
+	uuid_str += 1;
+
+	more_str = strnchr(uuid_str, buf_len, ':');
+	if (!more_str) {
+		dev_err(dev, "Extent failed to find more_str: %s\n", uuid_str);
+		return -EINVAL;
+	}
+	*more_str = '\0';
+	more_str += 1;
+
+	/* Optional 5th field: shared_extn_seq.  Absent -> 0. */
+	seq_str = strnchr(more_str, buf_len, ':');
+	if (seq_str) {
+		unsigned long long seq;
+
+		*seq_str = '\0';
+		seq_str += 1;
+		if (kstrtoull(seq_str, 0, &seq) || seq > U16_MAX) {
+			dev_err(dev, "Extent failed to parse seq: %s\n",
+				seq_str);
+			return -EINVAL;
+		}
+		shared_extn_seq = seq;
+	}
+
+	if (kstrtoull(start_str, 0, &start)) {
+		dev_err(dev, "Extent failed to parse start: %s\n", start_str);
+		return -EINVAL;
+	}
+
+	if (kstrtoull(len_str, 0, &length)) {
+		dev_err(dev, "Extent failed to parse length: %s\n", len_str);
+		return -EINVAL;
+	}
+
+	if (kstrtoull(more_str, 0, &more)) {
+		dev_err(dev, "Extent failed to parse more: %s\n", more_str);
+		return -EINVAL;
+	}
+
+	if (!new_extent_valid(dev, start, length))
+		return -EINVAL;
+
+	rc = devm_add_fm_extent(dev, start, length, uuid_str, shared_extn_seq,
+				shared);
+	if (rc) {
+		dev_err(dev, "Failed to add extent DPA:%#llx LEN:%#llx; %d\n",
+			start, length, rc);
+		return rc;
+	}
+
+	mark_extent_sent(dev, start);
+	rc = log_dc_event(mdata, DCD_ADD_CAPACITY, start, length, uuid_str,
+			  shared_extn_seq, more);
+	if (rc) {
+		dev_err(dev, "Failed to add event %d\n", rc);
+		return rc;
+	}
+
+	return count;
+}
+
+static ssize_t dc_inject_extent_store(struct device *dev,
+				      struct device_attribute *attr,
+				      const char *buf, size_t count)
+{
+	return __dc_inject_extent_store(dev, attr, buf, count, false);
+}
+static DEVICE_ATTR_WO(dc_inject_extent);
+
+static ssize_t dc_inject_shared_extent_store(struct device *dev,
+					     struct device_attribute *attr,
+					     const char *buf, size_t count)
+{
+	return __dc_inject_extent_store(dev, attr, buf, count, true);
+}
+static DEVICE_ATTR_WO(dc_inject_shared_extent);
+
+static ssize_t __dc_del_extent_store(struct device *dev,
+				     struct device_attribute *attr,
+				     const char *buf, size_t count,
+				     enum dc_event type)
+{
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	unsigned long long start, length;
+	char *len_str, *uuid_str;
+	size_t buf_len = count;
+	int rc;
+
+	char *start_str __free(kfree) = kstrdup(buf, GFP_KERNEL);
+	if (!start_str)
+		return -ENOMEM;
+
+	len_str = strnchr(start_str, buf_len, ':');
+	if (!len_str) {
+		dev_err(dev, "Failed to find len_str: %s\n", start_str);
+		return -EINVAL;
+	}
+	*len_str = '\0';
+	len_str += 1;
+	buf_len -= strlen(start_str);
+
+	uuid_str = strnchr(len_str, buf_len, ':');
+	if (!uuid_str) {
+		dev_err(dev, "Failed to find uuid_str: %s\n", len_str);
+		return -EINVAL;
+	}
+	*uuid_str = '\0';
+	uuid_str += 1;
+	/*
+	 * uuid_str is the trailing field; trim shell-added '\n' so
+	 * parse_tag()/uuid_parse() see a clean string.
+	 */
+	uuid_str = strim(uuid_str);
+
+	if (kstrtoull(start_str, 0, &start)) {
+		dev_err(dev, "Failed to parse start: %s\n", start_str);
+		return -EINVAL;
+	}
+
+	if (kstrtoull(len_str, 0, &length)) {
+		dev_err(dev, "Failed to parse length: %s\n", len_str);
+		return -EINVAL;
+	}
+
+	dc_delete_extent(dev, start, length);
+
+	if (type == DCD_FORCED_CAPACITY_RELEASE)
+		dev_dbg(dev, "Forcing delete of extent %#llx len:%#llx\n",
+			start, length);
+
+	rc = log_dc_event(mdata, type, start, length, uuid_str, 0, false);
+	if (rc) {
+		dev_err(dev, "Failed to add event %d\n", rc);
+		return rc;
+	}
+
+	return count;
+}
+
+/*
+ * Format <start>:<length>:<uuid>
+ */
+static ssize_t dc_del_extent_store(struct device *dev,
+				   struct device_attribute *attr,
+				   const char *buf, size_t count)
+{
+	return __dc_del_extent_store(dev, attr, buf, count,
+				     DCD_RELEASE_CAPACITY);
+}
+static DEVICE_ATTR_WO(dc_del_extent);
+
+static ssize_t dc_force_del_extent_store(struct device *dev,
+					 struct device_attribute *attr,
+					 const char *buf, size_t count)
+{
+	return __dc_del_extent_store(dev, attr, buf, count,
+				     DCD_FORCED_CAPACITY_RELEASE);
+}
+static DEVICE_ATTR_WO(dc_force_del_extent);
+
 static struct attribute *cxl_mock_mem_attrs[] = {
 	&dev_attr_security_lock.attr,
 	&dev_attr_event_trigger.attr,
 	&dev_attr_fw_buf_checksum.attr,
 	&dev_attr_sanitize_timeout.attr,
+	&dev_attr_dc_inject_extent.attr,
+	&dev_attr_dc_inject_shared_extent.attr,
+	&dev_attr_dc_del_extent.attr,
+	&dev_attr_dc_force_del_extent.attr,
 	NULL
 };
 ATTRIBUTE_GROUPS(cxl_mock_mem);
diff --git a/tools/testing/cxl/test/mock.h b/tools/testing/cxl/test/mock.h
index 4f57dc80ae7d..58ebe1c81fd2 100644
--- a/tools/testing/cxl/test/mock.h
+++ b/tools/testing/cxl/test/mock.h
@@ -5,6 +5,15 @@
 #include <linux/dax.h>
 #include <cxl.h>
 
+/*
+ * Mock serial sentinel.  The cxl_mock_mem probe stamps this serial on
+ * exactly one platform device (cxl_mem with id 0); that single memdev's
+ * DC partition is marked sharable in mock_cxl_endpoint_parse_cdat() so
+ * the suite can exercise sharable-extent code paths without losing the
+ * non-sharable coverage on the other mock memdevs.
+ */
+#define MOCK_DC_SHARABLE_SERIAL 0xDCDCULL
+
 struct cxl_mock_ops {
 	struct list_head list;
 	bool (*is_mock_adev)(struct acpi_device *dev);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v11 31/31] Documentation/cxl: Document DCD extent handling and DC-backed DAX regions
  2026-06-25 11:04 DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
                   ` (28 preceding siblings ...)
  2026-06-25 11:05 ` [PATCH v11 30/31] tools/testing/cxl: Add DC Regions to mock mem data Anisa Su
@ 2026-06-25 11:05 ` Anisa Su
  2026-06-25 18:24   ` sashiko-bot
  2026-06-25 18:00 ` [PATCH v11 03/31] cxl/cdat: Gather DSMAS data for DCD partitions Anisa Su
  30 siblings, 1 reply; 61+ messages in thread
From: Anisa Su @ 2026-06-25 11:05 UTC (permalink / raw)
  To: linux-cxl, linux-kernel
  Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
	John Groves, Gregory Price, Anisa Su

Extend the CXL and DAX driver-api documentation to cover Dynamic
Capacity Devices.

cxl-driver.rst gains a "Dynamic Capacity Extents" section describing
the conditions under which the CXL core accepts an offered extent
(per-extent: region resolution, full ED-range containment,
no-overlap, duplicate tolerance; per-tag-group: host-wide tag-uuid
uniqueness, sequence-number integrity, partition equality,
alignment) and the conditions under which a release request is
honoured (DPA-range containment in some member, tag match,
DAX-layer EBUSY deferral, whole-tag-group release).  The host-wide
uniqueness gate is enforced by the cxl_tag_register registry in
drivers/cxl/core/extent.c.  For sequence numbers the doc spells out
both regimes — device-stamped 0..n-1 on sharable allocations and
host-assigned arrival-order 0..n-1 (assigned by cxl_realize_group())
on non-sharable allocations — and notes that the DAX layer sees one
unified 0..n-1 dense invariant.

dax-driver.rst gains a "Dynamic Capacity (DC) Regions" section
that lays out the four-object layering device extent → dc_extent →
dax_resource → DAX device, with cardinalities: one tagged
allocation maps to one cxl_dc_tag_group containing N dc_extents and
N dax_resources, claimed into one DAX device with N range entries
in seq_num order; an untagged Add delivery becomes its own
single-member group.  Each dc_extent carries its own hpa_range —
there is no aggregated bounding-box range across siblings.
Tag-based DAX device creation, DC-only sizing rules (no non-zero
resize, size=0 to destroy), and the uuid attribute semantics are
documented alongside.

Signed-off-by: Anisa Su <anisa.su@samsung.com>
---
 .../driver-api/cxl/linux/cxl-driver.rst       | 149 +++++++++++++++
 .../driver-api/cxl/linux/dax-driver.rst       | 169 ++++++++++++++++++
 2 files changed, 318 insertions(+)

diff --git a/Documentation/driver-api/cxl/linux/cxl-driver.rst b/Documentation/driver-api/cxl/linux/cxl-driver.rst
index dd6dd17dc536..581dca271749 100644
--- a/Documentation/driver-api/cxl/linux/cxl-driver.rst
+++ b/Documentation/driver-api/cxl/linux/cxl-driver.rst
@@ -619,6 +619,155 @@ from HPA to DPA.  This is why they must be aware of the entire interleave set.
 Linux does not support unbalanced interleave configurations.  As a result, all
 endpoints in an interleave set must have the same ways and granularity.
 
+Dynamic Capacity Extents
+========================
+
+A `Dynamic Capacity Device (DCD)` advertises capacity in `DC partitions`
+and surfaces individual chunks of that capacity to the host as `extents`.
+The device may add an extent at any time (a `pending add`) and may
+request that a previously accepted extent be released (a `pending
+release`).  Each transition is mediated by a mailbox handshake whose
+state machine the CXL driver enforces in
+:code:`drivers/cxl/core/{mbox.c,extent.c}`.
+
+Extents that share a non-null tag form one logical allocation.  Each
+surviving member becomes its own :code:`struct dc_extent` (per-extent
+sysfs device, per-extent HPA range); their containing tag group is an
+internal-only :code:`struct cxl_dc_tag_group` keyed by UUID with no
+sysfs identity.  Each :code:`dc_extent` becomes one
+:code:`dax_resource` on the DAX side, and a tagged DAX device is built
+by claiming every :code:`dax_resource` that carries the tag.
+
+For DAX-side semantics — how accepted extents materialize into
+:code:`dax_resource` objects and DAX devices — see
+:doc:`dax-driver`.
+
+Accepting Extents
+-----------------
+Extents are made available to the host from the device through DC ADD events.
+Event records contain extents, which may be tagged or untagged, shared or
+not shared. Multiple event records can by chained together by the `More` flag.
+
+The unit of allocation is a `tag`.  All extents
+sharing a tag form one allocation; the More flag is a delivery boundary
+only, meaning when the More chain ends, the host can assume that all extents
+have been collected for each tag.
+A tag may be the null UUID (an `untagged` allocation, valid in
+non-sharable regions) or a non-null UUID identifying a sharable or
+non-sharable allocation.
+
+When a `More`-terminated chain of pending adds closes, the driver
+processes the pending list one tag group at a time.  A group is
+committed only if it passes every gate below; failing any gate drops
+the entire group with a firmware-bug warning, and the dropped extents
+do not appear in the :code:`ADD_DC_RESPONSE`.  There is no
+partial-extent acceptance — either an offered extent is accepted whole
+or it is dropped whole.
+
+Per-extent gates (applied in :code:`cxl_add_extent`,
+:code:`drivers/cxl/core/extent.c`):
+
+* The extent's DPA range must resolve to a CXL region via
+  :code:`cxl_dpa_to_region()`.  An extent with no owning region is
+  dropped; the device sees the omission from :code:`ADD_DC_RESPONSE`.
+* The extent's DPA range must be `fully contained` in the endpoint
+  decoder's DPA range.  An extent that straddles the decoder boundary
+  is rejected with :code:`-ENXIO`; the driver never clips an extent to
+  fit.
+* The extent must not overlap an extent already present in the same
+  region.  Overlap classification is done in
+  :code:`cxlr_dax_classify_extent()` using :code:`range_overlaps()`.
+  Exact duplicates of a previously-accepted range are tolerated —
+  accepting the same range twice is a no-op, which simplifies
+  probe-time scans of the device's existing accepted list.
+
+Per-group gates (applied in :code:`cxl_add_pending`,
+:code:`drivers/cxl/core/mbox.c`):
+
+* `Host-wide tag uniqueness`: a non-null tag must not already
+  correspond to a live :code:`cxl_dc_tag_group` anywhere on this host.
+  The orchestrator (FM) owns tag-UUID allocation per spec; the
+  registry in :code:`drivers/cxl/core/extent.c`
+  (:code:`cxl_tag_register` / :code:`cxl_tag_already_committed`)
+  catches firmware bugs and orchestrator misbehavior across every
+  region and memdev.  Skipped for the null UUID, which has no
+  cross-chain identity.
+* `Sequence-number integrity`: every member must carry the wire
+  field :code:`shared_extn_seq == 0` (non-sharable allocation), or
+  the group's sorted sequence numbers must be exactly
+  :code:`0, 1, …, n-1` (sharable allocation).  Mixed, gapped,
+  duplicate, or sets that do not start at 0 are rejected.
+* `Partition equality`: every tagged extent in the group must
+  resolve to the same DC partition.  A single allocation cannot span
+  partitions because CDAT describes sharable / writable / coherency
+  attributes per-partition.  Skipped for the null UUID.
+* `Alignment`: every extent's :code:`start_dpa` and :code:`length`
+  must be :code:`PMD_SIZE`-aligned.  Partial acceptance
+  of an aligned subset would leave an unusable DAX device, so the
+  group is dropped instead.
+
+Surviving extents are sorted by the wire field
+:code:`shared_extn_seq` — stable, so arrival order is preserved for
+the all-zero non-sharable case — and each becomes a
+:code:`dc_extent` inserted into a fresh :code:`cxl_dc_tag_group`
+keyed by the group's UUID.  Each :code:`dc_extent` carries its own
+:code:`hpa_range`; the tag group itself has no aggregate range.
+
+As each surviving extent is attached the host assigns it a 0..n-1
+:code:`seq_num`: for sharable allocations this equals the
+device-stamped :code:`shared_extn_seq` directly; for non-sharable
+allocations the device sends :code:`shared_extn_seq == 0` and the
+host fills in the arrival-order position (assigned in
+:code:`cxl_realize_group`).  The DAX layer enforces the same
+:code:`0..n-1` dense invariant in both cases.
+
+The tag group is brought online via :code:`online_tag_group()`,
+which registers every member :code:`dc_extent` as an
+:code:`extentX.Y` child of :code:`cxlr_dax->dev`, the DAX layer is
+notified with :code:`DCD_ADD_CAPACITY`, and the accepted extents are
+spliced into the response list for a single :code:`ADD_DC_RESPONSE`
+mailbox per More-chain.
+
+Releasing Extents
+-----------------
+
+A release may be initiated by the device (a pending release
+notification) or by the host (when destroying a DAX device or tearing
+down a region).  Both paths converge on :code:`cxl_rm_extent`
+(:code:`drivers/cxl/core/extent.c`).
+
+Per-extent gates:
+
+* The DPA range must resolve to a CXL region.  If it does not — for
+  example, an extent left over from a host crash that has not yet
+  been re-claimed, or a duplicate release racing region teardown —
+  the release is acknowledged via :code:`memdev_release_extent()` so
+  the device knows the host is not using the capacity, and the
+  operation returns :code:`-ENXIO`.
+* The DPA range must be `fully contained` in some member
+  :code:`dc_extent`'s :code:`dpa_range` on the region's
+  :code:`cxlr_dax`, and the tag (UUID) on that member's
+  :code:`cxl_dc_tag_group` must match the release request.  Releases
+  are keyed by :code:`(DPA range, tag)` rather than by pointer
+  because the device, not the host, supplies the identity.  A
+  request that matches no :code:`dc_extent` is rejected with
+  :code:`-EINVAL`.
+
+If those gates pass, the DAX layer is notified with
+:code:`DCD_RELEASE_CAPACITY` and consulted for permission to proceed.
+If the DAX layer returns :code:`-EBUSY` — the capacity is still mapped
+or otherwise in use — the release is deferred and
+:code:`cxl_rm_extent` returns success without unregistering anything.
+When the DAX layer ultimately grants release,
+:code:`rm_tag_group()` invalidates the backing memregion once for the
+whole group, then unregisters every member :code:`dc_extent` device,
+which cascades through the DAX layer to drop the corresponding
+:code:`dax_resource`\ s.
+
+The release path is always whole-tag-group: tagged allocations
+release atomically, and the kernel does not split a group in response
+to a sub-range release request.
+
 Example Configurations
 ======================
 .. toctree::
diff --git a/Documentation/driver-api/cxl/linux/dax-driver.rst b/Documentation/driver-api/cxl/linux/dax-driver.rst
index 10d953a2167b..d869bcad41f4 100644
--- a/Documentation/driver-api/cxl/linux/dax-driver.rst
+++ b/Documentation/driver-api/cxl/linux/dax-driver.rst
@@ -27,6 +27,175 @@ CXL capacity in the task's page tables.
 Users wishing to manually handle allocation of CXL memory should use this
 interface.
 
+Dynamic Capacity (DC) Regions
+=============================
+A region backed by a CXL `Dynamic Capacity Device (DCD)` is a `DC region`:
+its HPA window is fixed at probe time, but the DPA capacity that fills the
+window arrives and departs at runtime as the device offers and reclaims
+`extents`.  DC regions are distinguished from static regions by the
+:code:`IORESOURCE_DAX_DCD` flag on the :code:`dax_region`.
+
+For the CXL-side rules governing when an offered extent is accepted or a
+release request is honoured, see :doc:`cxl-driver`.  This section covers
+the DAX-side mapping between accepted extents and DAX devices.
+
+The Extent Layering Model
+-------------------------
+Four objects sit between the wire-level CXL extent and the
+user-visible DAX device.  Understanding the cardinality between them
+is the key to the DC-region model.
+
+::
+
+    device extents     dc_extent           dax_resource         DAX device
+    (CXL device)       (CXL core)          (DAX bus)            (/dev/daxN.Y)
+    -------------      -------------       -------------        ------------
+    e1 ─┐                ┌─► dc_e1 ──►     res_1 (seq=0) ──┐
+    e2 ─┼─── tag A ──►   ┼─► dc_e2 ──►     res_2 (seq=1) ──┼──►  daxN.0
+    e3 ─┘                └─► dc_e3 ──►     res_3 (seq=2) ──┘     (claimed by tag A,
+                                                                   size = Σ |e_i|)
+
+    e4 ─── tag B ────►     dc_e4 ──►       res_4 (seq=0) ────►   daxN.1
+
+    e5 ─── null tag ─►     dc_e5 ──►       res_5 (seq=0) ────►   daxN.2
+    e6 ─── null tag ─►     dc_e6 ──►       res_6 (seq=0) ────►   daxN.3
+
+The CXL core groups extents sharing a non-null tag into a single
+:code:`cxl_dc_tag_group` (internal-only, no sysfs identity), but each
+member extent stays a distinct :code:`dc_extent` with its own HPA
+range.  The DAX bridge creates one :code:`dax_resource` per
+:code:`dc_extent`, and userspace claims a DAX device by writing the
+tag's UUID to the seed device's :code:`uuid` attribute, which carves
+every matching :code:`dax_resource` (in :code:`seq_num` order) into
+the device's :code:`ranges[]` array.
+
+`Device extent`
+  The unit the CXL device delivers over the mailbox: a
+  :code:`(DPA, length, tag, shared_extn_seq)` tuple inside an
+  Add-Capacity event.  The tag is either a non-null UUID (a
+  `tagged allocation`) or the null UUID (`untagged`).
+
+:code:`dc_extent`
+  The CXL core's per-extent object, one per surviving device extent.
+  Each :code:`dc_extent` is registered as its own :code:`extentX.Y`
+  sysfs device under :code:`cxlr_dax->dev` and carries its own
+  :code:`hpa_range` — there is no aggregated / bounding-box HPA
+  range across siblings.  Members of one tag group point at a
+  shared :code:`cxl_dc_tag_group` (which holds the UUID and a
+  manual refcount on the surviving siblings) but otherwise exist as
+  independent kernel objects.
+
+  For a `non-null tag`, the host-wide tag-uniqueness gate
+  (:doc:`cxl-driver`) guarantees there is at most one
+  :code:`cxl_dc_tag_group` per UUID on the host, so the set of
+  :code:`dc_extent`\ s sharing that UUID is a single allocation.
+
+  For the `null tag` there is no cross-event identity — the spec is
+  silent on aggregating untagged extents across Add-Capacity events.
+  Each untagged device extent becomes its own :code:`dc_extent` in
+  its own single-member tag group; two untagged extents delivered
+  separately are two distinct allocations.
+
+:code:`dax_resource`
+  The DAX bus's per-extent view, one-to-one with :code:`dc_extent`.
+  When the CXL DAX driver receives a :code:`DCD_ADD_CAPACITY`
+  notification it iterates the tag group and calls
+  :code:`dax_region_add_resource()` once per member, creating one
+  :code:`dax_resource` per :code:`dc_extent`.  Each
+  :code:`dax_resource` carries that member's HPA range, the tag
+  UUID (copied from :code:`dc_extent->group->uuid`), and a 0..n-1
+  :code:`seq_num` so :code:`uuid_claim_tagged` can carve the matched
+  set into the device's :code:`ranges[]` array in the right order
+  (see :code:`drivers/dax/bus.c`).
+
+`DAX device` (:code:`/dev/daxN.Y`)
+  Created by userspace claiming a set of :code:`dax_resource`\ s via
+  the :code:`uuid` sysfs attribute.  Each DAX device corresponds to
+  exactly one allocation:
+
+  * A `tagged` DAX device is built from every :code:`dax_resource`
+    carrying the tag — one per :code:`dc_extent` in the allocation
+    — carved into the device's :code:`ranges[]` in :code:`seq_num`
+    order.  Its size equals the sum of every member's size.
+  * An `untagged` DAX device is built from one untagged
+    :code:`dax_resource` and its size equals that one extent.
+
+So the end-to-end rule is: **one tagged allocation = one
+cxl_dc_tag_group = N dc_extents = N dax_resources = one DAX device
+with N range entries**.  An untagged device extent becomes its own
+:code:`dc_extent` / :code:`dax_resource` / single-range DAX device,
+claimed one at a time.
+
+Release follows the same layering in reverse.  When the CXL core
+calls :code:`rm_tag_group()` (after the device asks for release and
+the DAX layer consents), the DAX bridge collects every matching
+:code:`dax_resource` and removes them as a set via
+:code:`dax_region_rm_resources()`.  The removal is refuse-all-or-none
+under :code:`dax_region_rwsem`: if any member is in use, the whole
+group stays.  When removal commits, the HPA capacity returns to the
+region's free pool and any DAX device that had claimed it is left
+with no backing capacity.  Userspace tears the DAX device down via
+:code:`daxctl destroy-device` (size=0, then write the device name to
+the region's :code:`delete` attribute).
+
+UUID-Based DAX Device Creation
+------------------------------
+A DAX device on a DC region is created by writing a UUID to the
+seed device's :code:`uuid` attribute
+(:code:`/sys/bus/dax/devices/daxN.Y/uuid`).  The seed starts at
+size 0; writing :code:`uuid` is a `claim` operation that resolves
+the layering above and populates the device:
+
+* A `non-null UUID` claims `every` :code:`dax_resource` whose tag
+  matches.  :code:`uuid_claim_tagged` (in
+  :code:`drivers/dax/bus.c`) collects them, sorts by
+  :code:`seq_num`, enforces the dense :code:`0..n-1` invariant, and
+  carves each via :code:`__dev_dax_resize` in :code:`seq_num` order
+  so the device's :code:`ranges[]` array is dense and ordered.
+  The resulting DAX device represents exactly the tagged
+  allocation: its size equals the sum of every member extent's
+  size.
+
+  The dense :code:`0..n-1` invariant is the unified rule the CXL
+  side maintains for both sharable and non-sharable allocations
+  (see :doc:`cxl-driver`); the match set has exactly one entry per
+  :code:`dc_extent` in the tag group.
+
+* The value :code:`"0"` is shorthand for the null UUID and claims
+  exactly `one` untagged :code:`dax_resource`.  Untagged
+  :code:`dax_resource`\ s correspond to independent untagged
+  allocations; collapsing several into one device would aggregate
+  unrelated capacity, so each :code:`uuid` write consumes a single
+  untagged resource.
+
+* A write that matches no :code:`dax_resource` returns
+  :code:`-ENOENT` and the device remains at size 0.
+
+* Writes to the :code:`uuid` attribute on non-DC regions return
+  :code:`-EOPNOTSUPP`; the attribute itself is read-only (0444) on
+  non-DC devices.
+
+The device's size is determined entirely by the backing allocation:
+users do not choose a size on DC regions.  Accordingly, the
+:code:`size` attribute on a DC DAX device rejects any non-zero resize
+(grow or partial shrink) with :code:`-EOPNOTSUPP`.  Writing :code:`0`
+is still permitted and is
+how :code:`daxctl destroy-device` returns each claimed extent to the
+region's available pool before the device's name is written to the
+region's :code:`delete` attribute.
+
+Reads of :code:`uuid` report the tag identifying the capacity
+backing the device:
+
+* For a non-null-UUID-claimed DC DAX device, :code:`uuid` reads
+  back the claimed UUID.
+* For a DC DAX device claimed via :code:`"0"`, or for any
+  non-DCD DAX device, :code:`uuid` reads back the null UUID
+  (:code:`00000000-0000-0000-0000-000000000000`).
+
+See :code:`Documentation/ABI/testing/sysfs-bus-dax` for the
+authoritative attribute contracts.
+
 kmem conversion
 ===============
 The :code:`dax_kmem` driver converts a `DAX Device` into a series of `hotplug
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v11 03/31] cxl/cdat: Gather DSMAS data for DCD partitions
  2026-06-25 11:04 DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
                   ` (29 preceding siblings ...)
  2026-06-25 11:05 ` [PATCH v11 31/31] Documentation/cxl: Document DCD extent handling and DC-backed DAX regions Anisa Su
@ 2026-06-25 18:00 ` Anisa Su
  2026-06-26 22:30   ` Dave Jiang
  30 siblings, 1 reply; 61+ messages in thread
From: Anisa Su @ 2026-06-25 18:00 UTC (permalink / raw)
  To: linux-cxl, linux-kernel
  Cc: nvdimm, djbw, jic23, dave, dave.jiang, vishal.l.verma, iweiny,
	alison.schofield, gourry, anisa.su

From: Ira Weiny <iweiny@kernel.org>

Additional DCD partition (AKA region) information is contained in the
DSMAS CDAT tables, including performance, read only, and shareable
attributes.

Match DCD partitions with DSMAS tables and store the meta data.

Signed-off-by: Ira Weiny <iweiny@kernel.org>
Signed-off-by: Anisa Su <anisa.su@samsung.com>
---
 drivers/cxl/core/cdat.c | 12 ++++++++++++
 drivers/cxl/core/hdm.c  |  1 +
 drivers/cxl/core/mbox.c | 22 ++++++++++++++++------
 drivers/cxl/cxlmem.h    |  2 ++
 include/cxl/cxl.h       |  4 ++++
 5 files changed, 35 insertions(+), 6 deletions(-)

diff --git a/drivers/cxl/core/cdat.c b/drivers/cxl/core/cdat.c
index 5c9f07262513..a280039e4cd1 100644
--- a/drivers/cxl/core/cdat.c
+++ b/drivers/cxl/core/cdat.c
@@ -17,6 +17,7 @@ struct dsmas_entry {
 	struct access_coordinate cdat_coord[ACCESS_COORDINATE_MAX];
 	int entries;
 	int qos_class;
+	bool shareable;
 };
 
 static u32 cdat_normalize(u16 entry, u64 base, u8 type)
@@ -74,6 +75,7 @@ static int cdat_dsmas_handler(union acpi_subtable_headers *header, void *arg,
 		return -ENOMEM;
 
 	dent->handle = dsmas->dsmad_handle;
+	dent->shareable = dsmas->flags & ACPI_CDAT_DSMAS_SHAREABLE;
 	dent->dpa_range.start = le64_to_cpu((__force __le64)dsmas->dpa_base_address);
 	dent->dpa_range.end = le64_to_cpu((__force __le64)dsmas->dpa_base_address) +
 			      le64_to_cpu((__force __le64)dsmas->dpa_length) - 1;
@@ -266,15 +268,25 @@ static void cxl_memdev_set_qos_class(struct cxl_dev_state *cxlds,
 		bool found = false;
 
 		for (int i = 0; i < cxlds->nr_partitions; i++) {
+			enum cxl_partition_mode mode = cxlds->part[i].mode;
 			struct resource *res = &cxlds->part[i].res;
+			u8 handle = cxlds->part[i].handle;
 			struct range range = {
 				.start = res->start,
 				.end = res->end,
 			};
 
 			if (range_contains(&range, &dent->dpa_range)) {
+				if (mode == CXL_PARTMODE_DYNAMIC_RAM_1 &&
+				    dent->handle != handle) {
+					dev_warn(dev,
+						"Dynamic RAM perf mismatch; %pra (%u) vs %pra (%u)\n",
+						&range, handle, &dent->dpa_range, dent->handle);
+					continue;
+				}
 				update_perf_entry(dev, dent,
 						  &cxlds->part[i].perf);
+				cxlds->part[i].shareable = dent->shareable;
 				found = true;
 				break;
 			}
diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
index 0ef076c08ed2..7f63b86887f4 100644
--- a/drivers/cxl/core/hdm.c
+++ b/drivers/cxl/core/hdm.c
@@ -477,6 +477,7 @@ int cxl_dpa_setup(struct cxl_dev_state *cxlds, const struct cxl_dpa_info *info)
 
 		cxlds->part[i].perf.qos_class = CXL_QOS_CLASS_INVALID;
 		cxlds->part[i].mode = part->mode;
+		cxlds->part[i].handle = part->handle;
 
 		/* Require ordered + contiguous partitions */
 		if (i) {
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 2932bbd67e55..bdb908c6e7f3 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -1352,10 +1352,16 @@ static int cxl_dc_check(struct device *dev, struct cxl_dc_partition_info *part_a
 {
 	u64 blk_size = le64_to_cpu(dev_part->block_size);
 	u64 len = le64_to_cpu(dev_part->length);
+	u32 handle = le32_to_cpu(dev_part->dsmad_handle);
 
 	part_array[index].start = le64_to_cpu(dev_part->base);
 	part_array[index].size = le64_to_cpu(dev_part->decode_length);
 	part_array[index].size *= CXL_CAPACITY_MULTIPLIER;
+	if (handle & ~0xFF) {
+		dev_warn(dev, "DSMAD handle 0x%x has non-zero reserved bits\n", handle);
+		return -EINVAL;
+	}
+	part_array[index].handle = handle;
 
 	/* Check partitions are in increasing DPA order */
 	if (index > 0) {
@@ -1522,6 +1528,7 @@ int cxl_dev_dc_identify(struct cxl_mailbox *mbox,
 	/* Return 1st partition */
 	dc_info->start = partitions[0].start;
 	dc_info->size = partitions[0].size;
+	dc_info->handle = partitions[0].handle;
 	dev_dbg(dev, "Returning partition 0 %llu size %llu\n",
 		dc_info->start, dc_info->size);
 
@@ -1529,7 +1536,8 @@ int cxl_dev_dc_identify(struct cxl_mailbox *mbox,
 }
 EXPORT_SYMBOL_NS_GPL(cxl_dev_dc_identify, "CXL");
 
-static void add_part(struct cxl_dpa_info *info, u64 start, u64 size, enum cxl_partition_mode mode)
+static void add_part(struct cxl_dpa_info *info, u64 start, u64 size,
+		     enum cxl_partition_mode mode, u8 handle)
 {
 	int i = info->nr_partitions;
 
@@ -1541,6 +1549,7 @@ static void add_part(struct cxl_dpa_info *info, u64 start, u64 size, enum cxl_pa
 		.end = start + size - 1,
 	};
 	info->part[i].mode = mode;
+	info->part[i].handle = handle;
 	info->nr_partitions++;
 }
 
@@ -1558,9 +1567,9 @@ int cxl_mem_dpa_fetch(struct cxl_memdev_state *mds, struct cxl_dpa_info *info)
 	info->size = mds->total_bytes;
 
 	if (mds->partition_align_bytes == 0) {
-		add_part(info, 0, mds->volatile_only_bytes, CXL_PARTMODE_RAM);
+		add_part(info, 0, mds->volatile_only_bytes, CXL_PARTMODE_RAM, 0);
 		add_part(info, mds->volatile_only_bytes,
-			 mds->persistent_only_bytes, CXL_PARTMODE_PMEM);
+			 mds->persistent_only_bytes, CXL_PARTMODE_PMEM, 0);
 		return 0;
 	}
 
@@ -1570,9 +1579,9 @@ int cxl_mem_dpa_fetch(struct cxl_memdev_state *mds, struct cxl_dpa_info *info)
 		return rc;
 	}
 
-	add_part(info, 0, mds->active_volatile_bytes, CXL_PARTMODE_RAM);
+	add_part(info, 0, mds->active_volatile_bytes, CXL_PARTMODE_RAM, 0);
 	add_part(info, mds->active_volatile_bytes, mds->active_persistent_bytes,
-		 CXL_PARTMODE_PMEM);
+		 CXL_PARTMODE_PMEM, 0);
 
 	return 0;
 }
@@ -1624,7 +1633,8 @@ void cxl_configure_dcd(struct cxl_memdev_state *mds, struct cxl_dpa_info *info)
 	info->size += dc_info.size;
 	dev_dbg(dev, "Adding dynamic ram partition 1; %llu size %llu\n",
 		dc_info.start, dc_info.size);
-	add_part(info, dc_info.start, dc_info.size, CXL_PARTMODE_DYNAMIC_RAM_1);
+	add_part(info, dc_info.start, dc_info.size, CXL_PARTMODE_DYNAMIC_RAM_1,
+		 dc_info.handle);
 }
 EXPORT_SYMBOL_NS_GPL(cxl_configure_dcd, "CXL");
 
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 6b548a1ec1e9..b29fb16725b4 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -118,6 +118,7 @@ struct cxl_dpa_info {
 	struct cxl_dpa_part_info {
 		struct range range;
 		enum cxl_partition_mode mode;
+		u8 handle;
 	} part[CXL_NR_PARTITIONS_MAX];
 	int nr_partitions;
 };
@@ -823,6 +824,7 @@ int cxl_dev_state_identify(struct cxl_memdev_state *mds);
 struct cxl_dc_partition_info {
 	u64 start;
 	u64 size;
+	u8 handle;
 };
 
 int cxl_dev_dc_identify(struct cxl_mailbox *mbox,
diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
index e8a0899960d4..502d8333318b 100644
--- a/include/cxl/cxl.h
+++ b/include/cxl/cxl.h
@@ -141,11 +141,15 @@ enum cxl_partition_mode {
  * @res: shortcut to the partition in the DPA resource tree (cxlds->dpa_res)
  * @perf: performance attributes of the partition from CDAT
  * @mode: operation mode for the DPA capacity, e.g. ram, pmem, dynamic...
+ * @handle: DSMAS handle intended to represent this partition
+ * @shareable: Is the partition sharable (from its CDAT DSMAS entry)
  */
 struct cxl_dpa_partition {
 	struct resource res;
 	struct cxl_dpa_perf perf;
 	enum cxl_partition_mode mode;
+	u8 handle;
+	bool shareable;
 };
 
 #define CXL_NR_PARTITIONS_MAX 3
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH v11 12/31] cxl/mem: Set up framework for handling DC Events
  2026-06-25 11:04 ` [PATCH v11 12/31] cxl/mem: Set up framework for handling DC Events Anisa Su
@ 2026-06-25 18:12   ` sashiko-bot
  2026-06-26 21:54   ` Dave Jiang
  1 sibling, 0 replies; 61+ messages in thread
From: sashiko-bot @ 2026-06-25 18:12 UTC (permalink / raw)
  To: Anisa Su; +Cc: nvdimm, linux-cxl

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [High] Missing DCD response on memory allocation failure leaves the device state machine permanently blocked while unconditionally clearing the event log.
- [Medium] Stack-allocation optimization in `memdev_release_extent()` is defeated by an internal dynamic allocation in `cxl_send_dc_response()`.
- [Medium] Invalid `__counted_by` attribute on a little-endian struct member breaks bounds checking on big-endian architectures.
--

commit f5333550094fcaacb4e98dbf0c270b32ec17b506
Author: Ira Weiny <iweiny@kernel.org>

cxl/mem: Set up framework for handling DC Events
    
Adds the support for receiving DC event records but defers
the real add/release logic to subsequent commits. Simply refuse all
extents for DC_ADD and ack all DC_RELEASE events for now. Forced
release is currently unsupported.

> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c

[ ... ]

> +static void memdev_release_extent(struct cxl_memdev_state *mds, struct range *range)
> +{
> +	struct device *dev = mds->cxlds.dev;
> +	struct cxl_extent extent = {
> +		.start_dpa = cpu_to_le64(range->start),
> +		.length = cpu_to_le64(range_len(range)),
> +	};
> +	struct cxl_extent_list_node node = { .extent = &extent };
> +	LIST_HEAD(extent_list);
> +
> +	dev_dbg(dev, "Release response dpa %pra\n", range);
> +
> +	/*
> +	 * Stack-allocated instead of kzalloc'ed to avoid potential -ENOMEM,
> +	 * which would prevent sending the release DC rsp.
> +	 */
> +	list_add_tail(&node.list, &extent_list);
> +	if (cxl_send_dc_response(mds, CXL_MBOX_OP_RELEASE_DC, &extent_list, 1))

[Severity: Medium]
The comment mentions stack-allocating to avoid -ENOMEM so we can ensure the
release DC response is sent.

However, cxl_send_dc_response() uses kzalloc() to allocate the
cxl_mbox_dc_response message:

    struct cxl_mbox_dc_response *response __free(kfree) =
                        kzalloc(pl_size, GFP_KERNEL);

Doesn't this dynamic allocation defeat the optimization, since memory
pressure could still cause kzalloc() to fail, preventing the release
response from being sent?

> +		dev_dbg(dev, "Failed to release %pra\n", range);
> +}

[ ... ]

> +static int handle_add_event(struct cxl_memdev_state *mds,
> +			    struct cxl_event_dcd *event)
> +{
> +	struct device *dev = mds->cxlds.dev;
> +	int rc;
> +
> +	rc = add_to_pending_list(&mds->add_ctx.pending_extents, &event->extent);
> +	if (rc) {
> +		clear_pending_extents(mds);
> +		return rc;
> +	}

[Severity: High]
If add_to_pending_list() fails and returns -ENOMEM, we return early without
sending the required CXL_MBOX_OP_ADD_DC_RESPONSE command.

Since cxl_handle_dcd_event_records() drops the error and the event is cleared
from the device log, will this leave the device state machine permanently
blocked waiting for a response that never arrives?

> +
> +	if (event->flags & CXL_DCD_EVENT_MORE) {

[ ... ]

> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h

[ ... ]

> +/*
> + * Add Dynamic Capacity Response
> + * CXL rev 3.1 section 8.2.9.9.9.3; Table 8-168 & Table 8-169
> + */
> +struct cxl_mbox_dc_response {
> +	__le32 extent_list_size;
> +	u8 flags;
> +	u8 reserved[3];
> +	struct updated_extent_list {
> +		__le64 dpa_start;
> +		__le64 length;
> +		u8 reserved[8];
> +	} __packed extent_list[] __counted_by(extent_list_size);

[Severity: Medium]
Should this use __counted_by_le() instead of __counted_by()?

Since extent_list_size is explicitly little-endian (__le32), using
__counted_by() on big-endian architectures will result in the bounds checking
instrumentation reading an inflated integer value, leading to out-of-bounds
errors or kernel panics.

> +} __packed;

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260625112638.550691-1-anisa.su@samsung.com?part=12

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v11 05/31] cxl/mem: Expose dynamic ram 1 partition in sysfs
  2026-06-25 11:04 ` [PATCH v11 05/31] cxl/mem: Expose dynamic ram 1 partition in sysfs Anisa Su
@ 2026-06-25 18:12   ` sashiko-bot
  2026-06-26 23:08   ` Dave Jiang
  1 sibling, 0 replies; 61+ messages in thread
From: sashiko-bot @ 2026-06-25 18:12 UTC (permalink / raw)
  To: Anisa Su; +Cc: nvdimm, linux-cxl

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Low] Empty sysfs directory created for absent dynamic RAM partitions.
--

commit 839299dd9514d54314e661cd789de765da08d4f4
Author: Ira Weiny <iweiny@kernel.org>

cxl/mem: Expose dynamic ram 1 partition in sysfs

This commit exposes the details of the first dynamic ram partition via
sysfs so user space can properly configure CXL regions.

> diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
> --- a/drivers/cxl/core/memdev.c
> +++ b/drivers/cxl/core/memdev.c

[ ... ]

> @@ -519,6 +551,29 @@ static struct attribute_group cxl_memdev_pmem_attribute_group = {
>  	.is_visible = cxl_pmem_visible,
>  };
>  
> +static umode_t cxl_dynamic_ram_1_visible(struct kobject *kobj, struct attribute *a, int n)
> +{
> +	struct device *dev = kobj_to_dev(kobj);
> +	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> +	struct cxl_dpa_perf *perf = part_perf(cxlmd->cxlds, CXL_PARTMODE_DYNAMIC_RAM_1);
> +
> +	if (a == &dev_attr_dynamic_ram_1_qos_class.attr &&
> +	    (!perf || perf->qos_class == CXL_QOS_CLASS_INVALID))
> +		return 0;
> +
> +	if (a == &dev_attr_dynamic_ram_1_size.attr &&
> +	    (!cxl_part_size(cxlmd->cxlds, CXL_PARTMODE_DYNAMIC_RAM_1)))
> +		return 0;

[Severity: Low]
Should cxl_dynamic_ram_1_visible() return SYSFS_GROUP_INVISIBLE here when the
dynamic RAM partition is absent?

Since cxl_memdev_dynamic_ram_1_attribute_group is a named group
(.name = "dynamic_ram_1"), returning 0 hides the individual attributes but
still creates the group directory.

Won't this result in a permanently empty "dynamic_ram_1" directory in sysfs if
the size is 0, rather than suppressing the directory entirely?

> +
> +	return a->mode;
> +}
> +
> +static struct attribute_group cxl_memdev_dynamic_ram_1_attribute_group = {
> +	.name = "dynamic_ram_1",
> +	.attrs = cxl_memdev_dynamic_ram_1_attributes,
> +	.is_visible = cxl_dynamic_ram_1_visible,
> +};

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260625112638.550691-1-anisa.su@samsung.com?part=5

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v11 10/31] cxl/mem: Configure dynamic capacity interrupts
  2026-06-25 11:04 ` [PATCH v11 10/31] cxl/mem: Configure dynamic capacity interrupts Anisa Su
@ 2026-06-25 18:14   ` sashiko-bot
  0 siblings, 0 replies; 61+ messages in thread
From: sashiko-bot @ 2026-06-25 18:14 UTC (permalink / raw)
  To: Anisa Su; +Cc: linux-cxl, nvdimm

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [High] An unhandled DCD event status bit will cause an infinite loop in the `cxl_event_thread` IRQ handler if DCD processing is disabled dynamically.
- [High] The driver fails to perform an initial drain of the DCD event log when the BIOS has control of standard CXL errors (`!native_cxl`), stranding pre-existing events.
--

commit 4e2d7455b6a9875d8ee530fc91ae3a8b65741caa
Author: Ira Weiny <iweiny@kernel.org>

cxl/mem: Configure dynamic capacity interrupts

Dynamic Capacity Devices (DCD) support extent change notifications
through the event log mechanism.  The interrupt mailbox commands were
extended in CXL 3.1 to support these notifications.  Firmware can't
configure DCD events to be FW controlled but can retain control of
memory events.

> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 8d12c684d6700..95a4bf7c1e466 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
[ ... ]
> @@ -632,6 +657,30 @@ static int cxl_event_irqsetup(struct cxl_memdev_state *mds,
>  	return 0;
>  }
>  
> +static int cxl_irqsetup(struct cxl_memdev_state *mds,
> +			struct cxl_event_interrupt_policy *policy,
> +			bool native_cxl)
> +{
> +	struct cxl_dev_state *cxlds = &mds->cxlds;
> +	int rc;
> +
> +	if (native_cxl) {
> +		rc = cxl_event_irqsetup(mds, policy);
> +		if (rc)
> +			return rc;
> +	}
> +
> +	if (cxl_dcd_supported(mds)) {
> +		rc = cxl_event_req_irq(cxlds, policy->dcd_settings);
> +		if (rc) {
> +			dev_err(cxlds->dev, "Failed to get interrupt for DCD event log\n");
> +			cxl_disable_dcd(mds);

[Severity: High]
If cxl_event_req_irq() fails here, cxl_disable_dcd() is called, but the
device is already programmed to generate DCD interrupts.

When a DCD event interrupt fires, cxl_event_thread() will read the status
and retain the CXLDEV_EVENT_STATUS_DCD bit because the mask includes it.
However, cxl_mem_get_event_records() skips draining the log when
cxl_dcd_supported() is false:

drivers/cxl/core/mbox.c:cxl_mem_get_event_records() {
    ...
    if (cxl_dcd_supported(mds) && (status & CXLDEV_EVENT_STATUS_DCD))
        cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_DCD);
    ...
}

Will this cause cxl_event_thread() to spin in an infinite loop since the
status bit is never cleared by draining the log?

> +		}
> +	}
> +
> +	return 0;
> +}
[ ... ]
> @@ -687,11 +744,16 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
>  	if (rc)
>  		return rc;
>  
> -	rc = cxl_event_irqsetup(mds, &policy);
> +	rc = cxl_irqsetup(mds, &policy, native_cxl);
>  	if (rc)
>  		return rc;
>  
> -	cxl_mem_get_event_records(mds, CXLDEV_EVENT_STATUS_ALL);
> +	if (native_cxl)
> +		cxl_mem_get_event_records(mds, CXLDEV_EVENT_STATUS_ALL);

[Severity: High]
If the BIOS has control of standard CXL errors (!native_cxl), does this skip
the initial drain of the DCD event log?

Since the driver owns the DCD log regardless of native_cxl, it appears that
pre-existing DCD events might be left in the log, potentially preventing
subsequent edge-triggered interrupts from firing.

> +
> +	dev_dbg(mds->cxlds.dev, "Event config : %s DCD %s\n",
> +		native_cxl ? "OS" : "BIOS",
> +		cxl_dcd_supported(mds) ? "supported" : "not supported");
>  
>  	return 0;
>  }

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260625112638.550691-1-anisa.su@samsung.com?part=10

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v11 13/31] cxl/mem: Add 20 second timeout for stalled DC_ADD_CAPACITY chains
  2026-06-25 11:04 ` [PATCH v11 13/31] cxl/mem: Add 20 second timeout for stalled DC_ADD_CAPACITY chains Anisa Su
@ 2026-06-25 18:15   ` sashiko-bot
  0 siblings, 0 replies; 61+ messages in thread
From: sashiko-bot @ 2026-06-25 18:15 UTC (permalink / raw)
  To: Anisa Su; +Cc: linux-cxl, nvdimm

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [High] Error paths in handle_add_event fail to abort the DCD event chain properly, leading to a stalled device, leaked watchdog timers, and partial chain acceptance.
- [High] Watchdog timeout clears the pending list without transitioning to a discard state, leading to a double-response protocol violation if the chain completes.
--

commit f4b4a95fd0c56d91f06baa4811e9a05ea46012ee
Author: Anisa Su <anisa.su887@gmail.com>

cxl/mem: Add 20 second timeout for stalled DC_ADD_CAPACITY chains

A DC_ADD_CAPACITY event can span multiple event records. This patch adds a
delayed-work watchdog to prevent indefinitely stalled chains and enforces a
maximum limit on the number of pending extents.

> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 7dd40fb8d613c..4e887b5cdc3ed 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c

[ ... ]

> @@ -1208,15 +1208,78 @@ static void clear_pending_extents(void *_mds)
>  
>  	list_for_each_entry_safe(pos, tmp, &mds->add_ctx.pending_extents, list)
>  		delete_extent_node(pos);
> +	mds->add_ctx.nr_pending = 0;
>  	mds->add_ctx.group = NULL;
>  }

[ ... ]

> +static void cxl_dc_add_timeout(struct work_struct *work)
> +{
> +	struct pending_add_ctx *ctx = container_of(to_delayed_work(work),
> +						   struct pending_add_ctx,
> +						   timeout_work);

[ ... ]

> +	if (cxl_send_dc_response(mds, CXL_MBOX_OP_ADD_DC_RESPONSE,
> +				 &ctx->pending_extents, 0))
> +		dev_dbg(dev, "Failed to send empty ADD_DC_RESPONSE on timeout\n");
> +
> +	clear_pending_extents(mds);
> +	ctx->armed = false;
> +}

[Severity: High]
If the device is just slow and later sends the remainder of the chain
(including the closing More=0 event), will the driver treat the delayed
events as a fresh chain?

Because cxl_dc_add_timeout() resets ctx->armed = false and nr_pending = 0
without marking the chain as discarded, it appears handle_add_event() would
process a delayed tail of the chain as a new chain.

When the More=0 event is processed, wouldn't the driver issue a second
ADD_DC_RESPONSE for this single event chain, violating the CXL protocol?

[ ... ]

> @@ -1239,10 +1303,20 @@ static int add_to_pending_list(struct list_head *pending_list,
>  static int handle_add_event(struct cxl_memdev_state *mds,
>  			    struct cxl_event_dcd *event)
>  {
> +	struct pending_add_ctx *ctx = &mds->add_ctx;
>  	struct device *dev = mds->cxlds.dev;
>  	int rc;
>  
> -	rc = add_to_pending_list(&mds->add_ctx.pending_extents, &event->extent);
> +	guard(mutex)(&ctx->lock);
> +
> +	if (ctx->nr_pending >= CXL_DC_MAX_PENDING_EXTENTS) {
> +		dev_warn(dev, "DC add chain exceeds %u extents; dropping (firmware bug)\n",
> +			 CXL_DC_MAX_PENDING_EXTENTS);
> +		clear_pending_extents(mds);
> +		return -ENOSPC;
> +	}
> +
> +	rc = add_to_pending_list(&ctx->pending_extents, &event->extent);
>  	if (rc) {
>  		clear_pending_extents(mds);
>  		return rc;

[Severity: High]
Do these early error returns fail to abort the event chain properly?

If the limit is exceeded or memory allocation fails, the code calls
clear_pending_extents() and returns an error without sending an
ADD_DC_RESPONSE.

If this happens on the first extent, wouldn't the device never receive a
response and stall indefinitely?

If it occurs on a subsequent extent, does this leave the watchdog timer
armed to fire spuriously later?

Additionally, because clear_pending_extents() resets nr_pending to 0, if
the device continues sending More=1 events for the same chain, would the
driver incorrectly process them as a new chain and potentially accept the
tail of a chain after dropping the head?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260625112638.550691-1-anisa.su@samsung.com?part=13

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v11 14/31] cxl/extent: Handle DC Add Capacity events
  2026-06-25 11:04 ` [PATCH v11 14/31] cxl/extent: Handle DC Add Capacity events Anisa Su
@ 2026-06-25 18:16   ` sashiko-bot
  0 siblings, 0 replies; 61+ messages in thread
From: sashiko-bot @ 2026-06-25 18:16 UTC (permalink / raw)
  To: Anisa Su; +Cc: linux-cxl, nvdimm

Thank you for your contribution! Sashiko AI review found 7 potential issue(s) to consider:
- [High] Use-after-free in `cxlr_dax_classify_extent` due to lockless XArray iteration.
- [High] Fully duplicate Add-Capacity groups are erroneously dropped without sending an Add-DC-Response.
- [High] Failed extents unconditionally send Release DC commands on error paths, violating the skip_device_release protocol.
- [High] Manual `xa_erase()` before `put_device()` creates a race condition that clobbers re-allocated IDs.
- [High] Concurrent teardown causes a data race on `group->nr_extents`, leading to a permanent resource leak.
- [High] Recovered extents are permanently leaked to the device if their individual tracking allocation fails.
- [High] Tag groups spanning multiple DAX regions are silently grouped under the first region, corrupting HPA ranges.
--

commit c1287602ad57e47bb90d02808a1f908fcb287a89
Author: Ira Weiny <iweiny@kernel.org>
Subject: cxl/extent: Handle DC Add Capacity events

Replace the empty-response stub in handle_add_event() with the real
add pipeline. This commit fills in the logic for processing the pending list
and adds basic validation for extents before they are added to the
device model as a child of the cxlr_dax region.

> diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
> new file mode 100644
> index 0000000000000..6e67e787d14de
> --- /dev/null
> +++ b/drivers/cxl/core/extent.c
> @@ -0,0 +1,411 @@
[ ... ]
> +static void dc_extent_release(struct device *dev)
> +{
> +	struct dc_extent *dc_extent = to_dc_extent(dev);
> +	struct cxl_dc_tag_group *group;
> +
> +	if (!dc_extent)
> +		return;
> +
> +	group = dc_extent->group;
> +	cxled_release_extent(dc_extent->cxled, dc_extent);
> +	xa_erase(&group->cxlr_dax->dc_extents, dc_extent->dev.id);
> +	xa_erase(&group->dc_extents, dc_extent->seq_num);
> +	group->nr_extents--;
> +	if (!group->nr_extents)
> +		free_tag_group(group);

[Severity: High]
Is it safe to decrement group->nr_extents without a lock? Since multiple
extents in the same group can be released concurrently, couldn't this
non-atomic decrement cause a data race, leading to a permanent memory leak
of the cxl_dc_tag_group and its pinned cxlr_dax->dev reference?

> +	kfree(dc_extent);
> +}
[ ... ]
> +static enum cxl_extent_class
> +cxlr_dax_classify_extent(struct cxl_dax_region *cxlr_dax,
> +			 struct cxl_endpoint_decoder *cxled,
> +			 const struct range *ext_range)
> +{
> +	struct dc_extent *entry;
> +	unsigned long i;
> +
> +	xa_for_each(&cxlr_dax->dc_extents, i, entry) {
> +		if (entry->cxled != cxled)

[Severity: High]
Is this iteration safe against concurrent teardown? xa_for_each drops the
internal RCU read lock between iterations, leaving the returned entry
unprotected. If online_tag_group fails concurrently or there is a
parallel release, couldn't accessing entry->cxled trigger a use-after-free?

> +			continue;
> +		if (range_contains(&entry->dpa_range, ext_range))
> +			return CXL_EXT_DUPLICATE;
[ ... ]
> +static int cxlr_add_extent(struct cxl_memdev_state *mds,
> +			   struct cxl_dax_region *cxlr_dax,
> +			   struct dc_extent *dc_extent)
> +{
> +	struct cxl_dc_tag_group **group = &mds->add_ctx.group;
> +	int rc;
> +
> +	if (*group && !uuid_equal(&(*group)->uuid, &dc_extent->uuid)) {
> +		kfree(dc_extent);
> +		return -EINVAL;
> +	}
> +
> +	if (!*group) {
> +		dev_dbg(&cxlr_dax->dev, "Alloc new tag group\n");
> +		*group = alloc_tag_group(cxlr_dax, &dc_extent->uuid);
> +		if (IS_ERR(*group)) {
> +			rc = PTR_ERR(*group);
> +			*group = NULL;
> +			kfree(dc_extent);
> +			return rc;
> +		}
> +	} else {
> +		dev_dbg(&cxlr_dax->dev, "Append dc_extent to tag group\n");
> +	}
> +
> +	dc_extent->group = *group;

[Severity: High]
What happens if extents in the same tag group (sharing a UUID) map to
different endpoint decoders and thus different DAX regions? This logic
appears to allocate the tag group using the cxlr_dax of the first extent
and blindly links subsequent extents to it. Could this cause cross-region
extents to be parented to the wrong region and corrupt their HPA mapping
offsets?

> +
> +	/*
[ ... ]
> +int online_tag_group(struct cxl_dc_tag_group *group, bool skip_release)
> +{
[ ... ]
> +		rc = dev_set_name(dev, "extent%d.%d", cxlr_dax->cxlr->id,
> +				  dev->id);
> +		if (rc) {
> +			xa_erase(&cxlr_dax->dc_extents, dev->id);
> +			put_device(dev);
> +			break;
> +		}
> +
> +		rc = device_add(dev);
> +		if (rc) {
> +			xa_erase(&cxlr_dax->dc_extents, dev->id);
> +			put_device(dev);
> +			break;
> +		}

[Severity: High]
Does the manual xa_erase() before put_device() create a double-erase race
condition?

put_device(dev) synchronously calls dc_extent_release(), which performs a
second xa_erase() for the same ID. If a concurrent thread allocates an ID
between the first and second erase, wouldn't the second erase silently
clobber the new valid allocation?

[Severity: High]
Additionally, does this error path violate the skip_device_release protocol?
At this point, group->skip_device_release has not been set yet. Because it
is false, won't dc_extent_release() unconditionally send a Release DC command
to the hardware for capacity the host never formally accepted?

> +
> +		dev_dbg(dev, "dc_extent HPA %pra (%pU)\n",
[ ... ]
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 4e887b5cdc3ed..08f51b8807c06 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
[ ... ]
> +static int cxl_realize_group(struct cxl_memdev_state *mds, const uuid_t *tag,
> +			     struct list_head *group, bool existing)
> +{
> +	struct device *dev = mds->cxlds.dev;
> +	struct cxl_extent_list_node *pos, *tmp;
> +	struct cxl_dc_tag_group *tag_group;
> +	int group_cnt = 0;
> +	int rc;
> +
> +	list_for_each_entry_safe(pos, tmp, group, list) {
> +		/*
> +		 * Pass the device-stamped 0-based shared_extn_seq through
> +		 * unchanged as the dax-side @seq_num (0..n-1).
> +		 */
> +		u16 seq = le16_to_cpu(pos->extent->shared_extn_seq);
> +
> +		if (cxl_add_extent(mds, pos->extent, seq) < 0) {
> +			dev_dbg(dev,
> +				"Tag %pUb: failed to add extent DPA:%#llx LEN:%#llx\n",
> +				tag,
> +				le64_to_cpu(pos->extent->start_dpa),
> +				le64_to_cpu(pos->extent->length));
> +			delete_extent_node(pos);
> +			continue;
> +		}

[Severity: High]
If cxl_add_extent() fails during initialization of existing extents
(e.g., under memory pressure), we simply delete the node and continue.
Because the extent is never added to a tag group, the teardown error path
cannot see it. Does this permanently strand the capacity on the device since
no Release DC command is sent?

> +		group_cnt++;
> +	}
> +
> +	tag_group = mds->add_ctx.group;
> +	mds->add_ctx.group = NULL;
> +	if (!tag_group)
> +		return -ENXIO;

[Severity: High]
If an Add Capacity event contains only extents that the host has already
accepted (duplicates), cxl_add_extent() succeeds but returns 0, and
mds->add_ctx.group is never allocated. Does this erroneously fail the
entire group with -ENXIO, causing cxl_add_pending() to drop it without
sending an Add-DC-Response, thereby stalling the device?

> +
> +	rc = online_tag_group(tag_group, !existing);

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260625112638.550691-1-anisa.su@samsung.com?part=14

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v11 02/31] cxl/mem: Read dynamic capacity configuration from the device
  2026-06-25 11:04 ` [PATCH v11 02/31] cxl/mem: Read dynamic capacity configuration from the device Anisa Su
@ 2026-06-25 18:16   ` sashiko-bot
  2026-06-26 22:26   ` Dave Jiang
  1 sibling, 0 replies; 61+ messages in thread
From: sashiko-bot @ 2026-06-25 18:16 UTC (permalink / raw)
  To: Anisa Su; +Cc: nvdimm, linux-cxl

Thank you for your contribution! Sashiko AI review found 4 potential issue(s) to consider:
- [High] Mailbox payload size statically requests too many partitions, failing initialization on devices with small mailboxes.
- [High] Incorrect unit conversion of DC partition decode_length causes severe integer overflow.
- [High] Logic Error in DPA Skip Check mathematically guarantees failure, breaking DCD on all devices.
- [Medium] Valid CXL devices configured with 0 Dynamic Capacity partitions are erroneously treated as having an I/O error.
--

commit cbb0f735b5888cbb19cb493b31cc330ef3afdadf
Author: Ira Weiny <iweiny@kernel.org>

cxl/mem: Read dynamic capacity configuration from the device

This commit implements the mailbox commands to read Dynamic Capacity
partition configurations from CXL devices. It validates the partitions
and configures the first partition as dynamic RAM.

> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 07aba6f0b7194..2932bbd67e550 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -1347,6 +1347,188 @@ int cxl_mem_sanitize(struct cxl_memdev *cxlmd, u16 cmd)

[ ... ]

> +static int cxl_dc_check(struct device *dev, struct cxl_dc_partition_info *part_array,
> +			u8 index, struct cxl_dc_partition *dev_part)
> +{
> +	u64 blk_size = le64_to_cpu(dev_part->block_size);
> +	u64 len = le64_to_cpu(dev_part->length);
> +
> +	part_array[index].start = le64_to_cpu(dev_part->base);
> +	part_array[index].size = le64_to_cpu(dev_part->decode_length);
> +	part_array[index].size *= CXL_CAPACITY_MULTIPLIER;

[Severity: High]
Is it correct to multiply decode_length by CXL_CAPACITY_MULTIPLIER here?

The CXL specification provides the decode length in bytes, similar to the 
base and length fields. Multiplying this by 256MB will cause a 64-bit 
integer overflow for capacities of 64GB or larger. This overflow can wrap 
around to a small value, bypassing the later DPA ordering checks and 
registering a corrupted partition size.

[ ... ]

> +/* Returns the number of partitions in dc_resp or -ERRNO */
> +static int cxl_get_dc_config(struct cxl_mailbox *mbox, u8 start_partition,
> +			     struct cxl_mbox_get_dc_config_out *dc_resp,
> +			     size_t dc_resp_size)
> +{
> +	struct cxl_mbox_get_dc_config_in get_dc = (struct cxl_mbox_get_dc_config_in) {
> +		.partition_count = CXL_MAX_DC_PARTITIONS,
> +		.start_partition_index = start_partition,
> +	};
> +	struct cxl_mbox_cmd mbox_cmd = (struct cxl_mbox_cmd) {
> +		.opcode = CXL_MBOX_OP_GET_DC_CONFIG,
> +		.payload_in = &get_dc,
> +		.size_in = sizeof(get_dc),
> +		.size_out = dc_resp_size,
> +		.payload_out = dc_resp,
> +		.min_out = 8,
> +	};

[ ... ]

> +int cxl_dev_dc_identify(struct cxl_mailbox *mbox,
> +			struct cxl_dc_partition_info *dc_info)
> +{
> +	struct cxl_dc_partition_info partitions[CXL_MAX_DC_PARTITIONS];
> +	struct cxl_mbox_get_dc_config_out *dc_resp __free(kfree);
> +	struct device *dev = mbox->host;
> +	u8 start_partition;
> +	u8 num_partitions;
> +	size_t dc_resp_size = struct_size(dc_resp,
> +					  partition,
> +					  CXL_MAX_DC_PARTITIONS);

[Severity: High]
Can this static size calculation cause initialization failures on devices 
with small mailboxes?

The code requests CXL_MAX_DC_PARTITIONS (8 partitions), resulting in an 
expected response payload size of 328 bytes (an 8-byte header plus 8 
40-byte partition entries). However, the CXL specification allows a minimum 
mailbox payload size of 256 bytes. 

If a device has a mailbox smaller than 328 bytes, cxl_internal_send_cmd() 
will unconditionally fail with -E2BIG. Should the requested number of 
partitions be dynamically bounded by the device's negotiated payload size?

> +	dc_resp = kmalloc(dc_resp_size, GFP_KERNEL);
> +	if (!dc_resp)
> +		return -ENOMEM;
> +
> +	/**
> +	 * Read and check all partition information for validity and potential
> +	 * debugging; see debug output in cxl_dc_check()
> +	 */
> +	start_partition = 0;
> +	num_partitions = 0;
> +	do {
> +		int rc, i, j;
> +
> +		rc = cxl_get_dc_config(mbox, start_partition, dc_resp, dc_resp_size);
> +		if (rc < 0) {
> +			dev_err(dev, "Failed to get DC config: %d\n", rc);
> +			return rc;
> +		}
> +
> +		if (rc == 0) {
> +			dev_err(dev,
> +				"Device reported %u partitions available but returned none at index %u\n",
> +				dc_resp->avail_partition_count, start_partition);
> +			return -EIO;
> +		}

[Severity: Medium]
Does this unintentionally fail on valid devices configured with 0 Dynamic 
Capacity partitions?

If a device supports DCD but is configured with zero partitions, 
avail_partition_count is 0, and cxl_get_dc_config() legitimately returns 0. 
This path treats it as a hard error and returns -EIO, logging a confusing 
error message instead of cleanly disabling DCD and proceeding.

[ ... ]

> +void cxl_configure_dcd(struct cxl_memdev_state *mds, struct cxl_dpa_info *info)
> +{
> +	struct cxl_dc_partition_info dc_info = { 0 };
> +	struct device *dev = mds->cxlds.dev;
> +	int rc;
> +
> +	rc = cxl_dev_dc_identify(&mds->cxlds.cxl_mbox, &dc_info);
> +	if (rc) {
> +		dev_warn(dev,
> +			 "Failed to read Dynamic Capacity config: %d\n", rc);
> +		cxl_disable_dcd(mds);
> +		return;
> +	}
> +
> +	/* Skips between pmem and the dynamic partition are not supported */
> +	if (dc_info.start != info->size) {
> +		dev_warn(dev,
> +			 "Dynamic Capacity skip from pmem not supported\n");
> +		cxl_disable_dcd(mds);
> +		return;
> +	}

[Severity: High]
Is it intended to compare dc_info.start against info->size here?

Because cxl_mem_dpa_fetch() initializes info->size to mds->total_bytes, 
it already includes the dynamic capacity. This means dc_info.start will 
always be strictly less than info->size, causing this skip check to falsely 
fail on every valid device with a DC partition and breaking DCD entirely.

> +
> +	info->size += dc_info.size;

[Severity: High]
Does this inadvertently double-count the dynamic capacity?

Since info->size already includes the total device capacity, adding 
dc_info.size here will inflate the total DPA span incorrectly.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260625112638.550691-1-anisa.su@samsung.com?part=2

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v11 07/31] cxl/region: Add DC DAX region support
  2026-06-25 11:04 ` [PATCH v11 07/31] cxl/region: Add DC DAX region support Anisa Su
@ 2026-06-25 18:16   ` sashiko-bot
  2026-06-26 23:18   ` Dave Jiang
  1 sibling, 0 replies; 61+ messages in thread
From: sashiko-bot @ 2026-06-25 18:16 UTC (permalink / raw)
  To: Anisa Su; +Cc: nvdimm, linux-cxl

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] Lockless access to `cxlmd->cxlds` via the newly introduced `cxled_to_mds()` helper in `store_targetN()` leads to a NULL pointer dereference during concurrent device teardown.
--

commit 0243c2c435424873e37aafdcb17aee31b0ea07a1
Author: Ira Weiny <iweiny@kernel.org>

cxl/region: Add DC DAX region support

This commit introduces dynamic DAX regions to support CXL DC capabilities.
It adds a new sysfs attribute for creating dynamic RAM regions and updates
target attachment logic to handle DCD support.

> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 82ca3a4767080..9ed141fa13341 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
[ ... ]
> @@ -18,6 +19,13 @@ enum cxl_detach_mode {
>  	DETACH_INVALIDATE,
>  };
>  
> +static inline struct cxl_memdev_state *
> +cxled_to_mds(struct cxl_endpoint_decoder *cxled)
> +{
> +	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> +	return to_cxl_memdev_state(cxlmd->cxlds);
> +}

[Severity: High]
Can this result in a NULL pointer dereference during concurrent device
teardown?

If a user writes to a CXL region's targetX sysfs file, store_targetN()
calls cxled_to_mds() without holding locks that synchronize with device
teardown.

Concurrently, if the CXL device is hot-removed or the driver is unbound,
cxl_memdev_shutdown() nullifies cxlmd->cxlds under cxl_memdev_rwsem.

This would cause cxled_to_mds() to pass a NULL pointer to
to_cxl_memdev_state(), which unconditionally dereferences it:

drivers/cxl/cxlmem.h:to_cxl_memdev_state() {
    if (cxlds->type != CXL_DEVTYPE_CLASSMEM)
        return NULL;
    ...
}

Does this path need to hold cxl_memdev_rwsem or otherwise verify
cxlmd->cxlds is not NULL before using it?

> +
>  #ifdef CONFIG_CXL_REGION
>  
>  struct cxl_region_context {
[ ... ]
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index e50dc716d4e82..ba03ec5e27c32 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
[ ... ]
> @@ -2389,6 +2396,8 @@ static size_t store_targetN(struct cxl_region *cxlr, const char *buf, int pos,
>  	if (sysfs_streq(buf, "\n"))
>  		rc = detach_target(cxlr, pos);
>  	else {
> +		struct cxl_endpoint_decoder *cxled;
> +		struct cxl_memdev_state *mds;
>  		struct device *dev;
>  
>  		dev = bus_find_device_by_name(&cxl_bus_type, NULL, buf);
> @@ -2400,8 +2409,21 @@ static size_t store_targetN(struct cxl_region *cxlr, const char *buf, int pos,
>  			goto out;
>  		}
>  
> -		rc = attach_target(cxlr, to_cxl_endpoint_decoder(dev), pos,
> -				   TASK_INTERRUPTIBLE);
> +		cxled = to_cxl_endpoint_decoder(dev);
> +		if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_1) {
> +			mds = cxled_to_mds(cxled);
> +			if (!mds) {
> +				dev_dbg(dev, "No memdev state\n");
> +				rc = -ENODEV;
> +				goto out;
> +			}
> +			if (!cxl_dcd_supported(mds)) {
> +				dev_dbg(dev, "DCD unsupported\n");
> +				rc = -EINVAL;
> +				goto out;
> +			}
> +		}
> +		rc = attach_target(cxlr, cxled, pos, TASK_INTERRUPTIBLE);
>  out:
>  		put_device(dev);
>  	}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260625112638.550691-1-anisa.su@samsung.com?part=7

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v11 15/31] cxl/mem: Drop misaligned DCD extent groups
  2026-06-25 11:04 ` [PATCH v11 15/31] cxl/mem: Drop misaligned DCD extent groups Anisa Su
@ 2026-06-25 18:19   ` sashiko-bot
  0 siblings, 0 replies; 61+ messages in thread
From: sashiko-bot @ 2026-06-25 18:19 UTC (permalink / raw)
  To: Anisa Su; +Cc: nvdimm, linux-cxl

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Medium] Un-ratelimited `dev_warn` on invalid CXL device input can lead to log flooding and Denial of Service.
--

commit 2eb192d0644cd71e7046e47b82d5b1422081590e
Author: Ira Weiny <iweiny@kernel.org>

cxl/mem: Drop misaligned DCD extent groups

This commit adds an alignment check to cxl_add_pending() to ensure that
extents in a tag group are aligned to the dax region's mapping granularity.
Misaligned extents now cause the entire group to be dropped.

> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 08f51b8807c06..14ba263044f08 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c

[ ... ]

> @@ -1395,11 +1410,38 @@ static int cxl_realize_group(struct cxl_memdev_state *mds, const uuid_t *tag,
>  	return group_cnt;
>  }
>  
> +/*
> + * Validate a tag @group before realizing it.  Returns 0 if the group may be
> + * added, or a negative errno if it must be dropped.  Further gates layer in
> + * here in later commits.
> + */
> +static int cxl_validate_group(struct cxl_memdev_state *mds, const uuid_t *tag,
> +			      struct list_head *group)
> +{
> +	struct device *dev = mds->cxlds.dev;
> +	struct cxl_extent_list_node *pos;
> +
> +	/* Alignment gate — drop the group if any member fails */
> +	list_for_each_entry(pos, group, list) {
> +		if (!cxl_extent_dcd_aligned(pos->extent)) {
> +			dev_warn(dev,
> +				 "Tag %pUb: dropping group, extent DPA:%#llx LEN:%#llx not %#llx-aligned\n",
> +				 tag,
> +				 le64_to_cpu(pos->extent->start_dpa),
> +				 le64_to_cpu(pos->extent->length),
> +				 (u64)PMD_SIZE);

[Severity: Medium]
Could this un-ratelimited dev_warn() lead to log flooding?

If a malfunctioning or compromised CXL device rapidly sends DCD_ADD_CAPACITY
events with misaligned extents, it appears this could spam the kernel log
and cause high CPU overhead.

Should this use dev_warn_ratelimited() instead, since it is driven by
untrusted device input?

> +			return -EINVAL;
> +		}
> +	}
> +
> +	return 0;
> +}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260625112638.550691-1-anisa.su@samsung.com?part=15

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v11 16/31] cxl/extent: Validate DC extent partition
  2026-06-25 11:04 ` [PATCH v11 16/31] cxl/extent: Validate DC extent partition Anisa Su
@ 2026-06-25 18:20   ` sashiko-bot
  0 siblings, 0 replies; 61+ messages in thread
From: sashiko-bot @ 2026-06-25 18:20 UTC (permalink / raw)
  To: Anisa Su; +Cc: nvdimm, linux-cxl

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:

Pre-existing issues:
- [High] Entirely duplicate tag groups are rejected and dropped instead of replying success to the device.
- [High] Concurrent access to `group->nr_extents` without locks causes a memory leak and pins the DAX region device indefinitely.
--

commit b3bfb6df5d2f0f0f9e2b022064279f74742afa2e
Author: Ira Weiny <iweiny@kernel.org>

cxl/extent: Validate DC extent partition
    
Extend cxl_validate_extent() — the per-extent check of the add pipeline
to check partition membership. Resolves an extent's DPA to its containing
DC partition. Sharability is a property of the partition (part->shareable),
taken from its CDAT DSMAS entry.

> diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
> index 6e67e787d14de..2e770c5279c23 100644
> --- a/drivers/cxl/core/extent.c
> +++ b/drivers/cxl/core/extent.c

[ ... ]

> @@ -101,6 +159,27 @@ static int cxl_validate_extent(struct cxl_memdev_state *mds,
>  		.end = start_dpa + le64_to_cpu(extent->length) - 1,
>  	};
>  	struct range ed_range;
> +	uuid_t uuid;
> +
> +	import_uuid(&uuid, extent->uuid);
> +
> +	part = cxl_extent_dc_partition(mds, extent, &ext_range);
> +	if (!part)
> +		return -ENXIO;
> +
> +	if (part->shareable) {
> +		if (uuid_is_null(&uuid)) {
> +			dev_err_ratelimited(dev,
> +				"DC extent DPA %pra: sharable-partition extent has null tag (firmware bug)\n",
> +				&ext_range);
> +			return -ENXIO;
> +		}
> +	} else if (le16_to_cpu(extent->shared_extn_seq)) {
> +		dev_err_ratelimited(dev,
> +			"DC extent DPA %pra (%pU): non-sharable partition but shared_extn_seq=%u (firmware bug)\n",
> +			&ext_range, &uuid, le16_to_cpu(extent->shared_extn_seq));
> +		return -ENXIO;
> +	}
>  
>  	cxlr = cxl_dpa_to_region(cxlmd, start_dpa, &cxled);
>  	if (!cxlr || !cxlr->cxlr_dax)

[Severity: High]
This is a pre-existing issue, but I noticed a potential race condition in
drivers/cxl/core/extent.c:dc_extent_release() that might lead to a leak:

	group->nr_extents--;
	if (!group->nr_extents)
		free_tag_group(group);

Could concurrent access to group->nr_extents without locks cause a lost
decrement?

When userspace closes sysfs files for multiple extents of the same tag group
concurrently, their kobject refcounts can drop to 0 simultaneously, causing
dc_extent_release() to run on multiple CPUs in parallel.

If group->nr_extents never reaches 0, does this permanently pin the DAX
region device (since the reference isn't dropped) and leak the
cxl_dc_tag_group structure?

[Severity: High]
This isn't a bug introduced by this patch, but I have a question about how
duplicate tag groups are handled between drivers/cxl/core/extent.c and
drivers/cxl/core/mbox.c.

In drivers/cxl/core/extent.c:cxl_add_extent(), duplicate extents return 0
without allocating or assigning mds->add_ctx.group:

	case CXL_EXT_DUPLICATE:
		/*
		 * Idempotent accept simplifies the dax-side scan for existing
		 * extents on region creation; reply success without duplicating.
		 */
		dev_warn_ratelimited(&cxled->cxld.dev,
				     "Extent %pra exists; accept again\n",
				     &ext_range);
		return 0;

If all extents in a group are duplicates, cxl_realize_group() in
drivers/cxl/core/mbox.c finishes its loop with group_cnt > 0 but
mds->add_ctx.group remains NULL. It then returns -ENXIO:

	tag_group = mds->add_ctx.group;
	mds->add_ctx.group = NULL;
	if (!tag_group)
		return -ENXIO;

Since this causes the caller cxl_add_pending() to treat the group as an error
and drop the extents instead of adding them to the accepted list, does this
leave the device hanging without an ADD_DC_RESPONSE if it resends an
unacknowledged extent group?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260625112638.550691-1-anisa.su@samsung.com?part=16

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v11 19/31] cxl/extent: Enforce cross-region tag uniqueness
  2026-06-25 11:04 ` [PATCH v11 19/31] cxl/extent: Enforce cross-region tag uniqueness Anisa Su
@ 2026-06-25 18:23   ` sashiko-bot
  0 siblings, 0 replies; 61+ messages in thread
From: sashiko-bot @ 2026-06-25 18:23 UTC (permalink / raw)
  To: Anisa Su; +Cc: nvdimm, linux-cxl

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [High] Use-after-free of `tag_group` in `cxl_realize_group()` when extent addition fails.
- [High] Data race on `group->nr_extents` during concurrent extent device release leading to UAF/leak.
--

commit 0105c4aa2861118ac775e4aaac54f4e888f84aa2
Author: Anisa Su <anisa.su887@gmail.com>

cxl/extent: Enforce cross-region tag uniqueness

This patch adds a host-wide registry of live tag groups with non-null uuids
to ensure that tag uuids are unique across all regions and memdevs. It
updates alloc_tag_group() to register new tags, free_tag_group() to
unregister them, and validation logic to reject conflicting tags.

> diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
> index a590a89f35802..36be56ca10974 100644
> --- a/drivers/cxl/core/extent.c
> +++ b/drivers/cxl/core/extent.c

[ ... ]

>  static void free_tag_group(struct cxl_dc_tag_group *group)
>  {
> +	cxl_tag_unregister(group);
>  	xa_destroy(&group->dc_extents);
>  	/* Drop the pin taken in alloc_tag_group(). */
>  	put_device(&group->cxlr_dax->dev);

[Severity: High]
Can this lead to a use-after-free of the tag_group pointer?

If xa_insert() fails (e.g., due to -ENOMEM) for the first extent in
cxlr_add_extent(), the allocated tag_group is retained in
mds->add_ctx.group but its dc_extents xarray remains empty.

Back in cxl_realize_group(), the loop continues and eventually extracts
this empty tag_group:

cxl_realize_group() {
        ...
        tag_group = mds->add_ctx.group;
        ...
        rc = online_tag_group(tag_group, !existing);
        ...
        if (existing)
                rc = __cxlr_notify_extent(tag_group->cxlr_dax->cxlr, ...
}

If online_tag_group() iterates over the empty xarray, decrements
group->nr_extents to 0, frees the tag_group via free_tag_group(), and
returns 0 (success), won't cxl_realize_group() then proceed to
dereference the freed tag_group pointer when calling
__cxlr_notify_extent() or cxlr_notify_extent()?

[Severity: High]
Can a data race occur on group->nr_extents during concurrent releases?

The group->nr_extents counter is decremented with a non-atomic
group->nr_extents-- in both dc_extent_release() and rm_tag_group().
Because sysfs files like length and offset are world-readable,
unprivileged userspace processes can hold references to multiple extents
within the same tag group.

If these processes close the sysfs files concurrently,
dc_extent_release() executes on multiple CPUs concurrently.

If one CPU reads the counter, and a second CPU reads, decrements, and
frees the tag_group via free_tag_group(), won't the first CPU then
write to the nr_extents field of the freed memory?

>  	kfree(group);
>  }

[ ... ]

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260625112638.550691-1-anisa.su@samsung.com?part=19

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v11 18/31] cxl/extent: Handle DC Release Capacity events
  2026-06-25 11:04 ` [PATCH v11 18/31] cxl/extent: Handle DC Release Capacity events Anisa Su
@ 2026-06-25 18:23   ` sashiko-bot
  0 siblings, 0 replies; 61+ messages in thread
From: sashiko-bot @ 2026-06-25 18:23 UTC (permalink / raw)
  To: Anisa Su; +Cc: nvdimm, linux-cxl

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [High] Lockless iteration over the shared `cxlr_dax->dc_extents` xarray leads to Use-After-Free.
- [High] Unsynchronized non-atomic modifications to `group->nr_extents` lead to lost decrements and tag group leak.
- [High] Multi-extent tag group releases incorrectly tear down the entire group on the first chunk, causing spurious errors and WARN_ON stack traces on subsequent chunks.
--

commit 277b0b7ee1e3ee9199ddbac19db51d5566770cbb
Author: Ira Weiny <iweiny@kernel.org>

cxl/extent: Handle DC Release Capacity events
    
This commit replaces the no-op ack stub for cxl_rm_extent() with the real
teardown logic. It resolves the released DPA range to its region and endpoint
decoder, locates the matching dc_extent, and tears down the entire containing
tag group atomically.

> diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
> index 0ebb581ca8335..a590a89f35802 100644
> --- a/drivers/cxl/core/extent.c
> +++ b/drivers/cxl/core/extent.c
[ ... ]
> +static void rm_tag_group(struct cxl_dc_tag_group *group)
> +{
> +	struct device *region_dev = &group->cxlr_dax->dev;
> +	struct dc_extent *dc_extent;
> +	unsigned long index;
> +
> +	/*
> +	 * Pin @group across the walk: each devm_release_action runs the
> +	 * dc_extent_unregister action synchronously, which drops the last
> +	 * reference on the dc_extent device and fires dc_extent_release.
> +	 * The release decrements group->nr_extents and, on the final
> +	 * decrement, frees @group.  Without the pin the next iteration's
> +	 * xa_find_after() dereferences a freed xarray.
> +	 */
> +	group->nr_extents++;
> +	xa_for_each(&group->dc_extents, index, dc_extent)
> +		devm_release_action(region_dev, dc_extent_unregister, dc_extent);

[Severity: High]
What happens if a device releases a multi-extent tag group by sending multiple
DCD release events (one per chunk)?

When the host processes the first event, cxl_rm_extent() resolves the group
and calls rm_tag_group(), which atomically unregisters all extents. If the
sysfs reference is > 0 (e.g. userspace has a file open), the extents remain
in the xarray.

When subsequent release events for the remaining chunks are processed,
cxl_rm_extent() finds the same group and calls rm_tag_group() again. Will
invoking devm_release_action() on already-removed actions trigger a
WARN_ON(!devres) stack trace?

> +	group->nr_extents--;
> +	if (!group->nr_extents)
> +		free_tag_group(group);

[Severity: High]
Can this result in lost decrements and a tag group leak?

Since group->nr_extents is a plain unsigned int, a lockless decrement here
in rm_tag_group() could race with dc_extent_release(). If userspace holds
references to multiple extents in the same tag group via sysfs and closes
them concurrently, dc_extent_release() will run on multiple CPUs
simultaneously.

This would race with the decrement here, preventing the counter from reaching
zero and permanently leaking the cxl_dc_tag_group and its UUID.

> +}
> +
> +int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
> +{
[ ... ]
> +	/*
> +	 * Find the dc_extent whose DPA range covers the released range and
> +	 * whose tag matches.  The release targets the entire containing
> +	 * tag group atomically; partial release is not supported.
> +	 */
> +	group = NULL;
> +	xa_for_each(&cxlr_dax->dc_extents, idx, dc_extent) {
> +		if (dc_extent->cxled != cxled)

[Severity: High]
Is it safe to iterate the shared cxlr_dax->dc_extents xarray without RCU
protection or a write lock here?

cxl_rm_extent() and cxlr_dax_classify_extent() iterate over the shared
xarray using xa_for_each() while holding cxl_rwsem.region only for read. If
concurrent DCD events from different memdevs in the same interleaved region
trigger rm_tag_group() or a failed cxl_add_extent(), dc_extent objects can be
removed and synchronously freed via kfree() in dc_extent_release() by another
thread.

Since xa_for_each() drops the internal RCU lock during the loop body, could
the iterating thread dereference a freed dc_extent pointer here?

> +			continue;
> +		if (!range_contains(&dc_extent->dpa_range, &dpa_range))
> +			continue;
> +		if (!uuid_equal(&dc_extent->group->uuid, &tag))
> +			continue;
> +		group = dc_extent->group;
> +		break;
> +	}
> +	if (!group) {
> +		dev_err(&cxlr_dax->dev,
> +			"release DPA %pra (%pU) matches no dc_extent\n",
> +			&dpa_range, &tag);

[Severity: High]
Could this error path be triggered spuriously on multi-chunk releases?

If a device sends multiple DCD release events for a multi-extent tag group,
the first chunk will cause rm_tag_group() to unregister all extents. If sysfs
references are zero, the extents are freed immediately.

When subsequent release events for the remaining chunks are processed, they
will fail to find the group here and print this spurious "matches no dc_extent"
error.

> +		return -EINVAL;
> +	}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260625112638.550691-1-anisa.su@samsung.com?part=18

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v11 31/31] Documentation/cxl: Document DCD extent handling and DC-backed DAX regions
  2026-06-25 11:05 ` [PATCH v11 31/31] Documentation/cxl: Document DCD extent handling and DC-backed DAX regions Anisa Su
@ 2026-06-25 18:24   ` sashiko-bot
  0 siblings, 0 replies; 61+ messages in thread
From: sashiko-bot @ 2026-06-25 18:24 UTC (permalink / raw)
  To: Anisa Su; +Cc: linux-cxl, nvdimm

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Low] Typo in documentation: 'can by chained' should be 'can be chained'.
--

commit 06abcda7f883fa6e5d50becfc5150a117123a7a6
Author: Anisa Su <anisa.su887@gmail.com>

Documentation/cxl: Document DCD extent handling and DC-backed DAX regions

This commit extends the CXL and DAX driver-api documentation to cover
Dynamic Capacity Devices. It details the conditions for accepting or
releasing extents, and explains the mapping between accepted extents and
DAX devices.

> diff --git a/Documentation/driver-api/cxl/linux/cxl-driver.rst b/Documentation/driver-api/cxl/linux/cxl-driver.rst
> --- a/Documentation/driver-api/cxl/linux/cxl-driver.rst
> +++ b/Documentation/driver-api/cxl/linux/cxl-driver.rst

[ ... ]

> +Accepting Extents
> +-----------------
> +Extents are made available to the host from the device through DC ADD events.
> +Event records contain extents, which may be tagged or untagged, shared or
> +not shared. Multiple event records can by chained together by the `More` flag.

[Severity: Low]
This isn't a bug, but there is a typo here: "can by chained" should be
"can be chained".

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260625112638.550691-1-anisa.su@samsung.com?part=31

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v11 17/31] cxl/mem: Enforce tag-group semantics
  2026-06-25 11:04 ` [PATCH v11 17/31] cxl/mem: Enforce tag-group semantics Anisa Su
@ 2026-06-25 18:24   ` sashiko-bot
  0 siblings, 0 replies; 61+ messages in thread
From: sashiko-bot @ 2026-06-25 18:24 UTC (permalink / raw)
  To: Anisa Su; +Cc: nvdimm, linux-cxl

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:

Pre-existing issues:
- [High] Untagged extents spanning different DC partitions and DAX regions are incorrectly coalesced and assigned to the same region.
- [High] Fully duplicate extent groups are incorrectly rejected and omitted from the device response.
--

commit c2cf18ae81b1c32953f39b06aeaa9e153f4ac609
Author: Anisa Su <anisa.su887@gmail.com>

cxl/mem: Enforce tag-group semantics

This commit completes tag-group validation. It adds sequence integrity
checks to verify that sharable partitions have dense shared_extn_seq
values, and verifies that tagged allocations do not span DC partitions.

> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 14ba263044f08..7967b0db2c512 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c

[ ... ]

> +static int cxl_check_group_partition(struct cxl_memdev_state *mds,
> +				     const uuid_t *tag,
> +				     const struct list_head *group)
> +{
> +	struct device *dev = mds->cxlds.dev;
> +	const struct cxl_dpa_partition *first_part = NULL;
> +	u64 first_dpa = 0;
> +	struct cxl_extent_list_node *pos;
> +
> +	if (uuid_is_null(tag) || list_empty(group))
> +		return 0;

[Severity: High]
This is a pre-existing issue, but does this explicit bypass for untagged
extents allow them to span different DC partitions?

If a device sends multiple untagged DC add capacity events spanning
different partitions, cxl_add_pending() appears to extract them all into a
single group.

Since this check returns 0 for null UUIDs, cxl_realize_group() processes them
together. Would cxlr_add_extent() then register them all under the DAX region
of the first extent, effectively merging extents from different partitions into
the same region's resource tree?

[ ... ]

>  static int cxl_realize_group(struct cxl_memdev_state *mds, const uuid_t *tag,
> -			     struct list_head *group, bool existing)
> +			     struct list_head *group, bool shareable,
> +			     bool existing)
>  {
>  	struct device *dev = mds->cxlds.dev;
>  	struct cxl_extent_list_node *pos, *tmp;
>  	struct cxl_dc_tag_group *tag_group;
>  	int group_cnt = 0;
> +	u16 seq_num;
>  	int rc;
>  
> +	seq_num = -1;
>  	list_for_each_entry_safe(pos, tmp, group, list) {
> -		/*
> -		 * Pass the device-stamped 0-based shared_extn_seq through
> -		 * unchanged as the dax-side @seq_num (0..n-1).
> -		 */
> -		u16 seq = le16_to_cpu(pos->extent->shared_extn_seq);
> +		if (shareable)
> +			seq_num = le16_to_cpu(pos->extent->shared_extn_seq);
> +		else
> +			seq_num++;
>  
> -		if (cxl_add_extent(mds, pos->extent, seq) < 0) {
> +		if (cxl_add_extent(mds, pos->extent, seq_num) < 0) {

[Severity: High]
This is a pre-existing issue, but what happens if a device resends a tag group
where all extents have already been accepted?

If cxl_add_extent() detects a duplicate, it returns 0 without calling
cxlr_add_extent(). If this happens for every extent in the group,
mds->add_ctx.group is never allocated and remains NULL.

Does this cause the subsequent if (!tag_group) check in this function to drop
the group and return -ENXIO instead of returning group_cnt to acknowledge
the extents back to the device?

>  			dev_dbg(dev,
>  				"Tag %pUb: failed to add extent DPA:%#llx LEN:%#llx\n",
>  				tag,
> @@ -1412,15 +1532,22 @@ static int cxl_realize_group(struct cxl_memdev_state *mds, const uuid_t *tag,

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260625112638.550691-1-anisa.su@samsung.com?part=17

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v11 26/31] dax/bus: Tag-aware uuid claim and show on DC dax devices
  2026-06-25 11:05 ` [PATCH v11 26/31] dax/bus: Tag-aware uuid claim and show on DC dax devices Anisa Su
@ 2026-06-25 18:26   ` sashiko-bot
  0 siblings, 0 replies; 61+ messages in thread
From: sashiko-bot @ 2026-06-25 18:26 UTC (permalink / raw)
  To: Anisa Su; +Cc: nvdimm, linux-cxl

Thank you for your contribution! Sashiko AI review found 6 potential issue(s) to consider:
- [High] Retaining pointers to `dax_resource` in `dev_dax` ranges without holding a driver-core reference on the underlying extent device risks a Use-After-Free.
- [High] The error rollback path in `uuid_claim_tagged()` leaks DAX mapping devres actions, leading to Use-After-Free on device teardown.
- [High] The sequence validation in `uuid_claim_tagged()` fails to detect missing trailing extents, silently carving a partial allocation.
- [Medium] Modifying device capacity via `uuid_store` bypasses the check for bound drivers, leading to driver state inconsistency.
- [Medium] The `align` attribute is silently made read-only on dynamic DAX regions, preventing userspace from configuring huge page alignments.
- [Medium] `uuid_claim_untagged()` may falsely fail with `-EINVAL` if the first discovered untagged extent is unaligned.
--

commit cde0cc72c8f92d08e5dfa29000813bc5b9c53109
Author: Anisa Su <anisa.su887@gmail.com>

dax/bus: Tag-aware uuid claim and show on DC dax devices

This patch introduces userspace claim semantics to resolve tagged groups
into DAX devices on DC DAX regions. It populates a seed device with zero
size, and allows capacity to be claimed by writing to the uuid sysfs
attribute, expanding the size or consuming an untagged resource.

> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index f086ad27d5077..d94c0853af105 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -1100,6 +1101,9 @@ static int alloc_dev_dax_range(struct resource *parent, struct dev_dax *dev_dax,
>  		},
>  		.dax_resource = dax_resource,
>  	};
> +	/* Pin the extent for this range; trim_dev_dax_range() drops it. */
> +	if (dax_resource)
> +		dax_resource->use_cnt++;

[Severity: High]
Does this code risk a use-after-free if the underlying extent device is
removed?

alloc_dev_dax_range() pins the extent by incrementing use_cnt, but it does
not appear to secure a reference to the underlying extent_dev via
get_device(). If the parent CXL device is hot-unplugged, the core driver
model will unregister and free the child extent devices and their
dax_resource data.

Later, when the dev_dax device is destroyed, trim_dev_dax_range() would
dereference the freed dax_resource to decrement use_cnt.

[ ... ]

> @@ -1595,13 +1682,178 @@ static DEVICE_ATTR_RO(numa_node);
>  static ssize_t uuid_show(struct device *dev,
>  		struct device_attribute *attr, char *buf)
>  {
[ ... ]
> +static ssize_t uuid_claim_untagged(struct dax_region *dax_region,
> +				   struct dev_dax *dev_dax)
> +{
> +	struct dax_uuid_match match = {
> +		.dax_region = dax_region,
> +		.uuid = &uuid_null,
> +	};
> +	struct dax_resource *dax_resource;
> +	resource_size_t to_alloc;
> +	struct device *extent_dev;
> +	ssize_t alloc;
> +
> +	extent_dev = device_find_child(dax_region->dev, &match,
> +				       find_uuid_extent);
> +	if (!extent_dev)
> +		return -ENOENT;
> +
> +	dax_resource = dev_get_drvdata(extent_dev);
> +	to_alloc = resource_size(dax_resource->res);
> +	if (!alloc_is_aligned(dev_dax, to_alloc)) {
> +		put_device(extent_dev);
> +		return -EINVAL;
> +	}

[Severity: Medium]
Will this falsely fail if the very first untagged extent found happens to
be unaligned?

Since device_find_child() only returns the first matching extent, returning
-EINVAL here stops the claim entirely instead of iterating to check if
there is another untagged extent available that is properly aligned.

[ ... ]

> +static ssize_t uuid_claim_tagged(struct dax_region *dax_region,
> +				 struct dev_dax *dev_dax, const uuid_t *uuid)
> +{
[ ... ]
> +	/*
> +	 * Tagged groups carry a dense 0..n-1 @seq_num regardless of source —
> +	 * the device-stamped shared_extn_seq (already 0..n-1) for a sharable
> +	 * partition, or cxl-side arrival order for a non-sharable one (see
> +	 * &struct dax_resource).  A gap or out-of-range value here means an
> +	 * extent went missing on the cxl side (e.g. a per-extent failure in
> +	 * cxl_add_pending) or a cxl-side validation gap; in either case
> +	 * refuse the whole group rather than carve a partial allocation.
> +	 */
> +	for (i = 0; i < c.count; i++) {
> +		if (c.arr[i]->seq_num != i) {
> +			dev_WARN_ONCE(dax_region->dev, 1,
> +				"tag %pUb seq invariant violated at slot %u (got %u)\n",
> +				uuid, i, c.arr[i]->seq_num);
> +			rc = -EINVAL;
> +			goto out;
> +		}
> +	}

[Severity: High]
Does this validation loop miss missing trailing extents?

The loop only checks up to c.count. If a group is expected to have 4
extents, but only the first 2 (seq_num 0 and 1) were collected, the loop
will successfully validate them.

This seems to violate the all-or-nothing allocation invariant mentioned in
the comment, silently carving a partial allocation and permanently orphaning
the remaining extents since a device cannot be appended to once claimed.

[ ... ]

> +rollback:
> +	/*
> +	 * Partial failure: trim every range we added in this attempt.
> +	 * trim_dev_dax_range pops the most-recently-appended range from
> +	 * dev_dax->ranges[] and decrements its dax_resource->use_cnt, so
> +	 * looping until we have undone @i additions restores both
> +	 * dev_dax->ranges[] and the matched dax_resources' use_cnt.
> +	 */
> +	while (i-- > 0)
> +		trim_dev_dax_range(dev_dax);
> +out:
> +	kfree(c.arr);
> +	return rc;
> +}

[Severity: High]
Does this error rollback leak devm mapping actions?

During a successful allocation, alloc_dev_dax_range() calls
devm_register_dax_mapping() which attaches an unregister_dax_mapping devres
action.

When trim_dev_dax_range() pops the ranges, it does not seem to call
devm_release_action() to clean up the attached actions. When the dev_dax
device is eventually destroyed, these leaked actions would execute and attempt
to unregister mappings using array indices that no longer exist, leading to
an out-of-bounds dereference on the freed dev_dax->ranges array.

>  static ssize_t uuid_store(struct device *dev, struct device_attribute *attr,
>  			  const char *buf, size_t len)
>  {
[ ... ]
> +	ACQUIRE(rwsem_write_kill, dev_rwsem)(&dax_dev_rwsem);
> +	if ((rc = ACQUIRE_ERR(rwsem_write_kill, &dev_rwsem)))
> +		return rc;
> +
> +	/* A claimed device already has capacity; do not overwrite its uuid. */
> +	if (dev_dax_size(dev_dax))
> +		return -EBUSY;

[Severity: Medium]
Should this also check if a driver is actively bound to the device?

If userspace binds a driver to this 0-sized device before writing to the
uuid attribute, this bypasses the check. Failing to check
if (dev_dax->dev.driver) allows the capacity to change while a driver is
already bound, which breaks the contract enforced in size_store().

[ ... ]

> @@ -1661,8 +1913,12 @@ static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n)
>  		return 0;
>  	if (a == &dev_attr_mapping.attr && is_dynamic(dax_region))
>  		return 0;
> -	if ((a == &dev_attr_align.attr ||
> -	     a == &dev_attr_size.attr) && is_static(dax_region))
> +	if (a == &dev_attr_uuid.attr && !is_dynamic(dax_region))
> +		return 0444;
> +	if (a == &dev_attr_align.attr &&
> +	    (is_static(dax_region) || is_dynamic(dax_region)))
> +		return 0444;

[Severity: Medium]
Does this unintentionally prevent configuring alignment for dynamic regions?

By forcing the align attribute to be read-only (0444) for dynamic regions,
userspace cannot configure huge page alignments (like 2MB or 1GB) for DC
DAX devices before they are claimed.

> +	if (a == &dev_attr_size.attr && is_static(dax_region))
>  		return 0444;
>  	return a->mode;
>  }

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260625112638.550691-1-anisa.su@samsung.com?part=26

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v11 23/31] dax/bus: Factor out dev dax resize logic
  2026-06-25 11:05 ` [PATCH v11 23/31] dax/bus: Factor out dev dax resize logic Anisa Su
@ 2026-06-25 18:27   ` sashiko-bot
  0 siblings, 0 replies; 61+ messages in thread
From: sashiko-bot @ 2026-06-25 18:27 UTC (permalink / raw)
  To: Anisa Su; +Cc: nvdimm, linux-cxl

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] Integer overflow in __dev_dax_resize causes >2GB allocations to be treated as errors, leaking DAX resources on 32-bit systems.
--

commit abac505d7b066ca0edac0e8d7df59c037cc29e89
Author: Ira Weiny <iweiny@kernel.org>

dax/bus: Factor out dev dax resize logic

This commit factors out the device dax resize logic to prepare for dynamic
capacity regions. It introduces a helper function that allocates ranges
using a specified parent resource, returning the successfully allocated size.

> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 95683dc8fcd0c..ffa6b303fc9bd 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
[ ... ]
> +static ssize_t dev_dax_resize_static(struct resource *parent,
> +				     struct dev_dax *dev_dax,
> +				     resource_size_t to_alloc)
>  {
>  	struct resource *res, *first;
>  	int rc;
>  
> +	first = parent->child;
> +	if (!first) {
> +		rc = alloc_dev_dax_range(parent, dev_dax,
> +					   parent->start, to_alloc, NULL);
> +		if (rc)
> +			return rc;
> +		return to_alloc;
> +	}
>  
>  	for (res = first; res; res = res->sibling) {
>  		struct resource *next = res->sibling;
> +		resource_size_t alloc;
>  
>  		/* space at the beginning of the region */
> +		if (res == first && res->start > parent->start) {
> +			alloc = min(res->start - parent->start, to_alloc);
> +			rc = alloc_dev_dax_range(parent, dev_dax,
> +						 parent->start, alloc, NULL);
> +			if (rc)
> +				return rc;
> +			return alloc;

[Severity: High]
On 32-bit systems where ssize_t is a 32-bit signed integer and
resource_size_t can be an unsigned 64-bit value, does returning a 2GB or
larger alloc size cause an integer overflow?

If alloc is >= 2GB, returning it as ssize_t could wrap to a negative number
or zero if truncated from a larger type.

[ ... ]
> +static ssize_t dev_dax_resize(struct dax_region *dax_region,
> +		struct dev_dax *dev_dax, resource_size_t size)
> +{
[ ... ]
> +retry:
> +	alloc = dev_dax_resize_static(&dax_region->res, dev_dax, to_alloc);
> +	if (alloc < 0)
> +		return alloc;

[Severity: High]
If a large allocation overflows into a negative ssize_t, wouldn't this caller
evaluate alloc < 0 as true and mistakenly treat the successful allocation as
an error?

Because the successfully mapped chunk was already added to dev_dax->ranges in
dev_dax_resize_static(), aborting the resize here without unwinding the newly
allocated range seems like it would permanently leak the DAX space and leave
the region in an inconsistent state.

> +	if (alloc == 0)
> +		return -ENOSPC;
>  	to_alloc -= alloc;
>  	if (to_alloc)
>  		goto retry;

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260625112638.550691-1-anisa.su@samsung.com?part=23

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v11 28/31] cxl/mem: Trace Dynamic capacity Event Record
  2026-06-25 11:05 ` [PATCH v11 28/31] cxl/mem: Trace Dynamic capacity Event Record Anisa Su
@ 2026-06-25 18:29   ` sashiko-bot
  0 siblings, 0 replies; 61+ messages in thread
From: sashiko-bot @ 2026-06-25 18:29 UTC (permalink / raw)
  To: Anisa Su; +Cc: linux-cxl, nvdimm

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Medium] The `cxl_dynamic_capacity` tracepoint omits `validity_flags` and other critical diagnostic fields, leaving userspace unable to validate parsed values.
--

commit 7da6c652fcb2ca33cefdf89b66affc0318921f70
Author: Ira Weiny <iweiny@kernel.org>

cxl/mem: Trace Dynamic capacity Event Record

This commit adds tracepoints for CXL Dynamic Capacity Event Records defined
in the CXL specification. It enables user space debugging of DC capacity
changes by exposing the events through the trace log.

> diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
> index a972e4ef19368..e5b88887d11b5 100644
> --- a/drivers/cxl/core/trace.h
> +++ b/drivers/cxl/core/trace.h

[ ... ]

> +TRACE_EVENT(cxl_dynamic_capacity,
> +
> +	TP_PROTO(const struct cxl_memdev *cxlmd, enum cxl_event_log_type log,
> +		 struct cxl_event_dcd *rec),
> +
> +	TP_ARGS(cxlmd, log, rec),
> +
> +	TP_STRUCT__entry(
> +		CXL_EVT_TP_entry
> +
> +		/* Dynamic capacity Event */
> +		__field(u8, event_type)
> +		__field(u16, hostid)
> +		__field(u8, updated_region_index)
> +		__field(u64, dpa_start)
> +		__field(u64, length)
> +		__array(u8, uuid, UUID_SIZE)
> +		__field(u16, sh_extent_seq)
> +	),

[Severity: Medium]
Are we missing some diagnostic fields from struct cxl_event_dcd in
this tracepoint definition? 

It appears that validity_flags, flags (which contains the More bit),
num_avail_extents, and num_avail_tags are omitted here. Without
validity_flags, user space might be unable to determine if hostid and
updated_region_index contain valid data or hardware garbage. Additionally,
dropping the flags field could prevent user space from properly piecing
together multi-record capacity changes using the More bit.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260625112638.550691-1-anisa.su@samsung.com?part=28

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v11 21/31] cxl + dax: Surface dax_resources on DCD Add Capacity events
  2026-06-25 11:04 ` [PATCH v11 21/31] cxl + dax: Surface dax_resources on DCD Add Capacity events Anisa Su
@ 2026-06-25 18:29   ` sashiko-bot
  0 siblings, 0 replies; 61+ messages in thread
From: sashiko-bot @ 2026-06-25 18:29 UTC (permalink / raw)
  To: Anisa Su; +Cc: nvdimm, linux-cxl

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] The dax_resource devm cleanup action is tied to the dc_extent device instead of the DAX region, leading to a Use-After-Free of the DAX region if the DAX driver unbinds.
--

commit 20b0e06e09f045b8ed2d61cce6004f2c3647ec5a
Author: Ira Weiny <iweiny@kernel.org>

cxl + dax: Surface dax_resources on DCD Add Capacity events

This commit defines the DAX region .notify callback and registers a
dax_resource for each extent when the CXL side onlines a tag group during
a DC Add event.

> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 7356aaaffe577..9b5c03616b83d 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
[ ... ]
> @@ -182,6 +182,138 @@ static bool is_dynamic(struct dax_region *dax_region)
>  	return (dax_region->res.flags & IORESOURCE_DAX_DCD) != 0;
>  }
>  
> +static void __dax_release_resource(struct dax_resource *dax_resource)
> +{
> +	struct dax_region *dax_region = dax_resource->region;
> +
> +	lockdep_assert_held_write(&dax_region_rwsem);
> +	dev_dbg(dax_region->dev, "Extent release resource %pr\n",
> +		dax_resource->res);
[ ... ]
> +static int __dax_region_add_resource(struct dax_region *dax_region,
> +				     struct device *device,
> +				     resource_size_t start, resource_size_t length,
> +				     const uuid_t *tag, u16 seq_num)
> +{
> +	struct dax_resource *dax_resource __free(kfree) =
> +				kzalloc(sizeof(*dax_resource), GFP_KERNEL);
[ ... ]
> +	/*
> +	 * open code devm_add_action_or_reset() to avoid recursive write lock
> +	 * of dax_region_rwsem in the error case.
> +	 */
> +	rc = devm_add_action(device, dax_release_resource, dax_resource);

[Severity: High]
Does this tie the cleanup action to the wrong device lifetime?

In __dax_region_add_resource(), the devm_add_action() registers
dax_release_resource() to the passed-in device, which is the dc_extent
device managed by the CXL core.

However, the dax_region is tied to the DAX driver's lifecycle. If a user
unbinds the DAX driver, the dax_region is freed. Because the dc_extent
device is managed externally, it outlives the DAX driver unbind.

Later, when the CXL region is destroyed and the dc_extent is released,
dax_release_resource() will run and call __dax_release_resource(), which
dereferences the freed dax_region:

__dax_release_resource()
    struct dax_region *dax_region = dax_resource->region;
    dev_dbg(dax_region->dev, ...);

Can this lead to a use-after-free on the dax_region?

> +	if (rc) {
> +		__dax_release_resource(dax_resource);
> +		return rc;
> +	}
> +
> +	dev_set_drvdata(device, no_free_ptr(dax_resource));
> +	return 0;
> +}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260625112638.550691-1-anisa.su@samsung.com?part=21

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v11 29/31] tools/testing/cxl: Make event logs dynamic
  2026-06-25 11:05 ` [PATCH v11 29/31] tools/testing/cxl: Make event logs dynamic Anisa Su
@ 2026-06-25 18:31   ` sashiko-bot
  0 siblings, 0 replies; 61+ messages in thread
From: sashiko-bot @ 2026-06-25 18:31 UTC (permalink / raw)
  To: Anisa Su; +Cc: nvdimm, linux-cxl

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Medium] Incomplete validation of `nr_recs` in `mock_clear_event` allows an out-of-bounds array access and arbitrary memory free.
--

commit fb02931b1ff4b8cef05ff7386cba792c02a0c46e
Author: Ira Weiny <iweiny@kernel.org>

tools/testing/cxl: Make event logs dynamic

This commit modifies the mock event logs in the cxl test tool to be
dynamically allocated instead of using static arrays. This is intended to
support dynamically generating events when extents are created or destroyed.

> diff --git a/tools/testing/cxl/test/mem.c b/tools/testing/cxl/test/mem.c
> index 271c7ad8cc32c..a2bfd52db0762 100644
> --- a/tools/testing/cxl/test/mem.c
> +++ b/tools/testing/cxl/test/mem.c

[ ... ]

> @@ -325,23 +354,20 @@ static int mock_clear_event(struct device *dev, struct cxl_mbox_cmd *cmd)
>  	if (!log)
>  		return 0; /* No mock data in this log */
>  
> -	/*
> -	 * This check is technically not invalid per the specification AFAICS.
> -	 * (The host could 'guess' handles and clear them in order).
> -	 * However, this is not good behavior for the host so test it.
> -	 */
> -	if (log->clear_idx + pl->nr_recs > log->cur_idx) {
> -		dev_err(dev,
> -			"Attempting to clear more events than returned!\n");
> -		return -EINVAL;
> -	}
> +	guard(write_lock)(&log->lock);
>  
>  	/* Check handle order prior to clearing events */
> -	for (nr = 0, handle = event_get_clear_handle(log);
> -	     nr < pl->nr_recs;
> -	     nr++, handle++) {
> +	handle = log->current_handle;
> +	for (nr = 0; nr < pl->nr_recs && handle != log->last_handle;
> +	     nr++, handle = event_inc_handle(handle)) {
> +
> +		dev_dbg(dev, "Checking clear of %d handle %u plhandle %u\n",
> +			log_type, handle,
> +			le16_to_cpu(pl->handles[nr]));
> +
>  		if (handle != le16_to_cpu(pl->handles[nr])) {
> -			dev_err(dev, "Clearing events out of order\n");
> +			dev_err(dev, "Clearing events out of order %u %u\n",
> +				handle, le16_to_cpu(pl->handles[nr]));
>  			return -EINVAL;
>  		}
>  	}
> @@ -350,25 +376,12 @@ static int mock_clear_event(struct device *dev, struct cxl_mbox_cmd *cmd)
>  		log->nr_overflow = 0;
>  
>  	/* Clear events */
> -	log->clear_idx += pl->nr_recs;
> -	return 0;
> -}
> +	for (nr = 0; nr < pl->nr_recs; nr++)
> +		mes_del_event(dev, log, le16_to_cpu(pl->handles[nr]));

[Severity: Medium]
Can this result in an out-of-bounds access and arbitrary memory free?

If a user issues a clear command where pl->nr_recs is greater than the
number of valid events, the validation loop above terminates early when
handle == log->last_handle. It does not return an error for the extra
records.

The deletion loop here then unconditionally iterates pl->nr_recs times,
passing the remaining, unvalidated user-provided handles to mes_del_event().

Inside mes_del_event():

static void mes_del_event(struct device *dev, struct mock_event_log *log,
                          u16 handle)
{
        ...
        record = log->events[handle];
        ...
        devm_kfree(dev, record);
}

Since the user-provided handles are not validated against the log->events
array size bounds when the validation loop terminates early, does this allow
an out-of-bounds read on the array followed by freeing whatever value is read?

> +	dev_dbg(dev, "Delete log %d cur %d last %d\n",
> +		log_type, log->current_handle, log->last_handle);
>  
> -static void cxl_mock_event_trigger(struct device *dev)
> -{

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260625112638.550691-1-anisa.su@samsung.com?part=29

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v11 27/31] cxl/region: Read existing extents on region creation
  2026-06-25 11:05 ` [PATCH v11 27/31] cxl/region: Read existing extents on region creation Anisa Su
@ 2026-06-25 18:32   ` sashiko-bot
  0 siblings, 0 replies; 61+ messages in thread
From: sashiko-bot @ 2026-06-25 18:32 UTC (permalink / raw)
  To: Anisa Su; +Cc: nvdimm, linux-cxl

Thank you for your contribution! Sashiko AI review found 8 potential issue(s) to consider:

New issues:
- [High] Cross-region lock assertion failure when processing existing extents.
- [High] Out-of-bounds read due to unvalidated `nr_returned` payload size.
- [High] Runtime DC add events are permanently dropped and lost during region probe, contradicting the design intent.
- [High] Permanent device capacity leak when existing extents are dropped.
- [High] AB-BA deadlock between `mds->add_ctx.lock` and `cxlr_dax->dev` lock.
- [High] Shared `pending_extents` list corruption aborts in-flight runtime events.
- [Medium] Mixing `goto`-based error handling with `__free()` and `guard()` violates subsystem cleanup guidelines.

Pre-existing issues:
- [High] Integer overflow in extent boundary calculation bypasses range validation.
--

commit ae344fab96ea05c143bd5e8ab47ce8059fc778b6
Author: Ira Weiny <iweiny@kernel.org>
Subject: cxl/region: Read existing extents on region creation

Dynamic capacity device extents may be left in an accepted state on a
device due to an unexpected host crash. This patch reads the 'devices
extent list' during the asynchronous cxl_dax_region probe to reveal
previously accepted extents and add them to the new region, avoiding
races with probe.

> diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
> --- a/drivers/cxl/core/extent.c
> +++ b/drivers/cxl/core/extent.c
> @@ -261,7 +261,7 @@ static int cxl_validate_extent(struct cxl_memdev_state *mds,
>  			       struct cxl_extent *extent,
>  			       struct cxl_endpoint_decoder **out_cxled,
>  			       struct cxl_dax_region **out_cxlr_dax,
> -			       struct range *out_ext_range)
> +			       struct range *out_ext_range, bool existing)
>  {
>  	u64 start_dpa = le64_to_cpu(extent->start_dpa);

[Severity: High]
This is a pre-existing issue, but could the extent boundary calculation
in cxl_validate_extent() overflow?

A few lines down from this signature change, the end boundary is calculated:

    struct range ext_range = (struct range) {
        .start = start_dpa,
        .end = start_dpa + le64_to_cpu(extent->length) - 1,
    };

If a device provides a massive start_dpa and a small length, could this
calculation wrap around and produce a small ext_range.end?

If so, the bounds check later in the function:

    if (!range_contains(&ed_range, &ext_range)) {

would evaluate r1->start <= r2->start && r1->end >= r2->end. With a wrapped
near-zero ext_range.end, the check might incorrectly pass, allowing
an out-of-bounds extent.

> [ ... ]
> @@ -300,6 +300,13 @@ static int cxl_validate_extent(struct cxl_memdev_state *mds,
>  	if (!cxlr || !cxlr->cxlr_dax)
>  		return -ENXIO;
>  
> +	/*
> +	 * Pre-existing extents must be read before any new extent is added so a
> +	 * tag already in use is never added twice; defer new adds until then.
> +	 */
> +	if (!existing && !smp_load_acquire(&cxlr->cxlr_dax->extents_scanned))
> +		return -EBUSY;
> +

[Severity: High]
Does returning -EBUSY here cause runtime Add Capacity events to be permanently
lost?

If cxl_validate_extent() returns -EBUSY for a new runtime event, its caller
cxl_realize_group() treats it as a failure, logging an error and deleting the
extent node:

    if (cxl_add_extent(mds, pos->extent, seq_num, existing) < 0) {
        dev_dbg(dev, "Tag %pUb: failed to add extent ...");
        delete_extent_node(pos);
        continue;
    }

Since the extent is destroyed rather than deferred, it seems it will be omitted
from the ADD_DC_RESPONSE, meaning the host officially rejects it. Does this
prevent the runtime capacity from being added?

> [ ... ]
> @@ -376,16 +383,22 @@ dc_extent_build(struct cxl_endpoint_decoder *cxled,
>  	return dc_extent;
>  }
>  
> -int cxlr_notify_extent(struct cxl_region *cxlr, enum dc_event event,
> -		       struct cxl_dc_tag_group *group)
> +/*
> + * Core notify: the caller must hold device_lock(&cxlr->cxlr_dax->dev).  Used by
> + * the existing-extent path that runs inside cxl_dax_region_probe(), where the
> + * async device-attach already holds the dax_region's device_lock — taking it
> + * again (as cxlr_notify_extent() does) would deadlock the probe against itself.
> + */
> +int __cxlr_notify_extent(struct cxl_region *cxlr, enum dc_event event,
> +			 struct cxl_dc_tag_group *group)
>  {
>  	struct device *dev = &cxlr->cxlr_dax->dev;
>  	struct cxl_notify_data notify_data;
>  	struct cxl_driver *driver;
>  
> -	dev_dbg(dev, "Trying notify: type %d tag %pUb\n", event, &group->uuid);
> +	device_lock_assert(dev);

[Severity: High]
Could this lock assertion fail when processing extents from other regions?

When __cxl_process_extent_list() queries the device for existing extents,
the device returns all extents for all partitions, which may belong to
different regions.

For an extent belonging to a different region, cxl_realize_group() will call
__cxlr_notify_extent() targeting that other region. However, the probe thread
only holds the device lock for the region currently being probed. Will this
cause the assertion to fail for extents on other regions?

> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -1503,7 +1503,7 @@ static int cxl_realize_group(struct cxl_memdev_state *mds, const uuid_t *tag,
>  		else
>  			seq_num++;
>  
> -		if (cxl_add_extent(mds, pos->extent, seq_num) < 0) {
> +		if (cxl_add_extent(mds, pos->extent, seq_num, existing) < 0) {

[Severity: High]
If cxl_add_extent() fails for an existing extent, could this cause a permanent
capacity leak?

If existing=true, but cxl_validate_extent() or cxl_add_extent() fails
(e.g., due to an overlap or memory allocation failure), the host drops the
extent by calling delete_extent_node(pos) inside cxl_realize_group().

Because existing=true, cxl_add_pending() skips sending an ADD_DC_RESPONSE.
Since the device considers the extent accepted and the host never sends a
Release-DC command for the dropped extent, does the capacity remain leaked
until reboot?

> [ ... ]
> @@ -2199,6 +2209,145 @@ int cxl_dev_dc_identify(struct cxl_mailbox *mbox,
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_dev_dc_identify, "CXL");
>  
> +/* Return -EAGAIN if the extent list changes while reading */
> +static int __cxl_process_extent_list(struct cxl_endpoint_decoder *cxled)
> +{
> +	u32 current_index, total_read, total_expected, initial_gen_num;
> +	struct cxl_memdev_state *mds = cxled_to_mds(cxled);
> +	struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
> +	struct device *dev = mds->cxlds.dev;
> +	struct cxl_mbox_cmd mbox_cmd;
> +	u32 max_extent_count;
> +	int rc = 0;
> +	bool first = true;
> +
> +	struct cxl_mbox_get_extent_out *extents __free(kvfree) =
> +				kvmalloc(cxl_mbox->payload_size, GFP_KERNEL);
> +	if (!extents)
> +		return -ENOMEM;
> +
> +	/*
> +	 * Build and consume add_ctx.pending_extents under add_ctx.lock, the
> +	 * same lock the DC event path (handle_add_event()) holds, so the two
> +	 * cannot corrupt the shared pending list.
> +	 */
> +	guard(mutex)(&mds->add_ctx.lock);

[Severity: High]
Could this lock acquisition lead to an AB-BA deadlock between the
mds->add_ctx.lock and the region device lock?

In __cxl_process_extent_list(), the driver core is probing the region,
so the region's device lock is already held. Here we acquire mds->add_ctx.lock,
establishing a (device_lock -> add_ctx.lock) ordering.

However, in the asynchronous event path (e.g., handle_add_event()), the driver
holds mds->add_ctx.lock, and then cxl_realize_group() calls cxlr_notify_extent()
which acquires the device lock:

    guard(device)(&cxlr->cxlr_dax->dev);

This establishes an (add_ctx.lock -> device_lock) ordering. If a DC event
arrives while a region is being probed, could these paths deadlock?

[Severity: High]
Could appending to mds->add_ctx.pending_extents here disrupt in-flight
multi-part runtime Add Capacity events?

Runtime DC Add Capacity events with the 'More' flag set accumulate partial
extents in mds->add_ctx.pending_extents across multiple interrupts, with the
lock released in between.

If __cxl_process_extent_list() is called concurrently during a region probe,
it unconditionally appends to this list. Later, it clears the list:

    out:
        clear_pending_extents(mds);

If there was an in-flight runtime event, would its accumulated extents be
processed as existing=true (meaning no response sent) and then destroyed,
leaving the device stalled waiting for a response?

> +
> +	total_read = 0;
> +	current_index = 0;
> +	total_expected = 0;
> +	max_extent_count = (cxl_mbox->payload_size - sizeof(*extents)) /
> +			    sizeof(struct cxl_extent);
> +	do {
> +		u32 nr_returned, current_total, current_gen_num;
> +		struct cxl_mbox_get_extent_in get_extent;
> +
> +		get_extent = (struct cxl_mbox_get_extent_in) {
> +			.extent_cnt = cpu_to_le32(max_extent_count),
> +			.start_extent_index = cpu_to_le32(current_index),
> +		};
> +
> +		mbox_cmd = (struct cxl_mbox_cmd) {
> +			.opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
> +			.payload_in = &get_extent,
> +			.size_in = sizeof(get_extent),
> +			.size_out = cxl_mbox->payload_size,
> +			.payload_out = extents,
> +			.min_out = 1,
> +		};
> +
> +		rc = cxl_internal_send_cmd(cxl_mbox, &mbox_cmd);
> +		if (rc < 0)
> +			goto out;
> +
> +		/* Save initial data */
> +		if (first) {
> +			total_expected = le32_to_cpu(extents->total_extent_count);
> +			initial_gen_num = le32_to_cpu(extents->generation_num);
> +			first = false;
> +		}
> +
> +		nr_returned = le32_to_cpu(extents->returned_extent_count);

[Severity: High]
Could a malformed device response cause an out-of-bounds read here?

Because min_out is hardcoded to 1, truncated mailbox responses are accepted.
The code then extracts nr_returned:

    nr_returned = le32_to_cpu(extents->returned_extent_count);

and iterates over it:

    for (int i = 0; i < nr_returned ; i++) {
        struct cxl_extent *extent = &extents->extent[i];

Since nr_returned is not validated against max_extent_count or the actual
payload size returned by the hardware, could a large nr_returned value cause
the loop to read past the end of the kvmalloc buffer into unrelated kernel
heap memory?

> [ ... ]
> +	} while (total_expected > total_read);
> +
> +	if (!list_empty(&mds->add_ctx.pending_extents)) {
> +		/*
> +		 * Reached only on the success path (every error does goto out),
> +		 * so rc is 0 here.  These extents are already accepted on the
> +		 * device (recovered from a prior boot).  Pass existing=true so
> +		 * they are not re-reported in an Add-DC-Response (the device
> +		 * would reject a DPA already added by a prior response), and so
> +		 * a failed online releases them rather than silently dropping
> +		 * them.
> +		 */
> +		rc = cxl_add_pending(mds, true);
> +	}
> +out:
> +	clear_pending_extents(mds);

[Severity: Medium]
Does mixing goto statements with scope-based cleanup violate the cleanup
subsystem guidelines?

This function declares resources using scope-based cleanup:

    struct cxl_mbox_get_extent_out *extents __free(kvfree) = ...
    guard(mutex)(&mds->add_ctx.lock);

But it also uses explicit goto out statements that jump to a manual cleanup
label:

    out:
        clear_pending_extents(mds);

The cleanup subsystem documentation expects that usage of "goto" and cleanup
helpers is never mixed in the same function to avoid confusing ownership
semantics. Should this be refactored?

> +
> +	return rc;
> +}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260625112638.550691-1-anisa.su@samsung.com?part=27

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v11 20/31] cxl/region/extent: Expose dc_extent information in sysfs
  2026-06-25 11:04 ` [PATCH v11 20/31] cxl/region/extent: Expose dc_extent information in sysfs Anisa Su
@ 2026-06-25 18:33   ` sashiko-bot
  0 siblings, 0 replies; 61+ messages in thread
From: sashiko-bot @ 2026-06-25 18:33 UTC (permalink / raw)
  To: Anisa Su; +Cc: nvdimm, linux-cxl

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:

Pre-existing issues:
- [High] Double `xa_erase` race condition in `online_tag_group` error path can silently delete concurrently allocated extents.
- [High] Partial group accepted in `cxl_realize_group` on `cxl_add_extent` failure, violating More-chain atomicity.
--

commit 802d981e0cb1d1c5a24f7358b9d9626230786651
Author: Ira Weiny <iweiny@kernel.org>

cxl/region/extent: Expose dc_extent information in sysfs

Extent information can be helpful to the user to coordinate memory usage
with the external orchestrator and FM. Expose the details of each
dc_extent by creating sysfs entries.

> diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
> index 36be56ca10974..69c993cdd558b 100644
> --- a/drivers/cxl/core/extent.c
> +++ b/drivers/cxl/core/extent.c

[ ... ]

> @@ -99,6 +154,7 @@ static void dc_extent_release(struct device *dev)
>  static const struct device_type dc_extent_type = {
>  	.name = "extent",
>  	.release = dc_extent_release,
> +	.groups = dc_extent_attribute_groups,
>  };

[Severity: High]
This is a pre-existing issue, but does the error path in online_tag_group()
have a race condition that could silently delete concurrently allocated
extents?

If dev_set_name() or device_add() fails in online_tag_group(), the error path
calls xa_erase() followed by put_device():

drivers/cxl/core/extent.c:online_tag_group() {
    ...
		rc = dev_set_name(dev, "extent%d.%d", cxlr_dax->cxlr->id,
				  dev->id);
		if (rc) {
			xa_erase(&cxlr_dax->dc_extents, dev->id);
			put_device(dev);
			break;
		}

		rc = device_add(dev);
		if (rc) {
			xa_erase(&cxlr_dax->dc_extents, dev->id);
			put_device(dev);
			break;
		}
    ...
}

However, put_device() synchronously invokes dc_extent_release(), which
unconditionally executes a second xa_erase() for the same id:

drivers/cxl/core/extent.c:dc_extent_release() {
    ...
	group = dc_extent->group;
	cxled_release_extent(dc_extent->cxled, dc_extent);
	xa_erase(&group->cxlr_dax->dc_extents, dc_extent->dev.id);
    ...
}

Since the xarray is configured with XA_FLAGS_ALLOC1, a concurrent thread
processing another DCD event could allocate this id via xa_alloc() immediately
after the first xa_erase().

Could the second xa_erase() in the release function erroneously delete the
concurrent thread's newly allocated extent, breaking region capacity
accounting?


[Severity: High]
This is a pre-existing issue, but does the error handling in
cxl_realize_group() allow partial tag groups to be onlined?

If cxl_add_extent() fails for any extent in a More-chain group (e.g., due to an
overlap or memory allocation failure), the loop calls delete_extent_node() and
executes continue:

drivers/cxl/core/mbox.c:cxl_realize_group() {
    ...
		if (cxl_add_extent(mds, pos->extent, seq_num, existing) < 0) {
			dev_dbg(dev,
				"Tag %pUb: failed to add extent DPA:%#llx LEN:%#llx\n",
				tag,
				le64_to_cpu(pos->extent->start_dpa),
				le64_to_cpu(pos->extent->length));
			delete_extent_node(pos);
			continue;
		}
		group_cnt++;
    ...
}

By using continue, it bypasses the failure and leaves the remaining valid
extents in the group, resulting in a partial tag group being successfully
onlined.

Could this cause the host to surface a broken or incomplete DAX resource,
violating the architectural intent to either accept or reject tagged
allocations atomically?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260625112638.550691-1-anisa.su@samsung.com?part=20

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v11 30/31] tools/testing/cxl: Add DC Regions to mock mem data
  2026-06-25 11:05 ` [PATCH v11 30/31] tools/testing/cxl: Add DC Regions to mock mem data Anisa Su
@ 2026-06-25 18:34   ` sashiko-bot
  0 siblings, 0 replies; 61+ messages in thread
From: sashiko-bot @ 2026-06-25 18:34 UTC (permalink / raw)
  To: Anisa Su; +Cc: linux-cxl, nvdimm

Thank you for your contribution! Sashiko AI review found 8 potential issue(s) to consider:
- [Medium] Missing bounds validation on `cmd->size_out` causes out-of-bounds writes in `mock_get_dc_extent_list`.
- [Medium] `mock_get_dc_extent_list` fails to update `cmd->size_out` to the actual response size.
- [Medium] Unconditionally clearing sent extents in `mock_add_dc_response` breaks the host's ability to issue multi-part responses.
- [Medium] Missing bounds validation on `cmd->size_in` causes out-of-bounds reads in mailbox commands.
- [Medium] Missing `devm_kfree()` calls cause unbounded memory accumulation of devres structures when extents are removed.
- [Medium] Validation failures mid-command leave the mock device in a torn state due to partial state mutation.
- [Medium] Flawed overlap validation in `new_extent_valid()` allows injection of overlapping extents.
- [Medium] Concurrent sysfs operations create a race condition resulting in orphaned mock events.
--

commit 8e9b7fc74b6af2c32bcba55480acc53f67c13c08
Author: Ira Weiny <iweiny@kernel.org>

tools/testing/cxl: Add DC Regions to mock mem data

This commit enhances cxl_test by introducing mock memory data for Dynamic
Capacity (DC) regions. It provides simulated sysfs interfaces to inject and
delete extents, alongside emulated mailbox commands to interact with them.

> diff --git a/tools/testing/cxl/test/mem.c b/tools/testing/cxl/test/mem.c
> index a2bfd52db0762..455f8a50d5818 100644
> --- a/tools/testing/cxl/test/mem.c
> +++ b/tools/testing/cxl/test/mem.c
[ ... ]
> @@ -1594,6 +1861,215 @@ static int mock_get_supported_features(struct cxl_mockmem_data *mdata,
>  	return 0;
>  }
>  
> +static int mock_get_dc_config(struct device *dev,
[ ... ]
> +static int mock_get_dc_extent_list(struct device *dev,
> +				   struct cxl_mbox_cmd *cmd)
> +{
> +	struct cxl_mbox_get_extent_out *resp = cmd->payload_out;
> +	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
> +	struct cxl_mbox_get_extent_in *get = cmd->payload_in;
> +	u32 total_avail = 0, total_ret = 0, idx = 0;
> +	struct cxl_extent_data *ext;
> +	u32 ext_count, start_idx;
> +	unsigned long i;
> +
> +	ext_count = le32_to_cpu(get->extent_cnt);
> +	start_idx = le32_to_cpu(get->start_extent_index);
> +
> +	memset(resp, 0, sizeof(*resp));
> +
> +	guard(mutex)(&mdata->ext_lock);

[Severity: Medium]
Does `cmd->size_out` need to be validated against the expected output size
before we `memset` and write into the flexible array? If the caller provides
a buffer smaller than `sizeof(*resp) + total_ret * sizeof(struct cxl_extent)`,
could this overflow the output buffer?

> +	/*
> +	 * Total available needs to be calculated and returned regardless of
> +	 * how many can actually be returned.
> +	 */
> +	xa_for_each(&mdata->dc_accepted_exts, i, ext)
> +		total_avail++;
> +
> +	if (start_idx > total_avail)
> +		return -EINVAL;
> +
> +	xa_for_each(&mdata->dc_accepted_exts, i, ext) {
> +		if (total_ret >= ext_count)
> +			break;
> +
> +		if (idx++ >= start_idx) {
> +			resp->extent[total_ret].start_dpa =
> +						cpu_to_le64(ext->dpa_start);
> +			resp->extent[total_ret].length =
> +						cpu_to_le64(ext->length);
> +			export_uuid(resp->extent[total_ret].uuid, &ext->uuid);
> +			resp->extent[total_ret].shared_extn_seq =
> +						cpu_to_le16(ext->shared_extn_seq);
> +			total_ret++;
> +		}
> +	}
> +
> +	resp->returned_extent_count = cpu_to_le32(total_ret);
> +	resp->total_extent_count = cpu_to_le32(total_avail);
> +	resp->generation_num = cpu_to_le32(mdata->dc_ext_generation);
> +
> +	dev_dbg(dev, "Returning %d extents of %d total\n",
> +		total_ret, total_avail);
> +
> +	return 0;
> +}

[Severity: Medium]
Is it expected that `cmd->size_out` is left unmodified on success here? Other
mock mailbox commands update this field to reflect the actual response payload
size (e.g., setting it to `struct_size(resp, extent, total_ret)`).

> +
> +static void dc_clear_sent(struct device *dev)
> +{
> +	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
> +	struct cxl_extent_data *ext;
> +	unsigned long index;
> +
> +	lockdep_assert_held(&mdata->ext_lock);
> +
> +	/* Any extents not accepted must be cleared */
> +	xa_for_each(&mdata->dc_sent_extents, index, ext) {
> +		dev_dbg(dev, "Host rejected extent %#llx\n", ext->dpa_start);
> +		xa_erase(&mdata->dc_sent_extents, ext->dpa_start);
> +	}
> +}

[Severity: Medium]
Does removing the extent from the XArray via `xa_erase()` leak the underlying
`cxl_extent_data` structure? Since it was originally allocated with
`devm_kzalloc()`, it appears this memory will accumulate indefinitely over
repeated insert/erase cycles unless `devm_kfree()` is explicitly called.
This identical pattern exists in `dc_delete_extent()` and
`release_accepted_extent()` as well.

> +
> +static int mock_add_dc_response(struct device *dev,
> +				struct cxl_mbox_cmd *cmd)
> +{
> +	struct cxl_mbox_dc_response *req = cmd->payload_in;
> +	u32 list_size = le32_to_cpu(req->extent_list_size);
> +	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
> +	u32 last_offer_seq = 0;
> +	bool first = true;
> +
> +	guard(mutex)(&mdata->ext_lock);
> +	for (int i = 0; i < list_size; i++) {
> +		u64 start = le64_to_cpu(req->extent_list[i].dpa_start);
> +		u64 length = le64_to_cpu(req->extent_list[i].length);

[Severity: Medium]
Is it safe to iterate up to `list_size` without first validating that
`cmd->size_in` is large enough to contain
`struct_size(req, extent_list, list_size)`? Could a malformed request read
out-of-bounds memory here?

> +		struct cxl_extent_data *ext;
> +		int rc;
> +
> +		/*
> +		 * CXL r4.0 8.2.10.9.9.3: the host must list extents in the
> +		 * order the device offered them (Add Capacity events); reject
> +		 * an out-of-order response as Invalid Input.
> +		 */
> +		ext = xa_load(&mdata->dc_sent_extents, start);
> +		if (!ext)
> +			ext = xa_load(&mdata->dc_accepted_exts, start);
> +		if (ext) {
> +			if (!first && ext->offer_seq < last_offer_seq) {
> +				dev_err(dev, "Add-DC-Response out of order at extent %#llx\n",
> +					start);
> +				return -EINVAL;
> +			}
> +			last_offer_seq = ext->offer_seq;
> +			first = false;
> +		}
> +
> +		rc = dc_accept_extent(dev, start, length);
> +		if (rc)
> +			return rc;
> +	}

[Severity: Medium]
Could returning an error midway through this loop leave the mock device in a
torn state? If an extent midway through the list fails validation (like the
out-of-order check), previous extents in the iteration have already been
irrevocably transitioned to the accepted state via `dc_accept_extent()`.

> +
> +	dc_clear_sent(dev);
> +	return 0;
> +}

[Severity: Medium]
Will unconditionally calling `dc_clear_sent(dev)` break host responses that are
split into multiple parts? If the host sets `req->flags & CXL_DCD_EVENT_MORE`,
this clears the remaining pending extents before the host can acknowledge them.

> +
> +static void dc_delete_extent(struct device *dev, unsigned long long start,
> +			     unsigned long long length)
> +{
[ ... ]
> +static int mock_dc_release(struct device *dev,
> +			   struct cxl_mbox_cmd *cmd)
> +{
> +	struct cxl_mbox_dc_response *req = cmd->payload_in;
> +	u32 list_size = le32_to_cpu(req->extent_list_size);
> +
> +	for (int i = 0; i < list_size; i++) {
> +		u64 start = le64_to_cpu(req->extent_list[i].dpa_start);
> +		u64 length = le64_to_cpu(req->extent_list[i].length);

[Severity: Medium]
Similarly to `mock_add_dc_response()`, does `mock_dc_release()` need to
validate that `cmd->size_in` is large enough to hold `list_size` entries before
iterating over `req->extent_list[i]`?

> +
> +		dev_dbg(dev, "Extent %#llx released by host\n", start);
> +		release_accepted_extent(dev, start, length);
> +	}
> +
> +	return 0;
> +}
[ ... ]
> @@ -1936,11 +2442,328 @@ static ssize_t sanitize_timeout_store(struct device *dev,
>  
>  static DEVICE_ATTR_RW(sanitize_timeout);
>  
> +/* Return if the proposed extent would break the test code */
> +static bool new_extent_valid(struct device *dev, size_t new_start,
> +			     size_t new_len)
> +{
> +	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
> +	struct cxl_extent_data *extent;
> +	size_t new_end, i;
> +
> +	if (!new_len)
> +		return false;
> +
> +	new_end = new_start + new_len;
> +
> +	dev_dbg(dev, "New extent %zx-%zx\n", new_start, new_end);
> +
> +	guard(mutex)(&mdata->ext_lock);
> +	dev_dbg(dev, "Checking extents starts...\n");
> +	xa_for_each(&mdata->dc_fm_extents, i, extent) {
> +		if (extent->dpa_start == new_start)
> +			return false;
> +	}

[Severity: Medium]
Is it intentional that the overlap validation only checks exact starting
addresses? The iteration calculates `new_end` but doesn't check
`extent->length`, which seems to permit injecting extents that partially overlap
in physical DPA ranges.

> +
> +	dev_dbg(dev, "Checking sent extents starts...\n");
> +	xa_for_each(&mdata->dc_sent_extents, i, extent) {
> +		if (extent->dpa_start == new_start)
> +			return false;
> +	}
> +
> +	dev_dbg(dev, "Checking accepted extents starts...\n");
> +	xa_for_each(&mdata->dc_accepted_exts, i, extent) {
> +		if (extent->dpa_start == new_start)
> +			return false;
> +	}
> +
> +	return true;
> +}
[ ... ]
> +static ssize_t __dc_inject_extent_store(struct device *dev,
> +					struct device_attribute *attr,
> +					const char *buf, size_t count,
> +					bool shared)
> +{
[ ... ]
> +	if (!new_extent_valid(dev, start, length))
> +		return -EINVAL;
> +
> +	rc = devm_add_fm_extent(dev, start, length, uuid_str, shared_extn_seq,
> +				shared);
> +	if (rc) {
> +		dev_err(dev, "Failed to add extent DPA:%#llx LEN:%#llx; %d\n",
> +			start, length, rc);
> +		return rc;
> +	}
> +
> +	mark_extent_sent(dev, start);
> +	rc = log_dc_event(mdata, DCD_ADD_CAPACITY, start, length, uuid_str,
> +			  shared_extn_seq, more);
> +	if (rc) {

[Severity: Medium]
Could concurrent sysfs operations result in orphaned events here?
`mark_extent_sent()` aborts silently if the extent was deleted concurrently
(e.g., from a sysfs delete executing at the same time), but
`__dc_inject_extent_store()` will still unconditionally queue a
`DCD_ADD_CAPACITY` event via `log_dc_event()`.

> +		dev_err(dev, "Failed to add event %d\n", rc);
> +		return rc;
> +	}
> +
> +	return count;
> +}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260625112638.550691-1-anisa.su@samsung.com?part=30

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v11 22/31] cxl + dax: Release dax_resources on DCD Release Capacity events
  2026-06-25 11:04 ` [PATCH v11 22/31] cxl + dax: Release dax_resources on DCD Release " Anisa Su
@ 2026-06-25 18:36   ` sashiko-bot
  0 siblings, 0 replies; 61+ messages in thread
From: sashiko-bot @ 2026-06-25 18:36 UTC (permalink / raw)
  To: Anisa Su; +Cc: linux-cxl, nvdimm

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:

New issues:
- [High] Returning 0 when the DAX layer returns `-EBUSY` silently clears the release event without notifying the device, causing a permanent capacity leak.

Pre-existing issues:
- [Critical] Unprotected `xa_for_each` iteration over `dc_extents` in `cxl_rm_extent` accesses unlocked and potentially freed `dc_extent` pointers, causing a Use-After-Free.
--

commit a5dc5f1cb7a057f33cf2e7a95e0f09b9a990b420
Author: Ira Weiny <iweiny@kernel.org>

cxl + dax: Release dax_resources on DCD Release Capacity events

Implement the release path that mirrors the add path: when the device
asks for capacity back, the dax layer tears down the per-extent
resources for the whole tag group atomically via
dax_region_rm_resources().

> diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
> index 59db1878b5e2d..7009ac6a51b46 100644
> --- a/drivers/cxl/core/extent.c
> +++ b/drivers/cxl/core/extent.c

[Severity: Critical]
This isn't a bug introduced by this patch, but is it possible for the
xa_for_each() loop in cxl_rm_extent() to trigger a use-after-free?

Since cxl_rm_extent() and handle_add_event() hold cxl_rwsem.region
only for read, can two threads execute them concurrently for the same
region?

If thread A calls rm_tag_group() and unregisters devices:

drivers/cxl/core/extent.c:rm_tag_group() {
    xa_for_each(&group->dc_extents, index, dc_extent)
        devm_release_action(region_dev, dc_extent_unregister, dc_extent);
}

Which triggers a synchronous free via dc_extent_release():

drivers/cxl/core/extent.c:dc_extent_release() {
    kfree(dc_extent);
}

And concurrently, thread B executes the xa_for_each loop in cxl_rm_extent():

drivers/cxl/core/extent.c:cxl_rm_extent() {
    xa_for_each(&cxlr_dax->dc_extents, idx, dc_extent) {
        if (dc_extent->cxled != cxled)

Since xa_for_each() takes and drops the RCU read lock internally during
lookup, could thread B yield the dc_extent pointer just as thread A frees it,
causing the access to dc_extent->cxled to be a use-after-free?

[ ... ]

> @@ -627,6 +627,18 @@ int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
>  	if (rc)
>  		return rc;
>  
> +	rc = cxlr_notify_extent(cxlr, DCD_RELEASE_CAPACITY, group);
> +	if (rc) {
> +		/*
> +		 * dax layer refused (-EBUSY) or failed (-ENOMEM, etc.).  Do
> +		 * not proceed to tear down the tag group — leave its
> +		 * dax_resources alive so we do not free them out from under
> +		 * live dev_dax ranges.  The device will retry the release.
> +		 */
> +		return 0;
> +	}

[Severity: High]
Does returning 0 here when the DAX layer returns -EBUSY silently clear the
release event and cause a permanent capacity leak?

If cxlr_notify_extent() returns -EBUSY because capacity is actively mapped,
returning 0 signals success to cxl_handle_dcd_event_records().

This will cause cxl_mem_get_records_log() to issue a
CXL_MBOX_OP_CLEAR_EVENT_RECORD command, which clears the event from the device
without ever sending the CXL_MBOX_OP_RELEASE_DC command.

The comment mentions the device will retry the release, but section
8.2.9.2.1.3 of the CXL 3.1 specification states the device 'MAY reissue an
event', not MUST.

If the device does not actively reissue the event, will the host permanently
lose track of the release request and leak the capacity?

> +
> +	/* Release the entire tag group */
>  	rm_tag_group(group);
>  	return 0;
>  }

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260625112638.550691-1-anisa.su@samsung.com?part=22

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v11 01/31] cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)
  2026-06-25 11:04 ` [PATCH v11 01/31] cxl/mbox: Flag " Anisa Su
@ 2026-06-26 21:43   ` Dave Jiang
  0 siblings, 0 replies; 61+ messages in thread
From: Dave Jiang @ 2026-06-26 21:43 UTC (permalink / raw)
  To: Anisa Su, linux-cxl, linux-kernel
  Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
	Vishal Verma, Ira Weiny, Alison Schofield, John Groves,
	Gregory Price, Anisa Su



On 6/25/26 4:04 AM, Anisa Su wrote:
> From: Ira Weiny <iweiny@kernel.org>
> 
> Per the CXL 4.0 specification software must check the Command Effects
> Log (CEL) for dynamic capacity command support.
> 
> Detect support for the DCD commands while reading the CEL, including:
> 
>         Get DC Config
>         Get DC Extent List
>         Add DC Response
>         Release DC
> 
> Based on an original patch by Navneet Singh.
> 
> Signed-off-by: Ira Weiny <iweiny@kernel.org>
> Signed-off-by: Anisa Su <anisa.su@samsung.com>

Reviewed-by: Dave Jiang <dave.jiang@intel.com>

tiny nit below

> 
> ---
> 
> Changes:
> - remove unused param struct cxl_memdev_state *mds from
>   cxl_set_dcd_cmd_enabled()
> 
> - remove unused param struct cxl_memdev_state *mds from
>   cxl_verify_dcd_cmds()
> 
> - cxl_verify_dcd_cmds(): originally filled out local
>   bitmap with all DCD cmd bits and checking if cmds_seen
>   bitmap is equal to the local bitmap. Replace with
>   simple call to bitmap_full(cmd_seen)
> 
> - cxl_walk_cel(): zero out dcd_cmds bitmap before using
> 
> - cxlmem.h: Add comment to enum dcd_cmd_enabled_bits
>   pointing to where the command set is defined in the
>   4.0 spec
> 
> - original commit message referred to CXL r3.1. Bump to r4.0
> ---
>  drivers/cxl/core/mbox.c | 39 +++++++++++++++++++++++++++++++++++++++
>  drivers/cxl/cxlmem.h    | 20 ++++++++++++++++++++
>  2 files changed, 59 insertions(+)
> 
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 7c6c5b7450a5..07aba6f0b719 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -165,6 +165,38 @@ static void cxl_set_security_cmd_enabled(struct cxl_security_state *security,
>  	}
>  }
>  
> +static bool cxl_is_dcd_command(u16 opcode)
> +{
> +#define CXL_MBOX_OP_DCD_CMDS 0x48
> +
> +	return (opcode >> 8) == CXL_MBOX_OP_DCD_CMDS;
> +}
> +
> +static void cxl_set_dcd_cmd_enabled(u16 opcode, unsigned long *cmd_mask)
> +{
> +	switch (opcode) {
> +	case CXL_MBOX_OP_GET_DC_CONFIG:
> +		set_bit(CXL_DCD_ENABLED_GET_CONFIG, cmd_mask);
> +		break;
> +	case CXL_MBOX_OP_GET_DC_EXTENT_LIST:
> +		set_bit(CXL_DCD_ENABLED_GET_EXTENT_LIST, cmd_mask);
> +		break;
> +	case CXL_MBOX_OP_ADD_DC_RESPONSE:
> +		set_bit(CXL_DCD_ENABLED_ADD_RESPONSE, cmd_mask);
> +		break;
> +	case CXL_MBOX_OP_RELEASE_DC:
> +		set_bit(CXL_DCD_ENABLED_RELEASE, cmd_mask);
> +		break;
> +	default:
> +		break;
> +	}
> +}
> +
> +static bool cxl_verify_dcd_cmds(unsigned long *cmds_seen)
> +{
> +	return bitmap_full(cmds_seen, CXL_DCD_ENABLED_MAX);
> +}
> +
>  static bool cxl_is_poison_command(u16 opcode)
>  {
>  #define CXL_MBOX_OP_POISON_CMDS 0x43
> @@ -757,6 +789,7 @@ static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
>  	struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
>  	struct cxl_cel_entry *cel_entry;
>  	const int cel_entries = size / sizeof(*cel_entry);
> +	DECLARE_BITMAP(dcd_cmds, CXL_DCD_ENABLED_MAX) = {};
>  	struct device *dev = mds->cxlds.dev;
>  	int i, ro_cmds = 0, wr_cmds = 0;
>  
> @@ -785,11 +818,17 @@ static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
>  			enabled++;
>  		}
>  
> +		if (cxl_is_dcd_command(opcode)) {
> +			cxl_set_dcd_cmd_enabled(opcode, dcd_cmds);
> +			enabled++;
> +		}
> +
>  		dev_dbg(dev, "Opcode 0x%04x %s\n", opcode,
>  			enabled ? "enabled" : "unsupported by driver");
>  	}
>  
>  	set_features_cap(cxl_mbox, ro_cmds, wr_cmds);
> +	mds->dcd_supported = cxl_verify_dcd_cmds(dcd_cmds);
>  }
>  
>  static struct cxl_mbox_get_supported_logs *cxl_get_gsl(struct cxl_memdev_state *mds)
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 776c50d1db51..60dc3f0006a7 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -230,6 +230,20 @@ struct cxl_event_state {
>  	struct mutex log_lock;
>  };
>  
> +/**
> + * CXL r4.0 Section 8.2.10.9 - Memory Device Command Sets. See Table 8-308.
> + *
> + * The 48h Command Set (Opcodes 4800h - 4803h) defines the device-enabled DCD
> + * commands.
> + * */

*/

checkpatch flags this

DJ

> +enum dcd_cmd_enabled_bits {
> +	CXL_DCD_ENABLED_GET_CONFIG,
> +	CXL_DCD_ENABLED_GET_EXTENT_LIST,
> +	CXL_DCD_ENABLED_ADD_RESPONSE,
> +	CXL_DCD_ENABLED_RELEASE,
> +	CXL_DCD_ENABLED_MAX
> +};
> +
>  /* Device enabled poison commands */
>  enum poison_cmd_enabled_bits {
>  	CXL_POISON_ENABLED_LIST,
> @@ -405,6 +419,7 @@ static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
>   * @partition_align_bytes: alignment size for partition-able capacity
>   * @active_volatile_bytes: sum of hard + soft volatile
>   * @active_persistent_bytes: sum of hard + soft persistent
> + * @dcd_supported: all DCD commands are supported
>   * @event: event log driver state
>   * @poison: poison driver state info
>   * @security: security driver state info
> @@ -424,6 +439,7 @@ struct cxl_memdev_state {
>  	u64 partition_align_bytes;
>  	u64 active_volatile_bytes;
>  	u64 active_persistent_bytes;
> +	bool dcd_supported;
>  
>  	struct cxl_event_state event;
>  	struct cxl_poison_state poison;
> @@ -485,6 +501,10 @@ enum cxl_opcode {
>  	CXL_MBOX_OP_UNLOCK		= 0x4503,
>  	CXL_MBOX_OP_FREEZE_SECURITY	= 0x4504,
>  	CXL_MBOX_OP_PASSPHRASE_SECURE_ERASE	= 0x4505,
> +	CXL_MBOX_OP_GET_DC_CONFIG	= 0x4800,
> +	CXL_MBOX_OP_GET_DC_EXTENT_LIST	= 0x4801,
> +	CXL_MBOX_OP_ADD_DC_RESPONSE	= 0x4802,
> +	CXL_MBOX_OP_RELEASE_DC		= 0x4803,
>  	CXL_MBOX_OP_MAX			= 0x10000
>  };
>  


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v11 12/31] cxl/mem: Set up framework for handling DC Events
  2026-06-25 11:04 ` [PATCH v11 12/31] cxl/mem: Set up framework for handling DC Events Anisa Su
  2026-06-25 18:12   ` sashiko-bot
@ 2026-06-26 21:54   ` Dave Jiang
  1 sibling, 0 replies; 61+ messages in thread
From: Dave Jiang @ 2026-06-26 21:54 UTC (permalink / raw)
  To: Anisa Su, linux-cxl, linux-kernel
  Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
	Vishal Verma, Ira Weiny, Alison Schofield, John Groves,
	Gregory Price, Anisa Su



On 6/25/26 4:04 AM, Anisa Su wrote:
> From: Ira Weiny <iweiny@kernel.org>
> 
> Adds the support for receiving DC event records but defers
> the real add/release logic to subsequent commits. Simply refuse all
> extents for DC_ADD and ack all DC_RELEASE events for now. Forced
> release is currently unsupported.
> 
> In order, this commit adds the following:
> 
> 1. Learn about DC Event Records and how to respond to them
> 
> * cxl_mem_get_event_records() learns about the DC Event record.
> Records of that type are routed to cxl_handle_dcd_event_records().
> 
> * cxl_handle_dcd_event_records() switches on event_type:
> 	- DCD_ADD_CAPACITY     -> handle_add_event()
> 	- DCD_RELEASE_CAPACITY -> cxl_rm_extent()
> 	- DCD_FORCED_CAPACITY_RELEASE is logged and ignored (FM/device-only).
> 
> * cxl_send_dc_response() sends the reply mailbox commands
> ADD_DC_RESPONSE / RELEASE_DC
> 
> 2. Add stubs for DC_ADD and DC_RELEASE logic
> 
> * handle_add_event() stages incoming extents onto
> mds->add_ctx.pending_extents and, when More=0 closes the chain,
> replies with an empty ADD_DC_RESPONSE — refusing all extents for now
> 
> * cxl_rm_extent() acks the release via memdev_release_extent() so the
> device's view stays consistent; we can ack all releases because
> we currently don't accept/use any extents offered.
> 
> 3. Structural setup for later commits:
> 
> * struct dc_extent, struct cxl_dc_tag_group, and pending_add_ctx
> set up the stage for the real DC_ADD path, which will enforce
> tag/grouping semantics
> 
> Based on an original patch by Navneet Singh.
> 
> Signed-off-by: Ira Weiny <iweiny@kernel.org>
> Signed-off-by: Anisa Su <anisa.su@samsung.com>

Just a minor comment below besides all the issues raised by sashiko


> 
> ---
> Changes:
> 1. mbox.c: fix leaking var node in add_to_pending_list()
> 2. mbox.c: clear previously staged extents in handle_add_event()
>   if add_to_pending_list() returns an error
> 3. mbox.c memdev_release_extent(): alloc cxl_extent_list_node
>  on stack instead of using kzalloc because if kzalloc fails,
>  it would prevent the release DC rsp from being sent. Avoid
>  that possibility altogether.
> ---
>  drivers/cxl/core/mbox.c | 245 +++++++++++++++++++++++++++++++++++++++-
>  drivers/cxl/cxl.h       |  69 +++++++++++
>  drivers/cxl/cxlmem.h    |  45 ++++++++
>  include/cxl/event.h     |  38 +++++++
>  4 files changed, 396 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 2ab400788824..7dd40fb8d613 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -5,6 +5,7 @@
>  #include <linux/ktime.h>
>  #include <linux/mutex.h>
>  #include <linux/unaligned.h>
> +#include <linux/list.h>
>  #include <cxlpci.h>
>  #include <cxlmem.h>
>  #include <cxl.h>
> @@ -1098,6 +1099,237 @@ static int cxl_clear_event_record(struct cxl_memdev_state *mds,
>  	return rc;
>  }
>  
> +static int send_one_response(struct cxl_mailbox *cxl_mbox,
> +			     struct cxl_mbox_dc_response *response,
> +			     int opcode, u32 extent_list_size, u8 flags)
> +{
> +	struct cxl_mbox_cmd mbox_cmd = (struct cxl_mbox_cmd) {
> +		.opcode = opcode,
> +		.size_in = struct_size(response, extent_list, extent_list_size),
> +		.payload_in = response,
> +	};
> +
> +	response->extent_list_size = cpu_to_le32(extent_list_size);
> +	response->flags = flags;
> +	return cxl_internal_send_cmd(cxl_mbox, &mbox_cmd);
> +}
> +
> +static int cxl_send_dc_response(struct cxl_memdev_state *mds, int opcode,
> +				struct list_head *extent_list, int cnt)
> +{
> +	struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
> +	struct cxl_mbox_dc_response *p;
> +	struct cxl_extent_list_node *pos, *tmp;
> +	struct cxl_extent *extent;
> +	u32 pl_index;
> +
> +	size_t pl_size = struct_size(p, extent_list, cnt);
> +	u32 max_extents = cnt;
> +
> +	/* May have to use more bit on response. */
> +	if (pl_size > cxl_mbox->payload_size) {
> +		max_extents = (cxl_mbox->payload_size - sizeof(*p)) /
> +			      sizeof(struct updated_extent_list);
> +		pl_size = struct_size(p, extent_list, max_extents);
> +	}
> +
> +	struct cxl_mbox_dc_response *response __free(kfree) =
> +						kzalloc(pl_size, GFP_KERNEL);
> +	if (!response)
> +		return -ENOMEM;
> +
> +	/* Set __counted_by to capacity before populating; reset per response. */
> +	response->extent_list_size = cpu_to_le32(max_extents);
> +
> +	if (cnt == 0)
> +		return send_one_response(cxl_mbox, response, opcode, 0, 0);
> +
> +	pl_index = 0;
> +	list_for_each_entry_safe(pos, tmp, extent_list, list) {
> +		extent = pos->extent;
> +		response->extent_list[pl_index].dpa_start = extent->start_dpa;
> +		response->extent_list[pl_index].length = extent->length;
> +		pl_index++;
> +
> +		if (pl_index == max_extents) {
> +			u8 flags = 0;
> +			int rc;
> +
> +			if (pl_index < cnt)
> +				flags |= CXL_DCD_EVENT_MORE;
> +			rc = send_one_response(cxl_mbox, response, opcode,
> +					       pl_index, flags);
> +			if (rc)
> +				return rc;
> +			cnt -= pl_index;
> +			if (cnt < max_extents)
> +				max_extents = cnt;
> +			pl_index = 0;
> +		}
> +	}
> +
> +	if (!pl_index) /* nothing more to do */
> +		return 0;
> +	return send_one_response(cxl_mbox, response, opcode, pl_index, 0);
> +}
> +
> +static void delete_extent_node(struct cxl_extent_list_node *node)
> +{
> +	list_del(&node->list);
> +	kfree(node->extent);
> +	kfree(node);
> +}
> +
> +static void memdev_release_extent(struct cxl_memdev_state *mds, struct range *range)
> +{
> +	struct device *dev = mds->cxlds.dev;
> +	struct cxl_extent extent = {
> +		.start_dpa = cpu_to_le64(range->start),
> +		.length = cpu_to_le64(range_len(range)),
> +	};
> +	struct cxl_extent_list_node node = { .extent = &extent };
> +	LIST_HEAD(extent_list);
> +
> +	dev_dbg(dev, "Release response dpa %pra\n", range);
> +
> +	/*
> +	 * Stack-allocated instead of kzalloc'ed to avoid potential -ENOMEM,
> +	 * which would prevent sending the release DC rsp.
> +	 */
> +	list_add_tail(&node.list, &extent_list);
> +	if (cxl_send_dc_response(mds, CXL_MBOX_OP_RELEASE_DC, &extent_list, 1))
> +		dev_dbg(dev, "Failed to release %pra\n", range);
> +}
> +
> +static void clear_pending_extents(void *_mds)
> +{
> +	struct cxl_memdev_state *mds = _mds;
> +	struct cxl_extent_list_node *pos, *tmp;
> +
> +	list_for_each_entry_safe(pos, tmp, &mds->add_ctx.pending_extents, list)
> +		delete_extent_node(pos);
> +	mds->add_ctx.group = NULL;
> +}
> +
> +static int add_to_pending_list(struct list_head *pending_list,
> +			       struct cxl_extent *to_add)
> +{
> +	struct cxl_extent_list_node *node = kzalloc(sizeof(*node), GFP_KERNEL);
> +	struct cxl_extent *extent;
> +
> +	if (!node)
> +		return -ENOMEM;
> +	extent = kmemdup(to_add, sizeof(*extent), GFP_KERNEL);
> +	if (!extent) {
> +		kfree(node);
> +		return -ENOMEM;
> +	}
> +
> +	node->extent = extent;
> +	list_add_tail(&node->list, pending_list);
> +	return 0;
> +}
> +
> +/*
> + * Stub: stage extents on the pending list and reply with an empty
> + * ADD_DC_RESPONSE on More=0 (refuse all).  A later commit replaces
> + * the no-op tail with the real Add pipeline that surfaces a dax
> + * device per accepted extent.
> + */
> +static int handle_add_event(struct cxl_memdev_state *mds,
> +			    struct cxl_event_dcd *event)
> +{
> +	struct device *dev = mds->cxlds.dev;
> +	int rc;
> +
> +	rc = add_to_pending_list(&mds->add_ctx.pending_extents, &event->extent);
> +	if (rc) {
> +		clear_pending_extents(mds);
> +		return rc;
> +	}
> +
> +	if (event->flags & CXL_DCD_EVENT_MORE) {
> +		dev_dbg(dev, "more bit set; delay the surfacing of extent\n");
> +		return 0;
> +	}
> +
> +	rc = cxl_send_dc_response(mds, CXL_MBOX_OP_ADD_DC_RESPONSE,
> +				  &mds->add_ctx.pending_extents, 0);
> +	clear_pending_extents(mds);
> +	return rc;
> +}
> +
> +/*
> + * Stub: ack the release back to the device so it knows we are not
> + * using the range.  A later commit replaces this with the real
> + * teardown that walks the region's tag group and tears down the
> + * member dc_extent devices.
> + */
> +static int cxl_rm_extent(struct cxl_memdev_state *mds,
> +			 struct cxl_extent *extent)
> +{
> +	u64 start_dpa = le64_to_cpu(extent->start_dpa);
> +	struct range dpa_range = {
> +		.start = start_dpa,
> +		.end = start_dpa + le64_to_cpu(extent->length) - 1,
> +	};
> +
> +	memdev_release_extent(mds, &dpa_range);
> +	return 0;
> +}
> +
> +static char *cxl_dcd_evt_type_str(u8 type)

Should it return 'const char *' instead?

DJ

> +{
> +	switch (type) {
> +	case DCD_ADD_CAPACITY:
> +		return "add";
> +	case DCD_RELEASE_CAPACITY:
> +		return "release";
> +	case DCD_FORCED_CAPACITY_RELEASE:
> +		return "force release";
> +	default:
> +		break;
> +	}
> +
> +	return "<unknown>";
> +}
> +
> +static void cxl_handle_dcd_event_records(struct cxl_memdev_state *mds,
> +					 struct cxl_event_record_raw *raw_rec)
> +{
> +	struct cxl_event_dcd *event = &raw_rec->event.dcd;
> +	struct cxl_extent *extent = &event->extent;
> +	struct device *dev = mds->cxlds.dev;
> +	uuid_t *id = &raw_rec->id;
> +	int rc;
> +
> +	if (!uuid_equal(id, &CXL_EVENT_DC_EVENT_UUID))
> +		return;
> +
> +	dev_dbg(dev, "DCD event %s : DPA:%#llx LEN:%#llx\n",
> +		cxl_dcd_evt_type_str(event->event_type),
> +		le64_to_cpu(extent->start_dpa), le64_to_cpu(extent->length));
> +
> +	switch (event->event_type) {
> +	case DCD_ADD_CAPACITY:
> +		rc = handle_add_event(mds, event);
> +		break;
> +	case DCD_RELEASE_CAPACITY:
> +		rc = cxl_rm_extent(mds, &event->extent);
> +		break;
> +	case DCD_FORCED_CAPACITY_RELEASE:
> +		dev_err_ratelimited(dev, "Forced release event ignored.\n");
> +		rc = 0;
> +		break;
> +	default:
> +		rc = -EINVAL;
> +		break;
> +	}
> +
> +	if (rc)
> +		dev_err_ratelimited(dev, "dcd event failed: %d\n", rc);
> +}
> +
>  static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
>  				    enum cxl_event_log_type type)
>  {
> @@ -1134,9 +1366,13 @@ static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
>  		if (!nr_rec)
>  			break;
>  
> -		for (i = 0; i < nr_rec; i++)
> +		for (i = 0; i < nr_rec; i++) {
>  			__cxl_event_trace_record(cxlmd, type,
>  						 &payload->records[i]);
> +			if (type == CXL_EVENT_TYPE_DCD)
> +				cxl_handle_dcd_event_records(mds,
> +							&payload->records[i]);
> +		}
>  
>  		if (payload->flags & CXL_GET_EVENT_FLAG_OVERFLOW)
>  			trace_cxl_overflow(cxlmd, type, payload);
> @@ -1168,6 +1404,8 @@ void cxl_mem_get_event_records(struct cxl_memdev_state *mds, u32 status)
>  {
>  	dev_dbg(mds->cxlds.dev, "Reading event logs: %x\n", status);
>  
> +	if (cxl_dcd_supported(mds) && (status & CXLDEV_EVENT_STATUS_DCD))
> +		cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_DCD);
>  	if (status & CXLDEV_EVENT_STATUS_FATAL)
>  		cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_FATAL);
>  	if (status & CXLDEV_EVENT_STATUS_FAIL)
> @@ -1797,6 +2035,11 @@ struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev, u64 serial,
>  	}
>  
>  	mutex_init(&mds->event.log_lock);
> +	INIT_LIST_HEAD(&mds->add_ctx.pending_extents);
> +
> +	rc = devm_add_action_or_reset(dev, clear_pending_extents, mds);
> +	if (rc)
> +		return ERR_PTR(rc);
>  
>  	rc = devm_cxl_register_mce_notifier(dev, &mds->mce_notifier);
>  	if (rc == -EOPNOTSUPP)
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 864f6d3c03d4..367c3d9e2147 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -12,6 +12,7 @@
>  #include <linux/node.h>
>  #include <linux/io.h>
>  #include <linux/range.h>
> +#include <linux/xarray.h>
>  #include <cxl/cxl.h>
>  
>  extern const struct nvdimm_security_ops *cxl_security_ops;
> @@ -308,6 +309,41 @@ enum cxl_decoder_state {
>  	CXL_DECODER_STATE_AUTO_STAGED,
>  };
>  
> +struct cxl_dc_tag_group;
> +
> +/**
> + * struct dc_extent - A single dynamic-capacity extent surfaced to the host.
> + *
> + * One per device-stamped extent.  Multiple dc_extents that share a tag
> + * (see &struct cxl_dc_tag_group) form a single logical allocation, but
> + * each dc_extent has its own HPA range and is the unit that the DAX
> + * layer sees as a backing dax_resource.
> + *
> + * @dev: device representing this extent; child of cxlr_dax->dev.
> + * @group: containing tag group (allocation); shared across siblings.
> + * @cxled: endpoint decoder backing the DPA range.
> + * @dpa_range: DPA range this extent covers within @cxled.
> + * @hpa_range: HPA range that @dpa_range decodes to, relative to
> + *	       cxlr_dax->hpa_range.start.
> + * @uuid: tag uuid (matches @group->uuid; kept for the release-path log).
> + * @seq_num: 0..n-1 assembly-order index within the tag group, assigned by the
> + *	     host in cxl_realize_group().  For a sharable partition the
> + *	     order follows the device-stamped shared_extn_seq (0..n-1 per
> + *	     CXL r4.0 Table 8-230); for a non-sharable partition, where
> + *	     shared_extn_seq is reserved, the order follows event arrival.
> + *	     Either way @seq_num is a dense 0..n-1 so the dax layer can
> + *	     assemble ranges in the right order regardless of source.
> + */
> +struct dc_extent {
> +	struct device dev;
> +	struct cxl_dc_tag_group *group;
> +	struct cxl_endpoint_decoder *cxled;
> +	struct range dpa_range;
> +	struct range hpa_range;
> +	uuid_t uuid;
> +	u16 seq_num;
> +};
> +
>  /**
>   * struct cxl_endpoint_decoder - Endpoint  / SPA to DPA decoder
>   * @cxld: base cxl_decoder_object
> @@ -520,12 +556,45 @@ struct cxl_pmem_region {
>  	struct cxl_pmem_region_mapping mapping[];
>  };
>  
> +/* See CXL 3.1 8.2.9.2.1.6 */
> +enum dc_event {
> +	DCD_ADD_CAPACITY,
> +	DCD_RELEASE_CAPACITY,
> +	DCD_FORCED_CAPACITY_RELEASE,
> +	DCD_REGION_CONFIGURATION_UPDATED,
> +};
> +
>  struct cxl_dax_region {
>  	struct device dev;
>  	struct cxl_region *cxlr;
>  	struct range hpa_range;
>  };
>  
> +/**
> + * struct cxl_dc_tag_group - A tagged dynamic-capacity allocation.
> + *
> + * Container for the &struct dc_extent siblings that share a tag.  The
> + * group has no sysfs identity; userspace sees the individual dc_extents
> + * directly under the parent dax_region device.  The group exists to
> + * keep tag-scoped invariants (atomic add, atomic release, ordered carve
> + * by seq_num) in one place.
> + *
> + * @cxlr_dax: back reference to parent region device.
> + * @uuid: tag identifying this allocation; same across all member dc_extents.
> + * @dc_extents: xarray of &struct dc_extent in this group, indexed by the
> + *		dc_extent's @seq_num (0..n-1, dense).  See &struct dc_extent
> + *		for how seq_num is sourced for sharable vs non-sharable
> + *		allocations.
> + * @nr_extents: live count of dc_extents in the group; the group is freed
> + *		when the last dc_extent device is released.
> + */
> +struct cxl_dc_tag_group {
> +	struct cxl_dax_region *cxlr_dax;
> +	uuid_t uuid;
> +	struct xarray dc_extents;
> +	unsigned int nr_extents;
> +};
> +
>  /**
>   * struct cxl_port - logical collection of upstream port devices and
>   *		     downstream port devices to construct a CXL memory
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index bcf976829c3e..4ffa7bd1e5f1 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -7,6 +7,7 @@
>  #include <linux/cdev.h>
>  #include <linux/uuid.h>
>  #include <linux/node.h>
> +#include <linux/list.h>
>  #include <cxl/event.h>
>  #include <cxl/mailbox.h>
>  #include "cxl.h"
> @@ -404,6 +405,23 @@ static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
>  	return dev_get_drvdata(cxl_mbox->host);
>  }
>  
> +/**
> + * struct pending_add_ctx - Staging state for an in-progress
> + *			    DCD_ADD_CAPACITY event chain
> + * @pending_extents: extents received so far in the chain; flushed when
> + *		     the chain closes (More=0)
> + * @group: tag group being assembled from the chain
> + *
> + * A DCD_ADD_CAPACITY notification can span multiple event records
> + * stitched together by the CXL_DCD_EVENT_MORE flag.  Records are staged
> + * here until the device clears More, at which point the staged batch is
> + * processed and responded to as a single Add_DC_Response.
> + */
> +struct pending_add_ctx {
> +	struct list_head pending_extents;
> +	struct cxl_dc_tag_group *group;
> +};
> +
>  /**
>   * struct cxl_memdev_state - Generic Type-3 Memory Device Class driver data
>   *
> @@ -422,6 +440,8 @@ static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
>   * @active_volatile_bytes: sum of hard + soft volatile
>   * @active_persistent_bytes: sum of hard + soft persistent
>   * @dcd_supported: all DCD commands are supported
> + * @add_ctx: state for an in-progress DCD_ADD_CAPACITY chain
> + *	     (see &struct pending_add_ctx)
>   * @event: event log driver state
>   * @poison: poison driver state info
>   * @security: security driver state info
> @@ -442,6 +462,7 @@ struct cxl_memdev_state {
>  	u64 active_volatile_bytes;
>  	u64 active_persistent_bytes;
>  	bool dcd_supported;
> +	struct pending_add_ctx add_ctx;
>  
>  	struct cxl_event_state event;
>  	struct cxl_poison_state poison;
> @@ -518,6 +539,21 @@ enum cxl_opcode {
>  	UUID_INIT(0x5e1819d9, 0x11a9, 0x400c, 0x81, 0x1f, 0xd6, 0x07, 0x19,     \
>  		  0x40, 0x3d, 0x86)
>  
> +/*
> + * Add Dynamic Capacity Response
> + * CXL rev 3.1 section 8.2.9.9.9.3; Table 8-168 & Table 8-169
> + */
> +struct cxl_mbox_dc_response {
> +	__le32 extent_list_size;
> +	u8 flags;
> +	u8 reserved[3];
> +	struct updated_extent_list {
> +		__le64 dpa_start;
> +		__le64 length;
> +		u8 reserved[8];
> +	} __packed extent_list[] __counted_by(extent_list_size);
> +} __packed;
> +
>  struct cxl_mbox_get_supported_logs {
>  	__le16 entries;
>  	u8 rsvd[6];
> @@ -588,6 +624,14 @@ struct cxl_mbox_identify {
>  	UUID_INIT(0xe71f3a40, 0x2d29, 0x4092, 0x8a, 0x39, 0x4d, 0x1c, 0x96, \
>  		  0x6c, 0x7c, 0x65)
>  
> +/*
> + * Dynamic Capacity Event Record
> + * CXL rev 3.1 section 8.2.9.2.1; Table 8-43
> + */
> +#define CXL_EVENT_DC_EVENT_UUID                                             \
> +	UUID_INIT(0xca95afa7, 0xf183, 0x4018, 0x8c, 0x2f, 0x95, 0x26, 0x8e, \
> +		  0x10, 0x1a, 0x2a)
> +
>  /*
>   * Get Event Records output payload
>   * CXL rev 3.0 section 8.2.9.2.2; Table 8-50
> @@ -613,6 +657,7 @@ enum cxl_event_log_type {
>  	CXL_EVENT_TYPE_WARN,
>  	CXL_EVENT_TYPE_FAIL,
>  	CXL_EVENT_TYPE_FATAL,
> +	CXL_EVENT_TYPE_DCD,
>  	CXL_EVENT_TYPE_MAX
>  };
>  
> diff --git a/include/cxl/event.h b/include/cxl/event.h
> index ff97fea718d2..fa3cd895f656 100644
> --- a/include/cxl/event.h
> +++ b/include/cxl/event.h
> @@ -6,6 +6,7 @@
>  #include <linux/types.h>
>  #include <linux/uuid.h>
>  #include <linux/workqueue_types.h>
> +#include <linux/list.h>
>  
>  /*
>   * Common Event Record Format
> @@ -141,12 +142,49 @@ struct cxl_event_mem_sparing {
>  	u8 reserved2[0x25];
>  } __packed;
>  
> +/*
> + * CXL rev 3.1 section 8.2.9.2.1.6; Table 8-51
> + */
> +struct cxl_extent {
> +	__le64 start_dpa;
> +	__le64 length;
> +	u8 uuid[UUID_SIZE];
> +	__le16 shared_extn_seq;
> +	u8 reserved[0x6];
> +} __packed;
> +
> +struct cxl_extent_list_node {
> +	struct cxl_extent *extent;
> +	struct list_head list;
> +	int rid;
> +};
> +
> +/*
> + * Dynamic Capacity Event Record
> + * CXL rev 3.1 section 8.2.9.2.1.6; Table 8-50
> + */
> +#define CXL_DCD_EVENT_MORE			BIT(0)
> +struct cxl_event_dcd {
> +	struct cxl_event_record_hdr hdr;
> +	u8 event_type;
> +	u8 validity_flags;
> +	__le16 host_id;
> +	u8 partition_index;
> +	u8 flags;
> +	u8 reserved1[0x2];
> +	struct cxl_extent extent;
> +	u8 reserved2[0x18];
> +	__le32 num_avail_extents;
> +	__le32 num_avail_tags;
> +} __packed;
> +
>  union cxl_event {
>  	struct cxl_event_generic generic;
>  	struct cxl_event_gen_media gen_media;
>  	struct cxl_event_dram dram;
>  	struct cxl_event_mem_module mem_module;
>  	struct cxl_event_mem_sparing mem_sparing;
> +	struct cxl_event_dcd dcd;
>  	/* dram & gen_media event header */
>  	struct cxl_event_media_hdr media_hdr;
>  } __packed;


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v11 02/31] cxl/mem: Read dynamic capacity configuration from the device
  2026-06-25 11:04 ` [PATCH v11 02/31] cxl/mem: Read dynamic capacity configuration from the device Anisa Su
  2026-06-25 18:16   ` sashiko-bot
@ 2026-06-26 22:26   ` Dave Jiang
  1 sibling, 0 replies; 61+ messages in thread
From: Dave Jiang @ 2026-06-26 22:26 UTC (permalink / raw)
  To: Anisa Su, linux-cxl, linux-kernel
  Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
	Vishal Verma, Ira Weiny, Alison Schofield, John Groves,
	Gregory Price, Anisa Su



On 6/25/26 4:04 AM, Anisa Su wrote:
> From: Ira Weiny <iweiny@kernel.org>
> 
> Devices which optionally support Dynamic Capacity (DC) are configured
> via mailbox commands.  CXL r4.0 section 9.13.3 requires the host to issue
> the Get DC Configuration command in order to properly configure DCDs.
> Without the Get DC Configuration command DCD can't be supported.
> 
> Implement the DC mailbox commands as specified in CXL 4.0 section
> 8.2.10.9.9 (opcodes 48XXh) to read and store the DCD configuration
> information.  Disable DCD if an invalid configuration is found.
> 
> Linux has no support for more than one dynamic capacity partition.  Read
> and validate all the partitions but configure only the first partition
> as 'dynamic ram 1'.  Additional partitions can be added in the future if
> such a device ever materializes.  Additionally it is anticipated that no
> skips will be present from the end of the pmem partition.  Check for and
> disallow this configuration as well.
> 
> Linux has no use for the trailing fields of the Get Dynamic Capacity
> Configuration Output Payload (Total number of supported extents, number
> of available extents, total number of supported tags, and number of
> available tags).  Avoid defining those fields to use the more useful
> dynamic C array.
> 
> Based on an original patch by Navneet Singh.
> 
> Signed-off-by: Ira Weiny <iweiny@kernel.org>
> Signed-off-by: Anisa Su <anisa.su@samsung.com>

Couple minor things besides sachiko issues

> 
> ---
> Changes:
> 1. Move partition alignment check after is_power_of_2() check on
>    blk_size, as IS_ALIGNED(partition start, blk_size) expects blk_size
>    to be a power of 2 in cxl_dc_check()
> 
> 2. cxl_get_dc_config(): verify mbox_cmd.size_out against
>    dc_resp->partitions_returned
> 
> 3. cxl_dev_dc_identify(): originally calculated size of dc_resp using
>    struct cxl_dc_partition_info, but dc_resp->partition[] is of type
>    struct cxl_dc_partition. Fix size calculation.
> 
> 4. fix do/while loop in cxl_dev_dc_identify to protect against returning
>    0 partitions infinitely
> 
> 5. cxl_configure_dcd(): originally checked for gap between PMEM and DC
>    partition by calculating if a gap exists:
>    	if ([start of dc part] - [end of pmem part])
>    Replace with: if ([start of dc part] != [end of pmem part]) to avoid
>    underflow in case of bad input
> 
> 6. Change struct cxl_dc_partition_info to use u64 instead of size_t
>    fields
> 
> 7. Original commit message referenced CXL r3.2. Bump to r4.0.
>    Verified section numbers remain the same
> 
> 8. Rename dynamic_ram_a to dynamic_ram_1
> ---
>  drivers/cxl/core/hdm.c  |   2 +
>  drivers/cxl/core/mbox.c | 211 ++++++++++++++++++++++++++++++++++++++++
>  drivers/cxl/cxlmem.h    |  47 +++++++++
>  drivers/cxl/pci.c       |   3 +
>  include/cxl/cxl.h       |   3 +-
>  5 files changed, 265 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index 0c80b76a5f9b..0ef076c08ed2 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -446,6 +446,8 @@ static const char *cxl_mode_name(enum cxl_partition_mode mode)
>  		return "ram";
>  	case CXL_PARTMODE_PMEM:
>  		return "pmem";
> +	case CXL_PARTMODE_DYNAMIC_RAM_1:
> +		return "dynamic_ram_1";
>  	default:
>  		return "";
>  	};
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 07aba6f0b719..2932bbd67e55 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -1347,6 +1347,188 @@ int cxl_mem_sanitize(struct cxl_memdev *cxlmd, u16 cmd)
>  	return -EBUSY;
>  }
>  
> +static int cxl_dc_check(struct device *dev, struct cxl_dc_partition_info *part_array,
> +			u8 index, struct cxl_dc_partition *dev_part)
> +{
> +	u64 blk_size = le64_to_cpu(dev_part->block_size);
> +	u64 len = le64_to_cpu(dev_part->length);
> +
> +	part_array[index].start = le64_to_cpu(dev_part->base);
> +	part_array[index].size = le64_to_cpu(dev_part->decode_length);
> +	part_array[index].size *= CXL_CAPACITY_MULTIPLIER;
> +
> +	/* Check partitions are in increasing DPA order */
> +	if (index > 0) {
> +		struct cxl_dc_partition_info *prev_part = &part_array[index - 1];
> +
> +		if ((prev_part->start + prev_part->size) >
> +		     part_array[index].start) {
> +			dev_err(dev,
> +				"DPA ordering violation for DC partition %d and %d\n",
> +				index - 1, index);
> +			return -EINVAL;
> +		}
> +	}
> +
> +	if (part_array[index].size == 0 || len == 0 ||
> +	    part_array[index].size < len || !IS_ALIGNED(len, blk_size)) {
> +		dev_err(dev, "DC partition %d invalid length; size %llu len %llu blk size %llu\n",
> +			index, part_array[index].size, len, blk_size);
> +		return -EINVAL;
> +	}
> +
> +	if (blk_size == 0 || blk_size % CXL_DCD_BLOCK_LINE_SIZE ||
> +	    !is_power_of_2(blk_size)) {
> +		dev_err(dev, "DC partition %d invalid block size %llu\n",
> +			index, blk_size);
> +		return -EINVAL;
> +	}
> +
> +	if (!IS_ALIGNED(part_array[index].start, SZ_256M) ||
> +	    !IS_ALIGNED(part_array[index].start, blk_size)) {
> +		dev_err(dev, "DC partition %d invalid start %llu blk size %llu\n",
> +			index, part_array[index].start, blk_size);
> +		return -EINVAL;
> +	}
> +
> +	dev_dbg(dev, "DC partition %d start %llu size %llu blk_size: %llu\n",
> +		index, part_array[index].start, part_array[index].size,
> +		blk_size);
> +
> +	return 0;
> +}
> +
> +/* Returns the number of partitions in dc_resp or -ERRNO */
> +static int cxl_get_dc_config(struct cxl_mailbox *mbox, u8 start_partition,
> +			     struct cxl_mbox_get_dc_config_out *dc_resp,
> +			     size_t dc_resp_size)
> +{
> +	struct cxl_mbox_get_dc_config_in get_dc = (struct cxl_mbox_get_dc_config_in) {
> +		.partition_count = CXL_MAX_DC_PARTITIONS,
> +		.start_partition_index = start_partition,
> +	};
> +	struct cxl_mbox_cmd mbox_cmd = (struct cxl_mbox_cmd) {
> +		.opcode = CXL_MBOX_OP_GET_DC_CONFIG,
> +		.payload_in = &get_dc,
> +		.size_in = sizeof(get_dc),
> +		.size_out = dc_resp_size,
> +		.payload_out = dc_resp,
> +		.min_out = 8,
> +	};
> +	size_t expected_sz;
> +	int rc;
> +
> +	rc = cxl_internal_send_cmd(mbox, &mbox_cmd);
> +	if (rc < 0)
> +		return rc;
> +
> +	if (dc_resp->partitions_returned > CXL_MAX_DC_PARTITIONS) {
> +		dev_err(mbox->host, "Device returned %u partitions, max %d\n",
> +			dc_resp->partitions_returned, CXL_MAX_DC_PARTITIONS);
> +		return -EIO;
> +	}
> +
> +	/*
> +	 * The payload carries trailing extent/tag count fields after the
> +	 * partition array (CXL 3.2 Table 8-179) which the driver ignores, so
> +	 * the response is at least, not exactly, expected_sz.
> +	 */
> +	expected_sz = struct_size(dc_resp, partition,
> +				  dc_resp->partitions_returned);
> +
> +	if (mbox_cmd.size_out < expected_sz) {
> +		dev_err(mbox->host,
> +			"Payload size %zu less than expected %zu for %u partitions\n",
> +			mbox_cmd.size_out,
> +			expected_sz,
> +			dc_resp->partitions_returned);
> +		return -EIO;
> +	}
> +
> +	dev_dbg(mbox->host, "Read %d/%d DC partitions\n",
> +		dc_resp->partitions_returned, dc_resp->avail_partition_count);
> +	return dc_resp->partitions_returned;
> +}
> +
> +/**
> + * cxl_dev_dc_identify() - Reads the dynamic capacity information from the
> + *                         device.
> + * @mbox: Mailbox to query
> + * @dc_info: The dynamic partition information to return
> + *
> + * Read Dynamic Capacity information from the device and return the partition
> + * information.
> + *
> + * Return: 0 if identify was executed successfully, -ERRNO on error.
> + *         on error only dynamic_bytes is left unchanged.

s/dynamic_bytes/dc_info/ ?

> + */
> +int cxl_dev_dc_identify(struct cxl_mailbox *mbox,
> +			struct cxl_dc_partition_info *dc_info)
> +{
> +	struct cxl_dc_partition_info partitions[CXL_MAX_DC_PARTITIONS];
> +	struct cxl_mbox_get_dc_config_out *dc_resp __free(kfree);
> +	struct device *dev = mbox->host;
> +	u8 start_partition;
> +	u8 num_partitions;
> +	size_t dc_resp_size = struct_size(dc_resp,
> +					  partition,
> +					  CXL_MAX_DC_PARTITIONS);
> +
> +	dc_resp = kmalloc(dc_resp_size, GFP_KERNEL);
> +	if (!dc_resp)
> +		return -ENOMEM;
> +
> +	/**

/*

> +	 * Read and check all partition information for validity and potential
> +	 * debugging; see debug output in cxl_dc_check()
> +	 */
> +	start_partition = 0;
> +	num_partitions = 0;
> +	do {
> +		int rc, i, j;
> +
> +		rc = cxl_get_dc_config(mbox, start_partition, dc_resp, dc_resp_size);
> +		if (rc < 0) {
> +			dev_err(dev, "Failed to get DC config: %d\n", rc);
> +			return rc;
> +		}
> +
> +		if (rc == 0) {
> +			dev_err(dev,
> +				"Device reported %u partitions available but returned none at index %u\n",
> +				dc_resp->avail_partition_count, start_partition);
> +			return -EIO;
> +		}
> +
> +		num_partitions += rc;
> +
> +		if (num_partitions < 1 || num_partitions > CXL_MAX_DC_PARTITIONS) {
> +			dev_err(dev, "Invalid num of dynamic capacity partitions %d\n",
> +				num_partitions);
> +			return -EINVAL;
> +		}
> +
> +		for (i = start_partition, j = 0; i < num_partitions; i++, j++) {
> +			rc = cxl_dc_check(dev, partitions, i,
> +					  &dc_resp->partition[j]);
> +			if (rc)
> +				return rc;
> +		}
> +
> +		start_partition = num_partitions;
> +
> +	} while (num_partitions < dc_resp->avail_partition_count);
> +
> +	/* Return 1st partition */
> +	dc_info->start = partitions[0].start;
> +	dc_info->size = partitions[0].size;
> +	dev_dbg(dev, "Returning partition 0 %llu size %llu\n",
> +		dc_info->start, dc_info->size);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_dev_dc_identify, "CXL");
> +
>  static void add_part(struct cxl_dpa_info *info, u64 start, u64 size, enum cxl_partition_mode mode)
>  {
>  	int i = info->nr_partitions;
> @@ -1417,6 +1599,35 @@ int cxl_get_dirty_count(struct cxl_memdev_state *mds, u32 *count)
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_get_dirty_count, "CXL");
>  
> +void cxl_configure_dcd(struct cxl_memdev_state *mds, struct cxl_dpa_info *info)
> +{
> +	struct cxl_dc_partition_info dc_info = { 0 };
> +	struct device *dev = mds->cxlds.dev;
> +	int rc;
> +
> +	rc = cxl_dev_dc_identify(&mds->cxlds.cxl_mbox, &dc_info);
> +	if (rc) {
> +		dev_warn(dev,
> +			 "Failed to read Dynamic Capacity config: %d\n", rc);
> +		cxl_disable_dcd(mds);
> +		return;
> +	}
> +
> +	/* Skips between pmem and the dynamic partition are not supported */
> +	if (dc_info.start != info->size) {
> +		dev_warn(dev,
> +			 "Dynamic Capacity skip from pmem not supported\n");
> +		cxl_disable_dcd(mds);
> +		return;
> +	}
> +
> +	info->size += dc_info.size;
> +	dev_dbg(dev, "Adding dynamic ram partition 1; %llu size %llu\n",
> +		dc_info.start, dc_info.size);
> +	add_part(info, dc_info.start, dc_info.size, CXL_PARTMODE_DYNAMIC_RAM_1);
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_configure_dcd, "CXL");
> +
>  int cxl_arm_dirty_shutdown(struct cxl_memdev_state *mds)
>  {
>  	struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 60dc3f0006a7..6b548a1ec1e9 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -385,6 +385,8 @@ struct cxl_security_state {
>  	struct kernfs_node *sanitize_node;
>  };
>  
> +#define CXL_MAX_DC_PARTITIONS 8
> +
>  static inline resource_size_t cxl_pmem_size(struct cxl_dev_state *cxlds)
>  {
>  	/*
> @@ -669,6 +671,31 @@ struct cxl_mbox_set_shutdown_state_in {
>  	u8 state;
>  } __packed;
>  
> +/* See CXL 3.2 Table 8-178 get dynamic capacity config Input Payload */

Update to r4.0, which is also what the commit log cites.

> +struct cxl_mbox_get_dc_config_in {
> +	u8 partition_count;
> +	u8 start_partition_index;
> +} __packed;
> +
> +/* See CXL 3.2 Table 8-179 get dynamic capacity config Output Payload */

Update to r4.0

DJ

> +struct cxl_mbox_get_dc_config_out {
> +	u8 avail_partition_count;
> +	u8 partitions_returned;
> +	u8 rsvd[6];
> +	/* See CXL 3.2 Table 8-180 */
> +	struct cxl_dc_partition {
> +		__le64 base;
> +		__le64 decode_length;
> +		__le64 length;
> +		__le64 block_size;
> +		__le32 dsmad_handle;
> +		u8 flags;
> +		u8 rsvd[3];
> +	} __packed partition[] __counted_by(partitions_returned);
> +	/* Trailing extent/tag count fields unused */
> +} __packed;
> +#define CXL_DCD_BLOCK_LINE_SIZE 0x40
> +
>  /* Set Timestamp CXL 3.0 Spec 8.2.9.4.2 */
>  struct cxl_mbox_set_timestamp_in {
>  	__le64 timestamp;
> @@ -792,9 +819,18 @@ enum {
>  int cxl_internal_send_cmd(struct cxl_mailbox *cxl_mbox,
>  			  struct cxl_mbox_cmd *cmd);
>  int cxl_dev_state_identify(struct cxl_memdev_state *mds);
> +
> +struct cxl_dc_partition_info {
> +	u64 start;
> +	u64 size;
> +};
> +
> +int cxl_dev_dc_identify(struct cxl_mailbox *mbox,
> +			struct cxl_dc_partition_info *dc_info);
>  int cxl_await_media_ready(struct cxl_dev_state *cxlds);
>  int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
>  int cxl_mem_dpa_fetch(struct cxl_memdev_state *mds, struct cxl_dpa_info *info);
> +void cxl_configure_dcd(struct cxl_memdev_state *mds, struct cxl_dpa_info *info);
>  struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev, u64 serial,
>  						 u16 dvsec);
>  void set_exclusive_cxl_commands(struct cxl_memdev_state *mds,
> @@ -808,6 +844,17 @@ void cxl_event_trace_record(struct cxl_memdev *cxlmd,
>  			    const uuid_t *uuid, union cxl_event *evt);
>  int cxl_get_dirty_count(struct cxl_memdev_state *mds, u32 *count);
>  int cxl_arm_dirty_shutdown(struct cxl_memdev_state *mds);
> +
> +static inline bool cxl_dcd_supported(struct cxl_memdev_state *mds)
> +{
> +	return mds->dcd_supported;
> +}
> +
> +static inline void cxl_disable_dcd(struct cxl_memdev_state *mds)
> +{
> +	mds->dcd_supported = false;
> +}
> +
>  int cxl_set_timestamp(struct cxl_memdev_state *mds);
>  int cxl_poison_state_init(struct cxl_memdev_state *mds);
>  int cxl_mem_get_poison(struct cxl_memdev *cxlmd, u64 offset, u64 len,
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index bace662dc988..60f9fa05d9ef 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -870,6 +870,9 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  	if (rc)
>  		return rc;
>  
> +	if (cxl_dcd_supported(mds))
> +		cxl_configure_dcd(mds, &range_info);
> +
>  	rc = cxl_dpa_setup(cxlds, &range_info);
>  	if (rc)
>  		return rc;
> diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
> index fa7269154620..e8a0899960d4 100644
> --- a/include/cxl/cxl.h
> +++ b/include/cxl/cxl.h
> @@ -133,6 +133,7 @@ struct cxl_dpa_perf {
>  enum cxl_partition_mode {
>  	CXL_PARTMODE_RAM,
>  	CXL_PARTMODE_PMEM,
> +	CXL_PARTMODE_DYNAMIC_RAM_1,
>  };
>  
>  /**
> @@ -147,7 +148,7 @@ struct cxl_dpa_partition {
>  	enum cxl_partition_mode mode;
>  };
>  
> -#define CXL_NR_PARTITIONS_MAX 2
> +#define CXL_NR_PARTITIONS_MAX 3
>  
>  /**
>   * struct cxl_dev_state - The driver device state


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v11 03/31] cxl/cdat: Gather DSMAS data for DCD partitions
  2026-06-25 18:00 ` [PATCH v11 03/31] cxl/cdat: Gather DSMAS data for DCD partitions Anisa Su
@ 2026-06-26 22:30   ` Dave Jiang
  0 siblings, 0 replies; 61+ messages in thread
From: Dave Jiang @ 2026-06-26 22:30 UTC (permalink / raw)
  To: Anisa Su, linux-cxl, linux-kernel
  Cc: nvdimm, djbw, jic23, dave, vishal.l.verma, iweiny,
	alison.schofield, gourry, anisa.su



On 6/25/26 11:00 AM, Anisa Su wrote:
> From: Ira Weiny <iweiny@kernel.org>
> 
> Additional DCD partition (AKA region) information is contained in the
> DSMAS CDAT tables, including performance, read only, and shareable
> attributes.
> 
> Match DCD partitions with DSMAS tables and store the meta data.
> 
> Signed-off-by: Ira Weiny <iweiny@kernel.org>
> Signed-off-by: Anisa Su <anisa.su@samsung.com>

Maybe a co-developed-by tag since you are making changes with the shareable flag location?

Reviewed-by: Dave Jiang <dave.jiang@intel.com>

> ---
>  drivers/cxl/core/cdat.c | 12 ++++++++++++
>  drivers/cxl/core/hdm.c  |  1 +
>  drivers/cxl/core/mbox.c | 22 ++++++++++++++++------
>  drivers/cxl/cxlmem.h    |  2 ++
>  include/cxl/cxl.h       |  4 ++++
>  5 files changed, 35 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/cxl/core/cdat.c b/drivers/cxl/core/cdat.c
> index 5c9f07262513..a280039e4cd1 100644
> --- a/drivers/cxl/core/cdat.c
> +++ b/drivers/cxl/core/cdat.c
> @@ -17,6 +17,7 @@ struct dsmas_entry {
>  	struct access_coordinate cdat_coord[ACCESS_COORDINATE_MAX];
>  	int entries;
>  	int qos_class;
> +	bool shareable;
>  };
>  
>  static u32 cdat_normalize(u16 entry, u64 base, u8 type)
> @@ -74,6 +75,7 @@ static int cdat_dsmas_handler(union acpi_subtable_headers *header, void *arg,
>  		return -ENOMEM;
>  
>  	dent->handle = dsmas->dsmad_handle;
> +	dent->shareable = dsmas->flags & ACPI_CDAT_DSMAS_SHAREABLE;
>  	dent->dpa_range.start = le64_to_cpu((__force __le64)dsmas->dpa_base_address);
>  	dent->dpa_range.end = le64_to_cpu((__force __le64)dsmas->dpa_base_address) +
>  			      le64_to_cpu((__force __le64)dsmas->dpa_length) - 1;
> @@ -266,15 +268,25 @@ static void cxl_memdev_set_qos_class(struct cxl_dev_state *cxlds,
>  		bool found = false;
>  
>  		for (int i = 0; i < cxlds->nr_partitions; i++) {
> +			enum cxl_partition_mode mode = cxlds->part[i].mode;
>  			struct resource *res = &cxlds->part[i].res;
> +			u8 handle = cxlds->part[i].handle;
>  			struct range range = {
>  				.start = res->start,
>  				.end = res->end,
>  			};
>  
>  			if (range_contains(&range, &dent->dpa_range)) {
> +				if (mode == CXL_PARTMODE_DYNAMIC_RAM_1 &&
> +				    dent->handle != handle) {
> +					dev_warn(dev,
> +						"Dynamic RAM perf mismatch; %pra (%u) vs %pra (%u)\n",
> +						&range, handle, &dent->dpa_range, dent->handle);
> +					continue;
> +				}
>  				update_perf_entry(dev, dent,
>  						  &cxlds->part[i].perf);
> +				cxlds->part[i].shareable = dent->shareable;
>  				found = true;
>  				break;
>  			}
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index 0ef076c08ed2..7f63b86887f4 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -477,6 +477,7 @@ int cxl_dpa_setup(struct cxl_dev_state *cxlds, const struct cxl_dpa_info *info)
>  
>  		cxlds->part[i].perf.qos_class = CXL_QOS_CLASS_INVALID;
>  		cxlds->part[i].mode = part->mode;
> +		cxlds->part[i].handle = part->handle;
>  
>  		/* Require ordered + contiguous partitions */
>  		if (i) {
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 2932bbd67e55..bdb908c6e7f3 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -1352,10 +1352,16 @@ static int cxl_dc_check(struct device *dev, struct cxl_dc_partition_info *part_a
>  {
>  	u64 blk_size = le64_to_cpu(dev_part->block_size);
>  	u64 len = le64_to_cpu(dev_part->length);
> +	u32 handle = le32_to_cpu(dev_part->dsmad_handle);
>  
>  	part_array[index].start = le64_to_cpu(dev_part->base);
>  	part_array[index].size = le64_to_cpu(dev_part->decode_length);
>  	part_array[index].size *= CXL_CAPACITY_MULTIPLIER;
> +	if (handle & ~0xFF) {
> +		dev_warn(dev, "DSMAD handle 0x%x has non-zero reserved bits\n", handle);
> +		return -EINVAL;
> +	}
> +	part_array[index].handle = handle;
>  
>  	/* Check partitions are in increasing DPA order */
>  	if (index > 0) {
> @@ -1522,6 +1528,7 @@ int cxl_dev_dc_identify(struct cxl_mailbox *mbox,
>  	/* Return 1st partition */
>  	dc_info->start = partitions[0].start;
>  	dc_info->size = partitions[0].size;
> +	dc_info->handle = partitions[0].handle;
>  	dev_dbg(dev, "Returning partition 0 %llu size %llu\n",
>  		dc_info->start, dc_info->size);
>  
> @@ -1529,7 +1536,8 @@ int cxl_dev_dc_identify(struct cxl_mailbox *mbox,
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_dev_dc_identify, "CXL");
>  
> -static void add_part(struct cxl_dpa_info *info, u64 start, u64 size, enum cxl_partition_mode mode)
> +static void add_part(struct cxl_dpa_info *info, u64 start, u64 size,
> +		     enum cxl_partition_mode mode, u8 handle)
>  {
>  	int i = info->nr_partitions;
>  
> @@ -1541,6 +1549,7 @@ static void add_part(struct cxl_dpa_info *info, u64 start, u64 size, enum cxl_pa
>  		.end = start + size - 1,
>  	};
>  	info->part[i].mode = mode;
> +	info->part[i].handle = handle;
>  	info->nr_partitions++;
>  }
>  
> @@ -1558,9 +1567,9 @@ int cxl_mem_dpa_fetch(struct cxl_memdev_state *mds, struct cxl_dpa_info *info)
>  	info->size = mds->total_bytes;
>  
>  	if (mds->partition_align_bytes == 0) {
> -		add_part(info, 0, mds->volatile_only_bytes, CXL_PARTMODE_RAM);
> +		add_part(info, 0, mds->volatile_only_bytes, CXL_PARTMODE_RAM, 0);
>  		add_part(info, mds->volatile_only_bytes,
> -			 mds->persistent_only_bytes, CXL_PARTMODE_PMEM);
> +			 mds->persistent_only_bytes, CXL_PARTMODE_PMEM, 0);
>  		return 0;
>  	}
>  
> @@ -1570,9 +1579,9 @@ int cxl_mem_dpa_fetch(struct cxl_memdev_state *mds, struct cxl_dpa_info *info)
>  		return rc;
>  	}
>  
> -	add_part(info, 0, mds->active_volatile_bytes, CXL_PARTMODE_RAM);
> +	add_part(info, 0, mds->active_volatile_bytes, CXL_PARTMODE_RAM, 0);
>  	add_part(info, mds->active_volatile_bytes, mds->active_persistent_bytes,
> -		 CXL_PARTMODE_PMEM);
> +		 CXL_PARTMODE_PMEM, 0);
>  
>  	return 0;
>  }
> @@ -1624,7 +1633,8 @@ void cxl_configure_dcd(struct cxl_memdev_state *mds, struct cxl_dpa_info *info)
>  	info->size += dc_info.size;
>  	dev_dbg(dev, "Adding dynamic ram partition 1; %llu size %llu\n",
>  		dc_info.start, dc_info.size);
> -	add_part(info, dc_info.start, dc_info.size, CXL_PARTMODE_DYNAMIC_RAM_1);
> +	add_part(info, dc_info.start, dc_info.size, CXL_PARTMODE_DYNAMIC_RAM_1,
> +		 dc_info.handle);
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_configure_dcd, "CXL");
>  
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 6b548a1ec1e9..b29fb16725b4 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -118,6 +118,7 @@ struct cxl_dpa_info {
>  	struct cxl_dpa_part_info {
>  		struct range range;
>  		enum cxl_partition_mode mode;
> +		u8 handle;
>  	} part[CXL_NR_PARTITIONS_MAX];
>  	int nr_partitions;
>  };
> @@ -823,6 +824,7 @@ int cxl_dev_state_identify(struct cxl_memdev_state *mds);
>  struct cxl_dc_partition_info {
>  	u64 start;
>  	u64 size;
> +	u8 handle;
>  };
>  
>  int cxl_dev_dc_identify(struct cxl_mailbox *mbox,
> diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
> index e8a0899960d4..502d8333318b 100644
> --- a/include/cxl/cxl.h
> +++ b/include/cxl/cxl.h
> @@ -141,11 +141,15 @@ enum cxl_partition_mode {
>   * @res: shortcut to the partition in the DPA resource tree (cxlds->dpa_res)
>   * @perf: performance attributes of the partition from CDAT
>   * @mode: operation mode for the DPA capacity, e.g. ram, pmem, dynamic...
> + * @handle: DSMAS handle intended to represent this partition
> + * @shareable: Is the partition sharable (from its CDAT DSMAS entry)
>   */
>  struct cxl_dpa_partition {
>  	struct resource res;
>  	struct cxl_dpa_perf perf;
>  	enum cxl_partition_mode mode;
> +	u8 handle;
> +	bool shareable;
>  };
>  
>  #define CXL_NR_PARTITIONS_MAX 3


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v11 04/31] cxl/core: Enforce partition order/simplify partition calls
  2026-06-25 11:04 ` [PATCH v11 04/31] cxl/core: Enforce partition order/simplify partition calls Anisa Su
@ 2026-06-26 22:37   ` Dave Jiang
  0 siblings, 0 replies; 61+ messages in thread
From: Dave Jiang @ 2026-06-26 22:37 UTC (permalink / raw)
  To: Anisa Su, linux-cxl, linux-kernel
  Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
	Vishal Verma, Ira Weiny, Alison Schofield, John Groves,
	Gregory Price, Anisa Su



On 6/25/26 4:04 AM, Anisa Su wrote:
> From: Ira Weiny <iweiny@kernel.org>
> 
> Device partitions have an implied order which is made more complex by
> the addition of a dynamic partition
> 
> Remove the ram special case information calls in favor of generic calls
> with a check ahead of time to ensure the preservation of the implied
> partition order.
> 
> Signed-off-by: Ira Weiny <iweiny@kernel.org>
> Signed-off-by: Anisa Su <anisa.su@samsung.com>

Reviewed-by: Dave Jiang <dave.jiang@intel.com>

> 
> ---
> Changes:
> 
> 1. Use info->part[i] for verifying partitions are in expected order,
>    not cxlds->part[i]. cxlds->part[] is populated in the loop following
>    this check.
> ---
>  drivers/cxl/core/hdm.c    | 11 ++++++++++-
>  drivers/cxl/core/memdev.c | 32 +++++++++-----------------------
>  drivers/cxl/cxlmem.h      |  9 +++------
>  drivers/cxl/mem.c         |  2 +-
>  4 files changed, 23 insertions(+), 31 deletions(-)
> 
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index 7f63b86887f4..54b6848928a9 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -457,6 +457,7 @@ static const char *cxl_mode_name(enum cxl_partition_mode mode)
>  int cxl_dpa_setup(struct cxl_dev_state *cxlds, const struct cxl_dpa_info *info)
>  {
>  	struct device *dev = cxlds->dev;
> +	int i;
>  
>  	guard(rwsem_write)(&cxl_rwsem.dpa);
>  
> @@ -469,9 +470,17 @@ int cxl_dpa_setup(struct cxl_dev_state *cxlds, const struct cxl_dpa_info *info)
>  		return 0;
>  	}
>  
> +	/* Verify partitions are in expected order. */
> +	for (i = 1; i < info->nr_partitions; i++) {
> +		if (info->part[i].mode < info->part[i-1].mode) {
> +			dev_err(dev, "Partition order mismatch\n");
> +			return -EINVAL;
> +		}
> +	}
> +
>  	cxlds->dpa_res = DEFINE_RES_MEM(0, info->size);
>  
> -	for (int i = 0; i < info->nr_partitions; i++) {
> +	for (i = 0; i < info->nr_partitions; i++) {
>  		const struct cxl_dpa_part_info *part = &info->part[i];
>  		int rc;
>  
> diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
> index 80e65690eb77..71602820f896 100644
> --- a/drivers/cxl/core/memdev.c
> +++ b/drivers/cxl/core/memdev.c
> @@ -75,20 +75,12 @@ static ssize_t label_storage_size_show(struct device *dev,
>  }
>  static DEVICE_ATTR_RO(label_storage_size);
>  
> -static resource_size_t cxl_ram_size(struct cxl_dev_state *cxlds)
> -{
> -	/* Static RAM is only expected at partition 0. */
> -	if (cxlds->part[0].mode != CXL_PARTMODE_RAM)
> -		return 0;
> -	return resource_size(&cxlds->part[0].res);
> -}
> -
>  static ssize_t ram_size_show(struct device *dev, struct device_attribute *attr,
>  			     char *buf)
>  {
>  	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
>  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> -	unsigned long long len = cxl_ram_size(cxlds);
> +	unsigned long long len = cxl_part_size(cxlds, CXL_PARTMODE_RAM);
>  
>  	return sysfs_emit(buf, "%#llx\n", len);
>  }
> @@ -101,7 +93,7 @@ static ssize_t pmem_size_show(struct device *dev, struct device_attribute *attr,
>  {
>  	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
>  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> -	unsigned long long len = cxl_pmem_size(cxlds);
> +	unsigned long long len = cxl_part_size(cxlds, CXL_PARTMODE_PMEM);
>  
>  	return sysfs_emit(buf, "%#llx\n", len);
>  }
> @@ -424,10 +416,11 @@ static struct attribute *cxl_memdev_attributes[] = {
>  	NULL,
>  };
>  
> -static struct cxl_dpa_perf *to_pmem_perf(struct cxl_dev_state *cxlds)
> +static struct cxl_dpa_perf *part_perf(struct cxl_dev_state *cxlds,
> +				      enum cxl_partition_mode mode)
>  {
>  	for (int i = 0; i < cxlds->nr_partitions; i++)
> -		if (cxlds->part[i].mode == CXL_PARTMODE_PMEM)
> +		if (cxlds->part[i].mode == mode)
>  			return &cxlds->part[i].perf;
>  	return NULL;
>  }
> @@ -438,7 +431,7 @@ static ssize_t pmem_qos_class_show(struct device *dev,
>  	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
>  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
>  
> -	return sysfs_emit(buf, "%d\n", to_pmem_perf(cxlds)->qos_class);
> +	return sysfs_emit(buf, "%d\n", part_perf(cxlds, CXL_PARTMODE_PMEM)->qos_class);
>  }
>  
>  static struct device_attribute dev_attr_pmem_qos_class =
> @@ -450,20 +443,13 @@ static struct attribute *cxl_memdev_pmem_attributes[] = {
>  	NULL,
>  };
>  
> -static struct cxl_dpa_perf *to_ram_perf(struct cxl_dev_state *cxlds)
> -{
> -	if (cxlds->part[0].mode != CXL_PARTMODE_RAM)
> -		return NULL;
> -	return &cxlds->part[0].perf;
> -}
> -
>  static ssize_t ram_qos_class_show(struct device *dev,
>  				  struct device_attribute *attr, char *buf)
>  {
>  	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
>  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
>  
> -	return sysfs_emit(buf, "%d\n", to_ram_perf(cxlds)->qos_class);
> +	return sysfs_emit(buf, "%d\n", part_perf(cxlds, CXL_PARTMODE_RAM)->qos_class);
>  }
>  
>  static struct device_attribute dev_attr_ram_qos_class =
> @@ -499,7 +485,7 @@ static umode_t cxl_ram_visible(struct kobject *kobj, struct attribute *a, int n)
>  {
>  	struct device *dev = kobj_to_dev(kobj);
>  	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> -	struct cxl_dpa_perf *perf = to_ram_perf(cxlmd->cxlds);
> +	struct cxl_dpa_perf *perf = part_perf(cxlmd->cxlds, CXL_PARTMODE_RAM);
>  
>  	if (a == &dev_attr_ram_qos_class.attr &&
>  	    (!perf || perf->qos_class == CXL_QOS_CLASS_INVALID))
> @@ -518,7 +504,7 @@ static umode_t cxl_pmem_visible(struct kobject *kobj, struct attribute *a, int n
>  {
>  	struct device *dev = kobj_to_dev(kobj);
>  	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> -	struct cxl_dpa_perf *perf = to_pmem_perf(cxlmd->cxlds);
> +	struct cxl_dpa_perf *perf = part_perf(cxlmd->cxlds, CXL_PARTMODE_PMEM);
>  
>  	if (a == &dev_attr_pmem_qos_class.attr &&
>  	    (!perf || perf->qos_class == CXL_QOS_CLASS_INVALID))
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index b29fb16725b4..afc195d8c090 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -388,14 +388,11 @@ struct cxl_security_state {
>  
>  #define CXL_MAX_DC_PARTITIONS 8
>  
> -static inline resource_size_t cxl_pmem_size(struct cxl_dev_state *cxlds)
> +static inline resource_size_t cxl_part_size(struct cxl_dev_state *cxlds,
> +					    enum cxl_partition_mode mode)
>  {
> -	/*
> -	 * Static PMEM may be at partition index 0 when there is no static RAM
> -	 * capacity.
> -	 */
>  	for (int i = 0; i < cxlds->nr_partitions; i++)
> -		if (cxlds->part[i].mode == CXL_PARTMODE_PMEM)
> +		if (cxlds->part[i].mode == mode)
>  			return resource_size(&cxlds->part[i].res);
>  	return 0;
>  }
> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> index fcffe24dcb42..f19e08279ec7 100644
> --- a/drivers/cxl/mem.c
> +++ b/drivers/cxl/mem.c
> @@ -114,7 +114,7 @@ static int cxl_mem_probe(struct device *dev)
>  		return -ENXIO;
>  	}
>  
> -	if (cxl_pmem_size(cxlds) && IS_ENABLED(CONFIG_CXL_PMEM)) {
> +	if (cxl_part_size(cxlds, CXL_PARTMODE_PMEM) && IS_ENABLED(CONFIG_CXL_PMEM)) {
>  		rc = devm_cxl_add_nvdimm(dev, parent_port, cxlmd);
>  		if (rc) {
>  			if (rc == -ENODEV)


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v11 05/31] cxl/mem: Expose dynamic ram 1 partition in sysfs
  2026-06-25 11:04 ` [PATCH v11 05/31] cxl/mem: Expose dynamic ram 1 partition in sysfs Anisa Su
  2026-06-25 18:12   ` sashiko-bot
@ 2026-06-26 23:08   ` Dave Jiang
  1 sibling, 0 replies; 61+ messages in thread
From: Dave Jiang @ 2026-06-26 23:08 UTC (permalink / raw)
  To: Anisa Su, linux-cxl, linux-kernel
  Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
	Vishal Verma, Ira Weiny, Alison Schofield, John Groves,
	Gregory Price, Anisa Su



On 6/25/26 4:04 AM, Anisa Su wrote:
> From: Ira Weiny <iweiny@kernel.org>
> 
> To properly configure CXL regions user space will need to know the
> details of the dynamic ram partition.
> 
> Expose the first dynamic ram partition through sysfs.
> 
> Signed-off-by: Ira Weiny <iweiny@kernel.org>
> Signed-off-by: Anisa Su <anisa.su@samsung.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>

Just a spelling error below

> 
> ---
> Changes:
> 1. Documentation: bump kernel version to 7.3 and date to June 2026
> 2. Pick up Dave's reviewed-by tag
> 3. Rename dynamic_ram_a to dynamic_ram_1
> ---
>  Documentation/ABI/testing/sysfs-bus-cxl | 24 +++++++++++
>  drivers/cxl/core/memdev.c               | 57 +++++++++++++++++++++++++
>  2 files changed, 81 insertions(+)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> index 16a9b3d2e2c0..435495de409c 100644
> --- a/Documentation/ABI/testing/sysfs-bus-cxl
> +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> @@ -89,6 +89,30 @@ Description:
>  		and there are platform specific performance related
>  		side-effects that may result. First class-id is displayed.
>  
> +What:		/sys/bus/cxl/devices/memX/dynamic_ram_1/size
> +Date:		June, 2026
> +KernelVersion:	v7.3
> +Contact:	linux-cxl@vger.kernel.org
> +Description:
> +		(RO) The first Dynamic RAM partition capacity as bytes.
> +
> +
> +What:		/sys/bus/cxl/devices/memX/dynamic_ram_1/qos_class
> +Date:		June, 2026
> +KernelVersion:	v7.3
> +Contact:	linux-cxl@vger.kernel.org
> +Description:
> +		(RO) For CXL host platforms that support "QoS Telemmetry"

Telemetry

DJ

> +		this attribute conveys a comma delimited list of platform
> +		specific cookies that identifies a QoS performance class
> +		for the partition of the CXL mem device. These
> +		class-ids can be compared against a similar "qos_class"
> +		published for a root decoder. While it is not required
> +		that the endpoints map their local memory-class to a
> +		matching platform class, mismatches are not recommended
> +		and there are platform specific performance related
> +		side-effects that may result. First class-id is displayed.
> +
>  
>  What:		/sys/bus/cxl/devices/memX/serial
>  Date:		January, 2022
> diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
> index 71602820f896..20417db933aa 100644
> --- a/drivers/cxl/core/memdev.c
> +++ b/drivers/cxl/core/memdev.c
> @@ -101,6 +101,19 @@ static ssize_t pmem_size_show(struct device *dev, struct device_attribute *attr,
>  static struct device_attribute dev_attr_pmem_size =
>  	__ATTR(size, 0444, pmem_size_show, NULL);
>  
> +static ssize_t dynamic_ram_1_size_show(struct device *dev, struct device_attribute *attr,
> +			      char *buf)
> +{
> +	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> +	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> +	unsigned long long len = cxl_part_size(cxlds, CXL_PARTMODE_DYNAMIC_RAM_1);
> +
> +	return sysfs_emit(buf, "%#llx\n", len);
> +}
> +
> +static struct device_attribute dev_attr_dynamic_ram_1_size =
> +	__ATTR(size, 0444, dynamic_ram_1_size_show, NULL);
> +
>  static ssize_t serial_show(struct device *dev, struct device_attribute *attr,
>  			   char *buf)
>  {
> @@ -443,6 +456,25 @@ static struct attribute *cxl_memdev_pmem_attributes[] = {
>  	NULL,
>  };
>  
> +static ssize_t dynamic_ram_1_qos_class_show(struct device *dev,
> +				   struct device_attribute *attr, char *buf)
> +{
> +	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> +	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> +
> +	return sysfs_emit(buf, "%d\n",
> +			  part_perf(cxlds, CXL_PARTMODE_DYNAMIC_RAM_1)->qos_class);
> +}
> +
> +static struct device_attribute dev_attr_dynamic_ram_1_qos_class =
> +	__ATTR(qos_class, 0444, dynamic_ram_1_qos_class_show, NULL);
> +
> +static struct attribute *cxl_memdev_dynamic_ram_1_attributes[] = {
> +	&dev_attr_dynamic_ram_1_size.attr,
> +	&dev_attr_dynamic_ram_1_qos_class.attr,
> +	NULL,
> +};
> +
>  static ssize_t ram_qos_class_show(struct device *dev,
>  				  struct device_attribute *attr, char *buf)
>  {
> @@ -519,6 +551,29 @@ static struct attribute_group cxl_memdev_pmem_attribute_group = {
>  	.is_visible = cxl_pmem_visible,
>  };
>  
> +static umode_t cxl_dynamic_ram_1_visible(struct kobject *kobj, struct attribute *a, int n)
> +{
> +	struct device *dev = kobj_to_dev(kobj);
> +	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> +	struct cxl_dpa_perf *perf = part_perf(cxlmd->cxlds, CXL_PARTMODE_DYNAMIC_RAM_1);
> +
> +	if (a == &dev_attr_dynamic_ram_1_qos_class.attr &&
> +	    (!perf || perf->qos_class == CXL_QOS_CLASS_INVALID))
> +		return 0;
> +
> +	if (a == &dev_attr_dynamic_ram_1_size.attr &&
> +	    (!cxl_part_size(cxlmd->cxlds, CXL_PARTMODE_DYNAMIC_RAM_1)))
> +		return 0;
> +
> +	return a->mode;
> +}
> +
> +static struct attribute_group cxl_memdev_dynamic_ram_1_attribute_group = {
> +	.name = "dynamic_ram_1",
> +	.attrs = cxl_memdev_dynamic_ram_1_attributes,
> +	.is_visible = cxl_dynamic_ram_1_visible,
> +};
> +
>  static umode_t cxl_memdev_security_visible(struct kobject *kobj,
>  					   struct attribute *a, int n)
>  {
> @@ -547,6 +602,7 @@ static const struct attribute_group *cxl_memdev_attribute_groups[] = {
>  	&cxl_memdev_attribute_group,
>  	&cxl_memdev_ram_attribute_group,
>  	&cxl_memdev_pmem_attribute_group,
> +	&cxl_memdev_dynamic_ram_1_attribute_group,
>  	&cxl_memdev_security_attribute_group,
>  	NULL,
>  };
> @@ -555,6 +611,7 @@ void cxl_memdev_update_perf(struct cxl_memdev *cxlmd)
>  {
>  	sysfs_update_group(&cxlmd->dev.kobj, &cxl_memdev_ram_attribute_group);
>  	sysfs_update_group(&cxlmd->dev.kobj, &cxl_memdev_pmem_attribute_group);
> +	sysfs_update_group(&cxlmd->dev.kobj, &cxl_memdev_dynamic_ram_1_attribute_group);
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_memdev_update_perf, "CXL");
>  


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v11 07/31] cxl/region: Add DC DAX region support
  2026-06-25 11:04 ` [PATCH v11 07/31] cxl/region: Add DC DAX region support Anisa Su
  2026-06-25 18:16   ` sashiko-bot
@ 2026-06-26 23:18   ` Dave Jiang
  1 sibling, 0 replies; 61+ messages in thread
From: Dave Jiang @ 2026-06-26 23:18 UTC (permalink / raw)
  To: Anisa Su, linux-cxl, linux-kernel
  Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
	Vishal Verma, Ira Weiny, Alison Schofield, John Groves,
	Gregory Price, Anisa Su



On 6/25/26 4:04 AM, Anisa Su wrote:
> From: Ira Weiny <iweiny@kernel.org>
> 
> DC DAX regions must allow memory to be added or removed dynamically.
> In addition to the quantity of memory available the,

stray comma

DJ

> location of the memory within a DC partition is dynamic, based on the
> extents offered by a device.  CXL DAX regions must accommodate the
> dynamic movement of this memory in the management of DAX regions and devices.
> 
> Introduce the concept of a dynamic DAX region. Introduce
> create_dynamic_ram_1_region() sysfs entry to create such regions.
> Special case DC-capable regions to create a 0 sized seed DAX device
> to maintain compatibility which requires a default DAX device to hold a
> region reference.
> 
> Indicate 0 byte available capacity until such time that capacity is
> added.
> 
> Dynamic regions complicate the range mapping of dax devices.  There is no
> known use case for range mapping on dynamic regions.  Avoid the
> complication by preventing range mapping of dax devices on dynamic
> regions.
> 
> Interleaving is deferred for now.  Add checks.
> 
> Based on an original patch by Navneet Singh.
> 
> Signed-off-by: Ira Weiny <iweiny@kernel.org>
> Signed-off-by: Anisa Su <anisa.su@samsung.com>
> 
> ---
> Changes:
> 1. Documentation: bump kver to 7.3 and date to June 2026
> 2. port.c: use helper function to_cxl_memdev_state() in cxled_to_mds()
> 3. region.c: cxled_to_mds() can return NULL with ^ above change.
>    Handle properly in store_targetN()
> 4. port.c: check if the root decoder supports ram regions before
>    exposing create_dynamic_ram_1_region attribute in
>    cxl_root_decoder_visible()
> 5. Rename dynamic_ram_a to dynamic_ram_1
> 6. Add dynamic_ram_1 to mode_show()
> ---
>  Documentation/ABI/testing/sysfs-bus-cxl | 22 ++++++-------
>  drivers/cxl/core/core.h                 |  9 ++++++
>  drivers/cxl/core/port.c                 |  5 +++
>  drivers/cxl/core/region.c               | 43 +++++++++++++++++++++++--
>  drivers/cxl/core/region_dax.c           |  6 ++++
>  drivers/dax/bus.c                       | 10 ++++++
>  drivers/dax/bus.h                       |  1 +
>  drivers/dax/cxl.c                       | 17 ++++++++--
>  8 files changed, 98 insertions(+), 15 deletions(-)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> index 499741cbb899..00b98bbe0ff3 100644
> --- a/Documentation/ABI/testing/sysfs-bus-cxl
> +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> @@ -434,20 +434,20 @@ Description:
>  		interleave_granularity).
>  
>  
> -What:		/sys/bus/cxl/devices/decoderX.Y/create_{pmem,ram}_region
> -Date:		May, 2022, January, 2023
> -KernelVersion:	v6.0 (pmem), v6.3 (ram)
> +What:		/sys/bus/cxl/devices/decoderX.Y/create_{pmem,ram,dynamic_ram_1}_region
> +Date:		May, 2022, January, 2023, June 2026
> +KernelVersion:	v6.0 (pmem), v6.3 (ram), v7.3 (dynamic_ram_1)
>  Contact:	linux-cxl@vger.kernel.org
>  Description:
>  		(RW) Write a string in the form 'regionZ' to start the process
> -		of defining a new persistent, or volatile memory region
> -		(interleave-set) within the decode range bounded by root decoder
> -		'decoderX.Y'. The value written must match the current value
> -		returned from reading this attribute. An atomic compare exchange
> -		operation is done on write to assign the requested id to a
> -		region and allocate the region-id for the next creation attempt.
> -		EBUSY is returned if the region name written does not match the
> -		current cached value.
> +		of defining a new persistent, volatile, or dynamic RAM memory
> +		region (interleave-set) within the decode range bounded by root
> +		decoder 'decoderX.Y'. The value written must match the current
> +		value returned from reading this attribute.  An atomic compare
> +		exchange operation is done on write to assign the requested id
> +		to a region and allocate the region-id for the next creation
> +		attempt.  EBUSY is returned if the region name written does not
> +		match the current cached value.
>  
>  
>  What:		/sys/bus/cxl/devices/decoderX.Y/delete_region
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 82ca3a476708..9ed141fa1334 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -6,6 +6,7 @@
>  
>  #include <cxl/mailbox.h>
>  #include <linux/rwsem.h>
> +#include <cxlmem.h>
>  
>  extern const struct device_type cxl_nvdimm_bridge_type;
>  extern const struct device_type cxl_nvdimm_type;
> @@ -18,6 +19,13 @@ enum cxl_detach_mode {
>  	DETACH_INVALIDATE,
>  };
>  
> +static inline struct cxl_memdev_state *
> +cxled_to_mds(struct cxl_endpoint_decoder *cxled)
> +{
> +	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> +	return to_cxl_memdev_state(cxlmd->cxlds);
> +}
> +
>  #ifdef CONFIG_CXL_REGION
>  
>  struct cxl_region_context {
> @@ -29,6 +37,7 @@ struct cxl_region_context {
>  
>  extern struct device_attribute dev_attr_create_pmem_region;
>  extern struct device_attribute dev_attr_create_ram_region;
> +extern struct device_attribute dev_attr_create_dynamic_ram_1_region;
>  extern struct device_attribute dev_attr_delete_region;
>  extern struct device_attribute dev_attr_region;
>  extern const struct device_type cxl_pmem_region_type;
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 57d0fc72023f..279279f544d8 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -337,6 +337,7 @@ static struct attribute *cxl_decoder_root_attrs[] = {
>  	&dev_attr_qos_class.attr,
>  	SET_CXL_REGION_ATTR(create_pmem_region)
>  	SET_CXL_REGION_ATTR(create_ram_region)
> +	SET_CXL_REGION_ATTR(create_dynamic_ram_1_region)
>  	SET_CXL_REGION_ATTR(delete_region)
>  	NULL,
>  };
> @@ -366,6 +367,10 @@ static umode_t cxl_root_decoder_visible(struct kobject *kobj, struct attribute *
>  	if (a == CXL_REGION_ATTR(create_ram_region) && !can_create_ram(cxlrd))
>  		return 0;
>  
> +	if (a == CXL_REGION_ATTR(create_dynamic_ram_1_region) &&
> +	    !can_create_ram(cxlrd))
> +		return 0;
> +
>  	if (a == CXL_REGION_ATTR(delete_region) &&
>  	    !(can_create_pmem(cxlrd) || can_create_ram(cxlrd)))
>  		return 0;
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index e50dc716d4e8..ba03ec5e27c3 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -493,6 +493,11 @@ static int set_interleave_ways(struct cxl_region *cxlr, int val)
>  	int save, rc;
>  	u8 iw;
>  
> +	if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_1 && val != 1) {
> +		dev_err(&cxlr->dev, "Interleaving and DCD not supported\n");
> +		return -EINVAL;
> +	}
> +
>  	rc = ways_to_eiw(val, &iw);
>  	if (rc)
>  		return rc;
> @@ -642,6 +647,8 @@ static ssize_t mode_show(struct device *dev, struct device_attribute *attr,
>  		desc = "ram";
>  	else if (cxlr->mode == CXL_PARTMODE_PMEM)
>  		desc = "pmem";
> +	else if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_1)
> +		desc = "dynamic_ram_1";
>  	else
>  		desc = "";
>  
> @@ -2389,6 +2396,8 @@ static size_t store_targetN(struct cxl_region *cxlr, const char *buf, int pos,
>  	if (sysfs_streq(buf, "\n"))
>  		rc = detach_target(cxlr, pos);
>  	else {
> +		struct cxl_endpoint_decoder *cxled;
> +		struct cxl_memdev_state *mds;
>  		struct device *dev;
>  
>  		dev = bus_find_device_by_name(&cxl_bus_type, NULL, buf);
> @@ -2400,8 +2409,21 @@ static size_t store_targetN(struct cxl_region *cxlr, const char *buf, int pos,
>  			goto out;
>  		}
>  
> -		rc = attach_target(cxlr, to_cxl_endpoint_decoder(dev), pos,
> -				   TASK_INTERRUPTIBLE);
> +		cxled = to_cxl_endpoint_decoder(dev);
> +		if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_1) {
> +			mds = cxled_to_mds(cxled);
> +			if (!mds) {
> +				dev_dbg(dev, "No memdev state\n");
> +				rc = -ENODEV;
> +				goto out;
> +			}
> +			if (!cxl_dcd_supported(mds)) {
> +				dev_dbg(dev, "DCD unsupported\n");
> +				rc = -EINVAL;
> +				goto out;
> +			}
> +		}
> +		rc = attach_target(cxlr, cxled, pos, TASK_INTERRUPTIBLE);
>  out:
>  		put_device(dev);
>  	}
> @@ -2750,6 +2772,7 @@ static struct cxl_region *__create_region(struct cxl_root_decoder *cxlrd,
>  	switch (mode) {
>  	case CXL_PARTMODE_RAM:
>  	case CXL_PARTMODE_PMEM:
> +	case CXL_PARTMODE_DYNAMIC_RAM_1:
>  		break;
>  	default:
>  		dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %d\n", mode);
> @@ -2802,6 +2825,21 @@ static ssize_t create_ram_region_store(struct device *dev,
>  }
>  DEVICE_ATTR_RW(create_ram_region);
>  
> +static ssize_t create_dynamic_ram_1_region_show(struct device *dev,
> +						struct device_attribute *attr,
> +						char *buf)
> +{
> +	return __create_region_show(to_cxl_root_decoder(dev), buf);
> +}
> +
> +static ssize_t create_dynamic_ram_1_region_store(struct device *dev,
> +						 struct device_attribute *attr,
> +						 const char *buf, size_t len)
> +{
> +	return create_region_store(dev, buf, len, CXL_PARTMODE_DYNAMIC_RAM_1);
> +}
> +DEVICE_ATTR_RW(create_dynamic_ram_1_region);
> +
>  static ssize_t region_show(struct device *dev, struct device_attribute *attr,
>  			   char *buf)
>  {
> @@ -4083,6 +4121,7 @@ static int cxl_region_probe(struct device *dev)
>  
>  		return devm_cxl_add_pmem_region(cxlr);
>  	case CXL_PARTMODE_RAM:
> +	case CXL_PARTMODE_DYNAMIC_RAM_1:
>  		rc = devm_cxl_region_edac_register(cxlr);
>  		if (rc)
>  			dev_dbg(&cxlr->dev, "CXL EDAC registration for region_id=%d failed\n",
> diff --git a/drivers/cxl/core/region_dax.c b/drivers/cxl/core/region_dax.c
> index de04f78f6ad8..3865961c4301 100644
> --- a/drivers/cxl/core/region_dax.c
> +++ b/drivers/cxl/core/region_dax.c
> @@ -84,6 +84,12 @@ int devm_cxl_add_dax_region(struct cxl_region *cxlr)
>  	struct device *dev;
>  	int rc;
>  
> +	if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_1 &&
> +	    cxlr->params.interleave_ways != 1) {
> +		dev_err(&cxlr->dev, "Interleaving DC not supported\n");
> +		return -EINVAL;
> +	}
> +
>  	struct cxl_dax_region *cxlr_dax __free(put_cxl_dax_region) =
>  		cxl_dax_region_alloc(cxlr);
>  	if (IS_ERR(cxlr_dax))
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index ccfe65004888..7356aaaffe57 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -177,6 +177,11 @@ static bool is_static(struct dax_region *dax_region)
>  	return (dax_region->res.flags & IORESOURCE_DAX_STATIC) != 0;
>  }
>  
> +static bool is_dynamic(struct dax_region *dax_region)
> +{
> +	return (dax_region->res.flags & IORESOURCE_DAX_DCD) != 0;
> +}
> +
>  bool static_dev_dax(struct dev_dax *dev_dax)
>  {
>  	return is_static(dev_dax->region);
> @@ -300,6 +305,9 @@ static unsigned long long dax_region_avail_size(struct dax_region *dax_region)
>  
>  	lockdep_assert_held(&dax_region_rwsem);
>  
> +	if (is_dynamic(dax_region))
> +		return 0;
> +
>  	for_each_dax_region_resource(dax_region, res)
>  		size -= resource_size(res);
>  	return size;
> @@ -1385,6 +1393,8 @@ static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n)
>  		return 0;
>  	if (a == &dev_attr_mapping.attr && is_static(dax_region))
>  		return 0;
> +	if (a == &dev_attr_mapping.attr && is_dynamic(dax_region))
> +		return 0;
>  	if ((a == &dev_attr_align.attr ||
>  	     a == &dev_attr_size.attr) && is_static(dax_region))
>  		return 0444;
> diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
> index 5909171a4428..6e739bfab932 100644
> --- a/drivers/dax/bus.h
> +++ b/drivers/dax/bus.h
> @@ -15,6 +15,7 @@ struct dax_region;
>  /* dax bus specific ioresource flags */
>  #define IORESOURCE_DAX_STATIC BIT(0)
>  #define IORESOURCE_DAX_KMEM BIT(1)
> +#define IORESOURCE_DAX_DCD BIT(2)
>  
>  struct dax_region *alloc_dax_region(struct device *parent, int region_id,
>  		struct range *range, int target_node, unsigned int align,
> diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
> index 3ab39b77843d..cedd974c2d0c 100644
> --- a/drivers/dax/cxl.c
> +++ b/drivers/dax/cxl.c
> @@ -13,19 +13,32 @@ static int cxl_dax_region_probe(struct device *dev)
>  	struct cxl_region *cxlr = cxlr_dax->cxlr;
>  	struct dax_region *dax_region;
>  	struct dev_dax_data data;
> +	resource_size_t dev_size;
> +	unsigned long flags;
>  
>  	if (nid == NUMA_NO_NODE)
>  		nid = memory_add_physaddr_to_nid(cxlr_dax->hpa_range.start);
>  
> +	if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_1)
> +		flags = IORESOURCE_DAX_DCD;
> +	else
> +		flags = IORESOURCE_DAX_KMEM;
> +
>  	dax_region = alloc_dax_region(dev, cxlr->id, &cxlr_dax->hpa_range, nid,
> -				      PMD_SIZE, IORESOURCE_DAX_KMEM);
> +				      PMD_SIZE, flags);
>  	if (!dax_region)
>  		return -ENOMEM;
>  
> +	if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_1)
> +		/* Add empty seed dax device */
> +		dev_size = 0;
> +	else
> +		dev_size = range_len(&cxlr_dax->hpa_range);
> +
>  	data = (struct dev_dax_data) {
>  		.dax_region = dax_region,
>  		.id = -1,
> -		.size = range_len(&cxlr_dax->hpa_range),
> +		.size = dev_size,
>  		.memmap_on_memory = true,
>  	};
>  


^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2026-06-26 23:18 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-25 11:04 DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
2026-06-25 11:04 ` [PATCH v11 01/31] cxl/mbox: Flag " Anisa Su
2026-06-26 21:43   ` Dave Jiang
2026-06-25 11:04 ` [PATCH v11 02/31] cxl/mem: Read dynamic capacity configuration from the device Anisa Su
2026-06-25 18:16   ` sashiko-bot
2026-06-26 22:26   ` Dave Jiang
2026-06-25 11:04 ` [PATCH v11 04/31] cxl/core: Enforce partition order/simplify partition calls Anisa Su
2026-06-26 22:37   ` Dave Jiang
2026-06-25 11:04 ` [PATCH v11 05/31] cxl/mem: Expose dynamic ram 1 partition in sysfs Anisa Su
2026-06-25 18:12   ` sashiko-bot
2026-06-26 23:08   ` Dave Jiang
2026-06-25 11:04 ` [PATCH v11 06/31] cxl/port: Add 'dynamic_ram_1' to endpoint decoder mode Anisa Su
2026-06-25 11:04 ` [PATCH v11 07/31] cxl/region: Add DC DAX region support Anisa Su
2026-06-25 18:16   ` sashiko-bot
2026-06-26 23:18   ` Dave Jiang
2026-06-25 11:04 ` [PATCH v11 08/31] cxl/events: Split event msgnum configuration from irq setup Anisa Su
2026-06-25 11:04 ` [PATCH v11 09/31] cxl/pci: Factor out interrupt policy check Anisa Su
2026-06-25 11:04 ` [PATCH v11 10/31] cxl/mem: Configure dynamic capacity interrupts Anisa Su
2026-06-25 18:14   ` sashiko-bot
2026-06-25 11:04 ` [PATCH v11 11/31] cxl/core: Return endpoint decoder information from region search Anisa Su
2026-06-25 11:04 ` [PATCH v11 12/31] cxl/mem: Set up framework for handling DC Events Anisa Su
2026-06-25 18:12   ` sashiko-bot
2026-06-26 21:54   ` Dave Jiang
2026-06-25 11:04 ` [PATCH v11 13/31] cxl/mem: Add 20 second timeout for stalled DC_ADD_CAPACITY chains Anisa Su
2026-06-25 18:15   ` sashiko-bot
2026-06-25 11:04 ` [PATCH v11 14/31] cxl/extent: Handle DC Add Capacity events Anisa Su
2026-06-25 18:16   ` sashiko-bot
2026-06-25 11:04 ` [PATCH v11 15/31] cxl/mem: Drop misaligned DCD extent groups Anisa Su
2026-06-25 18:19   ` sashiko-bot
2026-06-25 11:04 ` [PATCH v11 16/31] cxl/extent: Validate DC extent partition Anisa Su
2026-06-25 18:20   ` sashiko-bot
2026-06-25 11:04 ` [PATCH v11 17/31] cxl/mem: Enforce tag-group semantics Anisa Su
2026-06-25 18:24   ` sashiko-bot
2026-06-25 11:04 ` [PATCH v11 18/31] cxl/extent: Handle DC Release Capacity events Anisa Su
2026-06-25 18:23   ` sashiko-bot
2026-06-25 11:04 ` [PATCH v11 19/31] cxl/extent: Enforce cross-region tag uniqueness Anisa Su
2026-06-25 18:23   ` sashiko-bot
2026-06-25 11:04 ` [PATCH v11 20/31] cxl/region/extent: Expose dc_extent information in sysfs Anisa Su
2026-06-25 18:33   ` sashiko-bot
2026-06-25 11:04 ` [PATCH v11 21/31] cxl + dax: Surface dax_resources on DCD Add Capacity events Anisa Su
2026-06-25 18:29   ` sashiko-bot
2026-06-25 11:04 ` [PATCH v11 22/31] cxl + dax: Release dax_resources on DCD Release " Anisa Su
2026-06-25 18:36   ` sashiko-bot
2026-06-25 11:05 ` [PATCH v11 23/31] dax/bus: Factor out dev dax resize logic Anisa Su
2026-06-25 18:27   ` sashiko-bot
2026-06-25 11:05 ` [PATCH v11 24/31] dax/bus: Add uuid sysfs attribute to dax devices Anisa Su
2026-06-25 11:05 ` [PATCH v11 25/31] dax/bus: Reject resize on DC dax devices and enforce 0-size creation Anisa Su
2026-06-25 11:05 ` [PATCH v11 26/31] dax/bus: Tag-aware uuid claim and show on DC dax devices Anisa Su
2026-06-25 18:26   ` sashiko-bot
2026-06-25 11:05 ` [PATCH v11 27/31] cxl/region: Read existing extents on region creation Anisa Su
2026-06-25 18:32   ` sashiko-bot
2026-06-25 11:05 ` [PATCH v11 28/31] cxl/mem: Trace Dynamic capacity Event Record Anisa Su
2026-06-25 18:29   ` sashiko-bot
2026-06-25 11:05 ` [PATCH v11 29/31] tools/testing/cxl: Make event logs dynamic Anisa Su
2026-06-25 18:31   ` sashiko-bot
2026-06-25 11:05 ` [PATCH v11 30/31] tools/testing/cxl: Add DC Regions to mock mem data Anisa Su
2026-06-25 18:34   ` sashiko-bot
2026-06-25 11:05 ` [PATCH v11 31/31] Documentation/cxl: Document DCD extent handling and DC-backed DAX regions Anisa Su
2026-06-25 18:24   ` sashiko-bot
2026-06-25 18:00 ` [PATCH v11 03/31] cxl/cdat: Gather DSMAS data for DCD partitions Anisa Su
2026-06-26 22:30   ` Dave Jiang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.