* [RFC PATCH 0/4] cxl/extent: Enable and validate multi-extent DCDs
[not found] <20260423235108.3732424-1-john@jagalactic.com>
@ 2026-04-23 23:51 ` John Groves
2026-04-23 23:52 ` [RFC PATCH 1/4] cxl/extent: Promote cxlr_dax->region_extent to an xarray John Groves
` (5 more replies)
0 siblings, 6 replies; 15+ messages in thread
From: John Groves @ 2026-04-23 23:51 UTC (permalink / raw)
To: Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Alison Schofield,
Vishal Verma, Ira Weiny, Dan Williams, John Groves, Fan Ni,
Anisa Su, Shiju Jose, Robert Richter
Cc: linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
John Groves, dev.srinivasulu@gmail.com, arramesh@micron.com,
ajayjoshi@micron.com
From: John Groves <john@groves.net>
This series fixes correctness issues in the DCD (Dynamic Capacity
Device) add-capacity pipeline so that multi-tag allocations, sharable
CDAT regions, and cross-partition checks work end-to-end.
Base
----
The series applies on top of Anisa Su's DCD stack — specifically the
tip of famfs-v9-dcd on the 'anisa' remote
(git@github.com:anisa-su993/anisa-linux-kernel.git). That stack is
the one posted to linux-cxl here:
https://lore.kernel.org/linux-cxl/adnWg-60uKmduTsh@gourry-fedora-PF4VCD3F/T/#t
I started out commenting on prior series, then started modifying it,
before settling on this initial approach of generalizing the
functionality through additional patches. I'm hoping the stake holders
will find this easy to follow - at least as easy to follow as something
this tedious and nuanced can be.
Also, as one of the authors of the DCD spec: Now I get why this was hard
to follow and implement. The implementation is pretty complicated.
Note that this RFC builds happily, but I have not tested it yet. This
exercise has been about fleshing out the full logic.
The patches
-----------
1) cxl/extent: Promote cxlr_dax->region_extent to an xarray
Infrastructure only. Replaces the single-slot region_extent
pointer on struct cxl_dax_region with an xarray keyed by an
allocator-assigned u32. No behavior change; prepares storage for
multiple tagged allocations per DAX-region shim.
2) cxl/extent: Fix DCD add-capacity: per-tag assembly, ordering, and
integrity
Rewrites the add-capacity pipeline around the tag as the unit of
allocation (not the More-chain). Drops the DPA-contiguity reject,
assembles each tag group separately, stable-sorts by
shared_extn_seq, and adds cross-More-chain uniqueness,
sequence-integrity, and alignment gates. Lifts the single-slot
"already onlined" gate now that (1) is in place.
3) cxl/extent: Support extents in sharable CDAT regions
Drops the per-extent rejection of shared_extn_seq != 0 and
replaces it with a partition-aware check driven by the DSMAS
SHAREABLE bit already exposed through cxl_dpa_partition.
4) cxl/extent: Reject tagged extents that span DC partitions
Adds a group-level check rejecting tagged allocations whose
extents resolve to different DC partitions. A single host-visible
allocation cannot meaningfully straddle partitions whose CDAT
attributes disagree.
Signed-off-by: John Groves <John@Groves.net>
John Groves (4):
cxl/extent: Promote cxlr_dax->region_extent to an xarray
cxl/extent: Fix DCD add-capacity: per-tag assembly, ordering, and
integrity
cxl/extent: Support extents in sharable CDAT regions
cxl/extent: Reject tagged extents that span DC partitions
drivers/cxl/core/extent.c | 93 +++---
drivers/cxl/core/mbox.c | 628 +++++++++++++++++++++++++++++++-------
drivers/cxl/core/region.c | 2 +
drivers/cxl/cxl.h | 9 +-
include/cxl/event.h | 1 -
5 files changed, 584 insertions(+), 149 deletions(-)
--
2.53.0
^ permalink raw reply [flat|nested] 15+ messages in thread
* [RFC PATCH 1/4] cxl/extent: Promote cxlr_dax->region_extent to an xarray
2026-04-23 23:51 ` [RFC PATCH 0/4] cxl/extent: Enable and validate multi-extent DCDs John Groves
@ 2026-04-23 23:52 ` John Groves
2026-04-24 0:51 ` Dave Jiang
2026-04-24 22:01 ` Ira Weiny
2026-04-23 23:52 ` [RFC PATCH 2/4] cxl/extent: Fix DCD add-capacity: per-tag assembly, ordering, and integrity John Groves
` (4 subsequent siblings)
5 siblings, 2 replies; 15+ messages in thread
From: John Groves @ 2026-04-23 23:52 UTC (permalink / raw)
To: Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Alison Schofield,
Vishal Verma, Ira Weiny, Dan Williams, John Groves, Fan Ni,
Anisa Su, Shiju Jose, Robert Richter
Cc: linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
John Groves, dev.srinivasulu@gmail.com, arramesh@micron.com,
ajayjoshi@micron.com
From: John Groves <John@Groves.net>
Terminology note: "CXL region" (struct cxl_region) is the
mode-agnostic HDM-decoded HPA window; "DAX region"
(struct cxl_dax_region, nicknamed cxlr_dax) is the DCD-specific
CXL->device-dax shim that hangs off a cxl_region. This commit
touches only the DAX-region shim; nothing about cxl_region, its
HDM decoding, or its lifecycle changes.
Prior to this commit, struct cxl_dax_region holds a single
region_extent pointer, so at most one tagged allocation can ever
surface under a DAX region. A subsequent commit rewrites DCD
add-capacity handling to assemble extents per-tag and permits multiple
allocations per More-chain; that work is latent until cxlr_dax can
actually hold more than one region_extent. Do the infrastructural
swap here, on its own, so the behavior change lands in a dedicated
commit and the per-tag assembly work isn't tangled with storage-layout
changes.
Replace the single-slot pointer with an xarray. Note: the xarray is
*not* keyed by UUID. xarray is a radix tree over unsigned-long keys;
a 128-bit pseudo-random UUID is neither the right width nor the right
distribution for a radix index. The key is an allocator-assigned u32
obtained via xa_alloc() when online_region_extent() registers the
device; that same u32 is stored in region_extent->dev.id so device
names become "extent<rid>.<id>" with unique suffixes per allocation
(replacing the hardcoded .0). UUID remains a stored attribute on
struct region_extent; tag-based lookup is a linear xa_for_each walk,
which is adequate at the bounded per-region tag counts DCD delivers.
User-visible naming: device names remain "extent<rid>.<N>". Before
this commit, N is always 0; after, N is the xa_alloc()-assigned id.
The first region_extent under a cxlr_dax still lands at id 0, so a
single-allocation region is observably unchanged. Only a 2nd+
region_extent (which cannot be produced at all today) gets a
non-zero suffix.
Call-site translations:
- cxl_dax_region_alloc() / _release(): xa_init / xa_destroy.
- alloc_region_extent(): drop the hardcoded dev.id = 0; the ID is
assigned when online_region_extent() registers the device.
- online_region_extent(): device_initialize() first so the release
function is wired, then xa_alloc(), assign the returned id into
dev->id, then dev_set_name() and device_add(). On any failure
from xa_alloc() onward, put_device(dev) -> release ->
free_region_extent() -> xa_erase() handles cleanup; no explicit
xa_erase() is needed in the error path of online_region_extent().
- free_region_extent(): xa_erase() replaces NULLing the slot.
- extents_contain() / extents_overlap(): nested xa_for_each, outer
over region_extents, inner over decoder_extents.
- cxl_rm_extent(): walk region_extents and match by UUID to find
the target region_extent (linear, bounded).
- cxl_add_extent()'s "already onlined" gate: translated from
single-slot NULL check to !xa_empty(). This preserves the
existing single-region_extent-per-cxlr_dax invariant; lifting
that gate to actually support multiple allocations is a separate
semantic change that belongs with the per-tag assembly rework.
The dax-side (drivers/dax/cxl.c) is unchanged. cxl_dax_region_probe()
still iterates region_extent children via device_for_each_child() on
cxlr_dax->dev, and cxl_dax_region_notify() still takes a
struct region_extent * through cxl_notify_data; neither the CXL<->DAX
boundary nor the DAX resource model is affected.
No functional change: all paths still observe at most one
region_extent per cxlr_dax; this commit only prepares the storage.
Signed-off-by: John Groves <John@Groves.net>
Signed-off-by: John Groves <john@groves.net>
---
drivers/cxl/core/extent.c | 89 ++++++++++++++++++++++-----------------
drivers/cxl/core/region.c | 2 +
drivers/cxl/cxl.h | 9 +++-
3 files changed, 61 insertions(+), 39 deletions(-)
diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
index c4ad81814fb4d..44b58cd477655 100644
--- a/drivers/cxl/core/extent.c
+++ b/drivers/cxl/core/extent.c
@@ -87,7 +87,8 @@ static void free_region_extent(struct region_extent *region_extent)
xa_for_each(®ion_extent->decoder_extents, index, ed_extent)
cxled_release_extent(ed_extent->cxled, ed_extent);
xa_destroy(®ion_extent->decoder_extents);
- region_extent->cxlr_dax->region_extent = NULL;
+ xa_erase(®ion_extent->cxlr_dax->region_extents,
+ region_extent->dev.id);
kfree(region_extent);
}
@@ -144,7 +145,6 @@ alloc_region_extent(struct cxl_dax_region *cxlr_dax, struct range *hpa_range,
region_extent->hpa_range = *hpa_range;
region_extent->cxlr_dax = cxlr_dax;
uuid_copy(®ion_extent->uuid, uuid);
- region_extent->dev.id = 0;
xa_init(®ion_extent->decoder_extents);
return no_free_ptr(region_extent);
}
@@ -153,12 +153,26 @@ int online_region_extent(struct region_extent *region_extent)
{
struct cxl_dax_region *cxlr_dax = region_extent->cxlr_dax;
struct device *dev = ®ion_extent->dev;
+ u32 id;
int rc;
device_initialize(dev);
device_set_pm_not_required(dev);
dev->parent = &cxlr_dax->dev;
dev->type = ®ion_extent_type;
+
+ /*
+ * Insert into cxlr_dax->region_extents before dev_set_name so
+ * the allocated id is available as the .<N> suffix. On any
+ * failure from here on, put_device(dev) -> release ->
+ * free_region_extent() -> xa_erase() handles the cleanup.
+ */
+ rc = xa_alloc(&cxlr_dax->region_extents, &id, region_extent,
+ xa_limit_32b, GFP_KERNEL);
+ if (rc < 0)
+ goto err;
+ dev->id = id;
+
rc = dev_set_name(dev, "extent%d.%d", cxlr_dax->cxlr->id, dev->id);
if (rc)
goto err;
@@ -167,7 +181,6 @@ int online_region_extent(struct region_extent *region_extent)
if (rc)
goto err;
- cxlr_dax->region_extent = region_extent;
dev_dbg(dev, "region extent HPA %pra\n", ®ion_extent->hpa_range);
return devm_add_action_or_reset(&cxlr_dax->dev, region_extent_unregister,
region_extent);
@@ -184,17 +197,16 @@ static bool extents_contain(struct cxl_dax_region *cxlr_dax,
struct cxl_endpoint_decoder *cxled,
struct range *new_range)
{
- struct region_extent *re = cxlr_dax->region_extent;
+ struct region_extent *re;
struct cxled_extent *entry;
- unsigned long index;
-
- if (!re)
- return false;
-
- xa_for_each(&re->decoder_extents, index, entry) {
- if (cxled == entry->cxled &&
- range_contains(&entry->dpa_range, new_range))
- return true;
+ unsigned long i, j;
+
+ xa_for_each(&cxlr_dax->region_extents, i, re) {
+ xa_for_each(&re->decoder_extents, j, entry) {
+ if (cxled == entry->cxled &&
+ range_contains(&entry->dpa_range, new_range))
+ return true;
+ }
}
return false;
}
@@ -203,17 +215,16 @@ static bool extents_overlap(struct cxl_dax_region *cxlr_dax,
struct cxl_endpoint_decoder *cxled,
struct range *new_range)
{
- struct region_extent *re = cxlr_dax->region_extent;
+ struct region_extent *re;
struct cxled_extent *entry;
- unsigned long index;
-
- if (!re)
- return false;
-
- xa_for_each(&re->decoder_extents, index, entry) {
- if (cxled == entry->cxled &&
- range_overlaps(&entry->dpa_range, new_range))
- return true;
+ unsigned long i, j;
+
+ xa_for_each(&cxlr_dax->region_extents, i, re) {
+ xa_for_each(&re->decoder_extents, j, entry) {
+ if (cxled == entry->cxled &&
+ range_overlaps(&entry->dpa_range, new_range))
+ return true;
+ }
}
return false;
}
@@ -306,21 +317,25 @@ int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
}
cxlr_dax = cxlr->cxlr_dax;
- reg_ext = cxlr_dax->region_extent;
+ import_uuid(&tag, extent->uuid);
+ {
+ struct region_extent *r;
+ unsigned long idx;
+
+ reg_ext = NULL;
+ xa_for_each(&cxlr_dax->region_extents, idx, r) {
+ if (uuid_equal(&r->uuid, &tag)) {
+ reg_ext = r;
+ break;
+ }
+ }
+ }
if (!reg_ext) {
- dev_err(&cxlr->cxlr_dax->dev,
- "no capacity has been added to the region\n");
+ dev_err(&cxlr_dax->dev,
+ "no region_extent matches tag %pU\n", &tag);
return -ENXIO;
}
- import_uuid(&tag, extent->uuid);
- if (!uuid_equal(&tag, ®_ext->uuid)) {
- dev_err(&cxlr->cxlr_dax->dev,
- "extent tag %pU doesn't match region tag %pU\n",
- &tag, ®_ext->uuid);
- return -EINVAL;
- }
-
calc_hpa_range(cxled, cxlr_dax, &dpa_range, &hpa_range);
if (!range_contains(®_ext->hpa_range, &hpa_range)) {
dev_err(&cxlr_dax->dev,
@@ -329,9 +344,7 @@ int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
return -EINVAL;
}
- rc = cxlr_notify_extent(cxlr,
- DCD_RELEASE_CAPACITY,
- cxlr_dax->region_extent);
+ rc = cxlr_notify_extent(cxlr, DCD_RELEASE_CAPACITY, reg_ext);
if (rc == -EBUSY)
return 0;
@@ -416,7 +429,7 @@ int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
cxlr_dax = cxlr->cxlr_dax;
/* Cannot add to a region_extent once it's been onlined */
- if (cxlr_dax->region_extent) {
+ if (!xa_empty(&cxlr_dax->region_extents)) {
dev_err(&cxlr_dax->dev, "Can no longer add to region %d\n",
cxlr->id);
return -EINVAL;
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 15065d9a1cbf2..42b3a3bbd502f 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -3546,6 +3546,7 @@ static void cxl_dax_region_release(struct device *dev)
{
struct cxl_dax_region *cxlr_dax = to_cxl_dax_region(dev);
+ xa_destroy(&cxlr_dax->region_extents);
kfree(cxlr_dax);
}
@@ -3590,6 +3591,7 @@ static struct cxl_dax_region *cxl_dax_region_alloc(struct cxl_region *cxlr)
if (!cxlr_dax)
return ERR_PTR(-ENOMEM);
+ xa_init(&cxlr_dax->region_extents);
cxlr_dax->hpa_range.start = p->res->start;
cxlr_dax->hpa_range.end = p->res->end;
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 53d8a2f0d1272..715fa9e580cb0 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -637,7 +637,14 @@ struct cxl_dax_region {
struct device dev;
struct cxl_region *cxlr;
struct range hpa_range;
- struct region_extent *region_extent;
+ /*
+ * region_extents is keyed by an allocator-assigned u32 (see
+ * online_region_extent()), not by extent UUID. The UUID is a
+ * 128-bit pseudo-random identity carried on each region_extent;
+ * tag lookup is a linear xa_for_each walk, which is adequate at
+ * the bounded per-region tag counts this driver handles.
+ */
+ struct xarray region_extents;
};
/**
--
2.53.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [RFC PATCH 2/4] cxl/extent: Fix DCD add-capacity: per-tag assembly, ordering, and integrity
2026-04-23 23:51 ` [RFC PATCH 0/4] cxl/extent: Enable and validate multi-extent DCDs John Groves
2026-04-23 23:52 ` [RFC PATCH 1/4] cxl/extent: Promote cxlr_dax->region_extent to an xarray John Groves
@ 2026-04-23 23:52 ` John Groves
2026-04-23 23:52 ` [RFC PATCH 3/4] cxl/extent: Support extents in sharable CDAT regions John Groves
` (3 subsequent siblings)
5 siblings, 0 replies; 15+ messages in thread
From: John Groves @ 2026-04-23 23:52 UTC (permalink / raw)
To: Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Alison Schofield,
Vishal Verma, Ira Weiny, Dan Williams, John Groves, Fan Ni,
Anisa Su, Shiju Jose, Robert Richter
Cc: linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
John Groves, dev.srinivasulu@gmail.com, arramesh@micron.com,
ajayjoshi@micron.com
From: John Groves <John@Groves.net>
Prior to this commit, the DCD add-capacity path mis-handles the stream
of extent events that the device uses to deliver allocations:
1. It list_sort()s the pending batch by DPA, rejects the whole batch
unless every adjacent pair is DPA-contiguous, then re-sorts by a
saved arrival index to restore order for the response.
This defeats the purpose of the extent mechanism. Extents exist
so the device can satisfy a single allocation from non-contiguous
DPA pieces; rejecting those sets means only trivially-contiguous
allocations ever become dax devices. The idx bookkeeping on
struct cxl_extent_list_node exists only to undo the DPA sort.
2. It conflates "More-chain" with "allocation". The allocation unit
in the spec is the *tag*: extents that share the same tag form one
allocation and belong in one dax device. A More-chain is a
delivery boundary, not an allocation boundary — it may carry
extents for several distinct tags. What More=0 guarantees is
completeness: for every tag that appears inside the chain, all of
that tag's extents are delivered by the time the chain closes.
Consequently, an extent bearing a tag that a previous More-chain
already committed is a firmware bug; this applies to any tagged
allocation, whether or not the extents carry sequence numbers.
Prior to this commit, the code forces all More-chain extents into a single
region_extent via cxl_add_extent()'s uuid_equal() check, so for a
multi-tag chain every allocation but the first is silently
dropped. It also has no explicit detection for a tag reappearing
in a later More-chain.
3. It orders assembly by the host's event arrival timing (via the
DPA sort + idx unsort), not by the per-allocation sequence number
the spec defines on each extent (Shared Extent Sequence Number).
The dax device's backing layout should be a property of the
device's allocation, not of host event delivery.
Note that sequence numbers are a sharable-region concern: extents
in non-sharable regions do not carry them (shared_extn_seq == 0
on every extent), and a single host may simply assemble those in
arrival order. The same comparator must handle both cases.
There is also no check that the set of sequence numbers in a tag
group is well-formed. Valid sharable values are 1..n contiguous
and unique; mixing 0 with non-zero, skipping positions, or
duplicating a position is a firmware bug that should not produce
a dax device.
4. It rejects any extent with a null (untagged) UUID up front, even
though CXL 3.1 explicitly permits untagged extents in non-sharable
regions.
5. It does not validate extent alignment. A dax device backed by
extents that are not aligned to device-dax granularity cannot be
mapped; surfacing such a device to userspace is a bug waiting to
happen.
Fix:
- Replace the flat DPA sort + contiguity reject + idx restore with
a per-tag-group assembly pipeline:
1. Extract this tag's pending extents to a local list.
2. Stable-sort by shared_extn_seq. For sharable extents this
walks the group in device-stamped order; for non-sharable
extents every key is 0 and the stable sort preserves arrival
order. One comparator, both cases.
3. Cross-More-chain uniqueness: if this (tagged) group's tag
already maps to a committed region_extent on its target
cxlr_dax, the device has re-sent a completed allocation —
reject the whole group with a firmware-bug warning. Skipped
for the null UUID (the spec does not define cross-chain
identity for untagged extents). Implemented as a linear walk
of cxlr_dax->region_extents comparing stored UUIDs.
4. Sequence-number integrity: a tag group is well-formed iff
every member carries shared_extn_seq == 0 (non-sharable) or
the sorted group is exactly 1, 2, ..., n (sharable). A mix,
a gap, a duplicate, or a non-zero set that does not start at
1 is a firmware bug and the whole group is dropped. For now,
the per-extent sharable-extent rejection (see below) means only
the all-zero branch is reachable; the non-zero branches are in
place so that lifting that restriction keeps the sequencing-
integrity contract enforced.
5. Alignment gate: if any extent in the group fails
CXL_DCD_EXTENT_ALIGN (SZ_2M), drop the whole group with a
warning. Partial acceptance would leave an unusable dax
device.
6. Validate + cxl_add_extent() each survivor into a fresh
region_extent.
7. Online + notify, splice accepted extents onto the response
list, clear add_ctx.region_extent for the next tag.
The outer loop picks tags in first-appearance order; the *intra-
group* order is the spec's sequence number (or, for non-sharable
extents, arrival order via stable-sort tie-breaking).
- Drop the contiguity check. Same-tag extents form one dax device
regardless of DPA layout.
- Drop the null-UUID rejection in cxl_validate_extent(). Untagged
extents may now be accepted; uuid_equal() in cxl_add_extent()
already collapses them into a single untagged region_extent, which
is the simplest conformant choice given the spec's silence on
untagged aggregation. Sharable extents (shared_extn_seq != 0) remain
unsupported at the per-extent validate stage, so for now the group-
level sequence-integrity check only exercises the all-zero branch.
- Drop the DPA sort, the restore-by-idx sort, and the idx field on
struct cxl_extent_list_node. list_sort is retained (stable sort
for the intra-tag sequence-number ordering).
- Lift cxl_add_extent()'s "region already has a region_extent" gate.
It was the single-slot safety stop; with the prior commit's xarray
in place there is no reason to reject adding a second allocation
to cxlr_dax, and the cross-More-chain uniqueness check above
handles the one case the gate actually defended against (a
same-tag re-delivery).
Fixes: 7f9b600a07e1 ("cxl/extent: Process dynamic partition events and realize region extents")
Signed-off-by: John Groves <John@Groves.net>
Signed-off-by: John Groves <john@groves.net>
---
drivers/cxl/core/extent.c | 6 -
drivers/cxl/core/mbox.c | 494 +++++++++++++++++++++++++++++++-------
include/cxl/event.h | 1 -
3 files changed, 401 insertions(+), 100 deletions(-)
diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
index 44b58cd477655..559c68f10dc6b 100644
--- a/drivers/cxl/core/extent.c
+++ b/drivers/cxl/core/extent.c
@@ -428,12 +428,6 @@ int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
return -ENXIO;
cxlr_dax = cxlr->cxlr_dax;
- /* Cannot add to a region_extent once it's been onlined */
- if (!xa_empty(&cxlr_dax->region_extents)) {
- dev_err(&cxlr_dax->dev, "Can no longer add to region %d\n",
- cxlr->id);
- return -EINVAL;
- }
if (pending_region_ext &&
!uuid_equal((uuid_t *)extent->uuid, &pending_region_ext->uuid)) {
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 34ba57d0494f5..804b7846b5726 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -7,6 +7,7 @@
#include <linux/unaligned.h>
#include <linux/list.h>
#include <linux/list_sort.h>
+#include <linux/sizes.h>
#include <cxlpci.h>
#include <cxlmem.h>
#include <cxl.h>
@@ -970,11 +971,13 @@ static int cxl_validate_extent_partition(struct cxl_memdev_state *mds,
}
/*
- * Check extent tag is non-null, tag is not already in use, extent belongs to a
- * region, and the extent is within the bounds of a DC partition.
- * If extent is not first in the pending_list, check tag and region match the
- * previous entry's.
+ * CXL 3.1 permits both tagged (non-null UUID) and untagged (null UUID)
+ * extents. The spec is silent on whether untagged extents from different
+ * events may be aggregated; we allow them to be combined into a single
+ * dax device for simplicity. Sharable extents (shared_extn_seq != 0) are
+ * not supported yet and are rejected here.
*
+ * Partition boundary and region-attachment are validated separately.
*/
static int cxl_validate_extent(struct cxl_memdev_state *mds,
struct cxl_extent_list_node *pos)
@@ -988,14 +991,9 @@ static int cxl_validate_extent(struct cxl_memdev_state *mds,
};
uuid_t *uuid = (uuid_t *)extent->uuid;
- if (uuid_is_null(uuid)) {
- dev_dbg(dev, "no tag for extent: %pra\n", &ext_range);
- return -EINVAL;
- }
-
if (le16_to_cpu(extent->shared_extn_seq) != 0) {
dev_dbg(dev,
- "DC extent DPA %pra (%pU) can not be shared\n",
+ "DC extent DPA %pra (%pU) is sharable; not supported\n",
&ext_range, uuid);
return -ENXIO;
}
@@ -1293,122 +1291,444 @@ static void clear_pending_extents(void *_mds)
mds->add_ctx.region_extent = NULL;
}
-static int dpa_compare(void *priv,
- const struct list_head *a,
- const struct list_head *b)
+/*
+ * Device-dax requires extent boundaries aligned to its mapping granularity.
+ * Use SZ_2M as a conservative default; a tighter check that queries the
+ * cxl_dax_region / cxl_endpoint_decoder for its actual alignment would be
+ * strictly more correct, but SZ_2M is the minimum device-dax supports on
+ * every architecture that enables CXL DCD today.
+ */
+#define CXL_DCD_EXTENT_ALIGN SZ_2M
+
+static bool cxl_extent_dcd_aligned(const struct cxl_extent *extent)
+{
+ u64 start = le64_to_cpu(extent->start_dpa);
+ u64 len = le64_to_cpu(extent->length);
+
+ return IS_ALIGNED(start, CXL_DCD_EXTENT_ALIGN) &&
+ IS_ALIGNED(len, CXL_DCD_EXTENT_ALIGN);
+}
+
+/*
+ * Compare two extents by shared_extn_seq (ascending).
+ *
+ * Per CXL 3.1 Table 8-51, shared_extn_seq is defined only for extents in
+ * *sharable* CDAT regions: those extents are required to carry both a
+ * non-null tag and a per-allocation sequence number so multiple hosts
+ * reading the same allocation assemble the extents into the same order.
+ *
+ * Extents in non-sharable regions do not carry a sequence number
+ * (shared_extn_seq == 0 on every extent); for those, a single host's
+ * arrival order is a sufficient definition of "the order the device
+ * sent them." list_sort() is stable, so when every element in a group
+ * has shared_extn_seq == 0, ties fall back to list order — which is
+ * arrival order via list_add_tail() in add_to_pending_list(). Thus
+ * the same comparator gives the right answer for both cases, and the
+ * code stays correct if/when sharable (sequenced) extents become
+ * supported.
+ */
+static int extent_seq_compare(void *priv,
+ const struct list_head *a,
+ const struct list_head *b)
{
const struct cxl_extent_list_node *ea =
list_entry(a, struct cxl_extent_list_node, list);
const struct cxl_extent_list_node *eb =
list_entry(b, struct cxl_extent_list_node, list);
+ u16 sa = le16_to_cpu(ea->extent->shared_extn_seq);
+ u16 sb = le16_to_cpu(eb->extent->shared_extn_seq);
- if (ea->extent->start_dpa < eb->extent->start_dpa)
+ if (sa < sb)
return -1;
- if (ea->extent->start_dpa > eb->extent->start_dpa)
+ if (sa > sb)
return 1;
-
return 0;
}
-static int idx_compare(void *priv,
- const struct list_head *a,
- const struct list_head *b)
+/*
+ * Move every pending extent whose tag matches @tag onto @group, preserving
+ * the order they appear in @pending. @group is left in arrival order so
+ * the caller can then sort it by shared_extn_seq with list_sort()'s stable
+ * ordering guarantee.
+ */
+static void extract_tag_group(struct list_head *pending,
+ const uuid_t *tag,
+ struct list_head *group)
{
- const struct cxl_extent_list_node *ea =
- list_entry(a, struct cxl_extent_list_node, list);
- const struct cxl_extent_list_node *eb =
- list_entry(b, struct cxl_extent_list_node, list);
+ struct cxl_extent_list_node *pos, *tmp;
- if (ea->idx < eb->idx)
- return -1;
- if (ea->idx > eb->idx)
- return 1;
+ list_for_each_entry_safe(pos, tmp, pending, list) {
+ uuid_t t;
+
+ import_uuid(&t, pos->extent->uuid);
+ if (uuid_equal(&t, tag))
+ list_move_tail(&pos->list, group);
+ }
+}
+/*
+ * Detect a tagged allocation re-appearing after its More-chain closed.
+ *
+ * A More-chain (the sequence of Add-Capacity events terminated by
+ * More=0) guarantees completeness for every tag it carries: once the
+ * chain ends, no extent bearing a tag that appeared inside it may
+ * arrive in any later chain. This is true for tagged extents whether
+ * or not they carry shared_extn_seq — sequencing is a sharable-region
+ * concern, completeness is a general one.
+ *
+ * Detection here is a linear walk of cxlr_dax->region_extents (keyed
+ * by allocator-assigned IDs, not by UUID) comparing each stored
+ * region_extent's UUID against the incoming tag.
+ *
+ * Returns true iff @tag is non-null AND a region_extent with a
+ * matching uuid already exists on the target region. For an untagged
+ * (null-UUID) extent the check is skipped: the spec is silent on
+ * aggregating untagged extents across More-chains, so we don't
+ * manufacture a rule here.
+ */
+static bool cxl_tag_already_committed(struct cxl_memdev_state *mds,
+ struct cxl_extent *extent,
+ const uuid_t *tag)
+{
+ u64 start_dpa = le64_to_cpu(extent->start_dpa);
+ struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
+ struct cxl_dax_region *cxlr_dax;
+ struct region_extent *re;
+ struct cxl_region *cxlr;
+ unsigned long idx;
+
+ if (uuid_is_null(tag))
+ return false;
+
+ guard(rwsem_read)(&cxl_rwsem.region);
+ cxlr = cxl_dpa_to_region(cxlmd, start_dpa, NULL);
+ if (!cxlr)
+ return false;
+
+ cxlr_dax = cxlr->cxlr_dax;
+ xa_for_each(&cxlr_dax->region_extents, idx, re) {
+ if (uuid_equal(&re->uuid, tag))
+ return true;
+ }
+ return false;
+}
+
+/*
+ * Validate shared_extn_seq across a tag group already sorted ascending.
+ *
+ * Per CXL 3.1 Table 8-51, shared_extn_seq is the per-allocation
+ * extent sequence number. Interpretation:
+ *
+ * - For extents in non-sharable regions the field is unused; every
+ * extent of the allocation carries shared_extn_seq == 0.
+ * - For extents in sharable regions the field carries the device's
+ * stamped position within the allocation. Valid values are 1..n
+ * where n is the number of extents in the allocation; the set
+ * must be contiguous (no gaps), unique (no duplicates), and
+ * complete (no missing positions). 0 is reserved as the
+ * "non-sharable" marker and is not a valid sharable sequence
+ * number.
+ *
+ * Hence a tag group is well-formed iff either (a) every extent has
+ * shared_extn_seq == 0, or (b) the sorted group is exactly 1, 2, ...,
+ * n. Anything else — a mix of 0 and non-zero values, a non-zero set
+ * that does not start at 1, a gap, or a duplicate — is a device
+ * firmware bug. Reject the whole group in those cases; partial
+ * acceptance would surface a dax device whose backing layout does
+ * not reflect the device's allocation.
+ *
+ * NOTE: at the time of this patch cxl_validate_extent() still rejects
+ * any extent with shared_extn_seq != 0 per-extent (sharable extents are
+ * not yet surfaced). This group-level check therefore only exercises
+ * the all-zero arm in the current driver; the non-zero arms are in
+ * place so that lifting the per-extent restriction does not leave the
+ * sequencing-integrity contract unenforced.
+ */
+static int cxl_check_group_seq(struct device *dev,
+ const uuid_t *tag,
+ const struct list_head *group)
+{
+ struct cxl_extent_list_node *pos;
+ u16 first, expected;
+
+ if (list_empty(group))
+ return 0;
+
+ pos = list_first_entry(group, struct cxl_extent_list_node, list);
+ first = le16_to_cpu(pos->extent->shared_extn_seq);
+
+ if (first == 0) {
+ /* Non-sharable: every member must be 0. */
+ list_for_each_entry(pos, group, list) {
+ if (le16_to_cpu(pos->extent->shared_extn_seq) != 0) {
+ dev_warn(dev,
+ "Tag %pUb: shared_extn_seq mixed 0/non-zero in one allocation (firmware bug)\n",
+ tag);
+ return -EINVAL;
+ }
+ }
+ return 0;
+ }
+
+ /* Sharable: group must be exactly 1, 2, ..., n (contiguous). */
+ if (first != 1) {
+ dev_warn(dev,
+ "Tag %pUb: shared_extn_seq starts at %u, expected 1 (firmware bug)\n",
+ tag, first);
+ return -EINVAL;
+ }
+
+ expected = 1;
+ list_for_each_entry(pos, group, list) {
+ u16 s = le16_to_cpu(pos->extent->shared_extn_seq);
+
+ if (s != expected) {
+ dev_warn(dev,
+ "Tag %pUb: shared_extn_seq gap/dup: expected %u got %u (firmware bug)\n",
+ tag, expected, s);
+ return -EINVAL;
+ }
+ expected++;
+ }
return 0;
}
/*
- * Validate and add contiguous extents. Removes invalid, non-contiguous, or
- * mismatched extents from pending_list. Sorts by DPA for processing, then
- * restores original order for response.
+ * Assemble the pending Add-Capacity events into dax devices and send the
+ * ADD_DC_RESPONSE.
+ *
+ * Spec semantics (CXL 3.1 8.2.9.9.9.3 / 8.2.9.2.1.6):
+ *
+ * - The unit of allocation is a *tag*, not a More-chain. All extents
+ * that share the same tag form one allocation and must be assembled
+ * into a single dax device. For extents in sharable CDAT regions
+ * a non-null tag is required; for extents in non-sharable regions
+ * the tag is optional — the null UUID is a valid "untagged"
+ * allocation identity.
+ *
+ * - Within a tag, extents must be ordered by shared_extn_seq (the
+ * per-allocation sequence number, Table 8-51). shared_extn_seq is
+ * a sharable-region concern: multiple hosts reading the same
+ * allocation need to agree on assembly order, so the device stamps
+ * each extent with its position. For non-sharable extents the
+ * spec does not provide sequence numbers (shared_extn_seq == 0 on
+ * every extent); the lone host simply assembles in arrival order.
+ * list_sort() is stable, so one comparator handles both cases:
+ * sequence-number order when it is populated, arrival order when
+ * every tie key is zero.
+ *
+ * Valid sharable values are 1..n, contiguous and unique across the
+ * n extents of one allocation; 0 is reserved for the non-sharable
+ * marker. A tag group is well-formed iff either every member is
+ * 0 or the sorted group is exactly 1, 2, ..., n. See
+ * cxl_check_group_seq().
+ *
+ * - A More-chain is a delivery boundary, not an allocation boundary:
+ * it may carry extents for several distinct tags. What More=0
+ * guarantees is completeness — for every tag that appears inside
+ * the chain, all of that tag's extents are delivered by the time
+ * the chain closes. This completeness guarantee applies to tagged
+ * allocations regardless of whether the extents carry sequence
+ * numbers. Therefore, receiving an extent bearing a tag that a
+ * previous More-chain already committed is a device firmware bug:
+ * the tag's allocation was supposed to have been complete when its
+ * chain closed. The untagged case is excluded — the spec does not
+ * define a cross-chain identity for untagged extents.
+ *
+ * - An allocation is not required to be DPA-contiguous; extents exist
+ * precisely so the device can satisfy one allocation from scattered
+ * DPA pieces.
+ *
+ * - Untagged extents from distinct events: the spec is silent on
+ * aggregation. Collapsing them into a single untagged dax device
+ * is the simplest conformant choice and is what the existing
+ * cxl_add_extent()/uuid_equal() logic implements for the null-UUID
+ * case.
+ *
+ * Enforced here, per tag group (in first-appearance order of the tag):
+ *
+ * 1. Extract the group to a local list, then stable-sort by
+ * shared_extn_seq. For sharable extents this walks the group
+ * in device-stamped sequence order; for non-sharable extents
+ * every key is 0 and the stable sort preserves arrival order.
+ * 2. Cross-More-chain uniqueness — if this (tagged) group's tag
+ * already maps to a committed region_extent on its target
+ * cxlr_dax, the device has re-sent a completed allocation.
+ * Drop the whole group with a firmware-bug warning. Skipped
+ * for the null UUID.
+ * 3. Sequence-number integrity — either every member carries
+ * shared_extn_seq == 0 (non-sharable allocation) or the sorted
+ * group is exactly 1, 2, ..., n (sharable). Mix, gap,
+ * duplicate, or a non-zero set that does not start at 1 is a
+ * firmware bug; drop the whole group.
+ * 4. Alignment gate — every extent's start_dpa and length must be
+ * CXL_DCD_EXTENT_ALIGN-aligned, else drop the whole group with
+ * a warning. Partial acceptance would leave an unusable dax
+ * device.
+ * 5. Validate + cxl_add_extent() each surviving extent into a fresh
+ * region_extent built up in add_ctx.
+ * 6. Online + notify the region_extent, splice accepted extents
+ * into the response list, clear the add_ctx slot so the next
+ * tag's group can build its own. online_region_extent() inserts
+ * each realized region_extent into cxlr_dax->region_extents
+ * (an xarray keyed by an allocator-assigned ID, not by UUID),
+ * which is what allows multiple tag groups to surface under
+ * one DAX region.
*/
static int cxl_add_pending(struct cxl_memdev_state *mds)
{
struct device *dev = mds->cxlds.dev;
+ struct list_head *pending = &mds->add_ctx.pending_extents;
struct cxl_extent_list_node *pos, *tmp;
- struct region_extent *pending_reg_ext;
- struct cxl_extent *extent;
- u64 prev_end, start, len;
- int cnt = 0, rc;
+ LIST_HEAD(accepted);
+ int total_accepted = 0;
+
+ while (!list_empty(pending)) {
+ LIST_HEAD(group);
+ struct region_extent *reg_ext;
+ bool aligned = true;
+ int group_cnt = 0;
+ uuid_t tag;
+ int rc;
- list_sort(NULL, &mds->add_ctx.pending_extents, dpa_compare);
- list_for_each_entry_safe(pos, tmp, &mds->add_ctx.pending_extents, list) {
- extent = pos->extent;
- start = le64_to_cpu(extent->start_dpa);
- len = le64_to_cpu(extent->length);
-
- /* Start enforcing contiguity after accepting first extent */
- if (cnt && start != prev_end) {
- dev_dbg(dev,
- "Non-contiguous extent DPA:%#llx LEN:%#llx\n",
- start, len);
- delete_extent_node(pos);
+ /*
+ * (1) Extract this tag's extents from pending, then order
+ * them by shared_extn_seq. The outer tag is picked by the
+ * first-appearance extent in pending; groups *within* a tag
+ * are ordered by the per-allocation sequence number, which
+ * is the invariant the spec defines.
+ */
+ import_uuid(&tag,
+ list_first_entry(pending,
+ struct cxl_extent_list_node,
+ list)->extent->uuid);
+ extract_tag_group(pending, &tag, &group);
+ list_sort(NULL, &group, extent_seq_compare);
+
+ /*
+ * (2) Cross-More-chain uniqueness. A non-null tag seen in
+ * this group must not already correspond to a committed
+ * region_extent on its target cxlr_dax: More=0 was supposed
+ * to close that allocation. Firmware bug — reject the whole
+ * group. Any extent in the group maps to the same region
+ * (same tag == same allocation == same target), so checking
+ * the first suffices.
+ */
+ pos = list_first_entry(&group, struct cxl_extent_list_node,
+ list);
+ if (cxl_tag_already_committed(mds, pos->extent, &tag)) {
+ dev_warn(dev,
+ "Tag %pUb: dropping group, tag already committed in a previous More-chain (firmware bug)\n",
+ &tag);
+ list_for_each_entry_safe(pos, tmp, &group, list)
+ delete_extent_node(pos);
continue;
}
- if (cxl_validate_extent(mds, pos)) {
- delete_extent_node(pos);
+ /*
+ * (3) Sequence-number integrity. All-zero (non-sharable)
+ * or exact 1..n contiguous (sharable). Anything else is a
+ * firmware bug — reject the whole group; no partial
+ * acceptance.
+ */
+ if (cxl_check_group_seq(dev, &tag, &group)) {
+ list_for_each_entry_safe(pos, tmp, &group, list)
+ delete_extent_node(pos);
continue;
}
- if (cxl_add_extent(mds, extent)) {
- dev_dbg(dev,
- "Failed to add extent DPA:%#llx LEN:%#llx\n",
- start, len);
- delete_extent_node(pos);
+ /* (4) Alignment gate — abort the group if any member fails */
+ list_for_each_entry(pos, &group, list) {
+ if (!cxl_extent_dcd_aligned(pos->extent)) {
+ dev_warn(dev,
+ "Tag %pUb: dropping group, extent DPA:%#llx LEN:%#llx not %u-aligned\n",
+ &tag,
+ le64_to_cpu(pos->extent->start_dpa),
+ le64_to_cpu(pos->extent->length),
+ CXL_DCD_EXTENT_ALIGN);
+ aligned = false;
+ break;
+ }
+ }
+ if (!aligned) {
+ list_for_each_entry_safe(pos, tmp, &group, list)
+ delete_extent_node(pos);
continue;
}
- prev_end = start + len;
- cnt++;
- }
+ /*
+ * (5) Validate + attach in seq order. Surviving nodes stay
+ * on @group in seq order; failed nodes are removed.
+ */
+ list_for_each_entry_safe(pos, tmp, &group, list) {
+ if (cxl_validate_extent(mds, pos)) {
+ delete_extent_node(pos);
+ continue;
+ }
- if (!mds->add_ctx.region_extent) {
- dev_dbg(dev, "No valid extents in list; accept none\n");
- return 0;
- }
+ if (cxl_add_extent(mds, pos->extent)) {
+ dev_dbg(dev,
+ "Tag %pUb: failed to add extent DPA:%#llx LEN:%#llx\n",
+ &tag,
+ le64_to_cpu(pos->extent->start_dpa),
+ le64_to_cpu(pos->extent->length));
+ delete_extent_node(pos);
+ continue;
+ }
+ group_cnt++;
+ }
- pending_reg_ext = mds->add_ctx.region_extent;
- /* Ensure caches are clean prior onlining */
- rc = cxl_region_invalidate_memregion(pending_reg_ext->cxlr_dax->cxlr);
- if (rc)
- return rc;
+ /* (6) online + notify */
+ reg_ext = mds->add_ctx.region_extent;
+ if (!reg_ext) {
+ /* Every extent in the group was dropped */
+ continue;
+ }
- /* device model handles freeing region_extent */
- rc = online_region_extent(pending_reg_ext);
- if (rc)
- return rc;
+ rc = cxl_region_invalidate_memregion(reg_ext->cxlr_dax->cxlr);
+ if (!rc)
+ rc = online_region_extent(reg_ext);
+ if (rc) {
+ dev_warn(dev,
+ "Tag %pUb: failed to online region_extent (%d)\n",
+ &tag, rc);
+ /*
+ * region_extent was not onlined; the allocation
+ * failed. Drop its extents so we do not mis-report
+ * acceptance to the device.
+ */
+ list_for_each_entry_safe(pos, tmp, &group, list)
+ delete_extent_node(pos);
+ } else {
+ rc = cxlr_notify_extent(reg_ext->cxlr_dax->cxlr,
+ DCD_ADD_CAPACITY, reg_ext);
+ if (rc)
+ region_rm_extent(reg_ext);
+ /* Keep accepted extents for the response */
+ list_splice_tail_init(&group, &accepted);
+ total_accepted += group_cnt;
+ }
+
+ /* Next tag's group gets a fresh add_ctx slot */
+ mds->add_ctx.region_extent = NULL;
+ }
- rc = cxlr_notify_extent(pending_reg_ext->cxlr_dax->cxlr,
- DCD_ADD_CAPACITY,
- pending_reg_ext);
/*
- * The region device was briefly live but DAX layer ensures it was not
- * used
+ * Response payload: all accepted extents, grouped by tag (in the
+ * tag's first-appearance order), each group ordered by
+ * shared_extn_seq. pending_extents is empty at this point since
+ * every tag group was extracted; splice the accepted list in so
+ * cxl_send_dc_response() can walk a single list.
*/
- if (rc)
- region_rm_extent(pending_reg_ext);
-
- /* Restore remaining extents to original order and send rsp */
- list_sort(NULL, &mds->add_ctx.pending_extents, idx_compare);
+ list_splice(&accepted, pending);
return cxl_send_dc_response(mds, CXL_MBOX_OP_ADD_DC_RESPONSE,
- &mds->add_ctx.pending_extents, cnt);
+ pending, total_accepted);
}
static int add_to_pending_list(struct list_head *pending_list,
struct cxl_extent *to_add)
{
- struct cxl_extent_list_node *node, *prev;
+ struct cxl_extent_list_node *node;
struct cxl_extent *extent;
node = kzalloc(sizeof(*node), GFP_KERNEL);
@@ -1420,18 +1740,6 @@ static int add_to_pending_list(struct list_head *pending_list,
node->extent = extent;
list_add_tail(&node->list, pending_list);
-
- /*
- * List is sorted by DPA when adding. Save original index to restore
- * order when sending DC rsp, as required by the spec.
- */
- if (list_is_first(&node->list, pending_list)) {
- node->idx = 0;
- } else {
- prev = list_prev_entry(node, list);
- node->idx = prev->idx + 1;
- }
-
return 0;
}
diff --git a/include/cxl/event.h b/include/cxl/event.h
index fbd95e381e414..fa3cd895f656f 100644
--- a/include/cxl/event.h
+++ b/include/cxl/event.h
@@ -156,7 +156,6 @@ struct cxl_extent {
struct cxl_extent_list_node {
struct cxl_extent *extent;
struct list_head list;
- int idx;
int rid;
};
--
2.53.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [RFC PATCH 3/4] cxl/extent: Support extents in sharable CDAT regions
2026-04-23 23:51 ` [RFC PATCH 0/4] cxl/extent: Enable and validate multi-extent DCDs John Groves
2026-04-23 23:52 ` [RFC PATCH 1/4] cxl/extent: Promote cxlr_dax->region_extent to an xarray John Groves
2026-04-23 23:52 ` [RFC PATCH 2/4] cxl/extent: Fix DCD add-capacity: per-tag assembly, ordering, and integrity John Groves
@ 2026-04-23 23:52 ` John Groves
2026-04-23 23:52 ` [RFC PATCH 4/4] cxl/extent: Reject tagged extents that span DC partitions John Groves
` (2 subsequent siblings)
5 siblings, 0 replies; 15+ messages in thread
From: John Groves @ 2026-04-23 23:52 UTC (permalink / raw)
To: Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Alison Schofield,
Vishal Verma, Ira Weiny, Dan Williams, John Groves, Fan Ni,
Anisa Su, Shiju Jose, Robert Richter
Cc: linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
John Groves, dev.srinivasulu@gmail.com, arramesh@micron.com,
ajayjoshi@micron.com
From: John Groves <John@Groves.net>
The previous cleanup (per-tag assembly, ordering, integrity) intentionally
left extents in sharable CDAT regions rejected at the per-extent gate:
cxl_validate_extent() returned -ENXIO for any shared_extn_seq != 0. The
group-level 1..n contiguity check (cxl_check_group_seq) was written so
that turning the gate off would leave the sequencing contract enforced.
Turn the gate off and replace it with a partition-aware check.
The CDAT DSMAS entry for each DC partition already carries the
ACPI_CDAT_DSMAS_SHAREABLE bit, propagated into
cxl_dpa_partition::perf.shareable by the CDAT parse path. That bit is
the authoritative answer to "is this extent multi-host sharable":
- Sharable partition. An extent's tag (UUID) is the allocation
identity that every sharing host uses, so the tag must be
non-null. The device stamps each extent with its position in the
allocation via shared_extn_seq, so the field must be non-zero.
The group-level check then verifies the sorted group is exactly
1, 2, ..., n.
- Non-sharable partition. No multi-host coordination is required,
so the tag is optional (null UUID means "untagged") and
shared_extn_seq must be 0.
Any cross-mixing (sharable + null tag, sharable + seq == 0,
non-sharable + seq != 0) is a device firmware bug and the extent is
rejected with a per-extent firmware-bug message.
Refactor to support the check:
- Rename cxl_validate_extent_partition() to cxl_extent_dc_partition()
and return the matching DC partition (NULL on not-found). One
call site; the caller now has the partition in hand and can read
perf.shareable without a second lookup.
- Rewrite cxl_validate_extent() around the partition's sharable
flag as described above. Drop the blanket shared_extn_seq != 0
reject.
- Update the comments on extent_seq_compare() and
cxl_check_group_seq() to drop the "sharable not yet surfaced"
caveats; both branches of each helper are now reachable.
The cxl_add_pending() pipeline (sort, cross-More-chain uniqueness,
sequence-number integrity, alignment, attach, online) is unchanged:
all of it was already written for both regimes.
Signed-off-by: John Groves <John@Groves.net>
Signed-off-by: John Groves <john@groves.net>
---
drivers/cxl/core/mbox.c | 94 ++++++++++++++++++++++++++++-------------
1 file changed, 65 insertions(+), 29 deletions(-)
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 804b7846b5726..3ffcd90698f3c 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -939,14 +939,20 @@ int cxl_enumerate_cmds(struct cxl_memdev_state *mds)
}
EXPORT_SYMBOL_NS_GPL(cxl_enumerate_cmds, "CXL");
-static int cxl_validate_extent_partition(struct cxl_memdev_state *mds,
- struct cxl_extent *extent,
- struct range *ext_range)
+/*
+ * Find the DC (Dynamic Capacity) partition that fully contains @ext_range,
+ * or NULL if the extent falls outside every DC partition on this memdev.
+ * The returned pointer is owned by mds->cxlds.part[] and lives for the
+ * lifetime of the memdev.
+ */
+static const struct cxl_dpa_partition *
+cxl_extent_dc_partition(struct cxl_memdev_state *mds,
+ struct cxl_extent *extent,
+ struct range *ext_range)
{
struct cxl_dev_state *cxlds = &mds->cxlds;
struct device *dev = mds->cxlds.dev;
- /* Extents must be within the DC partition boundary */
for (int i = 0; i < cxlds->nr_partitions; i++) {
struct cxl_dpa_partition *part = &cxlds->part[i];
struct range partition_range = {
@@ -959,25 +965,37 @@ static int cxl_validate_extent_partition(struct cxl_memdev_state *mds,
if (range_contains(&partition_range, ext_range)) {
dev_dbg(dev, "DC extent DPA %pra (DCR:%pra)(%pU)\n",
- &ext_range, &partition_range, extent->uuid);
- return 0;
+ ext_range, &partition_range, extent->uuid);
+ return part;
}
}
dev_err_ratelimited(dev,
"DC extent DPA %pra (%pU) is not in a valid DC partition\n",
- &ext_range, extent->uuid);
- return -ENXIO;
+ ext_range, extent->uuid);
+ return NULL;
}
/*
- * CXL 3.1 permits both tagged (non-null UUID) and untagged (null UUID)
- * extents. The spec is silent on whether untagged extents from different
- * events may be aggregated; we allow them to be combined into a single
- * dax device for simplicity. Sharable extents (shared_extn_seq != 0) are
- * not supported yet and are rejected here.
+ * Per-extent validation for an Add-Capacity event. Two regimes, chosen
+ * by the DC partition's CDAT-advertised sharability:
*
- * Partition boundary and region-attachment are validated separately.
+ * Sharable partition (DSMAS flag ACPI_CDAT_DSMAS_SHAREABLE,
+ * reflected in part->perf.shareable):
+ * - A non-null tag (UUID) is required. The tag is the allocation
+ * identity that every host sharing the allocation uses.
+ * - shared_extn_seq must be non-zero. Together with the other
+ * members of the tag group it forms the 1..n contiguous set that
+ * cxl_check_group_seq() enforces.
+ *
+ * Non-sharable partition:
+ * - The tag is optional; null UUID is permitted.
+ * - shared_extn_seq must be 0. Sequencing is meaningless when
+ * only one host consumes the allocation.
+ *
+ * Any cross-mixing (sharable partition with null tag or seq == 0;
+ * non-sharable partition with non-zero seq) is a device firmware bug.
+ * Partition-boundary and region-attachment checks are separate.
*/
static int cxl_validate_extent(struct cxl_memdev_state *mds,
struct cxl_extent_list_node *pos)
@@ -990,16 +1008,36 @@ static int cxl_validate_extent(struct cxl_memdev_state *mds,
le64_to_cpu(extent->length) - 1,
};
uuid_t *uuid = (uuid_t *)extent->uuid;
+ const struct cxl_dpa_partition *part;
+ u16 seq = le16_to_cpu(extent->shared_extn_seq);
- if (le16_to_cpu(extent->shared_extn_seq) != 0) {
- dev_dbg(dev,
- "DC extent DPA %pra (%pU) is sharable; not supported\n",
- &ext_range, uuid);
+ part = cxl_extent_dc_partition(mds, extent, &ext_range);
+ if (!part)
return -ENXIO;
+
+ if (part->perf.shareable) {
+ if (uuid_is_null(uuid)) {
+ dev_err_ratelimited(dev,
+ "DC extent DPA %pra: sharable-partition extent has null tag (firmware bug)\n",
+ &ext_range);
+ return -ENXIO;
+ }
+ if (seq == 0) {
+ dev_err_ratelimited(dev,
+ "DC extent DPA %pra (%pU): sharable-partition extent missing shared_extn_seq (firmware bug)\n",
+ &ext_range, uuid);
+ return -ENXIO;
+ }
+ return 0;
}
- if (cxl_validate_extent_partition(mds, extent, &ext_range))
+ /* Non-sharable partition. */
+ if (seq != 0) {
+ dev_err_ratelimited(dev,
+ "DC extent DPA %pra (%pU): non-sharable partition but shared_extn_seq=%u (firmware bug)\n",
+ &ext_range, uuid, seq);
return -ENXIO;
+ }
return 0;
}
@@ -1322,10 +1360,8 @@ static bool cxl_extent_dcd_aligned(const struct cxl_extent *extent)
* arrival order is a sufficient definition of "the order the device
* sent them." list_sort() is stable, so when every element in a group
* has shared_extn_seq == 0, ties fall back to list order — which is
- * arrival order via list_add_tail() in add_to_pending_list(). Thus
- * the same comparator gives the right answer for both cases, and the
- * code stays correct if/when sharable (sequenced) extents become
- * supported.
+ * arrival order via list_add_tail() in add_to_pending_list(). One
+ * comparator, both regimes.
*/
static int extent_seq_compare(void *priv,
const struct list_head *a,
@@ -1437,12 +1473,12 @@ static bool cxl_tag_already_committed(struct cxl_memdev_state *mds,
* acceptance would surface a dax device whose backing layout does
* not reflect the device's allocation.
*
- * NOTE: at the time of this patch cxl_validate_extent() still rejects
- * any extent with shared_extn_seq != 0 per-extent (sharable extents are
- * not yet surfaced). This group-level check therefore only exercises
- * the all-zero arm in the current driver; the non-zero arms are in
- * place so that lifting the per-extent restriction does not leave the
- * sequencing-integrity contract unenforced.
+ * cxl_validate_extent() enforces the per-extent partition/sharable
+ * consistency (sharable partition -> non-null tag + non-zero seq;
+ * non-sharable -> seq == 0), so by the time a group reaches this
+ * check all members agree on regime. This helper then enforces the
+ * group-level invariants the per-extent check cannot see: that
+ * sharable groups form an exact 1..n set with no gap or duplicate.
*/
static int cxl_check_group_seq(struct device *dev,
const uuid_t *tag,
--
2.53.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [RFC PATCH 4/4] cxl/extent: Reject tagged extents that span DC partitions
2026-04-23 23:51 ` [RFC PATCH 0/4] cxl/extent: Enable and validate multi-extent DCDs John Groves
` (2 preceding siblings ...)
2026-04-23 23:52 ` [RFC PATCH 3/4] cxl/extent: Support extents in sharable CDAT regions John Groves
@ 2026-04-23 23:52 ` John Groves
2026-04-24 6:34 ` [RFC PATCH 0/4] cxl/extent: Enable and validate multi-extent DCDs Anisa Su
2026-05-01 22:00 ` Anisa Su
5 siblings, 0 replies; 15+ messages in thread
From: John Groves @ 2026-04-23 23:52 UTC (permalink / raw)
To: Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Alison Schofield,
Vishal Verma, Ira Weiny, Dan Williams, John Groves, Fan Ni,
Anisa Su, Shiju Jose, Robert Richter
Cc: linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
John Groves, dev.srinivasulu@gmail.com, arramesh@micron.com,
ajayjoshi@micron.com
From: John Groves <John@Groves.net>
A CXL DC partition is described by a CDAT DSMAS entry whose attribute
bits cover sharable vs. non-sharable, writable vs. read-only, and HW
vs. SW cache coherency (HDM-DB vs. HDM-H). A single tagged allocation
is a single host-visible object; it cannot meaningfully straddle two
partitions whose CDAT attributes disagree.
At today's driver granularity each DC partition maps to at most one
DSMAS entry (see update_perf_entry() in core/cdat.c and the
"Dynamic RAM perf mismatch" warning), and the only attribute plumbed
out of DSMAS into cxl_dpa_perf is ->shareable. Comparing the
containing cxl_dpa_partition of every extent in a tag group is
therefore a sufficient proxy for "same CDAT attributes": if all
extents resolve to the same partition pointer, they share every
attribute the driver currently distinguishes, and the check will
continue to hold as more DSMAS attributes get plumbed in the future
without needing to inspect them individually here.
Add cxl_check_group_partition() as step (4) of cxl_add_pending(),
between sequence-number integrity and the alignment gate. It walks
the sorted group, resolves each extent via cxl_extent_dc_partition(),
and rejects the whole group with a firmware-bug warning naming the
two offending DPAs if any extent's partition differs from the first.
The check is skipped for the null UUID: the spec does not define a
cross-chain identity for untagged extents, and the driver aggregates
them separately.
Signed-off-by: John Groves <John@Groves.net>
Signed-off-by: John Groves <john@groves.net>
---
drivers/cxl/core/mbox.c | 84 +++++++++++++++++++++++++++++++++++++++--
1 file changed, 80 insertions(+), 4 deletions(-)
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 3ffcd90698f3c..b5300b48879f9 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -1529,6 +1529,65 @@ static int cxl_check_group_seq(struct device *dev,
return 0;
}
+/*
+ * For tagged groups, reject allocations that span DC partitions.
+ *
+ * A tag is an allocation identity; the CDAT DSMAS entry that describes
+ * the containing DC partition is what tells the host which attributes
+ * (sharable, writable, HW cache coherency) apply. At current driver
+ * granularity each DC partition is described by at most one DSMAS and
+ * the only plumbed attribute is part->perf.shareable — but partition
+ * identity is a sufficient proxy for "same set of CDAT attributes."
+ * Comparing the containing cxl_dpa_partition of every extent to the
+ * first extent's therefore implicitly enforces attribute equality for
+ * all attributes the driver distinguishes today, and will keep doing so
+ * as more attributes become CDAT-plumbed.
+ *
+ * Untagged (null-UUID) groups are not meaningful here: the spec does
+ * not define a cross-chain identity for them and the driver aggregates
+ * them separately; skip the check.
+ */
+static int cxl_check_group_partition(struct cxl_memdev_state *mds,
+ const uuid_t *tag,
+ const struct list_head *group)
+{
+ struct device *dev = mds->cxlds.dev;
+ const struct cxl_dpa_partition *first_part = NULL;
+ u64 first_dpa = 0;
+ struct cxl_extent_list_node *pos;
+
+ if (uuid_is_null(tag) || list_empty(group))
+ return 0;
+
+ list_for_each_entry(pos, group, list) {
+ struct cxl_extent *extent = pos->extent;
+ struct range ext_range = (struct range) {
+ .start = le64_to_cpu(extent->start_dpa),
+ .end = le64_to_cpu(extent->start_dpa) +
+ le64_to_cpu(extent->length) - 1,
+ };
+ const struct cxl_dpa_partition *part;
+
+ part = cxl_extent_dc_partition(mds, extent, &ext_range);
+ if (!part)
+ return -ENXIO;
+
+ if (!first_part) {
+ first_part = part;
+ first_dpa = ext_range.start;
+ continue;
+ }
+
+ if (part != first_part) {
+ dev_warn(dev,
+ "Tag %pUb: extents span DC partitions (DPA:%#llx and DPA:%#llx), firmware bug\n",
+ tag, first_dpa, ext_range.start);
+ return -EINVAL;
+ }
+ }
+ return 0;
+}
+
/*
* Assemble the pending Add-Capacity events into dax devices and send the
* ADD_DC_RESPONSE.
@@ -1597,13 +1656,18 @@ static int cxl_check_group_seq(struct device *dev,
* group is exactly 1, 2, ..., n (sharable). Mix, gap,
* duplicate, or a non-zero set that does not start at 1 is a
* firmware bug; drop the whole group.
- * 4. Alignment gate — every extent's start_dpa and length must be
+ * 4. Partition equality — for tagged groups, every extent must
+ * resolve to the same DC partition. CDAT describes a partition
+ * with a DSMAS entry carrying sharable / writable / coherency
+ * attributes; a single allocation cannot span differing CDAT
+ * attributes. Skipped for the null UUID.
+ * 5. Alignment gate — every extent's start_dpa and length must be
* CXL_DCD_EXTENT_ALIGN-aligned, else drop the whole group with
* a warning. Partial acceptance would leave an unusable dax
* device.
- * 5. Validate + cxl_add_extent() each surviving extent into a fresh
+ * 6. Validate + cxl_add_extent() each surviving extent into a fresh
* region_extent built up in add_ctx.
- * 6. Online + notify the region_extent, splice accepted extents
+ * 7. Online + notify the region_extent, splice accepted extents
* into the response list, clear the add_ctx slot so the next
* tag's group can build its own. online_region_extent() inserts
* each realized region_extent into cxlr_dax->region_extents
@@ -1673,7 +1737,19 @@ static int cxl_add_pending(struct cxl_memdev_state *mds)
continue;
}
- /* (4) Alignment gate — abort the group if any member fails */
+ /*
+ * (4) Partition equality — tagged allocations cannot span DC
+ * partitions, because a DC partition is the unit at which CDAT
+ * attributes (sharable, writable, coherency) are described.
+ * Skipped for the null UUID.
+ */
+ if (cxl_check_group_partition(mds, &tag, &group)) {
+ list_for_each_entry_safe(pos, tmp, &group, list)
+ delete_extent_node(pos);
+ continue;
+ }
+
+ /* (5) Alignment gate — abort the group if any member fails */
list_for_each_entry(pos, &group, list) {
if (!cxl_extent_dcd_aligned(pos->extent)) {
dev_warn(dev,
--
2.53.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 1/4] cxl/extent: Promote cxlr_dax->region_extent to an xarray
2026-04-23 23:52 ` [RFC PATCH 1/4] cxl/extent: Promote cxlr_dax->region_extent to an xarray John Groves
@ 2026-04-24 0:51 ` Dave Jiang
2026-04-24 22:01 ` Ira Weiny
1 sibling, 0 replies; 15+ messages in thread
From: Dave Jiang @ 2026-04-24 0:51 UTC (permalink / raw)
To: John Groves, Davidlohr Bueso, Jonathan Cameron, Alison Schofield,
Vishal Verma, Ira Weiny, Dan Williams, John Groves, Fan Ni,
Anisa Su, Shiju Jose, Robert Richter
Cc: linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
dev.srinivasulu@gmail.com, arramesh@micron.com,
ajayjoshi@micron.com
On 4/23/26 4:52 PM, John Groves wrote:
> From: John Groves <John@Groves.net>
>
> Terminology note: "CXL region" (struct cxl_region) is the
> mode-agnostic HDM-decoded HPA window; "DAX region"
> (struct cxl_dax_region, nicknamed cxlr_dax) is the DCD-specific
> CXL->device-dax shim that hangs off a cxl_region. This commit
> touches only the DAX-region shim; nothing about cxl_region, its
> HDM decoding, or its lifecycle changes.
>
> Prior to this commit, struct cxl_dax_region holds a single
> region_extent pointer, so at most one tagged allocation can ever
> surface under a DAX region. A subsequent commit rewrites DCD
> add-capacity handling to assemble extents per-tag and permits multiple
> allocations per More-chain; that work is latent until cxlr_dax can
> actually hold more than one region_extent. Do the infrastructural
> swap here, on its own, so the behavior change lands in a dedicated
> commit and the per-tag assembly work isn't tangled with storage-layout
> changes.
>
> Replace the single-slot pointer with an xarray. Note: the xarray is
> *not* keyed by UUID. xarray is a radix tree over unsigned-long keys;
> a 128-bit pseudo-random UUID is neither the right width nor the right
> distribution for a radix index. The key is an allocator-assigned u32
> obtained via xa_alloc() when online_region_extent() registers the
> device; that same u32 is stored in region_extent->dev.id so device
> names become "extent<rid>.<id>" with unique suffixes per allocation
> (replacing the hardcoded .0). UUID remains a stored attribute on
> struct region_extent; tag-based lookup is a linear xa_for_each walk,
> which is adequate at the bounded per-region tag counts DCD delivers.
>
> User-visible naming: device names remain "extent<rid>.<N>". Before
> this commit, N is always 0; after, N is the xa_alloc()-assigned id.
> The first region_extent under a cxlr_dax still lands at id 0, so a
> single-allocation region is observably unchanged. Only a 2nd+
> region_extent (which cannot be produced at all today) gets a
> non-zero suffix.
>
> Call-site translations:
>
> - cxl_dax_region_alloc() / _release(): xa_init / xa_destroy.
> - alloc_region_extent(): drop the hardcoded dev.id = 0; the ID is
> assigned when online_region_extent() registers the device.
> - online_region_extent(): device_initialize() first so the release
> function is wired, then xa_alloc(), assign the returned id into
> dev->id, then dev_set_name() and device_add(). On any failure
> from xa_alloc() onward, put_device(dev) -> release ->
> free_region_extent() -> xa_erase() handles cleanup; no explicit
> xa_erase() is needed in the error path of online_region_extent().
> - free_region_extent(): xa_erase() replaces NULLing the slot.
> - extents_contain() / extents_overlap(): nested xa_for_each, outer
> over region_extents, inner over decoder_extents.
> - cxl_rm_extent(): walk region_extents and match by UUID to find
> the target region_extent (linear, bounded).
> - cxl_add_extent()'s "already onlined" gate: translated from
> single-slot NULL check to !xa_empty(). This preserves the
> existing single-region_extent-per-cxlr_dax invariant; lifting
> that gate to actually support multiple allocations is a separate
> semantic change that belongs with the per-tag assembly rework.
>
> The dax-side (drivers/dax/cxl.c) is unchanged. cxl_dax_region_probe()
> still iterates region_extent children via device_for_each_child() on
> cxlr_dax->dev, and cxl_dax_region_notify() still takes a
> struct region_extent * through cxl_notify_data; neither the CXL<->DAX
> boundary nor the DAX resource model is affected.
>
> No functional change: all paths still observe at most one
> region_extent per cxlr_dax; this commit only prepares the storage.
>
> Signed-off-by: John Groves <John@Groves.net>
> Signed-off-by: John Groves <john@groves.net>
Double sign offs
> ---
> drivers/cxl/core/extent.c | 89 ++++++++++++++++++++++-----------------
> drivers/cxl/core/region.c | 2 +
> drivers/cxl/cxl.h | 9 +++-
> 3 files changed, 61 insertions(+), 39 deletions(-)
>
> diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
> index c4ad81814fb4d..44b58cd477655 100644
> --- a/drivers/cxl/core/extent.c
> +++ b/drivers/cxl/core/extent.c
> @@ -87,7 +87,8 @@ static void free_region_extent(struct region_extent *region_extent)
> xa_for_each(®ion_extent->decoder_extents, index, ed_extent)
> cxled_release_extent(ed_extent->cxled, ed_extent);
> xa_destroy(®ion_extent->decoder_extents);
> - region_extent->cxlr_dax->region_extent = NULL;
> + xa_erase(®ion_extent->cxlr_dax->region_extents,
> + region_extent->dev.id);
> kfree(region_extent);
> }
>
> @@ -144,7 +145,6 @@ alloc_region_extent(struct cxl_dax_region *cxlr_dax, struct range *hpa_range,
> region_extent->hpa_range = *hpa_range;
> region_extent->cxlr_dax = cxlr_dax;
> uuid_copy(®ion_extent->uuid, uuid);
> - region_extent->dev.id = 0;
> xa_init(®ion_extent->decoder_extents);
> return no_free_ptr(region_extent);
> }
> @@ -153,12 +153,26 @@ int online_region_extent(struct region_extent *region_extent)
> {
> struct cxl_dax_region *cxlr_dax = region_extent->cxlr_dax;
> struct device *dev = ®ion_extent->dev;
> + u32 id;
> int rc;
>
> device_initialize(dev);
> device_set_pm_not_required(dev);
> dev->parent = &cxlr_dax->dev;
> dev->type = ®ion_extent_type;
> +
> + /*
> + * Insert into cxlr_dax->region_extents before dev_set_name so
> + * the allocated id is available as the .<N> suffix. On any
> + * failure from here on, put_device(dev) -> release ->
> + * free_region_extent() -> xa_erase() handles the cleanup.
> + */
> + rc = xa_alloc(&cxlr_dax->region_extents, &id, region_extent,
> + xa_limit_32b, GFP_KERNEL);
> + if (rc < 0)
> + goto err;
> + dev->id = id;
> +
> rc = dev_set_name(dev, "extent%d.%d", cxlr_dax->cxlr->id, dev->id);
> if (rc)
> goto err;
> @@ -167,7 +181,6 @@ int online_region_extent(struct region_extent *region_extent)
> if (rc)
> goto err;
>
> - cxlr_dax->region_extent = region_extent;
> dev_dbg(dev, "region extent HPA %pra\n", ®ion_extent->hpa_range);
> return devm_add_action_or_reset(&cxlr_dax->dev, region_extent_unregister,
> region_extent);
> @@ -184,17 +197,16 @@ static bool extents_contain(struct cxl_dax_region *cxlr_dax,
> struct cxl_endpoint_decoder *cxled,
> struct range *new_range)
> {
> - struct region_extent *re = cxlr_dax->region_extent;
> + struct region_extent *re;
> struct cxled_extent *entry;
> - unsigned long index;
> -
> - if (!re)
> - return false;
> -
> - xa_for_each(&re->decoder_extents, index, entry) {
> - if (cxled == entry->cxled &&
> - range_contains(&entry->dpa_range, new_range))
> - return true;
> + unsigned long i, j;
> +
> + xa_for_each(&cxlr_dax->region_extents, i, re) {
> + xa_for_each(&re->decoder_extents, j, entry) {
> + if (cxled == entry->cxled &&
> + range_contains(&entry->dpa_range, new_range))
> + return true;
> + }
> }
> return false;
> }
> @@ -203,17 +215,16 @@ static bool extents_overlap(struct cxl_dax_region *cxlr_dax,
> struct cxl_endpoint_decoder *cxled,
> struct range *new_range)
> {
> - struct region_extent *re = cxlr_dax->region_extent;
> + struct region_extent *re;
> struct cxled_extent *entry;
> - unsigned long index;
> -
> - if (!re)
> - return false;
> -
> - xa_for_each(&re->decoder_extents, index, entry) {
> - if (cxled == entry->cxled &&
> - range_overlaps(&entry->dpa_range, new_range))
> - return true;
> + unsigned long i, j;
> +
> + xa_for_each(&cxlr_dax->region_extents, i, re) {
> + xa_for_each(&re->decoder_extents, j, entry) {
> + if (cxled == entry->cxled &&
> + range_overlaps(&entry->dpa_range, new_range))
> + return true;
> + }
> }
Second time seeing this pattern. Maybe just create a helper function?
> return false;
> }
> @@ -306,21 +317,25 @@ int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
> }
>
> cxlr_dax = cxlr->cxlr_dax;
> - reg_ext = cxlr_dax->region_extent;
> + import_uuid(&tag, extent->uuid);
> + {
> + struct region_extent *r;
> + unsigned long idx;
> +
> + reg_ext = NULL;
> + xa_for_each(&cxlr_dax->region_extents, idx, r) {
> + if (uuid_equal(&r->uuid, &tag)) {
> + reg_ext = r;
> + break;
> + }
> + }
> + }
Would prefer not have the extra scope.
DJ
> if (!reg_ext) {
> - dev_err(&cxlr->cxlr_dax->dev,
> - "no capacity has been added to the region\n");
> + dev_err(&cxlr_dax->dev,
> + "no region_extent matches tag %pU\n", &tag);
> return -ENXIO;
> }
>
> - import_uuid(&tag, extent->uuid);
> - if (!uuid_equal(&tag, ®_ext->uuid)) {
> - dev_err(&cxlr->cxlr_dax->dev,
> - "extent tag %pU doesn't match region tag %pU\n",
> - &tag, ®_ext->uuid);
> - return -EINVAL;
> - }
> -
> calc_hpa_range(cxled, cxlr_dax, &dpa_range, &hpa_range);
> if (!range_contains(®_ext->hpa_range, &hpa_range)) {
> dev_err(&cxlr_dax->dev,
> @@ -329,9 +344,7 @@ int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
> return -EINVAL;
> }
>
> - rc = cxlr_notify_extent(cxlr,
> - DCD_RELEASE_CAPACITY,
> - cxlr_dax->region_extent);
> + rc = cxlr_notify_extent(cxlr, DCD_RELEASE_CAPACITY, reg_ext);
> if (rc == -EBUSY)
> return 0;
>
> @@ -416,7 +429,7 @@ int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
>
> cxlr_dax = cxlr->cxlr_dax;
> /* Cannot add to a region_extent once it's been onlined */
> - if (cxlr_dax->region_extent) {
> + if (!xa_empty(&cxlr_dax->region_extents)) {
> dev_err(&cxlr_dax->dev, "Can no longer add to region %d\n",
> cxlr->id);
> return -EINVAL;
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 15065d9a1cbf2..42b3a3bbd502f 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -3546,6 +3546,7 @@ static void cxl_dax_region_release(struct device *dev)
> {
> struct cxl_dax_region *cxlr_dax = to_cxl_dax_region(dev);
>
> + xa_destroy(&cxlr_dax->region_extents);
> kfree(cxlr_dax);
> }
>
> @@ -3590,6 +3591,7 @@ static struct cxl_dax_region *cxl_dax_region_alloc(struct cxl_region *cxlr)
> if (!cxlr_dax)
> return ERR_PTR(-ENOMEM);
>
> + xa_init(&cxlr_dax->region_extents);
> cxlr_dax->hpa_range.start = p->res->start;
> cxlr_dax->hpa_range.end = p->res->end;
>
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 53d8a2f0d1272..715fa9e580cb0 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -637,7 +637,14 @@ struct cxl_dax_region {
> struct device dev;
> struct cxl_region *cxlr;
> struct range hpa_range;
> - struct region_extent *region_extent;
> + /*
> + * region_extents is keyed by an allocator-assigned u32 (see
> + * online_region_extent()), not by extent UUID. The UUID is a
> + * 128-bit pseudo-random identity carried on each region_extent;
> + * tag lookup is a linear xa_for_each walk, which is adequate at
> + * the bounded per-region tag counts this driver handles.
> + */
> + struct xarray region_extents;
> };
>
> /**
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 0/4] cxl/extent: Enable and validate multi-extent DCDs
2026-04-23 23:51 ` [RFC PATCH 0/4] cxl/extent: Enable and validate multi-extent DCDs John Groves
` (3 preceding siblings ...)
2026-04-23 23:52 ` [RFC PATCH 4/4] cxl/extent: Reject tagged extents that span DC partitions John Groves
@ 2026-04-24 6:34 ` Anisa Su
2026-05-01 22:00 ` Anisa Su
5 siblings, 0 replies; 15+ messages in thread
From: Anisa Su @ 2026-04-24 6:34 UTC (permalink / raw)
To: John Groves
Cc: Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Alison Schofield,
Vishal Verma, Ira Weiny, Dan Williams, John Groves, Fan Ni,
Shiju Jose, Robert Richter, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, dev.srinivasulu@gmail.com,
arramesh@micron.com, ajayjoshi@micron.com
On Thu, Apr 23, 2026 at 11:51:12PM +0000, John Groves wrote:
> From: John Groves <john@groves.net>
>
> This series fixes correctness issues in the DCD (Dynamic Capacity
> Device) add-capacity pipeline so that multi-tag allocations, sharable
> CDAT regions, and cross-partition checks work end-to-end.
>
Thanks for the fixes, this makes way more sense. Let me see if I can squash
these nicely before LSFMM :)
One question based off the initial read:
When we do the actual dax device creation via daxctl create-device, would we
want to add an optional arg specifying uuid? I think right now, each
region_extent becomes a dax_resource of the sparse dax region and create-device
grabs any unused dax_resource if -s isn't specified. So if we want to limit to
1 tagged allocation/DAX device, we would want some way to specify that.
But let me double check tomorrow
Thanks,
Anisa
> Base
> ----
>
> The series applies on top of Anisa Su's DCD stack — specifically the
> tip of famfs-v9-dcd on the 'anisa' remote
> (git@github.com:anisa-su993/anisa-linux-kernel.git). That stack is
> the one posted to linux-cxl here:
>
> https://lore.kernel.org/linux-cxl/adnWg-60uKmduTsh@gourry-fedora-PF4VCD3F/T/#t
>
> I started out commenting on prior series, then started modifying it,
> before settling on this initial approach of generalizing the
> functionality through additional patches. I'm hoping the stake holders
> will find this easy to follow - at least as easy to follow as something
> this tedious and nuanced can be.
>
> Also, as one of the authors of the DCD spec: Now I get why this was hard
> to follow and implement. The implementation is pretty complicated.
>
> Note that this RFC builds happily, but I have not tested it yet. This
> exercise has been about fleshing out the full logic.
>
> The patches
> -----------
>
> 1) cxl/extent: Promote cxlr_dax->region_extent to an xarray
>
> Infrastructure only. Replaces the single-slot region_extent
> pointer on struct cxl_dax_region with an xarray keyed by an
> allocator-assigned u32. No behavior change; prepares storage for
> multiple tagged allocations per DAX-region shim.
>
> 2) cxl/extent: Fix DCD add-capacity: per-tag assembly, ordering, and
> integrity
>
> Rewrites the add-capacity pipeline around the tag as the unit of
> allocation (not the More-chain). Drops the DPA-contiguity reject,
> assembles each tag group separately, stable-sorts by
> shared_extn_seq, and adds cross-More-chain uniqueness,
> sequence-integrity, and alignment gates. Lifts the single-slot
> "already onlined" gate now that (1) is in place.
>
> 3) cxl/extent: Support extents in sharable CDAT regions
>
> Drops the per-extent rejection of shared_extn_seq != 0 and
> replaces it with a partition-aware check driven by the DSMAS
> SHAREABLE bit already exposed through cxl_dpa_partition.
>
> 4) cxl/extent: Reject tagged extents that span DC partitions
>
> Adds a group-level check rejecting tagged allocations whose
> extents resolve to different DC partitions. A single host-visible
> allocation cannot meaningfully straddle partitions whose CDAT
> attributes disagree.
>
> Signed-off-by: John Groves <John@Groves.net>
>
> John Groves (4):
> cxl/extent: Promote cxlr_dax->region_extent to an xarray
> cxl/extent: Fix DCD add-capacity: per-tag assembly, ordering, and
> integrity
> cxl/extent: Support extents in sharable CDAT regions
> cxl/extent: Reject tagged extents that span DC partitions
>
> drivers/cxl/core/extent.c | 93 +++---
> drivers/cxl/core/mbox.c | 628 +++++++++++++++++++++++++++++++-------
> drivers/cxl/core/region.c | 2 +
> drivers/cxl/cxl.h | 9 +-
> include/cxl/event.h | 1 -
> 5 files changed, 584 insertions(+), 149 deletions(-)
>
> --
> 2.53.0
>
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 1/4] cxl/extent: Promote cxlr_dax->region_extent to an xarray
2026-04-23 23:52 ` [RFC PATCH 1/4] cxl/extent: Promote cxlr_dax->region_extent to an xarray John Groves
2026-04-24 0:51 ` Dave Jiang
@ 2026-04-24 22:01 ` Ira Weiny
2026-04-27 12:38 ` Jonathan Cameron
2026-04-27 18:32 ` Anisa Su
1 sibling, 2 replies; 15+ messages in thread
From: Ira Weiny @ 2026-04-24 22:01 UTC (permalink / raw)
To: John Groves, Davidlohr Bueso, Jonathan Cameron, Dave Jiang,
Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams,
John Groves, Fan Ni, Anisa Su, Shiju Jose, Robert Richter
Cc: linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
John Groves, dev.srinivasulu@gmail.com, arramesh@micron.com,
ajayjoshi@micron.com
John Groves wrote:
> From: John Groves <John@Groves.net>
>
There is a misunderstanding of what a region_extent was intended to do.
From my original commit message:
<quote>
Regions are made up of one or more devices which may be surfacing memory
to the host. Once all devices in a region have surfaced an extent the
region can expose a corresponding extent for the user to consume.
Without interleaving a device extent forms a 1:1 relationship with the
region extent. Immediately surface a region extent upon getting a
device extent.
</quote>
Without interleave support the relationship should remain 1:1.
I don't see any comments here which indicate interleaving is being added.
I think the misunderstanding started with the modifications that Anisa
did trying to make device extents contiguous. Her version of the commit
message:
<quote>
Regions are made up of one or more devices which may be surfacing memory
to the host. Once all devices in a region have surfaced an extent the
region can expose a corresponding extent for the user to consume.
A region extent is comprised of 1 or more contiguous device extents with
the same tag. It is surfaced after all extents from the DC event ar
processed (events grouped together with the "More" flag).
Interleaving is unsupported.
</quote>
From what I can tell Anisa confused 'region_extent' with dax device.
struct dax_dev is the container to group together extents.
Look at the diagrams in this presentation:
https://lpc.events/event/18/contributions/1826/attachments/1435/3335/LPC2024_CXL_DCD-v2.pdf
'DAX dev 1' covers memory from Extent A and Extent B. What yall will want
to do is ensure that the region extents which get surfaced are ordered
based on the sequence number _when_ _the_ _dax_ _device_ is created. The
order they come into the host does not really matter. Although yea the
spec has a bunch of rules... so whatever, follow those. But it is the
dax device which groups the extents into a contiguous HPA range and maps
those ranges through struct dev_dax->ranges.
Patch 2/4 tries to undo the contiguous DPA requirement. This is going
down a bad path. We should go back and fix Anisa's patch rather than add
on top of all this misunderstanding.
It may also be advantageous to remove the region_extent if interleave is
way far out on the support list... But it was added by me because folks
in our meetings insisted it was something we should consider in the
initial implementation. Sounds like that is just confusing things. So I
have to say if I were Dan I would probably be boiling right now with the
added complexity Ira added to the series... :-/ Sorry...
Regardless, this series is a bad way to try and get the base series fixed
up.
Allow Anisa to go back and remove the contiguous restriction and fix up
the base series.
Ira
> Terminology note: "CXL region" (struct cxl_region) is the
> mode-agnostic HDM-decoded HPA window; "DAX region"
> (struct cxl_dax_region, nicknamed cxlr_dax) is the DCD-specific
> CXL->device-dax shim that hangs off a cxl_region. This commit
> touches only the DAX-region shim; nothing about cxl_region, its
> HDM decoding, or its lifecycle changes.
>
> Prior to this commit, struct cxl_dax_region holds a single
> region_extent pointer, so at most one tagged allocation can ever
> surface under a DAX region. A subsequent commit rewrites DCD
> add-capacity handling to assemble extents per-tag and permits multiple
> allocations per More-chain; that work is latent until cxlr_dax can
> actually hold more than one region_extent. Do the infrastructural
> swap here, on its own, so the behavior change lands in a dedicated
> commit and the per-tag assembly work isn't tangled with storage-layout
> changes.
>
> Replace the single-slot pointer with an xarray. Note: the xarray is
> *not* keyed by UUID. xarray is a radix tree over unsigned-long keys;
> a 128-bit pseudo-random UUID is neither the right width nor the right
> distribution for a radix index. The key is an allocator-assigned u32
> obtained via xa_alloc() when online_region_extent() registers the
> device; that same u32 is stored in region_extent->dev.id so device
> names become "extent<rid>.<id>" with unique suffixes per allocation
> (replacing the hardcoded .0). UUID remains a stored attribute on
> struct region_extent; tag-based lookup is a linear xa_for_each walk,
> which is adequate at the bounded per-region tag counts DCD delivers.
>
> User-visible naming: device names remain "extent<rid>.<N>". Before
> this commit, N is always 0; after, N is the xa_alloc()-assigned id.
> The first region_extent under a cxlr_dax still lands at id 0, so a
> single-allocation region is observably unchanged. Only a 2nd+
> region_extent (which cannot be produced at all today) gets a
> non-zero suffix.
>
> Call-site translations:
>
> - cxl_dax_region_alloc() / _release(): xa_init / xa_destroy.
> - alloc_region_extent(): drop the hardcoded dev.id = 0; the ID is
> assigned when online_region_extent() registers the device.
> - online_region_extent(): device_initialize() first so the release
> function is wired, then xa_alloc(), assign the returned id into
> dev->id, then dev_set_name() and device_add(). On any failure
> from xa_alloc() onward, put_device(dev) -> release ->
> free_region_extent() -> xa_erase() handles cleanup; no explicit
> xa_erase() is needed in the error path of online_region_extent().
> - free_region_extent(): xa_erase() replaces NULLing the slot.
> - extents_contain() / extents_overlap(): nested xa_for_each, outer
> over region_extents, inner over decoder_extents.
> - cxl_rm_extent(): walk region_extents and match by UUID to find
> the target region_extent (linear, bounded).
> - cxl_add_extent()'s "already onlined" gate: translated from
> single-slot NULL check to !xa_empty(). This preserves the
> existing single-region_extent-per-cxlr_dax invariant; lifting
> that gate to actually support multiple allocations is a separate
> semantic change that belongs with the per-tag assembly rework.
>
> The dax-side (drivers/dax/cxl.c) is unchanged. cxl_dax_region_probe()
> still iterates region_extent children via device_for_each_child() on
> cxlr_dax->dev, and cxl_dax_region_notify() still takes a
> struct region_extent * through cxl_notify_data; neither the CXL<->DAX
> boundary nor the DAX resource model is affected.
>
> No functional change: all paths still observe at most one
> region_extent per cxlr_dax; this commit only prepares the storage.
>
> Signed-off-by: John Groves <John@Groves.net>
> Signed-off-by: John Groves <john@groves.net>
> ---
> drivers/cxl/core/extent.c | 89 ++++++++++++++++++++++-----------------
> drivers/cxl/core/region.c | 2 +
> drivers/cxl/cxl.h | 9 +++-
> 3 files changed, 61 insertions(+), 39 deletions(-)
>
> diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
> index c4ad81814fb4d..44b58cd477655 100644
> --- a/drivers/cxl/core/extent.c
> +++ b/drivers/cxl/core/extent.c
> @@ -87,7 +87,8 @@ static void free_region_extent(struct region_extent *region_extent)
> xa_for_each(®ion_extent->decoder_extents, index, ed_extent)
> cxled_release_extent(ed_extent->cxled, ed_extent);
> xa_destroy(®ion_extent->decoder_extents);
> - region_extent->cxlr_dax->region_extent = NULL;
> + xa_erase(®ion_extent->cxlr_dax->region_extents,
> + region_extent->dev.id);
> kfree(region_extent);
> }
>
> @@ -144,7 +145,6 @@ alloc_region_extent(struct cxl_dax_region *cxlr_dax, struct range *hpa_range,
> region_extent->hpa_range = *hpa_range;
> region_extent->cxlr_dax = cxlr_dax;
> uuid_copy(®ion_extent->uuid, uuid);
> - region_extent->dev.id = 0;
> xa_init(®ion_extent->decoder_extents);
> return no_free_ptr(region_extent);
> }
> @@ -153,12 +153,26 @@ int online_region_extent(struct region_extent *region_extent)
> {
> struct cxl_dax_region *cxlr_dax = region_extent->cxlr_dax;
> struct device *dev = ®ion_extent->dev;
> + u32 id;
> int rc;
>
> device_initialize(dev);
> device_set_pm_not_required(dev);
> dev->parent = &cxlr_dax->dev;
> dev->type = ®ion_extent_type;
> +
> + /*
> + * Insert into cxlr_dax->region_extents before dev_set_name so
> + * the allocated id is available as the .<N> suffix. On any
> + * failure from here on, put_device(dev) -> release ->
> + * free_region_extent() -> xa_erase() handles the cleanup.
> + */
> + rc = xa_alloc(&cxlr_dax->region_extents, &id, region_extent,
> + xa_limit_32b, GFP_KERNEL);
> + if (rc < 0)
> + goto err;
> + dev->id = id;
> +
> rc = dev_set_name(dev, "extent%d.%d", cxlr_dax->cxlr->id, dev->id);
> if (rc)
> goto err;
> @@ -167,7 +181,6 @@ int online_region_extent(struct region_extent *region_extent)
> if (rc)
> goto err;
>
> - cxlr_dax->region_extent = region_extent;
> dev_dbg(dev, "region extent HPA %pra\n", ®ion_extent->hpa_range);
> return devm_add_action_or_reset(&cxlr_dax->dev, region_extent_unregister,
> region_extent);
> @@ -184,17 +197,16 @@ static bool extents_contain(struct cxl_dax_region *cxlr_dax,
> struct cxl_endpoint_decoder *cxled,
> struct range *new_range)
> {
> - struct region_extent *re = cxlr_dax->region_extent;
> + struct region_extent *re;
> struct cxled_extent *entry;
> - unsigned long index;
> -
> - if (!re)
> - return false;
> -
> - xa_for_each(&re->decoder_extents, index, entry) {
> - if (cxled == entry->cxled &&
> - range_contains(&entry->dpa_range, new_range))
> - return true;
> + unsigned long i, j;
> +
> + xa_for_each(&cxlr_dax->region_extents, i, re) {
> + xa_for_each(&re->decoder_extents, j, entry) {
> + if (cxled == entry->cxled &&
> + range_contains(&entry->dpa_range, new_range))
> + return true;
> + }
> }
> return false;
> }
> @@ -203,17 +215,16 @@ static bool extents_overlap(struct cxl_dax_region *cxlr_dax,
> struct cxl_endpoint_decoder *cxled,
> struct range *new_range)
> {
> - struct region_extent *re = cxlr_dax->region_extent;
> + struct region_extent *re;
> struct cxled_extent *entry;
> - unsigned long index;
> -
> - if (!re)
> - return false;
> -
> - xa_for_each(&re->decoder_extents, index, entry) {
> - if (cxled == entry->cxled &&
> - range_overlaps(&entry->dpa_range, new_range))
> - return true;
> + unsigned long i, j;
> +
> + xa_for_each(&cxlr_dax->region_extents, i, re) {
> + xa_for_each(&re->decoder_extents, j, entry) {
> + if (cxled == entry->cxled &&
> + range_overlaps(&entry->dpa_range, new_range))
> + return true;
> + }
> }
> return false;
> }
> @@ -306,21 +317,25 @@ int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
> }
>
> cxlr_dax = cxlr->cxlr_dax;
> - reg_ext = cxlr_dax->region_extent;
> + import_uuid(&tag, extent->uuid);
> + {
> + struct region_extent *r;
> + unsigned long idx;
> +
> + reg_ext = NULL;
> + xa_for_each(&cxlr_dax->region_extents, idx, r) {
> + if (uuid_equal(&r->uuid, &tag)) {
> + reg_ext = r;
> + break;
> + }
> + }
> + }
> if (!reg_ext) {
> - dev_err(&cxlr->cxlr_dax->dev,
> - "no capacity has been added to the region\n");
> + dev_err(&cxlr_dax->dev,
> + "no region_extent matches tag %pU\n", &tag);
> return -ENXIO;
> }
>
> - import_uuid(&tag, extent->uuid);
> - if (!uuid_equal(&tag, ®_ext->uuid)) {
> - dev_err(&cxlr->cxlr_dax->dev,
> - "extent tag %pU doesn't match region tag %pU\n",
> - &tag, ®_ext->uuid);
> - return -EINVAL;
> - }
> -
> calc_hpa_range(cxled, cxlr_dax, &dpa_range, &hpa_range);
> if (!range_contains(®_ext->hpa_range, &hpa_range)) {
> dev_err(&cxlr_dax->dev,
> @@ -329,9 +344,7 @@ int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
> return -EINVAL;
> }
>
> - rc = cxlr_notify_extent(cxlr,
> - DCD_RELEASE_CAPACITY,
> - cxlr_dax->region_extent);
> + rc = cxlr_notify_extent(cxlr, DCD_RELEASE_CAPACITY, reg_ext);
> if (rc == -EBUSY)
> return 0;
>
> @@ -416,7 +429,7 @@ int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
>
> cxlr_dax = cxlr->cxlr_dax;
> /* Cannot add to a region_extent once it's been onlined */
> - if (cxlr_dax->region_extent) {
> + if (!xa_empty(&cxlr_dax->region_extents)) {
> dev_err(&cxlr_dax->dev, "Can no longer add to region %d\n",
> cxlr->id);
> return -EINVAL;
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 15065d9a1cbf2..42b3a3bbd502f 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -3546,6 +3546,7 @@ static void cxl_dax_region_release(struct device *dev)
> {
> struct cxl_dax_region *cxlr_dax = to_cxl_dax_region(dev);
>
> + xa_destroy(&cxlr_dax->region_extents);
> kfree(cxlr_dax);
> }
>
> @@ -3590,6 +3591,7 @@ static struct cxl_dax_region *cxl_dax_region_alloc(struct cxl_region *cxlr)
> if (!cxlr_dax)
> return ERR_PTR(-ENOMEM);
>
> + xa_init(&cxlr_dax->region_extents);
> cxlr_dax->hpa_range.start = p->res->start;
> cxlr_dax->hpa_range.end = p->res->end;
>
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 53d8a2f0d1272..715fa9e580cb0 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -637,7 +637,14 @@ struct cxl_dax_region {
> struct device dev;
> struct cxl_region *cxlr;
> struct range hpa_range;
> - struct region_extent *region_extent;
> + /*
> + * region_extents is keyed by an allocator-assigned u32 (see
> + * online_region_extent()), not by extent UUID. The UUID is a
> + * 128-bit pseudo-random identity carried on each region_extent;
> + * tag lookup is a linear xa_for_each walk, which is adequate at
> + * the bounded per-region tag counts this driver handles.
> + */
> + struct xarray region_extents;
> };
>
> /**
> --
> 2.53.0
>
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 1/4] cxl/extent: Promote cxlr_dax->region_extent to an xarray
2026-04-24 22:01 ` Ira Weiny
@ 2026-04-27 12:38 ` Jonathan Cameron
2026-04-27 15:12 ` Ira Weiny
2026-04-27 18:32 ` Anisa Su
1 sibling, 1 reply; 15+ messages in thread
From: Jonathan Cameron @ 2026-04-27 12:38 UTC (permalink / raw)
To: Ira Weiny
Cc: John Groves, Davidlohr Bueso, Jonathan Cameron, Dave Jiang,
Alison Schofield, Vishal Verma, Dan Williams, John Groves, Fan Ni,
Anisa Su, Shiju Jose, Robert Richter, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, dev.srinivasulu@gmail.com,
arramesh@micron.com, ajayjoshi@micron.com
> Look at the diagrams in this presentation:
>
> https://lpc.events/event/18/contributions/1826/attachments/1435/3335/LPC2024_CXL_DCD-v2.pdf
>
> 'DAX dev 1' covers memory from Extent A and Extent B. What yall will want
> to do is ensure that the region extents which get surfaced are ordered
> based on the sequence number _when_ _the_ _dax_ _device_ is created. The
> order they come into the host does not really matter. Although yea the
> spec has a bunch of rules... so whatever, follow those. But it is the
> dax device which groups the extents into a contiguous HPA range and maps
On this bit, HPA? Why would extents be contiguous in HPA? Contiguous
in the DAX device mapping sure, but not HPA.
> those ranges through struct dev_dax->ranges.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 1/4] cxl/extent: Promote cxlr_dax->region_extent to an xarray
2026-04-27 12:38 ` Jonathan Cameron
@ 2026-04-27 15:12 ` Ira Weiny
2026-04-29 10:56 ` Jonathan Cameron
0 siblings, 1 reply; 15+ messages in thread
From: Ira Weiny @ 2026-04-27 15:12 UTC (permalink / raw)
To: Jonathan Cameron, Ira Weiny
Cc: John Groves, Davidlohr Bueso, Jonathan Cameron, Dave Jiang,
Alison Schofield, Vishal Verma, Dan Williams, John Groves, Fan Ni,
Anisa Su, Shiju Jose, Robert Richter, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, dev.srinivasulu@gmail.com,
arramesh@micron.com, ajayjoshi@micron.com
Jonathan Cameron wrote:
>
> > Look at the diagrams in this presentation:
> >
> > https://lpc.events/event/18/contributions/1826/attachments/1435/3335/LPC2024_CXL_DCD-v2.pdf
> >
> > 'DAX dev 1' covers memory from Extent A and Extent B. What yall will want
> > to do is ensure that the region extents which get surfaced are ordered
> > based on the sequence number _when_ _the_ _dax_ _device_ is created. The
> > order they come into the host does not really matter. Although yea the
> > spec has a bunch of rules... so whatever, follow those. But it is the
> > dax device which groups the extents into a contiguous HPA range and maps
>
> On this bit, HPA? Why would extents be contiguous in HPA? Contiguous
> in the DAX device mapping sure, but not HPA.
I'm not saying the _extents_ are contiguous in HPA. I said they get
_mapped_ into a contiguous HPA _by_ the DAX device.
IOW the DAX device has some policy(tm) which looks at the extents and
decides which ones are part of the device (the grouping I mention above).
Then it presents a contiguous HPA to the user by mapping those
discontinuous extents.
Ira
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 1/4] cxl/extent: Promote cxlr_dax->region_extent to an xarray
2026-04-24 22:01 ` Ira Weiny
2026-04-27 12:38 ` Jonathan Cameron
@ 2026-04-27 18:32 ` Anisa Su
1 sibling, 0 replies; 15+ messages in thread
From: Anisa Su @ 2026-04-27 18:32 UTC (permalink / raw)
To: Ira Weiny
Cc: John Groves, Davidlohr Bueso, Jonathan Cameron, Dave Jiang,
Alison Schofield, Vishal Verma, Dan Williams, John Groves, Fan Ni,
Shiju Jose, Robert Richter, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, dev.srinivasulu@gmail.com,
arramesh@micron.com, ajayjoshi@micron.com
On Fri, Apr 24, 2026 at 05:01:26PM -0500, Ira Weiny wrote:
> John Groves wrote:
> > From: John Groves <John@Groves.net>
> >
>
> There is a misunderstanding of what a region_extent was intended to do.
>
> >From my original commit message:
>
> <quote>
> Regions are made up of one or more devices which may be surfacing memory
> to the host. Once all devices in a region have surfaced an extent the
> region can expose a corresponding extent for the user to consume.
> Without interleaving a device extent forms a 1:1 relationship with the
> region extent. Immediately surface a region extent upon getting a
> device extent.
> </quote>
>
> Without interleave support the relationship should remain 1:1.
>
> I don't see any comments here which indicate interleaving is being added.
>
> I think the misunderstanding started with the modifications that Anisa
> did trying to make device extents contiguous. Her version of the commit
> message:
>
> <quote>
> Regions are made up of one or more devices which may be surfacing memory
> to the host. Once all devices in a region have surfaced an extent the
> region can expose a corresponding extent for the user to consume.
> A region extent is comprised of 1 or more contiguous device extents with
> the same tag. It is surfaced after all extents from the DC event ar
> processed (events grouped together with the "More" flag).
> Interleaving is unsupported.
> </quote>
>
> >From what I can tell Anisa confused 'region_extent' with dax device.
> struct dax_dev is the container to group together extents.
>
> Look at the diagrams in this presentation:
>
> https://lpc.events/event/18/contributions/1826/attachments/1435/3335/LPC2024_CXL_DCD-v2.pdf
>
> 'DAX dev 1' covers memory from Extent A and Extent B. What yall will want
> to do is ensure that the region extents which get surfaced are ordered
> based on the sequence number _when_ _the_ _dax_ _device_ is created. The
> order they come into the host does not really matter. Although yea the
> spec has a bunch of rules... so whatever, follow those. But it is the
> dax device which groups the extents into a contiguous HPA range and maps
> those ranges through struct dev_dax->ranges.
>
> Patch 2/4 tries to undo the contiguous DPA requirement. This is going
> down a bad path. We should go back and fix Anisa's patch rather than add
> on top of all this misunderstanding.
>
I think what John meant was that we can squash this patchset rather than apply
on top.
All 4 of these patches should be able to be squashed into "cxl/extent: Process
dynamic partition events and realize region extents", which is the commit that
added the extent add/remove logic.
I can squash them by EOD and we can iterate from that? Would that work?
Anyway seems like I've made a huge mess from my confusion... sorry all :(
Thanks,
Anisa
> It may also be advantageous to remove the region_extent if interleave is
> way far out on the support list... But it was added by me because folks
> in our meetings insisted it was something we should consider in the
> initial implementation. Sounds like that is just confusing things. So I
> have to say if I were Dan I would probably be boiling right now with the
> added complexity Ira added to the series... :-/ Sorry...
>
> Regardless, this series is a bad way to try and get the base series fixed
> up.
>
> Allow Anisa to go back and remove the contiguous restriction and fix up
> the base series.
>
> Ira
>
> > Terminology note: "CXL region" (struct cxl_region) is the
> > mode-agnostic HDM-decoded HPA window; "DAX region"
> > (struct cxl_dax_region, nicknamed cxlr_dax) is the DCD-specific
> > CXL->device-dax shim that hangs off a cxl_region. This commit
> > touches only the DAX-region shim; nothing about cxl_region, its
> > HDM decoding, or its lifecycle changes.
> >
> > Prior to this commit, struct cxl_dax_region holds a single
> > region_extent pointer, so at most one tagged allocation can ever
> > surface under a DAX region. A subsequent commit rewrites DCD
> > add-capacity handling to assemble extents per-tag and permits multiple
> > allocations per More-chain; that work is latent until cxlr_dax can
> > actually hold more than one region_extent. Do the infrastructural
> > swap here, on its own, so the behavior change lands in a dedicated
> > commit and the per-tag assembly work isn't tangled with storage-layout
> > changes.
> >
> > Replace the single-slot pointer with an xarray. Note: the xarray is
> > *not* keyed by UUID. xarray is a radix tree over unsigned-long keys;
> > a 128-bit pseudo-random UUID is neither the right width nor the right
> > distribution for a radix index. The key is an allocator-assigned u32
> > obtained via xa_alloc() when online_region_extent() registers the
> > device; that same u32 is stored in region_extent->dev.id so device
> > names become "extent<rid>.<id>" with unique suffixes per allocation
> > (replacing the hardcoded .0). UUID remains a stored attribute on
> > struct region_extent; tag-based lookup is a linear xa_for_each walk,
> > which is adequate at the bounded per-region tag counts DCD delivers.
> >
> > User-visible naming: device names remain "extent<rid>.<N>". Before
> > this commit, N is always 0; after, N is the xa_alloc()-assigned id.
> > The first region_extent under a cxlr_dax still lands at id 0, so a
> > single-allocation region is observably unchanged. Only a 2nd+
> > region_extent (which cannot be produced at all today) gets a
> > non-zero suffix.
> >
> > Call-site translations:
> >
> > - cxl_dax_region_alloc() / _release(): xa_init / xa_destroy.
> > - alloc_region_extent(): drop the hardcoded dev.id = 0; the ID is
> > assigned when online_region_extent() registers the device.
> > - online_region_extent(): device_initialize() first so the release
> > function is wired, then xa_alloc(), assign the returned id into
> > dev->id, then dev_set_name() and device_add(). On any failure
> > from xa_alloc() onward, put_device(dev) -> release ->
> > free_region_extent() -> xa_erase() handles cleanup; no explicit
> > xa_erase() is needed in the error path of online_region_extent().
> > - free_region_extent(): xa_erase() replaces NULLing the slot.
> > - extents_contain() / extents_overlap(): nested xa_for_each, outer
> > over region_extents, inner over decoder_extents.
> > - cxl_rm_extent(): walk region_extents and match by UUID to find
> > the target region_extent (linear, bounded).
> > - cxl_add_extent()'s "already onlined" gate: translated from
> > single-slot NULL check to !xa_empty(). This preserves the
> > existing single-region_extent-per-cxlr_dax invariant; lifting
> > that gate to actually support multiple allocations is a separate
> > semantic change that belongs with the per-tag assembly rework.
> >
> > The dax-side (drivers/dax/cxl.c) is unchanged. cxl_dax_region_probe()
> > still iterates region_extent children via device_for_each_child() on
> > cxlr_dax->dev, and cxl_dax_region_notify() still takes a
> > struct region_extent * through cxl_notify_data; neither the CXL<->DAX
> > boundary nor the DAX resource model is affected.
> >
> > No functional change: all paths still observe at most one
> > region_extent per cxlr_dax; this commit only prepares the storage.
> >
> > Signed-off-by: John Groves <John@Groves.net>
> > Signed-off-by: John Groves <john@groves.net>
> > ---
> > drivers/cxl/core/extent.c | 89 ++++++++++++++++++++++-----------------
> > drivers/cxl/core/region.c | 2 +
> > drivers/cxl/cxl.h | 9 +++-
> > 3 files changed, 61 insertions(+), 39 deletions(-)
> >
> > diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
> > index c4ad81814fb4d..44b58cd477655 100644
> > --- a/drivers/cxl/core/extent.c
> > +++ b/drivers/cxl/core/extent.c
> > @@ -87,7 +87,8 @@ static void free_region_extent(struct region_extent *region_extent)
> > xa_for_each(®ion_extent->decoder_extents, index, ed_extent)
> > cxled_release_extent(ed_extent->cxled, ed_extent);
> > xa_destroy(®ion_extent->decoder_extents);
> > - region_extent->cxlr_dax->region_extent = NULL;
> > + xa_erase(®ion_extent->cxlr_dax->region_extents,
> > + region_extent->dev.id);
> > kfree(region_extent);
> > }
> >
> > @@ -144,7 +145,6 @@ alloc_region_extent(struct cxl_dax_region *cxlr_dax, struct range *hpa_range,
> > region_extent->hpa_range = *hpa_range;
> > region_extent->cxlr_dax = cxlr_dax;
> > uuid_copy(®ion_extent->uuid, uuid);
> > - region_extent->dev.id = 0;
> > xa_init(®ion_extent->decoder_extents);
> > return no_free_ptr(region_extent);
> > }
> > @@ -153,12 +153,26 @@ int online_region_extent(struct region_extent *region_extent)
> > {
> > struct cxl_dax_region *cxlr_dax = region_extent->cxlr_dax;
> > struct device *dev = ®ion_extent->dev;
> > + u32 id;
> > int rc;
> >
> > device_initialize(dev);
> > device_set_pm_not_required(dev);
> > dev->parent = &cxlr_dax->dev;
> > dev->type = ®ion_extent_type;
> > +
> > + /*
> > + * Insert into cxlr_dax->region_extents before dev_set_name so
> > + * the allocated id is available as the .<N> suffix. On any
> > + * failure from here on, put_device(dev) -> release ->
> > + * free_region_extent() -> xa_erase() handles the cleanup.
> > + */
> > + rc = xa_alloc(&cxlr_dax->region_extents, &id, region_extent,
> > + xa_limit_32b, GFP_KERNEL);
> > + if (rc < 0)
> > + goto err;
> > + dev->id = id;
> > +
> > rc = dev_set_name(dev, "extent%d.%d", cxlr_dax->cxlr->id, dev->id);
> > if (rc)
> > goto err;
> > @@ -167,7 +181,6 @@ int online_region_extent(struct region_extent *region_extent)
> > if (rc)
> > goto err;
> >
> > - cxlr_dax->region_extent = region_extent;
> > dev_dbg(dev, "region extent HPA %pra\n", ®ion_extent->hpa_range);
> > return devm_add_action_or_reset(&cxlr_dax->dev, region_extent_unregister,
> > region_extent);
> > @@ -184,17 +197,16 @@ static bool extents_contain(struct cxl_dax_region *cxlr_dax,
> > struct cxl_endpoint_decoder *cxled,
> > struct range *new_range)
> > {
> > - struct region_extent *re = cxlr_dax->region_extent;
> > + struct region_extent *re;
> > struct cxled_extent *entry;
> > - unsigned long index;
> > -
> > - if (!re)
> > - return false;
> > -
> > - xa_for_each(&re->decoder_extents, index, entry) {
> > - if (cxled == entry->cxled &&
> > - range_contains(&entry->dpa_range, new_range))
> > - return true;
> > + unsigned long i, j;
> > +
> > + xa_for_each(&cxlr_dax->region_extents, i, re) {
> > + xa_for_each(&re->decoder_extents, j, entry) {
> > + if (cxled == entry->cxled &&
> > + range_contains(&entry->dpa_range, new_range))
> > + return true;
> > + }
> > }
> > return false;
> > }
> > @@ -203,17 +215,16 @@ static bool extents_overlap(struct cxl_dax_region *cxlr_dax,
> > struct cxl_endpoint_decoder *cxled,
> > struct range *new_range)
> > {
> > - struct region_extent *re = cxlr_dax->region_extent;
> > + struct region_extent *re;
> > struct cxled_extent *entry;
> > - unsigned long index;
> > -
> > - if (!re)
> > - return false;
> > -
> > - xa_for_each(&re->decoder_extents, index, entry) {
> > - if (cxled == entry->cxled &&
> > - range_overlaps(&entry->dpa_range, new_range))
> > - return true;
> > + unsigned long i, j;
> > +
> > + xa_for_each(&cxlr_dax->region_extents, i, re) {
> > + xa_for_each(&re->decoder_extents, j, entry) {
> > + if (cxled == entry->cxled &&
> > + range_overlaps(&entry->dpa_range, new_range))
> > + return true;
> > + }
> > }
> > return false;
> > }
> > @@ -306,21 +317,25 @@ int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
> > }
> >
> > cxlr_dax = cxlr->cxlr_dax;
> > - reg_ext = cxlr_dax->region_extent;
> > + import_uuid(&tag, extent->uuid);
> > + {
> > + struct region_extent *r;
> > + unsigned long idx;
> > +
> > + reg_ext = NULL;
> > + xa_for_each(&cxlr_dax->region_extents, idx, r) {
> > + if (uuid_equal(&r->uuid, &tag)) {
> > + reg_ext = r;
> > + break;
> > + }
> > + }
> > + }
> > if (!reg_ext) {
> > - dev_err(&cxlr->cxlr_dax->dev,
> > - "no capacity has been added to the region\n");
> > + dev_err(&cxlr_dax->dev,
> > + "no region_extent matches tag %pU\n", &tag);
> > return -ENXIO;
> > }
> >
> > - import_uuid(&tag, extent->uuid);
> > - if (!uuid_equal(&tag, ®_ext->uuid)) {
> > - dev_err(&cxlr->cxlr_dax->dev,
> > - "extent tag %pU doesn't match region tag %pU\n",
> > - &tag, ®_ext->uuid);
> > - return -EINVAL;
> > - }
> > -
> > calc_hpa_range(cxled, cxlr_dax, &dpa_range, &hpa_range);
> > if (!range_contains(®_ext->hpa_range, &hpa_range)) {
> > dev_err(&cxlr_dax->dev,
> > @@ -329,9 +344,7 @@ int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
> > return -EINVAL;
> > }
> >
> > - rc = cxlr_notify_extent(cxlr,
> > - DCD_RELEASE_CAPACITY,
> > - cxlr_dax->region_extent);
> > + rc = cxlr_notify_extent(cxlr, DCD_RELEASE_CAPACITY, reg_ext);
> > if (rc == -EBUSY)
> > return 0;
> >
> > @@ -416,7 +429,7 @@ int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
> >
> > cxlr_dax = cxlr->cxlr_dax;
> > /* Cannot add to a region_extent once it's been onlined */
> > - if (cxlr_dax->region_extent) {
> > + if (!xa_empty(&cxlr_dax->region_extents)) {
> > dev_err(&cxlr_dax->dev, "Can no longer add to region %d\n",
> > cxlr->id);
> > return -EINVAL;
> > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> > index 15065d9a1cbf2..42b3a3bbd502f 100644
> > --- a/drivers/cxl/core/region.c
> > +++ b/drivers/cxl/core/region.c
> > @@ -3546,6 +3546,7 @@ static void cxl_dax_region_release(struct device *dev)
> > {
> > struct cxl_dax_region *cxlr_dax = to_cxl_dax_region(dev);
> >
> > + xa_destroy(&cxlr_dax->region_extents);
> > kfree(cxlr_dax);
> > }
> >
> > @@ -3590,6 +3591,7 @@ static struct cxl_dax_region *cxl_dax_region_alloc(struct cxl_region *cxlr)
> > if (!cxlr_dax)
> > return ERR_PTR(-ENOMEM);
> >
> > + xa_init(&cxlr_dax->region_extents);
> > cxlr_dax->hpa_range.start = p->res->start;
> > cxlr_dax->hpa_range.end = p->res->end;
> >
> > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> > index 53d8a2f0d1272..715fa9e580cb0 100644
> > --- a/drivers/cxl/cxl.h
> > +++ b/drivers/cxl/cxl.h
> > @@ -637,7 +637,14 @@ struct cxl_dax_region {
> > struct device dev;
> > struct cxl_region *cxlr;
> > struct range hpa_range;
> > - struct region_extent *region_extent;
> > + /*
> > + * region_extents is keyed by an allocator-assigned u32 (see
> > + * online_region_extent()), not by extent UUID. The UUID is a
> > + * 128-bit pseudo-random identity carried on each region_extent;
> > + * tag lookup is a linear xa_for_each walk, which is adequate at
> > + * the bounded per-region tag counts this driver handles.
> > + */
> > + struct xarray region_extents;
> > };
> >
> > /**
> > --
> > 2.53.0
> >
> >
>
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 1/4] cxl/extent: Promote cxlr_dax->region_extent to an xarray
2026-04-27 15:12 ` Ira Weiny
@ 2026-04-29 10:56 ` Jonathan Cameron
2026-05-01 10:56 ` Gregory Price
0 siblings, 1 reply; 15+ messages in thread
From: Jonathan Cameron @ 2026-04-29 10:56 UTC (permalink / raw)
To: Ira Weiny
Cc: John Groves, Davidlohr Bueso, Jonathan Cameron, Dave Jiang,
Alison Schofield, Vishal Verma, Dan Williams, John Groves, Fan Ni,
Anisa Su, Shiju Jose, Robert Richter, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, dev.srinivasulu@gmail.com,
arramesh@micron.com, ajayjoshi@micron.com
On Mon, 27 Apr 2026 10:12:02 -0500
Ira Weiny <ira.weiny@intel.com> wrote:
> Jonathan Cameron wrote:
> >
> > > Look at the diagrams in this presentation:
> > >
> > > https://lpc.events/event/18/contributions/1826/attachments/1435/3335/LPC2024_CXL_DCD-v2.pdf
> > >
> > > 'DAX dev 1' covers memory from Extent A and Extent B. What yall will want
> > > to do is ensure that the region extents which get surfaced are ordered
> > > based on the sequence number _when_ _the_ _dax_ _device_ is created. The
> > > order they come into the host does not really matter. Although yea the
> > > spec has a bunch of rules... so whatever, follow those. But it is the
> > > dax device which groups the extents into a contiguous HPA range and maps
> >
> > On this bit, HPA? Why would extents be contiguous in HPA? Contiguous
> > in the DAX device mapping sure, but not HPA.
>
> I'm not saying the _extents_ are contiguous in HPA. I said they get
> _mapped_ into a contiguous HPA _by_ the DAX device.
That's not HPA at that point, it's a VA of some type.
Now maybe it smells like HPA in some DAX interface but if it does
we should think about any renames etc necessary to avoid it doing
so! Or introduce DAXPA (not DPA for obvious reasons ;)
>
> IOW the DAX device has some policy(tm) which looks at the extents and
> decides which ones are part of the device (the grouping I mention above).
> Then it presents a contiguous HPA to the user by mapping those
> discontinuous extents.
>
> Ira
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 1/4] cxl/extent: Promote cxlr_dax->region_extent to an xarray
2026-04-29 10:56 ` Jonathan Cameron
@ 2026-05-01 10:56 ` Gregory Price
0 siblings, 0 replies; 15+ messages in thread
From: Gregory Price @ 2026-05-01 10:56 UTC (permalink / raw)
To: Jonathan Cameron
Cc: Ira Weiny, John Groves, Davidlohr Bueso, Jonathan Cameron,
Dave Jiang, Alison Schofield, Vishal Verma, Dan Williams,
John Groves, Fan Ni, Anisa Su, Shiju Jose, Robert Richter,
linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
dev.srinivasulu@gmail.com, arramesh@micron.com,
ajayjoshi@micron.com
On Wed, Apr 29, 2026 at 11:56:38AM +0100, Jonathan Cameron wrote:
> On Mon, 27 Apr 2026 10:12:02 -0500
> Ira Weiny <ira.weiny@intel.com> wrote:
>
> > Jonathan Cameron wrote:
> > >
> > > > Look at the diagrams in this presentation:
> > > >
> > > > https://lpc.events/event/18/contributions/1826/attachments/1435/3335/LPC2024_CXL_DCD-v2.pdf
> > > >
> > > > 'DAX dev 1' covers memory from Extent A and Extent B. What yall will want
> > > > to do is ensure that the region extents which get surfaced are ordered
> > > > based on the sequence number _when_ _the_ _dax_ _device_ is created. The
> > > > order they come into the host does not really matter. Although yea the
> > > > spec has a bunch of rules... so whatever, follow those. But it is the
> > > > dax device which groups the extents into a contiguous HPA range and maps
> > >
> > > On this bit, HPA? Why would extents be contiguous in HPA? Contiguous
> > > in the DAX device mapping sure, but not HPA.
> >
> > I'm not saying the _extents_ are contiguous in HPA. I said they get
> > _mapped_ into a contiguous HPA _by_ the DAX device.
>
> That's not HPA at that point, it's a VA of some type.
> Now maybe it smells like HPA in some DAX interface but if it does
> we should think about any renames etc necessary to avoid it doing
> so! Or introduce DAXPA (not DPA for obvious reasons ;)
>
Dynamic Capacity Extent
Byte Offset Length Description
00h 08h Starting DPA
This is all we need to know what should happen in the rest of the path.
The device sends a set of DPA-extents that map into a host defined DCD
region (which has an HPA range).
The extents it sends may be laid out as such:
Extent0: Tag(A) Seq(0) DPA(0x0000.0000) Length(0x1000.0000)
Extent1: Tag(A) Seq(1) DPA(0xF000.0000) Length(0x1000.0000)
(start of device, end of device)
The CXL driver creates a DCD Region, which maps a *possible region of
memory* for extents, but this region *may be sparse*.
For example:
DCDRegion -> Base(0x5.0000.0000) Length(0x1.0000.0000)
Extent0 -> Base(0x5.0000.0000) Length(0x1000.0000) (start of region)
Extent1 -> Base(0x5.F000.0000) Length(0x1000.0000) (end of region)
This is a completely valid set of extents.
We use DAX to cobble these sparse extents into a "virtually contiguous"
region of memory - essentially base-0 + length.
DAX0.0: Base(0x0) Length(0x2000.0000)
Region List:
Base(0x0) -> Extent0
Base(0x1000.0000) -> Extent1
So in a single diagram:
[ device dc partition ]
[extent0] ... unalloc... [extent1] Device View
| Sparse! |
----------------------------------------
| Sparse! |
[extent0] ... unalloc... [extent1]
[ host dc region ] CXL Driver View
| |
----------------------------------------
| |
[ DAX0.0 ]
[ Extent0 ][ Extent1 ] DAX View
| NOT Sparse |
----------------------------------------
fd = open("/dev/dax0.0")
buf = mmap(fd, ...) Software View
buf[0]; -> DPA 0x0000.0000
buf[0x10000000]; -> DPA 0xF000.0000
~Gregory
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 0/4] cxl/extent: Enable and validate multi-extent DCDs
2026-04-23 23:51 ` [RFC PATCH 0/4] cxl/extent: Enable and validate multi-extent DCDs John Groves
` (4 preceding siblings ...)
2026-04-24 6:34 ` [RFC PATCH 0/4] cxl/extent: Enable and validate multi-extent DCDs Anisa Su
@ 2026-05-01 22:00 ` Anisa Su
2026-05-02 10:48 ` Gregory Price
5 siblings, 1 reply; 15+ messages in thread
From: Anisa Su @ 2026-05-01 22:00 UTC (permalink / raw)
To: John Groves, Gregory Price
Cc: Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Alison Schofield,
Vishal Verma, Ira Weiny, Dan Williams, John Groves, Shiju Jose,
Robert Richter, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, dev.srinivasulu@gmail.com,
arramesh@micron.com, ajayjoshi@micron.com
On Thu, Apr 23, 2026 at 11:51:12PM +0000, John Groves wrote:
> From: John Groves <john@groves.net>
>
> This series fixes correctness issues in the DCD (Dynamic Capacity
> Device) add-capacity pipeline so that multi-tag allocations, sharable
> CDAT regions, and cross-partition checks work end-to-end.
>
[snip]
Hi John,
I've already squashed your branches to a GH branch so they're just hanging out
until we get some more feedback.
But I do have some food for thought relating to the DAX layer changes that would
be required to make this patchset function the way it's intended:
Currently DAX device creation is not tag-aware. This patchset only affects the
conditions of when we choose to accept/reject extents. But accepted extents just
hang out under the region until a DAX device is created and
claims resources on the region.
Currently, doing daxctl create-device on a dynamic_ram_a type region does 2
things:
daxctl create-device -r region0 -s requested_size
1. create a 0-sized dev_dax seed:
- create_store()
- extents must have been added to the region already, otherwise returns
-ENOSPC
avail = dax_region_avail_size(dax_region);
if (avail == 0)
rc = -ENOSPC;
- creates a 0-sized DAX device seed (Sparse DAX region devices must be
created initially with 0 size)
2. resize the empty dev_dax to requested_size or the size of the region if -s
not specified
- size_store(to_alloc = requested_size ? requested_size :
dax_region_avail_size(dax_region))
- dev_dax_resize()
- iterates over each child resource of the region to find unused
resources and claims them up to min(available_size, to_alloc)
- for sparse regions, each child resource = 1 region extent
- *note*: in Ira's original patchset, each region_extent = 1
device extent.
Thus each resource under the resource tree of the dax_region
corresponds to 1 device extent
- In the current version: 1 region_extent --> 1 tagged
allocation (1+ extents w/the same tag)
- this should also be cleared up
If we want each DAX device to correspond to a specific tag (tag-aware DAX device
creation):
1. Add tag to struct dax_resource
/**
* struct dax_resource - For sparse regions; an active resource
* @region: dax_region this resources is in
* @res: resource
* @use_cnt: count the number of uses of this resource
*
* Changes to the dax_region and the dax_resources within it are protected by
* dax_region_rwsem
*
* dax_resource's are not intended to be used outside the dax layer.
*/
struct dax_resource {
struct dax_region *region;
struct resource *res;
unsigned int use_cnt;
+ uuid_t tag;
};
2. In the resizing operation:
- only allow dax device to claim a resource if it has the right tag
- all resources with that tag must be claimed by that device
- currently, the logic allows for a DAX device partially claim an extent
if requested_size_to_alloc < size_of_free_extent(s); would need to be
disallowed for tagged extents
3. ? New sysfs interface to allow for something like "echo [tag_abcd] >
/sys/.../daxX.Y/tag"
4. If we allow non-tagged extents (as this patchset allows):
- can non-tagged extents be partially claimed by a DAX device that
doesn't have a specified tag?
Ira's original patchset introduced minimal changes to DAX logic. Introducing
tag-awareness to
DAX resize logic may require adding more conditional statements,
especially if we allow both NULL and non-NULL tags (but there might be a nice
way to do it;
This is just my initial impression, I haven't tried implementing it, and I could
certainly look into it more)
TLDR; some topics for discussion:
1. region_extent to device extent relationship
- Ira said w/o interleaving, it should remain 1:1, not 1:many for
grouping device extents by tag
2. tag-aware DAX device creation
- changes needed & their implications
- what to do with non-tagged extents
See you at LSFMM :)
Thanks,
Anisa
> Signed-off-by: John Groves <John@Groves.net>
>
> John Groves (4):
> cxl/extent: Promote cxlr_dax->region_extent to an xarray
> cxl/extent: Fix DCD add-capacity: per-tag assembly, ordering, and
> integrity
> cxl/extent: Support extents in sharable CDAT regions
> cxl/extent: Reject tagged extents that span DC partitions
>
> drivers/cxl/core/extent.c | 93 +++---
> drivers/cxl/core/mbox.c | 628 +++++++++++++++++++++++++++++++-------
> drivers/cxl/core/region.c | 2 +
> drivers/cxl/cxl.h | 9 +-
> include/cxl/event.h | 1 -
> 5 files changed, 584 insertions(+), 149 deletions(-)
>
> --
> 2.53.0
>
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 0/4] cxl/extent: Enable and validate multi-extent DCDs
2026-05-01 22:00 ` Anisa Su
@ 2026-05-02 10:48 ` Gregory Price
0 siblings, 0 replies; 15+ messages in thread
From: Gregory Price @ 2026-05-02 10:48 UTC (permalink / raw)
To: Anisa Su
Cc: John Groves, Davidlohr Bueso, Jonathan Cameron, Dave Jiang,
Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams,
John Groves, Shiju Jose, Robert Richter,
linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
dev.srinivasulu@gmail.com, arramesh@micron.com,
ajayjoshi@micron.com
On Fri, May 01, 2026 at 03:00:07PM -0700, Anisa Su wrote:
>
> 4. If we allow non-tagged extents (as this patchset allows):
> - can non-tagged extents be partially claimed by a DAX device that
> doesn't have a specified tag?
>
> Ira's original patchset introduced minimal changes to DAX logic. Introducing
> tag-awareness to
> DAX resize logic may require adding more conditional statements,
> especially if we allow both NULL and non-NULL tags (but there might be a nice
> way to do it;
> This is just my initial impression, I haven't tried implementing it, and I could
> certainly look into it more)
>
There is a lot of confusion built around tagged vs untagged.
Can we at least agree if a DC Extent has a NULL tag:
1 extent = 1 DAX device
And we just denote that a NULL UUID is a sentinel value ("Unknown").
e.g.: there's no such thing as a "partial claim by a dax device".
either the extent is tagged (part of a set), or it is its own
thing without any identifying data around it (probably intended
to be hotplugged as system ram or something, userland problem).
alternatively: do not allow untagged extents, because we can't make
sense of what to do with it.
~Gregory
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2026-05-02 10:48 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20260423235108.3732424-1-john@jagalactic.com>
2026-04-23 23:51 ` [RFC PATCH 0/4] cxl/extent: Enable and validate multi-extent DCDs John Groves
2026-04-23 23:52 ` [RFC PATCH 1/4] cxl/extent: Promote cxlr_dax->region_extent to an xarray John Groves
2026-04-24 0:51 ` Dave Jiang
2026-04-24 22:01 ` Ira Weiny
2026-04-27 12:38 ` Jonathan Cameron
2026-04-27 15:12 ` Ira Weiny
2026-04-29 10:56 ` Jonathan Cameron
2026-05-01 10:56 ` Gregory Price
2026-04-27 18:32 ` Anisa Su
2026-04-23 23:52 ` [RFC PATCH 2/4] cxl/extent: Fix DCD add-capacity: per-tag assembly, ordering, and integrity John Groves
2026-04-23 23:52 ` [RFC PATCH 3/4] cxl/extent: Support extents in sharable CDAT regions John Groves
2026-04-23 23:52 ` [RFC PATCH 4/4] cxl/extent: Reject tagged extents that span DC partitions John Groves
2026-04-24 6:34 ` [RFC PATCH 0/4] cxl/extent: Enable and validate multi-extent DCDs Anisa Su
2026-05-01 22:00 ` Anisa Su
2026-05-02 10:48 ` Gregory Price
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox