* [PATCH v10 01/31] cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)
2026-05-23 9:42 [PATCH v10 00/31] DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
@ 2026-05-23 9:42 ` Anisa Su
2026-05-27 21:34 ` Dave Jiang
2026-05-23 9:42 ` [PATCH v10 02/31] cxl/mem: Read dynamic capacity configuration from the device Anisa Su
` (31 subsequent siblings)
32 siblings, 1 reply; 71+ messages in thread
From: Anisa Su @ 2026-05-23 9:42 UTC (permalink / raw)
To: linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
John Groves, Gregory Price, Ira Weiny
From: Ira Weiny <ira.weiny@intel.com>
Per the CXL 3.1 specification software must check the Command Effects
Log (CEL) for dynamic capacity command support.
Detect support for the DCD commands while reading the CEL, including:
Get DC Config
Get DC Extent List
Add DC Response
Release DC
Based on an original patch by Navneet Singh.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
Changes:
[anisa: rebase]
---
drivers/cxl/core/mbox.c | 43 +++++++++++++++++++++++++++++++++++++++++
drivers/cxl/cxlmem.h | 15 ++++++++++++++
2 files changed, 58 insertions(+)
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index aaa5c6277ebf..7ef5708bf210 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -165,6 +165,42 @@ static void cxl_set_security_cmd_enabled(struct cxl_security_state *security,
}
}
+static bool cxl_is_dcd_command(u16 opcode)
+{
+#define CXL_MBOX_OP_DCD_CMDS 0x48
+
+ return (opcode >> 8) == CXL_MBOX_OP_DCD_CMDS;
+}
+
+static void cxl_set_dcd_cmd_enabled(struct cxl_memdev_state *mds, u16 opcode,
+ unsigned long *cmd_mask)
+{
+ switch (opcode) {
+ case CXL_MBOX_OP_GET_DC_CONFIG:
+ set_bit(CXL_DCD_ENABLED_GET_CONFIG, cmd_mask);
+ break;
+ case CXL_MBOX_OP_GET_DC_EXTENT_LIST:
+ set_bit(CXL_DCD_ENABLED_GET_EXTENT_LIST, cmd_mask);
+ break;
+ case CXL_MBOX_OP_ADD_DC_RESPONSE:
+ set_bit(CXL_DCD_ENABLED_ADD_RESPONSE, cmd_mask);
+ break;
+ case CXL_MBOX_OP_RELEASE_DC:
+ set_bit(CXL_DCD_ENABLED_RELEASE, cmd_mask);
+ break;
+ default:
+ break;
+ }
+}
+
+static bool cxl_verify_dcd_cmds(struct cxl_memdev_state *mds, unsigned long *cmds_seen)
+{
+ DECLARE_BITMAP(all_cmds, CXL_DCD_ENABLED_MAX);
+
+ bitmap_fill(all_cmds, CXL_DCD_ENABLED_MAX);
+ return bitmap_equal(cmds_seen, all_cmds, CXL_DCD_ENABLED_MAX);
+}
+
static bool cxl_is_poison_command(u16 opcode)
{
#define CXL_MBOX_OP_POISON_CMDS 0x43
@@ -757,6 +793,7 @@ static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
struct cxl_cel_entry *cel_entry;
const int cel_entries = size / sizeof(*cel_entry);
+ DECLARE_BITMAP(dcd_cmds, CXL_DCD_ENABLED_MAX);
struct device *dev = mds->cxlds.dev;
int i, ro_cmds = 0, wr_cmds = 0;
@@ -785,11 +822,17 @@ static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
enabled++;
}
+ if (cxl_is_dcd_command(opcode)) {
+ cxl_set_dcd_cmd_enabled(mds, opcode, dcd_cmds);
+ enabled++;
+ }
+
dev_dbg(dev, "Opcode 0x%04x %s\n", opcode,
enabled ? "enabled" : "unsupported by driver");
}
set_features_cap(cxl_mbox, ro_cmds, wr_cmds);
+ mds->dcd_supported = cxl_verify_dcd_cmds(mds, dcd_cmds);
}
static struct cxl_mbox_get_supported_logs *cxl_get_gsl(struct cxl_memdev_state *mds)
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 776c50d1db51..53444af448d7 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -230,6 +230,15 @@ struct cxl_event_state {
struct mutex log_lock;
};
+/* Device enabled DCD commands */
+enum dcd_cmd_enabled_bits {
+ CXL_DCD_ENABLED_GET_CONFIG,
+ CXL_DCD_ENABLED_GET_EXTENT_LIST,
+ CXL_DCD_ENABLED_ADD_RESPONSE,
+ CXL_DCD_ENABLED_RELEASE,
+ CXL_DCD_ENABLED_MAX
+};
+
/* Device enabled poison commands */
enum poison_cmd_enabled_bits {
CXL_POISON_ENABLED_LIST,
@@ -405,6 +414,7 @@ static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
* @partition_align_bytes: alignment size for partition-able capacity
* @active_volatile_bytes: sum of hard + soft volatile
* @active_persistent_bytes: sum of hard + soft persistent
+ * @dcd_supported: all DCD commands are supported
* @event: event log driver state
* @poison: poison driver state info
* @security: security driver state info
@@ -424,6 +434,7 @@ struct cxl_memdev_state {
u64 partition_align_bytes;
u64 active_volatile_bytes;
u64 active_persistent_bytes;
+ bool dcd_supported;
struct cxl_event_state event;
struct cxl_poison_state poison;
@@ -485,6 +496,10 @@ enum cxl_opcode {
CXL_MBOX_OP_UNLOCK = 0x4503,
CXL_MBOX_OP_FREEZE_SECURITY = 0x4504,
CXL_MBOX_OP_PASSPHRASE_SECURE_ERASE = 0x4505,
+ CXL_MBOX_OP_GET_DC_CONFIG = 0x4800,
+ CXL_MBOX_OP_GET_DC_EXTENT_LIST = 0x4801,
+ CXL_MBOX_OP_ADD_DC_RESPONSE = 0x4802,
+ CXL_MBOX_OP_RELEASE_DC = 0x4803,
CXL_MBOX_OP_MAX = 0x10000
};
--
2.43.0
^ permalink raw reply related [flat|nested] 71+ messages in thread* Re: [PATCH v10 01/31] cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)
2026-05-23 9:42 ` [PATCH v10 01/31] cxl/mbox: Flag " Anisa Su
@ 2026-05-27 21:34 ` Dave Jiang
2026-05-30 6:22 ` Anisa Su
0 siblings, 1 reply; 71+ messages in thread
From: Dave Jiang @ 2026-05-27 21:34 UTC (permalink / raw)
To: Anisa Su, linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Vishal Verma, Ira Weiny, Alison Schofield, John Groves,
Gregory Price, Ira Weiny
On 5/23/26 2:42 AM, Anisa Su wrote:
> From: Ira Weiny <ira.weiny@intel.com>
>
> Per the CXL 3.1 specification software must check the Command Effects
May as well update to CXL r4.0
> Log (CEL) for dynamic capacity command support.
>
> Detect support for the DCD commands while reading the CEL, including:
>
> Get DC Config
> Get DC Extent List
> Add DC Response
> Release DC
>
> Based on an original patch by Navneet Singh.
>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
missing Anisa sign off
>
> ---
> Changes:
> [anisa: rebase]
> ---
> drivers/cxl/core/mbox.c | 43 +++++++++++++++++++++++++++++++++++++++++
> drivers/cxl/cxlmem.h | 15 ++++++++++++++
> 2 files changed, 58 insertions(+)
>
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index aaa5c6277ebf..7ef5708bf210 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -165,6 +165,42 @@ static void cxl_set_security_cmd_enabled(struct cxl_security_state *security,
> }
> }
>
> +static bool cxl_is_dcd_command(u16 opcode)
> +{
> +#define CXL_MBOX_OP_DCD_CMDS 0x48
> +
> + return (opcode >> 8) == CXL_MBOX_OP_DCD_CMDS;
> +}
> +
> +static void cxl_set_dcd_cmd_enabled(struct cxl_memdev_state *mds, u16 opcode,
> + unsigned long *cmd_mask)
mds not used, consider drop
> +{
> + switch (opcode) {
> + case CXL_MBOX_OP_GET_DC_CONFIG:
> + set_bit(CXL_DCD_ENABLED_GET_CONFIG, cmd_mask);
> + break;
> + case CXL_MBOX_OP_GET_DC_EXTENT_LIST:
> + set_bit(CXL_DCD_ENABLED_GET_EXTENT_LIST, cmd_mask);
> + break;
> + case CXL_MBOX_OP_ADD_DC_RESPONSE:
> + set_bit(CXL_DCD_ENABLED_ADD_RESPONSE, cmd_mask);
> + break;
> + case CXL_MBOX_OP_RELEASE_DC:
> + set_bit(CXL_DCD_ENABLED_RELEASE, cmd_mask);
> + break;
> + default:
> + break;
> + }
> +}
> +
> +static bool cxl_verify_dcd_cmds(struct cxl_memdev_state *mds, unsigned long *cmds_seen)
mds not used. consider drop
> +{
> + DECLARE_BITMAP(all_cmds, CXL_DCD_ENABLED_MAX);
> +
> + bitmap_fill(all_cmds, CXL_DCD_ENABLED_MAX);
> + return bitmap_equal(cmds_seen, all_cmds, CXL_DCD_ENABLED_MAX);
Above lines can be replaced with:
return bitmap_full(cmds_seen, CXL_DCD_ENABLED_MAX);
> +}
> +
> static bool cxl_is_poison_command(u16 opcode)
> {
> #define CXL_MBOX_OP_POISON_CMDS 0x43
> @@ -757,6 +793,7 @@ static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
> struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
> struct cxl_cel_entry *cel_entry;
> const int cel_entries = size / sizeof(*cel_entry);
> + DECLARE_BITMAP(dcd_cmds, CXL_DCD_ENABLED_MAX);
Need to zero out the declared bitmap 'dcd_cmds' on stack before using.
DECLARE_BITMAP(dcd_cmds, CXL_DCD_ENABLED_MAX) = {};
> struct device *dev = mds->cxlds.dev;
> int i, ro_cmds = 0, wr_cmds = 0;
>
> @@ -785,11 +822,17 @@ static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
> enabled++;
> }
>
> + if (cxl_is_dcd_command(opcode)) {
> + cxl_set_dcd_cmd_enabled(mds, opcode, dcd_cmds);
> + enabled++;
> + }
> +
> dev_dbg(dev, "Opcode 0x%04x %s\n", opcode,
> enabled ? "enabled" : "unsupported by driver");
> }
>
> set_features_cap(cxl_mbox, ro_cmds, wr_cmds);
> + mds->dcd_supported = cxl_verify_dcd_cmds(mds, dcd_cmds);
> }
>
> static struct cxl_mbox_get_supported_logs *cxl_get_gsl(struct cxl_memdev_state *mds)
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 776c50d1db51..53444af448d7 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -230,6 +230,15 @@ struct cxl_event_state {
> struct mutex log_lock;
> };
>
> +/* Device enabled DCD commands */
> +enum dcd_cmd_enabled_bits {
> + CXL_DCD_ENABLED_GET_CONFIG,
> + CXL_DCD_ENABLED_GET_EXTENT_LIST,
> + CXL_DCD_ENABLED_ADD_RESPONSE,
> + CXL_DCD_ENABLED_RELEASE,
> + CXL_DCD_ENABLED_MAX
> +};
> +
would be nice to have comment point to where in the spec this is
DJ
> /* Device enabled poison commands */
> enum poison_cmd_enabled_bits {
> CXL_POISON_ENABLED_LIST,
> @@ -405,6 +414,7 @@ static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
> * @partition_align_bytes: alignment size for partition-able capacity
> * @active_volatile_bytes: sum of hard + soft volatile
> * @active_persistent_bytes: sum of hard + soft persistent
> + * @dcd_supported: all DCD commands are supported
> * @event: event log driver state
> * @poison: poison driver state info
> * @security: security driver state info
> @@ -424,6 +434,7 @@ struct cxl_memdev_state {
> u64 partition_align_bytes;
> u64 active_volatile_bytes;
> u64 active_persistent_bytes;
> + bool dcd_supported;
>
> struct cxl_event_state event;
> struct cxl_poison_state poison;
> @@ -485,6 +496,10 @@ enum cxl_opcode {
> CXL_MBOX_OP_UNLOCK = 0x4503,
> CXL_MBOX_OP_FREEZE_SECURITY = 0x4504,
> CXL_MBOX_OP_PASSPHRASE_SECURE_ERASE = 0x4505,
> + CXL_MBOX_OP_GET_DC_CONFIG = 0x4800,
> + CXL_MBOX_OP_GET_DC_EXTENT_LIST = 0x4801,
> + CXL_MBOX_OP_ADD_DC_RESPONSE = 0x4802,
> + CXL_MBOX_OP_RELEASE_DC = 0x4803,
> CXL_MBOX_OP_MAX = 0x10000
> };
>
^ permalink raw reply [flat|nested] 71+ messages in thread* Re: [PATCH v10 01/31] cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)
2026-05-27 21:34 ` Dave Jiang
@ 2026-05-30 6:22 ` Anisa Su
0 siblings, 0 replies; 71+ messages in thread
From: Anisa Su @ 2026-05-30 6:22 UTC (permalink / raw)
To: Dave Jiang
Cc: Anisa Su, linux-cxl, linux-kernel, nvdimm, Dan Williams,
Jonathan Cameron, Davidlohr Bueso, Vishal Verma, Ira Weiny,
Alison Schofield, John Groves, Gregory Price, Ira Weiny
On Wed, May 27, 2026 at 02:34:31PM -0700, Dave Jiang wrote:
>
>
> On 5/23/26 2:42 AM, Anisa Su wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> >
> > Per the CXL 3.1 specification software must check the Command Effects
>
> May as well update to CXL r4.0
>
Updated!
> > Log (CEL) for dynamic capacity command support.
> >
> > Detect support for the DCD commands while reading the CEL, including:
> >
> > Get DC Config
> > Get DC Extent List
> > Add DC Response
> > Release DC
> >
> > Based on an original patch by Navneet Singh.
> >
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
>
> missing Anisa sign off
>
Added!
> >
> > ---
> > Changes:
> > [anisa: rebase]
> > ---
> > drivers/cxl/core/mbox.c | 43 +++++++++++++++++++++++++++++++++++++++++
> > drivers/cxl/cxlmem.h | 15 ++++++++++++++
> > 2 files changed, 58 insertions(+)
> >
> > diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> > index aaa5c6277ebf..7ef5708bf210 100644
> > --- a/drivers/cxl/core/mbox.c
> > +++ b/drivers/cxl/core/mbox.c
> > @@ -165,6 +165,42 @@ static void cxl_set_security_cmd_enabled(struct cxl_security_state *security,
> > }
> > }
> >
> > +static bool cxl_is_dcd_command(u16 opcode)
> > +{
> > +#define CXL_MBOX_OP_DCD_CMDS 0x48
> > +
> > + return (opcode >> 8) == CXL_MBOX_OP_DCD_CMDS;
> > +}
> > +
> > +static void cxl_set_dcd_cmd_enabled(struct cxl_memdev_state *mds, u16 opcode,
> > + unsigned long *cmd_mask)
>
> mds not used, consider drop
>
dropped
> > +{
> > + switch (opcode) {
> > + case CXL_MBOX_OP_GET_DC_CONFIG:
> > + set_bit(CXL_DCD_ENABLED_GET_CONFIG, cmd_mask);
> > + break;
> > + case CXL_MBOX_OP_GET_DC_EXTENT_LIST:
> > + set_bit(CXL_DCD_ENABLED_GET_EXTENT_LIST, cmd_mask);
> > + break;
> > + case CXL_MBOX_OP_ADD_DC_RESPONSE:
> > + set_bit(CXL_DCD_ENABLED_ADD_RESPONSE, cmd_mask);
> > + break;
> > + case CXL_MBOX_OP_RELEASE_DC:
> > + set_bit(CXL_DCD_ENABLED_RELEASE, cmd_mask);
> > + break;
> > + default:
> > + break;
> > + }
> > +}
> > +
> > +static bool cxl_verify_dcd_cmds(struct cxl_memdev_state *mds, unsigned long *cmds_seen)
>
> mds not used. consider drop
>
also dropped
> > +{
> > + DECLARE_BITMAP(all_cmds, CXL_DCD_ENABLED_MAX);
> > +
> > + bitmap_fill(all_cmds, CXL_DCD_ENABLED_MAX);
> > + return bitmap_equal(cmds_seen, all_cmds, CXL_DCD_ENABLED_MAX);
>
> Above lines can be replaced with:
> return bitmap_full(cmds_seen, CXL_DCD_ENABLED_MAX);
>
replaced
> > +}
> > +
> > static bool cxl_is_poison_command(u16 opcode)
> > {
> > #define CXL_MBOX_OP_POISON_CMDS 0x43
> > @@ -757,6 +793,7 @@ static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
> > struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
> > struct cxl_cel_entry *cel_entry;
> > const int cel_entries = size / sizeof(*cel_entry);
> > + DECLARE_BITMAP(dcd_cmds, CXL_DCD_ENABLED_MAX);
>
> Need to zero out the declared bitmap 'dcd_cmds' on stack before using.
>
> DECLARE_BITMAP(dcd_cmds, CXL_DCD_ENABLED_MAX) = {};
>
done :)
> > struct device *dev = mds->cxlds.dev;
> > int i, ro_cmds = 0, wr_cmds = 0;
> >
> > @@ -785,11 +822,17 @@ static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
> > enabled++;
> > }
> >
> > + if (cxl_is_dcd_command(opcode)) {
> > + cxl_set_dcd_cmd_enabled(mds, opcode, dcd_cmds);
> > + enabled++;
> > + }
> > +
> > dev_dbg(dev, "Opcode 0x%04x %s\n", opcode,
> > enabled ? "enabled" : "unsupported by driver");
> > }
> >
> > set_features_cap(cxl_mbox, ro_cmds, wr_cmds);
> > + mds->dcd_supported = cxl_verify_dcd_cmds(mds, dcd_cmds);
> > }
> >
> > static struct cxl_mbox_get_supported_logs *cxl_get_gsl(struct cxl_memdev_state *mds)
> > diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> > index 776c50d1db51..53444af448d7 100644
> > --- a/drivers/cxl/cxlmem.h
> > +++ b/drivers/cxl/cxlmem.h
> > @@ -230,6 +230,15 @@ struct cxl_event_state {
> > struct mutex log_lock;
> > };
> >
> > +/* Device enabled DCD commands */
> > +enum dcd_cmd_enabled_bits {
> > + CXL_DCD_ENABLED_GET_CONFIG,
> > + CXL_DCD_ENABLED_GET_EXTENT_LIST,
> > + CXL_DCD_ENABLED_ADD_RESPONSE,
> > + CXL_DCD_ENABLED_RELEASE,
> > + CXL_DCD_ENABLED_MAX
> > +};
> > +
>
> would be nice to have comment point to where in the spec this is
>
Added
> DJ
>
Thanks!
Anisa
> > /* Device enabled poison commands */
> > enum poison_cmd_enabled_bits {
> > CXL_POISON_ENABLED_LIST,
> > @@ -405,6 +414,7 @@ static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
> > * @partition_align_bytes: alignment size for partition-able capacity
> > * @active_volatile_bytes: sum of hard + soft volatile
> > * @active_persistent_bytes: sum of hard + soft persistent
> > + * @dcd_supported: all DCD commands are supported
> > * @event: event log driver state
> > * @poison: poison driver state info
> > * @security: security driver state info
> > @@ -424,6 +434,7 @@ struct cxl_memdev_state {
> > u64 partition_align_bytes;
> > u64 active_volatile_bytes;
> > u64 active_persistent_bytes;
> > + bool dcd_supported;
> >
> > struct cxl_event_state event;
> > struct cxl_poison_state poison;
> > @@ -485,6 +496,10 @@ enum cxl_opcode {
> > CXL_MBOX_OP_UNLOCK = 0x4503,
> > CXL_MBOX_OP_FREEZE_SECURITY = 0x4504,
> > CXL_MBOX_OP_PASSPHRASE_SECURE_ERASE = 0x4505,
> > + CXL_MBOX_OP_GET_DC_CONFIG = 0x4800,
> > + CXL_MBOX_OP_GET_DC_EXTENT_LIST = 0x4801,
> > + CXL_MBOX_OP_ADD_DC_RESPONSE = 0x4802,
> > + CXL_MBOX_OP_RELEASE_DC = 0x4803,
> > CXL_MBOX_OP_MAX = 0x10000
> > };
> >
>
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH v10 02/31] cxl/mem: Read dynamic capacity configuration from the device
2026-05-23 9:42 [PATCH v10 00/31] DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
2026-05-23 9:42 ` [PATCH v10 01/31] cxl/mbox: Flag " Anisa Su
@ 2026-05-23 9:42 ` Anisa Su
2026-05-27 22:28 ` Dave Jiang
2026-05-23 9:42 ` [PATCH v10 03/31] cxl/cdat: Gather DSMAS data for DCD partitions Anisa Su
` (30 subsequent siblings)
32 siblings, 1 reply; 71+ messages in thread
From: Anisa Su @ 2026-05-23 9:42 UTC (permalink / raw)
To: linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
John Groves, Gregory Price, Ira Weiny
From: Ira Weiny <ira.weiny@intel.com>
Devices which optionally support Dynamic Capacity (DC) are configured
via mailbox commands. CXL 3.2 section 9.13.3 requires the host to issue
the Get DC Configuration command in order to properly configure DCDs.
Without the Get DC Configuration command DCD can't be supported.
Implement the DC mailbox commands as specified in CXL 3.2 section
8.2.10.9.9 (opcodes 48XXh) to read and store the DCD configuration
information. Disable DCD if an invalid configuration is found.
Linux has no support for more than one dynamic capacity partition. Read
and validate all the partitions but configure only the first partition
as 'dynamic ram A'. Additional partitions can be added in the future if
such a device ever materializes. Additionally is it anticipated that no
skips will be present from the end of the pmem partition. Check for an
disallow this configuration as well.
Linux has no use for the trailing fields of the Get Dynamic Capacity
Configuration Output Payload (Total number of supported extents, number
of available extents, total number of supported tags, and number of
available tags). Avoid defining those fields to use the more useful
dynamic C array.
Based on an original patch by Navneet Singh.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
Changes:
[anisa: rebase]
[jonathan: mbox.c: use max possible size for get_dc_config command to
avoid vmalloc]
[jonathan & fan: cxlmem.h: remove unused struct cxl_mem_dev_info]
---
drivers/cxl/core/hdm.c | 2 +
drivers/cxl/core/mbox.c | 182 ++++++++++++++++++++++++++++++++++++++++
drivers/cxl/cxlmem.h | 47 +++++++++++
drivers/cxl/pci.c | 3 +
include/cxl/cxl.h | 3 +-
5 files changed, 236 insertions(+), 1 deletion(-)
diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
index 3930e130d6b6..28974adaab75 100644
--- a/drivers/cxl/core/hdm.c
+++ b/drivers/cxl/core/hdm.c
@@ -453,6 +453,8 @@ static const char *cxl_mode_name(enum cxl_partition_mode mode)
return "ram";
case CXL_PARTMODE_PMEM:
return "pmem";
+ case CXL_PARTMODE_DYNAMIC_RAM_A:
+ return "dynamic_ram_a";
default:
return "";
};
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 7ef5708bf210..71b29cd6abfe 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -1351,6 +1351,156 @@ int cxl_mem_sanitize(struct cxl_memdev *cxlmd, u16 cmd)
return -EBUSY;
}
+static int cxl_dc_check(struct device *dev, struct cxl_dc_partition_info *part_array,
+ u8 index, struct cxl_dc_partition *dev_part)
+{
+ size_t blk_size = le64_to_cpu(dev_part->block_size);
+ size_t len = le64_to_cpu(dev_part->length);
+
+ part_array[index].start = le64_to_cpu(dev_part->base);
+ part_array[index].size = le64_to_cpu(dev_part->decode_length);
+ part_array[index].size *= CXL_CAPACITY_MULTIPLIER;
+
+ /* Check partitions are in increasing DPA order */
+ if (index > 0) {
+ struct cxl_dc_partition_info *prev_part = &part_array[index - 1];
+
+ if ((prev_part->start + prev_part->size) >
+ part_array[index].start) {
+ dev_err(dev,
+ "DPA ordering violation for DC partition %d and %d\n",
+ index - 1, index);
+ return -EINVAL;
+ }
+ }
+
+ if (!IS_ALIGNED(part_array[index].start, SZ_256M) ||
+ !IS_ALIGNED(part_array[index].start, blk_size)) {
+ dev_err(dev, "DC partition %d invalid start %zu blk size %zu\n",
+ index, part_array[index].start, blk_size);
+ return -EINVAL;
+ }
+
+ if (part_array[index].size == 0 || len == 0 ||
+ part_array[index].size < len || !IS_ALIGNED(len, blk_size)) {
+ dev_err(dev, "DC partition %d invalid length; size %zu len %zu blk size %zu\n",
+ index, part_array[index].size, len, blk_size);
+ return -EINVAL;
+ }
+
+ if (blk_size == 0 || blk_size % CXL_DCD_BLOCK_LINE_SIZE ||
+ !is_power_of_2(blk_size)) {
+ dev_err(dev, "DC partition %d invalid block size; %zu\n",
+ index, blk_size);
+ return -EINVAL;
+ }
+
+ dev_dbg(dev, "DC partition %d start %zu start %zu size %zu\n",
+ index, part_array[index].start, part_array[index].size,
+ blk_size);
+
+ return 0;
+}
+
+/* Returns the number of partitions in dc_resp or -ERRNO */
+static int cxl_get_dc_config(struct cxl_mailbox *mbox, u8 start_partition,
+ struct cxl_mbox_get_dc_config_out *dc_resp,
+ size_t dc_resp_size)
+{
+ struct cxl_mbox_get_dc_config_in get_dc = (struct cxl_mbox_get_dc_config_in) {
+ .partition_count = CXL_MAX_DC_PARTITIONS,
+ .start_partition_index = start_partition,
+ };
+ struct cxl_mbox_cmd mbox_cmd = (struct cxl_mbox_cmd) {
+ .opcode = CXL_MBOX_OP_GET_DC_CONFIG,
+ .payload_in = &get_dc,
+ .size_in = sizeof(get_dc),
+ .size_out = dc_resp_size,
+ .payload_out = dc_resp,
+ .min_out = 8,
+ };
+ int rc;
+
+ rc = cxl_internal_send_cmd(mbox, &mbox_cmd);
+ if (rc < 0)
+ return rc;
+
+ dev_dbg(mbox->host, "Read %d/%d DC partitions\n",
+ dc_resp->partitions_returned, dc_resp->avail_partition_count);
+ return dc_resp->partitions_returned;
+}
+
+/**
+ * cxl_dev_dc_identify() - Reads the dynamic capacity information from the
+ * device.
+ * @mbox: Mailbox to query
+ * @dc_info: The dynamic partition information to return
+ *
+ * Read Dynamic Capacity information from the device and return the partition
+ * information.
+ *
+ * Return: 0 if identify was executed successfully, -ERRNO on error.
+ * on error only dynamic_bytes is left unchanged.
+ */
+int cxl_dev_dc_identify(struct cxl_mailbox *mbox,
+ struct cxl_dc_partition_info *dc_info)
+{
+ struct cxl_dc_partition_info partitions[CXL_MAX_DC_PARTITIONS];
+ struct device *dev = mbox->host;
+ size_t dc_resp_size =
+ sizeof(struct cxl_mbox_get_dc_config_out) + sizeof(partitions);
+ u8 start_partition;
+ u8 num_partitions;
+
+ struct cxl_mbox_get_dc_config_out *dc_resp __free(kfree) =
+ kmalloc(dc_resp_size, GFP_KERNEL);
+ if (!dc_resp)
+ return -ENOMEM;
+
+ /**
+ * Read and check all partition information for validity and potential
+ * debugging; see debug output in cxl_dc_check()
+ */
+ start_partition = 0;
+ num_partitions = 0;
+ do {
+ int rc, i, j;
+
+ rc = cxl_get_dc_config(mbox, start_partition, dc_resp, dc_resp_size);
+ if (rc < 0) {
+ dev_err(dev, "Failed to get DC config: %d\n", rc);
+ return rc;
+ }
+
+ num_partitions += rc;
+
+ if (num_partitions < 1 || num_partitions > CXL_MAX_DC_PARTITIONS) {
+ dev_err(dev, "Invalid num of dynamic capacity partitions %d\n",
+ num_partitions);
+ return -EINVAL;
+ }
+
+ for (i = start_partition, j = 0; i < num_partitions; i++, j++) {
+ rc = cxl_dc_check(dev, partitions, i,
+ &dc_resp->partition[j]);
+ if (rc)
+ return rc;
+ }
+
+ start_partition = num_partitions;
+
+ } while (num_partitions < dc_resp->avail_partition_count);
+
+ /* Return 1st partition */
+ dc_info->start = partitions[0].start;
+ dc_info->size = partitions[0].size;
+ dev_dbg(dev, "Returning partition 0 %zu size %zu\n",
+ dc_info->start, dc_info->size);
+
+ return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_dev_dc_identify, "CXL");
+
static void add_part(struct cxl_dpa_info *info, u64 start, u64 size, enum cxl_partition_mode mode)
{
int i = info->nr_partitions;
@@ -1421,6 +1571,38 @@ int cxl_get_dirty_count(struct cxl_memdev_state *mds, u32 *count)
}
EXPORT_SYMBOL_NS_GPL(cxl_get_dirty_count, "CXL");
+void cxl_configure_dcd(struct cxl_memdev_state *mds, struct cxl_dpa_info *info)
+{
+ struct cxl_dc_partition_info dc_info = { 0 };
+ struct device *dev = mds->cxlds.dev;
+ size_t skip;
+ int rc;
+
+ rc = cxl_dev_dc_identify(&mds->cxlds.cxl_mbox, &dc_info);
+ if (rc) {
+ dev_warn(dev,
+ "Failed to read Dynamic Capacity config: %d\n", rc);
+ cxl_disable_dcd(mds);
+ return;
+ }
+
+ /* Skips between pmem and the dynamic partition are not supported */
+ skip = dc_info.start - info->size;
+ if (skip) {
+ dev_warn(dev,
+ "Dynamic Capacity skip from pmem not supported: %zu\n",
+ skip);
+ cxl_disable_dcd(mds);
+ return;
+ }
+
+ info->size += dc_info.size;
+ dev_dbg(dev, "Adding dynamic ram partition A; %zu size %zu\n",
+ dc_info.start, dc_info.size);
+ add_part(info, dc_info.start, dc_info.size, CXL_PARTMODE_DYNAMIC_RAM_A);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_configure_dcd, "CXL");
+
int cxl_arm_dirty_shutdown(struct cxl_memdev_state *mds)
{
struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 53444af448d7..87386488ad10 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -380,6 +380,8 @@ struct cxl_security_state {
struct kernfs_node *sanitize_node;
};
+#define CXL_MAX_DC_PARTITIONS 8
+
static inline resource_size_t cxl_pmem_size(struct cxl_dev_state *cxlds)
{
/*
@@ -664,6 +666,31 @@ struct cxl_mbox_set_shutdown_state_in {
u8 state;
} __packed;
+/* See CXL 3.2 Table 8-178 get dynamic capacity config Input Payload */
+struct cxl_mbox_get_dc_config_in {
+ u8 partition_count;
+ u8 start_partition_index;
+} __packed;
+
+/* See CXL 3.2 Table 8-179 get dynamic capacity config Output Payload */
+struct cxl_mbox_get_dc_config_out {
+ u8 avail_partition_count;
+ u8 partitions_returned;
+ u8 rsvd[6];
+ /* See CXL 3.2 Table 8-180 */
+ struct cxl_dc_partition {
+ __le64 base;
+ __le64 decode_length;
+ __le64 length;
+ __le64 block_size;
+ __le32 dsmad_handle;
+ u8 flags;
+ u8 rsvd[3];
+ } __packed partition[] __counted_by(partitions_returned);
+ /* Trailing fields unused */
+} __packed;
+#define CXL_DCD_BLOCK_LINE_SIZE 0x40
+
/* Set Timestamp CXL 3.0 Spec 8.2.9.4.2 */
struct cxl_mbox_set_timestamp_in {
__le64 timestamp;
@@ -787,9 +814,18 @@ enum {
int cxl_internal_send_cmd(struct cxl_mailbox *cxl_mbox,
struct cxl_mbox_cmd *cmd);
int cxl_dev_state_identify(struct cxl_memdev_state *mds);
+
+struct cxl_dc_partition_info {
+ size_t start;
+ size_t size;
+};
+
+int cxl_dev_dc_identify(struct cxl_mailbox *mbox,
+ struct cxl_dc_partition_info *dc_info);
int cxl_await_media_ready(struct cxl_dev_state *cxlds);
int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
int cxl_mem_dpa_fetch(struct cxl_memdev_state *mds, struct cxl_dpa_info *info);
+void cxl_configure_dcd(struct cxl_memdev_state *mds, struct cxl_dpa_info *info);
struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev, u64 serial,
u16 dvsec);
void set_exclusive_cxl_commands(struct cxl_memdev_state *mds,
@@ -803,6 +839,17 @@ void cxl_event_trace_record(struct cxl_memdev *cxlmd,
const uuid_t *uuid, union cxl_event *evt);
int cxl_get_dirty_count(struct cxl_memdev_state *mds, u32 *count);
int cxl_arm_dirty_shutdown(struct cxl_memdev_state *mds);
+
+static inline bool cxl_dcd_supported(struct cxl_memdev_state *mds)
+{
+ return mds->dcd_supported;
+}
+
+static inline void cxl_disable_dcd(struct cxl_memdev_state *mds)
+{
+ mds->dcd_supported = false;
+}
+
int cxl_set_timestamp(struct cxl_memdev_state *mds);
int cxl_poison_state_init(struct cxl_memdev_state *mds);
int cxl_mem_get_poison(struct cxl_memdev *cxlmd, u64 offset, u64 len,
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index bace662dc988..60f9fa05d9ef 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -870,6 +870,9 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
if (rc)
return rc;
+ if (cxl_dcd_supported(mds))
+ cxl_configure_dcd(mds, &range_info);
+
rc = cxl_dpa_setup(cxlds, &range_info);
if (rc)
return rc;
diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
index fa7269154620..bb1df0cef863 100644
--- a/include/cxl/cxl.h
+++ b/include/cxl/cxl.h
@@ -133,6 +133,7 @@ struct cxl_dpa_perf {
enum cxl_partition_mode {
CXL_PARTMODE_RAM,
CXL_PARTMODE_PMEM,
+ CXL_PARTMODE_DYNAMIC_RAM_A,
};
/**
@@ -147,7 +148,7 @@ struct cxl_dpa_partition {
enum cxl_partition_mode mode;
};
-#define CXL_NR_PARTITIONS_MAX 2
+#define CXL_NR_PARTITIONS_MAX 3
/**
* struct cxl_dev_state - The driver device state
--
2.43.0
^ permalink raw reply related [flat|nested] 71+ messages in thread* Re: [PATCH v10 02/31] cxl/mem: Read dynamic capacity configuration from the device
2026-05-23 9:42 ` [PATCH v10 02/31] cxl/mem: Read dynamic capacity configuration from the device Anisa Su
@ 2026-05-27 22:28 ` Dave Jiang
2026-05-30 6:40 ` Anisa Su
0 siblings, 1 reply; 71+ messages in thread
From: Dave Jiang @ 2026-05-27 22:28 UTC (permalink / raw)
To: Anisa Su, linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Vishal Verma, Ira Weiny, Alison Schofield, John Groves,
Gregory Price, Ira Weiny
On 5/23/26 2:42 AM, Anisa Su wrote:
> From: Ira Weiny <ira.weiny@intel.com>
>
> Devices which optionally support Dynamic Capacity (DC) are configured
> via mailbox commands. CXL 3.2 section 9.13.3 requires the host to issue
4.0
> the Get DC Configuration command in order to properly configure DCDs.
> Without the Get DC Configuration command DCD can't be supported.
>
> Implement the DC mailbox commands as specified in CXL 3.2 section
4.0
> 8.2.10.9.9 (opcodes 48XXh) to read and store the DCD configuration
> information. Disable DCD if an invalid configuration is found.
>
> Linux has no support for more than one dynamic capacity partition. Read
> and validate all the partitions but configure only the first partition
> as 'dynamic ram A'. Additional partitions can be added in the future if
> such a device ever materializes. Additionally is it anticipated that no
> skips will be present from the end of the pmem partition. Check for an
> disallow this configuration as well.
>
> Linux has no use for the trailing fields of the Get Dynamic Capacity
> Configuration Output Payload (Total number of supported extents, number
> of available extents, total number of supported tags, and number of
> available tags). Avoid defining those fields to use the more useful
> dynamic C array.
>
> Based on an original patch by Navneet Singh.
>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Missing Anisa sign off
>
> ---
> Changes:
> [anisa: rebase]
> [jonathan: mbox.c: use max possible size for get_dc_config command to
> avoid vmalloc]
> [jonathan & fan: cxlmem.h: remove unused struct cxl_mem_dev_info]
> ---
> drivers/cxl/core/hdm.c | 2 +
> drivers/cxl/core/mbox.c | 182 ++++++++++++++++++++++++++++++++++++++++
> drivers/cxl/cxlmem.h | 47 +++++++++++
> drivers/cxl/pci.c | 3 +
> include/cxl/cxl.h | 3 +-
> 5 files changed, 236 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index 3930e130d6b6..28974adaab75 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -453,6 +453,8 @@ static const char *cxl_mode_name(enum cxl_partition_mode mode)
> return "ram";
> case CXL_PARTMODE_PMEM:
> return "pmem";
> + case CXL_PARTMODE_DYNAMIC_RAM_A:
> + return "dynamic_ram_a";
> default:
> return "";
> };
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 7ef5708bf210..71b29cd6abfe 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -1351,6 +1351,156 @@ int cxl_mem_sanitize(struct cxl_memdev *cxlmd, u16 cmd)
> return -EBUSY;
> }
>
> +static int cxl_dc_check(struct device *dev, struct cxl_dc_partition_info *part_array,
> + u8 index, struct cxl_dc_partition *dev_part)
> +{
> + size_t blk_size = le64_to_cpu(dev_part->block_size);
> + size_t len = le64_to_cpu(dev_part->length);
> +
> + part_array[index].start = le64_to_cpu(dev_part->base);
> + part_array[index].size = le64_to_cpu(dev_part->decode_length);
> + part_array[index].size *= CXL_CAPACITY_MULTIPLIER;
> +
> + /* Check partitions are in increasing DPA order */
> + if (index > 0) {
> + struct cxl_dc_partition_info *prev_part = &part_array[index - 1];
> +
> + if ((prev_part->start + prev_part->size) >
> + part_array[index].start) {
> + dev_err(dev,
> + "DPA ordering violation for DC partition %d and %d\n",
> + index - 1, index);
> + return -EINVAL;
> + }
> + }
> +
> + if (!IS_ALIGNED(part_array[index].start, SZ_256M) ||
> + !IS_ALIGNED(part_array[index].start, blk_size)) {
> + dev_err(dev, "DC partition %d invalid start %zu blk size %zu\n",
> + index, part_array[index].start, blk_size);
> + return -EINVAL;
> + }
> +
> + if (part_array[index].size == 0 || len == 0 ||
> + part_array[index].size < len || !IS_ALIGNED(len, blk_size)) {
> + dev_err(dev, "DC partition %d invalid length; size %zu len %zu blk size %zu\n",
> + index, part_array[index].size, len, blk_size);
> + return -EINVAL;
> + }
> +
> + if (blk_size == 0 || blk_size % CXL_DCD_BLOCK_LINE_SIZE ||
> + !is_power_of_2(blk_size)) {
> + dev_err(dev, "DC partition %d invalid block size; %zu\n",
size: instead of size;
> + index, blk_size);
> + return -EINVAL;
> + }
> +
> + dev_dbg(dev, "DC partition %d start %zu start %zu size %zu\n",
should it be "DC partition %d start %zu size %zu blk_size: %zu\n"?
> + index, part_array[index].start, part_array[index].size,
> + blk_size);
> +
> + return 0;
> +}
> +
> +/* Returns the number of partitions in dc_resp or -ERRNO */
> +static int cxl_get_dc_config(struct cxl_mailbox *mbox, u8 start_partition,
> + struct cxl_mbox_get_dc_config_out *dc_resp,
> + size_t dc_resp_size)
> +{
> + struct cxl_mbox_get_dc_config_in get_dc = (struct cxl_mbox_get_dc_config_in) {
> + .partition_count = CXL_MAX_DC_PARTITIONS,
> + .start_partition_index = start_partition,
> + };
> + struct cxl_mbox_cmd mbox_cmd = (struct cxl_mbox_cmd) {
> + .opcode = CXL_MBOX_OP_GET_DC_CONFIG,
> + .payload_in = &get_dc,
> + .size_in = sizeof(get_dc),
> + .size_out = dc_resp_size,
> + .payload_out = dc_resp,
> + .min_out = 8,
> + };
> + int rc;
> +
> + rc = cxl_internal_send_cmd(mbox, &mbox_cmd);
> + if (rc < 0)
> + return rc;
> +
> + dev_dbg(mbox->host, "Read %d/%d DC partitions\n",
> + dc_resp->partitions_returned, dc_resp->avail_partition_count);
> + return dc_resp->partitions_returned;
> +}
> +
> +/**
> + * cxl_dev_dc_identify() - Reads the dynamic capacity information from the
> + * device.
> + * @mbox: Mailbox to query
> + * @dc_info: The dynamic partition information to return
> + *
> + * Read Dynamic Capacity information from the device and return the partition
> + * information.
> + *
> + * Return: 0 if identify was executed successfully, -ERRNO on error.
> + * on error only dynamic_bytes is left unchanged.
> + */
> +int cxl_dev_dc_identify(struct cxl_mailbox *mbox,
> + struct cxl_dc_partition_info *dc_info)
> +{
> + struct cxl_dc_partition_info partitions[CXL_MAX_DC_PARTITIONS];
> + struct device *dev = mbox->host;
> + size_t dc_resp_size =
> + sizeof(struct cxl_mbox_get_dc_config_out) + sizeof(partitions);
I think it needs to be something like below because of the 'partition' flex array:
size_t dc_resp_size = struct_size(dc_resp, partition, CXL_MAX_DC_PARTITIONS);
partitions is type 'struct cxl_dc_partition_info'. and dc_resp->partition is type 'struct cxl_dc_partition'. So the size calucation is wrong. It should at least be:
size_t dc_resp_size = sizeof(struct cxl_mbox_get_dc_config_out) + sizeof(struct cxl_dc_partition) * CXL_MAX_DC_PARTITIONS;
> + u8 start_partition;
> + u8 num_partitions;
> +
> + struct cxl_mbox_get_dc_config_out *dc_resp __free(kfree) =
> + kmalloc(dc_resp_size, GFP_KERNEL);
> + if (!dc_resp)
> + return -ENOMEM;
> +
> + /**
/*
> + * Read and check all partition information for validity and potential
> + * debugging; see debug output in cxl_dc_check()
> + */
> + start_partition = 0;
> + num_partitions = 0;
> + do {
> + int rc, i, j;
> +
> + rc = cxl_get_dc_config(mbox, start_partition, dc_resp, dc_resp_size);
> + if (rc < 0) {
> + dev_err(dev, "Failed to get DC config: %d\n", rc);
> + return rc;
> + }
> +
> + num_partitions += rc;
Would cxl_get_dc_config() keep returning 0 be a problem? Not likely to happen unless device is malicious.
> +
> + if (num_partitions < 1 || num_partitions > CXL_MAX_DC_PARTITIONS) {
> + dev_err(dev, "Invalid num of dynamic capacity partitions %d\n",
> + num_partitions);
> + return -EINVAL;
> + }
> +
> + for (i = start_partition, j = 0; i < num_partitions; i++, j++) {
> + rc = cxl_dc_check(dev, partitions, i,
> + &dc_resp->partition[j]);
> + if (rc)
> + return rc;
> + }
> +
> + start_partition = num_partitions;
> +
> + } while (num_partitions < dc_resp->avail_partition_count);
> +
> + /* Return 1st partition */
> + dc_info->start = partitions[0].start;
> + dc_info->size = partitions[0].size;
> + dev_dbg(dev, "Returning partition 0 %zu size %zu\n",
> + dc_info->start, dc_info->size);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_dev_dc_identify, "CXL");
> +
> static void add_part(struct cxl_dpa_info *info, u64 start, u64 size, enum cxl_partition_mode mode)
> {
> int i = info->nr_partitions;
> @@ -1421,6 +1571,38 @@ int cxl_get_dirty_count(struct cxl_memdev_state *mds, u32 *count)
> }
> EXPORT_SYMBOL_NS_GPL(cxl_get_dirty_count, "CXL");
>
> +void cxl_configure_dcd(struct cxl_memdev_state *mds, struct cxl_dpa_info *info)
> +{
> + struct cxl_dc_partition_info dc_info = { 0 };
> + struct device *dev = mds->cxlds.dev;
> + size_t skip;
> + int rc;
> +
> + rc = cxl_dev_dc_identify(&mds->cxlds.cxl_mbox, &dc_info);
> + if (rc) {
> + dev_warn(dev,
> + "Failed to read Dynamic Capacity config: %d\n", rc);
> + cxl_disable_dcd(mds);
> + return;
> + }
> +
> + /* Skips between pmem and the dynamic partition are not supported */
> + skip = dc_info.start - info->size;
> + if (skip) {
Would this be sufficient?
if (dc_info.start != info->size)
DJ
> + dev_warn(dev,
> + "Dynamic Capacity skip from pmem not supported: %zu\n",
> + skip);
> + cxl_disable_dcd(mds);
> + return;
> + }
> +
> + info->size += dc_info.size;
> + dev_dbg(dev, "Adding dynamic ram partition A; %zu size %zu\n",
> + dc_info.start, dc_info.size);
> + add_part(info, dc_info.start, dc_info.size, CXL_PARTMODE_DYNAMIC_RAM_A);
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_configure_dcd, "CXL");
> +
> int cxl_arm_dirty_shutdown(struct cxl_memdev_state *mds)
> {
> struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 53444af448d7..87386488ad10 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -380,6 +380,8 @@ struct cxl_security_state {
> struct kernfs_node *sanitize_node;
> };
>
> +#define CXL_MAX_DC_PARTITIONS 8
> +
> static inline resource_size_t cxl_pmem_size(struct cxl_dev_state *cxlds)
> {
> /*
> @@ -664,6 +666,31 @@ struct cxl_mbox_set_shutdown_state_in {
> u8 state;
> } __packed;
>
> +/* See CXL 3.2 Table 8-178 get dynamic capacity config Input Payload */
> +struct cxl_mbox_get_dc_config_in {
> + u8 partition_count;
> + u8 start_partition_index;
> +} __packed;
> +
> +/* See CXL 3.2 Table 8-179 get dynamic capacity config Output Payload */
> +struct cxl_mbox_get_dc_config_out {
> + u8 avail_partition_count;
> + u8 partitions_returned;
> + u8 rsvd[6];
> + /* See CXL 3.2 Table 8-180 */
> + struct cxl_dc_partition {
> + __le64 base;
> + __le64 decode_length;
> + __le64 length;
> + __le64 block_size;
> + __le32 dsmad_handle;
> + u8 flags;
> + u8 rsvd[3];
> + } __packed partition[] __counted_by(partitions_returned);
> + /* Trailing fields unused */
> +} __packed;
> +#define CXL_DCD_BLOCK_LINE_SIZE 0x40
> +
> /* Set Timestamp CXL 3.0 Spec 8.2.9.4.2 */
> struct cxl_mbox_set_timestamp_in {
> __le64 timestamp;
> @@ -787,9 +814,18 @@ enum {
> int cxl_internal_send_cmd(struct cxl_mailbox *cxl_mbox,
> struct cxl_mbox_cmd *cmd);
> int cxl_dev_state_identify(struct cxl_memdev_state *mds);
> +
> +struct cxl_dc_partition_info {
> + size_t start;
> + size_t size;
> +};
> +
> +int cxl_dev_dc_identify(struct cxl_mailbox *mbox,
> + struct cxl_dc_partition_info *dc_info);
> int cxl_await_media_ready(struct cxl_dev_state *cxlds);
> int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
> int cxl_mem_dpa_fetch(struct cxl_memdev_state *mds, struct cxl_dpa_info *info);
> +void cxl_configure_dcd(struct cxl_memdev_state *mds, struct cxl_dpa_info *info);
> struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev, u64 serial,
> u16 dvsec);
> void set_exclusive_cxl_commands(struct cxl_memdev_state *mds,
> @@ -803,6 +839,17 @@ void cxl_event_trace_record(struct cxl_memdev *cxlmd,
> const uuid_t *uuid, union cxl_event *evt);
> int cxl_get_dirty_count(struct cxl_memdev_state *mds, u32 *count);
> int cxl_arm_dirty_shutdown(struct cxl_memdev_state *mds);
> +
> +static inline bool cxl_dcd_supported(struct cxl_memdev_state *mds)
> +{
> + return mds->dcd_supported;
> +}
> +
> +static inline void cxl_disable_dcd(struct cxl_memdev_state *mds)
> +{
> + mds->dcd_supported = false;
> +}
> +
> int cxl_set_timestamp(struct cxl_memdev_state *mds);
> int cxl_poison_state_init(struct cxl_memdev_state *mds);
> int cxl_mem_get_poison(struct cxl_memdev *cxlmd, u64 offset, u64 len,
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index bace662dc988..60f9fa05d9ef 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -870,6 +870,9 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> if (rc)
> return rc;
>
> + if (cxl_dcd_supported(mds))
> + cxl_configure_dcd(mds, &range_info);
> +
> rc = cxl_dpa_setup(cxlds, &range_info);
> if (rc)
> return rc;
> diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
> index fa7269154620..bb1df0cef863 100644
> --- a/include/cxl/cxl.h
> +++ b/include/cxl/cxl.h
> @@ -133,6 +133,7 @@ struct cxl_dpa_perf {
> enum cxl_partition_mode {
> CXL_PARTMODE_RAM,
> CXL_PARTMODE_PMEM,
> + CXL_PARTMODE_DYNAMIC_RAM_A,
> };
>
> /**
> @@ -147,7 +148,7 @@ struct cxl_dpa_partition {
> enum cxl_partition_mode mode;
> };
>
> -#define CXL_NR_PARTITIONS_MAX 2
> +#define CXL_NR_PARTITIONS_MAX 3
>
> /**
> * struct cxl_dev_state - The driver device state
^ permalink raw reply [flat|nested] 71+ messages in thread* Re: [PATCH v10 02/31] cxl/mem: Read dynamic capacity configuration from the device
2026-05-27 22:28 ` Dave Jiang
@ 2026-05-30 6:40 ` Anisa Su
2026-06-01 15:23 ` Dave Jiang
0 siblings, 1 reply; 71+ messages in thread
From: Anisa Su @ 2026-05-30 6:40 UTC (permalink / raw)
To: Dave Jiang
Cc: Anisa Su, linux-cxl, linux-kernel, nvdimm, Dan Williams,
Jonathan Cameron, Davidlohr Bueso, Vishal Verma, Ira Weiny,
Alison Schofield, John Groves, Gregory Price, Ira Weiny
On Wed, May 27, 2026 at 03:28:56PM -0700, Dave Jiang wrote:
>
>
> On 5/23/26 2:42 AM, Anisa Su wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> >
> > Devices which optionally support Dynamic Capacity (DC) are configured
> > via mailbox commands. CXL 3.2 section 9.13.3 requires the host to issue
>
> 4.0
>
done
> > the Get DC Configuration command in order to properly configure DCDs.
> > Without the Get DC Configuration command DCD can't be supported.
> >
> > Implement the DC mailbox commands as specified in CXL 3.2 section
>
> 4.0
>
done :)
> > 8.2.10.9.9 (opcodes 48XXh) to read and store the DCD configuration
> > information. Disable DCD if an invalid configuration is found.
> >
> > Linux has no support for more than one dynamic capacity partition. Read
> > and validate all the partitions but configure only the first partition
> > as 'dynamic ram A'. Additional partitions can be added in the future if
> > such a device ever materializes. Additionally is it anticipated that no
> > skips will be present from the end of the pmem partition. Check for an
> > disallow this configuration as well.
> >
> > Linux has no use for the trailing fields of the Get Dynamic Capacity
> > Configuration Output Payload (Total number of supported extents, number
> > of available extents, total number of supported tags, and number of
> > available tags). Avoid defining those fields to use the more useful
> > dynamic C array.
> >
> > Based on an original patch by Navneet Singh.
> >
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
>
> Missing Anisa sign off
>
Added
> >
> > ---
> > Changes:
> > [anisa: rebase]
> > [jonathan: mbox.c: use max possible size for get_dc_config command to
> > avoid vmalloc]
> > [jonathan & fan: cxlmem.h: remove unused struct cxl_mem_dev_info]
> > ---
> > drivers/cxl/core/hdm.c | 2 +
> > drivers/cxl/core/mbox.c | 182 ++++++++++++++++++++++++++++++++++++++++
> > drivers/cxl/cxlmem.h | 47 +++++++++++
> > drivers/cxl/pci.c | 3 +
> > include/cxl/cxl.h | 3 +-
> > 5 files changed, 236 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> > index 3930e130d6b6..28974adaab75 100644
> > --- a/drivers/cxl/core/hdm.c
> > +++ b/drivers/cxl/core/hdm.c
> > @@ -453,6 +453,8 @@ static const char *cxl_mode_name(enum cxl_partition_mode mode)
> > return "ram";
> > case CXL_PARTMODE_PMEM:
> > return "pmem";
> > + case CXL_PARTMODE_DYNAMIC_RAM_A:
> > + return "dynamic_ram_a";
> > default:
> > return "";
> > };
> > diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> > index 7ef5708bf210..71b29cd6abfe 100644
> > --- a/drivers/cxl/core/mbox.c
> > +++ b/drivers/cxl/core/mbox.c
> > @@ -1351,6 +1351,156 @@ int cxl_mem_sanitize(struct cxl_memdev *cxlmd, u16 cmd)
> > return -EBUSY;
> > }
> >
> > +static int cxl_dc_check(struct device *dev, struct cxl_dc_partition_info *part_array,
> > + u8 index, struct cxl_dc_partition *dev_part)
> > +{
> > + size_t blk_size = le64_to_cpu(dev_part->block_size);
> > + size_t len = le64_to_cpu(dev_part->length);
> > +
> > + part_array[index].start = le64_to_cpu(dev_part->base);
> > + part_array[index].size = le64_to_cpu(dev_part->decode_length);
> > + part_array[index].size *= CXL_CAPACITY_MULTIPLIER;
> > +
> > + /* Check partitions are in increasing DPA order */
> > + if (index > 0) {
> > + struct cxl_dc_partition_info *prev_part = &part_array[index - 1];
> > +
> > + if ((prev_part->start + prev_part->size) >
> > + part_array[index].start) {
> > + dev_err(dev,
> > + "DPA ordering violation for DC partition %d and %d\n",
> > + index - 1, index);
> > + return -EINVAL;
> > + }
> > + }
> > +
> > + if (!IS_ALIGNED(part_array[index].start, SZ_256M) ||
> > + !IS_ALIGNED(part_array[index].start, blk_size)) {
> > + dev_err(dev, "DC partition %d invalid start %zu blk size %zu\n",
> > + index, part_array[index].start, blk_size);
> > + return -EINVAL;
> > + }
> > +
> > + if (part_array[index].size == 0 || len == 0 ||
> > + part_array[index].size < len || !IS_ALIGNED(len, blk_size)) {
> > + dev_err(dev, "DC partition %d invalid length; size %zu len %zu blk size %zu\n",
> > + index, part_array[index].size, len, blk_size);
> > + return -EINVAL;
> > + }
> > +
> > + if (blk_size == 0 || blk_size % CXL_DCD_BLOCK_LINE_SIZE ||
> > + !is_power_of_2(blk_size)) {
> > + dev_err(dev, "DC partition %d invalid block size; %zu\n",
>
> size: instead of size;
>
fixed!
> > + index, blk_size);
> > + return -EINVAL;
> > + }
> > +
> > + dev_dbg(dev, "DC partition %d start %zu start %zu size %zu\n",
>
> should it be "DC partition %d start %zu size %zu blk_size: %zu\n"?
>
yep, fixed! Also I changed the type of
struct cxl_dc_partition_info->start/size from size_t to u64 so
the print specifier uses %llu now. Unless it's better to stick with
size_t?
> > + index, part_array[index].start, part_array[index].size,
> > + blk_size);
> > +
> > + return 0;
> > +}
> > +
> > +/* Returns the number of partitions in dc_resp or -ERRNO */
> > +static int cxl_get_dc_config(struct cxl_mailbox *mbox, u8 start_partition,
> > + struct cxl_mbox_get_dc_config_out *dc_resp,
> > + size_t dc_resp_size)
> > +{
> > + struct cxl_mbox_get_dc_config_in get_dc = (struct cxl_mbox_get_dc_config_in) {
> > + .partition_count = CXL_MAX_DC_PARTITIONS,
> > + .start_partition_index = start_partition,
> > + };
> > + struct cxl_mbox_cmd mbox_cmd = (struct cxl_mbox_cmd) {
> > + .opcode = CXL_MBOX_OP_GET_DC_CONFIG,
> > + .payload_in = &get_dc,
> > + .size_in = sizeof(get_dc),
> > + .size_out = dc_resp_size,
> > + .payload_out = dc_resp,
> > + .min_out = 8,
> > + };
> > + int rc;
> > +
> > + rc = cxl_internal_send_cmd(mbox, &mbox_cmd);
> > + if (rc < 0)
> > + return rc;
> > +
> > + dev_dbg(mbox->host, "Read %d/%d DC partitions\n",
> > + dc_resp->partitions_returned, dc_resp->avail_partition_count);
> > + return dc_resp->partitions_returned;
> > +}
> > +
> > +/**
> > + * cxl_dev_dc_identify() - Reads the dynamic capacity information from the
> > + * device.
> > + * @mbox: Mailbox to query
> > + * @dc_info: The dynamic partition information to return
> > + *
> > + * Read Dynamic Capacity information from the device and return the partition
> > + * information.
> > + *
> > + * Return: 0 if identify was executed successfully, -ERRNO on error.
> > + * on error only dynamic_bytes is left unchanged.
> > + */
> > +int cxl_dev_dc_identify(struct cxl_mailbox *mbox,
> > + struct cxl_dc_partition_info *dc_info)
> > +{
> > + struct cxl_dc_partition_info partitions[CXL_MAX_DC_PARTITIONS];
> > + struct device *dev = mbox->host;
> > + size_t dc_resp_size =
> > + sizeof(struct cxl_mbox_get_dc_config_out) + sizeof(partitions);
>
> I think it needs to be something like below because of the 'partition' flex array:
> size_t dc_resp_size = struct_size(dc_resp, partition, CXL_MAX_DC_PARTITIONS);
>
> partitions is type 'struct cxl_dc_partition_info'. and dc_resp->partition is type 'struct cxl_dc_partition'. So the size calucation is wrong. It should at least be:
> size_t dc_resp_size = sizeof(struct cxl_mbox_get_dc_config_out) + sizeof(struct cxl_dc_partition) * CXL_MAX_DC_PARTITIONS;
>
Fixed!
>
> > + u8 start_partition;
> > + u8 num_partitions;
> > +
> > + struct cxl_mbox_get_dc_config_out *dc_resp __free(kfree) =
> > + kmalloc(dc_resp_size, GFP_KERNEL);
> > + if (!dc_resp)
> > + return -ENOMEM;
> > +
> > + /**
>
> /*
>
> > + * Read and check all partition information for validity and potential
> > + * debugging; see debug output in cxl_dc_check()
> > + */
> > + start_partition = 0;
> > + num_partitions = 0;
> > + do {
> > + int rc, i, j;
> > +
> > + rc = cxl_get_dc_config(mbox, start_partition, dc_resp, dc_resp_size);
> > + if (rc < 0) {
> > + dev_err(dev, "Failed to get DC config: %d\n", rc);
> > + return rc;
> > + }
> > +
if (rc == 0) {
dev_err(dev,
"Device reported %u partitions available but returned none at index %u\n",
dc_resp->avail_partition_count, start_partition);
return -EIO;
}
> > + num_partitions += rc;
>
> Would cxl_get_dc_config() keep returning 0 be a problem? Not likely to happen unless device is malicious.
>
Not sure but I added a check anyway. ^ See above. It prohibits
cxl_get_dc_config() returning 0 at all though. But could be changed to
err only if 0 partitions are returned X amount of times...?
> > +
> > + if (num_partitions < 1 || num_partitions > CXL_MAX_DC_PARTITIONS) {
> > + dev_err(dev, "Invalid num of dynamic capacity partitions %d\n",
> > + num_partitions);
> > + return -EINVAL;
> > + }
> > +
> > + for (i = start_partition, j = 0; i < num_partitions; i++, j++) {
> > + rc = cxl_dc_check(dev, partitions, i,
> > + &dc_resp->partition[j]);
> > + if (rc)
> > + return rc;
> > + }
> > +
> > + start_partition = num_partitions;
> > +
> > + } while (num_partitions < dc_resp->avail_partition_count);
> > +
> > + /* Return 1st partition */
> > + dc_info->start = partitions[0].start;
> > + dc_info->size = partitions[0].size;
> > + dev_dbg(dev, "Returning partition 0 %zu size %zu\n",
> > + dc_info->start, dc_info->size);
> > +
> > + return 0;
> > +}
> > +EXPORT_SYMBOL_NS_GPL(cxl_dev_dc_identify, "CXL");
> > +
> > static void add_part(struct cxl_dpa_info *info, u64 start, u64 size, enum cxl_partition_mode mode)
> > {
> > int i = info->nr_partitions;
> > @@ -1421,6 +1571,38 @@ int cxl_get_dirty_count(struct cxl_memdev_state *mds, u32 *count)
> > }
> > EXPORT_SYMBOL_NS_GPL(cxl_get_dirty_count, "CXL");
> >
> > +void cxl_configure_dcd(struct cxl_memdev_state *mds, struct cxl_dpa_info *info)
> > +{
> > + struct cxl_dc_partition_info dc_info = { 0 };
> > + struct device *dev = mds->cxlds.dev;
> > + size_t skip;
> > + int rc;
> > +
> > + rc = cxl_dev_dc_identify(&mds->cxlds.cxl_mbox, &dc_info);
> > + if (rc) {
> > + dev_warn(dev,
> > + "Failed to read Dynamic Capacity config: %d\n", rc);
> > + cxl_disable_dcd(mds);
> > + return;
> > + }
> > +
> > + /* Skips between pmem and the dynamic partition are not supported */
> > + skip = dc_info.start - info->size;
> > + if (skip) {
>
> Would this be sufficient?
>
> if (dc_info.start != info->size)
>
Fixed!
> DJ
Thanks,
Anisa
> > + dev_warn(dev,
> > + "Dynamic Capacity skip from pmem not supported: %zu\n",
> > + skip);
> > + cxl_disable_dcd(mds);
> > + return;
> > + }
> > +
> > + info->size += dc_info.size;
> > + dev_dbg(dev, "Adding dynamic ram partition A; %zu size %zu\n",
> > + dc_info.start, dc_info.size);
> > + add_part(info, dc_info.start, dc_info.size, CXL_PARTMODE_DYNAMIC_RAM_A);
> > +}
> > +EXPORT_SYMBOL_NS_GPL(cxl_configure_dcd, "CXL");
> > +
> > int cxl_arm_dirty_shutdown(struct cxl_memdev_state *mds)
> > {
> > struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
> > diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> > index 53444af448d7..87386488ad10 100644
> > --- a/drivers/cxl/cxlmem.h
> > +++ b/drivers/cxl/cxlmem.h
> > @@ -380,6 +380,8 @@ struct cxl_security_state {
> > struct kernfs_node *sanitize_node;
> > };
> >
> > +#define CXL_MAX_DC_PARTITIONS 8
> > +
> > static inline resource_size_t cxl_pmem_size(struct cxl_dev_state *cxlds)
> > {
> > /*
> > @@ -664,6 +666,31 @@ struct cxl_mbox_set_shutdown_state_in {
> > u8 state;
> > } __packed;
> >
> > +/* See CXL 3.2 Table 8-178 get dynamic capacity config Input Payload */
> > +struct cxl_mbox_get_dc_config_in {
> > + u8 partition_count;
> > + u8 start_partition_index;
> > +} __packed;
> > +
> > +/* See CXL 3.2 Table 8-179 get dynamic capacity config Output Payload */
> > +struct cxl_mbox_get_dc_config_out {
> > + u8 avail_partition_count;
> > + u8 partitions_returned;
> > + u8 rsvd[6];
> > + /* See CXL 3.2 Table 8-180 */
> > + struct cxl_dc_partition {
> > + __le64 base;
> > + __le64 decode_length;
> > + __le64 length;
> > + __le64 block_size;
> > + __le32 dsmad_handle;
> > + u8 flags;
> > + u8 rsvd[3];
> > + } __packed partition[] __counted_by(partitions_returned);
> > + /* Trailing fields unused */
> > +} __packed;
> > +#define CXL_DCD_BLOCK_LINE_SIZE 0x40
> > +
> > /* Set Timestamp CXL 3.0 Spec 8.2.9.4.2 */
> > struct cxl_mbox_set_timestamp_in {
> > __le64 timestamp;
> > @@ -787,9 +814,18 @@ enum {
> > int cxl_internal_send_cmd(struct cxl_mailbox *cxl_mbox,
> > struct cxl_mbox_cmd *cmd);
> > int cxl_dev_state_identify(struct cxl_memdev_state *mds);
> > +
> > +struct cxl_dc_partition_info {
> > + size_t start;
> > + size_t size;
> > +};
> > +
> > +int cxl_dev_dc_identify(struct cxl_mailbox *mbox,
> > + struct cxl_dc_partition_info *dc_info);
> > int cxl_await_media_ready(struct cxl_dev_state *cxlds);
> > int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
> > int cxl_mem_dpa_fetch(struct cxl_memdev_state *mds, struct cxl_dpa_info *info);
> > +void cxl_configure_dcd(struct cxl_memdev_state *mds, struct cxl_dpa_info *info);
> > struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev, u64 serial,
> > u16 dvsec);
> > void set_exclusive_cxl_commands(struct cxl_memdev_state *mds,
> > @@ -803,6 +839,17 @@ void cxl_event_trace_record(struct cxl_memdev *cxlmd,
> > const uuid_t *uuid, union cxl_event *evt);
> > int cxl_get_dirty_count(struct cxl_memdev_state *mds, u32 *count);
> > int cxl_arm_dirty_shutdown(struct cxl_memdev_state *mds);
> > +
> > +static inline bool cxl_dcd_supported(struct cxl_memdev_state *mds)
> > +{
> > + return mds->dcd_supported;
> > +}
> > +
> > +static inline void cxl_disable_dcd(struct cxl_memdev_state *mds)
> > +{
> > + mds->dcd_supported = false;
> > +}
> > +
> > int cxl_set_timestamp(struct cxl_memdev_state *mds);
> > int cxl_poison_state_init(struct cxl_memdev_state *mds);
> > int cxl_mem_get_poison(struct cxl_memdev *cxlmd, u64 offset, u64 len,
> > diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> > index bace662dc988..60f9fa05d9ef 100644
> > --- a/drivers/cxl/pci.c
> > +++ b/drivers/cxl/pci.c
> > @@ -870,6 +870,9 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> > if (rc)
> > return rc;
> >
> > + if (cxl_dcd_supported(mds))
> > + cxl_configure_dcd(mds, &range_info);
> > +
> > rc = cxl_dpa_setup(cxlds, &range_info);
> > if (rc)
> > return rc;
> > diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
> > index fa7269154620..bb1df0cef863 100644
> > --- a/include/cxl/cxl.h
> > +++ b/include/cxl/cxl.h
> > @@ -133,6 +133,7 @@ struct cxl_dpa_perf {
> > enum cxl_partition_mode {
> > CXL_PARTMODE_RAM,
> > CXL_PARTMODE_PMEM,
> > + CXL_PARTMODE_DYNAMIC_RAM_A,
> > };
> >
> > /**
> > @@ -147,7 +148,7 @@ struct cxl_dpa_partition {
> > enum cxl_partition_mode mode;
> > };
> >
> > -#define CXL_NR_PARTITIONS_MAX 2
> > +#define CXL_NR_PARTITIONS_MAX 3
> >
> > /**
> > * struct cxl_dev_state - The driver device state
>
^ permalink raw reply [flat|nested] 71+ messages in thread* Re: [PATCH v10 02/31] cxl/mem: Read dynamic capacity configuration from the device
2026-05-30 6:40 ` Anisa Su
@ 2026-06-01 15:23 ` Dave Jiang
2026-06-02 9:46 ` Anisa Su
0 siblings, 1 reply; 71+ messages in thread
From: Dave Jiang @ 2026-06-01 15:23 UTC (permalink / raw)
To: Anisa Su
Cc: linux-cxl, linux-kernel, nvdimm, Dan Williams, Jonathan Cameron,
Davidlohr Bueso, Vishal Verma, Ira Weiny, Alison Schofield,
John Groves, Gregory Price, Ira Weiny
On 5/29/26 11:40 PM, Anisa Su wrote:
> On Wed, May 27, 2026 at 03:28:56PM -0700, Dave Jiang wrote:
>>
>>
>> On 5/23/26 2:42 AM, Anisa Su wrote:
>>> From: Ira Weiny <ira.weiny@intel.com>
>>>
>>> Devices which optionally support Dynamic Capacity (DC) are configured
>>> via mailbox commands. CXL 3.2 section 9.13.3 requires the host to issue
>>
>> 4.0
>>
> done
>>> the Get DC Configuration command in order to properly configure DCDs.
>>> Without the Get DC Configuration command DCD can't be supported.
>>>
>>> Implement the DC mailbox commands as specified in CXL 3.2 section
>>
>> 4.0
>>
> done :)
>>> 8.2.10.9.9 (opcodes 48XXh) to read and store the DCD configuration
>>> information. Disable DCD if an invalid configuration is found.
>>>
>>> Linux has no support for more than one dynamic capacity partition. Read
>>> and validate all the partitions but configure only the first partition
>>> as 'dynamic ram A'. Additional partitions can be added in the future if
>>> such a device ever materializes. Additionally is it anticipated that no
>>> skips will be present from the end of the pmem partition. Check for an
>>> disallow this configuration as well.
>>>
>>> Linux has no use for the trailing fields of the Get Dynamic Capacity
>>> Configuration Output Payload (Total number of supported extents, number
>>> of available extents, total number of supported tags, and number of
>>> available tags). Avoid defining those fields to use the more useful
>>> dynamic C array.
>>>
>>> Based on an original patch by Navneet Singh.
>>>
>>> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
>>
>> Missing Anisa sign off
>>
> Added
>>>
>>> ---
>>> Changes:
>>> [anisa: rebase]
>>> [jonathan: mbox.c: use max possible size for get_dc_config command to
>>> avoid vmalloc]
>>> [jonathan & fan: cxlmem.h: remove unused struct cxl_mem_dev_info]
>>> ---
>>> drivers/cxl/core/hdm.c | 2 +
>>> drivers/cxl/core/mbox.c | 182 ++++++++++++++++++++++++++++++++++++++++
>>> drivers/cxl/cxlmem.h | 47 +++++++++++
>>> drivers/cxl/pci.c | 3 +
>>> include/cxl/cxl.h | 3 +-
>>> 5 files changed, 236 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
>>> index 3930e130d6b6..28974adaab75 100644
>>> --- a/drivers/cxl/core/hdm.c
>>> +++ b/drivers/cxl/core/hdm.c
>>> @@ -453,6 +453,8 @@ static const char *cxl_mode_name(enum cxl_partition_mode mode)
>>> return "ram";
>>> case CXL_PARTMODE_PMEM:
>>> return "pmem";
>>> + case CXL_PARTMODE_DYNAMIC_RAM_A:
>>> + return "dynamic_ram_a";
>>> default:
>>> return "";
>>> };
>>> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
>>> index 7ef5708bf210..71b29cd6abfe 100644
>>> --- a/drivers/cxl/core/mbox.c
>>> +++ b/drivers/cxl/core/mbox.c
>>> @@ -1351,6 +1351,156 @@ int cxl_mem_sanitize(struct cxl_memdev *cxlmd, u16 cmd)
>>> return -EBUSY;
>>> }
>>>
>>> +static int cxl_dc_check(struct device *dev, struct cxl_dc_partition_info *part_array,
>>> + u8 index, struct cxl_dc_partition *dev_part)
>>> +{
>>> + size_t blk_size = le64_to_cpu(dev_part->block_size);
>>> + size_t len = le64_to_cpu(dev_part->length);
>>> +
>>> + part_array[index].start = le64_to_cpu(dev_part->base);
>>> + part_array[index].size = le64_to_cpu(dev_part->decode_length);
>>> + part_array[index].size *= CXL_CAPACITY_MULTIPLIER;
>>> +
>>> + /* Check partitions are in increasing DPA order */
>>> + if (index > 0) {
>>> + struct cxl_dc_partition_info *prev_part = &part_array[index - 1];
>>> +
>>> + if ((prev_part->start + prev_part->size) >
>>> + part_array[index].start) {
>>> + dev_err(dev,
>>> + "DPA ordering violation for DC partition %d and %d\n",
>>> + index - 1, index);
>>> + return -EINVAL;
>>> + }
>>> + }
>>> +
>>> + if (!IS_ALIGNED(part_array[index].start, SZ_256M) ||
>>> + !IS_ALIGNED(part_array[index].start, blk_size)) {
>>> + dev_err(dev, "DC partition %d invalid start %zu blk size %zu\n",
>>> + index, part_array[index].start, blk_size);
>>> + return -EINVAL;
>>> + }
>>> +
>>> + if (part_array[index].size == 0 || len == 0 ||
>>> + part_array[index].size < len || !IS_ALIGNED(len, blk_size)) {
>>> + dev_err(dev, "DC partition %d invalid length; size %zu len %zu blk size %zu\n",
>>> + index, part_array[index].size, len, blk_size);
>>> + return -EINVAL;
>>> + }
>>> +
>>> + if (blk_size == 0 || blk_size % CXL_DCD_BLOCK_LINE_SIZE ||
>>> + !is_power_of_2(blk_size)) {
>>> + dev_err(dev, "DC partition %d invalid block size; %zu\n",
>>
>> size: instead of size;
>>
> fixed!
>>> + index, blk_size);
>>> + return -EINVAL;
>>> + }
>>> +
>>> + dev_dbg(dev, "DC partition %d start %zu start %zu size %zu\n",
>>
>> should it be "DC partition %d start %zu size %zu blk_size: %zu\n"?
>>
> yep, fixed! Also I changed the type of
> struct cxl_dc_partition_info->start/size from size_t to u64 so
> the print specifier uses %llu now. Unless it's better to stick with
> size_t?
I think u64 would be explicit and better. I can just see the kbot complaining about 32bit systems and size_t....
>
>>> + index, part_array[index].start, part_array[index].size,
>>> + blk_size);
>>> +
>>> + return 0;
>>> +}
>>> +
>>> +/* Returns the number of partitions in dc_resp or -ERRNO */
>>> +static int cxl_get_dc_config(struct cxl_mailbox *mbox, u8 start_partition,
>>> + struct cxl_mbox_get_dc_config_out *dc_resp,
>>> + size_t dc_resp_size)
>>> +{
>>> + struct cxl_mbox_get_dc_config_in get_dc = (struct cxl_mbox_get_dc_config_in) {
>>> + .partition_count = CXL_MAX_DC_PARTITIONS,
>>> + .start_partition_index = start_partition,
>>> + };
>>> + struct cxl_mbox_cmd mbox_cmd = (struct cxl_mbox_cmd) {
>>> + .opcode = CXL_MBOX_OP_GET_DC_CONFIG,
>>> + .payload_in = &get_dc,
>>> + .size_in = sizeof(get_dc),
>>> + .size_out = dc_resp_size,
>>> + .payload_out = dc_resp,
>>> + .min_out = 8,
>>> + };
>>> + int rc;
>>> +
>>> + rc = cxl_internal_send_cmd(mbox, &mbox_cmd);
>>> + if (rc < 0)
>>> + return rc;
>>> +
>>> + dev_dbg(mbox->host, "Read %d/%d DC partitions\n",
>>> + dc_resp->partitions_returned, dc_resp->avail_partition_count);
>>> + return dc_resp->partitions_returned;
>>> +}
>>> +
>>> +/**
>>> + * cxl_dev_dc_identify() - Reads the dynamic capacity information from the
>>> + * device.
>>> + * @mbox: Mailbox to query
>>> + * @dc_info: The dynamic partition information to return
>>> + *
>>> + * Read Dynamic Capacity information from the device and return the partition
>>> + * information.
>>> + *
>>> + * Return: 0 if identify was executed successfully, -ERRNO on error.
>>> + * on error only dynamic_bytes is left unchanged.
>>> + */
>>> +int cxl_dev_dc_identify(struct cxl_mailbox *mbox,
>>> + struct cxl_dc_partition_info *dc_info)
>>> +{
>>> + struct cxl_dc_partition_info partitions[CXL_MAX_DC_PARTITIONS];
>>> + struct device *dev = mbox->host;
>>> + size_t dc_resp_size =
>>> + sizeof(struct cxl_mbox_get_dc_config_out) + sizeof(partitions);
>>
>> I think it needs to be something like below because of the 'partition' flex array:
>> size_t dc_resp_size = struct_size(dc_resp, partition, CXL_MAX_DC_PARTITIONS);
>>
>> partitions is type 'struct cxl_dc_partition_info'. and dc_resp->partition is type 'struct cxl_dc_partition'. So the size calucation is wrong. It should at least be:
>> size_t dc_resp_size = sizeof(struct cxl_mbox_get_dc_config_out) + sizeof(struct cxl_dc_partition) * CXL_MAX_DC_PARTITIONS;
>>
> Fixed!
>>
>>> + u8 start_partition;
>>> + u8 num_partitions;
>>> +
>>> + struct cxl_mbox_get_dc_config_out *dc_resp __free(kfree) =
>>> + kmalloc(dc_resp_size, GFP_KERNEL);
>>> + if (!dc_resp)
>>> + return -ENOMEM;
>>> +
>>> + /**
>>
>> /*
>>
>>> + * Read and check all partition information for validity and potential
>>> + * debugging; see debug output in cxl_dc_check()
>>> + */
>>> + start_partition = 0;
>>> + num_partitions = 0;
>>> + do {
>>> + int rc, i, j;
>>> +
>>> + rc = cxl_get_dc_config(mbox, start_partition, dc_resp, dc_resp_size);
>>> + if (rc < 0) {
>>> + dev_err(dev, "Failed to get DC config: %d\n", rc);
>>> + return rc;
>>> + }
>>> +
> if (rc == 0) {
> dev_err(dev,
> "Device reported %u partitions available but returned none at index %u\n",
> dc_resp->avail_partition_count, start_partition);
> return -EIO;
> }
>>> + num_partitions += rc;
>>
>> Would cxl_get_dc_config() keep returning 0 be a problem? Not likely to happen unless device is malicious.
>>
> Not sure but I added a check anyway. ^ See above. It prohibits
> cxl_get_dc_config() returning 0 at all though. But could be changed to
> err only if 0 partitions are returned X amount of times...?
I think as long as we have a way to detect that we aren't moving forward in this loop and need to get out at some point.
DJ
>>> +
>>> + if (num_partitions < 1 || num_partitions > CXL_MAX_DC_PARTITIONS) {
>>> + dev_err(dev, "Invalid num of dynamic capacity partitions %d\n",
>>> + num_partitions);
>>> + return -EINVAL;
>>> + }
>>> +
>>> + for (i = start_partition, j = 0; i < num_partitions; i++, j++) {
>>> + rc = cxl_dc_check(dev, partitions, i,
>>> + &dc_resp->partition[j]);
>>> + if (rc)
>>> + return rc;
>>> + }
>>> +
>>> + start_partition = num_partitions;
>>> +
>>> + } while (num_partitions < dc_resp->avail_partition_count);
>>> +
>>> + /* Return 1st partition */
>>> + dc_info->start = partitions[0].start;
>>> + dc_info->size = partitions[0].size;
>>> + dev_dbg(dev, "Returning partition 0 %zu size %zu\n",
>>> + dc_info->start, dc_info->size);
>>> +
>>> + return 0;
>>> +}
>>> +EXPORT_SYMBOL_NS_GPL(cxl_dev_dc_identify, "CXL");
>>> +
>>> static void add_part(struct cxl_dpa_info *info, u64 start, u64 size, enum cxl_partition_mode mode)
>>> {
>>> int i = info->nr_partitions;
>>> @@ -1421,6 +1571,38 @@ int cxl_get_dirty_count(struct cxl_memdev_state *mds, u32 *count)
>>> }
>>> EXPORT_SYMBOL_NS_GPL(cxl_get_dirty_count, "CXL");
>>>
>>> +void cxl_configure_dcd(struct cxl_memdev_state *mds, struct cxl_dpa_info *info)
>>> +{
>>> + struct cxl_dc_partition_info dc_info = { 0 };
>>> + struct device *dev = mds->cxlds.dev;
>>> + size_t skip;
>>> + int rc;
>>> +
>>> + rc = cxl_dev_dc_identify(&mds->cxlds.cxl_mbox, &dc_info);
>>> + if (rc) {
>>> + dev_warn(dev,
>>> + "Failed to read Dynamic Capacity config: %d\n", rc);
>>> + cxl_disable_dcd(mds);
>>> + return;
>>> + }
>>> +
>>> + /* Skips between pmem and the dynamic partition are not supported */
>>> + skip = dc_info.start - info->size;
>>> + if (skip) {
>>
>> Would this be sufficient?
>>
>> if (dc_info.start != info->size)
>>
> Fixed!
>> DJ
> Thanks,
> Anisa
>>> + dev_warn(dev,
>>> + "Dynamic Capacity skip from pmem not supported: %zu\n",
>>> + skip);
>>> + cxl_disable_dcd(mds);
>>> + return;
>>> + }
>>> +
>>> + info->size += dc_info.size;
>>> + dev_dbg(dev, "Adding dynamic ram partition A; %zu size %zu\n",
>>> + dc_info.start, dc_info.size);
>>> + add_part(info, dc_info.start, dc_info.size, CXL_PARTMODE_DYNAMIC_RAM_A);
>>> +}
>>> +EXPORT_SYMBOL_NS_GPL(cxl_configure_dcd, "CXL");
>>> +
>>> int cxl_arm_dirty_shutdown(struct cxl_memdev_state *mds)
>>> {
>>> struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
>>> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
>>> index 53444af448d7..87386488ad10 100644
>>> --- a/drivers/cxl/cxlmem.h
>>> +++ b/drivers/cxl/cxlmem.h
>>> @@ -380,6 +380,8 @@ struct cxl_security_state {
>>> struct kernfs_node *sanitize_node;
>>> };
>>>
>>> +#define CXL_MAX_DC_PARTITIONS 8
>>> +
>>> static inline resource_size_t cxl_pmem_size(struct cxl_dev_state *cxlds)
>>> {
>>> /*
>>> @@ -664,6 +666,31 @@ struct cxl_mbox_set_shutdown_state_in {
>>> u8 state;
>>> } __packed;
>>>
>>> +/* See CXL 3.2 Table 8-178 get dynamic capacity config Input Payload */
>>> +struct cxl_mbox_get_dc_config_in {
>>> + u8 partition_count;
>>> + u8 start_partition_index;
>>> +} __packed;
>>> +
>>> +/* See CXL 3.2 Table 8-179 get dynamic capacity config Output Payload */
>>> +struct cxl_mbox_get_dc_config_out {
>>> + u8 avail_partition_count;
>>> + u8 partitions_returned;
>>> + u8 rsvd[6];
>>> + /* See CXL 3.2 Table 8-180 */
>>> + struct cxl_dc_partition {
>>> + __le64 base;
>>> + __le64 decode_length;
>>> + __le64 length;
>>> + __le64 block_size;
>>> + __le32 dsmad_handle;
>>> + u8 flags;
>>> + u8 rsvd[3];
>>> + } __packed partition[] __counted_by(partitions_returned);
>>> + /* Trailing fields unused */
>>> +} __packed;
>>> +#define CXL_DCD_BLOCK_LINE_SIZE 0x40
>>> +
>>> /* Set Timestamp CXL 3.0 Spec 8.2.9.4.2 */
>>> struct cxl_mbox_set_timestamp_in {
>>> __le64 timestamp;
>>> @@ -787,9 +814,18 @@ enum {
>>> int cxl_internal_send_cmd(struct cxl_mailbox *cxl_mbox,
>>> struct cxl_mbox_cmd *cmd);
>>> int cxl_dev_state_identify(struct cxl_memdev_state *mds);
>>> +
>>> +struct cxl_dc_partition_info {
>>> + size_t start;
>>> + size_t size;
>>> +};
>>> +
>>> +int cxl_dev_dc_identify(struct cxl_mailbox *mbox,
>>> + struct cxl_dc_partition_info *dc_info);
>>> int cxl_await_media_ready(struct cxl_dev_state *cxlds);
>>> int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
>>> int cxl_mem_dpa_fetch(struct cxl_memdev_state *mds, struct cxl_dpa_info *info);
>>> +void cxl_configure_dcd(struct cxl_memdev_state *mds, struct cxl_dpa_info *info);
>>> struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev, u64 serial,
>>> u16 dvsec);
>>> void set_exclusive_cxl_commands(struct cxl_memdev_state *mds,
>>> @@ -803,6 +839,17 @@ void cxl_event_trace_record(struct cxl_memdev *cxlmd,
>>> const uuid_t *uuid, union cxl_event *evt);
>>> int cxl_get_dirty_count(struct cxl_memdev_state *mds, u32 *count);
>>> int cxl_arm_dirty_shutdown(struct cxl_memdev_state *mds);
>>> +
>>> +static inline bool cxl_dcd_supported(struct cxl_memdev_state *mds)
>>> +{
>>> + return mds->dcd_supported;
>>> +}
>>> +
>>> +static inline void cxl_disable_dcd(struct cxl_memdev_state *mds)
>>> +{
>>> + mds->dcd_supported = false;
>>> +}
>>> +
>>> int cxl_set_timestamp(struct cxl_memdev_state *mds);
>>> int cxl_poison_state_init(struct cxl_memdev_state *mds);
>>> int cxl_mem_get_poison(struct cxl_memdev *cxlmd, u64 offset, u64 len,
>>> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
>>> index bace662dc988..60f9fa05d9ef 100644
>>> --- a/drivers/cxl/pci.c
>>> +++ b/drivers/cxl/pci.c
>>> @@ -870,6 +870,9 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>>> if (rc)
>>> return rc;
>>>
>>> + if (cxl_dcd_supported(mds))
>>> + cxl_configure_dcd(mds, &range_info);
>>> +
>>> rc = cxl_dpa_setup(cxlds, &range_info);
>>> if (rc)
>>> return rc;
>>> diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
>>> index fa7269154620..bb1df0cef863 100644
>>> --- a/include/cxl/cxl.h
>>> +++ b/include/cxl/cxl.h
>>> @@ -133,6 +133,7 @@ struct cxl_dpa_perf {
>>> enum cxl_partition_mode {
>>> CXL_PARTMODE_RAM,
>>> CXL_PARTMODE_PMEM,
>>> + CXL_PARTMODE_DYNAMIC_RAM_A,
>>> };
>>>
>>> /**
>>> @@ -147,7 +148,7 @@ struct cxl_dpa_partition {
>>> enum cxl_partition_mode mode;
>>> };
>>>
>>> -#define CXL_NR_PARTITIONS_MAX 2
>>> +#define CXL_NR_PARTITIONS_MAX 3
>>>
>>> /**
>>> * struct cxl_dev_state - The driver device state
>>
^ permalink raw reply [flat|nested] 71+ messages in thread* Re: [PATCH v10 02/31] cxl/mem: Read dynamic capacity configuration from the device
2026-06-01 15:23 ` Dave Jiang
@ 2026-06-02 9:46 ` Anisa Su
0 siblings, 0 replies; 71+ messages in thread
From: Anisa Su @ 2026-06-02 9:46 UTC (permalink / raw)
To: Dave Jiang
Cc: Anisa Su, linux-cxl, linux-kernel, nvdimm, Dan Williams,
Jonathan Cameron, Davidlohr Bueso, Vishal Verma, Ira Weiny,
Alison Schofield, John Groves, Gregory Price, Ira Weiny
On Mon, Jun 01, 2026 at 08:23:46AM -0700, Dave Jiang wrote:
>
>
> On 5/29/26 11:40 PM, Anisa Su wrote:
> > On Wed, May 27, 2026 at 03:28:56PM -0700, Dave Jiang wrote:
> >>
> >>
> >> On 5/23/26 2:42 AM, Anisa Su wrote:
[snip]
> >>> + struct cxl_mbox_get_dc_config_out *dc_resp __free(kfree) =
> >>> + kmalloc(dc_resp_size, GFP_KERNEL);
> >>> + if (!dc_resp)
> >>> + return -ENOMEM;
> >>> +
> >>> + /**
> >>
> >> /*
> >>
> >>> + * Read and check all partition information for validity and potential
> >>> + * debugging; see debug output in cxl_dc_check()
> >>> + */
> >>> + start_partition = 0;
> >>> + num_partitions = 0;
> >>> + do {
> >>> + int rc, i, j;
> >>> +
> >>> + rc = cxl_get_dc_config(mbox, start_partition, dc_resp, dc_resp_size);
> >>> + if (rc < 0) {
> >>> + dev_err(dev, "Failed to get DC config: %d\n", rc);
> >>> + return rc;
> >>> + }
> >>> +
> > if (rc == 0) {
> > dev_err(dev,
> > "Device reported %u partitions available but returned none at index %u\n",
> > dc_resp->avail_partition_count, start_partition);
> > return -EIO;
> > }
> >>> + num_partitions += rc;
> >>
> >> Would cxl_get_dc_config() keep returning 0 be a problem? Not likely to happen unless device is malicious.
> >>
> > Not sure but I added a check anyway. ^ See above. It prohibits
> > cxl_get_dc_config() returning 0 at all though. But could be changed to
> > err only if 0 partitions are returned X amount of times...?
>
> I think as long as we have a way to detect that we aren't moving forward in this loop and need to get out at some point.
>
> DJ
>
I'll keep the check above then, and just prohibit returning 0 partitions
when the device reports that it has more partitions available, since I don't
think it makes sense for the device to transiently return 0 and for some
reason make progress on retry anyway...
But need to move it below this check
if (num_partitions < 1 || num_partitions > CXL_MAX_DC_PARTITIONS) {
so the no forward progress check is differentiated from returning 0
partitions.
Thanks,
Anisa
> >>> +
> >>> + if (num_partitions < 1 || num_partitions > CXL_MAX_DC_PARTITIONS) {
> >>> + dev_err(dev, "Invalid num of dynamic capacity partitions %d\n",
> >>> + num_partitions);
> >>> + return -EINVAL;
> >>> + }
> >>> +
> >>> + for (i = start_partition, j = 0; i < num_partitions; i++, j++) {
> >>> + rc = cxl_dc_check(dev, partitions, i,
> >>> + &dc_resp->partition[j]);
> >>> + if (rc)
> >>> + return rc;
> >>> + }
> >>> +
> >>> + start_partition = num_partitions;
> >>> +
> >>> + } while (num_partitions < dc_resp->avail_partition_count);
> >>> +
> >>> + /* Return 1st partition */
> >>> + dc_info->start = partitions[0].start;
> >>> + dc_info->size = partitions[0].size;
> >>> + dev_dbg(dev, "Returning partition 0 %zu size %zu\n",
> >>> + dc_info->start, dc_info->size);
> >>> +
> >>> + return 0;
> >>> +}
> >>> +EXPORT_SYMBOL_NS_GPL(cxl_dev_dc_identify, "CXL");
> >>> +
>
[snip]
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH v10 03/31] cxl/cdat: Gather DSMAS data for DCD partitions
2026-05-23 9:42 [PATCH v10 00/31] DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
2026-05-23 9:42 ` [PATCH v10 01/31] cxl/mbox: Flag " Anisa Su
2026-05-23 9:42 ` [PATCH v10 02/31] cxl/mem: Read dynamic capacity configuration from the device Anisa Su
@ 2026-05-23 9:42 ` Anisa Su
2026-05-27 23:16 ` Dave Jiang
2026-05-23 9:42 ` [PATCH v10 04/31] cxl/core: Enforce partition order/simplify partition calls Anisa Su
` (29 subsequent siblings)
32 siblings, 1 reply; 71+ messages in thread
From: Anisa Su @ 2026-05-23 9:42 UTC (permalink / raw)
To: linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
John Groves, Gregory Price, Ira Weiny
From: Ira Weiny <ira.weiny@intel.com>
Additional DCD partition (AKA region) information is contained in the
DSMAS CDAT tables, including performance, read only, and shareable
attributes.
Match DCD partitions with DSMAS tables and store the meta data.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
Changes:
[anisa: rebase]
[jonathan: core/mbox.c: error if there are non-zero reserved bits in DSMAD
handle in cxl_dc_check]
---
drivers/cxl/core/cdat.c | 11 +++++++++++
drivers/cxl/core/mbox.c | 7 +++++++
drivers/cxl/cxlmem.h | 2 ++
include/cxl/cxl.h | 4 ++++
4 files changed, 24 insertions(+)
diff --git a/drivers/cxl/core/cdat.c b/drivers/cxl/core/cdat.c
index 5c9f07262513..c5f3d2ebea55 100644
--- a/drivers/cxl/core/cdat.c
+++ b/drivers/cxl/core/cdat.c
@@ -17,6 +17,7 @@ struct dsmas_entry {
struct access_coordinate cdat_coord[ACCESS_COORDINATE_MAX];
int entries;
int qos_class;
+ bool shareable;
};
static u32 cdat_normalize(u16 entry, u64 base, u8 type)
@@ -74,6 +75,7 @@ static int cdat_dsmas_handler(union acpi_subtable_headers *header, void *arg,
return -ENOMEM;
dent->handle = dsmas->dsmad_handle;
+ dent->shareable = dsmas->flags & ACPI_CDAT_DSMAS_SHAREABLE;
dent->dpa_range.start = le64_to_cpu((__force __le64)dsmas->dpa_base_address);
dent->dpa_range.end = le64_to_cpu((__force __le64)dsmas->dpa_base_address) +
le64_to_cpu((__force __le64)dsmas->dpa_length) - 1;
@@ -244,6 +246,7 @@ static void update_perf_entry(struct device *dev, struct dsmas_entry *dent,
dpa_perf->coord[i] = dent->coord[i];
dpa_perf->cdat_coord[i] = dent->cdat_coord[i];
}
+ dpa_perf->shareable = dent->shareable;
dpa_perf->dpa_range = dent->dpa_range;
dpa_perf->qos_class = dent->qos_class;
dev_dbg(dev,
@@ -266,13 +269,21 @@ static void cxl_memdev_set_qos_class(struct cxl_dev_state *cxlds,
bool found = false;
for (int i = 0; i < cxlds->nr_partitions; i++) {
+ enum cxl_partition_mode mode = cxlds->part[i].mode;
struct resource *res = &cxlds->part[i].res;
+ u8 handle = cxlds->part[i].handle;
struct range range = {
.start = res->start,
.end = res->end,
};
if (range_contains(&range, &dent->dpa_range)) {
+ if (mode == CXL_PARTMODE_DYNAMIC_RAM_A &&
+ dent->handle != handle)
+ dev_warn(dev,
+ "Dynamic RAM perf mismatch; %pra (%u) vs %pra (%u)\n",
+ &range, handle, &dent->dpa_range, dent->handle);
+
update_perf_entry(dev, dent,
&cxlds->part[i].perf);
found = true;
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 71b29cd6abfe..f9a5e21f5d09 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -1356,10 +1356,16 @@ static int cxl_dc_check(struct device *dev, struct cxl_dc_partition_info *part_a
{
size_t blk_size = le64_to_cpu(dev_part->block_size);
size_t len = le64_to_cpu(dev_part->length);
+ u32 handle = le32_to_cpu(dev_part->dsmad_handle);
part_array[index].start = le64_to_cpu(dev_part->base);
part_array[index].size = le64_to_cpu(dev_part->decode_length);
part_array[index].size *= CXL_CAPACITY_MULTIPLIER;
+ if (handle & ~0xFF) {
+ dev_warn(dev, "DSMAD handle 0x%x has non-zero reserved bits\n", handle);
+ return -EINVAL;
+ }
+ part_array[index].handle = handle;
/* Check partitions are in increasing DPA order */
if (index > 0) {
@@ -1494,6 +1500,7 @@ int cxl_dev_dc_identify(struct cxl_mailbox *mbox,
/* Return 1st partition */
dc_info->start = partitions[0].start;
dc_info->size = partitions[0].size;
+ dc_info->handle = partitions[0].handle;
dev_dbg(dev, "Returning partition 0 %zu size %zu\n",
dc_info->start, dc_info->size);
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 87386488ad10..cee936fb3d03 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -118,6 +118,7 @@ struct cxl_dpa_info {
struct cxl_dpa_part_info {
struct range range;
enum cxl_partition_mode mode;
+ u8 handle;
} part[CXL_NR_PARTITIONS_MAX];
int nr_partitions;
};
@@ -818,6 +819,7 @@ int cxl_dev_state_identify(struct cxl_memdev_state *mds);
struct cxl_dc_partition_info {
size_t start;
size_t size;
+ u8 handle;
};
int cxl_dev_dc_identify(struct cxl_mailbox *mbox,
diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
index bb1df0cef863..51685a01d19c 100644
--- a/include/cxl/cxl.h
+++ b/include/cxl/cxl.h
@@ -122,12 +122,14 @@ struct cxl_register_map {
* @coord: QoS performance data (i.e. latency, bandwidth)
* @cdat_coord: raw QoS performance data from CDAT
* @qos_class: QoS Class cookies
+ * @shareable: Is the range sharable
*/
struct cxl_dpa_perf {
struct range dpa_range;
struct access_coordinate coord[ACCESS_COORDINATE_MAX];
struct access_coordinate cdat_coord[ACCESS_COORDINATE_MAX];
int qos_class;
+ bool shareable;
};
enum cxl_partition_mode {
@@ -141,11 +143,13 @@ enum cxl_partition_mode {
* @res: shortcut to the partition in the DPA resource tree (cxlds->dpa_res)
* @perf: performance attributes of the partition from CDAT
* @mode: operation mode for the DPA capacity, e.g. ram, pmem, dynamic...
+ * @handle: DSMAS handle intended to represent this partition
*/
struct cxl_dpa_partition {
struct resource res;
struct cxl_dpa_perf perf;
enum cxl_partition_mode mode;
+ u8 handle;
};
#define CXL_NR_PARTITIONS_MAX 3
--
2.43.0
^ permalink raw reply related [flat|nested] 71+ messages in thread* Re: [PATCH v10 03/31] cxl/cdat: Gather DSMAS data for DCD partitions
2026-05-23 9:42 ` [PATCH v10 03/31] cxl/cdat: Gather DSMAS data for DCD partitions Anisa Su
@ 2026-05-27 23:16 ` Dave Jiang
2026-05-30 6:45 ` Anisa Su
0 siblings, 1 reply; 71+ messages in thread
From: Dave Jiang @ 2026-05-27 23:16 UTC (permalink / raw)
To: Anisa Su, linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Vishal Verma, Ira Weiny, Alison Schofield, John Groves,
Gregory Price, Ira Weiny
On 5/23/26 2:42 AM, Anisa Su wrote:
> From: Ira Weiny <ira.weiny@intel.com>
>
> Additional DCD partition (AKA region) information is contained in the
> DSMAS CDAT tables, including performance, read only, and shareable
> attributes.
>
> Match DCD partitions with DSMAS tables and store the meta data.
DCD handle needs to be propogated.
add_part() needs to copy over the handle
cxl_dpa_setup() also needs to copy the handle
Would be good to get this checked against actual hardware.
>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
>
> ---
> Changes:
> [anisa: rebase]
> [jonathan: core/mbox.c: error if there are non-zero reserved bits in DSMAD
> handle in cxl_dc_check]
> ---
> drivers/cxl/core/cdat.c | 11 +++++++++++
> drivers/cxl/core/mbox.c | 7 +++++++
> drivers/cxl/cxlmem.h | 2 ++
> include/cxl/cxl.h | 4 ++++
> 4 files changed, 24 insertions(+)
>
> diff --git a/drivers/cxl/core/cdat.c b/drivers/cxl/core/cdat.c
> index 5c9f07262513..c5f3d2ebea55 100644
> --- a/drivers/cxl/core/cdat.c
> +++ b/drivers/cxl/core/cdat.c
> @@ -17,6 +17,7 @@ struct dsmas_entry {
> struct access_coordinate cdat_coord[ACCESS_COORDINATE_MAX];
> int entries;
> int qos_class;
> + bool shareable;
> };
>
> static u32 cdat_normalize(u16 entry, u64 base, u8 type)
> @@ -74,6 +75,7 @@ static int cdat_dsmas_handler(union acpi_subtable_headers *header, void *arg,
> return -ENOMEM;
>
> dent->handle = dsmas->dsmad_handle;
> + dent->shareable = dsmas->flags & ACPI_CDAT_DSMAS_SHAREABLE;
> dent->dpa_range.start = le64_to_cpu((__force __le64)dsmas->dpa_base_address);
> dent->dpa_range.end = le64_to_cpu((__force __le64)dsmas->dpa_base_address) +
> le64_to_cpu((__force __le64)dsmas->dpa_length) - 1;
> @@ -244,6 +246,7 @@ static void update_perf_entry(struct device *dev, struct dsmas_entry *dent,
> dpa_perf->coord[i] = dent->coord[i];
> dpa_perf->cdat_coord[i] = dent->cdat_coord[i];
> }
> + dpa_perf->shareable = dent->shareable;
> dpa_perf->dpa_range = dent->dpa_range;
> dpa_perf->qos_class = dent->qos_class;
> dev_dbg(dev,
> @@ -266,13 +269,21 @@ static void cxl_memdev_set_qos_class(struct cxl_dev_state *cxlds,
> bool found = false;
>
> for (int i = 0; i < cxlds->nr_partitions; i++) {
> + enum cxl_partition_mode mode = cxlds->part[i].mode;
> struct resource *res = &cxlds->part[i].res;
> + u8 handle = cxlds->part[i].handle;
> struct range range = {
> .start = res->start,
> .end = res->end,
> };
>
> if (range_contains(&range, &dent->dpa_range)) {
> + if (mode == CXL_PARTMODE_DYNAMIC_RAM_A &&
> + dent->handle != handle)
> + dev_warn(dev,
> + "Dynamic RAM perf mismatch; %pra (%u) vs %pra (%u)\n",
> + &range, handle, &dent->dpa_range, dent->handle);
Should it 'continue' here since it mismatches?
DJ
> +
> update_perf_entry(dev, dent,
> &cxlds->part[i].perf);
> found = true;
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 71b29cd6abfe..f9a5e21f5d09 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -1356,10 +1356,16 @@ static int cxl_dc_check(struct device *dev, struct cxl_dc_partition_info *part_a
> {
> size_t blk_size = le64_to_cpu(dev_part->block_size);
> size_t len = le64_to_cpu(dev_part->length);
> + u32 handle = le32_to_cpu(dev_part->dsmad_handle);
>
> part_array[index].start = le64_to_cpu(dev_part->base);
> part_array[index].size = le64_to_cpu(dev_part->decode_length);
> part_array[index].size *= CXL_CAPACITY_MULTIPLIER;
> + if (handle & ~0xFF) {
> + dev_warn(dev, "DSMAD handle 0x%x has non-zero reserved bits\n", handle);
> + return -EINVAL;
> + }
> + part_array[index].handle = handle;
>
> /* Check partitions are in increasing DPA order */
> if (index > 0) {
> @@ -1494,6 +1500,7 @@ int cxl_dev_dc_identify(struct cxl_mailbox *mbox,
> /* Return 1st partition */
> dc_info->start = partitions[0].start;
> dc_info->size = partitions[0].size;
> + dc_info->handle = partitions[0].handle;
> dev_dbg(dev, "Returning partition 0 %zu size %zu\n",
> dc_info->start, dc_info->size);
>
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 87386488ad10..cee936fb3d03 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -118,6 +118,7 @@ struct cxl_dpa_info {
> struct cxl_dpa_part_info {
> struct range range;
> enum cxl_partition_mode mode;
> + u8 handle;
> } part[CXL_NR_PARTITIONS_MAX];
> int nr_partitions;
> };
> @@ -818,6 +819,7 @@ int cxl_dev_state_identify(struct cxl_memdev_state *mds);
> struct cxl_dc_partition_info {
> size_t start;
> size_t size;
> + u8 handle;
> };
>
> int cxl_dev_dc_identify(struct cxl_mailbox *mbox,
> diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
> index bb1df0cef863..51685a01d19c 100644
> --- a/include/cxl/cxl.h
> +++ b/include/cxl/cxl.h
> @@ -122,12 +122,14 @@ struct cxl_register_map {
> * @coord: QoS performance data (i.e. latency, bandwidth)
> * @cdat_coord: raw QoS performance data from CDAT
> * @qos_class: QoS Class cookies
> + * @shareable: Is the range sharable
> */
> struct cxl_dpa_perf {
> struct range dpa_range;
> struct access_coordinate coord[ACCESS_COORDINATE_MAX];
> struct access_coordinate cdat_coord[ACCESS_COORDINATE_MAX];
> int qos_class;
> + bool shareable;
> };
>
> enum cxl_partition_mode {
> @@ -141,11 +143,13 @@ enum cxl_partition_mode {
> * @res: shortcut to the partition in the DPA resource tree (cxlds->dpa_res)
> * @perf: performance attributes of the partition from CDAT
> * @mode: operation mode for the DPA capacity, e.g. ram, pmem, dynamic...
> + * @handle: DSMAS handle intended to represent this partition
> */
> struct cxl_dpa_partition {
> struct resource res;
> struct cxl_dpa_perf perf;
> enum cxl_partition_mode mode;
> + u8 handle;
> };
>
> #define CXL_NR_PARTITIONS_MAX 3
^ permalink raw reply [flat|nested] 71+ messages in thread* Re: [PATCH v10 03/31] cxl/cdat: Gather DSMAS data for DCD partitions
2026-05-27 23:16 ` Dave Jiang
@ 2026-05-30 6:45 ` Anisa Su
0 siblings, 0 replies; 71+ messages in thread
From: Anisa Su @ 2026-05-30 6:45 UTC (permalink / raw)
To: Dave Jiang
Cc: Anisa Su, linux-cxl, linux-kernel, nvdimm, Dan Williams,
Jonathan Cameron, Davidlohr Bueso, Vishal Verma, Ira Weiny,
Alison Schofield, John Groves, Gregory Price, Ira Weiny
On Wed, May 27, 2026 at 04:16:05PM -0700, Dave Jiang wrote:
>
>
> On 5/23/26 2:42 AM, Anisa Su wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> >
> > Additional DCD partition (AKA region) information is contained in the
> > DSMAS CDAT tables, including performance, read only, and shareable
> > attributes.
> >
> > Match DCD partitions with DSMAS tables and store the meta data.
>
> DCD handle needs to be propogated.
>
> add_part() needs to copy over the handle
Fixed: add_part() function signature gains u8 handle parameter.
Call site from cxl_mem_dpa_fetch() for RAM and PMEM partitions passes in
0 for handle.
Call site from cxl_configure_dcd() passes in dc_info->handle.
> cxl_dpa_setup() also needs to copy the handle
Fixed
>
>
> Would be good to get this checked against actual hardware.
>
Yes, we are coordinating with another team within Samsung to validate.
>
> >
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> >
> > ---
> > Changes:
> > [anisa: rebase]
> > [jonathan: core/mbox.c: error if there are non-zero reserved bits in DSMAD
> > handle in cxl_dc_check]
> > ---
> > drivers/cxl/core/cdat.c | 11 +++++++++++
> > drivers/cxl/core/mbox.c | 7 +++++++
> > drivers/cxl/cxlmem.h | 2 ++
> > include/cxl/cxl.h | 4 ++++
> > 4 files changed, 24 insertions(+)
> >
> > diff --git a/drivers/cxl/core/cdat.c b/drivers/cxl/core/cdat.c
> > index 5c9f07262513..c5f3d2ebea55 100644
> > --- a/drivers/cxl/core/cdat.c
> > +++ b/drivers/cxl/core/cdat.c
> > @@ -17,6 +17,7 @@ struct dsmas_entry {
> > struct access_coordinate cdat_coord[ACCESS_COORDINATE_MAX];
> > int entries;
> > int qos_class;
> > + bool shareable;
> > };
> >
> > static u32 cdat_normalize(u16 entry, u64 base, u8 type)
> > @@ -74,6 +75,7 @@ static int cdat_dsmas_handler(union acpi_subtable_headers *header, void *arg,
> > return -ENOMEM;
> >
> > dent->handle = dsmas->dsmad_handle;
> > + dent->shareable = dsmas->flags & ACPI_CDAT_DSMAS_SHAREABLE;
> > dent->dpa_range.start = le64_to_cpu((__force __le64)dsmas->dpa_base_address);
> > dent->dpa_range.end = le64_to_cpu((__force __le64)dsmas->dpa_base_address) +
> > le64_to_cpu((__force __le64)dsmas->dpa_length) - 1;
> > @@ -244,6 +246,7 @@ static void update_perf_entry(struct device *dev, struct dsmas_entry *dent,
> > dpa_perf->coord[i] = dent->coord[i];
> > dpa_perf->cdat_coord[i] = dent->cdat_coord[i];
> > }
> > + dpa_perf->shareable = dent->shareable;
> > dpa_perf->dpa_range = dent->dpa_range;
> > dpa_perf->qos_class = dent->qos_class;
> > dev_dbg(dev,
> > @@ -266,13 +269,21 @@ static void cxl_memdev_set_qos_class(struct cxl_dev_state *cxlds,
> > bool found = false;
> >
> > for (int i = 0; i < cxlds->nr_partitions; i++) {
> > + enum cxl_partition_mode mode = cxlds->part[i].mode;
> > struct resource *res = &cxlds->part[i].res;
> > + u8 handle = cxlds->part[i].handle;
> > struct range range = {
> > .start = res->start,
> > .end = res->end,
> > };
> >
> > if (range_contains(&range, &dent->dpa_range)) {
> > + if (mode == CXL_PARTMODE_DYNAMIC_RAM_A &&
> > + dent->handle != handle)
> > + dev_warn(dev,
> > + "Dynamic RAM perf mismatch; %pra (%u) vs %pra (%u)\n",
> > + &range, handle, &dent->dpa_range, dent->handle);
>
> Should it 'continue' here since it mismatches?
>
Fixed!
> DJ
>
Thanks,
Anisa
> > +
> > update_perf_entry(dev, dent,
> > &cxlds->part[i].perf);
> > found = true;
> > diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> > index 71b29cd6abfe..f9a5e21f5d09 100644
> > --- a/drivers/cxl/core/mbox.c
> > +++ b/drivers/cxl/core/mbox.c
> > @@ -1356,10 +1356,16 @@ static int cxl_dc_check(struct device *dev, struct cxl_dc_partition_info *part_a
> > {
> > size_t blk_size = le64_to_cpu(dev_part->block_size);
> > size_t len = le64_to_cpu(dev_part->length);
> > + u32 handle = le32_to_cpu(dev_part->dsmad_handle);
> >
> > part_array[index].start = le64_to_cpu(dev_part->base);
> > part_array[index].size = le64_to_cpu(dev_part->decode_length);
> > part_array[index].size *= CXL_CAPACITY_MULTIPLIER;
> > + if (handle & ~0xFF) {
> > + dev_warn(dev, "DSMAD handle 0x%x has non-zero reserved bits\n", handle);
> > + return -EINVAL;
> > + }
> > + part_array[index].handle = handle;
> >
> > /* Check partitions are in increasing DPA order */
> > if (index > 0) {
> > @@ -1494,6 +1500,7 @@ int cxl_dev_dc_identify(struct cxl_mailbox *mbox,
> > /* Return 1st partition */
> > dc_info->start = partitions[0].start;
> > dc_info->size = partitions[0].size;
> > + dc_info->handle = partitions[0].handle;
> > dev_dbg(dev, "Returning partition 0 %zu size %zu\n",
> > dc_info->start, dc_info->size);
> >
> > diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> > index 87386488ad10..cee936fb3d03 100644
> > --- a/drivers/cxl/cxlmem.h
> > +++ b/drivers/cxl/cxlmem.h
> > @@ -118,6 +118,7 @@ struct cxl_dpa_info {
> > struct cxl_dpa_part_info {
> > struct range range;
> > enum cxl_partition_mode mode;
> > + u8 handle;
> > } part[CXL_NR_PARTITIONS_MAX];
> > int nr_partitions;
> > };
> > @@ -818,6 +819,7 @@ int cxl_dev_state_identify(struct cxl_memdev_state *mds);
> > struct cxl_dc_partition_info {
> > size_t start;
> > size_t size;
> > + u8 handle;
> > };
> >
> > int cxl_dev_dc_identify(struct cxl_mailbox *mbox,
> > diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
> > index bb1df0cef863..51685a01d19c 100644
> > --- a/include/cxl/cxl.h
> > +++ b/include/cxl/cxl.h
> > @@ -122,12 +122,14 @@ struct cxl_register_map {
> > * @coord: QoS performance data (i.e. latency, bandwidth)
> > * @cdat_coord: raw QoS performance data from CDAT
> > * @qos_class: QoS Class cookies
> > + * @shareable: Is the range sharable
> > */
> > struct cxl_dpa_perf {
> > struct range dpa_range;
> > struct access_coordinate coord[ACCESS_COORDINATE_MAX];
> > struct access_coordinate cdat_coord[ACCESS_COORDINATE_MAX];
> > int qos_class;
> > + bool shareable;
> > };
> >
> > enum cxl_partition_mode {
> > @@ -141,11 +143,13 @@ enum cxl_partition_mode {
> > * @res: shortcut to the partition in the DPA resource tree (cxlds->dpa_res)
> > * @perf: performance attributes of the partition from CDAT
> > * @mode: operation mode for the DPA capacity, e.g. ram, pmem, dynamic...
> > + * @handle: DSMAS handle intended to represent this partition
> > */
> > struct cxl_dpa_partition {
> > struct resource res;
> > struct cxl_dpa_perf perf;
> > enum cxl_partition_mode mode;
> > + u8 handle;
> > };
> >
> > #define CXL_NR_PARTITIONS_MAX 3
>
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH v10 04/31] cxl/core: Enforce partition order/simplify partition calls
2026-05-23 9:42 [PATCH v10 00/31] DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
` (2 preceding siblings ...)
2026-05-23 9:42 ` [PATCH v10 03/31] cxl/cdat: Gather DSMAS data for DCD partitions Anisa Su
@ 2026-05-23 9:42 ` Anisa Su
2026-05-27 23:37 ` Dave Jiang
2026-05-23 9:42 ` [PATCH v10 05/31] cxl/mem: Expose dynamic ram A partition in sysfs Anisa Su
` (28 subsequent siblings)
32 siblings, 1 reply; 71+ messages in thread
From: Anisa Su @ 2026-05-23 9:42 UTC (permalink / raw)
To: linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
John Groves, Gregory Price, Ira Weiny
From: Ira Weiny <ira.weiny@intel.com>
Device partitions have an implied order which is made more complex by
the addition of a dynamic partition.
Remove the ram special case information calls in favor of generic calls
with a check ahead of time to ensure the preservation of the implied
partition order.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
Changes::
[anisa: rebase]
[davidlohr: core/hdm.c: return -EINVAL instead of 0 in cxl_dpa_setup
if partitions are out of order]
---
drivers/cxl/core/hdm.c | 11 ++++++++++-
drivers/cxl/core/memdev.c | 32 +++++++++-----------------------
drivers/cxl/cxlmem.h | 9 +++------
drivers/cxl/mem.c | 2 +-
4 files changed, 23 insertions(+), 31 deletions(-)
diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
index 28974adaab75..7a5812971f8f 100644
--- a/drivers/cxl/core/hdm.c
+++ b/drivers/cxl/core/hdm.c
@@ -464,6 +464,7 @@ static const char *cxl_mode_name(enum cxl_partition_mode mode)
int cxl_dpa_setup(struct cxl_dev_state *cxlds, const struct cxl_dpa_info *info)
{
struct device *dev = cxlds->dev;
+ int i;
guard(rwsem_write)(&cxl_rwsem.dpa);
@@ -476,9 +477,17 @@ int cxl_dpa_setup(struct cxl_dev_state *cxlds, const struct cxl_dpa_info *info)
return 0;
}
+ /* Verify partitions are in expected order. */
+ for (i = 1; i < info->nr_partitions; i++) {
+ if (cxlds->part[i].mode < cxlds->part[i-1].mode) {
+ dev_err(dev, "Partition order mismatch\n");
+ return -EINVAL;
+ }
+ }
+
cxlds->dpa_res = DEFINE_RES_MEM(0, info->size);
- for (int i = 0; i < info->nr_partitions; i++) {
+ for (i = 0; i < info->nr_partitions; i++) {
const struct cxl_dpa_part_info *part = &info->part[i];
int rc;
diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
index 80e65690eb77..71602820f896 100644
--- a/drivers/cxl/core/memdev.c
+++ b/drivers/cxl/core/memdev.c
@@ -75,20 +75,12 @@ static ssize_t label_storage_size_show(struct device *dev,
}
static DEVICE_ATTR_RO(label_storage_size);
-static resource_size_t cxl_ram_size(struct cxl_dev_state *cxlds)
-{
- /* Static RAM is only expected at partition 0. */
- if (cxlds->part[0].mode != CXL_PARTMODE_RAM)
- return 0;
- return resource_size(&cxlds->part[0].res);
-}
-
static ssize_t ram_size_show(struct device *dev, struct device_attribute *attr,
char *buf)
{
struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
struct cxl_dev_state *cxlds = cxlmd->cxlds;
- unsigned long long len = cxl_ram_size(cxlds);
+ unsigned long long len = cxl_part_size(cxlds, CXL_PARTMODE_RAM);
return sysfs_emit(buf, "%#llx\n", len);
}
@@ -101,7 +93,7 @@ static ssize_t pmem_size_show(struct device *dev, struct device_attribute *attr,
{
struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
struct cxl_dev_state *cxlds = cxlmd->cxlds;
- unsigned long long len = cxl_pmem_size(cxlds);
+ unsigned long long len = cxl_part_size(cxlds, CXL_PARTMODE_PMEM);
return sysfs_emit(buf, "%#llx\n", len);
}
@@ -424,10 +416,11 @@ static struct attribute *cxl_memdev_attributes[] = {
NULL,
};
-static struct cxl_dpa_perf *to_pmem_perf(struct cxl_dev_state *cxlds)
+static struct cxl_dpa_perf *part_perf(struct cxl_dev_state *cxlds,
+ enum cxl_partition_mode mode)
{
for (int i = 0; i < cxlds->nr_partitions; i++)
- if (cxlds->part[i].mode == CXL_PARTMODE_PMEM)
+ if (cxlds->part[i].mode == mode)
return &cxlds->part[i].perf;
return NULL;
}
@@ -438,7 +431,7 @@ static ssize_t pmem_qos_class_show(struct device *dev,
struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
struct cxl_dev_state *cxlds = cxlmd->cxlds;
- return sysfs_emit(buf, "%d\n", to_pmem_perf(cxlds)->qos_class);
+ return sysfs_emit(buf, "%d\n", part_perf(cxlds, CXL_PARTMODE_PMEM)->qos_class);
}
static struct device_attribute dev_attr_pmem_qos_class =
@@ -450,20 +443,13 @@ static struct attribute *cxl_memdev_pmem_attributes[] = {
NULL,
};
-static struct cxl_dpa_perf *to_ram_perf(struct cxl_dev_state *cxlds)
-{
- if (cxlds->part[0].mode != CXL_PARTMODE_RAM)
- return NULL;
- return &cxlds->part[0].perf;
-}
-
static ssize_t ram_qos_class_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
struct cxl_dev_state *cxlds = cxlmd->cxlds;
- return sysfs_emit(buf, "%d\n", to_ram_perf(cxlds)->qos_class);
+ return sysfs_emit(buf, "%d\n", part_perf(cxlds, CXL_PARTMODE_RAM)->qos_class);
}
static struct device_attribute dev_attr_ram_qos_class =
@@ -499,7 +485,7 @@ static umode_t cxl_ram_visible(struct kobject *kobj, struct attribute *a, int n)
{
struct device *dev = kobj_to_dev(kobj);
struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
- struct cxl_dpa_perf *perf = to_ram_perf(cxlmd->cxlds);
+ struct cxl_dpa_perf *perf = part_perf(cxlmd->cxlds, CXL_PARTMODE_RAM);
if (a == &dev_attr_ram_qos_class.attr &&
(!perf || perf->qos_class == CXL_QOS_CLASS_INVALID))
@@ -518,7 +504,7 @@ static umode_t cxl_pmem_visible(struct kobject *kobj, struct attribute *a, int n
{
struct device *dev = kobj_to_dev(kobj);
struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
- struct cxl_dpa_perf *perf = to_pmem_perf(cxlmd->cxlds);
+ struct cxl_dpa_perf *perf = part_perf(cxlmd->cxlds, CXL_PARTMODE_PMEM);
if (a == &dev_attr_pmem_qos_class.attr &&
(!perf || perf->qos_class == CXL_QOS_CLASS_INVALID))
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index cee936fb3d03..10175ca3b7ee 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -383,14 +383,11 @@ struct cxl_security_state {
#define CXL_MAX_DC_PARTITIONS 8
-static inline resource_size_t cxl_pmem_size(struct cxl_dev_state *cxlds)
+static inline resource_size_t cxl_part_size(struct cxl_dev_state *cxlds,
+ enum cxl_partition_mode mode)
{
- /*
- * Static PMEM may be at partition index 0 when there is no static RAM
- * capacity.
- */
for (int i = 0; i < cxlds->nr_partitions; i++)
- if (cxlds->part[i].mode == CXL_PARTMODE_PMEM)
+ if (cxlds->part[i].mode == mode)
return resource_size(&cxlds->part[i].res);
return 0;
}
diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
index fcffe24dcb42..f19e08279ec7 100644
--- a/drivers/cxl/mem.c
+++ b/drivers/cxl/mem.c
@@ -114,7 +114,7 @@ static int cxl_mem_probe(struct device *dev)
return -ENXIO;
}
- if (cxl_pmem_size(cxlds) && IS_ENABLED(CONFIG_CXL_PMEM)) {
+ if (cxl_part_size(cxlds, CXL_PARTMODE_PMEM) && IS_ENABLED(CONFIG_CXL_PMEM)) {
rc = devm_cxl_add_nvdimm(dev, parent_port, cxlmd);
if (rc) {
if (rc == -ENODEV)
--
2.43.0
^ permalink raw reply related [flat|nested] 71+ messages in thread* Re: [PATCH v10 04/31] cxl/core: Enforce partition order/simplify partition calls
2026-05-23 9:42 ` [PATCH v10 04/31] cxl/core: Enforce partition order/simplify partition calls Anisa Su
@ 2026-05-27 23:37 ` Dave Jiang
2026-05-30 6:57 ` Anisa Su
0 siblings, 1 reply; 71+ messages in thread
From: Dave Jiang @ 2026-05-27 23:37 UTC (permalink / raw)
To: Anisa Su, linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Vishal Verma, Ira Weiny, Alison Schofield, John Groves,
Gregory Price, Ira Weiny
On 5/23/26 2:42 AM, Anisa Su wrote:
> From: Ira Weiny <ira.weiny@intel.com>
>
> Device partitions have an implied order which is made more complex by
> the addition of a dynamic partition.
>
> Remove the ram special case information calls in favor of generic calls
> with a check ahead of time to ensure the preservation of the implied
> partition order.
>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
>
> ---
> Changes::
> [anisa: rebase]
> [davidlohr: core/hdm.c: return -EINVAL instead of 0 in cxl_dpa_setup
> if partitions are out of order]
> ---
> drivers/cxl/core/hdm.c | 11 ++++++++++-
> drivers/cxl/core/memdev.c | 32 +++++++++-----------------------
> drivers/cxl/cxlmem.h | 9 +++------
> drivers/cxl/mem.c | 2 +-
> 4 files changed, 23 insertions(+), 31 deletions(-)
>
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index 28974adaab75..7a5812971f8f 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -464,6 +464,7 @@ static const char *cxl_mode_name(enum cxl_partition_mode mode)
> int cxl_dpa_setup(struct cxl_dev_state *cxlds, const struct cxl_dpa_info *info)
> {
> struct device *dev = cxlds->dev;
> + int i;
>
> guard(rwsem_write)(&cxl_rwsem.dpa);
>
> @@ -476,9 +477,17 @@ int cxl_dpa_setup(struct cxl_dev_state *cxlds, const struct cxl_dpa_info *info)
> return 0;
> }
>
> + /* Verify partitions are in expected order. */
> + for (i = 1; i < info->nr_partitions; i++) {
> + if (cxlds->part[i].mode < cxlds->part[i-1].mode) {
I think we need to check info->part[i].mode and not cxlds here. cxlds mode is assigned later in this function.
DJ
> + dev_err(dev, "Partition order mismatch\n");
> + return -EINVAL;
> + }
> + }
> +
> cxlds->dpa_res = DEFINE_RES_MEM(0, info->size);
>
> - for (int i = 0; i < info->nr_partitions; i++) {
> + for (i = 0; i < info->nr_partitions; i++) {
> const struct cxl_dpa_part_info *part = &info->part[i];
> int rc;
>
> diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
> index 80e65690eb77..71602820f896 100644
> --- a/drivers/cxl/core/memdev.c
> +++ b/drivers/cxl/core/memdev.c
> @@ -75,20 +75,12 @@ static ssize_t label_storage_size_show(struct device *dev,
> }
> static DEVICE_ATTR_RO(label_storage_size);
>
> -static resource_size_t cxl_ram_size(struct cxl_dev_state *cxlds)
> -{
> - /* Static RAM is only expected at partition 0. */
> - if (cxlds->part[0].mode != CXL_PARTMODE_RAM)
> - return 0;
> - return resource_size(&cxlds->part[0].res);
> -}
> -
> static ssize_t ram_size_show(struct device *dev, struct device_attribute *attr,
> char *buf)
> {
> struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> struct cxl_dev_state *cxlds = cxlmd->cxlds;
> - unsigned long long len = cxl_ram_size(cxlds);
> + unsigned long long len = cxl_part_size(cxlds, CXL_PARTMODE_RAM);
>
> return sysfs_emit(buf, "%#llx\n", len);
> }
> @@ -101,7 +93,7 @@ static ssize_t pmem_size_show(struct device *dev, struct device_attribute *attr,
> {
> struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> struct cxl_dev_state *cxlds = cxlmd->cxlds;
> - unsigned long long len = cxl_pmem_size(cxlds);
> + unsigned long long len = cxl_part_size(cxlds, CXL_PARTMODE_PMEM);
>
> return sysfs_emit(buf, "%#llx\n", len);
> }
> @@ -424,10 +416,11 @@ static struct attribute *cxl_memdev_attributes[] = {
> NULL,
> };
>
> -static struct cxl_dpa_perf *to_pmem_perf(struct cxl_dev_state *cxlds)
> +static struct cxl_dpa_perf *part_perf(struct cxl_dev_state *cxlds,
> + enum cxl_partition_mode mode)
> {
> for (int i = 0; i < cxlds->nr_partitions; i++)
> - if (cxlds->part[i].mode == CXL_PARTMODE_PMEM)
> + if (cxlds->part[i].mode == mode)
> return &cxlds->part[i].perf;
> return NULL;
> }
> @@ -438,7 +431,7 @@ static ssize_t pmem_qos_class_show(struct device *dev,
> struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> struct cxl_dev_state *cxlds = cxlmd->cxlds;
>
> - return sysfs_emit(buf, "%d\n", to_pmem_perf(cxlds)->qos_class);
> + return sysfs_emit(buf, "%d\n", part_perf(cxlds, CXL_PARTMODE_PMEM)->qos_class);
> }
>
> static struct device_attribute dev_attr_pmem_qos_class =
> @@ -450,20 +443,13 @@ static struct attribute *cxl_memdev_pmem_attributes[] = {
> NULL,
> };
>
> -static struct cxl_dpa_perf *to_ram_perf(struct cxl_dev_state *cxlds)
> -{
> - if (cxlds->part[0].mode != CXL_PARTMODE_RAM)
> - return NULL;
> - return &cxlds->part[0].perf;
> -}
> -
> static ssize_t ram_qos_class_show(struct device *dev,
> struct device_attribute *attr, char *buf)
> {
> struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> struct cxl_dev_state *cxlds = cxlmd->cxlds;
>
> - return sysfs_emit(buf, "%d\n", to_ram_perf(cxlds)->qos_class);
> + return sysfs_emit(buf, "%d\n", part_perf(cxlds, CXL_PARTMODE_RAM)->qos_class);
> }
>
> static struct device_attribute dev_attr_ram_qos_class =
> @@ -499,7 +485,7 @@ static umode_t cxl_ram_visible(struct kobject *kobj, struct attribute *a, int n)
> {
> struct device *dev = kobj_to_dev(kobj);
> struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> - struct cxl_dpa_perf *perf = to_ram_perf(cxlmd->cxlds);
> + struct cxl_dpa_perf *perf = part_perf(cxlmd->cxlds, CXL_PARTMODE_RAM);
>
> if (a == &dev_attr_ram_qos_class.attr &&
> (!perf || perf->qos_class == CXL_QOS_CLASS_INVALID))
> @@ -518,7 +504,7 @@ static umode_t cxl_pmem_visible(struct kobject *kobj, struct attribute *a, int n
> {
> struct device *dev = kobj_to_dev(kobj);
> struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> - struct cxl_dpa_perf *perf = to_pmem_perf(cxlmd->cxlds);
> + struct cxl_dpa_perf *perf = part_perf(cxlmd->cxlds, CXL_PARTMODE_PMEM);
>
> if (a == &dev_attr_pmem_qos_class.attr &&
> (!perf || perf->qos_class == CXL_QOS_CLASS_INVALID))
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index cee936fb3d03..10175ca3b7ee 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -383,14 +383,11 @@ struct cxl_security_state {
>
> #define CXL_MAX_DC_PARTITIONS 8
>
> -static inline resource_size_t cxl_pmem_size(struct cxl_dev_state *cxlds)
> +static inline resource_size_t cxl_part_size(struct cxl_dev_state *cxlds,
> + enum cxl_partition_mode mode)
> {
> - /*
> - * Static PMEM may be at partition index 0 when there is no static RAM
> - * capacity.
> - */
> for (int i = 0; i < cxlds->nr_partitions; i++)
> - if (cxlds->part[i].mode == CXL_PARTMODE_PMEM)
> + if (cxlds->part[i].mode == mode)
> return resource_size(&cxlds->part[i].res);
> return 0;
> }
> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> index fcffe24dcb42..f19e08279ec7 100644
> --- a/drivers/cxl/mem.c
> +++ b/drivers/cxl/mem.c
> @@ -114,7 +114,7 @@ static int cxl_mem_probe(struct device *dev)
> return -ENXIO;
> }
>
> - if (cxl_pmem_size(cxlds) && IS_ENABLED(CONFIG_CXL_PMEM)) {
> + if (cxl_part_size(cxlds, CXL_PARTMODE_PMEM) && IS_ENABLED(CONFIG_CXL_PMEM)) {
> rc = devm_cxl_add_nvdimm(dev, parent_port, cxlmd);
> if (rc) {
> if (rc == -ENODEV)
^ permalink raw reply [flat|nested] 71+ messages in thread* Re: [PATCH v10 04/31] cxl/core: Enforce partition order/simplify partition calls
2026-05-27 23:37 ` Dave Jiang
@ 2026-05-30 6:57 ` Anisa Su
0 siblings, 0 replies; 71+ messages in thread
From: Anisa Su @ 2026-05-30 6:57 UTC (permalink / raw)
To: Dave Jiang
Cc: Anisa Su, linux-cxl, linux-kernel, nvdimm, Dan Williams,
Jonathan Cameron, Davidlohr Bueso, Vishal Verma, Ira Weiny,
Alison Schofield, John Groves, Gregory Price, Ira Weiny
On Wed, May 27, 2026 at 04:37:11PM -0700, Dave Jiang wrote:
>
>
> On 5/23/26 2:42 AM, Anisa Su wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> >
> > Device partitions have an implied order which is made more complex by
> > the addition of a dynamic partition.
> >
> > Remove the ram special case information calls in favor of generic calls
> > with a check ahead of time to ensure the preservation of the implied
> > partition order.
> >
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> >
> > ---
> > Changes::
> > [anisa: rebase]
> > [davidlohr: core/hdm.c: return -EINVAL instead of 0 in cxl_dpa_setup
> > if partitions are out of order]
> > ---
> > drivers/cxl/core/hdm.c | 11 ++++++++++-
> > drivers/cxl/core/memdev.c | 32 +++++++++-----------------------
> > drivers/cxl/cxlmem.h | 9 +++------
> > drivers/cxl/mem.c | 2 +-
> > 4 files changed, 23 insertions(+), 31 deletions(-)
> >
> > diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> > index 28974adaab75..7a5812971f8f 100644
> > --- a/drivers/cxl/core/hdm.c
> > +++ b/drivers/cxl/core/hdm.c
> > @@ -464,6 +464,7 @@ static const char *cxl_mode_name(enum cxl_partition_mode mode)
> > int cxl_dpa_setup(struct cxl_dev_state *cxlds, const struct cxl_dpa_info *info)
> > {
> > struct device *dev = cxlds->dev;
> > + int i;
> >
> > guard(rwsem_write)(&cxl_rwsem.dpa);
> >
> > @@ -476,9 +477,17 @@ int cxl_dpa_setup(struct cxl_dev_state *cxlds, const struct cxl_dpa_info *info)
> > return 0;
> > }
> >
> > + /* Verify partitions are in expected order. */
> > + for (i = 1; i < info->nr_partitions; i++) {
> > + if (cxlds->part[i].mode < cxlds->part[i-1].mode) {
>
> I think we need to check info->part[i].mode and not cxlds here. cxlds mode is assigned later in this function.
>
fixed!
> DJ
>
Thanks,
Anisa
>
> > + dev_err(dev, "Partition order mismatch\n");
> > + return -EINVAL;
> > + }
> > + }
> > +
> > cxlds->dpa_res = DEFINE_RES_MEM(0, info->size);
> >
> > - for (int i = 0; i < info->nr_partitions; i++) {
> > + for (i = 0; i < info->nr_partitions; i++) {
> > const struct cxl_dpa_part_info *part = &info->part[i];
> > int rc;
> >
> > diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
> > index 80e65690eb77..71602820f896 100644
> > --- a/drivers/cxl/core/memdev.c
> > +++ b/drivers/cxl/core/memdev.c
> > @@ -75,20 +75,12 @@ static ssize_t label_storage_size_show(struct device *dev,
> > }
> > static DEVICE_ATTR_RO(label_storage_size);
> >
> > -static resource_size_t cxl_ram_size(struct cxl_dev_state *cxlds)
> > -{
> > - /* Static RAM is only expected at partition 0. */
> > - if (cxlds->part[0].mode != CXL_PARTMODE_RAM)
> > - return 0;
> > - return resource_size(&cxlds->part[0].res);
> > -}
> > -
> > static ssize_t ram_size_show(struct device *dev, struct device_attribute *attr,
> > char *buf)
> > {
> > struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> > struct cxl_dev_state *cxlds = cxlmd->cxlds;
> > - unsigned long long len = cxl_ram_size(cxlds);
> > + unsigned long long len = cxl_part_size(cxlds, CXL_PARTMODE_RAM);
> >
> > return sysfs_emit(buf, "%#llx\n", len);
> > }
> > @@ -101,7 +93,7 @@ static ssize_t pmem_size_show(struct device *dev, struct device_attribute *attr,
> > {
> > struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> > struct cxl_dev_state *cxlds = cxlmd->cxlds;
> > - unsigned long long len = cxl_pmem_size(cxlds);
> > + unsigned long long len = cxl_part_size(cxlds, CXL_PARTMODE_PMEM);
> >
> > return sysfs_emit(buf, "%#llx\n", len);
> > }
> > @@ -424,10 +416,11 @@ static struct attribute *cxl_memdev_attributes[] = {
> > NULL,
> > };
> >
> > -static struct cxl_dpa_perf *to_pmem_perf(struct cxl_dev_state *cxlds)
> > +static struct cxl_dpa_perf *part_perf(struct cxl_dev_state *cxlds,
> > + enum cxl_partition_mode mode)
> > {
> > for (int i = 0; i < cxlds->nr_partitions; i++)
> > - if (cxlds->part[i].mode == CXL_PARTMODE_PMEM)
> > + if (cxlds->part[i].mode == mode)
> > return &cxlds->part[i].perf;
> > return NULL;
> > }
> > @@ -438,7 +431,7 @@ static ssize_t pmem_qos_class_show(struct device *dev,
> > struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> > struct cxl_dev_state *cxlds = cxlmd->cxlds;
> >
> > - return sysfs_emit(buf, "%d\n", to_pmem_perf(cxlds)->qos_class);
> > + return sysfs_emit(buf, "%d\n", part_perf(cxlds, CXL_PARTMODE_PMEM)->qos_class);
> > }
> >
> > static struct device_attribute dev_attr_pmem_qos_class =
> > @@ -450,20 +443,13 @@ static struct attribute *cxl_memdev_pmem_attributes[] = {
> > NULL,
> > };
> >
> > -static struct cxl_dpa_perf *to_ram_perf(struct cxl_dev_state *cxlds)
> > -{
> > - if (cxlds->part[0].mode != CXL_PARTMODE_RAM)
> > - return NULL;
> > - return &cxlds->part[0].perf;
> > -}
> > -
> > static ssize_t ram_qos_class_show(struct device *dev,
> > struct device_attribute *attr, char *buf)
> > {
> > struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> > struct cxl_dev_state *cxlds = cxlmd->cxlds;
> >
> > - return sysfs_emit(buf, "%d\n", to_ram_perf(cxlds)->qos_class);
> > + return sysfs_emit(buf, "%d\n", part_perf(cxlds, CXL_PARTMODE_RAM)->qos_class);
> > }
> >
> > static struct device_attribute dev_attr_ram_qos_class =
> > @@ -499,7 +485,7 @@ static umode_t cxl_ram_visible(struct kobject *kobj, struct attribute *a, int n)
> > {
> > struct device *dev = kobj_to_dev(kobj);
> > struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> > - struct cxl_dpa_perf *perf = to_ram_perf(cxlmd->cxlds);
> > + struct cxl_dpa_perf *perf = part_perf(cxlmd->cxlds, CXL_PARTMODE_RAM);
> >
> > if (a == &dev_attr_ram_qos_class.attr &&
> > (!perf || perf->qos_class == CXL_QOS_CLASS_INVALID))
> > @@ -518,7 +504,7 @@ static umode_t cxl_pmem_visible(struct kobject *kobj, struct attribute *a, int n
> > {
> > struct device *dev = kobj_to_dev(kobj);
> > struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> > - struct cxl_dpa_perf *perf = to_pmem_perf(cxlmd->cxlds);
> > + struct cxl_dpa_perf *perf = part_perf(cxlmd->cxlds, CXL_PARTMODE_PMEM);
> >
> > if (a == &dev_attr_pmem_qos_class.attr &&
> > (!perf || perf->qos_class == CXL_QOS_CLASS_INVALID))
> > diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> > index cee936fb3d03..10175ca3b7ee 100644
> > --- a/drivers/cxl/cxlmem.h
> > +++ b/drivers/cxl/cxlmem.h
> > @@ -383,14 +383,11 @@ struct cxl_security_state {
> >
> > #define CXL_MAX_DC_PARTITIONS 8
> >
> > -static inline resource_size_t cxl_pmem_size(struct cxl_dev_state *cxlds)
> > +static inline resource_size_t cxl_part_size(struct cxl_dev_state *cxlds,
> > + enum cxl_partition_mode mode)
> > {
> > - /*
> > - * Static PMEM may be at partition index 0 when there is no static RAM
> > - * capacity.
> > - */
> > for (int i = 0; i < cxlds->nr_partitions; i++)
> > - if (cxlds->part[i].mode == CXL_PARTMODE_PMEM)
> > + if (cxlds->part[i].mode == mode)
> > return resource_size(&cxlds->part[i].res);
> > return 0;
> > }
> > diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> > index fcffe24dcb42..f19e08279ec7 100644
> > --- a/drivers/cxl/mem.c
> > +++ b/drivers/cxl/mem.c
> > @@ -114,7 +114,7 @@ static int cxl_mem_probe(struct device *dev)
> > return -ENXIO;
> > }
> >
> > - if (cxl_pmem_size(cxlds) && IS_ENABLED(CONFIG_CXL_PMEM)) {
> > + if (cxl_part_size(cxlds, CXL_PARTMODE_PMEM) && IS_ENABLED(CONFIG_CXL_PMEM)) {
> > rc = devm_cxl_add_nvdimm(dev, parent_port, cxlmd);
> > if (rc) {
> > if (rc == -ENODEV)
>
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH v10 05/31] cxl/mem: Expose dynamic ram A partition in sysfs
2026-05-23 9:42 [PATCH v10 00/31] DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
` (3 preceding siblings ...)
2026-05-23 9:42 ` [PATCH v10 04/31] cxl/core: Enforce partition order/simplify partition calls Anisa Su
@ 2026-05-23 9:42 ` Anisa Su
2026-05-27 23:54 ` Dave Jiang
2026-05-27 23:56 ` Dave Jiang
2026-05-23 9:43 ` [PATCH v10 06/31] cxl/port: Add 'dynamic_ram_a' to endpoint decoder mode Anisa Su
` (27 subsequent siblings)
32 siblings, 2 replies; 71+ messages in thread
From: Anisa Su @ 2026-05-23 9:42 UTC (permalink / raw)
To: linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
John Groves, Gregory Price, Ira Weiny
From: Ira Weiny <ira.weiny@intel.com>
To properly configure CXL regions user space will need to know the
details of the dynamic ram partition.
Expose the first dynamic ram partition through sysfs.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
Changes:
[anisa: Update kernel version to 7.0]
[davidlohr: Remove "persistent" from description of
/sys/bus/cxl/devices/memX/dynamic_ram_a/qos_class]
---
Documentation/ABI/testing/sysfs-bus-cxl | 24 +++++++++++
drivers/cxl/core/memdev.c | 57 +++++++++++++++++++++++++
2 files changed, 81 insertions(+)
diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
index 16a9b3d2e2c0..3d95c325f6e0 100644
--- a/Documentation/ABI/testing/sysfs-bus-cxl
+++ b/Documentation/ABI/testing/sysfs-bus-cxl
@@ -89,6 +89,30 @@ Description:
and there are platform specific performance related
side-effects that may result. First class-id is displayed.
+What: /sys/bus/cxl/devices/memX/dynamic_ram_a/size
+Date: May, 2025
+KernelVersion: v7.0
+Contact: linux-cxl@vger.kernel.org
+Description:
+ (RO) The first Dynamic RAM partition capacity as bytes.
+
+
+What: /sys/bus/cxl/devices/memX/dynamic_ram_a/qos_class
+Date: May, 2025
+KernelVersion: v7.0
+Contact: linux-cxl@vger.kernel.org
+Description:
+ (RO) For CXL host platforms that support "QoS Telemmetry"
+ this attribute conveys a comma delimited list of platform
+ specific cookies that identifies a QoS performance class
+ for the partition of the CXL mem device. These
+ class-ids can be compared against a similar "qos_class"
+ published for a root decoder. While it is not required
+ that the endpoints map their local memory-class to a
+ matching platform class, mismatches are not recommended
+ and there are platform specific performance related
+ side-effects that may result. First class-id is displayed.
+
What: /sys/bus/cxl/devices/memX/serial
Date: January, 2022
diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
index 71602820f896..064cfd628577 100644
--- a/drivers/cxl/core/memdev.c
+++ b/drivers/cxl/core/memdev.c
@@ -101,6 +101,19 @@ static ssize_t pmem_size_show(struct device *dev, struct device_attribute *attr,
static struct device_attribute dev_attr_pmem_size =
__ATTR(size, 0444, pmem_size_show, NULL);
+static ssize_t dynamic_ram_a_size_show(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
+ struct cxl_dev_state *cxlds = cxlmd->cxlds;
+ unsigned long long len = cxl_part_size(cxlds, CXL_PARTMODE_DYNAMIC_RAM_A);
+
+ return sysfs_emit(buf, "%#llx\n", len);
+}
+
+static struct device_attribute dev_attr_dynamic_ram_a_size =
+ __ATTR(size, 0444, dynamic_ram_a_size_show, NULL);
+
static ssize_t serial_show(struct device *dev, struct device_attribute *attr,
char *buf)
{
@@ -443,6 +456,25 @@ static struct attribute *cxl_memdev_pmem_attributes[] = {
NULL,
};
+static ssize_t dynamic_ram_a_qos_class_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
+ struct cxl_dev_state *cxlds = cxlmd->cxlds;
+
+ return sysfs_emit(buf, "%d\n",
+ part_perf(cxlds, CXL_PARTMODE_DYNAMIC_RAM_A)->qos_class);
+}
+
+static struct device_attribute dev_attr_dynamic_ram_a_qos_class =
+ __ATTR(qos_class, 0444, dynamic_ram_a_qos_class_show, NULL);
+
+static struct attribute *cxl_memdev_dynamic_ram_a_attributes[] = {
+ &dev_attr_dynamic_ram_a_size.attr,
+ &dev_attr_dynamic_ram_a_qos_class.attr,
+ NULL,
+};
+
static ssize_t ram_qos_class_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
@@ -519,6 +551,29 @@ static struct attribute_group cxl_memdev_pmem_attribute_group = {
.is_visible = cxl_pmem_visible,
};
+static umode_t cxl_dynamic_ram_a_visible(struct kobject *kobj, struct attribute *a, int n)
+{
+ struct device *dev = kobj_to_dev(kobj);
+ struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
+ struct cxl_dpa_perf *perf = part_perf(cxlmd->cxlds, CXL_PARTMODE_DYNAMIC_RAM_A);
+
+ if (a == &dev_attr_dynamic_ram_a_qos_class.attr &&
+ (!perf || perf->qos_class == CXL_QOS_CLASS_INVALID))
+ return 0;
+
+ if (a == &dev_attr_dynamic_ram_a_size.attr &&
+ (!cxl_part_size(cxlmd->cxlds, CXL_PARTMODE_DYNAMIC_RAM_A)))
+ return 0;
+
+ return a->mode;
+}
+
+static struct attribute_group cxl_memdev_dynamic_ram_a_attribute_group = {
+ .name = "dynamic_ram_a",
+ .attrs = cxl_memdev_dynamic_ram_a_attributes,
+ .is_visible = cxl_dynamic_ram_a_visible,
+};
+
static umode_t cxl_memdev_security_visible(struct kobject *kobj,
struct attribute *a, int n)
{
@@ -547,6 +602,7 @@ static const struct attribute_group *cxl_memdev_attribute_groups[] = {
&cxl_memdev_attribute_group,
&cxl_memdev_ram_attribute_group,
&cxl_memdev_pmem_attribute_group,
+ &cxl_memdev_dynamic_ram_a_attribute_group,
&cxl_memdev_security_attribute_group,
NULL,
};
@@ -555,6 +611,7 @@ void cxl_memdev_update_perf(struct cxl_memdev *cxlmd)
{
sysfs_update_group(&cxlmd->dev.kobj, &cxl_memdev_ram_attribute_group);
sysfs_update_group(&cxlmd->dev.kobj, &cxl_memdev_pmem_attribute_group);
+ sysfs_update_group(&cxlmd->dev.kobj, &cxl_memdev_dynamic_ram_a_attribute_group);
}
EXPORT_SYMBOL_NS_GPL(cxl_memdev_update_perf, "CXL");
--
2.43.0
^ permalink raw reply related [flat|nested] 71+ messages in thread* Re: [PATCH v10 05/31] cxl/mem: Expose dynamic ram A partition in sysfs
2026-05-23 9:42 ` [PATCH v10 05/31] cxl/mem: Expose dynamic ram A partition in sysfs Anisa Su
@ 2026-05-27 23:54 ` Dave Jiang
2026-05-27 23:56 ` Dave Jiang
1 sibling, 0 replies; 71+ messages in thread
From: Dave Jiang @ 2026-05-27 23:54 UTC (permalink / raw)
To: Anisa Su, linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Vishal Verma, Ira Weiny, Alison Schofield, John Groves,
Gregory Price, Ira Weiny
On 5/23/26 2:42 AM, Anisa Su wrote:
> From: Ira Weiny <ira.weiny@intel.com>
>
> To properly configure CXL regions user space will need to know the
> details of the dynamic ram partition.
>
> Expose the first dynamic ram partition through sysfs.
>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Missing Anisa sign off
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
>
> ---
> Changes:
> [anisa: Update kernel version to 7.0]
> [davidlohr: Remove "persistent" from description of
> /sys/bus/cxl/devices/memX/dynamic_ram_a/qos_class]
> ---
> Documentation/ABI/testing/sysfs-bus-cxl | 24 +++++++++++
> drivers/cxl/core/memdev.c | 57 +++++++++++++++++++++++++
> 2 files changed, 81 insertions(+)
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> index 16a9b3d2e2c0..3d95c325f6e0 100644
> --- a/Documentation/ABI/testing/sysfs-bus-cxl
> +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> @@ -89,6 +89,30 @@ Description:
> and there are platform specific performance related
> side-effects that may result. First class-id is displayed.
>
> +What: /sys/bus/cxl/devices/memX/dynamic_ram_a/size
> +Date: May, 2025
> +KernelVersion: v7.0
> +Contact: linux-cxl@vger.kernel.org
> +Description:
> + (RO) The first Dynamic RAM partition capacity as bytes.
> +
> +
> +What: /sys/bus/cxl/devices/memX/dynamic_ram_a/qos_class
> +Date: May, 2025
> +KernelVersion: v7.0
> +Contact: linux-cxl@vger.kernel.org
> +Description:
> + (RO) For CXL host platforms that support "QoS Telemmetry"
> + this attribute conveys a comma delimited list of platform
> + specific cookies that identifies a QoS performance class
> + for the partition of the CXL mem device. These
> + class-ids can be compared against a similar "qos_class"
> + published for a root decoder. While it is not required
> + that the endpoints map their local memory-class to a
> + matching platform class, mismatches are not recommended
> + and there are platform specific performance related
> + side-effects that may result. First class-id is displayed.
> +
>
> What: /sys/bus/cxl/devices/memX/serial
> Date: January, 2022
> diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
> index 71602820f896..064cfd628577 100644
> --- a/drivers/cxl/core/memdev.c
> +++ b/drivers/cxl/core/memdev.c
> @@ -101,6 +101,19 @@ static ssize_t pmem_size_show(struct device *dev, struct device_attribute *attr,
> static struct device_attribute dev_attr_pmem_size =
> __ATTR(size, 0444, pmem_size_show, NULL);
>
> +static ssize_t dynamic_ram_a_size_show(struct device *dev, struct device_attribute *attr,
> + char *buf)
> +{
> + struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> + struct cxl_dev_state *cxlds = cxlmd->cxlds;
> + unsigned long long len = cxl_part_size(cxlds, CXL_PARTMODE_DYNAMIC_RAM_A);
> +
> + return sysfs_emit(buf, "%#llx\n", len);
> +}
> +
> +static struct device_attribute dev_attr_dynamic_ram_a_size =
> + __ATTR(size, 0444, dynamic_ram_a_size_show, NULL);
> +
> static ssize_t serial_show(struct device *dev, struct device_attribute *attr,
> char *buf)
> {
> @@ -443,6 +456,25 @@ static struct attribute *cxl_memdev_pmem_attributes[] = {
> NULL,
> };
>
> +static ssize_t dynamic_ram_a_qos_class_show(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> + struct cxl_dev_state *cxlds = cxlmd->cxlds;
> +
> + return sysfs_emit(buf, "%d\n",
> + part_perf(cxlds, CXL_PARTMODE_DYNAMIC_RAM_A)->qos_class);
> +}
> +
> +static struct device_attribute dev_attr_dynamic_ram_a_qos_class =
> + __ATTR(qos_class, 0444, dynamic_ram_a_qos_class_show, NULL);
> +
> +static struct attribute *cxl_memdev_dynamic_ram_a_attributes[] = {
> + &dev_attr_dynamic_ram_a_size.attr,
> + &dev_attr_dynamic_ram_a_qos_class.attr,
> + NULL,
> +};
> +
> static ssize_t ram_qos_class_show(struct device *dev,
> struct device_attribute *attr, char *buf)
> {
> @@ -519,6 +551,29 @@ static struct attribute_group cxl_memdev_pmem_attribute_group = {
> .is_visible = cxl_pmem_visible,
> };
>
> +static umode_t cxl_dynamic_ram_a_visible(struct kobject *kobj, struct attribute *a, int n)
> +{
> + struct device *dev = kobj_to_dev(kobj);
> + struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> + struct cxl_dpa_perf *perf = part_perf(cxlmd->cxlds, CXL_PARTMODE_DYNAMIC_RAM_A);
> +
> + if (a == &dev_attr_dynamic_ram_a_qos_class.attr &&
> + (!perf || perf->qos_class == CXL_QOS_CLASS_INVALID))
> + return 0;
> +
> + if (a == &dev_attr_dynamic_ram_a_size.attr &&
> + (!cxl_part_size(cxlmd->cxlds, CXL_PARTMODE_DYNAMIC_RAM_A)))
> + return 0;
> +
> + return a->mode;
> +}
> +
> +static struct attribute_group cxl_memdev_dynamic_ram_a_attribute_group = {
> + .name = "dynamic_ram_a",
> + .attrs = cxl_memdev_dynamic_ram_a_attributes,
> + .is_visible = cxl_dynamic_ram_a_visible,
> +};
> +
> static umode_t cxl_memdev_security_visible(struct kobject *kobj,
> struct attribute *a, int n)
> {
> @@ -547,6 +602,7 @@ static const struct attribute_group *cxl_memdev_attribute_groups[] = {
> &cxl_memdev_attribute_group,
> &cxl_memdev_ram_attribute_group,
> &cxl_memdev_pmem_attribute_group,
> + &cxl_memdev_dynamic_ram_a_attribute_group,
> &cxl_memdev_security_attribute_group,
> NULL,
> };
> @@ -555,6 +611,7 @@ void cxl_memdev_update_perf(struct cxl_memdev *cxlmd)
> {
> sysfs_update_group(&cxlmd->dev.kobj, &cxl_memdev_ram_attribute_group);
> sysfs_update_group(&cxlmd->dev.kobj, &cxl_memdev_pmem_attribute_group);
> + sysfs_update_group(&cxlmd->dev.kobj, &cxl_memdev_dynamic_ram_a_attribute_group);
> }
> EXPORT_SYMBOL_NS_GPL(cxl_memdev_update_perf, "CXL");
>
^ permalink raw reply [flat|nested] 71+ messages in thread* Re: [PATCH v10 05/31] cxl/mem: Expose dynamic ram A partition in sysfs
2026-05-23 9:42 ` [PATCH v10 05/31] cxl/mem: Expose dynamic ram A partition in sysfs Anisa Su
2026-05-27 23:54 ` Dave Jiang
@ 2026-05-27 23:56 ` Dave Jiang
2026-05-30 7:04 ` Anisa Su
1 sibling, 1 reply; 71+ messages in thread
From: Dave Jiang @ 2026-05-27 23:56 UTC (permalink / raw)
To: Anisa Su, linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Vishal Verma, Ira Weiny, Alison Schofield, John Groves,
Gregory Price, Ira Weiny
On 5/23/26 2:42 AM, Anisa Su wrote:
> From: Ira Weiny <ira.weiny@intel.com>
>
> To properly configure CXL regions user space will need to know the
> details of the dynamic ram partition.
>
> Expose the first dynamic ram partition through sysfs.
>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
>
> ---
> Changes:
> [anisa: Update kernel version to 7.0]
> [davidlohr: Remove "persistent" from description of
> /sys/bus/cxl/devices/memX/dynamic_ram_a/qos_class]
> ---
> Documentation/ABI/testing/sysfs-bus-cxl | 24 +++++++++++
> drivers/cxl/core/memdev.c | 57 +++++++++++++++++++++++++
> 2 files changed, 81 insertions(+)
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> index 16a9b3d2e2c0..3d95c325f6e0 100644
> --- a/Documentation/ABI/testing/sysfs-bus-cxl
> +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> @@ -89,6 +89,30 @@ Description:
> and there are platform specific performance related
> side-effects that may result. First class-id is displayed.
>
> +What: /sys/bus/cxl/devices/memX/dynamic_ram_a/size
> +Date: May, 2025
> +KernelVersion: v7.0
Probably should update this to 7.3 maybe?
DJ
> +Contact: linux-cxl@vger.kernel.org
> +Description:
> + (RO) The first Dynamic RAM partition capacity as bytes.
> +
> +
> +What: /sys/bus/cxl/devices/memX/dynamic_ram_a/qos_class
> +Date: May, 2025
> +KernelVersion: v7.0
> +Contact: linux-cxl@vger.kernel.org
> +Description:
> + (RO) For CXL host platforms that support "QoS Telemmetry"
> + this attribute conveys a comma delimited list of platform
> + specific cookies that identifies a QoS performance class
> + for the partition of the CXL mem device. These
> + class-ids can be compared against a similar "qos_class"
> + published for a root decoder. While it is not required
> + that the endpoints map their local memory-class to a
> + matching platform class, mismatches are not recommended
> + and there are platform specific performance related
> + side-effects that may result. First class-id is displayed.
> +
>
> What: /sys/bus/cxl/devices/memX/serial
> Date: January, 2022
> diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
> index 71602820f896..064cfd628577 100644
> --- a/drivers/cxl/core/memdev.c
> +++ b/drivers/cxl/core/memdev.c
> @@ -101,6 +101,19 @@ static ssize_t pmem_size_show(struct device *dev, struct device_attribute *attr,
> static struct device_attribute dev_attr_pmem_size =
> __ATTR(size, 0444, pmem_size_show, NULL);
>
> +static ssize_t dynamic_ram_a_size_show(struct device *dev, struct device_attribute *attr,
> + char *buf)
> +{
> + struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> + struct cxl_dev_state *cxlds = cxlmd->cxlds;
> + unsigned long long len = cxl_part_size(cxlds, CXL_PARTMODE_DYNAMIC_RAM_A);
> +
> + return sysfs_emit(buf, "%#llx\n", len);
> +}
> +
> +static struct device_attribute dev_attr_dynamic_ram_a_size =
> + __ATTR(size, 0444, dynamic_ram_a_size_show, NULL);
> +
> static ssize_t serial_show(struct device *dev, struct device_attribute *attr,
> char *buf)
> {
> @@ -443,6 +456,25 @@ static struct attribute *cxl_memdev_pmem_attributes[] = {
> NULL,
> };
>
> +static ssize_t dynamic_ram_a_qos_class_show(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> + struct cxl_dev_state *cxlds = cxlmd->cxlds;
> +
> + return sysfs_emit(buf, "%d\n",
> + part_perf(cxlds, CXL_PARTMODE_DYNAMIC_RAM_A)->qos_class);
> +}
> +
> +static struct device_attribute dev_attr_dynamic_ram_a_qos_class =
> + __ATTR(qos_class, 0444, dynamic_ram_a_qos_class_show, NULL);
> +
> +static struct attribute *cxl_memdev_dynamic_ram_a_attributes[] = {
> + &dev_attr_dynamic_ram_a_size.attr,
> + &dev_attr_dynamic_ram_a_qos_class.attr,
> + NULL,
> +};
> +
> static ssize_t ram_qos_class_show(struct device *dev,
> struct device_attribute *attr, char *buf)
> {
> @@ -519,6 +551,29 @@ static struct attribute_group cxl_memdev_pmem_attribute_group = {
> .is_visible = cxl_pmem_visible,
> };
>
> +static umode_t cxl_dynamic_ram_a_visible(struct kobject *kobj, struct attribute *a, int n)
> +{
> + struct device *dev = kobj_to_dev(kobj);
> + struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> + struct cxl_dpa_perf *perf = part_perf(cxlmd->cxlds, CXL_PARTMODE_DYNAMIC_RAM_A);
> +
> + if (a == &dev_attr_dynamic_ram_a_qos_class.attr &&
> + (!perf || perf->qos_class == CXL_QOS_CLASS_INVALID))
> + return 0;
> +
> + if (a == &dev_attr_dynamic_ram_a_size.attr &&
> + (!cxl_part_size(cxlmd->cxlds, CXL_PARTMODE_DYNAMIC_RAM_A)))
> + return 0;
> +
> + return a->mode;
> +}
> +
> +static struct attribute_group cxl_memdev_dynamic_ram_a_attribute_group = {
> + .name = "dynamic_ram_a",
> + .attrs = cxl_memdev_dynamic_ram_a_attributes,
> + .is_visible = cxl_dynamic_ram_a_visible,
> +};
> +
> static umode_t cxl_memdev_security_visible(struct kobject *kobj,
> struct attribute *a, int n)
> {
> @@ -547,6 +602,7 @@ static const struct attribute_group *cxl_memdev_attribute_groups[] = {
> &cxl_memdev_attribute_group,
> &cxl_memdev_ram_attribute_group,
> &cxl_memdev_pmem_attribute_group,
> + &cxl_memdev_dynamic_ram_a_attribute_group,
> &cxl_memdev_security_attribute_group,
> NULL,
> };
> @@ -555,6 +611,7 @@ void cxl_memdev_update_perf(struct cxl_memdev *cxlmd)
> {
> sysfs_update_group(&cxlmd->dev.kobj, &cxl_memdev_ram_attribute_group);
> sysfs_update_group(&cxlmd->dev.kobj, &cxl_memdev_pmem_attribute_group);
> + sysfs_update_group(&cxlmd->dev.kobj, &cxl_memdev_dynamic_ram_a_attribute_group);
> }
> EXPORT_SYMBOL_NS_GPL(cxl_memdev_update_perf, "CXL");
>
^ permalink raw reply [flat|nested] 71+ messages in thread* Re: [PATCH v10 05/31] cxl/mem: Expose dynamic ram A partition in sysfs
2026-05-27 23:56 ` Dave Jiang
@ 2026-05-30 7:04 ` Anisa Su
0 siblings, 0 replies; 71+ messages in thread
From: Anisa Su @ 2026-05-30 7:04 UTC (permalink / raw)
To: Dave Jiang
Cc: Anisa Su, linux-cxl, linux-kernel, nvdimm, Dan Williams,
Jonathan Cameron, Davidlohr Bueso, Vishal Verma, Ira Weiny,
Alison Schofield, John Groves, Gregory Price, Ira Weiny
On Wed, May 27, 2026 at 04:56:34PM -0700, Dave Jiang wrote:
>
>
> On 5/23/26 2:42 AM, Anisa Su wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> >
> > To properly configure CXL regions user space will need to know the
> > details of the dynamic ram partition.
> >
> > Expose the first dynamic ram partition through sysfs.
> >
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> >
> > ---
> > Changes:
> > [anisa: Update kernel version to 7.0]
> > [davidlohr: Remove "persistent" from description of
> > /sys/bus/cxl/devices/memX/dynamic_ram_a/qos_class]
> > ---
> > Documentation/ABI/testing/sysfs-bus-cxl | 24 +++++++++++
> > drivers/cxl/core/memdev.c | 57 +++++++++++++++++++++++++
> > 2 files changed, 81 insertions(+)
> >
> > diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> > index 16a9b3d2e2c0..3d95c325f6e0 100644
> > --- a/Documentation/ABI/testing/sysfs-bus-cxl
> > +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> > @@ -89,6 +89,30 @@ Description:
> > and there are platform specific performance related
> > side-effects that may result. First class-id is displayed.
> >
> > +What: /sys/bus/cxl/devices/memX/dynamic_ram_a/size
> > +Date: May, 2025
> > +KernelVersion: v7.0
>
> Probably should update this to 7.3 maybe?
>
Updated
> DJ
>
> > +Contact: linux-cxl@vger.kernel.org
> > +Description:
> > + (RO) The first Dynamic RAM partition capacity as bytes.
> > +
> > +
> > +What: /sys/bus/cxl/devices/memX/dynamic_ram_a/qos_class
> > +Date: May, 2025
> > +KernelVersion: v7.0
Also updated
Thanks,
Anisa
> > +Contact: linux-cxl@vger.kernel.org
> > +Description:
> > + (RO) For CXL host platforms that support "QoS Telemmetry"
> > + this attribute conveys a comma delimited list of platform
> > + specific cookies that identifies a QoS performance class
> > + for the partition of the CXL mem device. These
> > + class-ids can be compared against a similar "qos_class"
> > + published for a root decoder. While it is not required
> > + that the endpoints map their local memory-class to a
> > + matching platform class, mismatches are not recommended
> > + and there are platform specific performance related
> > + side-effects that may result. First class-id is displayed.
> > +
> >
> > What: /sys/bus/cxl/devices/memX/serial
> > Date: January, 2022
> > diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
> > index 71602820f896..064cfd628577 100644
> > --- a/drivers/cxl/core/memdev.c
> > +++ b/drivers/cxl/core/memdev.c
> > @@ -101,6 +101,19 @@ static ssize_t pmem_size_show(struct device *dev, struct device_attribute *attr,
> > static struct device_attribute dev_attr_pmem_size =
> > __ATTR(size, 0444, pmem_size_show, NULL);
> >
> > +static ssize_t dynamic_ram_a_size_show(struct device *dev, struct device_attribute *attr,
> > + char *buf)
> > +{
> > + struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> > + struct cxl_dev_state *cxlds = cxlmd->cxlds;
> > + unsigned long long len = cxl_part_size(cxlds, CXL_PARTMODE_DYNAMIC_RAM_A);
> > +
> > + return sysfs_emit(buf, "%#llx\n", len);
> > +}
> > +
> > +static struct device_attribute dev_attr_dynamic_ram_a_size =
> > + __ATTR(size, 0444, dynamic_ram_a_size_show, NULL);
> > +
> > static ssize_t serial_show(struct device *dev, struct device_attribute *attr,
> > char *buf)
> > {
> > @@ -443,6 +456,25 @@ static struct attribute *cxl_memdev_pmem_attributes[] = {
> > NULL,
> > };
> >
> > +static ssize_t dynamic_ram_a_qos_class_show(struct device *dev,
> > + struct device_attribute *attr, char *buf)
> > +{
> > + struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> > + struct cxl_dev_state *cxlds = cxlmd->cxlds;
> > +
> > + return sysfs_emit(buf, "%d\n",
> > + part_perf(cxlds, CXL_PARTMODE_DYNAMIC_RAM_A)->qos_class);
> > +}
> > +
> > +static struct device_attribute dev_attr_dynamic_ram_a_qos_class =
> > + __ATTR(qos_class, 0444, dynamic_ram_a_qos_class_show, NULL);
> > +
> > +static struct attribute *cxl_memdev_dynamic_ram_a_attributes[] = {
> > + &dev_attr_dynamic_ram_a_size.attr,
> > + &dev_attr_dynamic_ram_a_qos_class.attr,
> > + NULL,
> > +};
> > +
> > static ssize_t ram_qos_class_show(struct device *dev,
> > struct device_attribute *attr, char *buf)
> > {
> > @@ -519,6 +551,29 @@ static struct attribute_group cxl_memdev_pmem_attribute_group = {
> > .is_visible = cxl_pmem_visible,
> > };
> >
> > +static umode_t cxl_dynamic_ram_a_visible(struct kobject *kobj, struct attribute *a, int n)
> > +{
> > + struct device *dev = kobj_to_dev(kobj);
> > + struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> > + struct cxl_dpa_perf *perf = part_perf(cxlmd->cxlds, CXL_PARTMODE_DYNAMIC_RAM_A);
> > +
> > + if (a == &dev_attr_dynamic_ram_a_qos_class.attr &&
> > + (!perf || perf->qos_class == CXL_QOS_CLASS_INVALID))
> > + return 0;
> > +
> > + if (a == &dev_attr_dynamic_ram_a_size.attr &&
> > + (!cxl_part_size(cxlmd->cxlds, CXL_PARTMODE_DYNAMIC_RAM_A)))
> > + return 0;
> > +
> > + return a->mode;
> > +}
> > +
> > +static struct attribute_group cxl_memdev_dynamic_ram_a_attribute_group = {
> > + .name = "dynamic_ram_a",
> > + .attrs = cxl_memdev_dynamic_ram_a_attributes,
> > + .is_visible = cxl_dynamic_ram_a_visible,
> > +};
> > +
> > static umode_t cxl_memdev_security_visible(struct kobject *kobj,
> > struct attribute *a, int n)
> > {
> > @@ -547,6 +602,7 @@ static const struct attribute_group *cxl_memdev_attribute_groups[] = {
> > &cxl_memdev_attribute_group,
> > &cxl_memdev_ram_attribute_group,
> > &cxl_memdev_pmem_attribute_group,
> > + &cxl_memdev_dynamic_ram_a_attribute_group,
> > &cxl_memdev_security_attribute_group,
> > NULL,
> > };
> > @@ -555,6 +611,7 @@ void cxl_memdev_update_perf(struct cxl_memdev *cxlmd)
> > {
> > sysfs_update_group(&cxlmd->dev.kobj, &cxl_memdev_ram_attribute_group);
> > sysfs_update_group(&cxlmd->dev.kobj, &cxl_memdev_pmem_attribute_group);
> > + sysfs_update_group(&cxlmd->dev.kobj, &cxl_memdev_dynamic_ram_a_attribute_group);
> > }
> > EXPORT_SYMBOL_NS_GPL(cxl_memdev_update_perf, "CXL");
> >
>
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH v10 06/31] cxl/port: Add 'dynamic_ram_a' to endpoint decoder mode
2026-05-23 9:42 [PATCH v10 00/31] DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
` (4 preceding siblings ...)
2026-05-23 9:42 ` [PATCH v10 05/31] cxl/mem: Expose dynamic ram A partition in sysfs Anisa Su
@ 2026-05-23 9:43 ` Anisa Su
2026-05-28 0:01 ` Dave Jiang
2026-05-23 9:43 ` [PATCH v10 07/31] cxl/region: Add DC DAX region support Anisa Su
` (26 subsequent siblings)
32 siblings, 1 reply; 71+ messages in thread
From: Anisa Su @ 2026-05-23 9:43 UTC (permalink / raw)
To: linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
John Groves, Gregory Price, Ira Weiny
From: Ira Weiny <ira.weiny@intel.com>
Endpoints can now support a single dynamic ram partition following the
persistent memory partition.
Expand the mode to allow a decoder to point to the first dynamic ram
partition.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
Changes:
[anisa: rebase]
---
Documentation/ABI/testing/sysfs-bus-cxl | 18 +++++++++---------
drivers/cxl/core/port.c | 4 ++++
2 files changed, 13 insertions(+), 9 deletions(-)
diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
index 3d95c325f6e0..c604c7ca6432 100644
--- a/Documentation/ABI/testing/sysfs-bus-cxl
+++ b/Documentation/ABI/testing/sysfs-bus-cxl
@@ -358,22 +358,22 @@ Description:
What: /sys/bus/cxl/devices/decoderX.Y/mode
-Date: May, 2022
-KernelVersion: v6.0
+Date: May, 2022, May 2025
+KernelVersion: v6.0, v6.16 (dynamic_ram_a)
Contact: linux-cxl@vger.kernel.org
Description:
(RW) When a CXL decoder is of devtype "cxl_decoder_endpoint" it
translates from a host physical address range, to a device
local address range. Device-local address ranges are further
- split into a 'ram' (volatile memory) range and 'pmem'
- (persistent memory) range. The 'mode' attribute emits one of
- 'ram', 'pmem', or 'none'. The 'none' indicates the decoder is
- not actively decoding, or no DPA allocation policy has been
- set.
+ split into a 'ram' (volatile memory) range, 'pmem' (persistent
+ memory), and 'dynamic_ram_a' (first Dynamic RAM) range. The
+ 'mode' attribute emits one of 'ram', 'pmem', 'dynamic_ram_a' or
+ 'none'. The 'none' indicates the decoder is not actively
+ decoding, or no DPA allocation policy has been set.
'mode' can be written, when the decoder is in the 'disabled'
- state, with either 'ram' or 'pmem' to set the boundaries for the
- next allocation.
+ state, with either 'ram', 'pmem', or 'dynamic_ram_a' to set the
+ boundaries for the next allocation.
What: /sys/bus/cxl/devices/decoderX.Y/dpa_resource
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 0c5957d1d329..a7f71f36531f 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -128,6 +128,7 @@ static DEVICE_ATTR_RO(name)
CXL_DECODER_FLAG_ATTR(cap_pmem, CXL_DECODER_F_PMEM);
CXL_DECODER_FLAG_ATTR(cap_ram, CXL_DECODER_F_RAM);
+CXL_DECODER_FLAG_ATTR(cap_dynamic_ram_a, CXL_DECODER_F_RAM);
CXL_DECODER_FLAG_ATTR(cap_type2, CXL_DECODER_F_TYPE2);
CXL_DECODER_FLAG_ATTR(cap_type3, CXL_DECODER_F_TYPE3);
CXL_DECODER_FLAG_ATTR(locked, CXL_DECODER_F_LOCK);
@@ -222,6 +223,8 @@ static ssize_t mode_store(struct device *dev, struct device_attribute *attr,
mode = CXL_PARTMODE_PMEM;
else if (sysfs_streq(buf, "ram"))
mode = CXL_PARTMODE_RAM;
+ else if (sysfs_streq(buf, "dynamic_ram_a"))
+ mode = CXL_PARTMODE_DYNAMIC_RAM_A;
else
return -EINVAL;
@@ -327,6 +330,7 @@ static struct attribute_group cxl_decoder_base_attribute_group = {
static struct attribute *cxl_decoder_root_attrs[] = {
&dev_attr_cap_pmem.attr,
&dev_attr_cap_ram.attr,
+ &dev_attr_cap_dynamic_ram_a.attr,
&dev_attr_cap_type2.attr,
&dev_attr_cap_type3.attr,
&dev_attr_target_list.attr,
--
2.43.0
^ permalink raw reply related [flat|nested] 71+ messages in thread* Re: [PATCH v10 06/31] cxl/port: Add 'dynamic_ram_a' to endpoint decoder mode
2026-05-23 9:43 ` [PATCH v10 06/31] cxl/port: Add 'dynamic_ram_a' to endpoint decoder mode Anisa Su
@ 2026-05-28 0:01 ` Dave Jiang
2026-05-30 7:07 ` Anisa Su
0 siblings, 1 reply; 71+ messages in thread
From: Dave Jiang @ 2026-05-28 0:01 UTC (permalink / raw)
To: Anisa Su, linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Vishal Verma, Ira Weiny, Alison Schofield, John Groves,
Gregory Price, Ira Weiny
On 5/23/26 2:43 AM, Anisa Su wrote:
> From: Ira Weiny <ira.weiny@intel.com>
>
> Endpoints can now support a single dynamic ram partition following the
> persistent memory partition.
>
> Expand the mode to allow a decoder to point to the first dynamic ram
> partition.
>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Need Anisa sign off
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Just update kver and dates below.
>
> ---
> Changes:
> [anisa: rebase]
> ---
> Documentation/ABI/testing/sysfs-bus-cxl | 18 +++++++++---------
> drivers/cxl/core/port.c | 4 ++++
> 2 files changed, 13 insertions(+), 9 deletions(-)
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> index 3d95c325f6e0..c604c7ca6432 100644
> --- a/Documentation/ABI/testing/sysfs-bus-cxl
> +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> @@ -358,22 +358,22 @@ Description:
>
>
> What: /sys/bus/cxl/devices/decoderX.Y/mode
> -Date: May, 2022
> -KernelVersion: v6.0
> +Date: May, 2022, May 2025
A later date
> +KernelVersion: v6.0, v6.16 (dynamic_ram_a)
7.3 maybe?
DJ
> Contact: linux-cxl@vger.kernel.org
> Description:
> (RW) When a CXL decoder is of devtype "cxl_decoder_endpoint" it
> translates from a host physical address range, to a device
> local address range. Device-local address ranges are further
> - split into a 'ram' (volatile memory) range and 'pmem'
> - (persistent memory) range. The 'mode' attribute emits one of
> - 'ram', 'pmem', or 'none'. The 'none' indicates the decoder is
> - not actively decoding, or no DPA allocation policy has been
> - set.
> + split into a 'ram' (volatile memory) range, 'pmem' (persistent
> + memory), and 'dynamic_ram_a' (first Dynamic RAM) range. The
> + 'mode' attribute emits one of 'ram', 'pmem', 'dynamic_ram_a' or
> + 'none'. The 'none' indicates the decoder is not actively
> + decoding, or no DPA allocation policy has been set.
>
> 'mode' can be written, when the decoder is in the 'disabled'
> - state, with either 'ram' or 'pmem' to set the boundaries for the
> - next allocation.
> + state, with either 'ram', 'pmem', or 'dynamic_ram_a' to set the
> + boundaries for the next allocation.
>
>
> What: /sys/bus/cxl/devices/decoderX.Y/dpa_resource
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 0c5957d1d329..a7f71f36531f 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -128,6 +128,7 @@ static DEVICE_ATTR_RO(name)
>
> CXL_DECODER_FLAG_ATTR(cap_pmem, CXL_DECODER_F_PMEM);
> CXL_DECODER_FLAG_ATTR(cap_ram, CXL_DECODER_F_RAM);
> +CXL_DECODER_FLAG_ATTR(cap_dynamic_ram_a, CXL_DECODER_F_RAM);
> CXL_DECODER_FLAG_ATTR(cap_type2, CXL_DECODER_F_TYPE2);
> CXL_DECODER_FLAG_ATTR(cap_type3, CXL_DECODER_F_TYPE3);
> CXL_DECODER_FLAG_ATTR(locked, CXL_DECODER_F_LOCK);
> @@ -222,6 +223,8 @@ static ssize_t mode_store(struct device *dev, struct device_attribute *attr,
> mode = CXL_PARTMODE_PMEM;
> else if (sysfs_streq(buf, "ram"))
> mode = CXL_PARTMODE_RAM;
> + else if (sysfs_streq(buf, "dynamic_ram_a"))
> + mode = CXL_PARTMODE_DYNAMIC_RAM_A;
> else
> return -EINVAL;
>
> @@ -327,6 +330,7 @@ static struct attribute_group cxl_decoder_base_attribute_group = {
> static struct attribute *cxl_decoder_root_attrs[] = {
> &dev_attr_cap_pmem.attr,
> &dev_attr_cap_ram.attr,
> + &dev_attr_cap_dynamic_ram_a.attr,
> &dev_attr_cap_type2.attr,
> &dev_attr_cap_type3.attr,
> &dev_attr_target_list.attr,
^ permalink raw reply [flat|nested] 71+ messages in thread* Re: [PATCH v10 06/31] cxl/port: Add 'dynamic_ram_a' to endpoint decoder mode
2026-05-28 0:01 ` Dave Jiang
@ 2026-05-30 7:07 ` Anisa Su
0 siblings, 0 replies; 71+ messages in thread
From: Anisa Su @ 2026-05-30 7:07 UTC (permalink / raw)
To: Dave Jiang
Cc: Anisa Su, linux-cxl, linux-kernel, nvdimm, Dan Williams,
Jonathan Cameron, Davidlohr Bueso, Vishal Verma, Ira Weiny,
Alison Schofield, John Groves, Gregory Price, Ira Weiny
On Wed, May 27, 2026 at 05:01:44PM -0700, Dave Jiang wrote:
>
>
> On 5/23/26 2:43 AM, Anisa Su wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> >
> > Endpoints can now support a single dynamic ram partition following the
> > persistent memory partition.
> >
> > Expand the mode to allow a decoder to point to the first dynamic ram
> > partition.
> >
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
>
> Need Anisa sign off
>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
>
> Just update kver and dates below.
>
Updated!
Thanks,
Anisa
> >
> > ---
> > Changes:
> > [anisa: rebase]
> > ---
> > Documentation/ABI/testing/sysfs-bus-cxl | 18 +++++++++---------
> > drivers/cxl/core/port.c | 4 ++++
> > 2 files changed, 13 insertions(+), 9 deletions(-)
> >
> > diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> > index 3d95c325f6e0..c604c7ca6432 100644
> > --- a/Documentation/ABI/testing/sysfs-bus-cxl
> > +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> > @@ -358,22 +358,22 @@ Description:
> >
> >
> > What: /sys/bus/cxl/devices/decoderX.Y/mode
> > -Date: May, 2022
> > -KernelVersion: v6.0
> > +Date: May, 2022, May 2025
>
> A later date
>
> > +KernelVersion: v6.0, v6.16 (dynamic_ram_a)
>
> 7.3 maybe?
>
> DJ
>
> > Contact: linux-cxl@vger.kernel.org
> > Description:
> > (RW) When a CXL decoder is of devtype "cxl_decoder_endpoint" it
> > translates from a host physical address range, to a device
> > local address range. Device-local address ranges are further
> > - split into a 'ram' (volatile memory) range and 'pmem'
> > - (persistent memory) range. The 'mode' attribute emits one of
> > - 'ram', 'pmem', or 'none'. The 'none' indicates the decoder is
> > - not actively decoding, or no DPA allocation policy has been
> > - set.
> > + split into a 'ram' (volatile memory) range, 'pmem' (persistent
> > + memory), and 'dynamic_ram_a' (first Dynamic RAM) range. The
> > + 'mode' attribute emits one of 'ram', 'pmem', 'dynamic_ram_a' or
> > + 'none'. The 'none' indicates the decoder is not actively
> > + decoding, or no DPA allocation policy has been set.
> >
> > 'mode' can be written, when the decoder is in the 'disabled'
> > - state, with either 'ram' or 'pmem' to set the boundaries for the
> > - next allocation.
> > + state, with either 'ram', 'pmem', or 'dynamic_ram_a' to set the
> > + boundaries for the next allocation.
> >
> >
> > What: /sys/bus/cxl/devices/decoderX.Y/dpa_resource
> > diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> > index 0c5957d1d329..a7f71f36531f 100644
> > --- a/drivers/cxl/core/port.c
> > +++ b/drivers/cxl/core/port.c
> > @@ -128,6 +128,7 @@ static DEVICE_ATTR_RO(name)
> >
> > CXL_DECODER_FLAG_ATTR(cap_pmem, CXL_DECODER_F_PMEM);
> > CXL_DECODER_FLAG_ATTR(cap_ram, CXL_DECODER_F_RAM);
> > +CXL_DECODER_FLAG_ATTR(cap_dynamic_ram_a, CXL_DECODER_F_RAM);
> > CXL_DECODER_FLAG_ATTR(cap_type2, CXL_DECODER_F_TYPE2);
> > CXL_DECODER_FLAG_ATTR(cap_type3, CXL_DECODER_F_TYPE3);
> > CXL_DECODER_FLAG_ATTR(locked, CXL_DECODER_F_LOCK);
> > @@ -222,6 +223,8 @@ static ssize_t mode_store(struct device *dev, struct device_attribute *attr,
> > mode = CXL_PARTMODE_PMEM;
> > else if (sysfs_streq(buf, "ram"))
> > mode = CXL_PARTMODE_RAM;
> > + else if (sysfs_streq(buf, "dynamic_ram_a"))
> > + mode = CXL_PARTMODE_DYNAMIC_RAM_A;
> > else
> > return -EINVAL;
> >
> > @@ -327,6 +330,7 @@ static struct attribute_group cxl_decoder_base_attribute_group = {
> > static struct attribute *cxl_decoder_root_attrs[] = {
> > &dev_attr_cap_pmem.attr,
> > &dev_attr_cap_ram.attr,
> > + &dev_attr_cap_dynamic_ram_a.attr,
> > &dev_attr_cap_type2.attr,
> > &dev_attr_cap_type3.attr,
> > &dev_attr_target_list.attr,
>
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH v10 07/31] cxl/region: Add DC DAX region support
2026-05-23 9:42 [PATCH v10 00/31] DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
` (5 preceding siblings ...)
2026-05-23 9:43 ` [PATCH v10 06/31] cxl/port: Add 'dynamic_ram_a' to endpoint decoder mode Anisa Su
@ 2026-05-23 9:43 ` Anisa Su
2026-05-28 0:16 ` Dave Jiang
2026-05-23 9:43 ` [PATCH v10 08/31] cxl/events: Split event msgnum configuration from irq setup Anisa Su
` (25 subsequent siblings)
32 siblings, 1 reply; 71+ messages in thread
From: Anisa Su @ 2026-05-23 9:43 UTC (permalink / raw)
To: linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
John Groves, Gregory Price, Ira Weiny
From: Ira Weiny <ira.weiny@intel.com>
DC DAX regions must allow memory to be added or removed dynamically.
In addition to the quantity of memory available the,
location of the memory within a DC partition is dynamic, based on the
extents offered by a device. CXL DAX regions must accommodate the
dynamic movement of this memory in the management of DAX regions and devices.
Introduce the concept of a dynamic DAX region. Introduce
create_dynamic_ram_a_region() sysfs entry to create such regions.
Special case DC-capable regions to create a 0 sized seed DAX device
to maintain compatibility which requires a default DAX device to hold a
region reference.
Indicate 0 byte available capacity until such time that capacity is
added.
Dynamic regions complicate the range mapping of dax devices. There is no
known use case for range mapping on dynamic regions. Avoid the
complication by preventing range mapping of dax devices on dynamic
regions.
Interleaving is deferred for now. Add checks.
Based on an original patch by Navneet Singh.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
Changes:
[anisa: rebase]
[anisa: change "sparse" naming conventions and to "dynamic"]
---
Documentation/ABI/testing/sysfs-bus-cxl | 22 ++++++++---------
drivers/cxl/core/core.h | 11 +++++++++
drivers/cxl/core/port.c | 1 +
drivers/cxl/core/region.c | 33 +++++++++++++++++++++++--
drivers/cxl/core/region_dax.c | 6 +++++
drivers/dax/bus.c | 10 ++++++++
drivers/dax/bus.h | 1 +
drivers/dax/cxl.c | 17 +++++++++++--
8 files changed, 86 insertions(+), 15 deletions(-)
diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
index c604c7ca6432..3080aef9ad67 100644
--- a/Documentation/ABI/testing/sysfs-bus-cxl
+++ b/Documentation/ABI/testing/sysfs-bus-cxl
@@ -434,20 +434,20 @@ Description:
interleave_granularity).
-What: /sys/bus/cxl/devices/decoderX.Y/create_{pmem,ram}_region
-Date: May, 2022, January, 2023
-KernelVersion: v6.0 (pmem), v6.3 (ram)
+What: /sys/bus/cxl/devices/decoderX.Y/create_{pmem,ram,dynamic_ram_a}_region
+Date: May, 2022, January, 2023, May 2025
+KernelVersion: v6.0 (pmem), v6.3 (ram), v6.16 (dynamic_ram_a)
Contact: linux-cxl@vger.kernel.org
Description:
(RW) Write a string in the form 'regionZ' to start the process
- of defining a new persistent, or volatile memory region
- (interleave-set) within the decode range bounded by root decoder
- 'decoderX.Y'. The value written must match the current value
- returned from reading this attribute. An atomic compare exchange
- operation is done on write to assign the requested id to a
- region and allocate the region-id for the next creation attempt.
- EBUSY is returned if the region name written does not match the
- current cached value.
+ of defining a new persistent, volatile, or dynamic RAM memory
+ region (interleave-set) within the decode range bounded by root
+ decoder 'decoderX.Y'. The value written must match the current
+ value returned from reading this attribute. An atomic compare
+ exchange operation is done on write to assign the requested id
+ to a region and allocate the region-id for the next creation
+ attempt. EBUSY is returned if the region name written does not
+ match the current cached value.
What: /sys/bus/cxl/devices/decoderX.Y/delete_region
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 82ca3a476708..8881cc9323e0 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -6,6 +6,7 @@
#include <cxl/mailbox.h>
#include <linux/rwsem.h>
+#include <cxlmem.h>
extern const struct device_type cxl_nvdimm_bridge_type;
extern const struct device_type cxl_nvdimm_type;
@@ -18,6 +19,15 @@ enum cxl_detach_mode {
DETACH_INVALIDATE,
};
+static inline struct cxl_memdev_state *
+cxled_to_mds(struct cxl_endpoint_decoder *cxled)
+{
+ struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
+ struct cxl_dev_state *cxlds = cxlmd->cxlds;
+
+ return container_of(cxlds, struct cxl_memdev_state, cxlds);
+}
+
#ifdef CONFIG_CXL_REGION
struct cxl_region_context {
@@ -29,6 +39,7 @@ struct cxl_region_context {
extern struct device_attribute dev_attr_create_pmem_region;
extern struct device_attribute dev_attr_create_ram_region;
+extern struct device_attribute dev_attr_create_dynamic_ram_a_region;
extern struct device_attribute dev_attr_delete_region;
extern struct device_attribute dev_attr_region;
extern const struct device_type cxl_pmem_region_type;
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index a7f71f36531f..2d33001dac26 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -337,6 +337,7 @@ static struct attribute *cxl_decoder_root_attrs[] = {
&dev_attr_qos_class.attr,
SET_CXL_REGION_ATTR(create_pmem_region)
SET_CXL_REGION_ATTR(create_ram_region)
+ SET_CXL_REGION_ATTR(create_dynamic_ram_a_region)
SET_CXL_REGION_ATTR(delete_region)
NULL,
};
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index edc267c6cf77..7561bf3d8af8 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -493,6 +493,11 @@ static int set_interleave_ways(struct cxl_region *cxlr, int val)
int save, rc;
u8 iw;
+ if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_A && val != 1) {
+ dev_err(&cxlr->dev, "Interleaving and DCD not supported\n");
+ return -EINVAL;
+ }
+
rc = ways_to_eiw(val, &iw);
if (rc)
return rc;
@@ -2389,6 +2394,7 @@ static size_t store_targetN(struct cxl_region *cxlr, const char *buf, int pos,
if (sysfs_streq(buf, "\n"))
rc = detach_target(cxlr, pos);
else {
+ struct cxl_endpoint_decoder *cxled;
struct device *dev;
dev = bus_find_device_by_name(&cxl_bus_type, NULL, buf);
@@ -2400,8 +2406,14 @@ static size_t store_targetN(struct cxl_region *cxlr, const char *buf, int pos,
goto out;
}
- rc = attach_target(cxlr, to_cxl_endpoint_decoder(dev), pos,
- TASK_INTERRUPTIBLE);
+ cxled = to_cxl_endpoint_decoder(dev);
+ if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_A &&
+ !cxl_dcd_supported(cxled_to_mds(cxled))) {
+ dev_dbg(dev, "DCD unsupported\n");
+ rc = -EINVAL;
+ goto out;
+ }
+ rc = attach_target(cxlr, cxled, pos, TASK_INTERRUPTIBLE);
out:
put_device(dev);
}
@@ -2750,6 +2762,7 @@ static struct cxl_region *__create_region(struct cxl_root_decoder *cxlrd,
switch (mode) {
case CXL_PARTMODE_RAM:
case CXL_PARTMODE_PMEM:
+ case CXL_PARTMODE_DYNAMIC_RAM_A:
break;
default:
dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %d\n", mode);
@@ -2802,6 +2815,21 @@ static ssize_t create_ram_region_store(struct device *dev,
}
DEVICE_ATTR_RW(create_ram_region);
+static ssize_t create_dynamic_ram_a_region_show(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ return __create_region_show(to_cxl_root_decoder(dev), buf);
+}
+
+static ssize_t create_dynamic_ram_a_region_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t len)
+{
+ return create_region_store(dev, buf, len, CXL_PARTMODE_DYNAMIC_RAM_A);
+}
+DEVICE_ATTR_RW(create_dynamic_ram_a_region);
+
static ssize_t region_show(struct device *dev, struct device_attribute *attr,
char *buf)
{
@@ -4081,6 +4109,7 @@ static int cxl_region_probe(struct device *dev)
return devm_cxl_add_pmem_region(cxlr);
case CXL_PARTMODE_RAM:
+ case CXL_PARTMODE_DYNAMIC_RAM_A:
rc = devm_cxl_region_edac_register(cxlr);
if (rc)
dev_dbg(&cxlr->dev, "CXL EDAC registration for region_id=%d failed\n",
diff --git a/drivers/cxl/core/region_dax.c b/drivers/cxl/core/region_dax.c
index de04f78f6ad8..d6bf69155827 100644
--- a/drivers/cxl/core/region_dax.c
+++ b/drivers/cxl/core/region_dax.c
@@ -84,6 +84,12 @@ int devm_cxl_add_dax_region(struct cxl_region *cxlr)
struct device *dev;
int rc;
+ if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_A &&
+ cxlr->params.interleave_ways != 1) {
+ dev_err(&cxlr->dev, "Interleaving DC not supported\n");
+ return -EINVAL;
+ }
+
struct cxl_dax_region *cxlr_dax __free(put_cxl_dax_region) =
cxl_dax_region_alloc(cxlr);
if (IS_ERR(cxlr_dax))
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 95aee2a037fb..b0c2162b5e37 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -181,6 +181,11 @@ static bool is_static(struct dax_region *dax_region)
return (dax_region->res.flags & IORESOURCE_DAX_STATIC) != 0;
}
+static bool is_dynamic(struct dax_region *dax_region)
+{
+ return (dax_region->res.flags & IORESOURCE_DAX_DCD) != 0;
+}
+
bool static_dev_dax(struct dev_dax *dev_dax)
{
return is_static(dev_dax->region);
@@ -304,6 +309,9 @@ static unsigned long long dax_region_avail_size(struct dax_region *dax_region)
lockdep_assert_held(&dax_region_rwsem);
+ if (is_dynamic(dax_region))
+ return 0;
+
for_each_dax_region_resource(dax_region, res)
size -= resource_size(res);
return size;
@@ -1389,6 +1397,8 @@ static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n)
return 0;
if (a == &dev_attr_mapping.attr && is_static(dax_region))
return 0;
+ if (a == &dev_attr_mapping.attr && is_dynamic(dax_region))
+ return 0;
if ((a == &dev_attr_align.attr ||
a == &dev_attr_size.attr) && is_static(dax_region))
return 0444;
diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
index 5909171a4428..6e739bfab932 100644
--- a/drivers/dax/bus.h
+++ b/drivers/dax/bus.h
@@ -15,6 +15,7 @@ struct dax_region;
/* dax bus specific ioresource flags */
#define IORESOURCE_DAX_STATIC BIT(0)
#define IORESOURCE_DAX_KMEM BIT(1)
+#define IORESOURCE_DAX_DCD BIT(2)
struct dax_region *alloc_dax_region(struct device *parent, int region_id,
struct range *range, int target_node, unsigned int align,
diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
index 3ab39b77843d..f58fe992aa8d 100644
--- a/drivers/dax/cxl.c
+++ b/drivers/dax/cxl.c
@@ -13,19 +13,32 @@ static int cxl_dax_region_probe(struct device *dev)
struct cxl_region *cxlr = cxlr_dax->cxlr;
struct dax_region *dax_region;
struct dev_dax_data data;
+ resource_size_t dev_size;
+ unsigned long flags;
if (nid == NUMA_NO_NODE)
nid = memory_add_physaddr_to_nid(cxlr_dax->hpa_range.start);
+ if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_A)
+ flags = IORESOURCE_DAX_DCD;
+ else
+ flags = IORESOURCE_DAX_KMEM;
+
dax_region = alloc_dax_region(dev, cxlr->id, &cxlr_dax->hpa_range, nid,
- PMD_SIZE, IORESOURCE_DAX_KMEM);
+ PMD_SIZE, flags);
if (!dax_region)
return -ENOMEM;
+ if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_A)
+ /* Add empty seed dax device */
+ dev_size = 0;
+ else
+ dev_size = range_len(&cxlr_dax->hpa_range);
+
data = (struct dev_dax_data) {
.dax_region = dax_region,
.id = -1,
- .size = range_len(&cxlr_dax->hpa_range),
+ .size = dev_size,
.memmap_on_memory = true,
};
--
2.43.0
^ permalink raw reply related [flat|nested] 71+ messages in thread* Re: [PATCH v10 07/31] cxl/region: Add DC DAX region support
2026-05-23 9:43 ` [PATCH v10 07/31] cxl/region: Add DC DAX region support Anisa Su
@ 2026-05-28 0:16 ` Dave Jiang
2026-06-02 9:22 ` Anisa Su
0 siblings, 1 reply; 71+ messages in thread
From: Dave Jiang @ 2026-05-28 0:16 UTC (permalink / raw)
To: Anisa Su, linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Vishal Verma, Ira Weiny, Alison Schofield, John Groves,
Gregory Price, Ira Weiny
On 5/23/26 2:43 AM, Anisa Su wrote:
> From: Ira Weiny <ira.weiny@intel.com>
>
> DC DAX regions must allow memory to be added or removed dynamically.
> In addition to the quantity of memory available the,
> location of the memory within a DC partition is dynamic, based on the
> extents offered by a device. CXL DAX regions must accommodate the
> dynamic movement of this memory in the management of DAX regions and devices.
>
> Introduce the concept of a dynamic DAX region. Introduce
> create_dynamic_ram_a_region() sysfs entry to create such regions.
> Special case DC-capable regions to create a 0 sized seed DAX device
> to maintain compatibility which requires a default DAX device to hold a
> region reference.
>
> Indicate 0 byte available capacity until such time that capacity is
> added.
>
> Dynamic regions complicate the range mapping of dax devices. There is no
> known use case for range mapping on dynamic regions. Avoid the
> complication by preventing range mapping of dax devices on dynamic
> regions.
>
> Interleaving is deferred for now. Add checks.
>
> Based on an original patch by Navneet Singh.
>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Missing Anisa sign off
>
> ---
> Changes:
> [anisa: rebase]
> [anisa: change "sparse" naming conventions and to "dynamic"]
> ---
> Documentation/ABI/testing/sysfs-bus-cxl | 22 ++++++++---------
> drivers/cxl/core/core.h | 11 +++++++++
> drivers/cxl/core/port.c | 1 +
> drivers/cxl/core/region.c | 33 +++++++++++++++++++++++--
> drivers/cxl/core/region_dax.c | 6 +++++
> drivers/dax/bus.c | 10 ++++++++
> drivers/dax/bus.h | 1 +
> drivers/dax/cxl.c | 17 +++++++++++--
> 8 files changed, 86 insertions(+), 15 deletions(-)
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> index c604c7ca6432..3080aef9ad67 100644
> --- a/Documentation/ABI/testing/sysfs-bus-cxl
> +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> @@ -434,20 +434,20 @@ Description:
> interleave_granularity).
>
>
> -What: /sys/bus/cxl/devices/decoderX.Y/create_{pmem,ram}_region
> -Date: May, 2022, January, 2023
> -KernelVersion: v6.0 (pmem), v6.3 (ram)
> +What: /sys/bus/cxl/devices/decoderX.Y/create_{pmem,ram,dynamic_ram_a}_region
> +Date: May, 2022, January, 2023, May 2025
> +KernelVersion: v6.0 (pmem), v6.3 (ram), v6.16 (dynamic_ram_a)
update
> Contact: linux-cxl@vger.kernel.org
> Description:
> (RW) Write a string in the form 'regionZ' to start the process
> - of defining a new persistent, or volatile memory region
> - (interleave-set) within the decode range bounded by root decoder
> - 'decoderX.Y'. The value written must match the current value
> - returned from reading this attribute. An atomic compare exchange
> - operation is done on write to assign the requested id to a
> - region and allocate the region-id for the next creation attempt.
> - EBUSY is returned if the region name written does not match the
> - current cached value.
> + of defining a new persistent, volatile, or dynamic RAM memory
> + region (interleave-set) within the decode range bounded by root
> + decoder 'decoderX.Y'. The value written must match the current
> + value returned from reading this attribute. An atomic compare
> + exchange operation is done on write to assign the requested id
> + to a region and allocate the region-id for the next creation
> + attempt. EBUSY is returned if the region name written does not
> + match the current cached value.
>
>
> What: /sys/bus/cxl/devices/decoderX.Y/delete_region
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 82ca3a476708..8881cc9323e0 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -6,6 +6,7 @@
>
> #include <cxl/mailbox.h>
> #include <linux/rwsem.h>
> +#include <cxlmem.h>
>
> extern const struct device_type cxl_nvdimm_bridge_type;
> extern const struct device_type cxl_nvdimm_type;
> @@ -18,6 +19,15 @@ enum cxl_detach_mode {
> DETACH_INVALIDATE,
> };
>
> +static inline struct cxl_memdev_state *
> +cxled_to_mds(struct cxl_endpoint_decoder *cxled)
> +{
> + struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> + struct cxl_dev_state *cxlds = cxlmd->cxlds;
> +
> + return container_of(cxlds, struct cxl_memdev_state, cxlds);
return to_cxl_memdev_state(cxlmd->cxlds);
> +}
> +
> #ifdef CONFIG_CXL_REGION
>
> struct cxl_region_context {
> @@ -29,6 +39,7 @@ struct cxl_region_context {
>
> extern struct device_attribute dev_attr_create_pmem_region;
> extern struct device_attribute dev_attr_create_ram_region;
> +extern struct device_attribute dev_attr_create_dynamic_ram_a_region;
> extern struct device_attribute dev_attr_delete_region;
> extern struct device_attribute dev_attr_region;
> extern const struct device_type cxl_pmem_region_type;
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index a7f71f36531f..2d33001dac26 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -337,6 +337,7 @@ static struct attribute *cxl_decoder_root_attrs[] = {
> &dev_attr_qos_class.attr,
> SET_CXL_REGION_ATTR(create_pmem_region)
> SET_CXL_REGION_ATTR(create_ram_region)
> + SET_CXL_REGION_ATTR(create_dynamic_ram_a_region)
With this add, may need to add checks in cxl_root_decoder_visible() for dynamic_ram for create and also delete.
> SET_CXL_REGION_ATTR(delete_region)
> NULL,
> };
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index edc267c6cf77..7561bf3d8af8 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -493,6 +493,11 @@ static int set_interleave_ways(struct cxl_region *cxlr, int val)
> int save, rc;
> u8 iw;
>
> + if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_A && val != 1) {
> + dev_err(&cxlr->dev, "Interleaving and DCD not supported\n");
> + return -EINVAL;
> + }
> +
> rc = ways_to_eiw(val, &iw);
> if (rc)
> return rc;
> @@ -2389,6 +2394,7 @@ static size_t store_targetN(struct cxl_region *cxlr, const char *buf, int pos,
> if (sysfs_streq(buf, "\n"))
> rc = detach_target(cxlr, pos);
> else {
> + struct cxl_endpoint_decoder *cxled;
> struct device *dev;
>
> dev = bus_find_device_by_name(&cxl_bus_type, NULL, buf);
> @@ -2400,8 +2406,14 @@ static size_t store_targetN(struct cxl_region *cxlr, const char *buf, int pos,
> goto out;
> }
>
> - rc = attach_target(cxlr, to_cxl_endpoint_decoder(dev), pos,
> - TASK_INTERRUPTIBLE);
> + cxled = to_cxl_endpoint_decoder(dev);
> + if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_A &&
> + !cxl_dcd_supported(cxled_to_mds(cxled))) {
cxled_to_mds() can return NULL with the earlier change suggested. Need to handle that
DJ
> + dev_dbg(dev, "DCD unsupported\n");
> + rc = -EINVAL;
> + goto out;
> + }
> + rc = attach_target(cxlr, cxled, pos, TASK_INTERRUPTIBLE);
> out:
> put_device(dev);
> }
> @@ -2750,6 +2762,7 @@ static struct cxl_region *__create_region(struct cxl_root_decoder *cxlrd,
> switch (mode) {
> case CXL_PARTMODE_RAM:
> case CXL_PARTMODE_PMEM:
> + case CXL_PARTMODE_DYNAMIC_RAM_A:
> break;
> default:
> dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %d\n", mode);
> @@ -2802,6 +2815,21 @@ static ssize_t create_ram_region_store(struct device *dev,
> }
> DEVICE_ATTR_RW(create_ram_region);
>
> +static ssize_t create_dynamic_ram_a_region_show(struct device *dev,
> + struct device_attribute *attr,
> + char *buf)
> +{
> + return __create_region_show(to_cxl_root_decoder(dev), buf);
> +}
> +
> +static ssize_t create_dynamic_ram_a_region_store(struct device *dev,
> + struct device_attribute *attr,
> + const char *buf, size_t len)
> +{
> + return create_region_store(dev, buf, len, CXL_PARTMODE_DYNAMIC_RAM_A);
> +}
> +DEVICE_ATTR_RW(create_dynamic_ram_a_region);
> +
> static ssize_t region_show(struct device *dev, struct device_attribute *attr,
> char *buf)
> {
> @@ -4081,6 +4109,7 @@ static int cxl_region_probe(struct device *dev)
>
> return devm_cxl_add_pmem_region(cxlr);
> case CXL_PARTMODE_RAM:
> + case CXL_PARTMODE_DYNAMIC_RAM_A:
> rc = devm_cxl_region_edac_register(cxlr);
> if (rc)
> dev_dbg(&cxlr->dev, "CXL EDAC registration for region_id=%d failed\n",
> diff --git a/drivers/cxl/core/region_dax.c b/drivers/cxl/core/region_dax.c
> index de04f78f6ad8..d6bf69155827 100644
> --- a/drivers/cxl/core/region_dax.c
> +++ b/drivers/cxl/core/region_dax.c
> @@ -84,6 +84,12 @@ int devm_cxl_add_dax_region(struct cxl_region *cxlr)
> struct device *dev;
> int rc;
>
> + if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_A &&
> + cxlr->params.interleave_ways != 1) {
> + dev_err(&cxlr->dev, "Interleaving DC not supported\n");
> + return -EINVAL;
> + }
> +
> struct cxl_dax_region *cxlr_dax __free(put_cxl_dax_region) =
> cxl_dax_region_alloc(cxlr);
> if (IS_ERR(cxlr_dax))
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 95aee2a037fb..b0c2162b5e37 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -181,6 +181,11 @@ static bool is_static(struct dax_region *dax_region)
> return (dax_region->res.flags & IORESOURCE_DAX_STATIC) != 0;
> }
>
> +static bool is_dynamic(struct dax_region *dax_region)
> +{
> + return (dax_region->res.flags & IORESOURCE_DAX_DCD) != 0;
> +}
> +
> bool static_dev_dax(struct dev_dax *dev_dax)
> {
> return is_static(dev_dax->region);
> @@ -304,6 +309,9 @@ static unsigned long long dax_region_avail_size(struct dax_region *dax_region)
>
> lockdep_assert_held(&dax_region_rwsem);
>
> + if (is_dynamic(dax_region))
> + return 0;
> +
> for_each_dax_region_resource(dax_region, res)
> size -= resource_size(res);
> return size;
> @@ -1389,6 +1397,8 @@ static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n)
> return 0;
> if (a == &dev_attr_mapping.attr && is_static(dax_region))
> return 0;
> + if (a == &dev_attr_mapping.attr && is_dynamic(dax_region))
> + return 0;
> if ((a == &dev_attr_align.attr ||
> a == &dev_attr_size.attr) && is_static(dax_region))
> return 0444;
> diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
> index 5909171a4428..6e739bfab932 100644
> --- a/drivers/dax/bus.h
> +++ b/drivers/dax/bus.h
> @@ -15,6 +15,7 @@ struct dax_region;
> /* dax bus specific ioresource flags */
> #define IORESOURCE_DAX_STATIC BIT(0)
> #define IORESOURCE_DAX_KMEM BIT(1)
> +#define IORESOURCE_DAX_DCD BIT(2)
>
> struct dax_region *alloc_dax_region(struct device *parent, int region_id,
> struct range *range, int target_node, unsigned int align,
> diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
> index 3ab39b77843d..f58fe992aa8d 100644
> --- a/drivers/dax/cxl.c
> +++ b/drivers/dax/cxl.c
> @@ -13,19 +13,32 @@ static int cxl_dax_region_probe(struct device *dev)
> struct cxl_region *cxlr = cxlr_dax->cxlr;
> struct dax_region *dax_region;
> struct dev_dax_data data;
> + resource_size_t dev_size;
> + unsigned long flags;
>
> if (nid == NUMA_NO_NODE)
> nid = memory_add_physaddr_to_nid(cxlr_dax->hpa_range.start);
>
> + if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_A)
> + flags = IORESOURCE_DAX_DCD;
> + else
> + flags = IORESOURCE_DAX_KMEM;
> +
> dax_region = alloc_dax_region(dev, cxlr->id, &cxlr_dax->hpa_range, nid,
> - PMD_SIZE, IORESOURCE_DAX_KMEM);
> + PMD_SIZE, flags);
> if (!dax_region)
> return -ENOMEM;
>
> + if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_A)
> + /* Add empty seed dax device */
> + dev_size = 0;
> + else
> + dev_size = range_len(&cxlr_dax->hpa_range);
> +
> data = (struct dev_dax_data) {
> .dax_region = dax_region,
> .id = -1,
> - .size = range_len(&cxlr_dax->hpa_range),
> + .size = dev_size,
> .memmap_on_memory = true,
> };
>
^ permalink raw reply [flat|nested] 71+ messages in thread* Re: [PATCH v10 07/31] cxl/region: Add DC DAX region support
2026-05-28 0:16 ` Dave Jiang
@ 2026-06-02 9:22 ` Anisa Su
2026-06-02 15:42 ` Dave Jiang
0 siblings, 1 reply; 71+ messages in thread
From: Anisa Su @ 2026-06-02 9:22 UTC (permalink / raw)
To: Dave Jiang
Cc: Anisa Su, linux-cxl, linux-kernel, nvdimm, Dan Williams,
Jonathan Cameron, Davidlohr Bueso, Vishal Verma, Ira Weiny,
Alison Schofield, John Groves, Gregory Price, Ira Weiny
On Wed, May 27, 2026 at 05:16:44PM -0700, Dave Jiang wrote:
>
>
> On 5/23/26 2:43 AM, Anisa Su wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> >
> > DC DAX regions must allow memory to be added or removed dynamically.
> > In addition to the quantity of memory available the,
> > location of the memory within a DC partition is dynamic, based on the
> > extents offered by a device. CXL DAX regions must accommodate the
> > dynamic movement of this memory in the management of DAX regions and devices.
> >
> > Introduce the concept of a dynamic DAX region. Introduce
> > create_dynamic_ram_a_region() sysfs entry to create such regions.
> > Special case DC-capable regions to create a 0 sized seed DAX device
> > to maintain compatibility which requires a default DAX device to hold a
> > region reference.
> >
> > Indicate 0 byte available capacity until such time that capacity is
> > added.
> >
> > Dynamic regions complicate the range mapping of dax devices. There is no
> > known use case for range mapping on dynamic regions. Avoid the
> > complication by preventing range mapping of dax devices on dynamic
> > regions.
> >
> > Interleaving is deferred for now. Add checks.
> >
> > Based on an original patch by Navneet Singh.
> >
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
>
> Missing Anisa sign off
>
Added!
>
> >
> > ---
> > Changes:
> > [anisa: rebase]
> > [anisa: change "sparse" naming conventions and to "dynamic"]
> > ---
> > Documentation/ABI/testing/sysfs-bus-cxl | 22 ++++++++---------
> > drivers/cxl/core/core.h | 11 +++++++++
> > drivers/cxl/core/port.c | 1 +
> > drivers/cxl/core/region.c | 33 +++++++++++++++++++++++--
> > drivers/cxl/core/region_dax.c | 6 +++++
> > drivers/dax/bus.c | 10 ++++++++
> > drivers/dax/bus.h | 1 +
> > drivers/dax/cxl.c | 17 +++++++++++--
> > 8 files changed, 86 insertions(+), 15 deletions(-)
> >
> > diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> > index c604c7ca6432..3080aef9ad67 100644
> > --- a/Documentation/ABI/testing/sysfs-bus-cxl
> > +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> > @@ -434,20 +434,20 @@ Description:
> > interleave_granularity).
> >
> >
> > -What: /sys/bus/cxl/devices/decoderX.Y/create_{pmem,ram}_region
> > -Date: May, 2022, January, 2023
> > -KernelVersion: v6.0 (pmem), v6.3 (ram)
> > +What: /sys/bus/cxl/devices/decoderX.Y/create_{pmem,ram,dynamic_ram_a}_region
> > +Date: May, 2022, January, 2023, May 2025
> > +KernelVersion: v6.0 (pmem), v6.3 (ram), v6.16 (dynamic_ram_a)
>
> update
>
updated
> > Contact: linux-cxl@vger.kernel.org
> > Description:
> > (RW) Write a string in the form 'regionZ' to start the process
> > - of defining a new persistent, or volatile memory region
> > - (interleave-set) within the decode range bounded by root decoder
> > - 'decoderX.Y'. The value written must match the current value
> > - returned from reading this attribute. An atomic compare exchange
> > - operation is done on write to assign the requested id to a
> > - region and allocate the region-id for the next creation attempt.
> > - EBUSY is returned if the region name written does not match the
> > - current cached value.
> > + of defining a new persistent, volatile, or dynamic RAM memory
> > + region (interleave-set) within the decode range bounded by root
> > + decoder 'decoderX.Y'. The value written must match the current
> > + value returned from reading this attribute. An atomic compare
> > + exchange operation is done on write to assign the requested id
> > + to a region and allocate the region-id for the next creation
> > + attempt. EBUSY is returned if the region name written does not
> > + match the current cached value.
> >
> >
> > What: /sys/bus/cxl/devices/decoderX.Y/delete_region
> > diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> > index 82ca3a476708..8881cc9323e0 100644
> > --- a/drivers/cxl/core/core.h
> > +++ b/drivers/cxl/core/core.h
> > @@ -6,6 +6,7 @@
> >
> > #include <cxl/mailbox.h>
> > #include <linux/rwsem.h>
> > +#include <cxlmem.h>
> >
> > extern const struct device_type cxl_nvdimm_bridge_type;
> > extern const struct device_type cxl_nvdimm_type;
> > @@ -18,6 +19,15 @@ enum cxl_detach_mode {
> > DETACH_INVALIDATE,
> > };
> >
> > +static inline struct cxl_memdev_state *
> > +cxled_to_mds(struct cxl_endpoint_decoder *cxled)
> > +{
> > + struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> > + struct cxl_dev_state *cxlds = cxlmd->cxlds;
> > +
> > + return container_of(cxlds, struct cxl_memdev_state, cxlds);
>
> return to_cxl_memdev_state(cxlmd->cxlds);
>
done
> > +}
> > +
> > #ifdef CONFIG_CXL_REGION
> >
> > struct cxl_region_context {
> > @@ -29,6 +39,7 @@ struct cxl_region_context {
> >
> > extern struct device_attribute dev_attr_create_pmem_region;
> > extern struct device_attribute dev_attr_create_ram_region;
> > +extern struct device_attribute dev_attr_create_dynamic_ram_a_region;
> > extern struct device_attribute dev_attr_delete_region;
> > extern struct device_attribute dev_attr_region;
> > extern const struct device_type cxl_pmem_region_type;
> > diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> > index a7f71f36531f..2d33001dac26 100644
> > --- a/drivers/cxl/core/port.c
> > +++ b/drivers/cxl/core/port.c
> > @@ -337,6 +337,7 @@ static struct attribute *cxl_decoder_root_attrs[] = {
> > &dev_attr_qos_class.attr,
> > SET_CXL_REGION_ATTR(create_pmem_region)
> > SET_CXL_REGION_ATTR(create_ram_region)
> > + SET_CXL_REGION_ATTR(create_dynamic_ram_a_region)
>
> With this add, may need to add checks in cxl_root_decoder_visible() for dynamic_ram for create and also delete.
>
So for this check, since there's no CXL_DECODER_F_ bit defined for DCD, I considered
traversing through all endpoints and seeing if they have a DYNAMIC_RAM_A
partition, but that traversal already happens in the store_targetN() path,
which also includes the mode mismatch check.
Specifically, in cxl_region_attach:
if (cxlds->part[cxled->part].mode != cxlr->mode) {
dev_dbg(&cxlr->dev, "%s region mode: %d mismatch\n",
dev_name(&cxled->cxld.dev), cxlr->mode);
return -EINVAL;
}
Is it sufficient here to prohibit creating a dynamic ram region if the root
decoder does not support ram?
if (a == CXL_REGION_ATTR(create_dynamic_ram_a_region) && !can_create_ram(cxlrd))
return 0;
> > SET_CXL_REGION_ATTR(delete_region)
> > NULL,
> > };
> > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> > index edc267c6cf77..7561bf3d8af8 100644
> > --- a/drivers/cxl/core/region.c
> > +++ b/drivers/cxl/core/region.c
> > @@ -493,6 +493,11 @@ static int set_interleave_ways(struct cxl_region *cxlr, int val)
> > int save, rc;
> > u8 iw;
> >
> > + if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_A && val != 1) {
> > + dev_err(&cxlr->dev, "Interleaving and DCD not supported\n");
> > + return -EINVAL;
> > + }
> > +
> > rc = ways_to_eiw(val, &iw);
> > if (rc)
> > return rc;
> > @@ -2389,6 +2394,7 @@ static size_t store_targetN(struct cxl_region *cxlr, const char *buf, int pos,
> > if (sysfs_streq(buf, "\n"))
> > rc = detach_target(cxlr, pos);
> > else {
> > + struct cxl_endpoint_decoder *cxled;
> > struct device *dev;
> >
> > dev = bus_find_device_by_name(&cxl_bus_type, NULL, buf);
> > @@ -2400,8 +2406,14 @@ static size_t store_targetN(struct cxl_region *cxlr, const char *buf, int pos,
> > goto out;
> > }
> >
> > - rc = attach_target(cxlr, to_cxl_endpoint_decoder(dev), pos,
> > - TASK_INTERRUPTIBLE);
> > + cxled = to_cxl_endpoint_decoder(dev);
> > + if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_A &&
> > + !cxl_dcd_supported(cxled_to_mds(cxled))) {
>
> cxled_to_mds() can return NULL with the earlier change suggested. Need to handle that
>
Fixed
> DJ
>
Thanks,
Anisa
Also, for potential future support for multiple DC partitions not to be awkward, I
think it would make sense to rename dynamic_ram_a to dynamic_ram_1. Any
objections?
>
> > + dev_dbg(dev, "DCD unsupported\n");
> > + rc = -EINVAL;
> > + goto out;
> > + }
> > + rc = attach_target(cxlr, cxled, pos, TASK_INTERRUPTIBLE);
> > out:
> > put_device(dev);
> > }
> > @@ -2750,6 +2762,7 @@ static struct cxl_region *__create_region(struct cxl_root_decoder *cxlrd,
> > switch (mode) {
> > case CXL_PARTMODE_RAM:
> > case CXL_PARTMODE_PMEM:
> > + case CXL_PARTMODE_DYNAMIC_RAM_A:
> > break;
> > default:
> > dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %d\n", mode);
> > @@ -2802,6 +2815,21 @@ static ssize_t create_ram_region_store(struct device *dev,
> > }
> > DEVICE_ATTR_RW(create_ram_region);
> >
> > +static ssize_t create_dynamic_ram_a_region_show(struct device *dev,
> > + struct device_attribute *attr,
> > + char *buf)
> > +{
> > + return __create_region_show(to_cxl_root_decoder(dev), buf);
> > +}
> > +
> > +static ssize_t create_dynamic_ram_a_region_store(struct device *dev,
> > + struct device_attribute *attr,
> > + const char *buf, size_t len)
> > +{
> > + return create_region_store(dev, buf, len, CXL_PARTMODE_DYNAMIC_RAM_A);
> > +}
> > +DEVICE_ATTR_RW(create_dynamic_ram_a_region);
> > +
> > static ssize_t region_show(struct device *dev, struct device_attribute *attr,
> > char *buf)
> > {
> > @@ -4081,6 +4109,7 @@ static int cxl_region_probe(struct device *dev)
> >
> > return devm_cxl_add_pmem_region(cxlr);
> > case CXL_PARTMODE_RAM:
> > + case CXL_PARTMODE_DYNAMIC_RAM_A:
> > rc = devm_cxl_region_edac_register(cxlr);
> > if (rc)
> > dev_dbg(&cxlr->dev, "CXL EDAC registration for region_id=%d failed\n",
> > diff --git a/drivers/cxl/core/region_dax.c b/drivers/cxl/core/region_dax.c
> > index de04f78f6ad8..d6bf69155827 100644
> > --- a/drivers/cxl/core/region_dax.c
> > +++ b/drivers/cxl/core/region_dax.c
> > @@ -84,6 +84,12 @@ int devm_cxl_add_dax_region(struct cxl_region *cxlr)
> > struct device *dev;
> > int rc;
> >
> > + if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_A &&
> > + cxlr->params.interleave_ways != 1) {
> > + dev_err(&cxlr->dev, "Interleaving DC not supported\n");
> > + return -EINVAL;
> > + }
> > +
> > struct cxl_dax_region *cxlr_dax __free(put_cxl_dax_region) =
> > cxl_dax_region_alloc(cxlr);
> > if (IS_ERR(cxlr_dax))
> > diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> > index 95aee2a037fb..b0c2162b5e37 100644
> > --- a/drivers/dax/bus.c
> > +++ b/drivers/dax/bus.c
> > @@ -181,6 +181,11 @@ static bool is_static(struct dax_region *dax_region)
> > return (dax_region->res.flags & IORESOURCE_DAX_STATIC) != 0;
> > }
> >
> > +static bool is_dynamic(struct dax_region *dax_region)
> > +{
> > + return (dax_region->res.flags & IORESOURCE_DAX_DCD) != 0;
> > +}
> > +
> > bool static_dev_dax(struct dev_dax *dev_dax)
> > {
> > return is_static(dev_dax->region);
> > @@ -304,6 +309,9 @@ static unsigned long long dax_region_avail_size(struct dax_region *dax_region)
> >
> > lockdep_assert_held(&dax_region_rwsem);
> >
> > + if (is_dynamic(dax_region))
> > + return 0;
> > +
> > for_each_dax_region_resource(dax_region, res)
> > size -= resource_size(res);
> > return size;
> > @@ -1389,6 +1397,8 @@ static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n)
> > return 0;
> > if (a == &dev_attr_mapping.attr && is_static(dax_region))
> > return 0;
> > + if (a == &dev_attr_mapping.attr && is_dynamic(dax_region))
> > + return 0;
> > if ((a == &dev_attr_align.attr ||
> > a == &dev_attr_size.attr) && is_static(dax_region))
> > return 0444;
> > diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
> > index 5909171a4428..6e739bfab932 100644
> > --- a/drivers/dax/bus.h
> > +++ b/drivers/dax/bus.h
> > @@ -15,6 +15,7 @@ struct dax_region;
> > /* dax bus specific ioresource flags */
> > #define IORESOURCE_DAX_STATIC BIT(0)
> > #define IORESOURCE_DAX_KMEM BIT(1)
> > +#define IORESOURCE_DAX_DCD BIT(2)
> >
> > struct dax_region *alloc_dax_region(struct device *parent, int region_id,
> > struct range *range, int target_node, unsigned int align,
> > diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
> > index 3ab39b77843d..f58fe992aa8d 100644
> > --- a/drivers/dax/cxl.c
> > +++ b/drivers/dax/cxl.c
> > @@ -13,19 +13,32 @@ static int cxl_dax_region_probe(struct device *dev)
> > struct cxl_region *cxlr = cxlr_dax->cxlr;
> > struct dax_region *dax_region;
> > struct dev_dax_data data;
> > + resource_size_t dev_size;
> > + unsigned long flags;
> >
> > if (nid == NUMA_NO_NODE)
> > nid = memory_add_physaddr_to_nid(cxlr_dax->hpa_range.start);
> >
> > + if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_A)
> > + flags = IORESOURCE_DAX_DCD;
> > + else
> > + flags = IORESOURCE_DAX_KMEM;
> > +
> > dax_region = alloc_dax_region(dev, cxlr->id, &cxlr_dax->hpa_range, nid,
> > - PMD_SIZE, IORESOURCE_DAX_KMEM);
> > + PMD_SIZE, flags);
> > if (!dax_region)
> > return -ENOMEM;
> >
> > + if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_A)
> > + /* Add empty seed dax device */
> > + dev_size = 0;
> > + else
> > + dev_size = range_len(&cxlr_dax->hpa_range);
> > +
> > data = (struct dev_dax_data) {
> > .dax_region = dax_region,
> > .id = -1,
> > - .size = range_len(&cxlr_dax->hpa_range),
> > + .size = dev_size,
> > .memmap_on_memory = true,
> > };
> >
>
^ permalink raw reply [flat|nested] 71+ messages in thread* Re: [PATCH v10 07/31] cxl/region: Add DC DAX region support
2026-06-02 9:22 ` Anisa Su
@ 2026-06-02 15:42 ` Dave Jiang
0 siblings, 0 replies; 71+ messages in thread
From: Dave Jiang @ 2026-06-02 15:42 UTC (permalink / raw)
To: Anisa Su
Cc: linux-cxl, linux-kernel, nvdimm, Dan Williams, Jonathan Cameron,
Davidlohr Bueso, Vishal Verma, Ira Weiny, Alison Schofield,
John Groves, Gregory Price, Ira Weiny
On 6/2/26 2:22 AM, Anisa Su wrote:
> On Wed, May 27, 2026 at 05:16:44PM -0700, Dave Jiang wrote:
>>
>>
>> On 5/23/26 2:43 AM, Anisa Su wrote:
>>> From: Ira Weiny <ira.weiny@intel.com>
< --snip -->
>>> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
>>> index a7f71f36531f..2d33001dac26 100644
>>> --- a/drivers/cxl/core/port.c
>>> +++ b/drivers/cxl/core/port.c
>>> @@ -337,6 +337,7 @@ static struct attribute *cxl_decoder_root_attrs[] = {
>>> &dev_attr_qos_class.attr,
>>> SET_CXL_REGION_ATTR(create_pmem_region)
>>> SET_CXL_REGION_ATTR(create_ram_region)
>>> + SET_CXL_REGION_ATTR(create_dynamic_ram_a_region)
>>
>> With this add, may need to add checks in cxl_root_decoder_visible() for dynamic_ram for create and also delete.
>>
> So for this check, since there's no CXL_DECODER_F_ bit defined for DCD, I considered
> traversing through all endpoints and seeing if they have a DYNAMIC_RAM_A
> partition, but that traversal already happens in the store_targetN() path,
> which also includes the mode mismatch check.
>
> Specifically, in cxl_region_attach:
>
> if (cxlds->part[cxled->part].mode != cxlr->mode) {
> dev_dbg(&cxlr->dev, "%s region mode: %d mismatch\n",
> dev_name(&cxled->cxld.dev), cxlr->mode);
> return -EINVAL;
> }
>
> Is it sufficient here to prohibit creating a dynamic ram region if the root
> decoder does not support ram?
>
> if (a == CXL_REGION_ATTR(create_dynamic_ram_a_region) && !can_create_ram(cxlrd))
> return 0;
>
I think so.
>>> SET_CXL_REGION_ATTR(delete_region)
>>> NULL,
>>> };
>>> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
>>> index edc267c6cf77..7561bf3d8af8 100644
>>> --- a/drivers/cxl/core/region.c
>>> +++ b/drivers/cxl/core/region.c
>>> @@ -493,6 +493,11 @@ static int set_interleave_ways(struct cxl_region *cxlr, int val)
>>> int save, rc;
>>> u8 iw;
>>>
>>> + if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_A && val != 1) {
>>> + dev_err(&cxlr->dev, "Interleaving and DCD not supported\n");
>>> + return -EINVAL;
>>> + }
>>> +
>>> rc = ways_to_eiw(val, &iw);
>>> if (rc)
>>> return rc;
>>> @@ -2389,6 +2394,7 @@ static size_t store_targetN(struct cxl_region *cxlr, const char *buf, int pos,
>>> if (sysfs_streq(buf, "\n"))
>>> rc = detach_target(cxlr, pos);
>>> else {
>>> + struct cxl_endpoint_decoder *cxled;
>>> struct device *dev;
>>>
>>> dev = bus_find_device_by_name(&cxl_bus_type, NULL, buf);
>>> @@ -2400,8 +2406,14 @@ static size_t store_targetN(struct cxl_region *cxlr, const char *buf, int pos,
>>> goto out;
>>> }
>>>
>>> - rc = attach_target(cxlr, to_cxl_endpoint_decoder(dev), pos,
>>> - TASK_INTERRUPTIBLE);
>>> + cxled = to_cxl_endpoint_decoder(dev);
>>> + if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_A &&
>>> + !cxl_dcd_supported(cxled_to_mds(cxled))) {
>>
>> cxled_to_mds() can return NULL with the earlier change suggested. Need to handle that
>>
> Fixed
>> DJ
>>
> Thanks,
> Anisa
>
> Also, for potential future support for multiple DC partitions not to be awkward, I
> think it would make sense to rename dynamic_ram_a to dynamic_ram_1. Any
> objections?
No objections from me. Seems reasonable.
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH v10 08/31] cxl/events: Split event msgnum configuration from irq setup
2026-05-23 9:42 [PATCH v10 00/31] DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
` (6 preceding siblings ...)
2026-05-23 9:43 ` [PATCH v10 07/31] cxl/region: Add DC DAX region support Anisa Su
@ 2026-05-23 9:43 ` Anisa Su
2026-05-23 9:43 ` [PATCH v10 09/31] cxl/pci: Factor out interrupt policy check Anisa Su
` (24 subsequent siblings)
32 siblings, 0 replies; 71+ messages in thread
From: Anisa Su @ 2026-05-23 9:43 UTC (permalink / raw)
To: linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
John Groves, Gregory Price, Ira Weiny, Jonathan Cameron, Fan Ni,
Li Ming
From: Ira Weiny <ira.weiny@intel.com>
Dynamic Capacity Devices (DCD) require event interrupts to process
memory addition or removal. BIOS may have control over non-DCD event
processing. DCD interrupt configuration needs to be separate from
memory event interrupt configuration.
Split cxl_event_config_msgnums() from irq setup in preparation for
separate DCD interrupts configuration.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Li Ming <ming.li@zohomail.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
Changes:
[anisa: rebase]
---
drivers/cxl/pci.c | 22 +++++++++++-----------
1 file changed, 11 insertions(+), 11 deletions(-)
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 60f9fa05d9ef..35942b2ace53 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -599,35 +599,31 @@ static int cxl_event_config_msgnums(struct cxl_memdev_state *mds,
return cxl_event_get_int_policy(mds, policy);
}
-static int cxl_event_irqsetup(struct cxl_memdev_state *mds)
+static int cxl_event_irqsetup(struct cxl_memdev_state *mds,
+ struct cxl_event_interrupt_policy *policy)
{
struct cxl_dev_state *cxlds = &mds->cxlds;
- struct cxl_event_interrupt_policy policy;
int rc;
- rc = cxl_event_config_msgnums(mds, &policy);
- if (rc)
- return rc;
-
- rc = cxl_event_req_irq(cxlds, policy.info_settings);
+ rc = cxl_event_req_irq(cxlds, policy->info_settings);
if (rc) {
dev_err(cxlds->dev, "Failed to get interrupt for event Info log\n");
return rc;
}
- rc = cxl_event_req_irq(cxlds, policy.warn_settings);
+ rc = cxl_event_req_irq(cxlds, policy->warn_settings);
if (rc) {
dev_err(cxlds->dev, "Failed to get interrupt for event Warn log\n");
return rc;
}
- rc = cxl_event_req_irq(cxlds, policy.failure_settings);
+ rc = cxl_event_req_irq(cxlds, policy->failure_settings);
if (rc) {
dev_err(cxlds->dev, "Failed to get interrupt for event Failure log\n");
return rc;
}
- rc = cxl_event_req_irq(cxlds, policy.fatal_settings);
+ rc = cxl_event_req_irq(cxlds, policy->fatal_settings);
if (rc) {
dev_err(cxlds->dev, "Failed to get interrupt for event Fatal log\n");
return rc;
@@ -674,11 +670,15 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
return -EBUSY;
}
+ rc = cxl_event_config_msgnums(mds, &policy);
+ if (rc)
+ return rc;
+
rc = cxl_mem_alloc_event_buf(mds);
if (rc)
return rc;
- rc = cxl_event_irqsetup(mds);
+ rc = cxl_event_irqsetup(mds, &policy);
if (rc)
return rc;
--
2.43.0
^ permalink raw reply related [flat|nested] 71+ messages in thread* [PATCH v10 09/31] cxl/pci: Factor out interrupt policy check
2026-05-23 9:42 [PATCH v10 00/31] DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
` (7 preceding siblings ...)
2026-05-23 9:43 ` [PATCH v10 08/31] cxl/events: Split event msgnum configuration from irq setup Anisa Su
@ 2026-05-23 9:43 ` Anisa Su
2026-05-23 9:43 ` [PATCH v10 10/31] cxl/mem: Configure dynamic capacity interrupts Anisa Su
` (23 subsequent siblings)
32 siblings, 0 replies; 71+ messages in thread
From: Anisa Su @ 2026-05-23 9:43 UTC (permalink / raw)
To: linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
John Groves, Gregory Price, Ira Weiny, Jonathan Cameron, Fan Ni,
Li Ming, Dan Williams
From: Ira Weiny <ira.weiny@intel.com>
Dynamic Capacity Devices (DCD) require event interrupts to process
memory addition or removal. BIOS may have control over non-DCD event
processing. DCD interrupt configuration needs to be separate from
memory event interrupt configuration.
Factor out event interrupt setting validation.
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
Reviewed-by: Li Ming <ming.li@zohomail.com>
Link: https://lore.kernel.org/all/663922b475e50_d54d72945b@dwillia2-xfh.jf.intel.com.notmuch/ [1]
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
Changes:
[anisa: rebase]
---
drivers/cxl/pci.c | 23 ++++++++++++++++-------
1 file changed, 16 insertions(+), 7 deletions(-)
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 35942b2ace53..8d12c684d670 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -639,6 +639,21 @@ static bool cxl_event_int_is_fw(u8 setting)
return mode == CXL_INT_FW;
}
+static bool cxl_event_validate_mem_policy(struct cxl_memdev_state *mds,
+ struct cxl_event_interrupt_policy *policy)
+{
+ if (cxl_event_int_is_fw(policy->info_settings) ||
+ cxl_event_int_is_fw(policy->warn_settings) ||
+ cxl_event_int_is_fw(policy->failure_settings) ||
+ cxl_event_int_is_fw(policy->fatal_settings)) {
+ dev_err(mds->cxlds.dev,
+ "FW still in control of Event Logs despite _OSC settings\n");
+ return false;
+ }
+
+ return true;
+}
+
static int cxl_event_config(struct pci_host_bridge *host_bridge,
struct cxl_memdev_state *mds, bool irq_avail)
{
@@ -661,14 +676,8 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
if (rc)
return rc;
- if (cxl_event_int_is_fw(policy.info_settings) ||
- cxl_event_int_is_fw(policy.warn_settings) ||
- cxl_event_int_is_fw(policy.failure_settings) ||
- cxl_event_int_is_fw(policy.fatal_settings)) {
- dev_err(mds->cxlds.dev,
- "FW still in control of Event Logs despite _OSC settings\n");
+ if (!cxl_event_validate_mem_policy(mds, &policy))
return -EBUSY;
- }
rc = cxl_event_config_msgnums(mds, &policy);
if (rc)
--
2.43.0
^ permalink raw reply related [flat|nested] 71+ messages in thread* [PATCH v10 10/31] cxl/mem: Configure dynamic capacity interrupts
2026-05-23 9:42 [PATCH v10 00/31] DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
` (8 preceding siblings ...)
2026-05-23 9:43 ` [PATCH v10 09/31] cxl/pci: Factor out interrupt policy check Anisa Su
@ 2026-05-23 9:43 ` Anisa Su
2026-05-28 16:21 ` Dave Jiang
2026-05-23 9:43 ` [PATCH v10 11/31] cxl/core: Return endpoint decoder information from region search Anisa Su
` (22 subsequent siblings)
32 siblings, 1 reply; 71+ messages in thread
From: Anisa Su @ 2026-05-23 9:43 UTC (permalink / raw)
To: linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
John Groves, Gregory Price, Ira Weiny, Anisa Su
From: Ira Weiny <ira.weiny@intel.com>
Dynamic Capacity Devices (DCD) support extent change notifications
through the event log mechanism. The interrupt mailbox commands were
extended in CXL 3.1 to support these notifications. Firmware can't
configure DCD events to be FW controlled but can retain control of
memory events.
Configure DCD event log interrupts on devices supporting dynamic
capacity. Disable DCD if interrupts are not supported.
Care is taken to preserve the interrupt policy set by the FW if FW first
has been selected by the BIOS.
Accept the 4-byte CXL 2.0 reply on GET Event Interrupt Policy by setting
min_out to CXL_EVENT_INT_POLICY_BASE_SIZE; pre-CXL 3.1 firmware omits
dcd_settings and would otherwise fail the size check.
Based on an original patch by Navneet Singh.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Anisa Su <anisa.su@samsung.com>
---
Changes:
[anisa: rebase]
[anisa: accept 4-byte CXL 2.0 GET reply via min_out]
[anisa: drop Reviewed-by tags now that the patch carries new changes]
---
drivers/cxl/cxlmem.h | 2 ++
drivers/cxl/pci.c | 75 ++++++++++++++++++++++++++++++++++++--------
2 files changed, 64 insertions(+), 13 deletions(-)
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 10175ca3b7ee..65c009b02da6 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -218,7 +218,9 @@ struct cxl_event_interrupt_policy {
u8 warn_settings;
u8 failure_settings;
u8 fatal_settings;
+ u8 dcd_settings;
} __packed;
+#define CXL_EVENT_INT_POLICY_BASE_SIZE 4 /* info, warn, failure, fatal */
/**
* struct cxl_event_state - Event log driver state
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 8d12c684d670..83617439bbd3 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -557,6 +557,8 @@ static int cxl_event_get_int_policy(struct cxl_memdev_state *mds,
.opcode = CXL_MBOX_OP_GET_EVT_INT_POLICY,
.payload_out = policy,
.size_out = sizeof(*policy),
+ /* CXL 2.0 firmware omits dcd_settings; accept the shorter reply */
+ .min_out = CXL_EVENT_INT_POLICY_BASE_SIZE,
};
int rc;
@@ -569,23 +571,34 @@ static int cxl_event_get_int_policy(struct cxl_memdev_state *mds,
}
static int cxl_event_config_msgnums(struct cxl_memdev_state *mds,
- struct cxl_event_interrupt_policy *policy)
+ struct cxl_event_interrupt_policy *policy,
+ bool native_cxl)
{
struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
+ size_t size_in = CXL_EVENT_INT_POLICY_BASE_SIZE;
struct cxl_mbox_cmd mbox_cmd;
int rc;
- *policy = (struct cxl_event_interrupt_policy) {
- .info_settings = CXL_INT_MSI_MSIX,
- .warn_settings = CXL_INT_MSI_MSIX,
- .failure_settings = CXL_INT_MSI_MSIX,
- .fatal_settings = CXL_INT_MSI_MSIX,
- };
+ /* memory event policy is left if FW has control */
+ if (native_cxl) {
+ *policy = (struct cxl_event_interrupt_policy) {
+ .info_settings = CXL_INT_MSI_MSIX,
+ .warn_settings = CXL_INT_MSI_MSIX,
+ .failure_settings = CXL_INT_MSI_MSIX,
+ .fatal_settings = CXL_INT_MSI_MSIX,
+ .dcd_settings = 0,
+ };
+ }
+
+ if (cxl_dcd_supported(mds)) {
+ policy->dcd_settings = CXL_INT_MSI_MSIX;
+ size_in += sizeof(policy->dcd_settings);
+ }
mbox_cmd = (struct cxl_mbox_cmd) {
.opcode = CXL_MBOX_OP_SET_EVT_INT_POLICY,
.payload_in = policy,
- .size_in = sizeof(*policy),
+ .size_in = size_in,
};
rc = cxl_internal_send_cmd(cxl_mbox, &mbox_cmd);
@@ -632,6 +645,30 @@ static int cxl_event_irqsetup(struct cxl_memdev_state *mds,
return 0;
}
+static int cxl_irqsetup(struct cxl_memdev_state *mds,
+ struct cxl_event_interrupt_policy *policy,
+ bool native_cxl)
+{
+ struct cxl_dev_state *cxlds = &mds->cxlds;
+ int rc;
+
+ if (native_cxl) {
+ rc = cxl_event_irqsetup(mds, policy);
+ if (rc)
+ return rc;
+ }
+
+ if (cxl_dcd_supported(mds)) {
+ rc = cxl_event_req_irq(cxlds, policy->dcd_settings);
+ if (rc) {
+ dev_err(cxlds->dev, "Failed to get interrupt for DCD event log\n");
+ cxl_disable_dcd(mds);
+ }
+ }
+
+ return 0;
+}
+
static bool cxl_event_int_is_fw(u8 setting)
{
u8 mode = FIELD_GET(CXLDEV_EVENT_INT_MODE_MASK, setting);
@@ -657,18 +694,26 @@ static bool cxl_event_validate_mem_policy(struct cxl_memdev_state *mds,
static int cxl_event_config(struct pci_host_bridge *host_bridge,
struct cxl_memdev_state *mds, bool irq_avail)
{
- struct cxl_event_interrupt_policy policy;
+ struct cxl_event_interrupt_policy policy = { 0 };
+ bool native_cxl = host_bridge->native_cxl_error;
int rc;
/*
* When BIOS maintains CXL error reporting control, it will process
* event records. Only one agent can do so.
+ *
+ * If BIOS has control of events and DCD is not supported skip event
+ * configuration.
*/
- if (!host_bridge->native_cxl_error)
+ if (!native_cxl && !cxl_dcd_supported(mds))
return 0;
if (!irq_avail) {
dev_info(mds->cxlds.dev, "No interrupt support, disable event processing.\n");
+ if (cxl_dcd_supported(mds)) {
+ dev_info(mds->cxlds.dev, "DCD requires interrupts, disable DCD\n");
+ cxl_disable_dcd(mds);
+ }
return 0;
}
@@ -676,10 +721,10 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
if (rc)
return rc;
- if (!cxl_event_validate_mem_policy(mds, &policy))
+ if (native_cxl && !cxl_event_validate_mem_policy(mds, &policy))
return -EBUSY;
- rc = cxl_event_config_msgnums(mds, &policy);
+ rc = cxl_event_config_msgnums(mds, &policy, native_cxl);
if (rc)
return rc;
@@ -687,12 +732,16 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
if (rc)
return rc;
- rc = cxl_event_irqsetup(mds, &policy);
+ rc = cxl_irqsetup(mds, &policy, native_cxl);
if (rc)
return rc;
cxl_mem_get_event_records(mds, CXLDEV_EVENT_STATUS_ALL);
+ dev_dbg(mds->cxlds.dev, "Event config : %s DCD %s\n",
+ native_cxl ? "OS" : "BIOS",
+ cxl_dcd_supported(mds) ? "supported" : "not supported");
+
return 0;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 71+ messages in thread* Re: [PATCH v10 10/31] cxl/mem: Configure dynamic capacity interrupts
2026-05-23 9:43 ` [PATCH v10 10/31] cxl/mem: Configure dynamic capacity interrupts Anisa Su
@ 2026-05-28 16:21 ` Dave Jiang
0 siblings, 0 replies; 71+ messages in thread
From: Dave Jiang @ 2026-05-28 16:21 UTC (permalink / raw)
To: Anisa Su, linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Vishal Verma, Ira Weiny, Alison Schofield, John Groves,
Gregory Price, Ira Weiny, Anisa Su
On 5/23/26 2:43 AM, Anisa Su wrote:
> From: Ira Weiny <ira.weiny@intel.com>
>
> Dynamic Capacity Devices (DCD) support extent change notifications
> through the event log mechanism. The interrupt mailbox commands were
> extended in CXL 3.1 to support these notifications. Firmware can't
> configure DCD events to be FW controlled but can retain control of
> memory events.
>
> Configure DCD event log interrupts on devices supporting dynamic
> capacity. Disable DCD if interrupts are not supported.
>
> Care is taken to preserve the interrupt policy set by the FW if FW first
> has been selected by the BIOS.
>
> Accept the 4-byte CXL 2.0 reply on GET Event Interrupt Policy by setting
> min_out to CXL_EVENT_INT_POLICY_BASE_SIZE; pre-CXL 3.1 firmware omits
> dcd_settings and would otherwise fail the size check.
>
> Based on an original patch by Navneet Singh.
>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Anisa Su <anisa.su@samsung.com>
>
> ---
> Changes:
> [anisa: rebase]
> [anisa: accept 4-byte CXL 2.0 GET reply via min_out]
> [anisa: drop Reviewed-by tags now that the patch carries new changes]
> ---
> drivers/cxl/cxlmem.h | 2 ++
> drivers/cxl/pci.c | 75 ++++++++++++++++++++++++++++++++++++--------
> 2 files changed, 64 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 10175ca3b7ee..65c009b02da6 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -218,7 +218,9 @@ struct cxl_event_interrupt_policy {
> u8 warn_settings;
> u8 failure_settings;
> u8 fatal_settings;
> + u8 dcd_settings;
> } __packed;
> +#define CXL_EVENT_INT_POLICY_BASE_SIZE 4 /* info, warn, failure, fatal */
>
> /**
> * struct cxl_event_state - Event log driver state
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 8d12c684d670..83617439bbd3 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -557,6 +557,8 @@ static int cxl_event_get_int_policy(struct cxl_memdev_state *mds,
> .opcode = CXL_MBOX_OP_GET_EVT_INT_POLICY,
> .payload_out = policy,
> .size_out = sizeof(*policy),
> + /* CXL 2.0 firmware omits dcd_settings; accept the shorter reply */
> + .min_out = CXL_EVENT_INT_POLICY_BASE_SIZE,
> };
> int rc;
>
> @@ -569,23 +571,34 @@ static int cxl_event_get_int_policy(struct cxl_memdev_state *mds,
> }
>
> static int cxl_event_config_msgnums(struct cxl_memdev_state *mds,
> - struct cxl_event_interrupt_policy *policy)
> + struct cxl_event_interrupt_policy *policy,
> + bool native_cxl)
> {
> struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
> + size_t size_in = CXL_EVENT_INT_POLICY_BASE_SIZE;
> struct cxl_mbox_cmd mbox_cmd;
> int rc;
>
> - *policy = (struct cxl_event_interrupt_policy) {
> - .info_settings = CXL_INT_MSI_MSIX,
> - .warn_settings = CXL_INT_MSI_MSIX,
> - .failure_settings = CXL_INT_MSI_MSIX,
> - .fatal_settings = CXL_INT_MSI_MSIX,
> - };
> + /* memory event policy is left if FW has control */
> + if (native_cxl) {
> + *policy = (struct cxl_event_interrupt_policy) {
> + .info_settings = CXL_INT_MSI_MSIX,
> + .warn_settings = CXL_INT_MSI_MSIX,
> + .failure_settings = CXL_INT_MSI_MSIX,
> + .fatal_settings = CXL_INT_MSI_MSIX,
> + .dcd_settings = 0,
> + };
> + }
> +
> + if (cxl_dcd_supported(mds)) {
> + policy->dcd_settings = CXL_INT_MSI_MSIX;
> + size_in += sizeof(policy->dcd_settings);
> + }
>
> mbox_cmd = (struct cxl_mbox_cmd) {
> .opcode = CXL_MBOX_OP_SET_EVT_INT_POLICY,
> .payload_in = policy,
> - .size_in = sizeof(*policy),
> + .size_in = size_in,
> };
>
> rc = cxl_internal_send_cmd(cxl_mbox, &mbox_cmd);
> @@ -632,6 +645,30 @@ static int cxl_event_irqsetup(struct cxl_memdev_state *mds,
> return 0;
> }
>
> +static int cxl_irqsetup(struct cxl_memdev_state *mds,
> + struct cxl_event_interrupt_policy *policy,
> + bool native_cxl)
> +{
> + struct cxl_dev_state *cxlds = &mds->cxlds;
> + int rc;
> +
> + if (native_cxl) {
> + rc = cxl_event_irqsetup(mds, policy);
> + if (rc)
> + return rc;
> + }
> +
> + if (cxl_dcd_supported(mds)) {
> + rc = cxl_event_req_irq(cxlds, policy->dcd_settings);
> + if (rc) {
> + dev_err(cxlds->dev, "Failed to get interrupt for DCD event log\n");
> + cxl_disable_dcd(mds);
> + }
> + }
> +
> + return 0;
> +}
> +
> static bool cxl_event_int_is_fw(u8 setting)
> {
> u8 mode = FIELD_GET(CXLDEV_EVENT_INT_MODE_MASK, setting);
> @@ -657,18 +694,26 @@ static bool cxl_event_validate_mem_policy(struct cxl_memdev_state *mds,
> static int cxl_event_config(struct pci_host_bridge *host_bridge,
> struct cxl_memdev_state *mds, bool irq_avail)
> {
> - struct cxl_event_interrupt_policy policy;
> + struct cxl_event_interrupt_policy policy = { 0 };
> + bool native_cxl = host_bridge->native_cxl_error;
> int rc;
>
> /*
> * When BIOS maintains CXL error reporting control, it will process
> * event records. Only one agent can do so.
> + *
> + * If BIOS has control of events and DCD is not supported skip event
> + * configuration.
> */
> - if (!host_bridge->native_cxl_error)
> + if (!native_cxl && !cxl_dcd_supported(mds))
> return 0;
>
> if (!irq_avail) {
> dev_info(mds->cxlds.dev, "No interrupt support, disable event processing.\n");
> + if (cxl_dcd_supported(mds)) {
> + dev_info(mds->cxlds.dev, "DCD requires interrupts, disable DCD\n");
> + cxl_disable_dcd(mds);
> + }
> return 0;
> }
>
> @@ -676,10 +721,10 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
> if (rc)
> return rc;
>
> - if (!cxl_event_validate_mem_policy(mds, &policy))
> + if (native_cxl && !cxl_event_validate_mem_policy(mds, &policy))
> return -EBUSY;
>
> - rc = cxl_event_config_msgnums(mds, &policy);
> + rc = cxl_event_config_msgnums(mds, &policy, native_cxl);
> if (rc)
> return rc;
>
> @@ -687,12 +732,16 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
> if (rc)
> return rc;
>
> - rc = cxl_event_irqsetup(mds, &policy);
> + rc = cxl_irqsetup(mds, &policy, native_cxl);
> if (rc)
> return rc;
>
> cxl_mem_get_event_records(mds, CXLDEV_EVENT_STATUS_ALL);
Issue that was always there probably, should this check native_cxl so the BIOS owned events are not retrieved?
if (native_cxl)
cxl_mem_get_event_records(mds, CXLDEV_EVENT_STATUS_ALL);
Also, CXLDEV_EVENT_STATUS_ALL is missing bit 4 (Dynamic Capcity Event Log). CXL r4.0 8.2.9.3.1 Table 8-203.
DJ
>
> + dev_dbg(mds->cxlds.dev, "Event config : %s DCD %s\n",
> + native_cxl ? "OS" : "BIOS",
> + cxl_dcd_supported(mds) ? "supported" : "not supported");
> +
> return 0;
> }
>
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH v10 11/31] cxl/core: Return endpoint decoder information from region search
2026-05-23 9:42 [PATCH v10 00/31] DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
` (9 preceding siblings ...)
2026-05-23 9:43 ` [PATCH v10 10/31] cxl/mem: Configure dynamic capacity interrupts Anisa Su
@ 2026-05-23 9:43 ` Anisa Su
2026-05-23 9:43 ` [PATCH v10 12/31] cxl/mem: Set up framework for handling DC Events Anisa Su
` (21 subsequent siblings)
32 siblings, 0 replies; 71+ messages in thread
From: Anisa Su @ 2026-05-23 9:43 UTC (permalink / raw)
To: linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
John Groves, Gregory Price, Ira Weiny, Jonathan Cameron, Fan Ni,
Li Ming
From: Ira Weiny <ira.weiny@intel.com>
cxl_dpa_to_region() finds the region from a <DPA, device> tuple.
The search involves finding the device endpoint decoder as well.
Dynamic capacity extent processing uses the endpoint decoder HPA
information to calculate the HPA offset. In addition, well behaved
extents should be contained within an endpoint decoder.
Return the endpoint decoder found to be used in subsequent DCD code.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Li Ming <ming.li@zohomail.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
Changes:
[anisa: rebase]
---
drivers/cxl/core/core.h | 6 ++++--
drivers/cxl/core/mbox.c | 2 +-
drivers/cxl/core/memdev.c | 4 ++--
drivers/cxl/core/region.c | 8 +++++++-
4 files changed, 14 insertions(+), 6 deletions(-)
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 8881cc9323e0..14723cfd05f0 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -58,7 +58,8 @@ int cxl_decoder_detach(struct cxl_region *cxlr,
int cxl_region_init(void);
void cxl_region_exit(void);
int cxl_get_poison_by_endpoint(struct cxl_port *port);
-struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa);
+struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa,
+ struct cxl_endpoint_decoder **cxled);
u64 cxl_dpa_to_hpa(struct cxl_region *cxlr, const struct cxl_memdev *cxlmd,
u64 dpa);
int devm_cxl_add_dax_region(struct cxl_region *cxlr);
@@ -71,7 +72,8 @@ static inline u64 cxl_dpa_to_hpa(struct cxl_region *cxlr,
return ULLONG_MAX;
}
static inline
-struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa)
+struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa,
+ struct cxl_endpoint_decoder **cxled)
{
return NULL;
}
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index f9a5e21f5d09..01b1a318f34f 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -968,7 +968,7 @@ void cxl_event_trace_record(struct cxl_memdev *cxlmd,
guard(rwsem_read)(&cxl_rwsem.dpa);
dpa = le64_to_cpu(evt->media_hdr.phys_addr) & CXL_DPA_MASK;
- cxlr = cxl_dpa_to_region(cxlmd, dpa);
+ cxlr = cxl_dpa_to_region(cxlmd, dpa, NULL);
if (cxlr) {
u64 cache_size = cxlr->params.cache_size;
diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
index 064cfd628577..b8b3489f69e5 100644
--- a/drivers/cxl/core/memdev.c
+++ b/drivers/cxl/core/memdev.c
@@ -320,7 +320,7 @@ int cxl_inject_poison_locked(struct cxl_memdev *cxlmd, u64 dpa)
if (rc)
return rc;
- cxlr = cxl_dpa_to_region(cxlmd, dpa);
+ cxlr = cxl_dpa_to_region(cxlmd, dpa, NULL);
if (cxlr)
dev_warn_once(cxl_mbox->host,
"poison inject dpa:%#llx region: %s\n", dpa,
@@ -389,7 +389,7 @@ int cxl_clear_poison_locked(struct cxl_memdev *cxlmd, u64 dpa)
if (rc)
return rc;
- cxlr = cxl_dpa_to_region(cxlmd, dpa);
+ cxlr = cxl_dpa_to_region(cxlmd, dpa, NULL);
if (cxlr)
dev_warn_once(cxl_mbox->host,
"poison clear dpa:%#llx region: %s\n", dpa,
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 7561bf3d8af8..733d77c07493 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -2991,6 +2991,7 @@ int cxl_get_poison_by_endpoint(struct cxl_port *port)
struct cxl_dpa_to_region_context {
struct cxl_region *cxlr;
u64 dpa;
+ struct cxl_endpoint_decoder *cxled;
};
static int __cxl_dpa_to_region(struct device *dev, void *arg)
@@ -3024,11 +3025,13 @@ static int __cxl_dpa_to_region(struct device *dev, void *arg)
dev_name(dev));
ctx->cxlr = cxlr;
+ ctx->cxled = cxled;
return 1;
}
-struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa)
+struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa,
+ struct cxl_endpoint_decoder **cxled)
{
struct cxl_dpa_to_region_context ctx;
struct cxl_port *port = cxlmd->endpoint;
@@ -3042,6 +3045,9 @@ struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa)
if (cxl_num_decoders_committed(port))
device_for_each_child(&port->dev, &ctx, __cxl_dpa_to_region);
+ if (cxled)
+ *cxled = ctx.cxled;
+
return ctx.cxlr;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 71+ messages in thread* [PATCH v10 12/31] cxl/mem: Set up framework for handling DC Events
2026-05-23 9:42 [PATCH v10 00/31] DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
` (10 preceding siblings ...)
2026-05-23 9:43 ` [PATCH v10 11/31] cxl/core: Return endpoint decoder information from region search Anisa Su
@ 2026-05-23 9:43 ` Anisa Su
2026-05-28 16:40 ` Dave Jiang
2026-05-23 9:43 ` [PATCH v10 13/31] cxl/mem: Add 20 second timeout for stalled DC_ADD_CAPACITY chains Anisa Su
` (20 subsequent siblings)
32 siblings, 1 reply; 71+ messages in thread
From: Anisa Su @ 2026-05-23 9:43 UTC (permalink / raw)
To: linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
John Groves, Gregory Price, Anisa Su, Ira Weiny
Adds the support for receiving DC event records but defers
the real add/release logic to subsequent commits. Simply refuse all
extents for DC_ADD and ack all DC_RELEASE events for now. Forced
release is currently unsupported.
In order, this commit adds the following:
1. Learn about DC Event Records and how to respond to them
* cxl_mem_get_event_records() learns about the DC Event record.
Records of that type are routed to cxl_handle_dcd_event_records().
* cxl_handle_dcd_event_records() switches on event_type:
- DCD_ADD_CAPACITY -> handle_add_event()
- DCD_RELEASE_CAPACITY -> cxl_rm_extent()
- DCD_FORCED_CAPACITY_RELEASE is logged and ignored (FM/device-only).
* cxl_send_dc_response() sends the reply mailbox commands
ADD_DC_RESPONSE / RELEASE_DC
2. Add stubs for DC_ADD and DC_RELEASE logic
* handle_add_event() stages incoming extents onto
mds->add_ctx.pending_extents and, when More=0 closes the chain,
replies with an empty ADD_DC_RESPONSE — refusing all extents for now
* cxl_rm_extent() acks the release via memdev_release_extent() so the
device's view stays consistent; we can ack all releases because
we currently don't accept/use any extents offered.
3. Structural setup for later commits:
* struct dc_extent, struct cxl_dc_tag_group, and pending_add_ctx
set up the stage for the real DC_ADD path, which will enforce
tag/grouping semantics
Based on an original patch by Navneet Singh.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Anisa Su <anisa.su@samsung.com>
---
Changes:
[anisa: restructured from the original "Process dynamic partition
events" monolith; this commit lands only the wire-level intake and
dispatches the add/release logic to stubbed handlers. The handlers
are fleshed out in subsequent commits.]
---
drivers/cxl/core/mbox.c | 246 +++++++++++++++++++++++++++++++++++++++-
drivers/cxl/cxl.h | 73 +++++++++++-
drivers/cxl/cxlmem.h | 45 ++++++++
include/cxl/event.h | 38 +++++++
4 files changed, 400 insertions(+), 2 deletions(-)
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 01b1a318f34f..1b38f34538f3 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -5,6 +5,7 @@
#include <linux/ktime.h>
#include <linux/mutex.h>
#include <linux/unaligned.h>
+#include <linux/list.h>
#include <cxlpci.h>
#include <cxlmem.h>
#include <cxl.h>
@@ -1102,6 +1103,238 @@ static int cxl_clear_event_record(struct cxl_memdev_state *mds,
return rc;
}
+static int send_one_response(struct cxl_mailbox *cxl_mbox,
+ struct cxl_mbox_dc_response *response,
+ int opcode, u32 extent_list_size, u8 flags)
+{
+ struct cxl_mbox_cmd mbox_cmd = (struct cxl_mbox_cmd) {
+ .opcode = opcode,
+ .size_in = struct_size(response, extent_list, extent_list_size),
+ .payload_in = response,
+ };
+
+ response->extent_list_size = cpu_to_le32(extent_list_size);
+ response->flags = flags;
+ return cxl_internal_send_cmd(cxl_mbox, &mbox_cmd);
+}
+
+static int cxl_send_dc_response(struct cxl_memdev_state *mds, int opcode,
+ struct list_head *extent_list, int cnt)
+{
+ struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
+ struct cxl_mbox_dc_response *p;
+ struct cxl_extent_list_node *pos, *tmp;
+ struct cxl_extent *extent;
+ u32 pl_index;
+
+ size_t pl_size = struct_size(p, extent_list, cnt);
+ u32 max_extents = cnt;
+
+ /* May have to use more bit on response. */
+ if (pl_size > cxl_mbox->payload_size) {
+ max_extents = (cxl_mbox->payload_size - sizeof(*p)) /
+ sizeof(struct updated_extent_list);
+ pl_size = struct_size(p, extent_list, max_extents);
+ }
+
+ struct cxl_mbox_dc_response *response __free(kfree) =
+ kzalloc(pl_size, GFP_KERNEL);
+ if (!response)
+ return -ENOMEM;
+
+ if (cnt == 0)
+ return send_one_response(cxl_mbox, response, opcode, 0, 0);
+
+ pl_index = 0;
+ list_for_each_entry_safe(pos, tmp, extent_list, list) {
+ extent = pos->extent;
+ response->extent_list[pl_index].dpa_start = extent->start_dpa;
+ response->extent_list[pl_index].length = extent->length;
+ pl_index++;
+
+ if (pl_index == max_extents) {
+ u8 flags = 0;
+ int rc;
+
+ if (pl_index < cnt)
+ flags |= CXL_DCD_EVENT_MORE;
+ rc = send_one_response(cxl_mbox, response, opcode,
+ pl_index, flags);
+ if (rc)
+ return rc;
+ cnt -= pl_index;
+ if (cnt < max_extents)
+ max_extents = cnt;
+ pl_index = 0;
+ }
+ }
+
+ if (!pl_index) /* nothing more to do */
+ return 0;
+ return send_one_response(cxl_mbox, response, opcode, pl_index, 0);
+}
+
+static void delete_extent_node(struct cxl_extent_list_node *node)
+{
+ list_del(&node->list);
+ kfree(node->extent);
+ kfree(node);
+}
+
+static void memdev_release_extent(struct cxl_memdev_state *mds, struct range *range)
+{
+ struct device *dev = mds->cxlds.dev;
+ struct cxl_extent_list_node *node;
+ LIST_HEAD(extent_list);
+
+ dev_dbg(dev, "Release response dpa %pra\n", range);
+
+ node = kzalloc(sizeof(*node), GFP_KERNEL);
+ if (!node)
+ return;
+
+ node->extent = kzalloc(sizeof(*node->extent), GFP_KERNEL);
+ if (!node->extent) {
+ kfree(node);
+ return;
+ }
+
+ node->extent->start_dpa = cpu_to_le64(range->start);
+ node->extent->length = cpu_to_le64(range_len(range));
+ list_add_tail(&node->list, &extent_list);
+
+ if (cxl_send_dc_response(mds, CXL_MBOX_OP_RELEASE_DC, &extent_list, 1))
+ dev_dbg(dev, "Failed to release %pra\n", range);
+
+ delete_extent_node(node);
+}
+
+static void clear_pending_extents(void *_mds)
+{
+ struct cxl_memdev_state *mds = _mds;
+ struct cxl_extent_list_node *pos, *tmp;
+
+ list_for_each_entry_safe(pos, tmp, &mds->add_ctx.pending_extents, list)
+ delete_extent_node(pos);
+ mds->add_ctx.group = NULL;
+}
+
+static int add_to_pending_list(struct list_head *pending_list,
+ struct cxl_extent *to_add)
+{
+ struct cxl_extent_list_node *node;
+ struct cxl_extent *extent;
+
+ node = kzalloc(sizeof(*node), GFP_KERNEL);
+ if (!node)
+ return -ENOMEM;
+ extent = kmemdup(to_add, sizeof(*extent), GFP_KERNEL);
+ if (!extent)
+ return -ENOMEM;
+
+ node->extent = extent;
+ list_add_tail(&node->list, pending_list);
+ return 0;
+}
+
+/*
+ * Stub: stage extents on the pending list and reply with an empty
+ * ADD_DC_RESPONSE on More=0 (refuse all). A later commit replaces
+ * the no-op tail with the real Add pipeline that surfaces a dax
+ * device per accepted extent.
+ */
+static int handle_add_event(struct cxl_memdev_state *mds,
+ struct cxl_event_dcd *event)
+{
+ struct device *dev = mds->cxlds.dev;
+ int rc;
+
+ rc = add_to_pending_list(&mds->add_ctx.pending_extents, &event->extent);
+ if (rc)
+ return rc;
+
+ if (event->flags & CXL_DCD_EVENT_MORE) {
+ dev_dbg(dev, "more bit set; delay the surfacing of extent\n");
+ return 0;
+ }
+
+ rc = cxl_send_dc_response(mds, CXL_MBOX_OP_ADD_DC_RESPONSE,
+ &mds->add_ctx.pending_extents, 0);
+ clear_pending_extents(mds);
+ return rc;
+}
+
+/*
+ * Stub: ack the release back to the device so it knows we are not
+ * using the range. A later commit replaces this with the real
+ * teardown that walks the region's tag group and tears down the
+ * member dc_extent devices.
+ */
+static int cxl_rm_extent(struct cxl_memdev_state *mds,
+ struct cxl_extent *extent)
+{
+ u64 start_dpa = le64_to_cpu(extent->start_dpa);
+ struct range dpa_range = {
+ .start = start_dpa,
+ .end = start_dpa + le64_to_cpu(extent->length) - 1,
+ };
+
+ memdev_release_extent(mds, &dpa_range);
+ return 0;
+}
+
+static char *cxl_dcd_evt_type_str(u8 type)
+{
+ switch (type) {
+ case DCD_ADD_CAPACITY:
+ return "add";
+ case DCD_RELEASE_CAPACITY:
+ return "release";
+ case DCD_FORCED_CAPACITY_RELEASE:
+ return "force release";
+ default:
+ break;
+ }
+
+ return "<unknown>";
+}
+
+static void cxl_handle_dcd_event_records(struct cxl_memdev_state *mds,
+ struct cxl_event_record_raw *raw_rec)
+{
+ struct cxl_event_dcd *event = &raw_rec->event.dcd;
+ struct cxl_extent *extent = &event->extent;
+ struct device *dev = mds->cxlds.dev;
+ uuid_t *id = &raw_rec->id;
+ int rc;
+
+ if (!uuid_equal(id, &CXL_EVENT_DC_EVENT_UUID))
+ return;
+
+ dev_dbg(dev, "DCD event %s : DPA:%#llx LEN:%#llx\n",
+ cxl_dcd_evt_type_str(event->event_type),
+ le64_to_cpu(extent->start_dpa), le64_to_cpu(extent->length));
+
+ switch (event->event_type) {
+ case DCD_ADD_CAPACITY:
+ rc = handle_add_event(mds, event);
+ break;
+ case DCD_RELEASE_CAPACITY:
+ rc = cxl_rm_extent(mds, &event->extent);
+ break;
+ case DCD_FORCED_CAPACITY_RELEASE:
+ dev_err_ratelimited(dev, "Forced release event ignored.\n");
+ rc = 0;
+ break;
+ default:
+ rc = -EINVAL;
+ break;
+ }
+
+ if (rc)
+ dev_err_ratelimited(dev, "dcd event failed: %d\n", rc);
+}
+
static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
enum cxl_event_log_type type)
{
@@ -1138,9 +1371,13 @@ static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
if (!nr_rec)
break;
- for (i = 0; i < nr_rec; i++)
+ for (i = 0; i < nr_rec; i++) {
__cxl_event_trace_record(cxlmd, type,
&payload->records[i]);
+ if (type == CXL_EVENT_TYPE_DCD)
+ cxl_handle_dcd_event_records(mds,
+ &payload->records[i]);
+ }
if (payload->flags & CXL_GET_EVENT_FLAG_OVERFLOW)
trace_cxl_overflow(cxlmd, type, payload);
@@ -1172,6 +1409,8 @@ void cxl_mem_get_event_records(struct cxl_memdev_state *mds, u32 status)
{
dev_dbg(mds->cxlds.dev, "Reading event logs: %x\n", status);
+ if (cxl_dcd_supported(mds) && (status & CXLDEV_EVENT_STATUS_DCD))
+ cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_DCD);
if (status & CXLDEV_EVENT_STATUS_FATAL)
cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_FATAL);
if (status & CXLDEV_EVENT_STATUS_FAIL)
@@ -1769,6 +2008,11 @@ struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev, u64 serial,
}
mutex_init(&mds->event.log_lock);
+ INIT_LIST_HEAD(&mds->add_ctx.pending_extents);
+
+ rc = devm_add_action_or_reset(dev, clear_pending_extents, mds);
+ if (rc)
+ return ERR_PTR(rc);
rc = devm_cxl_register_mce_notifier(dev, &mds->mce_notifier);
if (rc == -EOPNOTSUPP)
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 1297594beaec..5ef2cf4d005b 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -12,6 +12,7 @@
#include <linux/node.h>
#include <linux/io.h>
#include <linux/range.h>
+#include <linux/xarray.h>
#include <cxl/cxl.h>
extern const struct nvdimm_security_ops *cxl_security_ops;
@@ -180,11 +181,13 @@ static inline int ways_to_eiw(unsigned int ways, u8 *eiw)
#define CXLDEV_EVENT_STATUS_WARN BIT(1)
#define CXLDEV_EVENT_STATUS_FAIL BIT(2)
#define CXLDEV_EVENT_STATUS_FATAL BIT(3)
+#define CXLDEV_EVENT_STATUS_DCD BIT(4)
#define CXLDEV_EVENT_STATUS_ALL (CXLDEV_EVENT_STATUS_INFO | \
CXLDEV_EVENT_STATUS_WARN | \
CXLDEV_EVENT_STATUS_FAIL | \
- CXLDEV_EVENT_STATUS_FATAL)
+ CXLDEV_EVENT_STATUS_FATAL | \
+ CXLDEV_EVENT_STATUS_DCD)
/* CXL rev 3.0 section 8.2.9.2.4; Table 8-52 */
#define CXLDEV_EVENT_INT_MODE_MASK GENMASK(1, 0)
@@ -306,6 +309,41 @@ enum cxl_decoder_state {
CXL_DECODER_STATE_AUTO_STAGED,
};
+struct cxl_dc_tag_group;
+
+/**
+ * struct dc_extent - A single dynamic-capacity extent surfaced to the host.
+ *
+ * One per device-stamped extent. Multiple dc_extents that share a tag
+ * (see &struct cxl_dc_tag_group) form a single logical allocation, but
+ * each dc_extent has its own HPA range and is the unit that the DAX
+ * layer sees as a backing dax_resource.
+ *
+ * @dev: device representing this extent; child of cxlr_dax->dev.
+ * @group: containing tag group (allocation); shared across siblings.
+ * @cxled: endpoint decoder backing the DPA range.
+ * @dpa_range: DPA range this extent covers within @cxled.
+ * @hpa_range: HPA range that @dpa_range decodes to, relative to
+ * cxlr_dax->hpa_range.start.
+ * @uuid: tag uuid (matches @group->uuid; kept for the release-path log).
+ * @seq_num: 1..n assembly-order index within the tag group. For extents
+ * from a sharable partition this equals the device-stamped
+ * shared_extn_seq (CXL 3.1 Table 8-51). For extents from a
+ * non-sharable partition the device leaves shared_extn_seq == 0
+ * and the host assigns @seq_num in event arrival order at
+ * cxl_add_pending() time. Used by the dax layer to assemble
+ * ranges in the right order regardless of source.
+ */
+struct dc_extent {
+ struct device dev;
+ struct cxl_dc_tag_group *group;
+ struct cxl_endpoint_decoder *cxled;
+ struct range dpa_range;
+ struct range hpa_range;
+ uuid_t uuid;
+ u16 seq_num;
+};
+
/**
* struct cxl_endpoint_decoder - Endpoint / SPA to DPA decoder
* @cxld: base cxl_decoder_object
@@ -518,12 +556,45 @@ struct cxl_pmem_region {
struct cxl_pmem_region_mapping mapping[];
};
+/* See CXL 3.1 8.2.9.2.1.6 */
+enum dc_event {
+ DCD_ADD_CAPACITY,
+ DCD_RELEASE_CAPACITY,
+ DCD_FORCED_CAPACITY_RELEASE,
+ DCD_REGION_CONFIGURATION_UPDATED,
+};
+
struct cxl_dax_region {
struct device dev;
struct cxl_region *cxlr;
struct range hpa_range;
};
+/**
+ * struct cxl_dc_tag_group - A tagged dynamic-capacity allocation.
+ *
+ * Container for the &struct dc_extent siblings that share a tag. The
+ * group has no sysfs identity; userspace sees the individual dc_extents
+ * directly under the parent dax_region device. The group exists to
+ * keep tag-scoped invariants (atomic add, atomic release, ordered carve
+ * by seq_num) in one place.
+ *
+ * @cxlr_dax: back reference to parent region device.
+ * @uuid: tag identifying this allocation; same across all member dc_extents.
+ * @dc_extents: xarray of &struct dc_extent in this group, indexed by the
+ * dc_extent's @seq_num (1..n, dense). See &struct dc_extent
+ * for how seq_num is sourced for sharable vs non-sharable
+ * allocations.
+ * @nr_extents: live count of dc_extents in the group; the group is freed
+ * when the last dc_extent device is released.
+ */
+struct cxl_dc_tag_group {
+ struct cxl_dax_region *cxlr_dax;
+ uuid_t uuid;
+ struct xarray dc_extents;
+ unsigned int nr_extents;
+};
+
/**
* struct cxl_port - logical collection of upstream port devices and
* downstream port devices to construct a CXL memory
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 65c009b02da6..592c8e3b611c 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -7,6 +7,7 @@
#include <linux/cdev.h>
#include <linux/uuid.h>
#include <linux/node.h>
+#include <linux/list.h>
#include <cxl/event.h>
#include <cxl/mailbox.h>
#include "cxl.h"
@@ -399,6 +400,23 @@ static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
return dev_get_drvdata(cxl_mbox->host);
}
+/**
+ * struct pending_add_ctx - Staging state for an in-progress
+ * DCD_ADD_CAPACITY event chain
+ * @pending_extents: extents received so far in the chain; flushed when
+ * the chain closes (More=0)
+ * @group: tag group being assembled from the chain
+ *
+ * A DCD_ADD_CAPACITY notification can span multiple event records
+ * stitched together by the CXL_DCD_EVENT_MORE flag. Records are staged
+ * here until the device clears More, at which point the staged batch is
+ * processed and responded to as a single Add_DC_Response.
+ */
+struct pending_add_ctx {
+ struct list_head pending_extents;
+ struct cxl_dc_tag_group *group;
+};
+
/**
* struct cxl_memdev_state - Generic Type-3 Memory Device Class driver data
*
@@ -417,6 +435,8 @@ static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
* @active_volatile_bytes: sum of hard + soft volatile
* @active_persistent_bytes: sum of hard + soft persistent
* @dcd_supported: all DCD commands are supported
+ * @add_ctx: state for an in-progress DCD_ADD_CAPACITY chain
+ * (see &struct pending_add_ctx)
* @event: event log driver state
* @poison: poison driver state info
* @security: security driver state info
@@ -437,6 +457,7 @@ struct cxl_memdev_state {
u64 active_volatile_bytes;
u64 active_persistent_bytes;
bool dcd_supported;
+ struct pending_add_ctx add_ctx;
struct cxl_event_state event;
struct cxl_poison_state poison;
@@ -513,6 +534,21 @@ enum cxl_opcode {
UUID_INIT(0x5e1819d9, 0x11a9, 0x400c, 0x81, 0x1f, 0xd6, 0x07, 0x19, \
0x40, 0x3d, 0x86)
+/*
+ * Add Dynamic Capacity Response
+ * CXL rev 3.1 section 8.2.9.9.9.3; Table 8-168 & Table 8-169
+ */
+struct cxl_mbox_dc_response {
+ __le32 extent_list_size;
+ u8 flags;
+ u8 reserved[3];
+ struct updated_extent_list {
+ __le64 dpa_start;
+ __le64 length;
+ u8 reserved[8];
+ } __packed extent_list[] __counted_by(extent_list_size);
+} __packed;
+
struct cxl_mbox_get_supported_logs {
__le16 entries;
u8 rsvd[6];
@@ -583,6 +619,14 @@ struct cxl_mbox_identify {
UUID_INIT(0xe71f3a40, 0x2d29, 0x4092, 0x8a, 0x39, 0x4d, 0x1c, 0x96, \
0x6c, 0x7c, 0x65)
+/*
+ * Dynamic Capacity Event Record
+ * CXL rev 3.1 section 8.2.9.2.1; Table 8-43
+ */
+#define CXL_EVENT_DC_EVENT_UUID \
+ UUID_INIT(0xca95afa7, 0xf183, 0x4018, 0x8c, 0x2f, 0x95, 0x26, 0x8e, \
+ 0x10, 0x1a, 0x2a)
+
/*
* Get Event Records output payload
* CXL rev 3.0 section 8.2.9.2.2; Table 8-50
@@ -608,6 +652,7 @@ enum cxl_event_log_type {
CXL_EVENT_TYPE_WARN,
CXL_EVENT_TYPE_FAIL,
CXL_EVENT_TYPE_FATAL,
+ CXL_EVENT_TYPE_DCD,
CXL_EVENT_TYPE_MAX
};
diff --git a/include/cxl/event.h b/include/cxl/event.h
index ff97fea718d2..fa3cd895f656 100644
--- a/include/cxl/event.h
+++ b/include/cxl/event.h
@@ -6,6 +6,7 @@
#include <linux/types.h>
#include <linux/uuid.h>
#include <linux/workqueue_types.h>
+#include <linux/list.h>
/*
* Common Event Record Format
@@ -141,12 +142,49 @@ struct cxl_event_mem_sparing {
u8 reserved2[0x25];
} __packed;
+/*
+ * CXL rev 3.1 section 8.2.9.2.1.6; Table 8-51
+ */
+struct cxl_extent {
+ __le64 start_dpa;
+ __le64 length;
+ u8 uuid[UUID_SIZE];
+ __le16 shared_extn_seq;
+ u8 reserved[0x6];
+} __packed;
+
+struct cxl_extent_list_node {
+ struct cxl_extent *extent;
+ struct list_head list;
+ int rid;
+};
+
+/*
+ * Dynamic Capacity Event Record
+ * CXL rev 3.1 section 8.2.9.2.1.6; Table 8-50
+ */
+#define CXL_DCD_EVENT_MORE BIT(0)
+struct cxl_event_dcd {
+ struct cxl_event_record_hdr hdr;
+ u8 event_type;
+ u8 validity_flags;
+ __le16 host_id;
+ u8 partition_index;
+ u8 flags;
+ u8 reserved1[0x2];
+ struct cxl_extent extent;
+ u8 reserved2[0x18];
+ __le32 num_avail_extents;
+ __le32 num_avail_tags;
+} __packed;
+
union cxl_event {
struct cxl_event_generic generic;
struct cxl_event_gen_media gen_media;
struct cxl_event_dram dram;
struct cxl_event_mem_module mem_module;
struct cxl_event_mem_sparing mem_sparing;
+ struct cxl_event_dcd dcd;
/* dram & gen_media event header */
struct cxl_event_media_hdr media_hdr;
} __packed;
--
2.43.0
^ permalink raw reply related [flat|nested] 71+ messages in thread* Re: [PATCH v10 12/31] cxl/mem: Set up framework for handling DC Events
2026-05-23 9:43 ` [PATCH v10 12/31] cxl/mem: Set up framework for handling DC Events Anisa Su
@ 2026-05-28 16:40 ` Dave Jiang
0 siblings, 0 replies; 71+ messages in thread
From: Dave Jiang @ 2026-05-28 16:40 UTC (permalink / raw)
To: Anisa Su, linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Vishal Verma, Ira Weiny, Alison Schofield, John Groves,
Gregory Price, Anisa Su, Ira Weiny
On 5/23/26 2:43 AM, Anisa Su wrote:
> Adds the support for receiving DC event records but defers
> the real add/release logic to subsequent commits. Simply refuse all
> extents for DC_ADD and ack all DC_RELEASE events for now. Forced
> release is currently unsupported.
>
> In order, this commit adds the following:
>
> 1. Learn about DC Event Records and how to respond to them
>
> * cxl_mem_get_event_records() learns about the DC Event record.
> Records of that type are routed to cxl_handle_dcd_event_records().
>
> * cxl_handle_dcd_event_records() switches on event_type:
> - DCD_ADD_CAPACITY -> handle_add_event()
> - DCD_RELEASE_CAPACITY -> cxl_rm_extent()
> - DCD_FORCED_CAPACITY_RELEASE is logged and ignored (FM/device-only).
>
> * cxl_send_dc_response() sends the reply mailbox commands
> ADD_DC_RESPONSE / RELEASE_DC
>
> 2. Add stubs for DC_ADD and DC_RELEASE logic
>
> * handle_add_event() stages incoming extents onto
> mds->add_ctx.pending_extents and, when More=0 closes the chain,
> replies with an empty ADD_DC_RESPONSE — refusing all extents for now
>
> * cxl_rm_extent() acks the release via memdev_release_extent() so the
> device's view stays consistent; we can ack all releases because
> we currently don't accept/use any extents offered.
>
> 3. Structural setup for later commits:
>
> * struct dc_extent, struct cxl_dc_tag_group, and pending_add_ctx
> set up the stage for the real DC_ADD path, which will enforce
> tag/grouping semantics
>
> Based on an original patch by Navneet Singh.
>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Anisa Su <anisa.su@samsung.com>
>
> ---
> Changes:
> [anisa: restructured from the original "Process dynamic partition
> events" monolith; this commit lands only the wire-level intake and
> dispatches the add/release logic to stubbed handlers. The handlers
> are fleshed out in subsequent commits.]
> ---
> drivers/cxl/core/mbox.c | 246 +++++++++++++++++++++++++++++++++++++++-
> drivers/cxl/cxl.h | 73 +++++++++++-
> drivers/cxl/cxlmem.h | 45 ++++++++
> include/cxl/event.h | 38 +++++++
> 4 files changed, 400 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 01b1a318f34f..1b38f34538f3 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -5,6 +5,7 @@
> #include <linux/ktime.h>
> #include <linux/mutex.h>
> #include <linux/unaligned.h>
> +#include <linux/list.h>
> #include <cxlpci.h>
> #include <cxlmem.h>
> #include <cxl.h>
> @@ -1102,6 +1103,238 @@ static int cxl_clear_event_record(struct cxl_memdev_state *mds,
> return rc;
> }
>
> +static int send_one_response(struct cxl_mailbox *cxl_mbox,
> + struct cxl_mbox_dc_response *response,
> + int opcode, u32 extent_list_size, u8 flags)
> +{
> + struct cxl_mbox_cmd mbox_cmd = (struct cxl_mbox_cmd) {
> + .opcode = opcode,
> + .size_in = struct_size(response, extent_list, extent_list_size),
> + .payload_in = response,
> + };
> +
> + response->extent_list_size = cpu_to_le32(extent_list_size);
> + response->flags = flags;
> + return cxl_internal_send_cmd(cxl_mbox, &mbox_cmd);
> +}
> +
> +static int cxl_send_dc_response(struct cxl_memdev_state *mds, int opcode,
> + struct list_head *extent_list, int cnt)
> +{
> + struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
> + struct cxl_mbox_dc_response *p;
> + struct cxl_extent_list_node *pos, *tmp;
> + struct cxl_extent *extent;
> + u32 pl_index;
> +
> + size_t pl_size = struct_size(p, extent_list, cnt);
> + u32 max_extents = cnt;
> +
> + /* May have to use more bit on response. */
> + if (pl_size > cxl_mbox->payload_size) {
> + max_extents = (cxl_mbox->payload_size - sizeof(*p)) /
> + sizeof(struct updated_extent_list);
> + pl_size = struct_size(p, extent_list, max_extents);
> + }
> +
> + struct cxl_mbox_dc_response *response __free(kfree) =
> + kzalloc(pl_size, GFP_KERNEL);
> + if (!response)
> + return -ENOMEM;
> +
> + if (cnt == 0)
> + return send_one_response(cxl_mbox, response, opcode, 0, 0);
> +
> + pl_index = 0;
> + list_for_each_entry_safe(pos, tmp, extent_list, list) {
> + extent = pos->extent;
> + response->extent_list[pl_index].dpa_start = extent->start_dpa;
> + response->extent_list[pl_index].length = extent->length;
> + pl_index++;
> +
> + if (pl_index == max_extents) {
> + u8 flags = 0;
> + int rc;
> +
> + if (pl_index < cnt)
> + flags |= CXL_DCD_EVENT_MORE;
> + rc = send_one_response(cxl_mbox, response, opcode,
> + pl_index, flags);
> + if (rc)
> + return rc;
> + cnt -= pl_index;
> + if (cnt < max_extents)
> + max_extents = cnt;
> + pl_index = 0;
> + }
> + }
> +
> + if (!pl_index) /* nothing more to do */
> + return 0;
> + return send_one_response(cxl_mbox, response, opcode, pl_index, 0);
> +}
> +
> +static void delete_extent_node(struct cxl_extent_list_node *node)
> +{
> + list_del(&node->list);
> + kfree(node->extent);
> + kfree(node);
> +}
> +
> +static void memdev_release_extent(struct cxl_memdev_state *mds, struct range *range)
> +{
> + struct device *dev = mds->cxlds.dev;
> + struct cxl_extent_list_node *node;
> + LIST_HEAD(extent_list);
> +
> + dev_dbg(dev, "Release response dpa %pra\n", range);
> +
> + node = kzalloc(sizeof(*node), GFP_KERNEL);
> + if (!node)
> + return;
> +
> + node->extent = kzalloc(sizeof(*node->extent), GFP_KERNEL);
> + if (!node->extent) {
> + kfree(node);
> + return;
> + }
> +
> + node->extent->start_dpa = cpu_to_le64(range->start);
> + node->extent->length = cpu_to_le64(range_len(range));
> + list_add_tail(&node->list, &extent_list);
> +
> + if (cxl_send_dc_response(mds, CXL_MBOX_OP_RELEASE_DC, &extent_list, 1))
> + dev_dbg(dev, "Failed to release %pra\n", range);
> +
> + delete_extent_node(node);
> +}
> +
> +static void clear_pending_extents(void *_mds)
> +{
> + struct cxl_memdev_state *mds = _mds;
> + struct cxl_extent_list_node *pos, *tmp;
> +
> + list_for_each_entry_safe(pos, tmp, &mds->add_ctx.pending_extents, list)
> + delete_extent_node(pos);
> + mds->add_ctx.group = NULL;
> +}
> +
> +static int add_to_pending_list(struct list_head *pending_list,
> + struct cxl_extent *to_add)
> +{
> + struct cxl_extent_list_node *node;
> + struct cxl_extent *extent;
> +
> + node = kzalloc(sizeof(*node), GFP_KERNEL);
> + if (!node)
> + return -ENOMEM;
> + extent = kmemdup(to_add, sizeof(*extent), GFP_KERNEL);
> + if (!extent)
> + return -ENOMEM;
Leaking node here. Maybe convert to using __free()?
> +
> + node->extent = extent;
> + list_add_tail(&node->list, pending_list);
> + return 0;
> +}
> +
> +/*
> + * Stub: stage extents on the pending list and reply with an empty
> + * ADD_DC_RESPONSE on More=0 (refuse all). A later commit replaces
> + * the no-op tail with the real Add pipeline that surfaces a dax
> + * device per accepted extent.
> + */
> +static int handle_add_event(struct cxl_memdev_state *mds,
> + struct cxl_event_dcd *event)
> +{
> + struct device *dev = mds->cxlds.dev;
> + int rc;
> +
> + rc = add_to_pending_list(&mds->add_ctx.pending_extents, &event->extent);
> + if (rc)
> + return rc;
Should clear_pending_extents() be called before return to clean up previously staged extents?
DJ
> +
> + if (event->flags & CXL_DCD_EVENT_MORE) {
> + dev_dbg(dev, "more bit set; delay the surfacing of extent\n");
> + return 0;
> + }
> +
> + rc = cxl_send_dc_response(mds, CXL_MBOX_OP_ADD_DC_RESPONSE,
> + &mds->add_ctx.pending_extents, 0);
> + clear_pending_extents(mds);
> + return rc;
> +}
> +
> +/*
> + * Stub: ack the release back to the device so it knows we are not
> + * using the range. A later commit replaces this with the real
> + * teardown that walks the region's tag group and tears down the
> + * member dc_extent devices.
> + */
> +static int cxl_rm_extent(struct cxl_memdev_state *mds,
> + struct cxl_extent *extent)
> +{
> + u64 start_dpa = le64_to_cpu(extent->start_dpa);
> + struct range dpa_range = {
> + .start = start_dpa,
> + .end = start_dpa + le64_to_cpu(extent->length) - 1,
> + };
> +
> + memdev_release_extent(mds, &dpa_range);
> + return 0;
> +}
> +
> +static char *cxl_dcd_evt_type_str(u8 type)
> +{
> + switch (type) {
> + case DCD_ADD_CAPACITY:
> + return "add";
> + case DCD_RELEASE_CAPACITY:
> + return "release";
> + case DCD_FORCED_CAPACITY_RELEASE:
> + return "force release";
> + default:
> + break;
> + }
> +
> + return "<unknown>";
> +}
> +
> +static void cxl_handle_dcd_event_records(struct cxl_memdev_state *mds,
> + struct cxl_event_record_raw *raw_rec)
> +{
> + struct cxl_event_dcd *event = &raw_rec->event.dcd;
> + struct cxl_extent *extent = &event->extent;
> + struct device *dev = mds->cxlds.dev;
> + uuid_t *id = &raw_rec->id;
> + int rc;
> +
> + if (!uuid_equal(id, &CXL_EVENT_DC_EVENT_UUID))
> + return;
> +
> + dev_dbg(dev, "DCD event %s : DPA:%#llx LEN:%#llx\n",
> + cxl_dcd_evt_type_str(event->event_type),
> + le64_to_cpu(extent->start_dpa), le64_to_cpu(extent->length));
> +
> + switch (event->event_type) {
> + case DCD_ADD_CAPACITY:
> + rc = handle_add_event(mds, event);
> + break;
> + case DCD_RELEASE_CAPACITY:
> + rc = cxl_rm_extent(mds, &event->extent);
> + break;
> + case DCD_FORCED_CAPACITY_RELEASE:
> + dev_err_ratelimited(dev, "Forced release event ignored.\n");
> + rc = 0;
> + break;
> + default:
> + rc = -EINVAL;
> + break;
> + }
> +
> + if (rc)
> + dev_err_ratelimited(dev, "dcd event failed: %d\n", rc);
> +}
> +
> static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
> enum cxl_event_log_type type)
> {
> @@ -1138,9 +1371,13 @@ static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
> if (!nr_rec)
> break;
>
> - for (i = 0; i < nr_rec; i++)
> + for (i = 0; i < nr_rec; i++) {
> __cxl_event_trace_record(cxlmd, type,
> &payload->records[i]);
> + if (type == CXL_EVENT_TYPE_DCD)
> + cxl_handle_dcd_event_records(mds,
> + &payload->records[i]);
> + }
>
> if (payload->flags & CXL_GET_EVENT_FLAG_OVERFLOW)
> trace_cxl_overflow(cxlmd, type, payload);
> @@ -1172,6 +1409,8 @@ void cxl_mem_get_event_records(struct cxl_memdev_state *mds, u32 status)
> {
> dev_dbg(mds->cxlds.dev, "Reading event logs: %x\n", status);
>
> + if (cxl_dcd_supported(mds) && (status & CXLDEV_EVENT_STATUS_DCD))
> + cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_DCD);
> if (status & CXLDEV_EVENT_STATUS_FATAL)
> cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_FATAL);
> if (status & CXLDEV_EVENT_STATUS_FAIL)
> @@ -1769,6 +2008,11 @@ struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev, u64 serial,
> }
>
> mutex_init(&mds->event.log_lock);
> + INIT_LIST_HEAD(&mds->add_ctx.pending_extents);
> +
> + rc = devm_add_action_or_reset(dev, clear_pending_extents, mds);
> + if (rc)
> + return ERR_PTR(rc);
>
> rc = devm_cxl_register_mce_notifier(dev, &mds->mce_notifier);
> if (rc == -EOPNOTSUPP)
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 1297594beaec..5ef2cf4d005b 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -12,6 +12,7 @@
> #include <linux/node.h>
> #include <linux/io.h>
> #include <linux/range.h>
> +#include <linux/xarray.h>
> #include <cxl/cxl.h>
>
> extern const struct nvdimm_security_ops *cxl_security_ops;
> @@ -180,11 +181,13 @@ static inline int ways_to_eiw(unsigned int ways, u8 *eiw)
> #define CXLDEV_EVENT_STATUS_WARN BIT(1)
> #define CXLDEV_EVENT_STATUS_FAIL BIT(2)
> #define CXLDEV_EVENT_STATUS_FATAL BIT(3)
> +#define CXLDEV_EVENT_STATUS_DCD BIT(4)
>
> #define CXLDEV_EVENT_STATUS_ALL (CXLDEV_EVENT_STATUS_INFO | \
> CXLDEV_EVENT_STATUS_WARN | \
> CXLDEV_EVENT_STATUS_FAIL | \
> - CXLDEV_EVENT_STATUS_FATAL)
> + CXLDEV_EVENT_STATUS_FATAL | \
> + CXLDEV_EVENT_STATUS_DCD)
>
> /* CXL rev 3.0 section 8.2.9.2.4; Table 8-52 */
> #define CXLDEV_EVENT_INT_MODE_MASK GENMASK(1, 0)
> @@ -306,6 +309,41 @@ enum cxl_decoder_state {
> CXL_DECODER_STATE_AUTO_STAGED,
> };
>
> +struct cxl_dc_tag_group;
> +
> +/**
> + * struct dc_extent - A single dynamic-capacity extent surfaced to the host.
> + *
> + * One per device-stamped extent. Multiple dc_extents that share a tag
> + * (see &struct cxl_dc_tag_group) form a single logical allocation, but
> + * each dc_extent has its own HPA range and is the unit that the DAX
> + * layer sees as a backing dax_resource.
> + *
> + * @dev: device representing this extent; child of cxlr_dax->dev.
> + * @group: containing tag group (allocation); shared across siblings.
> + * @cxled: endpoint decoder backing the DPA range.
> + * @dpa_range: DPA range this extent covers within @cxled.
> + * @hpa_range: HPA range that @dpa_range decodes to, relative to
> + * cxlr_dax->hpa_range.start.
> + * @uuid: tag uuid (matches @group->uuid; kept for the release-path log).
> + * @seq_num: 1..n assembly-order index within the tag group. For extents
> + * from a sharable partition this equals the device-stamped
> + * shared_extn_seq (CXL 3.1 Table 8-51). For extents from a
> + * non-sharable partition the device leaves shared_extn_seq == 0
> + * and the host assigns @seq_num in event arrival order at
> + * cxl_add_pending() time. Used by the dax layer to assemble
> + * ranges in the right order regardless of source.
> + */
> +struct dc_extent {
> + struct device dev;
> + struct cxl_dc_tag_group *group;
> + struct cxl_endpoint_decoder *cxled;
> + struct range dpa_range;
> + struct range hpa_range;
> + uuid_t uuid;
> + u16 seq_num;
> +};
> +
> /**
> * struct cxl_endpoint_decoder - Endpoint / SPA to DPA decoder
> * @cxld: base cxl_decoder_object
> @@ -518,12 +556,45 @@ struct cxl_pmem_region {
> struct cxl_pmem_region_mapping mapping[];
> };
>
> +/* See CXL 3.1 8.2.9.2.1.6 */
> +enum dc_event {
> + DCD_ADD_CAPACITY,
> + DCD_RELEASE_CAPACITY,
> + DCD_FORCED_CAPACITY_RELEASE,
> + DCD_REGION_CONFIGURATION_UPDATED,
> +};
> +
> struct cxl_dax_region {
> struct device dev;
> struct cxl_region *cxlr;
> struct range hpa_range;
> };
>
> +/**
> + * struct cxl_dc_tag_group - A tagged dynamic-capacity allocation.
> + *
> + * Container for the &struct dc_extent siblings that share a tag. The
> + * group has no sysfs identity; userspace sees the individual dc_extents
> + * directly under the parent dax_region device. The group exists to
> + * keep tag-scoped invariants (atomic add, atomic release, ordered carve
> + * by seq_num) in one place.
> + *
> + * @cxlr_dax: back reference to parent region device.
> + * @uuid: tag identifying this allocation; same across all member dc_extents.
> + * @dc_extents: xarray of &struct dc_extent in this group, indexed by the
> + * dc_extent's @seq_num (1..n, dense). See &struct dc_extent
> + * for how seq_num is sourced for sharable vs non-sharable
> + * allocations.
> + * @nr_extents: live count of dc_extents in the group; the group is freed
> + * when the last dc_extent device is released.
> + */
> +struct cxl_dc_tag_group {
> + struct cxl_dax_region *cxlr_dax;
> + uuid_t uuid;
> + struct xarray dc_extents;
> + unsigned int nr_extents;
> +};
> +
> /**
> * struct cxl_port - logical collection of upstream port devices and
> * downstream port devices to construct a CXL memory
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 65c009b02da6..592c8e3b611c 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -7,6 +7,7 @@
> #include <linux/cdev.h>
> #include <linux/uuid.h>
> #include <linux/node.h>
> +#include <linux/list.h>
> #include <cxl/event.h>
> #include <cxl/mailbox.h>
> #include "cxl.h"
> @@ -399,6 +400,23 @@ static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
> return dev_get_drvdata(cxl_mbox->host);
> }
>
> +/**
> + * struct pending_add_ctx - Staging state for an in-progress
> + * DCD_ADD_CAPACITY event chain
> + * @pending_extents: extents received so far in the chain; flushed when
> + * the chain closes (More=0)
> + * @group: tag group being assembled from the chain
> + *
> + * A DCD_ADD_CAPACITY notification can span multiple event records
> + * stitched together by the CXL_DCD_EVENT_MORE flag. Records are staged
> + * here until the device clears More, at which point the staged batch is
> + * processed and responded to as a single Add_DC_Response.
> + */
> +struct pending_add_ctx {
> + struct list_head pending_extents;
> + struct cxl_dc_tag_group *group;
> +};
> +
> /**
> * struct cxl_memdev_state - Generic Type-3 Memory Device Class driver data
> *
> @@ -417,6 +435,8 @@ static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
> * @active_volatile_bytes: sum of hard + soft volatile
> * @active_persistent_bytes: sum of hard + soft persistent
> * @dcd_supported: all DCD commands are supported
> + * @add_ctx: state for an in-progress DCD_ADD_CAPACITY chain
> + * (see &struct pending_add_ctx)
> * @event: event log driver state
> * @poison: poison driver state info
> * @security: security driver state info
> @@ -437,6 +457,7 @@ struct cxl_memdev_state {
> u64 active_volatile_bytes;
> u64 active_persistent_bytes;
> bool dcd_supported;
> + struct pending_add_ctx add_ctx;
>
> struct cxl_event_state event;
> struct cxl_poison_state poison;
> @@ -513,6 +534,21 @@ enum cxl_opcode {
> UUID_INIT(0x5e1819d9, 0x11a9, 0x400c, 0x81, 0x1f, 0xd6, 0x07, 0x19, \
> 0x40, 0x3d, 0x86)
>
> +/*
> + * Add Dynamic Capacity Response
> + * CXL rev 3.1 section 8.2.9.9.9.3; Table 8-168 & Table 8-169
> + */
> +struct cxl_mbox_dc_response {
> + __le32 extent_list_size;
> + u8 flags;
> + u8 reserved[3];
> + struct updated_extent_list {
> + __le64 dpa_start;
> + __le64 length;
> + u8 reserved[8];
> + } __packed extent_list[] __counted_by(extent_list_size);
> +} __packed;
> +
> struct cxl_mbox_get_supported_logs {
> __le16 entries;
> u8 rsvd[6];
> @@ -583,6 +619,14 @@ struct cxl_mbox_identify {
> UUID_INIT(0xe71f3a40, 0x2d29, 0x4092, 0x8a, 0x39, 0x4d, 0x1c, 0x96, \
> 0x6c, 0x7c, 0x65)
>
> +/*
> + * Dynamic Capacity Event Record
> + * CXL rev 3.1 section 8.2.9.2.1; Table 8-43
> + */
> +#define CXL_EVENT_DC_EVENT_UUID \
> + UUID_INIT(0xca95afa7, 0xf183, 0x4018, 0x8c, 0x2f, 0x95, 0x26, 0x8e, \
> + 0x10, 0x1a, 0x2a)
> +
> /*
> * Get Event Records output payload
> * CXL rev 3.0 section 8.2.9.2.2; Table 8-50
> @@ -608,6 +652,7 @@ enum cxl_event_log_type {
> CXL_EVENT_TYPE_WARN,
> CXL_EVENT_TYPE_FAIL,
> CXL_EVENT_TYPE_FATAL,
> + CXL_EVENT_TYPE_DCD,
> CXL_EVENT_TYPE_MAX
> };
>
> diff --git a/include/cxl/event.h b/include/cxl/event.h
> index ff97fea718d2..fa3cd895f656 100644
> --- a/include/cxl/event.h
> +++ b/include/cxl/event.h
> @@ -6,6 +6,7 @@
> #include <linux/types.h>
> #include <linux/uuid.h>
> #include <linux/workqueue_types.h>
> +#include <linux/list.h>
>
> /*
> * Common Event Record Format
> @@ -141,12 +142,49 @@ struct cxl_event_mem_sparing {
> u8 reserved2[0x25];
> } __packed;
>
> +/*
> + * CXL rev 3.1 section 8.2.9.2.1.6; Table 8-51
> + */
> +struct cxl_extent {
> + __le64 start_dpa;
> + __le64 length;
> + u8 uuid[UUID_SIZE];
> + __le16 shared_extn_seq;
> + u8 reserved[0x6];
> +} __packed;
> +
> +struct cxl_extent_list_node {
> + struct cxl_extent *extent;
> + struct list_head list;
> + int rid;
> +};
> +
> +/*
> + * Dynamic Capacity Event Record
> + * CXL rev 3.1 section 8.2.9.2.1.6; Table 8-50
> + */
> +#define CXL_DCD_EVENT_MORE BIT(0)
> +struct cxl_event_dcd {
> + struct cxl_event_record_hdr hdr;
> + u8 event_type;
> + u8 validity_flags;
> + __le16 host_id;
> + u8 partition_index;
> + u8 flags;
> + u8 reserved1[0x2];
> + struct cxl_extent extent;
> + u8 reserved2[0x18];
> + __le32 num_avail_extents;
> + __le32 num_avail_tags;
> +} __packed;
> +
> union cxl_event {
> struct cxl_event_generic generic;
> struct cxl_event_gen_media gen_media;
> struct cxl_event_dram dram;
> struct cxl_event_mem_module mem_module;
> struct cxl_event_mem_sparing mem_sparing;
> + struct cxl_event_dcd dcd;
> /* dram & gen_media event header */
> struct cxl_event_media_hdr media_hdr;
> } __packed;
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH v10 13/31] cxl/mem: Add 20 second timeout for stalled DC_ADD_CAPACITY chains
2026-05-23 9:42 [PATCH v10 00/31] DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
` (11 preceding siblings ...)
2026-05-23 9:43 ` [PATCH v10 12/31] cxl/mem: Set up framework for handling DC Events Anisa Su
@ 2026-05-23 9:43 ` Anisa Su
2026-05-28 16:57 ` Dave Jiang
2026-05-23 9:43 ` [PATCH v10 14/31] cxl/extent: Handle DC Add Capacity events Anisa Su
` (19 subsequent siblings)
32 siblings, 1 reply; 71+ messages in thread
From: Anisa Su @ 2026-05-23 9:43 UTC (permalink / raw)
To: linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
John Groves, Gregory Price, Anisa Su
A DC_ADD_CAPACITY event can span multiple event records grouped together
by the CXL_DCD_EVENT_MORE flag. Extents are staged in the pending list until
the last event record ('More'=0) is received, at which point the pending
list is processed. If the device opens such a chain (More=1) but never
sends the closing record, the staged list sits indefinitely.
Add a delayed-work watchdog that, on expiry, refuses the chain with an
empty ADD_DC_RESPONSE and drops the staged list.
The 20s timeout is a conservative upper bound and may be tightened
later. The timeout is purely defensive — the spec does not require it,
but prevents issues from a lost mailbox response or a crashed fabric manager.
Signed-off-by: Anisa Su <anisa.su@samsung.com>
---
drivers/cxl/core/mbox.c | 73 ++++++++++++++++++++++++++++++++++++++++-
drivers/cxl/cxlmem.h | 23 ++++++++++---
2 files changed, 91 insertions(+), 5 deletions(-)
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 1b38f34538f3..c376492fa166 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -1219,6 +1219,48 @@ static void clear_pending_extents(void *_mds)
mds->add_ctx.group = NULL;
}
+/*
+ * Bound on how long the host will wait for a device to finish a
+ * multi-record DC_ADD_CAPACITY chain (More=1 ... More=0) before
+ * refusing the chain.
+ * The timeout is not defined in the spec, but added for defensive purposes.
+ * Since there is no spec-defined timeout, 20s is chosen as a generous
+ * upper bound and matches the GPF timeout.
+ */
+#define CXL_DC_ADD_TIMEOUT (20 * HZ)
+
+static void cxl_dc_add_timeout(struct work_struct *work)
+{
+ struct pending_add_ctx *ctx = container_of(to_delayed_work(work),
+ struct pending_add_ctx,
+ timeout_work);
+ struct cxl_memdev_state *mds = container_of(ctx,
+ struct cxl_memdev_state,
+ add_ctx);
+ struct device *dev = mds->cxlds.dev;
+
+ guard(mutex)(&ctx->lock);
+
+ if (!ctx->armed)
+ return;
+
+ dev_warn(dev, "DC add chain timed out; refusing staged extents\n");
+
+ if (cxl_send_dc_response(mds, CXL_MBOX_OP_ADD_DC_RESPONSE,
+ &ctx->pending_extents, 0))
+ dev_dbg(dev, "Failed to send empty ADD_DC_RESPONSE on timeout\n");
+
+ clear_pending_extents(mds);
+ ctx->armed = false;
+}
+
+static void cxl_cancel_dcd_add_chain_work(void *_mds)
+{
+ struct cxl_memdev_state *mds = _mds;
+
+ cancel_delayed_work_sync(&mds->add_ctx.timeout_work);
+}
+
static int add_to_pending_list(struct list_head *pending_list,
struct cxl_extent *to_add)
{
@@ -1246,18 +1288,34 @@ static int add_to_pending_list(struct list_head *pending_list,
static int handle_add_event(struct cxl_memdev_state *mds,
struct cxl_event_dcd *event)
{
+ struct pending_add_ctx *ctx = &mds->add_ctx;
struct device *dev = mds->cxlds.dev;
int rc;
- rc = add_to_pending_list(&mds->add_ctx.pending_extents, &event->extent);
+ guard(mutex)(&ctx->lock);
+
+ rc = add_to_pending_list(&ctx->pending_extents, &event->extent);
if (rc)
return rc;
if (event->flags & CXL_DCD_EVENT_MORE) {
dev_dbg(dev, "more bit set; delay the surfacing of extent\n");
+ mod_delayed_work(system_wq, &ctx->timeout_work,
+ CXL_DC_ADD_TIMEOUT);
+ ctx->armed = true;
return 0;
}
+ /*
+ * Chain is closing. Disarm before flushing so a pending watchdog
+ * (queued but blocked on @ctx->lock) sees !armed and bails out.
+ * cancel_delayed_work() — not _sync — because handle_add_event()
+ * itself runs on system_wq and a sync cancel of same-wq work can
+ * deadlock.
+ */
+ ctx->armed = false;
+ cancel_delayed_work(&ctx->timeout_work);
+
rc = cxl_send_dc_response(mds, CXL_MBOX_OP_ADD_DC_RESPONSE,
&mds->add_ctx.pending_extents, 0);
clear_pending_extents(mds);
@@ -2009,11 +2067,24 @@ struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev, u64 serial,
mutex_init(&mds->event.log_lock);
INIT_LIST_HEAD(&mds->add_ctx.pending_extents);
+ mutex_init(&mds->add_ctx.lock);
+ INIT_DELAYED_WORK(&mds->add_ctx.timeout_work,
+ cxl_dc_add_timeout);
+ mds->add_ctx.armed = false;
rc = devm_add_action_or_reset(dev, clear_pending_extents, mds);
if (rc)
return ERR_PTR(rc);
+ /*
+ * Registered after clear_pending_extents so devm's reverse-order
+ * unwind cancels (and waits for) the watchdog first, then the list
+ * cleanup runs with the watchdog guaranteed not to refire.
+ */
+ rc = devm_add_action_or_reset(dev, cxl_cancel_dcd_add_chain_work, mds);
+ if (rc)
+ return ERR_PTR(rc);
+
rc = devm_cxl_register_mce_notifier(dev, &mds->mce_notifier);
if (rc == -EOPNOTSUPP)
dev_warn(dev, "CXL MCE unsupported\n");
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 592c8e3b611c..d992cc9b7811 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -8,6 +8,8 @@
#include <linux/uuid.h>
#include <linux/node.h>
#include <linux/list.h>
+#include <linux/mutex.h>
+#include <linux/workqueue.h>
#include <cxl/event.h>
#include <cxl/mailbox.h>
#include "cxl.h"
@@ -402,19 +404,32 @@ static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
/**
* struct pending_add_ctx - Staging state for an in-progress
- * DCD_ADD_CAPACITY event chain
+ * DCD_ADD_CAPACITY event chain
* @pending_extents: extents received so far in the chain; flushed when
- * the chain closes (More=0)
+ * the chain closes (More=0)
* @group: tag group being assembled from the chain
+ * @timeout_work: watchdog that fires if a chain is opened with
+ * CXL_DCD_EVENT_MORE but the closing record never arrives
+ * @lock: serialises updates to the chain state against the watchdog
+ * @armed: set when a More=1 chain opens; cleared when the chain closes,
+ * either by a More=0 event record or by the watchdog firing.
*
* A DCD_ADD_CAPACITY notification can span multiple event records
* stitched together by the CXL_DCD_EVENT_MORE flag. Records are staged
- * here until the device clears More, at which point the staged batch is
- * processed and responded to as a single Add_DC_Response.
+ * here until an event record with 'More'=0 is received, at which point the
+ * staged batch is processed and responded to as a single Add_DC_Response.
+ *
+ * If a chain is opened (More=1) but the device never sends the closing
+ * record, the staged list would otherwise sit indefinitely. @timeout_work
+ * is a defensive watchdog that refuses such a chain with an empty response
+ * and drops the staged list.
*/
struct pending_add_ctx {
struct list_head pending_extents;
struct cxl_dc_tag_group *group;
+ struct delayed_work timeout_work;
+ struct mutex lock;
+ bool armed;
};
/**
--
2.43.0
^ permalink raw reply related [flat|nested] 71+ messages in thread* Re: [PATCH v10 13/31] cxl/mem: Add 20 second timeout for stalled DC_ADD_CAPACITY chains
2026-05-23 9:43 ` [PATCH v10 13/31] cxl/mem: Add 20 second timeout for stalled DC_ADD_CAPACITY chains Anisa Su
@ 2026-05-28 16:57 ` Dave Jiang
0 siblings, 0 replies; 71+ messages in thread
From: Dave Jiang @ 2026-05-28 16:57 UTC (permalink / raw)
To: Anisa Su, linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Vishal Verma, Ira Weiny, Alison Schofield, John Groves,
Gregory Price, Anisa Su
On 5/23/26 2:43 AM, Anisa Su wrote:
> A DC_ADD_CAPACITY event can span multiple event records grouped together
> by the CXL_DCD_EVENT_MORE flag. Extents are staged in the pending list until
> the last event record ('More'=0) is received, at which point the pending
> list is processed. If the device opens such a chain (More=1) but never
> sends the closing record, the staged list sits indefinitely.
>
> Add a delayed-work watchdog that, on expiry, refuses the chain with an
> empty ADD_DC_RESPONSE and drops the staged list.
>
> The 20s timeout is a conservative upper bound and may be tightened
> later. The timeout is purely defensive — the spec does not require it,
> but prevents issues from a lost mailbox response or a crashed fabric manager.
>
> Signed-off-by: Anisa Su <anisa.su@samsung.com>
> ---
> drivers/cxl/core/mbox.c | 73 ++++++++++++++++++++++++++++++++++++++++-
> drivers/cxl/cxlmem.h | 23 ++++++++++---
> 2 files changed, 91 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 1b38f34538f3..c376492fa166 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -1219,6 +1219,48 @@ static void clear_pending_extents(void *_mds)
> mds->add_ctx.group = NULL;
> }
>
> +/*
> + * Bound on how long the host will wait for a device to finish a
> + * multi-record DC_ADD_CAPACITY chain (More=1 ... More=0) before
> + * refusing the chain.
> + * The timeout is not defined in the spec, but added for defensive purposes.
> + * Since there is no spec-defined timeout, 20s is chosen as a generous
> + * upper bound and matches the GPF timeout.
> + */
> +#define CXL_DC_ADD_TIMEOUT (20 * HZ)
> +
> +static void cxl_dc_add_timeout(struct work_struct *work)
> +{
> + struct pending_add_ctx *ctx = container_of(to_delayed_work(work),
> + struct pending_add_ctx,
> + timeout_work);
> + struct cxl_memdev_state *mds = container_of(ctx,
> + struct cxl_memdev_state,
> + add_ctx);
> + struct device *dev = mds->cxlds.dev;
> +
> + guard(mutex)(&ctx->lock);
> +
> + if (!ctx->armed)
> + return;
> +
> + dev_warn(dev, "DC add chain timed out; refusing staged extents\n");
> +
> + if (cxl_send_dc_response(mds, CXL_MBOX_OP_ADD_DC_RESPONSE,
> + &ctx->pending_extents, 0))
> + dev_dbg(dev, "Failed to send empty ADD_DC_RESPONSE on timeout\n");
> +
> + clear_pending_extents(mds);
> + ctx->armed = false;
> +}
> +
> +static void cxl_cancel_dcd_add_chain_work(void *_mds)
> +{
> + struct cxl_memdev_state *mds = _mds;
> +
> + cancel_delayed_work_sync(&mds->add_ctx.timeout_work);
> +}
> +
> static int add_to_pending_list(struct list_head *pending_list,
> struct cxl_extent *to_add)
> {
> @@ -1246,18 +1288,34 @@ static int add_to_pending_list(struct list_head *pending_list,
> static int handle_add_event(struct cxl_memdev_state *mds,
> struct cxl_event_dcd *event)
> {
> + struct pending_add_ctx *ctx = &mds->add_ctx;
> struct device *dev = mds->cxlds.dev;
> int rc;
>
> - rc = add_to_pending_list(&mds->add_ctx.pending_extents, &event->extent);
> + guard(mutex)(&ctx->lock);
> +
> + rc = add_to_pending_list(&ctx->pending_extents, &event->extent);
> if (rc)
> return rc;
>
> if (event->flags & CXL_DCD_EVENT_MORE) {
> dev_dbg(dev, "more bit set; delay the surfacing of extent\n");
> + mod_delayed_work(system_wq, &ctx->timeout_work,
> + CXL_DC_ADD_TIMEOUT);
> + ctx->armed = true;
> return 0;
> }
>
> + /*
> + * Chain is closing. Disarm before flushing so a pending watchdog
> + * (queued but blocked on @ctx->lock) sees !armed and bails out.
> + * cancel_delayed_work() — not _sync — because handle_add_event()
> + * itself runs on system_wq and a sync cancel of same-wq work can
> + * deadlock.
> + */
Don't think this comment is correct. handle_add_event() is launched from threaded irq and does not run in system_wq. Just drop that second part of the comments.
> + ctx->armed = false;
> + cancel_delayed_work(&ctx->timeout_work);
> +
> rc = cxl_send_dc_response(mds, CXL_MBOX_OP_ADD_DC_RESPONSE,
> &mds->add_ctx.pending_extents, 0);
> clear_pending_extents(mds);
> @@ -2009,11 +2067,24 @@ struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev, u64 serial,
>
> mutex_init(&mds->event.log_lock);
> INIT_LIST_HEAD(&mds->add_ctx.pending_extents);
> + mutex_init(&mds->add_ctx.lock);
> + INIT_DELAYED_WORK(&mds->add_ctx.timeout_work,
> + cxl_dc_add_timeout);
> + mds->add_ctx.armed = false;
Not needed. Allocated memory zeroed.
DJ
>
> rc = devm_add_action_or_reset(dev, clear_pending_extents, mds);
> if (rc)
> return ERR_PTR(rc);
>
> + /*
> + * Registered after clear_pending_extents so devm's reverse-order
> + * unwind cancels (and waits for) the watchdog first, then the list
> + * cleanup runs with the watchdog guaranteed not to refire.
> + */
> + rc = devm_add_action_or_reset(dev, cxl_cancel_dcd_add_chain_work, mds);
> + if (rc)
> + return ERR_PTR(rc);
> +
> rc = devm_cxl_register_mce_notifier(dev, &mds->mce_notifier);
> if (rc == -EOPNOTSUPP)
> dev_warn(dev, "CXL MCE unsupported\n");
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 592c8e3b611c..d992cc9b7811 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -8,6 +8,8 @@
> #include <linux/uuid.h>
> #include <linux/node.h>
> #include <linux/list.h>
> +#include <linux/mutex.h>
> +#include <linux/workqueue.h>
> #include <cxl/event.h>
> #include <cxl/mailbox.h>
> #include "cxl.h"
> @@ -402,19 +404,32 @@ static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
>
> /**
> * struct pending_add_ctx - Staging state for an in-progress
> - * DCD_ADD_CAPACITY event chain
> + * DCD_ADD_CAPACITY event chain
> * @pending_extents: extents received so far in the chain; flushed when
> - * the chain closes (More=0)
> + * the chain closes (More=0)
> * @group: tag group being assembled from the chain
> + * @timeout_work: watchdog that fires if a chain is opened with
> + * CXL_DCD_EVENT_MORE but the closing record never arrives
> + * @lock: serialises updates to the chain state against the watchdog
> + * @armed: set when a More=1 chain opens; cleared when the chain closes,
> + * either by a More=0 event record or by the watchdog firing.
> *
> * A DCD_ADD_CAPACITY notification can span multiple event records
> * stitched together by the CXL_DCD_EVENT_MORE flag. Records are staged
> - * here until the device clears More, at which point the staged batch is
> - * processed and responded to as a single Add_DC_Response.
> + * here until an event record with 'More'=0 is received, at which point the
> + * staged batch is processed and responded to as a single Add_DC_Response.
> + *
> + * If a chain is opened (More=1) but the device never sends the closing
> + * record, the staged list would otherwise sit indefinitely. @timeout_work
> + * is a defensive watchdog that refuses such a chain with an empty response
> + * and drops the staged list.
> */
> struct pending_add_ctx {
> struct list_head pending_extents;
> struct cxl_dc_tag_group *group;
> + struct delayed_work timeout_work;
> + struct mutex lock;
> + bool armed;
> };
>
> /**
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH v10 14/31] cxl/extent: Handle DC Add Capacity events
2026-05-23 9:42 [PATCH v10 00/31] DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
` (12 preceding siblings ...)
2026-05-23 9:43 ` [PATCH v10 13/31] cxl/mem: Add 20 second timeout for stalled DC_ADD_CAPACITY chains Anisa Su
@ 2026-05-23 9:43 ` Anisa Su
2026-05-28 19:06 ` Dave Jiang
2026-05-23 9:43 ` [PATCH v10 15/31] cxl/mem: Drop misaligned DCD extent groups Anisa Su
` (18 subsequent siblings)
32 siblings, 1 reply; 71+ messages in thread
From: Anisa Su @ 2026-05-23 9:43 UTC (permalink / raw)
To: linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
John Groves, Gregory Price, Anisa Su, Ira Weiny
Replace the empty-response stub in handle_add_event() with the real
add pipeline.
DC Event Records can be grouped together with the 'More' flag. The
previous commit completed the set up for holding onto extents in
the pending list until receiving the last event record of the group,
marked by 'More'=0.
This commit fills in the logic for processing the pending list and
adds basic validation for extents before they are added to the
device model as a child of the cxlr_dax region. More complete
checks for tags/sequence numbers/alignment is added in subsequent commits.
For each tag that appears in the pending list:
1. Extract all extents in the pending list with that tag to a
local list.
2. The spec requires that shareable extents are ordered by
shared extent sequence number, which "instructs each host
on the relative order these extents must be placed in adjacent
virtual address space" (r4.0 Section 9.13.3 Figure 9-23
Shared Extent List Example). Otherwise, retain arrival order.
Thus the tag group is stable-sorted by shared_extn_seq; for non-sharable
extents every key is 0 and the stable sort preserves arrival
order.
Individual extents are checked for the following:
1. The extent's DPA range fully resolves to an endpoint decoder.
2. Doesn't overlap with a previously accepted extent.
3. Sequence number doesn't collide with others in the same
tag group
Upon passing these checks, extents are "onlined" together
as a tag group:
online_tag_group() registers a struct device per
dc_extent under cxlr_dax->dev so the dax layer can discover them
via device_for_each_child().
Once the pending list has been fully processed, send the
DC_ADD_RESPONSE.
Based on an original patch by Navneet Singh.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Anisa Su <anisa.su@samsung.com>
---
Changes:
[anisa: restructured from the original "Process dynamic partition
events" monolith; this commit fills in the Add path on top of the
previous commit's stubs. Further validation lands in subsequent
commits.]
---
drivers/cxl/core/Makefile | 2 +-
drivers/cxl/core/core.h | 13 ++
drivers/cxl/core/extent.c | 372 ++++++++++++++++++++++++++++++++++
drivers/cxl/core/mbox.c | 123 ++++++++++-
drivers/cxl/core/region_dax.c | 3 +
drivers/cxl/cxl.h | 19 ++
tools/testing/cxl/Kbuild | 5 +-
7 files changed, 528 insertions(+), 9 deletions(-)
create mode 100644 drivers/cxl/core/extent.c
diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
index ce7213818d3c..208917ad8aac 100644
--- a/drivers/cxl/core/Makefile
+++ b/drivers/cxl/core/Makefile
@@ -15,7 +15,7 @@ cxl_core-y += hdm.o
cxl_core-y += pmu.o
cxl_core-y += cdat.o
cxl_core-$(CONFIG_TRACING) += trace.o
-cxl_core-$(CONFIG_CXL_REGION) += region.o region_pmem.o region_dax.o
+cxl_core-$(CONFIG_CXL_REGION) += region.o region_pmem.o region_dax.o extent.o
cxl_core-$(CONFIG_CXL_MCE) += mce.o
cxl_core-$(CONFIG_CXL_FEATURES) += features.o
cxl_core-$(CONFIG_CXL_EDAC_MEM_FEATURES) += edac.o
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 14723cfd05f0..1bae80dbf991 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -65,12 +65,24 @@ u64 cxl_dpa_to_hpa(struct cxl_region *cxlr, const struct cxl_memdev *cxlmd,
int devm_cxl_add_dax_region(struct cxl_region *cxlr);
int devm_cxl_add_pmem_region(struct cxl_region *cxlr);
+int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent,
+ u16 seq_num);
+int online_tag_group(struct cxl_dc_tag_group *group);
#else
static inline u64 cxl_dpa_to_hpa(struct cxl_region *cxlr,
const struct cxl_memdev *cxlmd, u64 dpa)
{
return ULLONG_MAX;
}
+static inline int cxl_add_extent(struct cxl_memdev_state *mds,
+ struct cxl_extent *extent, u16 seq_num)
+{
+ return 0;
+}
+static inline int online_tag_group(struct cxl_dc_tag_group *group)
+{
+ return 0;
+}
static inline
struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa,
struct cxl_endpoint_decoder **cxled)
@@ -166,6 +178,7 @@ long cxl_pci_get_latency(struct pci_dev *pdev);
int cxl_pci_get_bandwidth(struct pci_dev *pdev, struct access_coordinate *c);
int cxl_port_get_switch_dport_bandwidth(struct cxl_port *port,
struct access_coordinate *c);
+void memdev_release_extent(struct cxl_memdev_state *mds, struct range *range);
static inline struct device *port_to_host(struct cxl_port *port)
{
diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
new file mode 100644
index 000000000000..94128d06f4ed
--- /dev/null
+++ b/drivers/cxl/core/extent.c
@@ -0,0 +1,372 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2024 Intel Corporation. All rights reserved. */
+
+#include <linux/device.h>
+#include <cxl.h>
+
+#include "core.h"
+
+
+static void cxled_release_extent(struct cxl_endpoint_decoder *cxled,
+ struct dc_extent *dc_extent)
+{
+ struct cxl_memdev_state *mds = cxled_to_mds(cxled);
+ struct device *dev = &cxled->cxld.dev;
+
+ dev_dbg(dev, "Remove extent %pra (%pU)\n",
+ &dc_extent->dpa_range, &dc_extent->uuid);
+ memdev_release_extent(mds, &dc_extent->dpa_range);
+}
+
+static void free_tag_group(struct cxl_dc_tag_group *group)
+{
+ xa_destroy(&group->dc_extents);
+ kfree(group);
+}
+
+static void dc_extent_release(struct device *dev)
+{
+ struct dc_extent *dc_extent = to_dc_extent(dev);
+ struct cxl_dc_tag_group *group = dc_extent->group;
+
+ cxled_release_extent(dc_extent->cxled, dc_extent);
+ xa_erase(&group->cxlr_dax->dc_extents, dc_extent->dev.id);
+ xa_erase(&group->dc_extents, dc_extent->seq_num);
+ group->nr_extents--;
+ if (!group->nr_extents)
+ free_tag_group(group);
+ kfree(dc_extent);
+}
+
+static const struct device_type dc_extent_type = {
+ .name = "extent",
+ .release = dc_extent_release,
+};
+
+bool is_dc_extent(struct device *dev)
+{
+ return dev->type == &dc_extent_type;
+}
+EXPORT_SYMBOL_NS_GPL(is_dc_extent, "CXL");
+
+static struct cxl_dc_tag_group *
+alloc_tag_group(struct cxl_dax_region *cxlr_dax, uuid_t *uuid)
+{
+ struct cxl_dc_tag_group *group __free(kfree) =
+ kzalloc(sizeof(*group), GFP_KERNEL);
+ if (!group)
+ return ERR_PTR(-ENOMEM);
+
+ group->cxlr_dax = cxlr_dax;
+ uuid_copy(&group->uuid, uuid);
+ xa_init(&group->dc_extents);
+ return no_free_ptr(group);
+}
+
+/*
+ * Stage 1 of the add pipeline: pure, no allocation. Resolve the extent
+ * to its region/endpoint decoder and ext_range, and verify the range
+ * fits in the resolved endpoint decoder's DPA resource. Further
+ * per-extent invariants layer into this function in subsequent commits.
+ *
+ * Caller must hold cxl_rwsem.region for read (cxl_dpa_to_region()).
+ * On success, @out_cxled / @out_cxlr_dax / @out_ext_range carry the
+ * resolved handles consumed by the rest of the pipeline.
+ */
+static int cxl_validate_extent(struct cxl_memdev_state *mds,
+ struct cxl_extent *extent,
+ struct cxl_endpoint_decoder **out_cxled,
+ struct cxl_dax_region **out_cxlr_dax,
+ struct range *out_ext_range)
+{
+ u64 start_dpa = le64_to_cpu(extent->start_dpa);
+ struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
+ struct cxl_endpoint_decoder *cxled;
+ struct cxl_region *cxlr;
+ struct range ext_range = (struct range) {
+ .start = start_dpa,
+ .end = start_dpa + le64_to_cpu(extent->length) - 1,
+ };
+ struct range ed_range;
+
+ cxlr = cxl_dpa_to_region(cxlmd, start_dpa, &cxled);
+ if (!cxlr)
+ return -ENXIO;
+
+ ed_range = (struct range) {
+ .start = cxled->dpa_res->start,
+ .end = cxled->dpa_res->end,
+ };
+ if (!range_contains(&ed_range, &ext_range)) {
+ dev_err_ratelimited(&cxled->cxld.dev,
+ "DC extent DPA %pra (%pU) is not fully in ED %pra\n",
+ &ext_range, extent->uuid, &ed_range);
+ return -ENXIO;
+ }
+
+ *out_cxled = cxled;
+ *out_cxlr_dax = cxlr->cxlr_dax;
+ *out_ext_range = ext_range;
+ return 0;
+}
+
+enum cxl_extent_class {
+ CXL_EXT_NEW,
+ CXL_EXT_DUPLICATE,
+ CXL_EXT_OVERLAP,
+};
+
+/*
+ * Stage 2: classify @ext_range against extents already accepted on this
+ * cxlr_dax+cxled. Walks cxlr_dax->dc_extents once: a stored extent that
+ * fully contains @ext_range means a duplicate accept (idempotent, fine);
+ * a stored extent that only overlaps means an inconsistent offer.
+ */
+static enum cxl_extent_class
+cxlr_dax_classify_extent(struct cxl_dax_region *cxlr_dax,
+ struct cxl_endpoint_decoder *cxled,
+ const struct range *ext_range)
+{
+ struct dc_extent *entry;
+ unsigned long i;
+
+ xa_for_each(&cxlr_dax->dc_extents, i, entry) {
+ if (entry->cxled != cxled)
+ continue;
+ if (range_contains(&entry->dpa_range, ext_range))
+ return CXL_EXT_DUPLICATE;
+ if (range_overlaps(&entry->dpa_range, ext_range))
+ return CXL_EXT_OVERLAP;
+ }
+ return CXL_EXT_NEW;
+}
+
+/*
+ * Stage 3: allocate and populate a dc_extent for an already-validated,
+ * already-classified-as-new @ext_range. Only -ENOMEM can fail here.
+ */
+static struct dc_extent *
+dc_extent_build(struct cxl_endpoint_decoder *cxled,
+ struct cxl_dax_region *cxlr_dax,
+ struct cxl_extent *extent,
+ const struct range *ext_range, u16 seq_num)
+{
+ resource_size_t dpa_offset = ext_range->start - cxled->dpa_res->start;
+ resource_size_t hpa = cxled->cxld.hpa_range.start + dpa_offset;
+ struct dc_extent *dc_extent;
+
+ dc_extent = kzalloc(sizeof(*dc_extent), GFP_KERNEL);
+ if (!dc_extent)
+ return ERR_PTR(-ENOMEM);
+
+ dc_extent->cxled = cxled;
+ dc_extent->dpa_range = *ext_range;
+ dc_extent->hpa_range.start = hpa - cxlr_dax->hpa_range.start;
+ dc_extent->hpa_range.end = dc_extent->hpa_range.start +
+ range_len(ext_range) - 1;
+ dc_extent->seq_num = seq_num;
+ import_uuid(&dc_extent->uuid, extent->uuid);
+ return dc_extent;
+}
+
+/*
+ * Stage 4: insert @dc_extent into the pending tag group. All extents in
+ * one More-chain group share a UUID — enforced here as the group is
+ * either being created (first extent) or appended to. On any failure
+ * the dc_extent is freed.
+ */
+static int cxlr_add_extent(struct cxl_memdev_state *mds,
+ struct cxl_dax_region *cxlr_dax,
+ struct dc_extent *dc_extent)
+{
+ struct cxl_dc_tag_group **group = &mds->add_ctx.group;
+ int rc;
+
+ if (*group && !uuid_equal(&(*group)->uuid, &dc_extent->uuid)) {
+ kfree(dc_extent);
+ return -EINVAL;
+ }
+
+ if (!*group) {
+ dev_dbg(&cxlr_dax->dev, "Alloc new tag group\n");
+ *group = alloc_tag_group(cxlr_dax, &dc_extent->uuid);
+ if (IS_ERR(*group)) {
+ rc = PTR_ERR(*group);
+ *group = NULL;
+ kfree(dc_extent);
+ return rc;
+ }
+ } else {
+ dev_dbg(&cxlr_dax->dev, "Append dc_extent to tag group\n");
+ }
+
+ dc_extent->group = *group;
+
+ /*
+ * Key by @seq_num so iteration order equals assembly order, in both
+ * the sharable case (device-stamped 1..n) and the non-sharable case
+ * (host-assigned arrival-order 1..n). A collision here signals a
+ * cxl-side validation gap.
+ */
+ rc = xa_insert(&(*group)->dc_extents, dc_extent->seq_num,
+ dc_extent, GFP_KERNEL);
+ if (rc) {
+ dev_WARN_ONCE(&cxlr_dax->dev, rc == -EBUSY,
+ "duplicate seq_num %u in tag %pUb\n",
+ dc_extent->seq_num, &dc_extent->uuid);
+ kfree(dc_extent);
+ return rc;
+ }
+
+ return 0;
+}
+
+int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent,
+ u16 seq_num)
+{
+ struct cxl_endpoint_decoder *cxled;
+ struct cxl_dax_region *cxlr_dax;
+ struct dc_extent *dc_extent;
+ struct range ext_range;
+ int rc;
+
+ guard(rwsem_read)(&cxl_rwsem.region);
+
+ rc = cxl_validate_extent(mds, extent, &cxled, &cxlr_dax, &ext_range);
+ if (rc)
+ return rc;
+
+ switch (cxlr_dax_classify_extent(cxlr_dax, cxled, &ext_range)) {
+ case CXL_EXT_DUPLICATE:
+ /*
+ * Idempotent accept simplifies the dax-side scan for existing
+ * extents on region creation; reply success without duplicating.
+ */
+ dev_warn_ratelimited(&cxled->cxld.dev,
+ "Extent %pra exists; accept again\n",
+ &ext_range);
+ return 0;
+ case CXL_EXT_OVERLAP:
+ return -ENXIO;
+ case CXL_EXT_NEW:
+ break;
+ }
+
+ dc_extent = dc_extent_build(cxled, cxlr_dax, extent, &ext_range,
+ seq_num);
+ if (IS_ERR(dc_extent))
+ return PTR_ERR(dc_extent);
+
+ dev_dbg(&cxled->cxld.dev, "Add extent %pra (%pU)\n",
+ &dc_extent->dpa_range, &dc_extent->uuid);
+
+ return cxlr_add_extent(mds, cxlr_dax, dc_extent);
+}
+
+static void dc_extent_unregister(void *ext)
+{
+ struct dc_extent *dc_extent = ext;
+
+ dev_dbg(&dc_extent->dev, "DAX region rm extent HPA %pra\n",
+ &dc_extent->hpa_range);
+ device_unregister(&dc_extent->dev);
+}
+
+static void cleanup_pending_dc_extent(struct dc_extent *dc_extent)
+{
+ struct cxl_dc_tag_group *group = dc_extent->group;
+
+ cxled_release_extent(dc_extent->cxled, dc_extent);
+ xa_erase(&group->dc_extents, dc_extent->seq_num);
+ group->nr_extents--;
+ if (!group->nr_extents)
+ free_tag_group(group);
+ kfree(dc_extent);
+}
+
+int online_tag_group(struct cxl_dc_tag_group *group)
+{
+ struct cxl_dax_region *cxlr_dax = group->cxlr_dax;
+ struct dc_extent *dc_extent;
+ unsigned long index;
+ int rc = 0;
+
+ /*
+ * Seed nr_extents with the full group size plus a +1 pin held by
+ * this function. The size counts every dc_extent that might
+ * decrement nr_extents on cleanup; the pin keeps @group alive
+ * across the body even if every dc_extent release fires inside
+ * the loop (e.g. devm_add_action_or_reset failure on the only
+ * pending extent). The pin is dropped at the end of the function.
+ */
+ xa_for_each(&group->dc_extents, index, dc_extent)
+ group->nr_extents++;
+ group->nr_extents++;
+
+ xa_for_each(&group->dc_extents, index, dc_extent) {
+ struct device *dev = &dc_extent->dev;
+ u32 id;
+
+ device_initialize(dev);
+ device_set_pm_not_required(dev);
+ dev->parent = &cxlr_dax->dev;
+ dev->type = &dc_extent_type;
+
+ rc = xa_alloc(&cxlr_dax->dc_extents, &id, dc_extent,
+ xa_limit_32b, GFP_KERNEL);
+ if (rc < 0) {
+ put_device(dev);
+ break;
+ }
+ dev->id = id;
+
+ rc = dev_set_name(dev, "extent%d.%d", cxlr_dax->cxlr->id,
+ dev->id);
+ if (rc) {
+ xa_erase(&cxlr_dax->dc_extents, dev->id);
+ put_device(dev);
+ break;
+ }
+
+ rc = device_add(dev);
+ if (rc) {
+ xa_erase(&cxlr_dax->dc_extents, dev->id);
+ put_device(dev);
+ break;
+ }
+
+ dev_dbg(dev, "dc_extent HPA %pra (%pU)\n",
+ &dc_extent->hpa_range, &group->uuid);
+
+ rc = devm_add_action_or_reset(&cxlr_dax->dev,
+ dc_extent_unregister, dc_extent);
+ if (rc)
+ break;
+ }
+
+ if (rc) {
+ /*
+ * Unwind every remaining dc_extent in the group. The pin
+ * above keeps @group alive across this walk. Distinguish
+ * onlined dc_extents (have a devm action) from pending ones
+ * via devm_remove_action_nowarn(): a 0 return means the
+ * action was installed and is now consumed, so we run the
+ * unregister ourselves; -ENOENT means pending.
+ */
+ xa_for_each(&group->dc_extents, index, dc_extent) {
+ int r = devm_remove_action_nowarn(&cxlr_dax->dev,
+ dc_extent_unregister,
+ dc_extent);
+ if (r == 0)
+ dc_extent_unregister(dc_extent);
+ else
+ cleanup_pending_dc_extent(dc_extent);
+ }
+ }
+
+ /* Drop the pin; if nothing else still references @group, free it. */
+ group->nr_extents--;
+ if (!group->nr_extents)
+ free_tag_group(group);
+ return rc;
+}
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index c376492fa166..e5edc3975e8f 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -6,6 +6,7 @@
#include <linux/mutex.h>
#include <linux/unaligned.h>
#include <linux/list.h>
+#include <linux/list_sort.h>
#include <cxlpci.h>
#include <cxlmem.h>
#include <cxl.h>
@@ -1181,7 +1182,7 @@ static void delete_extent_node(struct cxl_extent_list_node *node)
kfree(node);
}
-static void memdev_release_extent(struct cxl_memdev_state *mds, struct range *range)
+void memdev_release_extent(struct cxl_memdev_state *mds, struct range *range)
{
struct device *dev = mds->cxlds.dev;
struct cxl_extent_list_node *node;
@@ -1280,11 +1281,120 @@ static int add_to_pending_list(struct list_head *pending_list,
}
/*
- * Stub: stage extents on the pending list and reply with an empty
- * ADD_DC_RESPONSE on More=0 (refuse all). A later commit replaces
- * the no-op tail with the real Add pipeline that surfaces a dax
- * device per accepted extent.
+ * Compare two extents by shared_extn_seq (ascending). list_sort is
+ * stable so when shared_extn_seq is 0 for every entry (non-sharable
+ * partition) ties fall back to arrival order via list_add_tail() in
+ * add_to_pending_list().
*/
+static int extent_seq_compare(void *priv,
+ const struct list_head *a,
+ const struct list_head *b)
+{
+ const struct cxl_extent_list_node *ea =
+ list_entry(a, struct cxl_extent_list_node, list);
+ const struct cxl_extent_list_node *eb =
+ list_entry(b, struct cxl_extent_list_node, list);
+ u16 sa = le16_to_cpu(ea->extent->shared_extn_seq);
+ u16 sb = le16_to_cpu(eb->extent->shared_extn_seq);
+
+ if (sa < sb)
+ return -1;
+ if (sa > sb)
+ return 1;
+ return 0;
+}
+
+/*
+ * Move every pending extent whose tag matches @tag onto @group, preserving
+ * the order they appear in @pending.
+ */
+static void extract_tag_group(struct list_head *pending,
+ const uuid_t *tag,
+ struct list_head *group)
+{
+ struct cxl_extent_list_node *pos, *tmp;
+
+ list_for_each_entry_safe(pos, tmp, pending, list) {
+ uuid_t t;
+
+ import_uuid(&t, pos->extent->uuid);
+ if (uuid_equal(&t, tag))
+ list_move_tail(&pos->list, group);
+ }
+}
+
+/*
+ * Drive the pending Add-Capacity records through cxl_add_extent(),
+ * grouped by tag. Per group: extract from pending, stable-sort by
+ * shared_extn_seq, then attempt to add each extent. Online the tag
+ * group via online_tag_group() once all of its extents have been
+ * realized. Validation gates layer onto this loop in later commits.
+ */
+static int cxl_add_pending(struct cxl_memdev_state *mds)
+{
+ struct device *dev = mds->cxlds.dev;
+ struct list_head *pending = &mds->add_ctx.pending_extents;
+ struct cxl_extent_list_node *pos, *tmp;
+ LIST_HEAD(accepted);
+ int total_accepted = 0;
+
+ while (!list_empty(pending)) {
+ LIST_HEAD(group);
+ struct cxl_dc_tag_group *tag_group;
+ int group_cnt = 0;
+ uuid_t tag;
+ int rc;
+
+ import_uuid(&tag,
+ list_first_entry(pending,
+ struct cxl_extent_list_node,
+ list)->extent->uuid);
+ extract_tag_group(pending, &tag, &group);
+ list_sort(NULL, &group, extent_seq_compare);
+
+ u16 logical_seq = 1;
+ list_for_each_entry_safe(pos, tmp, &group, list) {
+ u16 raw = le16_to_cpu(pos->extent->shared_extn_seq);
+ u16 seq = raw ? raw : logical_seq;
+
+ logical_seq++;
+
+ if (cxl_add_extent(mds, pos->extent, seq)) {
+ dev_dbg(dev,
+ "Tag %pUb: failed to add extent DPA:%#llx LEN:%#llx\n",
+ &tag,
+ le64_to_cpu(pos->extent->start_dpa),
+ le64_to_cpu(pos->extent->length));
+ delete_extent_node(pos);
+ continue;
+ }
+ group_cnt++;
+ }
+
+ tag_group = mds->add_ctx.group;
+ if (!tag_group)
+ continue;
+
+ rc = online_tag_group(tag_group);
+ if (rc) {
+ dev_warn(dev,
+ "Tag %pUb: failed to online tag group (%d)\n",
+ &tag, rc);
+ list_for_each_entry_safe(pos, tmp, &group, list)
+ delete_extent_node(pos);
+ } else {
+ list_splice_tail_init(&group, &accepted);
+ total_accepted += group_cnt;
+ }
+
+ mds->add_ctx.group = NULL;
+ }
+
+ list_splice(&accepted, pending);
+ return cxl_send_dc_response(mds, CXL_MBOX_OP_ADD_DC_RESPONSE,
+ pending, total_accepted);
+}
+
static int handle_add_event(struct cxl_memdev_state *mds,
struct cxl_event_dcd *event)
{
@@ -1316,8 +1426,7 @@ static int handle_add_event(struct cxl_memdev_state *mds,
ctx->armed = false;
cancel_delayed_work(&ctx->timeout_work);
- rc = cxl_send_dc_response(mds, CXL_MBOX_OP_ADD_DC_RESPONSE,
- &mds->add_ctx.pending_extents, 0);
+ rc = cxl_add_pending(mds);
clear_pending_extents(mds);
return rc;
}
diff --git a/drivers/cxl/core/region_dax.c b/drivers/cxl/core/region_dax.c
index d6bf69155827..519e203c486a 100644
--- a/drivers/cxl/core/region_dax.c
+++ b/drivers/cxl/core/region_dax.c
@@ -13,6 +13,7 @@ static void cxl_dax_region_release(struct device *dev)
{
struct cxl_dax_region *cxlr_dax = to_cxl_dax_region(dev);
+ xa_destroy(&cxlr_dax->dc_extents);
kfree(cxlr_dax);
}
@@ -57,11 +58,13 @@ static struct cxl_dax_region *cxl_dax_region_alloc(struct cxl_region *cxlr)
if (!cxlr_dax)
return ERR_PTR(-ENOMEM);
+ xa_init_flags(&cxlr_dax->dc_extents, XA_FLAGS_ALLOC);
cxlr_dax->hpa_range.start = p->res->start;
cxlr_dax->hpa_range.end = p->res->end;
dev = &cxlr_dax->dev;
cxlr_dax->cxlr = cxlr;
+ cxlr->cxlr_dax = cxlr_dax;
device_initialize(dev);
lockdep_set_class(&dev->mutex, &cxl_dax_region_key);
device_set_pm_not_required(dev);
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 5ef2cf4d005b..cbbfba92fea9 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -495,6 +495,7 @@ struct cxl_region_params {
* @type: Endpoint decoder target type
* @cxl_nvb: nvdimm bridge for coordinating @cxlr_pmem setup / shutdown
* @cxlr_pmem: (for pmem regions) cached copy of the nvdimm bridge
+ * @cxlr_dax: (for DC regions) cached copy of CXL DAX bridge
* @flags: Region state flags
* @params: active + config params for the region
* @coord: QoS access coordinates for the region
@@ -510,6 +511,7 @@ struct cxl_region {
enum cxl_decoder_type type;
struct cxl_nvdimm_bridge *cxl_nvb;
struct cxl_pmem_region *cxlr_pmem;
+ struct cxl_dax_region *cxlr_dax;
unsigned long flags;
struct cxl_region_params params;
struct access_coordinate coord[ACCESS_COORDINATE_MAX];
@@ -568,6 +570,15 @@ struct cxl_dax_region {
struct device dev;
struct cxl_region *cxlr;
struct range hpa_range;
+ /*
+ * dc_extents is keyed by an allocator-assigned u32 (see
+ * online_tag_group()). Tag groups have no first-class identity in
+ * this xarray; siblings within a tag find each other via
+ * dc_extent->group. Tag-uniqueness lookup is a linear xa_for_each
+ * walk, adequate at the bounded per-region extent counts the
+ * driver handles.
+ */
+ struct xarray dc_extents;
};
/**
@@ -595,6 +606,14 @@ struct cxl_dc_tag_group {
unsigned int nr_extents;
};
+bool is_dc_extent(struct device *dev);
+static inline struct dc_extent *to_dc_extent(struct device *dev)
+{
+ if (!is_dc_extent(dev))
+ return NULL;
+ return container_of(dev, struct dc_extent, dev);
+}
+
/**
* struct cxl_port - logical collection of upstream port devices and
* downstream port devices to construct a CXL memory
diff --git a/tools/testing/cxl/Kbuild b/tools/testing/cxl/Kbuild
index 2be1df80fcc9..8941cf187462 100644
--- a/tools/testing/cxl/Kbuild
+++ b/tools/testing/cxl/Kbuild
@@ -63,7 +63,10 @@ cxl_core-y += $(CXL_CORE_SRC)/hdm.o
cxl_core-y += $(CXL_CORE_SRC)/pmu.o
cxl_core-y += $(CXL_CORE_SRC)/cdat.o
cxl_core-$(CONFIG_TRACING) += $(CXL_CORE_SRC)/trace.o
-cxl_core-$(CONFIG_CXL_REGION) += $(CXL_CORE_SRC)/region.o $(CXL_CORE_SRC)/region_pmem.o $(CXL_CORE_SRC)/region_dax.o
+cxl_core-$(CONFIG_CXL_REGION) += $(CXL_CORE_SRC)/region.o \
+ $(CXL_CORE_SRC)/region_pmem.o \
+ $(CXL_CORE_SRC)/region_dax.o \
+ $(CXL_CORE_SRC)/extent.o
cxl_core-$(CONFIG_CXL_MCE) += $(CXL_CORE_SRC)/mce.o
cxl_core-$(CONFIG_CXL_FEATURES) += $(CXL_CORE_SRC)/features.o
cxl_core-$(CONFIG_CXL_EDAC_MEM_FEATURES) += $(CXL_CORE_SRC)/edac.o
--
2.43.0
^ permalink raw reply related [flat|nested] 71+ messages in thread* Re: [PATCH v10 14/31] cxl/extent: Handle DC Add Capacity events
2026-05-23 9:43 ` [PATCH v10 14/31] cxl/extent: Handle DC Add Capacity events Anisa Su
@ 2026-05-28 19:06 ` Dave Jiang
0 siblings, 0 replies; 71+ messages in thread
From: Dave Jiang @ 2026-05-28 19:06 UTC (permalink / raw)
To: Anisa Su, linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Vishal Verma, Ira Weiny, Alison Schofield, John Groves,
Gregory Price, Anisa Su, Ira Weiny
On 5/23/26 2:43 AM, Anisa Su wrote:
> Replace the empty-response stub in handle_add_event() with the real
> add pipeline.
>
> DC Event Records can be grouped together with the 'More' flag. The
> previous commit completed the set up for holding onto extents in
> the pending list until receiving the last event record of the group,
> marked by 'More'=0.
>
> This commit fills in the logic for processing the pending list and
> adds basic validation for extents before they are added to the
> device model as a child of the cxlr_dax region. More complete
> checks for tags/sequence numbers/alignment is added in subsequent commits.
>
> For each tag that appears in the pending list:
> 1. Extract all extents in the pending list with that tag to a
> local list.
>
> 2. The spec requires that shareable extents are ordered by
> shared extent sequence number, which "instructs each host
> on the relative order these extents must be placed in adjacent
> virtual address space" (r4.0 Section 9.13.3 Figure 9-23
> Shared Extent List Example). Otherwise, retain arrival order.
>
> Thus the tag group is stable-sorted by shared_extn_seq; for non-sharable
> extents every key is 0 and the stable sort preserves arrival
> order.
>
> Individual extents are checked for the following:
> 1. The extent's DPA range fully resolves to an endpoint decoder.
>
> 2. Doesn't overlap with a previously accepted extent.
>
> 3. Sequence number doesn't collide with others in the same
> tag group
>
> Upon passing these checks, extents are "onlined" together
> as a tag group:
> online_tag_group() registers a struct device per
> dc_extent under cxlr_dax->dev so the dax layer can discover them
> via device_for_each_child().
>
> Once the pending list has been fully processed, send the
> DC_ADD_RESPONSE.
>
> Based on an original patch by Navneet Singh.
>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Anisa Su <anisa.su@samsung.com>
>
> ---
> Changes:
> [anisa: restructured from the original "Process dynamic partition
> events" monolith; this commit fills in the Add path on top of the
> previous commit's stubs. Further validation lands in subsequent
> commits.]
> ---
> drivers/cxl/core/Makefile | 2 +-
> drivers/cxl/core/core.h | 13 ++
> drivers/cxl/core/extent.c | 372 ++++++++++++++++++++++++++++++++++
> drivers/cxl/core/mbox.c | 123 ++++++++++-
> drivers/cxl/core/region_dax.c | 3 +
> drivers/cxl/cxl.h | 19 ++
> tools/testing/cxl/Kbuild | 5 +-
> 7 files changed, 528 insertions(+), 9 deletions(-)
> create mode 100644 drivers/cxl/core/extent.c
>
> diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
> index ce7213818d3c..208917ad8aac 100644
> --- a/drivers/cxl/core/Makefile
> +++ b/drivers/cxl/core/Makefile
> @@ -15,7 +15,7 @@ cxl_core-y += hdm.o
> cxl_core-y += pmu.o
> cxl_core-y += cdat.o
> cxl_core-$(CONFIG_TRACING) += trace.o
> -cxl_core-$(CONFIG_CXL_REGION) += region.o region_pmem.o region_dax.o
> +cxl_core-$(CONFIG_CXL_REGION) += region.o region_pmem.o region_dax.o extent.o
> cxl_core-$(CONFIG_CXL_MCE) += mce.o
> cxl_core-$(CONFIG_CXL_FEATURES) += features.o
> cxl_core-$(CONFIG_CXL_EDAC_MEM_FEATURES) += edac.o
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 14723cfd05f0..1bae80dbf991 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -65,12 +65,24 @@ u64 cxl_dpa_to_hpa(struct cxl_region *cxlr, const struct cxl_memdev *cxlmd,
> int devm_cxl_add_dax_region(struct cxl_region *cxlr);
> int devm_cxl_add_pmem_region(struct cxl_region *cxlr);
>
> +int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent,
> + u16 seq_num);
> +int online_tag_group(struct cxl_dc_tag_group *group);
> #else
> static inline u64 cxl_dpa_to_hpa(struct cxl_region *cxlr,
> const struct cxl_memdev *cxlmd, u64 dpa)
> {
> return ULLONG_MAX;
> }
> +static inline int cxl_add_extent(struct cxl_memdev_state *mds,
> + struct cxl_extent *extent, u16 seq_num)
> +{
> + return 0;
> +}
> +static inline int online_tag_group(struct cxl_dc_tag_group *group)
> +{
> + return 0;
> +}
> static inline
> struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa,
> struct cxl_endpoint_decoder **cxled)
> @@ -166,6 +178,7 @@ long cxl_pci_get_latency(struct pci_dev *pdev);
> int cxl_pci_get_bandwidth(struct pci_dev *pdev, struct access_coordinate *c);
> int cxl_port_get_switch_dport_bandwidth(struct cxl_port *port,
> struct access_coordinate *c);
> +void memdev_release_extent(struct cxl_memdev_state *mds, struct range *range);
>
> static inline struct device *port_to_host(struct cxl_port *port)
> {
> diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
> new file mode 100644
> index 000000000000..94128d06f4ed
> --- /dev/null
> +++ b/drivers/cxl/core/extent.c
> @@ -0,0 +1,372 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright(c) 2024 Intel Corporation. All rights reserved. */
> +
> +#include <linux/device.h>
> +#include <cxl.h>
> +
> +#include "core.h"
> +
> +
> +static void cxled_release_extent(struct cxl_endpoint_decoder *cxled,
> + struct dc_extent *dc_extent)
> +{
> + struct cxl_memdev_state *mds = cxled_to_mds(cxled);
> + struct device *dev = &cxled->cxld.dev;
> +
> + dev_dbg(dev, "Remove extent %pra (%pU)\n",
> + &dc_extent->dpa_range, &dc_extent->uuid);
> + memdev_release_extent(mds, &dc_extent->dpa_range);
> +}
> +
> +static void free_tag_group(struct cxl_dc_tag_group *group)
> +{
> + xa_destroy(&group->dc_extents);
> + kfree(group);
> +}
> +
> +static void dc_extent_release(struct device *dev)
> +{
> + struct dc_extent *dc_extent = to_dc_extent(dev);
to_dc_extent() can return NULL. Need to check dc_extent before use.
> + struct cxl_dc_tag_group *group = dc_extent->group;
> +
> + cxled_release_extent(dc_extent->cxled, dc_extent);
> + xa_erase(&group->cxlr_dax->dc_extents, dc_extent->dev.id);
This could potentially be a problem. See comments later with online_tag_group()
> + xa_erase(&group->dc_extents, dc_extent->seq_num);
> + group->nr_extents--;
> + if (!group->nr_extents)
> + free_tag_group(group);
> + kfree(dc_extent);
> +}
> +
> +static const struct device_type dc_extent_type = {
> + .name = "extent",
> + .release = dc_extent_release,
> +};
> +
> +bool is_dc_extent(struct device *dev)
> +{
> + return dev->type == &dc_extent_type;
> +}
> +EXPORT_SYMBOL_NS_GPL(is_dc_extent, "CXL");
> +
> +static struct cxl_dc_tag_group *
> +alloc_tag_group(struct cxl_dax_region *cxlr_dax, uuid_t *uuid)
> +{
> + struct cxl_dc_tag_group *group __free(kfree) =
> + kzalloc(sizeof(*group), GFP_KERNEL);
> + if (!group)
> + return ERR_PTR(-ENOMEM);
> +
> + group->cxlr_dax = cxlr_dax;
> + uuid_copy(&group->uuid, uuid);
> + xa_init(&group->dc_extents);
> + return no_free_ptr(group);
> +}
> +
> +/*
> + * Stage 1 of the add pipeline: pure, no allocation. Resolve the extent
> + * to its region/endpoint decoder and ext_range, and verify the range
> + * fits in the resolved endpoint decoder's DPA resource. Further
> + * per-extent invariants layer into this function in subsequent commits.
> + *
> + * Caller must hold cxl_rwsem.region for read (cxl_dpa_to_region()).
> + * On success, @out_cxled / @out_cxlr_dax / @out_ext_range carry the
> + * resolved handles consumed by the rest of the pipeline.
> + */
> +static int cxl_validate_extent(struct cxl_memdev_state *mds,
> + struct cxl_extent *extent,
> + struct cxl_endpoint_decoder **out_cxled,
> + struct cxl_dax_region **out_cxlr_dax,
> + struct range *out_ext_range)
> +{
> + u64 start_dpa = le64_to_cpu(extent->start_dpa);
> + struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
> + struct cxl_endpoint_decoder *cxled;
> + struct cxl_region *cxlr;
> + struct range ext_range = (struct range) {
> + .start = start_dpa,
> + .end = start_dpa + le64_to_cpu(extent->length) - 1,
> + };
> + struct range ed_range;
> +
> + cxlr = cxl_dpa_to_region(cxlmd, start_dpa, &cxled);
> + if (!cxlr)
> + return -ENXIO;
> +
> + ed_range = (struct range) {
> + .start = cxled->dpa_res->start,
> + .end = cxled->dpa_res->end,
> + };
> + if (!range_contains(&ed_range, &ext_range)) {
> + dev_err_ratelimited(&cxled->cxld.dev,
> + "DC extent DPA %pra (%pU) is not fully in ED %pra\n",
> + &ext_range, extent->uuid, &ed_range);
> + return -ENXIO;
> + }
> +
> + *out_cxled = cxled;
> + *out_cxlr_dax = cxlr->cxlr_dax;
Should there be check for cxled and cxlr->cxlr_dax to make sure they aren't NULL?
> + *out_ext_range = ext_range;
> + return 0;
> +}
> +
> +enum cxl_extent_class {
> + CXL_EXT_NEW,
> + CXL_EXT_DUPLICATE,
> + CXL_EXT_OVERLAP,
> +};
> +
> +/*
> + * Stage 2: classify @ext_range against extents already accepted on this
> + * cxlr_dax+cxled. Walks cxlr_dax->dc_extents once: a stored extent that
> + * fully contains @ext_range means a duplicate accept (idempotent, fine);
> + * a stored extent that only overlaps means an inconsistent offer.
> + */
> +static enum cxl_extent_class
> +cxlr_dax_classify_extent(struct cxl_dax_region *cxlr_dax,
> + struct cxl_endpoint_decoder *cxled,
> + const struct range *ext_range)
> +{
> + struct dc_extent *entry;
> + unsigned long i;
> +
> + xa_for_each(&cxlr_dax->dc_extents, i, entry) {
> + if (entry->cxled != cxled)
> + continue;
> + if (range_contains(&entry->dpa_range, ext_range))
> + return CXL_EXT_DUPLICATE;
> + if (range_overlaps(&entry->dpa_range, ext_range))
> + return CXL_EXT_OVERLAP;
> + }
> + return CXL_EXT_NEW;
> +}
> +
> +/*
> + * Stage 3: allocate and populate a dc_extent for an already-validated,
> + * already-classified-as-new @ext_range. Only -ENOMEM can fail here.
> + */
> +static struct dc_extent *
> +dc_extent_build(struct cxl_endpoint_decoder *cxled,
> + struct cxl_dax_region *cxlr_dax,
> + struct cxl_extent *extent,
> + const struct range *ext_range, u16 seq_num)
> +{
> + resource_size_t dpa_offset = ext_range->start - cxled->dpa_res->start;
> + resource_size_t hpa = cxled->cxld.hpa_range.start + dpa_offset;
> + struct dc_extent *dc_extent;
> +
> + dc_extent = kzalloc(sizeof(*dc_extent), GFP_KERNEL);
> + if (!dc_extent)
> + return ERR_PTR(-ENOMEM);
> +
> + dc_extent->cxled = cxled;
> + dc_extent->dpa_range = *ext_range;
> + dc_extent->hpa_range.start = hpa - cxlr_dax->hpa_range.start;
> + dc_extent->hpa_range.end = dc_extent->hpa_range.start +
> + range_len(ext_range) - 1;
> + dc_extent->seq_num = seq_num;
> + import_uuid(&dc_extent->uuid, extent->uuid);
> + return dc_extent;
> +}
> +
> +/*
> + * Stage 4: insert @dc_extent into the pending tag group. All extents in
> + * one More-chain group share a UUID — enforced here as the group is
> + * either being created (first extent) or appended to. On any failure
> + * the dc_extent is freed.
> + */
> +static int cxlr_add_extent(struct cxl_memdev_state *mds,
> + struct cxl_dax_region *cxlr_dax,
> + struct dc_extent *dc_extent)
> +{
> + struct cxl_dc_tag_group **group = &mds->add_ctx.group;
> + int rc;
> +
> + if (*group && !uuid_equal(&(*group)->uuid, &dc_extent->uuid)) {
> + kfree(dc_extent);
> + return -EINVAL;
> + }
> +
> + if (!*group) {
> + dev_dbg(&cxlr_dax->dev, "Alloc new tag group\n");
> + *group = alloc_tag_group(cxlr_dax, &dc_extent->uuid);
> + if (IS_ERR(*group)) {
> + rc = PTR_ERR(*group);
> + *group = NULL;
> + kfree(dc_extent);
> + return rc;
> + }
> + } else {
> + dev_dbg(&cxlr_dax->dev, "Append dc_extent to tag group\n");
> + }
> +
> + dc_extent->group = *group;
> +
> + /*
> + * Key by @seq_num so iteration order equals assembly order, in both
> + * the sharable case (device-stamped 1..n) and the non-sharable case
> + * (host-assigned arrival-order 1..n). A collision here signals a
> + * cxl-side validation gap.
> + */
> + rc = xa_insert(&(*group)->dc_extents, dc_extent->seq_num,
> + dc_extent, GFP_KERNEL);
> + if (rc) {
> + dev_WARN_ONCE(&cxlr_dax->dev, rc == -EBUSY,
> + "duplicate seq_num %u in tag %pUb\n",
> + dc_extent->seq_num, &dc_extent->uuid);
> + kfree(dc_extent);
> + return rc;
> + }
> +
> + return 0;
> +}
> +
> +int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent,
> + u16 seq_num)
> +{
> + struct cxl_endpoint_decoder *cxled;
> + struct cxl_dax_region *cxlr_dax;
> + struct dc_extent *dc_extent;
> + struct range ext_range;
> + int rc;
> +
> + guard(rwsem_read)(&cxl_rwsem.region);
> +
> + rc = cxl_validate_extent(mds, extent, &cxled, &cxlr_dax, &ext_range);
> + if (rc)
> + return rc;
> +
> + switch (cxlr_dax_classify_extent(cxlr_dax, cxled, &ext_range)) {
> + case CXL_EXT_DUPLICATE:
> + /*
> + * Idempotent accept simplifies the dax-side scan for existing
> + * extents on region creation; reply success without duplicating.
> + */
> + dev_warn_ratelimited(&cxled->cxld.dev,
> + "Extent %pra exists; accept again\n",
> + &ext_range);
> + return 0;
Does this cause issues with the accounting in cxl_add_pending()? At the end the total_accepted would include duplicates as well. I wonder if modify cxlr_add_extent() to return 1 if an extent is added and handle the returns approriately would make the accounting more accurate.
> + case CXL_EXT_OVERLAP:
> + return -ENXIO;
> + case CXL_EXT_NEW:
> + break;
> + }
> +
> + dc_extent = dc_extent_build(cxled, cxlr_dax, extent, &ext_range,
> + seq_num);
> + if (IS_ERR(dc_extent))
> + return PTR_ERR(dc_extent);
> +
> + dev_dbg(&cxled->cxld.dev, "Add extent %pra (%pU)\n",
> + &dc_extent->dpa_range, &dc_extent->uuid);
> +
> + return cxlr_add_extent(mds, cxlr_dax, dc_extent);
> +}
> +
> +static void dc_extent_unregister(void *ext)
> +{
> + struct dc_extent *dc_extent = ext;
> +
> + dev_dbg(&dc_extent->dev, "DAX region rm extent HPA %pra\n",
> + &dc_extent->hpa_range);
> + device_unregister(&dc_extent->dev);
> +}
> +
> +static void cleanup_pending_dc_extent(struct dc_extent *dc_extent)
> +{
> + struct cxl_dc_tag_group *group = dc_extent->group;
> +
> + cxled_release_extent(dc_extent->cxled, dc_extent);
> + xa_erase(&group->dc_extents, dc_extent->seq_num);
> + group->nr_extents--;
> + if (!group->nr_extents)
> + free_tag_group(group);
> + kfree(dc_extent);
> +}
> +
> +int online_tag_group(struct cxl_dc_tag_group *group)
> +{
> + struct cxl_dax_region *cxlr_dax = group->cxlr_dax;
> + struct dc_extent *dc_extent;
> + unsigned long index;
> + int rc = 0;
> +
> + /*
> + * Seed nr_extents with the full group size plus a +1 pin held by
> + * this function. The size counts every dc_extent that might
> + * decrement nr_extents on cleanup; the pin keeps @group alive
> + * across the body even if every dc_extent release fires inside
> + * the loop (e.g. devm_add_action_or_reset failure on the only
> + * pending extent). The pin is dropped at the end of the function.
> + */
> + xa_for_each(&group->dc_extents, index, dc_extent)
> + group->nr_extents++;
> + group->nr_extents++;
> +
> + xa_for_each(&group->dc_extents, index, dc_extent) {
> + struct device *dev = &dc_extent->dev;
> + u32 id;
> +
> + device_initialize(dev);
> + device_set_pm_not_required(dev);
> + dev->parent = &cxlr_dax->dev;
> + dev->type = &dc_extent_type;
> +
> + rc = xa_alloc(&cxlr_dax->dc_extents, &id, dc_extent,
> + xa_limit_32b, GFP_KERNEL);
> + if (rc < 0) {
> + put_device(dev);
Here xa_alloc() fails, so there no valid id. But the put_device() will trigger dc_extent_release(). And therefore dev->id of 0 is passed in and potentially erase the id 0 slot even if that's not the intended id. Can always define xarray with XA_FLAGS_ALLOC1 to start it at 1 and therefore slot 0 is not used.
> + break;
> + }
> + dev->id = id;
> +
> + rc = dev_set_name(dev, "extent%d.%d", cxlr_dax->cxlr->id,
> + dev->id);
> + if (rc) {
> + xa_erase(&cxlr_dax->dc_extents, dev->id);
> + put_device(dev);
> + break;
> + }
> +
> + rc = device_add(dev);
> + if (rc) {
> + xa_erase(&cxlr_dax->dc_extents, dev->id);
> + put_device(dev);
> + break;
> + }
> +
> + dev_dbg(dev, "dc_extent HPA %pra (%pU)\n",
> + &dc_extent->hpa_range, &group->uuid);
> +
> + rc = devm_add_action_or_reset(&cxlr_dax->dev,
> + dc_extent_unregister, dc_extent);
> + if (rc)
> + break;
> + }
> +
> + if (rc) {
> + /*
> + * Unwind every remaining dc_extent in the group. The pin
> + * above keeps @group alive across this walk. Distinguish
> + * onlined dc_extents (have a devm action) from pending ones
> + * via devm_remove_action_nowarn(): a 0 return means the
> + * action was installed and is now consumed, so we run the
> + * unregister ourselves; -ENOENT means pending.
> + */
> + xa_for_each(&group->dc_extents, index, dc_extent) {
> + int r = devm_remove_action_nowarn(&cxlr_dax->dev,
> + dc_extent_unregister,
> + dc_extent);
> + if (r == 0)
> + dc_extent_unregister(dc_extent);
> + else
> + cleanup_pending_dc_extent(dc_extent);
> + }
> + }
> +
> + /* Drop the pin; if nothing else still references @group, free it. */
> + group->nr_extents--;
> + if (!group->nr_extents)
> + free_tag_group(group);
> + return rc;
> +}
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index c376492fa166..e5edc3975e8f 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -6,6 +6,7 @@
> #include <linux/mutex.h>
> #include <linux/unaligned.h>
> #include <linux/list.h>
> +#include <linux/list_sort.h>
> #include <cxlpci.h>
> #include <cxlmem.h>
> #include <cxl.h>
> @@ -1181,7 +1182,7 @@ static void delete_extent_node(struct cxl_extent_list_node *node)
> kfree(node);
> }
>
> -static void memdev_release_extent(struct cxl_memdev_state *mds, struct range *range)
> +void memdev_release_extent(struct cxl_memdev_state *mds, struct range *range)
> {
> struct device *dev = mds->cxlds.dev;
> struct cxl_extent_list_node *node;
> @@ -1280,11 +1281,120 @@ static int add_to_pending_list(struct list_head *pending_list,
> }
>
> /*
> - * Stub: stage extents on the pending list and reply with an empty
> - * ADD_DC_RESPONSE on More=0 (refuse all). A later commit replaces
> - * the no-op tail with the real Add pipeline that surfaces a dax
> - * device per accepted extent.
> + * Compare two extents by shared_extn_seq (ascending). list_sort is
> + * stable so when shared_extn_seq is 0 for every entry (non-sharable
> + * partition) ties fall back to arrival order via list_add_tail() in
> + * add_to_pending_list().
> */
> +static int extent_seq_compare(void *priv,
> + const struct list_head *a,
> + const struct list_head *b)
> +{
> + const struct cxl_extent_list_node *ea =
> + list_entry(a, struct cxl_extent_list_node, list);
> + const struct cxl_extent_list_node *eb =
> + list_entry(b, struct cxl_extent_list_node, list);
> + u16 sa = le16_to_cpu(ea->extent->shared_extn_seq);
> + u16 sb = le16_to_cpu(eb->extent->shared_extn_seq);
> +
> + if (sa < sb)
> + return -1;
> + if (sa > sb)
> + return 1;
> + return 0;
> +}
> +
> +/*
> + * Move every pending extent whose tag matches @tag onto @group, preserving
> + * the order they appear in @pending.
> + */
> +static void extract_tag_group(struct list_head *pending,
> + const uuid_t *tag,
> + struct list_head *group)
> +{
> + struct cxl_extent_list_node *pos, *tmp;
> +
> + list_for_each_entry_safe(pos, tmp, pending, list) {
> + uuid_t t;
> +
> + import_uuid(&t, pos->extent->uuid);
> + if (uuid_equal(&t, tag))
> + list_move_tail(&pos->list, group);
> + }
> +}
> +
> +/*
> + * Drive the pending Add-Capacity records through cxl_add_extent(),
> + * grouped by tag. Per group: extract from pending, stable-sort by
> + * shared_extn_seq, then attempt to add each extent. Online the tag
> + * group via online_tag_group() once all of its extents have been
> + * realized. Validation gates layer onto this loop in later commits.
> + */
> +static int cxl_add_pending(struct cxl_memdev_state *mds)
> +{
> + struct device *dev = mds->cxlds.dev;
> + struct list_head *pending = &mds->add_ctx.pending_extents;
> + struct cxl_extent_list_node *pos, *tmp;
> + LIST_HEAD(accepted);
> + int total_accepted = 0;
> +
> + while (!list_empty(pending)) {
> + LIST_HEAD(group);
> + struct cxl_dc_tag_group *tag_group;
> + int group_cnt = 0;
> + uuid_t tag;
> + int rc;
> +
> + import_uuid(&tag,
> + list_first_entry(pending,
> + struct cxl_extent_list_node,
> + list)->extent->uuid);
> + extract_tag_group(pending, &tag, &group);
> + list_sort(NULL, &group, extent_seq_compare);
> +
> + u16 logical_seq = 1;
declaring var in middle of code
> + list_for_each_entry_safe(pos, tmp, &group, list) {
> + u16 raw = le16_to_cpu(pos->extent->shared_extn_seq);
> + u16 seq = raw ? raw : logical_seq;
> +
> + logical_seq++;
> +
> + if (cxl_add_extent(mds, pos->extent, seq)) {
> + dev_dbg(dev,
> + "Tag %pUb: failed to add extent DPA:%#llx LEN:%#llx\n",
> + &tag,
> + le64_to_cpu(pos->extent->start_dpa),
> + le64_to_cpu(pos->extent->length));
> + delete_extent_node(pos);
> + continue;
> + }
> + group_cnt++;
> + }
> +
> + tag_group = mds->add_ctx.group;
> + if (!tag_group)
> + continue;
Does it need to add delete_extent_node() for the group before continue?
DJ
> +
> + rc = online_tag_group(tag_group);
> + if (rc) {
> + dev_warn(dev,
> + "Tag %pUb: failed to online tag group (%d)\n",
> + &tag, rc);
> + list_for_each_entry_safe(pos, tmp, &group, list)
> + delete_extent_node(pos);
> + } else {
> + list_splice_tail_init(&group, &accepted);
> + total_accepted += group_cnt;
> + }
> +
> + mds->add_ctx.group = NULL;
> + }
> +
> + list_splice(&accepted, pending);
> + return cxl_send_dc_response(mds, CXL_MBOX_OP_ADD_DC_RESPONSE,
> + pending, total_accepted);
> +}
> +
> static int handle_add_event(struct cxl_memdev_state *mds,
> struct cxl_event_dcd *event)
> {
> @@ -1316,8 +1426,7 @@ static int handle_add_event(struct cxl_memdev_state *mds,
> ctx->armed = false;
> cancel_delayed_work(&ctx->timeout_work);
>
> - rc = cxl_send_dc_response(mds, CXL_MBOX_OP_ADD_DC_RESPONSE,
> - &mds->add_ctx.pending_extents, 0);
> + rc = cxl_add_pending(mds);
> clear_pending_extents(mds);
> return rc;
> }
> diff --git a/drivers/cxl/core/region_dax.c b/drivers/cxl/core/region_dax.c
> index d6bf69155827..519e203c486a 100644
> --- a/drivers/cxl/core/region_dax.c
> +++ b/drivers/cxl/core/region_dax.c
> @@ -13,6 +13,7 @@ static void cxl_dax_region_release(struct device *dev)
> {
> struct cxl_dax_region *cxlr_dax = to_cxl_dax_region(dev);
>
> + xa_destroy(&cxlr_dax->dc_extents);
> kfree(cxlr_dax);
> }
>
> @@ -57,11 +58,13 @@ static struct cxl_dax_region *cxl_dax_region_alloc(struct cxl_region *cxlr)
> if (!cxlr_dax)
> return ERR_PTR(-ENOMEM);
>
> + xa_init_flags(&cxlr_dax->dc_extents, XA_FLAGS_ALLOC);
> cxlr_dax->hpa_range.start = p->res->start;
> cxlr_dax->hpa_range.end = p->res->end;
>
> dev = &cxlr_dax->dev;
> cxlr_dax->cxlr = cxlr;
> + cxlr->cxlr_dax = cxlr_dax;
> device_initialize(dev);
> lockdep_set_class(&dev->mutex, &cxl_dax_region_key);
> device_set_pm_not_required(dev);
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 5ef2cf4d005b..cbbfba92fea9 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -495,6 +495,7 @@ struct cxl_region_params {
> * @type: Endpoint decoder target type
> * @cxl_nvb: nvdimm bridge for coordinating @cxlr_pmem setup / shutdown
> * @cxlr_pmem: (for pmem regions) cached copy of the nvdimm bridge
> + * @cxlr_dax: (for DC regions) cached copy of CXL DAX bridge
> * @flags: Region state flags
> * @params: active + config params for the region
> * @coord: QoS access coordinates for the region
> @@ -510,6 +511,7 @@ struct cxl_region {
> enum cxl_decoder_type type;
> struct cxl_nvdimm_bridge *cxl_nvb;
> struct cxl_pmem_region *cxlr_pmem;
> + struct cxl_dax_region *cxlr_dax;
> unsigned long flags;
> struct cxl_region_params params;
> struct access_coordinate coord[ACCESS_COORDINATE_MAX];
> @@ -568,6 +570,15 @@ struct cxl_dax_region {
> struct device dev;
> struct cxl_region *cxlr;
> struct range hpa_range;
> + /*
> + * dc_extents is keyed by an allocator-assigned u32 (see
> + * online_tag_group()). Tag groups have no first-class identity in
> + * this xarray; siblings within a tag find each other via
> + * dc_extent->group. Tag-uniqueness lookup is a linear xa_for_each
> + * walk, adequate at the bounded per-region extent counts the
> + * driver handles.
> + */
> + struct xarray dc_extents;
> };
>
> /**
> @@ -595,6 +606,14 @@ struct cxl_dc_tag_group {
> unsigned int nr_extents;
> };
>
> +bool is_dc_extent(struct device *dev);
> +static inline struct dc_extent *to_dc_extent(struct device *dev)
> +{
> + if (!is_dc_extent(dev))
> + return NULL;
> + return container_of(dev, struct dc_extent, dev);
> +}
> +
> /**
> * struct cxl_port - logical collection of upstream port devices and
> * downstream port devices to construct a CXL memory
> diff --git a/tools/testing/cxl/Kbuild b/tools/testing/cxl/Kbuild
> index 2be1df80fcc9..8941cf187462 100644
> --- a/tools/testing/cxl/Kbuild
> +++ b/tools/testing/cxl/Kbuild
> @@ -63,7 +63,10 @@ cxl_core-y += $(CXL_CORE_SRC)/hdm.o
> cxl_core-y += $(CXL_CORE_SRC)/pmu.o
> cxl_core-y += $(CXL_CORE_SRC)/cdat.o
> cxl_core-$(CONFIG_TRACING) += $(CXL_CORE_SRC)/trace.o
> -cxl_core-$(CONFIG_CXL_REGION) += $(CXL_CORE_SRC)/region.o $(CXL_CORE_SRC)/region_pmem.o $(CXL_CORE_SRC)/region_dax.o
> +cxl_core-$(CONFIG_CXL_REGION) += $(CXL_CORE_SRC)/region.o \
> + $(CXL_CORE_SRC)/region_pmem.o \
> + $(CXL_CORE_SRC)/region_dax.o \
> + $(CXL_CORE_SRC)/extent.o
> cxl_core-$(CONFIG_CXL_MCE) += $(CXL_CORE_SRC)/mce.o
> cxl_core-$(CONFIG_CXL_FEATURES) += $(CXL_CORE_SRC)/features.o
> cxl_core-$(CONFIG_CXL_EDAC_MEM_FEATURES) += $(CXL_CORE_SRC)/edac.o
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH v10 15/31] cxl/mem: Drop misaligned DCD extent groups
2026-05-23 9:42 [PATCH v10 00/31] DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
` (13 preceding siblings ...)
2026-05-23 9:43 ` [PATCH v10 14/31] cxl/extent: Handle DC Add Capacity events Anisa Su
@ 2026-05-23 9:43 ` Anisa Su
2026-05-28 21:03 ` Dave Jiang
2026-05-23 9:43 ` [PATCH v10 16/31] cxl/extent: Validate DC extent partition Anisa Su
` (17 subsequent siblings)
32 siblings, 1 reply; 71+ messages in thread
From: Anisa Su @ 2026-05-23 9:43 UTC (permalink / raw)
To: linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
John Groves, Gregory Price, Anisa Su, Ira Weiny
Add an alignment gate to cxl_add_pending(): every extent in a tag group
must have its start_dpa and length aligned to CXL_DCD_EXTENT_ALIGN (SZ_2M,
the minimum device-dax mapping granularity on every architecture that
enables CXL DCD). A misaligned extent makes the resulting dax device
unusable, so drop the whole group rather than accept a partial allocation
that would surface a broken dax_resource.
Based on patches by John Groves.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: John Groves <John@Groves.net>
Signed-off-by: Anisa Su <anisa.su@samsung.com>
---
Changes:
[anisa: split out as a separate validation step]
---
drivers/cxl/core/mbox.c | 39 +++++++++++++++++++++++++++++++++++++++
1 file changed, 39 insertions(+)
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index e5edc3975e8f..421bd716a273 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -7,6 +7,7 @@
#include <linux/unaligned.h>
#include <linux/list.h>
#include <linux/list_sort.h>
+#include <linux/sizes.h>
#include <cxlpci.h>
#include <cxlmem.h>
#include <cxl.h>
@@ -1280,6 +1281,24 @@ static int add_to_pending_list(struct list_head *pending_list,
return 0;
}
+/*
+ * Device-dax requires extent boundaries aligned to its mapping granularity.
+ * Use SZ_2M as a conservative default; a tighter check that queries the
+ * cxl_dax_region / cxl_endpoint_decoder for its actual alignment would be
+ * strictly more correct, but SZ_2M is the minimum device-dax supports on
+ * every architecture that enables CXL DCD today.
+ */
+#define CXL_DCD_EXTENT_ALIGN SZ_2M
+
+static bool cxl_extent_dcd_aligned(const struct cxl_extent *extent)
+{
+ u64 start = le64_to_cpu(extent->start_dpa);
+ u64 len = le64_to_cpu(extent->length);
+
+ return IS_ALIGNED(start, CXL_DCD_EXTENT_ALIGN) &&
+ IS_ALIGNED(len, CXL_DCD_EXTENT_ALIGN);
+}
+
/*
* Compare two extents by shared_extn_seq (ascending). list_sort is
* stable so when shared_extn_seq is 0 for every entry (non-sharable
@@ -1352,6 +1371,26 @@ static int cxl_add_pending(struct cxl_memdev_state *mds)
extract_tag_group(pending, &tag, &group);
list_sort(NULL, &group, extent_seq_compare);
+ /* Alignment gate — abort the group if any member fails */
+ bool aligned = true;
+ list_for_each_entry(pos, &group, list) {
+ if (!cxl_extent_dcd_aligned(pos->extent)) {
+ dev_warn(dev,
+ "Tag %pUb: dropping group, extent DPA:%#llx LEN:%#llx not %u-aligned\n",
+ &tag,
+ le64_to_cpu(pos->extent->start_dpa),
+ le64_to_cpu(pos->extent->length),
+ CXL_DCD_EXTENT_ALIGN);
+ aligned = false;
+ break;
+ }
+ }
+ if (!aligned) {
+ list_for_each_entry_safe(pos, tmp, &group, list)
+ delete_extent_node(pos);
+ continue;
+ }
+
u16 logical_seq = 1;
list_for_each_entry_safe(pos, tmp, &group, list) {
u16 raw = le16_to_cpu(pos->extent->shared_extn_seq);
--
2.43.0
^ permalink raw reply related [flat|nested] 71+ messages in thread* Re: [PATCH v10 15/31] cxl/mem: Drop misaligned DCD extent groups
2026-05-23 9:43 ` [PATCH v10 15/31] cxl/mem: Drop misaligned DCD extent groups Anisa Su
@ 2026-05-28 21:03 ` Dave Jiang
0 siblings, 0 replies; 71+ messages in thread
From: Dave Jiang @ 2026-05-28 21:03 UTC (permalink / raw)
To: Anisa Su, linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Vishal Verma, Ira Weiny, Alison Schofield, John Groves,
Gregory Price, Anisa Su, Ira Weiny
On 5/23/26 2:43 AM, Anisa Su wrote:
> Add an alignment gate to cxl_add_pending(): every extent in a tag group
> must have its start_dpa and length aligned to CXL_DCD_EXTENT_ALIGN (SZ_2M,
> the minimum device-dax mapping granularity on every architecture that
> enables CXL DCD). A misaligned extent makes the resulting dax device
> unusable, so drop the whole group rather than accept a partial allocation
> that would surface a broken dax_resource.
>
> Based on patches by John Groves.
>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: John Groves <John@Groves.net>
> Signed-off-by: Anisa Su <anisa.su@samsung.com>
>
> ---
> Changes:
> [anisa: split out as a separate validation step]
> ---
> drivers/cxl/core/mbox.c | 39 +++++++++++++++++++++++++++++++++++++++
> 1 file changed, 39 insertions(+)
>
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index e5edc3975e8f..421bd716a273 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -7,6 +7,7 @@
> #include <linux/unaligned.h>
> #include <linux/list.h>
> #include <linux/list_sort.h>
> +#include <linux/sizes.h>
> #include <cxlpci.h>
> #include <cxlmem.h>
> #include <cxl.h>
> @@ -1280,6 +1281,24 @@ static int add_to_pending_list(struct list_head *pending_list,
> return 0;
> }
>
> +/*
> + * Device-dax requires extent boundaries aligned to its mapping granularity.
> + * Use SZ_2M as a conservative default; a tighter check that queries the
> + * cxl_dax_region / cxl_endpoint_decoder for its actual alignment would be
> + * strictly more correct, but SZ_2M is the minimum device-dax supports on
> + * every architecture that enables CXL DCD today.
> + */
> +#define CXL_DCD_EXTENT_ALIGN SZ_2M
Wonder if this would cause issues in DAX on ARM64 with 64k page size since its PMD size is 512M.
DJ
> +
> +static bool cxl_extent_dcd_aligned(const struct cxl_extent *extent)
> +{
> + u64 start = le64_to_cpu(extent->start_dpa);
> + u64 len = le64_to_cpu(extent->length);
> +
> + return IS_ALIGNED(start, CXL_DCD_EXTENT_ALIGN) &&
> + IS_ALIGNED(len, CXL_DCD_EXTENT_ALIGN);
> +}
> +
> /*
> * Compare two extents by shared_extn_seq (ascending). list_sort is
> * stable so when shared_extn_seq is 0 for every entry (non-sharable
> @@ -1352,6 +1371,26 @@ static int cxl_add_pending(struct cxl_memdev_state *mds)
> extract_tag_group(pending, &tag, &group);
> list_sort(NULL, &group, extent_seq_compare);
>
> + /* Alignment gate — abort the group if any member fails */
> + bool aligned = true;
declaring var in middle of code
> + list_for_each_entry(pos, &group, list) {
> + if (!cxl_extent_dcd_aligned(pos->extent)) {
> + dev_warn(dev,
> + "Tag %pUb: dropping group, extent DPA:%#llx LEN:%#llx not %u-aligned\n",
> + &tag,
> + le64_to_cpu(pos->extent->start_dpa),
> + le64_to_cpu(pos->extent->length),
> + CXL_DCD_EXTENT_ALIGN);
> + aligned = false;
> + break;
> + }
> + }
> + if (!aligned) {
> + list_for_each_entry_safe(pos, tmp, &group, list)
> + delete_extent_node(pos);
> + continue;
> + }
> +
> u16 logical_seq = 1;
Looks like this one came from a previous patch.
> list_for_each_entry_safe(pos, tmp, &group, list) {
> u16 raw = le16_to_cpu(pos->extent->shared_extn_seq);
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH v10 16/31] cxl/extent: Validate DC extent partition
2026-05-23 9:42 [PATCH v10 00/31] DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
` (14 preceding siblings ...)
2026-05-23 9:43 ` [PATCH v10 15/31] cxl/mem: Drop misaligned DCD extent groups Anisa Su
@ 2026-05-23 9:43 ` Anisa Su
2026-05-28 21:34 ` Dave Jiang
2026-05-23 9:43 ` [PATCH v10 17/31] cxl/mem: Enforce tag-group semantics Anisa Su
` (16 subsequent siblings)
32 siblings, 1 reply; 71+ messages in thread
From: Anisa Su @ 2026-05-23 9:43 UTC (permalink / raw)
To: linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
John Groves, Gregory Price, Anisa Su, Ira Weiny
Extend cxl_validate_extent() — the per-extent check of the add pipeline
to check partition membership.
Resolves an extent's DPA to its containing DC partition. Then based on
if the partition is shareable:
- Shareable: tag must be non-null and shared_extn_seq must be non-zero
— multiple hosts reading the same allocation rely on the device-
stamped 1..n sequence to assemble extents in agreed order.
- Non-sharable: shared_extn_seq must be zero — sequencing is
meaningless when only one host consumes the allocation; tag is
optional (null UUID permitted).
Any cross-mix is a device firmware bug; reject the extent.
Based on patches by John Groves.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: John Groves <John@Groves.net>
Signed-off-by: Anisa Su <anisa.su@samsung.com>
---
Changes:
[anisa: split out as a separate validation step]
---
drivers/cxl/core/core.h | 4 ++
drivers/cxl/core/extent.c | 78 +++++++++++++++++++++++++++++++++++++--
2 files changed, 79 insertions(+), 3 deletions(-)
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 1bae80dbf991..30b6b05b155b 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -179,6 +179,10 @@ int cxl_pci_get_bandwidth(struct pci_dev *pdev, struct access_coordinate *c);
int cxl_port_get_switch_dport_bandwidth(struct cxl_port *port,
struct access_coordinate *c);
void memdev_release_extent(struct cxl_memdev_state *mds, struct range *range);
+const struct cxl_dpa_partition *
+cxl_extent_dc_partition(struct cxl_memdev_state *mds,
+ struct cxl_extent *extent,
+ struct range *ext_range);
static inline struct device *port_to_host(struct cxl_port *port)
{
diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
index 94128d06f4ed..b01507022cff 100644
--- a/drivers/cxl/core/extent.c
+++ b/drivers/cxl/core/extent.c
@@ -63,11 +63,55 @@ alloc_tag_group(struct cxl_dax_region *cxlr_dax, uuid_t *uuid)
return no_free_ptr(group);
}
+/*
+ * Find the DC (Dynamic Capacity) partition that fully contains @ext_range,
+ * or NULL if the extent falls outside every DC partition on this memdev.
+ * The returned pointer is owned by mds->cxlds.part[] and lives for the
+ * lifetime of the memdev.
+ */
+const struct cxl_dpa_partition *
+cxl_extent_dc_partition(struct cxl_memdev_state *mds,
+ struct cxl_extent *extent,
+ struct range *ext_range)
+{
+ struct cxl_dev_state *cxlds = &mds->cxlds;
+ struct device *dev = mds->cxlds.dev;
+
+ for (int i = 0; i < cxlds->nr_partitions; i++) {
+ struct cxl_dpa_partition *part = &cxlds->part[i];
+ struct range partition_range = {
+ .start = part->res.start,
+ .end = part->res.end,
+ };
+
+ if (part->mode != CXL_PARTMODE_DYNAMIC_RAM_A)
+ continue;
+
+ if (range_contains(&partition_range, ext_range)) {
+ dev_dbg(dev, "DC extent DPA %pra (DCR:%pra)(%pU)\n",
+ ext_range, &partition_range, extent->uuid);
+ return part;
+ }
+ }
+
+ dev_err_ratelimited(dev,
+ "DC extent DPA %pra (%pU) is not in a valid DC partition\n",
+ ext_range, extent->uuid);
+ return NULL;
+}
+
/*
* Stage 1 of the add pipeline: pure, no allocation. Resolve the extent
- * to its region/endpoint decoder and ext_range, and verify the range
- * fits in the resolved endpoint decoder's DPA resource. Further
- * per-extent invariants layer into this function in subsequent commits.
+ * to its region/endpoint decoder and ext_range, and enforce every
+ * per-extent invariant the device must satisfy:
+ *
+ * - DPA falls inside a Dynamic Capacity partition (cxl_extent_dc_partition).
+ * - CDAT-sharability rules:
+ * sharable: tag must be non-null AND shared_extn_seq != 0
+ * non-sharable: shared_extn_seq must be 0 (tag is optional)
+ * Any cross-mixing is a device firmware bug.
+ * - DPA resolves to an endpoint decoder attached to a region.
+ * - The extent's range is fully contained in that ED's DPA resource.
*
* Caller must hold cxl_rwsem.region for read (cxl_dpa_to_region()).
* On success, @out_cxled / @out_cxlr_dax / @out_ext_range carry the
@@ -81,6 +125,10 @@ static int cxl_validate_extent(struct cxl_memdev_state *mds,
{
u64 start_dpa = le64_to_cpu(extent->start_dpa);
struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
+ struct device *dev = mds->cxlds.dev;
+ uuid_t *uuid = (uuid_t *)extent->uuid;
+ u16 seq = le16_to_cpu(extent->shared_extn_seq);
+ const struct cxl_dpa_partition *part;
struct cxl_endpoint_decoder *cxled;
struct cxl_region *cxlr;
struct range ext_range = (struct range) {
@@ -89,6 +137,30 @@ static int cxl_validate_extent(struct cxl_memdev_state *mds,
};
struct range ed_range;
+ part = cxl_extent_dc_partition(mds, extent, &ext_range);
+ if (!part)
+ return -ENXIO;
+
+ if (part->perf.shareable) {
+ if (uuid_is_null(uuid)) {
+ dev_err_ratelimited(dev,
+ "DC extent DPA %pra: sharable-partition extent has null tag (firmware bug)\n",
+ &ext_range);
+ return -ENXIO;
+ }
+ if (seq == 0) {
+ dev_err_ratelimited(dev,
+ "DC extent DPA %pra (%pU): sharable-partition extent missing shared_extn_seq (firmware bug)\n",
+ &ext_range, uuid);
+ return -ENXIO;
+ }
+ } else if (seq != 0) {
+ dev_err_ratelimited(dev,
+ "DC extent DPA %pra (%pU): non-sharable partition but shared_extn_seq=%u (firmware bug)\n",
+ &ext_range, uuid, seq);
+ return -ENXIO;
+ }
+
cxlr = cxl_dpa_to_region(cxlmd, start_dpa, &cxled);
if (!cxlr)
return -ENXIO;
--
2.43.0
^ permalink raw reply related [flat|nested] 71+ messages in thread* Re: [PATCH v10 16/31] cxl/extent: Validate DC extent partition
2026-05-23 9:43 ` [PATCH v10 16/31] cxl/extent: Validate DC extent partition Anisa Su
@ 2026-05-28 21:34 ` Dave Jiang
0 siblings, 0 replies; 71+ messages in thread
From: Dave Jiang @ 2026-05-28 21:34 UTC (permalink / raw)
To: Anisa Su, linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Vishal Verma, Ira Weiny, Alison Schofield, John Groves,
Gregory Price, Anisa Su, Ira Weiny
On 5/23/26 2:43 AM, Anisa Su wrote:
> Extend cxl_validate_extent() — the per-extent check of the add pipeline
> to check partition membership.
>
> Resolves an extent's DPA to its containing DC partition. Then based on
> if the partition is shareable:
>
> - Shareable: tag must be non-null and shared_extn_seq must be non-zero
> — multiple hosts reading the same allocation rely on the device-
> stamped 1..n sequence to assemble extents in agreed order.
> - Non-sharable: shared_extn_seq must be zero — sequencing is
> meaningless when only one host consumes the allocation; tag is
> optional (null UUID permitted).
>
> Any cross-mix is a device firmware bug; reject the extent.
>
> Based on patches by John Groves.
>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: John Groves <John@Groves.net>
> Signed-off-by: Anisa Su <anisa.su@samsung.com>
>
> ---
> Changes:
> [anisa: split out as a separate validation step]
> ---
> drivers/cxl/core/core.h | 4 ++
> drivers/cxl/core/extent.c | 78 +++++++++++++++++++++++++++++++++++++--
> 2 files changed, 79 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 1bae80dbf991..30b6b05b155b 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -179,6 +179,10 @@ int cxl_pci_get_bandwidth(struct pci_dev *pdev, struct access_coordinate *c);
> int cxl_port_get_switch_dport_bandwidth(struct cxl_port *port,
> struct access_coordinate *c);
> void memdev_release_extent(struct cxl_memdev_state *mds, struct range *range);
> +const struct cxl_dpa_partition *
> +cxl_extent_dc_partition(struct cxl_memdev_state *mds,
> + struct cxl_extent *extent,
> + struct range *ext_range);
>
> static inline struct device *port_to_host(struct cxl_port *port)
> {
> diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
> index 94128d06f4ed..b01507022cff 100644
> --- a/drivers/cxl/core/extent.c
> +++ b/drivers/cxl/core/extent.c
> @@ -63,11 +63,55 @@ alloc_tag_group(struct cxl_dax_region *cxlr_dax, uuid_t *uuid)
> return no_free_ptr(group);
> }
>
> +/*
> + * Find the DC (Dynamic Capacity) partition that fully contains @ext_range,
> + * or NULL if the extent falls outside every DC partition on this memdev.
> + * The returned pointer is owned by mds->cxlds.part[] and lives for the
> + * lifetime of the memdev.
> + */
> +const struct cxl_dpa_partition *
> +cxl_extent_dc_partition(struct cxl_memdev_state *mds,
> + struct cxl_extent *extent,
> + struct range *ext_range)
This can be static, given it's only called in extent.c
> +{
> + struct cxl_dev_state *cxlds = &mds->cxlds;
> + struct device *dev = mds->cxlds.dev;
> +
> + for (int i = 0; i < cxlds->nr_partitions; i++) {
> + struct cxl_dpa_partition *part = &cxlds->part[i];
> + struct range partition_range = {
> + .start = part->res.start,
> + .end = part->res.end,
> + };
> +
> + if (part->mode != CXL_PARTMODE_DYNAMIC_RAM_A)
> + continue;
> +
> + if (range_contains(&partition_range, ext_range)) {
> + dev_dbg(dev, "DC extent DPA %pra (DCR:%pra)(%pU)\n",
> + ext_range, &partition_range, extent->uuid);
> + return part;
> + }
> + }
> +
> + dev_err_ratelimited(dev,
> + "DC extent DPA %pra (%pU) is not in a valid DC partition\n",
> + ext_range, extent->uuid);
> + return NULL;
> +}
> +
> /*
> * Stage 1 of the add pipeline: pure, no allocation. Resolve the extent
> - * to its region/endpoint decoder and ext_range, and verify the range
> - * fits in the resolved endpoint decoder's DPA resource. Further
> - * per-extent invariants layer into this function in subsequent commits.
> + * to its region/endpoint decoder and ext_range, and enforce every
> + * per-extent invariant the device must satisfy:
> + *
> + * - DPA falls inside a Dynamic Capacity partition (cxl_extent_dc_partition).
> + * - CDAT-sharability rules:
> + * sharable: tag must be non-null AND shared_extn_seq != 0
> + * non-sharable: shared_extn_seq must be 0 (tag is optional)
> + * Any cross-mixing is a device firmware bug.
> + * - DPA resolves to an endpoint decoder attached to a region.
> + * - The extent's range is fully contained in that ED's DPA resource.
> *
> * Caller must hold cxl_rwsem.region for read (cxl_dpa_to_region()).
> * On success, @out_cxled / @out_cxlr_dax / @out_ext_range carry the
> @@ -81,6 +125,10 @@ static int cxl_validate_extent(struct cxl_memdev_state *mds,
> {
> u64 start_dpa = le64_to_cpu(extent->start_dpa);
> struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
> + struct device *dev = mds->cxlds.dev;
> + uuid_t *uuid = (uuid_t *)extent->uuid;
Consider using import_uuid() instead of direct cast.
> + u16 seq = le16_to_cpu(extent->shared_extn_seq);
> + const struct cxl_dpa_partition *part;
> struct cxl_endpoint_decoder *cxled;
> struct cxl_region *cxlr;
> struct range ext_range = (struct range) {
> @@ -89,6 +137,30 @@ static int cxl_validate_extent(struct cxl_memdev_state *mds,
> };
> struct range ed_range;
>
> + part = cxl_extent_dc_partition(mds, extent, &ext_range);
> + if (!part)
> + return -ENXIO;
> +
> + if (part->perf.shareable) {
> + if (uuid_is_null(uuid)) {
> + dev_err_ratelimited(dev,
> + "DC extent DPA %pra: sharable-partition extent has null tag (firmware bug)\n",
> + &ext_range);
> + return -ENXIO;
> + }
> + if (seq == 0) {
I don't think this complies with the spec language. In r4.0 Table 8-230: "For extents describing shareable regions this field shall be within the range of 0 to n-1 where n is the number of extents, with each value appearing only once." So seq == 0 is an acceptable value.
Also, looking at cxl_add_pending() @ line ~1396, does shared seq number '0' holds special meanings? Maybe that needs to change? Also suggest adding comments to point out what's happening there if '0' is special.
DJ
> + dev_err_ratelimited(dev,
> + "DC extent DPA %pra (%pU): sharable-partition extent missing shared_extn_seq (firmware bug)\n",
> + &ext_range, uuid);
> + return -ENXIO;
> + }
> + } else if (seq != 0) {
> + dev_err_ratelimited(dev,
> + "DC extent DPA %pra (%pU): non-sharable partition but shared_extn_seq=%u (firmware bug)\n",
> + &ext_range, uuid, seq);
> + return -ENXIO;
> + }
> +
> cxlr = cxl_dpa_to_region(cxlmd, start_dpa, &cxled);
> if (!cxlr)
> return -ENXIO;
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH v10 17/31] cxl/mem: Enforce tag-group semantics
2026-05-23 9:42 [PATCH v10 00/31] DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
` (15 preceding siblings ...)
2026-05-23 9:43 ` [PATCH v10 16/31] cxl/extent: Validate DC extent partition Anisa Su
@ 2026-05-23 9:43 ` Anisa Su
2026-05-23 9:43 ` [PATCH v10 18/31] cxl/extent: Handle DC Release Capacity events Anisa Su
` (15 subsequent siblings)
32 siblings, 0 replies; 71+ messages in thread
From: Anisa Su @ 2026-05-23 9:43 UTC (permalink / raw)
To: linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
John Groves, Gregory Price, Anisa Su, Ira Weiny
The previous commit fully-fleshed out validation for individual extents.
This commit completes tag-group validation.
Add two group-level gates to cxl_add_pending() that
cxl_validate_extent()'s per-extent view can't see:
- Sequence integrity (cxl_check_group_seq): well-formed iff. every
member is shared_extn_seq == 0 (non-shareable) or the sorted group
is exactly 1..n contiguous (shareable).
- Partition equality (cxl_check_group_partition): tagged allocations
cannot span DC partitions; a partition's CDAT DSMAS entry is the
unit at which shareable / writable / coherency attributes are
described. Skipped for the null UUID.
Each check drops the whole group on violation. Cross-chain uniqueness
of a tag lands in a subsequent commit alongside the host-wide tag
registry.
Based on patches by John Groves.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: John Groves <John@Groves.net>
Signed-off-by: Anisa Su <anisa.su@samsung.com>
---
Changes:
[anisa: split out as a separate validation step. Cross-chain
uniqueness moved to the dedicated "Enforce cross-region tag
uniqueness" commit so this one only adds — no add-then-replace.]
---
drivers/cxl/core/mbox.c | 117 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 117 insertions(+)
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 421bd716a273..545c48c9c373 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -1342,6 +1342,109 @@ static void extract_tag_group(struct list_head *pending,
}
}
+/*
+ * Validate shared_extn_seq across a tag group already sorted ascending.
+ *
+ * A tag group is well-formed iff either every member has
+ * shared_extn_seq == 0 (non-sharable allocation) or the sorted group is
+ * exactly 1, 2, ..., n (sharable). Anything else — mix, gap, duplicate,
+ * non-zero starting other than 1 — is a device firmware bug.
+ */
+static int cxl_check_group_seq(struct device *dev,
+ const uuid_t *tag,
+ const struct list_head *group)
+{
+ struct cxl_extent_list_node *pos;
+ u16 first, expected;
+
+ if (list_empty(group))
+ return 0;
+
+ pos = list_first_entry(group, struct cxl_extent_list_node, list);
+ first = le16_to_cpu(pos->extent->shared_extn_seq);
+
+ if (first == 0) {
+ list_for_each_entry(pos, group, list) {
+ if (le16_to_cpu(pos->extent->shared_extn_seq) != 0) {
+ dev_warn(dev,
+ "Tag %pUb: shared_extn_seq mixed 0/non-zero in one allocation (firmware bug)\n",
+ tag);
+ return -EINVAL;
+ }
+ }
+ return 0;
+ }
+
+ if (first != 1) {
+ dev_warn(dev,
+ "Tag %pUb: shared_extn_seq starts at %u, expected 1 (firmware bug)\n",
+ tag, first);
+ return -EINVAL;
+ }
+
+ expected = 1;
+ list_for_each_entry(pos, group, list) {
+ u16 s = le16_to_cpu(pos->extent->shared_extn_seq);
+
+ if (s != expected) {
+ dev_warn(dev,
+ "Tag %pUb: shared_extn_seq gap/dup: expected %u got %u (firmware bug)\n",
+ tag, expected, s);
+ return -EINVAL;
+ }
+ expected++;
+ }
+ return 0;
+}
+
+/*
+ * For tagged groups, reject allocations that span DC partitions. A tag
+ * is an allocation identity; the partition's CDAT DSMAS entry is what
+ * tells the host which attributes (sharable, writable, coherency)
+ * apply. Untagged groups are skipped — the spec does not define a
+ * cross-chain identity for them.
+ */
+static int cxl_check_group_partition(struct cxl_memdev_state *mds,
+ const uuid_t *tag,
+ const struct list_head *group)
+{
+ struct device *dev = mds->cxlds.dev;
+ const struct cxl_dpa_partition *first_part = NULL;
+ u64 first_dpa = 0;
+ struct cxl_extent_list_node *pos;
+
+ if (uuid_is_null(tag) || list_empty(group))
+ return 0;
+
+ list_for_each_entry(pos, group, list) {
+ struct cxl_extent *extent = pos->extent;
+ struct range ext_range = (struct range) {
+ .start = le64_to_cpu(extent->start_dpa),
+ .end = le64_to_cpu(extent->start_dpa) +
+ le64_to_cpu(extent->length) - 1,
+ };
+ const struct cxl_dpa_partition *part;
+
+ part = cxl_extent_dc_partition(mds, extent, &ext_range);
+ if (!part)
+ return -ENXIO;
+
+ if (!first_part) {
+ first_part = part;
+ first_dpa = ext_range.start;
+ continue;
+ }
+
+ if (part != first_part) {
+ dev_warn(dev,
+ "Tag %pUb: extents span DC partitions (DPA:%#llx and DPA:%#llx), firmware bug\n",
+ tag, first_dpa, ext_range.start);
+ return -EINVAL;
+ }
+ }
+ return 0;
+}
+
/*
* Drive the pending Add-Capacity records through cxl_add_extent(),
* grouped by tag. Per group: extract from pending, stable-sort by
@@ -1371,6 +1474,20 @@ static int cxl_add_pending(struct cxl_memdev_state *mds)
extract_tag_group(pending, &tag, &group);
list_sort(NULL, &group, extent_seq_compare);
+ /* Sequence-number integrity */
+ if (cxl_check_group_seq(dev, &tag, &group)) {
+ list_for_each_entry_safe(pos, tmp, &group, list)
+ delete_extent_node(pos);
+ continue;
+ }
+
+ /* Partition equality (skipped for null UUID) */
+ if (cxl_check_group_partition(mds, &tag, &group)) {
+ list_for_each_entry_safe(pos, tmp, &group, list)
+ delete_extent_node(pos);
+ continue;
+ }
+
/* Alignment gate — abort the group if any member fails */
bool aligned = true;
list_for_each_entry(pos, &group, list) {
--
2.43.0
^ permalink raw reply related [flat|nested] 71+ messages in thread* [PATCH v10 18/31] cxl/extent: Handle DC Release Capacity events
2026-05-23 9:42 [PATCH v10 00/31] DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
` (16 preceding siblings ...)
2026-05-23 9:43 ` [PATCH v10 17/31] cxl/mem: Enforce tag-group semantics Anisa Su
@ 2026-05-23 9:43 ` Anisa Su
2026-05-28 22:13 ` Dave Jiang
2026-05-23 9:43 ` [PATCH v10 19/31] cxl/extent: Enforce cross-region tag uniqueness Anisa Su
` (14 subsequent siblings)
32 siblings, 1 reply; 71+ messages in thread
From: Anisa Su @ 2026-05-23 9:43 UTC (permalink / raw)
To: linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
John Groves, Gregory Price, Anisa Su, Ira Weiny
Replace the no-op ack stub for cxl_rm_extent() with the real teardown:
resolve the released DPA range to its region and endpoint decoder,
locate the matching dc_extent in cxlr_dax->dc_extents (filtering by
cxled, range containment, and tag), and tear down the entire containing
tag group atomically through rm_tag_group(). Partial release is not
supported.
rm_tag_group() invalidates caches once for the whole group (no mappings
exist at this point — partial release is not supported, so all members
are leaving together), then walks the group's dc_extents and releases
each via its devm action installed at online_tag_group() time.
cxl_region_invalidate_memregion() becomes non-static and is declared
in core.h so rm_tag_group() can flush caches before tearing the group down.
When the released range maps to no region (host crashed before
persisting acceptance, region destruction raced device release, or the
device is confused) the host has nothing to drop, so reply via
memdev_release_extent() to keep the device's view consistent.
Based on an original patch by Navneet Singh.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Anisa Su <anisa.su@samsung.com>
---
Changes:
[anisa: restructured from the original "Process dynamic partition
events" monolith; this commit replaces the stubbed release with the
real walk-and-tear-down of the matching tag group.]
---
drivers/cxl/core/core.h | 8 +++
drivers/cxl/core/extent.c | 101 ++++++++++++++++++++++++++++++++++++++
drivers/cxl/core/mbox.c | 19 -------
drivers/cxl/core/region.c | 2 +-
4 files changed, 110 insertions(+), 20 deletions(-)
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 30b6b05b155b..65daaaadf68e 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -28,6 +28,8 @@ cxled_to_mds(struct cxl_endpoint_decoder *cxled)
return container_of(cxlds, struct cxl_memdev_state, cxlds);
}
+int cxl_region_invalidate_memregion(struct cxl_region *cxlr);
+
#ifdef CONFIG_CXL_REGION
struct cxl_region_context {
@@ -67,6 +69,7 @@ int devm_cxl_add_pmem_region(struct cxl_region *cxlr);
int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent,
u16 seq_num);
+int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent);
int online_tag_group(struct cxl_dc_tag_group *group);
#else
static inline u64 cxl_dpa_to_hpa(struct cxl_region *cxlr,
@@ -79,6 +82,11 @@ static inline int cxl_add_extent(struct cxl_memdev_state *mds,
{
return 0;
}
+static inline int cxl_rm_extent(struct cxl_memdev_state *mds,
+ struct cxl_extent *extent)
+{
+ return 0;
+}
static inline int online_tag_group(struct cxl_dc_tag_group *group)
{
return 0;
diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
index b01507022cff..51116c8139ed 100644
--- a/drivers/cxl/core/extent.c
+++ b/drivers/cxl/core/extent.c
@@ -344,6 +344,107 @@ static void dc_extent_unregister(void *ext)
device_unregister(&dc_extent->dev);
}
+static void rm_tag_group(struct cxl_dc_tag_group *group)
+{
+ struct device *region_dev = &group->cxlr_dax->dev;
+ struct dc_extent *dc_extent;
+ unsigned long index;
+
+ /*
+ * Tagged allocations release atomically. Invalidate caches once
+ * for the whole group (no mappings exist at this point — partial
+ * release is not supported, so all members are leaving use
+ * together) before tearing down each dc_extent device.
+ *
+ * Pin @group across the walk: each devm_release_action runs the
+ * dc_extent_unregister action synchronously, which drops the last
+ * reference on the dc_extent device and fires dc_extent_release.
+ * The release decrements group->nr_extents and, on the final
+ * decrement, frees @group. Without the pin the next iteration's
+ * xa_find_after() dereferences a freed xarray.
+ */
+ cxl_region_invalidate_memregion(group->cxlr_dax->cxlr);
+
+ group->nr_extents++;
+ xa_for_each(&group->dc_extents, index, dc_extent)
+ devm_release_action(region_dev, dc_extent_unregister, dc_extent);
+ group->nr_extents--;
+ if (!group->nr_extents)
+ free_tag_group(group);
+}
+
+int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
+{
+ u64 start_dpa = le64_to_cpu(extent->start_dpa);
+ struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
+ struct cxl_endpoint_decoder *cxled;
+ struct cxl_dax_region *cxlr_dax;
+ struct cxl_dc_tag_group *group;
+ struct dc_extent *dc_extent;
+ struct cxl_region *cxlr;
+ struct range dpa_range;
+ unsigned long idx;
+ uuid_t tag;
+
+ dpa_range = (struct range) {
+ .start = start_dpa,
+ .end = start_dpa + le64_to_cpu(extent->length) - 1,
+ };
+
+ guard(rwsem_read)(&cxl_rwsem.region);
+ cxlr = cxl_dpa_to_region(cxlmd, start_dpa, &cxled);
+ if (!cxlr) {
+ /*
+ * No region can happen here for a few reasons:
+ *
+ * 1) Extents were accepted and the host crashed/rebooted
+ * leaving them in an accepted state. On reboot the host
+ * has not yet created a region to own them.
+ *
+ * 2) Region destruction won the race with the device releasing
+ * all the extents. Here the release will be a duplicate of
+ * the one sent via region destruction.
+ *
+ * 3) The device is confused and releasing extents for which no
+ * region ever existed.
+ *
+ * In all these cases make sure the device knows we are not
+ * using this extent.
+ */
+ memdev_release_extent(mds, &dpa_range);
+ return -ENXIO;
+ }
+
+ cxlr_dax = cxlr->cxlr_dax;
+ import_uuid(&tag, extent->uuid);
+
+ /*
+ * Find the dc_extent whose DPA range covers the released range and
+ * whose tag matches. The release targets the entire containing
+ * tag group atomically; partial release is not supported.
+ */
+ group = NULL;
+ xa_for_each(&cxlr_dax->dc_extents, idx, dc_extent) {
+ if (dc_extent->cxled != cxled)
+ continue;
+ if (!range_contains(&dc_extent->dpa_range, &dpa_range))
+ continue;
+ if (!uuid_equal(&dc_extent->group->uuid, &tag))
+ continue;
+ group = dc_extent->group;
+ break;
+ }
+ if (!group) {
+ dev_err(&cxlr_dax->dev,
+ "release DPA %pra (%pU) matches no dc_extent\n",
+ &dpa_range, &tag);
+ return -EINVAL;
+ }
+
+ rm_tag_group(group);
+ return 0;
+}
+
static void cleanup_pending_dc_extent(struct dc_extent *dc_extent)
{
struct cxl_dc_tag_group *group = dc_extent->group;
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 545c48c9c373..70e6c4c9743c 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -1587,25 +1587,6 @@ static int handle_add_event(struct cxl_memdev_state *mds,
return rc;
}
-/*
- * Stub: ack the release back to the device so it knows we are not
- * using the range. A later commit replaces this with the real
- * teardown that walks the region's tag group and tears down the
- * member dc_extent devices.
- */
-static int cxl_rm_extent(struct cxl_memdev_state *mds,
- struct cxl_extent *extent)
-{
- u64 start_dpa = le64_to_cpu(extent->start_dpa);
- struct range dpa_range = {
- .start = start_dpa,
- .end = start_dpa + le64_to_cpu(extent->length) - 1,
- };
-
- memdev_release_extent(mds, &dpa_range);
- return 0;
-}
-
static char *cxl_dcd_evt_type_str(u8 type)
{
switch (type) {
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 733d77c07493..317630d8bf2e 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -222,7 +222,7 @@ static struct cxl_region_ref *cxl_rr_load(struct cxl_port *port,
return xa_load(&port->regions, (unsigned long)cxlr);
}
-static int cxl_region_invalidate_memregion(struct cxl_region *cxlr)
+int cxl_region_invalidate_memregion(struct cxl_region *cxlr)
{
if (!cpu_cache_has_invalidate_memregion()) {
if (IS_ENABLED(CONFIG_CXL_REGION_INVALIDATION_TEST)) {
--
2.43.0
^ permalink raw reply related [flat|nested] 71+ messages in thread* Re: [PATCH v10 18/31] cxl/extent: Handle DC Release Capacity events
2026-05-23 9:43 ` [PATCH v10 18/31] cxl/extent: Handle DC Release Capacity events Anisa Su
@ 2026-05-28 22:13 ` Dave Jiang
0 siblings, 0 replies; 71+ messages in thread
From: Dave Jiang @ 2026-05-28 22:13 UTC (permalink / raw)
To: Anisa Su, linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Vishal Verma, Ira Weiny, Alison Schofield, John Groves,
Gregory Price, Anisa Su, Ira Weiny
On 5/23/26 2:43 AM, Anisa Su wrote:
> Replace the no-op ack stub for cxl_rm_extent() with the real teardown:
> resolve the released DPA range to its region and endpoint decoder,
> locate the matching dc_extent in cxlr_dax->dc_extents (filtering by
> cxled, range containment, and tag), and tear down the entire containing
> tag group atomically through rm_tag_group(). Partial release is not
> supported.
>
> rm_tag_group() invalidates caches once for the whole group (no mappings
> exist at this point — partial release is not supported, so all members
> are leaving together), then walks the group's dc_extents and releases
> each via its devm action installed at online_tag_group() time.
>
> cxl_region_invalidate_memregion() becomes non-static and is declared
> in core.h so rm_tag_group() can flush caches before tearing the group down.
>
> When the released range maps to no region (host crashed before
> persisting acceptance, region destruction raced device release, or the
> device is confused) the host has nothing to drop, so reply via
> memdev_release_extent() to keep the device's view consistent.
>
> Based on an original patch by Navneet Singh.
>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Anisa Su <anisa.su@samsung.com>
>
> ---
> Changes:
> [anisa: restructured from the original "Process dynamic partition
> events" monolith; this commit replaces the stubbed release with the
> real walk-and-tear-down of the matching tag group.]
> ---
> drivers/cxl/core/core.h | 8 +++
> drivers/cxl/core/extent.c | 101 ++++++++++++++++++++++++++++++++++++++
> drivers/cxl/core/mbox.c | 19 -------
> drivers/cxl/core/region.c | 2 +-
> 4 files changed, 110 insertions(+), 20 deletions(-)
>
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 30b6b05b155b..65daaaadf68e 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -28,6 +28,8 @@ cxled_to_mds(struct cxl_endpoint_decoder *cxled)
> return container_of(cxlds, struct cxl_memdev_state, cxlds);
> }
>
> +int cxl_region_invalidate_memregion(struct cxl_region *cxlr);
Doesn't this need to go within CONFIG_CXL_REGION?
> +
> #ifdef CONFIG_CXL_REGION
>
> struct cxl_region_context {
> @@ -67,6 +69,7 @@ int devm_cxl_add_pmem_region(struct cxl_region *cxlr);
>
> int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent,
> u16 seq_num);
> +int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent);
> int online_tag_group(struct cxl_dc_tag_group *group);
> #else
> static inline u64 cxl_dpa_to_hpa(struct cxl_region *cxlr,
> @@ -79,6 +82,11 @@ static inline int cxl_add_extent(struct cxl_memdev_state *mds,
> {
> return 0;
> }
> +static inline int cxl_rm_extent(struct cxl_memdev_state *mds,
> + struct cxl_extent *extent)
> +{
> + return 0;
> +}
> static inline int online_tag_group(struct cxl_dc_tag_group *group)
> {
> return 0;
> diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
> index b01507022cff..51116c8139ed 100644
> --- a/drivers/cxl/core/extent.c
> +++ b/drivers/cxl/core/extent.c
> @@ -344,6 +344,107 @@ static void dc_extent_unregister(void *ext)
> device_unregister(&dc_extent->dev);
> }
>
> +static void rm_tag_group(struct cxl_dc_tag_group *group)
> +{
> + struct device *region_dev = &group->cxlr_dax->dev;
> + struct dc_extent *dc_extent;
> + unsigned long index;
> +
> + /*
> + * Tagged allocations release atomically. Invalidate caches once
> + * for the whole group (no mappings exist at this point — partial
> + * release is not supported, so all members are leaving use
> + * together) before tearing down each dc_extent device.
> + *
> + * Pin @group across the walk: each devm_release_action runs the
> + * dc_extent_unregister action synchronously, which drops the last
> + * reference on the dc_extent device and fires dc_extent_release.
> + * The release decrements group->nr_extents and, on the final
> + * decrement, frees @group. Without the pin the next iteration's
> + * xa_find_after() dereferences a freed xarray.
> + */
> + cxl_region_invalidate_memregion(group->cxlr_dax->cxlr);
check return value?
> +
> + group->nr_extents++;
> + xa_for_each(&group->dc_extents, index, dc_extent)
> + devm_release_action(region_dev, dc_extent_unregister, dc_extent);
> + group->nr_extents--;
> + if (!group->nr_extents)
> + free_tag_group(group);
> +}
> +
> +int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
> +{
> + u64 start_dpa = le64_to_cpu(extent->start_dpa);
> + struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
> + struct cxl_endpoint_decoder *cxled;
> + struct cxl_dax_region *cxlr_dax;
> + struct cxl_dc_tag_group *group;
> + struct dc_extent *dc_extent;
> + struct cxl_region *cxlr;
> + struct range dpa_range;
> + unsigned long idx;
> + uuid_t tag;
> +
> + dpa_range = (struct range) {
> + .start = start_dpa,
> + .end = start_dpa + le64_to_cpu(extent->length) - 1,
> + };
> +
> + guard(rwsem_read)(&cxl_rwsem.region);
> + cxlr = cxl_dpa_to_region(cxlmd, start_dpa, &cxled);
> + if (!cxlr) {
> + /*
> + * No region can happen here for a few reasons:
> + *
> + * 1) Extents were accepted and the host crashed/rebooted
> + * leaving them in an accepted state. On reboot the host
> + * has not yet created a region to own them.
> + *
> + * 2) Region destruction won the race with the device releasing
> + * all the extents. Here the release will be a duplicate of
> + * the one sent via region destruction.
> + *
> + * 3) The device is confused and releasing extents for which no
> + * region ever existed.
> + *
> + * In all these cases make sure the device knows we are not
> + * using this extent.
> + */
> + memdev_release_extent(mds, &dpa_range);
> + return -ENXIO;
> + }
> +
> + cxlr_dax = cxlr->cxlr_dax;
Does it need to check if cxlr_dax is NULL?
DJ
> + import_uuid(&tag, extent->uuid);
> +
> + /*
> + * Find the dc_extent whose DPA range covers the released range and
> + * whose tag matches. The release targets the entire containing
> + * tag group atomically; partial release is not supported.
> + */
> + group = NULL;
> + xa_for_each(&cxlr_dax->dc_extents, idx, dc_extent) {
> + if (dc_extent->cxled != cxled)
> + continue;
> + if (!range_contains(&dc_extent->dpa_range, &dpa_range))
> + continue;
> + if (!uuid_equal(&dc_extent->group->uuid, &tag))
> + continue;
> + group = dc_extent->group;
> + break;
> + }
> + if (!group) {
> + dev_err(&cxlr_dax->dev,
> + "release DPA %pra (%pU) matches no dc_extent\n",
> + &dpa_range, &tag);
> + return -EINVAL;
> + }
> +
> + rm_tag_group(group);
> + return 0;
> +}
> +
> static void cleanup_pending_dc_extent(struct dc_extent *dc_extent)
> {
> struct cxl_dc_tag_group *group = dc_extent->group;
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 545c48c9c373..70e6c4c9743c 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -1587,25 +1587,6 @@ static int handle_add_event(struct cxl_memdev_state *mds,
> return rc;
> }
>
> -/*
> - * Stub: ack the release back to the device so it knows we are not
> - * using the range. A later commit replaces this with the real
> - * teardown that walks the region's tag group and tears down the
> - * member dc_extent devices.
> - */
> -static int cxl_rm_extent(struct cxl_memdev_state *mds,
> - struct cxl_extent *extent)
> -{
> - u64 start_dpa = le64_to_cpu(extent->start_dpa);
> - struct range dpa_range = {
> - .start = start_dpa,
> - .end = start_dpa + le64_to_cpu(extent->length) - 1,
> - };
> -
> - memdev_release_extent(mds, &dpa_range);
> - return 0;
> -}
> -
> static char *cxl_dcd_evt_type_str(u8 type)
> {
> switch (type) {
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 733d77c07493..317630d8bf2e 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -222,7 +222,7 @@ static struct cxl_region_ref *cxl_rr_load(struct cxl_port *port,
> return xa_load(&port->regions, (unsigned long)cxlr);
> }
>
> -static int cxl_region_invalidate_memregion(struct cxl_region *cxlr)
> +int cxl_region_invalidate_memregion(struct cxl_region *cxlr)
> {
> if (!cpu_cache_has_invalidate_memregion()) {
> if (IS_ENABLED(CONFIG_CXL_REGION_INVALIDATION_TEST)) {
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH v10 19/31] cxl/extent: Enforce cross-region tag uniqueness
2026-05-23 9:42 [PATCH v10 00/31] DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
` (17 preceding siblings ...)
2026-05-23 9:43 ` [PATCH v10 18/31] cxl/extent: Handle DC Release Capacity events Anisa Su
@ 2026-05-23 9:43 ` Anisa Su
2026-05-28 22:44 ` Dave Jiang
2026-05-23 9:43 ` [PATCH v10 20/31] cxl/region/extent: Expose dc_extent information in sysfs Anisa Su
` (13 subsequent siblings)
32 siblings, 1 reply; 71+ messages in thread
From: Anisa Su @ 2026-05-23 9:43 UTC (permalink / raw)
To: linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
John Groves, Gregory Price, Anisa Su
The per-region scan in cxl_tag_already_committed() only catches a tag
re-appearing on the same cxlr_dax. The orchestrator owns tag
allocation and is responsible for global uniqueness, but a buggy FM
(or firmware redelivering a tag for a previously-closed allocation)
can still hand the same uuid to extents on two different regions or
memdevs, and the per-region check accepts the second one — leaving
two independent cxl_dc_tag_group objects with the same uuid.
Add a host-wide registry of live tag groups with non-null uuids.
alloc_tag_group() inserts on success, free_tag_group() removes; both
skip the null-uuid case since the spec defines no cross-chain identity
for untagged allocations.
An attempt to add a second group with the same uuid fails with
-EBUSY.
No exit hook is needed: cxl_core only unloads after every dependent
module has, by which point every live tag group has been freed and
the registry is empty.
Signed-off-by: Anisa Su <anisa.su@samsung.com>
---
drivers/cxl/core/core.h | 5 ++++
drivers/cxl/core/extent.c | 60 +++++++++++++++++++++++++++++++++++++++
drivers/cxl/core/mbox.c | 19 +++++++++++++
drivers/cxl/cxl.h | 3 ++
4 files changed, 87 insertions(+)
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 65daaaadf68e..02b36728c22d 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -69,6 +69,7 @@ int devm_cxl_add_pmem_region(struct cxl_region *cxlr);
int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent,
u16 seq_num);
+bool cxl_tag_already_committed(const uuid_t *tag);
int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent);
int online_tag_group(struct cxl_dc_tag_group *group);
#else
@@ -91,6 +92,10 @@ static inline int online_tag_group(struct cxl_dc_tag_group *group)
{
return 0;
}
+static inline bool cxl_tag_already_committed(const uuid_t *tag)
+{
+ return false;
+}
static inline
struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa,
struct cxl_endpoint_decoder **cxled)
diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
index 51116c8139ed..f66fa8c600c5 100644
--- a/drivers/cxl/core/extent.c
+++ b/drivers/cxl/core/extent.c
@@ -18,8 +18,60 @@ static void cxled_release_extent(struct cxl_endpoint_decoder *cxled,
memdev_release_extent(mds, &dc_extent->dpa_range);
}
+/*
+ * Host-wide registry of live tag groups with non-null uuids. Enforces
+ * that within this host, a tag uuid identifies exactly one allocation
+ * across all regions and memdevs — closing the gap left by the
+ * per-region scans in cxlr_add_extent() and uuid_claim_tagged(). The
+ * orchestrator (FM) owns tag-uuid allocation per spec; this is a
+ * defense against firmware bugs and orchestrator misbehavior. Untagged
+ * (null uuid) allocations are not tracked: the spec defines no
+ * cross-chain identity for them.
+ */
+static DEFINE_MUTEX(cxl_tag_lock);
+static LIST_HEAD(cxl_tag_groups);
+
+static int cxl_tag_register(struct cxl_dc_tag_group *grp)
+{
+ struct cxl_dc_tag_group *g;
+
+ if (uuid_is_null(&grp->uuid))
+ return 0;
+
+ guard(mutex)(&cxl_tag_lock);
+ list_for_each_entry(g, &cxl_tag_groups, registry_node)
+ if (uuid_equal(&g->uuid, &grp->uuid))
+ return -EBUSY;
+ list_add_tail(&grp->registry_node, &cxl_tag_groups);
+ return 0;
+}
+
+static void cxl_tag_unregister(struct cxl_dc_tag_group *grp)
+{
+ if (uuid_is_null(&grp->uuid))
+ return;
+
+ guard(mutex)(&cxl_tag_lock);
+ list_del(&grp->registry_node);
+}
+
+bool cxl_tag_already_committed(const uuid_t *tag)
+{
+ struct cxl_dc_tag_group *g;
+
+ if (uuid_is_null(tag))
+ return false;
+
+ guard(mutex)(&cxl_tag_lock);
+ list_for_each_entry(g, &cxl_tag_groups, registry_node)
+ if (uuid_equal(&g->uuid, tag))
+ return true;
+ return false;
+}
+
static void free_tag_group(struct cxl_dc_tag_group *group)
{
+ cxl_tag_unregister(group);
xa_destroy(&group->dc_extents);
kfree(group);
}
@@ -54,12 +106,20 @@ alloc_tag_group(struct cxl_dax_region *cxlr_dax, uuid_t *uuid)
{
struct cxl_dc_tag_group *group __free(kfree) =
kzalloc(sizeof(*group), GFP_KERNEL);
+ int rc;
+
if (!group)
return ERR_PTR(-ENOMEM);
group->cxlr_dax = cxlr_dax;
uuid_copy(&group->uuid, uuid);
xa_init(&group->dc_extents);
+ INIT_LIST_HEAD(&group->registry_node);
+
+ rc = cxl_tag_register(group);
+ if (rc)
+ return ERR_PTR(rc);
+
return no_free_ptr(group);
}
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 70e6c4c9743c..85959dee35ea 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -1474,6 +1474,25 @@ static int cxl_add_pending(struct cxl_memdev_state *mds)
extract_tag_group(pending, &tag, &group);
list_sort(NULL, &group, extent_seq_compare);
+ /*
+ * Cross-More-chain uniqueness. A non-null tag seen in this
+ * group must not already correspond to a committed tag group
+ * anywhere on this host. More=0 was supposed to close that
+ * allocation, and tag uuids must be unique across all regions
+ * and memdevs (the orchestrator owns assignment per spec).
+ * Either constraint failing — same chain redelivered, or two
+ * distinct allocations colliding on the same uuid — is a
+ * firmware/orchestrator bug; reject the whole group.
+ */
+ if (cxl_tag_already_committed(&tag)) {
+ dev_warn(dev,
+ "Tag %pUb: dropping group, tag already committed (firmware/orchestrator bug)\n",
+ &tag);
+ list_for_each_entry_safe(pos, tmp, &group, list)
+ delete_extent_node(pos);
+ continue;
+ }
+
/* Sequence-number integrity */
if (cxl_check_group_seq(dev, &tag, &group)) {
list_for_each_entry_safe(pos, tmp, &group, list)
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index cbbfba92fea9..a28e7b12a4a8 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -598,12 +598,15 @@ struct cxl_dax_region {
* allocations.
* @nr_extents: live count of dc_extents in the group; the group is freed
* when the last dc_extent device is released.
+ * @registry_node: anchor in the host-wide non-null-tag registry that
+ * enforces tag uuid uniqueness across all regions and memdevs.
*/
struct cxl_dc_tag_group {
struct cxl_dax_region *cxlr_dax;
uuid_t uuid;
struct xarray dc_extents;
unsigned int nr_extents;
+ struct list_head registry_node;
};
bool is_dc_extent(struct device *dev);
--
2.43.0
^ permalink raw reply related [flat|nested] 71+ messages in thread* Re: [PATCH v10 19/31] cxl/extent: Enforce cross-region tag uniqueness
2026-05-23 9:43 ` [PATCH v10 19/31] cxl/extent: Enforce cross-region tag uniqueness Anisa Su
@ 2026-05-28 22:44 ` Dave Jiang
0 siblings, 0 replies; 71+ messages in thread
From: Dave Jiang @ 2026-05-28 22:44 UTC (permalink / raw)
To: Anisa Su, linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Vishal Verma, Ira Weiny, Alison Schofield, John Groves,
Gregory Price, Anisa Su
On 5/23/26 2:43 AM, Anisa Su wrote:
> The per-region scan in cxl_tag_already_committed() only catches a tag
> re-appearing on the same cxlr_dax. The orchestrator owns tag
> allocation and is responsible for global uniqueness, but a buggy FM
> (or firmware redelivering a tag for a previously-closed allocation)
> can still hand the same uuid to extents on two different regions or
> memdevs, and the per-region check accepts the second one — leaving
> two independent cxl_dc_tag_group objects with the same uuid.
>
> Add a host-wide registry of live tag groups with non-null uuids.
> alloc_tag_group() inserts on success, free_tag_group() removes; both
> skip the null-uuid case since the spec defines no cross-chain identity
> for untagged allocations.
>
> An attempt to add a second group with the same uuid fails with
> -EBUSY.
>
> No exit hook is needed: cxl_core only unloads after every dependent
> module has, by which point every live tag group has been freed and
> the registry is empty.
>
> Signed-off-by: Anisa Su <anisa.su@samsung.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> ---
> drivers/cxl/core/core.h | 5 ++++
> drivers/cxl/core/extent.c | 60 +++++++++++++++++++++++++++++++++++++++
> drivers/cxl/core/mbox.c | 19 +++++++++++++
> drivers/cxl/cxl.h | 3 ++
> 4 files changed, 87 insertions(+)
>
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 65daaaadf68e..02b36728c22d 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -69,6 +69,7 @@ int devm_cxl_add_pmem_region(struct cxl_region *cxlr);
>
> int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent,
> u16 seq_num);
> +bool cxl_tag_already_committed(const uuid_t *tag);
> int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent);
> int online_tag_group(struct cxl_dc_tag_group *group);
> #else
> @@ -91,6 +92,10 @@ static inline int online_tag_group(struct cxl_dc_tag_group *group)
> {
> return 0;
> }
> +static inline bool cxl_tag_already_committed(const uuid_t *tag)
> +{
> + return false;
> +}
> static inline
> struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa,
> struct cxl_endpoint_decoder **cxled)
> diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
> index 51116c8139ed..f66fa8c600c5 100644
> --- a/drivers/cxl/core/extent.c
> +++ b/drivers/cxl/core/extent.c
> @@ -18,8 +18,60 @@ static void cxled_release_extent(struct cxl_endpoint_decoder *cxled,
> memdev_release_extent(mds, &dc_extent->dpa_range);
> }
>
> +/*
> + * Host-wide registry of live tag groups with non-null uuids. Enforces
> + * that within this host, a tag uuid identifies exactly one allocation
> + * across all regions and memdevs — closing the gap left by the
> + * per-region scans in cxlr_add_extent() and uuid_claim_tagged(). The
> + * orchestrator (FM) owns tag-uuid allocation per spec; this is a
> + * defense against firmware bugs and orchestrator misbehavior. Untagged
> + * (null uuid) allocations are not tracked: the spec defines no
> + * cross-chain identity for them.
> + */
> +static DEFINE_MUTEX(cxl_tag_lock);
> +static LIST_HEAD(cxl_tag_groups);
> +
> +static int cxl_tag_register(struct cxl_dc_tag_group *grp)
> +{
> + struct cxl_dc_tag_group *g;
> +
> + if (uuid_is_null(&grp->uuid))
> + return 0;
> +
> + guard(mutex)(&cxl_tag_lock);
> + list_for_each_entry(g, &cxl_tag_groups, registry_node)
> + if (uuid_equal(&g->uuid, &grp->uuid))
> + return -EBUSY;
> + list_add_tail(&grp->registry_node, &cxl_tag_groups);
> + return 0;
> +}
> +
> +static void cxl_tag_unregister(struct cxl_dc_tag_group *grp)
> +{
> + if (uuid_is_null(&grp->uuid))
> + return;
> +
> + guard(mutex)(&cxl_tag_lock);
> + list_del(&grp->registry_node);
> +}
> +
> +bool cxl_tag_already_committed(const uuid_t *tag)
> +{
> + struct cxl_dc_tag_group *g;
> +
> + if (uuid_is_null(tag))
> + return false;
> +
> + guard(mutex)(&cxl_tag_lock);
> + list_for_each_entry(g, &cxl_tag_groups, registry_node)
> + if (uuid_equal(&g->uuid, tag))
> + return true;
> + return false;
> +}
> +
> static void free_tag_group(struct cxl_dc_tag_group *group)
> {
> + cxl_tag_unregister(group);
> xa_destroy(&group->dc_extents);
> kfree(group);
> }
> @@ -54,12 +106,20 @@ alloc_tag_group(struct cxl_dax_region *cxlr_dax, uuid_t *uuid)
> {
> struct cxl_dc_tag_group *group __free(kfree) =
> kzalloc(sizeof(*group), GFP_KERNEL);
> + int rc;
> +
> if (!group)
> return ERR_PTR(-ENOMEM);
>
> group->cxlr_dax = cxlr_dax;
> uuid_copy(&group->uuid, uuid);
> xa_init(&group->dc_extents);
> + INIT_LIST_HEAD(&group->registry_node);
> +
> + rc = cxl_tag_register(group);
> + if (rc)
> + return ERR_PTR(rc);
> +
> return no_free_ptr(group);
> }
>
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 70e6c4c9743c..85959dee35ea 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -1474,6 +1474,25 @@ static int cxl_add_pending(struct cxl_memdev_state *mds)
> extract_tag_group(pending, &tag, &group);
> list_sort(NULL, &group, extent_seq_compare);
>
> + /*
> + * Cross-More-chain uniqueness. A non-null tag seen in this
> + * group must not already correspond to a committed tag group
> + * anywhere on this host. More=0 was supposed to close that
> + * allocation, and tag uuids must be unique across all regions
> + * and memdevs (the orchestrator owns assignment per spec).
> + * Either constraint failing — same chain redelivered, or two
> + * distinct allocations colliding on the same uuid — is a
> + * firmware/orchestrator bug; reject the whole group.
> + */
> + if (cxl_tag_already_committed(&tag)) {
> + dev_warn(dev,
> + "Tag %pUb: dropping group, tag already committed (firmware/orchestrator bug)\n",
> + &tag);
> + list_for_each_entry_safe(pos, tmp, &group, list)
> + delete_extent_node(pos);
> + continue;
> + }
> +
> /* Sequence-number integrity */
> if (cxl_check_group_seq(dev, &tag, &group)) {
> list_for_each_entry_safe(pos, tmp, &group, list)
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index cbbfba92fea9..a28e7b12a4a8 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -598,12 +598,15 @@ struct cxl_dax_region {
> * allocations.
> * @nr_extents: live count of dc_extents in the group; the group is freed
> * when the last dc_extent device is released.
> + * @registry_node: anchor in the host-wide non-null-tag registry that
> + * enforces tag uuid uniqueness across all regions and memdevs.
> */
> struct cxl_dc_tag_group {
> struct cxl_dax_region *cxlr_dax;
> uuid_t uuid;
> struct xarray dc_extents;
> unsigned int nr_extents;
> + struct list_head registry_node;
> };
>
> bool is_dc_extent(struct device *dev);
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH v10 20/31] cxl/region/extent: Expose dc_extent information in sysfs
2026-05-23 9:42 [PATCH v10 00/31] DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
` (18 preceding siblings ...)
2026-05-23 9:43 ` [PATCH v10 19/31] cxl/extent: Enforce cross-region tag uniqueness Anisa Su
@ 2026-05-23 9:43 ` Anisa Su
2026-05-28 22:54 ` Dave Jiang
2026-05-23 9:43 ` [PATCH v10 21/31] cxl + dax: Surface dax_resources on DCD Add Capacity events Anisa Su
` (12 subsequent siblings)
32 siblings, 1 reply; 71+ messages in thread
From: Anisa Su @ 2026-05-23 9:43 UTC (permalink / raw)
To: linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
John Groves, Gregory Price, Ira Weiny, Jonathan Cameron, Fan Ni
From: Ira Weiny <ira.weiny@intel.com>
Extent information can be helpful to the user to coordinate memory
usage with the external orchestrator and FM.
Expose the details of each dc_extent by creating the following sysfs
entries.
/sys/bus/cxl/devices/dax_regionX/extentX.Y
/sys/bus/cxl/devices/dax_regionX/extentX.Y/offset
/sys/bus/cxl/devices/dax_regionX/extentX.Y/length
/sys/bus/cxl/devices/dax_regionX/extentX.Y/uuid
Each dc_extent surfaces as its own extentX.Y device under the parent
dax_region. offset and length describe that dc_extent's HPA range,
not an aggregate bounding box across the containing tagged
allocation — so when a tagged allocation has multiple
DPA-discontiguous extents, each is reported with its own offset and
length. uuid is the tag identifying the containing allocation; it
is shared across dc_extents that belong to the same tagged
allocation and is hidden for untagged extents.
Based on an original patch by Navneet Singh.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
Tested-by: Fan Ni <fan.ni@samsung.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
Documentation/ABI/testing/sysfs-bus-cxl | 36 +++++++++++++++
drivers/cxl/core/extent.c | 58 +++++++++++++++++++++++++
2 files changed, 94 insertions(+)
diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
index 3080aef9ad67..38cf0a2894b9 100644
--- a/Documentation/ABI/testing/sysfs-bus-cxl
+++ b/Documentation/ABI/testing/sysfs-bus-cxl
@@ -661,3 +661,39 @@ Description:
The count is persistent across power loss and wraps back to 0
upon overflow. If this file is not present, the device does not
have the necessary support for dirty tracking.
+
+
+What: /sys/bus/cxl/devices/dax_regionX/extentX.Y/offset
+Date: May, 2025
+KernelVersion: v6.16
+Contact: linux-cxl@vger.kernel.org
+Description:
+ (RO) [For Dynamic Capacity regions only] Users can use the
+ extent information to create DAX devices on specific extents.
+ This is done by creating and destroying DAX devices in specific
+ sequences and looking at the mappings created. Extent offset
+ within the region.
+
+
+What: /sys/bus/cxl/devices/dax_regionX/extentX.Y/length
+Date: May, 2025
+KernelVersion: v6.16
+Contact: linux-cxl@vger.kernel.org
+Description:
+ (RO) [For Dynamic Capacity regions only] Users can use the
+ extent information to create DAX devices on specific extents.
+ This is done by creating and destroying DAX devices in specific
+ sequences and looking at the mappings created. Extent length
+ within the region.
+
+
+What: /sys/bus/cxl/devices/dax_regionX/extentX.Y/uuid
+Date: May, 2025
+KernelVersion: v6.16
+Contact: linux-cxl@vger.kernel.org
+Description:
+ (RO) [For Dynamic Capacity regions only] Users can use the
+ extent information to create DAX devices on specific extents.
+ This is done by creating and destroying DAX devices in specific
+ sequences and looking at the mappings created. UUID of this
+ extent.
diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
index f66fa8c600c5..34babfe032d1 100644
--- a/drivers/cxl/core/extent.c
+++ b/drivers/cxl/core/extent.c
@@ -6,6 +6,63 @@
#include "core.h"
+static ssize_t offset_show(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ struct dc_extent *dc_extent = to_dc_extent(dev);
+
+ return sysfs_emit(buf, "%#llx\n", dc_extent->hpa_range.start);
+}
+static DEVICE_ATTR_RO(offset);
+
+static ssize_t length_show(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ struct dc_extent *dc_extent = to_dc_extent(dev);
+ u64 length = range_len(&dc_extent->hpa_range);
+
+ return sysfs_emit(buf, "%#llx\n", length);
+}
+static DEVICE_ATTR_RO(length);
+
+static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ struct dc_extent *dc_extent = to_dc_extent(dev);
+
+ return sysfs_emit(buf, "%pUb\n", &dc_extent->group->uuid);
+}
+static DEVICE_ATTR_RO(uuid);
+
+static struct attribute *dc_extent_attrs[] = {
+ &dev_attr_offset.attr,
+ &dev_attr_length.attr,
+ &dev_attr_uuid.attr,
+ NULL
+};
+
+static uuid_t empty_uuid = { 0 };
+
+static umode_t dc_extent_visible(struct kobject *kobj,
+ struct attribute *a, int n)
+{
+ struct device *dev = kobj_to_dev(kobj);
+ struct dc_extent *dc_extent = to_dc_extent(dev);
+
+ if (a == &dev_attr_uuid.attr &&
+ uuid_equal(&dc_extent->group->uuid, &empty_uuid))
+ return 0;
+
+ return a->mode;
+}
+
+static const struct attribute_group dc_extent_attribute_group = {
+ .attrs = dc_extent_attrs,
+ .is_visible = dc_extent_visible,
+};
+
+__ATTRIBUTE_GROUPS(dc_extent_attribute);
+
static void cxled_release_extent(struct cxl_endpoint_decoder *cxled,
struct dc_extent *dc_extent)
@@ -93,6 +150,7 @@ static void dc_extent_release(struct device *dev)
static const struct device_type dc_extent_type = {
.name = "extent",
.release = dc_extent_release,
+ .groups = dc_extent_attribute_groups,
};
bool is_dc_extent(struct device *dev)
--
2.43.0
^ permalink raw reply related [flat|nested] 71+ messages in thread* Re: [PATCH v10 20/31] cxl/region/extent: Expose dc_extent information in sysfs
2026-05-23 9:43 ` [PATCH v10 20/31] cxl/region/extent: Expose dc_extent information in sysfs Anisa Su
@ 2026-05-28 22:54 ` Dave Jiang
0 siblings, 0 replies; 71+ messages in thread
From: Dave Jiang @ 2026-05-28 22:54 UTC (permalink / raw)
To: Anisa Su, linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Vishal Verma, Ira Weiny, Alison Schofield, John Groves,
Gregory Price, Ira Weiny, Jonathan Cameron, Fan Ni
On 5/23/26 2:43 AM, Anisa Su wrote:
> From: Ira Weiny <ira.weiny@intel.com>
>
> Extent information can be helpful to the user to coordinate memory
> usage with the external orchestrator and FM.
>
> Expose the details of each dc_extent by creating the following sysfs
> entries.
>
> /sys/bus/cxl/devices/dax_regionX/extentX.Y
> /sys/bus/cxl/devices/dax_regionX/extentX.Y/offset
> /sys/bus/cxl/devices/dax_regionX/extentX.Y/length
> /sys/bus/cxl/devices/dax_regionX/extentX.Y/uuid
>
> Each dc_extent surfaces as its own extentX.Y device under the parent
> dax_region. offset and length describe that dc_extent's HPA range,
> not an aggregate bounding box across the containing tagged
> allocation — so when a tagged allocation has multiple
> DPA-discontiguous extents, each is reported with its own offset and
> length. uuid is the tag identifying the containing allocation; it
> is shared across dc_extents that belong to the same tagged
> allocation and is hidden for untagged extents.
>
> Based on an original patch by Navneet Singh.
>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Fan Ni <fan.ni@samsung.com>
> Tested-by: Fan Ni <fan.ni@samsung.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Missing Anisa sign off
> ---
> Documentation/ABI/testing/sysfs-bus-cxl | 36 +++++++++++++++
> drivers/cxl/core/extent.c | 58 +++++++++++++++++++++++++
> 2 files changed, 94 insertions(+)
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> index 3080aef9ad67..38cf0a2894b9 100644
> --- a/Documentation/ABI/testing/sysfs-bus-cxl
> +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> @@ -661,3 +661,39 @@ Description:
> The count is persistent across power loss and wraps back to 0
> upon overflow. If this file is not present, the device does not
> have the necessary support for dirty tracking.
> +
> +
> +What: /sys/bus/cxl/devices/dax_regionX/extentX.Y/offset
> +Date: May, 2025
> +KernelVersion: v6.16
Update date and kernel version for all
> +Contact: linux-cxl@vger.kernel.org
> +Description:
> + (RO) [For Dynamic Capacity regions only] Users can use the
> + extent information to create DAX devices on specific extents.
> + This is done by creating and destroying DAX devices in specific
> + sequences and looking at the mappings created. Extent offset
> + within the region.
> +
> +
> +What: /sys/bus/cxl/devices/dax_regionX/extentX.Y/length
> +Date: May, 2025
> +KernelVersion: v6.16
> +Contact: linux-cxl@vger.kernel.org
> +Description:
> + (RO) [For Dynamic Capacity regions only] Users can use the
> + extent information to create DAX devices on specific extents.
> + This is done by creating and destroying DAX devices in specific
> + sequences and looking at the mappings created. Extent length
> + within the region.
> +
> +
> +What: /sys/bus/cxl/devices/dax_regionX/extentX.Y/uuid
> +Date: May, 2025
> +KernelVersion: v6.16
> +Contact: linux-cxl@vger.kernel.org
> +Description:
> + (RO) [For Dynamic Capacity regions only] Users can use the
> + extent information to create DAX devices on specific extents.
> + This is done by creating and destroying DAX devices in specific
> + sequences and looking at the mappings created. UUID of this
> + extent.
> diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
> index f66fa8c600c5..34babfe032d1 100644
> --- a/drivers/cxl/core/extent.c
> +++ b/drivers/cxl/core/extent.c
> @@ -6,6 +6,63 @@
>
> #include "core.h"
>
> +static ssize_t offset_show(struct device *dev, struct device_attribute *attr,
> + char *buf)
> +{
> + struct dc_extent *dc_extent = to_dc_extent(dev);
> +
> + return sysfs_emit(buf, "%#llx\n", dc_extent->hpa_range.start);
> +}
> +static DEVICE_ATTR_RO(offset);
> +
> +static ssize_t length_show(struct device *dev, struct device_attribute *attr,
> + char *buf)
> +{
> + struct dc_extent *dc_extent = to_dc_extent(dev);
> + u64 length = range_len(&dc_extent->hpa_range);
> +
> + return sysfs_emit(buf, "%#llx\n", length);
> +}
> +static DEVICE_ATTR_RO(length);
> +
> +static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
> + char *buf)
> +{
> + struct dc_extent *dc_extent = to_dc_extent(dev);
> +
> + return sysfs_emit(buf, "%pUb\n", &dc_extent->group->uuid);
> +}
> +static DEVICE_ATTR_RO(uuid);
> +
> +static struct attribute *dc_extent_attrs[] = {
> + &dev_attr_offset.attr,
> + &dev_attr_length.attr,
> + &dev_attr_uuid.attr,
> + NULL
> +};
> +
> +static uuid_t empty_uuid = { 0 };
> +
> +static umode_t dc_extent_visible(struct kobject *kobj,
> + struct attribute *a, int n)
> +{
> + struct device *dev = kobj_to_dev(kobj);
> + struct dc_extent *dc_extent = to_dc_extent(dev);
> +
> + if (a == &dev_attr_uuid.attr &&
> + uuid_equal(&dc_extent->group->uuid, &empty_uuid))'
uuid_is_null() can be used?
DJ
> + return 0;
> +
> + return a->mode;
> +}
> +
> +static const struct attribute_group dc_extent_attribute_group = {
> + .attrs = dc_extent_attrs,
> + .is_visible = dc_extent_visible,
> +};
> +
> +__ATTRIBUTE_GROUPS(dc_extent_attribute);
> +
>
> static void cxled_release_extent(struct cxl_endpoint_decoder *cxled,
> struct dc_extent *dc_extent)
> @@ -93,6 +150,7 @@ static void dc_extent_release(struct device *dev)
> static const struct device_type dc_extent_type = {
> .name = "extent",
> .release = dc_extent_release,
> + .groups = dc_extent_attribute_groups,
> };
>
> bool is_dc_extent(struct device *dev)
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH v10 21/31] cxl + dax: Surface dax_resources on DCD Add Capacity events
2026-05-23 9:42 [PATCH v10 00/31] DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
` (19 preceding siblings ...)
2026-05-23 9:43 ` [PATCH v10 20/31] cxl/region/extent: Expose dc_extent information in sysfs Anisa Su
@ 2026-05-23 9:43 ` Anisa Su
2026-05-28 23:41 ` Dave Jiang
2026-05-23 9:43 ` [PATCH v10 22/31] cxl + dax: Release dax_resources on DCD Release " Anisa Su
` (11 subsequent siblings)
32 siblings, 1 reply; 71+ messages in thread
From: Anisa Su @ 2026-05-23 9:43 UTC (permalink / raw)
To: linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
John Groves, Gregory Price, Anisa Su, Ira Weiny
When an extent is accepted/released, the CXL driver must notify
the DAX driver to coordinate the management of resources. Define
the .notify callback to the cxl_dax region driver to enable
the coordination.
Define struct dax_resource, a sub-resource of a DC dax_region
representing the capacity of one dc_extent.
When the cxl side onlines a tag group during a DC Add event,
notify the DAX region to register a struct dax_resource for each
extent under the dax_region's resource tree.
The dax_resource model:
* struct dax_resource (dax-private.h) — per-extent sub-resource of
a DC dax_region: pointer back to its region, the kernel struct
resource, the tag uuid, the per-allocation seq_num, and a use_cnt
that lets a later commit refuse release of an in-use extent.
* struct dev_dax_range gains a dax_resource back-pointer so a
carved range remembers which extent it lives in.
For now, dax_resources live under the dax_region and remain inaccessible
to DAX devices. A later commit adds the support to specify a tag
when creating a DAX device, which then allows dax_resources to be
claimed by tag.
Release is handled in the following commit.
Based on an original patch by Navneet Singh.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Anisa Su <anisa.su@samsung.com>
---
Changes:
[anisa:restructured from the original "Create resources on sparse DAX
regions" commit】
---
drivers/cxl/core/core.h | 10 ++++
drivers/cxl/core/extent.c | 33 ++++++++++-
drivers/cxl/core/mbox.c | 17 +++++-
drivers/cxl/cxl.h | 6 ++
drivers/dax/bus.c | 118 ++++++++++++++++++++++++++++++++++----
drivers/dax/bus.h | 3 +-
drivers/dax/cxl.c | 88 +++++++++++++++++++++++++++-
drivers/dax/dax-private.h | 49 ++++++++++++++++
drivers/dax/hmem/hmem.c | 2 +-
drivers/dax/pmem.c | 2 +-
10 files changed, 306 insertions(+), 22 deletions(-)
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 02b36728c22d..c28e357c5817 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -72,6 +72,9 @@ int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent,
bool cxl_tag_already_committed(const uuid_t *tag);
int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent);
int online_tag_group(struct cxl_dc_tag_group *group);
+void rm_tag_group(struct cxl_dc_tag_group *group);
+int cxlr_notify_extent(struct cxl_region *cxlr, enum dc_event event,
+ struct cxl_dc_tag_group *group);
#else
static inline u64 cxl_dpa_to_hpa(struct cxl_region *cxlr,
const struct cxl_memdev *cxlmd, u64 dpa)
@@ -96,6 +99,13 @@ static inline bool cxl_tag_already_committed(const uuid_t *tag)
{
return false;
}
+static inline void rm_tag_group(struct cxl_dc_tag_group *group) { }
+static inline int cxlr_notify_extent(struct cxl_region *cxlr,
+ enum dc_event event,
+ struct cxl_dc_tag_group *group)
+{
+ return 0;
+}
static inline
struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa,
struct cxl_endpoint_decoder **cxled)
diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
index 34babfe032d1..3fc4b7292664 100644
--- a/drivers/cxl/core/extent.c
+++ b/drivers/cxl/core/extent.c
@@ -63,7 +63,6 @@ static const struct attribute_group dc_extent_attribute_group = {
__ATTRIBUTE_GROUPS(dc_extent_attribute);
-
static void cxled_release_extent(struct cxl_endpoint_decoder *cxled,
struct dc_extent *dc_extent)
{
@@ -359,6 +358,36 @@ dc_extent_build(struct cxl_endpoint_decoder *cxled,
return dc_extent;
}
+int cxlr_notify_extent(struct cxl_region *cxlr, enum dc_event event,
+ struct cxl_dc_tag_group *group)
+{
+ struct device *dev = &cxlr->cxlr_dax->dev;
+ struct cxl_notify_data notify_data;
+ struct cxl_driver *driver;
+
+ dev_dbg(dev, "Trying notify: type %d tag %pUb\n", event, &group->uuid);
+
+ guard(device)(dev);
+
+ /*
+ * The lack of a driver indicates a notification has failed. No user
+ * space coordination was possible.
+ */
+ if (!dev->driver)
+ return 0;
+ driver = to_cxl_drv(dev->driver);
+ if (!driver->notify)
+ return 0;
+
+ notify_data = (struct cxl_notify_data) {
+ .event = event,
+ .group = group,
+ };
+
+ dev_dbg(dev, "Notify: type %d tag %pUb\n", event, &group->uuid);
+ return driver->notify(dev, ¬ify_data);
+}
+
/*
* Stage 4: insert @dc_extent into the pending tag group. All extents in
* one More-chain group share a UUID — enforced here as the group is
@@ -462,7 +491,7 @@ static void dc_extent_unregister(void *ext)
device_unregister(&dc_extent->dev);
}
-static void rm_tag_group(struct cxl_dc_tag_group *group)
+void rm_tag_group(struct cxl_dc_tag_group *group)
{
struct device *region_dev = &group->cxlr_dax->dev;
struct dc_extent *dc_extent;
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 85959dee35ea..8071c1ed1b36 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -1558,8 +1558,21 @@ static int cxl_add_pending(struct cxl_memdev_state *mds)
list_for_each_entry_safe(pos, tmp, &group, list)
delete_extent_node(pos);
} else {
- list_splice_tail_init(&group, &accepted);
- total_accepted += group_cnt;
+ rc = cxlr_notify_extent(tag_group->cxlr_dax->cxlr,
+ DCD_ADD_CAPACITY, tag_group);
+ if (rc) {
+ /*
+ * The dax-side notification failed; tear down the
+ * tag group and drop the extents so we do not
+ * mis-report acceptance to the device.
+ */
+ rm_tag_group(tag_group);
+ list_for_each_entry_safe(pos, tmp, &group, list)
+ delete_extent_node(pos);
+ } else {
+ list_splice_tail_init(&group, &accepted);
+ total_accepted += group_cnt;
+ }
}
mds->add_ctx.group = NULL;
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index a28e7b12a4a8..27e3046654e9 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -892,6 +892,11 @@ bool is_cxl_region(struct device *dev);
extern const struct bus_type cxl_bus_type;
+struct cxl_notify_data {
+ enum dc_event event;
+ struct cxl_dc_tag_group *group;
+};
+
/*
* Note, add_dport() is expressly for the cxl_port driver. TODO: investigate a
* type-safe driver model where probe()/remove() take the type of object implied
@@ -904,6 +909,7 @@ struct cxl_driver {
void (*remove)(struct device *dev);
struct cxl_dport *(*add_dport)(struct cxl_port *port,
struct device *dport_dev);
+ int (*notify)(struct device *dev, struct cxl_notify_data *notify_data);
struct device_driver drv;
int id;
};
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index b0c2162b5e37..a6ee59f2d8a1 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -186,6 +186,73 @@ static bool is_dynamic(struct dax_region *dax_region)
return (dax_region->res.flags & IORESOURCE_DAX_DCD) != 0;
}
+static void __dax_release_resource(struct dax_resource *dax_resource)
+{
+ struct dax_region *dax_region = dax_resource->region;
+
+ lockdep_assert_held_write(&dax_region_rwsem);
+ dev_dbg(dax_region->dev, "Extent release resource %pr\n",
+ dax_resource->res);
+ if (dax_resource->res)
+ __release_region(&dax_region->res, dax_resource->res->start,
+ resource_size(dax_resource->res));
+ dax_resource->res = NULL;
+}
+
+static void dax_release_resource(void *res)
+{
+ struct dax_resource *dax_resource = res;
+
+ guard(rwsem_write)(&dax_region_rwsem);
+ __dax_release_resource(dax_resource);
+ kfree(dax_resource);
+}
+
+int dax_region_add_resource(struct dax_region *dax_region,
+ struct device *device,
+ resource_size_t start, resource_size_t length,
+ const uuid_t *tag, u16 seq_num)
+{
+ struct resource *new_resource;
+ int rc;
+
+ struct dax_resource *dax_resource __free(kfree) =
+ kzalloc(sizeof(*dax_resource), GFP_KERNEL);
+ if (!dax_resource)
+ return -ENOMEM;
+
+ guard(rwsem_write)(&dax_region_rwsem);
+
+ dev_dbg(dax_region->dev, "DAX region resource %pr\n", &dax_region->res);
+ new_resource = __request_region(&dax_region->res, start, length, "extent", 0);
+ if (!new_resource) {
+ dev_err(dax_region->dev, "Failed to add region s:%pa l:%pa\n",
+ &start, &length);
+ return -ENOSPC;
+ }
+
+ dev_dbg(dax_region->dev, "add resource %pr\n", new_resource);
+ dax_resource->region = dax_region;
+ dax_resource->res = new_resource;
+ dax_resource->seq_num = seq_num;
+ if (tag)
+ uuid_copy(&dax_resource->uuid, tag);
+
+ /*
+ * open code devm_add_action_or_reset() to avoid recursive write lock
+ * of dax_region_rwsem in the error case.
+ */
+ rc = devm_add_action(device, dax_release_resource, dax_resource);
+ if (rc) {
+ __dax_release_resource(dax_resource);
+ return rc;
+ }
+
+ dev_set_drvdata(device, no_free_ptr(dax_resource));
+ return 0;
+}
+EXPORT_SYMBOL_GPL(dax_region_add_resource);
+
bool static_dev_dax(struct dev_dax *dev_dax)
{
return is_static(dev_dax->region);
@@ -304,14 +371,25 @@ static struct device_attribute dev_attr_region_align =
static unsigned long long dax_region_avail_size(struct dax_region *dax_region)
{
- resource_size_t size = resource_size(&dax_region->res);
+ resource_size_t size;
struct resource *res;
lockdep_assert_held(&dax_region_rwsem);
- if (is_dynamic(dax_region))
- return 0;
+ if (is_dynamic(dax_region)) {
+ /*
+ * Children of a dynamic region are extents, claimed
+ * all-or-nothing: an extent's resource is either unclaimed (no
+ * child) or fully consumed by exactly one dax device.
+ */
+ size = 0;
+ for_each_dax_region_resource(dax_region, res)
+ if (!res->child)
+ size += resource_size(res);
+ return size;
+ }
+ size = resource_size(&dax_region->res);
for_each_dax_region_resource(dax_region, res)
size -= resource_size(res);
return size;
@@ -452,15 +530,26 @@ EXPORT_SYMBOL_GPL(kill_dev_dax);
static void trim_dev_dax_range(struct dev_dax *dev_dax)
{
int i = dev_dax->nr_range - 1;
- struct range *range = &dev_dax->ranges[i].range;
+ struct dev_dax_range *dev_range = &dev_dax->ranges[i];
+ struct range *range = &dev_range->range;
struct dax_region *dax_region = dev_dax->region;
+ struct resource *res = &dax_region->res;
lockdep_assert_held_write(&dax_region_rwsem);
dev_dbg(&dev_dax->dev, "delete range[%d]: %#llx:%#llx\n", i,
(unsigned long long)range->start,
(unsigned long long)range->end);
- __release_region(&dax_region->res, range->start, range_len(range));
+ if (dev_range->dax_resource) {
+ res = dev_range->dax_resource->res;
+ dev_dbg(&dev_dax->dev, "Trim dc extent %pr\n", res);
+ }
+
+ __release_region(res, range->start, range_len(range));
+
+ if (dev_range->dax_resource)
+ dev_range->dax_resource->use_cnt--;
+
if (--dev_dax->nr_range == 0) {
kfree(dev_dax->ranges);
dev_dax->ranges = NULL;
@@ -644,11 +733,14 @@ static void dax_region_unregister(void *region)
struct dax_region *alloc_dax_region(struct device *parent, int region_id,
struct range *range, int target_node, unsigned int align,
- unsigned long flags)
+ unsigned long flags, struct dax_dc_ops *dc_ops)
{
struct dax_region *dax_region;
int rc;
+ if (!dc_ops && (flags & IORESOURCE_DAX_DCD))
+ return NULL;
+
/*
* The DAX core assumes that it can store its private data in
* parent->driver_data. This WARN is a reminder / safeguard for
@@ -673,6 +765,7 @@ struct dax_region *alloc_dax_region(struct device *parent, int region_id,
dax_region->align = align;
dax_region->dev = parent;
dax_region->target_node = target_node;
+ dax_region->dc_ops = dc_ops;
ida_init(&dax_region->ida);
dax_region->res = (struct resource) {
.start = range->start,
@@ -861,7 +954,7 @@ static int devm_register_dax_mapping(struct dev_dax *dev_dax, int range_id)
}
static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
- resource_size_t size)
+ resource_size_t size, struct dax_resource *dax_resource)
{
struct dax_region *dax_region = dev_dax->region;
struct resource *res = &dax_region->res;
@@ -902,6 +995,7 @@ static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
.start = alloc->start,
.end = alloc->end,
},
+ .dax_resource = dax_resource,
};
dev_dbg(dev, "alloc range[%d]: %pa:%pa\n", dev_dax->nr_range - 1,
@@ -1075,7 +1169,7 @@ static ssize_t dev_dax_resize(struct dax_region *dax_region,
retry:
first = region_res->child;
if (!first)
- return alloc_dev_dax_range(dev_dax, dax_region->res.start, to_alloc);
+ return alloc_dev_dax_range(dev_dax, dax_region->res.start, to_alloc, NULL);
rc = -ENOSPC;
for (res = first; res; res = res->sibling) {
@@ -1084,7 +1178,7 @@ static ssize_t dev_dax_resize(struct dax_region *dax_region,
/* space at the beginning of the region */
if (res == first && res->start > dax_region->res.start) {
alloc = min(res->start - dax_region->res.start, to_alloc);
- rc = alloc_dev_dax_range(dev_dax, dax_region->res.start, alloc);
+ rc = alloc_dev_dax_range(dev_dax, dax_region->res.start, alloc, NULL);
break;
}
@@ -1104,7 +1198,7 @@ static ssize_t dev_dax_resize(struct dax_region *dax_region,
rc = adjust_dev_dax_range(dev_dax, res, resource_size(res) + alloc);
break;
}
- rc = alloc_dev_dax_range(dev_dax, res->end + 1, alloc);
+ rc = alloc_dev_dax_range(dev_dax, res->end + 1, alloc, NULL);
break;
}
if (rc)
@@ -1214,7 +1308,7 @@ static ssize_t mapping_store(struct device *dev, struct device_attribute *attr,
to_alloc = range_len(&r);
if (alloc_is_aligned(dev_dax, to_alloc))
- rc = alloc_dev_dax_range(dev_dax, r.start, to_alloc);
+ rc = alloc_dev_dax_range(dev_dax, r.start, to_alloc, NULL);
up_write(&dax_dev_rwsem);
up_write(&dax_region_rwsem);
@@ -1506,7 +1600,7 @@ static struct dev_dax *__devm_create_dev_dax(struct dev_dax_data *data)
device_initialize(dev);
dev_set_name(dev, "dax%d.%d", dax_region->id, dev_dax->id);
- rc = alloc_dev_dax_range(dev_dax, dax_region->res.start, data->size);
+ rc = alloc_dev_dax_range(dev_dax, dax_region->res.start, data->size, NULL);
if (rc)
goto err_range;
diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
index 6e739bfab932..7a115893a102 100644
--- a/drivers/dax/bus.h
+++ b/drivers/dax/bus.h
@@ -11,6 +11,7 @@ struct dev_dax;
struct resource;
struct dax_device;
struct dax_region;
+struct dax_dc_ops;
/* dax bus specific ioresource flags */
#define IORESOURCE_DAX_STATIC BIT(0)
@@ -19,7 +20,7 @@ struct dax_region;
struct dax_region *alloc_dax_region(struct device *parent, int region_id,
struct range *range, int target_node, unsigned int align,
- unsigned long flags);
+ unsigned long flags, struct dax_dc_ops *dc_ops);
struct dev_dax_data {
struct dax_region *dax_region;
diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
index f58fe992aa8d..690cf625e052 100644
--- a/drivers/dax/cxl.c
+++ b/drivers/dax/cxl.c
@@ -5,6 +5,84 @@
#include "../cxl/cxl.h"
#include "bus.h"
+#include "dax-private.h"
+
+static int __cxl_dax_add_resource(struct dax_region *dax_region,
+ struct dc_extent *dc_extent)
+{
+ struct device *dev = &dc_extent->dev;
+ resource_size_t start, length;
+
+ start = dax_region->res.start + dc_extent->hpa_range.start;
+ length = range_len(&dc_extent->hpa_range);
+ return dax_region_add_resource(dax_region, dev, start, length,
+ &dc_extent->group->uuid,
+ dc_extent->seq_num);
+}
+
+static int cxl_dax_add_resource(struct device *dev, void *data)
+{
+ struct dax_region *dax_region = data;
+ struct dc_extent *dc_extent;
+
+ dc_extent = to_dc_extent(dev);
+ if (!dc_extent)
+ return 0;
+
+ dev_dbg(dax_region->dev, "Adding resource HPA %pra (%pUb)\n",
+ &dc_extent->hpa_range, &dc_extent->group->uuid);
+
+ return __cxl_dax_add_resource(dax_region, dc_extent);
+}
+
+static int cxl_dax_group_add(struct dax_region *dax_region,
+ struct cxl_dc_tag_group *group)
+{
+ struct dc_extent *dc_extent;
+ unsigned long index;
+ int rc;
+
+ xa_for_each(&group->dc_extents, index, dc_extent) {
+ rc = __cxl_dax_add_resource(dax_region, dc_extent);
+ if (rc)
+ return rc;
+ }
+ return 0;
+}
+
+/*
+ * RELEASE is still a stub here — the atomic dax_region_rm_resources API
+ * and its wire-up land in the next commit. An incoming RELEASE returns
+ * success and the cxl side proceeds to rm_tag_group(), which device-
+ * unregisters each dc_extent; the devm action armed by
+ * dax_region_add_resource() then tears down each dax_resource.
+ */
+static int cxl_dax_region_notify(struct device *dev,
+ struct cxl_notify_data *notify_data)
+{
+ struct cxl_dax_region *cxlr_dax = to_cxl_dax_region(dev);
+ struct dax_region *dax_region = dev_get_drvdata(dev);
+ struct cxl_dc_tag_group *group = notify_data->group;
+
+ switch (notify_data->event) {
+ case DCD_ADD_CAPACITY:
+ return cxl_dax_group_add(dax_region, group);
+ case DCD_RELEASE_CAPACITY:
+ dev_dbg(&cxlr_dax->dev,
+ "DCD RELEASE notify (tag %pUb): no-op (stub)\n",
+ &group->uuid);
+ return 0;
+ case DCD_FORCED_CAPACITY_RELEASE:
+ default:
+ dev_err(&cxlr_dax->dev, "Unknown DC event %d\n",
+ notify_data->event);
+ return -ENXIO;
+ }
+}
+
+static struct dax_dc_ops dc_ops = {
+ .is_extent = is_dc_extent,
+};
static int cxl_dax_region_probe(struct device *dev)
{
@@ -25,15 +103,18 @@ static int cxl_dax_region_probe(struct device *dev)
flags = IORESOURCE_DAX_KMEM;
dax_region = alloc_dax_region(dev, cxlr->id, &cxlr_dax->hpa_range, nid,
- PMD_SIZE, flags);
+ PMD_SIZE, flags, &dc_ops);
if (!dax_region)
return -ENOMEM;
- if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_A)
+ if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_A) {
+ device_for_each_child(&cxlr_dax->dev, dax_region,
+ cxl_dax_add_resource);
/* Add empty seed dax device */
dev_size = 0;
- else
+ } else {
dev_size = range_len(&cxlr_dax->hpa_range);
+ }
data = (struct dev_dax_data) {
.dax_region = dax_region,
@@ -48,6 +129,7 @@ static int cxl_dax_region_probe(struct device *dev)
static struct cxl_driver cxl_dax_region_driver = {
.name = "cxl_dax_region",
.probe = cxl_dax_region_probe,
+ .notify = cxl_dax_region_notify,
.id = CXL_DEVICE_DAX_REGION,
.drv = {
.suppress_bind_attrs = true,
diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index ee8f3af8387f..f2ae5918f94d 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -8,6 +8,7 @@
#include <linux/device.h>
#include <linux/cdev.h>
#include <linux/idr.h>
+#include <linux/uuid.h>
/* private routines between core files */
struct dax_device;
@@ -16,6 +17,14 @@ struct inode *dax_inode(struct dax_device *dax_dev);
int dax_bus_init(void);
void dax_bus_exit(void);
+/**
+ * struct dax_dc_ops - Operations for dc-backed regions
+ * @is_extent: return if the device is an extent
+ */
+struct dax_dc_ops {
+ bool (*is_extent)(struct device *dev);
+};
+
/**
* struct dax_region - mapping infrastructure for dax devices
* @id: kernel-wide unique region for a memory range
@@ -27,6 +36,7 @@ void dax_bus_exit(void);
* @res: resource tree to track instance allocations
* @seed: allow userspace to find the first unbound seed device
* @youngest: allow userspace to find the most recently created device
+ * @dc_ops: operations required for DC-backed regions
*/
struct dax_region {
int id;
@@ -38,6 +48,7 @@ struct dax_region {
struct resource res;
struct device *seed;
struct device *youngest;
+ struct dax_dc_ops *dc_ops;
};
/**
@@ -57,11 +68,13 @@ struct dax_mapping {
* @pgoff: page offset
* @range: resource-span
* @mapping: reference to the dax_mapping for this range
+ * @dax_resource: if not NULL; dax DC resource containing this range
*/
struct dev_dax_range {
unsigned long pgoff;
struct range range;
struct dax_mapping *mapping;
+ struct dax_resource *dax_resource;
};
/**
@@ -105,6 +118,42 @@ struct dev_dax {
*/
void run_dax(struct dax_device *dax_dev);
+/**
+ * struct dax_resource - For DC DAX regions; an active resource
+ * @region: dax_region this resources is in
+ * @res: resource
+ * @uuid: tag identifying the backing extent; zero uuid means untagged
+ * @seq_num: 1..n assembly-order index within the tag group; 0 for the
+ * untagged pool (uuid == 0). For extents from a sharable
+ * CXL DC partition this is the device-stamped shared_extn_seq
+ * (CXL 3.1 Table 8-51). For extents from a non-sharable
+ * partition the cxl layer fills it in event arrival order, so
+ * the dax layer can rely on a single 1..n dense invariant when
+ * it claims a tagged group in uuid_store().
+ * @use_cnt: count the number of uses of this resource
+ *
+ * Changes to the dax_region and the dax_resources within it are protected by
+ * dax_region_rwsem
+ *
+ * dax_resource's are not intended to be used outside the dax layer.
+ */
+struct dax_resource {
+ struct dax_region *region;
+ struct resource *res;
+ uuid_t uuid;
+ u16 seq_num;
+ unsigned int use_cnt;
+};
+
+/*
+ * Similar to run_dax() dax_region_add_resource() is exported but is not
+ * intended to be a generic operation outside the dax subsystem. It is only
+ * generic between the dax layer and the dax drivers.
+ */
+int dax_region_add_resource(struct dax_region *dax_region, struct device *dev,
+ resource_size_t start, resource_size_t length,
+ const uuid_t *tag, u16 seq_num);
+
static inline struct dev_dax *to_dev_dax(struct device *dev)
{
return container_of(dev, struct dev_dax, dev);
diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index af21f66bf872..be938c2a73f8 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -28,7 +28,7 @@ static int dax_hmem_probe(struct platform_device *pdev)
mri = dev->platform_data;
dax_region = alloc_dax_region(dev, pdev->id, &mri->range,
- mri->target_node, PMD_SIZE, flags);
+ mri->target_node, PMD_SIZE, flags, NULL);
if (!dax_region)
return -ENOMEM;
diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
index bee93066a849..5b5be86768f3 100644
--- a/drivers/dax/pmem.c
+++ b/drivers/dax/pmem.c
@@ -53,7 +53,7 @@ static struct dev_dax *__dax_pmem_probe(struct device *dev)
range.start += offset;
dax_region = alloc_dax_region(dev, region_id, &range,
nd_region->target_node, le32_to_cpu(pfn_sb->align),
- IORESOURCE_DAX_STATIC);
+ IORESOURCE_DAX_STATIC, NULL);
if (!dax_region)
return ERR_PTR(-ENOMEM);
--
2.43.0
^ permalink raw reply related [flat|nested] 71+ messages in thread* Re: [PATCH v10 21/31] cxl + dax: Surface dax_resources on DCD Add Capacity events
2026-05-23 9:43 ` [PATCH v10 21/31] cxl + dax: Surface dax_resources on DCD Add Capacity events Anisa Su
@ 2026-05-28 23:41 ` Dave Jiang
0 siblings, 0 replies; 71+ messages in thread
From: Dave Jiang @ 2026-05-28 23:41 UTC (permalink / raw)
To: Anisa Su, linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Vishal Verma, Ira Weiny, Alison Schofield, John Groves,
Gregory Price, Anisa Su, Ira Weiny
On 5/23/26 2:43 AM, Anisa Su wrote:
> When an extent is accepted/released, the CXL driver must notify
> the DAX driver to coordinate the management of resources. Define
> the .notify callback to the cxl_dax region driver to enable
> the coordination.
>
> Define struct dax_resource, a sub-resource of a DC dax_region
> representing the capacity of one dc_extent.
>
> When the cxl side onlines a tag group during a DC Add event,
> notify the DAX region to register a struct dax_resource for each
> extent under the dax_region's resource tree.
>
> The dax_resource model:
>
> * struct dax_resource (dax-private.h) — per-extent sub-resource of
> a DC dax_region: pointer back to its region, the kernel struct
> resource, the tag uuid, the per-allocation seq_num, and a use_cnt
> that lets a later commit refuse release of an in-use extent.
> * struct dev_dax_range gains a dax_resource back-pointer so a
> carved range remembers which extent it lives in.
>
> For now, dax_resources live under the dax_region and remain inaccessible
> to DAX devices. A later commit adds the support to specify a tag
> when creating a DAX device, which then allows dax_resources to be
> claimed by tag.
>
> Release is handled in the following commit.
>
> Based on an original patch by Navneet Singh.
>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Anisa Su <anisa.su@samsung.com>
>
> ---
> Changes:
> [anisa:restructured from the original "Create resources on sparse DAX
> regions" commit】
> ---
> drivers/cxl/core/core.h | 10 ++++
> drivers/cxl/core/extent.c | 33 ++++++++++-
> drivers/cxl/core/mbox.c | 17 +++++-
> drivers/cxl/cxl.h | 6 ++
> drivers/dax/bus.c | 118 ++++++++++++++++++++++++++++++++++----
> drivers/dax/bus.h | 3 +-
> drivers/dax/cxl.c | 88 +++++++++++++++++++++++++++-
> drivers/dax/dax-private.h | 49 ++++++++++++++++
> drivers/dax/hmem/hmem.c | 2 +-
> drivers/dax/pmem.c | 2 +-
> 10 files changed, 306 insertions(+), 22 deletions(-)
>
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 02b36728c22d..c28e357c5817 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -72,6 +72,9 @@ int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent,
> bool cxl_tag_already_committed(const uuid_t *tag);
> int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent);
> int online_tag_group(struct cxl_dc_tag_group *group);
> +void rm_tag_group(struct cxl_dc_tag_group *group);
> +int cxlr_notify_extent(struct cxl_region *cxlr, enum dc_event event,
> + struct cxl_dc_tag_group *group);
> #else
> static inline u64 cxl_dpa_to_hpa(struct cxl_region *cxlr,
> const struct cxl_memdev *cxlmd, u64 dpa)
> @@ -96,6 +99,13 @@ static inline bool cxl_tag_already_committed(const uuid_t *tag)
> {
> return false;
> }
> +static inline void rm_tag_group(struct cxl_dc_tag_group *group) { }
> +static inline int cxlr_notify_extent(struct cxl_region *cxlr,
> + enum dc_event event,
> + struct cxl_dc_tag_group *group)
> +{
> + return 0;
> +}
> static inline
> struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa,
> struct cxl_endpoint_decoder **cxled)
> diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
> index 34babfe032d1..3fc4b7292664 100644
> --- a/drivers/cxl/core/extent.c
> +++ b/drivers/cxl/core/extent.c
> @@ -63,7 +63,6 @@ static const struct attribute_group dc_extent_attribute_group = {
>
> __ATTRIBUTE_GROUPS(dc_extent_attribute);
>
> -
> static void cxled_release_extent(struct cxl_endpoint_decoder *cxled,
> struct dc_extent *dc_extent)
> {
> @@ -359,6 +358,36 @@ dc_extent_build(struct cxl_endpoint_decoder *cxled,
> return dc_extent;
> }
>
> +int cxlr_notify_extent(struct cxl_region *cxlr, enum dc_event event,
> + struct cxl_dc_tag_group *group)
> +{
> + struct device *dev = &cxlr->cxlr_dax->dev;
> + struct cxl_notify_data notify_data;
> + struct cxl_driver *driver;
> +
> + dev_dbg(dev, "Trying notify: type %d tag %pUb\n", event, &group->uuid);
> +
> + guard(device)(dev);
> +
> + /*
> + * The lack of a driver indicates a notification has failed. No user
> + * space coordination was possible.
> + */
> + if (!dev->driver)
> + return 0;
> + driver = to_cxl_drv(dev->driver);
> + if (!driver->notify)
> + return 0;
> +
> + notify_data = (struct cxl_notify_data) {
> + .event = event,
> + .group = group,
> + };
> +
> + dev_dbg(dev, "Notify: type %d tag %pUb\n", event, &group->uuid);
> + return driver->notify(dev, ¬ify_data);
> +}
> +
> /*
> * Stage 4: insert @dc_extent into the pending tag group. All extents in
> * one More-chain group share a UUID — enforced here as the group is
> @@ -462,7 +491,7 @@ static void dc_extent_unregister(void *ext)
> device_unregister(&dc_extent->dev);
> }
>
> -static void rm_tag_group(struct cxl_dc_tag_group *group)
> +void rm_tag_group(struct cxl_dc_tag_group *group)
> {
> struct device *region_dev = &group->cxlr_dax->dev;
> struct dc_extent *dc_extent;
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 85959dee35ea..8071c1ed1b36 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -1558,8 +1558,21 @@ static int cxl_add_pending(struct cxl_memdev_state *mds)
> list_for_each_entry_safe(pos, tmp, &group, list)
> delete_extent_node(pos);
> } else {
> - list_splice_tail_init(&group, &accepted);
> - total_accepted += group_cnt;
> + rc = cxlr_notify_extent(tag_group->cxlr_dax->cxlr,
> + DCD_ADD_CAPACITY, tag_group);
> + if (rc) {
> + /*
> + * The dax-side notification failed; tear down the
> + * tag group and drop the extents so we do not
> + * mis-report acceptance to the device.
> + */
> + rm_tag_group(tag_group);
This is called before cxl_send_dc_response() happens and accepting the extents. It has the call path of
rm_tag_group() -> devm_release_action(dc_extent_unregister()) -> device_unregister(dc_extent->dev) -> dc_extent_release() -> cxled_release_extent() -> memdev_release_extent() -> CXL_MBOX_OP_RELEASE_DC on extents not accepted yet. Maybe there needs to be state tracking so the mailbox command does not get issued on these extents and only the kernel structs get freed?
Speaking of which, cleanup_pending_dc_extent() also sends release the mailbox command. Probably shouldn't do that right?
DJ
> + list_for_each_entry_safe(pos, tmp, &group, list)
> + delete_extent_node(pos);
> + } else {
> + list_splice_tail_init(&group, &accepted);
> + total_accepted += group_cnt;
> + }
> }
>
> mds->add_ctx.group = NULL;
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index a28e7b12a4a8..27e3046654e9 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -892,6 +892,11 @@ bool is_cxl_region(struct device *dev);
>
> extern const struct bus_type cxl_bus_type;
>
> +struct cxl_notify_data {
> + enum dc_event event;
> + struct cxl_dc_tag_group *group;
> +};
> +
> /*
> * Note, add_dport() is expressly for the cxl_port driver. TODO: investigate a
> * type-safe driver model where probe()/remove() take the type of object implied
> @@ -904,6 +909,7 @@ struct cxl_driver {
> void (*remove)(struct device *dev);
> struct cxl_dport *(*add_dport)(struct cxl_port *port,
> struct device *dport_dev);
> + int (*notify)(struct device *dev, struct cxl_notify_data *notify_data);
> struct device_driver drv;
> int id;
> };
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index b0c2162b5e37..a6ee59f2d8a1 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -186,6 +186,73 @@ static bool is_dynamic(struct dax_region *dax_region)
> return (dax_region->res.flags & IORESOURCE_DAX_DCD) != 0;
> }
>
> +static void __dax_release_resource(struct dax_resource *dax_resource)
> +{
> + struct dax_region *dax_region = dax_resource->region;
> +
> + lockdep_assert_held_write(&dax_region_rwsem);
> + dev_dbg(dax_region->dev, "Extent release resource %pr\n",
> + dax_resource->res);
> + if (dax_resource->res)
> + __release_region(&dax_region->res, dax_resource->res->start,
> + resource_size(dax_resource->res));
> + dax_resource->res = NULL;
> +}
> +
> +static void dax_release_resource(void *res)
> +{
> + struct dax_resource *dax_resource = res;
> +
> + guard(rwsem_write)(&dax_region_rwsem);
> + __dax_release_resource(dax_resource);
> + kfree(dax_resource);
> +}
> +
> +int dax_region_add_resource(struct dax_region *dax_region,
> + struct device *device,
> + resource_size_t start, resource_size_t length,
> + const uuid_t *tag, u16 seq_num)
> +{
> + struct resource *new_resource;
> + int rc;
> +
> + struct dax_resource *dax_resource __free(kfree) =
> + kzalloc(sizeof(*dax_resource), GFP_KERNEL);
> + if (!dax_resource)
> + return -ENOMEM;
> +
> + guard(rwsem_write)(&dax_region_rwsem);
> +
> + dev_dbg(dax_region->dev, "DAX region resource %pr\n", &dax_region->res);
> + new_resource = __request_region(&dax_region->res, start, length, "extent", 0);
> + if (!new_resource) {
> + dev_err(dax_region->dev, "Failed to add region s:%pa l:%pa\n",
> + &start, &length);
> + return -ENOSPC;
> + }
> +
> + dev_dbg(dax_region->dev, "add resource %pr\n", new_resource);
> + dax_resource->region = dax_region;
> + dax_resource->res = new_resource;
> + dax_resource->seq_num = seq_num;
> + if (tag)
> + uuid_copy(&dax_resource->uuid, tag);
> +
> + /*
> + * open code devm_add_action_or_reset() to avoid recursive write lock
> + * of dax_region_rwsem in the error case.
> + */
> + rc = devm_add_action(device, dax_release_resource, dax_resource);
> + if (rc) {
> + __dax_release_resource(dax_resource);
> + return rc;
> + }
> +
> + dev_set_drvdata(device, no_free_ptr(dax_resource));
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(dax_region_add_resource);
> +
> bool static_dev_dax(struct dev_dax *dev_dax)
> {
> return is_static(dev_dax->region);
> @@ -304,14 +371,25 @@ static struct device_attribute dev_attr_region_align =
>
> static unsigned long long dax_region_avail_size(struct dax_region *dax_region)
> {
> - resource_size_t size = resource_size(&dax_region->res);
> + resource_size_t size;
> struct resource *res;
>
> lockdep_assert_held(&dax_region_rwsem);
>
> - if (is_dynamic(dax_region))
> - return 0;
> + if (is_dynamic(dax_region)) {
> + /*
> + * Children of a dynamic region are extents, claimed
> + * all-or-nothing: an extent's resource is either unclaimed (no
> + * child) or fully consumed by exactly one dax device.
> + */
> + size = 0;
> + for_each_dax_region_resource(dax_region, res)
> + if (!res->child)
> + size += resource_size(res);
> + return size;
> + }
>
> + size = resource_size(&dax_region->res);
> for_each_dax_region_resource(dax_region, res)
> size -= resource_size(res);
> return size;
> @@ -452,15 +530,26 @@ EXPORT_SYMBOL_GPL(kill_dev_dax);
> static void trim_dev_dax_range(struct dev_dax *dev_dax)
> {
> int i = dev_dax->nr_range - 1;
> - struct range *range = &dev_dax->ranges[i].range;
> + struct dev_dax_range *dev_range = &dev_dax->ranges[i];
> + struct range *range = &dev_range->range;
> struct dax_region *dax_region = dev_dax->region;
> + struct resource *res = &dax_region->res;
>
> lockdep_assert_held_write(&dax_region_rwsem);
> dev_dbg(&dev_dax->dev, "delete range[%d]: %#llx:%#llx\n", i,
> (unsigned long long)range->start,
> (unsigned long long)range->end);
>
> - __release_region(&dax_region->res, range->start, range_len(range));
> + if (dev_range->dax_resource) {
> + res = dev_range->dax_resource->res;
> + dev_dbg(&dev_dax->dev, "Trim dc extent %pr\n", res);
> + }
> +
> + __release_region(res, range->start, range_len(range));
> +
> + if (dev_range->dax_resource)
> + dev_range->dax_resource->use_cnt--;
> +
> if (--dev_dax->nr_range == 0) {
> kfree(dev_dax->ranges);
> dev_dax->ranges = NULL;
> @@ -644,11 +733,14 @@ static void dax_region_unregister(void *region)
>
> struct dax_region *alloc_dax_region(struct device *parent, int region_id,
> struct range *range, int target_node, unsigned int align,
> - unsigned long flags)
> + unsigned long flags, struct dax_dc_ops *dc_ops)
> {
> struct dax_region *dax_region;
> int rc;
>
> + if (!dc_ops && (flags & IORESOURCE_DAX_DCD))
> + return NULL;
> +
> /*
> * The DAX core assumes that it can store its private data in
> * parent->driver_data. This WARN is a reminder / safeguard for
> @@ -673,6 +765,7 @@ struct dax_region *alloc_dax_region(struct device *parent, int region_id,
> dax_region->align = align;
> dax_region->dev = parent;
> dax_region->target_node = target_node;
> + dax_region->dc_ops = dc_ops;
> ida_init(&dax_region->ida);
> dax_region->res = (struct resource) {
> .start = range->start,
> @@ -861,7 +954,7 @@ static int devm_register_dax_mapping(struct dev_dax *dev_dax, int range_id)
> }
>
> static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
> - resource_size_t size)
> + resource_size_t size, struct dax_resource *dax_resource)
> {
> struct dax_region *dax_region = dev_dax->region;
> struct resource *res = &dax_region->res;
> @@ -902,6 +995,7 @@ static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
> .start = alloc->start,
> .end = alloc->end,
> },
> + .dax_resource = dax_resource,
> };
>
> dev_dbg(dev, "alloc range[%d]: %pa:%pa\n", dev_dax->nr_range - 1,
> @@ -1075,7 +1169,7 @@ static ssize_t dev_dax_resize(struct dax_region *dax_region,
> retry:
> first = region_res->child;
> if (!first)
> - return alloc_dev_dax_range(dev_dax, dax_region->res.start, to_alloc);
> + return alloc_dev_dax_range(dev_dax, dax_region->res.start, to_alloc, NULL);
>
> rc = -ENOSPC;
> for (res = first; res; res = res->sibling) {
> @@ -1084,7 +1178,7 @@ static ssize_t dev_dax_resize(struct dax_region *dax_region,
> /* space at the beginning of the region */
> if (res == first && res->start > dax_region->res.start) {
> alloc = min(res->start - dax_region->res.start, to_alloc);
> - rc = alloc_dev_dax_range(dev_dax, dax_region->res.start, alloc);
> + rc = alloc_dev_dax_range(dev_dax, dax_region->res.start, alloc, NULL);
> break;
> }
>
> @@ -1104,7 +1198,7 @@ static ssize_t dev_dax_resize(struct dax_region *dax_region,
> rc = adjust_dev_dax_range(dev_dax, res, resource_size(res) + alloc);
> break;
> }
> - rc = alloc_dev_dax_range(dev_dax, res->end + 1, alloc);
> + rc = alloc_dev_dax_range(dev_dax, res->end + 1, alloc, NULL);
> break;
> }
> if (rc)
> @@ -1214,7 +1308,7 @@ static ssize_t mapping_store(struct device *dev, struct device_attribute *attr,
>
> to_alloc = range_len(&r);
> if (alloc_is_aligned(dev_dax, to_alloc))
> - rc = alloc_dev_dax_range(dev_dax, r.start, to_alloc);
> + rc = alloc_dev_dax_range(dev_dax, r.start, to_alloc, NULL);
> up_write(&dax_dev_rwsem);
> up_write(&dax_region_rwsem);
>
> @@ -1506,7 +1600,7 @@ static struct dev_dax *__devm_create_dev_dax(struct dev_dax_data *data)
> device_initialize(dev);
> dev_set_name(dev, "dax%d.%d", dax_region->id, dev_dax->id);
>
> - rc = alloc_dev_dax_range(dev_dax, dax_region->res.start, data->size);
> + rc = alloc_dev_dax_range(dev_dax, dax_region->res.start, data->size, NULL);
> if (rc)
> goto err_range;
>
> diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
> index 6e739bfab932..7a115893a102 100644
> --- a/drivers/dax/bus.h
> +++ b/drivers/dax/bus.h
> @@ -11,6 +11,7 @@ struct dev_dax;
> struct resource;
> struct dax_device;
> struct dax_region;
> +struct dax_dc_ops;
>
> /* dax bus specific ioresource flags */
> #define IORESOURCE_DAX_STATIC BIT(0)
> @@ -19,7 +20,7 @@ struct dax_region;
>
> struct dax_region *alloc_dax_region(struct device *parent, int region_id,
> struct range *range, int target_node, unsigned int align,
> - unsigned long flags);
> + unsigned long flags, struct dax_dc_ops *dc_ops);
>
> struct dev_dax_data {
> struct dax_region *dax_region;
> diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
> index f58fe992aa8d..690cf625e052 100644
> --- a/drivers/dax/cxl.c
> +++ b/drivers/dax/cxl.c
> @@ -5,6 +5,84 @@
>
> #include "../cxl/cxl.h"
> #include "bus.h"
> +#include "dax-private.h"
> +
> +static int __cxl_dax_add_resource(struct dax_region *dax_region,
> + struct dc_extent *dc_extent)
> +{
> + struct device *dev = &dc_extent->dev;
> + resource_size_t start, length;
> +
> + start = dax_region->res.start + dc_extent->hpa_range.start;
> + length = range_len(&dc_extent->hpa_range);
> + return dax_region_add_resource(dax_region, dev, start, length,
> + &dc_extent->group->uuid,
> + dc_extent->seq_num);
> +}
> +
> +static int cxl_dax_add_resource(struct device *dev, void *data)
> +{
> + struct dax_region *dax_region = data;
> + struct dc_extent *dc_extent;
> +
> + dc_extent = to_dc_extent(dev);
> + if (!dc_extent)
> + return 0;
> +
> + dev_dbg(dax_region->dev, "Adding resource HPA %pra (%pUb)\n",
> + &dc_extent->hpa_range, &dc_extent->group->uuid);
> +
> + return __cxl_dax_add_resource(dax_region, dc_extent);
> +}
> +
> +static int cxl_dax_group_add(struct dax_region *dax_region,
> + struct cxl_dc_tag_group *group)
> +{
> + struct dc_extent *dc_extent;
> + unsigned long index;
> + int rc;
> +
> + xa_for_each(&group->dc_extents, index, dc_extent) {
> + rc = __cxl_dax_add_resource(dax_region, dc_extent);
> + if (rc)
> + return rc;
> + }
> + return 0;
> +}
> +
> +/*
> + * RELEASE is still a stub here — the atomic dax_region_rm_resources API
> + * and its wire-up land in the next commit. An incoming RELEASE returns
> + * success and the cxl side proceeds to rm_tag_group(), which device-
> + * unregisters each dc_extent; the devm action armed by
> + * dax_region_add_resource() then tears down each dax_resource.
> + */
> +static int cxl_dax_region_notify(struct device *dev,
> + struct cxl_notify_data *notify_data)
> +{
> + struct cxl_dax_region *cxlr_dax = to_cxl_dax_region(dev);
> + struct dax_region *dax_region = dev_get_drvdata(dev);
> + struct cxl_dc_tag_group *group = notify_data->group;
> +
> + switch (notify_data->event) {
> + case DCD_ADD_CAPACITY:
> + return cxl_dax_group_add(dax_region, group);
> + case DCD_RELEASE_CAPACITY:
> + dev_dbg(&cxlr_dax->dev,
> + "DCD RELEASE notify (tag %pUb): no-op (stub)\n",
> + &group->uuid);
> + return 0;
> + case DCD_FORCED_CAPACITY_RELEASE:
> + default:
> + dev_err(&cxlr_dax->dev, "Unknown DC event %d\n",
> + notify_data->event);
> + return -ENXIO;
> + }
> +}
> +
> +static struct dax_dc_ops dc_ops = {
> + .is_extent = is_dc_extent,
> +};
>
> static int cxl_dax_region_probe(struct device *dev)
> {
> @@ -25,15 +103,18 @@ static int cxl_dax_region_probe(struct device *dev)
> flags = IORESOURCE_DAX_KMEM;
>
> dax_region = alloc_dax_region(dev, cxlr->id, &cxlr_dax->hpa_range, nid,
> - PMD_SIZE, flags);
> + PMD_SIZE, flags, &dc_ops);
> if (!dax_region)
> return -ENOMEM;
>
> - if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_A)
> + if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_A) {
> + device_for_each_child(&cxlr_dax->dev, dax_region,
> + cxl_dax_add_resource);
> /* Add empty seed dax device */
> dev_size = 0;
> - else
> + } else {
> dev_size = range_len(&cxlr_dax->hpa_range);
> + }
>
> data = (struct dev_dax_data) {
> .dax_region = dax_region,
> @@ -48,6 +129,7 @@ static int cxl_dax_region_probe(struct device *dev)
> static struct cxl_driver cxl_dax_region_driver = {
> .name = "cxl_dax_region",
> .probe = cxl_dax_region_probe,
> + .notify = cxl_dax_region_notify,
> .id = CXL_DEVICE_DAX_REGION,
> .drv = {
> .suppress_bind_attrs = true,
> diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
> index ee8f3af8387f..f2ae5918f94d 100644
> --- a/drivers/dax/dax-private.h
> +++ b/drivers/dax/dax-private.h
> @@ -8,6 +8,7 @@
> #include <linux/device.h>
> #include <linux/cdev.h>
> #include <linux/idr.h>
> +#include <linux/uuid.h>
>
> /* private routines between core files */
> struct dax_device;
> @@ -16,6 +17,14 @@ struct inode *dax_inode(struct dax_device *dax_dev);
> int dax_bus_init(void);
> void dax_bus_exit(void);
>
> +/**
> + * struct dax_dc_ops - Operations for dc-backed regions
> + * @is_extent: return if the device is an extent
> + */
> +struct dax_dc_ops {
> + bool (*is_extent)(struct device *dev);
> +};
> +
> /**
> * struct dax_region - mapping infrastructure for dax devices
> * @id: kernel-wide unique region for a memory range
> @@ -27,6 +36,7 @@ void dax_bus_exit(void);
> * @res: resource tree to track instance allocations
> * @seed: allow userspace to find the first unbound seed device
> * @youngest: allow userspace to find the most recently created device
> + * @dc_ops: operations required for DC-backed regions
> */
> struct dax_region {
> int id;
> @@ -38,6 +48,7 @@ struct dax_region {
> struct resource res;
> struct device *seed;
> struct device *youngest;
> + struct dax_dc_ops *dc_ops;
> };
>
> /**
> @@ -57,11 +68,13 @@ struct dax_mapping {
> * @pgoff: page offset
> * @range: resource-span
> * @mapping: reference to the dax_mapping for this range
> + * @dax_resource: if not NULL; dax DC resource containing this range
> */
> struct dev_dax_range {
> unsigned long pgoff;
> struct range range;
> struct dax_mapping *mapping;
> + struct dax_resource *dax_resource;
> };
>
> /**
> @@ -105,6 +118,42 @@ struct dev_dax {
> */
> void run_dax(struct dax_device *dax_dev);
>
> +/**
> + * struct dax_resource - For DC DAX regions; an active resource
> + * @region: dax_region this resources is in
> + * @res: resource
> + * @uuid: tag identifying the backing extent; zero uuid means untagged
> + * @seq_num: 1..n assembly-order index within the tag group; 0 for the
> + * untagged pool (uuid == 0). For extents from a sharable
> + * CXL DC partition this is the device-stamped shared_extn_seq
> + * (CXL 3.1 Table 8-51). For extents from a non-sharable
> + * partition the cxl layer fills it in event arrival order, so
> + * the dax layer can rely on a single 1..n dense invariant when
> + * it claims a tagged group in uuid_store().
> + * @use_cnt: count the number of uses of this resource
> + *
> + * Changes to the dax_region and the dax_resources within it are protected by
> + * dax_region_rwsem
> + *
> + * dax_resource's are not intended to be used outside the dax layer.
> + */
> +struct dax_resource {
> + struct dax_region *region;
> + struct resource *res;
> + uuid_t uuid;
> + u16 seq_num;
> + unsigned int use_cnt;
> +};
> +
> +/*
> + * Similar to run_dax() dax_region_add_resource() is exported but is not
> + * intended to be a generic operation outside the dax subsystem. It is only
> + * generic between the dax layer and the dax drivers.
> + */
> +int dax_region_add_resource(struct dax_region *dax_region, struct device *dev,
> + resource_size_t start, resource_size_t length,
> + const uuid_t *tag, u16 seq_num);
> +
> static inline struct dev_dax *to_dev_dax(struct device *dev)
> {
> return container_of(dev, struct dev_dax, dev);
> diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
> index af21f66bf872..be938c2a73f8 100644
> --- a/drivers/dax/hmem/hmem.c
> +++ b/drivers/dax/hmem/hmem.c
> @@ -28,7 +28,7 @@ static int dax_hmem_probe(struct platform_device *pdev)
>
> mri = dev->platform_data;
> dax_region = alloc_dax_region(dev, pdev->id, &mri->range,
> - mri->target_node, PMD_SIZE, flags);
> + mri->target_node, PMD_SIZE, flags, NULL);
> if (!dax_region)
> return -ENOMEM;
>
> diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
> index bee93066a849..5b5be86768f3 100644
> --- a/drivers/dax/pmem.c
> +++ b/drivers/dax/pmem.c
> @@ -53,7 +53,7 @@ static struct dev_dax *__dax_pmem_probe(struct device *dev)
> range.start += offset;
> dax_region = alloc_dax_region(dev, region_id, &range,
> nd_region->target_node, le32_to_cpu(pfn_sb->align),
> - IORESOURCE_DAX_STATIC);
> + IORESOURCE_DAX_STATIC, NULL);
> if (!dax_region)
> return ERR_PTR(-ENOMEM);
>
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH v10 22/31] cxl + dax: Release dax_resources on DCD Release Capacity events
2026-05-23 9:42 [PATCH v10 00/31] DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
` (20 preceding siblings ...)
2026-05-23 9:43 ` [PATCH v10 21/31] cxl + dax: Surface dax_resources on DCD Add Capacity events Anisa Su
@ 2026-05-23 9:43 ` Anisa Su
2026-05-28 23:53 ` Dave Jiang
2026-05-23 9:43 ` [PATCH v10 23/31] dax/bus: Factor out dev dax resize logic Anisa Su
` (10 subsequent siblings)
32 siblings, 1 reply; 71+ messages in thread
From: Anisa Su @ 2026-05-23 9:43 UTC (permalink / raw)
To: linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
John Groves, Gregory Price, Anisa Su, Ira Weiny
Implement the release path that mirrors the add path: when the
device asks for capacity back, the dax layer tears down the
per-extent resources for the whole tag group atomically.
If any extent in the group is still mapped by a dev_dax, the release
is refused with -EBUSY and no state changes; the cxl side then leaves
the tag group intact and the device retries.
Also add a rollback to the add path: if any per-extent registration
fails midway through a group, undo the ones already added so a
partial group never leaks into the dax region.
Based on an original patch by Navneet Singh.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Anisa Su <anisa.su@samsung.com>
---
Changes:
[anisa: split out from the original "Surface dc_extents" commit;
fills in the RELEASE half of the bridge, moves the cxl-side RELEASE
notify into this commit, and adds the rollback path to ADD.]
---
drivers/cxl/core/extent.c | 13 +++++++++
drivers/dax/bus.c | 59 +++++++++++++++++++++++++++++++++++++++
drivers/dax/cxl.c | 54 +++++++++++++++++++++++++++--------
drivers/dax/dax-private.h | 8 ++++--
4 files changed, 120 insertions(+), 14 deletions(-)
diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
index 3fc4b7292664..2c8edfe53c0a 100644
--- a/drivers/cxl/core/extent.c
+++ b/drivers/cxl/core/extent.c
@@ -532,6 +532,7 @@ int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
struct range dpa_range;
unsigned long idx;
uuid_t tag;
+ int rc;
dpa_range = (struct range) {
.start = start_dpa,
@@ -588,6 +589,18 @@ int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
return -EINVAL;
}
+ rc = cxlr_notify_extent(cxlr, DCD_RELEASE_CAPACITY, group);
+ if (rc) {
+ /*
+ * dax layer refused (-EBUSY) or failed (-ENOMEM, etc.). Do
+ * not proceed to tear down the tag group — leave its
+ * dax_resources alive so we do not free them out from under
+ * live dev_dax ranges. The device will retry the release.
+ */
+ return 0;
+ }
+
+ /* Release the entire tag group */
rm_tag_group(group);
return 0;
}
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index a6ee59f2d8a1..6368bdfdf93a 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -253,6 +253,65 @@ int dax_region_add_resource(struct dax_region *dax_region,
}
EXPORT_SYMBOL_GPL(dax_region_add_resource);
+int dax_region_rm_resource(struct dax_region *dax_region,
+ struct device *dev)
+{
+ struct dax_resource *dax_resource;
+
+ guard(rwsem_write)(&dax_region_rwsem);
+
+ dax_resource = dev_get_drvdata(dev);
+ if (!dax_resource)
+ return 0;
+
+ if (dax_resource->use_cnt)
+ return -EBUSY;
+
+ /*
+ * release the resource under dax_region_rwsem to avoid races with
+ * users trying to use the extent
+ */
+ __dax_release_resource(dax_resource);
+ dev_set_drvdata(dev, NULL);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(dax_region_rm_resource);
+
+/**
+ * dax_region_rm_resources - atomically remove a set of dax_resources.
+ *
+ * Walk @devs twice under dax_region_rwsem. First pass refuses the
+ * operation if any member's use_cnt is non-zero; second pass releases
+ * each. This gives refuse-all-or-none semantics across the set, which
+ * a tag group's atomic release relies on. Devices with no
+ * dax_resource attached are silently skipped.
+ */
+int dax_region_rm_resources(struct dax_region *dax_region,
+ struct device * const *devs, unsigned int n)
+{
+ unsigned int i;
+
+ guard(rwsem_write)(&dax_region_rwsem);
+
+ for (i = 0; i < n; i++) {
+ struct dax_resource *r = dev_get_drvdata(devs[i]);
+
+ if (r && r->use_cnt)
+ return -EBUSY;
+ }
+
+ for (i = 0; i < n; i++) {
+ struct dax_resource *r = dev_get_drvdata(devs[i]);
+
+ if (!r)
+ continue;
+ __dax_release_resource(r);
+ dev_set_drvdata(devs[i], NULL);
+ }
+ return 0;
+}
+EXPORT_SYMBOL_GPL(dax_region_rm_resources);
+
bool static_dev_dax(struct dev_dax *dev_dax)
{
return is_static(dev_dax->region);
diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
index 690cf625e052..04b73315a8f2 100644
--- a/drivers/dax/cxl.c
+++ b/drivers/dax/cxl.c
@@ -44,19 +44,52 @@ static int cxl_dax_group_add(struct dax_region *dax_region,
xa_for_each(&group->dc_extents, index, dc_extent) {
rc = __cxl_dax_add_resource(dax_region, dc_extent);
- if (rc)
+ if (rc) {
+ /*
+ * Unwind every dax_resource already added for this
+ * group; one rm per owner suffices.
+ */
+ struct dc_extent *u;
+ unsigned long uidx;
+
+ xa_for_each(&group->dc_extents, uidx, u) {
+ if (u == dc_extent)
+ break;
+ dax_region_rm_resource(dax_region, &u->dev);
+ }
return rc;
+ }
}
return 0;
}
-/*
- * RELEASE is still a stub here — the atomic dax_region_rm_resources API
- * and its wire-up land in the next commit. An incoming RELEASE returns
- * success and the cxl side proceeds to rm_tag_group(), which device-
- * unregisters each dc_extent; the devm action armed by
- * dax_region_add_resource() then tears down each dax_resource.
- */
+static int cxl_dax_group_rm(struct dax_region *dax_region,
+ struct cxl_dc_tag_group *group)
+{
+ struct dc_extent *dc_extent;
+ struct device **devs;
+ unsigned long index;
+ unsigned int n = 0;
+ int rc;
+
+ if (!group->nr_extents)
+ return 0;
+
+ devs = kmalloc_array(group->nr_extents, sizeof(*devs), GFP_KERNEL);
+ if (!devs)
+ return -ENOMEM;
+
+ xa_for_each(&group->dc_extents, index, dc_extent) {
+ if (n == group->nr_extents)
+ break;
+ devs[n++] = &dc_extent->dev;
+ }
+
+ rc = dax_region_rm_resources(dax_region, devs, n);
+ kfree(devs);
+ return rc;
+}
+
static int cxl_dax_region_notify(struct device *dev,
struct cxl_notify_data *notify_data)
{
@@ -68,10 +101,7 @@ static int cxl_dax_region_notify(struct device *dev,
case DCD_ADD_CAPACITY:
return cxl_dax_group_add(dax_region, group);
case DCD_RELEASE_CAPACITY:
- dev_dbg(&cxlr_dax->dev,
- "DCD RELEASE notify (tag %pUb): no-op (stub)\n",
- &group->uuid);
- return 0;
+ return cxl_dax_group_rm(dax_region, group);
case DCD_FORCED_CAPACITY_RELEASE:
default:
dev_err(&cxlr_dax->dev, "Unknown DC event %d\n",
diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index f2ae5918f94d..414813a6137f 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -146,13 +146,17 @@ struct dax_resource {
};
/*
- * Similar to run_dax() dax_region_add_resource() is exported but is not
- * intended to be a generic operation outside the dax subsystem. It is only
+ * Similar to run_dax() dax_region_{add,rm}_resource() are exported but are not
+ * intended to be generic operations outside the dax subsystem. They are only
* generic between the dax layer and the dax drivers.
*/
int dax_region_add_resource(struct dax_region *dax_region, struct device *dev,
resource_size_t start, resource_size_t length,
const uuid_t *tag, u16 seq_num);
+int dax_region_rm_resource(struct dax_region *dax_region,
+ struct device *dev);
+int dax_region_rm_resources(struct dax_region *dax_region,
+ struct device * const *devs, unsigned int n);
static inline struct dev_dax *to_dev_dax(struct device *dev)
{
--
2.43.0
^ permalink raw reply related [flat|nested] 71+ messages in thread* Re: [PATCH v10 22/31] cxl + dax: Release dax_resources on DCD Release Capacity events
2026-05-23 9:43 ` [PATCH v10 22/31] cxl + dax: Release dax_resources on DCD Release " Anisa Su
@ 2026-05-28 23:53 ` Dave Jiang
0 siblings, 0 replies; 71+ messages in thread
From: Dave Jiang @ 2026-05-28 23:53 UTC (permalink / raw)
To: Anisa Su, linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Vishal Verma, Ira Weiny, Alison Schofield, John Groves,
Gregory Price, Anisa Su, Ira Weiny
On 5/23/26 2:43 AM, Anisa Su wrote:
> Implement the release path that mirrors the add path: when the
> device asks for capacity back, the dax layer tears down the
> per-extent resources for the whole tag group atomically.
>
> If any extent in the group is still mapped by a dev_dax, the release
> is refused with -EBUSY and no state changes; the cxl side then leaves
> the tag group intact and the device retries.
>
> Also add a rollback to the add path: if any per-extent registration
> fails midway through a group, undo the ones already added so a
> partial group never leaks into the dax region.
>
> Based on an original patch by Navneet Singh.
>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Anisa Su <anisa.su@samsung.com>
Just a nit below
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
>
> ---
> Changes:
> [anisa: split out from the original "Surface dc_extents" commit;
> fills in the RELEASE half of the bridge, moves the cxl-side RELEASE
> notify into this commit, and adds the rollback path to ADD.]
> ---
> drivers/cxl/core/extent.c | 13 +++++++++
> drivers/dax/bus.c | 59 +++++++++++++++++++++++++++++++++++++++
> drivers/dax/cxl.c | 54 +++++++++++++++++++++++++++--------
> drivers/dax/dax-private.h | 8 ++++--
> 4 files changed, 120 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
> index 3fc4b7292664..2c8edfe53c0a 100644
> --- a/drivers/cxl/core/extent.c
> +++ b/drivers/cxl/core/extent.c
> @@ -532,6 +532,7 @@ int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
> struct range dpa_range;
> unsigned long idx;
> uuid_t tag;
> + int rc;
>
> dpa_range = (struct range) {
> .start = start_dpa,
> @@ -588,6 +589,18 @@ int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
> return -EINVAL;
> }
>
> + rc = cxlr_notify_extent(cxlr, DCD_RELEASE_CAPACITY, group);
> + if (rc) {
> + /*
> + * dax layer refused (-EBUSY) or failed (-ENOMEM, etc.). Do
> + * not proceed to tear down the tag group — leave its
> + * dax_resources alive so we do not free them out from under
> + * live dev_dax ranges. The device will retry the release.
> + */
> + return 0;
> + }
> +
> + /* Release the entire tag group */
> rm_tag_group(group);
> return 0;
> }
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index a6ee59f2d8a1..6368bdfdf93a 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -253,6 +253,65 @@ int dax_region_add_resource(struct dax_region *dax_region,
> }
> EXPORT_SYMBOL_GPL(dax_region_add_resource);
>
> +int dax_region_rm_resource(struct dax_region *dax_region,
> + struct device *dev)
> +{
> + struct dax_resource *dax_resource;
> +
> + guard(rwsem_write)(&dax_region_rwsem);
> +
> + dax_resource = dev_get_drvdata(dev);
> + if (!dax_resource)
> + return 0;
> +
> + if (dax_resource->use_cnt)
> + return -EBUSY;
> +
> + /*
> + * release the resource under dax_region_rwsem to avoid races with
> + * users trying to use the extent
> + */
> + __dax_release_resource(dax_resource);
> + dev_set_drvdata(dev, NULL);
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(dax_region_rm_resource);
No reason to export. Seems only used within DAX.
DJ
> +
> +/**
> + * dax_region_rm_resources - atomically remove a set of dax_resources.
> + *
> + * Walk @devs twice under dax_region_rwsem. First pass refuses the
> + * operation if any member's use_cnt is non-zero; second pass releases
> + * each. This gives refuse-all-or-none semantics across the set, which
> + * a tag group's atomic release relies on. Devices with no
> + * dax_resource attached are silently skipped.
> + */
> +int dax_region_rm_resources(struct dax_region *dax_region,
> + struct device * const *devs, unsigned int n)
> +{
> + unsigned int i;
> +
> + guard(rwsem_write)(&dax_region_rwsem);
> +
> + for (i = 0; i < n; i++) {
> + struct dax_resource *r = dev_get_drvdata(devs[i]);
> +
> + if (r && r->use_cnt)
> + return -EBUSY;
> + }
> +
> + for (i = 0; i < n; i++) {
> + struct dax_resource *r = dev_get_drvdata(devs[i]);
> +
> + if (!r)
> + continue;
> + __dax_release_resource(r);
> + dev_set_drvdata(devs[i], NULL);
> + }
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(dax_region_rm_resources);
> +
> bool static_dev_dax(struct dev_dax *dev_dax)
> {
> return is_static(dev_dax->region);
> diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
> index 690cf625e052..04b73315a8f2 100644
> --- a/drivers/dax/cxl.c
> +++ b/drivers/dax/cxl.c
> @@ -44,19 +44,52 @@ static int cxl_dax_group_add(struct dax_region *dax_region,
>
> xa_for_each(&group->dc_extents, index, dc_extent) {
> rc = __cxl_dax_add_resource(dax_region, dc_extent);
> - if (rc)
> + if (rc) {
> + /*
> + * Unwind every dax_resource already added for this
> + * group; one rm per owner suffices.
> + */
> + struct dc_extent *u;
> + unsigned long uidx;
> +
> + xa_for_each(&group->dc_extents, uidx, u) {
> + if (u == dc_extent)
> + break;
> + dax_region_rm_resource(dax_region, &u->dev);
> + }
> return rc;
> + }
> }
> return 0;
> }
>
> -/*
> - * RELEASE is still a stub here — the atomic dax_region_rm_resources API
> - * and its wire-up land in the next commit. An incoming RELEASE returns
> - * success and the cxl side proceeds to rm_tag_group(), which device-
> - * unregisters each dc_extent; the devm action armed by
> - * dax_region_add_resource() then tears down each dax_resource.
> - */
> +static int cxl_dax_group_rm(struct dax_region *dax_region,
> + struct cxl_dc_tag_group *group)
> +{
> + struct dc_extent *dc_extent;
> + struct device **devs;
> + unsigned long index;
> + unsigned int n = 0;
> + int rc;
> +
> + if (!group->nr_extents)
> + return 0;
> +
> + devs = kmalloc_array(group->nr_extents, sizeof(*devs), GFP_KERNEL);
> + if (!devs)
> + return -ENOMEM;
> +
> + xa_for_each(&group->dc_extents, index, dc_extent) {
> + if (n == group->nr_extents)
> + break;
> + devs[n++] = &dc_extent->dev;
> + }
> +
> + rc = dax_region_rm_resources(dax_region, devs, n);
> + kfree(devs);
> + return rc;
> +}
> +
> static int cxl_dax_region_notify(struct device *dev,
> struct cxl_notify_data *notify_data)
> {
> @@ -68,10 +101,7 @@ static int cxl_dax_region_notify(struct device *dev,
> case DCD_ADD_CAPACITY:
> return cxl_dax_group_add(dax_region, group);
> case DCD_RELEASE_CAPACITY:
> - dev_dbg(&cxlr_dax->dev,
> - "DCD RELEASE notify (tag %pUb): no-op (stub)\n",
> - &group->uuid);
> - return 0;
> + return cxl_dax_group_rm(dax_region, group);
> case DCD_FORCED_CAPACITY_RELEASE:
> default:
> dev_err(&cxlr_dax->dev, "Unknown DC event %d\n",
> diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
> index f2ae5918f94d..414813a6137f 100644
> --- a/drivers/dax/dax-private.h
> +++ b/drivers/dax/dax-private.h
> @@ -146,13 +146,17 @@ struct dax_resource {
> };
>
> /*
> - * Similar to run_dax() dax_region_add_resource() is exported but is not
> - * intended to be a generic operation outside the dax subsystem. It is only
> + * Similar to run_dax() dax_region_{add,rm}_resource() are exported but are not
> + * intended to be generic operations outside the dax subsystem. They are only
> * generic between the dax layer and the dax drivers.
> */
> int dax_region_add_resource(struct dax_region *dax_region, struct device *dev,
> resource_size_t start, resource_size_t length,
> const uuid_t *tag, u16 seq_num);
> +int dax_region_rm_resource(struct dax_region *dax_region,
> + struct device *dev);
> +int dax_region_rm_resources(struct dax_region *dax_region,
> + struct device * const *devs, unsigned int n);
>
> static inline struct dev_dax *to_dev_dax(struct device *dev)
> {
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH v10 23/31] dax/bus: Factor out dev dax resize logic
2026-05-23 9:42 [PATCH v10 00/31] DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
` (21 preceding siblings ...)
2026-05-23 9:43 ` [PATCH v10 22/31] cxl + dax: Release dax_resources on DCD Release " Anisa Su
@ 2026-05-23 9:43 ` Anisa Su
2026-05-23 9:43 ` [PATCH v10 24/31] dax/bus: Add uuid sysfs attribute to dax devices Anisa Su
` (9 subsequent siblings)
32 siblings, 0 replies; 71+ messages in thread
From: Anisa Su @ 2026-05-23 9:43 UTC (permalink / raw)
To: linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
John Groves, Gregory Price, Ira Weiny, Jonathan Cameron
From: Ira Weiny <ira.weiny@intel.com>
Dynamic Capacity (DC) DAX regions back their dax devices with per-extent
resource children of the region, rather than carving from a single
contiguous dax_region->res. Allocating space for a DC dax device — on
initial uuid claim of its backing extents and on shrink-to-0 during
destroy — needs the same allocator the static case uses, but pointed at
a different parent resource.
Factor the body of dev_dax_resize() into __dev_dax_resize(parent, ...)
and add a dev_dax_resize_static() wrapper that passes dax_region->res
for static (non-DC) regions. alloc_dev_dax_range() gains the same
parent parameter so it can operate under either kind of parent.
No functional change.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
Changes:
[anisa: reword to drop the options-considered discussion and "sparse"
terminology; preserved in a later commit that realizes per-extent
resource children]
---
drivers/dax/bus.c | 131 ++++++++++++++++++++++++++++------------------
1 file changed, 81 insertions(+), 50 deletions(-)
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 6368bdfdf93a..5c1b93890d30 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -1012,11 +1012,10 @@ static int devm_register_dax_mapping(struct dev_dax *dev_dax, int range_id)
return 0;
}
-static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
- resource_size_t size, struct dax_resource *dax_resource)
+static int alloc_dev_dax_range(struct resource *parent, struct dev_dax *dev_dax,
+ u64 start, resource_size_t size,
+ struct dax_resource *dax_resource)
{
- struct dax_region *dax_region = dev_dax->region;
- struct resource *res = &dax_region->res;
struct device *dev = &dev_dax->dev;
struct dev_dax_range *ranges;
unsigned long pgoff = 0;
@@ -1034,14 +1033,14 @@ static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
return 0;
}
- alloc = __request_region(res, start, size, dev_name(dev), 0);
+ alloc = __request_region(parent, start, size, dev_name(dev), 0);
if (!alloc)
return -ENOMEM;
ranges = krealloc(dev_dax->ranges, sizeof(*ranges)
* (dev_dax->nr_range + 1), GFP_KERNEL);
if (!ranges) {
- __release_region(res, alloc->start, resource_size(alloc));
+ __release_region(parent, alloc->start, resource_size(alloc));
return -ENOMEM;
}
@@ -1195,50 +1194,45 @@ static bool adjust_ok(struct dev_dax *dev_dax, struct resource *res)
return true;
}
-static ssize_t dev_dax_resize(struct dax_region *dax_region,
- struct dev_dax *dev_dax, resource_size_t size)
+/**
+ * dev_dax_resize_static - Expand the device into the unused portion of the
+ * region. This may involve adjusting the end of an existing resource, or
+ * allocating a new resource.
+ *
+ * @parent: parent resource to allocate this range in
+ * @dev_dax: DAX device to be expanded
+ * @to_alloc: amount of space to alloc; must be <= space available in @parent
+ *
+ * Return the amount of space allocated or -ERRNO on failure
+ */
+static ssize_t dev_dax_resize_static(struct resource *parent,
+ struct dev_dax *dev_dax,
+ resource_size_t to_alloc)
{
- resource_size_t avail = dax_region_avail_size(dax_region), to_alloc;
- resource_size_t dev_size = dev_dax_size(dev_dax);
- struct resource *region_res = &dax_region->res;
- struct device *dev = &dev_dax->dev;
struct resource *res, *first;
- resource_size_t alloc = 0;
int rc;
- if (dev->driver)
- return -EBUSY;
- if (size == dev_size)
- return 0;
- if (size > dev_size && size - dev_size > avail)
- return -ENOSPC;
- if (size < dev_size)
- return dev_dax_shrink(dev_dax, size);
-
- to_alloc = size - dev_size;
- if (dev_WARN_ONCE(dev, !alloc_is_aligned(dev_dax, to_alloc),
- "resize of %pa misaligned\n", &to_alloc))
- return -ENXIO;
-
- /*
- * Expand the device into the unused portion of the region. This
- * may involve adjusting the end of an existing resource, or
- * allocating a new resource.
- */
-retry:
- first = region_res->child;
- if (!first)
- return alloc_dev_dax_range(dev_dax, dax_region->res.start, to_alloc, NULL);
+ first = parent->child;
+ if (!first) {
+ rc = alloc_dev_dax_range(parent, dev_dax,
+ parent->start, to_alloc, NULL);
+ if (rc)
+ return rc;
+ return to_alloc;
+ }
- rc = -ENOSPC;
for (res = first; res; res = res->sibling) {
struct resource *next = res->sibling;
+ resource_size_t alloc;
/* space at the beginning of the region */
- if (res == first && res->start > dax_region->res.start) {
- alloc = min(res->start - dax_region->res.start, to_alloc);
- rc = alloc_dev_dax_range(dev_dax, dax_region->res.start, alloc, NULL);
- break;
+ if (res == first && res->start > parent->start) {
+ alloc = min(res->start - parent->start, to_alloc);
+ rc = alloc_dev_dax_range(parent, dev_dax,
+ parent->start, alloc, NULL);
+ if (rc)
+ return rc;
+ return alloc;
}
alloc = 0;
@@ -1247,21 +1241,56 @@ static ssize_t dev_dax_resize(struct dax_region *dax_region,
alloc = min(next->start - (res->end + 1), to_alloc);
/* space at the end of the region */
- if (!alloc && !next && res->end < region_res->end)
- alloc = min(region_res->end - res->end, to_alloc);
+ if (!alloc && !next && res->end < parent->end)
+ alloc = min(parent->end - res->end, to_alloc);
if (!alloc)
continue;
if (adjust_ok(dev_dax, res)) {
rc = adjust_dev_dax_range(dev_dax, res, resource_size(res) + alloc);
- break;
+ if (rc)
+ return rc;
+ return alloc;
}
- rc = alloc_dev_dax_range(dev_dax, res->end + 1, alloc, NULL);
- break;
+ rc = alloc_dev_dax_range(parent, dev_dax, res->end + 1, alloc, NULL);
+ if (rc)
+ return rc;
+ return alloc;
}
- if (rc)
- return rc;
+
+ /* available was already calculated and should never be an issue */
+ dev_WARN_ONCE(&dev_dax->dev, 1, "space not found?");
+ return 0;
+}
+
+static ssize_t dev_dax_resize(struct dax_region *dax_region,
+ struct dev_dax *dev_dax, resource_size_t size)
+{
+ resource_size_t avail = dax_region_avail_size(dax_region);
+ resource_size_t dev_size = dev_dax_size(dev_dax);
+ struct device *dev = &dev_dax->dev;
+ resource_size_t to_alloc;
+ resource_size_t alloc;
+
+ if (dev->driver)
+ return -EBUSY;
+ if (size == dev_size)
+ return 0;
+ if (size > dev_size && size - dev_size > avail)
+ return -ENOSPC;
+ if (size < dev_size)
+ return dev_dax_shrink(dev_dax, size);
+
+ to_alloc = size - dev_size;
+ if (dev_WARN_ONCE(dev, !alloc_is_aligned(dev_dax, to_alloc),
+ "resize of %pa misaligned\n", &to_alloc))
+ return -ENXIO;
+
+retry:
+ alloc = dev_dax_resize_static(&dax_region->res, dev_dax, to_alloc);
+ if (alloc <= 0)
+ return alloc;
to_alloc -= alloc;
if (to_alloc)
goto retry;
@@ -1367,7 +1396,8 @@ static ssize_t mapping_store(struct device *dev, struct device_attribute *attr,
to_alloc = range_len(&r);
if (alloc_is_aligned(dev_dax, to_alloc))
- rc = alloc_dev_dax_range(dev_dax, r.start, to_alloc, NULL);
+ rc = alloc_dev_dax_range(&dax_region->res, dev_dax, r.start,
+ to_alloc, NULL);
up_write(&dax_dev_rwsem);
up_write(&dax_region_rwsem);
@@ -1659,7 +1689,8 @@ static struct dev_dax *__devm_create_dev_dax(struct dev_dax_data *data)
device_initialize(dev);
dev_set_name(dev, "dax%d.%d", dax_region->id, dev_dax->id);
- rc = alloc_dev_dax_range(dev_dax, dax_region->res.start, data->size, NULL);
+ rc = alloc_dev_dax_range(&dax_region->res, dev_dax, dax_region->res.start,
+ data->size, NULL);
if (rc)
goto err_range;
--
2.43.0
^ permalink raw reply related [flat|nested] 71+ messages in thread* [PATCH v10 24/31] dax/bus: Add uuid sysfs attribute to dax devices
2026-05-23 9:42 [PATCH v10 00/31] DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
` (22 preceding siblings ...)
2026-05-23 9:43 ` [PATCH v10 23/31] dax/bus: Factor out dev dax resize logic Anisa Su
@ 2026-05-23 9:43 ` Anisa Su
2026-05-29 17:07 ` Dave Jiang
2026-05-23 9:43 ` [PATCH v10 25/31] dax/bus: Reject resize on DC dax devices and enforce 0-size creation Anisa Su
` (8 subsequent siblings)
32 siblings, 1 reply; 71+ messages in thread
From: Anisa Su @ 2026-05-23 9:43 UTC (permalink / raw)
To: linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
John Groves, Gregory Price, Anisa Su
Introduce a read-write 'uuid' sysfs entry at
/sys/bus/dax/devices/daxX.Y/ with stubbed handlers: show returns "0"
and store returns -EOPNOTSUPP. A follow-on patch wires both
directions to dax_resource tracking.
Document the attribute in the dax sysfs ABI.
Signed-off-by: Anisa Su <anisa.su@samsung.com>
---
Documentation/ABI/testing/sysfs-bus-dax | 18 ++++++++++++++++++
drivers/dax/bus.c | 14 ++++++++++++++
2 files changed, 32 insertions(+)
diff --git a/Documentation/ABI/testing/sysfs-bus-dax b/Documentation/ABI/testing/sysfs-bus-dax
index b34266bfae49..23400824073b 100644
--- a/Documentation/ABI/testing/sysfs-bus-dax
+++ b/Documentation/ABI/testing/sysfs-bus-dax
@@ -59,6 +59,24 @@ Description:
backing device for this dax device, emit the CPU node
affinity for this device.
+What: /sys/bus/dax/devices/daxX.Y/uuid
+Date: May, 2026
+KernelVersion: v6.16
+Contact: nvdimm@lists.linux.dev
+Description:
+ (RW) On read, reports the uuid identifying the capacity
+ backing this dax device. A value of "0" indicates that the
+ device has no associated uuid — either it is not backed by
+ DCD capacity, or the backing extents are untagged.
+
+ Writes are accepted only on dax devices in sparse (DCD)
+ regions; writes to non-sparse devices return -EOPNOTSUPP.
+ Writing a non-null uuid claims every dax_resource in the
+ parent region whose tag matches the written uuid, consuming
+ any available capacity in each matching resource. Writing
+ "0" is shorthand for the null uuid and claims a single
+ untagged dax_resource.
+
What: /sys/bus/dax/devices/daxX.Y/target_node
Date: February, 2019
KernelVersion: v5.1
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 5c1b93890d30..1d6f82920be6 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -1526,6 +1526,19 @@ static ssize_t numa_node_show(struct device *dev,
}
static DEVICE_ATTR_RO(numa_node);
+static ssize_t uuid_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return sysfs_emit(buf, "%d\n", 0);
+}
+
+static ssize_t uuid_store(struct device *dev, struct device_attribute *attr,
+ const char *buf, size_t len)
+{
+ return -EOPNOTSUPP;
+}
+static DEVICE_ATTR_RW(uuid);
+
static ssize_t memmap_on_memory_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
@@ -1597,6 +1610,7 @@ static struct attribute *dev_dax_attributes[] = {
&dev_attr_resource.attr,
&dev_attr_numa_node.attr,
&dev_attr_memmap_on_memory.attr,
+ &dev_attr_uuid.attr,
NULL,
};
--
2.43.0
^ permalink raw reply related [flat|nested] 71+ messages in thread* Re: [PATCH v10 24/31] dax/bus: Add uuid sysfs attribute to dax devices
2026-05-23 9:43 ` [PATCH v10 24/31] dax/bus: Add uuid sysfs attribute to dax devices Anisa Su
@ 2026-05-29 17:07 ` Dave Jiang
0 siblings, 0 replies; 71+ messages in thread
From: Dave Jiang @ 2026-05-29 17:07 UTC (permalink / raw)
To: Anisa Su, linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Vishal Verma, Ira Weiny, Alison Schofield, John Groves,
Gregory Price, Anisa Su
On 5/23/26 2:43 AM, Anisa Su wrote:
> Introduce a read-write 'uuid' sysfs entry at
> /sys/bus/dax/devices/daxX.Y/ with stubbed handlers: show returns "0"
> and store returns -EOPNOTSUPP. A follow-on patch wires both
> directions to dax_resource tracking.
>
> Document the attribute in the dax sysfs ABI.
>
> Signed-off-by: Anisa Su <anisa.su@samsung.com>
> ---
> Documentation/ABI/testing/sysfs-bus-dax | 18 ++++++++++++++++++
> drivers/dax/bus.c | 14 ++++++++++++++
> 2 files changed, 32 insertions(+)
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-dax b/Documentation/ABI/testing/sysfs-bus-dax
> index b34266bfae49..23400824073b 100644
> --- a/Documentation/ABI/testing/sysfs-bus-dax
> +++ b/Documentation/ABI/testing/sysfs-bus-dax
> @@ -59,6 +59,24 @@ Description:
> backing device for this dax device, emit the CPU node
> affinity for this device.
>
> +What: /sys/bus/dax/devices/daxX.Y/uuid
> +Date: May, 2026
> +KernelVersion: v6.16
update
> +Contact: nvdimm@lists.linux.dev
> +Description:
> + (RW) On read, reports the uuid identifying the capacity
> + backing this dax device. A value of "0" indicates that the
> + device has no associated uuid — either it is not backed by
> + DCD capacity, or the backing extents are untagged.
> +
> + Writes are accepted only on dax devices in sparse (DCD)
> + regions; writes to non-sparse devices return -EOPNOTSUPP.
> + Writing a non-null uuid claims every dax_resource in the
> + parent region whose tag matches the written uuid, consuming
> + any available capacity in each matching resource. Writing
> + "0" is shorthand for the null uuid and claims a single
> + untagged dax_resource.
> +
> What: /sys/bus/dax/devices/daxX.Y/target_node
> Date: February, 2019
> KernelVersion: v5.1
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 5c1b93890d30..1d6f82920be6 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -1526,6 +1526,19 @@ static ssize_t numa_node_show(struct device *dev,
> }
> static DEVICE_ATTR_RO(numa_node);
>
> +static ssize_t uuid_show(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + return sysfs_emit(buf, "%d\n", 0);
Should it just emit null UUID instead of 0 to not screw up user apps?
return sysfs_emit(buf, "%pUb\n", &uuid_null);
> +}
> +
> +static ssize_t uuid_store(struct device *dev, struct device_attribute *attr,
> + const char *buf, size_t len)
> +{
> + return -EOPNOTSUPP;
> +}
> +static DEVICE_ATTR_RW(uuid);
> +
> static ssize_t memmap_on_memory_show(struct device *dev,
> struct device_attribute *attr, char *buf)
> {
> @@ -1597,6 +1610,7 @@ static struct attribute *dev_dax_attributes[] = {
> &dev_attr_resource.attr,
> &dev_attr_numa_node.attr,
> &dev_attr_memmap_on_memory.attr,
> + &dev_attr_uuid.attr,
> NULL,
> };
>
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH v10 25/31] dax/bus: Reject resize on DC dax devices and enforce 0-size creation
2026-05-23 9:42 [PATCH v10 00/31] DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
` (23 preceding siblings ...)
2026-05-23 9:43 ` [PATCH v10 24/31] dax/bus: Add uuid sysfs attribute to dax devices Anisa Su
@ 2026-05-23 9:43 ` Anisa Su
2026-05-29 17:16 ` Dave Jiang
2026-05-23 9:43 ` [PATCH v10 26/31] dax/bus: Tag-aware uuid claim and show on DC dax devices Anisa Su
` (7 subsequent siblings)
32 siblings, 1 reply; 71+ messages in thread
From: Anisa Su @ 2026-05-23 9:43 UTC (permalink / raw)
To: linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
John Groves, Gregory Price, Anisa Su, Ira Weiny
A DC dax device's size is determined by the extents that back it, not by
the user. DCD extents are all-or-nothing, so partial shrink is just as
illegal as growing. Enforce that on the size and creation paths:
* size_store: any non-zero resize on a DC region returns -EOPNOTSUPP.
The sole exception is size=0, which daxctl destroy-device writes to
return every claimed extent to the region's available pool before
the device's name is written to the region's 'delete' attribute.
* __devm_create_dev_dax: a DC dax device must be created at size 0.
Non-zero data->size on a DC region returns -EINVAL with a clear
message.
The resize machinery (dev_dax_shrink, adjust_ok, dev_dax_resize_static,
dev_dax_resize) learns to walk the right parent — dax_region->res for
static regions, the dax_resource->res for DC regions claimed via
uuid_store — so shrink-to-0 correctly releases each extent's child
resource rather than the region's.
Based on an original patch by Navneet Singh.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Anisa Su <anisa.su@samsung.com>
---
Changes:
[anisa: split out from the original "Surface dc_extents" commit;
DC-aware resize policy only.]
---
drivers/dax/bus.c | 46 +++++++++++++++++++++++++++++++++++-----------
1 file changed, 35 insertions(+), 11 deletions(-)
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 1d6f82920be6..c030eb103ad0 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -1136,7 +1136,8 @@ static int dev_dax_shrink(struct dev_dax *dev_dax, resource_size_t size)
int i;
for (i = dev_dax->nr_range - 1; i >= 0; i--) {
- struct range *range = &dev_dax->ranges[i].range;
+ struct dev_dax_range *dev_range = &dev_dax->ranges[i];
+ struct range *range = &dev_range->range;
struct dax_mapping *mapping = dev_dax->ranges[i].mapping;
struct resource *adjust = NULL, *res;
resource_size_t shrink;
@@ -1152,6 +1153,10 @@ static int dev_dax_shrink(struct dev_dax *dev_dax, resource_size_t size)
continue;
}
+ /*
+ * Partial shrink: forbidden on DC regions, so dev_range
+ * here must belong to a static device.
+ */
for_each_dax_region_resource(dax_region, res)
if (strcmp(res->name, dev_name(dev)) == 0
&& res->start == range->start) {
@@ -1195,19 +1200,21 @@ static bool adjust_ok(struct dev_dax *dev_dax, struct resource *res)
}
/**
- * dev_dax_resize_static - Expand the device into the unused portion of the
- * region. This may involve adjusting the end of an existing resource, or
- * allocating a new resource.
+ * __dev_dax_resize - Expand the device into the unused portion of the region.
+ * This may involve adjusting the end of an existing resource, or allocating a
+ * new resource.
*
* @parent: parent resource to allocate this range in
* @dev_dax: DAX device to be expanded
* @to_alloc: amount of space to alloc; must be <= space available in @parent
+ * @dax_resource: if dc; the parent resource
*
* Return the amount of space allocated or -ERRNO on failure
*/
-static ssize_t dev_dax_resize_static(struct resource *parent,
- struct dev_dax *dev_dax,
- resource_size_t to_alloc)
+static ssize_t __dev_dax_resize(struct resource *parent,
+ struct dev_dax *dev_dax,
+ resource_size_t to_alloc,
+ struct dax_resource *dax_resource)
{
struct resource *res, *first;
int rc;
@@ -1215,7 +1222,8 @@ static ssize_t dev_dax_resize_static(struct resource *parent,
first = parent->child;
if (!first) {
rc = alloc_dev_dax_range(parent, dev_dax,
- parent->start, to_alloc, NULL);
+ parent->start, to_alloc,
+ dax_resource);
if (rc)
return rc;
return to_alloc;
@@ -1229,7 +1237,8 @@ static ssize_t dev_dax_resize_static(struct resource *parent,
if (res == first && res->start > parent->start) {
alloc = min(res->start - parent->start, to_alloc);
rc = alloc_dev_dax_range(parent, dev_dax,
- parent->start, alloc, NULL);
+ parent->start, alloc,
+ dax_resource);
if (rc)
return rc;
return alloc;
@@ -1253,7 +1262,8 @@ static ssize_t dev_dax_resize_static(struct resource *parent,
return rc;
return alloc;
}
- rc = alloc_dev_dax_range(parent, dev_dax, res->end + 1, alloc, NULL);
+ rc = alloc_dev_dax_range(parent, dev_dax, res->end + 1, alloc,
+ dax_resource);
if (rc)
return rc;
return alloc;
@@ -1264,6 +1274,13 @@ static ssize_t dev_dax_resize_static(struct resource *parent,
return 0;
}
+static ssize_t dev_dax_resize_static(struct dax_region *dax_region,
+ struct dev_dax *dev_dax,
+ resource_size_t to_alloc)
+{
+ return __dev_dax_resize(&dax_region->res, dev_dax, to_alloc, NULL);
+}
+
static ssize_t dev_dax_resize(struct dax_region *dax_region,
struct dev_dax *dev_dax, resource_size_t size)
{
@@ -1277,6 +1294,8 @@ static ssize_t dev_dax_resize(struct dax_region *dax_region,
return -EBUSY;
if (size == dev_size)
return 0;
+ if (size != 0 && is_dynamic(dax_region))
+ return -EOPNOTSUPP;
if (size > dev_size && size - dev_size > avail)
return -ENOSPC;
if (size < dev_size)
@@ -1288,7 +1307,7 @@ static ssize_t dev_dax_resize(struct dax_region *dax_region,
return -ENXIO;
retry:
- alloc = dev_dax_resize_static(&dax_region->res, dev_dax, to_alloc);
+ alloc = dev_dax_resize_static(dax_region, dev_dax, to_alloc);
if (alloc <= 0)
return alloc;
to_alloc -= alloc;
@@ -1674,6 +1693,11 @@ static struct dev_dax *__devm_create_dev_dax(struct dev_dax_data *data)
struct device *dev;
int rc;
+ if (is_dynamic(dax_region) && data->size) {
+ dev_err(parent, "DC DAX region devices must be created initially with 0 size");
+ return ERR_PTR(-EINVAL);
+ }
+
dev_dax = kzalloc_obj(*dev_dax);
if (!dev_dax)
return ERR_PTR(-ENOMEM);
--
2.43.0
^ permalink raw reply related [flat|nested] 71+ messages in thread* Re: [PATCH v10 25/31] dax/bus: Reject resize on DC dax devices and enforce 0-size creation
2026-05-23 9:43 ` [PATCH v10 25/31] dax/bus: Reject resize on DC dax devices and enforce 0-size creation Anisa Su
@ 2026-05-29 17:16 ` Dave Jiang
0 siblings, 0 replies; 71+ messages in thread
From: Dave Jiang @ 2026-05-29 17:16 UTC (permalink / raw)
To: Anisa Su, linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Vishal Verma, Ira Weiny, Alison Schofield, John Groves,
Gregory Price, Anisa Su, Ira Weiny
On 5/23/26 2:43 AM, Anisa Su wrote:
> A DC dax device's size is determined by the extents that back it, not by
> the user. DCD extents are all-or-nothing, so partial shrink is just as
> illegal as growing. Enforce that on the size and creation paths:
>
> * size_store: any non-zero resize on a DC region returns -EOPNOTSUPP.
> The sole exception is size=0, which daxctl destroy-device writes to
> return every claimed extent to the region's available pool before
> the device's name is written to the region's 'delete' attribute.
> * __devm_create_dev_dax: a DC dax device must be created at size 0.
> Non-zero data->size on a DC region returns -EINVAL with a clear
> message.
>
> The resize machinery (dev_dax_shrink, adjust_ok, dev_dax_resize_static,
> dev_dax_resize) learns to walk the right parent — dax_region->res for
> static regions, the dax_resource->res for DC regions claimed via
> uuid_store — so shrink-to-0 correctly releases each extent's child
> resource rather than the region's.
>
> Based on an original patch by Navneet Singh.
>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Anisa Su <anisa.su@samsung.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Just a nit below.
>
> ---
> Changes:
> [anisa: split out from the original "Surface dc_extents" commit;
> DC-aware resize policy only.]
> ---
> drivers/dax/bus.c | 46 +++++++++++++++++++++++++++++++++++-----------
> 1 file changed, 35 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 1d6f82920be6..c030eb103ad0 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -1136,7 +1136,8 @@ static int dev_dax_shrink(struct dev_dax *dev_dax, resource_size_t size)
> int i;
>
> for (i = dev_dax->nr_range - 1; i >= 0; i--) {
> - struct range *range = &dev_dax->ranges[i].range;
> + struct dev_dax_range *dev_range = &dev_dax->ranges[i];
> + struct range *range = &dev_range->range;
> struct dax_mapping *mapping = dev_dax->ranges[i].mapping;
> struct resource *adjust = NULL, *res;
> resource_size_t shrink;
> @@ -1152,6 +1153,10 @@ static int dev_dax_shrink(struct dev_dax *dev_dax, resource_size_t size)
> continue;
> }
>
> + /*
> + * Partial shrink: forbidden on DC regions, so dev_range
> + * here must belong to a static device.
> + */
> for_each_dax_region_resource(dax_region, res)
> if (strcmp(res->name, dev_name(dev)) == 0
> && res->start == range->start) {
> @@ -1195,19 +1200,21 @@ static bool adjust_ok(struct dev_dax *dev_dax, struct resource *res)
> }
>
> /**
> - * dev_dax_resize_static - Expand the device into the unused portion of the
> - * region. This may involve adjusting the end of an existing resource, or
> - * allocating a new resource.
> + * __dev_dax_resize - Expand the device into the unused portion of the region.
> + * This may involve adjusting the end of an existing resource, or allocating a
> + * new resource.
> *
> * @parent: parent resource to allocate this range in
> * @dev_dax: DAX device to be expanded
> * @to_alloc: amount of space to alloc; must be <= space available in @parent
> + * @dax_resource: if dc; the parent resource
> *
> * Return the amount of space allocated or -ERRNO on failure
> */
> -static ssize_t dev_dax_resize_static(struct resource *parent,
> - struct dev_dax *dev_dax,
> - resource_size_t to_alloc)
> +static ssize_t __dev_dax_resize(struct resource *parent,
> + struct dev_dax *dev_dax,
> + resource_size_t to_alloc,
> + struct dax_resource *dax_resource)
> {
> struct resource *res, *first;
> int rc;
> @@ -1215,7 +1222,8 @@ static ssize_t dev_dax_resize_static(struct resource *parent,
> first = parent->child;
> if (!first) {
> rc = alloc_dev_dax_range(parent, dev_dax,
> - parent->start, to_alloc, NULL);
> + parent->start, to_alloc,
> + dax_resource);
> if (rc)
> return rc;
> return to_alloc;
> @@ -1229,7 +1237,8 @@ static ssize_t dev_dax_resize_static(struct resource *parent,
> if (res == first && res->start > parent->start) {
> alloc = min(res->start - parent->start, to_alloc);
> rc = alloc_dev_dax_range(parent, dev_dax,
> - parent->start, alloc, NULL);
> + parent->start, alloc,
> + dax_resource);
> if (rc)
> return rc;
> return alloc;
> @@ -1253,7 +1262,8 @@ static ssize_t dev_dax_resize_static(struct resource *parent,
> return rc;
> return alloc;
> }
> - rc = alloc_dev_dax_range(parent, dev_dax, res->end + 1, alloc, NULL);
> + rc = alloc_dev_dax_range(parent, dev_dax, res->end + 1, alloc,
> + dax_resource);
> if (rc)
> return rc;
> return alloc;
> @@ -1264,6 +1274,13 @@ static ssize_t dev_dax_resize_static(struct resource *parent,
> return 0;
> }
>
> +static ssize_t dev_dax_resize_static(struct dax_region *dax_region,
> + struct dev_dax *dev_dax,
> + resource_size_t to_alloc)
> +{
> + return __dev_dax_resize(&dax_region->res, dev_dax, to_alloc, NULL);
> +}
> +
> static ssize_t dev_dax_resize(struct dax_region *dax_region,
> struct dev_dax *dev_dax, resource_size_t size)
> {
> @@ -1277,6 +1294,8 @@ static ssize_t dev_dax_resize(struct dax_region *dax_region,
> return -EBUSY;
> if (size == dev_size)
> return 0;
> + if (size != 0 && is_dynamic(dax_region))
> + return -EOPNOTSUPP;
> if (size > dev_size && size - dev_size > avail)
> return -ENOSPC;
> if (size < dev_size)
> @@ -1288,7 +1307,7 @@ static ssize_t dev_dax_resize(struct dax_region *dax_region,
> return -ENXIO;
>
> retry:
> - alloc = dev_dax_resize_static(&dax_region->res, dev_dax, to_alloc);
> + alloc = dev_dax_resize_static(dax_region, dev_dax, to_alloc);
> if (alloc <= 0)
> return alloc;
> to_alloc -= alloc;
> @@ -1674,6 +1693,11 @@ static struct dev_dax *__devm_create_dev_dax(struct dev_dax_data *data)
> struct device *dev;
> int rc;
>
> + if (is_dynamic(dax_region) && data->size) {
> + dev_err(parent, "DC DAX region devices must be created initially with 0 size");
Needs \n
> + return ERR_PTR(-EINVAL);
> + }
> +
> dev_dax = kzalloc_obj(*dev_dax);
> if (!dev_dax)
> return ERR_PTR(-ENOMEM);
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH v10 26/31] dax/bus: Tag-aware uuid claim and show on DC dax devices
2026-05-23 9:42 [PATCH v10 00/31] DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
` (24 preceding siblings ...)
2026-05-23 9:43 ` [PATCH v10 25/31] dax/bus: Reject resize on DC dax devices and enforce 0-size creation Anisa Su
@ 2026-05-23 9:43 ` Anisa Su
2026-05-29 17:53 ` Dave Jiang
2026-05-23 9:43 ` [PATCH v10 27/31] cxl/region: Read existing extents on region creation Anisa Su
` (6 subsequent siblings)
32 siblings, 1 reply; 71+ messages in thread
From: Anisa Su @ 2026-05-23 9:43 UTC (permalink / raw)
To: linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
John Groves, Gregory Price, Anisa Su, Ira Weiny
DC DAX regions are populated with dax_resource children that each carry a
backing tag uuid and a per-allocation sequence number (seq_num). Add the
userspace claim semantics that resolve those tagged groups into DAX
devices.
A DC region's seed dax device is created at 0-size on probe; userspace
populates it by writing to its 'uuid' attribute:
* A non-null UUID claims every dax_resource on this region whose tag
matches, in seq_num order via uuid_claim_tagged(). The match set
must form a dense 1..n sequence (no gap, no duplicate); the CXL
side maintains this invariant for both sharable allocations (where
the device stamps shared_extn_seq) and non-sharable allocations
(where cxl_add_pending assigns arrival-order seq). The resulting
DAX device's size equals the sum of every member extent's size.
* "0" claims a single untagged dax_resource via
uuid_claim_untagged(). Untagged extents are independent
allocations; collapsing several would aggregate unrelated capacity,
so each uuid="0" write consumes exactly one untagged resource.
* A write that matches no dax_resource returns -ENOENT; the device
stays at size 0.
uuid_show() reads back the backing tag uuid (or the null UUID for an
untagged claim). The attribute is read-only (0444) on non-DC dax
devices; writes to it on non-DC regions return -EOPNOTSUPP.
dev_dax_visible() exposes the uuid attribute only on DC dax devices.
Based on an original patch by Navneet Singh.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Anisa Su <anisa.su@samsung.com>
---
Changes:
[anisa: split out from the original "Surface dc_extents" commit;
userspace tag-claim semantics only.]
---
drivers/dax/bus.c | 260 +++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 256 insertions(+), 4 deletions(-)
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index c030eb103ad0..1dccb3e5cd0f 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -5,6 +5,7 @@
#include <linux/mutex.h>
#include <linux/list.h>
#include <linux/slab.h>
+#include <linux/sort.h>
#include <linux/dax.h>
#include <linux/io.h>
#include "dax-private.h"
@@ -1316,6 +1317,89 @@ static ssize_t dev_dax_resize(struct dax_region *dax_region,
return 0;
}
+/* DC extents are all-or-nothing: an extent is either free or fully claimed. */
+static bool dax_resource_in_use(const struct dax_resource *dax_resource)
+{
+ return dax_resource->use_cnt > 0;
+}
+
+struct dax_uuid_match {
+ const struct dax_region *dax_region;
+ const uuid_t *uuid;
+};
+
+static int find_uuid_extent(struct device *dev, const void *data)
+{
+ const struct dax_uuid_match *match = data;
+ struct dax_resource *dax_resource;
+
+ if (!match->dax_region->dc_ops->is_extent(dev))
+ return 0;
+
+ dax_resource = dev_get_drvdata(dev);
+ if (!dax_resource || dax_resource_in_use(dax_resource))
+ return 0;
+ return uuid_equal(&dax_resource->uuid, match->uuid);
+}
+
+struct dax_tag_collect {
+ const struct dax_region *dax_region;
+ const uuid_t *uuid;
+ struct dax_resource **arr;
+ unsigned int count;
+ unsigned int cap;
+};
+
+static int collect_uuid_extent(struct device *dev, void *data)
+{
+ struct dax_tag_collect *c = data;
+ struct dax_resource *dax_resource;
+
+ if (!c->dax_region->dc_ops->is_extent(dev))
+ return 0;
+
+ dax_resource = dev_get_drvdata(dev);
+ if (!dax_resource || dax_resource_in_use(dax_resource))
+ return 0;
+ if (!uuid_equal(&dax_resource->uuid, c->uuid))
+ return 0;
+
+ if (c->count == c->cap)
+ return -ENOSPC;
+ c->arr[c->count++] = dax_resource;
+ return 0;
+}
+
+static int count_uuid_extent(struct device *dev, void *data)
+{
+ struct dax_tag_collect *c = data;
+ struct dax_resource *dax_resource;
+
+ if (!c->dax_region->dc_ops->is_extent(dev))
+ return 0;
+
+ dax_resource = dev_get_drvdata(dev);
+ if (!dax_resource || dax_resource_in_use(dax_resource))
+ return 0;
+ if (!uuid_equal(&dax_resource->uuid, c->uuid))
+ return 0;
+
+ c->count++;
+ return 0;
+}
+
+static int dax_resource_seq_cmp(const void *a, const void *b)
+{
+ const struct dax_resource * const *pa = a;
+ const struct dax_resource * const *pb = b;
+
+ if ((*pa)->seq_num < (*pb)->seq_num)
+ return -1;
+ if ((*pa)->seq_num > (*pb)->seq_num)
+ return 1;
+ return 0;
+}
+
static ssize_t size_store(struct device *dev, struct device_attribute *attr,
const char *buf, size_t len)
{
@@ -1548,13 +1632,177 @@ static DEVICE_ATTR_RO(numa_node);
static ssize_t uuid_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
- return sysfs_emit(buf, "%d\n", 0);
+ struct dev_dax *dev_dax = to_dev_dax(dev);
+ int rc;
+
+ rc = down_read_interruptible(&dax_dev_rwsem);
+ if (rc)
+ return rc;
+
+ for (int i = 0; i < dev_dax->nr_range; i++) {
+ struct dax_resource *r = dev_dax->ranges[i].dax_resource;
+
+ if (r && !uuid_is_null(&r->uuid)) {
+ rc = sysfs_emit(buf, "%pUb\n", &r->uuid);
+ goto out;
+ }
+ }
+ rc = sysfs_emit(buf, "0\n");
+out:
+ up_read(&dax_dev_rwsem);
+ return rc;
+}
+
+static ssize_t uuid_claim_untagged(struct dax_region *dax_region,
+ struct dev_dax *dev_dax)
+{
+ struct dax_uuid_match match = {
+ .dax_region = dax_region,
+ .uuid = &uuid_null,
+ };
+ struct dax_resource *dax_resource;
+ resource_size_t to_alloc;
+ struct device *extent_dev;
+ ssize_t alloc;
+
+ extent_dev = device_find_child(dax_region->dev, &match,
+ find_uuid_extent);
+ if (!extent_dev)
+ return -ENOENT;
+
+ dax_resource = dev_get_drvdata(extent_dev);
+ to_alloc = resource_size(dax_resource->res);
+ alloc = __dev_dax_resize(dax_resource->res, dev_dax, to_alloc,
+ dax_resource);
+ put_device(extent_dev);
+ if (alloc < 0)
+ return alloc;
+ if (alloc == 0)
+ return -ENOENT;
+ dax_resource->use_cnt++;
+ return 0;
+}
+
+static ssize_t uuid_claim_tagged(struct dax_region *dax_region,
+ struct dev_dax *dev_dax, const uuid_t *uuid)
+{
+ struct dax_tag_collect c = {
+ .dax_region = dax_region,
+ .uuid = uuid,
+ };
+ unsigned int i;
+ ssize_t rc;
+
+ /* Two-pass: count, then collect into a sized array. */
+ device_for_each_child(dax_region->dev, &c, count_uuid_extent);
+ if (!c.count)
+ return -ENOENT;
+
+ c.arr = kmalloc_array(c.count, sizeof(*c.arr), GFP_KERNEL);
+ if (!c.arr)
+ return -ENOMEM;
+ c.cap = c.count;
+ c.count = 0;
+
+ rc = device_for_each_child(dax_region->dev, &c, collect_uuid_extent);
+ if (rc)
+ goto out;
+
+ sort(c.arr, c.count, sizeof(*c.arr), dax_resource_seq_cmp, NULL);
+
+ /*
+ * Tagged groups carry a dense 1..n @seq_num regardless of source
+ * (sharable: device-stamped; non-sharable: host-assigned in
+ * arrival order — see &struct dax_resource). A gap or
+ * out-of-range value here means an extent went missing on the
+ * cxl side (e.g. a per-extent failure in cxl_add_pending) or a
+ * cxl-side validation gap; in either case refuse the whole
+ * group rather than carve a partial allocation.
+ */
+ for (i = 0; i < c.count; i++) {
+ if (c.arr[i]->seq_num != i + 1) {
+ dev_WARN_ONCE(dax_region->dev, 1,
+ "tag %pUb seq invariant violated at slot %u (got %u)\n",
+ uuid, i, c.arr[i]->seq_num);
+ rc = -EINVAL;
+ goto out;
+ }
+ }
+
+ for (i = 0; i < c.count; i++) {
+ resource_size_t to_alloc = resource_size(c.arr[i]->res);
+ ssize_t alloc;
+
+ alloc = __dev_dax_resize(c.arr[i]->res, dev_dax, to_alloc,
+ c.arr[i]);
+ if (alloc < 0) {
+ rc = alloc;
+ goto rollback;
+ }
+ if (alloc == 0) {
+ rc = -ENOSPC;
+ goto rollback;
+ }
+ c.arr[i]->use_cnt++;
+ }
+ rc = 0;
+ goto out;
+
+rollback:
+ /*
+ * Partial failure: trim every range we added in this attempt.
+ * trim_dev_dax_range pops the most-recently-appended range from
+ * dev_dax->ranges[] and decrements its dax_resource->use_cnt, so
+ * looping until we have undone @i additions restores both
+ * dev_dax->ranges[] and the matched dax_resources' use_cnt.
+ */
+ while (i-- > 0)
+ trim_dev_dax_range(dev_dax);
+out:
+ kfree(c.arr);
+ return rc;
}
static ssize_t uuid_store(struct device *dev, struct device_attribute *attr,
const char *buf, size_t len)
{
- return -EOPNOTSUPP;
+ struct dev_dax *dev_dax = to_dev_dax(dev);
+ struct dax_region *dax_region = dev_dax->region;
+ uuid_t uuid;
+ ssize_t rc;
+
+ if (!is_dynamic(dax_region))
+ return -EOPNOTSUPP;
+
+ if (sysfs_streq(buf, "0"))
+ uuid_copy(&uuid, &uuid_null);
+ else {
+ rc = uuid_parse(buf, &uuid);
+ if (rc)
+ return rc;
+ }
+
+ rc = down_write_killable(&dax_region_rwsem);
+ if (rc)
+ return rc;
+ if (!dax_region->dev->driver) {
+ rc = -ENXIO;
+ goto err_region;
+ }
+ rc = down_write_killable(&dax_dev_rwsem);
+ if (rc)
+ goto err_region;
+
+ if (uuid_is_null(&uuid))
+ rc = uuid_claim_untagged(dax_region, dev_dax);
+ else
+ rc = uuid_claim_tagged(dax_region, dev_dax, &uuid);
+
+ up_write(&dax_dev_rwsem);
+err_region:
+ up_write(&dax_region_rwsem);
+
+ return rc < 0 ? rc : len;
}
static DEVICE_ATTR_RW(uuid);
@@ -1614,8 +1862,12 @@ static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n)
return 0;
if (a == &dev_attr_mapping.attr && is_dynamic(dax_region))
return 0;
- if ((a == &dev_attr_align.attr ||
- a == &dev_attr_size.attr) && is_static(dax_region))
+ if (a == &dev_attr_uuid.attr && !is_dynamic(dax_region))
+ return 0444;
+ if (a == &dev_attr_align.attr &&
+ (is_static(dax_region) || is_dynamic(dax_region)))
+ return 0444;
+ if (a == &dev_attr_size.attr && is_static(dax_region))
return 0444;
return a->mode;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 71+ messages in thread* Re: [PATCH v10 26/31] dax/bus: Tag-aware uuid claim and show on DC dax devices
2026-05-23 9:43 ` [PATCH v10 26/31] dax/bus: Tag-aware uuid claim and show on DC dax devices Anisa Su
@ 2026-05-29 17:53 ` Dave Jiang
0 siblings, 0 replies; 71+ messages in thread
From: Dave Jiang @ 2026-05-29 17:53 UTC (permalink / raw)
To: Anisa Su, linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Vishal Verma, Ira Weiny, Alison Schofield, John Groves,
Gregory Price, Anisa Su, Ira Weiny
On 5/23/26 2:43 AM, Anisa Su wrote:
> DC DAX regions are populated with dax_resource children that each carry a
> backing tag uuid and a per-allocation sequence number (seq_num). Add the
> userspace claim semantics that resolve those tagged groups into DAX
> devices.
>
> A DC region's seed dax device is created at 0-size on probe; userspace
> populates it by writing to its 'uuid' attribute:
>
> * A non-null UUID claims every dax_resource on this region whose tag
> matches, in seq_num order via uuid_claim_tagged(). The match set
> must form a dense 1..n sequence (no gap, no duplicate); the CXL
> side maintains this invariant for both sharable allocations (where
> the device stamps shared_extn_seq) and non-sharable allocations
> (where cxl_add_pending assigns arrival-order seq). The resulting
> DAX device's size equals the sum of every member extent's size.
>
> * "0" claims a single untagged dax_resource via
> uuid_claim_untagged(). Untagged extents are independent
> allocations; collapsing several would aggregate unrelated capacity,
> so each uuid="0" write consumes exactly one untagged resource.
>
> * A write that matches no dax_resource returns -ENOENT; the device
> stays at size 0.
>
> uuid_show() reads back the backing tag uuid (or the null UUID for an
> untagged claim). The attribute is read-only (0444) on non-DC dax
> devices; writes to it on non-DC regions return -EOPNOTSUPP.
>
> dev_dax_visible() exposes the uuid attribute only on DC dax devices.
>
> Based on an original patch by Navneet Singh.
>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Anisa Su <anisa.su@samsung.com>
>
> ---
> Changes:
> [anisa: split out from the original "Surface dc_extents" commit;
> userspace tag-claim semantics only.]
> ---
> drivers/dax/bus.c | 260 +++++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 256 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index c030eb103ad0..1dccb3e5cd0f 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -5,6 +5,7 @@
> #include <linux/mutex.h>
> #include <linux/list.h>
> #include <linux/slab.h>
> +#include <linux/sort.h>
> #include <linux/dax.h>
> #include <linux/io.h>
> #include "dax-private.h"
> @@ -1316,6 +1317,89 @@ static ssize_t dev_dax_resize(struct dax_region *dax_region,
> return 0;
> }
>
> +/* DC extents are all-or-nothing: an extent is either free or fully claimed. */
> +static bool dax_resource_in_use(const struct dax_resource *dax_resource)
> +{
> + return dax_resource->use_cnt > 0;
> +}
> +
> +struct dax_uuid_match {
> + const struct dax_region *dax_region;
> + const uuid_t *uuid;
> +};
> +
> +static int find_uuid_extent(struct device *dev, const void *data)
> +{
> + const struct dax_uuid_match *match = data;
> + struct dax_resource *dax_resource;
> +
> + if (!match->dax_region->dc_ops->is_extent(dev))
> + return 0;
> +
> + dax_resource = dev_get_drvdata(dev);
> + if (!dax_resource || dax_resource_in_use(dax_resource))
> + return 0;
> + return uuid_equal(&dax_resource->uuid, match->uuid);
> +}
> +
> +struct dax_tag_collect {
> + const struct dax_region *dax_region;
> + const uuid_t *uuid;
> + struct dax_resource **arr;
> + unsigned int count;
> + unsigned int cap;
> +};
> +
> +static int collect_uuid_extent(struct device *dev, void *data)
> +{
> + struct dax_tag_collect *c = data;
> + struct dax_resource *dax_resource;
> +
> + if (!c->dax_region->dc_ops->is_extent(dev))
> + return 0;
> +
> + dax_resource = dev_get_drvdata(dev);
> + if (!dax_resource || dax_resource_in_use(dax_resource))
> + return 0;
> + if (!uuid_equal(&dax_resource->uuid, c->uuid))
> + return 0;
> +
> + if (c->count == c->cap)
> + return -ENOSPC;
> + c->arr[c->count++] = dax_resource;
> + return 0;
> +}
> +
> +static int count_uuid_extent(struct device *dev, void *data)
> +{
> + struct dax_tag_collect *c = data;
> + struct dax_resource *dax_resource;
> +
> + if (!c->dax_region->dc_ops->is_extent(dev))
> + return 0;
> +
> + dax_resource = dev_get_drvdata(dev);
> + if (!dax_resource || dax_resource_in_use(dax_resource))
> + return 0;
> + if (!uuid_equal(&dax_resource->uuid, c->uuid))
> + return 0;
> +
> + c->count++;
> + return 0;
> +}
> +
> +static int dax_resource_seq_cmp(const void *a, const void *b)
> +{
> + const struct dax_resource * const *pa = a;
> + const struct dax_resource * const *pb = b;
> +
> + if ((*pa)->seq_num < (*pb)->seq_num)
> + return -1;
> + if ((*pa)->seq_num > (*pb)->seq_num)
> + return 1;
> + return 0;
> +}
> +
> static ssize_t size_store(struct device *dev, struct device_attribute *attr,
> const char *buf, size_t len)
> {
> @@ -1548,13 +1632,177 @@ static DEVICE_ATTR_RO(numa_node);
> static ssize_t uuid_show(struct device *dev,
> struct device_attribute *attr, char *buf)
> {
> - return sysfs_emit(buf, "%d\n", 0);
> + struct dev_dax *dev_dax = to_dev_dax(dev);
> + int rc;
> +
> + rc = down_read_interruptible(&dax_dev_rwsem);
Since we are here, may as well convert these to ACQUIRE() and be rid of the gotos
ACQUIRE(rwsem_read_intr, rwsem)(&dax_dev_rwsem);
if ((rc = ACQUIRE_ERR(rwsem_read_intr, &rwsem)))
...
> + if (rc)
> + return rc;
> +
> + for (int i = 0; i < dev_dax->nr_range; i++) {
> + struct dax_resource *r = dev_dax->ranges[i].dax_resource;
> +
> + if (r && !uuid_is_null(&r->uuid)) {
> + rc = sysfs_emit(buf, "%pUb\n", &r->uuid);
> + goto out;
> + }
> + }
> + rc = sysfs_emit(buf, "0\n");
As pointed out earlyer, should display null_uuid to be consistent.
> +out:
> + up_read(&dax_dev_rwsem);
> + return rc;
> +}
> +
> +static ssize_t uuid_claim_untagged(struct dax_region *dax_region,
> + struct dev_dax *dev_dax)
> +{
> + struct dax_uuid_match match = {
> + .dax_region = dax_region,
> + .uuid = &uuid_null,
> + };
> + struct dax_resource *dax_resource;
> + resource_size_t to_alloc;
> + struct device *extent_dev;
> + ssize_t alloc;
> +
> + extent_dev = device_find_child(dax_region->dev, &match,
> + find_uuid_extent);
> + if (!extent_dev)
> + return -ENOENT;
> +
> + dax_resource = dev_get_drvdata(extent_dev);
> + to_alloc = resource_size(dax_resource->res);
> + alloc = __dev_dax_resize(dax_resource->res, dev_dax, to_alloc,
> + dax_resource);
> + put_device(extent_dev);
> + if (alloc < 0)
> + return alloc;
> + if (alloc == 0)
> + return -ENOENT;
> + dax_resource->use_cnt++;
> + return 0;
> +}
> +
> +static ssize_t uuid_claim_tagged(struct dax_region *dax_region,
> + struct dev_dax *dev_dax, const uuid_t *uuid)
> +{
> + struct dax_tag_collect c = {
> + .dax_region = dax_region,
> + .uuid = uuid,
> + };
> + unsigned int i;
> + ssize_t rc;
> +
> + /* Two-pass: count, then collect into a sized array. */
> + device_for_each_child(dax_region->dev, &c, count_uuid_extent);
> + if (!c.count)
> + return -ENOENT;
> +
> + c.arr = kmalloc_array(c.count, sizeof(*c.arr), GFP_KERNEL);
> + if (!c.arr)
> + return -ENOMEM;
> + c.cap = c.count;
> + c.count = 0;
> +
> + rc = device_for_each_child(dax_region->dev, &c, collect_uuid_extent);
> + if (rc)
> + goto out;
> +
> + sort(c.arr, c.count, sizeof(*c.arr), dax_resource_seq_cmp, NULL);
> +
> + /*
> + * Tagged groups carry a dense 1..n @seq_num regardless of source
> + * (sharable: device-stamped; non-sharable: host-assigned in
> + * arrival order — see &struct dax_resource). A gap or
> + * out-of-range value here means an extent went missing on the
> + * cxl side (e.g. a per-extent failure in cxl_add_pending) or a
> + * cxl-side validation gap; in either case refuse the whole
> + * group rather than carve a partial allocation.
> + */
> + for (i = 0; i < c.count; i++) {
> + if (c.arr[i]->seq_num != i + 1) {
> + dev_WARN_ONCE(dax_region->dev, 1,
> + "tag %pUb seq invariant violated at slot %u (got %u)\n",
> + uuid, i, c.arr[i]->seq_num);
> + rc = -EINVAL;
> + goto out;
> + }
> + }
> +
> + for (i = 0; i < c.count; i++) {
> + resource_size_t to_alloc = resource_size(c.arr[i]->res);
> + ssize_t alloc;
> +
> + alloc = __dev_dax_resize(c.arr[i]->res, dev_dax, to_alloc,
> + c.arr[i]);
> + if (alloc < 0) {
> + rc = alloc;
> + goto rollback;
> + }
> + if (alloc == 0) {
> + rc = -ENOSPC;
> + goto rollback;
> + }
> + c.arr[i]->use_cnt++;
> + }
> + rc = 0;
> + goto out;
> +
> +rollback:
> + /*
> + * Partial failure: trim every range we added in this attempt.
> + * trim_dev_dax_range pops the most-recently-appended range from
> + * dev_dax->ranges[] and decrements its dax_resource->use_cnt, so
> + * looping until we have undone @i additions restores both
> + * dev_dax->ranges[] and the matched dax_resources' use_cnt.
> + */
> + while (i-- > 0)
> + trim_dev_dax_range(dev_dax);
> +out:
> + kfree(c.arr);
> + return rc;
> }
>
> static ssize_t uuid_store(struct device *dev, struct device_attribute *attr,
> const char *buf, size_t len)
> {
> - return -EOPNOTSUPP;
> + struct dev_dax *dev_dax = to_dev_dax(dev);
> + struct dax_region *dax_region = dev_dax->region;
> + uuid_t uuid;
> + ssize_t rc;
> +
> + if (!is_dynamic(dax_region))
> + return -EOPNOTSUPP;
> +
> + if (sysfs_streq(buf, "0"))
> + uuid_copy(&uuid, &uuid_null);
> + else {
> + rc = uuid_parse(buf, &uuid);
> + if (rc)
> + return rc;
> + }
> +
> + rc = down_write_killable(&dax_region_rwsem);
> + if (rc)
> + return rc;
> + if (!dax_region->dev->driver) {
> + rc = -ENXIO;
> + goto err_region;
> + }
> + rc = down_write_killable(&dax_dev_rwsem);
same comments about ACQUIRE()
> + if (rc)
> + goto err_region;
> +
Does it need to check if the device is already claimed before proceeding to claiming? What happens if the uuid is written twice to this sysfs file?
DJ
> + if (uuid_is_null(&uuid))
> + rc = uuid_claim_untagged(dax_region, dev_dax);
> + else
> + rc = uuid_claim_tagged(dax_region, dev_dax, &uuid);
> +
> + up_write(&dax_dev_rwsem);
> +err_region:
> + up_write(&dax_region_rwsem);
> +
> + return rc < 0 ? rc : len;
> }
> static DEVICE_ATTR_RW(uuid);
>
> @@ -1614,8 +1862,12 @@ static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n)
> return 0;
> if (a == &dev_attr_mapping.attr && is_dynamic(dax_region))
> return 0;
> - if ((a == &dev_attr_align.attr ||
> - a == &dev_attr_size.attr) && is_static(dax_region))
> + if (a == &dev_attr_uuid.attr && !is_dynamic(dax_region))
> + return 0444;
> + if (a == &dev_attr_align.attr &&
> + (is_static(dax_region) || is_dynamic(dax_region)))
> + return 0444;
> + if (a == &dev_attr_size.attr && is_static(dax_region))
> return 0444;
> return a->mode;
> }
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH v10 27/31] cxl/region: Read existing extents on region creation
2026-05-23 9:42 [PATCH v10 00/31] DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
` (25 preceding siblings ...)
2026-05-23 9:43 ` [PATCH v10 26/31] dax/bus: Tag-aware uuid claim and show on DC dax devices Anisa Su
@ 2026-05-23 9:43 ` Anisa Su
2026-05-29 21:30 ` Dave Jiang
2026-05-23 9:43 ` [PATCH v10 28/31] cxl/mem: Trace Dynamic capacity Event Record Anisa Su
` (5 subsequent siblings)
32 siblings, 1 reply; 71+ messages in thread
From: Anisa Su @ 2026-05-23 9:43 UTC (permalink / raw)
To: linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
John Groves, Gregory Price, Ira Weiny, Jonathan Cameron, Fan Ni
From: Ira Weiny <ira.weiny@intel.com>
Dynamic capacity device extents may be left in an accepted state on a
device due to an unexpected host crash. In this case it is expected
that the creation of a new region on top of a DC partition can read
those extents and surface them for continued use.
Once all endpoint decoders are part of a region and the region is being
realized, a read of the 'devices extent list' can reveal these
previously accepted extents.
CXL r3.1 specifies the mailbox call Get Dynamic Capacity Extent List for
this purpose. The call returns all the extents for all dynamic capacity
partitions. If the fabric manager is adding extents to any DCD
partition, the extent list for the recovered region may change. In this
case the query must retry. Upon retry the query could encounter extents
which were accepted on a previous list query. Adding such extents is
ignored without error because they are entirely within a previous
accepted extent. Instead warn on this case to allow for differentiating
bad devices from this normal condition.
Latch any errors to be bubbled up to ensure notification to the user
even if individual errors are rate limited or otherwise ignored.
The scan for existing extents races with the dax_cxl driver. This is
synchronized through the region device lock. Extents which are found
after the driver has loaded will surface through the normal notification
path while extents seen prior to the driver are read during driver load.
Based on an original patch by Navneet Singh.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
drivers/cxl/core/core.h | 1 +
drivers/cxl/core/mbox.c | 116 ++++++++++++++++++++++++++++++++++
drivers/cxl/core/region_dax.c | 27 ++++++++
drivers/cxl/cxlmem.h | 21 ++++++
4 files changed, 165 insertions(+)
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index c28e357c5817..f5b05de5ed83 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -28,6 +28,7 @@ cxled_to_mds(struct cxl_endpoint_decoder *cxled)
return container_of(cxlds, struct cxl_memdev_state, cxlds);
}
+int cxl_process_extent_list(struct cxl_endpoint_decoder *cxled);
int cxl_region_invalidate_memregion(struct cxl_region *cxlr);
#ifdef CONFIG_CXL_REGION
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 8071c1ed1b36..486110e1c03d 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -2083,6 +2083,122 @@ int cxl_dev_dc_identify(struct cxl_mailbox *mbox,
}
EXPORT_SYMBOL_NS_GPL(cxl_dev_dc_identify, "CXL");
+/* Return -EAGAIN if the extent list changes while reading */
+static int __cxl_process_extent_list(struct cxl_endpoint_decoder *cxled)
+{
+ u32 current_index, total_read, total_expected, initial_gen_num;
+ struct cxl_memdev_state *mds = cxled_to_mds(cxled);
+ struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
+ struct device *dev = mds->cxlds.dev;
+ struct cxl_mbox_cmd mbox_cmd;
+ u32 max_extent_count;
+ int latched_rc = 0;
+ bool first = true;
+
+ struct cxl_mbox_get_extent_out *extents __free(kvfree) =
+ kvmalloc(cxl_mbox->payload_size, GFP_KERNEL);
+ if (!extents)
+ return -ENOMEM;
+
+ total_read = 0;
+ current_index = 0;
+ total_expected = 0;
+ max_extent_count = (cxl_mbox->payload_size - sizeof(*extents)) /
+ sizeof(struct cxl_extent);
+ do {
+ u32 nr_returned, current_total, current_gen_num;
+ struct cxl_mbox_get_extent_in get_extent;
+ int rc;
+
+ get_extent = (struct cxl_mbox_get_extent_in) {
+ .extent_cnt = cpu_to_le32(max(max_extent_count,
+ total_expected - current_index)),
+ .start_extent_index = cpu_to_le32(current_index),
+ };
+
+ mbox_cmd = (struct cxl_mbox_cmd) {
+ .opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
+ .payload_in = &get_extent,
+ .size_in = sizeof(get_extent),
+ .size_out = cxl_mbox->payload_size,
+ .payload_out = extents,
+ .min_out = 1,
+ };
+
+ rc = cxl_internal_send_cmd(cxl_mbox, &mbox_cmd);
+ if (rc < 0)
+ return rc;
+
+ /* Save initial data */
+ if (first) {
+ total_expected = le32_to_cpu(extents->total_extent_count);
+ initial_gen_num = le32_to_cpu(extents->generation_num);
+ first = false;
+ }
+
+ nr_returned = le32_to_cpu(extents->returned_extent_count);
+ total_read += nr_returned;
+ current_total = le32_to_cpu(extents->total_extent_count);
+ current_gen_num = le32_to_cpu(extents->generation_num);
+
+ dev_dbg(dev, "Got extent list %d-%d of %d generation Num:%d\n",
+ current_index, total_read - 1, current_total, current_gen_num);
+
+ if (current_gen_num != initial_gen_num || total_expected != current_total) {
+ dev_warn(dev, "Extent list change detected; gen %u != %u : cnt %u != %u\n",
+ current_gen_num, initial_gen_num,
+ total_expected, current_total);
+ return -EAGAIN;
+ }
+
+ for (int i = 0; i < nr_returned ; i++) {
+ struct cxl_extent *extent = &extents->extent[i];
+
+ dev_dbg(dev, "Processing extent %d/%d\n",
+ current_index + i, total_expected);
+
+ rc = add_to_pending_list(&mds->add_ctx.pending_extents,
+ extent);
+ if (rc) {
+ latched_rc = rc;
+ }
+ }
+
+ current_index += nr_returned;
+ } while (total_expected > total_read);
+
+ if (!latched_rc && !list_empty(&mds->add_ctx.pending_extents)) {
+ latched_rc = cxl_add_pending(mds);
+ }
+ clear_pending_extents(mds);
+
+ return latched_rc;
+}
+
+#define CXL_READ_EXTENT_LIST_RETRY 10
+
+/**
+ * cxl_process_extent_list() - Read existing extents
+ * @cxled: Endpoint decoder which is part of a region
+ *
+ * Issue the Get Dynamic Capacity Extent List command to the device
+ * and add existing extents if found.
+ *
+ * A retry of 10 is somewhat arbitrary, however, extent changes should be
+ * relatively rare while bringing up a region. So 10 should be plenty.
+ */
+int cxl_process_extent_list(struct cxl_endpoint_decoder *cxled)
+{
+ int retry = CXL_READ_EXTENT_LIST_RETRY;
+ int rc;
+
+ do {
+ rc = __cxl_process_extent_list(cxled);
+ } while (rc == -EAGAIN && retry--);
+
+ return rc;
+}
+
static void add_part(struct cxl_dpa_info *info, u64 start, u64 size, enum cxl_partition_mode mode)
{
int i = info->nr_partitions;
diff --git a/drivers/cxl/core/region_dax.c b/drivers/cxl/core/region_dax.c
index 519e203c486a..e7a812e8b2e7 100644
--- a/drivers/cxl/core/region_dax.c
+++ b/drivers/cxl/core/region_dax.c
@@ -82,6 +82,26 @@ static void cxlr_dax_unregister(void *_cxlr_dax)
device_unregister(&cxlr_dax->dev);
}
+static int cxlr_add_existing_extents(struct cxl_region *cxlr)
+{
+ struct cxl_region_params *p = &cxlr->params;
+ int i, latched_rc = 0;
+
+ for (i = 0; i < p->nr_targets; i++) {
+ struct device *dev = &p->targets[i]->cxld.dev;
+ int rc;
+
+ rc = cxl_process_extent_list(p->targets[i]);
+ if (rc) {
+ dev_err(dev, "Existing extent processing failed %d\n",
+ rc);
+ latched_rc = rc;
+ }
+ }
+
+ return latched_rc;
+}
+
int devm_cxl_add_dax_region(struct cxl_region *cxlr)
{
struct device *dev;
@@ -110,6 +130,13 @@ int devm_cxl_add_dax_region(struct cxl_region *cxlr)
dev_dbg(&cxlr->dev, "%s: register %s\n", dev_name(dev->parent),
dev_name(dev));
+ if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_A) {
+ rc = cxlr_add_existing_extents(cxlr);
+ if (rc)
+ dev_err(&cxlr->dev,
+ "Existing extent processing failed %d\n", rc);
+ }
+
return devm_add_action_or_reset(&cxlr->dev, cxlr_dax_unregister,
no_free_ptr(cxlr_dax));
}
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index d992cc9b7811..1ad3dc7e413c 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -564,6 +564,27 @@ struct cxl_mbox_dc_response {
} __packed extent_list[] __counted_by(extent_list_size);
} __packed;
+/*
+ * Get Dynamic Capacity Extent List; Input Payload
+ * CXL rev 3.1 section 8.2.9.9.9.2; Table 8-166
+ */
+struct cxl_mbox_get_extent_in {
+ __le32 extent_cnt;
+ __le32 start_extent_index;
+} __packed;
+
+/*
+ * Get Dynamic Capacity Extent List; Output Payload
+ * CXL rev 3.1 section 8.2.9.9.9.2; Table 8-167
+ */
+struct cxl_mbox_get_extent_out {
+ __le32 returned_extent_count;
+ __le32 total_extent_count;
+ __le32 generation_num;
+ u8 rsvd[4];
+ struct cxl_extent extent[];
+} __packed;
+
struct cxl_mbox_get_supported_logs {
__le16 entries;
u8 rsvd[6];
--
2.43.0
^ permalink raw reply related [flat|nested] 71+ messages in thread* Re: [PATCH v10 27/31] cxl/region: Read existing extents on region creation
2026-05-23 9:43 ` [PATCH v10 27/31] cxl/region: Read existing extents on region creation Anisa Su
@ 2026-05-29 21:30 ` Dave Jiang
0 siblings, 0 replies; 71+ messages in thread
From: Dave Jiang @ 2026-05-29 21:30 UTC (permalink / raw)
To: Anisa Su, linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Vishal Verma, Ira Weiny, Alison Schofield, John Groves,
Gregory Price, Ira Weiny, Jonathan Cameron, Fan Ni
On 5/23/26 2:43 AM, Anisa Su wrote:
> From: Ira Weiny <ira.weiny@intel.com>
>
> Dynamic capacity device extents may be left in an accepted state on a
> device due to an unexpected host crash. In this case it is expected
> that the creation of a new region on top of a DC partition can read
> those extents and surface them for continued use.
>
> Once all endpoint decoders are part of a region and the region is being
> realized, a read of the 'devices extent list' can reveal these
> previously accepted extents.
>
> CXL r3.1 specifies the mailbox call Get Dynamic Capacity Extent List for
> this purpose. The call returns all the extents for all dynamic capacity
> partitions. If the fabric manager is adding extents to any DCD
> partition, the extent list for the recovered region may change. In this
> case the query must retry. Upon retry the query could encounter extents
> which were accepted on a previous list query. Adding such extents is
> ignored without error because they are entirely within a previous
> accepted extent. Instead warn on this case to allow for differentiating
> bad devices from this normal condition.
>
> Latch any errors to be bubbled up to ensure notification to the user
> even if individual errors are rate limited or otherwise ignored.
>
> The scan for existing extents races with the dax_cxl driver. This is
> synchronized through the region device lock. Extents which are found
> after the driver has loaded will surface through the normal notification
> path while extents seen prior to the driver are read during driver load.
>
> Based on an original patch by Navneet Singh.
>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Fan Ni <fan.ni@samsung.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> ---
> drivers/cxl/core/core.h | 1 +
> drivers/cxl/core/mbox.c | 116 ++++++++++++++++++++++++++++++++++
> drivers/cxl/core/region_dax.c | 27 ++++++++
> drivers/cxl/cxlmem.h | 21 ++++++
> 4 files changed, 165 insertions(+)
>
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index c28e357c5817..f5b05de5ed83 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -28,6 +28,7 @@ cxled_to_mds(struct cxl_endpoint_decoder *cxled)
> return container_of(cxlds, struct cxl_memdev_state, cxlds);
> }
>
> +int cxl_process_extent_list(struct cxl_endpoint_decoder *cxled);
> int cxl_region_invalidate_memregion(struct cxl_region *cxlr);
>
> #ifdef CONFIG_CXL_REGION
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 8071c1ed1b36..486110e1c03d 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -2083,6 +2083,122 @@ int cxl_dev_dc_identify(struct cxl_mailbox *mbox,
> }
> EXPORT_SYMBOL_NS_GPL(cxl_dev_dc_identify, "CXL");
>
> +/* Return -EAGAIN if the extent list changes while reading */
> +static int __cxl_process_extent_list(struct cxl_endpoint_decoder *cxled)
> +{
> + u32 current_index, total_read, total_expected, initial_gen_num;
initial_gen_num should be initialized to something invalid?
> + struct cxl_memdev_state *mds = cxled_to_mds(cxled);
> + struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
> + struct device *dev = mds->cxlds.dev;
> + struct cxl_mbox_cmd mbox_cmd;
> + u32 max_extent_count;
> + int latched_rc = 0;
> + bool first = true;
> +
> + struct cxl_mbox_get_extent_out *extents __free(kvfree) =
> + kvmalloc(cxl_mbox->payload_size, GFP_KERNEL);
> + if (!extents)
> + return -ENOMEM;
> +
> + total_read = 0;
> + current_index = 0;
> + total_expected = 0;
> + max_extent_count = (cxl_mbox->payload_size - sizeof(*extents)) /
> + sizeof(struct cxl_extent);
> + do {
> + u32 nr_returned, current_total, current_gen_num;
> + struct cxl_mbox_get_extent_in get_extent;
> + int rc;
> +
> + get_extent = (struct cxl_mbox_get_extent_in) {
> + .extent_cnt = cpu_to_le32(max(max_extent_count,
> + total_expected - current_index)),
Should be min() here right?
> + .start_extent_index = cpu_to_le32(current_index),
> + };
> +
> + mbox_cmd = (struct cxl_mbox_cmd) {
> + .opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
> + .payload_in = &get_extent,
> + .size_in = sizeof(get_extent),
> + .size_out = cxl_mbox->payload_size,
> + .payload_out = extents,
> + .min_out = 1,
> + };
> +
> + rc = cxl_internal_send_cmd(cxl_mbox, &mbox_cmd);
> + if (rc < 0)
> + return rc;
> +
> + /* Save initial data */
> + if (first) {
> + total_expected = le32_to_cpu(extents->total_extent_count);
> + initial_gen_num = le32_to_cpu(extents->generation_num);
> + first = false;
> + }
> +
> + nr_returned = le32_to_cpu(extents->returned_extent_count);
> + total_read += nr_returned;
> + current_total = le32_to_cpu(extents->total_extent_count);
> + current_gen_num = le32_to_cpu(extents->generation_num);
> +
> + dev_dbg(dev, "Got extent list %d-%d of %d generation Num:%d\n",
> + current_index, total_read - 1, current_total, current_gen_num);
> +
> + if (current_gen_num != initial_gen_num || total_expected != current_total) {
> + dev_warn(dev, "Extent list change detected; gen %u != %u : cnt %u != %u\n",
> + current_gen_num, initial_gen_num,
> + total_expected, current_total);
> + return -EAGAIN;
> + }
> +
> + for (int i = 0; i < nr_returned ; i++) {
> + struct cxl_extent *extent = &extents->extent[i];
> +
> + dev_dbg(dev, "Processing extent %d/%d\n",
> + current_index + i, total_expected);
> +
Should probably hold the mds->add_ctx->lock before calling add_to_pending_list()? handle_add_event() holds the lock before calling. Maybe also add a lock assert in add_to_pending_list().
> + rc = add_to_pending_list(&mds->add_ctx.pending_extents,
> + extent);
> + if (rc) {
> + latched_rc = rc;
Is the intention here to report the last found error and not the first error?
> + }
{} not needed if single line
> + }
> +
> + current_index += nr_returned;
> + } while (total_expected > total_read);
> +
> + if (!latched_rc && !list_empty(&mds->add_ctx.pending_extents)) {
> + latched_rc = cxl_add_pending(mds);
> + }
> + clear_pending_extents(mds);
> +
> + return latched_rc;
> +}
> +
> +#define CXL_READ_EXTENT_LIST_RETRY 10
> +
> +/**
> + * cxl_process_extent_list() - Read existing extents
> + * @cxled: Endpoint decoder which is part of a region
> + *
> + * Issue the Get Dynamic Capacity Extent List command to the device
> + * and add existing extents if found.
> + *
> + * A retry of 10 is somewhat arbitrary, however, extent changes should be
> + * relatively rare while bringing up a region. So 10 should be plenty.
> + */
> +int cxl_process_extent_list(struct cxl_endpoint_decoder *cxled)
> +{
> + int retry = CXL_READ_EXTENT_LIST_RETRY;
> + int rc;
> +
> + do {
> + rc = __cxl_process_extent_list(cxled);
> + } while (rc == -EAGAIN && retry--);
I think it's retrying 11 times here.
> +
> + return rc;
> +}
> +
> static void add_part(struct cxl_dpa_info *info, u64 start, u64 size, enum cxl_partition_mode mode)
> {
> int i = info->nr_partitions;
> diff --git a/drivers/cxl/core/region_dax.c b/drivers/cxl/core/region_dax.c
> index 519e203c486a..e7a812e8b2e7 100644
> --- a/drivers/cxl/core/region_dax.c
> +++ b/drivers/cxl/core/region_dax.c
> @@ -82,6 +82,26 @@ static void cxlr_dax_unregister(void *_cxlr_dax)
> device_unregister(&cxlr_dax->dev);
> }
>
> +static int cxlr_add_existing_extents(struct cxl_region *cxlr)
> +{
> + struct cxl_region_params *p = &cxlr->params;
> + int i, latched_rc = 0;
> +
> + for (i = 0; i < p->nr_targets; i++) {
> + struct device *dev = &p->targets[i]->cxld.dev;
> + int rc;
> +
> + rc = cxl_process_extent_list(p->targets[i]);
> + if (rc) {
> + dev_err(dev, "Existing extent processing failed %d\n",
> + rc);
> + latched_rc = rc;
> + }
> + }
> +
> + return latched_rc;
> +}
> +
> int devm_cxl_add_dax_region(struct cxl_region *cxlr)
> {
> struct device *dev;
> @@ -110,6 +130,13 @@ int devm_cxl_add_dax_region(struct cxl_region *cxlr)
> dev_dbg(&cxlr->dev, "%s: register %s\n", dev_name(dev->parent),
> dev_name(dev));
>
> + if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_A) {
> + rc = cxlr_add_existing_extents(cxlr);
cxlr_add_existing_extents() -> cxl_process_extent_list() -> __cxl_process_extent_list() -> cxl_add_pending() -> CXL_MBOX_OP_ADD_DC_RESPONSE sent.
CXL r4.0 8.2.10.9.9.3:
Device shall report Invalid Physical Address if:
One or more extents in the updated extent list specify a DPA range that has already been added with a previous call to the Add Dynamic Capacity Response.
Aren't existing extents already been added previously and responded by ADD_DC_RESPONSE? For add existing extent path it seems like no response is needed to send to the device and can be skipped. Otherwise the software will receive error from the device when sending ADD_DC_RESPONSE.
Would be good to get this tested on hw.
> + if (rc)
> + dev_err(&cxlr->dev,
> + "Existing extent processing failed %d\n", rc);
No return on error?
DJ
> + }
> +
> return devm_add_action_or_reset(&cxlr->dev, cxlr_dax_unregister,
> no_free_ptr(cxlr_dax));
> }
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index d992cc9b7811..1ad3dc7e413c 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -564,6 +564,27 @@ struct cxl_mbox_dc_response {
> } __packed extent_list[] __counted_by(extent_list_size);
> } __packed;
>
> +/*
> + * Get Dynamic Capacity Extent List; Input Payload
> + * CXL rev 3.1 section 8.2.9.9.9.2; Table 8-166
> + */
> +struct cxl_mbox_get_extent_in {
> + __le32 extent_cnt;
> + __le32 start_extent_index;
> +} __packed;
> +
> +/*
> + * Get Dynamic Capacity Extent List; Output Payload
> + * CXL rev 3.1 section 8.2.9.9.9.2; Table 8-167
> + */
> +struct cxl_mbox_get_extent_out {
> + __le32 returned_extent_count;
> + __le32 total_extent_count;
> + __le32 generation_num;
> + u8 rsvd[4];
> + struct cxl_extent extent[];
> +} __packed;
> +
> struct cxl_mbox_get_supported_logs {
> __le16 entries;
> u8 rsvd[6];
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH v10 28/31] cxl/mem: Trace Dynamic capacity Event Record
2026-05-23 9:42 [PATCH v10 00/31] DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
` (26 preceding siblings ...)
2026-05-23 9:43 ` [PATCH v10 27/31] cxl/region: Read existing extents on region creation Anisa Su
@ 2026-05-23 9:43 ` Anisa Su
2026-05-29 22:41 ` Dave Jiang
2026-05-23 9:43 ` [PATCH v10 29/31] tools/testing/cxl: Make event logs dynamic Anisa Su
` (4 subsequent siblings)
32 siblings, 1 reply; 71+ messages in thread
From: Anisa Su @ 2026-05-23 9:43 UTC (permalink / raw)
To: linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
John Groves, Gregory Price, Ira Weiny, Jonathan Cameron, Fan Ni
From: Ira Weiny <ira.weiny@intel.com>
CXL rev 3.1 section 8.2.9.2.1 adds the Dynamic Capacity Event Records.
User space can use trace events for debugging of DC capacity changes.
Add DC trace points to the trace log.
Based on an original patch by Navneet Singh.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
drivers/cxl/core/mbox.c | 5 ++++
drivers/cxl/core/trace.h | 65 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 70 insertions(+)
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 486110e1c03d..271f4556db85 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -1030,6 +1030,11 @@ static void __cxl_event_trace_record(struct cxl_memdev *cxlmd,
ev_type = CXL_CPER_EVENT_MEM_MODULE;
else if (uuid_equal(uuid, &CXL_EVENT_MEM_SPARING_UUID))
ev_type = CXL_CPER_EVENT_MEM_SPARING;
+ else if (uuid_equal(uuid, &CXL_EVENT_DC_EVENT_UUID)) {
+/* FIXME still valid? */
+ trace_cxl_dynamic_capacity(cxlmd, type, &record->event.dcd);
+ return;
+ }
cxl_event_trace_record(cxlmd, type, ev_type, uuid, &record->event);
}
diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
index a972e4ef1936..421e492d1b3f 100644
--- a/drivers/cxl/core/trace.h
+++ b/drivers/cxl/core/trace.h
@@ -1099,6 +1099,71 @@ TRACE_EVENT(cxl_poison,
)
);
+/*
+ * Dynamic Capacity Event Record - DER
+ *
+ * CXL rev 3.1 section 8.2.9.2.1.6 Table 8-50
+ */
+
+#define CXL_DC_ADD_CAPACITY 0x00
+#define CXL_DC_REL_CAPACITY 0x01
+#define CXL_DC_FORCED_REL_CAPACITY 0x02
+#define CXL_DC_REG_CONF_UPDATED 0x03
+#define show_dc_evt_type(type) __print_symbolic(type, \
+ { CXL_DC_ADD_CAPACITY, "Add capacity"}, \
+ { CXL_DC_REL_CAPACITY, "Release capacity"}, \
+ { CXL_DC_FORCED_REL_CAPACITY, "Forced capacity release"}, \
+ { CXL_DC_REG_CONF_UPDATED, "Region Configuration Updated" } \
+)
+
+TRACE_EVENT(cxl_dynamic_capacity,
+
+ TP_PROTO(const struct cxl_memdev *cxlmd, enum cxl_event_log_type log,
+ struct cxl_event_dcd *rec),
+
+ TP_ARGS(cxlmd, log, rec),
+
+ TP_STRUCT__entry(
+ CXL_EVT_TP_entry
+
+ /* Dynamic capacity Event */
+ __field(u8, event_type)
+ __field(u16, hostid)
+ __field(u8, partition_id)
+ __field(u64, dpa_start)
+ __field(u64, length)
+ __array(u8, uuid, UUID_SIZE)
+ __field(u16, sh_extent_seq)
+ ),
+
+ TP_fast_assign(
+ CXL_EVT_TP_fast_assign(cxlmd, log, rec->hdr);
+
+ /* Dynamic_capacity Event */
+ __entry->event_type = rec->event_type;
+
+ /* DCD event record data */
+ __entry->hostid = le16_to_cpu(rec->host_id);
+ __entry->partition_id = rec->partition_index;
+ __entry->dpa_start = le64_to_cpu(rec->extent.start_dpa);
+ __entry->length = le64_to_cpu(rec->extent.length);
+ memcpy(__entry->uuid, &rec->extent.uuid, UUID_SIZE);
+ __entry->sh_extent_seq = le16_to_cpu(rec->extent.shared_extn_seq);
+ ),
+
+ CXL_EVT_TP_printk("event_type='%s' host_id='%d' partition_id='%d' " \
+ "starting_dpa=%llx length=%llx tag=%pU " \
+ "shared_extent_sequence=%d",
+ show_dc_evt_type(__entry->event_type),
+ __entry->hostid,
+ __entry->partition_id,
+ __entry->dpa_start,
+ __entry->length,
+ __entry->uuid,
+ __entry->sh_extent_seq
+ )
+);
+
#endif /* _CXL_EVENTS_H */
#define TRACE_INCLUDE_FILE trace
--
2.43.0
^ permalink raw reply related [flat|nested] 71+ messages in thread* Re: [PATCH v10 28/31] cxl/mem: Trace Dynamic capacity Event Record
2026-05-23 9:43 ` [PATCH v10 28/31] cxl/mem: Trace Dynamic capacity Event Record Anisa Su
@ 2026-05-29 22:41 ` Dave Jiang
0 siblings, 0 replies; 71+ messages in thread
From: Dave Jiang @ 2026-05-29 22:41 UTC (permalink / raw)
To: Anisa Su, linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Vishal Verma, Ira Weiny, Alison Schofield, John Groves,
Gregory Price, Ira Weiny, Jonathan Cameron, Fan Ni
On 5/23/26 2:43 AM, Anisa Su wrote:
> From: Ira Weiny <ira.weiny@intel.com>
>
> CXL rev 3.1 section 8.2.9.2.1 adds the Dynamic Capacity Event Records.
> User space can use trace events for debugging of DC capacity changes.
>
> Add DC trace points to the trace log.
>
> Based on an original patch by Navneet Singh.
>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Fan Ni <fan.ni@samsung.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> ---
> drivers/cxl/core/mbox.c | 5 ++++
> drivers/cxl/core/trace.h | 65 ++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 70 insertions(+)
>
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 486110e1c03d..271f4556db85 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -1030,6 +1030,11 @@ static void __cxl_event_trace_record(struct cxl_memdev *cxlmd,
> ev_type = CXL_CPER_EVENT_MEM_MODULE;
> else if (uuid_equal(uuid, &CXL_EVENT_MEM_SPARING_UUID))
> ev_type = CXL_CPER_EVENT_MEM_SPARING;
> + else if (uuid_equal(uuid, &CXL_EVENT_DC_EVENT_UUID)) {
> +/* FIXME still valid? */
? address or delete?
> + trace_cxl_dynamic_capacity(cxlmd, type, &record->event.dcd);
> + return;
> + }
>
> cxl_event_trace_record(cxlmd, type, ev_type, uuid, &record->event);
> }
> diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
> index a972e4ef1936..421e492d1b3f 100644
> --- a/drivers/cxl/core/trace.h
> +++ b/drivers/cxl/core/trace.h
> @@ -1099,6 +1099,71 @@ TRACE_EVENT(cxl_poison,
> )
> );
>
> +/*
> + * Dynamic Capacity Event Record - DER
> + *
> + * CXL rev 3.1 section 8.2.9.2.1.6 Table 8-50
Let's move it to 4.0
> + */
> +
> +#define CXL_DC_ADD_CAPACITY 0x00
> +#define CXL_DC_REL_CAPACITY 0x01
> +#define CXL_DC_FORCED_REL_CAPACITY 0x02
> +#define CXL_DC_REG_CONF_UPDATED 0x03
> +#define show_dc_evt_type(type) __print_symbolic(type, \
> + { CXL_DC_ADD_CAPACITY, "Add capacity"}, \
> + { CXL_DC_REL_CAPACITY, "Release capacity"}, \
> + { CXL_DC_FORCED_REL_CAPACITY, "Forced capacity release"}, \
> + { CXL_DC_REG_CONF_UPDATED, "Region Configuration Updated" } \
> +)
> +
> +TRACE_EVENT(cxl_dynamic_capacity,
> +
> + TP_PROTO(const struct cxl_memdev *cxlmd, enum cxl_event_log_type log,
> + struct cxl_event_dcd *rec),
> +
> + TP_ARGS(cxlmd, log, rec),
> +
> + TP_STRUCT__entry(
> + CXL_EVT_TP_entry
> +
> + /* Dynamic capacity Event */
> + __field(u8, event_type)
> + __field(u16, hostid)
> + __field(u8, partition_id)
> + __field(u64, dpa_start)
> + __field(u64, length)
> + __array(u8, uuid, UUID_SIZE)
> + __field(u16, sh_extent_seq)
> + ),
> +
> + TP_fast_assign(
> + CXL_EVT_TP_fast_assign(cxlmd, log, rec->hdr);
> +
> + /* Dynamic_capacity Event */
> + __entry->event_type = rec->event_type;
> +
> + /* DCD event record data */
> + __entry->hostid = le16_to_cpu(rec->host_id);
> + __entry->partition_id = rec->partition_index;
CXL r4.0 8.2.10.2.1.6 Table 8-229
Couple issues.
1. This is not partition_index, it's updated_region_index.
2. It's only valid for events of type Region Configuration Updated. Otherwise we may be displaying garbage or 0.
So it needs a rename and also a check for validity. Better to fix it before rasdaemon start picking it up.
DJ
> + __entry->dpa_start = le64_to_cpu(rec->extent.start_dpa);
> + __entry->length = le64_to_cpu(rec->extent.length);
> + memcpy(__entry->uuid, &rec->extent.uuid, UUID_SIZE);
> + __entry->sh_extent_seq = le16_to_cpu(rec->extent.shared_extn_seq);
> + ),
> +
> + CXL_EVT_TP_printk("event_type='%s' host_id='%d' partition_id='%d' " \
> + "starting_dpa=%llx length=%llx tag=%pU " \
> + "shared_extent_sequence=%d",
> + show_dc_evt_type(__entry->event_type),
> + __entry->hostid,
> + __entry->partition_id,
> + __entry->dpa_start,
> + __entry->length,
> + __entry->uuid,
> + __entry->sh_extent_seq
> + )
> +);
> +
> #endif /* _CXL_EVENTS_H */
>
> #define TRACE_INCLUDE_FILE trace
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH v10 29/31] tools/testing/cxl: Make event logs dynamic
2026-05-23 9:42 [PATCH v10 00/31] DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
` (27 preceding siblings ...)
2026-05-23 9:43 ` [PATCH v10 28/31] cxl/mem: Trace Dynamic capacity Event Record Anisa Su
@ 2026-05-23 9:43 ` Anisa Su
2026-05-29 22:58 ` Dave Jiang
2026-05-23 9:43 ` [PATCH v10 30/31] tools/testing/cxl: Add DC Regions to mock mem data Anisa Su
` (3 subsequent siblings)
32 siblings, 1 reply; 71+ messages in thread
From: Anisa Su @ 2026-05-23 9:43 UTC (permalink / raw)
To: linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
John Groves, Gregory Price, Ira Weiny, Jonathan Cameron
From: Ira Weiny <ira.weiny@intel.com>
The event logs test was created as static arrays as an easy way to mock
events. Dynamic Capacity Device (DCD) test support requires events be
generated dynamically when extents are created or destroyed.
The current event log test has specific checks for the number of events
seen including log overflow.
Modify mock event logs to be dynamically allocated. Adjust array size
and mock event entry data to match the output expected by the existing
event test.
Use the static event data to create the dynamic events in the new logs
without inventing complex event injection for the previous tests.
Simplify log processing by using the event log array index as the
handle. Add a lock to manage concurrency required when user space is
allowed to control DCD extents
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
tools/testing/cxl/test/mem.c | 265 +++++++++++++++++++++--------------
1 file changed, 161 insertions(+), 104 deletions(-)
diff --git a/tools/testing/cxl/test/mem.c b/tools/testing/cxl/test/mem.c
index 271c7ad8cc32..fe1dadddd18e 100644
--- a/tools/testing/cxl/test/mem.c
+++ b/tools/testing/cxl/test/mem.c
@@ -142,18 +142,26 @@ static struct {
#define PASS_TRY_LIMIT 3
-#define CXL_TEST_EVENT_CNT_MAX 15
+#define CXL_TEST_EVENT_CNT_MAX 16
+/* 1 extra slot to accommodate that handles can't be 0 */
+#define CXL_TEST_EVENT_ARRAY_SIZE (CXL_TEST_EVENT_CNT_MAX + 1)
/* Set a number of events to return at a time for simulation. */
#define CXL_TEST_EVENT_RET_MAX 4
+/*
+ * @last_handle: last handle (index) to have an entry stored
+ * @current_handle: current handle (index) to be returned to the user on get_event
+ * @nr_overflow: number of events added past the log size
+ * @lock: protect these state variables
+ * @events: array of pending events to be returned.
+ */
struct mock_event_log {
- u16 clear_idx;
- u16 cur_idx;
- u16 nr_events;
+ u16 last_handle;
+ u16 current_handle;
u16 nr_overflow;
- u16 overflow_reset;
- struct cxl_event_record_raw *events[CXL_TEST_EVENT_CNT_MAX];
+ rwlock_t lock;
+ struct cxl_event_record_raw *events[CXL_TEST_EVENT_ARRAY_SIZE];
};
struct mock_event_store {
@@ -194,56 +202,65 @@ static struct mock_event_log *event_find_log(struct device *dev, int log_type)
return &mdata->mes.mock_logs[log_type];
}
-static struct cxl_event_record_raw *event_get_current(struct mock_event_log *log)
-{
- return log->events[log->cur_idx];
-}
-
-static void event_reset_log(struct mock_event_log *log)
-{
- log->cur_idx = 0;
- log->clear_idx = 0;
- log->nr_overflow = log->overflow_reset;
-}
-
/* Handle can never be 0 use 1 based indexing for handle */
-static u16 event_get_clear_handle(struct mock_event_log *log)
+static u16 event_inc_handle(u16 handle)
{
- return log->clear_idx + 1;
+ handle = (handle + 1) % CXL_TEST_EVENT_ARRAY_SIZE;
+ if (handle == 0)
+ handle = 1;
+ return handle;
}
-/* Handle can never be 0 use 1 based indexing for handle */
-static __le16 event_get_cur_event_handle(struct mock_event_log *log)
-{
- u16 cur_handle = log->cur_idx + 1;
-
- return cpu_to_le16(cur_handle);
-}
-
-static bool event_log_empty(struct mock_event_log *log)
-{
- return log->cur_idx == log->nr_events;
-}
-
-static void mes_add_event(struct mock_event_store *mes,
+/* Add the event or free it on overflow */
+static void mes_add_event(struct cxl_mockmem_data *mdata,
enum cxl_event_log_type log_type,
struct cxl_event_record_raw *event)
{
+ struct device *dev = mdata->mds->cxlds.dev;
struct mock_event_log *log;
if (WARN_ON(log_type >= CXL_EVENT_TYPE_MAX))
return;
- log = &mes->mock_logs[log_type];
+ log = &mdata->mes.mock_logs[log_type];
+
+ guard(write_lock)(&log->lock);
- if ((log->nr_events + 1) > CXL_TEST_EVENT_CNT_MAX) {
+ dev_dbg(dev, "Add log %d cur %d last %d\n",
+ log_type, log->current_handle, log->last_handle);
+
+ /* Check next buffer */
+ if (event_inc_handle(log->last_handle) == log->current_handle) {
log->nr_overflow++;
- log->overflow_reset = log->nr_overflow;
+ dev_dbg(dev, "Overflowing log %d nr %d\n",
+ log_type, log->nr_overflow);
+ devm_kfree(dev, event);
return;
}
- log->events[log->nr_events] = event;
- log->nr_events++;
+ dev_dbg(dev, "Log %d; handle %u\n", log_type, log->last_handle);
+ event->event.generic.hdr.handle = cpu_to_le16(log->last_handle);
+ log->events[log->last_handle] = event;
+ log->last_handle = event_inc_handle(log->last_handle);
+}
+
+static void mes_del_event(struct device *dev,
+ struct mock_event_log *log,
+ u16 handle)
+{
+ struct cxl_event_record_raw *record;
+
+ lockdep_assert(lockdep_is_held(&log->lock));
+
+ dev_dbg(dev, "Clearing event %u; record %u\n",
+ handle, log->current_handle);
+ record = log->events[handle];
+ if (!record)
+ dev_err(dev, "Mock event index %u empty?\n", handle);
+
+ log->events[handle] = NULL;
+ log->current_handle = event_inc_handle(log->current_handle);
+ devm_kfree(dev, record);
}
/*
@@ -257,6 +274,7 @@ static int mock_get_event(struct device *dev, struct cxl_mbox_cmd *cmd)
struct cxl_get_event_payload *pl;
struct mock_event_log *log;
int ret_limit;
+ u16 handle;
u8 log_type;
int i;
@@ -276,22 +294,31 @@ static int mock_get_event(struct device *dev, struct cxl_mbox_cmd *cmd)
memset(cmd->payload_out, 0, struct_size(pl, records, 0));
log = event_find_log(dev, log_type);
- if (!log || event_log_empty(log))
+ if (!log)
return 0;
pl = cmd->payload_out;
- for (i = 0; i < ret_limit && !event_log_empty(log); i++) {
- memcpy(&pl->records[i], event_get_current(log),
- sizeof(pl->records[i]));
- pl->records[i].event.generic.hdr.handle =
- event_get_cur_event_handle(log);
- log->cur_idx++;
+ guard(read_lock)(&log->lock);
+
+ handle = log->current_handle;
+ dev_dbg(dev, "Get log %d handle %u last %u\n",
+ log_type, handle, log->last_handle);
+ for (i = 0; i < ret_limit && handle != log->last_handle;
+ i++, handle = event_inc_handle(handle)) {
+ struct cxl_event_record_raw *cur;
+
+ cur = log->events[handle];
+ dev_dbg(dev, "Sending event log %d handle %d idx %u\n",
+ log_type, le16_to_cpu(cur->event.generic.hdr.handle),
+ handle);
+ memcpy(&pl->records[i], cur, sizeof(pl->records[i]));
+ pl->records[i].event.generic.hdr.handle = cpu_to_le16(handle);
}
cmd->size_out = struct_size(pl, records, i);
pl->record_count = cpu_to_le16(i);
- if (!event_log_empty(log))
+ if (handle != log->last_handle)
pl->flags |= CXL_GET_EVENT_FLAG_MORE_RECORDS;
if (log->nr_overflow) {
@@ -313,8 +340,8 @@ static int mock_get_event(struct device *dev, struct cxl_mbox_cmd *cmd)
static int mock_clear_event(struct device *dev, struct cxl_mbox_cmd *cmd)
{
struct cxl_mbox_clear_event_payload *pl = cmd->payload_in;
- struct mock_event_log *log;
u8 log_type = pl->event_log;
+ struct mock_event_log *log;
u16 handle;
int nr;
@@ -325,23 +352,20 @@ static int mock_clear_event(struct device *dev, struct cxl_mbox_cmd *cmd)
if (!log)
return 0; /* No mock data in this log */
- /*
- * This check is technically not invalid per the specification AFAICS.
- * (The host could 'guess' handles and clear them in order).
- * However, this is not good behavior for the host so test it.
- */
- if (log->clear_idx + pl->nr_recs > log->cur_idx) {
- dev_err(dev,
- "Attempting to clear more events than returned!\n");
- return -EINVAL;
- }
+ guard(write_lock)(&log->lock);
/* Check handle order prior to clearing events */
- for (nr = 0, handle = event_get_clear_handle(log);
- nr < pl->nr_recs;
- nr++, handle++) {
+ handle = log->current_handle;
+ for (nr = 0; nr < pl->nr_recs && handle != log->last_handle;
+ nr++, handle = event_inc_handle(handle)) {
+
+ dev_dbg(dev, "Checking clear of %d handle %u plhandle %u\n",
+ log_type, handle,
+ le16_to_cpu(pl->handles[nr]));
+
if (handle != le16_to_cpu(pl->handles[nr])) {
- dev_err(dev, "Clearing events out of order\n");
+ dev_err(dev, "Clearing events out of order %u %u\n",
+ handle, le16_to_cpu(pl->handles[nr]));
return -EINVAL;
}
}
@@ -350,25 +374,12 @@ static int mock_clear_event(struct device *dev, struct cxl_mbox_cmd *cmd)
log->nr_overflow = 0;
/* Clear events */
- log->clear_idx += pl->nr_recs;
- return 0;
-}
-
-static void cxl_mock_event_trigger(struct device *dev)
-{
- struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
- struct mock_event_store *mes = &mdata->mes;
- int i;
-
- for (i = CXL_EVENT_TYPE_INFO; i < CXL_EVENT_TYPE_MAX; i++) {
- struct mock_event_log *log;
+ for (nr = 0; nr < pl->nr_recs; nr++)
+ mes_del_event(dev, log, le16_to_cpu(pl->handles[nr]));
+ dev_dbg(dev, "Delete log %d cur %d last %d\n",
+ log_type, log->current_handle, log->last_handle);
- log = event_find_log(dev, i);
- if (log)
- event_reset_log(log);
- }
-
- cxl_mem_get_event_records(mdata->mds, mes->ev_status);
+ return 0;
}
struct cxl_event_record_raw maint_needed = {
@@ -509,8 +520,27 @@ static int mock_set_timestamp(struct cxl_dev_state *cxlds,
return 0;
}
-static void cxl_mock_add_event_logs(struct mock_event_store *mes)
+/* Create a dynamically allocated event out of a statically defined event. */
+static void add_event_from_static(struct cxl_mockmem_data *mdata,
+ enum cxl_event_log_type log_type,
+ struct cxl_event_record_raw *raw)
{
+ struct device *dev = mdata->mds->cxlds.dev;
+ struct cxl_event_record_raw *rec;
+
+ rec = devm_kmemdup(dev, raw, sizeof(*rec), GFP_KERNEL);
+ if (!rec) {
+ dev_err(dev, "Failed to alloc event for log\n");
+ return;
+ }
+ mes_add_event(mdata, log_type, rec);
+}
+
+static void cxl_mock_add_event_logs(struct cxl_mockmem_data *mdata)
+{
+ struct mock_event_store *mes = &mdata->mes;
+ struct device *dev = mdata->mds->cxlds.dev;
+
put_unaligned_le16(CXL_GMER_VALID_CHANNEL | CXL_GMER_VALID_RANK |
CXL_GMER_VALID_COMPONENT | CXL_GMER_VALID_COMPONENT_ID_FORMAT,
&gen_media.rec.media_hdr.validity_flags);
@@ -523,43 +553,60 @@ static void cxl_mock_add_event_logs(struct mock_event_store *mes)
put_unaligned_le16(CXL_MMER_VALID_COMPONENT | CXL_MMER_VALID_COMPONENT_ID_FORMAT,
&mem_module.rec.validity_flags);
- mes_add_event(mes, CXL_EVENT_TYPE_INFO, &maint_needed);
- mes_add_event(mes, CXL_EVENT_TYPE_INFO,
+ dev_dbg(dev, "Generating fake event logs %d\n",
+ CXL_EVENT_TYPE_INFO);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_INFO, &maint_needed);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_INFO,
(struct cxl_event_record_raw *)&gen_media);
- mes_add_event(mes, CXL_EVENT_TYPE_INFO,
+ add_event_from_static(mdata, CXL_EVENT_TYPE_INFO,
(struct cxl_event_record_raw *)&mem_module);
mes->ev_status |= CXLDEV_EVENT_STATUS_INFO;
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &maint_needed);
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
+ dev_dbg(dev, "Generating fake event logs %d\n",
+ CXL_EVENT_TYPE_FAIL);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &maint_needed);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
+ (struct cxl_event_record_raw *)&mem_module);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
(struct cxl_event_record_raw *)&dram);
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
(struct cxl_event_record_raw *)&gen_media);
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
(struct cxl_event_record_raw *)&mem_module);
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
(struct cxl_event_record_raw *)&dram);
/* Overflow this log */
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
mes->ev_status |= CXLDEV_EVENT_STATUS_FAIL;
- mes_add_event(mes, CXL_EVENT_TYPE_FATAL, &hardware_replace);
- mes_add_event(mes, CXL_EVENT_TYPE_FATAL,
+ dev_dbg(dev, "Generating fake event logs %d\n",
+ CXL_EVENT_TYPE_FATAL);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FATAL, &hardware_replace);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FATAL,
(struct cxl_event_record_raw *)&dram);
mes->ev_status |= CXLDEV_EVENT_STATUS_FATAL;
}
+static void cxl_mock_event_trigger(struct device *dev)
+{
+ struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+ struct mock_event_store *mes = &mdata->mes;
+
+ cxl_mock_add_event_logs(mdata);
+ cxl_mem_get_event_records(mdata->mds, mes->ev_status);
+}
+
static int mock_gsl(struct cxl_mbox_cmd *cmd)
{
if (cmd->size_out < sizeof(mock_gsl_payload))
@@ -1684,6 +1731,14 @@ static void cxl_mock_test_feat_init(struct cxl_mockmem_data *mdata)
mdata->test_feat.data = cpu_to_le32(0xdeadbeef);
}
+static void init_event_log(struct mock_event_log *log)
+{
+ rwlock_init(&log->lock);
+ /* Handle can never be 0 use 1 based indexing for handle */
+ log->current_handle = 1;
+ log->last_handle = 1;
+}
+
static int cxl_mock_mem_probe(struct platform_device *pdev)
{
struct device *dev = &pdev->dev;
@@ -1767,7 +1822,9 @@ static int cxl_mock_mem_probe(struct platform_device *pdev)
if (rc)
dev_dbg(dev, "No CXL Features discovered\n");
- cxl_mock_add_event_logs(&mdata->mes);
+ for (int i = 0; i < CXL_EVENT_TYPE_MAX; i++)
+ init_event_log(&mdata->mes.mock_logs[i]);
+ cxl_mock_add_event_logs(mdata);
cxlmd = devm_cxl_add_memdev(cxlds, NULL);
if (IS_ERR(cxlmd))
--
2.43.0
^ permalink raw reply related [flat|nested] 71+ messages in thread* Re: [PATCH v10 29/31] tools/testing/cxl: Make event logs dynamic
2026-05-23 9:43 ` [PATCH v10 29/31] tools/testing/cxl: Make event logs dynamic Anisa Su
@ 2026-05-29 22:58 ` Dave Jiang
0 siblings, 0 replies; 71+ messages in thread
From: Dave Jiang @ 2026-05-29 22:58 UTC (permalink / raw)
To: Anisa Su, linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Vishal Verma, Ira Weiny, Alison Schofield, John Groves,
Gregory Price, Ira Weiny, Jonathan Cameron
On 5/23/26 2:43 AM, Anisa Su wrote:
> From: Ira Weiny <ira.weiny@intel.com>
>
> The event logs test was created as static arrays as an easy way to mock
> events. Dynamic Capacity Device (DCD) test support requires events be
> generated dynamically when extents are created or destroyed.
>
> The current event log test has specific checks for the number of events
> seen including log overflow.
>
> Modify mock event logs to be dynamically allocated. Adjust array size
> and mock event entry data to match the output expected by the existing
> event test.
>
> Use the static event data to create the dynamic events in the new logs
> without inventing complex event injection for the previous tests.
>
> Simplify log processing by using the event log array index as the
> handle. Add a lock to manage concurrency required when user space is
> allowed to control DCD extents
>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> ---
> tools/testing/cxl/test/mem.c | 265 +++++++++++++++++++++--------------
> 1 file changed, 161 insertions(+), 104 deletions(-)
>
> diff --git a/tools/testing/cxl/test/mem.c b/tools/testing/cxl/test/mem.c
> index 271c7ad8cc32..fe1dadddd18e 100644
> --- a/tools/testing/cxl/test/mem.c
> +++ b/tools/testing/cxl/test/mem.c
> @@ -142,18 +142,26 @@ static struct {
>
> #define PASS_TRY_LIMIT 3
>
> -#define CXL_TEST_EVENT_CNT_MAX 15
> +#define CXL_TEST_EVENT_CNT_MAX 16
> +/* 1 extra slot to accommodate that handles can't be 0 */
> +#define CXL_TEST_EVENT_ARRAY_SIZE (CXL_TEST_EVENT_CNT_MAX + 1)
>
> /* Set a number of events to return at a time for simulation. */
> #define CXL_TEST_EVENT_RET_MAX 4
>
> +/*
> + * @last_handle: last handle (index) to have an entry stored
> + * @current_handle: current handle (index) to be returned to the user on get_event
> + * @nr_overflow: number of events added past the log size
> + * @lock: protect these state variables
> + * @events: array of pending events to be returned.
> + */
> struct mock_event_log {
> - u16 clear_idx;
> - u16 cur_idx;
> - u16 nr_events;
> + u16 last_handle;
> + u16 current_handle;
> u16 nr_overflow;
> - u16 overflow_reset;
> - struct cxl_event_record_raw *events[CXL_TEST_EVENT_CNT_MAX];
> + rwlock_t lock;
> + struct cxl_event_record_raw *events[CXL_TEST_EVENT_ARRAY_SIZE];
> };
>
> struct mock_event_store {
> @@ -194,56 +202,65 @@ static struct mock_event_log *event_find_log(struct device *dev, int log_type)
> return &mdata->mes.mock_logs[log_type];
> }
>
> -static struct cxl_event_record_raw *event_get_current(struct mock_event_log *log)
> -{
> - return log->events[log->cur_idx];
> -}
> -
> -static void event_reset_log(struct mock_event_log *log)
> -{
> - log->cur_idx = 0;
> - log->clear_idx = 0;
> - log->nr_overflow = log->overflow_reset;
> -}
> -
> /* Handle can never be 0 use 1 based indexing for handle */
> -static u16 event_get_clear_handle(struct mock_event_log *log)
> +static u16 event_inc_handle(u16 handle)
> {
> - return log->clear_idx + 1;
> + handle = (handle + 1) % CXL_TEST_EVENT_ARRAY_SIZE;
> + if (handle == 0)
> + handle = 1;
> + return handle;
> }
>
> -/* Handle can never be 0 use 1 based indexing for handle */
> -static __le16 event_get_cur_event_handle(struct mock_event_log *log)
> -{
> - u16 cur_handle = log->cur_idx + 1;
> -
> - return cpu_to_le16(cur_handle);
> -}
> -
> -static bool event_log_empty(struct mock_event_log *log)
> -{
> - return log->cur_idx == log->nr_events;
> -}
> -
> -static void mes_add_event(struct mock_event_store *mes,
> +/* Add the event or free it on overflow */
> +static void mes_add_event(struct cxl_mockmem_data *mdata,
> enum cxl_event_log_type log_type,
> struct cxl_event_record_raw *event)
> {
> + struct device *dev = mdata->mds->cxlds.dev;
> struct mock_event_log *log;
>
> if (WARN_ON(log_type >= CXL_EVENT_TYPE_MAX))
> return;
>
> - log = &mes->mock_logs[log_type];
> + log = &mdata->mes.mock_logs[log_type];
> +
> + guard(write_lock)(&log->lock);
>
> - if ((log->nr_events + 1) > CXL_TEST_EVENT_CNT_MAX) {
> + dev_dbg(dev, "Add log %d cur %d last %d\n",
> + log_type, log->current_handle, log->last_handle);
> +
> + /* Check next buffer */
> + if (event_inc_handle(log->last_handle) == log->current_handle) {
> log->nr_overflow++;
> - log->overflow_reset = log->nr_overflow;
> + dev_dbg(dev, "Overflowing log %d nr %d\n",
> + log_type, log->nr_overflow);
> + devm_kfree(dev, event);
> return;
> }
>
> - log->events[log->nr_events] = event;
> - log->nr_events++;
> + dev_dbg(dev, "Log %d; handle %u\n", log_type, log->last_handle);
> + event->event.generic.hdr.handle = cpu_to_le16(log->last_handle);
> + log->events[log->last_handle] = event;
> + log->last_handle = event_inc_handle(log->last_handle);
> +}
> +
> +static void mes_del_event(struct device *dev,
> + struct mock_event_log *log,
> + u16 handle)
> +{
> + struct cxl_event_record_raw *record;
> +
> + lockdep_assert(lockdep_is_held(&log->lock));
lockdep_assert_held(&log->lock);
> +
> + dev_dbg(dev, "Clearing event %u; record %u\n",
> + handle, log->current_handle);
> + record = log->events[handle];
> + if (!record)
> + dev_err(dev, "Mock event index %u empty?\n", handle);
err but continue?
> +
> + log->events[handle] = NULL;
> + log->current_handle = event_inc_handle(log->current_handle);
> + devm_kfree(dev, record);
> }
>
> /*
> @@ -257,6 +274,7 @@ static int mock_get_event(struct device *dev, struct cxl_mbox_cmd *cmd)
> struct cxl_get_event_payload *pl;
> struct mock_event_log *log;
> int ret_limit;
> + u16 handle;
> u8 log_type;
> int i;
>
> @@ -276,22 +294,31 @@ static int mock_get_event(struct device *dev, struct cxl_mbox_cmd *cmd)
> memset(cmd->payload_out, 0, struct_size(pl, records, 0));
>
> log = event_find_log(dev, log_type);
> - if (!log || event_log_empty(log))
> + if (!log)
> return 0;
>
> pl = cmd->payload_out;
>
> - for (i = 0; i < ret_limit && !event_log_empty(log); i++) {
> - memcpy(&pl->records[i], event_get_current(log),
> - sizeof(pl->records[i]));
> - pl->records[i].event.generic.hdr.handle =
> - event_get_cur_event_handle(log);
> - log->cur_idx++;
> + guard(read_lock)(&log->lock);
> +
> + handle = log->current_handle;
> + dev_dbg(dev, "Get log %d handle %u last %u\n",
> + log_type, handle, log->last_handle);
> + for (i = 0; i < ret_limit && handle != log->last_handle;
> + i++, handle = event_inc_handle(handle)) {
> + struct cxl_event_record_raw *cur;
> +
> + cur = log->events[handle];
> + dev_dbg(dev, "Sending event log %d handle %d idx %u\n",
> + log_type, le16_to_cpu(cur->event.generic.hdr.handle),
> + handle);
> + memcpy(&pl->records[i], cur, sizeof(pl->records[i]));
> + pl->records[i].event.generic.hdr.handle = cpu_to_le16(handle);
> }
>
> cmd->size_out = struct_size(pl, records, i);
> pl->record_count = cpu_to_le16(i);
> - if (!event_log_empty(log))
> + if (handle != log->last_handle)
> pl->flags |= CXL_GET_EVENT_FLAG_MORE_RECORDS;
>
> if (log->nr_overflow) {
> @@ -313,8 +340,8 @@ static int mock_get_event(struct device *dev, struct cxl_mbox_cmd *cmd)
> static int mock_clear_event(struct device *dev, struct cxl_mbox_cmd *cmd)
> {
> struct cxl_mbox_clear_event_payload *pl = cmd->payload_in;
> - struct mock_event_log *log;
> u8 log_type = pl->event_log;
> + struct mock_event_log *log;
> u16 handle;
> int nr;
>
> @@ -325,23 +352,20 @@ static int mock_clear_event(struct device *dev, struct cxl_mbox_cmd *cmd)
> if (!log)
> return 0; /* No mock data in this log */
>
> - /*
> - * This check is technically not invalid per the specification AFAICS.
> - * (The host could 'guess' handles and clear them in order).
> - * However, this is not good behavior for the host so test it.
> - */
> - if (log->clear_idx + pl->nr_recs > log->cur_idx) {
> - dev_err(dev,
> - "Attempting to clear more events than returned!\n");
> - return -EINVAL;
> - }
> + guard(write_lock)(&log->lock);
>
> /* Check handle order prior to clearing events */
> - for (nr = 0, handle = event_get_clear_handle(log);
> - nr < pl->nr_recs;
> - nr++, handle++) {
> + handle = log->current_handle;
> + for (nr = 0; nr < pl->nr_recs && handle != log->last_handle;
> + nr++, handle = event_inc_handle(handle)) {
> +
> + dev_dbg(dev, "Checking clear of %d handle %u plhandle %u\n",
> + log_type, handle,
> + le16_to_cpu(pl->handles[nr]));
> +
> if (handle != le16_to_cpu(pl->handles[nr])) {
> - dev_err(dev, "Clearing events out of order\n");
> + dev_err(dev, "Clearing events out of order %u %u\n",
> + handle, le16_to_cpu(pl->handles[nr]));
> return -EINVAL;
> }
> }
> @@ -350,25 +374,12 @@ static int mock_clear_event(struct device *dev, struct cxl_mbox_cmd *cmd)
> log->nr_overflow = 0;
>
> /* Clear events */
> - log->clear_idx += pl->nr_recs;
> - return 0;
> -}
> -
> -static void cxl_mock_event_trigger(struct device *dev)
> -{
> - struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
> - struct mock_event_store *mes = &mdata->mes;
> - int i;
> -
> - for (i = CXL_EVENT_TYPE_INFO; i < CXL_EVENT_TYPE_MAX; i++) {
> - struct mock_event_log *log;
> + for (nr = 0; nr < pl->nr_recs; nr++)
> + mes_del_event(dev, log, le16_to_cpu(pl->handles[nr]));
> + dev_dbg(dev, "Delete log %d cur %d last %d\n",
> + log_type, log->current_handle, log->last_handle);
>
> - log = event_find_log(dev, i);
> - if (log)
> - event_reset_log(log);
> - }
> -
> - cxl_mem_get_event_records(mdata->mds, mes->ev_status);
> + return 0;
> }
>
> struct cxl_event_record_raw maint_needed = {
> @@ -509,8 +520,27 @@ static int mock_set_timestamp(struct cxl_dev_state *cxlds,
> return 0;
> }
>
> -static void cxl_mock_add_event_logs(struct mock_event_store *mes)
> +/* Create a dynamically allocated event out of a statically defined event. */
> +static void add_event_from_static(struct cxl_mockmem_data *mdata,
> + enum cxl_event_log_type log_type,
> + struct cxl_event_record_raw *raw)
> {
> + struct device *dev = mdata->mds->cxlds.dev;
> + struct cxl_event_record_raw *rec;
> +
> + rec = devm_kmemdup(dev, raw, sizeof(*rec), GFP_KERNEL);
> + if (!rec) {
> + dev_err(dev, "Failed to alloc event for log\n");
> + return;
Should we silently swallow out of memory error instead of just fail?
DJ
> + }
> + mes_add_event(mdata, log_type, rec);
> +}
> +
> +static void cxl_mock_add_event_logs(struct cxl_mockmem_data *mdata)
> +{
> + struct mock_event_store *mes = &mdata->mes;
> + struct device *dev = mdata->mds->cxlds.dev;
> +
> put_unaligned_le16(CXL_GMER_VALID_CHANNEL | CXL_GMER_VALID_RANK |
> CXL_GMER_VALID_COMPONENT | CXL_GMER_VALID_COMPONENT_ID_FORMAT,
> &gen_media.rec.media_hdr.validity_flags);
> @@ -523,43 +553,60 @@ static void cxl_mock_add_event_logs(struct mock_event_store *mes)
> put_unaligned_le16(CXL_MMER_VALID_COMPONENT | CXL_MMER_VALID_COMPONENT_ID_FORMAT,
> &mem_module.rec.validity_flags);
>
> - mes_add_event(mes, CXL_EVENT_TYPE_INFO, &maint_needed);
> - mes_add_event(mes, CXL_EVENT_TYPE_INFO,
> + dev_dbg(dev, "Generating fake event logs %d\n",
> + CXL_EVENT_TYPE_INFO);
> + add_event_from_static(mdata, CXL_EVENT_TYPE_INFO, &maint_needed);
> + add_event_from_static(mdata, CXL_EVENT_TYPE_INFO,
> (struct cxl_event_record_raw *)&gen_media);
> - mes_add_event(mes, CXL_EVENT_TYPE_INFO,
> + add_event_from_static(mdata, CXL_EVENT_TYPE_INFO,
> (struct cxl_event_record_raw *)&mem_module);
> mes->ev_status |= CXLDEV_EVENT_STATUS_INFO;
>
> - mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &maint_needed);
> - mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> - mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
> + dev_dbg(dev, "Generating fake event logs %d\n",
> + CXL_EVENT_TYPE_FAIL);
> + add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &maint_needed);
> + add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
> + (struct cxl_event_record_raw *)&mem_module);
> + add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> + add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
> (struct cxl_event_record_raw *)&dram);
> - mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
> + add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
> (struct cxl_event_record_raw *)&gen_media);
> - mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
> + add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
> (struct cxl_event_record_raw *)&mem_module);
> - mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> - mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
> + add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> + add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
> (struct cxl_event_record_raw *)&dram);
> /* Overflow this log */
> - mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> - mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> - mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> - mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> - mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> - mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> - mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> - mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> - mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> - mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> + add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> + add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> + add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> + add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> + add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> + add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> + add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> + add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> + add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> + add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> mes->ev_status |= CXLDEV_EVENT_STATUS_FAIL;
>
> - mes_add_event(mes, CXL_EVENT_TYPE_FATAL, &hardware_replace);
> - mes_add_event(mes, CXL_EVENT_TYPE_FATAL,
> + dev_dbg(dev, "Generating fake event logs %d\n",
> + CXL_EVENT_TYPE_FATAL);
> + add_event_from_static(mdata, CXL_EVENT_TYPE_FATAL, &hardware_replace);
> + add_event_from_static(mdata, CXL_EVENT_TYPE_FATAL,
> (struct cxl_event_record_raw *)&dram);
> mes->ev_status |= CXLDEV_EVENT_STATUS_FATAL;
> }
>
> +static void cxl_mock_event_trigger(struct device *dev)
> +{
> + struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
> + struct mock_event_store *mes = &mdata->mes;
> +
> + cxl_mock_add_event_logs(mdata);
> + cxl_mem_get_event_records(mdata->mds, mes->ev_status);
> +}
> +
> static int mock_gsl(struct cxl_mbox_cmd *cmd)
> {
> if (cmd->size_out < sizeof(mock_gsl_payload))
> @@ -1684,6 +1731,14 @@ static void cxl_mock_test_feat_init(struct cxl_mockmem_data *mdata)
> mdata->test_feat.data = cpu_to_le32(0xdeadbeef);
> }
>
> +static void init_event_log(struct mock_event_log *log)
> +{
> + rwlock_init(&log->lock);
> + /* Handle can never be 0 use 1 based indexing for handle */
> + log->current_handle = 1;
> + log->last_handle = 1;
> +}
> +
> static int cxl_mock_mem_probe(struct platform_device *pdev)
> {
> struct device *dev = &pdev->dev;
> @@ -1767,7 +1822,9 @@ static int cxl_mock_mem_probe(struct platform_device *pdev)
> if (rc)
> dev_dbg(dev, "No CXL Features discovered\n");
>
> - cxl_mock_add_event_logs(&mdata->mes);
> + for (int i = 0; i < CXL_EVENT_TYPE_MAX; i++)
> + init_event_log(&mdata->mes.mock_logs[i]);
> + cxl_mock_add_event_logs(mdata);
>
> cxlmd = devm_cxl_add_memdev(cxlds, NULL);
> if (IS_ERR(cxlmd))
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH v10 30/31] tools/testing/cxl: Add DC Regions to mock mem data
2026-05-23 9:42 [PATCH v10 00/31] DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
` (28 preceding siblings ...)
2026-05-23 9:43 ` [PATCH v10 29/31] tools/testing/cxl: Make event logs dynamic Anisa Su
@ 2026-05-23 9:43 ` Anisa Su
2026-05-29 23:42 ` Dave Jiang
2026-05-23 9:43 ` [PATCH v10 31/31] Documentation/cxl: Document DCD extent handling and DC-backed DAX regions Anisa Su
` (2 subsequent siblings)
32 siblings, 1 reply; 71+ messages in thread
From: Anisa Su @ 2026-05-23 9:43 UTC (permalink / raw)
To: linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
John Groves, Gregory Price, Ira Weiny, Anisa Su
From: Ira Weiny <ira.weiny@intel.com>
cxl_test provides a good way to ensure quick smoke and regression
testing. The complexity of Dynamic Capacity (DC) extent processing as
well as the complexity of DC-backed DAX regions can mostly be tested
through cxl_test. This includes management of DC regions and DAX
devices on those regions; the management of extent device lifetimes;
and the processing of DCD events.
The only missing functionality from this test is actual interrupt
processing.
Mock memory devices can easily mock DC information and manage fake
extent data.
Define mock_dc_partition information within the mock memory data. Add
sysfs entries on the mock device to inject and delete extents.
The inject format is <start>:<length>:<tag>:<more>[:<seq>] where <tag>
is a UUID string (or "" / "0" for the null UUID) and <seq> is an
optional shared_extn_seq value used for sharable-partition tests
(defaults to 0).
The delete format is <start>:<length>:<uuid>
Directly call the event irq callback to simulate irqs to process the
test extents.
Add DC mailbox commands to the CEL and implement those commands.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Anisa Su <anisa.su@samsung.com>
---
Changes:
[anisa: add uuid + shared_extn_seq, align mock with kernel validators,
introduce a sharable-partition test fixture]
[anisa: replace "sparse" terminology with "DC" / "DC-backed"]
Carry a uuid_t and a u16 shared_extn_seq on each mock extent, parse
tags via uuid_parse() in the inject path and the pre-extent fixture,
and propagate both fields through log_dc_event() and
mock_get_dc_extent_list(). An optional 5th field in the inject
format supplies the shared_extn_seq for sharable-partition tests.
The delete format takes the uuid as its third field so release
events carry tag identity to the host.
Mock fixes required to satisfy the host-side validators:
- dsmad_handle starts at 0xFA, not 0xFADE. The Get Dynamic
Capacity Configuration response's DSMAD Handle field is 1 byte
per the CXL spec; the kernel rejects any handle with the upper
24 bits non-zero as a firmware-bug.
- dc_accept_extent() treats a re-accept of an already-accepted
extent as a successful no-op (look up dc_accepted_exts when the
sent xa lookup misses). The host replays accepts for pre-
injected extents on region creation; without this the existing-
extent ingest aborts with -ENOMEM.
- __dc_del_extent_store() runs strim() on the trailing uuid field
so the '
' shell write tail doesn't cause parse_tag() to fall
through to uuid_parse() and -EINVAL.
- NUM_MOCK_DC_REGIONS reduced from 2 to 1. The host's
cxl_dev_dc_identify() surfaces partitions[0] only, so extents
seeded into a second mock partition land outside the registered
DC range; for tagged groups that also trips the partition-
equality gate and drops the whole group (including the in-range
member).
Sharable-partition test fixture:
- Stamp MOCK_DC_SHARABLE_SERIAL (0xDCDC) on the cxl_mem instance
at pdev->id == 0. The companion cxl_test driver checks this
serial in mock_cxl_endpoint_parse_cdat() and sets the DC
partition's perf.shareable on that memdev only — exposing both
sharable and non-sharable DC partitions from one cxl_test
module load so the userspace suite can exercise both regimes.
- Skip inject_prev_extents() on that one memdev: the pre-injected
extents are untagged / seq=0 and would be rejected as firmware-
bug by cxl_validate_extent() on a sharable partition, leaving
spurious noise in dmesg at probe.
---
tools/testing/cxl/test/cxl.c | 21 +
tools/testing/cxl/test/mem.c | 806 ++++++++++++++++++++++++++++++++++-
2 files changed, 826 insertions(+), 1 deletion(-)
diff --git a/tools/testing/cxl/test/cxl.c b/tools/testing/cxl/test/cxl.c
index 418669927fb0..ac6060ede061 100644
--- a/tools/testing/cxl/test/cxl.c
+++ b/tools/testing/cxl/test/cxl.c
@@ -18,6 +18,15 @@ static int interleave_arithmetic;
static bool extended_linear_cache;
static bool fail_autoassemble;
+/*
+ * Mock serial sentinel. The cxl_mock_mem probe stamps this serial on
+ * exactly one platform device (cxl_mem with id 0); that single memdev's
+ * DC partition is marked sharable below in mock_cxl_endpoint_parse_cdat
+ * so the suite can exercise sharable-extent code paths without losing
+ * the non-sharable coverage on the other mock memdevs.
+ */
+#define MOCK_DC_SHARABLE_SERIAL 0xDCDCULL
+
#define FAKE_QTG_ID 42
#define NR_CXL_HOST_BRIDGES 2
@@ -1432,6 +1441,18 @@ static void mock_cxl_endpoint_parse_cdat(struct cxl_port *port)
};
dpa_perf_setup(port, &range, perf);
+
+ /*
+ * The mock probe stamps MOCK_DC_SHARABLE_SERIAL onto exactly
+ * one cxl_mem instance; mark its DC partition sharable so
+ * cxl_validate_extent() routes shared-seq injects through
+ * the sharable regime. Every other memdev keeps its DC
+ * partition non-sharable so the existing untagged / seq=0
+ * tests still run on this kernel.
+ */
+ if (cxlds->part[i].mode == CXL_PARTMODE_DYNAMIC_RAM_A &&
+ cxlds->serial == MOCK_DC_SHARABLE_SERIAL)
+ perf->shareable = true;
}
cxl_memdev_update_perf(cxlmd);
diff --git a/tools/testing/cxl/test/mem.c b/tools/testing/cxl/test/mem.c
index fe1dadddd18e..9cc97b718b5f 100644
--- a/tools/testing/cxl/test/mem.c
+++ b/tools/testing/cxl/test/mem.c
@@ -20,6 +20,7 @@
#define FW_SLOTS 3
#define DEV_SIZE SZ_2G
#define EFFECT(x) (1U << x)
+#define BASE_DYNAMIC_CAP_DPA DEV_SIZE
#define MOCK_INJECT_DEV_MAX 8
#define MOCK_INJECT_TEST_MAX 128
@@ -113,6 +114,22 @@ static struct cxl_cel_entry mock_cel[] = {
EFFECT(SECURITY_CHANGE_IMMEDIATE) |
EFFECT(BACKGROUND_OP)),
},
+ {
+ .opcode = cpu_to_le16(CXL_MBOX_OP_GET_DC_CONFIG),
+ .effect = CXL_CMD_EFFECT_NONE,
+ },
+ {
+ .opcode = cpu_to_le16(CXL_MBOX_OP_GET_DC_EXTENT_LIST),
+ .effect = CXL_CMD_EFFECT_NONE,
+ },
+ {
+ .opcode = cpu_to_le16(CXL_MBOX_OP_ADD_DC_RESPONSE),
+ .effect = cpu_to_le16(EFFECT(CONF_CHANGE_IMMEDIATE)),
+ },
+ {
+ .opcode = cpu_to_le16(CXL_MBOX_OP_RELEASE_DC),
+ .effect = cpu_to_le16(EFFECT(CONF_CHANGE_IMMEDIATE)),
+ },
};
/* See CXL 2.0 Table 181 Get Health Info Output Payload */
@@ -173,6 +190,16 @@ struct vendor_test_feat {
__le32 data;
} __packed;
+/*
+ * The kernel surfaces only the first DC partition reported by the
+ * device (cxl_dev_dc_identify() takes partitions[0] only), so any
+ * extents we pre-inject into a second mock partition end up rejected
+ * as "not in a valid DC partition" — and for tagged groups they also
+ * trip the partition-equality gate and drop the whole group (including
+ * the in-range member in DC0). Keep the mock at one DC partition.
+ */
+#define NUM_MOCK_DC_REGIONS 1
+
struct cxl_mockmem_data {
void *lsa;
void *fw;
@@ -191,6 +218,20 @@ struct cxl_mockmem_data {
unsigned long sanitize_timeout;
struct vendor_test_feat test_feat;
u8 shutdown_state;
+
+ struct cxl_dc_partition dc_partitions[NUM_MOCK_DC_REGIONS];
+ u32 dc_ext_generation;
+ struct mutex ext_lock;
+
+ /*
+ * Extents are in 1 of 3 states
+ * FM (sysfs added but not sent to the host yet)
+ * sent (sent to the host but not accepted)
+ * accepted (by the host)
+ */
+ struct xarray dc_fm_extents;
+ struct xarray dc_sent_extents;
+ struct xarray dc_accepted_exts;
};
static struct mock_event_log *event_find_log(struct device *dev, int log_type)
@@ -607,6 +648,229 @@ static void cxl_mock_event_trigger(struct device *dev)
cxl_mem_get_event_records(mdata->mds, mes->ev_status);
}
+struct cxl_extent_data {
+ u64 dpa_start;
+ u64 length;
+ uuid_t uuid;
+ u16 shared_extn_seq;
+ bool shared;
+};
+
+/*
+ * Parse a tag string into a uuid_t. Accepts the empty string and "0"
+ * as shorthand for the null UUID; anything else must be a UUID string
+ * uuid_parse() can understand.
+ */
+static int parse_tag(const char *tag, uuid_t *out)
+{
+ if (!tag || tag[0] == '\0' || strcmp(tag, "0") == 0) {
+ uuid_copy(out, &uuid_null);
+ return 0;
+ }
+ return uuid_parse(tag, out);
+}
+
+static int __devm_add_extent(struct device *dev, struct xarray *array,
+ u64 start, u64 length, const char *tag,
+ u16 shared_extn_seq, bool shared)
+{
+ struct cxl_extent_data *extent;
+ int rc;
+
+ extent = devm_kzalloc(dev, sizeof(*extent), GFP_KERNEL);
+ if (!extent)
+ return -ENOMEM;
+
+ extent->dpa_start = start;
+ extent->length = length;
+ rc = parse_tag(tag, &extent->uuid);
+ if (rc) {
+ dev_err(dev, "Failed to parse tag '%s'\n", tag);
+ devm_kfree(dev, extent);
+ return rc;
+ }
+ extent->shared_extn_seq = shared_extn_seq;
+ extent->shared = shared;
+
+ if (xa_insert(array, start, extent, GFP_KERNEL)) {
+ devm_kfree(dev, extent);
+ dev_err(dev, "Failed xarry insert %#llx\n", start);
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+static int devm_add_fm_extent(struct device *dev, u64 start, u64 length,
+ const char *tag, u16 shared_extn_seq, bool shared)
+{
+ struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+
+ guard(mutex)(&mdata->ext_lock);
+ return __devm_add_extent(dev, &mdata->dc_fm_extents, start, length,
+ tag, shared_extn_seq, shared);
+}
+
+static int dc_accept_extent(struct device *dev, u64 start, u64 length)
+{
+ struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+ struct cxl_extent_data *ext;
+
+ dev_dbg(dev, "Host accepting extent %#llx\n", start);
+ mdata->dc_ext_generation++;
+
+ lockdep_assert_held(&mdata->ext_lock);
+ ext = xa_load(&mdata->dc_sent_extents, start);
+ if (!ext || ext->length != length) {
+ /*
+ * The host may re-accept extents we already moved into the
+ * accepted xarray (e.g. pre-injected extents replayed on
+ * region creation). Treat that as a successful no-op so
+ * the existing-extent ingest path doesn't abort.
+ */
+ ext = xa_load(&mdata->dc_accepted_exts, start);
+ if (ext && ext->length == length)
+ return 0;
+ dev_err(dev, "Extent %#llx-%#llx not found\n",
+ start, start + length);
+ return -ENOMEM;
+ }
+ xa_erase(&mdata->dc_sent_extents, start);
+ return xa_insert(&mdata->dc_accepted_exts, start, ext, GFP_KERNEL);
+}
+
+static void release_dc_ext(void *md)
+{
+ struct cxl_mockmem_data *mdata = md;
+
+ xa_destroy(&mdata->dc_fm_extents);
+ xa_destroy(&mdata->dc_sent_extents);
+ xa_destroy(&mdata->dc_accepted_exts);
+}
+
+/* Pretend to have some previous accepted extents */
+struct pre_ext_info {
+ u64 offset;
+ u64 length;
+ const char *tag;
+} pre_ext_info[] = {
+ {
+ .offset = SZ_128M,
+ .length = SZ_64M,
+ .tag = "",
+ },
+ {
+ .offset = SZ_256M,
+ .length = SZ_64M,
+ .tag = "deadbeef-cafe-baad-f00d-fedcba987654",
+ },
+};
+
+static int devm_add_sent_extent(struct device *dev, u64 start, u64 length,
+ const char *tag, u16 shared_extn_seq, bool shared)
+{
+ struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+
+ lockdep_assert_held(&mdata->ext_lock);
+ return __devm_add_extent(dev, &mdata->dc_sent_extents, start, length,
+ tag, shared_extn_seq, shared);
+}
+
+static int inject_prev_extents(struct device *dev, u64 base_dpa)
+{
+ struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+ int rc;
+
+ dev_dbg(dev, "Adding %ld pre-extents for testing\n",
+ ARRAY_SIZE(pre_ext_info));
+
+ guard(mutex)(&mdata->ext_lock);
+ for (int i = 0; i < ARRAY_SIZE(pre_ext_info); i++) {
+ u64 ext_dpa = base_dpa + pre_ext_info[i].offset;
+ u64 ext_len = pre_ext_info[i].length;
+
+ dev_dbg(dev, "Adding pre-extent DPA:%#llx LEN:%#llx tag:%s\n",
+ ext_dpa, ext_len, pre_ext_info[i].tag);
+
+ rc = devm_add_sent_extent(dev, ext_dpa, ext_len,
+ pre_ext_info[i].tag, 0, false);
+ if (rc) {
+ dev_err(dev, "Failed to add pre-extent DPA:%#llx LEN:%#llx; %d\n",
+ ext_dpa, ext_len, rc);
+ return rc;
+ }
+
+ rc = dc_accept_extent(dev, ext_dpa, ext_len);
+ if (rc)
+ return rc;
+ }
+ return 0;
+}
+
+static int cxl_mock_dc_partition_setup(struct device *dev)
+{
+ struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+ u64 base_dpa = BASE_DYNAMIC_CAP_DPA;
+ u32 dsmad_handle = 0xFA;
+ u64 decode_length = SZ_512M;
+ u64 block_size = SZ_512;
+ u64 length = SZ_512M;
+ int rc;
+
+ mutex_init(&mdata->ext_lock);
+ xa_init(&mdata->dc_fm_extents);
+ xa_init(&mdata->dc_sent_extents);
+ xa_init(&mdata->dc_accepted_exts);
+
+ rc = devm_add_action_or_reset(dev, release_dc_ext, mdata);
+ if (rc)
+ return rc;
+
+ for (int i = 0; i < NUM_MOCK_DC_REGIONS; i++) {
+ struct cxl_dc_partition *part = &mdata->dc_partitions[i];
+
+ dev_dbg(dev, "Creating DC partition DC%d DPA:%#llx LEN:%#llx\n",
+ i, base_dpa, length);
+
+ part->base = cpu_to_le64(base_dpa);
+ part->decode_length = cpu_to_le64(decode_length /
+ CXL_CAPACITY_MULTIPLIER);
+ part->length = cpu_to_le64(length);
+ part->block_size = cpu_to_le64(block_size);
+ part->dsmad_handle = cpu_to_le32(dsmad_handle);
+ dsmad_handle++;
+
+ /*
+ * Skip pre-injection on the sharable mock memdev. The
+ * pre-injected extents are untagged / seq=0, which a
+ * sharable partition rejects as firmware-bug; leaving the
+ * sharable memdev with an empty DC partition is what its
+ * dedicated tests (test_shared_extent_inject and
+ * test_seq_integrity_gap in cxl-dcd.sh) expect anyway.
+ *
+ * The sharable fixture is the memdev at pdev->id == 0 —
+ * see the matching MOCK_DC_SHARABLE_SERIAL stamp in
+ * cxl_mock_mem_probe(). This relies on tools/testing/cxl
+ * always allocating a "cxl_mem" platform device with id 0
+ * as the first memdev; if that invariant ever breaks the
+ * sharable test fixture will land on the wrong device.
+ */
+ if (to_platform_device(dev)->id != 0) {
+ rc = inject_prev_extents(dev, base_dpa);
+ if (rc) {
+ dev_err(dev,
+ "Failed to add pre-extents for DC%d\n",
+ i);
+ return rc;
+ }
+ }
+
+ base_dpa += decode_length;
+ }
+
+ return 0;
+}
+
static int mock_gsl(struct cxl_mbox_cmd *cmd)
{
if (cmd->size_out < sizeof(mock_gsl_payload))
@@ -1582,6 +1846,193 @@ static int mock_get_supported_features(struct cxl_mockmem_data *mdata,
return 0;
}
+static int mock_get_dc_config(struct device *dev,
+ struct cxl_mbox_cmd *cmd)
+{
+ struct cxl_mbox_get_dc_config_in *dc_config = cmd->payload_in;
+ struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+ u8 partition_requested, partition_start_idx, partition_ret_cnt;
+ struct cxl_mbox_get_dc_config_out *resp;
+ int i;
+
+ partition_requested = min(dc_config->partition_count, NUM_MOCK_DC_REGIONS);
+
+ if (cmd->size_out < struct_size(resp, partition, partition_requested))
+ return -EINVAL;
+
+ memset(cmd->payload_out, 0, cmd->size_out);
+ resp = cmd->payload_out;
+
+ partition_start_idx = dc_config->start_partition_index;
+ partition_ret_cnt = 0;
+ for (i = 0; i < NUM_MOCK_DC_REGIONS; i++) {
+ if (i >= partition_start_idx) {
+ memcpy(&resp->partition[partition_ret_cnt],
+ &mdata->dc_partitions[i],
+ sizeof(resp->partition[partition_ret_cnt]));
+ partition_ret_cnt++;
+ }
+ }
+ resp->avail_partition_count = NUM_MOCK_DC_REGIONS;
+ resp->partitions_returned = i;
+
+ dev_dbg(dev, "Returning %d dc partitions\n", partition_ret_cnt);
+ return 0;
+}
+
+static int mock_get_dc_extent_list(struct device *dev,
+ struct cxl_mbox_cmd *cmd)
+{
+ struct cxl_mbox_get_extent_out *resp = cmd->payload_out;
+ struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+ struct cxl_mbox_get_extent_in *get = cmd->payload_in;
+ u32 total_avail = 0, total_ret = 0;
+ struct cxl_extent_data *ext;
+ u32 ext_count, start_idx;
+ unsigned long i;
+
+ ext_count = le32_to_cpu(get->extent_cnt);
+ start_idx = le32_to_cpu(get->start_extent_index);
+
+ memset(resp, 0, sizeof(*resp));
+
+ guard(mutex)(&mdata->ext_lock);
+ /*
+ * Total available needs to be calculated and returned regardless of
+ * how many can actually be returned.
+ */
+ xa_for_each(&mdata->dc_accepted_exts, i, ext)
+ total_avail++;
+
+ if (start_idx > total_avail)
+ return -EINVAL;
+
+ xa_for_each(&mdata->dc_accepted_exts, i, ext) {
+ if (total_ret >= ext_count)
+ break;
+
+ if (total_ret >= start_idx) {
+ resp->extent[total_ret].start_dpa =
+ cpu_to_le64(ext->dpa_start);
+ resp->extent[total_ret].length =
+ cpu_to_le64(ext->length);
+ export_uuid(resp->extent[total_ret].uuid, &ext->uuid);
+ resp->extent[total_ret].shared_extn_seq =
+ cpu_to_le16(ext->shared_extn_seq);
+ total_ret++;
+ }
+ }
+
+ resp->returned_extent_count = cpu_to_le32(total_ret);
+ resp->total_extent_count = cpu_to_le32(total_avail);
+ resp->generation_num = cpu_to_le32(mdata->dc_ext_generation);
+
+ dev_dbg(dev, "Returning %d extents of %d total\n",
+ total_ret, total_avail);
+
+ return 0;
+}
+
+static void dc_clear_sent(struct device *dev)
+{
+ struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+ struct cxl_extent_data *ext;
+ unsigned long index;
+
+ lockdep_assert_held(&mdata->ext_lock);
+
+ /* Any extents not accepted must be cleared */
+ xa_for_each(&mdata->dc_sent_extents, index, ext) {
+ dev_dbg(dev, "Host rejected extent %#llx\n", ext->dpa_start);
+ xa_erase(&mdata->dc_sent_extents, ext->dpa_start);
+ }
+}
+
+static int mock_add_dc_response(struct device *dev,
+ struct cxl_mbox_cmd *cmd)
+{
+ struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+ struct cxl_mbox_dc_response *req = cmd->payload_in;
+ u32 list_size = le32_to_cpu(req->extent_list_size);
+
+ guard(mutex)(&mdata->ext_lock);
+ for (int i = 0; i < list_size; i++) {
+ u64 start = le64_to_cpu(req->extent_list[i].dpa_start);
+ u64 length = le64_to_cpu(req->extent_list[i].length);
+ int rc;
+
+ rc = dc_accept_extent(dev, start, length);
+ if (rc)
+ return rc;
+ }
+
+ dc_clear_sent(dev);
+ return 0;
+}
+
+static void dc_delete_extent(struct device *dev, unsigned long long start,
+ unsigned long long length)
+{
+ struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+ unsigned long long end = start + length;
+ struct cxl_extent_data *ext;
+ unsigned long index;
+
+ dev_dbg(dev, "Deleting extent at %#llx len:%#llx\n", start, length);
+
+ guard(mutex)(&mdata->ext_lock);
+ xa_for_each(&mdata->dc_fm_extents, index, ext) {
+ u64 extent_end = ext->dpa_start + ext->length;
+
+ /*
+ * Any extent which 'touches' the released delete range will be
+ * removed.
+ */
+ if ((start <= ext->dpa_start && ext->dpa_start < end) ||
+ (start <= extent_end && extent_end < end))
+ xa_erase(&mdata->dc_fm_extents, ext->dpa_start);
+ }
+
+ /*
+ * If the extent was accepted let it be for the host to drop
+ * later.
+ */
+}
+
+static int release_accepted_extent(struct device *dev, u64 start, u64 length)
+{
+ struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+ struct cxl_extent_data *ext;
+
+ guard(mutex)(&mdata->ext_lock);
+ ext = xa_load(&mdata->dc_accepted_exts, start);
+ if (!ext || ext->length != length) {
+ dev_err(dev, "Extent %#llx not in accepted state\n", start);
+ return -EINVAL;
+ }
+ xa_erase(&mdata->dc_accepted_exts, start);
+ mdata->dc_ext_generation++;
+
+ return 0;
+}
+
+static int mock_dc_release(struct device *dev,
+ struct cxl_mbox_cmd *cmd)
+{
+ struct cxl_mbox_dc_response *req = cmd->payload_in;
+ u32 list_size = le32_to_cpu(req->extent_list_size);
+
+ for (int i = 0; i < list_size; i++) {
+ u64 start = le64_to_cpu(req->extent_list[i].dpa_start);
+ u64 length = le64_to_cpu(req->extent_list[i].length);
+
+ dev_dbg(dev, "Extent %#llx released by host\n", start);
+ release_accepted_extent(dev, start, length);
+ }
+
+ return 0;
+}
+
static int cxl_mock_mbox_send(struct cxl_mailbox *cxl_mbox,
struct cxl_mbox_cmd *cmd)
{
@@ -1673,6 +2124,18 @@ static int cxl_mock_mbox_send(struct cxl_mailbox *cxl_mbox,
case CXL_MBOX_OP_GET_SUPPORTED_FEATURES:
rc = mock_get_supported_features(mdata, cmd);
break;
+ case CXL_MBOX_OP_GET_DC_CONFIG:
+ rc = mock_get_dc_config(dev, cmd);
+ break;
+ case CXL_MBOX_OP_GET_DC_EXTENT_LIST:
+ rc = mock_get_dc_extent_list(dev, cmd);
+ break;
+ case CXL_MBOX_OP_ADD_DC_RESPONSE:
+ rc = mock_add_dc_response(dev, cmd);
+ break;
+ case CXL_MBOX_OP_RELEASE_DC:
+ rc = mock_dc_release(dev, cmd);
+ break;
case CXL_MBOX_OP_GET_FEATURE:
rc = mock_get_feature(mdata, cmd);
break;
@@ -1739,6 +2202,14 @@ static void init_event_log(struct mock_event_log *log)
log->last_handle = 1;
}
+/*
+ * Stamp this serial on a single mock cxl_mem instance so the
+ * companion cxl_test driver can find it and mark its DC partition
+ * sharable in mock_cxl_endpoint_parse_cdat(). Must match the value
+ * defined in tools/testing/cxl/test/cxl.c.
+ */
+#define MOCK_DC_SHARABLE_SERIAL 0xDCDCULL
+
static int cxl_mock_mem_probe(struct platform_device *pdev)
{
struct device *dev = &pdev->dev;
@@ -1758,6 +2229,10 @@ static int cxl_mock_mem_probe(struct platform_device *pdev)
return -ENOMEM;
dev_set_drvdata(dev, mdata);
+ rc = cxl_mock_dc_partition_setup(dev);
+ if (rc)
+ return rc;
+
mdata->lsa = vmalloc(LSA_SIZE);
if (!mdata->lsa)
return -ENOMEM;
@@ -1774,7 +2249,23 @@ static int cxl_mock_mem_probe(struct platform_device *pdev)
if (rc)
return rc;
- mds = cxl_memdev_state_create(dev, pdev->id + 1, 0);
+ {
+ u64 serial = pdev->id + 1;
+
+ /*
+ * Reserve the memdev at pdev->id == 0 as the sharable DC
+ * partition test fixture. This relies on tools/testing/cxl
+ * always allocating a "cxl_mem" platform device with id 0
+ * as the first memdev — currently true in cxl.c, but if
+ * the topology ever renumbers, the sharable serial will be
+ * stamped on the wrong device (or no device). Matched by
+ * the skip-pre-inject guard in cxl_mock_dc_partition_setup
+ * and by mock_cxl_endpoint_parse_cdat in cxl_test.
+ */
+ if (pdev->id == 0)
+ serial = MOCK_DC_SHARABLE_SERIAL;
+ mds = cxl_memdev_state_create(dev, serial, 0);
+ }
if (IS_ERR(mds))
return PTR_ERR(mds);
@@ -1814,6 +2305,9 @@ static int cxl_mock_mem_probe(struct platform_device *pdev)
if (rc)
return rc;
+ if (cxl_dcd_supported(mds))
+ cxl_configure_dcd(mds, &range_info);
+
rc = cxl_dpa_setup(cxlds, &range_info);
if (rc)
return rc;
@@ -1921,11 +2415,321 @@ static ssize_t sanitize_timeout_store(struct device *dev,
static DEVICE_ATTR_RW(sanitize_timeout);
+/* Return if the proposed extent would break the test code */
+static bool new_extent_valid(struct device *dev, size_t new_start,
+ size_t new_len)
+{
+ struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+ struct cxl_extent_data *extent;
+ size_t new_end, i;
+
+ if (!new_len)
+ return false;
+
+ new_end = new_start + new_len;
+
+ dev_dbg(dev, "New extent %zx-%zx\n", new_start, new_end);
+
+ guard(mutex)(&mdata->ext_lock);
+ dev_dbg(dev, "Checking extents starts...\n");
+ xa_for_each(&mdata->dc_fm_extents, i, extent) {
+ if (extent->dpa_start == new_start)
+ return false;
+ }
+
+ dev_dbg(dev, "Checking sent extents starts...\n");
+ xa_for_each(&mdata->dc_sent_extents, i, extent) {
+ if (extent->dpa_start == new_start)
+ return false;
+ }
+
+ dev_dbg(dev, "Checking accepted extents starts...\n");
+ xa_for_each(&mdata->dc_accepted_exts, i, extent) {
+ if (extent->dpa_start == new_start)
+ return false;
+ }
+
+ return true;
+}
+
+struct cxl_test_dcd {
+ uuid_t id;
+ struct cxl_event_dcd rec;
+} __packed;
+
+struct cxl_test_dcd dcd_event_rec_template = {
+ .id = CXL_EVENT_DC_EVENT_UUID,
+ .rec = {
+ .hdr = {
+ .length = sizeof(struct cxl_test_dcd),
+ },
+ },
+};
+
+static int log_dc_event(struct cxl_mockmem_data *mdata, enum dc_event type,
+ u64 start, u64 length, const char *tag_str,
+ u16 shared_extn_seq, bool more)
+{
+ struct device *dev = mdata->mds->cxlds.dev;
+ struct cxl_test_dcd *dcd_event;
+ uuid_t tag;
+ int rc;
+
+ dev_dbg(dev, "mock device log event %d\n", type);
+
+ dcd_event = devm_kmemdup(dev, &dcd_event_rec_template,
+ sizeof(*dcd_event), GFP_KERNEL);
+ if (!dcd_event)
+ return -ENOMEM;
+
+ dcd_event->rec.flags = 0;
+ if (more)
+ dcd_event->rec.flags |= CXL_DCD_EVENT_MORE;
+ dcd_event->rec.event_type = type;
+ dcd_event->rec.extent.start_dpa = cpu_to_le64(start);
+ dcd_event->rec.extent.length = cpu_to_le64(length);
+ rc = parse_tag(tag_str, &tag);
+ if (rc) {
+ devm_kfree(dev, dcd_event);
+ return rc;
+ }
+ export_uuid(dcd_event->rec.extent.uuid, &tag);
+ dcd_event->rec.extent.shared_extn_seq = cpu_to_le16(shared_extn_seq);
+
+ mes_add_event(mdata, CXL_EVENT_TYPE_DCD,
+ (struct cxl_event_record_raw *)dcd_event);
+
+ /* Fake the irq */
+ cxl_mem_get_event_records(mdata->mds, CXLDEV_EVENT_STATUS_DCD);
+
+ return 0;
+}
+
+static void mark_extent_sent(struct device *dev, unsigned long long start)
+{
+ struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+ struct cxl_extent_data *ext;
+
+ guard(mutex)(&mdata->ext_lock);
+ ext = xa_erase(&mdata->dc_fm_extents, start);
+ if (xa_insert(&mdata->dc_sent_extents, ext->dpa_start, ext, GFP_KERNEL))
+ dev_err(dev, "Failed to mark extent %#llx sent\n", ext->dpa_start);
+}
+
+/*
+ * Format <start>:<length>:<tag>:<more_flag>
+ *
+ * start and length must be a multiple of the configured partition block size.
+ * Tag can be any string up to 16 bytes.
+ *
+ * Extents must be exclusive of other extents
+ *
+ * If the more flag is specified it is expected that an additional extent will
+ * be specified without the more flag to complete the test transaction with the
+ * host.
+ */
+static ssize_t __dc_inject_extent_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count,
+ bool shared)
+{
+ struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+ unsigned long long start, length, more;
+ char *len_str, *uuid_str, *more_str, *seq_str;
+ u16 shared_extn_seq = 0;
+ size_t buf_len = count;
+ int rc;
+
+ char *start_str __free(kfree) = kstrdup(buf, GFP_KERNEL);
+ if (!start_str)
+ return -ENOMEM;
+
+ len_str = strnchr(start_str, buf_len, ':');
+ if (!len_str) {
+ dev_err(dev, "Extent failed to find len_str: %s\n", start_str);
+ return -EINVAL;
+ }
+
+ *len_str = '\0';
+ len_str += 1;
+ buf_len -= strlen(start_str);
+
+ uuid_str = strnchr(len_str, buf_len, ':');
+ if (!uuid_str) {
+ dev_err(dev, "Extent failed to find uuid_str: %s\n", len_str);
+ return -EINVAL;
+ }
+ *uuid_str = '\0';
+ uuid_str += 1;
+
+ more_str = strnchr(uuid_str, buf_len, ':');
+ if (!more_str) {
+ dev_err(dev, "Extent failed to find more_str: %s\n", uuid_str);
+ return -EINVAL;
+ }
+ *more_str = '\0';
+ more_str += 1;
+
+ /* Optional 5th field: shared_extn_seq. Absent -> 0. */
+ seq_str = strnchr(more_str, buf_len, ':');
+ if (seq_str) {
+ unsigned long long seq;
+
+ *seq_str = '\0';
+ seq_str += 1;
+ if (kstrtoull(seq_str, 0, &seq) || seq > U16_MAX) {
+ dev_err(dev, "Extent failed to parse seq: %s\n",
+ seq_str);
+ return -EINVAL;
+ }
+ shared_extn_seq = seq;
+ }
+
+ if (kstrtoull(start_str, 0, &start)) {
+ dev_err(dev, "Extent failed to parse start: %s\n", start_str);
+ return -EINVAL;
+ }
+
+ if (kstrtoull(len_str, 0, &length)) {
+ dev_err(dev, "Extent failed to parse length: %s\n", len_str);
+ return -EINVAL;
+ }
+
+ if (kstrtoull(more_str, 0, &more)) {
+ dev_err(dev, "Extent failed to parse more: %s\n", more_str);
+ return -EINVAL;
+ }
+
+ if (!new_extent_valid(dev, start, length))
+ return -EINVAL;
+
+ rc = devm_add_fm_extent(dev, start, length, uuid_str, shared_extn_seq,
+ shared);
+ if (rc) {
+ dev_err(dev, "Failed to add extent DPA:%#llx LEN:%#llx; %d\n",
+ start, length, rc);
+ return rc;
+ }
+
+ mark_extent_sent(dev, start);
+ rc = log_dc_event(mdata, DCD_ADD_CAPACITY, start, length, uuid_str,
+ shared_extn_seq, more);
+ if (rc) {
+ dev_err(dev, "Failed to add event %d\n", rc);
+ return rc;
+ }
+
+ return count;
+}
+
+static ssize_t dc_inject_extent_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ return __dc_inject_extent_store(dev, attr, buf, count, false);
+}
+static DEVICE_ATTR_WO(dc_inject_extent);
+
+static ssize_t dc_inject_shared_extent_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ return __dc_inject_extent_store(dev, attr, buf, count, true);
+}
+static DEVICE_ATTR_WO(dc_inject_shared_extent);
+
+static ssize_t __dc_del_extent_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count,
+ enum dc_event type)
+{
+ struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+ unsigned long long start, length;
+ char *len_str, *uuid_str;
+ size_t buf_len = count;
+ int rc;
+
+ char *start_str __free(kfree) = kstrdup(buf, GFP_KERNEL);
+ if (!start_str)
+ return -ENOMEM;
+
+ len_str = strnchr(start_str, buf_len, ':');
+ if (!len_str) {
+ dev_err(dev, "Failed to find len_str: %s\n", start_str);
+ return -EINVAL;
+ }
+ *len_str = '\0';
+ len_str += 1;
+ buf_len -= strlen(start_str);
+
+ uuid_str = strnchr(len_str, buf_len, ':');
+ if (!uuid_str) {
+ dev_err(dev, "Failed to find uuid_str: %s\n", len_str);
+ return -EINVAL;
+ }
+ *uuid_str = '\0';
+ uuid_str += 1;
+ /*
+ * uuid_str is the trailing field; trim shell-added '\n' so
+ * parse_tag()/uuid_parse() see a clean string.
+ */
+ uuid_str = strim(uuid_str);
+
+ if (kstrtoull(start_str, 0, &start)) {
+ dev_err(dev, "Failed to parse start: %s\n", start_str);
+ return -EINVAL;
+ }
+
+ if (kstrtoull(len_str, 0, &length)) {
+ dev_err(dev, "Failed to parse length: %s\n", len_str);
+ return -EINVAL;
+ }
+
+ dc_delete_extent(dev, start, length);
+
+ if (type == DCD_FORCED_CAPACITY_RELEASE)
+ dev_dbg(dev, "Forcing delete of extent %#llx len:%#llx\n",
+ start, length);
+
+ rc = log_dc_event(mdata, type, start, length, uuid_str, 0, false);
+ if (rc) {
+ dev_err(dev, "Failed to add event %d\n", rc);
+ return rc;
+ }
+
+ return count;
+}
+
+/*
+ * Format <start>:<length>:<uuid>
+ */
+static ssize_t dc_del_extent_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ return __dc_del_extent_store(dev, attr, buf, count,
+ DCD_RELEASE_CAPACITY);
+}
+static DEVICE_ATTR_WO(dc_del_extent);
+
+static ssize_t dc_force_del_extent_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ return __dc_del_extent_store(dev, attr, buf, count,
+ DCD_FORCED_CAPACITY_RELEASE);
+}
+static DEVICE_ATTR_WO(dc_force_del_extent);
+
static struct attribute *cxl_mock_mem_attrs[] = {
&dev_attr_security_lock.attr,
&dev_attr_event_trigger.attr,
&dev_attr_fw_buf_checksum.attr,
&dev_attr_sanitize_timeout.attr,
+ &dev_attr_dc_inject_extent.attr,
+ &dev_attr_dc_inject_shared_extent.attr,
+ &dev_attr_dc_del_extent.attr,
+ &dev_attr_dc_force_del_extent.attr,
NULL
};
ATTRIBUTE_GROUPS(cxl_mock_mem);
--
2.43.0
^ permalink raw reply related [flat|nested] 71+ messages in thread* Re: [PATCH v10 30/31] tools/testing/cxl: Add DC Regions to mock mem data
2026-05-23 9:43 ` [PATCH v10 30/31] tools/testing/cxl: Add DC Regions to mock mem data Anisa Su
@ 2026-05-29 23:42 ` Dave Jiang
0 siblings, 0 replies; 71+ messages in thread
From: Dave Jiang @ 2026-05-29 23:42 UTC (permalink / raw)
To: Anisa Su, linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Vishal Verma, Ira Weiny, Alison Schofield, John Groves,
Gregory Price, Ira Weiny, Anisa Su
On 5/23/26 2:43 AM, Anisa Su wrote:
> From: Ira Weiny <ira.weiny@intel.com>
>
> cxl_test provides a good way to ensure quick smoke and regression
> testing. The complexity of Dynamic Capacity (DC) extent processing as
> well as the complexity of DC-backed DAX regions can mostly be tested
> through cxl_test. This includes management of DC regions and DAX
> devices on those regions; the management of extent device lifetimes;
> and the processing of DCD events.
>
> The only missing functionality from this test is actual interrupt
> processing.
>
> Mock memory devices can easily mock DC information and manage fake
> extent data.
>
> Define mock_dc_partition information within the mock memory data. Add
> sysfs entries on the mock device to inject and delete extents.
>
> The inject format is <start>:<length>:<tag>:<more>[:<seq>] where <tag>
> is a UUID string (or "" / "0" for the null UUID) and <seq> is an
> optional shared_extn_seq value used for sharable-partition tests
> (defaults to 0).
> The delete format is <start>:<length>:<uuid>
>
> Directly call the event irq callback to simulate irqs to process the
> test extents.
>
> Add DC mailbox commands to the CEL and implement those commands.
>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Anisa Su <anisa.su@samsung.com>
>
> ---
> Changes:
> [anisa: add uuid + shared_extn_seq, align mock with kernel validators,
> introduce a sharable-partition test fixture]
> [anisa: replace "sparse" terminology with "DC" / "DC-backed"]
>
> Carry a uuid_t and a u16 shared_extn_seq on each mock extent, parse
> tags via uuid_parse() in the inject path and the pre-extent fixture,
> and propagate both fields through log_dc_event() and
> mock_get_dc_extent_list(). An optional 5th field in the inject
> format supplies the shared_extn_seq for sharable-partition tests.
> The delete format takes the uuid as its third field so release
> events carry tag identity to the host.
>
> Mock fixes required to satisfy the host-side validators:
>
> - dsmad_handle starts at 0xFA, not 0xFADE. The Get Dynamic
> Capacity Configuration response's DSMAD Handle field is 1 byte
> per the CXL spec; the kernel rejects any handle with the upper
> 24 bits non-zero as a firmware-bug.
>
> - dc_accept_extent() treats a re-accept of an already-accepted
> extent as a successful no-op (look up dc_accepted_exts when the
> sent xa lookup misses). The host replays accepts for pre-
> injected extents on region creation; without this the existing-
> extent ingest aborts with -ENOMEM.
>
> - __dc_del_extent_store() runs strim() on the trailing uuid field
> so the '
> ' shell write tail doesn't cause parse_tag() to fall
> through to uuid_parse() and -EINVAL.
>
> - NUM_MOCK_DC_REGIONS reduced from 2 to 1. The host's
> cxl_dev_dc_identify() surfaces partitions[0] only, so extents
> seeded into a second mock partition land outside the registered
> DC range; for tagged groups that also trips the partition-
> equality gate and drops the whole group (including the in-range
> member).
>
> Sharable-partition test fixture:
>
> - Stamp MOCK_DC_SHARABLE_SERIAL (0xDCDC) on the cxl_mem instance
> at pdev->id == 0. The companion cxl_test driver checks this
> serial in mock_cxl_endpoint_parse_cdat() and sets the DC
> partition's perf.shareable on that memdev only — exposing both
> sharable and non-sharable DC partitions from one cxl_test
> module load so the userspace suite can exercise both regimes.
>
> - Skip inject_prev_extents() on that one memdev: the pre-injected
> extents are untagged / seq=0 and would be rejected as firmware-
> bug by cxl_validate_extent() on a sharable partition, leaving
> spurious noise in dmesg at probe.
> ---
> tools/testing/cxl/test/cxl.c | 21 +
> tools/testing/cxl/test/mem.c | 806 ++++++++++++++++++++++++++++++++++-
> 2 files changed, 826 insertions(+), 1 deletion(-)
>
> diff --git a/tools/testing/cxl/test/cxl.c b/tools/testing/cxl/test/cxl.c
> index 418669927fb0..ac6060ede061 100644
> --- a/tools/testing/cxl/test/cxl.c
> +++ b/tools/testing/cxl/test/cxl.c
> @@ -18,6 +18,15 @@ static int interleave_arithmetic;
> static bool extended_linear_cache;
> static bool fail_autoassemble;
>
> +/*
> + * Mock serial sentinel. The cxl_mock_mem probe stamps this serial on
> + * exactly one platform device (cxl_mem with id 0); that single memdev's
> + * DC partition is marked sharable below in mock_cxl_endpoint_parse_cdat
> + * so the suite can exercise sharable-extent code paths without losing
> + * the non-sharable coverage on the other mock memdevs.
> + */
> +#define MOCK_DC_SHARABLE_SERIAL 0xDCDCULL
This is defined in cxl.c and mem.c. Why not just put it in a shared header?
> +
> #define FAKE_QTG_ID 42
>
> #define NR_CXL_HOST_BRIDGES 2
> @@ -1432,6 +1441,18 @@ static void mock_cxl_endpoint_parse_cdat(struct cxl_port *port)
> };
>
> dpa_perf_setup(port, &range, perf);
> +
> + /*
> + * The mock probe stamps MOCK_DC_SHARABLE_SERIAL onto exactly
> + * one cxl_mem instance; mark its DC partition sharable so
> + * cxl_validate_extent() routes shared-seq injects through
> + * the sharable regime. Every other memdev keeps its DC
> + * partition non-sharable so the existing untagged / seq=0
> + * tests still run on this kernel.
> + */
> + if (cxlds->part[i].mode == CXL_PARTMODE_DYNAMIC_RAM_A &&
> + cxlds->serial == MOCK_DC_SHARABLE_SERIAL)
> + perf->shareable = true;
> }
>
> cxl_memdev_update_perf(cxlmd);
> diff --git a/tools/testing/cxl/test/mem.c b/tools/testing/cxl/test/mem.c
> index fe1dadddd18e..9cc97b718b5f 100644
> --- a/tools/testing/cxl/test/mem.c
> +++ b/tools/testing/cxl/test/mem.c
> @@ -20,6 +20,7 @@
> #define FW_SLOTS 3
> #define DEV_SIZE SZ_2G
> #define EFFECT(x) (1U << x)
> +#define BASE_DYNAMIC_CAP_DPA DEV_SIZE
>
> #define MOCK_INJECT_DEV_MAX 8
> #define MOCK_INJECT_TEST_MAX 128
> @@ -113,6 +114,22 @@ static struct cxl_cel_entry mock_cel[] = {
> EFFECT(SECURITY_CHANGE_IMMEDIATE) |
> EFFECT(BACKGROUND_OP)),
> },
> + {
> + .opcode = cpu_to_le16(CXL_MBOX_OP_GET_DC_CONFIG),
> + .effect = CXL_CMD_EFFECT_NONE,
> + },
> + {
> + .opcode = cpu_to_le16(CXL_MBOX_OP_GET_DC_EXTENT_LIST),
> + .effect = CXL_CMD_EFFECT_NONE,
> + },
> + {
> + .opcode = cpu_to_le16(CXL_MBOX_OP_ADD_DC_RESPONSE),
> + .effect = cpu_to_le16(EFFECT(CONF_CHANGE_IMMEDIATE)),
> + },
> + {
> + .opcode = cpu_to_le16(CXL_MBOX_OP_RELEASE_DC),
> + .effect = cpu_to_le16(EFFECT(CONF_CHANGE_IMMEDIATE)),
> + },
> };
>
> /* See CXL 2.0 Table 181 Get Health Info Output Payload */
> @@ -173,6 +190,16 @@ struct vendor_test_feat {
> __le32 data;
> } __packed;
>
> +/*
> + * The kernel surfaces only the first DC partition reported by the
> + * device (cxl_dev_dc_identify() takes partitions[0] only), so any
> + * extents we pre-inject into a second mock partition end up rejected
> + * as "not in a valid DC partition" — and for tagged groups they also
> + * trip the partition-equality gate and drop the whole group (including
> + * the in-range member in DC0). Keep the mock at one DC partition.
> + */
> +#define NUM_MOCK_DC_REGIONS 1
> +
> struct cxl_mockmem_data {
> void *lsa;
> void *fw;
> @@ -191,6 +218,20 @@ struct cxl_mockmem_data {
> unsigned long sanitize_timeout;
> struct vendor_test_feat test_feat;
> u8 shutdown_state;
> +
> + struct cxl_dc_partition dc_partitions[NUM_MOCK_DC_REGIONS];
> + u32 dc_ext_generation;
> + struct mutex ext_lock;
> +
> + /*
> + * Extents are in 1 of 3 states
> + * FM (sysfs added but not sent to the host yet)
> + * sent (sent to the host but not accepted)
> + * accepted (by the host)
> + */
> + struct xarray dc_fm_extents;
> + struct xarray dc_sent_extents;
> + struct xarray dc_accepted_exts;
> };
>
> static struct mock_event_log *event_find_log(struct device *dev, int log_type)
> @@ -607,6 +648,229 @@ static void cxl_mock_event_trigger(struct device *dev)
> cxl_mem_get_event_records(mdata->mds, mes->ev_status);
> }
>
> +struct cxl_extent_data {
> + u64 dpa_start;
> + u64 length;
> + uuid_t uuid;
> + u16 shared_extn_seq;
> + bool shared;
> +};
> +
> +/*
> + * Parse a tag string into a uuid_t. Accepts the empty string and "0"
> + * as shorthand for the null UUID; anything else must be a UUID string
> + * uuid_parse() can understand.
> + */
> +static int parse_tag(const char *tag, uuid_t *out)
> +{
> + if (!tag || tag[0] == '\0' || strcmp(tag, "0") == 0) {
> + uuid_copy(out, &uuid_null);
> + return 0;
> + }
> + return uuid_parse(tag, out);
> +}
> +
> +static int __devm_add_extent(struct device *dev, struct xarray *array,
> + u64 start, u64 length, const char *tag,
> + u16 shared_extn_seq, bool shared)
> +{
> + struct cxl_extent_data *extent;
> + int rc;
> +
> + extent = devm_kzalloc(dev, sizeof(*extent), GFP_KERNEL);
> + if (!extent)
> + return -ENOMEM;
> +
> + extent->dpa_start = start;
> + extent->length = length;
> + rc = parse_tag(tag, &extent->uuid);
> + if (rc) {
> + dev_err(dev, "Failed to parse tag '%s'\n", tag);
> + devm_kfree(dev, extent);
> + return rc;
> + }
> + extent->shared_extn_seq = shared_extn_seq;
> + extent->shared = shared;
> +
> + if (xa_insert(array, start, extent, GFP_KERNEL)) {
> + devm_kfree(dev, extent);
> + dev_err(dev, "Failed xarry insert %#llx\n", start);
> + return -EINVAL;
> + }
> +
> + return 0;
> +}
> +
> +static int devm_add_fm_extent(struct device *dev, u64 start, u64 length,
> + const char *tag, u16 shared_extn_seq, bool shared)
> +{
> + struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
> +
> + guard(mutex)(&mdata->ext_lock);
> + return __devm_add_extent(dev, &mdata->dc_fm_extents, start, length,
> + tag, shared_extn_seq, shared);
> +}
> +
> +static int dc_accept_extent(struct device *dev, u64 start, u64 length)
> +{
> + struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
> + struct cxl_extent_data *ext;
> +
> + dev_dbg(dev, "Host accepting extent %#llx\n", start);
> + mdata->dc_ext_generation++;
Should this only happen after xa_load() succeeds and checked?
> +
> + lockdep_assert_held(&mdata->ext_lock);
Maybe this should go above the increment above
> + ext = xa_load(&mdata->dc_sent_extents, start);
> + if (!ext || ext->length != length) {
> + /*
> + * The host may re-accept extents we already moved into the
> + * accepted xarray (e.g. pre-injected extents replayed on
> + * region creation). Treat that as a successful no-op so
> + * the existing-extent ingest path doesn't abort.
> + */
> + ext = xa_load(&mdata->dc_accepted_exts, start);
> + if (ext && ext->length == length)
> + return 0;
> + dev_err(dev, "Extent %#llx-%#llx not found\n",
> + start, start + length);
> + return -ENOMEM;
> + }
> + xa_erase(&mdata->dc_sent_extents, start);
> + return xa_insert(&mdata->dc_accepted_exts, start, ext, GFP_KERNEL);
> +}
> +
> +static void release_dc_ext(void *md)
> +{
> + struct cxl_mockmem_data *mdata = md;
> +
> + xa_destroy(&mdata->dc_fm_extents);
> + xa_destroy(&mdata->dc_sent_extents);
> + xa_destroy(&mdata->dc_accepted_exts);
> +}
> +
> +/* Pretend to have some previous accepted extents */
> +struct pre_ext_info {
> + u64 offset;
> + u64 length;
> + const char *tag;
> +} pre_ext_info[] = {
> + {
> + .offset = SZ_128M,
> + .length = SZ_64M,
> + .tag = "",
> + },
> + {
> + .offset = SZ_256M,
> + .length = SZ_64M,
> + .tag = "deadbeef-cafe-baad-f00d-fedcba987654",
> + },
> +};
> +
> +static int devm_add_sent_extent(struct device *dev, u64 start, u64 length,
> + const char *tag, u16 shared_extn_seq, bool shared)
> +{
> + struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
> +
> + lockdep_assert_held(&mdata->ext_lock);
> + return __devm_add_extent(dev, &mdata->dc_sent_extents, start, length,
> + tag, shared_extn_seq, shared);
> +}
> +
> +static int inject_prev_extents(struct device *dev, u64 base_dpa)
> +{
> + struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
> + int rc;
> +
> + dev_dbg(dev, "Adding %ld pre-extents for testing\n",
> + ARRAY_SIZE(pre_ext_info));
> +
> + guard(mutex)(&mdata->ext_lock);
> + for (int i = 0; i < ARRAY_SIZE(pre_ext_info); i++) {
> + u64 ext_dpa = base_dpa + pre_ext_info[i].offset;
> + u64 ext_len = pre_ext_info[i].length;
> +
> + dev_dbg(dev, "Adding pre-extent DPA:%#llx LEN:%#llx tag:%s\n",
> + ext_dpa, ext_len, pre_ext_info[i].tag);
> +
> + rc = devm_add_sent_extent(dev, ext_dpa, ext_len,
> + pre_ext_info[i].tag, 0, false);
> + if (rc) {
> + dev_err(dev, "Failed to add pre-extent DPA:%#llx LEN:%#llx; %d\n",
> + ext_dpa, ext_len, rc);
> + return rc;
> + }
> +
> + rc = dc_accept_extent(dev, ext_dpa, ext_len);
> + if (rc)
> + return rc;
> + }
> + return 0;
> +}
> +
> +static int cxl_mock_dc_partition_setup(struct device *dev)
> +{
> + struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
> + u64 base_dpa = BASE_DYNAMIC_CAP_DPA;
> + u32 dsmad_handle = 0xFA;
> + u64 decode_length = SZ_512M;
> + u64 block_size = SZ_512;
> + u64 length = SZ_512M;
> + int rc;
> +
> + mutex_init(&mdata->ext_lock);
> + xa_init(&mdata->dc_fm_extents);
> + xa_init(&mdata->dc_sent_extents);
> + xa_init(&mdata->dc_accepted_exts);
> +
> + rc = devm_add_action_or_reset(dev, release_dc_ext, mdata);
> + if (rc)
> + return rc;
> +
> + for (int i = 0; i < NUM_MOCK_DC_REGIONS; i++) {
> + struct cxl_dc_partition *part = &mdata->dc_partitions[i];
> +
> + dev_dbg(dev, "Creating DC partition DC%d DPA:%#llx LEN:%#llx\n",
> + i, base_dpa, length);
> +
> + part->base = cpu_to_le64(base_dpa);
> + part->decode_length = cpu_to_le64(decode_length /
> + CXL_CAPACITY_MULTIPLIER);
> + part->length = cpu_to_le64(length);
> + part->block_size = cpu_to_le64(block_size);
> + part->dsmad_handle = cpu_to_le32(dsmad_handle);
> + dsmad_handle++;
> +
> + /*
> + * Skip pre-injection on the sharable mock memdev. The
> + * pre-injected extents are untagged / seq=0, which a
> + * sharable partition rejects as firmware-bug; leaving the
> + * sharable memdev with an empty DC partition is what its
> + * dedicated tests (test_shared_extent_inject and
> + * test_seq_integrity_gap in cxl-dcd.sh) expect anyway.
> + *
> + * The sharable fixture is the memdev at pdev->id == 0 —
> + * see the matching MOCK_DC_SHARABLE_SERIAL stamp in
> + * cxl_mock_mem_probe(). This relies on tools/testing/cxl
> + * always allocating a "cxl_mem" platform device with id 0
> + * as the first memdev; if that invariant ever breaks the
> + * sharable test fixture will land on the wrong device.
> + */
> + if (to_platform_device(dev)->id != 0) {
> + rc = inject_prev_extents(dev, base_dpa);
> + if (rc) {
> + dev_err(dev,
> + "Failed to add pre-extents for DC%d\n",
> + i);
> + return rc;
> + }
> + }
> +
> + base_dpa += decode_length;
> + }
> +
> + return 0;
> +}
> +
> static int mock_gsl(struct cxl_mbox_cmd *cmd)
> {
> if (cmd->size_out < sizeof(mock_gsl_payload))
> @@ -1582,6 +1846,193 @@ static int mock_get_supported_features(struct cxl_mockmem_data *mdata,
> return 0;
> }
>
> +static int mock_get_dc_config(struct device *dev,
> + struct cxl_mbox_cmd *cmd)
> +{
> + struct cxl_mbox_get_dc_config_in *dc_config = cmd->payload_in;
> + struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
> + u8 partition_requested, partition_start_idx, partition_ret_cnt;
> + struct cxl_mbox_get_dc_config_out *resp;
> + int i;
> +
> + partition_requested = min(dc_config->partition_count, NUM_MOCK_DC_REGIONS);
> +
> + if (cmd->size_out < struct_size(resp, partition, partition_requested))
> + return -EINVAL;
> +
> + memset(cmd->payload_out, 0, cmd->size_out);
> + resp = cmd->payload_out;
> +
> + partition_start_idx = dc_config->start_partition_index;
> + partition_ret_cnt = 0;
> + for (i = 0; i < NUM_MOCK_DC_REGIONS; i++) {
> + if (i >= partition_start_idx) {
Should there be a check for partition_requested and exit when reached?
> + memcpy(&resp->partition[partition_ret_cnt],
> + &mdata->dc_partitions[i],
> + sizeof(resp->partition[partition_ret_cnt]));
> + partition_ret_cnt++;
> + }
> + }
> + resp->avail_partition_count = NUM_MOCK_DC_REGIONS;
> + resp->partitions_returned = i;
partition returned always return NUM_MOCK_DC_REGIONS. Never takes into account partition_start_idx.
> +
> + dev_dbg(dev, "Returning %d dc partitions\n", partition_ret_cnt);
> + return 0;
> +}
> +
> +static int mock_get_dc_extent_list(struct device *dev,
> + struct cxl_mbox_cmd *cmd)
> +{
> + struct cxl_mbox_get_extent_out *resp = cmd->payload_out;
> + struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
> + struct cxl_mbox_get_extent_in *get = cmd->payload_in;
> + u32 total_avail = 0, total_ret = 0;
> + struct cxl_extent_data *ext;
> + u32 ext_count, start_idx;
> + unsigned long i;
> +
> + ext_count = le32_to_cpu(get->extent_cnt);
> + start_idx = le32_to_cpu(get->start_extent_index);
> +
> + memset(resp, 0, sizeof(*resp));
> +
> + guard(mutex)(&mdata->ext_lock);
> + /*
> + * Total available needs to be calculated and returned regardless of
> + * how many can actually be returned.
> + */
> + xa_for_each(&mdata->dc_accepted_exts, i, ext)
> + total_avail++;
> +
> + if (start_idx > total_avail)
> + return -EINVAL;
> +
> + xa_for_each(&mdata->dc_accepted_exts, i, ext) {
> + if (total_ret >= ext_count)
> + break;
> +
> + if (total_ret >= start_idx) {
In the case where start_idx > 0, I think we hit an infinite loop.
> + resp->extent[total_ret].start_dpa =
> + cpu_to_le64(ext->dpa_start);
> + resp->extent[total_ret].length =
> + cpu_to_le64(ext->length);
> + export_uuid(resp->extent[total_ret].uuid, &ext->uuid);
> + resp->extent[total_ret].shared_extn_seq =
> + cpu_to_le16(ext->shared_extn_seq);
> + total_ret++;
> + }
> + }
> +
> + resp->returned_extent_count = cpu_to_le32(total_ret);
> + resp->total_extent_count = cpu_to_le32(total_avail);
> + resp->generation_num = cpu_to_le32(mdata->dc_ext_generation);
> +
> + dev_dbg(dev, "Returning %d extents of %d total\n",
> + total_ret, total_avail);
> +
> + return 0;
> +}
> +
> +static void dc_clear_sent(struct device *dev)
> +{
> + struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
> + struct cxl_extent_data *ext;
> + unsigned long index;
> +
> + lockdep_assert_held(&mdata->ext_lock);
> +
> + /* Any extents not accepted must be cleared */
> + xa_for_each(&mdata->dc_sent_extents, index, ext) {
> + dev_dbg(dev, "Host rejected extent %#llx\n", ext->dpa_start);
> + xa_erase(&mdata->dc_sent_extents, ext->dpa_start);
> + }
> +}
> +
> +static int mock_add_dc_response(struct device *dev,
> + struct cxl_mbox_cmd *cmd)
> +{
> + struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
> + struct cxl_mbox_dc_response *req = cmd->payload_in;
> + u32 list_size = le32_to_cpu(req->extent_list_size);
> +
> + guard(mutex)(&mdata->ext_lock);
> + for (int i = 0; i < list_size; i++) {
> + u64 start = le64_to_cpu(req->extent_list[i].dpa_start);
> + u64 length = le64_to_cpu(req->extent_list[i].length);
> + int rc;
> +
> + rc = dc_accept_extent(dev, start, length);
> + if (rc)
> + return rc;
> + }
mock response seems to be missing ordering check. CXL r4.0 8.2.10.9.9.3 requires ADD_DC_RESPONSE to be sent in same order as the Add Capacity Event Records or return invalid input.
> +
> + dc_clear_sent(dev);
> + return 0;
> +}
> +
> +static void dc_delete_extent(struct device *dev, unsigned long long start,
> + unsigned long long length)
> +{
> + struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
> + unsigned long long end = start + length;
> + struct cxl_extent_data *ext;
> + unsigned long index;
> +
> + dev_dbg(dev, "Deleting extent at %#llx len:%#llx\n", start, length);
> +
> + guard(mutex)(&mdata->ext_lock);
> + xa_for_each(&mdata->dc_fm_extents, index, ext) {
> + u64 extent_end = ext->dpa_start + ext->length;
> +
> + /*
> + * Any extent which 'touches' the released delete range will be
> + * removed.
> + */
> + if ((start <= ext->dpa_start && ext->dpa_start < end) ||
> + (start <= extent_end && extent_end < end))
would this work?
if (start < extent_end && ext->dpa_start < end)
> + xa_erase(&mdata->dc_fm_extents, ext->dpa_start);
> + }
> +
> + /*
> + * If the extent was accepted let it be for the host to drop
> + * later.
> + */
> +}
> +
> +static int release_accepted_extent(struct device *dev, u64 start, u64 length)
> +{
> + struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
> + struct cxl_extent_data *ext;
> +
> + guard(mutex)(&mdata->ext_lock);
> + ext = xa_load(&mdata->dc_accepted_exts, start);
> + if (!ext || ext->length != length) {
> + dev_err(dev, "Extent %#llx not in accepted state\n", start);
> + return -EINVAL;
> + }
> + xa_erase(&mdata->dc_accepted_exts, start);
> + mdata->dc_ext_generation++;
> +
> + return 0;
> +}
> +
> +static int mock_dc_release(struct device *dev,
> + struct cxl_mbox_cmd *cmd)
> +{
> + struct cxl_mbox_dc_response *req = cmd->payload_in;
> + u32 list_size = le32_to_cpu(req->extent_list_size);
> +
> + for (int i = 0; i < list_size; i++) {
> + u64 start = le64_to_cpu(req->extent_list[i].dpa_start);
> + u64 length = le64_to_cpu(req->extent_list[i].length);
> +
> + dev_dbg(dev, "Extent %#llx released by host\n", start);
> + release_accepted_extent(dev, start, length);
> + }
> +
> + return 0;
> +}
> +
> static int cxl_mock_mbox_send(struct cxl_mailbox *cxl_mbox,
> struct cxl_mbox_cmd *cmd)
> {
> @@ -1673,6 +2124,18 @@ static int cxl_mock_mbox_send(struct cxl_mailbox *cxl_mbox,
> case CXL_MBOX_OP_GET_SUPPORTED_FEATURES:
> rc = mock_get_supported_features(mdata, cmd);
> break;
> + case CXL_MBOX_OP_GET_DC_CONFIG:
> + rc = mock_get_dc_config(dev, cmd);
> + break;
> + case CXL_MBOX_OP_GET_DC_EXTENT_LIST:
> + rc = mock_get_dc_extent_list(dev, cmd);
> + break;
> + case CXL_MBOX_OP_ADD_DC_RESPONSE:
> + rc = mock_add_dc_response(dev, cmd);
> + break;
> + case CXL_MBOX_OP_RELEASE_DC:
> + rc = mock_dc_release(dev, cmd);
> + break;
> case CXL_MBOX_OP_GET_FEATURE:
> rc = mock_get_feature(mdata, cmd);
> break;
> @@ -1739,6 +2202,14 @@ static void init_event_log(struct mock_event_log *log)
> log->last_handle = 1;
> }
>
> +/*
> + * Stamp this serial on a single mock cxl_mem instance so the
> + * companion cxl_test driver can find it and mark its DC partition
> + * sharable in mock_cxl_endpoint_parse_cdat(). Must match the value
> + * defined in tools/testing/cxl/test/cxl.c.
> + */
> +#define MOCK_DC_SHARABLE_SERIAL 0xDCDCULL
> +
> static int cxl_mock_mem_probe(struct platform_device *pdev)
> {
> struct device *dev = &pdev->dev;
> @@ -1758,6 +2229,10 @@ static int cxl_mock_mem_probe(struct platform_device *pdev)
> return -ENOMEM;
> dev_set_drvdata(dev, mdata);
>
> + rc = cxl_mock_dc_partition_setup(dev);
> + if (rc)
> + return rc;
> +
> mdata->lsa = vmalloc(LSA_SIZE);
> if (!mdata->lsa)
> return -ENOMEM;
> @@ -1774,7 +2249,23 @@ static int cxl_mock_mem_probe(struct platform_device *pdev)
> if (rc)
> return rc;
>
> - mds = cxl_memdev_state_create(dev, pdev->id + 1, 0);
> + {
> + u64 serial = pdev->id + 1;
> +
> + /*
> + * Reserve the memdev at pdev->id == 0 as the sharable DC
> + * partition test fixture. This relies on tools/testing/cxl
> + * always allocating a "cxl_mem" platform device with id 0
> + * as the first memdev — currently true in cxl.c, but if
> + * the topology ever renumbers, the sharable serial will be
> + * stamped on the wrong device (or no device). Matched by
> + * the skip-pre-inject guard in cxl_mock_dc_partition_setup
> + * and by mock_cxl_endpoint_parse_cdat in cxl_test.
> + */
> + if (pdev->id == 0)
> + serial = MOCK_DC_SHARABLE_SERIAL;
> + mds = cxl_memdev_state_create(dev, serial, 0);
> + }
Would prefer not have inline anonymous block. Just declare the var at top.
> if (IS_ERR(mds))
> return PTR_ERR(mds);
>
> @@ -1814,6 +2305,9 @@ static int cxl_mock_mem_probe(struct platform_device *pdev)
> if (rc)
> return rc;
>
> + if (cxl_dcd_supported(mds))
> + cxl_configure_dcd(mds, &range_info);
> +
> rc = cxl_dpa_setup(cxlds, &range_info);
> if (rc)
> return rc;
> @@ -1921,11 +2415,321 @@ static ssize_t sanitize_timeout_store(struct device *dev,
>
> static DEVICE_ATTR_RW(sanitize_timeout);
>
> +/* Return if the proposed extent would break the test code */
> +static bool new_extent_valid(struct device *dev, size_t new_start,
> + size_t new_len)
> +{
> + struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
> + struct cxl_extent_data *extent;
> + size_t new_end, i;
> +
> + if (!new_len)
> + return false;
> +
> + new_end = new_start + new_len;
> +
> + dev_dbg(dev, "New extent %zx-%zx\n", new_start, new_end);
> +
> + guard(mutex)(&mdata->ext_lock);
> + dev_dbg(dev, "Checking extents starts...\n");
> + xa_for_each(&mdata->dc_fm_extents, i, extent) {
> + if (extent->dpa_start == new_start)
> + return false;
> + }
> +
> + dev_dbg(dev, "Checking sent extents starts...\n");
> + xa_for_each(&mdata->dc_sent_extents, i, extent) {
> + if (extent->dpa_start == new_start)
> + return false;
> + }
> +
> + dev_dbg(dev, "Checking accepted extents starts...\n");
> + xa_for_each(&mdata->dc_accepted_exts, i, extent) {
> + if (extent->dpa_start == new_start)
> + return false;
> + }
> +
> + return true;
> +}
> +
> +struct cxl_test_dcd {
> + uuid_t id;
> + struct cxl_event_dcd rec;
> +} __packed;
> +
> +struct cxl_test_dcd dcd_event_rec_template = {
> + .id = CXL_EVENT_DC_EVENT_UUID,
> + .rec = {
> + .hdr = {
> + .length = sizeof(struct cxl_test_dcd),
> + },
> + },
> +};
> +
> +static int log_dc_event(struct cxl_mockmem_data *mdata, enum dc_event type,
> + u64 start, u64 length, const char *tag_str,
> + u16 shared_extn_seq, bool more)
> +{
> + struct device *dev = mdata->mds->cxlds.dev;
> + struct cxl_test_dcd *dcd_event;
> + uuid_t tag;
> + int rc;
> +
> + dev_dbg(dev, "mock device log event %d\n", type);
> +
> + dcd_event = devm_kmemdup(dev, &dcd_event_rec_template,
> + sizeof(*dcd_event), GFP_KERNEL);
> + if (!dcd_event)
> + return -ENOMEM;
> +
> + dcd_event->rec.flags = 0;
> + if (more)
> + dcd_event->rec.flags |= CXL_DCD_EVENT_MORE;
> + dcd_event->rec.event_type = type;
> + dcd_event->rec.extent.start_dpa = cpu_to_le64(start);
> + dcd_event->rec.extent.length = cpu_to_le64(length);
> + rc = parse_tag(tag_str, &tag);
> + if (rc) {
> + devm_kfree(dev, dcd_event);
> + return rc;
> + }
> + export_uuid(dcd_event->rec.extent.uuid, &tag);
> + dcd_event->rec.extent.shared_extn_seq = cpu_to_le16(shared_extn_seq);
> +
> + mes_add_event(mdata, CXL_EVENT_TYPE_DCD,
> + (struct cxl_event_record_raw *)dcd_event);
> +
> + /* Fake the irq */
> + cxl_mem_get_event_records(mdata->mds, CXLDEV_EVENT_STATUS_DCD);
> +
> + return 0;
> +}
> +
> +static void mark_extent_sent(struct device *dev, unsigned long long start)
> +{
> + struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
> + struct cxl_extent_data *ext;
> +
> + guard(mutex)(&mdata->ext_lock);
> + ext = xa_erase(&mdata->dc_fm_extents, start);
> + if (xa_insert(&mdata->dc_sent_extents, ext->dpa_start, ext, GFP_KERNEL))
> + dev_err(dev, "Failed to mark extent %#llx sent\n", ext->dpa_start);
Should it also clean up 'ext' since insert failed?
DJ
> +}
> +
> +/*
> + * Format <start>:<length>:<tag>:<more_flag>
> + *
> + * start and length must be a multiple of the configured partition block size.
> + * Tag can be any string up to 16 bytes.
> + *
> + * Extents must be exclusive of other extents
> + *
> + * If the more flag is specified it is expected that an additional extent will
> + * be specified without the more flag to complete the test transaction with the
> + * host.
> + */
> +static ssize_t __dc_inject_extent_store(struct device *dev,
> + struct device_attribute *attr,
> + const char *buf, size_t count,
> + bool shared)
> +{
> + struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
> + unsigned long long start, length, more;
> + char *len_str, *uuid_str, *more_str, *seq_str;
> + u16 shared_extn_seq = 0;
> + size_t buf_len = count;
> + int rc;
> +
> + char *start_str __free(kfree) = kstrdup(buf, GFP_KERNEL);
> + if (!start_str)
> + return -ENOMEM;
> +
> + len_str = strnchr(start_str, buf_len, ':');
> + if (!len_str) {
> + dev_err(dev, "Extent failed to find len_str: %s\n", start_str);
> + return -EINVAL;
> + }
> +
> + *len_str = '\0';
> + len_str += 1;
> + buf_len -= strlen(start_str);
> +
> + uuid_str = strnchr(len_str, buf_len, ':');
> + if (!uuid_str) {
> + dev_err(dev, "Extent failed to find uuid_str: %s\n", len_str);
> + return -EINVAL;
> + }
> + *uuid_str = '\0';
> + uuid_str += 1;
> +
> + more_str = strnchr(uuid_str, buf_len, ':');
> + if (!more_str) {
> + dev_err(dev, "Extent failed to find more_str: %s\n", uuid_str);
> + return -EINVAL;
> + }
> + *more_str = '\0';
> + more_str += 1;
> +
> + /* Optional 5th field: shared_extn_seq. Absent -> 0. */
> + seq_str = strnchr(more_str, buf_len, ':');
> + if (seq_str) {
> + unsigned long long seq;
> +
> + *seq_str = '\0';
> + seq_str += 1;
> + if (kstrtoull(seq_str, 0, &seq) || seq > U16_MAX) {
> + dev_err(dev, "Extent failed to parse seq: %s\n",
> + seq_str);
> + return -EINVAL;
> + }
> + shared_extn_seq = seq;
> + }
> +
> + if (kstrtoull(start_str, 0, &start)) {
> + dev_err(dev, "Extent failed to parse start: %s\n", start_str);
> + return -EINVAL;
> + }
> +
> + if (kstrtoull(len_str, 0, &length)) {
> + dev_err(dev, "Extent failed to parse length: %s\n", len_str);
> + return -EINVAL;
> + }
> +
> + if (kstrtoull(more_str, 0, &more)) {
> + dev_err(dev, "Extent failed to parse more: %s\n", more_str);
> + return -EINVAL;
> + }
> +
> + if (!new_extent_valid(dev, start, length))
> + return -EINVAL;
> +
> + rc = devm_add_fm_extent(dev, start, length, uuid_str, shared_extn_seq,
> + shared);
> + if (rc) {
> + dev_err(dev, "Failed to add extent DPA:%#llx LEN:%#llx; %d\n",
> + start, length, rc);
> + return rc;
> + }
> +
> + mark_extent_sent(dev, start);
> + rc = log_dc_event(mdata, DCD_ADD_CAPACITY, start, length, uuid_str,
> + shared_extn_seq, more);
> + if (rc) {
> + dev_err(dev, "Failed to add event %d\n", rc);
> + return rc;
> + }
> +
> + return count;
> +}
> +
> +static ssize_t dc_inject_extent_store(struct device *dev,
> + struct device_attribute *attr,
> + const char *buf, size_t count)
> +{
> + return __dc_inject_extent_store(dev, attr, buf, count, false);
> +}
> +static DEVICE_ATTR_WO(dc_inject_extent);
> +
> +static ssize_t dc_inject_shared_extent_store(struct device *dev,
> + struct device_attribute *attr,
> + const char *buf, size_t count)
> +{
> + return __dc_inject_extent_store(dev, attr, buf, count, true);
> +}
> +static DEVICE_ATTR_WO(dc_inject_shared_extent);
> +
> +static ssize_t __dc_del_extent_store(struct device *dev,
> + struct device_attribute *attr,
> + const char *buf, size_t count,
> + enum dc_event type)
> +{
> + struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
> + unsigned long long start, length;
> + char *len_str, *uuid_str;
> + size_t buf_len = count;
> + int rc;
> +
> + char *start_str __free(kfree) = kstrdup(buf, GFP_KERNEL);
> + if (!start_str)
> + return -ENOMEM;
> +
> + len_str = strnchr(start_str, buf_len, ':');
> + if (!len_str) {
> + dev_err(dev, "Failed to find len_str: %s\n", start_str);
> + return -EINVAL;
> + }
> + *len_str = '\0';
> + len_str += 1;
> + buf_len -= strlen(start_str);
> +
> + uuid_str = strnchr(len_str, buf_len, ':');
> + if (!uuid_str) {
> + dev_err(dev, "Failed to find uuid_str: %s\n", len_str);
> + return -EINVAL;
> + }
> + *uuid_str = '\0';
> + uuid_str += 1;
> + /*
> + * uuid_str is the trailing field; trim shell-added '\n' so
> + * parse_tag()/uuid_parse() see a clean string.
> + */
> + uuid_str = strim(uuid_str);
> +
> + if (kstrtoull(start_str, 0, &start)) {
> + dev_err(dev, "Failed to parse start: %s\n", start_str);
> + return -EINVAL;
> + }
> +
> + if (kstrtoull(len_str, 0, &length)) {
> + dev_err(dev, "Failed to parse length: %s\n", len_str);
> + return -EINVAL;
> + }
> +
> + dc_delete_extent(dev, start, length);
> +
> + if (type == DCD_FORCED_CAPACITY_RELEASE)
> + dev_dbg(dev, "Forcing delete of extent %#llx len:%#llx\n",
> + start, length);
> +
> + rc = log_dc_event(mdata, type, start, length, uuid_str, 0, false);
> + if (rc) {
> + dev_err(dev, "Failed to add event %d\n", rc);
> + return rc;
> + }
> +
> + return count;
> +}
> +
> +/*
> + * Format <start>:<length>:<uuid>
> + */
> +static ssize_t dc_del_extent_store(struct device *dev,
> + struct device_attribute *attr,
> + const char *buf, size_t count)
> +{
> + return __dc_del_extent_store(dev, attr, buf, count,
> + DCD_RELEASE_CAPACITY);
> +}
> +static DEVICE_ATTR_WO(dc_del_extent);
> +
> +static ssize_t dc_force_del_extent_store(struct device *dev,
> + struct device_attribute *attr,
> + const char *buf, size_t count)
> +{
> + return __dc_del_extent_store(dev, attr, buf, count,
> + DCD_FORCED_CAPACITY_RELEASE);
> +}
> +static DEVICE_ATTR_WO(dc_force_del_extent);
> +
> static struct attribute *cxl_mock_mem_attrs[] = {
> &dev_attr_security_lock.attr,
> &dev_attr_event_trigger.attr,
> &dev_attr_fw_buf_checksum.attr,
> &dev_attr_sanitize_timeout.attr,
> + &dev_attr_dc_inject_extent.attr,
> + &dev_attr_dc_inject_shared_extent.attr,
> + &dev_attr_dc_del_extent.attr,
> + &dev_attr_dc_force_del_extent.attr,
> NULL
> };
> ATTRIBUTE_GROUPS(cxl_mock_mem);
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH v10 31/31] Documentation/cxl: Document DCD extent handling and DC-backed DAX regions
2026-05-23 9:42 [PATCH v10 00/31] DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
` (29 preceding siblings ...)
2026-05-23 9:43 ` [PATCH v10 30/31] tools/testing/cxl: Add DC Regions to mock mem data Anisa Su
@ 2026-05-23 9:43 ` Anisa Su
2026-05-27 18:51 ` [PATCH v10 00/31] DCD: Add support for Dynamic Capacity Devices (DCD) Dave Jiang
2026-06-05 5:35 ` Alison Schofield
32 siblings, 0 replies; 71+ messages in thread
From: Anisa Su @ 2026-05-23 9:43 UTC (permalink / raw)
To: linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Dave Jiang, Vishal Verma, Ira Weiny, Alison Schofield,
John Groves, Gregory Price, Anisa Su
Extend the CXL and DAX driver-api documentation to cover Dynamic
Capacity Devices.
cxl-driver.rst gains a "Dynamic Capacity Extents" section describing
the conditions under which the CXL core accepts an offered extent
(per-extent: region resolution, full ED-range containment,
no-overlap, duplicate tolerance; per-tag-group: host-wide tag-uuid
uniqueness, sequence-number integrity, partition equality,
alignment) and the conditions under which a release request is
honoured (DPA-range containment in some member, tag match,
DAX-layer EBUSY deferral, whole-tag-group release). The host-wide
uniqueness gate is enforced by the cxl_tag_register registry in
drivers/cxl/core/extent.c. For sequence numbers the doc spells out
both regimes — device-stamped 1..n on sharable allocations and
host-assigned arrival-order 1..n (via cxl_add_pending's
logical_seq) on non-sharable allocations — and notes that the DAX
layer sees one unified 1..n dense invariant.
dax-driver.rst gains a "Dynamic Capacity (DC) Regions" section
that lays out the four-object layering device extent → dc_extent →
dax_resource → DAX device, with cardinalities: one tagged
allocation maps to one cxl_dc_tag_group containing N dc_extents and
N dax_resources, claimed into one DAX device with N range entries
in seq_num order; an untagged Add delivery becomes its own
single-member group. Each dc_extent carries its own hpa_range —
there is no aggregated bounding-box range across siblings.
Tag-based DAX device creation, DC-only sizing rules (no grow,
size=0 to destroy), and the uuid attribute semantics are documented
alongside.
Signed-off-by: Anisa Su <anisa.su@samsung.com>
---
.../driver-api/cxl/linux/cxl-driver.rst | 149 ++++++++++++++++
.../driver-api/cxl/linux/dax-driver.rst | 167 ++++++++++++++++++
2 files changed, 316 insertions(+)
diff --git a/Documentation/driver-api/cxl/linux/cxl-driver.rst b/Documentation/driver-api/cxl/linux/cxl-driver.rst
index dd6dd17dc536..cb08fc536da8 100644
--- a/Documentation/driver-api/cxl/linux/cxl-driver.rst
+++ b/Documentation/driver-api/cxl/linux/cxl-driver.rst
@@ -619,6 +619,155 @@ from HPA to DPA. This is why they must be aware of the entire interleave set.
Linux does not support unbalanced interleave configurations. As a result, all
endpoints in an interleave set must have the same ways and granularity.
+Dynamic Capacity Extents
+========================
+
+A `Dynamic Capacity Device (DCD)` advertises capacity in `DC partitions`
+and surfaces individual chunks of that capacity to the host as `extents`.
+The device may add an extent at any time (a `pending add`) and may
+request that a previously accepted extent be released (a `pending
+release`). Each transition is mediated by a mailbox handshake whose
+state machine the CXL driver enforces in
+:code:`drivers/cxl/core/{mbox.c,extent.c}`.
+
+Extents that share a non-null tag form one logical allocation. Each
+surviving member becomes its own :code:`struct dc_extent` (per-extent
+sysfs device, per-extent HPA range); their containing tag group is an
+internal-only :code:`struct cxl_dc_tag_group` keyed by UUID with no
+sysfs identity. Each :code:`dc_extent` becomes one
+:code:`dax_resource` on the DAX side, and a tagged DAX device is built
+by claiming every :code:`dax_resource` that carries the tag.
+
+For DAX-side semantics — how accepted extents materialize into
+:code:`dax_resource` objects and DAX devices — see
+:doc:`dax-driver`.
+
+Accepting Extents
+-----------------
+Extents are made available to the host from the device through DC ADD events.
+Event records contain extents, which may be tagged or untagged, shared or
+not shared. Multiple event records can by chained together by the `More` flag.
+
+The unit of allocation is a `tag`. All extents
+sharing a tag form one allocation; the More flag is a delivery boundary
+only, meaning when the More chain ends, the host can assume that all extents
+have been collected for each tag.
+A tag may be the null UUID (an `untagged` allocation, valid in
+non-sharable regions) or a non-null UUID identifying a sharable or
+non-sharable allocation.
+
+When a `More`-terminated chain of pending adds closes, the driver
+processes the pending list one tag group at a time. A group is
+committed only if it passes every gate below; failing any gate drops
+the entire group with a firmware-bug warning, and the dropped extents
+do not appear in the :code:`ADD_DC_RESPONSE`. There is no
+partial-extent acceptance — either an offered extent is accepted whole
+or it is dropped whole.
+
+Per-extent gates (applied in :code:`cxl_add_extent`,
+:code:`drivers/cxl/core/extent.c`):
+
+* The extent's DPA range must resolve to a CXL region via
+ :code:`cxl_dpa_to_region()`. An extent with no owning region is
+ dropped; the device sees the omission from :code:`ADD_DC_RESPONSE`.
+* The extent's DPA range must be `fully contained` in the endpoint
+ decoder's DPA range. An extent that straddles the decoder boundary
+ is rejected with :code:`-ENXIO`; the driver never clips an extent to
+ fit.
+* The extent must not overlap an extent already present in the same
+ region. Overlap classification is done in
+ :code:`cxlr_dax_classify_extent()` using :code:`range_overlaps()`.
+ Exact duplicates of a previously-accepted range are tolerated —
+ accepting the same range twice is a no-op, which simplifies
+ probe-time scans of the device's existing accepted list.
+
+Per-group gates (applied in :code:`cxl_add_pending`,
+:code:`drivers/cxl/core/mbox.c`):
+
+* `Host-wide tag uniqueness`: a non-null tag must not already
+ correspond to a live :code:`cxl_dc_tag_group` anywhere on this host.
+ The orchestrator (FM) owns tag-UUID allocation per spec; the
+ registry in :code:`drivers/cxl/core/extent.c`
+ (:code:`cxl_tag_register` / :code:`cxl_tag_already_committed`)
+ catches firmware bugs and orchestrator misbehavior across every
+ region and memdev. Skipped for the null UUID, which has no
+ cross-chain identity.
+* `Sequence-number integrity`: every member must carry the wire
+ field :code:`shared_extn_seq == 0` (non-sharable allocation), or
+ the group's sorted sequence numbers must be exactly
+ :code:`1, 2, …, n` (sharable allocation). Mixed, gapped,
+ duplicate, or non-zero-but-not-starting-at-1 sets are rejected.
+* `Partition equality`: every tagged extent in the group must
+ resolve to the same DC partition. A single allocation cannot span
+ partitions because CDAT describes sharable / writable / coherency
+ attributes per-partition. Skipped for the null UUID.
+* `Alignment`: every extent's :code:`start_dpa` and :code:`length`
+ must be :code:`CXL_DCD_EXTENT_ALIGN`-aligned. Partial acceptance
+ of an aligned subset would leave an unusable DAX device, so the
+ group is dropped instead.
+
+Surviving extents are sorted by the wire field
+:code:`shared_extn_seq` — stable, so arrival order is preserved for
+the all-zero non-sharable case — and each becomes a
+:code:`dc_extent` inserted into a fresh :code:`cxl_dc_tag_group`
+keyed by the group's UUID. Each :code:`dc_extent` carries its own
+:code:`hpa_range`; the tag group itself has no aggregate range.
+
+As each surviving extent is attached the host assigns it a 1..n
+:code:`seq_num`: for sharable allocations this equals the
+device-stamped :code:`shared_extn_seq` directly; for non-sharable
+allocations the device sends :code:`shared_extn_seq == 0` and the
+host fills in the arrival-order position (see :code:`logical_seq` in
+:code:`cxl_add_pending`). The DAX layer enforces the same
+:code:`1..n` dense invariant in both cases.
+
+The tag group is brought online via :code:`online_tag_group()`,
+which registers every member :code:`dc_extent` as an
+:code:`extentX.Y` child of :code:`cxlr_dax->dev`, the DAX layer is
+notified with :code:`DCD_ADD_CAPACITY`, and the accepted extents are
+spliced into the response list for a single :code:`ADD_DC_RESPONSE`
+mailbox per More-chain.
+
+Releasing Extents
+-----------------
+
+A release may be initiated by the device (a pending release
+notification) or by the host (when destroying a DAX device or tearing
+down a region). Both paths converge on :code:`cxl_rm_extent`
+(:code:`drivers/cxl/core/extent.c`).
+
+Per-extent gates:
+
+* The DPA range must resolve to a CXL region. If it does not — for
+ example, an extent left over from a host crash that has not yet
+ been re-claimed, or a duplicate release racing region teardown —
+ the release is acknowledged via :code:`memdev_release_extent()` so
+ the device knows the host is not using the capacity, and the
+ operation returns :code:`-ENXIO`.
+* The DPA range must be `fully contained` in some member
+ :code:`dc_extent`'s :code:`dpa_range` on the region's
+ :code:`cxlr_dax`, and the tag (UUID) on that member's
+ :code:`cxl_dc_tag_group` must match the release request. Releases
+ are keyed by :code:`(DPA range, tag)` rather than by pointer
+ because the device, not the host, supplies the identity. A
+ request that matches no :code:`dc_extent` is rejected with
+ :code:`-EINVAL`.
+
+If those gates pass, the DAX layer is notified with
+:code:`DCD_RELEASE_CAPACITY` and consulted for permission to proceed.
+If the DAX layer returns :code:`-EBUSY` — the capacity is still mapped
+or otherwise in use — the release is deferred and
+:code:`cxl_rm_extent` returns success without unregistering anything.
+When the DAX layer ultimately grants release,
+:code:`rm_tag_group()` invalidates the backing memregion once for the
+whole group, then unregisters every member :code:`dc_extent` device,
+which cascades through the DAX layer to drop the corresponding
+:code:`dax_resource`\ s.
+
+The release path is always whole-tag-group: tagged allocations
+release atomically, and the kernel does not split a group in response
+to a sub-range release request.
+
Example Configurations
======================
.. toctree::
diff --git a/Documentation/driver-api/cxl/linux/dax-driver.rst b/Documentation/driver-api/cxl/linux/dax-driver.rst
index 10d953a2167b..07f08396f639 100644
--- a/Documentation/driver-api/cxl/linux/dax-driver.rst
+++ b/Documentation/driver-api/cxl/linux/dax-driver.rst
@@ -27,6 +27,173 @@ CXL capacity in the task's page tables.
Users wishing to manually handle allocation of CXL memory should use this
interface.
+Dynamic Capacity (DC) Regions
+=============================
+A region backed by a CXL `Dynamic Capacity Device (DCD)` is a `DC region`:
+its HPA window is fixed at probe time, but the DPA capacity that fills the
+window arrives and departs at runtime as the device offers and reclaims
+`extents`. DC regions are distinguished from static regions by the
+:code:`IORESOURCE_DAX_DCD` flag on the :code:`dax_region`.
+
+For the CXL-side rules governing when an offered extent is accepted or a
+release request is honoured, see :doc:`cxl-driver`. This section covers
+the DAX-side mapping between accepted extents and DAX devices.
+
+The Extent Layering Model
+-------------------------
+Four objects sit between the wire-level CXL extent and the
+user-visible DAX device. Understanding the cardinality between them
+is the key to the DC-region model.
+
+::
+
+ device extents dc_extent dax_resource DAX device
+ (CXL device) (CXL core) (DAX bus) (/dev/daxN.Y)
+ ------------- ------------- ------------- ------------
+ e1 ─┐ ┌─► dc_e1 ──► res_1 (seq=1) ──┐
+ e2 ─┼─── tag A ──► ┼─► dc_e2 ──► res_2 (seq=2) ──┼──► daxN.0
+ e3 ─┘ └─► dc_e3 ──► res_3 (seq=3) ──┘ (claimed by tag A,
+ size = Σ |e_i|)
+
+ e4 ─── tag B ────► dc_e4 ──► res_4 (seq=1) ────► daxN.1
+
+ e5 ─── null tag ─► dc_e5 ──► res_5 (seq=0) ────► daxN.2
+ e6 ─── null tag ─► dc_e6 ──► res_6 (seq=0) ────► daxN.3
+
+The CXL core groups extents sharing a non-null tag into a single
+:code:`cxl_dc_tag_group` (internal-only, no sysfs identity), but each
+member extent stays a distinct :code:`dc_extent` with its own HPA
+range. The DAX bridge creates one :code:`dax_resource` per
+:code:`dc_extent`, and userspace claims a DAX device by writing the
+tag's UUID to the seed device's :code:`uuid` attribute, which carves
+every matching :code:`dax_resource` (in :code:`seq_num` order) into
+the device's :code:`ranges[]` array.
+
+`Device extent`
+ The unit the CXL device delivers over the mailbox: a
+ :code:`(DPA, length, tag, shared_extn_seq)` tuple inside an
+ Add-Capacity event. The tag is either a non-null UUID (a
+ `tagged allocation`) or the null UUID (`untagged`).
+
+:code:`dc_extent`
+ The CXL core's per-extent object, one per surviving device extent.
+ Each :code:`dc_extent` is registered as its own :code:`extentX.Y`
+ sysfs device under :code:`cxlr_dax->dev` and carries its own
+ :code:`hpa_range` — there is no aggregated / bounding-box HPA
+ range across siblings. Members of one tag group point at a
+ shared :code:`cxl_dc_tag_group` (which holds the UUID and a
+ manual refcount on the surviving siblings) but otherwise exist as
+ independent kernel objects.
+
+ For a `non-null tag`, the host-wide tag-uniqueness gate
+ (:doc:`cxl-driver`) guarantees there is at most one
+ :code:`cxl_dc_tag_group` per UUID on the host, so the set of
+ :code:`dc_extent`\ s sharing that UUID is a single allocation.
+
+ For the `null tag` there is no cross-event identity — the spec is
+ silent on aggregating untagged extents across Add-Capacity events.
+ Each untagged device extent becomes its own :code:`dc_extent` in
+ its own single-member tag group; two untagged extents delivered
+ separately are two distinct allocations.
+
+:code:`dax_resource`
+ The DAX bus's per-extent view, one-to-one with :code:`dc_extent`.
+ When the CXL DAX driver receives a :code:`DCD_ADD_CAPACITY`
+ notification it iterates the tag group and calls
+ :code:`dax_region_add_resource()` once per member, creating one
+ :code:`dax_resource` per :code:`dc_extent`. Each
+ :code:`dax_resource` carries that member's HPA range, the tag
+ UUID (copied from :code:`dc_extent->group->uuid`), and a 1..n
+ :code:`seq_num` so :code:`uuid_claim_tagged` can carve the matched
+ set into the device's :code:`ranges[]` array in the right order
+ (see :code:`drivers/dax/bus.c`).
+
+`DAX device` (:code:`/dev/daxN.Y`)
+ Created by userspace claiming a set of :code:`dax_resource`\ s via
+ the :code:`uuid` sysfs attribute. Each DAX device corresponds to
+ exactly one allocation:
+
+ * A `tagged` DAX device is built from every :code:`dax_resource`
+ carrying the tag — one per :code:`dc_extent` in the allocation
+ — carved into the device's :code:`ranges[]` in :code:`seq_num`
+ order. Its size equals the sum of every member's size.
+ * An `untagged` DAX device is built from one untagged
+ :code:`dax_resource` and its size equals that one extent.
+
+So the end-to-end rule is: **one tagged allocation = one
+cxl_dc_tag_group = N dc_extents = N dax_resources = one DAX device
+with N range entries**. An untagged device extent becomes its own
+:code:`dc_extent` / :code:`dax_resource` / single-range DAX device,
+claimed one at a time.
+
+Release follows the same layering in reverse. When the CXL core
+calls :code:`rm_tag_group()` (after the device asks for release and
+the DAX layer consents), the DAX bridge collects every matching
+:code:`dax_resource` and removes them as a set via
+:code:`dax_region_rm_resources()`. The removal is refuse-all-or-none
+under :code:`dax_region_rwsem`: if any member is in use, the whole
+group stays. When removal commits, the HPA capacity returns to the
+region's free pool and any DAX device that had claimed it is left
+with no backing capacity. Userspace tears the DAX device down via
+:code:`daxctl destroy-device` (size=0, then write the device name to
+the region's :code:`delete` attribute).
+
+UUID-Based DAX Device Creation
+------------------------------
+A DAX device on a DC region is created by writing a UUID to the
+seed device's :code:`uuid` attribute
+(:code:`/sys/bus/dax/devices/daxN.Y/uuid`). The seed starts at
+size 0; writing :code:`uuid` is a `claim` operation that resolves
+the layering above and populates the device:
+
+* A `non-null UUID` claims `every` :code:`dax_resource` whose tag
+ matches. :code:`uuid_claim_tagged` (in
+ :code:`drivers/dax/bus.c`) collects them, sorts by
+ :code:`seq_num`, enforces the dense :code:`1..n` invariant, and
+ carves each via :code:`__dev_dax_resize` in :code:`seq_num` order
+ so the device's :code:`ranges[]` array is dense and ordered.
+ The resulting DAX device represents exactly the tagged
+ allocation: its size equals the sum of every member extent's
+ size.
+
+ The dense :code:`1..n` invariant is the unified rule the CXL
+ side maintains for both sharable and non-sharable allocations
+ (see :doc:`cxl-driver`); the match set has exactly one entry per
+ :code:`dc_extent` in the tag group.
+
+* The value :code:`"0"` is shorthand for the null UUID and claims
+ exactly `one` untagged :code:`dax_resource`. Untagged
+ :code:`dax_resource`\ s correspond to independent untagged
+ allocations; collapsing several into one device would aggregate
+ unrelated capacity, so each :code:`uuid` write consumes a single
+ untagged resource.
+
+* A write that matches no :code:`dax_resource` returns
+ :code:`-ENOENT` and the device remains at size 0.
+
+* Writes to the :code:`uuid` attribute on non-DC regions return
+ :code:`-EOPNOTSUPP`; the attribute itself is read-only (0444) on
+ non-DC devices.
+
+The device's size is determined entirely by the backing allocation:
+users do not choose a size on DC regions. Accordingly, the
+:code:`size` attribute on a DC DAX device rejects grow requests
+with :code:`-EOPNOTSUPP`. Writing :code:`0` is still permitted and is
+how :code:`daxctl destroy-device` returns each claimed extent to the
+region's available pool before the device's name is written to the
+region's :code:`delete` attribute.
+
+Reads of :code:`uuid` report the tag identifying the capacity
+backing the device:
+
+* For a non-null-UUID-claimed DC DAX device, :code:`uuid` reads
+ back the claimed UUID.
+* For a DC DAX device claimed via :code:`"0"`, or for any
+ non-DCD DAX device, :code:`uuid` reads :code:`0`.
+
+See :code:`Documentation/ABI/testing/sysfs-bus-dax` for the
+authoritative attribute contracts.
+
kmem conversion
===============
The :code:`dax_kmem` driver converts a `DAX Device` into a series of `hotplug
--
2.43.0
^ permalink raw reply related [flat|nested] 71+ messages in thread* Re: [PATCH v10 00/31] DCD: Add support for Dynamic Capacity Devices (DCD)
2026-05-23 9:42 [PATCH v10 00/31] DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
` (30 preceding siblings ...)
2026-05-23 9:43 ` [PATCH v10 31/31] Documentation/cxl: Document DCD extent handling and DC-backed DAX regions Anisa Su
@ 2026-05-27 18:51 ` Dave Jiang
2026-05-30 0:16 ` Anisa Su
2026-06-05 5:35 ` Alison Schofield
32 siblings, 1 reply; 71+ messages in thread
From: Dave Jiang @ 2026-05-27 18:51 UTC (permalink / raw)
To: Anisa Su, linux-cxl, linux-kernel
Cc: nvdimm, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
Vishal Verma, Ira Weiny, Alison Schofield, John Groves,
Gregory Price, Anisa Su
On 5/23/26 2:42 AM, Anisa Su wrote:
<-- snip -->
> Series Info
> =============
> The series builds on top of cxl-next with the famfs-v9 patchset
> applied.
Hi Anisa,
Just for future reference, I would prefer that you base off of latest Linus tags for future patch submissions. i.e. for this week it should be v7.1-rc5. I would prefer that it does not base on cxl/next unless instructed otherwise. Thankfully this series applies cleanly to v7.1-rc5 so it's not a concern. Thanks!
DJ
^ permalink raw reply [flat|nested] 71+ messages in thread* Re: [PATCH v10 00/31] DCD: Add support for Dynamic Capacity Devices (DCD)
2026-05-27 18:51 ` [PATCH v10 00/31] DCD: Add support for Dynamic Capacity Devices (DCD) Dave Jiang
@ 2026-05-30 0:16 ` Anisa Su
0 siblings, 0 replies; 71+ messages in thread
From: Anisa Su @ 2026-05-30 0:16 UTC (permalink / raw)
To: Dave Jiang
Cc: Anisa Su, linux-cxl, linux-kernel, nvdimm, Dan Williams,
Jonathan Cameron, Davidlohr Bueso, Vishal Verma, Ira Weiny,
Alison Schofield, John Groves, Gregory Price
On Wed, May 27, 2026 at 11:51:13AM -0700, Dave Jiang wrote:
>
>
> On 5/23/26 2:42 AM, Anisa Su wrote:
>
> <-- snip -->
>
> > Series Info
> > =============
> > The series builds on top of cxl-next with the famfs-v9 patchset
> > applied.
>
> Hi Anisa,
> Just for future reference, I would prefer that you base off of latest Linus tags for future patch submissions. i.e. for this week it should be v7.1-rc5. I would prefer that it does not base on cxl/next unless instructed otherwise. Thankfully this series applies cleanly to v7.1-rc5 so it's not a concern. Thanks!
>
> DJ
>
>
Noted, will remember for next time!
Thanks,
Anisa
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v10 00/31] DCD: Add support for Dynamic Capacity Devices (DCD)
2026-05-23 9:42 [PATCH v10 00/31] DCD: Add support for Dynamic Capacity Devices (DCD) Anisa Su
` (31 preceding siblings ...)
2026-05-27 18:51 ` [PATCH v10 00/31] DCD: Add support for Dynamic Capacity Devices (DCD) Dave Jiang
@ 2026-06-05 5:35 ` Alison Schofield
32 siblings, 0 replies; 71+ messages in thread
From: Alison Schofield @ 2026-06-05 5:35 UTC (permalink / raw)
To: Anisa Su
Cc: linux-cxl, linux-kernel, nvdimm, Dan Williams, Jonathan Cameron,
Davidlohr Bueso, Dave Jiang, Vishal Verma, Ira Weiny, John Groves,
Gregory Price, Anisa Su
On Sat, May 23, 2026 at 02:42:54AM -0700, Anisa Su wrote:
> Table of Contents
> ==================
> - Use Case
> - LSFMM`26 Discussion
> - Updated Design Overview
> - DC Add
> - DC Release
> - Series Info
> - Changes from v9
> - CXL Layer Changes
> - DAX Layer Changes
> - Testing
> - References
snip
The cover letter is missing the list of patches, diffstat, and
base commit which are all usually at the tail end here.
I noticed because I went looking for the diffstat to see if there
might be a Kconfig change. Didn't find that info here, but very
happy that this seems to require no new config parameter.
Please bring those sections back in next rev.
I did fire it all up and will post in the ndctl patch first
findings.
>
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 71+ messages in thread