* [PATCH v4 0/9] dax/hmem, cxl: Coordinate Soft Reserved handling with CXL and HMEM
@ 2025-11-20 3:19 Smita Koralahalli
2025-11-20 3:19 ` [PATCH v4 1/9] dax/hmem, e820, resource: Defer Soft Reserved insertion until hmem is ready Smita Koralahalli
` (10 more replies)
0 siblings, 11 replies; 31+ messages in thread
From: Smita Koralahalli @ 2025-11-20 3:19 UTC (permalink / raw)
To: linux-cxl, linux-kernel, nvdimm, linux-fsdevel, linux-pm
Cc: Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams,
Jonathan Cameron, Yazen Ghannam, Dave Jiang, Davidlohr Bueso,
Matthew Wilcox, Jan Kara, Rafael J . Wysocki, Len Brown,
Pavel Machek, Li Ming, Jeff Johnson, Ying Huang, Yao Xingtao,
Peter Zijlstra, Greg KH, Nathan Fontenot, Terry Bowman,
Robert Richter, Benjamin Cheatham, Zhijian Li, Borislav Petkov,
Ard Biesheuvel
This series aims to address long-standing conflicts between HMEM and
CXL when handling Soft Reserved memory ranges.
Reworked from Dan's patch:
https://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git/patch/?id=ab70c6227ee6165a562c215d9dcb4a1c55620d5d
Previous work:
https://lore.kernel.org/all/20250715180407.47426-1-Smita.KoralahalliChannabasappa@amd.com/
Link to v3:
https://lore.kernel.org/all/20250930044757.214798-1-Smita.KoralahalliChannabasappa@amd.com
This series should be applied on top of:
"214291cbaace: acpi/hmat: Fix lockdep warning for hmem_register_resource()"
and is based on:
base-commit: 211ddde0823f1442e4ad052a2f30f050145ccada
I initially tried picking up the three probe ordering patches from v20/v21
of Type 2 support, but I hit a NULL pointer dereference in
devm_cxl_add_memdev() and cycle dependency with all patches so I left
them out for now. With my current series rebased on 6.18-rc2 plus
214291cbaace, probe ordering behaves correctly on AMD systems and I have
verified the scenarios mentioned below. I can pull those three patches
back in for a future revision once the failures are sorted out.
Probe order patches of interest:
cxl/mem: refactor memdev allocation
cxl/mem: Arrange for always-synchronous memdev attach
cxl/port: Arrange for always synchronous endpoint attach
[1] Hotplug looks okay. After offlining the memory I can tear down the
regions and recreate it back if CXL owns entire SR range as Soft Reserved
is gone. dax_cxl creates dax devices and onlines memory.
850000000-284fffffff : CXL Window 0
850000000-284fffffff : region0
850000000-284fffffff : dax0.0
850000000-284fffffff : System RAM (kmem)
[2] With CONFIG_CXL_REGION disabled, all the resources are handled by
HMEM. Soft Reserved range shows up in /proc/iomem, no regions come up
and dax devices are created from HMEM.
850000000-284fffffff : CXL Window 0
850000000-284fffffff : Soft Reserved
850000000-284fffffff : dax0.0
850000000-284fffffff : System RAM (kmem)
[3] Region assembly failures also behave okay and work same as [2].
Before:
2850000000-484fffffff : Soft Reserved
2850000000-484fffffff : CXL Window 1
2850000000-484fffffff : dax4.0
2850000000-484fffffff : System RAM (kmem)
After tearing down dax4.0 and creating it back:
Logs:
[ 547.847764] unregister_dax_mapping: mapping0: unregister_dax_mapping
[ 547.855000] trim_dev_dax_range: dax dax4.0: delete range[0]: 0x2850000000:0x484fffffff
[ 622.474580] alloc_dev_dax_range: dax dax4.1: alloc range[0]: 0x0000002850000000:0x000000484fffffff
[ 752.766194] Fallback order for Node 0: 0 1
[ 752.766199] Fallback order for Node 1: 1 0
[ 752.766200] Built 2 zonelists, mobility grouping on. Total pages: 8096220
[ 752.783234] Policy zone: Normal
[ 752.808604] Demotion targets for Node 0: preferred: 1, fallback: 1
[ 752.815509] Demotion targets for Node 1: null
After:
2850000000-484fffffff : Soft Reserved
2850000000-484fffffff : CXL Window 1
2850000000-484fffffff : dax4.1
2850000000-484fffffff : System RAM (kmem)
[4] A small hack to tear down the fully assembled and probed region
(i.e region in committed state) for range 850000000-284fffffff.
This is to test the region teardown path for regions which don't
fully cover the Soft Reserved range.
850000000-284fffffff : Soft Reserved
850000000-284fffffff : CXL Window 0
850000000-284fffffff : dax5.0
850000000-284fffffff : System RAM (kmem)
2850000000-484fffffff : CXL Window 1
2850000000-484fffffff : region1
2850000000-484fffffff : dax1.0
2850000000-484fffffff : System RAM (kmem)
.4850000000-684fffffff : CXL Window 2
4850000000-684fffffff : region2
4850000000-684fffffff : dax2.0
4850000000-684fffffff : System RAM (kmem)
daxctl list -R -u
[
{
"path":"\/platform\/ACPI0017:00\/root0\/decoder0.1\/region1\/dax_region1",
"id":1,
"size":"128.00 GiB (137.44 GB)",
"align":2097152
},
{
"path":"\/platform\/hmem.5",
"id":5,
"size":"128.00 GiB (137.44 GB)",
"align":2097152
},
{
"path":"\/platform\/ACPI0017:00\/root0\/decoder0.2\/region2\/dax_region2",
"id":2,
"size":"128.00 GiB (137.44 GB)",
"align":2097152
}
]
I couldn't test multiple regions under same Soft Reserved range
with/without contiguous mapping due to limiting BIOS support. Hopefully
that works.
v4 updates:
- No changes patches 1-3.
- New patches 4-7.
- handle_deferred_cxl() has been enhanced to handle case where CXL
regions do not contiguously and fully cover Soft Reserved ranges.
- Support added to defer cxl_dax registration.
- Support added to teardown cxl regions.
v3 updates:
- Fixed two "From".
v2 updates:
- Removed conditional check on CONFIG_EFI_SOFT_RESERVE as dax_hmem
depends on CONFIG_EFI_SOFT_RESERVE. (Zhijian)
- Added TODO note. (Zhijian)
- Included region_intersects_soft_reserve() inside CONFIG_EFI_SOFT_RESERVE
conditional check. (Zhijian)
- insert_resource_late() -> insert_resource_expand_to_fit() and
__insert_resource_expand_to_fit() replacement. (Boris)
- Fixed Co-developed and Signed-off by. (Dan)
- Combined 2/6 and 3/6 into a single patch. (Zhijian).
- Skip local variable in remove_soft_reserved. (Jonathan)
- Drop kfree with __free(). (Jonathan)
- return 0 -> return dev_add_action_or_reset(host...) (Jonathan)
- Dropped 6/6.
- Reviewed-by tags (Dave, Jonathan)
Dan Williams (4):
dax/hmem, e820, resource: Defer Soft Reserved insertion until hmem is
ready
dax/hmem: Request cxl_acpi and cxl_pci before walking Soft Reserved
ranges
dax/hmem: Gate Soft Reserved deferral on DEV_DAX_CXL
dax/hmem: Defer handling of Soft Reserved ranges that overlap CXL
windows
Smita Koralahalli (5):
cxl/region, dax/hmem: Arbitrate Soft Reserved ownership with
cxl_regions_fully_map()
cxl/region: Add register_dax flag to control probe-time devdax setup
cxl/region, dax/hmem: Register devdax only when CXL owns Soft Reserved
span
cxl/region, dax/hmem: Tear down CXL regions when HMEM reclaims Soft
Reserved
dax/hmem: Reintroduce Soft Reserved ranges back into the iomem tree
arch/x86/kernel/e820.c | 2 +-
drivers/cxl/acpi.c | 2 +-
drivers/cxl/core/region.c | 181 ++++++++++++++++++++++++++++++++++++--
drivers/cxl/cxl.h | 17 ++++
drivers/dax/Kconfig | 2 +
drivers/dax/hmem/device.c | 4 +-
drivers/dax/hmem/hmem.c | 137 ++++++++++++++++++++++++++---
include/linux/ioport.h | 13 ++-
kernel/resource.c | 92 ++++++++++++++++---
9 files changed, 415 insertions(+), 35 deletions(-)
base-commit: 211ddde0823f1442e4ad052a2f30f050145ccada
--
2.17.1
^ permalink raw reply [flat|nested] 31+ messages in thread
* [PATCH v4 1/9] dax/hmem, e820, resource: Defer Soft Reserved insertion until hmem is ready
2025-11-20 3:19 [PATCH v4 0/9] dax/hmem, cxl: Coordinate Soft Reserved handling with CXL and HMEM Smita Koralahalli
@ 2025-11-20 3:19 ` Smita Koralahalli
2025-12-02 22:19 ` dan.j.williams
2025-12-02 23:31 ` Dave Jiang
2025-11-20 3:19 ` [PATCH v4 2/9] dax/hmem: Request cxl_acpi and cxl_pci before walking Soft Reserved ranges Smita Koralahalli
` (9 subsequent siblings)
10 siblings, 2 replies; 31+ messages in thread
From: Smita Koralahalli @ 2025-11-20 3:19 UTC (permalink / raw)
To: linux-cxl, linux-kernel, nvdimm, linux-fsdevel, linux-pm
Cc: Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams,
Jonathan Cameron, Yazen Ghannam, Dave Jiang, Davidlohr Bueso,
Matthew Wilcox, Jan Kara, Rafael J . Wysocki, Len Brown,
Pavel Machek, Li Ming, Jeff Johnson, Ying Huang, Yao Xingtao,
Peter Zijlstra, Greg KH, Nathan Fontenot, Terry Bowman,
Robert Richter, Benjamin Cheatham, Zhijian Li, Borislav Petkov,
Ard Biesheuvel
From: Dan Williams <dan.j.williams@intel.com>
Insert Soft Reserved memory into a dedicated soft_reserve_resource tree
instead of the iomem_resource tree at boot. Delay publishing these ranges
into the iomem hierarchy until ownership is resolved and the HMEM path
is ready to consume them.
Publishing Soft Reserved ranges into iomem too early conflicts with CXL
hotplug and prevents region assembly when those ranges overlap CXL
windows.
Follow up patches will reinsert Soft Reserved ranges into iomem after CXL
window publication is complete and HMEM is ready to claim the memory. This
provides a cleaner handoff between EFI-defined memory ranges and CXL
resource management without trimming or deleting resources later.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
---
arch/x86/kernel/e820.c | 2 +-
drivers/cxl/acpi.c | 2 +-
drivers/dax/hmem/device.c | 4 +-
drivers/dax/hmem/hmem.c | 7 ++-
include/linux/ioport.h | 13 +++++-
kernel/resource.c | 92 +++++++++++++++++++++++++++++++++------
6 files changed, 100 insertions(+), 20 deletions(-)
diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index c3acbd26408b..c32f144f0e4a 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1153,7 +1153,7 @@ void __init e820__reserve_resources_late(void)
res = e820_res;
for (i = 0; i < e820_table->nr_entries; i++) {
if (!res->parent && res->end)
- insert_resource_expand_to_fit(&iomem_resource, res);
+ insert_resource_expand_to_fit(res);
res++;
}
diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c
index bd2e282ca93a..b37858f797be 100644
--- a/drivers/cxl/acpi.c
+++ b/drivers/cxl/acpi.c
@@ -847,7 +847,7 @@ static int add_cxl_resources(struct resource *cxl_res)
*/
cxl_set_public_resource(res, new);
- insert_resource_expand_to_fit(&iomem_resource, new);
+ __insert_resource_expand_to_fit(&iomem_resource, new);
next = res->sibling;
while (next && resource_overlaps(new, next)) {
diff --git a/drivers/dax/hmem/device.c b/drivers/dax/hmem/device.c
index f9e1a76a04a9..22732b729017 100644
--- a/drivers/dax/hmem/device.c
+++ b/drivers/dax/hmem/device.c
@@ -83,8 +83,8 @@ static __init int hmem_register_one(struct resource *res, void *data)
static __init int hmem_init(void)
{
- walk_iomem_res_desc(IORES_DESC_SOFT_RESERVED,
- IORESOURCE_MEM, 0, -1, NULL, hmem_register_one);
+ walk_soft_reserve_res_desc(IORES_DESC_SOFT_RESERVED, IORESOURCE_MEM, 0,
+ -1, NULL, hmem_register_one);
return 0;
}
diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index c18451a37e4f..48f4642f4bb8 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -73,11 +73,14 @@ static int hmem_register_device(struct device *host, int target_nid,
return 0;
}
- rc = region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
- IORES_DESC_SOFT_RESERVED);
+ rc = region_intersects_soft_reserve(res->start, resource_size(res),
+ IORESOURCE_MEM,
+ IORES_DESC_SOFT_RESERVED);
if (rc != REGION_INTERSECTS)
return 0;
+ /* TODO: Add Soft-Reserved memory back to iomem */
+
id = memregion_alloc(GFP_KERNEL);
if (id < 0) {
dev_err(host, "memregion allocation failure for %pr\n", res);
diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index e8b2d6aa4013..e20226870a81 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -232,6 +232,9 @@ struct resource_constraint {
/* PC/ISA/whatever - the normal PC address spaces: IO and memory */
extern struct resource ioport_resource;
extern struct resource iomem_resource;
+#ifdef CONFIG_EFI_SOFT_RESERVE
+extern struct resource soft_reserve_resource;
+#endif
extern struct resource *request_resource_conflict(struct resource *root, struct resource *new);
extern int request_resource(struct resource *root, struct resource *new);
@@ -242,7 +245,8 @@ extern void reserve_region_with_split(struct resource *root,
const char *name);
extern struct resource *insert_resource_conflict(struct resource *parent, struct resource *new);
extern int insert_resource(struct resource *parent, struct resource *new);
-extern void insert_resource_expand_to_fit(struct resource *root, struct resource *new);
+extern void __insert_resource_expand_to_fit(struct resource *root, struct resource *new);
+extern void insert_resource_expand_to_fit(struct resource *new);
extern int remove_resource(struct resource *old);
extern void arch_remove_reservations(struct resource *avail);
extern int allocate_resource(struct resource *root, struct resource *new,
@@ -409,6 +413,13 @@ walk_system_ram_res_rev(u64 start, u64 end, void *arg,
extern int
walk_iomem_res_desc(unsigned long desc, unsigned long flags, u64 start, u64 end,
void *arg, int (*func)(struct resource *, void *));
+extern int
+walk_soft_reserve_res_desc(unsigned long desc, unsigned long flags,
+ u64 start, u64 end, void *arg,
+ int (*func)(struct resource *, void *));
+extern int
+region_intersects_soft_reserve(resource_size_t start, size_t size,
+ unsigned long flags, unsigned long desc);
struct resource *devm_request_free_mem_region(struct device *dev,
struct resource *base, unsigned long size);
diff --git a/kernel/resource.c b/kernel/resource.c
index b9fa2a4ce089..208eaafcc681 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -321,13 +321,14 @@ static bool is_type_match(struct resource *p, unsigned long flags, unsigned long
}
/**
- * find_next_iomem_res - Finds the lowest iomem resource that covers part of
- * [@start..@end].
+ * find_next_res - Finds the lowest resource that covers part of
+ * [@start..@end].
*
* If a resource is found, returns 0 and @*res is overwritten with the part
* of the resource that's within [@start..@end]; if none is found, returns
* -ENODEV. Returns -EINVAL for invalid parameters.
*
+ * @parent: resource tree root to search
* @start: start address of the resource searched for
* @end: end address of same resource
* @flags: flags which the resource must have
@@ -337,9 +338,9 @@ static bool is_type_match(struct resource *p, unsigned long flags, unsigned long
* The caller must specify @start, @end, @flags, and @desc
* (which may be IORES_DESC_NONE).
*/
-static int find_next_iomem_res(resource_size_t start, resource_size_t end,
- unsigned long flags, unsigned long desc,
- struct resource *res)
+static int find_next_res(struct resource *parent, resource_size_t start,
+ resource_size_t end, unsigned long flags,
+ unsigned long desc, struct resource *res)
{
struct resource *p;
@@ -351,7 +352,7 @@ static int find_next_iomem_res(resource_size_t start, resource_size_t end,
read_lock(&resource_lock);
- for_each_resource(&iomem_resource, p, false) {
+ for_each_resource(parent, p, false) {
/* If we passed the resource we are looking for, stop */
if (p->start > end) {
p = NULL;
@@ -382,16 +383,23 @@ static int find_next_iomem_res(resource_size_t start, resource_size_t end,
return p ? 0 : -ENODEV;
}
-static int __walk_iomem_res_desc(resource_size_t start, resource_size_t end,
- unsigned long flags, unsigned long desc,
- void *arg,
- int (*func)(struct resource *, void *))
+static int find_next_iomem_res(resource_size_t start, resource_size_t end,
+ unsigned long flags, unsigned long desc,
+ struct resource *res)
+{
+ return find_next_res(&iomem_resource, start, end, flags, desc, res);
+}
+
+static int walk_res_desc(struct resource *parent, resource_size_t start,
+ resource_size_t end, unsigned long flags,
+ unsigned long desc, void *arg,
+ int (*func)(struct resource *, void *))
{
struct resource res;
int ret = -EINVAL;
while (start < end &&
- !find_next_iomem_res(start, end, flags, desc, &res)) {
+ !find_next_res(parent, start, end, flags, desc, &res)) {
ret = (*func)(&res, arg);
if (ret)
break;
@@ -402,6 +410,15 @@ static int __walk_iomem_res_desc(resource_size_t start, resource_size_t end,
return ret;
}
+static int __walk_iomem_res_desc(resource_size_t start, resource_size_t end,
+ unsigned long flags, unsigned long desc,
+ void *arg,
+ int (*func)(struct resource *, void *))
+{
+ return walk_res_desc(&iomem_resource, start, end, flags, desc, arg, func);
+}
+
+
/**
* walk_iomem_res_desc - Walks through iomem resources and calls func()
* with matching resource ranges.
@@ -426,6 +443,26 @@ int walk_iomem_res_desc(unsigned long desc, unsigned long flags, u64 start,
}
EXPORT_SYMBOL_GPL(walk_iomem_res_desc);
+#ifdef CONFIG_EFI_SOFT_RESERVE
+struct resource soft_reserve_resource = {
+ .name = "Soft Reserved",
+ .start = 0,
+ .end = -1,
+ .desc = IORES_DESC_SOFT_RESERVED,
+ .flags = IORESOURCE_MEM,
+};
+EXPORT_SYMBOL_GPL(soft_reserve_resource);
+
+int walk_soft_reserve_res_desc(unsigned long desc, unsigned long flags,
+ u64 start, u64 end, void *arg,
+ int (*func)(struct resource *, void *))
+{
+ return walk_res_desc(&soft_reserve_resource, start, end, flags, desc,
+ arg, func);
+}
+EXPORT_SYMBOL_GPL(walk_soft_reserve_res_desc);
+#endif
+
/*
* This function calls the @func callback against all memory ranges of type
* System RAM which are marked as IORESOURCE_SYSTEM_RAM and IORESOUCE_BUSY.
@@ -648,6 +685,22 @@ int region_intersects(resource_size_t start, size_t size, unsigned long flags,
}
EXPORT_SYMBOL_GPL(region_intersects);
+#ifdef CONFIG_EFI_SOFT_RESERVE
+int region_intersects_soft_reserve(resource_size_t start, size_t size,
+ unsigned long flags, unsigned long desc)
+{
+ int ret;
+
+ read_lock(&resource_lock);
+ ret = __region_intersects(&soft_reserve_resource, start, size, flags,
+ desc);
+ read_unlock(&resource_lock);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(region_intersects_soft_reserve);
+#endif
+
void __weak arch_remove_reservations(struct resource *avail)
{
}
@@ -966,7 +1019,7 @@ EXPORT_SYMBOL_GPL(insert_resource);
* Insert a resource into the resource tree, possibly expanding it in order
* to make it encompass any conflicting resources.
*/
-void insert_resource_expand_to_fit(struct resource *root, struct resource *new)
+void __insert_resource_expand_to_fit(struct resource *root, struct resource *new)
{
if (new->parent)
return;
@@ -997,7 +1050,20 @@ void insert_resource_expand_to_fit(struct resource *root, struct resource *new)
* to use this interface. The former are built-in and only the latter,
* CXL, is a module.
*/
-EXPORT_SYMBOL_NS_GPL(insert_resource_expand_to_fit, "CXL");
+EXPORT_SYMBOL_NS_GPL(__insert_resource_expand_to_fit, "CXL");
+
+void insert_resource_expand_to_fit(struct resource *new)
+{
+ struct resource *root = &iomem_resource;
+
+#ifdef CONFIG_EFI_SOFT_RESERVE
+ if (new->desc == IORES_DESC_SOFT_RESERVED)
+ root = &soft_reserve_resource;
+#endif
+
+ __insert_resource_expand_to_fit(root, new);
+}
+EXPORT_SYMBOL_GPL(insert_resource_expand_to_fit);
/**
* remove_resource - Remove a resource in the resource tree
--
2.17.1
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH v4 2/9] dax/hmem: Request cxl_acpi and cxl_pci before walking Soft Reserved ranges
2025-11-20 3:19 [PATCH v4 0/9] dax/hmem, cxl: Coordinate Soft Reserved handling with CXL and HMEM Smita Koralahalli
2025-11-20 3:19 ` [PATCH v4 1/9] dax/hmem, e820, resource: Defer Soft Reserved insertion until hmem is ready Smita Koralahalli
@ 2025-11-20 3:19 ` Smita Koralahalli
2025-11-20 3:19 ` [PATCH v4 3/9] dax/hmem: Gate Soft Reserved deferral on DEV_DAX_CXL Smita Koralahalli
` (8 subsequent siblings)
10 siblings, 0 replies; 31+ messages in thread
From: Smita Koralahalli @ 2025-11-20 3:19 UTC (permalink / raw)
To: linux-cxl, linux-kernel, nvdimm, linux-fsdevel, linux-pm
Cc: Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams,
Jonathan Cameron, Yazen Ghannam, Dave Jiang, Davidlohr Bueso,
Matthew Wilcox, Jan Kara, Rafael J . Wysocki, Len Brown,
Pavel Machek, Li Ming, Jeff Johnson, Ying Huang, Yao Xingtao,
Peter Zijlstra, Greg KH, Nathan Fontenot, Terry Bowman,
Robert Richter, Benjamin Cheatham, Zhijian Li, Borislav Petkov,
Ard Biesheuvel
From: Dan Williams <dan.j.williams@intel.com>
Ensure cxl_acpi has published CXL Window resources before HMEM walks Soft
Reserved ranges.
Replace MODULE_SOFTDEP("pre: cxl_acpi") with an explicit, synchronous
request_module("cxl_acpi"). MODULE_SOFTDEP() only guarantees eventual
loading, it does not enforce that the dependency has finished init
before the current module runs. This can cause HMEM to start before
cxl_acpi has populated the resource tree, breaking detection of overlaps
between Soft Reserved and CXL Windows.
Also, request cxl_pci before HMEM walks Soft Reserved ranges. Unlike
cxl_acpi, cxl_pci attach is asynchronous and creates dependent devices
that trigger further module loads. Asynchronous probe flushing
(wait_for_device_probe()) is added later in the series in a deferred
context before HMEM makes ownership decisions for Soft Reserved ranges.
Add an additional explicit Kconfig ordering so that CXL_ACPI and CXL_PCI
must be initialized before DEV_DAX_HMEM. This prevents HMEM from consuming
Soft Reserved ranges before CXL drivers have had a chance to claim them.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
---
drivers/dax/Kconfig | 2 ++
drivers/dax/hmem/hmem.c | 17 ++++++++++-------
2 files changed, 12 insertions(+), 7 deletions(-)
diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index d656e4c0eb84..3683bb3f2311 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -48,6 +48,8 @@ config DEV_DAX_CXL
tristate "CXL DAX: direct access to CXL RAM regions"
depends on CXL_BUS && CXL_REGION && DEV_DAX
default CXL_REGION && DEV_DAX
+ depends on CXL_ACPI >= DEV_DAX_HMEM
+ depends on CXL_PCI >= DEV_DAX_HMEM
help
CXL RAM regions are either mapped by platform-firmware
and published in the initial system-memory map as "System RAM", mapped
diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index 48f4642f4bb8..02e79c7adf75 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -141,6 +141,16 @@ static __init int dax_hmem_init(void)
{
int rc;
+ /*
+ * Ensure that cxl_acpi and cxl_pci have a chance to kick off
+ * CXL topology discovery at least once before scanning the
+ * iomem resource tree for IORES_DESC_CXL resources.
+ */
+ if (IS_ENABLED(CONFIG_DEV_DAX_CXL)) {
+ request_module("cxl_acpi");
+ request_module("cxl_pci");
+ }
+
rc = platform_driver_register(&dax_hmem_platform_driver);
if (rc)
return rc;
@@ -161,13 +171,6 @@ static __exit void dax_hmem_exit(void)
module_init(dax_hmem_init);
module_exit(dax_hmem_exit);
-/* Allow for CXL to define its own dax regions */
-#if IS_ENABLED(CONFIG_CXL_REGION)
-#if IS_MODULE(CONFIG_CXL_ACPI)
-MODULE_SOFTDEP("pre: cxl_acpi");
-#endif
-#endif
-
MODULE_ALIAS("platform:hmem*");
MODULE_ALIAS("platform:hmem_platform*");
MODULE_DESCRIPTION("HMEM DAX: direct access to 'specific purpose' memory");
--
2.17.1
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH v4 3/9] dax/hmem: Gate Soft Reserved deferral on DEV_DAX_CXL
2025-11-20 3:19 [PATCH v4 0/9] dax/hmem, cxl: Coordinate Soft Reserved handling with CXL and HMEM Smita Koralahalli
2025-11-20 3:19 ` [PATCH v4 1/9] dax/hmem, e820, resource: Defer Soft Reserved insertion until hmem is ready Smita Koralahalli
2025-11-20 3:19 ` [PATCH v4 2/9] dax/hmem: Request cxl_acpi and cxl_pci before walking Soft Reserved ranges Smita Koralahalli
@ 2025-11-20 3:19 ` Smita Koralahalli
2025-12-02 23:32 ` Dave Jiang
2025-11-20 3:19 ` [PATCH v4 4/9] dax/hmem: Defer handling of Soft Reserved ranges that overlap CXL windows Smita Koralahalli
` (7 subsequent siblings)
10 siblings, 1 reply; 31+ messages in thread
From: Smita Koralahalli @ 2025-11-20 3:19 UTC (permalink / raw)
To: linux-cxl, linux-kernel, nvdimm, linux-fsdevel, linux-pm
Cc: Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams,
Jonathan Cameron, Yazen Ghannam, Dave Jiang, Davidlohr Bueso,
Matthew Wilcox, Jan Kara, Rafael J . Wysocki, Len Brown,
Pavel Machek, Li Ming, Jeff Johnson, Ying Huang, Yao Xingtao,
Peter Zijlstra, Greg KH, Nathan Fontenot, Terry Bowman,
Robert Richter, Benjamin Cheatham, Zhijian Li, Borislav Petkov,
Ard Biesheuvel
From: Dan Williams <dan.j.williams@intel.com>
Replace IS_ENABLED(CONFIG_CXL_REGION) with IS_ENABLED(CONFIG_DEV_DAX_CXL)
so that HMEM only defers Soft Reserved ranges when CXL DAX support is
enabled. This makes the coordination between HMEM and the CXL stack more
precise and prevents deferral in unrelated CXL configurations.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
---
drivers/dax/hmem/hmem.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index 02e79c7adf75..c2c110b194e5 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -66,7 +66,7 @@ static int hmem_register_device(struct device *host, int target_nid,
long id;
int rc;
- if (IS_ENABLED(CONFIG_CXL_REGION) &&
+ if (IS_ENABLED(CONFIG_DEV_DAX_CXL) &&
region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
IORES_DESC_CXL) != REGION_DISJOINT) {
dev_dbg(host, "deferring range to CXL: %pr\n", res);
--
2.17.1
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH v4 4/9] dax/hmem: Defer handling of Soft Reserved ranges that overlap CXL windows
2025-11-20 3:19 [PATCH v4 0/9] dax/hmem, cxl: Coordinate Soft Reserved handling with CXL and HMEM Smita Koralahalli
` (2 preceding siblings ...)
2025-11-20 3:19 ` [PATCH v4 3/9] dax/hmem: Gate Soft Reserved deferral on DEV_DAX_CXL Smita Koralahalli
@ 2025-11-20 3:19 ` Smita Koralahalli
2025-12-02 22:37 ` dan.j.williams
2025-11-20 3:19 ` [PATCH v4 5/9] cxl/region, dax/hmem: Arbitrate Soft Reserved ownership with cxl_regions_fully_map() Smita Koralahalli
` (6 subsequent siblings)
10 siblings, 1 reply; 31+ messages in thread
From: Smita Koralahalli @ 2025-11-20 3:19 UTC (permalink / raw)
To: linux-cxl, linux-kernel, nvdimm, linux-fsdevel, linux-pm
Cc: Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams,
Jonathan Cameron, Yazen Ghannam, Dave Jiang, Davidlohr Bueso,
Matthew Wilcox, Jan Kara, Rafael J . Wysocki, Len Brown,
Pavel Machek, Li Ming, Jeff Johnson, Ying Huang, Yao Xingtao,
Peter Zijlstra, Greg KH, Nathan Fontenot, Terry Bowman,
Robert Richter, Benjamin Cheatham, Zhijian Li, Borislav Petkov,
Ard Biesheuvel
From: Dan Williams <dan.j.williams@intel.com>
Defer handling of Soft Reserved ranges that intersect CXL windows at
probe time. Delay processing until after device discovery so that the
CXL stack can publish windows and assemble regions before HMEM claims
those address ranges.
Add a deferral path that schedules deferred work when HMEM detects a
Soft Reserved range intersecting a CXL window during probe. The deferred
work runs after probe completes and allows the CXL subsystem to finish
resource discovery and region setup before HMEM takes any action.
This change does not address region assembly failures. It only delays
HMEM handling to avoid prematurely claiming ranges that CXL may own.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
---
drivers/dax/hmem/hmem.c | 66 +++++++++++++++++++++++++++++++++++++++--
1 file changed, 64 insertions(+), 2 deletions(-)
diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index c2c110b194e5..f70a0688bd11 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -58,9 +58,21 @@ static void release_hmem(void *pdev)
platform_device_unregister(pdev);
}
+static enum dax_cxl_mode {
+ DAX_CXL_MODE_DEFER,
+ DAX_CXL_MODE_REGISTER,
+ DAX_CXL_MODE_DROP,
+} dax_cxl_mode;
+
+struct dax_defer_work {
+ struct platform_device *pdev;
+ struct work_struct work;
+};
+
static int hmem_register_device(struct device *host, int target_nid,
const struct resource *res)
{
+ struct dax_defer_work *work = dev_get_drvdata(host);
struct platform_device *pdev;
struct memregion_info info;
long id;
@@ -69,8 +81,18 @@ static int hmem_register_device(struct device *host, int target_nid,
if (IS_ENABLED(CONFIG_DEV_DAX_CXL) &&
region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
IORES_DESC_CXL) != REGION_DISJOINT) {
- dev_dbg(host, "deferring range to CXL: %pr\n", res);
- return 0;
+ switch (dax_cxl_mode) {
+ case DAX_CXL_MODE_DEFER:
+ dev_dbg(host, "deferring range to CXL: %pr\n", res);
+ schedule_work(&work->work);
+ return 0;
+ case DAX_CXL_MODE_REGISTER:
+ dev_dbg(host, "registering CXL range: %pr\n", res);
+ break;
+ case DAX_CXL_MODE_DROP:
+ dev_dbg(host, "dropping CXL range: %pr\n", res);
+ return 0;
+ }
}
rc = region_intersects_soft_reserve(res->start, resource_size(res),
@@ -125,8 +147,48 @@ static int hmem_register_device(struct device *host, int target_nid,
return rc;
}
+static int handle_deferred_cxl(struct device *host, int target_nid,
+ const struct resource *res)
+{
+ /* TODO: Handle region assembly failures */
+ return 0;
+}
+
+static void process_defer_work(struct work_struct *_work)
+{
+ struct dax_defer_work *work = container_of(_work, typeof(*work), work);
+ struct platform_device *pdev = work->pdev;
+
+ /* relies on cxl_acpi and cxl_pci having had a chance to load */
+ wait_for_device_probe();
+
+ walk_hmem_resources(&pdev->dev, handle_deferred_cxl);
+}
+
+static void kill_defer_work(void *_work)
+{
+ struct dax_defer_work *work = container_of(_work, typeof(*work), work);
+
+ cancel_work_sync(&work->work);
+ kfree(work);
+}
+
static int dax_hmem_platform_probe(struct platform_device *pdev)
{
+ struct dax_defer_work *work = kzalloc(sizeof(*work), GFP_KERNEL);
+ int rc;
+
+ if (!work)
+ return -ENOMEM;
+
+ work->pdev = pdev;
+ INIT_WORK(&work->work, process_defer_work);
+
+ rc = devm_add_action_or_reset(&pdev->dev, kill_defer_work, work);
+ if (rc)
+ return rc;
+
+ platform_set_drvdata(pdev, work);
return walk_hmem_resources(&pdev->dev, hmem_register_device);
}
--
2.17.1
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH v4 5/9] cxl/region, dax/hmem: Arbitrate Soft Reserved ownership with cxl_regions_fully_map()
2025-11-20 3:19 [PATCH v4 0/9] dax/hmem, cxl: Coordinate Soft Reserved handling with CXL and HMEM Smita Koralahalli
` (3 preceding siblings ...)
2025-11-20 3:19 ` [PATCH v4 4/9] dax/hmem: Defer handling of Soft Reserved ranges that overlap CXL windows Smita Koralahalli
@ 2025-11-20 3:19 ` Smita Koralahalli
2025-12-03 3:50 ` dan.j.williams
2025-11-20 3:19 ` [PATCH v4 6/9] cxl/region: Add register_dax flag to defer DAX setup Smita Koralahalli
` (5 subsequent siblings)
10 siblings, 1 reply; 31+ messages in thread
From: Smita Koralahalli @ 2025-11-20 3:19 UTC (permalink / raw)
To: linux-cxl, linux-kernel, nvdimm, linux-fsdevel, linux-pm
Cc: Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams,
Jonathan Cameron, Yazen Ghannam, Dave Jiang, Davidlohr Bueso,
Matthew Wilcox, Jan Kara, Rafael J . Wysocki, Len Brown,
Pavel Machek, Li Ming, Jeff Johnson, Ying Huang, Yao Xingtao,
Peter Zijlstra, Greg KH, Nathan Fontenot, Terry Bowman,
Robert Richter, Benjamin Cheatham, Zhijian Li, Borislav Petkov,
Ard Biesheuvel
Introduce cxl_regions_fully_map() to check whether CXL regions form a
single contiguous, non-overlapping cover of a given Soft Reserved range.
Use this helper to decide whether Soft Reserved memory overlapping CXL
regions should be owned by CXL or registered by HMEM.
If the span is fully covered by CXL regions, treat the Soft Reserved
range as owned by CXL and have HMEM skip registration. Else, let HMEM
claim the range and register the corresponding devdax for it.
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
---
drivers/cxl/core/region.c | 80 +++++++++++++++++++++++++++++++++++++++
drivers/cxl/cxl.h | 6 +++
drivers/dax/hmem/hmem.c | 14 ++++++-
3 files changed, 99 insertions(+), 1 deletion(-)
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index b06fee1978ba..94dbbd6b5513 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -3749,6 +3749,86 @@ static int cxl_region_debugfs_poison_clear(void *data, u64 offset)
DEFINE_DEBUGFS_ATTRIBUTE(cxl_poison_clear_fops, NULL,
cxl_region_debugfs_poison_clear, "%llx\n");
+static struct cxl_region *
+cxlr_overlapping_range(struct device *dev, resource_size_t s, resource_size_t e)
+{
+ struct cxl_region *cxlr;
+ struct resource *r;
+
+ if (!is_cxl_region(dev))
+ return NULL;
+
+ cxlr = to_cxl_region(dev);
+ r = cxlr->params.res;
+ if (!r)
+ return NULL;
+
+ if (r->start > e || r->end < s)
+ return NULL;
+
+ return cxlr;
+}
+
+struct cxl_range_ctx {
+ resource_size_t start;
+ resource_size_t end;
+ resource_size_t pos;
+ resource_size_t map_end;
+ bool found;
+};
+
+static int cxl_region_map_cb(struct device *dev, void *data)
+{
+ struct cxl_range_ctx *ctx = data;
+ struct cxl_region *cxlr;
+ struct resource *r;
+
+ cxlr = cxlr_overlapping_range(dev, ctx->pos, ctx->end);
+ if (!cxlr)
+ return 0;
+
+ r = cxlr->params.res;
+ if (r->start != ctx->pos)
+ return 0;
+
+ if (!ctx->found) {
+ ctx->found = true;
+ ctx->map_end = r->end;
+ return 0;
+ }
+
+ return 1;
+}
+
+bool cxl_regions_fully_map(resource_size_t start, resource_size_t end)
+{
+ resource_size_t pos = start;
+ int rc;
+
+ while (pos <= end) {
+ struct cxl_range_ctx ctx = {
+ .start = start,
+ .end = end,
+ .pos = pos,
+ .found = false,
+ };
+
+ rc = bus_for_each_dev(&cxl_bus_type, NULL, &ctx,
+ cxl_region_map_cb);
+
+ if (rc || !ctx.found || ctx.map_end > end)
+ return false;
+
+ if (ctx.map_end == end)
+ break;
+
+ pos = ctx.map_end + 1;
+ }
+
+ return true;
+}
+EXPORT_SYMBOL_GPL(cxl_regions_fully_map);
+
static int cxl_region_can_probe(struct cxl_region *cxlr)
{
struct cxl_region_params *p = &cxlr->params;
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 231ddccf8977..af78c9fd37f2 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -877,6 +877,7 @@ struct cxl_pmem_region *to_cxl_pmem_region(struct device *dev);
int cxl_add_to_region(struct cxl_endpoint_decoder *cxled);
struct cxl_dax_region *to_cxl_dax_region(struct device *dev);
u64 cxl_port_get_spa_cache_alias(struct cxl_port *endpoint, u64 spa);
+bool cxl_regions_fully_map(resource_size_t start, resource_size_t end);
#else
static inline bool is_cxl_pmem_region(struct device *dev)
{
@@ -899,6 +900,11 @@ static inline u64 cxl_port_get_spa_cache_alias(struct cxl_port *endpoint,
{
return 0;
}
+static inline bool cxl_regions_fully_map(resource_size_t start,
+ resource_size_t end)
+{
+ return false;
+}
#endif
void cxl_endpoint_parse_cdat(struct cxl_port *port);
diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index f70a0688bd11..db4c46337ac3 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -3,6 +3,8 @@
#include <linux/memregion.h>
#include <linux/module.h>
#include <linux/dax.h>
+
+#include "../../cxl/cxl.h"
#include "../bus.h"
static bool region_idle;
@@ -150,7 +152,17 @@ static int hmem_register_device(struct device *host, int target_nid,
static int handle_deferred_cxl(struct device *host, int target_nid,
const struct resource *res)
{
- /* TODO: Handle region assembly failures */
+ if (region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
+ IORES_DESC_CXL) != REGION_DISJOINT) {
+
+ if (cxl_regions_fully_map(res->start, res->end))
+ dax_cxl_mode = DAX_CXL_MODE_DROP;
+ else
+ dax_cxl_mode = DAX_CXL_MODE_REGISTER;
+
+ hmem_register_device(host, target_nid, res);
+ }
+
return 0;
}
--
2.17.1
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH v4 6/9] cxl/region: Add register_dax flag to defer DAX setup
2025-11-20 3:19 [PATCH v4 0/9] dax/hmem, cxl: Coordinate Soft Reserved handling with CXL and HMEM Smita Koralahalli
` (4 preceding siblings ...)
2025-11-20 3:19 ` [PATCH v4 5/9] cxl/region, dax/hmem: Arbitrate Soft Reserved ownership with cxl_regions_fully_map() Smita Koralahalli
@ 2025-11-20 3:19 ` Smita Koralahalli
2025-11-20 18:17 ` Koralahalli Channabasappa, Smita
` (2 more replies)
2025-11-20 3:19 ` [PATCH v4 7/9] cxl/region, dax/hmem: Register cxl_dax only when CXL owns Soft Reserved span Smita Koralahalli
` (4 subsequent siblings)
10 siblings, 3 replies; 31+ messages in thread
From: Smita Koralahalli @ 2025-11-20 3:19 UTC (permalink / raw)
To: linux-cxl, linux-kernel, nvdimm, linux-fsdevel, linux-pm
Cc: Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams,
Jonathan Cameron, Yazen Ghannam, Dave Jiang, Davidlohr Bueso,
Matthew Wilcox, Jan Kara, Rafael J . Wysocki, Len Brown,
Pavel Machek, Li Ming, Jeff Johnson, Ying Huang, Yao Xingtao,
Peter Zijlstra, Greg KH, Nathan Fontenot, Terry Bowman,
Robert Richter, Benjamin Cheatham, Zhijian Li, Borislav Petkov,
Ard Biesheuvel
Stop creating cxl_dax during cxl_region_probe(). Early DAX registration
can online memory before ownership of Soft Reserved ranges is finalized.
This makes it difficult to tear down regions later when HMEM determines
that a region should not claim that range.
Introduce a register_dax flag in struct cxl_region_params and gate DAX
registration on this flag. Leave probe time registration disabled for
regions discovered during early CXL enumeration; set the flag only for
regions created dynamically at runtime to preserve existing behaviour.
This patch prepares the region code for later changes where cxl_dax
setup occurs from the HMEM path only after ownership arbitration
completes.
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
---
drivers/cxl/core/region.c | 21 ++++++++++++++++-----
drivers/cxl/cxl.h | 1 +
2 files changed, 17 insertions(+), 5 deletions(-)
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 94dbbd6b5513..c17cd8706b9d 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -2540,9 +2540,11 @@ static int cxl_region_calculate_adistance(struct notifier_block *nb,
static struct cxl_region *devm_cxl_add_region(struct cxl_root_decoder *cxlrd,
int id,
enum cxl_partition_mode mode,
- enum cxl_decoder_type type)
+ enum cxl_decoder_type type,
+ bool register_dax)
{
struct cxl_port *port = to_cxl_port(cxlrd->cxlsd.cxld.dev.parent);
+ struct cxl_region_params *p;
struct cxl_region *cxlr;
struct device *dev;
int rc;
@@ -2553,6 +2555,9 @@ static struct cxl_region *devm_cxl_add_region(struct cxl_root_decoder *cxlrd,
cxlr->mode = mode;
cxlr->type = type;
+ p = &cxlr->params;
+ p->register_dax = register_dax;
+
dev = &cxlr->dev;
rc = dev_set_name(dev, "region%d", id);
if (rc)
@@ -2593,7 +2598,8 @@ static ssize_t create_ram_region_show(struct device *dev,
}
static struct cxl_region *__create_region(struct cxl_root_decoder *cxlrd,
- enum cxl_partition_mode mode, int id)
+ enum cxl_partition_mode mode, int id,
+ bool register_dax)
{
int rc;
@@ -2615,7 +2621,8 @@ static struct cxl_region *__create_region(struct cxl_root_decoder *cxlrd,
return ERR_PTR(-EBUSY);
}
- return devm_cxl_add_region(cxlrd, id, mode, CXL_DECODER_HOSTONLYMEM);
+ return devm_cxl_add_region(cxlrd, id, mode, CXL_DECODER_HOSTONLYMEM,
+ register_dax);
}
static ssize_t create_region_store(struct device *dev, const char *buf,
@@ -2629,7 +2636,7 @@ static ssize_t create_region_store(struct device *dev, const char *buf,
if (rc != 1)
return -EINVAL;
- cxlr = __create_region(cxlrd, mode, id);
+ cxlr = __create_region(cxlrd, mode, id, true);
if (IS_ERR(cxlr))
return PTR_ERR(cxlr);
@@ -3523,7 +3530,7 @@ static struct cxl_region *construct_region(struct cxl_root_decoder *cxlrd,
do {
cxlr = __create_region(cxlrd, cxlds->part[part].mode,
- atomic_read(&cxlrd->region_id));
+ atomic_read(&cxlrd->region_id), false);
} while (IS_ERR(cxlr) && PTR_ERR(cxlr) == -EBUSY);
if (IS_ERR(cxlr)) {
@@ -3930,6 +3937,10 @@ static int cxl_region_probe(struct device *dev)
p->res->start, p->res->end, cxlr,
is_system_ram) > 0)
return 0;
+
+ if (!p->register_dax)
+ return 0;
+
return devm_cxl_add_dax_region(cxlr);
default:
dev_dbg(&cxlr->dev, "unsupported region mode: %d\n",
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index af78c9fd37f2..324220596890 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -495,6 +495,7 @@ struct cxl_region_params {
struct cxl_endpoint_decoder *targets[CXL_DECODER_MAX_INTERLEAVE];
int nr_targets;
resource_size_t cache_size;
+ bool register_dax;
};
enum cxl_partition_mode {
--
2.17.1
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH v4 7/9] cxl/region, dax/hmem: Register cxl_dax only when CXL owns Soft Reserved span
2025-11-20 3:19 [PATCH v4 0/9] dax/hmem, cxl: Coordinate Soft Reserved handling with CXL and HMEM Smita Koralahalli
` (5 preceding siblings ...)
2025-11-20 3:19 ` [PATCH v4 6/9] cxl/region: Add register_dax flag to defer DAX setup Smita Koralahalli
@ 2025-11-20 3:19 ` Smita Koralahalli
2025-11-20 3:19 ` [PATCH v4 8/9] cxl/region, dax/hmem: Tear down CXL regions when HMEM reclaims Soft Reserved Smita Koralahalli
` (3 subsequent siblings)
10 siblings, 0 replies; 31+ messages in thread
From: Smita Koralahalli @ 2025-11-20 3:19 UTC (permalink / raw)
To: linux-cxl, linux-kernel, nvdimm, linux-fsdevel, linux-pm
Cc: Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams,
Jonathan Cameron, Yazen Ghannam, Dave Jiang, Davidlohr Bueso,
Matthew Wilcox, Jan Kara, Rafael J . Wysocki, Len Brown,
Pavel Machek, Li Ming, Jeff Johnson, Ying Huang, Yao Xingtao,
Peter Zijlstra, Greg KH, Nathan Fontenot, Terry Bowman,
Robert Richter, Benjamin Cheatham, Zhijian Li, Borislav Petkov,
Ard Biesheuvel
Register DAX from the HMEM path only after determining that CXL owns Soft
Reserved range. This avoids onlining memory under CXL before ownership is
finalized and prevents failed teardown when HMEM must reclaim the range.
Introduce cxl_register_dax() to walk overlapping CXL regions and register
DAX from CXL only when cxl_regions_fully_map() confirms full coverage of
the span. If CXL does not own the span, skip cxl_dax setup and allow HMEM
to register DAX and online memory.
With probe time DAX creation already suppressed in the previous patch,
this change ensures that only the single owner (CXL or HMEM) performs
DAX/KMEM setup.
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
---
drivers/cxl/core/region.c | 42 +++++++++++++++++++++++++++++++++++++++
drivers/cxl/cxl.h | 5 +++++
drivers/dax/hmem/hmem.c | 5 +++--
3 files changed, 50 insertions(+), 2 deletions(-)
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index c17cd8706b9d..38e7ec6a087b 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -3784,6 +3784,48 @@ struct cxl_range_ctx {
bool found;
};
+static void cxl_region_enable_dax(struct cxl_region *cxlr)
+{
+ struct cxl_region_params *p = &cxlr->params;
+ int rc;
+
+ if (walk_iomem_res_desc(IORES_DESC_NONE,
+ IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
+ p->res->start, p->res->end, cxlr,
+ is_system_ram) > 0)
+ return;
+
+ rc = devm_cxl_add_dax_region(cxlr);
+ if (rc)
+ dev_warn(&cxlr->dev, "failed to add DAX for %s: %d\n",
+ dev_name(&cxlr->dev), rc);
+}
+
+static int cxl_register_dax_cb(struct device *dev, void *data)
+{
+ struct cxl_range_ctx *ctx = data;
+ struct cxl_region *cxlr;
+
+ cxlr = cxlr_overlapping_range(dev, ctx->start, ctx->end);
+ if (!cxlr)
+ return 0;
+
+ if (cxlr->mode != CXL_PARTMODE_RAM)
+ return 0;
+
+ cxl_region_enable_dax(cxlr);
+
+ return 0;
+}
+
+void cxl_register_dax(resource_size_t start, resource_size_t end)
+{
+ struct cxl_range_ctx ctx = { .start = start, .end = end };
+
+ bus_for_each_dev(&cxl_bus_type, NULL, &ctx, cxl_register_dax_cb);
+}
+EXPORT_SYMBOL_GPL(cxl_register_dax);
+
static int cxl_region_map_cb(struct device *dev, void *data)
{
struct cxl_range_ctx *ctx = data;
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 324220596890..414ddf6c35d7 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -879,6 +879,7 @@ int cxl_add_to_region(struct cxl_endpoint_decoder *cxled);
struct cxl_dax_region *to_cxl_dax_region(struct device *dev);
u64 cxl_port_get_spa_cache_alias(struct cxl_port *endpoint, u64 spa);
bool cxl_regions_fully_map(resource_size_t start, resource_size_t end);
+void cxl_register_dax(resource_size_t start, resource_size_t end);
#else
static inline bool is_cxl_pmem_region(struct device *dev)
{
@@ -906,6 +907,10 @@ static inline bool cxl_regions_fully_map(resource_size_t start,
{
return false;
}
+static inline void cxl_register_dax(resource_size_t start,
+ resource_size_t end)
+{
+}
#endif
void cxl_endpoint_parse_cdat(struct cxl_port *port);
diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index db4c46337ac3..b9312e0f2e62 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -155,9 +155,10 @@ static int handle_deferred_cxl(struct device *host, int target_nid,
if (region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
IORES_DESC_CXL) != REGION_DISJOINT) {
- if (cxl_regions_fully_map(res->start, res->end))
+ if (cxl_regions_fully_map(res->start, res->end)) {
dax_cxl_mode = DAX_CXL_MODE_DROP;
- else
+ cxl_register_dax(res->start, res->end);
+ } else
dax_cxl_mode = DAX_CXL_MODE_REGISTER;
hmem_register_device(host, target_nid, res);
--
2.17.1
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH v4 8/9] cxl/region, dax/hmem: Tear down CXL regions when HMEM reclaims Soft Reserved
2025-11-20 3:19 [PATCH v4 0/9] dax/hmem, cxl: Coordinate Soft Reserved handling with CXL and HMEM Smita Koralahalli
` (6 preceding siblings ...)
2025-11-20 3:19 ` [PATCH v4 7/9] cxl/region, dax/hmem: Register cxl_dax only when CXL owns Soft Reserved span Smita Koralahalli
@ 2025-11-20 3:19 ` Smita Koralahalli
2025-12-04 0:50 ` dan.j.williams
2025-11-20 3:19 ` [PATCH v4 9/9] dax/hmem: Reintroduce Soft Reserved ranges back into the iomem tree Smita Koralahalli
` (2 subsequent siblings)
10 siblings, 1 reply; 31+ messages in thread
From: Smita Koralahalli @ 2025-11-20 3:19 UTC (permalink / raw)
To: linux-cxl, linux-kernel, nvdimm, linux-fsdevel, linux-pm
Cc: Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams,
Jonathan Cameron, Yazen Ghannam, Dave Jiang, Davidlohr Bueso,
Matthew Wilcox, Jan Kara, Rafael J . Wysocki, Len Brown,
Pavel Machek, Li Ming, Jeff Johnson, Ying Huang, Yao Xingtao,
Peter Zijlstra, Greg KH, Nathan Fontenot, Terry Bowman,
Robert Richter, Benjamin Cheatham, Zhijian Li, Borislav Petkov,
Ard Biesheuvel
If CXL regions do not fully cover a Soft Reserved span, HMEM takes
ownership. Tear down overlapping CXL regions before allowing HMEM to
register and online the memory.
Add cxl_region_teardown() to walk CXL regions overlapping a span and
unregister them via devm_release_action() and unregister_region().
Force the region state back to CXL_CONFIG_ACTIVE before unregistering to
prevent the teardown path from resetting decoders HMEM still relies on
to create its dax and online memory.
Co-developed-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
---
drivers/cxl/core/region.c | 38 ++++++++++++++++++++++++++++++++++++++
drivers/cxl/cxl.h | 5 +++++
drivers/dax/hmem/hmem.c | 4 +++-
3 files changed, 46 insertions(+), 1 deletion(-)
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 38e7ec6a087b..266b24028df0 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -3784,6 +3784,44 @@ struct cxl_range_ctx {
bool found;
};
+static int cxl_region_teardown_cb(struct device *dev, void *data)
+{
+ struct cxl_range_ctx *ctx = data;
+ struct cxl_root_decoder *cxlrd;
+ struct cxl_region_params *p;
+ struct cxl_region *cxlr;
+ struct cxl_port *port;
+
+ cxlr = cxlr_overlapping_range(dev, ctx->start, ctx->end);
+ if (!cxlr)
+ return 0;
+
+ cxlrd = to_cxl_root_decoder(cxlr->dev.parent);
+ port = cxlrd_to_port(cxlrd);
+ p = &cxlr->params;
+
+ /* Force the region state back to CXL_CONFIG_ACTIVE so that
+ * unregister_region() does not run the full decoder reset path
+ * which would invalidate the decoder programming that HMEM
+ * relies on to create its DAX device and online the underlying
+ * memory.
+ */
+ scoped_guard(rwsem_write, &cxl_rwsem.region)
+ p->state = min(p->state, CXL_CONFIG_ACTIVE);
+
+ devm_release_action(port->uport_dev, unregister_region, cxlr);
+
+ return 0;
+}
+
+void cxl_region_teardown(resource_size_t start, resource_size_t end)
+{
+ struct cxl_range_ctx ctx = { .start = start, .end = end };
+
+ bus_for_each_dev(&cxl_bus_type, NULL, &ctx, cxl_region_teardown_cb);
+}
+EXPORT_SYMBOL_GPL(cxl_region_teardown);
+
static void cxl_region_enable_dax(struct cxl_region *cxlr)
{
struct cxl_region_params *p = &cxlr->params;
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 414ddf6c35d7..a215a88ef59c 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -880,6 +880,7 @@ struct cxl_dax_region *to_cxl_dax_region(struct device *dev);
u64 cxl_port_get_spa_cache_alias(struct cxl_port *endpoint, u64 spa);
bool cxl_regions_fully_map(resource_size_t start, resource_size_t end);
void cxl_register_dax(resource_size_t start, resource_size_t end);
+void cxl_region_teardown(resource_size_t start, resource_size_t end);
#else
static inline bool is_cxl_pmem_region(struct device *dev)
{
@@ -911,6 +912,10 @@ static inline void cxl_register_dax(resource_size_t start,
resource_size_t end)
{
}
+static inline void cxl_region_teardown(resource_size_t start,
+ resource_size_t end)
+{
+}
#endif
void cxl_endpoint_parse_cdat(struct cxl_port *port);
diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index b9312e0f2e62..7d874ee169ac 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -158,8 +158,10 @@ static int handle_deferred_cxl(struct device *host, int target_nid,
if (cxl_regions_fully_map(res->start, res->end)) {
dax_cxl_mode = DAX_CXL_MODE_DROP;
cxl_register_dax(res->start, res->end);
- } else
+ } else {
dax_cxl_mode = DAX_CXL_MODE_REGISTER;
+ cxl_region_teardown(res->start, res->end);
+ }
hmem_register_device(host, target_nid, res);
}
--
2.17.1
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH v4 9/9] dax/hmem: Reintroduce Soft Reserved ranges back into the iomem tree
2025-11-20 3:19 [PATCH v4 0/9] dax/hmem, cxl: Coordinate Soft Reserved handling with CXL and HMEM Smita Koralahalli
` (7 preceding siblings ...)
2025-11-20 3:19 ` [PATCH v4 8/9] cxl/region, dax/hmem: Tear down CXL regions when HMEM reclaims Soft Reserved Smita Koralahalli
@ 2025-11-20 3:19 ` Smita Koralahalli
2025-12-04 0:54 ` dan.j.williams
2025-12-01 19:56 ` [PATCH v4 0/9] dax/hmem, cxl: Coordinate Soft Reserved handling with CXL and HMEM Alison Schofield
2025-12-02 6:41 ` dan.j.williams
10 siblings, 1 reply; 31+ messages in thread
From: Smita Koralahalli @ 2025-11-20 3:19 UTC (permalink / raw)
To: linux-cxl, linux-kernel, nvdimm, linux-fsdevel, linux-pm
Cc: Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams,
Jonathan Cameron, Yazen Ghannam, Dave Jiang, Davidlohr Bueso,
Matthew Wilcox, Jan Kara, Rafael J . Wysocki, Len Brown,
Pavel Machek, Li Ming, Jeff Johnson, Ying Huang, Yao Xingtao,
Peter Zijlstra, Greg KH, Nathan Fontenot, Terry Bowman,
Robert Richter, Benjamin Cheatham, Zhijian Li, Borislav Petkov,
Ard Biesheuvel
Reworked from a patch by Alison Schofield <alison.schofield@intel.com>
Reintroduce Soft Reserved range into the iomem_resource tree for HMEM
to consume.
This restores visibility in /proc/iomem for ranges actively in use, while
avoiding the early-boot conflicts that occurred when Soft Reserved was
published into iomem before CXL window and region discovery.
Link: https://lore.kernel.org/linux-cxl/29312c0765224ae76862d59a17748c8188fb95f1.1692638817.git.alison.schofield@intel.com/
Co-developed-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Co-developed-by: Zhijian Li <lizhijian@fujitsu.com>
Signed-off-by: Zhijian Li <lizhijian@fujitsu.com>
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
---
drivers/dax/hmem/hmem.c | 32 +++++++++++++++++++++++++++++++-
1 file changed, 31 insertions(+), 1 deletion(-)
diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index 7d874ee169ac..5f36b0374cf4 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -71,6 +71,34 @@ struct dax_defer_work {
struct work_struct work;
};
+static void remove_soft_reserved(void *r)
+{
+ remove_resource(r);
+ kfree(r);
+}
+
+static int add_soft_reserve_into_iomem(struct device *host,
+ const struct resource *res)
+{
+ struct resource *soft __free(kfree) =
+ kzalloc(sizeof(*soft), GFP_KERNEL);
+ int rc;
+
+ if (!soft)
+ return -ENOMEM;
+
+ *soft = DEFINE_RES_NAMED_DESC(res->start, (res->end - res->start + 1),
+ "Soft Reserved", IORESOURCE_MEM,
+ IORES_DESC_SOFT_RESERVED);
+
+ rc = insert_resource(&iomem_resource, soft);
+ if (rc)
+ return rc;
+
+ return devm_add_action_or_reset(host, remove_soft_reserved,
+ no_free_ptr(soft));
+}
+
static int hmem_register_device(struct device *host, int target_nid,
const struct resource *res)
{
@@ -103,7 +131,9 @@ static int hmem_register_device(struct device *host, int target_nid,
if (rc != REGION_INTERSECTS)
return 0;
- /* TODO: Add Soft-Reserved memory back to iomem */
+ rc = add_soft_reserve_into_iomem(host, res);
+ if (rc)
+ return rc;
id = memregion_alloc(GFP_KERNEL);
if (id < 0) {
--
2.17.1
^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [PATCH v4 6/9] cxl/region: Add register_dax flag to defer DAX setup
2025-11-20 3:19 ` [PATCH v4 6/9] cxl/region: Add register_dax flag to defer DAX setup Smita Koralahalli
@ 2025-11-20 18:17 ` Koralahalli Channabasappa, Smita
2025-11-20 20:21 ` kernel test robot
2025-12-04 0:22 ` dan.j.williams
2 siblings, 0 replies; 31+ messages in thread
From: Koralahalli Channabasappa, Smita @ 2025-11-20 18:17 UTC (permalink / raw)
To: Smita Koralahalli, linux-cxl, linux-kernel, nvdimm, linux-fsdevel,
linux-pm
Cc: Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams,
Jonathan Cameron, Yazen Ghannam, Dave Jiang, Davidlohr Bueso,
Matthew Wilcox, Jan Kara, Rafael J . Wysocki, Len Brown,
Pavel Machek, Li Ming, Jeff Johnson, Ying Huang, Yao Xingtao,
Peter Zijlstra, Greg KH, Nathan Fontenot, Terry Bowman,
Robert Richter, Benjamin Cheatham, Zhijian Li, Borislav Petkov,
Ard Biesheuvel
On 11/19/2025 7:19 PM, Smita Koralahalli wrote:
> Stop creating cxl_dax during cxl_region_probe(). Early DAX registration
> can online memory before ownership of Soft Reserved ranges is finalized.
> This makes it difficult to tear down regions later when HMEM determines
> that a region should not claim that range.
>
> Introduce a register_dax flag in struct cxl_region_params and gate DAX
> registration on this flag. Leave probe time registration disabled for
> regions discovered during early CXL enumeration; set the flag only for
> regions created dynamically at runtime to preserve existing behaviour.
>
> This patch prepares the region code for later changes where cxl_dax
> setup occurs from the HMEM path only after ownership arbitration
> completes.
>
> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
> ---
> drivers/cxl/core/region.c | 21 ++++++++++++++++-----
> drivers/cxl/cxl.h | 1 +
> 2 files changed, 17 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 94dbbd6b5513..c17cd8706b9d 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -2540,9 +2540,11 @@ static int cxl_region_calculate_adistance(struct notifier_block *nb,
> static struct cxl_region *devm_cxl_add_region(struct cxl_root_decoder *cxlrd,
> int id,
> enum cxl_partition_mode mode,
> - enum cxl_decoder_type type)
> + enum cxl_decoder_type type,
> + bool register_dax)
> {
> struct cxl_port *port = to_cxl_port(cxlrd->cxlsd.cxld.dev.parent);
> + struct cxl_region_params *p;
> struct cxl_region *cxlr;
> struct device *dev;
> int rc;
> @@ -2553,6 +2555,9 @@ static struct cxl_region *devm_cxl_add_region(struct cxl_root_decoder *cxlrd,
> cxlr->mode = mode;
> cxlr->type = type;
>
> + p = &cxlr->params;
> + p->register_dax = register_dax;
> +
> dev = &cxlr->dev;
> rc = dev_set_name(dev, "region%d", id);
> if (rc)
> @@ -2593,7 +2598,8 @@ static ssize_t create_ram_region_show(struct device *dev,
> }
>
> static struct cxl_region *__create_region(struct cxl_root_decoder *cxlrd,
> - enum cxl_partition_mode mode, int id)
> + enum cxl_partition_mode mode, int id,
> + bool register_dax)
> {
> int rc;
>
> @@ -2615,7 +2621,8 @@ static struct cxl_region *__create_region(struct cxl_root_decoder *cxlrd,
> return ERR_PTR(-EBUSY);
> }
>
> - return devm_cxl_add_region(cxlrd, id, mode, CXL_DECODER_HOSTONLYMEM);
> + return devm_cxl_add_region(cxlrd, id, mode, CXL_DECODER_HOSTONLYMEM,
> + register_dax);
> }
>
> static ssize_t create_region_store(struct device *dev, const char *buf,
> @@ -2629,7 +2636,7 @@ static ssize_t create_region_store(struct device *dev, const char *buf,
> if (rc != 1)
> return -EINVAL;
>
> - cxlr = __create_region(cxlrd, mode, id);
> + cxlr = __create_region(cxlrd, mode, id, true);
> if (IS_ERR(cxlr))
> return PTR_ERR(cxlr);
>
> @@ -3523,7 +3530,7 @@ static struct cxl_region *construct_region(struct cxl_root_decoder *cxlrd,
>
> do {
> cxlr = __create_region(cxlrd, cxlds->part[part].mode,
> - atomic_read(&cxlrd->region_id));
> + atomic_read(&cxlrd->region_id), false);
> } while (IS_ERR(cxlr) && PTR_ERR(cxlr) == -EBUSY);
>
> if (IS_ERR(cxlr)) {
> @@ -3930,6 +3937,10 @@ static int cxl_region_probe(struct device *dev)
> p->res->start, p->res->end, cxlr,
> is_system_ram) > 0)
> return 0;
> +
> + if (!p->register_dax)
> + return 0;
Sorry, I missed this. It should continue registering DAX if HMEM is
disabled. I will fix this in v5 and add a comment here
- if (!p->register_dax)
- return 0;
+ /*
+ * Only skip probe time DAX if HMEM will handle it
+ * later.
+ */
+ if (IS_ENABLED(CONFIG_DEV_DAX_HMEM) && !p->register_dax)
+ return 0;
> +
> return devm_cxl_add_dax_region(cxlr);
> default:
> dev_dbg(&cxlr->dev, "unsupported region mode: %d\n",
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index af78c9fd37f2..324220596890 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -495,6 +495,7 @@ struct cxl_region_params {
> struct cxl_endpoint_decoder *targets[CXL_DECODER_MAX_INTERLEAVE];
> int nr_targets;
> resource_size_t cache_size;
> + bool register_dax;
> };
>
> enum cxl_partition_mode {
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 6/9] cxl/region: Add register_dax flag to defer DAX setup
2025-11-20 3:19 ` [PATCH v4 6/9] cxl/region: Add register_dax flag to defer DAX setup Smita Koralahalli
2025-11-20 18:17 ` Koralahalli Channabasappa, Smita
@ 2025-11-20 20:21 ` kernel test robot
2025-12-04 0:22 ` dan.j.williams
2 siblings, 0 replies; 31+ messages in thread
From: kernel test robot @ 2025-11-20 20:21 UTC (permalink / raw)
To: Smita Koralahalli, linux-cxl, linux-kernel, nvdimm, linux-fsdevel,
linux-pm
Cc: oe-kbuild-all, Alison Schofield, Vishal Verma, Ira Weiny,
Dan Williams, Jonathan Cameron, Yazen Ghannam, Dave Jiang,
Davidlohr Bueso, Matthew Wilcox, Jan Kara, Rafael J . Wysocki,
Len Brown, Pavel Machek, Li Ming, Jeff Johnson, Ying Huang,
Yao Xingtao, Peter Zijlstra, Greg KH, Nathan Fontenot,
Terry Bowman, Robert Richter, Benjamin Cheatham, Zhijian Li,
Borislav Petkov
Hi Smita,
kernel test robot noticed the following build warnings:
[auto build test WARNING on 211ddde0823f1442e4ad052a2f30f050145ccada]
url: https://github.com/intel-lab-lkp/linux/commits/Smita-Koralahalli/dax-hmem-e820-resource-Defer-Soft-Reserved-insertion-until-hmem-is-ready/20251120-112457
base: 211ddde0823f1442e4ad052a2f30f050145ccada
patch link: https://lore.kernel.org/r/20251120031925.87762-7-Smita.KoralahalliChannabasappa%40amd.com
patch subject: [PATCH v4 6/9] cxl/region: Add register_dax flag to defer DAX setup
config: sparc64-randconfig-6002-20251120 (https://download.01.org/0day-ci/archive/20251121/202511210343.c0vb4NRc-lkp@intel.com/config)
compiler: sparc64-linux-gcc (GCC) 13.4.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251121/202511210343.c0vb4NRc-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202511210343.c0vb4NRc-lkp@intel.com/
All warnings (new ones prefixed by >>):
>> Warning: drivers/cxl/core/region.c:2544 function parameter 'register_dax' not described in 'devm_cxl_add_region'
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 0/9] dax/hmem, cxl: Coordinate Soft Reserved handling with CXL and HMEM
2025-11-20 3:19 [PATCH v4 0/9] dax/hmem, cxl: Coordinate Soft Reserved handling with CXL and HMEM Smita Koralahalli
` (8 preceding siblings ...)
2025-11-20 3:19 ` [PATCH v4 9/9] dax/hmem: Reintroduce Soft Reserved ranges back into the iomem tree Smita Koralahalli
@ 2025-12-01 19:56 ` Alison Schofield
2025-12-03 13:35 ` Tomasz Wolski
2025-12-02 6:41 ` dan.j.williams
10 siblings, 1 reply; 31+ messages in thread
From: Alison Schofield @ 2025-12-01 19:56 UTC (permalink / raw)
To: Smita Koralahalli
Cc: linux-cxl, linux-kernel, nvdimm, linux-fsdevel, linux-pm,
Vishal Verma, Ira Weiny, Dan Williams, Jonathan Cameron,
Yazen Ghannam, Dave Jiang, Davidlohr Bueso, Matthew Wilcox,
Jan Kara, Rafael J . Wysocki, Len Brown, Pavel Machek, Li Ming,
Jeff Johnson, Ying Huang, Yao Xingtao, Peter Zijlstra, Greg KH,
Nathan Fontenot, Terry Bowman, Robert Richter, Benjamin Cheatham,
Zhijian Li, Borislav Petkov, Ard Biesheuvel
On Thu, Nov 20, 2025 at 03:19:16AM +0000, Smita Koralahalli wrote:
> This series aims to address long-standing conflicts between HMEM and
> CXL when handling Soft Reserved memory ranges.
>
> Reworked from Dan's patch:
> https://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git/patch/?id=ab70c6227ee6165a562c215d9dcb4a1c55620d5d
>
> Previous work:
> https://lore.kernel.org/all/20250715180407.47426-1-Smita.KoralahalliChannabasappa@amd.com/
>
> Link to v3:
> https://lore.kernel.org/all/20250930044757.214798-1-Smita.KoralahalliChannabasappa@amd.com
>
> This series should be applied on top of:
> "214291cbaace: acpi/hmat: Fix lockdep warning for hmem_register_resource()"
> and is based on:
> base-commit: 211ddde0823f1442e4ad052a2f30f050145ccada
>
> I initially tried picking up the three probe ordering patches from v20/v21
> of Type 2 support, but I hit a NULL pointer dereference in
> devm_cxl_add_memdev() and cycle dependency with all patches so I left
> them out for now. With my current series rebased on 6.18-rc2 plus
> 214291cbaace, probe ordering behaves correctly on AMD systems and I have
> verified the scenarios mentioned below. I can pull those three patches
> back in for a future revision once the failures are sorted out.
Hi Smita,
This is a regression from the v3 version for my hotplug test case.
I believe at least partially due to the ommitted probe order patches.
I'm not clear why that 'dax18.0' still exists after region teardown.
Upon booting:
- Do not expect to see that Soft Reserved resource
68e80000000-8d37fffffff : CXL Window 9
68e80000000-70e7fffffff : region9
68e80000000-70e7fffffff : Soft Reserved
68e80000000-70e7fffffff : dax18.0
68e80000000-70e7fffffff : System RAM (kmem)
After region teardown:
- Do not expect to see that Soft Reserved resource
- Do not expect to see that DAX or kmem
68e80000000-8d37fffffff : CXL Window 9
68e80000000-70e7fffffff : Soft Reserved
68e80000000-70e7fffffff : dax18.0
68e80000000-70e7fffffff : System RAM (kmem)
Create the region anew:
- Here we see a new region and dax devices created in the
available space after the Soft Reserved. We don't want
that. We want to be able to recreate in that original
space of 68e80000000-70e7fffffff.
68e80000000-8d37fffffff : CXL Window 9
68e80000000-70e7fffffff : Soft Reserved
68e80000000-70e7fffffff : dax18.0
68e80000000-70e7fffffff : System RAM (kmem)
70e80000000-78e7fffffff : region9
70e80000000-78e7fffffff : dax9.0
70e80000000-78e7fffffff : System RAM (kmem)
-- Alison
>
> Probe order patches of interest:
> cxl/mem: refactor memdev allocation
> cxl/mem: Arrange for always-synchronous memdev attach
> cxl/port: Arrange for always synchronous endpoint attach
>
> [1] Hotplug looks okay. After offlining the memory I can tear down the
> regions and recreate it back if CXL owns entire SR range as Soft Reserved
> is gone. dax_cxl creates dax devices and onlines memory.
> 850000000-284fffffff : CXL Window 0
> 850000000-284fffffff : region0
> 850000000-284fffffff : dax0.0
> 850000000-284fffffff : System RAM (kmem)
>
> [2] With CONFIG_CXL_REGION disabled, all the resources are handled by
> HMEM. Soft Reserved range shows up in /proc/iomem, no regions come up
> and dax devices are created from HMEM.
> 850000000-284fffffff : CXL Window 0
> 850000000-284fffffff : Soft Reserved
> 850000000-284fffffff : dax0.0
> 850000000-284fffffff : System RAM (kmem)
>
> [3] Region assembly failures also behave okay and work same as [2].
>
> Before:
> 2850000000-484fffffff : Soft Reserved
> 2850000000-484fffffff : CXL Window 1
> 2850000000-484fffffff : dax4.0
> 2850000000-484fffffff : System RAM (kmem)
>
> After tearing down dax4.0 and creating it back:
>
> Logs:
> [ 547.847764] unregister_dax_mapping: mapping0: unregister_dax_mapping
> [ 547.855000] trim_dev_dax_range: dax dax4.0: delete range[0]: 0x2850000000:0x484fffffff
> [ 622.474580] alloc_dev_dax_range: dax dax4.1: alloc range[0]: 0x0000002850000000:0x000000484fffffff
> [ 752.766194] Fallback order for Node 0: 0 1
> [ 752.766199] Fallback order for Node 1: 1 0
> [ 752.766200] Built 2 zonelists, mobility grouping on. Total pages: 8096220
> [ 752.783234] Policy zone: Normal
> [ 752.808604] Demotion targets for Node 0: preferred: 1, fallback: 1
> [ 752.815509] Demotion targets for Node 1: null
>
> After:
> 2850000000-484fffffff : Soft Reserved
> 2850000000-484fffffff : CXL Window 1
> 2850000000-484fffffff : dax4.1
> 2850000000-484fffffff : System RAM (kmem)
>
> [4] A small hack to tear down the fully assembled and probed region
> (i.e region in committed state) for range 850000000-284fffffff.
> This is to test the region teardown path for regions which don't
> fully cover the Soft Reserved range.
>
> 850000000-284fffffff : Soft Reserved
> 850000000-284fffffff : CXL Window 0
> 850000000-284fffffff : dax5.0
> 850000000-284fffffff : System RAM (kmem)
> 2850000000-484fffffff : CXL Window 1
> 2850000000-484fffffff : region1
> 2850000000-484fffffff : dax1.0
> 2850000000-484fffffff : System RAM (kmem)
> .4850000000-684fffffff : CXL Window 2
> 4850000000-684fffffff : region2
> 4850000000-684fffffff : dax2.0
> 4850000000-684fffffff : System RAM (kmem)
>
> daxctl list -R -u
> [
> {
> "path":"\/platform\/ACPI0017:00\/root0\/decoder0.1\/region1\/dax_region1",
> "id":1,
> "size":"128.00 GiB (137.44 GB)",
> "align":2097152
> },
> {
> "path":"\/platform\/hmem.5",
> "id":5,
> "size":"128.00 GiB (137.44 GB)",
> "align":2097152
> },
> {
> "path":"\/platform\/ACPI0017:00\/root0\/decoder0.2\/region2\/dax_region2",
> "id":2,
> "size":"128.00 GiB (137.44 GB)",
> "align":2097152
> }
> ]
>
> I couldn't test multiple regions under same Soft Reserved range
> with/without contiguous mapping due to limiting BIOS support. Hopefully
> that works.
>
> v4 updates:
> - No changes patches 1-3.
> - New patches 4-7.
> - handle_deferred_cxl() has been enhanced to handle case where CXL
> regions do not contiguously and fully cover Soft Reserved ranges.
> - Support added to defer cxl_dax registration.
> - Support added to teardown cxl regions.
>
> v3 updates:
> - Fixed two "From".
>
> v2 updates:
> - Removed conditional check on CONFIG_EFI_SOFT_RESERVE as dax_hmem
> depends on CONFIG_EFI_SOFT_RESERVE. (Zhijian)
> - Added TODO note. (Zhijian)
> - Included region_intersects_soft_reserve() inside CONFIG_EFI_SOFT_RESERVE
> conditional check. (Zhijian)
> - insert_resource_late() -> insert_resource_expand_to_fit() and
> __insert_resource_expand_to_fit() replacement. (Boris)
> - Fixed Co-developed and Signed-off by. (Dan)
> - Combined 2/6 and 3/6 into a single patch. (Zhijian).
> - Skip local variable in remove_soft_reserved. (Jonathan)
> - Drop kfree with __free(). (Jonathan)
> - return 0 -> return dev_add_action_or_reset(host...) (Jonathan)
> - Dropped 6/6.
> - Reviewed-by tags (Dave, Jonathan)
>
> Dan Williams (4):
> dax/hmem, e820, resource: Defer Soft Reserved insertion until hmem is
> ready
> dax/hmem: Request cxl_acpi and cxl_pci before walking Soft Reserved
> ranges
> dax/hmem: Gate Soft Reserved deferral on DEV_DAX_CXL
> dax/hmem: Defer handling of Soft Reserved ranges that overlap CXL
> windows
>
> Smita Koralahalli (5):
> cxl/region, dax/hmem: Arbitrate Soft Reserved ownership with
> cxl_regions_fully_map()
> cxl/region: Add register_dax flag to control probe-time devdax setup
> cxl/region, dax/hmem: Register devdax only when CXL owns Soft Reserved
> span
> cxl/region, dax/hmem: Tear down CXL regions when HMEM reclaims Soft
> Reserved
> dax/hmem: Reintroduce Soft Reserved ranges back into the iomem tree
>
> arch/x86/kernel/e820.c | 2 +-
> drivers/cxl/acpi.c | 2 +-
> drivers/cxl/core/region.c | 181 ++++++++++++++++++++++++++++++++++++--
> drivers/cxl/cxl.h | 17 ++++
> drivers/dax/Kconfig | 2 +
> drivers/dax/hmem/device.c | 4 +-
> drivers/dax/hmem/hmem.c | 137 ++++++++++++++++++++++++++---
> include/linux/ioport.h | 13 ++-
> kernel/resource.c | 92 ++++++++++++++++---
> 9 files changed, 415 insertions(+), 35 deletions(-)
>
> base-commit: 211ddde0823f1442e4ad052a2f30f050145ccada
> --
> 2.17.1
>
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 0/9] dax/hmem, cxl: Coordinate Soft Reserved handling with CXL and HMEM
2025-11-20 3:19 [PATCH v4 0/9] dax/hmem, cxl: Coordinate Soft Reserved handling with CXL and HMEM Smita Koralahalli
` (9 preceding siblings ...)
2025-12-01 19:56 ` [PATCH v4 0/9] dax/hmem, cxl: Coordinate Soft Reserved handling with CXL and HMEM Alison Schofield
@ 2025-12-02 6:41 ` dan.j.williams
10 siblings, 0 replies; 31+ messages in thread
From: dan.j.williams @ 2025-12-02 6:41 UTC (permalink / raw)
To: Smita Koralahalli, linux-cxl, linux-kernel, nvdimm, linux-fsdevel,
linux-pm
Cc: Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams,
Jonathan Cameron, Yazen Ghannam, Dave Jiang, Davidlohr Bueso,
Matthew Wilcox, Jan Kara, Rafael J . Wysocki, Len Brown,
Pavel Machek, Li Ming, Jeff Johnson, Ying Huang, Yao Xingtao,
Peter Zijlstra, Greg KH, Nathan Fontenot, Terry Bowman,
Robert Richter, Benjamin Cheatham, Zhijian Li, Borislav Petkov,
Ard Biesheuvel
Smita Koralahalli wrote:
[..]
> I initially tried picking up the three probe ordering patches from v20/v21
> of Type 2 support, but I hit a NULL pointer dereference in
> devm_cxl_add_memdev() and cycle dependency with all patches so I left
> them out for now.
No, we need to get those baseline patches in, there is no ability to
detect an sync point for "all CXL devices present at boot have had a
chance to probe" without the synchronous registration changes.
I will push a branch with finished patches rather than RFC quality, and
you can build from there. The order of the series should be Sync Probe
changes, DAX HMEM, protocol error series, Type-2.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 1/9] dax/hmem, e820, resource: Defer Soft Reserved insertion until hmem is ready
2025-11-20 3:19 ` [PATCH v4 1/9] dax/hmem, e820, resource: Defer Soft Reserved insertion until hmem is ready Smita Koralahalli
@ 2025-12-02 22:19 ` dan.j.williams
2025-12-11 23:20 ` Koralahalli Channabasappa, Smita
2025-12-02 23:31 ` Dave Jiang
1 sibling, 1 reply; 31+ messages in thread
From: dan.j.williams @ 2025-12-02 22:19 UTC (permalink / raw)
To: Smita Koralahalli, linux-cxl, linux-kernel, nvdimm, linux-fsdevel,
linux-pm
Cc: Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams,
Jonathan Cameron, Yazen Ghannam, Dave Jiang, Davidlohr Bueso,
Matthew Wilcox, Jan Kara, Rafael J . Wysocki, Len Brown,
Pavel Machek, Li Ming, Jeff Johnson, Ying Huang, Yao Xingtao,
Peter Zijlstra, Greg KH, Nathan Fontenot, Terry Bowman,
Robert Richter, Benjamin Cheatham, Zhijian Li, Borislav Petkov,
Ard Biesheuvel
Smita Koralahalli wrote:
> From: Dan Williams <dan.j.williams@intel.com>
>
> Insert Soft Reserved memory into a dedicated soft_reserve_resource tree
> instead of the iomem_resource tree at boot. Delay publishing these ranges
> into the iomem hierarchy until ownership is resolved and the HMEM path
> is ready to consume them.
>
> Publishing Soft Reserved ranges into iomem too early conflicts with CXL
> hotplug and prevents region assembly when those ranges overlap CXL
> windows.
>
> Follow up patches will reinsert Soft Reserved ranges into iomem after CXL
> window publication is complete and HMEM is ready to claim the memory. This
> provides a cleaner handoff between EFI-defined memory ranges and CXL
> resource management without trimming or deleting resources later.
Please, when you modify a patch from an original, add your
Co-developed-by: and clarify what you changed.
>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
> ---
> arch/x86/kernel/e820.c | 2 +-
> drivers/cxl/acpi.c | 2 +-
> drivers/dax/hmem/device.c | 4 +-
> drivers/dax/hmem/hmem.c | 7 ++-
> include/linux/ioport.h | 13 +++++-
> kernel/resource.c | 92 +++++++++++++++++++++++++++++++++------
> 6 files changed, 100 insertions(+), 20 deletions(-)
>
[..]
> @@ -426,6 +443,26 @@ int walk_iomem_res_desc(unsigned long desc, unsigned long flags, u64 start,
> }
> EXPORT_SYMBOL_GPL(walk_iomem_res_desc);
>
> +#ifdef CONFIG_EFI_SOFT_RESERVE
> +struct resource soft_reserve_resource = {
> + .name = "Soft Reserved",
> + .start = 0,
> + .end = -1,
> + .desc = IORES_DESC_SOFT_RESERVED,
> + .flags = IORESOURCE_MEM,
> +};
> +EXPORT_SYMBOL_GPL(soft_reserve_resource);
It looks like one of the things you changed from my RFC was the addition
of walk_soft_reserve_res_desc() and region_intersects_soft_reserve().
With those APIs not only does this symbol not need to be exported, but
it also can be static / private to resource.c.
> +
> +int walk_soft_reserve_res_desc(unsigned long desc, unsigned long flags,
> + u64 start, u64 end, void *arg,
> + int (*func)(struct resource *, void *))
> +{
> + return walk_res_desc(&soft_reserve_resource, start, end, flags, desc,
> + arg, func);
> +}
> +EXPORT_SYMBOL_GPL(walk_soft_reserve_res_desc);
> +#endif
> +
> /*
> * This function calls the @func callback against all memory ranges of type
> * System RAM which are marked as IORESOURCE_SYSTEM_RAM and IORESOUCE_BUSY.
> @@ -648,6 +685,22 @@ int region_intersects(resource_size_t start, size_t size, unsigned long flags,
> }
> EXPORT_SYMBOL_GPL(region_intersects);
>
> +#ifdef CONFIG_EFI_SOFT_RESERVE
> +int region_intersects_soft_reserve(resource_size_t start, size_t size,
> + unsigned long flags, unsigned long desc)
> +{
> + int ret;
> +
> + read_lock(&resource_lock);
> + ret = __region_intersects(&soft_reserve_resource, start, size, flags,
> + desc);
> + read_unlock(&resource_lock);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(region_intersects_soft_reserve);
> +#endif
> +
> void __weak arch_remove_reservations(struct resource *avail)
> {
> }
> @@ -966,7 +1019,7 @@ EXPORT_SYMBOL_GPL(insert_resource);
> * Insert a resource into the resource tree, possibly expanding it in order
> * to make it encompass any conflicting resources.
> */
> -void insert_resource_expand_to_fit(struct resource *root, struct resource *new)
> +void __insert_resource_expand_to_fit(struct resource *root, struct resource *new)
> {
> if (new->parent)
> return;
> @@ -997,7 +1050,20 @@ void insert_resource_expand_to_fit(struct resource *root, struct resource *new)
> * to use this interface. The former are built-in and only the latter,
> * CXL, is a module.
> */
> -EXPORT_SYMBOL_NS_GPL(insert_resource_expand_to_fit, "CXL");
> +EXPORT_SYMBOL_NS_GPL(__insert_resource_expand_to_fit, "CXL");
> +
> +void insert_resource_expand_to_fit(struct resource *new)
> +{
> + struct resource *root = &iomem_resource;
> +
> +#ifdef CONFIG_EFI_SOFT_RESERVE
> + if (new->desc == IORES_DESC_SOFT_RESERVED)
> + root = &soft_reserve_resource;
> +#endif
I can not say I am entirely happy with this change, I would prefer to
avoid ifdef in C, and I would prefer not to break the legacy semantics
of this function, but it meets the spirit of the original RFC without
introducing a new insert_resource_late(). I assume review feedback
requested this?
> + __insert_resource_expand_to_fit(root, new);
> +}
> +EXPORT_SYMBOL_GPL(insert_resource_expand_to_fit);
There are no consumers for this export, so it can be dropped.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 4/9] dax/hmem: Defer handling of Soft Reserved ranges that overlap CXL windows
2025-11-20 3:19 ` [PATCH v4 4/9] dax/hmem: Defer handling of Soft Reserved ranges that overlap CXL windows Smita Koralahalli
@ 2025-12-02 22:37 ` dan.j.williams
2025-12-11 23:23 ` Koralahalli Channabasappa, Smita
0 siblings, 1 reply; 31+ messages in thread
From: dan.j.williams @ 2025-12-02 22:37 UTC (permalink / raw)
To: Smita Koralahalli, linux-cxl, linux-kernel, nvdimm, linux-fsdevel,
linux-pm
Cc: Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams,
Jonathan Cameron, Yazen Ghannam, Dave Jiang, Davidlohr Bueso,
Matthew Wilcox, Jan Kara, Rafael J . Wysocki, Len Brown,
Pavel Machek, Li Ming, Jeff Johnson, Ying Huang, Yao Xingtao,
Peter Zijlstra, Greg KH, Nathan Fontenot, Terry Bowman,
Robert Richter, Benjamin Cheatham, Zhijian Li, Borislav Petkov,
Ard Biesheuvel
Smita Koralahalli wrote:
> From: Dan Williams <dan.j.williams@intel.com>
>
> Defer handling of Soft Reserved ranges that intersect CXL windows at
> probe time. Delay processing until after device discovery so that the
> CXL stack can publish windows and assemble regions before HMEM claims
> those address ranges.
>
> Add a deferral path that schedules deferred work when HMEM detects a
> Soft Reserved range intersecting a CXL window during probe. The deferred
> work runs after probe completes and allows the CXL subsystem to finish
> resource discovery and region setup before HMEM takes any action.
>
> This change does not address region assembly failures. It only delays
> HMEM handling to avoid prematurely claiming ranges that CXL may own.
No, with the changes it just unconditionally disables dax_hmem in the
presence of CXL. I do not think these changes can stand alone. It
probably wants to be folded with patch 5 or something like that.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 1/9] dax/hmem, e820, resource: Defer Soft Reserved insertion until hmem is ready
2025-11-20 3:19 ` [PATCH v4 1/9] dax/hmem, e820, resource: Defer Soft Reserved insertion until hmem is ready Smita Koralahalli
2025-12-02 22:19 ` dan.j.williams
@ 2025-12-02 23:31 ` Dave Jiang
1 sibling, 0 replies; 31+ messages in thread
From: Dave Jiang @ 2025-12-02 23:31 UTC (permalink / raw)
To: Smita Koralahalli, linux-cxl, linux-kernel, nvdimm, linux-fsdevel,
linux-pm
Cc: Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams,
Jonathan Cameron, Yazen Ghannam, Davidlohr Bueso, Matthew Wilcox,
Jan Kara, Rafael J . Wysocki, Len Brown, Pavel Machek, Li Ming,
Jeff Johnson, Ying Huang, Yao Xingtao, Peter Zijlstra, Greg KH,
Nathan Fontenot, Terry Bowman, Robert Richter, Benjamin Cheatham,
Zhijian Li, Borislav Petkov, Ard Biesheuvel
On 11/19/25 8:19 PM, Smita Koralahalli wrote:
> From: Dan Williams <dan.j.williams@intel.com>
>
> Insert Soft Reserved memory into a dedicated soft_reserve_resource tree
> instead of the iomem_resource tree at boot. Delay publishing these ranges
> into the iomem hierarchy until ownership is resolved and the HMEM path
> is ready to consume them.
>
> Publishing Soft Reserved ranges into iomem too early conflicts with CXL
> hotplug and prevents region assembly when those ranges overlap CXL
> windows.
>
> Follow up patches will reinsert Soft Reserved ranges into iomem after CXL
> window publication is complete and HMEM is ready to claim the memory. This
> provides a cleaner handoff between EFI-defined memory ranges and CXL
> resource management without trimming or deleting resources later.
>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
With changes requested from Dan,
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> ---
> arch/x86/kernel/e820.c | 2 +-
> drivers/cxl/acpi.c | 2 +-
> drivers/dax/hmem/device.c | 4 +-
> drivers/dax/hmem/hmem.c | 7 ++-
> include/linux/ioport.h | 13 +++++-
> kernel/resource.c | 92 +++++++++++++++++++++++++++++++++------
> 6 files changed, 100 insertions(+), 20 deletions(-)
>
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index c3acbd26408b..c32f144f0e4a 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -1153,7 +1153,7 @@ void __init e820__reserve_resources_late(void)
> res = e820_res;
> for (i = 0; i < e820_table->nr_entries; i++) {
> if (!res->parent && res->end)
> - insert_resource_expand_to_fit(&iomem_resource, res);
> + insert_resource_expand_to_fit(res);
> res++;
> }
>
> diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c
> index bd2e282ca93a..b37858f797be 100644
> --- a/drivers/cxl/acpi.c
> +++ b/drivers/cxl/acpi.c
> @@ -847,7 +847,7 @@ static int add_cxl_resources(struct resource *cxl_res)
> */
> cxl_set_public_resource(res, new);
>
> - insert_resource_expand_to_fit(&iomem_resource, new);
> + __insert_resource_expand_to_fit(&iomem_resource, new);
>
> next = res->sibling;
> while (next && resource_overlaps(new, next)) {
> diff --git a/drivers/dax/hmem/device.c b/drivers/dax/hmem/device.c
> index f9e1a76a04a9..22732b729017 100644
> --- a/drivers/dax/hmem/device.c
> +++ b/drivers/dax/hmem/device.c
> @@ -83,8 +83,8 @@ static __init int hmem_register_one(struct resource *res, void *data)
>
> static __init int hmem_init(void)
> {
> - walk_iomem_res_desc(IORES_DESC_SOFT_RESERVED,
> - IORESOURCE_MEM, 0, -1, NULL, hmem_register_one);
> + walk_soft_reserve_res_desc(IORES_DESC_SOFT_RESERVED, IORESOURCE_MEM, 0,
> + -1, NULL, hmem_register_one);
> return 0;
> }
>
> diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
> index c18451a37e4f..48f4642f4bb8 100644
> --- a/drivers/dax/hmem/hmem.c
> +++ b/drivers/dax/hmem/hmem.c
> @@ -73,11 +73,14 @@ static int hmem_register_device(struct device *host, int target_nid,
> return 0;
> }
>
> - rc = region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
> - IORES_DESC_SOFT_RESERVED);
> + rc = region_intersects_soft_reserve(res->start, resource_size(res),
> + IORESOURCE_MEM,
> + IORES_DESC_SOFT_RESERVED);
> if (rc != REGION_INTERSECTS)
> return 0;
>
> + /* TODO: Add Soft-Reserved memory back to iomem */
> +
> id = memregion_alloc(GFP_KERNEL);
> if (id < 0) {
> dev_err(host, "memregion allocation failure for %pr\n", res);
> diff --git a/include/linux/ioport.h b/include/linux/ioport.h
> index e8b2d6aa4013..e20226870a81 100644
> --- a/include/linux/ioport.h
> +++ b/include/linux/ioport.h
> @@ -232,6 +232,9 @@ struct resource_constraint {
> /* PC/ISA/whatever - the normal PC address spaces: IO and memory */
> extern struct resource ioport_resource;
> extern struct resource iomem_resource;
> +#ifdef CONFIG_EFI_SOFT_RESERVE
> +extern struct resource soft_reserve_resource;
> +#endif
>
> extern struct resource *request_resource_conflict(struct resource *root, struct resource *new);
> extern int request_resource(struct resource *root, struct resource *new);
> @@ -242,7 +245,8 @@ extern void reserve_region_with_split(struct resource *root,
> const char *name);
> extern struct resource *insert_resource_conflict(struct resource *parent, struct resource *new);
> extern int insert_resource(struct resource *parent, struct resource *new);
> -extern void insert_resource_expand_to_fit(struct resource *root, struct resource *new);
> +extern void __insert_resource_expand_to_fit(struct resource *root, struct resource *new);
> +extern void insert_resource_expand_to_fit(struct resource *new);
> extern int remove_resource(struct resource *old);
> extern void arch_remove_reservations(struct resource *avail);
> extern int allocate_resource(struct resource *root, struct resource *new,
> @@ -409,6 +413,13 @@ walk_system_ram_res_rev(u64 start, u64 end, void *arg,
> extern int
> walk_iomem_res_desc(unsigned long desc, unsigned long flags, u64 start, u64 end,
> void *arg, int (*func)(struct resource *, void *));
> +extern int
> +walk_soft_reserve_res_desc(unsigned long desc, unsigned long flags,
> + u64 start, u64 end, void *arg,
> + int (*func)(struct resource *, void *));
> +extern int
> +region_intersects_soft_reserve(resource_size_t start, size_t size,
> + unsigned long flags, unsigned long desc);
>
> struct resource *devm_request_free_mem_region(struct device *dev,
> struct resource *base, unsigned long size);
> diff --git a/kernel/resource.c b/kernel/resource.c
> index b9fa2a4ce089..208eaafcc681 100644
> --- a/kernel/resource.c
> +++ b/kernel/resource.c
> @@ -321,13 +321,14 @@ static bool is_type_match(struct resource *p, unsigned long flags, unsigned long
> }
>
> /**
> - * find_next_iomem_res - Finds the lowest iomem resource that covers part of
> - * [@start..@end].
> + * find_next_res - Finds the lowest resource that covers part of
> + * [@start..@end].
> *
> * If a resource is found, returns 0 and @*res is overwritten with the part
> * of the resource that's within [@start..@end]; if none is found, returns
> * -ENODEV. Returns -EINVAL for invalid parameters.
> *
> + * @parent: resource tree root to search
> * @start: start address of the resource searched for
> * @end: end address of same resource
> * @flags: flags which the resource must have
> @@ -337,9 +338,9 @@ static bool is_type_match(struct resource *p, unsigned long flags, unsigned long
> * The caller must specify @start, @end, @flags, and @desc
> * (which may be IORES_DESC_NONE).
> */
> -static int find_next_iomem_res(resource_size_t start, resource_size_t end,
> - unsigned long flags, unsigned long desc,
> - struct resource *res)
> +static int find_next_res(struct resource *parent, resource_size_t start,
> + resource_size_t end, unsigned long flags,
> + unsigned long desc, struct resource *res)
> {
> struct resource *p;
>
> @@ -351,7 +352,7 @@ static int find_next_iomem_res(resource_size_t start, resource_size_t end,
>
> read_lock(&resource_lock);
>
> - for_each_resource(&iomem_resource, p, false) {
> + for_each_resource(parent, p, false) {
> /* If we passed the resource we are looking for, stop */
> if (p->start > end) {
> p = NULL;
> @@ -382,16 +383,23 @@ static int find_next_iomem_res(resource_size_t start, resource_size_t end,
> return p ? 0 : -ENODEV;
> }
>
> -static int __walk_iomem_res_desc(resource_size_t start, resource_size_t end,
> - unsigned long flags, unsigned long desc,
> - void *arg,
> - int (*func)(struct resource *, void *))
> +static int find_next_iomem_res(resource_size_t start, resource_size_t end,
> + unsigned long flags, unsigned long desc,
> + struct resource *res)
> +{
> + return find_next_res(&iomem_resource, start, end, flags, desc, res);
> +}
> +
> +static int walk_res_desc(struct resource *parent, resource_size_t start,
> + resource_size_t end, unsigned long flags,
> + unsigned long desc, void *arg,
> + int (*func)(struct resource *, void *))
> {
> struct resource res;
> int ret = -EINVAL;
>
> while (start < end &&
> - !find_next_iomem_res(start, end, flags, desc, &res)) {
> + !find_next_res(parent, start, end, flags, desc, &res)) {
> ret = (*func)(&res, arg);
> if (ret)
> break;
> @@ -402,6 +410,15 @@ static int __walk_iomem_res_desc(resource_size_t start, resource_size_t end,
> return ret;
> }
>
> +static int __walk_iomem_res_desc(resource_size_t start, resource_size_t end,
> + unsigned long flags, unsigned long desc,
> + void *arg,
> + int (*func)(struct resource *, void *))
> +{
> + return walk_res_desc(&iomem_resource, start, end, flags, desc, arg, func);
> +}
> +
> +
> /**
> * walk_iomem_res_desc - Walks through iomem resources and calls func()
> * with matching resource ranges.
> @@ -426,6 +443,26 @@ int walk_iomem_res_desc(unsigned long desc, unsigned long flags, u64 start,
> }
> EXPORT_SYMBOL_GPL(walk_iomem_res_desc);
>
> +#ifdef CONFIG_EFI_SOFT_RESERVE
> +struct resource soft_reserve_resource = {
> + .name = "Soft Reserved",
> + .start = 0,
> + .end = -1,
> + .desc = IORES_DESC_SOFT_RESERVED,
> + .flags = IORESOURCE_MEM,
> +};
> +EXPORT_SYMBOL_GPL(soft_reserve_resource);
> +
> +int walk_soft_reserve_res_desc(unsigned long desc, unsigned long flags,
> + u64 start, u64 end, void *arg,
> + int (*func)(struct resource *, void *))
> +{
> + return walk_res_desc(&soft_reserve_resource, start, end, flags, desc,
> + arg, func);
> +}
> +EXPORT_SYMBOL_GPL(walk_soft_reserve_res_desc);
> +#endif
> +
> /*
> * This function calls the @func callback against all memory ranges of type
> * System RAM which are marked as IORESOURCE_SYSTEM_RAM and IORESOUCE_BUSY.
> @@ -648,6 +685,22 @@ int region_intersects(resource_size_t start, size_t size, unsigned long flags,
> }
> EXPORT_SYMBOL_GPL(region_intersects);
>
> +#ifdef CONFIG_EFI_SOFT_RESERVE
> +int region_intersects_soft_reserve(resource_size_t start, size_t size,
> + unsigned long flags, unsigned long desc)
> +{
> + int ret;
> +
> + read_lock(&resource_lock);
> + ret = __region_intersects(&soft_reserve_resource, start, size, flags,
> + desc);
> + read_unlock(&resource_lock);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(region_intersects_soft_reserve);
> +#endif
> +
> void __weak arch_remove_reservations(struct resource *avail)
> {
> }
> @@ -966,7 +1019,7 @@ EXPORT_SYMBOL_GPL(insert_resource);
> * Insert a resource into the resource tree, possibly expanding it in order
> * to make it encompass any conflicting resources.
> */
> -void insert_resource_expand_to_fit(struct resource *root, struct resource *new)
> +void __insert_resource_expand_to_fit(struct resource *root, struct resource *new)
> {
> if (new->parent)
> return;
> @@ -997,7 +1050,20 @@ void insert_resource_expand_to_fit(struct resource *root, struct resource *new)
> * to use this interface. The former are built-in and only the latter,
> * CXL, is a module.
> */
> -EXPORT_SYMBOL_NS_GPL(insert_resource_expand_to_fit, "CXL");
> +EXPORT_SYMBOL_NS_GPL(__insert_resource_expand_to_fit, "CXL");
> +
> +void insert_resource_expand_to_fit(struct resource *new)
> +{
> + struct resource *root = &iomem_resource;
> +
> +#ifdef CONFIG_EFI_SOFT_RESERVE
> + if (new->desc == IORES_DESC_SOFT_RESERVED)
> + root = &soft_reserve_resource;
> +#endif
> +
> + __insert_resource_expand_to_fit(root, new);
> +}
> +EXPORT_SYMBOL_GPL(insert_resource_expand_to_fit);
>
> /**
> * remove_resource - Remove a resource in the resource tree
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 3/9] dax/hmem: Gate Soft Reserved deferral on DEV_DAX_CXL
2025-11-20 3:19 ` [PATCH v4 3/9] dax/hmem: Gate Soft Reserved deferral on DEV_DAX_CXL Smita Koralahalli
@ 2025-12-02 23:32 ` Dave Jiang
0 siblings, 0 replies; 31+ messages in thread
From: Dave Jiang @ 2025-12-02 23:32 UTC (permalink / raw)
To: Smita Koralahalli, linux-cxl, linux-kernel, nvdimm, linux-fsdevel,
linux-pm
Cc: Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams,
Jonathan Cameron, Yazen Ghannam, Davidlohr Bueso, Matthew Wilcox,
Jan Kara, Rafael J . Wysocki, Len Brown, Pavel Machek, Li Ming,
Jeff Johnson, Ying Huang, Yao Xingtao, Peter Zijlstra, Greg KH,
Nathan Fontenot, Terry Bowman, Robert Richter, Benjamin Cheatham,
Zhijian Li, Borislav Petkov, Ard Biesheuvel
On 11/19/25 8:19 PM, Smita Koralahalli wrote:
> From: Dan Williams <dan.j.williams@intel.com>
>
> Replace IS_ENABLED(CONFIG_CXL_REGION) with IS_ENABLED(CONFIG_DEV_DAX_CXL)
> so that HMEM only defers Soft Reserved ranges when CXL DAX support is
> enabled. This makes the coordination between HMEM and the CXL stack more
> precise and prevents deferral in unrelated CXL configurations.
>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>> ---
> drivers/dax/hmem/hmem.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
> index 02e79c7adf75..c2c110b194e5 100644
> --- a/drivers/dax/hmem/hmem.c
> +++ b/drivers/dax/hmem/hmem.c
> @@ -66,7 +66,7 @@ static int hmem_register_device(struct device *host, int target_nid,
> long id;
> int rc;
>
> - if (IS_ENABLED(CONFIG_CXL_REGION) &&
> + if (IS_ENABLED(CONFIG_DEV_DAX_CXL) &&
> region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
> IORES_DESC_CXL) != REGION_DISJOINT) {
> dev_dbg(host, "deferring range to CXL: %pr\n", res);
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 5/9] cxl/region, dax/hmem: Arbitrate Soft Reserved ownership with cxl_regions_fully_map()
2025-11-20 3:19 ` [PATCH v4 5/9] cxl/region, dax/hmem: Arbitrate Soft Reserved ownership with cxl_regions_fully_map() Smita Koralahalli
@ 2025-12-03 3:50 ` dan.j.williams
2025-12-11 23:42 ` Koralahalli Channabasappa, Smita
0 siblings, 1 reply; 31+ messages in thread
From: dan.j.williams @ 2025-12-03 3:50 UTC (permalink / raw)
To: Smita Koralahalli, linux-cxl, linux-kernel, nvdimm, linux-fsdevel,
linux-pm
Cc: Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams,
Jonathan Cameron, Yazen Ghannam, Dave Jiang, Davidlohr Bueso,
Matthew Wilcox, Jan Kara, Rafael J . Wysocki, Len Brown,
Pavel Machek, Li Ming, Jeff Johnson, Ying Huang, Yao Xingtao,
Peter Zijlstra, Greg KH, Nathan Fontenot, Terry Bowman,
Robert Richter, Benjamin Cheatham, Zhijian Li, Borislav Petkov,
Ard Biesheuvel
Smita Koralahalli wrote:
> Introduce cxl_regions_fully_map() to check whether CXL regions form a
> single contiguous, non-overlapping cover of a given Soft Reserved range.
>
> Use this helper to decide whether Soft Reserved memory overlapping CXL
> regions should be owned by CXL or registered by HMEM.
>
> If the span is fully covered by CXL regions, treat the Soft Reserved
> range as owned by CXL and have HMEM skip registration. Else, let HMEM
> claim the range and register the corresponding devdax for it.
This all feels a bit too custom when helpers like resource_contains()
exist.
Also remember that the default list of soft-reserved ranges that dax
grabs is filtered by the ACPI HMAT. So while there is a chance that one
EFI memory map entry spans multiple CXL regions, there is a lower chance
that a single ACPI HMAT range spans multiple CXL regions.
I think it is fair for Linux to be simple and require that an algorithm
of:
cxl_contains_soft_reserve()
for_each_cxl_intersecting_hmem_resource()
found = false
for_each_region()
if (resource_contains(cxl_region_resource, hmem_resource))
found = true
if (!found)
return false
return true
...should be good enough, otherwise fallback to pure hmem operation, and
do not worry about the corner cases.
If Linux really needs to understand that ACPI HMAT ranges may span
multiple CXL regions then I would want to understand more what is
driving that configuration.
Btw, I do not see a:
guard(rwsem_read)(&cxl_rwsem.region)
...anywhere in the proposed patch. That needs to be held be sure the
region's resource settings are not changed out from underneath you. This
should probably also be checking that the region is in the commit state
because it may still be racing regions under creation post
wait_for_device_probe().
> void cxl_endpoint_parse_cdat(struct cxl_port *port);
> diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
> index f70a0688bd11..db4c46337ac3 100644
> --- a/drivers/dax/hmem/hmem.c
> +++ b/drivers/dax/hmem/hmem.c
> @@ -3,6 +3,8 @@
> #include <linux/memregion.h>
> #include <linux/module.h>
> #include <linux/dax.h>
> +
> +#include "../../cxl/cxl.h"
> #include "../bus.h"
>
> static bool region_idle;
> @@ -150,7 +152,17 @@ static int hmem_register_device(struct device *host, int target_nid,
> static int handle_deferred_cxl(struct device *host, int target_nid,
> const struct resource *res)
> {
> - /* TODO: Handle region assembly failures */
> + if (region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
> + IORES_DESC_CXL) != REGION_DISJOINT) {
> +
> + if (cxl_regions_fully_map(res->start, res->end))
> + dax_cxl_mode = DAX_CXL_MODE_DROP;
> + else
> + dax_cxl_mode = DAX_CXL_MODE_REGISTER;
> +
> + hmem_register_device(host, target_nid, res);
> + }
> +
I think there is enough content to just create the new
cxl_contains_soft_reserve() ABI, and then hookup handle_deferred_cxl in
a follow-on patch.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 0/9] dax/hmem, cxl: Coordinate Soft Reserved handling with CXL and HMEM
2025-12-01 19:56 ` [PATCH v4 0/9] dax/hmem, cxl: Coordinate Soft Reserved handling with CXL and HMEM Alison Schofield
@ 2025-12-03 13:35 ` Tomasz Wolski
2025-12-03 22:05 ` dan.j.williams
0 siblings, 1 reply; 31+ messages in thread
From: Tomasz Wolski @ 2025-12-03 13:35 UTC (permalink / raw)
To: alison.schofield
Cc: Smita.KoralahalliChannabasappa, ardb, benjamin.cheatham, bp,
dan.j.williams, dave.jiang, dave, gregkh, huang.ying.caritas,
ira.weiny, jack, jeff.johnson, jonathan.cameron, len.brown,
linux-cxl, linux-fsdevel, linux-kernel, linux-pm, lizhijian,
ming.li, nathan.fontenot, nvdimm, pavel, peterz, rafael, rrichter,
terry.bowman, vishal.l.verma, willy, yaoxt.fnst, yazen.ghannam
>> This series aims to address long-standing conflicts between HMEM and
>> CXL when handling Soft Reserved memory ranges.
>>
>> Reworked from Dan's patch:
>> https://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git/patch/?id=ab70c6227ee6165a562c215d9dcb4a1c55620d5d
>>
>> Previous work:
>> https://lore.kernel.org/all/20250715180407.47426-1-Smita.KoralahalliChannabasappa@amd.com/
>>
>> Link to v3:
>> https://lore.kernel.org/all/20250930044757.214798-1-Smita.KoralahalliChannabasappa@amd.com
>>
>> This series should be applied on top of:
>> "214291cbaace: acpi/hmat: Fix lockdep warning for hmem_register_resource()"
>> and is based on:
>> base-commit: 211ddde0823f1442e4ad052a2f30f050145ccada
>>
>> I initially tried picking up the three probe ordering patches from v20/v21
>> of Type 2 support, but I hit a NULL pointer dereference in
>> devm_cxl_add_memdev() and cycle dependency with all patches so I left
>> them out for now. With my current series rebased on 6.18-rc2 plus
>> 214291cbaace, probe ordering behaves correctly on AMD systems and I have
>> verified the scenarios mentioned below. I can pull those three patches
>> back in for a future revision once the failures are sorted out.
>
>Hi Smita,
>
>This is a regression from the v3 version for my hotplug test case.
>I believe at least partially due to the ommitted probe order patches.
>I'm not clear why that 'dax18.0' still exists after region teardown.
>
>Upon booting:
>- Do not expect to see that Soft Reserved resource
>
>68e80000000-8d37fffffff : CXL Window 9
> 68e80000000-70e7fffffff : region9
> 68e80000000-70e7fffffff : Soft Reserved
> 68e80000000-70e7fffffff : dax18.0
> 68e80000000-70e7fffffff : System RAM (kmem)
>
>After region teardown:
>- Do not expect to see that Soft Reserved resource
>- Do not expect to see that DAX or kmem
>
>68e80000000-8d37fffffff : CXL Window 9
> 68e80000000-70e7fffffff : Soft Reserved
> 68e80000000-70e7fffffff : dax18.0
> 68e80000000-70e7fffffff : System RAM (kmem)
>
>Create the region anew:
>- Here we see a new region and dax devices created in the
>available space after the Soft Reserved. We don't want
>that. We want to be able to recreate in that original
>space of 68e80000000-70e7fffffff.
>
>68e80000000-8d37fffffff : CXL Window 9
> 68e80000000-70e7fffffff : Soft Reserved
> 68e80000000-70e7fffffff : dax18.0
> 68e80000000-70e7fffffff : System RAM (kmem)
> 70e80000000-78e7fffffff : region9
> 70e80000000-78e7fffffff : dax9.0
> 70e80000000-78e7fffffff : System RAM (kmem)
>
>
>-- Alison
Hello Smita, Alison
I did some testing and came across issues with probe order so I applied the
three patches mentioned by Smita + fix for the NULL dereference.
I noticed issues in scenario 3.1 and 4 below but maybe they are related to
the test setup:
[1] QEMU: 1 CFMWS + Host-bridge + 1 CXL device
Soft reserve in not seen in the iomem:
a90000000-b8fffffff : CXL Window 0
a90000000-b8fffffff : region0
a90000000-b8fffffff : dax0.0
a90000000-b8fffffff : System RAM (kmem)
kernel: [ 0.000000][ T0] BIOS-e820: [mem 0x0000000a90000000-0x0000000b8fffffff] soft reserved
== region teardown
a90000000-b8fffffff : CXL Window 0
// no dax devices
== region recreate
a90000000-b8fffffff : CXL Window 0
a90000000-b8fffffff : region0
a90000000-b8fffffff : dax0.0
a90000000-b8fffffff : System RAM (kmem)
== booted with no PCI attached
a90000000-b8fffffff : Soft Reserved
a90000000-b8fffffff : CXL Window 0
a90000000-b8fffffff : dax1.0
a90000000-b8fffffff : System RAM (kmem)
== ..and hot plug via QEMU terminal => is the following iomem tree expected?
a90000000-b8fffffff : Soft Reserved
a90000000-b8fffffff : CXL Window 0
a90000000-b8fffffff : region0
a90000000-b8fffffff : dax1.0
a90000000-b8fffffff : System RAM (kmem)
kernel: [ 129.820136][ T65] cxl_acpi ACPI0017:00: decoder0.0: created region0
..
kernel: [ 129.827126][ T65] cxl_region region0: [mem 0xa90000000-0xb8fffffff flags 0x200] has System RAM: [mem 0xa90000000-0xb8fffffff flags 0x83000200]
[1.1] QEMU: 1 CFMWS + Host-bridge + 1 CXL device
Region is smaller than SR - hmem claims the space
a90000000-bcfffffff : Soft Reserved
a90000000-bcfffffff : CXL Window 0
a90000000-bcfffffff : dax1.0
a90000000-bcfffffff : System RAM (kmem)
[2] QEMU: 1 CFMWS + Host-bridge + 2 CXL devices
kernel: [ 0.000000][ T0] BIOS-e820: [mem 0x0000000a90000000-0x0000000c8fffffff] soft reserved
a90000000-c8fffffff : CXL Window 0
a90000000-b8fffffff : region1
a90000000-b8fffffff : dax1.0
a90000000-b8fffffff : System RAM (kmem)
b90000000-c8fffffff : region0
b90000000-c8fffffff : dax0.0
b90000000-c8fffffff : System RAM (kmem)
== region1 teardown
a90000000-c8fffffff : CXL Window 0
a90000000-b8fffffff : region0
a90000000-b8fffffff : dax0.0
a90000000-b8fffffff : System RAM (kmem)
== recreate region1 - created in correct address range
a90000000-c8fffffff : CXL Window 0
a90000000-b8fffffff : region0
a90000000-b8fffffff : dax0.0
a90000000-b8fffffff : System RAM (kmem)
b90000000-c8fffffff : region1
b90000000-c8fffffff : dax1.0
b90000000-c8fffffff : System RAM (kmem)
[2.1] QEMU: 1 CFMWS + Host-bridge + 2 CXL devices
Region is smaller than SR - hmem claims the whole space
kernel: [ 0.000000][ T0] BIOS-e820: [mem 0x0000000a90000000-0x0000000ccfffffff] soft reserved
a90000000-ccfffffff : Soft Reserved
a90000000-ccfffffff : CXL Window 0
a90000000-ccfffffff : dax1.0
a90000000-ccfffffff : System RAM (kmem)
[3] QEMU: 2 CFMWS + Host-bridge + 2 CXL devices
a90000000-b8fffffff : CXL Window 0
a90000000-b8fffffff : region0
a90000000-b8fffffff : dax0.0
a90000000-b8fffffff : System RAM (kmem)
b90000000-c8fffffff : CXL Window 1
b90000000-c8fffffff : region1
b90000000-c8fffffff : dax1.0
b90000000-c8fffffff : System RAM (kmem)
== Tearing down region 1
a90000000-b8fffffff : CXL Window 0
a90000000-b8fffffff : region0
a90000000-b8fffffff : dax0.0
a90000000-b8fffffff : System RAM (kmem)
b90000000-c8fffffff : CXL Window 1
== Recreate region 1
a90000000-b8fffffff : CXL Window 0
a90000000-b8fffffff : region0
a90000000-b8fffffff : dax0.0
a90000000-b8fffffff : System RAM (kmem)
b90000000-c8fffffff : CXL Window 1
b90000000-c8fffffff : region1
b90000000-c8fffffff : dax1.0
b90000000-c8fffffff : System RAM (kmem)
[3.1] QEMU: 2 CFMWS + Host-bridge + 2 CXL devices
Region does not span whole CXL Window - hmem should claim the whole space, but kmem failed with EBUSY
a90000000-ccfffffff : Soft Reserved
a90000000-bcfffffff : CXL Window 0
bd0000000-ccfffffff : CXL Window 1
kernel: [ 24.598310][ T543] cxl_acpi ACPI0017:00: decoder0.0 added to root0
kernel: [ 24.598645][ T543] cxl_acpi ACPI0017:00: decode range: node: 1 range [0xa90000000 - 0xbcfffffff]
kernel: [ 24.599673][ T543] cxl_acpi ACPI0017:00: decoder0.1 added to root0
kernel: [ 24.599939][ T543] cxl_acpi ACPI0017:00: decode range: node: 2 range [0xbd0000000 - 0xccfffffff]
kernel: [ 24.630549][ T543] cxl_acpi ACPI0017:00: root0: add: nvdimm-bridge0
kernel: [ 24.692068][ T70] cxl_pci 0000:0e:00.0: mem0:decoder2.0 no CXL window for range 0xb90000000:0xc8fffffff
kernel: [ 24.722976][ T69] cxl_region region0: config state: 0
kernel: [ 24.724446][ T69] cxl_acpi ACPI0017:00: decoder0.0: created region0
kernel: [ 24.725023][ T69] cxl_pci 0000:0d:00.0: mem1:decoder3.0: __construct_region region0 res: [mem 0xa90000000-0xb8fffffff flags 0x200] iw: 1 ig: 256
kernel: [ 24.727230][ T69] cxl_mem mem1: decoder:decoder3.0 parent:0000:0d:00.0 port:endpoint3 range:0xa90000000-0xb8fffffff pos:0
kernel: [ 24.728660][ T69] cxl region0: region sort successful
kernel: [ 24.729627][ T69] cxl region0: mem1:endpoint3 decoder3.0 add: mem1:decoder3.0 @ 0 next: none nr_eps:1 nr_targets: 1
kernel: [ 24.730566][ T69] cxl region0: pci0000:0c:port1 decoder1.0 add: mem1:decoder3.0 @ 0 next: mem1 nr_eps: 1 nr_targets: 1
kernel: [ 24.731445][ T69] cxl region0: pci0000:0c:port1 iw: 1 ig: 256
kernel: [ 24.731791][ T69] cxl region0: pci0000:0c:port1 target[0] = 0000:0c:00.0 for mem1:decoder3.0 @ 0
kernel: [ 24.807234][ T519] hmem_platform hmem_platform.0: deferring range to CXL: [mem 0xa90000000-0xccfffffff flags 0x80000200]
kernel: [ 24.903542][ T99] hmem_platform hmem_platform.0: registering CXL range: [mem 0xa90000000-0xccfffffff flags 0x80000200]
kernel: [ 25.043776][ T530] kmem dax2.0: mapping0: 0xa90000000-0xccfffffff could not reserve region
kernel: [ 25.044553][ T530] kmem dax2.0: probe with driver kmem failed with error -16
[4] Physical machine: 2 CFMWS + Host-bridge + 2 CXL devices
kernel: BIOS-e820: [mem 0x0000002070000000-0x000000a06fffffff] soft reserved
2070000000-606fffffff : CXL Window 0
2070000000-606fffffff : region0
2070000000-606fffffff : dax0.0
2070000000-606fffffff : System RAM (kmem)
6070000000-a06fffffff : CXL Window 1
6070000000-a06fffffff : region1
6070000000-a06fffffff : dax1.0
6070000000-a06fffffff : System RAM (kmem)
kernel: BIOS-e820: [mem 0x0000002070000000-0x000000a06fffffff] soft reserved
== region 1 teardown and unplug (the unplug was done via ubind/remove in /sys/bus/pci/devices)
2070000000-606fffffff : CXL Window 0
2070000000-606fffffff : region0
2070000000-606fffffff : dax0.0
2070000000-606fffffff : System RAM (kmem)
6070000000-a06fffffff : CXL Window 1
== plug - after PCI rescan cannot create hmem
6070000000-a06fffffff : CXL Window 1
6070000000-a06fffffff : region1
kernel: cxl_region region1: config state: 0
kernel: cxl_acpi ACPI0017:00: decoder0.1: created region1
kernel: cxl_pci 0000:04:00.0: mem1:decoder10.0: __construct_region region1 res: [mem 0x6070000000-0xa06fffffff flags 0x200] iw: 1 ig: 4096
kernel: cxl_mem mem1: decoder:decoder10.0 parent:0000:04:00.0 port:endpoint10 range:0x6070000000-0xa06fffffff pos:0
kernel: cxl region1: region sort successful
kernel: cxl region1: mem1:endpoint10 decoder10.0 add: mem1:decoder10.0 @ 0 next: none nr_eps: 1 nr_targets: 1
kernel: cxl region1: pci0000:00:port2 decoder2.1 add: mem1:decoder10.0 @ 0 next: mem1 nr_eps: 1 nr_targets: 1
kernel: cxl region1: pci0000:00:port2 cxl_port_setup_targets expected iw: 1 ig: 4096 [mem 0x6070000000-0xa06fffffff flags 0x200]
kernel: cxl region1: pci0000:00:port2 cxl_port_setup_targets got iw: 1 ig: 256 state: disabled 0x6070000000:0xa06fffffff
kernel: cxl_port endpoint10: failed to attach decoder10.0 to region1: -6
Thanks,
Tomasz
>>
>> Probe order patches of interest:
>> cxl/mem: refactor memdev allocation
>> cxl/mem: Arrange for always-synchronous memdev attach
>> cxl/port: Arrange for always synchronous endpoint attach
>>
>> [1] Hotplug looks okay. After offlining the memory I can tear down the
>> regions and recreate it back if CXL owns entire SR range as Soft Reserved
>> is gone. dax_cxl creates dax devices and onlines memory.
>> 850000000-284fffffff : CXL Window 0
>> 850000000-284fffffff : region0
>> 850000000-284fffffff : dax0.0
>> 850000000-284fffffff : System RAM (kmem)
>>
>> [2] With CONFIG_CXL_REGION disabled, all the resources are handled by
>> HMEM. Soft Reserved range shows up in /proc/iomem, no regions come up
>> and dax devices are created from HMEM.
>> 850000000-284fffffff : CXL Window 0
>> 850000000-284fffffff : Soft Reserved
>> 850000000-284fffffff : dax0.0
>> 850000000-284fffffff : System RAM (kmem)
>>
>> [3] Region assembly failures also behave okay and work same as [2].
>>
>> Before:
>> 2850000000-484fffffff : Soft Reserved
>> 2850000000-484fffffff : CXL Window 1
>> 2850000000-484fffffff : dax4.0
>> 2850000000-484fffffff : System RAM (kmem)
>>
>> After tearing down dax4.0 and creating it back:
>>
>> Logs:
>> [ 547.847764] unregister_dax_mapping: mapping0: unregister_dax_mapping
>> [ 547.855000] trim_dev_dax_range: dax dax4.0: delete range[0]: 0x2850000000:0x484fffffff
>> [ 622.474580] alloc_dev_dax_range: dax dax4.1: alloc range[0]: 0x0000002850000000:0x000000484fffffff
>> [ 752.766194] Fallback order for Node 0: 0 1
>> [ 752.766199] Fallback order for Node 1: 1 0
>> [ 752.766200] Built 2 zonelists, mobility grouping on. Total pages: 8096220
>> [ 752.783234] Policy zone: Normal
>> [ 752.808604] Demotion targets for Node 0: preferred: 1, fallback: 1
>> [ 752.815509] Demotion targets for Node 1: null
>>
>> After:
>> 2850000000-484fffffff : Soft Reserved
>> 2850000000-484fffffff : CXL Window 1
>> 2850000000-484fffffff : dax4.1
>> 2850000000-484fffffff : System RAM (kmem)
>>
>> [4] A small hack to tear down the fully assembled and probed region
>> (i.e region in committed state) for range 850000000-284fffffff.
>> This is to test the region teardown path for regions which don't
>> fully cover the Soft Reserved range.
>>
>> 850000000-284fffffff : Soft Reserved
>> 850000000-284fffffff : CXL Window 0
>> 850000000-284fffffff : dax5.0
>> 850000000-284fffffff : System RAM (kmem)
>> 2850000000-484fffffff : CXL Window 1
>> 2850000000-484fffffff : region1
>> 2850000000-484fffffff : dax1.0
>> 2850000000-484fffffff : System RAM (kmem)
>> .4850000000-684fffffff : CXL Window 2
>> 4850000000-684fffffff : region2
>> 4850000000-684fffffff : dax2.0
>> 4850000000-684fffffff : System RAM (kmem)
>>
>> daxctl list -R -u
>> [
>> {
>> "path":"\/platform\/ACPI0017:00\/root0\/decoder0.1\/region1\/dax_region1",
>> "id":1,
>> "size":"128.00 GiB (137.44 GB)",
>> "align":2097152
>> },
>> {
>> "path":"\/platform\/hmem.5",
>> "id":5,
>> "size":"128.00 GiB (137.44 GB)",
>> "align":2097152
>> },
>> {
>> "path":"\/platform\/ACPI0017:00\/root0\/decoder0.2\/region2\/dax_region2",
>> "id":2,
>> "size":"128.00 GiB (137.44 GB)",
>> "align":2097152
>> }
>> ]
>>
>> I couldn't test multiple regions under same Soft Reserved range
>> with/without contiguous mapping due to limiting BIOS support. Hopefully
>> that works.
>>
>> v4 updates:
>> - No changes patches 1-3.
>> - New patches 4-7.
>> - handle_deferred_cxl() has been enhanced to handle case where CXL
>> regions do not contiguously and fully cover Soft Reserved ranges.
>> - Support added to defer cxl_dax registration.
>> - Support added to teardown cxl regions.
>>
>> v3 updates:
>> - Fixed two "From".
>>
>> v2 updates:
>> - Removed conditional check on CONFIG_EFI_SOFT_RESERVE as dax_hmem
>> depends on CONFIG_EFI_SOFT_RESERVE. (Zhijian)
>> - Added TODO note. (Zhijian)
>> - Included region_intersects_soft_reserve() inside CONFIG_EFI_SOFT_RESERVE
>> conditional check. (Zhijian)
>> - insert_resource_late() -> insert_resource_expand_to_fit() and
>> __insert_resource_expand_to_fit() replacement. (Boris)
>> - Fixed Co-developed and Signed-off by. (Dan)
>> - Combined 2/6 and 3/6 into a single patch. (Zhijian).
>> - Skip local variable in remove_soft_reserved. (Jonathan)
>> - Drop kfree with __free(). (Jonathan)
>> - return 0 -> return dev_add_action_or_reset(host...) (Jonathan)
>> - Dropped 6/6.
>> - Reviewed-by tags (Dave, Jonathan)
>>
>> Dan Williams (4):
>> dax/hmem, e820, resource: Defer Soft Reserved insertion until hmem is
>> ready
>> dax/hmem: Request cxl_acpi and cxl_pci before walking Soft Reserved
>> ranges
>> dax/hmem: Gate Soft Reserved deferral on DEV_DAX_CXL
>> dax/hmem: Defer handling of Soft Reserved ranges that overlap CXL
>> windows
>>
>> Smita Koralahalli (5):
>> cxl/region, dax/hmem: Arbitrate Soft Reserved ownership with
>> cxl_regions_fully_map()
>> cxl/region: Add register_dax flag to control probe-time devdax setup
>> cxl/region, dax/hmem: Register devdax only when CXL owns Soft Reserved
>> span
>> cxl/region, dax/hmem: Tear down CXL regions when HMEM reclaims Soft
>> Reserved
>> dax/hmem: Reintroduce Soft Reserved ranges back into the iomem tree
>>
>> arch/x86/kernel/e820.c | 2 +-
>> drivers/cxl/acpi.c | 2 +-
>> drivers/cxl/core/region.c | 181 ++++++++++++++++++++++++++++++++++++--
>> drivers/cxl/cxl.h | 17 ++++
>> drivers/dax/Kconfig | 2 +
>> drivers/dax/hmem/device.c | 4 +-
>> drivers/dax/hmem/hmem.c | 137 ++++++++++++++++++++++++++---
>> include/linux/ioport.h | 13 ++-
>> kernel/resource.c | 92 ++++++++++++++++---
>> 9 files changed, 415 insertions(+), 35 deletions(-)
>>
>> base-commit: 211ddde0823f1442e4ad052a2f30f050145ccada
>> --
>> 2.17.1
>>
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 0/9] dax/hmem, cxl: Coordinate Soft Reserved handling with CXL and HMEM
2025-12-03 13:35 ` Tomasz Wolski
@ 2025-12-03 22:05 ` dan.j.williams
2025-12-05 2:54 ` Yasunori Gotou (Fujitsu)
0 siblings, 1 reply; 31+ messages in thread
From: dan.j.williams @ 2025-12-03 22:05 UTC (permalink / raw)
To: Tomasz Wolski, alison.schofield
Cc: Smita.KoralahalliChannabasappa, ardb, benjamin.cheatham, bp,
dan.j.williams, dave.jiang, dave, gregkh, huang.ying.caritas,
ira.weiny, jack, jeff.johnson, jonathan.cameron, len.brown,
linux-cxl, linux-fsdevel, linux-kernel, linux-pm, lizhijian,
ming.li, nathan.fontenot, nvdimm, pavel, peterz, rafael, rrichter,
terry.bowman, vishal.l.verma, willy, yaoxt.fnst, yazen.ghannam
Tomasz Wolski wrote:
[..]
>
> Hello Smita, Alison
>
> I did some testing and came across issues with probe order so I applied the
> three patches mentioned by Smita + fix for the NULL dereference.
> I noticed issues in scenario 3.1 and 4 below but maybe they are related to
> the test setup:
BTW, thanks for all these tests, it helps!
> [1] QEMU: 1 CFMWS + Host-bridge + 1 CXL device
> Soft reserve in not seen in the iomem:
>
> a90000000-b8fffffff : CXL Window 0
> a90000000-b8fffffff : region0
> a90000000-b8fffffff : dax0.0
> a90000000-b8fffffff : System RAM (kmem)
>
> kernel: [ 0.000000][ T0] BIOS-e820: [mem 0x0000000a90000000-0x0000000b8fffffff] soft reserved
>
> == region teardown
> a90000000-b8fffffff : CXL Window 0
> // no dax devices
>
> == region recreate
> a90000000-b8fffffff : CXL Window 0
> a90000000-b8fffffff : region0
> a90000000-b8fffffff : dax0.0
> a90000000-b8fffffff : System RAM (kmem)
>
> == booted with no PCI attached
> a90000000-b8fffffff : Soft Reserved
> a90000000-b8fffffff : CXL Window 0
> a90000000-b8fffffff : dax1.0
> a90000000-b8fffffff : System RAM (kmem)
So this is the expected behavior with the proposal that if the device and
memory is present at boot, but the driver is disabled or fails to
assemble, then the solution falls back to "region-less" dax.
> == ..and hot plug via QEMU terminal => is the following iomem tree expected?
> a90000000-b8fffffff : Soft Reserved
> a90000000-b8fffffff : CXL Window 0
> a90000000-b8fffffff : region0
> a90000000-b8fffffff : dax1.0
> a90000000-b8fffffff : System RAM (kmem)
Unless I am missing something, this looks like a bug in the test because
if you are truly hot-adding the device after boot, then the BIOS would
have had no reason/chance to create that Soft Reserved entry. The
assumption is that the presence of Soft Reserved always implies that it
was mapping physical hardware that was present at boot.
> kernel: [ 129.820136][ T65] cxl_acpi ACPI0017:00: decoder0.0: created region0
> ..
> kernel: [ 129.827126][ T65] cxl_region region0: [mem 0xa90000000-0xb8fffffff flags 0x200] has System RAM: [mem 0xa90000000-0xb8fffffff flags 0x83000200]
>
> [1.1] QEMU: 1 CFMWS + Host-bridge + 1 CXL device
> Region is smaller than SR - hmem claims the space
The expectation is that a configuration like this is out of scope.
Unless CXL fully covers a Soft Reserved entry it must assume that the
platform is doing something custom / special and disable the CXL
subsystem in its entirety.
The simplifying hope is that there is always a 1:1 correlation between
CXL Region and ACPI SRAT/HMAT range entries such that a Soft Reserved
resource is never misaligned to a CXL region.
Hmm, this might highlight a gap in the implementation. I think we need
to make sure that drivers/acpi/numa/hmat.c::alloc_memory_target()
injects boundaries into soft_reserve_resource. I.e. I think a BIOS might
create a merged EFI memory map entry that spans multiple SRAT/HMAT
ranges in the same proximity domain.
It would be lovely to require BIOS to bound their descriptions on CXL
region boundaries.
> a90000000-bcfffffff : Soft Reserved
> a90000000-bcfffffff : CXL Window 0
> a90000000-bcfffffff : dax1.0
> a90000000-bcfffffff : System RAM (kmem)
>
> [2] QEMU: 1 CFMWS + Host-bridge + 2 CXL devices
>
> kernel: [ 0.000000][ T0] BIOS-e820: [mem 0x0000000a90000000-0x0000000c8fffffff] soft reserved
>
> a90000000-c8fffffff : CXL Window 0
> a90000000-b8fffffff : region1
> a90000000-b8fffffff : dax1.0
> a90000000-b8fffffff : System RAM (kmem)
> b90000000-c8fffffff : region0
> b90000000-c8fffffff : dax0.0
> b90000000-c8fffffff : System RAM (kmem)
Wait, you have a CXL region that partially overlaps a Soft Reserved
range? That does not look a configuration the subsystem could ever
support and should fallback to disabling CXL.
> == region1 teardown
> a90000000-c8fffffff : CXL Window 0
> a90000000-b8fffffff : region0
> a90000000-b8fffffff : dax0.0
> a90000000-b8fffffff : System RAM (kmem)
>
> == recreate region1 - created in correct address range
>
> a90000000-c8fffffff : CXL Window 0
> a90000000-b8fffffff : region0
> a90000000-b8fffffff : dax0.0
> a90000000-b8fffffff : System RAM (kmem)
> b90000000-c8fffffff : region1
> b90000000-c8fffffff : dax1.0
> b90000000-c8fffffff : System RAM (kmem)
>
> [2.1] QEMU: 1 CFMWS + Host-bridge + 2 CXL devices
> Region is smaller than SR - hmem claims the whole space
>
> kernel: [ 0.000000][ T0] BIOS-e820: [mem 0x0000000a90000000-0x0000000ccfffffff] soft reserved
>
> a90000000-ccfffffff : Soft Reserved
> a90000000-ccfffffff : CXL Window 0
> a90000000-ccfffffff : dax1.0
> a90000000-ccfffffff : System RAM (kmem)
>
> [3] QEMU: 2 CFMWS + Host-bridge + 2 CXL devices
>
> a90000000-b8fffffff : CXL Window 0
> a90000000-b8fffffff : region0
> a90000000-b8fffffff : dax0.0
> a90000000-b8fffffff : System RAM (kmem)
> b90000000-c8fffffff : CXL Window 1
> b90000000-c8fffffff : region1
> b90000000-c8fffffff : dax1.0
> b90000000-c8fffffff : System RAM (kmem)
>
> == Tearing down region 1
>
> a90000000-b8fffffff : CXL Window 0
> a90000000-b8fffffff : region0
> a90000000-b8fffffff : dax0.0
> a90000000-b8fffffff : System RAM (kmem)
> b90000000-c8fffffff : CXL Window 1
>
> == Recreate region 1
> a90000000-b8fffffff : CXL Window 0
> a90000000-b8fffffff : region0
> a90000000-b8fffffff : dax0.0
> a90000000-b8fffffff : System RAM (kmem)
> b90000000-c8fffffff : CXL Window 1
> b90000000-c8fffffff : region1
> b90000000-c8fffffff : dax1.0
> b90000000-c8fffffff : System RAM (kmem)
>
> [3.1] QEMU: 2 CFMWS + Host-bridge + 2 CXL devices
> Region does not span whole CXL Window - hmem should claim the whole space, but kmem failed with EBUSY
>
> a90000000-ccfffffff : Soft Reserved
> a90000000-bcfffffff : CXL Window 0
> bd0000000-ccfffffff : CXL Window 1
Again, we do not expect that a real world BIOS would ever present this.
It might be the case that there is a single EFI entry that covers
a90000000-ccfffffff, but the expectation is that SRAT would have
separate entries for a90000000-bcfffffff and bd0000000-ccfffffff so that
everything lines up.
For simplicity I want the fallback to be all or nothing because either
there is full confidence that the CXL Subsystem understands the
configuration, or there is zero confidence. Leave no room for complex
"partial assembly" configurations to debug.
[..]
>
> [4] Physical machine: 2 CFMWS + Host-bridge + 2 CXL devices
>
> kernel: BIOS-e820: [mem 0x0000002070000000-0x000000a06fffffff] soft reserved
>
> 2070000000-606fffffff : CXL Window 0
> 2070000000-606fffffff : region0
> 2070000000-606fffffff : dax0.0
> 2070000000-606fffffff : System RAM (kmem)
> 6070000000-a06fffffff : CXL Window 1
> 6070000000-a06fffffff : region1
> 6070000000-a06fffffff : dax1.0
> 6070000000-a06fffffff : System RAM (kmem)
Ok, so a real world maching that creates a merged
0x0000002070000000-0x000000a06fffffff range. Can you confirm that the
SRAT has separate entries for those ranges? Otherwise, need to rethink
how to keep this fallback algorithm simple and predictable.
> kernel: BIOS-e820: [mem 0x0000002070000000-0x000000a06fffffff] soft reserved
>
> == region 1 teardown and unplug (the unplug was done via ubind/remove in /sys/bus/pci/devices)
Note that you need to explicitly destroy the region for the physical
removal case. Otherwise, decoders stay committed throughout the
hierarchy. Simple unbind / PCI device removal does not manage CXL
decoders.
>
> 2070000000-606fffffff : CXL Window 0
> 2070000000-606fffffff : region0
> 2070000000-606fffffff : dax0.0
> 2070000000-606fffffff : System RAM (kmem)
> 6070000000-a06fffffff : CXL Window 1
>
> == plug - after PCI rescan cannot create hmem
> 6070000000-a06fffffff : CXL Window 1
> 6070000000-a06fffffff : region1
>
> kernel: cxl_region region1: config state: 0
> kernel: cxl_acpi ACPI0017:00: decoder0.1: created region1
> kernel: cxl_pci 0000:04:00.0: mem1:decoder10.0: __construct_region region1 res: [mem 0x6070000000-0xa06fffffff flags 0x200] iw: 1 ig: 4096
> kernel: cxl_mem mem1: decoder:decoder10.0 parent:0000:04:00.0 port:endpoint10 range:0x6070000000-0xa06fffffff pos:0
> kernel: cxl region1: region sort successful
> kernel: cxl region1: mem1:endpoint10 decoder10.0 add: mem1:decoder10.0 @ 0 next: none nr_eps: 1 nr_targets: 1
> kernel: cxl region1: pci0000:00:port2 decoder2.1 add: mem1:decoder10.0 @ 0 next: mem1 nr_eps: 1 nr_targets: 1
> kernel: cxl region1: pci0000:00:port2 cxl_port_setup_targets expected iw: 1 ig: 4096 [mem 0x6070000000-0xa06fffffff flags 0x200]
> kernel: cxl region1: pci0000:00:port2 cxl_port_setup_targets got iw: 1 ig: 256 state: disabled 0x6070000000:0xa06fffffff
Did the device get reset in the process? This looks like decoders
bounced in an inconsistent fashion from unplug to replug and
autodiscovery.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 6/9] cxl/region: Add register_dax flag to defer DAX setup
2025-11-20 3:19 ` [PATCH v4 6/9] cxl/region: Add register_dax flag to defer DAX setup Smita Koralahalli
2025-11-20 18:17 ` Koralahalli Channabasappa, Smita
2025-11-20 20:21 ` kernel test robot
@ 2025-12-04 0:22 ` dan.j.williams
2025-12-12 19:59 ` Koralahalli Channabasappa, Smita
2 siblings, 1 reply; 31+ messages in thread
From: dan.j.williams @ 2025-12-04 0:22 UTC (permalink / raw)
To: Smita Koralahalli, linux-cxl, linux-kernel, nvdimm, linux-fsdevel,
linux-pm
Cc: Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams,
Jonathan Cameron, Yazen Ghannam, Dave Jiang, Davidlohr Bueso,
Matthew Wilcox, Jan Kara, Rafael J . Wysocki, Len Brown,
Pavel Machek, Li Ming, Jeff Johnson, Ying Huang, Yao Xingtao,
Peter Zijlstra, Greg KH, Nathan Fontenot, Terry Bowman,
Robert Richter, Benjamin Cheatham, Zhijian Li, Borislav Petkov,
Ard Biesheuvel
Smita Koralahalli wrote:
> Stop creating cxl_dax during cxl_region_probe(). Early DAX registration
> can online memory before ownership of Soft Reserved ranges is finalized.
> This makes it difficult to tear down regions later when HMEM determines
> that a region should not claim that range.
>
> Introduce a register_dax flag in struct cxl_region_params and gate DAX
> registration on this flag. Leave probe time registration disabled for
> regions discovered during early CXL enumeration; set the flag only for
> regions created dynamically at runtime to preserve existing behaviour.
>
> This patch prepares the region code for later changes where cxl_dax
> setup occurs from the HMEM path only after ownership arbitration
> completes.
This seems backwards to me. The dax subsystem knows when it wants to
move ahead with CXL or not, dax_cxl_mode is that indicator. So, just
share that variable with drivers/dax/cxl.c, arrange for
cxl_dax_region_probe() to fail while waiting for initial CXL probing to
succeed.
Once that point is reached move dax_cxl_mode to DAX_CXL_MODE_DROP, which
means drop the hmem alias, and go with the real-deal CXL region. Rescan
the dax-bus to retry cxl_dax_region_probe(). No need to bother 'struct
cxl_region' with a 'dax' flag, it just registers per normal and lets the
dax-subsystem handle accepting / rejecting.
Now, we do need a mechanism from dax-to-cxl to trigger region removal in
the DAX_CXL_MODE_REGISTER case (proceed with the hmem registration), but
that is separate from blocking the attachment of dax to CXL regions.
Keep all that complexity local to dax.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 8/9] cxl/region, dax/hmem: Tear down CXL regions when HMEM reclaims Soft Reserved
2025-11-20 3:19 ` [PATCH v4 8/9] cxl/region, dax/hmem: Tear down CXL regions when HMEM reclaims Soft Reserved Smita Koralahalli
@ 2025-12-04 0:50 ` dan.j.williams
0 siblings, 0 replies; 31+ messages in thread
From: dan.j.williams @ 2025-12-04 0:50 UTC (permalink / raw)
To: Smita Koralahalli, linux-cxl, linux-kernel, nvdimm, linux-fsdevel,
linux-pm
Cc: Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams,
Jonathan Cameron, Yazen Ghannam, Dave Jiang, Davidlohr Bueso,
Matthew Wilcox, Jan Kara, Rafael J . Wysocki, Len Brown,
Pavel Machek, Li Ming, Jeff Johnson, Ying Huang, Yao Xingtao,
Peter Zijlstra, Greg KH, Nathan Fontenot, Terry Bowman,
Robert Richter, Benjamin Cheatham, Zhijian Li, Borislav Petkov,
Ard Biesheuvel
Smita Koralahalli wrote:
> If CXL regions do not fully cover a Soft Reserved span, HMEM takes
> ownership. Tear down overlapping CXL regions before allowing HMEM to
> register and online the memory.
>
> Add cxl_region_teardown() to walk CXL regions overlapping a span and
> unregister them via devm_release_action() and unregister_region().
>
> Force the region state back to CXL_CONFIG_ACTIVE before unregistering to
> prevent the teardown path from resetting decoders HMEM still relies on
> to create its dax and online memory.
>
> Co-developed-by: Alison Schofield <alison.schofield@intel.com>
> Signed-off-by: Alison Schofield <alison.schofield@intel.com>
> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
> ---
> drivers/cxl/core/region.c | 38 ++++++++++++++++++++++++++++++++++++++
> drivers/cxl/cxl.h | 5 +++++
> drivers/dax/hmem/hmem.c | 4 +++-
> 3 files changed, 46 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 38e7ec6a087b..266b24028df0 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -3784,6 +3784,44 @@ struct cxl_range_ctx {
> bool found;
> };
>
> +static int cxl_region_teardown_cb(struct device *dev, void *data)
> +{
> + struct cxl_range_ctx *ctx = data;
> + struct cxl_root_decoder *cxlrd;
> + struct cxl_region_params *p;
> + struct cxl_region *cxlr;
> + struct cxl_port *port;
> +
> + cxlr = cxlr_overlapping_range(dev, ctx->start, ctx->end);
> + if (!cxlr)
> + return 0;
> +
> + cxlrd = to_cxl_root_decoder(cxlr->dev.parent);
> + port = cxlrd_to_port(cxlrd);
> + p = &cxlr->params;
> +
> + /* Force the region state back to CXL_CONFIG_ACTIVE so that
Minor, and moot given the follow on comments below, but please keep
consistent comment-style and lead with a /*, i.e.:
/*
* Force the region...
> + * unregister_region() does not run the full decoder reset path
> + * which would invalidate the decoder programming that HMEM
> + * relies on to create its DAX device and online the underlying
> + * memory.
> + */
> + scoped_guard(rwsem_write, &cxl_rwsem.region)
> + p->state = min(p->state, CXL_CONFIG_ACTIVE);
I think the thickness of the above comment belies that this is too much
of a layering violation and likely to cause problems. For minimizing the
mental load of analyzing future bug reports, I want all regions gone
when any handshake with the platform firmware and dax-hmem occurs. When
that happens it may mean destroying regions that were dynamically
created while waiting the wait_for_initial_probe() to timeout, who
knows. The simple policy is "CXL subsystem understands everything, or
touches nothing."
For this reset determination, what I think makes more sense, and is
generally useful for shutting down CXL even outside of the hmem deferral
trickery, is to always record whether decoders were idle or not at the
time of region creation. In fact we already have that flag, it is called
CXL_REGION_F_AUTO.
If CXL_REGION_F_AUTO is still set at detach_target() time, it means that
we are giving up on auto-assembly and leaving the decoders alone.
If the administrator actually wants to destroy and reclaim that
physical address space then they need to forcefully de-commit that
auto-assembled region via the @commit sysfs attribute. So that means
commit_store() needs to clear CXL_REGION_F_AUTO to get the decoder reset
to happen.
[..]
> void cxl_endpoint_parse_cdat(struct cxl_port *port);
> diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
> index b9312e0f2e62..7d874ee169ac 100644
> --- a/drivers/dax/hmem/hmem.c
> +++ b/drivers/dax/hmem/hmem.c
> @@ -158,8 +158,10 @@ static int handle_deferred_cxl(struct device *host, int target_nid,
> if (cxl_regions_fully_map(res->start, res->end)) {
> dax_cxl_mode = DAX_CXL_MODE_DROP;
> cxl_register_dax(res->start, res->end);
> - } else
> + } else {
> dax_cxl_mode = DAX_CXL_MODE_REGISTER;
> + cxl_region_teardown(res->start, res->end);
> + }
Like I alluded to above, I am not on board with making a range-by range
decision on teardown. The check for "all clear" vs "abort" should be a
global event before proceeding with either allowing cxl_region instances
to attach or all of them get destroyed. Recall that if
cxl_dax_region_probe() is globally rejecting all cxl_dax_region devices
until dax_cxl_mode moves to DAX_CXL_MODE_DROP then it keeps a consistent
behavior of all regions attach or none attach.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 9/9] dax/hmem: Reintroduce Soft Reserved ranges back into the iomem tree
2025-11-20 3:19 ` [PATCH v4 9/9] dax/hmem: Reintroduce Soft Reserved ranges back into the iomem tree Smita Koralahalli
@ 2025-12-04 0:54 ` dan.j.williams
0 siblings, 0 replies; 31+ messages in thread
From: dan.j.williams @ 2025-12-04 0:54 UTC (permalink / raw)
To: Smita Koralahalli, linux-cxl, linux-kernel, nvdimm, linux-fsdevel,
linux-pm
Cc: Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams,
Jonathan Cameron, Yazen Ghannam, Dave Jiang, Davidlohr Bueso,
Matthew Wilcox, Jan Kara, Rafael J . Wysocki, Len Brown,
Pavel Machek, Li Ming, Jeff Johnson, Ying Huang, Yao Xingtao,
Peter Zijlstra, Greg KH, Nathan Fontenot, Terry Bowman,
Robert Richter, Benjamin Cheatham, Zhijian Li, Borislav Petkov,
Ard Biesheuvel
Smita Koralahalli wrote:
> Reworked from a patch by Alison Schofield <alison.schofield@intel.com>
>
> Reintroduce Soft Reserved range into the iomem_resource tree for HMEM
> to consume.
>
> This restores visibility in /proc/iomem for ranges actively in use, while
> avoiding the early-boot conflicts that occurred when Soft Reserved was
> published into iomem before CXL window and region discovery.
>
> Link: https://lore.kernel.org/linux-cxl/29312c0765224ae76862d59a17748c8188fb95f1.1692638817.git.alison.schofield@intel.com/
> Co-developed-by: Alison Schofield <alison.schofield@intel.com>
> Signed-off-by: Alison Schofield <alison.schofield@intel.com>
> Co-developed-by: Zhijian Li <lizhijian@fujitsu.com>
> Signed-off-by: Zhijian Li <lizhijian@fujitsu.com>
> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Looks good to me:
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Thanks for all the work on this Smita, and Alison too, you have flushed
out many issues and helped me through my blind spots.
^ permalink raw reply [flat|nested] 31+ messages in thread
* RE: [PATCH v4 0/9] dax/hmem, cxl: Coordinate Soft Reserved handling with CXL and HMEM
2025-12-03 22:05 ` dan.j.williams
@ 2025-12-05 2:54 ` Yasunori Gotou (Fujitsu)
2025-12-05 23:04 ` Tomasz Wolski
2025-12-06 0:11 ` dan.j.williams
0 siblings, 2 replies; 31+ messages in thread
From: Yasunori Gotou (Fujitsu) @ 2025-12-05 2:54 UTC (permalink / raw)
To: 'dan.j.williams@intel.com', Tomasz Wolski (Fujitsu),
alison.schofield@intel.com
Cc: Smita.KoralahalliChannabasappa@amd.com, ardb@kernel.org,
benjamin.cheatham@amd.com, bp@alien8.de, dave.jiang@intel.com,
dave@stgolabs.net, gregkh@linuxfoundation.org,
huang.ying.caritas@gmail.com, ira.weiny@intel.com, jack@suse.cz,
jeff.johnson@oss.qualcomm.com, jonathan.cameron@huawei.com,
len.brown@intel.com, linux-cxl@vger.kernel.org,
linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-pm@vger.kernel.org, Zhijian Li (Fujitsu),
ming.li@zohomail.com, nathan.fontenot@amd.com,
nvdimm@lists.linux.dev, pavel@kernel.org, peterz@infradead.org,
rafael@kernel.org, rrichter@amd.com, terry.bowman@amd.com,
vishal.l.verma@intel.com, willy@infradead.org,
Xingtao Yao (Fujitsu), yazen.ghannam@amd.com
Hi,
Just one comment.
> > [4] Physical machine: 2 CFMWS + Host-bridge + 2 CXL devices
> >
> > kernel: BIOS-e820: [mem 0x0000002070000000-0x000000a06fffffff] soft
> > reserved
> >
> > 2070000000-606fffffff : CXL Window 0
> > 2070000000-606fffffff : region0
> > 2070000000-606fffffff : dax0.0
> > 2070000000-606fffffff : System RAM (kmem) 6070000000-a06fffffff
> > : CXL Window 1
> > 6070000000-a06fffffff : region1
> > 6070000000-a06fffffff : dax1.0
> > 6070000000-a06fffffff : System RAM (kmem)
>
> Ok, so a real world maching that creates a merged
> 0x0000002070000000-0x000000a06fffffff range. Can you confirm that the SRAT
> has separate entries for those ranges? Otherwise, need to rethink how to keep
> this fallback algorithm simple and predictable.
>
> > kernel: BIOS-e820: [mem 0x0000002070000000-0x000000a06fffffff] soft
> > reserved
> >
> > == region 1 teardown and unplug (the unplug was done via ubind/remove
> > in /sys/bus/pci/devices)
>
> Note that you need to explicitly destroy the region for the physical removal case.
> Otherwise, decoders stay committed throughout the hierarchy. Simple unbind /
> PCI device removal does not manage CXL decoders.
>
> >
> > 2070000000-606fffffff : CXL Window 0
> > 2070000000-606fffffff : region0
> > 2070000000-606fffffff : dax0.0
> > 2070000000-606fffffff : System RAM (kmem) 6070000000-a06fffffff
> > : CXL Window 1
> >
> > == plug - after PCI rescan cannot create hmem 6070000000-a06fffffff :
> > CXL Window 1
> > 6070000000-a06fffffff : region1
> >
> > kernel: cxl_region region1: config state: 0
> > kernel: cxl_acpi ACPI0017:00: decoder0.1: created region1
> > kernel: cxl_pci 0000:04:00.0: mem1:decoder10.0: __construct_region
> > region1 res: [mem 0x6070000000-0xa06fffffff flags 0x200] iw: 1 ig:
> > 4096
> > kernel: cxl_mem mem1: decoder:decoder10.0 parent:0000:04:00.0
> > port:endpoint10 range:0x6070000000-0xa06fffffff pos:0
> > kernel: cxl region1: region sort successful
> > kernel: cxl region1: mem1:endpoint10 decoder10.0 add: mem1:decoder10.0
> > @ 0 next: none nr_eps: 1 nr_targets: 1
> > kernel: cxl region1: pci0000:00:port2 decoder2.1 add: mem1:decoder10.0
> > @ 0 next: mem1 nr_eps: 1 nr_targets: 1
> > kernel: cxl region1: pci0000:00:port2 cxl_port_setup_targets expected
> > iw: 1 ig: 4096 [mem 0x6070000000-0xa06fffffff flags 0x200]
> > kernel: cxl region1: pci0000:00:port2 cxl_port_setup_targets got iw: 1
> > ig: 256 state: disabled 0x6070000000:0xa06fffffff
>
> Did the device get reset in the process? This looks like decoders bounced in an
> inconsistent fashion from unplug to replug and autodiscovery.
You are correct.
This environment does not support actual PCIe hotplug.
Even if we perform PCIe hotplug emulation by manipulating sysfs, some CXL Decoder registers,
which have read-only attributes, are not initialized.
I confirmed about a month and a half ago that this was causing the hot-add process to fail.
I suspect that such registers must be initialized by the hardware when a hot-add occurs.
I should have informed Wolski-san about this in advance. My apologies.
Thank you,
---
Yasunori Goto
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 0/9] dax/hmem, cxl: Coordinate Soft Reserved handling with CXL and HMEM
2025-12-05 2:54 ` Yasunori Gotou (Fujitsu)
@ 2025-12-05 23:04 ` Tomasz Wolski
2025-12-06 0:11 ` dan.j.williams
1 sibling, 0 replies; 31+ messages in thread
From: Tomasz Wolski @ 2025-12-05 23:04 UTC (permalink / raw)
To: y-goto
Cc: Smita.KoralahalliChannabasappa, alison.schofield, ardb,
benjamin.cheatham, bp, dan.j.williams, dave.jiang, dave, gregkh,
huang.ying.caritas, ira.weiny, jack, jeff.johnson,
jonathan.cameron, len.brown, linux-cxl, linux-fsdevel,
linux-kernel, linux-pm, lizhijian, ming.li, nathan.fontenot,
nvdimm, pavel, peterz, rafael, rrichter, terry.bowman,
tomasz.wolski, vishal.l.verma, willy, yaoxt.fnst, yazen.ghannam
Hello Dan & Gotou-san,
Many thanks for your remarks for the test cases
For the qemu tests I used modified qemu and sebios, therefore some 'strange' cases
are only testable on virtual setups - thanks for making it clear which configurations
are supported in real life
>>
>> [4] Physical machine: 2 CFMWS + Host-bridge + 2 CXL devices
>>
>> kernel: BIOS-e820: [mem 0x0000002070000000-0x000000a06fffffff] soft
>> reserved
>>
>> 2070000000-606fffffff : CXL Window 0
>> 2070000000-606fffffff : region0
>> 2070000000-606fffffff : dax0.0
>> 2070000000-606fffffff : System RAM (kmem) 6070000000-a06fffffff
>> : CXL Window 1
>> 6070000000-a06fffffff : region1
>> 6070000000-a06fffffff : dax1.0
>> 6070000000-a06fffffff : System RAM (kmem)
>
>Ok, so a real world maching that creates a merged 0x0000002070000000-0x000000a06fffffff range. Can you confirm that the SRAT has separate entries for those ranges? >Otherwise, need to rethink how to keep this fallback algorithm simple and predictable.
I looked into the syslogs and I see the SRAT has separate entries:
[ 0.005128] [ T0] ACPI: SRAT: Node 2 PXM 2 [mem 0x2070000000-0x606fffffff] hotplug
[ 0.005129] [ T0] ACPI: SRAT: Node 2 PXM 2 [mem 0x6070000000-0xa06fffffff] hotplug
^ permalink raw reply [flat|nested] 31+ messages in thread
* RE: [PATCH v4 0/9] dax/hmem, cxl: Coordinate Soft Reserved handling with CXL and HMEM
2025-12-05 2:54 ` Yasunori Gotou (Fujitsu)
2025-12-05 23:04 ` Tomasz Wolski
@ 2025-12-06 0:11 ` dan.j.williams
1 sibling, 0 replies; 31+ messages in thread
From: dan.j.williams @ 2025-12-06 0:11 UTC (permalink / raw)
To: Yasunori Gotou (Fujitsu), 'dan.j.williams@intel.com',
Tomasz Wolski (Fujitsu), alison.schofield@intel.com
Cc: Smita.KoralahalliChannabasappa@amd.com, ardb@kernel.org,
benjamin.cheatham@amd.com, bp@alien8.de, dave.jiang@intel.com,
dave@stgolabs.net, gregkh@linuxfoundation.org,
huang.ying.caritas@gmail.com, ira.weiny@intel.com, jack@suse.cz,
jeff.johnson@oss.qualcomm.com, jonathan.cameron@huawei.com,
len.brown@intel.com, linux-cxl@vger.kernel.org,
linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-pm@vger.kernel.org, Zhijian Li (Fujitsu),
ming.li@zohomail.com, nathan.fontenot@amd.com,
nvdimm@lists.linux.dev, pavel@kernel.org, peterz@infradead.org,
rafael@kernel.org, rrichter@amd.com, terry.bowman@amd.com,
vishal.l.verma@intel.com, willy@infradead.org,
Xingtao Yao (Fujitsu), yazen.ghannam@amd.com
Yasunori Gotou (Fujitsu) wrote:
[..]
> > > == plug - after PCI rescan cannot create hmem 6070000000-a06fffffff :
> > > CXL Window 1
> > > 6070000000-a06fffffff : region1
> > >
> > > kernel: cxl_region region1: config state: 0
> > > kernel: cxl_acpi ACPI0017:00: decoder0.1: created region1
> > > kernel: cxl_pci 0000:04:00.0: mem1:decoder10.0: __construct_region
> > > region1 res: [mem 0x6070000000-0xa06fffffff flags 0x200] iw: 1 ig:
> > > 4096
> > > kernel: cxl_mem mem1: decoder:decoder10.0 parent:0000:04:00.0
> > > port:endpoint10 range:0x6070000000-0xa06fffffff pos:0
> > > kernel: cxl region1: region sort successful
> > > kernel: cxl region1: mem1:endpoint10 decoder10.0 add: mem1:decoder10.0
> > > @ 0 next: none nr_eps: 1 nr_targets: 1
> > > kernel: cxl region1: pci0000:00:port2 decoder2.1 add: mem1:decoder10.0
> > > @ 0 next: mem1 nr_eps: 1 nr_targets: 1
> > > kernel: cxl region1: pci0000:00:port2 cxl_port_setup_targets expected
> > > iw: 1 ig: 4096 [mem 0x6070000000-0xa06fffffff flags 0x200]
> > > kernel: cxl region1: pci0000:00:port2 cxl_port_setup_targets got iw: 1
> > > ig: 256 state: disabled 0x6070000000:0xa06fffffff
> >
> > Did the device get reset in the process? This looks like decoders bounced in an
> > inconsistent fashion from unplug to replug and autodiscovery.
>
> You are correct.
> This environment does not support actual PCIe hotplug.
> Even if we perform PCIe hotplug emulation by manipulating sysfs, some CXL Decoder registers,
> which have read-only attributes, are not initialized.
> I confirmed about a month and a half ago that this was causing the hot-add process to fail.
> I suspect that such registers must be initialized by the hardware when a hot-add occurs.
>
> I should have informed Wolski-san about this in advance. My apologies.
No worries, just wanted to understand what was happening, thanks for
confirming.
However, this does raise an important issue that tooling could solve. If
you are committed to unplugging a device and the decoders are locked
then tooling should probably arrange for a secondary bus reset to unlock
and disable those decoders. Otherwise, the kernel might have a hard time
guaranteeing that a removed device restores at the exact address it had
previously, especially when there is free CFMWS capacity.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 1/9] dax/hmem, e820, resource: Defer Soft Reserved insertion until hmem is ready
2025-12-02 22:19 ` dan.j.williams
@ 2025-12-11 23:20 ` Koralahalli Channabasappa, Smita
0 siblings, 0 replies; 31+ messages in thread
From: Koralahalli Channabasappa, Smita @ 2025-12-11 23:20 UTC (permalink / raw)
To: dan.j.williams, Smita Koralahalli, linux-cxl, linux-kernel,
nvdimm, linux-fsdevel, linux-pm
Cc: Alison Schofield, Vishal Verma, Ira Weiny, Jonathan Cameron,
Yazen Ghannam, Dave Jiang, Davidlohr Bueso, Matthew Wilcox,
Jan Kara, Rafael J . Wysocki, Len Brown, Pavel Machek, Li Ming,
Jeff Johnson, Ying Huang, Yao Xingtao, Peter Zijlstra, Greg KH,
Nathan Fontenot, Terry Bowman, Robert Richter, Benjamin Cheatham,
Zhijian Li, Borislav Petkov, Ard Biesheuvel
Hi,
Sorry for the delay here. I was on vacation. Responses inline.
On 12/2/2025 2:19 PM, dan.j.williams@intel.com wrote:
> Smita Koralahalli wrote:
>> From: Dan Williams <dan.j.williams@intel.com>
>>
>> Insert Soft Reserved memory into a dedicated soft_reserve_resource tree
>> instead of the iomem_resource tree at boot. Delay publishing these ranges
>> into the iomem hierarchy until ownership is resolved and the HMEM path
>> is ready to consume them.
>>
>> Publishing Soft Reserved ranges into iomem too early conflicts with CXL
>> hotplug and prevents region assembly when those ranges overlap CXL
>> windows.
>>
>> Follow up patches will reinsert Soft Reserved ranges into iomem after CXL
>> window publication is complete and HMEM is ready to claim the memory. This
>> provides a cleaner handoff between EFI-defined memory ranges and CXL
>> resource management without trimming or deleting resources later.
>
> Please, when you modify a patch from an original, add your
> Co-developed-by: and clarify what you changed.
Thanks Dan. Yeah, this was a bit of a gray area for me. I had the
impression or remember reading somewhere that Co-developed-by tags are
typically added only when the modifications are substantial, so I didn’t
include it initially. I will add the Co-developed-by: line.
>
>>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
>> ---
>> arch/x86/kernel/e820.c | 2 +-
>> drivers/cxl/acpi.c | 2 +-
>> drivers/dax/hmem/device.c | 4 +-
>> drivers/dax/hmem/hmem.c | 7 ++-
>> include/linux/ioport.h | 13 +++++-
>> kernel/resource.c | 92 +++++++++++++++++++++++++++++++++------
>> 6 files changed, 100 insertions(+), 20 deletions(-)
>>
> [..]
>> @@ -426,6 +443,26 @@ int walk_iomem_res_desc(unsigned long desc, unsigned long flags, u64 start,
>> }
>> EXPORT_SYMBOL_GPL(walk_iomem_res_desc);
>>
>> +#ifdef CONFIG_EFI_SOFT_RESERVE
>> +struct resource soft_reserve_resource = {
>> + .name = "Soft Reserved",
>> + .start = 0,
>> + .end = -1,
>> + .desc = IORES_DESC_SOFT_RESERVED,
>> + .flags = IORESOURCE_MEM,
>> +};
>> +EXPORT_SYMBOL_GPL(soft_reserve_resource);
>
> It looks like one of the things you changed from my RFC was the addition
> of walk_soft_reserve_res_desc() and region_intersects_soft_reserve().
> With those APIs not only does this symbol not need to be exported, but
> it also can be static / private to resource.c.
I remember these helpers were introduced in your RFC but I think they
weren't yet defined. With them in place, agreed there’s no need to
export soft_reserve_resource. Will fix this in the next revision.
>
>> +
>> +int walk_soft_reserve_res_desc(unsigned long desc, unsigned long flags,
>> + u64 start, u64 end, void *arg,
>> + int (*func)(struct resource *, void *))
>> +{
>> + return walk_res_desc(&soft_reserve_resource, start, end, flags, desc,
>> + arg, func);
>> +}
>> +EXPORT_SYMBOL_GPL(walk_soft_reserve_res_desc);
>> +#endif
>> +
>> /*
>> * This function calls the @func callback against all memory ranges of type
>> * System RAM which are marked as IORESOURCE_SYSTEM_RAM and IORESOUCE_BUSY.
>> @@ -648,6 +685,22 @@ int region_intersects(resource_size_t start, size_t size, unsigned long flags,
>> }
>> EXPORT_SYMBOL_GPL(region_intersects);
>>
>> +#ifdef CONFIG_EFI_SOFT_RESERVE
>> +int region_intersects_soft_reserve(resource_size_t start, size_t size,
>> + unsigned long flags, unsigned long desc)
>> +{
>> + int ret;
>> +
>> + read_lock(&resource_lock);
>> + ret = __region_intersects(&soft_reserve_resource, start, size, flags,
>> + desc);
>> + read_unlock(&resource_lock);
>> +
>> + return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(region_intersects_soft_reserve);
>> +#endif
>> +
>> void __weak arch_remove_reservations(struct resource *avail)
>> {
>> }
>> @@ -966,7 +1019,7 @@ EXPORT_SYMBOL_GPL(insert_resource);
>> * Insert a resource into the resource tree, possibly expanding it in order
>> * to make it encompass any conflicting resources.
>> */
>> -void insert_resource_expand_to_fit(struct resource *root, struct resource *new)
>> +void __insert_resource_expand_to_fit(struct resource *root, struct resource *new)
>> {
>> if (new->parent)
>> return;
>> @@ -997,7 +1050,20 @@ void insert_resource_expand_to_fit(struct resource *root, struct resource *new)
>> * to use this interface. The former are built-in and only the latter,
>> * CXL, is a module.
>> */
>> -EXPORT_SYMBOL_NS_GPL(insert_resource_expand_to_fit, "CXL");
>> +EXPORT_SYMBOL_NS_GPL(__insert_resource_expand_to_fit, "CXL");
>> +
>> +void insert_resource_expand_to_fit(struct resource *new)
>> +{
>> + struct resource *root = &iomem_resource;
>> +
>> +#ifdef CONFIG_EFI_SOFT_RESERVE
>> + if (new->desc == IORES_DESC_SOFT_RESERVED)
>> + root = &soft_reserve_resource;
>> +#endif
>
> I can not say I am entirely happy with this change, I would prefer to
> avoid ifdef in C, and I would prefer not to break the legacy semantics
> of this function, but it meets the spirit of the original RFC without
> introducing a new insert_resource_late(). I assume review feedback
> requested this?
Yeah here,
https://lore.kernel.org/all/20250909161210.GBaMBR2rN8h6eT9JHe@fat_crate.local/
>
>> + __insert_resource_expand_to_fit(root, new);
>> +}
>> +EXPORT_SYMBOL_GPL(insert_resource_expand_to_fit);
>
> There are no consumers for this export, so it can be dropped.
Okay.
Thanks
Smita
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 4/9] dax/hmem: Defer handling of Soft Reserved ranges that overlap CXL windows
2025-12-02 22:37 ` dan.j.williams
@ 2025-12-11 23:23 ` Koralahalli Channabasappa, Smita
0 siblings, 0 replies; 31+ messages in thread
From: Koralahalli Channabasappa, Smita @ 2025-12-11 23:23 UTC (permalink / raw)
To: dan.j.williams, Smita Koralahalli, linux-cxl, linux-kernel,
nvdimm, linux-fsdevel, linux-pm
Cc: Alison Schofield, Vishal Verma, Ira Weiny, Jonathan Cameron,
Yazen Ghannam, Dave Jiang, Davidlohr Bueso, Matthew Wilcox,
Jan Kara, Rafael J . Wysocki, Len Brown, Pavel Machek, Li Ming,
Jeff Johnson, Ying Huang, Yao Xingtao, Peter Zijlstra, Greg KH,
Nathan Fontenot, Terry Bowman, Robert Richter, Benjamin Cheatham,
Zhijian Li, Borislav Petkov, Ard Biesheuvel
On 12/2/2025 2:37 PM, dan.j.williams@intel.com wrote:
> Smita Koralahalli wrote:
>> From: Dan Williams <dan.j.williams@intel.com>
>>
>> Defer handling of Soft Reserved ranges that intersect CXL windows at
>> probe time. Delay processing until after device discovery so that the
>> CXL stack can publish windows and assemble regions before HMEM claims
>> those address ranges.
>>
>> Add a deferral path that schedules deferred work when HMEM detects a
>> Soft Reserved range intersecting a CXL window during probe. The deferred
>> work runs after probe completes and allows the CXL subsystem to finish
>> resource discovery and region setup before HMEM takes any action.
>>
>> This change does not address region assembly failures. It only delays
>> HMEM handling to avoid prematurely claiming ranges that CXL may own.
>
> No, with the changes it just unconditionally disables dax_hmem in the
> presence of CXL. I do not think these changes can stand alone. It
> probably wants to be folded with patch 5 or something like that.
Sure, will include this with changes suggested in Patch 5.
Thanks
Smita
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 5/9] cxl/region, dax/hmem: Arbitrate Soft Reserved ownership with cxl_regions_fully_map()
2025-12-03 3:50 ` dan.j.williams
@ 2025-12-11 23:42 ` Koralahalli Channabasappa, Smita
0 siblings, 0 replies; 31+ messages in thread
From: Koralahalli Channabasappa, Smita @ 2025-12-11 23:42 UTC (permalink / raw)
To: dan.j.williams, Smita Koralahalli, linux-cxl, linux-kernel,
nvdimm, linux-fsdevel, linux-pm
Cc: Alison Schofield, Vishal Verma, Ira Weiny, Jonathan Cameron,
Yazen Ghannam, Dave Jiang, Davidlohr Bueso, Matthew Wilcox,
Jan Kara, Rafael J . Wysocki, Len Brown, Pavel Machek, Li Ming,
Jeff Johnson, Ying Huang, Yao Xingtao, Peter Zijlstra, Greg KH,
Nathan Fontenot, Terry Bowman, Robert Richter, Benjamin Cheatham,
Zhijian Li, Borislav Petkov, Ard Biesheuvel
On 12/2/2025 7:50 PM, dan.j.williams@intel.com wrote:
> Smita Koralahalli wrote:
>> Introduce cxl_regions_fully_map() to check whether CXL regions form a
>> single contiguous, non-overlapping cover of a given Soft Reserved range.
>>
>> Use this helper to decide whether Soft Reserved memory overlapping CXL
>> regions should be owned by CXL or registered by HMEM.
>>
>> If the span is fully covered by CXL regions, treat the Soft Reserved
>> range as owned by CXL and have HMEM skip registration. Else, let HMEM
>> claim the range and register the corresponding devdax for it.
>
> This all feels a bit too custom when helpers like resource_contains()
> exist.
>
> Also remember that the default list of soft-reserved ranges that dax
> grabs is filtered by the ACPI HMAT. So while there is a chance that one
> EFI memory map entry spans multiple CXL regions, there is a lower chance
> that a single ACPI HMAT range spans multiple CXL regions.
>
> I think it is fair for Linux to be simple and require that an algorithm
> of:
>
> cxl_contains_soft_reserve()
> for_each_cxl_intersecting_hmem_resource()
> found = false
> for_each_region()
> if (resource_contains(cxl_region_resource, hmem_resource))
> found = true
> if (!found)
> return false
> return true
>
> ...should be good enough, otherwise fallback to pure hmem operation, and
> do not worry about the corner cases.
>
> If Linux really needs to understand that ACPI HMAT ranges may span
> multiple CXL regions then I would want to understand more what is
> driving that configuration.
I was trying to handle a case like Tomasz's setup in [2], where a single
Soft Reserved span and CFMWS cover two CXL regions:
kernel: [ 0.000000][ T0] BIOS-e820: [mem
0x0000000a90000000-0x0000000c8fffffff] soft reserved
a90000000-c8fffffff : CXL Window 0
a90000000-b8fffffff : region1
b90000000-c8fffffff : region0
…so I ended up with the more generic cxl_regions_fully_map() walker. I
missed the detail that the HMAT filtered Soft reserved ranges we
actually act on are much less likely to span multiple regions, and on
AMD platforms we effectively have a 1:1 mapping. Im fine with
simplifying this per your suggestion.
>
> Btw, I do not see a:
>
> guard(rwsem_read)(&cxl_rwsem.region)
>
> ...anywhere in the proposed patch. That needs to be held be sure the
> region's resource settings are not changed out from underneath you. This
> should probably also be checking that the region is in the commit state
> because it may still be racing regions under creation post
> wait_for_device_probe().
Sure, I will add this.
>
>> void cxl_endpoint_parse_cdat(struct cxl_port *port);
>> diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
>> index f70a0688bd11..db4c46337ac3 100644
>> --- a/drivers/dax/hmem/hmem.c
>> +++ b/drivers/dax/hmem/hmem.c
>> @@ -3,6 +3,8 @@
>> #include <linux/memregion.h>
>> #include <linux/module.h>
>> #include <linux/dax.h>
>> +
>> +#include "../../cxl/cxl.h"
>> #include "../bus.h"
>>
>> static bool region_idle;
>> @@ -150,7 +152,17 @@ static int hmem_register_device(struct device *host, int target_nid,
>> static int handle_deferred_cxl(struct device *host, int target_nid,
>> const struct resource *res)
>> {
>> - /* TODO: Handle region assembly failures */
>> + if (region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
>> + IORES_DESC_CXL) != REGION_DISJOINT) {
>> +
>> + if (cxl_regions_fully_map(res->start, res->end))
>> + dax_cxl_mode = DAX_CXL_MODE_DROP;
>> + else
>> + dax_cxl_mode = DAX_CXL_MODE_REGISTER;
>> +
>> + hmem_register_device(host, target_nid, res);
>> + }
>> +
>
> I think there is enough content to just create the new
> cxl_contains_soft_reserve() ABI, and then hookup handle_deferred_cxl in
> a follow-on patch.
Okay.
Thanks
Smita
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 6/9] cxl/region: Add register_dax flag to defer DAX setup
2025-12-04 0:22 ` dan.j.williams
@ 2025-12-12 19:59 ` Koralahalli Channabasappa, Smita
0 siblings, 0 replies; 31+ messages in thread
From: Koralahalli Channabasappa, Smita @ 2025-12-12 19:59 UTC (permalink / raw)
To: dan.j.williams, Smita Koralahalli, linux-cxl, linux-kernel,
nvdimm, linux-fsdevel, linux-pm
Cc: Alison Schofield, Vishal Verma, Ira Weiny, Jonathan Cameron,
Yazen Ghannam, Dave Jiang, Davidlohr Bueso, Matthew Wilcox,
Jan Kara, Rafael J . Wysocki, Len Brown, Pavel Machek, Li Ming,
Jeff Johnson, Ying Huang, Yao Xingtao, Peter Zijlstra, Greg KH,
Nathan Fontenot, Terry Bowman, Robert Richter, Benjamin Cheatham,
Zhijian Li, Borislav Petkov, Ard Biesheuvel
On 12/3/2025 4:22 PM, dan.j.williams@intel.com wrote:
> Smita Koralahalli wrote:
>> Stop creating cxl_dax during cxl_region_probe(). Early DAX registration
>> can online memory before ownership of Soft Reserved ranges is finalized.
>> This makes it difficult to tear down regions later when HMEM determines
>> that a region should not claim that range.
>>
>> Introduce a register_dax flag in struct cxl_region_params and gate DAX
>> registration on this flag. Leave probe time registration disabled for
>> regions discovered during early CXL enumeration; set the flag only for
>> regions created dynamically at runtime to preserve existing behaviour.
>>
>> This patch prepares the region code for later changes where cxl_dax
>> setup occurs from the HMEM path only after ownership arbitration
>> completes.
>
> This seems backwards to me. The dax subsystem knows when it wants to
> move ahead with CXL or not, dax_cxl_mode is that indicator. So, just
> share that variable with drivers/dax/cxl.c, arrange for
> cxl_dax_region_probe() to fail while waiting for initial CXL probing to
> succeed.
>
> Once that point is reached move dax_cxl_mode to DAX_CXL_MODE_DROP, which
> means drop the hmem alias, and go with the real-deal CXL region. Rescan
> the dax-bus to retry cxl_dax_region_probe(). No need to bother 'struct
> cxl_region' with a 'dax' flag, it just registers per normal and lets the
> dax-subsystem handle accepting / rejecting.
>
> Now, we do need a mechanism from dax-to-cxl to trigger region removal in
> the DAX_CXL_MODE_REGISTER case (proceed with the hmem registration), but
> that is separate from blocking the attachment of dax to CXL regions.
> Keep all that complexity local to dax.
Okay. To make sure I'm aligned with your suggestion.
It should be something like below in cxl_dax_region_probe():
switch (dax_cxl_mode) {
case DAX_CXL_MODE_DEFER:
return -EPROBE_DEFER;
case DAX_CXL_MODE_REGISTER:
return -ENODEV;
case DAX_CXL_MODE_DROP:
default:
break;
}
Then in the HMEM path, if the SR span is fully covered I will switch to
DAX_CXL_MODE_DROP and trigger a rescan.
Something like:
if (cxl_regions_fully_map(res->start, res->end)) {
dax_cxl_mode = DAX_CXL_MODE_DROP;
bus_rescan_devices(&cxl_bus_type);
} else {
dax_cxl_mode = DAX_CXL_MODE_REGISTER;
cxl_region_teardown(res->start, res->end);
}
hmem_register_device(host, target_nid, res);
cxl_regions_fully_map() will include changes as suggested in Patch 5.
Thanks
Smita
^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2025-12-12 20:00 UTC | newest]
Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-20 3:19 [PATCH v4 0/9] dax/hmem, cxl: Coordinate Soft Reserved handling with CXL and HMEM Smita Koralahalli
2025-11-20 3:19 ` [PATCH v4 1/9] dax/hmem, e820, resource: Defer Soft Reserved insertion until hmem is ready Smita Koralahalli
2025-12-02 22:19 ` dan.j.williams
2025-12-11 23:20 ` Koralahalli Channabasappa, Smita
2025-12-02 23:31 ` Dave Jiang
2025-11-20 3:19 ` [PATCH v4 2/9] dax/hmem: Request cxl_acpi and cxl_pci before walking Soft Reserved ranges Smita Koralahalli
2025-11-20 3:19 ` [PATCH v4 3/9] dax/hmem: Gate Soft Reserved deferral on DEV_DAX_CXL Smita Koralahalli
2025-12-02 23:32 ` Dave Jiang
2025-11-20 3:19 ` [PATCH v4 4/9] dax/hmem: Defer handling of Soft Reserved ranges that overlap CXL windows Smita Koralahalli
2025-12-02 22:37 ` dan.j.williams
2025-12-11 23:23 ` Koralahalli Channabasappa, Smita
2025-11-20 3:19 ` [PATCH v4 5/9] cxl/region, dax/hmem: Arbitrate Soft Reserved ownership with cxl_regions_fully_map() Smita Koralahalli
2025-12-03 3:50 ` dan.j.williams
2025-12-11 23:42 ` Koralahalli Channabasappa, Smita
2025-11-20 3:19 ` [PATCH v4 6/9] cxl/region: Add register_dax flag to defer DAX setup Smita Koralahalli
2025-11-20 18:17 ` Koralahalli Channabasappa, Smita
2025-11-20 20:21 ` kernel test robot
2025-12-04 0:22 ` dan.j.williams
2025-12-12 19:59 ` Koralahalli Channabasappa, Smita
2025-11-20 3:19 ` [PATCH v4 7/9] cxl/region, dax/hmem: Register cxl_dax only when CXL owns Soft Reserved span Smita Koralahalli
2025-11-20 3:19 ` [PATCH v4 8/9] cxl/region, dax/hmem: Tear down CXL regions when HMEM reclaims Soft Reserved Smita Koralahalli
2025-12-04 0:50 ` dan.j.williams
2025-11-20 3:19 ` [PATCH v4 9/9] dax/hmem: Reintroduce Soft Reserved ranges back into the iomem tree Smita Koralahalli
2025-12-04 0:54 ` dan.j.williams
2025-12-01 19:56 ` [PATCH v4 0/9] dax/hmem, cxl: Coordinate Soft Reserved handling with CXL and HMEM Alison Schofield
2025-12-03 13:35 ` Tomasz Wolski
2025-12-03 22:05 ` dan.j.williams
2025-12-05 2:54 ` Yasunori Gotou (Fujitsu)
2025-12-05 23:04 ` Tomasz Wolski
2025-12-06 0:11 ` dan.j.williams
2025-12-02 6:41 ` dan.j.williams
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).