* [PATCH v5 0/7] Add managed SOFT RESERVE resource handling
@ 2025-07-15 18:04 Smita Koralahalli
2025-07-15 18:04 ` [PATCH v5 1/7] cxl/acpi: Refactor cxl_acpi_probe() to always schedule fallback DAX registration Smita Koralahalli
` (9 more replies)
0 siblings, 10 replies; 38+ messages in thread
From: Smita Koralahalli @ 2025-07-15 18:04 UTC (permalink / raw)
To: linux-cxl, linux-kernel, nvdimm, linux-fsdevel, linux-pm
Cc: Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Alison Schofield,
Vishal Verma, Ira Weiny, Dan Williams, Matthew Wilcox, Jan Kara,
Rafael J . Wysocki, Len Brown, Pavel Machek, Li Ming,
Jeff Johnson, Ying Huang, Yao Xingtao, Peter Zijlstra, Greg KH,
Nathan Fontenot, Smita Koralahalli, Terry Bowman, Robert Richter,
Benjamin Cheatham, PradeepVineshReddy Kodamati, Zhijian Li
This series introduces the ability to manage SOFT RESERVED iomem
resources, enabling the CXL driver to remove any portions that
intersect with created CXL regions.
The current approach of leaving SOFT RESERVED entries as is can result
in failures during device hotplug such as CXL because the address range
remains reserved and unavailable for reuse even after region teardown.
To address this, the CXL driver now uses a background worker that waits
for cxl_mem driver probe to complete before scanning for intersecting
resources. Then the driver walks through created CXL regions to trim any
intersections with SOFT RESERVED resources in the iomem tree.
The following scenarios have been tested:
Example 1: Exact alignment, soft reserved is a child of the region
|---------- "Soft Reserved" -----------|
|-------------- "Region #" ------------|
Before:
1050000000-304fffffff : CXL Window 0
1050000000-304fffffff : region0
1050000000-304fffffff : Soft Reserved
1080000000-2fffffffff : dax0.0
1080000000-2fffffffff : System RAM (kmem)
After:
1050000000-304fffffff : CXL Window 0
1050000000-304fffffff : region0
1080000000-2fffffffff : dax0.0
1080000000-2fffffffff : System RAM (kmem)
Example 2: Start and/or end aligned and soft reserved spans multiple
regions
|----------- "Soft Reserved" -----------|
|-------- "Region #" -------|
or
|----------- "Soft Reserved" -----------|
|-------- "Region #" -------|
Before:
850000000-684fffffff : Soft Reserved
850000000-284fffffff : CXL Window 0
850000000-284fffffff : region3
850000000-284fffffff : dax0.0
850000000-284fffffff : System RAM (kmem)
2850000000-484fffffff : CXL Window 1
2850000000-484fffffff : region4
2850000000-484fffffff : dax1.0
2850000000-484fffffff : System RAM (kmem)
4850000000-684fffffff : CXL Window 2
4850000000-684fffffff : region5
4850000000-684fffffff : dax2.0
4850000000-684fffffff : System RAM (kmem)
After:
850000000-284fffffff : CXL Window 0
850000000-284fffffff : region3
850000000-284fffffff : dax0.0
850000000-284fffffff : System RAM (kmem)
2850000000-484fffffff : CXL Window 1
2850000000-484fffffff : region4
2850000000-484fffffff : dax1.0
2850000000-484fffffff : System RAM (kmem)
4850000000-684fffffff : CXL Window 2
4850000000-684fffffff : region5
4850000000-684fffffff : dax2.0
4850000000-684fffffff : System RAM (kmem)
Example 3: No alignment
|---------- "Soft Reserved" ----------|
|---- "Region #" ----|
Before:
00000000-3050000ffd : Soft Reserved
..
..
1050000000-304fffffff : CXL Window 0
1050000000-304fffffff : region1
1080000000-2fffffffff : dax0.0
1080000000-2fffffffff : System RAM (kmem)
After:
00000000-104fffffff : Soft Reserved
..
..
1050000000-304fffffff : CXL Window 0
1050000000-304fffffff : region1
1080000000-2fffffffff : dax0.0
1080000000-2fffffffff : System RAM (kmem)
3050000000-3050000ffd : Soft Reserved
Link to v4:
https://lore.kernel.org/linux-cxl/20250603221949.53272-1-Smita.KoralahalliChannabasappa@amd.com
v5 updates:
- Handled cases where CXL driver loads early even before HMEM driver is
initialized.
- Introduced callback functions to resolve dependencies.
- Rename suspend.c to probe_state.c.
- Refactor cxl_acpi_probe() to use a single exit path.
- Commit description update to justify cxl_mem_active() usage.
- Change from kmalloc -> kzalloc in add_soft_reserved().
- Change from goto to if else blocks inside remove_soft_reserved().
- DEFINE_RES_MEM_NAMED -> DEFINE_RES_NAMED_DESC.
- Comments for flags inside remove_soft_reserved().
- Add resource_lock inside normalize_resource().
- bus_find_next_device -> bus_find_device.
- Skip DAX consumption of soft reserves inside hmat with
CONFIG_CXL_ACPI checks.
v4 updates:
- Split first patch into 4 smaller patches.
- Correct the logic for cxl_pci_loaded() and cxl_mem_active() to return
false at default instead of true.
- Cleanup cxl_wait_for_pci_mem() to remove config checks for cxl_pci
and cxl_mem.
- Fixed multiple bugs and build issues which includes correcting
walk_iomem_resc_desc() and calculations of alignments.
v3 updates:
- Remove srmem resource tree from kernel/resource.c, this is no longer
needed in the current implementation. All SOFT RESERVE resources now
put on the iomem resource tree.
- Remove the no longer needed SOFT_RESERVED_MANAGED kernel config option.
- Add the 'nid' parameter back to hmem_register_resource();
- Remove the no longer used soft reserve notification chain (introduced
in v2). The dax driver is now notified of SOFT RESERVED resources by
the CXL driver.
v2 updates:
- Add config option SOFT_RESERVE_MANAGED to control use of the
separate srmem resource tree at boot.
- Only add SOFT RESERVE resources to the soft reserve tree during
boot, they go to the iomem resource tree after boot.
- Remove the resource trimming code in the previous patch to re-use
the existing code in kernel/resource.c
- Add functionality for the cxl acpi driver to wait for the cxl PCI
and mem drivers to load.
Smita Koralahalli (7):
cxl/acpi: Refactor cxl_acpi_probe() to always schedule fallback DAX
registration
cxl/core: Rename suspend.c to probe_state.c and remove
CONFIG_CXL_SUSPEND
cxl/acpi: Add background worker to coordinate with cxl_mem probe
completion
cxl/region: Introduce SOFT RESERVED resource removal on region
teardown
dax/hmem: Save the DAX HMEM platform device pointer
dax/hmem, cxl: Defer DAX consumption of SOFT RESERVED resources until
after CXL region creation
dax/hmem: Preserve fallback SOFT RESERVED regions if DAX HMEM loads
late
drivers/acpi/numa/hmat.c | 4 +
drivers/cxl/Kconfig | 4 -
drivers/cxl/acpi.c | 50 +++++--
drivers/cxl/core/Makefile | 2 +-
drivers/cxl/core/{suspend.c => probe_state.c} | 10 +-
drivers/cxl/core/region.c | 135 ++++++++++++++++++
drivers/cxl/cxl.h | 4 +
drivers/cxl/cxlmem.h | 9 --
drivers/dax/hmem/Makefile | 1 +
drivers/dax/hmem/device.c | 62 ++++----
drivers/dax/hmem/hmem.c | 14 +-
drivers/dax/hmem/hmem_notify.c | 29 ++++
include/linux/dax.h | 7 +-
include/linux/ioport.h | 1 +
include/linux/pm.h | 7 -
kernel/resource.c | 34 +++++
16 files changed, 307 insertions(+), 66 deletions(-)
rename drivers/cxl/core/{suspend.c => probe_state.c} (62%)
create mode 100644 drivers/dax/hmem/hmem_notify.c
--
2.17.1
^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH v5 1/7] cxl/acpi: Refactor cxl_acpi_probe() to always schedule fallback DAX registration
2025-07-15 18:04 [PATCH v5 0/7] Add managed SOFT RESERVE resource handling Smita Koralahalli
@ 2025-07-15 18:04 ` Smita Koralahalli
2025-07-22 21:04 ` dan.j.williams
2025-07-15 18:04 ` [PATCH v5 2/7] cxl/core: Rename suspend.c to probe_state.c and remove CONFIG_CXL_SUSPEND Smita Koralahalli
` (8 subsequent siblings)
9 siblings, 1 reply; 38+ messages in thread
From: Smita Koralahalli @ 2025-07-15 18:04 UTC (permalink / raw)
To: linux-cxl, linux-kernel, nvdimm, linux-fsdevel, linux-pm
Cc: Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Alison Schofield,
Vishal Verma, Ira Weiny, Dan Williams, Matthew Wilcox, Jan Kara,
Rafael J . Wysocki, Len Brown, Pavel Machek, Li Ming,
Jeff Johnson, Ying Huang, Yao Xingtao, Peter Zijlstra, Greg KH,
Nathan Fontenot, Smita Koralahalli, Terry Bowman, Robert Richter,
Benjamin Cheatham, PradeepVineshReddy Kodamati, Zhijian Li
Refactor cxl_acpi_probe() to use a single exit path so that the fallback
DAX registration can be scheduled regardless of probe success or failure.
With CONFIG_CXL_ACPI enabled, future patches will bypass DAX device
registration via the HMAT and hmem drivers. To avoid missing DAX
registration for SOFT RESERVED regions, the fallback path must be
triggered regardless of probe outcome.
No functional changes.
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
---
drivers/cxl/acpi.c | 30 ++++++++++++++++++------------
1 file changed, 18 insertions(+), 12 deletions(-)
diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c
index a1a99ec3f12c..ca06d5acdf8f 100644
--- a/drivers/cxl/acpi.c
+++ b/drivers/cxl/acpi.c
@@ -825,7 +825,7 @@ static int pair_cxl_resource(struct device *dev, void *data)
static int cxl_acpi_probe(struct platform_device *pdev)
{
- int rc;
+ int rc = 0;
struct resource *cxl_res;
struct cxl_root *cxl_root;
struct cxl_port *root_port;
@@ -837,7 +837,7 @@ static int cxl_acpi_probe(struct platform_device *pdev)
rc = devm_add_action_or_reset(&pdev->dev, cxl_acpi_lock_reset_class,
&pdev->dev);
if (rc)
- return rc;
+ goto out;
cxl_res = devm_kzalloc(host, sizeof(*cxl_res), GFP_KERNEL);
if (!cxl_res)
@@ -848,18 +848,20 @@ static int cxl_acpi_probe(struct platform_device *pdev)
cxl_res->flags = IORESOURCE_MEM;
cxl_root = devm_cxl_add_root(host, &acpi_root_ops);
- if (IS_ERR(cxl_root))
- return PTR_ERR(cxl_root);
+ if (IS_ERR(cxl_root)) {
+ rc = PTR_ERR(cxl_root);
+ goto out;
+ }
root_port = &cxl_root->port;
rc = bus_for_each_dev(adev->dev.bus, NULL, root_port,
add_host_bridge_dport);
if (rc < 0)
- return rc;
+ goto out;
rc = devm_add_action_or_reset(host, remove_cxl_resources, cxl_res);
if (rc)
- return rc;
+ goto out;
ctx = (struct cxl_cfmws_context) {
.dev = host,
@@ -867,12 +869,14 @@ static int cxl_acpi_probe(struct platform_device *pdev)
.cxl_res = cxl_res,
};
rc = acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, cxl_parse_cfmws, &ctx);
- if (rc < 0)
- return -ENXIO;
+ if (rc < 0) {
+ rc = -ENXIO;
+ goto out;
+ }
rc = add_cxl_resources(cxl_res);
if (rc)
- return rc;
+ goto out;
/*
* Populate the root decoders with their related iomem resource,
@@ -887,17 +891,19 @@ static int cxl_acpi_probe(struct platform_device *pdev)
rc = bus_for_each_dev(adev->dev.bus, NULL, root_port,
add_host_bridge_uport);
if (rc < 0)
- return rc;
+ goto out;
if (IS_ENABLED(CONFIG_CXL_PMEM))
rc = device_for_each_child(&root_port->dev, root_port,
add_root_nvdimm_bridge);
if (rc < 0)
- return rc;
+ goto out;
/* In case PCI is scanned before ACPI re-trigger memdev attach */
cxl_bus_rescan();
- return 0;
+
+out:
+ return rc;
}
static const struct acpi_device_id cxl_acpi_ids[] = {
--
2.17.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v5 2/7] cxl/core: Rename suspend.c to probe_state.c and remove CONFIG_CXL_SUSPEND
2025-07-15 18:04 [PATCH v5 0/7] Add managed SOFT RESERVE resource handling Smita Koralahalli
2025-07-15 18:04 ` [PATCH v5 1/7] cxl/acpi: Refactor cxl_acpi_probe() to always schedule fallback DAX registration Smita Koralahalli
@ 2025-07-15 18:04 ` Smita Koralahalli
2025-07-22 21:44 ` dan.j.williams
2025-07-15 18:04 ` [PATCH v5 3/7] cxl/acpi: Add background worker to coordinate with cxl_mem probe completion Smita Koralahalli
` (7 subsequent siblings)
9 siblings, 1 reply; 38+ messages in thread
From: Smita Koralahalli @ 2025-07-15 18:04 UTC (permalink / raw)
To: linux-cxl, linux-kernel, nvdimm, linux-fsdevel, linux-pm
Cc: Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Alison Schofield,
Vishal Verma, Ira Weiny, Dan Williams, Matthew Wilcox, Jan Kara,
Rafael J . Wysocki, Len Brown, Pavel Machek, Li Ming,
Jeff Johnson, Ying Huang, Yao Xingtao, Peter Zijlstra, Greg KH,
Nathan Fontenot, Smita Koralahalli, Terry Bowman, Robert Richter,
Benjamin Cheatham, PradeepVineshReddy Kodamati, Zhijian Li
The cxl_mem_active_inc()/dec() and cxl_mem_active() helpers were initially
introduced to coordinate suspend/resume behavior. However, upcoming
changes will reuse these helpers to track cxl_mem_probe() activity during
SOFT RESERVED region handling.
To reflect this broader purpose, rename suspend.c to probe_state.c and
remove CONFIG_CXL_SUSPEND Kconfig option. These helpers are now always
built into the CXL core subsystem.
This ensures drivers like cxl_acpi to coordinate with cxl_mem for
region setup and hotplug handling.
Co-developed-by: Nathan Fontenot <Nathan.Fontenot@amd.com>
Signed-off-by: Nathan Fontenot <Nathan.Fontenot@amd.com>
Co-developed-by: Terry Bowman <terry.bowman@amd.com>
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
---
While these helpers are no longer specific to suspend, they couldn't be
moved into files like memdev.c or mem.c, as those are built as modules.
The problem is that cxl_mem_active() is invoked by core kernel components
such as kernel/power/suspend.c and hibernate.c, which are built into
vmlinux. If the helpers were moved into a module, it would result in
unresolved symbol errors as symbols are not guaranteed to be available.
One option would be to force memdev.o to be built-in, but that introduces
unnecessary constraints, since it includes broader device management
logic. Instead, I have renamed it to probe_state.c.
---
drivers/cxl/Kconfig | 4 ----
drivers/cxl/core/Makefile | 2 +-
drivers/cxl/core/{suspend.c => probe_state.c} | 5 ++++-
drivers/cxl/cxlmem.h | 9 ---------
include/linux/pm.h | 7 -------
5 files changed, 5 insertions(+), 22 deletions(-)
rename drivers/cxl/core/{suspend.c => probe_state.c} (83%)
diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
index 48b7314afdb8..d407d2c96a7a 100644
--- a/drivers/cxl/Kconfig
+++ b/drivers/cxl/Kconfig
@@ -189,10 +189,6 @@ config CXL_PORT
default CXL_BUS
tristate
-config CXL_SUSPEND
- def_bool y
- depends on SUSPEND && CXL_MEM
-
config CXL_REGION
bool "CXL: Region Support"
default CXL_BUS
diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
index 79e2ef81fde8..0fa7aa530de4 100644
--- a/drivers/cxl/core/Makefile
+++ b/drivers/cxl/core/Makefile
@@ -1,6 +1,6 @@
# SPDX-License-Identifier: GPL-2.0
obj-$(CONFIG_CXL_BUS) += cxl_core.o
-obj-$(CONFIG_CXL_SUSPEND) += suspend.o
+obj-y += probe_state.o
ccflags-y += -I$(srctree)/drivers/cxl
CFLAGS_trace.o = -DTRACE_INCLUDE_PATH=. -I$(src)
diff --git a/drivers/cxl/core/suspend.c b/drivers/cxl/core/probe_state.c
similarity index 83%
rename from drivers/cxl/core/suspend.c
rename to drivers/cxl/core/probe_state.c
index 29aa5cc5e565..5ba4b4de0e33 100644
--- a/drivers/cxl/core/suspend.c
+++ b/drivers/cxl/core/probe_state.c
@@ -8,7 +8,10 @@ static atomic_t mem_active;
bool cxl_mem_active(void)
{
- return atomic_read(&mem_active) != 0;
+ if (IS_ENABLED(CONFIG_CXL_MEM))
+ return atomic_read(&mem_active) != 0;
+
+ return false;
}
void cxl_mem_active_inc(void)
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 551b0ba2caa1..86e43475a1e1 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -883,17 +883,8 @@ static inline void devm_cxl_memdev_edac_release(struct cxl_memdev *cxlmd)
{ return; }
#endif
-#ifdef CONFIG_CXL_SUSPEND
void cxl_mem_active_inc(void);
void cxl_mem_active_dec(void);
-#else
-static inline void cxl_mem_active_inc(void)
-{
-}
-static inline void cxl_mem_active_dec(void)
-{
-}
-#endif
int cxl_mem_sanitize(struct cxl_memdev *cxlmd, u16 cmd);
diff --git a/include/linux/pm.h b/include/linux/pm.h
index f0bd8fbae4f2..415928e0b6ca 100644
--- a/include/linux/pm.h
+++ b/include/linux/pm.h
@@ -35,14 +35,7 @@ static inline void pm_vt_switch_unregister(struct device *dev)
}
#endif /* CONFIG_VT_CONSOLE_SLEEP */
-#ifdef CONFIG_CXL_SUSPEND
bool cxl_mem_active(void);
-#else
-static inline bool cxl_mem_active(void)
-{
- return false;
-}
-#endif
/*
* Device power management
--
2.17.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v5 3/7] cxl/acpi: Add background worker to coordinate with cxl_mem probe completion
2025-07-15 18:04 [PATCH v5 0/7] Add managed SOFT RESERVE resource handling Smita Koralahalli
2025-07-15 18:04 ` [PATCH v5 1/7] cxl/acpi: Refactor cxl_acpi_probe() to always schedule fallback DAX registration Smita Koralahalli
2025-07-15 18:04 ` [PATCH v5 2/7] cxl/core: Rename suspend.c to probe_state.c and remove CONFIG_CXL_SUSPEND Smita Koralahalli
@ 2025-07-15 18:04 ` Smita Koralahalli
2025-07-17 0:24 ` Dave Jiang
2025-07-23 7:31 ` dan.j.williams
2025-07-15 18:04 ` [PATCH v5 4/7] cxl/region: Introduce SOFT RESERVED resource removal on region teardown Smita Koralahalli
` (6 subsequent siblings)
9 siblings, 2 replies; 38+ messages in thread
From: Smita Koralahalli @ 2025-07-15 18:04 UTC (permalink / raw)
To: linux-cxl, linux-kernel, nvdimm, linux-fsdevel, linux-pm
Cc: Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Alison Schofield,
Vishal Verma, Ira Weiny, Dan Williams, Matthew Wilcox, Jan Kara,
Rafael J . Wysocki, Len Brown, Pavel Machek, Li Ming,
Jeff Johnson, Ying Huang, Yao Xingtao, Peter Zijlstra, Greg KH,
Nathan Fontenot, Smita Koralahalli, Terry Bowman, Robert Richter,
Benjamin Cheatham, PradeepVineshReddy Kodamati, Zhijian Li
Introduce a background worker in cxl_acpi to delay SOFT RESERVE handling
until the cxl_mem driver has probed at least one device. This coordination
ensures that DAX registration or fallback handling for soft-reserved
regions is not triggered prematurely.
The worker waits on cxl_wait_queue, which is signaled via
cxl_mem_active_inc() during cxl_mem_probe(). Once at least one memory
device probe is confirmed, the worker invokes wait_for_device_probe()
to allow the rest of the CXL device hierarchy to complete initialization.
Additionally, it also handles initialization order issues where
cxl_acpi_probe() may complete before other drivers such as cxl_port or
cxl_mem have loaded, especially when cxl_acpi and cxl_port are built-in
and cxl_mem is a loadable module. In such cases, using only
wait_for_device_probe() is insufficient, as it may return before all
relevant probes are registered.
While region creation happens in cxl_port_probe(), waiting on
cxl_mem_active() would be sufficient as cxl_mem_probe() can only succeed
after the port hierarchy is in place. Furthermore, since cxl_mem depends
on cxl_pci, this also guarantees that cxl_pci has loaded by the time the
wait completes.
As cxl_mem_active() infrastructure already exists for tracking probe
activity, cxl_acpi can use it without introducing new coordination
mechanisms.
Co-developed-by: Nathan Fontenot <Nathan.Fontenot@amd.com>
Signed-off-by: Nathan Fontenot <Nathan.Fontenot@amd.com>
Co-developed-by: Terry Bowman <terry.bowman@amd.com>
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
---
drivers/cxl/acpi.c | 18 ++++++++++++++++++
drivers/cxl/core/probe_state.c | 5 +++++
drivers/cxl/cxl.h | 2 ++
3 files changed, 25 insertions(+)
diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c
index ca06d5acdf8f..3a27289e669b 100644
--- a/drivers/cxl/acpi.c
+++ b/drivers/cxl/acpi.c
@@ -823,6 +823,20 @@ static int pair_cxl_resource(struct device *dev, void *data)
return 0;
}
+static void cxl_softreserv_mem_work_fn(struct work_struct *work)
+{
+ if (!wait_event_timeout(cxl_wait_queue, cxl_mem_active(), 30 * HZ))
+ pr_debug("Timeout waiting for cxl_mem probing");
+
+ wait_for_device_probe();
+}
+static DECLARE_WORK(cxl_sr_work, cxl_softreserv_mem_work_fn);
+
+static void cxl_softreserv_mem_update(void)
+{
+ schedule_work(&cxl_sr_work);
+}
+
static int cxl_acpi_probe(struct platform_device *pdev)
{
int rc = 0;
@@ -903,6 +917,9 @@ static int cxl_acpi_probe(struct platform_device *pdev)
cxl_bus_rescan();
out:
+ /* Update SOFT RESERVE resources that intersect with CXL regions */
+ cxl_softreserv_mem_update();
+
return rc;
}
@@ -934,6 +951,7 @@ static int __init cxl_acpi_init(void)
static void __exit cxl_acpi_exit(void)
{
+ cancel_work_sync(&cxl_sr_work);
platform_driver_unregister(&cxl_acpi_driver);
cxl_bus_drain();
}
diff --git a/drivers/cxl/core/probe_state.c b/drivers/cxl/core/probe_state.c
index 5ba4b4de0e33..3089b2698b32 100644
--- a/drivers/cxl/core/probe_state.c
+++ b/drivers/cxl/core/probe_state.c
@@ -2,9 +2,12 @@
/* Copyright(c) 2022 Intel Corporation. All rights reserved. */
#include <linux/atomic.h>
#include <linux/export.h>
+#include <linux/wait.h>
#include "cxlmem.h"
static atomic_t mem_active;
+DECLARE_WAIT_QUEUE_HEAD(cxl_wait_queue);
+EXPORT_SYMBOL_NS_GPL(cxl_wait_queue, "CXL");
bool cxl_mem_active(void)
{
@@ -13,10 +16,12 @@ bool cxl_mem_active(void)
return false;
}
+EXPORT_SYMBOL_NS_GPL(cxl_mem_active, "CXL");
void cxl_mem_active_inc(void)
{
atomic_inc(&mem_active);
+ wake_up(&cxl_wait_queue);
}
EXPORT_SYMBOL_NS_GPL(cxl_mem_active_inc, "CXL");
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 3f1695c96abc..3117136f0208 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -903,6 +903,8 @@ void cxl_coordinates_combine(struct access_coordinate *out,
bool cxl_endpoint_decoder_reset_detected(struct cxl_port *port);
+extern wait_queue_head_t cxl_wait_queue;
+
/*
* Unit test builds overrides this to __weak, find the 'strong' version
* of these symbols in tools/testing/cxl/.
--
2.17.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v5 4/7] cxl/region: Introduce SOFT RESERVED resource removal on region teardown
2025-07-15 18:04 [PATCH v5 0/7] Add managed SOFT RESERVE resource handling Smita Koralahalli
` (2 preceding siblings ...)
2025-07-15 18:04 ` [PATCH v5 3/7] cxl/acpi: Add background worker to coordinate with cxl_mem probe completion Smita Koralahalli
@ 2025-07-15 18:04 ` Smita Koralahalli
2025-07-17 0:42 ` Dave Jiang
2025-07-15 18:04 ` [PATCH v5 5/7] dax/hmem: Save the DAX HMEM platform device pointer Smita Koralahalli
` (5 subsequent siblings)
9 siblings, 1 reply; 38+ messages in thread
From: Smita Koralahalli @ 2025-07-15 18:04 UTC (permalink / raw)
To: linux-cxl, linux-kernel, nvdimm, linux-fsdevel, linux-pm
Cc: Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Alison Schofield,
Vishal Verma, Ira Weiny, Dan Williams, Matthew Wilcox, Jan Kara,
Rafael J . Wysocki, Len Brown, Pavel Machek, Li Ming,
Jeff Johnson, Ying Huang, Yao Xingtao, Peter Zijlstra, Greg KH,
Nathan Fontenot, Smita Koralahalli, Terry Bowman, Robert Richter,
Benjamin Cheatham, PradeepVineshReddy Kodamati, Zhijian Li
Reworked from a patch by Alison Schofield <alison.schofield@intel.com>
Previously, when CXL regions were created through autodiscovery and their
resources overlapped with SOFT RESERVED ranges, the soft reserved resource
remained in place after region teardown. This left the HPA range
unavailable for reuse even after the region was destroyed.
Enhance the logic to reliably remove SOFT RESERVED resources associated
with a region, regardless of alignment or hierarchy in the iomem tree.
Link: https://lore.kernel.org/linux-cxl/29312c0765224ae76862d59a17748c8188fb95f1.1692638817.git.alison.schofield@intel.com/
Co-developed-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Co-developed-by: Terry Bowman <terry.bowman@amd.com>
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
---
drivers/cxl/acpi.c | 2 +
drivers/cxl/core/region.c | 124 ++++++++++++++++++++++++++++++++++++++
drivers/cxl/cxl.h | 2 +
include/linux/ioport.h | 1 +
kernel/resource.c | 34 +++++++++++
5 files changed, 163 insertions(+)
diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c
index 3a27289e669b..9eb8a9587dee 100644
--- a/drivers/cxl/acpi.c
+++ b/drivers/cxl/acpi.c
@@ -829,6 +829,8 @@ static void cxl_softreserv_mem_work_fn(struct work_struct *work)
pr_debug("Timeout waiting for cxl_mem probing");
wait_for_device_probe();
+
+ cxl_region_softreserv_update();
}
static DECLARE_WORK(cxl_sr_work, cxl_softreserv_mem_work_fn);
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 6e5e1460068d..95951a1f1cab 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -3486,6 +3486,130 @@ int cxl_add_to_region(struct cxl_endpoint_decoder *cxled)
}
EXPORT_SYMBOL_NS_GPL(cxl_add_to_region, "CXL");
+static int add_soft_reserved(resource_size_t start, resource_size_t len,
+ unsigned long flags)
+{
+ struct resource *res = kzalloc(sizeof(*res), GFP_KERNEL);
+ int rc;
+
+ if (!res)
+ return -ENOMEM;
+
+ *res = DEFINE_RES_NAMED_DESC(start, len, "Soft Reserved",
+ flags | IORESOURCE_MEM,
+ IORES_DESC_SOFT_RESERVED);
+
+ rc = insert_resource(&iomem_resource, res);
+ if (rc) {
+ kfree(res);
+ return rc;
+ }
+
+ return 0;
+}
+
+static void remove_soft_reserved(struct cxl_region *cxlr, struct resource *soft,
+ resource_size_t start, resource_size_t end)
+{
+ struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(cxlr->dev.parent);
+ resource_size_t new_start, new_end;
+ int rc;
+
+ guard(mutex)(&cxlrd->range_lock);
+
+ if (soft->start == start && soft->end == end) {
+ /*
+ * Exact alignment at both start and end. The entire region is
+ * removed below.
+ */
+
+ } else if (soft->start == start || soft->end == end) {
+ /* Aligns at either resource start or end */
+ if (soft->start == start) {
+ new_start = end + 1;
+ new_end = soft->end;
+ } else {
+ new_start = soft->start;
+ new_end = start - 1;
+ }
+
+ /*
+ * Reuse original flags as the trimmed portion retains the same
+ * memory type and access characteristics.
+ */
+ rc = add_soft_reserved(new_start, new_end - new_start + 1,
+ soft->flags);
+ if (rc)
+ dev_warn(&cxlr->dev,
+ "cannot add new soft reserved resource at %pa\n",
+ &new_start);
+
+ } else {
+ /* No alignment - Split into two new soft reserved regions */
+ new_start = soft->start;
+ new_end = soft->end;
+
+ rc = add_soft_reserved(new_start, start - new_start,
+ soft->flags);
+ if (rc)
+ dev_warn(&cxlr->dev,
+ "cannot add new soft reserved resource at %pa\n",
+ &new_start);
+
+ rc = add_soft_reserved(end + 1, new_end - end, soft->flags);
+ if (rc)
+ dev_warn(&cxlr->dev,
+ "cannot add new soft reserved resource at %pa + 1\n",
+ &end);
+ }
+
+ rc = remove_resource(soft);
+ if (rc)
+ dev_warn(&cxlr->dev, "cannot remove soft reserved resource %pr\n",
+ soft);
+}
+
+static int __cxl_region_softreserv_update(struct resource *soft,
+ void *_cxlr)
+{
+ struct cxl_region *cxlr = _cxlr;
+ struct resource *res = cxlr->params.res;
+
+ /* Skip non-intersecting soft-reserved regions */
+ if (soft->end < res->start || soft->start > res->end)
+ return 0;
+
+ soft = normalize_resource(soft);
+ if (!soft)
+ return -EINVAL;
+
+ remove_soft_reserved(cxlr, soft, res->start, res->end);
+
+ return 0;
+}
+
+static int cxl_region_softreserv_update_cb(struct device *dev, void *data)
+{
+ struct cxl_region *cxlr;
+
+ if (!is_cxl_region(dev))
+ return 0;
+
+ cxlr = to_cxl_region(dev);
+
+ walk_iomem_res_desc(IORES_DESC_SOFT_RESERVED, IORESOURCE_MEM, 0, -1,
+ cxlr, __cxl_region_softreserv_update);
+
+ return 0;
+}
+
+void cxl_region_softreserv_update(void)
+{
+ bus_for_each_dev(&cxl_bus_type, NULL, NULL,
+ cxl_region_softreserv_update_cb);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_region_softreserv_update, "CXL");
+
u64 cxl_port_get_spa_cache_alias(struct cxl_port *endpoint, u64 spa)
{
struct cxl_region_ref *iter;
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 3117136f0208..9f173467e497 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -862,6 +862,7 @@ struct cxl_pmem_region *to_cxl_pmem_region(struct device *dev);
int cxl_add_to_region(struct cxl_endpoint_decoder *cxled);
struct cxl_dax_region *to_cxl_dax_region(struct device *dev);
u64 cxl_port_get_spa_cache_alias(struct cxl_port *endpoint, u64 spa);
+void cxl_region_softreserv_update(void);
#else
static inline bool is_cxl_pmem_region(struct device *dev)
{
@@ -884,6 +885,7 @@ static inline u64 cxl_port_get_spa_cache_alias(struct cxl_port *endpoint,
{
return 0;
}
+static inline void cxl_region_softreserv_update(void) { }
#endif
void cxl_endpoint_parse_cdat(struct cxl_port *port);
diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index e8b2d6aa4013..8693e095d32b 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -233,6 +233,7 @@ struct resource_constraint {
extern struct resource ioport_resource;
extern struct resource iomem_resource;
+extern struct resource *normalize_resource(struct resource *res);
extern struct resource *request_resource_conflict(struct resource *root, struct resource *new);
extern int request_resource(struct resource *root, struct resource *new);
extern int release_resource(struct resource *new);
diff --git a/kernel/resource.c b/kernel/resource.c
index 8d3e6ed0bdc1..3d8dc2a59cb2 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -50,6 +50,40 @@ EXPORT_SYMBOL(iomem_resource);
static DEFINE_RWLOCK(resource_lock);
+/*
+ * normalize_resource
+ *
+ * The walk_iomem_res_desc() returns a copy of a resource, not a reference
+ * to the actual resource in the iomem_resource tree. As a result,
+ * __release_resource() which relies on pointer equality will fail.
+ *
+ * This helper walks the children of the resource's parent to find and
+ * return the original resource pointer that matches the given resource's
+ * start and end addresses.
+ *
+ * Return: Pointer to the matching original resource in iomem_resource, or
+ * NULL if not found or invalid input.
+ */
+struct resource *normalize_resource(struct resource *res)
+{
+ if (!res || !res->parent)
+ return NULL;
+
+ read_lock(&resource_lock);
+ for (struct resource *res_iter = res->parent->child; res_iter != NULL;
+ res_iter = res_iter->sibling) {
+ if ((res_iter->start == res->start) &&
+ (res_iter->end == res->end)) {
+ read_unlock(&resource_lock);
+ return res_iter;
+ }
+ }
+
+ read_unlock(&resource_lock);
+ return NULL;
+}
+EXPORT_SYMBOL_NS_GPL(normalize_resource, "CXL");
+
/*
* Return the next node of @p in pre-order tree traversal. If
* @skip_children is true, skip the descendant nodes of @p in
--
2.17.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v5 5/7] dax/hmem: Save the DAX HMEM platform device pointer
2025-07-15 18:04 [PATCH v5 0/7] Add managed SOFT RESERVE resource handling Smita Koralahalli
` (3 preceding siblings ...)
2025-07-15 18:04 ` [PATCH v5 4/7] cxl/region: Introduce SOFT RESERVED resource removal on region teardown Smita Koralahalli
@ 2025-07-15 18:04 ` Smita Koralahalli
2025-07-15 18:04 ` [PATCH v5 6/7] dax/hmem, cxl: Defer DAX consumption of SOFT RESERVED resources until after CXL region creation Smita Koralahalli
` (4 subsequent siblings)
9 siblings, 0 replies; 38+ messages in thread
From: Smita Koralahalli @ 2025-07-15 18:04 UTC (permalink / raw)
To: linux-cxl, linux-kernel, nvdimm, linux-fsdevel, linux-pm
Cc: Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Alison Schofield,
Vishal Verma, Ira Weiny, Dan Williams, Matthew Wilcox, Jan Kara,
Rafael J . Wysocki, Len Brown, Pavel Machek, Li Ming,
Jeff Johnson, Ying Huang, Yao Xingtao, Peter Zijlstra, Greg KH,
Nathan Fontenot, Smita Koralahalli, Terry Bowman, Robert Richter,
Benjamin Cheatham, PradeepVineshReddy Kodamati, Zhijian Li
From: Nathan Fontenot <nathan.fontenot@amd.com>
To enable registration of HMEM devices for SOFT RESERVED regions after
the DAX HMEM device is initialized, this patch saves a reference to the
DAX HMEM platform device.
This saved pointer will be used in a follow-up patch to allow late
registration of SOFT RESERVED memory ranges. It also enables
simplification of the walk_hmem_resources() by removing the need to
pass a struct device argument.
There are no functional changes.
Co-developed-by: Nathan Fontenot <Nathan.Fontenot@amd.com>
Signed-off-by: Nathan Fontenot <Nathan.Fontenot@amd.com>
Co-developed-by: Terry Bowman <terry.bowman@amd.com>
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
---
drivers/dax/hmem/device.c | 4 ++--
drivers/dax/hmem/hmem.c | 9 ++++++---
include/linux/dax.h | 5 ++---
3 files changed, 10 insertions(+), 8 deletions(-)
diff --git a/drivers/dax/hmem/device.c b/drivers/dax/hmem/device.c
index f9e1a76a04a9..59ad44761191 100644
--- a/drivers/dax/hmem/device.c
+++ b/drivers/dax/hmem/device.c
@@ -17,14 +17,14 @@ static struct resource hmem_active = {
.flags = IORESOURCE_MEM,
};
-int walk_hmem_resources(struct device *host, walk_hmem_fn fn)
+int walk_hmem_resources(walk_hmem_fn fn)
{
struct resource *res;
int rc = 0;
mutex_lock(&hmem_resource_lock);
for (res = hmem_active.child; res; res = res->sibling) {
- rc = fn(host, (int) res->desc, res);
+ rc = fn((int) res->desc, res);
if (rc)
break;
}
diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index 5e7c53f18491..3aedef5f1be1 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -9,6 +9,8 @@
static bool region_idle;
module_param_named(region_idle, region_idle, bool, 0644);
+static struct platform_device *dax_hmem_pdev;
+
static int dax_hmem_probe(struct platform_device *pdev)
{
unsigned long flags = IORESOURCE_DAX_KMEM;
@@ -59,9 +61,9 @@ static void release_hmem(void *pdev)
platform_device_unregister(pdev);
}
-static int hmem_register_device(struct device *host, int target_nid,
- const struct resource *res)
+static int hmem_register_device(int target_nid, const struct resource *res)
{
+ struct device *host = &dax_hmem_pdev->dev;
struct platform_device *pdev;
struct memregion_info info;
long id;
@@ -125,7 +127,8 @@ static int hmem_register_device(struct device *host, int target_nid,
static int dax_hmem_platform_probe(struct platform_device *pdev)
{
- return walk_hmem_resources(&pdev->dev, hmem_register_device);
+ dax_hmem_pdev = pdev;
+ return walk_hmem_resources(hmem_register_device);
}
static struct platform_driver dax_hmem_platform_driver = {
diff --git a/include/linux/dax.h b/include/linux/dax.h
index dcc9fcdf14e4..a4ad3708ea35 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -305,7 +305,6 @@ static inline void hmem_register_resource(int target_nid, struct resource *r)
}
#endif
-typedef int (*walk_hmem_fn)(struct device *dev, int target_nid,
- const struct resource *res);
-int walk_hmem_resources(struct device *dev, walk_hmem_fn fn);
+typedef int (*walk_hmem_fn)(int target_nid, const struct resource *res);
+int walk_hmem_resources(walk_hmem_fn fn);
#endif
--
2.17.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v5 6/7] dax/hmem, cxl: Defer DAX consumption of SOFT RESERVED resources until after CXL region creation
2025-07-15 18:04 [PATCH v5 0/7] Add managed SOFT RESERVE resource handling Smita Koralahalli
` (4 preceding siblings ...)
2025-07-15 18:04 ` [PATCH v5 5/7] dax/hmem: Save the DAX HMEM platform device pointer Smita Koralahalli
@ 2025-07-15 18:04 ` Smita Koralahalli
2025-07-15 18:04 ` [PATCH v5 7/7] dax/hmem: Preserve fallback SOFT RESERVED regions if DAX HMEM loads late Smita Koralahalli
` (3 subsequent siblings)
9 siblings, 0 replies; 38+ messages in thread
From: Smita Koralahalli @ 2025-07-15 18:04 UTC (permalink / raw)
To: linux-cxl, linux-kernel, nvdimm, linux-fsdevel, linux-pm
Cc: Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Alison Schofield,
Vishal Verma, Ira Weiny, Dan Williams, Matthew Wilcox, Jan Kara,
Rafael J . Wysocki, Len Brown, Pavel Machek, Li Ming,
Jeff Johnson, Ying Huang, Yao Xingtao, Peter Zijlstra, Greg KH,
Nathan Fontenot, Smita Koralahalli, Terry Bowman, Robert Richter,
Benjamin Cheatham, PradeepVineshReddy Kodamati, Zhijian Li
Introduce a fallback registration mechanism in the DAX HMEM driver to
enable deferred registration of SOFT RESERVED regions. This allows
coordination with the CXL subsystem to avoid conflicts during CXL region
setup.
When CONFIG_CXL_ACPI is enabled, the DAX HMEM driver and HMAT skips
walking SOFT RESERVED resources. Instead, DAX driver provides a
fallback registration mechanism via hmem_register_fallback_handler()
and hmem_fallback_register_device().
The CXL driver invokes hmem_fallback_register_device() after trimming soft
reserves to register any remaining SOFT RESERVED regions that are not
consumed by CXL. This ensures that the DAX driver does not consume
memory ranges that are intended to be part of CXL regions.
Co-developed-by: Nathan Fontenot <Nathan.Fontenot@amd.com>
Signed-off-by: Nathan Fontenot <Nathan.Fontenot@amd.com>
Co-developed-by: Terry Bowman <terry.bowman@amd.com>
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
---
drivers/acpi/numa/hmat.c | 4 ++++
drivers/cxl/core/region.c | 11 +++++++++
drivers/dax/hmem/Makefile | 1 +
drivers/dax/hmem/device.c | 43 +++++++++++++++++-----------------
drivers/dax/hmem/hmem.c | 6 +++++
drivers/dax/hmem/hmem_notify.c | 27 +++++++++++++++++++++
include/linux/dax.h | 2 ++
7 files changed, 73 insertions(+), 21 deletions(-)
create mode 100644 drivers/dax/hmem/hmem_notify.c
diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
index 9d9052258e92..8883fd4a229b 100644
--- a/drivers/acpi/numa/hmat.c
+++ b/drivers/acpi/numa/hmat.c
@@ -901,6 +901,10 @@ static void hmat_register_target_devices(struct memory_target *target)
if (!IS_ENABLED(CONFIG_DEV_DAX_HMEM))
return;
+ /* Allow CXL to manage the dax devices if enabled */
+ if (IS_ENABLED(CONFIG_CXL_ACPI))
+ return;
+
for (res = target->memregions.child; res; res = res->sibling) {
int target_nid = pxm_to_node(target->memory_pxm);
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 95951a1f1cab..b1fa38e0b987 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -10,6 +10,7 @@
#include <linux/sort.h>
#include <linux/idr.h>
#include <linux/memory-tiers.h>
+#include <linux/dax.h>
#include <cxlmem.h>
#include <cxl.h>
#include "core.h"
@@ -3603,10 +3604,20 @@ static int cxl_region_softreserv_update_cb(struct device *dev, void *data)
return 0;
}
+static int cxl_softreserv_mem_register(struct resource *res, void *unused)
+{
+ hmem_fallback_register_device(phys_to_target_node(res->start), res);
+ return 0;
+}
+
void cxl_region_softreserv_update(void)
{
bus_for_each_dev(&cxl_bus_type, NULL, NULL,
cxl_region_softreserv_update_cb);
+
+ /* Now register any remaining SOFT RESERVES with DAX */
+ walk_iomem_res_desc(IORES_DESC_SOFT_RESERVED, IORESOURCE_MEM,
+ 0, -1, NULL, cxl_softreserv_mem_register);
}
EXPORT_SYMBOL_NS_GPL(cxl_region_softreserv_update, "CXL");
diff --git a/drivers/dax/hmem/Makefile b/drivers/dax/hmem/Makefile
index d4c4cd6bccd7..aa8742e20408 100644
--- a/drivers/dax/hmem/Makefile
+++ b/drivers/dax/hmem/Makefile
@@ -2,6 +2,7 @@
# device_hmem.o deliberately precedes dax_hmem.o for initcall ordering
obj-$(CONFIG_DEV_DAX_HMEM_DEVICES) += device_hmem.o
obj-$(CONFIG_DEV_DAX_HMEM) += dax_hmem.o
+obj-y += hmem_notify.o
device_hmem-y := device.o
dax_hmem-y := hmem.o
diff --git a/drivers/dax/hmem/device.c b/drivers/dax/hmem/device.c
index 59ad44761191..cc1ed7bbdb1a 100644
--- a/drivers/dax/hmem/device.c
+++ b/drivers/dax/hmem/device.c
@@ -8,7 +8,6 @@
static bool nohmem;
module_param_named(disable, nohmem, bool, 0444);
-static bool platform_initialized;
static DEFINE_MUTEX(hmem_resource_lock);
static struct resource hmem_active = {
.name = "HMEM devices",
@@ -35,9 +34,7 @@ EXPORT_SYMBOL_GPL(walk_hmem_resources);
static void __hmem_register_resource(int target_nid, struct resource *res)
{
- struct platform_device *pdev;
struct resource *new;
- int rc;
new = __request_region(&hmem_active, res->start, resource_size(res), "",
0);
@@ -47,21 +44,6 @@ static void __hmem_register_resource(int target_nid, struct resource *res)
}
new->desc = target_nid;
-
- if (platform_initialized)
- return;
-
- pdev = platform_device_alloc("hmem_platform", 0);
- if (!pdev) {
- pr_err_once("failed to register device-dax hmem_platform device\n");
- return;
- }
-
- rc = platform_device_add(pdev);
- if (rc)
- platform_device_put(pdev);
- else
- platform_initialized = true;
}
void hmem_register_resource(int target_nid, struct resource *res)
@@ -83,9 +65,28 @@ static __init int hmem_register_one(struct resource *res, void *data)
static __init int hmem_init(void)
{
- walk_iomem_res_desc(IORES_DESC_SOFT_RESERVED,
- IORESOURCE_MEM, 0, -1, NULL, hmem_register_one);
- return 0;
+ struct platform_device *pdev;
+ int rc;
+
+ if (!IS_ENABLED(CONFIG_CXL_ACPI)) {
+ walk_iomem_res_desc(IORES_DESC_SOFT_RESERVED,
+ IORESOURCE_MEM, 0, -1, NULL,
+ hmem_register_one);
+ }
+
+ pdev = platform_device_alloc("hmem_platform", 0);
+ if (!pdev) {
+ pr_err("failed to register device-dax hmem_platform device\n");
+ return -1;
+ }
+
+ rc = platform_device_add(pdev);
+ if (rc) {
+ pr_err("failed to add device-dax hmem_platform device\n");
+ platform_device_put(pdev);
+ }
+
+ return rc;
}
/*
diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index 3aedef5f1be1..16873ae0a53b 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -128,6 +128,12 @@ static int hmem_register_device(int target_nid, const struct resource *res)
static int dax_hmem_platform_probe(struct platform_device *pdev)
{
dax_hmem_pdev = pdev;
+
+ if (IS_ENABLED(CONFIG_CXL_ACPI)) {
+ hmem_register_fallback_handler(hmem_register_device);
+ return 0;
+ }
+
return walk_hmem_resources(hmem_register_device);
}
diff --git a/drivers/dax/hmem/hmem_notify.c b/drivers/dax/hmem/hmem_notify.c
new file mode 100644
index 000000000000..1b366ffbda66
--- /dev/null
+++ b/drivers/dax/hmem/hmem_notify.c
@@ -0,0 +1,27 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright(c) 2025 AMD Corporation. All rights reserved. */
+
+#include <linux/spinlock.h>
+#include <linux/dax.h>
+
+static walk_hmem_fn hmem_fallback_fn;
+static DEFINE_SPINLOCK(hmem_notify_lock);
+
+void hmem_register_fallback_handler(walk_hmem_fn hmem_fn)
+{
+ guard(spinlock_irqsave)(&hmem_notify_lock);
+ hmem_fallback_fn = hmem_fn;
+}
+EXPORT_SYMBOL_GPL(hmem_register_fallback_handler);
+
+void hmem_fallback_register_device(int target_nid, const struct resource *res)
+{
+ walk_hmem_fn hmem_fn;
+
+ guard(spinlock)(&hmem_notify_lock);
+ hmem_fn = hmem_fallback_fn;
+
+ if (hmem_fn)
+ hmem_fn(target_nid, res);
+}
+EXPORT_SYMBOL_GPL(hmem_fallback_register_device);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index a4ad3708ea35..069ded715e5a 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -307,4 +307,6 @@ static inline void hmem_register_resource(int target_nid, struct resource *r)
typedef int (*walk_hmem_fn)(int target_nid, const struct resource *res);
int walk_hmem_resources(walk_hmem_fn fn);
+void hmem_register_fallback_handler(walk_hmem_fn hmem_fn);
+void hmem_fallback_register_device(int target_nid, const struct resource *res);
#endif
--
2.17.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v5 7/7] dax/hmem: Preserve fallback SOFT RESERVED regions if DAX HMEM loads late
2025-07-15 18:04 [PATCH v5 0/7] Add managed SOFT RESERVE resource handling Smita Koralahalli
` (5 preceding siblings ...)
2025-07-15 18:04 ` [PATCH v5 6/7] dax/hmem, cxl: Defer DAX consumption of SOFT RESERVED resources until after CXL region creation Smita Koralahalli
@ 2025-07-15 18:04 ` Smita Koralahalli
2025-07-15 21:07 ` [PATCH v5 0/7] Add managed SOFT RESERVE resource handling Alison Schofield
` (2 subsequent siblings)
9 siblings, 0 replies; 38+ messages in thread
From: Smita Koralahalli @ 2025-07-15 18:04 UTC (permalink / raw)
To: linux-cxl, linux-kernel, nvdimm, linux-fsdevel, linux-pm
Cc: Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Alison Schofield,
Vishal Verma, Ira Weiny, Dan Williams, Matthew Wilcox, Jan Kara,
Rafael J . Wysocki, Len Brown, Pavel Machek, Li Ming,
Jeff Johnson, Ying Huang, Yao Xingtao, Peter Zijlstra, Greg KH,
Nathan Fontenot, Smita Koralahalli, Terry Bowman, Robert Richter,
Benjamin Cheatham, PradeepVineshReddy Kodamati, Zhijian Li
After CXL completes trimming SOFT RESERVED ranges that intersect with CXL
regions, it invokes hmem_fallback_register_device() to register any
leftover ranges. If this occurs before the DAX HMEM driver has
initialized, the call becomes a no-op and those resources are lost.
To prevent this, store fallback-registered resources in a separate
deferred tree (hmem_deferred_active). When the DAX HMEM driver is
initialized, it walks this deferred list to properly register DAX
devices.
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
---
drivers/dax/hmem/device.c | 17 +++++++++++++----
drivers/dax/hmem/hmem.c | 1 -
drivers/dax/hmem/hmem_notify.c | 2 ++
3 files changed, 15 insertions(+), 5 deletions(-)
diff --git a/drivers/dax/hmem/device.c b/drivers/dax/hmem/device.c
index cc1ed7bbdb1a..41c5886a30d1 100644
--- a/drivers/dax/hmem/device.c
+++ b/drivers/dax/hmem/device.c
@@ -16,13 +16,21 @@ static struct resource hmem_active = {
.flags = IORESOURCE_MEM,
};
+static struct resource hmem_deferred_active = {
+ .name = "Deferred HMEM devices",
+ .start = 0,
+ .end = -1,
+ .flags = IORESOURCE_MEM,
+};
+static struct resource *hmem_resource_root = &hmem_active;
+
int walk_hmem_resources(walk_hmem_fn fn)
{
struct resource *res;
int rc = 0;
mutex_lock(&hmem_resource_lock);
- for (res = hmem_active.child; res; res = res->sibling) {
+ for (res = hmem_resource_root->child; res; res = res->sibling) {
rc = fn((int) res->desc, res);
if (rc)
break;
@@ -36,8 +44,8 @@ static void __hmem_register_resource(int target_nid, struct resource *res)
{
struct resource *new;
- new = __request_region(&hmem_active, res->start, resource_size(res), "",
- 0);
+ new = __request_region(hmem_resource_root, res->start,
+ resource_size(res), "", 0);
if (!new) {
pr_debug("hmem range %pr already active\n", res);
return;
@@ -72,7 +80,8 @@ static __init int hmem_init(void)
walk_iomem_res_desc(IORES_DESC_SOFT_RESERVED,
IORESOURCE_MEM, 0, -1, NULL,
hmem_register_one);
- }
+ } else
+ hmem_resource_root = &hmem_deferred_active;
pdev = platform_device_alloc("hmem_platform", 0);
if (!pdev) {
diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index 16873ae0a53b..76a381c274a8 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -131,7 +131,6 @@ static int dax_hmem_platform_probe(struct platform_device *pdev)
if (IS_ENABLED(CONFIG_CXL_ACPI)) {
hmem_register_fallback_handler(hmem_register_device);
- return 0;
}
return walk_hmem_resources(hmem_register_device);
diff --git a/drivers/dax/hmem/hmem_notify.c b/drivers/dax/hmem/hmem_notify.c
index 1b366ffbda66..6c276c5bd51d 100644
--- a/drivers/dax/hmem/hmem_notify.c
+++ b/drivers/dax/hmem/hmem_notify.c
@@ -23,5 +23,7 @@ void hmem_fallback_register_device(int target_nid, const struct resource *res)
if (hmem_fn)
hmem_fn(target_nid, res);
+ else
+ hmem_register_resource(target_nid, (struct resource *)res);
}
EXPORT_SYMBOL_GPL(hmem_fallback_register_device);
--
2.17.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH v5 0/7] Add managed SOFT RESERVE resource handling
2025-07-15 18:04 [PATCH v5 0/7] Add managed SOFT RESERVE resource handling Smita Koralahalli
` (6 preceding siblings ...)
2025-07-15 18:04 ` [PATCH v5 7/7] dax/hmem: Preserve fallback SOFT RESERVED regions if DAX HMEM loads late Smita Koralahalli
@ 2025-07-15 21:07 ` Alison Schofield
2025-07-16 6:01 ` Koralahalli Channabasappa, Smita
2025-07-21 7:38 ` Zhijian Li (Fujitsu)
2025-07-22 20:07 ` dan.j.williams
9 siblings, 1 reply; 38+ messages in thread
From: Alison Schofield @ 2025-07-15 21:07 UTC (permalink / raw)
To: Smita Koralahalli
Cc: linux-cxl, linux-kernel, nvdimm, linux-fsdevel, linux-pm,
Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Vishal Verma,
Ira Weiny, Dan Williams, Matthew Wilcox, Jan Kara,
Rafael J . Wysocki, Len Brown, Pavel Machek, Li Ming,
Jeff Johnson, Ying Huang, Yao Xingtao, Peter Zijlstra, Greg KH,
Nathan Fontenot, Terry Bowman, Robert Richter, Benjamin Cheatham,
PradeepVineshReddy Kodamati, Zhijian Li
On Tue, Jul 15, 2025 at 06:04:00PM +0000, Smita Koralahalli wrote:
> This series introduces the ability to manage SOFT RESERVED iomem
> resources, enabling the CXL driver to remove any portions that
> intersect with created CXL regions.
Hi Smita,
This set applied cleanly to todays cxl-next but fails like appended
before region probe.
BTW - there were sparse warnings in the build that look related:
CHECK drivers/dax/hmem/hmem_notify.c
drivers/dax/hmem/hmem_notify.c:10:6: warning: context imbalance in 'hmem_register_fallback_handler' - wrong count at exit
drivers/dax/hmem/hmem_notify.c:24:9: warning: context imbalance in 'hmem_fallback_register_device' - wrong count at exit
This isn't all the logs, I trimmed. Let me know if you need more or
other info to reproduce.
[ 53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for cxl_mem probing
[ 53.653293] BUG: sleeping function called from invalid context at ./include/linux/sched/mm.h:321
[ 53.653513] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1875, name: kworker/46:1
[ 53.653540] preempt_count: 1, expected: 0
[ 53.653554] RCU nest depth: 0, expected: 0
[ 53.653568] 3 locks held by kworker/46:1/1875:
[ 53.653569] #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630
[ 53.653583] #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630
[ 53.653589] #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60
[ 53.653598] Preemption disabled at:
[ 53.653599] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60
[ 53.653640] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Not tainted 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
[ 53.653643] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
[ 53.653648] Call Trace:
[ 53.653649] <TASK>
[ 53.653652] dump_stack_lvl+0xa8/0xd0
[ 53.653658] dump_stack+0x14/0x20
[ 53.653659] __might_resched+0x1ae/0x2d0
[ 53.653666] __might_sleep+0x48/0x70
[ 53.653668] __kmalloc_node_track_caller_noprof+0x349/0x510
[ 53.653674] ? __devm_add_action+0x3d/0x160
[ 53.653685] ? __pfx_devm_action_release+0x10/0x10
[ 53.653688] __devres_alloc_node+0x4a/0x90
[ 53.653689] ? __devres_alloc_node+0x4a/0x90
[ 53.653691] ? __pfx_release_memregion+0x10/0x10 [dax_hmem]
[ 53.653693] __devm_add_action+0x3d/0x160
[ 53.653696] hmem_register_device+0xea/0x230 [dax_hmem]
[ 53.653700] hmem_fallback_register_device+0x37/0x60
[ 53.653703] cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
[ 53.653739] walk_iomem_res_desc+0x55/0xb0
[ 53.653744] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
[ 53.653755] cxl_region_softreserv_update+0x46/0x50 [cxl_core]
[ 53.653761] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
[ 53.653763] ? __pfx_autoremove_wake_function+0x10/0x10
[ 53.653768] process_one_work+0x1fa/0x630
[ 53.653774] worker_thread+0x1b2/0x360
[ 53.653777] kthread+0x128/0x250
[ 53.653781] ? __pfx_worker_thread+0x10/0x10
[ 53.653784] ? __pfx_kthread+0x10/0x10
[ 53.653786] ret_from_fork+0x139/0x1e0
[ 53.653790] ? __pfx_kthread+0x10/0x10
[ 53.653792] ret_from_fork_asm+0x1a/0x30
[ 53.653801] </TASK>
[ 53.654193] =============================
[ 53.654203] [ BUG: Invalid wait context ]
[ 53.654451] 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 Tainted: G W
[ 53.654623] -----------------------------
[ 53.654785] kworker/46:1/1875 is trying to lock:
[ 53.654946] ff37d7824096d588 (&root->kernfs_rwsem){++++}-{4:4}, at: kernfs_add_one+0x34/0x390
[ 53.655115] other info that might help us debug this:
[ 53.655273] context-{5:5}
[ 53.655428] 3 locks held by kworker/46:1/1875:
[ 53.655579] #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630
[ 53.655739] #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630
[ 53.655900] #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60
[ 53.656062] stack backtrace:
[ 53.656224] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G W 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
[ 53.656227] Tainted: [W]=WARN
[ 53.656228] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
[ 53.656232] Call Trace:
[ 53.656232] <TASK>
[ 53.656234] dump_stack_lvl+0x85/0xd0
[ 53.656238] dump_stack+0x14/0x20
[ 53.656239] __lock_acquire+0xaf4/0x2200
[ 53.656246] lock_acquire+0xd8/0x300
[ 53.656248] ? kernfs_add_one+0x34/0x390
[ 53.656252] ? __might_resched+0x208/0x2d0
[ 53.656257] down_write+0x44/0xe0
[ 53.656262] ? kernfs_add_one+0x34/0x390
[ 53.656263] kernfs_add_one+0x34/0x390
[ 53.656265] kernfs_create_dir_ns+0x5a/0xa0
[ 53.656268] sysfs_create_dir_ns+0x74/0xd0
[ 53.656270] kobject_add_internal+0xb1/0x2f0
[ 53.656273] kobject_add+0x7d/0xf0
[ 53.656275] ? get_device_parent+0x28/0x1e0
[ 53.656280] ? __pfx_klist_children_get+0x10/0x10
[ 53.656282] device_add+0x124/0x8b0
[ 53.656285] ? dev_set_name+0x56/0x70
[ 53.656287] platform_device_add+0x102/0x260
[ 53.656289] hmem_register_device+0x160/0x230 [dax_hmem]
[ 53.656291] hmem_fallback_register_device+0x37/0x60
[ 53.656294] cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
[ 53.656323] walk_iomem_res_desc+0x55/0xb0
[ 53.656326] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
[ 53.656335] cxl_region_softreserv_update+0x46/0x50 [cxl_core]
[ 53.656342] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
[ 53.656343] ? __pfx_autoremove_wake_function+0x10/0x10
[ 53.656346] process_one_work+0x1fa/0x630
[ 53.656350] worker_thread+0x1b2/0x360
[ 53.656352] kthread+0x128/0x250
[ 53.656354] ? __pfx_worker_thread+0x10/0x10
[ 53.656356] ? __pfx_kthread+0x10/0x10
[ 53.656357] ret_from_fork+0x139/0x1e0
[ 53.656360] ? __pfx_kthread+0x10/0x10
[ 53.656361] ret_from_fork_asm+0x1a/0x30
[ 53.656366] </TASK>
[ 53.662274] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002
[ 53.663552] schedule+0x4a/0x160
[ 53.663553] schedule_timeout+0x10a/0x120
[ 53.663555] ? debug_smp_processor_id+0x1b/0x30
[ 53.663556] ? trace_hardirqs_on+0x5f/0xd0
[ 53.663558] __wait_for_common+0xb9/0x1c0
[ 53.663559] ? __pfx_schedule_timeout+0x10/0x10
[ 53.663561] wait_for_completion+0x28/0x30
[ 53.663562] __synchronize_srcu+0xbf/0x180
[ 53.663566] ? __pfx_wakeme_after_rcu+0x10/0x10
[ 53.663571] ? i2c_repstart+0x30/0x80
[ 53.663576] synchronize_srcu+0x46/0x120
[ 53.663577] kill_dax+0x47/0x70
[ 53.663580] __devm_create_dev_dax+0x112/0x470
[ 53.663582] devm_create_dev_dax+0x26/0x50
[ 53.663584] dax_hmem_probe+0x87/0xd0 [dax_hmem]
[ 53.663585] platform_probe+0x61/0xd0
[ 53.663589] really_probe+0xe2/0x390
[ 53.663591] ? __pfx___device_attach_driver+0x10/0x10
[ 53.663593] __driver_probe_device+0x7e/0x160
[ 53.663594] driver_probe_device+0x23/0xa0
[ 53.663596] __device_attach_driver+0x92/0x120
[ 53.663597] bus_for_each_drv+0x8c/0xf0
[ 53.663599] __device_attach+0xc2/0x1f0
[ 53.663601] device_initial_probe+0x17/0x20
[ 53.663603] bus_probe_device+0xa8/0xb0
[ 53.663604] device_add+0x687/0x8b0
[ 53.663607] ? dev_set_name+0x56/0x70
[ 53.663609] platform_device_add+0x102/0x260
[ 53.663610] hmem_register_device+0x160/0x230 [dax_hmem]
[ 53.663612] hmem_fallback_register_device+0x37/0x60
[ 53.663614] cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
[ 53.663637] walk_iomem_res_desc+0x55/0xb0
[ 53.663640] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
[ 53.663647] cxl_region_softreserv_update+0x46/0x50 [cxl_core]
[ 53.663654] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
[ 53.663655] ? __pfx_autoremove_wake_function+0x10/0x10
[ 53.663658] process_one_work+0x1fa/0x630
[ 53.663662] worker_thread+0x1b2/0x360
[ 53.663664] kthread+0x128/0x250
[ 53.663666] ? __pfx_worker_thread+0x10/0x10
[ 53.663668] ? __pfx_kthread+0x10/0x10
[ 53.663670] ret_from_fork+0x139/0x1e0
[ 53.663672] ? __pfx_kthread+0x10/0x10
[ 53.663673] ret_from_fork_asm+0x1a/0x30
[ 53.663677] </TASK>
[ 53.700107] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002
[ 53.700264] INFO: lockdep is turned off.
[ 53.701315] Preemption disabled at:
[ 53.701316] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60
[ 53.701631] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G W 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
[ 53.701633] Tainted: [W]=WARN
[ 53.701635] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
[ 53.701638] Call Trace:
[ 53.701638] <TASK>
[ 53.701640] dump_stack_lvl+0xa8/0xd0
[ 53.701644] dump_stack+0x14/0x20
[ 53.701645] __schedule_bug+0xa2/0xd0
[ 53.701649] __schedule+0xe6f/0x10d0
[ 53.701652] ? debug_smp_processor_id+0x1b/0x30
[ 53.701655] ? lock_release+0x1e6/0x2b0
[ 53.701658] ? trace_hardirqs_on+0x5f/0xd0
[ 53.701661] schedule+0x4a/0x160
[ 53.701662] schedule_timeout+0x10a/0x120
[ 53.701664] ? debug_smp_processor_id+0x1b/0x30
[ 53.701666] ? trace_hardirqs_on+0x5f/0xd0
[ 53.701667] __wait_for_common+0xb9/0x1c0
[ 53.701668] ? __pfx_schedule_timeout+0x10/0x10
[ 53.701670] wait_for_completion+0x28/0x30
[ 53.701671] __synchronize_srcu+0xbf/0x180
[ 53.701677] ? __pfx_wakeme_after_rcu+0x10/0x10
[ 53.701682] ? i2c_repstart+0x30/0x80
[ 53.701685] synchronize_srcu+0x46/0x120
[ 53.701687] kill_dax+0x47/0x70
[ 53.701689] __devm_create_dev_dax+0x112/0x470
[ 53.701691] devm_create_dev_dax+0x26/0x50
[ 53.701693] dax_hmem_probe+0x87/0xd0 [dax_hmem]
[ 53.701695] platform_probe+0x61/0xd0
[ 53.701698] really_probe+0xe2/0x390
[ 53.701700] ? __pfx___device_attach_driver+0x10/0x10
[ 53.701701] __driver_probe_device+0x7e/0x160
[ 53.701703] driver_probe_device+0x23/0xa0
[ 53.701704] __device_attach_driver+0x92/0x120
[ 53.701706] bus_for_each_drv+0x8c/0xf0
[ 53.701708] __device_attach+0xc2/0x1f0
[ 53.701710] device_initial_probe+0x17/0x20
[ 53.701711] bus_probe_device+0xa8/0xb0
[ 53.701712] device_add+0x687/0x8b0
[ 53.701715] ? dev_set_name+0x56/0x70
[ 53.701717] platform_device_add+0x102/0x260
[ 53.701718] hmem_register_device+0x160/0x230 [dax_hmem]
[ 53.701720] hmem_fallback_register_device+0x37/0x60
[ 53.701722] cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
[ 53.701734] walk_iomem_res_desc+0x55/0xb0
[ 53.701738] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
[ 53.701745] cxl_region_softreserv_update+0x46/0x50 [cxl_core]
[ 53.701751] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
[ 53.701752] ? __pfx_autoremove_wake_function+0x10/0x10
[ 53.701756] process_one_work+0x1fa/0x630
[ 53.701760] worker_thread+0x1b2/0x360
[ 53.701762] kthread+0x128/0x250
[ 53.701765] ? __pfx_worker_thread+0x10/0x10
[ 53.701766] ? __pfx_kthread+0x10/0x10
[ 53.701768] ret_from_fork+0x139/0x1e0
[ 53.701771] ? __pfx_kthread+0x10/0x10
[ 53.701772] ret_from_fork_asm+0x1a/0x30
[ 53.701777] </TASK>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v5 0/7] Add managed SOFT RESERVE resource handling
2025-07-15 21:07 ` [PATCH v5 0/7] Add managed SOFT RESERVE resource handling Alison Schofield
@ 2025-07-16 6:01 ` Koralahalli Channabasappa, Smita
2025-07-16 20:20 ` Alison Schofield
2025-07-23 15:24 ` dan.j.williams
0 siblings, 2 replies; 38+ messages in thread
From: Koralahalli Channabasappa, Smita @ 2025-07-16 6:01 UTC (permalink / raw)
To: Alison Schofield
Cc: linux-cxl, linux-kernel, nvdimm, linux-fsdevel, linux-pm,
Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Vishal Verma,
Ira Weiny, Dan Williams, Matthew Wilcox, Jan Kara,
Rafael J . Wysocki, Len Brown, Pavel Machek, Li Ming,
Jeff Johnson, Ying Huang, Yao Xingtao, Peter Zijlstra, Greg KH,
Nathan Fontenot, Terry Bowman, Robert Richter, Benjamin Cheatham,
PradeepVineshReddy Kodamati, Zhijian Li
Hi Alison,
On 7/15/2025 2:07 PM, Alison Schofield wrote:
> On Tue, Jul 15, 2025 at 06:04:00PM +0000, Smita Koralahalli wrote:
>> This series introduces the ability to manage SOFT RESERVED iomem
>> resources, enabling the CXL driver to remove any portions that
>> intersect with created CXL regions.
>
> Hi Smita,
>
> This set applied cleanly to todays cxl-next but fails like appended
> before region probe.
>
> BTW - there were sparse warnings in the build that look related:
> CHECK drivers/dax/hmem/hmem_notify.c
> drivers/dax/hmem/hmem_notify.c:10:6: warning: context imbalance in 'hmem_register_fallback_handler' - wrong count at exit
> drivers/dax/hmem/hmem_notify.c:24:9: warning: context imbalance in 'hmem_fallback_register_device' - wrong count at exit
Thanks for pointing this bug. I failed to release the spinlock before
calling hmem_register_device(), which internally calls
platform_device_add() and can sleep. The following fix addresses that
bug. I’ll incorporate this into v6:
diff --git a/drivers/dax/hmem/hmem_notify.c b/drivers/dax/hmem/hmem_notify.c
index 6c276c5bd51d..8f411f3fe7bd 100644
--- a/drivers/dax/hmem/hmem_notify.c
+++ b/drivers/dax/hmem/hmem_notify.c
@@ -18,8 +18,9 @@ void hmem_fallback_register_device(int target_nid,
const struct resource *res)
{
walk_hmem_fn hmem_fn;
- guard(spinlock)(&hmem_notify_lock);
+ spin_lock(&hmem_notify_lock);
hmem_fn = hmem_fallback_fn;
+ spin_unlock(&hmem_notify_lock);
if (hmem_fn)
hmem_fn(target_nid, res);
--
As for the log:
[ 53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting
for cxl_mem probing
I’m still analyzing that. Here's what was my thought process so far.
- This occurs when cxl_acpi_probe() runs significantly earlier than
cxl_mem_probe(), so CXL region creation (which happens in
cxl_port_endpoint_probe()) may or may not have completed by the time
trimming is attempted.
- Both cxl_acpi and cxl_mem have MODULE_SOFTDEPs on cxl_port. This does
guarantee load order when all components are built as modules. So even
if the timeout occurs and cxl_mem_probe() hasn’t run within the wait
window, MODULE_SOFTDEP ensures that cxl_port is loaded before both
cxl_acpi and cxl_mem in modular configurations. As a result, region
creation is eventually guaranteed, and wait_for_device_probe() will
succeed once the relevant probes complete.
- However, when both CONFIG_CXL_PORT=y and CONFIG_CXL_ACPI=y, there's no
guarantee of probe ordering. In such cases, cxl_acpi_probe() may finish
before cxl_port_probe() even begins, which can cause
wait_for_device_probe() to return prematurely and trigger the timeout.
- In my local setup, I observed that a 30-second timeout was generally
sufficient to catch this race, allowing cxl_port_probe() to load while
cxl_acpi_probe() is still active. Since we cannot mix built-in and
modular components (i.e., have cxl_acpi=y and cxl_port=m), the timeout
serves as a best-effort mechanism. After the timeout,
wait_for_device_probe() ensures cxl_port_probe() has completed before
trimming proceeds, making the logic good enough to most boot-time races.
One possible improvement I’m considering is to schedule a
delayed_workqueue() from cxl_acpi_probe(). This deferred work could wait
slightly longer for cxl_mem_probe() to complete (which itself softdeps
on cxl_port) before initiating the soft reserve trimming.
That said, I'm still evaluating better options to more robustly
coordinate probe ordering between cxl_acpi, cxl_port, cxl_mem and
cxl_region and looking for suggestions here.
Thanks
Smita
>
>
> This isn't all the logs, I trimmed. Let me know if you need more or
> other info to reproduce.
>
> [ 53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for cxl_mem probing
> [ 53.653293] BUG: sleeping function called from invalid context at ./include/linux/sched/mm.h:321
> [ 53.653513] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1875, name: kworker/46:1
> [ 53.653540] preempt_count: 1, expected: 0
> [ 53.653554] RCU nest depth: 0, expected: 0
> [ 53.653568] 3 locks held by kworker/46:1/1875:
> [ 53.653569] #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630
> [ 53.653583] #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630
> [ 53.653589] #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60
> [ 53.653598] Preemption disabled at:
> [ 53.653599] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60
> [ 53.653640] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Not tainted 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
> [ 53.653643] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
> [ 53.653648] Call Trace:
> [ 53.653649] <TASK>
> [ 53.653652] dump_stack_lvl+0xa8/0xd0
> [ 53.653658] dump_stack+0x14/0x20
> [ 53.653659] __might_resched+0x1ae/0x2d0
> [ 53.653666] __might_sleep+0x48/0x70
> [ 53.653668] __kmalloc_node_track_caller_noprof+0x349/0x510
> [ 53.653674] ? __devm_add_action+0x3d/0x160
> [ 53.653685] ? __pfx_devm_action_release+0x10/0x10
> [ 53.653688] __devres_alloc_node+0x4a/0x90
> [ 53.653689] ? __devres_alloc_node+0x4a/0x90
> [ 53.653691] ? __pfx_release_memregion+0x10/0x10 [dax_hmem]
> [ 53.653693] __devm_add_action+0x3d/0x160
> [ 53.653696] hmem_register_device+0xea/0x230 [dax_hmem]
> [ 53.653700] hmem_fallback_register_device+0x37/0x60
> [ 53.653703] cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
> [ 53.653739] walk_iomem_res_desc+0x55/0xb0
> [ 53.653744] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
> [ 53.653755] cxl_region_softreserv_update+0x46/0x50 [cxl_core]
> [ 53.653761] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
> [ 53.653763] ? __pfx_autoremove_wake_function+0x10/0x10
> [ 53.653768] process_one_work+0x1fa/0x630
> [ 53.653774] worker_thread+0x1b2/0x360
> [ 53.653777] kthread+0x128/0x250
> [ 53.653781] ? __pfx_worker_thread+0x10/0x10
> [ 53.653784] ? __pfx_kthread+0x10/0x10
> [ 53.653786] ret_from_fork+0x139/0x1e0
> [ 53.653790] ? __pfx_kthread+0x10/0x10
> [ 53.653792] ret_from_fork_asm+0x1a/0x30
> [ 53.653801] </TASK>
>
> [ 53.654193] =============================
> [ 53.654203] [ BUG: Invalid wait context ]
> [ 53.654451] 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 Tainted: G W
> [ 53.654623] -----------------------------
> [ 53.654785] kworker/46:1/1875 is trying to lock:
> [ 53.654946] ff37d7824096d588 (&root->kernfs_rwsem){++++}-{4:4}, at: kernfs_add_one+0x34/0x390
> [ 53.655115] other info that might help us debug this:
> [ 53.655273] context-{5:5}
> [ 53.655428] 3 locks held by kworker/46:1/1875:
> [ 53.655579] #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630
> [ 53.655739] #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630
> [ 53.655900] #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60
> [ 53.656062] stack backtrace:
> [ 53.656224] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G W 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
> [ 53.656227] Tainted: [W]=WARN
> [ 53.656228] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
> [ 53.656232] Call Trace:
> [ 53.656232] <TASK>
> [ 53.656234] dump_stack_lvl+0x85/0xd0
> [ 53.656238] dump_stack+0x14/0x20
> [ 53.656239] __lock_acquire+0xaf4/0x2200
> [ 53.656246] lock_acquire+0xd8/0x300
> [ 53.656248] ? kernfs_add_one+0x34/0x390
> [ 53.656252] ? __might_resched+0x208/0x2d0
> [ 53.656257] down_write+0x44/0xe0
> [ 53.656262] ? kernfs_add_one+0x34/0x390
> [ 53.656263] kernfs_add_one+0x34/0x390
> [ 53.656265] kernfs_create_dir_ns+0x5a/0xa0
> [ 53.656268] sysfs_create_dir_ns+0x74/0xd0
> [ 53.656270] kobject_add_internal+0xb1/0x2f0
> [ 53.656273] kobject_add+0x7d/0xf0
> [ 53.656275] ? get_device_parent+0x28/0x1e0
> [ 53.656280] ? __pfx_klist_children_get+0x10/0x10
> [ 53.656282] device_add+0x124/0x8b0
> [ 53.656285] ? dev_set_name+0x56/0x70
> [ 53.656287] platform_device_add+0x102/0x260
> [ 53.656289] hmem_register_device+0x160/0x230 [dax_hmem]
> [ 53.656291] hmem_fallback_register_device+0x37/0x60
> [ 53.656294] cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
> [ 53.656323] walk_iomem_res_desc+0x55/0xb0
> [ 53.656326] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
> [ 53.656335] cxl_region_softreserv_update+0x46/0x50 [cxl_core]
> [ 53.656342] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
> [ 53.656343] ? __pfx_autoremove_wake_function+0x10/0x10
> [ 53.656346] process_one_work+0x1fa/0x630
> [ 53.656350] worker_thread+0x1b2/0x360
> [ 53.656352] kthread+0x128/0x250
> [ 53.656354] ? __pfx_worker_thread+0x10/0x10
> [ 53.656356] ? __pfx_kthread+0x10/0x10
> [ 53.656357] ret_from_fork+0x139/0x1e0
> [ 53.656360] ? __pfx_kthread+0x10/0x10
> [ 53.656361] ret_from_fork_asm+0x1a/0x30
> [ 53.656366] </TASK>
> [ 53.662274] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002
> [ 53.663552] schedule+0x4a/0x160
> [ 53.663553] schedule_timeout+0x10a/0x120
> [ 53.663555] ? debug_smp_processor_id+0x1b/0x30
> [ 53.663556] ? trace_hardirqs_on+0x5f/0xd0
> [ 53.663558] __wait_for_common+0xb9/0x1c0
> [ 53.663559] ? __pfx_schedule_timeout+0x10/0x10
> [ 53.663561] wait_for_completion+0x28/0x30
> [ 53.663562] __synchronize_srcu+0xbf/0x180
> [ 53.663566] ? __pfx_wakeme_after_rcu+0x10/0x10
> [ 53.663571] ? i2c_repstart+0x30/0x80
> [ 53.663576] synchronize_srcu+0x46/0x120
> [ 53.663577] kill_dax+0x47/0x70
> [ 53.663580] __devm_create_dev_dax+0x112/0x470
> [ 53.663582] devm_create_dev_dax+0x26/0x50
> [ 53.663584] dax_hmem_probe+0x87/0xd0 [dax_hmem]
> [ 53.663585] platform_probe+0x61/0xd0
> [ 53.663589] really_probe+0xe2/0x390
> [ 53.663591] ? __pfx___device_attach_driver+0x10/0x10
> [ 53.663593] __driver_probe_device+0x7e/0x160
> [ 53.663594] driver_probe_device+0x23/0xa0
> [ 53.663596] __device_attach_driver+0x92/0x120
> [ 53.663597] bus_for_each_drv+0x8c/0xf0
> [ 53.663599] __device_attach+0xc2/0x1f0
> [ 53.663601] device_initial_probe+0x17/0x20
> [ 53.663603] bus_probe_device+0xa8/0xb0
> [ 53.663604] device_add+0x687/0x8b0
> [ 53.663607] ? dev_set_name+0x56/0x70
> [ 53.663609] platform_device_add+0x102/0x260
> [ 53.663610] hmem_register_device+0x160/0x230 [dax_hmem]
> [ 53.663612] hmem_fallback_register_device+0x37/0x60
> [ 53.663614] cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
> [ 53.663637] walk_iomem_res_desc+0x55/0xb0
> [ 53.663640] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
> [ 53.663647] cxl_region_softreserv_update+0x46/0x50 [cxl_core]
> [ 53.663654] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
> [ 53.663655] ? __pfx_autoremove_wake_function+0x10/0x10
> [ 53.663658] process_one_work+0x1fa/0x630
> [ 53.663662] worker_thread+0x1b2/0x360
> [ 53.663664] kthread+0x128/0x250
> [ 53.663666] ? __pfx_worker_thread+0x10/0x10
> [ 53.663668] ? __pfx_kthread+0x10/0x10
> [ 53.663670] ret_from_fork+0x139/0x1e0
> [ 53.663672] ? __pfx_kthread+0x10/0x10
> [ 53.663673] ret_from_fork_asm+0x1a/0x30
> [ 53.663677] </TASK>
> [ 53.700107] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002
> [ 53.700264] INFO: lockdep is turned off.
> [ 53.701315] Preemption disabled at:
> [ 53.701316] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60
> [ 53.701631] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G W 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
> [ 53.701633] Tainted: [W]=WARN
> [ 53.701635] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
> [ 53.701638] Call Trace:
> [ 53.701638] <TASK>
> [ 53.701640] dump_stack_lvl+0xa8/0xd0
> [ 53.701644] dump_stack+0x14/0x20
> [ 53.701645] __schedule_bug+0xa2/0xd0
> [ 53.701649] __schedule+0xe6f/0x10d0
> [ 53.701652] ? debug_smp_processor_id+0x1b/0x30
> [ 53.701655] ? lock_release+0x1e6/0x2b0
> [ 53.701658] ? trace_hardirqs_on+0x5f/0xd0
> [ 53.701661] schedule+0x4a/0x160
> [ 53.701662] schedule_timeout+0x10a/0x120
> [ 53.701664] ? debug_smp_processor_id+0x1b/0x30
> [ 53.701666] ? trace_hardirqs_on+0x5f/0xd0
> [ 53.701667] __wait_for_common+0xb9/0x1c0
> [ 53.701668] ? __pfx_schedule_timeout+0x10/0x10
> [ 53.701670] wait_for_completion+0x28/0x30
> [ 53.701671] __synchronize_srcu+0xbf/0x180
> [ 53.701677] ? __pfx_wakeme_after_rcu+0x10/0x10
> [ 53.701682] ? i2c_repstart+0x30/0x80
> [ 53.701685] synchronize_srcu+0x46/0x120
> [ 53.701687] kill_dax+0x47/0x70
> [ 53.701689] __devm_create_dev_dax+0x112/0x470
> [ 53.701691] devm_create_dev_dax+0x26/0x50
> [ 53.701693] dax_hmem_probe+0x87/0xd0 [dax_hmem]
> [ 53.701695] platform_probe+0x61/0xd0
> [ 53.701698] really_probe+0xe2/0x390
> [ 53.701700] ? __pfx___device_attach_driver+0x10/0x10
> [ 53.701701] __driver_probe_device+0x7e/0x160
> [ 53.701703] driver_probe_device+0x23/0xa0
> [ 53.701704] __device_attach_driver+0x92/0x120
> [ 53.701706] bus_for_each_drv+0x8c/0xf0
> [ 53.701708] __device_attach+0xc2/0x1f0
> [ 53.701710] device_initial_probe+0x17/0x20
> [ 53.701711] bus_probe_device+0xa8/0xb0
> [ 53.701712] device_add+0x687/0x8b0
> [ 53.701715] ? dev_set_name+0x56/0x70
> [ 53.701717] platform_device_add+0x102/0x260
> [ 53.701718] hmem_register_device+0x160/0x230 [dax_hmem]
> [ 53.701720] hmem_fallback_register_device+0x37/0x60
> [ 53.701722] cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
> [ 53.701734] walk_iomem_res_desc+0x55/0xb0
> [ 53.701738] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
> [ 53.701745] cxl_region_softreserv_update+0x46/0x50 [cxl_core]
> [ 53.701751] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
> [ 53.701752] ? __pfx_autoremove_wake_function+0x10/0x10
> [ 53.701756] process_one_work+0x1fa/0x630
> [ 53.701760] worker_thread+0x1b2/0x360
> [ 53.701762] kthread+0x128/0x250
> [ 53.701765] ? __pfx_worker_thread+0x10/0x10
> [ 53.701766] ? __pfx_kthread+0x10/0x10
> [ 53.701768] ret_from_fork+0x139/0x1e0
> [ 53.701771] ? __pfx_kthread+0x10/0x10
> [ 53.701772] ret_from_fork_asm+0x1a/0x30
> [ 53.701777] </TASK>
>
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH v5 0/7] Add managed SOFT RESERVE resource handling
2025-07-16 6:01 ` Koralahalli Channabasappa, Smita
@ 2025-07-16 20:20 ` Alison Schofield
2025-07-16 21:29 ` Koralahalli Channabasappa, Smita
2025-07-23 15:24 ` dan.j.williams
1 sibling, 1 reply; 38+ messages in thread
From: Alison Schofield @ 2025-07-16 20:20 UTC (permalink / raw)
To: Koralahalli Channabasappa, Smita
Cc: linux-cxl, linux-kernel, nvdimm, linux-fsdevel, linux-pm,
Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Vishal Verma,
Ira Weiny, Dan Williams, Matthew Wilcox, Jan Kara,
Rafael J . Wysocki, Len Brown, Pavel Machek, Li Ming,
Jeff Johnson, Ying Huang, Yao Xingtao, Peter Zijlstra, Greg KH,
Nathan Fontenot, Terry Bowman, Robert Richter, Benjamin Cheatham,
PradeepVineshReddy Kodamati, Zhijian Li
On Tue, Jul 15, 2025 at 11:01:23PM -0700, Koralahalli Channabasappa, Smita wrote:
> Hi Alison,
>
> On 7/15/2025 2:07 PM, Alison Schofield wrote:
> > On Tue, Jul 15, 2025 at 06:04:00PM +0000, Smita Koralahalli wrote:
> > > This series introduces the ability to manage SOFT RESERVED iomem
> > > resources, enabling the CXL driver to remove any portions that
> > > intersect with created CXL regions.
> >
> > Hi Smita,
> >
> > This set applied cleanly to todays cxl-next but fails like appended
> > before region probe.
> >
> > BTW - there were sparse warnings in the build that look related:
> > CHECK drivers/dax/hmem/hmem_notify.c
> > drivers/dax/hmem/hmem_notify.c:10:6: warning: context imbalance in 'hmem_register_fallback_handler' - wrong count at exit
> > drivers/dax/hmem/hmem_notify.c:24:9: warning: context imbalance in 'hmem_fallback_register_device' - wrong count at exit
>
> Thanks for pointing this bug. I failed to release the spinlock before
> calling hmem_register_device(), which internally calls platform_device_add()
> and can sleep. The following fix addresses that bug. I’ll incorporate this
> into v6:
>
> diff --git a/drivers/dax/hmem/hmem_notify.c b/drivers/dax/hmem/hmem_notify.c
> index 6c276c5bd51d..8f411f3fe7bd 100644
> --- a/drivers/dax/hmem/hmem_notify.c
> +++ b/drivers/dax/hmem/hmem_notify.c
> @@ -18,8 +18,9 @@ void hmem_fallback_register_device(int target_nid, const
> struct resource *res)
> {
> walk_hmem_fn hmem_fn;
>
> - guard(spinlock)(&hmem_notify_lock);
> + spin_lock(&hmem_notify_lock);
> hmem_fn = hmem_fallback_fn;
> + spin_unlock(&hmem_notify_lock);
>
> if (hmem_fn)
> hmem_fn(target_nid, res);
> --
Hi Smita, Adding the above got me past that, and doubling the timeout
below stopped that from happening. After that, I haven't had time to
trace so, I'll just dump on you for now:
In /proc/iomem
Here, we see a regions resource, no CXL Window, and no dax, and no
actual region, not even disabled, is available.
c080000000-c47fffffff : region0
And, here no CXL Window, no region, and a soft reserved.
68e80000000-70e7fffffff : Soft Reserved
68e80000000-70e7fffffff : dax1.0
68e80000000-70e7fffffff : System RAM (kmem)
I haven't yet walked through the v4 to v5 changes so I'll do that next.
>
> As for the log:
> [ 53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for
> cxl_mem probing
>
> I’m still analyzing that. Here's what was my thought process so far.
>
> - This occurs when cxl_acpi_probe() runs significantly earlier than
> cxl_mem_probe(), so CXL region creation (which happens in
> cxl_port_endpoint_probe()) may or may not have completed by the time
> trimming is attempted.
>
> - Both cxl_acpi and cxl_mem have MODULE_SOFTDEPs on cxl_port. This does
> guarantee load order when all components are built as modules. So even if
> the timeout occurs and cxl_mem_probe() hasn’t run within the wait window,
> MODULE_SOFTDEP ensures that cxl_port is loaded before both cxl_acpi and
> cxl_mem in modular configurations. As a result, region creation is
> eventually guaranteed, and wait_for_device_probe() will succeed once the
> relevant probes complete.
>
> - However, when both CONFIG_CXL_PORT=y and CONFIG_CXL_ACPI=y, there's no
> guarantee of probe ordering. In such cases, cxl_acpi_probe() may finish
> before cxl_port_probe() even begins, which can cause wait_for_device_probe()
> to return prematurely and trigger the timeout.
>
> - In my local setup, I observed that a 30-second timeout was generally
> sufficient to catch this race, allowing cxl_port_probe() to load while
> cxl_acpi_probe() is still active. Since we cannot mix built-in and modular
> components (i.e., have cxl_acpi=y and cxl_port=m), the timeout serves as a
> best-effort mechanism. After the timeout, wait_for_device_probe() ensures
> cxl_port_probe() has completed before trimming proceeds, making the logic
> good enough to most boot-time races.
>
> One possible improvement I’m considering is to schedule a
> delayed_workqueue() from cxl_acpi_probe(). This deferred work could wait
> slightly longer for cxl_mem_probe() to complete (which itself softdeps on
> cxl_port) before initiating the soft reserve trimming.
>
> That said, I'm still evaluating better options to more robustly coordinate
> probe ordering between cxl_acpi, cxl_port, cxl_mem and cxl_region and
> looking for suggestions here.
>
> Thanks
> Smita
>
> >
> >
> > This isn't all the logs, I trimmed. Let me know if you need more or
> > other info to reproduce.
> >
> > [ 53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for cxl_mem probing
> > [ 53.653293] BUG: sleeping function called from invalid context at ./include/linux/sched/mm.h:321
> > [ 53.653513] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1875, name: kworker/46:1
> > [ 53.653540] preempt_count: 1, expected: 0
> > [ 53.653554] RCU nest depth: 0, expected: 0
> > [ 53.653568] 3 locks held by kworker/46:1/1875:
> > [ 53.653569] #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630
> > [ 53.653583] #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630
> > [ 53.653589] #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60
> > [ 53.653598] Preemption disabled at:
> > [ 53.653599] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60
> > [ 53.653640] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Not tainted 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
> > [ 53.653643] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
> > [ 53.653648] Call Trace:
> > [ 53.653649] <TASK>
> > [ 53.653652] dump_stack_lvl+0xa8/0xd0
> > [ 53.653658] dump_stack+0x14/0x20
> > [ 53.653659] __might_resched+0x1ae/0x2d0
> > [ 53.653666] __might_sleep+0x48/0x70
> > [ 53.653668] __kmalloc_node_track_caller_noprof+0x349/0x510
> > [ 53.653674] ? __devm_add_action+0x3d/0x160
> > [ 53.653685] ? __pfx_devm_action_release+0x10/0x10
> > [ 53.653688] __devres_alloc_node+0x4a/0x90
> > [ 53.653689] ? __devres_alloc_node+0x4a/0x90
> > [ 53.653691] ? __pfx_release_memregion+0x10/0x10 [dax_hmem]
> > [ 53.653693] __devm_add_action+0x3d/0x160
> > [ 53.653696] hmem_register_device+0xea/0x230 [dax_hmem]
> > [ 53.653700] hmem_fallback_register_device+0x37/0x60
> > [ 53.653703] cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
> > [ 53.653739] walk_iomem_res_desc+0x55/0xb0
> > [ 53.653744] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
> > [ 53.653755] cxl_region_softreserv_update+0x46/0x50 [cxl_core]
> > [ 53.653761] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
> > [ 53.653763] ? __pfx_autoremove_wake_function+0x10/0x10
> > [ 53.653768] process_one_work+0x1fa/0x630
> > [ 53.653774] worker_thread+0x1b2/0x360
> > [ 53.653777] kthread+0x128/0x250
> > [ 53.653781] ? __pfx_worker_thread+0x10/0x10
> > [ 53.653784] ? __pfx_kthread+0x10/0x10
> > [ 53.653786] ret_from_fork+0x139/0x1e0
> > [ 53.653790] ? __pfx_kthread+0x10/0x10
> > [ 53.653792] ret_from_fork_asm+0x1a/0x30
> > [ 53.653801] </TASK>
> >
> > [ 53.654193] =============================
> > [ 53.654203] [ BUG: Invalid wait context ]
> > [ 53.654451] 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 Tainted: G W
> > [ 53.654623] -----------------------------
> > [ 53.654785] kworker/46:1/1875 is trying to lock:
> > [ 53.654946] ff37d7824096d588 (&root->kernfs_rwsem){++++}-{4:4}, at: kernfs_add_one+0x34/0x390
> > [ 53.655115] other info that might help us debug this:
> > [ 53.655273] context-{5:5}
> > [ 53.655428] 3 locks held by kworker/46:1/1875:
> > [ 53.655579] #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630
> > [ 53.655739] #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630
> > [ 53.655900] #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60
> > [ 53.656062] stack backtrace:
> > [ 53.656224] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G W 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
> > [ 53.656227] Tainted: [W]=WARN
> > [ 53.656228] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
> > [ 53.656232] Call Trace:
> > [ 53.656232] <TASK>
> > [ 53.656234] dump_stack_lvl+0x85/0xd0
> > [ 53.656238] dump_stack+0x14/0x20
> > [ 53.656239] __lock_acquire+0xaf4/0x2200
> > [ 53.656246] lock_acquire+0xd8/0x300
> > [ 53.656248] ? kernfs_add_one+0x34/0x390
> > [ 53.656252] ? __might_resched+0x208/0x2d0
> > [ 53.656257] down_write+0x44/0xe0
> > [ 53.656262] ? kernfs_add_one+0x34/0x390
> > [ 53.656263] kernfs_add_one+0x34/0x390
> > [ 53.656265] kernfs_create_dir_ns+0x5a/0xa0
> > [ 53.656268] sysfs_create_dir_ns+0x74/0xd0
> > [ 53.656270] kobject_add_internal+0xb1/0x2f0
> > [ 53.656273] kobject_add+0x7d/0xf0
> > [ 53.656275] ? get_device_parent+0x28/0x1e0
> > [ 53.656280] ? __pfx_klist_children_get+0x10/0x10
> > [ 53.656282] device_add+0x124/0x8b0
> > [ 53.656285] ? dev_set_name+0x56/0x70
> > [ 53.656287] platform_device_add+0x102/0x260
> > [ 53.656289] hmem_register_device+0x160/0x230 [dax_hmem]
> > [ 53.656291] hmem_fallback_register_device+0x37/0x60
> > [ 53.656294] cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
> > [ 53.656323] walk_iomem_res_desc+0x55/0xb0
> > [ 53.656326] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
> > [ 53.656335] cxl_region_softreserv_update+0x46/0x50 [cxl_core]
> > [ 53.656342] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
> > [ 53.656343] ? __pfx_autoremove_wake_function+0x10/0x10
> > [ 53.656346] process_one_work+0x1fa/0x630
> > [ 53.656350] worker_thread+0x1b2/0x360
> > [ 53.656352] kthread+0x128/0x250
> > [ 53.656354] ? __pfx_worker_thread+0x10/0x10
> > [ 53.656356] ? __pfx_kthread+0x10/0x10
> > [ 53.656357] ret_from_fork+0x139/0x1e0
> > [ 53.656360] ? __pfx_kthread+0x10/0x10
> > [ 53.656361] ret_from_fork_asm+0x1a/0x30
> > [ 53.656366] </TASK>
> > [ 53.662274] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002
> > [ 53.663552] schedule+0x4a/0x160
> > [ 53.663553] schedule_timeout+0x10a/0x120
> > [ 53.663555] ? debug_smp_processor_id+0x1b/0x30
> > [ 53.663556] ? trace_hardirqs_on+0x5f/0xd0
> > [ 53.663558] __wait_for_common+0xb9/0x1c0
> > [ 53.663559] ? __pfx_schedule_timeout+0x10/0x10
> > [ 53.663561] wait_for_completion+0x28/0x30
> > [ 53.663562] __synchronize_srcu+0xbf/0x180
> > [ 53.663566] ? __pfx_wakeme_after_rcu+0x10/0x10
> > [ 53.663571] ? i2c_repstart+0x30/0x80
> > [ 53.663576] synchronize_srcu+0x46/0x120
> > [ 53.663577] kill_dax+0x47/0x70
> > [ 53.663580] __devm_create_dev_dax+0x112/0x470
> > [ 53.663582] devm_create_dev_dax+0x26/0x50
> > [ 53.663584] dax_hmem_probe+0x87/0xd0 [dax_hmem]
> > [ 53.663585] platform_probe+0x61/0xd0
> > [ 53.663589] really_probe+0xe2/0x390
> > [ 53.663591] ? __pfx___device_attach_driver+0x10/0x10
> > [ 53.663593] __driver_probe_device+0x7e/0x160
> > [ 53.663594] driver_probe_device+0x23/0xa0
> > [ 53.663596] __device_attach_driver+0x92/0x120
> > [ 53.663597] bus_for_each_drv+0x8c/0xf0
> > [ 53.663599] __device_attach+0xc2/0x1f0
> > [ 53.663601] device_initial_probe+0x17/0x20
> > [ 53.663603] bus_probe_device+0xa8/0xb0
> > [ 53.663604] device_add+0x687/0x8b0
> > [ 53.663607] ? dev_set_name+0x56/0x70
> > [ 53.663609] platform_device_add+0x102/0x260
> > [ 53.663610] hmem_register_device+0x160/0x230 [dax_hmem]
> > [ 53.663612] hmem_fallback_register_device+0x37/0x60
> > [ 53.663614] cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
> > [ 53.663637] walk_iomem_res_desc+0x55/0xb0
> > [ 53.663640] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
> > [ 53.663647] cxl_region_softreserv_update+0x46/0x50 [cxl_core]
> > [ 53.663654] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
> > [ 53.663655] ? __pfx_autoremove_wake_function+0x10/0x10
> > [ 53.663658] process_one_work+0x1fa/0x630
> > [ 53.663662] worker_thread+0x1b2/0x360
> > [ 53.663664] kthread+0x128/0x250
> > [ 53.663666] ? __pfx_worker_thread+0x10/0x10
> > [ 53.663668] ? __pfx_kthread+0x10/0x10
> > [ 53.663670] ret_from_fork+0x139/0x1e0
> > [ 53.663672] ? __pfx_kthread+0x10/0x10
> > [ 53.663673] ret_from_fork_asm+0x1a/0x30
> > [ 53.663677] </TASK>
> > [ 53.700107] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002
> > [ 53.700264] INFO: lockdep is turned off.
> > [ 53.701315] Preemption disabled at:
> > [ 53.701316] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60
> > [ 53.701631] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G W 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
> > [ 53.701633] Tainted: [W]=WARN
> > [ 53.701635] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
> > [ 53.701638] Call Trace:
> > [ 53.701638] <TASK>
> > [ 53.701640] dump_stack_lvl+0xa8/0xd0
> > [ 53.701644] dump_stack+0x14/0x20
> > [ 53.701645] __schedule_bug+0xa2/0xd0
> > [ 53.701649] __schedule+0xe6f/0x10d0
> > [ 53.701652] ? debug_smp_processor_id+0x1b/0x30
> > [ 53.701655] ? lock_release+0x1e6/0x2b0
> > [ 53.701658] ? trace_hardirqs_on+0x5f/0xd0
> > [ 53.701661] schedule+0x4a/0x160
> > [ 53.701662] schedule_timeout+0x10a/0x120
> > [ 53.701664] ? debug_smp_processor_id+0x1b/0x30
> > [ 53.701666] ? trace_hardirqs_on+0x5f/0xd0
> > [ 53.701667] __wait_for_common+0xb9/0x1c0
> > [ 53.701668] ? __pfx_schedule_timeout+0x10/0x10
> > [ 53.701670] wait_for_completion+0x28/0x30
> > [ 53.701671] __synchronize_srcu+0xbf/0x180
> > [ 53.701677] ? __pfx_wakeme_after_rcu+0x10/0x10
> > [ 53.701682] ? i2c_repstart+0x30/0x80
> > [ 53.701685] synchronize_srcu+0x46/0x120
> > [ 53.701687] kill_dax+0x47/0x70
> > [ 53.701689] __devm_create_dev_dax+0x112/0x470
> > [ 53.701691] devm_create_dev_dax+0x26/0x50
> > [ 53.701693] dax_hmem_probe+0x87/0xd0 [dax_hmem]
> > [ 53.701695] platform_probe+0x61/0xd0
> > [ 53.701698] really_probe+0xe2/0x390
> > [ 53.701700] ? __pfx___device_attach_driver+0x10/0x10
> > [ 53.701701] __driver_probe_device+0x7e/0x160
> > [ 53.701703] driver_probe_device+0x23/0xa0
> > [ 53.701704] __device_attach_driver+0x92/0x120
> > [ 53.701706] bus_for_each_drv+0x8c/0xf0
> > [ 53.701708] __device_attach+0xc2/0x1f0
> > [ 53.701710] device_initial_probe+0x17/0x20
> > [ 53.701711] bus_probe_device+0xa8/0xb0
> > [ 53.701712] device_add+0x687/0x8b0
> > [ 53.701715] ? dev_set_name+0x56/0x70
> > [ 53.701717] platform_device_add+0x102/0x260
> > [ 53.701718] hmem_register_device+0x160/0x230 [dax_hmem]
> > [ 53.701720] hmem_fallback_register_device+0x37/0x60
> > [ 53.701722] cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
> > [ 53.701734] walk_iomem_res_desc+0x55/0xb0
> > [ 53.701738] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
> > [ 53.701745] cxl_region_softreserv_update+0x46/0x50 [cxl_core]
> > [ 53.701751] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
> > [ 53.701752] ? __pfx_autoremove_wake_function+0x10/0x10
> > [ 53.701756] process_one_work+0x1fa/0x630
> > [ 53.701760] worker_thread+0x1b2/0x360
> > [ 53.701762] kthread+0x128/0x250
> > [ 53.701765] ? __pfx_worker_thread+0x10/0x10
> > [ 53.701766] ? __pfx_kthread+0x10/0x10
> > [ 53.701768] ret_from_fork+0x139/0x1e0
> > [ 53.701771] ? __pfx_kthread+0x10/0x10
> > [ 53.701772] ret_from_fork_asm+0x1a/0x30
> > [ 53.701777] </TASK>
> >
>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v5 0/7] Add managed SOFT RESERVE resource handling
2025-07-16 20:20 ` Alison Schofield
@ 2025-07-16 21:29 ` Koralahalli Channabasappa, Smita
2025-07-16 23:48 ` Alison Schofield
0 siblings, 1 reply; 38+ messages in thread
From: Koralahalli Channabasappa, Smita @ 2025-07-16 21:29 UTC (permalink / raw)
To: Alison Schofield
Cc: linux-cxl, linux-kernel, nvdimm, linux-fsdevel, linux-pm,
Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Vishal Verma,
Ira Weiny, Dan Williams, Matthew Wilcox, Jan Kara,
Rafael J . Wysocki, Len Brown, Pavel Machek, Li Ming,
Jeff Johnson, Ying Huang, Yao Xingtao, Peter Zijlstra, Greg KH,
Nathan Fontenot, Terry Bowman, Robert Richter, Benjamin Cheatham,
PradeepVineshReddy Kodamati, Zhijian Li
On 7/16/2025 1:20 PM, Alison Schofield wrote:
> On Tue, Jul 15, 2025 at 11:01:23PM -0700, Koralahalli Channabasappa, Smita wrote:
>> Hi Alison,
>>
>> On 7/15/2025 2:07 PM, Alison Schofield wrote:
>>> On Tue, Jul 15, 2025 at 06:04:00PM +0000, Smita Koralahalli wrote:
>>>> This series introduces the ability to manage SOFT RESERVED iomem
>>>> resources, enabling the CXL driver to remove any portions that
>>>> intersect with created CXL regions.
>>>
>>> Hi Smita,
>>>
>>> This set applied cleanly to todays cxl-next but fails like appended
>>> before region probe.
>>>
>>> BTW - there were sparse warnings in the build that look related:
>>> CHECK drivers/dax/hmem/hmem_notify.c
>>> drivers/dax/hmem/hmem_notify.c:10:6: warning: context imbalance in 'hmem_register_fallback_handler' - wrong count at exit
>>> drivers/dax/hmem/hmem_notify.c:24:9: warning: context imbalance in 'hmem_fallback_register_device' - wrong count at exit
>>
>> Thanks for pointing this bug. I failed to release the spinlock before
>> calling hmem_register_device(), which internally calls platform_device_add()
>> and can sleep. The following fix addresses that bug. I’ll incorporate this
>> into v6:
>>
>> diff --git a/drivers/dax/hmem/hmem_notify.c b/drivers/dax/hmem/hmem_notify.c
>> index 6c276c5bd51d..8f411f3fe7bd 100644
>> --- a/drivers/dax/hmem/hmem_notify.c
>> +++ b/drivers/dax/hmem/hmem_notify.c
>> @@ -18,8 +18,9 @@ void hmem_fallback_register_device(int target_nid, const
>> struct resource *res)
>> {
>> walk_hmem_fn hmem_fn;
>>
>> - guard(spinlock)(&hmem_notify_lock);
>> + spin_lock(&hmem_notify_lock);
>> hmem_fn = hmem_fallback_fn;
>> + spin_unlock(&hmem_notify_lock);
>>
>> if (hmem_fn)
>> hmem_fn(target_nid, res);
>> --
>
> Hi Smita, Adding the above got me past that, and doubling the timeout
> below stopped that from happening. After that, I haven't had time to
> trace so, I'll just dump on you for now:
>
> In /proc/iomem
> Here, we see a regions resource, no CXL Window, and no dax, and no
> actual region, not even disabled, is available.
> c080000000-c47fffffff : region0
>
> And, here no CXL Window, no region, and a soft reserved.
> 68e80000000-70e7fffffff : Soft Reserved
> 68e80000000-70e7fffffff : dax1.0
> 68e80000000-70e7fffffff : System RAM (kmem)
>
> I haven't yet walked through the v4 to v5 changes so I'll do that next.
Hi Alison,
To help better understand the current behavior, could you share more
about your platform configuration? specifically, are there two memory
cards involved? One at c080000000 (which appears as region0) and another
at 68e80000000 (which is falling back to kmem via dax1.0)? Additionally,
how are the Soft Reserved ranges laid out on your system for these
cards? I'm trying to understand the "before" state of the resources i.e,
prior to trimming applied by my patches.
Also, do you think it's feasible to change the direction of the soft
reserve trimming, that is, defer it until after CXL region or memdev
creation is complete? In this case it would be trimmed after but inline
the existing region or memdev creation. This might simplify the flow by
removing the need for wait_event_timeout(), wait_for_device_probe() and
the workqueue logic inside cxl_acpi_probe().
(As a side note I experimented changing cxl_acpi_init() to a
late_initcall() and observed that it consistently avoided probe ordering
issues in my setup.
Additional note: I realized that even when cxl_acpi_probe() fails, the
fallback DAX registration path (via cxl_softreserv_mem_update()) still
waits on cxl_mem_active() and wait_for_device_probe(). I plan to address
this in v6 by immediately triggering fallback DAX registration
(hmem_register_device()) when the ACPI probe fails, instead of waiting.)
Thanks
Smita
>
>>
>> As for the log:
>> [ 53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for
>> cxl_mem probing
>>
>> I’m still analyzing that. Here's what was my thought process so far.
>>
>> - This occurs when cxl_acpi_probe() runs significantly earlier than
>> cxl_mem_probe(), so CXL region creation (which happens in
>> cxl_port_endpoint_probe()) may or may not have completed by the time
>> trimming is attempted.
>>
>> - Both cxl_acpi and cxl_mem have MODULE_SOFTDEPs on cxl_port. This does
>> guarantee load order when all components are built as modules. So even if
>> the timeout occurs and cxl_mem_probe() hasn’t run within the wait window,
>> MODULE_SOFTDEP ensures that cxl_port is loaded before both cxl_acpi and
>> cxl_mem in modular configurations. As a result, region creation is
>> eventually guaranteed, and wait_for_device_probe() will succeed once the
>> relevant probes complete.
>>
>> - However, when both CONFIG_CXL_PORT=y and CONFIG_CXL_ACPI=y, there's no
>> guarantee of probe ordering. In such cases, cxl_acpi_probe() may finish
>> before cxl_port_probe() even begins, which can cause wait_for_device_probe()
>> to return prematurely and trigger the timeout.
>>
>> - In my local setup, I observed that a 30-second timeout was generally
>> sufficient to catch this race, allowing cxl_port_probe() to load while
>> cxl_acpi_probe() is still active. Since we cannot mix built-in and modular
>> components (i.e., have cxl_acpi=y and cxl_port=m), the timeout serves as a
>> best-effort mechanism. After the timeout, wait_for_device_probe() ensures
>> cxl_port_probe() has completed before trimming proceeds, making the logic
>> good enough to most boot-time races.
>>
>> One possible improvement I’m considering is to schedule a
>> delayed_workqueue() from cxl_acpi_probe(). This deferred work could wait
>> slightly longer for cxl_mem_probe() to complete (which itself softdeps on
>> cxl_port) before initiating the soft reserve trimming.
>>
>> That said, I'm still evaluating better options to more robustly coordinate
>> probe ordering between cxl_acpi, cxl_port, cxl_mem and cxl_region and
>> looking for suggestions here.
>>
>> Thanks
>> Smita
>>
>>>
>>>
>>> This isn't all the logs, I trimmed. Let me know if you need more or
>>> other info to reproduce.
>>>
>>> [ 53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for cxl_mem probing
>>> [ 53.653293] BUG: sleeping function called from invalid context at ./include/linux/sched/mm.h:321
>>> [ 53.653513] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1875, name: kworker/46:1
>>> [ 53.653540] preempt_count: 1, expected: 0
>>> [ 53.653554] RCU nest depth: 0, expected: 0
>>> [ 53.653568] 3 locks held by kworker/46:1/1875:
>>> [ 53.653569] #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630
>>> [ 53.653583] #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630
>>> [ 53.653589] #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60
>>> [ 53.653598] Preemption disabled at:
>>> [ 53.653599] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60
>>> [ 53.653640] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Not tainted 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
>>> [ 53.653643] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
>>> [ 53.653648] Call Trace:
>>> [ 53.653649] <TASK>
>>> [ 53.653652] dump_stack_lvl+0xa8/0xd0
>>> [ 53.653658] dump_stack+0x14/0x20
>>> [ 53.653659] __might_resched+0x1ae/0x2d0
>>> [ 53.653666] __might_sleep+0x48/0x70
>>> [ 53.653668] __kmalloc_node_track_caller_noprof+0x349/0x510
>>> [ 53.653674] ? __devm_add_action+0x3d/0x160
>>> [ 53.653685] ? __pfx_devm_action_release+0x10/0x10
>>> [ 53.653688] __devres_alloc_node+0x4a/0x90
>>> [ 53.653689] ? __devres_alloc_node+0x4a/0x90
>>> [ 53.653691] ? __pfx_release_memregion+0x10/0x10 [dax_hmem]
>>> [ 53.653693] __devm_add_action+0x3d/0x160
>>> [ 53.653696] hmem_register_device+0xea/0x230 [dax_hmem]
>>> [ 53.653700] hmem_fallback_register_device+0x37/0x60
>>> [ 53.653703] cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
>>> [ 53.653739] walk_iomem_res_desc+0x55/0xb0
>>> [ 53.653744] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
>>> [ 53.653755] cxl_region_softreserv_update+0x46/0x50 [cxl_core]
>>> [ 53.653761] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
>>> [ 53.653763] ? __pfx_autoremove_wake_function+0x10/0x10
>>> [ 53.653768] process_one_work+0x1fa/0x630
>>> [ 53.653774] worker_thread+0x1b2/0x360
>>> [ 53.653777] kthread+0x128/0x250
>>> [ 53.653781] ? __pfx_worker_thread+0x10/0x10
>>> [ 53.653784] ? __pfx_kthread+0x10/0x10
>>> [ 53.653786] ret_from_fork+0x139/0x1e0
>>> [ 53.653790] ? __pfx_kthread+0x10/0x10
>>> [ 53.653792] ret_from_fork_asm+0x1a/0x30
>>> [ 53.653801] </TASK>
>>>
>>> [ 53.654193] =============================
>>> [ 53.654203] [ BUG: Invalid wait context ]
>>> [ 53.654451] 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 Tainted: G W
>>> [ 53.654623] -----------------------------
>>> [ 53.654785] kworker/46:1/1875 is trying to lock:
>>> [ 53.654946] ff37d7824096d588 (&root->kernfs_rwsem){++++}-{4:4}, at: kernfs_add_one+0x34/0x390
>>> [ 53.655115] other info that might help us debug this:
>>> [ 53.655273] context-{5:5}
>>> [ 53.655428] 3 locks held by kworker/46:1/1875:
>>> [ 53.655579] #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630
>>> [ 53.655739] #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630
>>> [ 53.655900] #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60
>>> [ 53.656062] stack backtrace:
>>> [ 53.656224] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G W 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
>>> [ 53.656227] Tainted: [W]=WARN
>>> [ 53.656228] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
>>> [ 53.656232] Call Trace:
>>> [ 53.656232] <TASK>
>>> [ 53.656234] dump_stack_lvl+0x85/0xd0
>>> [ 53.656238] dump_stack+0x14/0x20
>>> [ 53.656239] __lock_acquire+0xaf4/0x2200
>>> [ 53.656246] lock_acquire+0xd8/0x300
>>> [ 53.656248] ? kernfs_add_one+0x34/0x390
>>> [ 53.656252] ? __might_resched+0x208/0x2d0
>>> [ 53.656257] down_write+0x44/0xe0
>>> [ 53.656262] ? kernfs_add_one+0x34/0x390
>>> [ 53.656263] kernfs_add_one+0x34/0x390
>>> [ 53.656265] kernfs_create_dir_ns+0x5a/0xa0
>>> [ 53.656268] sysfs_create_dir_ns+0x74/0xd0
>>> [ 53.656270] kobject_add_internal+0xb1/0x2f0
>>> [ 53.656273] kobject_add+0x7d/0xf0
>>> [ 53.656275] ? get_device_parent+0x28/0x1e0
>>> [ 53.656280] ? __pfx_klist_children_get+0x10/0x10
>>> [ 53.656282] device_add+0x124/0x8b0
>>> [ 53.656285] ? dev_set_name+0x56/0x70
>>> [ 53.656287] platform_device_add+0x102/0x260
>>> [ 53.656289] hmem_register_device+0x160/0x230 [dax_hmem]
>>> [ 53.656291] hmem_fallback_register_device+0x37/0x60
>>> [ 53.656294] cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
>>> [ 53.656323] walk_iomem_res_desc+0x55/0xb0
>>> [ 53.656326] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
>>> [ 53.656335] cxl_region_softreserv_update+0x46/0x50 [cxl_core]
>>> [ 53.656342] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
>>> [ 53.656343] ? __pfx_autoremove_wake_function+0x10/0x10
>>> [ 53.656346] process_one_work+0x1fa/0x630
>>> [ 53.656350] worker_thread+0x1b2/0x360
>>> [ 53.656352] kthread+0x128/0x250
>>> [ 53.656354] ? __pfx_worker_thread+0x10/0x10
>>> [ 53.656356] ? __pfx_kthread+0x10/0x10
>>> [ 53.656357] ret_from_fork+0x139/0x1e0
>>> [ 53.656360] ? __pfx_kthread+0x10/0x10
>>> [ 53.656361] ret_from_fork_asm+0x1a/0x30
>>> [ 53.656366] </TASK>
>>> [ 53.662274] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002
>>> [ 53.663552] schedule+0x4a/0x160
>>> [ 53.663553] schedule_timeout+0x10a/0x120
>>> [ 53.663555] ? debug_smp_processor_id+0x1b/0x30
>>> [ 53.663556] ? trace_hardirqs_on+0x5f/0xd0
>>> [ 53.663558] __wait_for_common+0xb9/0x1c0
>>> [ 53.663559] ? __pfx_schedule_timeout+0x10/0x10
>>> [ 53.663561] wait_for_completion+0x28/0x30
>>> [ 53.663562] __synchronize_srcu+0xbf/0x180
>>> [ 53.663566] ? __pfx_wakeme_after_rcu+0x10/0x10
>>> [ 53.663571] ? i2c_repstart+0x30/0x80
>>> [ 53.663576] synchronize_srcu+0x46/0x120
>>> [ 53.663577] kill_dax+0x47/0x70
>>> [ 53.663580] __devm_create_dev_dax+0x112/0x470
>>> [ 53.663582] devm_create_dev_dax+0x26/0x50
>>> [ 53.663584] dax_hmem_probe+0x87/0xd0 [dax_hmem]
>>> [ 53.663585] platform_probe+0x61/0xd0
>>> [ 53.663589] really_probe+0xe2/0x390
>>> [ 53.663591] ? __pfx___device_attach_driver+0x10/0x10
>>> [ 53.663593] __driver_probe_device+0x7e/0x160
>>> [ 53.663594] driver_probe_device+0x23/0xa0
>>> [ 53.663596] __device_attach_driver+0x92/0x120
>>> [ 53.663597] bus_for_each_drv+0x8c/0xf0
>>> [ 53.663599] __device_attach+0xc2/0x1f0
>>> [ 53.663601] device_initial_probe+0x17/0x20
>>> [ 53.663603] bus_probe_device+0xa8/0xb0
>>> [ 53.663604] device_add+0x687/0x8b0
>>> [ 53.663607] ? dev_set_name+0x56/0x70
>>> [ 53.663609] platform_device_add+0x102/0x260
>>> [ 53.663610] hmem_register_device+0x160/0x230 [dax_hmem]
>>> [ 53.663612] hmem_fallback_register_device+0x37/0x60
>>> [ 53.663614] cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
>>> [ 53.663637] walk_iomem_res_desc+0x55/0xb0
>>> [ 53.663640] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
>>> [ 53.663647] cxl_region_softreserv_update+0x46/0x50 [cxl_core]
>>> [ 53.663654] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
>>> [ 53.663655] ? __pfx_autoremove_wake_function+0x10/0x10
>>> [ 53.663658] process_one_work+0x1fa/0x630
>>> [ 53.663662] worker_thread+0x1b2/0x360
>>> [ 53.663664] kthread+0x128/0x250
>>> [ 53.663666] ? __pfx_worker_thread+0x10/0x10
>>> [ 53.663668] ? __pfx_kthread+0x10/0x10
>>> [ 53.663670] ret_from_fork+0x139/0x1e0
>>> [ 53.663672] ? __pfx_kthread+0x10/0x10
>>> [ 53.663673] ret_from_fork_asm+0x1a/0x30
>>> [ 53.663677] </TASK>
>>> [ 53.700107] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002
>>> [ 53.700264] INFO: lockdep is turned off.
>>> [ 53.701315] Preemption disabled at:
>>> [ 53.701316] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60
>>> [ 53.701631] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G W 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
>>> [ 53.701633] Tainted: [W]=WARN
>>> [ 53.701635] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
>>> [ 53.701638] Call Trace:
>>> [ 53.701638] <TASK>
>>> [ 53.701640] dump_stack_lvl+0xa8/0xd0
>>> [ 53.701644] dump_stack+0x14/0x20
>>> [ 53.701645] __schedule_bug+0xa2/0xd0
>>> [ 53.701649] __schedule+0xe6f/0x10d0
>>> [ 53.701652] ? debug_smp_processor_id+0x1b/0x30
>>> [ 53.701655] ? lock_release+0x1e6/0x2b0
>>> [ 53.701658] ? trace_hardirqs_on+0x5f/0xd0
>>> [ 53.701661] schedule+0x4a/0x160
>>> [ 53.701662] schedule_timeout+0x10a/0x120
>>> [ 53.701664] ? debug_smp_processor_id+0x1b/0x30
>>> [ 53.701666] ? trace_hardirqs_on+0x5f/0xd0
>>> [ 53.701667] __wait_for_common+0xb9/0x1c0
>>> [ 53.701668] ? __pfx_schedule_timeout+0x10/0x10
>>> [ 53.701670] wait_for_completion+0x28/0x30
>>> [ 53.701671] __synchronize_srcu+0xbf/0x180
>>> [ 53.701677] ? __pfx_wakeme_after_rcu+0x10/0x10
>>> [ 53.701682] ? i2c_repstart+0x30/0x80
>>> [ 53.701685] synchronize_srcu+0x46/0x120
>>> [ 53.701687] kill_dax+0x47/0x70
>>> [ 53.701689] __devm_create_dev_dax+0x112/0x470
>>> [ 53.701691] devm_create_dev_dax+0x26/0x50
>>> [ 53.701693] dax_hmem_probe+0x87/0xd0 [dax_hmem]
>>> [ 53.701695] platform_probe+0x61/0xd0
>>> [ 53.701698] really_probe+0xe2/0x390
>>> [ 53.701700] ? __pfx___device_attach_driver+0x10/0x10
>>> [ 53.701701] __driver_probe_device+0x7e/0x160
>>> [ 53.701703] driver_probe_device+0x23/0xa0
>>> [ 53.701704] __device_attach_driver+0x92/0x120
>>> [ 53.701706] bus_for_each_drv+0x8c/0xf0
>>> [ 53.701708] __device_attach+0xc2/0x1f0
>>> [ 53.701710] device_initial_probe+0x17/0x20
>>> [ 53.701711] bus_probe_device+0xa8/0xb0
>>> [ 53.701712] device_add+0x687/0x8b0
>>> [ 53.701715] ? dev_set_name+0x56/0x70
>>> [ 53.701717] platform_device_add+0x102/0x260
>>> [ 53.701718] hmem_register_device+0x160/0x230 [dax_hmem]
>>> [ 53.701720] hmem_fallback_register_device+0x37/0x60
>>> [ 53.701722] cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
>>> [ 53.701734] walk_iomem_res_desc+0x55/0xb0
>>> [ 53.701738] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
>>> [ 53.701745] cxl_region_softreserv_update+0x46/0x50 [cxl_core]
>>> [ 53.701751] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
>>> [ 53.701752] ? __pfx_autoremove_wake_function+0x10/0x10
>>> [ 53.701756] process_one_work+0x1fa/0x630
>>> [ 53.701760] worker_thread+0x1b2/0x360
>>> [ 53.701762] kthread+0x128/0x250
>>> [ 53.701765] ? __pfx_worker_thread+0x10/0x10
>>> [ 53.701766] ? __pfx_kthread+0x10/0x10
>>> [ 53.701768] ret_from_fork+0x139/0x1e0
>>> [ 53.701771] ? __pfx_kthread+0x10/0x10
>>> [ 53.701772] ret_from_fork_asm+0x1a/0x30
>>> [ 53.701777] </TASK>
>>>
>>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v5 0/7] Add managed SOFT RESERVE resource handling
2025-07-16 21:29 ` Koralahalli Channabasappa, Smita
@ 2025-07-16 23:48 ` Alison Schofield
2025-07-17 17:58 ` Koralahalli Channabasappa, Smita
0 siblings, 1 reply; 38+ messages in thread
From: Alison Schofield @ 2025-07-16 23:48 UTC (permalink / raw)
To: Koralahalli Channabasappa, Smita
Cc: linux-cxl, linux-kernel, nvdimm, linux-fsdevel, linux-pm,
Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Vishal Verma,
Ira Weiny, Dan Williams, Matthew Wilcox, Jan Kara,
Rafael J . Wysocki, Len Brown, Pavel Machek, Li Ming,
Jeff Johnson, Ying Huang, Yao Xingtao, Peter Zijlstra, Greg KH,
Nathan Fontenot, Terry Bowman, Robert Richter, Benjamin Cheatham,
PradeepVineshReddy Kodamati, Zhijian Li
On Wed, Jul 16, 2025 at 02:29:52PM -0700, Koralahalli Channabasappa, Smita wrote:
> On 7/16/2025 1:20 PM, Alison Schofield wrote:
> > On Tue, Jul 15, 2025 at 11:01:23PM -0700, Koralahalli Channabasappa, Smita wrote:
> > > Hi Alison,
> > >
> > > On 7/15/2025 2:07 PM, Alison Schofield wrote:
> > > > On Tue, Jul 15, 2025 at 06:04:00PM +0000, Smita Koralahalli wrote:
> > > > > This series introduces the ability to manage SOFT RESERVED iomem
> > > > > resources, enabling the CXL driver to remove any portions that
> > > > > intersect with created CXL regions.
> > > >
> > > > Hi Smita,
> > > >
> > > > This set applied cleanly to todays cxl-next but fails like appended
> > > > before region probe.
> > > >
> > > > BTW - there were sparse warnings in the build that look related:
> > > > CHECK drivers/dax/hmem/hmem_notify.c
> > > > drivers/dax/hmem/hmem_notify.c:10:6: warning: context imbalance in 'hmem_register_fallback_handler' - wrong count at exit
> > > > drivers/dax/hmem/hmem_notify.c:24:9: warning: context imbalance in 'hmem_fallback_register_device' - wrong count at exit
> > >
> > > Thanks for pointing this bug. I failed to release the spinlock before
> > > calling hmem_register_device(), which internally calls platform_device_add()
> > > and can sleep. The following fix addresses that bug. I’ll incorporate this
> > > into v6:
> > >
> > > diff --git a/drivers/dax/hmem/hmem_notify.c b/drivers/dax/hmem/hmem_notify.c
> > > index 6c276c5bd51d..8f411f3fe7bd 100644
> > > --- a/drivers/dax/hmem/hmem_notify.c
> > > +++ b/drivers/dax/hmem/hmem_notify.c
> > > @@ -18,8 +18,9 @@ void hmem_fallback_register_device(int target_nid, const
> > > struct resource *res)
> > > {
> > > walk_hmem_fn hmem_fn;
> > >
> > > - guard(spinlock)(&hmem_notify_lock);
> > > + spin_lock(&hmem_notify_lock);
> > > hmem_fn = hmem_fallback_fn;
> > > + spin_unlock(&hmem_notify_lock);
> > >
> > > if (hmem_fn)
> > > hmem_fn(target_nid, res);
> > > --
> >
> > Hi Smita, Adding the above got me past that, and doubling the timeout
> > below stopped that from happening. After that, I haven't had time to
> > trace so, I'll just dump on you for now:
> >
> > In /proc/iomem
> > Here, we see a regions resource, no CXL Window, and no dax, and no
> > actual region, not even disabled, is available.
> > c080000000-c47fffffff : region0
> >
> > And, here no CXL Window, no region, and a soft reserved.
> > 68e80000000-70e7fffffff : Soft Reserved
> > 68e80000000-70e7fffffff : dax1.0
> > 68e80000000-70e7fffffff : System RAM (kmem)
> >
> > I haven't yet walked through the v4 to v5 changes so I'll do that next.
>
> Hi Alison,
>
> To help better understand the current behavior, could you share more about
> your platform configuration? specifically, are there two memory cards
> involved? One at c080000000 (which appears as region0) and another at
> 68e80000000 (which is falling back to kmem via dax1.0)? Additionally, how
> are the Soft Reserved ranges laid out on your system for these cards? I'm
> trying to understand the "before" state of the resources i.e, prior to
> trimming applied by my patches.
Here are the soft reserveds -
[] BIOS-e820: [mem 0x000000c080000000-0x000000c47fffffff] soft reserved
[] BIOS-e820: [mem 0x0000068e80000000-0x0000070e7fffffff] soft reserved
And this is what we expect -
c080000000-17dbfffffff : CXL Window 0
c080000000-c47fffffff : region2
c080000000-c47fffffff : dax0.0
c080000000-c47fffffff : System RAM (kmem)
68e80000000-8d37fffffff : CXL Window 1
68e80000000-70e7fffffff : region5
68e80000000-70e7fffffff : dax1.0
68e80000000-70e7fffffff : System RAM (kmem)
And, like in prev message, iv v5 we get -
c080000000-c47fffffff : region0
68e80000000-70e7fffffff : Soft Reserved
68e80000000-70e7fffffff : dax1.0
68e80000000-70e7fffffff : System RAM (kmem)
In v4, we 'almost' had what we expect, except that the HMEM driver
created those dax devices our of Soft Reserveds before region driver
could do same.
>
> Also, do you think it's feasible to change the direction of the soft reserve
> trimming, that is, defer it until after CXL region or memdev creation is
> complete? In this case it would be trimmed after but inline the existing
> region or memdev creation. This might simplify the flow by removing the need
> for wait_event_timeout(), wait_for_device_probe() and the workqueue logic
> inside cxl_acpi_probe().
Yes that aligns with my simple thinking. There's the trimming after a region
is successfully created, and it seems that could simply be called at the end
of *that* region creation.
Then, there's the round up of all the unused Soft Reserveds, and that has
to wait until after all regions are created, ie. all endpoints have arrived
and we've given up all hope of creating another region in that space.
That's the timing challenge.
-- Alison
>
> (As a side note I experimented changing cxl_acpi_init() to a late_initcall()
> and observed that it consistently avoided probe ordering issues in my setup.
>
> Additional note: I realized that even when cxl_acpi_probe() fails, the
> fallback DAX registration path (via cxl_softreserv_mem_update()) still waits
> on cxl_mem_active() and wait_for_device_probe(). I plan to address this in
> v6 by immediately triggering fallback DAX registration
> (hmem_register_device()) when the ACPI probe fails, instead of waiting.)
>
> Thanks
> Smita
>
> >
> > >
> > > As for the log:
> > > [ 53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for
> > > cxl_mem probing
> > >
> > > I’m still analyzing that. Here's what was my thought process so far.
> > >
> > > - This occurs when cxl_acpi_probe() runs significantly earlier than
> > > cxl_mem_probe(), so CXL region creation (which happens in
> > > cxl_port_endpoint_probe()) may or may not have completed by the time
> > > trimming is attempted.
> > >
> > > - Both cxl_acpi and cxl_mem have MODULE_SOFTDEPs on cxl_port. This does
> > > guarantee load order when all components are built as modules. So even if
> > > the timeout occurs and cxl_mem_probe() hasn’t run within the wait window,
> > > MODULE_SOFTDEP ensures that cxl_port is loaded before both cxl_acpi and
> > > cxl_mem in modular configurations. As a result, region creation is
> > > eventually guaranteed, and wait_for_device_probe() will succeed once the
> > > relevant probes complete.
> > >
> > > - However, when both CONFIG_CXL_PORT=y and CONFIG_CXL_ACPI=y, there's no
> > > guarantee of probe ordering. In such cases, cxl_acpi_probe() may finish
> > > before cxl_port_probe() even begins, which can cause wait_for_device_probe()
> > > to return prematurely and trigger the timeout.
> > >
> > > - In my local setup, I observed that a 30-second timeout was generally
> > > sufficient to catch this race, allowing cxl_port_probe() to load while
> > > cxl_acpi_probe() is still active. Since we cannot mix built-in and modular
> > > components (i.e., have cxl_acpi=y and cxl_port=m), the timeout serves as a
> > > best-effort mechanism. After the timeout, wait_for_device_probe() ensures
> > > cxl_port_probe() has completed before trimming proceeds, making the logic
> > > good enough to most boot-time races.
> > >
> > > One possible improvement I’m considering is to schedule a
> > > delayed_workqueue() from cxl_acpi_probe(). This deferred work could wait
> > > slightly longer for cxl_mem_probe() to complete (which itself softdeps on
> > > cxl_port) before initiating the soft reserve trimming.
> > >
> > > That said, I'm still evaluating better options to more robustly coordinate
> > > probe ordering between cxl_acpi, cxl_port, cxl_mem and cxl_region and
> > > looking for suggestions here.
> > >
> > > Thanks
> > > Smita
> > >
> > > >
> > > >
> > > > This isn't all the logs, I trimmed. Let me know if you need more or
> > > > other info to reproduce.
> > > >
> > > > [ 53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for cxl_mem probing
> > > > [ 53.653293] BUG: sleeping function called from invalid context at ./include/linux/sched/mm.h:321
> > > > [ 53.653513] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1875, name: kworker/46:1
> > > > [ 53.653540] preempt_count: 1, expected: 0
> > > > [ 53.653554] RCU nest depth: 0, expected: 0
> > > > [ 53.653568] 3 locks held by kworker/46:1/1875:
> > > > [ 53.653569] #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630
> > > > [ 53.653583] #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630
> > > > [ 53.653589] #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60
> > > > [ 53.653598] Preemption disabled at:
> > > > [ 53.653599] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60
> > > > [ 53.653640] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Not tainted 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
> > > > [ 53.653643] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
> > > > [ 53.653648] Call Trace:
> > > > [ 53.653649] <TASK>
> > > > [ 53.653652] dump_stack_lvl+0xa8/0xd0
> > > > [ 53.653658] dump_stack+0x14/0x20
> > > > [ 53.653659] __might_resched+0x1ae/0x2d0
> > > > [ 53.653666] __might_sleep+0x48/0x70
> > > > [ 53.653668] __kmalloc_node_track_caller_noprof+0x349/0x510
> > > > [ 53.653674] ? __devm_add_action+0x3d/0x160
> > > > [ 53.653685] ? __pfx_devm_action_release+0x10/0x10
> > > > [ 53.653688] __devres_alloc_node+0x4a/0x90
> > > > [ 53.653689] ? __devres_alloc_node+0x4a/0x90
> > > > [ 53.653691] ? __pfx_release_memregion+0x10/0x10 [dax_hmem]
> > > > [ 53.653693] __devm_add_action+0x3d/0x160
> > > > [ 53.653696] hmem_register_device+0xea/0x230 [dax_hmem]
> > > > [ 53.653700] hmem_fallback_register_device+0x37/0x60
> > > > [ 53.653703] cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
> > > > [ 53.653739] walk_iomem_res_desc+0x55/0xb0
> > > > [ 53.653744] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
> > > > [ 53.653755] cxl_region_softreserv_update+0x46/0x50 [cxl_core]
> > > > [ 53.653761] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
> > > > [ 53.653763] ? __pfx_autoremove_wake_function+0x10/0x10
> > > > [ 53.653768] process_one_work+0x1fa/0x630
> > > > [ 53.653774] worker_thread+0x1b2/0x360
> > > > [ 53.653777] kthread+0x128/0x250
> > > > [ 53.653781] ? __pfx_worker_thread+0x10/0x10
> > > > [ 53.653784] ? __pfx_kthread+0x10/0x10
> > > > [ 53.653786] ret_from_fork+0x139/0x1e0
> > > > [ 53.653790] ? __pfx_kthread+0x10/0x10
> > > > [ 53.653792] ret_from_fork_asm+0x1a/0x30
> > > > [ 53.653801] </TASK>
> > > >
> > > > [ 53.654193] =============================
> > > > [ 53.654203] [ BUG: Invalid wait context ]
> > > > [ 53.654451] 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 Tainted: G W
> > > > [ 53.654623] -----------------------------
> > > > [ 53.654785] kworker/46:1/1875 is trying to lock:
> > > > [ 53.654946] ff37d7824096d588 (&root->kernfs_rwsem){++++}-{4:4}, at: kernfs_add_one+0x34/0x390
> > > > [ 53.655115] other info that might help us debug this:
> > > > [ 53.655273] context-{5:5}
> > > > [ 53.655428] 3 locks held by kworker/46:1/1875:
> > > > [ 53.655579] #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630
> > > > [ 53.655739] #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630
> > > > [ 53.655900] #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60
> > > > [ 53.656062] stack backtrace:
> > > > [ 53.656224] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G W 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
> > > > [ 53.656227] Tainted: [W]=WARN
> > > > [ 53.656228] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
> > > > [ 53.656232] Call Trace:
> > > > [ 53.656232] <TASK>
> > > > [ 53.656234] dump_stack_lvl+0x85/0xd0
> > > > [ 53.656238] dump_stack+0x14/0x20
> > > > [ 53.656239] __lock_acquire+0xaf4/0x2200
> > > > [ 53.656246] lock_acquire+0xd8/0x300
> > > > [ 53.656248] ? kernfs_add_one+0x34/0x390
> > > > [ 53.656252] ? __might_resched+0x208/0x2d0
> > > > [ 53.656257] down_write+0x44/0xe0
> > > > [ 53.656262] ? kernfs_add_one+0x34/0x390
> > > > [ 53.656263] kernfs_add_one+0x34/0x390
> > > > [ 53.656265] kernfs_create_dir_ns+0x5a/0xa0
> > > > [ 53.656268] sysfs_create_dir_ns+0x74/0xd0
> > > > [ 53.656270] kobject_add_internal+0xb1/0x2f0
> > > > [ 53.656273] kobject_add+0x7d/0xf0
> > > > [ 53.656275] ? get_device_parent+0x28/0x1e0
> > > > [ 53.656280] ? __pfx_klist_children_get+0x10/0x10
> > > > [ 53.656282] device_add+0x124/0x8b0
> > > > [ 53.656285] ? dev_set_name+0x56/0x70
> > > > [ 53.656287] platform_device_add+0x102/0x260
> > > > [ 53.656289] hmem_register_device+0x160/0x230 [dax_hmem]
> > > > [ 53.656291] hmem_fallback_register_device+0x37/0x60
> > > > [ 53.656294] cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
> > > > [ 53.656323] walk_iomem_res_desc+0x55/0xb0
> > > > [ 53.656326] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
> > > > [ 53.656335] cxl_region_softreserv_update+0x46/0x50 [cxl_core]
> > > > [ 53.656342] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
> > > > [ 53.656343] ? __pfx_autoremove_wake_function+0x10/0x10
> > > > [ 53.656346] process_one_work+0x1fa/0x630
> > > > [ 53.656350] worker_thread+0x1b2/0x360
> > > > [ 53.656352] kthread+0x128/0x250
> > > > [ 53.656354] ? __pfx_worker_thread+0x10/0x10
> > > > [ 53.656356] ? __pfx_kthread+0x10/0x10
> > > > [ 53.656357] ret_from_fork+0x139/0x1e0
> > > > [ 53.656360] ? __pfx_kthread+0x10/0x10
> > > > [ 53.656361] ret_from_fork_asm+0x1a/0x30
> > > > [ 53.656366] </TASK>
> > > > [ 53.662274] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002
> > > > [ 53.663552] schedule+0x4a/0x160
> > > > [ 53.663553] schedule_timeout+0x10a/0x120
> > > > [ 53.663555] ? debug_smp_processor_id+0x1b/0x30
> > > > [ 53.663556] ? trace_hardirqs_on+0x5f/0xd0
> > > > [ 53.663558] __wait_for_common+0xb9/0x1c0
> > > > [ 53.663559] ? __pfx_schedule_timeout+0x10/0x10
> > > > [ 53.663561] wait_for_completion+0x28/0x30
> > > > [ 53.663562] __synchronize_srcu+0xbf/0x180
> > > > [ 53.663566] ? __pfx_wakeme_after_rcu+0x10/0x10
> > > > [ 53.663571] ? i2c_repstart+0x30/0x80
> > > > [ 53.663576] synchronize_srcu+0x46/0x120
> > > > [ 53.663577] kill_dax+0x47/0x70
> > > > [ 53.663580] __devm_create_dev_dax+0x112/0x470
> > > > [ 53.663582] devm_create_dev_dax+0x26/0x50
> > > > [ 53.663584] dax_hmem_probe+0x87/0xd0 [dax_hmem]
> > > > [ 53.663585] platform_probe+0x61/0xd0
> > > > [ 53.663589] really_probe+0xe2/0x390
> > > > [ 53.663591] ? __pfx___device_attach_driver+0x10/0x10
> > > > [ 53.663593] __driver_probe_device+0x7e/0x160
> > > > [ 53.663594] driver_probe_device+0x23/0xa0
> > > > [ 53.663596] __device_attach_driver+0x92/0x120
> > > > [ 53.663597] bus_for_each_drv+0x8c/0xf0
> > > > [ 53.663599] __device_attach+0xc2/0x1f0
> > > > [ 53.663601] device_initial_probe+0x17/0x20
> > > > [ 53.663603] bus_probe_device+0xa8/0xb0
> > > > [ 53.663604] device_add+0x687/0x8b0
> > > > [ 53.663607] ? dev_set_name+0x56/0x70
> > > > [ 53.663609] platform_device_add+0x102/0x260
> > > > [ 53.663610] hmem_register_device+0x160/0x230 [dax_hmem]
> > > > [ 53.663612] hmem_fallback_register_device+0x37/0x60
> > > > [ 53.663614] cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
> > > > [ 53.663637] walk_iomem_res_desc+0x55/0xb0
> > > > [ 53.663640] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
> > > > [ 53.663647] cxl_region_softreserv_update+0x46/0x50 [cxl_core]
> > > > [ 53.663654] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
> > > > [ 53.663655] ? __pfx_autoremove_wake_function+0x10/0x10
> > > > [ 53.663658] process_one_work+0x1fa/0x630
> > > > [ 53.663662] worker_thread+0x1b2/0x360
> > > > [ 53.663664] kthread+0x128/0x250
> > > > [ 53.663666] ? __pfx_worker_thread+0x10/0x10
> > > > [ 53.663668] ? __pfx_kthread+0x10/0x10
> > > > [ 53.663670] ret_from_fork+0x139/0x1e0
> > > > [ 53.663672] ? __pfx_kthread+0x10/0x10
> > > > [ 53.663673] ret_from_fork_asm+0x1a/0x30
> > > > [ 53.663677] </TASK>
> > > > [ 53.700107] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002
> > > > [ 53.700264] INFO: lockdep is turned off.
> > > > [ 53.701315] Preemption disabled at:
> > > > [ 53.701316] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60
> > > > [ 53.701631] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G W 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
> > > > [ 53.701633] Tainted: [W]=WARN
> > > > [ 53.701635] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
> > > > [ 53.701638] Call Trace:
> > > > [ 53.701638] <TASK>
> > > > [ 53.701640] dump_stack_lvl+0xa8/0xd0
> > > > [ 53.701644] dump_stack+0x14/0x20
> > > > [ 53.701645] __schedule_bug+0xa2/0xd0
> > > > [ 53.701649] __schedule+0xe6f/0x10d0
> > > > [ 53.701652] ? debug_smp_processor_id+0x1b/0x30
> > > > [ 53.701655] ? lock_release+0x1e6/0x2b0
> > > > [ 53.701658] ? trace_hardirqs_on+0x5f/0xd0
> > > > [ 53.701661] schedule+0x4a/0x160
> > > > [ 53.701662] schedule_timeout+0x10a/0x120
> > > > [ 53.701664] ? debug_smp_processor_id+0x1b/0x30
> > > > [ 53.701666] ? trace_hardirqs_on+0x5f/0xd0
> > > > [ 53.701667] __wait_for_common+0xb9/0x1c0
> > > > [ 53.701668] ? __pfx_schedule_timeout+0x10/0x10
> > > > [ 53.701670] wait_for_completion+0x28/0x30
> > > > [ 53.701671] __synchronize_srcu+0xbf/0x180
> > > > [ 53.701677] ? __pfx_wakeme_after_rcu+0x10/0x10
> > > > [ 53.701682] ? i2c_repstart+0x30/0x80
> > > > [ 53.701685] synchronize_srcu+0x46/0x120
> > > > [ 53.701687] kill_dax+0x47/0x70
> > > > [ 53.701689] __devm_create_dev_dax+0x112/0x470
> > > > [ 53.701691] devm_create_dev_dax+0x26/0x50
> > > > [ 53.701693] dax_hmem_probe+0x87/0xd0 [dax_hmem]
> > > > [ 53.701695] platform_probe+0x61/0xd0
> > > > [ 53.701698] really_probe+0xe2/0x390
> > > > [ 53.701700] ? __pfx___device_attach_driver+0x10/0x10
> > > > [ 53.701701] __driver_probe_device+0x7e/0x160
> > > > [ 53.701703] driver_probe_device+0x23/0xa0
> > > > [ 53.701704] __device_attach_driver+0x92/0x120
> > > > [ 53.701706] bus_for_each_drv+0x8c/0xf0
> > > > [ 53.701708] __device_attach+0xc2/0x1f0
> > > > [ 53.701710] device_initial_probe+0x17/0x20
> > > > [ 53.701711] bus_probe_device+0xa8/0xb0
> > > > [ 53.701712] device_add+0x687/0x8b0
> > > > [ 53.701715] ? dev_set_name+0x56/0x70
> > > > [ 53.701717] platform_device_add+0x102/0x260
> > > > [ 53.701718] hmem_register_device+0x160/0x230 [dax_hmem]
> > > > [ 53.701720] hmem_fallback_register_device+0x37/0x60
> > > > [ 53.701722] cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
> > > > [ 53.701734] walk_iomem_res_desc+0x55/0xb0
> > > > [ 53.701738] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
> > > > [ 53.701745] cxl_region_softreserv_update+0x46/0x50 [cxl_core]
> > > > [ 53.701751] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
> > > > [ 53.701752] ? __pfx_autoremove_wake_function+0x10/0x10
> > > > [ 53.701756] process_one_work+0x1fa/0x630
> > > > [ 53.701760] worker_thread+0x1b2/0x360
> > > > [ 53.701762] kthread+0x128/0x250
> > > > [ 53.701765] ? __pfx_worker_thread+0x10/0x10
> > > > [ 53.701766] ? __pfx_kthread+0x10/0x10
> > > > [ 53.701768] ret_from_fork+0x139/0x1e0
> > > > [ 53.701771] ? __pfx_kthread+0x10/0x10
> > > > [ 53.701772] ret_from_fork_asm+0x1a/0x30
> > > > [ 53.701777] </TASK>
> > > >
> > >
>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v5 3/7] cxl/acpi: Add background worker to coordinate with cxl_mem probe completion
2025-07-15 18:04 ` [PATCH v5 3/7] cxl/acpi: Add background worker to coordinate with cxl_mem probe completion Smita Koralahalli
@ 2025-07-17 0:24 ` Dave Jiang
2025-07-23 7:31 ` dan.j.williams
1 sibling, 0 replies; 38+ messages in thread
From: Dave Jiang @ 2025-07-17 0:24 UTC (permalink / raw)
To: Smita Koralahalli, linux-cxl, linux-kernel, nvdimm, linux-fsdevel,
linux-pm
Cc: Davidlohr Bueso, Jonathan Cameron, Alison Schofield, Vishal Verma,
Ira Weiny, Dan Williams, Matthew Wilcox, Jan Kara,
Rafael J . Wysocki, Len Brown, Pavel Machek, Li Ming,
Jeff Johnson, Ying Huang, Yao Xingtao, Peter Zijlstra, Greg KH,
Nathan Fontenot, Terry Bowman, Robert Richter, Benjamin Cheatham,
PradeepVineshReddy Kodamati, Zhijian Li
On 7/15/25 11:04 AM, Smita Koralahalli wrote:
> Introduce a background worker in cxl_acpi to delay SOFT RESERVE handling
> until the cxl_mem driver has probed at least one device. This coordination
> ensures that DAX registration or fallback handling for soft-reserved
> regions is not triggered prematurely.
>
> The worker waits on cxl_wait_queue, which is signaled via
> cxl_mem_active_inc() during cxl_mem_probe(). Once at least one memory
> device probe is confirmed, the worker invokes wait_for_device_probe()
> to allow the rest of the CXL device hierarchy to complete initialization.
>
> Additionally, it also handles initialization order issues where
> cxl_acpi_probe() may complete before other drivers such as cxl_port or
> cxl_mem have loaded, especially when cxl_acpi and cxl_port are built-in
> and cxl_mem is a loadable module. In such cases, using only
> wait_for_device_probe() is insufficient, as it may return before all
> relevant probes are registered.
>
> While region creation happens in cxl_port_probe(), waiting on
> cxl_mem_active() would be sufficient as cxl_mem_probe() can only succeed
> after the port hierarchy is in place. Furthermore, since cxl_mem depends
> on cxl_pci, this also guarantees that cxl_pci has loaded by the time the
> wait completes.
>
> As cxl_mem_active() infrastructure already exists for tracking probe
> activity, cxl_acpi can use it without introducing new coordination
> mechanisms.
>
> Co-developed-by: Nathan Fontenot <Nathan.Fontenot@amd.com>
> Signed-off-by: Nathan Fontenot <Nathan.Fontenot@amd.com>
> Co-developed-by: Terry Bowman <terry.bowman@amd.com>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
> ---
> drivers/cxl/acpi.c | 18 ++++++++++++++++++
> drivers/cxl/core/probe_state.c | 5 +++++
> drivers/cxl/cxl.h | 2 ++
> 3 files changed, 25 insertions(+)
>
> diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c
> index ca06d5acdf8f..3a27289e669b 100644
> --- a/drivers/cxl/acpi.c
> +++ b/drivers/cxl/acpi.c
> @@ -823,6 +823,20 @@ static int pair_cxl_resource(struct device *dev, void *data)
> return 0;
> }
>
> +static void cxl_softreserv_mem_work_fn(struct work_struct *work)
> +{
> + if (!wait_event_timeout(cxl_wait_queue, cxl_mem_active(), 30 * HZ))
> + pr_debug("Timeout waiting for cxl_mem probing");
> +
> + wait_for_device_probe();
> +}
> +static DECLARE_WORK(cxl_sr_work, cxl_softreserv_mem_work_fn);
> +
> +static void cxl_softreserv_mem_update(void)
> +{
> + schedule_work(&cxl_sr_work);
> +}
> +
> static int cxl_acpi_probe(struct platform_device *pdev)
> {
> int rc = 0;
> @@ -903,6 +917,9 @@ static int cxl_acpi_probe(struct platform_device *pdev)
> cxl_bus_rescan();
>
> out:
> + /* Update SOFT RESERVE resources that intersect with CXL regions */
> + cxl_softreserv_mem_update();
Can you please squash 1/7 with this patch since both are fairly small? Otherwise it leaves the reviewer wonder what the changes in 1/7 would result in.
DJ
> +
> return rc;
> }
>
> @@ -934,6 +951,7 @@ static int __init cxl_acpi_init(void)
>
> static void __exit cxl_acpi_exit(void)
> {
> + cancel_work_sync(&cxl_sr_work);
> platform_driver_unregister(&cxl_acpi_driver);
> cxl_bus_drain();
> }
> diff --git a/drivers/cxl/core/probe_state.c b/drivers/cxl/core/probe_state.c
> index 5ba4b4de0e33..3089b2698b32 100644
> --- a/drivers/cxl/core/probe_state.c
> +++ b/drivers/cxl/core/probe_state.c
> @@ -2,9 +2,12 @@
> /* Copyright(c) 2022 Intel Corporation. All rights reserved. */
> #include <linux/atomic.h>
> #include <linux/export.h>
> +#include <linux/wait.h>
> #include "cxlmem.h"
>
> static atomic_t mem_active;
> +DECLARE_WAIT_QUEUE_HEAD(cxl_wait_queue);
> +EXPORT_SYMBOL_NS_GPL(cxl_wait_queue, "CXL");
>
> bool cxl_mem_active(void)
> {
> @@ -13,10 +16,12 @@ bool cxl_mem_active(void)
>
> return false;
> }
> +EXPORT_SYMBOL_NS_GPL(cxl_mem_active, "CXL");
>
> void cxl_mem_active_inc(void)
> {
> atomic_inc(&mem_active);
> + wake_up(&cxl_wait_queue);
> }
> EXPORT_SYMBOL_NS_GPL(cxl_mem_active_inc, "CXL");
>
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 3f1695c96abc..3117136f0208 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -903,6 +903,8 @@ void cxl_coordinates_combine(struct access_coordinate *out,
>
> bool cxl_endpoint_decoder_reset_detected(struct cxl_port *port);
>
> +extern wait_queue_head_t cxl_wait_queue;
> +
> /*
> * Unit test builds overrides this to __weak, find the 'strong' version
> * of these symbols in tools/testing/cxl/.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v5 4/7] cxl/region: Introduce SOFT RESERVED resource removal on region teardown
2025-07-15 18:04 ` [PATCH v5 4/7] cxl/region: Introduce SOFT RESERVED resource removal on region teardown Smita Koralahalli
@ 2025-07-17 0:42 ` Dave Jiang
0 siblings, 0 replies; 38+ messages in thread
From: Dave Jiang @ 2025-07-17 0:42 UTC (permalink / raw)
To: Smita Koralahalli, linux-cxl, linux-kernel, nvdimm, linux-fsdevel,
linux-pm
Cc: Davidlohr Bueso, Jonathan Cameron, Alison Schofield, Vishal Verma,
Ira Weiny, Dan Williams, Matthew Wilcox, Jan Kara,
Rafael J . Wysocki, Len Brown, Pavel Machek, Li Ming,
Jeff Johnson, Ying Huang, Yao Xingtao, Peter Zijlstra, Greg KH,
Nathan Fontenot, Terry Bowman, Robert Richter, Benjamin Cheatham,
PradeepVineshReddy Kodamati, Zhijian Li
On 7/15/25 11:04 AM, Smita Koralahalli wrote:
> Reworked from a patch by Alison Schofield <alison.schofield@intel.com>
>
> Previously, when CXL regions were created through autodiscovery and their
> resources overlapped with SOFT RESERVED ranges, the soft reserved resource
> remained in place after region teardown. This left the HPA range
> unavailable for reuse even after the region was destroyed.
>
> Enhance the logic to reliably remove SOFT RESERVED resources associated
> with a region, regardless of alignment or hierarchy in the iomem tree.
>
> Link: https://lore.kernel.org/linux-cxl/29312c0765224ae76862d59a17748c8188fb95f1.1692638817.git.alison.schofield@intel.com/
> Co-developed-by: Alison Schofield <alison.schofield@intel.com>
> Signed-off-by: Alison Schofield <alison.schofield@intel.com>
> Co-developed-by: Terry Bowman <terry.bowman@amd.com>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
> ---
> drivers/cxl/acpi.c | 2 +
> drivers/cxl/core/region.c | 124 ++++++++++++++++++++++++++++++++++++++
> drivers/cxl/cxl.h | 2 +
> include/linux/ioport.h | 1 +
> kernel/resource.c | 34 +++++++++++
> 5 files changed, 163 insertions(+)
>
> diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c
> index 3a27289e669b..9eb8a9587dee 100644
> --- a/drivers/cxl/acpi.c
> +++ b/drivers/cxl/acpi.c
> @@ -829,6 +829,8 @@ static void cxl_softreserv_mem_work_fn(struct work_struct *work)
> pr_debug("Timeout waiting for cxl_mem probing");
>
> wait_for_device_probe();
> +
> + cxl_region_softreserv_update();
> }
> static DECLARE_WORK(cxl_sr_work, cxl_softreserv_mem_work_fn);
>
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 6e5e1460068d..95951a1f1cab 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -3486,6 +3486,130 @@ int cxl_add_to_region(struct cxl_endpoint_decoder *cxled)
> }
> EXPORT_SYMBOL_NS_GPL(cxl_add_to_region, "CXL");
>
> +static int add_soft_reserved(resource_size_t start, resource_size_t len,
> + unsigned long flags)
> +{
> + struct resource *res = kzalloc(sizeof(*res), GFP_KERNEL);
> + int rc;
> +
> + if (!res)
> + return -ENOMEM;
> +
> + *res = DEFINE_RES_NAMED_DESC(start, len, "Soft Reserved",
> + flags | IORESOURCE_MEM,
> + IORES_DESC_SOFT_RESERVED);
> +
> + rc = insert_resource(&iomem_resource, res);
> + if (rc) {
> + kfree(res);
> + return rc;
> + }
> +
> + return 0;
> +}
> +
> +static void remove_soft_reserved(struct cxl_region *cxlr, struct resource *soft,
> + resource_size_t start, resource_size_t end)
> +{
> + struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(cxlr->dev.parent);
> + resource_size_t new_start, new_end;
> + int rc;
> +
> + guard(mutex)(&cxlrd->range_lock);
> +
> + if (soft->start == start && soft->end == end) {
> + /*
> + * Exact alignment at both start and end. The entire region is
> + * removed below.
> + */
> +
> + } else if (soft->start == start || soft->end == end) {
> + /* Aligns at either resource start or end */
> + if (soft->start == start) {
> + new_start = end + 1;
> + new_end = soft->end;
> + } else {
> + new_start = soft->start;
> + new_end = start - 1;
> + }
> +
> + /*
> + * Reuse original flags as the trimmed portion retains the same
> + * memory type and access characteristics.
> + */
> + rc = add_soft_reserved(new_start, new_end - new_start + 1,
> + soft->flags);
> + if (rc)
> + dev_warn(&cxlr->dev,
> + "cannot add new soft reserved resource at %pa\n",
> + &new_start);
> +
> + } else {
> + /* No alignment - Split into two new soft reserved regions */
> + new_start = soft->start;
> + new_end = soft->end;
> +
> + rc = add_soft_reserved(new_start, start - new_start,
> + soft->flags);
> + if (rc)
> + dev_warn(&cxlr->dev,
> + "cannot add new soft reserved resource at %pa\n",
> + &new_start);
> +
> + rc = add_soft_reserved(end + 1, new_end - end, soft->flags);
> + if (rc)
> + dev_warn(&cxlr->dev,
> + "cannot add new soft reserved resource at %pa + 1\n",
> + &end);
> + }
> +
> + rc = remove_resource(soft);
> + if (rc)
> + dev_warn(&cxlr->dev, "cannot remove soft reserved resource %pr\n",
> + soft);
> +}
> +
> +static int __cxl_region_softreserv_update(struct resource *soft,
> + void *_cxlr)
> +{
> + struct cxl_region *cxlr = _cxlr;
> + struct resource *res = cxlr->params.res;
> +
> + /* Skip non-intersecting soft-reserved regions */
> + if (soft->end < res->start || soft->start > res->end)
> + return 0;
> +
> + soft = normalize_resource(soft);
> + if (!soft)
> + return -EINVAL;
> +
> + remove_soft_reserved(cxlr, soft, res->start, res->end);
> +
> + return 0;
> +}
> +
> +static int cxl_region_softreserv_update_cb(struct device *dev, void *data)
> +{
> + struct cxl_region *cxlr;
> +
> + if (!is_cxl_region(dev))
> + return 0;
> +
> + cxlr = to_cxl_region(dev);
> +
> + walk_iomem_res_desc(IORES_DESC_SOFT_RESERVED, IORESOURCE_MEM, 0, -1,
> + cxlr, __cxl_region_softreserv_update);
No checking return value of walk_iomem_res_desc()?
> +
> + return 0;
> +}
> +
> +void cxl_region_softreserv_update(void)
> +{
> + bus_for_each_dev(&cxl_bus_type, NULL, NULL,
> + cxl_region_softreserv_update_cb);
No checking return value of bus_for_each_dev()? Is it ok to ignore all errors?
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_region_softreserv_update, "CXL");
> +
> u64 cxl_port_get_spa_cache_alias(struct cxl_port *endpoint, u64 spa)
> {
> struct cxl_region_ref *iter;
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 3117136f0208..9f173467e497 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -862,6 +862,7 @@ struct cxl_pmem_region *to_cxl_pmem_region(struct device *dev);
> int cxl_add_to_region(struct cxl_endpoint_decoder *cxled);
> struct cxl_dax_region *to_cxl_dax_region(struct device *dev);
> u64 cxl_port_get_spa_cache_alias(struct cxl_port *endpoint, u64 spa);
> +void cxl_region_softreserv_update(void);
> #else
> static inline bool is_cxl_pmem_region(struct device *dev)
> {
> @@ -884,6 +885,7 @@ static inline u64 cxl_port_get_spa_cache_alias(struct cxl_port *endpoint,
> {
> return 0;
> }
> +static inline void cxl_region_softreserv_update(void) { }
> #endif
>
> void cxl_endpoint_parse_cdat(struct cxl_port *port);
> diff --git a/include/linux/ioport.h b/include/linux/ioport.h
> index e8b2d6aa4013..8693e095d32b 100644
> --- a/include/linux/ioport.h
> +++ b/include/linux/ioport.h
> @@ -233,6 +233,7 @@ struct resource_constraint {
> extern struct resource ioport_resource;
> extern struct resource iomem_resource;
>
> +extern struct resource *normalize_resource(struct resource *res);
> extern struct resource *request_resource_conflict(struct resource *root, struct resource *new);
> extern int request_resource(struct resource *root, struct resource *new);
> extern int release_resource(struct resource *new);
> diff --git a/kernel/resource.c b/kernel/resource.c
> index 8d3e6ed0bdc1..3d8dc2a59cb2 100644
> --- a/kernel/resource.c
> +++ b/kernel/resource.c
> @@ -50,6 +50,40 @@ EXPORT_SYMBOL(iomem_resource);
>
> static DEFINE_RWLOCK(resource_lock);
>
> +/*
> + * normalize_resource
> + *
> + * The walk_iomem_res_desc() returns a copy of a resource, not a reference
> + * to the actual resource in the iomem_resource tree. As a result,
> + * __release_resource() which relies on pointer equality will fail.
> + *
> + * This helper walks the children of the resource's parent to find and
> + * return the original resource pointer that matches the given resource's
> + * start and end addresses.
> + *
> + * Return: Pointer to the matching original resource in iomem_resource, or
> + * NULL if not found or invalid input.
> + */
> +struct resource *normalize_resource(struct resource *res)
> +{
> + if (!res || !res->parent)
> + return NULL;
> +
> + read_lock(&resource_lock);
May as well go with below for consistency:
guard(read_lock)(&resource_lock);
DJ
> + for (struct resource *res_iter = res->parent->child; res_iter != NULL;
> + res_iter = res_iter->sibling) {
> + if ((res_iter->start == res->start) &&
> + (res_iter->end == res->end)) {
> + read_unlock(&resource_lock);
> + return res_iter;
> + }
> + }
> +
> + read_unlock(&resource_lock);
> + return NULL;
> +}
> +EXPORT_SYMBOL_NS_GPL(normalize_resource, "CXL");
> +
> /*
> * Return the next node of @p in pre-order tree traversal. If
> * @skip_children is true, skip the descendant nodes of @p in
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v5 0/7] Add managed SOFT RESERVE resource handling
2025-07-16 23:48 ` Alison Schofield
@ 2025-07-17 17:58 ` Koralahalli Channabasappa, Smita
2025-07-17 19:06 ` Dave Jiang
0 siblings, 1 reply; 38+ messages in thread
From: Koralahalli Channabasappa, Smita @ 2025-07-17 17:58 UTC (permalink / raw)
To: Alison Schofield
Cc: linux-cxl, linux-kernel, nvdimm, linux-fsdevel, linux-pm,
Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Vishal Verma,
Ira Weiny, Dan Williams, Matthew Wilcox, Jan Kara,
Rafael J . Wysocki, Len Brown, Pavel Machek, Li Ming,
Jeff Johnson, Ying Huang, Yao Xingtao, Peter Zijlstra, Greg KH,
Nathan Fontenot, Terry Bowman, Robert Richter, Benjamin Cheatham,
PradeepVineshReddy Kodamati, Zhijian Li
On 7/16/2025 4:48 PM, Alison Schofield wrote:
> On Wed, Jul 16, 2025 at 02:29:52PM -0700, Koralahalli Channabasappa, Smita wrote:
>> On 7/16/2025 1:20 PM, Alison Schofield wrote:
>>> On Tue, Jul 15, 2025 at 11:01:23PM -0700, Koralahalli Channabasappa, Smita wrote:
>>>> Hi Alison,
>>>>
>>>> On 7/15/2025 2:07 PM, Alison Schofield wrote:
>>>>> On Tue, Jul 15, 2025 at 06:04:00PM +0000, Smita Koralahalli wrote:
>>>>>> This series introduces the ability to manage SOFT RESERVED iomem
>>>>>> resources, enabling the CXL driver to remove any portions that
>>>>>> intersect with created CXL regions.
>>>>>
>>>>> Hi Smita,
>>>>>
>>>>> This set applied cleanly to todays cxl-next but fails like appended
>>>>> before region probe.
>>>>>
>>>>> BTW - there were sparse warnings in the build that look related:
>>>>> CHECK drivers/dax/hmem/hmem_notify.c
>>>>> drivers/dax/hmem/hmem_notify.c:10:6: warning: context imbalance in 'hmem_register_fallback_handler' - wrong count at exit
>>>>> drivers/dax/hmem/hmem_notify.c:24:9: warning: context imbalance in 'hmem_fallback_register_device' - wrong count at exit
>>>>
>>>> Thanks for pointing this bug. I failed to release the spinlock before
>>>> calling hmem_register_device(), which internally calls platform_device_add()
>>>> and can sleep. The following fix addresses that bug. I’ll incorporate this
>>>> into v6:
>>>>
>>>> diff --git a/drivers/dax/hmem/hmem_notify.c b/drivers/dax/hmem/hmem_notify.c
>>>> index 6c276c5bd51d..8f411f3fe7bd 100644
>>>> --- a/drivers/dax/hmem/hmem_notify.c
>>>> +++ b/drivers/dax/hmem/hmem_notify.c
>>>> @@ -18,8 +18,9 @@ void hmem_fallback_register_device(int target_nid, const
>>>> struct resource *res)
>>>> {
>>>> walk_hmem_fn hmem_fn;
>>>>
>>>> - guard(spinlock)(&hmem_notify_lock);
>>>> + spin_lock(&hmem_notify_lock);
>>>> hmem_fn = hmem_fallback_fn;
>>>> + spin_unlock(&hmem_notify_lock);
>>>>
>>>> if (hmem_fn)
>>>> hmem_fn(target_nid, res);
>>>> --
>>>
>>> Hi Smita, Adding the above got me past that, and doubling the timeout
>>> below stopped that from happening. After that, I haven't had time to
>>> trace so, I'll just dump on you for now:
>>>
>>> In /proc/iomem
>>> Here, we see a regions resource, no CXL Window, and no dax, and no
>>> actual region, not even disabled, is available.
>>> c080000000-c47fffffff : region0
>>>
>>> And, here no CXL Window, no region, and a soft reserved.
>>> 68e80000000-70e7fffffff : Soft Reserved
>>> 68e80000000-70e7fffffff : dax1.0
>>> 68e80000000-70e7fffffff : System RAM (kmem)
>>>
>>> I haven't yet walked through the v4 to v5 changes so I'll do that next.
>>
>> Hi Alison,
>>
>> To help better understand the current behavior, could you share more about
>> your platform configuration? specifically, are there two memory cards
>> involved? One at c080000000 (which appears as region0) and another at
>> 68e80000000 (which is falling back to kmem via dax1.0)? Additionally, how
>> are the Soft Reserved ranges laid out on your system for these cards? I'm
>> trying to understand the "before" state of the resources i.e, prior to
>> trimming applied by my patches.
>
> Here are the soft reserveds -
> [] BIOS-e820: [mem 0x000000c080000000-0x000000c47fffffff] soft reserved
> [] BIOS-e820: [mem 0x0000068e80000000-0x0000070e7fffffff] soft reserved
>
> And this is what we expect -
>
> c080000000-17dbfffffff : CXL Window 0
> c080000000-c47fffffff : region2
> c080000000-c47fffffff : dax0.0
> c080000000-c47fffffff : System RAM (kmem)
>
>
> 68e80000000-8d37fffffff : CXL Window 1
> 68e80000000-70e7fffffff : region5
> 68e80000000-70e7fffffff : dax1.0
> 68e80000000-70e7fffffff : System RAM (kmem)
>
> And, like in prev message, iv v5 we get -
>
> c080000000-c47fffffff : region0
>
> 68e80000000-70e7fffffff : Soft Reserved
> 68e80000000-70e7fffffff : dax1.0
> 68e80000000-70e7fffffff : System RAM (kmem)
>
>
> In v4, we 'almost' had what we expect, except that the HMEM driver
> created those dax devices our of Soft Reserveds before region driver
> could do same.
>
Yeah, the only part I’m uncertain about in v5 is scheduling the fallback
work from the failure path of cxl_acpi_probe(). That doesn’t feel like
the right place to do it, and I suspect it might be contributing to the
unexpected behavior.
v4 had most of the necessary pieces in place, but it didn’t handle
situations well when the driver load order didn’t go as expected.
Even if we modify v4 to avoid triggering hmem_register_device() directly
from cxl_acpi_probe() which helps avoid unresolved symbol errors when
cxl_acpi_probe() loads too early, and instead only rely on dax_hmem to
pick up Soft Reserved regions after cxl_acpi creates regions, we still
run into timing issues..
Specifically, there's no guarantee that hmem_register_device() will
correctly skip the following check if the region state isn't fully
ready, even with MODULE_SOFTDEP("pre: cxl_acpi") or using
late_initcall() (which I tried):
if (IS_ENABLED(CONFIG_CXL_REGION) &&
region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
IORES_DESC_CXL) != REGION_DISJOINT) {..
At this point, I’m running out of ideas on how to reliably coordinate
this.. :(
Thanks
Smita
>>
>> Also, do you think it's feasible to change the direction of the soft reserve
>> trimming, that is, defer it until after CXL region or memdev creation is
>> complete? In this case it would be trimmed after but inline the existing
>> region or memdev creation. This might simplify the flow by removing the need
>> for wait_event_timeout(), wait_for_device_probe() and the workqueue logic
>> inside cxl_acpi_probe().
>
> Yes that aligns with my simple thinking. There's the trimming after a region
> is successfully created, and it seems that could simply be called at the end
> of *that* region creation.
>
> Then, there's the round up of all the unused Soft Reserveds, and that has
> to wait until after all regions are created, ie. all endpoints have arrived
> and we've given up all hope of creating another region in that space.
> That's the timing challenge.
>
> -- Alison
>
>>
>> (As a side note I experimented changing cxl_acpi_init() to a late_initcall()
>> and observed that it consistently avoided probe ordering issues in my setup.
>>
>> Additional note: I realized that even when cxl_acpi_probe() fails, the
>> fallback DAX registration path (via cxl_softreserv_mem_update()) still waits
>> on cxl_mem_active() and wait_for_device_probe(). I plan to address this in
>> v6 by immediately triggering fallback DAX registration
>> (hmem_register_device()) when the ACPI probe fails, instead of waiting.)
>>
>> Thanks
>> Smita
>>
>>>
>>>>
>>>> As for the log:
>>>> [ 53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for
>>>> cxl_mem probing
>>>>
>>>> I’m still analyzing that. Here's what was my thought process so far.
>>>>
>>>> - This occurs when cxl_acpi_probe() runs significantly earlier than
>>>> cxl_mem_probe(), so CXL region creation (which happens in
>>>> cxl_port_endpoint_probe()) may or may not have completed by the time
>>>> trimming is attempted.
>>>>
>>>> - Both cxl_acpi and cxl_mem have MODULE_SOFTDEPs on cxl_port. This does
>>>> guarantee load order when all components are built as modules. So even if
>>>> the timeout occurs and cxl_mem_probe() hasn’t run within the wait window,
>>>> MODULE_SOFTDEP ensures that cxl_port is loaded before both cxl_acpi and
>>>> cxl_mem in modular configurations. As a result, region creation is
>>>> eventually guaranteed, and wait_for_device_probe() will succeed once the
>>>> relevant probes complete.
>>>>
>>>> - However, when both CONFIG_CXL_PORT=y and CONFIG_CXL_ACPI=y, there's no
>>>> guarantee of probe ordering. In such cases, cxl_acpi_probe() may finish
>>>> before cxl_port_probe() even begins, which can cause wait_for_device_probe()
>>>> to return prematurely and trigger the timeout.
>>>>
>>>> - In my local setup, I observed that a 30-second timeout was generally
>>>> sufficient to catch this race, allowing cxl_port_probe() to load while
>>>> cxl_acpi_probe() is still active. Since we cannot mix built-in and modular
>>>> components (i.e., have cxl_acpi=y and cxl_port=m), the timeout serves as a
>>>> best-effort mechanism. After the timeout, wait_for_device_probe() ensures
>>>> cxl_port_probe() has completed before trimming proceeds, making the logic
>>>> good enough to most boot-time races.
>>>>
>>>> One possible improvement I’m considering is to schedule a
>>>> delayed_workqueue() from cxl_acpi_probe(). This deferred work could wait
>>>> slightly longer for cxl_mem_probe() to complete (which itself softdeps on
>>>> cxl_port) before initiating the soft reserve trimming.
>>>>
>>>> That said, I'm still evaluating better options to more robustly coordinate
>>>> probe ordering between cxl_acpi, cxl_port, cxl_mem and cxl_region and
>>>> looking for suggestions here.
>>>>
>>>> Thanks
>>>> Smita
>>>>
>>>>>
>>>>>
>>>>> This isn't all the logs, I trimmed. Let me know if you need more or
>>>>> other info to reproduce.
>>>>>
>>>>> [ 53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for cxl_mem probing
>>>>> [ 53.653293] BUG: sleeping function called from invalid context at ./include/linux/sched/mm.h:321
>>>>> [ 53.653513] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1875, name: kworker/46:1
>>>>> [ 53.653540] preempt_count: 1, expected: 0
>>>>> [ 53.653554] RCU nest depth: 0, expected: 0
>>>>> [ 53.653568] 3 locks held by kworker/46:1/1875:
>>>>> [ 53.653569] #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630
>>>>> [ 53.653583] #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630
>>>>> [ 53.653589] #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60
>>>>> [ 53.653598] Preemption disabled at:
>>>>> [ 53.653599] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60
>>>>> [ 53.653640] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Not tainted 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
>>>>> [ 53.653643] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
>>>>> [ 53.653648] Call Trace:
>>>>> [ 53.653649] <TASK>
>>>>> [ 53.653652] dump_stack_lvl+0xa8/0xd0
>>>>> [ 53.653658] dump_stack+0x14/0x20
>>>>> [ 53.653659] __might_resched+0x1ae/0x2d0
>>>>> [ 53.653666] __might_sleep+0x48/0x70
>>>>> [ 53.653668] __kmalloc_node_track_caller_noprof+0x349/0x510
>>>>> [ 53.653674] ? __devm_add_action+0x3d/0x160
>>>>> [ 53.653685] ? __pfx_devm_action_release+0x10/0x10
>>>>> [ 53.653688] __devres_alloc_node+0x4a/0x90
>>>>> [ 53.653689] ? __devres_alloc_node+0x4a/0x90
>>>>> [ 53.653691] ? __pfx_release_memregion+0x10/0x10 [dax_hmem]
>>>>> [ 53.653693] __devm_add_action+0x3d/0x160
>>>>> [ 53.653696] hmem_register_device+0xea/0x230 [dax_hmem]
>>>>> [ 53.653700] hmem_fallback_register_device+0x37/0x60
>>>>> [ 53.653703] cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
>>>>> [ 53.653739] walk_iomem_res_desc+0x55/0xb0
>>>>> [ 53.653744] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
>>>>> [ 53.653755] cxl_region_softreserv_update+0x46/0x50 [cxl_core]
>>>>> [ 53.653761] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
>>>>> [ 53.653763] ? __pfx_autoremove_wake_function+0x10/0x10
>>>>> [ 53.653768] process_one_work+0x1fa/0x630
>>>>> [ 53.653774] worker_thread+0x1b2/0x360
>>>>> [ 53.653777] kthread+0x128/0x250
>>>>> [ 53.653781] ? __pfx_worker_thread+0x10/0x10
>>>>> [ 53.653784] ? __pfx_kthread+0x10/0x10
>>>>> [ 53.653786] ret_from_fork+0x139/0x1e0
>>>>> [ 53.653790] ? __pfx_kthread+0x10/0x10
>>>>> [ 53.653792] ret_from_fork_asm+0x1a/0x30
>>>>> [ 53.653801] </TASK>
>>>>>
>>>>> [ 53.654193] =============================
>>>>> [ 53.654203] [ BUG: Invalid wait context ]
>>>>> [ 53.654451] 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 Tainted: G W
>>>>> [ 53.654623] -----------------------------
>>>>> [ 53.654785] kworker/46:1/1875 is trying to lock:
>>>>> [ 53.654946] ff37d7824096d588 (&root->kernfs_rwsem){++++}-{4:4}, at: kernfs_add_one+0x34/0x390
>>>>> [ 53.655115] other info that might help us debug this:
>>>>> [ 53.655273] context-{5:5}
>>>>> [ 53.655428] 3 locks held by kworker/46:1/1875:
>>>>> [ 53.655579] #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630
>>>>> [ 53.655739] #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630
>>>>> [ 53.655900] #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60
>>>>> [ 53.656062] stack backtrace:
>>>>> [ 53.656224] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G W 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
>>>>> [ 53.656227] Tainted: [W]=WARN
>>>>> [ 53.656228] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
>>>>> [ 53.656232] Call Trace:
>>>>> [ 53.656232] <TASK>
>>>>> [ 53.656234] dump_stack_lvl+0x85/0xd0
>>>>> [ 53.656238] dump_stack+0x14/0x20
>>>>> [ 53.656239] __lock_acquire+0xaf4/0x2200
>>>>> [ 53.656246] lock_acquire+0xd8/0x300
>>>>> [ 53.656248] ? kernfs_add_one+0x34/0x390
>>>>> [ 53.656252] ? __might_resched+0x208/0x2d0
>>>>> [ 53.656257] down_write+0x44/0xe0
>>>>> [ 53.656262] ? kernfs_add_one+0x34/0x390
>>>>> [ 53.656263] kernfs_add_one+0x34/0x390
>>>>> [ 53.656265] kernfs_create_dir_ns+0x5a/0xa0
>>>>> [ 53.656268] sysfs_create_dir_ns+0x74/0xd0
>>>>> [ 53.656270] kobject_add_internal+0xb1/0x2f0
>>>>> [ 53.656273] kobject_add+0x7d/0xf0
>>>>> [ 53.656275] ? get_device_parent+0x28/0x1e0
>>>>> [ 53.656280] ? __pfx_klist_children_get+0x10/0x10
>>>>> [ 53.656282] device_add+0x124/0x8b0
>>>>> [ 53.656285] ? dev_set_name+0x56/0x70
>>>>> [ 53.656287] platform_device_add+0x102/0x260
>>>>> [ 53.656289] hmem_register_device+0x160/0x230 [dax_hmem]
>>>>> [ 53.656291] hmem_fallback_register_device+0x37/0x60
>>>>> [ 53.656294] cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
>>>>> [ 53.656323] walk_iomem_res_desc+0x55/0xb0
>>>>> [ 53.656326] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
>>>>> [ 53.656335] cxl_region_softreserv_update+0x46/0x50 [cxl_core]
>>>>> [ 53.656342] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
>>>>> [ 53.656343] ? __pfx_autoremove_wake_function+0x10/0x10
>>>>> [ 53.656346] process_one_work+0x1fa/0x630
>>>>> [ 53.656350] worker_thread+0x1b2/0x360
>>>>> [ 53.656352] kthread+0x128/0x250
>>>>> [ 53.656354] ? __pfx_worker_thread+0x10/0x10
>>>>> [ 53.656356] ? __pfx_kthread+0x10/0x10
>>>>> [ 53.656357] ret_from_fork+0x139/0x1e0
>>>>> [ 53.656360] ? __pfx_kthread+0x10/0x10
>>>>> [ 53.656361] ret_from_fork_asm+0x1a/0x30
>>>>> [ 53.656366] </TASK>
>>>>> [ 53.662274] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002
>>>>> [ 53.663552] schedule+0x4a/0x160
>>>>> [ 53.663553] schedule_timeout+0x10a/0x120
>>>>> [ 53.663555] ? debug_smp_processor_id+0x1b/0x30
>>>>> [ 53.663556] ? trace_hardirqs_on+0x5f/0xd0
>>>>> [ 53.663558] __wait_for_common+0xb9/0x1c0
>>>>> [ 53.663559] ? __pfx_schedule_timeout+0x10/0x10
>>>>> [ 53.663561] wait_for_completion+0x28/0x30
>>>>> [ 53.663562] __synchronize_srcu+0xbf/0x180
>>>>> [ 53.663566] ? __pfx_wakeme_after_rcu+0x10/0x10
>>>>> [ 53.663571] ? i2c_repstart+0x30/0x80
>>>>> [ 53.663576] synchronize_srcu+0x46/0x120
>>>>> [ 53.663577] kill_dax+0x47/0x70
>>>>> [ 53.663580] __devm_create_dev_dax+0x112/0x470
>>>>> [ 53.663582] devm_create_dev_dax+0x26/0x50
>>>>> [ 53.663584] dax_hmem_probe+0x87/0xd0 [dax_hmem]
>>>>> [ 53.663585] platform_probe+0x61/0xd0
>>>>> [ 53.663589] really_probe+0xe2/0x390
>>>>> [ 53.663591] ? __pfx___device_attach_driver+0x10/0x10
>>>>> [ 53.663593] __driver_probe_device+0x7e/0x160
>>>>> [ 53.663594] driver_probe_device+0x23/0xa0
>>>>> [ 53.663596] __device_attach_driver+0x92/0x120
>>>>> [ 53.663597] bus_for_each_drv+0x8c/0xf0
>>>>> [ 53.663599] __device_attach+0xc2/0x1f0
>>>>> [ 53.663601] device_initial_probe+0x17/0x20
>>>>> [ 53.663603] bus_probe_device+0xa8/0xb0
>>>>> [ 53.663604] device_add+0x687/0x8b0
>>>>> [ 53.663607] ? dev_set_name+0x56/0x70
>>>>> [ 53.663609] platform_device_add+0x102/0x260
>>>>> [ 53.663610] hmem_register_device+0x160/0x230 [dax_hmem]
>>>>> [ 53.663612] hmem_fallback_register_device+0x37/0x60
>>>>> [ 53.663614] cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
>>>>> [ 53.663637] walk_iomem_res_desc+0x55/0xb0
>>>>> [ 53.663640] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
>>>>> [ 53.663647] cxl_region_softreserv_update+0x46/0x50 [cxl_core]
>>>>> [ 53.663654] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
>>>>> [ 53.663655] ? __pfx_autoremove_wake_function+0x10/0x10
>>>>> [ 53.663658] process_one_work+0x1fa/0x630
>>>>> [ 53.663662] worker_thread+0x1b2/0x360
>>>>> [ 53.663664] kthread+0x128/0x250
>>>>> [ 53.663666] ? __pfx_worker_thread+0x10/0x10
>>>>> [ 53.663668] ? __pfx_kthread+0x10/0x10
>>>>> [ 53.663670] ret_from_fork+0x139/0x1e0
>>>>> [ 53.663672] ? __pfx_kthread+0x10/0x10
>>>>> [ 53.663673] ret_from_fork_asm+0x1a/0x30
>>>>> [ 53.663677] </TASK>
>>>>> [ 53.700107] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002
>>>>> [ 53.700264] INFO: lockdep is turned off.
>>>>> [ 53.701315] Preemption disabled at:
>>>>> [ 53.701316] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60
>>>>> [ 53.701631] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G W 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
>>>>> [ 53.701633] Tainted: [W]=WARN
>>>>> [ 53.701635] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
>>>>> [ 53.701638] Call Trace:
>>>>> [ 53.701638] <TASK>
>>>>> [ 53.701640] dump_stack_lvl+0xa8/0xd0
>>>>> [ 53.701644] dump_stack+0x14/0x20
>>>>> [ 53.701645] __schedule_bug+0xa2/0xd0
>>>>> [ 53.701649] __schedule+0xe6f/0x10d0
>>>>> [ 53.701652] ? debug_smp_processor_id+0x1b/0x30
>>>>> [ 53.701655] ? lock_release+0x1e6/0x2b0
>>>>> [ 53.701658] ? trace_hardirqs_on+0x5f/0xd0
>>>>> [ 53.701661] schedule+0x4a/0x160
>>>>> [ 53.701662] schedule_timeout+0x10a/0x120
>>>>> [ 53.701664] ? debug_smp_processor_id+0x1b/0x30
>>>>> [ 53.701666] ? trace_hardirqs_on+0x5f/0xd0
>>>>> [ 53.701667] __wait_for_common+0xb9/0x1c0
>>>>> [ 53.701668] ? __pfx_schedule_timeout+0x10/0x10
>>>>> [ 53.701670] wait_for_completion+0x28/0x30
>>>>> [ 53.701671] __synchronize_srcu+0xbf/0x180
>>>>> [ 53.701677] ? __pfx_wakeme_after_rcu+0x10/0x10
>>>>> [ 53.701682] ? i2c_repstart+0x30/0x80
>>>>> [ 53.701685] synchronize_srcu+0x46/0x120
>>>>> [ 53.701687] kill_dax+0x47/0x70
>>>>> [ 53.701689] __devm_create_dev_dax+0x112/0x470
>>>>> [ 53.701691] devm_create_dev_dax+0x26/0x50
>>>>> [ 53.701693] dax_hmem_probe+0x87/0xd0 [dax_hmem]
>>>>> [ 53.701695] platform_probe+0x61/0xd0
>>>>> [ 53.701698] really_probe+0xe2/0x390
>>>>> [ 53.701700] ? __pfx___device_attach_driver+0x10/0x10
>>>>> [ 53.701701] __driver_probe_device+0x7e/0x160
>>>>> [ 53.701703] driver_probe_device+0x23/0xa0
>>>>> [ 53.701704] __device_attach_driver+0x92/0x120
>>>>> [ 53.701706] bus_for_each_drv+0x8c/0xf0
>>>>> [ 53.701708] __device_attach+0xc2/0x1f0
>>>>> [ 53.701710] device_initial_probe+0x17/0x20
>>>>> [ 53.701711] bus_probe_device+0xa8/0xb0
>>>>> [ 53.701712] device_add+0x687/0x8b0
>>>>> [ 53.701715] ? dev_set_name+0x56/0x70
>>>>> [ 53.701717] platform_device_add+0x102/0x260
>>>>> [ 53.701718] hmem_register_device+0x160/0x230 [dax_hmem]
>>>>> [ 53.701720] hmem_fallback_register_device+0x37/0x60
>>>>> [ 53.701722] cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
>>>>> [ 53.701734] walk_iomem_res_desc+0x55/0xb0
>>>>> [ 53.701738] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
>>>>> [ 53.701745] cxl_region_softreserv_update+0x46/0x50 [cxl_core]
>>>>> [ 53.701751] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
>>>>> [ 53.701752] ? __pfx_autoremove_wake_function+0x10/0x10
>>>>> [ 53.701756] process_one_work+0x1fa/0x630
>>>>> [ 53.701760] worker_thread+0x1b2/0x360
>>>>> [ 53.701762] kthread+0x128/0x250
>>>>> [ 53.701765] ? __pfx_worker_thread+0x10/0x10
>>>>> [ 53.701766] ? __pfx_kthread+0x10/0x10
>>>>> [ 53.701768] ret_from_fork+0x139/0x1e0
>>>>> [ 53.701771] ? __pfx_kthread+0x10/0x10
>>>>> [ 53.701772] ret_from_fork_asm+0x1a/0x30
>>>>> [ 53.701777] </TASK>
>>>>>
>>>>
>>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v5 0/7] Add managed SOFT RESERVE resource handling
2025-07-17 17:58 ` Koralahalli Channabasappa, Smita
@ 2025-07-17 19:06 ` Dave Jiang
2025-07-17 23:20 ` Koralahalli Channabasappa, Smita
0 siblings, 1 reply; 38+ messages in thread
From: Dave Jiang @ 2025-07-17 19:06 UTC (permalink / raw)
To: Koralahalli Channabasappa, Smita, Alison Schofield
Cc: linux-cxl, linux-kernel, nvdimm, linux-fsdevel, linux-pm,
Davidlohr Bueso, Jonathan Cameron, Vishal Verma, Ira Weiny,
Dan Williams, Matthew Wilcox, Jan Kara, Rafael J . Wysocki,
Len Brown, Pavel Machek, Li Ming, Jeff Johnson, Ying Huang,
Yao Xingtao, Peter Zijlstra, Greg KH, Nathan Fontenot,
Terry Bowman, Robert Richter, Benjamin Cheatham,
PradeepVineshReddy Kodamati, Zhijian Li
On 7/17/25 10:58 AM, Koralahalli Channabasappa, Smita wrote:
>
>
> On 7/16/2025 4:48 PM, Alison Schofield wrote:
>> On Wed, Jul 16, 2025 at 02:29:52PM -0700, Koralahalli Channabasappa, Smita wrote:
>>> On 7/16/2025 1:20 PM, Alison Schofield wrote:
>>>> On Tue, Jul 15, 2025 at 11:01:23PM -0700, Koralahalli Channabasappa, Smita wrote:
>>>>> Hi Alison,
>>>>>
>>>>> On 7/15/2025 2:07 PM, Alison Schofield wrote:
>>>>>> On Tue, Jul 15, 2025 at 06:04:00PM +0000, Smita Koralahalli wrote:
>>>>>>> This series introduces the ability to manage SOFT RESERVED iomem
>>>>>>> resources, enabling the CXL driver to remove any portions that
>>>>>>> intersect with created CXL regions.
>>>>>>
>>>>>> Hi Smita,
>>>>>>
>>>>>> This set applied cleanly to todays cxl-next but fails like appended
>>>>>> before region probe.
>>>>>>
>>>>>> BTW - there were sparse warnings in the build that look related:
>>>>>> CHECK drivers/dax/hmem/hmem_notify.c
>>>>>> drivers/dax/hmem/hmem_notify.c:10:6: warning: context imbalance in 'hmem_register_fallback_handler' - wrong count at exit
>>>>>> drivers/dax/hmem/hmem_notify.c:24:9: warning: context imbalance in 'hmem_fallback_register_device' - wrong count at exit
>>>>>
>>>>> Thanks for pointing this bug. I failed to release the spinlock before
>>>>> calling hmem_register_device(), which internally calls platform_device_add()
>>>>> and can sleep. The following fix addresses that bug. I’ll incorporate this
>>>>> into v6:
>>>>>
>>>>> diff --git a/drivers/dax/hmem/hmem_notify.c b/drivers/dax/hmem/hmem_notify.c
>>>>> index 6c276c5bd51d..8f411f3fe7bd 100644
>>>>> --- a/drivers/dax/hmem/hmem_notify.c
>>>>> +++ b/drivers/dax/hmem/hmem_notify.c
>>>>> @@ -18,8 +18,9 @@ void hmem_fallback_register_device(int target_nid, const
>>>>> struct resource *res)
>>>>> {
>>>>> walk_hmem_fn hmem_fn;
>>>>>
>>>>> - guard(spinlock)(&hmem_notify_lock);
>>>>> + spin_lock(&hmem_notify_lock);
>>>>> hmem_fn = hmem_fallback_fn;
>>>>> + spin_unlock(&hmem_notify_lock);
>>>>>
>>>>> if (hmem_fn)
>>>>> hmem_fn(target_nid, res);
>>>>> --
>>>>
>>>> Hi Smita, Adding the above got me past that, and doubling the timeout
>>>> below stopped that from happening. After that, I haven't had time to
>>>> trace so, I'll just dump on you for now:
>>>>
>>>> In /proc/iomem
>>>> Here, we see a regions resource, no CXL Window, and no dax, and no
>>>> actual region, not even disabled, is available.
>>>> c080000000-c47fffffff : region0
>>>>
>>>> And, here no CXL Window, no region, and a soft reserved.
>>>> 68e80000000-70e7fffffff : Soft Reserved
>>>> 68e80000000-70e7fffffff : dax1.0
>>>> 68e80000000-70e7fffffff : System RAM (kmem)
>>>>
>>>> I haven't yet walked through the v4 to v5 changes so I'll do that next.
>>>
>>> Hi Alison,
>>>
>>> To help better understand the current behavior, could you share more about
>>> your platform configuration? specifically, are there two memory cards
>>> involved? One at c080000000 (which appears as region0) and another at
>>> 68e80000000 (which is falling back to kmem via dax1.0)? Additionally, how
>>> are the Soft Reserved ranges laid out on your system for these cards? I'm
>>> trying to understand the "before" state of the resources i.e, prior to
>>> trimming applied by my patches.
>>
>> Here are the soft reserveds -
>> [] BIOS-e820: [mem 0x000000c080000000-0x000000c47fffffff] soft reserved
>> [] BIOS-e820: [mem 0x0000068e80000000-0x0000070e7fffffff] soft reserved
>>
>> And this is what we expect -
>>
>> c080000000-17dbfffffff : CXL Window 0
>> c080000000-c47fffffff : region2
>> c080000000-c47fffffff : dax0.0
>> c080000000-c47fffffff : System RAM (kmem)
>>
>>
>> 68e80000000-8d37fffffff : CXL Window 1
>> 68e80000000-70e7fffffff : region5
>> 68e80000000-70e7fffffff : dax1.0
>> 68e80000000-70e7fffffff : System RAM (kmem)
>>
>> And, like in prev message, iv v5 we get -
>>
>> c080000000-c47fffffff : region0
>>
>> 68e80000000-70e7fffffff : Soft Reserved
>> 68e80000000-70e7fffffff : dax1.0
>> 68e80000000-70e7fffffff : System RAM (kmem)
>>
>>
>> In v4, we 'almost' had what we expect, except that the HMEM driver
>> created those dax devices our of Soft Reserveds before region driver
>> could do same.
>>
>
> Yeah, the only part I’m uncertain about in v5 is scheduling the fallback work from the failure path of cxl_acpi_probe(). That doesn’t feel like the right place to do it, and I suspect it might be contributing to the unexpected behavior.
>
> v4 had most of the necessary pieces in place, but it didn’t handle situations well when the driver load order didn’t go as expected.
>
> Even if we modify v4 to avoid triggering hmem_register_device() directly from cxl_acpi_probe() which helps avoid unresolved symbol errors when cxl_acpi_probe() loads too early, and instead only rely on dax_hmem to pick up Soft Reserved regions after cxl_acpi creates regions, we still run into timing issues..
>
> Specifically, there's no guarantee that hmem_register_device() will correctly skip the following check if the region state isn't fully ready, even with MODULE_SOFTDEP("pre: cxl_acpi") or using late_initcall() (which I tried):
>
> if (IS_ENABLED(CONFIG_CXL_REGION) &&
> region_intersects(res->start, resource_size(res), IORESOURCE_MEM, IORES_DESC_CXL) != REGION_DISJOINT) {..
>
> At this point, I’m running out of ideas on how to reliably coordinate this.. :(
>
> Thanks
> Smita
>
>>>
>>> Also, do you think it's feasible to change the direction of the soft reserve
>>> trimming, that is, defer it until after CXL region or memdev creation is
>>> complete? In this case it would be trimmed after but inline the existing
>>> region or memdev creation. This might simplify the flow by removing the need
>>> for wait_event_timeout(), wait_for_device_probe() and the workqueue logic
>>> inside cxl_acpi_probe().
>>
>> Yes that aligns with my simple thinking. There's the trimming after a region
>> is successfully created, and it seems that could simply be called at the end
>> of *that* region creation.
>>
>> Then, there's the round up of all the unused Soft Reserveds, and that has
>> to wait until after all regions are created, ie. all endpoints have arrived
>> and we've given up all hope of creating another region in that space.
>> That's the timing challenge.
>>
>> -- Alison
>>
>>>
>>> (As a side note I experimented changing cxl_acpi_init() to a late_initcall()
>>> and observed that it consistently avoided probe ordering issues in my setup.
>>>
>>> Additional note: I realized that even when cxl_acpi_probe() fails, the
>>> fallback DAX registration path (via cxl_softreserv_mem_update()) still waits
>>> on cxl_mem_active() and wait_for_device_probe(). I plan to address this in
>>> v6 by immediately triggering fallback DAX registration
>>> (hmem_register_device()) when the ACPI probe fails, instead of waiting.)
>>>
>>> Thanks
>>> Smita
>>>
>>>>
>>>>>
>>>>> As for the log:
>>>>> [ 53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for
>>>>> cxl_mem probing
>>>>>
>>>>> I’m still analyzing that. Here's what was my thought process so far.
>>>>>
>>>>> - This occurs when cxl_acpi_probe() runs significantly earlier than
>>>>> cxl_mem_probe(), so CXL region creation (which happens in
>>>>> cxl_port_endpoint_probe()) may or may not have completed by the time
>>>>> trimming is attempted.
>>>>>
>>>>> - Both cxl_acpi and cxl_mem have MODULE_SOFTDEPs on cxl_port. This does
>>>>> guarantee load order when all components are built as modules. So even if
>>>>> the timeout occurs and cxl_mem_probe() hasn’t run within the wait window,
>>>>> MODULE_SOFTDEP ensures that cxl_port is loaded before both cxl_acpi and
>>>>> cxl_mem in modular configurations. As a result, region creation is
>>>>> eventually guaranteed, and wait_for_device_probe() will succeed once the
>>>>> relevant probes complete.
>>>>>
>>>>> - However, when both CONFIG_CXL_PORT=y and CONFIG_CXL_ACPI=y, there's no
>>>>> guarantee of probe ordering. In such cases, cxl_acpi_probe() may finish
>>>>> before cxl_port_probe() even begins, which can cause wait_for_device_probe()
>>>>> to return prematurely and trigger the timeout.
>>>>>
>>>>> - In my local setup, I observed that a 30-second timeout was generally
>>>>> sufficient to catch this race, allowing cxl_port_probe() to load while
>>>>> cxl_acpi_probe() is still active. Since we cannot mix built-in and modular
>>>>> components (i.e., have cxl_acpi=y and cxl_port=m), the timeout serves as a
>>>>> best-effort mechanism. After the timeout, wait_for_device_probe() ensures
>>>>> cxl_port_probe() has completed before trimming proceeds, making the logic
>>>>> good enough to most boot-time races.
>>>>>
>>>>> One possible improvement I’m considering is to schedule a
>>>>> delayed_workqueue() from cxl_acpi_probe(). This deferred work could wait
>>>>> slightly longer for cxl_mem_probe() to complete (which itself softdeps on
>>>>> cxl_port) before initiating the soft reserve trimming.
>>>>>
>>>>> That said, I'm still evaluating better options to more robustly coordinate
>>>>> probe ordering between cxl_acpi, cxl_port, cxl_mem and cxl_region and
>>>>> looking for suggestions here.
Hi Smita,
Reading this thread and thinking about what can be done to deal with this. Throwing out some ideas and see what you think. My idea is to create two global counters that are are protected by a lock. You hava delayed workqueue that checks these counters. If counter1 is 0, go back to sleep and check later continuously with a reasonable time period. Every time a memdev endpoint starts probe, increment counter1 and counter2 atomically. Every time the probe is successful, decrement counter2. When you reach the condition of 'if (counter1 && counter2 == 0)' I think you can start soft reserve discovery.
A different idea came from Dan. Arm a timer on the first memdev probe. Kick the timer to increment every time a new memdev gets probed. At some point things settles and timer goes off to trigger soft reserved discovery.
I think either one will not require special ordering of the modules being loaded.
DJ
>>>>>
>>>>> Thanks
>>>>> Smita
>>>>>
>>>>>>
>>>>>>
>>>>>> This isn't all the logs, I trimmed. Let me know if you need more or
>>>>>> other info to reproduce.
>>>>>>
>>>>>> [ 53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for cxl_mem probing
>>>>>> [ 53.653293] BUG: sleeping function called from invalid context at ./include/linux/sched/mm.h:321
>>>>>> [ 53.653513] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1875, name: kworker/46:1
>>>>>> [ 53.653540] preempt_count: 1, expected: 0
>>>>>> [ 53.653554] RCU nest depth: 0, expected: 0
>>>>>> [ 53.653568] 3 locks held by kworker/46:1/1875:
>>>>>> [ 53.653569] #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630
>>>>>> [ 53.653583] #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630
>>>>>> [ 53.653589] #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60
>>>>>> [ 53.653598] Preemption disabled at:
>>>>>> [ 53.653599] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60
>>>>>> [ 53.653640] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Not tainted 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
>>>>>> [ 53.653643] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
>>>>>> [ 53.653648] Call Trace:
>>>>>> [ 53.653649] <TASK>
>>>>>> [ 53.653652] dump_stack_lvl+0xa8/0xd0
>>>>>> [ 53.653658] dump_stack+0x14/0x20
>>>>>> [ 53.653659] __might_resched+0x1ae/0x2d0
>>>>>> [ 53.653666] __might_sleep+0x48/0x70
>>>>>> [ 53.653668] __kmalloc_node_track_caller_noprof+0x349/0x510
>>>>>> [ 53.653674] ? __devm_add_action+0x3d/0x160
>>>>>> [ 53.653685] ? __pfx_devm_action_release+0x10/0x10
>>>>>> [ 53.653688] __devres_alloc_node+0x4a/0x90
>>>>>> [ 53.653689] ? __devres_alloc_node+0x4a/0x90
>>>>>> [ 53.653691] ? __pfx_release_memregion+0x10/0x10 [dax_hmem]
>>>>>> [ 53.653693] __devm_add_action+0x3d/0x160
>>>>>> [ 53.653696] hmem_register_device+0xea/0x230 [dax_hmem]
>>>>>> [ 53.653700] hmem_fallback_register_device+0x37/0x60
>>>>>> [ 53.653703] cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
>>>>>> [ 53.653739] walk_iomem_res_desc+0x55/0xb0
>>>>>> [ 53.653744] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
>>>>>> [ 53.653755] cxl_region_softreserv_update+0x46/0x50 [cxl_core]
>>>>>> [ 53.653761] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
>>>>>> [ 53.653763] ? __pfx_autoremove_wake_function+0x10/0x10
>>>>>> [ 53.653768] process_one_work+0x1fa/0x630
>>>>>> [ 53.653774] worker_thread+0x1b2/0x360
>>>>>> [ 53.653777] kthread+0x128/0x250
>>>>>> [ 53.653781] ? __pfx_worker_thread+0x10/0x10
>>>>>> [ 53.653784] ? __pfx_kthread+0x10/0x10
>>>>>> [ 53.653786] ret_from_fork+0x139/0x1e0
>>>>>> [ 53.653790] ? __pfx_kthread+0x10/0x10
>>>>>> [ 53.653792] ret_from_fork_asm+0x1a/0x30
>>>>>> [ 53.653801] </TASK>
>>>>>>
>>>>>> [ 53.654193] =============================
>>>>>> [ 53.654203] [ BUG: Invalid wait context ]
>>>>>> [ 53.654451] 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 Tainted: G W
>>>>>> [ 53.654623] -----------------------------
>>>>>> [ 53.654785] kworker/46:1/1875 is trying to lock:
>>>>>> [ 53.654946] ff37d7824096d588 (&root->kernfs_rwsem){++++}-{4:4}, at: kernfs_add_one+0x34/0x390
>>>>>> [ 53.655115] other info that might help us debug this:
>>>>>> [ 53.655273] context-{5:5}
>>>>>> [ 53.655428] 3 locks held by kworker/46:1/1875:
>>>>>> [ 53.655579] #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630
>>>>>> [ 53.655739] #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630
>>>>>> [ 53.655900] #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60
>>>>>> [ 53.656062] stack backtrace:
>>>>>> [ 53.656224] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G W 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
>>>>>> [ 53.656227] Tainted: [W]=WARN
>>>>>> [ 53.656228] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
>>>>>> [ 53.656232] Call Trace:
>>>>>> [ 53.656232] <TASK>
>>>>>> [ 53.656234] dump_stack_lvl+0x85/0xd0
>>>>>> [ 53.656238] dump_stack+0x14/0x20
>>>>>> [ 53.656239] __lock_acquire+0xaf4/0x2200
>>>>>> [ 53.656246] lock_acquire+0xd8/0x300
>>>>>> [ 53.656248] ? kernfs_add_one+0x34/0x390
>>>>>> [ 53.656252] ? __might_resched+0x208/0x2d0
>>>>>> [ 53.656257] down_write+0x44/0xe0
>>>>>> [ 53.656262] ? kernfs_add_one+0x34/0x390
>>>>>> [ 53.656263] kernfs_add_one+0x34/0x390
>>>>>> [ 53.656265] kernfs_create_dir_ns+0x5a/0xa0
>>>>>> [ 53.656268] sysfs_create_dir_ns+0x74/0xd0
>>>>>> [ 53.656270] kobject_add_internal+0xb1/0x2f0
>>>>>> [ 53.656273] kobject_add+0x7d/0xf0
>>>>>> [ 53.656275] ? get_device_parent+0x28/0x1e0
>>>>>> [ 53.656280] ? __pfx_klist_children_get+0x10/0x10
>>>>>> [ 53.656282] device_add+0x124/0x8b0
>>>>>> [ 53.656285] ? dev_set_name+0x56/0x70
>>>>>> [ 53.656287] platform_device_add+0x102/0x260
>>>>>> [ 53.656289] hmem_register_device+0x160/0x230 [dax_hmem]
>>>>>> [ 53.656291] hmem_fallback_register_device+0x37/0x60
>>>>>> [ 53.656294] cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
>>>>>> [ 53.656323] walk_iomem_res_desc+0x55/0xb0
>>>>>> [ 53.656326] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
>>>>>> [ 53.656335] cxl_region_softreserv_update+0x46/0x50 [cxl_core]
>>>>>> [ 53.656342] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
>>>>>> [ 53.656343] ? __pfx_autoremove_wake_function+0x10/0x10
>>>>>> [ 53.656346] process_one_work+0x1fa/0x630
>>>>>> [ 53.656350] worker_thread+0x1b2/0x360
>>>>>> [ 53.656352] kthread+0x128/0x250
>>>>>> [ 53.656354] ? __pfx_worker_thread+0x10/0x10
>>>>>> [ 53.656356] ? __pfx_kthread+0x10/0x10
>>>>>> [ 53.656357] ret_from_fork+0x139/0x1e0
>>>>>> [ 53.656360] ? __pfx_kthread+0x10/0x10
>>>>>> [ 53.656361] ret_from_fork_asm+0x1a/0x30
>>>>>> [ 53.656366] </TASK>
>>>>>> [ 53.662274] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002
>>>>>> [ 53.663552] schedule+0x4a/0x160
>>>>>> [ 53.663553] schedule_timeout+0x10a/0x120
>>>>>> [ 53.663555] ? debug_smp_processor_id+0x1b/0x30
>>>>>> [ 53.663556] ? trace_hardirqs_on+0x5f/0xd0
>>>>>> [ 53.663558] __wait_for_common+0xb9/0x1c0
>>>>>> [ 53.663559] ? __pfx_schedule_timeout+0x10/0x10
>>>>>> [ 53.663561] wait_for_completion+0x28/0x30
>>>>>> [ 53.663562] __synchronize_srcu+0xbf/0x180
>>>>>> [ 53.663566] ? __pfx_wakeme_after_rcu+0x10/0x10
>>>>>> [ 53.663571] ? i2c_repstart+0x30/0x80
>>>>>> [ 53.663576] synchronize_srcu+0x46/0x120
>>>>>> [ 53.663577] kill_dax+0x47/0x70
>>>>>> [ 53.663580] __devm_create_dev_dax+0x112/0x470
>>>>>> [ 53.663582] devm_create_dev_dax+0x26/0x50
>>>>>> [ 53.663584] dax_hmem_probe+0x87/0xd0 [dax_hmem]
>>>>>> [ 53.663585] platform_probe+0x61/0xd0
>>>>>> [ 53.663589] really_probe+0xe2/0x390
>>>>>> [ 53.663591] ? __pfx___device_attach_driver+0x10/0x10
>>>>>> [ 53.663593] __driver_probe_device+0x7e/0x160
>>>>>> [ 53.663594] driver_probe_device+0x23/0xa0
>>>>>> [ 53.663596] __device_attach_driver+0x92/0x120
>>>>>> [ 53.663597] bus_for_each_drv+0x8c/0xf0
>>>>>> [ 53.663599] __device_attach+0xc2/0x1f0
>>>>>> [ 53.663601] device_initial_probe+0x17/0x20
>>>>>> [ 53.663603] bus_probe_device+0xa8/0xb0
>>>>>> [ 53.663604] device_add+0x687/0x8b0
>>>>>> [ 53.663607] ? dev_set_name+0x56/0x70
>>>>>> [ 53.663609] platform_device_add+0x102/0x260
>>>>>> [ 53.663610] hmem_register_device+0x160/0x230 [dax_hmem]
>>>>>> [ 53.663612] hmem_fallback_register_device+0x37/0x60
>>>>>> [ 53.663614] cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
>>>>>> [ 53.663637] walk_iomem_res_desc+0x55/0xb0
>>>>>> [ 53.663640] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
>>>>>> [ 53.663647] cxl_region_softreserv_update+0x46/0x50 [cxl_core]
>>>>>> [ 53.663654] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
>>>>>> [ 53.663655] ? __pfx_autoremove_wake_function+0x10/0x10
>>>>>> [ 53.663658] process_one_work+0x1fa/0x630
>>>>>> [ 53.663662] worker_thread+0x1b2/0x360
>>>>>> [ 53.663664] kthread+0x128/0x250
>>>>>> [ 53.663666] ? __pfx_worker_thread+0x10/0x10
>>>>>> [ 53.663668] ? __pfx_kthread+0x10/0x10
>>>>>> [ 53.663670] ret_from_fork+0x139/0x1e0
>>>>>> [ 53.663672] ? __pfx_kthread+0x10/0x10
>>>>>> [ 53.663673] ret_from_fork_asm+0x1a/0x30
>>>>>> [ 53.663677] </TASK>
>>>>>> [ 53.700107] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002
>>>>>> [ 53.700264] INFO: lockdep is turned off.
>>>>>> [ 53.701315] Preemption disabled at:
>>>>>> [ 53.701316] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60
>>>>>> [ 53.701631] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G W 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
>>>>>> [ 53.701633] Tainted: [W]=WARN
>>>>>> [ 53.701635] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
>>>>>> [ 53.701638] Call Trace:
>>>>>> [ 53.701638] <TASK>
>>>>>> [ 53.701640] dump_stack_lvl+0xa8/0xd0
>>>>>> [ 53.701644] dump_stack+0x14/0x20
>>>>>> [ 53.701645] __schedule_bug+0xa2/0xd0
>>>>>> [ 53.701649] __schedule+0xe6f/0x10d0
>>>>>> [ 53.701652] ? debug_smp_processor_id+0x1b/0x30
>>>>>> [ 53.701655] ? lock_release+0x1e6/0x2b0
>>>>>> [ 53.701658] ? trace_hardirqs_on+0x5f/0xd0
>>>>>> [ 53.701661] schedule+0x4a/0x160
>>>>>> [ 53.701662] schedule_timeout+0x10a/0x120
>>>>>> [ 53.701664] ? debug_smp_processor_id+0x1b/0x30
>>>>>> [ 53.701666] ? trace_hardirqs_on+0x5f/0xd0
>>>>>> [ 53.701667] __wait_for_common+0xb9/0x1c0
>>>>>> [ 53.701668] ? __pfx_schedule_timeout+0x10/0x10
>>>>>> [ 53.701670] wait_for_completion+0x28/0x30
>>>>>> [ 53.701671] __synchronize_srcu+0xbf/0x180
>>>>>> [ 53.701677] ? __pfx_wakeme_after_rcu+0x10/0x10
>>>>>> [ 53.701682] ? i2c_repstart+0x30/0x80
>>>>>> [ 53.701685] synchronize_srcu+0x46/0x120
>>>>>> [ 53.701687] kill_dax+0x47/0x70
>>>>>> [ 53.701689] __devm_create_dev_dax+0x112/0x470
>>>>>> [ 53.701691] devm_create_dev_dax+0x26/0x50
>>>>>> [ 53.701693] dax_hmem_probe+0x87/0xd0 [dax_hmem]
>>>>>> [ 53.701695] platform_probe+0x61/0xd0
>>>>>> [ 53.701698] really_probe+0xe2/0x390
>>>>>> [ 53.701700] ? __pfx___device_attach_driver+0x10/0x10
>>>>>> [ 53.701701] __driver_probe_device+0x7e/0x160
>>>>>> [ 53.701703] driver_probe_device+0x23/0xa0
>>>>>> [ 53.701704] __device_attach_driver+0x92/0x120
>>>>>> [ 53.701706] bus_for_each_drv+0x8c/0xf0
>>>>>> [ 53.701708] __device_attach+0xc2/0x1f0
>>>>>> [ 53.701710] device_initial_probe+0x17/0x20
>>>>>> [ 53.701711] bus_probe_device+0xa8/0xb0
>>>>>> [ 53.701712] device_add+0x687/0x8b0
>>>>>> [ 53.701715] ? dev_set_name+0x56/0x70
>>>>>> [ 53.701717] platform_device_add+0x102/0x260
>>>>>> [ 53.701718] hmem_register_device+0x160/0x230 [dax_hmem]
>>>>>> [ 53.701720] hmem_fallback_register_device+0x37/0x60
>>>>>> [ 53.701722] cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
>>>>>> [ 53.701734] walk_iomem_res_desc+0x55/0xb0
>>>>>> [ 53.701738] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
>>>>>> [ 53.701745] cxl_region_softreserv_update+0x46/0x50 [cxl_core]
>>>>>> [ 53.701751] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
>>>>>> [ 53.701752] ? __pfx_autoremove_wake_function+0x10/0x10
>>>>>> [ 53.701756] process_one_work+0x1fa/0x630
>>>>>> [ 53.701760] worker_thread+0x1b2/0x360
>>>>>> [ 53.701762] kthread+0x128/0x250
>>>>>> [ 53.701765] ? __pfx_worker_thread+0x10/0x10
>>>>>> [ 53.701766] ? __pfx_kthread+0x10/0x10
>>>>>> [ 53.701768] ret_from_fork+0x139/0x1e0
>>>>>> [ 53.701771] ? __pfx_kthread+0x10/0x10
>>>>>> [ 53.701772] ret_from_fork_asm+0x1a/0x30
>>>>>> [ 53.701777] </TASK>
>>>>>>
>>>>>
>>>
>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v5 0/7] Add managed SOFT RESERVE resource handling
2025-07-17 19:06 ` Dave Jiang
@ 2025-07-17 23:20 ` Koralahalli Channabasappa, Smita
2025-07-17 23:30 ` Dave Jiang
0 siblings, 1 reply; 38+ messages in thread
From: Koralahalli Channabasappa, Smita @ 2025-07-17 23:20 UTC (permalink / raw)
To: Dave Jiang, Alison Schofield
Cc: linux-cxl, linux-kernel, nvdimm, linux-fsdevel, linux-pm,
Davidlohr Bueso, Jonathan Cameron, Vishal Verma, Ira Weiny,
Dan Williams, Matthew Wilcox, Jan Kara, Rafael J . Wysocki,
Len Brown, Pavel Machek, Li Ming, Jeff Johnson, Ying Huang,
Yao Xingtao, Peter Zijlstra, Greg KH, Nathan Fontenot,
Terry Bowman, Robert Richter, Benjamin Cheatham,
PradeepVineshReddy Kodamati, Zhijian Li
On 7/17/2025 12:06 PM, Dave Jiang wrote:
>
>
> On 7/17/25 10:58 AM, Koralahalli Channabasappa, Smita wrote:
>>
>>
>> On 7/16/2025 4:48 PM, Alison Schofield wrote:
>>> On Wed, Jul 16, 2025 at 02:29:52PM -0700, Koralahalli Channabasappa, Smita wrote:
>>>> On 7/16/2025 1:20 PM, Alison Schofield wrote:
>>>>> On Tue, Jul 15, 2025 at 11:01:23PM -0700, Koralahalli Channabasappa, Smita wrote:
>>>>>> Hi Alison,
>>>>>>
>>>>>> On 7/15/2025 2:07 PM, Alison Schofield wrote:
>>>>>>> On Tue, Jul 15, 2025 at 06:04:00PM +0000, Smita Koralahalli wrote:
>>>>>>>> This series introduces the ability to manage SOFT RESERVED iomem
>>>>>>>> resources, enabling the CXL driver to remove any portions that
>>>>>>>> intersect with created CXL regions.
>>>>>>>
>>>>>>> Hi Smita,
>>>>>>>
>>>>>>> This set applied cleanly to todays cxl-next but fails like appended
>>>>>>> before region probe.
>>>>>>>
>>>>>>> BTW - there were sparse warnings in the build that look related:
>>>>>>> CHECK drivers/dax/hmem/hmem_notify.c
>>>>>>> drivers/dax/hmem/hmem_notify.c:10:6: warning: context imbalance in 'hmem_register_fallback_handler' - wrong count at exit
>>>>>>> drivers/dax/hmem/hmem_notify.c:24:9: warning: context imbalance in 'hmem_fallback_register_device' - wrong count at exit
>>>>>>
>>>>>> Thanks for pointing this bug. I failed to release the spinlock before
>>>>>> calling hmem_register_device(), which internally calls platform_device_add()
>>>>>> and can sleep. The following fix addresses that bug. I’ll incorporate this
>>>>>> into v6:
>>>>>>
>>>>>> diff --git a/drivers/dax/hmem/hmem_notify.c b/drivers/dax/hmem/hmem_notify.c
>>>>>> index 6c276c5bd51d..8f411f3fe7bd 100644
>>>>>> --- a/drivers/dax/hmem/hmem_notify.c
>>>>>> +++ b/drivers/dax/hmem/hmem_notify.c
>>>>>> @@ -18,8 +18,9 @@ void hmem_fallback_register_device(int target_nid, const
>>>>>> struct resource *res)
>>>>>> {
>>>>>> walk_hmem_fn hmem_fn;
>>>>>>
>>>>>> - guard(spinlock)(&hmem_notify_lock);
>>>>>> + spin_lock(&hmem_notify_lock);
>>>>>> hmem_fn = hmem_fallback_fn;
>>>>>> + spin_unlock(&hmem_notify_lock);
>>>>>>
>>>>>> if (hmem_fn)
>>>>>> hmem_fn(target_nid, res);
>>>>>> --
>>>>>
>>>>> Hi Smita, Adding the above got me past that, and doubling the timeout
>>>>> below stopped that from happening. After that, I haven't had time to
>>>>> trace so, I'll just dump on you for now:
>>>>>
>>>>> In /proc/iomem
>>>>> Here, we see a regions resource, no CXL Window, and no dax, and no
>>>>> actual region, not even disabled, is available.
>>>>> c080000000-c47fffffff : region0
>>>>>
>>>>> And, here no CXL Window, no region, and a soft reserved.
>>>>> 68e80000000-70e7fffffff : Soft Reserved
>>>>> 68e80000000-70e7fffffff : dax1.0
>>>>> 68e80000000-70e7fffffff : System RAM (kmem)
>>>>>
>>>>> I haven't yet walked through the v4 to v5 changes so I'll do that next.
>>>>
>>>> Hi Alison,
>>>>
>>>> To help better understand the current behavior, could you share more about
>>>> your platform configuration? specifically, are there two memory cards
>>>> involved? One at c080000000 (which appears as region0) and another at
>>>> 68e80000000 (which is falling back to kmem via dax1.0)? Additionally, how
>>>> are the Soft Reserved ranges laid out on your system for these cards? I'm
>>>> trying to understand the "before" state of the resources i.e, prior to
>>>> trimming applied by my patches.
>>>
>>> Here are the soft reserveds -
>>> [] BIOS-e820: [mem 0x000000c080000000-0x000000c47fffffff] soft reserved
>>> [] BIOS-e820: [mem 0x0000068e80000000-0x0000070e7fffffff] soft reserved
>>>
>>> And this is what we expect -
>>>
>>> c080000000-17dbfffffff : CXL Window 0
>>> c080000000-c47fffffff : region2
>>> c080000000-c47fffffff : dax0.0
>>> c080000000-c47fffffff : System RAM (kmem)
>>>
>>>
>>> 68e80000000-8d37fffffff : CXL Window 1
>>> 68e80000000-70e7fffffff : region5
>>> 68e80000000-70e7fffffff : dax1.0
>>> 68e80000000-70e7fffffff : System RAM (kmem)
>>>
>>> And, like in prev message, iv v5 we get -
>>>
>>> c080000000-c47fffffff : region0
>>>
>>> 68e80000000-70e7fffffff : Soft Reserved
>>> 68e80000000-70e7fffffff : dax1.0
>>> 68e80000000-70e7fffffff : System RAM (kmem)
>>>
>>>
>>> In v4, we 'almost' had what we expect, except that the HMEM driver
>>> created those dax devices our of Soft Reserveds before region driver
>>> could do same.
>>>
>>
>> Yeah, the only part I’m uncertain about in v5 is scheduling the fallback work from the failure path of cxl_acpi_probe(). That doesn’t feel like the right place to do it, and I suspect it might be contributing to the unexpected behavior.
>>
>> v4 had most of the necessary pieces in place, but it didn’t handle situations well when the driver load order didn’t go as expected.
>>
>> Even if we modify v4 to avoid triggering hmem_register_device() directly from cxl_acpi_probe() which helps avoid unresolved symbol errors when cxl_acpi_probe() loads too early, and instead only rely on dax_hmem to pick up Soft Reserved regions after cxl_acpi creates regions, we still run into timing issues..
>>
>> Specifically, there's no guarantee that hmem_register_device() will correctly skip the following check if the region state isn't fully ready, even with MODULE_SOFTDEP("pre: cxl_acpi") or using late_initcall() (which I tried):
>>
>> if (IS_ENABLED(CONFIG_CXL_REGION) &&
>> region_intersects(res->start, resource_size(res), IORESOURCE_MEM, IORES_DESC_CXL) != REGION_DISJOINT) {..
>>
>> At this point, I’m running out of ideas on how to reliably coordinate this.. :(
>>
>> Thanks
>> Smita
>>
>>>>
>>>> Also, do you think it's feasible to change the direction of the soft reserve
>>>> trimming, that is, defer it until after CXL region or memdev creation is
>>>> complete? In this case it would be trimmed after but inline the existing
>>>> region or memdev creation. This might simplify the flow by removing the need
>>>> for wait_event_timeout(), wait_for_device_probe() and the workqueue logic
>>>> inside cxl_acpi_probe().
>>>
>>> Yes that aligns with my simple thinking. There's the trimming after a region
>>> is successfully created, and it seems that could simply be called at the end
>>> of *that* region creation.
>>>
>>> Then, there's the round up of all the unused Soft Reserveds, and that has
>>> to wait until after all regions are created, ie. all endpoints have arrived
>>> and we've given up all hope of creating another region in that space.
>>> That's the timing challenge.
>>>
>>> -- Alison
>>>
>>>>
>>>> (As a side note I experimented changing cxl_acpi_init() to a late_initcall()
>>>> and observed that it consistently avoided probe ordering issues in my setup.
>>>>
>>>> Additional note: I realized that even when cxl_acpi_probe() fails, the
>>>> fallback DAX registration path (via cxl_softreserv_mem_update()) still waits
>>>> on cxl_mem_active() and wait_for_device_probe(). I plan to address this in
>>>> v6 by immediately triggering fallback DAX registration
>>>> (hmem_register_device()) when the ACPI probe fails, instead of waiting.)
>>>>
>>>> Thanks
>>>> Smita
>>>>
>>>>>
>>>>>>
>>>>>> As for the log:
>>>>>> [ 53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for
>>>>>> cxl_mem probing
>>>>>>
>>>>>> I’m still analyzing that. Here's what was my thought process so far.
>>>>>>
>>>>>> - This occurs when cxl_acpi_probe() runs significantly earlier than
>>>>>> cxl_mem_probe(), so CXL region creation (which happens in
>>>>>> cxl_port_endpoint_probe()) may or may not have completed by the time
>>>>>> trimming is attempted.
>>>>>>
>>>>>> - Both cxl_acpi and cxl_mem have MODULE_SOFTDEPs on cxl_port. This does
>>>>>> guarantee load order when all components are built as modules. So even if
>>>>>> the timeout occurs and cxl_mem_probe() hasn’t run within the wait window,
>>>>>> MODULE_SOFTDEP ensures that cxl_port is loaded before both cxl_acpi and
>>>>>> cxl_mem in modular configurations. As a result, region creation is
>>>>>> eventually guaranteed, and wait_for_device_probe() will succeed once the
>>>>>> relevant probes complete.
>>>>>>
>>>>>> - However, when both CONFIG_CXL_PORT=y and CONFIG_CXL_ACPI=y, there's no
>>>>>> guarantee of probe ordering. In such cases, cxl_acpi_probe() may finish
>>>>>> before cxl_port_probe() even begins, which can cause wait_for_device_probe()
>>>>>> to return prematurely and trigger the timeout.
>>>>>>
>>>>>> - In my local setup, I observed that a 30-second timeout was generally
>>>>>> sufficient to catch this race, allowing cxl_port_probe() to load while
>>>>>> cxl_acpi_probe() is still active. Since we cannot mix built-in and modular
>>>>>> components (i.e., have cxl_acpi=y and cxl_port=m), the timeout serves as a
>>>>>> best-effort mechanism. After the timeout, wait_for_device_probe() ensures
>>>>>> cxl_port_probe() has completed before trimming proceeds, making the logic
>>>>>> good enough to most boot-time races.
>>>>>>
>>>>>> One possible improvement I’m considering is to schedule a
>>>>>> delayed_workqueue() from cxl_acpi_probe(). This deferred work could wait
>>>>>> slightly longer for cxl_mem_probe() to complete (which itself softdeps on
>>>>>> cxl_port) before initiating the soft reserve trimming.
>>>>>>
>>>>>> That said, I'm still evaluating better options to more robustly coordinate
>>>>>> probe ordering between cxl_acpi, cxl_port, cxl_mem and cxl_region and
>>>>>> looking for suggestions here.
>
> Hi Smita,
> Reading this thread and thinking about what can be done to deal with this. Throwing out some ideas and see what you think. My idea is to create two global counters that are are protected by a lock. You hava delayed workqueue that checks these counters. If counter1 is 0, go back to sleep and check later continuously with a reasonable time period. Every time a memdev endpoint starts probe, increment counter1 and counter2 atomically. Every time the probe is successful, decrement counter2. When you reach the condition of 'if (counter1 && counter2 == 0)' I think you can start soft reserve discovery.
>
> A different idea came from Dan. Arm a timer on the first memdev probe. Kick the timer to increment every time a new memdev gets probed. At some point things settles and timer goes off to trigger soft reserved discovery.
>
> I think either one will not require special ordering of the modules being loaded.
>
> DJ
I think we might need both, the counters and a settling timer to
coordinate Soft Reserved trimming and DAX registration.
Here's the rough flow I'm thinking of. Let me know the flaws in this
approach.
1. cxl_acpi_probe() schedules cxl_softreserv_work_fn() and exits early.
This work item is responsible for trimming leftover Soft Reserved memory
ranges once all cxl_mem devices have finished probing.
2. A delayed work is initialized for the settle timer:
INIT_DELAYED_WORK(&cxl_probe_settle_work, cxl_probe_settle_fn);
3. In cxl_mem_probe():
- Increment counter2 (memdevs in progress).
- Increment counter1 (memdevs discovered).
- On probe completion (success or failure), decrement counter2.
- After each probe, re-arm the settle timer to extend the quiet
period if more devices arrive (this might fail Im not sure if cxl
mem devices come in too late)..
mod_delayed_work(system_wq, &cxl_probe_settle_work, 30 * HZ);
- Call wake_up(&cxl_softreserv_waitq); after each probe to notify
listeners.
4. The settle timer callback (cxl_probe_settle_fn()) runs when no new
devices have probed for a while (30s)
timer_expired = true;
wake_up(&cxl_softreserv_waitq);
5. In cxl_softreserv_work_fn()
wait_event(cxl_softreserv_waitq,
atomic_read(&cxl_mem_counter1) > 0 &&
atomic_read(&cxl_mem_counter2) == 0 &&
atomic_read(&timer_expired));
6. Once unblocked, cxl_softreserv_work_fn() trims Soft Reserved regions
via cxl_region_softreserv_update().
(We do not perform any DAX fallback here as we dont want to endup with
unresolved symbols when DAX_HMEM loads too late..)
7. Separately, dax_hmem_platform_probe() runs independently on module
load, but also blocks on the same wait_event() condition if
CONFIG_CXL_ACPI is enabled. Once the condition is satisfied, it invokes
hmem_register_device() to register leftover Soft Reserved memory.
Thanks
Smita
>
>>>>>>
>>>>>> Thanks
>>>>>> Smita
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> This isn't all the logs, I trimmed. Let me know if you need more or
>>>>>>> other info to reproduce.
>>>>>>>
>>>>>>> [ 53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for cxl_mem probing
>>>>>>> [ 53.653293] BUG: sleeping function called from invalid context at ./include/linux/sched/mm.h:321
>>>>>>> [ 53.653513] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1875, name: kworker/46:1
>>>>>>> [ 53.653540] preempt_count: 1, expected: 0
>>>>>>> [ 53.653554] RCU nest depth: 0, expected: 0
>>>>>>> [ 53.653568] 3 locks held by kworker/46:1/1875:
>>>>>>> [ 53.653569] #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630
>>>>>>> [ 53.653583] #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630
>>>>>>> [ 53.653589] #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60
>>>>>>> [ 53.653598] Preemption disabled at:
>>>>>>> [ 53.653599] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60
>>>>>>> [ 53.653640] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Not tainted 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
>>>>>>> [ 53.653643] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
>>>>>>> [ 53.653648] Call Trace:
>>>>>>> [ 53.653649] <TASK>
>>>>>>> [ 53.653652] dump_stack_lvl+0xa8/0xd0
>>>>>>> [ 53.653658] dump_stack+0x14/0x20
>>>>>>> [ 53.653659] __might_resched+0x1ae/0x2d0
>>>>>>> [ 53.653666] __might_sleep+0x48/0x70
>>>>>>> [ 53.653668] __kmalloc_node_track_caller_noprof+0x349/0x510
>>>>>>> [ 53.653674] ? __devm_add_action+0x3d/0x160
>>>>>>> [ 53.653685] ? __pfx_devm_action_release+0x10/0x10
>>>>>>> [ 53.653688] __devres_alloc_node+0x4a/0x90
>>>>>>> [ 53.653689] ? __devres_alloc_node+0x4a/0x90
>>>>>>> [ 53.653691] ? __pfx_release_memregion+0x10/0x10 [dax_hmem]
>>>>>>> [ 53.653693] __devm_add_action+0x3d/0x160
>>>>>>> [ 53.653696] hmem_register_device+0xea/0x230 [dax_hmem]
>>>>>>> [ 53.653700] hmem_fallback_register_device+0x37/0x60
>>>>>>> [ 53.653703] cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
>>>>>>> [ 53.653739] walk_iomem_res_desc+0x55/0xb0
>>>>>>> [ 53.653744] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
>>>>>>> [ 53.653755] cxl_region_softreserv_update+0x46/0x50 [cxl_core]
>>>>>>> [ 53.653761] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
>>>>>>> [ 53.653763] ? __pfx_autoremove_wake_function+0x10/0x10
>>>>>>> [ 53.653768] process_one_work+0x1fa/0x630
>>>>>>> [ 53.653774] worker_thread+0x1b2/0x360
>>>>>>> [ 53.653777] kthread+0x128/0x250
>>>>>>> [ 53.653781] ? __pfx_worker_thread+0x10/0x10
>>>>>>> [ 53.653784] ? __pfx_kthread+0x10/0x10
>>>>>>> [ 53.653786] ret_from_fork+0x139/0x1e0
>>>>>>> [ 53.653790] ? __pfx_kthread+0x10/0x10
>>>>>>> [ 53.653792] ret_from_fork_asm+0x1a/0x30
>>>>>>> [ 53.653801] </TASK>
>>>>>>>
>>>>>>> [ 53.654193] =============================
>>>>>>> [ 53.654203] [ BUG: Invalid wait context ]
>>>>>>> [ 53.654451] 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 Tainted: G W
>>>>>>> [ 53.654623] -----------------------------
>>>>>>> [ 53.654785] kworker/46:1/1875 is trying to lock:
>>>>>>> [ 53.654946] ff37d7824096d588 (&root->kernfs_rwsem){++++}-{4:4}, at: kernfs_add_one+0x34/0x390
>>>>>>> [ 53.655115] other info that might help us debug this:
>>>>>>> [ 53.655273] context-{5:5}
>>>>>>> [ 53.655428] 3 locks held by kworker/46:1/1875:
>>>>>>> [ 53.655579] #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630
>>>>>>> [ 53.655739] #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630
>>>>>>> [ 53.655900] #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60
>>>>>>> [ 53.656062] stack backtrace:
>>>>>>> [ 53.656224] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G W 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
>>>>>>> [ 53.656227] Tainted: [W]=WARN
>>>>>>> [ 53.656228] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
>>>>>>> [ 53.656232] Call Trace:
>>>>>>> [ 53.656232] <TASK>
>>>>>>> [ 53.656234] dump_stack_lvl+0x85/0xd0
>>>>>>> [ 53.656238] dump_stack+0x14/0x20
>>>>>>> [ 53.656239] __lock_acquire+0xaf4/0x2200
>>>>>>> [ 53.656246] lock_acquire+0xd8/0x300
>>>>>>> [ 53.656248] ? kernfs_add_one+0x34/0x390
>>>>>>> [ 53.656252] ? __might_resched+0x208/0x2d0
>>>>>>> [ 53.656257] down_write+0x44/0xe0
>>>>>>> [ 53.656262] ? kernfs_add_one+0x34/0x390
>>>>>>> [ 53.656263] kernfs_add_one+0x34/0x390
>>>>>>> [ 53.656265] kernfs_create_dir_ns+0x5a/0xa0
>>>>>>> [ 53.656268] sysfs_create_dir_ns+0x74/0xd0
>>>>>>> [ 53.656270] kobject_add_internal+0xb1/0x2f0
>>>>>>> [ 53.656273] kobject_add+0x7d/0xf0
>>>>>>> [ 53.656275] ? get_device_parent+0x28/0x1e0
>>>>>>> [ 53.656280] ? __pfx_klist_children_get+0x10/0x10
>>>>>>> [ 53.656282] device_add+0x124/0x8b0
>>>>>>> [ 53.656285] ? dev_set_name+0x56/0x70
>>>>>>> [ 53.656287] platform_device_add+0x102/0x260
>>>>>>> [ 53.656289] hmem_register_device+0x160/0x230 [dax_hmem]
>>>>>>> [ 53.656291] hmem_fallback_register_device+0x37/0x60
>>>>>>> [ 53.656294] cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
>>>>>>> [ 53.656323] walk_iomem_res_desc+0x55/0xb0
>>>>>>> [ 53.656326] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
>>>>>>> [ 53.656335] cxl_region_softreserv_update+0x46/0x50 [cxl_core]
>>>>>>> [ 53.656342] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
>>>>>>> [ 53.656343] ? __pfx_autoremove_wake_function+0x10/0x10
>>>>>>> [ 53.656346] process_one_work+0x1fa/0x630
>>>>>>> [ 53.656350] worker_thread+0x1b2/0x360
>>>>>>> [ 53.656352] kthread+0x128/0x250
>>>>>>> [ 53.656354] ? __pfx_worker_thread+0x10/0x10
>>>>>>> [ 53.656356] ? __pfx_kthread+0x10/0x10
>>>>>>> [ 53.656357] ret_from_fork+0x139/0x1e0
>>>>>>> [ 53.656360] ? __pfx_kthread+0x10/0x10
>>>>>>> [ 53.656361] ret_from_fork_asm+0x1a/0x30
>>>>>>> [ 53.656366] </TASK>
>>>>>>> [ 53.662274] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002
>>>>>>> [ 53.663552] schedule+0x4a/0x160
>>>>>>> [ 53.663553] schedule_timeout+0x10a/0x120
>>>>>>> [ 53.663555] ? debug_smp_processor_id+0x1b/0x30
>>>>>>> [ 53.663556] ? trace_hardirqs_on+0x5f/0xd0
>>>>>>> [ 53.663558] __wait_for_common+0xb9/0x1c0
>>>>>>> [ 53.663559] ? __pfx_schedule_timeout+0x10/0x10
>>>>>>> [ 53.663561] wait_for_completion+0x28/0x30
>>>>>>> [ 53.663562] __synchronize_srcu+0xbf/0x180
>>>>>>> [ 53.663566] ? __pfx_wakeme_after_rcu+0x10/0x10
>>>>>>> [ 53.663571] ? i2c_repstart+0x30/0x80
>>>>>>> [ 53.663576] synchronize_srcu+0x46/0x120
>>>>>>> [ 53.663577] kill_dax+0x47/0x70
>>>>>>> [ 53.663580] __devm_create_dev_dax+0x112/0x470
>>>>>>> [ 53.663582] devm_create_dev_dax+0x26/0x50
>>>>>>> [ 53.663584] dax_hmem_probe+0x87/0xd0 [dax_hmem]
>>>>>>> [ 53.663585] platform_probe+0x61/0xd0
>>>>>>> [ 53.663589] really_probe+0xe2/0x390
>>>>>>> [ 53.663591] ? __pfx___device_attach_driver+0x10/0x10
>>>>>>> [ 53.663593] __driver_probe_device+0x7e/0x160
>>>>>>> [ 53.663594] driver_probe_device+0x23/0xa0
>>>>>>> [ 53.663596] __device_attach_driver+0x92/0x120
>>>>>>> [ 53.663597] bus_for_each_drv+0x8c/0xf0
>>>>>>> [ 53.663599] __device_attach+0xc2/0x1f0
>>>>>>> [ 53.663601] device_initial_probe+0x17/0x20
>>>>>>> [ 53.663603] bus_probe_device+0xa8/0xb0
>>>>>>> [ 53.663604] device_add+0x687/0x8b0
>>>>>>> [ 53.663607] ? dev_set_name+0x56/0x70
>>>>>>> [ 53.663609] platform_device_add+0x102/0x260
>>>>>>> [ 53.663610] hmem_register_device+0x160/0x230 [dax_hmem]
>>>>>>> [ 53.663612] hmem_fallback_register_device+0x37/0x60
>>>>>>> [ 53.663614] cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
>>>>>>> [ 53.663637] walk_iomem_res_desc+0x55/0xb0
>>>>>>> [ 53.663640] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
>>>>>>> [ 53.663647] cxl_region_softreserv_update+0x46/0x50 [cxl_core]
>>>>>>> [ 53.663654] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
>>>>>>> [ 53.663655] ? __pfx_autoremove_wake_function+0x10/0x10
>>>>>>> [ 53.663658] process_one_work+0x1fa/0x630
>>>>>>> [ 53.663662] worker_thread+0x1b2/0x360
>>>>>>> [ 53.663664] kthread+0x128/0x250
>>>>>>> [ 53.663666] ? __pfx_worker_thread+0x10/0x10
>>>>>>> [ 53.663668] ? __pfx_kthread+0x10/0x10
>>>>>>> [ 53.663670] ret_from_fork+0x139/0x1e0
>>>>>>> [ 53.663672] ? __pfx_kthread+0x10/0x10
>>>>>>> [ 53.663673] ret_from_fork_asm+0x1a/0x30
>>>>>>> [ 53.663677] </TASK>
>>>>>>> [ 53.700107] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002
>>>>>>> [ 53.700264] INFO: lockdep is turned off.
>>>>>>> [ 53.701315] Preemption disabled at:
>>>>>>> [ 53.701316] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60
>>>>>>> [ 53.701631] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G W 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
>>>>>>> [ 53.701633] Tainted: [W]=WARN
>>>>>>> [ 53.701635] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
>>>>>>> [ 53.701638] Call Trace:
>>>>>>> [ 53.701638] <TASK>
>>>>>>> [ 53.701640] dump_stack_lvl+0xa8/0xd0
>>>>>>> [ 53.701644] dump_stack+0x14/0x20
>>>>>>> [ 53.701645] __schedule_bug+0xa2/0xd0
>>>>>>> [ 53.701649] __schedule+0xe6f/0x10d0
>>>>>>> [ 53.701652] ? debug_smp_processor_id+0x1b/0x30
>>>>>>> [ 53.701655] ? lock_release+0x1e6/0x2b0
>>>>>>> [ 53.701658] ? trace_hardirqs_on+0x5f/0xd0
>>>>>>> [ 53.701661] schedule+0x4a/0x160
>>>>>>> [ 53.701662] schedule_timeout+0x10a/0x120
>>>>>>> [ 53.701664] ? debug_smp_processor_id+0x1b/0x30
>>>>>>> [ 53.701666] ? trace_hardirqs_on+0x5f/0xd0
>>>>>>> [ 53.701667] __wait_for_common+0xb9/0x1c0
>>>>>>> [ 53.701668] ? __pfx_schedule_timeout+0x10/0x10
>>>>>>> [ 53.701670] wait_for_completion+0x28/0x30
>>>>>>> [ 53.701671] __synchronize_srcu+0xbf/0x180
>>>>>>> [ 53.701677] ? __pfx_wakeme_after_rcu+0x10/0x10
>>>>>>> [ 53.701682] ? i2c_repstart+0x30/0x80
>>>>>>> [ 53.701685] synchronize_srcu+0x46/0x120
>>>>>>> [ 53.701687] kill_dax+0x47/0x70
>>>>>>> [ 53.701689] __devm_create_dev_dax+0x112/0x470
>>>>>>> [ 53.701691] devm_create_dev_dax+0x26/0x50
>>>>>>> [ 53.701693] dax_hmem_probe+0x87/0xd0 [dax_hmem]
>>>>>>> [ 53.701695] platform_probe+0x61/0xd0
>>>>>>> [ 53.701698] really_probe+0xe2/0x390
>>>>>>> [ 53.701700] ? __pfx___device_attach_driver+0x10/0x10
>>>>>>> [ 53.701701] __driver_probe_device+0x7e/0x160
>>>>>>> [ 53.701703] driver_probe_device+0x23/0xa0
>>>>>>> [ 53.701704] __device_attach_driver+0x92/0x120
>>>>>>> [ 53.701706] bus_for_each_drv+0x8c/0xf0
>>>>>>> [ 53.701708] __device_attach+0xc2/0x1f0
>>>>>>> [ 53.701710] device_initial_probe+0x17/0x20
>>>>>>> [ 53.701711] bus_probe_device+0xa8/0xb0
>>>>>>> [ 53.701712] device_add+0x687/0x8b0
>>>>>>> [ 53.701715] ? dev_set_name+0x56/0x70
>>>>>>> [ 53.701717] platform_device_add+0x102/0x260
>>>>>>> [ 53.701718] hmem_register_device+0x160/0x230 [dax_hmem]
>>>>>>> [ 53.701720] hmem_fallback_register_device+0x37/0x60
>>>>>>> [ 53.701722] cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
>>>>>>> [ 53.701734] walk_iomem_res_desc+0x55/0xb0
>>>>>>> [ 53.701738] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
>>>>>>> [ 53.701745] cxl_region_softreserv_update+0x46/0x50 [cxl_core]
>>>>>>> [ 53.701751] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
>>>>>>> [ 53.701752] ? __pfx_autoremove_wake_function+0x10/0x10
>>>>>>> [ 53.701756] process_one_work+0x1fa/0x630
>>>>>>> [ 53.701760] worker_thread+0x1b2/0x360
>>>>>>> [ 53.701762] kthread+0x128/0x250
>>>>>>> [ 53.701765] ? __pfx_worker_thread+0x10/0x10
>>>>>>> [ 53.701766] ? __pfx_kthread+0x10/0x10
>>>>>>> [ 53.701768] ret_from_fork+0x139/0x1e0
>>>>>>> [ 53.701771] ? __pfx_kthread+0x10/0x10
>>>>>>> [ 53.701772] ret_from_fork_asm+0x1a/0x30
>>>>>>> [ 53.701777] </TASK>
>>>>>>>
>>>>>>
>>>>
>>
>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v5 0/7] Add managed SOFT RESERVE resource handling
2025-07-17 23:20 ` Koralahalli Channabasappa, Smita
@ 2025-07-17 23:30 ` Dave Jiang
0 siblings, 0 replies; 38+ messages in thread
From: Dave Jiang @ 2025-07-17 23:30 UTC (permalink / raw)
To: Koralahalli Channabasappa, Smita, Alison Schofield
Cc: linux-cxl, linux-kernel, nvdimm, linux-fsdevel, linux-pm,
Davidlohr Bueso, Jonathan Cameron, Vishal Verma, Ira Weiny,
Dan Williams, Matthew Wilcox, Jan Kara, Rafael J . Wysocki,
Len Brown, Pavel Machek, Li Ming, Jeff Johnson, Ying Huang,
Yao Xingtao, Peter Zijlstra, Greg KH, Nathan Fontenot,
Terry Bowman, Robert Richter, Benjamin Cheatham,
PradeepVineshReddy Kodamati, Zhijian Li
On 7/17/25 4:20 PM, Koralahalli Channabasappa, Smita wrote:
> On 7/17/2025 12:06 PM, Dave Jiang wrote:
>>
>>
>> On 7/17/25 10:58 AM, Koralahalli Channabasappa, Smita wrote:
>>>
>>>
>>> On 7/16/2025 4:48 PM, Alison Schofield wrote:
>>>> On Wed, Jul 16, 2025 at 02:29:52PM -0700, Koralahalli Channabasappa, Smita wrote:
>>>>> On 7/16/2025 1:20 PM, Alison Schofield wrote:
>>>>>> On Tue, Jul 15, 2025 at 11:01:23PM -0700, Koralahalli Channabasappa, Smita wrote:
>>>>>>> Hi Alison,
>>>>>>>
>>>>>>> On 7/15/2025 2:07 PM, Alison Schofield wrote:
>>>>>>>> On Tue, Jul 15, 2025 at 06:04:00PM +0000, Smita Koralahalli wrote:
>>>>>>>>> This series introduces the ability to manage SOFT RESERVED iomem
>>>>>>>>> resources, enabling the CXL driver to remove any portions that
>>>>>>>>> intersect with created CXL regions.
>>>>>>>>
>>>>>>>> Hi Smita,
>>>>>>>>
>>>>>>>> This set applied cleanly to todays cxl-next but fails like appended
>>>>>>>> before region probe.
>>>>>>>>
>>>>>>>> BTW - there were sparse warnings in the build that look related:
>>>>>>>> CHECK drivers/dax/hmem/hmem_notify.c
>>>>>>>> drivers/dax/hmem/hmem_notify.c:10:6: warning: context imbalance in 'hmem_register_fallback_handler' - wrong count at exit
>>>>>>>> drivers/dax/hmem/hmem_notify.c:24:9: warning: context imbalance in 'hmem_fallback_register_device' - wrong count at exit
>>>>>>>
>>>>>>> Thanks for pointing this bug. I failed to release the spinlock before
>>>>>>> calling hmem_register_device(), which internally calls platform_device_add()
>>>>>>> and can sleep. The following fix addresses that bug. I’ll incorporate this
>>>>>>> into v6:
>>>>>>>
>>>>>>> diff --git a/drivers/dax/hmem/hmem_notify.c b/drivers/dax/hmem/hmem_notify.c
>>>>>>> index 6c276c5bd51d..8f411f3fe7bd 100644
>>>>>>> --- a/drivers/dax/hmem/hmem_notify.c
>>>>>>> +++ b/drivers/dax/hmem/hmem_notify.c
>>>>>>> @@ -18,8 +18,9 @@ void hmem_fallback_register_device(int target_nid, const
>>>>>>> struct resource *res)
>>>>>>> {
>>>>>>> walk_hmem_fn hmem_fn;
>>>>>>>
>>>>>>> - guard(spinlock)(&hmem_notify_lock);
>>>>>>> + spin_lock(&hmem_notify_lock);
>>>>>>> hmem_fn = hmem_fallback_fn;
>>>>>>> + spin_unlock(&hmem_notify_lock);
>>>>>>>
>>>>>>> if (hmem_fn)
>>>>>>> hmem_fn(target_nid, res);
>>>>>>> --
>>>>>>
>>>>>> Hi Smita, Adding the above got me past that, and doubling the timeout
>>>>>> below stopped that from happening. After that, I haven't had time to
>>>>>> trace so, I'll just dump on you for now:
>>>>>>
>>>>>> In /proc/iomem
>>>>>> Here, we see a regions resource, no CXL Window, and no dax, and no
>>>>>> actual region, not even disabled, is available.
>>>>>> c080000000-c47fffffff : region0
>>>>>>
>>>>>> And, here no CXL Window, no region, and a soft reserved.
>>>>>> 68e80000000-70e7fffffff : Soft Reserved
>>>>>> 68e80000000-70e7fffffff : dax1.0
>>>>>> 68e80000000-70e7fffffff : System RAM (kmem)
>>>>>>
>>>>>> I haven't yet walked through the v4 to v5 changes so I'll do that next.
>>>>>
>>>>> Hi Alison,
>>>>>
>>>>> To help better understand the current behavior, could you share more about
>>>>> your platform configuration? specifically, are there two memory cards
>>>>> involved? One at c080000000 (which appears as region0) and another at
>>>>> 68e80000000 (which is falling back to kmem via dax1.0)? Additionally, how
>>>>> are the Soft Reserved ranges laid out on your system for these cards? I'm
>>>>> trying to understand the "before" state of the resources i.e, prior to
>>>>> trimming applied by my patches.
>>>>
>>>> Here are the soft reserveds -
>>>> [] BIOS-e820: [mem 0x000000c080000000-0x000000c47fffffff] soft reserved
>>>> [] BIOS-e820: [mem 0x0000068e80000000-0x0000070e7fffffff] soft reserved
>>>>
>>>> And this is what we expect -
>>>>
>>>> c080000000-17dbfffffff : CXL Window 0
>>>> c080000000-c47fffffff : region2
>>>> c080000000-c47fffffff : dax0.0
>>>> c080000000-c47fffffff : System RAM (kmem)
>>>>
>>>>
>>>> 68e80000000-8d37fffffff : CXL Window 1
>>>> 68e80000000-70e7fffffff : region5
>>>> 68e80000000-70e7fffffff : dax1.0
>>>> 68e80000000-70e7fffffff : System RAM (kmem)
>>>>
>>>> And, like in prev message, iv v5 we get -
>>>>
>>>> c080000000-c47fffffff : region0
>>>>
>>>> 68e80000000-70e7fffffff : Soft Reserved
>>>> 68e80000000-70e7fffffff : dax1.0
>>>> 68e80000000-70e7fffffff : System RAM (kmem)
>>>>
>>>>
>>>> In v4, we 'almost' had what we expect, except that the HMEM driver
>>>> created those dax devices our of Soft Reserveds before region driver
>>>> could do same.
>>>>
>>>
>>> Yeah, the only part I’m uncertain about in v5 is scheduling the fallback work from the failure path of cxl_acpi_probe(). That doesn’t feel like the right place to do it, and I suspect it might be contributing to the unexpected behavior.
>>>
>>> v4 had most of the necessary pieces in place, but it didn’t handle situations well when the driver load order didn’t go as expected.
>>>
>>> Even if we modify v4 to avoid triggering hmem_register_device() directly from cxl_acpi_probe() which helps avoid unresolved symbol errors when cxl_acpi_probe() loads too early, and instead only rely on dax_hmem to pick up Soft Reserved regions after cxl_acpi creates regions, we still run into timing issues..
>>>
>>> Specifically, there's no guarantee that hmem_register_device() will correctly skip the following check if the region state isn't fully ready, even with MODULE_SOFTDEP("pre: cxl_acpi") or using late_initcall() (which I tried):
>>>
>>> if (IS_ENABLED(CONFIG_CXL_REGION) &&
>>> region_intersects(res->start, resource_size(res), IORESOURCE_MEM, IORES_DESC_CXL) != REGION_DISJOINT) {..
>>>
>>> At this point, I’m running out of ideas on how to reliably coordinate this.. :(
>>>
>>> Thanks
>>> Smita
>>>
>>>>>
>>>>> Also, do you think it's feasible to change the direction of the soft reserve
>>>>> trimming, that is, defer it until after CXL region or memdev creation is
>>>>> complete? In this case it would be trimmed after but inline the existing
>>>>> region or memdev creation. This might simplify the flow by removing the need
>>>>> for wait_event_timeout(), wait_for_device_probe() and the workqueue logic
>>>>> inside cxl_acpi_probe().
>>>>
>>>> Yes that aligns with my simple thinking. There's the trimming after a region
>>>> is successfully created, and it seems that could simply be called at the end
>>>> of *that* region creation.
>>>>
>>>> Then, there's the round up of all the unused Soft Reserveds, and that has
>>>> to wait until after all regions are created, ie. all endpoints have arrived
>>>> and we've given up all hope of creating another region in that space.
>>>> That's the timing challenge.
>>>>
>>>> -- Alison
>>>>
>>>>>
>>>>> (As a side note I experimented changing cxl_acpi_init() to a late_initcall()
>>>>> and observed that it consistently avoided probe ordering issues in my setup.
>>>>>
>>>>> Additional note: I realized that even when cxl_acpi_probe() fails, the
>>>>> fallback DAX registration path (via cxl_softreserv_mem_update()) still waits
>>>>> on cxl_mem_active() and wait_for_device_probe(). I plan to address this in
>>>>> v6 by immediately triggering fallback DAX registration
>>>>> (hmem_register_device()) when the ACPI probe fails, instead of waiting.)
>>>>>
>>>>> Thanks
>>>>> Smita
>>>>>
>>>>>>
>>>>>>>
>>>>>>> As for the log:
>>>>>>> [ 53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for
>>>>>>> cxl_mem probing
>>>>>>>
>>>>>>> I’m still analyzing that. Here's what was my thought process so far.
>>>>>>>
>>>>>>> - This occurs when cxl_acpi_probe() runs significantly earlier than
>>>>>>> cxl_mem_probe(), so CXL region creation (which happens in
>>>>>>> cxl_port_endpoint_probe()) may or may not have completed by the time
>>>>>>> trimming is attempted.
>>>>>>>
>>>>>>> - Both cxl_acpi and cxl_mem have MODULE_SOFTDEPs on cxl_port. This does
>>>>>>> guarantee load order when all components are built as modules. So even if
>>>>>>> the timeout occurs and cxl_mem_probe() hasn’t run within the wait window,
>>>>>>> MODULE_SOFTDEP ensures that cxl_port is loaded before both cxl_acpi and
>>>>>>> cxl_mem in modular configurations. As a result, region creation is
>>>>>>> eventually guaranteed, and wait_for_device_probe() will succeed once the
>>>>>>> relevant probes complete.
>>>>>>>
>>>>>>> - However, when both CONFIG_CXL_PORT=y and CONFIG_CXL_ACPI=y, there's no
>>>>>>> guarantee of probe ordering. In such cases, cxl_acpi_probe() may finish
>>>>>>> before cxl_port_probe() even begins, which can cause wait_for_device_probe()
>>>>>>> to return prematurely and trigger the timeout.
>>>>>>>
>>>>>>> - In my local setup, I observed that a 30-second timeout was generally
>>>>>>> sufficient to catch this race, allowing cxl_port_probe() to load while
>>>>>>> cxl_acpi_probe() is still active. Since we cannot mix built-in and modular
>>>>>>> components (i.e., have cxl_acpi=y and cxl_port=m), the timeout serves as a
>>>>>>> best-effort mechanism. After the timeout, wait_for_device_probe() ensures
>>>>>>> cxl_port_probe() has completed before trimming proceeds, making the logic
>>>>>>> good enough to most boot-time races.
>>>>>>>
>>>>>>> One possible improvement I’m considering is to schedule a
>>>>>>> delayed_workqueue() from cxl_acpi_probe(). This deferred work could wait
>>>>>>> slightly longer for cxl_mem_probe() to complete (which itself softdeps on
>>>>>>> cxl_port) before initiating the soft reserve trimming.
>>>>>>>
>>>>>>> That said, I'm still evaluating better options to more robustly coordinate
>>>>>>> probe ordering between cxl_acpi, cxl_port, cxl_mem and cxl_region and
>>>>>>> looking for suggestions here.
>>
>> Hi Smita,
>> Reading this thread and thinking about what can be done to deal with this. Throwing out some ideas and see what you think. My idea is to create two global counters that are are protected by a lock. You hava delayed workqueue that checks these counters. If counter1 is 0, go back to sleep and check later continuously with a reasonable time period. Every time a memdev endpoint starts probe, increment counter1 and counter2 atomically. Every time the probe is successful, decrement counter2. When you reach the condition of 'if (counter1 && counter2 == 0)' I think you can start soft reserve discovery.
>>
>> A different idea came from Dan. Arm a timer on the first memdev probe. Kick the timer to increment every time a new memdev gets probed. At some point things settles and timer goes off to trigger soft reserved discovery.
>>
>> I think either one will not require special ordering of the modules being loaded.
>>
>> DJ
>
> I think we might need both, the counters and a settling timer to coordinate Soft Reserved trimming and DAX registration.
>
> Here's the rough flow I'm thinking of. Let me know the flaws in this approach.
Seems reasonable to me. Don't forget to cancel timer if your condition is met and you are woken up early by a probe() finish. It really is best effort in dealing with the situation.
DJ
>
> 1. cxl_acpi_probe() schedules cxl_softreserv_work_fn() and exits early.
> This work item is responsible for trimming leftover Soft Reserved memory ranges once all cxl_mem devices have finished probing.
>
> 2. A delayed work is initialized for the settle timer:
>
> INIT_DELAYED_WORK(&cxl_probe_settle_work, cxl_probe_settle_fn);
>
> 3. In cxl_mem_probe():
> - Increment counter2 (memdevs in progress).
> - Increment counter1 (memdevs discovered).
> - On probe completion (success or failure), decrement counter2.
> - After each probe, re-arm the settle timer to extend the quiet
> period if more devices arrive (this might fail Im not sure if cxl
> mem devices come in too late)..
> mod_delayed_work(system_wq, &cxl_probe_settle_work, 30 * HZ);
> - Call wake_up(&cxl_softreserv_waitq); after each probe to notify
> listeners.
>
> 4. The settle timer callback (cxl_probe_settle_fn()) runs when no new devices have probed for a while (30s)
> timer_expired = true;
> wake_up(&cxl_softreserv_waitq);
>
> 5. In cxl_softreserv_work_fn()
> wait_event(cxl_softreserv_waitq,
> atomic_read(&cxl_mem_counter1) > 0 &&
> atomic_read(&cxl_mem_counter2) == 0 &&
> atomic_read(&timer_expired));
>
> 6. Once unblocked, cxl_softreserv_work_fn() trims Soft Reserved regions via cxl_region_softreserv_update().
> (We do not perform any DAX fallback here as we dont want to endup with unresolved symbols when DAX_HMEM loads too late..)
>
> 7. Separately, dax_hmem_platform_probe() runs independently on module load, but also blocks on the same wait_event() condition if CONFIG_CXL_ACPI is enabled. Once the condition is satisfied, it invokes hmem_register_device() to register leftover Soft Reserved memory.
>
> Thanks
> Smita
>
>>
>>>>>>>
>>>>>>> Thanks
>>>>>>> Smita
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> This isn't all the logs, I trimmed. Let me know if you need more or
>>>>>>>> other info to reproduce.
>>>>>>>>
>>>>>>>> [ 53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for cxl_mem probing
>>>>>>>> [ 53.653293] BUG: sleeping function called from invalid context at ./include/linux/sched/mm.h:321
>>>>>>>> [ 53.653513] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1875, name: kworker/46:1
>>>>>>>> [ 53.653540] preempt_count: 1, expected: 0
>>>>>>>> [ 53.653554] RCU nest depth: 0, expected: 0
>>>>>>>> [ 53.653568] 3 locks held by kworker/46:1/1875:
>>>>>>>> [ 53.653569] #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630
>>>>>>>> [ 53.653583] #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630
>>>>>>>> [ 53.653589] #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60
>>>>>>>> [ 53.653598] Preemption disabled at:
>>>>>>>> [ 53.653599] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60
>>>>>>>> [ 53.653640] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Not tainted 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
>>>>>>>> [ 53.653643] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
>>>>>>>> [ 53.653648] Call Trace:
>>>>>>>> [ 53.653649] <TASK>
>>>>>>>> [ 53.653652] dump_stack_lvl+0xa8/0xd0
>>>>>>>> [ 53.653658] dump_stack+0x14/0x20
>>>>>>>> [ 53.653659] __might_resched+0x1ae/0x2d0
>>>>>>>> [ 53.653666] __might_sleep+0x48/0x70
>>>>>>>> [ 53.653668] __kmalloc_node_track_caller_noprof+0x349/0x510
>>>>>>>> [ 53.653674] ? __devm_add_action+0x3d/0x160
>>>>>>>> [ 53.653685] ? __pfx_devm_action_release+0x10/0x10
>>>>>>>> [ 53.653688] __devres_alloc_node+0x4a/0x90
>>>>>>>> [ 53.653689] ? __devres_alloc_node+0x4a/0x90
>>>>>>>> [ 53.653691] ? __pfx_release_memregion+0x10/0x10 [dax_hmem]
>>>>>>>> [ 53.653693] __devm_add_action+0x3d/0x160
>>>>>>>> [ 53.653696] hmem_register_device+0xea/0x230 [dax_hmem]
>>>>>>>> [ 53.653700] hmem_fallback_register_device+0x37/0x60
>>>>>>>> [ 53.653703] cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
>>>>>>>> [ 53.653739] walk_iomem_res_desc+0x55/0xb0
>>>>>>>> [ 53.653744] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
>>>>>>>> [ 53.653755] cxl_region_softreserv_update+0x46/0x50 [cxl_core]
>>>>>>>> [ 53.653761] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
>>>>>>>> [ 53.653763] ? __pfx_autoremove_wake_function+0x10/0x10
>>>>>>>> [ 53.653768] process_one_work+0x1fa/0x630
>>>>>>>> [ 53.653774] worker_thread+0x1b2/0x360
>>>>>>>> [ 53.653777] kthread+0x128/0x250
>>>>>>>> [ 53.653781] ? __pfx_worker_thread+0x10/0x10
>>>>>>>> [ 53.653784] ? __pfx_kthread+0x10/0x10
>>>>>>>> [ 53.653786] ret_from_fork+0x139/0x1e0
>>>>>>>> [ 53.653790] ? __pfx_kthread+0x10/0x10
>>>>>>>> [ 53.653792] ret_from_fork_asm+0x1a/0x30
>>>>>>>> [ 53.653801] </TASK>
>>>>>>>>
>>>>>>>> [ 53.654193] =============================
>>>>>>>> [ 53.654203] [ BUG: Invalid wait context ]
>>>>>>>> [ 53.654451] 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 Tainted: G W
>>>>>>>> [ 53.654623] -----------------------------
>>>>>>>> [ 53.654785] kworker/46:1/1875 is trying to lock:
>>>>>>>> [ 53.654946] ff37d7824096d588 (&root->kernfs_rwsem){++++}-{4:4}, at: kernfs_add_one+0x34/0x390
>>>>>>>> [ 53.655115] other info that might help us debug this:
>>>>>>>> [ 53.655273] context-{5:5}
>>>>>>>> [ 53.655428] 3 locks held by kworker/46:1/1875:
>>>>>>>> [ 53.655579] #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630
>>>>>>>> [ 53.655739] #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630
>>>>>>>> [ 53.655900] #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60
>>>>>>>> [ 53.656062] stack backtrace:
>>>>>>>> [ 53.656224] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G W 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
>>>>>>>> [ 53.656227] Tainted: [W]=WARN
>>>>>>>> [ 53.656228] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
>>>>>>>> [ 53.656232] Call Trace:
>>>>>>>> [ 53.656232] <TASK>
>>>>>>>> [ 53.656234] dump_stack_lvl+0x85/0xd0
>>>>>>>> [ 53.656238] dump_stack+0x14/0x20
>>>>>>>> [ 53.656239] __lock_acquire+0xaf4/0x2200
>>>>>>>> [ 53.656246] lock_acquire+0xd8/0x300
>>>>>>>> [ 53.656248] ? kernfs_add_one+0x34/0x390
>>>>>>>> [ 53.656252] ? __might_resched+0x208/0x2d0
>>>>>>>> [ 53.656257] down_write+0x44/0xe0
>>>>>>>> [ 53.656262] ? kernfs_add_one+0x34/0x390
>>>>>>>> [ 53.656263] kernfs_add_one+0x34/0x390
>>>>>>>> [ 53.656265] kernfs_create_dir_ns+0x5a/0xa0
>>>>>>>> [ 53.656268] sysfs_create_dir_ns+0x74/0xd0
>>>>>>>> [ 53.656270] kobject_add_internal+0xb1/0x2f0
>>>>>>>> [ 53.656273] kobject_add+0x7d/0xf0
>>>>>>>> [ 53.656275] ? get_device_parent+0x28/0x1e0
>>>>>>>> [ 53.656280] ? __pfx_klist_children_get+0x10/0x10
>>>>>>>> [ 53.656282] device_add+0x124/0x8b0
>>>>>>>> [ 53.656285] ? dev_set_name+0x56/0x70
>>>>>>>> [ 53.656287] platform_device_add+0x102/0x260
>>>>>>>> [ 53.656289] hmem_register_device+0x160/0x230 [dax_hmem]
>>>>>>>> [ 53.656291] hmem_fallback_register_device+0x37/0x60
>>>>>>>> [ 53.656294] cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
>>>>>>>> [ 53.656323] walk_iomem_res_desc+0x55/0xb0
>>>>>>>> [ 53.656326] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
>>>>>>>> [ 53.656335] cxl_region_softreserv_update+0x46/0x50 [cxl_core]
>>>>>>>> [ 53.656342] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
>>>>>>>> [ 53.656343] ? __pfx_autoremove_wake_function+0x10/0x10
>>>>>>>> [ 53.656346] process_one_work+0x1fa/0x630
>>>>>>>> [ 53.656350] worker_thread+0x1b2/0x360
>>>>>>>> [ 53.656352] kthread+0x128/0x250
>>>>>>>> [ 53.656354] ? __pfx_worker_thread+0x10/0x10
>>>>>>>> [ 53.656356] ? __pfx_kthread+0x10/0x10
>>>>>>>> [ 53.656357] ret_from_fork+0x139/0x1e0
>>>>>>>> [ 53.656360] ? __pfx_kthread+0x10/0x10
>>>>>>>> [ 53.656361] ret_from_fork_asm+0x1a/0x30
>>>>>>>> [ 53.656366] </TASK>
>>>>>>>> [ 53.662274] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002
>>>>>>>> [ 53.663552] schedule+0x4a/0x160
>>>>>>>> [ 53.663553] schedule_timeout+0x10a/0x120
>>>>>>>> [ 53.663555] ? debug_smp_processor_id+0x1b/0x30
>>>>>>>> [ 53.663556] ? trace_hardirqs_on+0x5f/0xd0
>>>>>>>> [ 53.663558] __wait_for_common+0xb9/0x1c0
>>>>>>>> [ 53.663559] ? __pfx_schedule_timeout+0x10/0x10
>>>>>>>> [ 53.663561] wait_for_completion+0x28/0x30
>>>>>>>> [ 53.663562] __synchronize_srcu+0xbf/0x180
>>>>>>>> [ 53.663566] ? __pfx_wakeme_after_rcu+0x10/0x10
>>>>>>>> [ 53.663571] ? i2c_repstart+0x30/0x80
>>>>>>>> [ 53.663576] synchronize_srcu+0x46/0x120
>>>>>>>> [ 53.663577] kill_dax+0x47/0x70
>>>>>>>> [ 53.663580] __devm_create_dev_dax+0x112/0x470
>>>>>>>> [ 53.663582] devm_create_dev_dax+0x26/0x50
>>>>>>>> [ 53.663584] dax_hmem_probe+0x87/0xd0 [dax_hmem]
>>>>>>>> [ 53.663585] platform_probe+0x61/0xd0
>>>>>>>> [ 53.663589] really_probe+0xe2/0x390
>>>>>>>> [ 53.663591] ? __pfx___device_attach_driver+0x10/0x10
>>>>>>>> [ 53.663593] __driver_probe_device+0x7e/0x160
>>>>>>>> [ 53.663594] driver_probe_device+0x23/0xa0
>>>>>>>> [ 53.663596] __device_attach_driver+0x92/0x120
>>>>>>>> [ 53.663597] bus_for_each_drv+0x8c/0xf0
>>>>>>>> [ 53.663599] __device_attach+0xc2/0x1f0
>>>>>>>> [ 53.663601] device_initial_probe+0x17/0x20
>>>>>>>> [ 53.663603] bus_probe_device+0xa8/0xb0
>>>>>>>> [ 53.663604] device_add+0x687/0x8b0
>>>>>>>> [ 53.663607] ? dev_set_name+0x56/0x70
>>>>>>>> [ 53.663609] platform_device_add+0x102/0x260
>>>>>>>> [ 53.663610] hmem_register_device+0x160/0x230 [dax_hmem]
>>>>>>>> [ 53.663612] hmem_fallback_register_device+0x37/0x60
>>>>>>>> [ 53.663614] cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
>>>>>>>> [ 53.663637] walk_iomem_res_desc+0x55/0xb0
>>>>>>>> [ 53.663640] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
>>>>>>>> [ 53.663647] cxl_region_softreserv_update+0x46/0x50 [cxl_core]
>>>>>>>> [ 53.663654] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
>>>>>>>> [ 53.663655] ? __pfx_autoremove_wake_function+0x10/0x10
>>>>>>>> [ 53.663658] process_one_work+0x1fa/0x630
>>>>>>>> [ 53.663662] worker_thread+0x1b2/0x360
>>>>>>>> [ 53.663664] kthread+0x128/0x250
>>>>>>>> [ 53.663666] ? __pfx_worker_thread+0x10/0x10
>>>>>>>> [ 53.663668] ? __pfx_kthread+0x10/0x10
>>>>>>>> [ 53.663670] ret_from_fork+0x139/0x1e0
>>>>>>>> [ 53.663672] ? __pfx_kthread+0x10/0x10
>>>>>>>> [ 53.663673] ret_from_fork_asm+0x1a/0x30
>>>>>>>> [ 53.663677] </TASK>
>>>>>>>> [ 53.700107] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002
>>>>>>>> [ 53.700264] INFO: lockdep is turned off.
>>>>>>>> [ 53.701315] Preemption disabled at:
>>>>>>>> [ 53.701316] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60
>>>>>>>> [ 53.701631] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G W 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
>>>>>>>> [ 53.701633] Tainted: [W]=WARN
>>>>>>>> [ 53.701635] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
>>>>>>>> [ 53.701638] Call Trace:
>>>>>>>> [ 53.701638] <TASK>
>>>>>>>> [ 53.701640] dump_stack_lvl+0xa8/0xd0
>>>>>>>> [ 53.701644] dump_stack+0x14/0x20
>>>>>>>> [ 53.701645] __schedule_bug+0xa2/0xd0
>>>>>>>> [ 53.701649] __schedule+0xe6f/0x10d0
>>>>>>>> [ 53.701652] ? debug_smp_processor_id+0x1b/0x30
>>>>>>>> [ 53.701655] ? lock_release+0x1e6/0x2b0
>>>>>>>> [ 53.701658] ? trace_hardirqs_on+0x5f/0xd0
>>>>>>>> [ 53.701661] schedule+0x4a/0x160
>>>>>>>> [ 53.701662] schedule_timeout+0x10a/0x120
>>>>>>>> [ 53.701664] ? debug_smp_processor_id+0x1b/0x30
>>>>>>>> [ 53.701666] ? trace_hardirqs_on+0x5f/0xd0
>>>>>>>> [ 53.701667] __wait_for_common+0xb9/0x1c0
>>>>>>>> [ 53.701668] ? __pfx_schedule_timeout+0x10/0x10
>>>>>>>> [ 53.701670] wait_for_completion+0x28/0x30
>>>>>>>> [ 53.701671] __synchronize_srcu+0xbf/0x180
>>>>>>>> [ 53.701677] ? __pfx_wakeme_after_rcu+0x10/0x10
>>>>>>>> [ 53.701682] ? i2c_repstart+0x30/0x80
>>>>>>>> [ 53.701685] synchronize_srcu+0x46/0x120
>>>>>>>> [ 53.701687] kill_dax+0x47/0x70
>>>>>>>> [ 53.701689] __devm_create_dev_dax+0x112/0x470
>>>>>>>> [ 53.701691] devm_create_dev_dax+0x26/0x50
>>>>>>>> [ 53.701693] dax_hmem_probe+0x87/0xd0 [dax_hmem]
>>>>>>>> [ 53.701695] platform_probe+0x61/0xd0
>>>>>>>> [ 53.701698] really_probe+0xe2/0x390
>>>>>>>> [ 53.701700] ? __pfx___device_attach_driver+0x10/0x10
>>>>>>>> [ 53.701701] __driver_probe_device+0x7e/0x160
>>>>>>>> [ 53.701703] driver_probe_device+0x23/0xa0
>>>>>>>> [ 53.701704] __device_attach_driver+0x92/0x120
>>>>>>>> [ 53.701706] bus_for_each_drv+0x8c/0xf0
>>>>>>>> [ 53.701708] __device_attach+0xc2/0x1f0
>>>>>>>> [ 53.701710] device_initial_probe+0x17/0x20
>>>>>>>> [ 53.701711] bus_probe_device+0xa8/0xb0
>>>>>>>> [ 53.701712] device_add+0x687/0x8b0
>>>>>>>> [ 53.701715] ? dev_set_name+0x56/0x70
>>>>>>>> [ 53.701717] platform_device_add+0x102/0x260
>>>>>>>> [ 53.701718] hmem_register_device+0x160/0x230 [dax_hmem]
>>>>>>>> [ 53.701720] hmem_fallback_register_device+0x37/0x60
>>>>>>>> [ 53.701722] cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
>>>>>>>> [ 53.701734] walk_iomem_res_desc+0x55/0xb0
>>>>>>>> [ 53.701738] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
>>>>>>>> [ 53.701745] cxl_region_softreserv_update+0x46/0x50 [cxl_core]
>>>>>>>> [ 53.701751] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
>>>>>>>> [ 53.701752] ? __pfx_autoremove_wake_function+0x10/0x10
>>>>>>>> [ 53.701756] process_one_work+0x1fa/0x630
>>>>>>>> [ 53.701760] worker_thread+0x1b2/0x360
>>>>>>>> [ 53.701762] kthread+0x128/0x250
>>>>>>>> [ 53.701765] ? __pfx_worker_thread+0x10/0x10
>>>>>>>> [ 53.701766] ? __pfx_kthread+0x10/0x10
>>>>>>>> [ 53.701768] ret_from_fork+0x139/0x1e0
>>>>>>>> [ 53.701771] ? __pfx_kthread+0x10/0x10
>>>>>>>> [ 53.701772] ret_from_fork_asm+0x1a/0x30
>>>>>>>> [ 53.701777] </TASK>
>>>>>>>>
>>>>>>>
>>>>>
>>>
>>
>
>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v5 0/7] Add managed SOFT RESERVE resource handling
2025-07-15 18:04 [PATCH v5 0/7] Add managed SOFT RESERVE resource handling Smita Koralahalli
` (7 preceding siblings ...)
2025-07-15 21:07 ` [PATCH v5 0/7] Add managed SOFT RESERVE resource handling Alison Schofield
@ 2025-07-21 7:38 ` Zhijian Li (Fujitsu)
2025-07-22 20:07 ` dan.j.williams
9 siblings, 0 replies; 38+ messages in thread
From: Zhijian Li (Fujitsu) @ 2025-07-21 7:38 UTC (permalink / raw)
To: Smita Koralahalli, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, nvdimm@lists.linux.dev,
linux-fsdevel@vger.kernel.org, linux-pm@vger.kernel.org
Cc: Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Alison Schofield,
Vishal Verma, Ira Weiny, Dan Williams, Matthew Wilcox, Jan Kara,
Rafael J . Wysocki, Len Brown, Pavel Machek, Li Ming,
Jeff Johnson, Ying Huang, Xingtao Yao (Fujitsu), Peter Zijlstra,
Greg KH, Nathan Fontenot, Terry Bowman, Robert Richter,
Benjamin Cheatham, PradeepVineshReddy Kodamati
Smita,
I have not yet to complete all of my local patterns. Nonetheless, in addition to the issues highlighted by Alison, I have also encountered some regressions.
Based on your conversation with Alison, it appears you have decided to have a refactor. Thus, I intend to stop testing on this version until the updated iteration is available.
Here is what I have verified thus far (kernel built upon the cxl/next 20250718):
A) No Soft reserved (BIOS did not expose EFI_SPECIAL_PURPOSE)
- A.1 Decoder not committed (default QEMU emulation)
Before:
```
fffc0000-ffffffff : Reserved
100000000-27fffffff : System RAM
5c0001128-5c00011b7 : port1
5d0000000-6cfffffff : CXL Window 0
6d0000000-7cfffffff : CXL Window 1
7000000000-700000ffff : PCI Bus 0000:0c
7000000000-700000ffff : 0000:0c:00.0
7000010000-700001ffff : PCI Bus 0000:0e
7000010000-700001ffff : 0000:0e:00.0
7000011080-70000110d7 : mem0
```
After (CXL window is absent):
```
fed00000-fed003ff : PNP0103:00
fed1c000-fed1ffff : Reserved
feffc000-feffffff : Reserved
fffc0000-ffffffff : Reserved
100000000-27fffffff : System RAM
7000000000-700000ffff : PCI Bus 0000:0c
7000000000-700000ffff : 0000:0c:00.0
7000010000-700001ffff : PCI Bus 0000:0e
7000010000-700001ffff : 0000:0e:00.0
7000020000-703fffffff : PCI Bus 0000:00
```
- A.2 Decoder is committed
Before:
```
100000000-27fffffff : System RAM
5c0001128-5c00011b7 : port1
5d0000000-6cfffffff : CXL Window 0
5d0000000-6cfffffff : region0
5d0000000-6cfffffff : dax0.0
5d0000000-6cfffffff : System RAM (kmem)
7000000000-700000ffff : PCI Bus 0000:0c
7000000000-700000ffff : 0000:0c:00.0
```
After (CXL window is absent):
```
feffc000-feffffff : Reserved
fffc0000-ffffffff : Reserved
100000000-27fffffff : System RAM
7000000000-700000ffff : PCI Bus 0000:0c
7000000000-700000ffff : 0000:0c:00.0
7000010000-700001ffff : PCI Bus 0000:0e
7000010000-700001ffff : 0000:0e:00.0
7000020000-703fffffff : PCI Bus 0000:00
```
B) EFI_SPECIAL_PURPOSE is set
- B.1 Decoder not committed
Before:
```
5d0000000-7cfffffff : Soft Reserved
5d0000000-6cfffffff : CXL Window 0
6d0000000-7cfffffff : CXL Window 1
```
After (fallback to hmem):
```
5d0000000-7cfffffff : Soft Reserved
5d0000000-7cfffffff : dax0.0
5d0000000-7cfffffff : System RAM (kmem)
```
- B.2 Decoder is committed
Before:
```
5d0000000-6cfffffff : CXL Window 0
5d0000000-6cfffffff : region0
5d0000000-6cfffffff : Soft Reserved
5d0000000-6cfffffff : dax0.0
5d0000000-6cfffffff : System RAM (kmem)
```
After (fallback to hmem):
```
5d0000000-6cfffffff : Soft Reserved
5d0000000-6cfffffff : dax0.0
5d0000000-6cfffffff : System RAM (kmem)
```
Thanks
Zhijian
On 16/07/2025 02:04, Smita Koralahalli wrote:
> This series introduces the ability to manage SOFT RESERVED iomem
> resources, enabling the CXL driver to remove any portions that
> intersect with created CXL regions.
>
> The current approach of leaving SOFT RESERVED entries as is can result
> in failures during device hotplug such as CXL because the address range
> remains reserved and unavailable for reuse even after region teardown.
>
> To address this, the CXL driver now uses a background worker that waits
> for cxl_mem driver probe to complete before scanning for intersecting
> resources. Then the driver walks through created CXL regions to trim any
> intersections with SOFT RESERVED resources in the iomem tree.
>
> The following scenarios have been tested:
>
> Example 1: Exact alignment, soft reserved is a child of the region
>
> |---------- "Soft Reserved" -----------|
> |-------------- "Region #" ------------|
>
> Before:
> 1050000000-304fffffff : CXL Window 0
> 1050000000-304fffffff : region0
> 1050000000-304fffffff : Soft Reserved
> 1080000000-2fffffffff : dax0.0
> 1080000000-2fffffffff : System RAM (kmem)
>
> After:
> 1050000000-304fffffff : CXL Window 0
> 1050000000-304fffffff : region0
> 1080000000-2fffffffff : dax0.0
> 1080000000-2fffffffff : System RAM (kmem)
>
> Example 2: Start and/or end aligned and soft reserved spans multiple
> regions
> |----------- "Soft Reserved" -----------|
> |-------- "Region #" -------|
> or
> |----------- "Soft Reserved" -----------|
> |-------- "Region #" -------|
>
> Before:
> 850000000-684fffffff : Soft Reserved
> 850000000-284fffffff : CXL Window 0
> 850000000-284fffffff : region3
> 850000000-284fffffff : dax0.0
> 850000000-284fffffff : System RAM (kmem)
> 2850000000-484fffffff : CXL Window 1
> 2850000000-484fffffff : region4
> 2850000000-484fffffff : dax1.0
> 2850000000-484fffffff : System RAM (kmem)
> 4850000000-684fffffff : CXL Window 2
> 4850000000-684fffffff : region5
> 4850000000-684fffffff : dax2.0
> 4850000000-684fffffff : System RAM (kmem)
>
> After:
> 850000000-284fffffff : CXL Window 0
> 850000000-284fffffff : region3
> 850000000-284fffffff : dax0.0
> 850000000-284fffffff : System RAM (kmem)
> 2850000000-484fffffff : CXL Window 1
> 2850000000-484fffffff : region4
> 2850000000-484fffffff : dax1.0
> 2850000000-484fffffff : System RAM (kmem)
> 4850000000-684fffffff : CXL Window 2
> 4850000000-684fffffff : region5
> 4850000000-684fffffff : dax2.0
> 4850000000-684fffffff : System RAM (kmem)
>
> Example 3: No alignment
> |---------- "Soft Reserved" ----------|
> |---- "Region #" ----|
>
> Before:
> 00000000-3050000ffd : Soft Reserved
> ..
> ..
> 1050000000-304fffffff : CXL Window 0
> 1050000000-304fffffff : region1
> 1080000000-2fffffffff : dax0.0
> 1080000000-2fffffffff : System RAM (kmem)
>
> After:
> 00000000-104fffffff : Soft Reserved
> ..
> ..
> 1050000000-304fffffff : CXL Window 0
> 1050000000-304fffffff : region1
> 1080000000-2fffffffff : dax0.0
> 1080000000-2fffffffff : System RAM (kmem)
> 3050000000-3050000ffd : Soft Reserved
>
> Link to v4:
> https://lore.kernel.org/linux-cxl/20250603221949.53272-1-Smita.KoralahalliChannabasappa@amd.com
>
> v5 updates:
> - Handled cases where CXL driver loads early even before HMEM driver is
> initialized.
> - Introduced callback functions to resolve dependencies.
> - Rename suspend.c to probe_state.c.
> - Refactor cxl_acpi_probe() to use a single exit path.
> - Commit description update to justify cxl_mem_active() usage.
> - Change from kmalloc -> kzalloc in add_soft_reserved().
> - Change from goto to if else blocks inside remove_soft_reserved().
> - DEFINE_RES_MEM_NAMED -> DEFINE_RES_NAMED_DESC.
> - Comments for flags inside remove_soft_reserved().
> - Add resource_lock inside normalize_resource().
> - bus_find_next_device -> bus_find_device.
> - Skip DAX consumption of soft reserves inside hmat with
> CONFIG_CXL_ACPI checks.
>
> v4 updates:
> - Split first patch into 4 smaller patches.
> - Correct the logic for cxl_pci_loaded() and cxl_mem_active() to return
> false at default instead of true.
> - Cleanup cxl_wait_for_pci_mem() to remove config checks for cxl_pci
> and cxl_mem.
> - Fixed multiple bugs and build issues which includes correcting
> walk_iomem_resc_desc() and calculations of alignments.
>
> v3 updates:
> - Remove srmem resource tree from kernel/resource.c, this is no longer
> needed in the current implementation. All SOFT RESERVE resources now
> put on the iomem resource tree.
> - Remove the no longer needed SOFT_RESERVED_MANAGED kernel config option.
> - Add the 'nid' parameter back to hmem_register_resource();
> - Remove the no longer used soft reserve notification chain (introduced
> in v2). The dax driver is now notified of SOFT RESERVED resources by
> the CXL driver.
>
> v2 updates:
> - Add config option SOFT_RESERVE_MANAGED to control use of the
> separate srmem resource tree at boot.
> - Only add SOFT RESERVE resources to the soft reserve tree during
> boot, they go to the iomem resource tree after boot.
> - Remove the resource trimming code in the previous patch to re-use
> the existing code in kernel/resource.c
> - Add functionality for the cxl acpi driver to wait for the cxl PCI
> and mem drivers to load.
>
> Smita Koralahalli (7):
> cxl/acpi: Refactor cxl_acpi_probe() to always schedule fallback DAX
> registration
> cxl/core: Rename suspend.c to probe_state.c and remove
> CONFIG_CXL_SUSPEND
> cxl/acpi: Add background worker to coordinate with cxl_mem probe
> completion
> cxl/region: Introduce SOFT RESERVED resource removal on region
> teardown
> dax/hmem: Save the DAX HMEM platform device pointer
> dax/hmem, cxl: Defer DAX consumption of SOFT RESERVED resources until
> after CXL region creation
> dax/hmem: Preserve fallback SOFT RESERVED regions if DAX HMEM loads
> late
>
> drivers/acpi/numa/hmat.c | 4 +
> drivers/cxl/Kconfig | 4 -
> drivers/cxl/acpi.c | 50 +++++--
> drivers/cxl/core/Makefile | 2 +-
> drivers/cxl/core/{suspend.c => probe_state.c} | 10 +-
> drivers/cxl/core/region.c | 135 ++++++++++++++++++
> drivers/cxl/cxl.h | 4 +
> drivers/cxl/cxlmem.h | 9 --
> drivers/dax/hmem/Makefile | 1 +
> drivers/dax/hmem/device.c | 62 ++++----
> drivers/dax/hmem/hmem.c | 14 +-
> drivers/dax/hmem/hmem_notify.c | 29 ++++
> include/linux/dax.h | 7 +-
> include/linux/ioport.h | 1 +
> include/linux/pm.h | 7 -
> kernel/resource.c | 34 +++++
> 16 files changed, 307 insertions(+), 66 deletions(-)
> rename drivers/cxl/core/{suspend.c => probe_state.c} (62%)
> create mode 100644 drivers/dax/hmem/hmem_notify.c
>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v5 0/7] Add managed SOFT RESERVE resource handling
2025-07-15 18:04 [PATCH v5 0/7] Add managed SOFT RESERVE resource handling Smita Koralahalli
` (8 preceding siblings ...)
2025-07-21 7:38 ` Zhijian Li (Fujitsu)
@ 2025-07-22 20:07 ` dan.j.williams
9 siblings, 0 replies; 38+ messages in thread
From: dan.j.williams @ 2025-07-22 20:07 UTC (permalink / raw)
To: Smita Koralahalli, linux-cxl, linux-kernel, nvdimm, linux-fsdevel,
linux-pm
Cc: Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Alison Schofield,
Vishal Verma, Ira Weiny, Dan Williams, Matthew Wilcox, Jan Kara,
Rafael J . Wysocki, Len Brown, Pavel Machek, Li Ming,
Jeff Johnson, Ying Huang, Yao Xingtao, Peter Zijlstra, Greg KH,
Nathan Fontenot, Smita Koralahalli, Terry Bowman, Robert Richter,
Benjamin Cheatham, PradeepVineshReddy Kodamati, Zhijian Li
Smita Koralahalli wrote:
> This series introduces the ability to manage SOFT RESERVED iomem
> resources, enabling the CXL driver to remove any portions that
> intersect with created CXL regions.
>
> The current approach of leaving SOFT RESERVED entries as is can result
> in failures during device hotplug such as CXL because the address range
> remains reserved and unavailable for reuse even after region teardown.
I will go through the patches, but the main concern here is not hotplug,
it is region assembly failure.
We have a constant drip of surprising platform behaviors that trip up
the driver leaving memory stranded. Specifically, device-dax defers to
CXL to assemble the region representing the soft-reserve range, CXL
fails to complete that assembly due to being confused by the platform,
end user wonders why their platform BIOS sees memory capacity that Linux
does not see.
So the priority order of solutions needed here is:
1/ Fix all shipping platform "quirks", try to prevent new ones from
being created. I.e. ideally, long term, Linux doed not need a
soft-reserve fallback and just always ignores Soft Reserve in
CXL Windows because the CXL subsystem will handle it.
2/ In the near term forseeable future, for all yet to be solved or yet
to be discovered platform quirks, provide a device-dax fallback to
recover baseline device-dax behavior (equivalent to putting cxl_acpi on
a modprobe deny-list).
3/ For hotplug, remove the conflicting resource.
> To address this, the CXL driver now uses a background worker that waits
> for cxl_mem driver probe to complete before scanning for intersecting
> resources. Then the driver walks through created CXL regions to trim any
> intersections with SOFT RESERVED resources in the iomem tree.
The precision of this gives me pause. I think it is fine to make this
more coarse because any mismatch between Soft Reserve and a CXL Window
resource should be cause to give up on the CXL side.
If a Soft Reserve range straddles a CXL window and "System RAM", give up
on trying to use the CXL driver on that system.
CXL does not completely cover a soft-reserve region, give up on trying
to use the CXL driver on that system.
Effectively anytime we detect unexpected platform shenanigans it is
likely indicating missing understanding in the Linux driver.
> The following scenarios have been tested:
Nice! Appreciate you including the test case results.
[..]
> Example 3: No alignment
> |---------- "Soft Reserved" ----------|
> |---- "Region #" ----|
Per above, CXL subsystem should completely give up in this scenario. The
BIOS said that all of the range is Conventional memory and CXL is only
creating a region for part of it. Somebody is wrong. Given the fact that
non-CXL aware OSes would try to use the entirety of the Soft Reserved
region, then this scenario is "disable CXL, it clearly does not
understand this platform".
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v5 1/7] cxl/acpi: Refactor cxl_acpi_probe() to always schedule fallback DAX registration
2025-07-15 18:04 ` [PATCH v5 1/7] cxl/acpi: Refactor cxl_acpi_probe() to always schedule fallback DAX registration Smita Koralahalli
@ 2025-07-22 21:04 ` dan.j.williams
2025-07-23 0:45 ` Alison Schofield
0 siblings, 1 reply; 38+ messages in thread
From: dan.j.williams @ 2025-07-22 21:04 UTC (permalink / raw)
To: Smita Koralahalli, linux-cxl, linux-kernel, nvdimm, linux-fsdevel,
linux-pm
Cc: Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Alison Schofield,
Vishal Verma, Ira Weiny, Dan Williams, Matthew Wilcox, Jan Kara,
Rafael J . Wysocki, Len Brown, Pavel Machek, Li Ming,
Jeff Johnson, Ying Huang, Yao Xingtao, Peter Zijlstra, Greg KH,
Nathan Fontenot, Smita Koralahalli, Terry Bowman, Robert Richter,
Benjamin Cheatham, PradeepVineshReddy Kodamati, Zhijian Li
Smita Koralahalli wrote:
> Refactor cxl_acpi_probe() to use a single exit path so that the fallback
> DAX registration can be scheduled regardless of probe success or failure.
I do not understand why cxl_acpi needs to be responsible for this,
especially in the cxl_acpi_probe() failure path. Certainly if
cxl_acpi_probe() fails, that is a strong signal to give up on the CXL
subsystem altogether and fallback to DAX vanilla discovery exclusively.
Now, maybe the need for this becomes clearer in follow-on patches.
However, I would have expected that DAX, which currently arranges for
CXL to load first would just flush CXL discovery, make a decision about
whether proceed with Soft Reserved, or not.
Something like:
DAX CXL
Scan CXL Windows. Fail on any window
parsing failures
Launch a work item to flush PCI
discovery and give a reaonable amount of
time for cxl_pci and cxl_mem to quiesce
<assumes CXL Windows are discovered
by virtue of initcall order or
MODULE_SOFTDEP("pre: cxl_acpi")>
Calls a CXL flush routine to await probe
completion (will always be racy)
Evaluates if all Soft Reserve has
cxl_region coverage
if yes: skip publishing CXL intersecting
Soft Reserve range in iomem, let dax_cxl
attach to the cxl_region devices
if no: decline the already published
cxl_dax_regions, notify cxl_acpi to
shutdown. Install Soft Reserved in iomem
and create dax_hmem devices for the
ranges per usual.
Something like the above puts all the onus on device-dax to decide if
CXL is meeting expectations. CXL is only responsible flagging when it
thinks it has successfully completed init. If device-dax disagrees with
what CXL has done it can tear down the world without ever attaching
'struct cxl_dax_region'. The success/fail is an "all or nothing"
proposition. Either CXL understands everything or the user needs to
work with their hardware vendor to fix whatever is giving the CXL driver
indigestion.
It needs to be coarse and simple because longer term the expectation is
the Soft Reserved stops going to System RAM by default and instead
becomes an isolated memory pool that requires opt-in. In many ways the
current behavior is optimized for hardware validation not applications.
> With CONFIG_CXL_ACPI enabled, future patches will bypass DAX device
> registration via the HMAT and hmem drivers. To avoid missing DAX
> registration for SOFT RESERVED regions, the fallback path must be
> triggered regardless of probe outcome.
>
> No functional changes.
A comment below in case something like this patch moves forward:
>
> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
> ---
> drivers/cxl/acpi.c | 30 ++++++++++++++++++------------
> 1 file changed, 18 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c
> index a1a99ec3f12c..ca06d5acdf8f 100644
> --- a/drivers/cxl/acpi.c
> +++ b/drivers/cxl/acpi.c
> @@ -825,7 +825,7 @@ static int pair_cxl_resource(struct device *dev, void *data)
>
> static int cxl_acpi_probe(struct platform_device *pdev)
> {
> - int rc;
> + int rc = 0;
> struct resource *cxl_res;
> struct cxl_root *cxl_root;
> struct cxl_port *root_port;
> @@ -837,7 +837,7 @@ static int cxl_acpi_probe(struct platform_device *pdev)
> rc = devm_add_action_or_reset(&pdev->dev, cxl_acpi_lock_reset_class,
> &pdev->dev);
> if (rc)
> - return rc;
> + goto out;
No, new goto please. With cleanup.h the momentum is towards elimination
of goto. If you need to do something like this, just wrap the function:
diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c
index a1a99ec3f12c..b50d3aa45ad5 100644
--- a/drivers/cxl/acpi.c
+++ b/drivers/cxl/acpi.c
@@ -823,7 +823,7 @@ static int pair_cxl_resource(struct device *dev, void *data)
return 0;
}
-static int cxl_acpi_probe(struct platform_device *pdev)
+static int __cxl_acpi_probe(struct platform_device *pdev)
{
int rc;
struct resource *cxl_res;
@@ -900,6 +900,15 @@ static int cxl_acpi_probe(struct platform_device *pdev)
return 0;
}
+static int cxl_acpi_probe(struct platform_device *pdev)
+{
+ int rc = __cxl_acpi_probe(pdev);
+
+ /* do something */
+
+ return rc;
+}
+
static const struct acpi_device_id cxl_acpi_ids[] = {
{ "ACPI0017" },
{ },
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH v5 2/7] cxl/core: Rename suspend.c to probe_state.c and remove CONFIG_CXL_SUSPEND
2025-07-15 18:04 ` [PATCH v5 2/7] cxl/core: Rename suspend.c to probe_state.c and remove CONFIG_CXL_SUSPEND Smita Koralahalli
@ 2025-07-22 21:44 ` dan.j.williams
0 siblings, 0 replies; 38+ messages in thread
From: dan.j.williams @ 2025-07-22 21:44 UTC (permalink / raw)
To: Smita Koralahalli, linux-cxl, linux-kernel, nvdimm, linux-fsdevel,
linux-pm
Cc: Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Alison Schofield,
Vishal Verma, Ira Weiny, Dan Williams, Matthew Wilcox, Jan Kara,
Rafael J . Wysocki, Len Brown, Pavel Machek, Li Ming,
Jeff Johnson, Ying Huang, Yao Xingtao, Peter Zijlstra, Greg KH,
Nathan Fontenot, Smita Koralahalli, Terry Bowman, Robert Richter,
Benjamin Cheatham, PradeepVineshReddy Kodamati, Zhijian Li
Smita Koralahalli wrote:
> The cxl_mem_active_inc()/dec() and cxl_mem_active() helpers were initially
> introduced to coordinate suspend/resume behavior. However, upcoming
> changes will reuse these helpers to track cxl_mem_probe() activity during
> SOFT RESERVED region handling.
>
> To reflect this broader purpose, rename suspend.c to probe_state.c and
> remove CONFIG_CXL_SUSPEND Kconfig option. These helpers are now always
> built into the CXL core subsystem.
>
> This ensures drivers like cxl_acpi to coordinate with cxl_mem for
> region setup and hotplug handling.
I struggle to see how blocking suspend in the presence of CXL memory has
anything to do with soft-reserve handling? Maybe the story is in a
follow-on patch.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v5 1/7] cxl/acpi: Refactor cxl_acpi_probe() to always schedule fallback DAX registration
2025-07-22 21:04 ` dan.j.williams
@ 2025-07-23 0:45 ` Alison Schofield
2025-07-23 7:34 ` dan.j.williams
0 siblings, 1 reply; 38+ messages in thread
From: Alison Schofield @ 2025-07-23 0:45 UTC (permalink / raw)
To: dan.j.williams
Cc: Smita Koralahalli, linux-cxl, linux-kernel, nvdimm, linux-fsdevel,
linux-pm, Davidlohr Bueso, Jonathan Cameron, Dave Jiang,
Vishal Verma, Ira Weiny, Matthew Wilcox, Jan Kara,
Rafael J . Wysocki, Len Brown, Pavel Machek, Li Ming,
Jeff Johnson, Ying Huang, Yao Xingtao, Peter Zijlstra, Greg KH,
Nathan Fontenot, Terry Bowman, Robert Richter, Benjamin Cheatham,
PradeepVineshReddy Kodamati, Zhijian Li
On Tue, Jul 22, 2025 at 02:04:00PM -0700, Dan Williams wrote:
> Smita Koralahalli wrote:
> > Refactor cxl_acpi_probe() to use a single exit path so that the fallback
> > DAX registration can be scheduled regardless of probe success or failure.
>
> I do not understand why cxl_acpi needs to be responsible for this,
> especially in the cxl_acpi_probe() failure path. Certainly if
> cxl_acpi_probe() fails, that is a strong signal to give up on the CXL
> subsystem altogether and fallback to DAX vanilla discovery exclusively.
>
> Now, maybe the need for this becomes clearer in follow-on patches.
> However, I would have expected that DAX, which currently arranges for
> CXL to load first would just flush CXL discovery, make a decision about
> whether proceed with Soft Reserved, or not.
>
> Something like:
>
> DAX CXL
> Scan CXL Windows. Fail on any window
> parsing failures
>
> Launch a work item to flush PCI
> discovery and give a reaonable amount of
> time for cxl_pci and cxl_mem to quiesce
>
> <assumes CXL Windows are discovered
> by virtue of initcall order or
> MODULE_SOFTDEP("pre: cxl_acpi")>
>
> Calls a CXL flush routine to await probe
> completion (will always be racy)
>
> Evaluates if all Soft Reserve has
> cxl_region coverage
>
> if yes: skip publishing CXL intersecting
> Soft Reserve range in iomem, let dax_cxl
> attach to the cxl_region devices
>
> if no: decline the already published
> cxl_dax_regions, notify cxl_acpi to
> shutdown. Install Soft Reserved in iomem
> and create dax_hmem devices for the
> ranges per usual.
This is super course. If CXL region driver sets up 99 regions with
exact matching SR ranges and there are no CXL Windows with unused SR,
then we have a YES!
But if after those 99 successful assemblies, we get one errant window
with a Soft Reserved for which a region never assembles, it's a hard NO.
DAX declines, ie teardowns the 99 dax_regions and cxl_regions.
Obviously, this is different from the current approach that aimed to
pick up completely unused SRs and the trimmings from SRs that exceeded
region size and offered them to DAX too.
I'm cringing a bit at the fact that one bad apple (like a cxl device
that doesn't show up for it's region) means no CXL devices get managed.
Probably asking the obvious question here. This is what 'we' want,
right?
>
> Something like the above puts all the onus on device-dax to decide if
> CXL is meeting expectations. CXL is only responsible flagging when it
> thinks it has successfully completed init. If device-dax disagrees with
> what CXL has done it can tear down the world without ever attaching
> 'struct cxl_dax_region'. The success/fail is an "all or nothing"
> proposition. Either CXL understands everything or the user needs to
> work with their hardware vendor to fix whatever is giving the CXL driver
> indigestion.
>
> It needs to be coarse and simple because longer term the expectation is
> the Soft Reserved stops going to System RAM by default and instead
> becomes an isolated memory pool that requires opt-in. In many ways the
> current behavior is optimized for hardware validation not applications.
>
> > With CONFIG_CXL_ACPI enabled, future patches will bypass DAX device
> > registration via the HMAT and hmem drivers. To avoid missing DAX
> > registration for SOFT RESERVED regions, the fallback path must be
> > triggered regardless of probe outcome.
> >
> > No functional changes.
>
> A comment below in case something like this patch moves forward:
>
> >
> > Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
> > ---
> > drivers/cxl/acpi.c | 30 ++++++++++++++++++------------
> > 1 file changed, 18 insertions(+), 12 deletions(-)
> >
> > diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c
> > index a1a99ec3f12c..ca06d5acdf8f 100644
> > --- a/drivers/cxl/acpi.c
> > +++ b/drivers/cxl/acpi.c
> > @@ -825,7 +825,7 @@ static int pair_cxl_resource(struct device *dev, void *data)
> >
> > static int cxl_acpi_probe(struct platform_device *pdev)
> > {
> > - int rc;
> > + int rc = 0;
> > struct resource *cxl_res;
> > struct cxl_root *cxl_root;
> > struct cxl_port *root_port;
> > @@ -837,7 +837,7 @@ static int cxl_acpi_probe(struct platform_device *pdev)
> > rc = devm_add_action_or_reset(&pdev->dev, cxl_acpi_lock_reset_class,
> > &pdev->dev);
> > if (rc)
> > - return rc;
> > + goto out;
>
> No, new goto please. With cleanup.h the momentum is towards elimination
> of goto. If you need to do something like this, just wrap the function:
>
> diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c
> index a1a99ec3f12c..b50d3aa45ad5 100644
> --- a/drivers/cxl/acpi.c
> +++ b/drivers/cxl/acpi.c
> @@ -823,7 +823,7 @@ static int pair_cxl_resource(struct device *dev, void *data)
> return 0;
> }
>
> -static int cxl_acpi_probe(struct platform_device *pdev)
> +static int __cxl_acpi_probe(struct platform_device *pdev)
> {
> int rc;
> struct resource *cxl_res;
> @@ -900,6 +900,15 @@ static int cxl_acpi_probe(struct platform_device *pdev)
> return 0;
> }
>
> +static int cxl_acpi_probe(struct platform_device *pdev)
> +{
> + int rc = __cxl_acpi_probe(pdev);
> +
> + /* do something */
> +
> + return rc;
> +}
> +
> static const struct acpi_device_id cxl_acpi_ids[] = {
> { "ACPI0017" },
> { },
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v5 3/7] cxl/acpi: Add background worker to coordinate with cxl_mem probe completion
2025-07-15 18:04 ` [PATCH v5 3/7] cxl/acpi: Add background worker to coordinate with cxl_mem probe completion Smita Koralahalli
2025-07-17 0:24 ` Dave Jiang
@ 2025-07-23 7:31 ` dan.j.williams
2025-07-23 16:13 ` dan.j.williams
2025-07-29 15:48 ` Koralahalli Channabasappa, Smita
1 sibling, 2 replies; 38+ messages in thread
From: dan.j.williams @ 2025-07-23 7:31 UTC (permalink / raw)
To: Smita Koralahalli, linux-cxl, linux-kernel, nvdimm, linux-fsdevel,
linux-pm
Cc: Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Alison Schofield,
Vishal Verma, Ira Weiny, Dan Williams, Matthew Wilcox, Jan Kara,
Rafael J . Wysocki, Len Brown, Pavel Machek, Li Ming,
Jeff Johnson, Ying Huang, Yao Xingtao, Peter Zijlstra, Greg KH,
Nathan Fontenot, Smita Koralahalli, Terry Bowman, Robert Richter,
Benjamin Cheatham, PradeepVineshReddy Kodamati, Zhijian Li
Smita Koralahalli wrote:
> Introduce a background worker in cxl_acpi to delay SOFT RESERVE handling
> until the cxl_mem driver has probed at least one device. This coordination
> ensures that DAX registration or fallback handling for soft-reserved
> regions is not triggered prematurely.
>
> The worker waits on cxl_wait_queue, which is signaled via
> cxl_mem_active_inc() during cxl_mem_probe(). Once at least one memory
> device probe is confirmed, the worker invokes wait_for_device_probe()
> to allow the rest of the CXL device hierarchy to complete initialization.
>
> Additionally, it also handles initialization order issues where
> cxl_acpi_probe() may complete before other drivers such as cxl_port or
> cxl_mem have loaded, especially when cxl_acpi and cxl_port are built-in
> and cxl_mem is a loadable module. In such cases, using only
> wait_for_device_probe() is insufficient, as it may return before all
> relevant probes are registered.
Right, but that problem is not solved by this which still leaves the
decision on when to give up on this mechanism, and this mechanism does
not tell you when follow-on probe work is complete.
> While region creation happens in cxl_port_probe(), waiting on
> cxl_mem_active() would be sufficient as cxl_mem_probe() can only succeed
> after the port hierarchy is in place. Furthermore, since cxl_mem depends
> on cxl_pci, this also guarantees that cxl_pci has loaded by the time the
> wait completes.
>
> As cxl_mem_active() infrastructure already exists for tracking probe
> activity, cxl_acpi can use it without introducing new coordination
> mechanisms.
In appreciate the instinct to not add anything new, but the module
loading problem is solvable.
If the goal is: "I want to give device-dax a point at which it can make
a go / no-go decision about whether the CXL subsystem has properly
assembled all CXL regions implied by Soft Reserved instersecting with
CXL Windows." Then that is something like the below, only lightly tested
and likely regresses the non-CXL case.
-- 8< --
From 48b25461eca050504cf5678afd7837307b2dd14f Mon Sep 17 00:00:00 2001
From: Dan Williams <dan.j.williams@intel.com>
Date: Tue, 22 Jul 2025 16:11:08 -0700
Subject: [RFC PATCH] dax/cxl: Defer Soft Reserved registration
CXL and dax_hmem fight over "Soft Reserved" (EFI Specific Purpose Memory)
resources are published in the iomem resource tree. The entry blocks some
CXL hotplug flows, and CXL blocks dax_hmem from publishing the memory in
the event that CXL fails to parse the platform configuration.
Towards resolving this conflict: (the non-RFC version
of this patch should split these into separate patches):
1/ Defer publishing "Soft Reserved" entries in the iomem resource tree
until the consumer, dax_hmem, is ready to use them.
2/ Fix detection of "Soft Reserved" vs "CXL Window" resource overlaps by
switching from MODULE_SOFTDEP() to request_module() for making sure that
cxl_acpi has had a chance to publish "CXL Window" resources.
3/ Add cxl_pci to the list of modules that need to have had a chance to
scan boot devices such that wait_device_probe() flushes initial CXL
topology discovery.
4/ Add a workqueue that delays consideration of "Soft Reserved" that
overlaps CXL so that the CXL subsystem can complete all of its region
assembly.
For RFC purposes this only solves the reliabilty of the DAX_CXL_MODE_DROP
case. DAX_CXL_MODE_REGISTER support can follow to shutdown CXL in favor of
vanilla DAX devices as an emergency fallback for platform configuration
quirks and bugs.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
arch/x86/kernel/e820.c | 2 +-
drivers/dax/hmem/device.c | 4 +-
drivers/dax/hmem/hmem.c | 94 +++++++++++++++++++++++++++++++++------
include/linux/ioport.h | 25 +++++++++++
kernel/resource.c | 58 +++++++++++++++++++-----
5 files changed, 156 insertions(+), 27 deletions(-)
diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index c3acbd26408b..aef1ff2cabda 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1153,7 +1153,7 @@ void __init e820__reserve_resources_late(void)
res = e820_res;
for (i = 0; i < e820_table->nr_entries; i++) {
if (!res->parent && res->end)
- insert_resource_expand_to_fit(&iomem_resource, res);
+ insert_resource_late(res);
res++;
}
diff --git a/drivers/dax/hmem/device.c b/drivers/dax/hmem/device.c
index f9e1a76a04a9..22732b729017 100644
--- a/drivers/dax/hmem/device.c
+++ b/drivers/dax/hmem/device.c
@@ -83,8 +83,8 @@ static __init int hmem_register_one(struct resource *res, void *data)
static __init int hmem_init(void)
{
- walk_iomem_res_desc(IORES_DESC_SOFT_RESERVED,
- IORESOURCE_MEM, 0, -1, NULL, hmem_register_one);
+ walk_soft_reserve_res_desc(IORES_DESC_SOFT_RESERVED, IORESOURCE_MEM, 0,
+ -1, NULL, hmem_register_one);
return 0;
}
diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index 5e7c53f18491..0916478e3817 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -59,9 +59,45 @@ static void release_hmem(void *pdev)
platform_device_unregister(pdev);
}
+static enum dax_cxl_mode {
+ DAX_CXL_MODE_DEFER,
+ DAX_CXL_MODE_REGISTER,
+ DAX_CXL_MODE_DROP,
+} dax_cxl_mode;
+
+static int handle_deferred_cxl(struct device *host, int target_nid,
+ const struct resource *res)
+{
+ if (region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
+ IORES_DESC_CXL) != REGION_DISJOINT) {
+ if (dax_cxl_mode == DAX_CXL_MODE_DROP)
+ dev_dbg(host, "dropping CXL range: %pr\n", res);
+ }
+ return 0;
+}
+
+struct dax_defer_work {
+ struct platform_device *pdev;
+ struct work_struct work;
+};
+
+static void process_defer_work(struct work_struct *_work)
+{
+ struct dax_defer_work *work = container_of(_work, typeof(*work), work);
+ struct platform_device *pdev = work->pdev;
+
+ /* relies on cxl_acpi and cxl_pci having had a chance to load */
+ wait_for_device_probe();
+
+ dax_cxl_mode = DAX_CXL_MODE_DROP;
+
+ walk_hmem_resources(&pdev->dev, handle_deferred_cxl);
+}
+
static int hmem_register_device(struct device *host, int target_nid,
const struct resource *res)
{
+ struct dax_defer_work *work = dev_get_drvdata(host);
struct platform_device *pdev;
struct memregion_info info;
long id;
@@ -70,14 +106,21 @@ static int hmem_register_device(struct device *host, int target_nid,
if (IS_ENABLED(CONFIG_CXL_REGION) &&
region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
IORES_DESC_CXL) != REGION_DISJOINT) {
- dev_dbg(host, "deferring range to CXL: %pr\n", res);
- return 0;
+ switch (dax_cxl_mode) {
+ case DAX_CXL_MODE_DEFER:
+ dev_dbg(host, "deferring range to CXL: %pr\n", res);
+ schedule_work(&work->work);
+ return 0;
+ case DAX_CXL_MODE_REGISTER:
+ dev_dbg(host, "registering CXL range: %pr\n", res);
+ break;
+ case DAX_CXL_MODE_DROP:
+ dev_dbg(host, "dropping CXL range: %pr\n", res);
+ return 0;
+ }
}
- rc = region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
- IORES_DESC_SOFT_RESERVED);
- if (rc != REGION_INTERSECTS)
- return 0;
+ /* TODO: insert "Soft Reserved" into iomem here */
id = memregion_alloc(GFP_KERNEL);
if (id < 0) {
@@ -123,8 +166,30 @@ static int hmem_register_device(struct device *host, int target_nid,
return rc;
}
+static void kill_defer_work(void *_work)
+{
+ struct dax_defer_work *work = container_of(_work, typeof(*work), work);
+
+ cancel_work_sync(&work->work);
+ kfree(work);
+}
+
static int dax_hmem_platform_probe(struct platform_device *pdev)
{
+ struct dax_defer_work *work = kzalloc(sizeof(*work), GFP_KERNEL);
+ int rc;
+
+ if (!work)
+ return -ENOMEM;
+
+ work->pdev = pdev;
+ INIT_WORK(&work->work, process_defer_work);
+
+ rc = devm_add_action_or_reset(&pdev->dev, kill_defer_work, work);
+ if (rc)
+ return rc;
+
+ platform_set_drvdata(pdev, work);
return walk_hmem_resources(&pdev->dev, hmem_register_device);
}
@@ -139,6 +204,16 @@ static __init int dax_hmem_init(void)
{
int rc;
+ /*
+ * Ensure that cxl_acpi and cxl_pci have a chance to kick off
+ * CXL topology discovery at least once before scanning the
+ * iomem resource tree for IORES_DESC_CXL resources.
+ */
+ if (IS_ENABLED(CONFIG_CXL_REGION)) {
+ request_module("cxl_acpi");
+ request_module("cxl_pci");
+ }
+
rc = platform_driver_register(&dax_hmem_platform_driver);
if (rc)
return rc;
@@ -159,13 +234,6 @@ static __exit void dax_hmem_exit(void)
module_init(dax_hmem_init);
module_exit(dax_hmem_exit);
-/* Allow for CXL to define its own dax regions */
-#if IS_ENABLED(CONFIG_CXL_REGION)
-#if IS_MODULE(CONFIG_CXL_ACPI)
-MODULE_SOFTDEP("pre: cxl_acpi");
-#endif
-#endif
-
MODULE_ALIAS("platform:hmem*");
MODULE_ALIAS("platform:hmem_platform*");
MODULE_DESCRIPTION("HMEM DAX: direct access to 'specific purpose' memory");
diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index e8b2d6aa4013..4fc6ab518c24 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -232,6 +232,9 @@ struct resource_constraint {
/* PC/ISA/whatever - the normal PC address spaces: IO and memory */
extern struct resource ioport_resource;
extern struct resource iomem_resource;
+#ifdef CONFIG_EFI_SOFT_RESERVE
+extern struct resource soft_reserve_resource;
+#endif
extern struct resource *request_resource_conflict(struct resource *root, struct resource *new);
extern int request_resource(struct resource *root, struct resource *new);
@@ -255,6 +258,22 @@ int adjust_resource(struct resource *res, resource_size_t start,
resource_size_t size);
resource_size_t resource_alignment(struct resource *res);
+
+#ifdef CONFIG_EFI_SOFT_RESERVE
+static inline void insert_resource_late(struct resource *new)
+{
+ if (new->desc == IORES_DESC_SOFT_RESERVED)
+ insert_resource_expand_to_fit(&soft_reserve_resource, new);
+ else
+ insert_resource_expand_to_fit(&iomem_resource, new);
+}
+#else
+static inline void insert_resource_late(struct resource *new)
+{
+ insert_resource_expand_to_fit(&iomem_resource, new);
+}
+#endif
+
/**
* resource_set_size - Calculate resource end address from size and start
* @res: Resource descriptor
@@ -409,6 +428,12 @@ walk_system_ram_res_rev(u64 start, u64 end, void *arg,
extern int
walk_iomem_res_desc(unsigned long desc, unsigned long flags, u64 start, u64 end,
void *arg, int (*func)(struct resource *, void *));
+int walk_soft_reserve_res_desc(unsigned long desc, unsigned long flags,
+ u64 start, u64 end, void *arg,
+ int (*func)(struct resource *, void *));
+int region_intersects_soft_reserve(struct resource *root, resource_size_t start,
+ size_t size, unsigned long flags,
+ unsigned long desc);
struct resource *devm_request_free_mem_region(struct device *dev,
struct resource *base, unsigned long size);
diff --git a/kernel/resource.c b/kernel/resource.c
index 8d3e6ed0bdc1..fd90990c31c6 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -321,8 +321,8 @@ static bool is_type_match(struct resource *p, unsigned long flags, unsigned long
}
/**
- * find_next_iomem_res - Finds the lowest iomem resource that covers part of
- * [@start..@end].
+ * find_next_res - Finds the lowest resource that covers part of
+ * [@start..@end].
*
* If a resource is found, returns 0 and @*res is overwritten with the part
* of the resource that's within [@start..@end]; if none is found, returns
@@ -337,9 +337,9 @@ static bool is_type_match(struct resource *p, unsigned long flags, unsigned long
* The caller must specify @start, @end, @flags, and @desc
* (which may be IORES_DESC_NONE).
*/
-static int find_next_iomem_res(resource_size_t start, resource_size_t end,
- unsigned long flags, unsigned long desc,
- struct resource *res)
+static int find_next_res(struct resource *parent, resource_size_t start,
+ resource_size_t end, unsigned long flags,
+ unsigned long desc, struct resource *res)
{
struct resource *p;
@@ -351,7 +351,7 @@ static int find_next_iomem_res(resource_size_t start, resource_size_t end,
read_lock(&resource_lock);
- for_each_resource(&iomem_resource, p, false) {
+ for_each_resource(parent, p, false) {
/* If we passed the resource we are looking for, stop */
if (p->start > end) {
p = NULL;
@@ -382,16 +382,23 @@ static int find_next_iomem_res(resource_size_t start, resource_size_t end,
return p ? 0 : -ENODEV;
}
-static int __walk_iomem_res_desc(resource_size_t start, resource_size_t end,
- unsigned long flags, unsigned long desc,
- void *arg,
- int (*func)(struct resource *, void *))
+static int find_next_iomem_res(resource_size_t start, resource_size_t end,
+ unsigned long flags, unsigned long desc,
+ struct resource *res)
+{
+ return find_next_res(&iomem_resource, start, end, flags, desc, res);
+}
+
+static int walk_res_desc(struct resource *parent, resource_size_t start,
+ resource_size_t end, unsigned long flags,
+ unsigned long desc, void *arg,
+ int (*func)(struct resource *, void *))
{
struct resource res;
int ret = -EINVAL;
while (start < end &&
- !find_next_iomem_res(start, end, flags, desc, &res)) {
+ !find_next_res(parent, start, end, flags, desc, &res)) {
ret = (*func)(&res, arg);
if (ret)
break;
@@ -402,6 +409,15 @@ static int __walk_iomem_res_desc(resource_size_t start, resource_size_t end,
return ret;
}
+static int __walk_iomem_res_desc(resource_size_t start, resource_size_t end,
+ unsigned long flags, unsigned long desc,
+ void *arg,
+ int (*func)(struct resource *, void *))
+{
+ return walk_res_desc(&iomem_resource, start, end, flags, desc, arg, func);
+}
+
+
/**
* walk_iomem_res_desc - Walks through iomem resources and calls func()
* with matching resource ranges.
@@ -426,6 +442,26 @@ int walk_iomem_res_desc(unsigned long desc, unsigned long flags, u64 start,
}
EXPORT_SYMBOL_GPL(walk_iomem_res_desc);
+#ifdef CONFIG_EFI_SOFT_RESERVE
+struct resource soft_reserve_resource = {
+ .name = "Soft Reserved",
+ .start = 0,
+ .end = -1,
+ .desc = IORES_DESC_SOFT_RESERVED,
+ .flags = IORESOURCE_MEM,
+};
+EXPORT_SYMBOL_GPL(soft_reserve_resource);
+
+int walk_soft_reserve_res_desc(unsigned long desc, unsigned long flags,
+ u64 start, u64 end, void *arg,
+ int (*func)(struct resource *, void *))
+{
+ return walk_res_desc(&soft_reserve_resource, start, end, flags, desc,
+ arg, func);
+}
+EXPORT_SYMBOL_GPL(walk_soft_reserve_res_desc);
+#endif
+
/*
* This function calls the @func callback against all memory ranges of type
* System RAM which are marked as IORESOURCE_SYSTEM_RAM and IORESOUCE_BUSY.
--
2.50.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH v5 1/7] cxl/acpi: Refactor cxl_acpi_probe() to always schedule fallback DAX registration
2025-07-23 0:45 ` Alison Schofield
@ 2025-07-23 7:34 ` dan.j.williams
0 siblings, 0 replies; 38+ messages in thread
From: dan.j.williams @ 2025-07-23 7:34 UTC (permalink / raw)
To: Alison Schofield, dan.j.williams
Cc: Smita Koralahalli, linux-cxl, linux-kernel, nvdimm, linux-fsdevel,
linux-pm, Davidlohr Bueso, Jonathan Cameron, Dave Jiang,
Vishal Verma, Ira Weiny, Matthew Wilcox, Jan Kara,
Rafael J . Wysocki, Len Brown, Pavel Machek, Li Ming,
Jeff Johnson, Ying Huang, Yao Xingtao, Peter Zijlstra, Greg KH,
Nathan Fontenot, Terry Bowman, Robert Richter, Benjamin Cheatham,
PradeepVineshReddy Kodamati, Zhijian Li
Alison Schofield wrote:
> On Tue, Jul 22, 2025 at 02:04:00PM -0700, Dan Williams wrote:
> > Smita Koralahalli wrote:
> > > Refactor cxl_acpi_probe() to use a single exit path so that the fallback
> > > DAX registration can be scheduled regardless of probe success or failure.
> >
> > I do not understand why cxl_acpi needs to be responsible for this,
> > especially in the cxl_acpi_probe() failure path. Certainly if
> > cxl_acpi_probe() fails, that is a strong signal to give up on the CXL
> > subsystem altogether and fallback to DAX vanilla discovery exclusively.
> >
> > Now, maybe the need for this becomes clearer in follow-on patches.
> > However, I would have expected that DAX, which currently arranges for
> > CXL to load first would just flush CXL discovery, make a decision about
> > whether proceed with Soft Reserved, or not.
> >
> > Something like:
> >
> > DAX CXL
> > Scan CXL Windows. Fail on any window
> > parsing failures
> >
> > Launch a work item to flush PCI
> > discovery and give a reaonable amount of
> > time for cxl_pci and cxl_mem to quiesce
> >
> > <assumes CXL Windows are discovered
> > by virtue of initcall order or
> > MODULE_SOFTDEP("pre: cxl_acpi")>
> >
> > Calls a CXL flush routine to await probe
> > completion (will always be racy)
> >
> > Evaluates if all Soft Reserve has
> > cxl_region coverage
> >
> > if yes: skip publishing CXL intersecting
> > Soft Reserve range in iomem, let dax_cxl
> > attach to the cxl_region devices
> >
> > if no: decline the already published
> > cxl_dax_regions, notify cxl_acpi to
> > shutdown. Install Soft Reserved in iomem
> > and create dax_hmem devices for the
> > ranges per usual.
>
> This is super course. If CXL region driver sets up 99 regions with
> exact matching SR ranges and there are no CXL Windows with unused SR,
> then we have a YES!
>
> But if after those 99 successful assemblies, we get one errant window
> with a Soft Reserved for which a region never assembles, it's a hard NO.
> DAX declines, ie teardowns the 99 dax_regions and cxl_regions.
Exactly.
> Obviously, this is different from the current approach that aimed to
> pick up completely unused SRs and the trimmings from SRs that exceeded
> region size and offered them to DAX too.
>
> I'm cringing a bit at the fact that one bad apple (like a cxl device
> that doesn't show up for it's region) means no CXL devices get managed.
Think of the linear extended cache case where if you just look at the
endpoint CXL decoders you get half of the capacity and none of the
warning that the DDR side of the cache can not be managed. Consider the
other address translation error cases where endpoint decoders do not
tell the full story.
Now think of the contract of what the CXL driver offers. It says, "hey
if you change this configuration I have high confidence that I have
managed everything associated with this region throughout the whole
platform config", or "if an error happens anywhere within a CXL window
you can rely on my identification of the system components to suspect".
> Probably asking the obvious question here. This is what 'we' want,
> right?
So, in the end it is not a fair fight. It is a handful of Linux
developers vs multiple platform BIOS teams per vendor and per hardware
generation. The amount of Linux tripping, specification-stretching,
innovation is eye watering.
The downside of being drastic and coarse is worth the benefit.
The status quo is end users get nothing (stranded memory), or low
confidence partial configs with latent RAS problems.
The coarse policy is a spec compliant, safe fallback that gets end users
access to the memory the BIOS/firmware promised. It buys time for the
platform vendor to get Linux fixed, or even better, these Linux
coniptions catch problems earlier in platform qualification cycle.
The alternative fine grained policy hopes that the memory the CXL driver
gives up on was not critical or indicative of a deeper misunderstanding
of the platform. I do not want to debug hopeful guesses on top of
undefined breakage.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v5 0/7] Add managed SOFT RESERVE resource handling
2025-07-16 6:01 ` Koralahalli Channabasappa, Smita
2025-07-16 20:20 ` Alison Schofield
@ 2025-07-23 15:24 ` dan.j.williams
1 sibling, 0 replies; 38+ messages in thread
From: dan.j.williams @ 2025-07-23 15:24 UTC (permalink / raw)
To: Koralahalli Channabasappa, Smita, Alison Schofield
Cc: linux-cxl, linux-kernel, nvdimm, linux-fsdevel, linux-pm,
Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Vishal Verma,
Ira Weiny, Dan Williams, Matthew Wilcox, Jan Kara,
Rafael J . Wysocki, Len Brown, Pavel Machek, Li Ming,
Jeff Johnson, Ying Huang, Yao Xingtao, Peter Zijlstra, Greg KH,
Nathan Fontenot, Terry Bowman, Robert Richter, Benjamin Cheatham,
PradeepVineshReddy Kodamati, Zhijian Li
Koralahalli Channabasappa, Smita wrote:
[..]
> That said, I'm still evaluating better options to more robustly
> coordinate probe ordering between cxl_acpi, cxl_port, cxl_mem and
> cxl_region and looking for suggestions here.
I never quite understood the arguments around why
wait_for_device_probe() does not work, but I did find a bug in my prior
thinking on the way towards this RFC [1]. The misunderstanding was that
MODULE_SOFTDEP() only guarantees that the module gets loaded eventually,
but it does not guarantee that the softdep has completed init before the
caller performs its own init.
It works sometimes, and that is probably what misled me about that
contract. request_module() is synchronous. With that in place I now see
what wait_for_device_probe() does the right thing. It flushes cxl_pci
attach for devices present at boot, and all follow-on probe work gets
flushed as well.
With that in hand the RFC now has a stable quiesce point to walk the CXL
topology and make decisions. The RFC is effectively a fix for platforms
where CXL loses the MODULE_SOFTDEP() race.
[1]: http://lore.kernel.org/68808fb4e4cbf_137e6b100cc@dwillia2-xfh.jf.intel.com.notmuch
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v5 3/7] cxl/acpi: Add background worker to coordinate with cxl_mem probe completion
2025-07-23 7:31 ` dan.j.williams
@ 2025-07-23 16:13 ` dan.j.williams
2025-08-05 3:58 ` Zhijian Li (Fujitsu)
2025-07-29 15:48 ` Koralahalli Channabasappa, Smita
1 sibling, 1 reply; 38+ messages in thread
From: dan.j.williams @ 2025-07-23 16:13 UTC (permalink / raw)
To: dan.j.williams, Smita Koralahalli, linux-cxl, linux-kernel,
nvdimm, linux-fsdevel, linux-pm
Cc: Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Alison Schofield,
Vishal Verma, Ira Weiny, Dan Williams, Matthew Wilcox, Jan Kara,
Rafael J . Wysocki, Len Brown, Pavel Machek, Li Ming,
Jeff Johnson, Ying Huang, Yao Xingtao, Peter Zijlstra, Greg KH,
Nathan Fontenot, Smita Koralahalli, Terry Bowman, Robert Richter,
Benjamin Cheatham, PradeepVineshReddy Kodamati, Zhijian Li
dan.j.williams@ wrote:
[..]
> If the goal is: "I want to give device-dax a point at which it can make
> a go / no-go decision about whether the CXL subsystem has properly
> assembled all CXL regions implied by Soft Reserved instersecting with
> CXL Windows." Then that is something like the below, only lightly tested
> and likely regresses the non-CXL case.
>
> -- 8< --
> From 48b25461eca050504cf5678afd7837307b2dd14f Mon Sep 17 00:00:00 2001
> From: Dan Williams <dan.j.williams@intel.com>
> Date: Tue, 22 Jul 2025 16:11:08 -0700
> Subject: [RFC PATCH] dax/cxl: Defer Soft Reserved registration
Likely needs this incremental change to prevent DEV_DAX_HMEM from being
built-in when CXL is not. This still leaves the awkward scenario of CXL
enabled, DEV_DAX_CXL disabled, and DEV_DAX_HMEM built-in. I believe that
safely fails in devdax only / fallback mode, but something to
investigate when respinning on top of this.
-- 8< --
diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index d656e4c0eb84..3683bb3f2311 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -48,6 +48,8 @@ config DEV_DAX_CXL
tristate "CXL DAX: direct access to CXL RAM regions"
depends on CXL_BUS && CXL_REGION && DEV_DAX
default CXL_REGION && DEV_DAX
+ depends on CXL_ACPI >= DEV_DAX_HMEM
+ depends on CXL_PCI >= DEV_DAX_HMEM
help
CXL RAM regions are either mapped by platform-firmware
and published in the initial system-memory map as "System RAM", mapped
diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index 0916478e3817..8bcd104111a8 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -103,7 +103,7 @@ static int hmem_register_device(struct device *host, int target_nid,
long id;
int rc;
- if (IS_ENABLED(CONFIG_CXL_REGION) &&
+ if (IS_ENABLED(CONFIG_DEV_DAX_CXL) &&
region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
IORES_DESC_CXL) != REGION_DISJOINT) {
switch (dax_cxl_mode) {
@@ -209,7 +209,7 @@ static __init int dax_hmem_init(void)
* CXL topology discovery at least once before scanning the
* iomem resource tree for IORES_DESC_CXL resources.
*/
- if (IS_ENABLED(CONFIG_CXL_REGION)) {
+ if (IS_ENABLED(CONFIG_DEV_DAX_CXL)) {
request_module("cxl_acpi");
request_module("cxl_pci");
}
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH v5 3/7] cxl/acpi: Add background worker to coordinate with cxl_mem probe completion
2025-07-23 7:31 ` dan.j.williams
2025-07-23 16:13 ` dan.j.williams
@ 2025-07-29 15:48 ` Koralahalli Channabasappa, Smita
2025-07-30 16:09 ` dan.j.williams
1 sibling, 1 reply; 38+ messages in thread
From: Koralahalli Channabasappa, Smita @ 2025-07-29 15:48 UTC (permalink / raw)
To: dan.j.williams, linux-cxl, linux-kernel, nvdimm, linux-fsdevel,
linux-pm
Cc: Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Alison Schofield,
Vishal Verma, Ira Weiny, Matthew Wilcox, Jan Kara,
Rafael J . Wysocki, Len Brown, Pavel Machek, Li Ming,
Jeff Johnson, Ying Huang, Yao Xingtao, Peter Zijlstra, Greg KH,
Nathan Fontenot, Terry Bowman, Robert Richter, Benjamin Cheatham,
PradeepVineshReddy Kodamati, Zhijian Li
Hi Dan,
On 7/23/2025 12:31 AM, dan.j.williams@intel.com wrote:
> Smita Koralahalli wrote:
>> Introduce a background worker in cxl_acpi to delay SOFT RESERVE handling
>> until the cxl_mem driver has probed at least one device. This coordination
>> ensures that DAX registration or fallback handling for soft-reserved
>> regions is not triggered prematurely.
>>
>> The worker waits on cxl_wait_queue, which is signaled via
>> cxl_mem_active_inc() during cxl_mem_probe(). Once at least one memory
>> device probe is confirmed, the worker invokes wait_for_device_probe()
>> to allow the rest of the CXL device hierarchy to complete initialization.
>>
>> Additionally, it also handles initialization order issues where
>> cxl_acpi_probe() may complete before other drivers such as cxl_port or
>> cxl_mem have loaded, especially when cxl_acpi and cxl_port are built-in
>> and cxl_mem is a loadable module. In such cases, using only
>> wait_for_device_probe() is insufficient, as it may return before all
>> relevant probes are registered.
>
> Right, but that problem is not solved by this which still leaves the
> decision on when to give up on this mechanism, and this mechanism does
> not tell you when follow-on probe work is complete.
>
>> While region creation happens in cxl_port_probe(), waiting on
>> cxl_mem_active() would be sufficient as cxl_mem_probe() can only succeed
>> after the port hierarchy is in place. Furthermore, since cxl_mem depends
>> on cxl_pci, this also guarantees that cxl_pci has loaded by the time the
>> wait completes.
>>
>> As cxl_mem_active() infrastructure already exists for tracking probe
>> activity, cxl_acpi can use it without introducing new coordination
>> mechanisms.
>
> In appreciate the instinct to not add anything new, but the module
> loading problem is solvable.
>
> If the goal is: "I want to give device-dax a point at which it can make
> a go / no-go decision about whether the CXL subsystem has properly
> assembled all CXL regions implied by Soft Reserved instersecting with
> CXL Windows." Then that is something like the below, only lightly tested
> and likely regresses the non-CXL case.
>
> -- 8< --
> From 48b25461eca050504cf5678afd7837307b2dd14f Mon Sep 17 00:00:00 2001
> From: Dan Williams <dan.j.williams@intel.com>
> Date: Tue, 22 Jul 2025 16:11:08 -0700
> Subject: [RFC PATCH] dax/cxl: Defer Soft Reserved registration
>
> CXL and dax_hmem fight over "Soft Reserved" (EFI Specific Purpose Memory)
> resources are published in the iomem resource tree. The entry blocks some
> CXL hotplug flows, and CXL blocks dax_hmem from publishing the memory in
> the event that CXL fails to parse the platform configuration.
>
> Towards resolving this conflict: (the non-RFC version
> of this patch should split these into separate patches):
>
> 1/ Defer publishing "Soft Reserved" entries in the iomem resource tree
> until the consumer, dax_hmem, is ready to use them.
>
> 2/ Fix detection of "Soft Reserved" vs "CXL Window" resource overlaps by
> switching from MODULE_SOFTDEP() to request_module() for making sure that
> cxl_acpi has had a chance to publish "CXL Window" resources.
>
> 3/ Add cxl_pci to the list of modules that need to have had a chance to
> scan boot devices such that wait_device_probe() flushes initial CXL
> topology discovery.
>
> 4/ Add a workqueue that delays consideration of "Soft Reserved" that
> overlaps CXL so that the CXL subsystem can complete all of its region
> assembly.
>
> For RFC purposes this only solves the reliabilty of the DAX_CXL_MODE_DROP
> case. DAX_CXL_MODE_REGISTER support can follow to shutdown CXL in favor of
> vanilla DAX devices as an emergency fallback for platform configuration
> quirks and bugs.
>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
> arch/x86/kernel/e820.c | 2 +-
> drivers/dax/hmem/device.c | 4 +-
> drivers/dax/hmem/hmem.c | 94 +++++++++++++++++++++++++++++++++------
> include/linux/ioport.h | 25 +++++++++++
> kernel/resource.c | 58 +++++++++++++++++++-----
> 5 files changed, 156 insertions(+), 27 deletions(-)
>
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index c3acbd26408b..aef1ff2cabda 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -1153,7 +1153,7 @@ void __init e820__reserve_resources_late(void)
> res = e820_res;
> for (i = 0; i < e820_table->nr_entries; i++) {
> if (!res->parent && res->end)
> - insert_resource_expand_to_fit(&iomem_resource, res);
> + insert_resource_late(res);
> res++;
> }
>
> diff --git a/drivers/dax/hmem/device.c b/drivers/dax/hmem/device.c
> index f9e1a76a04a9..22732b729017 100644
> --- a/drivers/dax/hmem/device.c
> +++ b/drivers/dax/hmem/device.c
> @@ -83,8 +83,8 @@ static __init int hmem_register_one(struct resource *res, void *data)
>
> static __init int hmem_init(void)
> {
> - walk_iomem_res_desc(IORES_DESC_SOFT_RESERVED,
> - IORESOURCE_MEM, 0, -1, NULL, hmem_register_one);
> + walk_soft_reserve_res_desc(IORES_DESC_SOFT_RESERVED, IORESOURCE_MEM, 0,
> + -1, NULL, hmem_register_one);
> return 0;
> }
>
> diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
> index 5e7c53f18491..0916478e3817 100644
> --- a/drivers/dax/hmem/hmem.c
> +++ b/drivers/dax/hmem/hmem.c
> @@ -59,9 +59,45 @@ static void release_hmem(void *pdev)
> platform_device_unregister(pdev);
> }
>
> +static enum dax_cxl_mode {
> + DAX_CXL_MODE_DEFER,
> + DAX_CXL_MODE_REGISTER,
> + DAX_CXL_MODE_DROP,
> +} dax_cxl_mode;
> +
> +static int handle_deferred_cxl(struct device *host, int target_nid,
> + const struct resource *res)
> +{
> + if (region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
> + IORES_DESC_CXL) != REGION_DISJOINT) {
> + if (dax_cxl_mode == DAX_CXL_MODE_DROP)
> + dev_dbg(host, "dropping CXL range: %pr\n", res);
> + }
> + return 0;
> +}
> +
> +struct dax_defer_work {
> + struct platform_device *pdev;
> + struct work_struct work;
> +};
> +
> +static void process_defer_work(struct work_struct *_work)
> +{
> + struct dax_defer_work *work = container_of(_work, typeof(*work), work);
> + struct platform_device *pdev = work->pdev;
> +
> + /* relies on cxl_acpi and cxl_pci having had a chance to load */
> + wait_for_device_probe();
> +
> + dax_cxl_mode = DAX_CXL_MODE_DROP;
> +
> + walk_hmem_resources(&pdev->dev, handle_deferred_cxl);
> +}
> +
> static int hmem_register_device(struct device *host, int target_nid,
> const struct resource *res)
> {
> + struct dax_defer_work *work = dev_get_drvdata(host);
> struct platform_device *pdev;
> struct memregion_info info;
> long id;
> @@ -70,14 +106,21 @@ static int hmem_register_device(struct device *host, int target_nid,
> if (IS_ENABLED(CONFIG_CXL_REGION) &&
> region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
> IORES_DESC_CXL) != REGION_DISJOINT) {
I may be wrong here, but could this check fail? While request_module()
ensures that cxl_acpi and cxl_pci are requested for loading, it does not
guarantee that either has completed initialization or that region
enumeration (i.e add_cxl_resources()) has finished by the time we reach
this check.
We also haven't called wait_for_device_probe() at this point, which is
typically used to block until all pending device probes are complete.
Thanks
Smita
> - dev_dbg(host, "deferring range to CXL: %pr\n", res);
> - return 0;
> + switch (dax_cxl_mode) {
> + case DAX_CXL_MODE_DEFER:
> + dev_dbg(host, "deferring range to CXL: %pr\n", res);
> + schedule_work(&work->work);
> + return 0;
> + case DAX_CXL_MODE_REGISTER:
> + dev_dbg(host, "registering CXL range: %pr\n", res);
> + break;
> + case DAX_CXL_MODE_DROP:
> + dev_dbg(host, "dropping CXL range: %pr\n", res);
> + return 0;
> + }
> }
>
> - rc = region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
> - IORES_DESC_SOFT_RESERVED);
> - if (rc != REGION_INTERSECTS)
> - return 0;
> + /* TODO: insert "Soft Reserved" into iomem here */
>
> id = memregion_alloc(GFP_KERNEL);
> if (id < 0) {
> @@ -123,8 +166,30 @@ static int hmem_register_device(struct device *host, int target_nid,
> return rc;
> }
>
> +static void kill_defer_work(void *_work)
> +{
> + struct dax_defer_work *work = container_of(_work, typeof(*work), work);
> +
> + cancel_work_sync(&work->work);
> + kfree(work);
> +}
> +
> static int dax_hmem_platform_probe(struct platform_device *pdev)
> {
> + struct dax_defer_work *work = kzalloc(sizeof(*work), GFP_KERNEL);
> + int rc;
> +
> + if (!work)
> + return -ENOMEM;
> +
> + work->pdev = pdev;
> + INIT_WORK(&work->work, process_defer_work);
> +
> + rc = devm_add_action_or_reset(&pdev->dev, kill_defer_work, work);
> + if (rc)
> + return rc;
> +
> + platform_set_drvdata(pdev, work);
> return walk_hmem_resources(&pdev->dev, hmem_register_device);
> }
>
> @@ -139,6 +204,16 @@ static __init int dax_hmem_init(void)
> {
> int rc;
>
> + /*
> + * Ensure that cxl_acpi and cxl_pci have a chance to kick off
> + * CXL topology discovery at least once before scanning the
> + * iomem resource tree for IORES_DESC_CXL resources.
> + */
> + if (IS_ENABLED(CONFIG_CXL_REGION)) {
> + request_module("cxl_acpi");
> + request_module("cxl_pci");
> + }
> +
> rc = platform_driver_register(&dax_hmem_platform_driver);
> if (rc)
> return rc;
> @@ -159,13 +234,6 @@ static __exit void dax_hmem_exit(void)
> module_init(dax_hmem_init);
> module_exit(dax_hmem_exit);
>
> -/* Allow for CXL to define its own dax regions */
> -#if IS_ENABLED(CONFIG_CXL_REGION)
> -#if IS_MODULE(CONFIG_CXL_ACPI)
> -MODULE_SOFTDEP("pre: cxl_acpi");
> -#endif
> -#endif
> -
> MODULE_ALIAS("platform:hmem*");
> MODULE_ALIAS("platform:hmem_platform*");
> MODULE_DESCRIPTION("HMEM DAX: direct access to 'specific purpose' memory");
> diff --git a/include/linux/ioport.h b/include/linux/ioport.h
> index e8b2d6aa4013..4fc6ab518c24 100644
> --- a/include/linux/ioport.h
> +++ b/include/linux/ioport.h
> @@ -232,6 +232,9 @@ struct resource_constraint {
> /* PC/ISA/whatever - the normal PC address spaces: IO and memory */
> extern struct resource ioport_resource;
> extern struct resource iomem_resource;
> +#ifdef CONFIG_EFI_SOFT_RESERVE
> +extern struct resource soft_reserve_resource;
> +#endif
>
> extern struct resource *request_resource_conflict(struct resource *root, struct resource *new);
> extern int request_resource(struct resource *root, struct resource *new);
> @@ -255,6 +258,22 @@ int adjust_resource(struct resource *res, resource_size_t start,
> resource_size_t size);
> resource_size_t resource_alignment(struct resource *res);
>
> +
> +#ifdef CONFIG_EFI_SOFT_RESERVE
> +static inline void insert_resource_late(struct resource *new)
> +{
> + if (new->desc == IORES_DESC_SOFT_RESERVED)
> + insert_resource_expand_to_fit(&soft_reserve_resource, new);
> + else
> + insert_resource_expand_to_fit(&iomem_resource, new);
> +}
> +#else
> +static inline void insert_resource_late(struct resource *new)
> +{
> + insert_resource_expand_to_fit(&iomem_resource, new);
> +}
> +#endif
> +
> /**
> * resource_set_size - Calculate resource end address from size and start
> * @res: Resource descriptor
> @@ -409,6 +428,12 @@ walk_system_ram_res_rev(u64 start, u64 end, void *arg,
> extern int
> walk_iomem_res_desc(unsigned long desc, unsigned long flags, u64 start, u64 end,
> void *arg, int (*func)(struct resource *, void *));
> +int walk_soft_reserve_res_desc(unsigned long desc, unsigned long flags,
> + u64 start, u64 end, void *arg,
> + int (*func)(struct resource *, void *));
> +int region_intersects_soft_reserve(struct resource *root, resource_size_t start,
> + size_t size, unsigned long flags,
> + unsigned long desc);
>
> struct resource *devm_request_free_mem_region(struct device *dev,
> struct resource *base, unsigned long size);
> diff --git a/kernel/resource.c b/kernel/resource.c
> index 8d3e6ed0bdc1..fd90990c31c6 100644
> --- a/kernel/resource.c
> +++ b/kernel/resource.c
> @@ -321,8 +321,8 @@ static bool is_type_match(struct resource *p, unsigned long flags, unsigned long
> }
>
> /**
> - * find_next_iomem_res - Finds the lowest iomem resource that covers part of
> - * [@start..@end].
> + * find_next_res - Finds the lowest resource that covers part of
> + * [@start..@end].
> *
> * If a resource is found, returns 0 and @*res is overwritten with the part
> * of the resource that's within [@start..@end]; if none is found, returns
> @@ -337,9 +337,9 @@ static bool is_type_match(struct resource *p, unsigned long flags, unsigned long
> * The caller must specify @start, @end, @flags, and @desc
> * (which may be IORES_DESC_NONE).
> */
> -static int find_next_iomem_res(resource_size_t start, resource_size_t end,
> - unsigned long flags, unsigned long desc,
> - struct resource *res)
> +static int find_next_res(struct resource *parent, resource_size_t start,
> + resource_size_t end, unsigned long flags,
> + unsigned long desc, struct resource *res)
> {
> struct resource *p;
>
> @@ -351,7 +351,7 @@ static int find_next_iomem_res(resource_size_t start, resource_size_t end,
>
> read_lock(&resource_lock);
>
> - for_each_resource(&iomem_resource, p, false) {
> + for_each_resource(parent, p, false) {
> /* If we passed the resource we are looking for, stop */
> if (p->start > end) {
> p = NULL;
> @@ -382,16 +382,23 @@ static int find_next_iomem_res(resource_size_t start, resource_size_t end,
> return p ? 0 : -ENODEV;
> }
>
> -static int __walk_iomem_res_desc(resource_size_t start, resource_size_t end,
> - unsigned long flags, unsigned long desc,
> - void *arg,
> - int (*func)(struct resource *, void *))
> +static int find_next_iomem_res(resource_size_t start, resource_size_t end,
> + unsigned long flags, unsigned long desc,
> + struct resource *res)
> +{
> + return find_next_res(&iomem_resource, start, end, flags, desc, res);
> +}
> +
> +static int walk_res_desc(struct resource *parent, resource_size_t start,
> + resource_size_t end, unsigned long flags,
> + unsigned long desc, void *arg,
> + int (*func)(struct resource *, void *))
> {
> struct resource res;
> int ret = -EINVAL;
>
> while (start < end &&
> - !find_next_iomem_res(start, end, flags, desc, &res)) {
> + !find_next_res(parent, start, end, flags, desc, &res)) {
> ret = (*func)(&res, arg);
> if (ret)
> break;
> @@ -402,6 +409,15 @@ static int __walk_iomem_res_desc(resource_size_t start, resource_size_t end,
> return ret;
> }
>
> +static int __walk_iomem_res_desc(resource_size_t start, resource_size_t end,
> + unsigned long flags, unsigned long desc,
> + void *arg,
> + int (*func)(struct resource *, void *))
> +{
> + return walk_res_desc(&iomem_resource, start, end, flags, desc, arg, func);
> +}
> +
> +
> /**
> * walk_iomem_res_desc - Walks through iomem resources and calls func()
> * with matching resource ranges.
> @@ -426,6 +442,26 @@ int walk_iomem_res_desc(unsigned long desc, unsigned long flags, u64 start,
> }
> EXPORT_SYMBOL_GPL(walk_iomem_res_desc);
>
> +#ifdef CONFIG_EFI_SOFT_RESERVE
> +struct resource soft_reserve_resource = {
> + .name = "Soft Reserved",
> + .start = 0,
> + .end = -1,
> + .desc = IORES_DESC_SOFT_RESERVED,
> + .flags = IORESOURCE_MEM,
> +};
> +EXPORT_SYMBOL_GPL(soft_reserve_resource);
> +
> +int walk_soft_reserve_res_desc(unsigned long desc, unsigned long flags,
> + u64 start, u64 end, void *arg,
> + int (*func)(struct resource *, void *))
> +{
> + return walk_res_desc(&soft_reserve_resource, start, end, flags, desc,
> + arg, func);
> +}
> +EXPORT_SYMBOL_GPL(walk_soft_reserve_res_desc);
> +#endif
> +
> /*
> * This function calls the @func callback against all memory ranges of type
> * System RAM which are marked as IORESOURCE_SYSTEM_RAM and IORESOUCE_BUSY.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v5 3/7] cxl/acpi: Add background worker to coordinate with cxl_mem probe completion
2025-07-29 15:48 ` Koralahalli Channabasappa, Smita
@ 2025-07-30 16:09 ` dan.j.williams
0 siblings, 0 replies; 38+ messages in thread
From: dan.j.williams @ 2025-07-30 16:09 UTC (permalink / raw)
To: Koralahalli Channabasappa, Smita, dan.j.williams, linux-cxl,
linux-kernel, nvdimm, linux-fsdevel, linux-pm
Cc: Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Alison Schofield,
Vishal Verma, Ira Weiny, Matthew Wilcox, Jan Kara,
Rafael J . Wysocki, Len Brown, Pavel Machek, Li Ming,
Jeff Johnson, Ying Huang, Yao Xingtao, Peter Zijlstra, Greg KH,
Nathan Fontenot, Terry Bowman, Robert Richter, Benjamin Cheatham,
PradeepVineshReddy Kodamati, Zhijian Li
Koralahalli Channabasappa, Smita wrote:
[..]
> > static int hmem_register_device(struct device *host, int target_nid,
> > const struct resource *res)
> > {
> > + struct dax_defer_work *work = dev_get_drvdata(host);
> > struct platform_device *pdev;
> > struct memregion_info info;
> > long id;
> > @@ -70,14 +106,21 @@ static int hmem_register_device(struct device *host, int target_nid,
> > if (IS_ENABLED(CONFIG_CXL_REGION) &&
> > region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
> > IORES_DESC_CXL) != REGION_DISJOINT) {
>
> I may be wrong here, but could this check fail?
It can fail, but for the case where ACPI0017 is present and CXL windows
exist, the failure cases would only be the extreme ones like OOM killer.
> While request_module() ensures that cxl_acpi and cxl_pci are requested
> for loading, it does not guarantee that either has completed
> initialization or that region enumeration (i.e add_cxl_resources())
> has finished by the time we reach this check.
No, outside of someone doing something silly like passing
"driver_async_probe=cxl_acpi" on the kernel command line then
request_module() will complete synchronously (btw, should close that
possibility off with PROBE_FORCE_SYNCHRONOUS).
When request_module() returns module_init() for the requested module
will have completed. ACPI devices will have been enumerated by this
point, so cxl_acpi_probe() will have also run by the time module_init()
completes.
> We also haven't called wait_for_device_probe() at this point, which is
> typically used to block until all pending device probes are complete.
wait_for_device_probe() is only needed for async probing, deferred
probing, and dependent device probing. cxl_acpi is none of those cases.
ACPI devices are always enumerated before userspace is up, so the
initial driver attach can always assume to have completed in module_init
context.
wait_for_device_probe() is needed for cxl_pci attach because cxl_pci
attach is async and it creates dependent devices that fire off their own
module requests.
As I noted in the changelog MODULE_SOFTDEP() is not reliable for
ordering, but request_module() is reliable for ordering. We could go so
far as to have symbol dependencies to require module loading to succeed,
but I don't think that is needed here.
See that approach in the for-6.18/cxl-probe-order RFC branch for cxl_mem
and cxl_port:
https://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git/log/?h=for-6.18/cxl-probe-order
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v5 3/7] cxl/acpi: Add background worker to coordinate with cxl_mem probe completion
2025-07-23 16:13 ` dan.j.williams
@ 2025-08-05 3:58 ` Zhijian Li (Fujitsu)
2025-08-20 23:14 ` Alison Schofield
0 siblings, 1 reply; 38+ messages in thread
From: Zhijian Li (Fujitsu) @ 2025-08-05 3:58 UTC (permalink / raw)
To: dan.j.williams@intel.com, Smita Koralahalli,
linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
nvdimm@lists.linux.dev, linux-fsdevel@vger.kernel.org,
linux-pm@vger.kernel.org
Cc: Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Alison Schofield,
Vishal Verma, Ira Weiny, Matthew Wilcox, Jan Kara,
Rafael J . Wysocki, Len Brown, Pavel Machek, Li Ming,
Jeff Johnson, Ying Huang, Xingtao Yao (Fujitsu), Peter Zijlstra,
Greg KH, Nathan Fontenot, Terry Bowman, Robert Richter,
Benjamin Cheatham, PradeepVineshReddy Kodamati
Hi Dan and Smita,
On 24/07/2025 00:13, dan.j.williams@intel.com wrote:
> dan.j.williams@ wrote:
> [..]
>> If the goal is: "I want to give device-dax a point at which it can make
>> a go / no-go decision about whether the CXL subsystem has properly
>> assembled all CXL regions implied by Soft Reserved instersecting with
>> CXL Windows." Then that is something like the below, only lightly tested
>> and likely regresses the non-CXL case.
>>
>> -- 8< --
>> From 48b25461eca050504cf5678afd7837307b2dd14f Mon Sep 17 00:00:00 2001
>> From: Dan Williams <dan.j.williams@intel.com>
>> Date: Tue, 22 Jul 2025 16:11:08 -0700
>> Subject: [RFC PATCH] dax/cxl: Defer Soft Reserved registration
>
> Likely needs this incremental change to prevent DEV_DAX_HMEM from being
> built-in when CXL is not. This still leaves the awkward scenario of CXL
> enabled, DEV_DAX_CXL disabled, and DEV_DAX_HMEM built-in. I believe that
> safely fails in devdax only / fallback mode, but something to
> investigate when respinning on top of this.
>
Thank you for your RFC; I find your proposal remarkably compelling, as it adeptly addresses the issues I am currently facing.
To begin with, I still encountered several issues with your patch (considering the patch at the RFC stage, I think it is already quite commendable):
1. Some resources described by SRAT are wrongly identified as System RAM (kmem), such as the following: 200000000-5bffffff.
```
200000000-5bffffff : dax6.0
200000000-5bffffff : System RAM (kmem)
5c0001128-5c00011b7 : port1
5d0000000-64ffffff : CXL Window 0
5d0000000-64ffffff : region0
5d0000000-64ffffff : dax0.0
5d0000000-64ffffff : System RAM (kmem)
680000000-e7ffffff : PCI Bus 0000:00
[root@rdma-server ~]# dmesg | grep -i -e soft -e hotplug
[ 0.000000] Command line: BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-6.16.0-rc4-lizhijian-Dan+ root=UUID=386769a3-cfa5-47c8-8797-d5ec58c9cb6c ro earlyprintk=ttyS0 no_timer_check net.ifnames=0 console=tty1 console=ttyS0,115200n8 softlockup_panic=1 printk.devkmsg=on oops=panic sysrq_always_enabled panic_on_warn ignore_loglevel kasan.fault=panic
[ 0.000000] BIOS-e820: [mem 0x0000000180000000-0x00000001ffffffff] soft reserved
[ 0.000000] BIOS-e820: [mem 0x00000005d0000000-0x000000064ffffff] soft reserved
[ 0.072114] ACPI: SRAT: Node 3 PXM 3 [mem 0x200000000-0x5bffffff] hotplug
```
2. Triggers dev_warn and dev_err:
```
[root@rdma-server ~]# journalctl -p err -p warning --dmesg
...snip...
Jul 29 13:17:36 rdma-server kernel: cxl root0: Extended linear cache calculation failed rc:-2
Jul 29 13:17:36 rdma-server kernel: hmem hmem.1: probe with driver hmem failed with error -12
Jul 29 13:17:36 rdma-server kernel: hmem hmem.2: probe with driver hmem failed with error -12
Jul 29 13:17:36 rdma-server kernel: kmem dax3.0: mapping0: 0x100000000-0x17ffffff could not reserve region
Jul 29 13:17:36 rdma-server kernel: kmem dax3.0: probe with driver kmem failed with error -16
```
3. When CXL_REGION is disabled, there is a failure to fallback to dax_hmem, in which case only CXL Window X is visible.
On failure:
```
100000000-27ffffff : System RAM
5c0001128-5c00011b7 : port1
5c0011128-5c00111b7 : port2
5d0000000-6cffffff : CXL Window 0
6d0000000-7cffffff : CXL Window 1
7000000000-700000ffff : PCI Bus 0000:0c
7000000000-700000ffff : 0000:0c:00.0
7000001080-70000010d7 : mem1
```
On success:
```
5d0000000-7cffffff : dax0.0
5d0000000-7cffffff : System RAM (kmem)
5d0000000-6cffffff : CXL Window 0
6d0000000-7cffffff : CXL Window 1
```
In term of issues 1 and 2, this arises because hmem_register_device() attempts to register resources of all "HMEM devices," whereas we only need to register the IORES_DESC_SOFT_RESERVED resources. I believe resolving the current TODO will address this.
```
- rc = region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
- IORES_DESC_SOFT_RESERVED);
- if (rc != REGION_INTERSECTS)
- return 0;
+ /* TODO: insert "Soft Reserved" into iomem here */
```
Regarding issue 3 (which exists in the current situation), this could be because it cannot ensure that dax_hmem_probe() executes prior to cxl_acpi_probe() when CXL_REGION is disabled.
I am pleased that you have pushed the patch to the cxl/for-6.18/cxl-probe-order branch, and I'm looking forward to its integration into the upstream during the v6.18 merge window.
Besides the current TODO, you also mentioned that this RFC PATCH must be further subdivided into several patches, so there remains significant work to be done.
If my understanding is correct, you would be personally continuing to push forward this patch, right?
Smita,
Do you have any additional thoughts on this proposal from your side?
Thanks
Zhijian
> -- 8< --
> diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
> index d656e4c0eb84..3683bb3f2311 100644
> --- a/drivers/dax/Kconfig
> +++ b/drivers/dax/Kconfig
> @@ -48,6 +48,8 @@ config DEV_DAX_CXL
> tristate "CXL DAX: direct access to CXL RAM regions"
> depends on CXL_BUS && CXL_REGION && DEV_DAX
> default CXL_REGION && DEV_DAX
> + depends on CXL_ACPI >= DEV_DAX_HMEM
> + depends on CXL_PCI >= DEV_DAX_HMEM
> help
> CXL RAM regions are either mapped by platform-firmware
> and published in the initial system-memory map as "System RAM", mapped
> diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
> index 0916478e3817..8bcd104111a8 100644
> --- a/drivers/dax/hmem/hmem.c
> +++ b/drivers/dax/hmem/hmem.c
> @@ -103,7 +103,7 @@ static int hmem_register_device(struct device *host, int target_nid,
> long id;
> int rc;
>
> - if (IS_ENABLED(CONFIG_CXL_REGION) &&
> + if (IS_ENABLED(CONFIG_DEV_DAX_CXL) &&
> region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
> IORES_DESC_CXL) != REGION_DISJOINT) {
> switch (dax_cxl_mode) {
> @@ -209,7 +209,7 @@ static __init int dax_hmem_init(void)
> * CXL topology discovery at least once before scanning the
> * iomem resource tree for IORES_DESC_CXL resources.
> */
> - if (IS_ENABLED(CONFIG_CXL_REGION)) {
> + if (IS_ENABLED(CONFIG_DEV_DAX_CXL)) {
> request_module("cxl_acpi");
> request_module("cxl_pci");
> }
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v5 3/7] cxl/acpi: Add background worker to coordinate with cxl_mem probe completion
2025-08-05 3:58 ` Zhijian Li (Fujitsu)
@ 2025-08-20 23:14 ` Alison Schofield
2025-08-21 2:30 ` Zhijian Li (Fujitsu)
0 siblings, 1 reply; 38+ messages in thread
From: Alison Schofield @ 2025-08-20 23:14 UTC (permalink / raw)
To: Zhijian Li (Fujitsu)
Cc: dan.j.williams@intel.com, Smita Koralahalli,
linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
nvdimm@lists.linux.dev, linux-fsdevel@vger.kernel.org,
linux-pm@vger.kernel.org, Davidlohr Bueso, Jonathan Cameron,
Dave Jiang, Vishal Verma, Ira Weiny, Matthew Wilcox, Jan Kara,
Rafael J . Wysocki, Len Brown, Pavel Machek, Li Ming,
Jeff Johnson, Ying Huang, Xingtao Yao (Fujitsu), Peter Zijlstra,
Greg KH, Nathan Fontenot, Terry Bowman, Robert Richter,
Benjamin Cheatham, PradeepVineshReddy Kodamati
On Tue, Aug 05, 2025 at 03:58:41AM +0000, Zhijian Li (Fujitsu) wrote:
> Hi Dan and Smita,
>
>
> On 24/07/2025 00:13, dan.j.williams@intel.com wrote:
> > dan.j.williams@ wrote:
> > [..]
> >> If the goal is: "I want to give device-dax a point at which it can make
> >> a go / no-go decision about whether the CXL subsystem has properly
> >> assembled all CXL regions implied by Soft Reserved instersecting with
> >> CXL Windows." Then that is something like the below, only lightly tested
> >> and likely regresses the non-CXL case.
> >>
> >> -- 8< --
> >> From 48b25461eca050504cf5678afd7837307b2dd14f Mon Sep 17 00:00:00 2001
> >> From: Dan Williams <dan.j.williams@intel.com>
> >> Date: Tue, 22 Jul 2025 16:11:08 -0700
> >> Subject: [RFC PATCH] dax/cxl: Defer Soft Reserved registration
> >
> > Likely needs this incremental change to prevent DEV_DAX_HMEM from being
> > built-in when CXL is not. This still leaves the awkward scenario of CXL
> > enabled, DEV_DAX_CXL disabled, and DEV_DAX_HMEM built-in. I believe that
> > safely fails in devdax only / fallback mode, but something to
> > investigate when respinning on top of this.
> >
>
> Thank you for your RFC; I find your proposal remarkably compelling, as it adeptly addresses the issues I am currently facing.
>
>
> To begin with, I still encountered several issues with your patch (considering the patch at the RFC stage, I think it is already quite commendable):
Hi Zhijian,
Like you, I tried this RFC out. It resolved the issue of soft reserved
resources preventing teardown and replacement of a region in place.
I looked at the issues you found, and have some questions comments
included below.
>
> 1. Some resources described by SRAT are wrongly identified as System RAM (kmem), such as the following: 200000000-5bffffff.
>
> ```
> 200000000-5bffffff : dax6.0
> 200000000-5bffffff : System RAM (kmem)
> 5c0001128-5c00011b7 : port1
> 5d0000000-64ffffff : CXL Window 0
> 5d0000000-64ffffff : region0
> 5d0000000-64ffffff : dax0.0
> 5d0000000-64ffffff : System RAM (kmem)
> 680000000-e7ffffff : PCI Bus 0000:00
>
> [root@rdma-server ~]# dmesg | grep -i -e soft -e hotplug
> [ 0.000000] Command line: BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-6.16.0-rc4-lizhijian-Dan+ root=UUID=386769a3-cfa5-47c8-8797-d5ec58c9cb6c ro earlyprintk=ttyS0 no_timer_check net.ifnames=0 console=tty1 console=ttyS0,115200n8 softlockup_panic=1 printk.devkmsg=on oops=panic sysrq_always_enabled panic_on_warn ignore_loglevel kasan.fault=panic
> [ 0.000000] BIOS-e820: [mem 0x0000000180000000-0x00000001ffffffff] soft reserved
> [ 0.000000] BIOS-e820: [mem 0x00000005d0000000-0x000000064ffffff] soft reserved
> [ 0.072114] ACPI: SRAT: Node 3 PXM 3 [mem 0x200000000-0x5bffffff] hotplug
> ```
Is that range also labelled as soft reserved?
I ask, because I'm trying to draw a parallel between our test platforms.
I see -
[] BIOS-e820: [mem 0x0000024080000000-0x000004407fffffff] soft reserved
.
.
[] reserve setup_data: [mem 0x0000024080000000-0x000004407fffffff] soft reserved
.
.
[] ACPI: SRAT: Node 6 PXM 14 [mem 0x24080000000-0x4407fffffff] hotplug
/proc/iomem - as expected
24080000000-5f77fffffff : CXL Window 0
24080000000-4407fffffff : region0
24080000000-4407fffffff : dax0.0
24080000000-4407fffffff : System RAM (kmem)
I'm also seeing this message:
[] resource: Unaddressable device [mem 0x24080000000-0x4407fffffff] conflicts with [mem 0x24080000000-0x4407fffffff]
>
> 2. Triggers dev_warn and dev_err:
>
> ```
> [root@rdma-server ~]# journalctl -p err -p warning --dmesg
> ...snip...
> Jul 29 13:17:36 rdma-server kernel: cxl root0: Extended linear cache calculation failed rc:-2
> Jul 29 13:17:36 rdma-server kernel: hmem hmem.1: probe with driver hmem failed with error -12
> Jul 29 13:17:36 rdma-server kernel: hmem hmem.2: probe with driver hmem failed with error -12
> Jul 29 13:17:36 rdma-server kernel: kmem dax3.0: mapping0: 0x100000000-0x17ffffff could not reserve region
> Jul 29 13:17:36 rdma-server kernel: kmem dax3.0: probe with driver kmem failed with error -16
I see the kmem dax messages also. It seems the kmem probe is going after
every range (except hotplug) in the SRAT, and failing.
> ```
>
> 3. When CXL_REGION is disabled, there is a failure to fallback to dax_hmem, in which case only CXL Window X is visible.
Haven't tested !CXL_REGION yet.
>
> On failure:
>
> ```
> 100000000-27ffffff : System RAM
> 5c0001128-5c00011b7 : port1
> 5c0011128-5c00111b7 : port2
> 5d0000000-6cffffff : CXL Window 0
> 6d0000000-7cffffff : CXL Window 1
> 7000000000-700000ffff : PCI Bus 0000:0c
> 7000000000-700000ffff : 0000:0c:00.0
> 7000001080-70000010d7 : mem1
> ```
>
> On success:
>
> ```
> 5d0000000-7cffffff : dax0.0
> 5d0000000-7cffffff : System RAM (kmem)
> 5d0000000-6cffffff : CXL Window 0
> 6d0000000-7cffffff : CXL Window 1
> ```
>
> In term of issues 1 and 2, this arises because hmem_register_device() attempts to register resources of all "HMEM devices," whereas we only need to register the IORES_DESC_SOFT_RESERVED resources. I believe resolving the current TODO will address this.
>
> ```
> - rc = region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
> - IORES_DESC_SOFT_RESERVED);
> - if (rc != REGION_INTERSECTS)
> - return 0;
> + /* TODO: insert "Soft Reserved" into iomem here */
> ```
Above makes sense.
I'll probably wait for an update from Smita to test again, but if you
or Smita have anything you want me to try out on my hardwware in the
meantime, let me know.
-- Alison
>
> Regarding issue 3 (which exists in the current situation), this could be because it cannot ensure that dax_hmem_probe() executes prior to cxl_acpi_probe() when CXL_REGION is disabled.
>
> I am pleased that you have pushed the patch to the cxl/for-6.18/cxl-probe-order branch, and I'm looking forward to its integration into the upstream during the v6.18 merge window.
> Besides the current TODO, you also mentioned that this RFC PATCH must be further subdivided into several patches, so there remains significant work to be done.
> If my understanding is correct, you would be personally continuing to push forward this patch, right?
>
>
> Smita,
>
> Do you have any additional thoughts on this proposal from your side?
>
>
> Thanks
> Zhijian
>
snip
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v5 3/7] cxl/acpi: Add background worker to coordinate with cxl_mem probe completion
2025-08-20 23:14 ` Alison Schofield
@ 2025-08-21 2:30 ` Zhijian Li (Fujitsu)
2025-08-22 3:56 ` Koralahalli Channabasappa, Smita
0 siblings, 1 reply; 38+ messages in thread
From: Zhijian Li (Fujitsu) @ 2025-08-21 2:30 UTC (permalink / raw)
To: Alison Schofield
Cc: dan.j.williams@intel.com, Smita Koralahalli,
linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
nvdimm@lists.linux.dev, linux-fsdevel@vger.kernel.org,
linux-pm@vger.kernel.org, Davidlohr Bueso, Jonathan Cameron,
Dave Jiang, Vishal Verma, Ira Weiny, Matthew Wilcox, Jan Kara,
Rafael J . Wysocki, Len Brown, Pavel Machek, Li Ming,
Jeff Johnson, Ying Huang, Xingtao Yao (Fujitsu), Peter Zijlstra,
Greg KH, Nathan Fontenot, Terry Bowman, Robert Richter,
Benjamin Cheatham, PradeepVineshReddy Kodamati,
Yasunori Gotou (Fujitsu)
On 21/08/2025 07:14, Alison Schofield wrote:
> On Tue, Aug 05, 2025 at 03:58:41AM +0000, Zhijian Li (Fujitsu) wrote:
>> Hi Dan and Smita,
>>
>>
>> On 24/07/2025 00:13, dan.j.williams@intel.com wrote:
>>> dan.j.williams@ wrote:
>>> [..]
>>>> If the goal is: "I want to give device-dax a point at which it can make
>>>> a go / no-go decision about whether the CXL subsystem has properly
>>>> assembled all CXL regions implied by Soft Reserved instersecting with
>>>> CXL Windows." Then that is something like the below, only lightly tested
>>>> and likely regresses the non-CXL case.
>>>>
>>>> -- 8< --
>>>> From 48b25461eca050504cf5678afd7837307b2dd14f Mon Sep 17 00:00:00 2001
>>>> From: Dan Williams <dan.j.williams@intel.com>
>>>> Date: Tue, 22 Jul 2025 16:11:08 -0700
>>>> Subject: [RFC PATCH] dax/cxl: Defer Soft Reserved registration
>>>
>>> Likely needs this incremental change to prevent DEV_DAX_HMEM from being
>>> built-in when CXL is not. This still leaves the awkward scenario of CXL
>>> enabled, DEV_DAX_CXL disabled, and DEV_DAX_HMEM built-in. I believe that
>>> safely fails in devdax only / fallback mode, but something to
>>> investigate when respinning on top of this.
>>>
>>
>> Thank you for your RFC; I find your proposal remarkably compelling, as it adeptly addresses the issues I am currently facing.
>>
>>
>> To begin with, I still encountered several issues with your patch (considering the patch at the RFC stage, I think it is already quite commendable):
>
> Hi Zhijian,
>
> Like you, I tried this RFC out. It resolved the issue of soft reserved
> resources preventing teardown and replacement of a region in place.
>
> I looked at the issues you found, and have some questions comments
> included below.
>
>>
>> 1. Some resources described by SRAT are wrongly identified as System RAM (kmem), such as the following: 200000000-5bffffff.
>>
>> ```
>> 200000000-5bffffff : dax6.0
>> 200000000-5bffffff : System RAM (kmem)
>> 5c0001128-5c00011b7 : port1
>> 5d0000000-64ffffff : CXL Window 0
>> 5d0000000-64ffffff : region0
>> 5d0000000-64ffffff : dax0.0
>> 5d0000000-64ffffff : System RAM (kmem)
>> 680000000-e7ffffff : PCI Bus 0000:00
>>
>> [root@rdma-server ~]# dmesg | grep -i -e soft -e hotplug
>> [ 0.000000] Command line: BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-6.16.0-rc4-lizhijian-Dan+ root=UUID=386769a3-cfa5-47c8-8797-d5ec58c9cb6c ro earlyprintk=ttyS0 no_timer_check net.ifnames=0 console=tty1 console=ttyS0,115200n8 softlockup_panic=1 printk.devkmsg=on oops=panic sysrq_always_enabled panic_on_warn ignore_loglevel kasan.fault=panic
>> [ 0.000000] BIOS-e820: [mem 0x0000000180000000-0x00000001ffffffff] soft reserved
>> [ 0.000000] BIOS-e820: [mem 0x00000005d0000000-0x000000064ffffff] soft reserved
>> [ 0.072114] ACPI: SRAT: Node 3 PXM 3 [mem 0x200000000-0x5bffffff] hotplug
>> ```
>
> Is that range also labelled as soft reserved?
> I ask, because I'm trying to draw a parallel between our test platforms.
No, It's not a soft reserved range. This can simply simulate with QEMU with `maxmem=192G` option(see below full qemu command line).
In my environment, `0x200000000-0x5bffffff` is something like [DRAM_END + 1, DRAM_END + maxmem - TOTAL_INSTALLED_DRAM_SIZE]
DRAM_END: end of the installed DRAM in Node 3
This range is reserved for the DRAM hot-add. In my case, it will be registered into 'HMEM devices' by calling hmem_register_resource in HMAT(drivers/acpi/numa/hmat.c)
893 static void hmat_register_target_devices(struct memory_target *target)
894 {
895 struct resource *res;
896
897 /*
898 * Do not bother creating devices if no driver is available to
899 * consume them.
900 */
901 if (!IS_ENABLED(CONFIG_DEV_DAX_HMEM))
902 return;
903
904 for (res = target->memregions.child; res; res = res->sibling) {
905 int target_nid = pxm_to_node(target->memory_pxm);
906
907 hmem_register_resource(target_nid, res);
908 }
909 }
$ dmesg | grep -i -e soft -e hotplug -e Node
[ 0.000000] Command line: BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-6.16.0-rc4-lizhijian-Dan-00026-g1473b9914846-dirty root=UUID=386769a3-cfa5-47c8-8797-d5ec58c9cb6c ro earlyprintk=ttyS0 no_timer_check net.ifnames=0 console=tty1 conc
[ 0.000000] BIOS-e820: [mem 0x0000000180000000-0x00000001ffffffff] soft reserved
[ 0.000000] BIOS-e820: [mem 0x00000005d0000000-0x000000064fffffff] soft reserved
[ 0.066332] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
[ 0.067665] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0x7fffffff]
[ 0.068995] ACPI: SRAT: Node 1 PXM 1 [mem 0x100000000-0x17fffffff]
[ 0.070359] ACPI: SRAT: Node 2 PXM 2 [mem 0x180000000-0x1bfffffff]
[ 0.071723] ACPI: SRAT: Node 3 PXM 3 [mem 0x1c0000000-0x1ffffffff]
[ 0.073085] ACPI: SRAT: Node 3 PXM 3 [mem 0x200000000-0x5bfffffff] hotplug
[ 0.075689] NUMA: Node 0 [mem 0x00001000-0x0009ffff] + [mem 0x00100000-0x7fffffff] -> [mem 0x00001000-0x7fffffff]
[ 0.077849] NODE_DATA(0) allocated [mem 0x7ffb3e00-0x7ffdefff]
[ 0.079149] NODE_DATA(1) allocated [mem 0x17ffd1e00-0x17fffcfff]
[ 0.086077] Movable zone start for each node
[ 0.087054] Early memory node ranges
[ 0.087890] node 0: [mem 0x0000000000001000-0x000000000009efff]
[ 0.089264] node 0: [mem 0x0000000000100000-0x000000007ffdefff]
[ 0.090631] node 1: [mem 0x0000000100000000-0x000000017fffffff]
[ 0.092003] Initmem setup node 0 [mem 0x0000000000001000-0x000000007ffdefff]
[ 0.093532] Initmem setup node 1 [mem 0x0000000100000000-0x000000017fffffff]
[ 0.095164] Initmem setup node 2 as memoryless
[ 0.096281] Initmem setup node 3 as memoryless
[ 0.097397] Initmem setup node 4 as memoryless
[ 0.098444] On node 0, zone DMA: 1 pages in unavailable ranges
[ 0.099866] On node 0, zone DMA: 97 pages in unavailable ranges
[ 0.104342] On node 1, zone Normal: 33 pages in unavailable ranges
[ 0.126883] CPU topo: Allowing 4 present CPUs plus 0 hotplug CPUs
=================================
Please note that this is a modified QEMU.
/home/lizhijian/qemu/build-hmem/qemu-system-x86_64 -machine q35,accel=kvm,cxl=on,hmat=on \
-name guest-rdma-server -nographic -boot c \
-m size=6G,slots=2,maxmem=19922944k \
-hda /home/lizhijian/images/Fedora-rdma-server.qcow2 \
-object memory-backend-memfd,share=on,size=2G,id=m0 \
-object memory-backend-memfd,share=on,size=2G,id=m1 \
-numa node,nodeid=0,cpus=0-1,memdev=m0 \
-numa node,nodeid=1,cpus=2-3,memdev=m1 \
-smp 4,sockets=2,cores=2 \
-device pcie-root-port,id=pci-root,slot=8,bus=pcie.0,chassis=0 \
-device pxb-cxl,id=pxb-cxl-host-bridge,bus=pcie.0,bus_nr=0x35,hdm_for_passthrough=true \
-device cxl-rp,id=cxl-rp-hb-rp0,bus=pxb-cxl-host-bridge,chassis=0,slot=0,port=0 \
-device cxl-type3,bus=cxl-rp-hb-rp0,volatile-memdev=cxl-vmem0,id=cxl-vmem0,program-hdm-decoder=true \
-object memory-backend-file,id=cxl-vmem0,share=on,mem-path=/home/lizhijian/images/cxltest0.raw,size=2048M \
-M cxl-fmw.0.targets.0=pxb-cxl-host-bridge,cxl-fmw.0.size=2G,cxl-fmw.0.interleave-granularity=8k \
-nic bridge,br=virbr0,model=e1000,mac=52:54:00:c9:76:74 \
-bios /home/lizhijian/seabios/out/bios.bin \
-object memory-backend-memfd,share=on,size=1G,id=m2 \
-object memory-backend-memfd,share=on,size=1G,id=m3 \
-numa node,memdev=m2,nodeid=2 \
-numa node,memdev=m3,nodeid=3 \
-numa dist,src=0,dst=0,val=10 \
-numa dist,src=0,dst=1,val=21 \
-numa dist,src=0,dst=2,val=21 \
-numa dist,src=0,dst=3,val=21 \
-numa dist,src=1,dst=0,val=21 \
-numa dist,src=1,dst=1,val=10 \
-numa dist,src=1,dst=2,val=21 \
-numa dist,src=1,dst=3,val=21 \
-numa dist,src=2,dst=0,val=21 \
-numa dist,src=2,dst=1,val=21 \
-numa dist,src=2,dst=2,val=10 \
-numa dist,src=2,dst=3,val=21 \
-numa dist,src=3,dst=0,val=21 \
-numa dist,src=3,dst=1,val=21 \
-numa dist,src=3,dst=2,val=21 \
-numa dist,src=3,dst=3,val=10 \
-numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=110 \
-numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=20000M \
-numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=240 \
-numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=40000M \
-numa hmat-lb,initiator=0,target=2,hierarchy=memory,data-type=access-latency,latency=340 \
-numa hmat-lb,initiator=0,target=2,hierarchy=memory,data-type=access-bandwidth,bandwidth=60000M \
-numa hmat-lb,initiator=0,target=3,hierarchy=memory,data-type=access-latency,latency=440 \
-numa hmat-lb,initiator=0,target=3,hierarchy=memory,data-type=access-bandwidth,bandwidth=80000M \
-numa hmat-lb,initiator=1,target=0,hierarchy=memory,data-type=access-latency,latency=240 \
-numa hmat-lb,initiator=1,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=40000M \
-numa hmat-lb,initiator=1,target=1,hierarchy=memory,data-type=access-latency,latency=110 \
-numa hmat-lb,initiator=1,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=20000M \
-numa hmat-lb,initiator=1,target=2,hierarchy=memory,data-type=access-latency,latency=340 \
-numa hmat-lb,initiator=1,target=2,hierarchy=memory,data-type=access-bandwidth,bandwidth=60000M \
-numa hmat-lb,initiator=1,target=3,hierarchy=memory,data-type=access-latency,latency=440 \
-numa hmat-lb,initiator=1,target=3,hierarchy=memory,data-type=access-bandwidth,bandwidth=80000M
> I see -
>
> [] BIOS-e820: [mem 0x0000024080000000-0x000004407fffffff] soft reserved
> .
> .
> [] reserve setup_data: [mem 0x0000024080000000-0x000004407fffffff] soft reserved
> .
> .
> [] ACPI: SRAT: Node 6 PXM 14 [mem 0x24080000000-0x4407fffffff] hotplug
>
> /proc/iomem - as expected
> 24080000000-5f77fffffff : CXL Window 0
> 24080000000-4407fffffff : region0
> 24080000000-4407fffffff : dax0.0
> 24080000000-4407fffffff : System RAM (kmem)
>
>
> I'm also seeing this message:
> [] resource: Unaddressable device [mem 0x24080000000-0x4407fffffff] conflicts with [mem 0x24080000000-0x4407fffffff]
>
>>
>> 2. Triggers dev_warn and dev_err:
>>
>> ```
>> [root@rdma-server ~]# journalctl -p err -p warning --dmesg
>> ...snip...
>> Jul 29 13:17:36 rdma-server kernel: cxl root0: Extended linear cache calculation failed rc:-2
>> Jul 29 13:17:36 rdma-server kernel: hmem hmem.1: probe with driver hmem failed with error -12
>> Jul 29 13:17:36 rdma-server kernel: hmem hmem.2: probe with driver hmem failed with error -12
>> Jul 29 13:17:36 rdma-server kernel: kmem dax3.0: mapping0: 0x100000000-0x17ffffff could not reserve region
>> Jul 29 13:17:36 rdma-server kernel: kmem dax3.0: probe with driver kmem failed with error -16
>
> I see the kmem dax messages also. It seems the kmem probe is going after
> every range (except hotplug) in the SRAT, and failing.
Yes, that's true, because current RFC removed the code that filters out the non-soft-reserverd resource. As a result, it will try to register dax/kmem for all of them while some of them has been marked as busy in the iomem_resource.
>> - rc = region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
>> - IORES_DESC_SOFT_RESERVED);
>> - if (rc != REGION_INTERSECTS)
>> - return 0;
This is another example on my real *CXL HOST*:
Aug 19 17:59:05 kernel: device-mapper: core: CONFIG_IMA_DISABLE_HTABLE is disabled. Duplicate IMA measuremen>
Aug 19 17:59:09 kernel: power_meter ACPI000D:00: Ignoring unsafe software power cap!
Aug 19 17:59:09 kernel: kmem dax2.0: mapping0: 0x0-0x8fffffff could not reserve region
Aug 19 17:59:09 kernel: kmem dax2.0: probe with driver kmem failed with error -16
Aug 19 17:59:09 kernel: kmem dax3.0: mapping0: 0x100000000-0x86fffffff could not reserve region
Aug 19 17:59:09 kernel: kmem dax3.0: probe with driver kmem failed with error -16
Aug 19 17:59:09 kernel: kmem dax4.0: mapping0: 0x870000000-0x106fffffff could not reserve region
Aug 19 17:59:09 kernel: kmem dax4.0: probe with driver kmem failed with error -16
Aug 19 17:59:19 kernel: nvme nvme0: using unchecked data buffer
Aug 19 18:36:27 kernel: block nvme1n1: No UUID available providing old NGUID
lizhijian@:~$ sudo grep -w -e 106fffffff -e 870000000 -e 8fffffff -e 100000000 /proc/iomem
6fffb000-8fffffff : Reserved
100000000-10000ffff : Reserved
106ccc0000-106fffffff : Reserved
This issue can be resolved by re-introducing sort_reserved_region_intersects(...) I guess.
>
>> ```
>>
>> 3. When CXL_REGION is disabled, there is a failure to fallback to dax_hmem, in which case only CXL Window X is visible.
>
> Haven't tested !CXL_REGION yet.
>
>>
>> On failure:
>>
>> ```
>> 100000000-27ffffff : System RAM
>> 5c0001128-5c00011b7 : port1
>> 5c0011128-5c00111b7 : port2
>> 5d0000000-6cffffff : CXL Window 0
>> 6d0000000-7cffffff : CXL Window 1
>> 7000000000-700000ffff : PCI Bus 0000:0c
>> 7000000000-700000ffff : 0000:0c:00.0
>> 7000001080-70000010d7 : mem1
>> ```
>>
>> On success:
>>
>> ```
>> 5d0000000-7cffffff : dax0.0
>> 5d0000000-7cffffff : System RAM (kmem)
>> 5d0000000-6cffffff : CXL Window 0
>> 6d0000000-7cffffff : CXL Window 1
>> ```
>>
>> In term of issues 1 and 2, this arises because hmem_register_device() attempts to register resources of all "HMEM devices," whereas we only need to register the IORES_DESC_SOFT_RESERVED resources. I believe resolving the current TODO will address this.
>>
>> ```
>> - rc = region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
>> - IORES_DESC_SOFT_RESERVED);
>> - if (rc != REGION_INTERSECTS)
>> - return 0;
>> + /* TODO: insert "Soft Reserved" into iomem here */
>> ```
>
> Above makes sense.
I think the subroutine add_soft_reserved() in your previous patchset[1] are able to cover this TODO
>
> I'll probably wait for an update from Smita to test again, but if you
> or Smita have anything you want me to try out on my hardwware in the
> meantime, let me know.
>
Here is my local fixup based on Dan's RFC, it can resovle issue 1 and 2.
-- 8< --
commit e7ccd7a01e168e185971da66f4aa13eb451caeaf
Author: Li Zhijian <lizhijian@fujitsu.com>
Date: Fri Aug 20 11:07:15 2025 +0800
Fix probe-order TODO
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index 754115da86cc..965ffc622136 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -93,6 +93,26 @@ static void process_defer_work(struct work_struct *_work)
walk_hmem_resources(&pdev->dev, handle_deferred_cxl);
}
+static int add_soft_reserved(resource_size_t start, resource_size_t len,
+ unsigned long flags)
+{
+ struct resource *res = kzalloc(sizeof(*res), GFP_KERNEL);
+ int rc;
+
+ if (!res)
+ return -ENOMEM;
+
+ *res = DEFINE_RES_NAMED_DESC(start, len, "Soft Reserved",
+ flags | IORESOURCE_MEM,
+ IORES_DESC_SOFT_RESERVED);
+
+ rc = insert_resource(&iomem_resource, res);
+ if (rc)
+ kfree(res);
+
+ return rc;
+}
+
static int hmem_register_device(struct device *host, int target_nid,
const struct resource *res)
{
@@ -102,6 +122,10 @@ static int hmem_register_device(struct device *host, int target_nid,
long id;
int rc;
+ if (soft_reserve_res_intersects(res->start, resource_size(res),
+ IORESOURCE_MEM, IORES_DESC_NONE) == REGION_DISJOINT)
+ return 0;
+
if (IS_ENABLED(CONFIG_DEV_DAX_CXL) &&
region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
IORES_DESC_CXL) != REGION_DISJOINT) {
@@ -119,7 +143,17 @@ static int hmem_register_device(struct device *host, int target_nid,
}
}
- /* TODO: insert "Soft Reserved" into iomem here */
+ /*
+ * This is a verified Soft Reserved region that CXL is not claiming (or
+ * is being overridden). Add it to the main iomem tree so it can be
+ * properly reserved by the DAX driver.
+ */
+ rc = add_soft_reserved(res->start, res->end - res->start + 1, 0);
+ if (rc) {
+ dev_warn(host, "failed to insert soft-reserved resource %pr into iomem: %d\n",
+ res, rc);
+ return rc;
+ }
id = memregion_alloc(GFP_KERNEL);
if (id < 0) {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 349f0d9aad22..eca5956c444b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1069,6 +1069,8 @@ enum {
int region_intersects(resource_size_t offset, size_t size, unsigned long flags,
unsigned long desc);
+int soft_reserve_res_intersects(resource_size_t offset, size_t size, unsigned long flags,
+ unsigned long desc);
/* Support for virtually mapped pages */
struct page *vmalloc_to_page(const void *addr);
unsigned long vmalloc_to_pfn(const void *addr);
diff --git a/kernel/resource.c b/kernel/resource.c
index b8eac6af2fad..a34b76cf690a 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -461,6 +461,22 @@ int walk_soft_reserve_res_desc(unsigned long desc, unsigned long flags,
arg, func);
}
EXPORT_SYMBOL_GPL(walk_soft_reserve_res_desc);
+
+static int __region_intersects(struct resource *parent, resource_size_t start,
+ size_t size, unsigned long flags,
+ unsigned long desc);
+int soft_reserve_res_intersects(resource_size_t start, size_t size, unsigned long flags,
+ unsigned long desc)
+{
+ int ret;
+
+ read_lock(&resource_lock);
+ ret = __region_intersects(&soft_reserve_resource, start, size, flags, desc);
+ read_unlock(&resource_lock);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(soft_reserve_res_intersects);
#endif
/*
[1] https://lore.kernel.org/linux-cxl/29312c0765224ae76862d59a17748c8188fb95f1.1692638817.git.alison.schofield@intel.com/
> -- Alison
>
>
>>
>> Regarding issue 3 (which exists in the current situation), this could be because it cannot ensure that dax_hmem_probe() executes prior to cxl_acpi_probe() when CXL_REGION is disabled.
>>
>> I am pleased that you have pushed the patch to the cxl/for-6.18/cxl-probe-order branch, and I'm looking forward to its integration into the upstream during the v6.18 merge window.
>> Besides the current TODO, you also mentioned that this RFC PATCH must be further subdivided into several patches, so there remains significant work to be done.
>> If my understanding is correct, you would be personally continuing to push forward this patch, right?
>>
>>
>> Smita,
>>
>> Do you have any additional thoughts on this proposal from your side?
>>
>>
>> Thanks
>> Zhijian
>>
> snip
>
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH v5 3/7] cxl/acpi: Add background worker to coordinate with cxl_mem probe completion
2025-08-21 2:30 ` Zhijian Li (Fujitsu)
@ 2025-08-22 3:56 ` Koralahalli Channabasappa, Smita
2025-08-25 7:50 ` Zhijian Li (Fujitsu)
0 siblings, 1 reply; 38+ messages in thread
From: Koralahalli Channabasappa, Smita @ 2025-08-22 3:56 UTC (permalink / raw)
To: Zhijian Li (Fujitsu), Alison Schofield
Cc: dan.j.williams@intel.com, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, nvdimm@lists.linux.dev,
linux-fsdevel@vger.kernel.org, linux-pm@vger.kernel.org,
Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Vishal Verma,
Ira Weiny, Matthew Wilcox, Jan Kara, Rafael J . Wysocki,
Len Brown, Pavel Machek, Li Ming, Jeff Johnson, Ying Huang,
Xingtao Yao (Fujitsu), Peter Zijlstra, Greg KH, Nathan Fontenot,
Terry Bowman, Robert Richter, Benjamin Cheatham,
PradeepVineshReddy Kodamati, Yasunori Gotou (Fujitsu)
On 8/20/2025 7:30 PM, Zhijian Li (Fujitsu) wrote:
>
>
> On 21/08/2025 07:14, Alison Schofield wrote:
>> On Tue, Aug 05, 2025 at 03:58:41AM +0000, Zhijian Li (Fujitsu) wrote:
>>> Hi Dan and Smita,
>>>
>>>
>>> On 24/07/2025 00:13, dan.j.williams@intel.com wrote:
>>>> dan.j.williams@ wrote:
>>>> [..]
>>>>> If the goal is: "I want to give device-dax a point at which it can make
>>>>> a go / no-go decision about whether the CXL subsystem has properly
>>>>> assembled all CXL regions implied by Soft Reserved instersecting with
>>>>> CXL Windows." Then that is something like the below, only lightly tested
>>>>> and likely regresses the non-CXL case.
>>>>>
>>>>> -- 8< --
>>>>> From 48b25461eca050504cf5678afd7837307b2dd14f Mon Sep 17 00:00:00 2001
>>>>> From: Dan Williams <dan.j.williams@intel.com>
>>>>> Date: Tue, 22 Jul 2025 16:11:08 -0700
>>>>> Subject: [RFC PATCH] dax/cxl: Defer Soft Reserved registration
>>>>
>>>> Likely needs this incremental change to prevent DEV_DAX_HMEM from being
>>>> built-in when CXL is not. This still leaves the awkward scenario of CXL
>>>> enabled, DEV_DAX_CXL disabled, and DEV_DAX_HMEM built-in. I believe that
>>>> safely fails in devdax only / fallback mode, but something to
>>>> investigate when respinning on top of this.
>>>>
>>>
>>> Thank you for your RFC; I find your proposal remarkably compelling, as it adeptly addresses the issues I am currently facing.
>>>
>>>
>>> To begin with, I still encountered several issues with your patch (considering the patch at the RFC stage, I think it is already quite commendable):
>>
>> Hi Zhijian,
>>
>> Like you, I tried this RFC out. It resolved the issue of soft reserved
>> resources preventing teardown and replacement of a region in place.
>>
>> I looked at the issues you found, and have some questions comments
>> included below.
>>
>>>
>>> 1. Some resources described by SRAT are wrongly identified as System RAM (kmem), such as the following: 200000000-5bffffff.
>>>
>>> ```
>>> 200000000-5bffffff : dax6.0
>>> 200000000-5bffffff : System RAM (kmem)
>>> 5c0001128-5c00011b7 : port1
>>> 5d0000000-64ffffff : CXL Window 0
>>> 5d0000000-64ffffff : region0
>>> 5d0000000-64ffffff : dax0.0
>>> 5d0000000-64ffffff : System RAM (kmem)
>>> 680000000-e7ffffff : PCI Bus 0000:00
>>>
>>> [root@rdma-server ~]# dmesg | grep -i -e soft -e hotplug
>>> [ 0.000000] Command line: BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-6.16.0-rc4-lizhijian-Dan+ root=UUID=386769a3-cfa5-47c8-8797-d5ec58c9cb6c ro earlyprintk=ttyS0 no_timer_check net.ifnames=0 console=tty1 console=ttyS0,115200n8 softlockup_panic=1 printk.devkmsg=on oops=panic sysrq_always_enabled panic_on_warn ignore_loglevel kasan.fault=panic
>>> [ 0.000000] BIOS-e820: [mem 0x0000000180000000-0x00000001ffffffff] soft reserved
>>> [ 0.000000] BIOS-e820: [mem 0x00000005d0000000-0x000000064ffffff] soft reserved
>>> [ 0.072114] ACPI: SRAT: Node 3 PXM 3 [mem 0x200000000-0x5bffffff] hotplug
>>> ```
>>
>> Is that range also labelled as soft reserved?
>> I ask, because I'm trying to draw a parallel between our test platforms.
>
> No, It's not a soft reserved range. This can simply simulate with QEMU with `maxmem=192G` option(see below full qemu command line).
> In my environment, `0x200000000-0x5bffffff` is something like [DRAM_END + 1, DRAM_END + maxmem - TOTAL_INSTALLED_DRAM_SIZE]
> DRAM_END: end of the installed DRAM in Node 3
>
> This range is reserved for the DRAM hot-add. In my case, it will be registered into 'HMEM devices' by calling hmem_register_resource in HMAT(drivers/acpi/numa/hmat.c)
>
> 893 static void hmat_register_target_devices(struct memory_target *target)
> 894 {
> 895 struct resource *res;
> 896
> 897 /*
> 898 * Do not bother creating devices if no driver is available to
> 899 * consume them.
> 900 */
> 901 if (!IS_ENABLED(CONFIG_DEV_DAX_HMEM))
> 902 return;
> 903
> 904 for (res = target->memregions.child; res; res = res->sibling) {
> 905 int target_nid = pxm_to_node(target->memory_pxm);
> 906
> 907 hmem_register_resource(target_nid, res);
> 908 }
> 909 }
>
>
> $ dmesg | grep -i -e soft -e hotplug -e Node
> [ 0.000000] Command line: BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-6.16.0-rc4-lizhijian-Dan-00026-g1473b9914846-dirty root=UUID=386769a3-cfa5-47c8-8797-d5ec58c9cb6c ro earlyprintk=ttyS0 no_timer_check net.ifnames=0 console=tty1 conc
> [ 0.000000] BIOS-e820: [mem 0x0000000180000000-0x00000001ffffffff] soft reserved
> [ 0.000000] BIOS-e820: [mem 0x00000005d0000000-0x000000064fffffff] soft reserved
> [ 0.066332] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
> [ 0.067665] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0x7fffffff]
> [ 0.068995] ACPI: SRAT: Node 1 PXM 1 [mem 0x100000000-0x17fffffff]
> [ 0.070359] ACPI: SRAT: Node 2 PXM 2 [mem 0x180000000-0x1bfffffff]
> [ 0.071723] ACPI: SRAT: Node 3 PXM 3 [mem 0x1c0000000-0x1ffffffff]
> [ 0.073085] ACPI: SRAT: Node 3 PXM 3 [mem 0x200000000-0x5bfffffff] hotplug
> [ 0.075689] NUMA: Node 0 [mem 0x00001000-0x0009ffff] + [mem 0x00100000-0x7fffffff] -> [mem 0x00001000-0x7fffffff]
> [ 0.077849] NODE_DATA(0) allocated [mem 0x7ffb3e00-0x7ffdefff]
> [ 0.079149] NODE_DATA(1) allocated [mem 0x17ffd1e00-0x17fffcfff]
> [ 0.086077] Movable zone start for each node
> [ 0.087054] Early memory node ranges
> [ 0.087890] node 0: [mem 0x0000000000001000-0x000000000009efff]
> [ 0.089264] node 0: [mem 0x0000000000100000-0x000000007ffdefff]
> [ 0.090631] node 1: [mem 0x0000000100000000-0x000000017fffffff]
> [ 0.092003] Initmem setup node 0 [mem 0x0000000000001000-0x000000007ffdefff]
> [ 0.093532] Initmem setup node 1 [mem 0x0000000100000000-0x000000017fffffff]
> [ 0.095164] Initmem setup node 2 as memoryless
> [ 0.096281] Initmem setup node 3 as memoryless
> [ 0.097397] Initmem setup node 4 as memoryless
> [ 0.098444] On node 0, zone DMA: 1 pages in unavailable ranges
> [ 0.099866] On node 0, zone DMA: 97 pages in unavailable ranges
> [ 0.104342] On node 1, zone Normal: 33 pages in unavailable ranges
> [ 0.126883] CPU topo: Allowing 4 present CPUs plus 0 hotplug CPUs
>
> =================================
>
> Please note that this is a modified QEMU.
>
> /home/lizhijian/qemu/build-hmem/qemu-system-x86_64 -machine q35,accel=kvm,cxl=on,hmat=on \
> -name guest-rdma-server -nographic -boot c \
> -m size=6G,slots=2,maxmem=19922944k \
> -hda /home/lizhijian/images/Fedora-rdma-server.qcow2 \
> -object memory-backend-memfd,share=on,size=2G,id=m0 \
> -object memory-backend-memfd,share=on,size=2G,id=m1 \
> -numa node,nodeid=0,cpus=0-1,memdev=m0 \
> -numa node,nodeid=1,cpus=2-3,memdev=m1 \
> -smp 4,sockets=2,cores=2 \
> -device pcie-root-port,id=pci-root,slot=8,bus=pcie.0,chassis=0 \
> -device pxb-cxl,id=pxb-cxl-host-bridge,bus=pcie.0,bus_nr=0x35,hdm_for_passthrough=true \
> -device cxl-rp,id=cxl-rp-hb-rp0,bus=pxb-cxl-host-bridge,chassis=0,slot=0,port=0 \
> -device cxl-type3,bus=cxl-rp-hb-rp0,volatile-memdev=cxl-vmem0,id=cxl-vmem0,program-hdm-decoder=true \
> -object memory-backend-file,id=cxl-vmem0,share=on,mem-path=/home/lizhijian/images/cxltest0.raw,size=2048M \
> -M cxl-fmw.0.targets.0=pxb-cxl-host-bridge,cxl-fmw.0.size=2G,cxl-fmw.0.interleave-granularity=8k \
> -nic bridge,br=virbr0,model=e1000,mac=52:54:00:c9:76:74 \
> -bios /home/lizhijian/seabios/out/bios.bin \
> -object memory-backend-memfd,share=on,size=1G,id=m2 \
> -object memory-backend-memfd,share=on,size=1G,id=m3 \
> -numa node,memdev=m2,nodeid=2 \
> -numa node,memdev=m3,nodeid=3 \
> -numa dist,src=0,dst=0,val=10 \
> -numa dist,src=0,dst=1,val=21 \
> -numa dist,src=0,dst=2,val=21 \
> -numa dist,src=0,dst=3,val=21 \
> -numa dist,src=1,dst=0,val=21 \
> -numa dist,src=1,dst=1,val=10 \
> -numa dist,src=1,dst=2,val=21 \
> -numa dist,src=1,dst=3,val=21 \
> -numa dist,src=2,dst=0,val=21 \
> -numa dist,src=2,dst=1,val=21 \
> -numa dist,src=2,dst=2,val=10 \
> -numa dist,src=2,dst=3,val=21 \
> -numa dist,src=3,dst=0,val=21 \
> -numa dist,src=3,dst=1,val=21 \
> -numa dist,src=3,dst=2,val=21 \
> -numa dist,src=3,dst=3,val=10 \
> -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=110 \
> -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=20000M \
> -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=240 \
> -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=40000M \
> -numa hmat-lb,initiator=0,target=2,hierarchy=memory,data-type=access-latency,latency=340 \
> -numa hmat-lb,initiator=0,target=2,hierarchy=memory,data-type=access-bandwidth,bandwidth=60000M \
> -numa hmat-lb,initiator=0,target=3,hierarchy=memory,data-type=access-latency,latency=440 \
> -numa hmat-lb,initiator=0,target=3,hierarchy=memory,data-type=access-bandwidth,bandwidth=80000M \
> -numa hmat-lb,initiator=1,target=0,hierarchy=memory,data-type=access-latency,latency=240 \
> -numa hmat-lb,initiator=1,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=40000M \
> -numa hmat-lb,initiator=1,target=1,hierarchy=memory,data-type=access-latency,latency=110 \
> -numa hmat-lb,initiator=1,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=20000M \
> -numa hmat-lb,initiator=1,target=2,hierarchy=memory,data-type=access-latency,latency=340 \
> -numa hmat-lb,initiator=1,target=2,hierarchy=memory,data-type=access-bandwidth,bandwidth=60000M \
> -numa hmat-lb,initiator=1,target=3,hierarchy=memory,data-type=access-latency,latency=440 \
> -numa hmat-lb,initiator=1,target=3,hierarchy=memory,data-type=access-bandwidth,bandwidth=80000M
>
>
>
>> I see -
>>
>> [] BIOS-e820: [mem 0x0000024080000000-0x000004407fffffff] soft reserved
>> .
>> .
>> [] reserve setup_data: [mem 0x0000024080000000-0x000004407fffffff] soft reserved
>> .
>> .
>> [] ACPI: SRAT: Node 6 PXM 14 [mem 0x24080000000-0x4407fffffff] hotplug
>>
>> /proc/iomem - as expected
>> 24080000000-5f77fffffff : CXL Window 0
>> 24080000000-4407fffffff : region0
>> 24080000000-4407fffffff : dax0.0
>> 24080000000-4407fffffff : System RAM (kmem)
>>
>>
>> I'm also seeing this message:
>> [] resource: Unaddressable device [mem 0x24080000000-0x4407fffffff] conflicts with [mem 0x24080000000-0x4407fffffff]
>>
>>>
>>> 2. Triggers dev_warn and dev_err:
>>>
>>> ```
>>> [root@rdma-server ~]# journalctl -p err -p warning --dmesg
>>> ...snip...
>>> Jul 29 13:17:36 rdma-server kernel: cxl root0: Extended linear cache calculation failed rc:-2
>>> Jul 29 13:17:36 rdma-server kernel: hmem hmem.1: probe with driver hmem failed with error -12
>>> Jul 29 13:17:36 rdma-server kernel: hmem hmem.2: probe with driver hmem failed with error -12
>>> Jul 29 13:17:36 rdma-server kernel: kmem dax3.0: mapping0: 0x100000000-0x17ffffff could not reserve region
>>> Jul 29 13:17:36 rdma-server kernel: kmem dax3.0: probe with driver kmem failed with error -16
>>
>> I see the kmem dax messages also. It seems the kmem probe is going after
>> every range (except hotplug) in the SRAT, and failing.
>
> Yes, that's true, because current RFC removed the code that filters out the non-soft-reserverd resource. As a result, it will try to register dax/kmem for all of them while some of them has been marked as busy in the iomem_resource.
>
>>> - rc = region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
>>> - IORES_DESC_SOFT_RESERVED);
>>> - if (rc != REGION_INTERSECTS)
>>> - return 0;
>
>
> This is another example on my real *CXL HOST*:
> Aug 19 17:59:05 kernel: device-mapper: core: CONFIG_IMA_DISABLE_HTABLE is disabled. Duplicate IMA measuremen>
> Aug 19 17:59:09 kernel: power_meter ACPI000D:00: Ignoring unsafe software power cap!
> Aug 19 17:59:09 kernel: kmem dax2.0: mapping0: 0x0-0x8fffffff could not reserve region
> Aug 19 17:59:09 kernel: kmem dax2.0: probe with driver kmem failed with error -16
> Aug 19 17:59:09 kernel: kmem dax3.0: mapping0: 0x100000000-0x86fffffff could not reserve region
> Aug 19 17:59:09 kernel: kmem dax3.0: probe with driver kmem failed with error -16
> Aug 19 17:59:09 kernel: kmem dax4.0: mapping0: 0x870000000-0x106fffffff could not reserve region
> Aug 19 17:59:09 kernel: kmem dax4.0: probe with driver kmem failed with error -16
> Aug 19 17:59:19 kernel: nvme nvme0: using unchecked data buffer
> Aug 19 18:36:27 kernel: block nvme1n1: No UUID available providing old NGUID
> lizhijian@:~$ sudo grep -w -e 106fffffff -e 870000000 -e 8fffffff -e 100000000 /proc/iomem
> 6fffb000-8fffffff : Reserved
> 100000000-10000ffff : Reserved
> 106ccc0000-106fffffff : Reserved
>
>
> This issue can be resolved by re-introducing sort_reserved_region_intersects(...) I guess.
>
>
>
>>
>>> ```
>>>
>>> 3. When CXL_REGION is disabled, there is a failure to fallback to dax_hmem, in which case only CXL Window X is visible.
>>
>> Haven't tested !CXL_REGION yet.
When CXL_REGION is disabled, DEV_DAX_CXL will also be disabled. So
dax_hmem should handle it. I was able to fallback to dax_hmem. But let
me know if I'm missing something.
config DEV_DAX_CXL
tristate "CXL DAX: direct access to CXL RAM regions"
depends on CXL_BUS && CXL_REGION && DEV_DAX
..
>>
>>>
>>> On failure:
>>>
>>> ```
>>> 100000000-27ffffff : System RAM
>>> 5c0001128-5c00011b7 : port1
>>> 5c0011128-5c00111b7 : port2
>>> 5d0000000-6cffffff : CXL Window 0
>>> 6d0000000-7cffffff : CXL Window 1
>>> 7000000000-700000ffff : PCI Bus 0000:0c
>>> 7000000000-700000ffff : 0000:0c:00.0
>>> 7000001080-70000010d7 : mem1
>>> ```
>>>
>>> On success:
>>>
>>> ```
>>> 5d0000000-7cffffff : dax0.0
>>> 5d0000000-7cffffff : System RAM (kmem)
>>> 5d0000000-6cffffff : CXL Window 0
>>> 6d0000000-7cffffff : CXL Window 1
>>> ```
>>>
>>> In term of issues 1 and 2, this arises because hmem_register_device() attempts to register resources of all "HMEM devices," whereas we only need to register the IORES_DESC_SOFT_RESERVED resources. I believe resolving the current TODO will address this.
>>>
>>> ```
>>> - rc = region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
>>> - IORES_DESC_SOFT_RESERVED);
>>> - if (rc != REGION_INTERSECTS)
>>> - return 0;
>>> + /* TODO: insert "Soft Reserved" into iomem here */
>>> ```
>>
>> Above makes sense.
>
> I think the subroutine add_soft_reserved() in your previous patchset[1] are able to cover this TODO
>
>>
>> I'll probably wait for an update from Smita to test again, but if you
>> or Smita have anything you want me to try out on my hardwware in the
>> meantime, let me know.
>>
>
> Here is my local fixup based on Dan's RFC, it can resovle issue 1 and 2.
I almost have the same approach :) Sorry, I missed adding your
"Signed-off-by".. Will include for next revision..
>
>
> -- 8< --
> commit e7ccd7a01e168e185971da66f4aa13eb451caeaf
> Author: Li Zhijian <lizhijian@fujitsu.com>
> Date: Fri Aug 20 11:07:15 2025 +0800
>
> Fix probe-order TODO
>
> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
>
> diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
> index 754115da86cc..965ffc622136 100644
> --- a/drivers/dax/hmem/hmem.c
> +++ b/drivers/dax/hmem/hmem.c
> @@ -93,6 +93,26 @@ static void process_defer_work(struct work_struct *_work)
> walk_hmem_resources(&pdev->dev, handle_deferred_cxl);
> }
>
> +static int add_soft_reserved(resource_size_t start, resource_size_t len,
> + unsigned long flags)
> +{
> + struct resource *res = kzalloc(sizeof(*res), GFP_KERNEL);
> + int rc;
> +
> + if (!res)
> + return -ENOMEM;
> +
> + *res = DEFINE_RES_NAMED_DESC(start, len, "Soft Reserved",
> + flags | IORESOURCE_MEM,
> + IORES_DESC_SOFT_RESERVED);
> +
> + rc = insert_resource(&iomem_resource, res);
> + if (rc)
> + kfree(res);
> +
> + return rc;
> +}
> +
> static int hmem_register_device(struct device *host, int target_nid,
> const struct resource *res)
> {
> @@ -102,6 +122,10 @@ static int hmem_register_device(struct device *host, int target_nid,
> long id;
> int rc;
>
> + if (soft_reserve_res_intersects(res->start, resource_size(res),
> + IORESOURCE_MEM, IORES_DESC_NONE) == REGION_DISJOINT)
> + return 0;
> +
Should also handle CONFIG_EFI_SOFT_RESERVE not enabled case..
Thanks
Smita
> if (IS_ENABLED(CONFIG_DEV_DAX_CXL) &&
> region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
> IORES_DESC_CXL) != REGION_DISJOINT) {
> @@ -119,7 +143,17 @@ static int hmem_register_device(struct device *host, int target_nid,
> }
> }
>
> - /* TODO: insert "Soft Reserved" into iomem here */
> + /*
> + * This is a verified Soft Reserved region that CXL is not claiming (or
> + * is being overridden). Add it to the main iomem tree so it can be
> + * properly reserved by the DAX driver.
> + */
> + rc = add_soft_reserved(res->start, res->end - res->start + 1, 0);
> + if (rc) {
> + dev_warn(host, "failed to insert soft-reserved resource %pr into iomem: %d\n",
> + res, rc);
> + return rc;
> + }
>
> id = memregion_alloc(GFP_KERNEL);
> if (id < 0) {
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 349f0d9aad22..eca5956c444b 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1069,6 +1069,8 @@ enum {
> int region_intersects(resource_size_t offset, size_t size, unsigned long flags,
> unsigned long desc);
>
> +int soft_reserve_res_intersects(resource_size_t offset, size_t size, unsigned long flags,
> + unsigned long desc);
> /* Support for virtually mapped pages */
> struct page *vmalloc_to_page(const void *addr);
> unsigned long vmalloc_to_pfn(const void *addr);
> diff --git a/kernel/resource.c b/kernel/resource.c
> index b8eac6af2fad..a34b76cf690a 100644
> --- a/kernel/resource.c
> +++ b/kernel/resource.c
> @@ -461,6 +461,22 @@ int walk_soft_reserve_res_desc(unsigned long desc, unsigned long flags,
> arg, func);
> }
> EXPORT_SYMBOL_GPL(walk_soft_reserve_res_desc);
> +
> +static int __region_intersects(struct resource *parent, resource_size_t start,
> + size_t size, unsigned long flags,
> + unsigned long desc);
> +int soft_reserve_res_intersects(resource_size_t start, size_t size, unsigned long flags,
> + unsigned long desc)
> +{
> + int ret;
> +
> + read_lock(&resource_lock);
> + ret = __region_intersects(&soft_reserve_resource, start, size, flags, desc);
> + read_unlock(&resource_lock);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(soft_reserve_res_intersects);
> #endif
>
> /*
>
>
>
> [1] https://lore.kernel.org/linux-cxl/29312c0765224ae76862d59a17748c8188fb95f1.1692638817.git.alison.schofield@intel.com/
>
>
>> -- Alison
>>
>>
>>>
>>> Regarding issue 3 (which exists in the current situation), this could be because it cannot ensure that dax_hmem_probe() executes prior to cxl_acpi_probe() when CXL_REGION is disabled.
>>>
>>> I am pleased that you have pushed the patch to the cxl/for-6.18/cxl-probe-order branch, and I'm looking forward to its integration into the upstream during the v6.18 merge window.
>>> Besides the current TODO, you also mentioned that this RFC PATCH must be further subdivided into several patches, so there remains significant work to be done.
>>> If my understanding is correct, you would be personally continuing to push forward this patch, right?
>>>
>>>
>>> Smita,
>>>
>>> Do you have any additional thoughts on this proposal from your side?
>>>
>>>
>>> Thanks
>>> Zhijian
>>>
>> snip
>>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v5 3/7] cxl/acpi: Add background worker to coordinate with cxl_mem probe completion
2025-08-22 3:56 ` Koralahalli Channabasappa, Smita
@ 2025-08-25 7:50 ` Zhijian Li (Fujitsu)
2025-08-27 6:30 ` Zhijian Li (Fujitsu)
0 siblings, 1 reply; 38+ messages in thread
From: Zhijian Li (Fujitsu) @ 2025-08-25 7:50 UTC (permalink / raw)
To: Koralahalli Channabasappa, Smita, Alison Schofield
Cc: dan.j.williams@intel.com, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, nvdimm@lists.linux.dev,
linux-fsdevel@vger.kernel.org, linux-pm@vger.kernel.org,
Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Vishal Verma,
Ira Weiny, Matthew Wilcox, Jan Kara, Rafael J . Wysocki,
Len Brown, Pavel Machek, Li Ming, Jeff Johnson, Ying Huang,
Xingtao Yao (Fujitsu), Peter Zijlstra, Greg KH, Nathan Fontenot,
Terry Bowman, Robert Richter, Benjamin Cheatham,
PradeepVineshReddy Kodamati, Yasunori Gotou (Fujitsu)
On 22/08/2025 11:56, Koralahalli Channabasappa, Smita wrote:
>>
>>>
>>>> ```
>>>>
>>>> 3. When CXL_REGION is disabled, there is a failure to fallback to dax_hmem, in which case only CXL Window X is visible.
>>>
>>> Haven't tested !CXL_REGION yet.
>
> When CXL_REGION is disabled, DEV_DAX_CXL will also be disabled. So dax_hmem should handle it.
Yes, falling back to dax_hmem/kmem is the result we expect.
I haven't figured out the root cause of the issue yet, but I can tell you that in my QEMU environment,
there is currently a certain probability that it cannot fall back to dax_hmem/kmem.
Upon its failure, I observed the following warnings and errors (with my local fixup kernel).
[ 12.203254] kmem dax0.0: mapping0: 0x5d0000000-0x7cfffffff could not reserve region
[ 12.203437] kmem dax0.0: probe with driver kmem failed with error -16
> I was able to fallback to dax_hmem. But let me know if I'm missing something.
>
> config DEV_DAX_CXL
> tristate "CXL DAX: direct access to CXL RAM regions"
> depends on CXL_BUS && CXL_REGION && DEV_DAX
> ..
>
>>>
>>>> On failure:
>>>> ```
>>>> 100000000-27ffffff : System RAM
>>>> 5c0001128-5c00011b7 : port1
>>>> 5c0011128-5c00111b7 : port2
>>>> 5d0000000-6cffffff : CXL Window 0
>>>> 6d0000000-7cffffff : CXL Window 1
>>>> 7000000000-700000ffff : PCI Bus 0000:0c
>>>> 7000000000-700000ffff : 0000:0c:00.0
>>>> 7000001080-70000010d7 : mem1
>>>> ```
>>>>
>>>> On success:
>>>> ```
>>>> 5d0000000-7cffffff : dax0.0
>>>> 5d0000000-7cffffff : System RAM (kmem)
>>>> 5d0000000-6cffffff : CXL Window 0
>>>> 6d0000000-7cffffff : CXL Window 1
>>>> ```
>>>>
>>>> In term of issues 1 and 2, this arises because hmem_register_device() attempts to register resources of all "HMEM devices," whereas we only need to register the IORES_DESC_SOFT_RESERVED resources. I believe resolving the current TODO will address this.
>>>>
>>>> ```
>>>> - rc = region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
>>>> - IORES_DESC_SOFT_RESERVED);
>>>> - if (rc != REGION_INTERSECTS)
>>>> - return 0;
>>>> + /* TODO: insert "Soft Reserved" into iomem here */
>>>> ```
>>>
>>> Above makes sense.
>>
>> I think the subroutine add_soft_reserved() in your previous patchset[1] are able to cover this TODO
>>
>>>
>>> I'll probably wait for an update from Smita to test again, but if you
>>> or Smita have anything you want me to try out on my hardwware in the
>>> meantime, let me know.
>>>
>>
>> Here is my local fixup based on Dan's RFC, it can resovle issue 1 and 2.
>
> I almost have the same approach 🙂 Sorry, I missed adding your
> "Signed-off-by".. Will include for next revision..
Never mind.
Glad to see your V6, I will test and take a look at soon
>
>>
>>
>> -- 8< --
>> commit e7ccd7a01e168e185971da66f4aa13eb451caeaf
>> Author: Li Zhijian <lizhijian@fujitsu.com>
>> Date: Fri Aug 20 11:07:15 2025 +0800
>>
>> Fix probe-order TODO
>> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
>>
>> diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
>> index 754115da86cc..965ffc622136 100644
>> --- a/drivers/dax/hmem/hmem.c
>> +++ b/drivers/dax/hmem/hmem.c
>> @@ -93,6 +93,26 @@ static void process_defer_work(struct work_struct *_work)
>> walk_hmem_resources(&pdev->dev, handle_deferred_cxl);
>> }
>> +static int add_soft_reserved(resource_size_t start, resource_size_t len,
>> + unsigned long flags)
>> +{
>> + struct resource *res = kzalloc(sizeof(*res), GFP_KERNEL);
>> + int rc;
>> +
>> + if (!res)
>> + return -ENOMEM;
>> +
>> + *res = DEFINE_RES_NAMED_DESC(start, len, "Soft Reserved",
>> + flags | IORESOURCE_MEM,
>> + IORES_DESC_SOFT_RESERVED);
>> +
>> + rc = insert_resource(&iomem_resource, res);
>> + if (rc)
>> + kfree(res);
>> +
>> + return rc;
>> +}
>> +
>> static int hmem_register_device(struct device *host, int target_nid,
>> const struct resource *res)
>> {
>> @@ -102,6 +122,10 @@ static int hmem_register_device(struct device *host, int target_nid,
>> long id;
>> int rc;
>>
> > + if (soft_reserve_res_intersects(res->start, resource_size(res),
>> + IORESOURCE_MEM, IORES_DESC_NONE) == REGION_DISJOINT)
>> + return 0;
>> +
>
> Should also handle CONFIG_EFI_SOFT_RESERVE not enabled case..
I think it’s unnecessary. For !CONFIG_EFI_SOFT_RESERVE, it will return directly because soft_reserve_res_intersects() will always return REGION_DISJOINT.
Thanks
Zhijian
>
>
> Thanks
> Smita
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v5 3/7] cxl/acpi: Add background worker to coordinate with cxl_mem probe completion
2025-08-25 7:50 ` Zhijian Li (Fujitsu)
@ 2025-08-27 6:30 ` Zhijian Li (Fujitsu)
2025-08-28 23:21 ` Koralahalli Channabasappa, Smita
0 siblings, 1 reply; 38+ messages in thread
From: Zhijian Li (Fujitsu) @ 2025-08-27 6:30 UTC (permalink / raw)
To: Koralahalli Channabasappa, Smita, Alison Schofield
Cc: dan.j.williams@intel.com, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, nvdimm@lists.linux.dev,
linux-fsdevel@vger.kernel.org, linux-pm@vger.kernel.org,
Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Vishal Verma,
Ira Weiny, Matthew Wilcox, Jan Kara, Rafael J . Wysocki,
Len Brown, Pavel Machek, Li Ming, Jeff Johnson, Ying Huang,
Xingtao Yao (Fujitsu), Peter Zijlstra, Greg KH, Nathan Fontenot,
Terry Bowman, Robert Richter, Benjamin Cheatham,
PradeepVineshReddy Kodamati, Yasunori Gotou (Fujitsu)
All,
I have confirmed that in the !CXL_REGION configuration, the same environment may fail to fall back to hmem.(Your new patch cannot resolve this issue)
In my environment:
- There are two CXL memory devices corresponding to:
```
5d0000000-6cffffff : CXL Window 0
6d0000000-7cffffff : CXL Window 1
```
- E820 table contains a 'soft reserved' entry:
```
[ 0.000000] BIOS-e820: [mem 0x00000005d0000000-0x00000007cfffffff] soft reserved
```
However, since my ACPI SRAT doesn't describe the CXL memory devices (the point), `acpi/hmat.c` won't allocate memory targets for them. This prevents the call chain:
```c
hmat_register_target_devices() // for each SRAT-described target
-> hmem_register_resource()
-> insert entry into "HMEM devices" resource
```
Therefore, for successful fallback to hmem in this environment: `dax_hmem.ko` and `kmem.ko` must request resources BEFORE `cxl_acpi.ko` inserts 'CXL Window X'
However the kernel cannot guarantee this initialization order.
When cxl_acpi runs before dax_kmem/kmem:
```
(built-in) CXL_REGION=n
driver/dax/hmem/device.c cxl_acpi.ko dax_hmem.ko kmem.ko
(1) Add entry '15d0000000-7cfffffff'
(2) Traverse "HMEM devices"
Insert to iomem:
5d0000000-7cffffff : Soft Reserved
(3) Insert CXL Window 0/1
/proc/iomem shows:
5d0000000-7cffffff : Soft Reserved
5d0000000-6cffffff : CXL Window 0
6d0000000-7cffffff : CXL Window 1
(4) Create dax device
(5) request_mem_region() fails
for 5d0000000-7cffffff
Reason: Children of 'Soft Reserved'
(CXL Windows 0/1) don't cover full range
```
---------------------
In my another environment where ACPI SRAT has separate entries per CXL device:
1. `acpi/hmat.c` inserts two entries into "HMEM devices":
- 5d0000000-6cffffff
- 6d0000000-7cffffff
2. Regardless of module order, dax/kmem requests per-device resources, resulting in:
```
5d0000000-7cffffff : Soft Reserved
5d0000000-6cffffff : CXL Window 0
5d0000000-6cffffff : dax0.0
5d0000000-6cffffff : System RAM (kmem)
6d0000000-7cffffff : CXL Window 1
6d0000000-7cffffff : dax1.0
6d0000000-7cffffff : System RAM (kmem)
```
Thanks,
Zhijian
On 25/08/2025 15:50, Li Zhijian wrote:
>
>
> On 22/08/2025 11:56, Koralahalli Channabasappa, Smita wrote:
>>>
>>>>
>>>>> ```
>>>>>
>>>>> 3. When CXL_REGION is disabled, there is a failure to fallback to dax_hmem, in which case only CXL Window X is visible.
>>>>
>>>> Haven't tested !CXL_REGION yet.
>>
>> When CXL_REGION is disabled, DEV_DAX_CXL will also be disabled. So dax_hmem should handle it.
>
> Yes, falling back to dax_hmem/kmem is the result we expect.
> I haven't figured out the root cause of the issue yet, but I can tell you that in my QEMU environment,
> there is currently a certain probability that it cannot fall back to dax_hmem/kmem.
>
> Upon its failure, I observed the following warnings and errors (with my local fixup kernel).
> [ 12.203254] kmem dax0.0: mapping0: 0x5d0000000-0x7cfffffff could not reserve region
> [ 12.203437] kmem dax0.0: probe with driver kmem failed with error -16
>
>
>
>> I was able to fallback to dax_hmem. But let me know if I'm missing something.
>>
>> config DEV_DAX_CXL
>> tristate "CXL DAX: direct access to CXL RAM regions"
>> depends on CXL_BUS && CXL_REGION && DEV_DAX
>> ..
>>
>>>>
>>>>> On failure:
>>>>> ```
>>>>> 100000000-27ffffff : System RAM
>>>>> 5c0001128-5c00011b7 : port1
>>>>> 5c0011128-5c00111b7 : port2
>>>>> 5d0000000-6cffffff : CXL Window 0
>>>>> 6d0000000-7cffffff : CXL Window 1
>>>>> 7000000000-700000ffff : PCI Bus 0000:0c
>>>>> 7000000000-700000ffff : 0000:0c:00.0
>>>>> 7000001080-70000010d7 : mem1
>>>>> ```
>>>>>
>>>>> On success:
>>>>> ```
>>>>> 5d0000000-7cffffff : dax0.0
>>>>> 5d0000000-7cffffff : System RAM (kmem)
>>>>> 5d0000000-6cffffff : CXL Window 0
>>>>> 6d0000000-7cffffff : CXL Window 1
>>>>> ```
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v5 3/7] cxl/acpi: Add background worker to coordinate with cxl_mem probe completion
2025-08-27 6:30 ` Zhijian Li (Fujitsu)
@ 2025-08-28 23:21 ` Koralahalli Channabasappa, Smita
2025-09-01 2:46 ` Zhijian Li (Fujitsu)
0 siblings, 1 reply; 38+ messages in thread
From: Koralahalli Channabasappa, Smita @ 2025-08-28 23:21 UTC (permalink / raw)
To: Zhijian Li (Fujitsu), Alison Schofield
Cc: dan.j.williams@intel.com, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, nvdimm@lists.linux.dev,
linux-fsdevel@vger.kernel.org, linux-pm@vger.kernel.org,
Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Vishal Verma,
Ira Weiny, Matthew Wilcox, Jan Kara, Rafael J . Wysocki,
Len Brown, Pavel Machek, Li Ming, Jeff Johnson, Ying Huang,
Xingtao Yao (Fujitsu), Peter Zijlstra, Greg KH, Nathan Fontenot,
Terry Bowman, Robert Richter, Benjamin Cheatham,
PradeepVineshReddy Kodamati, Yasunori Gotou (Fujitsu)
Hi Zhijian,
On 8/26/2025 11:30 PM, Zhijian Li (Fujitsu) wrote:
> All,
>
>
> I have confirmed that in the !CXL_REGION configuration, the same environment may fail to fall back to hmem.(Your new patch cannot resolve this issue)
>
> In my environment:
> - There are two CXL memory devices corresponding to:
> ```
> 5d0000000-6cffffff : CXL Window 0
> 6d0000000-7cffffff : CXL Window 1
> ```
> - E820 table contains a 'soft reserved' entry:
> ```
> [ 0.000000] BIOS-e820: [mem 0x00000005d0000000-0x00000007cfffffff] soft reserved
> ```
>
> However, since my ACPI SRAT doesn't describe the CXL memory devices (the point), `acpi/hmat.c` won't allocate memory targets for them. This prevents the call chain:
> ```c
> hmat_register_target_devices() // for each SRAT-described target
> -> hmem_register_resource()
> -> insert entry into "HMEM devices" resource
> ```
>
> Therefore, for successful fallback to hmem in this environment: `dax_hmem.ko` and `kmem.ko` must request resources BEFORE `cxl_acpi.ko` inserts 'CXL Window X'
>
> However the kernel cannot guarantee this initialization order.
>
> When cxl_acpi runs before dax_kmem/kmem:
> ```
> (built-in) CXL_REGION=n
> driver/dax/hmem/device.c cxl_acpi.ko dax_hmem.ko kmem.ko
>
> (1) Add entry '15d0000000-7cfffffff'
> (2) Traverse "HMEM devices"
> Insert to iomem:
> 5d0000000-7cffffff : Soft Reserved
>
> (3) Insert CXL Window 0/1
> /proc/iomem shows:
> 5d0000000-7cffffff : Soft Reserved
> 5d0000000-6cffffff : CXL Window 0
> 6d0000000-7cffffff : CXL Window 1
>
> (4) Create dax device
> (5) request_mem_region() fails
> for 5d0000000-7cffffff
> Reason: Children of 'Soft Reserved'
> (CXL Windows 0/1) don't cover full range
> ```
>
Thanks for confirming the failure point. I was thinking of two possible
ways forward here, and I would like to get feedback from others:
[1] Teach dax_hmem to split when the parent claim fails:
If __request_region() fails for the top-level Soft Reserved range
because IORES_DESC_CXL children already exist, dax_hmem could iterate
those windows and register each one individually. The downside is that
it adds some complexity and feels a bit like papering over the fact that
CXL should eventually own all of this memory. As Dan mentioned, the
long-term plan is for Linux to not need the soft-reserve fallback at
all, and simply ignore Soft Reserve for CXL Windows because the CXL
subsystem will handle it.
[2] Always unconditionally load CXL early..
Call request_module("cxl_acpi"); request_module("cxl_pci"); from
dax_hmem_init() (without the IS_ENABLED(CONFIG_DEV_DAX_CXL) guard). If
those are y/m, they’ll be present; if n, it’s a no-op. Then in
hmem_register_device() drop the IS_ENABLED(CONFIG_DEV_DAX_CXL) gate and do:
if (region_intersects(res->start, resource_size(res),
IORESOURCE_MEM, IORES_DESC_CXL) !=REGION_DISJOINT)
/* defer to CXL */;
and defer to CXL if windows are present. This makes Soft Reserved
unavailable once CXL Windows have been discovered, even if CXL_REGION is
disabled. That aligns better with the idea that “CXL should win”
whenever a window is visible (This also needs to be considered alongside
patch 6/6 in my series.)
With CXL_REGION=n there would be no devdax and no kmem for that range;
proc/iomem would show only the windows something like below
850000000-284fffffff : CXL Window 0
2850000000-484fffffff : CXL Window 1
4850000000-684fffffff : CXL Window 2
That means the memory is left unclaimed/unavailable.. (no System RAM, no
/dev/dax). Is that acceptable when CXL_REGION is disabled?
Thanks
Smita
> ---------------------
> In my another environment where ACPI SRAT has separate entries per CXL device:
> 1. `acpi/hmat.c` inserts two entries into "HMEM devices":
> - 5d0000000-6cffffff
> - 6d0000000-7cffffff
>
> 2. Regardless of module order, dax/kmem requests per-device resources, resulting in:
> ```
> 5d0000000-7cffffff : Soft Reserved
> 5d0000000-6cffffff : CXL Window 0
> 5d0000000-6cffffff : dax0.0
> 5d0000000-6cffffff : System RAM (kmem)
> 6d0000000-7cffffff : CXL Window 1
> 6d0000000-7cffffff : dax1.0
> 6d0000000-7cffffff : System RAM (kmem)
> ```
>
> Thanks,
> Zhijian
>
>
> On 25/08/2025 15:50, Li Zhijian wrote:
>>
>>
>> On 22/08/2025 11:56, Koralahalli Channabasappa, Smita wrote:
>>>>
>>>>>
>>>>>> ```
>>>>>>
>>>>>> 3. When CXL_REGION is disabled, there is a failure to fallback to dax_hmem, in which case only CXL Window X is visible.
>>>>>
>>>>> Haven't tested !CXL_REGION yet.
>>>
>>> When CXL_REGION is disabled, DEV_DAX_CXL will also be disabled. So dax_hmem should handle it.
>>
>> Yes, falling back to dax_hmem/kmem is the result we expect.
>> I haven't figured out the root cause of the issue yet, but I can tell you that in my QEMU environment,
>> there is currently a certain probability that it cannot fall back to dax_hmem/kmem.
>>
>> Upon its failure, I observed the following warnings and errors (with my local fixup kernel).
>> [ 12.203254] kmem dax0.0: mapping0: 0x5d0000000-0x7cfffffff could not reserve region
>> [ 12.203437] kmem dax0.0: probe with driver kmem failed with error -16
>>
>>
>>
>>> I was able to fallback to dax_hmem. But let me know if I'm missing something.
>>>
>>> config DEV_DAX_CXL
>>> tristate "CXL DAX: direct access to CXL RAM regions"
>>> depends on CXL_BUS && CXL_REGION && DEV_DAX
>>> ..
>>>
>>>>>
>>>>>> On failure:
>>>>>> ```
>>>>>> 100000000-27ffffff : System RAM
>>>>>> 5c0001128-5c00011b7 : port1
>>>>>> 5c0011128-5c00111b7 : port2
>>>>>> 5d0000000-6cffffff : CXL Window 0
>>>>>> 6d0000000-7cffffff : CXL Window 1
>>>>>> 7000000000-700000ffff : PCI Bus 0000:0c
>>>>>> 7000000000-700000ffff : 0000:0c:00.0
>>>>>> 7000001080-70000010d7 : mem1
>>>>>> ```
>>>>>>
>>>>>> On success:
>>>>>> ```
>>>>>> 5d0000000-7cffffff : dax0.0
>>>>>> 5d0000000-7cffffff : System RAM (kmem)
>>>>>> 5d0000000-6cffffff : CXL Window 0
>>>>>> 6d0000000-7cffffff : CXL Window 1
>>>>>> ```
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v5 3/7] cxl/acpi: Add background worker to coordinate with cxl_mem probe completion
2025-08-28 23:21 ` Koralahalli Channabasappa, Smita
@ 2025-09-01 2:46 ` Zhijian Li (Fujitsu)
0 siblings, 0 replies; 38+ messages in thread
From: Zhijian Li (Fujitsu) @ 2025-09-01 2:46 UTC (permalink / raw)
To: Koralahalli Channabasappa, Smita, Alison Schofield
Cc: dan.j.williams@intel.com, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, nvdimm@lists.linux.dev,
linux-fsdevel@vger.kernel.org, linux-pm@vger.kernel.org,
Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Vishal Verma,
Ira Weiny, Matthew Wilcox, Jan Kara, Rafael J . Wysocki,
Len Brown, Pavel Machek, Li Ming, Jeff Johnson, Ying Huang,
Xingtao Yao (Fujitsu), Peter Zijlstra, Greg KH, Nathan Fontenot,
Terry Bowman, Robert Richter, Benjamin Cheatham,
PradeepVineshReddy Kodamati, Yasunori Gotou (Fujitsu)
On 29/08/2025 07:21, Koralahalli Channabasappa, Smita wrote:
> Hi Zhijian,
>
> On 8/26/2025 11:30 PM, Zhijian Li (Fujitsu) wrote:
>> All,
>>
>>
>> I have confirmed that in the !CXL_REGION configuration, the same environment may fail to fall back to hmem.(Your new patch cannot resolve this issue)
>>
>> In my environment:
>> - There are two CXL memory devices corresponding to:
>> ```
>> 5d0000000-6cffffff : CXL Window 0
>> 6d0000000-7cffffff : CXL Window 1
>> ```
>> - E820 table contains a 'soft reserved' entry:
>> ```
>> [ 0.000000] BIOS-e820: [mem 0x00000005d0000000-0x00000007cfffffff] soft reserved
>> ```
>>
>> However, since my ACPI SRAT doesn't describe the CXL memory devices (the point), `acpi/hmat.c` won't allocate memory targets for them. This prevents the call chain:
>> ```c
>> hmat_register_target_devices() // for each SRAT-described target
>> -> hmem_register_resource()
>> -> insert entry into "HMEM devices" resource
>> ```
>>
>> Therefore, for successful fallback to hmem in this environment: `dax_hmem.ko` and `kmem.ko` must request resources BEFORE `cxl_acpi.ko` inserts 'CXL Window X'
>>
>> However the kernel cannot guarantee this initialization order.
>>
>> When cxl_acpi runs before dax_kmem/kmem:
>> ```
>> (built-in) CXL_REGION=n
>> driver/dax/hmem/device.c cxl_acpi.ko dax_hmem.ko kmem.ko
>>
>> (1) Add entry '15d0000000-7cfffffff'
>> (2) Traverse "HMEM devices"
>> Insert to iomem:
>> 5d0000000-7cffffff : Soft Reserved
>>
>> (3) Insert CXL Window 0/1
>> /proc/iomem shows:
>> 5d0000000-7cffffff : Soft Reserved
>> 5d0000000-6cffffff : CXL Window 0
>> 6d0000000-7cffffff : CXL Window 1
>>
>> (4) Create dax device
>> (5) request_mem_region() fails
>> for 5d0000000-7cffffff
>> Reason: Children of 'Soft Reserved'
>> (CXL Windows 0/1) don't cover full range
>> ```
>>
>
> Thanks for confirming the failure point. I was thinking of two possible ways forward here, and I would like to get feedback from others:
>
> [1] Teach dax_hmem to split when the parent claim fails:
> If __request_region() fails for the top-level Soft Reserved range because IORES_DESC_CXL children already exist, dax_hmem could iterate those windows and register each one individually. The downside is that it adds some complexity and feels a bit like papering over the fact that CXL should eventually own all of this memory.
I examined below change to ensure kmem runs first, it seemed to work.
static int __init cxl_acpi_init(void)
{
+ if (!IS_ENABLED(CONFIG_DEV_DAX_CXL) && IS_ENABLED(CONFIG_DEV_DAX_KMEM)) {
+ /* fall back to dax_hmem,kmem */
+ request_module("kmem");
+ }
return platform_driver_register(&cxl_acpi_driver);
}
> As Dan mentioned, the long-term plan is for Linux to not need the soft-reserve fallback at all, and simply ignore Soft Reserve for CXL Windows because the CXL subsystem will handle it.
The current CXL_REGION kconfig states:
Otherwise, platform-firmware managed CXL is enabled by being placed in the system address map and does not need a driver.
I think this implies that a fallback to dax_hmem/kmem is still required for such cases.
Of course, I personally agree with this 'long-term plan'.
>
> [2] Always unconditionally load CXL early..
> Call request_module("cxl_acpi"); request_module("cxl_pci"); from dax_hmem_init() (without the IS_ENABLED(CONFIG_DEV_DAX_CXL) guard). If those are y/m, they’ll be present; if n, it’s a no-op. Then in hmem_register_device() drop the IS_ENABLED(CONFIG_DEV_DAX_CXL) gate and do:
>
> if (region_intersects(res->start, resource_size(res),
> IORESOURCE_MEM, IORES_DESC_CXL) !=REGION_DISJOINT)
> /* defer to CXL */;
>
> and defer to CXL if windows are present. This makes Soft Reserved unavailable once CXL Windows have been discovered, even if CXL_REGION is disabled. That aligns better with the idea that “CXL should win” whenever a window is visible (This also needs to be considered alongside patch 6/6 in my series.)
>
> With CXL_REGION=n there would be no devdax and no kmem for that range; proc/iomem would show only the windows something like below
>
> 850000000-284fffffff : CXL Window 0
> 2850000000-484fffffff : CXL Window 1
> 4850000000-684fffffff : CXL Window 2
>
> That means the memory is left unclaimed/unavailable.. (no System RAM, no /dev/dax). Is that acceptable when CXL_REGION is disabled?
Regarding option [2] (unconditionally loading CXL early):
This approach conflicts with the CXL_REGION Kconfig description mentioned above.
---
To refocus on the original issue – the inability to recreate regions after destruction when CXL Windows overlap with Soft Reserved
I believe your patch series "[PATCH 0/6] dax/hmem, cxl: Coordinate Soft Reserved handling with CXL" effectively addresses this problem.
As for the pre-existing issues with !CXL_REGION and the unimplemented DAX_CXL_MODE_REGISTER, I suggest deferring them for now.
They need not be resolved within this patch set, as we should prioritize the initial problem.
Thanks
Zhijian
^ permalink raw reply [flat|nested] 38+ messages in thread
end of thread, other threads:[~2025-09-01 2:47 UTC | newest]
Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-15 18:04 [PATCH v5 0/7] Add managed SOFT RESERVE resource handling Smita Koralahalli
2025-07-15 18:04 ` [PATCH v5 1/7] cxl/acpi: Refactor cxl_acpi_probe() to always schedule fallback DAX registration Smita Koralahalli
2025-07-22 21:04 ` dan.j.williams
2025-07-23 0:45 ` Alison Schofield
2025-07-23 7:34 ` dan.j.williams
2025-07-15 18:04 ` [PATCH v5 2/7] cxl/core: Rename suspend.c to probe_state.c and remove CONFIG_CXL_SUSPEND Smita Koralahalli
2025-07-22 21:44 ` dan.j.williams
2025-07-15 18:04 ` [PATCH v5 3/7] cxl/acpi: Add background worker to coordinate with cxl_mem probe completion Smita Koralahalli
2025-07-17 0:24 ` Dave Jiang
2025-07-23 7:31 ` dan.j.williams
2025-07-23 16:13 ` dan.j.williams
2025-08-05 3:58 ` Zhijian Li (Fujitsu)
2025-08-20 23:14 ` Alison Schofield
2025-08-21 2:30 ` Zhijian Li (Fujitsu)
2025-08-22 3:56 ` Koralahalli Channabasappa, Smita
2025-08-25 7:50 ` Zhijian Li (Fujitsu)
2025-08-27 6:30 ` Zhijian Li (Fujitsu)
2025-08-28 23:21 ` Koralahalli Channabasappa, Smita
2025-09-01 2:46 ` Zhijian Li (Fujitsu)
2025-07-29 15:48 ` Koralahalli Channabasappa, Smita
2025-07-30 16:09 ` dan.j.williams
2025-07-15 18:04 ` [PATCH v5 4/7] cxl/region: Introduce SOFT RESERVED resource removal on region teardown Smita Koralahalli
2025-07-17 0:42 ` Dave Jiang
2025-07-15 18:04 ` [PATCH v5 5/7] dax/hmem: Save the DAX HMEM platform device pointer Smita Koralahalli
2025-07-15 18:04 ` [PATCH v5 6/7] dax/hmem, cxl: Defer DAX consumption of SOFT RESERVED resources until after CXL region creation Smita Koralahalli
2025-07-15 18:04 ` [PATCH v5 7/7] dax/hmem: Preserve fallback SOFT RESERVED regions if DAX HMEM loads late Smita Koralahalli
2025-07-15 21:07 ` [PATCH v5 0/7] Add managed SOFT RESERVE resource handling Alison Schofield
2025-07-16 6:01 ` Koralahalli Channabasappa, Smita
2025-07-16 20:20 ` Alison Schofield
2025-07-16 21:29 ` Koralahalli Channabasappa, Smita
2025-07-16 23:48 ` Alison Schofield
2025-07-17 17:58 ` Koralahalli Channabasappa, Smita
2025-07-17 19:06 ` Dave Jiang
2025-07-17 23:20 ` Koralahalli Channabasappa, Smita
2025-07-17 23:30 ` Dave Jiang
2025-07-23 15:24 ` dan.j.williams
2025-07-21 7:38 ` Zhijian Li (Fujitsu)
2025-07-22 20:07 ` dan.j.williams
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).