* [PATCH v5 1/9] mm/memory: add memory_block_aligned_range() helper
2026-06-24 14:57 [PATCH v5 0/9] dax/kmem: atomic whole-device hotplug via sysfs Gregory Price
@ 2026-06-24 14:57 ` Gregory Price
2026-06-24 14:57 ` [PATCH v5 2/9] mm/memory_hotplug: pass online_type to online_memory_block() via arg Gregory Price
` (8 subsequent siblings)
9 siblings, 0 replies; 13+ messages in thread
From: Gregory Price @ 2026-06-24 14:57 UTC (permalink / raw)
To: linux-mm, nvdimm
Cc: linux-kernel, linux-cxl, driver-core, linux-kselftest,
kernel-team, david, osalvador, gregkh, rafael, dakr, djbw,
vishal.l.verma, dave.jiang, akpm, ljs, liam, vbabka, rppt, surenb,
mhocko, shuah, gourry, alison.schofield,
Smita.KoralahalliChannabasappa, ira.weiny, apopple
Memory hotplug operations require ranges aligned to memory block
boundaries. This is a generic operation for hotplug.
Add memory_block_aligned_range() as a common helper in <linux/memory.h>
that aligns the start address up and end address down to memory block
boundaries.
Update dax/kmem to use this helper.
Signed-off-by: Gregory Price <gourry@gourry.net>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
---
drivers/dax/kmem.c | 4 +---
include/linux/memory.h | 22 ++++++++++++++++++++++
2 files changed, 23 insertions(+), 3 deletions(-)
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index a18e2b968e4d..592171ec10f4 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -33,9 +33,7 @@ static int dax_kmem_range(struct dev_dax *dev_dax, int i, struct range *r)
struct dev_dax_range *dax_range = &dev_dax->ranges[i];
struct range *range = &dax_range->range;
- /* memory-block align the hotplug range */
- r->start = ALIGN(range->start, memory_block_size_bytes());
- r->end = ALIGN_DOWN(range->end + 1, memory_block_size_bytes()) - 1;
+ *r = memory_block_aligned_range(range);
if (r->start >= r->end) {
r->start = range->start;
r->end = range->end;
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 463dc02f6cff..9f5ef0309f77 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -20,6 +20,7 @@
#include <linux/compiler.h>
#include <linux/mutex.h>
#include <linux/memory_hotplug.h>
+#include <linux/range.h>
#define MIN_MEMORY_BLOCK_SIZE (1UL << SECTION_SIZE_BITS)
@@ -100,6 +101,27 @@ int arch_get_memory_phys_device(unsigned long start_pfn);
unsigned long memory_block_size_bytes(void);
int set_memory_block_size_order(unsigned int order);
+/**
+ * memory_block_aligned_range - align a physical address range to memory blocks
+ * @range: the input range to align
+ *
+ * Aligns the start address up and the end address down to memory block
+ * boundaries. This is required for memory hotplug operations which must
+ * operate on memory-block aligned ranges.
+ *
+ * Returns the aligned range. Callers should check that the returned
+ * range is valid (aligned.start < aligned.end) before using it.
+ */
+static inline struct range memory_block_aligned_range(const struct range *range)
+{
+ struct range aligned;
+
+ aligned.start = ALIGN(range->start, memory_block_size_bytes());
+ aligned.end = ALIGN_DOWN(range->end + 1, memory_block_size_bytes()) - 1;
+
+ return aligned;
+}
+
struct memory_notify {
unsigned long start_pfn;
unsigned long nr_pages;
--
2.54.0
^ permalink raw reply related [flat|nested] 13+ messages in thread* [PATCH v5 2/9] mm/memory_hotplug: pass online_type to online_memory_block() via arg
2026-06-24 14:57 [PATCH v5 0/9] dax/kmem: atomic whole-device hotplug via sysfs Gregory Price
2026-06-24 14:57 ` [PATCH v5 1/9] mm/memory: add memory_block_aligned_range() helper Gregory Price
@ 2026-06-24 14:57 ` Gregory Price
2026-06-24 16:28 ` Gupta, Pankaj
2026-06-24 14:57 ` [PATCH v5 3/9] mm/memory_hotplug: export mhp_get_default_online_type Gregory Price
` (7 subsequent siblings)
9 siblings, 1 reply; 13+ messages in thread
From: Gregory Price @ 2026-06-24 14:57 UTC (permalink / raw)
To: linux-mm, nvdimm
Cc: linux-kernel, linux-cxl, driver-core, linux-kselftest,
kernel-team, david, osalvador, gregkh, rafael, dakr, djbw,
vishal.l.verma, dave.jiang, akpm, ljs, liam, vbabka, rppt, surenb,
mhocko, shuah, gourry, alison.schofield,
Smita.KoralahalliChannabasappa, ira.weiny, apopple
Modify online_memory_block() to accept the online type through its arg
parameter rather than calling mhp_get_default_online_type() internally.
This prepares for allowing callers to specify explicit online types.
Update the caller in add_memory_resource() to pass the default online
type via a local variable.
No functional change.
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Signed-off-by: Gregory Price <gourry@gourry.net>
---
mm/memory_hotplug.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 7ac19fab2263..6833208cc17c 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1337,7 +1337,9 @@ static int check_hotplug_memory_range(u64 start, u64 size)
static int online_memory_block(struct memory_block *mem, void *arg)
{
- mem->online_type = mhp_get_default_online_type();
+ enum mmop *online_type = arg;
+
+ mem->online_type = *online_type;
return device_online(&mem->dev);
}
@@ -1494,6 +1496,7 @@ static int create_altmaps_and_memory_blocks(int nid, struct memory_group *group,
int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
{
struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
+ enum mmop online_type = mhp_get_default_online_type();
enum memblock_flags memblock_flags = MEMBLOCK_NONE;
struct memory_group *group = NULL;
u64 start, size;
@@ -1582,7 +1585,8 @@ int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
/* online pages if requested */
if (mhp_get_default_online_type() != MMOP_OFFLINE)
- walk_memory_blocks(start, size, NULL, online_memory_block);
+ walk_memory_blocks(start, size, &online_type,
+ online_memory_block);
return ret;
error:
--
2.54.0
^ permalink raw reply related [flat|nested] 13+ messages in thread* Re: [PATCH v5 2/9] mm/memory_hotplug: pass online_type to online_memory_block() via arg
2026-06-24 14:57 ` [PATCH v5 2/9] mm/memory_hotplug: pass online_type to online_memory_block() via arg Gregory Price
@ 2026-06-24 16:28 ` Gupta, Pankaj
0 siblings, 0 replies; 13+ messages in thread
From: Gupta, Pankaj @ 2026-06-24 16:28 UTC (permalink / raw)
To: Gregory Price, linux-mm, nvdimm
Cc: linux-kernel, linux-cxl, driver-core, linux-kselftest,
kernel-team, david, osalvador, gregkh, rafael, dakr, djbw,
vishal.l.verma, dave.jiang, akpm, ljs, liam, vbabka, rppt, surenb,
mhocko, shuah, alison.schofield, Smita.KoralahalliChannabasappa,
ira.weiny, apopple
> Modify online_memory_block() to accept the online type through its arg
> parameter rather than calling mhp_get_default_online_type() internally.
>
> This prepares for allowing callers to specify explicit online types.
>
> Update the caller in add_memory_resource() to pass the default online
> type via a local variable.
>
> No functional change.
>
> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
> Signed-off-by: Gregory Price <gourry@gourry.net>
> ---
> mm/memory_hotplug.c | 8 ++++++--
> 1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 7ac19fab2263..6833208cc17c 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1337,7 +1337,9 @@ static int check_hotplug_memory_range(u64 start, u64 size)
>
> static int online_memory_block(struct memory_block *mem, void *arg)
> {
> - mem->online_type = mhp_get_default_online_type();
> + enum mmop *online_type = arg;
> +
> + mem->online_type = *online_type;
> return device_online(&mem->dev);
> }
>
> @@ -1494,6 +1496,7 @@ static int create_altmaps_and_memory_blocks(int nid, struct memory_group *group,
> int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
> {
> struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
> + enum mmop online_type = mhp_get_default_online_type();
> enum memblock_flags memblock_flags = MEMBLOCK_NONE;
> struct memory_group *group = NULL;
> u64 start, size;
> @@ -1582,7 +1585,8 @@ int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>
> /* online pages if requested */
> if (mhp_get_default_online_type() != MMOP_OFFLINE)
> - walk_memory_blocks(start, size, NULL, online_memory_block);
> + walk_memory_blocks(start, size, &online_type,
> + online_memory_block);
>
> return ret;
> error:
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v5 3/9] mm/memory_hotplug: export mhp_get_default_online_type
2026-06-24 14:57 [PATCH v5 0/9] dax/kmem: atomic whole-device hotplug via sysfs Gregory Price
2026-06-24 14:57 ` [PATCH v5 1/9] mm/memory: add memory_block_aligned_range() helper Gregory Price
2026-06-24 14:57 ` [PATCH v5 2/9] mm/memory_hotplug: pass online_type to online_memory_block() via arg Gregory Price
@ 2026-06-24 14:57 ` Gregory Price
2026-06-24 14:57 ` [PATCH v5 4/9] mm/memory_hotplug: add __add_memory_driver_managed() with online_type arg Gregory Price
` (6 subsequent siblings)
9 siblings, 0 replies; 13+ messages in thread
From: Gregory Price @ 2026-06-24 14:57 UTC (permalink / raw)
To: linux-mm, nvdimm
Cc: linux-kernel, linux-cxl, driver-core, linux-kselftest,
kernel-team, david, osalvador, gregkh, rafael, dakr, djbw,
vishal.l.verma, dave.jiang, akpm, ljs, liam, vbabka, rppt, surenb,
mhocko, shuah, gourry, alison.schofield,
Smita.KoralahalliChannabasappa, ira.weiny, apopple
Drivers which may pass hotplug policy down to DAX need MMOP_ symbols
and the mhp_get_default_online_type function for hotplug use cases.
Some drivers (cxl) co-mingle their hotplug and devdax use-cases into
the same driver code, and chose the dax_kmem path as the default driver
path - making it difficult to require hotplug as a predicate to building
the overall driver (it may break other non-hotplug use-cases).
Export mhp_get_default_online_type function to allow these drivers to
build when hotplug is disabled and still use the DAX use case.
In the built-out case we simply return MMOP_OFFLINE as it's
non-destructive. The internal function can never return -1 either,
so we choose this to allow for defining the function with 'enum mmop'.
Signed-off-by: Gregory Price <gourry@gourry.net>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
---
include/linux/memory_hotplug.h | 2 ++
mm/memory_hotplug.c | 1 +
2 files changed, 3 insertions(+)
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 7c9d66729c60..f059025f8f8b 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -316,6 +316,8 @@ extern struct zone *zone_for_pfn_range(enum mmop online_type,
extern int arch_create_linear_mapping(int nid, u64 start, u64 size,
struct mhp_params *params);
void arch_remove_linear_mapping(u64 start, u64 size);
+#else
+static inline enum mmop mhp_get_default_online_type(void) { return MMOP_OFFLINE; }
#endif /* CONFIG_MEMORY_HOTPLUG */
#endif /* __LINUX_MEMORY_HOTPLUG_H */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 6833208cc17c..494257054095 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -239,6 +239,7 @@ enum mmop mhp_get_default_online_type(void)
return mhp_default_online_type;
}
+EXPORT_SYMBOL_GPL(mhp_get_default_online_type);
void mhp_set_default_online_type(enum mmop online_type)
{
--
2.54.0
^ permalink raw reply related [flat|nested] 13+ messages in thread* [PATCH v5 4/9] mm/memory_hotplug: add __add_memory_driver_managed() with online_type arg
2026-06-24 14:57 [PATCH v5 0/9] dax/kmem: atomic whole-device hotplug via sysfs Gregory Price
` (2 preceding siblings ...)
2026-06-24 14:57 ` [PATCH v5 3/9] mm/memory_hotplug: export mhp_get_default_online_type Gregory Price
@ 2026-06-24 14:57 ` Gregory Price
2026-06-24 16:41 ` Gupta, Pankaj
2026-06-24 14:57 ` [PATCH v5 5/9] mm/memory_hotplug: offline_and_remove_memory_ranges() Gregory Price
` (5 subsequent siblings)
9 siblings, 1 reply; 13+ messages in thread
From: Gregory Price @ 2026-06-24 14:57 UTC (permalink / raw)
To: linux-mm, nvdimm
Cc: linux-kernel, linux-cxl, driver-core, linux-kselftest,
kernel-team, david, osalvador, gregkh, rafael, dakr, djbw,
vishal.l.verma, dave.jiang, akpm, ljs, liam, vbabka, rppt, surenb,
mhocko, shuah, gourry, alison.schofield,
Smita.KoralahalliChannabasappa, ira.weiny, apopple
Existing callers of add_memory_driver_managed cannot select the
preferred online type (ZONE_NORMAL vs ZONE_MOVABLE), requiring it to
hot-add memory as offline blocks, and then follow up by onlining each
memory block individually.
Most drivers prefer the system default, but the CXL driver wants to
plumb a preferred policy through the dax kmem driver.
Refactor APIs to add a new interface which allows the dax kmem module
to select a preferred policy.
Overriding the configured auto-online policy is only safe for known
in-tree modules, where we know the override reflects a different,
user-requested policy. We do not want arbitrary out-of-tree drivers
silently overriding the system-wide onlining policy, so restrict the
new interface to the kmem module using EXPORT_SYMBOL_FOR_MODULES()
rather than a plain EXPORT_SYMBOL_GPL(). Other in-tree modules (e.g.
cxl_core) can be added to the allowed list as the need arises.
Refactor add_memory_driver_managed, extract __add_memory_driver_managed
- Add proper kernel-doc for add_memory_driver_managed while refactoring
- New helper accepts an explicit online_type.
- New helper validates online_type is between OFFLINE and ONLINE_MOVABLE
Refactor: add_memory_resource, extract __add_memory_resource
- new helper accepts an explicit online_type
Original APIs now explicitly pass the system-default to new helpers.
No functional change for existing users.
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Signed-off-by: Gregory Price <gourry@gourry.net>
---
include/linux/memory_hotplug.h | 3 ++
mm/memory_hotplug.c | 61 +++++++++++++++++++++++++++++-----
2 files changed, 56 insertions(+), 8 deletions(-)
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index f059025f8f8b..d3edeb80aadb 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -294,6 +294,9 @@ extern int __add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags);
extern int add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags);
extern int add_memory_resource(int nid, struct resource *resource,
mhp_t mhp_flags);
+int __add_memory_driver_managed(int nid, u64 start, u64 size,
+ const char *resource_name, mhp_t mhp_flags,
+ enum mmop online_type);
extern int add_memory_driver_managed(int nid, u64 start, u64 size,
const char *resource_name,
mhp_t mhp_flags);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 494257054095..a66346def504 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1494,10 +1494,10 @@ static int create_altmaps_and_memory_blocks(int nid, struct memory_group *group,
*
* we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG
*/
-int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
+static int __add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags,
+ enum mmop online_type)
{
struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
- enum mmop online_type = mhp_get_default_online_type();
enum memblock_flags memblock_flags = MEMBLOCK_NONE;
struct memory_group *group = NULL;
u64 start, size;
@@ -1585,7 +1585,7 @@ int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
merge_system_ram_resource(res);
/* online pages if requested */
- if (mhp_get_default_online_type() != MMOP_OFFLINE)
+ if (online_type != MMOP_OFFLINE)
walk_memory_blocks(start, size, &online_type,
online_memory_block);
@@ -1603,7 +1603,13 @@ int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
return ret;
}
-/* requires device_hotplug_lock, see add_memory_resource() */
+int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
+{
+ return __add_memory_resource(nid, res, mhp_flags,
+ mhp_get_default_online_type());
+}
+
+/* requires device_hotplug_lock, see __add_memory_resource() */
int __add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags)
{
struct resource *res;
@@ -1631,7 +1637,15 @@ int add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags)
}
EXPORT_SYMBOL_GPL(add_memory);
-/*
+/**
+ * __add_memory_driver_managed - add driver-managed memory with explicit online_type
+ * @nid: NUMA node ID where the memory will be added
+ * @start: Start physical address of the memory range
+ * @size: Size of the memory range in bytes
+ * @resource_name: Resource name in format "System RAM ($DRIVER)"
+ * @mhp_flags: Memory hotplug flags
+ * @online_type: Auto-Online behavior (offline, online, kernel, movable)
+ *
* Add special, driver-managed memory to the system as system RAM. Such
* memory is not exposed via the raw firmware-provided memmap as system
* RAM, instead, it is detected and added by a driver - during cold boot,
@@ -1639,6 +1653,7 @@ EXPORT_SYMBOL_GPL(add_memory);
*
* Reasons why this memory should not be used for the initial memmap of a
* kexec kernel or for placing kexec images:
+ *
* - The booting kernel is in charge of determining how this memory will be
* used (e.g., use persistent memory as system RAM)
* - Coordination with a hypervisor is required before this memory
@@ -1651,9 +1666,12 @@ EXPORT_SYMBOL_GPL(add_memory);
*
* The resource_name (visible via /proc/iomem) has to have the format
* "System RAM ($DRIVER)".
+ *
+ * Return: 0 on success, negative error code on failure.
*/
-int add_memory_driver_managed(int nid, u64 start, u64 size,
- const char *resource_name, mhp_t mhp_flags)
+int __add_memory_driver_managed(int nid, u64 start, u64 size,
+ const char *resource_name, mhp_t mhp_flags,
+ enum mmop online_type)
{
struct resource *res;
int rc;
@@ -1663,6 +1681,9 @@ int add_memory_driver_managed(int nid, u64 start, u64 size,
resource_name[strlen(resource_name) - 1] != ')')
return -EINVAL;
+ if (online_type < MMOP_OFFLINE || online_type > MMOP_ONLINE_MOVABLE)
+ return -EINVAL;
+
lock_device_hotplug();
res = register_memory_resource(start, size, resource_name);
@@ -1671,7 +1692,7 @@ int add_memory_driver_managed(int nid, u64 start, u64 size,
goto out_unlock;
}
- rc = add_memory_resource(nid, res, mhp_flags);
+ rc = __add_memory_resource(nid, res, mhp_flags, online_type);
if (rc < 0)
release_memory_resource(res);
@@ -1679,6 +1700,30 @@ int add_memory_driver_managed(int nid, u64 start, u64 size,
unlock_device_hotplug();
return rc;
}
+EXPORT_SYMBOL_FOR_MODULES(__add_memory_driver_managed, "kmem");
+
+/**
+ * add_memory_driver_managed - add driver-managed memory
+ * @nid: NUMA node ID where the memory will be added
+ * @start: Start physical address of the memory range
+ * @size: Size of the memory range in bytes
+ * @resource_name: Resource name in format "System RAM ($DRIVER)"
+ * @mhp_flags: Memory hotplug flags
+ *
+ * Add driver-managed memory with the system default online type set by
+ * build config or kernel boot parameter.
+ *
+ * See __add_memory_driver_managed for more details.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int add_memory_driver_managed(int nid, u64 start, u64 size,
+ const char *resource_name, mhp_t mhp_flags)
+{
+ return __add_memory_driver_managed(nid, start, size, resource_name,
+ mhp_flags,
+ mhp_get_default_online_type());
+}
EXPORT_SYMBOL_GPL(add_memory_driver_managed);
/*
--
2.54.0
^ permalink raw reply related [flat|nested] 13+ messages in thread* Re: [PATCH v5 4/9] mm/memory_hotplug: add __add_memory_driver_managed() with online_type arg
2026-06-24 14:57 ` [PATCH v5 4/9] mm/memory_hotplug: add __add_memory_driver_managed() with online_type arg Gregory Price
@ 2026-06-24 16:41 ` Gupta, Pankaj
0 siblings, 0 replies; 13+ messages in thread
From: Gupta, Pankaj @ 2026-06-24 16:41 UTC (permalink / raw)
To: Gregory Price, linux-mm, nvdimm
Cc: linux-kernel, linux-cxl, driver-core, linux-kselftest,
kernel-team, david, osalvador, gregkh, rafael, dakr, djbw,
vishal.l.verma, dave.jiang, akpm, ljs, liam, vbabka, rppt, surenb,
mhocko, shuah, alison.schofield, Smita.KoralahalliChannabasappa,
ira.weiny, apopple
> Existing callers of add_memory_driver_managed cannot select the
> preferred online type (ZONE_NORMAL vs ZONE_MOVABLE), requiring it to
> hot-add memory as offline blocks, and then follow up by onlining each
> memory block individually.
>
> Most drivers prefer the system default, but the CXL driver wants to
> plumb a preferred policy through the dax kmem driver.
>
> Refactor APIs to add a new interface which allows the dax kmem module
> to select a preferred policy.
>
> Overriding the configured auto-online policy is only safe for known
> in-tree modules, where we know the override reflects a different,
> user-requested policy. We do not want arbitrary out-of-tree drivers
> silently overriding the system-wide onlining policy, so restrict the
> new interface to the kmem module using EXPORT_SYMBOL_FOR_MODULES()
> rather than a plain EXPORT_SYMBOL_GPL(). Other in-tree modules (e.g.
> cxl_core) can be added to the allowed list as the need arises.
>
> Refactor add_memory_driver_managed, extract __add_memory_driver_managed
> - Add proper kernel-doc for add_memory_driver_managed while refactoring
> - New helper accepts an explicit online_type.
> - New helper validates online_type is between OFFLINE and ONLINE_MOVABLE
>
> Refactor: add_memory_resource, extract __add_memory_resource
> - new helper accepts an explicit online_type
>
> Original APIs now explicitly pass the system-default to new helpers.
>
> No functional change for existing users.
>
> Acked-by: David Hildenbrand (Arm) <david@kernel.org>
> Signed-off-by: Gregory Price <gourry@gourry.net>
> ---
> include/linux/memory_hotplug.h | 3 ++
> mm/memory_hotplug.c | 61 +++++++++++++++++++++++++++++-----
> 2 files changed, 56 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index f059025f8f8b..d3edeb80aadb 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -294,6 +294,9 @@ extern int __add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags);
> extern int add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags);
> extern int add_memory_resource(int nid, struct resource *resource,
> mhp_t mhp_flags);
> +int __add_memory_driver_managed(int nid, u64 start, u64 size,
> + const char *resource_name, mhp_t mhp_flags,
> + enum mmop online_type);
> extern int add_memory_driver_managed(int nid, u64 start, u64 size,
> const char *resource_name,
> mhp_t mhp_flags);
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 494257054095..a66346def504 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1494,10 +1494,10 @@ static int create_altmaps_and_memory_blocks(int nid, struct memory_group *group,
> *
> * we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG
> */
> -int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
> +static int __add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags,
> + enum mmop online_type)
> {
> struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
> - enum mmop online_type = mhp_get_default_online_type();
> enum memblock_flags memblock_flags = MEMBLOCK_NONE;
> struct memory_group *group = NULL;
> u64 start, size;
> @@ -1585,7 +1585,7 @@ int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
> merge_system_ram_resource(res);
>
> /* online pages if requested */
> - if (mhp_get_default_online_type() != MMOP_OFFLINE)
> + if (online_type != MMOP_OFFLINE)
> walk_memory_blocks(start, size, &online_type,
> online_memory_block);
>
> @@ -1603,7 +1603,13 @@ int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
> return ret;
> }
>
> -/* requires device_hotplug_lock, see add_memory_resource() */
> +int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
> +{
> + return __add_memory_resource(nid, res, mhp_flags,
> + mhp_get_default_online_type());
> +}
> +
> +/* requires device_hotplug_lock, see __add_memory_resource() */
> int __add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags)
> {
> struct resource *res;
> @@ -1631,7 +1637,15 @@ int add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags)
> }
> EXPORT_SYMBOL_GPL(add_memory);
>
> -/*
> +/**
> + * __add_memory_driver_managed - add driver-managed memory with explicit online_type
> + * @nid: NUMA node ID where the memory will be added
> + * @start: Start physical address of the memory range
> + * @size: Size of the memory range in bytes
> + * @resource_name: Resource name in format "System RAM ($DRIVER)"
> + * @mhp_flags: Memory hotplug flags
> + * @online_type: Auto-Online behavior (offline, online, kernel, movable)
> + *
> * Add special, driver-managed memory to the system as system RAM. Such
> * memory is not exposed via the raw firmware-provided memmap as system
> * RAM, instead, it is detected and added by a driver - during cold boot,
> @@ -1639,6 +1653,7 @@ EXPORT_SYMBOL_GPL(add_memory);
> *
> * Reasons why this memory should not be used for the initial memmap of a
> * kexec kernel or for placing kexec images:
> + *
> * - The booting kernel is in charge of determining how this memory will be
> * used (e.g., use persistent memory as system RAM)
> * - Coordination with a hypervisor is required before this memory
> @@ -1651,9 +1666,12 @@ EXPORT_SYMBOL_GPL(add_memory);
> *
> * The resource_name (visible via /proc/iomem) has to have the format
> * "System RAM ($DRIVER)".
> + *
> + * Return: 0 on success, negative error code on failure.
> */
> -int add_memory_driver_managed(int nid, u64 start, u64 size,
> - const char *resource_name, mhp_t mhp_flags)
> +int __add_memory_driver_managed(int nid, u64 start, u64 size,
> + const char *resource_name, mhp_t mhp_flags,
> + enum mmop online_type)
> {
> struct resource *res;
> int rc;
> @@ -1663,6 +1681,9 @@ int add_memory_driver_managed(int nid, u64 start, u64 size,
> resource_name[strlen(resource_name) - 1] != ')')
> return -EINVAL;
>
> + if (online_type < MMOP_OFFLINE || online_type > MMOP_ONLINE_MOVABLE)
> + return -EINVAL;
> +
> lock_device_hotplug();
>
> res = register_memory_resource(start, size, resource_name);
> @@ -1671,7 +1692,7 @@ int add_memory_driver_managed(int nid, u64 start, u64 size,
> goto out_unlock;
> }
>
> - rc = add_memory_resource(nid, res, mhp_flags);
> + rc = __add_memory_resource(nid, res, mhp_flags, online_type);
> if (rc < 0)
> release_memory_resource(res);
>
> @@ -1679,6 +1700,30 @@ int add_memory_driver_managed(int nid, u64 start, u64 size,
> unlock_device_hotplug();
> return rc;
> }
> +EXPORT_SYMBOL_FOR_MODULES(__add_memory_driver_managed, "kmem");
> +
> +/**
> + * add_memory_driver_managed - add driver-managed memory
> + * @nid: NUMA node ID where the memory will be added
> + * @start: Start physical address of the memory range
> + * @size: Size of the memory range in bytes
> + * @resource_name: Resource name in format "System RAM ($DRIVER)"
> + * @mhp_flags: Memory hotplug flags
> + *
> + * Add driver-managed memory with the system default online type set by
> + * build config or kernel boot parameter.
> + *
> + * See __add_memory_driver_managed for more details.
> + *
> + * Return: 0 on success, negative error code on failure.
> + */
> +int add_memory_driver_managed(int nid, u64 start, u64 size,
> + const char *resource_name, mhp_t mhp_flags)
> +{
> + return __add_memory_driver_managed(nid, start, size, resource_name,
> + mhp_flags,
> + mhp_get_default_online_type());
> +}
> EXPORT_SYMBOL_GPL(add_memory_driver_managed);
>
> /*
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v5 5/9] mm/memory_hotplug: offline_and_remove_memory_ranges()
2026-06-24 14:57 [PATCH v5 0/9] dax/kmem: atomic whole-device hotplug via sysfs Gregory Price
` (3 preceding siblings ...)
2026-06-24 14:57 ` [PATCH v5 4/9] mm/memory_hotplug: add __add_memory_driver_managed() with online_type arg Gregory Price
@ 2026-06-24 14:57 ` Gregory Price
2026-06-24 14:57 ` [PATCH v5 6/9] dax: plumb hotplug online_type through dax Gregory Price
` (4 subsequent siblings)
9 siblings, 0 replies; 13+ messages in thread
From: Gregory Price @ 2026-06-24 14:57 UTC (permalink / raw)
To: linux-mm, nvdimm
Cc: linux-kernel, linux-cxl, driver-core, linux-kselftest,
kernel-team, david, osalvador, gregkh, rafael, dakr, djbw,
vishal.l.verma, dave.jiang, akpm, ljs, liam, vbabka, rppt, surenb,
mhocko, shuah, gourry, alison.schofield,
Smita.KoralahalliChannabasappa, ira.weiny, apopple
offline_and_remove_memory() handles a single contiguous range.
Callers that manage a device composed of several ranges (dax/kmem)
currently have to call it in a loop, which gives up atomicity.
In addition to pushing rollback logic into the driver, the lack
of atomicity creates a race condition between system daemons trying
to manage the same resource:
- Manager 1: Offlines memory blocks. Removes device.
^^^^
- Manager 2: Detects offline memory blocks, re-onlines them.
Add offline_and_remove_memory_ranges(), which takes an array of ranges
and processes them as one operation under a single lock_device_hotplug():
- Phase 1 offlines every block of every range.
- Phase 2 removes the ranges only if all ranges are offline.
- If any offline fails, the whole operation is reverted.
This gives callers all-or-nothing semantics for the offline step, so a
failed or interrupted unplug leaves the device in a consistent state.
This also resolves the battling managers race - the second manager's
operation simply fails when the block is destroyed / cannot be onlined.
offline_and_remove_memory() becomes a thin wrapper that passes its single
range to the new helper, so the offline/rollback logic lives in one place.
Suggested-by: David Hildenbrand (Arm) <david@kernel.org>
Signed-off-by: Gregory Price <gourry@gourry.net>
---
include/linux/memory_hotplug.h | 7 +++
mm/memory_hotplug.c | 94 ++++++++++++++++++++++++----------
2 files changed, 74 insertions(+), 27 deletions(-)
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index d3edeb80aadb..7f1da7c428dc 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -267,6 +267,7 @@ extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
extern int remove_memory(u64 start, u64 size);
extern void __remove_memory(u64 start, u64 size);
extern int offline_and_remove_memory(u64 start, u64 size);
+int offline_and_remove_memory_ranges(const struct range *ranges, int nr_ranges);
#else
static inline void try_offline_node(int nid) {}
@@ -283,6 +284,12 @@ static inline int remove_memory(u64 start, u64 size)
}
static inline void __remove_memory(u64 start, u64 size) {}
+
+static inline int offline_and_remove_memory_ranges(const struct range *ranges,
+ int nr_ranges)
+{
+ return -EBUSY;
+}
#endif /* CONFIG_MEMORY_HOTREMOVE */
#ifdef CONFIG_MEMORY_HOTPLUG
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index a66346def504..7d56e0c6ede0 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -2429,58 +2429,98 @@ static int try_reonline_memory_block(struct memory_block *mem, void *arg)
*/
int offline_and_remove_memory(u64 start, u64 size)
{
- const unsigned long mb_count = size / memory_block_size_bytes();
+ struct range range = { .start = start, .end = start + size - 1 };
+
+ return offline_and_remove_memory_ranges(&range, 1);
+}
+EXPORT_SYMBOL_GPL(offline_and_remove_memory);
+
+/**
+ * offline_and_remove_memory_ranges - offline and remove multiple memory ranges
+ * @ranges: array of physical address ranges to offline and remove
+ * @nr_ranges: number of entries in @ranges
+ *
+ * Offline and remove several memory ranges as one operation, serialized
+ * against other hotplug operations by a single lock_device_hotplug().
+ *
+ * This offlines all ranges before removing any of them. If offlining any
+ * range fails, the entire process is reverted and nothing is removed.
+ * This provides a fully atomic semantic for unplugging an entire device.
+ *
+ * Each range must be memory-block aligned in start and size.
+ *
+ * Return: 0 on success, negative errno otherwise. On failure no range has
+ * been removed.
+ */
+int offline_and_remove_memory_ranges(const struct range *ranges, int nr_ranges)
+{
+ unsigned long mb_total = 0;
uint8_t *online_types, *tmp;
- int rc;
+ int i, rc = 0;
- if (!IS_ALIGNED(start, memory_block_size_bytes()) ||
- !IS_ALIGNED(size, memory_block_size_bytes()) || !size)
+ if (!ranges || nr_ranges <= 0)
return -EINVAL;
+ for (i = 0; i < nr_ranges; i++) {
+ u64 start = ranges[i].start;
+ u64 size = range_len(&ranges[i]);
+
+ if (!IS_ALIGNED(start, memory_block_size_bytes()) ||
+ !IS_ALIGNED(size, memory_block_size_bytes()) || !size)
+ return -EINVAL;
+ mb_total += size / memory_block_size_bytes();
+ }
+
/*
- * We'll remember the old online type of each memory block, so we can
- * try to revert whatever we did when offlining one memory block fails
- * after offlining some others succeeded.
+ * Remember the old online type of every memory block across all ranges,
+ * so we can revert if offlining a later block fails. All entries start
+ * as MMOP_OFFLINE so blocks we never touched are skipped on rollback.
*/
- online_types = kmalloc_array(mb_count, sizeof(*online_types),
+ online_types = kmalloc_array(mb_total, sizeof(*online_types),
GFP_KERNEL);
if (!online_types)
return -ENOMEM;
- /*
- * Initialize all states to MMOP_OFFLINE, so when we abort processing in
- * try_offline_memory_block(), we'll skip all unprocessed blocks in
- * try_reonline_memory_block().
- */
- memset(online_types, MMOP_OFFLINE, mb_count);
+ memset(online_types, MMOP_OFFLINE, mb_total);
lock_device_hotplug();
+ /* Phase 1: offline every block in every range. */
tmp = online_types;
- rc = walk_memory_blocks(start, size, &tmp, try_offline_memory_block);
+ for (i = 0; i < nr_ranges; i++) {
+ rc = walk_memory_blocks(ranges[i].start, range_len(&ranges[i]),
+ &tmp, try_offline_memory_block);
+ if (rc)
+ break;
+ }
/*
- * In case we succeeded to offline all memory, remove it.
- * This cannot fail as it cannot get onlined in the meantime.
+ * Phase 2: Remove each range. This essentially cannot fail as we hold
+ * the hotplug lock . WARN if that assumption is ever broken.
*/
if (!rc) {
- rc = try_remove_memory(start, size);
- if (rc)
- pr_err("%s: Failed to remove memory: %d", __func__, rc);
+ for (i = 0; i < nr_ranges; i++) {
+ rc = try_remove_memory(ranges[i].start,
+ range_len(&ranges[i]));
+ if (WARN_ON_ONCE(rc)) {
+ pr_err("%s: Failed to remove memory: %d",
+ __func__, rc);
+ break;
+ }
+ }
}
- /*
- * Rollback what we did. While memory onlining might theoretically fail
- * (nacked by a notifier), it barely ever happens.
- */
+ /* On fail: roll back. Blocks that were already offline are skipped */
if (rc) {
tmp = online_types;
- walk_memory_blocks(start, size, &tmp,
- try_reonline_memory_block);
+ for (i = 0; i < nr_ranges; i++)
+ walk_memory_blocks(ranges[i].start,
+ range_len(&ranges[i]), &tmp,
+ try_reonline_memory_block);
}
unlock_device_hotplug();
kfree(online_types);
return rc;
}
-EXPORT_SYMBOL_GPL(offline_and_remove_memory);
+EXPORT_SYMBOL_GPL(offline_and_remove_memory_ranges);
#endif /* CONFIG_MEMORY_HOTREMOVE */
--
2.54.0
^ permalink raw reply related [flat|nested] 13+ messages in thread* [PATCH v5 6/9] dax: plumb hotplug online_type through dax
2026-06-24 14:57 [PATCH v5 0/9] dax/kmem: atomic whole-device hotplug via sysfs Gregory Price
` (4 preceding siblings ...)
2026-06-24 14:57 ` [PATCH v5 5/9] mm/memory_hotplug: offline_and_remove_memory_ranges() Gregory Price
@ 2026-06-24 14:57 ` Gregory Price
2026-06-24 14:57 ` [PATCH v5 7/9] dax/kmem: extract hotplug/hotremove helper functions Gregory Price
` (3 subsequent siblings)
9 siblings, 0 replies; 13+ messages in thread
From: Gregory Price @ 2026-06-24 14:57 UTC (permalink / raw)
To: linux-mm, nvdimm
Cc: linux-kernel, linux-cxl, driver-core, linux-kselftest,
kernel-team, david, osalvador, gregkh, rafael, dakr, djbw,
vishal.l.verma, dave.jiang, akpm, ljs, liam, vbabka, rppt, surenb,
mhocko, shuah, gourry, alison.schofield,
Smita.KoralahalliChannabasappa, ira.weiny, apopple
There is no way for drivers leveraging dax_kmem to plumb through a
preferred auto-online policy - the system default policy is forced.
Add 'enum mmop' field to DAX device creation path to allow drivers
to specify an auto-online policy when using the kmem driver.
Capturing the system default would otherwise break the ABI, because
the system default can change - but we would be statically assigning
the value at device creation time.
To resolve this we add DAX_ONLINE_DEFAULT, which defaults devices to
the current behavior, while providing a clean way to override it.
No behavioural change for existing callers (still the system default).
Signed-off-by: Gregory Price <gourry@gourry.net>
---
drivers/dax/bus.c | 3 +++
drivers/dax/bus.h | 9 +++++++++
drivers/dax/cxl.c | 1 +
drivers/dax/dax-private.h | 4 ++++
drivers/dax/hmem/hmem.c | 1 +
drivers/dax/kmem.c | 11 +++++++++--
drivers/dax/pmem.c | 1 +
7 files changed, 28 insertions(+), 2 deletions(-)
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 492573b47f66..4a03b323b003 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -1,6 +1,7 @@
// SPDX-License-Identifier: GPL-2.0
/* Copyright(c) 2017-2018 Intel Corporation. All rights reserved. */
#include <linux/memremap.h>
+#include <linux/memory_hotplug.h>
#include <linux/device.h>
#include <linux/mutex.h>
#include <linux/list.h>
@@ -394,6 +395,7 @@ static ssize_t create_store(struct device *dev, struct device_attribute *attr,
.size = 0,
.id = -1,
.memmap_on_memory = false,
+ .online_type = DAX_ONLINE_DEFAULT,
};
struct dev_dax *dev_dax = __devm_create_dev_dax(&data);
@@ -1527,6 +1529,7 @@ static struct dev_dax *__devm_create_dev_dax(struct dev_dax_data *data)
ida_init(&dev_dax->ida);
dev_dax->memmap_on_memory = data->memmap_on_memory;
+ dev_dax->online_type = data->online_type;
inode = dax_inode(dax_dev);
dev->devt = inode->i_rdev;
diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
index 5909171a4428..f3c9dae5de6b 100644
--- a/drivers/dax/bus.h
+++ b/drivers/dax/bus.h
@@ -3,6 +3,7 @@
#ifndef __DAX_BUS_H__
#define __DAX_BUS_H__
#include <linux/device.h>
+#include <linux/memory_hotplug.h>
#include <linux/platform_device.h>
#include <linux/range.h>
#include <linux/workqueue.h>
@@ -16,6 +17,13 @@ struct dax_region;
#define IORESOURCE_DAX_STATIC BIT(0)
#define IORESOURCE_DAX_KMEM BIT(1)
+/*
+ * online_type sentinel: the device was created without an explicit online
+ * policy, so the system default is resolved when the kmem driver binds,
+ * (not at device-creation time, which would freeze a stale policy).
+ */
+#define DAX_ONLINE_DEFAULT (-1)
+
struct dax_region *alloc_dax_region(struct device *parent, int region_id,
struct range *range, int target_node, unsigned int align,
unsigned long flags);
@@ -26,6 +34,7 @@ struct dev_dax_data {
resource_size_t size;
int id;
bool memmap_on_memory;
+ enum mmop online_type;
};
struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data);
diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
index 3ab39b77843d..1a7ec6212213 100644
--- a/drivers/dax/cxl.c
+++ b/drivers/dax/cxl.c
@@ -27,6 +27,7 @@ static int cxl_dax_region_probe(struct device *dev)
.id = -1,
.size = range_len(&cxlr_dax->hpa_range),
.memmap_on_memory = true,
+ .online_type = DAX_ONLINE_DEFAULT,
};
return PTR_ERR_OR_ZERO(devm_create_dev_dax(&data));
diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index 81e4af49e39c..ccd77965fe3e 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -8,6 +8,7 @@
#include <linux/device.h>
#include <linux/cdev.h>
#include <linux/idr.h>
+#include <linux/memory_hotplug.h>
/* private routines between core files */
struct dax_device;
@@ -79,6 +80,8 @@ struct dev_dax_range {
* @dev: device core
* @pgmap: pgmap for memmap setup / lifetime (driver owned)
* @memmap_on_memory: allow kmem to put the memmap in the memory
+ * @online_type: MMOP_* online type for memory hotplug, or DAX_ONLINE_DEFAULT
+ * to resolve the system default policy when kmem binds
* @nr_range: size of @ranges
* @ranges: range tuples of memory used
*/
@@ -95,6 +98,7 @@ struct dev_dax {
struct device dev;
struct dev_pagemap *pgmap;
bool memmap_on_memory;
+ enum mmop online_type;
int nr_range;
struct dev_dax_range *ranges;
};
diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index af21f66bf872..2de3bc925172 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -37,6 +37,7 @@ static int dax_hmem_probe(struct platform_device *pdev)
.id = -1,
.size = region_idle ? 0 : range_len(&mri->range),
.memmap_on_memory = false,
+ .online_type = DAX_ONLINE_DEFAULT,
};
return PTR_ERR_OR_ZERO(devm_create_dev_dax(&data));
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 592171ec10f4..0a184c0878dd 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -72,6 +72,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
int i, rc, mapped = 0;
mhp_t mhp_flags;
int numa_node;
+ int online_type;
int adist = MEMTIER_DEFAULT_DAX_ADISTANCE;
/*
@@ -132,6 +133,11 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
goto err_reg_mgid;
data->mgid = rc;
+ /* Resolve system default at bind time in case it changed */
+ online_type = dev_dax->online_type;
+ if (online_type == DAX_ONLINE_DEFAULT)
+ online_type = mhp_get_default_online_type();
+
for (i = 0; i < dev_dax->nr_range; i++) {
struct resource *res;
struct range range;
@@ -172,8 +178,9 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
* Ensure that future kexec'd kernels will not treat
* this as RAM automatically.
*/
- rc = add_memory_driver_managed(data->mgid, range.start,
- range_len(&range), kmem_name, mhp_flags);
+ rc = __add_memory_driver_managed(data->mgid, range.start,
+ range_len(&range), kmem_name, mhp_flags,
+ online_type);
if (rc) {
dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
index bee93066a849..e7adace69195 100644
--- a/drivers/dax/pmem.c
+++ b/drivers/dax/pmem.c
@@ -63,6 +63,7 @@ static struct dev_dax *__dax_pmem_probe(struct device *dev)
.pgmap = &pgmap,
.size = range_len(&range),
.memmap_on_memory = false,
+ .online_type = DAX_ONLINE_DEFAULT,
};
return devm_create_dev_dax(&data);
--
2.54.0
^ permalink raw reply related [flat|nested] 13+ messages in thread* [PATCH v5 7/9] dax/kmem: extract hotplug/hotremove helper functions
2026-06-24 14:57 [PATCH v5 0/9] dax/kmem: atomic whole-device hotplug via sysfs Gregory Price
` (5 preceding siblings ...)
2026-06-24 14:57 ` [PATCH v5 6/9] dax: plumb hotplug online_type through dax Gregory Price
@ 2026-06-24 14:57 ` Gregory Price
2026-06-24 14:57 ` [PATCH v5 8/9] dax/kmem: add sysfs interface for atomic whole-device hotplug Gregory Price
` (2 subsequent siblings)
9 siblings, 0 replies; 13+ messages in thread
From: Gregory Price @ 2026-06-24 14:57 UTC (permalink / raw)
To: linux-mm, nvdimm
Cc: linux-kernel, linux-cxl, driver-core, linux-kselftest,
kernel-team, david, osalvador, gregkh, rafael, dakr, djbw,
vishal.l.verma, dave.jiang, akpm, ljs, liam, vbabka, rppt, surenb,
mhocko, shuah, gourry, alison.schofield,
Smita.KoralahalliChannabasappa, ira.weiny, apopple
Refactor kmem _probe() _remove() by extracting init, cleanup, hotplug,
and hot-remove logic into separate helper functions:
- dax_kmem_init_resources: inits IO_RESOURCE w/ request_mem_region
- dax_kmem_cleanup_resources: cleans up initialized IO_RESOURCE
- dax_kmem_do_hotplug: handles memory region reservation and adding
- dax_kmem_do_hotremove: handles memory removal and resource cleanup
This is a pure refactoring with no functional change. The helpers will
enable future extensions to support more granular control over memory
hotplug operations.
We need to split hotplug/hotunplug and init/cleanup in order to have the
resources available for hot-add. Otherwise, when probe occurs, the dax
devices are never added to sysfs because the resources are never
registered.
Detatching hotunplug/cleanup allows us to re-use the hotunplug code
without destroying the underlying resources.
Signed-off-by: Gregory Price <gourry@gourry.net>
---
drivers/dax/kmem.c | 316 ++++++++++++++++++++++++++++++---------------
1 file changed, 214 insertions(+), 102 deletions(-)
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 0a184c0878dd..a45e50def537 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -63,14 +63,195 @@ static void kmem_put_memory_types(void)
mt_put_memory_types(&kmem_memory_types);
}
+/**
+ * dax_kmem_do_hotplug - hotplug memory for dax kmem device
+ * @dev_dax: the dev_dax instance
+ * @data: the dax_kmem_data structure with resource tracking
+ *
+ * Hotplugs all ranges in the dev_dax region as system memory.
+ *
+ * Returns the number of successfully mapped ranges, or negative error.
+ */
+static int dax_kmem_do_hotplug(struct dev_dax *dev_dax,
+ struct dax_kmem_data *data,
+ int online_type)
+{
+ struct device *dev = &dev_dax->dev;
+ int i, rc, onlined = 0;
+ mhp_t mhp_flags;
+
+ for (i = 0; i < dev_dax->nr_range; i++) {
+ struct range range;
+
+ rc = dax_kmem_range(dev_dax, i, &range);
+ if (rc)
+ continue;
+
+ mhp_flags = MHP_NID_IS_MGID;
+ if (dev_dax->memmap_on_memory)
+ mhp_flags |= MHP_MEMMAP_ON_MEMORY;
+
+ /*
+ * Ensure that future kexec'd kernels will not treat
+ * this as RAM automatically.
+ */
+ rc = __add_memory_driver_managed(data->mgid, range.start,
+ range_len(&range), kmem_name, mhp_flags,
+ online_type);
+
+ if (rc) {
+ dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
+ i, range.start, range.end);
+ /*
+ * Release the reservation for the range that failed to
+ * add so a later hotremove does not try to remove memory
+ * that was never added.
+ */
+ if (data->res[i]) {
+ remove_resource(data->res[i]);
+ kfree(data->res[i]);
+ data->res[i] = NULL;
+ }
+ if (onlined)
+ continue;
+ return rc;
+ }
+ onlined++;
+ }
+
+ return onlined;
+}
+
+/**
+ * dax_kmem_init_resources - create memory regions for dax kmem
+ * @dev_dax: the dev_dax instance
+ * @data: the dax_kmem_data structure with resource tracking
+ *
+ * Initializes all the resources for the DAX
+ *
+ * Returns the number of successfully mapped ranges, or negative error.
+ */
+static int dax_kmem_init_resources(struct dev_dax *dev_dax,
+ struct dax_kmem_data *data)
+{
+ struct device *dev = &dev_dax->dev;
+ int i, rc, mapped = 0;
+
+ for (i = 0; i < dev_dax->nr_range; i++) {
+ struct resource *res;
+ struct range range;
+
+ rc = dax_kmem_range(dev_dax, i, &range);
+ if (rc)
+ continue;
+
+ /* Skip ranges already added */
+ if (data->res[i])
+ continue;
+
+ /* Region is permanently reserved if hotremove fails. */
+ res = request_mem_region(range.start, range_len(&range),
+ data->res_name);
+ if (!res) {
+ dev_warn(dev, "mapping%d: %#llx-%#llx could not reserve region\n",
+ i, range.start, range.end);
+ /*
+ * Once some memory has been onlined we can't
+ * assume that it can be un-onlined safely.
+ */
+ if (mapped)
+ continue;
+ return -EBUSY;
+ }
+ data->res[i] = res;
+ /*
+ * Set flags appropriate for System RAM. Leave ..._BUSY clear
+ * so that add_memory() can add a child resource. Do not
+ * inherit flags from the parent since it may set new flags
+ * unknown to us that will break add_memory() below.
+ */
+ res->flags = IORESOURCE_SYSTEM_RAM;
+ mapped++;
+ }
+ return mapped;
+}
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+/**
+ * dax_kmem_do_hotremove - hot-remove memory for dax kmem device
+ * @dev_dax: the dev_dax instance
+ * @data: the dax_kmem_data structure with resource tracking
+ *
+ * Removes all ranges in the dev_dax region.
+ *
+ * Returns the number of successfully removed ranges.
+ */
+static int dax_kmem_do_hotremove(struct dev_dax *dev_dax,
+ struct dax_kmem_data *data)
+{
+ struct device *dev = &dev_dax->dev;
+ int i, success = 0;
+
+ for (i = 0; i < dev_dax->nr_range; i++) {
+ struct range range;
+ int rc;
+
+ rc = dax_kmem_range(dev_dax, i, &range);
+ if (rc)
+ continue;
+
+ /* range was never added during probe, count as removed */
+ if (!data->res[i]) {
+ success++;
+ continue;
+ }
+
+ rc = remove_memory(range.start, range_len(&range));
+ if (rc == 0) {
+ /* Release the resource for the successfully removed range */
+ remove_resource(data->res[i]);
+ kfree(data->res[i]);
+ data->res[i] = NULL;
+ success++;
+ continue;
+ }
+ any_hotremove_failed = true;
+ dev_err(dev, "mapping%d: %#llx-%#llx hotremove failed\n",
+ i, range.start, range.end);
+ }
+
+ return success;
+}
+#endif /* CONFIG_MEMORY_HOTREMOVE */
+
+/**
+ * dax_kmem_cleanup_resources - remove the dax memory resources
+ * @dev_dax: the dev_dax instance
+ * @data: the dax_kmem_data structure with resource tracking
+ *
+ * Removes all resources in the dev_dax region.
+ */
+static void dax_kmem_cleanup_resources(struct dev_dax *dev_dax,
+ struct dax_kmem_data *data)
+{
+ int i;
+
+ for (i = 0; i < dev_dax->nr_range; i++) {
+ if (!data->res[i])
+ continue;
+ remove_resource(data->res[i]);
+ kfree(data->res[i]);
+ data->res[i] = NULL;
+ }
+}
+
static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
{
struct device *dev = &dev_dax->dev;
unsigned long total_len = 0, orig_len = 0;
struct dax_kmem_data *data;
struct memory_dev_type *mtype;
- int i, rc, mapped = 0;
- mhp_t mhp_flags;
+ int i, rc;
int numa_node;
int online_type;
int adist = MEMTIER_DEFAULT_DAX_ADISTANCE;
@@ -133,73 +314,27 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
goto err_reg_mgid;
data->mgid = rc;
+ dev_set_drvdata(dev, data);
+
+ rc = dax_kmem_init_resources(dev_dax, data);
+ if (rc < 0)
+ goto err_resources;
+
/* Resolve system default at bind time in case it changed */
online_type = dev_dax->online_type;
if (online_type == DAX_ONLINE_DEFAULT)
online_type = mhp_get_default_online_type();
- for (i = 0; i < dev_dax->nr_range; i++) {
- struct resource *res;
- struct range range;
-
- rc = dax_kmem_range(dev_dax, i, &range);
- if (rc)
- continue;
-
- /* Region is permanently reserved if hotremove fails. */
- res = request_mem_region(range.start, range_len(&range), data->res_name);
- if (!res) {
- dev_warn(dev, "mapping%d: %#llx-%#llx could not reserve region\n",
- i, range.start, range.end);
- /*
- * Once some memory has been onlined we can't
- * assume that it can be un-onlined safely.
- */
- if (mapped)
- continue;
- rc = -EBUSY;
- goto err_request_mem;
- }
- data->res[i] = res;
-
- /*
- * Set flags appropriate for System RAM. Leave ..._BUSY clear
- * so that add_memory() can add a child resource. Do not
- * inherit flags from the parent since it may set new flags
- * unknown to us that will break add_memory() below.
- */
- res->flags = IORESOURCE_SYSTEM_RAM;
-
- mhp_flags = MHP_NID_IS_MGID;
- if (dev_dax->memmap_on_memory)
- mhp_flags |= MHP_MEMMAP_ON_MEMORY;
-
- /*
- * Ensure that future kexec'd kernels will not treat
- * this as RAM automatically.
- */
- rc = __add_memory_driver_managed(data->mgid, range.start,
- range_len(&range), kmem_name, mhp_flags,
- online_type);
-
- if (rc) {
- dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
- i, range.start, range.end);
- remove_resource(res);
- kfree(res);
- data->res[i] = NULL;
- if (mapped)
- continue;
- goto err_request_mem;
- }
- mapped++;
- }
-
- dev_set_drvdata(dev, data);
+ rc = dax_kmem_do_hotplug(dev_dax, data, online_type);
+ if (rc < 0)
+ goto err_hotplug;
return 0;
-err_request_mem:
+err_hotplug:
+ dax_kmem_cleanup_resources(dev_dax, data);
+err_resources:
+ dev_set_drvdata(dev, NULL);
memory_group_unregister(data->mgid);
err_reg_mgid:
kfree(data->res_name);
@@ -213,7 +348,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
#ifdef CONFIG_MEMORY_HOTREMOVE
static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
{
- int i, success = 0;
+ int success;
int node = dev_dax->target_node;
struct device *dev = &dev_dax->dev;
struct dax_kmem_data *data = dev_get_drvdata(dev);
@@ -224,48 +359,25 @@ static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
* there is no way to hotremove this memory until reboot because device
* unbind will succeed even if we return failure.
*/
- for (i = 0; i < dev_dax->nr_range; i++) {
- struct range range;
- int rc;
-
- rc = dax_kmem_range(dev_dax, i, &range);
- if (rc)
- continue;
-
- /* range was never added during probe */
- if (!data->res[i]) {
- success++;
- continue;
- }
-
- rc = remove_memory(range.start, range_len(&range));
- if (rc == 0) {
- remove_resource(data->res[i]);
- kfree(data->res[i]);
- data->res[i] = NULL;
- success++;
- continue;
- }
- any_hotremove_failed = true;
- dev_err(dev,
- "mapping%d: %#llx-%#llx cannot be hotremoved until the next reboot\n",
- i, range.start, range.end);
+ success = dax_kmem_do_hotremove(dev_dax, data);
+ if (success < dev_dax->nr_range) {
+ dev_err(dev, "Hotplug regions stuck online until reboot\n");
+ return;
}
- if (success >= dev_dax->nr_range) {
- memory_group_unregister(data->mgid);
- kfree(data->res_name);
- kfree(data);
- dev_set_drvdata(dev, NULL);
- /*
- * Clear the memtype association on successful unplug.
- * If not, we have memory blocks left which can be
- * offlined/onlined later. We need to keep memory_dev_type
- * for that. This implies this reference will be around
- * till next reboot.
- */
- clear_node_memory_type(node, NULL);
- }
+ dax_kmem_cleanup_resources(dev_dax, data);
+ memory_group_unregister(data->mgid);
+ kfree(data->res_name);
+ kfree(data);
+ dev_set_drvdata(dev, NULL);
+ /*
+ * Clear the memtype association on successful unplug.
+ * If not, we have memory blocks left which can be
+ * offlined/onlined later. We need to keep memory_dev_type
+ * for that. This implies this reference will be around
+ * till next reboot.
+ */
+ clear_node_memory_type(node, NULL);
}
#else
static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
--
2.54.0
^ permalink raw reply related [flat|nested] 13+ messages in thread* [PATCH v5 8/9] dax/kmem: add sysfs interface for atomic whole-device hotplug
2026-06-24 14:57 [PATCH v5 0/9] dax/kmem: atomic whole-device hotplug via sysfs Gregory Price
` (6 preceding siblings ...)
2026-06-24 14:57 ` [PATCH v5 7/9] dax/kmem: extract hotplug/hotremove helper functions Gregory Price
@ 2026-06-24 14:57 ` Gregory Price
2026-06-24 14:57 ` [PATCH v5 9/9] selftests/dax: add dax/kmem hotplug sysfs regression test Gregory Price
2026-06-24 18:59 ` [PATCH v5 0/9] dax/kmem: atomic whole-device hotplug via sysfs Gregory Price
9 siblings, 0 replies; 13+ messages in thread
From: Gregory Price @ 2026-06-24 14:57 UTC (permalink / raw)
To: linux-mm, nvdimm
Cc: linux-kernel, linux-cxl, driver-core, linux-kselftest,
kernel-team, david, osalvador, gregkh, rafael, dakr, djbw,
vishal.l.verma, dave.jiang, akpm, ljs, liam, vbabka, rppt, surenb,
mhocko, shuah, gourry, alison.schofield,
Smita.KoralahalliChannabasappa, ira.weiny, apopple,
Hannes Reinecke
There is no atomic mechanism to offline and remove an entire
multi-block DAX kmem device. This is presently done in two steps:
1. offline all
2. remove all).
This creates a race condition where another entity operates directly
on the memory blocks and can cause hot-unplug to fail / unbind to
deadlock.
Add a new 'state' sysfs attribute that enables an atomic whole-device
hotplug operation across its entire memory region.
daxX.Y/state mirrors the per-block memoryX/state ABI:
- [offline, online, online_kernel, online_movable]
- "unplugged" - is added specifically for dax0.0/state
The valid writable states include:
- "unplugged": memory blocks are not present
- "online": memory is online, zone chosen by the kernel
- "online_kernel": memory is online in ZONE_NORMAL
- "online_movable": memory is online in ZONE_MOVABLE
Valid transitions:
- unplugged -> online[_kernel|_movable]
- online[_kernel|_movable] -> unplugged
- offline -> unplugged
A device can only be onlined from "unplugged", so it must be returned
there before being onlined into a different state.
For backwards compatibility the memory blocks are always created at
probe - existing tools expect them to be present after kmem binds.
"offline" is therefore a reportable state but is not writable: it only
arises from the legacy auto_online_blocks=offline policy. Onlining
such a device through this attribute requires unplugging it first in
an effort to get drivers creating DAX devices to set a default.
Unplug is atomic across the whole device: dax_kmem_do_hotremove()
collects every added range and offlines/removes them in one operation.
Either the operation succeeds or is entirely rolled back.
Unbind Note:
We used to call remove_memory() during unbind, which would fire a
BUG() if any of the memory blocks were online at that time. We lift
this into a WARN in the cleanup routine and don't attempt hotremove
if ->state is not DAX_KMEM_UNPLUGGED or MMOP_OFFLINE.
An offline dax device memory is removed on unbind as before.
If online at unbind, the resources are leaked (as before), but now
we prevent deadlock if a memory region is impossible to hotremove.
Suggested-by: Hannes Reinecke <hare@suse.de>
Suggested-by: David Hildenbrand <david@kernel.org>
Signed-off-by: Gregory Price <gourry@gourry.net>
---
Documentation/ABI/testing/sysfs-bus-dax | 26 +++
drivers/base/memory.c | 9 +
drivers/dax/kmem.c | 224 ++++++++++++++++++++----
include/linux/memory_hotplug.h | 1 +
4 files changed, 224 insertions(+), 36 deletions(-)
diff --git a/Documentation/ABI/testing/sysfs-bus-dax b/Documentation/ABI/testing/sysfs-bus-dax
index b34266bfae49..2dcad1e9dad0 100644
--- a/Documentation/ABI/testing/sysfs-bus-dax
+++ b/Documentation/ABI/testing/sysfs-bus-dax
@@ -151,3 +151,29 @@ Description:
memmap_on_memory parameter for memory_hotplug. This is
typically set on the kernel command line -
memory_hotplug.memmap_on_memory set to 'true' or 'force'."
+
+What: /sys/bus/dax/devices/daxX.Y/state
+Date: June, 2026
+KernelVersion: v6.21
+Contact: nvdimm@lists.linux.dev
+Description:
+ (RW) Controls the state of the memory region.
+ Applies to all memory blocks associated with the device.
+ Only applies to dax_kmem devices.
+
+ Reading returns the current state; the writable states mirror
+ the per-block /sys/devices/system/memory/memoryX/state ABI::
+
+ "unplugged": memory blocks are not present
+ "online": memory is online, zone chosen by the kernel
+ "online_kernel": memory is online in ZONE_NORMAL
+ "online_movable": memory is online in ZONE_MOVABLE
+
+ "offline" (memory blocks are present but offline) may also be
+ reported - this happens when the device is bound while the
+ auto_online_blocks policy is "offline". It cannot be written,
+ as it's not useful and creates device destruction races.
+
+ A device can only be onlined from the "unplugged" state, so a
+ device must be returned to "unplugged" before it can be onlined
+ into a different state.
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index b318344426fa..3a2f69d3af7b 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -46,6 +46,15 @@ int mhp_online_type_from_str(const char *str)
}
return -EINVAL;
}
+EXPORT_SYMBOL_GPL(mhp_online_type_from_str);
+
+const char *mhp_online_type_to_str(int online_type)
+{
+ if (online_type < 0 || online_type >= (int)ARRAY_SIZE(online_type_to_str))
+ return NULL;
+ return online_type_to_str[online_type];
+}
+EXPORT_SYMBOL_GPL(mhp_online_type_to_str);
#define to_memory_block(dev) container_of(dev, struct memory_block, dev)
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index a45e50def537..340486586d82 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -42,9 +42,15 @@ static int dax_kmem_range(struct dev_dax *dev_dax, int i, struct range *r)
return 0;
}
+#define DAX_KMEM_UNPLUGGED (-1)
+
struct dax_kmem_data {
const char *res_name;
int mgid;
+ int numa_node;
+ struct dev_dax *dev_dax;
+ int state;
+ struct mutex lock; /* protects hotplug state transitions */
struct resource *res[];
};
@@ -63,12 +69,22 @@ static void kmem_put_memory_types(void)
mt_put_memory_types(&kmem_memory_types);
}
+/* True for the online states a kmem dax device can hold. */
+static bool dax_kmem_state_is_online(int state)
+{
+ return state == MMOP_ONLINE ||
+ state == MMOP_ONLINE_KERNEL ||
+ state == MMOP_ONLINE_MOVABLE;
+}
+
/**
* dax_kmem_do_hotplug - hotplug memory for dax kmem device
* @dev_dax: the dev_dax instance
* @data: the dax_kmem_data structure with resource tracking
+ * @online_type: the online policy to use for the memory blocks
*
- * Hotplugs all ranges in the dev_dax region as system memory.
+ * Hotplugs all ranges in the dev_dax region as system memory with the
+ * provided online policy (offline, online, online_movable, online_kernel).
*
* Returns the number of successfully mapped ranges, or negative error.
*/
@@ -77,9 +93,15 @@ static int dax_kmem_do_hotplug(struct dev_dax *dev_dax,
int online_type)
{
struct device *dev = &dev_dax->dev;
- int i, rc, onlined = 0;
+ int i, rc, added = 0;
mhp_t mhp_flags;
+ if (dax_kmem_state_is_online(data->state))
+ return -EINVAL;
+
+ if (online_type < MMOP_OFFLINE || online_type > MMOP_ONLINE_MOVABLE)
+ return -EINVAL;
+
for (i = 0; i < dev_dax->nr_range; i++) {
struct range range;
@@ -112,14 +134,14 @@ static int dax_kmem_do_hotplug(struct dev_dax *dev_dax,
kfree(data->res[i]);
data->res[i] = NULL;
}
- if (onlined)
+ if (added)
continue;
return rc;
}
- onlined++;
+ added++;
}
- return onlined;
+ return added;
}
/**
@@ -182,45 +204,64 @@ static int dax_kmem_init_resources(struct dev_dax *dev_dax,
* @dev_dax: the dev_dax instance
* @data: the dax_kmem_data structure with resource tracking
*
- * Removes all ranges in the dev_dax region.
+ * Offlines and removes every currently-added range in the dev_dax region
+ * atomically: either all ranges are offlined and removed, or none are and
+ * the device is returned to its prior state.
*
- * Returns the number of successfully removed ranges.
+ * Returns 0 on success, or a negative errno on failure.
*/
static int dax_kmem_do_hotremove(struct dev_dax *dev_dax,
struct dax_kmem_data *data)
{
struct device *dev = &dev_dax->dev;
- int i, success = 0;
+ struct range *ranges;
+ int i, nr_ranges = 0, rc;
+
+ ranges = kmalloc_array(dev_dax->nr_range, sizeof(*ranges), GFP_KERNEL);
+ if (!ranges)
+ return -ENOMEM;
+ /* Collect the ranges that were actually added during probe. */
for (i = 0; i < dev_dax->nr_range; i++) {
struct range range;
- int rc;
- rc = dax_kmem_range(dev_dax, i, &range);
- if (rc)
+ if (!data->res[i])
continue;
-
- /* range was never added during probe, count as removed */
- if (!data->res[i]) {
- success++;
+ if (dax_kmem_range(dev_dax, i, &range))
continue;
- }
+ ranges[nr_ranges++] = range;
+ }
- rc = remove_memory(range.start, range_len(&range));
- if (rc == 0) {
- /* Release the resource for the successfully removed range */
- remove_resource(data->res[i]);
- kfree(data->res[i]);
- data->res[i] = NULL;
- success++;
- continue;
- }
+ /* Nothing added means nothing to remove. */
+ if (!nr_ranges) {
+ kfree(ranges);
+ return 0;
+ }
+
+ rc = offline_and_remove_memory_ranges(ranges, nr_ranges);
+ kfree(ranges);
+ if (rc) {
any_hotremove_failed = true;
- dev_err(dev, "mapping%d: %#llx-%#llx hotremove failed\n",
- i, range.start, range.end);
+ dev_err(dev, "hotremove failed, device left online: %d\n", rc);
+ return rc;
}
- return success;
+ /* All ranges removed; release the reserved resources. */
+ for (i = 0; i < dev_dax->nr_range; i++) {
+ if (!data->res[i])
+ continue;
+ remove_resource(data->res[i]);
+ kfree(data->res[i]);
+ data->res[i] = NULL;
+ }
+
+ return 0;
+}
+#else
+static int dax_kmem_do_hotremove(struct dev_dax *dev_dax,
+ struct dax_kmem_data *data)
+{
+ return -EBUSY;
}
#endif /* CONFIG_MEMORY_HOTREMOVE */
@@ -236,6 +277,18 @@ static void dax_kmem_cleanup_resources(struct dev_dax *dev_dax,
{
int i;
+ /*
+ * If the device unbind occurs before memory is hotremoved, we can never
+ * remove the memory (requires reboot). Attempting an offline operation
+ * here may cause deadlock and a failure to finish the unbind.
+ *
+ * Note: This leaks the resources.
+ */
+ if (WARN(((data->state != DAX_KMEM_UNPLUGGED) &&
+ (data->state != MMOP_OFFLINE)),
+ "Hotplug memory regions stuck online until reboot"))
+ return;
+
for (i = 0; i < dev_dax->nr_range; i++) {
if (!data->res[i])
continue;
@@ -245,6 +298,85 @@ static void dax_kmem_cleanup_resources(struct dev_dax *dev_dax,
}
}
+static int dax_kmem_parse_state(const char *buf)
+{
+ int online_type;
+
+ /* "unplugged" is kmem-specific - the rest map to MMOP_ */
+ if (sysfs_streq(buf, "unplugged"))
+ return DAX_KMEM_UNPLUGGED;
+
+ online_type = mhp_online_type_from_str(buf);
+ /* Disallow "offline": it's not useful and creates race conditions */
+ if (online_type == MMOP_OFFLINE)
+ return -EINVAL;
+ return online_type;
+}
+
+static ssize_t state_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct dax_kmem_data *data = dev_get_drvdata(dev);
+ const char *state_str;
+
+ if (!data)
+ return -ENXIO;
+
+ if (data->state == DAX_KMEM_UNPLUGGED)
+ state_str = "unplugged";
+ else
+ state_str = mhp_online_type_to_str(data->state);
+
+ return sysfs_emit(buf, "%s\n", state_str ?: "unknown");
+}
+
+static ssize_t state_store(struct device *dev, struct device_attribute *attr,
+ const char *buf, size_t len)
+{
+ struct dev_dax *dev_dax = to_dev_dax(dev);
+ struct dax_kmem_data *data = dev_get_drvdata(dev);
+ int online_type;
+ int rc;
+
+ if (!data)
+ return -ENXIO;
+
+ online_type = dax_kmem_parse_state(buf);
+ if (online_type < DAX_KMEM_UNPLUGGED)
+ return online_type;
+
+ guard(mutex)(&data->lock);
+
+ /* Already in requested state */
+ if (data->state == online_type)
+ return len;
+
+ if (online_type == DAX_KMEM_UNPLUGGED) {
+ rc = dax_kmem_do_hotremove(dev_dax, data);
+ if (rc)
+ return rc;
+ data->state = DAX_KMEM_UNPLUGGED;
+ return len;
+ }
+
+ /* Onlining is only allowed from the unplugged state. */
+ if (data->state != DAX_KMEM_UNPLUGGED)
+ return -EBUSY;
+
+ /* Re-acquire resources if previously unplugged, otherwise no-op */
+ rc = dax_kmem_init_resources(dev_dax, data);
+ if (rc < 0)
+ return rc;
+
+ rc = dax_kmem_do_hotplug(dev_dax, data, online_type);
+ if (rc < 0)
+ return rc;
+
+ data->state = online_type;
+ return len;
+}
+static DEVICE_ATTR_RW(state);
+
static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
{
struct device *dev = &dev_dax->dev;
@@ -313,6 +445,10 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
if (rc < 0)
goto err_reg_mgid;
data->mgid = rc;
+ data->numa_node = numa_node;
+ data->dev_dax = dev_dax;
+ data->state = DAX_KMEM_UNPLUGGED;
+ mutex_init(&data->lock);
dev_set_drvdata(dev, data);
@@ -325,9 +461,15 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
if (online_type == DAX_ONLINE_DEFAULT)
online_type = mhp_get_default_online_type();
+ /* Always create blocks for backward compatibility, even if offline */
rc = dax_kmem_do_hotplug(dev_dax, data, online_type);
if (rc < 0)
goto err_hotplug;
+ data->state = online_type;
+
+ rc = device_create_file(dev, &dev_attr_state);
+ if (rc)
+ dev_warn(dev, "failed to create state sysfs entry\n");
return 0;
@@ -348,20 +490,26 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
#ifdef CONFIG_MEMORY_HOTREMOVE
static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
{
- int success;
int node = dev_dax->target_node;
struct device *dev = &dev_dax->dev;
struct dax_kmem_data *data = dev_get_drvdata(dev);
+ device_remove_file(dev, &dev_attr_state);
/*
- * We have one shot for removing memory, if some memory blocks were not
- * offline prior to calling this function remove_memory() will fail, and
- * there is no way to hotremove this memory until reboot because device
- * unbind will succeed even if we return failure.
+ * Online memory cannot safely be removed (offlining during unbind can
+ * deadlock a task as unbind cannot be interrupted). Unfortunately we
+ * have to leak all of [resources, memory group, @data, memtype], until
+ * the next reboot - and the memory will stay online until then.
+ *
+ * offline blocks are removed on unbind, but may leak on failure.
*/
- success = dax_kmem_do_hotremove(dev_dax, data);
- if (success < dev_dax->nr_range) {
- dev_err(dev, "Hotplug regions stuck online until reboot\n");
+ if (dax_kmem_state_is_online(data->state)) {
+ dev_warn(dev, "Hotplug regions stuck online until reboot\n");
+ any_hotremove_failed = true;
+ return;
+ } else if (data->state == MMOP_OFFLINE &&
+ dax_kmem_do_hotremove(dev_dax, data)) {
+ dev_warn(dev, "Unplug failed, resources leaked until reboot\n");
return;
}
@@ -382,6 +530,10 @@ static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
#else
static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
{
+ struct device *dev = &dev_dax->dev;
+
+ device_remove_file(dev, &dev_attr_state);
+
/*
* Without hotremove purposely leak the request_mem_region() for the
* device-dax range and return '0' to ->remove() attempts. The removal
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 7f1da7c428dc..46c796570692 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -127,6 +127,7 @@ extern int arch_add_memory(int nid, u64 start, u64 size,
extern u64 max_mem_size;
extern int mhp_online_type_from_str(const char *str);
+const char *mhp_online_type_to_str(int online_type);
/* If movable_node boot option specified */
extern bool movable_node_enabled;
--
2.54.0
^ permalink raw reply related [flat|nested] 13+ messages in thread* [PATCH v5 9/9] selftests/dax: add dax/kmem hotplug sysfs regression test
2026-06-24 14:57 [PATCH v5 0/9] dax/kmem: atomic whole-device hotplug via sysfs Gregory Price
` (7 preceding siblings ...)
2026-06-24 14:57 ` [PATCH v5 8/9] dax/kmem: add sysfs interface for atomic whole-device hotplug Gregory Price
@ 2026-06-24 14:57 ` Gregory Price
2026-06-24 18:59 ` [PATCH v5 0/9] dax/kmem: atomic whole-device hotplug via sysfs Gregory Price
9 siblings, 0 replies; 13+ messages in thread
From: Gregory Price @ 2026-06-24 14:57 UTC (permalink / raw)
To: linux-mm, nvdimm
Cc: linux-kernel, linux-cxl, driver-core, linux-kselftest,
kernel-team, david, osalvador, gregkh, rafael, dakr, djbw,
vishal.l.verma, dave.jiang, akpm, ljs, liam, vbabka, rppt, surenb,
mhocko, shuah, gourry, alison.schofield,
Smita.KoralahalliChannabasappa, ira.weiny, apopple
Add a kselftest for the dax/kmem whole-device "state" sysfs attribute
(/sys/bus/dax/devices/daxX.Y/state), which transitions a kmem-backed
dax device between "unplugged", "online" and "online_movable".
The kselftest also includes a test to demonstrate the force-unbind
does not deadlock - but this is a destructive test. The dax device
can never be rebound after doing this.
Provisioning a devdax device and binding it to kmem needs daxctl/ndctl
out of scope for an in-tree selftest, so the test discovers an already
kmem-bound dax device and SKIPs when none are present or the memory
cannot be freed to reach a known baseline.
When a device is available it validates the interface contract:
- online / online_movable actually add memory (MemTotal grows),
- online is idempotent,
- switching between online types without unplug is rejected,
- unplug removes memory and the reported state is "unplugged"
- invalid input is rejected.
One specific regression test:
online -> unplug -> online_movable -> unplug
Re-online must re-reserve per-range resources so subsequent unplug
actually offlines and removes instead of silently reporting success
while the memory stays online.
Signed-off-by: Gregory Price <gourry@gourry.net>
---
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/dax/Makefile | 6 +
tools/testing/selftests/dax/config | 4 +
.../testing/selftests/dax/dax-kmem-hotplug.sh | 207 ++++++++++++++++++
tools/testing/selftests/dax/settings | 1 +
5 files changed, 219 insertions(+)
create mode 100644 tools/testing/selftests/dax/Makefile
create mode 100644 tools/testing/selftests/dax/config
create mode 100755 tools/testing/selftests/dax/dax-kmem-hotplug.sh
create mode 100644 tools/testing/selftests/dax/settings
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 6e59b8f63e41..8c2b4f97619c 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -14,6 +14,7 @@ TARGETS += core
TARGETS += cpufreq
TARGETS += cpu-hotplug
TARGETS += damon
+TARGETS += dax
TARGETS += devices/error_logs
TARGETS += devices/probe
TARGETS += dmabuf-heaps
diff --git a/tools/testing/selftests/dax/Makefile b/tools/testing/selftests/dax/Makefile
new file mode 100644
index 000000000000..25a4f3d73a5b
--- /dev/null
+++ b/tools/testing/selftests/dax/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0
+all:
+
+TEST_PROGS := dax-kmem-hotplug.sh
+
+include ../lib.mk
diff --git a/tools/testing/selftests/dax/config b/tools/testing/selftests/dax/config
new file mode 100644
index 000000000000..4c9aaeb6ceb4
--- /dev/null
+++ b/tools/testing/selftests/dax/config
@@ -0,0 +1,4 @@
+CONFIG_DEV_DAX=m
+CONFIG_DEV_DAX_KMEM=m
+CONFIG_MEMORY_HOTPLUG=y
+CONFIG_MEMORY_HOTREMOVE=y
diff --git a/tools/testing/selftests/dax/dax-kmem-hotplug.sh b/tools/testing/selftests/dax/dax-kmem-hotplug.sh
new file mode 100755
index 000000000000..803bbd5a6409
--- /dev/null
+++ b/tools/testing/selftests/dax/dax-kmem-hotplug.sh
@@ -0,0 +1,207 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Exercise the dax/kmem "state" sysfs attribute:
+# /sys/bus/dax/devices/daxX.Y/state -> unplugged | online | online_movable
+#
+# The test needs a dax device already bound to the kmem driver.
+# If no suitable device is found the tests SKIP.
+#
+# A dax device can be provisioned with the memmap= boot param, e.g.:
+# memmap=2G!4G
+#
+# then, in the booted system:
+#
+# ndctl create-namespace -m devdax -e namespace0.0 -f
+# daxctl reconfigure-device -N -m system-ram dax0.0 # bind kmem
+# ./dax-kmem-hotplug.sh
+
+# shellcheck disable=SC1091
+DIR="$(dirname "$(readlink -f "$0")")"
+. "$DIR"/../kselftest/ktap_helpers.sh
+
+DAX_BASE=/sys/bus/dax/devices
+
+memtotal_kb() { awk '/^MemTotal:/ {print $2}' /proc/meminfo; }
+get_state() { cat "$HP" 2>/dev/null; }
+# set_state STATE -- write a state to the state attribute; returns the
+# write's exit status (0 = accepted by the kernel)
+set_state() { echo "$1" > "$HP" 2>/dev/null; }
+
+find_kmem_dax() {
+ local d drv
+ for d in "$DAX_BASE"/dax*; do
+ [ -e "$d/state" ] || continue
+ drv=$(readlink "$d/driver" 2>/dev/null)
+ [ "$(basename "${drv:-}")" = kmem ] || continue
+ basename "$d"
+ return 0
+ done
+ return 1
+}
+
+ktap_print_header
+
+if [ "$UID" != 0 ]; then
+ ktap_skip_all "must be run as root"
+ exit "$KSFT_SKIP"
+fi
+
+DAX=$(find_kmem_dax)
+if [ -z "$DAX" ]; then
+ ktap_skip_all "no kmem-bound dax device with a state attribute"
+ exit "$KSFT_SKIP"
+fi
+HP=$DAX_BASE/$DAX/state
+ORIG=$(get_state)
+
+# A failure to reach the baseline is environmental (memory in use), not an
+# interface failure, so skip rather than fail.
+set_state unplugged; rc=$?
+if [ "$rc" != 0 ] || [ "$(get_state)" != unplugged ]; then
+ ktap_skip_all "$DAX: cannot reach 'unplugged' baseline (memory in use?)"
+ [ -n "$ORIG" ] && set_state "$ORIG"
+ exit "$KSFT_SKIP"
+fi
+mt_unplugged=$(memtotal_kb)
+
+DRV=/sys/bus/dax/drivers/kmem
+AOB=/sys/devices/system/memory/auto_online_blocks
+
+ktap_print_msg "using $DAX (initial state was: $ORIG)"
+ktap_set_plan 11
+
+set_state online; rc=$?
+mt_online=$(memtotal_kb)
+if [ "$rc" = 0 ] && [ "$(get_state)" = online ] && [ "$mt_online" -gt "$mt_unplugged" ]; then
+ ktap_test_pass "online: state=online, MemTotal $mt_unplugged -> $mt_online kB"
+else
+ ktap_test_fail "online: rc=$rc state=$(get_state) MemTotal $mt_unplugged -> $mt_online"
+fi
+
+set_state online; rc=$?
+if [ "$rc" = 0 ] && [ "$(get_state)" = online ]; then
+ ktap_test_pass "online idempotent"
+else
+ ktap_test_fail "online idempotent: rc=$rc state=$(get_state)"
+fi
+
+set_state online_movable; rc=$?
+if [ "$rc" != 0 ] && [ "$(get_state)" = online ]; then
+ ktap_test_pass "reject online_movable without intervening unplug"
+else
+ ktap_test_fail "online->online_movable not rejected: rc=$rc state=$(get_state)"
+fi
+
+set_state unplugged; rc=$?
+mt=$(memtotal_kb)
+if [ "$rc" = 0 ] && [ "$(get_state)" = unplugged ] && [ "$mt" -lt "$mt_online" ]; then
+ ktap_test_pass "unplug from online: MemTotal $mt_online -> $mt kB"
+else
+ ktap_test_fail "unplug from online: rc=$rc state=$(get_state) MemTotal $mt_online -> $mt"
+fi
+
+set_state online_movable; rc=$?
+mt_movable=$(memtotal_kb)
+if [ "$rc" = 0 ] && [ "$(get_state)" = online_movable ] && [ "$mt_movable" -gt "$mt_unplugged" ]; then
+ ktap_test_pass "online_movable after unplug: MemTotal $mt_unplugged -> $mt_movable kB"
+else
+ ktap_test_fail "online_movable after unplug: rc=$rc state=$(get_state) MemTotal=$mt_movable"
+fi
+
+# The online -> unplug -> online_movable -> unplug cycle once regressed:
+# a re-online failed to re-reserve the per-range resources, so the final unplug
+# reported success while leaving the memory online. Assert it is really freed.
+set_state unplugged; rc=$?
+mt=$(memtotal_kb)
+if [ "$rc" != 0 ]; then
+ ktap_test_skip "unplug from movable not accepted (memory in use?) rc=$rc"
+elif [ "$(get_state)" = unplugged ] && [ "$mt" -lt "$mt_movable" ]; then
+ ktap_test_pass "unplug from online_movable removed memory: $mt_movable -> $mt kB"
+else
+ ktap_test_fail "unplug from movable reported success but memory remained: state=$(get_state) MemTotal $mt_movable -> $mt"
+fi
+
+set_state online_kernel; rc=$?
+mt=$(memtotal_kb)
+if [ "$rc" = 0 ] && [ "$(get_state)" = online_kernel ] && [ "$mt" -gt "$mt_unplugged" ]; then
+ ktap_test_pass "online_kernel: MemTotal $mt_unplugged -> $mt kB"
+else
+ ktap_test_fail "online_kernel: rc=$rc state=$(get_state) MemTotal=$mt"
+fi
+set_state unplugged
+
+before=$(get_state)
+set_state bogus_state; rc=$?
+if [ "$rc" != 0 ] && [ "$(get_state)" = "$before" ]; then
+ ktap_test_pass "reject invalid state string"
+else
+ ktap_test_fail "invalid state not rejected: rc=$rc state=$(get_state)"
+fi
+
+# Run several online/unplug cycles and require that each one adds/removes memory
+set_state unplugged
+cycle_ok=1; fail_i=0
+for i in 1 2 3; do
+ if ! set_state online; then cycle_ok=0; fail_i=$i; break; fi
+ on=$(memtotal_kb)
+ if ! set_state unplugged; then cycle_ok=0; fail_i=$i; break; fi
+ off=$(memtotal_kb)
+ if [ "$on" -le "$mt_unplugged" ] || [ "$off" -ge "$on" ]; then
+ cycle_ok=0; fail_i=$i; break
+ fi
+done
+if [ "$cycle_ok" = 1 ]; then
+ ktap_test_pass "online/unplug cycle re-acquires resources (3x: memory added and freed each time)"
+else
+ ktap_test_fail "online/unplug cycle regressed at iteration $fail_i (on=$on off=$off baseline=$mt_unplugged)"
+fi
+
+# change system default online policy while the device is unbound, and show
+# the new system default policy is utilized across bindings.
+set_state unplugged
+if [ -w "$AOB" ] && [ -w "$DRV/unbind" ] && [ -w "$DRV/bind" ]; then
+ orig_aob=$(cat "$AOB")
+ echo "$DAX" > "$DRV/unbind" 2>/dev/null
+ echo offline > "$AOB" 2>/dev/null
+ echo "$DAX" > "$DRV/bind" 2>/dev/null
+ sleep 1
+ st=$(get_state)
+ echo "$orig_aob" > "$AOB" 2>/dev/null # restore system policy
+ if [ "$st" = offline ]; then
+ ktap_test_pass "online policy resolved at bind: auto_online_blocks=offline -> state=offline"
+ else
+ ktap_test_fail "bind-time policy not honored: state=$st (expected offline)"
+ fi
+ set_state unplugged 2>/dev/null
+else
+ ktap_test_skip "auto_online_blocks or driver bind/unbind not writable"
+fi
+
+[ -n "$ORIG" ] && set_state "$ORIG"
+
+# DESTRUCTIVE: unbinding the driver while memory is online causes the resources
+# to leak - but the unbind should not deadlock. Instead the driver leaks it
+# with a single "suck online" warning. This leaves the memory online and the
+# device unbound until reboot, so it runs last.
+set_state unplugged; set_state online
+if [ "$(get_state)" = online ] && [ -w "$DRV/unbind" ]; then
+ mt_on=$(memtotal_kb)
+ dmesg -C 2>/dev/null
+ echo "$DAX" > "$DRV/unbind" 2>/dev/null
+ mt_after=$(memtotal_kb)
+ # The leaked "System RAM (kmem)" regions stay in the iomem tree; reading
+ # their names dereferences res_name, which a buggy unbind already freed.
+ # Walk /proc/iomem to provoke that use-after-free (caught by KASAN).
+ cat /proc/iomem > /dev/null 2>&1
+ splat=$(dmesg 2>/dev/null | grep -ciE "KASAN|BUG:|use-after-free|general protection|Oops|refcount_t")
+ if [ "$splat" = 0 ] && [ "$mt_after" -ge "$mt_on" ]; then
+ ktap_test_pass "unbind while online: memory left online, no UAF/oops (MemTotal $mt_on -> $mt_after kB)"
+ else
+ ktap_test_fail "unbind while online regressed: splat=$splat MemTotal $mt_on -> $mt_after kB"
+ fi
+else
+ ktap_test_skip "could not online device for unbind-while-online test"
+fi
+
+ktap_finished
diff --git a/tools/testing/selftests/dax/settings b/tools/testing/selftests/dax/settings
new file mode 100644
index 000000000000..ba4d85f74cd6
--- /dev/null
+++ b/tools/testing/selftests/dax/settings
@@ -0,0 +1 @@
+timeout=90
--
2.54.0
^ permalink raw reply related [flat|nested] 13+ messages in thread* Re: [PATCH v5 0/9] dax/kmem: atomic whole-device hotplug via sysfs
2026-06-24 14:57 [PATCH v5 0/9] dax/kmem: atomic whole-device hotplug via sysfs Gregory Price
` (8 preceding siblings ...)
2026-06-24 14:57 ` [PATCH v5 9/9] selftests/dax: add dax/kmem hotplug sysfs regression test Gregory Price
@ 2026-06-24 18:59 ` Gregory Price
9 siblings, 0 replies; 13+ messages in thread
From: Gregory Price @ 2026-06-24 18:59 UTC (permalink / raw)
To: linux-mm, nvdimm
Cc: linux-kernel, linux-cxl, driver-core, linux-kselftest,
kernel-team, david, osalvador, gregkh, rafael, dakr, djbw,
vishal.l.verma, dave.jiang, akpm, ljs, liam, vbabka, rppt, surenb,
mhocko, shuah, alison.schofield, Smita.KoralahalliChannabasappa,
ira.weiny, apopple
On Wed, Jun 24, 2026 at 10:57:35AM -0400, Gregory Price wrote:
>... snip ...
Disregard, there are a few unaddressed Sashiko comments, I'm just going
to respin this. Will wait until after the merge window closes for v6.
The rough shape of things should still hold w/ prior feedback.
~Gregory
^ permalink raw reply [flat|nested] 13+ messages in thread