* [PATCH 0/8] dax/kmem: atomic whole-device hotplug via sysfs
@ 2026-03-21 15:03 Gregory Price
2026-03-21 15:03 ` [PATCH 1/8] mm/memory-tiers: consolidate memory type dedup into mt_get_memory_type() Gregory Price
` (8 more replies)
0 siblings, 9 replies; 11+ messages in thread
From: Gregory Price @ 2026-03-21 15:03 UTC (permalink / raw)
To: linux-mm, vishal.l.verma, dave.jiang, akpm, david, osalvador
Cc: dan.j.williams, ljs, Liam.Howlett, vbabka, rppt, surenb, mhocko,
linux-kernel, nvdimm, linux-cxl, kernel-team
The dax kmem driver currently onlines memory during probe using the
system default policy, with no way to control or query the region state
at runtime - other than by inspecting the state of individual blocks.
Offlining and removing an entire region requires operating on individual
memory blocks, creating race conditions where external entities can
interfere between the offline and remove steps.
The problem was discussed specifically in the LPC2025 device memory
sessions - https://lpc.events/event/19/contributions/2016/ - where
it was discussed how the non-atomic interface for dax hotplug is causing
issues in some distributions which have competing userland controllers
that interfere with each other.
This series adds a sysfs "hotplug" attribute for atomic whole-device
hotplug control, along with the mm and dax plumbing to support it.
The first five patches prepare the mm and dax layers:
1. Consolidate memory-tier type deduplication into mt_get_memory_type(),
removing redundant per-driver infrastructure.
2. Add a memory_block_align_range() helper for hotplug range alignment.
3-5. Thread an explicit online_type through the memory hotplug and dax
paths, allowing drivers to specify a preferred auto-online policy
(ZONE_NORMAL vs ZONE_MOVABLE) instead of being forced to the
system default.
The last three patches build the dax/kmem feature:
6. Plumb online_type through the dax device creation path.
7. Extract hotplug/hotremove into helper functions to separate resource
lifecycle from memory onlining.
8. Add the "hotplug" sysfs attribute supporting three states:
- "unplug": memory blocks removed
- "online": online as normal system RAM
- "online_movable": online in ZONE_MOVABLE
Transitions are atomic across all ranges in the device. Backward
compatibility is preserved: probe still auto-onlines when the configured
policy matches the system default.
Specific notes for maintainers:
I downgraded a BUG() to a WARN() when unbind is called while the dax
device is not un an UNPLUGGED state. This is because the old pattern of
toggling individual memory blocks is still used by userland tools, and
will disconnect the `hotplug` value from the actual state of the overall
memory region.
Unless we move to deprecate per-block controls, we should just WARN()
instead of BUG() as an indicator that userland tools need to be updated
to use the new pattern (the old pattern is subject to race conditions).
The first two commits are semi-unrelated cleanups that conflict with the
changes made in the refactoring commits. (memory-tier dedup and align_range
helper). These are intended to be used for future cxl region extensions,
but if you prefer them to be dropped or submitted separately let me
know.
This is technically v3, but the patch line has diverged considerably and
I've reworked the cover letter, apologies for prior obtuseness
Link: https://lore.kernel.org/all/20260114235022.3437787-1-gourry@gourry.net/
Gregory Price (8):
mm/memory-tiers: consolidate memory type dedup into
mt_get_memory_type()
mm/memory: add memory_block_align_range() helper
mm/memory_hotplug: pass online_type to online_memory_block() via arg
mm/memory_hotplug: export mhp_get_default_online_type
mm/memory_hotplug: add __add_memory_driver_managed() with online_type
arg
dax: plumb hotplug online_type through dax
dax/kmem: extract hotplug/hotremove helper functions
dax/kmem: add sysfs interface for atomic whole-device hotplug
Documentation/ABI/testing/sysfs-bus-dax | 17 +
drivers/dax/bus.c | 3 +
drivers/dax/bus.h | 2 +
drivers/dax/cxl.c | 1 +
drivers/dax/dax-private.h | 3 +
drivers/dax/hmem/hmem.c | 1 +
drivers/dax/kmem.c | 457 ++++++++++++++++++------
include/linux/memory-tiers.h | 34 +-
include/linux/memory.h | 22 ++
include/linux/memory_hotplug.h | 32 ++
mm/memory-tiers.c | 29 +-
mm/memory_hotplug.c | 67 +++-
12 files changed, 501 insertions(+), 167 deletions(-)
--
2.53.0
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 1/8] mm/memory-tiers: consolidate memory type dedup into mt_get_memory_type()
2026-03-21 15:03 [PATCH 0/8] dax/kmem: atomic whole-device hotplug via sysfs Gregory Price
@ 2026-03-21 15:03 ` Gregory Price
2026-03-21 15:03 ` [PATCH 2/8] mm/memory: add memory_block_align_range() helper Gregory Price
` (7 subsequent siblings)
8 siblings, 0 replies; 11+ messages in thread
From: Gregory Price @ 2026-03-21 15:03 UTC (permalink / raw)
To: linux-mm, vishal.l.verma, dave.jiang, akpm, david, osalvador
Cc: dan.j.williams, ljs, Liam.Howlett, vbabka, rppt, surenb, mhocko,
linux-kernel, nvdimm, linux-cxl, kernel-team
Replace per-driver memory type list infrastructure with a single
mt_get_memory_type(adist) that deduplicates against the global
default_memory_types list under memory_tier_lock.
The per-driver lists (mutex + list_head + find/put wrappers) provided
dedup within a single driver, but not across drivers or with the core.
Since the number of distinct adist values is bounded and types on
default_memory_types are never freed anyway, the per-driver cleanup
on module unload was not useful.
Add MEMTIER_DEFAULT_LOWTIER_ADISTANCE to replace the default DAX
adistance, since it was really used as a standin for all kmem hotplugged
memory. This at least makes the default tier relationship clearer to
other drivers and they can see where to put their memory in relation to
the default lower tier.
Core changes:
- Add mt_get_memory_type() as the single exported entry point
- Drop most other interfaces - clear_node_memory_type() is now the
appropriate put function.
- export MEMTIER_DEFAULT_LOWTIER_ADISTANCE
dax/kmem changes:
- Remove MEMTIER_DEFAULT_DAX_ADISTANCE, use MEMTIER_DEFAULT_LOWTIER_ADISTANCE
- Remove per-driver kmem_memory_type_lock/kmem_memory_types/wrappers
- Store mtype per-device in dax_kmem_data
- Pass data->mtype to clear_node_memory_type() instead of NULL
Signed-off-by: Gregory Price <gourry@gourry.net>
---
drivers/dax/kmem.c | 32 +++++---------------------------
include/linux/memory-tiers.h | 34 ++++++++++------------------------
mm/memory-tiers.c | 29 +++++++++++++----------------
3 files changed, 28 insertions(+), 67 deletions(-)
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 2cc8749bc871..eb693a581961 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -16,13 +16,6 @@
#include "dax-private.h"
#include "bus.h"
-/*
- * Default abstract distance assigned to the NUMA node onlined
- * by DAX/kmem if the low level platform driver didn't initialize
- * one for this NUMA node.
- */
-#define MEMTIER_DEFAULT_DAX_ADISTANCE (MEMTIER_ADISTANCE_DRAM * 5)
-
/* Memory resource name used for add_memory_driver_managed(). */
static const char *kmem_name;
/* Set if any memory will remain added when the driver will be unloaded. */
@@ -47,24 +40,10 @@ static int dax_kmem_range(struct dev_dax *dev_dax, int i, struct range *r)
struct dax_kmem_data {
const char *res_name;
int mgid;
+ struct memory_dev_type *mtype;
struct resource *res[];
};
-static DEFINE_MUTEX(kmem_memory_type_lock);
-static LIST_HEAD(kmem_memory_types);
-
-static struct memory_dev_type *kmem_find_alloc_memory_type(int adist)
-{
- guard(mutex)(&kmem_memory_type_lock);
- return mt_find_alloc_memory_type(adist, &kmem_memory_types);
-}
-
-static void kmem_put_memory_types(void)
-{
- guard(mutex)(&kmem_memory_type_lock);
- mt_put_memory_types(&kmem_memory_types);
-}
-
static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
{
struct device *dev = &dev_dax->dev;
@@ -74,7 +53,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
int i, rc, mapped = 0;
mhp_t mhp_flags;
int numa_node;
- int adist = MEMTIER_DEFAULT_DAX_ADISTANCE;
+ int adist = MEMTIER_DEFAULT_LOWTIER_ADISTANCE;
/*
* Ensure good NUMA information for the persistent memory.
@@ -90,7 +69,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
}
mt_calc_adistance(numa_node, &adist);
- mtype = kmem_find_alloc_memory_type(adist);
+ mtype = mt_get_memory_type(adist);
if (IS_ERR(mtype))
return PTR_ERR(mtype);
@@ -189,6 +168,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
}
mapped++;
}
+ data->mtype = mtype;
dev_set_drvdata(dev, data);
@@ -253,7 +233,7 @@ static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
* for that. This implies this reference will be around
* till next reboot.
*/
- clear_node_memory_type(node, NULL);
+ clear_node_memory_type(node, data->mtype);
}
}
#else
@@ -292,7 +272,6 @@ static int __init dax_kmem_init(void)
return rc;
error_dax_driver:
- kmem_put_memory_types();
kfree_const(kmem_name);
return rc;
}
@@ -302,7 +281,6 @@ static void __exit dax_kmem_exit(void)
dax_driver_unregister(&device_dax_kmem_driver);
if (!any_hotremove_failed)
kfree_const(kmem_name);
- kmem_put_memory_types();
}
MODULE_AUTHOR("Intel Corporation");
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 96987d9d95a8..70fbd3ad577f 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -20,11 +20,17 @@
*/
#define MEMTIER_ADISTANCE_DRAM ((4L * MEMTIER_CHUNK_SIZE) + (MEMTIER_CHUNK_SIZE >> 1))
+/*
+ * Default abstract distance assigned to non-DRAM memory if the platform
+ * driver didn't initialize one for this NUMA node.
+ */
+#define MEMTIER_DEFAULT_LOWTIER_ADISTANCE (MEMTIER_ADISTANCE_DRAM * 5)
+
struct memory_tier;
struct memory_dev_type {
/* list of memory types that are part of same tier as this type */
struct list_head tier_sibling;
- /* list of memory types that are managed by one driver */
+ /* memory types on global list */
struct list_head list;
/* abstract distance for this specific memory type */
int adistance;
@@ -39,8 +45,6 @@ struct access_coordinate;
extern bool numa_demotion_enabled;
extern struct memory_dev_type *default_dram_type;
extern nodemask_t default_dram_nodes;
-struct memory_dev_type *alloc_memory_type(int adistance);
-void put_memory_type(struct memory_dev_type *memtype);
void init_node_memory_type(int node, struct memory_dev_type *default_type);
void clear_node_memory_type(int node, struct memory_dev_type *memtype);
int register_mt_adistance_algorithm(struct notifier_block *nb);
@@ -49,9 +53,7 @@ int mt_calc_adistance(int node, int *adist);
int mt_set_default_dram_perf(int nid, struct access_coordinate *perf,
const char *source);
int mt_perf_to_adistance(struct access_coordinate *perf, int *adist);
-struct memory_dev_type *mt_find_alloc_memory_type(int adist,
- struct list_head *memory_types);
-void mt_put_memory_types(struct list_head *memory_types);
+struct memory_dev_type *mt_get_memory_type(int adist);
#ifdef CONFIG_MIGRATION
int next_demotion_node(int node, const nodemask_t *allowed_mask);
void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
@@ -78,18 +80,6 @@ static inline bool node_is_toptier(int node)
#define numa_demotion_enabled false
#define default_dram_type NULL
#define default_dram_nodes NODE_MASK_NONE
-/*
- * CONFIG_NUMA implementation returns non NULL error.
- */
-static inline struct memory_dev_type *alloc_memory_type(int adistance)
-{
- return NULL;
-}
-
-static inline void put_memory_type(struct memory_dev_type *memtype)
-{
-
-}
static inline void init_node_memory_type(int node, struct memory_dev_type *default_type)
{
@@ -142,14 +132,10 @@ static inline int mt_perf_to_adistance(struct access_coordinate *perf, int *adis
return -EIO;
}
-static inline struct memory_dev_type *mt_find_alloc_memory_type(int adist,
- struct list_head *memory_types)
+static inline struct memory_dev_type *mt_get_memory_type(int adist)
{
return NULL;
}
-
-static inline void mt_put_memory_types(struct list_head *memory_types)
-{
-}
#endif /* CONFIG_NUMA */
+
#endif /* _LINUX_MEMORY_TIERS_H */
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 986f809376eb..c8f032a75249 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -38,14 +38,17 @@ struct node_memory_type_map {
static DEFINE_MUTEX(memory_tier_lock);
static LIST_HEAD(memory_tiers);
/*
- * The list is used to store all memory types that are not created
- * by a device driver.
+ * The list is used to store all memory types, both auto-initialized
+ * and driver-requested. Drivers obtain types via mt_get_memory_type().
*/
static LIST_HEAD(default_memory_types);
static struct node_memory_type_map node_memory_types[MAX_NUMNODES];
struct memory_dev_type *default_dram_type;
nodemask_t default_dram_nodes __initdata = NODE_MASK_NONE;
+static struct memory_dev_type *mt_find_alloc_memory_type(int adist,
+ struct list_head *memory_types);
+
static const struct bus_type memory_tier_subsys = {
.name = "memory_tiering",
.dev_name = "memory_tier",
@@ -621,7 +624,7 @@ static void release_memtype(struct kref *kref)
kfree(memtype);
}
-struct memory_dev_type *alloc_memory_type(int adistance)
+static struct memory_dev_type *alloc_memory_type(int adistance)
{
struct memory_dev_type *memtype;
@@ -635,13 +638,11 @@ struct memory_dev_type *alloc_memory_type(int adistance)
kref_init(&memtype->kref);
return memtype;
}
-EXPORT_SYMBOL_GPL(alloc_memory_type);
-void put_memory_type(struct memory_dev_type *memtype)
+static void put_memory_type(struct memory_dev_type *memtype)
{
kref_put(&memtype->kref, release_memtype);
}
-EXPORT_SYMBOL_GPL(put_memory_type);
void init_node_memory_type(int node, struct memory_dev_type *memtype)
{
@@ -670,7 +671,8 @@ void clear_node_memory_type(int node, struct memory_dev_type *memtype)
}
EXPORT_SYMBOL_GPL(clear_node_memory_type);
-struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct list_head *memory_types)
+static struct memory_dev_type *mt_find_alloc_memory_type(int adist,
+ struct list_head *memory_types)
{
struct memory_dev_type *mtype;
@@ -686,18 +688,13 @@ struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct list_head *m
return mtype;
}
-EXPORT_SYMBOL_GPL(mt_find_alloc_memory_type);
-void mt_put_memory_types(struct list_head *memory_types)
+struct memory_dev_type *mt_get_memory_type(int adist)
{
- struct memory_dev_type *mtype, *mtn;
-
- list_for_each_entry_safe(mtype, mtn, memory_types, list) {
- list_del(&mtype->list);
- put_memory_type(mtype);
- }
+ guard(mutex)(&memory_tier_lock);
+ return mt_find_alloc_memory_type(adist, &default_memory_types);
}
-EXPORT_SYMBOL_GPL(mt_put_memory_types);
+EXPORT_SYMBOL_GPL(mt_get_memory_type);
/*
* This is invoked via `late_initcall()` to initialize memory tiers for
--
2.53.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 2/8] mm/memory: add memory_block_align_range() helper
2026-03-21 15:03 [PATCH 0/8] dax/kmem: atomic whole-device hotplug via sysfs Gregory Price
2026-03-21 15:03 ` [PATCH 1/8] mm/memory-tiers: consolidate memory type dedup into mt_get_memory_type() Gregory Price
@ 2026-03-21 15:03 ` Gregory Price
2026-03-21 15:03 ` [PATCH 3/8] mm/memory_hotplug: pass online_type to online_memory_block() via arg Gregory Price
` (6 subsequent siblings)
8 siblings, 0 replies; 11+ messages in thread
From: Gregory Price @ 2026-03-21 15:03 UTC (permalink / raw)
To: linux-mm, vishal.l.verma, dave.jiang, akpm, david, osalvador
Cc: dan.j.williams, ljs, Liam.Howlett, vbabka, rppt, surenb, mhocko,
linux-kernel, nvdimm, linux-cxl, kernel-team
Memory hotplug operations require ranges aligned to memory block
boundaries. This is a generic operation for hotplug.
Add memory_block_align_range() as a common helper in <linux/memory.h>
that aligns the start address up and end address down to memory block
boundaries.
Update dax/kmem to use this helper.
Signed-off-by: Gregory Price <gourry@gourry.net>
---
drivers/dax/kmem.c | 4 +---
include/linux/memory.h | 22 ++++++++++++++++++++++
2 files changed, 23 insertions(+), 3 deletions(-)
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index eb693a581961..798f389df992 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -26,9 +26,7 @@ static int dax_kmem_range(struct dev_dax *dev_dax, int i, struct range *r)
struct dev_dax_range *dax_range = &dev_dax->ranges[i];
struct range *range = &dax_range->range;
- /* memory-block align the hotplug range */
- r->start = ALIGN(range->start, memory_block_size_bytes());
- r->end = ALIGN_DOWN(range->end + 1, memory_block_size_bytes()) - 1;
+ *r = memory_block_align_range(range);
if (r->start >= r->end) {
r->start = range->start;
r->end = range->end;
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 5bb5599c6b2b..17cdf6ba3823 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -20,6 +20,7 @@
#include <linux/compiler.h>
#include <linux/mutex.h>
#include <linux/memory_hotplug.h>
+#include <linux/range.h>
#define MIN_MEMORY_BLOCK_SIZE (1UL << SECTION_SIZE_BITS)
@@ -100,6 +101,27 @@ int arch_get_memory_phys_device(unsigned long start_pfn);
unsigned long memory_block_size_bytes(void);
int set_memory_block_size_order(unsigned int order);
+/**
+ * memory_block_align_range - align a physical address range to memory blocks
+ * @range: the input range to align
+ *
+ * Aligns the start address up and the end address down to memory block
+ * boundaries. This is required for memory hotplug operations which must
+ * operate on memory-block aligned ranges.
+ *
+ * Returns the aligned range. Callers should check that the returned
+ * range is valid (aligned.start < aligned.end) before using it.
+ */
+static inline struct range memory_block_align_range(const struct range *range)
+{
+ struct range aligned;
+
+ aligned.start = ALIGN(range->start, memory_block_size_bytes());
+ aligned.end = ALIGN_DOWN(range->end + 1, memory_block_size_bytes()) - 1;
+
+ return aligned;
+}
+
struct memory_notify {
unsigned long start_pfn;
unsigned long nr_pages;
--
2.53.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 3/8] mm/memory_hotplug: pass online_type to online_memory_block() via arg
2026-03-21 15:03 [PATCH 0/8] dax/kmem: atomic whole-device hotplug via sysfs Gregory Price
2026-03-21 15:03 ` [PATCH 1/8] mm/memory-tiers: consolidate memory type dedup into mt_get_memory_type() Gregory Price
2026-03-21 15:03 ` [PATCH 2/8] mm/memory: add memory_block_align_range() helper Gregory Price
@ 2026-03-21 15:03 ` Gregory Price
2026-03-21 15:04 ` [PATCH 4/8] mm/memory_hotplug: export mhp_get_default_online_type Gregory Price
` (5 subsequent siblings)
8 siblings, 0 replies; 11+ messages in thread
From: Gregory Price @ 2026-03-21 15:03 UTC (permalink / raw)
To: linux-mm, vishal.l.verma, dave.jiang, akpm, david, osalvador
Cc: dan.j.williams, ljs, Liam.Howlett, vbabka, rppt, surenb, mhocko,
linux-kernel, nvdimm, linux-cxl, kernel-team
Modify online_memory_block() to accept the online type through its arg
parameter rather than calling mhp_get_default_online_type() internally.
This prepares for allowing callers to specify explicit online types.
Update the caller in add_memory_resource() to pass the default online
type via a local variable.
No functional change.
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Signed-off-by: Gregory Price <gourry@gourry.net>
---
mm/memory_hotplug.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 86d3faf50453..282bf3d89613 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1338,7 +1338,9 @@ static int check_hotplug_memory_range(u64 start, u64 size)
static int online_memory_block(struct memory_block *mem, void *arg)
{
- mem->online_type = mhp_get_default_online_type();
+ enum mmop *online_type = arg;
+
+ mem->online_type = *online_type;
return device_online(&mem->dev);
}
@@ -1492,6 +1494,7 @@ static int create_altmaps_and_memory_blocks(int nid, struct memory_group *group,
int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
{
struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
+ enum mmop online_type = mhp_get_default_online_type();
enum memblock_flags memblock_flags = MEMBLOCK_NONE;
struct memory_group *group = NULL;
u64 start, size;
@@ -1580,7 +1583,8 @@ int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
/* online pages if requested */
if (mhp_get_default_online_type() != MMOP_OFFLINE)
- walk_memory_blocks(start, size, NULL, online_memory_block);
+ walk_memory_blocks(start, size, &online_type,
+ online_memory_block);
return ret;
error:
--
2.53.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 4/8] mm/memory_hotplug: export mhp_get_default_online_type
2026-03-21 15:03 [PATCH 0/8] dax/kmem: atomic whole-device hotplug via sysfs Gregory Price
` (2 preceding siblings ...)
2026-03-21 15:03 ` [PATCH 3/8] mm/memory_hotplug: pass online_type to online_memory_block() via arg Gregory Price
@ 2026-03-21 15:04 ` Gregory Price
2026-03-21 15:04 ` [PATCH 5/8] mm/memory_hotplug: add __add_memory_driver_managed() with online_type arg Gregory Price
` (4 subsequent siblings)
8 siblings, 0 replies; 11+ messages in thread
From: Gregory Price @ 2026-03-21 15:04 UTC (permalink / raw)
To: linux-mm, vishal.l.verma, dave.jiang, akpm, david, osalvador
Cc: dan.j.williams, ljs, Liam.Howlett, vbabka, rppt, surenb, mhocko,
linux-kernel, nvdimm, linux-cxl, kernel-team
Drivers which may pass hotplug policy down to DAX need MMOP_ symbols
and the mhp_get_default_online_type function for hotplug use cases.
Some drivers (cxl) co-mingle their hotplug and devdax use-cases into
the same driver code, and chose the dax_kmem path as the default driver
path - making it difficult to require hotplug as a predicate to building
the overall driver (it may break other non-hotplug use-cases).
Export mhp_get_default_online_type function to allow these drivers to
build when hotplug is disabled and still use the DAX use case.
In the built-out case we simply return MMOP_OFFLINE as it's
non-destructive. The internal function can never return -1 either,
so we choose this to allow for defining the function with 'enum mmop'.
Signed-off-by: Gregory Price <gourry@gourry.net>
---
include/linux/memory_hotplug.h | 29 +++++++++++++++++++++++++++++
mm/memory_hotplug.c | 1 +
2 files changed, 30 insertions(+)
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index e77ef3d7ff73..a8bcb36f93b8 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -6,6 +6,7 @@
#include <linux/spinlock.h>
#include <linux/notifier.h>
#include <linux/bug.h>
+#include <linux/errno.h>
struct page;
struct zone;
@@ -28,6 +29,27 @@ enum mmop {
MMOP_ONLINE_MOVABLE,
};
+/**
+ * mmop_to_str - convert memory online type to string
+ * @online_type: the MMOP_* value to convert
+ *
+ * Returns a string representation of the memory online type,
+ * suitable for sysfs output (includes trailing newline).
+ */
+static inline const char *mmop_to_str(enum mmop online_type)
+{
+ switch (online_type) {
+ case MMOP_ONLINE:
+ return "online\n";
+ case MMOP_ONLINE_KERNEL:
+ return "online_kernel\n";
+ case MMOP_ONLINE_MOVABLE:
+ return "online_movable\n";
+ default:
+ return "offline\n";
+ }
+}
+
#ifdef CONFIG_MEMORY_HOTPLUG
struct page *pfn_to_online_page(unsigned long pfn);
@@ -221,6 +243,11 @@ static inline bool mhp_supports_memmap_on_memory(void)
static inline void pgdat_kswapd_lock(pg_data_t *pgdat) {}
static inline void pgdat_kswapd_unlock(pg_data_t *pgdat) {}
static inline void pgdat_kswapd_lock_init(pg_data_t *pgdat) {}
+
+static inline int mhp_online_type_from_str(const char *str)
+{
+ return -EOPNOTSUPP;
+}
#endif /* ! CONFIG_MEMORY_HOTPLUG */
/*
@@ -316,6 +343,8 @@ extern struct zone *zone_for_pfn_range(enum mmop online_type,
extern int arch_create_linear_mapping(int nid, u64 start, u64 size,
struct mhp_params *params);
void arch_remove_linear_mapping(u64 start, u64 size);
+#else
+static inline enum mmop mhp_get_default_online_type(void) { return MMOP_OFFLINE; }
#endif /* CONFIG_MEMORY_HOTPLUG */
#endif /* __LINUX_MEMORY_HOTPLUG_H */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 282bf3d89613..af9a6cb5a2f9 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -240,6 +240,7 @@ enum mmop mhp_get_default_online_type(void)
return mhp_default_online_type;
}
+EXPORT_SYMBOL_GPL(mhp_get_default_online_type);
void mhp_set_default_online_type(enum mmop online_type)
{
--
2.53.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 5/8] mm/memory_hotplug: add __add_memory_driver_managed() with online_type arg
2026-03-21 15:03 [PATCH 0/8] dax/kmem: atomic whole-device hotplug via sysfs Gregory Price
` (3 preceding siblings ...)
2026-03-21 15:04 ` [PATCH 4/8] mm/memory_hotplug: export mhp_get_default_online_type Gregory Price
@ 2026-03-21 15:04 ` Gregory Price
2026-03-21 15:04 ` [PATCH 6/8] dax: plumb hotplug online_type through dax Gregory Price
` (3 subsequent siblings)
8 siblings, 0 replies; 11+ messages in thread
From: Gregory Price @ 2026-03-21 15:04 UTC (permalink / raw)
To: linux-mm, vishal.l.verma, dave.jiang, akpm, david, osalvador
Cc: dan.j.williams, ljs, Liam.Howlett, vbabka, rppt, surenb, mhocko,
linux-kernel, nvdimm, linux-cxl, kernel-team
Existing callers of add_memory_driver_managed cannot select the
preferred online type (ZONE_NORMAL vs ZONE_MOVABLE), requiring it to
hot-add memory as offline blocks, and then follow up by onlining each
memory block individually.
Most drivers prefer the system default, but the CXL driver wants to
plumb a preferred policy through the dax kmem driver.
Refactor APIs to add a new interface which allows the dax kmem and
cxl_core modules to select a preferred policy. Only expose this
interface to those modules to avoid confusion among existing API users
and to limit usage in out-of-tree modules.
Refactor add_memory_driver_managed, extract __add_memory_driver_managed
- Add proper kernel-doc for add_memory_driver_managed while refactoring
- New helper accepts an explicit online_type.
- New help validates online_type is between OFFLINE and ONLINE_MOVABLE
Refactor: add_memory_resource, extract __add_memory_resource
- new helper accepts an explicit online_type
Original APIs now explicitly pass the system-default to new helpers.
No functional change for existing users.
Cc: David Hildenbrand <david@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Gregory Price <gourry@gourry.net>
---
include/linux/memory_hotplug.h | 3 ++
mm/memory_hotplug.c | 60 +++++++++++++++++++++++++++++-----
2 files changed, 55 insertions(+), 8 deletions(-)
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index a8bcb36f93b8..1f19f08552ea 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -320,6 +320,9 @@ extern int __add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags);
extern int add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags);
extern int add_memory_resource(int nid, struct resource *resource,
mhp_t mhp_flags);
+int __add_memory_driver_managed(int nid, u64 start, u64 size,
+ const char *resource_name, mhp_t mhp_flags,
+ enum mmop online_type);
extern int add_memory_driver_managed(int nid, u64 start, u64 size,
const char *resource_name,
mhp_t mhp_flags);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index af9a6cb5a2f9..9081aad5078f 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1492,10 +1492,10 @@ static int create_altmaps_and_memory_blocks(int nid, struct memory_group *group,
*
* we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG
*/
-int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
+static int __add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags,
+ enum mmop online_type)
{
struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
- enum mmop online_type = mhp_get_default_online_type();
enum memblock_flags memblock_flags = MEMBLOCK_NONE;
struct memory_group *group = NULL;
u64 start, size;
@@ -1583,7 +1583,7 @@ int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
merge_system_ram_resource(res);
/* online pages if requested */
- if (mhp_get_default_online_type() != MMOP_OFFLINE)
+ if (online_type != MMOP_OFFLINE)
walk_memory_blocks(start, size, &online_type,
online_memory_block);
@@ -1601,7 +1601,13 @@ int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
return ret;
}
-/* requires device_hotplug_lock, see add_memory_resource() */
+int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
+{
+ return __add_memory_resource(nid, res, mhp_flags,
+ mhp_get_default_online_type());
+}
+
+/* requires device_hotplug_lock, see __add_memory_resource() */
int __add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags)
{
struct resource *res;
@@ -1629,7 +1635,15 @@ int add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags)
}
EXPORT_SYMBOL_GPL(add_memory);
-/*
+/**
+ * __add_memory_driver_managed - add driver-managed memory with explicit online_type
+ * @nid: NUMA node ID where the memory will be added
+ * @start: Start physical address of the memory range
+ * @size: Size of the memory range in bytes
+ * @resource_name: Resource name in format "System RAM ($DRIVER)"
+ * @mhp_flags: Memory hotplug flags
+ * @online_type: Auto-Online behavior (offline, online, kernel, movable)
+ *
* Add special, driver-managed memory to the system as system RAM. Such
* memory is not exposed via the raw firmware-provided memmap as system
* RAM, instead, it is detected and added by a driver - during cold boot,
@@ -1649,9 +1663,12 @@ EXPORT_SYMBOL_GPL(add_memory);
*
* The resource_name (visible via /proc/iomem) has to have the format
* "System RAM ($DRIVER)".
+ *
+ * Return: 0 on success, negative error code on failure.
*/
-int add_memory_driver_managed(int nid, u64 start, u64 size,
- const char *resource_name, mhp_t mhp_flags)
+int __add_memory_driver_managed(int nid, u64 start, u64 size,
+ const char *resource_name, mhp_t mhp_flags,
+ enum mmop online_type)
{
struct resource *res;
int rc;
@@ -1661,6 +1678,9 @@ int add_memory_driver_managed(int nid, u64 start, u64 size,
resource_name[strlen(resource_name) - 1] != ')')
return -EINVAL;
+ if (online_type < MMOP_OFFLINE || online_type > MMOP_ONLINE_MOVABLE)
+ return -EINVAL;
+
lock_device_hotplug();
res = register_memory_resource(start, size, resource_name);
@@ -1669,7 +1689,7 @@ int add_memory_driver_managed(int nid, u64 start, u64 size,
goto out_unlock;
}
- rc = add_memory_resource(nid, res, mhp_flags);
+ rc = __add_memory_resource(nid, res, mhp_flags, online_type);
if (rc < 0)
release_memory_resource(res);
@@ -1677,6 +1697,30 @@ int add_memory_driver_managed(int nid, u64 start, u64 size,
unlock_device_hotplug();
return rc;
}
+EXPORT_SYMBOL_FOR_MODULES(__add_memory_driver_managed, "kmem,cxl_core");
+
+/**
+ * add_memory_driver_managed - add driver-managed memory
+ * @nid: NUMA node ID where the memory will be added
+ * @start: Start physical address of the memory range
+ * @size: Size of the memory range in bytes
+ * @resource_name: Resource name in format "System RAM ($DRIVER)"
+ * @mhp_flags: Memory hotplug flags
+ *
+ * Add driver-managed memory with the system default online type set by
+ * build config or kernel boot parameter.
+ *
+ * See __add_memory_driver_managed for more details.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int add_memory_driver_managed(int nid, u64 start, u64 size,
+ const char *resource_name, mhp_t mhp_flags)
+{
+ return __add_memory_driver_managed(nid, start, size, resource_name,
+ mhp_flags,
+ mhp_get_default_online_type());
+}
EXPORT_SYMBOL_GPL(add_memory_driver_managed);
/*
--
2.53.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 6/8] dax: plumb hotplug online_type through dax
2026-03-21 15:03 [PATCH 0/8] dax/kmem: atomic whole-device hotplug via sysfs Gregory Price
` (4 preceding siblings ...)
2026-03-21 15:04 ` [PATCH 5/8] mm/memory_hotplug: add __add_memory_driver_managed() with online_type arg Gregory Price
@ 2026-03-21 15:04 ` Gregory Price
2026-03-21 15:04 ` [PATCH 7/8] dax/kmem: extract hotplug/hotremove helper functions Gregory Price
` (2 subsequent siblings)
8 siblings, 0 replies; 11+ messages in thread
From: Gregory Price @ 2026-03-21 15:04 UTC (permalink / raw)
To: linux-mm, vishal.l.verma, dave.jiang, akpm, david, osalvador
Cc: dan.j.williams, ljs, Liam.Howlett, vbabka, rppt, surenb, mhocko,
linux-kernel, nvdimm, linux-cxl, kernel-team
There is no way for drivers leveraging dax_kmem to plumb through a
preferred auto-online policy - the system default policy is forced.
Add 'enum mmop' field to DAX device creation path to allow drivers
to specify an auto-online policy when using the kmem driver.
Current callers initialize online_type to mhp_get_default_online_type()
to retain backward compatibility and to make explicit to the drivers
what is actually happening underneath.
No functional changes to existing callers.
Cc:David Hildenbrand <david@kernel.org>
Signed-off-by: Gregory Price <gourry@gourry.net>
---
drivers/dax/bus.c | 3 +++
drivers/dax/bus.h | 2 ++
drivers/dax/cxl.c | 1 +
drivers/dax/dax-private.h | 3 +++
drivers/dax/hmem/hmem.c | 1 +
drivers/dax/kmem.c | 13 +++++++++++--
6 files changed, 21 insertions(+), 2 deletions(-)
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index c94c09622516..2c6140dc9382 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -1,6 +1,7 @@
// SPDX-License-Identifier: GPL-2.0
/* Copyright(c) 2017-2018 Intel Corporation. All rights reserved. */
#include <linux/memremap.h>
+#include <linux/memory_hotplug.h>
#include <linux/device.h>
#include <linux/mutex.h>
#include <linux/list.h>
@@ -395,6 +396,7 @@ static ssize_t create_store(struct device *dev, struct device_attribute *attr,
.size = 0,
.id = -1,
.memmap_on_memory = false,
+ .online_type = mhp_get_default_online_type(),
};
struct dev_dax *dev_dax = __devm_create_dev_dax(&data);
@@ -1494,6 +1496,7 @@ static struct dev_dax *__devm_create_dev_dax(struct dev_dax_data *data)
ida_init(&dev_dax->ida);
dev_dax->memmap_on_memory = data->memmap_on_memory;
+ dev_dax->online_type = data->online_type;
inode = dax_inode(dax_dev);
dev->devt = inode->i_rdev;
diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
index cbbf64443098..f037cd8a2d51 100644
--- a/drivers/dax/bus.h
+++ b/drivers/dax/bus.h
@@ -3,6 +3,7 @@
#ifndef __DAX_BUS_H__
#define __DAX_BUS_H__
#include <linux/device.h>
+#include <linux/memory_hotplug.h>
#include <linux/range.h>
struct dev_dax;
@@ -24,6 +25,7 @@ struct dev_dax_data {
resource_size_t size;
int id;
bool memmap_on_memory;
+ enum mmop online_type;
};
struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data);
diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
index 13cd94d32ff7..d6fbec863361 100644
--- a/drivers/dax/cxl.c
+++ b/drivers/dax/cxl.c
@@ -27,6 +27,7 @@ static int cxl_dax_region_probe(struct device *dev)
.id = -1,
.size = range_len(&cxlr_dax->hpa_range),
.memmap_on_memory = true,
+ .online_type = mhp_get_default_online_type(),
};
return PTR_ERR_OR_ZERO(devm_create_dev_dax(&data));
diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index c6ae27c982f4..734fb83f5eb4 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -8,6 +8,7 @@
#include <linux/device.h>
#include <linux/cdev.h>
#include <linux/idr.h>
+#include <linux/memory_hotplug.h>
/* private routines between core files */
struct dax_device;
@@ -77,6 +78,7 @@ struct dev_dax_range {
* @dev: device core
* @pgmap: pgmap for memmap setup / lifetime (driver owned)
* @memmap_on_memory: allow kmem to put the memmap in the memory
+ * @online_type: MMOP_* online type for memory hotplug
* @nr_range: size of @ranges
* @ranges: range tuples of memory used
*/
@@ -91,6 +93,7 @@ struct dev_dax {
struct device dev;
struct dev_pagemap *pgmap;
bool memmap_on_memory;
+ enum mmop online_type;
int nr_range;
struct dev_dax_range *ranges;
};
diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index 1cf7c2a0ee1c..acbc574ced93 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -36,6 +36,7 @@ static int dax_hmem_probe(struct platform_device *pdev)
.id = -1,
.size = region_idle ? 0 : range_len(&mri->range),
.memmap_on_memory = false,
+ .online_type = mhp_get_default_online_type(),
};
return PTR_ERR_OR_ZERO(devm_create_dev_dax(&data));
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 798f389df992..d4c34b2e3766 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -16,6 +16,11 @@
#include "dax-private.h"
#include "bus.h"
+/* Internal function exported only to kmem module */
+extern int __add_memory_driver_managed(int nid, u64 start, u64 size,
+ const char *resource_name,
+ mhp_t mhp_flags, enum mmop online_type);
+
/* Memory resource name used for add_memory_driver_managed(). */
static const char *kmem_name;
/* Set if any memory will remain added when the driver will be unloaded. */
@@ -49,6 +54,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
struct dax_kmem_data *data;
struct memory_dev_type *mtype;
int i, rc, mapped = 0;
+ enum mmop online_type;
mhp_t mhp_flags;
int numa_node;
int adist = MEMTIER_DEFAULT_LOWTIER_ADISTANCE;
@@ -111,6 +117,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
goto err_reg_mgid;
data->mgid = rc;
+ online_type = dev_dax->online_type;
+
for (i = 0; i < dev_dax->nr_range; i++) {
struct resource *res;
struct range range;
@@ -151,8 +159,9 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
* Ensure that future kexec'd kernels will not treat
* this as RAM automatically.
*/
- rc = add_memory_driver_managed(data->mgid, range.start,
- range_len(&range), kmem_name, mhp_flags);
+ rc = __add_memory_driver_managed(data->mgid, range.start,
+ range_len(&range), kmem_name, mhp_flags,
+ online_type);
if (rc) {
dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
--
2.53.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 7/8] dax/kmem: extract hotplug/hotremove helper functions
2026-03-21 15:03 [PATCH 0/8] dax/kmem: atomic whole-device hotplug via sysfs Gregory Price
` (5 preceding siblings ...)
2026-03-21 15:04 ` [PATCH 6/8] dax: plumb hotplug online_type through dax Gregory Price
@ 2026-03-21 15:04 ` Gregory Price
2026-03-21 15:04 ` [PATCH 8/8] dax/kmem: add sysfs interface for atomic whole-device hotplug Gregory Price
2026-03-21 17:40 ` [PATCH 0/8] dax/kmem: atomic whole-device hotplug via sysfs Andrew Morton
8 siblings, 0 replies; 11+ messages in thread
From: Gregory Price @ 2026-03-21 15:04 UTC (permalink / raw)
To: linux-mm, vishal.l.verma, dave.jiang, akpm, david, osalvador
Cc: dan.j.williams, ljs, Liam.Howlett, vbabka, rppt, surenb, mhocko,
linux-kernel, nvdimm, linux-cxl, kernel-team
Refactor kmem _probe() _remove() by extracting init, cleanup, hotplug,
and hot-remove logic into separate helper functions:
- dax_kmem_init_resources: inits IO_RESOURCE w/ request_mem_region
- dax_kmem_cleanup_resources: cleans up initialized IO_RESOURCE
- dax_kmem_do_hotplug: handles memory region reservation and adding
- dax_kmem_do_hotremove: handles memory removal and resource cleanup
This is a pure refactoring with no functional change. The helpers will
enable future extensions to support more granular control over memory
hotplug operations.
We need to split hotplug/remove and init/cleanup in order to have the
resources available for hot-add. Otherwise, when probe occurs, the dax
devices are never added to sysfs because the resources are never
registered.
Signed-off-by: Gregory Price <gourry@gourry.net>
---
drivers/dax/kmem.c | 308 ++++++++++++++++++++++++++++++---------------
1 file changed, 210 insertions(+), 98 deletions(-)
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index d4c34b2e3766..8be9286f0ea3 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -47,15 +47,189 @@ struct dax_kmem_data {
struct resource *res[];
};
+/**
+ * dax_kmem_do_hotplug - hotplug memory for dax kmem device
+ * @dev_dax: the dev_dax instance
+ * @data: the dax_kmem_data structure with resource tracking
+ *
+ * Hotplugs all ranges in the dev_dax region as system memory.
+ *
+ * Returns the number of successfully mapped ranges, or negative error.
+ */
+static int dax_kmem_do_hotplug(struct dev_dax *dev_dax,
+ struct dax_kmem_data *data,
+ int online_type)
+{
+ struct device *dev = &dev_dax->dev;
+ int i, rc, onlined = 0;
+ mhp_t mhp_flags;
+
+ for (i = 0; i < dev_dax->nr_range; i++) {
+ struct range range;
+
+ rc = dax_kmem_range(dev_dax, i, &range);
+ if (rc)
+ continue;
+
+ mhp_flags = MHP_NID_IS_MGID;
+ if (dev_dax->memmap_on_memory)
+ mhp_flags |= MHP_MEMMAP_ON_MEMORY;
+
+ /*
+ * Ensure that future kexec'd kernels will not treat
+ * this as RAM automatically.
+ */
+ rc = __add_memory_driver_managed(data->mgid, range.start,
+ range_len(&range), kmem_name, mhp_flags,
+ online_type);
+
+ if (rc) {
+ dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
+ i, range.start, range.end);
+ if (onlined)
+ continue;
+ return rc;
+ }
+ onlined++;
+ }
+
+ return onlined;
+}
+
+/**
+ * dax_kmem_init_resources - create memory regions for dax kmem
+ * @dev_dax: the dev_dax instance
+ * @data: the dax_kmem_data structure with resource tracking
+ *
+ * Initializes all the resources for the DAX
+ *
+ * Returns the number of successfully mapped ranges, or negative error.
+ */
+static int dax_kmem_init_resources(struct dev_dax *dev_dax,
+ struct dax_kmem_data *data)
+{
+ struct device *dev = &dev_dax->dev;
+ int i, rc, mapped = 0;
+
+ for (i = 0; i < dev_dax->nr_range; i++) {
+ struct resource *res;
+ struct range range;
+
+ rc = dax_kmem_range(dev_dax, i, &range);
+ if (rc)
+ continue;
+
+ /* Skip ranges already added */
+ if (data->res[i])
+ continue;
+
+ /* Region is permanently reserved if hotremove fails. */
+ res = request_mem_region(range.start, range_len(&range),
+ data->res_name);
+ if (!res) {
+ dev_warn(dev, "mapping%d: %#llx-%#llx could not reserve region\n",
+ i, range.start, range.end);
+ /*
+ * Once some memory has been onlined we can't
+ * assume that it can be un-onlined safely.
+ */
+ if (mapped)
+ continue;
+ return -EBUSY;
+ }
+ data->res[i] = res;
+ /*
+ * Set flags appropriate for System RAM. Leave ..._BUSY clear
+ * so that add_memory() can add a child resource. Do not
+ * inherit flags from the parent since it may set new flags
+ * unknown to us that will break add_memory() below.
+ */
+ res->flags = IORESOURCE_SYSTEM_RAM;
+ mapped++;
+ }
+ return mapped;
+}
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+/**
+ * dax_kmem_do_hotremove - hot-remove memory for dax kmem device
+ * @dev_dax: the dev_dax instance
+ * @data: the dax_kmem_data structure with resource tracking
+ *
+ * Removes all ranges in the dev_dax region.
+ *
+ * Returns the number of successfully removed ranges.
+ */
+static int dax_kmem_do_hotremove(struct dev_dax *dev_dax,
+ struct dax_kmem_data *data)
+{
+ struct device *dev = &dev_dax->dev;
+ int i, success = 0;
+
+ for (i = 0; i < dev_dax->nr_range; i++) {
+ struct range range;
+ int rc;
+
+ rc = dax_kmem_range(dev_dax, i, &range);
+ if (rc)
+ continue;
+
+ /* Skip ranges not currently added */
+ if (!data->res[i])
+ continue;
+
+ rc = remove_memory(range.start, range_len(&range));
+ if (rc == 0) {
+ /* Release the resource for the successfully removed range */
+ remove_resource(data->res[i]);
+ kfree(data->res[i]);
+ data->res[i] = NULL;
+ success++;
+ continue;
+ }
+ any_hotremove_failed = true;
+ dev_err(dev, "mapping%d: %#llx-%#llx hotremove failed\n",
+ i, range.start, range.end);
+ }
+
+ return success;
+}
+#else
+static int dax_kmem_do_hotremove(struct dev_dax *dev_dax,
+ struct dax_kmem_data *data)
+{
+ return -EBUSY;
+}
+#endif /* CONFIG_MEMORY_HOTREMOVE */
+
+/**
+ * dax_kmem_cleanup_resources - remove the dax memory resources
+ * @dev_dax: the dev_dax instance
+ * @data: the dax_kmem_data structure with resource tracking
+ *
+ * Removes all resources in the dev_dax region.
+ */
+static void dax_kmem_cleanup_resources(struct dev_dax *dev_dax,
+ struct dax_kmem_data *data)
+{
+ int i;
+
+ for (i = 0; i < dev_dax->nr_range; i++) {
+ if (!data->res[i])
+ continue;
+ remove_resource(data->res[i]);
+ kfree(data->res[i]);
+ data->res[i] = NULL;
+ }
+}
+
static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
{
struct device *dev = &dev_dax->dev;
unsigned long total_len = 0, orig_len = 0;
struct dax_kmem_data *data;
struct memory_dev_type *mtype;
- int i, rc, mapped = 0;
- enum mmop online_type;
- mhp_t mhp_flags;
+ int i, rc;
int numa_node;
int adist = MEMTIER_DEFAULT_LOWTIER_ADISTANCE;
@@ -116,72 +290,27 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
if (rc < 0)
goto err_reg_mgid;
data->mgid = rc;
-
- online_type = dev_dax->online_type;
-
- for (i = 0; i < dev_dax->nr_range; i++) {
- struct resource *res;
- struct range range;
-
- rc = dax_kmem_range(dev_dax, i, &range);
- if (rc)
- continue;
-
- /* Region is permanently reserved if hotremove fails. */
- res = request_mem_region(range.start, range_len(&range), data->res_name);
- if (!res) {
- dev_warn(dev, "mapping%d: %#llx-%#llx could not reserve region\n",
- i, range.start, range.end);
- /*
- * Once some memory has been onlined we can't
- * assume that it can be un-onlined safely.
- */
- if (mapped)
- continue;
- rc = -EBUSY;
- goto err_request_mem;
- }
- data->res[i] = res;
-
- /*
- * Set flags appropriate for System RAM. Leave ..._BUSY clear
- * so that add_memory() can add a child resource. Do not
- * inherit flags from the parent since it may set new flags
- * unknown to us that will break add_memory() below.
- */
- res->flags = IORESOURCE_SYSTEM_RAM;
-
- mhp_flags = MHP_NID_IS_MGID;
- if (dev_dax->memmap_on_memory)
- mhp_flags |= MHP_MEMMAP_ON_MEMORY;
-
- /*
- * Ensure that future kexec'd kernels will not treat
- * this as RAM automatically.
- */
- rc = __add_memory_driver_managed(data->mgid, range.start,
- range_len(&range), kmem_name, mhp_flags,
- online_type);
-
- if (rc) {
- dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
- i, range.start, range.end);
- remove_resource(res);
- kfree(res);
- data->res[i] = NULL;
- if (mapped)
- continue;
- goto err_request_mem;
- }
- mapped++;
- }
data->mtype = mtype;
dev_set_drvdata(dev, data);
+ rc = dax_kmem_init_resources(dev_dax, data);
+ if (rc < 0)
+ goto err_resources;
+
+ /*
+ * Hotplug using the configured online type for this device.
+ */
+ rc = dax_kmem_do_hotplug(dev_dax, data, dev_dax->online_type);
+ if (rc < 0)
+ goto err_hotplug;
+
return 0;
-err_request_mem:
+err_hotplug:
+ dax_kmem_cleanup_resources(dev_dax, data);
+err_resources:
+ dev_set_drvdata(dev, NULL);
memory_group_unregister(data->mgid);
err_reg_mgid:
kfree(data->res_name);
@@ -195,7 +324,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
#ifdef CONFIG_MEMORY_HOTREMOVE
static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
{
- int i, success = 0;
+ int success;
int node = dev_dax->target_node;
struct device *dev = &dev_dax->dev;
struct dax_kmem_data *data = dev_get_drvdata(dev);
@@ -206,42 +335,25 @@ static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
* there is no way to hotremove this memory until reboot because device
* unbind will succeed even if we return failure.
*/
- for (i = 0; i < dev_dax->nr_range; i++) {
- struct range range;
- int rc;
-
- rc = dax_kmem_range(dev_dax, i, &range);
- if (rc)
- continue;
-
- rc = remove_memory(range.start, range_len(&range));
- if (rc == 0) {
- remove_resource(data->res[i]);
- kfree(data->res[i]);
- data->res[i] = NULL;
- success++;
- continue;
- }
- any_hotremove_failed = true;
- dev_err(dev,
- "mapping%d: %#llx-%#llx cannot be hotremoved until the next reboot\n",
- i, range.start, range.end);
+ success = dax_kmem_do_hotremove(dev_dax, data);
+ if (success < dev_dax->nr_range) {
+ dev_err(dev, "Hotplug regions stuck online until reboot\n");
+ return;
}
- if (success >= dev_dax->nr_range) {
- memory_group_unregister(data->mgid);
- kfree(data->res_name);
- kfree(data);
- dev_set_drvdata(dev, NULL);
- /*
- * Clear the memtype association on successful unplug.
- * If not, we have memory blocks left which can be
- * offlined/onlined later. We need to keep memory_dev_type
- * for that. This implies this reference will be around
- * till next reboot.
- */
- clear_node_memory_type(node, data->mtype);
- }
+ dax_kmem_cleanup_resources(dev_dax, data);
+ memory_group_unregister(data->mgid);
+ kfree(data->res_name);
+ kfree(data);
+ dev_set_drvdata(dev, NULL);
+ /*
+ * Clear the memtype association on successful unplug.
+ * If not, we have memory blocks left which can be
+ * offlined/onlined later. We need to keep memory_dev_type
+ * for that. This implies this reference will be around
+ * till next reboot.
+ */
+ clear_node_memory_type(node, data->mtype);
}
#else
static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
--
2.53.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 8/8] dax/kmem: add sysfs interface for atomic whole-device hotplug
2026-03-21 15:03 [PATCH 0/8] dax/kmem: atomic whole-device hotplug via sysfs Gregory Price
` (6 preceding siblings ...)
2026-03-21 15:04 ` [PATCH 7/8] dax/kmem: extract hotplug/hotremove helper functions Gregory Price
@ 2026-03-21 15:04 ` Gregory Price
2026-03-21 17:40 ` [PATCH 0/8] dax/kmem: atomic whole-device hotplug via sysfs Andrew Morton
8 siblings, 0 replies; 11+ messages in thread
From: Gregory Price @ 2026-03-21 15:04 UTC (permalink / raw)
To: linux-mm, vishal.l.verma, dave.jiang, akpm, david, osalvador
Cc: dan.j.williams, ljs, Liam.Howlett, vbabka, rppt, surenb, mhocko,
linux-kernel, nvdimm, linux-cxl, kernel-team, Hannes Reinecke
The dax kmem driver currently onlines memory automatically during
probe using the system's default online policy but provides no way
to control or query the entire region state at runtime.
Additionally, there is no atomic mechanism to offline and remove
the entire set of memory blocks together. Instead, this is presently
done in two steps: (offline all, remove all). This creates a race
condition where external entities can operate directly on the blocks
and cause hot-unplug to fail.
Add a new 'hotplug' sysfs attribute that allows userspace to control
and query the entire memory region state.
The interface supports the following states:
- "unplug": memory is offline and blocks are not present
- "online": memory is online as normal system RAM
- "online_movable": memory is online in ZONE_MOVABLE
Valid transitions:
- unplugged -> online
- unplugged -> online_movable
- online -> unplugged
- online_movable -> unplugged
"offline" (memory blocks exist but are offline by default) is not
supported because it's functionally equivalent to "unplugged" and
entices races between offlining and unplugging.
The initial state after probe currently checks if online_type matches
mhp_get_default_online_type() - and if so calls dax_kmem_do_hotplug.
This causes the creation of memory blocks, despite the fact that we
should be in an unplugged state. This preserves userland backward
compatibility for existing tools that expect the memory blocks to be
present after kmem probe - and can be deprecated over time.
As with any hot-remove mechanism, the removal can fail and if rollback
fails the system can be left in an inconsistent state.
Unbind Note:
We used to call remove_memory() during unbind, which would fire a
BUG() if any of the memory blocks were online at that time. We lift
this into a WARN in the cleanup routine and don't attempt hotremove
if ->state is not DAX_KMEM_UNPLUGGED or MMOP_OFFLINE.
The resources are still leaked but this prevents deadlock on unbind
if a memory region happens to be impossible to hotremove.
Suggested-by: Hannes Reinecke <hare@suse.de>
Suggested-by: David Hildenbrand <david@kernel.org>
Signed-off-by: Gregory Price <gourry@gourry.net>
---
Documentation/ABI/testing/sysfs-bus-dax | 17 +++
drivers/dax/kmem.c | 164 +++++++++++++++++++++---
2 files changed, 161 insertions(+), 20 deletions(-)
diff --git a/Documentation/ABI/testing/sysfs-bus-dax b/Documentation/ABI/testing/sysfs-bus-dax
index b34266bfae49..faf6f63a368c 100644
--- a/Documentation/ABI/testing/sysfs-bus-dax
+++ b/Documentation/ABI/testing/sysfs-bus-dax
@@ -151,3 +151,20 @@ Description:
memmap_on_memory parameter for memory_hotplug. This is
typically set on the kernel command line -
memory_hotplug.memmap_on_memory set to 'true' or 'force'."
+
+What: /sys/bus/dax/devices/daxX.Y/hotplug
+Date: January, 2026
+KernelVersion: v6.21
+Contact: nvdimm@lists.linux.dev
+Description:
+ (RW) Controls what hotplug state of the memory region.
+ Applies to all memory blocks associated with the device.
+ Only applies to dax_kmem devices.
+
+ States: [unplugged, online, online_movable]
+ Arguments:
+ "unplug": memory is offline and blocks are not present
+ "online": memory is online as normal system RAM
+ "online_movable": memory is online in ZONE_MOVABLE
+
+ Devices must unplug to online into a different state.
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 8be9286f0ea3..5dbd5b7862fd 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -40,10 +40,16 @@ static int dax_kmem_range(struct dev_dax *dev_dax, int i, struct range *r)
return 0;
}
+#define DAX_KMEM_UNPLUGGED (-1)
+
struct dax_kmem_data {
const char *res_name;
int mgid;
struct memory_dev_type *mtype;
+ int numa_node;
+ struct dev_dax *dev_dax;
+ int state;
+ struct mutex lock; /* protects hotplug state transitions */
struct resource *res[];
};
@@ -51,8 +57,10 @@ struct dax_kmem_data {
* dax_kmem_do_hotplug - hotplug memory for dax kmem device
* @dev_dax: the dev_dax instance
* @data: the dax_kmem_data structure with resource tracking
+ * @online_type: MMOP_ONLINE or MMOP_ONLINE_MOVABLE
*
- * Hotplugs all ranges in the dev_dax region as system memory.
+ * Hotplugs all ranges in the dev_dax region as system memory using
+ * the specified online type.
*
* Returns the number of successfully mapped ranges, or negative error.
*/
@@ -64,6 +72,12 @@ static int dax_kmem_do_hotplug(struct dev_dax *dev_dax,
int i, rc, onlined = 0;
mhp_t mhp_flags;
+ if (data->state == MMOP_ONLINE || data->state == MMOP_ONLINE_MOVABLE)
+ return -EINVAL;
+
+ if (online_type != MMOP_ONLINE && online_type != MMOP_ONLINE_MOVABLE)
+ return -EINVAL;
+
for (i = 0; i < dev_dax->nr_range; i++) {
struct range range;
@@ -156,9 +170,9 @@ static int dax_kmem_init_resources(struct dev_dax *dev_dax,
* @dev_dax: the dev_dax instance
* @data: the dax_kmem_data structure with resource tracking
*
- * Removes all ranges in the dev_dax region.
+ * Offlines and removes all ranges in the dev_dax region.
*
- * Returns the number of successfully removed ranges.
+ * Returns the number of successfully removed ranges, or negative error.
*/
static int dax_kmem_do_hotremove(struct dev_dax *dev_dax,
struct dax_kmem_data *data)
@@ -178,7 +192,7 @@ static int dax_kmem_do_hotremove(struct dev_dax *dev_dax,
if (!data->res[i])
continue;
- rc = remove_memory(range.start, range_len(&range));
+ rc = offline_and_remove_memory(range.start, range_len(&range));
if (rc == 0) {
/* Release the resource for the successfully removed range */
remove_resource(data->res[i]);
@@ -214,6 +228,20 @@ static void dax_kmem_cleanup_resources(struct dev_dax *dev_dax,
{
int i;
+ /*
+ * If the device unbind occurs before memory is hotremoved, we can never
+ * remove the memory (requires reboot). Attempting an offline operation
+ * here may cause deadlock and a failure to finish the unbind.
+ *
+ * This WARN used to be a BUG called by remove_memory().
+ *
+ * Note: This leaks the resources.
+ */
+ if (WARN(((data->state != DAX_KMEM_UNPLUGGED) &&
+ (data->state != MMOP_OFFLINE)),
+ "Hotplug memory regions stuck online until reboot"))
+ return;
+
for (i = 0; i < dev_dax->nr_range; i++) {
if (!data->res[i])
continue;
@@ -223,6 +251,98 @@ static void dax_kmem_cleanup_resources(struct dev_dax *dev_dax,
}
}
+static int dax_kmem_parse_state(const char *buf)
+{
+ if (sysfs_streq(buf, "unplug"))
+ return DAX_KMEM_UNPLUGGED;
+ if (sysfs_streq(buf, "online"))
+ return MMOP_ONLINE;
+ if (sysfs_streq(buf, "online_movable"))
+ return MMOP_ONLINE_MOVABLE;
+ return -EINVAL;
+}
+
+static ssize_t hotplug_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct dax_kmem_data *data = dev_get_drvdata(dev);
+ const char *state_str;
+
+ if (!data)
+ return -ENXIO;
+
+ switch (data->state) {
+ case DAX_KMEM_UNPLUGGED:
+ state_str = "unplugged";
+ break;
+ case MMOP_OFFLINE:
+ state_str = "offline";
+ break;
+ case MMOP_ONLINE:
+ state_str = "online";
+ break;
+ case MMOP_ONLINE_MOVABLE:
+ state_str = "online_movable";
+ break;
+ default:
+ state_str = "unknown";
+ break;
+ }
+
+ return sysfs_emit(buf, "%s\n", state_str);
+}
+
+static ssize_t hotplug_store(struct device *dev, struct device_attribute *attr,
+ const char *buf, size_t len)
+{
+ struct dev_dax *dev_dax = to_dev_dax(dev);
+ struct dax_kmem_data *data = dev_get_drvdata(dev);
+ int online_type;
+ int rc;
+
+ if (!data)
+ return -ENXIO;
+
+ online_type = dax_kmem_parse_state(buf);
+ if (online_type < DAX_KMEM_UNPLUGGED)
+ return online_type;
+
+ guard(mutex)(&data->lock);
+
+ /* Already in requested state */
+ if (data->state == online_type)
+ return len;
+
+ if (online_type == DAX_KMEM_UNPLUGGED) {
+ rc = dax_kmem_do_hotremove(dev_dax, data);
+ if (rc < 0) {
+ dev_warn(dev, "hotplug state is inconsistent\n");
+ return rc;
+ }
+ if (rc < dev_dax->nr_range)
+ dev_warn(dev, "partial hotremove: %d of %d ranges removed\n",
+ rc, dev_dax->nr_range);
+ else
+ data->state = DAX_KMEM_UNPLUGGED;
+ return len;
+ }
+
+ /*
+ * online_type is MMOP_ONLINE or MMOP_ONLINE_MOVABLE
+ * Cannot switch between online types without unplugging first
+ */
+ if (data->state == MMOP_ONLINE || data->state == MMOP_ONLINE_MOVABLE)
+ return -EBUSY;
+
+ rc = dax_kmem_do_hotplug(dev_dax, data, online_type);
+ if (rc < 0)
+ return rc;
+
+ data->state = online_type;
+ return len;
+}
+static DEVICE_ATTR_RW(hotplug);
+
static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
{
struct device *dev = &dev_dax->dev;
@@ -291,6 +411,10 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
goto err_reg_mgid;
data->mgid = rc;
data->mtype = mtype;
+ data->numa_node = numa_node;
+ data->dev_dax = dev_dax;
+ data->state = DAX_KMEM_UNPLUGGED;
+ mutex_init(&data->lock);
dev_set_drvdata(dev, data);
@@ -301,9 +425,17 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
/*
* Hotplug using the configured online type for this device.
*/
- rc = dax_kmem_do_hotplug(dev_dax, data, dev_dax->online_type);
- if (rc < 0)
- goto err_hotplug;
+ if (dev_dax->online_type != MMOP_OFFLINE ||
+ dev_dax->online_type == mhp_get_default_online_type()) {
+ rc = dax_kmem_do_hotplug(dev_dax, data, dev_dax->online_type);
+ if (rc < 0)
+ goto err_hotplug;
+ data->state = dev_dax->online_type;
+ }
+
+ rc = device_create_file(dev, &dev_attr_hotplug);
+ if (rc)
+ dev_warn(dev, "failed to create hotplug sysfs entry\n");
return 0;
@@ -324,23 +456,11 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
#ifdef CONFIG_MEMORY_HOTREMOVE
static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
{
- int success;
int node = dev_dax->target_node;
struct device *dev = &dev_dax->dev;
struct dax_kmem_data *data = dev_get_drvdata(dev);
- /*
- * We have one shot for removing memory, if some memory blocks were not
- * offline prior to calling this function remove_memory() will fail, and
- * there is no way to hotremove this memory until reboot because device
- * unbind will succeed even if we return failure.
- */
- success = dax_kmem_do_hotremove(dev_dax, data);
- if (success < dev_dax->nr_range) {
- dev_err(dev, "Hotplug regions stuck online until reboot\n");
- return;
- }
-
+ device_remove_file(dev, &dev_attr_hotplug);
dax_kmem_cleanup_resources(dev_dax, data);
memory_group_unregister(data->mgid);
kfree(data->res_name);
@@ -358,6 +478,10 @@ static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
#else
static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
{
+ struct device *dev = &dev_dax->dev;
+
+ device_remove_file(dev, &dev_attr_hotplug);
+
/*
* Without hotremove purposely leak the request_mem_region() for the
* device-dax range and return '0' to ->remove() attempts. The removal
--
2.53.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 0/8] dax/kmem: atomic whole-device hotplug via sysfs
2026-03-21 15:03 [PATCH 0/8] dax/kmem: atomic whole-device hotplug via sysfs Gregory Price
` (7 preceding siblings ...)
2026-03-21 15:04 ` [PATCH 8/8] dax/kmem: add sysfs interface for atomic whole-device hotplug Gregory Price
@ 2026-03-21 17:40 ` Andrew Morton
2026-03-21 20:26 ` Gregory Price
8 siblings, 1 reply; 11+ messages in thread
From: Andrew Morton @ 2026-03-21 17:40 UTC (permalink / raw)
To: Gregory Price
Cc: linux-mm, vishal.l.verma, dave.jiang, david, osalvador,
dan.j.williams, ljs, Liam.Howlett, vbabka, rppt, surenb, mhocko,
linux-kernel, nvdimm, linux-cxl, kernel-team
On Sat, 21 Mar 2026 11:03:56 -0400 Gregory Price <gourry@gourry.net> wrote:
> The dax kmem driver currently onlines memory during probe using the
> system default policy, with no way to control or query the region state
> at runtime - other than by inspecting the state of individual blocks.
>
> Offlining and removing an entire region requires operating on individual
> memory blocks, creating race conditions where external entities can
> interfere between the offline and remove steps.
>
> The problem was discussed specifically in the LPC2025 device memory
> sessions - https://lpc.events/event/19/contributions/2016/ - where
> it was discussed how the non-atomic interface for dax hotplug is causing
> issues in some distributions which have competing userland controllers
> that interfere with each other.
>
> This series adds a sysfs "hotplug" attribute for atomic whole-device
> hotplug control, along with the mm and dax plumbing to support it.
AI review (which hasn't completed at this time) has a lot to say:
https://sashiko.dev/#/patchset/20260321150404.3288786-1-gourry@gourry.net
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 0/8] dax/kmem: atomic whole-device hotplug via sysfs
2026-03-21 17:40 ` [PATCH 0/8] dax/kmem: atomic whole-device hotplug via sysfs Andrew Morton
@ 2026-03-21 20:26 ` Gregory Price
0 siblings, 0 replies; 11+ messages in thread
From: Gregory Price @ 2026-03-21 20:26 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, vishal.l.verma, dave.jiang, david, osalvador,
dan.j.williams, ljs, Liam.Howlett, vbabka, rppt, surenb, mhocko,
linux-kernel, nvdimm, linux-cxl, kernel-team
On Sat, Mar 21, 2026 at 10:40:21AM -0700, Andrew Morton wrote:
> On Sat, 21 Mar 2026 11:03:56 -0400 Gregory Price <gourry@gourry.net> wrote:
>
> > The dax kmem driver currently onlines memory during probe using the
> > system default policy, with no way to control or query the region state
> > at runtime - other than by inspecting the state of individual blocks.
> >
> > Offlining and removing an entire region requires operating on individual
> > memory blocks, creating race conditions where external entities can
> > interfere between the offline and remove steps.
> >
> > The problem was discussed specifically in the LPC2025 device memory
> > sessions - https://lpc.events/event/19/contributions/2016/ - where
> > it was discussed how the non-atomic interface for dax hotplug is causing
> > issues in some distributions which have competing userland controllers
> > that interfere with each other.
> >
> > This series adds a sysfs "hotplug" attribute for atomic whole-device
> > hotplug control, along with the mm and dax plumbing to support it.
>
> AI review (which hasn't completed at this time) has a lot to say:
> https://sashiko.dev/#/patchset/20260321150404.3288786-1-gourry@gourry.net
Looking at the results - i mucked up a UAF during the rebase that i
didn't catch during testing. Will clean that up.
I also just realized I left an extern in one of the patches that I
thought I had removed.
So I owe a respin on this in more ways than one.
But on the AI review comment for non-trivial stuff
---
Much of the remaining commentary is about either the pre-existing code
race conditions, or design questions in the space of that race
condition.
Specifically: userland can still try to twiddle the memoryN/state bits
while the dax device loops over non-contiguous regions.
I dropped this commit:
https://lore.kernel.org/all/20260114235022.3437787-6-gourry@gourry.net/
From the series, because the feedback here:
https://lore.kernel.org/linux-mm/d1938a63-839b-44a5-a68f-34ad290fef21@kernel.org/
suggested that offline_and_remove_memory() would resolve the race
condition problem - but the patch proposed actually solved two issues:
1) Inconsistent hotplug state issue (user is still using the old
per-block offlining pattern)
2) The old offline pattern calling BUG() instead of WARN() when trying
to unbind while things are still online.
But this goes to the issue of: If the race condition in userland has
been around for many years, is it to be considered a feature we should
not break - or on what time scale should we consider breaking it?
I don't know the answer, David will have to weigh in on that.
~Gregory
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2026-03-21 20:26 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-21 15:03 [PATCH 0/8] dax/kmem: atomic whole-device hotplug via sysfs Gregory Price
2026-03-21 15:03 ` [PATCH 1/8] mm/memory-tiers: consolidate memory type dedup into mt_get_memory_type() Gregory Price
2026-03-21 15:03 ` [PATCH 2/8] mm/memory: add memory_block_align_range() helper Gregory Price
2026-03-21 15:03 ` [PATCH 3/8] mm/memory_hotplug: pass online_type to online_memory_block() via arg Gregory Price
2026-03-21 15:04 ` [PATCH 4/8] mm/memory_hotplug: export mhp_get_default_online_type Gregory Price
2026-03-21 15:04 ` [PATCH 5/8] mm/memory_hotplug: add __add_memory_driver_managed() with online_type arg Gregory Price
2026-03-21 15:04 ` [PATCH 6/8] dax: plumb hotplug online_type through dax Gregory Price
2026-03-21 15:04 ` [PATCH 7/8] dax/kmem: extract hotplug/hotremove helper functions Gregory Price
2026-03-21 15:04 ` [PATCH 8/8] dax/kmem: add sysfs interface for atomic whole-device hotplug Gregory Price
2026-03-21 17:40 ` [PATCH 0/8] dax/kmem: atomic whole-device hotplug via sysfs Andrew Morton
2026-03-21 20:26 ` Gregory Price
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox