[PATCH v4 0/3] Implement numa node notifier

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/3] Implement numa node notifier
@ 2025-06-03 11:08 Oscar Salvador
  2025-06-03 11:08 ` [PATCH v4 1/3] mm,slub: Do not special case N_NORMAL nodes for slab_nodes Oscar Salvador
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Oscar Salvador @ 2025-06-03 11:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Vlastimil Babka, Jonathan Cameron, Harry Yoo,
	Rakie Kim, Hyeonggon Yoo, linux-mm, linux-kernel, Oscar Salvador

A break from hugetlbfs :-)

 v3 -> v4:
   - Fix typos pointed out by Alok Tiwari
   - Further cleanups suggested by Vlastimil
   - Add RBs-by from Vlastimil

 v2 -> v3:
   - Add Suggested-by (David)
   - Replace last N_NORMAL_MEMORY mention in slub (David)
   - Replace the notifier for autoweitght-mempolicy
   - Fix build on !CONFIG_MEMORY_HOTPLUG

 v1 -> v2:
   - Remove status_change_nid_normal and the code that
     deals with it (David & Vlastimil)
   - Remove slab_mem_offline_callback (David & Vlastimil)
   - Change the order of canceling the notifiers
     in {online,offline}_pages (Vlastimil)
   - Fix up a couple of whitespaces (Jonathan Cameron)
   - Add RBs-by

Memory notifier is a tool that allow consumers to get notified whenever
memory gets onlined or offlined in the system.
Currently, there are 10 consumers of that, but 5 out of those 10 consumers
are only interested in getting notifications when a numa node changes its
memory state.
That means going from memoryless to memory-aware of vice versa.

Which means that for every {online,offline}_pages operation they get
notified even though the numa node might not have changed its state.
This is suboptimal, and we want to decouple numa node state changes from
memory state changes.

While we are doing this, remove status_change_nid_normal, as the only
current user (slub) does not really need it.
This allows us to further simplify and clean up the code.

The first patch gets rid of status_change_nid_normal in slub.
The second patch implements a numa node notifier that does just that, and have
those consumers register in there, so they get notified only when they are
interested.

The third patch replaces 'status_change_nid{_normal}' fields within
memory_notify with a 'nid', as that is only what we need for memory
notifer and update the only user of it (page_ext).

Consumers that are only interested in numa node states change are:

 - memory-tier
 - slub
 - cpuset
 - hmat
 - cxl
 - autoweight-mempolicy

Oscar Salvador (3):
  mm,slub: Do not special case N_NORMAL nodes for slab_nodes
  mm,memory_hotplug: Implement numa node notifier
  mm,memory_hotplug: Rename status_change_nid parameter in memory_notify

 drivers/acpi/numa/hmat.c  |   6 +-
 drivers/base/node.c       |  21 +++++
 drivers/cxl/core/region.c |  14 ++--
 drivers/cxl/cxl.h         |   4 +-
 include/linux/memory.h    |  38 ++++++++-
 kernel/cgroup/cpuset.c    |   2 +-
 mm/memory-tiers.c         |   8 +-
 mm/memory_hotplug.c       | 161 +++++++++++++++++---------------------
 mm/mempolicy.c            |   8 +-
 mm/page_ext.c             |  12 +--
 mm/slub.c                 |  45 ++---------
 11 files changed, 158 insertions(+), 161 deletions(-)

-- 
2.49.0

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v4 1/3] mm,slub: Do not special case N_NORMAL nodes for slab_nodes
  2025-06-03 11:08 [PATCH v4 0/3] Implement numa node notifier Oscar Salvador
@ 2025-06-03 11:08 ` Oscar Salvador
  2025-06-04  9:28   ` Harry Yoo
                     ` (2 more replies)
  2025-06-03 11:08 ` [PATCH v4 2/3] mm,memory_hotplug: Implement numa node notifier Oscar Salvador
  2025-06-03 11:08 ` [PATCH v4 3/3] mm,memory_hotplug: Rename status_change_nid parameter in memory_notify Oscar Salvador
  2 siblings, 3 replies; 12+ messages in thread
From: Oscar Salvador @ 2025-06-03 11:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Vlastimil Babka, Jonathan Cameron, Harry Yoo,
	Rakie Kim, Hyeonggon Yoo, linux-mm, linux-kernel, Oscar Salvador

Currently, slab_mem_going_going_callback() checks whether the node has
N_NORMAL memory in order to be set in slab_nodes.
While it is true that gettind rid of that enforcing would mean
ending up with movables nodes in slab_nodes, the memory waste that comes
with that is negligible.

So stop checking for status_change_nid_normal and just use status_change_nid
instead which works for both types of memory.

Also, once we allocate the kmem_cache_node cache  for the node in
slab_mem_online_callback(), we never deallocate it in
slab_mem_off_callback() when the node goes memoryless, so we can just
get rid of it.

The side effects are that we will stop clearing the node from slab_nodes,
and also that newly created kmem caches after node hotremove will now allocate
their kmem_cache_node for the node(s) that was hotremoved, but these
should be negligible.

Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slub.c | 34 +++-------------------------------
 1 file changed, 3 insertions(+), 31 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index be8b09e09d30..f92b43d36adc 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -447,7 +447,7 @@ static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
 
 /*
  * Tracks for which NUMA nodes we have kmem_cache_nodes allocated.
- * Corresponds to node_state[N_NORMAL_MEMORY], but can temporarily
+ * Corresponds to node_state[N_MEMORY], but can temporarily
  * differ during memory hotplug/hotremove operations.
  * Protected by slab_mutex.
  */
@@ -6160,36 +6160,12 @@ static int slab_mem_going_offline_callback(void *arg)
 	return 0;
 }
 
-static void slab_mem_offline_callback(void *arg)
-{
-	struct memory_notify *marg = arg;
-	int offline_node;
-
-	offline_node = marg->status_change_nid_normal;
-
-	/*
-	 * If the node still has available memory. we need kmem_cache_node
-	 * for it yet.
-	 */
-	if (offline_node < 0)
-		return;
-
-	mutex_lock(&slab_mutex);
-	node_clear(offline_node, slab_nodes);
-	/*
-	 * We no longer free kmem_cache_node structures here, as it would be
-	 * racy with all get_node() users, and infeasible to protect them with
-	 * slab_mutex.
-	 */
-	mutex_unlock(&slab_mutex);
-}
-
 static int slab_mem_going_online_callback(void *arg)
 {
 	struct kmem_cache_node *n;
 	struct kmem_cache *s;
 	struct memory_notify *marg = arg;
-	int nid = marg->status_change_nid_normal;
+	int nid = marg->status_change_nid;
 	int ret = 0;
 
 	/*
@@ -6247,10 +6223,6 @@ static int slab_memory_callback(struct notifier_block *self,
 	case MEM_GOING_OFFLINE:
 		ret = slab_mem_going_offline_callback(arg);
 		break;
-	case MEM_OFFLINE:
-	case MEM_CANCEL_ONLINE:
-		slab_mem_offline_callback(arg);
-		break;
 	case MEM_ONLINE:
 	case MEM_CANCEL_OFFLINE:
 		break;
@@ -6321,7 +6293,7 @@ void __init kmem_cache_init(void)
 	 * Initialize the nodemask for which we will allocate per node
 	 * structures. Here we don't need taking slab_mutex yet.
 	 */
-	for_each_node_state(node, N_NORMAL_MEMORY)
+	for_each_node_state(node, N_MEMORY)
 		node_set(node, slab_nodes);
 
 	create_boot_cache(kmem_cache_node, "kmem_cache_node",
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v4 2/3] mm,memory_hotplug: Implement numa node notifier
  2025-06-03 11:08 [PATCH v4 0/3] Implement numa node notifier Oscar Salvador
  2025-06-03 11:08 ` [PATCH v4 1/3] mm,slub: Do not special case N_NORMAL nodes for slab_nodes Oscar Salvador
@ 2025-06-03 11:08 ` Oscar Salvador
  2025-06-04 12:03   ` David Hildenbrand
  2025-06-03 11:08 ` [PATCH v4 3/3] mm,memory_hotplug: Rename status_change_nid parameter in memory_notify Oscar Salvador
  2 siblings, 1 reply; 12+ messages in thread
From: Oscar Salvador @ 2025-06-03 11:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Vlastimil Babka, Jonathan Cameron, Harry Yoo,
	Rakie Kim, Hyeonggon Yoo, linux-mm, linux-kernel, Oscar Salvador

There are at least six consumers of hotplug_memory_notifier that what they
really are interested in is whether any numa node changed its state, e.g: going
from being memory aware to becoming memoryless and vice versa.

Implement a specific notifier for numa nodes when their state gets changed,
and have those consumers that only care about numa node state changes use it.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 drivers/acpi/numa/hmat.c  |   6 +-
 drivers/base/node.c       |  21 +++++
 drivers/cxl/core/region.c |  14 ++--
 drivers/cxl/cxl.h         |   4 +-
 include/linux/memory.h    |  38 ++++++++-
 kernel/cgroup/cpuset.c    |   2 +-
 mm/memory-tiers.c         |   8 +-
 mm/memory_hotplug.c       | 161 +++++++++++++++++---------------------
 mm/mempolicy.c            |   8 +-
 mm/slub.c                 |  13 ++-
 10 files changed, 155 insertions(+), 120 deletions(-)

diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
index 9d9052258e92..9ac82a767daf 100644
--- a/drivers/acpi/numa/hmat.c
+++ b/drivers/acpi/numa/hmat.c
@@ -962,10 +962,10 @@ static int hmat_callback(struct notifier_block *self,
 			 unsigned long action, void *arg)
 {
 	struct memory_target *target;
-	struct memory_notify *mnb = arg;
+	struct node_notify *mnb = arg;
 	int pxm, nid = mnb->status_change_nid;
 
-	if (nid == NUMA_NO_NODE || action != MEM_ONLINE)
+	if (nid == NUMA_NO_NODE || action != NODE_BECAME_MEM_AWARE)
 		return NOTIFY_OK;
 
 	pxm = node_to_pxm(nid);
@@ -1118,7 +1118,7 @@ static __init int hmat_init(void)
 	hmat_register_targets();
 
 	/* Keep the table and structures if the notifier may use them */
-	if (hotplug_memory_notifier(hmat_callback, HMAT_CALLBACK_PRI))
+	if (hotplug_node_notifier(hmat_callback, HMAT_CALLBACK_PRI))
 		goto out_put;
 
 	if (!hmat_set_default_dram_perf())
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 25ab9ec14eb8..c5b0859d846d 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -111,6 +111,27 @@ static const struct attribute_group *node_access_node_groups[] = {
 	NULL,
 };
 
+#ifdef CONFIG_MEMORY_HOTPLUG
+static BLOCKING_NOTIFIER_HEAD(node_chain);
+
+int register_node_notifier(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_register(&node_chain, nb);
+}
+EXPORT_SYMBOL(register_node_notifier);
+
+void unregister_node_notifier(struct notifier_block *nb)
+{
+	blocking_notifier_chain_unregister(&node_chain, nb);
+}
+EXPORT_SYMBOL(unregister_node_notifier);
+
+int node_notify(unsigned long val, void *v)
+{
+	return blocking_notifier_call_chain(&node_chain, val, v);
+}
+#endif
+
 static void node_remove_accesses(struct node *node)
 {
 	struct node_access_nodes *c, *cnext;
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index c3f4dc244df7..c43770d6834c 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -2432,12 +2432,12 @@ static int cxl_region_perf_attrs_callback(struct notifier_block *nb,
 					  unsigned long action, void *arg)
 {
 	struct cxl_region *cxlr = container_of(nb, struct cxl_region,
-					       memory_notifier);
-	struct memory_notify *mnb = arg;
+					       node_notifier);
+	struct node_notify *mnb = arg;
 	int nid = mnb->status_change_nid;
 	int region_nid;
 
-	if (nid == NUMA_NO_NODE || action != MEM_ONLINE)
+	if (nid == NUMA_NO_NODE || action != NODE_BECAME_MEM_AWARE)
 		return NOTIFY_DONE;
 
 	/*
@@ -3484,7 +3484,7 @@ static void shutdown_notifiers(void *_cxlr)
 {
 	struct cxl_region *cxlr = _cxlr;
 
-	unregister_memory_notifier(&cxlr->memory_notifier);
+	unregister_node_notifier(&cxlr->node_notifier);
 	unregister_mt_adistance_algorithm(&cxlr->adist_notifier);
 }
 
@@ -3523,9 +3523,9 @@ static int cxl_region_probe(struct device *dev)
 	if (rc)
 		return rc;
 
-	cxlr->memory_notifier.notifier_call = cxl_region_perf_attrs_callback;
-	cxlr->memory_notifier.priority = CXL_CALLBACK_PRI;
-	register_memory_notifier(&cxlr->memory_notifier);
+	cxlr->node_notifier.notifier_call = cxl_region_perf_attrs_callback;
+	cxlr->node_notifier.priority = CXL_CALLBACK_PRI;
+	register_node_notifier(&cxlr->node_notifier);
 
 	cxlr->adist_notifier.notifier_call = cxl_region_calculate_adistance;
 	cxlr->adist_notifier.priority = 100;
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index a9ab46eb0610..48ac02dee881 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -513,7 +513,7 @@ enum cxl_partition_mode {
  * @flags: Region state flags
  * @params: active + config params for the region
  * @coord: QoS access coordinates for the region
- * @memory_notifier: notifier for setting the access coordinates to node
+ * @node_notifier: notifier for setting the access coordinates to node
  * @adist_notifier: notifier for calculating the abstract distance of node
  */
 struct cxl_region {
@@ -526,7 +526,7 @@ struct cxl_region {
 	unsigned long flags;
 	struct cxl_region_params params;
 	struct access_coordinate coord[ACCESS_COORDINATE_MAX];
-	struct notifier_block memory_notifier;
+	struct notifier_block node_notifier;
 	struct notifier_block adist_notifier;
 };
 
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 5ec4e6d209b9..8c5c88eaffb3 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -99,6 +99,14 @@ int set_memory_block_size_order(unsigned int order);
 #define	MEM_PREPARE_ONLINE	(1<<6)
 #define	MEM_FINISH_OFFLINE	(1<<7)
 
+/* These states are used for numa node notifiers */
+#define NODE_BECOMING_MEM_AWARE		(1<<0)
+#define NODE_BECAME_MEM_AWARE		(1<<1)
+#define NODE_BECOMING_MEMORYLESS	(1<<2)
+#define NODE_BECAME_MEMORYLESS		(1<<3)
+#define NODE_CANCEL_MEM_AWARE		(1<<4)
+#define NODE_CANCEL_MEMORYLESS		(1<<5)
+
 struct memory_notify {
 	/*
 	 * The altmap_start_pfn and altmap_nr_pages fields are designated for
@@ -109,7 +117,10 @@ struct memory_notify {
 	unsigned long altmap_nr_pages;
 	unsigned long start_pfn;
 	unsigned long nr_pages;
-	int status_change_nid_normal;
+	int status_change_nid;
+};
+
+struct node_notify {
 	int status_change_nid;
 };
 
@@ -157,15 +168,34 @@ static inline unsigned long memory_block_advised_max_size(void)
 {
 	return 0;
 }
+
+static inline int register_node_notifier(struct notifier_block *nb)
+{
+	return 0;
+}
+static inline void unregister_node_notifier(struct notifier_block *nb)
+{
+}
+static inline int node_notify(unsigned long val, void *v)
+{
+	return 0;
+}
+static inline int hotplug_node_notifier(notifier_fn_t fn, int pri)
+{
+	return 0;
+}
 #else /* CONFIG_MEMORY_HOTPLUG */
 extern int register_memory_notifier(struct notifier_block *nb);
+extern int register_node_notifier(struct notifier_block *nb);
 extern void unregister_memory_notifier(struct notifier_block *nb);
+extern void unregister_node_notifier(struct notifier_block *nb);
 int create_memory_block_devices(unsigned long start, unsigned long size,
 				struct vmem_altmap *altmap,
 				struct memory_group *group);
 void remove_memory_block_devices(unsigned long start, unsigned long size);
 extern void memory_dev_init(void);
 extern int memory_notify(unsigned long val, void *v);
+extern int node_notify(unsigned long val, void *v);
 extern struct memory_block *find_memory_block(unsigned long section_nr);
 typedef int (*walk_memory_blocks_func_t)(struct memory_block *, void *);
 extern int walk_memory_blocks(unsigned long start, unsigned long size,
@@ -185,6 +215,12 @@ int walk_dynamic_memory_groups(int nid, walk_memory_groups_func_t func,
 	register_memory_notifier(&fn##_mem_nb);			\
 })
 
+#define hotplug_node_notifier(fn, pri) ({		\
+	static __meminitdata struct notifier_block fn##_node_nb =\
+		{ .notifier_call = fn, .priority = pri };\
+	register_node_notifier(&fn##_node_nb);			\
+})
+
 #ifdef CONFIG_NUMA
 void memory_block_add_nid(struct memory_block *mem, int nid,
 			  enum meminit_context context);
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 83639a12883d..66c84024f217 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -4013,7 +4013,7 @@ void __init cpuset_init_smp(void)
 	cpumask_copy(top_cpuset.effective_cpus, cpu_active_mask);
 	top_cpuset.effective_mems = node_states[N_MEMORY];
 
-	hotplug_memory_notifier(cpuset_track_online_nodes, CPUSET_CALLBACK_PRI);
+	hotplug_node_notifier(cpuset_track_online_nodes, CPUSET_CALLBACK_PRI);
 
 	cpuset_migrate_mm_wq = alloc_ordered_workqueue("cpuset_migrate_mm", 0);
 	BUG_ON(!cpuset_migrate_mm_wq);
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index fc14fe53e9b7..dfe6c28c8352 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -872,7 +872,7 @@ static int __meminit memtier_hotplug_callback(struct notifier_block *self,
 					      unsigned long action, void *_arg)
 {
 	struct memory_tier *memtier;
-	struct memory_notify *arg = _arg;
+	struct node_notify *arg = _arg;
 
 	/*
 	 * Only update the node migration order when a node is
@@ -882,13 +882,13 @@ static int __meminit memtier_hotplug_callback(struct notifier_block *self,
 		return notifier_from_errno(0);
 
 	switch (action) {
-	case MEM_OFFLINE:
+	case NODE_BECAME_MEMORYLESS:
 		mutex_lock(&memory_tier_lock);
 		if (clear_node_memory_tier(arg->status_change_nid))
 			establish_demotion_targets();
 		mutex_unlock(&memory_tier_lock);
 		break;
-	case MEM_ONLINE:
+	case NODE_BECAME_MEM_AWARE:
 		mutex_lock(&memory_tier_lock);
 		memtier = set_node_memory_tier(arg->status_change_nid);
 		if (!IS_ERR(memtier))
@@ -929,7 +929,7 @@ static int __init memory_tier_init(void)
 	nodes_and(default_dram_nodes, node_states[N_MEMORY],
 		  node_states[N_CPU]);
 
-	hotplug_memory_notifier(memtier_hotplug_callback, MEMTIER_HOTPLUG_PRI);
+	hotplug_node_notifier(memtier_hotplug_callback, MEMTIER_HOTPLUG_PRI);
 	return 0;
 }
 subsys_initcall(memory_tier_init);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index b1caedbade5b..777c81cd2943 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -699,30 +699,6 @@ static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages)
 	online_mem_sections(start_pfn, end_pfn);
 }
 
-/* check which state of node_states will be changed when online memory */
-static void node_states_check_changes_online(unsigned long nr_pages,
-	struct zone *zone, struct memory_notify *arg)
-{
-	int nid = zone_to_nid(zone);
-
-	arg->status_change_nid = NUMA_NO_NODE;
-	arg->status_change_nid_normal = NUMA_NO_NODE;
-
-	if (!node_state(nid, N_MEMORY))
-		arg->status_change_nid = nid;
-	if (zone_idx(zone) <= ZONE_NORMAL && !node_state(nid, N_NORMAL_MEMORY))
-		arg->status_change_nid_normal = nid;
-}
-
-static void node_states_set_node(int node, struct memory_notify *arg)
-{
-	if (arg->status_change_nid_normal >= 0)
-		node_set_state(node, N_NORMAL_MEMORY);
-
-	if (arg->status_change_nid >= 0)
-		node_set_state(node, N_MEMORY);
-}
-
 static void __meminit resize_zone_range(struct zone *zone, unsigned long start_pfn,
 		unsigned long nr_pages)
 {
@@ -1177,7 +1153,9 @@ int online_pages(unsigned long pfn, unsigned long nr_pages,
 	int need_zonelists_rebuild = 0;
 	const int nid = zone_to_nid(zone);
 	int ret;
-	struct memory_notify arg;
+	struct memory_notify mem_arg;
+	struct node_notify node_arg;
+	bool cancel_mem_notifier_on_err = false, cancel_node_notifier_on_err = false;
 
 	/*
 	 * {on,off}lining is constrained to full memory sections (or more
@@ -1194,11 +1172,22 @@ int online_pages(unsigned long pfn, unsigned long nr_pages,
 	/* associate pfn range with the zone */
 	move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_ISOLATE);
 
-	arg.start_pfn = pfn;
-	arg.nr_pages = nr_pages;
-	node_states_check_changes_online(nr_pages, zone, &arg);
+	node_arg.status_change_nid = NUMA_NO_NODE;
+	if (!node_state(nid, N_MEMORY)) {
+		/* Node is becoming memory aware. Notify consumers */
+		cancel_node_notifier_on_err = true;
+		node_arg.status_change_nid = nid;
+		ret = node_notify(NODE_BECOMING_MEM_AWARE, &node_arg);
+		ret = notifier_to_errno(ret);
+		if (ret)
+			goto failed_addition;
+	}
 
-	ret = memory_notify(MEM_GOING_ONLINE, &arg);
+	cancel_mem_notifier_on_err = true;
+	mem_arg.start_pfn = pfn;
+	mem_arg.nr_pages = nr_pages;
+	mem_arg.status_change_nid = node_arg.status_change_nid;
+	ret = memory_notify(MEM_GOING_ONLINE, &mem_arg);
 	ret = notifier_to_errno(ret);
 	if (ret)
 		goto failed_addition;
@@ -1224,7 +1213,8 @@ int online_pages(unsigned long pfn, unsigned long nr_pages,
 	online_pages_range(pfn, nr_pages);
 	adjust_present_page_count(pfn_to_page(pfn), group, nr_pages);
 
-	node_states_set_node(nid, &arg);
+	if (node_arg.status_change_nid >= 0)
+		node_set_state(nid, N_MEMORY);
 	if (need_zonelists_rebuild)
 		build_all_zonelists(NULL);
 
@@ -1245,16 +1235,26 @@ int online_pages(unsigned long pfn, unsigned long nr_pages,
 	kswapd_run(nid);
 	kcompactd_run(nid);
 
+	if (node_arg.status_change_nid >= 0)
+		/*
+		 * Node went from memoryless to having memory. Notify interested
+		 * consumers
+		 */
+		node_notify(NODE_BECAME_MEM_AWARE, &node_arg);
+
 	writeback_set_ratelimit();
 
-	memory_notify(MEM_ONLINE, &arg);
+	memory_notify(MEM_ONLINE, &mem_arg);
 	return 0;
 
 failed_addition:
 	pr_debug("online_pages [mem %#010llx-%#010llx] failed\n",
 		 (unsigned long long) pfn << PAGE_SHIFT,
 		 (((unsigned long long) pfn + nr_pages) << PAGE_SHIFT) - 1);
-	memory_notify(MEM_CANCEL_ONLINE, &arg);
+	if (cancel_mem_notifier_on_err)
+		memory_notify(MEM_CANCEL_ONLINE, &mem_arg);
+	if (cancel_node_notifier_on_err)
+		node_notify(NODE_CANCEL_MEM_AWARE, &node_arg);
 	remove_pfn_range_from_zone(zone, pfn, nr_pages);
 	return ret;
 }
@@ -1886,54 +1886,6 @@ static int __init cmdline_parse_movable_node(char *p)
 }
 early_param("movable_node", cmdline_parse_movable_node);
 
-/* check which state of node_states will be changed when offline memory */
-static void node_states_check_changes_offline(unsigned long nr_pages,
-		struct zone *zone, struct memory_notify *arg)
-{
-	struct pglist_data *pgdat = zone->zone_pgdat;
-	unsigned long present_pages = 0;
-	enum zone_type zt;
-
-	arg->status_change_nid = NUMA_NO_NODE;
-	arg->status_change_nid_normal = NUMA_NO_NODE;
-
-	/*
-	 * Check whether node_states[N_NORMAL_MEMORY] will be changed.
-	 * If the memory to be offline is within the range
-	 * [0..ZONE_NORMAL], and it is the last present memory there,
-	 * the zones in that range will become empty after the offlining,
-	 * thus we can determine that we need to clear the node from
-	 * node_states[N_NORMAL_MEMORY].
-	 */
-	for (zt = 0; zt <= ZONE_NORMAL; zt++)
-		present_pages += pgdat->node_zones[zt].present_pages;
-	if (zone_idx(zone) <= ZONE_NORMAL && nr_pages >= present_pages)
-		arg->status_change_nid_normal = zone_to_nid(zone);
-
-	/*
-	 * We have accounted the pages from [0..ZONE_NORMAL); ZONE_HIGHMEM
-	 * does not apply as we don't support 32bit.
-	 * Here we count the possible pages from ZONE_MOVABLE.
-	 * If after having accounted all the pages, we see that the nr_pages
-	 * to be offlined is over or equal to the accounted pages,
-	 * we know that the node will become empty, and so, we can clear
-	 * it for N_MEMORY as well.
-	 */
-	present_pages += pgdat->node_zones[ZONE_MOVABLE].present_pages;
-
-	if (nr_pages >= present_pages)
-		arg->status_change_nid = zone_to_nid(zone);
-}
-
-static void node_states_clear_node(int node, struct memory_notify *arg)
-{
-	if (arg->status_change_nid_normal >= 0)
-		node_clear_state(node, N_NORMAL_MEMORY);
-
-	if (arg->status_change_nid >= 0)
-		node_clear_state(node, N_MEMORY);
-}
-
 static int count_system_ram_pages_cb(unsigned long start_pfn,
 				     unsigned long nr_pages, void *data)
 {
@@ -1950,10 +1902,14 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
 			struct zone *zone, struct memory_group *group)
 {
 	const unsigned long end_pfn = start_pfn + nr_pages;
-	unsigned long pfn, managed_pages, system_ram_pages = 0;
+	unsigned long pfn, managed_pages, system_ram_pages = 0, present_pages = 0;
 	const int node = zone_to_nid(zone);
+	struct pglist_data *pgdat = zone->zone_pgdat;
 	unsigned long flags;
-	struct memory_notify arg;
+	struct memory_notify mem_arg;
+	struct node_notify node_arg;
+	bool cancel_mem_notifier_on_err = false, cancel_node_notifier_on_err = false;
+	enum zone_type zt;
 	char *reason;
 	int ret;
 
@@ -2012,11 +1968,30 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
 		goto failed_removal_pcplists_disabled;
 	}
 
-	arg.start_pfn = start_pfn;
-	arg.nr_pages = nr_pages;
-	node_states_check_changes_offline(nr_pages, zone, &arg);
+	/*
+	 * Here we count the possible pages within the range [0..ZONE_MOVABLE].
+	 * If after having accounted all the pages, we see that the nr_pages to
+	 * be offlined is greater or equal to the accounted pages, we know that the
+	 * node will become empty, and so, we can clear it for N_MEMORY.
+	 */
+	node_arg.status_change_nid = NUMA_NO_NODE;
+	for (zt = 0; zt <= ZONE_MOVABLE; zt++)
+		present_pages += pgdat->node_zones[zt].present_pages;
+	if (nr_pages >= present_pages)
+		node_arg.status_change_nid = node;
+	if (node_arg.status_change_nid >= 0) {
+		cancel_node_notifier_on_err = true;
+		ret = node_notify(NODE_BECOMING_MEMORYLESS, &node_arg);
+		ret = notifier_to_errno(ret);
+		if (ret)
+			goto failed_removal_isolated;
+	}
 
-	ret = memory_notify(MEM_GOING_OFFLINE, &arg);
+	cancel_mem_notifier_on_err = true;
+	mem_arg.start_pfn = start_pfn;
+	mem_arg.nr_pages = nr_pages;
+	mem_arg.status_change_nid = node_arg.status_change_nid;
+	ret = memory_notify(MEM_GOING_OFFLINE, &mem_arg);
 	ret = notifier_to_errno(ret);
 	if (ret) {
 		reason = "notifier failure";
@@ -2096,27 +2071,33 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
 	 * Make sure to mark the node as memory-less before rebuilding the zone
 	 * list. Otherwise this node would still appear in the fallback lists.
 	 */
-	node_states_clear_node(node, &arg);
+	if (node_arg.status_change_nid >= 0)
+		node_clear_state(node, N_MEMORY);
 	if (!populated_zone(zone)) {
 		zone_pcp_reset(zone);
 		build_all_zonelists(NULL);
 	}
 
-	if (arg.status_change_nid >= 0) {
+	if (node_arg.status_change_nid >= 0) {
 		kcompactd_stop(node);
 		kswapd_stop(node);
+		/* Node went memoryless. Notify interested consumers */
+		node_notify(NODE_BECAME_MEMORYLESS, &node_arg);
 	}
 
 	writeback_set_ratelimit();
 
-	memory_notify(MEM_OFFLINE, &arg);
+	memory_notify(MEM_OFFLINE, &mem_arg);
 	remove_pfn_range_from_zone(zone, start_pfn, nr_pages);
 	return 0;
 
 failed_removal_isolated:
 	/* pushback to free area */
 	undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
-	memory_notify(MEM_CANCEL_OFFLINE, &arg);
+	if (cancel_mem_notifier_on_err)
+		memory_notify(MEM_CANCEL_OFFLINE, &mem_arg);
+	if (cancel_node_notifier_on_err)
+		node_notify(NODE_CANCEL_MEMORYLESS, &node_arg);
 failed_removal_pcplists_disabled:
 	lru_cache_enable();
 	zone_pcp_enable(zone);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 72fd72e156b1..3a7717e09506 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -3793,20 +3793,20 @@ static int wi_node_notifier(struct notifier_block *nb,
 			       unsigned long action, void *data)
 {
 	int err;
-	struct memory_notify *arg = data;
+	struct node_notify *arg = data;
 	int nid = arg->status_change_nid;
 
 	if (nid < 0)
 		return NOTIFY_OK;
 
 	switch (action) {
-	case MEM_ONLINE:
+	case NODE_BECAME_MEM_AWARE:
 		err = sysfs_wi_node_add(nid);
 		if (err)
 			pr_err("failed to add sysfs for node%d during hotplug: %d\n",
 			       nid, err);
 		break;
-	case MEM_OFFLINE:
+	case NODE_BECAME_MEMORYLESS:
 		sysfs_wi_node_delete(nid);
 		break;
 	}
@@ -3845,7 +3845,7 @@ static int __init add_weighted_interleave_group(struct kobject *mempolicy_kobj)
 		}
 	}
 
-	hotplug_memory_notifier(wi_node_notifier, DEFAULT_CALLBACK_PRI);
+	hotplug_node_notifier(wi_node_notifier, DEFAULT_CALLBACK_PRI);
 	return 0;
 
 err_cleanup_kobj:
diff --git a/mm/slub.c b/mm/slub.c
index f92b43d36adc..78a70f31de8f 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -6164,8 +6164,8 @@ static int slab_mem_going_online_callback(void *arg)
 {
 	struct kmem_cache_node *n;
 	struct kmem_cache *s;
-	struct memory_notify *marg = arg;
-	int nid = marg->status_change_nid;
+	struct node_notify *narg = arg;
+	int nid = narg->status_change_nid;
 	int ret = 0;
 
 	/*
@@ -6217,15 +6217,12 @@ static int slab_memory_callback(struct notifier_block *self,
 	int ret = 0;
 
 	switch (action) {
-	case MEM_GOING_ONLINE:
+	case NODE_BECOMING_MEM_AWARE:
 		ret = slab_mem_going_online_callback(arg);
 		break;
-	case MEM_GOING_OFFLINE:
+	case NODE_BECOMING_MEMORYLESS:
 		ret = slab_mem_going_offline_callback(arg);
 		break;
-	case MEM_ONLINE:
-	case MEM_CANCEL_OFFLINE:
-		break;
 	}
 	if (ret)
 		ret = notifier_from_errno(ret);
@@ -6300,7 +6297,7 @@ void __init kmem_cache_init(void)
 			sizeof(struct kmem_cache_node),
 			SLAB_HWCACHE_ALIGN | SLAB_NO_OBJ_EXT, 0, 0);
 
-	hotplug_memory_notifier(slab_memory_callback, SLAB_CALLBACK_PRI);
+	hotplug_node_notifier(slab_memory_callback, SLAB_CALLBACK_PRI);
 
 	/* Able to allocate the per node structures */
 	slab_state = PARTIAL;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v4 3/3] mm,memory_hotplug: Rename status_change_nid parameter in memory_notify
  2025-06-03 11:08 [PATCH v4 0/3] Implement numa node notifier Oscar Salvador
  2025-06-03 11:08 ` [PATCH v4 1/3] mm,slub: Do not special case N_NORMAL nodes for slab_nodes Oscar Salvador
  2025-06-03 11:08 ` [PATCH v4 2/3] mm,memory_hotplug: Implement numa node notifier Oscar Salvador
@ 2025-06-03 11:08 ` Oscar Salvador
  2 siblings, 0 replies; 12+ messages in thread
From: Oscar Salvador @ 2025-06-03 11:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Vlastimil Babka, Jonathan Cameron, Harry Yoo,
	Rakie Kim, Hyeonggon Yoo, linux-mm, linux-kernel, Oscar Salvador

The 'status_change_nid' field was used to track changes in the memory
state of a numa node, but that funcionality has been decoupled from
memory_notify and moved to node_notify.
Current consumers of memory_notify are only interested in which node the
memory we are adding belongs to, so rename current 'status_change_nid'
to 'nid'.

Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 include/linux/memory.h |  2 +-
 mm/memory_hotplug.c    |  4 ++--
 mm/page_ext.c          | 12 +-----------
 3 files changed, 4 insertions(+), 14 deletions(-)

diff --git a/include/linux/memory.h b/include/linux/memory.h
index 8c5c88eaffb3..3c482f189d14 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -117,7 +117,7 @@ struct memory_notify {
 	unsigned long altmap_nr_pages;
 	unsigned long start_pfn;
 	unsigned long nr_pages;
-	int status_change_nid;
+	int nid;
 };
 
 struct node_notify {
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 777c81cd2943..3a62d90d416d 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1186,7 +1186,7 @@ int online_pages(unsigned long pfn, unsigned long nr_pages,
 	cancel_mem_notifier_on_err = true;
 	mem_arg.start_pfn = pfn;
 	mem_arg.nr_pages = nr_pages;
-	mem_arg.status_change_nid = node_arg.status_change_nid;
+	mem_arg.nid = nid;
 	ret = memory_notify(MEM_GOING_ONLINE, &mem_arg);
 	ret = notifier_to_errno(ret);
 	if (ret)
@@ -1990,7 +1990,7 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
 	cancel_mem_notifier_on_err = true;
 	mem_arg.start_pfn = start_pfn;
 	mem_arg.nr_pages = nr_pages;
-	mem_arg.status_change_nid = node_arg.status_change_nid;
+	mem_arg.nid = node;
 	ret = memory_notify(MEM_GOING_OFFLINE, &mem_arg);
 	ret = notifier_to_errno(ret);
 	if (ret) {
diff --git a/mm/page_ext.c b/mm/page_ext.c
index c351fdfe9e9a..477e6f24b7ab 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -378,16 +378,6 @@ static int __meminit online_page_ext(unsigned long start_pfn,
 	start = SECTION_ALIGN_DOWN(start_pfn);
 	end = SECTION_ALIGN_UP(start_pfn + nr_pages);
 
-	if (nid == NUMA_NO_NODE) {
-		/*
-		 * In this case, "nid" already exists and contains valid memory.
-		 * "start_pfn" passed to us is a pfn which is an arg for
-		 * online__pages(), and start_pfn should exist.
-		 */
-		nid = pfn_to_nid(start_pfn);
-		VM_BUG_ON(!node_online(nid));
-	}
-
 	for (pfn = start; !fail && pfn < end; pfn += PAGES_PER_SECTION)
 		fail = init_section_page_ext(pfn, nid);
 	if (!fail)
@@ -436,7 +426,7 @@ static int __meminit page_ext_callback(struct notifier_block *self,
 	switch (action) {
 	case MEM_GOING_ONLINE:
 		ret = online_page_ext(mn->start_pfn,
-				   mn->nr_pages, mn->status_change_nid);
+				   mn->nr_pages, mn->nid);
 		break;
 	case MEM_OFFLINE:
 		offline_page_ext(mn->start_pfn,
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 1/3] mm,slub: Do not special case N_NORMAL nodes for slab_nodes
  2025-06-03 11:08 ` [PATCH v4 1/3] mm,slub: Do not special case N_NORMAL nodes for slab_nodes Oscar Salvador
@ 2025-06-04  9:28   ` Harry Yoo
  2025-06-04 11:33   ` David Hildenbrand
  2025-06-04 12:16   ` Yunsheng Lin
  2 siblings, 0 replies; 12+ messages in thread
From: Harry Yoo @ 2025-06-04  9:28 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, David Hildenbrand, Vlastimil Babka,
	Jonathan Cameron, Rakie Kim, Hyeonggon Yoo, linux-mm,
	linux-kernel

On Tue, Jun 03, 2025 at 01:08:48PM +0200, Oscar Salvador wrote:
> Currently, slab_mem_going_going_callback() checks whether the node has
> N_NORMAL memory in order to be set in slab_nodes.
> While it is true that gettind rid of that enforcing would mean

nit: gettind -> getting

> ending up with movables nodes in slab_nodes, the memory waste that comes
> with that is negligible.
> 
> So stop checking for status_change_nid_normal and just use status_change_nid
> instead which works for both types of memory.
> 
> Also, once we allocate the kmem_cache_node cache  for the node in
> slab_mem_online_callback(), we never deallocate it in
> slab_mem_off_callback() when the node goes memoryless, so we can just
> get rid of it.
> 
> The side effects are that we will stop clearing the node from slab_nodes,
> and also that newly created kmem caches after node hotremove will now allocate
> their kmem_cache_node for the node(s) that was hotremoved, but these
> should be negligible.
> 
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Looks good to me,
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

-- 
Cheers,
Harry / Hyeonggon

> ---
>  mm/slub.c | 34 +++-------------------------------
>  1 file changed, 3 insertions(+), 31 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index be8b09e09d30..f92b43d36adc 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -447,7 +447,7 @@ static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
>  
>  /*
>   * Tracks for which NUMA nodes we have kmem_cache_nodes allocated.
> - * Corresponds to node_state[N_NORMAL_MEMORY], but can temporarily
> + * Corresponds to node_state[N_MEMORY], but can temporarily
>   * differ during memory hotplug/hotremove operations.
>   * Protected by slab_mutex.
>   */
> @@ -6160,36 +6160,12 @@ static int slab_mem_going_offline_callback(void *arg)
>  	return 0;
>  }
>  
> -static void slab_mem_offline_callback(void *arg)
> -{
> -	struct memory_notify *marg = arg;
> -	int offline_node;
> -
> -	offline_node = marg->status_change_nid_normal;
> -
> -	/*
> -	 * If the node still has available memory. we need kmem_cache_node
> -	 * for it yet.
> -	 */
> -	if (offline_node < 0)
> -		return;
> -
> -	mutex_lock(&slab_mutex);
> -	node_clear(offline_node, slab_nodes);
> -	/*
> -	 * We no longer free kmem_cache_node structures here, as it would be
> -	 * racy with all get_node() users, and infeasible to protect them with
> -	 * slab_mutex.
> -	 */
> -	mutex_unlock(&slab_mutex);
> -}
> -
>  static int slab_mem_going_online_callback(void *arg)
>  {
>  	struct kmem_cache_node *n;
>  	struct kmem_cache *s;
>  	struct memory_notify *marg = arg;
> -	int nid = marg->status_change_nid_normal;
> +	int nid = marg->status_change_nid;
>  	int ret = 0;
>  
>  	/*
> @@ -6247,10 +6223,6 @@ static int slab_memory_callback(struct notifier_block *self,
>  	case MEM_GOING_OFFLINE:
>  		ret = slab_mem_going_offline_callback(arg);
>  		break;
> -	case MEM_OFFLINE:
> -	case MEM_CANCEL_ONLINE:
> -		slab_mem_offline_callback(arg);
> -		break;
>  	case MEM_ONLINE:
>  	case MEM_CANCEL_OFFLINE:
>  		break;
> @@ -6321,7 +6293,7 @@ void __init kmem_cache_init(void)
>  	 * Initialize the nodemask for which we will allocate per node
>  	 * structures. Here we don't need taking slab_mutex yet.
>  	 */
> -	for_each_node_state(node, N_NORMAL_MEMORY)
> +	for_each_node_state(node, N_MEMORY)
>  		node_set(node, slab_nodes);
>  
>  	create_boot_cache(kmem_cache_node, "kmem_cache_node",
> -- 
> 2.49.0
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 1/3] mm,slub: Do not special case N_NORMAL nodes for slab_nodes
  2025-06-03 11:08 ` [PATCH v4 1/3] mm,slub: Do not special case N_NORMAL nodes for slab_nodes Oscar Salvador
  2025-06-04  9:28   ` Harry Yoo
@ 2025-06-04 11:33   ` David Hildenbrand
  2025-06-04 12:16   ` Yunsheng Lin
  2 siblings, 0 replies; 12+ messages in thread
From: David Hildenbrand @ 2025-06-04 11:33 UTC (permalink / raw)
  To: Oscar Salvador, Andrew Morton
  Cc: Vlastimil Babka, Jonathan Cameron, Harry Yoo, Rakie Kim,
	Hyeonggon Yoo, linux-mm, linux-kernel

On 03.06.25 13:08, Oscar Salvador wrote:
> Currently, slab_mem_going_going_callback() checks whether the node has
> N_NORMAL memory in order to be set in slab_nodes.
> While it is true that gettind rid of that enforcing would mean
> ending up with movables nodes in slab_nodes, the memory waste that comes
> with that is negligible.
> 
> So stop checking for status_change_nid_normal and just use status_change_nid
> instead which works for both types of memory.
> 
> Also, once we allocate the kmem_cache_node cache  for the node in
> slab_mem_online_callback(), we never deallocate it in
> slab_mem_off_callback() when the node goes memoryless, so we can just
> get rid of it.
> 
> The side effects are that we will stop clearing the node from slab_nodes,
> and also that newly created kmem caches after node hotremove will now allocate
> their kmem_cache_node for the node(s) that was hotremoved, but these
> should be negligible.
> 
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>
Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 2/3] mm,memory_hotplug: Implement numa node notifier
  2025-06-03 11:08 ` [PATCH v4 2/3] mm,memory_hotplug: Implement numa node notifier Oscar Salvador
@ 2025-06-04 12:03   ` David Hildenbrand
  2025-06-04 12:38     ` Oscar Salvador
  0 siblings, 1 reply; 12+ messages in thread
From: David Hildenbrand @ 2025-06-04 12:03 UTC (permalink / raw)
  To: Oscar Salvador, Andrew Morton
  Cc: Vlastimil Babka, Jonathan Cameron, Harry Yoo, Rakie Kim,
	Hyeonggon Yoo, linux-mm, linux-kernel

On 03.06.25 13:08, Oscar Salvador wrote:
> There are at least six consumers of hotplug_memory_notifier that what they
> really are interested in is whether any numa node changed its state, e.g: going
> from being memory aware to becoming memoryless and vice versa.
> 
> Implement a specific notifier for numa nodes when their state gets changed,
> and have those consumers that only care about numa node state changes use it.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>   drivers/acpi/numa/hmat.c  |   6 +-
>   drivers/base/node.c       |  21 +++++
>   drivers/cxl/core/region.c |  14 ++--
>   drivers/cxl/cxl.h         |   4 +-
>   include/linux/memory.h    |  38 ++++++++-
>   kernel/cgroup/cpuset.c    |   2 +-
>   mm/memory-tiers.c         |   8 +-
>   mm/memory_hotplug.c       | 161 +++++++++++++++++---------------------
>   mm/mempolicy.c            |   8 +-
>   mm/slub.c                 |  13 ++-
>   10 files changed, 155 insertions(+), 120 deletions(-)
> 
> diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
> index 9d9052258e92..9ac82a767daf 100644
> --- a/drivers/acpi/numa/hmat.c
> +++ b/drivers/acpi/numa/hmat.c
> @@ -962,10 +962,10 @@ static int hmat_callback(struct notifier_block *self,
>   			 unsigned long action, void *arg)
>   {
>   	struct memory_target *target;
> -	struct memory_notify *mnb = arg;
> +	struct node_notify *mnb = arg;
>   	int pxm, nid = mnb->status_change_nid;
>   
> -	if (nid == NUMA_NO_NODE || action != MEM_ONLINE)
> +	if (nid == NUMA_NO_NODE || action != NODE_BECAME_MEM_AWARE)
>   		return NOTIFY_OK;
>   
>   	pxm = node_to_pxm(nid);
> @@ -1118,7 +1118,7 @@ static __init int hmat_init(void)
>   	hmat_register_targets();
>   
>   	/* Keep the table and structures if the notifier may use them */
> -	if (hotplug_memory_notifier(hmat_callback, HMAT_CALLBACK_PRI))
> +	if (hotplug_node_notifier(hmat_callback, HMAT_CALLBACK_PRI))
>   		goto out_put;
>   
>   	if (!hmat_set_default_dram_perf())
> diff --git a/drivers/base/node.c b/drivers/base/node.c


[...]


> diff --git a/include/linux/memory.h b/include/linux/memory.h
> index 5ec4e6d209b9..8c5c88eaffb3 100644
> --- a/include/linux/memory.h
> +++ b/include/linux/memory.h
> @@ -99,6 +99,14 @@ int set_memory_block_size_order(unsigned int order);
>   #define	MEM_PREPARE_ONLINE	(1<<6)
>   #define	MEM_FINISH_OFFLINE	(1<<7)
>   
> +/* These states are used for numa node notifiers */
> +#define NODE_BECOMING_MEM_AWARE		(1<<0)
> +#define NODE_BECAME_MEM_AWARE		(1<<1)
> +#define NODE_BECOMING_MEMORYLESS	(1<<2)
> +#define NODE_BECAME_MEMORYLESS		(1<<3)
> +#define NODE_CANCEL_MEM_AWARE		(1<<4)
> +#define NODE_CANCEL_MEMORYLESS		(1<<5)

Very nitpicky: MEM vs. MEMORY inconsistency. Also, I am not sure about 
"MEMORYLESS vs. MEMORY AWARE" terminology (opposite of aware is not 
less) and "BECOMING" vs. "CANCEL" ...

There must be something better ... but what is it. :)

NODE_ADDING_FIRST_MEMORY
NODE_ADDED_FIRST_MEMORY
NODE_CANCEL_ADDING_FIRST_MEMORY

NODE_REMOVING_LAST_MEMORY
NODE_REMOVED_LAST_MEMORY
NODE_CANCEL_REMOVING_LAST_MEMORY

Maybe something like that? I still don't quite like the "CANCEL" stuff.

NODE_ADDING_FIRST_MEMORY
NODE_ADDED_FIRST_MEMORY
NODE_NOT_ADDED_FIRST_MEMORY

NODE_REMOVING_LAST_MEMORY
NODE_REMOVED_LAST_MEMORY
NODE_NOT_REMOVED_LAST_MEMORY

Hm ...

> +
>   struct memory_notify {
>   	/*
>   	 * The altmap_start_pfn and altmap_nr_pages fields are designated for
> @@ -109,7 +117,10 @@ struct memory_notify {
>   	unsigned long altmap_nr_pages;
>   	unsigned long start_pfn;
>   	unsigned long nr_pages;
> -	int status_change_nid_normal;
> +	int status_change_nid;
> +};

Could/should that be a separate patch after patch #1 removed the last user?

Also, I think the sequence should be (this patch is getting hard to 
review for me due to the size):

#1 existing patch 1
#2 remove status_change_nid_normal
#3 introduce node notifier
#4-#X: convert individual users to node notifier
#X+1: change status_change_nid to always just indicate the nid, renaming
       it on the way (incl current patch #3)


> +
> +struct node_notify {
>   	int status_change_nid;

This should be called "nid" right from the start.

>   
> @@ -157,15 +168,34 @@ static inline unsigned long memory_block_advised_max_size(void)
>   {
>   	return 0;
>   }
> +

[...]

>   	 * {on,off}lining is constrained to full memory sections (or more
> @@ -1194,11 +1172,22 @@ int online_pages(unsigned long pfn, unsigned long nr_pages,
>   	/* associate pfn range with the zone */
>   	move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_ISOLATE);
>   
> -	arg.start_pfn = pfn;
> -	arg.nr_pages = nr_pages;
> -	node_states_check_changes_online(nr_pages, zone, &arg);
> +	node_arg.status_change_nid = NUMA_NO_NODE;
> +	if (!node_state(nid, N_MEMORY)) {
> +		/* Node is becoming memory aware. Notify consumers */
> +		cancel_node_notifier_on_err = true;
> +		node_arg.status_change_nid = nid;
> +		ret = node_notify(NODE_BECOMING_MEM_AWARE, &node_arg);
> +		ret = notifier_to_errno(ret);
> +		if (ret)
> +			goto failed_addition;
> +	}

I assume without NUMA, that code would never trigger? I mean, the whole 
notifier doesn't make sense without CONFIG_NUMA :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 1/3] mm,slub: Do not special case N_NORMAL nodes for slab_nodes
  2025-06-03 11:08 ` [PATCH v4 1/3] mm,slub: Do not special case N_NORMAL nodes for slab_nodes Oscar Salvador
  2025-06-04  9:28   ` Harry Yoo
  2025-06-04 11:33   ` David Hildenbrand
@ 2025-06-04 12:16   ` Yunsheng Lin
  2 siblings, 0 replies; 12+ messages in thread
From: Yunsheng Lin @ 2025-06-04 12:16 UTC (permalink / raw)
  To: Oscar Salvador, Andrew Morton
  Cc: David Hildenbrand, Vlastimil Babka, Jonathan Cameron, Harry Yoo,
	Rakie Kim, Hyeonggon Yoo, linux-mm, linux-kernel

On 2025/6/3 19:08, Oscar Salvador wrote:
> Currently, slab_mem_going_going_callback() checks whether the node has

slab_mem_going_going_callback() might be a typo here.
I suppose slab_mem_going_online_callback() might be the correct one here.

> N_NORMAL memory in order to be set in slab_nodes.
> While it is true that gettind rid of that enforcing would mean
> ending up with movables nodes in slab_nodes, the memory waste that comes
> with that is negligible.
> 
> So stop checking for status_change_nid_normal and just use status_change_nid
> instead which works for both types of memory.
> 
> Also, once we allocate the kmem_cache_node cache  for the node in
                                                  ^^
seems like a minor nit, an extral space for the above.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 2/3] mm,memory_hotplug: Implement numa node notifier
  2025-06-04 12:03   ` David Hildenbrand
@ 2025-06-04 12:38     ` Oscar Salvador
  2025-06-04 12:47       ` David Hildenbrand
  0 siblings, 1 reply; 12+ messages in thread
From: Oscar Salvador @ 2025-06-04 12:38 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Vlastimil Babka, Jonathan Cameron, Harry Yoo,
	Rakie Kim, Hyeonggon Yoo, linux-mm, linux-kernel

On Wed, Jun 04, 2025 at 02:03:23PM +0200, David Hildenbrand wrote:
> > diff --git a/include/linux/memory.h b/include/linux/memory.h
> > index 5ec4e6d209b9..8c5c88eaffb3 100644
> > --- a/include/linux/memory.h
> > +++ b/include/linux/memory.h
> > @@ -99,6 +99,14 @@ int set_memory_block_size_order(unsigned int order);
> >   #define	MEM_PREPARE_ONLINE	(1<<6)
> >   #define	MEM_FINISH_OFFLINE	(1<<7)
> > +/* These states are used for numa node notifiers */
> > +#define NODE_BECOMING_MEM_AWARE		(1<<0)
> > +#define NODE_BECAME_MEM_AWARE		(1<<1)
> > +#define NODE_BECOMING_MEMORYLESS	(1<<2)
> > +#define NODE_BECAME_MEMORYLESS		(1<<3)
> > +#define NODE_CANCEL_MEM_AWARE		(1<<4)
> > +#define NODE_CANCEL_MEMORYLESS		(1<<5)
> 
> Very nitpicky: MEM vs. MEMORY inconsistency. Also, I am not sure about
> "MEMORYLESS vs. MEMORY AWARE" terminology (opposite of aware is not less)
> and "BECOMING" vs. "CANCEL" ...

Heh, that is why I'm not in the marketing field :-)

> There must be something better ... but what is it. :)
> 
> NODE_ADDING_FIRST_MEMORY
> NODE_ADDED_FIRST_MEMORY
> NODE_CANCEL_ADDING_FIRST_MEMORY
> 
> NODE_REMOVING_LAST_MEMORY
> NODE_REMOVED_LAST_MEMORY
> NODE_CANCEL_REMOVING_LAST_MEMORY
> 
> Maybe something like that? I still don't quite like the "CANCEL" stuff.
> 
> NODE_ADDING_FIRST_MEMORY
> NODE_ADDED_FIRST_MEMORY
> NODE_NOT_ADDED_FIRST_MEMORY
> 
> NODE_REMOVING_LAST_MEMORY
> NODE_REMOVED_LAST_MEMORY
> NODE_NOT_REMOVED_LAST_MEMORY

If I were to pick one, I'd go with NODE_ADDING_FIRST_MEMORY/NODE_REMOVING_LAST_MEMORY.
I think those make it easier to grasp.


> Hm ...
> 
> > +
> >   struct memory_notify {
> >   	/*
> >   	 * The altmap_start_pfn and altmap_nr_pages fields are designated for
> > @@ -109,7 +117,10 @@ struct memory_notify {
> >   	unsigned long altmap_nr_pages;
> >   	unsigned long start_pfn;
> >   	unsigned long nr_pages;
> > -	int status_change_nid_normal;
> > +	int status_change_nid;
> > +};
> 
> Could/should that be a separate patch after patch #1 removed the last user?
> 
> Also, I think the sequence should be (this patch is getting hard to review
> for me due to the size):
> 
> #1 existing patch 1
> #2 remove status_change_nid_normal
> #3 introduce node notifier
> #4-#X: convert individual users to node notifier
> #X+1: change status_change_nid to always just indicate the nid, renaming
>       it on the way (incl current patch #3)

When you say #4-#X, you mean a separate patch per converting user?
So, one for memtier, one for cxl, one for hmat, etc.?


> > +
> > +struct node_notify {
> >   	int status_change_nid;
> 
> This should be called "nid" right from the start.

Copy that.

> > @@ -157,15 +168,34 @@ static inline unsigned long memory_block_advised_max_size(void)
> >   {
> >   	return 0;
> >   }
> > +
> 
> [...]
> 
> >   	 * {on,off}lining is constrained to full memory sections (or more
> > @@ -1194,11 +1172,22 @@ int online_pages(unsigned long pfn, unsigned long nr_pages,
> >   	/* associate pfn range with the zone */
> >   	move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_ISOLATE);
> > -	arg.start_pfn = pfn;
> > -	arg.nr_pages = nr_pages;
> > -	node_states_check_changes_online(nr_pages, zone, &arg);
> > +	node_arg.status_change_nid = NUMA_NO_NODE;
> > +	if (!node_state(nid, N_MEMORY)) {
> > +		/* Node is becoming memory aware. Notify consumers */
> > +		cancel_node_notifier_on_err = true;
> > +		node_arg.status_change_nid = nid;
> > +		ret = node_notify(NODE_BECOMING_MEM_AWARE, &node_arg);
> > +		ret = notifier_to_errno(ret);
> > +		if (ret)
> > +			goto failed_addition;
> > +	}
> 
> I assume without NUMA, that code would never trigger? I mean, the whole
> notifier doesn't make sense without CONFIG_NUMA :)

So, glad you asked because it forced me to look again.
And I think we might have some divergences there, so I will sort it out
in the next respin.

Thanks for the feedback ;-)


-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 2/3] mm,memory_hotplug: Implement numa node notifier
  2025-06-04 12:38     ` Oscar Salvador
@ 2025-06-04 12:47       ` David Hildenbrand
  2025-06-05  5:18         ` Oscar Salvador
  0 siblings, 1 reply; 12+ messages in thread
From: David Hildenbrand @ 2025-06-04 12:47 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Vlastimil Babka, Jonathan Cameron, Harry Yoo,
	Rakie Kim, Hyeonggon Yoo, linux-mm, linux-kernel

On 04.06.25 14:38, Oscar Salvador wrote:
> On Wed, Jun 04, 2025 at 02:03:23PM +0200, David Hildenbrand wrote:
>>> diff --git a/include/linux/memory.h b/include/linux/memory.h
>>> index 5ec4e6d209b9..8c5c88eaffb3 100644
>>> --- a/include/linux/memory.h
>>> +++ b/include/linux/memory.h
>>> @@ -99,6 +99,14 @@ int set_memory_block_size_order(unsigned int order);
>>>    #define	MEM_PREPARE_ONLINE	(1<<6)
>>>    #define	MEM_FINISH_OFFLINE	(1<<7)
>>> +/* These states are used for numa node notifiers */
>>> +#define NODE_BECOMING_MEM_AWARE		(1<<0)
>>> +#define NODE_BECAME_MEM_AWARE		(1<<1)
>>> +#define NODE_BECOMING_MEMORYLESS	(1<<2)
>>> +#define NODE_BECAME_MEMORYLESS		(1<<3)
>>> +#define NODE_CANCEL_MEM_AWARE		(1<<4)
>>> +#define NODE_CANCEL_MEMORYLESS		(1<<5)
>>
>> Very nitpicky: MEM vs. MEMORY inconsistency. Also, I am not sure about
>> "MEMORYLESS vs. MEMORY AWARE" terminology (opposite of aware is not less)
>> and "BECOMING" vs. "CANCEL" ...
> 
> Heh, that is why I'm not in the marketing field :-)
> 
>> There must be something better ... but what is it. :)
>>
>> NODE_ADDING_FIRST_MEMORY
>> NODE_ADDED_FIRST_MEMORY
>> NODE_CANCEL_ADDING_FIRST_MEMORY
>>
>> NODE_REMOVING_LAST_MEMORY
>> NODE_REMOVED_LAST_MEMORY
>> NODE_CANCEL_REMOVING_LAST_MEMORY
>>
>> Maybe something like that? I still don't quite like the "CANCEL" stuff.
>>
>> NODE_ADDING_FIRST_MEMORY
>> NODE_ADDED_FIRST_MEMORY
>> NODE_NOT_ADDED_FIRST_MEMORY
>>
>> NODE_REMOVING_LAST_MEMORY
>> NODE_REMOVED_LAST_MEMORY
>> NODE_NOT_REMOVED_LAST_MEMORY
> 
> If I were to pick one, I'd go with NODE_ADDING_FIRST_MEMORY/NODE_REMOVING_LAST_MEMORY.
> I think those make it easier to grasp.

Just to clarify, these were the 3 notifiers each that belong together. I 
was not sure about NODE_CANCEL_ADDING_FIRST_MEMORY vs. 
NODE_NOT_ADDED_FIRST_MEMORY.

> 
> 
>> Hm ...
>>
>>> +
>>>    struct memory_notify {
>>>    	/*
>>>    	 * The altmap_start_pfn and altmap_nr_pages fields are designated for
>>> @@ -109,7 +117,10 @@ struct memory_notify {
>>>    	unsigned long altmap_nr_pages;
>>>    	unsigned long start_pfn;
>>>    	unsigned long nr_pages;
>>> -	int status_change_nid_normal;
>>> +	int status_change_nid;
>>> +};
>>
>> Could/should that be a separate patch after patch #1 removed the last user?
>>
>> Also, I think the sequence should be (this patch is getting hard to review
>> for me due to the size):
>>
>> #1 existing patch 1
>> #2 remove status_change_nid_normal
>> #3 introduce node notifier
>> #4-#X: convert individual users to node notifier
>> #X+1: change status_change_nid to always just indicate the nid, renaming
>>        it on the way (incl current patch #3)
> 
> When you say #4-#X, you mean a separate patch per converting user?
> So, one for memtier, one for cxl, one for hmat, etc.?

Yes.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 2/3] mm,memory_hotplug: Implement numa node notifier
  2025-06-04 12:47       ` David Hildenbrand
@ 2025-06-05  5:18         ` Oscar Salvador
  2025-06-05  8:12           ` David Hildenbrand
  0 siblings, 1 reply; 12+ messages in thread
From: Oscar Salvador @ 2025-06-05  5:18 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Vlastimil Babka, Jonathan Cameron, Harry Yoo,
	Rakie Kim, Hyeonggon Yoo, linux-mm, linux-kernel

On Wed, Jun 04, 2025 at 02:47:28PM +0200, David Hildenbrand wrote:
> Just to clarify, these were the 3 notifiers each that belong together. I was
> not sure about NODE_CANCEL_ADDING_FIRST_MEMORY vs.
> NODE_NOT_ADDED_FIRST_MEMORY.

I started working on the new respin and the moment came to make a
decision about this.
I think I'd go with NODE_CANCEL_ADDING_FIRST_MEMORY, for two reasons.
One is that memory notifier also uses that therminology, so I'd use that
one for the node notifier to keep it consistent.
Someone could argue whether we are perpetuating a bad decision naming
though :-).

The other reason is that to me, it sounds more natural as the way I see
it, we are canceling an ongoing operation (memory-hotplug).
Now, I can also see the point in the NODE_NOT_ADDED because the memory
could "not be added in the end", but at the end of the way only one can be picked :-D.

-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 2/3] mm,memory_hotplug: Implement numa node notifier
  2025-06-05  5:18         ` Oscar Salvador
@ 2025-06-05  8:12           ` David Hildenbrand
  0 siblings, 0 replies; 12+ messages in thread
From: David Hildenbrand @ 2025-06-05  8:12 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Vlastimil Babka, Jonathan Cameron, Harry Yoo,
	Rakie Kim, Hyeonggon Yoo, linux-mm, linux-kernel

On 05.06.25 07:18, Oscar Salvador wrote:
> On Wed, Jun 04, 2025 at 02:47:28PM +0200, David Hildenbrand wrote:
>> Just to clarify, these were the 3 notifiers each that belong together. I was
>> not sure about NODE_CANCEL_ADDING_FIRST_MEMORY vs.
>> NODE_NOT_ADDED_FIRST_MEMORY.
> 
> I started working on the new respin and the moment came to make a
> decision about this.
> I think I'd go with NODE_CANCEL_ADDING_FIRST_MEMORY, for two reasons.
> One is that memory notifier also uses that therminology, so I'd use that
> one for the node notifier to keep it consistent.
> Someone could argue whether we are perpetuating a bad decision naming
> though :-).

Works for me :)

And yes, out of both options I provided, that is the better one.


-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-06-05  8:12 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-03 11:08 [PATCH v4 0/3] Implement numa node notifier Oscar Salvador
2025-06-03 11:08 ` [PATCH v4 1/3] mm,slub: Do not special case N_NORMAL nodes for slab_nodes Oscar Salvador
2025-06-04  9:28   ` Harry Yoo
2025-06-04 11:33   ` David Hildenbrand
2025-06-04 12:16   ` Yunsheng Lin
2025-06-03 11:08 ` [PATCH v4 2/3] mm,memory_hotplug: Implement numa node notifier Oscar Salvador
2025-06-04 12:03   ` David Hildenbrand
2025-06-04 12:38     ` Oscar Salvador
2025-06-04 12:47       ` David Hildenbrand
2025-06-05  5:18         ` Oscar Salvador
2025-06-05  8:12           ` David Hildenbrand
2025-06-03 11:08 ` [PATCH v4 3/3] mm,memory_hotplug: Rename status_change_nid parameter in memory_notify Oscar Salvador

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).