* [RFC PATCH 0/3] mm/numa: reserve standby NUMA nodes for runtime claiming
@ 2026-06-10 1:45 Gregory Price
2026-06-10 1:45 ` [RFC PATCH 1/3] mm/numa: add exclusive node pool and numa=standby boot parameter Gregory Price
` (2 more replies)
0 siblings, 3 replies; 5+ messages in thread
From: Gregory Price @ 2026-06-10 1:45 UTC (permalink / raw)
To: linux-mm
Cc: x86, linux-doc, linux-kernel, linux-acpi, driver-core,
kernel-team, corbet, skhan, dave.hansen, luto, peterz, tglx,
mingo, bp, hpa, rafael, lenb, gregkh, dakr, akpm, rppt, rdunlap,
feng.tang, dapeng1.mi, elver, kuba, ebiggers, lirongqing, paulmck,
gourry, dave.jiang, jic23, xueshuai, kai.huang
A NUMA node must be "possible" at __init time to be usable later; a node
that is not described at boot cannot be brought online afterwards.
For memory tiering or isolation it is sometimes desirable to spread
hotplug memory (CXL, GPU, virtio-mem, ...) across more nodes than
firmware describes. Additionally, some memory devices may provide
more than a single class of memory and need flexibility to redefine
the effective topology at runtime instead of depending on BIOS.
This series adds a way to reserve empty "standby" NUMA nodes at boot
so drivers can place hotplugged memory on distinct nodes later, at
runtime, without those nodes being described by BIOS.
Using the feature
=================
A standby node is an empty, offline-but-possible NUMA node: at boot it
has no memory and no CPUs. A driver claims one at runtime, brings
memory online on it, and releases it when done.
This series adds 3 ways to reserve standby nodes.
- numa=standby=N
Boot parameter. Reserve N extra empty nodes. Platform
independent; works with or without ACPI.
- CONFIG_ACPI_NUMA_STANDBY_NODES=N
Reserve N extra empty nodes on ACPI systems (honoured only when
firmware produces a usable NUMA configuration).
- CONFIG_ACPI_NUMA_ADD_CFMWS_NODES=K
Reserve K extra empty nodes per CXL Fixed Memory Window (CEDT
CFMWS), for CXL topologies that want several nodes behind one
window.
All three default to off (0 / unset).
Reserved nodes show up in /sys/devices/system/node/possible but not
.../online until a driver claims one and onlines memory on it.
Testing
=======
Built and booted under QEMU (virtme-ng) across a matrix of boot
parameters and topologies:
- Each reservation source, individually and combined: reserved nodes
appear as possible-but-offline with no memory, claim/release
round-trips correctly, and node distances are sane.
The CFMWS path was exercised with an emulated CXL Type-3 device
presenting a CEDT/CFMWS.
- Fallback: when ACPI NUMA init does not produce a usable config,
no standby nodes are reserved.
- NUMA emulation (numa=fake): renumbers the node space.
Standby nodes are created only after the (possibly emulated)
topology is final, so their ids can never alias emulated nodes.
numa=fake boots cleanly with the feature enabled and behaves
identically to a baseline kernel without this series.
Tested with CONFIG_NUMA_EMU both enabled and disabled, and with
and without numa=fake on the command line.
- Default-off builds behave identically to a baseline kernel.
Gregory Price (3):
mm/numa: add exclusive node pool and numa=standby boot parameter
acpi/numa: add CONFIG_ACPI_NUMA_STANDBY_NODES
acpi/numa: add CONFIG_ACPI_NUMA_ADD_CFMWS_NODES
.../admin-guide/kernel-parameters.txt | 8 ++
arch/x86/mm/numa.c | 2 +
drivers/acpi/numa/Kconfig | 35 ++++++
drivers/acpi/numa/srat.c | 14 ++-
drivers/base/arch_numa.c | 2 +
include/linux/numa.h | 14 +++
include/linux/numa_memblks.h | 3 +
mm/numa.c | 90 +++++++++++++
mm/numa_memblks.c | 118 +++++++++++++++++-
9 files changed, 284 insertions(+), 2 deletions(-)
--
2.54.0
^ permalink raw reply [flat|nested] 5+ messages in thread
* [RFC PATCH 1/3] mm/numa: add exclusive node pool and numa=standby boot parameter
2026-06-10 1:45 [RFC PATCH 0/3] mm/numa: reserve standby NUMA nodes for runtime claiming Gregory Price
@ 2026-06-10 1:45 ` Gregory Price
2026-06-11 9:00 ` Mike Rapoport
2026-06-10 1:45 ` [RFC PATCH 2/3] acpi/numa: add CONFIG_ACPI_NUMA_STANDBY_NODES Gregory Price
2026-06-10 1:45 ` [RFC PATCH 3/3] acpi/numa: add CONFIG_ACPI_NUMA_ADD_CFMWS_NODES Gregory Price
2 siblings, 1 reply; 5+ messages in thread
From: Gregory Price @ 2026-06-10 1:45 UTC (permalink / raw)
To: linux-mm
Cc: x86, linux-doc, linux-kernel, linux-acpi, driver-core,
kernel-team, corbet, skhan, dave.hansen, luto, peterz, tglx,
mingo, bp, hpa, rafael, lenb, gregkh, dakr, akpm, rppt, rdunlap,
feng.tang, dapeng1.mi, elver, kuba, ebiggers, lirongqing, paulmck,
gourry, dave.jiang, jic23, xueshuai, kai.huang
It can be at times preferential to logically split up hotplug memory
capacity into more nodes than are described by BIOS at boot time.
However, if nodes are not described at __init time, they are not
possible to add later on.
Add the core infrastructure for reserving empty "standby" NUMA nodes at
boot that drivers can claim at runtime.
Introduce an exclusive node pool with a runtime claim/release interface:
numa_request_exclusive_node() - claim a node from the pool
numa_release_exclusive_node() - return a node to the pool
This allows drivers to place hotplugged memory on distinct NUMA nodes
without requiring BIOS-assigned proximity domains.
Standby nodes are created after numa_emulation() has finalized the node
numbering. The count comes from the numa=standby=<N> boot parameter
plus any requests by init code via numa_request_standby_count().
Creating them post-emulation avoids perturbing the emulated node
numbering and keeps standby node ids from aliasing emulated nodes.
This also pushes off assigning node numbers until after all PXM mappings
have been created to keep the system view more consistent overall.
numa_init_standby_nodes() rebuilds the NUMA distance table (using the
same method as numa_emulation), this way standby nodes have distance
entries. These entries may be programmed later via numa_set_distance().
As a result, it is also possible for drivers to use these standby nodes
to change memory tier membership and fallback ordering instead of being
tied down to what is described by BIOS / Firmware.
Additional Notes/Concerns:
1) Can we do dynamic addition of nodes?
Not Trivially
Some services utilize num_possible_nodes() as a static value to
calculate the amount of resources to use at runtime (bpf, md/raid5).
Example: futex_init uses num_possible_nodes() as part of its
hashsize calculation during __init.
2) Does this create phys_to_target_node() ambiguity?
No.
Every present user of phys_to_target_node() either uses it during
__init to set up associations, or after __init to associate a static
memory region and a node.
In neither case do these additional nodes create ambiguity.
We do at least add a comment to phys_to_target_node() to note that
this should only be used to determine the affiliation of a memory
region statically configured by BIOS (i.e. not hotplug memory).
Signed-off-by: Gregory Price <gourry@gourry.net>
---
.../admin-guide/kernel-parameters.txt | 8 ++
arch/x86/mm/numa.c | 2 +
drivers/base/arch_numa.c | 2 +
include/linux/numa.h | 14 +++
include/linux/numa_memblks.h | 3 +
mm/numa.c | 90 +++++++++++++
mm/numa_memblks.c | 118 +++++++++++++++++-
7 files changed, 236 insertions(+), 1 deletion(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 23be2f64439c..5410498c97af 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4765,6 +4765,14 @@ Kernel parameters
numa=nohmat [X86] Don't parse the HMAT table for NUMA setup, or
soft-reserved memory partitioning.
+ numa=standby=<N>
+ [KNL, ARM64, RISCV, X86, EARLY]
+ Reserve N additional empty NUMA nodes at boot for
+ runtime claiming via numa_request_exclusive_node().
+ These nodes have no memory or CPU affinity. Drivers
+ can claim them to place hotplugged memory on distinct
+ NUMA nodes for memory tiering or isolation.
+
numa_balancing= [KNL,ARM64,PPC,RISCV,S390,X86] Enable or disable automatic
NUMA balancing.
Allowed values are enable and disable
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 99d0a9332c14..e4798c43276b 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -33,6 +33,8 @@ static __init int numa_setup(char *opt)
numa_off = 1;
if (!strncmp(opt, "fake=", 5))
return numa_emu_cmdline(opt + 5);
+ if (!strncmp(opt, "standby=", 8))
+ return numa_standby_cmdline(opt + 8);
if (!strncmp(opt, "noacpi", 6))
disable_srat();
if (!strncmp(opt, "nohmat", 6))
diff --git a/drivers/base/arch_numa.c b/drivers/base/arch_numa.c
index c99f2ab105e5..8526be1da69a 100644
--- a/drivers/base/arch_numa.c
+++ b/drivers/base/arch_numa.c
@@ -28,6 +28,8 @@ static __init int numa_parse_early_param(char *opt)
numa_off = true;
if (!strncmp(opt, "fake=", 5))
return numa_emu_cmdline(opt + 5);
+ if (!strncmp(opt, "standby=", 8))
+ return numa_standby_cmdline(opt + 8);
return 0;
}
diff --git a/include/linux/numa.h b/include/linux/numa.h
index e6baaf6051bc..4621af407ec6 100644
--- a/include/linux/numa.h
+++ b/include/linux/numa.h
@@ -43,6 +43,11 @@ int phys_to_target_node(u64 start);
int numa_fill_memblks(u64 start, u64 end);
+void __init numa_add_standby_node(int node);
+void __init numa_commit_standby_nodes(void);
+int numa_request_exclusive_node(void);
+void numa_release_exclusive_node(int node);
+
#else /* !CONFIG_NUMA */
static inline int numa_nearest_node(int node, unsigned int state)
{
@@ -64,6 +69,15 @@ static inline int phys_to_target_node(u64 start)
}
static inline void alloc_offline_node_data(int nid) {}
+
+static inline void numa_add_standby_node(int node) { }
+static inline void numa_commit_standby_nodes(void) { }
+static inline int numa_request_exclusive_node(void)
+{
+ return NUMA_NO_NODE;
+}
+static inline void numa_release_exclusive_node(int node) { }
+
#endif
#define numa_map_to_online_node(node) numa_nearest_node(node, N_ONLINE)
diff --git a/include/linux/numa_memblks.h b/include/linux/numa_memblks.h
index 991076cba7c5..7d7b8307e267 100644
--- a/include/linux/numa_memblks.h
+++ b/include/linux/numa_memblks.h
@@ -32,6 +32,9 @@ int __init numa_memblks_init(int (*init_func)(void),
extern int numa_distance_cnt;
+int __init numa_standby_cmdline(char *str);
+void __init numa_request_standby_count(int n);
+
#ifdef CONFIG_NUMA_EMU
extern int emu_nid_to_phys[MAX_NUMNODES];
int numa_emu_cmdline(char *str);
diff --git a/mm/numa.c b/mm/numa.c
index 7d5e06fe5bd4..9806cdf2f998 100644
--- a/mm/numa.c
+++ b/mm/numa.c
@@ -4,6 +4,8 @@
#include <linux/printk.h>
#include <linux/numa.h>
#include <linux/numa_memblks.h>
+#include <linux/spinlock.h>
+#include <linux/export.h>
struct pglist_data *node_data[MAX_NUMNODES];
EXPORT_SYMBOL(node_data);
@@ -59,3 +61,91 @@ int phys_to_target_node(u64 start)
}
EXPORT_SYMBOL_GPL(phys_to_target_node);
#endif
+
+/*
+ * Pool of exclusive NUMA nodes available for runtime claiming.
+ *
+ * Published by numa_commit_standby_nodes() from standby nodes staged
+ * during __init. Protected by exclusive_node_lock at runtime.
+ */
+static nodemask_t exclusive_nodes = NODE_MASK_NONE;
+static DEFINE_SPINLOCK(exclusive_node_lock);
+
+/*
+ * Standby node candidates staged during NUMA init. Committed to the exclusive
+ * pool by numa_commit_standby_nodes() once node_possible_map is finalized.
+ */
+static nodemask_t standby_candidates __initdata;
+
+/**
+ * numa_add_standby_node() - Stage a node as a standby pool candidate
+ * @node: Node ID created as an empty standby node during NUMA init
+ *
+ * Records @node as a candidate for the exclusive pool.
+ * Callers must also add @node to numa_nodes_parsed to mark it possible.
+ */
+void __init numa_add_standby_node(int node)
+{
+ node_set(node, standby_candidates);
+}
+
+/**
+ * numa_commit_standby_nodes() - Publish staged standby nodes to the pool
+ *
+ * Registers the staged candidates that are present in node_possible_map
+ * into the exclusive pool. Restricting to possible nodes keeps the pool a
+ * strict subset of node_possible_map, so a later claim can never return a
+ * node that was dropped (e.g. by a fallback init or NUMA emulation).
+ * Called once node_possible_map is final.
+ */
+void __init numa_commit_standby_nodes(void)
+{
+ nodes_and(exclusive_nodes, standby_candidates, node_possible_map);
+}
+
+/**
+ * numa_request_exclusive_node() - Claim an available exclusive NUMA node
+ *
+ * Exclusive nodes are empty NUMA nodes registered at boot via the standby
+ * node interfaces or standby= boot parameter.
+ *
+ * The caller takes exclusive ownership of the returned node and must
+ * release it with numa_release_exclusive_node() when no longer needed.
+ *
+ * Return: a NUMA node ID on success, %NUMA_NO_NODE if none available.
+ */
+int numa_request_exclusive_node(void)
+{
+ int node;
+
+ spin_lock(&exclusive_node_lock);
+ node = first_node(exclusive_nodes);
+ if (node < MAX_NUMNODES)
+ node_clear(node, exclusive_nodes);
+ else
+ node = NUMA_NO_NODE;
+ spin_unlock(&exclusive_node_lock);
+
+ return node;
+}
+EXPORT_SYMBOL_GPL(numa_request_exclusive_node);
+
+/**
+ * numa_release_exclusive_node() - Release a previously claimed exclusive node
+ * @node: Node ID previously returned by numa_request_exclusive_node()
+ *
+ * Returns the node to the exclusive pool.
+ */
+void numa_release_exclusive_node(int node)
+{
+ if (node == NUMA_NO_NODE)
+ return;
+
+ if (WARN_ON(node >= MAX_NUMNODES))
+ return;
+
+ spin_lock(&exclusive_node_lock);
+ node_set(node, exclusive_nodes);
+ spin_unlock(&exclusive_node_lock);
+}
+EXPORT_SYMBOL_GPL(numa_release_exclusive_node);
diff --git a/mm/numa_memblks.c b/mm/numa_memblks.c
index 3c3c4eac3514..9ba243fd360e 100644
--- a/mm/numa_memblks.c
+++ b/mm/numa_memblks.c
@@ -6,6 +6,7 @@
#include <linux/memblock.h>
#include <linux/numa.h>
#include <linux/numa_memblks.h>
+#include <linux/topology.h>
#include <asm/numa.h>
@@ -442,6 +443,104 @@ static int __init numa_register_meminfo(struct numa_meminfo *mi)
return 0;
}
+static int numa_standby_nodes __initdata;
+static int numa_acpi_standby_nodes __initdata;
+
+int __init numa_standby_cmdline(char *str)
+{
+ int ret = kstrtoint(str, 0, &numa_standby_nodes);
+
+ if (ret || numa_standby_nodes < 0)
+ return -EINVAL;
+ numa_standby_nodes = min(numa_standby_nodes, 16);
+ return 0;
+}
+
+/**
+ * numa_request_standby_count() - Request standby nodes from NUMA init code
+ * @n: number of standby nodes to reserve
+ *
+ * Accumulated during NUMA init and added to the numa=standby=<N> request.
+ * The nodes are created later, once numa_emulation() has finalized the node
+ * numbering. Init code must add the count here instead of adding the nodes.
+ */
+void __init numa_request_standby_count(int n)
+{
+ numa_acpi_standby_nodes += n;
+}
+
+/**
+ * numa_init_standby_nodes() - Create standby nodes and rebuild distance table
+ *
+ * Called after numa_emulation() has finalized the node numbering.
+ * Creates requested empty standby nodes and rebuilds the NUMA distance
+ * table if it needs to grow to cover nodes added after SLIT parsing.
+ */
+static void __init numa_init_standby_nodes(void)
+{
+ int total = numa_standby_nodes + numa_acpi_standby_nodes;
+ nodemask_t available;
+ int i, j, max_node, old_cnt;
+ u8 *saved_dist = NULL;
+ size_t saved_size;
+ int registered = 0;
+
+ /* Create the requested standby nodes in numa_nodes_parsed */
+ if (total) {
+ nodes_complement(available, numa_nodes_parsed);
+ for (i = 0; i < total; i++) {
+ int node = first_node(available);
+
+ if (node >= MAX_NUMNODES)
+ break;
+ node_clear(node, available);
+ node_set(node, numa_nodes_parsed);
+ numa_add_standby_node(node);
+ pr_info("NUMA: standby node %d reserved\n", node);
+ registered++;
+ }
+ }
+ if (registered != total)
+ pr_warn("NUMA: error registering standby nodes\n");
+
+ /*
+ * If nodes were added after the distance table was allocated,
+ * rebuild the table so all nodes have distance entries.
+ * Standby nodes get REMOTE_DISTANCE by default.
+ */
+ old_cnt = numa_distance_cnt;
+ if (!old_cnt)
+ return;
+
+ max_node = 0;
+ for_each_node_mask(i, numa_nodes_parsed)
+ max_node = i;
+
+ if (max_node < old_cnt)
+ return;
+
+ saved_size = old_cnt * old_cnt * sizeof(u8);
+ saved_dist = memblock_alloc(saved_size, PAGE_SIZE);
+ if (!saved_dist) {
+ pr_warn("NUMA: standby nodes will use default distances\n");
+ return;
+ }
+
+ for (i = 0; i < old_cnt; i++)
+ for (j = 0; j < old_cnt; j++)
+ saved_dist[i * old_cnt + j] = node_distance(i, j);
+
+ /* Reset triggers reallocation on next numa_set_distance() */
+ numa_reset_distance();
+
+ /* Restore - first call reallocates sized for new numa_nodes_parsed */
+ for (i = 0; i < old_cnt; i++)
+ for (j = 0; j < old_cnt; j++)
+ numa_set_distance(i, j, saved_dist[i * old_cnt + j]);
+
+ memblock_free(saved_dist, saved_size);
+}
+
int __init numa_memblks_init(int (*init_func)(void),
bool memblock_force_top_down)
{
@@ -451,6 +550,7 @@ int __init numa_memblks_init(int (*init_func)(void),
nodes_clear(numa_nodes_parsed);
nodes_clear(node_possible_map);
nodes_clear(node_online_map);
+ numa_acpi_standby_nodes = 0;
memset(&numa_meminfo, 0, sizeof(numa_meminfo));
WARN_ON(memblock_set_node(0, max_addr, &memblock.memory, NUMA_NO_NODE));
WARN_ON(memblock_set_node(0, max_addr, &memblock.reserved,
@@ -479,8 +579,15 @@ int __init numa_memblks_init(int (*init_func)(void),
return ret;
numa_emulation(&numa_meminfo, numa_distance_cnt);
+ numa_init_standby_nodes();
+
+ ret = numa_register_meminfo(&numa_meminfo);
+ if (ret < 0)
+ return ret;
- return numa_register_meminfo(&numa_meminfo);
+ /* node_possible_map is final; publish standby nodes to the pool. */
+ numa_commit_standby_nodes();
+ return 0;
}
static int __init cmp_memblk(const void *a, const void *b)
@@ -567,6 +674,15 @@ static int meminfo_to_nid(struct numa_meminfo *mi, u64 start)
return NUMA_NO_NODE;
}
+/*
+ * These interfaces should only be used to acquire information about statically
+ * configured memory associations made at __init time.
+ *
+ * This interface should not be used to determine the node a struct page/folio
+ * lives in, as it is possible for memory hotplug to place those pages in
+ * different nodes than reported by this function.
+ */
+
int phys_to_target_node(u64 start)
{
int nid = meminfo_to_nid(&numa_meminfo, start);
--
2.54.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [RFC PATCH 2/3] acpi/numa: add CONFIG_ACPI_NUMA_STANDBY_NODES
2026-06-10 1:45 [RFC PATCH 0/3] mm/numa: reserve standby NUMA nodes for runtime claiming Gregory Price
2026-06-10 1:45 ` [RFC PATCH 1/3] mm/numa: add exclusive node pool and numa=standby boot parameter Gregory Price
@ 2026-06-10 1:45 ` Gregory Price
2026-06-10 1:45 ` [RFC PATCH 3/3] acpi/numa: add CONFIG_ACPI_NUMA_ADD_CFMWS_NODES Gregory Price
2 siblings, 0 replies; 5+ messages in thread
From: Gregory Price @ 2026-06-10 1:45 UTC (permalink / raw)
To: linux-mm
Cc: x86, linux-doc, linux-kernel, linux-acpi, driver-core,
kernel-team, corbet, skhan, dave.hansen, luto, peterz, tglx,
mingo, bp, hpa, rafael, lenb, gregkh, dakr, akpm, rppt, rdunlap,
feng.tang, dapeng1.mi, elver, kuba, ebiggers, lirongqing, paulmck,
gourry, dave.jiang, jic23, xueshuai, kai.huang
Some platforms want to reserve empty NUMA nodes at boot so that drivers
can later place hotplugged memory on distinct nodes for memory tiering
or isolation, without those nodes being described by BIOS.
Add CONFIG_ACPI_NUMA_STANDBY_NODES, a platform-independent count of empty
nodes to reserve.
Deferring standby node creation until after NUMA emulation runs keeps
old numbering behaviors consistent for NUMA emulation users.
Signed-off-by: Gregory Price <gourry@gourry.net>
---
drivers/acpi/numa/Kconfig | 15 +++++++++++++++
drivers/acpi/numa/srat.c | 3 +++
2 files changed, 18 insertions(+)
diff --git a/drivers/acpi/numa/Kconfig b/drivers/acpi/numa/Kconfig
index f33194d1e43f..ecf27bf45e5b 100644
--- a/drivers/acpi/numa/Kconfig
+++ b/drivers/acpi/numa/Kconfig
@@ -13,3 +13,18 @@ config ACPI_HMAT
register memory initiators with their targets, and export
performance attributes through the node's sysfs device if
provided.
+
+config ACPI_NUMA_STANDBY_NODES
+ int "Additional standby NUMA nodes for runtime claiming"
+ depends on ACPI_NUMA
+ range 0 16
+ default 0
+ help
+ Number of additional empty NUMA nodes to reserve at boot for
+ runtime claiming via numa_request_exclusive_node().
+
+ These nodes have no memory and no SRAT PXM association.
+ Drivers can claim them to place hotplugged memory on distinct
+ NUMA nodes for memory tiering or isolation purposes.
+
+ Set to 0 (default) to disable.
diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c
index 62d4a8df0b8c..d7b0e4ece610 100644
--- a/drivers/acpi/numa/srat.c
+++ b/drivers/acpi/numa/srat.c
@@ -664,6 +664,9 @@ int __init acpi_numa_init(void)
return cnt;
else if (!parsed_numa_memblks)
return -ENOENT;
+
+ /* Request any standby nodes (created after numa emulation) */
+ numa_request_standby_count(CONFIG_ACPI_NUMA_STANDBY_NODES);
return 0;
}
--
2.54.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [RFC PATCH 3/3] acpi/numa: add CONFIG_ACPI_NUMA_ADD_CFMWS_NODES
2026-06-10 1:45 [RFC PATCH 0/3] mm/numa: reserve standby NUMA nodes for runtime claiming Gregory Price
2026-06-10 1:45 ` [RFC PATCH 1/3] mm/numa: add exclusive node pool and numa=standby boot parameter Gregory Price
2026-06-10 1:45 ` [RFC PATCH 2/3] acpi/numa: add CONFIG_ACPI_NUMA_STANDBY_NODES Gregory Price
@ 2026-06-10 1:45 ` Gregory Price
2 siblings, 0 replies; 5+ messages in thread
From: Gregory Price @ 2026-06-10 1:45 UTC (permalink / raw)
To: linux-mm
Cc: x86, linux-doc, linux-kernel, linux-acpi, driver-core,
kernel-team, corbet, skhan, dave.hansen, luto, peterz, tglx,
mingo, bp, hpa, rafael, lenb, gregkh, dakr, akpm, rppt, rdunlap,
feng.tang, dapeng1.mi, elver, kuba, ebiggers, lirongqing, paulmck,
gourry, dave.jiang, jic23, xueshuai, kai.huang
CXL is intended to be a programmable topology, and a single CXL Fixed
Memory Window (CFMWS) may back memory that a driver wants to split across
multiple NUMA nodes for tiering or isolation.
Those nodes must exist at __init time to be usable later.
Add CONFIG_ACPI_NUMA_ADD_CFMWS_NODES, the number of additional standby
NUMA nodes to reserve per CEDT CFMWS entry.
acpi_parse_cfmws() records the per-window count, which is folded into
the standby request on successful acpi_numa_init().
Signed-off-by: Gregory Price <gourry@gourry.net>
---
drivers/acpi/numa/Kconfig | 20 ++++++++++++++++++++
drivers/acpi/numa/srat.c | 13 +++++++++++--
2 files changed, 31 insertions(+), 2 deletions(-)
diff --git a/drivers/acpi/numa/Kconfig b/drivers/acpi/numa/Kconfig
index ecf27bf45e5b..65d7eb9a4022 100644
--- a/drivers/acpi/numa/Kconfig
+++ b/drivers/acpi/numa/Kconfig
@@ -14,6 +14,26 @@ config ACPI_HMAT
performance attributes through the node's sysfs device if
provided.
+config ACPI_NUMA_ADD_CFMWS_NODES
+ int "Additional standby NUMA nodes per CEDT CFMWS entry"
+ depends on ACPI_NUMA
+ range 0 4
+ default 0
+ help
+ Number of additional standby NUMA nodes to reserve per CEDT
+ CXL Fixed Memory Window Structure (CFMWS) entry.
+
+ By default ACPI reserves 1 NUMA node per unique PXM entry in
+ the SRAT, or 1 node for a CFMWS without SRAT mappings.
+
+ Setting this > 0 reserves additional standby nodes per CFMWS
+ that drivers can claim at runtime via
+ numa_request_exclusive_node(). This is useful for CXL drivers
+ that want to place memory on distinct NUMA nodes within the
+ same CXL Fixed Memory Window.
+
+ Set to 0 (default) to disable.
+
config ACPI_NUMA_STANDBY_NODES
int "Additional standby NUMA nodes for runtime claiming"
depends on ACPI_NUMA
diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c
index d7b0e4ece610..6c54d5f0cf0a 100644
--- a/drivers/acpi/numa/srat.c
+++ b/drivers/acpi/numa/srat.c
@@ -354,6 +354,7 @@ static int __init acpi_parse_slit(struct acpi_table_header *table)
}
static int parsed_numa_memblks __initdata;
+static int cfmws_standby_count __initdata;
static int __init
acpi_parse_memory_affinity(union acpi_subtable_headers *header,
@@ -454,7 +455,7 @@ static int __init acpi_parse_cfmws(union acpi_subtable_headers *header,
* window.
*/
if (!numa_fill_memblks(start, end))
- return 0;
+ goto standby_nodes;
/* No SRAT description. Create a new node. */
node = acpi_map_pxm_to_node(*fake_pxm);
@@ -473,6 +474,11 @@ static int __init acpi_parse_cfmws(union acpi_subtable_headers *header,
/* Set the next available fake_pxm value */
(*fake_pxm)++;
+
+standby_nodes:
+ /* Request any standby nodes (created after numa_emulation runs) */
+ cfmws_standby_count += CONFIG_ACPI_NUMA_ADD_CFMWS_NODES;
+
return 0;
}
@@ -607,6 +613,8 @@ int __init acpi_numa_init(void)
if (acpi_disabled)
return -EINVAL;
+ cfmws_standby_count = 0;
+
/*
* Should not limit number with cpu num that is from NR_CPUS or nr_cpus=
* SRAT cpu entries could have different order with that in MADT.
@@ -666,7 +674,8 @@ int __init acpi_numa_init(void)
return -ENOENT;
/* Request any standby nodes (created after numa emulation) */
- numa_request_standby_count(CONFIG_ACPI_NUMA_STANDBY_NODES);
+ numa_request_standby_count(CONFIG_ACPI_NUMA_STANDBY_NODES +
+ cfmws_standby_count);
return 0;
}
--
2.54.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [RFC PATCH 1/3] mm/numa: add exclusive node pool and numa=standby boot parameter
2026-06-10 1:45 ` [RFC PATCH 1/3] mm/numa: add exclusive node pool and numa=standby boot parameter Gregory Price
@ 2026-06-11 9:00 ` Mike Rapoport
0 siblings, 0 replies; 5+ messages in thread
From: Mike Rapoport @ 2026-06-11 9:00 UTC (permalink / raw)
To: Gregory Price
Cc: linux-mm, x86, linux-doc, linux-kernel, linux-acpi, driver-core,
kernel-team, corbet, skhan, dave.hansen, luto, peterz, tglx,
mingo, bp, hpa, rafael, lenb, gregkh, dakr, akpm, rdunlap,
feng.tang, dapeng1.mi, elver, kuba, ebiggers, lirongqing, paulmck,
dave.jiang, jic23, xueshuai, kai.huang
Hi,
On Tue, Jun 09, 2026 at 09:45:15PM -0400, Gregory Price wrote:
> It can be at times preferential to logically split up hotplug memory
> capacity into more nodes than are described by BIOS at boot time.
>
> However, if nodes are not described at __init time, they are not
> possible to add later on.
...
> 1) Can we do dynamic addition of nodes?
>
> Not Trivially
>
> Some services utilize num_possible_nodes() as a static value to
> calculate the amount of resources to use at runtime (bpf, md/raid5).
>
> Example: futex_init uses num_possible_nodes() as part of its
> hashsize calculation during __init.
AFAIU, we don't add the additional nodes for generic hotplug memory but
rather for exclusive use of by drivers/applications that are aware of these
nodes.
Wouldn't adding them to possible nodes actually skew the calculation of the
resources by the services utilizing num_possible_nodes()?
With the futex_init() example, won't be hashsize scaled down two much
because we've added these special nodes to the possible mask?
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2026-06-11 9:00 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-10 1:45 [RFC PATCH 0/3] mm/numa: reserve standby NUMA nodes for runtime claiming Gregory Price
2026-06-10 1:45 ` [RFC PATCH 1/3] mm/numa: add exclusive node pool and numa=standby boot parameter Gregory Price
2026-06-11 9:00 ` Mike Rapoport
2026-06-10 1:45 ` [RFC PATCH 2/3] acpi/numa: add CONFIG_ACPI_NUMA_STANDBY_NODES Gregory Price
2026-06-10 1:45 ` [RFC PATCH 3/3] acpi/numa: add CONFIG_ACPI_NUMA_ADD_CFMWS_NODES Gregory Price
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox