From: Rakie Kim <rakie.kim@sk.com>
To: akpm@linux-foundation.org
Cc: gourry@gourry.net, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org,
ziy@nvidia.com, matthew.brost@intel.com, joshua.hahnjy@gmail.com,
byungchul@sk.com, ying.huang@linux.alibaba.com,
apopple@nvidia.com, david@kernel.org, lorenzo.stoakes@oracle.com,
Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org,
surenb@google.com, mhocko@suse.com, dave@stgolabs.net,
jonathan.cameron@huawei.com, dave.jiang@intel.com,
alison.schofield@intel.com, vishal.l.verma@intel.com,
ira.weiny@intel.com, dan.j.williams@intel.com,
kernel_team@skhynix.com, honggyu.kim@sk.com, yunjeong.mun@sk.com,
rakie.kim@sk.com
Subject: [RFC PATCH 4/4] mm/mempolicy: enhance weighted interleave with socket-aware locality
Date: Mon, 16 Mar 2026 14:12:52 +0900 [thread overview]
Message-ID: <20260316051258.246-5-rakie.kim@sk.com> (raw)
In-Reply-To: <20260316051258.246-1-rakie.kim@sk.com>
Flat weighted interleave applies one global weight vector regardless of
where a task runs. On multi-socket systems this ignores inter-socket
interconnect costs and can steer allocations to remote sockets even when
local capacity exists, degrading effective bandwidth and increasing
latency.
Consider a dual-socket system:
node0 node1
+-------+ +-------+
| CPU0 |---------| CPU1 |
+-------+ +-------+
| DRAM0 | | DRAM1 |
+---+---+ +---+---+
| |
+---+---+ +---+---+
| CXL0 | | CXL1 |
+-------+ +-------+
node2 node3
Local device capabilities (GB/s) versus cross-socket effective bandwidth:
0 1 2 3
0 300 150 100 50
1 150 300 50 100
A reasonable global weight vector reflecting device capabilities is:
node0=3 node1=3 node2=1 node3=1
However, applying it flat to all sources yields the effective map:
0 1 2 3
0 3 3 1 1
1 3 3 1 1
This does not account for the interconnect penalty (e.g., node0->node1
drops 300->150, node0->node3 drops 100->50) and thus permits cross-socket
allocations that underutilize local bandwidth.
This patch makes weighted interleave socket-aware. Before weighting is
applied, the candidate nodes are restricted to the current socket; only
if no eligible local nodes remain does the policy fall back to the wider
set. The resulting effective map becomes:
0 1 2 3
0 3 0 1 0
1 0 3 0 1
Now tasks running on node0 prefer DRAM0(3) and CXL0(1), while tasks on
node1 prefer DRAM1(3) and CXL1(1). This aligns allocation with actual
effective bandwidth, preserves NUMA locality, and reduces cross-socket
traffic.
Signed-off-by: Rakie Kim <rakie.kim@sk.com>
---
mm/mempolicy.c | 94 +++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 90 insertions(+), 4 deletions(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index a3f0fde6c626..541853ac08bc 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -117,6 +117,7 @@
#include <asm/tlb.h>
#include <linux/uaccess.h>
#include <linux/memory.h>
+#include <linux/memory-tiers.h>
#include "internal.h"
@@ -2134,17 +2135,87 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
return zone >= dynamic_policy_zone;
}
+/**
+ * policy_resolve_package_nodes - Restrict policy nodes to the current package
+ * @policy: Target mempolicy whose user-selected nodes are in @policy->nodes.
+ * @mask: Output nodemask. On success, contains policy->nodes limited to
+ * the package that should be used for the allocation.
+ *
+ * This helper combines two constraints to decide where within a socket/package
+ * memory may be allocated:
+ *
+ * 1) The caller's package: derived via mp_get_package_nodes(numa_node_id()).
+ * 2) The user's preselected set @policy->nodes (cpusets/mempolicy).
+ *
+ * The function obtains the nodemask of the current CPU's package and
+ * intersects it with @policy->nodes. If the intersection is empty (e.g. the
+ * user excluded every node of the current package), it falls back to the
+ * node in @policy->nodes, derives that node's package, and intersects
+ * again. If the fallback also yields an empty set, @mask stays empty and a
+ * non-zero error is returned.
+ *
+ * Examples (packages: P0={CPU:0, MEM:2}, P1={CPU:1, MEM:3}):
+ * - policy->nodes = {0,1,2,3}
+ * on P0: mask = {0,2}; on P1: mask = {1,3}.
+ * - policy->nodes = {0,1,3}
+ * on P0: mask = {0} (only node 0 from P0 is allowed).
+ * - policy->nodes = {1,2,3}
+ * on P0: mask = {2} (only node 2 from P0 is allowed).
+ * - policy->nodes = {1,3}
+ * on P0: current package (P0) & policy = NULL -> fallback to policy=1,
+ * package(1)=P1, mask = {1,3}. (User effectively opted out of P0.)
+ *
+ * Return:
+ * 0 on success with @mask set as above;
+ * -EINVAL if @policy/@mask is NULL;
+ * Propagated error from mp_get_package_nodes() on failure.
+ */
+static int policy_resolve_package_nodes(struct mempolicy *policy, nodemask_t *mask)
+{
+ unsigned int node, ret = 0;
+ nodemask_t package_mask;
+
+ if (!policy || !mask)
+ return -EINVAL;
+
+ nodes_clear(*mask);
+
+ node = numa_node_id();
+ ret = mp_get_package_nodes(node, &package_mask);
+ if (!ret) {
+ nodes_and(*mask, package_mask, policy->nodes);
+
+ if (nodes_empty(*mask)) {
+ node = first_node(policy->nodes);
+ ret = mp_get_package_nodes(node, &package_mask);
+ if (ret)
+ goto out;
+ nodes_and(*mask, package_mask, policy->nodes);
+ if (nodes_empty(*mask))
+ goto out;
+ }
+ }
+
+out:
+ return ret;
+}
+
static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
{
unsigned int node;
unsigned int cpuset_mems_cookie;
+ nodemask_t mask;
retry:
/* to prevent miscount use tsk->mems_allowed_seq to detect rebind */
cpuset_mems_cookie = read_mems_allowed_begin();
node = current->il_prev;
- if (!current->il_weight || !node_isset(node, policy->nodes)) {
- node = next_node_in(node, policy->nodes);
+
+ if (policy_resolve_package_nodes(policy, &mask))
+ mask = policy->nodes;
+
+ if (!current->il_weight || !node_isset(node, mask)) {
+ node = next_node_in(node, mask);
if (read_mems_allowed_retry(cpuset_mems_cookie))
goto retry;
if (node == MAX_NUMNODES)
@@ -2237,6 +2308,21 @@ static unsigned int read_once_policy_nodemask(struct mempolicy *pol,
return nodes_weight(*mask);
}
+static unsigned int read_once_policy_package_nodemask(struct mempolicy *pol,
+ nodemask_t *mask)
+{
+ nodemask_t package_mask;
+
+ barrier();
+ if (policy_resolve_package_nodes(pol, &package_mask))
+ memcpy(mask, &pol->nodes, sizeof(nodemask_t));
+ else
+ memcpy(mask, &package_mask, sizeof(nodemask_t));
+ barrier();
+
+ return nodes_weight(*mask);
+}
+
static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
{
struct weighted_interleave_state *state;
@@ -2247,7 +2333,7 @@ static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
u8 weight;
int nid = 0;
- nr_nodes = read_once_policy_nodemask(pol, &nodemask);
+ nr_nodes = read_once_policy_package_nodemask(pol, &nodemask);
if (!nr_nodes)
return numa_node_id();
@@ -2691,7 +2777,7 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
/* read the nodes onto the stack, retry if done during rebind */
do {
cpuset_mems_cookie = read_mems_allowed_begin();
- nnodes = read_once_policy_nodemask(pol, &nodes);
+ nnodes = read_once_policy_package_nodemask(pol, &nodes);
} while (read_mems_allowed_retry(cpuset_mems_cookie));
/* if the nodemask has become invalid, we cannot do anything */
--
2.34.1
next prev parent reply other threads:[~2026-03-16 5:13 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-16 5:12 [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave Rakie Kim
2026-03-16 5:12 ` [RFC PATCH 1/4] mm/numa: introduce nearest_nodes_nodemask() Rakie Kim
2026-03-16 5:12 ` [RFC PATCH 2/4] mm/memory-tiers: introduce socket-aware topology management for NUMA nodes Rakie Kim
2026-03-18 12:22 ` Jonathan Cameron
2026-03-16 5:12 ` [RFC PATCH 3/4] mm/memory-tiers: register CXL nodes to socket-aware packages via initiator Rakie Kim
2026-03-16 5:12 ` Rakie Kim [this message]
2026-03-16 14:01 ` [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave Gregory Price
2026-03-17 9:50 ` Rakie Kim
2026-03-16 15:19 ` Joshua Hahn
2026-03-16 19:45 ` Gregory Price
2026-03-17 11:50 ` Rakie Kim
2026-03-17 11:36 ` Rakie Kim
2026-03-18 12:02 ` Jonathan Cameron
2026-03-19 7:55 ` Rakie Kim
2026-03-20 16:56 ` Jonathan Cameron
2026-03-24 5:35 ` Rakie Kim
2026-03-25 12:33 ` Jonathan Cameron
2026-03-26 8:54 ` Rakie Kim
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260316051258.246-5-rakie.kim@sk.com \
--to=rakie.kim@sk.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=alison.schofield@intel.com \
--cc=apopple@nvidia.com \
--cc=byungchul@sk.com \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=dave@stgolabs.net \
--cc=david@kernel.org \
--cc=gourry@gourry.net \
--cc=honggyu.kim@sk.com \
--cc=ira.weiny@intel.com \
--cc=jonathan.cameron@huawei.com \
--cc=joshua.hahnjy@gmail.com \
--cc=kernel_team@skhynix.com \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=matthew.brost@intel.com \
--cc=mhocko@suse.com \
--cc=rppt@kernel.org \
--cc=surenb@google.com \
--cc=vbabka@suse.cz \
--cc=vishal.l.verma@intel.com \
--cc=ying.huang@linux.alibaba.com \
--cc=yunjeong.mun@sk.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox